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Preface 


The 28th instalment of the European Safety and Reliability Conference contributes to a long-standing 
tradition of sharing and learning in safety and reliability in Europe and beyond. Academics and 
professionals from all over the world meet in Trondheim to share the state-of-the-art in safety and 
reliability and discuss collaborations and future work. 

The annual European Safety and Reliability Conference (ESREL) is an international conference under 
the auspices of the European Safety and Reliability Association (ESRA). It is one of the largest, and 
most important safety and reliability events in Europe and has become a recognized conference all over 
the world. 

ESREL aims to be an inclusive event where safety and reliability students can meet renowned 
professionals and partnerships are forged between participants from all parts of the globe. This 
inclusiveness has been a characteristic for ESREL over many years and it contributes to the continued 
success of ESREL and ESRA. 

NTNU is a university with an international focus, with headquarters in Trondheim and additional 
campuses in Alesund and Gjovik. NTNU has a main profile in science and technology, a variety of 
programmes of professional study, and great academic breadth that also includes the humanities, social 
sciences, economics and medicine. The Department of Marine Technology (IMT) is a world leader in 
education, research, and innovation for engineering systems in the marine environment. Areas of research 
include safety and reliability of marine systems, dynamic modeling, marine engineering, transport and 
production for oceans and the Polar Regions. 

The programme of ESREL 2018 offers a variety of arenas for discussion. In invited and plenary lectures 
world leading scientists explain what their contributions are and share their view for the future of safe 
societies in a changing world. Internationally recognized university teachers, researchers and practitioners 
support sharing and discussion in topic sessions and special sessions focus on topics that are of interest 
to special interest groups. Industry challenge sessions offer an arena to industry professionals to discuss 
direct industry interests in the field of safety and reliability. And finally, talented young researchers have 
an opportunity to share their work. 

This volume contains more than 400 abstracts of the scientific and industry contributions from the 
ESREL conference; it is ordered according to the methodologies that form the backbone of the ESREL 
conference. The full papers are published as Open Access papers on the following website: @@@@@@ 
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Comparing HFACS and AcciMaps in a health informatics case study— 
the analysis of a medication dosing error 


O.O. Igene & C.W. Johnson 
Department of Computing Science, University of Glasgow, Scotland 


ABSTRACT: The utilization of Information Technology/software systems is considered a proac- 
tive measure for reducing medication errors, providing clinical efficiency and improving patient safety. 
However this has added a layer of risk that can potentially harm patients and compromise safety. A com- 
parative study using specific accident models; the Human Factors and Classification Systems (HFACS) 
and Accident Mapping (AcciMaps) was utilized on a health IT related case study analysis of medication 
error relating to the Computer Provider Entry System (CPOE). The results (outcomes) of the analyses 
were compared using the usage characteristics criteria developed by Underwood and Waterson (2014). 
The usage criteria framework focuses on ease of learning (usability), data requirements, validity and 
reliability of analysis. The second objective of our study discusses the limitations of both models and 
proposes a way forward on enhancing the usability, validity of results and more importantly the reliability 


of the AcciMap approach for accident analysis in healthcare. 


1 INTRODUCTION 


The development and utilization of Information 
Technology (IT)/software systems within clinical 
settings in hospitals has helped to reduce medical 
errors and improve efficiency in health care deliv- 
ery. However, its implementation and utilization 
has also introduced an additional layer of risks 
and unforeseen errors that can compromise patient 
safety (Koppel et al. 2005, IOM 2012, Magrabi 
et al. 2016). This is especially apparent within 
complex safety critical sociotechnical systems 
like healthcare (IOM 2012, Schneider et al. 2014). 
A sociotechnical system while regarded as an 
imprecise term (Klein 2014) consists of complex 
interactions between different entities (people, 
technology, process, organization and external 
environment) (Sittig & Singh 2010). Accident/ 
errors can occur as a result of interactions involv- 
ing people (clinicians/physicians) using software/ 
IT systems. They typically stem from latent con- 
ditions (systemic factors) that stretch well beyond 
the “frontline” of healthcare service provision. 
The development of accident causation methods 
has provided a means of investigating and analyz- 
ing failures. They focus on the subsequent devel- 
opment of counter measures to improve patient 
and system safety (Johnson 2004). These accident 
analysis techniques include but are not limited to 
linear/event, taxonomy and systemic based meth- 


ods based on recognized accident causation theo- 
ries like Swiss Cheese Model (SCM) (Reason 1995, 
Qureshi 2008). Systemic methods are considered 
to be more suitable for analyzing interactions in 
sociotechnical systems than linear and event based 
approaches (Qureshi 2008, Leveson 2011). One 
of such methods includes the AcciMap method 
(Svedung and Rasmussen 2000). Systemic analy- 
sis techniques are typically applied retrospectively 
in analyzing incidents/case studies and different 
outcomes are produced based on their component 
methods. 

This paper presents a comparative study using 
a systematic taxonomy framework (HFACS) and 
AcciMaps on a published clinical case study report 
(Horsky et al 2005). The main purpose of the study 
is to identify the putative causes of errors resulting 
from interactions between the users and IT system. 
We are also motivated to identify contributing fac- 
tors associated particularly at the systemic level to 
understand not only what happened but crucially 
why. Different components within the sociotech- 
nical system cannot be analyzed in isolation from 
one another especially if system safety is to be 
achieved. The application of HFACS in healthcare 
incident analysis has been established but Acci- 
Maps has not been utilized as an accident analysis 
tool for clinical investigation and analysis. 

This paper is divided into sections highlight- 
ing the objectives of the comparative analysis, 


the methodology applied as well as the description 
of an example clinical case study utilized. Each 
of the methods selected is also briefly described, 
applied to the case study and their respective 
results (outcomes) will be compared and discussed 
based on the usage characteristics (Horsky et al. 
2005). Limitations of each method are highlighted 
leading to the discussion of the proposed method 
to address some of these limitations. 


2 OBJECTIVES OF RESEARCH 


The two methods were selected based on their 
differences in their methodological approach and 
systematic way of analyzing accidents. The objec- 
tives of the study are as follows: 


I. Analysis of the medication dosing error case 
study involving the Computer Provider Order 
Entry system using the HFACS and AcciMap 
models. 

II. A comparison of the resulting outputs of each 
accident method based on the usage charac- 
teristics criteria (Underwood and Waterson, 
2014). 

II. Identifying common causes and contributing 
factors from analyses relating to the adverse 
outcome. 


3 RESEARCH METHODOLOGY 


This study adopts a qualitative (case study) 
approach for identifying, understanding and ana- 
lyzing a complex sociotechnical technical setting 
involving healthcare informatics. We aim to expose 
the contributing factors that are associated with an 
adverse event. A medical case incident was used in a 
comparative study of accident causation methods; 
the Human Factors and Classification Systems 
(HFACS) (Shappell & Wiegmann 2000) and the 
Accident Mapping (AcciMaps) method, a compo- 
nent of the broader Risk Management Framework 
(RMF) (Rasmussen and Svedung 2000). 

This case incident details a sociotechnical sce- 
nario about risks relating to the house providers 
and the Computerized Provider Order Entry Sys- 
tem (CPOE) (Horsky et al. 2009, IOM 2012). The 
selected accident methods were used to analyze the 
case incident and the outputs were further iter- 
ated, reviewed and validated with a HFACS and 
an AcciMap expert. The resulting outputs were 
then compared using the usage criteria framework 
consisting of its graphical representation, usability, 
validity and reliability (Underwood & Waterson 
2014). Strengths and limitations of the methods 
applied are elaborated in section 8. 


4 DESCRIPTION AND ANALYSIS OF 
CASE STUDY 


The case study involves two providers (A and B) 
who were involved in the administration of Potas- 
sium Chloride (KCI) to a patient who was initially 
hypokalemic. The timeline of the events leading to 
the patient becoming hyperkalemic (receiving a high 
dosage of KCl) occurring over a period of three 
days are detailed in the work of Horsky et al. (2005). 
The patient was first examined by the first physician 
(Provider A) and administered an intravenous (IV) 
bolus injection thereby repleting the potassium. 
Provider A then realized that the patient already 
had an IV and so decided to administer the KCI 
as part of the treatment. The patient started receiv- 
ing a higher KCI dosage than what was originally 
intended due to several events that took place. The 
initial dosage order was detected to be higher than 
what was allowed by the hospital’s policy and was 
discontinued. A new dosage order had to be writ- 
ten. However this new dosage order was not entered 
correctly into the CPOE system and the maximum 
volume was not indicated for the fluid that was to 
be administered to the patient (Horsky et al. 2005). 
A changeover between the first physician and 
the incoming physician (Provider B) took place the 
next day where the latter was notified about the 
patient’s KCI levels from the system. However, the 
second provider did not know that the laboratory 
results were not current and it was before the last 
potassium repletion occurred (Horsky et al. 2005). 
This led to provider B to consider the KCI levels of 
the patient to be low and decided to order an addi- 
tional IV injection despite the KCI still running. 
This eventually led to the patient to become severely 
hyperkalemic and the problem was immediately 
rectified and the patient was discharged (Horsky 
et al. 2005). Their study provided a comprehensive 
analysis of the contributing factors as a result of 
both human errors and system design issues that 
led to the patient experiencing an adverse event. 


5 DESCRIPTION OF THE ACCIDENT 
ANALYTICAL METHODS 


The methods selected for the case study are based 
on recognized accident causation theories and are 
briefly described below: 


5.1. Human Factors and Classification System 
(HFACS) 


The HFACS method is a systematic approach 
that is based on the Swiss Cheese Method (SCM) 
(Reason 1995). This approach compares different 
levels of ‘causal categories’ relating from active to 
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Figure 1. 
2002). 


HFACS taxonomy (Shappell and Wiegmann, 


latent conditions as shown in Figure 1 (Shappell & 
Wiegmann 2000). The categories include Unsafe 
Acts, Preconditions for Unsafe Acts, Unsafe 
Supervision, and Organizational Influences 
(Shappell & Wiegmann 2000). Each of them con- 
sists of specific subcategories that place empha- 
sis on “who” and “what” rather than on “why” 
accidents or a significant adverse event occurred 
(Diller et al. 2013). This method can be utilized 
to analyze a comprehensive case study or a set 
of incidents to determine trends so as to develop 
countermeasures. 


5.2 Accident Mapping (AcciMaps) method 


The AcciMap method was developed as a compo- 
nent of the Risk Management Framework (RMF) 
and can be utilized as a standalone technique aside 
and as part of the broader RMF methodology 
consisting of the use of ActorMaps and Conflict 
Maps (Rasmussen & Svedung 2000). The Acci- 
Map method allows analysts to graphically map 
causal relationships leading to an adverse event 
across each of the levels in a sociotechnical system 
(Branford 2011). The AcciMap levels as shown 
in Figure 2 depict interactions in a sociotechnical 
system and consist of Environment/Surroundings, 
Physical Processes/Actor activities, Organizational 
(Technical/Operational Management, Company 


Events, Actions, AccidenvAdverse Event : 
at —= {af Smarr 


Figure 2. The AcciMap method adapted from Rasmussen 
and Svedung (2000). 


management), and External (Regulatory bodies/ 
Trade unions, and Government Policy and Budg- 
eting) (Rasmussen & Svedung 2000). 


6 RESULTS 


The resulting outputs from both methods are dis- 
cussed below: 


6.1 HFACS output 


The results of applying HFACS were given to an 
experienced human factors specialist for verifica- 
tion and validation. The factors were identified 
and classified according to each level’s defined cat- 
egories and sub-categories of the HFACS taxon- 
omy as shown in Figure 3 and described as follows: 


6.1.1 Unsafe acts 

Operators (clinicians) not using the CPOE sys- 
tem correctly was categorized under ‘skill based 
errors’. Lack of communication between the clini- 
cal providers (A and B) that led to the second pro- 
vider administering additional KCI dosage to the 
patient is considered a ‘decision error’. While there 
wasn’t any explicit evidence relating to ‘perceptual 
errors’, violations regarding clinical handover pro- 
cedures between Providers A and B was noted to 
have increased the risk of the patient becoming 
hyperkalemic. 


6.1.2 Preconditions for unsafe acts 
Several factors at this level include those relat- 
ing to the design and usability of CPOE system 
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Figure 3. Application of the HFACS method on the 
medication dosing error incident. 


(Environmental factors) and lack of user experi- 
ence with the current version of the CPOE sys- 
tem (Physical and Mental limitations). Also at the 
‘Personal Readiness subcategory level, factors 
identified are attributed to issues regarding lack 
of communication between the providers as well 
clinical handover procedures not adequately car- 
ried out. This could be due to lack of effectiveness 
in training regarding the procedure. 


6.1.3. Unsafe supervision 
At both the ‘supervisory violations and ‘failure to 
correct problem subcategories, contributing factors 


arise from issues relating to inadequate training on 
the use of the current CPOE system. Furthermore, 
the system may not have been rigorously tested 
at the initial stage for its usability in a simulated 
environment (no explicit evidence in the case). In 
addition, there is a possibility of inadequate com- 
munication and coordination between the actors 
(providers A and B) and those at the management 
level including those responsible for IT systems 
which is classified under the ‘Planned Inappropri- 
ate Operations’ subcategory. 


6.1.4 Organizational influence 

It was noted that within these subcategories, 
‘resource management’, ‘organizational processes’ 
and ‘organizational climate’, the case study did not 
indicate explicit evidence as to the contributing 
factors that enabled the accident. In examining the 
other subcategories within this category, it can be 
indicated that the organization may have underesti- 
mated the risks and severity relating to IT systems. 
In addition, there may not be any existing policies 
relating to testing the usability of the CPOE system 
within a simulated environment before its deploy- 
ment. Other issues can include budget restrictions 
and low safety culture will be considered potential 
systemic factors. 


6.2 AcciMap output 


The original AcciMap method (Rasmussen and 
Svedung, 2000) and the standardized AcciMap 
format (Branford 2011) was applied on the case 
incident (See Figures 4 and 5). The latter was uti- 
lized by an experienced clinician (e-health phar- 
macy) and this method provided a simpler way of 
analyzing the case study particularly at the organi- 
zational level. The AcciMap outputs were exter- 
nally validated by a human factors expert. 

However for the purposes of study, the result of 
the AcciMap model is described using the original 
AcciMap format as follows: 


6.2.1 Equipment and surrounding 

In this case study, the equipment used here will be 
the CPOE system that allows providers to place 
orders regarding the administration of KCl to 
the patient. However human errors occurred as a 
result of issues relating to the design and usability 
of the system (relating to the interface of the sys- 
tem where instructions were difficult to read) lead- 
ing to the eventual adverse outcome. 


6.2.2 Physical processes and actor activities 

This level refers to the actors involved in the adverse 
event (house providers, attending nurse and the 
patient). Here, provider A administered a high dos- 
age of KCI which was entered incorrectly into the 
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Figure 4. AcciMap analysis of the medication dosing 
error case study using the original format. 


eamate semma peara 
— —— 


Figure 5. AcciMap analysis of the medication dosing 
error case study using Branford’s standardized AcciMap 
format. 


system. The second provider further implemented 
an additional dosage of KCI after failing to ascer- 
tain the current KCl levels of the patient by read- 
ing the outdated results from the CPOE system 


during the changeover process. Further problems 
occurred as a result of miscommunication between 
providers A and B on the patient’s KCI levels dur- 
ing the clinical handover process and as a result, 
the patient became hyperkalemic. 


6.2.3 Technical and operational management 

At this level, issues relating to CPOE include spe- 
cifically its interface design flaws which increased 
the probability of an error to occur. This could 
also be attributed to lack of rigorous software sim- 
ulation testing of the system before it was deployed 
for clinical use. The IT vendor responsible for the 
development of the software may cite lack of train- 
ing on the use of the software rather than the prob- 
lem of the system design. Although this point can 
be inferred as a contributing factor, there was no 
explicit evidence from the reports that this was the 
case. Another inference that can be made is that 
similar incidents may not have been reported as 
an “IT-related” event and risks that can arise from 
system design of the CPOE was either underesti- 
mated or may not have been taken into considera- 
tion by the health organization. 


6.2.4 Company management and local 
government 

The case report did not contain explicit evidence 
regarding specific failures at this level. However, 
from the analysis of the preceding levels and some 
inferences, errors can occur due to lack of over- 
sight, underestimation of the severity of risks asso- 
ciated with IT systems. Additionally, there may be 
issues relating to inadequate training and enforce- 
ment of existing polices relating to clinical hando- 
ver process between providers can also contribute 
to the adverse event. 


6.2.5 Regulatory bodies, trade unions 
and associations 

At this level, contributing factors could include 
failure of implementing existing policies passed by 
the government and other relevant medical bodies 
carried out to ensure patient safety. Another infer- 
ence could be that incidents relating to IT systems 
are not being adequately reported and lack of 
proactive measures may not be in place to handle 
such kind of incidents. However, it should be noted 
again that there was no explicit evidence in the case 
study to support these inferences. 


6.2.6 Government policy 

Issues like budget limitations, priorities on budget 
allocations and policies regarding the reporting of 
incidents relating to software/IT systems imple- 
mented in health organizations would be consid- 
ered as systemic factors. However, a much more 
detailed report will need to include these systemic 


factors that could contribute to the accident or any 
similar incidents. Based on the AcciMap outcome, 
there was no explicit cause indicated due to lack of 
evidence relating to systemic factors at this exter- 
nal level (government). This can be attributed to 
the fact that incident reports do not typically con- 
tain information especially on contributing factors 
in enabling the government to realize their role in 
creating latent conditions that can further enable 
similar incidents to occur again. 


7 USAGE CHARACTERISTICS CRITERIA 


The HFACS and AcciMap methods utilized for 
the case study is briefly described and the resulting 
outputs are discussed below: 


7.1 Data requirements 


The use of both methods requires data which can 
be collected in various formats including case 
incidents or a detailed case report of significance. 
Other data sources include documentations, inter- 
views with actors (frontline workers, management) 
involved in the incidents and observing the events 
in a sociotechnical scenario. Both methods require 
a great deal of details from the actors directly 
involved in the incident and at the higher (organi- 
zational) levels to be to identify valid causes and 
contributing factors to accidents. 


7.2 Graphical representation of the accident 


HFACS and AcciMaps differ in this category. The 
HFACS only used a set of defined categories that 
classify causes to the accident and so lacks a way 
of graphically representing contributing factors 
and their relationships. However, the AcciMap 
method provides this advantage by enabling ana- 
lysts to use different graphical symbols to repre- 
sent different meanings; adverse event (accident), 
causes (evidence), causes (inferred) in relation to 
the adverse outcome (Branford 2011, Salmon et al. 
2012). AcciMaps also allow the causes identified 
to be represented using causal relationships (repre- 
sented by arrows) within and between each level of 
the sociotechnical system. 


7.3 Usability of method 


The use of both HFACS and AcciMaps is relative 
to the level of skill and experience of the analysts. 
While both methods require understanding of the 
accident causation theories they are built on, the 
HFACS method provides a framework of failure 
categories built for classifying failures according 
to the Swiss Cheese Model from either singular or 


multiple incidents. The AcciMap method provides 
users the basic understanding of the causes that 
interconnect in a vertical manner to enable under- 
standing of why the accident took place. How- 
ever, the use of these methods can be relatively 
time-consuming depending on the complexity and 
comprehensiveness of the case study, and due to its 
subjective nature of analysis from multiple users. 


7.4 Validity of method 


Each of the outputs from HFACS and AcciMaps 
were reviewed by an experienced and expert user 
of both methods for both content and face validity 
(hence external validation being required). As both 
methods have differing methodological approach 
(each based on a recognized accident causation 
theory), its results will only reflect the methods 
perspective as to the analysis of the accident and 
possible safety recommendations needed to be in 
place. 


7.5 Reliability of analysis 


In previous studies, the HFACS method was noted 
to provide a considerable measure of reliability 
due to the defined failure categories (Salmon et al. 
2012, Ergai 2013). However as was demonstrated 
in the case study, the AcciMap method is limited 
in terms of considering failure categories that can 
exist at different levels including the external level 
as it relates to this case study. This is the reason 
why reliability of the AcciMap method is consid- 
ered to be low, due to the subjective nature of its 
analysis (Salmon et al. 2012, Underwood & Water- 
son 2014) especially because multiple analysts can 
produce different causal outcomes from the same 
case incident. There is also the issue of bias includ- 
ing hindsight bias as to what contributed to the 
adverse outcome. 


8 DISCUSSION 


The implementation of both HFACS and Acci- 
Maps shows both strengths and limitations they 
offer regarding accident analysis. Drawbacks of 
the AcciMap method include its subjectivity of 
analysis and the lack of failure categories in each 
of the levels (Salmon et al. 2012). The issue of 
subjectivity is due in part to lack of existing stand- 
ard guidelines in addition to the need for external 
validation of results generated. This can ultimately 
affect the type of safety recommendations that are 
needed, and how effective they are in preventing 
an occurrence of the accident. Although Branford 
made an attempt to solve that issue through the 
development of the standardized AcciMap format, 


it still lacks a formal way of classifying failures into 
specific categories. This limitation makes the Acci- 
Map method not suitable for analyzing multiple 
incidents (Salmon et al. 2012, Goode et al. 2017). 
There is also the issue of bias including hindsight 
and outcome biases, which can also affect the 
results of the accident analysis, especially in deter- 
mining why the adverse event occurred (Johnson 
2004). 

In comparing outputs from the analyses, simi- 
lar causes from the analyses include communica- 
tion issues between the actors (Providers A and 
B), software design issues with the CPOE system, 
issues relating to clinical handover process between 
the providers and inadequate/ineffective training 
relating the use of the current CPOE system. The 
difference however lies in identifying contribut- 
ing factors at the higher levels especially as to how 
these factors at external levels (Regulatory bodies 
and the Government) can systemically contribute 
to the occurrence of the accident. For example, 
based on the AcciMap result, inferences was made 
as to why there was ineffective feedback between 
the health organization (i.e. IT department) and 
the software vendor responsible for the design 
of the CPOE system. Another inference could 
be made regarding inadequate policies relating 
to testing the software product (albeit in a simu- 
lated environment) before it was deployed live for 
clinical purposes. While both methods assist in 
identifying causes and contributing factors in a 
systematic way, the AcciMap method is not restric- 
tive in terms of identifying and placing causes at 
each level while in the case of HFACS, the causes 
need to be classified according to the defined 
framework of failure categories (based on Reason’s 
Swiss Cheese Model of accident causation). The 
HFACS taxonomy is also considered a generic- 
based method (initially developed for investigat- 
ing accidents in the aviation system), but has also 
been adopted in the healthcare system (Diller et al. 
2013). However, one of the limitations attributed 
to HFACS method is in the restrictive nature of 
the categories within the HFACS taxonomy. There 
is also the need to expand and adapt the taxonomy 
within specific healthcare scenarios including the 
addition of higher level related factors (External) 
(Salmon et al. 2012) as well as any other factors 
not included in the original format. The advan- 
tages of HFACS and/or any similar health-based 
error taxonomies and AcciMaps can potentially be 
combined together to improve the latter’s reliabil- 
ity. This step was explored in investigating multiple 
led outdoor incidents through the development of 
an incident reporting and learning system based 
on the UPLOADS (Understanding and Prevent- 
ing Led Outdoor Activities Data Systems) tax- 
onomy (Salmon et al. 2015). Each AcciMap level 


had failure categories based on contributing fac- 
tors identified from multiple incident reports as it 
related to the actors/decision makers in each of the 
level. Causal relationships, which are a very impor- 
tant feature of the AcciMap method, were also 
depicted between contributing factors (relating 
to the actors identified in the system) within and 
between each AcciMap level (Salmon et al. 2015, 
Goode et al. 2017). 

This similar approach can potentially be applied 
in investigating significant incidents and near 
misses as a result of software issues within the 
healthcare context. This could not only potentially 
improve the validity of results but can also enhance 
its usability for accident analysis and adoption by 
safety practitioners in healthcare systems. This 
approach can also be utilized to investigate IT fail- 
ures or near misses so that lessons learned can be 
used as a way of improving safety culture towards 
the implementation and utilization of health IT 
systems by its operators. 


9 LIMITATION OF STUDY 


This study has limitations in its identification of 
causes and in the validation of outcomes/recom- 
mendations from the analysis. It was particularly 
challenging identifying systemic factors at the 
higher levels (in the case of the AcciMap analy- 
sis) due to lack of evidence in the report; espe- 
cially dealing with governance. Another issue is 
the knowledge and application of the methods. 
especially for first time users in identifying causes 
that are valid and that actually contributed to the 
occurrence of the adverse event. The last point is 
very important for the development of effective 
safety countermeasures and to which level this 
countermeasure is applied to. Future studies will 
require the use of a more structured approach by 
involving a multidisciplinary team ideally from all 
sociotechnical levels. This will involve safety prac- 
titioners, IT specialists, users of software systems, 
vendors and expert opinions in the analysis of 
health IT related accidents. 


10 CONCLUSION 


Both methods allow for a systematic identification 
and analysis of accidents in complex sociotech- 
nical systems like healthcare. The purpose of the 
study was not to indicate that one method is better 
than the other, but to highlight the advantages and 
limitations of each method in the analysis of the 
case study. Isolating software related problems and 
taking steps to improve the functionality and reli- 
ability of the system does not necessarily improve 


patient safety. Accidents where software systems 
played a role must be analyzed from a sociotech- 
nical perspective. This is why human factors engi- 
neering plays a very important role as it involves 
analyzing complex interactions between clinicians 
and software systems utilized, and determining 
latent conditions based on decisions at the sys- 
temic level. 


11 FUTURE WORK 

Beyond the objectives of this present study, a 
current study is underway focusing on the devel- 
opment of a health-specific AcciMap taxonomy 
model. This model will then be used to analyze 
cases/incidents relating to software/IT related 
accidents in healthcare settings. The develop- 
ment of the model will comprise of examining 
existing health based error taxonomies similar to 
the HFACS and contributing factors framework. 
This will then be synthesized within the levels of 
the standardized AcciMap method. The purpose 
of this approach will be to determine if both the 
reliability and validity of the AcciMap method will 
be improved through the use of defined failure cat- 
egories in a taxonomic structure. 
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Possibilities of using simulation software to estimate losses 
of industrial facilities and installations—critical analysis 


J. Ryczynski, P. Mastalerz, K. Ksiadzyna & T. Smal 
Tadeusz Kosciuszko Military University of Land Forces, Wroclaw, Poland 


ABSTRACT: In the paper, simulation of severe failure of fuel terminal is presented. Results of the 
simulation are compared to the real situation, that took place on 11th December 2005 in Hertfordshire Oil 
Storage near the M1 motorway in England. That comparison shows how beneficial an accurate simula- 
tion of the accident is. Next, the risk analysis is presented. The Preliminary Hazard Analysis (PHA) and 
the fault tree are methods that show possible cause of the accident. Using simulation, the course of action 
before and during failure could be traversed. That kind of research shows how important is to do the 
simulation before the accident occur and assess the possible consequences of the adverse situation. That 
simple method is significant to increase awareness of the fuel terminal workers and anyone who is related 
to the technical process safety. 


1 INTRODUCTION 5000 20 
It is commonly known, that already high demand Fo 2 
for fuel products is growing year by year. What Sim ; i 
is more, this kind of products has to be stored a j 
properly and transported in special conditions. i Aaah r 
Because of that, the authors decided to look into bo aio atom 
this topic. Analysis of previous petrochemical = = 
— oil produton m tones — ol consumption m tones =. act 


failures shows, that most of accidents with fuel 
products occur during loading and unloading of 
the materials. Those operations are crucial and Figure 1. Oil production with consumption and 
cannot be omitted during the whole transporta- amount of accidents between 1960 and 2016 (BP, Statis- 
tion process. Thereupon, both operations, loading tical Review of World Energy June 2016, edition 65-th). 
and unloading, should be performed in a way that 
ensures no unwanted situation occurs. To fulfill the 
safety rules and eliminate the possible risk, series on calamity rates between 1965 and 2016. Because 
of complex theoretical analysis are made before of that, multi-stage risk analysis presents such a 
the transportation of the fuel products is possi- vital role in the petrochemical industry. 
ble. Nowadays, computer simulations are used to For risk analysis to reveal reality precisely, differ- 
estimate possible consequences of failure. Simple ent databases including industry cases are utilized. 
computer programs are able to show many indica- In Poland there are two programs: the System 
tors, like the dangerous zones range, which is help- Pomocy W Transporcie Materiałów Niebezpiec- 
ful especially for personnel responsible for safety | znych—transport of dangerous goods aid system 
of the technological process. (SPOT) and the Safe Chemistry, as an covenant 
According to data, the requirement for fuel between the Polish Chamber of Chemical Indus- 
products rises year by year, hence the need to store try and the leading plants in the chemical indus- 
and transport it. Without the suitable stock of liq- try in the Republic of Poland. One of the crucial 
uid fuel storage the whole transportation would elements of this collaboration is, among others, to 
not be able to function. Security in each of these form a common disaster database as a ground for 
processes is a crucial element. Unluckily, it is often sharing requests for unfavourable events. 
not possible to forecast events that menace safety The European equivalents are the Major 
in general, including the safety of fuel terminals. Accident Reporting System (eMARS), Analyse, 
Figure 1. introduce the dependence of oil manu- Recherche et Informations sur les Accidents (Aria) 
facturing and consumption of crude oil products and Failure and Accidents Technical information 
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System (FACTS). This sort of database should be 
kept topical with new industry incidents, establish- 
ing a kind of precedent. The high consumption 
of fuel products and the evident safety magnitude 
have led the authors to become interested in the 
matter of fuel terminals. 

In this paper, a simulated malfunction of the 
fuel terminal was shown via the usage of computer 
modeling. The simulation outcome was compiled 
with the positive situation that occurred on the 
Hertfordshire Oil Storage near the M1 at Hemel 
Hempstead in Hertfordshire in Europe (Mannan 
et al., 2009). This approach permit a similitude of 
both situations to project the accuracy of modeling. 
Based on the scores, we generated an assumptive 
simulation of the detonation of the whole storage 
base in order to settle the possible implications in 
the event of non-trip events in this type of facility. 


2 SAFETY IN STORAGE LIQUID FUELS 


Storage liquid fuels safety depends primarily 
on the proper storage infrastructure, legitimate 
handling of loading and unloading actions and 
compliance with minimum tank distance provi- 
sions (Ryng, 1989). Additionally, it is significant 
to notice that, in spite of expanded automation, 
qualified and experienced personnel are the major 
determinants of workplace safety. 

High haze volatility, continuous fire and explo- 
sion hazards necessitate the precaution against 
exorbitant loss of stored substance. These losses 
are mainly caused by the so-called ‘breathing 
tanks’. Attempts to limit the losses include (Ryng, 
1989): 


the usage of “tanks operating under pressure 
within the limits held by valves”; 

maintaining the maximum filling condition of 
the tank; 

protective coatings that reflect sunlight; 

the usage of isolating tanks; 

— placing underground tanks. 


Flammable liquid tanks are fitted with safety 
attachments, which can comprise of: 


— respiratory valves protected with valves with a 
fire stop; 

— fire fuses; 

— throats for feeding foam from the fire extin- 
guishing system; 

— Pouring and Emptying Devices (PED) installa- 
tions; 

— control and measuring apparatuses; 

— double bottoms; 

— fire alarm systems (mechanical and automatic); 

— sprinkler systems; 

— Faraday’s grids; 
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— tank control systems—Distributed Control Sys- 
tem (DCS); 
— emergency pools. 


3 THE FUEL TERMINAL FAILURE 
IN BUNCEFIELD—COMPARATIVE 
SIMULATIONS 


In Hertfordshire, at the daybreak of the 11th of 
December of 2005, a fuel terminal failure arose. 
The consequence of this event was the loss of about 
1/3 of the entire stockpile of about 10 million lit- 
ers of liquid fuels. The effects of the explosion of 
the gathered medium were considerable damage 
of up to 10 km from the site (several houses were 
grievously damaged, while hundreds had incurred 
smaller, non-injurious damages). Accordingly, 
there were no fatalities, but over 3,000 direct casu- 
alties are estimated (no fatalities, with more than 
2,000 homes and 92 businesses evacuated in the 
neighborhood). As an outcome of the fire, massive 
amounts of smoke were emitted, spreading over 
southern England. The fire has been burning for 
3 days and to quench it, gargantuan amounts of 
water and foam were spent. 180 firefighters took 
part in the rescue operations by at the peak of the 
event. The course of events (Buncefield major inci- 
dent investigation. Initial Report, Hemel Hemp- 
stead, on 11 Dec. 2005, p. 7): 


1. At around 3:00, the fuel gauge for tank 912 
(Figure 2) indicated a steady level of filling, 
even though the tank was filled with 550 m°/h; 
Around 5:20, the tank began to overflow as a 
result of overfilling. 

. At 5:38, approximately 300 t of gasoline flowed 
through the roof of the tank to the emergency 
pool, resulting in an explosive mixture (approx. 
5:38 of the emergency pool). 

. About 5:50 mass flow of fuel to the tank was 
890 mh. 

. At 6:01 am the first explosion occurred, result- 
ing in the fire of 20 storage tanks. 


Simulations have been performed, based on 
the post-accident studies data, to determine three 
phases of toxic cloud formation. In (Krawezyszyn 
et al., 2017), authors used Buncefield case to show 
accompanying phenomenon and consequences of 
the accident, such as formation of release zones, 
the floodplain area prior to the explosion and 
explosion itself. In this paper, Buncefield catastro- 
phe is used to introduce consequences for humans. 
It should be emphasized, that each result shown in 
this work is assumed, because there were no fatali- 
ties and injured in Buncefield accident. What is 
more, the topology of the area was not considered 
in current work. Possible injuries in case of fire in 


Figure 2. The spread of the toxic cloud according to 
accident reports (Mannan et al., 2009). 


Petrol level Access hatch 


Figure 3. 
accident reports (Mannan et al., 2009). 


The spread of the toxic cloud according to 


different stages of fuel spread are presented. Next 
part of the paper illustrates scathes that could be 
received during Buncefield catastrophe. 


4 FIRST SIMULATION 


Figures 4, 5 and 6 show dependence of burns 
degree in percent in function of exposure duration 
for 2 kW/m?, 5 kW/m? and 10 kW/m? radiation 
heat. The mathematical model is based on value of 
probit function and percentage measure relation 
(Committee for the Prevention..., 1989): 


Probit = —38.48 +2.56* In(c* q”) (1) 


where: t-time of exposure (s), q—heat-flux absorbed 
(Wm’). 
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Figure 4. Burns degree in percent to exposure duration 
for 2 kW/m’. 
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Figure 5. Burns degree in percent to exposure duration 
for 5 kW/m’. 
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Figure 6. Burns degree in percent to exposure duration 
for 10 kW/m?. 


In all cases increase in exposure time generates 
higher burns degree. According to the figures, peo- 
ple exposed to the 10 kW/m? are more likely to get 
severe damages in shorter time (lethal injuries after 
16 s) than people exposed to lower radiation heat 
(for 2 kW/m? lethal injuries after 130 s). 


The graphs 4-6 depict that increase in first 
degree burn grown quicker than other injuries. The 
saltatory change of body damage appears in a dif- 
ferent values for each simulation, the first degree 
burn curve saturate soonest and the lethal injuries 
curve has gradual transition and gain the plateau 
much later. 

For 2 kW/m? the first degree burns plateau is 
gained after 328 s, the second degree burns after 
900 s and the lethal injuries after 1080 s. 

For 5 kW/m? the first degree burns plateau is 
gained after 98 s, the second degree burns after 
280 s and the lethal injuries after 370 s. 

For 10 kW/m? the first degree burns plateau is 
gained after 50 s, the second degree burns after 
100 s and the lethal injuries after 125 s. 


5 SECOND SIMULATIONS 


The second simulation shows explosion of the 
entire fuel terminal and consequences for peo- 
ple. The average weight for man is 80 kg, and for 
woman — 55 kg. The calculations are based on 
Netherlands organization for applied scientific 
research (TNO) methodology (Committee for the 


Prevention..., 1989). The probit function for lung 
damage: 
Probit = 5.0 — 5.74 * InS (2) 
4.2 1. 
seta 3) 
Pp l 
Pei (4) 
P 
= i 
5 
Dern (5) 
where: P atmospheric pressure, P—pressure 
exerted on the body, m—mass of the body. 
The probit function for hearing damage: 
Probit = -12.6 +1.524* InP. (6) 


where: P—peak overpressure in the incident. 
The probit function for the impact with the 
whole body: 


Probit = 5 — 2.44 * InS (7) 

7.38*103 1.3*10° 
S= + 8 
P, BI e 
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where: i -impulse of the incident pressure. 
Figure 7 represents dependence of percentage 
of body damage and pressure in bars for average 
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Figure 8. Percentage of body damage and pressure for 
average woman. 


Table 1. Comparison of body damage caused by pressure 
for average man and woman. 


Body damage Man Woman 

Lung damage when body 18,88 19,2 
is perpendicular to blast 
direction (%) 

Lung damage when body near 99,97 100 
reflecting surface (%) 

Whole body displacement 100 100 
(when body collides to 
object) (%) 

Whole body displacement 100 100 
(when skull collides to 
object) (%) 

Eardrum damage (%) 88,78 88,78 


man, whereas Figure 8 depicts dependence of 
percentage of body damage for average woman. 
What is more, Table 1. Introduces comparison of 
body damage for average man and woman. Based 


on information in Table 1. We could assume, that 
potential explosion would have a similar effect on 
damage to the bodies of a woman and a man, there 
would be a slight discrepancy in the lung damage. 


6 CONCLUSIONS 


The purpose of the paper was to introduce the 
potential results of a fuel terminal failure. Choos- 
ing the Buncefield fire incident allowed to establish 
the correct mathematical model for a simulation of 
this type. The overriding goal was to identify poten- 
tial consequences for people who were exposed to 
radiation heat generated in the event of an explo- 
sion of the entire storage base. Detailed analysis of 
the data and implementation of the risk analysis 
allowed the authors to view the issue of storage of 
dangerous substances in a tangible way. 

It should be mentioned, that based on available 
data, authors were not able to define all parameters 
needed for the simulation, nor to fully anticipate 
the failure mechanism. Even though, it is valid to 
make such analysis and computer simulations to 
help us better understand the nature of the pro- 
duction process. 

Analyzed simulations show, that humans near 
the centre of explosion would be suffer serious inju- 
ries or even death. What is more, there is only slight 
difference between percentage of body damage sus- 
tained by average man and woman exposed to pres- 
sure in the centre of the explosion. Probable cause 
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is hidden in program limitation, because weight 
was the only factor that could be set to distin- 
guish between man and woman. Such parameters 
as height or physique are not included and could 
potentially influence the results of simulation. 
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ABSTRACT: Signal system is widely used to improve the safety and efficiency of subway system, while 
signal failure of it may lead to a huge breakdown of subway capacity. Most signaling operation and main- 
tenance team still use the outdated checklists while signal operation and maintenance is proved to be a 
more and more difficult task because the internal and external interactions. This paper proposes hybrid 
accident/incident causal model of bowtie and cybernetic control loop to guide the design of incident data 
model. After that, the database is demonstrated on Beijing subway line No. 5. Based on the analysis, statis- 
tical characteristics of incidents are concluded. The trend of these factors could help companies improve 
the performance of subway system and the process of this analysis give them a new manner to record and 


learn from incident data. 


1 INTRODUCTION 


Many big cities rely on subway to solve the traffic 
problem, including Beijing. There are now 21 mil- 
lion people living in this city and 6 of the 18 sub- 
way lines have a passenger volume for over one 
million every working day (Beijing Mass Transit 
Railway Corp. Ltd., 2017). Signal system is widely 
used for the safety and efficiency in subway. While 
the advanced block mode and automatic controls 
provided by signal system can achieve shorter train 
interval that makes it possible for more trains oper- 
ating at the same time, the failure of this system 
may lead to a breakdown of subway capacity. 

An event with a single over-5-minute-delay train 
is classified as an operation accident by Beijing 
subway operation cooperation. Statistic shows that 
there were 36 operation accidents caused by signal 
failure of Beijing subway in the year 2016, which 
caused great social repercussions. For example, 
a failure of track circuit of Beijing subway Line 
No. 5 happened to January 26, 2016. Before the 
trouble removal 37 minutes later, the subway had 
to operate by telephone block, which caused 53 
trains’ over-5-minute-delay. So it is of vital impor- 
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tance to reduce the signal failure and control the 
loss during recovery period. 

Learning from incident data is a way to help sub- 
way companies improve system performance and 
prevent an accident. While operation accidents sel- 
dom happened and learning from old accidents can- 
not catch up the operation state of the system, it is a 
good choice to reduce accident by analyzing minor 
incidents and minimizing the number of it according 
to Heinrich’s Law (Jehring, 1959). There are papers 
that analyze subway incident with data. Zhang et al. 
(2016) builds a database to collect and analyze time, 
location, severity, causes and staff data of near 
misses in Shanghai subway. Ding et al. (2017) uses 
fault log database for safety management of subway. 
The data model of these databases is built based on 
existing incident record provided by subway compa- 
nies and no safety science theory was mentioned to 
explain why they use this data and what other data 
could also be collected. Moreover, these databases 
are designed for the whole subway operation, con- 
cerning about aspects like fire, passenger fall downs, 
explosions, train collisions, vehicle derailments. And 
there is no database that illustrates the prevention 
and recovery of signal failure. 


Signal operation and maintenance are proved to 
be a more and more difficult task not only because 
electronic equipment is sensitive to sever environ- 
ment but because the interactions among signal- 
ing systems, surrounding equipment and human 
operators become much more complex. However, 
most signaling maintenance teams still use the out- 
dated checklists, and only inspects and records the 
parameters of signaling itself. The lack of records 
of the working environment, related interfaces and 
operations makes it difficult to do incident cause- 
tracing and maintenance dynamic adjustment. 

Accident/incident models provide insight to 
analysis the causes of an accident/incident. This 
paper proposes a signal system operation and 
maintenance database based on hybrid accident/ 
incident causal model of bow-tie and control loop, 
in which the former provides an overview of what 
process should be under controlled to keep the per- 
formance of the system and mitigate the loss and 
the later explore how the control processes failed. 
Cooperated with Communication and Signaling 
Branch affiliated with Beijing Mass Transit Rail- 
way Operation Corp. Ltd., this research is based 
on real subway operation data. By the approach 
of building a signal-related incident database, this 
analysis is aimed at helping reduce the failure rate 
and improved recovery efficiency of signal system 
in the daily operation and maintenance of Beijing 
subway. 

The structure of this paper is as follows. The lit- 
erature review is in Section 2. Section 3 describes 
the proposed model and the demonstration of 
this data model in Beijing subway is in Section 4. 
Finally, Section 5 is the discussion and Section 5 is 
the conclusion of this paper. 


2 LITERATURE REVIEW 


2.1 Bow-tie model 


Bowtie models demonstrate accidents and inci- 
dents as cause consequence diagrams. Nielsen 
(1971) invents the earliest bowtie by combining a 
fault tree and event tree into one diagram, in which 
the fault tree explains the cause and the event tree 
explains how the consequence happens. A top 
event (also called critical event) is put in the center 
of this kind of diagram while elements in the left 
end are causes and elements in the right end are 
called the consequences. Haddon (1973) put the 
concept of barrier to keep hazards from impact- 
ing a target, after which Reason (1990) use barri- 
ers as layers in his Swiss Cheese Model to explain 
the weakness in a system. Barrier then becomes 
an important part in bowtie models to explain the 
linear causal relationship between top event and 
causes or consequences. Indeed, fault tree and 


event are often abandoned in many bowtie models 
to be more suitable for qualitative analysis (Visser 
1998, Zuijderduijn 1999, Ruijter et al. 2016). Fig. 1 
displays a basic structure of Bowtie model. The 
barriers at the left of the critical event are control 
barriers and these at the right side are recover bar- 
riers (Visser 1998). 

Control barriers prevent the cause from occur- 
ring as well as prevent the cause from leading to a 
top event. Recover barriers prevent the top event 
from leading to an undesired loss and mitigate the 
loss. 

A bowtie model combines a cause consequence 
diagram with barriers, which means linking bar- 
riers of all operation process in the system. As is 
widely used for decision making purpose (Ruijter 
et al. 2016, Visser 1998, Rathnayaka et al. 2014), 
bowtie models can model the operation of an inte- 
grated system and provide an overview of what 
varieties of barriers should be placed. 


2.2 Control loop 


Control loop is a form of information flow dia- 
gram in the system, which aimed to ensure the out- 
come of a process can follow a specified target or 
setpoint. The basic structure of a control loop is 
shown in Fig. 2. 
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Figure 1. A basic structure of bowtie model. 
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Figure 2. A basic structure of control loop. 


Kuhlmann (1986) states that safety can be 
understood as a cybernetic problem because of 
the interactions among machine, human and envi- 
ronment. Behavioral cybernetic theory. Smith, 
K.U. (1972) and Smith, T.J. (1987) maintains that 
human behavior is guided as a closed-loop, feed- 
back controlled process. Cybernetics control loops 
are used to analyze the performance of complex, 
socio-technical systems in which individuals and 
organizations operating together with advanced 
technology systems (Smith, T.J. et al., 1995). Son- 
nemans (2006) uses the control loops in accident 
analysis of the chemical industry to exam the 
control mechanism inside organizations. Dijkstra 
(2007) uses a cybernetics management theory in 
aviation to model the organizational structure of 
communication which should be able to adapt the 
changing environment. Kontogiannis (2012) uses 
control loops based on VSM to look into general 
patterns of breakdown related to structural vulner- 
abilities and gradual degradation of performance 
in an accident from a Helicopter Emergency Medi- 
cal Service. 

The control loops are also used in railway. Appi- 
charla (2011) analyzes the system interfaces of 
railway with cybernetics loops because cybernetics 
studies organization, communication and control 
in complex systems. Kohda (2007) uses control 
loops in the accident analysis of railway protective 
systems. As can be seen, control loops can model a 
control process, providing details of subjects in the 
process and details of interactions among human, 
equipment and environment in a complex system, 
such as railway signal system. 


3 METHOD PROPOSED 


Beijing subway Line No. 5 is an important sub- 
way line in Beijing subway system, which contains 
23 stations and 27.6 km mileages, providing over 
one million passenger trips every working day. 
While Line No. 5 is the 4th busiest line in Beijing, 
it takes the top in signal system failure records and 
in operation delay in incident records in the past 
4 years. From the 2016, RCS lab of Beijing Jiao- 
tong University has been working with Beijing 
Subway to improve the performance of the signal 
system and subway Line No. 5 is chosen to be the 
research subject in this paper. 


3.1 A hybrid data modeling method proposed 


Bowtie and control loop are combined into a 
hybrid model in this paper. 

Signal system can guarantee the shortest train 
interval under the normal operation mode, while 
signal degradation resulting from signal failure 
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could cause a traffic jam on track. The signal-re- 
lated incidents have different causes and different 
degree of loss but all of them have an event that 
the normal operation mode of signal system failed. 
The responsibility of operators and maintainers is 
to keep signal system working under the normal 
mode and to recover the operation if a signal fail- 
ure happens. A bowtie model divides an incidents 
into causation and effect and signal failure is put 
in the middle of it. Then the operation and main- 
tenance tasks can be recognized as barriers besides 
the top event. Bowtie models can be valuable at 
high level to guarantee all various kinds of data 
is taken into consideration for the target incident 
data model. 

Signal system is a complex social-technical sys- 
tem and the operation and maintenance can be 
influenced by the complex interactions among 
human, equipment and environment. While the 
bowtie model is built with high abstraction level 
and less specific information, the management of 
barriers takes the view from the whole picture to 
more detail of the system. Thus control loops can 
be used as a model for the management of barriers 
could tell what detail data should be collected or 
recorded for the database. 

The hybrid model can be established by the fol- 
lowing two steps: 


Step 1 Establish a bowtie model based on prac- 
tical subway operation and maintenance 
situation and gain the barriers. 

Step 2 Establish a control loop specific for every 
barrier gained in the bowtie model. 


The steps are carried out as follows. 


3.2 Establish the bow-tie model 


A qualitative bowtie model is used in this hybrid 
model, in which there is no fault tree or event tree. 
Various types of signal failure of normal mode are 
placed at the middle of the model as top events. 
Component failures of the system are placed at 
the left end of the bowtie as the cause of the top 
event while the loss is placed at the right end of this 
model. Then there should be a way to determine 
the barriers of the bowtie model. 

RAMS concept is widely used in mainline rail- 
way and subway domain, which is defined in terms 
of reliability, availability, maintainability and 
safety and their interaction (CENELEC 1999). It 
provides a guide to assess and improve the quality 
of service. The RAMS is influenced in three ways, 
which are system conditions, operating conditions 
and maintenance conditions. Thus the three con- 
ditions can be used in the bowtie model to deter- 
mine the categories of barriers of subway signal 
system. 


These three conditions can be directly put on 
the left side of the top event to control the causes 
and can also be used in the right side of the top 
event for recovering within more specific scenarios. 
Thus, the six barriers include system conditions, 
maintenance conditions and operating condi- 
tions as well as degradation conditions, emergency 
maintenance conditions and operational response 
conditions. 

The system conditions barriers focus on the 
maintainability and technical characteristics of 
the system, which are the inherent attribute of the 
system, as well as the internal and external distur- 
bance of the system. 

Maintenance conditions barriers include main- 
tenance procedures such as the conditions of 
preventive maintenance and corrective mainte- 
nance, logistics, and also human factors in the 
maintenance. 

Operating conditions barriers are a question of 
the operation parameter of the system before the 
signal failure. Factors like operation time, train 
interval and passenger flow volume can affect the 
system and human factors like driving manners 
and dispatchers’ intervention can also contribute 
to signal failure. 

Degradation conditions describe the system 
conditions without the normal operation mode. 
The trains have different driving modes and the 
ground system has different block modes. When 
the signal failure happens, it is important that a 
backup mode is ready for operation. 

Emergency maintenance conditions are cor- 
responding to maintenance conditions. After a 
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Figure 3. The bowtie part of the hybrid model. 
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failure happens, it is important to report, diag- 
nose, and repair the system as quickly as possible. 
Enough well-trained maintainer in the nearby and 
sufficient spare parts help a lot. 

Operational response conditions are corre- 
sponding to operation conditions while emphasize 
the recovery operation procedure. The operational 
response under the command of dispatchers 
should decide and carry out methods like degra- 
dation, making trains offline, canceling trains and 
detain trains at stations. 

The cause consequence diagram and the 6 cat- 
egories of barriers are combined and the bowtie 
part of the hybrid model is shown in Fig. 3. 


3.3 Establish the control loops diagram 


There is no unified control loops diagram for the 
six categories of barriers. The control loops dia- 
gram can be different for different systems and the 
complexity of these loops depends on how far the 
analysis is intended to go. The six categories of 
barriers are unfolded by control loops to reveal the 
detail factors. 

System conditions and degradation conditions 
mainly talked about how the complex technical 
system works in normal mode and degradation 
mode. A control loops diagram that shows the 
entire conditions of signal system of Beijing sub- 
way Line No. 5 is shown in Fig. 4, including the 
control loops of normal system conditions and 
degradation conditions. 

The control loops diagram of the rest four bar- 
riers are more like mission based control. As the 
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Figure 4. Control loops of system conditions. 


Dispatcher 


dispatch command 
condition report 


request for reoair 


Maintenance 
Coordination 
center 


ask for repair 
rmiišssion of entrance 


ask for support 


Signal 
Maintenance 


maintenance 


Spare parts 
indoor eqpt 


Figure 5. 


Control loops of EMER MAINT conditions. 


track Occupation 
upa eH 
rack IS 


21 


Train monitor 
Equipment moniter 


Train 
interface 


Passenger information 


Route contm! 
Command 


Computer 
Interlock(C!) 


Spwed limit code 


Com 


Track 


receiver 


APR Information 


APR reader 


EB command 


Location correcton 


Doppler 
radar 


control loops diagram of emergency maintenance 
conditions shown in Fig. 5, person, group and 
technical system work together to make the system 
operating. 

Each control loop has three basic elements, 
which are human/equipment, interaction arrow 
and disturbance. Data can be obtained directly 
from these three elements. 


4 DEMONSTRATION 


4.1 


Base on the hybrid model proposed in Section 3, 
a signal-related incident data model for Beijing 
subway Line No. 5 is built, which is shown in 
Fig. 6. There are 16 tables in this data model, in 
which the type of 4 tables are defined by the cause 
consequence diagram and the type of 6 tables are 
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Figure 6. The data model for signal-related subway incidents. 


defined by the 6 barriers categories, while the rest 
tables and the join between tables as well as fields 
in all these tables are explored in control loops. 
The data model can direct what kinds of data are 
needed for the incident data analysis. 

Based on the data model, the signal-related 
subway incident database is built with Microsoft 
Access 2013. 


4.2 Data collection 


The data analysis in this paper is from the combi- 
nation of varieties of existing records by Beijing 
subway, which includes operation accident reports, 
signal-related incidents records, signal system 
maintenance records and equipment maintenance 
manual. Base on the files above, a FMEA analysis 
is carried out and 65 usual subsystem level failure 
modes that affect the normal operation mode are 
found. The outcome is used to rebuilt the incident 
records, classifying the incident records firstly by 
the structure of system, subsystems and equipment, 
secondly by the subsystem failure mode, equip- 
ment failure mode and thirdly by reasons includ- 
ing hardware failure, software failure, human error 
and environment. The subsystem failure modes 
are defined to be the top events and the causes and 
consequences are also recognized. 

Then files of other department are collected 
and attached. Dispatching control records are 
gathered from dispatching center so the dispatch- 
ing information of every incidents are stored in 
the database. Then line arrangement figure is used 
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to locate the underground and ground part of the 
line. Subway dispatching timetable is used to tell 
the train interval during the signal failure. What’s 
more, weather data is downloaded from the web- 
site of China Meteorological Administration. 

A total of 4352 incident records of Beijing sub- 
way signal related incidents from 2013 are input 
into the database, 69.5 percent of which have not a 
single | min delay while 0.67 percent have an effect 
of over-5 min-delays. 


4.3 Findings 


4.3.1 Analysis based on RAMS 

The frequency of subsystem failure is shown in 
Fig. 7, in which 85 percent of subsystem failures 
happened in onboard system. Some factors are 
found with the analysis of the data. 

The data analysis shows that the reliability, avail- 
ability and maintainability of onboard system of 
Beijing subway Line No. 5 is low, for which some 
technical characteristics of the system should be 
responsible. In terms the reliability of the onboard 
system, the frequency of inner communication 
error and of APR reader failure are very high. And 
the structure of onboard ATP is 2 out of 2, which 
means no redundancy and a single failure will lead 
to an emergency break. In terms of availability, 
there is no efficient backup mode like there are in 
other lines. If the normal mode fails, the trains have 
to be operated under the instruction of signal light, 
which cannot deal with the train interval which is 
so short. In terms of maintainability, firstly there 
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Figure 7. Frequency of incidents by subsystems. 


is no assist information about what is wrong after 
an onboard system failure except an alarm light. 
So the drivers and dispatchers cannot take actions 
according to the specific failure and can only use 
the same procedure. Secondly, the signal supplier is 
now broken, and the reasons of many failures are 
too hard to figure out. 


4.3.2 Analysis on environment disturbance 

Track circuit is used in Beijing subway line No. 5 
to send MA (movement authority). And MA is 
coded in low frequency and carrier frequency. In 
2016 and the first half year of 2017, there are 64 
emergency breaks that the fault code is ‘LF error’ 
or ‘CF error’. So analysis is carried out to find if 
there are some environmental disturbances that 
affect this failure. 

So when the LF/CF errors are compared with 
the location, as is shown in Fig. 8, it could be found 
that the frequency of these errors increased in track 
near Chongwenmen and Dongdan station. So the 
signal maintainers inspect these sections and found 
that the LF/CF might be disturbed by the traction 
return current circuit, which was maintained in 
early 2016. 


4.3.3 Analysis on human factors of maintainers 
Despite used for logic control for a long time, 
now relays are widely used in computer-based 
railway signal system in order to lead the system 
to the safety side after component failures. After 
the increment in the number of failures as a result 
of the shortening of train interval in 2014, Beijing 
subway increased the frequency of scheduled 
maintenance on the whole signal system in the late 
of that year. 
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Figure 8. Track circuit code error. 
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Figure 9. The trend of relay failure over years. 


However, as can be seen in Fig. 9, the data 
shows that relay failures increased in the year 2015 
and many of them were found to be or could be 
related to human errors. It seems that it is not a 
good choice to keep a high maintenance rate for 
every component. Relays, for example, have a large 
amount and every disassembling and assembling 
means potential of human errors. Then the main- 
tenance frequency of relays was befittingly reduced 
and more importance are put on the quality of it 
and the failure rate reduced in 2016. 


4.3.4 Analysis on human factors of operators 

Since the reliability and maintainability of 
onboard signal system of Line No. 5 is not good, 
onboard systems often suffer from random fail- 
ures and the drivers cannot read what is wrong. 
So there are two choices for the dispatchers to deal 
with the failure train before its recovery, one is to 
degrade the bock mode to station block, which is 
used as a standardized treatment but need time to 
the enable of signal lights, and the other is to keep 
the normal mode and guide the train by dispatcher 
themselves, which relay on dispatchers’ efficiency. 
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Figure 10. Average delays per onboard failure. 


Both the choices are used and it often depends on 
the dispatchers’ experience and preference. 

Fig. 10 shows the average l-min delays and 
2-min delays in under different conditions. The 
data shows that it is related to the train interval. 
The signal failures with block changes have higher 
average delays in peak period, but the difference 
is not so significant during non-peak period and 
weekends. So it is recommended to the subway 
company that it is a better choice not to use block 
change for a single onboard failure in peak period 
but related regulation must be assessed to guaran- 
tee the safety. 


5 DISCUSSION AND CONCLUSION 


The hybrid model proposed in this paper provides a 
manner for maintainer and operator to gather and 
use data to analyze incidents and prevent operation 
accident that related to signal failures. The bowtie 
provides an overview and in a way can ensure the 
comprehensiveness of the categories of data. And 
the control loops can directly provide varieties of 
data in detail in a systematic way. And the two part 
are well bonded because control loops can describe 
the management of barriers in bowtie. The hybrid 
model is also costume to subway domains by using 
the concept of RAMS, which has been used as a 
guide in this industry for a long time. So there is 
no difficult for subway engineers to learn and their 
experience can be inherited in the structure of the 
database. 

In this paper, the data model built from the 
hybrid model replaces the outdated checklist of 
Beijing subway and treats the signal system in the 
way of complex, social-technical system. The inci- 
dent data analysis can help signal maintainer find 
the environmental disturbances of signal failure. 
And it can also help signal maintainer find the 
human factors that affect the system maintenance 
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so as to adjust their manner of working. Operators 
can carry out optimistic changes to the operation 
manners in recovery period. The data analysis is 
proved to be useful for the maintenance and oper- 
ation of subway signal system. What’s more, the 
findings and changes can also be checked in future 
data. 

The data model and data collection in this 
paper play four roles in helping record and col- 
lect data. Firstly, it can help subway companies to 
make a formal manner for signal-related incident 
data recording since recorders have different habit 
for recording incidents. Secondly, maintainers and 
operators record comprehensive have more com- 
prehensive data to use that their analysis can be 
carried out immediately when they want to prove 
some measure can bring improvement the per- 
formance of signal system, but not starting from 
recording new data. That can improve the effi- 
ciency for them to acquaint the system. Thirdly, 
existing data collected by different department 
in a same subway company can be put together 
to analyze the interaction problems among them. 
And the environmental data recorded by other 
departments can be used to analyze and eliminate 
external disturbances. Furthermore, the model 
provides a way to analysis what kinds of data are 
valuable though they are hard to record so sen- 
sors could be bought or developed to record more 
system conditions data as well as advanced infor- 
mation system could be used to record human 
behaviors. 

This paper proposed a hybrid model to rec- 
ognize what kind of data should be recorded for 
maintainers and operators to improve the per- 
formance of subway signal system. Designing and 
establishing of this hybrid model are introduced 
in detailed. After that, the database is demon- 
strated for Beijing subway line No. 5. Based on the 
analysis, statistical characteristics of incidents are 
concluded. The trend of these factors could help 
companies improve the performance of subway 
system and the process of this analysis give them 
a new manner to record and learn from incident 
data. 

In future research, an incident data analysis 
method could be proposed base on this hybrid 
model. The method may guide people or even 
computers to conduct data analysis. 
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to platform collision incidents 
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ABSTRACT: The regulatory beginnings of the modern offshore Safety Case (SC) are demonstrated by 
the release of the Health and Safety at Work Act 1974, where it set up two new organisations to oversee its 
implementation: The Health and Safety Commission (HSC) (which was dissolved in 2008) and the Health 
and Safety Executive (HSE). Following the public inquiry into the Piper Alpha disaster, the responsibili- 
ties for offshore safety regulations were transferred from the Department of Energy to the HSC through 
the HSE as the singular regulatory body for safety in the offshore industry (Wang, 2002) (Department of 
Energy, 1990). In response to this the HSE launched a review of all safety legislation and subsequently 
implemented changes. The propositions sought to replace the legislations that were seen as prescriptive to 
a more “goal setting” approach. Since these events the SC regulations of 1992 have seen several amend- 
ments in both 2005 and 2015, as well as the release of further regulations, such as: The Offshore Instal- 
lations Prevention of Fire and Explosion, and Emergency Response (PFEER) in 1995 and the Safety 
Zones around Oil & Gas Installations in Waters around the UK in 2008. However, while Safety Cases 
are subject to review and updating as often as is required to keep them up to date, the process of change 
to the Safety Case may be slow and gives a monolithic appearance to the document. Subsequently, there 
have been several papers suggesting that a dynamic risk assessment method should be utilised to assist 
with SC regulation enforcement. This has led to the investigation presented in this report where 508 vessel 
to platform collision incidents between 1971 and 2015 have been analysed and compared with the release 
of key offshore SC regulations. The incidents have been sourced from the HSE, World Offshore Accident 
Databank (WOAD) and the MAIB. This analysis has identified a key trend between the reporting of 
offshore collision incidents and the release and enforcement of offshore SC regulations. 


1 INTRODUCTION to be prepared for all offshore installations. A SC is 
to be submitted by the “operator” for fixed instal- 
Following the public inquiry into the Piper Alpha lations and by the “owner” for mobile installations. 
disaster, the responsibilities for offshore safety | Within this all new production installations require 
regulations were transferred from the Department a design safety case and for mobile installations, 
of Energy to the Health and Safety Commission the duty holder is the owner (Wang, 2002). 
(HSC) through the Health and Safety Executive Offshore operators must submit operational 
(HSE) as the singular regulatory body for safety safety cases (SC) for all existing and new offshore 
in the offshore industry (Wang, 2002) (Depart- installations to the Health and Safety Executive's 
ment of Energy, 1990). In response to this the HSE (HSE) Energy Division (formerly the Offshore 
launched a review of all safety legislation and sub- Safety Division) for acceptance, but not approval, 
sequently implemented changes. The propositions yet it is an offence to operate without an approved 
sought to replace the legislations that were seen SC (HSE, 2006a). The SC must show that it identi- 
as prescriptive to a more “goal setting” approach. fies the hazards with potential to produce a serious 
Several regulations were produced, with the main- accident and that these hazards are below a tolerabil- 
stay being the Health and Safety at Work Act ity limit and have been reduced to the ALARP Level 
(HSE, 1992). Under this a draft of the offshore (As Low As Reasonably Practicable) (Wang, 2002). 
installations safety case regulations was produced. Safety and risk assessment for offshore instal- 
The regulations required operational safety cases lations is vigorous and requires demonstration 
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from duty holders that all hazards with potential 
to cause major accident are identified, all major 
risks have been evaluated, and measures have been 
or will be taken to control the major accident risks 
to ensure compliance with the statutory provi- 
sions (HSE, 2006b). Furthermore, it used to be a 
requirement stipulated in the SC Regulations that 
suitable and sufficient quantitative risk assessment 
was undertaken. However, the 2015 regulations 
have actually removed the requirement for this 
specific form of risk assessment though in prac- 
tice most duty holders still use QRA extensively, it 
should however, only be used as an evaluation tool 
(HSE, 2015). 

This is vitally important as accidents in the off- 
shore industry lead to devastating consequences, 
such as the explosion on board the Deepwater Hori- 
zon rig in the Gulf of Mexico which was caused by 
the failure of a subsea blowout preventer (BOP), 
with some failures thought to have occurred before 
the blowout (Jones, 2010). This solidifies the use 
of quantitative risk and reliability analysis, current 
ideas behind these models can perform predictive 
analysis and diagnostic analysis (Cai, et al., 2013). 

After many years of employing the safety case 
approach in the UK offshore industry, the regula- 
tions were expanded in 1996 to include verification 
of Safety Critical Elements (SCE). Also, the off- 
shore installations and wells regulations were intro- 
duced to deal with various stages of the life cycle 
of the installation. SCEs are parts of an installa- 
tion and its plant, including computer programs 
or any part whose failure could cause or contrib- 
ute substantially to or whose purpose of which is 
to prevent or limit the effect of a major accident 
(Wang, 2002) (HSE, 1996). 

Recently, however, it is felt that an expansion on 
Safety Cases is necessary, especially in the offshore 
and marine industry, as they are static documents 
that are produced at the inception of offshore 
installations, despite containing a structured argu- 
ment demonstrating that the evidence contained 
therein is sufficient to show that the system is safe 
(Auld, 2013) (Eleye-Datubo, A., et al., 2006). 

While the authors were conducting research 
regarding dynamic risk assessment techniques and 
methods for offshore installations and SCEs, more 
specifically data gathering regarding offshore ship 
to platform collisions, a periodic pattern emerged 
between the release of SC regulations and the 
number of collision incidents on the UKCS. 


2 BACKGROUND 


It is known that Safety Cases are subject to review 
and updating as often as is required to keep them 
up to date, the process of change to the Safety Case 
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may be slow and gives a monolithic appearance to 
the document. A SC is a relatively high level docu- 
ment which identifies the operator/owner, describes 
the installation and its layout, its environmental 
limits, the types of operation to be undertaken and 
how the health and safety aspects will be managed, 
and the maximum number of persons expected to 
be on the installation at any one time. Similarly, 
there also needs to be a description for the arrange- 
ments for detecting toxic or flammable gas and the 
detection, prevention or mitigation of fires and 
the arrangements for protecting people from the 
hazards of explosion, fire, heat, smoke etc. usually 
in the form of a temporary refuge, egress routes 
means of evacuation and means of monitoring 
and control for an incident. There should also be a 
demonstration by suitable risk assessment that the 
risks have been mitigated to a level that is ALARP 
and a description of the verification scheme. 

An attempt to identify trends with safety case 
regulation and incidents regarding offshore SCEs, 
such as gas drive turbines and generators, was con- 
ducted. However, due to possible under reporting 
or the availability of data, it is difficult to demon- 
strate the trend of some SCEs, with the updating 
of offshore regulations. On the other hand, it is 
possible to demonstrate the effect of a number of 
possible influences, (slow updating or enforcement 
of regulations, and under reporting or improper 
recording of incidents), and the effects they may 
have on incidents on-board offshore platforms. 
A key area that can be assessed is the issue of ship 
to platform collisions. 

Before any data is presented, it is important to 
understand the timeline of key offshore regulations 
and incidents that have shaped the modern-day 
safety case regulations. The following list identi- 
fies the key timeline of incidents that have built the 
current safety case regulations. 


e 1974 Health & Safety at Work Act (HSWA) 

The HSWA adopted a holistic approach to the 
health, safety and welfare of workers. The Act 
focuses on the concept that any situations that may 
give rise to harm need to be recognised and suit- 
able measures put in place to eliminate or reduce 
the potential for harm. It set up two new organi- 
sations to oversee its implementation: The Health 
and Safety Commission (HSC) and the Health and 
Safety Executive (HSE). The HSE is the executive 
organisation that enforces the provisions of the 
HSWA. However, in April 2008 the HSC was dis- 
solved and merged with the HSE. The HSC used 
to protect health and safety at work in the UK by 
conducting research, training and providing advice 
and information. The Commission also used to 
propose new regulations and approved codes of 
practice under the authority of the Act. This is 


all now conducted by the HSE (Inge, 2007) (The 
Stationary Office, 1974). 


e 1988 Piper alpha disaster 

Piper Alpha was an oil production platform in the 
North Sea off the coast of Aberdeen, Scotland. 
The platform began production in 1976, initially as 
an oil platform but was later converted to accom- 
modate gas production. Oil & Gas fires and explo- 
sions destroyed Piper Alpha on 6 July 1988, killing 
167 people, including two crewmen of a rescue ves- 
sel and 61 workers aboard survived. Thirty bodies 
were never recovered. The total insured loss was 
about £1.7 billion ($3.4 billion), making it one of 
the costliest manmade catastrophes ever. At the 
time of the disaster, the platform accounted for 
approximately ten per cent of North Sea oil and 
gas production. The Cullen Inquiry was set up in 
November 1988 to establish the cause of the disas- 
ter, chaired by Judge William Cullen. After 180 days 
of proceedings, the “Public Inquiry into the Piper 
Alpha Disaster” or “Cullen Report” was released 
in November 1990. It concluded that the initial 
condensate leak was the result of maintenance 
work being carried out simultaneously on a pump 
and related safety valve. The report was critical of 
Piper Alpha’s operator, which was found guilty of 
having inadequate maintenance and safety proce- 
dures (Inge, 2007) (Oil & Gas UK, 2008). 


e 1989 Offshore installations (safety representa- 
tives & safety committee) regulations 
The document provides information on interpreta- 
tion and enforcement of the Offshore Installations 
(Safety Representatives and Safety Committees) 
Regulations 1989. These regulations were made 
under the Mineral Workings (Offshore Installa- 
tions) Act 1971. They allow the workforce on an 
offshore installation to elect safety representatives 
from among themselves, and confers on those 
functions and powers in relation to the health and 
safety of the workforce. They also provide for time 
off with pay for safety representatives so they can 
perform these functions and undergo relevant 
training (The Stationery Office, 1989). 


e 1990 The Cullen report 

The Cullen Inquiry was set up in November 1988 
to establish the cause of the disaster, chaired by 
Judge William Cullen. After 180 days of proceed- 
ings, the “Public Inquiry into the Piper Alpha 
Disaster” or “Cullen Report” was released in 
November 1990. It concluded that the initial con- 
densate leak was the result of maintenance work 
being carried out simultaneously on a pump and 
related safety valve. The report critical of Piper 
Alpha’s operator, which was found guilty of hav- 
ing inadequate maintenance and safety proce- 
dures. 106 recommendations were made calling for, 
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amongst many matters, the requirement of a SCs, 
the transference of the discharge of offshore regu- 
lation from the Department of Energy to a discrete 
division of the HSE. The responsibility of imple- 
menting the recommendations was spread across 
the regulators and the industry with, the HSE 
overseeing 57, the operators were responsible for 
40, the offshore industry were given 8 to progress 
and the final one was for the Standby Ship Owners 
Association. The industry acted urgently to carry 
out the 48 recommendations that operators were 
directly responsible for. The HSE developed and 
implemented Lord Cullen’s key recommendation: 
the introduction of safety regulations requiring the 
operator/owner of every fixed and mobile installa- 
tion operating in UK waters to submit to the HSE, 
for their acceptance, a SC (Inge, 2007). 


e 1992 Safety case regulations 

The Offshore Installations (Safety Case) Regula- 
tions came into force in 1992. By November 1993 a 
safety case for every installation had been submit- 
ted to the HSE and by November 1995 all had had 
their safety case accepted by the HSE. The Safety 
Case Regulations require the owner/operator/duty 
holder of every fixed and mobile installation oper- 
ating in UK waters to submit to the HSE, for their 
acceptance, a safety case. The safety case must 
give full details of the arrangements for managing 
health and safety and controlling major accident 
hazards on the installation. It must demonstrate, 
for example, that the company has safety man- 
agement systems in place, has identified risks and 
reduced them to as low as reasonably practicable, 
has introduced management controls, provided a 
temporary safe refuge on the installation and has 
made provisions for safe evacuation and rescue 
(Inge, 2007) (HSE, 2005). 


e 1995 Offshore installations Prevention of Fire and 
Explosion, and Emergency Response (PFEER) 
PFEER deals primarily with fire and explosion 
events but it also deals with any event which may 
require emergency response and includes systems 
that may rely on radar on a standby vessel or respon- 
sible staff on the installation monitoring incom- 
ing vessels. The Regulations, ACOP and guidance 
deal with: (a) preventing fires and explosions, and 
protecting people from the effects of any which do 
occur; (b) securing effective response to emergen- 
cies affecting people on the installation or engaged 
in activities in connection with it, and which have 
the potential to require evacuation, escape and res- 

cue (Amended in 2005 and 2015) (HSE, 201 5a). 


e 1996 Offshore installation (design & construc- 
tion) regulations 

DCR requires the installation to possess integrity 

at all times, as is reasonably practicable. It requires 


the design of the installation to withstand such 
forces that are reasonably foreseeable and in the 
event of foreseeable damage it will retain sufficient 
integrity to enable action to be taken to safeguard 
the health and safety of persons on or near it. The 
duty holder also has to record the appropriate 
limits within which it is to be operated. Further 
duties can be found in the Offshore Installations 
and Wells (Design and Construction, etc.) Regula- 
tions 1996. 


e 2005 Offshore installations (safety case) regula- 
tions ( April 2006 ) 

The Offshore Installations (Safety Case) Regu- 
lations 2005 came into force on 6 April 2006. 
They replace the previous 1992 Regulations. The 
primary aim of the Regulations is to reduce the 
risks from major accident hazards to the health 
and safety of the workforce employed on offshore 
installations or in connected activities. The Regu- 
lations implement the central recommendation of 
Lord Cullen’s report on the public inquiry into the 
Piper Alpha disaster. This was that the operator 
or owner of every offshore installation should be 
required to prepare a safety case and submit it to 
HSE for acceptance (HSE, 2005). These SC regu- 
lations have been replaced by the 2015 regulations 
(HSE, 2015). 


e 2008 Safety zones around oil & gas installations in 
waters around the UK (HSE) 

While this document is not a regulation, it explains 
the purpose and significance of safety zones around 
offshore oil and gas installations and their effect 
on marine activities, particularly relating to fishing 
vessels. A safety zone is an area extending 500 m 
from any part of offshore oil and gas installations 
and is established automatically around all installa- 
tions which project above the sea at any state of the 
tide. Subsea installations may also have safety zones, 
created by statutory instrument, to protect them. 
These safety zones are a 500m radius from a central 
point. Vessels of all nations are required to respect 
them. It is an offence (under section 23 of the Petro- 
leum Act 1987) to enter a safety zone except under 
the special circumstances. (HSE, 2008b) 


e 2015 Offshore installations (offshore safety direc- 
tive) (safety cases etc.) regulations (July 2015) 
The Offshore Installations (Offshore Safety Direc- 
tive) (Safety Case etc.) Regulations 2015 came into 
force on 19 July 2015. They apply to oil and gas 
operations in external waters, that is, the territorial 
sea adjacent to Great Britain and any designated 
area within the United Kingdom Continental 
Shelf (UKCS). They replace the Offshore Instal- 
lations (Safety Case) Regulations 2005 in these 
waters, subject to certain transitional arrange- 

ments (HSE, 2015b). 
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These key events from 1974 to the present day 
have shaped the modern SC into what it is today. 
After identifying these key regulations and inci- 
dents it was possible to plot these against the 
number of ship to platform collision incidents. 
This is where the data gathering and statistical 
analysis is conducted. 


3 STATISTICAL ANALYSIS & DISCUSSION 


The current database of ship to platform collisions 
provided by the HSE is out dated as it was last pub- 
lished in 2001. Similarly, the OGP (Oil & Gas Pro- 
ducers) produced a document in 2010 of worldwide 
collision statistics and Oil & Gas UK produced a 
document of accident statistics on the UKCS for 
offshore units between 1990 and 2007 (HSE, 2003) 
(OGP, 2010) (Oil & Gas UK, 2009). However, the 
OGP document provides only the frequency of col- 
lisions of incidents over key offshore and shipping 
areas around the world and the Oil & Gas UK docu- 
ment lists all accidents statistics with overall frequen- 
cies per year. These three documents are sufficient 
enough to demonstrate the trend between offshore 
collision incidents and offshore regulations. There- 
fore, a statistical analysis is conducted for ship to plat- 
form collisions from 1971-2015 across the UKCS. 

For this study and incident has been defined as a 
reported impact or contact between a vessel and a 
fixed or mobile installation in terms of the Report- 
ing of Injuries, Diseases and Dangerous Occur- 
rences Regulations (RIDDOR 2013) database, 
which utilizes reported incident information from 
the OIR/9b and F2508A forms. 

The 2001 ship/platform collision incident data- 
base has been further cross-checked and extended. 
The complete list of incidents demonstrates a total 
of 508 incidents of vessels impacting or contacting 
both fixed and floating offshore structures. These 
incidents have been reported from Ist January 1971 
to 31st December 2014. 

The data has been recorded from a number of 
sources. In many cases the data is supplemented 
or confirmed from additional sources. Data across 
the whole study has been compiled from the fol- 
lowing sources: 


e RIDDOR 2013, utilizing search criteria “Colli- 
sions, or potential collisions”, between “vessels 
and offshore installations”. Information source 
is labelled as HSE in the database. 

e World Offshore Accident Databank (WOAD) 

using the search criteria (“Collision”, “Offshore 

Units” and “Europe North Sea”. 

Marine Accident Investigation Branch (MAIB) 

using the search criteria “Offshore installa- 


tions”, “collision” and “contact”. 


e Global Integrated Shipping Information System 
(GISIS) using search criteria “collisions” and 
“North Sea”. 

e World Energy Related Casualties (WREC) using 
search criteria “offshore installations”, “colli- 
sions” and “North Sea”. 


In order to demonstrate a coherent analysis, 
an incident frequency and a cumulative incident 
frequency has been calculated based upon the 
approximate level of operating experience per year. 
Table 1 demonstrates the operating experience, the 
number of incidents that have been reported and 


Table 1. List of all reported collision incidents on the UKCS, as well as operating experience and incident frequencies. 


Total Cumulative All Cumulative Cumulative 
experience experience incidents incidents Frequency frequency 

Year (N) (N1) (r) (rl) (A) (Al) 
1971 — 1 — — — 
1972 — — 0 — — — 
1973 = ~ 1 — — - 
1974 — — 2 — — — 
1975 89 89 12 12 0.135 0.135 
1976 93 182 12 24 0.129 0.132 
1977 102 284 20 44 0.196 0.155 
1978 103 387 19 63 0.184 0.163 
1979 107 494 25 88 0.234 0.178 
1980 113 607 17 105 0.150 0.173 
1981 121 728 29 134 0.240 0.184 
1982 132 860 23 157 0.174 0.183 
1983 142 1002 20 177 0.141 0.177 
1984 165 1167 12 189 0.073 0.162 
1985 177 1344 18 207 0.102 0.154 
1986 169 1513 13 220 0.077 0.145 
1987 174 1687 6 226 0.034 0.134 
1988 195 1882 7 233 0.036 0.124 
1989 210 2092 18 251 0.086 0.120 
1990 262 2354 24 275 0.092 0.117 
1991 281 2635 19 294 0.068 0.112 
1992 272 2907 25 319 0.092 0.110 
1993 270 3177 14 333 0.052 0.105 
1994 276 3453 12 345 0.043 0.100 
1995 289 3742 3 348 0.010 0.093 
1996 262 4004 8 356 0.031 0.089 
1997 271 4275 16 372 0.059 0.087 
1998 278 4553 16 388 0.058 0.085 
1999 291 4844 14 402 0.048 0.083 
2000 300 5144 13 415 0.043 0.081 
2001 307 5451 10 425 0.033 0.078 
2002 308 5759 7 432 0.023 0.075 
2003 311 6070 5 437 0.016 0.072 
2004 313 6383 4 441 0.013 0.069 
2005 314 6697 7 448 0.022 0.067 
2006 315 7012 6 454 0.019 0.065 
2007 331 7343 11 465 0.033 0.063 
2008 337 7680 8 473 0.024 0.062 
2009 338 8018 4 477 0.012 0.059 
2010 332 8350 S 482 0.015 0.058 
2011 332 8682 6 488 0.018 0.056 
2012 335 9017 3 491 0.009 0.054 
2013 337 9354 7 498 0.021 0.053 
2014 340 9694 3 501 0.009 0.052 
2015 331 10025 3 504 0.009 0.050 


31 


the frequency of these incidents. It can be seen 
that 4 incidents were recorded from 1971 to 1974 
however, they are not part of the overall frequency 
analysis as there is limited data regarding operat- 
ing experience. Similarly, these numbers are only 
confirmed from WOAD as the HSE did not begin 
recording incidents until 1975. 

Figure 1 demonstrates the number of ship to 
platform collision incidents from 1971-2015 as 
well as the key regulations and incidents as out- 
lined previously (GL, 2017) (HSE, 2016) (MAIB, 
2016). 

It can be seen from Figure 1 that the number 
of ship to platform collision incidents from 1971 
to 2015 is very turbulent, as more clearly demon- 
strated by the average trend-line. At a first glance, 
this trend seems to be rather erratic, following no 
logical pattern. However, when the milestones in 
the safety case regulation timeline are taken into 
consideration, patterns begin to emerge in the 
number of incidents each year in the UKCS. 

Initially, from 1971 to 1974 the number of inci- 
dents is very low at one or two per year. A possi- 
ble reason for this is that the data entries for 1971 
to 1974 are from WOAD only, as the HSE began 
their ship to platform collision recordings from 
1975. However, from 1975 onwards the number 
of incidents per year greatly increases until 1981 
from 12 to 29 respectively. There are a number 
of possibilities that can cause this rapid increase. 
Firstly, the HSWA is enforced from 1974 hence, 


the recognition of dangerous incidents that can 
cause harm to personnel is increased. Secondly, 
as more and more dangerous incidents are being 
recognised, the need to report said incidents 
also increases. Therefore, it is safe to say that an 
increased awareness of dangerous situations cou- 
pled with the need to report these incidents gives 
rise to a dramatic increase in the number of col- 
lision incidents. Thirdly, according to the HSE, 
the approximate number of installations operat- 
ing in the UKCS increases from 89 in 1975 to 121 
in 1981. The increase in the number of operating 
platforms would statistically increase the number 
of collisions at that time. 

From 1981, however, the number of incidents 
per year begins to decrease until 1987, from 29 to 
6. This decrease is much greater that the increase 
in incidents from 1975 to 1981. It is possible that 
the enforcement of the HSWA had a large effect on 
the safety procedures on offshore platforms in the 
UKCS. This hypothesis would also be consistent 
with the approximate number of platforms operat- 
ing in the UKCS which increases from 121 in 1981 
to 174 in 1987. This contradicts the previous state- 
ment that the number of incidents would increase 
with the number of platforms in operation. How- 
ever, in the 6-year period between 1981 and 1987 
this is not the case. This further backs up the idea 
that the regulations from 1974 have been increas- 
ingly enforced and have reduced the number of 
incidents. However, it is also possible to state that 
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Figure 1. 
tions and events that formed the modern safety case. 
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= H&S at Work Act (1974) 

eames Off. Inst. (Safety Reps. & Comm.) Regs. (1989) 
sms Safety Case Regulations (1992) 

m Off. Inst, (Safety Case) Regs, (2005) 

== — Incident Trendline 


Graph demonstrating the number of ship to platform collision incidents per year, as well as the key regula- 


the level of reporting of the collision incidents has 
decreased. This is a much more difficult claim to 
validate as there is not any possible way to deter- 
mine whether an incident has happened and has 
not been reported. This is part of the reasoning 
behind an increase in research regarding offshore 
dynamic asset integrity modelling and autono- 
mous systems, as the most common reason a 
detector or sensor will not detect and log any 
information is if it is faulty. On the other hand a 
human has the ability to choose not to carry out an 
action. Hence it is difficult to accurately determine 
the level of underreporting that would have taken 
place between 1981 and 1987. 

Continually, the time period between 1988 and 
1994, in terms of collision incidents, is very inter- 
esting. The year 1988 is well known in the offshore 
industry and indeed the world as the year of the 
Piper Alpha disaster in which 167 crew members 
lost their lives in the July of that year. When one 
examines the collision incidents that were reported 
in 1988, approximately 70% were reported after the 
loss of Piper Alpha on 6th July. This may suggest 
that a large-scale disaster, such as Piper Alpha, 
triggered an increase in the level of incident report- 
ing. However, the number of collision incidents in 
1988 alone are not enough to state this with any 
conviction. What is interesting however, is that 
the number of collision incidents increase in 1989 
to 18, from 7 in 1988. This is a drastic increase in 
terms of the number of reported incidents in the 
UKCS, after a large-scale offshore disaster. 

Furthermore, in 1989 the Offshore Installa- 
tions (Safety Representatives & Safety Committee) 
Regulations were published. This stated that the 
workforce could elect safety representatives from 
amongst themselves. This may have increased the 
level of reporting of collision incidents in 1989. 
However, it appears to be too much of a drastic 
increase from the previous year to conclusively 
state that the new regulations in 1989 resulted in 
a considerable number of reported incidents. It 
seems much more likely that a combination of 
the Piper Alpha disaster and the release of the 
Offshore Installations (Safety Representatives & 
Safety Committee) Regulations contributed to the 
vast increase in reported collision incidents. 

Continually, in 1990, the Cullen Report was 
published which was public enquiry into the Piper 
Alpha disaster. The report was heavily critical of 
the platform operators. Lord Cullen made a total 
of 106 recommendations within his report, all of 
which were accepted by industry. The responsibil- 
ity of implementing them was spread across the 
regulators and the industry with, the HSE oversee- 
ing 57, the operators were responsible for 40, the 
offshore industry were given 8 to progress and the 
final one was for the Standby Ship Owners Associ- 
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ation. The industry acted urgently to carry out the 
48 recommendations that operators were directly 
responsible for. By 1993 all had been acted upon 
and substantially implemented. Furthermore, the 
HSE developed and implemented Lord Cullen’s 
key recommendation: the introduction of safety 
regulations requiring the operator/owner of every 
fixed and mobile installation operating in UK 
waters to submit to the HSE, for their acceptance, 
a safety case. Hence, in 1992 the Offshore Installa- 
tions (Safety Case) Regulations came into force. By 
November 1993 a safety case for every installation 
had been submitted to the HSE and by November 
1995 all had had their safety case accepted by the 
HSE. 

If the number of collision incidents is exam- 
ined from the Cullen Report in 1990 to all instal- 
lation Safety Cases being accepted in 1995, it can 
be seen that the number of incidents per year 
decreases rapidly from 24 to 3 respectively. This is 
again a massive fluctuation in the number of inci- 
dents following a series of key regulations being 
enforced. It shows that the release of new regula- 
tions prompts the level of incidents to decrease as 
the regulations are enforced. However, as 1995 is a 
number of years after the Cullen Report and the 
introduction of Safety Cases it is possible that an 
element of complacency in terms of reporting may 
occur. This can be seen from the number of inci- 
dents between 1995 and 2004. The number of col- 
lision incidents increases from 3 in 1995, to a peak 
in 1998 of 16, then to a new low of 4 in 2004. What 
is key is this fluctuation, actually does not exactly 
follow the common trend of more incidents given 
more operating installations. From 1995 to 1996 
the approximate number of installations decreases 
from 289 to 262, then steadily increases to 313 
in 2004. This again does not follow a pattern as 
the number incidents decreases after 1998 as the 
approximate number of installations increases. 
This could be attributed to a substantial level 
of regulation enforcement leading to a decrease 
in incidents, or there may be an anomaly in the 
approximate number of installations and incidents, 
given that the approximate number of installations 
increases steadily from 1975 to 1995, then suddenly 
decreases. Similarly, it is at this time that the HSE 
began utilising the RIDDOR regulations (from 1st 
April 1996). 

What appears to be more likely is at the low 
point of 3 collisions in 1995, a new set of regu- 
lations are introduced and enforced, the Offshore 
Installations Prevention of Fire and Explosion, 
and Emergency Response (PFEER) along with 
the Offshore Installation (Design & Construction) 
Regulations in 1996. At that point, the number of 
incidents increases and peaks in 1998. It would be 
foolish to say that the introduction of new offshore 
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regulations causes offshore incidents. What is far 
more likely is that the increase of new regulations 
prompts a more proactive response in the accuracy 
of incident reporting. 

In addition, this trend can be seen yet again 
from 2004 to 2012, where the number of collision 
incidents per year increases from 4 in 2004 to 11 
in 2007 then decreases to 3 in 2012. This could 
be attributed to the Offshore Installations (Safety 
Case) Regulations 2005 being enforced in 2006. As 
with the regulations in 1995 and 1996 the number 
of incident increases and begins to decrease. How- 
ever, the number of collision incidents becomes 
much steadier and doesn’t fluctuate as much as pre- 
vious years, as the increase from 4 to 11 in 2013 is 
not a huge increase, and could be said to be anoma- 
lous when looking at the number of incidents in 
that period. However, it is an increase none the less. 

Furthermore, in 2008 the Safety Zones around 
Oil & Gas Installations in Waters around the UK 
information sheet is introduced by the HSE. This 
specifically targets the area of offshore collisions, 
near misses and general safety when operating in 
an installations 500m zone. Therefore, it makes 
sense to state that this introduction has maintained 
a steady level of incidents between 2008, with 8, 
and 2015, with 3. 

Furthermore, by applying the use of a calcu- 
lated incident frequency and cumulative frequency, 
based upon the operating experience on the UKCS 
from 1975 to 2015, it can be seen that the number 
of incidents has drastically decreased over the 40 
year period. This is demonstrated by Figure 2 
which highlights the trend of incidents frequencies 
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—— Cumulative Frequency (41) 


Incidents frequency and cumulative frequency for ship to platform collisions per year from 1975 to 2015. 


per year as well as the cumulative incidents fre- 
quency per year. What can be seen in Figure 1 and 
much more clearly in Figure 2 is that the average 
frequency of incidents has generally decreased 
since 1981. This clearly demonstrates that over 
the 40 year period from the introduction of the 
HSWA to the current amended Safety Case regu- 
lations, the enforced regulations have had a huge 
impact on installation safety in terms of collision 
incidents. As the general number of incidents has 
decreased, the approximate number of operating 
installations has increased. 


4 CONCLUSION 


From the information presented in Figures | and 2 
as well as Table 1, it can be seen that the offshore 
industry can be said to be reactive in its approach 
to reporting incidents, especially in the area of ship 
to platform collisions. What is also apparent is 
that the fluctuation has become gradually smaller 
in more recent times. This shows that the effect of 
introducing and amending regulations over time 
has a positive effect on the overall trend of colli- 
sion incidents. While this study identifies trends in 
ship to platform collisions, it would still be valid 
to state that the offshore industry would profit 
greatly from having a dynamic risk monitoring 
tool to aid with the continual enforcement of regu- 
lations across all areas of an offshore platform. In 
the near future, a widely accepted and integrated 
dynamic asset integrity monitoring tool could be a 
distinct possibility. 
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ABSTRACT: Safety analysis for nuclear facilities are performed in a conservative manner so that there 
is additional assurance that the nuclear facilities are safe to operate given the potential events that may 
impact or challenge the facilities. The degree of conservatism included in the analysis should be well know 
and be increased if there is a large degree of uncertainty associated with the analysis. If uncertainty is 
decreased, through better data or more sophisticated and/or rigorous analysis, a decrease in conservatism 
can be made without impacting the margin of safety of the design. 


1 INTRODUCTION 


1.1 Objective 


Providing conservatism in the safety analysis of a 
facility results in additional assurance that the facil- 
ity is safe to operate. This paper explores how the 
degree of conservatism and the degree of uncer- 
tainty of the safety analysis are linked in determin- 
ing the margin of safety of the facility design. The 
objective of this research is to improve the under- 
standing of the theoretical and technical basis for 
ensuring the safety of a facility design. 


1.2 Approach 


This research included the following steps. 


e Reviewing definitions for conservatism 
Reviewing national and international practices 
regarding use of conservative inputs into safety 
analyses 

Evaluating conservative and best estimate analy- 
ses and a potential method for reducing conserv- 


atism while maintaining safety margins 


A follow-on paper is planned that will evaluate 
some potential applications of the method to reduce 
conservatism while maintaining safety margins. 


2 CONSERVATISM DEFINITION 


2.1 


A discussion of conservatism, as it relates to safety 
analysis, is found in the U.S. Nuclear Regulatory 


Definition 
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Commission’s (NRC) NUREG-2122, Glossary of 
Risk-Related Terms in Support of Risk-Informed 
Decision-making (NRC 2013). NUREG-2122 
takes a holistic approach in describing conserva- 
tism; it defines the term conservative in combina- 
tion with “analysis,” and an additional qualifying 
term “demonstrably’—thus implying that an analy- 
sis must be evaluated as a whole. A “demonstra- 
bly conservative analysis” is: “An analysis that 
uses assumptions such that the assessed outcome 
is meant to be less favorable than the expected 
outcome.” 

The more detailed discussion in NUREG- 
2122 goes on to explain that a “demonstrably 
conservative analysis provides a result that may 
not be the worst result of a set of outcomes, but 
produces a quantified estimate of a risk metric 
that is significantly greater than a risk metric esti- 
mate produced using the most realistic information 
available” [emphasis added]. Thus, a conservative 
analysis can be described as being located between 
a bounding analysis and a best estimate analysis— 
but skewed towards the bounding case and signifi- 
cantly away from the results of the best estimate 
analysis. The relationship among the terms “con- 
servative”, “bounding”, and “best estimate” is dis- 
cussed further below. 

A guide by the United Kingdom’s (UK) Office 
for Nuclear Regulation (ONR), Safety Assessment 
Principles for Nuclear Facilities (UK ONR 2014) 
defines conservatism as: 


In analysis, an approach where the use of models, 
data and assumptions would be expected to lead to a 


result that bounds the best estimate (where known) 
on the safe side. The degree of conservatism should 
be proportionate to both the level of uncertainty and 
the overall significance of the estimate to the safety 
case. 


This definition is informative, as it links the 
level of uncertainty to the “significance of the esti- 
mate”; i.e., the importance of the parameter being 
addressed; however, what is the ONR meant by 
“bounds the best estimate” is not clear. 


3 NATIONAL AND INTERNATIONAL 
PERSPECTIVES ON CONSERVATIVE 
SAFETY ANALYSES 


3.1 


The NRC’s Safety Goal Policy Statement (NRC 
1986) states that, “to provide adequate protec- 
tion of the public health and safety, current NRC 
regulations require conservatism in design, con- 
struction, testing, operation and maintenance of 
nuclear power plants” [emphasis added]. Relative 
to conservatism in accident analysis, NRC Regula- 
tory Guide 1.203, Transient and Accident Analysis 
Methods (NRC 2005) states that: 


.. results of an analysis can be conservative due to 
a combination of code input and modeling assump- 
tions.... However, conservatism in just one aspect of 
[a model] ... cannot be used to justify conservatism 
in the [model] as a whole, because other aspects of 
the model may be non-conservative and cause the 


U.S. Nuclear regulatory commission 


overall model to be non-conservative. The degree of 


conservatism in the overall model must be quanti- 
fied and documented. Showing the degree of con- 
servatism in [a model] ... may be accomplished by 
a relatively simple uncertainty analysis....The key 
to simplifying the uncertainty analysis is identifying 
the small number of parameters and physical phe- 
nomena that are important in determining the behav- 
ior of the accident. 


The issue of conservatism is also tied somewhat 
to the issue of probabilistic versus deterministic 
approaches; in probabilistic approaches, conserva- 
tism and uncertainty can be specifically evaluated, 
whereas in a deterministic approach there is often 
less information to support an understanding of 
the degree of conservatism. This was discussed 
in the NRC’s NUREG/CR-7168, Regulatory 
Approaches for Accessing Facility Risks (NRC 
2015). A companion issue to that of probabilistic 
versus deterministic approaches is whether analy- 
ses should be based on data and computational 
methodologies that represent the best estimate of 
what might really occur, with uncertainty analysis 
to explore the effects of incorrect data or models, 
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or should be based on demonstrably conservative 
data and models. Most regulations and license 
applications have used a conservative, determin- 
istic approach. The NRC has identified problems 
with using this approach as discussed in Appen- 
dix C of NUREG-1909, Background, Status, and 
Issues Related to the Regulation of Advanced Spent 
Nuclear Fuel Recycle Facilities (NRC 2008). Two 
of the most important problems identified were: 
(1) that using very conservative assumptions can 
mask risk-significant items, and (2) that most 
conservative analyses are not accompanied by an 
uncertainty analysis. 

The NRC has also addressed the issue of con- 
servatism in thermal hydraulic code analysis and 
provide guidance on how best-estimate calculations 
can be utilized in place of conservative models. Spe- 
cifically, in Regulatory Guide 1.157, Best-Estimate 
Calculations of Emergency Core Cooling System 
Performance (NRC 1989), the NRC states that: 


the NRC staff amended the requirements of § 50.46 
and Appendix K, “ECCS Evaluation Models” (53 
FR 35996), so that these regulations reflect the 
improved understanding of ECCS performance dur- 
ing reactor transients that was obtained through the 
extensive research performed since the promulgation 
of the original requirements in January 1974. Para- 
graph 50.46(a)(1) now permits licensees or appli- 
cants to use either Appendix K features or a realistic 
evaluation model. These realistic evaluation models 
must include sufficient supporting justification to 
demonstrate that the analytic techniques employed 
realistically describe the behavior of the reactor 
system during a postulated loss-of-coolant accident. 
Paragraph 50.46 (a) (1) also requires that the uncer- 
tainty in the realistic evaluation model be quantified 
and considered when comparing the results of the 
calculations with the applicable limits in paragraph 
50.46(b) so that there is a high probability that the 
criteria will not be exceeded. 


For the purpose of the above regulatory guide, 
the terms “best-estimate” and “realistic” have the 
same meaning. Both terms are used to indicate 
that the techniques attempt to predict realistic 
reactor system thermal-hydraulic response; though 
best-estimate is not used in a statistical sense in this 
guide. 

The use of conservative values has been inves- 
tigated in International Atomic Energy Agency 
(IAEA) Safety Report 52, Best Estimate Safety 
Analysis for Nuclear Power Plants: Uncertainty Eval- 
uation, (IAEA 2008) and IAEA Publication 1428, 
Deterministic Safety Analysis for Nuclear Power 
Plants (2014). These documents discuss the prac- 
tices, benefits and downsides associated with use of 
conservative and best estimate analyses. The IAEA 
notes that for accident scenarios with large estimated 


margins to acceptance criteria, “it is appropriate for 
simplicity, and therefore, economy, to use conserva- 
tive analysis”; however, “for scenarios in which the 
margin is smaller, a best estimate is necessary...” 

The IAEA goes on to describe the need for best 
estimate analysis in this instance is to “to quantify 
the conservatism”; that is, “to show the margins to 
the acceptance criteria that apply in reality.” How- 
ever, the IAEA discussion of best estimate analysis 
in IAEA Safety Report 52 is always combined with 
a focus on evaluation of uncertainties; therefore, 
the IAEA discussion is of “best estimate analysis 
together with evaluation of uncertainties.” The use 
of this type of methodology is qualified with the 
caution that “realistic input data are used only if 
the uncertainties or their probabilistic distribu- 
tions are known.” For data that do not meet this 
test “conservative values” should be used. 

An alternate approach to conservative analysis, 
referred to as “best estimate plus uncertainty,” is 
discussed in IAEA Safety Report 52. The benefits 
of the best estimate plus uncertainty approach are 
described in IAEA Safety Report 52: (1) it pro- 
vides more realistic information about the physical 
behavior of the facility and thus assists in identify- 
ing the most relevant safety parameters, (2) the use 
of conservative assumptions can lead to predicting 
an incorrect event progression or exclude relevant 
physical phenomena, and (3) it provides informa- 
tion about safety margins which is not always obvi- 
ous in conservative deterministic analyses. However, 
moving toward best estimate analysis with uncer- 
tainty involves several challenges. For example, it is 
difficult to develop a relevant, validated best esti- 
mate computational methodology for the analysis. 
In addition, sufficient data on critical parameters 
must be available so that a probabilistic distribu- 
tion function can be developed that is statistically 
valid. IAEA Publication 1332, Safety Margins of 
Operating Reactors—Analysis of Uncertainties and 
Implications for Decision Making, (2003) discusses 
how, in safety analyses, it is customary to demon- 
strate that adequate margins exist between the true 
(but unknown) values of important, safety-related 
parameters of interest and the corresponding regu- 
latory limits (requirements or physical limits) that, 
if exceeded, would result in adverse consequences 
(e.g., release of radioactivity). 

Safety margins traditionally are established 
by relying on conservative models, conservative 
assumptions, and conservative interpretation 
of the analysis results. This approach has served 
the nuclear industry well; however, it can lead to 
employing additional safety measures and barriers 
that may not be strictly required, resulting in cost- 
lier and “overbuilt” designs. Also, the conservative 
approaches are not routinely able to identify the 
amount or degree of conservatism, nor do they 
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describe the degree of confidence in the resulting 
conclusions and safety features. 


4 TECHNICAL EVALUATION OF 
CONSERVATIVE AND BEST ESTIMATE 
ANALYSIS 

4.1 Introduction to conservative analysis 

and safety margins 


To illustrate the conservative approach to safety 
analysis, consider Figure 1 (shown at end of this 
paper). In the conservative approach, the results 
are expressed in terms of a set of deterministi- 
cally calculated values for the safety parameters of 
interest (e.g., radioactive dose) that are expected to 
be more pessimistic than the true values of these 
parameters. The difference between a conservative 
estimate of the safety parameter of interest and 
the regulatory/requirement limit is called the safety 
margin. Conservatism is intended to make the cal- 
culated deterministic value more limiting than the 
true (but unknown) value of a safety parmeter of 
interst, to assure that the estimated safety margin 
is smaller than the true safety margin. The differ- 
ence between the true safety margin and the esti- 
mated safety margin has been described by the 
IAEA (IAEA 2014) as the overbuilt safety margin. 

As illustrated in Figure 1 (figures are shown 
at the end of this paper), when deterministically 
analyzing the safety parameter of interest the con- 
servative safety analysis approach finds a single 
value of that parameter to compare to the regula- 
tory limit/requirement (referred to as the “accept- 
ance criterion” by the IAEA) and to determine if 
it is below that requirement/limit with a minimum, 
but undefined, amount of safety margin. In the 
conservative deterministic approach the degree of 
conservatism, and the true value of the parameter 
in relation to the conservative estimates, remain 
unknown. Generally, analysts believe that the 
conservative estimate provides an estimate of the 
safety margin smaller than the true margin, which 
by definition leaves some “overbuilt” margin. 

As a simple example, consider the following: 

Say the Regulatory Limit for the dose to the max- 
imally exposed offsite individual (MOT) is 25 rem 
Committed Effective Dose Equivalent (CEDE). 

If a Conservative Estimate of the dose in a given 
accident scenario to the MOI is calculated to be 
5 rem. The True Value of the dose to the MOI 
might be 0.5 rem (if the accident scenario, accident 
parameters, and phenomena were exactly known). 

In this case then the: 


e Design Safety Margin is 20 rem 
e True Safety Margin is 24.5 rem 
e Overbuilt Margin is 4.5 rem 


True Safety Margin 


Overbuilt Margin 


Safety Margin 


Safety parameter 
of interest 
True value of Conservative Regulatory 
the safety Estimate of limit 
parameter the safety 
parameter 
Figure 1. The concept of safety margins in conservative safety analysis. 


Another way to look at the margins is the fac- 
tor below the regulatory limit. In the above exam- 
ple, the Safety Margin is anticipated to be a factor 
of 5, the True Safety Margin is a factor of 50 and 
the Overbuilt Margin represents 90% of the True 
Safety Margin. 

The means of assuring conservatism in deter- 
ministic safety analysis is the use of conservative 
models, codes, assumptions and data with the 
anticipation that these collectively yield pessimistic 
estimates of the safety parameters of interest, rela- 
tive to the regulatory limit/requirements. An alter- 
native approach to conservatism in safety analysis 
is the best estimate approach (as described in the 
NRC Regulatory Guide 1.157) to the assessment 
of the safety parameters of interest that includes 
formal quantification of associated uncertainties 
(or “best estimate plus uncertainty” to use the 
IAEA terminology). 

In the best estimate plus uncertainty approach, 
best estimate models and computational methods, 
field and experimental data, and realistic assump- 
tions are used to estimate the safety parameter 
of interest. Clearly, availability of such tools and 
data are critical to make this approach feasible. 
However, depending on the amount of data and 
information available, sometimes the best estimate 
approach can only provide a rough estimate of the 
uncertainties, which may need to be supplemented 
with some conservative assumptions. Conversely, 
when data and information are abundant, the best 
estimate results are frequently associated with less 
uncertainty. In the best estimate plus uncertainty 
approach, the amount of uncertainty may be 
expressed explicitly by obtaining the probability 
distribution of the safety parameter of interest and 
quantifying the safety margin. This concept will be 
discussed in more detail in the following section. 
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4.2 Best estimate plus uncertainty approach to 
safety analysis 


In the best estimate plus uncertainty approach, the 
safety parameter of interest is treated as a random 
variable and the probability distribution within which 
the true value of the safety parameter of interest 
resides is estimated. Consider Figure 2, which shows 
a hypothetical distribution of a parameter. The true, 
but unknown, value of the safety parameter of inter- 
est resides somewhere within the span of this distri- 
bution. The regulatory limit/requirement of interest 
is shown within the range of this distribution. 

Also, shown in Figure 2 is the desired safety 
margin, set to account for “unknown-unknowns.” 
As such the safety parameter of interest should be 
below the regulatory limit/requirement plus this 
prescribed, desired margin. 

In the best estimate plus uncertainty approach, 
the distribution of the safety parameter of interest, 6, 
is expressed by the probability density function (dis- 
tribution function), (ô), which is obtained by using 
realistic data, best-estimate models and codes. The 
best-estimate approach does not use conservative 
assumptions. Once developed, the distribution, f(6), 
may be used to find the probability that the true (but 
unknown) value of the safety parameter of interest 
(expressed by the random variable, ô) is below the 
regulatory limit/requirement, D. In this approach it 
is this probability, and confidence associated with it, 
that forms the basis for safety decision-making. As 
such, in the best estimate plus uncertainty approach, 
the probability that the true margin, (D — 6), exceeds 
the desired safety margin, A, is expressed by: 


Pr[(D-6)>A]= mds (D+A)] 
OG 


The shaded area shows the 


True value of 
safety paremeter 
for a fixed design 

(unknown) 


Figure 2. Conceptual depiction of the probability density function of a safety parameter. 


Overbuilt 
Margin 


Desired Safety 
Margin A 


True Value and 
true mean overlap 
in this figure 


Figure 3. 


Also, the probability that the requirement is not 
met would be: 


Pr(d>D)=]" f(d)do 


The benefit of the best estimate plus uncertainty 
method is in its characterization of the safety mar- 
gin in light of the information, data and other evi- 
dence that is available. As the amount of such data 
increases, it is natural to expect that ô) becomes 
narrower (with smaller spread) and represents a 
small span within which the true value of the safety 
parameter of interest dis most likely located. This 
concept is illustrated in Figure 3, assuming that 


Regulatory 
Limit D 


Potential effect of more information: True value of ô distributes narrowly. 


more information about dwas available and result- 
ing in much narrower /(6), as compared to the 
probability distribution in Figure 2. 

The best estimate plus uncertainty approach, 
as depicted in Figures 2 and 3, illustrates the cases 
where the true margin could become very large 
(for example, due to conservative initial design) by 
showing that dl eae could be close to 
unity, and highlights the presence of a large over- 
built margin. Such cases can provide a rationale to 
revisit the need for such large margins. This insight 
can only be achieved by a best estimate plus uncer- 
tainty approach, where the uncertainties associated 
with the margins are formally and quantitatively 
assessed—and confidence levels established—allowing 


the overbuilt margins to be explicitly defined, 
explored and evaluated as to their necessity. 


4.3 Best-estimate models/codes versus 
conservative models 


In determining the probability distribution, f(6), 
of the safety parameter of interest, one needs to 
have access to validated best estimate models and 
computational methodologies. In the conservative 
approach, estimates of the safety parameters of 
interest are obtained from conservative or bound- 
ing assumptions, and/or conservative models. To 
compare these two approaches, consider Figure 4. 
This figure shows a case where data about a safety 
parameter of interest are available. The conserva- 
tive model (top-left branch), exemplifies an empiri- 
cal (linear) model that bounds the data—meaning 
all the data fall below the model. As such, when 
the independent variable, x, takes the value x,, 
the model estimates a dependent value (e.g., for a 
safety parameter of interest) of y, which is higher 
(more pessimistic) than all the data (evidence) that 
exist (note: if desired, it is possible to account for 
unobserved data, and draw the line above the clus- 
ter of the data with an additional margin, to cover 
for the unknown-unknowns). 

Conversely, the best estimate approach would 
fit the empirical line (model) into the data using 


a regression analysis including the quantiles that 
describe the uncertainty about this model (top- 
right model and the bottom-right model are best 
estimate fits to the data). The model’s upper and 
lower quantiles represent the model uncertainty 
using the residuals (expressed by the model error, 
€). Unlike the bounding model, in the best esti- 
mate model, for a given value of a dependent vari- 
able x, produces a probability distribution for y,. 
Similar to the conservative analysis, it may become 
necessary to make the best estimate model more 
conservative by introducing a bias to account 
for the unobserved data. The bottom left branch 
shows the same regression model of the lower 
right, but with an added conservative bias. 

Impact of Conservatism When Multiple Param- 
eters are Inputs to an Analysis. 

When several parameters are inputs into an 
analysis used to determine a resulting “final” para- 
meter or “figure of merit,” which is then used to 
compare against regulatory limits, the amount of 
conservatism in the final parameter will be larger 
than the conservatism in each of the input param- 
eters. This can result in large conservatisms in the 
resulting “final” parameter. In part, this reflects 
that fact that the uncertainty in the final parameter 
does increase as consequence of increases in the 
number of input parameters—each with its own 
level of uncertainty. When probability distribution 


Model Characteristic 
— 
y 
Conservative nam 
serv 
% f (ie. biased model) Best Estimate 
y=ax+b 
deterministic 
conservative 
parameter . 
aandb i 
distribution 
Best Estimate Best Estimate (uncertainty) 
Conservative (true data) z=model error (e.g. a 


Figure 4. 
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Comparison between best estimate and conservative models. 


functions, along with associated confidence levels 
are not known (i.e., a “data-deficient” environ- 
ment), industry standard approaches work to 
ensure—through conservative parametric values 
and/or modeling—that an appropriate level of 
conservatism is present in the final parameter. 
When this result is compared against the regula- 
tory limit, several possibilities exist: (1) if the mar- 
gin is small, more detailed analysis may be called 
for to more fully characterize the safety case, as 
discussed by the IAEA and NRC above; (2) if the 
margin is large and measures incorporated into 
system design and/or procedures are not oner- 
ous, further action may not be called for; and 
(3) if the margin is large and measures incorporated 
into system design and/or procedures are costly or 
result in overly complex operations, detailed analy- 
sis may be called for to assess the relative contribu- 
tion of such measures to safety. 


5 CONCLUSIONS 


This paper first provided definitions of the term 
“conservative”—especially as it is applied to 
describe safety analyses. It then explored how the 
degree of conservatism and the degree of uncer- 
tainty from data supporting the safety analysis 
are linked in determining the margin of safety 
obtained in a facility design. It showed that, in the- 
ory, safety margins can be maintained with reduc- 
tions in conservatism if corresponding reductions 
in uncertainty are made (through better data or 
improved analysis). This insight will help sup- 
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port decision-making on whether experimental or 
analytical resources would be best applied to 
making more detailed and sophisticated best- 
estimate analysis or by utilizing less sophisticated 
analysis with larger conservatisms which could 
lead to more resources spent on the facility design. 
A follow-on paper is planned to further investigate 
this with some example applications. 


REFERENCES 


IAEA (2003). Safety Margins of Operating Reactors— 
Analysis of Uncertainties and Implications for Deci- 
sion Making. JAEA Publication 1332. 

IAEA (2008). Best Estimate Safety Analysis for Nuclear 
Power Plants: Uncertainty Evaluation. Safety Report 
Series Number 52. 

IAEA (2014). Deterministic Safety Analysis for Nuclear 
Power Plants. [AEA Publication 1428. 

NRC (1986). Safety Goals for the Operation of Nuclear 

Power Plants. 57 FR 30028. 

NRC (1989). Best-Estimate Calculations of Emergency 

Core Cooling System Performance. NUREG 1.157. 

NRC (2005). Transient and Accident Analysis Methods. 

NUREG 1.203. 

NRC (2008). Background, Status, and Issues Related 

to the Regulation of Advanced Spent Nuclear Fuel 

Recycle Facilities. NUREG-1909. 

NRC (2013). Glossary of Risk-Related Terms in Support 

of Risk-Informed Decision-making. NUREG 2122. 

NRC (2015). Regulatory Approaches for Address- 
ing Reprocessing Facility Risks: An Assessment. 
NUREG-CR-7168. 

UK ONR (2014). Safety Assessment Principles for 
Nuclear Facilities. 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Awareness and preparation of the population for emergencies 


M. Vašková 
The National Cyber and Information Security Agency, Brno, Czech Republic 


M. Náplavová 
The College of Regional Development, Prague, Czech Republic 


J. Barta 
University of Defence in Brno, Brno, Czech Republic 


ABSTRACT: The paper deals with the level of citizen awareness in selected dangerous areas of 
Vysočina region and with their preparation for an emergency. In the theoretical part of the paper there 
was done an analysis of approaches to dealing with informing and training of inhabitants to emergencies 
in general as well as focusing on two selected subjects of Vysočina region. There were also discussed 
possibilities of training inhabitants and logistics information flows. In the practical part were created 
proposals for a preparation of the citizens for a potential emergency. First emergency we chose was a 
leakage of dangerous substances from ice skating park. There were also made a simulation of leaking of 
amoniak. The second emergency we focused on was a rupture of local dam. For both emergencies were 
made simulation to see its consequences. 


1 INTRODUCTION The paper is focused on the awareness of the 
population. So there was at the beginning of our 
The article deals with the awareness of the popula- work elaborated a questionnaire for the residents 
tion and their preparedness for a case of extraor- of vulnerable areas and for the authorities, under 
dinary event in 2 zones. Specifically, it focuses on which the affected area belongs to. The question 
emergency zones of winter stadium, where the was if they carried out the training on possible 
dangerous substance, ammonia (NH,), can escape. scenarios of threats and whether the citizens know 
The second emergency zone is the flood territory how to act in the case of an emergency or if they 
of the dam Vir in the Vysočina. know where to go to get that information. The 
According to §15 of Act No. 239/2000 Coll. questionnaire was developed primarily to obtain 
“The municipal office familiarizes the population an objective view on the situation and for the 
with the character of the possible threats in the analysis of the current state in selected territories. 
region. Then the office also familiarizes them with The main goal of our work is to find an accept- 
the prepared rescue and liquidation works and able solution of the issue or at least to deepen the 
the protection of the population. The office also awareness and knowledge of the population in 
organizes their training” (Act No. 239/2000). From selected territories. 
the law it is clearly showen who has the task to 
inform, or to train the population. However there 
is not mentioned anywhere, how often the train- 
ing should take place. This gap in the law is one of 
the aspects that cause the ignorance of the popula- 
tion. However, it is not possible to lay the blame 21 The dam Vir 
only on the gap in the law, because the iniciation 
of the population is also very important nowadays. The dam Vir is located in the Vysočina, location 
In today’s world which is full of all kinds of inven- of the Vysočina region in the Czech Republic is 
tions, and due to trends of travelling and learning shown in Figure 1, on the flow of the river Svratka. 
about foreign countries, people often neglect the It serves as a source of drinking water for the area 
importance of knowledge area of their own resi- called Žďárské vrchy and the surrounding areas 
dence. Sometimes it also happens that even whena and a part of Brno. The area of flood planning 
citizen tries to get some news, it is impossible. for this reservoir is located along the river Svratka. 


2 ANALYSIS OF APPROACHES TO 
ADDRESS THE AWARENESS AND 
TRAINING OF THE POPULATION 
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Figure 1. 
republic. 


Location of the Vysočina region in the Czech 


It is reported that the possible flood wave would 
most likely reached up to the Brno dam, where it 
could cause considerable complications. 

The total volume of the reservoir of the dam 
is 56,193 million m? and the size of the flooded 
area is 223,6 ha. The concrete gravity dam is in its 
crown 390 m long and 9 m wide. Harmless water 
drain under the dam is 55 m?.s!. On the left side of 
the dam is located a water power plant. 

According to the model of the passage of spe- 
cial flood caused by the breach of the dam was 
determined by the extent of the flood area. This 
model implies that the extent of the flood area 
lies in the stretch from the dam to the city Brno. 
Vulnerable place, which this work considers is the 
village Vir, which is located in the immediate vicin- 
ity of the dam Vir. Other endangered sites are in 
respectively the village Korouzné, Svayec, Stepanov 
nad Svratka, Ujéov and Lower Cepi. Of course 
endangered sites also include other municipalities, 
but this work deals only with the territory of the 
Vysočina region. 

Warning according to the operating plan for 
special flood is done by the owner of the dam Vir 
by using his own sirens and notifies the operational 
information centre of the fire rescue service (IRS) 
of the region about the dangers of specific floods. 
The IRS of the region then immediately notifies 
threatened population and also provides informa- 
tion about the development of the emergency (The 
operational plan for a special floods, 2017). 


2.2 Winter stadium 


The capacity of the winter stadium in Žďár nad 
Sazavou is 3500 visitors (Sportis, 2011). To the 
cooling of roller surface is used ammonia (NH,). 
Ammonia is under normal conditions a colour- 
less gas with a typical pungent odor; it is alkaline, 
irritant and caustic. Ammonia is very toxic for 
aquatic organisms (especially fish). Its very good 
solubility in water also plays an important role too. 
Also plants can be negatively affected if they are 
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exposed to its higher concentrations both in air 
and in water. It participates in the acidification of 
soils (The integrated pollution register, 2017). 

Due to the properties of ammonia there could 
occur an ecological disaster, because around the 
selected stadium is a river Sázava. 

The ammonia has during short-term exposure 
of the person, irritating effect. It can burn the skin 
and eyes. It causes cough and difficulty breathing. 
In a concentration higher than 3.5 g.m™° is even 
short-term exposure the lethal. In the current envi- 
ronment is the concentration of ammonia so low 
that it does not entail almost any risk. Its advan- 
tage is an intense pungent odor, which highlights its 
presence in the air before it could rise to a danger- 
ous level (The integrated pollution register, 2017). 

In the cooling device of the winter stadium is 
the total charge of ammonia of 6000 kg (amount 
of ammonia is under the current emergency plan, 
in fact it is 1400 kg). This amount is divided into 
three parts—the condenser, the expansion tank 
and the ice-skating area (Travnik, 2016). 

The zone of emergency planning for the win- 
ter stadium is about 130 m. In that zone there is a 
sports hall, 2 restaurants and a few family houses, 
which is located on the edge of the collision zone. 

Due to the security and availability of hazardous 
substances and a large number of people at hockey 
games and other, may be the winter stadiums mis- 
used to commit a terrorist attack intentional dis- 
charge of a dangerous substance or destruction of 
the device (Zeman & Bren & Urban, 2017). 

Warning according to the emergency plan of 
the winter stadium in the release of ammonia is 
performed by the doorman of the winter stadium 
after the notification of the engineer. He begins to 
organize the measures in the premises of the winter 
stadium (according to existing emergency response 
plan). Alarms are divided into 3 groups according 
to the amount of leaked substance: 


1. The first level of threat—a leakage of ammonia 
to 1 000 kg, if the spill threats only object of 
engineering. 

. The second level of threat—a leakage of ammo- 
nia to 2 000 kg, if the spill threats the entire object. 

. The third level of threat—a leakage of ammo- 
nia up to 3 000 kg, if the spill threats area of 
winter stadium and outside emergency zone in 
the direction of the wind (Travnik, 2016). 


A liquidation of the accident in the first degree 
is undertaken by the staff of the ice rink with the 
use of a IRS of the winter stadium. Liquidation 
of the accident the second and third tier is gov- 
erned by the emergency commission headed by 
the head of the winter stadium, who will call the 
appropriate personnel with cooperation with IRS 
(Travnik, 2016). 


Due to the maximum range of the zone of 
emergency planning (130 m) it is necessary to warn 
about the emergency all persons who are in the 
zone of emergency planning by the signal general 
warning (Travnik, 2016). 

Security protection of persons is carried out 
through escape routes (two kinds): 


1. from the space of the winter stadium. There 
are two entrances in the main building and two 
entrances in the eastern and western bleachers. 
Escape routes from the machinery spaces are in 
addition to the main entrance to the lobby even 
to the back of the winter stadium (to cooling 
towers) and in front of the garage into the hall. 

. from the space of a vapour cloud of ammo- 
nia against the direction of the surface wind 
(Travnik, 2016). 


Escape routes will be to the population adver- 
tised by radio device. All escape routes are prop- 
erly marked and are kept passable. Movement of 
persons on the escape routes will be monitored by 
the riot service, which will guard the access to the 
infested area (Travnik, 2016). 

To determine the state of awareness of the popu- 
lation in the emergency zone of the winter stadium 
was used questionnaire method. The residents of 
those zones and visitors as the winter stadium, 
sports hall, located next to the winter stadium, was 
submitted to a questionnaire of 9 questions. 


3 THE RESULTS OF THE SURVEY 


In the individual, above-mentioned zones of 
danger, it was examined, how are the residents 
prepared for emergency and whether they have 
enough information. Residents were submitted to 
a questionnaire of 9 questions. 


3.1 The results of the survey in the emergency 


zone of the dam Vir 


The number of persons surveyed in individual 
municipalities in the flood planning zone of the 
dam Vir is shown in the Table 1. 

On the first question, “Do you know that the 
place of your residence is located in flood terri- 
tory?”, replied to 100% of the respondents that 
they know about it. 

On the second question, “Do you know where to 
get more information about the threat?” responded 
the majority of respondents positively. The answer 
“Yes, I know.” checked 70% of the respondents. 
The remaining 30% didn’t know where to get the 
necessary information. 

The third question examined whether there is 
any training about what to do in case of emergen- 
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Table 1. 
ning zone. 


Number of persons surveyed in the flood plan- 


Number of answered 


Municipality questionnaires 
Vir 32 
Korouzné 19 
Stépanov nad Svratkou 17 
Ujcov 12 
Total 80 


cies. Approximately 57% of the respondents replied 
that the training is in progress, 37% did not know 
and the remaining 6% claimed that training is con- 
ducted. The answers of the respondents were com- 
pareted with the replies of the representatives of the 
municipalities in which the survey was conducted. 
It should be noted that municipal representatives 
about any training didn’t know anything. 

The respondents, who at the third question 
answered in the affirmative, they were asked about 
the evaluation of the appointed training. Respond- 
ents argued that the training was satisfied and 
beneficial. Unfortunately however, could not recall 
how often or when was the last time such training 
was held. 

An important part of the questionnaire was to 
ascertain the opinion of citizens on their readiness 
for an emergency, whether they would welcome 
the introduction of the training and what form 
would this training should have. Only one fifth 
of the respondents had shown interest in possi- 
ble training courses to attend. A total of 55% of 
respondents would prefer a combination of lec- 
tures and manuals, 22% only lecturing and the 
remaining 23% would meet the processed docu- 
ments (manuals). 

The fifth question examined, and vetted knowl- 
edge of the population, whether they know the 
manner in which they will be informed that an 
exceptional event has occurred. Interviewees had 
a choice of three options. The first option was that 
they hears from neighbors, the second option was 
that a warning signal will sound and the third, that 
they will start to ride the car of the IRS, especially 
car fire brigade and the Police of the Czech Repub- 
lic. All of those interviewed, up to 8%, chose the 
answer that a warning signal will sound. Of those 
8%, chose the answer: I learn it from the neighbors, 
with the argument that it learns more and earlier. 

Question number 6 looked at whether the resi- 
dents of the affected territory think that they are 
prepared on the emergency. To this question, 
respondents split almost in half. The first half is 
according to the response to emergency adequately 
prepared and the second on the contrary is not. 


To the seventh question, answered all the ques- 
tioned correctly. Had a task to choose the correct 
series of numbers to the IRS. 

First aid can according to the eighth questions 
provide 92% of the respondents. 

The last question examined whether the popula- 
tion knows what evacuation bag includes. It should 
be noted that in the questionnaire was the choice 
of just yes or no answer. However, all those who 
answered yes, they were verbally tested if they 
really know, what an evacuation bag should con- 
tain. The others were at least advised. About 52% 
of the respondents answered that knows what fea- 
tures to include evacuation bag. 


3.2 The results of the survey in the emergency 
zone of the winter stadium 


The number of persons surveyed in the emergency 
zone of the winter stadium is shown in the Table 2. 

The first question was focused on whether the 
interviewees know about the potential dangers 
that winter the stadium represents. Only 30% of 
respondents know that it is winter stadium a source 
of danger. 

The second question asked, whether the citizens 
know where to obtain more information about this 
risk. The answers were also alarming, only 17% 
knew where to get the information. 

Due to the low number of informed respond- 
ents, it was almost unnecessary to ask the third 
question, whether they were ever trained on what 
to do in case of leakage of ammonia. Nevertheless, 
it was found that some training completed 5% of 
the respondents; it was for them in some way bene- 
ficial and indicated that such training is conducted 
about once every 2 years. 

The fourth question asked respondents whether 
there would be interested in any training. Only 38% 
of the interviewed would be interested in training. 
They would prefer a combination of lectures and 
manuals or separate lectures. 

To the fifth question “Do you know how you will 
be notified of the fact that there was an emergency?” 
a majority of those surveyed answered correctly. 
Only a small part of the chosed a different answer. 


Table 2. Number of persons surveyed in the emergency 
zone of the winter stadium. 


Number of answered 


Place questionnaires 
Winter stadium 56 

Sports hall 44 
Residential houses 

Total 103 


The sixth question dealt with the feeling of pre- 
paredness of the respondents to such emergency. 
A total of 83% of respondents do not feel suffi- 
ciently prepared for the emergency. 

Any of the respondents do not own any protec- 
tive agent for the case of leakage of ammonia. 

On the contrary, everyone, as we found out in the 
eighth question, knows the emergency numbers. 

First aid in case of contact with ammonia can 
provide, according to the answers to the ninth 
question, 26% of respondents. 

From the responses it can be concluded that the 
awareness among the population regarding the 
aforementioned winter stadium is alarming and it 
is necessary to take care about this issue more. 

From the questionnaire sent to the municipal 
authority, department of emergency manage- 
ment, in Zdar nad Sazavou, which was focused on 
acquiring information about ongoing or planned 
trainings. It was found that no training do not 
take place, even in the near future do not plan. 
The population was according to the responses 
informed about the issue of risk arising from the 
winter stadium a few years ago through the local 
press. 


4 EVALUATION OF KNOWLEDGE 
OF THE POPULATION 


Although education in this field at primary, sec- 
ondary and higher professional schools required 
by law, in many cases, does not take place. Or it 
takes place only to a very low level. 

From the results of the questionnaire survey 
under the dam Vir shows that the majority of the 
population, living in the village Vir, Korouzné, 
Svarec, Stépanov nad Svratkou and Ujéov is well 
informed of the potential danger. However, there 
is the problem that there is no training due to this 
threat. The affirmative should be rated and that 
the population knows the emergency numbers and 
can provide first aid. But more than half of the 
respondents could not wrap evacuation luggage 
according to the investigation. 

The positives 


e Training is conducted in the framework of the 
voluntary fire brigade. 

Awareness is very good. 

People know where to get the information. 
Knowledge of emergency numbers. 

The majority of can provide first aid. 


The negatives 


Official training in the scope of the village 
perimeters. 
e Ignorance of the content of the evacuation luggage. 


The results of the survey the emergency zone of 
winter stadium, follow that a substantial part of 
the population, whether living in the zone of emer- 
gency planning or visitors and sports hall, unaware 
of the potential danger. 

The positives 


The population knows the emergency numbers. 
People know how they will be notified of the 
emergency event—warning signal. 

Some respondents were trained what to do in 
case of leakage of ammonia. 

Training sessions are carried out at least in the 
context of some of the schools in Zdar nad 
Sazavou. 

The municipality to warn population about the 
issue through the local press, several years ago. 


The negatives 


The municipality does not perform training. 
Disinterest of a large number of citizens about 
the training. 

Training is not in the foreseeable future planned. 
People do not know where to get the necessary 
information. 

Doesn’t know how to provide first aid in case of 
contact with ammonia. 

Insufficient readiness of the population to this 
extraordinary event. Obec neprovádí školení. 


In determining the current state of the issue 
has been identified in the area of implementa- 
tion of the statutory training of the population 
of the municipalities. This lack manifested itself 
especially on the ignorance of people. Surveyed 
population showed large gaps in knowledge of 
population protection, information, first aid and 
the total unpreparedness on the emergency. 

In contrast, the population living in záplavovém 
territory under the dam Vir is sufficiently informed 
and ready for the emergence of floods. People 
know where to get information about the dangers, 
knows how to provide first aid. In this area it is 
necessary to arouse awareness about the content of 
the evacuation luggage. 

After evaluation of the results, we have decided 
to undertake the modelling of leakage of hazard- 
ous substances from the winter stadium and, sub- 
sequently, to prepare and carry out an exercise on 
this to emergency. 


5 APPLIED METHODS 


An extraordinary event was chosen such that 
results from threats to the environment. To leakage 
of hazardous substances from refrigeration equip- 
ment occurs quite often and is still a current issue. 
Among other methods used for the purposes of 
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the work are undoubtedly a literature review and 
questionnaire method, from which we emerged 
some threats. 

Further there was used the method of simula- 
tion, in which were used the average values of the 
long-term monitoring of weather conditions in 
the site of the emergency and the expert estima- 
tion of the specified rate of leakage of hazardous 
substances. When using the method of construc- 
tive simulation is simulated entity controlled by 
the simulated operator. Constructive simulation is 
kind of simulation, when the model contains every- 
thing needed to during the simulation, replaced the 
original, including humans. Control of this type of 
simulation is implemented using the user interface. 
The display of the synthetic environment is similar 
to a topographic map. Constructive simulation is 
used in a variety of distinctive levels for different 
types of operations to deal with emergencies (Kanj 
& Flaus, 2015). 


6 MODEL OF LEAKAGE OF HAZARDOUS 
SUBSTANCES FROM THE STADIUM 


On the basis of the requirements for the format of 
the output was carried out the evaluation of the 
available simulation programs for the modelling 
of leakage of hazardous substances. There were 
taken into account the information about poten- 
tial emergencies and the input data, which were 
known about the emergency. Other relevant data 
were long-term hydrometeorological data from 
the vicinity of the emergency. Due to the lack of 
input information was selected by the simulation 
program TerEX, which allows working with a 
minimum of input information. For more details 
of the evaluated programs were published in the 
article The Use of Simulation Programs of Leak- 
age of Harmful Substances for Crisis management 
(Barta, 2015). 

The main factor when entering the input data 
was the amount of the leaked dangerous sub- 
stance. Due to the fact that the system of the win- 
ter stadium is designed so that it is divided into 
three independent parts with the possibility of 
swapping the cooling medium into any of them, it 
is not expected that in the event of a crash missed 
more than 60% of the amount that falls on 1/3 of 
the technology. This corresponds to approximately 
280 kg of ammonia. Basic input data: 


e Model: PUFF—Single release of boiled liquid 
with rapid cloud evaporation 

Substance: Ammonia 

Temperature: 7°C 

Total amount of escaped liquid: 280 kg 

Ground layer wind speed: 7 m/s 

Sky Overcast: 0% 


e Time of incident event occurence and continu- 
ance: Night, morning or evening 

e Type of atmospheric stability: D—isothermic 

e Surface type in direction of substance spread- 
ing: Inhabited area 


On the basis of the determined average temper- 
ature, the average values of the direction and force 
of the wind, was in the program TerEx performed 
a simulation whose result is shown in Figure 2. 

For the determination of the extent of the emer- 
gency was output from the modeling program 
exported into the map base. As seen in the sector 
of Blue part in Figure 3, when the prevailing west 
wind was threatened residential area with several 
family homes. It was the basis for the processing of 
documents for the exercise called the leak of ammo- 
nia from the winter stadium in Zdar nad Sazavou. 

For the realization of the exercise it was neces- 
sary to choose a suitable and available simulation 
program. In the framework of the project research 
we addressed this issue and then we have estab- 
lished the basic criteria, which has a simulation 
program for the implementation of practical exer- 
cises meet (Barta et al. 2016), (Urban et al. 2017), 
(Marana et al. 2016). Among the basic criteria 
belonged to the user friendliness and the variability 
of the simulator: 


e Scene Editor; 
e Implementation of External Data; 


| AX Person threat (toxic substance) x] 
Type of plume 
sim Recommenced of 

é concentraton trom sartiouwr owce to 

f \ A 137.5 m Person threat nse bukiings 
(broken window class) 

\ A 48m Person threat (fre coud grdion ) 
\ / 


\ Ba A j 
\ BAS 167 m : Person threat (axic substance) 


| EVACUATION TO THE DISTANCE OF 167m | 
E, 


Figure 2. The output data from the program TerEx. 


Figure 3. The plot of the leakage of hazardous sub- 
stances into the map. 


e Simulation Level (Teams or Individuals) 

e Communication Possibilities; 

e Continuity in Relation to the Surrounding 
Environment. 


On the basis of comparison was evaluated as 
the most appropriate simulation program SIMEX, 
which is built on the platform of the simulator 
WASP (Barta & Vašková & Urbánek, 2016). 

For the creation of the exercise was the best based 
on emergency, which become in the past. At the win- 
ter stadium in Zdar nad Sázavou there was a leakage 
of ammonia on 8. June 2011. On this basis, it was 
taken over and modified the scenario of the exercise. 


6.l The scenario of the exercise 


8 June 2011 at 18:18 pm adopted Regional opera- 
tional and information centre of the Vysočina IRS 
report of a leakage of ammonia from the winter sta- 
dium in the street Libušínská in Zdar nad Sázavou. 
To the location of the event rolls out unit of profes- 
sional firefighters from the station Zdar nad Sázavou 
and volunteer firefighters. By carried out survey of 
the site firefighters found that the ammonia is leaking 
from a cracked pipe behind the stadium. The place 
was immediately marked as a danger zone. There 
were found two persons on the spot intervention 
with breathing difficulties and called an ambulance. 
Firefighters ordered to evacuate the population from 
the nearest houses downwind. They also very quickly 
managed to prevent further leakage of ammonia. 
Effluent water with a weak concentration of ammo- 
nia fell into the river, a direct threat to the environ- 
ment and to the death of fish did not occur. 

Before the termination of the intervention, 
there was carried out a final measurement with a 
negative result and units returned back to station. 

On the basis of the scenario exercise of the 
ammonia leakage, it was necessary to define all 
entities that create, complement and comprehen- 
sively participated in the exercise. On the basis of 
the analysis carried out exercises with the leakage 
of a hazardous chemical (ammonia) to summarize 
each of the entities, which have been successively 
fed into the simulator (Oulehlova et al. 2016). The 
decisive step was creating lists, which contain basic 
clusters of entities, in particular for the area: 


e Staff—as a threat to people in the winter sta- 
dium and in its vicinity, the crew of the respond- 
ing units, people who are watching, etc.; 

e Technical means—for example, vehicles of emer- 
gency units, auxiliary vehicle, vehicle simulating 
normal transport in the Zdaru nad Sázavou, etc.; 

e Environment—map data of the place of leak- 
age of hazardous substances and the terrain 
database with the required layers for a simulator 
SIMEX. 


Table 3. Plan the connection of entities dealing the 
leakage of ammonia from the winter stadium. 


Telefonne Work 
Unit number station 
Regional operational and 112 PS15 
information centre 
Fire brigade-Zdar nad Sázavou PS02 
The commander of the PS01 
intervention 
Units of the volunteer PS03 
firefighters—Zdar 
nad Sazavou 2 
Police of the Czech Republic 158 PSO7 
Emergency medical service 155 PS05 
The mayor of the municipality PS10 
of Zdar nad Sázavou 
Management of the exercise PS řídící 
Members of the tip-off PS13 
Information line 1188 PS12 
Emergency accommodation PS14 
of persons 
Evacuation center PS09 


Position available 
Position available 


In Table 3 are in the plan of merger provides 
basic entities for the resolution of an incident. 

On the basis of defining the entities, there have 
been determined their role and activities envisaged 
in the framework of the scenarios dealing with 
emergencies (Okstad, 2016). The exercise is cur- 
rently being prepared and will serve for the practi- 
cal training of workers, emergency crews, and the 
general public to obtain information about a possi- 
ble danger, its extent and consequences. In but not 
least, the citizens hear a lot of information how to 
maintain when the occurrence of an emergency 
with the leakage of dangerous substances. 


7 CONCLUSIONS 


Total awareness of the population about the risks and 
threats that are in their surroundings is very impor- 
tant. Also basic reaction and behavior of the popula- 
tion upon the occurrence of extraordinary events are 
dependent on the awareness of the population. 

In the preparatory phase of the exercise had 
very proven freely available materials on the web- 
site of the municipality of Žďár nad Sázavou and 
Vysočina region. There were obtained very useful 
information for the preparation of exercises for 
informoování of the population. 

This revealed that for obtaining sufficient infor- 
mation about the impending danger from the 
winter stadium (release of ammonia), inhabitants 
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of the town of Zdar nad Sázavou have sufficient 
options, but these options are not used. 

In the framework of the forthcoming exercises 
instructor received the basic information about the 
issue, increased knowledge about crisis manage- 
ment and theoretically prepared for work in the 
selected simulator. It was a very good basis for a 
workout that is focused on practice management 
functions, implementation of established proce- 
dures and clear communication between individual 
practitioners of entities. 

Residents of the city may attend planned exer- 
cise, and to realize possible dangers and to obtain 
the necessary knowledge not only for the case of 
the ammonia leakage, but it will acquire a basic 
knowledge how to behave in emergencies, with the 
leakage of dangerous substances. 
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ABSTRACT: In the present study, the relationship between survey data regarding work conditions and 
safety performance is investigated. We transform a questionnaire into several factors which involve work 
environment, safety climate and organizational aspects, and run a series of analyses on repeated cross- 
sectional data in order to predict occurrences of Hydrocarbon (HC) leaks and acute spills. We apply 
survey data from 49 500 respondents across eight distributions from 2001 to 2015. Methodically, we con- 
duct Analyses of Variance (ANOVA), Principal Component Analyses (PCA) with Cronbach’s alpha tests, 
and lastly multiple logistic regressions. Our results give some support to the hypotheses—that factors of 
safety climate and psychosocial aspects may be predictors of safety performance. The results and inherent 
qualities and weaknesses of the present study are discussed, and recommendations for further research 
are presented. This paper is a contribution to the development of proactive lead indicators appropriate 
for safety management. 


1 INTRODUCTION development of valid safety indicators that may be 
used as “early warnings”. 

Preventing hydrocarbon (HC) leaks and acute 
spills to the environment are incidents that are of 
great importance to the industry and society. HC 
leaks can in the worst-case lead to major accidents, | Safety indicators is used in this paper to denote 
especially due to the inherent ignition and explo- the independent variables used in the analysis. An 
sion risk. Acute spills have person injury potential, indicator may be defined as a measurable variable 
however mainly they pose a concern for the envi- that can be used to describe the state of a phenom- 
ronment. Acute spills from the oil and gas industry enon, when the actual state of the phenomenon 
on the Norwegian Continental Shelf is subject to might be unknown (Haugen et al., 2012). Safety 
high focus from both regulatory authorities and indicators may be seen as observable measures that 
the civil society. The Petroleum Safety Author- should give information concerning the safety per- 
ity annually issues a report on the risk level for formance (Kongsvik, 2012). An indicator does not 
acute spills and states that there is a need for more necessarily imply that there is a causality between 
knowledge on the conditions that lead to acute the content of the indicators and the safety per- 
spills (PSA, 2017). In the present study, we utilize formance. A measure may be used as an indicator 
precursor survey data and technical data related to as long as there is a correlation with the phenom- 
offshore installations to investigate the conditions ena one want to gain knowledge about. One may 
that are associated with HC leaks and acute spills. differentiate between lead indicators and lag indi- 
The survey data is obtained by the standardized cators. Lead indicators are considered proactive 
questionnaire called the ‘Norwegian Offshore Risk performance indicators (Dyreborg, 2009). Lag 
and Safety Climate Inventory’ (NORSCI), which indicators are conceptualized as outcome measure- 
is constructed to measure health and safety climate ments, usually represented as survey items measur- 
and the risk for occupational health and accidents ing the respondent evaluation of the work practice 
(Tharaldsen et al. 2008). The study focuses on indi- or safety level in their organization (see e.g. Kongs- 
vidual and organizational aspects measured by a vik, 2012) or different types of accident statistics 
survey instrument, in order to contribute to the (see e.g. Høivik, 2007). 


1.1 Safety indicators 
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The question whether survey data are lag indi- 
cators or lead indicators of safety performance 
should be addressed as a part of the explorations 
of the relationship between survey data and safety 
performance. It seems reasonable that both direc- 
tions may apply; one could believe that earlier safety 
records may influence the responses of a respond- 
ent in a survey, but this may also be influenced by 
the conditions leading to the safety records. 

Kilskar et al. (2016) have conducted a review 
of 174 publication addressing safety indicators, in 
order to gain information regarding the relation 
between indicators and accident risk. They con- 
clude that there is a general lack of documented 
valid indicators, and that there is a need for more 
empirical research. 

The present study’s analysis is a contribution 
towards the ambitions within the community of 
safety research to close this knowledge gap. An 
innovative aspect of this research, in relation to 
previous research, is the attempt to explore and 
develop indicators for acute spills from the oil and 
gas industry. 


1.2 Organizational conditions and safety 
performance 


Vinnem (2012) have conducted an analysis of HC 
leaks in the Norwegian offshore industry based on 
reported incident data. In his analysis of opera- 
tional circumstances for the leaks, he found that 
55% of the leaks where related to human inter- 
ventions. Among these, 82% of the leaks could be 
attributed to latent errors, e.g. errors due to main- 
tenance and modifications. These findings support 
the quest to explore the possible relations between 
measurement of human and organizational factors 
and HC leaks. 

Several of survey instruments have been devel- 
oped in order to measure conditions and phe- 
nomena that are assumed to influence safety 
performance. Various concepts have been used to 
denote these phenomena. Concepts that are rela- 
tively common in use are safety culture, safety cli- 
mate, and organizational and psychosocial factors. 
In addition, there are different concepts and survey 
instruments that are designed to measure phenom- 
ena that are not directly safety-related, but where 
researchers have used these instruments to test 
possible relations with safety performance (see e.g. 
Høivik 2007, Bergh et al. 2014, Olsen et al. 2016). 

Measuring safety culture and safety climate has 
been extensively debated regarding both the defi- 
nition of concept and the concept validity—what 
are we actually measuring? In spite of the lack of 
consensus regarding what is actually measured, it 
has become a relatively common practice in some 
industries to conduct these measurements. 
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Safety climate has been conceptualized as a rep- 
resentation or a subset of safety culture (for exam- 
ple Cooper & Phillips, 2004; Zohar, 2003). Others 
argue that it is a reflection of the safety culture, 
i.e. a kind of representation of the somewhat vague 
and intangible safety culture (see e.g. Guldenmund, 
2000; Mearns & Flin, 1999). 

Safety climate has been defined as the set of per- 
ceptions that employees share regarding safety in 
their environment (Zohar, 1980). A safety climate 
questionnaire consists of a broad range of items 
where the respondents are asked to give responses 
that are assumed to reflect their perceptions of 
safety related topics. Common features of a safety 
climate construct include management/supervi- 
sion, safety competence, safety systems, work pres- 
sure and risk (Flin et al., 2000). 

A safety climate questionnaire often consists of 
items aimed to reflect the respondents’ perceptions 
of how safety is valued in their organization (Grif- 
fin & Neal, 2000); hence these perceptions should 
ideally form the frame of reference for employees 
about the behavior that is expected, supported, 
and rewarded (Zohar, 2010). However, some safety 
climate survey instrument consist of items that 
measure not only perceptions of how safety is val- 
ued, but also how they describe their own work 
practice, and perceive and evaluate organizational 
conditions such as e.g. procedures, leadership, 
communication, competence etc. 

Whereas safety climate survey tend to address 
how employees make sense of their work environ- 
ment, their values and attitudes, psychosocial sur- 
veys instrument seems to be more oriented towards 
conditions that influence their cognitive and physi- 
cal capacity. Simplified, one may claim that psy- 
chosocial surveys focus more on conditions that 
theoretically are assumed to influence human error 
(reflected in the use of concept denoted as e.g. 
“stress”, burnout”, “mental exhaustion”, see Bergh 
2014), whereas safety climate address conditions 
that influence violations and lack of compliance. 


1.3 The relationship between safety climate 
and accident statistics 


There have been several previous attempt of 
exploring the relations between survey data and 
safety performance by the use of data from the 
Norwegian oil and gas industry. Several of these 
studies are based on survey data obtained by the 
NORSCI instrument. 

Tharaldsen et al. (2008) have investigated the 
association between five safety climate dimensions 
and accident rates. The researchers found statis- 
tically significant, but rather weak correlations 
between safety climate and accident rates. Similarly, 
Hestad and Lilleheier (2009) found correlations 


between safety climate and HC leaks. Kongs- 
vik et al. (2011) explored the leading and lagging 
qualities of safety climate. In line with Hestad and 
Lilleheier (2009), they found that safety climate 
could be used as both a leading and lagging indi- 
cator for HC leaks; more negative safety climate 
scores were associated with an increased number 
of HC leaks the following 12 months. HC leaks 
one year before measuring safety climate also cor- 
related negatively and significantly with the safety 
climate indicator; more HC leaks were associated 
with worse safety climate scores. The correlations 
were medium sized. Vinnem et al. (2010) found 
that a safety climate measure together with barrier 
failure data explained 37% of the variance in HC 
leaks on the installation level. They also found that 
the safety climate measure was the strongest pre- 
dictor of leaks. 

Gilberg et al. (2015) found a significant relation- 
ship between safety climate, and safety perform- 
ance conceptualized as HC leaks and dropped 
objects. In their study, which consisted of data 
from 2001-2013, they also found that the leading 
effect was stronger than the lagging. 

In additions to these analysis that have combined 
NORSCI survey data and safety records, there have 
also been conducted several studies where items in 
the NORSCI survey data has been used as both 
independent and dependent variables (Kvalheim & 
Dahl, 2016). They found dimensions in the survey 
data to be a strong predictor of accident precur- 
sors such as self- reported safety compliance in the 
oil and gas industry; explaining roughly 27% of 
the variance in safety compliance over a period of 
7 years. 

There has also been conducted explorative stud- 
ies on Norway regarding the relationship between 
survey data and safety performance, by the use of 
other survey instrument than NORSCI and with 
other samples of respondents. Høivik et al. (2007) 
have analyzed possible relationship between items 
in a general work environment survey and health 
and safety records (occupational accidents) in one 
oil and gas company. They found that manage- 
ment style and trust in the manager are important 
factors for predicting personal injuries. Bergh et al. 
(2014) have analyzed the relationship between a 
psychosocial risk indicator and HC leaks frequen- 
cies, with a sample from one oil and gas company. 
Both survey data and some technical data (vari- 
ables/indicators: age, weight, number of leakage) 
were used as independent variables (lead indica- 
tors). They found that the survey data significantly 
counted for variations in HC leaks. They found no 
significant relation between the technical indica- 
tors and leaks. 

A general work environment survey within one 
single company was also used in a study by Olsen 
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et al. (2016), in order to predict HC leaks. They 
found that several identified dimension in the sur- 
vey data, by the use of factor analysis, where sig- 
nificantly related to HC leaks. 

Internationally, there has been conducted some 
meta studies regarding relationship between sur- 
vey data and safety performance. Meta studies by 
Clarke (2009) and Christian et al. (2009) on the 
relationship between safety climate and accident 
statistics/injuries demonstrate medium size cor- 
relations between (—0.22 and —0.39 respectively). 
Payne, Bergman, Beus, Rodriguez, and Henning 
(2009) found a negative correlation between safety 
climate and releases/contamination and property 
damage one year after measuring safety climate in 
the process industry. A meta-analysis considering 
the job-demands-resources model (Nahrgang et al. 
2010) found that job resources like a supportive 
environment were related to safety outcomes (acci- 
dents, injuries, adverse events and unsafe behavior) 
in several industries. 

Examples of one individual studies are Swaen 
et al. (2004). They conducted a cohort study and 
found that high psychologic job demands were a 
risk factor for being injured in occupational acci- 
dents in a wide range of companies and organiza- 
tions. Swaen et al. (2004) investigated sleep among 
47 860 individuals, and found a relationship with 
self-reported sleep and a risk of fatal occupational 
accidents. 

Many of these previous studies, with notable 
exceptions, have conducted analyses on small data 
sets, often because either the number of installa- 
tions has been low, or that the number of incidents 
has been low. 


1.4 Hypotheses 


The present study adds to the current research 
base by including new and updated data on oil 
and gas installations, and a new response variable 
with acute spills. Acute spills have a lot of inci- 
dents and thus serves as a response variable with 
better inherent statistical power. In addition, we 
broaden the scope of survey items, not only safety 
climate questions, but also organizational fac- 
tors and psychosocial work environment factors 
combined. 

Based on the discourse regarding relation- 
ship between survey data and safety records, we 
hypothesize the following: 


H1: Negative scores on factors concerning organi- 
zational, work environment and safety cli- 
mate will be significantly related to higher 
probability of HC leaks the year after 
measurement 

Negative scores on factors concerning organi- 


zational, work environment and safety climate 


H2: 


will be significantly related to high probability 
of acute spills the year after measurement 

The factors will be significantly related to 
HC leaks and acute spills when controlling 
for operator, installation type and area of 
operation 


H3: 


2 METHOD 


2.1 Data and variables 


The PSA gather data from the companies operat- 
ing on the NCS each year: this is data regarding 
over 20 different defined scenarios of hazard and 
accident (DSHA) as well as maintenance data and 
barrier test data. As a part of this, there is a bian- 
nually survey, using ‘Norwegian Offshore Risk and 
Safety Climate Inventory’ (NORSCI) as instru- 
ment. This study make use of the survey data, acci- 
dent records regarding HC leaks and acute spills, 
data regarding the types of installations and the 
area of the operations. 

We have conducted extensive data cleaning and 
quality assurance. This was necessary due to dif- 
ferent ways of reporting across data sources and 
operators. 

As a part of the preparation, we have used 
the concept of installation years. Each observa- 
tion in the analysis consists of one installation 
year, that is, for installation X in year Y we have 
survey data, HC leaks and acute spills. Fur- 
ther we have data for the same installation every 
two years later. This means that each installa- 
tion included in the study have a maximum of 8 
observations. 

The installations are coded into types; fixed, 
floating and mobile units (rigs). Fixed is used as a 
reference category in the logistic regressions. 


2.2 Survey instrument (lead indicators) 


In total, the data consists of eight distributions 
of a safety and work condition survey through 
the RNNP study, from 2001 to 2015. The data 
is gathered by the PSA. The survey covers safety 
climate factors as well as general health, psycho- 
social work environment factors and background 
questions. 

For our analyses, the data from catering and 
cleaning crew was excluded from the present 
study due to their expected small impact on 
the causality of HC leaks and acute spills. The 
total number of respondents in the survey data 
is 69 111. After excluding responses without 
reported installation name, N was 57 550, and 
after excluding catering and cleaning crew the 
final N totaled 49 500. 
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The questions generally are answered on a scale 
from 1 to 5, where 1 equals a “positive” answer. 
However, we have recoded questions so that a high 
score equals a “positive” answer. 


2.3 Safety performance (lag indicators) 


We include two variables measuring safety perform- 
ance, from 2001-2016; HC leaks and acute spills. 

The HC variable is a dichotomous variable 
with number of HC leaks over 0.1 kg per second 
throughout the year. For the analyses using HC 
leaks, we excluded mobile units, such as drilling 
rigs, which do not have process areas for hydrocar- 
bon in the same extent as fixed installations like oil 
platforms and FPSO ships. We also excluded nor- 
mally unmanned installations and data related to 
fields rather than installations. We coded the vari- 
able so that 0 = no HC leaks, and 1 = one or more 
acute spills. 

Similarly, we dichotomized a variable of the 
number of acute spills throughout the year. The 
acute spills were divided in three types; chemical, 
raw oil, and other oils. We made a binomial vari- 
able where 0 = no acute spills and 1 = one or more 
acute spills. As in HC leaks, we also excluded nor- 
mally unmanned installations and data related to 
fields rather than installations. 


2.4 Research design 


The present study is a repeated cross-sectional study 
with several data sources. This means that temporal 
variations may be investigated, and that common 
method bias is reduced. A design using accidents 
and incidents as a dependent variable also allow 
discussion of the measures’ predictive validity. 

Our analysis may be divided into four steps; 
exploratory bivariate tests with ANOVA, dimen- 
sion reduction techniques with principal compo- 
nent analysis, tests of reliability with Cronbach’s 
alpha, and lastly a multiple logistic regression 
analysis. 

The explorative Analysis of Variance (ANOVA) 
was related to H1 and H2. A central question to 
ask when conducting these repeated cross-sectional 
studies was: Does negative response to organiza- 
tional factors lead to accidents, or do accidents 
lead to certain perceptions of the work environ- 
ment and organizational factors? Which one is the 
strongest? It may be argued for both directions of 
causality, and therefore we chose to test both direc- 
tions of causality in the first steps of the study. 

In the ANOVA test, we ran a series of analyses 
where we compared the means of the survey items 
by the two levels of the outcome variables (0 = no 
incidents, and 1 = one or more incidents). 


The ANOVA test is similar to a Student’s t-test 
as F = f, with identical p-value, for analyses with 
two groups. 

In the following analytic steps, we chose to 
go further with the items significantly related 
to HC leaks and/or acute spills, but only if they 
were either only significantly related to the future 
(leading) or better as a predictor of the future than 
as a product of the history (lagging). 

The Principal Component Analysis (PCA), com- 
monly denoted as a factor analysis, was conducted 
in order to further investigate the findings in the 
ANOVA analysis. This is due to that a lot of the 
items significantly related to our outcomes were 
believed to represent common latent phenomena. 

We conducted four PCA’s, two on each target 
variable. The first iteration was an exploratory fac- 
tor analysis using a criterion of eigenvalue > | and 
ibspection of the scree plot to decide the number 
of factors. Thereafter, we conducted a confirma- 
tory PCA with the number of factors we chose to 
extract based on eigenvalues and scree plot. 

A Cronbach’s alpha (a) procedure was ran on all 
factors in order to ensure the internal consistency 
of the factors. The Cronbach’s alpha procedure 
essentially is a way of calculating all inter-corre- 
lations between items of a scale or factor (Field, 
2009). A common rule of thumb is that factors 
should be a > 0.70 (Nunnally, 1978). 

Multiple logistic regression was performed to test 
H3, and further H1 and H2. The factors identified 
in the PCA step may be confounded by other fac- 
tors like differences between installation types. To 
control for these factors, and especially the survey 
factors themselves, a multiple regression method 
was viable. 

We conducted one multiple logistic regression 
model for each target variable. In the first step, we 
included the major companies as dummy variables. 
The second step consisted of installation types, and 
the third was sea locations (the Norwegian sea or 
the Barents Sea). 

In the final step, we included the survey factors 
to see a) if they could explain differences in the tar- 
get variables even when controlling for these back- 
ground variables, and b) to identify which survey 
factors that were strongest when controlling for 
each other. 


3 RESULTS 


From the ANOVA analysis—we found that a total 
44 out of 144 items were significant for acute 
spills, and 57 for the HC leaks. As mentioned in 
the methods section, a lot of these were stronger 
as a lagging indicator than a leading indicator, that 
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is, the survey results was more correlated with the 
previous year’s accident statistics than the next 
year’s. When we excluded these items, as well as 
some questions loading on several or none of the 
factors in the PCA, we had 21 items for HC leaks 
and 14 for acute spills. 

The iterations of PCA showed four factors 
for acute spill items and five factors of HC leak 
items. The first factor of both HC leaks and acute 
spills was by far the most explanatory. This factor, 
denoted as Framework condtitions, consisted of 
items regarding e.g. competence, training, safety 
procedures and manning. 

One of the factors of HC leak items had a 
Cronbach’s alpha level which was not satisfactory, 
thus excluded from further analysis. The results 
are presented below in Tables 1 and 2. In sum, 
the ANOVA and PCA findings give support to 
hypothesis 1 and 2. 

Descriptive statistics and ANOVA results for 
the factors and target variables are presented in 
Table 3 below. 

The results from the logistic regression presented 
in Table 4 show that the model explained 12% of 
the variation in HC leaks the next year, and 21% of 
the Acute spills. Operator 2 has significantly lower 
probability of both incident types when compared 
to operator 1. Operator 3 and 5 also have this rela- 
tionship, but only for Acute spills. For HC leak, 
the factor Framework conditions (see Table 5) is 
significantly and negatively related to HC leaks, 
meaning that for higher (more positive) scores 
on this factor, the probability of one or more HC 
leaks the next year is lower. The same relationship 


Table 1. Factor loadings for survey factors HC leaks. 
HC leak Factor Cronbach’s 

Factor name loadings alpha Items 
Framework C 0.46—-0,76 0,87 10 
Leadership 0,58-0,83 0,88 8 
Work organisation 0,41-0,74 0,84 7 
Coordination 0,43-0,82 0,93 3 
Organsational 0,44-0,69 0,45 3 


risk awareness 


Table 2. Factor loadings for survey factors acute spills. 


Acute spills Factor Cronbach’s 

Factor name loadings alpha Items 
Phys/quant WE 0.63-0,82 0,81 5 
Noise and sleep 0,82—0,89 0,70 2 
Psychosocial WE 0,49-0,82 0,72 4 
Competence 0,74-0,82 0,70 3 


Table 3. Means of the factors related to HC leaks 
(N = 211) and Acute spills (N = 520) next year. 


HC leak Acute spill 

0 lormore 0 1 or more 
Phys/quant WE - - 3,16. 3,63% 
Noise and sleep — — 3,83 3,74** 
Psychosocial WE — — 3,88 3,94** 
Competence — — 3,42 3,44 
Framework cond. 4,19 4,07** — = 
Leadership 4,03 3,89** = = 
Work organisation 3,64 3,47** - - 
Coordination 4,12 4,08** = - 


**Significantly lower on a 0,01 level. 


Table 4. Results from logistic regression predicting HC 
leaks (N = 211), and acute spills (N = 520). 


HC leak Acute spill 
St. St. 

OR error OR error 
Op 2 0,11* 1.06 0,18** 0.38 
Op 3 0,74 0.61 0,21** 0.46 
Op 4 0,00 882.74 0,90 0.54 
Op 5 0,41 1.09 0,35* 0.42 
Floating inst. 1,28 0.38 1,82 0.33 
Mobile unit — — 0,99 0.33 
Norwegian Sea 0,91 0.48 3,52** 0.43 
Phys/quant WE — — 0,10** 0.52 
Noise and sleep - = 2,86 0.44 
Psychosocial WE - - 0,45 0.64 
Competence — — 0,35* 0.37 
Framework cond. 0,01** 1.57 — 
Leadership 6,66 1.33 = = 
Work organisation 0,22 1.05 = — 
Coordination 0,37 0.86 — — 
Nagelkerke r2 12% 21% 


**Significant on a 0,01 level, 
*Significant on a 0,05 level. 


concerns the Physical and Quantitative Work Envi- 
ronment and the Competence factor with regards 
to Acute spills (see Table 5). The factors contains 
questions regarding exposure to chemicals, psy- 
chosocial demands and training. Installations 
operating in the Norwegian Sea was significantly 
positively related to acute spills the next year, when 
compared to the North Sea. 

In sum, these findings give partial support to 
hypothesis 3. 

The items of the two strongest significantly 
related questionnaire factors are presented in 
Table 5. 
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Table 5. Items in the most important survey factors. 


Physical and quantitative work environment (acute spills) 


Are you exposed to skin contact with for example oil, 
drilling fluids, cleaning fluids or other chemicals? 

Can you smell chemicals or clearly see dust or smoke 
in the air? 

Do you have difficulties seeing what you should see 
because of lack of, weak or blinding lighting? 

Is it necessary to work in a high pace? Er det nødvendig 
å arbeide i et høyt tempo? 

Do you consider the shift arrangement as demanding? 


Framework conditions (HC leaks) 


I have the necessary competence to conduct my work 
in a safe way 

The HSE procedures covers well my tasks at work 

I have had sufficient training in safety 

I find it easy to find things in governing documents 
(requirements and procedures) 

Does your work demand so much attention that you 
perceive it as demanding? 

The manning level is sufficient so that HSE is taken care 
of in a good way 

Risky work operations are always thoroughly assessed 
before they start 

Information about unwanted events are effectively used 
to prevent repetitions 

Safety as the highest priority when I do my job 


Competence (acute spills) 


I have had sufficient training in safety 

Do you get the necessary training in the use of new ICT 
systems? 

I find it easy to find things in governing documents 
(requirements and procedures) 


4 DISCUSSION 


The aim of the present study was to investigate the 
predictive effect towards HC leaks and Acute spills 
by using a survey addressing work conditions and 
safety climate, offshore installation and company 
information. 

We found that our model explained 12% of the 
variation in HC leaks, and 21% of the Acute spills 
the next year. 

Our results show that these factors could be 
used as indicators for future safety performance, 
although the total explained variance is modest. 
The overall findings are in line with several of pre- 
vious studies (e.g. Gilberg et al., 2015; Kongsvik 
et al., 2011, Beus et al., 2010). We also found that 
the set of perceptions that employees share regard- 
ing safety in their environment (Zohar, 1980; 
Zohar, 2003; Zohar, 2010) are somewhat related to 
safety performance. 


In addition, we see that not only safety climate 
factors are related, but the fact that the Physical 
and quantitative work environment and Noise 
and sleep factors are related to safety performance 
(although bivariately) corresponds well with par- 
ticularly the research of Nahrgrang et al., (2010) 
and Akerstedt et al. (2002) on work conditions. 

In the following, the hypotheses are specifically 
discussed. 


4.1 Hypothesis 1 — HC leaks model 


All the factors were significantly related to HC 
leaks, thus giving support to hypothesis 1. In the 
multiple regression, the Framework factor was 
the strongest predictor of HC leaks. This should 
be highly relevant from the perspective of the 
supervisory authority, because these are aspects 
that can easily be inspected by regulators. Based 
on our findings, targeting documentation of train- 
ing, manning level analyses and the availability and 
quality of HSE procedures on the installations can 
aid in identifying installations at risk of HC leaks. 
The content of the Framework factor is consist- 
ent with the content of several of the safety climate 
dimensions such as safety system, safety compe- 
tence and work pressure presented in Kvalheim 
& Dahl (2016) as important predictors of safety 
compliance. 

Considering the complexity of a HC-leak, the 
fact that we measure the safety climate the year 
before, that no technical condition data is in the 
equation, and the inherent validity and reliabil- 
ity issues present with surveys as a method, we 
conclude that these are indeed interesting results. 
However, the incremental explained variance is 
somewhat lower than earlier studies, which may be 
due to new control variables (for example, instal- 
lation type and area of operation), or that the 
inclusion of more, new and better data reduces 
unwanted biases in the observations. 


4.2 Hypothesis 2 — Acute spills model 


All factors but one was significantly related to 
acute spills the next year as shown by the ANOVA 
analysis. Competence was the factor that did not 
significantly explain any variance in the ANOVA, 
but, however, it was significant as a predictor in 
the regression analysis. In sum, this give support 
to hypothesis 2. The strongest predictor was the 
Physical and quantitative work environment fac- 
tor, and the Competence factor was also related to 
acute spills the next year. 

Interestingly, the questions and factors related to 
acute spills were quite different from the questions 
and factors related to HC leaks. Whereas the acute 
spills aspects were related to work environment 
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factors, and for example the subjective experience 
of having a demanding work hour, the HC leaks 
aspects were related to framework conditions such 
as the competence, safety system and HSE proce- 
dures. This difference reflects the assumption that 
HC leaks are related to more complex work proc- 
esses, which involves e.g. more coordination of dif- 
ferent tasks and people than acute spills which may 
be more related to more limited task conducted 
within a more limited time period. 

Some of the questions may be related to the 
baseline risk on some installations. This par- 
ticularly concerns the two questions addressing 
whether the respondents can see or smell chemi- 
cals, or are exposed to oil. This means that a reason 
for the strong effect could be due to that, whatever 
reason, some installations have a lot of acute spills, 
and this influences the results on these two ques- 
tions, and therefore the correlations found here 
are to some degree spurious. However, since we 
included only the questions that were more rele- 
vant as a leading indicator, some predictive explan- 
atory effect solely by the questions is present. 


4.3. Hypothesis 3 — Multiple regression 


When looking back to hypothesis 3, we assumed 
that the factors related to safety performance 
would be significant even when controlling for var- 
iables and conditions such as operator, installation 
type and area of operations. 

This was indeed the case for Framework con- 
ditions (in regards to HC leaks), Physical and 
quantitative work environment and Competence 
(in regards to acute spills). The other factors were 
not significant. This gives some support to our 
hypothesis. 

It may be that although we find different fac- 
tors using a PCA procedure, that several of the 
survey factors to some extent in reality measure 
one underlying phenomenon, a common work 
condition/work environment response, and that 
the specific aspects do not have enough explana- 
tory effect. Indeed, the first factor of the PCA’s 
have a larger variance explained than the others. 
The common method bias of a broad survey is also 
somewhat present, which may explain that only 
one and two, respectively, factors are significantly 
related to safety performance. 


5 CONCLUSIONS AND FURTHER 
RESEARCH 


In line with theories and findings that show 
the complexity of accidents and incidents, we 
included a broad spectrum of survey factors, 
including safety climate, work environment and 


psychosocial factors. We found that survey factors 
can indeed be related to future safety perform- 
ance, and to some extent even when controlling 
for operator, installation type and are of opera- 
tion. There are inherently different causal chains 
which lead to HC leaks than acute spills. This was 
also reflected in the findings—the factors from the 
PCA were quite different between HC leaks and 
acute spills. 

Further research should consider treating the 
variables as continuous to cover more of the data 
variation. Additionally, comparing results across 
other types of safety performance would be inter- 
esting, for example dropped objects, HC leaks 
below 0.1 kg/sec, or person injuries. Where data 
are available, such comparisons as in this study 
would be interesting to conduct in other industries 
as well. Further, some installations may vary in 
terms of size, operation type and activity. Instal- 
lation types could be nuanced according to these 
aspects in further research. Efforts to find a nor- 
malization variable for the target variables should 
be done (e.g. HC leaks per work hour, gas pro- 
duced etc.). 

Lastly, considering alternative approaches to the 
aggregations of a single installation score could 
be done. Aggregating from individual to instal- 
lation level is necessary a data loss, which should 
be investigated using different techniques than the 
arithmetic mean. 

To sum up, this study shows that there are 
modest, but interesting, relationships between 
safety climate and work environment inquires 
and future safety performance. These conditions 
should be recognized among authorities and com- 
panies operating in the oil and gas industry, and 
the factors may be used as risk indicators in safety 
management efforts. 
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ABSTRACT: Power system is one of the most complex systems ever designed. Determining its reliabil- 
ity is therefore a demanding process, which is mostly dealt with in a way that partial problems are stated 
and solved. A reliability method of loss of load expectation is selected and improved and the model of 
conventional power system is considered as a standpoint. Reliability of the initial conventional power 
system is evaluated. Then, the conventional power system is changed, where one conventional power plant 
is abandoned and new wind power plants stand to its place. The reliability of the initial and the changed 
power system is calculated and compared. Discussion shows the problems of decreased reliability of 
the power system with increasing the percentage of wind power plants within the power system replac- 
ing conventional plants, if not enough reserve is added to the power system at the same time. The other 
related problems are discussed in addition, to show that the problem is very complex and one cannot only 
consider a limited number of influencing parameters when deciding on power system configurations. 


1 INTRODUCTION Renewable power plants are different from con- 
ventional in many aspects and the variable power 
Power system is one of the most complex systems is one if important ones. 
ever designed (Cepin, 2011). Determining its reli- The objective is to improve the method is sense 
ability is therefore a demanding process, which is to allow assessment of loss of load expectation 
mostly dealt with in a way that partial problems are performed in several steps considering the conven- 
stated regarding its reliability and then the solu- tional plants and considering the renewable sources, 
tions are presented. Many methods exist, which which depends largely on weather parameters (such 
each from its viewpoint give the representative as river flows, wind speed, sun irradiance), which 
answer to the stated problem. Some of them deal drive the power of the plant. In addition to con- 
with system as a static system and the term ade- sider the power variability also the power plant 
quacy is considered as the means of power system availability shall be a variable and not a constant. 
reliability. Some of them deal with the system as 
a dynamic system and the term security is consid- 
ered as the means of power system reliability. 

Loss of load expectation is only one of the 
methods identified (from the static point of view 
to power system reliability) which has reached a A loss of load expectation is a method, which is 
number of interesting applications (Calabrese, based on probabilistic approach for determina- 
1947, Billinton, Allan, 1996, Elmakias, 2008, Bric- tion of required reserves in the power system or 
man Rejc, Cepin, 2014). to assess its static feature how probable is state, 

The objective of the paper is to extend the static where the loads of the power system are too high 
method of power system reliability assessment to be powered from the available power (Calabrese, 
named loss of load expectation into an improved 1947, Billinton, Allan, 1996). The method analy- 
version of the method and present realistic exam- ses the probabilities of simultaneous outages of 
ples of the method, which can represent a stand- power plants, which based on a model of daily 
point for discussions for selections of new power load diagram, determine the number of hours per 
plants in the system and the amount of reserve period considered, e.g. per one year, of expected 
power needed in order to keep the specified level power production capacity shortages. 
of power system reliability. Loss of load occurs whenever the power system 

Loss of load expectation assesses reliability of | load exceeds the available generating capacity of 
conventional power systems with constant power power plants. Loss of load expectation expresses 
and constant plant availability. value representing the number of hours or days 


2 METHODS 


2.1 Loss of load expectation 
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in a certain time period considered, when power 
consumption cannot be covered considering the 
probability of losses of generating units. This time 
period is usually one year. 

The power system generation planners can eval- 
uate generation system reliability and determine 
how much capacity is required to obtain a speci- 
fied level of loss of load expectation. As demand 
grows over time, additional generating units are 
included in a way that the loss of load expectation 
does not exceed the required criterion. 

Actually, the initial objective of the method is 
more to determine the required reserve power in 
the system than to evaluate its reliability and in this 
sense the application of the method is realized. 


2.2 Mathematical model 


Figure 1 shows yearly load diagram showing the 
terms explaining the Loss Of Load Expectation 
(LOLE). Mathematical model includes the follow- 
ing parameters: 


— number of power plants that contribute to the 
system, 

number of steps, 

number of hours in daily load diagram (or 
yearly load diagram), 

generating capacity of power plant iin step j, 
availability of power plant i in step j, 

yearly or daily load diagram of the power system, 
probability of state k in step j; it includes some 
available and some unavailable power plant— 
all of them considered with their availability or 
unavailability, accordingly, 

time duration of loss of capacity, i.e. the time 
period, when the power in the yearly (or daily) 
diagram is larger than the sum of generating 
power plants in the system—and it is related to 
specific state; it denotes the duration of loss of 
capacity on Figure 1, 

installed capacity of the power system is 
obtained through the sum of generating capaci- 


Yearly load diagram sorted by decreasing power 
Power 


Installed capacity 


Peak load 


Duration of loss 
of capacity 


Capacity in operation 


Qo 


365 days 


Figure 1. Yearly load diagram showing the terms 
explaining the loss of load expectation. 
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ties of all power plants contributing to the sys- 
tem in step j, 
unavailability of power plant 1 is calculated as 
complement of its availability, 
number of system states is calculated as the func- 
tion of the number of power plants that contrib- 
ute to the system (st_stanj = 2"); each state 
denotes that a particular power plant is avail- 
able or unavailable; thus the number of states 
increases with the power of number of power 
plants that contribute to the system; alternative 
recursive algorithm can be a solution for larger 
systems, where the number of states would reach 
the number, which is not possible to be evaluated 
with the current computers, 

— time duration of not providing enough load in 
daily load diagram (or yearly load diagram) for 
system state k and step j, 

— available power capacity of all the power plants, 
which are available. 


General equation for calculation of loss of load 
expectation is the following: 


st_stanj 
LOLE = ` p(k)-tr_izp_ procent(k) 


k=1 


(1) 


where: LOLE = loss of load expectation, p(k) = prob- 
ability of state k, k = index of states, tr_izp_pro- 
cent = time duration of not providing enough load 
in daily load diagram (or yearly load diagram), or 
t(P, < P,), where P, is the power capacity of power 
plants in specific state and P, is the power in the daily 
load diagram or yearly load diagram) or expressed 
on Figure 1: duration of loss of capacity. 

The values of p(k) are obtained from capacity 
table, which needs to be prepared for evaluation. 


Table 1. Example of capacity table for three power 
plants with powers: 40 MW, 30 MW, 10 MW and their 
respective unavailabilities: 0.1, 0.05, 0.04 also addressed 
as Forced Outage Rates (FOR). 


Capa- Capa- 
city city in 
Unit Unit Unit lost service Probability of each 


A B C (MW) (MW) capacity state 

1 1 1 0 80 0.90 - 0.95 - 0.96 = 0.8208 
1 1 0 10 70 0.90 - 0.95 - 0.04 = 0.0342 
1 0 1 30 50 0.90 - 0.05 - 0.96 = 0.0432 
0 1 1 40 40 0.10 - 0.95 - 0.96 = 0.0912 
1 0 0 40 40 0.90 - 0.05 - 0.04 = 0.0018 
0 1 0 50 30 0.10 - 0.95 - 0.04 = 0.0038 
0 0 1 70 10 0.10 - 0.05 - 0.96 = 0.0048 
0 0 0 80 0 0.01 - 0.05 - 0.04 = 0.0002 


1 represents plant operable, 0 represents plant inoperable. 


2.3 Upgrade of the method 


The upgrade of the method goes in considering 
specifics of available power from power plants, 
which cannot generate the full power all the time, 
so their nominal power is not used in every step j, 
but their real power capacities are used. 

Namely, the hydro power plants with not a lot of 
accumulation can produce the power related to the 
incoming flow, the wind power plants can produce 
the power related to the weather parameters, such 
as wind speed, and the solar power plants can pro- 
duce the power related to the weather parameters 
and solar power density. Furthermore, the daily 
load diagram changes every day and for the more 
detailed model, the listed details are included. 

Loss of load expectation changes through the 
days depending on power of power plants, depend- 
ing on specific daily load diagram of the day. Its 
average can be calculated. 


1 st_cas_k 


- > LOLE, 


LOLE, u= 
st_cas_k 45 


(2) 


where: LOLE,,, = loss of load expectation (hours 
per day), LOLE,,, (j) = loss of load expectation in 
step j (hours per day), and st_cas_k = number of 
steps, j = index of steps. 

At the same time LOLE expressed in hours per 
year can be calculated: 


365 
LOLE,,, = £ LOLE,,,(d) 


d=1 


(3) 


where: LOLE, = loss of load expectation (hours 
per year), LOLE, (d) = loss of load expectation in 
day d (hours per day), and d = index of days. 

Due to variability of power of renewable sources 
another dimension of power from renewable sources 
can be introduced to every time step of calculation, 
thus giving minimum and maximum of LOLE in 
addition to its mean value. Actually, a distribution 
of values instead of single value can be assessed. 


2.4 Software support 


Software for supporting calculations has been 
developed, which allows quick evaluation of many 
scenarios and which enables sensitivity studies and 
analyses of variations of power supply in future 
years with increased consumption, which calls for 
new production capabilities. 


3 ANALYSIS AND RESULTS 


3.1 


Real conventional power system consists of 11 
power plants: 1 nuclear, 6 thermal power plants 


Models — base case 
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and 4 hydro power plants and, in addition, a hydro 
pump storage power plant. The reliability model 
of selected system is developed for calculation of 
loss of load expectation considering the yearly 
load diagram as it appeared in the year 2016. This 
model represents the base case. 

The data for the base case consists of name of 
the plant, nominal power (electric power deliv- 
ered to the power system) and forced outage rate 
(or plant unavailability caused by unintentional 
causes). Exception is hydro pump storage power 
plant, which is modeled as changing directly the 
load diagram and its forced outage rate is not 
needed. When the hydro pump storage power plant 
operates in pumping state, its power is added to 
the load diagram, when it operates in generating 
power state, its power is reduced from the load dia- 
gram. The losses are neglected at this point, but 
should be considered in future. 

Table 2 presents the power system data, where 
the hydro power plants are arranged into groups: 
identification, nominal power and Forced Outage 
Rate (FOR). The power of hydro power plants is 
not constant but varies significantly depending 
on water incoming flow. The average power of 
all hydro power plants through the year is around 
40% of nominal power. The data is obtained from 
the real power system of the region. 

The variants of the power system are as 
follows: 


— Variant 1: nuclear power plant is replaced 
by three wind power fields, each consisting 
of several wind turbines. Each of wind fields 
has nominal power of 1154.9 MW (all three: 
3470.7 MW). Their common nominal power 
suits the ratio of load factors for nuclear versus 
wind, which means that wind power plants need 
approximately 5 times larger nominal power 
than replaced nuclear power plant. 


Table 2. Power system data — base case. 


Power plant identification and net electrical power FOR 


Nuclear Power Plant, 696 MW 0.99 
Thermal Power Plant (coal), TES6, 544 MW 0.92 
Thermal Power Plant (coal), TESS, 345 MW 0.91 
Thermal Power Plant (coal), TES4, 275 MW 0.91 
Thermal Power Plant (gas), TESG, 84 MW 0.94 
Thermal Power Plant (coal), TETOL, 124 MW 0.94 
Thermal Power Plant (gas), TEB, 297 MW 0.96 
Hydro Power Plant, DEM, 590 MW 0.99 
Hydro Power Plant, SENG, 321 MW 0.99 
Hydro Power Plant, SEL, 118 MW 0.99 
Hydro Power Plant, HESS, 156 MW 0.99 
Hydro Pump Storage Power Plant, Avče, n/a 


185 MW, (180 MW for pumping mode) 


— Variant 2: nuclear power plant is replaced by 
two wind power fields, each with power of 
1735.35 MW (both wind power fields have 
together 3470.7 MW), each consisting of several 
wind turbines and a large power plant as a 
reserve power—gas power plant (e.g. 348 MW). 


3.2 Models — variant 1 


Table 3 presents the data for the variant 1 of the 
power system. 


3.3. Models — variant 2 


Table 4 presents the data for the variant 2 of the 
power system. 


Table 3. Power system data for variant 1. 


Power plant identification and net electrical power FOR 


Thermal Power Plant (coal), TES6, 544 MW 0.92 
Thermal Power Plant (coal), TESS, 345 MW 0.91 
Thermal Power Plant (coal), TES4, 275 MW 0.91 
Thermal Power Plant (gas), TESG, 84 MW 0.94 
Thermal Power Plant (coal), TETOL, 124 MW 0.94 
Thermal Power Plant (gas), TEB, 297 MW 0.96 
Hydro Power Plant, DEM, 590 MW 0.99 
Hydro Power Plant, SENG, 321 MW 0.99 
Hydro Power Plant, SEL, 118 MW 0.99 
Hydro Power Plant, HESS, 156 MW 0.99 
Wind Power Plant 1, 1156.9 MW 0.99 
Wind Power Plant 2, 1156.9 MW 0.99 
Wind Power Plant 3, 1156.9 MW 0.99 
Hydro Pump Storage Power Plant, Avče, n/a 


185 MW, (180 MW for pumping mode) 


Table 4. Power system data for variations. 


Power plant identification and net electrical power FOR 


Thermal Power Plant (coal), TES6, 544 MW 0.92 
Thermal Power Plant (coal), TESS, 345 MW 0.91 
Thermal Power Plant (coal), TES4, 275 MW 0.91 
Thermal Power Plant (gas), TESG, 84 MW 0.94 
Thermal Power Plant (coal), TETOL, 124 MW 0.94 
Thermal Power Plant (gas), TEB, 297 MW 0.96 
Hydro Power Plant, DEM, 590 MW 0.99 
Hydro Power Plant, SENG, 321 MW 0.99 
Hydro Power Plant, SEL, 118 MW 0.99 
Hydro Power Plant, HESS, 156 MW 0.99 
Wind Power Plant 1, 1735.35 MW 0.99 
Wind Power Plant 2, 1735.35 MW 0.99 
Reserve Power Plant (Gas), 348 MW 0.99 
Hydro Pump Storage Power Plant, Avče, n/a 


185 MW, (180 MW for pumping mode) 


3.4 Analysis and results — base case 


Loss of load expectation is calculated as 9.92 hours 
per year (0.027 hours per day) as the average of a 
number of calculations, where maximum loss of 
load expectation is 73.4 hours per year and mini- 
mum LOLE is 0.12 hours per year. The differences 
in performed calculations are due to the differences 
of hydro power plants, which are not assumed 
nominal all the time, but their power depends on 
incoming river flow through the year and changes 
respectively in the considered calculations. 

If additional base load power plant of 348 
MW is introduced, loss of load expectation is 
approximately ten times lower: 0.96 hours per year 
(0.0026 hours per day) as the average of a number 
of calculations, where maximum loss of load 
expectation is 8 hours per year and minimum is 
0.008 hours per year. 

If only the nuclear power plant is removed from 
the model and the rest stays as it is, the loss of load 
expectation increases to average of 419 hours per year 
(1.1 hours per day) with maximum of 2074.6 hours 
per year and minimum of 9.1 hours per year. 


3.5 Analysis and results — variant 1 


Nuclear power plant is replaced with three wind 
power plants in the model and model has been 
evaluated for 3 examples of wind power ver- 
sus weather parameters. 3 examples of different 
functions of wind power as the function of time 
are considered, where the wind speed and other 
weather parameters direct the wind power. 

Each of examples gives different wind power 
in configuration points, which can be seen also as 
time points and consequently the power of availa- 
ble power plants differs among configurations and 
thus loss of load expectation is different. 

Figure 2 shows wind power for example 1. 
Figure 3 shows loss of load expectation for example 1. 
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Figure 2. Wind power—variant | — example 1. 
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Figure 4. Wind power—variant | — example 2. 


Loss of load expectation increased significantly 
with replacement wind for nuclear, which can indi- 
cate a significant decrease of power system reliabil- 
ity: for example 1 it is 262 hours per year (26 times 
increase) as the average of a number of calcula- 
tions, where maximum loss of load expectation is 
2075 hours per year and minimum is so low that it 
can be rounded to 0. 

Figure 4 shows wind power for example 2. 
Figure 5 shows loss of load expectation for example 2. 

It is 199 hours per year as the average of a 
number of calculations, where maximum loss of 
load expectation is 1755 hours per year and mini- 
mum is so low that it can be rounded to 0. 

Figure 6 shows wind power for example 3. 
Figure 7 shows loss of load expectation for 
example 3. 

It is 268 hours per year as the average of a 
number of calculations, where maximum loss of 
load expectation is 1755 hours per year and mini- 
mum is so low that it can be rounded to 0. 

If the results of variant 1 are summarized, one 
can observe that the nuclear cannot be replaced by 
wind without significant other changes in power 
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Figure 6. Wind power—variant | — example 3. 
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Figure 7. LOLE—variant | — example 3. 


system such as more reserve power. In addition 
other factors, which are not discussed in this paper 
are certainly needed to be considered for such an 
exchange. At least some of the most important fac- 
tors are listed here: frequency control of the power 
system, voltage control of the power system, 
improved rules for transmission system operation, 


improved rules for distribution system, and 
improved rules for system operation in all aspects, 
which depend on intermittent power changes and 
the needed procedures. 


3.6 Analysis and results — variant 2 


Nuclear power plant is replaced with two wind 
power plants in the model and reserve power of 
its half power and model has been evaluated 
for 3 examples of wind power versus weather 
parameters. 

Each of examples gives different wind power 
in configuration points, which can be seen also as 
time points and consequently the power of availa- 
ble power plants differs among configurations and 
thus loss of load expectation is different. 

Figure 8 shows wind power for example 1. 
Figure 9 shows loss of load expectation for 
example 1. 

Loss of load expectation increased nota- 
bly with replacement of nuclear with wind and 
reserve power, which can indicate a significant 
decrease of power system reliability: for example 1 


loss of load expectation is 47.8 hours per year 
(more than four times increase from base case) 
as the average of a number of calculations, where 
maximum loss of load expectation is 470 hours 
per year and minimum is so low that it can be 
rounded to 0. 

Figure 10 shows wind power for example 2. 
Figure 11 shows loss of load expectation for 
example 2. 

It is 46.7 hours per year as the average of a 
number of calculations, where maximum loss of 
load expectation is 376 hours per year and mini- 
mum is so low that it can be rounded to 0. 

Figure 12 shows wind power for example 3. 
Figure 13 shows loss of load expectation for exam- 
ple 3. 

It is 46.5 hours per year as the average of a 
number of calculations, where maximum loss of 
load expectation is 470 hours per year and mini- 
mum is so low that it can be rounded to 0. 

If the reserve power is increased to 620 MW 
instead to 348 MW, the loss of load expectation 
is at approximate similar level as at the base case 
(with 696 MW of nuclear power). 
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Figure 8. Wind power—variant 2 — example 1. 


Figure 10. Wind power—variant 2 — example 2. 


Figure 9. LOLE—variant 2 -— example 1. 
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Figure 12. Wind power—variant 2 — example 3. 
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Figure 13. LOLE—variant 2 -— example 3. 


If the results of variant 2 are summarized, one 
can observe that the nuclear cannot be replaced 
by wind and half of its power with reserve power 
without significant reduction of power system reli- 
ability. In addition, other changes in the power sys- 
tem need to be performed (see the comments on 
that in the previous section). 


4 CONCLUSIONS 


The objective of the paper was to present realistic 
examples of the improved method for calculation 
of loss of load expectation to foster discussions 
about the amount of reserve power needed in the 
power system with increasing portions of renew- 
able sources replacing conventional power plants. 

The method was presented and improved. The 
computer code for the application on real exam- 
ples was developed. The realistic cases have been 
analysed and the results were obtained. 

The results show significant decrease of power 
reliability or in other words: a significant increase 
of loss of load expectation, in the case that nuclear 
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power plant is replaced with the wind power plants 
of approximately five times larger joined nominal 
power. 

The results show less significant increase of loss 
of load expectation, if replaced wind power goes 
along with additional reserve power, e.g. gas power 
plant of half of nominal power of nuclear power 
plant. But still several times of increase of loss of 
load expectation is present. 

If the nuclear power plant is replaced with five 
times larger nominal wind power and with addi- 
tional reserve (around of 90% of replaced nuclear 
power), the loss of load expectation shows that 
the power system reliability from the static point 
of view is not decreased. More sensitivity studies 
are needed for justifying this statement in more 
details. 

If the number of wind power plants is larger, 
the variability of wind power may be smaller than 
shown at these examples (variant | and variant 2) 
and less reserve power may be needed (than 
indicated 90%). If the power system consists of 
several hydro power plants with large accumula- 
tions, which by themselves introduce the reserve 
accumulation in the power system, the problem 
can be solved with less additional power. 

The probabilistic methods to calculate contribu- 
tion of renewable sources in the power system may 
be used in this sense. 

Future work can be oriented to consideration 
of probability distributions of renewable power 
to assess probability distributions of loss of load 
expectation. 
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ABSTRACT: The design of pressure vessels aiming to contain water under high pressure and 
temperature conditions considers steady state behavior of the fluid. But if pressure decreases suddenly 
thanks to a leak or a rupture, water can get into a supercritical state. Supercritical state is a high energy 
and metastable state that could lead to a catastrophic accident such as explosion. The phase change of 
the superheated fluid is expected to entail a violent repressurization of the vessel, and may blow up the 
vessel. Little data can be found in literature about that repressurization process. The aim of this work 
was there to measure the repressurization dynamics of superheated water following a loss of contain- 
ment. Experiments were performed at high temperature and pressure (up to 315°C, 100 bar) in order to 
understand the phenomenon. The pressure drop in the tank was fast and well below the saturation pres- 
sure at the given temperature. Then, a repressurization peak occurred, depending on the pressure drop 
history. The set of results is discussed and compared to literature data. 


1 INTRODUCTION This presentation presents (1) thermodynam- 
ics considerations about the superheated state of 
Steam is widely used in industry to carry heat. water, (ii) the experimental setup, (iii) results about 
After condensation, hot water flows usually back the depressurization dynamics in the vessel and the 
to a furnace to be reheated and vaporized. Steam conclusions that can be driven for the collected 
is produced and used at different temperatures and data. 
pressure. Typically, steam below 3.5 barg is termed 
as low pressure steam. Steam above 3.5 barg but 
below 17.5 barg is termed as medium pressure 
steam and steam above 17.5 barg is termed as high 
pressure steam. Some users define their steam 
above 40 barg as ultra-high pressure steam. Water thermodynamic properties were widely 
A water-steam circuit will comprise sections investigated previously. Phase change lines, tri- 
of piping where water flows at high pressure and ple point and critical point are given on Figure 1. 
high temperature. In case of a leak the hot water Experimental data about triple point and critical 
will undergo a rapid transformation into steam due point are given in Table 1. 
to violent boiling or flashing. This phenomenon is A specific state of water is called the superheated 
called a steam explosion. The higher the degree of state. This state (sometimes referred to as boiling 
superheat (i.e. the difference between the tempera- retardation or boiling delay) is usually described 
ture of the water and the atmospheric boiling tem- as the phenomenon in which a liquid is heated to 
perature of the water) the more violent will be the a temperature higher than its boiling point, with- 
explosion. The water vaporizes from liquid to vapor out boiling. Superheating is achieved by heating a 
with extreme speed, increasing dramatically in vol- | homogeneous substance in a clean container, free 
ume. A steam explosion sprays steam and boiling- of nucleation sites, while taking care not to disturb 
hot water and the hot medium that heated it in the liquid. 
all directions, creating a danger of burning. Some More generally, a liquid is said to be superheated 
steam explosions appear to be special kinds of Boil- when its temperature exceeds its saturation temper- 
ing Liquid Expanding Vapor Explosion (BLEVE), ature of its pressure or when its pressure decreases 
and rely on the release of stored superheat. below its saturation pressure of its temperature 


2 THERMODYNAMIC CONSIDERATIONS 


2.1 Phase diagram (P-T) 
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Figure 1. Phase change diagram of water in (T,P) 
coordinates. 
Table 1. Specific thermodynamic points of water 
(Abbasi & Abbasi 2007). 

Temperature Pressure 

K °C Pa atm 
Triple point 273.15 0.01 661.73 0.00653 
Critical point 647.37 374 22120 218 


while the liquid is still not boiling: T, > T,,,(P,) or 
P, < P(T,). Superheating may happen at any pres- 
sure below the critical point. Superheat domain 
is usually delimited by a line defined between the 
critical point and the superheat limit temperature 
at atmospheric pressure. 


2.2 The superheat state 


The superheat state may be described on the usual 
P-V diagram as following (Figure 2). C is the criti- 
cal point. [BCB’] is the saturation curve or the 
binodal. The isotherm at temperature T = 580 K 
(306.85°C) is [ABEFB’D]. B and B’ are equilib- 
rium states on the binodal. P „is the equilibrium 
pressure at T. 

When the liquid state is between A and B, it is 
called the subcooled liquid. The liquid at point B is 
called the saturated liquid. When the liquid state is 
between B and E, it is called the superheated liquid 
because its temperature has been higher than the 
saturation temperature of its pressure or its pres- 
sure has been lower than the saturation pressure of 
its temperature. 

When the liquid becomes superheated, it also 
becomes metastable which means its stability can 
be easily broken by external perturbations. If so, it 
can no longer maintain its liquid state and phase 
transition must occur. When the metastability of 


sat 
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the liquid becomes larger (the liquid is approach- 
ing point E), the minimum perturbation required 
to break the stability of the liquid becomes smaller 
and finally at point E, the thermodynamic stability 
limit has been reached, which means phase transi- 
tion will spontaneously occur without any external 
perturbations or without any suitable nucleation 
site. The stability of the liquid can be broken by 
the density fluctuations of the liquid itself. 


2.2.1 Kinetic Superheat Temperature (KSL) 

In experiments on superheating measurements of 
a liquid, bubble nucleation (generation of small 
bubbles) will start when point K on the isotherm 
in Figure 2 is reached. Point K is called the Kinetic 
Superheat Limit (KSL). The KSL can be meas- 
ured experimentally provided early bubble genera- 
tion by impurities of wall effects is prevented (no 
heterogeneous nucleation). Avedisian (1985) did a 
comprehensive work about the KSL measurement 
and reported values of KSL from the literature. A 
large discrepancy was observed due to different 
operating data. 


2.2.2. Thermodynamic Superheat Limit (TSL) 
Point E is called the Thermodynamic Superheat 
Limit (TSL) of the liquid and [CE] is called the 
superheated liquid spinodal. The phase separation 
occurring at the TSL is called spinodal decomposi- 
tion. On Figure 2, [CE], [CF] and [BEFB’] curves 
are not experimental but only illustrative, since 
experimentally the spinodal decomposition has 
only been observed by light scattering techniques 
at a temperature very close to the critical point in 
binary mixture systems and the process is too fast 
to allow transient measurement of thermodynamic 
properties. 

The thermodynamic superheat limit (TSL) 
curve is defined by the spinodal curve, given by the 
equation: 
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Figure 2. Coexistence line (BCB’), isotherm at equilib- 


rium (ABB’D), liquid spinodal (CE) and vapor spinodal 
(CF) in case of water. 


(2) =0 a) 
dV J; 

This equation is a condition of thermody- 
namic stability. It separates two regions on the 
PT diagram (Figure 1) separated by the superheat 
limit curve, defined by the value of the pressure 
gradient: 


If (2), <0 the system is stable 


If (2), >0 the system is unstable 


TSL can be predicted by use of equations of 
state (EOS). There are several equations of state, 
depending on the considered pressure and tem- 
perature range and the molecular structure of the 
compound. (Reid 1979) used the Redlich Kwong 
EOS to predict the thermodynamic superheat limit 
temperature. (Kim-E 1981) proposed to use the 
Peng Robinson equation (PR EOS): 


RT a 
P= 
V-b V(V+b)+b(V +b) 


(2) 


By deriving this equation, the following relation 
has to be solved: 
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where V is the molar volume, a and b are constants 
depending on which specie is being analyzed. The 
constants can be calculated from the critical point 
data of the gas. For a temperature below the criti- 
cal one, the previous equation will yield four real 
roots. One will be less than the saturated liquid vol- 
ume and another will be greater than the saturated 
vapor volume; these are disregarded. The two 
remaining roots are between the liquid and vapor 
saturation curves; the smallest root corresponds 
to the liquid spinodal curve and the largest to the 
vapor spinodal curve. The TSL can therefore be 
calculated at atmospheric pressure by solving this 
equation. 

Other equations of state can be used. (Abbasi & 
Abbasi 2007) used 9 models to calculated TSL in 
case of water (Table 2): Van der Waals (VDW), 
Soave Redlich Kwong (SRK), Peng Robinson 
(PR), Twu-Redlich-Kwong (TRK) or Peng- 
Robinson-Mathias-Copeman (PRMC), Berth- 
elot (B). According to the authors calculations, 
the TSL varied in the range [546.6-604.5] K, 
that is a range of AT = 58 K depending on the 
considered EOS. 
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2.2.3 Superheat limit curve 

Where does the superheat limit curve drawn on 
Figure 2 locate? Reid (1979) claimed that the ther- 
modynamic superheat limit TSL predicted by the 
Redlich-Kwong equation of state is reasonably in 
agreement with experimental data, but he doubted 
those results because “...no satisfactory correla- 
tion now exists to relate p, v and T in the super- 
heated liquid region...”. An equation of state is 
obtained by correlating experimental data outsides 
the saturation dome. Using an equation of state 
for metastable states inside the saturation dome 
(Figure 2) is equivalent to extrapolation of those 
experimental data. For slightly superheated liquid, 
the extrapolation is still reliable (Mengmeng 2013), 
but for highly superheated liquid states, Reid’s 
worry is inevitable. The accuracy of an equation of 
state to predict the TSL cannot be demonstrated 
by such comparisons unless the measured KSL 
has been proven to be very close to the real TSL, 
which has not been done in any experiment yet. If 
the validity of an equation of state in predicting 
the thermodynamic superheat limit has not been 
proven, the close agreement of its predictions with 
the measured superheat limit cannot be interpreted 
as matching. 

In case of water, Abbasi compared TSL calcu- 
lations to experimental KSL at the atmospheric 
pressure and calculated absolute deviations. For 
water, the VDW-EOS gives the minimum average 
absolute deviation of 1.20% from the experimen- 
tal value; The Twu-Redlich-Kwong equation of 
state (TRK-EOS), gives 5.90%; The PR-EOS and 
its modified version, the Peng-Robinson-Mathias- 
Copeman equation of state (PRMC-EOS) give the 
deviations of 7.90% and 8.02% respectively; The 
Berthelot equation of state (Berthelot-EOS) gives 
7.53%; The RK-EOS gives 4.76% and the Soave- 
Redlich-Kwong equation of state (SRK-EOS) gives 
9.27%. For more information refer to the study of 
Abbasi & Abbasi (2007) or Salla et al. (2006). 

According to these considerations, it appears 
that the superheat limit temperature of water is 
still not well defined. The KSL seems more reliable 
since it can be measured experimentally, keeping in 
mind that the value depends strongly on the sur- 
face state of the apparatus. 

Two values of KSL at atmospheric pressure were 
found in the literature: Avedisian et al proposed a 
KSL of 575.1 K (301.95°C) whereas Abbasi et al 
proposed a KSL for water at atmospheric pressure 
of 553.2 K (280.05°C). 


2.3. Phase change of superheated liquid 


The key point of this work is the violent phase 
change of the superheated liquid. Therefore 
a focus has to be done on the phase change 


dynamics. Two kinds of phase change have to 
be considered: homogeneous and heterogeneous 
nucleation. When a liquid becomes superheated, 
vapor embryos can be formed because the excess 
energy of the superheating can be used to cover 
energy needed for phase change and to maintain 
surface tension. This process called bubble nuclea- 
tion, has two forms, depending on the locations 
where vapor embryos form: 


e Homogeneous nucleation, as the name indicates, 
occurs in the middle of the fluid where no phase 
boundaries are present; 

Heterogeneous nucleation occurs on phase 
boundaries such as rough walls or suspending 


solid impurities. 


To form a vapor embryo with the same volume, 
heterogeneous nucleation requires less energy than 
homogeneous nucleation because the presence 
of the phase boundaries allows a lower interface 
area of the vapor embryo. Generally speaking, het- 
erogeneous nucleation occurs at a lower degree of 
superheat than homogeneous nucleation. 

Distinction has to be made between nucleation 
and bubble growth in the metastable state but still 
far away from the KSL and nucleation and vapori- 
zation close to the KSL. In conditions away from 
the KSL, homogeneous nucleation is much less 
probable than heterogeneous nucleation and the 
question how fast the nucleation is, also involves 
questions on availability and properties of surfaces 
for nucleation. Avedisian et al (1985) suggested 
that the nucleation rate J (nuclei/cm*.s) depends 
on the superheat level AT (Figure 3). J, defines the 
minimum nucleation rate below which homogene- 
ous nucleation is unlikely. The superheat AT, and 
AT, correspond to rates J, and J, respectively. The 
superheat at Ja is shown corresponding to the 
thermodynamic limit AT, 

When a liquid is superheated but is still far away 
from the superheat limit, vaporization will start 
first on locations most favorable for the formation 
of initial small bubbles which is on solid surfaces or 
on dust or other solid particles in the fluid. Once 
bubbles are formed they grow according growth 
laws which have been well studied in the literature. 

Approaching to the superheat limit state, the 
process generating smaller bubbles (nuclei) is dif- 
ferent than away from the superheat limit state 
because the small ‘vapor’ nuclei originate homoge- 
neously in the fluid and at a much faster rate. The 
smallest stable bubbles close to the KSL are much 
smaller than away from the KSL. Classical homo- 
geneous nucleation theory has been developed to 
describe this case. It turns out that its predictions of 
nucleation rate are extremely dependent on details. 

The growth of bubbles at the superheat limit 
also proceeds in a different manner than away from 
the superheat limit. The rapid growth of bubbles 
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Figure 3. Schematic variation of nucleation rate with 


temperature at a given ambient pressure (Avedisian 1985). 


may create a rise in pressure which counterbal- 
ances the initial pressure drop, and keeps the liquid 
away from the superheat limit. This phenomenon 
must be taken into account in the evaluation of the 
effects of rapid depressurization or rapid heating. 

Moreover, during a depressurization of a super- 
heated liquid, the decrease in pressure is not felt 
instantaneously all over the liquid but spreads in the 
form of a wave. The local temperature at different 
locations in the liquid may be different, depending 
on the features of the accident causing the vessel 
and the amount of heat used for vaporization. 

The starting points at different locations in the 
liquid may be at different locations on the satura- 
tion curve. The decrease in pressure due to open- 
ing of the vessel will be felt at different locations at 
different moments in time and once vaporization 
starts the decreases in pressure can be stopped by 
the expansion of the mixture due to vapor genera- 
tion. The rate of depressurization is also controlled 
by the time needed for vessel rupture. 

This has been taken into account in a refinement 
of the superheat limit theory proposed by (Mcdevitt 
1990). As summarized in the review proposed by 
Leslie & Birk (1991), the homogeneous boiling 
only occurs at the rupture location where the liquid 
sucked out of the breach first reaches atmospheric 
pressure. Their study focused on the liquid behavior 
inside the vessel (liquid hammer, pressure recovery 
etc.), before a possible total disintegration of the 
vessel. A vaporization front can have very complex 
shape and wildly fluctuating properties. 


The break-up of existing bubbles in smaller 
bubbles in a wildly fluctuating vaporization front 
creates extra area and also new nuclei for het- 
erogeneous vaporization. This can enhance the 
vaporization rate and the front propagation rate 
enormously. The question arises whether the 
vaporization in a propagating front can generate 
a sufficiently strong volume source for significant 
blast propagation. 

These considerations indicate that apart from 
experiments and thermodynamic models to 
determine the superheat limit curve for various 
substances, also experiments and fluid dynamic 
models are needed to determine the propagation 
speed of vaporization fronts. 


3 THE BOILING LIQUID EXPANDING 
VAPOUR EXPLOSION (BLEVE) 


A superheated liquid is in a high energy state and 
a metastable equilibrium. Therefore it can release 
a large amount of energy in explosive behavior. A 
superheated liquid explosion requires that a large 
part of the liquid vaporizes in very short time. Dif- 
ferent behaviors were described in literature. Two 
main categories can be made, according to the way 
the liquid becomes superheated: 


e By sudden pressure loss, such as observed in boil- 
ing liquid expanding vapor explosion (BLEVE) 

e By sudden temperature increase, such as 
observed in rapid phase transition (RPT) 


BLEVE is the most common phenomena which 
resulted in many studies. The standard theory of 
BLEVE was originally proposed by (Reid 1979). 
The essential idea is illustrated Figure 4. Under 
normal conditions the content of the vessel con- 
taining a liquid and its vapor is in thermodynamic 
equilibrium and the pressure and temperature 
combination lies at the saturation curve (points A 
or C). In the case of vessel rupture the pressure 
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Figure 4. Schematic explanation of Reid’s superheat 
limit theory for BLEVE in depressurization processes. 


suddenly decreases resulting in superheated liquid. 
There is a limit to the degree in which a liquid can 
get superheated. At constant pressure, the super- 
heat limit temperature is the highest temperature 
that a liquid can sustain without undergoing 
phase transition and at constant temperature; the 
superheat limit pressure is the lowest pressure for 
a liquid to maintain its liquid state. According 
to Reid’s theory, when the pressure of the liquid 
decreases from point C to D, the liquid reaches 
the superheat limit curve and a BLEVE will occur 
while in the process of A to B, the liquid does not 
reach the superheat limit curve, no BLEVE will 
occur. Based on previous part, the superheat limit 
referred in Reid’s theory should be the kinetic 
superheat limit (KSL). 

As explained previously, the situation in a real 
incident will be much more complicated than the 
simple trajectories from A to B or from C to D 
on Figure 4. In Reid’s superheat limit theory (Reid 
1976), the severity of the hazard of a BLEVE 
is attributed to the fact that the KSL has been 
reached, not the TSL. Throughout a study on 
BLEVE one should bear in mind the difference 
between the two superheat limits definitions and 
the difference between an experimentally observed 
phenomenon (KSL) and a theoretical thermody- 
namic property (TSL). 

Direct correspondence between a BLEVE and 
the spinodal decomposition has never been proven 
by experimental data. Recently Birk (Birk et al. 
2007) after an analysis of a number of medium 
scale BLEVE tests have come to the conclusion 
that in the case of rupture of high pressure vessels 
only partially filled with liquid propane (fill level 
in the range 13 to 61%) the shock waves observed 
in the far field seem rather produced by expansion 
of the vapor and not by the vaporization of the 
liquid, which is said to be a too slow process for 
generating a strong blast. In (Birk et al. 2007) it 
is mentioned however that the rapid vaporization 
process can produce significant dynamic pressure 
effects in a near field. These effects are of par- 
ticular importance in case of a BLEVE in a con- 
fined space such as a tunnel. Their demonstration 
does not involve superheat limit theory but uses a 
thermodynamic estimate of the available energy. 
Such estimates can assume isentropic expansion 
(Prugh 1991) or, more realistic, adiabatic irrevers- 
ible expansion (Planas-Cuchi et al. 2004). The sec- 
ond estimate is about half of the first (Abbasi & 
Abbasi 2007). 


4 MATERIALS AND METHODS 
The apparatus was designed to heat water at high 


temperature and pressure and to release pressure 
thanks to a 1⁄2” fast ball valve. The specifications 


of the vessel are given in Table 2. A sketch and a 
picture of the setup are given on Figure 5. 

Eight K type thermocouples were set on a verti- 
cal axis in the vessel. A stirrer was put at the sym- 
metry axis of the vessel, so the thermocouple axis 
was shifted 5 cm on the side of the stirrer axis. 
Location data are reported in Table 3. 

The power of the heater was 8000 W, with a 
maximum wall temperature of 450°C, so one hour 
was necessary to reach the target temperature 
(300°C). The apparatus was be completely insu- 
lated to minimize heat losses. 

Two water cooled dynamic pressure sensors 
(Kistler 601C) were put on the vessel to measure 
the transient pressure in the vessel. Data acquisi- 
tion rate was 200 kHz. Two static pressure sensors 
were put at a distance from the vessel (250 bar) and 
remained cold during the tests. A half inch fast dis- 
charge pneumatic valve was put at the top of the 
vessel and controlled remotely. 


Table 2. Experimental vessel specifications. 
Pressure vessel standard CODAP 2005 
Volume SL 

Steel X2CrNiMol7-12 
Maximum allowable working pressure 142 bar 
Working temperature 0°C to 300°C 
Design pressure 190 bar 
Design temperature 300°C 

Test pressure 305 bar 
Safety valve pressure 183 bar 
Internal diameter 125mm 
Internal height 395 mm 

Wall thickness 11.5 mm 
Vessel volume 5000 mL 
Total volume (incl. pipes) 5090 mL 
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Figure 5. Detailed sketch of the experimental vessel. 
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Table 3. Location of the thermocouples in the vessel. 
Distance Distance 
from bottom from top 

Thermocouple c — 

number mm mm 

8 391 4 

7 319 76 

6 273 122 

6 217 178 

4 166 229 

3 115 280 

2 68 327 

1 15 380 


5 RESULTS AND DISCUSSION 


A series of 8 experiments was performed. Operat- 
ing conditions of the tests are given in Table 4. 


5.1 


The experimental vessel was filled with an accurate 
quantity of high purity water (milliQ grade). The 
gas space remained filled with air at atmospheric 
pressure. The heater was switched on. In order to 
purge the air, the relief valve remained open until 
vapor was observed at the exit of the valve, typi- 
cally during 2-3 minutes. Then the valve was closed. 
Very little water was lost during the air purge. Dur- 
ing this purge dissolved gases were also removed. 

Figure 6 shows the water temperature evolution 
on a vertical axis during the test. T1 was located 
at the lowest level whereas T8 was located at the 
uppermost level in the tank. Temperature stratifi- 
cation occurred during the ten first minutes. Then, 
boiling at the wall created bubbles and turbulence 
that are clearly observable on the temperature 
curves. When the boiling was sufficient to provoke 
a complete mixing of the liquid, all temperature 
records converged to a single temperature curve. 
The internal pressure increased according to the 
vapor-liquid equilibrium law (Figure 7). 

A perfect fit is observed with the literature data 
on the (T,P) diagram of water (Figure 8). When the 
temperature target was reached, the fast discharge 
valve was opened. A powerful and noisy steam 
jet was created. A noise recording was performed 
at the operator bunker and indicated a level of 
102 dB. The temperature and pressure dropped 
very rapidly. 

These data were recorded at an acquisition rate 
of 20 Hz. In order to record perfectly the fast pres- 
sure dynamics in the tank, two dynamic pressure 
gauges and a high speed recording equipment ena- 
bled to get data at a 200 kHz recording rate. 

After a strong depressurization, a small repres- 
surization occurred and was measured with both 


Test description 


Table 4. Summary of experiments. 


Water Pressure 
Water temperature atrelief Volume 
quantity at relief time fraction 
(g) time PR at TR 
Experi- 
ment g °C bar % 
1 3995 202 17 91% 
2 3500 236 32 89% 
3 3425 267 53 90% 
4 3730 281 66 80% 
5 3700 290 76 90% 
6 3899 300 88 96% 
7 3913 281 67 95% 
8 3854 281 68 90% 
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Figure 6. Temperature of the fluids (Test #6). 
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Figure 7. Pressure in the tank (Test #6). 
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Figure 8. Thermodynamic transformation on phase 
diagram (Test #6). 


pressure transducers. This repressurization was 
very short in time; quickly the depressurization 
overcame the repressurization phenomena and the 
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Figure 9. Dynamic pressure in the tank (200 kHz, test #6). 


pressure restarted to decrease. In the considered 
case, a depressurization of 4.7 bar occured during 
110 ms and was followed during 140 ms by a 1.8 
bar pressure increase. Then, the pressure restarted 
to decrease to reach the atmospheric pressure. The 
data of both dynamic pressure transducers are 
reported on Figure 9. The data on the left corre- 
sponds to pressure sensor located at the bottom 
of the vessel, the data on the right corresponds 
to pressure sensor located at the top of the vessel. 
Both data match perfectly. 


5.2 Influence of superheating temperature 


According to the theory, nucleation and repressuri- 
zation rate are expected to depend on the superheat 
level. Therefore five experiments were performed 
at different temperatures: 236°C, 267°C, 281°C, 
290°C and 300°C. 

The transient pressure data are given on 
Figure 10. In these tests, a repressurization peak 
was observed. The peak is more visible at higher 
temperatures (281°C, 290°C and 300°C), but a 
slight peak is also noticeable at 267°C and some 
oscillation can me observed at 236°C. 


5.3 Discussion 


Experimental data about kinetic superheat limit 
are reminded in Table 5. 

Tests performed at 281, 290 and 300°C tests (Fig- 
ure 10) were above the KSL and indicated a repres- 
surization peak which seems to corroborate the 
proposal of (Avedisian 1985). Indeed, the pressure 
drop due to venting was countered by the nuclea- 
tion rate in the vessel. It is not possible to check if 
nucleation was homogenous of heterogeneous. 

The data was computed in order to get the numer- 
ical value of the pressure drop and the consequent 
repressurization pressure peaks (Table 6). Accord- 
ing to the theory of (Avedisian 1985), the difference 
between superheated temperature of water before 
release and the standard boiling temperature AT 
was calculated and reported on Figure 11. This fig- 
ure has to be compared with Figure 3, with a dif- 
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Figure 10. Temperature influence on transient pressure 
after depressurization. 


Table 5. Experimental values of water KSL. 
KSL 
°C 
Avedisian et al. 301.95 
Abbasi et al. 280.05 
Table 6. Pressure drop and rise after depressurization. 
Target Pressure Pressure 
Temperature pressure drop peak 
Test °C bar bar bar 
3 26 53 11.68 0.42 
4 28 65 6.28 1.97 
5 290 76 TA2 2.62 
6 300 88 5.89 1.64 
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Figure 11. Repressurization peak as a function of 


superheated level. 


ference that the nucleation rate was unknown and 
replaced by the repressurization peak on the y-axis. 
Both values are linked by non-linear relations that 
were not investigated in this work. 

A similar trend was observed, but further work 
will be required to investigated more accurately the 
nucleation rate for comparison with the work of 
(Avedisian 1985). 

In the frame of pressure vessels safety, results 
indicate that the pressure peaks remained small 
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(< 3 bar) in comparison with the pressure in the 
tank at relief time. Therefore, if an industrial vessel 
or pipe containing superheated water is depressu- 
rized by a little hole due to corrosion of mechani- 
cal failure, no significant pressure peak able to 
break the vessel should be feared for the conditions 
studied in this article. 
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ABSTRACT: This investigation highlights the safety of DP (Dynamic Positioning) systems by briefly 
discussing the existing risk assessment methods and risk control measures provided by the International 
Maritime Organization (IMO). Following a study regarding DP systems and the loss of position inci- 
dents reported to the International Marine Contractors Association (IMCA) and contained in the World 
Offshore Accident Databank (WOAD), the relevant hazards are identified and collated. The primary 
causes of loss of position incidents were found to be: the positional reference system failures and thruster 
failures, with both contributing to 20.6% each of the total incidents from 2000 to 2016. Similarly the time 
period of the analysis is the 17 years between 2000 and 2016. Given the two stated primary causes, the 
positional reference system failures and thruster failures, the thruster failures are analysed further. This 
is due to thruster failures occurring at an increased rate within DP incidents from 2011 to 2016. Hence, 
this trend, between 2011 and 2016, is analysed further. Three undesired events for loss of position due 
to thruster failures were identified, these are as follows: “drive off”, “drift off” and “time loss”. Further 
investigation of incidents in more recent years, in this case, 2012 to 2016, identified that the DP control 
system accounted for 33.7% of the initiating incidents that led to thruster failures. Furthermore, the unde- 
sired event “time loss” accounted for 71.2% of total incidents caused by thruster failures. 


1 INTRODUCTION or equipment operating correctly in response to its 
inputs. In essence, this means the achievement of 
Many marine vessels are now equipped with safety through application of control systems. This 
Dynamic Positioning (DP) systems. These ves- requires identifying what has to be done and how 
sels have different functions and come in differ- well it should be done” (Bell, 2010). 
ent sizes. Some vessels are used for Simultaneous DP has evolved over the years, from use as a tool 
Operations (SIMOPS) and Combined Operations for mobile offshore drilling units, for maintaining 
(COMOPS), bringing them in close proximity with position over offshore wells, to being employed 
other vessels and/or offshore installations. Some for a wide range of position keeping operations. 
are used to conduct diving operations, with a huge We see DP systems being fitted on an increasingly 
number of vessels also used for drilling operations. | large number of new and diverse vessels, from off- 
The functions of these DP vessels are very numer- shore units to shuttle tankers to passenger vessels 
ous, challenging and distinct. To cater for these (IMO 1994). 
functions, their level of safety was required to be The increase in the number of diverse applica- 
increased and classified. Through time, with sup- tions results in an increase to the estimated risks 
port from different organizational bodies and from involved with dynamic positioning vessels, further 
lessons learnt from past DP failures, the safety of | requiring an increased level of safety. To this end 
DP systems and vessels has been increased. These there has been a growth in the development of the 
DP systems now utilize a redundancy scheme to dynamic positioning systems. There are now sev- 
ensure that a single failure would not lead to an eral classes from DPO to DP3 (DNV 2012), each 
accident. In this research the term safety is derived class with a different level of safety. 
from the definition of functional safety outlined by The rationale behind this topic is to identify, inves- 
Bell (2010). This states that “functional safety isa tigate and analyse the loss of position incidents due 
part of the overall safety that depends on a system to DP system failure. These incidents are analysed 
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according to the seasons in which they occurred, 
types of vessels, main causes and initiating causes. 


2 BACKGROUND 


2.1 Dynamic positioning 


DP can be defined as “a means of holding a ves- 
sel in a relatively fixed position with respect to the 
ocean floor, without using anchors, accomplished 
by two or more propulsive devices, controlled by 
inputs from sonic instruments on the sea bottom 
and on the vessel, by gyrocompass, by satellite nav- 
igation or by other means” (Holvik 1998). 

The launch of the Global Positioning System 
Satellite network brought new ideas and new tech- 
nology to be integrated into the DP vessels for 
more efficient performance. In 1981, the Nautical 
Institute began working on certification process 
for DP operators. This was done to reduce acci- 
dents and failures caused by human error. In 1983, 
the Department of Energy and the Norwegian 
Petroleum Directorate produced guidelines for 
diving from DP vessels. Howard Shatto in 1983 
further improved on the DP systems, allowing 
greater water depths of 7,500 m and even rougher 
seas to be achievable. From his knowledge of sat- 
ellite positioning and through his participation in 
the first use of Failure Mode and Effects Analysis 
(FMEA) for DP systems in 1983, the Mean Time 
Between Failures (MTBF) for DP systems was 
improved six-fold (DPC-MTS 1996). 

By 1985, the number of DP capable vessels had 
increased to over 150. At this time, the vessel types 
equipped with DP systems had also increased. 
The following are some of the types of DP vessels 
that were available by 1985 based on their func- 
tions: Drillships, Mobile Offshore Drilling Units 
(MODU), Diving Semi-Submersibles, Diving and 
Emergency Response Vessels, Remote Operated 
Vessels, Diving Support Vessels, Shuttle Tankers 
and Accommodation Vessels (Flotels). 

The following years saw an increase in the use of 
DP systems various functions and vessel types. In 
1990, the first DP Floating Production Storage and 
Offloading (FPSO) vessel, Seillean, was launched 
by British Petroleum. The Dynamic Positioning 
Vessel Owners Association (DPVOA) was formed 
in the same year. In 1994, IMO provided guidelines 
for DP systems, MSC/Circ.645, the same year, the 
American Bureau of Shipping (ABS), introduced 
their first DP rules. The following year, 1995, the 
International Marine Contractors Association 
(IMCA) was formed through the merger of the 
Association of Offshore Diving Contractors and 
the DPVOA. In a bid to encourage exchange of 
information, foster improvement of DP reliability, 
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develop guidelines, train and educate, and address 
any other issues pertinent to DP that encourage 
an incident free operation of DP systems. The DP 
committee was founded in 1996 as a Professional 
Committee of the Marine Technology Society 
(Sean 2009). 

DP systems have been improved considerably 
since the installation of the first DP drill ship, 
Eureka. Vessels used for a variety functions are 
now equipped with DP systems. Some functions 
have been stated previously. 


2.2 Classification of the DP systems 


The different classes in the IMO and the ABS 
regulations are examined. The only difference 
between the two is that there is not a Class 0 for 
IMO regulations. ABS uses DPO to refer to vessels 
without DP systems. IMO does not put this as a 
class of DP vessels. IMO classes start from Class 1. 


1. Class 0: DPS-0 
“For vessels, which are fitted with centralized 
manual position control and automatic head- 
ing control system to maintain the position and 
heading under the specified maximum environ- 
mental conditions” (ABS, 2013). Class 0 does 
not exist on the IMO classification. 
. Class 1: DPS-1 
“For vessels, which are fitted with a dynamic 
positioning system, which is capable of auto- 
matically maintaining the position and heading 
of the vessel under specified maximum envi- 
ronmental conditions having a manual position 
control system” (ABS, 2013). 
. Class 2: DPS-2 
“For vessels, which are fitted with a dynamic 
positioning system, which is capable of auto- 
matically maintaining the position and heading 
of the vessel within a specified operating enve- 
lope under specified maximum environmen- 
tal conditions during and following any single 
fault, excluding a loss of compartment or com- 
partments” (ABS, 2013). 
Class 3: DPS-3 
“For vessels, which are fitted with a dynamic 
positioning system, which is capable of automati- 
cally maintaining the position and heading of the 
vessel within a specified operating envelope under 
specified maximum environmental conditions 
during and following any single fault, including 
complete loss of a compartment due to fire or 
flood” (ABS, 2013). 


DPS-1, DPS-2 and DPS-3 classification nota- 
tions stated by ABS (2013) are structured to con- 
form to IMO (2017). Therefore, the classifications 
DPS-1, DPS-2 and DPS-3 relate to IMO’s equip- 
ment classifications 1, 2 and 3 (ABS, 2013). 


4. 


3 STATISTIC ANALYSIS 


For the purpose of hazard identification, inci- 
dent data has been gathered from two sources: the 
World Offshore Accident Database (WOAD) and 
the IMCA. The data from IMCA has been utilised 
for the majority of the analysis as it is consistent 
and covers a substantial time period. 


3.1 Terms used by IMCA 


The incidents and events listed in the various 
reports have been categorized by IMCA into three 
areas, where PL stands for Position Loss. These 
categories are: 


1. DP Incident (PL 1): This is the loss of automatic 
DP control, loss of position or any other incident 
which has resulted in or should have resulted in a 
RED Alert status (IMCA 2017). In other words, 
these are incidents of a serious nature. 

. DP Undesired Event (PL 2): Loss of posi- 
tion, loss of stability, or another event which is 
unexpected or uncontrolled and has resulted in 
or should have resulted in a Yellow Alert status 
(IMCA 2017). As such, they are incidents of a 
less serious nature. 

. DP Downtime: Position keeping problem or 
loss of redundancy which would not warrant 
either a ‘Red’ or ‘Yellow’ alert, but where loss 
of confidence in the DP has resulted in a stand- 
down from operational status for investigation, 
rectification, trials, etc. (IMCA 2017). From 
an operational point of view, a loss of time or 
downtime can be seen as an undesired event 
which should be avoided at all cost, so as to 
save money. 


For the purpose of this research, further analy- 
sis focuses on the prior two incidents; DP Incidents 
(PL 1) and Undesired Incidents (PL 2). 

There are two common types of position loss, 
these are as follows: 


1. Drive Off: This is characterised by the thrusters 
going to high unwanted thrust usually because 
the DP control system believes the position is 
wrong (Jenman 1998). 

. Drift Off: This is caused by the lack of sufficient 
power or thrust. For example, a total blackout 
on a ship on high seas, combined with strong 
currents. This would lead to a drift off (Jenman 
1998). 


There was a debate as to how one would differenti- 
ate a “drift off” and a “drive off” of vessels that 
were quickly recovered and returned to their origi- 
nal position as opposed to vessels which travelled 
far from their original position and could not be 
easily recovered (Jenman 1998). These arguments 
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led to the addition of the third type of position 
loss, Large Excursion. 


3. Large Excursion: This is an excursion that takes 
the DP vessel beyond its normal excursion char- 
acterised by its footprint. The footprint is the 
outline of the vessels movement in a particular 


sea state (Jenman 1998). 


3.2 IMCA reporting 


Vessels from all over the world, report DP related 
incidents to IMCA, who ensure that the vessel 
reporting is kept anonymous, thereby ensuring 
safety and keeping company integrity. Also, there 
are a range of export regulations and restrictions 
which affect business and trade with many coun- 
tries of the world. These include restrictions on 
dual-goods and technology as well as the various 
sanctions regimes (UN, US and EU) targeting 
individual states including, regulation (EU) No. 
833/2014 which is directed at Russia. 

The data used in this research is available on 
the IMCA website, to IMCA members only. How- 
ever, there are exceptions so that the information 
can be released to other interested parties who 
are not IMCA members. IMCA requires both its 
members and non-members to confirm that they 
will not use any IMCA document in breach of the 
Restrictions. 

The data provided spanned over 17 years 
(2000-2016). From the reporting styles over the 
years, some improvements in reporting are visible, 
thereby creating differences in the data provided. 

From Table 1, a total of 1,163 incidents were 
analysed and documented by IMCA between the 
17-year periods. From Table 1, it is possible that 
the incident reporting by vessel owners is not con- 
sistent. It declined from 2000 to 2004 then picked 
back up, to peak in 2008 and fall again in 2010. 
Then there was a steady increase throughout the 
remaining years. 

More emphases should be placed on incident 
reporting, including potential near misses. 


Table 1. Incident data analysed from 2000 to 2016. 


Incidents reported 


2000 110 2008 102 
2001 98 2009 75 
2002 64 2010 56 
2003 51 2011 54 
2004 34 2012 64 
2005 36 2013 64 
2006 59 2014 71 
2007 67 2015 80 
2008 102 2016 78 


From the 1,163 incidents analysed, 633 of 
those incidents indicated the month in which 
they occurred. This was from the year 2007 to 
2015. Based on these IMCA reports, Table 2 was 
developed. 

It can be seen from Table 2 that in the spring 
and summer of the 9 years, recorded the highest 
amount of incidents reported. Taking a closer look 
at the individual years and months, it is evident 
that even though spring has the highest number 
of incidents reported over the 9 years, there were 
more years when summer had more incidents 
reported than spring. From this, one can state that 
the majority of the incidents reported occur during 
summer and spring time. 

Looking at the seasons of the year and looking at 
the months where natural disasters are prominent, 
this falls within the same seasons as the one stated 
above. According to AccuWeather Inc., an Ameri- 
can media company that provides commercial 
weather forecasting services worldwide, hurricanes 
occurs at different times of the year for different 
regions in the world. For the Atlantic Ocean, hur- 
ricane season runs from June to November. In the 
Eastern Pacific Ocean, it occurs in the months of 
May through to November. The hurricane sea- 
son for the Western Pacific Ocean runs from July 
to November, while in the South Pacific Ocean, 
it runs from October through to May, reaching a 


regions. They also state that in the Gulf Coast, tor- 
nadoes occur during the spring, while peaking in 
June and July in the Northern regions. Some states 
are also mentioned to experience a later tornado 
season from October to December (Sarah 2011). 

From the analyses above, it is evident that most 
of the adverse weather conditions occur during 
the spring to autumn seasons. It is safe to attribute 
this to the reason why there is a higher frequency 
of incidents reported within the spring to autumn 
seasons. 

Furthermore, the IMO does not state at which 
time of the day these incidents have occurred. Sim- 
ilarly, it should be noted that these incidents are 
provided from all around the globe, not one spe- 
cific region. 

For legal, trade and restriction reasons, the 
number of incidents occurring in different areas of 
the world has been withheld. 


3.3 Incidents according to vessel type 


Several DP vessels with different functions report 
incidents to IMCA. Inconsistency in reporting and 
change of reporting style can be seen in this area. 
The years stated in this report are the years that 
were clearly documented by IMCA. The following 
types of vessels that reported incidents over the 
aforementioned years are as follows: 


peak in late February or early March. The Indian 1. Remote Operated Vehicles (ROVs) 
Pacific Ocean’s hurricane season runs from April to 2. DP Diving Support Vessels (DSV) 
December in the Northern Indian Ocean and Octo- 3. Drilling Vessels 
ber to May in the Southern region (Mummey 2010). 4. Pipe/Cable Lay Vessels 

Research conducted by the University of Man- 5. Offshore Loading Vessels 
chester, showed that tornado season in the UK 6. Standby Vessels 
occurs within the months of May to October 7. Well Operations Vessels 
(Mulder & Schultz 2015). Mother Nature Network 8. Seismic Vessels 
on tornado season in America stated that it occurs 9. Multi Service Vessels (MSV) 
between March and Early June in the Southern 10. Shuttle Tanker 
Table 2. Incident data analysed from 2000-2016. 

2007 2008 2009 2010 2011 2012 2013 2014 2015 Total Season 

Dec 10 10 7 3 11 1 2 3 6 53 136 Winter 
Jan 5 6 4 1 5 4 6 5 4 40 
Feb 6 2 5 10 2 4 2 5 7 43 
Mar 5 7 7 4 5 10 8 7 10 63 179 Spring 
Apr 6 7 12 4 2 6 3 8 10 58 
May 9 14 3 3 5 2 8 8 6 58 
Jun 4 11 8 6 5 8 11 a 7 65 172 Summer 
Jul 3 7 T 5 4 7 6 5 7 53 
Aug 6 11 9 5 3 3 5 8 4 54 
Sep 6 9 3 6 4 4 6 5 > 48 145 Autumn 
Oct 4 7 2 4 2 10 3 7 5 44 
Nov 1 11 7 5 6 5 4 5 9 53 
Total 67 102 74 56 54 64 64 71 80 632 


. Flotels 

Construction Vessels 

. FPSO 

Rock Dumping Vessels 
. Crane Vessels 

Supply Vessels 


Since there are different types of vessels of dif- 
ferent functions and not all these vessels reported 
incidents every year, Table 3 shows a list of the ves- 
sels that reported incidents consistently through 
the stated time period. 

Table 3 shows that DSVs have reported the most 
incidents in the 10 years stated above. From 2008 
to 2014, the reporting style changed and the vari- 
ous vessels that had experienced incidents were not 
reported. 


3.4 Incident main causes 


In the IMCA reports, they go further by indicating 
the primary cause of each incident, making the 
statistical analysis very easy for the user. Table 4 
shows the incident causes over the past few years 

As shown in Table 4, the amount of incidents 
caused by the various DP elements are visible. 
On further study of Table 4, the elements that 
cause the most incidents per year can be iden- 
tified. These elements have been highlighted in 
bold in Table 4. Judging from this, it is evident 
that the major main causes of the DP incidents 
that have been reported are caused by either 
Reference element or the Thruster Element. It 
should be noted that the reference element men- 
tioned in Table 4 includes the sensors while the 
thruster element mentioned includes the propul- 
sion system. 


2000— 
ee ee It has been found that in the years prior to 2010, 
DP Pipe/Cable reference system failures were the main cause of 
DSVs Drilling Lay incidents. However, from 2011 to 2016, it can been 
seen that thruster failures have consistently been 
2000 21 22 4 identified as the main cause of DP incidents. In 
2001 35 5 15 2000, there was a drastic drop in incidents related to 
2002 I 5 12 reference systems. The number of reference related 
2003 15 10 4 incidents then peaks again in 2008 falls drastically 
2004 4 13 2 again up to 2016 and the reasons for this has not 
2005 9 15 2 been identified here. It may be assumed that mul- 
2006 13 9 10 tiple DP regulations were adopted and enforced 
rial i h n across the 17-year period which led to the decrease 
Ia 10 a in reference related incidents. 
2016 13 2 a Looking at the whole 17-year period, Table 4 
Total 152 116 88 sums up the total incidents caused by the different 
Table 4. Incident main causes from 2000-2016. 
Main causes Computer Environment Power gen. Operator Reference Thruster Electrical Other Total 
2000 18 5 12 18 34 17 3 3 110 
2001 23 18 8 8 14 14 10 3 98 
2002 11 3 13 8 14 12 3 0 64 
2003 4 7 8 14 6 11 0 1 51 
2004 1 2 8 6 12 4 1 0 34 
2005 8 3 4 5 5 7 4 0 36 
2006 7 4 12 13 11 4 6 2 59 
2007 18 4 11 7 13 8 5 1 67 
2008 22 3 9 5 27 21 10 5 102 
2009 8 2 13 10 18 12 10 2 75 
2010 6 4 5 3 21 2 12 3 56 
2011 14 5 7 3 9 13 3 0 54 
2012 8 2 6 11 11 20 4 2 64 
2013 6 3 13 7 15 20 0 0 64 
2014 13 2 9 7 12 26 0 2 71 
2015 13 11 10 10 11 24 1 0 80 
2016 15 0 15 16 7 24 0 1 78 
Incidents 195 78 163 151 240 239 72 25 1163 
Percentage 16.8% 6.7% 14.0% 13.0% 20.6% 20.6% 6.2% 2.1% 100% 
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systems and elements, showing their various per- 
centages as they relate to each other. 


3.5 Thruster statistics analysis 


From the analysis of Table 4, it is evident that 
the thruster system in recent times (2012-2016) 
has been the major cause of incidents. Looking 
further into this, it is possible to define initiating 
causes of undesired events relating to thruster fail- 
ures. For this, a more specific analysis of incidents 
caused by thruster failures, that occurred within 
the years of 2012-2016, was conducted. It can be 
seen that for the period in question, a total of 357 
incidents occurred, of which, 114 incidents were 
thruster-caused (31.9%), which is extremely high 
compared to other elements. Of the 114 incidents, 
105 were categorized into the different undesired 
events; “drift off”, “drive off” and “time loss” 
events. Table 5 shows the undesired incidents that 
occurred and their initiating events. 

From Table 5, it is clear that the majority of 
thruster related incidents begin with a fault in the 
DP control system, followed by the electrical sys- 
tem and the mechanical system. It is worth noting 
here that some minor failures have been grouped 
together to make up the failures mentioned above. 
For control system, there are failures such as, feed- 
back error, loss of control, wrong DP operator 
input, etc. These are the failures that are rectified 
when the DP control system is restarted. Software 
errors can also fall under DP control system errors. 
Under electrical errors, there are occurrences such 
as: loose wiring, fibre optic fault, low/high volt- 
age supply, loose fuse, field circuit failure and DC 
motor failure. In this case, when the loose wire has 
been fixed or the electrical problem fixed, opera- 
tion continues. 

In the case of mechanical initiating failures, 
there are faults such as, low oil pressure, hydrau- 
lic pump failure, faulty valve, engine failure, 
oil pipe leakage, cooling motor failure, brake 
failure, etc. In any of these mechanical event sce- 


Table 5. Undesired incidents identified from thruster 
failures and their initiating causes from 2012 to 2016. 
Undesired Drive Drift Time 

events off off loss Total 
Reference 0 0 2 2 
Electrical 5 3 21 29 
Mechanical 2 2 21 25 
Generator 0 0 1 1 
Control system 7 6 22 35 
Human error 3 2 7 12 
Total 17 13 74 104 
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narios, a redundant system will have to be used 
or the system will be taken off DP control and 
returned to port for fixing. For the category of 
human error, there are a number human failures, 
such as: wrong procedures, lack of maintenance, 
inexperience and late response. Some human 
errors fall under the DP control system, because 
there has to be an interface between the control 
system and the human. Here, depending on the 
fault, the vessel is regained as soon as possible 
to avoid a worse undesired event. For the refer- 
ence system, there are errors such as; wind sen- 
sor errors, tachometer errors, DGPS errors, etc. 
These errors coupled with some hidden errors 
can cause the thruster to fail at an odd angle 
(pitch). Finally, there are failures relating to the 
power generators. These failures are explana- 
tory on their own and can cause more than the 
thrusters to fail. They also include the failure of 
the Uninterrupted Power Supply system (UPS). 
When these fail, they can cause an immediate 
failure of the thruster, which can be fixed by a 
redundant power generation system. 


4 CONCLUSION 


From the statistical analysis presented in this 
study, it was possible to determine the number 
of incidents that occurred in the different sea- 
sons of the year, over a 9-year period. From the 
results obtained, it was observed that there was an 
increase in the incidents in the spring and summer 
seasons. From research it was discovered that this 
could be as a result of the harsh natural weather 
conditions that occur during those seasons of the 
year, such as, tornadoes, hurricanes, etc. The tides 
that occur during the spring, resulting in high tide 
were also mentioned as a factor for increased inci- 
dents during these seasons. 

The vessel types with the most incidents were 
also analysed. It was found that of all the ves- 
sels recorded, the three types; DSV, Pipelay and 
Drilling Vessels had the most incidents recorded. 
The DSV had the most recorded incidents, with 
152 consistently recorded (see Table 3). The rea- 
son for this was not fully identified, however it 
can be argued that pipelay vessels are not used as 
frequently the DSVs or Drilling vessels. Hence, 
this has resulted in large difference in the number 
of reported incidents relating to these vessel 
types. 

Following this, the main causes were analysed 
and separated into several categories. It was found 
that the two categories with the highest number of 
incidents in the 17 year period (2000-2016) were 
the reference and thruster systems. On further 
investigation of the data, it was found that within 


the years 2000 to 2002 there was a drastic drop in 
reference caused incidents. Although not found, 
it is suspected that a regulation was brought into 
force during that time that either checked incident 
occurrence or made the operators not to be able to 
report incidents. 

Failure of thruster systems were further ana- 
lysed because in previous years (2012-2016), they 
were the major cause of DP incidents. The analy- 
sis led to the discovery of the initiating events of 
the thruster failures and their final consequences 
(Drift-Off, Drive-Off and Time Loss). This 
helped to identify the hazards and establish the 
basic events that would cause the release of such 
hazards and provided a base to conduct a risk 
assessment. 

Although there were changes in the style of 
reporting, which led to the inconsistency of some 
information, the data sourced from IMCA was 
sufficient in identifying the hazards relating to DP 
operations. 

Finally, from the hazard analysis, reference fail- 
ures and thruster failures were identified as the 
major causes of DP incidents, with control failure 
being the most frequent initiating event leading to 
thruster failures. 
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ABSTRACT: 


We present a method based on heterogeneous ensemble learning for the prediction of the 


Remaining Useful Life (RUL) of cutting tools (knives) used in the packaging industry. Ensemble diversity 
is achieved by training multiple prognostic models using different learning algorithms. The combination 
of the outcomes of the models in the ensemble is based on a weighted averaging strategy, which assigns 
weights proportional to the individual model performances on patterns of a validation set. The proposed 
heterogeneous ensemble has been applied to real condition monitoring knife data. It has provided more 
accurate RUL predictions compared to those of each individual base model. 


1 INTRODUCTION 


As the digital, physical and human worlds con- 
tinue to integrate, the 4th industrial revolution, 
the internet of things and big data, the industrial 
internet, are changing the way we design, manufac- 
ture, deliver products and services. In this fast-pace 
changing environment, the attributes related to the 
reliability of components and systems continue to 
play a fundamental role for industry. On the other 
hand, the advancements in knowledge, methods 
and techniques, the increase in information sharing 
and data availability, offer new opportunities of 
analysis and assessment for reliability engineering. 
Based on this increased knowledge, information 
and data available, we can improve our reliability 
prediction capability. Particularly, the increased 
availability of data coming from monitoring the 
relevant components and systems parameters and 
the grown ability of treating these data by intel- 
ligent algorithms capable of mining out informa- 
tion relevant to the assessment and prediction of 
their state, has open wide the doors for Prognos- 
tics and Health Management (PHM) and predic- 
tive maintenance in many industrial sectors, for 
improved operation and maintenance (Zio, 2016). 
Approaches for RUL estimation can be generally 
categorized into model-based and data-driven 
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(Baraldi et al., 2015a). Model-based approaches 
use physics-based models to describe the degra- 
dation behavior of the equipment (Baraldi et al., 
2015a). On the other side, data-driven methods are 
of interest when an explicit model of the degra- 
dation process is not available, as they rely on the 
availability of field data collected during the oper- 
ation of one or more similar components. Among 
data-driven methods one can distinguish between 
(i) degradation-based approaches, modeling the 
future equipment degradation evolution and 
(ii) direct RUL prediction approaches, directly pre- 
dicting the RUL. 

Degradation-based approaches are based on 
statistical models that learn the equipment deg- 
radation time evolution from time series of the 
observed degradation (Baraldi et al., 2017). The 
predicted degradation state is, then, compared with 
a failure criterion, such as the value of degradation 
beyond which the equipment fails performing its 
function (failure threshold). Examples of modeling 
techniques used in degradation-based approaches 
are Auto-Regressive models (Gorjian et al., 
2009), Relevance Vector Machines (Di Maio 
et al., 2012) and Semi-Markov Models (Cannarile 
et al., 2017a) (Cannarile et al., 2018). 

Direct RUL predictions approaches, instead, 
typically resort to machine learning techniques 


that directly map the relation between the observ- 
able parameters and the equipment RUL, without 
the need of predicting the equipment degrada- 
tion state evolution towards a failure threshold 
(Schwabacher et al., 2007). Techniques used in 
direct RUL prediction approaches are, for example, 
Artifical Neural Networks (Wang & Vachtsenavos, 
2001), Extreme Learning Machines (ELM) (Yang 
et al., 2017), Gaussian Processes (GP) (Baraldi 
et al., 2015b), etc. 

When few run-to-failure degradation trajecto- 
ries are available, direct RUL approaches may over- 
fit, i.e., these algorithms customize themselves too 
much to learn the relationship between the observ- 
able parameters and the corresponding RUL in 
the training set. Therefore, these methods tend 
to lose their generalization power, which leads to 
poor performance on new data. To overcome this, 
ensemble approaches, based on the aggregation of 
multiple model outcomes, have been introduced 
(Baraldi et al., 2013a). The basic idea is that the 
diverse models in the ensemble complement each 
other by leveraging their strengths and overcoming 
their drawbacks. 

Thus, the combination of the outcomes of the 
individual models in the ensemble improves the 
accuracy of the predictions compared to the per- 
formance of a single model (Brown et al., 2005) 
(Baraldi et al., 2013a). Different methods, such 
as ANN (Baraldi et al., 2013b), Support Vector 
Machine (SVM) (Liu et al., 2006) and kernel learn- 
ing (Liu et al., 2015), have been used with success 
to build the individual models. For example, an 
ensemble of feedforward Artificial Neural Net- 
works (ANN) has been embedded into a Particle 
Filter (PF) for the prediction of crack length evo- 
lution (Baraldi et al., 2013b) and an ensemble of 
data-driven regression models has been exploited 
for the RUL prediction of lithium-ion batteries 
(Xing et al., 2013). In (Rigamonti et al., 2017) a 
local ensemble of Echo State Networks (ESN) has 
been proposed to improve the RUL prediction 
accuracy of turbofan engines. 

The objective of this work is to predict the RUL 
of knives installed on Tetra Pak® A3/Flex filling 
machines used to cut package material. The prog- 
nostic task is complicated by the fact that few run- 
to-failure degradation trajectories are available, 
and a failure threshold is not available. To cope 
with these issues, this work proposes an ensem- 
ble formed by multiple data-driven direct RUL 
prediction models, capable of aggregating the 
RUL predictions for good performance through- 
out the entire degradation trajectory of a knife. 
Ensemble diversity is achieved by heterogeneous 
ensemble generation, i.e., by training the models 
using different prognostics algorithms. Aggrega- 
tion is obtained by averaging the output of the 
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individual base models with weights proportional 
to the inverse of their Empirical Generalization 
Error (EGE) on retrieved patterns in a validation 
set. The application of the proposed heterogene- 
ous ensemble method to real condition monitor- 
ing knife data has shown to provide more accurate 
RUL prediction compared to that of each individ- 
ual base learner in the ensemble. 

The paper is organized as follows: in Section 2, 
the objectives of this work and the assumptions 
are discussed; in Section 3, ensemble learning main 
concepts for data-driven direct RUL prediction are 
illustrated; in Section 4, performance metrics to 
compare different prognostic models are discussed. 
The application of the methodology to Tetra Pak® 
A3/Flex filling data is described in Section 5, 
whereas Section 6 draws the work conclusions. 


2 ASSUMPTIONS AND OBJECTIVES 


We assume to have available run-to-failure deg- 
radation trajectories of N pieces of equipment 
similar to the one currently monitored (test equip- 
ment). Let x,(z,)e R”,i=1,..,N;7=1,...,n, be 
the vector of m features extracted from signal 
measurements performed at time 7, on the ith i” 
equipment, with ni indicating the total number 
of data acquisitions performed on the ith equip- 
ment before its failure. The ground truth RUL 
of the ith piece pf equipment at time 7, will be 
referred to as y,(z,),i=1....N37,=1,...,7,. We 
consider a case in which the failure thresholds for 
the extracted features are not known. In this set- 
ting, fault prognostics is framed as a regression 
problem: given the historical dataset U formed 
by N realizations (degradation trajectories) 
{x,(7).¥( 7). G=1...m}2=1,...,.N,0f a sto- 
chastic process (X( z7), Y{r))e R” x(0,+e), 
ourtaskistofindafunction f : R” — ( 0,+ ) such 
that it associates to a test pattern Xey (Teu) € R”, 
the corresponding output y,,,,(Z,,,). In what fol- 
lows, we refer to f as base model or base learner 
(Zhou, 2012). 


3 ENSEMBLE LEARNING FOR FAULT 
PROGNOSTICS 


In contrast to ordinary learning approaches which 
try to construct one base learner from training 
data, ensemble methods try to construct a set of 
learners fi,..., f and combine them to obtain an 
ensemble learner f, In this work, we consider 
combination of base learners based on weighted 
averaging (Zhou, 2012), i.e., the combined out- 
put Fas is obtained by averaging the output of 
the individual learners with different weights @,, 


which implies that the different learners have dif- 
ferent importance 


Fous( ¥( 7)) = AE r)) (1) 
where 

H 

¥%=1, @20; h=1,...,H (2) 


h=1 


3.1 


In this Subsection, we motivate the use of ensem- 
ble learning to enhance RUL predictions of a test 
equipment. Referring to the ensemble generaliza- 
tion error as GE(F,,, I, one can show that the fol- 


lowing error- -ambiguity decomposition holds (for 
more details, see the Appendix): 


Error ambiguity decomposition 


GE(F,,,)= GE(h) — ambi(h) (3) 
where GE(h)= X " a@GE( f,) is the weighted aver- 
age of the Ath individual base learner generalization 
error GE| f, fa): and ambi( h) =>, aambi( J, is 
the weighted average of T hth ‘individual ‘base 
learner ambiguity ambi defined in Appen- 
dix. The quantity aril f,) quantifies how much 
the Ath base learner predictions, fe differ from the 


ensemble predictions. On the deut: hand of Eq. 
(3), the first term GE(h) represents the individ- 
ual learner average error, which depends on the 
generalization ability of individual base learners 
whereas the second term ambi(h) represents the 
ambiguity, which depends on the ensemble diver- 
sity. Since the second term is always positive, and 
it is subtracted from the first term, it is clear that 
the error of the ensemble will never be larger than 
the average error of the individual base learners. 
Further, Eq. (11) shows that the more accurate and 
the more diverse the individual learners, the better 
the ensemble. 


3.2 Ensemble generation 


According to the error-ambiguity decomposition 
discussed in Subsection 3.1, ensemble diversity, 
i.e., the difference among the individual base learn- 
ers is a fundamental issue in ensemble learning. 
Therefore, since complementarity is more impor- 
tant than pure accuracy (Zhou, 2012), an ensemble 
formed by only very accurate learners can provide 
worse performances than one formed by also some 
relatively weak learners. Two approaches are typi- 
cally used to generate diverse base learners: 
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e Homogeneous ensemble generation: different 
base learners are generated using the same prog- 
nostic algorithm and diversity is achieved by 
manipulating data in different ways: subsam- 
pling from the training set (e.g., bagging ((Zhou, 
2012))) or using different subsets of features. 
Heterogeneous ensemble generation: different 
base models are generated using different prog- 
nostic algorithms. 


In this work, we have resorted to heterogeneous 
ensemble generation since it has been shown able 
to provide better performance than homogenous 
ensemble methods in cases of few low-dimensional 
data (Rathore & Kumal, 2017). 


3.3 Setting the ensemble base model weights a, 


The data extracted from the available N run-to- 
failure degradation trajectories of similar compo- 
nents are divided into training, validation and test 
subsets, formed by P pans Praia and P,,,, instances, 
respectively. The training subset is used to build 
the H individual base models, the validation subset 
to assign them weights to be used for the aggre- 
gation of the individual model outcomes (Eq. (1)) 
and the test subset to verify the final ensemble per- 
formance. The weight œ, associate to the Ath base 
learner is calculated based on its performance in 
predicting the RUL of the validation set patterns. 
Performance is measured resorting to the Empiri- 
cal Generalization Error (EGE), which for the hth 
base learner is defined as the mean squared error 
on validation set patterns: 


P 


valia = 


In this work, we have considered weights pro- 
portional to the inverse of the EGE, i.e., 


1 
cB i) 


: H 1 
> “GE Z) 


h=1,..., H (5) 


4 PROGNOSTIC PERFORMANCE 
METRICS 


In addition to EGE, we have considered other per- 
formance metrics, which are typically considered 
(Rigamonti et al., 2017) for quantitatively assessing 
and comparing the point prediction performance of 
different prognostic algorithms (Saxena etal., 2009). 
A brief description of the implemented metrics is 


given hereafter considering a generic test trajec- 
tory (x(z),»(7)),7=1,...,2 and a general base 


learner f. 


e Relative Accuracy (RA): 
ak, a8 F(Z) - (2) 
Ry |= 2 0 ama (6) 
-$e AG 

Notice that R( f ) is in the range [0,1] and the 
larger the relative accuracy the more accurate is the 
model. 
e Precision: 


=: 


Z del 


P= (7) 
n 
e(z)= f( x(2))- (2) (8) 
1 n 
z=-5 e(r) (9) 
n =l 
This measure quantifies the dispersion 


(stability) of the prediction error around its mean. 
Closer to zero is the precision, more stable is the 
model. 


5 CASE STUDY 


This Section presents the results of the application 
of the proposed method to Tetra Pak® A3/Flex 
filling knife condition monitoring data. 

We have available run-to failure-degradation 
trajectories from N = 10 different knives. For each 
knife, we have available m = 2 health indicators 
which have been extracted using the procedure 
presented in (Cannarile et al., 2017b). 

In this work, a heterogeneous ensemble genera- 
tion has been developed considering H = 4 prog- 
nostic algorithms: 


e Gaussian Process Regression with Squared 
Exponential (GPRSE) covariance function; 
GRP with Matern 3/2 (GRPM) covariance 
function; 

Support Vector Regression with Gaussian Ker- 
nel (SVRGK); 

SVR with Quadratic 
(SVRQPK). 


These algorithms have been selected, since they 
have proved to be effective also when few training 
data with no clear patterns of regularity are avail- 
able for training (Domingos, 2012). To properly 
compare the performance of the ensemble model 
with that of each base model, we have resorted to 
a twice nested Leave-One-Out-Cross- Validation 


Polynomial Kernel 
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(LOOCV) approach. The outer loop is to assess 
the performance of the ensemble and the single 
base learners, whereas the inner loop allows setting 
the weights @,,h=1,..,4. In practice, the weights 
associates to the base learners are computed on 
each outer-validation set (using the inner LOOCV 
loop) and the final performance is measured on the 
corresponding outer-testing set (see Figure 1). 

Table 1 compares the performances of the devel- 
oped ensemble model with that of the GRPM 
model, which has resulted to be the best perform- 
ing individual model. 

Notice that the ensemble model performs bet- 
ter than GRPM in all the considered metrics. In 
particular, the average EGE is 11.14% lower (more 
satisfactory) than that of GRPM, the relative 
accuracy of the ensemble is 3.34% larger (more 
satisfactory) than that of GRPM, whereas, the two 
methods are comparable from the point of view of 
the precision. Finally, Figs. 2 and 3 show the RUL 
predicted by the ensemble and GRPM for two rep- 
resentative test trajectories. 

The most satisfactory ensemble predictions tend 
to be at the begininning of the life of the test knife. 
This is reflected by the great improvement of the 
EGE metric, which is more sensible to errors at 
the beginning of the run to failure trajectory than 
the relative accuracy. 


+ Outer Loop 


e A 
| 
E i 
rs | 


Twice nested LOOCV. 


Inner Loop 


Figure 1. 


Table 1. Comparison between the ensemble and the 
GRPM performances. 


Ensemble GRPM 


Empirical Generalization 3.2991 3.7127 

Error (EGE) (best value 0) 
Relative Accuracy (RA) 
(best value 1) 


Precision (best value 0) 


0.8149 0.7804 


0.0569 0.0633 
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Figure 2. Predicted RUL by the ensemble (diamonds) 
and GRPM (exagon) for a test trajectory. 
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Figure 3. Predicted RUL by the ensemble (diamonds) 


and GRPM (exagon) for a test trajectory. 


6 CONCLUSIONS 


In this work, we have developed a heterogeneous 
ensemble model for enhancing the accuracy of 
the RUL prediction of knives used in the packag- 
ing industry. Thanks to the diversity of the base 
learner algorithms, the proposed approach has 
been shown capable of reducing the generaliza- 
tion error and providing more accurate RUL pre- 
dictions compared to that of each individual base 
learner in the ensemble. 
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APPENDIX 


Given an instance x= x( 7), the ambiguity of the 
individual base learner f, is defined as 


ambi( file) =(F(x)- Fal) b= de. (10) 
and the ambiguity of the ensemble is 
ambi( Talx) = $ a,ambi( Feje 
= 
Sal R(s)-Fals)) aD 


The ambiguity term measures the disagreement 
among the individual base learners on instance x. 
If we use the Squared Error (SE) to measure the 
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performance, then, the error of the individual base 


learner f, and the ensemble f, are, respectively, 

SE(Filx)=(Fi(x)-f(x)) bab... 02) 

SE( fale) = (Fal) - f(x) (13) 
Then, one can show that (Zhou, 2012) 

ambi f,,,|x) = SE( ix) - SE( Fal) (14) 

where SE(flx)= 7 @SE( fix) is the 


weighted average of the individual base learner errors. 
Since Eq. (14), holds for every instance x, after aver- 
aging over the input distribution p(x) from which 
the instances are sampled, it still holds that. 


Ya) ambi ix) p( x) dx 
Saj SE( lx) p(x) 


dx- J SE( fle) p(x) dx (15) 


The generalization error and the ambiguity of 
the individual base learner f,, can be written as, 
respectively, 


GE( J, = J SE fix) p@jdrh=1...., (16) 


ambi Ta) = Í ambi( jie) p( x)dxh=1, 


Similarly, the generalization error of the ensemble reads 


„H (17) 


GE( fos) = | SE( fu |x) p(x) dv h= 1... (18) 


Based on the notation just introduced and Eq. 
(14), we obtain the error-ambiguity decomposition 
(Zhou, 2012): 


GE( f.,,) = GE(h) — ambi(h) (19) 
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ABSTRACT: Risk Informed Decision Making (RIDM) is based on risk metrics obtained from a Proba- 
bilistic Risk Assessment (PRA). For plants exposed to multiple hazards, Multi-Hazards Risk Aggregation 
(MHRA) is necessary to inform decisions. In practice, this is often done by a simple arithmetic summa- 
tion over the different risk contributors, without taking into account that the state of knowledge of the 
risk models of the different hazards can be quite different. In this paper, we provide a hierarchical frame- 
work to assess the strength of knowledge that PRA models are based upon. The framework is organ- 
ized in three attributes characterizing the knowledge which a PRA model is based upon (assumptions, 
data, phenomenological understanding). These attributes are further broken down into sub-attributes 
and, finally, “leaf” attributes that can be evaluated. The PRA models of two hazards groups for Nuclear 
Power Plants (NPPs) are considered and the strength of knowledge behind each model is assessed using 
the developed framework. 


1 INTRODUCTION are qualitative in nature. A semi-quantitative 
method for evaluating the strength of knowledge 
In risk assessment, quantities are calculated to has been proposed by Flage and Aven [4], where 
describe the magnitude and likelihood of the con- the strength of knowledge is evaluated in terms of 
sequences from accidents that may develop from four attributes: (i) phenomenological understand- 
known hazards [1]. The confidence on the cal- ing and availability of trustable predicting mod- 
culated risk indexes depends on the knowledge els; (ii) reasonability and realism of assumptions; 
available to support the risk assessment [3-5]. For (iii) availability of reliable and relevant data, and 
example, in the risk assessment of Nuclear Power information; (iv) agreement/disagreement among 
Plants (NPPs), there is more experience and knowl- peers. The four attributes were assessed in three 
edge on internal events than other hazard groups levels (minor, moderate and significant) and aggre- 
like external flooding [1]. Evaluating the strength gated for strength of knowledge assessment [4]. 
of knowledge of a risk assessment, is, then, impor- Although the knowledge attributes proposed are 
tant to evaluate how much confidence we can put plausible and reasonably complete, their defini- 
on the risk outcomes, that are, then, used to inform tions remain ambiguous. In addition, the evalua- 
decision making [2]. tion of these attributes is somewhat intangible in 
Research efforts have been conducted, recently, practice, since it is done by simple scoring based 
for linking knowledge, knowledge evaluation on a plain description of the attributes. To over- 
and knowledge management to Risk-Informed come this problem, we expand the work in [5] and 
Decision-Making (RIDM) [4-7]. For example, in introduce a hierarchical tree-based framework for 
the nuclear industry, knowledge management has evaluating the state of knowledge. 
been identified as a key factor in sustaining nuclear The rest of the paper is organized as follows. 
power programs and maintaining their safety and In Sect. 2, we present the developed framework 
security [3]. However, most of the existing works for strength of knowledge assessment. Section 3 
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applies it on a case study of two hazard groups 
considered in NPPs risk assessment. Finally, in 
Sect 4, the paper is concluded with a discussion on 
potential future developments. 


2 ASSESSMENT FRAMEWORK 


We consider the strength of knowledge assessment 
of event tree models which are widely applied in 
PRA of NPPs. The events probabilities in the event 
tree model might are typically calculated by fault 
tree models. The risk index associated to a given 
consequence (e.g. the probability of core damage) 
is calculated by summing the values of the risk 
index from several risk models: 


Ren 


where no is the number of operation states (O), ng; 
is the number of accident sequences (scenarios, S) 
that in operation state i can lead to the given con- 
sequence. Each R, in (1) quantifies the specific risk 
index under scenario j (e.g., medium flood level) in 
operation state i (e.g., emergency shutdown). 

The risk models used to calculate the risk index 
R, values are characterized by Initial Events (IEs), 
Basic Events (BEs) and the combinations of the 
latter into Minimal Cut Sets (MCSs). In practice, it 
can often be assumed that the MCSs are mutually 
exclusive; then, R,, can be calculated by [5]: 


™MCS i,j NBE k 
R= ` I [ “P 
ij k=1 q=1 ~ BEq 


where Mcs; is the number of minimal cut sets in 
the risk model for operation state 7 and scenario j, 
Ng, 18 the number of basic events in the Ath mini- 
mal cut set, and Ppr, is the probability of having 
the gth basic event. The five elements, S, O, IE, BE 
and MCS, fully define a PRA model, as shown in 
Figure 1. In this paper, we refer to these five ele- 
ments as “atomic elements”. 

To assess the strength of knowledge of a PRA 
model, all the five atomic elements need to be con- 
sidered. In practice, however, PRA models are very 
complex and contain many scenarios and operation 
states combined in large and complex fault trees 
and event trees, that consist of thousands of BEs 
and MCSs [6]. For such a complex risk assessment 
model, it is not practical to consider all atomic ele- 
ments for evaluating the strength of knowledge. To 
address this problem, in this work, we first develop 
a reduced-order model for (1), in order to limit the 
number of atomic elements that need to be analyzed. 

A flowchart of the developed knowledge assess- 
ment method is given in Figure 2. The first step 


Ns i 
R, (1) 


j=l 


(2) 
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Reduced-order PRA model 
construction 


Krroveteute aa er er rrrerit 


Identifying the atomic elements 
of the PRA model 


Ranking the atomic elements 
based on their contribution to the 


Knowledge assessment of the 
basic events in the 
teduced-order PRA model 


total risk 
Building the reduced-order PRA 
model 


Figure 1. 


Knowledge aggregation 
based on the reduced-order 
model 


Steps of PRA model knowledge assessment. 


Figure 2. Atomic elements of a PRA model. 


involves developing a reduced-order model for 
the original risk assessment model. A detailed 
discussion on how to construct the reduced-order 
model is given in Sect. 2.1. Then, the strength of 
knowledge supporting each atomic element in the 
reduced-order model is assessed by an Analytical 
Hierarchy Process (AHP), as illustrated in Sect. 
2.2. Finally, the strength of knowledge of each 
element is aggregated to evaluate the strength of 
knowledge of the entire PRA model. A detailed 
discussion is given in Sect. 2.3. 


2.1 


It is often observed in PRA models that most of 
the contribution to the total risk is due to a small 
number of elements of the problem (known as 
“Pareto principle”) [7]. We can, then, reduce the 
PRA model into a reduced-ordered model, which 
consists of only the most important “elements”. 

The procedure for constructing the reduced- 
order model comprises of three steps. Firstly, the 
number of operation states no is reduced to 7o rea 
as follows: 


Reduced-order PRA model construction 


e Calculate the risk R, for each operation state: 


Rọ, = p LSi<n (3) 
where R, is calculated by (2). 

e Rank R, in descending order. 

e Find the minimal ng ge» So that 

Dy he 

otis) Oe (4 ) 


R 


where œ is the fraction of total risk that can be 

reproduced by the operation states in the reduced- 

order model (in the case study in Sect. 3.2.1, we 

assume that œ = 0.8). 

e Keep only operation states for i=1,-++,19 gea) 
operation states with ¿> np ga are eliminated. 
The second step is to define the reduced number 

of scenarios Ms rea; for each operating state i in the 

reduced-order model, where 7=1,-+-.79 poy! 


e For i=1,---,79.p.4, calculate the risk R,, 1 < 
J Sng, by (2). 

e Rank R, in descending order. 

e Find the minimal ny, ga; so that, 

YEAR, . 

S ss B (5) 


Ro; 


where R, is calculated by (3) and fis the fraction 
of total risk that can be reproduced by the scenar- 
ios in the reduced-order model (in the case study in 
Sect. 3.2.1, we assume that B= 0.8). 
e Keep only scenarios for j=1,---,n, geai) scenar- 
ios with J > Ms geai are eliminated. 
Finally, the number of minimal cut sets nyes; 


is tailored to Nycs greai E= lMo ra] = les 
Ms Red * 
e Calculate R;,,, by: 
STS No pea 
-= <j< 
Rix g Teves Peg? LS j S Ms pea. (6) 
1S kS Anyesij 
e Rank R,,, in descending order. 
e Find the minimal myc gey;,; SO that, 
"MCS Red i,j 
2 R; jx >y (7) 


Ry 
where R,,, is calculated by (6) and yis the fraction 
of total risk that can be reproduced by the mini- 
mal cut sets in the reduced-order model (in the case 
study in Sect. 3.2.1, we assume that y = 0.8). 
e Keep only minimal cut sets fork =1,--+, yc red.i,j3 
minimal cut sets with are 


nnn k > Nucs.Redi,j 
eliminated. 
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Assuming that the MCSs are mutually exclusive, 
the total risk of the reduced-order PRA model can 
be calculated by: 


Reed = £ 


NO, Red NS „Red,i NMCS,Red i,j 


k=1 


P 
= BE.q 


(8) 


i=l j=l 


Note that from (4), (5) and (7), the reduced 
order risk R,,. can reconstruct ax 2x y of the 
total risk R. Only the events that are contained in 
the reduced-order model (8) are used for assessing 
the strength of knowledge of the PSA. 


2.2 Knowledge assessment for the risk elements 


Once the reduced-order model is constructed, the 
strength of knowledge of each atomic element 
in such model is evaluated. In Section 2.2.1, we 
present a tree-based hierarchical framework for 
knowledge assessment. Then, in Section 2.2.2, we 
show how to proceed with the evaluation using the 
Analytical Hierarchy Process (AHP) method. 


2.2.1. Knowledge assessment framework 

A tree-based hierarchical framework is here devel- 
oped for knowledge assessment, as shown in 
Figure 3. The strength of knowledge, represented 
by K (Level 1), represents the solidity of knowl- 
edge that supports a PRA model. A higher value 
of strength of knowledge indicates that the PRA 
model is supported by trustable evidence and reli- 
able knowledge, and, therefore, its results can be 
taken with confidence. 

As in Flage and Aven [4], we evaluate the 
strength of knowledge in terms of three attributes: 
assumptions (K,), data (K,) and phenomenological 
understanding (K,) The attribute K, represents the 
adequacy, solidity and plausibility of the assump- 
tions upon which the model is based; K, represents 
the amount and quality of the available data that 
are used to estimate the parameters of the model; 
K, represents the knowledge behind the phenom- 
enon described in the model. 

For their evaluation, the three attributes are fur- 
ther decomposed into sub-attributes. In particular, 
assumptions (Ķ,) is evaluated in terms of quality 
of assumptions (K,,), value ladenness (K,,) and 
impact (K;,;); data (K,) is evaluated in terms of the 
amount of data (K,,) and the reliability and consist- 
ency of data (K,,); phenomenological understand- 
ing (K,) is evaluated in terms of years of experience 
of the experts involved in the model development 
(K,,), number of experts involved (K,,), academic 
evidence (K,,) and industrial evidence (K,,). Value 
ladenness and reliability and consistency of data 
are further decomposed into “leaf” sub-attributes 
in level 4 for their evaluation, as shown in Figure 3. 


Years of 
paparunor 
(VESKS1) 


Figure 3. A hierarchical tree-based framework for knowledge assessment. 
Table 1. References that justify the model in Figure 3. 
Attributes References 


Strength of knowledge is evaluated by K,, K, and K,. 


Realism and plausibility of assumptions (K, ) is evaluated by the quality of 


[4] 
[9]; [10]; [11] 


assumptions (K,,), the value ladenness and subjectivity of the experts (K,,) 


and sensitivity analysis on the assumptions (K,,). 
Value ladenness (K,) is defined by K,,,—K, 


111 117° 


Data (K,) is evaluated in terms of amount of available data (K,,) and 


reliability of data (K,,). 


Reliability of data is defined by: (i) completeness; (ii) consistency; 


(iii) accuracy; (iv) validity; (v) timeliness. 


[10]; [9]; [12]; [13]; [14] 
[4] 


[15]; [16]; [17] 


The tree structure in Figure 3 is constructed 
based on a thorough literature review related to 
trustworthiness and validity assessment of PRA/ 
QRA. References related to the construction of 
the tree model are given in Table 1. It should be 
noted that for phenomenological understanding, 
few references directly consider its assessment. 
A comprehensive understanding of phenomena 
requires its explanation [8], which depends on the 
capability of the experts involved in the risk mod- 
eling and analyses. Then, four sub-attributes are 
proposed for the assessment of phenomenologi- 
cal understanding: (1) industrial evidence; (11) aca- 
demic evidence; (iii) number of experts involved; 
(iv) number of years of experience in the domain). 


2.2.2 Evaluation using AHP 
Given the hierarchical tree in Figure 3, the assess- 
ment of the strength of knowledge is carried out 
within a Multi-Criteria Decision Analysis (MCDA) 
framework. AHP is adopted [18], as it is fit for both 
quantitative and qualitative evaluation of attributes 
and factors [19] and for group decision making [20]. 
A first step in applying AHP is to evaluate the 
“leaf” attributes (the non-decomposable attributes 
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in Figure 3). A score between 1 and 5 is used to 
represent the strength of knowledge with respect 
to each “leaf” attribute, where 1 represents the 
lowest knowledge level and 5 represents the highest 
knowledge level. The score is evaluated based on 
some predefined evaluation criteria. Due to page 
limits, we only present the evaluation criteria for 
K,, as an example (See Table 2). 

Then, the inter-level priorities (weights) are 
determined for each attribute, sub-attribute and 
“leaf” attribute, denoted by W(K,), WK) and 
wW Ky): respectively. Based on [14] and [20], a scale 
of 1-9 is used for evaluating the importance of each 
of these attributes relative to each other, with refer- 
ence to their contribution to the parent attribute: 
a value of 1 is assigned when two attributes of the 
same level of the hierarchy are equally important 
and 9 is assigned when one attribute is significantly 
more important than the other. 

The strength of knowledge of the ith atomic 
element, denoted by K, is, then, calculated as a 
weighted average of all the scores of the “leaf” 
attributes. The value of K, is between 1 and 5 and 
a high value indicates that we have stronger knowl- 
edge on that atomic element. 


Table 2. Quality of assumptions scoring guidelines. 


Score 
Attribute 1 3 5 
Quality of The assumptions are based The assumptions are The assumptions are 
assumptions aon weak knowledge and not acceptable based on moderate based on strong 


realistic (conservative 
assumptions or over-optimistic) 


knowledge, simple model and 
extrapolated data 


knowledge and 
established theory, 
verified by peer review 
and very plausible 


2.3 Knowledge aggregation 


From (8), the risk index of the reduced-order PRA 
model is the sum of n, = X °% ng pea; Tisk index 
values Rp, from the corresponding elementary 
risk model, where each elementary risk model is 
further composed of MCS and BEs, as shown in 


(2): 
Ree = 2 Lees Prea 


In (9), Raq is the risk index of the /-th reduced 
elementary risk model, where /=1,2,---,n, and 
Nycs.rea 18 the number of MCS in the -th reduced 
elementary risk model. 

Let K,,,, denote the strength of knowledge of 
the q-th BE in the reduced elementary risk, where 
Kye iq €[L5] and a large value of K,,,, indicates 
strong knowledge of BE. The K;,,,8 are assessed 
using the procedures described in Sect. 2.2. 

The next step is to aggregate the K,,,,,8 to assess 
the strength of knowledge of the whole risk assess- 
ment model. The aggregation should consider the 
difference in each atomic element’s contribution to 
the total risk. 

Different importance measures can be used to 
evaluate the contribution of the atomic elements 
with respect to the total risk. Since the elementary 
reduced-ordered risk model is constructed by the 
BEs through MCSs, the weights of the BEs are 
calculated based on Fussell-Vesely importance 
measures: 


MCS Red I 
k=1 


(9) 


Isela 
"BEL 


W, 


BE lq T 


(10) 


py 


where J,,, ; isthe Fussell-Vesely importance meas- 
ure of the corresponding BE in the elementary risk 
model /. 

The strength of knowledge for the /-th elemen- 
tary reduced order risk model, denoted by K,, is 
calculated by: 


a) 


q=1 BE lq 


"BE yyy 


ql Wreta’ Kori (11) 
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The importance of the elementary reduced- 
order model is evaluated by its contribution to the 
total risk: 


R 
W, = (12) 


‘Redl 


where Rpa, 18 the risk index value of the /-th ele- 
mentary reduced order model and is calculated by 
(9). 

To calculate the total strength of knowledge K pea 
of the reduced-order risk model, the knowledge 
indexes Ks of the reduced-order elementary risk 
models are further aggregated by considering their 
contributions: 


n 
J=1 Red,l 


n 


K ' WK, (13) 


Red T 


The index Kpa is, then, used to represent the 
strength of knowledge of the entire PRA. Its value 
is between | and 5, and a high value indicates that 
we have strong knowledge in support of the PRA 
model and its risk outcomes. 


3 CASE STUDY 


3.1 


In this section, we apply the developed method to 
assess the strength of knowledge of NPPs PRAs. 
Two hazard groups, i.e., external hazards and inter- 
nal events are considered in this case study. 
External hazards refer to the undesired events 
originating from sources outside the NPPs such as: 
external flooding, external fires, seismic hazards, 
etc., [21]. In particular, external flooding is a natu- 
rally induced hazard that might be caused due to dif- 
ferent reasons such as: tides, tsunamis, dam failures, 
snow melts, storm surges and etc., (see [22] for more 
examples). The choice of these initiating events to 
be a part of the external flooding risk assessment 
models is site-specific and some guidance should be 
provided for this purpose [23]. In general, for exter- 
nal flooding, the state of PRA practice is consid- 
ered less mature than for internal events [24]. For 


Problem description 


example, the flood frequencies are obtained using 
statistical models and by extrapolating design basis 
flood levels to the fitted historical data (usually 
limited), which results in a very high uncertainty 
[24]. Moreover, for extreme floods, the probability 
of occurrence is very low but, on the other hand, 
the potential consequences can be catastrophic [22]. 
The low probability and the consequent lack of 
data experience introduces large uncertainties in the 
risk analysis of this type of events [22]. 

Internal events refer to the undesired events that 
originate within the NPPs itself, which cause initiat- 
ing events that might lead to loss of important sys- 
tems and might eventually result in core meltdown 
[1]. The internal events are mainly [25]: (i) different 
types of components, systems and structures fail- 
ures, missiles and fires; (1i) safety systems operation 
and maintenance errors. These types of internal 
events can cause other initiating events such as tur- 
bine trip or Loss of Coolant Accidents (LOCAs). 
The risk assessment of internal events has been sig- 
nificantly developed and considered to have lower 
uncertainty compared to other hazard groups [1]. 


3.2 Evaluation of hazard group strength 
of knowledge 


In this case study, we consider the risk analysis 
models of two hazard groups developed by Elec- 
tricité de France (EDF) using Risk Spectrum 
Professional software [26]; [27]. The knowledge 
assessment framework developed in Section 2 is 
applied to evaluate the strength of knowledge of the 
risk models for both internal and external events. 
Technical reports were provided to the experts 
to support the knowledge assessment with the 
needed data and information. For simplification, 


Table 3. Reduced-order model constituents. 


we only present the case of the external events 
(specifically flooding). For internal events, we only 
show the results of the application. 


3.2.1 Reduced-order model construction 

Based on Eq. (4) with a@= 0.8, we found that only 
one out of six operating states (NS/SG-normal shut- 
down with cooling using steam generator-NS/SG) 
is needed for the reduced-order model, which con- 
tributes to 86% of the total risk index. Similarly, 
based on Eq. (5) with B= 0.8, only one out of ten 
scenarios (water levels) is needed for the reduced- 
order model, whose risk contribution is 98.7% 
Based on Eq. (7) with y=0.8, the number of MCSs 
needed for the reduced-order model is 5 out of 
3102, and the risk contribution is 80.1%. Therefore, 
a reduced-order model is constructed based on the 
atomic elements in Table 3, as shown in Figure 4. 


3.2.2 Strength of knowledge assessment 

After constructing the reduced-order model, the 
knowledge assessment framework in Section 2.2 
has been applied on each of the atomic elements 
and, then, AHP is used to compare the overall 
model strength of knowledge. The strength of 
knowledge for external flooding turns out to be 
Ky, = 2.78. The results of the knowledge assess- 
ment for both hazard groups are graphically illus- 
trated in Figure 5. It can be seen from the Figure 
that the strength of knowledge on the internal 
events is higher than that on external flooding. 
In fact, these results confirm our expectations. In 
addition, most of the risk assigned to the external 
flooding is due to two basic events (failure to close 
the isolating valve in the auxiliary feedwater sys- 
tem and failure of the containment spray system), 
whose strength of knowledge is very weak. 


Operating state Scenarios Number of MCS 


Number of basic events BE 


Total risk contribution 


NS/SG Water level A 5 


0.86 x 0.987 x 0.801 = 0.6799 


Onitiating event) 
Loss of offsite 
power 


| MCS: (BE1)(BE2\(/8E3) 


Operation state: Oi 


Figure 4. An illustration of the reduced-order model. 


x Heterogeneous hazard groups 
= 
RT aces: 
R2 
Strength of knowledge (represented by 
the thickness of the group) 
Figure 5. Representation of hazard groups levels of 


risk and strength of knowledge. 
4 CONCLUSION 


An analytical hierarchy process-based framework 
has been proposed for assessing the strength of 
knowledge of PRA models. The framework is 
based on three main attributes (assumptions, data, 
and phenomenological understanding), which 
are further decomposed into sub-attributes and 
“leaf” attributes. A reduced-order PRA model is 
constructed, that reduces the number of atomic 
elements to be analyzed. The framework has been 
applied on two hazard groups in NPPs. 

In the future, model uncertainty in the PRA 
model will be considered for a more comprehen- 
sive knowledge assessment. In addition, in the cur- 
rent framework, the weights of the attributes in the 
AHP were subjectively evaluated. Future investiga- 
tions will be devoted on how to more objectively 
evaluate the weights. 
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Formalization of RAM contracts for advanced consistency 


and completeness checking 


A. Joanni & D. Ratiu 
Siemens AG, Corporate Technology 


ABSTRACT: Non-compliance with contractual terms concerning Reliability, Availability and Main- 
tainability (RAM) can have significant financial impact on the profitability of technical projects due to 
e.g. penalties or warranty costs. Often, these terms are stipulated in complex (multi-party) contractual 
agreements. In practice, these contracts are captured as natural language prose and this can lead to ambi- 
guities, inconsistencies or incompleteness. In this paper, we present the benefits obtained when RAM 
contracts have a precise formal description expressed in a semantically rich language. Examples of ben- 
efits are advanced validation and consistency checks such as overlap of time periods, gaps in warranties 
between parties, transfer of risks to other parties, even before a numerical evaluation is performed. We 
present a brief theoretical background of the approach, an illustration by means of exemplary contrac- 
tual warranty terms, and give an outlook on further possibilities and advantages. 


1 INTRODUCTION 

Non-compliance with contractual terms concern- 
ing reliability, availability and maintainability 
(RAM) characteristics can significantly impact the 
profitability of technical projects. Complex projects 
involve contractual agreements between different 
par-ties working in different stages of the supply 
chain. These contracts must be perfectly synchro- 
nized and orchestrated to avoid unnecessary non- 
conformance costs for the contractor, for instance 
due to failures of the integrated components. 

Common practice today is that contracts are 
written in plain natural language and thereby are 
often ambiguous and incomplete. Furthermore, 
natural language contracts are not amenable to 
automatic analyses. 

A typical example of complex RAM contracts is 
related to warranties. Servicing of warranty cases 
involves additional (non-conformance) costs to 
the contractor and these costs depend on product 
reliability and contractual warranty terms between 
the contractor and the customer. Furthermore, 
these costs can be (fully or partially) covered by 
warranty contracts between the contractor and the 
individual suppliers. As extensively discussed in lit- 
erature, see e.g. Murthy & Jack (2003), there is a 
wide variety of warranties policies determined by 
different warranties periods and associated costs. 
Examples of different periods are simple periods, 
extended warranties, life-time warranties, physi- 
cal life, technological life; examples of different 
forms of penalties are product renewal, repairing 
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and looping of warranty. Taking informed deci- 
sions about liability due to warranties requires 
transparent assessment of contractual require- 
ments between all stakeholders (customer, contrac- 
tor, suppliers). Other examples of complex RAM 
contracts are related to performance guarantees of 
technical systems. 

Furthermore, automatic analysis of “what if” 
questions like “what additional costs would occur 
if the contractor offered extended warranty to cus- 
tomers” or “what is the longest warranty a contrac- 
tor can offer such that the risk budget is under a 
certain threshold with a given confidence” require 
precise modeling and analysis of contracts. 

In this paper we present our work in progress 
on using domain specific languages to model 
contracts in a precise and unambiguous manner. 
Once semantically rich contract models are avail- 
able, they represent a basis for further analyses for 
consistency and completeness. As an example we 
model warranty clauses of multi-party contracts. 


2 MODELING OF RAM CONTRACTS 


2.1 


Formalization of (business) contracts using for- 
mal models and languages with precise semantics 
has been subject of research for about two dec- 
ades; see e.g. Peyton Jones et al. (2000), Hvitved 
(2011) or Farmer & Hu (2016). In particular, 
these approaches have been adopted early on by 
the financial industry for pricing of contracts that 
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were traded e.g. in the financial derivative mar- 
kets. It was realized that formalization of contracts 
not only gives a precise way to describe complex 
contracts, which may immediately reveal ambigui- 
ties of contracts expressed in a natural language. 
Moreover, the formal semantics provides the basis 
for analyzing and integrating formal methods into 
contracts, as well as for automated contract man- 
agement. Last but not least, it provides a means to 
valuate contracts over time. All of this is, naturally, 
compositional and comes with a high degree of 
reusability, modularity and abstraction. 

Modeling and valuation of contractual RAM 
requirements has been demonstrated in a previous 
paper by Joanni & Ratiu (2018), based on the pio- 
neering work by Peyton Jones et al. (2000). Namely, 
acompact library of so-called contract primitives is 
used to perform compositional analysis of contrac- 
tual RAM requirements expressed in the library, 
and based on this it is shown how to perform valua- 
tion of a broad range of contractual requirements. 
The contract primitives are supplemented by the 
concept of so-called observables, which determine 
how the meaning (and value) of a contract evolves 
over time. Observables, for instance, define when 
certain conditions become true that entail a certain 
payment at that time, or what the precise amount 
of a payment is. In the context of contractual 
RAM requirements: whether a given component 
has failed or not may constitute an observable; 
another example of an observable is the average 
availability of a module of a process plant over a 
given time period. In essence, observables are con- 
cepts that glue together the formal contract model 
and the model of the technical system. 


2.2 Modeling of technical and contractual 
RAM aspects 


This section briefly exemplifies and reiterates how 
contractual RAM requirements are expressed in 
terms of contract primitives and observables as 
demonstrated by Joanni & Ratiu (2018), in order 
to set the stage for advanced consistency and com- 
pleteness checking of RAM contracts which is the 
focus of this paper. 

For the purpose of contractual RAM require- 
ments, the modeling of technical RAM aspects 
is one source of observables that determine how 
the meaning (and value) of a contract evolves over 
time. For instance, suppose that the warranty risks 
for component failures arising from a contract fora 
major technical project are to be analyzed, and the 
event of a control system failure of a given subsys- 
tem, say system 2, constitutes an observable whose 
value becomes true as soon as a control system 
failure occurred. Let this observable be denoted by 
system2:controlFailure in the following. 


104 


Some of the contract primitives are introduced 
next, in order to show how contractual RAM 
requirements can be composed in terms of these 
generic contract primitives and linked to observa- 
bles. First, the contract (one EUR) represents a con- 
tract where, once acquired, the holder immediately 
receives one unit of the currency Euro. The con- 
tract (give c) is a contract where all rights and obli- 
gations arising from contract c are reversed. If the 
contract (or cl c2) is acquired, then the holder must 
immediately choose whether to acquire c/ or c2, 
where the choice will be such that the benefit is 
greatest. Finally, once the contract (and cl c2) is 
acquired, both of the contracts c/ and c2 are imme- 
diately acquired. As an example, if the contract. 


(give (and (one EUR) (one EUR) )) 


is acquired, then holder is obligated to immediately 
pay two Euros. Consequently, in order to be willing 
to acquire the contract above, one would expect to 
receive two Euros in return as a fair price. Gen- 
erally, we consider only a fair price when we talk 
about contract valuation, and the usual profit mar- 
gin would have to be added afterwards. 

Up to now, the contract primitives did not refer 
to any observables, and all rights and obligations 
took effect immediately. These limitations will be 
overcome with the following contract primitives. If 
the contract (scale o c) is acquired, then the con- 
tract c is immediately acquired where all payments 
arising from the contract c are multiplied by the 
value of observable o at the moment of acquisi- 
tion. The contract (when o c) implies that, once 
acquired, the holder must acquire the contract c as 
soon as the observable o becomes true. 

It should be noted that, while both observa- 
bles introduced in the previous paragraph are in 
fact stochastic processes, the first one can be seen 
as a numerical value, whereas the second one is a 
stochastic process that assumed the values true or 
false. Other examples of the latter kind are observ- 
ables like (at date d), which becomes true at a cer- 
tain date d, or (between dl and d2), which assumes 
the value true between the dates d7 and d2 (and 
false otherwise). 

Using the contract primitives introduced so 
far, we will now give examples how these can be 
linked with observables, and how contractual 
RAM requirements of practical relevance can be 
expressed by exploiting the compositional aspects. 
For instance, the contract. 


(when (at date 2017-01-31) (scale 100 (one EUR) )) 
pays the holder of the contract the amount of 100 


Euros on January 31, 2017. In order to determine 
a fair price for the contract at the time of acquisi- 


tion, one would have to calculate the discounted 
present value at the time of acquisition depending 
on interest rates (and possibly currency exchange 
rates). Certainly, the value of the contract is zero if 
acquired after January 31, 2017. Similarly, we can 
define a contract. 


(when system2:controlFailure (give (scale 100 (one 


EUR)))) 


which implies that the holder of the contract is 
obligated to pay 100 Euros on occurrence of the 
event system2:controlFailure (see beginning of the 
section). This already constitutes a simple contrac- 
tual RAM requirement which could potentially be 
found in a real contract. The value of this contract 
at the time of acquisition, i.e., the fair amount of 
money the contractor would receive once it has 
entered into the contract, depends here on the 
number of occurrences within the contract dura- 
tion, and on the time instants the event occurs (in 
case of discounting). 

Next, consider a typical warranty clause from 
a technical contract that requires that a certain 
amount be paid to the customer on occurrence of 
failures of respective components (or, equivalently, 
the components must be replaced at the contrac- 
tor’s expense). This only applies within a given 
warranty period, and the total amount is limited 
by a cap of 20,000 Euros. Expressed in terms of 
contract primitives, this could take the following 
form. 


(or (scale 20000 (give (one EUR))) (and (when 
(system2:controlFailure && between  startOf- 
Project and November 03, 2021) (give (scale 800 
(one EUR)))) (when (system3:valveFailure && 
between startOfProject and November 03, 2021) 
(give (scale 130 (one EUR) ))) ...)) 


where a Boolean expression is used in the condi- 
tion, and the date startOfProject is a variable that 
was defined elsewhere in the model. 

It is noted here that the possibility to compose 
more complex contractual RAM requirements 
from contract primitives is a key point and adds 
great value to our approach. Without it, it would 
be hardly practicable to capture and analyze the 
various types of contractual requirements that 
may be encountered in realistic contracts. 

It is also important to note that, since our 
approach targets persons in charge of techni- 
cal, contractual and commercial aspects who are 
domain specialists and do not necessarily have 
any programming background, the expressions 
in terms of contract primitives can be just as well 
automatically generated from higher-level expres- 
sions of contractual requirements, which are closer 


105 


to natural language formulations. The reader is 
referred to Joanni & Ratiu (2018) for details. 


3 ADVANCED CONSISTENCY AND 
COMPLETENESS CHECKING 


3.1 Formalization of the contract model 


What has been presented in Section 2 is essentially 
a model-based form of a contract. This can be 
expressed in form of a domain specific language 
with constructs which directly represent concepts 
from the domain of contracts. Individual con- 
tracts can thereby expressed as models written 
using the domain specific language. The model 
can be represented as a tree (see Figure 1 above 
for the example from the previous section). The 
value of the contract at a certain point in time is 
determined by a precisely defined, so-called valu- 
ation semantics; see Peyton Jones et al. (2000). 
Moreover, prior to numerical evaluation, it may 
be useful to apply certain transformation rules in 
order to obtain the contract in an optimized form 
(ibid.). 

Typically, a project involves contractual obliga- 
tions that are not restricted to one party, but are 
stipulated in several contracts with multiple par- 
ties. Several contracts can be merged in a single 
one where the individual contracts are subsumed 
by an and contract. Care must be taken, though, 
to correctly express the rights and obligations aris- 
ing from the individual contracts from the point 
of view of the party of interest (eg. the main 
contractor). An observable such as a failure of a 
specific component then usually affects more than 
one contract, if the event gives rise to contractual 
rights or obligations for more than one of the par- 
ties involved. 


(and čl ¢2) 


(when ot c) 


(scale 20000 c) 


(give c) 


fone EUR) 


(when 02 c) 


(give c) (give č) 


(scale 800 c) (scale 130 c) 


fone EUR) fone EUR) 


Figure 1. Tree representation of a contract. Observables 
are shown in bold face (the observables related to failure 
events are replaced by o1, 02). 


3.2 Enriching the contract model with 
additional semantics 


Generic contracts can be enriched with additional 
semantics, such as whether a given branch in the 
tree represents the contract with a specific party, 
for instance the customer and the various sup- 
pliers. One possible way to implement this is by 
annotating certain branches of the tree with infor- 
mation on the respective contractual party. More- 
over, certain parts of the tree can be annotated 
with information on the specific type of payments 
arising from this part of the contract, for instance 
contractual penalties or warranty payments. In 
this way, information on the type of payment and 
the contractual parties involved can be retained 
and recorded during the numerical evaluation, e.g. 
through stochastic simulation. 

Once contracts have a precise formal descrip- 
tion expressed in a semantically rich language, it 
is possible to implement advanced validation and 
consistency checks such as overlap of time peri- 
ods, gaps in warranties between parties, transfer 
of certain risks to other parties etc., even before 
a numerical evaluation is performed. Some more 
concrete examples are given in the following: 


— In the context of warranty costs, a formal 
check can reveal if warranty payments owed to 
the customer in case of component failures is 
always (fully or partially) balanced by warranty 
payments received from the respective suppli- 
ers (back-to-back coverage). This is not always 
obvious for complex systems with many compo- 
nents, where warranty periods may be even be 
subject to looping conditional on certain events. 

— In case of random events which entail payments 
and/or liabilities from different parties, a formal 
analysis can help determine bounds on the over- 
all financial impact. This is extremely useful par- 
ticularly if the frequency of occurrence is based 
on uncertain estimates, but determining bounds 
may not be an easy task to do manually if cer- 
tain payments or liabilities are subject to further 
conditions. 


This list is certainly not exhaustive and depends 
on the specific technical system and contractual 
aspects under consideration. 


4 TOOLING AND IMPLEMENTATION 


In our implementation of the approach, the finan- 
cial and technical aspects of RAM contracts are 
expressed with the help of domain specific lan- 
guages (DSLs). A DSL provides a formal syntax 
(meta-model) for representing the domain concepts 
and relations between them. Once appropriate 


constructs are available, simple consistency checks 
which define the set of well-formed contracts can 
be implemented easily. In addition, advanced con- 
sistency and completeness checks can be imple- 
mented as described in this paper. 

The technical basis for our implementation is 
language engineering technologies as provided by 
the Jetbrains MPS language workbench (www. 
jetbrains.com/mps). MPS allows creation and 
composition of different modular DSLs, which 
enables domain experts to directly and explicitly 
express concerns from their business domain. We 
created DSLs to describe both the technical and 
commercial aspects of contractual RAM require- 
ments, so they are available as models. Compared 
with previous work on (business) contract formali- 
zation, which was e.g. implemented in Haskell and 
requires know-how about functional program- 
ming, our approach targets persons in charge of 
technical, contractual and commercial aspects who 
are domain specialists, and it does not require any 
programming background. 

Moreover, our implementation gives confidence 
in evaluation by validation of input and advanced 
consistency checks; the models are “correct by 
construction” and allow time saving due to effi- 
cient specification of the technical, contractual 
and commercial aspects of contractual RAM 
requirements. 


5 EXAMPLE 


In order to demonstrate the modeling and advanced 
checking of contractual RAM requirements, a sim- 
ple example in the context of product warranty is 
given in the following. For a comprehensive review 
of warranty policies see e.g. Murthy & Jack (2003) 
or Rahman & Chattopadhyay (2006). In this paper, 
we restrict ourselves to the case of so-called Free 
Replacement Warranty without renewal, where 
a certain amount is to be paid to the customer 
on occurrence of failures of certain components 
within a fixed time interval. This amount may 
comprise the costs for replacement of the failed 
components as well as additional servicing costs. 
The corresponding model of the customer con- 
tract is shown in the top pane of Figure 2, which 
repeats the exemplary warranty clause from Sec- 
tion 2.2. The model is annotated by a label “Cus- 
tomerWarranty” in order to indicate that, within 
this branch of the tree, any observables related to 
component failures incur a liability towards the 
customer. The specific amount to be paid, the war- 
ranty periods and the fact that a cap was agreed 
are not considered here, but will certainly influence 
the value of the contract. The intention here is to 
check whether warranty costs due to component 
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define contract CustomerContract 


contract statements: 


[CostomerWacranty: or(scale(give(one EUR), 20000.0), 


and (when (system2:controlFsilure 


Error: No coverage 


) 


when (5 gerens vaiva tire 
sé 


between start of project and November 03, 2021 


» Scale(give(one EUR), 890))) 


$š between start of project and November 0S, 2021 


+ scale(give(one EUR), 130)) 


define contract SupplieriCcontract 


contract statements: 


(SupplierWarranty: when(system2:controlFailure é& between start of project and January 01, 2021, scale(one EUR, 300))! 


define contract Supplier2Contract 


contract statements: 


when (system3:valveFailore sé between start of project and July Ol, 


2022, scale(one EUR, 300)) 


Figure 2. Exemplary customer contract (top pane) where the warranty clause indicates an incomplete coverage by an 
error message. This is because only the contract with supplier 1 (middle pane) is annotated as a supplier warranty, but 


the contract with supplier 2 (bottom pane) is not. 


failures can, in principle, be balanced by warranty 
payments received from the respective suppliers, 
based on the contracts entered. 

Similarly, the middle pane of Figure 2 shows a 
model for a supplier contract that states that failure 
of the component system2:controlFailure incurs a 
payment received from the supplier of this compo- 
nent. Here, the corresponding model is annotated 
by a label “SupplierWarranty” in order to indicate 
that, within this branch of the tree, any observa- 
bles related to component failures imply that a 
claim can be made towards the supplier. The bot- 
tom pane of Figure 2 shows a model for another 
supplier contract for payments received from the 
supplier on occurrence of failures of component 
system3:valveFailure. Here, however, the model was 
not annotated by a label “SupplierWarranty”, which 
will be revealed by the checking rule as shown below. 

An advanced checking rule was implemented 
in order to identify component failures that only 
incur a liability towards the customer but no cor- 
responding claims towards the supplier, based on 
all contracts entered. This was done by traversing 
the tree of the contract model and searching for 
observables related to component failures that are 
annotated with the label “Customer Warranty”, but 
not “SupplierWarranty”. Consequently, an error 
message is shown in the model of the customer 
contract that points out the respective components 
which failures are not covered by a warranty clause 
in any of the supplier contracts. In this example, 
the obvious remedy would be to annotate the con- 
tract with the second supplier with the label “Sup- 
plierWarranty”. However, it may well be the case 
for complex projects that it has been overlooked to 
agree on a warranty clause for one or more com- 
ponents. In this case, the error message suggests 
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that it would be advisable to renegotiate appropri- 
ate warranty agreements with the suppliers of the 
respective components. 


6 DISCUSSION AND OUTLOOK 


Based on our approach to express contractual 
RAM requirements by means of a precise formal 
description with a semantically rich language, 
we have explained the principles of performing 
advanced validation and consistency checks of 
certain contractual characteristics of interest (in 
addition to providing a way to efficiently evaluate 
the contractual requirements as demonstrated in 
Joanni & Ratiu (2018).) The example presented in 
this paper demonstrates the implementation for a 
rather simple example of warranty costs. However, 
the contract model can, in principle, also be trans- 
formed into other mathematical models that allow 
for more sophisticated checking algorithms that 
cover various aspects such as completeness of con- 
ditions. This is the subject of our ongoing research. 
In our opinion, formalization of contractual 
RAM requirements is not only of theoretical inter- 
est, but is of practical applicability and provides 
significant benefits such as efficient assessment 
and potential reduction of non-conformance costs, 
in addition to increased transparency particularly 
when dealing with complex contractual agreements 
for projects where multiple parties are involved. 
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Cost-benefit analysis for non-structural flood risk mitigation measures: 
Insights and lessons learnt from a real case study 


G. Pesaro, M.T. Mendoza, G. Minucci & S. Menoni 
Politecnico di Milano, Milan, Italy 


ABSTRACT: Cost-Benefit Analysis (CBA) in flood risk management is becoming increasingly popu- 
lar as a tool that makes the relationship between investments in mitigation measures and the related 
effects more comprehensible. Even if the application of CBA in flood risk management is nothing new, 
there is still limited evidence on its use conditions and possible limits when applied in real. The essay 
first suggests a new classification of Flood Mitigation Measures (FMM), towards a new taxonomy that 
goes beyond the distinction between structural and non-structural. This distinguishing between risk and 
damage reduction measures, public and private decision-making processes, mandatory and voluntary 
actions and allocating more importance to the value of the avoided damages related to the commons 
(i.e. cultural heritage) and other non-renewable, non-reproducible or non-restorable territorial resources. 
Second, the contribution of CBA as an ex-ante decision-making tool in flood risk reduction is discussed. 
Moreover, the essay offers a methodological insight and operational elements as results of a case study 
developed in the European research project IDEA, where a CBA for a non-structural mitigation meas- 
ure was performed using the data and information available after the 2012 and 2013 flood events in the 
Umbria Region (Italy). Here, dams built to produce hydroelectric power have been used for laminating 
the floods, as a non-structural risk mitigation measure. The experimental application of the CBA to this 
measure, including the combination of real damage data collected after the floods and damage mod- 
eling for the alternative scenario, provided methodological and operational evidence of its capability to 
reduce/avoid a part of the damage. Finally, the essay presents the lessons learnt, the open problems and 
future developments required for an effective CBA, in reference to the technical and scientific perspec- 
tive and to the difficulties in the understanding and interpretation of the whole of the cause-effect chains 
and externalities. 


1 INTRODUCTION we have proposed some innovative procedures 

while carrying out C&B analyses according to a 
In recent years, Cost Benefit Analysis (CBA) has more rigorous economic thinking. At first, a bet- 
been increasingly proposed as a key tool to sup- ter classification of Flood Mitigation Measures 
port decision-making processes for flood risk (FMM) with respect to most C&B applications, is 
prevention and mitigation, including in the Euro- proposed, as the variety of the possible solutions 
pean Floods Directive 2007/60/EC. It serves the and the related results when implemented is very 
purpose of comparing the costs of mitigation high. This because both costs and benefits can be 
measures with the potential benefits interpreted better assessed if mitigation measures are better 
as reduction of potential Damage and Losses identified and distinguishable one from the other. 
(D&L). However, there are a number of limita- In fact, a better classification of FMM has 
tions implied by most common C&B analyses a twofold purpose. On the one hand, if exhaus- 
regarding a variety of issues (Ale et al. 2015). tive enough, it offers planners a wider variety of 
There are difficulties in assessing the entirety of | options which to choose from (Yevjievich 1994). 
costs, for example indirect ones, questionable On the other hand, diverse FMM typologies have 
values are attached to both costs and benefits different costs and therefore require different 
without adequate empirical supporting evidence; degree of investment. A better classification of 
ethical and equity concerns arise, for example, FMM may also help in highlighting for each com- 
when intangibles are assessed in monetary terms. bination of measures the different mix of direct 
In order to address some of those limitations, and indirect costs that are associated. 
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Bouwer et al. (2014) underline the fact that there 
is often a solid base of information on the direct 
cost in hard mitigation measures, whereas there 
are few cost assessment methodologies for non- 
structural measures. Moreover, Mechler’s review 
(2016) on available studies on CBA for DRR 
measures shows that mostly structural measures 
are considered. 

In this context, we first propose a new clas- 
sification framing the different types of FMM. 
Then a focus is made on the analysis of a non- 
structural measure, namely the use of dams for 
electric hydropower production to retain poten- 
tial overflow due to heavy rains. This case-study 
introduces a second level of novelty in the arena 
of most widely applied C&B methodologies, in 
that it uses real damage and costs experienced 
after events that have actually occurred as a 
starting point for establishing monetary values 
of benefits intended as avoided damage. This in 
line with the methodological approach taken by 
the European funded project, IDEA (Improving 
Damage assessments to Enhance cost-benefit 
Analysis). 


2 FLOOD MITIGATION MEASURES (FMM) 


2.1 Scientific background: FMM classification 


Yevjievich et al. (1994) classify FMM on the 
basis of a geographic approach according to (i) 
adjustment to natural hazards (ii) typology of 
flood damage prevention; (iii) typology of flood 
damage reduction; (iv) typology of flood poli- 
cymaking and (v) basic categories of individual 
measures. 

The Australian Bureau of Transport and 
Regional Economics in 2002 proposed to classify 
FMM from an economic perspective according to 
(i) flood modification, (ii) property modification 
and (ili) response modification. On the basis of 
the latter, Hawley et al. (2012) categorize FMM 
by (i) structural and non-structural flood con- 
trol, (ii) exposure and property modification and 
(iii) behavioral response modification. Consid- 
ering the time scale before the damaging event, 
Mechler (2005) differentiates between FMM that 
reduce risk (mitigation/prevention and prepared- 
ness) or transfer and spread it on a larger basis 
(risk financing). 

The FLOOD-ERA Joint Report (Schanze et al. 
2008) enlarges the economic perspective by con- 
sidering both ex-ante and ex-post evaluation of 
structural and non-structural flood mitigation 
measures. 

A multi-sectoral perspective on FMM ex-post 
evaluation is provided in the FLOODsite project 
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(Klijn et al. 2009), which takes into account not 
only the economic efficiency but also the effec- 
tiveness, robustness, and flexibility of selected 
measures. Accordingly, measures (i.e. physically 
tangible interventions) are differentiated from 
instruments (i.e. interventions that cause effects 
indirectly). 


2.2 Towards anew taxonomy 


On the base of the literature review, a preliminary 
list of a large selection of possible flood mitiga- 
tion measures and a new FMM classification have 
been developed (Table 1). These measures have 
been classified according to the following criteria: 
(1) typology, i.e. structural or non-structural meas- 
ures, (ii) purpose, i.e. “risk” or “damage” mitiga- 
tion, (iii) who takes the decision, (iv) typology 
of action. The definition for structural and non- 
structural measures adopted in this work is the 
one by the UNISDR (see Terminology on DRR, 
2017). 

Regarding non-structural measures, we pro- 
pose as sub-categories: riverine environment- 
based (e.g. river management), built environment 
based (e.g. building regulations), social involve- 
ment-based (e.g. education programs) and 
economic-based (e.g. risk transfer through 
insurance). 

Concerning risk mitigation, investments are 
directed to the reduction of flood exposure, no 
matter how vulnerable the exposed subjects and 
objects are. As for D&L, reduction may mean the 
capability: (i) to reduce exposure, (ii) to recog- 
nize and intervene on the vulnerability of a great 
variety of individual situations, and (iii) identify 
the potential loss of territorial resources/values 
that are either unreplaceable or uneasily restored/ 
rebuilt. 

Moreover, flood mitigation measures involve, 
in different ways, public and private actions, 
decision-making and investments. It is therefore 
important to address attention not only to risk 
or damage mitigation measures or to structural 
and non-structural interventions, but also to the 
main decision-makers involved in their imple- 
mentation. These were categorized as “public”, 
“public-private”, “economic subject” (private 
firms) and “private individuals”. Regarding the 
typology of action, three sub-categories were 
defined: “regulated” (when regulation exists 
with defined procedural aspects), “mandatory 
by law” (binding action by law) and “volun- 
tary” action. Of course, particularly in this last 
category, there might be substantially different 
results for different territorial areas and coun- 
tries, according to the different regulation and 
institutional systems. 
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3 CBA FOR FMM 


3.1 Economic thinking and CBA in flood risk 
management 


The assessment methods and tools to quantify 
some of the damage categories are central elements 
when referring to the economic approach to dis- 
aster risk reduction. The debate developed during 
the last two decades converge to that damages suf- 
fered by populations and the built environment are 
nowadays better understood than in the past. In 
the contrary, less evidence is available on damage 
suffered by economic sectors and by the entirety 
of territorial resources, encompassing cultural her- 
itage, the natural environment, and shaping com- 
munities’ identity and social models (some of these 
elements are discussed in Mechler 2016). This is 
mainly due to the incidence of indirect and sys- 
temic damage (Cochrane 2004a,b), that are instead 
often underestimated (Shreve & Kelman 2014) but 
also to the difficulties in assessing the monetary 
value of intangibles, such as public resources, cul- 
tural and historical heritage (Pesaro 2005). 

Looking at the whole of the resources a territo- 
rial area can rely on to sustain its production and 
consumption models, qualitative and quantita- 
tive growth, the economic perspective suggests to 
account not only direct quantitative damage, using 
money as the measure unit, but also indirect and sys- 
temic D&L even if not easily quantifiable and meas- 
urable. Monetization models have been developed 
in the field of environmental economy, first, and in 
risk and damage matters afterward, for instance, for 
cultural heritage and natural environment, but still, 
monetary evaluation remains very controversial (see, 
among others, Meyers et al. 2013). 

Cost-Benefit Analysis (CBA) for the evalua- 
tion of flood mitigation measures is an important 
and widely used tool for decision makers to rely 
on, even though sometimes the incidence of non- 
monetizable values limits its potential. In a deci- 
sion-making framework, the main question is how 
to choose among different possible action/interven- 
tion alternatives using an economic-based toolbox, 
whose strength also lies in the use of quantitative 
measure units, able to offer clear-to-read results to 
support and address a selection process. Moreover, 
it enables to look at the results envisaged for differ- 
ent mitigation measures not only concerning the 
technical and operational performance but also in 
terms of investments effectiveness. 

CBA can be assessed following an economic 
approach, which is preferred and adopted in the 
preset article, and a purely financial one. Accord- 
ing to the economic approach, costs and ben- 
efits associated with a policy or a project ensure 
a more exhaustive assessment of damage and 


effects, including those to third parties. Whereas 
the financial approach to CBA is mainly based 
(if not exclusively) on a cash flow analysis. Con- 
sequently, economic CBA aims at highlighting the 
whole range of values and externalities implied by 
a certain mitigation measure. 

Several studies are available introducing cost- 
benefit analysis as a tool to assess the economic 
feasibility of flood management strategies (Botzen 
et al. 2017). Brouwer & van Ek (2004), for instance, 
justify changes in land use and floodplain restora- 
tion in the Netherlands based on both cost-benefit 
and multi-criteria analyses when ecological, social 
and economic benefits in the long term are consid- 
ered. Jonkman et al. (2004) investigate the applica- 
tion of CBA in decision-making on flood protection 
measures in the Netherlands as well. Joseph et al. 
(2014) apply conceptually the cost-benefit analyti- 
cal framework to flood risk adaptation measures 
taken at the property level in the UK and show 
the involved costs and benefits. Moreover, the EU 
Floods Directive (2007/60/EC) requires CBA for 
supporting public decision making to tackle flood 
risk in all member states of the European Union. 

The use of CBA to enhance the implementation 
of disaster mitigation measures relies on the con- 
cept of intervention profitability, which, in its turn, 
refers to the capability of investments in mitigation 
measures to obtain the expected outcomes in terms 
of risk prevention and/or damage mitigation. It is 
a method to compare the whole range of direct 
costs associated with each typology of mitigation 
measure and their relative outcomes, measured as 
the total value of the avoided damage and losses 
plus the benefits coming from an increase in safety 
and territorial quality. 

In this conceptual framework, CBA is seen as an 
ex-ante decision-making tool, which calls for the 
capability to develop the assessment process in an 
ex-ante perspective. In this case, mitigation meas- 
ures in flood risk allow the system to act on both 
risk and damage mitigation and reduction. 

The results of a CBA might be effective also in 
the light of a common problem decision-makers 
encounter very often, that is the perception of direct 
and quantifiable costs related to mitigation measures 
implementation. Quantified costs, even when related 
to damage prevention, are much more evident and 
easy to perceive than the possibility of a reduction 
of potential D&L in an uncertain and or known 
future. The clearer today’s mitigation costs but not 
directly comparable to future damage decrease, the 
more difficult is to obtain consensus on expenditure 
(both public and private) in time of peace. 

In this respect, CBA should be regarded as an 
effective tool as it is able at least to elicit relevant 
knowledge and information to support policy- 
makers in choosing among different alternative 


113 


measures and interventions to reduce damage and 
risk, even when some elements cannot be directly 
assessed in monetary terms. In this context, the 
question is how to recognize the “list” of costs and 
benefits suitable for describing the risk in a given 
area and how to evaluate and monetize them. The 
high variety of local conditions in terms of exposure 
and vulnerability, which influences the impacts of 
events on territorial subjects and objects (even when 
considering just floods) makes this task challenging. 


3.2 The costs side 


An effective synthesis focused on the comparison 
among the costs of different available mitigation 
measures assessed by way of different methodolo- 
gies is offered in Hawley et al. (2012). Likewise, 
Bouwer et al. (2014) use a cost classification that 
distinguishes between direct and indirect costs, 
tangible and intangible. 

Besides the direct costs of FMM, such as con- 
struction costs of physical infrastructure or evacu- 
ation costs, negative externalities or “co-costs” 
(Rose 2016) can be generated, e.g. interferences 
with the natural environment and landscape amen- 
ities. The externalities that are usually included in 
economic CBA are the costs or benefits resulting 
from a project that influences third parties without 
monetary compensation (European Commission 
2014). Negative externalities, or external costs, are 
considered in social analyses because they impact 
on the social welfare, causing market failure when 
not included (Pesaro et al. 2016). 


3.3 The benefits side 


In CBA, benefits are evaluated as the “positive 
impacts obtained because of the reduced or avoided 
D&L” due to the different mitigation measures to 
be implemented. Flood damage assessments are 
then crucial to measure avoided damages, which 
may include primarily health problems and loss of 
lives, physical damage to buildings, infrastructures 
and lifelines, environmental assets, cultural herit- 
age, and business interruption. Such assessment, 
along with the costs side mentioned in the above 
paragraph, can be developed both ex-ante and ex- 
post the flood occurrence. 

Concerning avoided damage, economic sub- 
jects and businesses are still less considered than 
households or public infrastructures, even though 
economic activities are core elements for the func- 
tioning and development of a territory. The prob- 
lem is the great variety characterizing the economic 
sector, production activities and assets which make 
them differently prone to damage and losses. 

As for positive externalities or “co-benefits” of 
disaster mitigation investments, such as, among 
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others, environmental conservation, improved 
social cohesion and increased agricultural produc- 
tivity, they may materialize even in the absence of 
a disaster (Tanner et al. 2015). Discussions devel- 
oped during the IDEA Project revealed that civil 
protection representatives, for instance, carefully 
consider as positive externalities the positive repu- 
tational effect resulting from the enhancement of 
prevention and mitigation measures. 

It is finally important to highlight also the “sys- 
tem benefits” arising from stakeholders’ perception 
of safety, which have to be better integrated into 
the “benefit scenarios”. If safety were perceived as 
a strength/value of a place to live or to locate eco- 
nomic activities, a “safer” place might become more 
attractive. This, of course, would produce an increase 
in the demand for safer space and properties, which, 
in its turn, not only would generate a rise in the real 
estate values but also a more dynamic environment 
for households and businesses in such safer zones. 
However, over time, as a drawback, exposure to 
residual flood risk would increase as well. Therefore, 
also other system benefits should be acknowledged, 
like for example improved knowledge and informa- 
tion dissemination, leading to changes in households 
and economic subjects’ behavior in case of flood. An 
increased awareness of flood risk would make com- 
munities better able to respond rapidly to any event, 
reducing damages, losses and psychologic distress. 


4 THE DAM COST-BENEFIT ANALYSIS 
FOR FLOOD EVENTS IN THE UMBRIA 
REGION 


4.1 Case study background 


After a national directive on national and regional 
warning system for hydro-geologic risk for civil 
protection purposes was issued in Italy in 2004, the 
National Department of Civil Protection established 
a technical panel on this topic focusing on the Tiber 
river basin. As a first result of this activity, an “infor- 
mal agreement” on the use of dams to retain water 
in case of heavy rains was signed among different 
authorities (Tiber river Authority, Dam Operators, 
etc.) in 2005. On the base of this, dams were emptied 
according to early warning and could reduce the river 
flow during flood events in Umbria in 2005, 2008, 
2010, 2012 and 2013. Furthermore, the Umbria and 
Lazio Regions adopted a framework agreement and 
a flood management plan in June 2016. 


4.2. The “dam exercise”: data collection, maps 
design, data processing and CBA assessment 


During both the 2012 and 2013 events, three 
dams (Montedoglio, Valfabbrica/Casanuova, and 


Corbara) were used to retain water and conse- 
quently reduce the river flow. The approach pro- 
posed in this paper for the development of CBA is 
the economic one, as in Brouwer & van Ek (2004), 
while avoided damage has been derived from the 
real damage and costs as evaluated by the Civil 
Protection of the Umbria Region, that is by a pub- 
lic actor. 

In order to understand the net benefits thanks 
to the application of the dams as non-structural 
measures 5 main steps were followed: (i) identifi- 
cation of the “avoided event” and its associated 
damages; (ii) identification of damages of the 
“occurred event”; (iii) calculation of “avoided 
damages”, (iv) identification of costs; (v) identifi- 
cation of the net benefits. 

As “avoided event”, we mean the flooding that 
had not occurred thanks to the water retained by 
dam reservoirs. The “avoided event” was assessed 
by analysing the rainfall data and the water depth 
measured upstream of the dams from the reports 
done by the Civil Protection of Umbria Region, 
for the 2012 and 2013 flood events. 

The damage that the “avoided event” could have 
provoked were estimated with the Flood-IMPAT 
procedure (see Molinari et al. 2016), which per- 
forms a damage assessment at the meso-scale 
level by depth-damage curves in a GIS environ- 
ment. Flood-IMPAT allowed estimating direct 
damages to the residential, industrial/commercial 
and agricultural sectors, and consequently, only 
these typologies were considered for the analysis 
(Table 2a). 

The damages due to the “occurred event” 
(Table 2b) were assessed by using the real flood 
damage values collected by the Civil Protection 
through the damage declarations filled in by house- 
holds and economic subjects (see Menoni et al. 
2016). The avoided damages due to the operation 
of the dams during the flood event result from the 
difference between the estimated damages of the 
“avoided event” and the damages of the “occurred 
event” (Table 2c). 

The assessment of the costs associated with the 
use of the dams for flood lamination was based on 
the hypothesis that the dams of Montedoglio and 
Casanuova were used exclusively for energy gener- 
ation. These costs (Table 3a) were estimated based 
on the loss of revenue due to the unsold energy as a 
consequence of the lack of whirl of the dams dur- 
ing the whole operation (considering also the days 
before the flood when the dams were emptied). 

Finally, the net benefits associated with the 
flood lamination (Table 3b) for each event were cal- 
culated with the difference between avoided dam- 
ages (benefits) and the loss of revenue (costs). As 
a further step, it has been computed the absolute 
net benefits in order to understand the relevance 
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of the net benefits in relation to the damages of the 
avoided event. The absolute net benefit for the 2012 
and 2013 flood events is respectively 32% and 73%, 


Table 2a. Estimate of the total damage of the “avoided” 
events in 2012 and 2013. 

Damages without dam management (Flood- 

IMPAT) [M€] 

Industry/ 

Event Allsectors Residential Agriculture Commercial 
2012 55.5 5.5 19 30.5 
2013 12.5 1.5 6.5 5 


Table 2b. Occurred total damage per sector for the 2012 
and 2013 events. 


Occurred Damages [M€] 


Industry/ 
Event Allsectors Residential Agriculture Commercial 


2012 36 
2013 2 


2.5 
1.5 


12 
0 


21 
0.5 


Table 2c. Total avoided damages due to the use of the 
dams in the 2012 and 2013 events. 


Avoided damages with the use of dams [M€] 


Industry/ 
Event All sectors Residential Agriculture Commercial 
2012 19.5 3 7 9.5 
2013 10.5 0 6.5 4.5 
Table 3a. Costs beard by the energy operator 
(operators’ calculations). 
Laminated Loss of 
Event Dam volume [m°] revenue [€] 
2012 Corbara 70M 2 
Montedoglio 25M 0.7 
Casanuova 20 M 0.5 
2013 Montedoglio 25M 0.7 
Casanuova 21M 0.6 


Table 3b. Net benefit from the CBA assessment. 


Benefits Costs Net benefits 
Event [ME] [M€] [M€] 
2012 19.5 1.2 18.3 
2013 10.5 1.3 9.2 


which means that the use of the dams in both cases 
clearly allowed reducing damages, up to such a 
high performance as in the second case. Such a sig- 
nificant result is obtained because a risk mitigation 
and not just a damage mitigation measure has been 
applied, reducing the severity hazardous event at 
the origin. A damage reduction degree even higher 
if the indirect and systemic effects of the avoided 
direct losses had been evaluated. 


4.3 Lessons learnt 


The economic advantages obtainable by means of 
dam management as a non-structural risk mitigation 
measure proved to produce high savings in terms of 
avoided damage. The case study allowed to identify 
a variety of factors related to the use of the CBA 
methodology and to the uncertainties linked to the 
processing of technical and scientific elements. 

Adopting the economic approach perspective, 
problems arise from the difficulties in assessing 
the cause-effect chains and externalities. This is 
mainly due to the presence of indirect and sys- 
temic damage, suffered by the territorial elements 
and subjects, from the one side, and by the private 
owners of the dam on the other side. This means 
a potential lack of damage information from both 
the benefits side, in terms of avoided damage, 
and in terms of costs, due to the losses produced 
because of the implementation of the measure. 
In this particular case, the production of electri- 
cal hydropower is concerned, which means that 
the accounting for costs depends on the energy 
costs, the tariffs profiles, the customers typology, 
the variable demand for power during different 
periods of the day and the related energy produc- 
tion functions. On the other hand, negative exter- 
nalities could occur because of the interruption of 
energy production, depending on the overall power 
production sources, and the amount of the power 
demand covered by the dam production. 

From a technical and scientific perspective, 
flood damage estimation by means of real dam- 
age assessment presents a set of shortcomings 
similar to those implied by fully modelled dam- 
age, depending on the availability of context-based 
functions and local specific characteristics that 
hamper the possibility to transfer results in space 
and time (see Merz et al. 2010). 

In addition, the case study shows that there is a 
huge space for non-structural measures negotiated 
among territorial actors, both ex-ante and ex-post 
events, whose costs are difficult to be assessed but 
whose benefits might be important. Although not 
directly considered in the case study, the use of the 
dam was possible through the negotiation proc- 
ess and collaboration between private and public 
subjects. Therefore, negotiation costs should be 


better taken into consideration, together with the 
different negotiation models and related power 
degree. In the dam example, for instance, the pri- 
vate subjects, that is the private dam managers, 
have been obliged to accept the demand for inter- 
vention under the specific regulation mentioned 
above, the Umbria Region’s Piani di laminazione 
(flood retaining plans). In principle, private sub- 
jects should be compensated for the costs under- 
taken for public interest purposes, however their 
right has to be balanced against the fact that they 
are beneficiaries of a license arrangements to use 
water for energy production (Pesaro 2007). 


5 OPEN PROBLEMS AND FUTURE 
DEVELOPMENTS 


The proposed FMM classification shows that, even 
if traditionally less utilized (WMO 2009), non- 
structural measures are more numerous and vari- 
ate than structural ones. A discussion about Flood 
Mitigation Measures (FMM) is central to better 
understand the potentials of CBA and improve it. 
Furthermore, structural measures are usually less 
cost-effective (Kelman 2013). Non-structural meas- 
ures should then be considered avoiding profes- 
sional biases or limitation in the appraisal process 
(Mechler 2016). Furthermore, the use of CBA as 
proposed here has been based on the comparison 
between “near to real” costs and benefits, derived 
from damage evaluated by the Regional Civil 
Protection and not on theoretical monetization 
methods. The aim of the exercise was to prove the 
evidence of the potential effectiveness of non-struc- 
tural measures to reduce damages, whose dimen- 
sions have been calculated based on the real damage 
data collected on the ground after the described 
event. Moreover, disaster risk reduction being the 
main goal, attention was driven toward costs in 
terms of interventions able to reduce the territorial 
impacts of floods and on non-structural measures 
as investments able to reduce implementation costs 
and related territorial and landscape impacts. 

Following the proposed example of CBA 
applied to the use of dam reservoir to store part 
of the peak volume, further efforts are required 
to estimate the economic efficiency of this type 
of measures, as well as to identify and assess their 
related systems of costs. 

Concerning future developments from the “ben- 
efits side”, avoided damages should also consider 
the high potential weight of the indirect and sys- 
temic damage to economic activities. In fact, more 
attention should be devoted as well to show how 
non-structural measures entail larger positive sys- 
temic effects on economy, by taking into consid- 
eration local and over-local links and interactions. 
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Investments in FMM might produce a system 
of other positive externalities, which should be 
better taken into consideration in decision-making 
processes about the selection of the FMM. Such 
additional side effects might therefore maximize 
the effectiveness of the investments from a sys- 
tem’s perspective, producing win-win effects. This, 
for instance, when, even in absence of flood events, 
territorial and natural environment management 
and conservation are considered and, where, there- 
fore, consensus can be achieved more easily. 

From the “costs side”, the presence of often 
neglected negative externalities from flood miti- 
gation interventions, mainly the structural ones, 
should be included in the analysis (monetary or 
not) because they may make other measures more 
desirable at the system level. A reason why, finally, 
there is the need for a more interdisciplinary per- 
spective to better integrate the economic thinking 
when assessing the impacts of mitigation. 
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Behavioural modelling of attackers’ choices 


S. Panda, I. Oliver & S. Holtmanns 
Cybersecurity Group, Nokia Bell Labs, Finland 


ABSTRACT: This paper examines a cyber environment involving attackers and telecommunications 
operators from attackers’ perspective. We incorporate a behavioural approach to understanding attack- 
ers’ behaviour during the attack process. Traditionally, security games have been analysed assuming the 
attackers to be of strictly bounded rationality or strategy-less. Furthermore, studies consider attackers 
do aim to maximise their expected gain which contradicts the assumption of bounded rationality of 
attackers. We have analysed security interactions considering attackers as rational entities with attack 
strategies. To understand the thought process and behavioural decision-making choices of attackers, we 
utilise a decision analysis model capturing the attack process. Based on our analysis, we propose a frame- 
work providing a way to enhance attack strategies against cooperating and non-cooperating (competing) 
operators. This study is intended to capture essential characteristics of an attacker to comprehensively 
understand and predict their expected behaviour assisting cybersecurity. 


1 INTRODUCTION have indicated that cognitive approaches can aid 
in predicting attackers’ behaviour addressing real- 
The major concerns in cybersecurity are to meas- world security problems. 
ure the security risks and to determine the effec- In addition, over the past years, adversaries have 
tiveness of one’s security investments against become more financial oriented (Gordon 1994, 
perceived threats. As in cybersecurity, security is | Gordon 2000, Franklin et al. 2007) making them 
defined by not only on an individual’s security- highly unpredictable. Some intentions behind these 
related investments but also by others’ security malicious activities are instigated by curiosity, or 
investments (Anderson and Moore 2006, Laszka for peer recognition, and are often undecided in 
et al. 2015). This security interdependence adds terms of ethical legitimacy (Gordon 1994, Gordon 
additional complexities in quantifying the security 2000). The possibilities of using illegal methods 
risks and crafting appropriate measures against it. provoke new classes and strategies of attacks cre- 
Game theory, being a mathematical modelling ating a need in studying and analysing attackers’ 
tool, has been widely used to study varied aspects behaviour to understand their intentions and deci- 
of security (Roy et al. 2010, Merrick et al. 2016, sion making criteria. 
Altman et al. 2006, Liang and Xiao 2013) and We performed a game-theoretic investigation 
privacy (Manshaei et al. 2013). Most of the work on attackers’ strategies in the context of cyberse- 
focused on studying defenders’ behaviour and curity. The examined scenarios illustrate security 
have proposed strategic recommendations which games between attackers (cybercriminals, hackers) 
include stochastic approaches, frameworks, cogni- and telecommunications operators (defenders). An 
tive and behavioural models strengthening defend- attacker is an external entity with malicious objec- 
ers’ chances of successfully defending against tives attempting to break through the security of 
attempted attacks. Studies have often assumed the targeted entity/system with an intention to 
strategy-less behaviour of adversaries with a pre- hamper the existing state of the target. 
scribed set of actions consistent with the threat A behavioural approach is utilised to anticipate 
models. However, alongside defenders, attackers decision-making behaviour of attackers. We intend 
are also intelligent entities and this assumption is to determine attack strategies optimising attackers’ 
not ideal in real-world situations which consists effort in performing an attack and improving their 
human adversaries (Camerer 2011). perceived utility. A viewpoint this paper aims to 
In cybersecurity, attackers’ behaviour has been highlight is when attack strategies are taken into 
less explored due to lack of reliable data on their consideration, what can the choice of not attack- 
intentions and interactions limiting our under- ing signify? 
standing of their characteristics and behaviours. This study is a step towards understanding the 
(Veksler & Buchler 2016) and (Anderson 2009) mentality of attackers and their decision-making 
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behaviour from a cybersecurity perspective. Lack 
of decisive information on adversaries along with 
the available security information being highly 
asymmetric—favouring the attackers; results from 
this study can be used by defenders in assessing 
their conditions and perceiving the most expected 
attack strategies. 

The rest of the paper is organised as follows. Sec- 
tion 2 covers the relevant literature and highlights 
the relationship of our work with existing research. 
Section 3 discusses the behavioural aspects of 
attackers and presents an attack framework dis- 
integrating the efforts required in an attack proc- 
ess. An analysis of how attack strategies can be 
optimised using the attack framework is explained 
in Section 4. Section 5 discusses our findings and 
concludes this paper. 


2 RELATED STUDIES 


Even though this paper is confined towards under- 
standing attackers, the complementarity of attack- 
ers’ behaviour on operators’ state is such that 
modelling them without an underlying operators’ 
state is complex and unrealistic. An operator’s 
state is defined by his security-related invest- 
ments and relationships with other operators— 
cooperation (Laszka et al. 2015, Kunreuther and 
Heal 2003, Varian 2004, Hota and Sundaram 
2015) and competition (Jiang et al. 2008, Sun et al. 
2008, Khouzani et al. 2014, Panaousis et al. 2014), 
which induces additional security dependencies. 

Attackers, alike operators (defenders), have 
strategic incentives and work towards maximising 
their expected utility (Laszka et al. 2015, Hausken 
2006). The expected utility is a critical influencer 
in any decision-making process. For example, an 
operator invests in a particular security technology 
only after acknowledging that the investment will 
attain the expected returns (Hausken 2006). Simi- 
larly, the expected utility moderates attackers’ stra- 
tegic choices, especially the motivation (Hausken 
2006, Herley 2010), behind attacks. The strategic 
choices of attackers are also influenced by avail- 
able resources (Hausken 2006), the context of the 
interaction and the targeted operator’s state which 
shapes the expected utility. 

From an economic perspective, Herley (Herley 
2010) pointed out that an attack strategy should 
be defined by the economics of attacks. He pro- 
posed attack strategies distinguishing attacks into 
scalable attacks and targeted attacks. In scalable 
attacks, the effort is independent of the number of 
users attacked. While in targeted attacks, the effort 
depends on per-user attacked suggesting targeted 
attacks must be on users with higher than aver- 
age expected value. A profitable attack strategy 
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involves accurately distinguishing viable from non- 
viable targets and deciding which viable target to 
attack based on the expected value (Herley 2012). 

Grossklags et al. (Grossklags et al. 2008) ana- 
lysed the Nash equilibria and social optima for 
different classes of attacks and defences in weakest- 
link, best-sort, and sum-of-effort security games. 
They introduced a weakest-target game “where 
the attacker will always be able to compromise the 
entity (or entities) with the lowest protection level 
but will leave other entities unharmed.” Florencio 
et al. (Floréncio and Herley 2013) refined this cri- 
terion by incorporating the concept of free-riding 
(discussed by (Varian 2004)) to the lowest pro- 
tected entity (or entities) stating that even though 
there exist economically profitable targets, many 
attacks are extremely difficult to turn into profit- 
able ones grounding it to the economics of attacks. 

From an extensive literature review, Hausken 
and Levitin (Hausken and Levitin 2012) catego- 
rised attack tactics on plausible types of attacks 
such as attacking a single element, attacks against 
multiple elements, consecutive attacks, random 
attacks, attacks involving a combination of inten- 
tional and unintentional impacts, attacks with 
incomplete information, and attacks with variable 
resources. However, a critical difficulty in model- 
ling opponents in general, specifically in the secu- 
rity domain, is due to lack of decisive information 
regarding potential adversaries and attackers- 
defenders interactions being highly complex and 
extensive (Pita et al. 2012). 

To understand the behavioural aspects of 
participants in cybersecurity, (Kusumastuti 
et al. 2015) and (Ryutov et al. 2015) have studied 
not only technical aspects but also psychosocial 
aspects through a three-player cybersecurity game. 
Kusumastuti et al. (Kusumastuti et al. 2015) used 
mini-max solution to identify game parameters 
and their influence on a player’s behaviour. Ryutov 
et al. (Ryutov et al. 2015) aimed at understanding 
and modelling roles, motivations and conflict- 
ing objectives of players. In addition, (Anderson 
2009), (Tambe et al. 2014) and (Veksler and Buch- 
ler 2016) have demonstrated improvement in the 
predictability of attackers’ behaviour by using 
behavioural/cognitive modelling in a repeated 
security game environment. 

To address adversary’ bounded rational- 
ity, researchers have been pursuing alternative 
approaches. One approach includes robust opti- 
misation techniques avoiding adversary modelling 
(Yang et al. 2011, Pita et al. 2012, Pita et al. 2010), 
while the other approach incorporates human 
decision-making models for computing defend 
strategies (Nguyen et al. 2013). Our work utilises 
the later approach and differs from the existing 
research by focusing on modelling adversaries 


rather than defenders. Firstly, instead of strictly 
bounded rationality of attackers, we consider 
attackers with strategic incentives working towards 
maximising their expected utility. More precisely, 
we pose a decision model with an intention to 
understand the mentality of attackers and their 
decision-making behaviour. We also introduce 
a generalised attack framework distinguishing 
the effort required during an attack process. The 
attack framework is used to evaluate and refine 
attack strategies. In addition, this framework also 
facilitates a way of addressing the abstract states 
of the decision model, which we believe applies to 
a whole class of security scenarios. 


3 BEHAVIOURAL ANALYSIS AND 
FRAMEWORK CHARACTERISATION 


We define the attackers-operators interaction as 
a game-theoretic model that captures essential 
characteristics of strategic decision making. The 
essence of game theory is to study factors influ- 
encing behaviour by reasoning what players think 
other players will do. However, in reality, having 
complete and perfect information regarding your 
opponents is never feasible. This applies particu- 
larly in the context of security, where the threat is 
almost always unknown and effectiveness of secu- 
rity investments are very hard to quantify (Laszka 
et al. 2015). So, every interaction is considered to 
involve certain degrees of uncertainty in commit- 
ting to decisions. For example, attackers might 
have knowledge regarding an operator’s invest- 
ment in security but have no decisive information 
related to the extent an operator has invested or on 
what kinds of secure technologies has the operator 
invested in. 

Attackers, alike defenders, being a deficit in 
resources (Floréncio and Herley 2013) have to act 
strategically maximising their expected gain and 
optimising their investment of resources. Aiming 
this, we analyse the attackers assuming their end 
goal is to successfully attain the expected results 
while minimising their effort in the attack proc- 
ess. First, this assumption alleviates the strictly 
bounded rationality of attackers and facilitate 
them with diverse attack strategies in contrast to 
the only choices of attacking or not attacking in 
traditional game-theoretic modelling approaches. 
In addition, it supports analysing interaction envi- 
ronments under conditions when attackers do not 
react—ignore or watch the target; diverging from 
the traditional approach where attackers follow a 
prescribed path of invariably attacking. 

Introduction of strategic attackers expands the 
possibilities where an action can bear latent objec- 
tives and motives raising concerns regarding the 
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admissibility of proposed defence strategies. To 
take a concrete example, consider a case of dis- 
tributed denial-of-service (DDoS) attack, where 
an attacker attempts to prevent an operator from 
delivering information or services. With a strategy- 
less attacker, the resultant action for the opera- 
tor would be to invest resources in countering the 
attack with full capacity to minimise the damage. 
The attacker being strategy-less an attack would 
precisely be an attempt to harm the existing state 
of the operator. However, for a strategic attacker, 
the DDoS attack might merely be a probing attack 
to assess the strength of the operator or it can be a 
diversion ahead a powerful targeted attack. 

Figure 1 presents a glimpse of an extended 
action set for strategic attackers. However, further 
characterisation of attacks based on the severity 
of attacks and dependencies between attacks are 
beyond the scope of this paper. Additionally, we 
consider investment in security as discrete (Grossk- 
lags et al. 2008, Kunreuther and Heal 2003, Lelarge 
and Bolot 2008), providing insulation towards all 
forms and degrees of attack, with no further dis- 
tinctions in their capabilities to defend specific 
attacks. 

Extracting the intent and decisions made to 
achieve the intended results from an instance of 
a highly restricted interaction scenario achieved 
through classical game theory is challenging. 
Indeed, it is the very problem in decisively predict- 
ing the behaviour of human players administring 
strictly bounded rationality (Camerer et al. 2004), 
especially while addressing human adversaries 
(Camerer 2011) where there is no evidence on gen- 
erated forms of motivation and intention behind 
attacks. One approach to tackle this problem is 
by understanding the context of interaction, as 
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Figure 1. Attacker’s decision space. 


a decision must be made within a specific con- 
text and can be best represented through a hier- 
archy of decision states (Lewis 2013). Figure 2 is 
a hierarchical decision analysis tree capturing the 
mentality of attackers. The lowest level of the hier- 
archy represents concrete actions or choices of an 
attacker. As we ascend the hierarchy, states become 
increasing abstract and can be further fragmented 
into transitional stages precisely representing and 
supplementing an interaction scenario. 

The hierarchical decision analysis tree is a 
highly simplified decision model representing cog- 
nitive workflow of attackers, initiating from the 
thought of an attack and terminating on a defini- 
tive decision on attacking or not attack, replicat- 
ing an attack process. The thought of an attack 
is supported by deciding on whether to search for 
vulnerabilities or the type of attack to perform 
within the attacker’s capabilities. Based on the 
context of interaction there could be numerous 
other decision paths to choose from for attackers. 
These intermediate choice of paths are latently, or 
innately, or precisely influenced by factors backing 
the intended goal. The subsequent steps down the 
hierarchy include designing attack strategies and 
then deciding whether to commit to an attack. 
The low-level decisions which demonstrate certain 
behaviour are being modelled using game theory. 
Game theory being a mathematical modelling tool 
supplements in determining and quantifying ele- 
ments influencing decisions and assist in predicting 
behaviour (Burke 1999). The low-level behaviour 
can be used to infer higher-order objectives that 
are likely driving such behaviour augmenting our 
understanding of the intention behind attacks. An 
improved understanding of intention and motiva- 
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tion will support rigid estimations of attackers’ 
behaviour. However, understanding the abstract 
states—top levels of the hierarchy demands a 
multi-disciplinary approach with effective applica- 
tion of concepts from behavioural psychology and 
cognitive science. 

We have characterised the attack process into 
different efforts required in performing an attack, 
acknowledging attackers to be rational entities. 
Figure 3, presents the attack framework demon- 
strating the efforts required in the attack process. 
The overall effort required can be broadly divided 
into searching effort and breaking-in effort. The 
searching effort includes efforts required in search- 
ing victims (targets), gathering information and 
searching vulnerabilities to exploit. Breaking-in 
effort represents the efforts required to compro- 
mise a system after choosing a target and a vulner- 
ability to exploit. Based on the total effort required 
to compromise the target, an expected value can be 
derived. The expected value is one of the crucial 
factors moderating an attacker’s decision (Herley 
2010, Laszka et al. 2015). 

The conversion of attackers’ decision model into 
efforts required in the attack process disintegrates 
the top abstract levels of the decision model into 
modellable units. These modellable units can be 
used to quantify the expected utilities revealing the 
incentives behind attacks offering a better under- 
standing of attackers’ decision-making behaviour. 
Additionally, the attack framework assists in evalu- 
ating and enhancing attack strategies by effectively 
regulating the efforts strengthening the efficacy of 
attacks and ensuring better profits. 


4 OPTIMISING ATTACK STRATEGIES 


Lack of complete and perfect information against 
target induces uncertainty in attackers’ deci- 
sions. The attack process eventually converges to 
a point of choice where an attacker has to decide 
on whether to attack or not to attack. The fate of 
an attempted attack depends on the target’s capa- 
bilities to defend against attacks which further 


depends on the extent of security investments. 
Figure 4, represents the expected payoffs of an 
attack against a targeted operator. We consider 
the investment in security as discrete—successfully 
preventing all forms and degrees of attacks. In the 
Figure 4, secure represents a system capable to suc- 
cessfully defend an attack and insecure represents 
the alternate. 

In reality, defenders outnumber attackers. A 
critical problem attackers face is identifying targets 
such that a committed attack would yield some- 
thing. For example, let’s say the telecommunica- 
tions domain has a total of N operators with N, 
cooperating operators sharing security dependen- 
cies and N, non-cooperating operators competing 
against security. It is an extremely expensive task 
for attackers to choose a viable target from such an 
intertwined mesh of operators. Here, the number of 
systems under each operator is ignored as consider- 
ing it magnifies the complexity by many folds. Pos- 
sible tactics attackers might adopt addressing this 
situation are 


1. To randomly choose an operator and try 
breaching through the operator’s defences. This 
approach would involve a heavy searching effort 
and a heavy breaking-in effort. This approach 
adds additional uncertainty as the attacker is 
unsure regarding his capabilities in successfully 
compromising the target. 

. To search for a certain type of vulnerability and 
then trying breaching it. This approach would 
involve heavy searching effort but a small break- 
ing-in effort. Even though this approach involves 
high searching effort, chances of successfully 
compromising the chosen operator are very high. 


The expected Utility (U) represents the proba- 
ble payoff an attacker will receive on attacking the 
chosen operator. Based on the attack framework 
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Figure 4. Attacker’s expected payoff. 
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in Figure 3, the expected Utility for an attack can 
be determined as 


U = cost UInformation_searching + Target_ 
searching + Vulnerability_searching 
+ Breaking_in) — expected Value 


where from (Herley 2010), 


cost (UInformation_searching) 
< cost (Target_searching) 


and any other forms of relationships cannot be 
defined from the existing literature. 

Gathering and sharing of security-related infor- 
mation is a key factor heightening cybersecurity 
in both cooperating (Hausken 2017) and non- 
cooperating (Khouzani et al. 2014) environments. 
However, it is a known fact that the proposed infor- 
mation by defenders supports attackers in strategic 
decision-making. The following analysis illustrates 
how commonly available knowledge on operators 
can be used to reduce the cost of an attack. The use 
of available information reduces the information- 
searching effort to a static cost, represented as C, 
rather than a variable cost. In addition, attackers 
must bear the vulnerability-searching costs, repre- 
sented as C,, as a common cost irrelevant to any 
choice of target. t represents the choice of a target 
from the set of operators. 

In a cooperating environment, the state of an 
operator is not only influenced by his decision but 
also by other cooperating operators’ decisions. An 
attacker knowing that a set of operators (N,) are 
cooperating refines the target-searching scope from 
N operators to N, operators, where, N, < N, reducing 
the effort to an extent. The expected Utility (U,) for 
attacking cooperating operators can be defined as 


U.= C,+ cost (Breaking_in) 
+C, Yi cost (Target_searching) Nx,» 
— expected Value 


Whereas, in a non-cooperating environment, 
an operator’s security investment might encour- 
age competing operators to invest in better secu- 
rity measures. On the other hand, it might also 
increase the likelihood of attacks on competing 
operators as the attacker will prefer a victim will 
lower resistance. Knowing operators are compet- 
ing reduces the victim-searching effort consider- 
ably, as it is economically beneficial to attack the 
losing operator. Reduced victim-searching effort 
can facilitate in allocating additional resources 
for vulnerability-searching and for breaking into 
the operator’s defences. The expected Utility 
(U,) for attacking competing operators can be 
defined as 


U, = C,+ cost (Breaking_in) 
+C, È cost (Target_searching) N p 
— expected Value 


Desired Gain represents the amount of gain the 
attacking wants from an attack. From an economic 
perspective, an attacker would prefer the attack 
that maximises his desired Gain. That is, from the 
available range of attacks which would successfully 
compromise the found vulnerability, he chooses 
the attack which max(U — expected Gain). This 
indicates co-existence of several classes of attacks 
on a point of attack. The expected payoff and the 
desired gain from an attack would moderate the 
decisions of the attacker. As 


Attack, 


Do not attack, 


if U = expected Gain 


decision = . . 
if U < expected Gain 


5 CONCLUSION AND FUTURE WORK 


We investigated cybersecurity environment from 
attackers’ perspective. Our results show that tak- 
ing into consideration and admitting that attack- 
ers have strategies, incentives etc, implies that 
defenders (telecoms operators in our studies) need 
to change how they perceive, defend and react to 
attackers. The implications given the rise of tar- 
geted/coordinated attacks versus uncoordinated 
attacks (eg: DDoS) mean that operators must 
significantly reassess their investment in security 
technologies towards the former, despite the latter 
having better ‘security theater’!. 

Traditionally, security games have been analysed 
assuming the attackers to be of bounded rationality 
with limited set of prescribed choices. Furthermore, 
studies consider attackers do aim to maximise 
their expected gain and this consideration con- 
tradicts the assumption of bounded rationality of 
the attacker. We study the attackers considering 
they share similar characteristics as defenders with 
attack strategies maximising their expected gain. 

In particular, we model security interactions 
with an extended set of actions available to the 
attackers. This expands the possibilities where an 
action can bear latent objectives and motives rais- 
ing concerns regarding the admissibility of pro- 
posed defence strategies. We present a hierarchical 
decision tree capturing the mental model of attack- 
ers during the attack process. The decision model is 
supported by a generalised framework representing 


1. Bruce Schneier—Beyond Security Theater: https: 
//www.schneier.com/essays/archives/2009/11/beyond_ 
security_thea.html. 
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the attack process in terms of efforts required by 
the attackers addressing the abstract levels of the 
decision model. Using this framework attack strat- 
egies against cooperating and competing operators 
are derived optimising attackers’ effort resulting in 
a better gain. Furthermore, it facilitates a way of 
understanding the strategic decision-making abili- 
ties of attackers. 

Not all attacks are intended towards achieving 
economic targets. A novice attacker might not aim 
to maximise his economic payoff rather aim in 
gaining experience, or reputation and the interac- 
tion might end on an attempted attack. However, 
it might be a completely different picture for an 
experienced attacker. When such personality traits 
of the attackers are considered, specifically the 
strategic option of not attacking, it unsettles the 
traditional security modelling approach, particu- 
larly the Stackelberg approach (Kar et al. 2017), 
where the game proceeds with the assumption that 
the attacker acts (invariably attacks). This raises a 
number of research questions challenging the tra- 
ditional approach used in modelling cybersecurity. 
For example 


— Is every interaction between an attacker and 
defender a repetitive process or is it a single- 
point interaction which ends on an attempted 
attack? 

— Is using Stackelberg Security Games to model 
security interactions an appropriate choice? 


Considering the economics of scalable and tar- 
geted attacks discussed by (Herley 2010), would it be 
an effective strategy to launch a small scalable attack 
to determine the strength of the target and then 
launch a specific attack incapacitating the target? 

In (Kusumastuti et al. 2015), attackers are facili- 
tated with an option to not attack and invest in 
enhancing their capabilities enabling in launching 
stronger attacks in the future. This consideration 
would reduce the breaking-in effort and vulner- 
ability searching effort. From a psychological per- 
spective, Rogers (Rogers 2000) classified hackers 
depending on their expertise (from novice to expe- 
rienced), areas of interests (software, hardware, 
etc.) and behavioural patterns. 

Modelling players to be able to predict expected 
behaviour in a more realistic way requires a pro- 
found understanding of their incentives, motives 
and the context of the interaction. Behavioural 
modelling of the attacker will not only assist in 
understanding the expected intentions and behav- 
iour of attackers but will also assist in devising 
comprehensive defences against such characteris- 
tics of attacks. 

However, each instance of an interaction is 
unique, with a unique set of parameters charac- 
terising and moderating it. Modelling these unique 


interactions under common grounds is highly inef- 
fective. They demand to be modelled based on the 
context of the interaction and using only game- 
theoretic concepts restricts the context and the 
interaction environment to a larger extent through 
biases, heuristics, and convenience. 

This preliminary exploration will guide our 
future studies in aptly modelling behavioural 
aspects of attackers and in refining the attack strat- 
egies and characteristics of attackers by incorpo- 
rating proposed concepts from the existing research 
work. This would further aid in comprehensively 
modelling the behavioural aspects of the partici- 
pants in the context of information-cyber security. 
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ABSTRACT: In this work, an approach to estimate the optimal strategy to replace groups of assets 
is discussed. The model includes parameters such as cost of economic depreciation, cost of decommis- 
sioning and probabilistic distributions to represent random variables (for instance, time between failures) 
and uncertainty in cost of failures. This simulation model helps to understand the trade-off relation- 
ship between cost and availability according to the possible replacement strategies available. We present 
and discuss how Markowitz effective frontier can be created based on the simulated values for different 
replacement strategies. This work can fill a gap in the literature concerning the problem of asset group 
replacement, which are not well explored, but is important for decision-makers dealing with real world 
problems. The approach presented also helps the asset replacement strategy, which is part of the opera- 
tional strategy, to be more flexible to support the high level business strategy. 


1 INTRODUCTION 2 ASSET REPLACEMENT 


One of the main tasks of a plant and maintenance Authors like Eilton et al. (1966) Bazargan & 
engineer is related to the replacement of existing Hartman (2012) and Al-Chalabi et al. (2015) use 
assets, which are especially important in industries analytical models to find the asset replacement age 
such as mining, petroleum, power generation, etc. in order to minimize the Total Cost of Ownership 
The problem faced by companies, though, is not (TCO). Also, others like Shi & Min (2014) and 
only related to finding the economic life of each Adkins & Paxson (2017) use Brownian geomet- 
asset individually, but it is also related to assessing ric motion to simulate the costs of an asset to be 
the impact of different asset replacement strategy applied to asset replacement management models. 
in the system performance by techniques such as The weakness of this approach to minimize the 
RAM (Reliability Availability and Maintainability) TOC is that its focus is on asset level and does not 
modeling. In this paper, we employ the Reliability take into account the importance of performance 
Block Diagram (RBD) and Monte Carlo simula- index like availability and reliability at system level, 
tion to evaluate the impacts of asset replacement which is the real problem. In this matter, Leung & 
strategies in system performance variables such Tanchoco (1990) mention that if multiple assets 
as total cost—which include OPEX (Operational are employed in an integrated system, then the 
Expenditure) and CAPEX (Capital Expenditures) replacement decision for each equipment of the 
— and downtime. These are conflicting objectives. system must be analyzed simultaneously. 
In this context, the main objective is to understand In spite of the idea of this integrated analysis 
the trade-off between cost and availability accord- has been a not so recent concernment, there is still 
ing to the possible replacement strategies available. a lack models studying real problems in the asset 
A numerical example of a generation unit system management literature. 
of a power generation company in Brazil is used to As discussed, the isolated analysis sometimes 
define the Markowitz effective frontier. is not enough to give the decision maker the cor- 
In order to fulfill the goals of the paper, the rect insight about the reality, because it does not 
modeling technique is Reliability Block Diagram consider the complex configuration of a produc- 
(RBD) employed together with the tool Monte tion system found in real world. In this isolated 
Carlo simulation as discussed in section 2. Sec- analysis, it does not matter if the equipment is in 
tion 3 contains the results (simulated and calcu- series, active parallel or in standby in the system 
lated). Section 4 presents finals comments. by the reliability point of view. This is the reason 
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why RBD was employed in the current paper. 
Simulation is employed to allow more flexibility in 
this problem with conflicting objectives. The final 
model may contribute to feel part of the gap of 
models to solve real replacement problems in asset 
management. 


3 MARKOWITZ EFFICIENT FRONTIER 


There are N strategies in asset replacement with 
different and conflicting relation between avail- 
ability and cost. Then, there is efficient frontier for 
management decision making. 

In financial world, the objective of a portfolio 
of stocks is (a) maximize the average return and 
(b) minimize risk. There is a conflict between these 
two variables—the higher the return, the higher 
the risk. As explained by Powel & Baker (2009), 
both objectives cannot be met simultaneously, but 
optimization techniques can be used to exploit the 
trade-offs. 

For financial analysis, Markowitz (1952; 1959) 
developed an approach to examine the trade-off 
between risk and return of a portfolio where the 
random variable is rate of return over time. The 
optimization consists in the selection of stocks in 
order to give the highest expected return (measure 
by average) for a given risk (measured by the vari- 
ance of return). 

In the present work of replacement strategy, the 
attention is not in the trade-off between risk and 
return, but in the one between cost and availability 
for a system composed of a number of assets that 
suffer degradation over time with use. 

The managerial flexibility is the replacement 
decision of each individual equipment. Each man- 
agerial flexibility give different cost and availability 
at system level. Then, by simulation a number of 
managerial alternatives it is possible to construct a 
Markowitz efficient frontier with solutions that are 
not overcome by others. From this information, 
managers will be able to choose de replacement 
strategy according to the companies risk profile. 


4 MODELING 


This paper considers the Markowitz efficient fron- 
tier of a Power Generation Unit. Its Reliability 
Block Diagram is in Figure 1. 

In Figure 1, equipment is in series, that is, each 
one asset can turn the system unavailable. This 
type of system demands less initial capital invest- 
ment, but has more risk of failure. 

In discussing modeling by reliability block dia- 
grams, Zhang et al. (2012) mentions that there are 
two approaches: (1) Top-down when the stochastic 
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behavior modeling of the failure time of an system 
does not consider its components individually and 
(2) Down-top, when the simulation of the moment 
of occurrence of failure of an system considers the 
premise that the statistical modeling of the time 
until the failure of the individual components of 
this equipment is known. 

In an RBD analysis, the choice of the Top-down 
or Down-top approach is influenced by the avail- 
ability of historical data. Thus, for some equip- 
ment the amount of data available allows for a 
more detailed representation of the time until the 
individual components fail, but for other cases this 
is not possible. 

Model presented in Figure 1 follows the 
approach Top-down. Noteworthy that the RAM 
analysis based on the flexibility of Monte Carlo 
simulation gives flexibility enough to become the 
model as complex as the judgment of the analyst 
requires. Table 1 summarizes the code, description 
and current age of each asset. 

According to Table 1, Auxiliary Service Trans- 
former, for example, is indicated by the asset code 
AST and its current age is 10 years of operation 
(87.600 hours). 

The current age of each asset affects directly its 
failure rate and consequentially its availability and 
maintenance cost. Table 2 shows the probability 
distribution of time-to-failure of assets of Power 
Generation System. 

As shown in Table 2, for example, the lifetime 
of Auxiliary Service Transformer is modeled using 
Weibull distribution with scale parameter (eta) 
equal to 56.940 hours and shape parameter (beta) 
equal to 1,8. It is interesting to note that all assets 


AST BRG GNT 
Auxiliary service Bearing Generator 
transformer 
| 
HLU TBN WGT 
Hydraulic Turbine Wicket gate 
lubrication unit 
Figure 1. Reliability block generation unit system. 
Table 1. Assets characterization for RDB modeling. 
Asset 
Current age 
Code Description (hours) 
AST Auxiliary service transformer 87,600 
BRG Bearing 87,600 
GNT Generator 131,400 
HLU Hydraulic lubrication unit 87,600 
TBN Turbine 131,400 
WGT Wicket gate 43,800 


Table 2. Probabilistic modeling of time-to-failure of 
each asset. 


Weibull distribution 
parameters 
Asset code Eta (hours) Beta 
AST 56,940 1,8 
BRG 30,660 4,2 
GNT 61,320 3,8 
HLU 78,840 3,2 
TBN 70,080 3,6 
WGT 43,800 1,4 
Probability density functions 
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Figure 2. Probability density functions. 


have an increasing failure rate, giving rise to expect 
find degradation as the main failure rate. 

The probability density functions based on the 
parametes of Table 2 can be observed in Figure 2. 

Each curve in Figure 2 represent the probability 
density function of the first failure of each asset. 

After the failure of each asset, management 
applies corrective maintenance. An important 
assumption is that after maintenance, the prob- 
ability density function of second failure due is 
not the same of the first one. Table 3 contains the 
characteristic of the corrective maintenance task 
of each asset. 

Considering again the Auxiliary Service Trans- 
former (ASF), the task duration of the corrective 
maintenance is 120 hours, the cost of this task is $ 
66,000 and the age reduction factor is equal to 5%. 
To understand the concept of age reduction, con- 
sider that after a maintenance task, the asset can 
return to a condition as good as new, as good as 
old or intermediate. To quantify this intermediate 
condition the following model can be used (Malik, 
1979): 
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L=1,* (1- ARP) (1) 
where J, is the age of the equipment after mainte- 
nance, 7, is the age before maintenance and ARF is 
the age reduction factor. The as good as new con- 
dition is represented by ARF = 100% and the as 
good as old condition is represented by ARF =0%. 
The condition after the maintenance depends on 
the complexity of the asset, among others things. 

The modeling of the asset failure rate per 
hour appears in Figure 3 according to just one 
simulation. 

In the simulation presented in Figure 3, one 
failure occurred after 67,991.3 operations hours, 
then a corrective maintenance was performed 
and the failure rate was affected by the age reduc- 
tion factor. Other failures occurred after 82,065.8 
and 93,971.2 hours and again it is possible to see 
the impact on failure rate. The asset replacement 
was performed in 96,360 hours, when the failure 
rate returned to as-good-as-new condition. The 
last simulation of failure occurrence was around 
158,454.5 hours. 

The problem of finding the right ARF is com- 
plex and requires additional data for modeling. In 
this context, the present work does not have make 
inferences about the unknown parameters of an 


Table 3. Maintenance corrective characteristics. 


Corrective maintenance 


Task duration Age reduction 


Asset code (hours) Cost ($) factor 
AST 120 66,000 5% 
BRG 12 21,000 85% 
GNT 144 72,000 5% 
HLU 96 60,000 5% 
TBN 168 114,000 5% 
WGT 120 90,000 5% 


Example of Generator (GNT) failure rate 
1E-4 
3 9E-5 
= 
8. 6E-5 
z 
Z 3E-5 
0E+0 - t 
0 50000 100000 150000 
Operation time (in hours) 
Figure 3. Profile of failure rate per hour using 
simulation. 


Table 4. Task and cost used in asset replacement. 


Asset replacement 


Asset code Task duration (hours) Cost ($) 
AST 24 300,000 
BRG 48 90,000 
GNT 240 750,000 
HLU 120 270,000 
TBN 480 630,000 
WGT 240 255,000 


imperfect repair model as in Melchor-Hernández 
et al. (2014) and Toledo et al. (2015). 

The activity of asset replacement implies in 
a replacement duration (downtime) and Capital 
Expenditure (CAPEX). In Table 4 there are some 
data of tasks duration and cost of equipment of 
Figure 1. 

In Table 4, the replacement of the Auxiliary 
Service Transformer has task duration equal to 
24 hours and investment cost equal to $ 300.000. 
Data for other assets are also in Table 4. 


5 RESULTS 


Based on parameters discussed in section 2, exclud- 
ing current age and asset replacement characteris- 
tic, Figure 4 contains the maintenance cost profile 
simulated of each asset for a period of 20 years. 

Each profile is the result of the simulated mean 
maintenance cost after 10,000 simulations using 
the software Isograph Availability Workbench 
(AvSim®) considering that each equipment is new 
in year zero. 

From Figure 4, it is easy to notice that differ- 
ent assets have different corrective maintenance 
cost profile over time. For example, considering 
only the year 20, the Turbine (TBN) has the high- 
est simulated mean maintenance cost. Then, it is 
expected that the optimal replacement strategy is 
not equal for all assets. 

The model of economic life of an asset based 
on cost minimization depends strongly on main- 
tenance cost profile and other economic variables. 
In this case, consider that the opportunity cost 
of capital equal to 12% per year, fiscal deprecia- 
tion equal to 10% per year, income tax rate equal 
to 25% a year and economic depreciation of asset 
value of 10% per year. 

Figure 5 shows the relationship between Total 
Cost Ownership and time of replacement from 
Costa Lima & Teodoro- Filho2013) for different 
assets. 

Figure 5 highlights that each asset has its own 
economic life. Turbine (TBN) and Generator 
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Figure 4. Simulated mean cost of maintenance of dif- 
ferent asset. 


Total cost owership according to age 
replacement 


150000 æ- 


Fe. 
120000 %-@: aa SS te hd 
Ose 2 
*2.0.0.0.6¢6¢¢¢ 


-8-0-8eees 
PETE idha 


: 


‘SOS: eSss 
0-06-8988 


60000 $e 
83333 
30000 


33 $8 


020:0:0-0-0-0:0092900-000-0-0-0 


Total cost owership ($) 


0 


5h & I E.G 
Age replacement (in years) 


I 3 17 19 


.@--- WGT 
-+@-+- HLU 


-@--- TBN 
+++ BRG 


Figure 5. 
replacement. 


Total cost ownership according to age 


(GNT), for example, have economic life of 9 and 
10 years, respectively. By other hand, for the Aux- 
iliary Service Transformer (AST), its economic life 
is at least for a period of 20 years. 

In reality, the problem faced by the companies 
is related not only about finding the economic life 
of each asset, but about assessing the impact of 
different sets of asset replacement (strategies) in 
system performance. In order to illustrate the ques- 
tion, consider the model discussed in section 2 sim- 
ulated for 3 years. In case of no replacement at 
the beginning of the period, the simulated present 
value of mean total cost is equal to $ 1,389,415 
(considering opportunity cost of capital equal to 
12% per year) and mean system downtime equal to 
2,749 hours as shown in Table 5. 


Table 5. Results of simulation present values of cost 
and downtime—Part 1. 


Table 6. Results of simulation of present values of cost 
and downtime—Part 2. 


Replacement Mean total Mean total Mean system 
decision OPEX ($)* cost ($)* downtime 
Replace all 61,874 1,692,042 611.00 

No replacement 1,389,415 1,389,415 2,749 

AST 1,330,456 1,525,853 2,658 

BRG 1,354,562 1,413,181 2,688 

GNT 834,091 1,429,673 1,775 

HLU 1,304,659 1,480,516 2,723 

TBN 823,115 1,323,404 2,319 

WGT 1,352,345 1,456,770 2,935 


*Present value. 


In Table 5, replacing all assets at the beginning 
of the period results in a mean total cost equal to 
$ 1,692,042 and mean system downtime equal to 
2,749 hours. Replacing only the Auxiliary Service 
Transformer (AST) at the beginning of the period 
results in a mean total cost equal to $ 1,525,835 
and mean system downtime equal to 2,658 hours. 

In short, Table 5 presents the results of the 
following replacement strategies: (a) replacing 
all assets at the beginning of operation, (b) no 
replacement at beginning and (c) replacement of 
only a specific asset and the others not. The mean 
total cost in present value is calculated based on 
the mean OPEX (represented by the corrective 
maintenance cost) plus the investment cost in new 
assets less the resale value of the assets replaced. 
Both result is generated based on the simulation. 
In this model the resale value is calculated consid- 
ering an economic depreciation of 10% in relation 
to the last year. 

In contrast, in Table 6 contains all 15 combina- 
tions of two assets replacement. 

Considering the replacements of two assets, to 
replace GNT and TBN results in the lowest mean 
system downtime. To replace BRG and TBN results 
in the lowest mean total cost in present value. 

In Table 7 the results of all possible 3 assets 
replacement are presented. 

To replace GNT, HLU and TBN results in mean 
system downtime equal to 900 hours, which is even 
lower the replacing only GNT and TBN. None of 
3 asset replacement strategy results in lower mean 
total cost than replacing only BRG and TBN or 
replacing only TBN. 

All 4 assets replacement simulated performance 
is presented in Table 8. 

Replacing AST, GNT, HLU and TBN results in 
mean total downtime is 780 hours, but has a mean 
total cost in present value equal to $ 1,595,473. 
Again, no option is the best in both criteria 
simultaneously. 
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Mean 

Replacement Mean total Mean total system 
decision OPEX ($)* cost ($)* downtime 
AST; HLU 1,245,466 1,616,719 2,609 
AST; WGT 1,297,708 1,597,529 2,832 
BRG; AST 1,297,233 1,551,248 2,577 
BRG; GNT 805,897 1,460,098 1,671 
BRG; HLU 1,273,306 1,507,782 2,620 
BRG; TBN 793,179 1,352,087 2,214 
BRG; WGT 1,316,862 1,479,906 2,823 
GNT; AST 774,217 1,565,195 1,655 
GNT; TBN 268,276 1,364,146 1,058 
GNT; WGT 797,726 1,497,733 1,722 
HLU; GNT 749,450 1,520,888 1,622 
HLU; TBN 739,361 1,390,800 2,168 
TBN; AST 768,240 1,463,925 2,210 
TBN; WGT 791,379 1,396,092 2,273 
WGT; HLU 1,274,912 1,555,194 2,801 
*Present value. 
Table 7. Simulated performance—Part 3. 

Mean total Mean Mean 
Replacement OPEX total cost system 
decision ($)* ($)* downtime 
AST; BRG; GNT 748,188 1,597,785 1,553 
AST; BRG; HLU 1,216,724 1,646,597 2,510 
AST; BRG; TBN 735,067 1,489,371 2,098 
AST; BRG; WGT 1,264,536 1,622,976 2,723 
AST; GNT; HLU 692,638 1,659,473 1,505 
AST; GNT; TBN 210,166 1,501,433 935 
AST; GNT; WGT 748,188 1,643,591 1,553 
AST; HLU; TBN 683,598 1,555,140 2,055 
AST; HLU; WGT 1,217,433 1,693,111 2,692 
AST; TBN; WGT 730,615 1,530,725 2,152 
BRG;GNT; HLU 717,374 1,547,431 1,509 
BRG; GNT; TBN 234,921 1,389,410 940 
BRG; GNT; WGT 765,303 1,523,929 1,609 
BRG; HLU; TBN 707,366 1,442,130 2,059 
BRG; HLU; WGT 1,237,202 1,576,103 2,687 
BRG; TBN; WGT 759,511 1,422,843 2,166 
GNT; HLU; TBN 184,808 1,456,535 900 
GNT; HLU; WGT 716,946 1,592,809 1,574 
GNT; TBN; WGT 233,956 1,434,252 1,005 
HLU; TBN; WGT 709,079 1,489,649 2,126 


*Present value. 


In Table 9 the results of all possible 5 assets 
replacement is presented. 

With the results of Table 9, all possible asset 
replacement strategy to the beginning of the 
period was mapped. It’s possible to identify that 
some strategies benefits one criteria and others 


Table 8. Simulated performance—Part 4. 


Mean total Mean 

cost in present system 
Replacement decision value ($) downtime 
AST; BRG; GNT; HLU 1,685,685 1,632 
AST; BRG; GNT; TBN 1,526,789 818 
AST; BRG; GNT; WGT 1,667,167 1,501 
AST; BRG; HLU; TBN 1,579,957 1,944 
AST; BRG; HLU; WGT 1,718,868 2,583 
AST; BRG; TBN; WGT 1,558,509 2,044 
AST; GNT; HLU; TBN 1,595,473 780 
AST; GNT; HLU; WGT 1,729,547 1,454 
AST; GNT; TBN; WGT 1,572,443 884 
AST, HLU; TBN; WGT 1,625,435 2,005 
BRG; GNT; HLU; TBN 1,483,405 785 
BRG; GNT; HLU; WGT 1,621,005 1,464 
BRG; GNT; TBN; WGT 1,461,611 891 
BRG; HLU; TBN; WGT 1,516,077 2,016 
GNT; HLU; TBN; WGT 1,528,600 851 
*Present value. 
Table 9. Simulated performance—Part 5. 

Mean Mean 
total system 

Replacement decision cost ($)* downtime 


AST; BRG; GNT; HLU; TBN 1,620,627 662 


AST; BRG; GNT; HLU; WGT 1,758,903 1,347 
AST; BRG; GNT; TBN; WGT 1,598,189 767 
AST; BRG; HLU; TBN; WGT 1,652,005 1,896 


AST; GNT; HLU; TBN; WGT 
BRG; GNT; HLU; TBN; WGT 


1,664,867 126 
1,553,340 132. 


*Present value. 


benefits the other criteria, but some strategies can 
be discarded for being surpassed by at least one 
other strategy in both criteria. 

All 64 assets replacement combination was 
simulated, but the plot of Figure 5 highlights only 
26 solutions. These solutions were chosen to the 
plot for being between the 14 best solutions to 
minimize cost or unavailability. In Figure 6 the 
plot indicates the mean total down time and mean 
total cost in present value to each strategy. 

It is clear that some strategies are overcome by 
at least one other strategy and others are not. The 
strategy of replacing BRG and GNT (green circle), 
for example, is surpassed by replacing BRG, GNT 
and TBN not only in terms of cost, but in terms of 
availability too. 

The unsurpassed strategies form the efficient 
frontier as in Table 10. 

The strategy of replacing only the Turbine in 
the beginning of the period results in the minimum 


Total down time and total cost in present value to 
each decision 
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Figure 6. Mean total down time and mean total cost in 
present value to each decision. 


Table 10. Markowitz effective frontier. 

Mean 
Replacement Mean total system 
decision cost ($)* downtime 


1,692,042 611 
1,323,404 2,319 
1,364,146 1,058 
1,389,410 940 
1,456,535 900 
1,483,405 785 


All assets replaced 

TBN 

GNT; TBN 

BRG; GNT; TBN 
GNT; HLU; TBN 
BRG; GNT; HLU; TBN 
BRG; GNT; TBN; WGT 1,461,611 891 
AST; BRG; GNT; HLU; TBN 1,620,627 662 
BRG; GNT; HLU; TBN; WGT 1,553,340 732 


*Present value. 


total cost in present value, but results in an high 
unavailability. The choice for one strategy must rely 
on budget to maintain the system and availability 


Total cost in present value 
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Figure 7. BRG, GNT, TBN replaced. 

targets. So, if the companies do not tolerate total 
down time in hours to 3 years surpassing 650 hours, 
the strategy of replacing all assets can be a good 
option. 

As the value discussed so far considers only 
the averages, we can also consider the uncertainty 
in costs. So the mean failure frequency is used to 
simulate the number of failure based on a Pois- 
son distribution and considering each corrective 
maintenance cost as a uncertain variable which 
can assume a value between 80% and 120% of the 
value presented in Table 3. 

In Figure 7 the histogram of the total cost in 
present value simulated considering replacement 
of Bearing, Generator and Turbine is presented. 

Mean total cost in present value is equal to 
$ 1,389,261 similarly to the value presented in 
Table 10. The advantage of this simulation is to 
show the dispersion of the total cost represented by 
a standard deviation equal to $ 119,971, 10th per- 
centile equal to $1,246,437, 90th percentile equal to 
$ 1,549,009, minimum value equal to $ 1,154,489 
and maximum value equal to $ 1,937,684. So, the 
manager even making a decision based on the mean 
values can investigate the dispersion of each vari- 
able to have a better understanding about the risks. 


6 CONCLUSIONS 


Different strategies have different impacts in total 
cost and in total down time. Considering the 
existence of trade-offs between the two variables, 
an effective frontier can be created based on the 
simulated values for different replacement strategy, 
similarly to Markowitz (1952). 

A big system contains a lot of asset, which 
results in a lot of combinations of replacement 
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(strategies). This problem easily falls in the curse of 
the dimensionality, so, to big systems, the modelers 
must consider optimization algorithms in order to 
find a good solution or the efficiency frontier. 
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ABSTRACT: The wish of humans to explore new areas, environmental changes and growing worldwide 
demand have led to the Arctic region’s increasing popularity in recent years. The cruise industry is con- 
tinuously evolving in this area, creating an important need for more research into the risk of operating 
in the Arctic Ocean. However, most insurance firms do not yet have standard procedures for evaluating 
risk and policies to build insurance premiums for the Arctic cruise ship industry. The paper refers to our 
participation in a survival exercise in Arctic waters during May 2017, where the objectives were to assess 
the capability of rescue means in cold regions. Using our experience of the exercise as a basis, in this paper 
we are presenting the limitations related to cruise voyages in the Arctic. We suggest an insurance process 
that should be followed and finally discuss on some key factors that drive an insurance premium’s cost for 


the cruise ship industry in the Arctic. 


1 INTRODUCTION 

In recent years, the Arctic has gained more and 
more popularity due to the extraordinary environ- 
mental and developmental changes that have taken 
place in this region. Meanwhile, climate change has 
led to extensive thinning of sea ice, making marine 
access in the Arctic Ocean much easier. It is obvi- 
ous that this ice reduction extends to all seasons of 
the year, giving the maritime industry the opportu- 
nity for extended seasons of navigation and access 
to new areas that were previously difficult to reach. 
(Guy, 2006, Lasserre, 2011, Ostreng et al., 2012, 
Sarrabezoles et al., 2014, Lasserre and Pelletier, 
2011). At the same time, global marine tourism is 
rising and a place of extraordinary beauty like the 
Arctic could not stay unaffected by this trend. The 
potential impacts of these new marine uses—social, 
environmental and economic—are unknown but 
will be significant. Thus, there is great interest from 
both cruise ship owners and insurance companies 
regarding these trips. Safety is the main challenge 
that should be addressed by the owners of cruise 
ships operating in a remote and isolated area, with 
harsh weather conditions and poor infrastructure 
and communications. When a company needs to 
manage the negative consequences of an accident, 
it can: a) take all the consequences if/when an acci- 
dental event occurs, b) reduce the probability of an 
accident and/or its consequences by safety meas- 
ures or c) transfer the consequences of the occur- 
rence to parties better able to carry them (i.e. buying 
insurance) (Abrahamsen and Asche, 2011). As dur- 
ing any other operation, when planning a cruise, 
especially in an in-hospitable environment like the 
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Arctic, both the ship owners and the passengers 
must be insured. Thus, marine and travel insurance 
companies are keen to increase their involvement 
in Arctic cruising, and this paper aims to give some 
considerations, in relation to insurance policies, 
that should be followed in the Arctic region. The 
limitations that the insurance companies should 
place on the ship owners, in order to offer insurance 
for their vessels, are discussed. In addition, this 
paper presents a standardized procedure that the 
insurance companies should follow before putting a 
value on the insurance contract. Finally, we debate 
which are the cost drivers and how could they affect 
the price of an insurance premium. 


2 CRUISE INSURANCE 


Cruise insurance is not considered marine insur- 
ance (Burke, 2000) but is better described as the 
sum of two categories: marine insurance and travel 
insurance. 

With the term ‘marine insurance’, we mean the 
contract offered by an insurance company to the 
ship owner. With this contract, the insurer under- 
takes to indemnify the assured, in manner and to 
the extent thereby agreed, against marine losses, 
that is to say, the losses incident to marine adven- 
ture (Marine Insurance Act, 1906). There are spe- 
cific marine insurance types and policies, which the 
cruise ship-owners are obliged to have, according 
to the international law and the national regula- 
tions of each country, depending on the specifica- 
tions and the details of the upcoming voyage of 
their vessel. 


r= = | 


Figure 1. 


On the other hand, travel insurance is consid- 
ered as the insurance product designed to cover the 
costs and losses of the passengers and to reduce the 
risk associated with unexpected events that some- 
one might incur while traveling. Here again, there 
are different types that cover the specific needs of 
travelers. 

In this subchapter, we discuss the policies of 
Arctic cruise insurance. The limitations that the 
insurance companies should place on the ship 
owners in order to offer insurance for their vessels, 
are presented. Finally, we debate which are the cost 
drivers and how could they affect the price of an 
insurance premium. 


Cruise insurance. 


3 LIMITATIONS 


The insurance companies decide whether they 
should offer insurance to a shipowner, according to 
their policies. Whether they offer an insurance pre- 
mium to a cruise vessel owner depends on the risk 
appetite, meaning that it depends on the ‘amount’ 
and type of risk that a company is willing to take in 
order to meet their strategic objectives. An insur- 
ance company can be defined as: a) risk-averse 
when the company dislikes risk and will stay away 
from adding high risk investments to their portfo- 
lio, b) risk-seeking when the company prefers to 
take some high-risk investments and c) risk-neutral 
when the company seeks high-risk investments but 
at an average level. 

Insurance is considered an alternative to invest- 
ing in safety measures offered in order to transfer 
risk to a third party (insurance company) (Abra- 
hamsen and Asche, 2011). However, the Arctic is a 
very hostile environment, about which we lack suf- 
ficient knowledge. Although there are no specific 
limitations stated by the regulatory framework 
in the Arctic, in order for a ship-owner to obtain 
insurance, we strongly believe that there should be 
some minimum requirements that should be cov- 
ered by the cruise ship owners prior to an Arctic 
expedition. 

Quite often ship-owners try to save from the 
costs of investing in safety measures by buying 
insurance. To make Arctic expeditions safer, there 
should be some limitations regarding their ability 
to obtain insurance. The main requirement that 
they must meet is to have an adequate polar- or 
ice-class-certified vessel for traveling in the Arctic. 
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A vessel should not be allowed to operate in the 
Arctic area unless it is certified as eligible for the 
area. In August 2006, the International Associa- 
tion of Classification Societies (IACS) released a 
document, titled the “Unified Requirements for 
Polar Ships”, which standardized global ice clas- 
sification specifications for vessels (Table 1). 

Another limitation should be the shipping firm’s 
competence in Arctic shipping. There should be a 
strict investigation of the firm’s past behavior in 
relation to safety policies. For example, if a ship- 
ping firm has neglected safety issues in previous 
trips and put passengers’ lives at risk, then it should 
not qualify for an insurance contract. 

Furthermore, before a ship-owner is given the 
right to insure his vessel for an Arctic voyage, there 
should be a thorough inspection of the vessel and 
the extent to which the ship owner has covered the 
requirements for survival equipment. An insurance 
company must deny insurance to a ship that does 
not have sufficient lifeboats and life rafts for all the 
passengers, modified to suit the Arctic needs (i.e. 
winterized). Another example could be the lack 
of adequate insulated survival suits, as well as Per- 
sonal Survival Kits (PSKs) and General Survival 
Kits (GSKs). An insurance company could protect 
itself against wrongdoing or neglect on safety issues 
by the shipping firm, by stating specific terms in 
the insurance premium that place the blame on the 
ship-owner in the case of an accidental event which 
occurred due to the shipowner’s negligence. 

But then an important question arise: Should 
those limitations regarding the shipping firm 


Table 1. Polar Class descriptions (International 
Association of Classification Societies, 2016). 


Ice descriptions (based on WMO 


Polar class Sea Ice Nomenclature) 


PCI Year-round operation in all polar 
waters 

Year-round operation in moderate 
multi-year ice conditions 

Year-round operation in second-year 
ice which may include multiyear ice 
inclusions 

Year-round operation in thick 
first-year ice which may include 
old ice inclusions 

Year-round operation in medium 
first-year ice which may include 
old ice inclusions 

Summer/autumn operation in medium 
first-year ice which may include old 
ice inclusions 

Summer/autumn operation in thin 
first-year ice which may include 
old ice inclusions 


PC 2 


PC 3 


PC4 


PG 5 


PC 6 


PC7 


depend only on the insurance company’s interpreta- 
tion? As already stated, certain insurance companies 
are considered to be risk seeking, meaning that they 
are willing to take high risks in order to gain profits. 
Thus, it is of great importance that these limitations 
be included in a legally binding agreement. There 
must be a regulatory framework between the Arctic 
countries and the ship owners interested in operat- 
ing in Arctic waters that states specific limitations 
for the vessels, in terms not only of operating in the 
Arctic but also of obtaining insurance coverage. 
This way, the ship owner can be made responsible in 
the case of an accidental event that involves neglect 
or wrongdoing on the shipping firm’s side. 


4 INSURANCE PREMIUMS IN 
THE ARCTIC 


The lack of data and standardized methods 
regarding the assessment and the modeling of the 
risks, as well as the poor background knowledge, 
create a big challenge for insurance companies in 
the Arctic. This leads the underwriters to work on 
a case-by-case basis that vastly increases the cost 
of the insurance premium. Thus, the sustainabil- 
ity of Arctic expeditions is strictly dependent on 
the cost of marine insurance, and the industry is 
calling for more standardized procedures. Here, a 
standardized procedure for the insurance compa- 
nies is suggested (Figure 2). 


Submission of marine declaration form 
from the ship owner 


Risk assessment 


Insurance premium pricing 


Figure 2. Phases of suggested insurance premium 
procedure. 


Initially, and after receiving all the relevant 
information regarding the vessel and the trip to be 
insured, from the ship owner through the marine 
declaration form, an insurance company should 
check whether the shipping firm covers the limi- 
tations regarding obtaining insurance (polar/ice 
class, life-saving equipment, etc.). If those limita- 
tions are covered, then a risk assessment should 
follow, with focus on the consequences and the 
associated probabilities. The risk assessment starts 
with a qualitative risk analysis, in which the haz- 
ard identification and their categorization take 
place. After identifying the possible hazards and 
their related consequences, a quantitative risk 
analysis should follow, in which, according to the 
data for the region of the trip concerned and the 
background knowledge of previous incidents, a 
probability number will be assigned to each acci- 
dental event. Then, by using the specific informa- 
tion obtained from the shipping firm regarding the 
vessel and the trip (e.g. ship design information, 
winterization of the ship, quality and quantity of 
survival equipment, etc.), adjustments should be 
made to the probabilities linked with the speci- 
fications of the shipping firm. Special attention 
must be given to the strength of knowledge and 
the uncertainties, as these may lead to a higher 
insurance premium. In general, the insurance pre- 
mium is to a large extent equal to the expected 
costs. Then, the insurance premium is considered 
fair. If an insurance company does not offer fair 
insurance another company can attract their cus- 
tomers. Thus, high uncertainties or poor back- 
ground knowledge can result in higher insurance 
premiums. For example, one important finding 
of our participation in the survival exercise in the 
Arctic was the influence of an integrated heating 
system in the lifeboats on the survivability rate of 
the passengers (Gudmestad et al., 2017). A vessel 
that does carry lifeboats with an integrated heating 
system will be assigned a higher probability for the 
risk of people getting cold due to low temperatures 
than a vessel equipped with lifeboats with a heat- 
ing system. Thus, the insurance premium will be 
lower for the second vessel, as the passengers have 
a higher probability of surviving and the insurance 
company is taking a lower risk. 

After finishing the risk assessment, the pricing 
of the insurance premium follows. In this phase we 
identify two main steps. During the first step, the 
expected costs are calculated by using information 
from the risk assessment. The expected cost for 
each accidental event, E[C|A,], is multiplied by the 
associated probability P(A). This is done for all the 
accidental events, and then summarized to give an 
initial price at the insurance premium. During the 
second step, the insurance premiums calculated on 
the previous step based on the expected costs, are 
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adjusted in order to take the strength of knowledge 
and uncertainties into consideration. Finally, the 
administration costs and a percentage of the profit 
that the company wants to make should be added to 
the price of the premium decided by the insurance 
company on the first step. As already mentioned, 
an insurance premium is considered fair when it is 
to a large extent equal to the expected costs. As a 
result, the added profit that an insurance company 
wishes to obtain should not be too high. It is also 
recommended that this later adjustments of the 
second step should be made through a broad man- 
agerial review. It is also of great importance that 
the insurance company executes a self-assessment 
of the vessel, sending its own surveyors to check 
the design and the equipment of the cruise ship. 

A cruise ship operator is concerned with lower- 
ing the costs to increase the profits and will, most 
likely, only want to fulfill the minimum require- 
ments stated by the legal framework. However, if 
the requirements are very strict and conservative, 
it might make the vessel owners stop their Arctic 
cruise business. Having first-class survival equip- 
ment that could fit every passenger on a cruise ship 
of more than 2000-3000 passengers could be quite 
expensive. Thus, there must be a conscious effort 
by the legislators to secure top levels of safety in 
Arctic cruise traffic, while, at the same time, not 
acting conservatively and overestimating the pres- 
ence of extra safety measures. 


5 ARTIC REGION COST DRIVERS OF 
MARINE INSURANCE 


After presenting the limitations that should apply 
to Arctic voyages, and our suggestion for a stand- 
ardized procedure that should be implemented on 
determining an insurance premium in the Arctic, 
we will now discuss the influence of specific fac- 
tors on the insurance premium. 

The most important criterion that influences 
the cost of the insurance premium is the ice class 
of the vessel. As previously mentioned, insurance 
companies should not offer insurance contracts to 
firms whose ships are not designed for navigation 
in potentially iced waters. However, there are dif- 
ferent polar and ice class categories and, thus, the 
higher the polar and/or ice class of a vessel is, the 
lower the price of the insurance premium offered 
by the insurance company. This is reasonable, as 
the higher the ice class of a vessel, the higher the 
capacity of the vessel to withstand harsh weather 
conditions and avoid any accidental events. 

Another factor that could influence the cost 
of an insurance premium in the Arctic is the win- 
terization of the vessels. Winterization is a proc- 
ess, which enables vessels to operate in extreme 
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sub-zero temperatures without suffering loss of 
equipment operability, vessel stability and power, 
and personnel habitability, and which permits 
crew operations to be performed safely (Ghosh 
and Rubly, 2015). Class regulations require equip- 
ment and systems to be winterized. Some of those 
winterizations procedures are presented by Has- 
holt (2011). Sheltered pathways, under-deck heat- 
ing, heated mooring equipment, low temperature 
emergency generators, ice navigation radar, ice 
searchlights, etc. are all elements that could drive 
the cost of an insurance premium down. The more 
of the aforementioned elements a cruise ship has, 
the more ‘winterized’ it is and the less is the price 
of the premium. Each element has specific impor- 
tance for operating in the Arctic and together 
they constitute the winterization of the vessel. For 
example, if a vessel is considered 40% winterized 
then there could be an agreement with the insur- 
ance company to lower the insurance premium 
price by 5%-10%. 

Communication systems and stability play an 
important role in the Arctic region. The remoteness 
of the area highlights the need for survival equip- 
ment able to help the passengers to survive for the 
period of five days stated in the regulatory frame- 
work for ships operating in the Arctic, The Polar 
Code (IMO, 2016). The presence of enhanced com- 
munication systems and adequate survival equip- 
ment could dramatically lower the overall price 
of an insurance premium. During harsh weather, 
improved communication systems could prevent 
loss of communication, which could be vital for 
the cruise ship. Being equipped with sufficient life- 
boats, life rafts, survival suits, PSKs and GSKs is 
one of the previously stated limitations. However, 
during our survival exercise in Arctic waters, it was 
noted that further improving the survival suits, by 
adding integrated woolen underwear, and provid- 
ing sizes for all the passengers could lead to an 
insurance company’s decision to lower their offer, 
as the severity of the consequences in case of an 
unexpected event are reduced (Gudmestad et al., 
2017, Solberg et al., 2016). 

The time of the year and the route that the cruise 
ship is planning to take can also influence the insur- 
ance premium. For instance, a trip planned during 
August would have a lower insurance premium 
than one trip planned during the beginning of the 
Arctic cruise season in May. This is because the 
probable ice concentration and ice movement dur- 
ing August would be lower than the corresponding 
ice concentration and ice movement during May 
(Lasserre, 2014). 

One of the most important criteria for an area 
like the Arctic, where we lack sufficient knowledge, 
is the experience of the captain and crew members’. 
According to Sarrabezoles et al. (2014), after inter- 


viewing several companies that offer marine insur- 
ance for the Arctic, they found most firms said that 
trust in the shipping company is important and that 
they therefore might be reluctant to insure firms that 
do not have experience in Arctic shipping. Some of 
them stated that they would examine every submis- 
sion but evaluate the preparedness of the shipping 
firm, the crew’s experience, charts’ accuracy and 
contingency planning in case of problems. However, 
for the majority of firms, it is clear there is a strict 
inspection of the shipping firm’s past behavior and 
safety-related policy. This means that the insurance 
company expects to see proof that the shipping firm 
is able to perform well in Arctic waters. 

Finally, each traveler should have private travel 
insurance. However, most private travel insur- 
ance companies require travelers to have separate 
search and rescue cover. This can raise the price 
for the traveler by two or sometimes three times, 
compared to the initial travel insurance cost. Fur- 
thermore, it is not unusual for those third-party 
companies that offer search and rescue cover, to 
put extra limitations on travelers, such as age or 
the area in which they provide cover. Thus, it is of 
significant importance that the ship-owners should 
acquire such search and rescue cover for all their 
passengers. This would decrease, the insurance 
premium provided by the insurance company, as 
there would be higher chances of survival in the 
case of an accidental event and, thus, lower sever- 
ity of consequences. The private travel insurance 
premium of the travelers would also decrease. 
However, the ship owners will have to undertake 
extra cost for the search and rescue cover. 


6 CONCLUSIONS 


Although Arctic cruises are gaining in popularity, 
the Arctic region remains a hostile environment, 
where we lack sufficient background knowledge 
and data to determine insurance premiums. There 
are many hazards related to cruise traffic in the 
Arctic, and even more challenges arise in the case 
of an accidental event during the voyage. 

Through a legislative framework, specific limi- 
tations must be implemented that, unless met by 
the ship owners, forbid the signing of insurance 
cover. Ice and/or polar class certification, thor- 
ough background check, inspection of the vessel 
and adequacy of survival equipment must all be 
included as mandatory limitations before insuring 
a cruise vessel for an Arctic voyage. 

A standardized procedure should be followed by 
underwriters to determine insurance premiums in 
the Arctic. Thus, the creation of a platform, where 
Arctic data would be gathered and easily accessed 
is of high importance. 
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Different factors can influence the cost of an 
insurance premium between a shipping firm and 
an insurance company, the most important being 
the level of winterization of the cruise vessel, the 
level of training of the shipmaster and the crew 
members and the coverage of the search and rescue 
procedure for the passengers in case of emergency. 
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ABSTRACT: Previous research on aviation and health sectors has found that individual blame for small 
failures discourages incident reporting and so adversely impacts disaster prevention. This finding has 
widely influenced practice in organizations relying on engineers. Based on a survey of Australian engi- 
neers (n = 275) this paper examines how personal legal liability considerations impact on hazard reporting 
and other forms of knowledge sharing. We found that 48% of engineers are more likely to report hazards 
despite changes in societal expectations and the tendency to blame. Only 5% indicated that they were less 
likely to report hazards as a result of their liability concerns. We suggest that these findings are due to 
the nature of engineering work, where decision-making is distributed across time, place and people. In 
this environment, blame and responsibility are less attributable to individual actors. Equally, reporting a 
hazard may act to transfer responsibility and so limit one’s personal liability. 


1 INTRODUCTION 

Excellence in organizational safety performance 
requires that all personnel are willing to learn from 
small faults and failures (Hopkins, 2009) and also 
from past accidents (Snook, 2000) in order to pre- 
vent recurrence. This leads to a range of strategies 
for collecting and sharing information ranging 
from formal incident reporting to informal sharing 
of stories within professional groups (Hayes and 
Maslen, 2015, Maslen and Hayes, 2016). 

When it comes to formal incident reporting sys- 
tems, a key challenge is encouraging people to report 
what they know so that systemic problems can be 
identified and addressed. A culture that encourages 
reporting is thought to be a just culture—one in 
which people can report genuine mistakes without 
fear of retribution (Dekker, 2007, Reason, 1997). 
Much of the research that led to these conclusions 
was originally grounded in the health and aviation 
sectors although the findings have been assumed to 
generalize into any complex socio-technical system 
including those where work is largely carried out 
by engineers. Our study challenges this paradigm 
within the engineering context. 

We have studied the impact of personal legal 
liability concerns on high stakes decision making by 
engineers. The results of our survey show that per- 
sonal legal liability supports good decision-making 
when legal responsibility aligns with issues over 
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which engineers have control (Hayes et al., 2017). 
Where these are not in alignment, there is evidence 
that ‘defensive engineering’ practices can develop 
analogous to ‘defensive medicine’—a set of practices 
that have grown up in the health sector in response 
to increased liability concerns (Catino, 2009). 

In this article we focus specifically on those 
aspects of the survey related to the impact of per- 
sonal liability concerns on reporting and knowledge 
sharing. Knowledge sharing practices include inci- 
dent reporting, the sharing of stories about past 
incidents and the release of findings from high 
quality accident investigations that support learn- 
ing. First, we engage with the literature on safety 
culture and responsibility in engineering. We then 
describe our survey method and the survey results 
as they apply to learning from accidents and inci- 
dents. The paper concludes with a discussion of the 
implications of our findings and key conclusions. 

We argue that the nature of engineering work 
means that knowledge sharing behaviors (report- 
ing, storytelling) are less impacted by personal 
liability concerns in engineering-based industries 
than in healthcare and aviation. Rather than inhib- 
iting reporting, concerns about blame may encour- 
age reporting. There is some evidence of impacts of 
personal liability concerns on storytelling among 
our respondents’ colleagues, though this is not 
as pronounced as in healthcare and aviation. The 
implications of this result warrants further inquiry. 


2 JUST CULTURE AND ENGINEERING 
RESPONSIBILITY 


Much organizational learning uses a strategy of 
trial and error, but learning in order to prevent dis- 
asters requires more sophistication. Reason’s (1997) 
famous Swiss cheese model of accident causation 
encourages us to seek out those holes in the cheese 
not only so they can be fixed, but because they pro- 
vide important information about organizational 
weaknesses. Similarly, high reliability theorists tell 
us that high performing organizations are preoc- 
cupied with finding small errors and faults, again 
for the insights they provide on underlying organi- 
zational issues that have the potential to result in 
more serious failures (Weick and Sutcliffe, 2001). 

Such learning strategies rely critically on profes- 
sionals throughout organizations being prepared to 
report small errors and faults but this is an inher- 
ently difficult undertaking when, at least in some 
circumstances, people are effectively being asked to 
report themselves. As Reason (1997, pg. 196) says, 
‘human reactions to making mistakes take various 
forms, but frank confession does not usually come 
high on the list’. 

In response to this problem, a significant research 
literature and safety practice has developed around 
what is known as ‘just culture’ which attempts to 
balance accountability for current problems with 
learning in order to prevent more problems in 
the future (Dekker, 2007). Enacting a just culture 
within an organization is seen as the ultimate goal in 
fostering open reporting of small faults, failures and 
errors and so maximizing opportunities for learn- 
ing (Aveling et al., 2016, Gerede, 2015, Khatri et al., 
2009). These ideas have been taken up with gusto, 
particularly in the health sector. A Scopus search 
for ‘just culture’ returns 175 items of which 71% are 
health sector publications and 12% are air traffic 
management-related. Only six articles are based in 
engineering-related industries and all of these focus 
on application of the just culture concept. 

Despite this lack of specific evidence, the prop- 
osition that a just culture increases reporting is 
assumed to apply in engineering-based industries, 
too. In particular, it has become common for safety 
culture survey tools to include questions based on 
just culture principles (see for example Kines et al., 
2011, Shirali et al., 2013). 

In addition to formal reporting systems, informal 
practices including sharing stories are also important 
for learning and accident prevention. Engineering 
professionals share stories among their professional 
group with the explicit purpose of fostering exper- 
tise (Hayes and Maslen, 2015, Vastveit et al., 2015). 
This also provides a mechanism for learning from 
past failures, be they small faults or major disasters. 
Through stories, workers come to understand the 
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potential consequences of their decisions (Maslen, 
2014, Maslen and Hayes, 2016, Storseth and 
Tinmannsvik, 2012). They support the development 
of a ‘safety imagination’—an ability to link one’s 
actions to the potential consequences (Pidgeon and 
O'Leary, 2000). When then faced with potentially 
minor operating anomalies, engineers are able to 
draw connections to past major events and so dig 
deeper into the state of the system (Macrae, 2009). 
Professionals taking responsibility for their own 
expertise and learning to ensure the best decision 
making is an example of what the engineering eth- 
ics literature calls forward-looking responsibility: 
‘a virtue or a moral obligation to see to it that a 
certain state-of-affairs applies’ (Doorn and van de 
Poel, 2012, pg. 10), in this case accident-free opera- 
tions. Some authors have called for an increased 
focus on this aspect of engineering responsibility 
pointing out that the discourse on engineering eth- 
ics and responsibility tends to be dominated by a 
framework of responsibility as blameworthiness 
(Coeckelbergh, 2012, Doorn, 2012, Kermisch, 
2012). Questions of engineering responsibility 
therefore link directly to the issue of learning from 
incidents and accidents as past errors can be inter- 
preted either as deserving of punishment or as key 
lessons to prevent recurrence (Fahlquist, 2006). 
We find these issues occupying the attention of 
survey respondents, but in ways that may not be 
expected based on the body of published research. 


3 METHOD 


To assess the prevalence and impacts of liability con- 
cerns among engineers in Australia, we developed 
and administered a survey with a mixture of closed- 
and open-ended questions. It was designed for self- 
completion via an internet browser on a computer, 
tablet or smartphone and was programmed and 
administered using Qualtrics software. 
The questionnaire addresses six main issues: 


Risks and benefits of considering personal 
liability. 

Impact of serious incidents on personal liability 
concerns. 

Impact of others’ individual experiences on per- 
sonal liability concerns. 

Impact of personal liability concerns on report- 
ing and knowledge sharing. 

Impact of personal liability concerns on profes- 
sional board registration. 

Impact of personal liability concerns on career. 


Survey participants were recruited from two 
sources. Email invitations to participate in the survey 
were sent to Australian Pipeline and Gas Associa- 
tion Research and Standards Committee members 


and to Engineers Australia’s Brisbane area members. 
A total of 275 Australian engineers participated in 
the survey during the period July to September 2016. 
Due to the way in which Australian engineers were 
recruited for this survey (i.e. a snowballing approach 
was used whereby one group of targeted respond- 
ents was asked to circulate the survey through- 
out their organisation in addition to participating 
themselves), it is not possible to accurately identify 
the total number of Australian engineers invited to 
participate. Therefore, it is not possible to accurately 
calculate the survey response rate nor to comment 
quantitatively on the extent to which the results 
might be indicative of engineers in Australia overall. 

This work has been approved by the relevant 
university human ethics committees. The survey 
data were processed using the IBM SPSS-23 sta- 
tistical software package and both IBM SPSS-23 
and Microsoft Excel were used to produce percent- 
ages. Graphical representations of the data were 
produced in Microsoft Excel. 

The percentage results presented against each 
survey question have been calculated using the 
total number of valid responses received for that 
question. The number of valid responses (as shown 
in each figure) varies due to individuals (either 
intentionally or unintentionally) not providing 
a response to a question and/or individuals not 
being eligible to complete a question given their 
previous responses in the questionnaire. 


4 IMPACT OF PERSONAL LIABILITY 
CONCERNS ON REPORTING AND 
KNOWLEDGE SHARING 


The extent to which respondents are concerned 
about personal legal liability provides impor- 
tant context for these results. Thirty-two per cent 
of respondents indicated that they think about 
personal liability to a great extent when making 
decisions at work. Further, 89% of respondents 
indicated that they consider personal liability at least 
to some extent. These findings show that personal 
liability is an important issue for respondents. 

To investigate the impact of liability concerns 
on formal hazard reporting and informal practices 
for sharing knowledge about things that have gone 
wrong, we asked a series of questions regarding 
hazard reporting practices and other practices 
about learning from past failures. 

Over four fifths of respondents (83%) said their 
organization provided a formal system for employ- 
ees to report hazards. Among these respondents, 
close to all (93%) of those who had encountered a 
hazard in their organization said they reported it 
into the organization’s formal system. Only 12 of 
the respondents (out of a possible 107) who had 
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encountered a hazard in their organization had not 
formally reported it. 

Members of this small group were then prompted 
to explain why, giving the following reasons for not 
making a formal report: 


e The incident was too small to report. This was 
the most frequently noted reason (‘I felt the haz- 
ard was minor’; ‘it was isolated...it was personal 
factors, not systemic’; ‘trivial problems—can’t 
cotton ball everything and everyone’). 

Time to report the incident (‘no time’; ‘not worth 
the time to report’). 

The incident was already known. One respond- 
ent observed ‘[I] didn’t want to create a fuss over 
something that was already known’. 

The incident was immediately fixed (‘the people 
involved investigated and steps [were] taken to 
redress the problem immediately’; ‘I was able to 
do something about it rather than just report it 
for someone else to deal with’). 


In the context of the survey, it is noteworthy that 
no-one indicated that they had failed to report a 
hazard as a result of concerns about liability and 
blame. 

Building on this, Figure 1 shows responses 
regarding whether personal liability concerns 
impact on whether a respondent is likely to report 
hazards into their organization’s formal system. 
Forty-seven per cent of respondents indicated 
that personal liability concerns would have no 
impact. Only 5% of respondents indicated that 
they were less likely to report hazards as a result 
of liability concerns. In contrast, 48% of respond- 
ents indicated that they were more likely to report 
hazards including 21% who said they were much 
more likely to report hazards. In other words, 
personal liability concerns promote rather than 
hinder likelihood to report. This is a critical find- 
ing because it is contrary to what we would antici- 
pate based on previous research that emphasizes 
the importance of a just culture in fostering haz- 
ard reporting. 

Another way in which learning about prob- 
lems occurs is via informal sharing of stories. 
Figure 2 shows responses regarding whether per- 
sonal liability concerns impact on the likelihood 
that a respondent will share stories of things that 
have gone wrong with colleagues so that lessons can 
be learned. Forty-two per cent of respondents indi- 
cated that personal liability concerns would have no 
impact. Sixteen per cent of respondents indicated 
that they were less likely to share stories as a result 
of liability concerns. While this is more than the 5% 
who indicated they were less likely to report hazards 
as a result of liability concerns it is still a low figure. 
Again, contrary to what might be expected from the 
broader literature on organizational learning and 


Whether I will report hazards into my 
organisation's formal system (n=207) 


0% 20% 40% 60% 80% 100% 
m (7) Much more likely @(6) @(5) O} No impact either way (3) @(2) @(1) Much less likely 


Figure 1. Does the risk of your personal liability make it more or less likely that you will report hazards into your 
organisation’s formal system? 


Whether I will share stories of things that have gone 
wrong with colleagues so that lessons can be leamed 
(n=242) 


0% 20% 40% 60% 80% 100% 
m (7) Much more likely (6) æ(5) O(4)Noimpact either way (3) (2) (1) Much less likely 


Figure 2. Does the risk of your personal liability make it more or less likely that you will share stories of things that 
have gone wrong with colleagues so that lessons can be learned? 


blame, 43% of respondents indicated that having 
personal liability concerns made them more likely to 
share stories, including 19% who said the concerns 
made them much more likely to share stories. 

While personal liability concerns appear to have 
only a minor impact on willingness to share stories, 
it is important to note that these figures were higher Aware of engineers 
when engineers reflected on sharing of stories by unwilling to share 
others. Figure 3 indicates that almost one third of stories (n=78) 
respondents (31%) were aware of other engineers 31% 
being unwilling to share stories of things going 
wrong due to concerns about personal liability. 

Those aware of such cases were asked to provide 
more details. Thirty seven respondents chose to do 
so. 

Key reasons cited that influenced people’s per- 
ceived willingness to share stories of things going 
wrong are: 


e Contractual relationships and concerns about 
blame and financial liability; 

e Failures being indicative of incompetence and Figure 3. Are you aware of any engineers (other than 
so those involved being unwilling to admit what yourself) who were unwilling to share stories of things going 
has occurred; wrong due to their concerns about their personal liability? 
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e Fear of blame linked to job security; and 
e Fear of blame linked to legal processes. 


We would expect unwillingness to share sto- 
ries on these grounds based on the just culture 
research. The reasons that engineers feel that their 


Yes, details have 


been made available 


(n=135) 


Figure 4. [Regarding the serious accidents that 
occurred in your sector, either in Australia or overseas] 
Have details of those serious accidents been made avail- 
able in a way that allows lessons to be learned? 


The investigation’s findings have not been made 
public, and there are no plans to release these 
findings in future (n=13) 


The investigation did not produce sufficient material 


peers may be less likely to share stories than them- 
selves warrants further inquiry. 

Other sources that may be drawn upon to learn 
about disaster prevention include investigations 
into the causes of serious accidents. As shown 
in Figure 4, 72% of respondents reported that 
details of serious accidents in their particular sec- 
tor have been made available in a way that allows 
lessons to be learned. While this is positive, it also 
shows room for improvement. Twenty-eight per 
cent of respondents indicated either that infor- 
mation had not been made available or they were 
not aware whether it had been made available. In 
either case, respondents were not able to learn 
from serious accidents. Opportunities to learn 
lessons from major failures are mercifully rare. 
However, this makes it all the more important 
that lessons are maximized when accidents occur 
to prevent future occurrences. In this context, it 
is problematic that almost a third of respondents 
had observed information not being made avail- 
able post-disaster. 

Looking into this issue further, respondents were 
asked if they knew why information had not been 
made available for learning. As shown in Figure 5, 
in more than two thirds of cases (68%), respond- 
ents reported that lessons had not been generated 
and/or not released. In only 12% of cases, was 
the issue one of timing in that lessons would be 
released in the future. In 21% of cases respondents 
were not aware why lessons had not been released. 


on the lessons learned (n=6) 
The investigation’s findings will be made public, but = 12% 
have not been released yet (n=4) g 


The accident(s) have not been investigated (n=1) Ej 3% 


Other (n=3) Ey 9% 


L don’t know (n=7) | Alone T] 21% 


0% 10% 


20% 30% 40% 50% 


Figure 5. Do you know why the details [of the serious accidents] are not available? 
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5 DISCUSSION 


Based on the small survey population it appears 
that most engineering organizations have in place 
a system for reporting hazards and that engineers 
use such systems with very few exceptions. Despite 
the expectations of the just culture literature that 
concerns about blame would inhibit reporting, this 
does not seem to be the case for these small and rel- 
atively common workplace problems. This surpris- 
ing result is also borne out by our previous work in 
incident reporting in a nuclear power station which 
exhibited over 2000 reports per annum with the 
primary advantage to those making reports being 
seen as quick action to remediate issues raised 
(Hayes, 2009). 

When asked specifically about legal liability and 
hazard reporting, many respondents saw liabil- 
ity as a reason to increase, rather than decrease, 
reporting. Respondents reported similar behavior 
when it came to more informal professional learn- 
ing practices regarding sharing stories with col- 
leagues. This raises the question as to why liability 
is of less concern in the engineering sector than in 
health. Doorn and van de Poel (2012) have high- 
lighted the extent to which responsibility is a key 
issue in engineering ethics (in contrast to medical 
ethics, for example). They maintain that this is 
in response to the collective nature of engineer- 
ing activity and also the uncertainty and indirect 
causality of major engineering failures. As Coeck- 
elbergh describes, ‘... between the actions of an 
engineer and the eventual consequences of her 
actions lies a world of relationships, people, things, 
time and space (2012, pg. 37). 

It is possible that this goes some way to explain 
our surprising results on this issue. In an environ- 
ment where work is both distributed and collective, 
responsibility for small problems is also distrib- 
uted. Perhaps reporting small problems could be 
seen as ‘insurance’ against blame beyond one’s 
sphere of direct control in the event that something 
goes wrong. Reporting possible warning signs may 
provide something to point to as a defense against 
blame when collective endeavors fail to give the 
required outcomes. This explanation is somewhat 
speculative, but it is consistent with evidence to 
date. 

Overall, the positive link between liability and 
reporting appears to be good news. Concerns about 
legal liability are perhaps raising awareness of the 
need to learn and so increasing forward-looking 
responsibility. At its best, forward-looking respon- 
sibility is conceptualized by ethics scholars as going 
beyond simply doing one’s duty. Duty implies rule- 
following, whereas responsibility is more concerned 
with getting the best outcome. As Goodin (1986, 
pg. 50) explains, ‘[d]uties dictate actions. Respon- 
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sibilities dictate results.’ Other safety researchers 
have warned of the dangers of decision making 
grounded only in compliance where obeying rules 
becomes an end in itself, rather than a means to 
safe operations (Bieder and Bourrier, 2013). It 
seems at least possible that a link between learning 
and responsibility mediated by an increased focus 
on liability may support the development of for- 
ward-looking or virtue responsibility, rather than 
simply a desire to avoid blame. 

The potentially positive impact of liability 
concerns on learning is tempered by the results 
obtained when questioning moves from self- 
reported behavior to observations about the 
behavior of professional colleagues. Almost one 
third of respondents reported seeing colleagues 
who are unwilling to share stories of things going 
wrong for reasons linked to liability. Possibly in 
these cases the potential storyteller has a more 
personal link to the problem at hand and so feels 
more acutely a sense of backward-looking respon- 
sibility as blameworthiness outweighing a sense of 
forward-looking responsibility and virtue in efforts 
to prevent accidents. 

This could also be a reflection of the reflexiv- 
ity of respondents and their discomfort confess- 
ing their own perceived shortcomings. They are 
uncomfortable talking about their own shortcom- 
ings, but are much more talkative about the short- 
comings of their colleagues. In another study, in 
which we examined decision making in the oil and 
gas sector, we found that respondents claimed that 
their decision making was not influenced by finan- 
cial incentives such as bonus payments awarded 
by their employer (Maslen and Hopkins, 2014). 
Incentives would have a negative impact on the 
quality of decisions from a safety perspective, in 
the same way that failing to share stories due to 
personal liability concerns has the potential to 
negatively impact disaster prevention. In contrast, 
many felt that their colleagues’ decisions could be 
influenced by incentives. In an interview context, as 
opposed to a survey, the interviewer has an oppor- 
tunity to explore the reasons for this inconsistency. 
Is it perhaps that liability concerns (like incentives) 
do influence behavior in a way that is not readily 
acknowledged? This is a critical question that war- 
rants attention using qualitative methods such as 
interviewing and observation. 

The final issue of interest is learning from 
significant disasters that have been the subject 
of formal inquiries. The survey results indicate 
that there is room for improvement in this area. 
Respondents reported a significant proportion of 
cases in which lessons are not made available for 
learning. Dekker (2015) describes four psychologi- 
cal purposes of accident investigations: epistemo- 
logical (establishing what happened), preventative 


(identifying pathways to avoidance), moral (trac- 
ing the transgressions that were committed and 
reinforcing moral and regulatory boundaries) and 
existential (finding an explanation for the suffering 
that occurred). These different orientations to an 
investigation result in different findings as shown 
by Gephart (1993) in his detailed analysis of a 
Canadian pipeline failure. As our survey respond- 
ents indicate, not all orientations result in effective 
lessons to prevent recurrence. 


6 CONCLUSIONS 


Much published research focuses on the adverse 
impact of personal liability and blame on learning 
from accidents and incidents. The conventional wis- 
dom is certainly that blame inhibits learning. Our 
study in the engineering profession reveals a more 
complex relationship. It seems that, at least for our 
respondents, a focus on personal legal liability may 
act to increase practices related to learning as a 
result of an increased sense of virtue responsibility. 

This finding is tempered however by reports 
that a significant minority of engineers are unwill- 
ing to share stories as a result of liability concerns. 
Legal constraints may also be responsible for a 
reluctance to share lessons for major accident 
investigations in some cases. 

It appears that liability is a double-edged sword 
with both positive and negative effects on learn- 
ing which are surely worthy of further investiga- 
tion, particularly in this engineering context where 
such issues are little studied by comparison with 
the health sector. 

It must be emphasized that this is effectively 
a pilot study with a relatively small cohort of 
respondents. As such, we make no specific claims 
that these findings are representative of the engi- 
neering profession in Australia or globally. Nev- 
ertheless, the findings are surprising and deserve 
further investigation in the context of organiza- 
tional learning for accident prevention. 
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ABSTRACT: Following the implementation of international and national regulations and standards 
pertaining the control of risks in different industrial and civil domains, a general increase of the safety 
knowledge was noticed and recognized in literature (e.g. Lee & Harrison, 2000). Indeed, different cam- 
paigns to rise the awareness of the general public were developed and implemented. However, even if 
numerous studies and researches measured the safety culture in occupational domains (Choudhry et al., 
2007), a measurement of the safety culture in the civil society is far to be defined. The present paper shows 
an attempt of measuring the public safety knowledge: a simple methodology based on gaming (safety 
quiz related to darts playing) was developed to collect data on the safety knowledge in the population, 
both for children and for adults. Different sets of questions were established for children (aged from 4 to 
15) and adults. The quiz was proposed during the European Researchers’ Night in 2015, 2016 in Turin 
(Italy), collecting about 250 replies. The present paper presents the analysis of the data collected, together 
with some observations both on the diffusion of the safety culture in the general public and on the pos- 
sible improvement of the data collection approach. 


1 INTRODUCTION 1.800 | 
1.800 | 
Today, the occupational incidents are an economi- 1.400 Te | 
cal and social problem (Hamalainen et al., 2006). 1.200 Fa ~*~ | 
Le., in 2004 the cost of the occupational accidents S, 000 O ar | 
in the European Union was around 35,000,000,000 X s00 = = | 
€, with 3.2% of workers involved (Battaglia et al., g oe a | 
2014). For this reason, since 1989, UE promoted $ | 
policies to reduce the occupational accidents, as 3 eid | 
the Council Directive 89/391/EEC (1989). g 200 
Between 1994 and the 2012, ESAW data (Euro- 0 
pean Statistics on Accidents at Work) showed a - $ P & K $ S P 


decrease trend in the occupational accidents, how- 
ever in 2013 (Eurostat, 2013) registered 3647 fatal Figure 1. Non-fatal occupation accidents in Italy 
accidents and 3,100,000 non-fatal accidents, which 1951-2005. 
cannot be considered negligible values. Lehtola et al. 
(2008) explained the decreasing trend as the product INAIL — Italian institution for insurance against 
of multiple factors, such as the regulation change, accidents at work evaluate the Italian domes- 
the demographical change (Morse et al., 2009) or the tic accidents in around 3,000,000 (Ferrante et al. 
deindustrialization process (Loomis et al., 2004). 2012), with a stable trend in the decade 1998-2008 
The data collected in Italy confirm the European (Figure 2). This value highlights the socio-economic 
trend (Figure 1). In Italy, the decrease trend started importance of domestic accidents, most of all con- 
around 50 years ago, and accidents occurring dem- sidering that they can involve also children. 
onstrated to be strictly correlated with the socio- Different accidents, both occupational and 
economical evolution in Italy (Comberti et al. 2017). domestic, are caused by the lack of knowledge on 
The numbers of domestic and non-work related the basic safety rules. For the occupational safety, 
accidents are more difficult to be defined; however, the Italian Legislative Decree 81/08 required the 
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Figure 2. Domestic accident in Italy. 


employers to develop mandatory safety training 
sections for the workers. At the end of each safety 
training, the workers’ acquired knowledge is tested. 
However, since the test is not repeated after a while, 
it is not possible to verify if the workers maintain, 
apply and consolidate the lessons learned. 

Literature provides several studies on the safety 
knowledge (Choudhry et al., 2007) in specific work- 
ing sectors such as: Australian manufacturing and 
mining sector (Griffin & Neal, 2000); Iranian meat 
processing (Ansari-Lari et al., 2010); medical sector 
(Kolade 2002); nuclear power plant (Lee & Harri- 
son, 2000), etc. However, scarce studies are available 
on the safety knowledge among the population: i.e., 
Kennedy et al. (2005) analyzed safety knowledge of 
the customers about food management. 

In order to increase the safety knowledge and 
safety culture of the population, it is fundamental 
to start teaching the basis of the safety rules to chil- 
dren. For this reason, some tools were promoted 
and developed by EU and the member states, such 
as “Napo for teachers” (napofilm.net, 2017), or 
in Italy, the projects “Scuola e sicurezza — School 
and safety” (Dettoni, 2011), applied in Piedmont 
region, and “A scuola di sicurezza! — At safety 
school!” (Bove et al., 2002), applied in Lombardy 
region. Testing the safety knowledge acquired by 
the children can provide interesting information 
on the proper functioning of these programs. 

This paper presents the data on the safety knowl- 
edge of a sample of Turin population: both adults 
and children, were tested through the participation 
to a simple game, in order to verify the continuity 
and solidity of the safety information acquired at 
school, at work, or through the media. 

The data were collected during the “European 
Researchers’ Night” in Turin, in 2015 and 2016; 
100 stands and more than 1000 researchers were 
present (CLOSER, 2016). This choice allowed to 
attract and test a large number of people, representing 
a wide range of age, gender and work conditions. In 
the end, 115 adults (between 10 to 73 years old) and 
132 children (between 5 to 15 years old) participated 
to the safety game, providing meaningful indications 
on the safety knowledge levels of the civil society. 
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2 MATERIAL AND METHODS 


The data collection was carried out through a 
game, developed and then applied by the research- 
ers of the SAFER team of Politecnico di Torino 
(Centro Studi su Sicurezza Affidabilita e Rischi), 
led by prof. Micaela Demichela. The game was 
part of an entire stand dedicated to safety, aimed 
at disseminating safety rules and showing the 
result of research in the safety field. For attract 
more participant at the data collection was used a 
gaming approach. During the game at the partici- 
pant are proposed 3 questions on the safety. The 
questions proposed at each participant are chosen 
in the random way. And the answers are used for 
test the safety knowledge in the population. 


2.1 Gaming 


The data collection was structured like a game with 
the aim of attracting more participants, stimulat- 
ing their competitiveness and their willingness 
to get involved. The game consisted in a simple 
darts-playing; depending on the score made on 
the dartboard, the participant is subjected to ques- 
tions related to safety, with an increasing level of 
difficulty. 

The 120 questions, with 5 different categories of 
difficulty, were prepared by the research team, but 
only 3 questions were asked to each participant, 
chosen in a random way. The participants could 
play alone or proceed with a sort of “race”, verify- 
ing who could provide more right answers. 

The questions tested the knowledge of the 
participants; the collected answers constituted a 
useful sample of data on the diffusion and level 
of safety awareness of the population. In particu- 
lar, the number of correct and incorrect answers 
was analyzed. The safety game and the related 
data collection also gave the possibility to the 
researchers to carry out a disseminating activ- 
ity about safety, in particular in case of incorrect 
answers. 


2.2 Questions 


The questions allow a multiple-choice: 3 possible 
choices are provided, only one is correct. 2 groups 
of questions were prepared on the basis of the age 
of the participants: 


e Junior (less than 15 years old); 
e Senior (over 10 years old). 


The 2 categories present a common age range, 
between 10 to 15 years old. This is due to the fact 
that the questions in the category “Junior” were 
designed for younger children, centered in the 
primary school age, while the questions of the 
“Senior” category were more work-oriented (even 


if not only regarding safety in the workplace). 
Therefore, the participants between 10 to 15 years 
old could find the “Junior” questions too easy, and 
the “Senior” questions too difficult. In the end, it 
was decided to let the children choose, before start- 
ing the game, if answer to “Senior” or “Junior” 
questions. Many children decided to answer “Sen- 
ior” questions, demonstrating to have in any case 
an adequate knowledge to correctly deal with them. 
For each participant, were collected: 


e General data for statistical purposes (gender, age); 
o ID code of the question proposed; 
e Answer obtained. 


The data were collected in a dedicated form in 
anonymous way. 


2.2.1 Senior questions 
Around 70 questions were developed for the “Senior” 
categories, covering different safety fields, such as: 


e Safety rules in the workspace; 

e Personal Protective Equipment and its use; 
Properties of chemical substances and their 
classification; 

e Safety signals and symbols; 

e Other safety fields. 


An example of “Senior” question is: 
c 

When you work on computer, how far from you 
is the monitor? 


A. 35-50 cm; 
B. 25—40 cm; 
C. 50-70 cm.” 


2.2.2 Junior questions 

Around 40 questions were developed for “Junior” 
category, to verify the awareness of the children in 
relation to: 


Safety rules in the school; 

Road safety; 

Safety during free time (sport, game, ...); 
Rudiments of emergency management (emer- 
gency number, behavior in case of earthquake, ...); 
Other safety fields. 


The questions present simple scenarios and the 
children have to choose a possible solution. An 
example is: “You are at a friend’s house and you 
are playing together. Since you are thirsty, you go 
in the kitchen and you find a bottle without labels. 
What are you going to do? 


A. Drink; 
B. Open the bottle, sniff and then drink; 
C. Drink from the tap.” 


3 RESULTS 


247 people from 4 years old to 73 years old were 
involved in the data collection between 2015 and 


153 


2016; 115 people were tested in the “Senior” category; 
while 132 children were tested in the “Junior” one. 
The following tables and figures represent gen- 
der and age distributions for the “Senior” category 
(Table 1 and Figure 3) and “Junior” category 
(Table 2 and Table 3); for adults, it resulted that 
the majority of participants was younger than 25 
years old and 55% was male. For the children, the 
gender of the participants was equally distributed, 
and the majority was between 7 and 12 years old. 


3.1.1 “Senior” category results 
The 61.3% of the people involved in the “Senior” 
category provided corrected answers. The global 


Table 1. Gender distribution in “Senior” category. 
Participants 
Total Male Female 
2015 40 22 18 
2016 75 41 34 
Total 115 63 52 
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Figure 3. Age distribution in the “Senior” category. 
Table 2. Gender distribution in “Junior” category. 
Participants 
Total Male Female 
2015 74 36 38 
2016 58 30 28 
Total 132 66 66 
Table 3. Age distribution in “Junior” category. 


Participants 


age<6 7<ages9 10<age<12 13<age<15 


2015 8 24 40 2 
2016 5 20 26 7 
Total 13 44 66 9 


value of correct answers was almost constant in 
2015 (61.9%) and 2016 (61.2%) and apparently 
there is no difference between male and female, 
as shown by the last bars of Figure 4. However, 
in 2015 women gave more correct answers than 
men, while in 2016 the opposite trend occurred. 
Figure 5 shows the percentage of correct answers 
on the basis of the age of the participants: the frac- 
tion of correct answers is higher (around or over 
60%) for participant younger than 45 years old. A 
negative peak is registered after 45 years old, and 
then the correct answers increase again. For some 
age groups, the uncertainty boundary is higher: give 
the small number of participants, wrong answers 


WS: 7 


assume a high impact on the results. On the con- 
trary, when the number of participants is high for 
both the years analyzed, the uncertainty is lower. 


3.1.2 “Junior” category results 

The 70.3% of the children provided correct answers, 
but the percentage of correct answer was higher in 
2015 (72.9%) than in 2016 (67.1%). 

About the gender difference, in the “Junior” cat- 
egory, girls gave more correct answers than boys, a 
trend particularly evident in 2015 (Figure 6). 

Concerning the correlation with the age dis- 
tribution, Figure 7 shows that the correct answer 
trend grows with the age of the participants. 


Correct answers [%] 
a a a 


uw 
o 


45 
2015 


2016 
Year 


# Male 
® Female 


® Total 


2015-2016 


Figure 4. Correct answer in the “Senior” category divided for gender. 


Correct answers [%] 


Figure 5. 
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Trend of correct answer by age of participants, in the “Senior” category. 
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Figure 6. Correct answer in the “Junior” category divided for gender. 
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Figure 7. Trend of correct answer by age of participants, in the “Junior” category. 


4 DISCUSSION 


On the basis of the data collected and presented 
in the previously paragraphs, some consideration 
can be done. 

For the adults, the mean values obtained from 
the test underline that the population know the 
basis of safety knowledge; but the level needs to be 
increased to reach a sufficient quality and diffusion 
of knowledge. 

The data show no gender gap in the safety 
knowledge, while the age trend gives some encour- 
aging signal for the future. In fact, the younger par- 
ticipants in the “senior” category gave more correct 
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answers than the older ones. This result probably 
evidences the importance of the several activities of 
sensitization about safety conducted in the last 20 
years during the school period (primary, secondary 
and high school). On the contrary, the worse results 
for the older participants mark the resistance in 
learning and adapting to safety rules, a behvaoiur 
that is still quite common in the Italian culture. 

In the “Junior” category, the mean value of 
safety knowledge is higher than for “Senior”, 
reaching values in line with those of the younger 
“Senior” participants. This good result probably 
underlines the effect of the safety sensitization 
campaign done during school age; it represents an 


encouraging signal of a possible radical change in 
the culture of workers and citizens, but this trend 
has obviously to be monitored in the future. At the 
moment, the answers given by the younger “Sen- 
ior” participants seem to confirm this increased 
sensitivity of the younger generations to the 
themes of the safety culture. 

Finally, the last interesting date emerged from 
the analysis is related to gender: “Junior” females 
showed a more in-depth safety knowledge than 
males. Further investigations should be executed to 
strengthen the definition of this trend, but in case 
of confirmation this could be the basis for a multi- 
disciplinary investigation on possible cultural dif- 
ferences in learning mechanisms. 


5 CONCLUSION 


In this paper, a method for data collection on 
safety knowledge of the population is presented. A 
gaming approach was settled to attract and involve 
more participants: series of multiple—choice ques- 
tions were developed and exposed to the users as a 
quiz. The game was applied during the “European 
Researchers’ Night” in Turin in 2015 and 2016, 
with different questions for adults and children. 

The data collected showed that the adults 
have a discrete level of safety knowledge, and 
that younger participants have an in-depth safety 
knowledge than the older ones. Children demon- 
strated to have a good safety knowledge level. 

The data collected prove that the safety-related 
projects applied in schools are producing good 
results on the population, but it is fundamental 
to continue with the activity of training both for 
young and older people. In particular for the latter, 
it should be essential to recover the gap in knowl- 
edge for people between 50 and 60 years old; in fact, 
these persons could be in influent positions inside 
their companies and jobs, therefore their awareness 
and knowledge in relation to safety themes has a 
wider influence than the foreseen. 
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ABSTRACT: The aquaculture industry is economically important in Norway, and the production is 
expected to increase in the future. Employees at the fish farms face a high risk of accidents compared to 
employees in other industries and the focus on safety from both industry and researchers has increased 
during the last decade. Adding to the knowledge on safety in aquaculture, the objective of this paper is to 
study employees’ perception of safety climate, and whether aspects related to safety climate may predict 
employees’ compliance. Findings from two surveys aimed at managers and employees in different compa- 
nies are analysed. The first is a telephone survey targeting employees and managers at the fish farms. The 
second is a web-based survey involving the onshore management level. The results show that employees at 
all levels have a positive perception of the safety climate, but they also illustrate challenges related to work 
pressure, maintenance and employee participation. Furthermore, the analysis shows that work pressure 
affects compliance negatively while participation and competence have positive associations with compli- 
ance. These results give input to some practical measures for safety management in the industry. 


1 INTRODUCTION Holen et al., 2017a). The fish farm also has a feed- 
ing barge for equipment and feed storage, the feed- 
Norway is the second largest exporter of fish ing system, as well as manager offices, meeting 
worldwide, and the largest producer of finfish rooms and accommodation for shift workers. 
(FAO, 2016). Since the 1970’s aquaculture, and Fish farmers are decreed to perform daily inspec- 
fish farming in particular, has become a significant tions to assess fish welfare and document that the 
contributor to the national value creation. In 2016, net cages are in order. Fish farmers often use boats 
the total production in the aquaculture industry and cranes in their work, but increasingly rely on 
was 1,3 mill metric tons, equal to a value of 65 specialized service vessel crews to perform tasks 
billion NOK (Norwegian Directorate of Fisheries, such as mooring operations and delousing. The 
2017). 93% of this was Atlantic salmon. safety for fish, material assets and personnel is regu- 
There are both large and small companies in the lated by an extensive set of statutory requirements 
industry, and the structure of management levels (Holmen et al., 2017b), which are audited by five dif- 
depends on the size. Smaller companies may have ferent regulatory authorities (Holmen et al., 2017c). 
personnel that serve different roles, combining the The aquaculture production has the potential to 
responsibility for safety with other areas such as increase in the future (Olafsen et al., 2012). Tech- 
quality management. Larger companies may have nology innovations aim to enable production at 
dedicated management working solely with health areas more exposed with respect to climate, wind 
and safety. The lowest management level is the and currents. This raises new challenges when 
operational managers, who are responsible for bio- it comes to fish welfare and operational safety 
logical production and personnel at one or two fish (Bjelland et al., 2015). 
farms, typically manned by three to six employees The attention to occupational and operational 
(fish farmers). safety in the aquaculture industry has increased 
There are typically six to 12 circular plastic col- during the last decade. Recent studies indicate 
lar net cages in one fish farm (Jensen et al., 2010, that safety and risk management systems need to 
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be improved to maintain a sustainable food pro- 
duction and ensure a safe work environment at 
harsher locations (Storkersen, 2012, Holmen et al., 
2017b, Holmen et al., 2017c). 

Being a fish farmer is the 2nd most risk 
exposed occupation in Norway after being a fish- 
erman, according to the rate of fatal accidents 
(McGuinness et al., 2013, Holen et al., 2017b). 
Fall, blow by object, entanglement/crush and 
cuts are the most common modes of injuries in 
the Norwegian aquaculture industry (Holen et al., 
2017a). In a recent study among aquaculture 
workers, three out of four respondents reported 
to have knowledge of near accidents (Thorvald- 
sen et al., 2017). Organisational aspects and safety 
indicators have been the topic in several studies 
(Fenstad et al., 2009, Storkersen, 2012, Thorvald- 
sen et al., 2015, Holmen et al., 2017b). Looking 
back, Allred et al. (2005) found that many compa- 
nies had implemented some systematic HSE work, 
although production often was given priority. 

In this article we will explore the following prob- 
lems: 1. How do fish farmers, operational manag- 
ers and onshore management/staff perceive the 
safety climate? 2. To what extent can safety climate 
predict safety compliance? 


1.1 Safety climate 


Safety culture has been a defined area of research 
since the late 1970s, and there is still much research 
activity related to the issue. Up until 2015, 1789 
research publications related to safety culture has 
been published (van Nunen et al., 2017). In the 
beginning it was applied as a construct related to 
causal analyses of major accidents, with Barry 
Turner’s (1978) contribution as an important start- 
ing point. Later, the research interests spread over 
different topics, and involved different disciplines, 
including anthropology, psychology, sociology and 
the engineering sciences. Today it is considered a 
multi-dimensional and cross-disciplinary field of 
research (van Nunen et al., 2017) 

The contributions from psychology is in par- 
ticular related to safety climate, a construct which 
is in many instances used as synonymous to safety 
culture and defined in similar ways (Guldenmund, 
2007). A much used definition of safety climate is 
provided by Zohar (2003), who sees it as shared 
perceptions about safety policies, procedures and 
practices in a work community. Questionnaire 
surveys are commonly used to measure safety cli- 
mate in a work community, involving assessments 
of topics such as the safety system, work pressure, 
the safety competence and leadership/supervision 
(Flin et al., 2000). 

Much research on safety climate has been related 
to identifying causal links between safety climate 
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as an independent variable and different safety 
outcomes, summed up in review studies (Clarke, 
2006, Christian et al., 2009). This includes explo- 
ration of safety climate as an indicator (Kongsvik 
et al., 2010). The review studies also point to many 
studies that find a positive relationship between 
safety climate and safety compliance (adherence to 
safety instructions, rules, and procedures). 

A general finding in the research is that the safety 
climate in a work community is associated with the 
work practices in the same community; a positive 
safety climate is related to compliance and partici- 
pation in safety-promoting activities, and also mind- 
ful safety practices (Dahl and Kongsvik, 2018). 


2 METHOD 


2.1 The surveys 


The results are based on two different surveys, that 
were later combined. The first included fish farm- 
ers, operational managers and service vessel crew 
members as well as some employees in other posi- 
tions. Representatives at the management level in a 
selection of 40 companies were informed about the 
aim of the survey, and invited to share employees’ 
phone numbers. A professional polling company 
conducted the survey, and a total of 447 out of 735 
employees participated. Here, answers from 258 fish 
farmers and 110 operational managers are used. 

The second survey included onshore management 
and staff. Companies were contacted and asked to 
provide e-mail addresses to their employees who 
then got an e-mail linking to the digital survey. Some 
companies distributed the link to their employees 
themselves, so the response rate cannot be estimated. 
A total of 135 persons responded. Here, the net sam- 
ple includes 92 onshore managers or staff. 

The questionnaires were developed on basis of 
earlier surveys, but were tailor made to the aquacul- 
ture industry. In both surveys, the respondents were 
asked to state their agreement to different state- 
ments related to safety climate, on a 5-point Likert 
scale, ranging from totally disagree to totally agree. 


2.2 Analyses 


Answers from both surveys were extracted and 
combined in one data file. Some items had diver- 
gent wording, mirroring their position. 

The items common for the two samples were 
thematically sorted into four categories: Work 
pressure, Participation, Competence and resources 
and Compliance. The first three categories are 
directly related to the safety climate construct, 
while Compliance is related to safety practices and 
whether employees live up to requirements given 
by the companies. 


The statistical analysis aimed at comparing the 
perceptions of three groups, involving comparing 
means related to the responses. One-way ANOVA 
was applied for comparing the means. The limit for 
statistical significance (P-value) was set to 1%. 

We performed a multiple regression analysis to 
explore if compliance could be predicted by the 
safety climate factors. The analysis was restricted 
to the fish farmers (n = 258) to ensure homogeneity 
in the work situation for those included. A Com- 
pliance scale was constructed by combining three 
items: 1. I use the required protective equipment 
2. If I see dangerous situations at work, I report 
them. 3. Safety has first priority when I do my job. 
The Cronbach’s alpha for the scale was .71. The 
Work pressure scale consisted of three items: 1. 
Sometimes I feel a pressure to continue working, 
although safety can be compromised 2. In prac- 
tice, consideration to production 1s prioritized at 
the expense of safety 3. Inadequate maintenance 
has reduced the safety level. The Cronbach’s alpha 
for the scale was .67. The Participation scale also 
included three items: 1.I participate in making new 
procedures 2. I get involved when new procedures 
are to be introduced 3. My manager appreciates 
that the employees take up safety issues. The Cron- 
bach’s alpha for the scale was .70. The Competence 
scale included two items: 1. I have the necessary 
competence to handle my work tasks safely. 2. I 
have received the necessary training for handling 
critical or dangerous situations. The Cronbach’s 
alpha for the scale was .60. 

The independent variables in the regression 
analysis were introduced by forced entry. Miss- 
ing values were excluded list wise. The variance 
inflation factor (VIF) varied between 1.141 and 
1.276, and tolerance values varied between 0.783 
and 0.876. This gives no indications of multicol- 
linearity. To check the assumption of independent 
errors, the Durbin—Watson test was performed. 
This test shows there was no concern regard- 
ing autocorrelation, with a test statistic of 2.063 
(Field, 2005). 


3 RESULTS 


Here, the mean values reported by the respondents 
of the surveys are compared. Results are presented 
by the three following groups: onshore manage- 
ment (M), operational managers (OM) and fish 
farmers (F). 


3.1 Perceived safety climate 


The safety climate can be expressed by the sur- 
vey results about work pressure; competence and 
resources; and compliance. 


3.1.1. Work pressure 
Table 1 reports the results from three items consid- 
ering work pressure. 

The first item is phrased differently for operational 
and onshore personnel. Yet, all groups somewhat 
disagree that production pressure makes operational 
personnel break safety rules and continue unsafe 
work. Onshore management agree more than opera- 
tional personnel that employees will compromise on 
safety because of production pressure. 

On the next item, this controversy is partly 
reversed. On average, fish farmers neither agree nor 
disagree that production is prioritized over safety, 
while both types of managers disagree more. Still, 
analysis show that 22.9% of the fish farmers and 
13.6% of the operational managers agree or totally 
agree that consideration to production is priori- 
tized at the expense of safety. 

All three groups’ mean values show they some- 
what disagree that inadequate maintenance has 
reduced the safety level. 19.8% of the fish farmers 
and 18.2% of the operational managers agree or 
totally agree that safety has been reduced due to 
inadequate maintenance. 


3.1.2 Participation 
Table 2 includes three items about employee 
participation. 

All groups agree that managers appreciate 
employees’ safety engagement. Onshore manage- 
ment appreciate that employees take up safety 
issues more than considered by the employees. 


Table 1. Perceptions of Work pressure — means on a 
scale from | (totally disagree) to 5 (totally agree). 
Items Groups Mean P-value 
F/OM: Sometimes Fish 2.02 0.000 

I feel a pressure farmers 

to continue Operational 

working, although managers 2.16 

safety can be Onshore 

compromised management 
M: Owing to the 2.63 


company’s production 
demands, the employees 
sometimes have to 
break the safety rules 


In practice, consideration Fishfarmers 2.56 0.001 
to production is Operational 2.09 
prioritized at the management 
expense of safety Onshore 2.19 

management 

Inadequate maintenance Fish farmers 2.40 0.609 
has reduced the safety Operational 2.30 (NS) 
level managers 

Onshore 2.47 
management 


Table 2. Perceptions of Participation — means on a scale 
from 1 (totally disagree) to 5 (totally agree). 


Mean P-value 
F/OM: My manager Fish farmers 4.20 0.000 
appreciates that the Operational 4.32 
employees take up managers 
safety issues Onshore 4.92 
M: As manager, I management 
appreciate that the 
employees take up 
safety issues 
F/OM: I participate Fish farmers 2.77 0.000 
in making new Operational 3.54 
procedures managers 3.61 
M: The employees On shore 
participate in management 
making new 
procedures 
F/OM: I get involved Fish farmers 3.35 0.006 
when new Operational 3.73 
procedures are to managers 
be introduced On shore 3.76 
M: The employees get management 


involved when new 
procedures are to 
be introduced 


The mean results about employees’ participation 
in development of new procedures lies around nei- 
ther/nor. Fish farmers lean towards disagreement, 
while both management groups agree somewhat 
that employees (including the operational manag- 
ers themselves) participate in making procedures. 

Management agree that employees get involved 
in introduction of procedures, Analysis shows that 
24.4% of the fish farmers disagree or totally disa- 
gree that they are involved when new procedures 
are introduced, compared to 19.1% of the opera- 
tional managers. 


3.1.3. Competence and resources 
The three items considering Competence and 
resources are presented in Table 3. 

All groups agree that the operational personnel 
have the competence to work safely and that man- 
ning is sufficient for safe work. It is 12.4% of the 
fish farmers and 12.7% of the operational manag- 
ers that disagree or totally disagree that manning 
was sufficient. 

There are larger differences regarding learning 
from unwanted events. Fish farmers and opera- 
tional managers agree that information regarding 
unwanted events is utilized to prevent recurrence, 
while onshore management state neither/nor. 


3.1.4 Compliance 

All respondent groups’ in average agree regarding 
the three items about compliance, but there are 
some differences (Table 4). 
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Table 3. 


Perceptions of Competence and resources — 


means on a scale from | (totally disagree) to 5 (totally 


agree). 


Mean P-value 


F/OM: I have the Fish farmers 4.49 0.578 
necessary Operational (NS) 
competence to managers 4.56 
handle my work Onshore 4.47 
tasks safely management 

M: Our employees 
have the necessary 
competence to 
handle their work 
tasks safely. 

The manning is Fish farmers 3.70 0.235 
sufficient to Operational (NS) 
maintain the managers 3.62 
safety Onshore 3.87 

management/ 
staff 

F/OM: Information Fish farmers 4.11 0.000 
regarding Operational 4.13 
unwanted events managers 
is utilized Onshore 2.91 
adequately to management 


prevent recurrence 
M: We utilize the 

information 

from reported 

unwanted events 

sufficiently in the 

preventive work 


The analysis displays no significant differences 
regarding perceptions of reporting of dangerous 
situations. However, 31.4% of the fish farmers 
answered that they agreed or totally agreed with the 
statement: I think it is uncomfortable to point out lack 
of compliance to safety rules and procedures. 16.4% 
of the operational managers said the same. Onshore 
management were not asked about this aspect, as it 
relates to the operational context and the everyday 
interaction between workers at the fish farms. 

Regarding use of protective equipment opera- 
tional personnel totally agree that they use it, while 
onshore staff only agree. 

Operational personnel agree (4.30), while 
onshore management almost totally agree (4.71) 
that safety has the first priority. 


3.2 Safety climate’s relation to compliance 


A linear regression analysis restricted to the fish 
farmers in the sample was performed. Work pres- 
sure, Participation and Competence were used to 
predict Compliance. 

Work pressure, Participation and Compe- 
tence explained 30.4% of the variance in Compli- 
ance (p < 0.000). Each of the predictor variables 
had a significant contribution to the model. 


Table 4. Perceptions of Compliance — means on a scale 
from 1 (totally disagree) to 5 (totally agree). 


Mean P-value 


F/OM: If I see Fish farmers 4.38 0.209 
dangerous Operational (NS) 
situations at work, managers 4.54 
I report them. Onshore 4.32 

M: The employees management/ 
use the reporting staff 
system adequately 
when it comes to 
personal injuries 
and other serious 
events. 

F/OM: I use the Fish farmers 4.55 0.000 
required protective Operational 4.72 
equipment managers 

M: Our employees Onshore 4.20 
always use the management 
required protective 
equipment 

Safety has first Fish farmers 4.30 0.000 
priority when Ido Operational 4.30 
my job managers 

Onshore 4.71 
management 

Table 5. Linear regression analysis predicting ‘Compli- 


ance’: Beta-values (B), standard errors (SE B), stand- 
ardised betas (B) and explained variance (Adjusted R°) 
(N =253). 


B SEB B Adjusted R? 
Constant 2,086 0.301 0.304 
Work pressure -0.150 0.044 -0.197 
Participation 0.112 0.043 0.156 
Competence 0.366 0.054 0.381 


Work pressure had a negative association with 
Compliance, while Participation and Compe- 
tence had a positive association. High perceived 
work pressure was thus associated with lower 
Compliance, while higher degrees of participa- 
tion and Competence, was associated with higher 
Compliance. 


4 DISCUSSION 


Considering the results, variations in perceptions 
of safety climate related to the three different 
groups and company levels are discussed. 


4.1 Regarding work pressure 


All groups recognize some pressure to continue 
unsafe operations or violate rules and prioritize 
production over safety (two first items, Table 1). 
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Almost % of the fish farmers agree that produc- 
tion sometimes trumps safety. Unsurprisingly, 
managers disagree more as they often have a more 
positive (Fenstad et al., 2009) or less realistic view 
of operations (Hale and Borys, 2013b, Hollnagel, 
2011). When considering if production pressure will 
make employees compromise on safety, managers 
agree more than operational personnel do. One 
explanation may be that management have HSE 
as their area of expertise, and might have learned, 
in practice and theoretically, that operational per- 
sonnel will feel pressured to work efficient and 
not thorough (Hollnagel, 2009) or prioritize pro- 
duction over protection (Reason, 1997) with a 
drift towards unsafe performance (Dekker, 2011, 
Vaughan, 1997, Rasmussen, 1997). 

This resonates with earlier findings in Nor- 
wegian aquaculture. In Allred et al. (2005), 21% 
agreed that they were pressured to work in a way 
that could threaten safety. In addition, 27% agreed 
that the operational manager did not have the time 
to sufficiently manage employees’ HSE — meaning 
that they focused on production to make the ends 
meet (ibid). Priority of the biological product, the 
fish, is also qualitatively described (Fenstad et al., 
2009, Storkersen, 2012). 

A more recent study found that work pressure 
may be caused by poor planning of operations, 
insufficient staffing, time pressure and working 
long hours (Thorvaldsen et al. 2015). And even 
though their personal safety may be threatened, 
employees are very conscious about following up 
on their work responsibilities. 

An increased efficiency pressure coincides with 
the statements about maintenance (Table 1). 1/5 of 
the fish farmers and operational managers think 
that inadequate maintenance has reduced the 
safety level. Maintenance on existing equipment 
might be postponed to times with less activity and 
potential earnings. It is shown that coastal vessels 
running for the prosperous aquaculture industry 
have tighter schedules and more stress than vessels 
operating in slower markets (Storkersen, 2017). 

Economic priorities are relevant here, as it may 
affect maintenance of existing equipment at a given 
fish farm or vessel, but also limit investments in 
new technology (Thorvaldsen et al. 2015). A previ- 
ous study also found that fish farmers experienced 
increased profit as more important than workers’ 
safety (Fenstad et al. 2009). 


4.2 Regarding participation 


Personnel on all levels agree that managers appre- 
ciate employees’ safety engagement. However, 
mangers seem to appreciate employees taking up 
safety issues more than fish farmers think. The 
interaction between company levels as well as dif- 
ferences between formal and informal communica- 


tion may be reflected here. Fish farmers may take 
up safety issues with the operational managers, 
who then decides whether and how to address a 
specific issue. In larger companies, formal report- 
ing systems are also used. As operational managers 
give a higher score than fish farmers, fish farmers 
may think about day-to-day interaction when they 
disagree with the perception of the operational 
managers, or they may be thinking about the (lack 
of) response they get when they address safety 
issues through formal reporting systems. 

Allred et al. (2005) found that 81% agreed that 
employees could influence the HSE conditions to a 
large degree. Still, a third of the respondents said they 
did not have the chance to participate in HSE strategies. 

When it comes to new procedures, and employ- 
ees’ participation in developing them, the answers 
gave a neither/nor result. Fish farmers disagree 
more than the managers. Here, a likely explana- 
tion will be that the onshore managers involve the 
operational mangers and some of the fish farmers 
when procedures are made. Even though notall fish 
farmers are involved in such processes the onshore 
managers do involve some of the employees. 

The situation seems to be somewhat similar 
when it comes to new procedures. Almost 25% 
of the fish farmers and 19% of the operational 
managers disagree. As procedures are often sent 
to employees via computer-based systems, this 
answer may indicate that this is not seen as involve- 
ment, but rather as information. 


4.3 Regarding competence and resources 


Over all, the participants agree that operational 
personnel at the fish farms have the competence 
to work safely. Several companies also provide 
external safety courses for employees, and com- 
panies use internal procedures to document how 
operations should be performed. Experience is 
also highly valued at the fish farms (Holmen et al., 
2017a, Thorvaldsen et al., 2015). 

In 2005, 68% stated that employees got adequate 
safety training, but 41-44% wanted more training 
in sickness and injury preventing work and safety 
routines (Allred et al., 2005). 

When it comes to staffing, about 12% of both 
fish farmers and operational mangers disagree that 
manning is sufficient. Safety issues when workload 
is increased without additional personnel has been 
discussed in previous studies in aquaculture (Thor- 
valdsen et al. 2015) and maritime industries (Het- 
herington et al., 2006, Osterman and Hult, 2016). 

Fish farmers and operational managers agree 
that they use information from previous events 
for prevention, but onshore staff answers nei- 
ther/nor. It may be that onshore management see 
more potential for learning and prevention both 
on their part and from the operational personnel. 
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Onshore management must follow up on non- 
compliance reports from many fish farms, and 
may lack time and resources do to this optimally. 
The operational personnel on the other hand, may 
answer more positively based on activities and 
measures on their specific fish farm, and not what 
comes from the onshore management. With more 
formalized systems for reporting, one might think 
that this was an area that had improved a lot dur- 
ing the last decade. Looking back, however, 72% 
of the participants in 2005 (Allred et al.) answered 
that information about accidents and unwanted 
events was actively used by the companies. 


4.4 Regarding compliance 


All groups agree they report dangerous situations 
(the first item, Table 5), but a third of the fish farmers 
also are uncomfortable pointing out non-compliance. 
Not every company meets deviance from rules with 
an understanding that most personnel follow rules, 
but that some rules can be difficult to follow because 
they are contradictory to other rules, the context, or 
resources (March, 1994). It may even be necessary 
to break a rule to get the job done (Reason, 1990). 
Thus, the safety literature has emphasized that safety 
is not attained by blind rule-following (Hollnagel 
et al., 2006, Hale and Borys, 2013b). Still, compliance 
might be the safest option if the rules can be followed. 
A literature review of quantitative studies indicates 
“a positive linear relationship between safety com- 
pliance and safety. That is, the more compliance the 
better for the state of safety” (Dahl, 2014: 31). In the 
study of Allred et al. (2005), the findings were at least 
as positive as in the current study: 65% stated that 
employees always reported safety issues and danger- 
ous conditions, and *4 meant operational manager 
encouraged employees to report such conditions. 

Both the current and previous studies (Allred 
et al., 2005) find that protective equipment mostly 
is used. 

All groups prioritize safety in most situations, the 
management respondents are most certain. This is 
related to the statement that the production is pri- 
oritized over safety in some operations, although 
management disagree that production is prioritized 
over safety (Table 1). Management is commonly 
looser coupled to the negotiations in the operations. 
A relevant point here, is that our survey has targeted 
management respondents with health and safety as 
their responsibility. Furthermore, as Allred et al. 
(2005) also discuss, we do not know if it is only the 
most positive representatives working in the most 
HSE focused companies who have answered the sur- 
veys (Hollnagel et al., 2006, Hale and Borys, 2013). 


4.5 Safety climate and compliance 


The regression analysis revealed that safety com- 
pliant behaviour was predicted by safety climate 


measures among fish farmers. In our context, com- 
pliant behaviour involves adherence to safety rules, 
and procedures, such as the use of the required 
protective equipment, reporting of dangerous situ- 
ations if they are observed, and prioritizing safety 
when they do their job. 

Among the three safety climate factors, Compe- 
tence was the most important predictor, followed 
by Work pressure and Participation. Compe- 
tence and Participation were positively related to 
Compliance, while Work pressure was negatively 
related to Compliance. 

Several studies have shown similar results, add- 
ing up to a quite robust relationship between safety 
climate and safety behaviour (Clarke, 2006, Chris- 
tian et al., 2009). In general, this research show that 
those who perceive safety as valued and prioritized 
in their work community, display a more positive 
safety behaviour, including compliant behaviour, 
than those who perceive safety as less valued (Dahl 
and Kongsvik, 2018). 

This relationship has not been studied in the 
aquaculture industry previously. The results 
indicate that the relationship can be valid also in 
this context. This gives input to some practical 
measures for safety management in the industry. 
The competence scale included items on train- 
ing, including training for emergencies. Providing 
such training might increase compliance through 
increased knowledge of the procedures. Further, 
avoiding work pressure that goes at the expense of 
safety might also increase compliance and reduce 
exhaustion for the individual employees. This may 
be a challenge, as there are some very labour- 
intensive periods related to delousing operations 
etc. Still, organizing these periods as to avoid long 
hours and heavy workloads might have a posi- 
tive influence on compliance and safety climate. 
Lastly, involving employees in the construction 
of procedures, and having an ‘open door’ policy 
regarding safety issues can also have a positive 
influence on ownership, feeling of involvement 
and compliance. 

Although safety compliance is one important 
aspect, it is also true that procedures and rules 
tend to be underspecified and cannot cover all 
eventualities in complex systems (Hollnagel, 2009). 
On the one hand, the work at fish farms include 
many routine tasks, where the applicability of clear 
procedures is evident. On the other hand, flexibil- 
ity, situational awareness, practical experience, and 
problem-solving skills are also vital qualities in this 
context (Thorvaldsen et al. 2015). Consequently, 
rules and procedures should be dynamic, and 
involve sharp-end workers in formulating and eval- 
uating them (Hale and Borys, 2013 a). So even if 
compliance in many instances is a basic foundation 
for many work operations in high-risk industries, 
performance variability is also a valuable asset that 
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might increase the resilience in a sociotechnical 
system (Hollnagel, 2009, Haavik et al., 2017). 

The results give grounds for further exploring the 
relationship between safety climate and safety com- 
pliance in the aquaculture industry. Future research 
could include onshore personnel, and suitable meas- 
ures for safety climate and safety behaviour. This 
could broaden the view on how accidents can be 
prevented in the industry. Also, the cause and effect 
relationship between climate and safety outcomes 
can be explored. Other studies indicate that this rela- 
tionship might be reciprocal (Kongsvik et al., 2011). 


5 CONCLUSION 


This article explores perceptions of safety climate 
at different company levels based on two surveys 
amongst employees in the aquaculture industry. 
Over all, perceptions of the safety climate are 
positive at all levels. This may reflect an increased 
focus on workers’ health and safety during the last 
decade. Still, there are challenges related to work 
pressure, maintenance and employee participa- 
tion. While aspects related to compliance such as 
reporting, wearing protective equipment and pri- 
oritizing safety get a high score, many fish farmers 
are uncomfortable with pointing out colleagues’ 
lack of compliance. 

The analyses further reveal that fish farmers’ com- 
pliance to safety requirements is predicted by safety 
climate, and in particular by competence. Training, 
including emergency exercises, will be valuable for 
increased safety. The same goes for reduced work 
pressure. Work pressure relates negatively to com- 
pliance, and almost one quarter of the fish farmers 
agree that production is prioritized over safety. 

Differences between company levels reflect dif- 
ferent points of view and responsibilities within 
the companies. It is important that fish farmers 
are involved when their work procedures are cre- 
ated and introduced. Fish farmers and operational 
managers who work at the sharp end are physically 
closest to the occupational hazards, and rely on the 
onshore management to get the necessary means 
to mitigate the risks. 
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Societal threat landscapes of petroleum industry activity in 
the high north 
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ABSTRACT: Today, industrial and societal systems interact and become more complex than ever, and 
hidden, dynamic and emerging threats and vulnerabilities evolve. By observing the petroleum activity in 
the north (west Barents Sea) along with other developments and societal trends in the region, it is pos- 
sible to sketch out a threat landscape of the high north. A threat landscape is formed from interconnected 
‘pictures’ of threats and actors, given some external conditions. These conditions are ‘on the move’, and 
requires that the risk pictures as well as the landscapes must be maintained and revised. Examples of such 
are effects of declining oil prices, or a changing climate. It is important to assess the validity of any picture 
contributing to the landscape. Threats may lead to ‘spillover effects’ and unexpected events if pictures 
become saturated. Scenarios applicable for stress-tests may be derived from such threat landscapes. The 
severity and urgency of a scenario is strengthened by events occurring simultaneously, or as combined 
events. Risk mitigation should thus be handled more in collaborating teams of interconnected actors 
instead of by single entities themselves. Actors involved in the ‘oil in high north’ threat landscape either 
take part in, support, or are affected by the petroleum activity in some way. Each party should consider 
their situation and role in view of the evolving threat landscape, and look for alternative handling of 
emerging risks. As part of a case work in the New Strains of Society project, key actors have been inter- 
viewed. In addition to the interviews, associated workshops have been held. This paper summarizes the 
foundations of a preliminary threat landscape based on these interviews and workshops, with recommen- 
dations for further elaboration by researchers as well as practitioners. Main challenges associated with 
the aforementioned threat landscape are outlined. The concepts of robust organizations and resilience 
are reflected upon as alternatives to the prevalent failure-oriented safety approaches. Knowledge from the 
current study serve as input to the final New Strains of Society framework. 


1 INTRODUCTION conditions. The overall idea is that as threat pictures 


mature, scenarios for stress testing can be built. 


A threat could be explained as a possible danger 
that might exploit vulnerability of a system and 
cause possible harm. Threat landscapes then cover 
threats and vulnerabilities of a thematic area that 
involve a network of agents and stakeholders, 
interacting normally or more randomly (Gretan & 
Antonsen, 2016). The landscape metaphor involves 
an aggregate of the more prevalent notion ‘threat 
pictures’. A ‘threat picture’ is confined and focused 
on a narrowed set of topics, either thematic-centric 
(e.g., oil spill) or actor-centric. A picture comprises 
a frame or a border that represents a clear demarca- 
tion of the relevant vs. non-relevant issues belong- 
ing to the threat, including the relevant vulnerability 
and presumptions on operational conditions. Such 
presumptions can, however, be turned the other way 
around. They can be recognized both as demarca- 
tion lines for the validity of a picture in relation to 
hidden, dynamic and emergent effects, but also as 
a key to understanding when a picture will become 
saturated with unaccounted or unprecedented 


The finalized project New Strains of Society 
(New Strains) dealt with these matters. Empirical 
studies of societal threats and vulnerabilities were 
focused on gradual developments of threat land- 
scapes for thematic areas like the petroleum activ- 
ity in the high north. Information was gathered 
from documentation reviews, followed up with 
guided interviews and workshops. By utilizing a 
joint interview guideline, the representatives from 
involved actors were incited to elaborate on threat 
pictures and scenarios that normally fall outside 
their normal belief of system behavior. Okstad 
et al. (2017) described a basis for the design of 
empirical case studies of which the results should 
support to the New Strains framework. 


1.1 Objective 


The main objective of this paper is to summarize 
generic findings gathered from interviews and 
workshops with involved actors concerning threat 
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landscapes arising from the thematic area ‘oil in 
high north’, and to organize the finding into a 
rudimentary threat landscape that can be sub- 
ject for elaboration and extension at a later stage, 
involving more actors. 


2 INFORMATION GATHERED 


2.1 Description of case studies 


The case ‘oil in high north’ is about possible impli- 
cations of the offshore petroleum business in the 
high north (Barents Sea). In addition to this case, 
there are two other cases covered in the New Strains 
project; the ‘Pandemic’ and ‘ICT infrastructure’. 
The latter is much about cyber safety, security and 
resilience and is as such, integrated in the other two 
thematic areas. In fact, it deals with vulnerability 
and risk handling of critical ICT infrastructures, 
of most importance in maintaining any societal 
function. 

The ‘oil in high north’ case defines threat land- 
scapes much as societal impact of the petroleum 
activity and its presence in the region. Examples 
are environmental concerns, access to limited 
services in emergencies (e.g., SAR), like public 
hospitals and air-borne ambulance, as well as the 
dependability of shared infrastructures. The lat- 
ter could be communication infrastructure, energy 
supply, transportation systems for goods and/or 
services to offshore installations as well as to local 
communities. 

‘Pandemic’ is about dealing with a global epi- 
demy that crosses national borders, affecting the 
public widely and thus, indirectly threatening 
important societal functions and infrastructures, 
herein questioning the deliberations of saturation 
points and interdependencies of the preliminary 
threat landscape presented in this paper. 


2.2 General aspects; method and interview guide 


In this paper, we try to delineate a preliminary 
threat landscape, based on tentative threat pictures, 
interviews and available public information. The 
key activity is to encircle and ‘carve out’ possible 
‘pictures’ and their saturation characteristics. What 
external events could ‘shake’ or interrupt them? 
Possible spillover effects are here those effects that 
carry the potential to influence other (external or 
overlapping) pictures, and ultimately the whole 
‘landscape’. The strategy for doing this is based 
on the landscape representation of Gretan and 
Antonsen (2016). Separate interviews were carried 
out and pragmatically used for this purpose, pro- 
viding crucial grounds for the New Strains frame- 
work development aiming for stress-testing as the 
ultimate objective. Each interview was aligned to 
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contribute in an optimal matter to the framework, 
and based on a joint interview guide (Gretan and 
Antonsen 2016). Key players were interviewed at 
both a general level, and on specific topics related 
to their presumed or self-declared ‘picture’. The 
initial aim was to identify the interviewees’ home 
ground with respect to risk pictures and the risk 
management been implemented. What kind of 
events were covered by the risk analysis normally, 
and what have been the criteria for establishing 
emergency- and contingency plans? 

By going through past events and how the organ- 
ization responded, characteristics of how they were 
surprised are revealed, in addition to information 
about situation handling. By going through these 
experiences, one should touch upon the outer edge 
of what (situations) are possible to handle by the 
organization. This could be explained as the thick- 
ness of the threat pictures’ frame. Next, the aim 
was to challenge actors to look for the bigger pic- 
ture given the situations experienced, how could 
the situations possible escalate? What may have 
happened if the initial event occurred slightly dif- 
ferent, or conditions were some otherwise for the 
scenario? At this stage, it was expected that actors 
used their imagination and took the opportunity to 
respond actively on both thinkable, and less think- 
able scenario escalations. Important questions in 
this phase were: 


— What has been done after the incident to be bet- 
ter prepared (learned by experience)? 

— What kind of effects in sense of improved inter- 
action between actors are seen? 

— What kind of overlap with other threats are seen 
in the hypothetical crisis? 

— How could such kinds of interactions and over- 
lap be handled in practice? 


Threat landscapes for ‘oil in high north’ have 
been formulated on basis of adapted interviews 
and meetings. The strategy was to pursue initiating 
events further and challenge operators and actors 
with respect to possible ways of escalations. By 
exerting pressure on a given threat landscape, one 
or more threat pictures were approaching a satu- 
rated state which triggered thinkable events. These 
events escalated from some defined ‘overlaps’ 
between the set of threat pictures and/or actors in 
the landscape. 

One possible escalation relates to the production 
installation’s dependence of electric power sup- 
ply from shore in normal operation. A scenario, 
with overlapping threat pictures, e.g. challenging 
weather, or vulnerable power supply, could initiate 
from a winter storm caused by a polar low pres- 
sure. The storm knocks out the power supply to 
the offshore production facility, while at the same 
time hampers evacuation of personnel from instal- 
lation by helicopter. 


Then, we ask what could be done to avoid 
similar experiences, or to be better prepared. The 
risk management should be updated accordingly. 
Finally, taking as a premise that surprises anyhow 
might occur in the future—what could be done, or 
should happen to successfully deal with these sur- 
prises? Here, we identify some resilience capabili- 
ties. The interviews were structured according to 
the predefined sequence of issues starting with the 
interviewee’s home ground, i.e. a description of the 
risk management- and emergency preparedness 
system in service. 


2.3 Actors involved in ‘oil in high north’ 


A list of potential interviewees was established 
based on a predefined actor landscape, shown in 
Figure 1. 

These actors, or organizational units, were 
assumed to play active roles in the threat landscape. 
Hence, they will also be prime candidates for being 
actors in a future ‘resilience landscape’ (Grotan 
& Antonsen 2016). Founded on the New Strains 
principles of building such landscapes (Grotan & 
Bergstrom, 2016), Grøtan (2018) utilizes a New 
Strains discursive-support structure that enables 
such a polycentric resilience landscape to evolve 
over time. Based on this, a preliminary threat land- 
scape may be elaborated and evolving further. 


2.4 Interviews and workshops accomplished 


During the project work, we managed to get in 
touch with an oil company, the council of county 
emergency preparedness in Finnmark, the regional 
health company and a local health company. We 
also planned to interview representatives from the 
police, the national electric power net agency, tel- 
ecommunication companies and the local electric 
power company. This turned out to be difficult, 
inter alia because we easily run into asking for clas- 
sified information that we cannot receive. There 
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Figure 1. Actor landscape: ‘Oil in High North’. 
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have been three workshops held in 2016-2017 on 
the topic cyber safety, security and resilience of 
critical infrastructure. Information from these 
workshops is elaborated on in the discussion part 
of the paper. 

In addition to the obstacles of classified infor- 
mation, we also come into or derive information 
and knowledge that is obviously sensitive (but 
unclassified), and therefore is omitted. 

However, it is emphasized here that the concep- 
tion of ‘sensitive’ is based own judgements. 


2.5 Findings from narrative-structured interviews 


The ‘narrative structure’ is based on Grøtan and 
Antonsen (2016). 


2.5.1 General issues 

General characteristics of the high north involving 
the petroleum industry activity and maritime traf- 
fic in the area were i.a. as follows: 


— A situation with limited ‘resources’ in the high 
north put constraints on crisis mitigation. 
Actors have experienced limited information 
provided by externals in early phases of situa- 
tions, combined with bad communication. 

The region is known for rough weather condi- 
tions and large distances between parties and 
service providers in reaching the necessary 
resources. 

There are still conflicting interests between 
nations in the Barents region. 

Societal events, and/or global changes to society 
seem to have effect on business as usual. 
Cyber-security, hidden, dynamic and emerging 
(h/d/e) risks are evolving. These issues are incor- 
porated in emergency exercises, initiated by the 
county administration, that involve actors and 
stakeholders starting to learn by collaboration. 


2.5.2 Actors ‘state of the art’ in risk management 
Oil companies give lower priority to security issues 
than pure (physical) safety. The petroleum busi- 
ness in Norway is not defined as a CI and thus, 
not subject to the Security Act like other critical 
infrastructures. 

Misleading, cyber-related actions, or events with 
purposes others than what seems to be the case at 
first place are challenging tasks for the industry. It 
is generally difficult to improvise on such ICT-prob- 
lems when lacking overview or knowledge to prob- 
lem origin in the early phases of a scenario/event. 
Achieving a delayed overview of the situation, or 
late awareness of the main purpose of an attack is 
normal, not exceptional. Generally, there are limited 
roles, or dedicated activities in strengthening the 
security field in the petroleum business. Measures 
are prescribed for physical protection mainly, e.g. 


like to secure persons from becoming violent off- 
shore, implement 100% check of personal luggage 
before offshore flights, etc. However, the industry is 
aware of cyber threats like the ‘denial of service’. 

From the authority’s point of view, cyber secu- 
rity is part of the overall emergency preparedness in 
society. Finnmark county has included these issues 
in their emergency-preparedness plans, and consti- 
tute an important part of their training program. 

Geographical coverage of the health emergency- 
preparedness is defined in dialogue with national 
authorities. Focus is on robust systems for alerting, 
assessing and managing unexpected situations. 
The regional health company (Finnmark) is set for 
handling of major accidents with many injuries. It 
incorporates plans for the critical infrastructures. 
Emergency plans, that includes infrastructures, are 
aligned with the relevant actors, like oil companies, 
hospitals, police, coast guard, national defense, etc. 
The ambulance services are also crossing national 
borders to Russia and Finland. 


2.5.3 Experiences from hldle events in own 
business 

The following draws an overview of some threat 

pictures that are relevant for the ‘oil in high 

north’. These pictures mainly are derived from the 

interviews. 

Offshore installations may become vulnerable to 
electrical-power shut down from shore, given the new 
Norwegian policy that requires electrical power sup- 
ply from shore. The tense political climate, with envi- 
ronmental activists and Russian collaboration makes 
operations challenging to a certain extent. The public 
society in the northern part of Norway, including the 
petroleum industry, have been surprised by several 
big events lately. Three examples were: 


— The refugee situation in 2015 with large groups 
of people crossing Norwegian border at 
Storskog. 

— The big avalanche at Svalbard in 2016 came as 
a surprise and challenged the health compa- 
ny’s preparedness, including transportation of 
personnel. 

— The ash cloud came from the Iceland volcano in 
2010 led to major ambulance-flight restrictions. 


Oil companies typically expect only minor, or 
none surprising events from own business. As men- 
tioned, cyber security of critical infrastructures and 
lack of overview/delayed knowledge about emerg- 
ing problems are general issues given implementa- 
tion of new/innovated technology in society. 

Finnmark as a large county with resources geo- 
graphically spread is also a challenge. There is limited 
capacity of the local hospitals to handle big events 
with lots of injuries. Anyway, unforeseen events 
happen all the time in the north, often related to the 
challenging weather conditions and long distances. 


168 


The time factor in an acute situation often become 
critical, and the required resources in an emergency 
may be far away, and take 2-3 hours to mobilize. An 
example of such is the fire-fighting service. 

There are periods of the year with less access to 
back-up health personnel, e.g. during the Easter with 
people spending their holiday time out in the terrain. 
Personnel resources to solve ICT-problems may also 
be scarce locally, and it becomes a critical aspect if 
communication gets down during handling of a crisis. 

Finally, the electric power situation depends on 
a power distribution grid crossing borders to both 
Finland and Russia, which makes it even more 
vulnerable. The electrical power net covering the 
northern parts of Norway is operated by 7 com- 
panies. Although, Norway, Russia and Finland 
have agreed on training the emergency prepared- 
ness against events (including power supply) that 
involve all three countries, the handling of crises 
near the borders to each country with utilization 
of the common resources is a critical issue. 


2.5.4 Pandemic as a threat 

The national preparedness plan for pandemic flu 
describes how to prevent and reduce the spread 
of infection, morbidity and death, and to provide 
good treatment and care for people affected by the 
flu pandemic. Municipalities in Norway are obli- 
gated through the Civil Protection Act, to actively 
work with emergency preparedness, such as e.g. to 
have a plan for how to handle possible pandemics. 
Some of our project group members were allowed 
to observe and challenge participants in the cri- 
sis response team during one specific manoeuvre, 
‘exercise Virus’, in one municipality during the 
autumn 2017. The type of virus was in this case a 
seasonal influenza virus. In this exercise, all munic- 
ipal offices were involved. 

A pandemic is a complicated situation to handle 
for a municipality. It involves the administrative and 
operative management, and several different profes- 
sionals such as general practitioners, hospitals, nursing 
homes, transportation, schools, day-care, groceries, to 
mention some. Moreover, a pandemic flu situation is 
highly different than an emergency eventase.g. a major 
explosion, that leaves an unexpected complex situa- 
tion with a high degree of uncertainty. While many 
emergencies occur suddenly, a pandemic flu spreads 
and escalates in a slower manner. It is possible, to a 
certain degree, to be prepared and prevent the sever- 
ity of the outbreak. However, it should be mentioned 
that the lethality of a flu pandemic could be abnor- 
mally high, in which could leave the situation even 
more complex and difficult to solve. Ebola pandemic, 
with the high degree of lethality, is another example 
of complex pandemic. 

During the ‘exercise Virus’ it was observed that 
the organisation involved in handling the crisis is 
complex, the distance between the members in the 


crisis management, the crisis response team, the 
administrative and operative management, as well 
as the many other actors involved, could lead to 
a non-optimal communication dialogue between 
professional institutions, external actors and deci- 
sion makers. At some offices, the person of con- 
tact could have changed, without notifying in the 
emergency preparedness plan or in the digital crisis 
management tool, in which led to further commu- 
nication problems. Misunderstandings could easily 
take place; degree of severity, underestimation of 
time consumption, and which resources are availa- 
ble. Moreover, since also neighbouring municipali- 
ties are affected in case of a pandemic, competition 
for the same resources as e.g. beds for a temporary 
emergency room/hospital, could occur. Other 
important issues are the complex logistics for han- 
dling sick persons, to continue with safe operation 
of critical societal functions and to have enough 
qualified personnel to run daily operations. 

On the positive hand, since the pandemic situ- 
ation often escalates in a slow matter, the man- 
agement can reallocate human resources. E.g., 
personnel working at the municipal cultural school 
could be reallocated to work at vaccination posts 
or nursing homes, as there most likely would be 
shortage of health personnel in a pandemic situa- 
tion. Still, because of the high numbers of persons 
involved in a pandemic, even a pandemic with low 
lethality could have the potential to paralyse—to a 
certain degree, a whole society. 

From this, we hypothesize that a pandemic com- 
ponent in ‘The High North’, could affect the capa- 
bility and function of persons directly involved at 
offshore installations as well as the support net- 
work on-shore from the Finnmark county. This 
could again threaten the safety for the work and 
the environment in ‘The High North’. Although we 
presume that ‘pandemic’ aspects already have been 
considered by the ‘Oil in High North’ actors, we 
find reason to issue a warning that the ‘crippling’ 
effect of a pandemic is likely to jeopardize any 
assessment of saturation effects related to almost 
any risk picture. This is probably even more rele- 
vant for scarcely populated areas, where competent 
people probably have several roles, related to e.g., 
authorities, companies, NGOs and the civic soci- 
ety. The latter is especially important for reliance 
on community resilience, despite the assumption 
that the population is more robust compared to 
rural areas (see later discussion on resilience). 


2.5.5 Emergence, escalation into ‘bigger pictures’ 

The following gives some examples of possible esca- 
lation routes into ‘bigger pictures’. Terror, where the 
health services can be targets themselves (physical/ 
cyber), is thought of as a possible scenario. Sensi- 
tive patient information may get lost, although 
health companies are subjects to the Security Act. 
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A refugee situation could escalate with people 
disappearing after crossing the national border, 
with a pandemic coming up. At the same time, some 
might bring forward violent attitudes and crime. 

With changing traffic patterns in the north, 
mainly enforced by increased tourisms, etc. the 
health companies are challenged on capacity and 
logistics in case of simultaneous events. There is 
limited capacity in local hospitals to handle big 
events with lots of injuries. These situations may 
as well effect on the health preparedness served to 
the petroleum business. 


2.5.6 Expectations and revised risk governance 
Of the most important learning from incidents by 
the oil companies is: 


— Risk governance being updated based on learn- 
ing from incidents. 

— Outsourcing of ICT services requires improved 
focus on contracts. 

— Improved (risk) knowledge is achieved among 
actors involved in the (oil) business. 

— Hospitals, police, etc. are more aware of possible 
new scenarios offshore. 


From a Finnmark county perspective, the new 
“cyber scenarios’ are to be included in the ‘national 
risk picture’, but the emergency preparedness 
should be made Finnmark-specific. The council of 
county emergency preparedness plays an impor- 
tant role in improving the society’s risk awareness 
by maintaining an open dialog, and arranging 
training exercises. 

For the health companies, review of emergency 
plans is expected to adapt at new events and devel- 
opments of the society. New emergency plans 
should also cover ‘hybrid-war’ scenarios. Here, 
the rescue function of the armed forces need close 
cooperation with the civilian health service to 
handle crises. Interaction competence will thus be 
more important. The 22. July terror attack gave us 
new views in that respect (Johnsen & Øren, 2015). 


2.5.7 Recognizing the usefulness of resilience 
When it comes to specific resilience capabilities the 
county emergency preparedness is used as an exam- 
ple. The county expresses they adapt to situations, 
and the degree of improvisation often occurs there 
and then. It is however, more difficult to improvise 
on ICT-problems due to lack of overview/knowl- 
edge in early phases of a situation. Each sector/ 
actor carry out resilience, or resilient behavior in 
different ways, either as an organization or individu- 
als. Early sharing of information to society (e.g. the 
municipalities) is one of the county governor’s main 
responsibilities and should support such a behavior. 
Safety and emergency events offshore are han- 
dled by the Norwegian rescue coordination center 
(RCC) in first instance. The oil business itself may 


contribute by its own transportation resources to 
assist the health companies in specific rescue opera- 
tions, etc. The Seeking helicopter squadron has 
become an important resource for the health com- 
panies as well. 

Experience from a gas-leak event on an installa- 
tion in the north lately tells that health companies 
nowadays are notified earlier without occurrence 
of a serious event with injuries, or having a real 
evacuation situation. The threshold level to estab- 
lish preparedness is maintained as before, but noti- 
fication to the health company comes earlier than 
before. The main reason is that the crisis manage- 
ment at health companies is more able to judge, or 
assess the seriousness of situations during the early 
stages than operators at the alarm receiving center. 

Another aspect of resilience is that inhabitants 
of Finnmark tend to try out their premises exten- 
sively in crisis situations. The northern communi- 
ties are characterized by their robust populations. 
Due to the region of small, often isolated commu- 
nities, the ability to handle unforeseen situations is 
spelled out when events occur. There is, however, a 
variety of such skills. Living in the north involves 
some risks and the inhabitants are more used to 
stand-in the situation when required. Next, there is 
also a reliance of our neighbor countries on Nor- 
wegian assistance in crises, e.g. in north Finland 
and Russia. An accident near the border to Russia 
can also be supported by resources from Russia. 
There is no experience from real events on such, 
but the countries exercise together regularly. 


3 THREAT LANDSCAPES OVERVIEW 


Generally, emergency preparedness in oil compa- 
nies is mainly designed to handle single accident 
events at a time only, e.g. a hydro carbon-leak or 
fire. However, problems may escalate if e.g. a pro- 
duction upset and a hydro carbon leak occur at 
the same time and in combination with outfall of 
critical infrastructures. An example of the latter is 
loss of electric power from onshore, or technical 
failures on critical communication systems. The 
following summarizes input to threat landscapes 
derived from the interviews: 


— Long distances in the north between critical 
(emergency preparedness) resources may induce 
logistic challenges when facing emergency situa- 
tions, e.g. easy/in time access to onshore equip- 
ment in case of an unexpected oil spill. 

The petroleum industry is not subject to the 
‘security act’, which makes the involved actors 
less accountable to cyber-attacks than other 
domains, e.g., ‘denial of service’ types of viruses. 
Possible ‘Spillover effects’ from externals during a 
crisis, e.g. environment activists, political interests 
may interrupt a proper handling of situations. 
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— Critical operations at the same time of an infec- 
tion disease spread among key offshore person- 
nel may reduce the problem-solving capacity 
dramatically. 

Terror attacks in combination with bad weather 
makes it difficult with respect to accessing accu- 
rate information, and handle multiple demands 
to limited resources for normalization. 

‘Oil in high north’ is currently a Vest-Finnmark 
domain. A major accident offshore may how- 
ever, require the whole capacity of the regional 
health company. That means the capacity of 
single health companies might be too scarce in 
those situations. 

Extraordinary events (e.g. the refugee situation 
at Storskog) combined with a major offshore 
accident could cause trouble for the emergency 
assistance from the health companies due to lim- 
ited resources. 

Ambulance services by air is a critical function 
in emergency situations offshore, and may easily 
be interrupted by bad weather or natural disas- 
ters, like the ash cloud in 2010. 

The avalanche at Svalbard in 2016 came as a 
surprise and challenged the health preparedness 
of health companies, especially with respect to 
transportation of personnel. An offshore acci- 
dent at the same time would have been difficult 
to handle locally. 

Health companies are always challenged on 
capacity and logistics by simulations events. The 
companies are decentralized, often small units 
with focus on the ‘acute’ functions during events. 
Experience from an exercise at a hospital with 
a CBR-incident (Chemical, Biological, Radio- 
logical), and at the same time a real failure in the 
communication system did occur. Then there was 
a need to put the exercise on ‘hold’ to solve it. 
Resources to solve ICT-problems become scarce 
with communication difficulties. Access to trouble- 
shooting capacity also becomes a scarcity resource 
by constantly introducing new technology. 


The above aggregate to three major treat land- 
scapes that could be elaborated on further (see 
Figures 2-4): 


— Production upsets 
— Cyber security 
— Emergency preparedness at sea 


A further elaboration of these landscapes is 
conceivable, drawing on a discursive support 
scheme (Grotan, 2018) that supports development 
of polycentric resilience landscape, including cyber 
vulnerability issues. 


3.1 


In addition to threat landscapes, the following 
issues were registered through the interviews as 


External conditions 


major external conditions, i.e. kinds of ‘labyrinths 
or moving horizons’ (Grotan & Antonsen, 2016): 


— Changing conditions for the offshore petroleum 
industry in general, mainly due to lower oil 
prices, a continuous need for cost cutting with 
structural and organizational consequences. 
Hidden vulnerabilities in the ‘system’ for electri- 
cal power supply from shore. 

Development strategies for the electrical power 
grid are not fully predictable for the industrial 
actors. Uncertainty relates to further development, 
of which, is partly geographically conditioned. 
Uncertainty relates to the resilience capabilities 
and degree of informal collaboration between 
actors involved in a potential (offshore) crises. 
The degree of information and communication 
constraints occurring between governmental 
levels and actors in crisis situations that evolve. 
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Figure 2. Threat landscape: Production upset. 
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Figure 3. Threat landscape: Cyber security. 
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— The availability of personnel varies over time 
in the regional health companies depending on 
changing regional policies that may have effect 
on the quality of the health preparedness. 

The degree of shared information and commu- 
nication between the actors in the petroleum 
business and health companies, has developed 
into new levels lately. 

The development of the expected ‘total-defense’ 
concept is uncertain. To what degree (and when) 
can actors rely on sufficient and steady interac- 
tion between actors according to such a concept? 
The technology change and degree of innovation 
in the hospital sector imply varying dependency 
on the Internet in rescue and health preparedness. 
Access to troubleshooting capacity (ICT related) 
may become scarce for the health companies in 
times of new implementations. 

Required manning in the northern health compa- 
nies, and accessibility to personnel vary during the 
year due to individual work- and shift schemes. 


4 DISCUSSION AND FURTHER WORK 


Threat scenarios are elaborated on basis of threat 
landscapes with ‘saturated’ threat pictures and ‘mov- 
ing horizons’ in Section 3 and 3.1. Scenarios may initi- 
ate from the offshore installation itself, from a passing 
ship in the area, enforced by bad weather conditions, 
or loss of critical infrastructures like electric power 
from shore. In combination with other events, scenar- 
ios may escalate into severe accidental events. 

The important aim of scenario framing/delimi- 
tating in New Strains was to establish an appropri- 
ate foundation and discussion basis for the actors 
and stakeholders involved (Okstad, 2016). Any 
actor should take the opportunity to respond on 
knowable escalation factors of the identified sce- 
narios, if they were involved or affected according 
to the given threat landscape. 

The scenario should neither be too specific, nor 
to general. In case of the first it probably requires 
too detailed knowledge and extensive verification of 
its implications for the final acceptance as a scenario 
among actors. If scenarios become too general and 
vague, they will be counterproductive and of minor 
value in clarifying possible escalation factors and 
effects of such within a given threat landscape. 

Another challenge was how far one should go to 
look for interconnections between contemporary 
threats and adverse circumstances as reinforce- 
ments to possible threat escalations. The right bal- 
ance was found to be just making illustrations of 
possible spill-over effects, or the added effects of 
several overlapping events or vulnerabilities in the 
given threat landscape. 

Threat scenarios that have been discussed 
related to ‘oil in high north’ are: 


— Critical exploration drilling activities goes on in 
the North Barents Sea, and at the same time, a 
major ICT-system failure occurs. There is lim- 
ited ICT infrastructure in the area. A well con- 
trol situation escalates into a critical situation 
that require shutdown and immediate evacua- 
tion of personnel from the rig. 

A large passenger boat gets into fire at open 
sea, at the same time an ICT-problem strikes 
the health company. Combined with large dis- 
charges of oil/chemicals at a site of damage 
(CBR), situations quickly get out of control. 
Viewed from the health company’s point of view, 
the size of the accident in sense of the number of 
injured people becomes the biggest stress factor. 
Simultaneous energy-power cut to an offshore 
installation and an oil spill, or operational problem 
at site. Challenges then occur related to fast and 
effective mitigation of consequences. Finnmark 
county is large, with the onshore recourses geo- 
graphically spread. Then it takes time to move, e.g. 
by helicopter between west and east. 

Offshore event, international cyber-attack, criti- 
cal failure in communication systems/infrastruc- 
ture (random failure/out-fall, or a deliberated 
action, e.g. cut of cables). 

Ship collision with an installation, and at the 
same time an outfall of electric power from shore 
occurs. Finally, an escalating pandemic flu strikes 
the area. With a trend of increasing ship trans- 
portation with new production- and exploration 
drilling taking place in the north-east Barents Sea. 
In addition, system vulnerabilities exist connected 
to ICT and Internet of Things incorporated in 
every critical infrastructure and logistic function. 


Our experience from the interviews is generally 
positive with respect to the interviewees willingness 
and engagement to share their opinions and thoughts 
on these aspects. However, we faced a challenge 
regarding confidentiality and security clearance at a 
degree that limited our contact with some actors. 

Regarding the workshops, the participating people 
were motivated to contribute there and then. Discus- 
sions in interdisciplinary work groups was a success. 
People from authorities, research units and prac- 
titioners worked very well and the impression was 
that it was valuable spending of time. However, the 
discussions regarding threat pictures, and especially 
related to cyber security, quickly became technical. 
In some discussion groups, therefore, the discussion 
was raised a level and used more to elaborate around 
possible effects or consequences of the cyber threats. 
One drawback was the tendency of pulling out the 
consequences completely (to wide, or thinking ‘worst 
case’) instead of thinking possible event chains or 
sequences leading to such. Maybe we should have 
emphasized to concentrate on a limited set of conse- 
quences, i.e. linked to the given organization. 
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5 CONCLUSION 


In line with the suggested approach to encircling 
and building threat landscapes, a preliminary 
threat landscape for ‘Oil in High North’ has been 
developed. Results from a study of pandemic han- 
dling in a municipality has also been included in 
the landscape. This threat picture sensitizes the ‘oil 
in high north’ scenario to be further scrutinized 
concerning operational presumptions and consid- 
erations of saturation effects. 

For a further development to take place, cru- 
cial issues about not only classified information, 
but also sharing of obviously ‘delicate’ informa- 
tion on (e.g., cyber) vulnerabilities would have to 
be addressed. The authors judgement is that this 
should be possible, laying the foundations for 
stress-testing according to the New-Strains objec- 
tive (Okstad et al. 2016). 
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ABSTRACT: With growing outsourcing in the petroleum industry, operations are becoming increas- 
ingly interorganizationally complex. This raises the question of how interorganizational complexity influ- 
ences safety. The aim of the research project “interorganizational complexity and risk of major accidents” 
was to gain a better understanding of safety challenges associated with interorganizational complexity. 
Three studies have been completed. In this paper, we present the main findings from the research project 
and discuss how the resulting knowledge can best be applied in the petroleum industry. The findings call 
for more awareness concerning interorganizational safety challenges in the industry, and points to specific 
areas in which challenges may arise. In particular, coordinating work processes, ensuring sufficient levels 
of competence and transferring lessons learned and best practices across companies, are identified as such 
areas. However, the findings also imply practices that help manage interorganizational complexity. The 
presence of high-quality work relations appear to be important to achieve safety performance across col- 
laborating companies. In this regard, middle managers appear to play a central role in terms of aligning 
employees from collaborating companies towards a shared focus on operational and safety related goals. 


1 INTRODUCTION cal implications in terms of organization by placing 
greater demands on aspects such as communica- 
In the Norwegian petroleum industry, the out- tion and coordination of work processes, but may 
sourcing of safety-critical activities and services also have significant implications for safety. With 
to contractors in petroleum operations, have multiple stakeholders that operate independently, 
increased in recent years. As a result of increas- yet interdependently, within their respective areas 
ingly complex drilling operations and recent tech- of expertise, it is challenging to maintain a com- 
nological developments, work processes in oil plete understanding of what is going on. Accord- 
and gas production have become more and more ingly, discovering and responding to safety threats 
specialized over the years. Contractors and sub- becomes more challenging. 
contractor companies hold specialist competence The Norwegian Petroleum industry is known to 
within specific domains and are hired to provide demonstrate enhanced focus on safety and invest 
various types of equipment, such as drilling equip- significant effort towards strengthening collabora- 
ment or valves, or specialist services in operation, tion and trust across actors in the industry. How- 
such as cementing or drilling fluids. As a result, ever, there seem to be less focus on understanding 
work processes in the petroleum industry are safety challenges that arise at the interfaces between 
fragmented across a large number of companies collaborating companies. The developments we see 
with different areas of responsibility and varying in the industry in terms of increasingly demanding 
levels of involvement. Outsourcing is undoubt- drilling conditions due to depleted oil fields and 
edly an advantageous strategy in that it contrib- the onset of petroleum production in the Arctic, 
utes to enhance performance by expanding the with more demanding weather and temperature 
pool of available expertise, enabling organiza- conditions, inadequate infrastructure, and larger 
tions to become more competitive in an increas- distances to onshore facilities (Petroleum Safety 
ingly high-paced marked. However, an increase in Authority Norway, 2012) calls for more complex 
the number of stakeholders also entails increased and technologically advanced drilling operations 
complexity and added possibilities of how human, in the future. Such conditions will require more 
technological and organizational components in terms of cooperation between the companies 
within the system can interact (Perrow, 1984, involved, which means that interorganizational 
Rasmussen, 1997). This does not only have practi- complexity can have consequences for how well 
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safety is managed. It is therefore important to gain 
more knowledge about the safety implications of 
interorganizational complexity. 

The research project Jnterorganizational com- 
plexity and risk of major accidents, funded by the 
Research Council of Norway (Grant nr: 220798), 
was initiated to develop knowledge about the safety 
implications of interorganizational complexity in 
petroleum operations, focusing particularly on major 
accident risk. The main objective was to investigate 
how interorganizational complexity contributes to 
safety challenges, and explore and discuss how chal- 
lenges can be reduced. The project was commenced 
in 2013, and completed in 2017. During this period 
of time, we have completed three studies exploring 
the influence of interorganizational complexity on 
safety from different angles. These include: a review 
of empirical literature addressing interorganiza- 
tional safety challenges (Milch and Laumann, 2016), 
a qualitative study of safety challenges and practices 
that help manage interorganizational complexity 
on a Norwegian petroleum installation (Milch and 
Laumann, 2018; Milch and Laumann, 2017), and 
finally, a qualitative analysis of investigation reports, 
exploring the influence of interorganizational factors 
on incidents in the Norwegian petroleum industry 
(Milch and Laumann, submitted). In this paper, we 
present the main finding from the tree studies under- 
taken in this project and intend to discuss practical 
implications of the current project for the petroleum 
industry. In particular, we would like to discuss how 
the knowledge from this project can best be utilized 
and applied in the petroleum industry. 

In this project, to explore the findings in a 
nuanced manner, we have employed different theo- 
retical lenses that represent long-held views as well 
as more recent theoretical developments. These 
are: The High Reliability Organizations Perspec- 
tive (Weick et al., 2008, Weick and Sutcliffe, 2007, 
Weick and Sutcliffe, 2015), Reason’s perspective on 
organizational accidents (Reason, 1997) and the 
perspective of migration towards the boundary/ 
drift into failure (Rasmussen, 1997, Dekker, 2012). 


2 METHOD 


Because of the sparsity of research within the field 
and the topic’s complexity, we opted for a qualita- 
tive approach. Qualitative research is particularly 
suitable for investigating complex phenomenon, 
particularly in circumstances where prior knowl- 
edge is limited (Miles et al., 2014, Elliott et al., 1999, 
Silverman, 2006). To illuminate different angles, the 
studies in this research project draws on different 
sources of data. Study 1 is a literature review of 
empirical literature to identify interorganizational 
safety challenges (Milch and Laumann, 2016). 
The aim of this work was to get a sense of key 
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challenges that can arise due to interorganizational 
complexity. A literature search was performed to 
identify empirical literature describing interor- 
ganizational issues that negatively influence safety. 
The search resulted in 22 articles, which shows that 
existing research on this topic is limited. In study 
2, we explored how interorganizational complex- 
ity was managed on a Norwegian petroleum pro- 
ducing installation (Milch and Laumann, 2018; 
Milch and Laumann, 2017). The aim of the study 
was to gain a better understanding of the main 
safety challenges. In addition, we were interested in 
exploring practices that help manage interorganiza- 
tional complexity. Semi-structured interviews with 
informants representing various affiliations and 
hierarchical positions were combined with observa- 
tions on site and onshore, and analyses of relevant 
formal documents. In total, 14 interviews were con- 
ducted. In the third and final study, we investigated 
how interorganizational factors have contributed to 
accidents and incidents in the petroleum industry 
through examining investigation reports issued by 
the petroleum safety authority between 2006-2016 
(Milch and Laumann, submitted). The aim of this 
study was to explore how interorganizational fac- 
tors are related to incidents, and gain knowledge 
about of what factors contribute to unwanted 
occurrences. 22 reports were identified in which 
interorganizational issues could be linked to inci- 
dents and accidents. 

In all three studies, data were analyzed using 
thematic analysis as described by Braun and 
Clarke (2006). Thematic analysis is a qualitative 
method used to identify overarching patterns 
or themes in the data. The analytical procedure 
consists of five recursive analytical phases. The 
first phase is simply familiarizing oneself with 
the content, reading through all of the material 
to get a sense of its main features. Following 
that, the material is coded, whereby the aspects 
of the material relevant to the research question 
are examined line-by-line, and smaller segments 
are given informative and short labels. The next 
phase involves actively searching for themes. 
Here, resulting codes are examined and compared 
and similar codes are clustered together to form 
candidate themes. The thematic structure at this 
point is not fixed, rather, the entire analytical 
process is one of constant comparison, in which 
the content of codes and themes are constantly 
compared against each other for deviations and 
adjusted accordingly. When a list of candidate 
themes has emerged, the next phase involves a 
more thorough review of the candidate themes 
to identify what themes represent main themes 
and what themes count as sub-themes. In the 
last phase, a final review is made before the the- 
matic structure is finalized and a thematic map is 
developed. 


3 RESEARCH FINDINGS 


3.1 Study 1 


The findings from this study show that safety chal- 
lenges arising from interorganizational complex- 
ity falls into four categories: economic pressures 
between companies, disorganization of work proc- 
esses, dilution of competence and organizational 
differences (Milch and Laumann, 2016). 

The literature suggests that economic pres- 
sures between companies can be a source of safety 
issues. This is because stakeholders pursuing dif- 
ferent and perhaps conflicting goals in addition 
to collective operational goals, contribute to goal 
conflicts for employees at the sharp-end and a 
fragmented overall safety focus. Moreover, due to 
the potential economic consequences of errors or 
accidents for individual companies, the focus on 
assigning blame amongst involved companies can 
often displace the focus on learning from what 
happened. Another aspect that becomes problem- 
atic when multiple companies are involved is the 
organization of work processes. Processes such 
as coordinating tasks and responsibilities and 
communication demands more effort the more 
interfaces that need to be managed. The litera- 
ture indicates that interorganizational complexity 
often results in more disorganized and fragmented 
work processes. Findings also suggest that inter- 
organizational complexity often contributes to 
reduced quality in available competence. Con- 
tractor employees, who spend shorter periods of 
time on the site, tend to be unfamiliar with the 
workplace, and can often be inexperienced and 
have little prior training. As such, they are not 
necessarily equipped with the knowledge to deal 
with unexpected safety-critical situations or may 
not be familiar with the hazards and safety con- 
straints in the workplace. The final theme iden- 
tified in the literature concerns organizational 
differences between collaborating companies as a 
source of safety issues. Collaborating companies 
can vary greatly in terms of organizational culture 
and work practices. Such variations can hinder 
efficient collaboration and a shared orientation 
towards operational goals. Moreover, organiza- 
tional differences can also create conflicts and 
distrust between employees from different com- 
panies, which hinders open communication about 
safety issues. 

The interorganizational issues that we found in 
this paper can in various ways impede the ability of 
the organizational system to identify and respond 
to safety treats, as they are associated with the 
development of latent conditions and fragmented 
operational focus. Therefore, challenges that arise 
from interorganizational complexity may increase 
the risk of major accidents. 
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3.2 Study 2 


With regard to safety challenges, informants gen- 
erally expressed few challenges with the interor- 
ganizational elements of their work. The perceived 
challenges largely reflect structural aspects relating 
to managing and organizing operational processes 
between involved companies. Coordination of 
work processes, information flow and varying lev- 
els of experience among contractor personnel were 
identified as the most important challenges (Milch 
and Laumann, 2018). 

The study also identified several practices that 
help manage interorganizational complexity. 
Employees from collaborating companies do not 
necessarily form close collegial relationships, but 
still engage in close collaborations. As such, high- 
quality relations, underpinned by friendly interac- 
tion and mutual respect, was deemed crucial for 
maintaining well-functional collaboration across 
organizational boundaries. High-quality work 
relations appear to promote a shared focus on 
operational goals, and were also found to stimulate 
constructive and open communication about safety 
matters across organizational boundaries. Findings 
also suggest that long-term organizational relation- 
ships between collaborating companies, worker 
involvement and managers’ interactive role in the 
sharp-end represent important organizational ele- 
ments stimulating high-quality work relations in 
this context (Milch and Laumann, 2018; Milch and 
Laumann, 2017). In addition, the similarities found 
among companies in terms of safety philosophies 
and practices also appear to be important for align- 
ing companies in their efforts to achieve collective 
operational goals. Moreover, similar safety prac- 
tices and philosophies also appears to be conducive 
to reporting behavior, as employees from collabo- 
rating companies seem to have similar perceptions 
about safe operational conduct. 

The findings show that high-quality work rela- 
tions may be a particularly important element 
in the pursuit of achieving and sustaining safety 
across collaborating companies in an interorgani- 
zational context. Moreover, since many of the 
challenges identified in the literature was not iden- 
tified in the current study, it could suggest that the 
presence of high-quality relations, combined with 
shared safety philosophies and practices across 
companies, may potentially counteract some of the 
safety challenges that have been found to arise in 
interorganizational collaborations. 


3.3 Study 3 


Findings show that interorganizational issues con- 
tribute to both occupational incidents and major 
near accidents. Four themes were identified that 
describe interorganizational factors contributing 


to incidents: Ambiguities in roles and responsibili- 
ties between personnel from different companies, 
Inadequate processes to ensure sufficient compe- 
tence across interfaces, Inadequate quality control 
routines across organizational interfaces and com- 
munication breakdowns between companies (Milch 
and Laumann, submitted). 

Roles and responsibilities between personnel 
from collaborating companies were often insuf- 
ficiently clarified, which resulted in confusion 
regarding what company or organizational unit 
was in charge of what. Such ambiguities had con- 
sequences in the form of disruptions in the follow- 
up of operational processes and had in some cases 
also resulted in omissions of safety-critical activi- 
ties. Another interorganizational issue identified 
was inadequate processes to ensure sufficient com- 
petence across interfaces. In several of the reports, 
the lack of installation-specific training and expe- 
rience in personnel, and particularly in contractor 
personnel, was identified as a contributing factor 
to sharp-end incidents. Moreover, failure to ensure 
sufficient competence in planning of safety-critical 
activities was also evident in serious near major 
accident, where relevant contractor personnel with 
key expertise had not been involved in planning. 

The study also suggest inadequate quality con- 
trol routines across organizational boundaries, 
contributed to incidents. This was identified as a 
problem in the handover of equipment delivered 
by third party companies, and also with assembled 
structures made up of components delivered from 
multiple companies. 

Finally, communication breakdowns between 
companies were identified as challenge in the inci- 
dent reports. Important information is lost or not 
communicated. A central issue in this regard is 
that companies fail to communicate experiences 
and lessons learned from previous incidents. This 
means that known problems and hazards are not 
sufficiently communicated between companies, 
resulting in the repeated occurrence of similar 
incidents. 

The study demonstrates the relevance of includ- 
ing interorganizational factors in the investigation 
process. This is important not just from a preventa- 
tive perspective by increasing the learning poten- 
tial from investigation reports, but more attention 
towards interorganizational factors is equally 
important from a proactive perspective, in terms 
of making such issues more explicit and finding 
ways to cope. 


4 DISCUSSION 


The aim of the research project Interorganisa- 
tional complexity and risk of major accidents has 
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been to gain knowledge about safety implications 
of interorganizational complexity by investigating 
how interorganizational complexity contributes 
to safety challenges, and exploring and discussing 
how challenges can be reduced. The findings from 
this project offer several practical implications. In 
the following, we will address the project’s main 
implications and discuss how the current findings 
can best be applied and utilized in the industry. 

The industry demonstrates a strong safety focus 
and a lot of effort has been made to strengthen col- 
laboration and trust across actors in the industry. 
However, there seems to have been limited focus 
on safety challenges that arise at the interfaces 
of companies. The findings from this study show 
that interorganizational complexity contributes to 
several safety challenges in the petroleum industry 
that can increase the risk of major accidents. Even 
though the occurrence of adverse events is rare in 
the industry, awareness concerning such issues is 
important. Knowing what to look for can increase 
the capacity of petroleum organizations to rec- 
ognize and respond to weak signals before they 
develop into major accidents (Weick and Sutcliffe, 
2007, Weick and Sutcliffe, 2015). 

One area that appears to be challenging is suc- 
cessfully coordinating work processes between 
companies. The sheer number of interfaces that 
must to be coordinated on a petroleum producing 
installation, involving interfaces onshore/offshore, 
as well as interfaces between the operator company, 
the drilling contractor and various contractor 
and subcontractor companies require continuous 
effort, which makes it difficult to ensure sufficient 
information flow and maintain a complete over- 
view of operational processes. The findings imply 
that misunderstandings and confusion regarding 
roles and areas of responsibility between compa- 
nies can occur. This not only impedes the qual- 
ity of monitoring and follow-up efforts, but can 
even lead to the omission of safety-critical activi- 
ties. This is because companies are largely focused 
on their own areas of responsibility, and may be 
unaware of what the other companies are doing. 
As such, the negligence of important activities or 
tasks can go unnoticed. Such fragmentation can 
contribute to the build-up of latent conditions in 
the system, which in combination with triggering 
events can cause adverse events (Reason, 1997). 

The coordination challenge that we find appears 
in some part related to lack of clarification about 
roles and responsibilities among companies before 
projects are commenced. Moreover, it was also 
found that formal documents describing the distri- 
bution of roles and responsibilities among compa- 
nies were in some cases inaccurate or not up to date. 
This implies that operating companies in the Nor- 
wegian petroleum industry do not necessarily have 


adequate routines to ensure that work processes are 
sufficiently clarified and understood across organi- 
zational boundaries. This might suggest that there 
is a need for better systems to achieve more success- 
ful coordination of organizational and operational 
processes between involved companies in the petro- 
leum industry. In practice, this may require that 
more effort is made by companies to ensure that 
formal documents such as organizational maps and 
documents describing roles and responsibilities 
across companies are accurate and up to date, and 
consistent with how the work is actually organized 
among involved actors in practice. Following up on 
these aspects formally fall under the responsibility 
of the operator company, as the actor that man- 
ages the operational processes. However, it could be 
questioned whether it is reasonable to assume that 
operator companies are able to oversee and check 
that all these documents are up to date. According 
to Perrow (1984), maintaining centralized control 
is problematic with high degrees of complexity. In 
this regard, a more decentralized form of control 
may be appropriate for these processes, whereby 
each individual company is responsible to ensure 
that formal documents are up to date and roles and 
responsibilities are adequately described. 

Another way to reduce ambiguities and con- 
fusion between companies is through initiatives 
to ameliorate current communication practices 
between companies in the industry. The industry 
may benefit from developing better communica- 
tional strategies to ensure that all parties have the 
same understanding of their roles and areas of 
responsibility. Research on teamwork has shown 
that the use of closed-loop communication, where 
the receiver communicates the message back to the 
communicator for confirmation, contributes to 
better team performance (Salas et al., 2005). Simi- 
larly, collaborative cross-checks, where employees 
with different perspectives actively discuss and 
explore each other’s assumptions, appear to be a 
promising strategy in terms of making processes 
more observable among employees with different 
perspectives (Patterson et al., 2006). 

The second challenge that we find is the failure 
of experience transfer related to lessons learned 
from incidents, known problems or sources of haz- 
ards and best practices among collaborating com- 
panies. The challenge with transferring experiences 
and learning from incidents is not a new one, and 
has been quite frequently debated in the industry. 
The Petroleum Safety Authority keeps asking, 
why aren’t we learning? The current findings sug- 
gest that interorganizational complexity may be 
one part of the explanation. Learning must not 
only occur in one organization, but across multi- 
ple organizations that differ in their expertise and 
operational focus, and vary in degrees of involve- 
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ment. Ensuring that lessons learned are absorbed 
equally in all involved companies can be challeng- 
ing considering that companies focus on different 
operational domains, and the resources and time 
available to follow up on lessons learnt may vary. 
Another obvious issue is information overload. 
Experiences and lessons learned can easily be lost 
in the vast amount of information that is trans- 
ferred between companies. With an ever growing 
information load, telling relevant information from 
noise can be difficult (Edmunds and Morris, 2000). 
There is no simple solution to this challenge. How- 
ever, industry-wide cooperative initiatives such as 
“Safety forum” and “Working together for safety” 
have shown to be successful in terms of promoting 
openness and collaboration across companies in 
the industry about best practices and sharing expe- 
riences about safety issues (Haukelid, 2008, Wiig 
and Karlsen, 2006). Several informants in study 2 
emphasized these forums as important arenas for 
experience transfer between companies. Accord- 
ingly, such collective arenas appear to be very fruit- 
ful for supporting experience transfer and learning 
across companies. The industry may benefit from 
developing these areas even further to facilitate 
learning across companies. 

Ensuring sufficient levels of competence across 
companies also appear to be a challenge in petro- 
leum operations. Findings from analyzing inves- 
tigation reports suggest that the competence 
requirements for doing specific tasks or operating 
specific equipment are sometimes not met. Possible 
explanations may be that competence requirements 
are not sufficiently appreciated or understood, that 
competence requirements are not clear enough, 
but it could also be a matter of capacity. In study 
2, some informants mentioned that lacking experi- 
ence and training among contractor company per- 
sonnel could be a problem in times of high activity, 
which would suggest that it is also related to avail- 
ability. This was also identified as a challenge in a 
study that looked at safety issues related to capac- 
ity and competence in the Norwegian petroleum 
industry (Skarholt et al., 2014). In the investigation 
reports, there were several examples where new and 
inexperienced contractor personnel had been put 
to operate equipment that they were not trained 
to handle. This does not only increase the risk of 
erroneous handling of equipment, however, the 
presence of new and inexperienced personnel can 
constitute a risk in itself because such personnel 
will most likely be unaware of local conditions and 
hazards that can influence safety. Moreover, inex- 
perienced personnel will have reduced capacity to 
improvise in unforeseen or unfamiliar situations. 

Companies in the petroleum industry normally 
have a buddy system in which new personnel are 
appointed a “buddy” responsible for familiar- 


izing them with the installation and with their 
work station. In the investigation reports, there 
were some indications to suggest that the buddy 
program on some installations had a more social 
focus, and did not sufficiently equip new person- 
nel with relevant safety information. Moreover, 
there were indications that training programs in 
some cases were too general, and did not detail 
specific tasks or equipment. The level of speciali- 
zation we see today, which will likely increase in 
the future, entails that expertise is key to safe per- 
formance (Weick and Sutcliffe, 2015, Weick and 
Sutcliffe, 2007). Ensuring safe operations require 
that employees have specific training and experi- 
ence to perform their work tasks, so that they can 
recognize when something is wrong. In fact, the 
ability to track small failure has been identified as 
an important characteristic underlying the reliable 
and stable safety performance of high reliability 
organizations (Weick et al., 2008). 

While the findings from this project points at 
areas in which interorganizational safety challenges 
occur, the current research also illuminate several 
factors that appear to be important to achieve and 
sustain safety across organizational boundaries. 
The findings suggest that inter-personal factors are 
central in this regard. The presence of high-quality 
work relations between employees from collaborat- 
ing companies appear to contribute to strengthen 
a collective orientation towards operational goals 
as well as being conducive to open communica- 
tion about safety related matters. By the majority 
of informants in study 2, high-quality relations 
were pointed out as an important ingredient for 
well-functioning collaboration with employees 
from collaborating companies. Because employees 
from collaborating companies do not necessarily 
form long-lasting collegial relationships, but still 
collaborate closely in safety-critical activities, the 
character of work-relationships can be crucial in 
terms of how well the collaboration will work. 
The emphasis on ensuring high-quality work rela- 
tionships that we find in our research, appear to 
reflect the efforts in the industry the recent decades 
to improve safety culture and strengthen collabo- 
ration and trust between actors in the industry. 
This may suggests that initiatives directed towards 
promoting high-quality work relations across col- 
laborating companies may be beneficial to enhance 
collaboration. However, cultural similarities may 
also be an important contributing factor in this 
regard as the majority of employees in the Nor- 
wegian petroleum industry share the same cultural 
background. With a high level of cultural diversity, 
it may be more challenging to form high-quality 
work relationships when employees have different 
cultural values and beliefs and speak different lan- 
guages. We still do not know enough about the spe- 
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cific preconditions and factors that are central to 
forming high-quality work relations across organi- 
zational boundaries. More research is needed to 
further explore how high-quality relations are 
formed across collaborating companies. 

With the current level of specialization of work 
among companies that we see in today’s petro- 
leum industry, middle managers seem to play an 
increasingly important role in terms of achieving 
a joint focus among involved companies towards 
mutual operational goals, but also for monitoring 
concurrent work processes. In particular, the inter- 
active role managers have, spending a lot of time 
in the sharp-end, was regarded as an important 
element contributing to well-functioning collabo- 
rations. Middle-managers offshore saw it as the 
most important part of their work, and reported 
that being present in the field was important to 
build trust with personnel from various compa- 
nies, and to get a sense of what was going on in 
the operation. Sharp-end workers emphasized that 
managers being present and approachable, eased 
bringing up safety-related matters. A potentially 
worrisome tendency in the petroleum industry is 
the increasing amount of paperwork required of 
middle managers. Informants with management 
responsibilities reported concern about the extent 
to which increasing amounts of paperwork ate 
away the time they could spend in the sharp end, 
and were worried about the safety effects this may 
have in the long run. Research has shown that 
managers play an important role in terms of shap- 
ing safety performance (Mearns et al., 2003, Flin, 
2003). While increasing paperwork in many ways 
is the result of a stronger focus on documentation 
of safety-management in the industry, our find- 
ings suggest that this may have the opposite effect 
than intended, at the expense of activities that have 
important, but perhaps less obvious safety func- 
tions. Arguably, the safety functions embedded in 
middle-managers’ interactive role in terms of man- 
aging interorganizational complexity appear to be 
important, but seem to be poorly understood. 


5 CONCLUSION 


The aim of this paper has been to present findings 
from the project “interorganizational complex- 
ity and risk of major accidents” and discuss how 
the implication from this project can be applied 
in the industry. Findings call for more awareness 
concerning interorganizational safety challenges in 
the industry, and points to specific areas in which 
challenges can arise. In particular, coordinating 
work processes, ensuring sufficient levels of com- 
petence and transferring lessons learned and best 
practices across companies are identified as areas 


in which challenges can arise. However, the find- 
ings also suggest that interpersonal factors and the 
presence of high-quality work relations are impor- 
tant to achieve safety performance across collabo- 
rating companies. In this regard, middle managers 
appear to play a central role in aligning employees 
from collaborating companies towards a shared 
focus on operational and safety related goals. 
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ABSTRACT: Fire and rescue services in Norway dispatch more often to false and unnecessary alarms 
than to real fires and accidents. In 2016, 60% of the emergency dispatches were conducted on the basis 
of false or unnecessary alarms. These unnecessary dispatches are costly in terms of time and resources 
spent, and can in some cases lead to a weakened preparedness towards real incidents. Also, the risk for 
traffic accidents increases when big vehicles rush through the streets on their way to where the alarm was 
triggered. Hence, there are good reasons to work to reduce the number of these kind of dispatches. On the 
other hand, one may also argue that there can be some positive effects of a certain number of mobiliza- 
tions for the fire crews. Based on interviews with relevant actors connected to fire and rescue services, as 
well as on statistics collected through the BRIS reporting system, we will discuss possible consequences 
of reducing the number of false and unnecessary alarms and potential effects of implementing measures 
for decreasing unnecessary dispatches. 


1 INTRODUCTION As a response to the increasing number of 
false alarms, several fire departments in Norway 
Fire and rescue services in Norway more often are seeking to develop measures to reduce these 
dispatch to different types of false or unnecessary numbers. This includes different technical meas- 
alarms than to real fires and accidents. In 2016, 59% ures and operational measures, as well as fines 
of the emergency dispatches were conducted on the for building owners. We will discuss the effects 
basis of false or unnecessary alarms (DSB 2017). The of these measures, both in terms of reducing the 
numbers are increasing, from 20 000 in 2013 to 50 number of unnecessary and false alarms, but also 
000 in 2016. An interesting point is that the number in terms of how they may affect the overall safety. 
of emergency dispatches conducted on the basis of This paper is based on interviews with firefight- 
false or unnecessary alarms varies a lot between dif- ers and other relevant professionals, as well as 
ferent municipalities in Norway; some have 40% and on existing statistics and reports. We will discuss 
some as high as 75% (numbers from 2010). There is the effects of false alarms from an organizational 
currently little knowledge about why this is so. perspective. We will also discuss DSB’s (Directo- 
Unnecessary dispatches are costly in terms of rate for Civil Protection and Emergency Planning) 
the time and resources spent, and can, in some reporting system BRIS, which was introduced in 
cases, lead to a weakened preparedness towards 2015 and contains all assignments registered at the 
real incidents. Also, the risk for traffic accidents alarm centrals. With this data as a starting point, 
increases when big rescue vehicles rush through the we discuss how projects to reduce the number 
streets on their way to where the alarm was trig- of unnecessary dispatches may be designed in a 
gered. Hence, there are good reasons to work to directed manner, so as to retain the positive effects 
reduce the number of these kind of dispatches. On of these dispatches in terms of training and reduc- 
the other hand, one may also argue that there can tion of risk and avoid the negative effects of imple- 
be some positive effects to be derived from a cer- mented measures in terms of the response to real 
tain number of mobilisations for fire crews. events. 
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2 BACKGROUND 


The following section will provide some facts 
about the background of the paper and the kind 
of information and data it is built upon. It will give 
the definitions of false alarms and unnecessary dis- 
patches, describe the new reporting system BRIS 
and also describe the qualitative data collected in 
order to secure a deeper understanding of the data 
collected in the BRIS reporting system. 


2.1 Definitions and terms 


Several different terms and concepts are used to 
address the issue of unnecessary and false alarms. 
In English-language literature, it is more common 
to talk about ‘false alarms’ for all alarms resulting in 
unnecessary dispatches (Chagger and Smith 2014; 
Karter 2013). In Denmark, it is also common to 
use the term ‘alarm’, and not ‘dispatches’, but there 
they differ between ‘false’ and ‘blind’ alarms (blind 
alarms being the same as unnecessary alarms). In 
Norway, the use of the term ‘false alarm’ is con- 
sidered imprecise, and it is recommended to dis- 
tinguish between ‘false alarms’ (intended), and 
“unnecessary alarms’ (unintended). Since the term 
‘alarm’ can result in some confusion, the preferred 
mode of referring to this issue in Norway is to talk 
about the need to reduce ‘unnecessary dispatches’. 
Since fire brigades responds to all types of alarms, 
reduction of unnecessary dispatches must involve 
the reduction of both unnecessary and false 
alarms. 

This paper, will mainly use the term ‘unneces- 
sary dispatches’ when addressing this issue. We 
have also chosen to use the term ‘dispatch’ and not 
‘response’. The term ‘response’ covers a range of 
actions instigated because of an alarm, while we, 
in this paper, only want to concentrate on response 
dispatches—when fire and rescue services vehicles 
rush out due to an alarm. The common practice 
in Norway is for the alarm central to dispatch a 
basic fire response unit, a standard vehicle with 
four firefighters, who investigate an incident at 
site, irrespective of their expectations of whether 
the alarm is genuine or not. 

Another term that will be used frequently is 
Automatic Fire Alarm systems, or AFA-systems. 
An AFA system is an installation comprising detec- 
tor units connected to a central. This is in contrast 
to a smoke alarm, which is a standalone unit. The 
AFAs described in the current paper are also con- 
nected to an alarm central. In Norway, these alarm 
centrals are called ‘110-centrals’, and their primary 
task is to address emergency messages for the fire 
and rescue services, alerting and calling out crews, 
establishing connections between relevant emer- 
gency actors and logging events. The AFA system 
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immediately, or after some pre-set delay, transfers 
the alarm, which in turn leads to a dispatch from 
the local fire and rescue service. AFA-systems are 
mandatory in a number of buildings where there 
are many people, such as nursing homes, schools 
and public buildings. In addition, companies vol- 
untarily choose to install AFA-systems that are 
directly connected to the alarm central. 


2.2 BRIS—Reporting system 


DSB, the Norwegian Directorate for Civil Protec- 
tion, introduced the reporting system BRIS (acro- 
nym for the Norwegian words for Fire, Rescue, 
Reporting and Statistics) from the first of January 
2016. BRIS is a national reporting system, gath- 
ering information about the Norwegian fire bri- 
gades’ assignments. The overarching goal of this 
system is to give the fire brigade a good foundation 
for developing targeted measures in their preven- 
tive work, develop emergency work and increase 
data quality in order to give local and national 
decision makers a better knowledge base for learn- 
ing and improvement. Another goal of the system 
is to ascertain a more suitable user interface that 
allows reporting to be conducted at the event site 
or in the fire truck on the way back to the station. 
Over the last few years, DSB and the Norwegian 
fire brigade have become aware of the high and 
increasing number of false and unnecessary alarms 
which leads to unnecessary dispatches. With BRIS, 
they have attained better statistics and data on 
what causes these alarms. Still, there are issues to 
be sorted out in terms of data quality, as the BRIS 
data depends on the categorisation of dispatches, 
and, in this respect, there are some differences 
between fire departments and alarm centrals. 
Later on, examples will be provided as to how 
fire brigades can use this data to develop measures 
to counter the increase of unnecessary dispatches. 


2.3 Data collection 


This paper employs statistics from BRIS which can 
tell us something about the situation of false and 
unnecessary alarms and unnecessary dispatches. In 
addition, from March till November 2017, we col- 
lected qualitative data in: 


one large fire brigade; 4 interviews (group and 
single) in emergency department, alarm centre 
(110-central) and preventive department, 

two small fire brigades; interviews with the two 
fire chiefs officers, 

one emergency exercise in a small fire brigade; 
observations, 

the work done by a project group in a large fire 
and rescue service, established for developing 


measures for reducing the number of unneces- 
sary dispatches; observations, discussions and 
document study. 


This paper also builds upon the knowledge 
gained in earlier projects; on the organisation of 
Norwegian fire brigades, and on measure develop- 
ment for reducing the number of fatal fires amongst 
vulnerable groups (Fenstad et al, 2013; Store- 
sund et al. 2015; Gjosund et al. 2017; Gjosund & 
Almklov 2016; Halvorsen et al. 2016). 


3 UNNECESSARY DISPATCHES AND ITS 
CONSEQUENCES 


3.1 The situation in Norway 


Data from BRIS shows that 60% of the Norwe- 
gian fire brigade’s dispatches are unnecessary. On 
average, fire brigades in Norway had 137 unnec- 
essary dispatches each day. The number of emer- 
gency dispatches conducted on the basis of false or 
unnecessary alarms varies a lot between different 
municipalities in Norway. Some have a proportion 
of 40%, whereas others have a proportion as high 
as 75% (numbers from 2010). There is currently lit- 
tle knowledge about why this is so, but it could be 
due to variation in the number of buildings that 
are directly connected to the alarm central, or to 
emergency controllers having different formal pro- 
cedures and different assessments of response to 
incoming alarms. Also, variations in the demo- 
graphics, housing stock and municipal organisa- 
tion may count for some of the variation. 

Norway consists of more than 400 municipali- 
ties that range in population from 200 inhabitants 
to 650 000 (Oslo). These municipalities are highly 
diverse in terms of their demographic profile, 
organisational structure and available resources. 
Equally diverse are the fire and rescue services, in 
terms of ownership, management and organisation. 
Some municipalities own and run their own fire and 
rescue services, while others collaborate with neigh- 
bouring municipalities on all or parts of the serv- 
ices. Large fire brigades consist of mostly full-time 
employees, while the small fire brigades have mostly 
part time employees. There are 335 fire and rescue 
service brigades and 620 fire stations, but we see a 
trend for merging brigades in order to attain larger 
and more specialised services. In Norway, the fire 
and rescue service is the only emergency agency that 
is municipal. It has a higher density, clear demands 
to response times, and thereby exhibits shorter 
response times than the police and health service. 

Most of the alarms resulting in unnecessary 
dispatches typically come from automatic fire 
alarm (AFA) systems going off due to different 
perturbations or errors, smoke detectors react- 
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ing to cooking, steam or dust, people triggering 
alarms by error, or, in some cases, intentional 
triggering. Not surprisingly there is a correlation 
between the size of the municipality in terms of 
number of inhabitants, and relative number of 
unnecessary alarms due to AFA-systems, and 
the main reason thereof is most likely that small 
places have fewer objects directly connected to the 
alarm central than larger places do. 


3.2 Comparison with Denmark 


In Denmark, they have a reporting system very 
much like BRIS. It is called ODIN (acronym of 
the Danish words Online Data Registration and 
Reporting System). The latest version, complete 
with a registration of information about AFA-sys- 
tems, was released July 2015. This means that the 
first year of complete data is 2016, the same as in 
Norway. In Denmark, 44% of the dispatches were 
due to false or unnecessary alarms, and 42% out of 
all alarms came from AFA systems. Of these AFA 
alarms, only 9% were real, the rest were categorised 
as blind (or unnecessary). In 2016, there were dis- 
patches to over 6000 addresses. 143 addresses had 
2507 dispatches alone. This means that 2.3% of the 
addresses had 16.3% of all dispatches. 

As in Norway, the blind alarms (or unnecessary 
dispatches) are unevenly distributed throughout 
the country. The variation is probably an expres- 
sion of uneven distribution of institutions, and 
thereby buildings with AFA-systems, throughout 
the country (Beredskabsstyrelsen 2017). Not sur- 
prisingly, the amount of blind alarms from AFA- 
systems have also increased from 9 000 in 2007 to 15 
500 in 2016. The interesting thing is that even if the 
absolute amount of blind alarms has increased, the 
relative amount of blind alarms has decreased, from 
8 blind alarms in 2007 to 5.7 in 2016 out of 1000 
detectors (Beredskabsstyrelsen 2017). This could 
mean that AFA-systems in general are more reliable 
than they were 10 years ago. We have not been able 
to find good data on how many AFA-systems are in 
use at any time in Norway, and how many of these 
are connected to an alarm central. (It may be pos- 
sible to get approximate numbers by contacting all 
the alarm centrals in Norway, but his has not been 
part of the project’s scope.) It is likely that there has 
been a relative decrease in the amount of unneces- 
sary alarms from AFA-systems in Norway as well. 


3.3 Consequences of unnecessary dispatches 


The negative consequences of unnecessary dis- 
patches are both well-known, as well as the back- 
ground for trying to reduce the number of these 
dispatches. First of all, it is resource demanding, as 
an unnecessary dispatch takes a crew and a vehicle 


away from other duties such as training, mainte- 
nance and makes them unavailable for other call- 
outs. In the case of part-time personnel, it also 
costs additional money, since the crew is paid extra 
for dispatches. There is also a direct risk connected 
with the traffic hazard posed by vehicles speeding 
with blue lights through traffic. A longer term neg- 
ative effect is the possibility, and a concern noted 
by several of our informants, of a cry wolf effect, 
both in the sense of a delayed evacuation of the 
buildings concerned but also for the fire brigade. 
However, possible positive consequences of unnec- 
essary dispatches have been very little discussed. 


4 SOME FINDINGS IN NORWEGIAN 
FIRE BRIGADES 


While it has been generally recognised that a large 
and increasing amount of the dispatches for the 
Norwegian Fire and Rescue services have been 
unnecessary, the introduction of the national BRIS 
database represented a stepwise change in the docu- 
mentation of dispatches. In BRIS, all call outs from 
the alarm centrals are given a preliminary code by 
the operator, and are later given a final code by the 
responding unit. While there are limitations to this 
database as well, particularly when it comes to how 
to categorise the different dispatches (something 
we will discuss later on), it has laid the foundation 
for directed efforts to reduce the number of unnec- 
essary dispatches, both at a national level and for 
individual fire services. The statistics clearly dem- 
onstrate that a majority of dispatches, particularly 
for urban fire brigades, are triggered by automatic 
detectors, and that in the vast majority of these 
cases there is no fire triggering them. 


4.1 Developing and implementing measures 


There have been increasing concerns in the fire 
community about unnecessary dispatches over the 
past years; however, it is after the implementation 
of BRIS that the reporting and the statistics have 
become more accurate, and it has been possible to 
attain facts about the actual causes of the unneces- 
sary dispatches. Based on this data, the directorate 
of Civil Protection (DSB) has requested the local fire 
brigades to try to find ways to reduce these unneces- 
sary dispatches, and several fire brigades have started 
this work. Since the complete BR IS-data is quite new, 
few measures have subsequently been implemented. 


4.2 Large fire and rescue services 


We know that a large fire and rescue service in 
a large city in Norway introduced an increase in 
the fines for unnecessary dispatches from 5500 
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to 8000 N.kr. They hoped this would motivate 
house owners to better maintain their alarm sys- 
tems in order to reduce unnecessary alarms. After 
8 months, there was no registered improvement, 
but they still hope for a long-term effect. There is 
however a fear that increasing fees may result in 
lower fire safety, which will be discussed later on. 

Another big fire brigade in Norway appointed 
a project group to develop measures which would 
reduce the amount of unnecessary dispatches by 
20% by the end of 2018. The average measure of 
unnecessary dispatches in Norway in 2016 was 
59%, while the average for this fire brigade was as 
high as 64%. They had, in several real fires, expe- 
rienced that persons living in the building did not 
evacuate or follow instructions because they had 
lost respect for fire alarms due to too many unnec- 
essary alarms. Through BRIS-data, they found that 
they had dispatched almost 2400 times to unnec- 
essary or false alarms throughout 2016, and that 
less than 10% of these dispatches were the result 
of false alarms. Further, they found that a major- 
ity of the unnecessary dispatches were to addresses 
with Automatic Fire Alarms (AFA). There were 
532 unnecessary responses to only 63 addresses, 
between 5 and 28 dispatches to each address. They 
found that the most common causes for unneces- 
sary dispatches were: wrong use or placement of 
detector (34%), technical or unknown error in the 
AFA-system (19%), intentional false alarm (11%), 
dust due to construction work (8%) and water 
steam (7.5%). The project group in this fire brigade 
therefor decided to concentrate on and target meas- 
ures towards countering these worst cases, noting 
that, if they succeeded, a reduction in such cases 
alone would make them reach the target of 20% 
decrease in unnecessary dispatches. The project 
group developed targeted measures for reducing 
unnecessary dispatches. The three most important 
are: 1) The provision of information and guidance 
to house owners/residents on how to avoid unneces- 
sary alarms, 2) Focus on unnecessary alarms under 
the fire and rescue services supervision of AFA- 
systems, and 3) Particular and direct follow-up on 
the worst cases.! 

The tendency of there being a large amount of 
dispatches to a few addresses with AFA-systems 
installed is the same as that found in Denmark. 


4.3 Small fire and rescue services 


None of the small fire and rescue brigades we talked 
to had started to implement measures for reduc- 


'These number and facts are taken from BRIS-database 
and from a presentation the mentioned project group 
held in Forum for Fire Safety in Trondheim, Norway, 
14th of November 2017. 


ing unnecessary dispatches. In contrast to the big 
fire and rescue services, they did not see the same 
need to reduce them. The main reason for this was 
that the total amount of alarms was not very high. 
When the total number of dispatches is low, smaller 
fire brigades do not get the same opportunity as 
large fire brigades to maintain their skills through 
dispatches. Since most of the firemen in the small 
brigades are part-time firefighters, unnecessary 
dispatches are an important part of their training, 
much more than for the full-time fire fighters who 
meet and train together each day. If the number 
of unnecessary dispatches were reduced, they were 
afraid that their skills would decrease. Part-time fire 
personnel can earn as little as 20 000 N.kr a year, 
even if it is a requirement that they have the same 
skills as full-time fire fighters. However, they earn 
extra pay for each dispatch. Even if it is an assump- 
tion that small fire brigades are reluctant to reduce 
the amount unnecessary dispatches because of the 
earnings, this was never mentioned as a reason by 
the small brigades. 

The regulation of the dimensioning of the fire 
department scales the size of the brigade depend- 
ing on the size of the population. Places/munici- 
palities with less than 3000 inhabitants have small 
fire brigades consisting of part-time personnel 
without fixed schedules, while big cities only 
employ full time fire personnel. For the small fire 
brigades, it is very difficult to possess the same 
specialised knowledge and skill-set as the big fire 
and rescue services with full-time employees. But, 
one advantage that public services in small places 
have is their situated and local knowledge. Part- 
time fire personnel have knowledge about houses 
and people from their regular jobs, as well as 
through neighbours and friends living in the small 
place. They are very much aware of this advan- 
tage, and try to strengthen it. In the interviews, 
they said that the unnecessary dispatches are a 
valuable source for gaining an even deeper local 
knowledge of the special objects and people in 
their neighbourhood. 


5 UNINTENDED IMPLICATIONS? 


As we have seen, the overall goal for the Norwe- 
gian Directorate for Civil Protection (DSB) is to 
decrease the numbers of unnecessary dispatches. 
Also, for large fire and rescue services this is an 
explicit goal, and they have started to develop and 
implement measures in order to reduce these kinds 
of dispatches. There are, however, some questions 
that it is necessary to reflect upon when it comes 
to deciding which kind of dispatches are to be 
reduced, and also if some measures could have 
unintended negative consequences. 
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5.1 The big increase and uncertainty in 


categorisation 


As mentioned, there has been a large increase in 
unnecessary dispatches in Norway, from 20 000 in 
2013 to 50 000 in 2016. Differences in reporting 
practices are central in order to understand this 
increase and also the implications of reducing the 
unnecessary dispatches. It is reasonable to think 
of at least three reasons for the increase: 1) The 
increase correlates with the number of detectors 
connected to the alarm centrals, which explains the 
majority of the increase, 2) The introduction of the 
BRIS registration system has led to a more accu- 
rate and systematic registration of not only real 
alarms but also of the unnecessary alarms and dis- 
patches, and 3) Those who register the dispatches 
in the BRIS system are uncertain of how to cat- 
egorise them, and systematic biases can occur. E.g. 
there may be different assessments of whether an 
alarm is unnecessary when the alarm is the result 
of smoke related to making food. In interviews 
with firemen and employees at alarm centrals, we 
have been told that it can difficult to categorise 
borderline cases, and that systematic tendencies 
can occur in the reporting of such incidences. One 
concrete example concerns food making—if the 
dish is still edible, the dispatch is to be categorised 
as unnecessary, and if it is not possible to eat the 
food, the dispatch is to be categorised as necessary, 
even in cases where there is no fire, since it can be 
seen as a preventive dispatch. But who is to decide 
if the food is edible or not? Further, is the dispatch 
unnecessary if it prevented further development of 
the situation or if the dispatch reduced the prob- 
ability for this kind of scenario happening again? 
Since the complete BRIS-data is available only 
for 2016, it is not yet possible to determine how 
much of the increase in unnecessary dispatches 
is due to these different variables. In a few years, 
however, it will be possible to evaluate the numbers 
and variables available in this registration system. 


5.2 Possible positive consequences of unnecessary 
dispatches 


An important question to ask when considering 
measures to reduce the number of unnecessary 
dispatches is whether there are also some positive 
effects. In some cases, one may expect that a certain 
number of mobilisations can be useful for training 
fire crews. As long as the dispatches are not too 
repetitive, they may provide the organisation with 
important training and expose it to a variability of 
scenarios that strengthens its resilience beyond the 
limits of normal training procedures. This is espe- 
cially important for smaller fire and rescue depart- 
ments with part-time employees and few dispatches. 


Since real fires are rare, everything they can learn 
about emergency preparedness and risk is consid- 
ered useful, and the unnecessary dispatches are 
used in the mobilisation phase for training in basic 
skills and catching up with colleagues. In large fire 
and rescue services, on the other hand, where the 
employees are with their brigade each day, unneces- 
sary dispatches are seen as disturbing. The positive 
effects of unnecessary dispatches are therefore more 
obvious in small, rather than in large brigades. 

Also, many of the unnecessary alarms are trig- 
gered in buildings (such as care centres for the 
elderly) and areas where the fire risk is high, or 
where there are vulnerable groups, so the dispatch 
to these locations might provide useful knowledge 
for future scenarios. It might provide information 
regarding incumbent risk, as in Turner’s (1976:381) 
“incubation period”, strong or weak signals that 
might trigger pre-emptive measures, or other 
forms of learning for the fire department. As such, 
it might help the fire crews prepare for future genu- 
ine alarms and can also be a way to prevent false 
or unnecessary alarms in the future. It can also be 
considered as an exercise for people who have to 
evacuate buildings. Studies has shown a reduction 
in evacuation times through the repetition of evac- 
uation drills (Hamilton et al. 2017). 

Thus, though a dispatch may be unnecessary, 
seen as an isolated response, it might contribute 
to reducing risk in some ways, and in some cases. 
Measures towards reducing the numbers of unnec- 
essary dispatches must be seen in the light of these 
possibilities for learning and risk reduction, as well 
as the cost and risk associated with the dispatches. 


5.3 Possible negative consequences of measures 


It is possible that measures implemented to reduce 
unnecessary dispatches may directly or indirectly 
harm the emergency preparedness or the response 
time to some types of fires. Fines may, for example, 
lower the threshold for disconnecting automatic 
sensors, or for not installing them in the first place. 
As mentioned above, if the measures are success- 
ful in buildings where there is a greater probability 
of a fire starting, like buildings housing vulner- 
able groups (Gjgsund et al. 2017), the measures for 
reducing unnecessary dispatches could indirectly 
result in the loss of valuable knowledge, possibili- 
ties to detect fire hazards and chances to suggest 
and implement fire preventive measures amongst 
vulnerable groups and exposed houses. 

Because of the reporting system, BRIS, it has 
been possible to develop more appropriate and 
targeted measures for reducing unnecessary dis- 
patches. Since the measures for large fire and 
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rescue services are in an early phase, and because 
BRIS was implemented relatively recently, there 
are currently no clear results indicating the impact 
or the implications of the measures. What is cer- 
tain is that the reporting system BRIS will be a 
useful tool when analysing the results and implica- 
tions of the implemented measures. 


6 CONCLUDING REMARKS 


By developing measures, like those of the large fire 
brigades, which are directed towards the “worst 
cases” (i.e. those with several unnecessary alarms 
due to dysfunctional AFA-systems), the chance 
of succeeding is greater than if there were no such 
measures. It is also likely that this strategy for 
reducing unnecessary dispatches will have effects 
on the recurring incidents that will not give valua- 
ble knowledge, while the dispatches which can give 
valuable knowledge (for instance, about vulnerable 
groups or local and demographic knowledge) are 
more likely to persist. 

One should not regard unnecessary dispatches as 
a homogenous category and implement measures 
blindly aimed at reducing the numbers. Rather, the 
detailed statistics of BRIS provide the opportu- 
nity for more targeted and more effective reduction 
measures, by which one can also avoid some possible 
pitfalls that might lead to an increase in overall risk. 
We have suggested two such categories: 1) Measures 
may lead to increased fire related risk if they lead to 
responses on the user side to reduce the number of 
alarms, for example by removing sensors altogether, 
by reducing their sensitivity or increasing the time 
before they are triggered, and 2) Measures may lead 
to a decrease in dispatches that are unnecessary in 
the sense of putting out an actual fire, but that may 
have other positive consequences in terms of train- 
ing for the personnel, or as opportunities to obtain a 
better knowledge of vulnerable buildings. 

Still, as our case study has shown, this leaves 
many unnecessary dispatches to be eliminated. The 
project we studied addressed frequently recurring 
dispatches to certain buildings in a targeted man- 
ner, dispatches that clearly did not fall under the 
categories where the side effects might be negative. 
In sum, then, we conclude that measures to reduce 
the number of unnecessary dispatches is impor- 
tant, but that they should be implemented based 
on a careful evaluation of the potential negative 
effects they may have. Evaluating the different cat- 
egories of reported data in terms of the positive 
and negative effects that measures have on a total 
fire risk, regarding both the likelihood and conse- 
quences, is an important first step. 
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ABSTRACT: We discuss the role of part-time firefighters as a resource for local emergency manage- 
ment in Norway. Informal social relations, the trust between practitioners and the social capital of the 
organization, has been recognized as a resource for emergency management, particularly as it contributes 
to improvisation and coordination between actors belonging to different professional groups. Likewise, 
social capital, the trust among citizens, has been identified as a resource for societal resilience in crises. We 
discuss a combination of these forms, how the social embeddedness of the emergency practitioners in the 
community and the multiplexity of roles is important for community resilience. These professionals know 
each other through several different social roles, and have resources beyond the formal capacities their 
position should suggest. Thus, role multiplexity and social networks provides a functional redundancy 
and is a resource for resilience in the management of incidents and emergencies. These abilities are hard 
to make visible in a work plan and challenging to include in exercises. Moreover, these abilities are affected 
by recent developments towards professionalization of and centralization. 


1 INTRODUCTION their only relevant role for the way they solve their 
tasks in emergencies.'! Moreover, role multiplex- 
1.1 Background of our study ity is an important contributing factor for the 


firefighters’ local knowledge, a competency that 
is well-recognized in the emergency management 
community. 

Thirdly, the notion of social capital is used 
to describe the networks of trust relations as a 
resource for coordination in emergencies.” 

These characteristics, we will argue, have proven 
to be important in the management of several 


Based on studies of part-time firefighters we dis- 
cuss how the combination of professional roles 
and embeddedness in the community of this group 
is a resource for community resilience. This is done 
through discussing how local knowledge, social 
capital and role multiplexity influences practice 
and decisions in emergencies. 

Our paper supplements the literature on com- 
munity resilience and emergency management by 
gee ta arpa aki Sau eres 'The notion of role multiplexity is inspired by sociologi- 

A : z cal theory on modernity and bureaucracy. Whereas one 
sional roles are both important parts of their individual in a bureaucratic organization has one role 
capabilities in the prevention and management of only, and thus uniplex relations to the ones he interacts 
emergencies. with, in small scale communities most people have several 

We understand community resilience as the relations to each other. (Durkheim discussed in Brøgger 
adaptive capacity of a community when faced with (1993: 26ff), see also Almklov et al, 2017). 
emergencies. Key elements of this are improvisa- "The individual sense of the term social capital is inextri- 
tion and redundancy (both in terms of resources “ably related to the works of Bourdieu (1986), viewing 
and competence). This will be elaborated in the social capital as a source of power individuals possess 
theory section. and use to further their interest. The collective view 


= ; R E on social capital is particularly associated with Robert 
What we describe here as role multiplexity isthe Putnam (1995), viewing it more as a property of à group, 
fact that these professionals have different profes- a community or a society. We are here referring to the 
sional and social roles, that their “hat”, or in the latter, as a descriptor of how trust based networks are 


case of the firefighters, helmet, does not represent resources for collective action. 
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emergencies in rural districts in Norway. Inter- 
estingly, they tend to elude description in formal 
documents, and thus risk being undermined by 
administrative reforms in the domain of societal 
safety and emergency preparedness, such as devel- 
opments towards centralization and professionali- 
zation of the fire departments. Understanding and 
documenting the specific competence and commu- 
nity role of this group is important input to such 
processes. 


1.2 Part-time firefighters in Norway, 
ongoing changes 


Norway consists of more than 400 municipalities 
that range in population from 200 inhabitants to 
650.000 (the capital, Oslo). The municipalities are 
highly diverse in terms of demographic profile, 
geography, size, organizational structure, and avail- 
able resources. Equally diverse are the fire and res- 
cue services, in terms of ownership, management, 
and organization. Some municipalities own and 
run their own fire and rescue services, while oth- 
ers collaborate with neighbouring municipalities 
either by having joint fire brigades or just in pro- 
viding parts of the services. The fire departments 
and placement of fire stations are dimensioned 
after specific criteria for response times, leading to 
a relative high density of fire stations and shorter 
response times in most areas compared to other 
emergency services. 

Whereas large fire brigades in cities and towns 
rely mostly on full time personnel, smaller fire bri- 
gades are largely dependent on personnel in differ- 
ent forms of part-time employment. For a large 
fraction of the latter, their regular payment only 
covers mandatory training and a small reimburse- 
ment for being on call, plus additional pay for dis- 
patches. This means that they typically have full 
employment in other trades. The composition of 
the fire and rescue services in Norway today are 
roughly 3500 full time firefighters, and around 
8000 part-time firefighters. In principle, the com- 
petence demands for part-time firefighters are 
supposed to be equivalent with the basic require- 
ment for full time personnel. However, in terms 
of technical skills and training, they tend to lag 
behind these demands, while full time personnel on 
the other hand train and rehearse their skills way 
beyond them. Recruitment, both for full time and 
part-time personnel, has often sought people with 
relevant technical skills from other trades, such as 
carpenters, electricians and people with military 
training. For part-time personnel, an additional 
requirement is often that they live close to the fire 
station, to be able to mobilize quickly. In rural dis- 
tricts, farmers often make up a significant portion 
of the crew. 
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1.3 Societal safety and emergency preparedness 


In Norway, the municipality has a key responsibil- 
ity in terms of risk management and emergency 
preparedness. In principle, the municipality “owns” 
the total risk picture within its borders, and should 
have updated all hazards risk and vulnerability 
analyses and emergency plans. The responsibility 
for this in small municipalities is usually given to an 
emergency management coordinator, typically an 
official in the technical department of the munici- 
pality that has this as a fraction of their position. In 
terms of operative resources, the Fire and Rescue 
Service (FRS) is a crucial first line of response, but 
the emergency plans also include other municipal 
personnel, as well as external actors (volunteers, 
industry, municipal technical department etc.). 


1.4 The fire & rescue service, the only remaining 
generalists in local emergency management? 


Societal safety as a term and policy area came 
about after the end of the cold war. Gradually, the 
resources available for emergency preparedness in 
the public sector in Norway in particular, and in 
Western Europe more generally, have been reduced 
since then. Moreover, as measures have been imple- 
mented to make the public sector more effective 
and goal oriented, through outsourcing and mar- 
ket based restructuring, generalist capabilities and 
functional redundancy have systematically been 
reduced. Other operative capacities in the public 
sector are trimmed (army, home guard, publicly 
owned technical services such as roads, energy, port 
authorities). Within this picture, the FRS is one of 
very few remaining generalists with a substantial 
redundancy. A result of this has, according to our 
informants in the FRS, informants from other sec- 
tors and public reports (DSB, 2013, Øren et al. 2016) 
that the scope of tasks for the FRS is expanding. 

In addition, in the rural areas, police services 
and emergency health services are generally sparse, 
so the FRS tend to be first on site for accidents 
and incidents of all sorts. This means that they 
sometimes must fill in for medical personnel or 
the police while waiting for ambulances and police 
patrols. 

The FRS still put out fires, but increasingly 
they respond to other accidents (in particular traf- 
fic accidents) and other emergencies. Due to the 
reduced operative redundancy in the public sector 
generally they are becoming increasingly impor- 
tant as first responders to other forms of emer- 
gencies, such as landslides, floods and storms and 
search and rescue operations. For many areas, cli- 
mate change leads to increases in flash floods and 
associated landslides as well as an increasing risk 
of forest fires. 


Another important development for the FRS 
is the implementation of the joint communication 
network for emergency services in Norway called 
Nodnett (lit. “emergency net”). This means that all 
part-time firefighters have a communication radio 
at home, serving both as a call out terminal and as 
a communication tool in operations. Their training 
in using these is an important resource for the coor- 
dination in emergencies both for the FRS itself, but 
also as they may act as liaisons with other municipal 
professionals (Tilset et al. 2015). In some FRS, the 
firefighters are supposed to have the radio nearby 
at all times. These radios further integrate the FRS 
with other emergency services, as they may com- 
municate in shared working groups with the police, 
health services and other relevant actors. The FRS 
are owned by the municipalities, but often, and to an 
increasing degree, they are parts of inter-municipal 
collaborative arrangements. It is an explicit national 
strategy to increase the size and professionality of 
the FRS, and there are several ongoing changes in 
the sector. Understanding the unique role and com- 
petencies of the part-time fire fighters will be impor- 
tant to ensure that these changes are successful. 


2 THEORY 


2.1 Resilience and robustness 


Resilience is employed here to describe how an 
organization, community or society absorbs 
shocks and ‘bounces back’ after a disturbance 
(Boin & van Eeten 2013). Resilience has become 
a central concept in the safety theory the last 
decades. One early contribution was Wildavsky’s 
(1987) insistence that a “search for safety” should 
go beyond trying to mend known weaknesses, by 
including a creative exploration of ways to improve 
society’s ability to sustain new challenges. Also, 
the descriptions and analyses of High Reliability 
Organizations (LaPorte & Consolini 1991, Weick 
and Sutcliffe 1995, see also Roe and Schulman 
2008) stressed that designing robust systems was 
only one step of the way to achieve high reliability, 
stressing the need for redundancy, flexible organ- 
izing and organizational mindfulness to be able to 
cope with variability. The most prominent theory 
on resilience within safety research is found in the 
“resilience engineering” strand of research, where 
an intense focus on the management and learning 
from variability as a resource for safety has been a 
cornerstone (Hollnagel et al. 2006). 

Outside the safety literature, the concept of resil- 
ience has also been important in studies on a soci- 
etal and community level (e.g. Boin & van Eeten 
2013), then often referring to the community or 
society’s ability to bounce back (or even forwards) 


191 


when confronted by major disasters. The litera- 
ture on community resilience is broad and diverse 
within several research fields. (See e.g. Norris et al. 
2008 for some background). In contrast, Resilience 
Engineering focuses heavily on the importance of 
resilience as a way of avoiding accidents. 

Based on a study of the response to the 9/11 ter- 
ror, Kendra & Wachtendorf (2003) identify some 
characteristics of resilience. These are redundancy, 
resourcefulness, effective communication, and the 
capacity to self-organize, undeterred by extremely 
challenging circumstances. They point out that 
resilience is essentially a set of attitudes concern- 
ing expediency of actions and the propensity to 
acquiring new capabilities. 

There is a big difference between a well laid out 
plan and a plan that is well played out. The former 
points to the ability to foresee and predict while the 
latter refers to the ability to act when the situation 
calls for the use of a plan. Plans cannot guarantee 
the success of how emergencies are handled. They 
can only provide the backbone of an emergency 
response. A common example employed to explain 
resilience is to compare hard wood to bamboo. 
The former is strong and does not easily break 
when it encounters strong winds. This 1s, of course, 
up to a certain threshold. The robust hardwood 
tree will eventually break if the wind blowing is at 
hurricane-strength. In comparison, the bamboo 
sways with the wind. It has the ability to bounce 
back into place. This ability to bounce back is what 
defines resilience. It is able to adapt. A plan cannot 
be made for every single possible emergency situ- 
ation in a community. It is the ability to adapt the 
plan according to the situation as it unfolds which 
will help a community to bounce back after a dis- 
turbance (Boin & van Eeten 2013). 

The part-time firefighters add flexibility in the 
community response to local emergencies. There 
are at least three aspects that contribute to this flex- 
ibility: the firefighters’ diverse backgrounds and 
experience (providing a functional redundancy), 
their local knowledge and proximity to the hazards 
and, that they possess a rudimentary organizational 
structure and means for communication and their 
social embeddedness in the community (easing 
swift coordination with volunteers and other exter- 
nal resources). Interestingly there is a good overlap 
between these characteristics and those identified 
by Kendra & Wachtendorf (2003). Though this 
is interesting, one should also draw comparisons 
between such different contexts with caution. 


2.2 Role multiplexity, social capital and 
community resilience 


Part-time firefighters have many ties in the com- 
munity. They are members of their local fire 


brigade. They are parents of children attending the 
local school, a colleague in the municipal organi- 
zation or electricians, plumber, factory workers or 
farmers. They may also be members of the sports 
clubs, hunting groups, health professionals, jani- 
tors or a loyal customer of the local grocery store. 
They have social relations throughout the commu- 
nity and local knowledge of threats and resources. 
The repeated interaction and networks built in their 
local community, over time, develops and strength- 
ens their social capital. Social networks, reciprocity 
and interpersonal trust are aspects that are critical 
to building social capital (Patterson et al. 2010). 
Social capital and networks among citizens are 
recognized to be critical to disaster survival and 
recovery (Aldrich & Meyer 2015) on a societal level. 
Importantly, here we also include the networks that 
go between the response organizations and the 
community, and that criss-cross organizational 
boundaries in the community (Almklov et al. 2017). 

Local knowledge is an important element in dis- 
aster management. For instance, it can help build 
resilience to flooding in local communities by pro- 
viding local information on actual flood patterns, 
frequency, and risk perceptions in the community 
(Ramsey et al. 2016). 

While bureaucratic organizations are built 
around uniplex roles, where the person is his role 
and that is the only relevant feature (see Almklov 
et al. 2017, Brogger 1993). In practice, however, we 
see, particularly in small communities that there 
are spill-over effects from other roles that the emer- 
gency professionals have in the community, and that 
these are often key both in terms of establishing 
trust relations that go beyond the formal relations 
and also that the multiplexity of roles provides the 
individual with a functional redundancy, in terms 
of knowledge and capacities. In the empirical sec- 
tion, we will give some brief examples of this. 


3 METHODS AND DATA 


This paper is based on an aggregate of data from 
several projects inspecting the roles of municipal 
emergency preparedness and the organization of 
the FRS in Norway: A study with scenario analy- 
ses for the future organization of the FR services in 
Norway (Fenstad et al. 2013), a study of different 
approaches for intersectoral collaboration between 
the fire departments and municipal services in pre- 
vention of deadly fires (Gjosund et al. 2016, Hal- 
vorsen et al. 2017), a study of the implementation 
of Nednett in Norwegian municipalities (Tilset 
et al. 2014), a process analysis for a project to 
improve regional collaboration between large and 
small FR services in Western Norway (Gjgsund & 
Almklov, 2017), and a study of municipal emergency 
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preparedness (Oren et al. 2016). All these projects 
have been based on interviews with firefighters and 
personnel that they interact with on a daily basis. 
While the scopes of these projects have been diverse, 
they have all contributed pieces to the puzzle regard- 
ing the role and qualities of part-time firefighters. 
We have, for the purpose of writing this paper, 
conducted five directed interviews with key inform- 
ants, two leaders and one firefighter at part time fire 
department and two with fire fighters in an urban 
fire department that has some part time personnel 
affiliated. We also conducted an observation study 
(with some informal discussions along the way) of a 
training session with a part-time FRS. During a one 
day visit (observation and interviews) in a regional 
dispatch central, our discussions included the capac- 
ities, call out procedures and response times of the 
part time FRS under their control, and also their 
typical assignments. In addition, we studied reports 
from a selection of recent emergencies in Norway. 


4 EMPIRICAL EXAMPLES 


In this section we give some examples from our 
data, supplemented with reports from recent emer- 
gencies, to illustrate how the role of municipal fire- 
fighters in contributing to community resilience 
can be characterized by the concepts of role multi- 
plexity and social capital. 


4.1 


One fire chief leading a fire brigade in two rural 
municipalities explicitly valued of the varied com- 
petencies of his crew. The long distances meant 
that his part-time firefighters had to mobilize for 
all sorts of accidents, and he had employed per- 
sonnel that who’s day jobs were in the medical sec- 
tor, i.e. nurses, adding to their ability to respond 
to both medical emergencies and to take better 
care of elderly people in trouble. But also, other 
occupations had qualities he valued. He told us 
about an accident on a farm where an old farmer 
had fallen and had to be rescued from a silo by the 
fire department and be evacuated by helicopter. 
When the helicopter and ambulance had left and 
the firefighters were demobilizing and consoling 
the old man’s wife, they heard the cattle were in 
distress. They had not yet been milked. His crew, 
consisting of several farmers, would not leave the 
site before milking the cows, he said. Farmers just 
don’t leave cows in distress! This example might 
seem little relevant, too trivial, for grand discus- 
sions of emergency management. However, it illus- 
trates how their knowledge and professional values 
from other occupations spill over into their role as 
firefighters. 


Two illustrative examples 


Another example from the same fire department 
illustrated the role of local knowledge and role 
multiplexity. An avalanche had hit a road, blocking 
the road and possibly covering some cars. While 
the police’s operational leader, formally in charge 
of the rescue operations, was speeding to the site 
from the closest (yet distant) city, the fire depart- 
ment was first on site, starting a search and rescue 
on their own initiative. The leader of the first vehi- 
cle, realizing that they needed equipment to search 
under the snow, went to his other workplace, a ski- 
ing facility, to pick up search and rescue gear there. 
Thus, the firefighters had mobilized the necessary 
equipment before the other services even reached 
the site, again because of the knowledge and access 
to resources provided by the firefighters’ day-jobs. 
It also illustrates the make-do attitude and improv- 
isational skills they have to their job. 


4.2 The fires of 2013 


Some very prominent examples in the recent dis- 
cussions of part-time firefighters and municipal 
emergency manager’s improvisational skills in 
Norway are the fires in 2013. That winter had had 
a very rare weather situation, with a very dry win- 
ter with little snow in the normally humid coastal 
areas, leading to bushfires in winter, and a major 
fire in the wooden town of Lærdal. All these fires 
were testing to the local fire departments abil- 
ity to mobilize, organize and execute an efficient 
response, and in the aftermath their effectiveness 
has been subject to debate (See Andresen 2017, 
PWC 2014). 

Even before the fire in Lerdal started some of 
the firefighters had concerns to the fire risk due 
to the combination of strong eastern winds and 
dry weather. Thus, there was already an increased 
awareness before the fire started. This is under- 
scored by the fact that the municipality on earlier 
occasions had implemented fire watches when 
this weather combination occurred, so this type 
of weather in was a risk recognized by the locals 
(Andresen 2017). 

When the firefighters mobilized, the response 
had a very improvised nature, seamlessly inte- 
grating volunteers (such as farmers with manure 
spreaders) in the response. As many firefighters 
were municipal employees, they also had good 
knowledge of available resources, such as access 
to the waterworks (to get pumps started when the 
electricity failed) and equipment. When telecom- 
services and electricity was lost, coordination was 
done by improvised means. Evacuation was greatly 
helped by the firefighters’ and volunteers’ knowl- 
edge of where vulnerable people lived. 

The response was improvised and organic. Rep- 
resentatives from the larger fire brigades that even- 
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tually assisted the firefighting, and the national fire 
authority (DSB), noted the lack of organization as 
a shortcoming, while the local community highlight 
the effectiveness in the improvised response. Both 
views have some have support in the eventual inves- 
tigations. The initial response was clearly effective, 
and several problems were solved in creative ways, 
but as the fire grew and as more and more resources 
arrived and needed to be coordinated (without effec- 
tive means of communication as the Nednett broke 
down), the coordination based on local knowledge 
and social networks became less efficient. 

The response to the Lerdal fire clearly shows 
how part-time firefighters may act as an integrated 
part of a closely-knit community, and how their 
role multiplexity and social embeddedness in the 
community proved invaluable resources for their 
response. However, it also shows that this mode of 
organizing has shortcomings in terms of tactical 
leadership and coordination of resources when the 
control span grows. 

Similarly, an external evaluation of the response 
to the bushfire in Flatanger the same winter 
pointed to a deficient plan compensated by good 
local knowledge and well-oiled collaboration 
machinery in the region. According to the investi- 
gation report, the crew exhibited their willingness 
to go beyond what was expected of them despite 
the notably harsh conditions. Some of the fire- 
fighters even lost their homes while extinguishing 
fire to save other people’s homes (PWC 2014:51). 


4.3 Responses to other emergencies 


An ever more common type of emergency in Nor- 
way the last decades are seasonal floods and flash 
floods, and water induced landslides. (DSB 2013, 
Fenstad et al. 2013) In particular flash floods and 
water induced landslides are commonly associated 
with climate change, as this leads to more intense 
precipitation. 

In rural communities, the combination of part- 
time firefighters and volunteers (farmers and oth- 
ers with access to machinery) are the core first 
responders to such events. Again, the personal 
networks and role-multiplexity of firefighters and 
municipal employees, provide a combination of a 
rudimentary organization and access to resources 
beyond the standard gear possessed by the fire 
department. Moreover, floods and landslides are 
events that typically happen on locations that are 
known by the locals to be risky, so local knowl- 
edge is important both in prevention and response. 
During floods, the local fire department usually 
assists in pumping out water that has flooded 
buildings. Knocking from door-to-door, firefight- 
ers also often perform the task of informing the 
locals of possible flooding in the area, and hazards 
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Figure 1. 


posed by landslides. The FRS also clears out trees 
that may pose a risk to the public or impede traf- 
fic. Other examples of notable cases of the rural 
fire department’s new challenges is the 2013 triple 
murder in on a regional bus in Ardal. There, the 
fire department, together with ambulance person- 
nel, managed to keep the perpetrator under con- 
trol while the police response was severely delayed. 
In September 2011, a train at Rorosbanen was 
derailed. The first emergency personnel to arrive 
on the scene was the local part-time FRS. They 
started the evacuation of injured passengers and 
cared for them until the paramedics arrived. They 
also cleared out an evacuation path and organized 
an assembly point where the evacuated passengers 
were registered before they were allowed to leave 
the area. 

The cases here illustrate some of the variation 
and complexity of the tasks facing these fire and 
rescue workers. City firefighters face some of the 
same complexity, but they have more support from 
police and health services and other profession- 
als and experts. This difference is also a source 
of the respect city firefighters have for part time 
crews. The demands for generalist competencies 
are higher for the part time crews. The part time 
personnel are sometimes (incorrectly) referred 
to as volunteer firefighters, but their response is 
based on a rudimentary organizational structure 
and basic training, and it is also better integrated 
with more professional responders than most vol- 
unteers. In a discussion of volunteers’ response to 
the large storms in the southern US (Katrina and 
Harvey), Wachtendorf & Kendra (2017) stress the 


194 


Role 
Nrultiplexy 


Functional redundancy 


Access to resources and competance 
through other roles: 


Summary of our description of the characteristics of the part time firefighters. 


importance of such coordination for the efficiency 
of volunteer responses. Though the part time FRS 
are not volunteers in the strict sense of the word, 
the part time FRS can be regarded as a hybrid 
form of response, connecting the volunteer com- 
munity and official response to events. 


5 DISCUSSION 
5.1 Are part-time firefighters only part firefighters, 
or do they bring some unique resources to 

the table? 


Based on our studies in different fire departments 
in Norway, it is clear that the part-time firefighters 
are less skilled and have less training for advanced 
firefighting than their full-time counterparts. The 
part-time fire departments also have shortcomings 
on formal communication procedures and on the 
management side, particularly when faced with 
larger incidents requiring coordination outside of 
their personal networks. 

One should be very surprised if this was not 
the case that these firefighters lacked some skills, 
as they have highly limited time for courses and 
training. This is also noted in the national “Fire 
Study” (DSB, 2013). They also lack resources in 
terms of equipment, and several firefighters lack 
formal qualifications. Moreover, due to the low 
frequency of call outs, many of them struggle to 
gain practical experience with firefighting. There is 
no shortage of problems in these fire departments. 
Notorious underfunding has also led to some FRS 


operating antiquated vehicles. Still, when we talk 
with full time firefighters and other profession- 
als in the emergency community, they generally 
have great respect for the part-time crews for their 
general skills and for their ability to solve their 
tasks. One city fireman described how impressed 
he was by a nearby part-time FRS near an acci- 
dent-ridden highway, how they responded quickly 
to horrible accidents, and how they on their own 
initiative had started taking first aid courses as a 
response to the slow response of the ambulances 
in that area. Their high motivation, and sense of 
responsibility was generally recognized by several 
of our informants. 

The part-time firefighters are seen, by full- 
time firefighters, as generalists with improvisation 
skills based on their additional occupations. Also, 
the full-time fire departments have traditionally 
strived for this quality, by actively recruiting peo- 
ple with a variety of professional backgrounds, 
but this is even more pronounced in the part-time 
corps. Their local knowledge (of terrain, threats, 
resources, people and buildings) and embedded- 
ness in the social fabric of the community is impor- 
tant for their ability to respond to emergencies, and 
their day job is sometimes an important resource. 

As professional firefighters, they are inferior to 
the well drilled crews that practice every day, but 
they have other qualities that should not be under- 
estimated, and that should be evaluated in the 
larger context of societal safety and community 
resilience, not only as actual fire fighting, which is 
only a minor part of their task portfolio. 


5.2 Role multiplexity and part-time resilience 


The fire-department in general, and the rural ones 
in particular, are organic parts of their communi- 
ties. The qualities of the part-time firefighters that 
we have discussed here are important parts of what 
we have labeled community resilience. Their social 
relations in the community make them more effec- 
tive than their fractions of positions may suggest, 
and they make up a critical part of the local com- 
munities’ ability to withstand and respond to emer- 
gencies of a highly varied nature. We introduced 
two sociological explanations for this: 

Role multiplexity: They have many hats, many 
forms of competence which give them a broad skill- 
set, competence and access to resources when faced 
with novel situations. In particular professional 
roles such as jobs in the municipality’s technical 
services, farmers, carpenters or medical professions 
gives highly valued additional competencies. 

Social capital: They have social networks, trust 
relations, that can be very useful for coordination 
in emergencies, and also for mobilizing equip- 
ment and resources. This also contributes to high 
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motivation. Informants throughout the sector 
are clear that the economic incentives, the pay- 
check, is not the primary motivation for most of 
the firefighters. Rather, it is the sense of doing an 
important job for the community that is the main 
motivation for most. 


6 CONCLUSION 


One conclusion, not a very daring one, is that we 
(researchers and especially policy makers) need to 
know more about the specific role of these fire- 
fighters as Norway is about to restructure our fire 
departments. They might not be as competent as 
full-time fire fighters, but they are different and fill 
other roles and are organically involved in commu- 
nity preparedness and response. From a societal 
safety perspective, the part-time fire fighters are 
possibly the most cost-efficient operative emer- 
gency management resource in Norway.’ 

Beyond the discussion of firefighters in Nor- 
way, our paper emphasizes the importance of role 
multiplexity and social capital in the management 
of societal emergencies. Part time firefigthers are 
not volunteers in the traditional sense but not fully 
professional actors either. For effective emergency 
management, they represent an important hybrid 
resource as they both possess rudimentary means 
in terms of coordination, communication (most 
importantly by being equipped with and trained 
to use radio communication terminals) and leader- 
ship while simultaneously being well engrained in 
the social fabric of the communities they serve. 
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ABSTRACT: In the petroleum sector low oil prices, high cost pressures and rapid technological devel- 
opment have led to higher demands for cost reductions, reorganisations and employee downsizing. The 
aim of this paper was to: 1) Assess trends and development related to reorganisations, downsizing, job 
insecurity, psychosocial factors and safety climate from 2013-2015; 2) Explore the correlation between 
reorganisations, downsizing and psychosocial factors, safety climate and occupational injury in 2015. 
The study is based on cross-sectional survey data collected every 2nd year, involving all offshore and land 
based personnel in the petroleum sector. Results show that employees have experienced an increase in 
reorganisations, downsizing and job insecurity in the period 2013-2015. Furthermore, the results show 
that employees who have experienced reorganisation and downsizing report higher risk of occupational 
injuries, poorer safety climate and psychosocial work environment, compared to employees who do not 
report such changes. The analyses also indicate that the higher risk of injuries reported by those who 
have been affected by downsizing and reorganisation may be associated with a poorer safety climate and 
psychosocial work environment. 


1 INTRODUCTION study performed by Quinlan and Bohle (2009) 
found that downsizing was associated with job 
Over the years’ significant changes have taken insecurity and negative impact on health and safety. 
place in the petroleum industry. Technological Furthermore, organisational change processes have 
changes and fluctuations in market dynamics have been regarded as an important aspect in accident 
caused demands for cost reductions, organisational investigations, and might be highly relevant to the 
restructuring and downsizing (PSA 2017a, ILO level of both process and worker safety (Baker, 
2013). Organisational changes at the workplace is 2007; Grote, 2008). However, research on the 
a wide-ranging concept which often have implica- relationship between organisational change proc- 
tions for the way work is designed, organized and esses, such as restructuring and downsizing, and 
managed, also referred to as psychosocial work risk of major accident is still limited (Grote, 2008; 
environment (Leka & Jain 2010). Over the years,a | Koukoulaki 2009, Lofquist 2008). 
growing amount of research has explored the link Changes in the work environment are an every- 
between psychosocial and organisational factors day reality in the petroleum industry, partly due to 
and incidents in the petroleum industry (Bergh demands for efficiency and cost-reductions. Devel- 
et al., 2014, Cricthon 2005, Mathisen & Bergh opment in knowledge, technology, and society 
2016, Olsen et al. 2015, Kongsvik et al. 2011). brings about unavoidable changes in our working 
Organisational change that influences the practices and experiences. Based on what we know 
employees’ job situation and involve modifications from research, there is reason to assume that these 
of the core system of an organization, including changes can lead to increased risk level in the petro- 
values, work practices, organizational structure and leum industry. It is important to understand the 
strategy, are likely to affect employees’ well-being impact of the changes, which furthermore enable 
more than less-pervasive changes (Bamberger et al. us to implement appropriate actions to respond 
2012). Among the most dramatic organisational and adapt appropriately to these changes, learning 
changes are restructuring, because these type of from experiences. The RNNP questionnaire survey 
processes may involve aspects such as relocations, is a data source that can help to highlight key issues 
offshoring, closure, mergers/acquisition, outsourc- related to these concerns. 
ing and internal restructuring including downsizing In light of this, this study assessed the impact 
(Eurofound 2014, Mathisen et al. 2016). A review of organisational changes on psychosocial work 
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environment, safety, health and occupational 
injuries develop in the petroleum industry. To 
explore this topic the results from the RNNP sur- 
vey (Trends in risk level in the petroleum activity) 
was utilized to: 


1. Assess trends and development related to 
reorganizations, downsizing, job insecurity, 
psychosocial factors and safety climate from 
2013-2015. 

2. Explore the correlation between reorganiza- 
tions, downsizing and psychosocial factors, 
safety climate and occupational injury in 2015. 


2 METHOD 


2.1. Sample and measures 


This study utilized survey data from personnel on 
offshore facilities on the Norwegian Continental 
Shelf (NCS) and onshore installations, covering all 
employees offshore and onshore in the petroleum 
industry. The RNNP questionnaire-based survey 
is conducted as a cross-sectional study every other 
year, for the first time in 2001 and most recently 
in 2015. The analyses in our study are based on 
data from 2013-2015 with a response rate in 2015 
of 27% for onshore installations and 29.7% for off- 
shore facilities (PSA, 2016a). In spite of a rather 
low response rate, from year to year the sample 
is relatively stable over a range of variables such 
as gender, age, facility, the area of work, ratio 
between operators and entrepreneurs, permanent 
and temporary employees and proportion with 
managerial responsibilities. The RNNP data pro- 
vide a good comparative basis for survey analyses 
from year to year. The large number of respond- 
ers in the survey contributes to make the results 
more robust. It should be noted that the regression 
analysis used data from the last offshore surveys in 
2015 (n = 8509). 

The analyses included indicators related to reor- 
ganization, downsizing, job insecurity psychoso- 
cial factors and safety climate (RNNP, 2016). The 
following item measured reorganizations: “Dur- 
ing the last year, have you experienced reorgani- 
sations that affect the way you plan and/or carry 
out your work on the facility?”. The response scale 
for this item were: “Yes” or “No”. Furthermore, 
the following item measured downsizing: “During 
the last year, has your workplace been subjected 
to workforce reductions or redundancies?”. The 
answer categories were: “I have experienced reor- 
ganisations with significant consequences” to “I 
have not experienced reorganisations”’. 

Based on the results from a factor analy- 
sis described in the RNNP-report from 2014, 
four factors that measure the psychosocial work 


environment were established. The psychosocial 
work environment dimensions measures employ- 
ee’s subjective experience of job demands, job con- 
trol, supporting leadership and supporting colleges 
and was measured with five scales. Job demands 
was measured with the three items (a = 0.65). Job 
control was measured with three items (a = 0.77). 
Supportive leadership was measured with the three 
items (œ = 0.76). Finally, social support from col- 
league was measured with two items (a = 0.62). 

The answer categories were: “Very rarely or 
Never” to “Very often or always”. The mean scale 
score was converted into three categories: Low 
(1.0-2,0), medium (2.1-3.0) and high (3.1-5.0). 
By dichotomizing job control (low = 1, medium 
and high = 0) and quantitative demands (high = 1, 
medium and low = 0), we constructed a job strain 
variable of the combination of low job control and 
high job demands (yes = 1, no = 0). All variables 
were coded so that high exposures indicate assumed 
negative exposure such as high job demands, low 
job control, low supportive leadership, and low 
support from colleagues. 

Safety climate was measured by using an index 
consisting of 11 items divided into three dimen- 
sions. The items are described in greater detail in 
(Birkeland, et al, 2013; Tharaldsen, et al, 2008). 
These dimensions measure the employee’s assess- 
ment of: 1) Individual intentions and motivation 
(a = 0,73) 2) The management’s prioritization of 
safety (aœ = 0,73) and 3) Safety routines (a = 0,74). 
The response scale for each item was: “Fully agree” 
“Partially agree” “Neither agree nor disagree” “Par- 
tially disagree” “Fully disagree”. The statements are 
both positive and negative. The scales are therefore 
reversed on some of the items and higher value indi- 
cates a poorer safety climate. The mean scale score 
of the 11 items was converted into three catego- 
ries: Good (1—1.53), medium (1.54-2.04) and poor 
(2.05—5). The cut-off point (2.05) for being included 
in the “poor category” is the 66.66 percentile. That 
is, respondents who are among the third with the 
highest score on the index in the 2015-material are 
included in the poor category. The cut-off point in 
2015 is as such set as a reference. In addition, occu- 
pational injuries were measured with a single item 
“Have you been injured in a work accident while at 
the facility during the last year?”. For further infor- 
mation about the items can be found online in the 
RNNP summery report (RNNP, 201 6a). 


2.2 Analysis 


We used the chi-square tests to determine whether 
there were significant changes in the levels of reor- 
ganization, downsizing, job insecurity, psychoso- 
cial factors, safety climate and occupational injuries 
in offshore facilities in the period 2013-2015. 
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RNNP data from 2015 were used to compare 
psychosocial factors, safety climate and occupa- 
tional injuries between employees reporting down- 
sizing and reorganizations and employees not 
reporting such changes in the petroleum industry 
(offshore and onshore facilities). For the trend 
analyses data from the offshore population was 
utilized. Furthermore, for the multiple logistic 
regression analysis both the offshore and onshore 
data was assessed. Associations between reorgani- 
zation or downsizing and work injuries were calcu- 
lated as the odds ratio (OR) and 95% confidence 
interval (95% CI). Statistical analyses were con- 
ducted using SPSS Statistics for windows version 
23.0 (IBM Corporation, Armonk, NY, USA). 


3 RESULTS 


3.1 Reorganizations and downsizing 


Reorganization and downsizing vary over time, 
however results from this study show that there 
has been a considerable increase in the proportion 
of employees who have experienced reorganiza- 
tions, downsizing and job insecurity in the period 
2013-2015 (Table 1). Results show similar devel- 
opment for the operators, entrepreneurs and the 
mobile facilities. 


3.2 Psychosocial work environment, safety 
climate and occupational injuries 


The trend analyses for 2013-2015 show a significant 
change in employees who report increased levels of 
job demands (19% to 22%) low job control (10% to 
12%) and the combination of high job demand and 
low job control (4,3% to 5,6%), in the industry as 


Table 1. 
2015, offshore facilities (2013 n = 7952/2015 n = 6675). 


a whole. If we look further into offshore facilities 
(Table 1), we see that between 2013 and 2015, there is 
a significant change in the percentage reporting low 
job control among entrepreneurs (6.8% to 9.8%). 
However, among operators, no change is observed 
in the same period, and about 10% report low job 
control in 2015 (Table 1). Table 1 show a significant 
change among entrepreneurs with regards to high 
job demands (16% to 20%). While for employees 
working for operators, no change is observed in 
the same period. Furthermore, entrepreneurs also 
report a significant change related to the combina- 
tion of high job demands and low job control (2.8% 
to 5.4%). At the same time, no change is observed 
among operators. There are no significant changes 
in the prevalence of low supportive leadership and 
low colleague support among operators, entrepre- 
neurs and among workers on mobile facilities. How- 
ever, it should be noted that operators report lower 
leadership support then entrepreneurs and workers 
at mobile facilities. There are however only smaller 
differences in the psychosocial working environment 
for entrepreneurs and operators on offshore facilities 
in 2015. As such, there are few changes in the psy- 
chosocial working environment on offshore facilities 
overall, but in the period 2013-2015, a deterioration 
of the psychosocial work environment is observed 
among entrepreneurs and workplaces with a large 
number of entrepreneurs. Table 1 show that the pro- 
portion of employees reporting poor safety climate 
(see description) decreased overall until 2013, while 
there was a significant increase in 2013-2015 among 
entrepreneurs (30% to 36%) and operators (from 
34% to 38%) on offshore facilities. A similar change 
is however not observed for employees on mobile 
facilities (27% to 28%). In the period 2013-2015 no 
significant changes in occupational injuries were 


Change in reorganization, downsizing, psychosocial factors, safety climate and occupational injuries 2013— 


Operators Entrepreneurs Mobile facilities 

2013 2015 2013 2015 2013 2015 
Downsizing 17.0 67.4*** 35.7 79.0*** 5.0 77.0*** 
Reorganization 40.3 5125" 31.5 47.35 ** 24.4 44.6*** 
Job insecurity 7.2 12.27** 9.0 IAEF 4.5 40.4*** 
High Job-demand 20.9 20.9 15.5 20.4*** 19.2 24. 7*** 
Low Job-control 10.0 10.9 6.8 O BERE 11.0 13.8* 
Job strain 5:2 5.0 2.8 5.4*** 4.0 OTIR 
Low supportive leadership 21.3 22.6 14.1 14.8 14.3 15.8 
Low colleague-support 3:2 3.5 29 2.8 3.0 3.6 
Poor safety climate 34.4 38.1* 29.9 ISATA 27.1 28.4 
Occupational injuries 3.1 3.4 5.3 4.5 4.5 3.3" 


*P < 0.005; **P < 0.01; ***P < 0.001. 


199 


reported among entrepreneurs and operators, but 
a significant reduction in injuries was found among 
employees on mobile facilities (Table 1). 


3.3 Reorganization, downsizing and 
occupational injuries 


In total, 4.1% (246/5977) of employees in the petro- 
leum industry (offshore and onshore facilities) 
experienced downsizing and 2.8% (61/2159) not 
experiencing downsizing reported occupational 
injuries. In total, 4.4%, (175/3939) experienced 
reorganization and 3.1% (130/4163) not experienc- 
ing reorganizations reported occupational injuries. 

Employees reporting reorganization or down- 
sizing reported also a significant higher prevalence 
for all psychosocial exposures and poor safety 
climate with the exception for low colleague sup- 


port compared to employees not reporting such 
changes (Table 2). 


3.4 Logistic regression analysis 


A logistic regression analysis were conducted to 
examine if downsizing and reorganizations are 
associated with an increased risk of occupational 
injuries, and to examine whether this association 
may be mediated by psychosocial factors. In the 
initial model (adjusted for age, gender and educa- 
tion) employees who had experienced downsizing 
had a significantly higher risk of occupational 
injuries compared with employees not affected 
by downsizing (OR = 1.54; 95% CI 1.14-2.06) 
(Table 3) Adjusting for psychosocial factors and 
safety climate reduced the OR by 44%. The most 
important factor was safety climate. 


Table 2. Description of occupational injuries and explanatory variables for employees in the petroleum industry (n=8102). 


Reorganization Downsizing 

Yes No p-value* Yes No p-value? 
Job insecurity 29.9 22.2 0.000 29.7 16.1 0.000 
High job demand 28.8 15.6 0.000 23.2 18.7 0.000 
Low job control 14.9 8.7 0.000 12.6 9.5 0.000 
Job strain 8.0 3.4 0.000 6.2 4.1 0.000 
Low supportive leadership 22.9 14 0.000 19.2 15.9 0.000 
Low colleague support 4.6 2.8 0.000 3.7 3.6 0.408 
Poor safety climate 42.2 28.2 0.000 37.5 28.3 0.000 
Occupational injuries 4.4 31 0.001 4.1 2.8 0.005 


‘Variables were tested with chi-square test. 


Table 3. 
for psychosocial factors and safety climate. 


Multiple logistic regression for occupational injuries and the effects of adjusting 


Initial model* 


Occupational injuries 


Odds Ratio (95%CI)* Change* 
No downsizing (n = 2159 (2.8))° 1 
Downsizing (n = 5977 (4.1))° 1.54 (1.14-2.06) 
Job insecurity 1.43 (1.06-1.93) —0.20 
Psychosocial factors 
High Job demand 1.46 (1.08-1.96) —0.15 
Low Job control 1.47 (1.09-1.98) —0.13 
Low Supportive leadership 1.46 (1,09-1.97) —0.15 
Low collegue support 1.48 (1,10-1.99) —0.11 
Poor safety climate 1.38 (1,03-1.87) —0.30 
All factors included 1.30 (0,96-1.76) —0.44 


“Adjusted for age, gender, education. 


Number of respondents (cases of injuries). 


‘Percentage change in OR after comparing the initial OR with further adjusted OR (i.e., 
the initial OR adjusted for work-related factors). 


ap = 0.005. 
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Table 4. Multiple logistic regression for occupational injuries and the effects of adjusting 


for psychosocial factors and safety climate 


Occupational injuries 


Initial model* Odds Ratio (95%CI)* Change* 
No reorganization (n = 4163 (3.1))° 1 

Reorganization (n = 3939 (4.4))? 1.46 (1.15-1.84)¢ 

Job insecurity 1.39 (1.09-1.77) -0.14 
Psychosocial factors 

High Job demand 1.31 (1.03-1.66) —0.33 
Low Job control 1.38 (1.09-1.75) -0.17 
Low Supportive leadership 1.35 (1.07-1.72) —0.23 
Low collegue support 1.38 (1.09-1.75) -0.17 
Poor safety climate 1.25 (0.98-1.57) —0.46 
All factors included 1.13 (0.88-1.44) —0.72 


‘Adjusted for age, gender, education. 
"Number of respondents (cases of injuries 


). 


‘Percentage change in OR after comparing the initial OR with further adjusted OR (i.e., 
the initial OR adjusted for work-related factors). 


åp = 0.002. 


The same procedure was applied when analys- 
ing the association between reorganizations and 
the risk of occupational injury, and to determine if 
this association may be mediated by psychosocial 
working conditions and safety climate. In the ini- 
tial model (adjusted for age, gender and education) 
employees who had experienced reorganizations 
had a significantly higher risk of occupational 
injuries compared to employees not affected by 
reorganizations (OR = 1.46; 95% CI 1.15-1.84) 
(Table 4) Adjusting for psychosocial factors and 
safety climate reduced the OR by 72%. The most 
important factor was safety climate. 


4 DISCUSSION 


Our study set out to explore whether exposure to 
downsizing and reorganization increase the risk 
of occupational injuries and to test whether the 
association between downsizing, reorganizations 
occupational injury may be mediated by psycho- 
social working conditions and safety climate. As 
expected, the results showed that employees in the 
petroleum industry report a considerable increase 
in reorganizations, downsizing and job insecurity 
in the period 2013-2015. In the same period, sig- 
nificant changes in the psychosocial work environ- 
ment offshore are observed among entrepreneurs 
and among employees on mobile facilities but not 
among employees working for operating compa- 
nies. In terms of safety climate items, a deteriora- 
tion on production facilities is observed, but not 
among employees on mobile facilities, and during 
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the same period there have been minor changes in 
the occurrence of self-reported injuries. 

The analysis also indicate that a higher risk 
of injuries is reported by those who have experi- 
enced reorganizations and downsizing and that 
is again associated with a poorer safety climate 
and psychosocial work environment. The results 
show that employees who have experienced reor- 
ganization and downsizing report higher risk of 
injuries, poorer safety and psychosocial work 
environment compared to employees who do not 
report such changes. Organizational changes, such 
as restructurings involving issues such as reloca- 
tions, offshoring, closure, mergers/acquisitions, 
outsourcing, and internal restructuring including 
downsizing (Eurofound 2014) have been found to 
be linked to of ill-health as well as incidents (de 
Jong et al., 2016, Serck-Hanssen 2002). 

Overall, the changes in the psychosocial work- 
ing environment appear to be somewhat more 
pronounced on mobile facilities and may be seen 
in the context of ongoing change processes. Look- 
ing at the differences between operators, entre- 
preneurs and mobile facilities it might be argued 
that employees from operating companies does 
not show a negative development with regards to 
psychosocial risk and safety climate because they 
have not been affected by change in the same time 
period. Operating companies are now moving into 
a period with more organisational changes being 
implemented, which might have implications for 
psychosocial and organisational factors such as job 
demands, job control, leadership support and social 
support. Research have found that psychosocial 


risks factors, such as high workload, inadequate 
resources, increased time pressure, staff conflicts 
and lower levels of autonomy or loss of job control 
may be exacerbated as a result of organisational 
change (Leiter & Maslach 2009, Rafferty and Grif- 
fin 2006, Rabatin et al. 2015, Day Crown and Ivany 
2016). As such, performing the same analysis for 
the upcoming RNNP survey in 2017 might shed 
some light on this particular issue. 


4.1 Strengths and limitations 


This study is cross-sectional and provide a snapshot 
of a particular group at a given time. In research, 
cross-sectional studies are used in order to deter- 
mine prevalence. Cross sectional studies are the 
best way to determine prevalence and are useful 
at identifying associations that can then be more 
rigorously studied using a cohort study or rand- 
omized controlled study (Mann 2003). As such, 
this may limit the ability to draw firm conclusions 
about relationships observed between exposure 
(for example, downsizing and reorganization) and 
outcomes (for example, occupational injuries), and 
the extent to which exposure precedes outcome (for 
example, occupational injuries) in time. Given that 
the data was cross sectional, we cannot draw con- 
clusions about causal relationships. Thus, there is a 
possibility that those who have been injured during 
the reorganization are somewhat more “negative” 
when they answer the survey or that the changes 
have more and more affected groups that initially 
reported poorer psychosocial work environment 
and safety. However, it is important to note that 
research over at least the last decade, including 
longitudinal studies, has shown that psychosocial 
hazards can have a negative impact on health and 
safety (Leka & Jain 2010, Mackey, Palverman Saul 
et al. 2012). Another important limitation is that 
those who are absent from work due to sickness, 
health complaints or injury during the survey are 
not included in the data. As such, studying poten- 
tial consequences of downsizing or termination 
of employment is difficult, because those who has 
been terminated are not included in the data. In 
other words, those remaining after the downsizing 
processes are those included in the data. 

A strength of this study is that the question- 
naire survey contains questions that cover many 
background variables that can be explored in rela- 
tionship to downsizing, reorganization and HSE 
outcomes. The sample is also large enough to per- 
mit analyses of different sub-groups of employees, 
i.e. employees in different areas of work and job 
categories, and in risk groups defined in the RNNP 
report from 2014. It is also important to note that 
research over at least the last decade, including 
longitudinal studies, has shown that psychosocial 
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hazards can have a negative impact on health and 
safety at the workplace. In the way forward these 
analysis should be replicated in order to explore 
whether the changes the industry are currently 
undergoing will have a more pronounced long- 
term effect on psychosocial work environment and 
safety climate. 


4.2 Implications for follow-up of the industry 


Psychosocial and organisational factors in rela- 
tion to health and safety have been the subject of 
research for the last twenty years. The Petroleum 
Safety Authority (PSA) have also over the years 
focused on areas and themes relevant to the pro- 
motion of psychosocial work environment. One 
example are the theme collaboration and employee 
involvement highlighted in PSA’s main topic for 
2017 “Reversing the Trend”. Other examples are 
PSA’s followed-up of company’s organisational 
change processes over the last years and the focus 
on organisational and human factors in barrier 
management (PSA 2016b, PSA 2017b). 

Despite the amount of evidence collected and 
described in the literature, one of the biggest chal- 
lenges may seem to be for companies to trans- 
late knowledge into workplace interventions. For 
employers it is a legal obligation to assess and man- 
age all types of risk to workers’ health and safety 
(Arbeidstilsynet 2017). Organisational changes 
represents such a process where the risk should be 
assessed and managed. As such, for employers and 
business owners it is vital to see psychosocial and 
organisational factors as an important element of 
managing risk and ensuring a sound change proc- 
ess. Risk management of these issues through risk 
identification, assessment and prioritization includ- 
ing allocation of resources to minimize, monitor, 
and control negative impact from occurring sup- 
port the organisation’s objectives for implementing 
lasting changes (Langenhan et al. 2013). As with 
health and safety in general, the successful promo- 
tion of the psychosocial and organisational work 
environment requires integration into daily work 
processes, avoiding treating it as a separate project. 
At last, employers must ensure the interventions 
implemented should target the work environment 
as well as the individual, in order to create safer 
workplaces and to improve the capacity of workers 
to protect their safety and health and to maximize 
their overall effectiveness. 

Since both risk of injury and exposure of psy- 
chosocial hazards are unevenly distributed among 
groups, the follow up should be able to reflect this. 
Informative and user-friendly risk profiles will 
support the promotion of risk-based initiatives 
in times of change. Nevertheless, in the way for- 
ward it is important to monitor the developments 


from an inspection authority perspective especially 
when there is a significantly higher risk of injury, 
reports of poorer psychosocial work environment 
and safety climate environment among employees 
affected by changes. 
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ABSTRACT: This paper takes a macro perspective on quality and safety in elderly care, focusing on the 
role of the macro level context in two European countries; Norway and the Netherlands. The aim is to 
conduct a comparison of the healthcare systems as a foundation to understand quality and patient safety 
improvement efforts in elderly care across the two countries. Our methodological approach is to conduct 
a review of macro level factors as suggested by previous studies such as governance and financing struc- 
ture, role of different actors, and types of initiatives taken. Data was collected from open sources such as 
national statistics, governmental reports, and web pages of key actors at the macro level. The similarities 
and differences are discussed to highlight the main areas where improvement efforts need to take country- 
specific factors into account when learning across countries. 


1 INTRODUCTION for quality and safety improvement efforts in eld- 
erly care in Norway and the Netherlands? 
1.1 Background 


In many European Countries, there are Caos 2 METHODS 
policy demands for efforts to improve healthcare 
quality and patient safety. Why improvement inter- 
ventions and efforts are successful in one context and 
not in another, and the role of macro level factors are 
still under-researched (Krein et al. 2010). Efforts to 
improve quality may work better in some situations 
than in others. Thus, quality improvement strategies 
requires information about context (McDonald, 
2013). There are large variations in the organization, 
financing and delivery of the healthcare services in 
elderly care between countries and these variations 
can be expected to have an impact on the quality and 
safety work (Ennoo et al. 2015). European health- 
care policy makers looking for sustainable ways to 
organize healthcare should take these differences in 
context into account (Eenoo et al. 2015). 

This paper takes a macro perspective on qual- 
ity and safety in elderly care, focusing on the role 
of the macro level context in two European coun- 
tries; Norway and the Netherlands. The aim is to 
conduct a comparison of the healthcare systems 3 HEALTH SYSTEM IN NORWAY 
as a foundation to understand quality and patient 
safety improvement efforts in elderly care across 
the two countries. Improvement efforts have to be 
adjusted to the specific contexts in which organiza- | Norway is a parliamentary democracy, divided 
tions are embedded in order to be effective. The into three different administrative levels, the 
following research question will guide this paper: state, the 19 counties and the 426 municipalities 
What do differences in macro level factors mean (Kartverket, 2017). The organizational structure 


Our methodological approach was to conduct a 
review of macro level factors as suggested by previ- 
ous studies (Wiig et al. 2014; Wendt, 2009; Grosse- 
Tebbe, 2005; Leijten et al. 2017) such as governance 
and financing structure, role of different actors, and 
types of initiatives taken. Data was collected from 
open sources such as national statistics, govern- 
mental reports, and web pages of key actors at the 
macro level (e.g. regulators, inspectorates, knowl- 
edge centers). We mapped and compared data on 
governmental expenditures on health, actors and 
types of community health care services, funding, 
governmental vision and regulation on community 
care, involvement of informal care, and national 
quality and safety indicators. Elderly care in this 
paper is limited to quality and safety improvement 
efforts in nursing home and home care. 


3.1 Governance of elderly care 
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of the Norwegian health-care system is built on the 
principle of equal access to services for all inhabit- 
ants, regardless of their social or economic status, 
country of origin and geographical location. This 
overarching goal has been a long-standing feature 
of the Norwegian welfare system and has been 
embedded in the national health-care legislation 
and strategic documents (Ringard et al. 2013). 

Since the beginning of the 2000s, the emphasis 
was both on achieving structural changes and to 
organize health care better. In addition, reforms 
have been carried out to strengthen patient and 
user involvement. In recent years, efforts have 
been directed towards better coordination of 
health services, while quality and patient safety has 
received increasing attention (Ringard et al. 2013). 

The Norwegian health care system can be char- 
acterized as semi-decentralized. At the national 
level, parliament serves as the political decision- 
making body. The responsibility for specialist care 
lies with the state, administered by the four Regional 
Health Authorities (RHAs), which in turn own 
the hospital trusts. Municipalities are responsible 
for primary care including rehabilitation, physi- 
otherapy, nursing homes, midwife, home care and 
after-hours emergency services. There is no direct 
command and control line from central authorities 
down to the municipalities and the municipalities 
have a great deal of freedom in organizing primary 
care services (Ringard et al. 2013). 

The principles for determining who is responsi- 
ble for health services varies in different parts of the 
system. Local politicians are therefore accountable 
to local inhabitants through local elections. Munic- 
ipalities are also responsible to the government to 
follow policies and regulations. Further internal 
auditors control municipality’s budgets and have 
access to other economic factors of importance. 

The Coordination reform was introduced in 
2012, and was part of an effort to improve care 
coordination and quality of the services. It implied 
that municipalities became responsible for more 
treatment tasks. Because of the reform, the munic- 
ipalities now receive more patients with need for 
long-term treatment, but there are still challenges 
regarding capacity and competence of the staff 
at the municipal level (Meld. St. 11. 2014-2015; 
Gautun et al. 2013). 

The main task of the central government is 
to assure the high quality of services across the 
municipalities through funding arrangements 
and legislation (e.g. the 2011 Municipal Health 
and Care Act). The Directorate of Health issues 
an annual circular letter to the municipalities. The 
circular letter of 2016 contained recommendations 
on preventive home visits in the municipalities. 
The letter described how the municipalities could 
use preventive home visits as part of their service 
provision to the elderly. The purpose of preventive 
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home visits is to strengthen the individual’s abil- 
ity to stay healthy and active as long as possible 
(Rundskriv, 2016). 


3.2 Funding and legislation 


In 2017, health-care expenditure accounted for 
approximately 10.5% of Norway’s GDP, placing it 
in 16th place in the Western Europe in terms of the 
share of GDP spent on health (OCED, 2017). 

Government spending accounted for 85% of 
overall health spending, among the highest in the 
OECD mostly comprising financing from the cen- 
tral and local governments (73%) and from the 
National Insurance Scheme (12%) through fee-for- 
fee service payments and reimbursement of user 
fees (Ringard et al. 2013). 

Primary care is funded from municipal taxes, 
block grants from the central government, and 
earmarked grants for specific purposes. Private 
sources mainly in the form of out of pocket pay- 
ments account for approximately 15% of health 
expenditure (Ringard et al. 2013; OCED, 2017). 

Private actors are involved in the provision of 
primary care services. The majority of general 
practitioner are self-employed but are in most 
cases fully embedded in the public system through 
contracts with the municipalities. Services pro- 
vided by for-profit institutions include long-term 
nursing care (about 10% of nursing homes in 2010) 
(Ringard et al. 2013). 

The Norwegian healthcare system is regulated 
through a large number of acts and secondary leg- 
islation. Legislation broadly reflects the decentral- 
ized nature of the healthcare system. Specialized 
healthcare, organized at the level of the RHAs, is 
regulated by the Specialist Care Act of 1999 and 
the Health Authorities and Health Trusts Act of 
2001. Primary care which in Norway comprises 
long term elderly care is regulated by the Munici- 
pal Health and Care Act of 2011. A key act is the 
Patients’ and Users’ Rights Act (1999) with the 
aim of ensuring “equal access to health services of 
high quality”. Strengthening the role of patients 
and next of kin has been a policy priority since 
the turn of the millennium, for example, through 
stronger patient choice and complaint procedures 
(Ringard et al. 2013). 

The central government is currently focusing on 
“the patient’s health and care service” putting the 
patient’s need in the center of healthcare provision. 
(St. Meld. 34. 2015-2016). There are no dedicated 
rights for the elderly today. The responsibility for 
the elderly rights is divided between a variety of 
administration levels, supervision, and agencies. 
Elderly rights are mainly regulated by the Patients’ 
and Users’ Rights Act, which applies to all health 
services in Norway. Municipalities often coordi- 
nate the services of the elderly (Rundskriv, 2015). 


There is no specific patient safety act in Nor- 
way. Demands for improving quality and safety in 
healthcare are incorporated in several acts and reg- 
ulations. The Health and Care Act (2011), covering 
all health and care services provided by municipali- 
ties, requires service providers to work systemati- 
cally to improve quality and patient and user safety 
(Health and care Act (2011) § 4-2). 

A recent regulation on leadership and qual- 
ity improvement in the health and care services 
(2017) requires all service providers to establish a 
management system to ensure internal control of 
the service provision. This includes overview of 
risk areas and how to plan activities, carry them 
out, evaluate, and correct deviances. Municipalities 
have the responsibility to supervise own activities 
in line with internal control regulations (Ringard 
et al. 2013), but there are also external supervisory 
bodies with mandate to supervise the service provi- 
sion. The Norwegian Board of Health Supervision 
(NBHS) is the national regulatory body for health 
and care services. It is a public institution organized 
under the Ministry of Health and Care Services. 
At the regional level, 18 county governors oversee 
services within primary and specialized healthcare. 


3.3. Knowledge infrastructure 


National professional guidelines exist to help ensur- 
ing that health and care services have high quality, 
proper priorities, not unwanted variety in service 
provision, resolve coordination challenges and 
provide comprehensive patient care. The Directo- 
rate of Health develops, disseminates and main- 
tains national professional guidelines. National 
professional guidelines provide recommendations 
for service provision, but are not legally binding. 
The health service providers are responsible for 
organizing the services to ensure that recommen- 
dations given in national professional guidelines 
can be followed (Helsedirektoratet, 2014). 

There is a National Patient Safety Program 
(2014-2018) working on quality and safety 
improvement by strengthening the competence 
of health professionals and managers, and imple- 
mentation of measures to reduce patient injuries. 
Intervention bundles are offered in areas such as 
medication list, prevention of fall and pressure 
ulcers, prevention and treatment of malnutrition, 
and early detection of deterioration (Pasient- 
sikkerhetsprogrammet, 2017). 

It is voluntary for municipalities to participate 
in the national patient safety program activities. 
Five municipalities currently participate in a pilot 
“patient and user-safe municipality”, which aims at 
ensuring systematic and sustained work on patient 
and user safety at all levels in the municipal health 
and care services (Pasientsikkerhetsprogrammet, 
2017). The National Patient Safety Program is 
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co-operating with the developing centers for Nurs- 
ing Home and Home Services (USHT). There 
is one USHT in each county and they establish 
networks with the municipalities. The develop- 
ment centers play a key role in supporting nursing 
homes and home services in quality improvement 
efforts, including organizing of learning networks 
(Pasientsikkerhetsprogrammet, 2017). 

There is a National Quality Indicator System in 
Norway and the quality indicators for the munici- 
pal health and care services have increased the last 
years. Examples of quality indicators in nursing 
homes and home care are follow-up of nutrition; 
health services associated with infections; available 
medical time; medication review; and occurrence of 
readmission among elderly 30 days after discharge 
to the municipality (Helsedirektoratet, 2014). 


4 HEALTH SYSTEM IN 
THE NETHERLANDS 


4.1 Governance of elderly care 


The Netherlands is a parliamentary democracy. 
The two governance levels that are most impor- 
tant for health care are the national and the local 
(municipal) level. The Dutch health care system 
can best be described as a hybrid governance sys- 
tem, with a mixture of top-down government regu- 
lation, market elements (e.g. free choice of provider 
and insurer), self-regulation by professionals and 
consultation amongst relevant actors (Van de Bov- 
enkamp, De Mul et al. 2014). 

Since the 1980s different themes figured on the 
national agenda. From the mid 1980s onwards, 
controlling health care costs, improving qual- 
ity and safety and strengthening the position of 
patients have figured prominently on the agenda. 
Already in the 1980s a system of regulated com- 
petition was proposed which was officially imple- 
mented in 2006 after a process of incremental 
change (this applies to care covered under the 
Health Insurance Act). During the 1990s several 
acts were introduced aimed at regulating health 
care quality, such as the Medical Treatment Agree- 
ments Act (1995), the Quality of Care Act (1996) 
and the Individual Health Care Professions Act 
(1997). These acts came about as a result of con- 
sultation with the field and reflected an emphasis 
on self-regulation. The Quality of Care Act for 
instance dictated that health care organizations, 
including those providing elderly care, should have 
a ‘quality management system’ installed. However, 
what these systems should look like was largely 
left up to the field (Grol 2006). In the last dec- 
ade, mostly in response to quality incidents, top 
down regulation has become more stringent which 
caused the health care inspectorate (responsible 


for quality supervision) to play a more stringent 
role. 

In addition to more stringent supervision, many 
national quality projects have been implemented 
also in elderly care. These national quality projects 
stimulate organizations to improve their quality 
and safety and provide funding to work on quality 
and safety projects. The short term focus of these 
programs (they only last for a couple of years) has 
caused the problem of fragmentation and projecti- 
fication as organizations move from one project to 
the next which limits the sustainability of improve- 
ments (Slaghuis 2016). 

Controlling costs and stimulating patient partic- 
ipation have been key issues of health care policy 
for some time now. These principles were also the 
goals of a large decentralization process that also 
affected elderly care. While part of elderly care is 
still governed at the national level (care that falls 
under the Health insurance act and Long-term 
care act) other parts have become the responsibility 
of the 388 municipalities from 2015 onwards (care 
that falls under the Social support act). These acts 
are further elaborated on in the next paragraph. 

Dutch quality and safety policy of the last decade 
has strongly focused on making quality of care trans- 
parent through the use of indicators. This means 
that health care organizations have to put a lot of 
effort in registering and providing actors such as the 
inspectorate and insurers with performance informa- 
tion. Partly because of this, there is now much debate 
about regulatory pressure in health care. Profession- 
als are calling for attention to the fact that they spend 
too much time on administration which limits their 
ability to provide care (Bovenkamp, Stoopendaal 
et al. 2017). In response to this, there is now much 
debate on working on quality in a different way that 
allows for local variation and focus on reflexivity 
(ibid.). In elderly care this has also been taken up in 
the latest national quality framework. 


4.2 Funding and legislation 


In 2016 13.8% of GDP, 96 billion euro, was spent 
on health care (including welfare) in the Nether- 
lands.' This makes the Dutch healthcare system 
one of the most expensive worldwide. Hospital 
care comprises the largest part of the health care 
budget, 27 billion. Elderly care comes in second 
with almost 17 billion.’ 


'http://statline.cbs.nl/Statweb/publication/?)VW=T&DM 
=SLNL&PA=83075NED&D1=a&D2=0,3,8,13,18,23,2 
9,34,39,(1-2)-l& HD=160517-1206&HDR=G1&STB=T. 
*http://statline.cbs.nl/Statweb/publication/?VW=T&D 
M=SLNL&PA=83038NED&D 1=a&D2=a&D3=(1-2)-1 
&HD=170512-1505&HDR=T,G2&STB=GI1. 
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Health care and elderly care in the Nether- 
lands are regulated and financed through a vari- 
ety of acts. Partly these acts determine how care is 
financed and who is responsible for the provision 
of care. In this section, we focus on the legislation 
especially relevant for elderly care. The Heath insur- 
ance Act (Zorgverzekeringswet), works through 
private insurance (and for a small part through 
out of pocket payments). The basic package is 
obligatory for all citizens (this package is set by the 
Dutch Health institute). Citizens can voluntarily 
choose to buy additional packages (e.g. for den- 
tal care, more physiotherapy). Under the Health 
insurance act a system of regulated competition 
is established. Health insurers compete for insured 
and can selectively procure care. Health care pro- 
viders compete for patients. The care offered under 
this act includes acute and curative care (primary, 
secondary & tertiary, mental health). 

The Long-term care act works through social 
security and is paid for through taxes and for some 
services small out of pocket payments. A ‘care 
indication’ is needed through an indication office 
to get access to this care (which includes 24 hour 
institutional or home care). The Act also provides 
for the possibility of personal budgets, empower- 
ing patients to organize their own care. 

The Social support act, for which the municipal- 
ities are responsible also works through a combina- 
tion of taxes and some out of pocket payments. 
For this type of care, which includes home care, 
special services and public mental health care, 
municipalities are responsible for care indications 
and financing. Municipalities can selectively con- 
tract health providers. 

Elderly can be subject to all three laws, which 
means that a lot of coordination is needed between 
insurers, municipalities and providers. 

As stated above there are also different acts to 
regulate health care quality, ensuring amongst 
others that organizations have a quality system in 
place and professionals comply to certain norms. 


4.3 The position of patients 


There is much focus in the Dutch system on 
strengthening the position of patients. Much effort, 
also in the national quality programs, is paid to let 
patients participate more in their care. This includes 
elderly care where the latest national program Dig- 
nity and Pride (Waardigheid en Trots) focuses on 
providing care according to the wishes of patients. 
The expectation is also that patients are active as 
much as possible themselves. For instance, the 
Social support act argues that people have to be as 
self-reliant as possible; only if people cannot take 
care of themselves, or cannot find help in their 
social network they are eligible to professional care. 


There is also legislation ensuring patients’ rights 
such as the right to informed consent, to choose 
providers and insurers and to file complaints. In 
addition, all health care organizations have to 
have a client council that has been given far reach- 
ing rights of advice (Van de Bovenkamp 2010, 
Schillemans, Bovenkamp et al. 2016). Further- 
more, patients are increasingly involved in quality 
improvement projects in health care organizations. 
Next to client councils, methods used include 
patient surveys, mirror meetings, focus groups 
and experience based co-design (Bovenkamp, Grit 
et al. 2008, Schipaanboord, Delnoij et al. 2011, 
Vennik, Bovenkamp et al. 2016). 


4.4 Knowledge infrastructure 


The Dutch healthcare system has an elaborate 
knowledge infrastructure (Bekker, van Egmond 
et al. 2010). For elderly care, specifically, a knowl- 
edge institute exists at the national level (http://www. 
vilans.nl/) which is especially concerned with quality 
improvement. In the early 2000s, the Health Coun- 
cil of the Netherlands nevertheless argued that an 
intensification of research directed at elderly (care) 
was called for, which resulted in the establishment 
of a National Program Elderly Care (Grit, Dwar- 
swaard et al. 2012, Wehrens, Oldenhof et al. 2017). 
The program, with a budget of almost 90 Million 
Euro, stimulated regional network building between 
elderly care organizations, municipalities and uni- 
versities and the experimentation with and imple- 
mentation of many quality programs. Participation 
of elderly was an explicit aim of the program. In 
addition, the national improvement project Dignity 
& Pride offers a platform for knowledge exchange 
between care organizations. Professional organiza- 
tions also play a role in knowledge exchange. One of 
these organizations organizes a yearly benchmark 
for example based on quality indicators with the aim 
to help elderly care organizations to improve their 
care. However, this data is (as of yet) not public. 


5 DISCUSSION 


The similarities and differences between countries 
are discussed to highlight the main areas where 
quality and safety improvement efforts in elderly 
care need to take country-specific factors into 
account. 


5.1 Similarities and differences 


The healthcare systems in Norway and the Neth- 
erlands are listed as some of the best in Europe 
(OCED, 2017). Both countries have high health 
expenditures, but there are major differences in 
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how healthcare services are organized and funded. 
The Dutch system is based on regulated competi- 
tion and insurance, while the Norwegian system is 
based on taxes and public funding. For the Neth- 
erlands, this could have an impact on how working 
on quality and safety improvement is conducted, 
because it involves a high number of external actors 
in the market, and a more competitive system. In 
Norway there are less actors in the market and the 
public role is stronger. As a consequence, quality 
work in Dutch elderly care tends to be more frag- 
mented, with many organizations pushing quality 
agendas and elderly care providers taking part in a 
diversified set of improvement initiatives (or not). 
While this on the one hand means that there is a 
very diversified set of quality and safety initiatives, 
which can be attuned to local contexts, there is also 
a lack of a building up of knowledge about quality 
and safety improvement across the sector. Several 
national initiatives that enable cross-organizational 
learning now try to fill this caveat, e.g. by organ- 
izing national meetings on specific quality subjects. 
In addition the national quality framework also 
states that organizations should develop learning 
networks as part of their quality policies. 

In both countries, the patients have a strong 
position. The Netherlands had early on the agenda 
(around 1990) to strengthen patients’ position, and 
Norway established a specific Patient and User 
Rights act in 1999. 

There are no specified rights for the elderly, 
but the rights of this user group are governed by 
the strong patients’ rights and legislations in both 
countries. In the Netherlands, there is a stronger 
focus on informal healthcare, compared to Norway. 
The expectation is also that patients themselves are 
active as much as possible. For instance, the Social 
support act argues that people have to be as self- 
reliant as possible; only if people cannot take care 
of themselves, or cannot find help in their social 
network they are eligible to professional care. 

Both countries require that service providers 
develop quality management systems. The Quality 
of Care Act in the Netherlands, and the Health and 
Care Services Act in Norway require a systematic 
quality improvement work and systems in place. 
But, what these systems should look like is largely 
left to the field in both countries (Grol 2006). How- 
ever, over the past decade, mostly as a result of 
quality assaults, regulation has become stricter in 
the Netherlands, which has led to a stricter supervi- 
sory role by the healthcare inspectorate. In terms of 
national quality indicators, knowledge structure, and 
national quality improvement projects the Dutch 
context has experienced a stronger pressure over 
years compared to Norway. The Norwegian indi- 
cator system has developed an increasing number 
of indicators the last years. Until a few years ago, 


it mainly consisted of indicators covering the spe- 
cialized healthcare services (Wiig & Lindahl, 2015). 
For the Netherlands, although there is much debate 
about indicators, a quality standard for elderly care 
has been developed and the sector engages in bench- 
marking in order to learn from best practices. 


5.2 Maturity 


By comparing the countries on macro level factors, 
the results indicate that the Netherlands has come 
further in their method development in user and 
next of kin involvement and in terms of initiating 
national improvement projects and indicators. At 
the same time, this can make healthcare provid- 
ers more reserved to additional quality and safety 
improvement interventions and efforts since they 
might already have knowledge and tools in place, 
while this is still under-developed in the Norwegian 
context, according to our mapping (Meld.St. 26. 
2014-2015). It appears as a challenge for the Dutch 
system, that nursing homes and home care services 
experience a strong pressure from external short- 
term projects, implying fragmentation and possible 
difficulties for implementation and sustainability. 
While policy makers in Norway emphasize the need 
for increased focus on quality and safety improve- 
ment and strengthen external knowledge support 
structures for nursing homes and home care, the 
Dutch system has national proposals on how to pre- 
vent these organizations from feeling such a strong 
external pressure that requires a lot of documenta- 
tion and takes time from daily patient care. 


6 CONCLUSION 


In this study, we have mapped and compared 
macro level factors in Norway and the Nether- 
lands as a foundation for understanding and learn- 
ing from quality and safety improvement efforts 
across countries. Having an overview of funding 
mechanisms, patient rights, quality and safety 
requirements in the law, and the maturity of the 
knowledge infrastructure supporting the insti- 
tutions in elderly care, are fundamental macro 
level contextual factors that should be mapped to 
understand the responses from institutions in qual- 
ity and safety improvement efforts. 
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ABSTRACT: A multi-method case study, including interviews, a survey and participatory observations, 
was undertaken to describe opportunities and challenges related to multicultural workplaces in the Nor- 
wegian construction industry, and the consequences for occupational safety, working environment and 
work performance. The findings show that challenges related to language and cultural differences are 
the most common, and that employment relationships and duration of relations have proven to affect 
many of the challenges. Consequences due to challenges related to language and culture were found more 
prominent for working environment and work performance than for safety. The focus on opportunities is 
limited and the potential is not fully exploited. Measures implemented to improve the conditions at mul- 
ticultural workplaces mainly focus on solving the challenges and mitigating the consequences in a short- 
term perspective. A more future-oriented focus is needed, including measures that may lead to long-term 


gains for the industry. 


1 INTRODUCTION 


With the free flow of labour in most of Europe, 
companies in Norway are experiencing potential 
for innovation and effectiveness, but also some 
challenges. In the Norwegian construction indus- 
try, it is estimated that around one-third of the 
work force are migrant workers (BNL, 2017). This 
paper discusses the opportunities and challenges 
found in a study on multicultural workplaces in 
the Norwegian construction industry which was 
conducted between August 2016 and August 2017 
(Kilskar et al., 2017). 

The study was carried out to bring more knowl- 
edge on how internationalisation influences the 
working environment and safety in the construc- 
tion industry, including culture, management, atti- 
tudes, expectations, communication and behaviour 
amongst both workers and their employers. 

The overall objective of the study was to enable 
organisations and businesses in the Norwegian 
construction industry to work systematically in 
improving performance and productivity through 
good working environment, collaboration and 
increased safety at multicultural workplaces. An 
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important aspect has been to not only look for 
problems and challenges at multicultural work- 
places, but also to find the potential benefits of 
having a multicultural working environment. 

The study aimed at solving three research 
questions: 


1. What are typical characteristics, strengths and 
challenges related to safety and working envi- 
ronment by using multicultural labour in the 
construction industry? 

2. How do challenges related to multicultural work 
force influence safety, working environment and 
work performance? 

3. What measures exist and are proposed by the 
construction industry to create good multicul- 
tural workplaces? 


1.1 Concepts and definitions 


Construction can be used as a collective term for 
more sectors. In Norway, the construction industry 
is divided mainly in the building sector (e.g. build- 
ing houses, commercial buildings) and construc- 
tion sector (e.g. construction of roads, railways). 


This study was limited to the building sector; how- 
ever, the term construction is used in this paper. 
The study mainly focused on migrant workers 
from eastern Europe, but interviews were also con- 
ducted with workers from other nationalities to get 
a broad understanding of multicultural workplaces 
in the construction industry. In this study, eastern 
European workers are defined to include nation- 
alities from countries in the old Eastern-Block (i.e. 
former Soviet countries; Bulgaria, Czech Rep., 
Hungary, Poland and the Balkans). However, the 
large focus of the study was on workers from coun- 
tries which joined the European Union (EU) in 
2004; Czech Republic, Estonia, Hungary Latvia, 
Lithuania, Poland, Slovakia and Slovenia. 


2 BACKGROUND 


2.1 The Norwegian construction industry 


The construction industry is one of the fastest 
growing industries in Norway and it stands for 
approximately 12.5 percent of the Norwegian BNP. 
In 2016, the industry turnover reached 521.8 bil- 
lion NOK, and the combined work force was at 
almost 235 thousand employees (SSB, 2017a). It 
is characterised by many small and medium sized 
companies, with union density lower than average 
in Norway (43 percent in the construction sector 
versus 54 percent for other industries in Norway) 
(Nergaard, 2016). The industry is project based 
with temporary organisations consisting of multi- 
ple actors, which are forming and dissolving with 
each project. The large companies account for a 
smaller share of the construction output. 

Unemployment rates in Norway have been sta- 
ble on a low level, which has resulted in a shortage 
of skilled workforce. Traditionally, the majority of 
Norwegian workers have been permanently hired, 
and possibilities for temporary employment are 
somewhat limited through the Norwegian Work- 
ing Environment Act. However, in recent years, 
there has been an increase in use of temporary 
staffing agencies hiring migrant workers in the 
construction industry. 


2.2 Working environment and occupational safety 
in the construction industry 


The construction industry in Norway is one of 
the main-land industries in Norway with the high- 
est number of fatal accidents. In 2016, 8 out of 
45 work-related fatal accidents occurred within 
this industry (NLIA, 2017a). There has also been 
reported 5.8 accidents at work that resulted in pro- 
longed absence from work per 1 000 employees in 
the construction industry, whereas the average for 
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other industries was 3.7 (SSB, 2017b). The Norwe- 
gian Labour Inspection Authority (NLIA) focuses 
on controlling and guiding companies and work- 
ers in the construction industry, highlighting the 
importance of preventive measures. 


2.3 Migrant workforce 


With low unemployment rates and shortage of 
skilled workers, the enlargement of the EU in 
2004 gave companies access to recruit migrant 
workers, while at the same time foreign service pro- 
viders got access to the Norwegian construction 
marked. Between 2007 and 2015, work has been the 
most common reason for immigration to Norway, 
followed by family reunions. The largest groups of 
immigrants are from Poland, Somalia and Lithua- 
nia (SSB, 2017). Many of the male immigants that 
came to Norway from eastern Europe after 2004, 
came to work in the construction industry. The 
process was not free of tension, as the work migra- 
tion led to many challenges concerning migrant 
workers’ wages and working conditions, accom- 
modation standards, undeclared work, examples 
of exploitation and so called “social dumping” 
(Deolvik et al., 2005). 

One of the serious concerns is the question 
whether migrant workers are more prone to work 
related accidents, which previous research both 
abroad and in Norway suggests (NLIA, 2012; 
Salminen, 2011). To look further into the back- 
ground of such numbers, a mixed approach was 
chosen with a focus on the qualitative studies. 


3 METHODS 


Triangulation was used to combine the advantages 
of qualitative and quantitative methods; collecting 
data through semi-structured interviews, partici- 
patory observations and a survey. Additionally, a 
literature study was performed by searching in Sco- 
pus and Google Scholar for scientific documenta- 
tion prior the data collection. Further, searches on 
the internet were done for news-articles. Search 
words used were “multicultural workforce”, 
“building/construction industry”, “management” 
and “migrant workers”. The focus was on Norwe- 
gian publications. 

As workers from Poland constitute the largest 
group of migrant workers in Norway, this is also 
the nationality most represented in the interviews 
and observations. 


3.1 


Interviews were undertaken with 35 persons in 
seven cases, each case being either a construction 
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project or a construction company. The majority of 
the informants were from Norway (17) and Poland 
(10), while the remaining (8), came from other 
countries; Germany, Denmark, Russia, Estonia, 
Afghanistan and Ethiopia. The interviewees were 
managers, work team leaders, safety deputies and 
workers (including hired labour and apprentices). 

A semi-structured interview guide was devel- 
oped in Norwegian, and questions were adjusted 
for managers, Norwegian workers and migrant 
workers. The interview guide was translated into 
Polish and English for interviews with workers, 
and to English for managers. As a quality assur- 
ance, the interview guide was discussed with per- 
sons from the construction industry, and a pilot 
study was performed. 

The interview guide was divided into five sec- 
tions. Section A included questions about the 
informants’ background, views on safety and 
working environment in general, and general ques- 
tions about migrant workers in Norway. Section 
B consisted of questions about characteristics, 
opportunities and challenges at multicultural work- 
places. Section C dealt with implications of hav- 
ing multicultural work teams, e.g. communication 
and cooperation between workers and managers, 
management and follow-up at the worksite, and 
characteristics of good multicultural workplaces. 
In section D, the focus was on good practices and 
solutions for creation of good multicultural work 
places. Finally, section E included questions to 
summarise the interview. 

Interviewees were identified by the Federation 
of Norwegian Construction Industries (BNL). 
Factors addressed in the selection process of the 
interviewees were persons from different compa- 
nies, disciplines, countries, and according to geog- 
raphy in Norway. Additionally, size of the project/ 
company was one criterion. The projects where 
the interviewees worked varied from being small 
projects to large housing complexes. 

The interviews were performed in the period 
from November 2016 to May 2017. 


3.2 Observations 


To verify data from the interviews, participatory 
observations were conducted at two construction 
sites over a period of three weeks. BNL facilitated 
contact with observation objects. The field work 
was performed by two anthropology students 
supervised by one of the project members. Two 
different methodological approaches were used; 
1) observation with a high degree of participa- 
tion in the daily work, supplemented with a lim- 
ited number of interviews, 2) interviews, as well 
as observation, but less participation in the daily 
work. 
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The observations were important to gain knowl- 
edge about interactions and cooperation between 
Norwegian and migrant workers. The focus in the 
observations was on eastern European workers. 


3.3. Survey 


A survey was performed among managers in com- 
panies that are members of BNL. 

The questionnaire was based on the results 
from the interviews and participatory observa- 
tions, thus making it possible to lift the findings 
up to a general/national level. The questionnaire 
was based on the same topics as the interview 
guide (see 3.1). 

A pilot test of the questionnaire was performed 
as a quality assurance of the content. BNL per- 
formed the submission of the questionnaire to the 
respondents electronically. 

In total, 5 774 managers were invited by e-mail 
to answer the survey. 350 e-mails were returned 
because of “unknown recipients”. Moreover, sev- 
eral persons replied and told they were not man- 
agers, but in the administrative or economical 
department. This was estimated to be about 870 
persons. Finally, 886 persons in 562 distinct con- 
struction companies answered the survey, resulting 
in an approximate response rate of 19.5 percent 
for individual respondents and 21 percent for 
companies. 


3.4 Limitations 


BNL is a business and employer policy organisa- 
tion for companies in the construction industry, 
and an umbrella organisation for 15 industries that 
organise a wide range of companies (in total over 
4000 companies with nearly 70,000 employees). 
The companies range from the smallest companies 
to the largest in the industry including manufactur- 
ing companies, plumbers, carpenters, landscape 
gardeners, masons, painters and entrepreneurs 
(BNL, 2018). Most of BNL’s branch associations 
have defined requirements for a company to apply 
for membership. The organisation advocates that 
Norwegian society should be built by serious, hon- 
est companies and that customers should be able to 
trust their suppliers. 

Most of the companies that participated in the 
study were members of BNL. These companies are 
thought to be performing in a serious way, as well 
as ensuring health, safety and environment (HSE) 
according to regulations. This could have influ- 
enced the results. 

The survey was directed only towards managers, 
since BNL has contact information only to man- 
agers and administrative personnel among their 
members. 


4 RESULTS AND DISCUSSION 


The results of the study show the importance of 
several factors for ensuring the working environ- 
ment, cooperation and safety at construction sites. 
The main findings of the study are presented in the 
following. 


4.1 Differences between migrant and 


Norwegian workers 


The study looked at perceived differences between 
migrant and Norwegian workers. In many aspects, 
including skills, competence and quality of per- 
formed work, the differences were found to be 
minor. However, it was seen that groups of migrant 
workers and Norwegian workers often have differ- 
ent views on each other. In example, several of the 
Polish workers that were interviewed perceived 
themselves as more efficient and solution oriented 
than Norwegian workers. The Norwegian workers 
on the other hand, had the same perception, but in 
their own favour. This can be a source of misun- 
derstandings as their perception differs. However, 
the survey showed that there is, in many aspects, 
a minor difference in how managers perceive 
Norwegian and migrant workers, respectively. 

Even though there were many similarities 
between workers, one aspect appeared to be very 
different. When it comes to reporting of unwanted 
events there was a major difference between eastern 
European workers and Norwegian workers. The 
eastern European workers report fewer unwanted 
events and dangerous conditions. One reason the 
workers gave was that they do not wish to report 
on others to the management. These findings also 
coincide with previous research (Wasilkiewicz 
et al., 2016). However, these major differences 
were not found for all aspects related to safety. For 
compliance with safety regulations the results from 
the survey showed that the managers observe dif- 
ferences, however in much smaller degree than for 
reporting unwanted events. 

Findings from the survey further shows that 
perceptions on migrant workers to a large degree 
depend on managers’ experiences with migrant 
workforce. Managers from companies with perma- 
nently employed migrant workers were less criti- 
cal towards migrant workers, than managers from 
companies without permanently employed migrant 
workers. As an example, the leaders were asked the 
following question: “To what degree do you experi- 
ence that migrant workers do not comply with exist- 
ing safety requirements (e.g. do not use required 
protection equipment)?” Nearly 44 percent of the 
respondents from companies with no permanently 
employed migrant workers answered this ques- 
tion with “to a large degree” or “to a very large 
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degree”. The corresponding proportion among the 
respondents from companies that do have perma- 
nently employed migrant workers was only close 
to 23 percent. The interviews also supported this 
finding, as managers with a lot of experience with 
permanently hired migrant workers were the most 
positive when discussing this topic. 

The perceptions of managers on migrant work- 
ers also appear to be somewhat influenced by the 
size of their respective companies, as managers 
in medium sized companies (which in the study 
was defined as companies with 22-51 employees) 
answered in more positive terms than the manag- 
ers from smaller and larger companies did. 

This shows that there are large individual dif- 
ferences between workers within a nationality, as 
well as individual differences between managers 
(e.g. their expectations and requirements) and their 
experiences with migrant workers, resulting in dif- 
ferent opinions. 


4.2 Potential not fully exploited 


The interviewees were asked what Norwegian 
workers and managers and migrant workers can 
learn from each other. Additionally, it was looked 
at what the benefits are, related to having migrant 
workforce, as well as possible opportunities. 

The opinions varied; some stated that there 
is nothing to learn, while others had examples 
of what they had learned. The opinions in large 
degree reflected personal experiences. 

When Norwegian managers were asked about 
what they think is positive with migrant workers, 
their flexibility was accentuated, e.g. migrant work- 
ers’ higher willingness to work overtime, stability 
in the workforce (e.g. not taking work days off due 
to sick children), and willingness to continue work 
to complete tasks before the day is over. According 
to the managers, these points were in large degree 
related to culture. However, they can also be related 
to other factors, such as the fact that many migrant 
workers have their families abroad, and therefore 
want to work as much as possible while they are 
in Norway. It is important to be aware that the 
described behaviour can also be related to power 
relations. Persons form different cultures can have 
different expectations and relations to mangers 
(House et al., 2004; Warner-Sederholm, 2012a; 
2012b), and thus perceive mangers’ requests, such 
as working over-time, as duties rather than options. 

When it comes to learning potential in general, 
most Norwegian interviewees did not see that they 
could learn anything from the migrant workers, 
however several managers pointed out that Nor- 
wegian workers could take advantage of learning 
better working moral from the migrant workers and 
to be “less lazy”. A few pointed out that migrant 


workers could teach local workers other working 
techniques and vice versa. 

Some of the migrant workers pointed out that 
they value the characteristics of the Norwegian 
working life, e.g. “calmness” and less stress, think- 
ing things over before starting a work task, and 
that they could learn this from Norwegians both 
for work, as well as for their personal life. 

The industry is not highly conscious of the ben- 
efits and learning opportunities, and utilisation 
of the potential that migrant workers might bring 
in. Earlier research shows that multicultural and 
diverse groups can be both more efficient and less 
efficient than groups consisting of one single cul- 
ture (Adler & Gundersen, 2008, p. 140). Through 
good collaboration, multicultural teams can be 
more efficient and innovative than homogeneous 
teams. By sharing knowledge and having differ- 
ent perspectives and experiences, such teams can 
achieve better solutions for common problems. 

From the free text field in the survey under- 
taken, an example was found of how the construc- 
tion industry can benefit from the knowledge of 
migrant workers: 


“Migrant workers and Norwegians have discussed 
product alternatives from abroad (...) with the con- 
sequence that products or alternative products are 
being imported from abroad now. (...) This resulted 
in economic benefits in the company because of a 
better calculation of the price, and often easier han- 
dling in production (a win-win situation for workers 
and the company)”. (Survey respondent). 


Migrant workers in the construction industry 
represent a larger resource than what is being uti- 
lised. The full utilisation of their potential is often 
prevented by poor linguistic skills (in Norwegian 
or English) and insufficient recognition of their 
formal competence from abroad. 


4.3 Challenges affected by employment 
relationships and duration of relations 


Results from the interviews, the participant observa- 
tions, as well as the survey indicate that language is 
the most common challenge related to multicultural 
workplaces. The Norwegian informants did, how- 
ever, consider this more of a problem than many of 
the migrant workers did. Shortage in Norwegian 
skilled workforce is part of the issue, as illustrated 
by the following response when asked whether 
speaking Norwegian should be an absolute request: 


“Yes, I believe so. It is, however, hard to get hold 
of enough people. Finding a man who knows both 
the language and how to do the job is not easy”. 
(Eastern European work team leader). 
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Among these migrant workers (permanent and hired workers, 
respectively), can you estimate the propartion that speak Norwegian 
well enough to use it as thelr primary language at work? 


Proportion of answers (9) 


40 
30 
20 
“En 
0 
0% 


1-10% 11-50% >50% 
Proportion of migrant workers that speak Nofwegaln well enough to 
use it as their primary language at work 


S permanent workers hired workers 


Figure 1. Distribution of manager’s answers regard- 
ing linguistic skills among their permanent and hired 
migrant workers. 


Figure 1 shows the distribution of managers’ 
answers regarding linguistic skills among their 
migrant workers. 

Close to 50 percent answered that more than 
half of their permanent migrant workers know 
Norwegian well enough to use it as their primary 
language. However, only 15 percent said the same 
about their hired migrant workers. In fact, close to 
60 percent answered that less than one in ten hired 
migrant workers speak Norwegian well enough. 

In addition to language related challenges, issues 
related to cultural differences were often men- 
tioned. As an example, many migrant workers and 
eastern European workers particularly, tend to say 
“ves” and give the impression that they understand 
messages when they do not. The following quote 
from a worker from Poland is illustrative: 


“I sometimes make mistakes when I don’t under- 
stand. I don’t know why I don't ask again. I don't 
know people working here, and I think I should 
understand”. (Eastern European worker). 


Thus, in several cases where communication 
fails, language related issues cannot be seen irre- 
spectively of those related to cultural differences. 
The respondents of the survey mostly agreed that 
managing migrant workers is more challenging than 
managing Norwegian workers. It is thus important 
both for managers and fellow workers to gain nec- 
essary understanding of such differences, and how, 
for example, differences in perceiving hierarchy 
and power distance can make Norwegian and east- 
ern European workers act differently. 

Other challenges included some migrant work- 
ers feeling that they were treated differently from 
their Norwegian co-workers, arguing that they 
were assigned to “harder” and more boring work. 


In addition to specific challenges that may occur 
at multicultural workplaces, several of the Norwe- 
gian managers and workers raised a concern that 
an extensive use of migrant workforce may nega- 
tively affect the recruitment of young, Norwegian 
workers to the industry. 

What is common for many of the identified 
challenges, is that they appear to be enhanced by 
employment relationships and conditions. The 
study shows that employment relationships are of 
greater importance than nationality when it comes 
to the challenges at multicultural workplaces. 
Hired workers from staffing agencies are less likely 
to learn the language, and companies are equally 
less likely to invest in language courses for these 
workers. 

As mentioned, many managers find it harder to 
manage migrant workers, than Norwegian work- 
ers. Findings from the interviews, however, indicate 
that leadership challenges are greater when involv- 
ing hired workforce rather than permanent workers. 
It is also the hired workers that most often claimed 
to be treated differently. As an example, some had 
experienced having to accept poor working condi- 
tions due to the fear of not being hired again. 

The findings also show that the challenges seem 
to be affected by the duration of relations; that is, 
when workers are staying at a building site for just a 
short amount of time, which may cause problems: 


“You know, when a subcontractor comes to the con- 
struction site, like those balcony fitters, they are 
staying for two, maybe three weeks before finishing 
their work. You don't have the time to get to know 
them. You don’t know what they stand for and what 
their interests are for doing the work in a safe man- 
ner”. (Norwegian project manager). 


4.4 Greater consequences for work performance 
and working environment than for safety 


Some argued that the number of unreported 
unwanted events is larger among migrant workers, 


and quite a few imagined migrant workers being 
harmed at work more often than Norwegians. 
Despite this, most of the informants said that 
they had no basis for stating that migrant work- 
ers are overrepresented in their company’s accident 
records. 

Results from the survey clearly indicate that 
language related challenges and culture related 
misunderstandings cause building errors or disa- 
greements at work more often than they cause 
accidents or near accidents. This implies that lan- 
guage and culture related challenges cause greater 
consequences for work performance and working 
environment than for safety. The distribution of 
the answers is shown in Figure 2. The proportion 
of respondents answering that they have experi- 
enced construction errors due to language related 
challenges and culture differences were 66 and 44 
percent, respectively. Some would argue that these 
numbers, and the corresponding numbers regard- 
ing the working environment and safety are unac- 
ceptably high. Still, when asked, very few could tell 
of any concrete examples in which any of these 
were the case. It is also important to note that the 
percentages represent the number of respondents 
that have either experienced or observed any of 
these things within the past three years. Thus, they 
do not necessarily indicate that these are com- 
mon problems over time. Also, when asked related 
questions in the interviews, almost no one could 
tell of any concrete events in which language or 
culture related challenges caused accidents or near 
accidents. 

Segregation at the construction site leaves many 
workers with an experience of being divided into 
“us and them”. The various nationalities typically 
keep to themselves during lunch and other breaks; 
especially in cases where the different professions 
constitute pure national working groups. This 
minimises the chances of extensive inclusion and 
involvement. The study also revealed several exam- 
ples of Norwegians that felt alone among eastern 
European co-workers. 


Have you in the past three years experienced or observed ... 


. ‘anguage related challenges leading to accidents or near accidents? 


_ culture related misunderstandings contributing to accidents or near accidents? | 38 E 


„ language related challenges contributing to disagreements or conflicts? 
.. cultural differences contributing to disagreements or conflicts? 
„ language related challenges leading to construction errors? 


„culture related misunderstandings contributing to construction errors? 
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1% 


2) a 


40% 
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Figure 2. 


Results from survey. Managers’ answers as to whether they have experienced or observed construction 


errors, disagreements/conflicts or accidents/near accidents due to culture or language in the past three years. 
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As described in 4.3, challenges may be affected 
by the duration of relations. As such, focusing on 
duration of relations is also important to maintain 
safety and working environment at multicultural 
workplaces. For the workers, getting to know each 
other may affect whether one is able to commu- 
nicate well about the daily work. This may again 
influence everyday working environment and 
safety. For employers, on the other hand, the dura- 
tion of relations affects how much one is willing to 
invest in the workers, for example through training 
and competence building. This, in turn, influences 
the prerequisites for the individual worker to work 
safely and in accordance with expectations from 
the employer and colleagues. 


4.5 Many measures focus on short-term solutions 


The study explored measures that exist to utilise 
the opportunities and cope with the challenges that 
come with using migrant workforce. It was found 
that most employers and construction sites do not 
put in direct efforts to promote the opportunities 
and fully utilise the resources that are represented 
by migrant workers. Measures in large degree aim 
to cope with specific challenges in the present. 
As language is mostly brought forward as a chal- 
lenge, most measures are also related to cope with 
linguistic issues. These include among other, trans- 
lations of safety materials, HSE training in other 
languages, language courses, language require- 
ments at work sites, and arrangements of working 
groups by language. 

In practice, some measures, such as language 
requirements can be difficult to comply with due 
to available workforce. Further, measures such as 
arrangement of working groups by language con- 
tribute to a bad spiral where migrant workers do 
not get the opportunity to learn Norwegian as they 
only work with people that speak the same lan- 
guage as themselves. And again, they do not get to 
work with Norwegians because they do not speak 
Norwegian. This is an example of a measure that 
is reactive in nature and contribute in a short-term 
perspective to reduce linguistic challenges at a spe- 
cific construction site, but do not focus on coping 
with the challenges in the long run. 

Many of the informants highlight communica- 
tion as a key to success at multicultural workplaces. 
Language is a large part of communication, how- 
ever not the only part. Cultural differences also 
influence communication at workplaces. 

In the study, it has been seen that duration of 
relations is important for successful construction 
sites. Knowing each other and having a relation to 
each other is also important for understanding and 
communication between workers, and between 
workers and managers. 
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The study indicates both opportunities and 
challenges with multicultural workplaces. How- 
ever, to reduce the challenges and to benefit from 
the potential that lies in a migrant workforce, fur- 
ther solutions need to be developed, e.g. tools and 
measures. The study shows that there are already 
many measures in place, especially to cope with 
challenges, but many of them aim to solve prob- 
lems in a short-term perspective. There is also a 
need for measures and tools which aim to reduce 
the challenges in the future, rather than reduc- 
ing consequences here and now. Further, there is 
a need to look for opportunities that come with 
migrant workers. 


5 CONCLUSIONS AND FURTHER 
RESEARCH 


The challenges with migrant workers are quite well 
documented and studied both internationally and 
recently in Norway. There has, however, been many 
unjustified assumptions when it comes to multicul- 
tural labour and consequences for safety, working 
environment and work performance. This study 
has documented both challenges and strengths by 
using migrant workers, and also a construction 
industry that has introduced different measures to 
create well-functioning multicultural workplaces. 

The findings in this study cannot conclude that 
migrant workers are involved in more accidents 
than Norwegian workers. However, when it comes 
to reporting of unwanted events, significant differ- 
ences were found between migrant and Norwegian 
workers. 

Rather than resulting in many accidents or near 
accidents, challenges related to language and cul- 
ture were found to lead to more consequences for 
work performance and working environment. Sev- 
eral informants said they imagined language and 
culture causing safety related consequences, but 
few could provide concrete examples of events in 
which this had been the case. 

Further, it was found that employment relation- 
ships are of greater importance than nationality 
when it comes to challenges related to multicul- 
tural workplaces. Duration of relations is impor- 
tant to maintain safety and working environment 
at multicultural workplaces. 

The study also shows that the potential for ben- 
efitting from knowledge and experiences among 
migrant workers is to a small degree realised, and the 
obvious focus in the industry is on the challenges. 

Measures to cope with the challenges were 
found to mostly be of a short-term nature han- 
dling challenges right here and now, whereas few 
measures are of such a nature that they improve 
long-term conditions. 


Migrant labour has become a natural part of the 
Norwegian construction industry, and the indus- 
try partners must take responsibility and work 
together to cope with the challenges and promote 
the opportunities. Measures to promote good mul- 
ticultural workplaces include improving managers’ 
understanding of cultural differences, adapting 
leadership style, as well as making conscious deci- 
sions regarding organisation of the work. A future- 
oriented focus is needed, including measures that 
may lead to long-term gains for the industry, and 
measures to lift the opportunities and exploit the 
potential that lies in the inequalities between differ- 
ent groups of workers at multicultural workplaces. 
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ABSTRACT: The industrialisation of Europe and Asia leads to the development of innovative and 
reliable technical complex products. Industrial nations like Germany and Japan took separate ways to 
highly developed industrial nations. This leads to different goals and philosophies in terms of reliability 
and safety design regarding product and manufacturing process development. This paper shows essential 
aspects of the differences in German and Japanese reliability and safety engineering. Base of opera- 
tions is the fundamental difference of the realisation of the principles “innovation” versus “optimisation”: 
German product functionality contains often innovative solutions; Japanese product functionality shows 
often optimised solutions. These different designs are caused by different engineering philosophies and 
factors: The paper discusses the historical factors, cultural factors and geographical conditions, which 
lead to the different technical reliability and safety engineering solutions in Germany and Japan. Finally, 
the paper shows two examples of German and Japanese technical products and engineering solutions to 
illustrate the different strategies in German and Japanese product design. 


1 INTRODUCTION the railway system: The concept (e.g.: Separation 
of high-speed and regional train system, as well 
The industrialisation of Europe and Asia leads to as rail cargo), the logistics (e.g.: entering and leav- 
the development of innovative and reliable tech- ing trains by passengers) and the ticket machine 
nical complex products. Industrial nations like (e.g.: parallel paying, non-limited bank-note); cf. 
Germany and Japan took separate ways to highly (Nussbaum and Roth 2005). 
developed industrial nations. Both countries have However, both approaches “Innovation” and 
the ability to develop and manufacture techni- “Optimisation” lead to different engineering 
cal complex products on a high reliability and solutions regarding technical complex products 
safety standard. But a detailed look at Japanese in terms of reliability and safety. Both reliability 
and German products shows different character- and safety design approaches have advantages and 
istics regarding the principles “Innovation” and open the chance to learn from each other. This 
“Optimisation”, especially in case of reliability paper shows essential aspects of the differences in 
and safety aspects. New developed German prod- German and Japanese reliability and safety engi- 
uct generations include often innovations regard- neering within a research study. Base of operations 
ing functionality. Japanese product engineering is the fundamental difference of the approaches 
considers very often the optimisation in relation “innovation” versus “optimisation” within the 
to the previous product generation. An example product development process. Subsequently the 
for illustration is the comparison of the Japanese paper shows different impacts—e.g. historical, 
and German railway system—including trains, culture and geographical—on innovative and opti- 
railway net and ticket machine logistics—with the mised engineering. 
focus on criterions like innovation, optimisation, 
efficiency and reliability. The railway system was 
invented in Europe (Great Britain, afterwards also 2 GOALOF RESEARCH STUDY 
Germany; cf. (Griffin 2010) and (Pollard 1981)). 
Subsequently the U.S. established the railway sys- In this research study, the main influences which 
tem in Japan after opening the country for trade. support product innovation and optimisation within 
The Japanese engineers adapted and optimised an engineering process are analysed. The goal of the 
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research study is to work out the reasons, why Ger- 
man product functionality is mostly characterised 
by innovative solutions, whereas Japanese products 
often show optimised solutions. 

In a first step, this paper illustrates the princi- 
pals and influences with supporting or restrain- 
ing impacts on creative engineering and product 
innovations (cf. section 3.1 and 3.2). Furthermore 
indicators for the measurement of innovation are 
shown (cf. section 3.3). Subsequently, the paper 
discusses main influences and aspects of Japanese 
and German historical factors, culture and educa- 
tion factors, as well as geographical conditions, 
which lead to the engineering preference regarding 
the approach “Innovation” or “Optimisation” (cf. 
chapter 4). Lastly, German and Japanese product 
examples are explained to illustrate the applied 
approaches with focus on innovative or optimised 
engineering. 


3 BASE OF OPERATIONS: INFLUENCES 
ON CREATIVITY AND INNOVATION 


The principals and main influences, which are 
supporting creativity and innovative thinking, 
are worked out by several authors, e.g. Linneweh, 
Rammert et al. and Dehio et al. Section 3.1 shows 
a summary of influences, which support creativity 
and innovative thinking within engineering proc- 
esses. Section 3.2 shows a short overview regarding 
influences with restraining impacts on innovations. 
Finally, section 3.3 shows essential indicators, 
which can be used for the measurement of innova- 
tion capability. 


3.1 Influences with supporting impacts on creative 
thinking and innovative engineering 


The main influences supporting creative thinking 
and the development of product innovations are 
as follows: 


1. Possibility of free thinking (Linneweh 1984), 

2. Possibility for realisation, especially financial 
support, state subsidy; cf. (Rammert et al. 2016) 
and (BDI et al. 2017), 

3. Consumer society based on customer expecta- 
tions based on life style thinking (Rammert 
et al. 2016). Note: In the past, the customer 
expressed requirements; in the present the life 
style is an essential part of the decision making 
additionally, 

4. Supply and demand, the base of a free market 
economy; cf. (Rammert et al. 2016), 

5. Social innovations based on customer’s expecta- 
tions; the customers are familiar with dynamic 
changes, cf. (Rammert et al. 2016), 
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6. Motivation and incentive (e.g.: cash/money, 
reward, appreciation, cf. (Linneweh 1984)), 

7. Knowledge and new technologies lead to new 
product functionalities; cf. (Rammert et al. 
2016), 

8. Digitisation (BDI et al. 2017) and 

9. Wish for diversity (e.g.: Gender, Ethnos, Reli- 
gion; cf. (BDI et al. 2017) and (Linneweh 
1984). 


3.2 Influences with restraining impacts 
on innovations 


The main influences, which are restraining or stop- 
ping creative thinking and the developing of prod- 
uct innovations, are listed below: 


1. Organisation with rules, structure, norms and 
regulations (Linneweh 1984), 

2. Bureaucracy (Linneweh 1984), 

3. Huge enterprises (combined with hierarchical 
structure); cf. (Dehio et al. 2006), 

4. No possibility for creative thinking, combined 
with the hindrance of personal development 
(BDI et al. 2017), 

5. No possibility for experiments and mistakes; cf. 
(BDI et al. 2017). 


3.3 Indicators to measure innovation capability 


The measurement of the innovation capability is 
based on fundamental indicators in relation to a 
country/state, which are shown in the following list 
(cf. (Dehio et al. 2006) and (BDI et al. 2017)): 


1. Patents per inhabitant, 

2. Scientific publications per inhabitant, 

3. Investments regarding science in relation to the 
gross domestic product (GDP), 

4. Turnover/sales based on new or improved prod- 
ucts and 

5. Balance of trade. 


4 COMPARISON OF GERMAN AND 
JAPANESE ENGINEERING 


Industrial nations like Germany and Japan took 
separate ways to highly developed industrial 
nations leading to different focuses regarding the 
approaches “Innovation” and “Optimisation” 
within engineering processes. This chapter works 
out the impacts leading to different philosophies 
and designs in terms of “Innovation” and “Optimi- 
sation” within product and manufacturing engineer- 
ing in the respective countries. Section 4.1 discusses 
the historical impact in relation to the different ways 
of the industrialisation of Japan and Germany. The 
culture and society impact—especially the value 


of education—is shown in section 4.2. The sec- 
tion 4.3 gives an overview regarding the main geo- 
graphical and locational impacts. 


4.1 Historical impacts: Industrialisation in 


Germany and Japan 


The industrialisation of Europe and Asia led to the 
development of innovative and reliable technical 
complex products. Industrial nations like Germany 
and Japan took separate ways to the level of highly 
developed industrial nations. The industrialisation 
started in Europe within the second half of 18th 
century in England, and subsequently in Germany; 
cf. (Griffin 2010) and (Pollard 1981). Within the 
time of the German empire, the industrialisation 
was growing rapidly. Exemplary, the industrialisa- 
tion process can be explained by the development 
of the railway system. The beginning of the rail- 
way history was in 1804 with the introduction of 
the first steam locomotive by Richard Trevithick. 
Shortly afterwards in 1825, the first public railway 
system was opened by Stockton and Darlington 
Railway in England (Tomlinson 1915), enabling 
the transport of people and rail cargo. Due to the 
fact that traveling times were reduced in Europe 
and North America, the industrial revolution itself 
was supported by the railway system. Furthermore 
it was pushed by the European wars, especially the 
German-French war from 1870 to 1871, where the 
transport of war goods was important. Today, the 
high-speed train ICE (“Inter City Express”) is an 
essential part of the German traffic system: The 
four ICE generations were developed between 1990 
and 2017 (Jehle et al. 2006). 

However in Japan the industrialisation starts 
in the end of the 19th century caused by external 
pressure from the United States. The external pres- 
sure led to the end of the Edo era (1603 until 1867) 
starting the new Meiji area (1868 until 1912) with 
the rapidly growing industrialisation (Nussbaum 
and Roth 2005). Within the Edo era, Japan was 
encapsulated to the rest of the world. Entry and 
departure for foreigners were not allowed (excep- 
tion: small exchange with china and Netherlands; 
colony island Dejima) and contact to other states 
were on a minimum level. In 1854, four war ships 
led by Matthew Perry were entering the harbour 
of Uraga. Based on a message from US president 
Millard Fillmore, the Tokugawa government was 
pressurised to start trade with the United States 
(otherwise U.S. go to war); cf. (Morinosuke 1976). 
The arrangement included the opening of the har- 
bours Shimoda and Hakodate and the possibility 
for trading. Parallel, the Edo era ended and the 
reform of the empire led to the Meiji era, starting 
in 1868. The gentry of war ended, the constitu- 
tion was modernised and the development of the 
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country—especially from the engineering point of 
view—started. This point is exemplary illustrated 
with the Japanese railway system; cf. (Givoni 
2006): Within the Meiji era, the railway system 
was developed—but limited to a track range of 
1067 mm. Therefore, the speed level was limited 
to 100 km/h until 1950. Huge steps of the railway 
system development followed in the sixties and 
subsequently resulting in the development of the 
high-speed trains (“Shinkansen” — high speed) in 
the nineties parallel to Europe (1992: Series type 
400, 1997: E3; 1999 E3-1000). Today the railway 
system in Japan is the most reliable and accurate 
system worldwide (Givoni 2006). 

The other historical aspect impacting Japanese 
product development is World War II. Japanese 
companies trained their engineers for multiple tasks 
and skills (called “Tanoko” in Japanese) based on 
long-term employment and long-term trading 
with suppliers because of the labour shortage after 
World War II. As a result Japanese engineers are 
able to work in many different teams on a very effi- 
cient level. Furthermore, Japanese engineers are 
able to find optimum design solutions faster than 
their colleagues in other countries through a trial 
and error processes by “Suriawase” which is the 
Japanese way for negotiation among several teams 
of engineers including design teams, production 
divisions, and parts manufacturers (suppliers) 
from the initial design phase to the detail design 
phase concurrently (Inoue et al. 2011). 

As a comparison between Germany and Japan: 
In Germany, the innovative engineering was 
pushed by the industrialisation under environmen- 
tal conditions, which allows free thinking. Prussia 
and the subsequently following German empire’s 
education system (“Humboldtian Model”; (Baum- 
gart 1990)) allow free thinking and development of 
creative solutions (cf. section 4.2). However Japa- 
nese engineering was pushed to open the country, 
caused by external pressure from the U.S. Many 
technologies were imported during Meiji era from 
Europe and U.S. and adapted to Japanese environ- 
ment (Nussbaum and Roth 2005). This was a part 
of the base of operations for Japanese focus on 
optimisation of imported technical products and 
processes. 


4.2 Culture impacts: Education system, society 
and work environment 


The culture and society of Germany and Japan is very 
different. Germany society is influenced by a hierar- 
chical system parallel to an established social market 
economy. As mentioned in section 4.1, the Prus- 
sia educations system, founded by Wilhelm Hum- 
bold, focusses on a comprehensive education based 
on a combination of research and teaching. The 


“Humboldtian Model” (Baumgart 1990) includes 
arts and sciences with research. The goal is com- 
prehensive general learning and cultural knowledge. 
Nowadays, the model is still base of the German 
education system and is the fundament for creative, 
innovative thinking. Furthermore the social market 
economy is established since 1949 in Germany. Social 
market economy allows free entrepreneurial acting 
while social compensation is considered. The social 
market economy was pushed by Chancellor Ludwig 
Erhard and is an important factor to ensure free 
thinking in a secured environment. 

The Japanese society is influenced by a strong 
hierarchical structure. Acting within the society is 
characterised by discipline. Furthermore, educa- 
tion has a high value within Japanese society (like 
in all Confucian orientated Asian countries). 

Important character of the Japanese working 
environment is the principal of life long employ- 
ment (until 20th century); cf. (Japan 2012). Fur- 
thermore, the loyalty regarding the employer is 
very important. Loyalty, discipline and healthy 
food lead to low numbers of employees absence 
due to illness (Japan: Average 1%; Germany: Aver- 
age 3.2%). The strong connection between work- 
ing environment and personal life style leads to a 
strong identification of Japanese employees with 
their work. 

Both countries have an education system, which 
ensures a high educational qualification level. The 
differences are the social market economy, hierar- 
chical system, employee identification with work 
and discipline. These aspects have a high influence 
on different engineering philosophies regarding 
product innovation and product optimisation. On 
the one hand strong hierarchical structure, disci- 
pline and loyalty support product optimisation. On 
the other hand, free thinking within a social market 
economy supports creative product innovations. 
The requirement for the ability to create product 
innovation or optimisation is the educational fun- 
dament, which is given in both countries. 


4.3 Geographical and location impacts 


A strong impact on innovative or optimised solu- 
tions are geographical impacts. The comparison 
between Japan and Germany shows obvious dif- 
ferences: Japan is located at the rupture zone of 
four tectonic plates of the earth crust: 


North American plate in the north, 
Eurasien plate in the west, 
Philippines plate in the south and 
Pacific plate in the east. 


Each year, the plates are drifting within a range 
of some centimetres. The pacific plate moves below 
the Eurasien plate which leads to volcano eruption 
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and many earthquakes. In average, 73 earthquakes 
with a Richter magnitude of 4 or higher occur in 
Japan per month (thereof nine per month with 
magnitude 5 or higher; 1.4 earthquakes per month 
with magnitude 6 or higher), cf. (Earthquake 
Research Committee et al. 2011). 

However Germany is located in the centre of 
Europe, nowadays without any influences regard- 
ing earthquakes and volcano eruption. 

This facts lead in Germany and Japan to very 
different strategies in case of safety and reliability 
engineering in many subjects, like facility engineer- 
ing (e.g. buildings), mechanical engineering (e.g. 
railway systems) and electromechanical engineering 
(e.g. consumer goods). Japanese risk- and reliabil- 
ity analysis always considers safety aspects based 
on earthquake-vibration impacts in comparison to 
German products. This is reflected e.g. in construc- 
tional, material and electrical aspects as well as shut- 
down strategies for respective product concepts. 


5 COMPARISON OF JAPANESE AND 
GERMAN CONSUMER PRODUCTS 
CONSTRUCTION 


The comparison of Japanese and German con- 
sumer products—e.g. water boiler (section 5.1) and 
ticket machine (section 5.2) — shows the influences 
on historical, culture and geographical impacts on 
reliability and safety engineering solutions. 


5.1 Safety related construction: Water boiler 


Electric consumer goods like water boiler, micro 
waves and other electric devices, need electricity 
within the operation mode. The stressing of the 
connection cable between power source and elec- 
trical device—e.g. tensile force, folding force or 
torsion—can lead to a damage. As a consequence, 
a cable damage can lead to fire. The construction 
of Japanese electrical consumer products often 
includes an easy detachable connection between 
electric device and power source (cf. Fig. 1). 

The reason regarding this construction is the 
geographical impact (cf. section 4.3): In case of 
an earthquake and the involved vibrations, the 
electrical devices within households are directly 
disconnected from the power source. This leads 
to a prevention of damage cases based on an elec- 
trical failure rout cause (e.g. damage of cable or 
technical device, which can cause fire). In compari- 
son, the construction of electric consumer goods 
based on European respectively German standards 
mostly do not have such a safety related construc- 
tion (cf. Fig. 2). The geographical location of Ger- 
many or Europe allows the negligence of possible 
earthquake impacts. 
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Figure 1. 
detachable connection to avoid electric cable damage 
in case of force effect (e.g. caused by earthquake and 
involved vibrations). 


Japanese safety related construction: Easy 


Figure2. Fixed—non-detachable—connection between 
electrical device and power source (European standard). 


5.2 Functionality related optimisation: Ticket 
machine in railway system 


Part of the industrialisation in Great Britain was 
the establishment of the railway system (cf. sec- 
tion 3.1), including the train, net, logistics and ticket 
machine. This transportation concept was an inno- 
vation regarding the transport of goods and peo- 
ple. It was built in Europe (especially Great Britain 
and Germany) and afterwards in North America. 
Subsequently the railway system was transferred to 
Japan at the end of 19th century. Each component 
was optimised by Japanese engineering, e.g. the 
functionality of the ticket machine. 

A direct comparison of the basic function “pay- 
ing process” shows the aspects of the European 
respectively German standard construction and 
the optimised Japanese construction. 


225 


European/German standard (cf. Fig. 3): 


e Paying with coins, procedure is one-by-one, 

e Paying with bank-note in limited form (e.g. 20 
or 50 Euro), procedure one-by-one, 

e Money return are coins in case of bank-note 
paying. 
Japan optimised construction (cf. Fig. 4): 


e Paying with coins can be parallel, because the 
slot has a hopper form, furthermore the auto- 
mat has a sorting function, 

e Paying with bank-note 10,000 Yen (approxi- 
mately 100 Euro) is possible; procedures are 
“parallel” or “one-by-one”, 

e Money return in case of bank-note paying are 
bank-notes in optimal standard units. 


Figure 3. Ticket machine: Standard solution, focus: 
single coin slot regarding paying by coins, procedure is 
one-by-one. 


Figure 4. Ticket machine: Optimised solution based on 
Japanese Engineering. Focus: Parallel paying (e.g. coin 
slot hopper form). 


6 SUMMARY 


This paper shows essential aspects of the differ- 
ences in German and Japanese reliability and 
safety engineering. The engineering approaches 
are based on the fundamental difference of the 
realised technical approaches “innovation” ver- 
sus “optimisation”: German product functional- 
ity is often characterised by innovative solutions, 
Japanese products show often optimised solutions 
regarding the previous product generation. 

Base of operations are the influencing factors, 
which support or restrain creative thinking and, 
as a consequence, support product innovations 
or optimisations. Afterwards, the historical fac- 
tors, culture factors and geographical conditions 
in Germany and Japan are analysed respectively, 
which leads to the different technical reliability and 
safety engineering solutions of German and Japa- 
nese engineering solutions. 

Finally, the paper shows two examples of tech- 
nical products and engineering solutions in Ger- 
many and Japan to illustrate the different strategies 
in German and Japanese product design. 

However, both product design approaches— 
innovative on the one hand and optimised on the 
other have advantages and gives the chance to 
learn from each other. 
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ABSTRACT: 


It is noted that difficult market conditions faced by the oil industry during last several 


years have been manifested in a negative safety performance. No direct relationship exists to explain this 
trend. Even though many stakeholders instinctively believe that extreme cost-efficiency drives seen within 
the industry are somehow responsible for this outcome, any clear-cut mechanisms or pathways have not 
yet been proposed. This paper presents the preliminary development of a schematic model basis intended 
to explain the impacts of economic pressure on safety performance of a profit oriented organization when 
faced by market challenges. Further development of this model basis is expected to provide a clearer pic- 
ture of this interrelation between safety performance and economic performance. 


1 INTRODUCTION 

Due to the downturn of oil industry ensued dur- 
ing last several years, many dramatic changes have 
been seen within the management structures of 
commercial entities engaged in the business. Some 
of these changes have unintentional consequences 
on the safety culture, barrier management, and 
the overall safety performance of organizations. 
These consequences can manifest as long lasting, 
and sometimes delayed, impacts and could lead to 
catastrophic major accidents as well as gradually 
deteriorating HSE performance. Therefore, timely 
recognition of these impacts and the pathways are 
crucial to avert short-term and long-term loses. 
Reasons (1998) pointed out that technologically 
complex high-tech industries are more vulnerable 
to organizational (major) accidents due to their 
intricate systems and subsystems that no single 
person could comprehend in isolation. Accord- 
ingly, weaken barriers or latent flaws accumulated 
during an economically hard time period could 
stay dormant for years or decades before they 
come into play a role in a major accident. 

This paper explains possible mechanisms behind 
the recent negative trend in safety performance 
observed within the oil & gas industry proposedly 
instigated by the dramatic downturn of the crude 
oil economy seen during the last couple of years 
(Botheju & Abeysinghe, 2017). While recognizing 
the necessity of adopting certain organizational 
changes in order to face the new economic reality 
of the industry, the paper highlights the importance 
of understanding the drivers behind this discom- 
forting trend that is threatening the prudent safety 
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management procedures established over decades. 
It is recognized that there can be certain chain- 
linked relations between some innocent looking 
cost-cutting measures and the organization’s safety 
culture dictating the overall behavior towards its 
safety performance and barrier management. 

This article can also assist in developing pru- 
dent guidelines that could be used to implement 
a robust safety management system that performs 
even under challenging economic circumstances 
without compromising the safety and well-being 
of the organization. 

The article is based on long-term experience of 
the authors, and careful observation of industrial 
dynamics related to safety and risk management. It 
is intended that this paper provides a much needed 
insight into the driving factors behind safety per- 
formance change currently being observed within 
the oil & gas industry, while establishing a sche- 
matic model basis to comprehend its safety dynam- 
ics under cost-efficiency pressure. 


2 RECENT TRENDS IN SAFETY 
PERFORMANCE 


The previous works of the authors (Botheju & 
Abeysinghe, 2017; Botheju & Abeysinghe, 2016) 
have argued that the downturn of oil industry has 
manifested itself as a sudden nosedive of the over- 
all safety performance. Even more threateningly, 
some of the resultant impacts, especially regarding 
the process safety risks, could come into effect years 
later from now. In relation to Norwegian petro- 
leum industry, Engen et al. (2017) have pointed out 


that, while the level of safety and working environ- 
ment conditions still remain relatively high, several 
safety challenges and serious conditions were start- 
ing to be manifested during the last few years. 

The existence of an apparent correlation 
between economic pressure and the safety per- 
formance had previously been identified by other 
authors as well (Rasmussen, 1997; Coles, et al., 
2000; Barden, and Lodestone, 2006; Young, 2015). 
However, many of the past case examples that 
were relating major accidents to cost-cutting meas- 
ures, had straightforward links connecting key 
management decisions to poor safety barrier man- 
agement (Chauhan, 2005; US Chemical Safety and 
Hazard Investigation Board, 2007). The current 
trend in the oil industry, that we are experiencing 
right now, is more intricate where such clear-cut 
pathways are still difficult to be observed. Among 
previous modelling attempts, Rasmussen’s (1997) 
migration model is quite unique. He proposed the 
existence of a boundary of functionally acceptable 
performance; operating outside of that would lead 
to accidents. Rasmussen further theorized that 
there exists a gradient created by management’s 
pressure directing the organization towards higher 
efficiency. This gradient, unless sufficiently coun- 
terbalanced by a gradient of safety culture, can 
gradually migrate the activities towards the func- 
tionally acceptable performance boundary. 

The challenge, therefore, is to exactly recognize 
the driving mechanisms behind this recent trend. 
The most of the commercial entities continue to 
emphasize that they are still prioritizing safety as 
a paramount factor during their operations, and 
refuse to accept that any of their management 
actions could have led to a deteriorated safety 
performance. 

In a way, what companies are claiming has a 
surface truth. Unlike in the past eras, the modern 
socio-ethical context and the associated legal and 


regulatory frameworks leave only a limited room 
for management bodies to initiate direct actions 
that could openly jeopardize safety. And above 
all, the most companies understand the gravity 
of such detrimental actions nowadays. Therefore, 
no sensible management would consciously sup- 
port any action that clearly leads to poor safety 
performance. 

Nevertheless, this paper argues that there are 
certain pathways linking the dramatic cost-effi- 
ciency actions and deteriorating safety perform- 
ance. We denote these links as “Safety X-factors”. 
The schematic model so named as “Safety X-fac- 
tors Model”, which is currently in its development 
stage, tries to explain these enigmatic connections 
and establish them within a model structure. 


3 SAFETY X-FACTOR MODEL 


As indicated in section 2, the overall safety per- 
formance within the oil and gas industry is being 
influenced in a negative trend aligned with the 
market downturn, with much visible evidence. 
Nevertheless, no management is accepting that 
they are actually driving this trend. This raises 
the question “what mechanisms are responsible 
for this trend then ?”. The Safety X-factor model 
(abbreviated as SXF model) is presented as a basic 
attempt to answer this question. Note that this 
paper only presents its preliminary development. 
Figure 1 provides a schematic illustration of this 
model showing some of the key components and 
pathways proposed. 


3.1 What is SXF model 


SXF can be introduced as a basic schematic model 
in development aimed at explaining the safety per- 
formance outcome of an organization (or an entire 


Figure 1. 
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Schematic Illustration of the Safety X-factor Model (Preliminary). 


industry in a broader sense) when faced by one or 
more exo-organizational events bearing certain 
economic impacts to the organization. In the cur- 
rent context, the relevant exo-organizational events 
are the market downturn triggered by low crude oil 
prices, and the extreme competition (the second is 
actually, to a larger degree, a resultant of the first in 
this case). The consequential prime organizational 
impact is the reduction of profit or even losses. 


3.2 The model structure 


The model is included with the behavioral dynam- 
ics of three (3) entities; namely, (1) Management, 
(11) Employees/operators (Executors), and (iii) the 
technical safety barrier systems. In the current ver- 
sion of the model, the organizational and proce- 
dural safety barriers are not separately addressed, 
but considered to be embedded within the human 
operators behavior. 

The model describes various influence pathways 
between the above 3 entities, that eventually lead to 
negative safety performance materialized as near 
misses, accidents, poor health, or degrading envi- 
ronmental impacts. 


3.3 Management reactions 


The SXF model describes the apparent manage- 
ment reactions as a threefold approach; i.e. (i) cost- 
efficiency drive, (ii) Attention diversion/reduction, 
(iii) Resource diversion/reduction; the each of 
these is briefly described below. 


3.3.1 Cost-efficiency drive 

This is the natural “panic action” by most manage- 
ments when faced by market threats. Loss of prof- 
its forces top management to run for extreme cost 
reduction /efficiency enhancement measures. While 
the necessity of some such actions is justified, 
some extreme measures could significantly change 
an existing positive safety culture as explained by 
Botheju & Abeysinghe (2017). It is argued that 
such cost-efficiency measures could lead to nega- 
tive reinforcements on behavioral safety under cer- 
tain situations. 


3.3.2 Attention diversion/reduction 

When good economies exist backed by favorable 
market conditions, top managements usually have 
high attention to HSE aspects. This is the normal 
behavior of organizations possessing an adequately 
good safety performance history. However, a man- 
agement is tested when it faces difficult market 
conditions and poor economic performance. Will 
they be capable of maintaining the same level of 
commitment to safety under pressing economic 
situations? Often, many managements yield under 
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such situations and their attention is significantly 
drawn from their usual emphasis on safe perform- 
ance to other more urging matters such as eco- 
nomic issues and profitability of operations. This 
in turn reduces the positive reinforcement previ- 
ously received by operators for their good safety 
performance. 


3.3.3 Resource diversion/reduction 

Diversion of monetary and human resources to 
other purposes, than for the continuous improve- 
ments of safety systems as well as for proper main- 
tenance of existing safety barriers, is a natural 
trend that can be observed under economically 
challenging periods. 

When it comes to many technical safety barriers, 
they are costly to establish and their benefits are not 
immediately apparent, or may be perceived as “it 
can wait”. Meanwhile, the actual cost components 
associated with such safety systems are very real 
and will have to be immediately dispatched from the 
existing economic resources. Under this scenario, 
many managements could be tripped into abandon 
or delay various continuous safety improvement 
actions and maintenance/upgrade actions. This 
forever conflict between production vs. protection 
(Reasons, 2000) can lean heavily towards produc- 
tion when the resources are more limited. 


3.4 Human operator impacts 


The above described management reactions gener- 
ate multiple responses from human operators, as 
briefly described below. 


3.4.1 Efficiency stress and workload stress 

The extreme cost-efficiency drives combined with 
negative reinforcements, and the lack of positive 
reinforcements originating from attention diversion 
reaction leads to high level of worker stress which 
is further accelerated by reduced amount of human 
resources (Resource Diversion/Reduction). 


3.4.2 Insufficient quality control 

Both the attention diversion management reaction 
and the resource diversion/reduction reaction gen- 
erate this impact on human operations. The lack 
of positive reinforcements further aids safety qual- 
ity control barriers. 


3.4.3. Communication breakdown 

The extreme cost-efficiency drives coupled 
with associated negative reinforcements lead to 
increased rift between the management and the 
executors leading to the breakdown of efficient 
two-way communication. A coherent safety man- 
agement becomes increasingly difficult under such 
communication breakdown situation. 


3.4.4 Human error 

A human error is an inadvertent event generated 
through the actions of human operators while try- 
ing to follow a preplanned course of actions. The 
likelihood of human error rapidly increases when 
operators are under stress. The probability of dis- 
covering such error is also diminishes in the face of 
insufficient quality control. 


3.4.5 Shortcut procedures 

A shortcut procedure is an intentional diversion 
from the planned (safe) course of action. Opera- 
tors resort to such short-cut procedures either 
because such actions are indirectly promoted by 
the organizational culture or else as a way-out 
from the high workload and stress. The lack of 
safety quality control would further contribute 
to this situation. The short-cuts may work during 
most of the times but eventually can trigger chain 
reactions leading to dangerous incidents/accidents. 


3.5 Performance of technical safety barriers 


Stemming primarily from the resource diversion 
management reaction, technical safety barriers 
experience following impacts described below. 


3.5.1 Insufficient barriers 

If the amount of barriers are not sufficient to 
cover all the high-probability accident scenarios, 
then the risk of an incident developing into a full 
scale accident is high. 


3.5.2 Barriers not maintained 

All technical safety barriers require certain main- 
tenance to keep them under optimum performance 
level. The lack of maintenance leads to their degrada- 
tion over time and therefore their reliability decreases. 


3.5.3 Decreased robustness 

The robustness can be defined as the spare capac- 
ity of a safety barrier to handle accident scenar- 
ios beyond the normally expected magnitudes, 
frequencies, and operational conditions. A more 
robust safety system has a high tolerance for errors, 
so that it can still break the propagating incidents 
originating from significant human errors. 


3.5.4 Barriers not improved 

All safety systems need continuous improvements 
/adjustments over time. The facilities face different 
kinds of risks during their lifetimes. For example, 
an old facility may have a different risk picture 
compared to a similar but newer facility. Similarly, 
plant modifications lead to changed risk scenarios. 
Therefore, the safety barriers must continuously be 
adopted or upgraded according to the changing 
conditions. 
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3.6 Organizational outcomes 


On top of the existing inherent risks of a facility, 
additional pathways leading to the development 
of dangerous incidents are generated as a result 
of the aforementioned human operator impacts. 
Meanwhile, due to the simultaneous weakening of 
the safety barriers, the possibility of containing/ 
resisting dangerous incidents under propagation 
becomes increasingly difficult. This makes the likeli- 
hood of accidents higher. The term “accident” here 
also embodies other slow phase outcomes such as 
poor health and weak environmental performance. 


4 CONCLUSIONS 


This paper briefly presents the preliminary devel- 
opment of a schematic model aimed at recogniz- 
ing mechanisms behind the apparent correlation 
between economic pressure vs. safety performance 
of a profit oriented organization. It is theorized 
that there exists several indirect pathways lead- 
ing to a deteriorated safety performance initiating 
from an economically stressed management, even 
when the management may not intentionally com- 
promise safety for economic gains. 

So named Safety X-factor (SXF) Model is to 
be further expanded in order to fully explain the 
negative safety performance trends observed under 
market downturn situations. The eventual aim is 
to describe an organization’s safety culture using 
more concrete and measurable terms. 
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ABSTRACT: The Norwegian government is reforming the railway sector. The reform was formally 
implemented Ist of January 2017. This has transformed the structure of the sector to ensure clearer 
division of responsibility, better coordination of service and infrastructure improvements, better aligned 
incentives and a more customer focused industry. As with any major change in any sector a concern is 
raised regarding the potential negative safety implications. This paper investigates safety implications 
of the reform and overall safety responsibilities as the reform progresses. Focus is on factors that can 
contribute to a future successful safety level. A risk analysis of safety implications of the reform, lessons 
learned from a similar restructuring in the Norwegian aviation industry and general safety literature, are 
combined and compared to factors that contributed to a railway accident. A railway accident will be 
investigated to search for correlation with risk factors in a reformed sector. Special focus will be on the 
responsibilities of the Norwegian Railway Directorate. 


1 INTRODUCTION to manage their construction-, operation-, mainte- 
nance- and safety work within budget limits. 
1.1 Background Statens Jernbanetilsyn (SJT) is the National 


Safety Auhority (NSA), and is responsible to 
supervise and follow-up the actors according to the 
railway safety regulations. SJT has certified Bane 
NOR and existing Norwegian RUs accordingly. 

The Railway Undertakings (RUs) such as, 
passenger transport and cargo transport use the 
railway for the purpose which is stated in their 
certificate. 

The responsibility for safety level delegated 
to the Directorate is not well defined, and what 
kind of responsibility the Directorate can or 
shall have concerning safety aspects is still under 
considerations. 

To adapt to the reform, Bane NOR implemented 
a significant organizational change in 2016. To be 
more cost efficient, costs during the period 2016— 
2017 are reduced by NOK 750 Mill. For the next 
period of 4 years, an increase in cost-efficiency is 
planned. This may increase the organization finan- 
cial pressure. Focus areas will be economy, staff- 
ing, new technology, increase of productivity at all 
levels and extended use of contractors. 

The complexity within the sector has increased 
due to increased number of interfaces, contract 
arrangements, splitting of competence among 
companies and potential reluctance to share 
information. 


To reduce costs within the railway sector and to 
help build a better railway for its customers, the 
Norwegian government is reforming the railway 
sector. The reform has transformed the structure 
of the sector into several government owned share- 
holding companies for a better coordination of 
service and infrastructure improvements. To put 
pressure on the Infrastructure Manager (IM) and 
Railway Undertakings (RU) to improve for a more 
customer focused industry, an incentive regime is 
implemented. 

The Norwegian Railway Directorate was estab- 
lished to ensure coordination and to achieve politi- 
cal goals within the sector. To open for a competing 
passenger transport marked, the Directorate have 
requested for quotation for future operation from 
RUs within the international market. 

The Directorate shall have the overall responsi- 
bility for coordination and safety measures of the 
safety level in the sector. 

Bane NOR SF is the government owned infra- 
structure manager and shall carry out work accord- 
ing to budgets. Contracts between the Directorate 
and Bane NOR regulate the overall responsibilities 
for Bane NOR, and follow-up of Bane NOR is 
based on long term effect achievement. This gives 
Bane NOR a flexibility for future development and 
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The following 3 findings is a backdrop and 
motivation to proceed with the search for preven- 
tive safety factors: 


1. To study safety implications before implementa- 
tion of the reform, the Government ordered an 
overall risk analysis from SafeTec AS. SafeTec 
AS identified nineteen main hazards. Three 
of these safety hazards are identified as; (i) 
the interaction capability within the sector, (ii) 
importance and necessity to control the overall 
risk picture, and (iii) postponed maintenance 
due to financial pressure. 

. SJT reports 13 serious accidents in the first 
eight months of 2017, compared to 10 in 2016. 
Five more accidents are under investigation. 

. SINTEF (2005) investigated safety aspect during 
restructuring in the Norwegian aviation industry 
in the period 2000-2005. Lessons learned from 
this restructuring are used as a starting point 
for the investigations. Four of the main threats 
which affect safety were found to be: 


e Reduced organisational ability to discover 
and interpret development of hazards. 

Loss of barriers due to loss of organisational 
capability to make decisions critical to safety. 
e Uncontrolled reduction of safety margins. 


e Problems connected to interaction 


Special focus will be on the responsibilities of 
the Norwegian Railway Directorate which holds 
a sector responsibility for coordination and devel- 
opment of safety level. Companies that operates 
within a competing market may focus on the 
incentives of decision makers with a short-term 
finance and survival criteria instead of the focus 
of long term safety effects. This may lead to a sys- 
tematic and silent migration toward an unaccept- 
able safety level. Accidents or faults which affect 
a cost-effective operation will create conflicts and 
probably lead to an issue among the actors. 


1.2 Objective 


The objective of this paper is to focus on ways to 
structure the risk management, the importance of 
an overall risk-picture, interaction and cooperation 
among the different actors to achieve successful 
operation. 


1.3 Approach 


A risk analysis of safety implications of the reform, 
lessons learned from a similar restructuring in the 
Norwegian aviation industry and general safety liter- 
ature, are combined and compared with factors that 
contributed to a railway accident. As a case study a 
railway accident is investigated to search for correla- 
tion with general findings in a reformed sector. 
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Special focus will be on the responsibilities of 
the Norwegian Railway Directorate. 


2 A MOTIVATING EVENT FOR 
THIS STUDY 


The event is a cargo train on a route at line Hoved- 
banen from Andalsnes to Oslo, weighting 600 tons 
loaded with dangerous goods and driving at nor- 
mal speed limit. Close to Bon station the cargo 
train four rearmost wagons derailed caused by rail 
buckling. Both rolling stock and infrastructure were 
badly damaged, and the line was closed for 4 weeks. 

We will return to the course of events consider- 
ing obtained relevant literature. 

The accident was investigated by the Accident 
Investigation Board Norway (Accident Investiga- 
tion Board Norway, 2017). The intention from this 
investigation is to use and draw points for improve- 
ments of a future safety level in the railway sector. 


3 THEORY AND BASIC TERMS 


The objective of this section is to define the term 
interaction considering organisational and inter- 
organisational interface within safety of the rail- 
way sector. Conflicting goals in the interface in an 
organisation and between organisations and the 
potential of incremental drift towards an unsafe 
state of operation, need to be explained. The need 
for an overall coordination of the actors within the 
sector also need to be explained. 


3.1 Relevant safety perspective 


In theory and practice on safety management 
related to organisations and to prevent accidents, 
it is in the past decades developed several safety 
perspectives. However, accidents still happen. The 
perspective of Resilient Engineering (RE) is devel- 
oped from earlier safety/accident perspectives as 
the 6th perspective (SINTEF, 2010). 


Hollnagel (2008) claims that *... a resilient sys- 
tem....the intrinsic ability of an organization (sys- 
tem) to adjust its functioning prior to or following 
disturbances to continue working in face of the pres- 
ence of a continuous stress or major mishaps’ 


This ability to adapt can be done both before 
and in response to changes and disturbances. This 
system-change capability allows continuity of 
operation, even in the case of major accidents or 
continuous disturbances. 

The Resilient Engineering perspective, suggests 
that an organization must follow, thus observing the 


daily normal situation of change, thereby identify- 
ing changes that may be a chime for an undesired 
event. The uniqueness of the RE is not the focus 
on proactivity, but the ambition of being proac- 
tive in a system sense. In other words, a resilient 
organization can be summed up as resilient when 
knowing that unexpected problems are present and 
unpredictable. This involves a form of proactive 
control where the system (organization or sector) 
anticipates what is coming and minimizing or elim- 
inating unwanted variation and taking advantage 
of productive variability (SINTEF 2010). 


3.2 Investigation method and root-cause analysis 


The situation is analysed by the STEP investiga- 
tion method (Sequentially Timed Events Plotting, 
see Hendrick & Benner, 1987). STEP is a multi- 
sequential accident analysis method and is a good 
method that helps to illustrate events and haz- 
ards that causes an incidents/accident (Herrera & 
Woltjer, 2009). 

A Fault Tree Analysis (FTA) is used to investi- 
gate the root cause of the basic event Risk Influenc- 
ing Factors (RIFs). A fault tree is a logic diagram 
that displays the interrelationships between a criti- 
cal event (accident) in a system and the causes for 
this event. The strength of FTA is the deductive 
reasoning to establish the basic events. However, 
the influences of contributing factors cannot be 
handled in the FTA. Therefor a hybrid approach 
motivated by see Gran et al. (2012) is used where 
the RIFs obtained from the STEP is linked to the 
basic events. 


3.3 Risk Influencing Factors (RIFs) 


A RIF is a set of relatively stable conditions influ- 
encing the risk. It is not an event, and it is not a 
state that influences over time. RIFs are thus con- 
ditions that may be influenced or improved by spe- 
cific actions. 

A risk model often comprises a formal logical 
representation of the system. Fault- and event tree 
analysis is often building blocks in such a represen- 
tation. Barriers, safety functions and/or layers of 
protection are typically represented by basic events. 
Probabilities are assigned to the outcome of the 
basic events, i.e., success or failure. A wide range of 
factors and conditions will influence the outcome 
of the basic events, and these need to be considered. 
In the literature we find many approaches to link 
such “soft” factors to the formal logical representa- 
tion. The literature also presents different names for 
such factors. In human reliability analysis (HRA) 
the term performance shaping factors (PSFs) and 
error producing conditions (EPCs) are used (see 
e.g., Rausand, 2011). The term contributing success 
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factors (SCFs) has also been used in an attempt to 
establish resilience based early warning indicators 
(Øien et al., 2010). To avoid different set of factors 
for positive and negative issues, we will in the fol- 
lowing use the term risk influencing factor (RIF) 
as a general tem, see e.g., Vinnem et al., (2012) for 
risk modelling of planning and execution of criti- 
cal tasks. Interaction and coordination as factor for 
successful operation 

The term interaction in terms of safety can be 
explained as the way the actors cooperate to coor- 
dinate their actions, to obtain a safety achieve- 
ment, e.g. proactively cooperate to reduce a risk 
for an incident or accident. 


Rasmussen (1997) claims that ‘individual decision 
makers cannot see the complete risk picture and 
judge the state of multiple defences conditionally 
depending on decisions taken by other people in 
other departments and organisations.’ 


Maidment (1998) claims; ‘to keep safety levels 
intact, safety must have a very high-profile aspect 
in the contractual arrangements between the sepa- 
rate companies.’ To monitor the overall risk-picture 
in a sector it will be important to coordinate all 
assigned contracts on a higher level. 


3.4 Monitoring safety 


The safety in restructuring processes can be moni- 
tored most effectively by following up on all levels, 
from individual and technical equipment to sector 
level. Symptoms of security issues may appear on 
all of these levels. Security researchers have until 
now been interested in technical conditions, human 
factors and organizational levels and to some extent 
the level of interaction. Today, therefore, there is a 
lack of knowledge and methods at inter-organiza- 
tional level/sector level, as well as knowledge and 
methods that embrace all levels (Rasmussen, 1997). 
Problem connected to interaction might arise 
in connection with restructuring processes with a 
high level of conflicts. Conflicts will arise, and to 
avoid safety problems the conflict level need to be 
regulated to an overall acceptable level. 
Management and work planning in any organi- 
zation apply different control strategies, depending 
on time horizon, stability of systems, and pre- 
dictability of disturbances. Management is based 
on monitoring of performance with reference to 
plans, budgets and schedules, and control is aimed 
at removing deviations (Rasmussen 1997). Risk 
management in a dynamic market in which all 
actors continuously strive to adapt to changes and 
the dynamic markets, require an explicit identifica- 
tion of the boundaries of safe operation together 


with efforts to make these boundaries visible and 
to learn and cope with these boundaries. 

To give a mental model as a basis we borrow 
the “layer of protection” concept from the proc- 
ess industry. A layer of protection can be seen as 
safety barriers, and they are often structured in 
the way they will be activated in an accident sce- 
nario. Compared to the presentation given by e.g., 
Rausand (2014) and CCPS (2007) we simplify and 
distinguish between: 


a. Process design (by using inherently safe design 
principles). 

b. Control, using basic control functions, alarms, 
and operator responses to keep the system in 
normal (steady) state. 

c. Prevention, using safety-instrumented systems 
(SISs) and other safety barriers to act upon 
deviations from normal state and thereby pre- 
vent an undesired event from occurring. 

d. Mitigation, recovery and emergency response in 
case of control is lost. 


Figure | shows the layers of protection. 

Having these categories of layers as a basis we 
now define: A normal operation corresponds to 
layer (b). Normal operation may involve adap- 
tion to varying conditions as emphasized by resil- 
lence engineering. But normal operation does not 
involve use of extraordinary safety functions rep- 
resenting layer (c). If the primary control is lost 
and there is a deviation from the normal state, 
layer (c) is intended to bring the system into a safe 
state. We define successful operation as a situa- 
tion where layer (c) is invoked, and can bring the 
system back to the primary control state. In some 
situation layer (c) is invoked but without being 
able to bring the system back to the primary con- 
trol state. This is often referred to as a state of 
system outside engineering control. If this is the 
case, we define successful recovery as the situation 
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Figure 1. Protection layers (adapted and simplified 
from CCPS, 2007). 


where the system could be brought under control 
and back to its normal state (after repairs, clean- 
up and so forth). 

In the literature, we also find other mental mod- 
els for accident avoidance than the protection lay- 
ers understanding shown in Figure 2. One class of 
such models is the safety envelope model shown in 

Figure 2, the inner ellipsis limited by the defined 
operation boundary (DOB), corresponds broadly 
to what is controlled by protection layers (a) and 
(b) in the protection layer model in Figure 2. 

We may also treat the elliptical band limited by 
the boundary of controllability (BC) very simi- 
lar to what is controlled by protection layer (c) in 
Figure 2. However, there might be differences since 
the layer of protection concept in Figure 2 is usu- 
ally based on very explicit identification and design 
of safety functions, whereas the safety envelope 
perspective in Figure does not have such an explicit 
understanding of safety functions. The outer ellip- 
tic band limited by the safe envelope boundary 
(SEB) represents extra safety margins with is a 
conceptual link to what is controlled by protection 
layer (d) in Figure 2 but at this level, similarities are 
not that evident. 

A conceptual advantage of the safe envelope 
model over the layer of protection model is the 
visualization of the track of the activity over time. 
In the safe envelope, safe operation is defined as 
an operation where the operating boundaries are 
exceeded, but where the boundary of controlla- 
bility is not exceeded (as illustrated in Figure 2). 
We will consider an operation to be safe if we are 
able to remain in the inner ellipsis of Figure 2. We 
will, however consider two special cases (sub sets) 
of normal operations as successful operation. The 
first situation is where there are extraordinary con- 
ditions making it very difficult to operate within 
defined operating boundaries. Examples could be 
a situation where a long-term development of a 
fault is hard to control for example due to a situa- 
tion where the energy in a track is building up close 
to a rail buckling, or it could be related to extreme 
weather conditions. If we, despite these challenges, 
are able to operate within the defined boundaries 
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Safe envelope Safety margins 
~ zone 


boundary (SEB) i 


4 be 


Defined 


operating 
boundary (DOB) 


Track of * 4 
activity 
over time 


Boundary of controllability (BC) 


Figure 2. Safe envelope (Adapted from Hale et al., 
2007). 
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Figure 3. Safe envelope with broken boundaries due to 


unnoticed deficiencies or problems (Based on Hale et al. 
2007). 


we will define that the particular operation to be 
within safety margins. This will most likely repre- 
sent a situation where operation runs smoothly and 
production performance will be high. A smooth 
process represents a situation where we use minimal 
effort to control the operation, and presumptively 
better prepared to handle any process upset that 
might occur. However, there is an argument in this 
latter situation that would question how safe such 
an operation is. The fact that there are no process 
upsets or situations to act upon, could be passiv- 
izing the operations crew in the sharp end limiting 
their situation awareness (see, e.g., Endsley, 1995, 
for a thorough discussion on situation awareness). 

In the safety envelope model in Figure 2 the 
boundaries are illustrated as static in time. For 
many operations the boundaries or safety func- 
tions cannot be seen as static. We will in the fol- 
lowing distinguish between a situation where (i) 
operating personnel is aware of this situation, and 
(ii) operating personnel is not aware of the status 
of a safety function or the limits for safe operation. 
A very easy example of situation (ii) is car driving 
where the temperature has decreased below zero 
degrees Celsius on a wet road without the atten- 
tion of the driver. 

Another situation is obvious very critical. 

In a situation where the operation is follow- 
ing a trajectory towards an accident without the 
attention of operating personnel for a while, and 
then operating personnel become aware of the 
situation, and succeed in bringing the situation 
under control. This is visualised in Figure 3. If 
personnel become unaware of the situation, and 
unable to bring the situation under control, will 
cause the operation to move into the danger zone. 


4 ANALYSIS OF THE EVENT 


The Hovedbanen line is the oldest line in Norway 
and was opened in 1854. During the last 20 years it 
has been reported that the line has had several defects 
and geometry faults, and was planned for renewal in 
2020. The national network track geometry condi- 
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tion is regularly measured with a special inspection 
vehicle. The inspection vehicle takes pictures every 
10 meter of tracks during inspections, these are avail- 
able to all employees in Bane NOR. To detect long 
term changes or establish fault trends to track and 
surroundings, Bane NOR need to compare pictures 
from different periods. The local staff and train driv- 
ers often observe changes, and their knowledge is an 
important input to achieve a good safety level. 

The vegetation clearing along the railway 
increased exposure for direct sunlight to the rails, 
increased temperature and tension in the track. 
Missing reference points to be able to measure 
track position was unknown to the management. 

The investigation of the situation is split into 
two STEP diagrams: One (long-term) shows the 
period from 2014 until May 31st in 2016, and the 
other shows the situation causing derailment. The 
STEP diagrams show that track faults were present 
already from April 2015 and further escalated until 
the accident occurred in Mai 2016. STEP diagrams 
will be published in a master thesis in September 
2018 (Aarsland, 2018). 

Track irregularities escalated due to construc- 
tion work, and reported in August 2015, this was 
interpreted to be within acceptable limits. 

From the long-term STEP-diagram the following 
safety problems were obtained during the analysis: 


1. Economy: Maintenance back-log, renewal 
planned in 2020. Missing overall risk-situation 
picture. Late plan for setting out lacking exact 
track position marks. 

2. Inadequate risk-assessment before construction 
work to be on an already faulty track. 

3. Construction work affects track stability, and 
vegetation clearing along track increases sun 
exposure to rails. 

4. Project management reports potential buckling 
problem to the Hovedbanen line management. 

5. Several reports from drivers of irregular track 
conditions. 

6. Inspection vehicle 4th measure, again reports 
geometric track fault. 

7. Missing overall risk-situation control. Tempera- 
ture rise and direct sun exposure to rails results 
in increased compression forces in track. 

8. Track lateral forces on limit to buckling. 

9. Overall cost-reduction scheme implemented in 
2016. 

From the short-term STEP diagram (situation 
causing derailment that day) the following safety 
problems were obtained during the analysis: 


10. Loaded cargo train at regular line speed 
releases track forces. 
11. Track buckles and results in derailment. 


The fault tree in Figure 4 displays the interrela- 
tionships between the accident and causes of acci- 
dent. The purpose of this FTA is to look at and 
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Figure 4. Fault tree with listed RIFs. 


evaluate the prerequisites that led to the TOP event 
‘derailment’. The TOP event is deducted to causal 
event and further to basic events. 

The RIF model developed by Gran et al. (2012) 
shows that it is advantageous to split basic events 
into mistake, violation and slips & lapses. We have 
considered this expedient and used the model in 
this study. RIFs found as contributors to the acci- 
dent are shown in the diagram with the following 
corresponding numbers: 


1. Missing overall risk-situation picture as a fun- 

dament for renewal investment. 

2. Competence to carry out risk assessment. 

3. Competence to understand risk of construction 
work on a degraded and faulty line. 

. Knowledge about neutral state of forces in the 
track (neutral rail temperature). 

. Missing overall risk-picture to prevent derailment 

. Missing management situation awareness and 
hence a wrong decision to keep planned traffic 
level and at regular train speed limit. 


DWN 


(25) 


16.3 Slips and lapses 
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14) C) 


. Management situation awareness, when track 
geometry is at critical level and in case of 
increased rail temperature, failure to establish 
barriers against buckling and derailment e.g. 
reduce train speed or close line. 

. Management risk awareness of construction 
work on an already degraded and faulty track. 

. Inadequate shared risk understanding at organ- 
isational management levels. 


Based on the railway line quality, each of the 
network line managers reports annual renewal 
requirements. Requirements are considered and 
given budget priority in a national network renewal 
plan by Bane NOR senior management. The man- 
agement policy is that safety measures always have 
first priority. 

The track geodetic control marks along the rail- 
way line were not established, and made it diffi- 
cult for the line management to know position of 
the track and hence critical rail temperature for 
buckling. 


Such a condition requires special attention and 
measures. Risk assessment was carried out by the 
project management, but did not consider local 
aging condition, possible weakening lateral stabil- 
ity due to piling of a cable duct or the curvature of 
the track. Clearing of vegetation causing increased 
sun exposure was neither considered in the risk 
assessment. 

It seems clear that a missing overall risk-situa- 
tion picture, situation awareness, competence level, 
and interaction in the interface between the actors 
were contributing factors to the accident. 


5 DISCUSSION 


Comparing previously described SafeTec AS haz- 
ards and the SINTEF (2005) bullet points with the 
contributing factors for the derailment accident 
case, the ability to control overall risk situation is 
important to discover, interpret and control safety 
level. 

A premise in a restructured sector, is that all 
actors shall have the responsibility for their own 
safety under the supervision of SJT. SJT shall 
ensure that Bane NOR and the RU’s operation in 
the interface complies with railway safety regula- 
tions. However, not all actors in the sector are 
under the supervision of SJT. 

Cooperation between and contribution from all 
actors are important input to achieve a dynamic 
risk-picture for managing risk in a sector. In the 
contract agreement between the Directorate and 
Bane NOR, Bane NOR shall establish and main- 
tain a railway sector risk-picture, reflecting types 
of hazards and locations, and be based on risk 
assessment and incidents. 

The risk of drifting away from safety margin, 
and ‘Silent Drifting’ due to financial pressure are 
indicated both in the SafeTec study and as uncon- 
trolled reduction of safety margins in the SINTEF 
(2005) investigation. In the accident case, missing 
situation risk picture and ability to understand risk 
development, and decide priority to renewal of the 
line, resulted in long-term track fault development. 
Increased competition, and the future arrival of 
new railway undertakings and railway mainte- 
nance companies, will increase financial pressure. 
This can cause a risk of “Silent Deviation” in the 
operative sharp-end. 

The term “Silent Deviation” is used by compa- 
nies in the oil industry to describe the mismatch 
between procedures and actual work practices 
(Tinmannsvik, 2008). This is an attitude among 
the staff to reduce the conflict between getting the 
job done and compliance with procedures, and is 
an expression of how routine violations of writ- 
ten procedures tacitly become accepted practice. 
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Reason (1997) uses the term “necessary viola- 
tions”, where operators adjusts the balance 
between procedures and knowledge based problem 
solving to get acceptable workload. 

Problems connected to organizational interac- 
tion (SINTEF. 2005), SafeTec hazard 3., and the 
contributors to the accident: Lack of interaction 
between project, railway line management, Bane 
NOR management and involvement of mainte- 
nance personnel resulted in lack of shared safety 
related information. 

In the restructured sector with an increase of 
organisational and inter-organisational interfaces, 
with increased financial pressure, might these 
challenge the safety margins. To understand how 
safety are affected at all interfaces, interaction and 
coordination at all levels is crucial. According to 
Rasmussen (1997) individual decision makers can- 
not see the complete risk picture and judge the 
state of multiple defences conditionally depend- 
ing on decisions taken by other people in other 
organisations. Therefor monitoring of safety level 
depends on sharing of information, interaction 
and cooperation. 

The Directorate has the responsibility for the 
development of a good sector safety level. SJT 
shall ensure that certified actors have a system for 
continuous improvement of safety, but do not have 
the mandate to set specific goals. The Directorate 
can however, set goals to be achieved in the con- 
tract agreements. 

Exception from this is railway vehicle main- 
tenance workshops, entities in charge of mainte- 
nance, vehicle owners or manufacturers. According 
to Maidment (1998) contracts must be coordinated 
on a high level in the sector. This is the responsibil- 
ity of the Directorate. 


6 SUMMARY AND CONCLUSIONS 


A case study is used to shed light on overall safety 
responsibility, and possible risk factors in a reform 
with high political focus on the main success fac- 
tors ‘cost efficiency’ and ‘safety level’ in the rail- 
way sector. An accident has been analysed with a 
traditional investigation method to reveal safety 
problems. The STEP method (Hendrick & Benner, 
1987) has been used as a basis for the investiga- 
tion. A fault tree analysis is carried out as a basis 
to find the Risk Influencing Factors contributing 
to the accident. 

The case study shows that interaction in the 
interface between different organisational units is 
important to ensure safety. The study also shows 
that a situation risk-picture is required as an input 
to make safety related decisions. Risk management 
in a dynamic market in which all actors continu- 


ously strive to adapt to changes, require an explicit 
identification of the boundaries of safe operation. 
Main threats learned from the investigation after 
restructure of the Norwegian aviation industry can 
be closed as follows: 

The Directorate should contribute to strengthen 
overall cooperation within the sector for Bane 
NOR to be able to implement and monitor a proc- 
ess for interaction in the different interfaces, and 
manage boundaries of acceptable safety level. The 
following points are essential for organising and to 
allocate responsibilities. 


e Ability to regulate safety at an overall level. A 
resilient railway sector requires a dynamic risk- 
picture with contribution of risk-assessments 
from all actors at all levels. Bane NOR should 
develop and to actively use the risk-picture as a 
sector risk management tool. 

The Directorate should contribute to strengthen 
level of safety among actors in the contract 
agreements within the sector. 

Bane NOR should have responsibility to estab- 
lish an appropriate level of information sharing, 
coordinate and propose safety measures in the 
interfaces between all actors in the sector. 

Bane NOR should establish an expedient level 
of risk monitoring. 


Actor independent safety aspects that are impor- 
tant to follow up: 


Success Contributing Factors (SCFs) should be 
used to establish resilience based early warning 
indicators. 

High level of competence to understand RIFs 
at all levels in the sector, and as part of risk 
assessment. 


It is recommended to extend the investigation 
to search for visible explicit boundaries as a tool 
for monitoring overall safety within the railway 
sector. 
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ABSTRACT: The current study is scaffolding on the paper “Tough Men Cry—Learning from Sharp 
End Military Aviation”, and presented at ESREL 2010. The previous paper reported on findings from 
a specialist course at the Royal Norwegian Air Force Academy (RNoAFA) termed “Effectiveness in the 
Cockpit—Developing and Taking care of the Man in the Machine” held in 2005. In 2010, five of the pilots 
were re-interviewed in order to understand more of the human side of military aviation and see more of 
the effects out in the squadrons. In the data from 2010, we see a transfer of knowledge taken back to the 
squadrons. We see from the interviews strong focus on performance and improvement. On the same time, 
there is hierarchy to reflect on regardless of hierarchical position. There is a need to protect ones position 
(professional hierarchy. However, the professional hierarchy can also be used positively to lead by example 


and pave the way for learning. 


1 INTRODUCTION 


The current study wanted to follow up on a pre- 
vious study that was named “Tough Men Cry— 
Learning from Sharp End Military Aviation” 
that was and presented at ESREL 2010 (Steiro, 
Moldjord, Fredriksen & Firing, 2010). The previ- 
ous paper reported on findings from a specialist 
course at RNoAFA named “Effectiveness in the 
Cockpit—Developing and Taking care of the Man 
in the Machine” in 2005 (Moldjord, 2007). All 
seven pilots participating in the course were inter- 
viewed and through a thematic analysis, the follow- 
ing topics were identified and discussed: 


. Challenging events 

. Emotions on the table 

. The role of the teacher/coach 

. Group based dialogue 

. Meaningful narratives 
(Firing & Moldjord, 2007; Steiro, Moldjord, 
Fredriksen & Firing, 2010). 


In 2010, five of the pilots were re-interviewed 
in order to understand more of the human side of 
military aviation and see more of the effects out in 
the squadrons. 


AWN re 


2 THEORETICAL FRAMEWORK 


Historically, the educational practice in the mili- 
tary can be characterized by a behavioristic view of 
learning where teachers and instructors transmitted 
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knowledge to the students. A wide set of reinforce- 
ments were used; both negative and positive conse- 
quences were systematically related to bad or good 
behavior respectively (Skinner, 1953). In the last 
20 years process-oriented coaching seems to have 
turned the attention within educational practice in 
the direction of each student’s thinking and feeling. 
Interviews indicate that students at the RNoAFA 
value practical exercises and cases and post-action 
reflection in groups as the most valuable learning 
for them (Steiro & Firing, 2009). Each student is 
addressed, stimulated and challenged to engage 
in their own process of assimilation and accom- 
modation (Piaget, 1977). This method emphasizes 
the process of raising students’ awareness and the 
development of each student’s own thinking, emo- 
tions and reflection. In the same process a culture of 
coaching has arisen which supports this educational 
philosophy. Experienced officers serve as coaches 
and mentors in the process of constructing new 
knowledge. Their intention is to help the students in 
their own process of growth and realization of their 
potential as officers (Vygotsky, 1978). Today, The 
RNOoFA has founded its educational practice in the 
concept of learning from experience as educational 
philosophy (Dewey, 1997, 1961; Skjevdal, Solheim & 
Henriksen, 1995). The RNoAFA has chosen to 
build its educational philosophy on three pillars: 


1. Theory 
2. Practical training 
3. Reflection 
(Firing & Lien, 2007; Steiro & Firing, 2009) 


The course “Effectiveness in the cockpit” was a 
single semester course offered in the autumn 2005 
and winter 2006. The level is on both the individual 
and on a group level. The same level as we have 
kept on the analysis in this paper. 

The course the cadets were provided aligned with 
the pedagogic model at the RNoAFA. I In accord- 
ance with how Bruner addressed the link between 
cognition, action and emotion (1986), these three 
pillars were also included in the curriculum as a 
second triangle. This is based on Brunetr’s thoughts 
(Bruner, 1986). Bruner (1986) addressed the link 
between cognition, action and emotion. One prob- 
lem, however, with most military debriefing is that 
it often lacks the focus of emotions and individual 
inner experience (Folland, 2009). One way to get 
more aligned with Bruner is to broaden the debrief- 
ing by a more holistic reflection of thoughts, emo- 
tions, actions, team relations, and communication. 
In a Holistic Debrief (Moldjord 2015, Moldjord 
and Fredriksen 2017) we asked questions such 
as “how did you experience the event?”, “what 
affected you most?”, “how did we communicate?”, 
and “how did we cooperate?” In addition to pro- 
vide feedback to each other. In order to establish 
a sound safety culture, Reason (1997) pointed out 
the following elements as central for a safety cul- 
ture in an organization: 


— A reporting culture 
— A just culture 

— A learning culture 
— A flexible culture 


In order to obtain a sound safety culture, sense- 
making can be seen as an important framework 
for safety. Sensemaking denotes processes of 
interpretation and meaning production through 
which individuals and groups engage their worlds 
on an ongoing basis (Cation & Patriotta; 2013; 
Patriotta & Brown, 2011; Weick, Sutcliffe, & 
Obstfeld, 2005). Sensemaking arises when people 
encounter situations that defy their current under- 
standings and call for adjustment. 

Several scholars have pointed out that learning 
from errors allows organizations to improve safety, 
reliability, and resilience (e.g., Reason, 1997; Ron, 
Lipshitz, & Popper, 2006; Weick, 1987; Weick & 
Sutcliffe, 2007; Catino & Patriotta, 2013). “Errors, 
whether actual or anticipated, provide an empirical 
intersection between sensemaking and learning. In 
fact, learning—the detection, reporting, and correc- 
tion of errors—is contingent upon the way in which 
individuals or groups interpret a problematic situ- 
ation to themselves. In this respect, errors play an 
essential role in processes of construction of real- 
ity. This construction revolves around cognitive, 
emotional, and cultural sensemaking, which may or 
may not lead to processes of learning from errors” 
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(Catino & Patriotta, 2013:462). Zhao (2011) found 
that negative emotions can stimulate motivation 
to learning from errors when leaders encourage 
a positive and constructive view of errors and 
thereby alleviate individual tension and stress. 

Bennis and Thomas (2007) claim that defining 
moments in a persons life is an opportunity to 
learn and to grow. Folland (2009) found that all 
the staff at the helicopter squadron had some chal- 
lenging events that they brought up. Several had 
never shared the incidents or more precisely how 
the incidents had influenced on them. In our study, 
putting challenging events on the table made it 
more clearly for the other pilots to talk about each 
others experiences. Language does not transfer 
but also create and construct knowledge or reali- 
ties (Bruner, 1986). Through the course, the cadets 
have transferred the experiences to meaningful 
narratives. This might have a positive health effect. 
Pennebaker (1997) argue that narratives organize 
overwhelming events to smaller units that hare 
easier to handle (Pennebaker, 1997). By construct- 
ing a narrative, the person goes through a process 
that helps the person to better understand both the 
experience and himself. The experiences have now 
got a structure and a meaning, which in turn may 
provide a feeling of solution. Something that has 
been viewed negatively might now be viewed more 
positively. 


3 METHODS AND MATERIALS 


We observed systematically throughout the course. 
The military psychologist took notes and recorded 
reflections, and the students also used writing reflec- 
tion as a tool. These records of reflections were 
made available for a researcher at the RNoAFA 
who was not a part of the course. The same 
researcher conducted semi-structured interviews. 
The interviews with each of the seven pilots were 
carried out at the end of the half year course by 
one of the authors. The interviews were taped and 
later transcribed for analysis (Yin, 1994; Creswell, 
1998; Thagaard, 1998). The research team consists 
of both teachers and researchers. One potential dis- 
advantage is that the work can be viewed as action 
research (Greenwood & Levin, 2007), meaning that 
certain norms are imposed on the group and that 
the research then of course is not objective. These 
interviews were used as a background for planning 
and executing the research in 2010. 

In the spring, just after the paper to ESREL 
2010 was submitted, all pilots were invited to a new 
interview. Two of the pilots were not available due 
to practical reasons. Five of the pilots were avail- 
able and agreed to be interviewed. The interviews 
were conducted at the pilots’ respective workplaces 


during 2010. The study of 2010 demonstrated that 
new learning had come up due to the educational 
framework created in the RNoAFA and in particu- 
lar in the course “Effectiveness in cockpit”. Mili- 
tary pilots are often performance orientated and 
aiming at high professional performance. Talking 
about incidents or accidents can put them in dif- 
ficult position. All informants were informed of 
the purpose of the study and signed an informed 
written consent. All the informants also agreed 
that the interviews could be recorded. Most of the 
interviews lasted approximately 70 minutes, while 
one interview lasted 120 minutes. The recordings 
were later fully transcribed so both authors and 
interviewees had the full access to the material. 
A thematic analysis was performed and centered 
on challenging events, the way emotions played a 
role and what would happen in the dialogue after 
something had happened. This was scaffolding 
on the elements from the study of 2010. The cat- 
egories were made through item-centered analysis 
(Thagaard, 1998). 

The research questions was how did they use 
their experiences and how did they reflect on learn- 
ing from experiences in the cockpit? 


4 RESULTS AND DISCUSSION 


In the interviews the pilot talked a lot of charac- 
teristic of being a pilot. So we identified this as an 
important theme for analysis. First, we extract from 
the interviews the pilots’ perception of characteris- 
tics of being a military pilot and how it might ena- 
ble and limit the ability to learn and share. Further, 
we have extracted three incidents that could serve 
as illustrations and situations that can complement 
each other in understanding the experiences of the 
pilots, how they are dealt with, and the role of the 
group. The three situations can be termed as: 


e A potential dangerous situation of coming out 
of course 

e A near miss within the envelope with the instruc- 
tor taking over 

e An actual severe incident and the handling of it 
afterwards 


The examples illustrate different situations but 
also how the group might shape the narratives. 
Narratives and the reshaping of narratives are 
important to understand safety. 


4.1 The pilots perception of characteristics 
of being a military pilot and how it might 


enable and limit the ability to learn and share 


The first theme is of interest since it is about 
characteristics of being a military pilot and how 
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it might enable and limit the ability to learn and 
share. All the pilots refer to the systematic proc- 
esses of sharing in a debrief. The debrief is mis- 
sion oriented and performed after each flight, and 
it is structured to focus on technical and tactical 
aspects. It can also include more emotional aspects. 
First, experiences after a flight is shared within the 
flight or crew members. In addition, other person- 
nel can be brought in to contribute, for instance in 
a near miss or incident. The severity of situations 
such as a shooting accident calls for others to be 
brought in, typically flight safety officers second in 
command of the squadron, and additional person- 
nel. The pilots talk of early selection and intensive 
training particular in earlier part of the careers. 
They also talk strongly of a performance culture 
where they measure each other, or are measured 
within a group. The professional hierarchy is of 
importance. The date of receiving the wings are 
important, the number of hours flown are typi- 
cally shown on patches (i.e. “7000 hours in Fighting 
Falcon’, that is the F-16, courses and check-outs 
you have been through and experience from inter- 
national operations). A check-out in aviation 
means that the pilot is testet and has “passed the 
exam” and are allowed i.e. to operate the 20 mm 
machine canon in an F-16. Other Check-outs are 
for instance leading other aircraft, lead a forma- 
tion at night etc. The peers from flight training are 
important, since they are able to share experiences 
and bonds from the pilot training in the United 
States. A pilot will go through training and check- 
out at the squadron. In the current study, there 
are both fighter pilots and helicopter pilots. The 
fighter pilots operate alone in the aircraft, except 
when an instructor sit in the back in a tandem 
F-16B. Otherwise, they always operate in pairs of 
two, implying that the relational aspect is present 
at all times. The helicopter pilots fly minimum with 
a system operator, and in addition there might be 
rescue divers, nurses or gunners, depending on the 
mission and type of helicopter. The team aspects is 
always there. At the same time, they will always be 
evaluated individually. In the interviews, all pilots 
in the current study touch upon their status and 
reflect on their ranking. They are all reflecting on 
their position at the squadron, acknowledging that 
being a newcomer is demanding regarding speak- 
ing up. All informants point out that it is easier to 
speak up for a more senior pilot. However, the sen- 
ior pilots like themselves play a crucial role in facil- 
itating openness in for example a helicopter crew. 
The pilots should be aware of the hierarchy and 
make sure that the other crewmembers are free to 
speak out or to speak first because of the hierarchy. 
At the same time it is important to have established 
a position as a good pilot. Informant 4 express the 
positioning of the pilots in the following way: 


“You do not need to be the best. But overall, you 
must avoid being the worst. However, it is not com- 
municated, it can be expressed by who gets some 
courses or positions to a greater degree than others, 
but not communicated and stated very clearly, but 
we still know it” 


4.2 A potential dangerous situation of coming 


out of course 


The first situation is reported by the pilot of a Bell 
412 SP flying with a system operator. The system 
operator had little experience and they were going 


to fly for a long distance at wintertime in Norway. 


Pilots lower in the hierarchy can be too focused 
on not getting blamed and pointing to others mis- 
take that might put them self in a relative better posi- 
tion. The same was reported in a study in the Italian 
Air Force (Catino & Patriotta, 2013). Informant 1 
thinks that the youngest pilots are not necessarily 
that concerned of their position since they are in a 
learning phase and that there are so much to learn 
at the start in the squadron. However, this can vary 
from person to person. For some pilots, flying a 
fighter is their main purpose in life. For others it 
can be viewed as a job, and other aspects of life are 
as important. Informant 1 reflects on the balance 
between learning and performance: 


“Where is the boundary between learning and 
achievement, the limit of how bad you can present 
your own presentation for learning purposes? I 
think there is a limit there. I can contribute a lot for 
learning purposes, but at one point I need to protect 
myself”. 


Informant 2 remember commanders that had 
stepped out and announced within the squadron: 


“Tf the boss could share his biggest mistakes, so 
can I. Nevertheless, I do not trust for sharing eve- 
rything with everyone. We have examples of people 
in the hierarchy uttering something that makes us 
suspicious. The utterings do not seem very thought 


The pilot saw this as a good opportunity to let the 
system operator have more experience. In addi- 
tion, he did not choose to have a more experienced 
system operator to sit in the back seat. They were 
flying and they missed some spots, they needed to 
reroute and the weather was getting worse with 
lower visibility. Informant 2 explains: 


“We start talking. I adjust the speed to the weather 
the conditions, I do it, and we wrongly navigate, I 
support him on the map while flying the route ... I 
use him, asking him questions. I see this far. How 
long distance can you see? I further asked him what 
he is looking forward. I think he is too much focus- 
ing on a spot that he is concentrating on”. Also his 
vocalization, how muchilittle he talks, what he talks 
of, what he is talking of, he is focusing far out” 


The system operator identifies their position 


and the tension decreases. They continue flying 
the route, but the visibility in a valley gets below 
800 meter. The pilot remembers he was no longer 
confident with the situation. What happens is inter- 
esting. There is calmness in the cockpit, however; 


“T say that we will turn and land, and then, very sud- 
den, he agrees wholeheartedly. We land in a farm 
field, and I shut down the system. At the same time I 
reflect that this is not something overly dramatic. It 
is dramatic within acceptable boundaries”. 


While the engine was shutting down, the pilot 


through”. If you are unsure you learn for yourself or 
only share experiences among the closest”. 


Informant 3 reflects regarding facilitating 


debrief after the introduction of Holistic Debrief 
in the Norwegian helicopter detachment in 
Afghanistan in 2010 (Folland, 2009; Moldjord, 
2016). 


starts by asking the system operator what he thinks 
first. In the informant’s view, it is important that 
the system operator is not overly influenced by the 
hierarchy. The system operator was very concerned 
that he did not find geographical features from the 
map such as a wire and a church. He is very con- 
cerned with his own mistake. Informant 2 under- 
lines that we did not find it out. The informant also 
explained to the system operator: 


‘A system operator approached me and expressed 
concern whether he had to share all my feelings with 
me”. I said, “You do not need to if you do not want to. 
But you can if you like”. The system operator seemed 
happy with the answer. But we, the older ones, need to 
do it, otherwise the youngest will not follow”. 


Informant 4 points to another reflection look- 
ing at his pilot career describing an environment 
that is very focused getting on goals, focusing on 
improvement and performance. Informant 4 rea- 
son that this lead to focusing on the past and the 
future, but less on the present. 
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“I explain to him that it was my feeling that it was 
not ok to fly further. Then I will raise my voice. I also 
say to him that I expect the same the other way. We 
have talked about this before, but now we have an 
actual shared experience of this”. 


They are invited in for a meal at a farmer’s house 


and stay there for a while. They arrive at the des- 
tination one day later. It was a trip with a lot of 
learning. After landing at the final destination they 
debriefed the mission. The informant sums up the 
experiences: 


“It’s a matter that I was particularly pleased with 
when I recognized my activation (of fear). This is the 
activation we have talked about, written about. This 
I've experienced before, and it’s ok. So it’s not dis- 
turbing, what’s going on with me now, it never turns 
in to high stress. This is information that means that 
I'm affected and I have to take some extra consid- 
eration, calm down. I have the opportunity to make 
some bigger margins now, so I can do that. In several 
situations on the same trip, I have been activated, 
but not done anything about it and put it away, more 
like an emotionally oriented coping strategy, (telling 
myself) this I have to take care of after landing” 


We see an interesting example of the social 
interaction in the helicopter. We see that the pilot 
by opening up for questions invites and encourage 
the system operator to also have a saying. The pilot 
report that the system operator very quickly agree, 
assuming that he might have thought of the same. 
The professional hierarchy play an important role 
and by reducing the authority gradient (Rosness, 
2001). Authority gradient was first defined in avia- 
tion when it was noted that pilots and copilots may 
not communicate effectively in stressful situations 
if there is a significant difference in their author- 
ity, experience, or perceived expertise (Crosby & 
Croskerry, 2004). 


4.3 A near miss within the envelope with the 
instructor taking over 


The next situation is told by informant 2 which was 
flying with a new fighter pilot. They were sitting in 
an F-16B tandem with the instructor (informant 2) 
sitting in the back seat. The student was about to 
be checked out on firing the 20 mm machine gun 
on ground target. The limit is 5 seconds to ditch 
the fighter aircraft down, aim, fire, and then you 
are at minimum distance. Then there are addi- 
tional 4-5 seconds to pull up. In the first part of 
the training mission the pilot student has dropped 
bombs which is quite usual on that type of train- 
ing missions. The shooting is always at the end of 
the mission. At that time the type of ammunition 
required the pilots to get near the target, mean- 
ing that this is very stressful for both student and 
instructor. The student had already tried shooting 
but fired too far out and then missed the target too 
much. The student aimed for a third time. Inform- 
ant | explains further that on the next round the 
student was very concentrated and he was paying 
extra attention. 


“T imagine he’s going to come in and get closer, and 
he comes in too close, so I yelled at him that he should 
get off the target, which means he’s going to pull the 
stick back, but he did not hear that because he was 
aiming, aiming and shooting while I was once again 
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roared to him that he should get off. Then I pulled 
the stick back at the same time as him. I wanted him 
to become aware of what was going on, so I pulled 
the stick back” 


This is explained by the informant as target fixa- 
tion: The student was too focused on the target, 
aiming at the target and forgetting about assessing 
the distance. It was a very brief window to inter- 
fere. As instructors, we want them to succeed. 


“He immediately understood what had happened 
and became, I think, perplexed. I took over the fly- 
ing in order for him to get back. We talked briefly; 
however, we were running out of fuel, so there was 
not much else to do than landing”. 


After landing, they talked about the situation. 
The student was expressing his views. He said he 
was close, but further expressed that he was not too 
fixated on the target, that he had some degree of 
control. They agreed that they had been within the 
safety limit. They performed a debrief. This was 
his first mission shooting with the 20 mm machine 
canon. Grades are assigned. The student fails one 
or two points, and one of them is safety, meaning 
he has to do the mission all over again. At the same 
time the informant explains central elements in the 
debrief session: 


“The student had a need to explain what kind of 
situational awareness he had in the situation, what 
he thought and why it happened. (First) I explain 
what I thought, what I did and why I did it. It’s on 
that level. I say that I'm an instructor open for that I 
might have made a mistake, perhaps intervening too 
early or doing something that’s not good (for the stu- 
dent). Then he’s able to explain how he experienced 
the situation”. 


We see from the example that student pilot and 
the instructor have a different risk perception. The 
instructor pilot get an uneasey feeling and instruct 
the pilot student to pull up. When there is no reac- 
tion, the instructor pilot shout and pull the stick 
at the same time. We see an interesting process of 
making sense of the situation. They both agree that 
they were within the safety envelope. At the same 
time, the instructor do not pass the pilot student. 
And the pilot student needs to perform the shoot- 
ing with the 20 mm machine canon again. 


4.4 Anactual severe incident and the handling 
of it afterwards 


The last situation is from informant 4. The inform- 
ant was not directly involved and is about how the 
situation was handled afterwards. During night 
time shooting, a pilot missed the targets and the 


bullets hit the shooting observation tower. There 
were no injuries on personnel, but it was a poten- 
tially very dangerous situation with possible sev- 
eral fatalities. Informant 4’s role was being part 
of the squadron command team, responsible for 
following up on the involved pilot, knowing what 
was going to happen to him afterwards and pre- 
pare him for that. Informant 4 was also a part of 
the command team responsible for facilitating the 
debrief setting for the involved person. In addi- 
tion, he was responsible for informing the rest of 
the squadron and the involved families. The chal- 
lenge was that the pilot reported a high degree of 
blame and guilt. The informant attributed a lot of 
his ‘follow up’ handling to what he had learned at 
the RNoAFA: 


“The most central aspect in my knowledge is what 
I learned at the Air Force Academy and the focus 
on reactions and experience with feelings like fear, 
shame and guilt. Now I looked at the person care- 
fully and looked for facial expressions to see and 
meet his needs. I avoided making him sit alone in a 
blood test session for hours, although the formal test 
was alright. But to have someone there to talk with 
a familiar face so trust became focus versus proce- 
dures. That was the most important knowledge that I 
brought with me. I steered consciously clear of such 
thing as “have you taken care of the video”, “can I 
see your data card”, all things that indicate that we 
must check in order to understand what the pilot did 
wrong, versus take care of the person. The others can 
handle the formal things”. 


The informant reflected on what his perspectives 
might have been earlier, that typically would have 
been focus on the individual, little training, little 
flying lately, should have been better prepared, 
tried another round. The informant reported that 
they did focus on the circumstances. As illustrated 
in 3.2., the ammunition had changed, now allow- 
ing for firing at longer distances. However, the 
shooting range was built in a period when firing 
at closer range. In addition, light conditions, the 
arrangement of flashlights, different location of 
the towers that were hit could have been better 
designed. So informant 4 points to the root causes 
and seeing the pilot in the wider context; 


“Looking at the system he was part of”. 


This quote can be seen in relation of learning 
at the organizational level as well as the individual 
level (Reason, 1997). The informant told us how 
important it is to take care of the person behind 
the mistake, not just the role of the pilot. The 
most important thing is to get the pilot back in 
the cockpit, and then it is crucial to take care of 
the person. Also, it is important to avoid blaming 


and too much focus on procedures right after an 
accident. At the same time, this incident has an 
organizational perspective. The entire organization 
can learn from the error and how it was handled 
afterwards. Before one can focus on learning, one 
needs to calm down strong emotions such as fear, 
guilt, anger, and pride. Creating an arena where 
emotions can be expressed is important before 
focusing on learning. In such a situation, Holistic 
Debrief is very relevant (Moldjord and Fredriksen, 
2017). This aligns with the practice of double loop 
learning (Argyris & Schön, 1978). The informant 
reflects on differences between the squadron and 
other environments he knows of and claim they are 
good at sharing mistakes. Not everything is fine, 
but they are sharing experiences after a mission. If 
you experience a close encounter with a wire, you 
raise up in the morning meeting and share it with 
the others. 


5 CONCLUSION 


We see from the interviews strong focus on per- 
formance and improvement. On the same time, 
there is a hierarchy to reflect on regardless of 
position. There is a need to protect one’s posi- 
tion. However, the hierarchy can also be used 
positively to lead by example. In the current study, 
all the pilots had experiences and lead several 
missions, and three of them had management 
positions meaning they had significant positions 
in the squadron. They all reflected on how their 
position can be used to lead and facilitate learn- 
ing. All acknowledged the challenge of the degree 
of openness and position in the hierarchy. They 
also report of something that is “private” and not 
shared widely. This is in accordance with the find- 
ings of Catino & Patriotta (2013). It is also a chal- 
lenge to create a learning safety culture (Reason, 
1997). The informants talk of what we would term 
a generative reporting culture. That would also be 
termed as double loop learning (Argyris & Schön, 
1978). Some of the learning that involves blame is 
linked to a more internal learning or is only shared 
with peers. This is especially evident if it threatens 
one’s overall position as a pilot or in the hierarchy. 
Being in a strong position means that you easier 
can lead by example. All pilots reported favorable 
learning of the course Effectiveness in the cockpit. 
Moreover, as we interpret it, there has been some 
transfer of knowledge from the course to practical 
military leadership. It seems important that opera- 
tive leaders have the knowledge of addressing the 
balance between performance, caring and learning 
after an error event or a mistake. Furthermore, 
it is important that such experiences are lifted 
from an individual level to something that the 
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entire organization can learn from. Therefore the 
organization needs to create arenas where trust is 
established and there is enough openness to share 
important experiences, like a Holistic Debrief 
(Moldjord and Fredriksen, 2017). 


5.1 


It is of great value to follow a group of officers 
over a period of time since long term studies are 
often missing within this research field. We there- 
fore plan to conduct a third round of interviews 
which hopefully will be accepted by all the pilots. 
We will look more in detail into the topics cov- 
ered here in the current paper. The analysis in this 
paper indicates the strong effect of the context. 
Further studies on how the processes are shaped 
and could be managed is of interest. Also, we want 
to further examine how a squadron can enable for 
more sharing and learning. We will look more into 
the organizational and management and leader- 
ship aspects of learning. The pilots in the current 
study has now moved on to administrative posi- 
tions. That means that it might be easier for them 
to understand the environment on the inside, but 
still obtain an analytical distance to the subjects, all 
while examining sensemaking processes over time 
and seeing the interplay between professional hier- 
archy and sensemaking processes. 


The need for further studies 
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ABSTRACT: Over the last years, the structure of the legal entities to govern all issues of the German 
nuclear industries has changed significantly. What used to be just one company is now divided into four 
entities between which responsibilities were shared and new tasks allocated. For this actual example of 
comprehensive structural changes, in particular regarding conditioning, handling and interim/final storage 
of radioactive waste in Germany, we apply the STAMP (System-Theoretic Accident Model and Processes), 
approach in order to demonstrate how the STAMP structure can be used to monitor and master this 
change. The choice of the industry area of nuclear is solely motivated by scientific interest and excludes 
commercial aspects. Siemens is not active in the “nuclear waste” management industry. 


1 MOTIVATION 
Reorganisation is always a very complex task in any 
undertaking. When dividing existing organisations 
and establishing new ones with new stakeholders, 
formal and informal processes and tasks have to 
be considered and adapted where necessary. This 
becomes even more important when safety-critical 
aspects are involved. As a safety-critical area we 
will be focusing in this article on the Nuclear Waste 
Disposal in Germany. A very clear and structured 
approach is helpful to keep track of the relevant 
changes, but also to visualize those changes in a clear 
manner. The last aspect especially helps to inform all 
concerned parties and can be used as a starting point 
for the discussion and verification of the model. 
The described difficulties can be modeled as a 
control problem. Instead of developing a com- 
pletely new theory it seemed sensible to use an 
existing theory. The advantage of using an existing 
model framework is that a certain set of conditions 
exists on which the ongoing work can be based on. 
In our paper (Berg, Griebel, Milius 2017) we have 
shown that STAMP, a theory developed by Nancy 
Leveson, can be used to model organizations on a 
high level. We applied some elements of STAMP 
to the current situation in the German nuclear 
industry. We modeled the current organizations 
and the existing interactions and discussed some 
fictional examples. In this paper, we will elaborate 
the example further, detailing a certain part of the 
existing STAMP structure for the German nuclear 
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industry, i.e. by clearly visualizing the structure 
and the processes using elements of STAMP we 
gain further insight 

The paper is structured as follows: 

In the first chapter, we explain the area of appli- 
cation that is the expected changes to the process of 
nuclear waste disposal. We will address the current 
situation, the planned changes the expected chal- 
lenges. Afterwards, we will give a short overview 
of the basic concept of STAMP and explain our 
chosen modeling approach. In the third chapter, 
we will present today’s situation in nuclear waste 
disposal as well as the future situation using the 
visualizing elements of STAMP. We will compare 
both structures and discuss interesting aspects. 


2 NUCLEAR WASTE DISPOSAL 
IN GERMANY 


2.1. General 


In general, most radioactive waste in Germany 
comes from nuclear electricity production. How- 
ever, it is also generated in hospitals from the use 
of radioactive material to diagnose and treat the 
sick and sterilize medical products, in the produc- 
tion of radiopharmaceuticals, at universities in 
conducting vital research in biology, chemistry and 
engineering. 

These wastes must be safely managed at all stages 
prior to and including final safe disposal. Storage is 
an integral part of the waste management process. 


In Germany, disposal in deep geological forma- 
tions is intended for all types of radioactive waste. 
However, there are strong debates concerning vari- 
ous nuclear waste disposal options including direct 
deep geological disposal with and without waste 
retrievability, long-term interim storage, as well as 
new conditioning and treatment concepts such as 
partitioning and transmutation. Scientific research 
to provide input to the decision-making process is 
essential, as this is the only guarantee for reliable 
data and models required for any sound, scientifi- 
cally-based safety assessment. 

Decentralised storage facilities for spent fuel 
were licensed under nuclear law and constructed 
and commissioned at twelve sites with nuclear 
power plants. They are designed as dry storage 
facilities in which transport and storage casks 
loaded with spent fuel are emplaced. Storage casks 
for spent fuel in the three central storage facilities 
have the same properties as the storage casks in on- 
site storage facilities. 

By means of conditioning of the radioactive 
waste, intermediate or final products shall be pro- 
duced which fulfill the requirements on safe han- 
dling, storage and transport also for the period of 
extended interim storage. The radioactive waste 
is to be safely stored until it can be delivered to a 
facility of the Federation for disposal. 

Currently, there are still eight operational 
nuclear power plants left in Germany. For one 
further nuclear power plant the authorisation for 
power operation expired on 31 December 2017 in 
accordance with the German Atomic Energy Act 
(Federal Law Gazette 2017c). 

Moreover, there are currently three research 
reactors, three reactors for training purposes and 
one reactor for educational purposes in operation. 

However, many nuclear facilities are already in 
the process of decommissioning as listed below: 


e seventeen nuclear power plants in the process of 
decommissioning as at 30 April 2017, 

e nine research reactors with an electric power of 
more than 1 MW permanently shut down, in the 
process of decommissioning, or decommission- 
ing completed and released from nuclear regula- 
tory control, as at 30 April 2017, 

e twenty-nine research reactors with an electric 
power of less than 1 MW permanently shut 
down, in the process of decommissioning, or 
decommissioning completed and released from 
nuclear regulatory control, as at 30 April 2017, 

e eight experimental and demonstration reactors 
in the process of decommissioning, or decom- 
missioning completed and released from nuclear 
regulatory control, as at 30 April 2017, 

e six commercial fuel cycle facilities in the proc- 
ess of decommissioning or decommissioning 


completed and released from nuclear regulatory 
control, as at 30 April 2017. 


This underlines the necessity of urgent solutions. 


2.2 Explanation of current structure 


The Federal Ministry for Environment, Nature 
Conservation, Building and Nuclear Safety 
(BMUB) is the nuclear regulatory authority of the 
Federation. As defined in Article 73 of the Basic 
Law (Federal Law Gazette 2014), the Federation 
shall have exclusive legislative power with respect 
to “the production and utilisation of nuclear 
energy for peaceful purposes, the construction and 
operation of facilities serving such purposes, pro- 
tection against hazards arising from the release of 
nuclear energy or from ionising radiation, and the 
disposal of radioactive substances”. 

The “Gesellschaft fiir Anlagen—und Reaktor- 
sicherheit (GRS)” carries out research and analysis 
in its fields of competence, namely reactor safety 
and radioactive waste management, and supports 
the BMUB on technical issues. 

The Nuclear Waste Management Commission 
(ESK) advises the BMUB in nuclear waste man- 
agement issues (conditioning, interim storage 
and transports of radioactive material and waste, 
decommissioning and dismantling of nuclearinstal- 
lations, disposal in deep geological formations). 

The Federal States carry out their tasks under 
nuclear law on behalf of the Federation (federal 
executive administration). Federal supervision 
extends to the legality and appropriateness of 
execution by the Land authorities. According to 
Article 85(3) of the Basic Law, these shall be sub- 
ject to the instructions from the competent highest 
federal authority (BMUB). 

In performing their activities, the Federal State 
authorities may consult technical expert organisa- 
tions or individual experts according to § 20 of the 
Atomic Energy Act (Federal Law Gazette 2017c). 
Today, this is mainly ensured by the technical 
expert organisation (TUV) for specific issues. With 
the involvement of experts, an examination on the 
safety-related issues is made which is independent 
of that of the applicant. 

The Federal State Authority for Mining, Energy 
and Geology is the mining authority and is grant- 
ing mining legal licenses for nuclear waste disposal 
in deep geological formations. 

The Federal Office for Radiation Protection (BfS) 
is a subordinate authority of the BMUB in the area 
of radiation protection and nuclear safety is the BfS. 
The four technical departments of the BfS deal with 
the statutory tasks in the areas of environmental and 
industrial radiation protection, radiation biology, 
radiation medicine, nuclear fuel supply and waste 
management and nuclear safety. The BfS supports 
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the BMUB technically and scientifically, especially 
in the execution of supervision of legality and expe- 
diency, the preparation of legal and administrative 
procedures, and in intergovernmental cooperation. 
Moreover, the BfS is license holder for nuclear waste 
disposal facilities. Furthermore, the nuclear super- 
visor, for example responsible for monitoring com- 
pliance with waste acceptance requirements for the 
respective final disposal facility. These requirements 
result from the safety-analytical investigations that 
need to be complied with when waste packages to 
be disposed of will be delivered to the repository in 
future, is also the BfS by a self-monitoring section— 
directly tied to the vice president Of BfS. 

The German Company for the Construction 
and Operation of Repositories (DBE) has been 
exploring, constructing and operating the Ger- 
man repository projects and mines in Gorleben, 
Morsleben and Salzgitter (Schacht Konrad) since 
1979. After a legally enforceable plan approval 
decision has been in force since 2007, the Federa- 
tion has decided to set up the Konrad mine as a 
repository. DBE is responsible for the comprehen- 
sive construction work. 

The Asse GmbH is an operating company and 
as such responsible for all operational work in the 
Asse mine. Shareholder is 100% federal. The Asse 
GmbH implements the measures ordered by the 
BfS. The Asse-GmbH will also be responsible for 
the implementation of the decommissioning work 
and until then ensure operation in accordance with 
the requirements of nuclear legislation. 

The Company for Nuclear Service (GNS) car- 
ries out services in the field of radioactive waste dis- 
posal and decommissioning of nuclear facilities and 
operates through several subsidiaries interim storage 
depots for spent fuel and radioactive waste. Further 
activities are related to the development of condition- 
ing methods, including the development and qualifica- 
tion of the cask and emplacement systems, processing 
of the waste, loading of the casks, documentation of 
the containers, and control of delivery to the Konrad 
repository. GNS owns 75% of the DBE. 


2.3 Explanation of future structure 


On January 1, 2014, the Federal Act on the estab- 
lishment of a Federal Office for Nuclear Waste 
Management (BfE) entered into force, so that 
BfE was formally founded on this day. However, 
the Federal Office for Radiation Protection (BfS) 
and its tasks remained unaffected at this time. By 
the Act on the Reorganization of the Organiza- 
tional Structure in the Field of Final Disposal, BfE 
was renamed Federal Office for Nuclear Safety in 
Waste Management as of July 30, 2016 (Federal 
Gazette 2016a). The main reason for the renam- 
ing was the intention to differentiate the BfE more 
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clearly against the Federal Company for Radioac- 
tive Waste Disposal (BGE). 

Against this background, the BfS was divided 
in-to three organizations. 

The Federal Office for Radiation Protection 
(BfS) is now responsible for the safety and protec- 
tion of man and the environment against damage 
caused by ionizing and non-ionizing radiation 
(Federal Law Gazette 2016b). In the area of ion- 
izing radiation, for example, X-ray diagnostics in 
medicine, safety in the handling of radioactive 
substances in nuclear technology, and the protec- 
tion against increased natural radioactivity. Non- 
ionizing radiation workplaces include, inter alia, 
protection against ultraviolet radiation and the 
effects of mobile phones. 

The Federal Office for the Safety of Nuclear 
Waste Management (BfE) advises the BMUB on 
nuclear waste disposal and safety issues for other 
nuclear facilities. In addition, BfE is bearer of the 
public participation in the process of determin- 
ing the location in Germany for final disposal of 
high level radioactive waste. BfE has the task to 
check the safety of the disposal at all process steps. 
Thus, for the first time in Germany, BfE is now an 
independent regulatory, licensing and supervisory 
authority for the transport of radioactive waste 
and its interim and permanent storage. 

Transitional provisions apply to the Konrad 
repository and the Morsleben repository for radio- 
active waste (ERAM), according to which the Fed- 
eral States remain responsible for licensing until 
this responsibility is transferred to the BfE with the 
granting of the approval of commissioning by the 
nuclear supervisory authority for the Konrad repos- 
itory or until the plan approval decision on decom- 
missioning will be enforceable for the ERAM. A 
part of the workforce which is still planned for the 
BfE, mainly from the administrative sector, will still 
be involved in BfS by December 2017. 

Furthermore, two additional Acts were recently 
set in force (Federal Gazette 2017 a and 2017b), 
partially in extension of the Act on Search and 
Selection of a Site for a Repository for Heat- 
Generating Radioactive Waste and for Amending 
Other Laws (Federal Law Gazette 2015). 

The Federal Company for Radioactive Waste 
Disposal (BGE) is a state-owned company in which 
several tasks have been merged: operational tasks 
of site selection, construction and the operation 
of the repositories for high level nuclear waste as 
well as, e.g., for the Asse II salt mine site where in 
the past low and partially medium level radioactive 
waste has been stored. Moreover, the “Act for 
the Reorganization of Responsibility in Nuclear 
Waste Disposal”, which came into effect in June 
2017, has been redefined regarding the responsi- 
bility for the decommissioning and dismantling 


of nuclear power plants and for the the interim 
storage of radioactive waste (Federal Law Gazette 
2017b). Some sections of the BfS have been trans- 
ferred to the BGE. The Asse GmbH and the DBE 
will also be incorporated in due time. 

Moreover, the Company for Interim Storage 
(BGZ) has become in August 2017 a company 
of the Federal Government. From now on, the 
existing two central interim storage facilities will 
be part of the business of the BGZ. At the begin- 
ning of 2019, the twelve decentralized temporary 
storage facilities at the nuclear power plants— 
currently operated by the electric power utilities— 
will also become the responsibility of the BGZ, one 
year later additionally the twelve storage facilities 
with low- and medium radioactive waste resulting 
from the operation and dismantling of the Ger- 
man nuclear power plants. 


3 WHAT IS STAMP? 


STAMP is an acronym and stands for System- 
Theoretic Accident Model and Processes. The 
theory was developed by Nancy Leveson. The 
necessity for the model has arisen when it became 
clear that the traditional mechanisms to explain 
incidents were no longer valid. Typically, incidents 
were explained as linear chain of events, with a 
failure of one system at the beginning of the chain. 
However, due to the complexity of today’s sys- 
tems this does not need to be true. Every part of 
a complex system can work perfectly as specified 
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Figure 2. 


but under certain preconditions it might lead nev- 
ertheless to an incident. For more examples of the 
general idea of this model, you can look at (Lev- 
eson and Stephanopoulos 2014). Leveson argues 
that in the future incidents should be considered as 
control problems. Actors and interactions need to 
be defined in a strict manner. 

A typical STAMP control structure is shown in 
Figure 1. An example for a more complex situation 
can be found, e.g. in our paper (Berg, Griebel & 
Milius 2017). 
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Control actions Feedback 


Controlled process 


Figure 1. General description of the idea behind STAMP. 
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Comparison of the current (right) and future (left) STAMP structures; numbers indicated areas of change 


(1, 2, 3 and 4) discussed in the text; letters indicate the discussed control loops. 


STAMP is the basis for different applications. 
(Thomas 2013) distinguishes CAST (Causal 
Analysis using System Theory) and STPA (System- 
Theoretic Process Analysis). For our work, STPA 
is the relevant component as it helps to identify the 
potentially hazardous control structures which can 
lead to failures in the process. 

To identify controls which are potentially haz- 
ardous, four types can be distinguished: 


e Control commands required for safety are not 
given 

e Unsafe ones are given 

e Potentially safe commands but given too early, 
too late 


e Control action stops too soon or applied too long 


The paper will not allow applying the issue of 
the controls in detail; however, we will highlight 
some examples where even on first look problems 
become obvious. 


4 COMPARING AND ANALYSING THE 
TRANSFORMATION PROCESS 


Using elements of STAMP the structure and con- 
trol processes of the current and future situation 
are shown in Figures 3 and 4 at the end of this 
paper. Looking at how the elements of STAMP 
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Current structure of German nuclear waste disposal process. 
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Figure 4. Future structure of German nuclear waste disposal process. 


can contribute to a better understanding and eval- 
uation of the change in the processes, the following 
aspects can be identified by looking at the struc- 
ture (Figure 2). 


1. The new process is more streamlined with a 
strong focus on the licensing and supervisory 
authority on the one hand and the license 
holder and operator on the other one. 

2. The BfS as a self-monitoring organization of 
final waste repositories is replaced BfE which 
is not any more license holder and supervisory 
authority. Moreover, the new picture shows a 
clearer structure because in future all licens- 
ing and supervisory activities were performed 
by BfE whereas in past the Federal States 
were licensing authority and BfS operator and 
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supervisory organization. The new legal basis 
also concentrates the responsibilities for licens- 
ing (in the past BfS) and supervision (in the 
past Federal States) of interim storage waste 
facilities not discussed in this paper in detail. 


. The general structure regarding the involvement 


of the BfS is now much clearer. Whereas in the 
old structure the BfS was having several, very 
different functions, this is now clarified and 
more focused on safety and protection of man 
and the environment against damage caused by 
ionizing and non-ionizing radiation. 


. The license holder and the operator were used 


to be separated from each other. This led to an 
additional need for control. Now they are sub- 
sumed under one roof. This would be a good 


example to look deeper in the structure to see, if 
all aspects of the external control structure are 
now integrated in the internal processes of the 
license holder and operator. 


Overall, all tasks of nuclear engineering disposal 
after dismantling have now been fully distributed 
to authorities (BfE) and federal private companies 
(BGE). 

A central prerequisite for the credibility of the 
now responsible actors has been created by the 
reorganization of the responsibilities which leads 
to a clear functional description of the company’s 
ownership as well as the operation in a private-law 
company on the one hand and the supervision and 
intensification of the public participation on the 
other hand. 

Regarding the site selection of a repository for 
highly radioactive waste, BGE has improved struc- 
tural preconditions for a rapid and well-founded 
presentation of results by the concentration of the 
project owner in the site search as well as all opera- 
tor tasks. 

BfE has the task of supervising the implementa- 
tion of the site selection procedure in accordance 
with § 19 (1) to (4) of the Atomic Energy Act (Fed- 
eral Law Gazette 2017c). It, therefore, supports the 
entire procedure from a scientific point of view 
and is the competent authority for the monitoring 
of the enforcement of the site selection procedure 
at all stages of the procedure (Federal Law Gazette 
2017a). 


5 ANALYSIS OF CONTROL LOOPS 


The structured depiction of the control processes 
enables the identification of possible gaps in the 
control and information loops. 

One example of a rather simple and clear loop 
consists of the requesting and granting of mining 
licenses (A). If we would apply here the different 
possibilities of hazardous controls, we could, e.g. 
discuss in detail what happens when a request is 
done at a wrong time or an application is granted 
to early. STAMP/STPA can help us here to bet- 
ter understand the process and guide the user to 
potentially hazardous situations which then can be 
counteracted by e.g. processes. 

As for the area of the interdependence between 
the BfE and the BGE the situation is rather more 
complex: The loop regarding the application of 
license changes and the granting of revised licence 
seems a closed one (B). Yet when it comes to the 
issue of supervision by the BfE and providing 
reports by the BGE the causal relation would need 
more detailed analysis (C). One example for this 
fact is that supervision also contains inspections 
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in the waste disposal facility, i.e. not only paper 
work is involved. This is different from the other 
reports&supervision loops in the structure, where 
only documents are transferred and checked. 

This succinct analysis already shows that is 
important to clearly and comparably construct 
and describe the elements of the control loops. 
Without this structured analysis one might face 
the danger of leaving out pertinent aspects in the 
process of transforming and streamlining existing 
control and supervision structures. 


6 CONCLUSION 


In this paper we have applied some of the elements 
of STAMP to the reorganization of the German 
nuclear waste management of the German Nuclear 
Industry. We focused on the process of final waste 
disposal. 

By structuring the process we have made a deci- 
sive leap towards 


e Transparency (by providing a method of dis- 
playing lucidly the complex process) 

e Elucidation of areas for improval 

e Enabling checks if the yearned objectives have 
been attained. 


Thus we have showed that even without going 
into much detail of the daily work, some areas of 
change became obvious and can be used as basis for 
further discussions. The identified changes can now 
be analyzed further to make sure that the respon- 
sibilities of changed ownership (partially spread 
and partially concentrated compared to the earlier 
structure) are taken care of in the new processes. 

Furthermore it could be checked in future anal- 
ysis, which effect envisaged changes would have on 
the overall safety of the processes and the site by 
highlighting the risks 
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ABSTRACT: As the world’s first tourist destination in 2015, with a target of 100 million visitors in 
2030, the French tourism sector, is a national priority and a strategic challenge for the country. However, 
the dramatic events of 13 November 2015 (in Paris) and those of 14 July 2016 (in Nice) led to a reversal of 
the trend, and a substantial and worrying decline in activity. The events raise two key questions in terms 
of crisis management: The first is linked to the ability of regions and organizations to recover, as quickly 
as possible, their capacity to meet the needs of the public. The second, which is the direct consequence of 
the first, is the capacity of regions and tourism industries to re-establish their reputation, notwithstanding 
the fear that is generated by the threat 

This paper presents a review of the management of major risks and the tourism sector at the inter- 
national level and in France. It presents an ongoing qualitative study conducted in partnership with an 
association of event professionals on the management of crises on the French Riviera. The roles of the 
different actors in a crisis are studied from the perspective of organizational resilience. 


1 INTRODUCTION manage a crisis after a terrorist attack. Apply- 
ing a systemic approach, the aim of the study is 
Whether resulting from natural or industrial to increase our understanding of why and how 
causes, or due to terrorism, disasters and crises regions and organizations become resilient after 
take a heavy toll on tourist activities. In a coun- an acute crisis. We also argue that the findings of 
try like France, where tourism accounts for 7% of this study could inform the way in which other 
GDP, and represents 2 million direct and indirect crises are managed. It should be noted that the 
jobs, a crisis can threatens the sector. Although in study concentrates on crises associated with acts 
different ways and in different places, the terrorist of terrorism. We begin, however, with presenting 
attacks in Paris in 2015, and the attacks in Nice in research on the perception of risk and tourism- 
July 2016 badly affected the sector’s economy, related crisis management. We then introduce the 
While crisis management is well-understood in concept of resilience applied to crisis management, 
conventional industrial activities (nuclear, oil and and the leisure and business tourism sectors. 
gas, chemistry, aeronautic...), it is clear that the 
academic literature contains few references to the 
tourism sector. Reasons for this oversight is that 2 GLOBAL TOURISM, RISK PERCEPTION 
the sector is diverse, geographically dispersed with AND IMPACTS ON THE DESTINATION 
a high proportion of small businesses, thus mak- 
ing it a challenging topic to investigate. Having a The globalization of tourism has created a highly 
good understanding the typology of crisesand the competitive environment. As a result, both the 
consequences for the sector is, however, essentialin leisure and business tourism sectors are subject 
order to prepare the resources that are required to to fierce competition within and between destina- 
recover the loss of patronage and business. tions. Although France is one of the most popular 
This paper reports an ongoing, longitudinal destinations, aiming to welcome 120 million tour- 
French study located in the tourism sector that ists in 2020, (Huchon, 2016), it has to compete 
identifies the key elements needed to successfully with other destinations. Studies of terrorist crises 
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(2001 in New York, Hua hin and Phuket 2016 etc.) 
also show that there is a negative economic impact 
on destinations that have experienced a terrorist 
attack, (Enders, 1992, Fainstein, 2002, Beirman, 
2003, Drakos, 2003, de Sausmarez, 2003). In this 
context, knowing the reasons for selecting a par- 
ticular destination, (ie the level of risk associated 
with the chosen destination) is necessary. 

Moreover, a number of studies highlight the fact 
that the level and perception of risks are key dimen- 
sions in the decision-making process. In a study of 
290 young adults born in the United States, Lepp & 
Gibson (2003) report that there are seven risk fac- 
tors associated with tourism: health, political insta- 
bility, terrorism, unusual food, cultural barriers, 
political and religious dogmas and crime. However, 
the study found that men and women differ: women 
care more about health risks. A second difference is 
found in the type of destination: some look for a 
familiar destination, while others seek out novelty. 
The latter group are less sensitive to risk. 

With regard to natural and technological risks, 
Chew & Jahari (2014) analyze the impact of per 
ceived risk on the image of a destination. Tak- 
ing the case of the Fukushima disaster in Japan 
in 2011, where a natural disaster that combined 
a tsunami and earthquake caused a nuclear acci- 
dent, the authors analyze the relationship between 
the image of the destination, perceived risk and 
the intention of returning to the destination. Their 
findings show that tourists’ perception of risk is 
an important element in the destination’s image 
and plays a key role in the choice of destination. 
In practice, although increased risk may play an 
inhibiting role in the choice of and return to a des- 
tination, the authors note that recent studies high- 
light repeat purchases despite a rise in the level of 
risk or a recent disaster. Acting on the image of 
the destination would therefore appear to help to 
mitigate the effect of perceived risk on the inten- 
tion to return. 

In sum, risk perception is a component of the 
decision-making process in the visit and revisit of 
a destination. The perceived risk associated with 
the destination, therefore, can play a role in miti- 
gating the image of a destination. 


3 CRISIS MANAGEMENT AND TOURISM 


In order to understand risks and resilient organi- 
zations it is necessary to examine the literature 
on crisis management, with particular reference 
to the tourist sector. Aktas & Gunlu (2005) note 
that, although the concept of crisis management 
has existed as a research field since 1970, studies 
located in the tourism sector only began to emerge 
in the 1990s. 
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It is useful, however, to first distinguish between 
“disaster” and “crisis”. Faulkner (2001) notes that 
the difference between the two is that a disaster is 
an external phenomena while a crisis is an internal 
phenomena and relates to the organization itself 
and management failures. Aktas &Gunlu (2005), 
taking Faulkner’s example (2001), show that the 
Izmit earthquake in 1999 could be likened to a 
catastrophe on account of its unpredictability and 
external nature. On the other hand, the lack of 
respect for earthquake-resistant standards reflects 
organizational failures and, therefore, a crisis. 

Atkas & Gunlu (2005) further explain that in 
practice, the unpredictable dimension of an unfore- 
seen external event can degenerate into a if there 
are management failures. In this case, the disaster 
and the crisis can be subsumed. Here both share 
three elements: a trigger, damage, and a threat to 
the life or property. 

Further, crises can be linked to events at differ- 
ent levels. At the micro level, a crisis can be limited 
to the organization (an industrial relations strike 
or epidemic that affects the workforce). Crises can 
also occur at the international level, moving from 
a micro to a macro level, for example a terrorist 
attack or a pandemic. Moreover, a crisis can be 
extend over many years (Ritchie, 2004, Ritchie & 
Campiranon, 2014). In this context, Both Ritchie 
(2004) and Laws and Prideaux (2006) argue that 
the complexity of crises suggests that there is a 
need to establish a typology in order to provide 
appropriate strategic management plans. Ritchie 
(2004) states that crisis management should be 
based on appropriate plans, trained staff, commu- 
nication plans and established relations with the 
media. That is, crisis management is based on a 
capacity to mobilize resources and communicate at 
the right time. It also requires appropriate lobbying 
of decision-making bodies to rapidly allocate the 
funds that are needed to meet the needs of crisis 
management activities. 

The ability to manage a crisis therefore depends 
on a number of factors that can prepare managers 
to cope, and also to communicate at the right time 
to contain the crisis and avoid changing the image 
of the destination. In addition, a significant part 
of crisis management is based on the creative abil- 
ity of actors and in particular, it relies upon the 
support of actors, or even drawing from the group 
solutions that could not have been imagined. This 
observation is echoed by Atkas and Gundu (2005), 
who consider the implementation of strategic cri- 
sis management and post-crisis marketing plans 
are necessary for the revival of the destination. 
Finally, proactive and holistic management as well 
as an evaluation of each crisis and its management 
are essential to ensure that visitors return (Laws & 
Prideaux, 2006, Ritchie & Campiranon, 2014). 


While a great deal of this discourse is located in 
the disciplines of marketing and economics, studies 
in other disciplines, such as resilience engineering 
and emergency engineering have provide another 
perspective. 


4 RISK AND RESPONSE 


There have been a number of scholars who have 
made a significant contribution to the research 
on risk and resilience. One of the most notable 
is Hollnagel (2007) who only studied the concept 
and the attributes of risk but also introduced the 
concept of resilience engineering. He defines resil- 
lence as a system’s ability to adapt before, during or 
after changes or perturbations in order to continue 
a set of operations that are identified in expected 
or unexpected conditions (Hollnagel, et al. 2006). 
This key contribution to our understanding of cri- 
sis management is part of a rich field of research 
on the human and organizational factors. Others 
scholars have also made a contribution. For exam- 
ple, Perrow’s (1981) work on complex organizations 
and risk in which he examines the organizational 
factors that underlie systemic and catastrophic 
accidents. His central argument is that accidents 
are normal in complex, socio-technical systems 
because they are by their very nature, susceptible 
to an elevated level risk. His work initiated critical 
thinking about industrial systems that are exclu- 
sively controlled by managed safety procedures. 

Similarly, the work of the High Reliability 
Organization group at Berkeley University (see La 
Porte, 1996) have examined the resistance criteria 
in complex socio-technical systems. What brought 
the Berkeley colleagues together was their shared 
observation from three different disciplinary per- 
spectives that the attention being paid to stud- 
ies and cases of organizational failure was not 
matched by parallel studies of organizations that 
were operating safely and reliably in similar cir- 
cumstances (Rochlin, 1996). They found that the 
role of technology and structure was not fixed, 
but varied from task to task. The organization 
had not one, but many overlapping cultures, tied 
together by a common purpose. Its performance 
was shaped not only by the skill and dedication of 
the operators, but also by intelligent and sensitive 
management. Weick (1995) also observed that in 
the event of a crisis, following procedures with- 
out begin able to innovate or rethink procedures 
can hamper action, or even end in a catastrophic 
scenario. 

Taking the analysis to the next level, Hollnagel 
(2007) argues that it is critical for an organization 
to anticipate and learn from previous crises. More- 
over, the resilience of organizations is linked to the 
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ability of actors to adapt to changing conditions. 
Organizational resilience, according to Hollnagel 
(2007), is based on four central pillars: the ability 
of a system to respond to normal and abnormal 
conditions; its ability to control threats and per- 
formance in the short term; its ability to anticipate 
threats and opportunities in the long term; and its 
capacity to learn from the positives and negatives 
of past events. 

Research on the management of major crises, 
including those in highly-regulated sectors, shows 
that comprehensive planning together with innova- 
tive solutions are necessary to deal with a crisis that 
is unexpected and has far-reaching, long-lasting 
consequences (Vogus & Sutcliffe, 2007; Burnard & 
Bhamra, 2011). The analysis of the Fukushima 
Daichi crisis is an example of the innovative capac- 
ity of one man and his team when faced with an 
unprecedented crisis that went far beyond any pre- 
viously planned action (Funabashi & Kitazawa, 
2012). Travadel, Martin and Guarnieri (2017) also 
highlight the need for this capacity. They intro- 
duce the notion of acting into extreme situation as 
a way of effectively responding to a long-lasting, 
hitherto unknown crisis. 

In addition, Ritchie (2004) recommends scenar- 
io-based planning as a foundation for an appropri- 
ate response to a crisis. He notes that there is value 
in identifying and planning responses to a crisis 
but given what we know about complex systems, 
there will still be a need for some flexibility in order 
to support innovative responses. 

Finally, it is crucial to not only link the capacity 
of government with sectoral institutions in order 
to generate a joint action and commitment but 
also necessary to understand the role of the vari- 
ous actors and their capacity to respond to crisis 
conditions (Glaesser, 2006). 

With this in mind, the following section presents 
the main features of a longitudinal study looking 
at the response to the recent crises in the French 
tourism sector. 


5 THREE TIERED RESPONSE SYSTEM 


Over the past two years France has experienced 
several major terrorist crises that have had a signif- 
icant impact on tourism activity. While the French 
government had predicted a significant growth 
in the tourist sector, the terrorist attacks of 12th 
November, 2015 brought the predicted growth in 
tourist numbers to a halt, and put a great deal of 
economic pressure on all the stakeholders in the 
sector. Paris was particularly affected. The Huchon 
Report (2016) (Governmental Report on Tourism) 
notes that in the days following the attacks, the 
various businesses operating in the sector suffered 


significant reductions in visitor numbers and 
turnover. The report also noted that commercial 
activities in central Paris were particularly badly 
affected. 

Despite efforts to revive tourist numbers, the 
end of 2015 and the whole of 2016 was still marred 
by more terrorist attacks and natural disasters, all 
of which continued to tarnish the image of France 
and in particular Paris and the Côte d’Azur, as a 
tourist destination. The terrorist attack of July 
2016 on the Promenade des Anglais in Nice was 
one of the worst as it occurred in the middle of 
the summer season and targeted the second most 
popular tourist destination after Paris: the French 
Riviera. 

Faced with the loss of its foreign clientele and 
a decline in hotel bookings, the sector asked the 
government to take measures to revive activity, 
similar to the actions that were taken after the 
11th September terrorist attacks in New York. The 
Huchon Report (2016) argued that the response 
must be organized around a few, key measures. 
The first was to set up, at inter-government level, a 
crisis unit to revive activity by reassuring the pub- 
lic about the safety of the destination. The second 
was to provide economic support for the sector. The 
third was to coordinate crisis communication and 
recovery actions, with special attention on Paris as 
the country’s flagship destination. 

Based on the Huchon Report’s recommenda- 
tions, a three tiered, response system was put in 
place. At the national level, a recovery plan was 
developed focusing on coordinated action involv- 
ing key stakeholders in both the public and private 
sectors with the aim of reviving tourist numbers 
in Paris and other key tourist destinations. In the 
South of France, the Regional Tourism Committee 
also established a recovery plan based on a com- 
munication strategy located centrally on the Côte 
dAzur brand and uses existing well-established 
social networks. At the sector level, a communica- 
tion plan for tourism professionals was created to 
reassure providers 

As a result of these concerted efforts of the 
combined private and public stakeholders at the 
different levels, tourist numbers began to recover. 
In the French Riviera, one year after the attack 
of 14 July 2016, figures showed that foreign visi- 
tors were returning. Following a 10% fall in hotel 
bookings in the last quarter of 2016, figures for the 
summer of 2017 indicate that activity is recovering. 

However, the tourist sector in the Côte d’Azur 
not only had to respond to the terrorist attacks, it 
also had to respond to climatic disasters. In par- 
ticular, the storms of October 2015 badly affected 
the destination in which there was a heavy toll both 
human (20 people lost their lives) and economic 
(damage to tourist infrastructure). As the region 


tried to cope with the aftermath, those working 
tourist sector had to prepare for the arrival of 
13,000 people attending a large international con- 
ference. These efforts by all the parties made the 
conference a success and were emblematic of the 
region’s ability to respond to crisis and its resilience. 


6 LONGITUDINAL STUDY 


The aim of this qualitative study is to examine the 
tourism sector’s processes and structures created in 
response to a series of crises in the Cote d’Azur. In 
particular, the study aims to understand the dynam- 
ics underlying the preand post-crises processes and 
structures. The focus of this two year study is on 
Provence Côte d’Azur Event Center, which rep- 
resents 160 event management companies in the 
region (a total of 1.2 billion euros turnover). 

Although Céte d’Azur Crisis Unit provided an 
initial impetus, the successful recovery of the tour- 
ist sector is highly dependent on the ability of key 
actors reassemble the affected areas of the sector 
during and after the crisis. 

In practice, actors’ ability to adjust to contex- 
tual conditions deserves to be studied in order to 
understand its modalities and reproducibility in 
other regions. The current study thus addresses, on 
the one hand, the organization of the actors during 
the management of the crisis. Was this organiza- 
tion, despite procedures, the result of management 
led by the institutional crisis unit, or was it an 
emerging organization of stakeholders and, pos- 
sibly, the fruit of discussions and initiatives taken 
jointly by professional organizations, economic 
actors and institutional actors? 

On the other hand, it addresses the innovative 
capacities of actors to respond, on a case-by-case 
basis, to the technical problems generated by the 
floods (loss of means of communication, trans- 
port, etc.). Are these capacities for innovation 
and adjustment part of a culture that is specific 
to the region, its actors and the experience of 
professionals? 

The qualitative study will run for two years, 
and will involve professionals and actors that were 
involved in the crisis. 

The plan is conduct 40 open-ended interviews 
with the various socio-economic actors who par- 
ticipated in the management of the crisis. So far, 
only exploratory interviews have been conducted 
with the governing bodies of companies involved 
in the event. A total of 12 informal interviews were 
carried out lasting an hour and a half that explore 
the role of actors in the management of the crisis, 
and report their perception of the event. 

In addition, a systematic study of the regional 
press that covered events will be conducted, in 
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order to understand its role in the construction of 
the image that actors may have developed of the 
crisis, or even their role in the emergence of a crisis 
culture specific to the sector and the region. 

The initial interview data indicates that recov- 
ery was driven by the desire to continue to provide 
a superior quality service while at the same time 
minimizing the crises. Statements like “the show 
must go on” and “putting on a brave face” were 
common. There was also a collective effort to por- 
tray the Côte d’Azur tourist sector as a dependable 
in spite of the previous crises. 

Understanding the resilience of organizations— 
that is how they respond to a crisis—requires an 
understanding of the attitudes of the individuals 
involved. In this case it was the attitudes of the 
individual actors who lived through the crisis and 
who then have had to accommodate the effects of 
the crisis in their daily life. How are they able to 
create a world that encompasses the crisis? In prac- 
tice, following Moreau (2017). Despite the tragedy, 
the world goes on. The way of thinking about the 
crisis in the tourism and events world, anchored in 
the mindset of the region, then becomes a condi- 
tion for the commitment of actors and their capac- 
ity to innovate. 


7 CONCLUSION 


The tourism sector is considered to be a key eco- 
nomic player in France. However, it is exposed 
to a variety of crises and disasters at the micro, 
national and international levels. Risk percep- 
tion is an important part of the tourist’s decision 
process in both leisure and business tourism. Risk 
perception can also be mitigated by presenting a 
positive image of the destination and in particular 
how the sector was able to effectively respond and 
manage the crisis. The question here is how is that 
constructed and what are the key features? 

Although an increasing number of studies have 
begun to appear in the literature in the past decade, 
there is still a lack of research on crisis manage- 
ment in the tourism sector. Those studies that are 
available all point to the need to implement a stra- 
tegic plan at each stage of the crisis management. 
Moreover, our findings show that a three-tiered 
system is an effective way of providing a compre- 
hensive response that incorporates input at all lev- 
els of the sector. 

The literature also highlights the need to move 
beyond overly procedural plans but instead allow 
for innovative responses when required. That is, 
from a resilience perspective, it is necessary to allow 
individuals to be innovative and responsive to the 
environment. Our findings support the literature 
in that any crisis management approach must have 
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a level of flexibility. By tapping into the mindsets 
of “the show must go on” and allowing individuals 
to be innovative, a more committed workforce will 
emerge, thus ensuring that the crisis management 
plans are deployed. But the key ingredient is to cre- 
ate a dynamic, sectoral environment that is respon- 
sive and resilient during and after crises. 
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Validation of a gamified measure of safety behavior: The SBT 


C.B.D. Burt, L. Crowe & K. Thomas 
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ABSTRACT: Safety behavior is defined by constructs such as safety compliance, safety participation, 
and risk-taking. A safety complaint employee follows safety rules and policies, and if they engage in safety 
participation, goes beyond their job to ensure everyone’s safety. In contrast, risk taking and safety viola- 
tion behaviors can cause accidents and system failures. The significance of safety behavior argues for its 
measurement in job applicants, and the use of the resulting data to select applicants in and out of high 
risk situations, and/or allocate to training programs. The development of the Safety Behavior Test (SBT) 
which is a gamified assessment tool operationalized within an animated work simulation environment is 
described. Participant’s SBT score was correlated with data on their actual safety behavior provided by 
an independent source. Results indicate the SBT has good criterion-related validity, but this is influenced 
by computer-game playing experience. The advantages of using gamification to measure safety behavior 


are discussed. 


1 INTRODUCTION 

Research varies in the estimation of the influence 
of behavioral safety in accident causation. Some 
researchers have suggested as much as 90 percent 
of accidents can be shown to have a behavioral 
safety component as a contributing factor (see 
Hofmann et al. 2017 for a useful review). It is also 
clear that specific cohorts of employees, such as 
young employees (e.g., Salminen, 2004), or those 
new to a job can engage in less safe behaviors and 
more risky behaviors (e.g., Burt, 2015). Behavioral 
safety not only requires an individual to engage 
in safe behaviors, but also to avoid a spectrum of 
unsafe or risky behaviors, including routine rule- 
breaking (e.g., Darby et al. 2005). While behavioral 
safety is clearly an important factor in the over- 
all safety performance of a system, organizations 
are somewhat restricted in their ability to manage 
it. Research has examined the influence on safety 
behavior of generating a positive safety culture 
(e.g., Cui et al. 2013, Diaz-cabera et al. 2007), and 
the impact on safety behavior of having a positive 
safety climate within work teams (e.g., Christian 
et al. 2009, Clarke, 2006). From this work it is clear 
that a positive safety culture and climate can help 
promote safety behaviors, and the avoidance of 
risky behavior. 

A further option to manage safety behavior is 
to recruit people into the organization who are 
likely to behavior in a safe way. In order to do this 
organizations need to use measures during recruit- 
ment which allow for the prediction of individu- 
als’ on the job safety behavior. In examining the 
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relationship between recruitment processes and 
safety, a number of studies have highlighted the 
need to ensure all employees have the skill, knowl- 
edge and training to complete their work in a safety 
manner (e.g., Ford & Wiggins, 2012, Grandell-Niemi 
et al. 2003, McMullan et al. 2010, Postlethwaite 
et al. 2009), suggesting that appropriate ability tests 
should be used during recruitment. A considerable 
body of work has also attempted to understand the 
link between personality and safety behavior, with 
the idea of an accident prone personality receiv- 
ing research attention for almost 100 years (e.g., 
Dahlback, 1991, Green & Woods, 1919, Visser 
et al. 2007), along with the idea that risk-taking is a 
personality trait (e.g., Dahlback, 1990, Eysenck & 
Eysenck, 1977). More recently, Clarke and Rob- 
ertson (2005) used meta-analysis to examine the 
criterion-related validity of the big five personal- 
ity factors as predictors of accident involvement, 
and concluded that conscientiousness and agreea- 
bleness were valid and generalizable predictors of 
safety. 

In addition to ability and personality testing, 
organizations may attempt to predict safety behav- 
ior using questions about safety behavior or the 
individual’s past accident history in an applica- 
tion blank, or in an employment interview. While 
these measurement options have the potential to 
contribute to an organization’s understanding of 
how a job applicant might behave in the future, 
they have serious limitations. Under these ‘self- 
reporting’ conditions, it is very obvious what is 
being measured, which makes responses suscepti- 
ble to social desirability biases. Social desirability 


bias refers to a phenomenon where participants 
over-report favorable opinions and behavior, while 
under-reporting those that are unfavorable, and is 
most common when the subject under investiga- 
tion is considered to be sensitive by the respondent 
(Krumpal, 2013). Sensitive subjects are defined as 
those where there are potential costs or risks to the 
respondent for responding in a particular way, or 
to the collective population that the outcome of 
the question represents (Sieber & Stanely, 1988). 
Clearly, a focus on safety within a job application 
process would represent such a situation. 

Given the importance of employee safety 
behavior, an ideal way to assess it would be to use 
a work sample approach where the job applicant 
can demonstrate their behavior. However, there are 
clear ethical reasons why job applicants’ can not be 
placed into a risky situation to measure how they 
will respond. A way to avoid this ethical issue is to 
use a simulation, and as such this study reports on 
the development of a measure of safety behavior 
that uses a work simulation developed using the 
gamification paradigm which is rapidly growing 
in its application across a number of areas (e.g., 
Chen et al. 2015, Rodrigues, Costa & Oliveira, 
2016, Singh, 2012). Gamification can be defined 
as the use of game design elements in non-game 
contexts, and is predominantly used to make real 
world activities more engaging (Deterding et al. 
2011, Shrofeld, 2010). Organizations have also 
started applying gamification to their selection 
procedures (Chamorro-Premuzic, 2015). The pop- 
ularity of gamification in the work place is likely 
due to the positive impact it has on engagement 
and motivation (Gagne & Deci, 2005, Harter et al. 
2013). Within the area of safety behavior measure- 
ment during employee recruitment gamification 
has considerable potential to avoid measurement 
bias. 

To be of value a gamified psychometric test 
must be valid, must accurately measure the con- 
struct intended, and be predictive of future per- 
formance related to that construct. Very few 
studies have investigated the implementation of 
gamified assessments, and even fewer have inves- 
tigated the validity of gamified measures. Of those 
which do report on gamified assessments, validity 
results are mix. Some studies have reported the 
gamified measure examined was not valid (e.g., 
Jaffal & Wloka, 2015, Kim & Shute, 2015, Whetzel 
et al. 2012), while others report they were able to 
develop a valid gamified measure (e.g., Mislevy, 
Almond & Lukas, 2003, Shute,Ventura & Kim, 
2013), supporting the use of a gamified measure as 
a way to observe authentic behaviors (Clarke, 2009, 
Clarke-Midura & Dede, 2010, Ketelhut et al. 2008, 
Shavelson et al. 1991). These mixed results may 
not necessarily reflect on gamified assessments as 
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a whole, but the design of the measurement points 
within the specific gamified assessments which 
were examined in each study. 

Gamified assessments contain two key design 
components, measurement and game design. 
Measurement design is developed first and con- 
tains the construct intended for measurement, and 
the design of measurement items for this construct. 
During this phase of gamification design the vir- 
tual environment is created and the measurement 
items are applied to decision points within the game 
(Halverson & Owen, 2014, Ifenthaler et al. 2012). 
Each decision point is recorded by a play log and 
summed to provide the user with a score or scores 
on the measured construct. Game design helps 
determine a game-based assessment’s validity and 
ability to predict employee behaviors in terms of its 
similarity to a work-sample test (Gangadharbatla & 
Davis, 2016). Work samples have been consistently 
regarded as one of the most accurate and valid 
measures in predicting performance (Hunter & 
Hunter, 1984, Reilly & Warech, 1993, Roth et al. 
2005). As noted it is difficult, if not impossible 
to use traditional work sample testing to measure 
safety constructs (e.g., to measure the constructs 
of safety compliance would require an individ- 
ual to complete a potentially dangerous task), a 
gamified assessment can replicate a work-sample 
by requiring the individual to provide a perform- 
ance based sample through a virtual rather than 
physical environment. In doing this safety behavior 
should be able to be assessed with no actual danger 
to the participant. 

In summary, this study reports on the develop- 
ment and initial validation of a gamified meas- 
ure of safety behavior: the Safety Behavior Test 
(SBT). The SBT was designed to be used during 
the employee recruitment process. The validation 
process involved correlating the SBT scores with 
data on safety behavior obtained from independ- 
ent raters. 


2 METHOD 


2.1 


A concurrent criteria-related validity design was 
used. The participant completed the SBT fol- 
lowed by an Individual Characteristics Question- 
naire (ICQ). The participant was then given an 
Acquaintance Questionnaire (AQ), and they nom- 
inated an individual (e.g., supervisor, co-worker) 
to complete it. The AQ provided the independent 
data on the participant’s safety behavior against 
which the SBT score was validated. The ICQ data 
was used to determine if any of the variables meas- 
ured showed adverse impact on the SBT score, and 
thus needed to be controlled for in the validation 


Design 


analysis. The study was reviewed and approved by 
the University of Canterbury Human Ethics Com- 
mittee, reference number HEC 2017/26. 


2.2 Sampling 


Haphazard sampling (Weisberg & Bowen, 1977) 
was used to obtain the SBT participants. The 
recruitment criteria were the participant had to be 
working (responses to a question on this indicated 
all participants were working), and had to agree 
that they had adequate vision to use a computer 
to complete the SBT. Acquaintances (e.g., super- 
visor, co-worker) for the collection of independ- 
ent data on the participant’s safety behavior were 
recruited by the SBT participant. To participate 
as an acquaintance, the individual had to have 
knowledge of their respective SBT participant’s 
safety behavior. SBT participants and acquaint- 
ances each received a $10 petrol voucher for their 
participation. 


2.3 Participants 


The study involved 200 individuals, 100 partici- 
pants who completed the SBT who had a mean 
age of 41.6 years (range 18-66 years), and 100 
acquaintances with a mean age of 43.5 years 
(range 18-66 years). SBT participants reported 
an average total work tenure of 286.8 months 
(SD = 158), and an average job risk rating on a 
100 point scale of 42.4 (SD = 28.3). Acquaintances 
indicate they had known their SBT participant 
for an average of 144.7 months (SD = 172.9), and 
using a 100 point scale rated the strength of their 
relationship with their SBT participant on aver- 
age as 73.4 (SD = 22.1). These results suggests that 
acquaintances should have had ample opportunity 
to observe the SBT participant’s safety behavior, 
and thus be able to provide valid ratings. 


2.4 Materials 


Data was collected using three sources: the SBT, 
the individual characteristics questionnaire, and 
the acquaintance questionnaire. 


2.4.1 Safety behavior test 

The SBT is a fully animated computer test of the 
point and click game genre, meaning that the test 
taker can point the cursor at an area on the screen 
and click in order to interact with the test envi- 
ronment. First person views are used in the SBT, 
thus the test taker experiences the test as if they 
were a character navigating the test environment. 
A number of gamification design elements provid- 
ing feedback to the test taker are included in the 
SBT, including a timer shown in the top left corner 


of the screen which displays play time in seconds, a 
click counter also shown in the top left hand corner 
of the screen which displays the number of times 
the player has clicked the mouse, and feedback on 
some decisions, such as a red cross appearing if an 
incorrect choice is made within the game. 

To protect the security of the SBT only a general 
description of it is provided here (note the testing 
procedure and pre-test instructions are described 
in the procedure section). The SBT requires the 
test taker to assume the role of a worker in a waste 
disposal company. The SBT starts with the ani- 
mation opening a door and moving into an office 
scene. A sign on the desk reads “Press red button 
if unattended”, which indicates that the partici- 
pant needs to click the red button. Clicking the 
red button plays a narrative instruction: “Hello 
forklift driver number 1. Sorry, I am up on level 6. 
It’s good that you are here on time, there is only one 
job for you today. You have a shipment for disposal 
at the incinerator. The empty shipping container 
for the shipment is in loading dock C. A truck will 
take the loaded container to the incinerator as soon 
as you have finished loading it. I have already put the 
shipment items into the system, so when you get in 
a forklift the item list will be on the display screen. 
The new semi-automatic forklifts are working great, 
just click an item on the list and off you go to the rel- 
evant floor. Remember that control buttons appear 
when you need them. We have fixed the problem 
with the red right and left directional control arrows, 
and the central yellow stop button is working fine 
on all forklifts. Remember to load the items in the 
order shown on the list. The cloak room is nice and 
tidy this week, so let’s keep it that way. Don’t muck 
around as the transportation firm will charge us if 
they have to wait, but be careful. When you have got 
the order loaded come back here and let me know. 
If you would like me to repeat the instructions, just 
click the red button again”. After the participant 
had listened to the narrative (they can only play 
it twice) they make a number of decisions which 
allows them to navigate through the game/test by 
first entering the cloak room, followed by another 
room where they select a forklift. They then control 
the forklift to 5 different levels within the building. 
On each level they collect an item and move it to 
a container in the loading bay. During this process 
they encounter various hazards, and have safety 
related decisions to make. Once they have loaded 
all the required items into the container they exit 
via the supervisor’s office where they complete 
a check list which has questions about the work 
they have just completed. Submitting the answers 
to these questions closes the test. During the SBT 
the player encounters 35 decision points. These 
decision points vary in terms of whether the deci- 
sion has a safety aspect (e.g., following directional 
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indicated how many months they had played com- 
puter games. Safety risk of the SBT participant’s 
current job was assessed by placing a mark on a 


Test Instructions 
Before you begin the test it is important that you understand bow it works. Please carefully read 


the following points. 


e Thus test is a work samulation in the fortat of a point and 
click video game. In the game you play the role of forklift 
driver number 1. You will enter a building which contains 
items you may interact with by clicking on them with the 
mouse pointer, For example to opena door, click on the door 
handle. 


Itis important to note that it may not be possible to go back 
in the game after clicking certain things such as a door 
handle. as this action will move you to the next area within 
the game. However, in some sections of the game a back 
arrow will appear in the bottom left comer of the screen, 
Clicking this will move you back in the game. 


Fora large part of the game you will be in a forklift. When 
you are in the forklift you can only control the game by 
clicking areas ou the forklift coutrol panel. For example to 
selectan item location, click on the level location on the 
control panel, Controls that you can use within the forklift 
will change at various parts of the game. 


The forklift directional control arrows will always be 
present. Click these in the middle of the arrow to control 
directional movements. You can only control directional 
movements when the forklift is stopped. 


At one point in the game a yellow stop button will appear ọn 
the control panel. This button allows you to stop the forkditt. 


You can only control the test (e.g. select directional 
movements of the forklift) when the test is in manual mode. 
Clicking anything when the test is in auto mode will have no 
effect, and only waste clicks. Test mode is shown in the 
bottom middle of the sereen- 


A mouse click counter is shown in the top left comer of the 
screen. The test can be completed perfectly in 50 mouse 
clicks. 


e Further instructions on what you need to do will be given when the gane begins. 
Please close other tabs and don't have applications running in the background. 
e When you click start the test will take several minutes to load. After which it will automatically start 


START 


The SBT instruction page. 


Figure 1. 


arrows on the floor while driving the forklift), or 
a risk aspect (e.g., avoiding driving over a hose in 
the forklift), or is simply a decision required to 
advance in the game (e.g., clicking a door handle 
to open a door). Safety and risk related decisions 
were used to form the SBT score. A safe decision 
was given | point and these points were summed to 
produce the score, with a possible SBT score range 
of 0 to 13. The SBT instruction page which par- 
ticipants were asked to read before taking the test 
is shown in Figure 1. 


2.4.2 Individual characteristics questionnaire 

Demographic information on age and gender, 
along with job tenure, experience driving a fork- 
lift, and computer game playing experience was 
collected. For the later variable, SBT participants’ 
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100 point scale (0 = not risky at all, 100 = extremely 
risky). 


2.4.3 Acquaintance questionnaire 
The relationship the acquaintance has with their 
respective SBT participant was measured by how 
long the acquaintance had known the SBT partici- 
pant for (months), and a rating of how well they 
know them on a 100 point scale where 0 = Not very 
well at allto 100 = Extremely well. Additionally, the 
SBT participants’ safety behavior was rated by the 
acquaintance using scales measuring safety par- 
ticipation, safety compliance, safety voicing, safety 
consciousness, rule-bending and risk taking. The 
wording of scale items was adjusted from a first 
person format to suit third person acquaintance 
ratings. For example, “J always use all the necessary 
safety equipment to do my job” was adjusted to “*... 
always uses all the necessary safety equipment to do 
their job”. Acquaintances were instructed that the 
* .. referred to the individual that asked them to 
complete the questionnaire. Each scale item was 
responded to on a 5-point Likert scale anchored 
with | = strongly disagree and 5 = strongly agree. 
Scales were analyzed for internal consistency, 
and coefficient alphas are reported below. Scale 
scores were formed by summing the item ratings 
and dividing the sum by the number of items in 
the scale. Depending on the scale, a higher score is 
indicative of a higher level of safety behavior, or a 
higher level of risk-taking/rule-bending behavior. 
Safety compliance and participation, were 
measured using a third person adapted version 
of Neal and Griffin’s (2006) six item scale. Three 
items measured safety compliance, defined as core 
activities an employee needs to engage in to main- 
tain workplace safety. The other three items meas- 
ured safety participation, defined as behaviors 
which help to develop an environment that sup- 
ports safety. An example items for safety participa- 
tion is “At work *... puts in extra effort to improve 
the safety of the workplace”. Coefficient alphas for 
safety compliance and participation are .87 and .83, 
respectively. Propensity for the SBT participant to 
breach workplace safety rules and procedures was 
measured using an adapted version of Chmiel’s 
(2005) four-item bending the rules scale. An 
example item is, “*... sometimes cuts corners if it 
makes the task easier”. A coefficient alpha of .87 
was found for this scale. Safety consciousness and 
risk taking were measured using the 12-item safety 
consciousness and risk-taking scale developed by 
Westaby and Lee (2003). Safety consciousness 
is defined as “a positive attitude and awareness 
toward acting safely in general”, and risk-taking is 


defined as an “individual’s willingness to engage in 
activities that knowingly have elements of physical 
danger” (Westaby & Lee, 2003, p. 228). An exam- 
ple item for safety consciousness is “*... gets upset 
when seeing other people acting dangerously”, and 
an example for risk-taking is “*... values having 
fun more than being safe”. The safety conscious- 
ness scale coefficient alpha was .82 and the risking- 
taking scale coefficient alpha was .82 after removal 
of one item. Acquaintance’s perception of their 
respective SBT participant’s safety voicing behav- 
iors was measured using an adapted version of 
Tucker et al.’s (2008) five-item safety voicing scale. 
Safety voicing is measured as “any individual com- 
munication directed at improving safety condi- 
tions” (Tucker et al. 2008, P.319). An example item 
is “*... makes suggestions about how safety could be 
improved’. The obtained safety voicing scale coef- 
ficient alpha was .87. 


2.5. Procedure 


SBT participants were tested individually in a 
quiet room. Participants read a consent sheet, 
and an information sheet which indicated they 
were being asked to take a safety test, complete a 
questionnaire and give a further questionnaire to 
an acquaintance who would be asked to complete 
questions about them. After consenting, partici- 
pants were told “Please read the instructions (see 
Figure 1) carefully, as you will only be able to see 
them once. Press the start button when you are ready 
to take the test. I will leave you to take the test pri- 
vately, and will be waiting outside of the room for 
when you have finished. Please imagine that you 
have applied for a job. The test you are about to 
complete is being used to determine your suitability 
for the job. As a job applicant, try to do your best 
on the test”. A Lenovo ideapad 510-15ikb laptop 
which has a 15.6 inch screen was used to run the 
SBT. 

After completing the SBT, the participant com- 
pleted the individual characteristics questionnaire. 
The participant then received an unsealed enve- 
lope containing an acquaintance questionnaire. 
The SBT participant gave this envelop to their 
selected acquaintance, who completed the ques- 
tionnaire and sealed it in the envelope which was 
then returned to the researchers. 


3 RESULTS 


Missing data for acquaintance rating scale items 
was replaced with the variable mean. Missing data 
for demographic questions was left as missing, and 
resulted in some n = variance across the analyses. 
SBT participants took an average 17.9 minutes 
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(SD = 2.84, range 13.1 to 26.1) to complete the 
SBT. The average SBT score was 8.0 (SD = 2.72, 
range = 0 to 13, skew = -.454, kurtosis = —.067). 
Table 1 shows the comparison of SBT scores by 
gender, computer game player versus never played 
computer games, and previously driven a forklift 
versus never driven a forklift. 

Inspection of Table 1 indicates that being a 
computer game player significantly increased the 
participant’s SBT score. Thus computer game expe- 
rience in the form of how many months the partici- 
pant had played computer games (mean = 177.9, 
SD = 130.3) was used as a covariate in the crite- 
rion-related validity analysis. Correlational analy- 
sis using Pearson product moment correlations 
showed no significant relationship between the 
participant’s age and SBT score (r = —.02, n = 99), 
total work tenure and SBT score (r = .02, n = 96), 
or rated safety risk of their current job and SBT 
score (mean = 42.4, SD = 28.3, r = .01, n = 98). 
Overall, the SBT score data shows a distribution 
suitable for validation analysis, and of the variables 
examined is only significantly impacted by compu- 
ter game playing experience. 

To examine the criterion-related validity of the 
SBT, several correlational analyses were performed. 
Table 2 shows the descriptive statistics for the 
independent safety behavior scale data from the 
acquaintance questionnaire and the correlation 
between each safety behavior scale scores and the 
SBT score. The first correlational analysis (column 
5 of Table 2) used all 100 participants. The second 
analysis used the number of months the participant 
had been playing computer games as a covariate, 
and thus shows partial correlations. The partial 
correlation analysis resulted in a decrease in sam- 
ple size (n = 30) due to the small number of com- 
puter game players in the total sample, and that 
only 30 of these reported their playing experience 


Table 1. Comparison of SBT score by gender, computer 
game playing experience, and forklift driving experience. 


SBT Score SBT Score 
Mean & (SD) Mean &(SD) t-test 
Gender Male, n = 62 Female, n = 38 
7.91 8.26 =-.611 
2.92 2.39 
Computer Yes, n = 36 No, n= 62 
game 8.91 7.66 = —2.263* 
player (2.61) (2.66) 
Driven a Yes, n = 57 No, n=43 
forklift 7.96 8.16 =.358 
(2.90) (2.49) 
*P < 05. 


Table 2. Safety behavior measure descriptive statistics 
and correlations with SBT scores. 
Corre- Partial 
lation correlation’ 
Safety with SBT with SBT 
behavior Mean Skew score score 
measure (SD) (Kurtosis) N=100 N=30 
Voice 3.67 -1.27 02 (25 
(1.1) (1.88) 
Compliance 4.03 -1.33 13 42* 
(95) (2.39) 
Participation 3.52 —.92 .00 22 
(1.2) (.63) 
Consciousness 3.63 —.59 .08 42* 
(78) (14 
Rule bending 2.35 .08 —.20* —.46** 
(2.2) (-.80) 
Risk taking 2.06 .82 —.08 —.41* 
(92) (.90) 


‘Controlling for months of computer game playing, 
*P < 05, **P < 0L 


in months (mean = 177.9, SD = 130.3, range = 19 
to 480). Inspection of Table 2 indicates SBT 
criterion-related validity evidence is shown when 
computer game playing experience is controlled for. 


4 GENERAL DISCUSSION 


The SBT scores across the 100 participants pro- 
duced a distribution consistent with the SBT being 
a useful measurement tool. The overall mean SBT 
score was close to the score range mid-point, and 
the standard deviation and skew statistics are 
indicative of a normal distribution. These meas- 
urement characteristics are ideal for a measure 
which is attempting to scale job applicants or 
employees on their safety behavior. The overall 
average test time of approximately 18 minutes is 
also likely to be acceptable within the process of 
employee recruitment. Tests that run too long are 
said to be associated with response burden, where 
participants feel strained by the test experience 
and this could adversely impact their test score 
(Rolstad, et al. 2011). Of the variables examined, 
computer game experience showed an impact on 
SBT performance, and this is very evident in the 
criterion-related validity analysis shown in Table 2. 
The implications of this are discussed below. 

The SBT was developed to help avoid bias asso- 
ciated with self-report measures of safety behav- 
ior. Using the assumption that behavior within 
a gamified work environment simulation would 
accurately reflect behavior in a ‘real’ situation, the 
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SBT was designed to include decision points which 
could be scored as either safety or unsafety (risky). 
Individuals with a propensity to behave safely, not 
bend the rules or take risks should score higher on 
the SBT, and the correlations with the independ- 
ent rating of safety behavior shown in Table 2 
are consistent with the SBT being able to achieve 
this objective. However, a key factor in the SBT’s 
measurement ability is the participant’s experience 
with the gamification mode, their experience with 
computer game playing. When computer game 
experience is controlled for the partial correlations 
support the criterion-related validity of the SBT, 
and the size of the obtained partial correlations are 
sufficient to ensure that the SBT would have utility 
within an operational situation, such as employee 
recruitment. 

The finding that computer game experience 
influenced SBT performance is consistent with 
other validation research into gamified tests (e.g., 
Kim & Shute, 2015). Kim and Shute (2015) exam- 
ined the influence of computer game experience on 
participant’s performance on their gamified physics 
test, and reported performance on the assessment 
was influenced by game playing experience. Partic- 
ipants identified as gamers were found to have had 
an advantage over non-games on achieving test 
points. A number of explanations for the influence 
of computer game experience on SBT performance 
need to be explored in future research. In complet- 
ing the SBT it is possible that non-computer game 
players had to focus on learning how to control the 
game/test, made decision errors because of their 
unfamiliarity with the test mode, and this reduced 
their SBT score. Under such conditions their SBT 
behavior, as reflected in their SBT score, would not 
be expected to be associated with their acquaint- 
ance’s ratings of their safety behavior. The influ- 
ence of test mode unfamiliarity should be able to 
be easily addressed through the use of sufficient 
pre-test instruction given in the operational aspects 
of the test mode. The SBT instructions shown in 
Figure 1, while assumed to be sufficient, are static 
and further work on enhancing them using brief 
animation clips which include point and click con- 
trol trials similar to those used in the SBT simula- 
tion environment need to be developed. 

It is also possible that computer game players 
may have characteristics which are associated with 
behaving safely, and their SBT performance has 
less to do with game control familiarity, rather is 
a reflection of their tendency to behavior safely. 
Teng (2008) found that computer game play- 
ers had higher scores on openness to experience, 
conscientiousness, and extraversion than non- 
computer game players. Geller (2004) suggested 
that being open to experience would make people 
more likely to accept and engage in new health and 


safety related initiatives, while being conscientious 
would see people being more inherently interested 
in safety processes, and being extraverted would 
make it easier for people to use safety procedures 
that require communication between people. Fur- 
thermore, in a meta-analysis of personality traits 
and accident involvement, Clark and Robertson 
(2005) found low conscientiousness to be a valid 
predictor of occupational accident involvement. 

There may also be a cognitive aspect associated 
with computer game playing which has relevance to 
safety behavior. Pillay (2002) revealed in an inves- 
tigation of the cognitive processes engaged in by 
computer game players that children who played 
recreational computer games performed better on 
subsequent educational tasks than a control group 
of children that did not play computer games. 
Higher levels of intelligence have been shown to 
be associated with lower rates of accidents. Spe- 
cifically, the information processing skills of being 
able to recall relevant information, quickly identify 
problematic situations, and react quickly to unfore- 
seen situations are said to be key skills in accident 
prevention (Gottfredson, 2004). 

While a number of issues need to be addressed 
in future research on the SBT, the evidence from 
this study suggests the SBT has the potential to be 
a useful tool for both employee recruitment and 
research on safety behavior. The merging of tradi- 
tional psychometrics and animated game technol- 
ogy provides for measurement which might be less 
susceptible to bias than self report. In the case of 
safety behavior this may result in significant ben- 
efits to organizations through accident reduction, 
and help protect individuals from injury through 
the identification of those who could benefit 
greatly from instruction around safe work place 
behaviors. 
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ABSTRACT: Our contribution to the ESREL conference will present the first results of a qualitative 
and quantitative study on the professionalization of alumni of a post master safety program. The post 
master Management of Industrial Risks has graduated 197 students since 2004. We addressed an online 
questionnaire to all the alumni, five in depth interviews will be realized with alumni and their work col- 
leagues and a focus group discussion will be held. The online questionnaire will investigate their careers 
(company, position and wages) but also how safety is organized in their company, with whom they inter- 
act, what are their missions. For the job missions, we focus on their personal convictions, the importance 
given by their organization or company and the real time spent doing the job missions. We will also 
investigate the characteristics of their work conditions and what are, from their own point of view, the 
real contributions for safety in their organization. Finally, the knowledge and competencies they mobilize 
or want to mobilize doing their job will be assessed. Those findings intend to, firstly, dress a picture of 
the safety professional daily job, secondly, identify factors that shape the safety professional’s role in the 
organization and, finally, elaborate a more adapted curriculum for the training and development of future 
safety professionals. 


1 INTRODUCTION Committee of Public Safety in 1794. So engineers 
were trained to the traditional engineering disci- 
The safety science is a rather young science. As plines (mathematics, physics, mechanics and chem- 
it is a multidisciplinary science mixing social sci- istry) but also social sciences were developed and 
ence with engineering science, education programs delivered, as economics and management. 
are also very young compared to the education The past years, a keen interest in «profession- 
of traditional professions (like medicine, law and alization in/of safety» has emerged in the safety 
applied sciences professions like engineers) who science community (Gilbert 2015). The FONCSI 
are depending on one traditional and identified (Fondation pour une culture de sécurité indus- 
discipline. Although INSHPO, the International  trielle) has initiated in 2015 a strategic analysis 
Network of Safety and Health Practitioner Organ- on “skills and competencies for industrial safety” 
izations (Pryor et al. 2015), considers that OHS in which several international researchers are 
(Occupational Health and Safety) is an emerging involved (Bieder, Journé, and Laroche 2015, Bieder 
profession that is often not well defined, locally or et al. 2018). 
globally, the safety profession in France can now Professionalization in/of safety can be seen two 
be considered as a structured and recognized pro- ways: the professionalization of persons who will 
fession: several curriculums and descriptions of | become experts in safety (professionalization in 
the safety or risk profession exist (APEC 2017, safety — PiS) and the integration of safety into the 
AMRAE 2013). This doesn’t mean that before general professionalism of employees (Profession- 
this period safety wasn’t managed by professional alization of Safety — PoS). Although Andrew Hale 
people in organizations or companies, Ecole des has worked since the eighties (Hale et al. 1986, 
Mines de Paris was founded by King Louis XVI Hale 1995, Booth et al. 1991, Hale et al. 2005, 
in 1783 to ensure the management of the mines of | Hale and Ytrehus 2004) on the topic of the profes- 
the kingdom. Two issues in mining business were _ sionalization of the profession (PiS), this renewed 
important at the moment, economy and safety. The interest gives birth to new models and tools like 
school disappeared at the beginning of the French the Australian OHS Body of Knowledge (Paul and 
Revolution but was re-established by decree of the Pearse 2016). This article will present a study of 
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the professional context of a post master safety 
program’s alumni. 

The general research questions we want to 
tackle are the following: what is the career path of 
alumni of a post master program in risk manage- 
ment? Can we define their daily professional envi- 
ronment? Which knowledge and competencies are 
actually used? What is their position and role in an 
organization? What is the contribution of the post 
master program to their professionalism? How to 
develop better education in risk management to 
bridge the theory to the professional real life con- 
text? And in a very general way, what could be the 
safety professionals’ contribution to safety? 

In this article, we will introduce the context and 
the research methods with some early results. First, 
we will differentiate professionalization of and pro- 
fessionalization in safety. For information, we will 
use the concept of “safety professional” in this arti- 
cle, as “safety practitioner” is a person who is voca- 
tionally-educated where the safety professional is 
university-educated (or has attained a similar level 
of higher education) (Pryor et al. 2015). 


2 PROFESSIONALIZATION OF SAFETY 


Industry recently wonders if the amount of effort 
put into safety training of employees was worth it 
(Gilbert 2015). Although a lot of time and money is 
consecrated to safety trainings, accidents still occur. 
The professionalization of safety, the way how 
workers integrate safety issues in their daily work, 
is subject of attention of researchers. The concepts 
of ordinary and extra-ordinary safety have been 
developed by Claude Gilbert (Bieder et al. 2018). 
A new way of safety training is also a topic of 
interest. It is clear that the safety professional has 
a major role to play as he is the referee person for 
safety in the organization. Thus we see a clear link 
between professionalization « in » and « of » safety. 
This makes us wonder about the place of the safety 
professional in his organization. Gilbert states that 
the difficulties for good safety training program in 
a company (PoS) are that they have to fulfill inter- 
nal performance obligations and requirements and 
external justifications (accountability). 


3 PROFESSIONALIZATION IN SAFETY 


There is significant literature about safety profes- 
sionals, Andrew Hale has done a lot of work since 
the eighties (see above), professional associations 
have also contributed to the subject (Pryor et al. 
2015) and even in the sixties, people (scientifics) 
have written pertinent insights concerning the 
safety professional and his role in organizations 
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that are still very actual today (Harper et al. 1962). 
Scientific and professional courses have emerged 
since the seventies (Arezes and Swuste 2012). The 
last decennium, in France the curriculums who 
are dealing with risk management, safety or HSE 
(Health Safety and environment) grew in a signifi- 
cant way. Private organizations in France propose 
rankings of (business) education programs and 
since several years the category « risk manage- 
ment » exists (www.eduniversal-ranking.com). Last 
November, the author of this article for ESREL 
was solicited to review an article for Safety Science. 
Chinese contributors examined the professionali- 
zation of safety in China: « Development of safety 
science in Chinese higher education ». The last dec- 
ade, we can observe a development of education of 
safety professionals in all industrialized countries. 
Dekker investigated the role shaping factors 
about the safety profession (Provan et al. 2017). 
This work is interesting because it can give an idea 
of the importance on a safety professional in his 
organization and the way how he can influence, 
model or create safety in the organization. This 
is a topic that has to be investigated and taught 
to future safety professionals. This present study 
tends to bring some answers to that question also. 
The following part presents our study object, the 
post master degree industrial risk management. 


4 POST MASTER DEGREE INDUSTRIAL 
RISK MANAGEMENT 


In 1997, the top management of MINES ParisTech 
decided to launch a research department dedicated 
to risk and crisis with the mission of developing 
studies in cooperation with industrial partners and 
training activities. In 2002, the decision was taken 
to build a post-master program for training HSE 
professionals for the industry. 

The post master degree (“Mastére Spécial- 
isé” in French) Industrial Risk Management was 
launched in 2004 on demand of French industry 
and in collaboration with Tongji University of 
Shanghai (China) to form, besides French stu- 
dents, also Chinese students on risk management 
for French companies developing in China. 

The design of the curriculum was achieved 
from an analysis of academic literature and cur- 
rent practices, in close cooperation with compa- 
nies’ representatives, in order to fit the contents of 
the training and the pedagogical methods to the 
needs of the industrial sector. The program has 
known several important changes in organization 
and content while its existence. In the actual form 
(promotion 2017-2018), it contains 500 hours of 
courses during one year (begin October to end 
September). The rest of the time, students realize 


a professional mission with an industrial partner. 
The program forms future safety professionals in 
the occupational, environmental and industrial 
(process safety) field. The curriculum contains six 
main tracks: 


Safety regulations, 

Hazard and risk assessment, 

Safety Management Systems, 
Human and organizational aspects, 
Management and leadership aspects, 
Emergency and crisis management. 


Lectures, exercises and practical work are pro- 
vided and supervised by a faculty group composed 
of an equal number of academics and professionals. 

Currently, 31 students attend the program, 
almost all of them alternate academic training 
and apprenticeship. This apprenticeship program 
consists in 5 weeks courses, 3 weeks enterprise, 3 
weeks courses, 4 weeks enterprise, 3 weeks courses, 
2 weeks enterprise, 1 week study trip, 5 weeks 
courses and 6 months enterprise followed by the 
defense in front of a jury of their professional 
project. The program in its current form contains 
six main tracks, a study trip (United Arab Emir- 
ates in 2017) and a professional conference, organ- 
ized end of March by the students themselves. 

The program has known an evolution from a 
«double degree or double competence» point of 
view to a program that forms «specialized» safety 
professionals. The students that apply for the pro- 
gram now have, for the great majority of them, 
already a Master degree in Health, Safety or Envi- 
ronment. The MS MRI makes them more “special- 
ized”, whereas in the past, students had degrees in 
chemistry, engineering, psychology or law. This is 
also the result of the increase in HSE Master Pro- 
grams and the result of the demand of industry. 
Industry wants a student already trained in HSE 
for their professional missions and collaborations. 
This eliminates from the program the students who 
have no knowledge of risk and/or HSE. 

The scientific transfer of HSE findings to the 
industry takes some time. HSE practitioners often 
lack the knowledge of the genesis and relations 
between HSE theories and methods, models and 
metaphors developed since World War 2 (Swuste, 
Guliyk, and Zwaard 2010, Swuste et al. 2014). It is 
also important to know the evolution of the HSE 
job. The integration of a HSE professionals’ train- 
ing program in a dedicated research laboratory 
(CRC—MINES ParisTech) facilitates this knowl- 
edge transfer and informs the future HSE profes- 
sionals of the latest advances in the field. They will 
probably keep an intellectual curiosity all the rest 
of their career and hopefully will participate in 
co-developing tools using the latest safety models 
(Besnard et al. 2009). 
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Post master programs are known to be «profes- 
sionalizing», the student will pass from the state 
of student to be a professional. Several character- 
istics of the MRI program tend to influence this 
transition: participation and courses of a lot of 
safety professionals, students pass a great deal of 
the program in the organization of their industrial 
partner, projects and case studies are real industry 
problems (with participation of industry), execu- 
tive summaries are demanded on several topics, 
they work in teams, they do a lot of presentations, 
they learn to present their work to decision mak- 
ers, the courses on “management and leadership” 
makes them aware of the importance of soft skills 
on the work floor. 

The program has grown since 13 years to be 
more and more professionalizing. But finally, what 
is the contribution of the post master program 
to their professionalism? How to develop better 
education in risk management to bridge the the- 
ory to the professional real life context? For what 
kind of professional context do we prepare them? 
The next part will present the missions of a safety 
professional. 


5 MISSIONS OF THE SAFETY 
PROFESSIONAL 


Several studies on the missions of safety profes- 
sionals have been done. Wybo and Van Wassen- 
hove (Wybo and Van Wassenhove 2016) present 
the missions of safety professionals (Table 1), the 
needed skills to do the job and modeled the MRI 
program to correspond. They conclude by propos- 
ing requirements for a HSE professional training 
curriculum: students getting their degree in HSE 
must be fully operational and demonstrate their 
professionalism when they start their first job. So 
the curriculum must prepare them to their future 
work. From the literature review and the analysis 
of a HSE professional’s job content, they identi- 
fied the important matters our HSE professional’s 
curriculum must address: 


e Domains to teach: 
o Regulations and HSE management systems, 
o Hazard and Risk analysis in occupational 
health, system safety and environment, 
Human and organizational matters, 
o Emergency and crisis management, 
o Communication, management and leadership, 
A strong implication of safety professionals, 
The use of realistic case studies, 
Interactions with industry practitioners: 
o “Field work” in industrial sites, 
o Annual conference for industry practitioners, 
o A long internship/presence in a company, 


O 


Table 1. 


Missions of safety professionals in literature (Wybo and Van Wassenhove 2016). 


ENSHPO AMRAE ASSEF Wu INRS DeJoy Hale Kohn 
Job content topic 2005 2013 2007 2011 2004 1993 2004 1991 Total 
Advising management and decision + + 6 
makers 
Definition of the missions and the + 6 
organization of the safety 
management system 
Risk management (hazard : H H H + 18 
identification, evaluation 
and control) 
Regulatory compliance 7 
Diffusion of safety culture and + 6 
culture change 
Training and communication + 10 
Accident and incident investigation + 8 
Emergency and crisis management 8 
Monitoring and reporting + 11 
Knowledge management 4 
Insuring and costing risks + 4 


Nevertheless, questions still remain on other 
aspects of their professional context. What are the 
knowledge and the competencies they mobilize? 
What is their position and role in an organization? 
What is their vision of safety and what can be their 
contribution to safety? A first step is to look closer 
to what alumni of a post master program in risk 
management do once graduated. We developed a 
methodology to search for some first answers. 


6 METHODOLOGY: GENERAL 
QUESTIONNAIRE 


The post master program MRI has now about 200 
alumni. To investigate the professional context of 
the safety professional, we dispose of exactly 197 
persons to consult. 

A similar study on the context of a profession 
has been conducted in the Netherlands (Corporaal 
et al. 2016). They investigate the Human Resources 
(HR) professional and they use the following 
methodology: several students (544 students par- 
ticipated to the study) have asked HR professionals 
(571 persons), their hierarchical boss (542 persons) 
and a line manager (553 persons) to fulfill a ques- 
tionnaire. That questionnaire was used afterwards 
as guide to do in-depth interviews with those per- 
sons. This study is repeated yearly (!). Every four 
years, a synthesis is done and an education profile 
is created. This profile enables to adapt the curric- 
ula of the HR programs. The research themes are 
the missions of the HR professional, role and posi- 
tion of the HR professional, competencies needed 
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for the HR professional and the relations with line 
management. 

In the field that interests us, the OHS Profes- 
sional Capability Framework is a framework for 
practice developed by the International Network 
of Safety and Health Practitioner Organizations 
(Pryor, Hale, and Hudson 2015). INSHPO differ- 
entiate competence and capability. The difference 
between competency and capability is that compe- 
tency is about delivering the present based on the 
past, while capability is about imagining and being 
able to realize the future. Capability goes much fur- 
ther than competency, it’s also about confidence 
and adaptability. For our study, we use “compe- 
tency”. With the first results, we will see and judge 
if there is a need to differentiate between compe- 
tency and capability. 

From the literature, personal work and obser- 
vations, a general questionnaire was developed. 
This development was done with the support of 
co-researchers (Gilbert De Terssac, sociologist, 
research director CNRS) and was tested on some 
alumni. The remarks were integrated into the ques- 
tionnaire. The questionnaire was proposed to the 
alumni through an on-line version. Each alumnus 
was personally contacted mid November 2017 by 
email with the request to take some time to answer 
the questions. The time to respond is set to two 
months. Several solicitations are done, the aim is 
to have at least 100 responses. To answer all ques- 
tions, the participant has to take at least 30 min- 
utes of his time and has to do some critical analysis 
of his work situation. Answering the questionnaire 
takes a real effort. 


The questionnaire is structured as followed: 


questions about the identity 

career description 

job description 

benefits of the MRI program for the profes- 
sional career 

e comments 


The job description is the most important for 
our research questions and is composed of twelve 
questions. Four questions are very important, we 
will present them here. 

One question concerns the missions of the pro- 
fessional. From a synthesis from the literature, 
personal observations and discussions with safety 
professionals, fourteen missions are proposed 
(Table 2). For each mission, the respondent has to 
notify on a scale going from | (not important) to 
10 (very important) the importance of the mission 
in his current job, according to several levels: 


e according to his or her personnel conviction 

e according to the organization in which he or she 
is working (his superiors) 

e according to the real time he or she spent on the 
mission 

In this way, we can identify discordances 
between the professional and his organization. 

The second important question concerns the 
characteristics of the professional’s job (Table 3). 
Sixteen descriptions of work characteristics are 
proposed. The professional is asked to score | (not 
agree) to 10 (totally agree) for each situation. 

The third question is about the concept of safety. 
Twenty-two ideas are formulated about what con- 
tributes to safety (Table 4). The professional has 
to score (1 not important and 10 very important) 
the ideas according to his personal conviction and 


Table2. The fourteen missions of the safety professional. 


Missions of the safety professional 


1. Inform and advice the direction 
2. Organize the safety management system 
3. Follow up of regulations 
4. Control compliance with regulations 
5. Manage risks (identification/evaluation/treatment) 
6. Develop safety culture 
7. Inspections on the work floor, organizing informa- 
tion collection (bottom up) 
8. Train and communicate about safety and risk 
9. Analyze incidents and accidents 
10. Manage crises and emergencies 
11. Measure performance: monitoring and reporting 
12. Give expertise for a specific type of risk 
13. Manage the insurance and financial side of risks 
14. Realize business continuity plans 


Table 3. Job characteristics description. 


Characteristics of the job 


1. 


FN ee 9 BD 


Presence of an administrative workload and an 
important “safety bureaucracy” 

A big diversity of objects and subjects to handle 
A lot of subjects to handle with urge 

No time to treat subjects in depth 

A lot of traveling 

A lot of simultaneous missions with different 
deadlines 

Recognition of the direction 

Recognition of middle management and workers 
Feeling of being isolated in an organization that is 
not much concerned by safety 

Interruptions and adaptations of planning 


. Difficulties to conciliate safety prevention with 


regulatory compliance, requirements of inspec- 
tion of the administration and the politics of the 
organization 

Interactions with a lot of stakeholders 


. Lack of means 


Necessity to adapt constantly the rules (theory) to 
the reality of the work floor (practice) 


. Frequent interactions with the work floor 


Autonomy and liberty of initiative 


Table 4. Safety “ideas”. 


Composantes de la sécurité 


E OE EE TE TEA 
AARBYN= So 


ies 
18. 


19. 


20. 
21. 


22. 


Co ON ue ee 


Engagement of the direction 

Engagement of middle management 
Compliance with regulations 

Safety culture 

Safe technical conception 

Bottom up consultation 

Consultation of administration and inspection 
Consultation and information of the neighborhood 
living near the company 

Presence of a Safety Management System 
Safety training 


. Team spirit and cooperation on the work floor 


Respect of procedures 

Sharing of good practices 

Sharing of information on accidents and incidents 
Comprehension and modeling of hazards 
Presence of a degree of liberty in interpreting pro- 
cedures in function of the context 

Analysis of root causes after an accident 
Considering the variability of human performance 
by analyzing the work conditions 

A good technical knowledge of the system by the 
stakeholders 

Human error analysis 

Give sense to work: organize debates on work 
situations 

Organize crisis and emergency exercises 
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according to what he thinks is the conviction of his 
organization. 

The fourth question (Table 5) concerns knowl- 
edge and competencies (skills or even capabilities) 
mobilized by the professional and the competen- 
cies that he or she would like to mobilize more. It 
should be noticed that knowledge and learning can 
be characterized at six levels. These levels are based 
on the well-known Bloom taxonomy (Murtonen, 
Gruber, and Lehtinen 2017) and go from ‘simple’ 
remembering to ‘elaborated’ evaluating. But we 
can consider the levels of analyzing, synthetizing 
and evaluating as hierarchical equal levels. 


e Remembering (knowledge) 

e Understanding (comprehension, translating, 
interpreting or extrapolating information) 

e Applying (using principles or abstractions to 
solve novel or real-life problems) 


Table 5. Knowledge and competencies. 


Knowledge and competencies 


Knowledge of regulations 
Knowledge of hazards 
Knowledge of the technical system 
Knowledge of accident models and theories 
Knowledge of risk analysis methods 
Knowledge of business management: finance, human 
resources, organization, technologies, innovation... 
7. Competency of applying risk analysis methods 
8. Competency of « constructing » a risk analysis 
methodology adapted to the problem 
9. Competency of crisis and emergency management 
10. Competency of situation/activity/accident analysis 
11. Competency of translation of politics into real 
actions 
12. Competency of translation of accidents statistics 
into action plans 
13. Competency of strategic planning 
14. Competency of hierarchizing problems 
15. Competency of looking for the right information 
and learning 
16. Competency of critical analysis 
17. Competency to propose solutions 
18. Competency of written and spoken communication 
19. Competency of synthetizing and adapting of word- 
ing to the public addressed 
20. Competency of organizing personal work 
21. Competency of listening 
22. Competency of leadership 
23. Competency of being exemplary 
24. Competency of teamwork 
25. Competency of working in an multicultural 
organization 
26. Competency of using new communication 
technologies 
27. Ethics 


Due wN 


e Analyzing (breaking down complex information 
or ideas into simpler parts to understand how 
the parts relate or are organized) 

e Synthesizing/creating (creation of something 
that did not exist before) 

e Evaluating (judging something against a given 
standard) 


A safety professional who is fully competent is 
expected to operate minimum at level three (apply- 
ing knowledge) for every knowledge category 
(Pryor, Hale, and Hudson 2015). 

Skills can be categorized in three sections: per- 
sonal skills (example: verbal communication), pro- 
fessional practice skills (example: problem solving 
and critical thinking) and professional technical 
skills (example: implements tools to assess risk). 
We chose to list in table four knowledge and skills 
but with differentiating neither the type of skills 
nor the taxonomy of Bloom. The last item of the 
list, ‘ethics’, is rather a personal value than a skill. 
Those aspects will be studied more in depth when 
doing interviews with selected people (see below). 

The next paragraph will present some first 
results and a discussion about how to complete 
this methodology (the questionnaire). 


7 FIRST RESULTS, DISCUSSION 
AND PERSPECTIVES 


The questionnaire is on-going and at this date, 
61 alumni of the 197 persons contacted have 
responded and completed the questionnaire. To 
complete the questionnaire (the majority of ques- 
tions are compulsory) at least 30 minutes and a 
great effort of reflecting on professional practices 
is needed. Another 50 respondents started but 
didn’t finalize the questionnaire yet. The aim is 
to have about 70% of the alumni. Each alumnus 
will be contacted individually, and several times if 
necessary. 

Several biases are present for this study. We will 
discuss the most important ones. We find that the 
youngest alumni completed in priority the question- 
naire. The elder ones ‘lost’ probably the implication 
with their former school and are less motivated 
to respond. Also, several contact addresses of the 
elder alumni are obsolete. A work to trace them on 
Linked In is ongoing. The Chinese alumni are also 
difficult to contact. This could give a slightly dis- 
torted image because we will have only data from 
young professionals: the description of the profes- 
sional context of a junior safety professional. 

The curriculum of the program has been modi- 
fied all along its existence. Several important 
modifications have been done: in 2010 an impor- 
tant content modification, in 2014 the introduc- 
tion of alternate training and apprenticeship. The 
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respondents will evaluate a program that was not 
the same for all the alumni. We hope that the first 
results will give us insights on the importance on 
professionalization of the introduction of the alter- 
nate training and apprenticeship with an industrial 
partner in the curriculum. 

A complementary methodology will allow us 
to gather more data. The first results will able us 
to organize focus groups with alumni to discuss 
several aspects or findings. A couple of monogra- 
phies will be done also: a safety professional, his 
superior and his colleagues will be interviewed and 
observed in their daily work. It could be interesting 
to interview also line managers who are in contact 
with the safety professional. 

The findings of this work will able us to: 


— Have an idea of the careers of our alumni (com- 
pany, wages, function,...) 

— Have an idea of the professional context of our 
safety alumni (missions, job profile, organiza- 
tional culture...) 

— Have an idea of the knowledge and competen- 
cies they mobilize 

— Have an idea of the impact of the MRI program 
on their professionalism 


The perspectives of this work are of different 
nature. On one hand we continue to implement 
changes and modifications and we improve the 
program of MRI to prepare the students to the “real 
life” context of the safety professionals’ job. This 
is managed by new pedagogical approaches: role 
plays, case studies, close interactions with safety 
professionals and collaboration of Safety Scientif- 
ics and Safety Professionals together on the same 
project. We discussed in a former paper (Wybo and 
Van Wassenhove 2016) the pedagogy used for edu- 
cation of safety professionals. The present study 
will be able to bring some data for improving the 
pedagogic approaches of the program. 

In order to improve the program, we must abso- 
lutely question ourselves on the ways we evaluate 
the program. The four-level training evaluation 
model of Kirkpatrick (Gilibert and Gillet 2010) 
is widely known. We could think of integrating 
a yearly survey of alumni and/or interview and 
observation of alumni in work conditions in a gen- 
eral evaluation plan based on Kirkpatrick’s model. 
Kirkpatrick proposes four levels of evaluation of 
learning: reaction (satisfaction of learners), learn- 
ing (what knowledge do they acquire, behavior 
(do they use the new knowledge in work situation) 
and results (outcome for the organization). Paul 
Swuste (Swuste and Sillem 2018) is interested in 
the quality approach to evaluate a post academic 
course ‘management of safety, health and environ- 
ment’. Having a feedback of alumni is not enough 
to have a global evaluation of the program. 
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What is the goal to form and educate students 
in safety curricula? Teaching them theoretical risk 
analysis methods (HAZOP, FMECA, QRA...)? 
Preparing them to the real world problems when 
applying those approved methods? Maybe the role 
of an education program is also to form “new” 
professionals who are importing new (scientific) 
insights into the profession. They are the bridge 
between science and practice. Probably, it is all this 
and even more. 

Last but not least, in our understanding of 
safety in organizations, the role of the safety pro- 
fessional is very important. David Provan (Provan, 
Dekker, and Rae 2017) did a literature review of 
the factors shaping the role of a safety professional 
(bureaucracy, influence and beliefs). This present 
study will certainly contribute to identify some fac- 
tors. Being aware of those factors will surely enable 
future safety professionals to be more effective in 
their job: bringing more safety to our world. And 
this is exactly the fourth level of Kirkpatrick’s eval- 
uation model. 
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Reversing the trend through collaboration in the petroleum industry 


K. Skarholt & G.M. Lamvik 
SINTEF Technology and Society, Trondheim, Norway 


ABSTRACT: The purpose of this paper is to discuss the importance of bipartite and tripartite coop- 
eration in the Norwegian petroleum industry and how it contributes to improve safety conditions. Lack 
of trust between the parties and pressure on the Norwegian model has great attention in this industry 
today, related to major cost reductions and downsizing the last 2-3 years. We discuss these challenges and 
how to re-build trust between the parties. The empirical material is mainly based on qualitative data from 
the ongoing four year RISKOP research project (Managing Risks in Offshore Operations). In addition, 
the data is based on a document study about the safety conditions and collaboration in the petroleum 


industry. 


1 INTRODUCTION 
In the Norwegian oil and gas industry there is a 
long tradition for employee participation at work, 
both at an individual level, and through formal 
bipartite and tripartite collaboration, inviting for 
employees’ ideas and concerns about safety issues. 
A tripartite collaboration consists of the authori- 
ties, employers’ association and worker unions, 
while bipartite collaboration is a local collabora- 
tion between employer/management and worker 
unions within a company (Levin et al., 2012). 
Employee participation is about involvement 
in work-related matters, with the intentions to 
have influence on working conditions. The term 
employee voice is an expression of participation at 
work, and is defined as an employee’s discretionary 
communication of ideas, suggestions, concerns, or 
opinions about work-related issues with the intent 
to improve organizational functioning (Morrison, 
2011:375). The contribution of employee voice is 
to influence decisions and contribute to improve- 
ments in a company. According to safety, it is 
crucial that employees’ opinions and voices are 
listened to, if not it may lead to dangerous safety 
conditions and accidents. However, the conditions 
for employee voice are challenged in periods with 
economic crisis (Farndale et al., 2011; Skarholt 
et al., 2017). Economic crisis and downsizing has 
led to a more fragile bipartite collaboration where 
the parties and authorities experience decrease in 
trust. Increased economic pressure has led to con- 
cerns about possible negative effects on safety and 
work environment. Figures from the latest study of 
trends in risk level in Norway’s petroleum activity 
(RNNP 2016) from the Petroleum Safety Author- 
ity Norway (PSA), shows that more employees 
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report higher work pressure and less influence on 
HSE than in previous RNNP studies. This nega- 
tive development has led to campaigns and initia- 
tives from both the PSA and the Ministry of Labor 
and Social Affairs in Norway. 

Through the campaign “Reversing the trend”, 
the PSA address what they see as a worrying devel- 
opment over the past years (PSA, 2016). How to 
actually improve bipartite and tripartite collabo- 
ration is one of the main issues in this initiative. 
According to PSA good collaboration between 
employers and employees has helped to boost the 
level of safety in Norway’s petroleum industry. This 
interaction now appears to be under pressure. PSA 
stress the importance of employee participation in 
handling safety matters, stating that involvement 
of employees is a requirement in all phases of 
the petroleum sector for every issue which relates 
to safety. These rights and duties are to be prac- 
ticed both directly by each individual worker and 
through union representatives and safety delegates. 
A good collaboration between the parties depends 
on mutual trust. 

The Ministry of Labor and Social Affairs in 
Norway appointed a HSE committee representing 
authorities, employer organizations and employee 
unions in the petroleum sector in 2016, to discuss 
and consider the state- and development of HSE 
conditions in the Norwegian petroleum industry. 
The background for this was a negative safety 
trend in 2015 and 2016 with serious conditions and 
safety challenges. The report from this work (HSE 
committee—Ministry of Labor and Social Affairs, 
2017) refers to that there have been major change 
processes with downsizing and restructuring that 
may be a challenge to the established collaboration 
between employers and employees. Indications 


suggests that cooperation between the parties is 
more fragile compared to earlier, although disa- 
greement prevails between the parties over how 
far this collaboration has come under pressure. A 
main conclusion in the report from 2017 is to keep 
and build mutual trust and respect to each other’s 
role and responsibilities between the parties—to be 
able to take care of and develop the safety level in 
the petroleum industry. 

The aim of this paper is to discuss the role 
of bipartite and tripartite collaboration due to 
develop and improve safe operations in the petro- 
leum sector in Norway. We discuss how employee 
participation at work influence on safety condi- 
tions—as a mean of reducing risk of injuries and 
major accidents. 


2 THEORETICAL BACKGROUND 


2.1 Collaboration, trust and safety 


The Norwegian oil industry, or rather the oil indus- 
try on the Norwegian Continental Shelf has been 
characterized by its social agreements grounded in 
the Norwegian model, producing alliances between 
all core stakeholders, thereby a “we” including the 
whole of the industry. The Norwegian model on 
a local company level is about the collaboration 
between management and union representatives. In 
Norway, this cooperation has been characterized 
by; trustful relations, willingness to collaborate for 
increased competitiveness, low level of conflicts at 
work (Levin et al., 2012). The voice of employees 
has thus to a large extent been listened to and have 
influenced over decisions made in the company. 
Improved safety is something that all parties 
want for the industry. The Norwegian petroleum 
sector has been characterized by trustful bipar- 
tite and tripartite collaboration, and the partici- 
pation among employees has been important for 
improving safety conditions. Important arenas for 
tripartite collaboraton is “Working Together for 
Safety” (Samarbeid for Sikkerhet/SfS) and Safety 
Forum (Sikkerhetsforum). The aim of their work 
is to increase safety in the petroleum industry 
and strengthen trust and cooperation among the 
actors of the industry. These arenas are central 
for cooperation among the parties in the industry 
and the authorities as regards health, safety and 
environment in the petroleum activities offshore 
and onshore. Here, the unions, the employers’ 
organizations and the authorities have a signifi- 
cant influence on the safety agenda in this sector. 
One could say that this trust-based tripartite col- 
laboration is a key cultural value related to how 
safety is maintained in the Norwegian petroleum 
industry. Trustful bipartite collaboration about 
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safety matters have also been an important corner- 
stone of the safety regime. Bipartite collaboration 
is an integrated and critical part of the regulatory 
regime for HSE in the Norwegian petroleum sector 
(Rosness & Forseth, 2014). To report about dan- 
gerous safety situations and conditions is easier 
to obtain in a bipartite collaboration, where you 
can have an open relation between leader and 
employees/unions. The economic crisis in this sec- 
tor has however put the collaboration between 
the parties under pressure. Indications from sev- 
eral actors claim that the Norwegian model with 
bipartite and tripartite collaboration is under pres- 
sure (PSA, 2016; Ministry of Labor and Social 
Affairs; Falkum et al., 2017). Safety Forum was 
established in 2000, based on distrust from unions 
to the employer association Norwegian Oil and 
Gas—stating that they constantly failed to include 
the employees in decisions concerning safety (Ros- 
ness & Forseth, 2014). Compared to the situation 
in 2000, we are seeing a similar development today 
with decrease of trust between the parties. 

Trust in organizations has been studied in differ- 
ent ways to address positive outcomes on organi- 
zational phenomena, such as positive impact on 
safety culture and safety performance (Burns et al., 
2006; Conchie et al., 2006; Reason, 1997). Trust 
also affect improved communication, knowledge 
sharing, commitment, and organizational learn- 
ing (McEvily et al., 2003; Nonaka and Takeuchi, 
1995). 

Research has shown that the cultural aspects of 
work practice influence safety as much as technol- 
ogy and formal organization structures (Antonsen, 
2009; Guldenmund, 2000; Haukelid, 2008). Also, 
the work from Tharaldsen (2011), addressing dif- 
ferences in safety climate and trust between UK 
and Norwegian sector, fits this picture. 

In high reliability-organizations (HRO) organi- 
zational culture/safety culture influence on safety 
(Weick and Sutcliffe, 2007). The key aspect of 
building safety culture is the level of openness and 
trust and access to information that may indicate 
compromising of safety. Reason (1997) argues that 
the safety culture is based on an underlying ele- 
ment of trust, and research shows that high levels 
of trust in relationships contributes to high levels 
of safety in high risk enterprises (Conchie et al., 
2006). Their study of safety performance in the 
offshore industry concluded that the impact on 
trust and distrust on safety performance is deter- 
mined by the act of being trusted (or distrusted). 

Trustful relations and openness requires the 
existence of a variety of channels, both formal 
and informal. Bipartite cooperation is a relation 
between managers and union representatives (and 
safety delegates) in a company. The Norwegian 
Working Environment Act from 1977 regulates 


the rules for formal participation at work among 
union employee representatives, where they have 
influence through information- and discussion 
meetings with the line management. They thus 
take part in problem-solving in different matters at 
work, such as improvement of safety, change proc- 
esses and similar. 

A good bipartite collaboration demands good 
leadership, listening to the ideas and concerns 
from the employees. How the dialogue and trust 
is between the parties will influence on how the 
employees are involved and have influence on 
their work and safety matters at work. The pres- 
ence of union representation at work contributes 
to increased individual employee participation 
(Trygstad & Hagen, 2007). The relations between 
managers and union representatives at work will 
influence on employee participation outside the 
formal bipartite cooperation, where an involving 
leadership style will strengthen open communica- 
tion. To make individual employees actually speak 
up about safety concerns, leaders must invite to 
openness and involvement about safety among the 
employees. How managers respond to employees’ 
opinions about safety improvements, will influ- 
ence on the motivation and willingness to speak 
up. If they signal interest and willingness to act 
on employee voice, the employee’s motivation to 
inform about safety concerns are enhanced (Detert 
& Burris 2007; Edmondson, 2003). Detert and 
Burris (2007) found that management openness 
and transformational leadership behaviour are 
consistently positively related to voice. To speak 
up involves sharing one’s idea with someone who 
has the power to devote organisational attention 
or resources to the issue raised (French & Raven, 
1959). To openly speak up about work environment 
and safety concerns at a work place without fear 
of being sanctioned/punished, is a prerequisite for 
reporting (Antonsen, 2009; Trygstad et al., 2014). 
Mutual trust between managers and employees 
also influence to which extent the employees will 
tend to keep silent or use their voice when safety 
concerns occurs (Skarholt et al., 2017). 


3 METHODS 


To highlight these issues we have drawn upon 
various sources. An important one is the project 
RISKOP (Managing Risks in Offshore Opera- 
tions). This project has been run by Western Nor- 
way University of Applied Sciences in Haugesund, 
in collaboration with SINTEF and University of 
Stavanger, supported by the Norwegian Research 
Council and nine companies in the industry. In 
this project, we interviewed 16 managers from 
five different shipping companies. These shipping 
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companies, that constitutes an important part of 
the petroleum cluster in Norway, are operating 
advanced vessels (e.g. supply vessels and anchor 
handling vessels) in the offshore petroleum indus- 
try, working for different oil and gas companies at 
the Norwegian Continental Shelf. Also, a broker, 
the Norwegian Maritime Directorate and trade 
unions were interviewed with topics pivoting 
around issues as: How the informants/shipping 
industry experience the crisis; What they actually 
do to meet and handle the situation, and; How the 
situation may affect safety operations offshore? 

Besides the RISKOP project we have analysed 
some documents covering certain aspects of 
safety and collaboration at work. First, the Petro- 
leum Safety Authority (PSA) and their campaign 
“Reversing the trend” (2016) has played an impor- 
tant basis for this topic. Second, in the extension 
of this campaign we have made use of the report 
“HSE in the petroleum industry” (HSE commit- 
tee—Ministry of Labor and Social Affairs, 2017) 
to shed some light on the safety situation in the 
petroleum sector. Third, the survey “Participa- 
tion Barometer” (Medbestemmelsesbarometeret) 
(Falkum et al., 2017) measure; “Is participation in 
Norwegian working life under pressure?” Moreo- 
ver, participation as is described as main elements 
of leadership- and managerial practice is the focus. 
The survey is owned by six trade unions, covering 
private- (included the oil and gas industry) and 
public sector in Norway. 


4 RESULTS 


We present 1) results from the interviews with 
leaders in shipping companies, union leaders and 
broker—conducted in the RISKOP project, and 
2) results from document analysis about the trend 
and development of safety in the Norwegian 
petroleum industry. 


4.1 


The economic crisis in the petroleum industry has 
led to fewer jobs in the offshore shipping indus- 
try, where the competition for jobs is tough in a 
marginal market. One of the companies we inter- 
viewed had reduced their staff with 400 employees, 
and feared further layoffs. Consequently, shipping 
companies have removed a considerable portion of 
the offshore fleet from the market. In December 
2017, there are 134 vessels from the offshore fleet 
in layup, out of a total fleet of around 550 vessels. 
Ordinary Platform Service Vessels (PSV) were the 
largest category of these ships (61), while Anchor 
Handling Tug Supply (AHTS) were 42 vessels and 
the number of Seismic vessels were 14. 


Strong bipartite cooperation—to survive 


This layup activity has resulted in downsizing 
of personnel for oil companies, shipping com- 
panies and subcontractors. Another challenge 
for this industry, is the fewer long-term contracts 
compared to before the crisis. Today, most of the 
contracts are in the spot market, meaning short- 
term assignments with a maximum of one month, 
especially operations done by PSV’s (Platform Sup- 
ply Vessels) and anchor handling vessels. Leaders 
from shipping companies we interviewed accepted 
contracts they knew to be too low, not even cov- 
ering basic running costs, solely to decelerate the 
decline of the company. When the market is weak 
and undergoing a crisis as today, there are many 
subcontractors that are weakened and not in a 
position to negotiate. 

We find that one way to deal with this crisis 
is actually to fight for the survival of the work 
place—together in a bipartite cooperation. As a 
coping strategy in the offshore shipping industry, 
collective local organizational arrangements have 
been strengthened. Our material show proves of 
strong and solidarity constellations inside the ship- 
ping companies, e.g. close collaboration between 
managers and employee’s in finding new solutions 
to handle the crisis in the industry. It seems that the 
major challenges in the offshore sector has made 
all the staff in the companies, to realize that they 
have a common interest in collaboration and find 
shared solutions. They realized that this is the time 
for dancing rather than boxing, to paraphrase the 
famous book by Huzzard et al. (2005). 

Many of the shipping companies we interviewed 
pointed out that the unions and employees was will- 
ing to find ways to save money in the company, such 
as reducing salaries for a period. One of the ship- 
ping companies in our study reduced the wages by 
29 percent among the crew from laid off vessels, so 
they could keep more of their staff and the compe- 
tence in which they held. Cut in company specific 
bonuses was another instrument to reduce costs. 
Another example was change in the shift system; 
from four to five shifts, to be able to keep more of 
the employees working. This allowed for an extra 
shift, or crew onboard the ships, allowing extra 
leisure time at home. This way, the economic crisis 
made the cooperation between the management and 
the employees/union representatives really strong. 
As one of the leaders from a shipping company 
said; “New solutions to survive is established because 
of the unions”. Problems and crisis become like an 
outer enemy that may build strong alliances between 
employer and employees in an organization. Labor 
relations become a positive force, building trust and 
a good working environment. We see that this is 
what happens in the shipping industry. Other means 
of survival in the Norwegian shipping market has 
been mergers of shipping companies the last years. 
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4.2. The Norwegian model under pressure 


On the other hand, there are many indications that 
the climate for collaboration between the parties 
has become worse in recent years. Increased eco- 
nomic pressure has led to more concerns about 
possible negative effects on safety in the petro- 
leum industry in Norway. We have made analysis 
of documents describing the development and 
trend about safety in this industry. The analysis is 
mainly based on these documents; Reversing the 
trend (PSA, 2016), HSE in the petroleum industry 
(HSE committee—Ministry of Labour and Social 
Affairs, 2017) and The Participation Barometer 
(Falkum et al., 2017). 

Five decades after the Norwegian oil adven- 
ture began, the petroleum sector faces important 
safety challenges. Trends are moving in the wrong 
direction in a number of areas (PSA, 2017). The 
development over the past two years have involved 
safety challenges and serious conditions: Fig- 
ures from the 2016 study of trends in risk level 
in Norways’s petroleum activity (RNNP) show 
an increase in serious hydrocarbon leaks and well 
control incidents. The major accident indicator is 
evaluated to be at a too high level (PSA, 2016). It is 
a reason to believe that this situation is affected by 
the economic crisis with restructuring and down- 
sizing. Changes and demands for greater efficiency 
raise the level of conflicts. 

To get safety development back on the right 
track, PSA have started a campaign; “Reversing 
the trend”. PSA has put collaboration as one of the 
main issues in Reversing the trend: “Collaboration 
between the various sides in the petroleum sector is 
under greater pressure, both between companies and 
unions and between them and the government. Such 
bi- and tri-partite interaction occupies a key place 
in Norwegian safety efforts.” PSA’s concerns is that 
a weakened cooperation could include a poorer 
basis for important decisions by company manage- 
ments, and weaker entrenchment with employees 
of important choices for the way forward. They 
are worried about that weaker employee partici- 
pation may have negative consequences for safety 
in the petroleum industry. Numerous examples 
from major accidents in the petroleum industry 
show that information that could have prevented 
the accidents, was available, but was either not 
communicated or not acted upon. This indicates 
a safety culture lacking openness and trust for 
reporting and telling about dangerous situations. 
A key aspect of safety culture is the level of open- 
ness and access to information that may indicate 
compromising of safety. PSA’s focus on bipartite 
cooperation is to emphasize the role industrial rela- 
tions has on safety, where they address the respon- 
sibility of improvements towards the management 


in oil and gas companies. PSA want to remind the 
management how safety can be taken care of and 
improved through formal collaboration between 
employer and employees. The voice of employees 
in decision-making is under pressure, and so the 
Norwegian model is under pressure. 

A leader from one of the major trade unions in 
the petroleum sector we interviewed, stated that he/ 
the union did not recommend employees to involve 
themselves in union activities at the moment, since 
it has become very troublesome to ask for leave 
for the individual representatives to involve them- 
selves in trade union work. This opinion from 
such a strong voice in the industry can be seen as 
a barometer or an indication of a lack of trust or 
lack of collaboration in the offshore industry. 

The report from the HSE committee—Ministry 
of Labor and Social Affairs (2017) “HSE in the 
petroleum industry” also emphasize collaboration 
between the parties—to enhance safety conditions 
in the petroleum industry: “Bipartite and tripartite 
collaboration is an important cornerstone of the 
safety regime, and must be strengthened and further 
developed.” Participation and influence among 
unions/employees about safety in this industry 
has been given high priority for many years, and 
has influenced positively on safety results offshore 
and onshore. The Working Act law define the 
rights among employees to speak up and partici- 
pate at work. There are both formal and informal 
arenas for cooperation between the employees’ 
organizations and the employer organization, 
with collective agreements regulating the bipartite 
cooperation. The work group with participants 
for employer and employee organizations recom- 
mends to strengthen the collaboration between 
the parties in the future, but they disagree about 
how far the collaboration between the parties has 
come under pressure. They have different experi- 
ences and interpretations weather the cooperation 
and the Norwegian model is under pressure or not. 
Further, they conclude that the level of HSE and 
the working environment in the Norwegian petro- 
leum sector is high. At the same time, safety chal- 
lenges and serious conditions have arisen the past 
few years. 

Theaimof the“Participation Barometer” (Falkum 
et al., 2017) is to analyze the development of how 
employees experience their influence on work, 
and their perceptions of control/management, 
organization and leadership at work. This survey 
is conducted annually. According to Falkum et al. 
(2017), “Employee participation and trustful rela- 
tions influence on company development. It serves 
both employees and the company’s interests at the 
same time”. In the literature, leader performance/ 
practice distinguishes between management and 
leadership; management means to lead through 
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systems and routines, while leadership means to 
lead through dialogue and hands on relations with 
employees (Ladegard & Vabo, 2010). The study 
measure if participation at work (the Norwegian 
model) is under pressure or not, and shows the 
development over years. The analysis emphasizes 
the relations between leaders, union representa- 
tives and employees. The hypothesis is that leader- 
ship practice affects the relations and cooperation 
between leader, union representatives and employ- 
ees to a large extent. The sample was totally 3053 
respondents—from private sector and public sector 
(county council/municipality and state). Accord- 
ing to the results, 42 percent of the respondents 
answer that Norwegian working life develops to 
be more authoritarian, while 12 percent answers a 
more democratic development. 28 percent of the 
respondents answered no change (status quo). In 
the oil and gas industry there is 56 percent of the 
respondents answering that work life is being more 
authoritarian, meaning reduced participation at 
work and high degree of control (management) 
and loyality. The analysis of the results from the 
survey are based on a model based on character- 
istics of ideal management and leadership per- 
formance/categories. And how management and 
leadership influence on; trust, restructuring, pro- 
fessional integrity and conflict handling in Norwe- 
gian work life in 2017. 

The conclusions from this survey is that par- 
ticipation at work is the most widespread form of 
leadership in Norway, compared to standardiza- 
tion/management, despite of that many experience 
that working life has become more authoritarian 
than before. 


5 DISCUSSION 
5.1 Reversing the trend—through trustful 
cooperation 


Based on our findings, the collaboration in the 
Norwegian petroleum industry is both weakened 
and strengthened during the economic crisis. We 
discuss the relationship between safety, collabora- 
tion, trust and leadership in the petroleum indus- 
try. How to improve the collaboration and re-build 
trustful relations—to collectively develop the 
safety level in this industry? On one side we find 
that the Norwegian model is under pressure and 
that bipartite collaboration needs to be improved, 
and the employee voice need to be heard. On the 
other hand, we find examples of a strengthened 
bipartite collaboration in the shipping companies 
we have studied—in the struggle of survival. 
Traditionally, the Norwegian petroleum indus- 
try has been known for its high degree of safety 


and trust—both in bipartite- and tripartite col- 
laboration. However, the collaboration between 
the parties has been under greater pressure, both 
between companies and unions and between them 
and the government. Related to the negative trend 
and development with safety challenges and serious 
conditions in the petroleum industry over the past 
years, the safety has been a “hot topic” both from 
the authorities, oil and gas-companies, employer- 
and employee organizations, and researchers. 

The Ministry of Labor and Social Affairs and 
authorities (PSA) has put effort in how to enhance 
safety conditions onshore and offshore at plants and 
installations—through collaboration. To strengthen 
collaboration between the parties, both bipartite- 
and tripar- tite cooperation has been one major 
goal and priority. PSA has through the campaign 
addressed cooperation, stressing the responsibility 
held by leaders in the oil and gas companies—to 
involve employees more actively in problem-soly- 
ing about safety matters at work. If not, PSA are 
worried about that loss of employees’ opinions in 
problem-solving may lead to poorer safety conditions. 

The voice of employees is important—to build 
a safety culture characterized by open communi- 
cation, where one could speak up about concerns 
and ideas of improvements. Cost reductions and 
downsizing in the petroleum industry may have 
influenced the organization culture in a nega- 
tive way with less openness and increased fear of 
sanctions as response of reporting. As one of the 
union leaders we interviewed said; he would not 
recommend anyone to take a role as a union rep- 
resentative or safety delegate today because of the 
pressure on employees having such positions in a 
company. He experienced that the free voice of 
union representatives was not appreciated and lis- 
tened to as it used to be. 

How the leaders respond to employees’ voice and 
how they deal with concerns will affect problem- 
solving about safety matters. If the leaders signal 
willingness to act on employee voice, the employ- 
ees’ motivation to speak up are enhanced (Detert 
and Burris, 2007; Edmondson, 2003). The relation 
between worker and manager will thus impact on 
the degree of employee voice and the employees’ 
participation and influence on work and develop- 
ment of work. According to safety, trust is a key 
factor to get safety issues on the agenda and to 
inform the management about what the sharp end 
in the organization experience and know related to 
how to run the operations safely. Trust opens up for 
good communication, commitment and sharing of 
information and knowledge (McEvily et al., 2003). 
The workers are close to the operations and every 
day productions, with a hands on experience and 
knowledge about dangerous situations and possi- 
ble incidents. Trust has positive impact on safety 
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culture and safety performance in high-risk organ- 
izations (Reason, 1997; Conchie et al., 2006). 

It is assumed that trust in organizations is benefi- 
cial for safety, (e.g. promotes open communication) 
and distrust is detrimental (e.g. leads to failed safety 
initiatives) (Conchie & Donald, 2007). What may 
be the consequences of distrust in tripartite and 
bipartite collaboration related to safety? Accord- 
ing to Falkum et al. (2017), Norwegian work life 
are being more authoritarian, and the oil industry 
are going in that directions according the survey 
about participation at work. Authoritarian leder- 
ship style means less involvement and participation 
at work among the employees. So, their voice and 
opinions will not be listened to in the same manner 
compared to a work place characterized by a good 
formal bipartite collaboration, and the possibility 
for employees individually to bring their concerns 
up to their line manager (closest leader), trusting 
that it is safe to speak up without fear of sanctions. 

Leadership practices affect the relations between 
managers, union representatives and employees 
to a large extent. Models of leadership practice 
inviting to participation at work, leads to higher 
agreements both in matters about restructuring 
processes and enhanced commitment towards 
company strategies and values in the organiza- 
tion (Falkum et al., 2017). They find that trust 
decrease with a management style (control), while 
trust decrease related to participation at work and 
trustful relation with nearest leader. According to 
safety, authoritarian management style may lead 
to poorer safety because of the problems of com- 
munication not build on trust. The problems asso- 
ciated with distrust or lack of trust are failed safety 
initiative and an absence of shared inter-group 
safety perceptions (Clarke, 1999). Reason (1997) 
argue that trust promotes open communication 
about safety (reporting culture) which enhances 
organization learning about accidents (informed 
culture). The main problem associated with under- 
reporting (or biased reporting) are the reductions 
in organizational learning and development of 
informed strategies to improve safety. 

There are however bright spots regarding col- 
laboration in the petroleum industry. The find- 
ings from the RISKOP project show how the 
collaboration between management and employ- 
ees/union representatives in the offshore shipping 
industry have strengthened during the economic 
crisis. When human societies face an obstacle or 
an external enemy, they seem to seek collaboration 
and alternative solutions. One fruitful example of 
such a mechanism is described by Evans-Pritchard 
when he discusses the political system and decision 
making among the tribe Nuer in Sudan (Evans- 
Pritchard, 1940). The core term in this book is that 
this community consist of “a system of segmen- 


tary opposition” and illustrates that local groups 
and communities can be united when it is a conflict 
on a higher level in the society. The Nuer society 
consist of potential of alliances and fissions, or as 
one informant told the anthropologist: “We fight 
against the Rengyan, but when either of us is fight- 
ing a third party we combine with them” (Evans- 
Pritchard 1940: 143). 

Transferred to the Norwegian offshore sector, it 
gives sense that the local shipping company see a 
shared value in finding common solutions. When 
they are facing a crisis in the sector as they are at 
the moment, they see new and unusual internal 
arrangements. Both managers and employees will 
seek dancing rather than conflict, to survive as a 
company. So once again the metaphors “boxing 
and dancing” are useful to illustrate the strategic 
choices the staff of certain companies have, when 
it comes to strategic choices in the daily work dur- 
ing extraordinary times (Huzzard et al., 2004). 


6 CONCLUSION 


In this paper we have discussed the relation between 
cooperation, trust, and safety in the petroleum 
industry. The authorities (PSA), employee unions 
and employer organizations all want to improve 
safety in the oil industry. The economic crises with 
downsizing and cutting costs seems however to have 
changed the cooperation climate between the par- 
ties. Union representatives and employees experi- 
ence that it is more difficult to tell- and report about 
dangerous situations that may lead to accident 
and unwanted incidents. This may lead to a safety 
culture not responding to dangerous situations at 
installations/plants offshore and onshore. We argue 
that the voice of employees must be listened to 
and taken notice of as an important instrument of 
improving safety. 

On the other hand, in the RISKOP research 
project, studying safety at offshore operations in the 
shipping industry, we found examples of that bipar- 
tite cooperation actually was strengthened. In the 
struggle of existence in a marginal market, there has 
been collective initiatives between management and 
employee unions fighting for the company’s survival 
and to keep as many jobs as possible. This way, the 
employees have gone a long way to find solutions, 
such as reduced pay in periods and abolish bonuses, 
and similar means of savings. Local alliances have 
become stronger—to fight the economic crisis. 


REFERENCES 


Antonsen, S. (2009). Safety culture: theory, method and 
improvement, Farnham, Ashgate. 


285 


Burns, C., Mearns, K. & McGeorge, P. (2006). Explicit 
and implicit trust within safety culture. Risk Analysis 
26 (5), 1139-1150. 

Clarke, S. (1999). Perceptions of organizational safety: 
Implications for the development of safety culture. 
Journal of Organizational Behaviour, 20, 185-198. 

Conchie, S.M., Donald, I.J. & Taylor, P.J. (2006). Trust: 
Missing Piece(s) in the Safety in Puzzle. Risk Analysis, 
26 (5). 

Detert, JR. & Burris, E.R. (2007). Leadership behav- 
iour and employee voice: It the door really open? The 
Academy of Management Journal, 50(4), 869-884. 

Edmondson, A.C. (2003). Speaking up in the operating 
room: How team leaders promote learning in interdis- 
ciplinary action teams. Journal of Management Stud- 
ies, 40, 1419-1452. 

Evans-Pritchard, E.E. 
Clarendon. 

Falkum, E., Nordrik, B., Drange, I. & Wathne, C.T. 
(2017). Participation barometer 2017: Change in 
working life relations. 

Farndale, E., van Ruiten, J., Kelliher, C. & Hope-Hailey, 
V. (2011). The influence of perceived employee voice 
on organizational commitment: An exchange perspec- 
tive. Human Resource Management 50: 113-129. 

French, J.R.P. & Raven, B. (1959). The bases of social 
power. In D.P. Cartwright (Ed.). Studies in social 
power, 150-167. Ann Arbor: University of Michigan. 

Guldenmund, F.W. (2000). The Nature of Safety Cul- 
ture: A Review of Theory and Research. Safety Sci- 
ence, 34, 1-14. 

Haukelid, K. (2008). Theories of (Safety) Culture Revis- 
ited: An Anthropological Approach. Safety Science, 
46:3, 413-426. 

Huzzard, T., Gregory, D. & Scott, R. (eds.) (2004). Box- 
ing or Dancing? Houndmills, Hampshire, UK: Pal- 
grave Macmillan. 

Ladegård. G. & Vabo, S.I. (2010). Ledelse og styring. Fag- 
bokforlaget, Bergen. 

Levin, M. et al. (2012). Demokrati i arbeidslivet: Den 
norske samarbeidsmodellen som konkurransefortrinn, 
Fagbokforlaget. 

McEvily, B., Perrone, V., Zaheer, A. (2003). Trust as an 

Organizing Principle. Organization Science 14(1): 

99-103. 

Medbestemmelsesbarometeret 2017: Arbeidslivsrelas- 

joner i endring. FoU-resultat 2017:05, Arbeidsforskn- 

ingsinsituttet ved Hogskolen i Oslo og Akershus. 

Ministry of Labor and Social Affairs (2017). HSE in the 

petroleum industry (Helse, arbeidsmiljo og sikkerhet i 

petroleumsvirksomheten). Report 09/17. 

Nonaka, I. & Takeuchi, H. (1995). The Knowledge Cre- 
ating Company. Oxford University Press, New York. 

Petroleum Safety Authorities (PSA) (2016). Reversing 
the trend. 

Petroleum Safety Authorities (PSA) (2016). Trends in 
Risk Level in Norway’s Petroleum Activity (RNNP). 

Reason, J. (1997). Managing the Risks of Organizational 
Accidents. Aldershot: Ashgate. 

Rosness, R. & Forseth, U. (2014). Boxing and dancing: 
Tripartite collaboration as an integral part of a regula- 
tory regime. In P.H. Lindge, M. Baram, O. Renn, Eds. 
Risk Governance of Offshore Oil and Gas Operations, 
New York: Cambridge University Press. 


(1940). The Nuer, Oxford: 


Skarholt, K., Lamvik, G., Antonsen, S., Royrvik, R. & 
Jonassen, J.R. (2017). Crisis in the Norwegian petro- 
leum industry: How does it affect safety conditions 
offshore? Risk, Reliability and Safety: Innovating The- 
ory and Practice —Walls, Revie & Bedford (Eds), pp. 
306, London: Taylor & Francis Group. 

Tharaldsen, J.E. (2011). In Safety We Trust—Sikkerhet, 
risiko og tillit i offshore petroleumsindustri. PhD. 
Universitetet i Stavanger. 


Trygstad, S.C., I.M. Hagen (2007). Ledere i den norske 
modellen. Oslo: Fafo. Faforapport 24. 

Trygstad, S.C., Skivenes, M., Steen, J.H. & Ødegård, A.M. 
(2014). Evaluering av varslerbestemmelsene. Fafo- 
rapport 2014:05. 

Weick, K.E. & Sutcliffe, K.M. (2007). Managing the 
Unexpected: Resilient performance in an age of uncer- 
tainty. Jossey-Bass, San Fransisco. 


286 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Production and protection. Seafarers’ handling of pressure 


in gemeinschaft and gesellshaft 


K.V. Storkersen 
NTNU Social Research, Norway 


A. Laiou 


National Technical University of Athens, Greece 


T.O. Nevestad 


Norwegian Centre for Transport Economics, Norway 


G. Yannis 


National Technical University of Athens, Greece 


ABSTRACT: Seafarers experience conflicting objectives of production and protection in most opera- 
tions. This study explores how seafarers deal with such pressures, through an analysis of interview data 
from 20 seafarers working on Norwegian- and Greek-controlled coastal and international passenger and 
cargo vessels of different sizes. Despite the various contexts, the results show similar conflicting objectives 
and pressure handling. The pressure is experienced differently, however, due to diverse organizational rela- 
tions. Seafarers on the /arge vessels in large companies describe business-like relations (gesellshaft) and 
direct efficiency pressures from superiors. Seafarers on the smaller vessels in small companies contrastingly 
report of close relations (gemeinschaft), devotion to the company and thus an internal wish to be efficient. 


1 INTRODUCTION 

Seafarers, as personnel in other industries, have in 
recent decades experienced increasing work pres- 
sure. Fewer persons are to complete more tasks in 
less time (Osterman and Hult, 2016), on shorter 
sea passages and rapid turnaround (Hetherington 
et al., 2006) without added resources (Lappalainen, 
2016). 

On top of the work pressure, seafarers need to 
take care of safety for themselves, the crew, vessel 
and cargo. Safety can be understood as the pres- 
ence of organizational conditions making opera- 
tions to be carried out without accidents or harm, 
in the short and long run (Kongsvik, 2013). 

Conflicting goals of protection and production 
are present in all organizations (Reason, 1997). 
Production includes costs, work, and time pressure 
for the personnel. Protection is about making sure 
no one is harmed by production, and is related to 
competence, procedures, and material and imma- 
terial support. Protective measures can also be 
viewed as pressure. 

In this paper interviews of seafarers are ana- 
lyzed to explore how they deal with pressures of 
production and protection. The seafarers work 
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at Norwegian- and Greek-controlled coastal and 
international passenger and cargo vessels. We find 
that the seafarers across contexts handle the simi- 
lar pressures similarly. The main difference involves 
how the pressure is experienced, and this seems to 
be defined by whether organizational relations are 
close (gemeinschaft, mainly on the small vessels, 
that usually are owned by small companies) or 
business-like (gesellshaft, mainly on the large ves- 
sels owned by large companies). 


2 LITERATURE 


2.1 


Operations are influenced by organizational 
structure and management, regulation and poli- 
cymaking (Reason, 1997). Within this context, 
operational personnel constantly face conflicting 
goals. 

Managers often value short-term financial crite- 
ria over safety, giving conflicting goals of produc- 
tion versus protection, or efficiency versus safety 
(Rasmussen, 1997, Reason, 1997). Production will 
generally be prioritized, since “production cre- 
ates the resources that make protection possible” 


Research about conflicting conditions 


(Reason, 1997). Vaughan (1997) shows how per- 
sonnel often want thorough rule-complying opera- 
tions, but that cost and time pressures slowly 
drive work practice away from the original qual- 
ity ensured routines. Hollnagel (2009) describes an 
efficiency/thoroughness trade-off principle. Man- 
agers want efficiency, but if the personnel work 
quickly instead of thoroughly, lower safety might 
be a result, which paradoxically is not efficient. 
Likelihood “of failures grow[s] when produc- 
tion pressures do not allow sufficient time—and 
effort—to develop and maintain the precautions 
that normally keep failure at bay” (Hollnagel, 
2009). This efficiency paradox is also noted for sea- 
farers by Fenstad et al. (2016). Seafarers are known 
to be efficient to help their company remain in 
business (Sampson et al., 2014). Personal injuries, 
violations and risk acceptance on board are related 
to work pressure and poor organizational safety 
culture (Nevestad et al., 2017). Crews’ immate- 
rial conditions, like time, concentration and com- 
petence, largely influence how safe they can work 
(Storkersen, 2017). Critical conditions are minimal 
resources, fast pace and accompanying lack of dis- 
cretionary space, while regulation can moderate 
such pressures (ibid). Ferry personnel have several 
strategies on how to meet schedules rather than 
comply with rules: 


The ability to keep the schedule and not canceling 
a departure, are associated with high competent 
navigators. Being delayed, or even worse, canceling 
a departure, may damage the navigator’s reputation 
(Aalberg and Bye, 2017). 


Still, operational personnel are expected to 
comply, even though for example Hale and Swuste 
(1998), Bieder and Bourrier (2013) emphasize 
that safety is not assured by blind rule-following. 
Compliance with bad rules that do not fit the real- 
world situation can lead to accidents (Reason, 
1997). Safe work vitally depends on personnel’s 
skills and experience (Dekker, 2017), for example 
about which rules should be avoided. Formal rules 
are not viewed as a positive contribution to the 
traditional professional values of seafarers: “Good 
seamanship” belongs to a seafarer with practical 
and social abilities who maintains safe practices 
with professional judgment, without being told 
what to do (Antonsen, 2009a, Knudsen, 2009). 
Since rules usually define operations that every- 
one knows, vessel operations are rather performed 
using experience (Bhattacharya, 2012, Aalberg and 
Bye, 2017). Many companies implement safety 
management systems that are not tailored to spe- 
cific vessels and activities. This makes procedures 
too numerous, detailed, and distanced from actual 
operations (Lappalainen, 2016, Bhattacharya, 
2012). For some situations there are more than 
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one procedure, or too few crewmembers to comply 
(Aalberg and Bye, 2017, Storkersen and Johansen, 
2014). Seafarers are also required to perform docu- 
mentation “essentially outside their primary func- 
tions of ensuring safe and efficient sailing” (Silos 
et al., 2012). It can lead to stress and exhaustion, 
particularly because it is viewed as unnecessary 
and disproportionate (Osterman and Hult, 2016). 


2.1.1 Research about different types of relations 

The early sociologist Ferdinand Tönnies (lived 
1855-1936) characterized relationships in differ- 
ent societies, applicable to maritime companies. 
Gemeinshaft, on the one hand, means close, per- 
sonal relations with shared language, norms and 
values, based on feelings, habits and consciousness 
(Falk, 2000). Gesellschaft, on the other hand, define 
impersonal business-like relations characterized by 
strategic decisions and exchange of means (ibid). 

Relations have also been a topic in safety 
research. Subordinate levels depend on trust and 
support from upper levels to be able to do their 
work safely. This can include care and concern 
(Jeffcott et al., 2006), personnel, equipment, lead- 
ership, time, rest, etc. 

Vessel operations rely on a balanced relation- 
ship between shore management and crews (Xue 
et al., 2015), with effective communication (Bhat- 
tacharya, 2009) and a management that is commit- 
ted to safety (Lappalainen, 2016). The safety level 
on each vessel depends on safety prioritization 
on board the vessel, in combination with seafar- 
ers’ interactions with ship owners and regulators 
(Fenstad et al., 2016). 

Most maritime studies report a lack of 
trust and communication inside organizations 
(Bhattacharya, 2009). The conclusion of Bhatta- 
charya’s (2009) double case study of vessels and 
ship owners from several countries is that man- 
agers and seafarers had fundamentally different 
understandings. Seafarers wanted to communicate 
as little as possible with shore-based management. 
Distant managers’ top-down instructions about 
compliance bureaucratized the entire system. The 
personnel were offered only low-discretion roles, 
due to a lack of trust by managers. This is mainly 
what Oltedal (2011) found on Norwegian-owned 
tankers, leading her to urge managers to trust their 
highly skilled seafarers to adjust safety manage- 
ment systems. Employer engagement correlates 
with safety levels on vessels (Bhattacharya, 2012). 
Top management in poorly performing shipping 
companies have been found to be not committed 
to safety issues (Lappalainen, 2016). 

Seafarers on short contracts are seen as partic- 
ularly vulnerable, as they are in an asymmetrical 
relationship with their employers, which prevents 
them from speaking up for their labor rights 


(Bhattacharya, 2009, Lappalainen, 2016). Seafar- 
ers on long contracts are reluctant to offend their 
managers since that can jeopardize their future 
plans and lives on the vessel (Xue et al., 2016). 
The dangers of a non-functioning relationship are 
described by Antonsen (2009a): 


. asymmetrical power relations seem to influence 
on the decisions regarding when working conditions 
are to be considered safe enough. ... The role of such 
asymmetries in safety-critical decisions should not 
be underestimated. 


Two companies studied by Xue et al. (2015) 
aimed to balance decision-making involvement 
but met limited success. Interviews with managers 
showed little tension between shore and vessels, but 
the personnel on four vessels had contrasting views. 
The seafarers had to follow management instruc- 
tions, even though it compromised their decision 
making and even their safety. They felt obliged to 
maintain hectic sailing schedules and to accept pro- 
longed working hours despite experiencing fatigue. 
The crews did not complain to management, as 
they saw that as useless, but sometimes they made 
decisions against management’s wishes. Their con- 
tribution to safety management was weak overall. 
These conflicts in interests between management 
and vessel staff worsened safety practices on board. 


3 METHOD 


The data material consists of 18 qualitative in-depth 
research interviews with 20 seafarers from a range 
of ship-owning companies. The interviews were 
conducted in Greece and Norway (see Table 1). 

The interviews give perspectives from different 
parts of maritime transport. They targeted seafar- 
ers of passenger and cargo ships, with coastal and 
international activity around Norway and Greece. 
These cargo vessels transport different types of 
gas, dry bulk cargo, general cargo, fodder for fish 
farms, or live fish. 

The Greek material includes personnel from the 
passenger and cargo sectors, where the passenger 
vessels are Greek registered and operate in Greece, 
while the cargo vessels operate internationally and 
are registered both in Greece and countries with 
laxer regulation (called Flag of Convenience) and 
thus mostly foreign crewmembers. All of these ves- 
sels are rather large, usually have crews of some 
size (10—40 persons) and are owned by companies 
with many vessels. 

In the Norwegian data material, however, there 
mainly are small vessels transporting cargo on the 
Norwegian coast. The vessels have Norwegian 
owners, and some are registered in Norway and 
carry only Norwegian personnel, while other have 


Table 1. Information about the data material. 
Greece Norway 
Interviewees 10 10 
Interviews 10 8 
Background of Crew members with professional Ship officers and educated navigators. Eight work as 
interviewees experience between 3 and captains or mates on cargo vessels. One work in 


30 years 


Nationality of 8 Greek, 2 Turkish 
interviewees 


Gender 9 men, | woman 


Contract length 


Watch schedules 


Size of crews 
Vessel type 


Registration of vessels 
Data gathered 


4-7 months contracts—on the 
ship all the time. Unemployed 
after, but usually new contract 
and back on the ship after a 
month 


Cargo: Two shifts, commonly 4—4 


(but in practice flexible). 


Passenger: One shift, and sleep 


at night. 

10-40 

Passenger ships with national 
routes (5 vessels) and cargo 
tankers with internet activity 
(5 vessels) 


Greek and Flags of Convenience 


Spring 2017 


management. One is partly captain and partly 
ship-owner (common in Norwegian coastal 
cargo) 

9 Norwegian and | Latvian officer 


10 men 

Norwegians: Permanent contracts, working 
4 weeks and staying at home 4 weeks. Foreigners 
in the crews: Often 4-8 months’ contracts 


One or two shifts. Two shifts commonly have 6-6 
watchkeeping schedule, but in practice flexible. 


6-15 
Cargo (7 vessels), mainly with coastal activity, some 
international activity 


Norwegian and Flags of Conv. 
Spring 2017 
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Flag of Convenience, a Norwegian captain and 
often Asian or Eastern European crew. 

The seafarers volunteered to be interviewed after 
information about the project from the researchers 
through their companies to all their seafarers. In fur- 
ther studies one should work to include more voices 
from groups such as ratings and machine chiefs. 

We conducted eight semi structured research 
interviews of 1-2 hours. The interviews were based 
on an interview guide constructed to explore safety 
culture and its relations to organizational and soci- 
etal aspects. Among the subjects asked about, were 
conditions for work and rest (manning, watch- 
keeping, tasks, etc.), and perceptions of safety, 
leadership, team culture, safety management, 
safety regulation, and organizational and national 
values. In Greece, all interviews were face-to-face 
on board vessels. In Norway, six interviews were 
on phone, with one researcher talking to one ship 
officer. The other interviews were conducted on 
vessels, each with one researcher talking to two ship 
officers. One of these interviews were recorded and 
transcribed in verbatim. For all interviews, detailed 
and anonymized research notes were written. Cat- 
egorization and pattern-analysis was performed 
manually. The quotes in Section 4 are direct cita- 
tions from the interviews. 

This study is not a comparison of the Greek 
and Norwegian maritime industry, since there are 
many groups and characteristics within the data 
collected in Greece and in Norway. It is a part of 
the SafeCulture project, funded by the Research 
Council of Norway, and undertaken by the Insti- 
tute of Transport Economics (Norway), NTNU 
Social Research (Norway) and the National Tech- 
nical University of Athens (Greece). The project’s 
survey results show how work pressure and organi- 
zational safety culture are related to work, which is 
related to personal injuries (Nevestad et al., 2017, 
Nevestad et al., forthcoming). 


4 RESULTS 


Seafarers from many different groups are inter- 
viewed, and they describe many common features 
in how they deal with pressures of production and 
protection. Some conditions are special for certain 
groups. The differences are most evident between 
seafarers on large and small vessels, since the size 
of the vessel is connected to size of the company 
and activity, and to closer or more distant organi- 
zational relationships. 


4.1 Protection: Competence of the crew 


Most seafarers are aware that they are responsible 
for the safety of their shipmates, the vessel and the 


cargo. Many have great knowledge of and interest 
in company procedures, and national and interna- 
tional regulation and policymaking. They know 
their job by hearth, and have various opinions on 
the large changes derived from the implementation 
of electronic devices and equipment. 

The seafarers tell that they always do the tasks 
as safe as possible—at the same time as being 
efficient. Most of the interviews indicate a pres- 
sure to go through with risky operations and to 
work while tired. Handling contradictory goals 
are talked about as a key characteristic of a good 
seafarer, but it differs how much of the decision- 
making is left to the seafarer. 

On the /arge vessels, an efficiency pressure is 
sometimes stated directly to the seafarers from 
the company managements. This is described in 
interviews especially at the international and large 
vessels from large companies. It is not uncom- 
mon that officers order seafarers to work faster or 
under other conditions that they find dangerous, 
or that onshore management order navigators to 
take shortcuts to arrive in port on time. (Of course, 
many international seafarers say their company 
respect seafarers’ judgement and do not force them 
to hurry up or push the ship into its limits.) 

At most of the smaller vessels, however, the 
judgement or handling of conflicting goals is up 
to the seafarers. It is underlined as an internal cri- 
terion of being a good seafarer and employee. The 
pressure is not from management, but within each 
seafarer. They take responsibility for their com- 
pany to stay in business, and thus indirectly for 
them to keep their job. The coastal seafarers agree 
that some operations cannot be accomplished, but 
their doubts and perception of pressure vary. Nav- 
igators on the small vessels have much decision- 
making power, and emphasize how they make their 
considerations and handle the pressure. 


Sometimes you feel it. Maybe when you're approach- 
ing the quay, “will this work or not”, but usually it 
works okay. You have to use your common sense, 
and know your limitations. You can lie at sea until 
the conditions are better. Even if someone stands at 
shore and waits, they just have to wait. But you do 
feel it. But in the end, you don’t care, even though you 
think about it afterwards. Captain, small bulk vessel 


4.2 Production pressure: Costs 


Maritime transport companies are in competition 
with other types of transportation, with each other 
and with vessels of different registration and con- 
ditions. Succeeding, buyers focus on price rates. 


It’s awful, just prices. It’s nothing to ask about, just 
price and price and price. They don't look at what’s 
in the dock, just as long as it floats it’s okay. Cap- 
tain, small bulk vessel 
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Both small and large companies need to save 
costs, and a result is low manning, limited poten- 
tial to buy new equipment, small time margins in 
routes or port calls, and so on. The seafarers on all 
vessels want quality in spare parts and other tech- 
nology, for the sake of safety, but usually they need 
to cut costs. 


What can I do with a Chinese spare part? I don't 
trust it but it’s cheap. An employee behind a desk 
can't understand the difficulty or the danger. Engi- 
neer, large cargo vessel 


On small vessels, many seafarers see it as their 
responsibility to handle the economic production 
pressure. Mostly production can be performed as 
planned, but sometimes there is doubt whether or 
not one should start or continue an operation, for 
example because of bad weather. If they do not 
go through with operations they will miss out on 
essential earnings. In such situations seafarers 
themselves can make cost saving or profit their 
decision-criteria. 


Yes, we can feel pressured to work. [...] There are 
situations where we wouldn't have approached in that 
weather, but when we're already there we continue 
the operation. No one wants to make the decision to 
abort. It costs a lot to run this vessel. Mate, coastal 
live fish carrier 


Some conditions are truly different on the /arge 
vessels compared to the small vessels. There seem to 
be more cost-saving, more pressure from manage- 
ment, more sanctions, less discretionary space and 
less labor rights. Two interviewees mention that 
their equipment is of so poor quality that they have 
to buy new equipment on their own expenses. If 
they do not buy new equipment, they are not able 
to comply with procedures. They cannot risk being 
reported to the company for ignoring procedures, 
as this will affect their future in this job. Another 
seafarer tells about one time he got ill and did not 
get sufficient treatment, but he would not press 
charges to the company because that can spoil his 
reputation so he never can work on a ship again. 


4.3 Production pressure: Time 


Seafarers experience a pressure to work fast, some- 
times under risky circumstances. Our interviewees 
especially feel the time pressure in situations related 
to port calls. They describe narrow time margins 
in all schedules, and too much work to keep the 
schedule. Vessels in large ports can be delayed by 
port authorities or logistics even if they get the 
work done in time themselves. 

Time is a reason why seafarers experience a 
pressure to go through with operations that should 
have been stopped. 
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We take short cuts; we don’t have manning to get 
everything formally right. Captain, small general 
cargo vessel 


Time pressure is common for every interviewee, 
but it varies where the pressure is perceived to 
come from. On a large vessel, an engineer men- 
tioned that he felt terrible when he was given a few 
hours in order to fix a serious damage on board. 
On a small vessel, the seafarer with engineering 
tasks would typically not be given a deadline for 
repairing the damage, as they rather describe an 
internal pressure or a wish to fix the damage before 
planned departure from port. Time pressure limits 
the seafarers’ possibility to rest and work safely. 


4.4 Production pressure: Much work, less sleep 


In addition to the production pressures of costs 
and time, seafarers experience a pressure of addi- 
tional work and tasks. 

On some large vessels, there usually are more than 
one shift on board, which is not common on smaller 
vessels. Deck personnel mostly rest during sailing, 
or in some rare periods of long inactivity. For both 
types, port calls prolong watch-keeping hours and 
gives no potential for rest until the vessel is back in 
clear waters (or anchored or docked). This results in 
limited discretionary space for all seafarers. 


You’ve chosen an occupation and it’s been like this 
since I started at sea. Since I started as deck boy. 
Everyone had to chip in when we loaded, and we 
could relax when the ship was at sea. It’s a culture 
that It’s not possible to change a culture that’s been 
there forever. When the load’s ready: «Oh, no, I have 
to sleep ten hours, I can’t work», right. I wont make 
money and the company won't make money. Then I'd 
have to quit. I'd have to get home and stay on wel- 
fare, that’s next. Captain, small bulk vessel 


Some vessels, both /arge and small, have sailing 
schedules with frequent port calls and short sail- 
ings. An engineer on a large vessel told us that he 
was continuously on duty for a long time because 
the ferry docked in many ports and there was no 
shift replacement. This made him feel weak and 
tired, but he accepted it as “how it is” for seafar- 
ers. A similar situation is common on some small 
vessels too. 


Particularly on timber runs, some ports are close to 
each other. You get two-three hours on the pillow 
before it’s up again. And we load for four-five hours 
and continue. Four-five hours to next port, and load- 
ing again. And maybe you have four ports like that 
after each other. Then you'll be tired when you're fin- 
ished. Captain, small bulk vessel 


Organizational conditions contribute to lack of 
rest, like the amount of work compared to manning 


level, watch-keeping schedules and sailing sched- 
ules. Even though ship-owning companies are in 
charge of this, sleepiness is mostly talked about as 
something that all seafarers experience and need to 
handle. They mostly blame the vast horizon view, 
darkness, or the weather. 

The majority of the interviewees admit that it 
is easy to fall asleep on duty. On the large vessels, 
if their shift is on the bridge, they might ask for 
permission to leave and take a “power-nap” or just 
ask for a cup of coffee. On small vessels one usually 
consider and make such decisions for oneself. 

Seafarers on large vessels also mention that in 
their valuable situations of rest, they still have to 
stay alert in case someone asks them a question. 
Especially electro-technicians and officers who 
have specific and special responsibilities are often 
asked to solve a problem. To stay alert, even off- 
duty, is an “unwritten law” on board on the inter- 
national vessels in this study. Even though rest is a 
luxury on board, these interviewees point out that 
in case a superior demands your help, you must 
present yourself. 


4.5 Compliance and violations 


Protection equipment is essential and seafarers 
wear it as a habit and a necessity. They use gloves, 
googles and boots for their own safety, and not 
only to follow procedures. 

All the interviewed seafarers report that it is 
compulsory to read and sign the vessel’s safety 
management system and take part in drills. Still, 
the seafarers report that their system is violated on 
a daily basis. 


For instance, it says you're to test the emergency 
radio every day. That's something you just don't 
bother. Mate, small bulk vessel 


Most of the work is done safely and accord- 
ing to procedures, and violations mostly happen 
because procedures do not fit the situation, the 
vessel do not have time or manning to comply, the 
seafarer do not know the procedure, or because of 
slips. Common slips are to forget to use a hard hat 
or life vest, but according to the seafarers this has 
decreased over time. As we have seen, «short cuts» 
or «calculated risks» to work efficient seems to be a 
regular part of work among all interviewees in the 
study. Through the stories in the interviews is evi- 
dent that many procedures are neglected regularly 
among the coastal vessels. One of the interviewed 
seafarers notice that it is dangerous with too many 
procedures; Now no one has oversight, and some 
tasks might be neglected over a long time. 

In general, it is common for seafarers on both 
large and small vessels to violate procedures to do 
the job more efficient. 
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At the large vessels, we are told it also happens 
that crewmembers are ordered by superiors to vio- 
late procedures. One interviewee tell he has been 
forced to pass alone through a tunnel under the 
holds of the ship, even though this involved risk 
of intoxication. 


4.6 Production pressure: Bureaucracy 


Seafarers on all the studied vessels underline that 
there is too much paperwork and bureaucracy 
every time they approach or leave a port. Many feel 
this as exhausting. 


There is too much bureaucracy. Each country has 
regulations outside the IMO. There should be a list 
for when we arrive at the port. Not every country 
sends its own list and in many cases it is sent in the 
local language, not even in English. Deck officer, 
large gas tanker 


All interviewees talk negatively about their safety 
management systems. They are too complex, and 
with procedures that cannot be complied with. For 
example, the procedures for maintenance are seen 
as detailed and unfitting for especially small ves- 
sels. Gas tankers have additional regulations to fol- 
low. If a company owns gas tankers and also other 
types of vessels, procedures applied for tankers are 
usually implemented on the other vessels too. 


The problem is that the ISM-system’s too big and 
extensive for the ordinary man to take the trouble to 
get to know it. So it’s usually the ship management 
that knows what it’s about. This is an overstatement, 
because most [crewmembers] know the basics, but 
not more than that. Mate, small bulk vessel 


On the /arge gas tankers, it is told that foreigners 
sometimes quit because of the extended bureau- 
cratic procedures. 


The bad thing is that paperwork is harvested. For 
example, for each drill done, everyone must sign. 
In these 2 hours I lose, and I lose them every day, I 
would have learned a lot of things. Deck crew, large 
gas tanker 


4.7 Production and protection: Social conditions 


In the interviews, it is described how the crews 
take care of another and do not let each other 
ignore safety measures. If someone finds them- 
selves extremely tired, colleagues can replace them 
or change shifts. Inconsiderate actions are neither 
allowed nor forgiven. Interviewees often speak 
about themselves and their colleagues as one, as a 
crew or a team. One’s safety depends on the others’ 
safety and vice versa. This study has revealed a deep 
trust between many crewmembers. On small vessels 
this trust is often shared among all crewmembers. 


From the /Jarge vessels, there are many stories 
about hierarchy and a gap between crew and ship 
management. Seafarers on large vessels follow and 
respect hierarchy on board. If something happens, 
they usually report to the next rank. If the situa- 
tion is of minor importance it does not reach to 
higher officers or the captain. On cargo ships, it is 
reported that it depends on the atmosphere and the 
captain’s attitude as a whole. A very strict captain 
is better to be avoided. In this study we have heard 
only a few stories about managers giving positive 
feedback for crewmembers’ compliance with safety 
rules. Still, most interviewees tell that they always 
remind forgetful coworkers to wear their personal 
protective equipment, even if it is someone of 
higher rank. Safety is perceived as not a reason 
for misunderstanding or fight. A cadet with three 
years of experience gave a relevant example: 


I saw the captain without his helmet. I felt weird, but 
I finally told him “captain you forgot it” and I gave it 
to him. The captain then praised me for this. Cadet, 
large passenger ferry 


On both small and large vessels, problems of 
behavior may occur when the “atmosphere” on 
board is not so warm and friendly. Such prob- 
lems are usually confronted on board and without 
intervention of the company. Several interview- 
ees informed us that problems can be the result 
of long contracts (of 6-7 months or seasonal) as 
nerves becomes tight when the crewmembers are 
on board for a long time. Long working periods on 
board are more common on large vessels. 


You're always under pressure at work because you 
live in a prison. It’s a small place because we live 
on the sea and beneath is the unknown. At most, 
we take a five-minute walk on the ship, we see the 
horizon, but we cannot take five steps. Deck officer, 
large gas tanker 


5 DISCUSSION 


5.1 


In line with earlier research of conflicting objec- 
tives (Rasmussen, 1997, Reason, 1997, Hollnagel, 
2009), the seafarers in this study routinely handle 
pressures of production and protection, with many 
tasks and little time. Last section showed exam- 
ples of how the financial situation and competi- 
tion in transport are present in the seafarers’ daily 
work. Cost pressure is related to time pressure and 
demands of efficiency. As core tasks are plenty, the 
added bureaucracy is not welcomed by the seafar- 
ers. Administration can lead to fatigue and increase 
risk (Silos et al., 2012, Osterman and Hult, 2016). 
Loading and discharging situations are described 
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as work intensive, including increased bureaucracy 
for each port, and no possibility for rest or going 
off duty in these situations. Most of the interviewed 
personnel, on smaller or large vessels, can rest on 
longer voyages. Vessels with frequent port calls—in 
coastal or international operations—describe a 
situation that is most difficult to handle, because 
of fatiguing pressure. Only when business is going 
slow, such seafarers have time to follow all safety 
procedures. Earlier research also has discussed how 
schedule and workload heavily influence the pos- 
sibilities for safe work and rest (Storkersen, 2017, 
Sampson et al., 2014). In the present study, too, it 
is difficult to isolate which conditions are related to 
for example regulation or market. 

All the interviewed seafarers describe how they 
handle pressures of production and protection 
with taking “short cuts”, working a lot and rest- 
ing little. This corresponds with seafarer traditions 
(Antonsen, 2009a, Knudsen, 2009, Bhattacharya, 
2012, Aalberg and Bye, 2017). The norms onboard 
are strictly followed, as research of maritime safety 
culture report of (Antonsen, 2009b). 

Our results show that conflicting goals of pro- 
duction and protection are constituted by a mix of 
conditions. These conditions stem from employers, 
market, and seafarers themselves. Pressures related 
to costs, time and work are evident for all seafarers 
in our study, but some aspects come out differently 
across the groups of interviewees. 


5.2 Maritime gemeinschaft and gesellschaft 


Two categories of seafarers seem to be divided 
between internal and external production pres- 
sure. Most seafarers on large vessels experience a 
pressure mainly from management, while seafar- 
ers on smaller vessels experience a pressure within 
themselves. Their different organizational condi- 
tions are related to Tönnies’ types gesellschaft and 
gemeinschaft. 

Many seafarers describe organizational rela- 
tions corresponding with gesellschaft, with imper- 
sonal relations and strategic decisions (Falk, 2000). 
The seafarers describing their context like this, 
usually work at large vessels with crews and com- 
panies of size. Here, hierarchy is firm, and one are 
to do as told by superiors. Such relations between 
onshore management and vessel personnel are also 
described by Xue et al. (2016), Xue et al. (2015), 
Bhattacharya (2012). A common feature is that the 
seafarers have single contracts expiring when they 
leave the ship, and thus have to act strategic to be 
hired for the next voyage. Such seafarers experience 
explicit pressures from onshore management, and 
sometimes onboard management, and describe 
that they will not keep the job if they do not act 
upon the pressures. These seafarers’ situation 


is also related to a traditional workers’ collective 
(Lysgaard, 1961), where subordinates oppose 
against work pressure through working as smooth 
and relaxing as possible (Rasmussen, 1997). 

Other seafarers elaborate on an internal pres- 
sure, where they want to be efficient because they 
are responsible for their company’s survival. Their 
relations with shipmates and management cor- 
respond with gemeinshaft’s close relations based 
on feelings (Falk, 2000). These seafarers typically 
work in small companies, and smaller vessels on 
the Norwegian coast (but not only Norwegian reg- 
istered). In the interviews, they elaborate that they 
work fast and skip procedures in order to maintain 
earnings for their employers. They experience to be 
supported and trusted by the management. They 
value their autonomy and thus take a lot of respon- 
sibility, maybe beyond what management explicitly 
has stated or expected. This is also described in 
Norwegian coastal passenger transport (Aalberg 
and Bye, 2017, Storkersen and Johansen, 2014) 
and cargo transport and the aquaculture industry 
(Storkersen, 2012). According to earlier research, 
management’s safety commitment is important 
for the safety on the vessels (Lappalainen, 2016, 
Bhattacharya, 2012). In the present study’s inter- 
views, the safety commitment of the management 
in these small companies are not elaborated on. It 
is possible that the managers are aware of the sea- 
farers’ internal pressure, and strategically let them 
prioritize production over protection. 

Another uncertain factor in these results is 
whether the gesellschaft and gemeinschaft seafar- 
ers are different because of their organizations’ or 
crews’ sizes—or other conditions. For example, the 
seafarers in gesellschaft are all from vessels operat- 
ing in and around Greece, while the gemeinschaft 
seafarers operate in Norway. There is need for 
more research to elaborate on the categories sug- 
gested in this study. 


6 CONCLUDING REMARKS 


This study adds to research results about a con- 
nection between safety and organizational rela- 
tions. It also shows that traditional sociological 
literature of gemeinshaft/gesellshaft is valuable in 
safety research, since it gives a clear understanding 
of how different relations result in different safety 
practices. 

All the interviewed seafarers describe how they 
handle pressures of production and protection 
with taking “short cuts”, working much and rest- 
ing less. The vital difference between the seafarers 
on large and smaller vessels is that on the large ges- 
ellschaft vessels formal structures and management 
explicitly favor production, while in gesellschaft 
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seafarers experience to have support for protective 
measures, but still choose to favor production. 
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Teamwork competence required across operational states: Findings 
from nuclear power plant operation 


Ann Britt Skjerve & Lars Holmgren 
Institute for Energy Technology, Norway 


ABSTRACT: The tasks of Nuclear Power Plant (NPP) operators are highly interconnected, and opera- 
tors need to engage in teamwork to ensure plant safety. Traditionally, teamwork-competence taxonomies 
for NPP operators do not distinguish among operational states. This study explored if differences exist 
among teamwork-competence requirements across the three main operational states in a NPP: normal 
operation, outage and emergencies. Data was collected from a north European NPP using observations, 
semi-structured interviews, and a questionnaire survey, and analyzed using a thematic-analysis approach. 
The study suggested that the teamwork competencies needed by NPP operators are similar, but not identi- 
cal across the three operational states. The variations were suggested to be caused by a combination task 
differences and different impacts of three performance-shaping factors: time pressure, task complexity, 
and proactive attitude to safety. Based on the results, it was suggested that refresher training should be 


adjusted to increase resilience in teamwork in NPP operation. 


1 INTRODUCTION 


Nuclear Power Plants (NPPs) are key means for 
producing electricity in a range of countries today. 
NPPs are dynamic and highly complex production 
systems, and training of operational staff is one 
of the cornerstones for ensuring safe and efficient 
operation. 

Competence can be defined as the “... ability 
to apply skills, knowledge and attitudes in order 
to perform an activity or a job to specified stand- 
ards in an effective and efficient manner” (IAEA, 
2002). Training of NPP operators addresses both 
technical and teamwork competencies (IAEA, 
1996): The operators need technical competence 
to understand the design and functioning of the 
process system, and they need teamwork compe- 
tence to be able to work in a team setting, due to 
the inter-dependability of their tasks. The tech- 
nical competencies required of NPP operators is 
well established (e.g. U.S. NRC, 1998; 2007), and 
training of these is under continuous develop- 
ment within NPPs. The teamwork competencies 
required is less well-specified. This paper aims at 
contributing to the understanding of what team- 
work competencies NPP operators need. 

Teamwork can be defined as “... a distinguish- 
able set of two or more people who interact dynam- 
ically, interdependently and adaptively toward a 
common goal” (Blickensderfer, Cannon-Bowers 
& Salas 1997, 250). There is general agreement 
that teamwork is a multi-dimensional concept, 
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but there is no final agreement about the specific 
dimensions the concept encompasses. 

In an NPP, teamwork is highly regulated by pro- 
cedures and routines. Still, the level of details with 
which teamwork is regulated varies substantially. 
For example, in some cases it is specified exactly 
what information and operator should contribute, 
where as in other cases, it is merely stated that oper- 
ators should contribute all relevant information. In 
addition, operators need teamwork competence to 
adapt performance to emerging situational char- 
acteristics, e.g., to the competence possessed by 
individual colleagues, the colleagues’ level of work- 
load, personal concerns, etc. 

A teamwork-competence taxonomy is important 
to support the development of teamwork-training 
programs. A taxonomy facilitates identification of 
training needs, as well as documentation of what 
competencies a training program covers. Within 
the domain of NPP operation, various teamwork- 
competence taxonomies exist (e.g. Broberg, 2009; 
Crichton and Flin, 2004; IAEA, 1996; IAEA, 
2001; O’Connor, O’Dea, Flin, and Belton, 2008; 
Skjerve, Kaarstad and Holmgren, 2013). 

Traditionally, the teamwork-competence taxo- 
nomies are general in nature, covering the entire 
span of teamwork competencies needed by NPP 
operators to perform their tasks safety and effi- 
ciently. Still, based on the observation that the 
NPP operators’ tasks are not identical across oper- 
ational states, it was hypothesized that the team- 
work-competence requirements might also not 


be identical. If this hypothesis is true, traditional 
re-fresher training might not fully address all the 
teamwork competencies required by NPP opera- 
tors, as it tends to focus on emergency scenarios. 

The purpose of this study was to explore if 
differences exist between teamwork-competence 
requirements to NPP operators across the three 
main operational states in a NPP: normal opera- 
tion, outage and emergencies.! 

In this paper, the concept NPP operators, 
include the following roles on a shift: Shift Man- 
ager (SM), Reactor Operator (RO), Assistant 
Reactor Operator (ARO), Turbine Operator (TO), 
and Field Operator (FO). O’Connor et al. (2008) 
and Broberg (2009) both report that no differences 
were found between the teamwork-competence 
requirements to the two main groups of NPP 
operators: control-room operators and field opera- 
tors. For this reason, there will be not distinctions 
between these two types of roles. 


2 NUCLEAR POWER PLANT OPERATION 


The main task of a NPP operator team is to ensure 
that plant safety is upheld. Overall, NPP operation 
can be decomposed into three operational states: 
normal operation, outages, and emergencies. The 
three operational states are defined below based 
on the tasks that are prototypically associated with 
each state. 

Normal operation is the period when an NPP is 
producing electricity according to plan and is oper- 
ated based on the requirements in the standard 
operating procedures, the technical specifications 
of the plant, the plant orders and/or other direc- 
tives provided by the Operational Department. 
Normal operation may also be referred to as power 
operation. The overall task of an operator team is 
to ensure that the operational activities progress 
according to plan. The team’s activities are largely 
based on routines, and involve monitoring plant 
parameters and intervening with planned adjust- 
ments and with immediate adjustment if neces- 
sary. When a shift begins, the first task is to engage 
in shift-handover: First, each position will have a 
semi-structured dialogue with his or her opposite 
on the departing shift to learn about ongoing and 
planned tasks and deviations (if any). Then all 
operators on the team will meet to jointly build a 
common understanding of the situation at hand, 
and decide how to proceed. Often, the SM will be 
away from the control-room for longer periods of 


1. The paper is based on and a further elaboration of 
results reported in a ‘limited distribution’ work report by 
Skjerve & Holmgren (2016). 
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time, leaving the RO in charge of the team. During 
normal operation, operator teams are required to 
perform a set of administrative activities, in addi- 
tion to the operational activities. These involve, 
e.g., logging, preparing for upcoming tag-outs, 
refreshing knowledge, updating descriptions of 
plant systems, and job appraisal talks. 

Outage is the period from when an NPP is 
brought to shut down for preventive maintenance, 
upgrades, and refueling until it has been started 
up again and is ready for production. During an 
outage, a plant is operated based on the standard 
operating procedures for shut-down and start-up, 
the technical specifications of the plant, the Out- 
age Plan, and the outage direction documentation 
and plant orders. The overall task of an opera- 
tor team is to ensure that the planned tasks are 
executed in accordance with the specifications in 
the Outage Plan and associated documentation to 
the extent this is possible. Team members’ tasks 
are usually proceduralised, but often non-routine. 
Their taskload is high, and they need to engage in 
teamwork with staff members, whom they may not 
know well in advance (e.g. staff from the mainte- 
nance departments) and with external parties (e.g. 
various types of consultants). Also throughout an 
outage, time management is an issue of key con- 
cern. The conditions in the control-room will be 
non-normal during an outage compared to power 
operation: A high number of alarms will go on- 
and-off in unusual ways due to the various tests 
performed in the plant, and there tend to be more 
people present in the control-room. The RO and 
the TO each with their associated field operators 
may come to create what looks like two islands in 
the control-room, and it is important that the SM, 
who is offloaded at day time by an administrative 
support, contributes to ensure internal coordina- 
tion in the operator team. 

Emergency operation is the period in which an 
NPP is in a state described by the Safety Analysis 
Report (SAR) or in the plant specific Probabilis- 
tic Risk Assessment (PRA). In these situations, a 
plant is operated based on the emergency opera- 
tion procedures, functional restoration guidelines 
and in extreme cases severe accident mitigation 
guidelines. The incidents and standard operation 
procedures may also be applied. The overall task 
of the operator team is to ensure that the plant is 
brought to a safe state. When an event occurs that 
has been addressed in SAR or PRAs (e.g. a rup- 
ture of a tube in the steam generator), task per- 
formance is heavily guided by procedures. When 
multiple failures (events) have occurred, the crew 
members will to a larger degree need to adapt the 
procedures to the characteristics of the situation. 
During emergencies, RO and ARO will typically be 
working together to execute the actions required 


on the reactor side, whereas TO will execute the 
actions on the turbines, power and I&C supplies. 
The FOs will assist in the control-room or out in 
the plant. The SM will take a stand back position 
and survey the plant’s behavior, including the criti- 
cal safety functions, and coordinate the crew mem- 
bers’ activity and plan ahead. 


3 METHOD 


The study was based on data collected in a PWR 
unit of a north European nuclear power plant. 
Data collection included 108 hours of observation 
in the control-room during normal operation and 
outages, distributed between two operator teams 
by the authors. Observations were further carried 
out across refresher training (i.e. regular training 
after the operators has been licensed to refresh and 
update competencies, comprising simulator and 
classroom sessions), including eight days of simu- 
lator runs addressing emergencies. In addition, 
data were obtained from 14 semi-structured inter- 
views lasting in average 1.5 hours with plant per- 
sonnel, and a questionnaire survey administered to 
33 NPP operators. Method triangulation (Denzin, 
1978) was applied to seek to increase the validity of 
the findings by cancelling out the limitations asso- 
ciated with each of the three methods. 

Data was analyzed using a thematic analysis 
approach, i.e. a qualitative method that makes use 
of labelling and iterative restructuring of data, to 
identify patterns—or themes in the dataset. The 
analysis process was developed based on Braun 
and Clarke (2006). It contained four phases. 


Phase 1: Familiarization with the dataset. This 
implied reading through notes from observa- 
tions, the interview responses, and the responses 
to the questionnaire survey to obtain an over- 
view of the content. 

Phase 2: Generating initial codes: All data 
obtained, i.e. from observations, interviews, and 
questionnaire survey, was decomposed into seg- 
ments. A segment was defined as an entity that 
described one aspect of the teamwork compe- 
tence required by NPP operators as it emerged 
from the data collected. In all 136 segments 
were identified. Examples on segments include: 
“Insights into how adults learn” (Learning and 
Coaching); “Communicating via more informa- 
tion channels to increase the likelihood that a 
message is understood” (Communication), and 
“Team-orientation—expanded to unit, plant, 
and other entities of relevance for ensuring safe 
and efficient plant performance” (Attitudes). 

Phase 3: Searching for themes: Establishing a 
taxonomy comprising a set of teamwork com- 
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petence dimensions: First, each segment was 
assigned to one of the five categories in the 
taxonomy defined by O’Connor et al. (2008) 
based on whether the content of a segment. If 
a segment was judged not to fit into any of the 
categories, a new category was introduced and/ 
or the definition of one of the existing catego- 
ries was modified to accommodate a broader 
range of content. If possible, the segments 
were allocated one or more of the three opera- 
tional states: normal operation, outages and/or 
emergency operation. If not, the segments were 
defined as common to all states. 

Phase 4: The segments associated with each of 
the three operational states were then grouped 
across the teamwork-competence dimensions to 
identify if patterns emerged, which should help 
clarify the reason for potential differences. 


4 RESULTS AND DISCUSSION 


The teamwork-competence taxonomy established 
in analysis phase 3 comprised nine dimensions: 
Attitudes, communication, coordination, decision 
making, interpersonal competence, intrapersonal 
competence, leadership, learning and coaching, 
and situation awareness. 

The distribution of segments across the nine 
dimensions can be seen in Table |. The inter-rater 
reliability between the two authors showed a cor- 
respondence of 81%. 

The teamwork-competence dimensions identi- 
fied did not differ substantially from the dimen- 
sions identified in earlier studies addressing 
teamwork competencies in NPP operation. This 
was interpreted to support the validity of the taxo- 
nomy (see Table 2). 


4.1 Teamwork requirements across the teamwork- 


competence dimensions 


The results suggested that teamwork-competence 
requirements for NPP operators overall were 
highly similar: Competence associated with each 
dimension of teamwork was required in all three 
operational states. Still, a more detailed analysis 
showed that except for the teamwork-competence 
dimension attitudes, the specific competence 
aspects NPP operators were required to master 
showed some degree of variation across the opera- 
tional states. Below the main findings are associ- 
ated with each of the nine teamwork-competence 
dimensions are summarized: 


4.1.1 Situation awareness 
The task of building situation awareness is 
done based on somewhat different sources of 


Table 1. Distribution of segments across the nine teamwork-competence dimensions. 


Segments Segments 
common specific to 
to normal normal 
Teamwork- operation operation 
competence Total no. of outagesand outages or Normal Outages & 
dimensions segments emergencies emergencies operation Outages Emergencies Emergencies 
Attitudes 8 8 0 0 0 0 0 
Communication 13 8 5 0 2 0 3 
Coordination 13 4 9 2 1 4 2 
Decision making 13 7 6 1 0 3 2 
Interpersonal 17 7 10 4 3 0 3 
competence 
Intrapersonal 13 1 12 3 3 4 2 
competence 
Leadership 25 8 17 5 4 0 8 
Learning and 16 4 12 7 2 1 2 
coaching 
Situation 18 6 12 1 4 3 4 
awareness 
SUM 136 53 83 23 19 15 26 


Table 2. Comparison of the taxonomy identified in the study with other teamwork-competence taxonomies. 


Situation Building situation Building situation Situation awareness Situation 
assessment awareness awareness —build and maintain Awareness— 
an accurate and shared building and 
situation understanding maintaining 
ecision Making 
—team focused 
ommunication ommunication ommunication ‘ommunication—sharing Communication 


information and insights 


— ———E— 
ollaboration Interpersonal 


competence 
Leadership 
roup climate ttitudes—towards Attitudes 
colleagues and the plant 
Personality fits 
Intrapersonal 
Learning and refreshing Learning and 
competencies coaching 


information and under various workload levels understanding. During outages and emergen- 
across the operational states. For example, during cies, NPP operators needs teamwork-competence 
normal operation the ability to establish accurate aspects associated with obtaining information from 
situation awareness involves teamwork-competence colleagues about a dynamic situation on-the-fly, as 
aspects associated with obtaining information from well as competence aspects related to distinguish- 
shift-handovers (i.e. semi-structured dialogues with ing between critical information and other types of 
colleagues), from various logs, etc. and to systemat- information and addressing critical information in 
ically assess these with colleagues to build a shared crew updates in a way all colleagues understand. 
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4.1.2 Decision-making 

A range of teamwork-competence aspects associ- 
ated with decision-making was shared across the 
operational states. This included, proactively deter- 
mining how to verify the consequences of a deci- 
sion and acknowledging and proactively addressing 
uncertainties together with team members. The 
differences found were mainly associated with the 
overall workload level, but also to some extent with 
concerns for ensuring the continuous learning of 
in the operator team. For example, during normal 
operation competence aspects associated with con- 
tributing to (depending on role) a more participatory 
decision-making process ensuring all understand 
the basis on which the decision should be made, 
aimed at jointly developing ‘optimal solutions’ were 
required. Whereas during emergencies, competence 
aspects associated with execution of a more authori- 
tarian decision-making approach aimed at finding 
‘good enough’ solutions were needed. 


4.1.3. Communication 

The communication tasks were basically the 
same across the three operational states. They 
involved, e.g., the use of “Three-way communica- 
tion”, adapting communication to the receiver(s)’ 
competencies and active listening. Still, across 
the operational states the frequency with which 
communication tasks had to be executed varied, 
and thus the overall level of time pressure associ- 
ated with task performance. This implied that the 
operators needed to master the communication 
competencies with substantially more fluency dur- 
ing outages and emergencies than during normal 
operation: the number of communication tasks 
was higher in these operational states, and the time 
available to identify and correct misunderstandings 
was more limited. Some communication tasks were 
further associated mainly with one operational 
state. During outages, e.g. the operators need to 
be prepared to communicate with consultants in 
English (a non-native language to the operators). 
Also during emergencies, there is a distinct need 
to uphold continuous communication among team 
mates during complex and/or stressful situations 
to promote collective sense-making processes and 
the provision of mutual support. 


4.1.4 Coordination 

The requirement to coordination competencies 
is essentially similar across operational states, in 
the sense that it covers a wide range of activities 
from performance-adaptation on-the-fly, engag- 
ing in backup behavior, to planning aimed at 
ensure coordination of future activity, which may 
be needed across the operational states. Still, the 
requirements to teamwork-competence aspects 
associated with coordination vary more than e.g. 
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was the case for communication. The reason is that 
the content of coordination tasks prototypically 
associated with each operational state is more var- 
ied, and involves different teamwork-competence 
aspects. For example: During normal operation, it 
is necessary to continually to coordinate perform- 
ance of operational tasks versus performance of 
administrative type of tasks; During outages, the 
need for carrying out Pre-Job Briefings is more 
pronounced than during normal operation, and 
will involve more staff, including external special- 
ists; During emergencies coordinating activities to 
ensuring clear, precise and not least timely is a task 
of key importance. 


4.1.5 Interpersonal competence 

The inter-personal teamwork-competence aspects 
were to a large degree similar across the opera- 
tional states, except they in general had to be mas- 
tered with greater fluency from normal operation, 
over outages, to emergencies. They comprised, e.g., 
building trust, mastering interactions, and recogniz- 
ing the achievements of colleagues. The interper- 
sonal teamwork-competence aspects were, however, 
suggested to serve different purposes during normal 
operation and emergencies: During normal opera- 
tion, the overall purpose was to transform operators 
into a team and/or to strengthen the team spirit, 
whereas during emergencies the purpose was to 
uphold the operators’ ability to function efficiently 
as a team under highly challenging conditions. 


4.1.6 Leadership 

This competence dimension was assessed to be use- 
ful for all operators, regardless of their particular 
role in the team, because all (with different degrees 
of likelihood) may end up in a situation, where 
they need to lead teammates. The teamwork-com- 
petence aspects required across the operational 
states varied, e.g., concerning the leadership style 
the operator should master: During normal opera- 
tion, competencies associated with executing a 
more democratic type of leadership were needed, 
e.g. promoting team mates’ motivation by involving 
them in decision-making processes and promoting 
learning processes. During outages, and especially 
during the acute part of emergencies, competence 
aspects associated with executing a more authori- 
tarian type of leadership were needed, e.g. giving 
and meticulously following-up on orders. 


4.1.7 Attitudes 

Attitude requirements included, e.g., safety con- 
cerns pervade all thinking and decision-making 
processes, and conscientious and commitment to 
quality. For this dimension no variation was found. 
The attitudes identified were of key importance 
across all operational states. 


4.1.8 Intrapersonal competence 

This dimension contained a set of teamwork-com- 
petence aspects of generic importance for sustain- 
ing sound teamwork, such as the competence to 
monitor own ability to operate the plant safely and 
efficiently, and courage to speak-up when need- 
ing assistance to achieve these goals. Since intra- 
personal competence was used to fulfill different 
purposes across the operational states, the team- 
work-competence aspects associated with each state 
varied somewhat. For example, to uphold attention 
towards the plant processes during normal opera- 
tion where ‘little happened’ over longer periods of 
time, teamwork-competence aspects associated with 
reducing the risk for complacency were needed. 
To uphold attention during outages and emergen- 
cies during prolonged periods with high workload 
and/or safety-critical situations, on the other hand, 
teamwork-competence aspects associated with pre- 
venting negative impacts of fatigue and/or of stres- 
sors on the task-performance process were required. 


4.1.9 Learning and coaching 

Learning and coaching activates may be carried 
out as an integrated part of task performance or 
as a dedicated activity. The teamwork-competence 
aspects implied include, e.g., coaching competence, 
the ability to give/receive and constructively use 
feedback, and techniques for self-improvement 
alone or with or assisted by other people. 

Dedicated activities to promote learning are pro- 
totypically associated with lower workload periods 
during normal operation. The likelihood that such 
activities will take place seems to increase, if the 
operators find that continues competence improve- 
ment is important for the team. 

During outages and emergencies, competence 
development may to a certain degree be an inte- 
grated part in the task-performance processes, 
involving teamwork-competence aspects associated 
with coaching. 

However, dedicated learning activity in rela- 
tion to outages and emergencies will usually be 
postponed to after the shift period is over and/or 
after the outage or event has been handled. At this 
time, a required teamwork-competencies aspect is 
the ability to address occurrence/events construc- 
tively in team setting, i.e. avoiding that the parties 
involved will be defensive and refuse to share and 
discuss actions, which may contain important les- 
sons learned from the entire team. 


4.2 Why teamwork-competence requirements are 
not identical across operational states 


Exploratory analysis of the variations found in 
the requirements to teamwork-competence aspects 
across the three operational states, suggested that 
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the dissimilarities might be caused by a combina- 
tion of two influences: (1) differences among the 
operational tasks across the operational states, and 
(2) differences among the impact of performance- 
shaping factors on otherwise similar operational 
tasks across the operational states. These two 
potential causes for dissimilarity will be discussed 
below. 


4.2.1 Task differences 

The operational tasks that are prototypically 
associated with each of the operational states, as 
described in section 2, are not identical. 

Shift-handover is a task that is prototypically 
associated with normal operation. This is, e.g., 
reflected in traditional refresher training where 
the hand-over process is substituted by a training 
instructor simply describe the plant state to the 
operator team. From an NPP operator’s perspec- 
tive, the shift-handover session in the beginning of 
a shift include, a semi-structured dialogue with the 
opposite on the departing team to obtain an accu- 
rate understanding of the plant state, including 
issues that need attention. Learning how to interact 
with the opposite to obtain the needed information 
is an important competence. It includes abilities to 
identify and constructively address potential omis- 
sions, misunderstandings and uncertainties in the 
information provided to build situation awareness. 
This type of competence is not addressed in dedi- 
cated training session following licensing. 

The requirement to work with people from dif- 
ferent professions and/or with whom the NPP 
operator is less familiar or unfamiliar is proto- 
typically associated with outages. During outages 
extended workgroups may arise, which in addition 
to the NPP operator team consist of colleagues 
from other operator teams, maintenance person- 
nel, contractor staff from external companies, etc. 

In this setting, a key teamwork-competence 
aspect required is associated with promoting com- 
mon ground between the diverse members of an 
extended workgroup. This includes the ability 
to present information in ways that are under- 
standable to people with different professional 
background, and ensuring that the concerns of 
all parties are adequately brought forward and 
addressed. This type of teamwork-competence 
aspect is not addressed during training. 

Handling of emergencies is highly procedur- 
alised activity, especially in the first part of an 
event, which is traditionally the part that has been 
addressed in refresher training. In cases of multi- 
ple failures, the requirement to making situation 
assessment to understand how to proceed will 
increase. A teamwork-competence aspect that is 
particularly needed in this situation is the ability 
to uphold communication throughout periods of 


uncertainty when operators tend to keep quiet and 
focus keep on making sense of the situation on 
their own. Emphasizing communication is impor- 
tant to promote the team’s ability to build situation 
awareness and making sound decisions. Unless 
refresher training progresses into this type of situ- 
ation, these skills may not be upheld. 


4.2.2 Performance-shaping factors 

The study points to three Performance-Shaping 
Factors (PSF) impacting the requirements to 
teamwork-competence aspects across the opera- 
tional states: time pressure, task complexity and 
proactive attitude to safety. The influence of these 
PSFs implies that the performance of otherwise 
similar teamwork tasks will come to require partly 
different teamwork-competence aspects. 

Time pressure implies that a task needs to be 
completed within a given time window. The time 
window is typically defined by constraints in the 
plant, e.g., the amount of break flow in a storage 
tank can secure. The impact of time pressure on 
task performance generally increases from normal 
operation over outages to emergencies. When time 
pressure is high, teamwork tasks should be mas- 
tered with a greater fluency. The ability to com- 
municate concern to a team mate should, e.g., 
preferably be mastered effortlessly, as the time 
available for re-stating information and correcting 
misunderstandings is reduced. 

As the level of task complexity increases, the 
more factors (parameters) and interdependencies 
an operator needs to address when performing a 
task. For this reason, task performance should 
preferably be thorough, highly systematic, and be 
ideally carried out without any time pressure. In 
situations with high task complexity, teamwork- 
competence aspects associated with the ability 
to lead and coordinate teamwork is particularly 
required, to help ensure that all parties involved 
in the task performance process will obtain accu- 
rate situation awareness, and thus a sound com- 
mon basis for making decision about the course 
of actions needed. As for time pressure, task com- 
plexity tends to increase from normal operation, 
over outage to emergencies, providing the latter 
contain multiple failures. In situations with both 
time pressure and task complexity, there will be 
a need for mastering the teamwork-competence 
aspects associated with handling task complexity 
with more fluently. 

The PSF proactive attitude to safety implies the 
conviction that it is important to establish the best 
possible basis for sound teamwork in future set- 
tings. This may be done by promoting learning 
processes, by coaching or encouraging team mates 
to engage in self-studies, e.g., by studying the back- 
ground materials for given procedures, etc. If the 
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proactive attitude to safety is deeply rooted in the 
operators, it will help overcome a tendency to per- 
ceive coaching and dedicated learning sessions as 
an “add on” to the normal wok practices. It will 
encourage the operator to perceive competence- 
promoting initiatives as an integrated and impor- 
tant aspect in task performance processes. 

The impact of this PSF is most visible during 
normal operation, where operators may or may not 
priorities to engage in learning processes. It seems 
also to be visible in the extent to which operators 
are able to uphold a ‘questioning attitude’ while 
carrying out their work, e.g. reflected in the extent 
to which they critically review current work prac- 
tices to protect against drifting. 


4.2.3 Are capturing variations in teamwork- 
competence requirements necessary? 

Even if the teamwork-competence requirements 
are not identical across the three operational states, 
they are highly similar. Has it any real impact on 
a training program if is based on a generic set of 
teamwork-competence aspect, rather than a set, 
which is decomposed across operational states? 
From a practical perspective, any of the three oper- 
ational states may contain characteristics that from 
time to time may be warranted in one of the other 
operational states: During an outage, there may 
be intervals resembling operation, such as longer 
periods of time where ‘little happens’, and during 
normal operation, situations may arise where NPP 
operators need to collaborate with unfamiliar peo- 
ple with a difference professional background, etc. 
Since a certain level of overlap exists between the 
tasks that may arise across the three operational 
states, it should in principle be possible to uncover 
all teamwork-competence aspects required of NPP 
operators by studying any of the three operational 
states exclusively. However, this approach would be 
substantially less effective than studying the char- 
acteristics of each of the three operational states, 
as the situational characteristics, which tradition- 
ally are associated with any of the other two oper- 
ational states, might likely be manifest only with 
highly irregular intervals in the given operational 
state. 

Another way of answering the question is to 
explore the level of teamwork-competence aspects 
missed if leaving out one of the operational states 
from an analysis. This can be done based on the 
distribution of segments reported in Table 1. The 
exploratory analysis indicates that if data from 
normal operation is left out of an analysis, 23 team- 
work-competence aspects required by NPP opera- 
tors may be at risk for remaining hidden, because 
the need for these competence aspects is rare dur- 
ing outages and emergencies. This corresponds to 
17% of the entire set of teamwork-competence 


requirements identified in the present study. If an 
analysis does not include data from the outages, 
14% of the teamwork-competence aspects (i.e. 
19 segments) may be at risk for remaining hid- 
den. If emergencies are left out of an analysis the 
corresponding figure is 19% (i.e. 26 segments). In 
addition, outages and emergency situations share 
a unique set of teamwork-competence aspects 
segments which together amounts to 11% (ie. 
15 segments) of the teamwork-competence aspects 
required. 

Overall, the results indicate that it is useful to 
analyses each operational state when establishing 
requirements to teamwork competence for NPP 
operators. 


4.3 Implications for teamwork training 


The IAEA (1996) recommends that the Systematic 
Approach to Training (SAT) is used as a basis for 
developing training programs. 

When preparing for teamwork training for 
NPP operators, it overall important to promote 
their ability to adapt performance to situational 
characteristics, including characteristics of team 
members, such as their role, current tasks and the 
type and level of competencies. Mutual perform- 
ance adaption among team members, in combi- 
nation with a clear understanding of the team’s 
goals, will promote teamwork processes. Because 
of the multitude of requirement posed to team- 
work-competence aspects in an NPP, it can be 
expected that operators, who master a wider rep- 
ertoire of teamwork competencies will be better 
able to adapt teamwork processes, than operators, 
who have a more limited repertoire of teamwork 
competencies. 

One potential use of identifying aspects of team- 
work competence that are prototypically associ- 
ated with particular operational states is to provide 
a mean for deepening operators’ level of teamwork 
competencies. Expanding the scope of teamwork 
competencies addressed in refresher training will 
support the operators in developing and uphold- 
ing an expanded repertoire of teamwork ‘tech- 
niques’ which can be flexibly applied depending on 
situational characteristics, reducing the risks for 
break-downs in teamwork (Skjerve, Holmgren & 
Widheden, 2015). 

A teamwork-competence aspect it may be use- 
ful to address as a part of the classroom part of 
refresher training, despite it being prototypical 
associated with normal operation, is team-building 
competence, in particular teamwork-competence 
aspects associated with maintaining team mem- 
bers’ ability to work together as a team, including 
upholding team spirit. A feeling of team efficacy 
and team spirit may promote the operators’ ability 
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to overcome challenges to teamwork, and thus 
contribute to resilient performance. 

Similarly, another teamwork-competence aspect 
it may be useful to address as a part of the class- 
room part of refresher training or exercises, despite 
it being prototypically associated with outages, is 
the ability to engage in teamwork with less famil- 
iar or unfamiliar parties with different professional 
backgrounds. This type of training may be included 
as an element in emergencies exercises, compris- 
ing NPP operator teams and key positions in the 
technical-support center. It may be done, e.g., by 
asking participants to state their expectations to 
one another, describe why they have these expecta- 
tions, and account for their concerns. This would 
contribute to resilience in the extended team by 
further strengthening team mates’ ability to select 
information of relevance to team members and to 
communicate this information accurately, etc. This 
will promote building and maintaining situation 
awareness in the extended team. 

Expanding the scope of refresher training by 
teamwork-competence aspects prototypically 
associated with normal operation and outages may 
further contribute to promote ‘teamwork mode’ 
awareness. 

Being aware of the current ‘teamwork mode’ 
may increase the likelihood that they will con- 
sciously apply teamwork-competence aspects are 
prototypically associated with the given mode— 
despite the operational state they are currently 
in. For example, during a non-acute phase of an 
emergency with limited workload, an NPP opera- 
tor may need to talk to various people from the 
maintenance to refine the teams’ understanding of 
the situation. If the operator recognized the par- 
allels these dialogues may have with the dialogues 
involved in a shift-hand over process, it could 
prompt the operator to remember applying similar 
‘techniques’ (such as focus at distinguishing facts 
from interpretations, specifying what needs to be 
further examined after completing the dialogue to 
get a clear picture of the situation, etc.). 

An operator’s increased focus on how team- 
work-competence aspects interplays, may promote 
meta-cognition about teamwork competencies. 
Meta-cognition may enable the operator to more 
readily identifying and developing solutions 
to teamwork challenges. Transfer of a message 
between two persons may, e.g., unsuccessful for a 
variety of reasons: A message may not be stated 
clearly, if may not formulated in a way the receiver 
understands, may be transferred at a time the 
receiver is unable to pay full attention to the mes- 
sage, the receiver may misinterpret the content due 
to lack of common ground, etc. Awareness that 
a message may not be transferred to the receiver 
for a variety of reasons, will allow the operator to 


‘troubleshoot’ potential teamwork break-downs 
from a range of different angles, and thus increase 
the likelihood that a means to preventing team- 
work break-down will be found. 

Both the ability to transfer teamwork-com- 
petence aspects across operational states and to 
engage in meta-cognition about teamwork, may, 
thus, further contribute to resilience in teamwork. 


5 CONCLUSION 


The outcome of the study suggested that the 
teamwork competencies needed by NPP opera- 
tors across the three operational states are simi- 
lar, but not identical. The results indicated that 
unless requirements to teamwork competence 
are obtained from all operational states, there is a 
risk that important teamwork competencies will 
remain hidden. With respect to refresher training, 
this would imply that these competencies are not 
addressed and thus potentially that NPP opera- 
tors will not maintain these competencies to the 
required standard. 

Based on the results, it was suggested that 
resilience in teamwork could be strengthened if 
refresher training expanded its traditional focus on 
teamwork-competence aspects associated emer- 
gencies, to include aspects that are prototypically 
associated with normal operation and outages. 
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Verification of HTC Vive deployment capabilities for ergonomic 
evaluations in virtual reality environments 
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ABSTRACT: Musculoskeletal assessment of possible risks in the workplace of the future is difficult, 
because it is not possible to simply evaluate and predict the attitudes of workers. The virtual reality envi- 
ronment offers an initial insight into the future workplace. For better immersion in the environment. HTC 
Vive were selected. The ergonomic method selected for assessment is called RULA. Because of work in 
a virtual reality environment, it is necessary to measure in the real laboratory environment. As a refer- 
ence device, the MS Kinect solution against occlusion has been selected. Measurements were performed 
in the laboratory according to a recently published article on RULA assessment using MS Kinect. The 
results show that HTC Vive can serve as an integral tool to assess a musculoskeletal risk in designing new 


workplaces. 


1 INTRODUCTION 

In every real production process there are risks 
that have the potential to endanger the operator 
or equipment. It is clear that the selected risks 
associated with the underlying process (e.g. the 
use of hazardous chemical substances) cannot be 
completely eliminated, but most of these risks can 
be significantly reduced. The human factor is one 
of the most important elements influencing the 
resulting safety. It is also a component that is often 
neglected under the prevailing emphasis on techni- 
cal reliability. The results of major accident inves- 
tigations consistently show that the human factor 
plays a very significant role in their development. 
Therefore, it is important to identify potential 
human errors and reduce the likelihood of failure 
of the human factor. The way how to deal with 
this problem is through an analysis of human fac- 
tor reliability. 

One of the elements of analysis of human factor 
reliability is also the analysis of the work load. This 
is both financially and time-consuming; the objec- 
tive of the present article is to design an automated 
evaluation process using the virtual reality of HTC 
Vive and the Kinect system. 

HTC vive and HMD are beginning to be widely 
used in a broad range of applications; such as in 
medicine (Pelargo et al. 2017) where it is used for 
education. In this case, it is deployed because of 
a large number of repetitive demonstrative opera- 
tions in education and their adverse impact on the 
patient. It also emphasizes the predicative value in 
the cases of e.g. blood vessels. It is also deployed 
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in treatment of various therapies. For example, in 
stress situations to determine if the patient does not 
suffer from height phobia (Bun et al. 2016). In the 
field of culture, for example, virtual tours of galler- 
ies (Choi et al. 2017) can be addressed. Also, in the 
field of fine art, primary applications have been 
made for fine arts and, from the point of view of 
artists involved in digital fine art, this technology 
is highly desirable. In engineering applications, we 
can see examples of the use of HMD technology 
such as the use of HMD in ship design (Pérez et al. 
2015) or in the field of robotics students’ education 
(Crespo et al. 2015). What all these articles have 
in common is the emphasis on the added value of 
virtual reality applications. In this case, a benefit is 
a greater clarity and thus a deeper understanding 
of the subject. 

The present article focuses on ergonomics and a 
human factor in manufacturing systems and at pro- 
duction sites. There are many simulation programs 
available for modeling—Siemens Process Simulate, 
Siemens Classic Jack (Jack and Process Simulate 
Human 2017), Plant Simulation (Process Simulate 
2017), IC: IDO (ESI Applications 2017) dealing 
with the ergonomic aspects of manual work in the 
early stages of design and planning of product 
manufacture. Jack and Process Simulate Human 
will allow for improvement of safety, performance 
and comfort of working environment with the 
use of digital human models. Work environments 
can be analyzed via virtual human models, while 
using a database of human body builds typical 
of the specified population. In our proposals, we 
can test a variety of human factors, including the 


risk of injury, user comfort, accessibility, lines of 
the views, energy expenditure, fatigue limits and 
other important parameters. These products offer 
recommendations for more user-friendly designs 
throughout the entire design process to help save 
costs and time. 


Key options include: 

Flexible human body builds that are anthropo- 
metrically and biomechanically accurate 
Supporting ergonomic analysis of total work- 
force using country-specific databases of work- 
ers and advanced anthropometric parameters 

A comprehensive set of ergonomic analytical 
tools 

An advanced positioning algorithm that can 
also analyze how the body responds to the force 
exerted in a particular direction 

Managing a wide range of workplace options 
that include work at different heights, stairs and 
ramps 

Views and analysis of field of view 

Envelope curves radius for fast workplace 
configuration 

Wide support for virtual motion capture technol- 
ogy including Microsoft Kinect® for Windows 
Support for virtual reality 


A new platform in this field seems to be the use 
of the HTC Vive game console because it allows an 
observer to achieve a fully immersive insight into 
the virtual environment compared to the previous 
imaging headset systems. The HTC Vive console 
has the following parameters (Oculus Rift vs. HTC 
Vive: Prices are lower, but our favorite remains the 
same, 2017: 


— Two displays panels with a resolution of 
1,080 x 1,200 pixels 

— Streaming video with 90 Hz refresh rate. 

— Laser position sensors — 32 headset sensors and 
48 sensors on the controls 

— Microelectromechanical 
accelerometer 

— Minimum space for movement is 1.5 m x 1.5 m, 
with maximum of 4.5 m x 4.5 m. 


systems—gyroscope, 


It is also possible to use controls to interact with 
the virtual environment; in the case of HTC Vive, 
these are two controls captured using two infrared 
sensors. 

The previous article (Plantard et al. 2017) dealt 
with the case of deploying MS Kinect in the assess- 
ment of the real workplace in both laboratory and 
real conditions. In this case, it would be obvious to 
use HTC Vive for further research for the above 
reasons. 

It was interesting to assess how one is influ- 
enced by the deployment of the game console and 
whether it is possible to deploy this technology in 
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the industry. For the assessment as in the previ- 
ous article, a laboratory test should be performed 
by lifting a box of defined dimensions. This time, 
game console controls should be used to simulate 
the operator’s hands, and the raised box should be 
virtual. The model case would fit into the overall 
concept of using virtual reality, i.e. to use the 3D 
model of the workplace prior to assembly. Using 
this technology, it would be possible to verify 
workplaces from an ergonomic point of view and 
a human factor without additional costs associated 
with building up of a dummy workplace, e.g. of 
cardboard engineering. 


2 METHODS 


As in the previous article, MS Kinect was used for 
validation of this method. 

The test itself also took place under the RULA 
and REBA method, with 10 operators being 
involved in the measurement. 


2.1 Method RULA (Rapid Upper Limb 


Assessment ) 


The RULA (RULA—Rapid Upper Limb Assess- 
ment 2017) method (Figure 1) was developed by 
ergonomists from the University of Nottingham 
and serves for a rapid and systematic assessment 
of the risk of damage to the musculoskeletal appa- 
ratus with respect to the upper limbs. In abroad, 
this method was used mainly for the assessment of 
upper limb disorders arising in the workplace and 
for the assessment of working postures. 


2.2 REBA method (“Rapid Entire Body 
Assessment” ) 


This method systematically evaluates muscu- 
loskeletal apparatus and is based on the RULA 


RULA method. 


Figure 1. 


methodology. In abroad, the REBA method was 
used to assess ergonomic risks when working with 
imaging units and for risk assessment of healthcare 
staff. Both of these methods are a tool for postural 
analysis to evaluate biomechanical and postural 
loads of individual body parts. The body is divided 
into segments for individual scoring in relation to 
movement planes. Identifying of risk postures is 
very important for evaluation. These may be work- 
ing postures that are physiologically unfavorable 
or occupied by the worker for most of the work 
shift. For RULA and REBA methods, the postures 
of individual parts of the body (arms, forearms, 
brace, neck, trunk and lower limbs) are scored 
with respect to the divergence from the neutral pos- 
ture (Simulace vyrobnich procest 2017). For each 
part of the body, the so-called basic postures are 
described to obtain a base score. It is a different 
range of flexions and extensions that are getting 
ascending points with increasing divergence from 
the neutral posture. There are also descriptions 
of postures to obtain additional points of the so- 
called variable score (e.g. rotation and bends side- 
ways). The final assessment also includes the weight 
of the manipulated load (load-strength score) and 
the influence of the static postures at work (muscle 
score, activity score). In addition, REBA takes into 
account the impact of gripping techniques when 
handling the load (grip score) (Modern methods 
for the evaluation of ergonomic risks, 2017). 


2.3 Experimental procedure in laboratory 
conditions 


This section presents an experiment in simu- 
lated conditions. For this purpose, an experiment 
with 10 participants (age: 30 + 6 years, height 
1.78 + 0.1 m, weight: 70 + 10 kg) was performed. 
Each participant of experiment was equipped with 
the HTC Vive headset and handhelds for interac- 
tion simulating the hands. The movement of each 
participant was sensed (Figure 2), using two MS 
Kinect devices. 

The reason for using the two MS Kinect devices 
to sense the participants was an insufficient sensing 
in singular postures when manipulating the virtual 
object, and also due to occlusion. The correction 
method described in (Planard et al. 2017) and 
(Diego-Mas et al. 2014) was also used to increase 
robustness. 

The experiment itself consisted, as in (Plantard 
et al. 2017) in shiffting a 40 x 30 x 17 cm box between 
the starting and target positions (Figure 3). 
As in (Plantard et al. 2016), two target positions 
were defined as seen in Figure 4. The first place- 
ment of the target point was 1.7 m high and 0.35 m 
to the left and 0.5 m at the front. The second tar- 
get position was 1.7 m high and 0.55 m to the left. 


KINECT 


119 `L 


Record 


Figure 2. Record from cameras from two MS Kinect 
devices. 


Figure 3. Course of in 


conditions. 


experiment laboratory 


LE Lol 
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Figure 4. Positioning of target points in Unity3D envi- 
ronment with avatar. 


Unlike in (Plantard et al. 2016), a virtual box of 
the same defined dimensions was used. The Unity 
3D software was used to create the experimental 
environment. 


2.4 Data analysis 


In the laboratory experiment, we assessed the val- 
ues of the joint angles calculated from the meas- 
ured data from the Kinect device (using parallel 
scanning of two cameras) and compared them with 
the data evaluated by the expert estimate using the 
RULA principle. 


3 RESULTS 


Calculated RULA values and the values obtained 
by the expert estimate according to the RULA 
method: 

The present article only evaluated the move- 
ment when lifting the object. The measured data 
was subjected to the Kolmogorov-Smirnov test for 
normality of the distribution of values. The data 
analysis has not shown normality. 

Subsequently, the RMSE coefficient was calcu- 
lated (Table 2) for each condition. Then we com- 
pared the resulting RULA score obtained from 
measurements with parallel use of two Kinect 
systems with a reference value obtained by expert 
estimate. 

The results of the analysis show that the pro- 
posed RULA method of automatic measurement 
of RULA indicators provides relatively accurate 
results. Only the results for the Upper Arm score 
(left and right) are loaded with a major error; this 
could be reduced by changing the magnitude of 
field of view of the cameras. Other issues of meas- 
urement are a large number of singular points. 
These points could be eliminated using two MS 


Table 1. RULA score. 
RULA— 
RULA— _ expert 

RULA calculated estimate 
Upper arm score 2,35 2 

Lower arm score 1.83 2 

Score A Left (upper body) 2.79 3 
RULA Grand Score Left 2.74 3 
Upper arm score 2.31 2 

Lower arm score 1:9 2 

Score A Right (upper body) 2.78 3 
RULA Grand Score Right 3.15 3 

Neck score 2.51 3 

Trunk score 2.24 2 

Score B (neck, trunk and legs) 2.97 3 
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Table 2. Calculated RMSE values for each RULA 
indicator. 


Score RMSE 
Upper arm score 1.05 
Lower arm score 0.38 
Score A Left (upper body) 0.83 
RULA Grand Score Left 0.45 
Upper arm score 0.98 
Lower arm score 0.3 
Score A Right (upper body) 0.7 
RULA Grand Score Right 0.46 
Neck score 0.75 
Trunk score 0.44 
Score B (neck, trunk and legs) 0.66 


Kinect devices scanning a moving person. How- 
ever, this approach also has its disadvantages. An 
issue of measurement was the overshadowing of 
the limbs through the body. At this point, only one 
sensor was active. Therefore, in this case, measure- 
ments and data from the second sensor were not 
used, since it would put a significant error in the 
whole measurement. 

The advantage of this approach is to signifi- 
cantly reduce the cost of evaluation. Accuracy of 
the results is sufficient for practical measurements; 
however, for laboratory evaluation, it would be 
necessary to carry out further adjustments of the 
measuring procedure. 


4 CONCLUSION 


This article dealt with the idea of using the HTC 
Vive gaming console to analyze workloads. Com- 
parison of the angles of movement is based to 
RULA method, within each body part, each angu- 
lar assessment is divided into three sections and is 
tested for monotype actions. This method of auto- 
matic analysis has demonstrated its validity using 
assessment in a virtual environment. 

Despite these constraints arising from per- 
formed measurement, HTC Vive can be considered 
as a promising tool for assessing workload analy- 
ses. Another indisputable advantage of the entire 
system is a very easy deployment in the industry, as 
it does not place high demands on software readi- 
ness of potential users. Another important role is 
also played by a good price value of the entire sys- 
tem. Another challenge will be to test the prepared 
solutions in assessing of the real production. The 
data from real-time measurement in traffic and the 
feedback from industry partners will provide us 
with a comprehensive picture of the use of these 
technologies. For a comprehensive examination of 


the overall image of the production site design, fur- 
ther measurements including fine motor skills are 
required. This approach requires the deployment 
of other virtual reality technologies as well as other 
methods of assessing these movements. 
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ABSTRACT: A lack of empirical data is often been presented as a large challenge for HRA, which 
begs the question: why is this so difficult? HRA methods were not developed as objective quantitative 
test methods, but more as qualitative evaluation methods because objective data did not exist. Since HRA 
methods include substantial qualitative evaluation of the meaning of the elements in HRA methods, 
such as definitions of the performance shaping factors as well as their strength, these elements cannot be 
objective measured. This paper also discusses other challenges with collection data from event reports, lit- 
erature reviews, experiments and databases. The conclusion in this paper is that a decision should be made 
about how we should look at HRA methods: as qualitative evaluation methods or objective quantitative 
test methods. Quantitative and qualitative methods have different approaches to evaluate the quality of 
the methods making it difficult to be something in between. 


1 INTRODUCTION 
Most of the Human Reliability Analysis (HRA) 
methods and techniques have been developed to 
estimate human reliability for tasks within Prob- 
abilistic Risk Analysis (PRA) in nuclear power 
plants. Human reliability analysis (HRA) was 
developed because of the lack of empirical data on 
human error probability. If data on the likelihood 
of human error on specific tasks existed, we would 
not need an HRA method. Williams (1992, p.20) 
said: “Therefore in cases where an assessor may 
have access to more specific and accurate task fail- 
ure data, these should be used in preference to the 
HEART generic data-set.” So if we had the data, 
the data should be used and not a HRA method. 
HRA methods like; THERP (Swain & Gutt- 
mann, 1983), HEART (Williams, 1992), SPAR-H 
(Gertman,et al. 2005; Whaley, et al. 2011) and 
ATHEANA (Forester, et al. 2007), were developed 
as methods to support more qualitative expert 
judgements. The expert judgement provides gross 
estimates of failure probabilities for tasks defined 
by the PRA, when better data was missing. In 
the HEART manual (Williams, 1992, p. 4) states: 
“When considering system safety and reliabil- 
ity, engineers are generally concerned with gross 
changes in the probability of failure within system 
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e.g. factors of 10, the proverbial order of magni- 
tude. To be of value, therefore human reliability 
assessment techniques should be concerned with 
those factors, which are likely to produce prob- 
ability of failure modification in excess of a factor 
of 3, and which, when cumulated, could produce 
significant changes in performance, and possible 
threaten system safety, operability and reliability.” 

Also HRA methods are often said to be simpli- 
fied methods since it is often not enough resources 
in performing a comprehensive analysis which 
mean that the analyst is analyzing the most impor- 
tant influences on human reliability with limited 
time to read and understand guidelines and to per- 
form the analysis. 

Even though HRA analysis results in a quanti- 
tative likelihood for failure or success, HRA meth- 
ods were not developed as positivistic quantitative 
test methods. HRA methods seem to be closer 
to a post positivistic research view. In positivism 
one assume that an objective reality exists, that 
this reality can be objective measured by scientific 
methods and that it is possible to develop scien- 
tific laws that can be generalized across settings 
(Guba & Lincoln, 1994). Within this approach, 
only quantitative research methods are used. In 
post positivism one assumes that an objective real- 
ity exists. However, this reality is complex and that 


it can only be imperfectly apprehended and that we 
can never be sure that a true reality has actually 
been found (Guba & Lincoln, 1994). Within this 
approach, both qualitative and quantitative meth- 
ods are used. 

In spite of these characteristics of HRA methods 
described above, many authors claim that the big- 
gest challenge in HRA is the lack of objective empir- 
ical data (for example; Boring et al. 2012; Hallbert 
et al. 2004; Kim, et al. 2015; Swain,1990; Williams, 
1985). A question raised here is a question that was 
first presented in Laumann (in review) about what 
an HRA method actually is: Is an HRA method a 
qualitative evaluation method that leaves a lot up to 
the analyst qualitative evaluation and where mainly 
expert judgement was used to develop HRA? Or is 
an HRA method a quantitative test method where 
empirical tests where used to develop and test the 
methods? HRA today, is much more a qualitative 
evaluation method than a quantitative test method. 
Probably the words used about HRA methods such 
as “analysis” and “techniques” reflect the qualita- 
tive basis of HRA, rather than a quantitative. 

In this paper, we will present challenges for 
HRA methods to be quantitative test methods and 
challenges with collecting quantitative HRA data. 
First, we will define challenges that exist for obtain- 
ing quantitative data for HRA that are the same 
challenges found with all kinds of quantitative 
methods. Then we will present challenges to obtain 
HRA data that exist with more specific methods 
such as; literature reviews, experiments designed to 
collect HRA data, event reports and databases. For 
simplicity, we have chosen to present example from 
SPAR-H (Gertman, et al. 2005; Whaley, et al. 2011) 
and HEART (Williams, 1992). However, the chal- 
lenges for collecting empirical data presented exist 
for all HRA methods. SPAR-H and HEART were 
also chosen because these are methods where quan- 
titative data collection have been much discussed. 


2 DISCUSSION 
2.1 Challenges that exists for all kinds of data 
collections in HRA 


There are some demands for all kinds of quantita- 
tive test methods; they should be valid and reliable. 
HRA methods have a strong similarity to psycho- 
logical test methods where human behavior is pre- 
dicted from different “psychological” construct. 
HRA is also about predicting human behavior 
from constructs, since the elements in the methods 
such as Performance Shaping Factors (PSFs) or 
Error Producing Conditions (EPCs) are constructs, 
which are assumed to affect human behavior. In 
psychological test methods, validity is divided into 
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different facets; content validity, construct valid- 
ity, concurrent and predictive validity (Murphy & 
Davidshoffer, 2014). These demands will not be 
described here since they are well known and can 
be found in many textbooks as for example Mur- 
phy and Davidshoffer (2014). 

A large challenge for HRA is content validity. 
Content validity is achieved by a) judgments and 
descriptions of the constructs and of the structure 
within these concepts and b) definitions and devel- 
opment of measurement scales to measure the 
constructs and their structure. If the content of the 
constructs, their structure, and how they should 
be measured (measurement scales) have not been 
clearly defined, then it is not possible to test the 
other aspects of validity such as construct validity, 
concurrent validity and predictive validity or relia- 
bility. Next, we will give some examples of content 
validity issues in two HRA methods SPAR-H and 
HEART. Laumann and Rasmussen have also pre- 
sented challenges with content validity in SPAR-H 
in several papers (Laumann & Rasmussen, 2016; 
Rasmussen, Sandal & Laumann, 2015; Rasmussen 
& Laumann, in review). 

The elements in SPAR-H (Gertman, et al. 2005; 
Whaley, et al. 2011) are: two nominal tasks with 
two nominal failure rates and eight performance 
shaping factors. The two nominal tasks in SPAR-H 
(Gertman,et al. 2005; Whaley, et al. 2011) are diag- 
nosis and action. A task should be classified as diag- 
nosis if it involves cognitive processing. An action 
involves limited cognitions. The separation between 
cognition and action within SPAR-H is very pecu- 
liar since probably nothing an operator does within 
an accident scenario is purely action without cogni- 
tion. The diagnosis/action separation gives room for 
a much interpretations of what this actually means. 

In the SPAR-H manual (Gertman et al. 2005), 
there are no specifications about what is meant 
by a task or at what task level the analysis should 
be performed. How a task is defined have a large 
effect on the result of the analysis or the probabil- 
ity for errors (for a discussion see Rasmussen & 
Laumann, 2017). SPAR-H leaves it up to the ana- 
lyst to define at what task level SPAR-H should 
be applied and this gives much room for analyst 
choice and interpretation. 

The elements in the HEART (Williams, 1992) 
method described in the user manual are 14 generic 
task types with a nominal human unreliability value 
and their suggested uncertainty bounds and 38 
EPCs. For the EPCs the analyst should also assess 
the proportion of affects. Williams and Bell (2017) 
have recently reviewed HEART with a large lit- 
erature review. Based on this review 32 out of the 
38 original EPCs were kept, six of the EPCs were 
revised slightly and two new ones was incorporated 
in HEART. 


It is difficult to find a definition of what a generic 
task actually is in HEART. It is said about generic 
tasks in HEART (Williams, 1992, p.8): “The first 
is the assumption that basic human reliability is 
dependent upon the generic nature of the task to be 
performed, i.e. for each task in life there is a basic 
probability of failure.” So a generic task has some- 
thing to do with the generic nature of the task which 
is not a very specific definition. Error producing con- 
ditions in HEART are defined as (Williams, 1992, 
p.1): ‘Error producing conditions are factors that can 
affect human performance, making it less reliable 
than it would otherwise be.’ The separation between 
what is a GTT and what is an EPC is not obvious 
in HEART, and some of the GTTs include elements 
that are very similar to the EPCs. This gives room for 
different interpretation by different analysts. 

HEART defines that a task should be analyzed 
at the level that fits the GTT. How to analyze the 
proportion of affects or the strengths of the EPCs 
are not well defined in HEART, which gives much 
room for interpretation by the analyst. 

To show an example of the difficulties with con- 
tent validity, SPAR-H and HEARTSs definitions of 
the PSF available time (SPAR-H) and time short- 
age (HEART) will be presented. 

SPAR-H [Gertman et al. 2005, page 20] defines 
one of its PSFs available time as: “Available time 
refers to the amount of time that an operator or 
a crew has to diagnose and act upon an abnormal 
event. A shortage of time can affect the operator’s 
ability to think clearly and consider alternatives. It 
may also affect the operator’s ability to perform. 
Multipliers differ somewhat, depending on whether 
the activity is a diagnosis activity or an action.” 

The SPAR-H Step-by-Step (Whaley et al. 2014, 
page 2-4) give the following definitions of the lev- 
els for available time in SPAR-H: 

Inadequate time—the time margin is negative 
because less time is available than is required. 

Barely Adequate Time—the time margin is zero 
because the time available equals the time required 

Nominal Time—there is a small time margin 
because the time available is slightly greater than 
the time required. 

Extra Time—the time margin is greater than 
zero but less than the time required; the time avail- 
able is greater than the time required 

Expansive Time—the time margin exceeds the 
time required; the time available is much greater 
than the time required. 

With these definitions, subjective evaluation 
depending on the characteristics of the tasks and 
the contexts are necessary to decide on a level for 
available time for a particular task. There is no way 
to objective define the level since there is no objec- 
tive description of how much time one should 
assume to define the different levels. In addition, 
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it is a question, what is the unit of analysis in 
SPAR-H. It has been claimed by the authors of 
SPAR-H that it is an analysis of the average opera- 
tor. However, if there is barely adequate time for 
the average operator one might expect that there 
is too little time for the slower than average opera- 
tors. How much failure one expects becomes circu- 
lar with the definition of the unit and the unit of 
analysis is not well defined in SPAR-H. 

HEART has a similar EPC, which is described 
as: “A shortage of time available for error detection 
and correction”. HEART gives no advice on, how 
the analyst should go about analyzing this EPC. 
As shown under the discussion of SPAR-H, a lot 
of information needs to be clarified to analyze this 
EPC and since it is not available in the method, it 
is up to each analyst subjective judgement. Some- 
thing that is peculiar with HEART is that the 
analyst is not instructed on how he/she should go 
about collecting information about the GTTs and 
the EPCs. The EPCs are defined by one sentence 
and then it is up to the analyst to interpret how this 
sentence fit their contexts/tasks, or what the sen- 
tence actually means, as for example—how much 
shortage of time should exist before the maximum 
predicted nominal amount should be chosen. 

These characteristics with HRA methods (as 
SPAR-H and HEART) show that they are more 
qualitative evaluation methods than objective 
quantitative test methods and that they are far from 
being objective quantitative test methods. It is a 
question if we should expect methods that include 
so little definitions and descriptions (content valid- 
ity) as SPAR-H and HEART to show interrater 
reliability. To obtain interrater reliability the con- 
cepts need to be precisely, defined. However, with a 
qualitative evaluation method view of the method, 
we will focus more on the analyst ability to predict 
correct error rates based on qualitative evaluations 
with use of a HRA method. With this approach, 
we might not expect high interrater reliability, but 
rather look at the quality of the data and the evalu- 
ations that the prediction is based on. 

We claim that if HRA is going to be tested with 
quantitative methods they need to be improved 
and that the place to start is to develop good defi- 
nitions of the content of the concepts included 
into the method. If good definitions exist, it might 
be possible to develop measurement scales for the 
PSFs/EPCs. If we have these measurement scales 
of the PSFs/EFCs it might be possible to predict 
how different levels of PSFs, such as for example 
complexity, affects performance. 

However, it is a question, if this is possible. It 
could be that with the different elements included 
into HRA, it is impossible to be so well defined that 
quantitative measurements can be developed. For 
example, for time available, it could also be that the 


tasks and contexts that are evaluated in HRA are so 
different that it is not possible to develop the exact 
meaning of the PSFs and the PSFs levels/strengths 
that counts for all kinds of contexts and tasks. If this 
is the case, the qualitative evaluation view of HRA 
methods might be better, but then also the qualita- 
tive part of the analysis need to be further developed. 


2.2 Challenges with data from psychological and 
human factors studies (literature reviews) 


HEART was developed based on human factor 
literature (Williams, 1992). Williams (1992) and 
Williams & Bell (2017) has done literature reviews 
to investigate studies that include EPCs and their 
maximum multipliers and nominal error rates on 
GTTs in experimental designs. Also, Laumann, 
Sandal & Rasmussen (Laumann & Rasmussen, 
2016; Rasmussen, Sandal & Laumann, 2015; 
Rasmussen & Laumann, in review) have done lit- 
erature reviews to investigate the meanings of the 
PSFs and how large effect the PSFs have on affect- 
ing human errors on tasks. 

One challenge with using literature reviews on 
psychological and human factors studies to collect 
information for HRA is that these studies were not 
designed with the purpose of testing HRA methods 
and therefore it is difficult to transform the data to 
fit the HRA method. For example, literature stud- 
ies usually only included one negative level for a 
PSF and this level is difficult to match with the 
level description for example in SPAR-H. It is also 
difficult to match the description of PSFs to the 
one manipulated PSFs in this kind of studies. 

It is not obvious how the literature review to 
collect information on EPCs in HEART was done. 
Williams and Bell (2017) say that they have looked 
for the maximum multipliers of the EPCs. How- 
ever, the human factors studies do usually not 
intent to manipulate the maximum multipliers of 
the EPCs. They usually just manipulate one level 
and it is often not described or discussed what this 
level actually is. In addition, the experiments usu- 
ally intend to study why the PSFs/EPCs have an 
influence on performance rather than how much 
it affects performance. In the human factors litera- 
ture there is also not developed measurement scales 
for the PSFs/EFCs. We have looked at some of 
the studies that are referred to by Williams & Bell 
(2017) and it is difficult to see that the maximum 
multiplier, in the meaning of ‘the highest possible 
negative multipliers’ for the PSFs/EFCs,’ were the 
manipulations in these studies. 

We are hopeful that Williams & Bell (2017) will 
present more from their literature review and dis- 
cuss the evidence for the EPCs’ maximum multipli- 
ers and nominal failure rates on the GTTs. In this 
way, other researchers can better understand the 
authors’ arguments, and perhaps add to and relate 


318 


this evidence to other methods. We do not think 
one should look at this data as objective evidence 
for an EPC/PSF but rather an evaluation done on 
the available evidence. Since it is an evaluation, it 
is important to understand the authors’ arguments 
on for example, including or excluding experiments 
and the authors’ argument about how similar the 
experimental manipulations found in these experi- 
ments are to the concepts in HEART. 


2.3. Challenges with performing new experiments 
to collect data that is relevant for HRA 


The unspecific definitions of the concepts in the 
HRA methods are also a large challenge for devel- 
oping experiments since it is difficult to develop 
manipulations and measurements that fit with the 
HRA methods. An example of this is an experiment 
performed by Liu and Li (2014) where experimen- 
tal data were compared to the multipliers in SPAR- 
H. In this experiment, one can see the difficulties 
the authors have in matching the definitions of the 
PSFs and the levels in their experiment to SPAR- 
H definitions and levels. For example, experience 
and training were defined as the 20 first trials as 
the negative level and the later 20 trials as the nomi- 
nal level. This manipulation does not fit with the 
negative level description of experience and train- 
ing given in SPAR-H, which is less than 6 months 
of relevant experience and training. It was diffi- 
cult to develop an experiment manipulation that 
fits with this level description in SPAR-H. In this 
experiment, also complexity was manipulated, but 
the measurement of complexity measured the com- 
plexity of the procedures, and then it is a question 
if this should have been looked at as complexity 
or procedures. There were also questions on, how 
to match the manipulated levels to SPAR-H levels 
also for complexity and available time. 

For the HEART EPCs, one should in experi- 
ments only manipulate the maximum strength of 
the EPCs, since these are the elements included 
into the method. However, usually, the maximum 
strength of an EPC, does not seem to be a mean- 
ingful experimental manipulation, if the “maxi- 
mum multiplier” is interpret literary. For example, 
in an experiment on EPC 1 (Williams, 1992, p.22): 
‘Unfamiliarity with a situation which is potentially 
important but which only occurs infrequently or 
which is novel’ one would give the participants no 
training on a completely new task. In this situa- 
tion, one would have expected a human error prob- 
ability for failure close to 1, and we might not need 
to test this because the result is too obvious. 

For some of the GTTs in HEART and the nomi- 
nal tasks in SPAR-H an obvious challenge for new 
experiments is the number of subjects needed, when 
errors is expected to occur in 1 of 100, 2 of 100 or 1 
of 1000 subject. 


There are many challenges with performing 
experiments that are relevant for HRA. One fre- 
quently mentioned challenge is that if these experi- 
ments should be done with actual operators in a 
simulated control room, the cost of the experi- 
ments are high and little data would be collected 
(Boring, 2012). 

Another challenge that exists when performing 
experiments on PSFs (with both operators and 
other participants such as students) is that PSFs 
are difficult to completely control and without 
that control, it is difficult to measure the independ- 
ent effect of the different nominal tasks and/or 
the PSFs/EPCs. We have seen in experiments that 
other PSFs than those manipulated often affect the 
results. For example, poor teamwork is a variable 
that is difficult to control, since this might exist 
within the crews before they come to the experi- 
ment, or develop during the experimental run. 
Other examples of PSFs that are difficult to control 
are; operators that are ill, hungry, stressed, fatigued 
or demotivated. The crews themselves might also 
increase the complexity of a scenario by some erro- 
neous actions or by forgetting a procedural step. 
Then the manipulation for some crews might be 
different from the one the experimenter planned. 

PSFs have a tendency to exist from before the 
experiments or occur during the experiments, they 
cannot be completely controlled, and some you 
might observe (e.g. poor teamwork) and other 
might be more hidden for the experimenter (e.g. 
fatigue, stress and illness). 

As an example of this, the first author experi- 
enced that in an experiment at the Halden Reactor 
Project where we intended to manipulate available 
time and information load, but for some of the 
crews we also observed that poor teamwork also 
occurred. The poor teamwork and short avail- 
able time combined had a very negative effect on 
performance for some of the crews (Laumann, 
Braarud & Svengren, 2005). 

In addition, another challenge with simulator 
experiments with operators is that these simula- 
tors such as the simulator at the Halden Reac- 
tor Project, often are computer based simulators 
rather than analog simulators that the operators 
use at their own plant, making it difficult to know 
how much “the new interface for the operator” 
PSF interacts with the manipulated PSFs and how 
this affect error rates. 


2.4 Challenges with event report as a basis for 
HRA data 


One possible source of data in HRA is event 
reports. One problem with using event reports 
as a basis for HRA data is that the event reports 
only investigate and report when an error occurred 
and then we do not know if the PSFs are usually 
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present when this task is done or if there were 
some particular PSFs that were present when the 
error occurred (Kim et al. 2015). 

Another issue is that event reports are often 
written by operators that probably do not have 
much knowledge about PSFs/EFCs and how to 
investigate the presence of PSFs/EFCs. The event 
reports that we have seen have not been very spe- 
cific about how they collected the data on the PSFs 
and how the data were interpreted. In addition, the 
strengths or levels of the PSFs are not defined in 
the event reports and much interpretation have to 
be done to decide on a specific level or strength. 

Another problem with event reports might also 
be that the events occur so infrequently that they do 
not give much data for HRA (Boring et al. 2012). 

It is also a problem for HRA that usually in 
the event report more than one PSF has occurred 
in the event, which makes it difficult to estimate 
the effects on orthogonal PSFs/EFCs, which is 
included in the HRA methods. 

It is also a problem with event report that many 
organizations prefer to not be open about such 
matter as human errors and why they occur since 
this is regarded as sensitive information. 


2.5. Challenges with databases 


There have been several attempts to develop data- 
bases for HRA data, which included data from 
event reports, literature reviews, and/or from 
experiments or simulations. Examples of such 
databases are NUCLARR (Gertman et al, 1990). 
HERA (Halbert et al, and COREDATA (Gibson, 
Basra & Kirwan, 1999). 

The challenges described for event reports, liter- 
ature reviews and event reports are also challenges 
for databases because this is the information that is 
entered into the databases. 

The general reason for a database is to organize 
data in some predefined ways. Also for databases, 
the unclear definitions in HRA are a large issue 
because when a data bases structure is developed the 
definitions, for example of PSFs/EPCs and PSFs/ 
EPCs levels/strengths from one or more method, 
have to be used as template and the data then has 
to be interpreted based on these definitions in the 
database. Data from databases are never going to 
be better than the data included in the first place. 
To include quality data into a database, a good and 
precise structure and definitions, which were also 
used during the data-collection from either event 
reports or experiments is required. 

Since the HRA methods, include so diverse defi- 
nitions of the elements in the methods one struc- 
ture in the databases for each method is necessary, 
which is very resource demanding. 

In HRA, the structure and purpose of the 
databases are often not clearly specified and the 


argument for them seems to be that one day, some 
analyst (a very smart analyst) could find a good way 
to analyze this data in a way that fits HRA. How- 
ever if the developers of the database do not know 
more exactly how the database should be used 
and what is the purpose of it, the work invested 
in it might be useless. One might wonder if all the 
resources to develop HRA databases have been a 
good investment based on the amount of data rel- 
evant for HRA that has been provided so far. 


3 CONCLUSION 


HRA methods have been criticized for the lack 
of predictive data and validation of their results. 
However, is seems like HRA methods are criticized 
for not being something they never intended to 
be: quantitative test methods. It is not enough to 
say that HRA methods, are methods to estimate 
human reliability on tasks. Within HRA one should 
make a choice, whether we should look at HRA 
methods as a qualitative evaluation methods that 
gives gross and crude differences based on expert 
judgement or if HRA methods should be devel- 
oped to be quantitative test methods. The inven- 
tors of SPAR-H and HEART seems sometimes 
to present their method as much more objective 
quantitative test methods than they actually are. 
Of course, this likely because they were developed 
to support probabilistic risk assessment where a 
quantitative result is required. ATHEANA went 
in another direction and defined an HRA method 
that is mainly a qualitative method, with an expert 
based quantification technique added at the back 
end. We think that also SPAR-H and HEART are 
mainly qualitative methods, requiring substantial 
analysts’ judgement in order to produce the quan- 
titative result. 

To be a quantitative test method we need con- 
tent validity and very good and specific definitions 
of the concepts the method includes, definitions of 
measurement scales for the concepts, definitions 
of who is the unit of the analysis, and definitions of 
how should a task be defined for that method. An 
important question is: Are these definitions pos- 
sible within HRA? Maybe concepts such as PSFs 
and EPCs might be too difficult to be precisely 
defined, because the concepts include too much, 
and because they vary too much from context to 
context or from task to task. It might not be possi- 
ble to develop definitions and measurement scales 
that can apply for all of the contexts and tasks 
where the HRA methods are used. 

However, if we define HRA methods as qualita- 
tive evaluation methods, criteria for a good quali- 
tative analysis should be developed and discussed. 
Qualitative research methods have other methods 
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to evaluate the quality of the research than quan- 
titative research methods. One paper by Laumann 
(in review) presents criteria for good qualitative 
analysis and discusses how these could be applied 
for HRA. 

It might be that the definitions when a qualita- 
tive method is used do not need to be that specific 
and are more allowed to vary between contexts. 
However, even with a qualitative evaluation 
method view, as good as possible definitions and 
advice about how to perform the analyses, should 
be available. 

This question about how we should look at 
HRA methods should be answered based on what 
we think about our data. Is it possible to precisely 
define the different elements ina HRA method that 
can be used across different contexts and tasks? If 
this is not possible, we have to collect HRA data 
with a qualitative approach. 

After working for many years with HRA meth- 
ods, definitions of PSFs and their levels and per- 
forming experiments within HRA, we doubt that 
it is possible to define and specify the PSFs/EFCs 
and measurement scales of the PSFs/EFCs enough 
that a quantitative approach is possible. An alter- 
native for HRA then is to more focus on developing 
good qualitative methods for evaluation of PSFs/ 
EPCs, PSF levels/EPC strengths, and error rates. 

As HRA methods are today it would be a best 
to just admit that they are qualitative expert judge- 
ment methods trying to predict crude differences 
in performance, and that they are far from being 
objective test methods that can be empirical vali- 
dated with quantitative methods. However, HRA 
methods are not good and systematic qualitative 
methods either and improvement in descriptions of 
how the qualitative analyses should be performed, 
are also needed. A qualitative method approach 
might demand lesser specification than a quantita- 
tive test approach. 

One could argue, if HRA methods are not objec- 
tive test methods why should they predict perform- 
ance? There have been some studies to test the 
validity of HRA methods such as The international 
empirical study (Forester, et al. 2014) and the U.S 
HRA empirical study (Forester et al. 2014). These 
studies do not give an overall conclusion on how 
well HRA methods predict human errors. They 
give many and varied answer depending on the 
task, the HRA method and the analyst. However, 
in these studies one might wonder, what is actually 
tested? Is it the analysts’ ability to use the HRA 
method to predict the likelihood of errors or is it 
the HRA method in itself that is validated? If one 
assume that the method was tested, the researcher 
should assure that the HRA method guideline was 
reliable followed by the analysts. This is not pos- 
sible since some of the methods like SPAR-H and 


HEART do not include complete and prescriptive 
descriptions of the qualitative parts of the analysis. 
In these studies, it seem to be the analysts’ qualita- 
tive evaluation with use/help of the HRA method 
that was tested and not the method in itself. 

HRA methods should not continue to be some- 
thing in between qualitative and quantitative 
research methods, since then they are based nei- 
ther on good qualitative research methods nor on 
good quantitative research methods. Qualitative 
and quantitative research methods have different 
assumptions about quality and have different ways 
to investigate the quality of the method or the 
quality of the research. A choice should be made 
within each method and the choice has to be made 
by the authors of the methods. 


REFERENCES 


Boring, R. et al. 2012 Microworlds, simulators, and simula- 
tion: Framework for a benchmark of human reliability 
data sources. In Joint Probabilistic Safety Assessment 
and Management and European Safety and Reliability 
Conference, 16B-Tu5-5. 

Forester, J. et al. 2014. The International HRA Empiri- 
cal Study. Lessons Learned from Comparing HRA 
Methods Predictions to HAMMLAB Simulator Data. 
NUREG-2127, US Nuclear Regulatory Commission, 
Washington, DC. 

Forester, J. et al. 2014. The US HRA Empirical Study — 
Assessment of HRA Method Predictions against Oper- 
ating Crew Performance on a US Nuclear Power Plant 
Simulator. NUREG-2156, US Nuclear Regulatory 
Commission, Washington, DC. 

Forester J. et al. 2007 ATHEANA user’s guide. NUREG- 
1880. 2007. Nuclear Regulatory Commission, Washing- 
ton, DC: U.S. 

Gertman, D. et al. 2005. The SPAR-H human reliability 
analysis method, NUREG/CR-6883. U.S Nuclear Reg- 
ulatory Commission, Washington, DC, USA. 

Gertman, D.I. et al. 1990. Nuclear Computerized library 
for assessing reactor reliability (NUCLARR), NUREG/ 
CR-4639. Nuclear Regulatory Commission, Washing- 
ton, DC, US. 

Gibson, H. et al. 1999. Development of the CORE-Data 
database. Safety & Reliability Journal, 19, 6-20. Guba 
E.G, Lincoln Y.S. 1994, Competing paradigms in quali- 
tative research. In N.K. Denzin & Y.S. Lincoln, (eds), 
Handbook of Qualitative Research. Thousand Oaks, 
CA: Sage; p. 105-117. 

Hallbert, B. et al. 2004. The use of empirical data sources 
in HRA. Reliability Engineering and System Safety, 82, 
139-143. 


321 


Hallbert, B. et al. 2006. Human Event Repository and 
Analysis (HERA) System Overview. NUREG/CR6903. 
Nuclear Regulatory Commission, Washington, DC, 
US. 

Kim, Y. et al. 2015. A statistical approch to estimating 
effects of performance shaping factors on human error 
probabilities of soft control. Reliability Engineering and 
System Safety, 142, 378-387. 

Laumann, K. et al. 2005. The task complexity experiment 
2003/2004, HWR-758. Institute for Energy Technology, 
Halden, Norway. 

Laumann, K. In review. Criteria for qualitative methods in 
Human Reliability Analysis. Reliability Engineering and 
System Safety. 

Laumann, K., & Rasmussen, M. 2016. Suggested improve- 
ments to the definitions of Standardized Plant Analysis 
of Risk-Human Reliability Analysis (SPAR-H) per- 
formance shaping factors, their levels and multipliers 
and the nominal tasks. Reliability Engineering & System 
Safety, 145, 287-300. 

Liu, P & Li, Z. 2014. Human error data collection and 
comparison with prediction by SPAR-H. Risk Analysis, 
34, 1706-1719. DOT: 10.11 1/risa.12199. 

Murphy, K.R. & Davidshofer, C.O. 2014. Psychological 
Testing Principles and Applications. Sixth edition. Per- 
son Education Limited, Essex, UK. 

Rasmussen, M. & Laumann, In review. The evaluation of 
fatigue as a performance shaping factor in the Petro- 
HRA method Reliability Engineering and System Safety. 

Rasmussen, M. & Laumann. 2017. The impact of decompo- 
setion level in human reliability analysis quantification. 
L. Walls, M. Revie & T. Bedford (Eds.), Risk, Reliability 
and Safety: Innovating Theory and Practice. Taylor & 
Francis group, London, ISBN 978-1-138-02997-2. 

Rasmussen, M. et al. 2015. Task complexity as a perform- 
ance shaping factor: a review and recommendations in 
standardized plant analysis risk-human reliability analy- 
sis (SPAR-H) adaption. Safety Science, 76, 228-238. 

Swain, A.D. 1990. Human reliability analysis: Need, status, 
trends and limitations. Reliability Engineering & System 
Safety, 29, 301-313. 

Swain, D.A, & Guttmann H.E. 1983, Handbook of human 
reliability analysis with emphasis on nuclear power plant 
application NUREG/CR-1278, Washington, D.C, USA. 

Whaley A.M. et al. 2011. The SPAR-H step-by-step guid- 
ance. INL/EXT-10-18533, Rev 2, Idaho Falls, USA. 

Williams, J.C. & Bell, J.L. 2016. Consolidation of the 
human error assessment and reduction technique. L. 
Walls, M. Revie & T. Bedford (Eds.), Risk, Reliability 
and Safety: Innovating Theory and Practice. Taylor & 
Francis group, London, ISBN 978-1-138-02997-2. 

Williams, J.C. 1985. Validation of human reliability assess- 
ment techniques, Reliability Engineering, 11, 149-162. 

Williams J.C. 1992. A user manual for the HEART. Stock- 
port, UK: DNV Technica Ltd. 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Usability and user experience: Adaption and application for a railway 


related environment 


Marc Burkhardt & Birgit Milius 


Siemens AG Braunschweig, Germany 


ABSTRACT: 


When developing consumer products, for a long time already usability is an issue and an 


integral part of the development process. For railway systems, usability is still a rather new area with new 
requirements. In our paper, we will discuss what makes usability and the newer research area of user expe- 
rience for railways special and how certain features of the railway system, e.g. high requirements regarding 
safety will influence how usability can be included in the development process. 


1 INTRODUCTION 

For most industries, usability is a concept which is 
applied for years. In railways, at least in Germany, 
the application of usability methods and tools is 
still rather new. In our paper we give an overview 
about the general objectives of usability and user 
experience. We present an example for how usabil- 
ity methods are applied in the railway sector. In 
the later chapter, we discuss an aspect which makes 
usability in railways special. Railway applications 
have often safety requirements which need to be 
taken into account. This can lead to solutions 
which do not have a very good usability. We dis- 
cuss this paradigm and its implications for the 
inclusion of usability in a railway systems develop- 
ment process. 


2 USER EXPERIENCE AND USABILITY 


Usability is defined in ISO 9241-11 as “The extent 
to which a product can be used by specified users to 
achieve specified goals with effectiveness, efficiency, 
and satisfaction in a specified context of use.” 
This means that the technical systems should be 
designed in a way that a user can concentrate com- 
pletely on the task at hand and is not distracted by 
the interface design. This means that a good usa- 
bility can only be achieved when the resources and 
conditions of a human’s perception and cognition 
are taken into account. 

However, usability only looks at specific aspects 
of the human-machine-interaction. Especially in 
the last years the more holistic concept of user 
experience was developed and is researched. ISO 
9241-210 defines user experience as “a person's per- 
ceptions and responses that result from the use or 
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anticipated use of a product, system or service”. 
User experience includes all the users’ emotions, 
beliefs, preferences, perceptions, physical and psy- 
chological responses, behaviors and accomplish- 
ments that occur before, during and after use. 
Even more so than usability, user experience is the 
result of the product, the user and the context/ 
environment. 

To make a successful product, it is important to 
take both aspects into account. It would be a mis- 
take to assume that because a product has a good 
usability the user experience is good. And also the 
other way around is not true. Just because a prod- 
uct is beautiful or aesthetically pleasing does not 
meant that it can be easily or intuitively used. 

While it is obvious why a product which is used 
for pleasure, e.g. a game or a mobile phone should 
have a good usability and excellent user experience, 
some might argue that an emotional connection 
with a product is not necessary in a business set- 
ting. However, on a qualitative side, an emotional 
connection to a product or a system makes a per- 
son more relaxed and fosters a better immersion 
into work (Hassenzahl 2003, Vorderer 2005). Also 
on a quantitative side it can be shown that user 
experience is important. In Wright et al. (2007) is 
proven that personal well-being correlates with the 
performance in the job. Further research is needed 
but it can be concluded that especially in safety 
related industries this correlation can mean that 
satisfied and content users work more reliable. 

Despite all the argumentation above, in practice 
the general ideas of usability are more often applied 
than those of user experience when developing new 
products. One of the reasons might be that we have 
a well founded set of methods and procedures to 
measure usability, whereas the same extended cata- 
log of tools is missing for user experience. 
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Figure 1. Usability design process at Siemens AG. 


3 USABILITY IN RAILWAYS: AN 
EXAMPLE 


In this chapter, we present an approach for design- 
ing a railway specific system. 

Signaling systems in railways are very complex 
structures which need to be closely monitored to 
avoid major malfunctions which can lead to the 
disruption of railway traffic. Traditionally, the 
maintenance supervision of each system or sys- 
tem group was done separately, however, with the 
raising complexity and taking into account new 
and existing dependencies between such systems 
it became necessary to transfer all supervisory 
maintenance related tasks into one system. Such 
a system allows not only monitoring several sys- 
tems, but also in cases of disruptions to coordinate 
all efforts regarding these systems with the aim of 
minimizing disruptions. 

Traditionally, when developing such a user inter- 
face, functional requirements were the leading ele- 
ments. Due to the new complexity, combined with 
usually less people being responsible, it becomes 
urgent to focus as much on functional aspects as it 
used to be but additionally to take human factors 
aspects into account to enable the user to fulfill 
the necessary tasks at a high quality. Also, the high 
reliability of new railway systems leads to very few 
failures. This has the effect that the procedures to 
recover from these failures are not as readily avail- 
able as it would be if a person is confronted with 
them often. Therefore, a supervisory maintenance 


324 


system needs to be able to provide detailed proc- 
esses and a vast amount of information to enable 
and guide the user when solving problems and 
repairing systems. 

At the beginning of most development processes 
in railways stands a detailed list of requirements. 
The complete list of requirements is describing 
the scope of the project and is used afterwards 
for verification and validation. The following list 
gives examples for how usability requirements were 
described. 


e Reql: Identify, which user groups are inter- 
acting with the system and how they can be 
distinguished 

Req?2: Identify on different level of detail, which 
information are need by which user/user group 
Req3: Identify, in which situations and in which 
environments/contexts will the system be used 
Req4: Identify, which systems with similar tasks/ 
in similar environments are used and known by 
the future user and which aspects of legacy sys- 
tems! have to be taken into account 

Req5: Identify functions/tasks with safety rel- 
evance as these need special attention 


The process as shown in Figure | was used to 
get a maximum of information from the users and 


1. Legacy systems play a huge role in railways as signal- 
ing systems usually (at least in the past) had a very long 
life span. Even today, there are still mechanical interlock- 
ings in operation which date back to early last century. 


to develop a User Interface (UI) with a very high 
usability. However, due to the fact how contracts in 
the area work, the application is not as straightfor- 
ward as it looks. 

Typically, the first step when deriving the user 
requirements is to identify the relevant user groups 
and develop the key questions which are necessary 
to derive all necessary information. However, espe- 
cially when you develop a completely new system, 
the identification of the user groups is difficult. 
The (future) organizational structures and with 
this the functions and hierarchies are not necessar- 
ily known so assumptions have to be made. Expe- 
rience has shown that for this, it is better for the 
development process to start with a more general 
description of the user groups and go into detail 
later than start with a too limited description. If 
the identified group of potential users is too small, 
danger is that important aspects are forgotten and 
consequently the User Interface (UI) cannot be 
used from all users with the same ease. Sometimes, 
it helps to think outside the box. In some projects, 
the user groups at the operators’ side are generally 
known but the supplier has not the opportunity to 
talk to these people. Here it can help to interview 
people with indirect knowledge, e.g. from trainings 
to get a better understanding of the requirements. 

During the interviews it is extremely important 
not to ask for specific functionalities but have peo- 
ple describe their daily tasks and how they are per- 
formed in the current setting. It is not the aim to 
listen to a general process description but rather to 
really understand the motivations behind actions 
and the needs to perform a task. This means that 
also short cuts or informal, typical processes should 
become known and are an important input in the 
development process. The approach becomes more 
difficult when safety related, process based tasks 
are to be discussed. If a usability study is done for, 
e.g. a traffic controller, the tasks of his jobs are 
described in detail in the rule book and in general 
it is not allowed to deviate from them. There is a 
danger that interviews will only show the officially 
allowed processes but potentially safety relevant 
changes are not talked about. Here, a longer field 
study is very helpful. 

Like in most other IT industries the railway 
development process follows specific regulations. 
In railways e.g. the CENELEC norms DIN EN 
50159 and DIN EN 50129 have very specific safety 
related requirements which in UI design have to 
be taken into account. In so called “context inter- 
views” it is very important to understand all kind 
of input from of these regulations and their influ- 
ence on the daily work. We recommend record- 
ing these interviews in audio and/or video. The 
experience shows that even if you have several well 
trained interviewers you could miss very important 
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Figure 2. Iterative process when designing a UI as 
described in ISO 9241-210. 


answers during the interview compared to a later 
“offline” analysis of audio or video recordings. 

The next step in the process is to develop the 
real needs or demands from the system from the 
user’s perspective. From this developed needs and 
demands the so called usage requirements will be 
generated. These set of usage requirements are the 
input to identify the key task and subtask of the 
users’ daily work with the new system. 

The goal of this separation of tasks is to find 
the concrete subtask a user wants or has to do and 
then develop usage scenarios out of these set of 
concrete sub tasks. This procedure is time consum- 
ing but definite worth it to do. 

If you have your setup of usage scenarios it is 
time to start the first UI prototype design which 
then needs to be tested exceedingly. It is very much 
recommended to start the UI prototype design with 
a wire framework paper and pencil attempt. Start 
in simple black and white and concentrate on the 
content and not on the design. Use “placeholder” 
for parts which should be described in more detail 
later and concentrate on the current concrete sub- 
task. Improve this paper and pencil design from 
subtask to subtask. 

Even though we described the development 
process as straightforward, in reality it is iterative 
and will need several iterations until a final version 
of a UI is developed. The iterative process is visual- 
ized in Figure 2. 

If you reach a first detailed design of your sys- 
tem start using design tools to develop a first dig- 
ital version of your system design and improve this 
design from iteration step to iteration step. Colors 
and specific design elements should come up in the 
later iteration steps. 


4 USABILITY FOR SAFETY-RELATED 
PRODUCTS: DIALOGUE CRITERIA 


In the previous chapter we described the usability 
process as applied by the Siemens AG to a supervi- 
sory maintenance system. This process is of general 


nature and only takes the special organizational 
requirements of the railway industry into account. 
Even though the process might not change sig- 
nificantly when safety aspects are to be taken into 
account, the contents of, e.g. interviews becomes 
more sensible. Also, at the latest when prototyping 
starts it will become necessary to balance good usa- 
bility with the described safety-related processes. 

As in any other industry, in safety related indus- 
tries a good usability is necessary to help the per- 
sonnel to work motivated and reliable. E.g., there 
is lots of research going on in the field of health 
engineering to provide safe, usable systems for 
health care. However, some of the aspects which 
are considered as making a product usable can- 
not be incorporated in a safety critical system. As 
an example, in this chapter, we have a look at the 
seven dialogue criteria from ISO and discuss, if 
and under which conditions these criteria can be 
used when assessing a safety relevant dialogue. As 
a railway example, we look specifically at the user 
interfaces for traffic operators. 

Traffic operators usually work centralized and 
get all necessary information via screen. Typically, 
at least the tracks are shown, one can see which 
parts of the track are occupied and how the signals 
are set. Many more information is shown directly 
or via menus. The manipulation of the signals is 
done automatically, but they can also be handled 
manually. In these cases, the tasks of a traffic con- 
troller can be directly safety-critical. Besides infor- 
mation from the screens, traffic operators get or 
deliver information, e.g. by phone. 

In the following paragraphs we discuss if the 
seven dialogue criteria can be used to evaluate the 
usability of a display. 

The dialogue criteria are: 


Suitability for the task 
Conformity with user expectations 
Suitability for learning 

Suitability for individualization 
Self descriptiveness 
Controllability 

Error tolerance 


An interactive system is suitable for a task, when 
it helps the user to complete the task. This means 
that functionality and dialogue are based on the 
task to be done. This is an aspect which should 
be applicable to safety systems. However, when 
dialogues are designed in rules and processes, this 
might not always be the case. 

A dialogue conforms to user expectations when 
it behaves just as other dialogues in the given sys- 
tem. E.g. if a dialogue has to be confirmed, it 
should be confirmed every time in the same way 
(e.g. “confirm” and not sometimes “OK”). Also, it 
implies that the system behaves as the user expects 
it to do (SAP 2017a). This criterion can also be 
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applied in a safety-related setting. As railways is an 
area where many aspects are standardised and it is 
an area which only develops very slowly, most peo- 
ple working in the area, e.g. most traffic controller, 
have the same professional background and their 
expectations to use the given systems are in general 
defined by the same technical systems. This might 
make it actually easier to define a new system, as 
the user group is very homogenous. This criterion 
can and should be applied when designing a dis- 
play for a traffic controller. 

A dialogue is suitable for learning when it helps 
the user to learn and understand the system. This 
aspect is less important when applied in safety- 
related settings, especially in railways. To work 
in a control centre, users had to undergo a rigid 
training. They are supposed to know the system 
very well and learning significant amounts of new 
aspects in the system would mean that the train- 
ing is not good enough. In other areas, e.g. mainte- 
nance of systems, this aspect might be very sensible 
and can help to design systems which are actually 
making a task easier to achieve. 

A dialogue is suitable for individualization 
when a user can adapt the man-machine-inter- 
action and/or the display of information to his/ 
her own preferences. Up to now, adaption of the 
man-machine-interface is not an option for traf- 
fic controllers. The reasons for this might be that 
the tradition exists to have the displays all looking 
the same. But there are also scenarios possible in 
which it is necessary that everyone gets the impor- 
tant information quickly, easily and reliably from 
the screen. This is the case, e.g. when in a difficult 
situation a second opinion is necessary. Today, it 
is not be possible to apply this criterion. For the 
future, research into the positive effect of individu- 
alization or changing general rules of cooperation 
in a train control center could result in the fact that 
the criterion is actually applicable. 

Self descriptiveness means that it is obvious for 
a user at what point of the process he/she is work- 
ing right now, what tasks are to be done and what 
options he/she has. In SAP (2017) it is explained that 
self-descriptiveness provides simplicity by reducing 
users’ memory load. Users can retain their capacity 
for their tasks instead of bothering with the system. 
They can work more efficiently. For future develop- 
ments this might actually be a game changer. The 
area controlled by just one controller gets larger all 
the time. There is the necessity to provide the con- 
troller with more and well structured information to 
always provide him/her with a well defined picture of 
the current situation. Also, railway displays contain 
many information, often in abbreviations. Applying 
the rule regarding self descriptiveness, this should be 
changed or the user should be well supported. 

Controllability means that a user is able to 
start a dialogue as well as control its direction and 


speed. In a safety related system sometimes it is 
almost always critical that dialogues are started 
and controlled by the system to provide on time 
information. The means, that the options for a traf- 
fic controller to influence the dialogue are limited. 
This might feel uncomfortable to the controller. 

Error tolerance means that a user can make errors 
and can undo these with acceptable effort. Here, we 
have to distinguish between errors in operation in 
normal and degraded mode. In normal mode, errors 
can either not be made because the technical system 
prevents them or they are operational, then they 
can be redone as long as it is safe. There are situa- 
tions, which are safe but lead to operational distur- 
bances, e.g. when a train is led on a wrong track. In 
general, all measures to redo an operational error 
are governed by operational rules. In degraded 
mode, when technical systems are not or not fully 
available, an error can have directly safety critical 
effects. The display might show a critical situation 
but might not recognize it. Undoing such a safety- 
critical error is possible but in general time critical. 
Due to having accidents because of traffic operator 
errors (BadAibling 2016), it would be a very valu- 
able research area to look at how a typical traffic 
controller display or complete system set up can be 
changed to at least help to recognize errors. 

After having discussed the criteria one after the 
other, we will now look at a concrete example. A 
typical example in railways where the designed proc- 
ess does not fully comply with usability standards 
for dialogues is how safety critical actions are safe- 
guarded. In these cases the so called “KF Bedienung 
(command release task)” has to be provided. After 
pushing the necessary buttons to, e.g. allow a train 
to pass a red signals, a window will pop up and ask 
the user if the planned action is correct and sensible. 
This has to be acknowledged after a givien time. In 
this time the user should take his time to be again 
aware of the complete situation on the tracks. The 
idea behind this is to make sure that such actions 
are not done lightly and by mistake. The authors are 
well aware that from a human factors point of view 
this process is flawed. However, for the sake of this 
paper, we only look at the design of the dialogue. 


e Suitability for the task: To discuss the suitabil- 
ity of this dialogue for this task is an ongoing 
research questions. Due to several human fac- 
tors, it can be argued the goal of this approach 
will not be reached. By designing the dialogue 
more intelligent as it is now, and taking into 
account how the brain usually works, the dia- 
logue can be changed to be suitable for the 
task. 

e Conformity with user expectations: As the traffic 
controllers are trained on these systems and the 
general processes are defined in rules and regu- 
lations the dialogue definitely confirms with the 


standards of the system and the expectations of 
the users. 

e Suitability for learning: The dialogue not actu- 
ally promotes a better learning or understand- 
ing of the system but focuses on repletion. The 
second dialogue can be confirmed (and probably 
often will be) without a deeper examination of 
the situation. 

e Suitability for individualization: There is no indi- 
vidualization possible, as this type of dialogue is 
designed in the system. 

e Self descriptiveness: The dialogue makes it clear 
what actions are expected from the controller. 

e Controllability: The user does not have the 
chance to manipulate the dialogue, but he can 
stop the command. 

e Error tolerance: There is little error tolerance. If 
the user actually made a mistake and is realising 
it, he/she cannot undo the wrongly given com- 
mando. Alternative processes are designed for 
these cases. 


Overall, the general assessment of the usability 
of the dialogue regarding command release task is 
not very good. As it is known that there is a corre- 
lation between good usability and reliability/safety, 
the results of this assessment can be used to start a 
general discussion about how to design such proc- 
esses in the future. The design process should take 
into account the basics of usability and combine 
them with the requirements of safety related, rule- 
based actions. 


5 USABILITY AND THE SYSTEM LIFE 
CYCLE PROCESS 


Another issue is the intervening of usability needs 
with the needs of the system life cycle. The system 
lifecycle is strictly structured in different phases in, 
e.g. several standards. In an ideal world the usability 
process for safety-critical systems will have an inter- 
action with many of these system lifecycle phases. 
This was explained for general human factors issues 
in Milius (2017). However, as we do not have a very 
formalised process for human factors integration, 
it is easier to imagine an intervening of human fac- 
tors and risk assessment. For usability, a very struc- 
tured process already exists as was shown in chapter 
X of this paper. Combining both approaches is nec- 
essary for better products but also a difficult task. 


6 CONCLUSION 


In the previous chapter we stated the discussion 
regarding the integration of usability issues into 
the railway system development processes. We were 
focussing on usability rather than user experience 
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because the latter is still to new and provide several 
challenges before it can become a part of a formal- 
ized process. The given example has shown how 
usability is treated by Siemens Ag when a new, not 
directly safety related system is developed. Due to 
the fact that all tasks with a connection to safety 
operations are strongly regulated, even in the area 
of maintenance we have special aspects which dis- 
tinguish railway usability from e.g. usability for a 
consumer product. 

Another level of difficulty arrives when usability 
is applied to the design of safety-related systems. 
To show the possible effects, the generally acknowl- 
edged dialogue criteria were discussed and it was 
assessed if they can be applied to safety relevant 
functions. It is obvious that several of the criteria 
cannot be applied fully and/or directly. This was 
also shown using the command release tasks for 
safety-critical actions as an example. Here, lots of 
research needs to be done. 

Usability is just a part of human factors. As was 
discussed in earlier papers (Milius 2017), recent inci- 
dents in railways show that even in highly automated 
systems humans still have to make safety-critical deci- 
sions. While the development in the past was focused 
heavily of replacing the human, newer research 
talks about effectively integrating the human in all 
necessary processes. The challenge will be that we 
start from a very different set-up as it used to be. As 
humans are more and more just supervisors, we have 
to make sure, that they still understand what’s going 
on and we have to use modern technology to keep 
them informed and to provide help where necessary. 
This means that we have to integrate new functions 
in our new and existing systems. But it also means 
that we have to update existing processes to take 
the human, and usability, into account. Regarding 
general human factors integration, some ideas were 
already discussed in Milius (2017), but we need to 
focus on usability and life cycle as well. Last but 
not least, providing clear processes ho to integrate 
usability analysis in the general development process 
will lead to better acceptance. 
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ABSTRACT: Airplanes, ships, nuclear power plants and chemical production plants (including oil & gas 
facilities) are examples of industries that depend upon the interaction between operators and machines. 
Consequently, to assess the risks of those systems, not only the reliability of the technological compo- 
nents has to be accounted for, but also the ‘human model’. For this reason, engineers have been working 
together with psychologists and sociologists to understand cognitive functions and how the organisa- 
tional context influences individual actions. 

Human Reliability Analysis (HRA) identifies and analyses the causes, consequences and contributions 
of human performance (including failures) in complex sociotechnical systems. Generally, HRA research is 
concentrated in modelling workers’ performance in the “sharp-end”, assessing the ones directly involved 
in handling the system, especially operators. However, in theory, a reliability analysis can be applied to any 
kind of human action, including those from designers and managers. 

This research will evaluate a way of conducting HRA in the design process, as previous research has 
demonstrated that design failure is the predominant contributor to human errors (Moura et al., 2016). 

Bayesian Network (BN) — a systematic way of learning from experience and incorporating new evi- 
dence (deterministic or probabilistic) — is proposed to model the complex relationships within cognitive 
functions, organisational and technological factors. Conditional probability tables have been obtained 
from a dataset of major accidents from different industry sectors (Moura et al. 2017), using a classifica- 
tion scheme developed by Hollnagel (1998) for an HRA method called CREAM — Cognitive Reliability 
and Error Analysis Method. 

The model allows to infer which factors most influence human performance in different scenarios. 
Also, we will discuss if the model can be applied to any human actions through the project life cycle— 
since the design phase to the operational phase, including their management. 


1 INTRODUCTION The expected results of such study can be either 
qualitative or quantitative, depending on the 
1.1 Human Reliability Analysis (HRA) industry sector best practice, data availability and 


regulatory requirements. 

Quantitative results for HRA means giving the 
human performance a number, a probability of 
occurrence—the so-called Human Error Prob- 
ability (HEP). This gives decision-makers the 


Human reliability analysis can have three objec- 
tives: identify human performance (as failures and 
their consequences), quantify the likelihood of 
failure (and error recovery) and to reduce or reme- 
diate those errors in the system (Kirwan, 1997). 
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opportunity to decrease the HEP to as low as prac- 
ticable by tackling the factors that impact it, or to 
check if a certain risk criteria is met. 

HRA research, practice and regulatory require- 
ments are currently focused on operation and 
maintenance workers—called ‘sharp-end’ work- 
ers—those who actually interact with the processes 
(Reason, 1990; Hollnagel, 1998). 


1.2 HRA in design phase 


Can human reliability analysis be applied in other 
phases of an industrial project, such as design, con- 
struction, commissioning and decommissioning? 

Theoretically speaking, it is possible: where 
there is human action, there is the possibility to 
model, analyse and measure performance (Hol- 
Inagel, 1998). 

This research will focus on the design phase and 
design changes during other phases, as there is evi- 
dence from previous studies that design failure is 
the organisational factor that most triggers human 
failure actions (Moura et al., 2016). 

One of the constraints of this approach is that 
human (engineers and managers) performance in 
the design stage has limited public data, preventing 
detailed task analyses. 

However, it is also known that design failures 
identified in latter stages of the lifecycle (i.e. opera- 
tional phase) are much more expensive to correct, 
compared to those detected during the design stage 
(Kohler and Moffatt, 2003). 

Thus, it is believed that understanding engi- 
neers and managers performance during the 
design phase would have the potential to motivate 
improvements in organisational design procedures, 
based on overall accident patterns. 

It is a trade-off between having perfect data but 
not sufficient resources to make design changes in 
the operational phase and having imperfect data 
but sufficient resources to improve the design in 
earlier stages of the lifecycle. 


1.3 Can human performance influence design? 


Design failure is often considered an organisational 
factor in HRAs, as the methods and assessors take 
into account that it influences human performance 
and not the opposite. 

In contrast, there are studies, outside the safety 
and engineering community, showing that organi- 
sations are not an unmanned box of procedures, 
but individuals deciding whether using them, based 
on other factors like regulations, knowledge and 
resources (Rocha Fernandes et al., 2005). Besides, 
those individuals, usually middle and front-line 
managers, have a significant influence in all levels 
of the organisation, dictating and implementing 
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the organisational strategy (Wooldridge et al., 
2008; Purcell, 2007). 

Another key aspect of this discussion is recog- 
nising the difference between ‘managing the design’ 
and ‘managing design changes’. First, because they 
can occur in different phases of a project and thus 
managed by completely different team profiles. 

Second, because decision-making in engineering 
practice can have two distinct meanings: ‘design 
decisions’ are the ones about the design itself (e.g. 
which equipment to choose in a system), while 
‘management decisions’ are the ones about the 
team responsible for designing the system or issues 
that impact this team (Herrmann, 2015). 

Each of these different concepts leads to a dif- 
ferent kind of performance to analyse. 


2 METHODOLOGY 


2.1 


The classification scheme is considered the col- 
lection of error modes (cognitive functions and 
human actions) and the Performance Shaping Fac- 
tors (PSFs) which shapes the context that triggers 
each error mode. 

To achieve the aim of this research, it is essential 
to use an HRA classification scheme that recog- 
nises cognitive functions, as both ‘design decisions’ 
and ‘management decisions’ cannot be evidenced 
only by actions described in most classification 
schemes from the first generation of HRAs. 

For this reason, a classification scheme of the 
second generation of HRA was chosen (see Hol- 
Inagel, 1998, to understand the differences between 
the first and the second HRA generation). 

From the publicly available ones, there are only 
two methods from the second generation that are 
considered useful to the Major Hazard Directo- 
rates of HSE, the UK safety regulator (Bell and 
Holroyd, 2009): CREAM (Hollnagel, 1998) and 
ATHEANA (Forester et al., 2007). 

From these two choices, CREAM’s (i.e. the 
Cognitive Reliability and Error Analysis Method) 
classification scheme was chosen to conduct this 
research, as it shows a clear distinction between 
causes and manifestations. This enables the appli- 
cation of the method in both directions: to ana- 
lyse major accidents retrospectively and to predict 
events as a traditional HRA method. Therefore, this 
feature made it possible to use a pre-existent dataset 
from major industrial accidents (Moura et al., 2016) 
in the current work, as explained in the next section. 

This classification scheme splits cognitive func- 
tions into two categories: analysis (the mental 
processes used when someone tries to understand a 
problem) and synthesis (the mental processes used 


Classification scheme used 


Table 1. 
CREAM. 


Summary of errors of cognition used in 


Observation missed 
False observation 
Wrong identification 
Faulty diagnosis 
Wrong reasoning 
Decision error 
Delayed interpretation 
Incorrect prediction 
Inadequate plan 
Priority error 


Analysis Observation 


Interpretation 


Synthesis Planning 


to solve the problem). Further, these are also split 
into subcategories, as summarized in Table 1. 

One of the problems that may be argued against 
this choice is that most HRA performed in practice 
are the ones from the first generation (Zwirglmaier 
et al., 2015, Henderson and Embrey, 2012), such as 
THERP, HEART and SPAR-H. Also, according 
to CREAM ’s creator, all the PSFs presented at the 
classification scheme are still useful, apart from the 
cognitive reliability, that he considers a ‘misleading 
oversimplification’ (Hollnagel, 2012). According 
to him, “explaining human performance as based 
on ‘cognitive processes’ represents a myopic infor- 
mation processing view, and talking about the reli- 
ability of such processes is an artefact of the PRA/ 
PSA mindset”. 

However, the current research is not using 
CREAM as an HRA method, limiting the discus- 
sion to the assessment of the HEPs, as a way to 
disclose possible improvements. 


2.2 Data used 


To generate Human Error Probabilities (HEP), or to 
validate HRA methods, different types of data are 
used. Kirwan (1997) classified them as: (i) real or 
operationally derived data (i.e. from incidents and 
near misses), (ii) simulator derived data, (iii) data 
from the psychological and ergonomics performance 
literature, (iv) expert judgement, (v) other techniques. 

Data from real operation are considered the 
one with highest quality, but also the more dif- 
ficult to obtain. That is because to achieve an 
absolute result for the HEP (number of observed 
errors by the number of opportunities for error) 
both the numerator and denominator of the equa- 
tion should be assessed by the observation of each 
human action through an industry lifecycle. This is 
impractical, as one should count even the actions 
and errors that have not led to incidents. 

For this reason, much research is being con- 
ducted using operationally derived data as near 
misses (i.e. events with the potential for undesirable 
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consequences (CCPS, 2007) and accidents occurred 
in industrial installations. 

Preischl and Hellmich (2013) used data from 
near misses, occurred on German nuclear power 
plants, to construct their model to estimate HEPs, 
in order to check validation of THERP handbook 
estimates. Groth and Mosleh (2012) have used the 
HERA database, from retrospective analyses of 
risk-significant events occurred on nuclear power 
plants, that contain at least one human error. 

In the current research, it was decided to use a 
dataset derived from major accidents from different 
industrial sectors, not yet tested as model to estimate 
HEPs: the MATA-D — Multi-attribute Technologi- 
cal Accidents Dataset, built by Moura et al. (2016). 

Differently from near misses reports, investiga- 
tion reports of major accidents have the potential 
to uncover more PSFs that trigger a human error. 
This is because major accidents’ investigations 
usually use several man-hours of an expert team 
(Moura et al. 2016)aiming to achieve an increased 
depth of analysis, eventually reaching root causes 
such as organizational issues (CCPS, 2007). 

MATA-D has been derived from the analysis of 
238 accident reports from different industrial sec- 
tors using the same classification scheme, with the 
intention to optimise the learning from cross-sec- 
tor accidents. All the reports had evidence on the 
presence of organisational, technological and per- 
son-related factors, the PSFs described in Table 3. 
Also, nearly half of all the reports had indications 
about the cognitive functions and actions executed, 
described in Tables 1 and 2. 

The MATA-D dataset is a table of 238 accidents 
by fifty-three parameters (thirty-nine factors, ten 
cognitive functions and four erroneous actions), 
where the number one represents the presence of 
a parameter in an accident report and the zero its 
absence. 

For the reasons listed above, it has been decided 
to use the available MATA-D data to feed a model, 
in order to understand if the results could describe 
human performance in design and its management, 
instead of modelling the whole process, developing 
PSFs, collecting data and creating a new method 
from scratch. 


2.3. Modelling method used—Bayesian network 


The relationships between PSFs, cognitive func- 
tions and human erroneous actions described 


Table 2. 
scheme. 


Erroneous actions used in the classification 


Errors of execution 


Wrong tine Wrong type Wrong place Wrong object 


Table 3. 


PSFs from CREAM classification scheme. 


Organisational Technological Person related 
factors factors factors 
Communication Equipment Memory failure 
failure failure Fear 
Missing information Software fault Distraction 
Maintenance failure Inadequate Fatigue 
procedure Performance 
Inadequate quality Access Variability 
control limitations Inattention 
Management Ambiguous Physiological 
problem information stress 
Design failure Incomplete Psychological 
information stress 
Inadequate task Access Functional 
allocation problems impairment 
Social pressure Mislabelling Cognitive style 
Insufficient skills Cognitive bias 
Insufficient 
knowledge 
Temperature 
Sound 
Humidity 
Illumination 
Other 
Adverse ambient 
conditions 


Excessive demand 

Inadequate workplace 
layout 

Inadequate team 
support 

Irregular working 
hours 


above were modelled into a Bayesian network 
(BN). BN is known as a systematic way of learning 
from experience and to incorporate new evidence 
(deterministic or probabilistic), and it was chosen 
due to the possibility of modelling those complex 
relationships within variables of different nature. 
Mkrtchyan et al. (2015) had suggested that using 
BN, human reliability analysis also benefits from: 


i. Its graphical formalism (Figure 1) of condi- 
tional probability equations (Equation 1). Using 
the visual representation of BN is a practical 
way of discussing the relations between factors, 
facilitating the communication between the 
multidisciplinary team that should be involved 
in an HRA meeting analysis, such as engineers, 
psychologists and sociologists. 


P(C=cl | A=al,B=b1) P(C=c2 | A=al,B=b1) = 
1-P(C=cl | A=al,B=b1) (1) 


332 


ii. A probabilistic representation of uncertainty, 
making it compatible with Probabilistic Safety 
Assessment. 

iii. Combination of different sources of informa- 
tion: empirical sources as databases of events, 
theoretical models of human cognition and 
expert judgement. 


The mathematical background of Bayesian 
networks was described by Tolo et al. (2014) as 
statistical models used to represent probability dis- 
tributions, that can provide combined probability 
distribution associated to an accident, exploiting 
information about the existing conditional depend- 
encies, e.g. between PSFs and cognitive functions. 

BNs are represented by acyclic graphs, where 
nodes are connected to each other by arcs 
(Figure 1). Child nodes must have a causality rela- 
tionship with each parent node. 

For example, consider in Figure 2, the child 
node ‘cognitive function’. The probability of its 
occurrence is conditioned to the occurrence of 
its parent nodes: organisation, technology and 


_z States 


ae 

ihe 
Parent 
nodes 

Mee! 

tina . 
Child node 
arc 

Figure 1. Directed acyclic graphs typical of a Bayesian 
network. 


Person related 


Cognitive functions ' Erroneous actions 
ax 


Figure 2. Example of a Bayesian network. 


person-related functions. To have a proper causal- 
ity, one has to know the answer to the question: 
what is the probability of occurring a cognitive 
function when the organisation, technology or per- 
son related factors occur altogether? What about 
when none of them occurs? And if only an organi- 
sational factor occurs, and no technology and per- 
son-related factor? All possible eight combinations 
from three parent nodes have to be answered, to 
establish a proper causality. 

Generically speaking, the number of combina- 
tions a conditional probability table has to repre- 
sent a child node is two (pair of combinations) to 
the power of the number of parent nodes (2“par- 
ent nodes). 

This means a high number of combinations if 
all the factors of the CREAM methodology are 
considered. The implications of this issue are dis- 
cussed in the next section. 


3 MODEL 


3.1 


To build and test the human reliability model, it 
was used the summarised process represented by 
Figure 3. It was proposed by Mkrtchyan et al. 
(2015), through their review of HRA methods 
using BN models. 

First, the nodes and their states were defined. 
Then, the structure, which means the links between 
the nodes. After the structure, comes the assessment 
of Conditional Probability Tables (CPT) for each 
node. Finally, a verification was conducted. The vali- 
dation process will be conducted in a future work. 


Bayesian model of human reliability 


3.2 Nodes and states 


The nodes used in the model are the sub-factors of 
CREAM classification scheme (Hollnagel, 1998), 
where the major factors are human, technology 
and organisation. 

The states of the nodes will be ‘presence’ or 
‘absence’ of the sub-factors observed during the 
investigation of major accidents. 

The result of the MATA-D dataset, presented 
in the methodology section of this paper, has fed 
our model as a matrix of zeros and ones of 53 
rows x 238 columns. At the dataset, the absence of 
a parameter (factor, cognitive function or action) is 


Definition of the 
nodes and their 
states 


Developme 
nt of the BN 
Structure 


Verification 
/ Validation 


Figure 3. Process to build a BN model to HRA. 
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represented by the number zero and the presence of 
them in an accident represented by the number one. 

Only the factors, cognitive functions and actions 
in italic in Tables 1 and 2 were used as nodes for the 
model. The reason is explained in the next section. 


3.3 Structure 


Basically, to create the structure of this BN model, 
parent nodes (organisational, technological and 
person-related factors) were linked by arrows to 
the child nodes (cognitive functions and human 
erroneous actions). 

It would be that simple if there were no limita- 
tions from the algorithm used to build the model 
in Genie software. For the reason explained in sec- 
tion 2.2, the thirty-nine factors provided by the clas- 
sification scheme would generate 549,755,813,888 
combinations (two to the power of thirty-nine) — 
more that was supported by the BN software used. 

The algorithm supports a large number of 
nodes, but not a large number of connections to 
one child node. Therefore, it was necessary to make 
assumptions to simplify the model structure. 

To make assumptions about connections 
between nodes, one must have a clear understand- 
ing of the causal relationship that factors transmit 
to cognitive functions. 

That is the reason why the most common way to 
simplify a model, for human reliability purpose, is 
using expert judgement, also known as expert elici- 
tation (Mkrtchyan et al. 2015). However, it is also 
the stage where happens one of the most claimed 
disadvantages of using Bayesian networks for 
human reliability analysis: it is argued that experts 
can bring more uncertainties to a model due to 
their personal bias. 

In an attempt to avoid this kind of uncertainty, 
the strategy was to let the data ‘speak’ for itself. 
This strategy has been already used by Groth and 
Mosleh’s (2012) at their BN model, where they had 
introduced nodes of ‘error context’ to align certain 
combinations of PIFs that are more likely to produce 
human errors than the individual PSFs acting alone. 

At the present work, the ‘error context’ was rep- 
resented as the arcs of the BN model instead of the 
nodes. The context was imported from the treat- 
ment applied by Moura et al. (2017) to their data- 
set, to disclose common patterns and significant 
features among major accidents. They have used an 
artificial neural network approach to the dataset, 
a data mining process that translated the informa- 
tion into a graphical interface, the self-organising 
maps (SOM). Analysing them, one can perceive 
that the 238 accidents are allocated into four differ- 
ent regions, shaped by the clustering of accidents 
with a similar profile and, thus, a similar combina- 
tion of factors, cognitive functions and actions. 


Summing up, the model connections were pro- 
posed based on those SOM relations: factors that 
were in the cluster #1 were linked only to cognitive 
functions located on the same cluster, and the same 
process was repeated for all the clusters. 

Simplifications to the network structure were 
applied not only to the connections but also to 
the number of nodes. Using previous research by 
Moura et al. (2017), the nodes were restricted to 
the factors, cognitive functions and actions consid- 
ered significant for the dataset of major accidents 
by the self-organising maps algorithm. 

Consequently, if a factor had represented nega- 
tive or very low variations in the formation of one 
of the SOM clusters, it was interpreted that that 
factor was not significant to the causation pattern 
of major accidents and, consequently, it was not 
included in the Bayesian model presented in this 
paper. The considered nodes are presented in italic 
in Tables 1, 2 and 3, and in the nodes represented 
in Figure 4. 


ORGANISATIONAL FACTORS 
ambient and working 


competences 
conditions 


training 


communication 


organisation 
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Equipment failu 


Figure 4. Bayesian model considered. 


The model considers that cognitive functions 
are affected by each factor and that human erro- 
neous actions are affected by the factors and by 
the cognitive functions. That is because the model 
assumes that workers have a mental process behind 
their actions (Hollnagel, 1998). 


3.4 Conditional Probability Tables (CPT) 


Conditional probability tables have been obtained 
from the dataset of major accidents from differ- 
ent industry sectors (Moura et al. 2017), using the 
CREAM classification scheme (Hollnagel, 1998). 

After the simplification on the network, the 
higher number of combinations to a child node 
reached was the node ‘wrong place’, with nineteen 
parent nodes combinations, considering the possi- 
bility of occurring a ‘wrong place action’ influenced 
by sixteen factors and three cognitive functions. 
That means 524,288 combinations, and thus 
524,288 probabilities of an action to occur or not. 

The prior probabilities of the model were 
obtained by calculating how many times a specific 
combination occurred, divided by the total number 
of accidents of the dataset. 


3.5 Software used 


The accident dataset developed by Moura et al. 
(2016) has been originally built as a table of zeros 
and ones, that was uploaded to the BN software. 

If the number of combinations was small, an 
Excel spreadsheet could be used to find the CPT and 
export to GeNle software. However, as presented 
in the previous section, the dataset generated con- 
ditional probability tables of 524,288 probabilities 
for some child nodes. This row (or vector) of data 
extrapolates Excel software limits, and thus Matlab 
had to be used. Also, to optimise the data gathering, 
it was necessary some coding skills to create the Con- 
ditional Probability Tables—as ‘filtering’ the combi- 
nations in Excel would consume too much time. 

The BN model was built in GeNIe Modeler for 
academic use (BayesFusion, LLC). The clustering 
algorithm embedded in the software was used to 
calculate the posterior probabilities, and the node 
type used was ‘chance—general’. 

Useful explanations of how to use the GeNIe 
software can be found in the manual provided by 
the developer and also by the authors of an evacu- 
ation time analysis of ships using BN (Sarshar 
et al., 2013). 


4 RESULTS 


After building the model, inserting the prior prob- 
abilities of parent and child nodes (through their 
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Table 4. Marginal probabilities of human performance. 


Cognitive functions Actions 

Observation Wrong time 4.21e-04 
Observation missed 8.12e-05 Wrong type — 1.73e-04 
Interpretation Wrong place 4.64e-04 
Faulty diagnosis 1.05e-03 

Wrong reasoning 1.05e-03 

Decision error 6.35e-04 

Planning 

Inadequate plan 1.22e-03 

Priority error 6.50e-04 


conditional probability tables), the marginal prob- 
ability distributions were calculated using Genie 
software, as presented in Table 4: 

Although a validation was not yet conducted, 
the order of magnitude of the HEPs (cognitive 
functions and erroneous actions) is consistent with 
HRA directives from the Oil & Gas industry (OGP, 
2010), and HRA documents obtained at the web- 
site of the Environmental Protection Department 
of The Government of the Hong Kong (2017). A 
validation is needed to understand if the model is 
optimistic or pessimistic, according to validation 
criteria discribed by Kirwan (1997). 


4.1 


To verify if the model behaves according to its 
specifications, some scenarios were created, chang- 
ing the factors to its extremes. It means that each 
parent node was assumed to be 0 and 1 separately. 
In other words, each factor (organisational, per- 
sonal and technological) was assumed to be absent 
or present in an industrial scenario. 

To achieve that, after changing the factors, the 
posterior probabilities of the human perform- 
ance nodes were calculated, updating the Bayesian 
network. 


Verification step 


4.2 Sensibility of human performance to each 
factor 


To infer which factors most influence human per- 
formance, the results from the verification process 
have been used. 

When a factor node was set as present in a sce- 
nario (state 1 of the node), it was assumed that 
the variation caused to the posterior probability 
of a human performance node is the sensibility 
of this parameter to that change. Note that the 
parameters are represented by small probabilities, 
so changes in their marginal probability are also 
expected to be small. Thus, for better visualisation 
of the sensibility, the variation in percentage has 
been calculated. 
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As can be noticed in Table 5, there is a slight 
increase in the presence of missed observations 
when Inadequate Quality Control and Design 
Failure, both organizational factors, are present 
in a scenario. Moderated positive variation is also 
perceived when the technological factors equip- 
ment failure and inadequate procedure are present. 
Interpretation functions (faulty diagnosis, wrong 
reasoning and decision error) are the most influ- 
enced parameters by changes in organisational and 
technological factors. 

Errors in design and equipment failures also 
increase errors in interpretation functions (mainly 
in wrong reasoning), but the results show a more 
accentuated positive variation at interpretation 
when changes of quality control, task allocation 
and knowledge (related to training) occurs. 

While interpretation has its presence affected by 
training (knowledge), planning incidence seems to 
be more related to experience (skills). 

Failures in equipment increase the possibility of 
poor planning in an accident scenario, but not as 
much as the inadequate quality control and design 
failure, both organisational factors. 

Errors of execution (wrong time, type and place) 
are triggered by quality control, design failure, task 
allocation and, with fewer effect, equipment failure. 

The results suggest that some factors do not 
make cognitive functions vary positively, possi- 
bly meaning that an error on cognitive functions 
would not occur with a scenario with errors in fac- 
tors as maintenance, management, social pressure 
and irregular working hours. 


5 CONCLUSION AND DISCUSSIONS 


The Bayesian model proposed was built to serve 
as a tool to predict human performance in indus- 
trial scenarios, i.e. the human failure probability, 
with the potential to be applied in different sectors 
such as chemical (including oil & gas processing), 
nuclear and aviation. 

The novelty of the present model is the use of 
a dataset of major accidents to define the basic 
aspects of a Bayesian network HRA model: nodes, 
states, structure and prior probabilities (conditional 
probability table). The model was developed as one 
of the objectives to achieve the aim of understand- 
ing human performance in the design phase. 

Some discussions are developed below in the 
form of answers to questions proposed for this 
research. 


5.1 


Not yet. All the steps of the model were executed, 
apart from the validation. The intention is to 


Can this model be used for HRA purposes? 
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PERSON RELATED FACTORS 


—91 2.1E-07 —100 


1.6E-05 


2.8E-12 —100 


—100 3.8E-06 -99 


5.8E-06 


-87 4.4E-05 —96 1.2E-05 -88 5.9E-06 -99 


1.0E-05 


Cognitive bias = 100% 


0 1.7E-04 0 6.3E-07 —100 


4.2E-04 


1.2E-03 0 6.5E-04 


0 


—94 1.0E-03 0 5.3E-07 —-100 6.4E-04 


4.8E-06 


Distraction = 100% 


TECHNOLOGICAL FACTORS 


4.4 2.8E-05 -84 4.6E-04 0.1 


4.4E-04 


1.0E-03 -2 1.6E-04 49 3.9E-04 -38 1.4E-03 15 2.3E-04 -64 


11 


9.0E-05 


Equipment 


failure = 100% 


6.5E-04 0 4.2E-04 0 1.7E-04 0 5.3E-04 14 


0 


1.0E-03 0 7.1E-05 -33 6.4E-04 0 1.2E-03 


11 


9.0E-05 


Inadequate 


procedure = 100% 


—54 6.2E-06 -96 1.0E-04 -78 


2.0E-04 


-42 1.1E-03 1 1.0E-05 -91 1.4E-04 -78 1.2E-04 —90 2.0E-04 —69 


4.7E-05 


Incomplete 


information = 100% 
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validate the model against data from recent major 
accident that are not yet covered by the original 
dataset, to measure if the model describes other 
real operational data. 

In future works, the dataset can be adapted to 
PSFs of classification schemes from other HRA 
methods. 


5.2 Is the model able to describe human actions 
through all the project life cycle? 


Although the verification step suggests the model 
is capable of inferring which factors most influ- 
ence human performance, the model still interprets 
design as a PSF affecting human performance in 
other stages—and do not consider PSFs affecting 
designers. In addition, it is believed that some fac- 
tors may influence design phase in a different way 
from the prior probability extracted from the data- 
set of major accidents 

Although it is believed that this model cannot 
describe the design phase, it has the potential to 
describe changes in design during the operational 
phase. In fact, that is one of the uncertainties that 
need to be investigated in the dataset used, as 
some situations described as a design failure can 
be attributed to changes in the initial design dur- 
ing latter phases. This can also change part of the 
prior probabilities considered in this model. 

Further development must consider the creation 
of new PSFs, with new organisational factors that 
should be considered during design. It seems that 
expert elicitation will have to be considered phase— 
as there is few public evidence of this process. 


5.3. Can this model be used to understand 
decision-makers performance during design? 


The model has the potential to describe front- 
line and middle managers’ routine and emergency 
performance, during the operational phase. It is 
expected that it will give a better description of 
cognitive functions than actions, as a reflection of 
the decision-makers typical job description. 

However, further investigation should be con- 
ducted to understand how specific factors affect 
different people in the organisation, specially for 
organisational factors for which results suggested 
that the impact in cognitive function is marginal, 
such as social pressure and irregular working 
hours. These factors, for instance, are reported 
in the literature (Thomas et al., 1999) to affect 
middle-managers in a different way, compared to 
sharp-end employees. 

The importance of understanding managers’ 
safety performance is part of the present research, 
as a way of investigating if improving their per- 
formance on safety issues has the potential to lead 


industries to better organisational factors and 
fewer accidents. Managers are linked at the same 
time to top management and operational teams— 
having the opportunity to sell new ideas to top 
management and to promote strategic change to 
employees (Wooldridge et al., 2008; Purcell, 2007). 

For this reason, it is recommended that further 
models consider all factors proposed by CREAM’s 
classification scheme, instead of only using the 
significant ones according to previous research 
(Moura et al., 2017). Accounting for factors like 
“excessive demand’ and ‘cognitive style’ might give 
an improved model for managers’ roles. With this 
purpose, different software and algorithms to cal- 
culate posterior probabilities have to be tested, to 
support more links between nodes. Therefore, Cos- 
san (Patelli et al., 2018), a software for Uncertainty 
Quantification and its Bayesian network toolbox 
will be tested in the future. 
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ABSTRACT: The article discusses the problems of conducting transport operations in the army. The 
original contribution of the authors in this field of research lies in the fact, that special attention has been 
paid to the Human Factor to estimate the risk of transport. Analysis of the impact that Human Factor, 
has on the risk of the transport depending on cultural aspects. The most important anthropological fac- 
tors were identified in carrying out a successful military campaign in a culturally different country (Soma- 
lia, Iraq, Afghanistan). The authors present the results of pilot studies on the impact of the human factor 
on risk assessment in the implementation of transport operations in the army. The elements presented in 
the article are innovative, because the cultural and cultural factors were taken into account, conducting 


the analysis in the army. 


1 INTRODUCTION 

The design and implementation of military trans- 
port is a multilevel logistic process, aimed at pre- 
paring the cargo transport for safe transport along 
some particular segment of route. Regardless of 
of the mode of transport used to transport some 
cargo, opportunities and constraints exist for dif- 
ferent modes of transport, so it is necessary to 
make a detailed analysis of the factors that need to 
be taken into consideration when planning for safe 
passage of cargo. 

Analyzing the needs of the army, in the field of 
movement and transportation, specific features of 
the transport process should be taken into account. 
The primary feature is its intangible nature, which 
means that it is not possible to assess the transport 
process before implementation (Ryczyński and 
Smal, 2017). Specific features include a simultane- 
ous realization and consumption of the transport 
service and the variability of quality standards, 
in particular for the movement of military forces 
including the mutual interaction between the pro- 
vider and the recipient. No possibility of transport 
services congregation forces the user to anticipate 
the needs of transport in advance for military 
purposes. 

Movement and transportation of troops and 
weapons systems is an issue arising mainly from 
the expeditionary nature of the modern armed 
forces and is associated with the need to movement 
forces in order to plan for their operational use. 

The aim of the displacement of troops is to 
change the location of personnel, weapons and 
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military equipment and means of supply caused by 
operational needs, logistics and training (Ryczyński 
and Smal, 2017). 

Part of the military’s movement and transporta- 
tion involves the operation of one of the following 
subsystems: management, transport and trans- 
port networks. Movement and transportation is 
characterized by unpredictable nature referring to 
operational needs, the need to preserve the special 
conditions of caution and the need to prepare the 
means of transport to carry heavy and oversized 
and dangerous loads. Military transport is divided 
into due to its nature and purpose: 

Operating, which includes transporting troops 
and their equipment. 

Supply, the carriage of weapons, military equip- 
ment, munitions and materials. 

Evacuation of unnecessary supply, failures and 
damaged of weapons as well as military equipment 
and packaging. 

The proper choice of transport means it is one 
of the most important strategic decisions. This 
decision, depends on primarily on the opera- 
tional criteria, the number of personnel provided 
for transportation, equipment and inventory 
quantities. 

We have to also take into account the geographi- 
cal location of the destination, distance, redeploy- 
ment time, the possibility of using point and linear 
elements of the communication infrastructure and 
cost. 

When planning a military movement and trans- 
portation we should be primarily aimed to achieve 
two functional objectives: ensure the timely 


completion of tasks, and minimize costs at an 
acceptable level of effort. 


2 RISK ASSESSMENT IN THE 
TRANSPORT—LITERATURE REVIEW 


Currently, there is no single, unified and commonly 
used definition of the term of risk (Aven, 2012). We 
can even say that the basic concepts of risk are dif- 
ficult to define and even more difficult to estimate 
(Heckmann et al, 2015). 

In recent decades, we have observed that this 
term is used in many research areas, such as deci- 
sion theory, management, emergency planning or 
critical-structure operations, including the effi- 
ciency of transport systems (Tubis and Werbitiska- 
Wojciechowska, 2017). Historical trends in the 
development of the risk concept have been dis- 
cussed, eg in (Aven, 2012 and Anderson. 1989). 
An overview of risk perspectives and discussions 
are provided, e.g. in (Aven, 2012, Aven et al., 2009, 
Aven and Krohn, 2014, Anderson, 1989). Aven dis- 
tinguish between two types of approaches to risk 
analysis: relative frequency based approaches and 
Bayesian approaches. The first category includes 
both traditional methods of statistical inference 
as well as the so-called Frequency probability 
approach. Depending on the risk analysis method, 
the purpose of the analysis is different, and the 
results are presented in different ways. 

This type of approach can be characterized as 
follows (Aven, 2012): 


1. Relative frequency-based methods: The risk 
analysis consists in estimating certain basic 
probabilities of interpretations or frequencies. 
Such probabilities express the relative part of 
the time the event of interest, if the analyzed 
situation was hypothetically “repeated” an 
infinite number of times. The basic probabili- 
ties are unknown and are estimated in the risk 
analysis. These estimates are uncertain, because 
there may be large differences between estimates 
and valid (real) probabilities. 

. Bayesian methods: Probability is a measure 
of uncertainty about future events and results 
(consequences), seen through the eyes of an 
assessor and based on certain information and 
knowledge. Probability is a subjective measure 
of uncertainty. 


These concepts are reflected in the literature 
dealing with the problem of risk estimation in 
transport processes. To effectively manage any 
organization (including companies providing 
transport services), a new concept—Enterprise 
Risk Management (ERM)—was introduced and 
promoted. One of the most popular definitions of 
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the concept of corporate risk management (ERM) 
in literature is the one presented in the COSO II 
standard. In accordance with the COSO II stand- 
ard, Enterprise Risk Management is defined as a 
process carried out by the management staff in 
order to identify potential events that may affect 
the reduction of the ability to achieve the entity’s 
goals. According to COSO II, the organization’s 
ERM system should be aimed at achieving the fol- 
lowing four objectives: 


1. Strategy: high-level goals, aligned with and sup- 

porting the organization’s mission. 

Operations: effective and efficient use of the 

organization’s resources. 

. Reporting: reliability of the organization report- 
ing system. 

. Compliance: organizational compliance with 
applicable laws and regulations. 


2. 


Proper implementation of the ERM concept also 
affects the risk assessment processes performed in 
the selected company. Risk assessment is an indis- 
pensable and systematic process that is part of risk 
management, the purpose of which is to identify, 
assess risks and plan actions to address risks [34]. 

Referring to the previously presented definitions 
and taking into account the perspective of the 
process, the starting point for the risk assessment 
should be the identification of hazards that may 
cause the failure to achieve the goals of providing 
transport services. The holistic approach assumes 
that the area of analysis will cover various levels 
of the implemented process, eg technical elements, 
Human Factor as well as legal and organizational 
issues (Tubis and Werbinska-Wojciechowska, 
2017). Risk assessment should be preceded by a 
process analysis that identifies potential adverse 
events. Risk assessment is usually only carried 
out at specific time points. Correct identification 
of the largest number of disturbing factors is the 
basis for a correct risk assessment. 


3 RISK ASSESSMENT IN TRANSPORT 
DURING MILITARY OPERATIONS 


The proposed methodology risk assessment in 
the transport of oversized loads consists of sev- 
eral stages (Fig. 1). The first step is, to diagnose 
the potential risk that accompanies transport, e.g. 
oversized loads. The next step is data collection 
and risk assessment. The diagnosis is necessary to 
determine the causes, factors that may cause dif- 
ficulties in transport (Ryczynski and Smal, 2017). 
Evaluation of the results helps to choose a strategy 
for reducing the number of factors affecting the 
various types of risk and limit its negative impact. 
One of the most important element risk assess- 
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Figure 1. Methodology of risk assessment in military 
transport. 
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Figure 2. Dependence: accidents — potential consequ- 
ences. 


ment for the transport of oversized loads is the cal- 
culation of the probability of an accident. 

Transportation of heavy oversizedloadsmay have 
a negative impact on the infrastructure and the 
organization of traffic and generate, e.g.: the delay, 
damage to roads, damaged buildings located on 
the route of travel and the danger to other road 
users. To minimize the risk of an accident involving 
other vehicles to transport non-normative, it may 
be introduced ban on movement of other traffic 
on a particular stretch of road or along the entire 
length of the route. The diagram below shows the 
possible impact and the consequences of the traf- 
fic accident involving a vehicle carrying oversized 
loads (Fig. 2). The occurrence of an accident on 
the road can also affect vehicles normative. 

Often the cause of the accident is incorrect 
actions of other road users, not adapted to the con- 
ditions prevailing on the road, the wrong choice of 
means of transport or incorrect load securing. 
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When modeling the risk for the oversized loads, 
the following data are used: 


1. On the motorway network, including their 
physical characteristics. 

On accidents involving heavy vehicles, including 
the frequency and severity of accidents. 

On the flow of traffic. 

On the loads, goods together with their 


characteristics. 


2. 


3. 
4. 


These spatial data can be presented in GIS for- 
mat. Information on highways, vehicles, accidents, 
traffic flow are readily available from various trans- 
port agencies. However, data on the transported 
loads, goods make bigger problem. Most of the 
available data in this area is insufficient to achieve 
the desired level of accuracy of risk assessment. 

In practice, there are methods spreadsheet, 
which you can use to assess the degree of risk 
undertaken in the project. The classic model of 
risk calculation is as follows: 
R=p-C (1) 
where: 

R - risk, 

p- the probability of an accident, 

C — the consequences of the accident. 

The probability of an accident is dependent on 
many factors: the volume of traffic, time of trans- 
port operation (day or night), load of the vehicle, 
transport distance, the parameters of roads, road 
surface conditions, time of year, weather condi- 
tions, visibility, etc. The risk assessment process 
(Fig. 3) may contain more factors and deriving 
your situation, the consequences of the accident— 
but this is dependent on the availability of data on 
accidents. 


Te 


| Factors influencing on >. 4 The consequences | 
probability of an accident an accident 


1, The frequency of ] 1. Delay in cargo 
transport delivery 
2. Capacity 2. Failure to deliver 
3. The length of the cargo 
route 3. Damage to cargo 
4. The quality of 4. Accident, damage 
road's surface to the vehicle 
5. 
6. Technical 
parameters of the 
road 
7. Day, seasonality 
Figure 3. Risk assessment in transport: factors x con- 
sequences. 


4 HUMAN FACTOR IN RISK 
ASSESSMENT 


As shown in Figure 3, one of the most important 
element, is the Human Factor which affect the 
magnitude of risk in military transport. Based 
on expert research, the authors have determined 
that the Human Factor is more important during 
the realization of transport plans than, for exam- 
ple, the technical parameters of the roads during 
transportation. 

A particularly important issue appears in mat- 
ters of security and risk assessment related to the 
conduct of military activities. This is confirmed by 
the words of David Kilcullen (Killucen, 2010), an 
Australian anthropologist who has been developing 
new strategies for warfare for many years, includ- 
ing the use of cultural knowledge. Kilcullen claims 
that “knowing and understanding a foreign culture 
is a prerequisite for victory over an opponent”. 

What exactly is the Human Factor and how is it 
defined? We can define Human Factor as under- 
standing human performance within a given sys- 
tem: trust, fear, decision-making, stress are crucial 
in so-called “golden hour” (Helsloot et al., 2004). 

In industry, the Human Factor (also known as 
ergonomics) mean the study of how humans behave 
physically and psychologically in relation to particu- 
lar environments, products, or services. Many large 
manufacturing companies have a Human Factors 
department or hire a consulting firm to study how 
any major new product would be accepted by the 
users in its design. A Human Factors specialist typi- 
cally has an advanced academic degree in psychology 
or anthropology or has special training. The term 
usability is now sometimes used as an alternative to 
Human Factors like human error or human resource. 

Today there are 2 different views on Human 
Factor as a cause of failure. The so-called “old 
view” means: 


— human error is the cause of most accidents; 

— the engineered systems in which people work are 
made to be basically safe; 

— progress on safety can be made by protect- 
ing these systems from the unreliable human 
through selection, procedures, automation, 
training and discipline. 


The other “new view” sees human error not as a 
case but as a symptom of failure: 


— human error is a symptom of a trouble deeper 
inside the system; 

— safety is not inherent in the systems—people 
have to create safety; 

— human error is systematically connected to 
features of people tools, tasks and operating 
environment. 


Progress on safety comes from understanding 
and influencing these connections (Gambetti et al, 
2012). 

The Human Factor is very difficult to measure. 
Other component factors will be dominant in the 
determination of the Human Factor in the army 
and others in the case of testing according to the 
same methods in the civilian industry. Based on 
their own experience and using the knowledge of 
experts who deal with the problem of transport 
organization on a daily basis, the authors of the 
article have listed the 12 most important factors 
affecting the so-called Human Factor: 


1. lack of communication — errors and disrup- 
tions in the information flow; 

2. routine — certainty resulting from long-term 
practice combined with the loss of awareness 
of existing threats, caused by often repetitive 
activities and tedious work; 

3. lack of knowledge — lack of clarity or certainty 
of understanding something, lack of language 
skills; 

4. distraction — caused by distraction, confusion, 
mental chaos; 

5. lack of cooperation in the team — inconsistent 
effort of a group of people caused by lack of a 
sense of community of purpose, fear of point- 
ing management to mistakes made by others, 
inappropriate style of leadership or inappro- 
priate communication; 

6. fatigue — it is ignored, because until it is exces- 
sive, people do not realize it; 

7. lack of resources — lack of tools, materials, 
outdated documentation, improper working 
conditions; 

8. pressure — caused by the pressure of superiors 
or colleagues, lack of time, improper setting of 
tasks; 

9. lack of assertiveness — lack of ability to refuse 
to perform a task resulting from lack of self- 
confidence, anxiety or complexes; 

10. stress — nervousness caused by eg: time pres- 
sure, new methodology, change in the scope of 
tasks, competition or private factors; 

11. carelessness — incorrect assessment of possible 
consequences of action caused by eg: pressure, 
lack of experience or lack of knowledge; 

12. comfort (deviation) — acceptance by most 
people of deviations from the instructions as 
standards facilitating work. 


The aim of the research was to indicate those 
factors that are dominant in the case of success or 
failure to complete the task. The described 12 fac- 
tors were presented to a group of 35 military drivers 
(participants of foreign missions) and were asked 
to choose the most important factors. The com- 
parison took place in such a way that participants 


_—Number of factor __ 


__Number of factor 


Figure 4. Result of research influence of chosen factors 
for success of complete the task. 


compared all 12 factors with each other. The 
more important factor was rating | — a factor less 
important in a given pair — 0. The maximum value 
that a single indicator could have obtained was 11 
points. Obtained results of tests are presented in 
Figure 4. 

The studies carried out have shown that fatigue 
(10 points), stress (10 points), routine (8 points) 
and pressure (7 points) are of the greatest impor- 
tance for success or failure in the implementation 
of the task. The least important was the lack of 
resources and the lack of assertiveness. It has been 
assumed that the lack of assertiveness will be the 
least important factor, because the army is a hier- 
archy institution and the soldiers carry out orders. 

It should be noted that the presented results 
were made on a small research sample—the results 
of pilot studies. Currently, research is conducted 
on groups of 200 people, both civilian and military 
drivers. The completion of the research will allow 
for a more complete identification of the most 
important elements of the Human Factor in the 
process of estimating transport risk. 


5 CULTURE ASPECTS OF HUMAN 
FACTOR 


The research conducted by the authors allow to 
conclude that all available definitions completely 
omit the influence of cultural aspects on the suc- 
cess of the performance or non-performance of the 
task. The authors of the article, as participants of 
military foreign operations, observed this phenom- 
enon during the mission in Iraq and Afghanistan. 

The biggest problem in the course of military 
expeditionary operations in foreign countries is the 
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mentality problems of the local population. The 
standard of behavior is a deliberate action con- 
sisting in violating applicable procedures, instruc- 
tions, requirements or regulations. There is social 
acceptance for this type of behavior and they are 
not treated as an offense. 

As one of the most representative example 
of the cultural aspect on the Human Factor, the 
Restore Hope operation in Somalia, can be men- 
tioned. It was planned as a military and humani- 
tarian mission under the auspices of the United 
Nations and was meant to provide transport serv- 
ices—providing food to the starving population of 
Somalia. One of the biggest problems of the mis- 
sion in Somalia was a different understanding of 
the culture of this country by military components 
of individual countries, and thus different ways 
to accomplish mission objectives. Lack of under- 
standing of the culture and history of Somalia was 
also characteristic of the United Nations, which 
worked on the assumptions of the mission. In par- 
ticular, the clan system and the decentral nature of 
traditional political institutions in Somalia were 
not understood, which was manifested in the pres- 
sure on the UN that a particular clan would rule 
over the entire country. Such behavior, degrad- 
ing the importance of the leaders of other clans, 
caused their reluctance to peace forces. 

It should be emphasized that cultural problems 
were not only related to the relations between 
UNOSOM and Somali troops. Cultural differences 
influenced the effectiveness of UN force manage- 
ment and control. For example, soldiers of the Ital- 
ian and French contingents ostentatiously ignored 
the orders of the Pakistani commander by follow- 
ing only the instructions coming from their own 
countries. 

Based on the experience of foreign missions, the 
world’s largest armies attach great importance to 
the issue of cultural differences. 

Today, the American standard is Operational 
Culture for the Warfighter and Applications for 
Operational Culture. Perspectives from the Field, 
as well as Through the Lens of Cultural Aware- 
ness: a Primer for US Armed Forces Deploying is 
Arab and Middle East Countries. In January 2013 
was published doctrinal document of the British 
forces JDN3/11 Decision Making and Problem 
Solving: Human and Organizational Factors. The 
document is devoted to the impact of a culturally 
conditioned Human Factor on the functioning of 
organizations such as the military, as well as on the 
ways of thinking and acting during various crises 
and military operations. 

According to the mentioned documents, the 
culture can be “purchased” and not “biologically 
inherited” in the process of enculturation. It distin- 
guishes members of one group from another and 


is a lens through which members of a given group 
perceive and understand the world. Documents 
emphasize the importance of culture and man 
for the army “culture is part of a wider context in 
which military operations and everyday relations 
take place. Ignoring the importance of culture 
increases the risk of failure of a given company 
or mission and creates barriers to effective interac- 
tion. Developed cultural ability reduces these risks 
and creates: new opportunities that can contribute 
to the enemy’s failure or success in negotiating 
with him; a better understanding of local com- 
munities and the foundations of a given conflict; 
fruitful cooperation with allies. (...) Understand- 
ing the value of culture in the whole spectrum of 
operations and military processes will result in 
increased situational awareness, better preparation 
of combat forces, better protection of forces and 
more effective tactical preparation. Culture is par- 
ticularly important for security and stability where 
the human dimension is crucial”. 

In research on the Human Factor, the concept 
of intercultural awareness often appears—which 
is supposed to facilitate the functioning of cultur- 
ally alien areas. An example is, the Human Terrain 
System (HTS), a new system of gaining knowledge 
about people and culture—operations in Iraq and 
Afghanistan revealed that the US army has no 
doctrine corresponding to the reality of opera- 
tions. HTS operates on the basis of Human Ter- 
rain Teams, consisting of military, anthropologists, 
sociologists and linguists. Each team is recruited 
and trained to work in a specific region, where a 
military unit works, which the team is supposed to 
support. The system was criticized at the beginning 
of the 21st century as unethical from the point of 
view of anthropological research. However, the 
words of General David Petraeus are still quoted, 
saying that “knowledge of the culture of a given 
area can be just as important, and sometimes even 
more important, as knowledge of the geography 
of a given area. This observation is to draw atten- 
tion to the fact that people’s knowledge is, in many 
respects, a decisive factor, and that we must study 
the culture of a given area with the same attention 
we have so far devoted to the geography of the 
area” (Petraeus, 2006). 


6 SUMMARY 


When carrying out military transports, in addition 
to the behavior tests of drivers, it is necessary to 
create procedures according to which the organiza- 
tion of the entire transport system is to be organ- 
ized. This system should be adapted to a specific 
cultural region. It must refer to all aspects of trans- 
port in which a human play role: communication, 
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updated technical documentation or organization 
of teamwork. Components that need to be tested 
are: 


1. defined main participants involved in transport, 
including the main coordinator responsible for 
the organization; 

. systematic characterization of the transport 
system of a given region and the socio-eco- 
nomic situation of the state from the residents’ 
perspective; 

. indication of activities undertaken by residents 
and entities involved in the procedure; 

. examining the limitations, challenges and human 
potential in the region’s transport system. 


The conducted research shows communication 
problems at the local level, as well as problems 
with the lack of appropriate tools, lack of knowl- 
edge of customs, including stressful and motivat- 
ing factors. Cultural research is proving necessary 
to understand the cultural environment of the 
planned transport operation. They also indicate 
the necessity of synergic cooperation between the 
army and the inhabitants, the government adminis- 
tration, and, in the case of missions, also non-gov- 
ernmental organizations. The aim of such action is 
to minimize and eliminate negative effects 
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Study on seafarers’ emotion identification during watch-keeping using 
bridge simulation 
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ABSTRACT: Human factors present one of the essential contributors to maritime accidents, and sea- 
farers’ emotion is sensitive to working environment and information inaccessibility. The data collected 
from 11 experienced seafarers’ Self-Assessment Manikin (SAM) questionnaires, is analysed to investigate 
the impact of their emotions during watch-keeping in a bridge simulator. SAM scale rating questionnaires 
are received separately after two sections, emotion calibration and recognition. The emotion is induced 
and identified in the calibration section. In the recognition, emotion is self-rated after the crew-qualified 
test and the Support Vector Machine (SVM) model is used for classification. The results indicate that 
SVM can effectively identify the emotions with a precision of 72.73%. Seafarers’ emotion in maritime 
operations affects their behaviour and decision-making. The overall positive emotion identified by SAM 
rating does not mean positive effect on sailing, while negative emotion identified by SAM rating does not 
lead to negative behaviour. 


1 INTRODUCTION 2008). In 1956, the collision between the two pas- 
senger ships Andrea Doria and the Stockholm 

Nowadays, more than three-quarters of the world was one example. The root causes of the accident 
trade cargoes in terms of volume are accomplished were related to the ship bridge. It is demonstrated 
by shipping (Grech et al., 2008). Shipping contrib- that much attention should be paid to human fac- 
utes to the increase of awareness on transport safety tors and the bridge. Consequently, it causes some 
in maritime. Surveys (Ren et al., 2008) show that interest in the area of bridge design and cognition, 
75-96% of maritime accidents are fully or partially both in Europe and the United States. Nowadays, 
caused by human and organisational errors. There the bridge has become more automated. Automa- 
has been an overwhelming understanding since tion is often highlighted because it has been over- 
1993 when the USCG reported that human errors whelmingly understood that it would reduce the 
had been the essential cause of approximately 80% involvement of crew, so far as to reduce human 
of maritime accidents and near misses over the workload and human errors, and increase effi- 
past decades (Grech et al., 2008). Moreover, out ciency. However, as demonstrated by the grounding 
of nearly 62% of pollution and maritime accidents of Royal Majesty (the Panamanian passenger ship 
over the past years (Er and Celik, 2005), human grounded on Rose and Crown Shoal about 10 miles 
factors result in 30% of deck officer error, 7% of east of Nantucket Island, Massachusetts on June 
shore-based personnel error, 2% of engine officer 10, 1995) as well as evidenced by other research 
error, 8% of pilot error. It reveals that human fac- results (Lutzhoft and Dekker, 2002), automation 
tors are not the direct cause of the accident, but has a prospecting expectation of human work and 
they are the beginning of a considerable incident safety, which cannot simply replace human work 
or catastrophe by trigging following chain events. thoroughly. Fewer crew numbers do not lead to less 
In this regard, it is meaningful to investigate workload. There exists increased mental workload 
human factors in ship bridge as it is closer to the affecting situation awareness (Aguiar et al., 2015). 
root causes of maritime accidents. One of the ear- In this regard, automation in the bridge creates new 
liest initiatives was fired up by accidents caused error pathways, especially resulting from human 
by a typical radar-assisted collision (Grech et al., errors, deficiencies in mission shifts, and postpone 
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chances to correct errors in the system further into 
the future. It is noteworthy that the bridge plays an 
essential role in the success and failure of naviga- 
tion, as well as human errors research. 

In the amendments of Seafarers’ Training, Cer- 
tification and Watchkeeping (STCW) Code in 
1995, human error was classified in three major 
taxonomies. They are operational-based, man- 
agement-based, and the combination of the two. 
For quantitative assessment of shipping accidents, 
Celik and Cebi (2009) generated a Human Fac- 
tors Analysis and Classification System (HFACS) 
based on a Fuzzy Analytical Hierarchy Process 
(FAHP), to identify human errors in shipping acci- 
dents. In line with the HFACS, as well as Reason’s 
Swiss Cheese Model and Hawkins’ SHEL model, 
Chen et al. (2013) proposed HFACS for a Mari- 
time Accidents (HFACS-MA) model to measure 
the Human and Organisational Factors (HOFs). 
Soner et al. (2015) combined Fuzzy Cognitive 
Mapping (FCM) and HFACS to generate a proac- 
tive model in fire prevention modelling onboard 
ships. Chauvin et al. (2013) found that most col- 
lisions were due to decision errors by a modi- 
fied HFACS model in collisions reported by the 
Marine Accident and Investigation Branch (UK) 
and the Transportation Safety Board (Canada). 
Meanwhile, the accidents were also associated 
with poor visibility and evidence deficiencies of 
the socio-technical system (technical environment, 
the condition of operators, leadership level, and 
organisational level) (Chauvin et al., 2013). In 
that way, cognition and error in accidents attract 
research attention as well. 

Quite a number of studies exist on human reli- 
ability to define human performance in accidents 
and estimate human failure probabilities (Yang 
et al., 2013, Yoshimura et al., 2015, Yang and 
Wang, 2012). Akyuz and Celik (2015) adopted 
Cognitive Reliability and Error Analysis Method 
(CREAM) to assess human reliability along with 
the cargo loading process, and Akhtar and Utne 
(2015) used it to study common patterns of inter- 
linked fatigue factors. It was illustrated that “inat- 
tention”, “inadequate procedures”, “observation 
missed”, and “communication failure” were related 
to fatigue factors that influenced the human cogni- 
tive processes in accidents. The bridge team should 
be trained to recognise fatigue and exercise cau- 
tion related to the fatigue factors. Moreover, Het- 
herington et al. (2006) divided human factors into 
fatigue, stress, health, situation awareness, team- 
work, decision-making, communication, automa- 
tion, and safety cultural diversity. 

Among them, the emotion factor of the crew is 
vulnerable to working space, inaccessible informa- 
tion sources, and communication. Their negative 
emotions are mainly related to irritability, tension, 
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instability, depression, and burnout with periodic 
changes. From this perspective, Liu et al. (2016b) 
proposed an EEG (Electroencephalogram) system 
in bridge simulation to monitor officers’ workload 
and pressure. It was one of the earliest studies 
on seafarer’s psychological response using bridge 
simulators. However, the relation with psychologi- 
cal response and seafarers’ performance was not 
demonstrated. For quantification of the crew emo- 
tion, this system also took into account monitor- 
ing emotion, emotional stress, and environmental 
stress (Liu et al., 2016a). It identified the emo- 
tion of cadets in the bridge simulator supervised 
by EEG, but not related to human errors neither. 
Geethanjali et al. (2017) detected and recognised 
the human emotion using Self-Assessment Mani- 
kin (SAM) rating. The statistical analysis revealed 
the emotion identification differences between sev- 
eral groups. Moreover, other studies on angry driv- 
ing in road transportation(Yan et al., 2015, Wan 
etal., 2016, Yan et al., 2014, Zhang et al., 2014) have 
been conducted to find the emotional connection 
between humans and the safety. Hence, seafarers’ 
emotion identification should be further studied by 
better incorporating psychological knowledge. 

This study is conducted to identify the emotion 
in the bridge by SAM rating questionnaires, and 
classify the emotion in a Support Vector Machine 
model by use of bridge simulators. It would ben- 
efit the crew training aiming at navigational safety 
and improve the watch-keeping while sailing. The 
remainder of paper is organized as follows: In Sec- 
tion 2, the methodologies of this study are described, 
including SAM and SVM methods. The experiment 
design and procedures are illustrated in Section 3. 
The result and discussion are presented in Section 4. 
Finally, the conclusion is given in Section 5. 


2 METHODOLOGY 


2.1 Self-Assessment Manikin (SAM) 


In this paper, the nine-point scale in Self-Assess- 
ment Manikin (SAM) (Bradley and Lang, 1994) 
(Bradley and Lang, 2007) is used to describe 
pleasure, arousal, and dominance in response to 
the stimuli. Figure | shows the questionnaire that 
the test subjects need to complete after the experi- 
ments, reflecting on their subjective feelings during 
the assessment. 

The scoring measures the pleasure, arousal, 
and dominance associated with the stimuli. The 
first SAM scale is the happy/unhappy scale, which 
ranges from a smile to a frown. The second scale 
is the excited/calm scale, which ranges from left to 
right. The last dimension is the controlled/In-con- 
trol dimension. The left end of the scale represents 


1. SAM rating 
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Joyful Surprised Satisfied Protected | 
| Angry Fear | 


Unconcerned Sad | 


| Or give your own descriptive word: 


Figure 1. The questionnaire of emotion with SAM 
scale on a nine-point rating (Liu et al., 2016a). 


the feeling of completely controlled and influ- 
enced. The right end of the scale is the feeling of 
completely controlling, important, and dominant. 

The SAM methodology reveals the specific fea- 
ture of a test subject’s certain emotion, as the emo- 
tion is a subjective variable. This method quantifies 
the emotion in specific time and condition. The 
experiment is recorded by the audio for each test 
subject. After the qualified test, comments on 
the performance of seafarers from the experts is 
recorded by audio, and the test subjects are given a 
result of pass or not pass. 


2.2 Support Vector Machine (SVM) 


In this study, there are two sorts of emotion catego- 
ries: positive emotion and negative emotion. The 
Support Vector Machine (SVM) is used to iden- 
tify the emotion category for the tested seafarers. 
SVM is a supervised learning model with associ- 
ated learning algorithms that analyse data used for 
classification and regression analysis. As shown in 
Figure 2, there are two kinds of sets, type “e” and 
type “a”. The SVM method finds an optimised 
hyperplane (e.g. a line defined by “w”, “b”), cal- 
culating the w and b to maximise the margin while 
still separating the points (primal form). Cis a cost 
function, which is C=max(0,1— y,(w-x,-5)) 
(For data on the wrong side of the margin, C 1s 
proportional to the distance from the margin), 
i=1,2,...,1, & is the smallest non-negative number 
satisfying y,(w-x,-b)21-&. 


1 
ming DE +Cpw| (1) 
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Figure 2. The support vector machine theory. 
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In this study, these points are described in three 
dimensions illustrated in SAM as pleasure, arousal, 
and dominance. As the emotion is a subjective vari- 
able, the SVM uses the feature of a certain emotion 
in calibration to generate the classifier. Using the 
classifier training by SVM, emotion in the quali- 
fied test of seafarers is identified by the three-di- 
mensional description questionnaire. Specifically, 
a 33 x3 matrix is generated as input, among which 
“33” represents “11 x 3” questionnaires, “3” defines 
three types of emotion dimensions. In addition, 
the output is a 33 x 1 matrix. The group of first 
22 samples is the training set while the group of 
last 11 samples is the test set. After normalisation, 
the optimal parameters in the SVM is searched by 
cross-validation. The kernel function of the model 
is calculated. The result of identification of emo- 
tion taxonomy can be calculated. 


3 EXPERIMENT 


3.1 Test subject selection 


Seafarers from different companies who were 
taking the captain and first officer qualification 
examinations were recruited to be involved in this 
study. There were 11 exams scheduled in two days. 
Each exam tested one participant who acted as a 
captain in a four-person exam group. All subjects 
were in good health without head injuries. They 
have 7.7 years of experience at sea on average, as 
they present a common emotional response dur- 
ing sailing when compared to beginners or cadets. 
The test subjects range from 26-38 years old, with 


the average of 31.9 years old. They are all males. 
These seafarers attended these experiments as 
volunteers. The last also demonstrates that they 
could quit the experiments whenever they changed 
their minds. Based on this agreement, the calibra- 
tion part of this study was conducted before the 
crew-qualified exam, and the test part was carried 
out after the whole exam. The test subjects were 
in bridge simulator room, while the staffs were 
in control room providing scenarios to subjects 
(Figure 3a, 3b). 


3.2 Stimuli selection 


The role of “captain” in four seafarers during the 
exam was selected as an independent sample. The 
rating of their perceived emotion for each stimu- 
lus presented uses a SAM scale. In view of this, 
IADS database was used as the stimulus with two 
categories (pleasant and unpleasant). It was pre- 
sented for the first time, and all the test subjects in 
this study were not aware of the clips prior to the 
experiment. 
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(a) Test subjects in simulator room 


(b) Staff in control room J 


Figure 3. The test subjects and staffs in control room. 
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Figure 4. Experimental protocol. 


3.3 Experimental protocol 


The experiment was conducted by SAM scale rat- 
ing questionnaires received separately after two 
sections, which are emotion calibration and rec- 
ognition respectively. In calibration, two types of 
emotions were induced by the International Affec- 
tive Digitized Sounds (IADS) methodology, devel- 
oped by the National Institute of Mental Health 
Center for the Study of Human Emotion. Every 
test subject was given by listing to sound clips from 
IADS with eyes closed in case of blink interrupts. 
In the calibration part, emotion 1 began with 5 sec- 
onds silence to calm down, and 10 seconds for one 
category of emotion stimulus, and then the SAM 
rating was carried out. After that, another cat- 
egory of emotion 2 repeated. This section aimed to 
calibrate emotion for each subject. In other words, 
the specific feature or standard of personal emo- 
tion type was obtained. 

In the test part, the subjects filled the question- 
naires after at least 30 minutes’ exam in the bridge 
simulator. Figure 4 demonstrates the process of 
the experiment. 


3.4 Statistical analysis 


As the subjective rating for the emotion ques- 
tionnaire was not normally distributed, the non- 
parametric test is conducted. The sound clips of 
the IADS database are considered as independent 
variables; the pleasure, arousal, and dominance 
are considered as dependent variables. The cor- 
relation between valance and dominance within 
subjects is calculated using Spearman’s correlation 
(Geethanjali et al., 2017). 


4 RESULT AND DISCUSSION 


4.1 


This study collects 22 (11 x 2) calibration question- 
naires and 11 test questionnaires reflecting 11 sea- 
farers’ emotions. Table 1 demonstrates descriptive 
statistics for seafarers in this research, while Table 2 


Descriptive statistics 


reveals the statistics in the IADS (2nd edition) 
database. There is correlation coefficient below 
in Table 3, and the correlation is significant at the 
0.01 level. The clip sounds 105 represents negative 
emotion, while 220 represents positive emotion in 
this study. The letter “p”, “a”, and “d” represent 
“pleasure”, “arousal”, “dominance” respectively, 
and “t” means test emotion. The majority of the 
mean value of this study is approximately coinci- 
dental with the mean value of the IADS, except for 
the pleasure dimension in negative emotion. The 
distinction is also revealed in the questionnaires. 
Notes state that some of them do not make sense 
of the meaning of the first clip. Therefore, the neu- 
tral feeling with rating 5 is given by them. 


4.2 Emotion identification 


After collecting the emotion data from seafarers 
by SAM questionnaires, SVM is used to identify 
the emotion category during watch-keeping. Over- 
all, 11 samples consist of 33 x 3 matrix of emo- 


Table 1. Statistics of seafarers in the research. 
Min. Max. Mean SD 

105p 1 9 4.82 2.601 
105a 1 7 4.18 2.272 
105d 1 8 5.18 2.523 
220p 3 9 8.09 1.814 
220a 1 8 5.27 2.195 
220d 3 9 6.36 1.912 
tp 3 9 5.73 1.679 
ta 1 7 4.64 2.063 
td 1 9 6.00 2.449 


*p — pleasure, a — arousal, d — dominance. 


Table 2. Statistics in the IADS (2nd Edition). 
Mean Std. Deviation 

105p 2.88 2.14 
105a 6.40 2.13 
105d 3.8 217 
220p 7.28 1.91 
220a 6.0 1.93 
220d 5.99 1.88 
Table 3. Correlation coefficient. 

Pleasure Arousal Dominance 
Negative 1.000 —0.210 0.043 
Positive 1.000 0.321 —0.068 
Test 1.000 —0.480 —0.742 
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Emotion Prediction 
accuracy = 72, 


Sample label 


Sample 


Figure 5. Emotion identification by using the 
SVM: Accuracy = 95.4545% (21/22) (training); Accu- 
racy = 72.73% (8/11) (test). 


tion description, and 33 x 1 matrix of emotion 
labels. The former 22 pieces are from the calibra- 
tion part as a training set for SVM, and the later 
11 pieces are from the test part as a test set. From 
these perspectives, the SVM model is proposed 
to find a hyperplane that divides the test set into 
two kinds of emotion categories. Figure 5 is the 
result of the test classification with the accuracy 
of 72.73%, where “1” represents negative emo- 
tion, and “2” means positive emotion. The kernel 
function of this model is calculated in the way that 
“—t = 2” represents kernel type is radial basis func- 
tion: exp (~*|x-x’ P); “c = 776.0469” represents 
cost parameter C; “—g = 0.0068012” represents yin 
kernel function. 

The emotion identifications by questionnaire 
from test subject and the SVM methodology are 
presented in Table 4, where “P” represents positive 
and “N” represents negative. More specifically, 
the self-rating emotions of subject 2 and 10 are 
positive emotions but were predicted as negative 
emotions. The self-rating emotion of subject 9 is 
negative while it was predicted as positive. All the 
others have the same results between self-rating 
and SVM. 


4.3 Human performance 


The comments on the examination for each test 
subject are further analysed to investigate if 
negative emotion identified by SAM scale ques- 
tionnaire affects human errors and human per- 
formance. Meanwhile, the comments from experts 
as an inevitable process of the qualified exam are 
collected by audios. This part took place after the 


Table 4. Comments from self-evaluation and third party. 


Emotion 
ID SR SVM Self-evaluation Third party 
i P P Untimely watch keeping in poor visibility Operate in incorrect sequence when stopping 
Wrong operation sequence 
Too late to realise poor visibility Unconcerned watch keeping 
2. P N Speed control problem 
Inaccurate report in time 
Anxious when collision Not fulfil the Convention on the International 
3 P P Wrong decision making (collision at ship Regulations for Preventing Collisions at 
body instead of bow) Sea (COLREGs) 
Tension during ship encounter Mistake sail with the current for sail against 
4 P P Response too late the current 
Unfamiliar with navigation device Not fulfil COLREGs 
Too panic when strand 
Speed control problem Wrong decision making of the captain 
5 N N Not enough communication Manoeuvring inappropriate 
Not stop timely 
Speed control problem Not enough communication 
6 N N ae : 
Course deviation Not enough cooperation not enough 
Late report in emergency Wrong manoeuvring 
T P P Unconcerned Too high speed 
Inappropriate manoeuvring Course deviation 
8 N Not familiar with rudder failure Slow speed affect steering 
Failure to meet a contingency 
9 EÈ Not switch on navigation lights when Not on-time watch keeping 
starting fog Too large deflection angle 
Unfamiliar with navigation environment Unfamiliar with navigation device 
10 P N Not report the collision on time Ignore environment when reporting 
Failed to fulfil COLREGs 
tI -P P Anxious when getting hurt Speed control problem 


Irregular language 


whole sessions, beginning with summarised com- 
ments from self-evaluation and third party, and 
ending in experts’ comments. 

According to the self-evaluation from the sub- 
jects and expert comments after the qualification 
exam, it is common to demonstrate that the human 
emotion emerging during watch-keeping affects 
ship manoeuvring, concentration, response to an 
emergency, and decision-making. 

For example, test subject 1 was not able to con- 
centrate on watch-keeping in poor visibility when 
sailing, which made him incapable of observing 
the crew onboard falling into the water. Moreo- 
ver, a further step to eliminate the hazards they 
encountered was to stop in accurate and timely 
operation sequence. The test subjects 2 and 7 had 
the same result as unconcerned when encounter- 
ing collision scenarios in poor visibility, resulting 
in a delayed report and operational problem. As 
a result, subject 2 reported inaccurately in colli- 
sion scenario and subject 7 had course deviation. 
There was obvious anxiety when collision occurred 
as subject 3 demonstrated, causing not fulfilling 


COLREGs (International Regulations for Prevent- 
ing Collisions at Sea). Subject 11 just became anx- 
ious when the crew got hurt, causing the irregular 
use of language and inappropriate manoeuvring. 
Subject 4 had tension emotion when the encounter 
happened and panic emotion during strand, which 
caused several mistakes, as shown in Table 4. In 
addition, subjects 4 and 10 had physiological prob- 
lems because they were unfamiliar with the device. 
They were not fulfilling COLREGs. 

From the above emotion problems existing in 
test subjects 1, 2, 3, 4, 7, 10, 11, all of them rated 
overall positive emotion after the sessions. How- 
ever, the subjects who rated a negative emotion 
did not reveal apparent emotion interruption on 
performance. From this perspective, overall posi- 
tive emotion identified by SAM rating does not 
mean positive effect on the sailing, while negative 
emotion identified by SAM rating does not lead to 
negative behaviour. Emotion rating through sub- 
jective judgement presents the overall feeling after 
the examination, whereas human errors occur at 
certain instant point. They are not matched well. 
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Moreover, the subjects may hide or ignore their 
true feelings when the mission is well done, and the 
questionnaires filled after the examination depend 
more on the result of the scenario completion than 
the process. 


5 CONCLUSION 


Seafarers’ emotion exists when sailing. It emerges 
during watch-keeping and could jeopardise the per- 
formance and decision-making of seafarers. When 
an emergency happens, there are requests for the 
timely report and accurate operation on ships. This 
study utilised SAM rating scales of 11 test subjects 
to establish a classification model. The training 
model is studied by SVM classifier with an accu- 
racy rate at 72.73%. The results concerning offic- 
ers’ emotion in a bridge simulator test reveal that 
seafarers’ emotion in maritime operations affects 
their behaviour, as well as the possibility of errors 
in decision-making. In addition, there is no strong 
correlation between emotion modes identified and 
behavioural consequences. The overall positive 
emotion identified by SAM rating does not mean 
positive effect on sailing, while negative emotion 
identified by SAM rating does not lead to negative 
behaviour. 

This study does not reveal the accurate in-time 
response, as the rating is done after the examina- 
tion. Moreover, some seafarers may hide or ignore 
their true feelings in the questionnaire after the 
exam if emergency problems are solved properly in 
scenarios. Thus, the relations between emotion and 
human errors are complex, and need to be further 
analysed according to the real-time physiological 
responses. 

Seafarers tend to be in vulnerable possession 
when manoeuvring in bridge simulator. Con- 
ducting the psychophysiology research in bridge 
simulator is a meaningful study on human factors 
or human errors in maritime operations. In addi- 
tion, the bridge simulation benefits researches on 
human factors, especially for crew training require- 
ment. From these perspectives, further studies will 
involve psychophysiological methods to meas- 
ure seafarers’ state and performance in real-time 
bridge simulation. 
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ABSTRACT: Recently, numerous organizations have made progress in developing autonomous ships, 
motivated by, among other factors, the potential for increased safety. This applies especially to accidents 
involving human error, as autonomous ships would remove operator role from all or most operations. 
The reality is that although the human role is reduced, autonomous ships would still rely on operators for 
supervision, remote control, and involvement in case of a glitch or an unexpected situation. Thus, autono- 
mous ships do not fully eliminate the possibility of human error. In this study, we assess the potential for 
human error in autonomous ship operations. We analyze an unmanned autonomous ship operation, and 
through a generic analysis of the interaction between operators working a Shore Control Centre (SCC) 
and system, we identify possible Human Failure Events (HFE). This provide a starting point for perform- 
ing human reliability analysis of autonomous ships operation. 


1 INTRODUCTION 

Important projects have drawn attention to 
autonomous ships in recent years. The so-called 
MUNIN (Maritime Unmanned Ships through 
Intelligence in Network) is currently establishing a 
concept for an unmanned merchant ship. AAWA 
(Advanced Autonomous Waterborne Applications 
Initiative), in turn, investigates challenges in dif- 
ferent scientific fields related to autonomous ship- 
ping operations (Laurinen, 2016). A third example 
is the DNV GL ReVolt project, which is a concept 
developed by DNV GL for an unmanned, zero- 
emission, shortsea vessel. 

One of the motivations for using autonomous 
ships—common to all autonomous systems in gen- 
eral, concerns the potential for increased safety and 
reliability. Human error accounts for an important 
root cause or contributing factor of accidents in 
a diversity of industries and activities, e.g. 70 to 
80% in aviation (Wiegmann and Shappell, 2012), 
over 80% in chemical and petrochemical industries 
(Kariuki and Löwe, 2007), over 90% in road traffic 
accidents (Treat et al., 1979). The maritime activity 
does not differ from those: the European Maritime 
Safety Agency points to human error as the trigger- 
ing factor in 62% of incidents with EU registered 
ships from 2011 to 2016 (EMSA, 2014). Moreover, 
statistics on fatal accidents have ascertained that 
work on deck, for example mooring operations, 
is 5 to 16 times more dangerous than jobs ashore 
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(Blanke, Henrique and Bang, 2017). Therefore, 
putting the human operator aside for all (or most 
part of) operation tasks is believed to avoid acci- 
dents (Laurinen, 2016). However, as highlighted by 
Rodseth and Tjora (2014), human error can still 
occur in autonomous ships operation. 

Current projects on autonomous ships have dif- 
ferent views on how they should operate in terms 
of crew onboard and autonomy level. The Munin 
project calls for a ship that would be unmanned 
only on the deep-sea part of the voyage, with a crew 
onboard for the departure from and approach to 
port. When unmanned, the ship would be autono- 
mous, but monitored by operators in a Shore Con- 
trol Center who can take over control of the ship in 
certain situations. The AAWA project, on the other 
hand, works with the concept of unmanned ships— 
i.e., there would be no crew onboard at any time of 
the voyage, and with dynamic levels of autonomy. 
This means the autonomy level approach would 
depend on the state of the vessel and mission being 
executed. In some cases, such as navigation in the 
open seas, the ship can be nearly fully autonomous 
whereas for some parts of the voyage it will require 
close supervision and decision making, or even full 
tele-operation from the human operator. 

The two concepts above show that, although an 
autonomous ship would have less interference of 
a human operator, the human is still part of the 
operation: they would still rely on a human opera- 
tor for one of the voyages phases (e.g. departure 


and docking) or for taking over control in case 
there is a situation the autonomous system can- 
not resolve by itself. Therefore, the autonomous 
ship operations are not free of the possibility of 
accidents generated or aggravated by human error. 
The current literature, however, have not deeply 
focused on the human element when considering 
autonomous ships’ safety. In fact, as pointed out 
by Parasuraman, Sheridan and Wickens (2000), 
for autonomous systems in general there is a volu- 
minous technical literature on automation, but a 
still small (but growing) research base examining 
the human capabilities involved in work with auto- 
mated systems. 

The potential for human error in autonomous 
ships can be assessed through a Human Reliability 
Analysis (HRA). HRA isa technique to assess both 
quantitatively and qualitatively the human contri- 
bution to accidents. Swain and Guttman (1983) 
define human reliability as the probability that a 
person (1) correctly performs an action required 
by the system in a required time and (2) that this 
person does not perform any extraneous activity 
that can degrade the system. HRA is thus, in short, 
a method by which human reliability is estimated 
(Swain and Guttman, 1983; Swain, 1990). 

To be able to perform an HRA it is essential 
to understand, first of all, how the operators will 
interact with the system. If autonomous ships will 
be a reality in years to come, ensuring they are safe 
and reliable is imperative, and the possibility of 
human errors cannot be minimized. 

The current literature presents some relevant 
works on autonomous ships containing discus- 
sions on human factors topics, mostly pointing out 
the factors that could affect the operators’ decision 
and actions (Ottesen, 2014; Rødseth and Tjora, 
2014; Laurinen, 2016). A more general discussion 
on these factors, applied to all autonomous sys- 
tems, can also be found (Parasuraman, Sheridan 
and Wickens, 2000; Chen, Haas and Barnes, 2007). 
In terms of identifying the possible human failure 
events (HFEs) in autonomous ships operations, 
however, the literature still falls short—and this 
paper aims to fill in this gap. 

The identification and definition of HFEs 
is can be considered as the starting point of an 
HRA (Ekanem, 2013). Boring (2014) differentiates 
between two approaches for identifying HFEs. A 
top-down approach would start with the analysis 
of hardware faults and deducing human contribu- 
tions to those faults, and is widely used in Proba- 
bilistic Risk Assessments in the nuclear industry. 
A bottom-up approach, on the other hand, would 
look at opportunities for human errors and then 
model them in terms of potential for affecting 
safety outcomes. This paper will adopt a bot- 
tom-up approach, performing a screening of the 
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interactions between the operators and the system 
and the subsequent tasks in order to identify the 
HFEs. The present paper, hence, aims to analyze 
the interactions between the operators’ and the 
system in the operation of autonomous ships, and 
identify the possible human failure events that 
derive from it. 

The discussions of this paper are part of an 
ongoing research aiming to identify and model 
the risks arising from autonomous ships opera- 
tion. The scope of this paper is limited to human 
actions and human failures during operation, 
under the assumption that the system would work 
as expected. Therefore, it does not cover system 
failures, which will be addressed in forthcoming 
papers by the authors. 

Nonetheless, it is important to acknowledge 
that the possibility of human error is not restricted 
to the operation of the ships. Human error asso- 
ciated with autonomous ships can be related to 
design, construction and installation, testing and 
verification and maintenance, among other activi- 
ties carried out by humans prior to the operation. 
This paper does not cover these tasks, as it focuses 
on the operation only, i.e., it considers that there 
would be no failures in all of these tasks previous 
to operation and navigation. Hence, the question 
it aims to answer is: given perfect design, mainte- 
nance, equipment and instruments behavior, could 
human actions affect safety during autonomous 
ships operation? 

The paper is organized as follows: Section 2 
presents the system description and the assump- 
tions made in this study, Section 3 focuses on 
describing of the interactions between the opera- 
tors’ and the system and the possible Human 
Failure Events deriving from these interactions. 
Section 4 presents some concluding thoughts. 


2 SYSTEM DESCRIPTION 


As stated in Section 1, the ongoing projects on 
autonomous ships have different concepts in 
terms of the ship being manned or on the level 
of autonomy. Utne et al. (2017) use the following 
definition of autonomy (adjusted from National 
Institute of Standards and Technology (NIST) 
(2008)): “a system’s or sub-system’s own ability 
of integrated sensing, perceiving, analyzing, com- 
municating, planning, decision-making, and act- 
ing, to achieve its goals as assigned by its human 
operator(s) through designed Human-Machine 
Interface (HMJI)”. From fully manual control to 
fully autonomous systems there can be distinct 
levels of autonomy (LoA), and the literature pro- 
vides different proposals for these levels and its 
taxonomy. One of the oldest taxonomies is the one 


proposed by Sheridan and Verplank (1978), with 
10 LoA, where Level 1 corresponds to fully manual 
control and level 10 to fully autonomous control. 
A review of all proposals can be seen in Vagia 
et al. (2016). 

From among the autonomous ship concepts 
indicated in Section 1, the analysis in this paper is 
based on the AAWA concept—unmanned ships 
and dynamic level of autonomy. A dynamic level 
of autonomy means that the autonomy level can 
change depending on the context of the voyage, 
e.g., one phase of the voyage is set to be fully auton- 
omous (Level 10 in the Sheridan and Verplank 
taxonomy) but the operation encounters a small 
problem and give the operator a “veto” option 
before solving it autonomously (Level 6 of auton- 
omy). The reason for choosing this concept are: (i) 
because it is unmanned, it offers the most different 
case study from ship operations nowadays, and (11) 
because it has a dynamic autonomy level, it covers 
a different range of situations, from totally autono- 
mous operation to tele-operated control. 

Being an unmanned ship, the operators would 
be working onshore, and we assume they would be 
working in a Shore Control Center (SCC) as the 
one proposed in the Munin project. The Munin 
project website! offers a range of information and 
publications on the Shore Control Center. Essen- 
tially, the Shore Control Centre acts as a manned 
supervisory station for monitoring and remote 
controlling a fleet of autonomous ships. Most of 
the time the ships would operate autonomously, 
without the need for intervention from shore. 
When needed, though, the operators’ in the SCC 
would provide assistance and may take over con- 
trol of the ship (Porathe, 2013; Porathe, Prison and 
Man, 2014; MUNIN, 2016). 

The voyage can be divided into four phases, in 
which the operators would have different possible 
levels of interaction with the system: Voyage Plan- 
ning, Unmooring and maneuvering out of dock, 
Open Sea and Port approaching and docking. The 
following of these phases is based on the informa- 
tion stated in the AAWA whitepaper (Laurinen, 
2016) for a general cargo vessel. 

The first phase is the Voyage Planning, in which 
the operators assess/define certain conditions of 
the voyage. The operators’ assessment makes use of 
systems that should be present in the ship, such as 
an automatic system for verifying the sea readiness 
before starting the voyage. Most of the systems can 
be checked remotely by the operator while in some 
areas (such as securing cargo) shore based crew can 
also be used to check that voyage can be started. 

One of the conditions that have to be assessed 
by the operator previously to the voyage is the con- 


! http://www.unmanned-ship.org/munin/ 
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nectivity—some of the remote control or remote 
supervision modes might require a latency and 
bandwidth that exceeds the capability of the sat- 
ellite systems in adverse weather conditions. The 
operator will have then to ensure that there is suffi- 
cient connectivity for the intended mission. If there 
is enough connectivity for the mission, the operator 
has then to define the primary operational strategy 
for each leg—autonomous or manual, considering 
the weather and environmental conditions. Note 
that manual operation, in this case, means remote 
operation from the SCC. Next, the operator defines 
the navigational and fallback strategies. Although 
the AAWA whitepaper does not describe what the 
operators take into account during “navigational 
strategies”, we believe that in addition to prede- 
fined paths and waypoints it would also include 
considerations with maintenance (to verify when 
should the next maintenance of determined equip- 
ment be versus the length of the voyage), propul- 
sion and fuel consumption. The fallback strategy, 
on the other hand, is a strategy executed if the ship 
experiences an unexpected situation that would 
require operator intervention. The fallback strat- 
egy could include: asking operator to take manual 
control, slow down and proceed to following way- 
point, stop the vessel and stay in DP mode, navi- 
gate to previous waypoint, navigate back to preset 
safe location. The commands and their execution 
sequence is not same in all parts of the voyage. For 
example trying to maintain its position in the mid- 
dle of a congested and narrow fairway in harsh 
weather might not be a feasible strategy. 

It is important to bear in mind that, given 
dynamic autonomy, the definitions made in the 
voyage planning are not static, i.e., it can be that 
one leg was defined to be autonomous but due to 
external circumstances it goes manual. Moreover, 
the voyage plan as well as the fallback strategies 
can always be modified during the voyage using the 
satellite communication link. 

The phase after voyage planning would be 
unmooring and maneuvering out of dock. The 
mooring systems can be fully or semi-automatic. 
A fully automatic mooring system would mean 
that the operation can be remote controlled or 
automatically executed by the autonomous vessel. 
A semi-automatic mooring, on the other hand, 
means that connection to the quay can be made 
automatically but the crew is needed to secure the 
docking. When the ship is maneuvered out of the 
congested harbor area, it can be controlled by 
the operator or it can use the dynamic position- 
ing control computer and autonomous control 
system to reach the waypoint. Moreover, in some 
areas it could go directly to autonomous mode 
instead of starting with teleoperation or supervi- 
sory control. 


The third voyage phase, after maneuvering out 
of dock, is the open sea navigation. In autonomous 
mode the ship executes the voyage according to the 
defined plan, and the operator receives relevant 
status data such as ship’s location, heading, speed, 
ETA to next waypoint (or area of closer supervi- 
sion) and key information from the situational 
awareness systems as well as critical ship systems.. 
For situations where the autonomous navigation 
system’s autonomous decision making threshold 
is exceeded, the operator is notified and can inter- 
vene. Therefore, the autonomy level is dynamically 
adjusted if the mission execution is not proceeding 
according to the original plan and the autonomous 
navigation system sees that adjustments are needed. 
AAWA differs between two different situations: one 
is a “veto” situation, in which for example the ves- 
sel is deviating from the planned course between the 
two waypoints but stays within specified margins the 
autonomous navigation system. In this case the sys- 
tem would notify the operator about planned eva- 
sion and give the operator a possibility to veto for a 
limited time. If modifications are needed, the opera- 
tor can take the vessel in manual control. It can also 
be that the vessel would need to change the course 
in such a way that complete waypoint has to be re- 
planned. In order to ensure that changes to the plan 
are made in a safe way operator confirmation will be 
requested. The autonomous navigation system will 
offer one or more alternatives of how the waypoint 
could be modified but the operator will finally make 
the decision how to continue the voyage. 

A second case would be a “pan-pan” situa- 
tion—when there is a complex scenario that the 
autonomous navigation system path planning and 
algorithms cannot unambiguously solve. Exam- 
ple of this could be if extremely large number of 
crafts or other objects are detected and the path 
planning algorithms are not capable to identify 
them and thereby the system cannot determine 
how the navigation should proceed. In this type of 
scenario the vessel will immediately send a “pan- 
pan” message to the operator indicating that it is 
in urgent need of assistance. The ship has a prede- 
fined set of fallback strategies (defined at the voy- 
age planning phase) that it will start to execute in 
the planned order if user response is not received, 
and depending on the urgency, automatic fallback 
strategy execution can also be started immediately. 

The last phase of the voyage is port approach- 
ing and docking. As the other phases, it can be 
remotely operated or autonomous. This phase 
together with open sea navigation and unmooring 
and maneuvering out of dock will be named “navi- 
gation phases”. 

The next section details the interactions between 
the operators and the system in each of these 
phases. 
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3 INTERACTIONS BETWEEN THE 
OPERATORS AND THE SYSTEM 


This section discusses the interactions between 
operators and the system for each voyage phase of 
the autonomous ship described in the previous sec- 
tion. It explain the operators’ main tasks and the 
possible decision/action paths they may take when 
accomplishing these tasks. 

The outcomes of the operators’ actions will be 
described in this paper as a “success” or a “failure” 
of that voyage phase. A successful operation is 
defined here as an operation that did not encoun- 
ter any unexpected problem or an operation that 
did encounter a problem but successfully recovered 
from it, by operators’ actions or autonomous solv- 
ing. For instance, if during autonomous navigation 
in the open sea the ship faces a complex situation it 
cannot solve autonomously and it gives a “pan-pan” 
alert for the operators, they take over control in time 
and manage to bring the ship back to a safe status, 
this is a successful open sea voyage. An unsuccessful 
operation, on the other hand, is one that encoun- 
ters a problem and does not recover from it. In the 
previous example, if the operators fail to respond to 
the “pan-pan” alert and the ship follows a fallback 
strategy that is inadequate, this would be a failure in 
the open sea voyage, leading to an incident. 

Note that for voyage planning a failure will not 
itself cause an accident, but it will increase the 
probability of having a “veto” or “pan-pan” situa- 
tions at the following phases. For example, if during 
voyage planning the operator decides for open sea 
voyage to be autonomous when the environmental 
conditions are not safe for the operation, there will 
be a higher chance that an unexpected situation 
during the voyage arises and the operator receives 
a “pan-pan” alert about it. Failures at the follow- 
ing phases, on the other hand, can cause accidents. 
These may, however, differ in terms of gravity: an 
accident when still in harbor is less probable to be 
of catastrophic consequences than during open sea 
voyage. Yet, the final events treated in this paper 
will be “success” and “failure”, not distinguishing 
between the severities of this failure, such as colli- 
sion, grounding, etc. This is illustrated in the gen- 
eral Event Sequence Diagram in Figure 1, where 


Figure 1. 


Interaction scheme for voyage planning. 


the “success” outcome is represented by the green 
final event, which is reached if all voyage phases 
are successful, and the “failures” are represented 
by the red final events. 

For the unmooring and maneuvering out of 
harbor phase it was considered a fully automatic 
system, i.e., the operation can be remote controlled 
or automatically executed by the autonomous ves- 
sel, depending on what was defined at the voyage 
planning phase. Moreover, in spite of the AAWA 
whitepaper describing the “veto” or “pan-pan” 
situations only during the open sea voyage, it was 
considered that they could also happen during 
unmooring and maneuvering out of harbor and 
port approaching and docking, when these are in 
autonomous mode. In this sense, the possible inter- 
actions between the operator and the system in the 
three voyage phases that follows voyage planning 
are similar: the operation can go manual or auton- 
omous; in case it is autonomous it can i) operate as 
expected, ii) encounter a small problem and gener- 
ate a “veto” situation, iii) encounter a more com- 
plex problem and generate a “pan-pan” situation. 
Moreover, the operators’ possible responses to 
these situations are also similar in the three phases. 
Thus, what will be discussed below for unmooring 
and maneuvering out of harbor can be extended 
to the open sea navigation and port approaching 
and docking. 

The operator’s interactions with the system in 
the voyage planning phase are described in sub- 
section 3.1, and for the following phases, exempli- 
fied by unmooring and maneuvering out of harbor 
operation, in sub-section 3.2. 


3.1 Voyage planning 


From the description in Section 2, it is possible to 
identify the operator’s tasks and possible paths in 
the voyage planning, which are described in the 
tables below. 


3.2. Unmooring and maneuvering out of harbor 


The operator’s tasks and possible paths in Unmoor- 
ing and maneuvering out of harbor are described 
below, and can be extended for the open sea voyage 
and port approaching and docking phases. 


I. Autonomous operation 

When the operation is autonomous there can be 
a small problem that the vessel can solve auton- 
omously, in which case the operator receives a 
“veto” alert (Table 5). If there is a significant 
problem, the vessel gives a “pan-pan” alert to the 
operator (Table 6). In that case, if the operator 
does not take over control the vessel follows the 
fallback strategy. 
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Table 1. Possible operator decisions for Task 1. 


Task 1 (T1): Ensure there is sufficient connectivity 


IL If there is sufficient 
connectivity, the 
operator can: 


I. If there is no sufficient 
connectivity, the 
Operator can: 


Tl_path1:be Tl_path2:be T1_path 3: be T1_path 4: 
wrong and right wrong and be right 
believe there about the believe the and the 
is sufficient connectivity connectivity operation 
connectivity, level and is not enough, goes on 
and the cancel and cancel 
operation the voyage the voyage 
goes on 


Table 2. Possible operator decisions for Task 2. 


Task 2 (T2): Define primary strategy for each leg (autonomous 
or manual) 


T2_path 1: Operator decides that the operation for one leg is 
autonomous when, due to weather and environmental condi- 
tions, it should be manual. For each phase: 


i. Unmooring and ii. Operation in iii. Port 
maneuvering out open sea goes approaching 
of harbor goes autonomous and docking goes 
autonomous when it should autonomous 
when it should be manual when it should 
be manual be manual 


T2_path 2: Operator decides that one leg should be manual 
when it could be autonomous 


i. Unmooring and ii. Operation in iii. Port 
maneuvering open sea goes approaching 
out of harbor manual when and docking 


it could be 
autonomous 


goes manual 

when it could when it could 

be autonomous be autonomous 
T2_path 3: Operator correctly decides that the operation for one 

leg is autonomous. For each phase: 


goes manual 


i. Unmooring and ii. Operation in iii. Port 
maneuvering out open sea goes approaching and 
of harbor goes autonomous docking goes 
autonomous autonomous 


T2_path 4: Operator correctly decides that one leg should be 
manual: 


i. Unmooring and ii. Operation in iii. Port 
maneuvering open sea goes approaching 
out of harbor manual and docking 


goes manual goes manual 


Table 3. Possible operator decisions for Task 3. 


Task 3 (T3): Define navigational strategies for the 
autonomous operations 


T3_path 1: The operator 
defines an incorrect 
navigational strategy 


T3_path 2: The operator 
decides for a good 
operational strategy 


II. Manual unmooring 

To aid in the visualization of these tasks and 
paths, these interactions are modeled through the 
schemes presented in Figure 2 for voyage planning 


and Figure 3 for unmooring and maneuvering out 
of harbor. 

As stated previously, these interactions were 
modeled not considering system failure yet—e.g. 


Table 4. Possible operator decisions for Task 4. 


Task 4 (T4): Define fallback strategy for the autonomous 
operations 


T4_path 1: The operator 
defines an inadequate 
fallback strategy 


T4_path 2: The operator 
defines an adequate 
fallback strategy 


Table 5. Possible operator decisions for Task 5, after a 
“veto” alert is received during autonomous operation. 


Task 5 (T5): Respond to “veto alert” 


T5_path 1: The operator does no respond to the veto 
alert 
T5_path 2: The operator responds to the alert and 
supervise the vessel solve the problem autonomously 
T5_path 2_1: The operator T5_path 2_2: The 
should have “veto” that autonomous solution 
operation and take over is adequate 
control because the 
autonomous solutions 
were not adequate 
T5_path 3: The operator responds to the alert and take 
over control of the vessel 
T5_path 3_1: The operator 
successfully operates 
the ships 


T5_path 3_2: The 
operator fails when 
operating the ship 


Cond 


H 
i 
i weather / 
' erarcamen 
lat condibors 
oo 


Operation is cancelled due to operatori tenet 
that three is reeatlficent connestivity 


Gpeisions that will Inereaye the probability of faving 

prodiens ahead Lin Lnmooring, open sea oparation oF port 

approaching and docking) 

The operator ls not correct tut thie will not being major 
@ consequences ahead (in thlà Case, resources ana time will 

be consumed in a manual operation that coud be aitonomous) 


O Corect pattin/declmony 


Figure 2. Interaction scheme for voyage planning. 


there is a “pan-pan” situation and the alert at the 
Shore Control Center fails, or the operator takes 
over manual control and the communication 
between the SCC and the vessel fails. It isolates, 
then, the human errors, considering no failure on 
other aspects of the operation. 

From the interaction schemes above it is possi- 
ble to identify the possible Human Failure Events 
that could lead or contribute to accidents in auton- 
omous ships operation. Table 8 presents the HFEs 
involved in the voyage planning phase. Note that 
these failures would not cause an accident itself, 
but would contribute for having a “veto” or “pan- 
pan” situation in the following phases, as illustrated 
in Figure 1. Table 9 present the HFEs involved in 


Table 6. Possible operator decisions for Task 6, after 
a “pan-pan”? alert is received during autonomous 
operation. 


Task 6 (T6): Respond to “pan-pan” alert 


T6_path 1: The operator does no respond to the “pan- 
pan” alert 
T6_path 2: The operator responds to the alert and 
supervise the vessel solve the problem autonomously 
following the fallback strategy 
T6_path 2_1: The operator 
should have taken over 
control of the ship 
because the fallback 
strategy is not adequate 
T6_path 3: The operator responds to the alert and take 
over control of the vessel 


T6_path 2_2: The 
fallback strategy is 
adequate 


Operator defines 
appropriate 
navigation 


Dperator detmies 
appropiate 
navigation 

pa 


Operaer dabas 
ammopeiate 
{aback strategy 


Dperatny Belmar 
Inappraweiaw 
Lids tir 
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Figure 3. 


Table 7. Possible operator decisions for Task 7, in man- 
ual operation. 


Task 7 (T7): unmooring and maneuvering out of harbor 
by tele-operation 


T7_path 1: Operator 
successfully operates 
the ship 


T7_path 2: Operator fails 
to operate the ship 


the following phases (navigation phases). These 
are the HFEs that could lead to the “failure” final 
events in Figure 1. 

The Human Failure Events presented in Table 8 
and 9 are defined rather broadly, and can be 
decomposed, if needed, to identify sub-HFEs. In 
this sense, they are a general representation of 
what could go wrong, in terms of human failure, in 
the autonomous ships operation. They are a start- 
ing point that allows to analyze, for each HFE, 
the crew cognitive processes involved, in order to 
identify more specific Failure Modes that would 
lead to each HFE. Furthermore, for each Failure 
Mode it will be possible to identify and assess the 
factors that influence the operator’s decisions and 
actions—the Performance Influencing Factors 
(PIFs). 

In this sense, as stated in Section 1, the identi- 
fication of Human Failure Events is the first step 
towards a solid Human Reliability Analysis. 

The description of the operator-system inter- 
actions in autonomous ships and possible HFEs 
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The following phases [open sea nayeeation and port approaching and docking) follow a similar system. 


Interaction scheme for unmooring and maneuvering out of harbour. 


Table 8. Human failure events in voyage planning 
phase. 
Human 
Path Failure Event Description 
T1 Failure to The operator is wrong about 
path1 correctly the low level of connectivity. 
assess The operation goes on, and 
connectivity the low level of connectivity 
level can lead to communication 
problems between the SCC 
and the ship. 
T2 Failure to During definition of the 
path1 correctly primary strategy for each 
define leg the operator believes 
primary the conditions are adequate 
strategy for autonomous 
operation when, in that 
situation, it should be 
manual (tele-operated) 
T3 Failure to The operator defines an 
path! define inadequate navigation 
adequate strategy. This will increase 
navigational the probability of having 
strategy problems ahead and a “veto” 
or “pan-pan” situation 
T4 Failure to The operator defines an 
path! define inadequate fallback strategy. 
adequate In case there is a “pan-pan” 
fallback situation the fallback strategy 
strategy will be followed by the ship, 


if the operator does not take 
manual control of the ship 


Table 9. Human failure events in navigation phases. 


Human 
Failure 
Path Event Description 
T5_path 1 Failure to The operator does no 
T6_path 1 respond to respond to an alert, 
an alert which may be a “veto” 
alert or a “pan-pan” 
alert. 
T5_path3_2  Failureto The operator is manually 
T6_path 3_2 remote- operating the ship, 
T7_path 2 operate which may be after a 
the ship “veto” or a “pan-pan” 
alert or may be from 
the beginning of that 
operation, in case it 
was defined to be 
manual. 
T5_path2_1 Failure to The operator trusts the 
T6_path 2_1 take over autonomous solution 
control or fallback strategy 
of the and does not take over 
ship when control of the ship ina 
necessary situation where this is 


needed 


deriving from it demonstrates that there is still 
room for human failure in its operation. The 
assessment of human error, therefore, cannot be 
neglected or minimized when considering autono- 
mous ships’ safety and reliability. 


4 CONCLUDING THOUGHTS 


This paper demonstrates that although compared 
to conventional ships the human interaction is 
reduced in autonomous ships, the human still 
plays a role, with a potential for human error that 
has to be considered. 

One of the concerns regarding autonomous 
ships’ operation is the new risks they can pose, 
and how to asses them. Being a novel operation, 
the possible interactions between operator and 
autonomous ships are not yet clear, but this paper 
contributes to identifying these interactions and 
modeling it. The paper also identifies, at a high 
level, possible Human Failure Events deriving 
from these interactions. 

Three HFEs deserve particular attention, for 
they can lead/contribute to an accident such as 
collision, grounding: a failure to respond to an 
alert, a failure to remotely operate the ship, and a 
failure to take over control of the ship when neces- 
sary. A deeper analysis of these events is needed to 
identify possible failure modes and the factors that 
can influence them. That analysis, in the context of 
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an HRA, will make it possible to identify oppor- 
tunities to reduce the likelihood of critical human 
failures. 

Moreover, it can be a basis for discussing 
whether autonomy will indeed reduce the likeli- 
hood of accidents caused/aggravated by human 
error—and to determine the appropriate level of 
autonomy that can lead to a safer operation. 

It is important to point that this paper 
approaches human actions focusing on their contri- 
bution to accidents. Actions of operators onboard, 
however, contribute also to avoiding accidents and/ 
or to reducing the severity of their consequences. 
This aspect needs to be evaluated as well in further 
discussions on the shift from onboard to onshore 
operation. 
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Human reliability analysis in NPP: A plant-specific sensitivity 
analysis considering dynamic operator actions versus accident 
management actions 
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NPP Goesgen-Daeniken AG, Daeniken, Switzerland 


ABSTRACT: The human reliability analysis is a method by which, in general terms, the human impact 
to the safety and risk of a nuclear power plant operation can be modelled, quantified and analyzed. It 
is an indispensable element of the PSA process within the nuclear industry nowadays. The paper herein 
presents a sensitivity study of the human reliability analysis performed on a realistic nuclear power plant— 
specific probabilistic safety assessment model. The analysis is performed on a pre-selected set of post- 
imitator operator actions. The purpose of the study is to investigate the impact of these operator actions 
on the plant risk by altering their corresponding human error probabilities in a wide spectrum. The results 
direct the fact that the future effort should be focused on maintaining the current human reliability level, 
i.e. not letting it worsen, rather than improving it. 


1 INTRODUCTION There are various methods, models and data- 

banks with estimated HEPs by the help of which 
The human reliability analysis (HRA) is an qualitative and quantitative analysis of the reli- 
integral part of the probabilistic safety analy- ability of OAs in the NPPs can be performed. The 
sis (PSA) in the nuclear power plants (NPP) inherent variability of the human performance 
throughout the world. While there are those, who under different conditions and for different func- 
contend that it is the single most important ele- tions implies relatively wide uncertainty bounds 
ment of the entire plant analysis, others contend for the HEP estimates. The uncertainty should be 
that any attempt at rigorous evaluation of human generally smaller for the routine tasks such as test, 
reliability is merely an academic exercise in futil- | maintenance, normal control room operations and 
ity. Human beings are unpredictable, they do higher for the OAs as a response to an abnormal 
not obey the laws of physics, and it is impossible event (IAEA 1995, 1996, U.S. NRC 1983 a, b, c). 
to collect relevant statistical evidence to under- The literature offers a wide spectrum of studies in 
stand or predict human behavior during complex regards to the various HRA qualitative and quan- 
“real world” situations. While it may be true that titative methods (Hannaman & Worledge 1988, 
human behavior is not as statistically predictable Lydell 1992, Moieni et al. 1994, Kim & Seong 
as rolling dice, it is also true that the probabilis- 2006, Khalaquzzaman et al. 2010, Podofillini & 
tic framework of PSA acknowledges that all data Dang 2013), its coupling with the deterministic 
are uncertain, and it rigorously accounts for these safety analysis (Cepin & Prosek 2008), as well as 
uncertainties. Lack of directly relevant experi- HRA uncertainty and sensitivity studies (Fujimoto 
ence and statistical evidence does not preclude a etal. 1994, Cepin 2008, Bedford et al. 2013, Baraldi 
fundamental understanding of the motivations, et al. 2015). 


physical and psychological factors, and external In that sense, the presented paper summarizes 
constraints that would most strongly influence a HRA sensitivity study performed on a realistic 
human response in a specific situation. NPP—specific PSA model. The NPP of interest 


The HRA has a multirole purpose: identify- is the Goesgen-Daeniken NPP (KKG) in Switzer- 
ing critical human-system interactions, i.e. human land, a 3-loop PWR plant. The plant PSA model is 
actions; modelling and quantifying the associated prepared with the RISKMAN*® analysis tool (ABS 
human error probability (HEP); HRA optimiza- Consulting Ltd. 2008, 2015). 
tion. Various preventive and mitigative operator Firstly, all the Type C or post-initiator OAs 
actions (OA) are planned, trained and employed modelled within the PSA model (ca. 30 different 
for various accident scenarios that are postulated actions with their corresponding split fractions) 
in the nuclear industry. are subjected to sensitivity analysis such that 
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their corresponding HEP values are being altered 
through a wide spectrum of values. Consequently, 
the quantitative effects on the PSA Level 1 (L1) 
and Level 2 (L2) risk measures — Core Damage 
Frequency (CDF) and Large Early Release Fre- 
quency (LERF) — as well as the qualitative effects 
regarding the general risk profile and the changes 
in the OAs importances are studied. 

Secondly, the analysed OAs are then delineated 
into dynamic, post-initiator OAs on one side and 
accident management actions on the other side 
and are subjected to sensitivity analysis given the 
plant risk as well as their quantitative and qualita- 
tive impacts compared and discussed. 


2 MODEL 


2.1 Preliminary considerations 


The PSA model herein considers three general 
groups of OAs: 

Type A or pre-initiator system-specific OAs: 
This type of action involves routine activities such 
as restoring a component or flow path to normal 
after the completion of testing, inspection, main- 
tenance, instrument calibration, etc. These system 
specific activities are typically performed by one or 
more individuals as part of their normal workday 
duties. They are not related directly to operator 
actions or equipment response during the plant 
transient after an initiating event. However, errors 
may leave important equipment disabled and 
require additional dynamic actions to restore it to 
service during the transient. These routine testing, 
maintenance, and surveillance actions are identi- 
fied in each system analysis and are quantified as 
specific causes for equipment inoperability. 

Type B OAs: The second type of human action 
that may be considered in a PSA is a person- 
nel error that directly causes or contributes to 
an initiating event. These human errors are not 
quantified separately for most initiating events in 
contemporary PSAs because the available initiat- 
ing event frequency data contain contributions 
from all causes, including human errors. Most of 
the initiating event frequencies for the KKG PSA 
are quantified from a combination of generic and 
plant-specific data. The initiating event databases 
do not differentiate among the specific causes for 
each type of event. Since the frequency of human 
errors is already included in the initiating event 
data, these errors are not quantified separately for 
any initiating events that are derived directly from 
generic and plant-specific experience. 

Type C or post-initiator OAs: The third general 
type of human action in a PSA is a scenario-spe- 
cific, directed mission activity. This action is an 
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integral part of plant response to an initiating 
event. The operators must accomplish well-defined 
tasks for manual initiation, control, and alignment 
of plant emergency equipment or selected backup 
systems. These tasks are generally guided by the 
plant emergency response procedures. The avail- 
able time window for successful response, the type 
of action that must be taken, and other factors that 
influence operator stress and confusion are deter- 
mined by the type of event being evaluated and all 
preceding actions during a specific response sce- 
nario. These actions are incorporated as distinct 
decision points in the KKG PSA event tree models. 
This type of actions—the post-initiator OAs—are 
in the focus of the HRA sensitivity study herein. 
These type C post-initiator actions are being 
divided in two general groups: the immediate, 
dynamic OAs as a post-initiator accident OAs and 
the accident management actions (AMA). They 
are presented in the following table (Table 1). 

In addition to the OAs encompassed by Table 1, 
the following three actions are also considered: 
L3 — late fire extinguishing system operation; L4 
— fire extinguishing by fire brigade before ignition 
in adjacent room. 

The part of the top event (TE) VU—related to 
the operator failure to clean debris after season 
event and thus rendering the main water intake 
unavailable—is also considered. 


2.2 Analysis and calculations 


The plant PSA model is prepared with the RISK- 
MAN*® analysis tool. It is a small fault tree (FT) 
— large event tree (ET) linking approach software. 
Each of the OAs summarized in Table 1 is mod- 
elled as a separate top event (TE) within the KKG 
PSA model. Each of these TEs are represented via 
multiple split fractions (SF), modelling different 
variants of the corresponding OAs under differ- 
ent boundary conditions. The failure probability 
of each OA is calculated as a logical OR between 
the cognitive/diagnosis and the implementation/ 
manipulation failure probability. Failure probabil- 
ity density functions (PDF) — accounting for the 
HRA uncertainty—are assigned to each of the 
cognitive and the implementation part. 

At this point, it is important to note that the 
dynamic OAs comprise not only the immediate, 
post-initiator, preventive measures but also recov- 
ery measures within these actions. Namely, one of 
the advantages of the SF modelling approach is 
that one can model different variations of the same 
OA that correspond to different boundary condi- 
tions. This, in turn, corresponds to different stages 
and/or different developments of a given accident. 

This paper tends to present a HRA sensitivity 
study of the values of the L1 and L2 risk measures, 


Table 1. Overview of the OAs used in the KKG PSA 


model. 
Type OA ID Description 
OALP Start residual heat removing 
(RHR) cooling 
OCD Start active cooldown 
2 ODP Depressurize/cooldown 
© OEFW Start emergency feedwater 
E (EFW) 
z OPUD Align demineralized water 
A makeup to feedwater tank 
ORT Reactor trip 
OSG Isolate ruptured steam 
generator (SG) 
OAMFT Align feedwater tank 
OAMFW Align fire truck 
OAMIS Isolate large containment 
openings 
OAMLI Isolate letdown line 
OAMPB Depressurize vie primary 
s pressure relief 
a, OAMPF Align fire water to low head 
5 (LH) pump path 
Ps z OAMRW Align flood tanks to operating 
5 # high pressure injection (HPI) 
3 pump 
= OAMSR Open SG relief 
2 OAMTH Align TH17/37 and VX01/02 
2 OAMFL Injection and recirculation via 
g TH17/37 after recovery of 
È bus FL/FM 
g OAMIGA H2 recombiners placed into 
© service prior to vessel breach 
Q OAMIGB H2 recombiners placed into 
< : 
service after to vessel breach 
< OAMVA Containment filtered venting 
2 system (CFVS) placed into 
x service prior to vessel breach 
A OAMVB CFVS placed into service after 
4 to vessel breach 
OAMVC CFVS scrubbing tank re-filled 
as necessary 
OAMVD Isolate CFVS to prevent large 


release 


i.e. CDF and LERF respectively, as a function of 
the altering values of the HEP for the dynamic 
OAs and the AMAs. By simple assignment of a 
scalar parameter (e.g. “SENHRA”, “SENHR3”) 
to the basic event (BE) calculation module where 
the OA failure probabilities are being calculated, 
one can alter all the wanted HEPs and hence, per- 
form a sensitivity study. In that sense, the SEN- 
HRA parameter is part of the data module of the 
applied KKG PSA model. All the HEP related 
to all the dynamic OAs as well as AMAs within 
the KKG model are multiplied by the SENHRA 
parameter. In this way, one can conduct system- 
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atic sensitivity analyses by simply altering the value 
of the SENHRA parameter. Within the nominal 
model, i.e. within the base case, value of 1.0 is 
assigned to SENHRA. Further on, the SENHRA3 
parameter is assigned to the HEP values related 
only to the dynamic OAs in addition to the already 
assigned SENHRA parameter. In such a way, one 
can perform HRA sensitivity analysis only to the 
OA group, without altering the AMA group. 

Two main cases within the sensitivity study are 
analysed: 


altering the HEP values of all the dynamic OAs 
as well as all the AMAs together (via altering 
the parameter “SENHRA”); 

altering the HEP values of the dynamic OAs 
only, while leaving the AMAs with their nomi- 
nal HEP values (via altering the parameter 
“SENHR3”). 


Figure 1 and Figure 2 present the first of the 
above-mentioned cases—altering the HEP values 
of both the dynamic OAs as well as the AMAs 
defined within chapter 2.1. The effects of this 
sensitivity case show that the potential “worsen- 
ing” of the HEP values would have much higher 
consequences on the CDF and LERF in terms of 
risk increase than the “improvement” of the HEP 
values might have on reducing this risk. In other 
words, by increasing the HEP values by a factor of 
10, the CDF increases by factor 3 and the LERF 
by 80%. By worsening, i.e. increasing the HEP val- 
ues by a factor of 100, the CDF increases by factor 
50 and the LERF by factor 20. On the other side, 
by improving the HEP values by a factor of 10, 
the CDF reduces by merely 7.5% and the LERF 
by 7%. 

By improving the HEP values by a factor of 100, 
the CDF reduces by merely 8% and the LERF by 
7.6%. 

Figure 3 and Figure 4 present the second of the 
above-mentioned cases—altering the HEP values 
of the dynamic OAs only, while leaving the AMAs 
with their nominal HEP values (via altering the 
parameter SENHR3). 

The effects of this sensitivity case are qualita- 
tively the same as the ones for the first sensitivity 
case. Namely, they show that the potential “wors- 
ening” of the HEP values would have much higher 
consequences on the CDF and LERF in terms of 
risk increase than the “improvement” of the HEP 
values might have on risk reduction. In addition 
to this, however, one can also see that if the two 
cases are compared on quantitative basis as well, 
they are relatively close to each other. This is espe- 
cially true in the case of LERF (Figure 6). Figure 5 
and Figure 6 address this comparison as separate 
presentation. Figure 5 presents the calculated 
CDF for both the above-mentioned cases. Figure 6 
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Figure 1. Sensitivity of the CDF as a function of the parameter SENHRA. 
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Figure 2. Sensitivity of the LERF as a function of the parameter SENHRA. 
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Figure 3. Sensitivity of the CDF as a function of the parameter SENHR3. 
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Figure 4. 
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Sensitivity of the LERF as a function of the parameter SENHR3. 
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Figure 5. 


Comparison of the CDF between the two cases: CDF (SENHRA) vs. CDF (SENHR3). 
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Figure 6. 


Comparison of the LERF between the two cases: LERF (SENHRA) vs. LERF (SENHR3). 
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presents the calculated LERF for both the above- 
mentioned cases. 

In order to inspect the influence of the AMAs 
solely, a third case is established: 


iii. altering the HEP values of the AMAs only, 
while leaving the dynamic OAs with their nomi- 
nal HEP values (via altering the parameter 
“SENHRI”); 


Figure 7 and Figure 8 present this third case— 
altering the HEP values of the AMAs only, while 
leaving the dynamic OAs with their nominal HEP 
values (via altering the parameter SENHRI/). By 
worsening the HEP values by a factor of 100, the 
CDF increases only by factor 2.5 and the LERF only 
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by ca. 25%. By improving the HEP values by a factor 
of 100, the CDF and LERF reduce by ca. 6-7%. 
Thus, it is clear that the effect of the AMAs, seen in 
general as a group of personnel actions, is less than 
the one of the dynamic OAs presented in case ii. 

When comparing Figure 3 with Figure 7 and 
Figure 4 with Figure 8, it can be concluded that 
the changes in the plant risk (CDF, LERF) due 
to altering (increasing) the HEP values are pre- 
dominantly governed by the changes affecting 
the dynamic OAs solely when compared to the 
case where solely the AMA HEP values are being 
altered (increased). 

Of course, the above stated is a general conclu- 
sion in a sense of comparing the personnel action 
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groups (dynamic OAs vs. AMAs) with each other. 
There are, of course, AMAs whose risk achieve- 
ment worth (RAW) is higher than the RAW of 
some dynamic OAs. 


3 DISCUSSION AND CONCLUSIONS 


Probabilistic studies of risks show that the human 
factor can considerably contribute to overall risk. 
The potential for and mechanisms of human error 
to affect plant risk and safety is evaluated by the 
human reliability analysis. The HRA has quanti- 
tative and qualitative aspects, aimed at designing 
operator interfaces that will minimise operator 
error and provide for error detection and recovery 
capability. The objectives of HRA therefore, are to 
assure that potential effects on plant safety and reli- 
ability are analysed and that human actions that are 
important to plant risk are identified so that they 
can be addressed in both PSA and plant design. 

The presented paper summarizes a HRA sensi- 
tivity study performed on a realistic NPP-specific 
PSA model. This sensitivity analysis is performed 
in order to investigate the role of the HEPs of two 
groups of OAs to the overall spectrum of plant 
risk. 

Type C or post-initiator OAs are in the focus 
of the study. These type C post-initiator human 
actions are being divided in two general groups: the 
immediate, dynamic OAs as a post-initiator acci- 
dent OAs and the accident management actions— 
AMAs. Each of the considered OAs is modelled as 
a separate TE within the KKG PSA model. Each 
of these TEs are represented via multiple SFs, 
modelling different variants of the corresponding 
OAs under different boundary conditions. 

The paper presents a HRA sensitivity study of 
the values of the L1 and L2 risk measures, CDF 
and LERF respectively, as a function of the alter- 
ing values of the HEP for the dynamic OAs and 
the AMAs. 

Three cases were analysed: 


altering the HEP values of all the dynamic OAs 
as well as all the AMAs together (via altering 
the parameter SENHRA); 
. altering the HEP values of the dynamic OAs 
only, while leaving the AMAs with their nomi- 
nal HEP values (via altering the parameter 
SENHR3); 
.altering the HEP values of the AMAs only, 
while leaving the dynamic OAs with their nomi- 
nal HEP values (via altering the parameter 
SENHRI). 


The effects of the first sensitivity case 1. show 
that the potential “worsening” of the HEP val- 
ues would have much higher consequences on the 


=e 


ii 


371 


CDF and LERF in terms of risk increase than the 
“improvement” of the HEP values might have on 
reducing the risk. In other words, by worsening the 
HEP values by a factor of 10, the CDF increases 
by factor 3 and the LERF by 80%. By worsen- 
ing the HEP values by a factor of 100, the CDF 
increases by factor 50 and the LERF by factor 20. 
On the other side, by improving the HEP values by 
a factor of 10, the CDF reduces by merely 7.5% 
and the LERF by 7%. By improving the HEP val- 
ues by a factor of 100, the CDF reduces by merely 
8% and the LERF by 7.6%. 

The effects of the second sensitivity case ii. are 
qualitatively the same as the ones for the first sen- 
sitivity case. In addition to this, however, one can 
also see that if the first and the second case are 
compared on quantitative basis as well, they are 
relatively close to each other. This is especially true 
in the case of LERF. 

In order to inspect the influence of the AMAs 
solely, the third sensitivity case is run. By increas- 
ing, i.e. worsening the HEP values by a factor of 
100, the CDF increases only by factor 2.5 and the 
LERF only by ca. 25%. By improving the HEP val- 
ues by a factor of 100, the CDF and LERF reduce 
by ca. 6-7%. Thus, it is clear that the effect of the 
AMAs, seen in general as a group of personnel 
actions, is less than the one of the dynamic OAs 
presented in case il. 

The usability of the herein performed HRA 
sensitivity analysis is summarized through the 
conclusions derived below in text. The interplay 
of the two identified type C human action groups 
(dynamic OAs and AMAs) and their significance 
and contribution to plant risk is discussed. 

From the results of the sensitivity study can be 
concluded that the changes in the plant risk (CDF, 
LERF) due to altering the HEP values are pre- 
dominantly governed by the changes affecting the 
dynamic OAs when compared to the case where 
solely the AMA HEP values are altered. This coin- 
cides with the comparison of the RAW factors of 
the dynamic OAs vis-a-vis the ones for the AMAs. 
In general, the former group have higher RAW val- 
ues than the latter. 

Of course, the above stated is a general conclu- 
sion in a sense of comparing the personnel action 
groups (dynamic OAs vs. AMAs) with each other. 
There are, however, AMAs whose RAW is higher 
than the RAW of some dynamic OAs. Addition- 
ally, the effects of both the groups on the plant 
risk are not additive. Still and as already stated, the 
plant risk is much more sensitive to the increase 
of the HEP values of the dynamic OAs considered 
alone than to the increase of the HEP values of the 
AMAs considered alone. 

Therefore, the general conclusion is that the 
focus should be put on maintaining the boundary 


conditions for diagnosis and execution of the gen- 
eral group of all the personnel type C actions on 
such a level so that the current reliability of their 
performance would not be endangered. In other 
words, future effort should be focused on main- 
taining the current HEP values, i.e. not letting them 
get worse, rather than improving these HEP values. 
Further on and more specifically, when already 
delineating between the dynamic OAs on one side 
and the AMAs on the other, the focus should be 
set foremost to maintaining the HEP values of the 
dynamic OAs group, since their deteriorating effect 
on CDF, LERF can be much higher than the one 
of the AMAs. 
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The weighting method’s impact on the weighting process in decision 
making problems 
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ABSTRACT: Assigning values to weights in a multi-criteria decision making problem is critical, since 
it introduces the decision maker’s perception and preference over the importance and value of the deci- 
sion problem’s criteria and alternatives, respectively. This paper investigates theoretically and experimen- 
tally the impact of the applied weighting method on the decision maker’s determination of weights. The 
research develops a methodology to evaluate the impact that a weighting method may have on the deci- 
sion maker’s attitude concerning the assignment of weighting values, comprising; a) a psychometric test 
revealing the decision maker’s attitude against risk and ambiguity with a new modeling approach based 
on a psychometric function and b) an assignment of weighting values by the decision maker with different 
weighting methods. The results demonstrate that the expression and particularly the ranking of decision 
maker’s attitudinal preference are affected by the weighting method and this impact is measurable through 


the proposed methodology. 


1 INTRODUCTION 


The principles of decision analysis set the frame- 
work wherein decision makers determine weights 
through the use of various methods and diverse 
approaches. According to the current theoretical 
approaches, decision analysis is characterized by 
two discrete principles, namely the normative and 
the descriptive (Barzilai, 2010; Riabacke, Danielson 
and Ekenberg, 2012; Aliev et al., 2016), whilst there 
is also the prescriptive principle, which constitutes 
a combination of them (Jia, Fischer and Dyer, 
1998; Riabacke, Danielson and Ekenberg, 2012; 
Aliev et al., 2016). The decision analysis principles 
determine the characteristics of the method’s input 
and process, imposing the use of numerical data 
or the combination of numerical data with verbal 
expressions. 

Recently, great importance has been given to 
the cognitive process of weight determination, 
which includes three conceptual phases, namely 
the extraction of data, their representation and the 
interpretation (Riabacke, Danielson and Ekenberg, 
2012). The two main classes of weight elicita- 
tion methods are the precise and the approximate 
methodological approaches (Jia, Fischer and Dyer, 
1998; Bottomley and Doyle, 2001; Barzilai, 2010; 
Riabacke, Danielson and Ekenberg, 2012; Hafe- 
zalkotob, Hafezalkotob and Sayadi, 2016; Wang, 
Wang and Zhang, 2016; Podinovskaya and Podi- 
novski, 2017). The literature review revealed that 
new weighting methods are constantly added to 


the existing ones, integrating contemporary math- 
ematical analysis (Abidin, Rusli and Shariff, 2016), 
as well as complementary techniques for the deter- 
mination of weights, like integration of weights 
and consensus building (Beliakov and James, 2015; 
Peng et al., 2015; Blagojevic et al., 2016). 

The determination of attribute weights is essen- 
tially a problem of preference formation in different 
conditions, with many possible limitations. The con- 
cept of preference describes the degree that a sub- 
ject desires a potential outcome (Hogarth, 2010). Its 
formation is related with multiple cognitive, emo- 
tional and even biological mechanisms, connected 
with cognition and behavioral traits (Hogarth and 
Einhorn, 1990; Weller, Levin and Bechara, 2010; 
Panno, Lauriola and Figner, 2013; Naili et al., 
2015). The study of those mechanisms, along with 
the external conditions of risk and ambiguity from 
the fields of Psychology and Economic Theory 
could enhance the comprehension of the prefer- 
ence formation and launch a further challenging 
investigation; the prediction of preferences (Weller, 
Levin and Bechara, 2010; Retief et al., 2013; Brand 
et al., 2014; Roth and Voskort, 2014; Johansen and 
Rausand, 2015; Csermely and Rabas, 2016; Shao, 
Taisch and Ortega-Mier, 2016; Thomas, 2016; van 
Winsen et al., 2016; Jern, Lucas and Kemp, 2017). 

Based on the above, there is evidence to con- 
nect the behavioral preferences with the process of 
weight elicitation. The paper investigates the deter- 
mination of weights as an expression of preference 
and behavioral attitude, which can be described 
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and modeled with a psychometric approach 
and a new methodology. This new psychometric 
approach is presented in Section 2 and the new 
methodology for recognition of behavioral atti- 
tudes with the use of the psychometric function is 
discussed in Section 3. The cognitive process leads 
the decision maker to assign weight assessments, 
and this research examines experimentally this 
aspect of the weighting method in Section 4. Sec- 
tion 5 presents the conclusions of this work. 


2 ANEW APPROACH FOR REVEALING 
BEHAVIORAL ATTITUDE 


Previous works on the field offer the theoretical 
background to determine a new approach for the 
identification and classification of behavioral atti- 
tudes, based on the concept of psychometric func- 
tion. The definition of this concept is presented in 
subsection 2.1, while the proposed approach for 
the use of the psychometric function for preference 
expression under risk or ambiguity is discussed in 
subsection 2.2. 


2.1 Definition of the psychometric function 


The psychophysic problem describes the formation 
of the mathematical operation which expresses the 
behavioral preference (Robert, 1985). The psycho- 
metric function that has been introduced in the 
context of the Signal Detection Theory offers an 
effective solution to the psychophysic problem, 
according to which the preferences are sensory 
responses that depend on a parameter of a specific 
stimulus for decision making (Gold and Ding, 2013; 
Hunter, 2017). The psychometric function can be 
applied in decision making problems with only 
two alternatives, also known as yes-no problems, in 
order to demonstrate the preference shift affected 
by the stimulus intensity (Gold and Ding, 2013) 
and reveal the behavioral attitude of the subject. 
The stimulus for making a choice is the probabil- 
ity of an event (Hunter, 2017) and especially for the 
decision making context, this stimulus becomes the 
gain probability or else the expected added value. 

In practice, the psychometric function allows 
the simultaneous study of multiple preference vari- 
ables and characteristics (Gold and Ding, 2013). In 
the context of the present paper the psychometric 
function takes the form of the Sigmoid function, 
which is defined by the formula j-a=m7- as shown 
in Figure 1 (Hunter, 2017). In the function variable 
a, determines the shift of the sigmoid and is inter- 
preted as the preference’s bias, whilst variable b, 
determines the steepness of the slope and charac- 
terizes the preference’s sensitivity (Gold and Ding, 
2013; Hunter, 2017). 
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2.2 Modeling attitudinal preferences with the 
psychometric function 


In order to fit the psychometric function in choice 
problems affected by the phenomenon of risk and 
ambiguity aversion, it is essential to translate the 
preference expression into arithmetic terms. The 
phenomenon of aversion can be defined as a shift 
of preference, and the subject’s response can be 
translated into values of the interval [0, 1], where 
a zero value reflects the risk or ambiguity aversion 
and a value of 1 reflects the risk or ambiguity seek- 
ing, respectively. The relevant tasks examine the 
subject’s multiple responses to a specific problem, 
while the gain probabilities progressively increase 
and they lead to the shift of the preference, from 
0 to 1, which can be represented both arithmeti- 
cally and graphically. When there are several values 
of such responses from multiple tasks, the average 
of all the responses is extracted for every 10 hun- 
dredths of gain probability rise, so as to extract the 
average response pattern and produce the graph of 
Figure 2. This sensory response pattern represents 
graphically the shift of attitude for all the relevant 
expressions and also determines the appropriate 
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Figure 1. The graph of the psychometric function 


with a = 0.642 and b = 0.096, which reflects the attitude 
towards risk. 
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Fitting of the sigmoid function under 


psychometric function which best reflects the sub- 
ject’s behavioral attitude. In particular, the sigmoid 
function is determined by the gain probability vari- 
able x and the output f(x) is the value of the sub- 
ject’s response for the given probability, whilst the 
function fitting is estimated with non-linear regres- 
sion and the least squares’ method. Practically, the 
variables a and b are estimated accurately with the 
use of Solver Tool in Microsoft Excel 2013 and the 
psychometric function is defined, as shown in the 
graph of Figure 2. 


3 A METHODOLOGY FOR BEHAVIORAL 

ATTITUDE IDENTIFICATION 
The new approach for revealing the subjects’ 
behavioral traits and attitudes was presented in 
Section 2. This research explores experimentally 
the utility and the significance of this approach 
with the methodology described in this Section, 
and particularly in the subsections 3.1, 3.2 and 3.3. 
The methodology includes the procedure of iden- 
tifying attitudes under the conditions of risk and 
ambiguity, respectively, as presented in the subsec- 
tions 3.4 and 3.5. 


3.1 The experiment 


The objective of the specific research is the inves- 
tigation of the individual attitudinal preferences 
and their potential influence on the determination 
of weights in decision making. To meet this objec- 
tive an experiment was realized in two phases. The 
initial phase included a personality test with con- 
trol questions for the evaluation of the subjects’ 
comprehension and consistency and a weighting 
test. The personality tests that were used included 
checklists of multiple values to two different prob- 
lems of alternatives evaluations, i.e. a problem of a 
ballot choice and a problem of the best bet among 
a set of offered bets. The answers to these tests 
reflect the subject’s respective psychometric func- 
tion, which is either risk prone or risk averse (Holt 
and Laury, 2002; Csermely and Rabas, 2016; He, 
Veronesi and Engel 2017). 

The evaluation of the responses led to the iden- 
tification of the subjects who responded properly 
to the control questions and so their responses 
could be considered valid. These specific subjects 
were selected to participate in the second phase of 
the experiment, which included some explanatory 
questions about the subjects’ personality and pref- 
erences in the conditions of risk and ambiguity, 
respectively, in order to validate and confirm their 
behavioral preferences and attitudes. The content 
of the second-phase tests was the same with the 
respective tests of the first phase. 
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3.2 The sample 


The present research was addressed to 50 engineers 
with professional experience in decision making 
in the field of construction industry; 37 of them 
participated to the experiment providing a satis- 
factory response rate of 74%. The sample included 
engineers of different specialties, such as civil engi- 
neers, mechanical engineers, architects and survey- 
ors, with various levels of working experience from 
two to 30 years, representing both the private and 
the public sector, and including individuals with 
diverse personalities, behaviors and cultures (Char- 
ness, Gneezy and Kuhn, 2013). Based on these fea- 
tures, the sample can be evaluated as satisfactory in 
the context of this research. 


3.3 Identification of behavioral attitudes 


The initial test was completed successfully by 
37 subjects and after the evaluation of the results, 
10 of the completed tests, which represent the 
37.04% of the initial responses, were excluded from 
further processing because of incorrect responses 
to the control questions. Consequently, the number 
of valid responses in this phase is 27, which is suffi- 
cient for the estimation of the subject’s preferences 
in risk and ambiguity and also allows the extrac- 
tion of weighting tasks’ results. 

The first analyses of the personality test results 
lead to the determination and the optimization 
of each subject’s psychometric functions for the 
shift of preferences in the conditions of risk and 
ambiguity, respectively. The determination of the 
sigmoid function requires the estimation of the 
characteristic values of variables a and b, which 
reveal the preference’s switching point and also 
allow the identification and the categorization of 
the subject’s behavioral attitudes. Moreover, the 
sigmoid function and its variables are uniquely 
determined for every subject in the environment of 
risk, as well as in the environment of ambiguity. 
The estimated values of these variables for the risk 
attitude, and also the corresponding values for the 
ambiguity attitude are presented in Table 1. 


3.4 Identification of risk attitudes 


The determination of risk attitude with the use of 
the psychometric function is based on the assump- 
tion that the switching point of the subject’s 
preference can describe the behavioral attitude. 
According to this assumption, the critical switch- 
ing point is observed when the gain probability is 
60% (Holt and Laury, 2002; Csermely and Rabas, 
2016; He et al., 2017). 

In practice, this means that when the subject 
changes preference from risk avoidance to risk taking 


Table 1. The values of the psychometric function’s vari- 
ables a and b for the subject’s attitudes under risk and 
ambiguity, respectively. 


Variables for 


Variables for risk ambiguity 
attitude attitude 

Id SUBJECT a b a b 

1 ANA 0.741 0.100 0.070 0.010 
2 ARA 0.767 0.024 0.456 0.001 
3 BOU 0.641 0.070 0.456 0.001 
4 KOU 0.703 0.134 0.456 0.001 
5 GOU 0.502 0.086 0.587 0.083 
6 KOI 0.522 0.037 0.457 0.001 
7 KOT 0.617 0.080 0.588 0.036 
8 MPO 0.698 0.035 0.456 0.001 
9 NEV 0.776 0.031 0.540 0.017 
10 SMP 0.497 0.104 0.540 0.001 
11 TSO 0.589 0.063 0.600 0.002 
12 TZI 0.520 0.035 0.565 0.001 
13 VAS 0.766 0.027 0.580 0.019 
14 VEL 0.609 0.049 0.475 0.096 
15 ZEL 0.689 0.089 0.551 0.029 
16 ATSO 0.626 0.047 0.520 0.001 
17 AC 0.635 0.113 0.456 0.001 
18 NA 0.643 0.045 0.550 0.001 
19 TSI 0.810 0.114 0.666 0.001 
20 FAK 0.657 0.072 0.456 0.001 
21 BIL 0.601 0.058 0.490 0.001 
22 AGG 0.753 0.095 0.510 0,001 
23 PET 0.604 0.053 0.510 0.001 
24 THE 0.687 0.123 0.510 0.001 
25 TOL 0.457 0.186 0.550 0.001 
26 XAM 0:523 0.041 0.456 0.001 
27 KAL 0.545 0.001 0.540 0.001 
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Figure 3. An example of the psychometric function 
with the critical switching point representing subject’s 
risk attitude. 


before the gain probability of 60%, this attitude is 
characterized as risk seeking. Accordingly, if this 
shift of preference occurs when the gain probabil- 
ity is greater than 60%, this attitude would be risk 


averse. In the psychometric function approach, the 
switching point can be directly identified because it is 
determined as the value of the characteristic variable 
a, as represented in Figure 3. The results of this anal- 
ysis demonstrated that in the sample of 27 subjects, 
19 of them, representing the 70.37%, are character- 
ized as risk averse, while only 8 subjects, representing 
the 29.63% are characterized as risk prone. 

In addition, it is important to examine the value 
of variable b, which expresses the sensitivity of 
subject’s choice under risk. As the value of vari- 
able b increases, the sensitivity of the choices also 
increases and consequently, it is more likely to 
observe inconsistent responses. The value of varia- 
ble b determines the different levels of sensitivity in 
the sample as shown in Figure 4 and particularly, 
in the total of 27, 11 subjects are characterized 
by b value smaller than 0,050 and therefore great 
consistency in their preferences, whilst 16 subjects 
demonstrate greater b value and three of them 
demonstrate remarkably great values, which indi- 
cate great sensitivity in preference expression. 


3.5 Identification of ambiguity attitudes 


The same analysis was conducted in order to extract 
results for subjects’ attitudes in the condition of 
ambiguity. The responses’ values are modelled with 
the psychometric function and the variables a and 
b are estimated. Based on the Ellsberg Paradox’s 
assumptions about the inherent ambiguity aver- 
sion, the critical switching point of preference is 
set at 60% gain probability (Lauriola and Levin, 
2001; Schneider and Nunez, 2015). In the sigmoid 
function approach, this fact can be translated to 
variable a = 0.60 and the subjects’ behavioral atti- 
tudes can easily be identified as ambiguity averse 
or prone, as presented in Figure 5. 

The majority of the sample, 25 subjects in total, 
confirms Ellsberg’s assumptions about an inher- 
ent ambiguity avoidance, as they demonstrate a 
values between 0.50 and 0.60, while there are only 
two cases where the value of variable a is greater 
than 0.60, a fact that indicates an ambiguity seek- 
ing attitude, as shown in Figure 6. Moreover, the 
sensitivity of these preferences, as determined by 
the values of variable b, remains small with minor 
differentiations. The b values vary between 0.05 
and 0.01, showing that the subjects express their 
preferences with high consistency and stability. 
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Figure 4. The distribution of the values of variable b 
for the risk attitude. 
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Figure 5. A psychometric function with the criti- 


cal switching point representing a subject’s ambiguity 
attitude. 
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Figure 6. The distribution of the values of variable a 
for the ambiguity attitude. 


4 FINDINGS ABOUT THE WEIGHTING 
METHOD’S IMPACT 


The research aims to evaluate the impact that 
a weighting method may have on the decision 
maker’s attitude concerning the assignment of 
weighting values. For this purpose, the proposed 
methodology employs three different weighting 
methods which are presented in subsection 4.1. 
The procedure of the analysis and the results are 
presented in detail in subsection 4.2. 


4.1 The examined weighting methods 


It is a fact that different weight elicitation meth- 
ods lead to different weighting and consequently 
different decision results, a phenomenon which is 
mainly connected to the methodological approach 
and the cognitive process of weighting, rather than 
random response errors (Jia, Fischer and Dyer, 
1998). The three basic weighting methods selected 
for this examination are the direct rating, the dis- 
tance estimation and the pairwise comparison. 
One of the most popular weighting methods is 
the direct rating, which depends on the decision 
makers’ quantitative estimation of their preference. 
The method leads to arank order of the alternatives 
in descending order, and also permits the forma- 
tion of linear models for capturing that preference 
(Bottomley and Doyle, 2001). The direct weighting 
is implemented with the direct evaluation of the 
criteria or the alternatives with the application of 


a specific weighting scale. The cognitive approach 
of this method entails a preliminary examination 
of all the criteria, so that the subject discerns the 
significance of each element, and then proceeds to 
the final evaluation. 

In an effort of disengagement from the direct 
and precise determination of weights, the method 
of distance estimation is used. According to this 
weighting process, the decision maker firstly deter- 
mines the ideal situation and then uses this as refer- 
ence base for the determination of weights. Every 
criterion or alternative is compared with the ideal 
situation, so as to determine the distance between 
the two situations, with a sense of proportion for 
all the rest situations. It has been observed by Jia, 
Fischer and Dyer (1998) that this method’s results 
present significant differentiation which depends 
on the given distance scale. 

The method of pairwise comparison offers a 
completely different approach for the elicitation 
of weights. This particular method depends on the 
concept of trade-off (Jia, Fischer and Dyer, 1998) 
and also offers the expression of reciprocal pref- 
erence relations (Xu et al., 2015). In this case, the 
criteria are not examined generally, but in separate 
pairs and so the decision maker tends to rely more 
on intuition, rather than correlation. 


4.2 The examination of the weighting methods 


Initially, the purpose of this research is the exami- 
nation of the consistency and reliability of each 
weighting method, as a correlation of the deci- 
sion makers’ behavioral attitudes. Each weighting 
task includes discrete sub-tasks, which are used 
to compare the assigned weights and extract the 
correlations, as a measure of each method’s effec- 
tiveness. The correlation between direct rating, 
distance estimation and pairwise comparison has 
to depend on an approach that is not affected by 
each method’s characteristics. For this reason, the 
present approach is based on the rank order of the 
alternatives and so the ranking correlation coeffi- 
cient, well-known as the Spearman coefficient, is 
extracted. Additionally, the ranking correlations 
and methods’ consistencies are related to the deci- 
sion makers’ behavioral attitude, so as to investi- 
gate any connection between them. 

The first weighting test is formed with the 
method of direct rating. The specific weighting 
method demonstrates high correlation between its 
results, as Table 2 shows. The overall correlation 
coefficient is the average of the correlations of the 
quantitative and qualitative weightings, which is 
estimated at 0.922, with a minor variance 0.002. 
This high coefficient also shows the high consist- 
ency of the responses, expressed with the method 
of direct rating. 
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Table 2. The correlation coefficients and the compat- 
ibility degrees of the three weighting methods. 


= Direct Distance Pairwise 

4 rating estimation comparison 

= 

3 Correlation Correlation (Compatibility 
Id ke coefficient coefficient degree 
1 ANA 0.946868 0.990680 0.93750 
2 ARA 0.950439 0.959866 0.52083 
3 BOU 0.930605 0.914056 0.79688 
4 KOU 0.944028 0.923726 0.62500 
5 GOU 0.971940 0.986747 0.75000 
6 KOI 0.941935 0.927940 0.68750 
7 KOT 0.946648 0.955168 0.59375 
8 MPO 0.806452 0.875113 0.43750 
9 NEV 0.880783 0.942050 0.50000 
10 SMP 0.868462 0.834058 0.73438 
11 TSO 0.848981 0.962911 0.54688 
12 TZI 0.924662 0.920021 0.78125 
13 VAS 0.942801 0.946782 0.73438 
14 VEL 0.940413 0.936341 0.57813 
15 ZEL 0.928976 0.981065 0.71875 
16 ATSO 0.923407 0.635245 0.57813 
17 AC 0.955210 0.927814 0.79690 
18 NA 0.940804 0.938397 0.78125 
19 TSI 0.931403 0.928515 0.65625 
20 FAK 0.875011 0.905473 0.59375 
21 BIL 0.945380 0.959723 1.00000 
22 AGG 0.943035 0.924585 0.60938 
23- PET 0.925352 0.965619 0.76565 
24 THE 0.866520 0.850975 0.54688 
25 TOL 0.950810 0.953618 0.59375 
26 XAM 0.859410 0.863554 0.76565 
27 KAL 0.924595 0.897692 0.81250 
Averages 0.921915 0.917263 0.66727 


Generally, referring to the subjects’ attitudinal 
characteristics, both risk averse and risk prone sub- 
jects demonstrate high consistency degree for their 
weightings with this particular method. Specifi- 
cally, the risk averse subjects demonstrate the same 
high consistency 0.922 and smaller variance 0.001, 
whilst the risk prone subjects are characterized by 
a slightly lower consistency coefficient 0.911 with 
variance 0.002. The consistency of weights and its 
relation with the decision makers’ attitude is pre- 
sented graphically in Figure 7. 

The second method examined in this experi- 
ment is the distance estimation. In this weighting 
method, the correlation coefficient average and 
therefore the consistency of the responses is also 
high, at the level of 0.917, with slightly increased 
variance 0.009, compared to the previous method, 
as demonstrated in Table 2. The method of dis- 
tance estimation helps the decision makers to 
determine weights with high consistency. The risk 
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Figure 7. The consistency of responses for the three 
weighting methods. 


averse decision makers express high consistency 
0.919 and variance 0.006, while the risk prone deci- 
sion makers seem equally consistent with consist- 
ency average 0.918 and variance 0.003. 

These results partially confirm the observation 
of Jia, Fischer and Dyer (1998) for the increased 
effectiveness of distance estimation compared to 
the direct rating, only for the risk prone subjects, 
whose consistency increased slightly from 0.911 to 
0.918, as presented in Figure 7. 

The third weighting method examined is the 
pairwise comparison, where the decision maker’s 
responses have the form of bits of a binary system, 
instead of a rank order of the alternatives. This is the 
reason for the determination of a different method 
for the correlation extraction. Every response with 
this weighting method was compared to the cor- 
responding response and ranking with each one of 
the other two examined methods, in order to extract 
conclusions for the consistency of the method. 
When the response that is expressed with the pair- 
wise comparison coincides with the response deter- 
mined with one of the other two weighting methods, 
then this response is considered true and takes the 
value 1, whereas in any different case, with no indi- 
cation of such coincidence, the response takes the 
value 0. For instance, when a decision maker assigns 
greater weight to the alternative with 70% gain prob- 
ability, compared to the alternative with 50% gain 
probability, in a comparison between these specific 
alternatives, it is normally expected to select the alter- 
native with 70% gain probability. This case can be 
described as compatibility between the responses of 
the direct rating and the pairwise comparison. In any 
different case, the coincidence cannot be achieved. 
With this procedure, all the responses of the pairwise 
comparison are tested for their coincidence with the 
other applied methods and finally, the average of the 
response compatibilities is estimated as the compat- 
ibility degree of Table 2, that reveals on the one hand 
the method’s reliability, and on the other hand, the 
subject’s consistency with this weighting method. 


The compatibility degree average, hence, deter- 
mines the consistency of weights, and simultaneously 
reflects the correlation with the two other weighting 
methods. It is obvious in Table 2 and in Figure 7 
that the responses of the pairwise comparison dem- 
onstrate a severe lack of consistency, because with 
the specific weighting method more inconsistent 
answers occur, which entail, for most of the subjects, 
an unexpected risk and ambiguity seeking. 

The correlation degree average for this method 
is significantly lower and estimated at 0.677, with 
high variance 0.031. The risk averse subjects seem 
to follow this trend, as they demonstrate a correla- 
tion average of 0.672 and variance 0.022. However, 
it is worth to mention that in this case, unlike pre- 
vious observations, the risk prone decision makers 
seem slightly more consistent compared to the risk 
averse subjects, as they have a correlation average 
of 0.709 and variance 0.009. 


5 CONCLUSIONS 


This research focuses in the decision makers’ 
determination of weights and features the cogni- 
tive process of weighting from the perspective of 
behavioral preference and attitude. Therefore, a 
new approach for revealing and classifying behav- 
ioral attitudes is proposed, based on the decision 
maker’s preference expression with the use of the 
psychometric function. The proposed method- 
ology allows the modeling, the recognition, and 
the classification of individual attitudinal prefer- 
ences and the parallel examination of other crucial 
parameters, like sensitivity. 

Moreover, the paper investigates the impact of 
the applied weighting method on the weighting 
process and consequently, on the determination 
of weights. A methodology is developed to evalu- 
ate the consistency and reliability of the weighting 
method, as well as the impact that it may have on 
the decision maker’s attitude concerning the assign- 
ment of weighting values. The results demonstrate 
that the weighting methods affect the expression 
and the ranking of decision maker’s attitudinal 
preference. In particular, the methods of direct 
rating and distance estimation demonstrate high 
consistency of responses, compared to the method 
of pairwise comparison that has significantly lower 
consistency. Additionally, risk averse subjects seem 
to respond more consistently with the two first 
methods, as opposed to the risk prone subjects, who 
respond better with the method of pairwise com- 
parison. This evidence can be very helpful to the 
assignment of managers and decision analysts on 
projects and to the selection of the proper weight- 
ing method to apply to decision making problems. 
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ABSTRACT: Nowadays the majority of organizations operating in manufacturing field recognize the 
importance of including the Human Factor contribution in the industrial process optimization (Hong 
et al. 2007). Technical measures and work organization procedures have been optimized in order to reduce 
the defects and waste generation but the Human Performance prediction still represents for Managers a 
difficult task to deal with.The prediction of the human performances of all workers involved in a produc- 
tion system would help Managers in better allocating the human resources. In order to reach this objective, 
a model to quantify the human capability of managing a complex task in a working context characterized 
by a set of physical, organizational and cognitive factors was designed.This paper presents the preliminary 
results of a three years industry/academia partnership project to assess the human performance in manu- 
facturing plant. A multi-discipline approach involving both technical and individual factors was adopted. 


1 INTRODUCTION 

In manufacturing sector, the process optimization 
plays an important role to improve the production 
efficiency and economical profits. 

Production is influenced by several factors such 
as: technology, organization, energy and workers 
performance. 

In many cases, process optimization has been 
primary focused on technical measures and work 
organization procedures. 

The Human Factor, despite the level of automa- 
tization in manufacturing industry is considerably 
increased and the standardization of working- 
procedures drives the working activity, still plays 
an important role on the efficiency of production 
system (Baines et al., 2005). 

Human Factor has a strong influence on the 
occupational accident occurrence and defects 
generation. 

Human Factor represents for Managers a diffi- 
cult task to deal with even if most of organizations 
operating in manufacturing field recognize the 
importance of including the Human Factor (HF) 
contribution in the industrial process optimization 
(Hong 2007). 

The Human Factor analysis 
approached differently in several areas. 

Safety and Quality managers focused their 
attention to the deviation of human behavior from 


has been 
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procedures. Miller (1987) analyzed a set of envi- 
ronmental, organizational and individual factors 
in relation to error-related outcomes. The ex-post 
events analysis approach has been used (Comberti 
et al., 2015) to identify causes of occupational acci- 
dents and defects with the aim of reducing their 
repetition. 

Work Organization managers related the HF 
analysis to the ergonomic with the aim of calibrat- 
ing and optimizing the task-time and reducing the 
operative risk task-related (Lin et al., 2001). 

Many studies on Human Performance mod- 
eling suggest that the HF has to be approached 
as a complex system, where behavior, cognition, 
physiology and working condition deeply interact 
(Leva, 2016). 

The knowledge of the relation between the 
human nature and the working condition of all 
workers involved in a production system would be 
crucial for the industrial Management. 

Better allocating the human resources forward 
the different tasks will probably reduce the defects 
generation and the unsafe actions frequency. 

In order to reach this objective it is necessary 
to model a system able to quantify and predict the 
human capability of managing a complex task in a 
context characterized by a set of physical, organi- 
zational and cognitive factors (Groth, 2012) in 
other words a model able to define and assess the 
Human Performance (HP). 


Relevant researchs in this topic suggested which 
variables can be used to define the HP. 

Baine & Benedettini (2007) suggested a multi- 
disciplinary approach based on Sociology, Phycol- 
ogy and Engineering disciplines to be consistent 
on human nature representation. Eklund (1997) 
showed that Ergonomic has to be related to quality 
performances. 

This paper presents the preliminary results of 
an industrial and academic project to develop a 
Human Performance assessment method for safety 
and quality optimization. 

The aim of this work is to approach the HP 
modeling to facilitate the management of HF into 
the industrial improvement process. 

The proposed model was developed on the basis 
of Straeter (2000) results. 

It is based on the fundamental assumption that 
the HP can be represented as directly dependent 
from two macro-factors: 


e Task Complexity (TC): that summarizes all 
factors contributing to the physical and men- 
tal requests to execute a given operative task, 
including work environmental factor. 

e Human Capability (HC): that resumes the 
resources of workers under the real working 
condition. This factor represents both physical, 
mental and cognitive ability of the worker. 


Section 2 of this paper presents the Conceptual 
Model of this project meanwhile section 3 shows 
the Operative Model deducted from the case 
study. 

Section 3 gives an illustration of the project 
future development with a focus on the model vali- 
dation. Conclusions will end the paper. 


2 HP PROJECT DESIGN 


This project has been managed in 4 steps as 
Figure | shows. 


Figure 1. 


Project structure. 


First step was focused on the “Conceptual 
Model” designing process. 

Conceptual Model defines the variables and 
relations considered to the HP assessment. 

Second step was characterized by the Opera- 
tive model-design that represents the projection of 
Conceptual model into the industrial real life. 

In other words each variables introduced into 
the Conceptual model have to be replaced by a 
measurable quantity into the Operative model. 

Third step will be focused on HC and TC assess- 
ment with an intensive data field collection. This 
step will involves directly the workers of the plant 
with skill tests performed during the working 
activity. 

In addition to this the descriptive parameters 
of TC will be collected with a deep analysis of 
working places.This step will be completed by a 
systematic interview of all workers involved. The 
interview will be structured on a set of questions 
related to individual motivation, risk-perception, 
working complexity perception.The information 
acquired with the survey will be used as a feedback 
for safety, work organization and quality improve- 
ments. On the basis of the results a validation or 
modification of the model will be done. This paper 
presents results related to the first and second steps 
of the project. 


2.1 Conceptual model 


The HP model represents the interaction between 
two macro factors: the Human Capability (HC) 
and the Task Complexity (TC). 

Both factors can be analyzed with a wealth of 
methods for different purposes, such as data col- 
lection, task analysis (including cognitive task 
analysis), workload measurement, assessing situa- 
tion awareness performance assessment (including 
team performance assessment), human error iden- 
tification and interface evaluation methods (Stan- 
ton, 2004 and 2006). 

In this work the proposed conceptual model of 
Human Performance is showed in Figure 2. 

TC, as mentioned in the previous section, rep- 
resents the total demand of resources asked to 
perform correctly a given task under certain work 
environmental condition. 

TC is the result of the contribution of two main 
factors: Mental Workload (MW) and Physical 
Workload (PW), both associated to a single opera- 
tive task. 

PW factor is easily relatable to the physical, 
motion and postural efforts required to complete 
a given task. 

Bad ergonomics combined with time pressure 
(coping with pace) have been estimated to cause 
about 50% of all quality deviations (Lin, 2001). 


Mental 
Workload 


“Derived 
Variables 


Figure 2. HP conceptual model. 


Several studies demonstrates that high physical 
workload such as unkind postures can decrease the 
performance for discomfort (Erding, 2011). 

A low variation, such as repetitive motions and 
static workload, was observed as additional cause 
for muscle fatigue (Punnett, 2000). 

Other factors that may be included on PW mod- 
elling can be identified in the degree of rotation 
between high and low demanding tasks (Horton, 
2012) and into the gender effects for the differences 
concerning discomfort and muscle fatigue in repet- 
itive and static workload (Hunter, 2012). 

In addition to the above mentioned factors, 
that are related to a specific operative task, other 
variables able to affect the PW are represented by 
environmental workload effects (Jung, 2001) which 
include: improper temperature, lighting, noise, 
vibration and exposure to chemical agents and 
physical agents as dust. 

The physiological effects of these environmental 
factors, under industrial conditions, can contribute 
to an increase of the stress level and consequently 
to a loss of human performance (Grandejan, 1985). 

MW was defined by Kahneman (1979) as “a 
factor directly related to the proportion of the 
mental capacity of an operator spends on task 
performance”. 

The MW assessment has been conducted in 
various research fields with both objectives and 
subjective measures such as: physiological activity 
under simple task normative condition (Kramer, 
1991), cognitive performances, subjective analy- 
sis (Didomenico, 2008) and combined approach 
(Miyake, 2001). 
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All these studies have been performed in norma- 
tive condition, with simple standardized tasks and 
under controlled environmental condition. This 
configuration is far away from industrial situation. 

A relation between MW of assembly tasks and 
quality deviations was recently founded by Falck 
(2014). This work suggests that MW can be esti- 
mated trough the evaluation of the complexity of 
the task. 

Operating in an industry plant it would be more 
suitable assessing the MW factor with a combi- 
nation of subjective measurement and indirect 
task-related variable quantification, instead of 
approaching it with physiological measurement 
and cognitive normative test. 

As a results of literature review and plant analy- 
sis a set of variables to TC definition was identified. 

Figure 3 summarizes all variables selected to TC 
definition. 

Human Capability (HC), as mentioned in the 
previous section, represents the total amount of 
resources that a worker is able to give for exe- 
cute a given task under environmental working 
condition. 

The HC factor is given by the contribution of 
several human skills that are all engaged in per- 
forming an operative task. 

In particular the main Human skills that have 
been considered in relation to an assembly task 
are: 


e Ability: skills like Precision, Manual Handling, 
Coordination are solicited continuously during 
a front line assembly work. 

e Memory: remembering the sequence of opera- 
tions and parts to complete correctly a given 
task can differ considerably. 

e Physical: the ability of maintaining a constant 
performance during the shift and the ability of 
coping with pace. 


Working 


cycle 
complexi 


Phisical 
Coping with 
Workload 
Environmental 
loads 


Figure 3. TC conceptual model. 


2.2 Operative model 


The Conceptual model defined in the previous 
section represent also the trace for the Operative 
model definition. 

Operative model contains for each factor taken 
into account into the Conceptual model a set of 
observable and measurable variables. 

The variables were selected after a field analysis 
performed in the beginning stages of the project 
with a participatory approach that involved both 
academic and industry professionals operating in 
the various management areas involved: Safety, 
Work Analysis, Quality, Work Organization. 

The observable variables selected will be meas- 
ured both in numerical and qualitative scales. 

In order to allow the confrontation between var- 
iables with different nature and scale, all the vari- 
ables will be harmonized in a common numerical 
scale. 

TC factor will be estimated trough the assess- 
ment of observable variables that are showed in 
Figure 4 (Mental Work Load) and in Figure 5 
(Task Complexity). 

In the proposed representation of Figures 4 and 
5 some variables are not used directly into the HP 
model but are compared with the results of the 
workers interviews previously mentioned. 

HC factors will be estimated trough a set of 
measures, showed in Figure 6. 

These measures will be obtained as a results of 
skill tests performed by workers during the real 
working activity. 
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Figure 6. HC Operative model. 


As an example the “Memory skill” will be tested 
recording the time spent by a worker to replicate a 
symbol sequence shortly showed. 

Dexterity variable will be measured with 3 
“ability tests” that simulate some typical operation 
asked into an assembly line. 

The HC of each single workers will be assessed 
recording the time spent to complete all tests and 
recording the number of errors done. 

In addition to these human skills, to model 
the Human Capability, it must be noticed that 
an important psycological aspects that can be 
described as “Motivation” can interact construc- 
tively or disprutively with this factor. 

Motivation includes several psychological 
factors: 


— Perception of task-risk; 

— Perception of task complexity; 
— Personal awarness; 

— Job satisfaction\dissatisfaction 


It is conceptually easy to consider that a severe 
mismatch between Task complexity and Perceived 
Task Complexity can facilitate the human error or 
unsafe act generation. 

Information acquired with the interview will be 
used to estimate the level of motivation and per- 
ception of each workers. 


3 PROJECT DEVELOPMENT AND 
MODEL VALIDATION 


Nowadays the project ended the second step. On the 
basis of the Operative model in the next 6 months a 
field data collection will be done. This activity will 
involves 150 workers operating in 4 assembly lines. 
The total number of working places can be 
approximately estimated in 70 units. The applica- 
tion of this model will imply the calculation of 
170 Human Capability profiles and 70 Task Com- 
plexity profiles. The HP calculation will be done 
according the scheme showed in Figure 7. 
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Figure 7. HP assessment. 


Figure 7 summarize the generic scheme of cal- 
culation of HP between the TC of a working place 
characterized by 5 index (Variability, Working 
Cycles, Parts, Physical Efforts, Saturation) and the 
HC of a worker characterized by 4 index (Mem- 
ory, Dexterity, Steadiness, Coping with pace). 

This scheme of calculation is based on the 
operation “HC-index — TC-index” and leads to the 
definition of 6 matching indexs. 

On the basis of the matching-index two Human 
Performance index are defined: 


e HP-: represents the sum of all negatives match- 
ing index. 

e HP*: represents the sum of all absolute values of 
matching index. 


The assessment for each assembly line of HP— 
and HP* will allows a quantitative calculation of 
the potential Human Performance related to the 
matching workers-working places. 

Changing the distribution of the workers will 
leads to a different HP estimation. 

Minimizing HP-and HP* will implies the opti- 
mization of the distribution of the workers forward 
the working places on the basis of each individual 
human capability and each task complexity. 

To validate this model a collaborative processes 
involving Quality and Production Managers in 
mutual learning processes will be adopted as sug- 
gested by Action Research method (Greenwood, 
2006). 

A set of 25 working places with relevant prob- 
lems of quality will be selected. 

HP results will be used to identify the best 
matching workers-working places and on the basis 
of that a new configuration of the line will be 
done. 

A period of 3 months will be used to monitor 
the results of the new configuration workers-tasks 
and quality indicators will be collected. 
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The comparison of quality data ante and post 
configuration will allows the evaluation of the 
impact of the method. 

This operation would leads to a reduction of 
human error related to a wrong matching worker- 
working place. 

The HP assessment would be used as a sort of 
objective guideline to optimize the workers distri- 
bution into assembly lines. 

The number of workers involved (more than 
150) and working places analyzes will allows a sta- 
tistical validation of the model. 


4 CONCLUSION 


Conclusively, the strengths of the proposed empiri- 
cal approach with respect to the Human Perform- 
ance assessment can be summarized as it follows: 


e a model to HP definition as the ultimate product 
of the balance between the TC (driven by all the 
factors from the environment) and the operator 
characteristics (HC) was developed. 

The empirical based analysis will enhance the 
knowledge of the specific process operations at 
Managerial level, possibly highlighting latent 
drivers of Human Performance. 

This model was developed and will be tested in 
real operative condition and with a large number 
of workers directly involved. That represents a 
rare case of cooperation between academia and 
manufacturing. Results of the model applica- 
tion will be directly applied by industrial man- 
agement with a measurable impact in term of 
process optimization. 
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Symptom-based approach for dynamic HRA and accident management 
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ABSTRACT: The paper presents symptom-based approach for dynamic Human Reliability Assess- 
ment (HRA) and accident management through holistic context evaluation by the Performance Evalu- 
ation of Teamwork (PET) method. The macroscopic context evaluation procedure of the PET method 
gives opportunity for correct definitions of emerging issues, challenges, and possible solutions in the field 
of HRA for errors of commission, dependency analysis, multi-unit considerations, long time window 
scenarios, digitalized main control room, etc. In addition, the measuring the durations for recognition and 
disregard of symptoms, depending on various Performance Shaping Factors (PSFs), by utilization of well- 
developed stimulus-response models on the microscopic level, and extended use of simulator data could 


improve the quality of HRA, accident analysis and Probabilistic Safety Analysis (PSA) respectively. 


1 INTRODUCTION 


Human reliability assessment (HRA) is an applied 
science on the frontier between Probabilistic Safety 
Analysis (PSA), reliability theory, Deterministic 
Safety Analysis (DSA), complex system simula- 
tion, Human Factors (HF), Cognitive Systems 
Engineering (CSE), psychology and neuroscience. 
It is trying to apply what is known from sciences to 
reflect human interferences within Socio-Technical 
System (STS)! assess probability of Human Failure 
Events (HFEs), design and construct fault-tolerant 
and resilient system interactions between human, 
technology, organizations and environment. 

With the increase in reliability of equipment in 
technical systems, the human who designs, man- 
ages and maintains them becomes a critical ele- 
ment. His/her reliability has to be explored and 
increased to match that of the machine. HRA 
has been emerging in the 60 s as a product of the 
development of the complex technical systems and 
technologies in the military, space and nuclear 
industry. 

Human reliability is an irreplaceable element for 
the efficient and safe use of STS. The understand- 
ing and evaluation of safety is related to probabil- 
istic risk assessment (PRA). Risk is characterized 
and includes at least two quantities: 


e magnitude (severity) of the possible adverse 
consequence(s), and 
e probability of occurrence of each consequence. 


'STS consists of human, organization, technology and 
environment. 


The development of nuclear technology is a 
priori related to ensuring its safety. Therefore, in 
the first systematic and comprehensive risk study, 
WASH-1400 (1975); HRA was included as an 
indisputable and important element. 

There, HRA used the Technique for error-rate 
prediction (THERP) method (Swain & Guttman, 
1983), developed by Alan Swain, and based on 
existing statistics and HF knowledge of that time, 
theoretical assumptions, and expert judgment. 
This method gives the “thesis” or the aim of “first 
generation” HRA methods—to obtain a Human 
Error Probability (HEP) of identified HFE by 
guessing, reasoning and weighting of internal and 
external factors influencing HEP. Thus, the sce- 
nario’s severity is taking into account by expert 
judgment through multiplication of performance 
shaping factors (PSFs). 

In every human action (HA), one may distin- 
guish two sequential stages—Cognitive (Diag- 
nosis or Decision-making) stage and Executive 
(Response or Manual) stage. For the second stage 
some reliable statistics does exist (e.g. THERP) 
or can be obtained. However, for the first cogni- 
tive stage refuted and recanted holographic-like* 
models are not available. This is due to the “unfin- 
ished business related to” HRA, which includes 
identification, specification and fitting of human 
cognition model to define the error potential and 
context of human action” (Dougherty, 1993). 
Hollnagel (1993) also emphasized that HFEs take 


*Holographic-like behavior of STS is a system with- 
out any communication among separated processes 
(Townsend, 1984). 
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place in a cognitive system. He also argued that the 
detailed knowledge of HA objective context and 
its subjective image which exists in human mind is 
a basis for understanding of human performance: 
<.. any description of human actions must recognize 
that they occur in context” and in dynamics “on a 
second-by-second basis” and must take into account 
multidimensional dynamic context and “how the 
context influences actions” and a whole cognitive 
process (Hollnagel, 1998). 

When considering human performance reliabil- 
ity in accident conditions, the first dimension is 
time. This temporal approach used in THERP, e.g. 
Swain’s Time Reliability Curves or its ‘improved’ 
version called the Human Cognitive Reliability 
(HCR) Correlation (Hannaman & Spurgin, 1984) 
is “virtually impervious to context”. It should be 
complemented by influential or contextual approach 
to avoid “bareness in modeling” (Dougherty, 1993). 
He came up with the idea the change the “first gen- 
eration” included HRA models such as THERP 
and HCR with the “second generation” HRA mod- 
els (Dougherty, 1990). The “antithesis” or the aim 
of “second generation” HRA models was to take 
into account HA context with its specifics, severity, 
multidimensional dynamics and holistics for any 
individual, group mental or manual process. 

However, the most of “first generation” HRA 
methods made a formalistic substitution of “influ- 
ential factors”, the THERP’s holistic PSFs or their 
modifications, with contextual factors and it does 
not substantially alter HRA’s outcomes. Moreover, 
difficulties arising from significant uncertainties in 
the quantification of each factor for HEP due to 
lack of data and an appropriate method to address 
uncertainties and dependencies between PSFs. 
This substitution exemplifies the Dougherty’s 
observation and insight of that “the influential and 
contextual approaches may find themselves indis- 
tinguishable at the quantification stage because of 
the paucity of actual data” (Dougherty, 1993). 

Since then the HRA community is discussing 
and substantiating some contentions and chal- 
lenges of HRA, which caused much debate and 
launched the so-called “second generation” HRA. 
Some of HRA basics are already indisputable; oth- 
ers are the result of inertia in group thinking, mis- 
leading interpretation, judgment biases, business 
dependences and interests. 

The following list shows HRA features that are 
valuable? for the accident analysis and whether they 
are implemented or not in the discussed second gen- 
eration HRA methods (the features are in italic): 


° This means that the HRA method can assess the severity 
of the context, its potential for human error and thereby 
reduce the risk or optimize the action of an operator. 


e The response-related model is necessary—not 
implemented in Méthode d’Evaluation de la 
Réalisation des Missions Operateur (MERMOS); 

e “...any description of human actions must recog- 
nize that they occur in context”—implemented in 
Cognitive Reliability and Error Analysis Method 
(CREAM); 

e ‘extremely 
MERMOS; 

e situational and knowledgelmental models— 
implemented in A Technique for Human Event 
Analysis (ATHEANA); 

e shift the problem from quantification of the oper- 
ator behavior to the quantification of the error- 
forcing context—implemented in CREAM, 
MERMOS, ATHEANA; 

e ‘error’ identification—implemented in CREAM, 
MERMOS, ATHEANA; 

e ‘error’ quantification—implemented in CREAM, 
MERMOS, ATHEANA; 

e ‘error’ reduction—implemented in CREAM; 

e accident context is a function of time “on a 
second-by-second basis”—not implemented in 
CREAM; 

e individual dynamics of situation—not imple- 
mented in MERMOS; 

e variability of performance is more important 
than how actions can fail—not implemented in 
CREAM; 

e some combination between performance-related 
effects—implemented in MERMOS; 

e context as a collection of weighted PSFs—imple- 
mented in SLIM by domain expert judgments; 

e human performance limiting values—imple- 
mented in MERMOS; 

e context could be different for each crew member— 
not implemented in MERMOS; 

e systematic search process by series of lists of 
possible Human Failure Events, Unsafe Acts and 
error-forcing conditions (EFCs )—implemented 
in ATHEANA; 

e identification of erroneous actions based on the 
accident context and not to “subsume the errors 
identified with the event sequence mostly by his- 
torical means”—implemented in ATHEANA. 


contextual —implemented in 


To overcome the above mentioned problems 
with subjective judgment of PSFs, holistic and 
dynamic human performance in STS, HFE iden- 
tification, quantification, reduction, data-mining 
and measuring, a more realistic symptom-based 
approach for the statistical description of the STS 
context was proposed (Petkov and Furuta, 1998). 
The STS context quantification procedure is a 
first step of a Performance Evaluation of Team- 
work (PET) HRA method that can implement the 
above features valuable for both HRA and accident 
analysis It relies on combinations of recognizable 


symptoms’ for description of the variability of 
STS performance and gives controllable macro 
structures of main mental processes as cognition 
and communication (Petkov, 2000). 

And finally, the HRA “synthesis”: to obtain a 
HEP of identified HFE in the STS context based 
on its holistic specific symptoms’ recognition, 
severity & dynamics for any individual, group 
mental/manual response. A symptom’s impact is 
reasoned, measured and weighted by internal and 
external “glocal” (global & local) PSFs for it. 


2 SYSTEM APPROACH TO HUMAN 
PERFORMANCE 


The HRA experience showed that often the 
approach to the study of human actions and fail- 
ures is unilateral and biased, for example from the 
point of view of psychologists and engineers. 

For example, the engineering modeling of human 
reliability, in similar way as equipment reliability, 
encounter grounded arguments by psychologists 
that it does not account for its activity and variabil- 
ity, and that man is cognitive, emotional, volitional, 
and not an item. On the other hand, the concerns 
of some of psychologists are unfounded that man 
cannot be simplified as one of the system compo- 
nents or cannot be evaluated as a number, i.e. neuro- 
science or HRA should not apply the universal ways 
for modeling and sharing scientific knowledge. 

There are two approaches to human errors — “the 
person approach and the system approach.” The first 
approach “focuses on the contribution of human 
errors on the system, their own psychological justi- 
fication, accusations of forgetfulness, inattention, 
or moral weakness” (Reason, 2000). For example, 
“Over time, the role and importance of human con- 
tributions has grown for early PRAs, the contribu- 
tion of humans has been set at about 15%, now the 
contribution has grown to 60% to 80%” (Spurgin, 
2009). 

The second approach focuses on the conditions, 
situations and context in which a person deliberately 
and conscientiously performs his/her actions to effec- 
tively manage the system and limit the consequences 
of the risk of its operation. The extreme statement is 
that “Human error is never the root cause.” 

Human performance needs to be considered as 
a variability of a whole STS where human inter- 
acts with technology, other humans, organizations 
and environment. The system approach is the more 
preferable and practical for context-based HRA. 
It is not important “who blundered, but how and 
why the defences failed” (Reason, 2000). 


4 A symptom is a measurable plant parameter that is avail- 
able to the operator in the control room (IAEA, 2006). 


389 


3 SYMPTOM-BASED APPROACH TO STS 
STATISTICAL DESCRIPTION 


The system approach means more components, 
more states, more interactions and information 
that have to be taken into account. How to ana- 
lyze the system performance is the most important 
issue. 

The enormous amounts of information and the 
use of digital technology, for information process- 
ing, have been changed individual (human) and 
group (social) mental processes and people behav- 
ior. The use of structural or functional decompo- 
sition, customary to PSA and CSE modeling, in 
HRA does not work because they break the most 
important dynamic interactions in the system and 
cannot explain its holistic behavior. The real-world 
decision-making processes need to be modeled in 
the context of the complex multi-agent and multi- 
level socio-technical system. The ‘context’ concept 
is the crucial in order to really make a difference 
in such multi-disciplinary models to control and 
coordinate the balance between enormous infor- 
mation and limited human cognitive capacities. 


3.1 Symptom-based approach in nuclear accident 


management 


The symptom-based approach is now usual for 
nuclear accident management. The IAEA NS-R-2 
(2000) establishes the following requirements on 
accident management: “The training of operating 
personnel shall ensure their familiarity with the 
symptoms of accidents beyond the design basis 
and with the procedures for accident manage- 
ment.” Later in IAEA SRS No. 48 (2006) “symp- 
tom/state based procedures” were justified and 
in IAEA NS-G-2.15 (2009) a “symptom-based 
approach” was also recommended: “2.14. The 
approach in accident management should be based 
on directly measurable plant parameters or param- 
eters derived from these by simple calculations.” 


3.2 Statistical entropy description of 
socio-technical system 


The STS dynamic interactions are manifested by 
interference of symptoms (stimuli with meaning for 
human) and the system context could be presented 
by them on the macro level. Of course, if we want 
to understand the root causes of human errors, we 
should searched in depth at micro level. But the 
macroscopic statistical description of the STS con- 
text would help to identify the dynamic and holistic 
nature of the system’s behavior (Petkov et al., 2001). 

The basic idea of the distinction between 
macro- and microscopic levels is to change the set 
of microscopic accessible states with equivalent 


subsets of macroscopic states. A certain macro- 
scopic state can be found in many microscopic 
accessible states. This idea follows the Shannon 
theorem (1948) regarding the entropy as the meas- 
ure of information, and was the basis for the used, 
in the PET method, an analogy of energy and infor- 
mation. The mental process in the STS is described 
at each moment by its microstate. This is a specific 
quantum state that represents the most detailed 
possible STS description. 


3.3 Practicality of symptom-based STS 
description 


As the context characterization and analysis in the 
HRA is implemented with the assistance of the per- 
sonnel, then it is more practical and natural, firstly, 
to describe STS context in symptoms used by per- 
sonnel. Recognizing each symptom by operator is a 
mental process involving individual cognition, com- 
munication between the group of operators, deci- 
sion-making, checking and recovery. Each symptom 
has its specific symptom-influencing factors (SIFs) 
that essentially coincide with the used holistic PSFs 
in HRA. But if all the symptoms into a given sce- 
nario are described with a common PSF then it 
leads to blur the specific influence of a PSF on a 
given symptom, masks the dependencies between 
the PSFs and increases the uncertainty of the results. 


3.4 Theoretical issues and limitations from 
unexplored mental processes 


In addition to the above, there are other deeper 
theoretical reasons for using the symptom- 
based approach, these are the unexplored mental 
processes. 

Townsend (1984) emphasizes the need for a 
theory for connection of holographic-like behav- 
ior and separation of selective and non-selective 
influence: 


e “Systems based entirely on holographic-like 
behavior without any communication among 
separated processes, are omitted.” 

“The selective influence postulate is critical.” 
The “empirical tests of selective influence will 
likely be tied closely to separation of selective 
from non-selective systems.” 

“Factors that influence not only durations but also 
outputs of processes have not been investigated.” 
“The expectation (the mean) of a sum of random 
variables (in this case, serial processing times) is 
equal to the sum of the expectations, which is true 
for any set of random variables whether or not 
they are independent.” 


If systems with holographic-like behavior are 
omitted from exploration of mental processes then 
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the holistic PSFs-based approach for STS context 
quantification is very questionable! In symptom- 
based approach every symptom is recognized 
separately and context quantification is based on 
enumeration of STS accessible states. 

Duration of symptom recognition is measured 
as a degree to which an operator's perception of 
the STS is compatible with the required action. 

A model to establish a mathematical function 
that describes the relation f between the symptom 
x and the expected duration of the recognition AT 
is necessary: AT = f(t). A common simplification 
assumed for such functions is linear, thus we expect 
to see a relationship like: 

AT, = AT nin + tbi; 


i min 


(1) 


where p; is j-factor, SIF, (PSF,), for i-symptom. 

Since Townsend & Schweickert (1989) also 
emphasized “Although we view the test for additiv- 
ity as one important strategy in an overall systematic 
approach to uncovering psychological processes, we 
do not identify processes with additivity,” then this 
means that the influence of n factors could not 
have been considered simple by the expression as 
b; = XB,, where j = 1...n. However, AT; „ax could be 
measured for given human performance context 
(STS accessible state), and the uncertainty of the 
results for HEP could also be reduced. 


4 PET EXTENDED DEFINITIONS, 
HUMAN FAILURE EVENTS 
TAXONOMY, EVALUATION 
PROCEDURE AND MODELS 

4.1 Extended definition of dynamic STS context 


Petkov & Furuta (1998) have proposed a heuris- 
tic concept of Context Factors and Conditions 
(CFCs) as quantitative factors to indicate “how 
symptoms (CFCs) influence context” in order to 
study their effects on human performance— “how 
context influences actions” during the accident. As 
well as they consider the context to be a function 
of time and second argument of the HEP func- 
tion, while first argument of the HEP is time, i.e. 
HEP = f(t, context). 

Petkov & Groudev (1999) proposed an indirect 
similarity between material (‘transition tempera- 
ture shifts’) and mental processes (“human per- 
formance shifts’) to measure dynamic deviations 
of symptoms (CFCs). The common ground of 
this study is the emergency context. The similar- 
ity is seen between the “reference” temperature for 
the pressurized thermal shock in materials and the 
“reference” context probability for the HFE. The 
latter is regarded as generalized evaluation of the 
context influence on HA. The basic assumption of 


this study is that the stress both of materials and 
human is due to symptom deviations or parameter 
gradients (between reality and expected values) 
which determine their characteristics and behavior. 

The interactions or “STS performance shifts” 
represent the dynamic STS context or processes 
in human-technology-organization-environment. 
Matching the object in situation is an approxima- 
tion that could be identified and described as an 
“image” or as a “signature” of the object in the 
situation. The concept “image” is the imprint of 
the consciousness and sub-consciousness of the 
“human performance shifts” that affect the person’s 
physical, physiological, psychological and psycho- 
social ability to make sense, perceive the object 
(STS) and to perform action in the STS context. 

Generally, context is defined by psychologists 
as “a state of mind’ or ‘a set of internal or mental 
representations and operations rather than a set 
of external elements”. In (Petkov, 2004) the con- 
text definition was extended as “a common state 
of universe, mind and situation in their relation” — 
object-image-situation. In (Petkov, 1999), the term 
“context probability” (CP) was coined as a meas- 
ure for severity magnitude of error-forcing context. 
But this general definition of context is not so 
practical to be a measure. The practical definition 
of context is: a statistical measure of the degree of 
the STS state randomness defined by the number of 
accessible states taking place in the STSs’ ensemble. 

Since the man analyses and operates the 
machine most frequently by discrete actions, the 
combination (number, value and tendency) of 
symptoms (stimuli with meanings), which she/he 
manages to distinguish, is of greatest importance 
to him. As a measure for symptoms’ influence the 
relative deviation of symptom image is proposed: 
Ad/, = |0,-0,/0,, where the indices denote the two 
types of values under interest (o — objective; s — 
subjective; d—image). 


4.2 Extended definition of STS violated image 
of symptom 


Petkov & Petkov (2000) presented the Combinato- 
rial Context Model (CCM) for context quantifica- 
tion of cognitive process based on deviations of the 
symptom image from object in situation. The CCM 
uses the following symptoms for context quantifica- 
tion: event (E), parameter (P), action (A), resource 
(R), function (F), transition (T) & goal (G). 

In order to demonstrate the crucial impact of 
violations (term “circumventions” is used in USA) 
on the post-accident context, a Violation of Objec- 
tive Kerbs (VOK) method was also proposed there 
to account for the probability of aberrant circum- 
stances (prior to or during the initiating event) in 
cognitive process. Following the Reason’s qualitative 
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definitions (1994) the quantitative definitions were 
proposed for the context quantification purposes: 

Errors is probable when the differences between 
objective & subjective images of context symp- 
tom is not zero, o,,(t)#0,,(t), where zero-context is 
lo,a(t) - da(| 0/min. 

Violation of Image of Symptom (VIS) occurs 
when the objective image of context symptom 6, 
is changed from 6',,(t) to o’,,(t) due to any reason. 

It should be emphasized that the quantitative 
definitions of error and violation thus formulated 
are dealt with not only in human performance con- 
text, but in the context of the whole STS. There- 
fore, they have not only a wider, but also more 
different meaning than these of Reason. For exam- 
ple, Reason (2000) defines “procedural violations” 
as an active failures that “have a direct and usually 
short-lived impact on the integrity of defenses.” 

In Reason (2000), the strategic decisions are 
“made by designers, builders, procedure writers, 
and top level management” refer to latent condi- 
tions that “can create longlasting holes or weak- 
nesses in the defenses” and “may lie dormant 
within the system for many years before they com- 
bine with active failures and local triggers to create 
an accident opportunity.” 

In the PET method, the VIS is considered as 
a strategic decision and introduces of “inevitable 
‘resident pathogens’ within the system,” resulting 
in a resonant increase in severity of error-forcing 
context, 1.e. the CP & HEP. 

Taking into account the relativity of time when 
the strategic decision was made, seconds or years 
ago, we can conclude that the difference of the PET 
VIS is that it may be in apparent or latent condi- 
tions, but these conditions must lead to a sharp/res- 
onant increase in the severity of context, i.e. CP(t). 

If we take into account the consequences of the 
PET VIS, it can be said that they are similar to 
those of Error of Commission (EOC). However, 
such consequences can occur not only after actions 
but also in disturbing the objective image of other 
symptoms, i.e. EOCs are subset of the set of VISs. 

For example, some VIS#EOC during the Fuku- 
shima Daiichi NPP #1 accident are the following: 
VIS-E: SBO — loss of AC power; VIS-T: Loss of DC 
power; VIS-F: Containment Failure. These VIS aren’t 
EOC, because the violated symptoms are E, T & F, 
but they dramatically increased the accident severity 
& C(t). 

For example, some VIS=EOC during Cherno- 
byl NPP #4 accident are the following: VIS-A: 
System for Emergency cooling system has turned 
off from the previous shift; VIS-A: The scram is 
disabled when two turbo-generators are turned off; 
VIS-A: Several manually operated rods are pulled 
out to increase power, they remain less than 8 of 15 
required rods in the reactor core; VIS-A: The opera- 


tional reserve of reactivity reaches a value requiring 
the reactor shutdown, but it is not done on time. 


4.3 PET HFE taxonomy 


The explained concepts, approaches and definition 
extensions used by PET method are presented on 
the HFE taxonomy shown on Figure 1. 


4.4 Recursion of context and recognition 


Context probability (CP) is determined by the 
existing deviations of symptom images in human 
cognition and team communication processes. 

Matching the object in situation is approxima- 
tion of a subjective image to the objective one by 
comparison of certain symptoms (signals, symbols 
and signs) that could be identified and described 
as an image of the object in the situation. These 
symptoms are Event, Parameter, Action, Resource, 
Function, Transition, Goal and VIS. 

Symptom recognition and cognitive context are 
iterative & recursive functions. In order to calculate 
CP, the durations of recognition for any symptom 
(as CFC) & cognitive disregard durations of VIS 
are needed. At the next step of iteration of cogni- 
tive process, we may use the new duration of symp- 
tom recognition based on previous calculation. 

Duration of symptom recognition could be 
based on measurement or expert judgment about 
the type of the recognized symptom (skill-based, 
rule-based, and knowledge-based). The times for 
completion of cognitive process for ‘skill-based’, 
‘rule-based’ & ‘knowledge-based’ symptoms would 
be in correlation 1:5:30 (Petkov, 2017). 

A symptom (CFC), after its recognition, could 
be kept or removed from the operator’s context 
model and his memory depending on the situa- 
tion. Some unimportant symptoms could be dis- 
regarded after their recognition by analogy with 
hypothesis that Rodin (1987) labeled ‘cognitive 
disregard’. 


epee | E 
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Figure 1. Human action context and HFE taxonomy. 


The PET procedure for evaluation of context, 
cognition, communication and decision-making 
probabilities consists of eight steps (Petkov, 2018). 


5 PET DIAGRAMS OF HRA PROCESS 
IN PSA 


5.1 PET scenario’s timeline description, 
symptoms and violations of symptom’s image 
qualification 


The PET scenario’s timeline description of symp- 
toms and violations of image of symptom quali- 
fication is presented on Figure 2. The following 
abbreviations are used: SAMG (Severe Accident 
Management Guideline); EOP (Emergency Oper- 
ating Procedure); OED (Operational Experience 
Data); PIE (Postulated Initiating Event); TH 
(Thermo-Hydraulic Simulation); CS (Computer 
Simulator); FSS (Full Scope Simulator); normP, 
abnormP (Normal/Abnormal Procedure); OP/MP 
(Operational/Maintenance Procedure); PSA; DSA; 
HRA; VIS; S (Symptom/Stimulus): P, R, E, T, G, 
HA and E 


5.2 PET screening, holistic and atomistic 
quantitative assessment and integration 
in PSA 


The PET screening, holistic & atomistic quanti- 
fication and integration in PSA are presented on 
Figure 3. The following abbreviations are used: 
CP; EP (Error Probability); HEP; HA (Human 
Action/Task); HFE; RI (Risk Importance). 


5.3 Holistic dynamic profiles of CPs and EPs 


The PET diagram for obtaining holistic dynamic 
profiles of context and error probabilities is 
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Figure 2. PET scenario’s timeline description, symp- 
toms and violations of symptom’s image qualification. 


Recovery 
HEP’. .2HEP 


HEE is Rise 
Importani 


Figure 3. PET screening, holistic and atomistic quanti- 
tative assessment and integration in PSA. 


Figure 4. PET diagram for obtaining holistic dynamic 
profiles of context and error probabilities. 
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Figure 5. 
tial reduction and root causes. 


PET diagram for obtaining atomistic poten- 


presented on Figure 4. The following abbreviations 
are used: CP; EP (Error Probability); HEP; HA 
(Human Action or Task); HFE; a (appearing); r 
(recognition); d (disregard); ry (recovery); o (objec- 
tive); S (subjective); vo (violated o). 


5.4 Atomistic error potential reduction and causes 


The PET diagram for obtaining atomistic potential 
reduction and root causes is presented on Figure 5. 


6 HRA ADVANCED PRACTICE BY 
PET METHOD 


HFEs are unexpected events in STS leading to 
unwanted outcomes. Therefore, the HEP has been 
changing in time before, during and after any HFE, 
and severity of STS context should be dynamic var- 
iable of error-producing potential. In nuclear acci- 
dent conditions this dynamic and holistic measure 
could be defined as context probability, where for 
the most severity context—the probability is 1. 
HRA methods try to predict a HEP in the “pre- 
vailing” context that means in a statistical average 
context of an average crew performance but only 
for some short time interval “prevailing” could exist 
during the accident. A static value (anchor) of HEP 
based on a judged average context of crew for iden- 
tified task is calculated. The HEP is adjusted by 
multiplication of guessed values of PSFs taking into 
account the variability of all system components. 
Fuzzy logic of each HRA technique, tabulated and 
justified by its database is used to introduce PSFs 
into the HEP variation. Usually, a cited database is 
‘know-how’ of the HRA method. It is verified and 
validated by the owner and concerned national reg- 
ulator. But the structures and parameters of models 
and obtained data are not accessible for other users 
in order to check them and to repeat data-mining. 
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The benchmarking of the HRA methods is based 
on results for similar identified tasks, but not on 
models and experimental data. 

Implicit, static and pseudo-holistic determina- 
tion of context based on an anchor HEP and fuzzy 
PSFs values judged by expert, makes HRA meth- 
ods superficial and insensitive to the STS models 
(structures and parameters), HFE symptoms and 
causes, and human performance processes. 

The main reason for the HRA insensitiveness 
is the lack of models and data for a holographic- 
like behavior of the human interactions in a com- 
plex situation and multifactorial context. These 
models are substituted with expert judgments and 
multiplication of concurrent PSFs considered for 
specific task. This subjective way of HEP evalua- 
tion does not allow a systematic and multi-layered 
study of human and STS performance. Practical 
PET applications for retrospective and prospective 
HRA and accident analyses are shown in (Petkov, 
2017). 


7 CONTRIBUTIONS TO EXISTING HRA 


The PET method allows dynamic evaluation of the 
individual CP and communication context prob- 
ability (CCP) over time as gauge parameters. 

Based on the dynamic individual CP and CCP, 
HEP and its constituents can be evaluated— 
cognitive, executive, recovery and decision-making 
by a group of people or organizations. 

Average HEP that is usually evaluated by ther 
HRA methods is easy calculated as a mean over 
the time interval during which the human action is 
most likely to be performed. 

The PET method provides an opportunity to 
generate HRA database based on simulations 
using thermo-hydraulic codes, computer-based, 
engineering or full-scope simulators. 

It also allows to evaluate and optimize the indi- 
vidual and group performance of the main control 
room operators during real events or training. 

The perspective to go in-depth and to move from 
a macroscopic to a microscopic context evaluation 
with the PET method would allow not only to scan 
and monitor the human error potential, but also to 
assess the importance of each symptom and PSFs 
influencing its recognition. 


8 CONCLUSIONS 


The PET, as a HRA method, applies a realistic 
procedure for dynamic symptom-based context 
evaluation of cognition and communication, and 
context-sensitive digraph models of cognition, 
communication and decision-making. 
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The intermediate use of symptoms for dynamic 
context evaluation gives better opportunities for 
the systematic identification, qualitative and quan- 
titative interpretation of time-dependent HFEs 
during the accident and for improving of emer- 
gency response planning and/or severe accident 
management. 

The PSA modeller is responsible for appropri- 
ate determination of the initiating event progres- 
sion and needed actions based on accident reports, 
thermal-hydraulic simulations or full-scope simula- 
tor training exercises. 

The detailed and qualitative data of accident 
reports are the best source for validation and verifi- 
cation of HRA methods. 

The data use of thermal-hydraulic calculations 
& full-scope simulator are a valuable option for 
optimization of time to implement emergency 
measures and actions to ensure safety during the 
DBA and DEC/BDBA based on joint determinis- 
tic and probabilistic criteria. 

The qualitative description and analysis of the 
accident scenario should be improved in order to 
improve the quantitative assessment of the context 
and the HFE probability. If important elements of 
description are missed then distortions occur in the 
evaluation. The HRA modeller is responsible for 
correct definition of symptoms, violated images 
of symptoms and determination of the durations 
of their manifestation, recognition and disregard, 
action and recovery implementation. It is prefer- 
ably to measure them but guess or judgment could 
be acceptable as a first approximation. 

The correct distribution of roles in modelling, 
limitation of expert guesses and possibility for exper- 
imental verification & validation supposes that PET 
could be applied with higher degree of confidence. 

Insufficient exchange of information between 
designers, DSA, PSA and technologists may lead 
to violated context and increased risk. 

The PET is a prospective emerging HRA 
method for dynamic, context-based, retrospec- 
tive and prospective analysis and data-mining that 
could provide the PSA studies with HEPs for any 
specific action/task based on state-of-art simula- 
tions in order to avoid expert guesses. 
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ABSTRACT: Decommissioning of offshore facilities involve changing risk profiles at different decom- 
missioning phases. Bayesian Belief networks (BBNs) are used as part of the proposed risk assessment 
method to capture the multiple interactions of a decommissioning activity. The Bayesian Belief network 
is structured from the data learning of an accident database and a modification of the BBN nodes to 
incorporate human factors and barrier performance modelling. The analysis covers one case study of 
one area of decommissioning operations by extrapolating well workover data to well plugging and aban- 
donment. Initial analysis from well workover data, of a 5-node BBN provided insights on two different 
levels of severity of an accident, the “Accident” and “Incident” level, and on its respective profiles of the 
initiating events and the investigation-reported human causes. The initial results demonstrate that the 
data learnt from the database can be used to structure the BBN, and give insights on how human factors 
pertaining to well activities can be modelled, and that the relative frequencies can act as initial data input 
for the proposed nodes. It is also proposed that the integrated treatment of various sources of information 
(database and expert judgement) through a BBN model can support the risk assessment of a dynamic 
situation such as offshore decommissioning. 


1 INTRODUCTION information can be used to model risks more spe- 

cifically. Bayesian Belief Network (BBN) is defined 
Decommissioning of offshore facilities takes place as an acyclic graphical network that is capable of 
in different phases which can include the warm representing qualitative and quantitative relation- 
suspension phase, the cold suspension phases and ships between factors of interest defined in it. 
the removal phase. Some examples of the warm The nature of the relationships can be defined by 
suspension phase activities are pipeline decontami- whether there exists a (i) causal (11) functional or 
nation and sectional removal or well plugging and (iii) statistical relationship. The BBN model struc- 
abandonment. Well plugging and abandonment ture can be derived by experts’ judgement or sta- 
involves multiple tasks at the same time, and the __ tistically, or through a combination of both. The 
assessment of location specific risks such as res- | proposed model in this paper is structured from 
ervoir profile, and presence of gas deposits from the data learning of an accident database and the 
its drilling history. Location-specific historical BBN nodes are modified to incorporate human 
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factors and barrier performance modelling. The 
analysis considers one case study of well workover 
data extrapolated to well plugging and abandon- 
ment activities. The information fed into the BBN 
consists of relative frequencies from data learning 
from the database and expert-elicited where there 
are data gaps. The desired output of the model is 
a probabilistic risk profile of the potential conse- 
quences of a particular decommissioning activity, 
and one that has considered and can trace differ- 
ent sources of information—generic historical 
data, location-specific information and expert- 
judgement, instead of only historical data in most 
existing methods. 


2 PROPOSED STATISTICAL MODEL FOR 
STRUCTURE LEARNING 


2.1 Data types 


The most comprehensive offshore safety database 
is the World Offshore Accident Database (WOAD) 
(DNV GL, 2017) that has been collecting data 
since 1976 by the Norwegian company DNV GL. 
The WOAD database provides principle informa- 
tion such as accident causes and its chain of con- 
secutive events (for e.g. a dropped object resulting 
in fire and oil spill), location of accident facility, 
year of accident etc. The WOAD database has also 
been built on merging information from existing 
databases such as the Offshore Blowout Database 
(SINTEF, Norway), MAIB accident database 
(UK Marine Accident Investigation Branch) and 
the COIN/ORION database (UK HSE—offshore 
Safety Division). 

The data in WOAD appears as a discrete, ‘count’ 
nature, each time there is an accident reported, it is 
updated. Also, some factors have order in its states, 
such as the severity of events, where “Accident” is 
more severe than a “Near Miss”. 


2.2 Statistical model—generalised linear 
models 


The proposed model should be able to accept the 
updating of information and would be able to 
present the risk picture as per the most updated 
snapshot. The Beta and Dirichlet density function 
allows the quantification of the prior data (belief) 
and updating new information (beliefs). For analy- 
sis where only binary variables are involved, such 
as in a Fault Tree/ Event Tree where there are only 
‘True’ or ‘False’ status, or the location of vessels 
such as ‘In the port’ or ‘At sea’, the Beta density 
function can be used. The beta density function f 
with parameters a, b, N = a + b, where a and b are 
real numbers > 0, is (Neapolitan, 2004, p. 306): 
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T(N b-1 
pls) = ann" f) (1) 


=beta(f;a,b), O< f <1 


With updated information, t and M = s +t, 
and considering d as the dataset, the updated beta 
density function, with respect to obtaining the 
probability of having dataset d can be written as 
(Neapolitan, 2004, p. 306): 


Pea) 
AA ECF ((1-F)') 

_ T(N+M) 

j T(a+s)r(b+t) 


(2) 


fer (1 jo 


For analysis of multinomial variables, the 
Dirichlet density function can be used to provide 
equal counts for all statuses to be studied. For 
example, the severity of a consequence can be 
‘Near miss’, ‘Injury/Accident’ or ‘Fatality’. The 
Beta density function is, in fact, a special case of 
the more generic Dirichlet density function. Simi- 
larly, considering d as the dataset, the probability 
of having the dataset d by assuming all statuses are 
Dirichlet-distributed is shown below (Neapolitan, 
2004, p. 306,386): 


o-ar] 
T(N) + T(a, +s) (3) 


~T(N+M) IL- T(a,) 


Offshore safety data is based mainly on 
response and explanatory variables of non-metric, 
discrete forms, such as operation sections on a 
platform (Drilling and/or Production area) or initi- 
ating events of accidents (Falling Load, Leak etc). 
A common response variable would refer to con- 
sequence status differing in levels of severity, and 
in this case, the order of the status is important, 
such as ‘Fatality’ having more severe considera- 
tions than a ‘Near Miss’. A statistical model suit- 
able for offshore safety data should be required to 
be able to handle multivariate data or non-metric 
format, and hence the classical regression models 
are unsuitable. The proposed statistical model able 
to handle ordered, categorical data is a log-linear 
model that falls under the classification of a Gen- 
eralised Linear Model (GLM) (Agresti, 2002, p. 
125). A GLM extend ordinary regression models 
to encompass non-normal response distributions 
and modelling functions of the mean and has three 
components (similar to classical regression models) 


consisting of a random component (response 
variable), a systematic component (explanatory 
variables) and a link function that transforms 
the mean to the natural parameter. The categori- 
cal data is arranged in a table form with its fre- 
quency information, and the log-linear modelling 
involves fitting models to the cell count in the 
cross-tabulation of categorical variables to derive 
the association and interaction patterns among 
variables. 

For a table of response variable y and explana- 
tory variable x (or two categorical responses), in a 
table with row i and column j, the cell probabilities 
are m, and the expected frequencies are “4, = n7,. 
logt; =A+ AX +4; (4) 
where /* refers to the row effect and 47 refers to 
the column effect. 

Assuming there are interaction effects, then a 
saturated model with statistically dependent vari- 
ables would be (Agresti, 2002, p. 316): 
logt; = A+ AS +A + AS (5) 
where 4%” refers to deviations from independence. 

When extrapolating the assessment from a two 
factor to a three factor contingency table, the fol- 
lowing summarised log-linear models (see Table 1) 
illustrate the combination of potential interactions. 

The modelling process begins with a saturated 
model, i.e. with the highest order interaction 
between the variables, before systematically and 
sequentially removing a higher-order interaction 
term so that model complexity is reduced without 
any significant loss in accuracy. The removal stops 
at a point where any removal leads to a poor fit of 
the data. The threshold of removal is considered 
by comparing the test model against the saturated 
model based on the difference of its deviances. 


Table 1. Loglinear models for three-dimensional 
tables. 
Loglinear Model Symbol 
log My, = A+ AX + AY + AZ (YZ) 

= 4 XY,Z 
log fy, = A+ AX + AY + AZ + ANY (X¥,Z) 
log by, = A+ AX + AY + AZ + AM + WZ YYZ) 


p AZ (XY.YZ,XZ) 


jk 


logy, = A+ AX + Aj + AE + AS A 
ae 

logy = A+ AX + AY + AZ +A + 

+ A + Ait? 


giz. (XYZ) 
jk 
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Deviance is the likelihood-ratio statistics for testing 
the null hypothesis that the simplified model would 
hold against the saturated model. For a particular 
GLM for observations y = (y,,... Yy), let L(44y) 
denote the log-likelihood function expressed in 
terms of the means w#=(/,...,4y). Let L(y) 
denote the maximum of the log-likelihood for the 
model, i.e. a saturated model, which can provide a 
baseline for comparison with other model fits. In 
this saturated model, “= y, and the deviance, D, 
of a Poisson GLM is defined to be (Agresti, 2002, 
p. 119): 


max likelihood for model 
°8 max likelihood for saturated model (6) 


=-2[ L(Asy)- L(y) | 


Consider two models, M, with fitted values ĝ, 
and a saturated model M, with fitted values ĝ. 
Since M, has lesser interactions considered than 
M,, a smaller set of parameter values satisfies M, 
as compared to M,. Maximizing the log likelihood 
over a smaller space cannot yield a larger maxi- 
mum, thus [r(A < L(f,;y) |. Continuing from 
equation (6) (Agresti, 2002, p. 141), and assuming 
that model M, holds, the likelihood-ratio test of 
the hypothesis that M, holds uses the test statistic 
(Agresti, 2002, p. 141): 


D(y;û,)-D(y;ĝ) 
=-2[L(ft,:y)-L(y:»)]-2[ L(y) -L (0:9) 
=-2[(iu:»)-L(a:9)] O 


Under regularity conditions, this difference has 
approximately a chi-squared null distribution with 
degrees of freedom equal to the difference between 
the numbers of parameters in the two models in 
comparison. The difference of the models’ devi- 
ances follows a x? null distribution (Agresti, 2002, 
p. 142). 

Before determining the factors with the loglin- 
ear model, it is proposed to use a two variables 
dependency test based on the Pearson x? model in 
order to shortlist the factors for consideration of a 
3-variable analysis. The Pearson’s x? test compares 
the frequencies observed in certain categories to 
the frequencies that might be expected in the same 
categories by chance, with respect to the degrees 
of freedom given by the (number of rows -1) mul- 
tiplied by (number of columns -1)(Field, Miles, & 
Field, 2012, pp. 802-803). 

Similar to regression, the residual is simply 
the error between what the model predicts (the 


expected frequency) and the data actually observed 
(the observed frequency): 


observed, — model, 
|model, 


Through determining statistical dependencies 
between factors, the nodes of the BBNs are then 
established, and from which the base structure of a 
BBN can be defined. 


standardised residual = 


(8) 


3 CASE STUDY—WELL WORKOVER TO 
WELL P&A 


3.1 Well workover 


Well workover data from the UKHSE database 
refers to operations in which a well is re-entered 
for any purpose, for example, maintenance related 
activities’ of replacing retrievable downhole safety 
valves, or malfunctioned electrical submers- 
ible pumps or worn out tubings. Before any well 
workover, the well must be killed, which requires 
the removal of the well head, flowline, packers and 
lifting the tubing hanger from the casing head, 
before putting a column of heavy fluid into a well 
bore to prevent the flow of reservoir fluids without 
the need for pressure control equipment at the sur- 
face (which have been removed, a part of the well 
kill process). Such an operation usually requires a 
drilling rig to be involved. 

While this data is strictly not well abandonment 
and plugging work required for a decommission- 
ing process, well plugging and abandonment work 
includes removing casing and other downhole 
equipment, thus suggesting the similarities with 
wireline operation. The risks of well killing can 
also be comparable to a well plugging and aban- 
donment procedure which involves cement being 
injected to plug the well, while in the well kill event; 
heavy fluid is injected into the well. 

The factors in WOAD database, relating to Well 
Workover activities are: (i) Schedule quarter (ii) 
Human causes (iii) Initiating event (iv) Evacuation 
and (v) Severity of accident. 

Thus the five factors from Table 2 will be ana- 
lysed to understand its dependency relationships. 


3.2 Results of well workover data analysis from 
two-variable and three-variable dependency 
analysis 


Dependency analysis between two variables is 
performed for all possible combinations in order 
to generate an initial list of dependency relation- 
ships. The analysis is based on the Pearson’s x? test 
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Table 2. Table of factors to be analysed in the Bayesian 
belief networks (DNV GL, 2017). 


Factors Description 


Schedule 
Quarter 


Each work schedule is divided into 
12 hours, and the quarters are 3 hour 
session each. 

In investigation reports that have con- 
cluded the cause of the incident to 
be human related, four major causes 
have been identified based on the 
incident reports: (i) unsafe act car- 
ried out without any procedures, (ii) 
procedures deemed to be unsafe (iii) 
improper design and (iv) unsafe act 
carried out against procedures. 

The initiating events were pre-defined 
in the WOAD database and are iden- 
tified as follows: (i) falling load or 
dropped object (ii) release of fluid or 
gas (iii) well problem without blow 
out (iv) blow out (v) out of position 
(vi) fire 

Indicates whether the emergency proce- 
dures of leaving the facility is (i) not 
required (ii) required and successful 
(iii) required and not successful 

There are four levels of severity and are 
classified below: 

e Accident (A) has recorded fatalities 
and severe injuries. 

e Incident (I) refers to low degree of 
damage, but repairs/replacements 
are required or for events causing 
minor injuries to personnel or health 
injuries. 

e Near-Miss (N) refers to events that 
might have developed into an acci- 
dental situation. No damage and no 
repairs required. 

e Unsignificant (U), refers to events 
with minor consequences. No dam- 
age, no repairs required. Other 
inclusions are small spills of crude 
oil and chemicals, and very minor 
personnel injuries, i.e. “lost time 
incidents”. 


Human 
Causes 


Initiating 
Event 


Evacuation 


Severity of 
Accident 


which was elaborated in Section 2.2. With the chi- 
squared value, and the corresponding number of 
degree of freedoms, the P-value can be obtained. 
The confidence interval then plays an important 
role on the significance test on the hypothesis of 
independence. The most commonly set confidence 
interval is at 95%, i.e. a P-value exceeding 0.05. For 
this data set (see Table 3), the confidence interval 
is set at 95%, and if the obtained P-value from the 
dependency analysis is less than 0.05, this implies 
a dependency between the two factors being 
investigated. 


Table 3. Snapshot of ‘cleaned’ data to be processed in 
the software R for the dependency analysis between 5 
factors: (i) Schedule quarter (ii) Human causes (iii) Ini- 
tiating event (iv) Evacuation and (v) Severity of accident 
(not all factors are shown in below). 


Schedule_ Accident_ 
No. qtr Category Human_Cause 
1 3 Incident Unsafe act/No 
Procedure 
2 4 Near_miss Improper design 
99 5 Incident Unsafe Procedure 
100 2 Incident Unsafe Procedure 


Within the five factors, a 2 x 2 frequency table 
and its ten combinations of the pair-wise compari- 
son of the five factors, has been extrapolated from 
the data source, and the Pearson’s x’ test ran on 
R to calculate the P-value. The results tabulated in 
Table 4 only reflect the results where the P-Value 
has been found to be less than 0.05 and indicates 
a “dependent relationship. Other independent 
results have been omitted since they do not have 
an impact on how the BBN can be structured. The 
pairs of variables thus provide the background for 
the three variable independency analyses. 

The modelling of a three variable dependency 
begins with a saturated model for (A,B,C) which 
consist the individual effects from each factor alone, 
and the secondary interactions AB, BC, AC as well 
as the highest order which is the effects of ABC. 
Backwards eliminated is initiated from the satu- 
rated model, by removing the highest order inter- 
action. The model with the highest order removed, 
is compared with the saturated model by consider- 
ing the differences in the ‘likelihood’, ‘degrees of 
freedom’, and in the ‘P-value’. The ‘P-value’ dem- 
onstrates the difference between the models, and if 
the value < 0.001, it represents a highly significant 
‘P-value difference’ and that removing the higher 
order interaction has made a significantly worse fit 
to the data (Field et al., 2012). In other words, the 
higher order interaction is a significant factor to 
ensuring a good fit of the model. 

If the removal of the three-way interaction 
shows a ‘P-value delta’ greater than 0.001, then the 
combination of the two-way interaction is com- 
pared against the next highest saturated model— 
which would refer to the three-way interaction. 

Based on the results from the two-way depend- 
ency analysis, the following three-way dependency 
analyses have been grouped (see Table 5). 

From the analysis from R (R Core Team, 2017) 
andsummarised in Table 6, itcan be observed that the 
model considering the interactions (Human Causes: 
Initiating Event + Initiating Event: Evacuation) 
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Table 4. Dependency relationships between two 

variables. 

Combinations X? value DoF*  P-Value 

Accident Severity, 47.98 21 0.000692 
Initiating Event 

Accident Severity, 18.748 6 0.00461 
Evacuation 

Initiating Event, 41.47 21 0.004903 
Human Causes 

Initiating Event, 42.131 14 0.0001178 
Evacuation 

Imitating Event, 71.101 28 0.00001304 
Schedule Quarter 

Human causes, 14.332 6 0.02614 
Evacuation 


DoF* = Degrees of Freedom. 


Table 5. Proposed three-way dependency analysis. 


Results from two-way 
dependency analysis 


Proposed three way 
dependency analysis 


e Initiating Event— 1. Initiating Event, Human 
Human Causes Causes, Evacuation 

e Initiating event— 2. Initiating Event, Human 
Evacuation Causes, Schedule Quarter 

e Initiating event— 3. Initiating Event, Evacua- 


Schedule Quarter tion, Schedule Quarter 

4. Human Causes, Evacua- 
tion, Schedule Quarter 

e Accident Severity— 5. Initiating Events, Number 
Initiating Event of Chains of Events, 

e Accident Severity— Nature of Events 


Evacuation 


Table 6. Summary table of conditional independency 
analysis. 


Likeli- Diff 
hood Deviance in P-Value 
Model DoF Ratio (diff) DoF (diff) 
Human_ 0 
Cause (H) 
* Tnitiating_ 
Event (IE)* 
Evacuation 
(E) 
H:IE+H:E+ 42 0.332754 0.332754 42 1 
IE:E 
H:IE + H:E 56 24.68428 24.35152 14 0.0415 
H:IE +IE:E 48 1.549424 1.216669 6 0.9760 
H:E + IE:E 63 21.71695 21.38419 21 0.4357 
H+IE+E 83 67.52035 67.18759 41 0.0060 


has a P-value delta of 0.9760 which is much higher 
than the P-value delta of the remaining models. 
The dropping of the interaction (Human Causes: 
Evacuation) still produces a good-enough fit for 
the model based on the P-value delta difference 
of 0.9760 compared to the P-value difference of 
the saturated model (Human Causes: Initiating 
Event + Human Causes: Evacuation + Initiat- 
ing Event: Evacuation. The interaction (Human 
Causes: Initiating Event + Initiating Event: Evacu- 
ation) thus suggests a conditionally independent 
relationship between Human Causes and Evacu- 
ation given the occurrence of the Initiating Event, 
which can be noted as I(Human Causes, Evacua- 
tion | Initiating Event). 

Similar operations have been performed for 
the remaining three-way combinations numbered 
from 2 to 5 in Table 4. As an example, the sec- 
ond three-way combination in Table 4 identified 
as (Initiating Event — IE, Human Causes — H, 
Schedule Quarter — S) is also investigated in the 
same manner, in that it considers the saturated 
model of H*I*S which consists of the following 
interactions of (H+ IE + S + H:IE + H:S + IE:S 
+ H:IE:S), followed by comparing the perform- 
ance of the (H:IE + H:S + IE:S) model with the 
higher order interaction term H:IE:S removed, 
and subsequently proceeding towards the simplest 
independent model of just H + IE + S. The best 
fitting model, not necessarily the most complex, is 
then chosen based on the P-value differences in a 
similar process as documented in Table 6, which 
referred to the three-way dependency analysis of 
the factors (Human Causes: Initiating Event + Ini- 
tiating Event: Evacuation). Having investigated the 
relationships between the factors and in creating 
the nodes, the next step is to put together the node 
network (see Figure 1), and its corresponding joint 
probability distribution through Bayesian Belief 
Networks (BBNs). 

In order to analyse the data in a BBN, the data 
obtained from the database needs to be trans- 
formed into probabilities and/or conditional 


( Schedule_qtr ) —— ss 


a ~ 


(Human_causes ) ý \ 
— ee P pe = = 
Evacuation Accident ` 
Figure 1. Snapshot of proposed BBN structure using 


GeNie (BayesFusion, 2018). 


probabilities. For any node, the probabilities can 
be represented by the Dirichlet Density Func- 
tion, which can be looked at as a ratio of a status, 
against the total number of statuses for that event. 
If an event has a parent event, then the conditional 
probability distribution should be used. The result- 
ing BBN can then represent the relative frequen- 
cies of the occurrence of events as in Figure 2. 


4 EXPERT ELICITATION TO BBN 
STRUCTURE AND CONDITIONAL 
PROBABILITY DATA 


4.1 Risk profile categorised by severity 
of accidents 


The BBN also allows for a sensitivity analysis to 
identify the types of initiating events associat- 
ing with the ‘Incident’ level of severity of event 
(see Figure 3): which reflects that ‘Falling Load/ 
Dropped Objects” make up the majority of the ini- 
tiating event, with the “Release of fluids/gas” the 
next most common initiating event, and that the 
“Well problems leading to no blow outs” is the least 
common initiating event. It is also interesting to 
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Falling_Load_Dropped 39% 
lease _fMluid_gas. 26% 
—-Well_problem_ne_BO 12%] 


low_out 
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Improper_design 6 | / Incident "4 
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juccessful_full_evac 27 
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Figure 2. Snapshot of BBN with distribution of respec- 
tive factors based on the data from WOAD. 
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Figure 3. Risk profile when “Incident” in the Severity 
node is reflected as 100% (observed to have happened). 
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identify that unsafe act carried out stemming from 
the lack of procedures is the most common docu- 
mented human cause in the investigation reports. 

In terms of “Accident” level of the severity 
of incidents (see Figure 4), “Fire” as a trigger- 
ing event makes up the majority of the initiating 
event. The next most common is a “Blow-out” fol- 
lowed by “out of position” and lastly the “Release 
of fluid or gas”. It is also interesting to identify 
that unsafe procedures is the most common docu- 
mented human cause in the investigation reports. 
It can be noted that the documented human cause 
between different levels of severities of an incident 
reflects the trends in the procedural level, at the 
“Incident” level, most of the time no procedures 
were in place, while in the “Accident level”, pro- 
cedures were in place but needed to be improved. 
In both cases, human action against procedures 
makes up a small percentage. 


4.2 Combining human factors, location specific 
risks and equipment reliability in the model 


The Petro-HRA guidelines published in 2017 
(Taylor, 2017) adopted the SPAR-H methodol- 
ogy for assessing human error probabilities in the 
petroleum industry based on a nominal HEP that 
is adjusted by performance shaping factors. The 
diversity of the activities in a petroleum facility 
makes it difficult to list the performance shaping 
factors, or define what constitutes a “missing/mis- 
leading” human machine interface design, espe- 
cially since decommissioning activities constitute a 
wide range from well P&A to structural removal. 
Strand et al (Strand & Lundteigen, 2016) further 
defined the crucial or “good” factors in a human- 
machine interface for drilling operations, such as 
having the ability to interpret wellbore flowrates, 
in situ pressures along the wellbores etc, and with 
each factor bearing different weights, indicat- 
ing that some factors have a greater effect than 


‘alling Load Dropped 0%) 
elease_fluid_gas GA 
—+Well_problem_no_BO 0% 
low_out 28% 
Breakage Fatigue 1%] 
Pe of_position 11%) 


ra Initiating Event 


\ 52%] 
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nsafe_acl_no_proc ...22% r| Accident 
insafe_procedures 54%) ' Accident 100! 
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Against_procedures 6 ~~ y \Near_miss 0 


Unsigniicent 0% 
—- ~ Evacuation E) 

Not L required 27%} 

Successful full evac 63% 

|interrupted_no_suc . 10% J 


Figure 4. Risk profile when “Accident” in the Severity 
node is reflected as 100% (observed to have happened). 
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others. This exact ranking and its respective weight 
ratios proposed by Strand et al is also a reflection 
of an expert judgement as a source of informa- 
tion, where it is the other mechanism proposed in 
this paper; the first method being to obtain infor- 
mation from the database. It is proposed that the 
relationships learnt and the relative frequencies 
obtained from the database can be used to supple- 
ment the HRA model by Strand et al (Strand & 
Lundteigen, 2016). 

The proposed BBN (Figure 5) consist of sev- 
eral parts: the nodes in light blue are informa- 
tion extracted from the data learning part as in 
Section 3.2. The “Severity of Accident” node has 
been expanded to include more consequences, 
including the state of safe operations, which is 
not originally in the accident database as it only 
recorded accidents. The green-shade of the node 
“Kick Frequency during drilling operations” rep- 
resent learning from information as well, but not 
from an accident database, rather, from the record 
keeping of the operating history of the facility. The 
motivation behind such an information node is to 
incorporate location-specific risks. A plugging and 
abandonment activity will also be susceptible to 
kicks from the well, and often it is due to the res- 
ervoir profile of a well, hence in this case, location 
specific historical data would help to assess the 
risks more dynamically. 

The relative frequencies for the “Human Cause” 
node has been removed and replaced with a “Pro- 
cedures_PSF”, in which PSF stands for Perform- 
ance Shaping Factors, and the data learning part 
indicated that the relative frequencies for ‘Not_ 
Available” or “Incomplete” would 0.52 and 0.37, 
and this could be the information for the initial 
conditional probability. 

The next part of the BBN structure is indicated 
in yellow, which are safety barriers in relation to the 
identified, and shortlisted Initiating Event (a well 
kick), which are “Kick Detection” or the action to 
“Kill Well”. These information can be obtained 
from the failure probability data from reliabil- 
ity databases, as in the approach for Bhandhari 
et al (Bhandari, Abbassi, Garaniya, & Khan, 2015) 
where “Kick Detection” failure probability was 
0.01, and Well Kill operations had a failure prob- 
ability of 0.25. 

The third part of the BBN structure is indicated in 
pink, and refer to human factors that may influence 
how well a kick can be detected, or how the killing 
of a well can be conducted with success. The human 
factors were identified as specific to a well drilling 
operation by Strand et al (Strand & Lundteigen, 
2016), such as how the presence of certain features 
would assign the Human Machine Interface PSF 
to be nominal or good. In Strand et al’s human 
factor framework, they had based it on the 
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BORA-Release methodology (Sklet, Vinnem, & 
Aven, 2006) where weights have been assigned to 
each contributing item to a PSF. The weights in 
nodes leading to the HMI PSF (the BORA-Release 
methodology) are derived from expert judgement. 
The discussion of expert judgement input for the 
pink nodes, and arguably the data gaps for the 
orange node for the output node of interest could 
allow the risk to be computed from a combination 
of historical data (with updating mechanism, each 
time the database is updated), expert judgement in 
the human factors part of the model, or to fill in 
data gaps in other parts of the model. 


4.3 Expert judgment and elicitation 
of information in nodes of BBN 


A literature review conducted by Mkrtchyan et al 
(Mkrtchyan, Podofillini, & Dang, 2016) identified 
eleven methods of the assessment of conditional 
probability information, in which the methods 
vary in terms of whether a probability elicitation 
is direct or indirect, or whether it allowed multiple 
expert aggregation. 

No expert judgment and elicitation have been 
carried out on the proposed BBN model yet, how- 
ever, the method proposed should be one that 
is used to model human reliability as one of the 
nodes is a human factor node, and as pointed out 
by Podofillini et al (Mkrtchyan, Podofillini, & 
Dang, 2015), the conditional probability infor- 
mation building methodology needs to meet two 
criteria of allowing combination effects such as 


] 
= 
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Proposed BBN with structure/information learned from database, and expert judgement input. 
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error-forcing effects where two factors in consid- 
eration are working on the same polarity, or the 
opposite end of the combination effect such as a 
compensation effect from a poor factor in combi- 
nation with a good factor. 


5 CONCLUSION 


From the initial modelling of the BBNs derived 
from accident data, the results show potential in 
using industry data for accidents having an over- 
all risk profile of the industry. The model provided 
insights on the “Accident” and “Incident” level of a 
severity of an accident, and its corresponding pro- 
file of the initiating events and the investigation- 
reported human causes. This forms the base struc- 
ture of adapting the initial BBN model to include 
also barrier performance, such as the performance 
of kick detection, and potentially include human 
factors analysis for high operator involvement 
work, such as well plugging and abandonment and 
heavy lifting. This analysis also meets one aspect 
of the QRA framework requirements in demon- 
strating how the performance of barriers changes 
with different conditions. The model indicates the 
source of historical data and expert judgement, 
and is able to keep track of how data used in the 
model affects the results. Existing risk assessment 
methods are often based on generic failure statis- 
tics alone, which may not be reflective of an exist- 
ing risk picture of a location. The model combining 
generic failure statistic, location-specific historical 


data, and expert judgement of a particular work 
task could provide a more updated risk picture. 


6 FUTURE WORK 


The scope for future work is to expand the model 
to include the effects of expert judgment in the 
model, sensitivity analysis of how robust the model 
is, information learning from other databases or 
information sources, and subsequently expanding 
the analysis to other decommissioning activities. 
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ABSTRACT: The two urban avalanches in Longyearbyen, Svalbard, in 2015 and 2017 were unexpected 
and fast developing, with trying conditions and consequences both for people buried by the snow, and for 
people coming to their rescue. The fatality rate increases rapidly as the minutes pass if people are buried 
by the snow. Thus, successful rescue if buried by an avalanche depends on a prompt response. The aim of 
this paper is to discuss the role of the local population in urban avalanche search and rescue. Empirical 
data was collected from 8 months of fieldwork in Longyearbyen, including interviews with representatives 
of public and private organisations and the local population, as well as literature surveys on the popula- 
tion’s involvement in acute crises, reports and newspaper articles. People are often present when crises 
strike. Many of them have the knowledge, training and equipment to be a valuable resource in avalanche 


search and rescue. 


1 INTRODUCTION 

19 December 2015 an avalanche hit urban areas in 
Longyearbyen, a small town situated on the Nor- 
wegian archipelago of Svalbard in the Arctic. The 
avalanche resulted in two fatalities and destroyed 
eleven houses below the mountain Sukkertoppen. 
Survivors and neighbours immediately commenced 
search and rescue activities. Over 150 unorganized 
volunteers stood shoulder to shoulder with the pro- 
fessional emergency responders in a joint effort to 
search for fellow citizens buried by the snow. Seven 
people were excavated and rescued (Indreiten 
and Svarstad 2016). Another urban avalanche hit 
Longyearbyen 21 February 2017, in almost the 
same area. This time there were no fatalities. How- 
ever, the snow masses resulted in major damages 
to an apartment building. The people who lived 
there assisted each other in the evacuation. Once 
again, a large number of unorganized volunteers 
stood in line ready to come to the rescue of pos- 
sible friends and neighbours buried by the snow. 
The local police, in close cooperation with the local 
Red Cross Society, quickly clarified that none were 
buried by the snow. Thus, further assistance from 
the unorganized volunteers was not required this 
time (Tengesdal 2017). 

In this paper we aim to discuss the role of the 
population in urban avalanche search and rescue. 
We will use the term «population» synonymously 
with the term «unorganized volunteers». This 
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includes the efforts of survivors and people who 
are randomly present or nearby the avalanche area. 
Organized volunteers, such as members of the Red 
Cross Society, are not included in these groups of 
volunteers. 

We will start with a brief introduction to some 
important avalanche characteristics. Then we 
will present the conceptual frameworks, with an 
emphasis on crisis characteristics, crisis manage- 
ment and the population’s behaviour and capacities 
in crises. After a brief methodological presentation 
we will present our empirical findings from the two 
urban avalanches. The findings will be discussed 
in relation to the conceptual frameworks before 
we conclude with some remarks on the popula- 
tion’s contribution in search and rescue in urban 
avalanches. 


2 AN INTRODUCTION TO AVALANCHES 


In a historical perspective, the risk of avalanches 
was mainly a problem for people in settlements 
and along roads and railways located in avalanche 
prone terrain. This picture started to change in the 
1970s, as a result of a growing interest in outdoor 
activities in steep terrain. Nowadays, the major- 
ity of the fatalities due to avalanches are related 
to winter recreation activities (McClung and 
Schaerer 2006). However, avalanches in countries 
like Norway, Iceland, USA, Switzerland, Austria 


and France clearly demonstrate that we have to 
take into consideration that avalanches may rep- 
resent a major risk also for people, buildings and 
infrastructure in urban areas. 

To get caught in an avalanche is associated 
with great danger. The probability of survival are 
approximately 90 per cent if a victim, completely 
buried by the snow, is excavated within 15 minutes, 
unless the victim is not already dead due to trauma 
(killed by debris in the snow). After 15-45 minutes, 
the chances of survival decrease rapidly due to 
oxygen deficiency. After 45 minutes some 75 per 
cent of those excavated die, mostly due to hypo- 
thermia. The level of moisture in the snow and the 
victim’s clothing are both crucial factors for sur- 
vival after 90 minutes. A victim excavated at this 
point in time may die due to the reflow syndrome, 
a fatal condition caused by hypothermia and cold 
blood flowing from the extremities to the heart 
(Landrø 2007). 

It is quite common to distinguish between two 
main types of avalanches; loose snow avalanches 
and slab avalanches. A loose snow avalanche is 
a chain reaction initiated at a specific point, and 
expands both in width and depth as it continues 
downwards (Landrø 2007). The initiating point in 
a loose snow avalanche consists of a small layer 
of surface snow that loosens, while it is a thin and 
weak layer of snow at a deeper level that fails in a 
slab avalanche. A slab avalanche is further charac- 
terized by a large and coherent mass of snow that 
loosens and splits into blocks as it moves down- 
wards (McClung and Schaerer 2006). This type of 
avalanche reaches high speed in a short period of 
time, and the width can be several hundred meters, 
within seconds (Landrø 2007). A loose snow ava- 
lanche has a pear shaped pattern, while a slab ava- 
lanche usually has a rectangular shape (McClung 
and Schaerer 2006). 99 per cent of the avalanche 
accidents occur due to slab avalanches (Landrø 
2007). The 2015 and 2017 urban avalanches in 
Longyearbyen were slab avalanches. 

As we have already discussed, the chances of 
survival after 15 minutes are relatively poor if 
a victim is completely buried by the snow in an 
avalanche. Therefore, the best avalanche rescue 
strategy is to avoid to get caught in an avalanche 
in the first place (Fredston and Fesler 1986). 
In most situations, successful avalanche rescue 
depends on a prompt response from people nearby 
(Kruke 2016). Statistically, professional emergency 
responders arrive too late in the avalanche area 
to save lives. The mobilisation and deployment 
of professional emergency responders might take 
hours. Consequently, people moving in avalanche 
prone terrain should have the basic knowledge and 
training, both to avoid getting caught in an ava- 
lanche, and to perform buddy search and rescue 
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if needed (Brattlien 2017, Landrø 2007, McClung 
and Schaerer 2006). Basic avalanche equipment, 
such as avalanche beacons, probes and shovels, is 
crucial to perform a speedy and reliable search and 
rescue operation (Brattlien 2017). 


3 CONCEPTUAL FRAMEWORKS 


After this brief introduction to some central ava- 
lanche characteristics, we now turn to the con- 
ceptual frameworks, with an emphasis on crisis 
characteristics, crisis management and the popula- 
tion’s behaviour and capacities in crises. 


3.1 


Rosenthal and colleagues define a crisis as «a seri- 
ous threat to the basic structures or the fundamen- 
tal values and norms of a system, which under time 
and pressure and highly uncertain circumstances 
necessitates making critical decisions» (1989: 10). 
This definition highlights crisis characteristics 
such as threat, uncertainty, time pressure and the 
need for critical decision-making. In addition to 
these crisis features, we need to look further into 
some defining characteristics of crises. ‘t Hart and 
Boin (2001) and Gundel (2005) present two dif- 
ferent approaches to the classification of crises. 
‘t Hart and Boin (2001) classify crises based on 
their speed of development and termination pat- 
terns, while Gundel (2005) differentiates between 
crises based on the degree to which they are pre- 
dictable and influenceable. It is fair to assume that 
unexpected and fast developing crisis will be par- 
ticularly difficult for professional responders to 
manage in the initial phase. Thus, the population 
present will often be left to manage the initial acute 
phase of these crises themselves. 

It is quite common to distinguish between dif- 
ferent phases of a crisis. A number of researchers 


Crisis characteristics 


Prevention 
Preparedness 


Figure 1. 
2015). 


Crisis phases as a circular process (Kruke 


Table 1. Some fundamental differences between the 
military model and the problem-solving model (Dynes 
1993). 


The military model The problem-solving model 


The initial response in 
the incident area is 
characterized by 
chaos 

Command systems are 
needed to re-establish 
control 


The initial response in the 
incident area is charac- 
terized by a degree of 
continuity 

The response must be coor- 
dinated to reach an overall 
degree of cooperation 
between the actors present 

Existing social structures per- 
form crisis management 

Decentralized approach 


Existing social structures 
have limited capacities 
Centralized approach 


have suggested ways of doing this, e.g. (Turner 
1976, Kruke 2015, Olson 2000). The model below 
outlines three basic phases, which are usually 
inherent in most phase descriptions in the crisis 
literature (Kruke 2015). 

We will put emphasis on the acute phase of the 
crisis in this paper. This is the phase where lives 
might be at stake and responders come to the aid of 
people in acute need. This phase, when the popula- 
tion is conducting the response, prior to the arrival 
of the professional emergency responders, is also 
termed the golden hour (Helsloot and Ruitenberg, 
2004). It is important to notice that the response, 
or the acute crisis management, and to what extent 
it succeeds, has a strong connection with the pre- 
vention and preparedness activities in the pre-crisis 
phase (Engen, Kruke et al. 2016; Kruke 2015). 


3.2 Crisis management 


Dynes (1993) argues that the dominant approach 
to crisis planning and management is based on a 
centralized command and control model, derived 
from military analogies. He claims that this model 
is based on inadequate assumptions about social 
behaviour in crises (ibidem). The military model is 
built on the notion that crises create social chaos. 
Hence, command and control must be established 
by professional emergency responders to eliminate 
such chaos (Dynes 1993). In this perspective, it 
is assumed that much of the civilian population 
become passive, helpless and incapable of making 
reasonable decisions in crises (Dynes 1994a). Dynes 
(1993) also suggests a decentralized problem-solv- 
ing model as an alternative approach, derived from 
research on social behaviour in crises. The prob- 
lem-solving model suggests that the focus should 
be on continuity rather than chaos, coordination 
rather than command and cooperation rather than 
control. The model takes the stand that civilians 


are fully capable of making reasonable decisions 
in a crisis, and that they will not be stunned, pas- 
sive or irresponsible (Dynes 1994a). The model 
assumes that a crisis represents a set of problems 
that also have to be dealt with by the resources that 
are already present in the local community (Dynes 
1993). Problem-solving capacities are inherent in 
all individuals and social structures. Thus, such 
capacities must be utilized in an effective manner 
when a crisis strikes us (Dynes 1994a). 

These models are based on a totally different 
view of the population’s behaviour and capacities 
in crises. 


3.3. The population's behaviour and capacities 


Panic and helplessness are two viable myths about 
people’s behaviour in acute crises. What is panic? 
The tabloid newspapers, and the public discourse, 
often label the reactions of victims as panic. Many 
researchers have contributed to our understanding 
of the degree to which we panic in crisis situations, 
for example (Fritz and Marks 1954, Quarantelli 
1954, Helsloot and Ruitenberg 2004, Ripley 2008, 
Kruke 2015). However, there is no complete con- 
sensus about what panic really is. The term is used 
to describe many human reactions in stressful situ- 
ations, such as an excessive alertness and/or uncon- 
trolled fear that leads to unwise action, or loss of 
judgment. A common use of the term is that panic 
is a very excited emotional state of acute fear, and 
that it results in uncontrolled escape (Quarantelli 
1953). Panic can also be characterized as a form 
of irrational behaviour (Fritz and Marks 1954, 
Quarantelli 1999). There are some conditions that 
may lead to panic behaviour (Fritz and Marks 
1954, Perry and Lindell 2003): 


— Understanding of an immediate and serious 
threat 
— Flight options are limited and disappearing. 


Even though there are examples of panic in 
acute and life-threatening situations, there are also 
examples of such situations were panic do not 
occur (Quarantelli 1954, Dynes 1993, Helsloot and 
Ruitenberg 2004, Ripley 2008, Kruke 2015). This 
research concludes that citizens provide a lot of 
assistance, even in situations where they put their 
own life at stake, trying to rescue fellow citizens. 

Some people become helpless in acute crises, 
mainly due to shock leading to paralysis and pas- 
sivity (Helsloot and Ruitenberg 2004, Kruke 2015) 
or a disaster or crisis syndrome (Perry and Lindell 
1978, Tierney, Lindell et al. 2001), described as a 
dull state of being. Passivity may also result in a 
collective disclaimer, called the «bystander effect» 
(spectator effect) (Darley and Latanté 1968), where 
no one takes responsibility immediately after an 
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event because everyone thinks someone else will 
do it. 

Another string of research is related to altru- 
ism, a stand that helpless and panic behaviour are 
myths, and that most people both want to and also 
contribute to saving lives in the midst of a crisis. 
Altruism may be understood as a moral commit- 
ment people have to serve others and to put the 
interests of others in front of their own (Comte 
1973), as a form of solidarity during crises (Tierney, 
Lindell et al. 2001). Altruism can be manifested 
as an individual, collective or situational commit- 
ment (Dynes 1994b). The society’s collective and 
institutional resources may in some crises be inad- 
equate for a prompt and reliable response (ibidem). 
To cope with these unexpected, acute and uncer- 
tain circumstances, individual altruism may result 
in responses supplementing traditional response 
structures (ibidem). 


4 METHODS 


This paper mainly draws on empirical findings 
from 8 months of fieldwork in Longyearbyen, 
from mid-September 2016 to end of May 2017. 
The primary data collection methods were semi- 
structured interviews, field conversations, observa- 
tions and participant observations. Semi-structured 
interviews were conducted with the local popula- 
tion and response actors in Longyearbyen, includ- 
ing: The Police, Longyearbyen Red Cross Society, 
Longyearbyen Hospital, Lufttransport AS', Long- 
yearbyen Fire and Rescue Agency, and the Long- 
yearbyen Community Council. Additionally, 40 
field conversations were conducted with the local 
population. Empirical data was further collected 
from three public meetings in conjunction with 
urban avalanches in Longyearbyen. Furthermore, 
one of the authors took part in the local popula- 
tion’s response to the urban avalanche 21 Febru- 
ary 2017. Empirical data was also collected from 
document analysis, primarily public documents, 
newspaper articles, and avalanche literature and 
statistics. 


5 EMPIRICAL FINDINGS 
5.1. The urban avalanche in 2015 and the local 
population's reactions in the acute phase 


A slab avalanche hit urban areas in Longyearbyen 
Saturday 19 December 2015 at 10.23 am (DSB 


'Lufttransport AS is a private company operating the 
Governor of Svalbard’s two search and rescue helicopters. 
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2016). According to an informant from the Red 
Cross Society survivors and neighbours imme- 
diately commenced search and rescue activities. 
Professional emergency responders were also in 
the avalanche area within minutes. Shortly after 
the avalanche, a private citizen posted a message 
on a local community Facebook group to encour- 
age people to pick up their shovels and meet in the 
avalanche area. The Longyearbyen Community 
Council posted a similar message on their public 
Facebook page. In a short period of time a huge 
mass of unorganized volunteers lined up to par- 
ticipate in the excavation efforts. There are many 
examples of this initial response. 

Five people sat in their kitchen eating breakfast 
when the avalanche hit their house. All of them 
were buried by the snow. One of them managed 
to get out of the snow masses by himself. He then 
managed to locate and excavate his wife and his 
eight-week-old daughter. He continued to search 
after the two missing persons, and dug in the snow 
masses with a wok pan, while he wore a sweater, 
sweatpants and socks. This was one of the first 
impressions meeting the police officers upon their 
arrival in the avalanche area. 

Another informant lived close by the houses that 
were hit by the avalanche. He noticed that the lights 
indoors and outdoors started to flicker. Suddenly, 
everything went dark. He immediately understood 
what had happened when his wife shouted that 
several houses had collapsed. He grabbed his head- 
light and shovel, and ran out. Meanwhile, his wife 
called the police and went from house to house to 
alert the neighbours. She also brought the children 
in the area to safety. The informant came home 
with two survivors and provided them with warm 
clothes, blankets and food. As soon as medical 
personnel arrived, he went out again to continue 
search and rescue efforts. 

Yet another informant was on his way to work 
when he noticed blue lights and sirens nearby the 
town centre. He met some people outside the shop- 
ping mall who told him about the avalanche, and 
that they were on their way to offer their assistance. 
He instantly joined them in the excavation efforts. 

None of the informants from the professional 
emergency responders observed people affected 
by panic or shock. However, a small number of 
the informants noticed a few passive spectators. A 
doctor working at the hospital told about a man 
who stood nearby when the avalanche hit. He had 
some trouble coping with the situation and seemed 
to be rather confused. One of the informants from 
the local population described an acquaintance 
who allegedly was struck by panic. As she stood 
on the stairs outside her home smoking a cigarette 
a house started to move and hit the house next to 
her. According to the informant, the woman then 


panicked and ran away, and was taken care of by 
others later on. 

A number of informants told about people who 
felt helpless as a consequence of not being able to 
participate in the search and rescue operations in 
the acute phase of the crisis. One informant said 
that he suffered for months because of not being 
involved in the response. He had no choice but 
to stay at work when the avalanche hit. Another 
informant told about some friends unable to get 
out of their house, due to the snow masses. Tt 
made them feel extremely indignant, as they had 
a strong desire to help their fellow citizens in the 
avalanche area. 


5.2 The urban avalanche in 2017 


In the morning hours Tuesday 21 February 2017 
the Longyearbyen Community Council informed 
the public on their website about an increased ava- 
lanche hazard. The message reassured that the set- 
tlement was not considered to be at risk. A couple 
of hours later, a slab avalanche hit the settlement 
and destroyed an apartment building (Tengesdal 
2017). The people at home managed to assist each 
other in the evacuation out through the windows. 
Since also this avalanche hit nearby the town cen- 
tre, the professional emergency responders were 
not far away. The police quickly confirmed that 
none of the people living in the area were missing 
nor seriously injured. They still decided to main- 
tain the search, in case other citizens had been in 
the area (Malmo 2017). 


5.3 The professional emergency responders’ 
perspectives on the local population 
contribution 


All of the informants interviewed characterized 
the urban avalanche in 2015 as a totally unexpected 
event. However, a couple discussed the bad weather 
conditions the previous night and the resulting 
huge snowdrifts on Sukkertoppen, just uphill from 
their house. While looking up the hill through the 
window, they agreed that this could not possibly 
be good. Soon after, the avalanche hit their house 
(Palm 2016). Both of them survived the avalanche. 

The informants described the avalanche area 
as rather chaotic. Many referred to houses that 
were moved and destroyed. However, the inform- 
ants underlined that the avalanche area was quiet 
and that the level of stress was low. A professional 
rescue worker from Lufttransport AS said: «What 
you see is chaos, but there is no panic». All of the 
informants expressed that the local population was 
a crucial resource in the search and rescue opera- 
tion. An informant from the Red Cross Society 
said that «the help we got from the unorganized 


411 


volunteers made it possible to quickly locate and 
excavate the victims». A police officer further 
explained why the local population was an impor- 
tant resource in this response operation: «Time is 
of the essence when it comes to avalanches. The 
chances of survival decrease rapidly. The impor- 
tant thing in this situation was to gather a lot of 
people who could dig». 

An informant from the Red Cross Society 
explained that intelligence is of the utmost impor- 
tance to be able to locate victims, both in back- 
country and urban avalanches. In line with him, 
Genswein and Harvey (2002) point out that intel- 
ligence in a backcountry avalanche is gathered by 
looking for tracks into the avalanche area, an initial 
primary search for people or their belongings, and 
a more thorough systematic search in the avalanche 
area. In an urban avalanche, the snow masses 
are so polluted with debris that the conventional 
search strategies are no longer effective. Instead, 
intelligence is gathered by talking to survivors and 
neighbours. The professional rescue workers from 
Lufttransport AS underlined that they were totally 
dependent on the local population during the initial 
search and rescue phase. They especially pointed 
to the population’s efforts in intelligence gathering 
and excavation of people buried by the snow. One 
of them claimed that «without the information 
and knowledge provided by the local population, 
it would have been like searching for the needle in 
the famous haystack». The local population also 
delivered different kinds of rescue equipment to 
the rescuers in the avalanche area, such as shovels, 
probes, jerven bags’, chainsaws etc. In addition, 
people offered warm clothes, food and drinks to 
the rescue workers. None of this assistance was 
ever requested, according to an informant from the 
Longyearbyen Fire and Rescue Agency. 

One of the informants from the Red Cross Soci- 
ety said that the professional emergency respond- 
ers in Longyearbyen do not have the capacity to 
manage a crisis of this scale on their own. Several 
informants also mentioned that rapid assistance 
from mainland Norway in the acute phase of a fast 
developing crisis is not likely, due to the geographi- 
cal distance. Tough weather conditions may delay 
assistance from outside even longer. This makes the 
local population a crucial resource in a crisis like the 
urban avalanche in 2015. Almost all of the inform- 
ants described that the local population has proved 
to be a valuable resource in countless backcoun- 
try avalanches near Longyearbyen. People often 
come to the rescue on their snowmobiles, loaded 
with avalanche equipment. Depending on the situ- 
ation, they initiate search and rescue operations 


7A jerven bag is a protective poncho and tarpaulin. 


themselves, or offer their assistance to the profes- 
sional responders. 

The vast majority of the informants from the 
professional emergency organisations expect the 
local population to be a resource in avalanche 
search and rescue. They pointed out that many 
of the citizens have the knowledge, training and 
equipment to be a resource. They also pointed to 
the fact that most of the local population is young 
and employable, and that they typically have a 
great interest in outdoor activities. The children in 
one of the kindergartens actually conduct search 
training when they play with avalanche beacons 
searching for candy buried together with an ava- 
lanche transmitter. 

Some informants added that Longyearbyen is a 
rather small town where people know their neigh- 
bours, which makes all affected when an urban ava- 
lanche strikes the town. One of the rescue workers 
from Lufttransport AS referred to the message on 
Facebook posted by a citizen (mentioned earlier), 
and added that «the local population know their 
role in an event like this. Most of us have avalanche 
beacons, probes and shovels. Meet up»! A doc- 
tor working at the hospital said that «history has 
shown that the local population represents a differ- 
ence when crises strike in Svalbard. It is a resource 
we can count om». 

The authorities did not request any assistance 
from the local population after the avalanche in 
2017. Nevertheless, a large number of unorganized 
volunteers, men and women, youths and adults, 
stood in line nearby the avalanche area with ava- 
lanche beacons, probes and shovels, ready to assist 
if needed. Everyone behaved calmly and focused, 
and there was no sign of irrational behaviour. 
After a relatively short period of time, the police 
informed people waiting to stand down, as none 
were missing and the risk of a second avalanche 
was considered to be high (Tengesdal 2017). 


6 DISCUSSION 


6.1 Unexpected and fast developing crises 


The urban avalanches took most citizens and 
authorities in Longyearbyen completely by sur- 
prise, even though some of the citizens were wor- 
ried due to the snow drifts uphill on Sukkertoppen 
in 2015. Also in 2017 some people were worried. 
But, they were reassured by the Longyearbyen 
Community Council that urban areas were safe. 
However, it is important to notice that the exact 
location and time of an avalanche cannot be pre- 
dicted (Brattlien 2017). When the avalanches were 
triggered (natural causes), it was only a matter of 
seconds before they hit urban areas in Longyear- 
byen. Thus, with reference to the crisis typologies 
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presented by ‘t Hart and Boin (2001) and Gundel 
(2005), it can be argued that the urban avalanches 
in Longyearbyen in 2015 and 2017 were unex- 
pected and fast developing crises. It may also be 
noted that the avalanche in 2015 has some legal 
aftereffects influencing the termination of the 
event?, 


6.2 The local population's response 


There are some factors that point to the need for the 
local population’s response in avalanches. First of 
all, time is of the essence in such crises. The statis- 
tics tell us that the chances of survival following an 
avalanche decrease rapidly after just 15 minutes, if 
a victim is completely buried by the snow (Landrø 
2007). Consequently, the time and pressure that 
Rosenthal et al. (1989) stress in their crisis defini- 
tion are extreme. In general, it will often take some 
time before the professional emergency responders 
arrive at the scene in an acute crisis (Kruke 2012, 
2015). Statistically, they arrive too late to save lives 
(McClung and Schaerer 2006, Landrø 2007). This 
means that the victims themselves, and the people 
who are randomly present, are the ones who need 
to take the first shift pending the arrival of the 
professional emergency responders, in the golden 
hour (Helsloot and Ruitenberg 2004). This golden 
hour is crucial when the difference between life and 
death is a race against time. As we have discussed, 
it was the victims and the people randomly present 
who initiated search and rescue in the urban ava- 
lanche in 2015. Also, in the urban avalanche in 
2017, the victims themselves helped each other to 
safety. In other words, it was the victims and the 
people randomly present who took the responsibil- 
ity to manage the initial acute phase of these crises. 
Therefore, they can be considered as first respond- 
ers (Kruke 2015). 

Another factor is related to the professional 
emergency responders’ lack of capacities. The 
golden hour in these crises do not last more than a 
few minutes. The urban avalanche in 2015 clearly 
demonstrated that the professional emergency 
responders in Longyearbyen lack the capacity to 
manage a crisis of this magnitude on their own. 
Therefore, altruistic behaviour is necessary in 
such situations as a supplement to the traditional 
response structures (Dynes 1994b). The profes- 
sional emergency responders were totally depend- 
ent on the continuity in operations (Dynes 1993) 
provided by the local population in the acute phase 
of the avalanches. The inclusion of the population 


‘The parents of a little girl dying in the avalanche may 
go to court to get compensation due to accusations of 
inadequate urban planning. 


made it possible to quickly evacuate people, and to 
locate and rescue people buried by the snow. The 
local population has proved to be an important 
resource in many search and rescue operations fol- 
lowing avalanches in Longyearbyen and surround- 
ing areas. They often initiate search and rescue 
activities themselves, or offer their assistance to 
the professional emergency responders. The local 
authorities did not encourage the population to 
participate in the urban avalanche in 2017, as was 
the case in 2015. Still, approximately 150 unorgan- 
ized volunteers met up next to the avalanche area 
(Tengesdal 2017). The fact that they actually met up 
unrequested, strengthen the view that the local pop- 
ulation is an asset in avalanche search and rescue. 


6.3 The local population's behaviour 


An informant from the local population told about 
a woman who allegedly panicked and ran away when 
a house crashed into the house next to her. Can 
this behaviour actually be characterized as panic? 
Quarantelli (1953) describes panic as an excessive 
alertness or fear that leads to an unwise action. 
Panic can further be regarded as a form of irrational 
behaviour (Fritz and Marks 1954, Quarantelli 
1999). In this specific situation, it is fair to assume 
that her decision to run away from houses in motion 
is a rather wise and rational modus operandi. She 
brought herself to safety. Generally, it is possible 
that panic occurs in acute crises, but we know that 
this form of irrational behaviour is rare (Helsloot 
and Ruitenberg 2004). The rest of the informants 
did not witness any examples of panic among the 
local population following the urban avalanches in 
2015 and 2017. The doctor’s story about the man 
who stood nearby when the avalanche hit in 2015 
is a possible example of shock. He had some dif- 
ficulties coping with the situation and seemed to 
be rather confused. It is not unlikely that he was 
paralysed (Helsloot and Ruitenberg 2004, Kruke 
2015) as a result of watching the enormous slab 
avalanche hitting the settlement. 

There are several examples of people who actu- 
ally felt helpless, as a consequence of not being 
able to participate in the search and rescue opera- 
tion in 2015. The fact that people felt helpless as a 
result of not being able to participate in the acute 
response, can be regarded as a considerable con- 
trast to the widespread assumption that we turn 
into helpless human beings when crises strike. 
This study clearly shows that the population has 
a strong desire to help others in need. This may be 
rooted in a moral commitment (Comte 1973) and 
a form of solidarity (Tierney, Lindell et al. 2001). 
The spectator effect (Darley and Latanté 1968) was 
minimal in the urban avalanches in 2015 and 2017. 
The informants have very few examples of passive 
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spectators in the avalanche area in 2015. There 
were also very few passive spectators observed in 
2017 (Tengesdal 2017). 


6.4 The military model versus the 
problem-solving model 


The responses to the 2015 avalanche indicate that 
many of the citizens in Longyearbyen possess 
both the knowledge and training to be a valuable 
resource in avalanche search and rescue. The empir- 
ical findings also show that the professional emer- 
gency responders are aware of the local population’s 
capacities, and that they actually count on the local 
population to be a resource in avalanches rescue 
when needed. This is largely in line with the problem- 
solving model presented by Dynes (1993) stressing 
the need for effective utilization of existing problem- 
solving capacities in the community (Dynes 1994a). 
The population quickly initiated search and rescue 
operations following the avalanche in 2015, display- 
ing a sort of continuity (Dynes 1993) witnessed by 
the professional emergency responders upon their 
arrival in the avalanche area. The local authorities 
actively encouraged the population to participate 
in the excavation efforts in the urban avalanche in 
2015 and coordinated the joint efforts between the 
actors present. Clearly, the authorities and the pro- 
fessional emergency responders didn’t expect social 
chaos, which is an inherent assumption in the mili- 
tary model (Dynes 1993). Instead, they expected the 
population to be a resource, as the problem-solving 
model strongly suggests (ibidem). 

Instead of establishing command and control 
structures, the authorities and the professional 
emergency responders focused on coordina- 
tion and cooperation among public and private 
response structures in the avalanche area in 2015. 
It is fair to assume that the public-private coopera- 
tion was decisive for the outcome of the avalanche 
in 2015. Victims and neighbours immediately ini- 
tiated search and rescue, and a large number of 
unorganized volunteers came to assist the profes- 
sional responders. As a result, the victims were 
quickly located and excavated. 


6.5 Characteristics of the local population 


Can we expect the population to be a resource in 
urban avalanche search and rescue elsewhere, or 
is there something unique about Longyearbyen? 
Many social researchers take the stand that most 
citizens are rational actors who often provide 
lifesaving assistance in acute crises and demand- 
ing situations where lives are at stake (Quarantelli 
1954, Dynes 1993, Helsloot and Ruitenberg 2004, 
Ripley 2008, Kruke 2015). Therefore, it is fair to 
assume that also citizens elsewhere would have 


done their best to rescue their neighbours. Even 
so, there seem to be some aspects that particularly 
make the population in Longyearbyen a resource 
in avalanche search and rescue. 

Firstly, the ones who come to the rescue possess 
the essential equipment that is needed to perform 
a speedy and reliable avalanche search and rescue 
operation. Secondly, many of them have both the 
knowledge and training to be a valuable resource. 
Thirdly, the population is young and employable. 
Fourthly, many citizens in Longyearbyen have a 
great interest in outdoor activities. Consequently, 
many of the citizens are also physically fit for an 
avalanche search and rescue operation. This rep- 
resents a set of characteristics that presumably is 
quite unique for the population in Longyearbyen. 
In addition to this, it is likely that the rural location 
is influencing the population’s expectations to help 
each other. Rescue resources from outside the archi- 
pelago cannot be expected in the acute phase of an 
avalanche. The citizens know that they are the ones 
who must provide the assistance. The sense of unity 
among citizens is strong, making the whole com- 
munity affected when an urban avalanche strikes. 
Although these characteristics of the population 
in Longyearbyen make them a valuable response 
resource in an unexpected and fast developing urban 
avalanche, it can hardly be argued that they apply 
exclusively for the population in Longyearbyen. 


7 CONCLUSIONS 


The aim of this paper has been to discuss the 
role of the local population in urban avalanche 
search and rescue. The cooperation between the 
professional emergency responders and the local 
population was decisive for the outcome of the 
unexpected and fast developing urban avalanche 
in 2015. The urban avalanches in 2015 and 2017 
show that the local population are the ones who 
are present when crises strike, and they perform 
search and rescue in the acute phase. The imme- 
diate response from survivors and neighbours, 
and the inclusion of unorganized volunteers that 
came to the rescue, made it possible to locate and 
rescue seven victims buried by the snow. This was 
possible due to an altruistic behaviour of the local 
population, that the citizens possessed the required 
equipment, knowledge and training, and that the 
local authorities and the professional emergency 
responders understood the value of including the 
population in the search and rescue operation. The 
responses to the avalanches in 2015 and 2017 show 
that the widespread assumptions about panic and 
helpless behaviour are mostly myths. The local 
population are first responders and have a strong 
desire to help their fellow citizens in need. 
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ABSTRACT: A paradigm shift is presently underway in the shipping industry promising safer, greener 
and more efficient ship traffic with unmanned, autonomous vessels. In this article, we will look at some 
of these promises. The expression “autonomous” and “unmanned” are often used interchangeably. We 
will therefore start out by suggesting a taxonomy of automation and manning of these ships. We will then 
go on examining the promise of safety. An hypotheses of increased safety is often brought forward and 
we know from various studies that the number of maritime accidents that involves what is called “human 
error” ranges from some 70-90 percent. If we replace the human with automation, can we then reduce the 
number of accidents? And is there a potential for new types of accidents to appear? Risk assessment will 
be a valuable tool, but will only reach as long as to the “known unknowns”. 


1 INTRODUCTION 

The shipping industry are about to enter a new 
epoch. The story started in the 1800 when mecha- 
nized power was introduced and the vessels moved 
from propulsion by sail to propulsion by steam. 
The next stage came in the early 1900’s when the 
diesel engine enabled more efficient and reliable 
ship services, analogous to the introduction of 
mass production on shore. In the 1970’s the com- 
puterized control of ships was introduced. Now we 
are about to go a step further where cyber physical 
systems and autonomy, as part of “Shipping 4.0” 
(Rødseth 2017), will form a new gravity. 


1.1 The first autonomous ship accident 


We will start this article by a fictive illustration: It 
was an unusually warm to be in the end of October. 
The water in the strait was completely calm and mir- 
rored the sky and the setting afternoon sun. In the 
Vessel Traffic Service (VTS) tower under the bridge 
the operator followed a lone kayak with his binocu- 
lars. It seemed like the kayaker was a child and not 
very proficient in his or her paddling and the kayak 
only slowly worked its way across the sound. The 
timing for crossing was not the best, the operator 
thought. He had an outbound oil tanker due in a 
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few minutes and the autonomous Yara shuttle was 
to pass in the other direction soon after. The tanker 
was already approaching from the far side of the 
bridge sounding her horn to let the kayaker know 
she was approaching the 200 meters wide strait, 
something that probably did not make the situation 
better for the child in the kayak, the VTS operator 
thought. From the other side the autonomous shut- 
tle was visible inbound on a westerly course with her 
6 knots. He expected her to slow down any minute 
as her sensors detected the kayak in the sound. 

Suddenly two water scooters appeared from no- 
where, criss-crossing over the strait and around 
the kayak at some thirty or forty knots. The VTS 
operator could hear the roar from their engines 
all the way into the VTS tower. The surplus water 
shot up like a fountain from the back of the 
scooters and their wakes brought the water into 
turmoil around the kayak. In his binoculars, the 
VTS operator saw the child in the kayak letting 
go of his paddle and waving his arms to signal the 
scooters. Suddenly the kayak flipped over and the 
boy disappeared into the water. The scooters shot 
off towards the far side and the operator could 
see the head of the boy reappear on the surface 
beside the overturned kayak. He was right in the 
way of the tanker. The operator quickly grabbed 
the VHF receiver and called the tanker. 


“Tarnfjord, Tarnfjord this is Brevik VTS on 
channel 16. Have you seen the overturned kayak 
ahead of you?” 

“Brevik VTS, this is Tarnfjord. Rodger that. We 
are slowing down and holding to port. We should 
manage to avoid the kayak. But we cannot reverse. 
And we will have close call with Yara.” 

“OK, Tarnfjord, thank you for that,” the VTS 
operator replied, and continued immediately to 
call the shuttle, “Yara remote control, Yara remote 
control, are you following what is happening in the 
Brevik strait?” 

He turned and looked at the shuttle and could see 
that she had not slowed down as he had expected. 
Both of the ships were now only a few hundred 
meters from the overturned kayak under the bridge. 

“Yara remote control, Yara remote control, this 
is Brevik VTS on channel 16. Please respond Yara.” 

He took up his binoculars and saw that the 
tanker was slowly turning. The shuttle was now 
only some 100 meters from the overturned kayak 
and the turning tanker and still showed no sign of 
slowing down. 

The radio crackled. “Brevik VTS, this is Yara. 
Did you call me? I had a coffee break.” 

“Thank, you, Yara,’ the operator quickly 
replied. “Stop immediately; can’t you see the kayak 
in front of you?” 

“No, the sun is completely blinding both my 
cameras and on the radar I only see the bridge” the 
remote operator answered, and then he shouted 
“What the hell is the tanker doing!” 

We will not know how this incident ended as it 
is pure fiction and the Yara shuttle will not start to 
traffic the Brevik strait in southern Norway until 
2021 (she will be manned in 2019, remote control- 
led in 2020, before attempting to go autonomous 
2021). Nevertheless, the situation could be plausi- 
ble. Kayaks, scooters and other leisure crafts will 
be close companions to autonomous ships in Scan- 
dinavian waters summertime. Cameras and radars 
can be deceive, as was shown in the Tesla car acci- 
dent in 2017 (Lambert 2017; NTSB 2017). Bridges 
may obscure radar detection of objects underneath. 
Objects coming and leaving like the two scooters 
may confuse the artificial intelligence of collision 
avoidance systems, and LIDAR (Light Imaging, 
Detection, And Ranging) is only useful at close 
range, closer than the stopping distance. Finally, the 
human backup may have gone for a cup of coffee. 

The fictional incident above is, maybe unfairly, 
attributed to the planned autonomous Yara- 
Birkeland container feeder (Kongsberg Maritime 
2017). This unmanned, autonomous vessel, tak- 
ing 120 containers on a fully electric propulsion 
system, will replace some 20 000 trucks taking the 
same amount of containers on the road today. 
There is an economic as well as environmental 
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gain to be made. Doing this autonomously and 
unmanned will be a challenge. So let us start by 
looking at that. 


1.2 Ambiguity in definitions 


The concepts of unmanned and autonomous 
when used on ships are ambiguous. The ship 
bridge may be unmanned, perhaps in periods, 
but crew may still be on board, ready to take con- 
trol when needed. A ship can also be remotely 
controlled from a shore station via highly redun- 
dant and high capacity communication links. Is 
this ship unmanned or autonomous? A dynamic 
positioning (DP) system on a ship will automati- 
cally control the position and perhaps the heading 
of the ship, but most DP systems will rely on an 
operator to handle any errors, e.g. in sensors, that 
occur during the operation. Is the DP automatic 
or autonomous? 

Furthermore, to what ship functions do 
unmanned or autonomous apply? In (Rødseth & 
Tjora 2017), eight main functional groups are 
identified, including, e.g. navigation, engine con- 
trol, cargo monitoring and onboard safety func- 
tions. In the following text, we will refer to typical 
bridge functions, but in a truly autonomous ship, 
all shipboard functions must be automated to 
some degree and the degree of autonomy may be 
different for each function. 

Finally, the degree of autonomy will be different 
during the ship’s voyage. Tighter supervision and 
perhaps continuous remote control will be necessary 
during berthing while a high degree of autonomy is 
normally desired during the deep-sea passage. 

This ambiguity is reflected in many existing defi- 
nitions of “autonomy levels”. In (Vagia et al. 2016), 
12 different “levels of autonomy” are examined 
and even more have become available as auton- 
omy levels have been extended to ships (Redseth 
& Nordahl 2017). One reason for the numerous 
definitions is that autonomy must be defined along 
several axes and with a strong focus on the opera- 
tional profile at hand. The idea of autonomy is 
very context dependent. 


1.3. Three axes of autonomy 


For ships, we propose to characterize autonomy 
along three axes (Rødseth & Nordahl 2017). 

One axis is the complexity of the intended 
operation. Is the ship operating in sheltered or 
open seas, what are the likely weather or visibility 
impacts, how much other traffic is there, how com- 
plex is the sailing routes in terms of shallows, turns 
and obstacles, and so on. We propose to capture 
the complexity in the operational design domain 
(ODD) as explained in the next section. 


Table 1. List of autonomous ship operation types. 


Unmanned bridge, crew 
on board 


Unmanned bridge, no crew 
on board 


Continuously manned 
bridge 


Remote control 
Automatic control 
Partly autonomous 
Constrained autonomy 


Remote control 
Automatic control 
Partly autonomous 
Constrained autonomy 


Direct control 
Automatic control 
Partly autonomous 


Operator controlled 
Automatic 

Partly autonomous 
Constrained autonomy 


Full autonomy 


Full autonomy 


The second axis is the manning level. The ship can 
have a continuously manned bridge, but still have 
a high degree of autonomy in automated object 
detection and collision avoidance. One can foresee 
ships with enough autonomy to allow the crew to go 
to bed at night, when sailing in open waters and fair 
weather. Ships can also be remotely controlled, with 
hardly any “real” autonomy at all. On the other end 
of the axis, one may see ships with no crew and no 
remote monitoring at all: they are fully autono- 
mous. The manning level is dealt with in Table 1. 

The third axis is the operational autonomy, 
how the necessary operations to satisfy require- 
ments of the ODD are divided between human 
and machines. We propose to capture this aspect 
by diving the Dynamic Navigation Tasks (DNT) 
into two parts: One part that requires human inter- 
vention to be executed (Operator Exclusive DNT) 
and one that can be handled by the automation 
systems (Control System DNT). 


1.4 A proposed taxonomy 


To simplify the definition of autonomous and 
unmanned, we will start with a concept borrowed 
from the US car industry and its definition of ter- 
minology for autonomous cars (SAE 2016). This 
is called the “Operational Design Domain” (ODD) 
which is the operational conditions that limits when 
and where a specific autonomous car can be used. 
The corresponding capabilities of the car and its 
control systems is the “Dynamic Driving Task” 
(DDT). The concept also includes the “DDT Fall- 
back” which is procedures and safety guards that 
are built into the vehicle and control systems for 
handling situations when the ODD is exceeded. The 
DDT Fallback will bring the system to a “minimal 
risk condition” (SAE 2016). For a ship, we suggest 
renaming DDT to the “Dynamic Navigation Task” 
(DNT). 

Most autonomous or unmanned ships are 
expected to have a “backup” operator somewhere 
on board or on shore, so that situations that can- 
not be handled by automatic functions can be 
safely handed over to the operator. This can be 
illustrated by dividing the DNT into two regions: 


The operational design domain and dynamic 
navigation task. 


Figure 1. 


The “Operator Exclusive DNT” where the opera- 
tor is needed to resolve problems that the auto- 
mation cannot handle and the “Control System 
DNT” which represents the unassisted capabilities 
of the automatic systems. The complete concept is 
illustrated in Figure 1. 

A proposed set of definitions for autonomous 
merchant ships (Rødseth, Nordahl 2017) indicates 
that four distinct levels of autonomy may be needed 
and are probably sufficient. These levels are defined 
independently of the human operator being located 
on board the ship or in a remote location: 


1. Operator controlled (ALO-1): The DNT is fully 
handled by the operator. Systems may provide 
decision support or very limited automatic con- 
trol, e.g. as in an auto pilot or track pilot. This 
is the current situation on today’s ships. 

2. Automatic (AL2): The ship systems can operate 
without human intervention for a very specific 
function, typically as a DP system works today. 
An operator is required to handle all devia- 
tions from expected operational parameters. 
This autonomy level is probably appropriate for 
automatic berthing or other situations where 
very accurate control is needed and where less 
deterministic and autonomous problem han- 
dling is unwanted. 

3. Partly autonomous (AL3): The ship can perform 
certain tasks in the DNT autonomously, e.g. 
transiting open sea in fair weather. This can, e.g. 
be used to have a periodically unmanned bridge. 

4. Constrained autonomous (AL 4): The ship can 
operate autonomously within most or all of the 
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DNT, but it has clear limits to what actions it 
can take by itself, e.g. maximum speed and track 
deviations. If the ship needs to exceed these lim- 
its, e.g. due to anti-collision manoeuvres, the 
operator has to be called to change limits or to 
remotely control it until constrained operations 
can resume. 

5. Fully autonomous: The ship systems can per- 
form all its DNT tasks without human inter- 
vention. There are no operational limits beyond 
those defined by the OOD. 


Constrained autonomy is the most likely type 
of autonomy for fully unmanned ships with shore 
supervision. It enables the ship to solve all “stand- 
ard” problems by itself while reducing system 
complexity by having an operator available for the 
more complex situations. It also gives a high degree 
of operational determinism due to the operational 
envelope it cannot exceed without human accept- 
ance. Fully autonomous is the necessary level for 
autonomous ships that have no remote supervisor. 
This will in many cases require very complex con- 
trol systems and is not very likely level for ships in 
the near future. 

The levels can be characterized by having dif- 
ferent ratios between the operator exclusive DNT 
(black) and the control system DNT (grey), as 
illustrated in Figure 2. One may validly argue 
that the levels between automatic and constrained 
autonomy should be the same class as they both 
have operator and control system DNTs. However, 
it is useful to differentiate between them since they 
are likely to be used in different context during the 
voyage. 

Dependent on autonomy level and the opera- 
tor being available on the ship or on shore, one 
can de-fine the matrix in Table 1. The shaded 
cells represent operations where one will require a 
manned shore control center to handle deviations 
from operator DNT fast enough. The empty cells 


1. Operator controlled 2. Automatic 
(ALO-1) (At2) 


©9090 


3. Partly autonomous 4. Constrained autonomous 5. Fully autonomous 
(AL3) (ALa) (ALS) 


Figure 2. Five levels of autonomy. 
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represent types that are not very relevant, although 
possible. 

The level of autonomy will vary over the ship’s 
different functions such as engine control, cargo 
monitoring and navigation functions. It will also 
vary during the ship’s voyage. This may be result 
of, e.g. using an unmanned bridge during night 
and open sea passage or by having different modes 
in different phases of the voyage, e.g. using remote 
control during port approach and automatic con- 
trol during berthing. 


2 AUTOMATION 


Going back to the concept of ODD and DNT, 
one may argue that most incidents occurring 
with automated systems may be of the following 


types: 


1. Errors in control system DNT (CS-DNT): 
These are purely technical errors that occur in 
the automation systems and associated sensors. 
It may be caused by technical system malfunc- 
tions or by design errors in system designs or 
configurations. 

. Errors in operator exclusive DNT (OE-DNT): 
These are human operational errors that may 
have been caused by, e.g. fatigue or low situa- 
tion awareness which, in turn, may have been 
caused by bad technical systems. However, the 
incident is directly attributed to a human opera- 
tional error. 

. Transition from CS-DNT to OE-DNT. This is a 
critical issue as the transition both has a timing 
aspect and must be fast enough and a situation 
awareness aspect as the human must under- 
stand the background for the transition to make 
the correct decisions. 

. Operator intervention in CS-DNT: There are 
also examples of incidents that have been 
caused by operators intervening in automated 
processes when they should have left the auto- 
mation system alone. 

. Transition from OE-DNT to CS-DNT: This is 
probably a less common type, but it may be 
challenging to make sure that the automatic 
control system is activated at the right time and 
with the right parameters settings. 

6. Transition to DNT Fallback: When to activate 
the DNT Fallback is also a critical issue. The 
DNT Fallback is not necessarily a “fail to safe” 
control as ships do not have a generally safe 
state. It is a “minimal risk condition” (SAE 
2016). Thus, there is an inherent risk in going 
from OE-DNT or CS-DNT to DNT Fallback 
and it is a challenge to define the proper condi- 
tions for doing so, particularly when a human is 
in the control loop. 


While this classification seems most relevant for 
autonomous ships, it is also applicable to manned 
ships with automation or decision support compo- 
nents. In particular, the transitions between automatic 
and human control in current automated systems will 
be a good indication of how this problem will develop 
when more autonomy is added in the system. 

In the following, we will discuss known benefits 
and shortcomings of today’s manned operation 
with automation and see how that can be applied 
to autonomous ships. 


3 SAFETY, HUMANS AND AUTOMATION 


If autonomous unmanned ships are to become 
a success they have to prove successful in several 
areas, and safety is one of them. Thus, the first 
thing we might ask is how safe is then manned 
shipping? 


3.1 At least as safe as manned shipping 


In a study by Oxford University on British data 
from 1976 to 1995, the seafaring job is ranked as 
the second most dangerous occupation in Britain— 
after being a fisher (Roberts 2002). This is however 
not usually because ships are sinking, but because 
of occupational hazards like slips, trips, and falls 
on a moving platform full of heavy gear and a 
hazardous environment. In this sense, we might 
conclude that already removing humans from this 
hazardous environment has a safety benefit. 

However, if we by safety think of the safety of 
the ship we can say that shipping is very safe and 
is becoming even safer every year. Just to provide 
a background we can note that in the three years 
between 1833 and 1835, onaverage 563 ships per year 
were reported wrecked or lost in United Kingdom 
alone (Crosbie 2006). Today the total number of 
tankers, bulk carriers, containerships and multi- 
purpose ships (over 100 Gross Tons) in the world 
fleet has risen from about 12,000 in 1996 to some 
33,000 in 2016 (Clarkson 2017). During the same 
time, the number of ships totally lost per year 
(ships over 500 Gross Tons) declined from 225 in 
the year 1980, to 150 in 1996 and 33 in 2016 (total 
losses as reported in Lloyds List — IUMI 2016) — 
and this worldwide. 

If we look at ship accidents broken down into 
different causes, we can see that between 2012 and 
2016 50% of ships totally lost did this because 
of weather. Some 20% grounded, 10% was lost 
because of fire or explosion, 5% by collision, and 
10% by machine failure. (Total Losses, all vessel 
types over 500 Gross Tons — IUMI 2017) 

As we can note from the above, there is no men- 
tioning of any losses due to “human error”. This 
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is because the statistics often chose a single, simple 
cause of the accident, but if we drill down look- 
ing for a root cause we often find “human error” 
on one level or another in almost all cases. Dhillon 
(2007) compiled the following statistics: 

A study of 6091 major accident claims associ- 
ated with all classes of commercial ships, revealed 
that 62% of the claims were attributable to “human 
error”. 

“Human error” contributes to 84-88% of tanker 
accidents. 

“Human error” contributes to 79% of towing 
vessel groundings. 

Over 80% of marine accidents are caused or 
influenced by human and organization factors. 

“Human error” contributes to 89-96% of ship 
collisions. 

A Dutch study of 100 marine casualties found 
that “human error” contributed to 96 of the 100 
accidents. (For detailed references see Dhillon 
2007, p. 2) 

Let us illustrate how “human error” can be a 
part of almost all accidents. Let us briefly look at 
the recent collision accident between the general 
cargo ship Daroja and the oil bunker barge Erin 
Wood that took place in Scottish waters in 2015 
(MAIB 2016). In August 2015 the two vessels col- 
lided off the east coast of Scotland. It was a nice 
summer afternoon with light wind and no sea state. 
The two vessels were both north bound but with 
crossing courses which brought them closer and 
closer together for almost two hours without any 
one of the two bridge officers apparently noticing 
the other ship until too late. Visibility was excel- 
lent, radar and AIS tracking was available on both 
bridges. The UK Maritime Accident Investigation 
Board concluded that “Daroja and Erin Wood col- 
lided because a proper lookout was not being kept 
on either vessel.” (MAIB 2016, p. 40) This accident 
would appear in the aforementioned statistics as 
a “collision”, but the underlying root cause was 
“improper lookout”, which would classify it as 
“human error”. 

A variety of taxonomies for “human error” has 
been proposed. One example is the simple dichot- 
omy between “errors of omission” and “errors of 
commission” (Wickens et al., 2013). “Errors of 
omission” mean: not doing anything when some- 
thing should have been done, as the watch keepers 
above. “Error of commission”, on the other hand, 
means: doing the wrong thing. 

A more elaborated taxonomy developed by 
Norman (1988) and Reason (1990) involves “mis- 
takes,” “slips” and “lapses.” 

“Mistakes,” are when the operator has not fully 
understood the situation and acts intentionally. 

“Slips,” on the other hand, are when the inten- 
tion is right but the action is carried out wrong. 


Maybe the wrong button is pressed although 
the intention was to press the right one. Because 
humans monitor their own actions, slips are often 
noticed and corrected before any harm has been 
done. 

“Lapses,” finally, are a failure of making any 
action at all, i.e. an error of omission. Often they are 
lapses of memory, forgetfulness. Humans forget, 
we become distracted or think about other things. 
This is all part of the human condition. Maybe the 
two watch keepers in the accident above was think- 
ing about other things and forgot to monitor their 
systems and look out of the window? “Lapses” are 
sometimes easy to prevent by technical solutions 
like automation. 

One may ask how come there was no warning 
issued to make the two watch officers aware of the 
pending danger. Radar systems on both ships as 
well as the AIS tracks in the electronic chart sys- 
tems could theoretically extrapolate the courses 
of the vessels to a collision point. In addition, 
systems on land that gather AIS data could have 
made the same calculation. Why is it that available 
data is not used to the benefit of safety when pos- 
sible? Why was there no warning and why did not 
the systems automatically make a small course or 
speed change to stay out of the close quarter situ- 
ation? It is because automation is a controversial 
issue. Warnings are often turned off by operators, 
because of many false alarms. 


3.2 Why automation can make ships safer 


A large part of the robustness of the shipping 
industry demonstrated by the constant decline in 
shipping accidents has to do with automation. The 
error prone and difficult position fixing, previously 
done by manual methods like dead reckoning, or 
sun heights and bearings to landmarks, when sun, 
stars and land was in sight, has now been replaced 
by satellite based navigation systems with very 
high reliability. Manual steering which in old days 
caused large course errors has been replaced by 
auto pilots or even track pilots which can follow 
a pre-programmed path with an accuracy of a few 
meters- or even centimetres when augmentation 
systems are used. Just to mention a few areas of 
marine automation. 

The reason automation is safer is that they 
address human shortcomings like: 

Fatigue: Humans are day animals. We are 
designed to be active by day and sleep by night. 
Our whole cognitive system is designed for work 
by day. Even if augmented by technical means, our 
decision making is crippled during night, even if 
we are accustomed to shift work by night. A larger 
degree of accidents happen during night. (e.g. 
Wagstaff & Sigstad Lie 2011) 
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Attention span: The ability to focus and sustain 
attention on a task is crucial for the achievement 
of one’s goals. Although attention span is a com- 
plex concept and measures depend on a lot of dif- 
ferent thing, most researchers agree that the time 
span humans need to concentrate to handle tasks 
without being distracted is limited, e.g. 10-20 min- 
utes in healthy teenagers and adults (Wilson & 
Korn 2017). 

Information overload: Overload can be of many 
kinds. Too much to do, and too little time to do 
it. Too much information that needs to be con- 
sidered presented in an unintegrated way at the 
same time. It boils down to limits of the human 
working memory. Miller in 1956 famously stated 
that humans at the most could handle 5-9 infor- 
mation chunks at one time. But, underload can 
also be a problem. During a conference in 2014 a 
British maritime accident investigator mentioned a 
new type of boredom-induced accidents. Evidence 
of the so-called Yerkes-Dodson law (first proved 
on mice in 1908) show that human performance 
describes an inverted U-shaped curve when plot- 
ted against arousal (or stress) so as low arousal also 
may lead to low performance and elevated arousal 
lead to higher performance to a certain point when 
performance declines with higher stress (cognitive 
tunnelling). 

Normality bias: This is a form of denial 70% 
humans revert to when facing events of disaster, as 
a result of which they underestimate the possibility 
of the disaster actually happening and its potential 
results (Omer & Alon 1994). 

We could go on stating human shortcomings in 
this way for many pages, however we think the point 
is made: automation can make ships safer. 


3.3 


In the everlasting strive to make life easier, 
humans have automated tasks that are tedious, 
dangerous, dirty, boring, etc. However, a paradox 
in automation is that it has often been the easi- 
est tasks that has been possible to automate. In 
complex and ambiguous situation, the human has 
had to step in to resolve the ambiguity and finish 
the task. 

Automation needs to be programmed and can 
therefore only solve simple or complicated prob- 
lems. By “complicated”, we here mean that there 
is a finite solution space that can be parsed by 
computers. In reality, many real world problems 
are complex in the sense that they have an infinite 
solution space due to many unknown factors and 
interrelationships. For such problems, it is not even 
theoretically possible to program to solve all pos- 
sible situations (possibly leaving machine or deep 
learning aside). 


Why automation can make ships less safe 


The dynamic maritime environment with sea 
and current, weather, topography, manned and 
autonomous ships is such a complex environment 
and will for a very long time need a human to step 
in and resolve problems out of the range of auto- 
mation. As we have seen above, there is relatively 
good statistics on “human error”, however there 
are almost no statistics on “human recoveries”, 
where humans has stepped in and saved a situation 
caused by e.g. technical malfunction. 

An illustration of such a recovery can be fetched 
from an incident in1991. 

In this incident a product tanker loaded with 
20 000 metric tons of gasoil was under way 
through the narrows of a winding Scandinavian 
archipelago. In a bend in the fairway she had a 
routine meeting with one of the large ferries traf- 
ficking the area. The ferry had almost 1000 pas- 
sengers and crew onboard. As the tanker applied 
starboard rudder to negotiate the bend in the fair- 
way, the captain noticed that the rudder instead 
turned to port and a port turn was commenced a 
few hundred meters in front of the oncoming ferry. 
The captain immediately reversed the engine, but 
realizing that he would not be able to prevent the 
turn, he called the ferry on the VHF saying they 
had a breakdown on the steering engine and asked 
for “green-to-green” (starboard side to starboard 
side) meeting. The ferry responded promptly, but 
by making a starboard 360 degree turn and the 
ships passed each other on parallel courses with 
20-30 meter between. The accident investigation 
board calculated that if the action from the ferry 
had been delayed 30-60 seconds a collision with 
the ferry running into the amidships section of the 
tanker in a right angle would have been impossible 
to avoid (SHK 1992). The consequences can only 
be imagined. 

The accident investigation concludes that it 
was the decisive actions by the captains of the two 
ships that avoided a possible catastrophe. One may 
wonder what would have happened if one or both 
of the ships had been autonomous. Remember 
also the pilot of the airliner that landed on Hudson 
River in 2009, and who, by acting against proto- 
col and procedures, miraculously saved the lives of 
passengers onboard (NTSB 2010).So, on one hand 
we have incidents due to human error that can be 
avoided with automation, on the other hand we 
have incidents that is now avoided with humans, 
but will happen when no humans are onboard. 
But new technology also opens for new types of 
accidents. 

These relationships are described in Figure 3. 

Automation of human processes (middle cir- 
cle, Figure 3) are expected to significantly reduce 
the number of incidents happening in shipping 
today, but one must also assume that a number of 
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Figure 3. 
after automating human processes. 


Remaining incidents in the autonomous ship 


potential incidents are averted by the crew's 
actions and it is not clear if improved automation 
can match these numbers. Finally, one must also 
assume that some new types of incidents will occur 
as a result of the introduction of new technology 
(far left). The net result is the remaining grey areas 
and the question is if this will be low enough for 
societal acceptance of the new ship types. 

Thus, while the assumption is that the net 
result of automation will be lesser accidents and 
incidents, this remains to be shown. Within com- 
mercial air industry, automation has improved 
safety, (e.g. Billings 1997; Pritchett 2009; Wiener 
1988). Can we assume that the same is true for the 
shipping domain? One way of dealing with this is 
through risk analysis. 


3.4 Risk analysis 


Risk analysis can be “broadly defined to include 
risk assessment, risk characterization, risk com- 
munication, risk management, and policy relating 
to risk, and risks of concern to individuals, to pub- 
lic- and private-sector organizations, and to society 
at a local, regional, national, or global level” (SRA 
2012). In this paper’s context, we look at risk anal- 
ysis as risk assessment where risk is defined as the 
combination of the frequency and the severity of 
the outcome of an accident (IMO 2002). 

The expected frequency of accidents must often 
be derived from an assumed accident probabil- 
ity, as statistical significant data on frequencies 
are impossible to find. Obviously, this particu- 
larly applies to new technology or ship types as in 
autonomous ships. The probabilities are difficult 
to determine in themselves and, in addition, the 
strength of knowledge used to establish the prob- 
abilities need to be addressed. In autonomous sys- 
tems the strength of knowledge is generally low 
due to lack of experience and the complexity of 
the autonomous marine system. 

The prevalent strategy to the increased (socio- 
technical) complexity, lack of coherence, and speed 
of change in contemporary systems, science and 


the discipline of risk management, is to incorporate 
uncertainty, ambiguity, and the knowledge dimen- 
sion per se in the risk measure (Paltrinieri et al. 
2016). This is done through risk analysis of poten- 
tial accident scenarios that we eventually are aware 
off and can manage. This is emergent research 
and there is not much hard knowledge in the area, 
although some papers have been published, e.g. 
(Utne et al. 2017) and (Rødseth & Tjora 2014). 

The second paper is mainly a preliminary haz- 
ard identification (HazId) study based on use cases 
and ship function breakdowns. It suggests a frame- 
work for doing HazId in the unknown environment 
of the autonomous ship based on assumptions on 
what can happen and how this influences on the 
different functions the ship systems have to pro- 
vide. The first paper argues for a more holistic 
approach to risk management, including dynamic 
risk assessments during the autonomous voyage. 

This paper will not go further into this area, but 
it is important to point out that determining the 
complete risk level for the autonomous ship will 
be very challenging. As was illustrated in Figure 3, 
there are more new issues that have to be taken into 
consideration and for at least two of these we do 
not have any statistics that can be used in estimates 
of probabilities. Although, e.g. HazId may be able 
to identify the hazards and accident consequences, 
we are still left with very uncertain probabili- 
ties and the limitation to the known knowns and 
known unknowns. 

Within safety science, the concept of “human 
error” are seldom used after 1990's since it has been 
seen that “human error” is not a cause but a result 
of other factors such as poor design, poor plan- 
ning, poor procedures, etc. (Dekker 2006). Instead 
the concept of “human variabity” from Resilience 
Theory is often used (Hollnagel, Woods & Leveson 
2006). Human variability that sometimes might 
lead to “human errors” but maybe more often to 
“miraculous recovery”. Positive actions and suc- 
cessful recoveries are usually not recorded, as men- 
tioned in Leveson (1995, p. 94); where an U.S. Air 
Force study showed 659 crew recoveries in 681 in- 
flight emergencies; with only 10 pilot errors. 


4 CONCLUSION 


It seems to be generally accepted that automation 
has the potential to decrease accidents that are due 
to human variability. 

However, automation has the potential of cre- 
ating accidents in itself, e.g. through transitions 
between automatic and manual control and the 
human having to rapidly assess the situation and 
make the right decisions. 
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Automation also sometimes creates problems 
by reducing the work load of the human, inducing 
boredom and by that further increasing the time 
needed to do a correct assessment. 

With constrained autonomy being the most 
likely form of ship autonomy, one needs to inves- 
tigate if these issues actually can increase the 
probability of some accident types compared to 
conventional manned ships. 

Also, autonomy will create new types of acci- 
dents, as suggested by the illustration in the begin- 
ning of the paper This is partly due to accidents 
that was before averted by the human crew and 
partly due to introduction of new technology and 
corresponding new accident types. These types of 
accidents are very challenging to include in the risk 
analysis as we lack statistical evidence for their 
probability. 

To address the new risk picture, one probably 
need new types and extensive use of human cen- 
tred risk analysis. Also, one needs to consider the 
development and use of dynamic risk assessment 
systems during autonomous voyages, as well as 
other real time tools that can be used on the ship 
or in the shore control centre. 
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ABSTRACT: The United States has a fleet of light water reactors that continue to operate using paper- 
based procedures. Here we review existing implementations of computerized procedure systems, the 
potential benefits, as well as their caveats. We also review U.S. regulatory requirements for computerized 
procedure systems in an effort to identify barriers to adoption. We conjecture that process control proce- 
dures, especially the formalized procedures used by U.S. utilities, can be viewed as E-Type Systems from 
a software engineering evolution perspective. After presenting this argument, we discuss corollaries from 
treating procedures as E-Type Systems. Based on these analyses, a strategy was formulated to aid utilities 
in transitioning to computerized procedure systems. And lastly, missing tooling to aid utilities in cost- 


effective, timely transitions is discussed. 


1 INTRODUCTION 


The U.S. Department of Energy’s (DOE) Gateway 
for Accelerated Innovation in Nuclear (GAIN) pro- 
gram partners nuclear technology industry partners 
with DOE expertise and resources. As part of GAIN, 
Idaho National Laboratory is teaming with the simu- 
lator vendor GSE Systems to develop a Computer 
Based Procedure (CBP) Engine for Nuclear Power 
Plant (NPP) Main Control Room (MCR) opera- 
tions. This platform would enable critical research 
related to CBPs with full-scope simulators specifically 
to support paper to digital procedure transitions for 
US. utilities seeking to realize the benefits of CBPs. 
The aging U.S. fleet of light water nuclear power 
plants (NPPs) is of paramount importance to the 
nation’s overall base energy production capabilities. 
Each nuclear power plant has been operating for 
multiple decades and has already undergone numer- 
ous engineering changes. Approximately 74 operat- 
ing plants have filed license extensions to operate 
beyond their original 40 year operating license. To 
operate beyond 40 years, many plants are undergo- 
ing modernization efforts of ensure the plants can 
be operated in a safe and cost effective manner. 
These changes result in frequent amendments to 
procedures. The procedures have evolved substan- 
tially from when the plant was first commissioned. 
The U.S. nuclear industry, in addition to being 
highly proceduralized with operations and mainte- 
nance in and out of the control room, is also one 
of the most stringently regulated energy industries. 
Plants were originally commissioned with paper 
procedures and, despite their shortcomings, have 
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stood the test of time and are still in use today. It 
is well documented and often cited that human 
reliability and error are the leading cause for plant 
events. When the contributors of human error are 
examined, a U.S. Nuclear Regulatory Commission 
report (1995) reveals that procedure problems have 
been cited as contributing to 69% of event reports. 
Following paper procedures is challenging because: 


e Operators must identify the correct procedure 
and have the most up-to-date version of the pro- 
cedure. The use of paper procedures can result in 
an excess number of procedures, often with poor 
classification schemes (O’Hara et al., 2000). 

e Operators must determine the correct path 
through the procedure and complete the proce- 
dure without skipping steps or following steps 
out of order. Operators must manually keep 
their place in procedures (Fink et al., 2009). 

e Procedure following imposes high memory 
demands on operators because of the need to 
track multiple pieces of information tied to dis- 
parate pieces of equipment. 

e In emergency events, each operator might be 
performing multiple nested procedures simulta- 
neously. It is in these stressful environments that 
humans have a high potential for decision mak- 
ing errors, such as making incorrect conclusions 
(Converse, 1995; Kim, et al. 2011). 

e Operators must determine plant state from control 
board indications or approved (qualified) plant 
computer systems. This requires frequent search- 
ing for information or requests for information 
from reactor operators (ROs). ROs must determine 


the correct instrumentation and control (I&C), and 
correctly perceive and report the correct value. Sen- 
ior reactor operators (SROs) often are completely 
reliant on the RO for critical pieces of information. 
Procedures also require continuous actions to 
monitor plant variables, and operators must 
take action if a specified threshold is reached. 
These continuous actions divide the operator’s 
attention across the board. 

Decisions must be based on the current plant state. 
The plant state changes as operators progress 
through a procedure. Normal operating condi- 
tions can be different depending on the plant state. 
Procedure following requires engaging in critical 
actions within specified time intervals. Opera- 
tors often must keep track of several time inter- 
vals simultaneously. 


Computer-based systems for nuclear power are 
precedented with over 30 years of research and mul- 
tiple use cases. Several systems have shown that com- 
puterized procedures can aid operators with many of 
these challenges as illustrated in the following section. 


2 RESEARCH AND IMPLEMENTATIONS 
OF COMPUTERIZED PROCEDURE 
SYSTEMS 


2.1 


KOPEC Nuclear Power operations for Shin Kori 
Units 3 and 4, which are APR1400 units (Unit 4 is 
not yet online), rely heavily on computerized pro- 
cedure systems in a completely different paradigm 
from conventional operations (Hong, Lee, and 
Hwang, 2009). With conventional (U.S.) opera- 
tions the operator follows a paper procedure and 
references the boards to assess the current state 
of the plant. Computerized procedures allow the 
relevant information to be provided within the 
context of the procedure. However, with individual 
workstations displaying CBPs, operators commu- 
nicate less with each other, leading to poorer team 
situation awareness (Kim, et al. 2011). Operations 
becomes much more centric to the computerized 
procedure system, and many of the inefficiencies 
associated with seeking information and verbally 
communicating between operators are removed. 

With traditional procedures, information is cou- 
pled primarily through the reactor operators. With 
CBPs the communication between operators is 
loosely coupled. 

The Korea Electric Power Research Institute 
has developed a computerized procedure system 
that in addition to displaying and executing pro- 
cedures has functionality for writing and editing 
procedures, and assessing the validity of logic and 
plant parameters through a testing framework that 
integrates with an engineering framework (Hong, 
Lee, and Hwang, 2009). 


Korea advanced reactors 
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2.2 The OECD Halden reactor project 


The Halden Man-Machine Laboratory 
(HAMMLAB) of the OECD Halden Reactor 
Project (HRP) has conducted pioneering research 
related to computerized systems for nuclear control. 
In 1985, work on developing computerized proce- 
dures began, which culminated in the Computerized 
Procedure Manual (COPMA) and later COPMA-II. 
A 1995 experimental study (Converse) found that 
operators committed half as many errors with 
COPMA-II compared to using paper procedures. 
This is even more significant when considering that 
the participants had only a few hours of training and 
experience using COPMA-II, although it should be 
noted that crews responded faster with paper proce- 
dures. In 2000, the third iteration of COPMA, based 
on web technologies, was released (IFE, 2017). 

While OECD Halden Reactor Project is perhaps 
most well-known for ecological interface design and 
information rich displays, much effort has focused 
on how task-based displays can be designed to con- 
vey the most important procedure relevant informa- 
tion, as well as identifying how task-based displays 
can integrate procedures with process displays 
(Braseth, et al., 2009). This philosophy is advanta- 
geous because the layout of the information is kept 
in a standard format and the procedure is simul- 
taneously visible to all operators, resulting in an 
improved shared situation awareness. 


2.3 


Westinghouse Electric Corporation began develop- 
ing CBPs for commercial applications in the early 
1990s. Their COMPRO system was highlighted as 
having a graphical user interface and using relational 
database technologies. COMPRO has been deployed 
in Beznau, Switzerland, and Temelin, Czech Repub- 
lic. Westinghouse’s AP 1000 will incorporate a com- 
puterized procedure system that will also provide 
a diverse set of procedure views including graphi- 
cal flowcharts, a textual view and a dynamic logical 
view (Lipner, Mundy, Franusich, 2006) while retain- 
ing much of the COMPRO “DNA” in philosophy. 

In the early 1990s Westinghouse collaborated 
with Army/NASA Ames Research Center to apply 
the Man-machine Integrated Design and Analysis 
System (MIDAS), a tool for computerizing the 
cockpits of advanced commercial and military air- 
craft, to a computer-based procedure system for 
main control room operations. This collaboration 
resulted in an impressive proof-of-concept demon- 
stration of how CBPs could substantially reduce 
operator memory demands and improve situa- 
tional awareness (Hoecker, et al. 1994). The caveat 
being that incorporating MIDAS to translate PBPs 
to CBPs is a fairly significant refactoring requiring 
detailed and extensive collaboration between oper- 
ations and human factors practitioners. 


Westinghouse 


2.4 EdF N4 series reactors 


The Électricité de France (EdF) N4 series of reac- 
tors in France use CBPs. Dien, Montmayeul, and 
Beltranda (1991) describe the philosophy of devel- 
oping the N4’s original computerized procedures 
from a human factors perspective. They are care- 
ful to note the computerized procedures are not 
intended to guide the operator actions and the 
operators should always have the final decision 
on the actions that are performed in the plant. In 
operations, they recognized the necessity for opera- 
tors to compensate for inadequacies in procedures, 
as well as handling conflicting or ambiguous con- 
flicts between procedures. Procedures can also 
result in operators looping through a procedure 
multiple times before finding the correct indication; 
the EdF approach recognizes that human opera- 
tors may be able to implement more economical or 
practical control strategies than those prescribed by 
procedures. 


3 U.S. ADOPTION OF COMPUTERIZED 
PROCEDURES 


Thus far, we have reviewed existing implementa- 
tions of computer-based procedures and pro- 
cedural systems. Despite several decades of 
commercial implementations and several new sys- 
tems in modernized plants, adoption of CBPs in 
U.S. main control rooms is almost non-existent in 
spite of the fact procedure related problems have 
been implicated in 69% of event reports. 

Le Blanc, Oxstand, and Waicosky (2012) sur- 
veyed U.S. nuclear utilities about their plans for 
implementing CBPs in the field, and the per- 
ceived barriers of implementation. Utilities are 
hesitant to be first. Being first in the industry 
usually entails higher costs. Utilities also need to 
justify the capital investment, and to do so CBPs 
need to be demonstrably better than paper-based 
counterparts. 

While there are mixed results, the available evi- 
dence suggests that tangible benefits exist: 


e Lin etal. (2016) found with a full-scope advanced 
Main Control Room simulator that operators 
with CBPs had significantly better task moni- 
toring awareness, and higher task performance 
while operators completed emergency response 
procedures. Their results suggest that reduced 
communication between team members is a side 
effect of having higher situation awareness. Team 
members were observed to make verbal inquiries 
when their situation awareness was low. 

e CBPs offload some of the complexity of procedure 
following, leading to fewer procedure deviations. 

e CBPs can be organized by task so that switch- 
ing between procedures is seamless (Le Blanc, 
Oxstrand, and Joe, 2015). 
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e Some of the work performed by operators can 
be offloaded to a CBP system, e.g. continuous 
monitoring of a plant parameter. 

e Recovery from human error is faster with CBPs 
compared to paper (O’Hara, 2000). 

e CBPs are even more beneficial with multiple 
failures. 

e Dynamic presentation of information is possible. 
The information does not need to adhere to the 
standard one sensor, one-indicator format. Data 
can be tailored by clustering and aggregation, 
through summaries, and graphical presentations 
(Dien, Montmayeul, and Beltranda, 2009). 

e Tasks that would normally require multiple pro- 
cedures can be combined and ordered in a logical 
sequence. 

e Visual representations of procedural flow paths can 
result in better awareness of the plant’s current state 
and were the procedure is headed (NRC, 1995). 


Why then, is there not greater adoption of CBPs, 
and how can adoption be encouraged? This ques- 
tion is difficult to answer with the available infor- 
mation, but we can provide some conjecture on the 
issue. Procedures, to state the obvious, tie to every 
sub-system of the plant. Undertaking an adop- 
tion of a CBP system is a large commitment from 
an organizational perspective. The benefits of a 
successful transition from paper to CBPs is clear, 
but plants also run the risk of disrupting the func- 
tional aspects of current procedures and proce- 
dural systems. The existing methodologies require 
completely disassembling and refactoring existing 
procedures into computerized procedure systems. 
This might work for new plants and new reactor 
designs, but could be seen as risky for the currently 
operating Generation II plants. Procedures are liv- 
ing documents that have been continually main- 
tained over their multi-decade lifespans. Plants are 
modernizing control systems, and procedures must 
be amended to stay accurate with plant systems. 
They contain human years of operating experience 
and implicit functional characteristics that could be 
lost if haphazardly transferred to computer-based 
procedure systems. 


4 SYSTEMS OF PROCEDURES 


Here we suggest that process control procedures can 
be viewed from a software engineering perspective. 
The logic of a procedure is analogous to compu- 
ter code. Instead of the code being compiled by a 
computer program, the code is interpreted and exe- 
cuted by a human. Lehman (1980) wrote a seminal 
article on software engineering evolution. Lehman 
describes three classes of software programs. S-Type 
Systems are programs whose function is defined by 
a specification. Programs in this category are prov- 
ably correct if they meet the specification. The logic 


blocks within a conventional distributed control sys- 
tem (DCS) is a prime example of S-Type Systems. 

The second class of programs are P-Type Sys- 
tems focusing on real world Problem solutions. 
P-Type Systems incorporate feedback loops com- 
paring the measured state of the world to desired 
outcomes. Control systems are P-Type Systems. A 
controller may monitor a single or multiple param- 
eters and produce one or more control signals to 
change the state of the system. A vendor will pro- 
vide a detailed specification of a DCS’s inputs and 
outputs along with descriptions of how the inputs 
should map to the outputs. The functionality of the 
DCS can be verified over the range of anticipated 
operating conditions. However, system control and 
safety outside of normal operating conditions is 
intractable. Efforts can be made to make systems 
more robust through strict quality control, redun- 
dancy, incorporation of passive safety, and more 
resilient through advanced control schemes and 
artificial intelligence, but ultimately closed form 
solutions for every possible contingency will not 
exist. Arguably this is why it is still a good idea 
to have highly trained and knowledgeable human 
operators in the control room and human eyes and 
ears about the plant. Humans can be resilient when 
hardware and control systems are not. 

Lastly, Lehman describes E-Type Systems or 
embedded programs. Embedded programs are 
named so because they are part of the overall sys- 
tem. E-Type Systems are described as mechanizing 
a human or societal activity. Human decision mak- 
ing inherently involves some level of prediction as 
well as uncertainty, intuition, opinion, and judge- 
ment. The validity of E-Type Systems is in the 
human domain, on whether human assessments of 
its effectiveness for the intended application. 

Here we suggest that the procedures used to 
control nuclear power plant processes are E-Type 
Systems and as such should be treated distinctly 
from the S and P countertypes. First, the human 
operators, maintenance workers, and engineers are 
the embedded aspect of energy production process. 
A plant on its own would be incapable of produc- 
ing power and as such comprises a sociotechnical 
system. Even if the plant configuration remained 
static, the constraints, demands, and requirements 
are constantly evolving. For example, plants incur 
high downtime costs and are always striving for 
increasing their capacity factor. Second, the pro- 
liferation of renewables presents load shaping 
requirements for plants. Third, an aging workforce 
presents challenges to preserving worker knowl- 
edge and skills needed to maintain legacy systems. 
Fourth, consider events like Fukushima Daiichi, or 
Ukrainian cyber-events and the influence they have 
on operations, upgrades, and maintenance. 

The distinction between S/P-Type Systems and 
E-Type Systems is important because they convey 
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distinct requirements for their respective life-cycles. 
A control block that meets its functional specifica- 
tion can operate for decades with virtually no main- 
tenance. E-Type Systems on the other hand are 
inherently change prone. Operations and procedures 
change because of the drivers that from a pure con- 
trol systems perspective are purely exogenous. Proce- 
dures need to be adaptable and computer procedure 
systems need to support the authoring and continued 
maintenance of procedures as much as the actual run- 
ning of procedures. If they do not change, Lehman 
suggests, their utility diminishes over time because 
their environment is evolving. To remain useful, they 
must change and have a tendency to become more 
complex, and Lehman postulates a law of “Continu- 
ing Change” to describe this phenomena. 

Unfortunately, continual change is also associ- 
ated with increased complexity (2nd law). Anecdo- 
tally, paper based procedures are becoming more 
complex to fill administrative gaps and remain 
technically accurate as systems change. Lehman’s 
6th law of continuing growth describes how the 
functional content of an E-type system must con- 
tinually increase to maintain user satisfaction. As 
an example, procedures have become embedded 
with notable user experience, which is a function 
never originally intended for procedures. While 
the increased complexity captures relevant process 
related information it also increases the cognitive 
difficulty of carrying out procedures. 

From this perceptive, computer-based pro- 
cedures could be vital in managing the increas- 
ing complexity associated with E-Type Systems. 
Lehman’s 7th law predicts declining quality if an 
E-Type system is not rigorously maintained and 
adapted to changing operational environments. 

If procedures are more akin to E-Type systems 
than P-Type systems, then maintaining the ability 
of procedures to evolve, to adapt and continuously 
modify procedures is of paramount importance. 
By the conclusion of this document we formulate a 
strategy to aid utilities in transitioning from paper 
to CBP systems. The previous survey of computer- 
ized procedures suggests that the technology exists, 
and it is likely that the hurdles to adoption are not 
technologically based. In the context of moderniza- 
tion, computerized procedure systems are likely not 
expensive relative to costs associated with modern- 
izing instrumentation and controls. If not technical 
or financial, then perhaps regulatory considera- 
tions are the barrier to entry. In the next section we 
briefly review regulatory considerations for CBPs. 


5 US. NUCLEAR REGULATORY 
COMMISION GUIDANCE 


Early work by the U.S. Nuclear Regulatory Commis- 
sion (NRC) concluded that CBPs were a desirable 


goal but the implementation and adoption needed 
to justify the use of CBPs over paper procedures 
(O’Hara et al., 2000). 

NUREG/CR-6634 provides a detailed techni- 
cal basis for CBP systems. The technical basis was 
developed by deconstructing the hierarchical influ- 
ence of human behavior on plant performance in 
order to identify how human errors cause unsafe 
circumstances. NUREG/CR-6634 expresses some 
valid concern regarding whether CBPs could lead 
to operators becoming disoriented amongst tabs, or 
losing situation awareness due to keyhole attention 
affects, or out-of-the-loop loss of situation aware- 
ness. Automation of control actions can also be 
detrimental because it increases task complexity 
and can add confusion to diagnosing the root cause 
of a disturbance (Andresen, et al. 2003). CBPs can 
use plant process data to guide operators; however, 
if these systems are disrupted, the CBP may not 
function, or even provide incorrect assessments and 
guidance. Operators may become too trusting of 
the guidance provided by a CBP and induce com- 
placently to follow CBP directions. 

The Electric Power Research Institute (EPRI) has 
guidance for developing Utility Requirements Docu- 
ments. According to O’Hara (2000), the guidance is 
based on EdF’s CBP experience. EPRI’s high level 
guidance covers human factors criteria describing 
the allowable presentation formats for procedures, as 
well as stipulating operators should have final author- 
ity in regard to control actions. It also has technical 
requirements for CBPs to verify operator decisions, 
provide logging capabilities, and redundancy in 
the case of loss of CBPs. The EPRI requirements 
include a validation of each operating procedure 
using the plant’s simulator and performance model. 
Plants would also need to make sure that alternative 
procedures are available in the event of loss of CBPs. 
Utility requirement documents would need to vali- 
date and verify CBPs with plant simulations. 

The NRC does not consider paper procedures to 
be a Human System Interface (HSI); however, com- 
puterized procedures would be displayed through an 
HSI would need to meet the human factors require- 
ments of NUREG-0700 as well as the NUREG/ 
CR-6634. For first adopters, the CBP would be 
considered an unproven HSI technology until there 
has been at least three years of documented satisfac- 
tory service in a light water reactor (O’Hara 2000). 


6 CATEGORIES OF COMPUTER BASED 
PROCEDURES 


IEEE Standard 1789-2011 describes three types 
of CBPs along a continuum of increased automa- 
tion. The least automated are Type 1 and are some- 
times referred to as E-procedures in the literature. 
These are electronic versions of standard PBPs. 
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They provide the ability to view the procedure in 
an electronic format and provide navigation links 
between procedures. The second category is Type 2 
procedures. These link to plant data in real time to 
provide process relevant information embedded in 
the procedure, perform logical assertions based on 
the procedure and available data, and provide links 
to process displays and soft-controls that reside in 
separate systems. Lastly, Type 3 procedures have 
embedded soft controls and procedure-based auto- 
mation that carries out a full procedure or several 
sub-steps of a procedure. 


STRATEGY FOR TRANSITIONING 
PAPER PROCEDURES TO 
COMPUTERIZED PROCEDURE 
SYSTEMS 


Based on the identified constraints and evolving 
nature of procedures it may be strategically advan- 
tageous to transition to a computer based format 
that is similar in organization and layout to existing 
procedures. The goal is to provide a process to tran- 
sition the existing corpus of PBPs into a computer- 
ized support system without substantively changing 
the organization, or format and layout of the proce- 
dures. The strategy is to avoid major refactoring that 
is costly, time consuming, and also has the potential 
of losing critical information contained within exist- 
ing PBPs. Plants have finite human resources to put 
towards the numerous modernization efforts. Mod- 
ernization schedules can easily run a decade or more 
into the future. The industry has a need for innova- 
tive solutions that can help change occur faster, and 
our thinking is centered on what can be done in a 
time effective manner, while offering a high value 
proposition, and satisfying or exceeding regulatory 
constraints. Such a strategy would avoid having 
to completely dissemble, reorganize, and refactor 
existing procedures, as well as ease the process of 
transcribing to computerized representations. Main- 
taining resemblances to paper countertypes also 
eases training and verification and validation activi- 
ties necessary for regulation. Therefore, a solution 
should maintain a textual presentation in the con- 
ventional two column format. 

Metaphorically this approach would be 
described as not shooting for the moon, but rather 
planning a course towards the moon. Based on our 
proposition that procedures are E-Type Systems it 
is vitally important that utilities maintain the abil- 
ity to quickly and adeptly author and edit proce- 
dures without continued support from a vendor. 

Strategically translating procedures should be 
thought of as an editing process instead of a com- 
plete refactoring of how procedures are performed 
and constructed. The editing process would likely 
be iterative, and would contain the following steps: 


. Multidisciplinary review of the existing proce- 
dures to identify and remedy shortcomings in 
the original procedure to prevent them from get- 
ting carried forward to a computerized format 

. Provide Type 2 computerized procedure 
enhancements like live plant parameters, moni- 
toring continuous actions (Choi and Park, 
2012), decision support aids for integrating 
multiple variables, and real-time trends for pre- 
defined sets of parameters. 

. Software engineering routinely uses code qual- 
ity metrics to assess portions of code that may 
be ambiguous or difficult to interpret. Once a 
plant’s procedures are represented in a form 
interpretable by a computerized procedure sys- 
tem it becomes possible to develop automated 
routines for finding indictors of poor quality 
procedures based on step complexity (Park, 
Kim, and Ha, 2003) or human reliability analy- 
sis performance shaping factors and unique 
failure modes relevant to computerized proce- 
dures (Boring et al., 2011). 

4. Technical review of information in the procedure. 

5. Plant simulator validation of the procedure. 


None of the procedure authoring systems are 
intended to ease the translation from existing 
paper implementations. Here we have identified a 
gap in existing technology capabilities that we aim 
to fill. We envision a what-you-see-is-what-you- 
mean (WYSIWYM) authoring system based on 
Markdown and Jinja templating. Jinja templat- 
ing allows for context processors to be added to a 
backend that would, at rendering, insert of live val- 
ues or handle plant dependencies logically, branch- 
ing when they are rendered on the display. Most 
importantly, the representation in the procedure 
(template) would have simple syntax expressions. 

We are currently developing a prototype compu- 
terized procedure system intended to conduct the 
full-scope simulator research needed to validate 
the engine and the strategy. The goal would be to 
transition the research framework to a commercial 
platform for computerized procedures. 
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ABSTRACT: Goals-Operators-Methods-Selection rules (GOMS) was originally developed as a task 
analytic tool for modeling behavioral primitives in human users of human-computer interfaces. GOMS 
was recently adapted for Human Reliability Analysis (HRA), producing the GOMS-HRA method. The 
GOMS-HRA method provides a taxonomy of task level primitives in human activities that correspond 
to human error probabilities and task timing. The GOMS-HRA method has been used in computation- 
based HRA (CoBHRA), due to its calibration to the subtask level of human performance, the optimal 
decomposition level for dynamic risk modeling. GOMS-HRA has also been linked to procedures, and it 
is possible to map procedure steps (called procedure level primitives) to task level primitives. This paper 
introduces another important development to the GOMS-HRA framework—the task level errors, which 
represent the use of the GOMS-HRA taxonomy for predicting human error types. While many HRA 
methods map task types or task primitives to error rates, the prediction of error is often a generic error 
type. In reality, each task level primitive has predilections to certain types of errors. To model human error 
dynamically requires the determination of the types of errors that can occur. 


1 A FRAMEWORK FOR DYNAMIC the initial implementation of HUNTER is focused 

HUMAN RELIABILITY ANALYSIS on adapting static approaches to HRA, the frame- 

work is flexible and can in the future readily incor- 

1.1 Human Unimodel for Nuclear Technology to porate other models including dynamic HRA 

Enhance Reliability (HUNTER) methods that have been developed in parallel at 
other institutions. 

Central to the firstimplementation of HUNTER 
is quantification of the human error probability 
(HEP). Quantification in many HRA methods 
occurs in three stages (Swain and Guttman, 1983): 


Researchers at Idaho National Laboratory set out 
to create a simplified implementation of dynamic 
Human Reliability Analysis (HRA), also known as 
computation-based HRA (CoBHRA) due to con- 
sideration of modeling beyond temporal dynam- 
ics. The goal of this effort is to demonstrate proofs e Nominal HEP—the generic or default error rate 
of concept of modeling HRA dynamically by for particular task types 

adapting existing static HRA approaches to make © Basic HEP—the nominal HEP modified by con- 


them dynamic, rather than develop an entirely textual factors that increase or decrease errors 
new dynamic HRA method. Because of the com- e Conditional HEP—the basic HEP modified to 
plexity of modeling human behavior and cogni- account for dependence and recovery 

tion, creating a new dynamic HRA method could 

quickly approach the complexity of artificial intel- Some HRA methods like the Technique for 


ligence production systems. Thus, this simplified Human Error Rate Prediction (THERP; Swain 
CoBHRA approach should borrow from current and Guttman, 1983) or Human Error Assessment 
simplified, worksheet-based approaches to HRA, and Reduction Technique (HEART; Williams and 
seeking to find ways to automate the analysis tasks. Bell, 2017) possess a large number of generic task 
Further, the approach should integrate with other types with a wide range of nominal HEPs, while 
modeling tools like thermal hydraulic codes that simplified methods like the Standardized Plant 
are used for nuclear power plant simulations. This Analysis Risk-Human (SPAR-H; Gertman et al., 
new HRA framework came to be called the Human 2005) contain only two nominal task types. In most 
Unimodel for Nuclear Technology to Enhance HRA methods, the basic HEP is calculated using 
Reliability (HUNTER; Boring et al., 2016). While performance shaping factors (PSFs), which simply 
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serve as multipliers on the nominal HEP. Finally, 
for the conditional HEP, dependence suggests that 
error begets error, and a sequence of human tasks 
may have increased overall error rates once an initial 
human error occurs. In contrast, recovery restores 
the success path and serves to decrease the HEP. 


1.2 Stages of HUNTER Quantification 


1.2.1. Nominal human error probability 

To address nominal HEPs, HUNTER presently 
uses an approach called GOMS-HRA (Boring 
and Rasmussen, 2016). GOMS stands for Goals- 
Operators-Method-Selection rules, and serves as a 
task analytic approach to evaluate human-system 
interactions (Card et al., 1983). Despite many 
implementations of GOMS for user interface appli- 
cations, GOMS has not historically been used for 
purposes of HRA. GOMS-HRA aligns its “opera- 
tors” (i.e., operations or tasks) to an extended ver- 
sion of the Systematic Human Error Reduction 
and Prediction Approach (SHERPA) taxonomy 
(Stanton et al., 2013; Boring and Rasmussen, 
2016), as depicted in Table 1. Here the GOMS 
operators are called Task Level Primitives (TLPs), 
and encompass the spectrum of operations in and 
beyond the control room. The TLPs are calibrated 
to scenarios in THERP to produce nominal HEPs 
as shown in Table 2. 

Note that the GOMS-HRA TLPs have recently 
been mapped to Procedure Level Primitives (PLPs; 
Boring et al., 2017; Ulrich et al., 2017). PLPs rep- 
resent simple procedure steps derived from a 


Table 1. 
primitives. 


GOMS operators used to define task level 


Primitive Description 


Ac Performing required physical actions on the 
control boards 

Performing required physical actions in the 
field 

Looking for required information on the 
control boards 

Looking for required information in the field 

Obtaining required information on the con- 
trol boards 

Obtaining required information in the field 

Producing verbal or written instructions 

Receiving verbal or written instructions 

Selecting or setting a value on the control 
boards 

Selecting or setting a value in the field 

Making a decision based on procedures 

Making a decision without available 
procedures 

Waiting 


a 


a 


Q 


DOn MDDP P 


3 
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Table 2. HEPs associated with each task level primitive. 


Operator Nominal HEP THERP Source* 
Ac 0.001 20-12 (3) 
A, 0.008 20-13 (4) 
Ca 0.001 20-9 (3) 
Cp; 0.01 20-14 (4) 
Re 0.001 20-9 (3) 
R; 0.01 20-14 (4) 
I, 0.003 20-5 (1) 
Tr 0.001 20-8 (1) 
Sc 0.001 20-12 (9) 
Sp 0.008 20-13 (4) 
Dy 0.001 20-3 (4) 
Dy 0.01 20-1 (4) 
wW n/a n/a 


*Corresponds to THERP Table values from Chapter 20. 


standardized list of operator commands. A common 
procedure step like CHECK VALVE readily maps to 
Ce, a single TLP. Other procedure steps may require 
mapping to multiple TLPs. The PLPs provide a 
lookup table of common procedure steps to their 
respective TLPs. This lookup table allows easy anal- 
ysis reuse of commonly occurring operator actions. 
Moreover, it allows quick extraction of operator 
actions into an HRA model format, provided there 
is good procedure following by the crews. 


1.2.2 Basic human error probability 

SPAR-H (Gertman et al., 2005) is a simplified HRA 
method that features eight PSFs that serve as multi- 
pliers on the nominal HEP. The eight PSFs have dif- 
ferent influence levels and corresponding multipliers 
that are selected by the analyst according to whether 
the PSF has a positive effect (i.e., decreasing the HEP) 
or negative effect (i.e., increasing the HEP). In the 
HUNTER implementation, SPAR-H PSFs are auto- 
calculated (Rasmussen and Boring, 2016; Boring 
et al., 2017) where possible from available plant 
parameters. This process works for so-called exter- 
nal PSFs, which are situational factors that influence 
the operators’ performance. Internal PSFs—factors 
intrinsic to the operators—must still be assigned by 
the analyst or coded to change in particular contexts 
(e.g., the PSF for Fitness for Duty, which represents 
fatigue, may degrade after a long-duration event but 
not usually in response to particular plant param- 
eters). The auto-calculated SPAR-H PSFs modify 
the basic HEPs from GOMS-HRA in the current 
implementation of HUNTER. 


1.2.3. Conditional human error probability 
HUNTER does not yet include an explicit model 
for conditional HEPs beyond the dependence 


model that THERP and SPAR-H share in com- 
mon. It is important to note that dynamic HRA 
modeling occurs at the subtask level. The human 
failure events (HFEs) used in static HRA represent 
hardware or process failures that may be influ- 
enced by human actions or inactions. These HFEs 
are often defined at the hardware component or 
system level, which does not meaningfully specify 
the spectrum of human activities associated with 
that hardware. Thus, an HFE may encompass a 
single step of a procedure or may require a whole 
procedure. The HFE level of modeling may require 
considerable task aggregation and expertise by 
analysts. In creating a virtual operator model in 
HUNTER, the focus necessarily falls not on the 
hardware but on the tasks the virtual operator must 
undertake, which will potentially have an effect on 
hardware process performance. In HUNTER and 
other dynamic HRA implementations, human 
activities are modeled at the subtask level, corre- 
sponding to activities surrounding each procedure 
step or equivalent for those activities that don’t fea- 
ture formal written procedures. This level of analy- 
sis is common in task analysis approaches used in 
human factors engineering, because it represents 
the most fundamental conscious level of action or 
cognition of the human (Rasmussen and Laumann, 
2017). 

Dependence in HRA is the notion that error 
begets error—that once an error has been commit- 
ted, it may prime subsequent errors unless there is 
a clear break in activities. Contemporary thinking 
frames dependence in terms of mindset (Whaley 
et al., 2007), which is to say once a human error 
occurs (e.g., misdiagnosis of a plant upset), that 
error will propagate until something breaks the 
chain of operations (e.g., realization of the misdi- 
agnosis). Dependence occurs because the HEP of 
downstream human activities is related to the first 
human error and will generally be larger than the cal- 
culated basic HEP. Most human error dependence 
modeling occurs at the task level—between HFEs— 
which may not readily translate into how depend- 
ence should be treated for subtasks—within HFEs. 
Work is ongoing in HUNTER (e.g., Boring, 2015) to 
develop better subtask dependence modeling. 


2 A COGNITIVE MODEL OF TASK 
LEVEL PRIMITIVES 


The GOMS-HRA TLPs represent a continuum of 
human cognition. Classical cognitive psychology, 
heavily influenced by information processing the- 
ory associated with computer technology, stressed 
three phases of cognition (Proctor and Vu, 2006): 


Input > process > output, 


which corresponds roughly to the following human 
activities: 


Sensation/perception — cognition — behavior. 


This framework to understand human cogni- 
tive functioning is echoed in several contemporary 
models. For example, Endsley’s (2005) situation 
awareness (SA) model delineates the input and 
cognition phase into three separate phases: percep- 
tion (Level 1 SA), comprehension (Level 2 SA), 
and Projection (Level 3) SA: which culminate in a 
decision and corresponding action: 


perception + comprehension — projection 
> decision > action. 


It should be noted that the modality of informa- 
tion and action can incorporate communication, 
whereby communication is a perceptual input or 
expressed action. 

Cacciabue and Hollnagel (1995) make a distinc- 
tion between microcognition and macrocognition. 
This distinction originally referred to the degree 
of context involved in decisions. Simplified cog- 
nition—microcognition—in carefully controlled 
laboratory experiments is removed of the con- 
text required in real-world complex cognition— 
macrocognition. While this distinction remains 
useful for framing the validity of human factors 
research, the concept of macrocognition has been 
generalized to represent high-level cognition like 
decision making in a real-world context. The dis- 
tinction is helpful in understanding the focus of 
the GOMS-HRA task-level and procedure-level 
primitives. TLPs are microcognitive in that they 
represent the most basic level of human activities. 
PLPs are aggregated at a higher level to represent 
a series of cognitive functions and actions corre- 
sponding to a procedure step. It might be argued 
that the HFEs used as the unit of analysis in most 
HRAs represent a further macrocognitive group- 
ing of multiple steps. 

Recent work (Whaley et al., 2016) funded by the 
U.S. Nuclear Regulatory Commission (NRC) has 
identified several macrocognitive functions as the 
basis of the Integrated Decision-tree Human Event 
Analysis System (IDHEAS) HRA method (Xing 
et al., 2016). These macrocognitive functions are 
derived from research by Patterson and Hoffman 
(2012) and build on earlier functions identified by 
Klein et al. (2003). The IDHEAS macrocognitive 
functions include: 


detection and noticing > understanding and 
sensemaking — decision making > action 
— teamwork, 


whereby teamwork may be seen as an overarching 
function rather than as a discrete tail-end activity. 
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Table 3. 


Failures according to macrocognitive functions in Whaley et al. (2016). 


Detecting/ 
Noticing 


Understanding/ 
Sensemaking 


Decision Making 


Action Teamwork 


e Cues/information ° Incorrect data 


e Incorrect goals or 


e Failure to execute ¢ Failure of team 


not attended to priorities desired action communication 
e Cues/information ° Incorrect frame e Incorrect pattern ° Execute desired e Failure in 
not perceived matching action incorrectly leadership/ 
supervision 


e Cues/information 
misperceived data, frames, or data 


with a frame 


e Incorrect integration of ° Incorrect mental 
simulation or 
evaluation of options 


Receive 
Instructions 
(In) 
Make 
Decision 
(D) 


Retrieve 
Information 
(R) 


Produce Check 
Instructions Something 
(Ip) (C) 


Take 
Action 
(A) 


Select 
Something 
(S) 


r ~ 
I \ 
/ 
\ a Wait => 
Figure 1. Cognitive model of GOMS-HRA task level primitives. 


The IDHEAS macrocognitive functions are 
linked to proximate causes, which are error mani- 
festations. The IDHEAS proximate causes are 
listed in Table 3. 

Each of these proximate causes may be trig- 
gered for different reasons or cognitive mecha- 
nisms. For example, under the detecting/noticing 
macrocognitive function, if cues or information 
are not perceived, this could be caused by low sali- 
ence of the cues or information, inability to main- 
tain vigilance, inattention, mismatches between 
expected and actual cues (including biases toward 
particular information), or overloads of working 
memory. In turn, these cognitive mechanisms are 
influenced by PSFs (called performance influenc- 
ing factors in IDHEAS). 

The TLPs in GOMS-HRA as depicted in Table 1 
are not ordered in a particular cognitive flow. 
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Figure | represents a cognitive model of the TLPs. 
The distinction between control room and field 
operations (as depicted with subscripted C and F 
in Table 1) is not included in Figure 1, because the 
location of operations affects the context and cor- 
responding error rates but not the function of the 
underlying cognition. The basic cognitive frame- 
work of GOMS-HRA mirrors a simple information 
processing model, with core functions related to sen- 
sation and perception (represented by R), cognitive 
(represented by D), and Action (represented by A) 
activities. As a pragmatic model based on observ- 
able operational functions, GOMS-HRA precludes 
a detailed delineation between detecting/noticing 
and understanding/sensemaking. In fact, the original 
SHERPA taxonomy (Stanton et al., 2013), upon 
which the TLPs are based, does not include the rela- 
tively unobservable decision making (D) operator. 


Table 4. Crosswalk of IDHEAS macrocognitive functions with GOMS-HRA task level primitives. 


IDHEAS 
macrocognitive GOMS- GOMS- GOMS- GOMS- GOMS- GOMS- GOMS- 
function HRAA HRAC HRAR HRAI HRAS HRA D HRA W 
Detecting/ v v Y (Ip) 

Noticing 
Understanding/ v v Y (Ip) v 

Sensemaking 
Decision- v 

Making 
Action v v v (Ip) v 
Teamwork v (I,) 

GOMS-HRA includes several common opera- Table 5. Definition of key terms in GOMS-HRA. 
tional functions that impact sensation and percep- a TE 
tion on the input side and actions on the output Term Abbreviation Definition 
side. Communication in the form of verbal or writ- Task TLP A basic hurar operation 
ten instructions (/) is both a cognitive input (Ip) and Level occurring at the subtask 
output (/,). Checking information (C) is an active Primitive level. Multiple operations 
instance of looking for information that encom- are typically required to 
passes both an overt action (4) and a retrieval (R). achieve specific actions 
Selecting or setting a value (S) and producing a and goals. 
verbal or written instruction (/,) are both special Procedure PLP A human activity occurring 
instances of an action (A). Waiting (W) in GOMS- Level _ at the procedure step 
HRA is an overriding category that is used to indi- Primitive level. Often, multiple — 
cate delays in tasking or periods of inactivity due to task level primitives will 
monitoring activities. Waiting does not inherently be required to achieve a 

‘ ni : : A é procedure level primitive 

consist of a cognitive function, but it has implica- activity. 
tions across all GOMS-HRA operators. Task TLE À nominal humanerror 

The TLPs can be mapped to the IDHEAS mac- Level associated with a task 
rocognitive functions as depicted in Table 4. Sev- Error level primitive. Each 


eral of the TLPs span multiple macrocognitive 
functions but demonstrate a good match between 
concepts. 


3 TASK LEVEL ERRORS 


While it is helpful to automate many of the ana- 
lytic processes in dynamic HRA as described ear- 
lier in this paper and in Rasmussen et al. (2017), 
such a model falls short on a key area: anticipat- 
ing the types of errors that will likely occur for 
each TLP. The HUNTER framework outlined 
demonstrates a way to extract TLPs from proce- 
dure steps and a method to modify their influence 
using auto-calculated PSFs. Each TLP has partic- 
ular error proclivities, which might be described 
as deviation paths from the normal path (U.S. 
Nuclear Regulatory Commission, 2000). Based 
on the proximate causes identified in the IDH- 
EAS framework (see Table 3), here we map par- 
ticular errors to the TLPs. These are called task 
level errors (TLEs). To avoid possible termino- 
logical confusion due to unique terms introduced 
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task level primitive is 
associated with multiple 
possible task level errors. 


through GOMS-HRA, key terms are summarized 
in Table 5. 

A mapping of TLPs to TLEs is found in Table 6. 
As noted, the TLEs are derived from the proximate 
causes in Whaley et al. (2016). However, the map- 
ping is not one-to-one. As depicted in Figure 1, 
some TLPs span multiple macrocognitive functions. 
Checking (C), for example, includes both an active 
operation of looking for information and the sub- 
sequent information input when the information 
is found. GOMS-HRA does not explicitly model 
teamwork, but much of the function of teamwork 
is covered in the Information Production (/,) and 
Information Received (J,) operations, which also 
cover both written and spoken communications. 

The TLEs do not distinguish between errors 
of omission and errors of commission. An error 
of commission occurs when a person performs 
an unintended action. For example, an operator 


Table 6. Task level errors in GOMS-HRA. 


TLP Task level errors 


A TLE-A1 
TLE-A2 
TLE-C1 
TLE-C2 
TLE-C3 
TLE-C4 
TLE-R1 


: Failure to execute desired action 

: Execute desired action incorrectly 

: Wrong information checked 

: Information missed 

: Information misinterpreted 

: Failure to check information 

: Information not attended to 

TLE-R2: Information not perceived 

TLE-R3: Information misinterpreted 

TLE-IP1: Failure to produce desired 
communication 

TLE-IP2: Failure to produce correct 
communication 

TLE-IR1: Communication not attended to 

TLE-IR2: Communication not perceived 

TLE-IR3: Communication misinterpreted 

TLE-S1: Failure to select 

TLE-S2: Selection make incorrectly 

TLE-DI: Incorrect goals or priorities 

TLE-D2: Incorrect use of information 

TLE-D3: Incorrect mental model 

TLE-W1: Incorrect inaction 

TLE-W2: Waiting too long 

TLE-W3: Waiting too short 


C 


might accidentally activate a pump. In a plant, such 
actions can be particularly troublesome, because 
they may shift the plant beyond its standard operat- 
ing range and may confound the normal procedural 
sequence by triggering anomalies beyond hardware 
failures. As noted in Whaley et al. (2016), errors 
of commission manifest at the action phase. The 
thought by an operator to activate a pump (perhaps 
due to an incorrect situation assessment) is caused 
by faulty decision making, but it will have no effect 
until that decision is put into action. As such, the 
TLEs that are mapped to Action—namely A, Ip, C, 
and S—include provision for errors of commission. 
One critique against the macrocognitive functions 
like those found in IDHEAS is that they focus pri- 
marily on cognition to the exclusion of behavio- 
ral or activity functions. The GOMS-HRA TLPs 
attempt to redress some of this shortcoming, which 
is also reflected in an expanded series of behavior 
operations as depicted in Figure 1. 


4 CURRENT AND FUTURE 
APPLICATION OF TASK LEVEL 
ERRORS 


The task level errors in GOMS-HRA provide 
two important additions to HRA: (1) The human 
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error probability will vary depending on the type 
of error. Although the same overt failure may 
manifest from different causes, it is not the same 
error and should not be treated as a single prob- 
ability value. This paper introduces a crosswalk of 
task level primitives to task level errors that can 
provide a more precise measure of human error 
and accompanying probabilities. (2) The task level 
errors bound the types of decision errors that can 
occur. This part of GOMS-HRA benefits decision 
trees, where branch points in operator decisions 
can have markedly different outcomes on event 
progression. Task level errors related to decisions 
can guide the likelihood of particular decision 
paths. Work still remains to specify such decision 
making and branching in HUNTER. 

TLEs present an opportunity to model differ- 
ent errors that may occur for each subtask that is 
modeled in GOMS-HRA. Each TLE is an oppor- 
tunity for error. As such, a human failure event is 
not the occurrence of a single error probability but 
rather the composite of multiple error types across 
multiple subtasks. The aggregation of these sub- 
tasks and co-likely error types presents a challenge 
for HRA aggregation. One way to link subtasks 
is through dependence. Dynamic dependence is 
far more complicated than the simple dependence 
used in static HRA methods, and the approach is 
not yet mature (Boring, 2015). Likewise, calculat- 
ing dynamic HEPs can be orders of magnitude 
more complex than their static HRA counterparts. 
To date, HUNTER has not resolved the subtask- 
by-error type aggregation issue. There remains 
work to be done to determine suitable equations 
and then to calibrate them to HEPs for HFEs that 
have been generated by human analysts. In bring- 
ing subtask aggregation into HUNTER, there is 
the promise for greater consistency in the level 
of human task decomposition in HRA. Previous 
research (Poucet, 1989) has shown that not con- 
trolling for the level of decomposition results in 
spurious estimates in HRA. GOMS-HRA, as an 
implementation of HUNTER, aims to map the 
layers of decomposition to the appropriate HEP, 
enabling true dynamic autocalculation of human 
error. 


5 DISCLAIMER 


The opinions expressed in this paper are entirely 
those of the author and do not represent official 
position. This work of authorship was prepared as 
an account of work sponsored by Idaho National 
Laboratory, an agency of the United States Gov- 
ernment. Neither the United States Government, 
nor any agency thereof, nor any of their employ- 
ees makes any warranty, express or implied, or 


assumes any legal liability or responsibility for the 
accuracy, completeness, or usefulness of any infor- 
mation, apparatus, product, or process disclosed, 
or represents that its use would not infringe pri- 
vately-owned rights. Idaho National Laboratory 
is a multi-program laboratory operated by Bat- 
telle Energy Alliance LLC, for the United States 
Department of Energy under Contract DE-AC07- 
051D14517. This research was funded through the 
Laboratory Directed Research and Development 
program at Idaho National Laboratory. 
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ABSTRACT: A computational modeling approach to human performance assessment in nuclear power 
plants is proposed. The approach takes into consideration three main processes involved in a human’s 
cognitive process: information perception, reasoning and response. Both the saliency and activation levels 
of a sensor signal in the control room are used to determine whether the signal can be actually perceived 
by the operator. The corresponding nodes in the operator’s knowledge base are activated by the perceived 
signals and the activation is propagated in the knowledge base to reach diagnoses and develop responses. 
Uncertainties in these processes are dealt with using sampling methods. The interplay between an opera- 
tor’s stress and fatigue levels and the cognitive process are considered. A case study derived from the Three 
Mile Island accident is conducted to demonstrate the proposed approach. 


1 INTRODUCTION 

In complex technological systems where opera- 
tor intervention is needed, such as nuclear power 
plants, chemical plants, and air traffic control 
centers, operator performance plays a central role 
in maintaining the normal operation of these sys- 
tems and in mitigating consequences in case of 
incidents. However, high operator performance is 
challenged by a number of factors, for instance 
the increasing complexity of modern technological 
systems and the stricter requirement for the opera- 
tor’s knowledge in the systems. As a result, human 
performance assessment becomes necessary and 
has drawn increasing attention since the 1970s. 
The assessment not only serves as an integral part 
of the overall risk assessment of the systems (e.g. 
probabilistic risk assessment for nuclear power 
plants), but also provides insights into the ways of 
improving human performance (e.g. optimization 
of the human-machine interface). 

To assess human performance, conventional 
methods, for instance the Technique for Human 
Error Rate Prediction (THERP) (Swain and 
Guttmann, 1983) and the Standardized Plant 
Analysis Risk-Human Reliability Analysis (SPAR- 
H) (Gertman et al., 2005), use performance 
shaping factors (PSFs), for instance stress and 
experience, to represent the context that influ- 
ences human performance and to adjust nominal 
human error probabilities. The number of PSFs 
varies between different methods, ranging from 
just one single factor (e.g. time available) up to 50 
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or more in some current methods (Boring, 2010). 
In these methods, expert judgment is extensively 
used to assign the levels of PSFs and to determine 
the relationships between PSFs and human error 
probabilities. These lead to variations in the results 
of different methods and even of the same method 
but conducted by different individuals. Besides, 
these methods deal with human performance 
assessment at the macroscopic level, which means 
that they cannot provide the users detailed infor- 
mation about how an operator commits an error. 
To address this limitation, novel methods based 
on a representation of human cognition were 
introduced (Chang and Mosleh, 2007a; Li, 2013; 
Smidts et al., 1997; Zhao and Smidts, 2017). These 
methods take explicitly human cognitive processes 
into consideration, for instance information per- 
ception, reasoning and decision making. The most 
prominent advantage of this type of methods is 
that they are able to provide more detailed inter- 
pretation of a human error. 

In this paper, we further extend the capability 
of existing methods based on human cognition for 
human reliability assessment by proposing a com- 
putational cognitive modeling approach to human 
performance assessment in the context of nuclear 
power plants. In this approach, three main proc- 
esses of human cognition, information percep- 
tion, reasoning and response (Endsley, 1995), are 
modeled explicitly. The main contributions of our 
research include: 1) development of an approach 
to modeling information perception based on the 
saliency of and the attention paid by the operator 


to a piece of information; 2) application of acti- 
vation propagation in modeling the reasoning and 
response processes of an operator; 3) development 
of an approach to modeling the interplay between 
two significant PSFs, stress and fatigue, and the 
cognitive process; 4) development of an approach 
to modeling the uncertainties in the cognitive 
process based on sampling techniques; 5) integra- 
tion of the pieces of the cognitive process into a 
framework. The proposed method exhibits several 
advantages compared with conventional methods 
of human performance assessment. In addition to 
the capability of providing microscopic interpreta- 
tions of human errors, it is able to model the effect 
of task dependence on human performance, and 
to model the uncertainties in the human cogni- 
tive process based on simulation. The proposed 
approach is illustrated by the operator error in 
identifying the failure in the Pilot-Operated Relief 
Valve (PORY) in the Three Mile Island accident. 

The paper is organized as follows. Section 2 
provides a detailed introduction of the approach. 
In Section 3, the case study is introduced and the 
results are discussed. The paper is concluded in 
Section 4. 


2 METHODOLOGY 


2.1 Framework 


The framework of the approach is shown in 
Figure 1. It consists of three main modules: infor- 
mation perception, reasoning and response, and 
PSFs. 

In a nuclear power plant, an operator is faced 
with a large number of sensor signals in the con- 
trol room and uses them to infer the dynamics 
and state of the plant. The first module, that is 
the module of information perception, is used to 
determine which signals can be perceived by the 
operator. The process of information perception 
is influenced not only by the signals themselves, 
but also by certain PSFs and the operator’s mental 
state, as shown by the dashed lines in Figure 1. 

The signals that are actually perceived then enter 
the operator’s knowledge base and initiate the rea- 
soning process. The knowledge base is represented 
as a network of nodes representing for instance 
specific systems or phenomena. The reasoning 
process is modeled by activating the nodes in the 
knowledge base corresponding to the perceived 
signals, as illustrated by the light grey node in 
Figure 1, and propagating the activation through 
the knowledge base. At the end of this proc- 
ess, the mental state of the operator is updated, 
for instance a system failure may be identified as 
illustrated by the medium grey node in Figure 1. 
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Figure 1. The framework of the approach. 


An updated evaluation of the plant state will 
direct the operator to take corresponding actions, 
as illustrated by the black node in Figure 1. The 
developed actions are then executed, which in turn 
influences the plant’s dynamics and state. System 
dynamics, information perception, reasoning, and 
response form a cycle, which represents the inter- 
action between the nuclear power plant and the 
cognitive process of an operator. 

It needs to be noted that there are uncertain- 
ties in the information perception, reasoning, and 
response processes as to whether a sensor signal 
can be perceived and whether a node in the knowl- 
edge base can be retrieved. These uncertainties are 
modeled by simulating the processes with sampling 
methods. 

As shown in Figure 1, certain PSFs (i.e. stress 
and fatigue in our model) influence the cognitive 
process and the PSFs themselves are influenced by 
the reasoning and response processes. The inter- 
play between them is modeled in the proposed 
approach. 

The elements in the framework are introduced 
in further detail in the following sections. 


2.2 Information perception 


In a nuclear power plant, the data from the sen- 
sors plays a central role in the operator’s situation 
awareness. An operator is usually faced with a 
large amount of information from the sensors (e.g. 
pressure, temperature, flow, water level). However, 
there is a limit in a normal human’s capacity of 
perceiving the information around him/her. So 
only a sub-set of the available information can be 
perceived by the operator (Endsley et al., 2003). 
There are also uncertainties in information percep- 
tion as to which signal or signals can be perceived 
by the operator (Chang and Mosleh, 2007b). These 


two factors lead to the fact that some information 
important for situation awareness may be ignored 
by the operator, which may then result in human 
errors. As a result, information perception is an 
important element in the cognitive model. 

The process of information perception is influ- 
enced by several factors. Research in the field of 
cognitive science shows that the perception of a 
piece of information depends on two factors: the 
saliency of the signal and the attention paid by the 
human to the signal (Corbetta and Shulman, 2002; 
Rybak et al., 1998; Theeuwes, 2010; Wolfe, 1994). 
The first factor corresponds to the stimulus-driven 
process of information perception, and the second 
factor corresponds to the goal-directed process. 
This finding is used as the basis of modeling the 
process of information perception in our research. 

To be specific, the saliency of a signal is rep- 
resented as the average of the contributions from 
three sources: the signal’s variation, whether the 
signal is an alarm, and the signal’s importance. 
The variation measures the change of a signal with 
time. The more significant the change, the more 
likely it is that the signal successfully draws the 
operator’s attention. Whether a signal is an alarm 
or not indicates whether the signal has exceeded 
a specified threshold. An alarm is more likely to 
attract the operator’s attention compared with a 
regular signal. The importance here refers to the 
accumulated emphasis on a signal through train- 
ing or practice. For instance, the primary system 
pressure in a Pressurized Water Reactor (PWR) is 
usually emphasized more often compared to the 
position of a non-safety related valve during the 
training of an operator. As a result, the importance 
of the first signal is usually higher in the operator’s 
mind and more easily draws the attention of an 
operator. The attention paid by an operator to a 
signal is represented by the activation level of the 
node that corresponds to the signal in the opera- 
tor’s knowledge base. As will be introduced in 
Section 2.3, the activation levels of the nodes in 
the operator’s knowledge base are updated as the 
reasoning and response processes are ongoing. A 
node with a high activation level means that the 
node relates more to the operator’s current cogni- 
tive activity and the operator pays more attention 
to the node. Finally, the overall activation level of 
a signal can be calculated with Equation 1 below: 


A= a8 + AG =a 1V +41+1))+ BG (1) 


where A is the overall activation level, S is the sali- 
ency, G is the goal related contribution, V is the 
variation, A/ represents whether the signal is an 
alarm or not, J is the importance, œ is the weight 


for the stimulus-driven contribution, and £ is the 
weight for the goal-directed contribution. 

With the overall activation level of a signal, the 
probability that the signal is perceived is calculated 
with Equation 2 below (Anderson et al., 2004): 


1 


“Tr eUs (2) 


P 


where Tis the activation level threshold, and s is the 
noise in information perception. Both parameters 
are greater than 0. 

From Equation 2, it is easy to see that the higher 
A is, the higher the probability p is. It is also easy 
to see that when 4 is greater than 7, p is greater 
than 0.5, and in this case the smaller s is, the higher 
p is, which means that the operator’s attention is 
more focused. In contrast, when A is smaller than 
T, p is smaller than 0.5, and in this case the higher s 
is, the higher p is, which means that the operator’s 
attention is more dispersed. Based on the probabil- 
ity, sampling methods can be used to determine 
whether the signal can be actually perceived. 


2.3 An operator’s knowledge base 


The reasoning and response processes are imple- 
mented in the operator’s knowledge base. Before 
introducing these processes, a brief description of 
the representation of the knowledge base is pro- 
vided in this section. 

In our method, an operator’s knowledge base is 
represented as a network (Zhao et al., 2017; Zhao 
and Smidts, 2017). In the network, the nodes are 
used to represent their counterparts in the physical 
world. They are classified into four categories: sys- 
tem, process, action, and state. System nodes rep- 
resent systems or components in a nuclear power 
plant, for instance the reactor coolant piping and 
the relief valve in the pressurizer in a PWR. Proc- 
ess nodes represent the physical variables or phe- 
nomena in a nuclear power plant, for instance the 
pressure level in the primary system and hydrogen 
ignition in the containment. Action nodes repre- 
sent the possible operator actions during the pro- 
gression of an incident. State nodes here are used 
to represent the possible states of system and proc- 
ess nodes, for instance “open” and “closed” states 
for a relief valve and “high” and “normal” states 
for primary system pressure level. 

In the network, the relationships between the 
nodes are modeled with three types of links: affili- 
ation, AND gate, and OR gate. Affiliation links 
are used to represent the relationships between 
state nodes and the corresponding system or proc- 
ess nodes. There are no directions on affiliation 
links. AND and OR gates are used to represent the 
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cause-effect relationships between state nodes, and 
between state nodes and action nodes. The second 
class of cause-effect relationships is meant to model 
the fact that operator actions are guided by spe- 
cific states of systems or process variables and the 
actions influence the states in turn. For instance, 
when the operator realizes that the reactor coolant 
piping is in the “break” state, he/she will take cor- 
responding actions such as “shut down reactor”. 

In addition to the nodes and links in the net- 
work, three other elements are tied to some nodes. 
Specifically, beliefs, which are greater than 0, are 
tied to state nodes to represent the operator’s belief 
in each state. Activation levels, which are from 0 
to an upper limit A,» are tied to all the nodes to 
represent the operator’s attention on the nodes. 
Observability is tied to state nodes to indicate 
whether a state is observable or not to the opera- 
tor. Observability has two values: 1 if the node is 
observable, and 0 otherwise. 

An example knowledge base is shown in 
Figure 2. 


2.4 Reasoning and response 


Information perception plays an important role 
in the operator’s situation awareness and response 
development. However, perception of the right 
information does not guarantee a successful diag- 
nosis of the situation and an appropriate response. 

Errors may be introduced in the reasoning and 
response processes. The operator’s thought may 
be stuck at one point in either process. This means 
that the operator cannot identify the failure of a 
system, or cannot develop the appropriate response 
even if the operator has identified the failure. 

To cover these possible errors, a method of 
modeling the reasoning and response processes 
based on activation propagation (Anderson, 1983; 
Anderson et al., 2004) and updating of belief 
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Figure 2. An example knowledge base. 


is proposed. Based on the updated beliefs in the 
nodes, diagnoses will be reached and responses will 
be developed. The modeling method is illustrated 
in Figure 3 with the same example in Figure 2. 

In the first step (see s1 in Figure 3), the node in 
the operator’s knowledge base corresponding to 
the perceived signal is activated, and the activation 
level of this node is raised to an upper limit App In 
this example, low pressure in the primary system is 
observed, and the activated node is marked in light 
grey in Figure 3. It needs to be noted that only if 
the perceived information is inconsistent with the 
operator’s belief, further inference is then made. 
This is modeled in a way as follows. First, for a per- 
ceived node, the ratio of the belief in the node to 
the summation of the beliefs in all the state nodes 
of the corresponding process node (i.e. “primary 
system pressure” in this example) is calculated. This 
ratio is then compared with a threshold æ (e.g. 0.8) 
to determine whether the perceived information is 
consistent with the operator’s belief or not. If the 
ratio is greater than @, the reasoning process is not 
implemented because the perceived information is 
normal, at least in the operator’s mind. Otherwise, 
further inference is made, and the belief in the per- 
ceived node is raised by a predefined value AB. 

In the second step (see s2 in Figure 3), the activa- 
tion of the perceived node is propagated from bot- 
tom (i.e. the perceived node) up to the causal state 
nodes to infer the possible system failures that gave 
rise to the erroneous situation. In the example in 
Figure 3, the activation propagates from the “low” 
state of “primary system pressure” to two causal 
state nodes, “standby” of “injection system” and 
“break” of “reactor coolant system” through the 
AND gate. In this case, activation propagation and 
belief updating are carried out based on the follow- 
ing procedure. First, Equation 2 is used to calculate 
the probability of a causal node being retrieved, and 
sampling methods are applied to determine whether 


injection action 
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Figure 3. Illustration of the modeling of the reasoning 
and response processes. 
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the causal node is actually retrieved. A in Equa- 
tion 2 now is the activation level of a causal node, 
for instance “standby” or “break” in this example. 
Then, for each retrieved causal node, its activation 
level is raised to the upper limit A,,, and its belief 
level is raised by AB. If no node is retrieved, the 
reasoning process stops at this point, which means 
that the operator’s thinking process is stuck at the 
perceived node. For the cases of OR gate and multi- 
ple gates between the perceived node and its causal 
nodes, similar procedures are followed. Usually, 
there are several layers of nodes between the per- 
ceived node and system failure nodes rather than 
direct links as in the example in Figure 3. In this 
case, activation propagation and belief updating 
can be carried out step by step from bottom up fol- 
lowing the same procedure. The result of this step 
is also the result of the reasoning process. It out- 
puts the diagnosis of possible system failures (e.g. 
“break” in medium grey in Figure 3). 

In the third step (see s3 in Figure 3), if the ratio 
of the belief in the failure state of a system is higher 
than a threshold 7h,, the operator will take corre- 
sponding actions, for instance “injection action” in 
the example in Figure 3. This step is also subject 
to errors of omission. So Equation 2 and sampling 
methods are used again to determine whether the 
operator will develop the response. In this calcula- 
tion, A in Equation 2 is the activation level of an 
action node, for instance “injection action” in this 
example. If the action node is retrieved, its activa- 
tion level is raised to the upper limit A, An action 
will affect the state of the corresponding system 
(e.g. “operational” of “injection system”). The acti- 
vation level of the affected state is then raised to 
App and the belief in this node is raised by AB. 

Situation awareness and response development 
have been modeled in the first three steps. In the 
last step (see s4 in Figure 3), activation propaga- 
tion is implemented from top (i.e. identified sys- 
tem failures and executed actions) down to process 
nodes. This represents the fact that the operator 
will pay more attention to the system response 
related to the executed actions and the identified 
system failures, and those signals that represent the 
system response are more likely to be perceived in 
the next time step. Specifically, the process starts 
from the activated state nodes (i.e. “break” and 
“operational” in this example), and ends at the 
last state node (e.g. “normal” in this example). It 
is implemented following slightly different proce- 
dures as in the bottom-up process. For the exam- 
ple in Figure 3, as both “operational” and “break” 
have been activated, their activation is propagated 
to the OR gate through the AND gate. In addi- 
tion to propagating the activation to effect node 
(i.e. “normal”), activation propagation to the other 
causal nodes of the effect node is also implemented, 
because all these nodes will draw the operator’s 
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attention. Equation 2 and sampling methods are 
used to determine whether a node can be retrieved 
or not. For the retrieved nodes, their activation lev- 
els are raised to App The change in the process var- 
iables of the nuclear power plant is usually delayed 
for some time after operator actions are taken, so 
the operator’s belief in the states of these process 
variables are not updated. Asa result, belief updat- 
ing is not implemented in the top-down process. 

In addition to activation increase resulting from 
the propagation from other nodes, the activation 
levels of the nodes in the knowledge base are also 
subject to decay (Endsley et al., 2003), which is cal- 
culated with Equation 3: 
A(t+At) = A (t+ At) (3) 
where Aż is the time duration between two discrete 
time steps, and A is the decay constant. 


2.5 Integration of stress and fatigue 
into the model 


All the processes of information perception, 
reasoning and response are influenced by cer- 
tain PSFs. On the other hand, the PSFs are also 
influenced by the operator’s mental state, which is 
updated by the three processes. In our research, the 
interplay between two significant PSFs (i.e. stress 
and fatigue) and the three processes are modeled. 

With respect to the effects of stress and fatigue 
on human performance, the major effects of the 
stress level can be summarized as: 1) people under 
high stress tend to focus on what they are focused 
on and ignore the other information (Endsley, 
1995); 2) people under high stress tend to make 
premature decisions (Endsley, 1995); and 3) high 
stress influences human performance through 
decrements in working memory retrieval (Weerda 
et al., 2010). These effects are modeled by adjust- 
ing the parameters in the model accordingly fol- 
lowing Equations 4 through 7 below: 


a oe 

— = — X stress (4) 

Bp # 

s= sx (5) 
stress 

Th, = Thi x (6) 
“stress 

A= A*x stress (7) 


where cand fare the weights in Equation 1, s is the 
noise parameter in Equation 2, Th, is the threshold 
for action development, / is the decay constant in 
Equation 3, and stress is the stress level. The parame- 
ters with a superscript of star denote the parameters 


before they are adjusted by stress. Equations 4 and 5, 
6, and 7 correspond to the three effects respectively. 

The major effects of the fatigue level can be 
summarized as: 1) people with high fatigue are eas- 
ily distracted by irrelevant subjects (Boksem et al., 
2005); and 2) people with high fatigue tend to com- 
mit errors more easily because of the decrement of 
the activation level (Li, 2013). These effects are 
modeled by adjusting the parameters following 
Equations 8 through 10 below: 


a ax 1 

B Be fatigue (8) 
s= s*x fatigue (9) 
A= A*x fatigue (10) 


where œ, p, s, and A are the same parameters as 
in Equations 4, 5, and 7, and fatigue is the fatigue 
level of the operator. Equations 8 and 9, and 10 
correspond to the two effects respectively. 

As to the effects of the cognitive process on 
PSFs, the stress level is updated every time step 
with Equation 11: 


1 Xs 
stress = — > R, 


s i 


(11) 


where N, is the number of system and process nodes 
related to negative consequences, for instance 
“reactor coolant piping” in Figure 3, and R, is the 
ratio of the belief in the state node corresponding 
to negative consequences, for instance “break” in 
Figure 3, to the summation of the beliefs in all the 
state nodes of system or process node i. 

The fatigue level is updated every time step with 
Equation 12: 


N 


1 


; ~ ‘ : T 
fatigue “a = >, A(t- jan) |a minl iz} 


N,N jel isl 
(12) 


where Nis the number of past time steps from the 
current time point, N is the number of nodes in the 
knowledge base, Ar is the time duration between 
two time steps, T is the time the operator has spent 
on the accident management, 7* is the threshold 
for the time period that leads to the exhaustion of 
an operator, and œ and œ are weights. This calcu- 
lation represents the fact that the fatigue level of an 
operator can be contributed by two sources, one is 
the attention the operator has paid on the tasks, 
and another one is the time duration the operator 
has been involved in the tasks. 


3 CASE STUDY 


A simple case study is conducted with the pro- 
posed computational modeling approach. Results 
are obtained and discussed. 


3.1 Case introduction 


The case studied is derived from the Three 
Mile Island accident. To be specific, it refers to 
the operators’ failing to identify the failure in the 
Pilot-Operated Relief Valve (PORV) at the top of 
the pressurizer which caused the valve to stick open 
during the accident and aggregated the situation 
(Rogovin, 1979). During the accident progression, 
there were several instrument readings relevant to 
this failure available to the operator in the control 
room. Some were misleading but the others pro- 
vided useful evidence of this failure. Unfortunately, 
the operator paid too much attention to the mis- 
leading readings and ignored the useful readings for 
over 2 hours. The misleading instrument readings 
include high water level in the pressurizer which 
in fact was caused by steam voids and gas in the 
reactor coolant system, and closed indicator of the 
PORV which was only circumstantial evidence of 
the actual state of the PORV. Useful instrument 
readings include low pressure of the reactor cool- 
ant system because of the release of coolant from 
the stuck-open PORY, and high temperature down- 
stream of the PORV which was also caused by the 
release of high-temperature coolant from the PORV. 

To provide a clear illustration of the essence of the 
proposed approach, the case study is simplified in the 
following ways. Firstly, the task in the case study is 
defined as to identify the failure in the PORV mecha- 
nism. This means we are just focused on the informa- 
tion perception and reasoning processes. The actions 
following this diagnosis are not considered. Secondly, 
only one time step is considered, and the dynamical 
environment around the operator which needs to be 
represented by multiple time steps is not considered. 
This means that the information perception and rea- 
soning processes are implemented one time. Thirdly, 
the operator’s knowledge base about this situation is 
simplified as in Figure 4, which only consists of the 
aforementioned relevant instrument readings and the 
PORYV. Six nodes without any notation are added in 
Figure 4 to represent the distracting effects of irrelevant 
readings. It is assumed that perception of these nodes 
has no contribution to the completion of the task. 

In the case study, “primary system pressure”, 
“PORV position”, “temperature downstream of 
PORV”, “pressurizer water level”, and the six irrele- 
vant nodes are observable. The instrument readings 
indicate that the states of the relevant nodes are 
“low”, “closed”, “high”, and “high” respectively. 
Normally, only when the ratio of the belief in the 
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Figure 4. The knowledge base in the PORV case study. 


“failed” state to the summation of the beliefs in all 
the state nodes of “PORV mechanism” is above 
the action threshold Th,, the task is seen as suc- 
cess. However, the cognitive process is implemented 
for only one time step in the case study. This means 
that the belief in the “failed” state may be increased 
but may not be increased to be above 7h,. As a 
result, the success criterion is relaxed to be the acti- 
vation of the “failed” state. This means that once 
the “failed” state is reached in the reasoning proc- 
ess, the task is thought to be successful. The values 
of the parameters are as follows: 7, the initial value 
of A for each node, and s in Equation 2 are 0.4, 0.5, 
and 1 respectively. The initial belief in each state is 
1 and the default belief ratio @is 0.8. These values 
are just used to illustrate the calculation, and do not 
necessarily reflect the real situation. 


3.2 Results 


The information perception and reasoning processes 
for the case study are simulated 1000 times. The per- 
ceived nodes and activation propagation paths in 
the knowledge base are recorded at each time. The 
results are analyzed from the following two aspects. 

First, the probability of success is calculated 
with Equation 13: 


(13) 


where N, is the number of successes, and N, is the 
number of simulations which is 1000 in the case 
study. The simulation results show that the prob- 
ability of success is 0.39. 

Second, statistics on the combinations of per- 
ceiving useful or useless information and success 
or failure of the task are obtained. Here perceiving 
useful information means that at least one of the 
two useful signals are perceived. Perceiving useless 
information means that none of the two useful sig- 
nals are perceived. Useless information refers to the 
misleading signals and the distracting instrument 
readings. The probabilities of success condition on 
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perceiving useful information, success condition 
on perceiving useless information, failure condi- 
tioned on perceiving useful information, and fail- 
ure conditioned on perceiving useless information 
are 0.91, 0.09, 0.83, and 0.17 respectively. From the 
result, we can see that perceiving useful informa- 
tion does not guarantee the success of the task, and 
perceiving useless information does not necessarily 
lead to the failure of the task. But perceiving useful 
information does help increase the success proba- 
bility, as can be seen from the comparison between 
0.91 and 0.83. The small increase of the probability 
of success also suggests that the failure of the task 
is dominated by the unsuccessful retrieval of infor- 
mation in the reasoning process and by the high 
complexity of the situation under study. 

Two examples, one for success and another for 
failure, are explained to illustrate the reasoning 
process. In the example for success, “closed” state 
of “PORV state”, “high” state of “temperature 
downstream of PORV”, and the six irrelevant sig- 
nals are perceived by the operator. However, except 
for the “closed” state, the activation propagation 
from the other signals is stuck at the initial points. 
The “closed” state leads to the activation of “failed” 
state of “PORV mechanism” and “normal” state of 
“pressurizer pressure”. In the example for failure, 
“high” of “pressurizer water level”, “low” of “pri- 
mary system pressure”, and two irrelevant signals 
are perceived. Perception of “high” of “pressurizer 
water level” leads to the activation of “closed” of 
“PORY state”, then to the activation of “normal” 
of “primary system pressure” and “normal” of 
“PORV mechanism”. The activation propagation 
from all the other signals is stuck at the initial points. 


3.3 Discussion 


From the case study, it is easy to see that the pro- 
posed computational cognitive modeling approach 
is able to provide the probability of success or failure 
in an easy way from the simulation results. In the cal- 
culation, no expert opinion is needed. Another more 
prominent strength of the proposed approach is that 
the simulation results provide the basis for detailed 
investigation of an operator’s cognitive process in 
a specific situation from different perspectives, for 
instance the relation between failure in the task and 
perceived signals in the case study. 

However, significant improvements are still 
needed to enable the proposed approach for 
practical applications and to further enhance the 
capabilities of the approach. First, a better way 
of representing the knowledge base needs to be 
developed to cover more complex structures of an 
operator’s knowledge in one field. Second, a sys- 
tematic method of determining the parameters in 
the cognitive model through experimental results 


or the results in existing research is needed. Third, 
the proposed approach needs to be validated by 
comparison with experimental results. 


4 CONCLUSIONS 


In this paper, a computational cognitive modeling 
approach to human performance assessment in 
nuclear power plants is proposed. The three main 
processes in a human’s cognitive process, that is infor- 
mation perception, reasoning and response devel- 
opment, are modeled. The interplay between two 
significant PSFs (i.e. stress and fatigue) and the cog- 
nitive process is considered. Sampling methods are 
used to simulate the possible outcomes of an opera- 
tor’s cognitive process. The simulation results enable 
the probability of success or failure of an operator 
performing a specific task to be calculated in an 
easy way. In addition, the simulation results provide 
the basis for detailed investigation of the cognitive 
process. With the proposed approach, the effect of 
task dependence on human performance is consid- 
ered implicitly based on the use of activation levels 
for the nodes. A case study derived from the Three 
Mile Island accident is conducted with the proposed 
approach. Although the case study is simplified and 
the results are preliminary, it demonstrates the power 
of the approach. Future research for further improve- 
ments of the proposed approach is pointed out. 
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ABSTRACT: We discuss potential assurance frameworks for autonomous navigation systems in the 
maritime industry, with emphasis on testing and verification of the system’s perception performance and 
capacities. Ongoing research in this field has revealed profound challenges related to artificial situation 
awareness and machine perception specific to the marine environment. The lack of a clear and transparent 
framework and methodologies to assure the safety associated with the usage of such solutions, have been 
identified as key barriers for the implementation of autonomous navigation solutions at scale. Because 
the machine perception and situational awareness algorithms are expected to be partly or fully based on 
machine learning algorithms, including deep learning, whose functional reasoning is challenging or even 
impossible to understand and predict, the verification of such systems is fundamentally different from a 
traditional verification process based on physical understanding and theory. We review several methods 
for testing autonomous navigation systems, proposed and used mainly in the automotive industry, and 
discuss how these methods can be adapted, combined and applied to form a framework for assurance of 


autonomy in the maritime industry. 


1 INTRODUCTION 


Autonomous transport on land, in the air and 
at sea has been coined the technology trend with 
the highest potential to disrupt the transport sec- 
tor in the future. It has the potential for making 
transport solutions more cost effective, safe and 
environmentally friendly, but also to disrupt entire 
business models and value chains associated with 
the mode of transport. Given the disruptive poten- 
tial of this technology trend, increasing research 
efforts are being invested to realize the technologi- 
cal solutions. 


1.1 The autonomy revolution 


Technologies and methods for autonomous sys- 
tems is a very active area of research both in the 
industry and in academia. However, the majority 
of the research being done for autonomous vehi- 
cle navigation is focused around the automotive 
industry. The amount of test data for such vehicles 
is becoming abundant and is considered an impor- 
tant contributor to the current state of the art in 
the research field. Major advances in object detec- 
tion, classification and image analysis have been 
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made in recent years, with extensive use of artifi- 
cial intelligence related technologies such as feature 
extraction, artificial neural networks, deep learning 
models such as Convolutional Neural Networks 
(CNNs), gradient-based and derivative-based 
matching approaches (see for example (Hofmann 
2013, Rout 2013, MathWorks 2017c)). Research 
is needed to identify if and how the algorithms, 
methods and sensors, developed for the automotive 
industry, can be utilized in the maritime domain. 


1.2 Opportunities in the maritime industry 


Several studies have shown that human error 
contributes to a majority of marine casualties 
(Rothblum 2000, Harrald et al. 1998). However, 
automated systems and autonomy can also intro- 
duce new challenges, and existing challenges might 
be amplified (Liitzhoft & Dekker 2002). Neverthe- 
less, we expect that if the interaction between the 
humans and machines are treated carefully, with 
thorough testing and verification, autonomy can 
contribute significantly to increase safety in many 
maritime operations. 

Unmanned ships will enable optimization 
of energy efficiency due to changes in design 


constraints and freeing of space, previously used 
to accommodate crew. In addition, more hydrody- 
namic and aerodynamic designs may in turn lead 
to less fuel consumption and reduced emissions. 
Furthermore, autonomous ships might be able to 
compete with road transportation and contribute 
to reduced emission from road transportation as 
well as reduced road wear and tear. 

If autonomous ships are successfully imple- 
mented, it will most probably enable fundamen- 
tally new types of ship transportation operations, 
such as for example single container shipment 
(Woodgate 2017); extremely slow speed transpor- 
tation with very low emissions (Tvete 2017); con- 
tainer feeder to replace road transport (Kongsberg 
2017); and unmanned patrol ships (Fingas 2017). 

Several demonstrators have already proven that 
it is feasible for a transport solution to be oper- 
ated by sensors and software either partially or 
fully based on deep learning algorithms (Huval 
et al. 2015, Ackerman 2017). Among others, a 
company Drive.ai, has an ambition to use deep 
learning fully from sensory input to decision mak- 
ing, while others usually use deep learning in parts 
of the system, e.g. situational awareness, while 
relying on traditional control system logic in other 
parts of the system (Huval et al. 2015, Ackerman 
2017, Muoio 2016). Nevertheless, the solutions are 
yet to be deployed at scale. One of the reasons for 
the lack of deployments is that the solutions are 
still not proven to be sufficiently safe. 


1.3 Early rule development as an enabler for 
innovation 


A key element required to keep the autonomous 
system safe, is the ability of the system to achieve 
situational awareness. Situational awareness algo- 
rithms are usually partly or fully based on machine 
learning algorithms whose functional reasoning are 
challenging or even impossible to understand and 
predict. Hence, the verification of such a system is 
fundamentally different from a traditional verifica- 
tion process based on physical understanding and 
theory. The machine learning algorithms are data 
driven, and completely dependent on the quality of 
the training data. Therefore, verification will likely 
be carried out by a combination of testing, simula- 
tions and benchmarking against real and synthetic 
data sets. Furthermore, adaptive methods, where 
data are automatically collected and used to retrain 
the system, will also be considered. 

For a manned system, awareness is achieved by 
the human operator by using his or her senses and 
perceptive abilities to interpret instrument signals 
and input from surroundings. An unmanned ship 
should use a priori information, such as maps, 
combined with sensor readings to make observa- 
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tions relative to the environment, and use soft- 
ware to perceive the situation based on this input. 
This digital perception will be used as input to a 
decision-making algorithm. In turn, this controls 
the actuators of the vessel which are effectuating 
the decision made. For the autonomous system to 
make safe decisions, the situational awareness must 
be sufficiently accurate for all feasible situations 
and conditions which the vessel may encounter. 

System functional and performance require- 
ments necessary to obtain a required safety level 
of an automated situational awareness system 
should be established as early as possible, as this 
will offer the technology providers a standard to 
be met by their solutions. If requirements are not 
set before or early in the technology development 
phase, developers risk spending significant efforts 
and money on developing solutions which in the 
end may not meet the safety standard. However, 
establishing such a standard is difficult when no 
solutions exist to evaluate the standard against. 

In addition to a standard for required system 
and component performance, tools are needed for 
verifying that the technology meets the require- 
ments set in the standard. For a situational aware- 
ness system, this will include verifying that the 
sensors adequately detect objects affecting the 
safety of the vessel and its surroundings under 
various conditions, and that the perception algo- 
rithm can use this information together with other 
a priori information to adequately understand the 
situation. 


1.4 Focus of this study 


In this paper, we discuss rules and regulations 
related to autonomous navigation systems in a 
maritime context, with focus on autonomous per- 
ception and situational awareness. However, we 
believe that a framework for approval developed 
for autonomous applications will also be appli- 
cable to other systems that are based on machine 
learning algorithms and artificial intelligence. 

The remainder of the paper is structured as fol- 
lows. In section 2, we propose and describe a range 
of recommended practices and tools that can be 
applied to test and validate the ability, performance 
and robustness of safety critical systems which 
decisions are based on data-driven methods. These 
practices and tools originate partly from traditional 
statistical analysis and are suggested and applied 
for testing and assurance of autonomy in the auto- 
motive industry. In section 3, we discuss challenges 
related to machine perception that are unique or 
particularly pronounced in the maritime domain, 
and suggest how the recommended practices and 
tools should be used and possibly adapted to suit 
the maritime domain. Furthermore, we present a 


possible scope for assurance framework, and dis- 
cuss potential implications of autonomy such as 
for example operational dependent requirements. 
We also describe the IMO guidelines for approval 
of alternatives and equivalents. We conclude in 
section 4. 


2 TOOLS AND RECOMMENDED 
PRACTICES FOR ASSURANCE 


In the following, we propose and describe a dif- 
ferent recommended practices and tools that can 
all be applied to test and validate the ability, per- 
formance and robustness of safety critical systems 
which decisions are based on data-driven methods. 
See Figure 1 for an overview of the methods. 


2.1 Confusion matrix 


It is not obvious how to measure and evaluate the 
performance of autonomous navigation systems. 
If two systems provides divergent predictions or 
decision, it is difficult to define and quantify which 
reaction was most correct. In classification prob- 
lems, the results are often presented in a confu- 
sion matrix, where the predicted class is compared 
with the actual or the true class. With two classes, 
for example object detection with two objects, it 
is straight forward to define the confusion matrix, 
which inhere the number of 


— True Positives (TP), hits 

— True Negatives (TN), correct rejections 

— False Positives (FP), false alarms, Type I errors 
— False Negatives (FN), misses, Type IJ errors 


When more classes are needed, defining the cri- 
teria for performance evaluation becomes more 
challenging. For example, how should we quantify 
the performance of a perception system which cor- 
rectly detects a vessel, but misclassifies it as a ferry? 
And how should this be compared to misclassify- 


Confusion matrix 
Cross-validation 
Third party testing and unrehearsed experiments 
Cost function-onented quantitative evaluation methods 
Elimination of uneventful data 


Exploring comer cases 


Investigate sensor redundance 


Figure 1. Proposed components in an assurance frame- 
work for safety critical systems. 
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ing it as a kayak? Or what if the system is not even 
able to recognize it as an object? 

To be able to make the above-mentioned com- 
parison, we are fully dependent on a correctly 
labelled dataset of the ground truth. An autono- 
mous ship will likely be equipped with multiple 
sensors, including multiple daylight and IR cam- 
eras in various directions, radars with different 
settings, in addition to Automatic Identification 
System (AIS) and other satellite data, etc. The 
labelling process should take all these sources into 
account when labelling the data. For example, if an 
object is not visible in the video stream due to thick 
fog or other difficult weather conditions, but we 
know the objects position from AIS data or other 
sources, the object will be labelled in the ground 
truth data set. All relevant information should be 
correctly labelled, in all datasets. 

Data collection, and especially data annotation 
or labelling, are surprisingly time consuming and 
costly tasks for vehicle classification (Schöning 
et al. 2015, Chen and Ellis 2014). However, vari- 
ous tools and methods for semi-automatic ground 
truth labelling on video streams designed for the 
automotive industry are available, such as for 
example (MathWorks 2017b, MathWorks 2017a, 
Cuevas et al. 2015, Lopez-Villa et al. 2015, Schön- 
ing et al. 2015). Another approach is to crowd 
source the data annotation like Mighty AI has 
done in automotive, where they have developed a 
mobile app in which users may annotate images 
manually and get paid for it, whereupon Mighty 
AI makes a business out of selling annotated data- 
sets [https://mty.ai/]. The available solutions should 
be explored, and if necessary adapted for our use 
in a maritime context. 


2.2. Cross-validation 


It is well known that when we evaluate predictions 
from a statistical model on the dataset used to train 
the model, our accuracy estimates tend to be over- 
optimistic (Arlot & Celisse 2010). To build robust 
and accurate models we ideally want to use all data 
available. The same applies to testing; we want to 
test our models in all situations, not only on a sub- 
set. Cross-validation introduces various methods 
of repetitively splitting the data D into two exclu- 
sive parts D, and; D, where one part D, is used to 
train the model, and the other D, is reserved for 
validation. 

A range of different splitting techniques can 
be applied, providing different cross-validation 
estimates. See for example Arlot & Celisse 2010, 
Kohavi 1995 for a brief overview of the most com- 
mon splitting techniques. 

One of the most widely used splitting tech- 
nique is called K-fold cross-validation, which in its 


standard form splits the original dataset D into K 
subsets (folds), D,, ..., D,, as described in (Arlot 
and Celisse 2010, Brandsæter and Vanem 2016). 
For each k e 1,2,...K the models are trained on 
D,= D\D,, and tested on D,. To make sure that the 
results are not strongly dependent on how the folds 
are selected, we repeatedly run the K-fold cross- 
validation with new selections. The sets are often 
chosen to be mutually exclusive with equal size. 


2.3 Extensive testing 


The standard approach for assurance of autono- 
mous navigation in the automotive industry is 
extensive testing (Pei et al. 2017a, Zhao and Peng 
2017, Waymo 2016, Fei-Fei 2010), where large 
amounts of real world data from ordinary opera- 
tion is gathered and manually labelled, and data on 
driver performance, behaviour, environment, driv- 
ing context and other factors that were associated 
with critical incidents, near misses and crashes are 
analysed and used in evaluating the system per- 
formance (Zhao and Peng 2017). 

Simulated real-world data is also sometimes 
used to massively increase the amount of data 
(Madrigal 2017, Zhao and Peng 2017), but usually 
this is completely unguided, and due to the large 
input space of real-world scenarios, none of these 
approaches can hope to cover more than a tiny 
fraction (if any at all) of all possible corner cases 
(Pei et al. 2017a). Here, a corner case is defined as 
an unusual, but far from impossible, scenario. In 
particular, if each individual parameter, such as 
temperature, fog, daylight, driving speed, number 
of other vehicles involved, etc. are well within the 
normal range for that parameter, but still the com- 
bined scenario is unusual. As an example, again 
from the automotive industry, a Tesla in autopilot 
recently crashed into a trailer because the autopilot 
system failed to recognize the trailer as an obstacle 
due to its white color against a brightly lit sky and 
the high ride height (Lambert 2016). Such corner 
cases were not part of Waymos (Googles) or Teslas 
test set (Pei et al. 2017a) and thus never showed up 
during testing. 


2.4 Third party testing and unrehearsed 
experiments 


In 2003 the Defense Advanced Research Projects 
Agency (DARPA) announced the first Grand 
Challenge with the goal of developing vehicles 
capable of autonomously navigating desert trails 
and roads at high speeds. In Krotkov et al. 2007, 
the conduct of six evaluation experiments for the 
DARPA PerceptOR program is described. Key dis- 
tinctions of the testing methodology include con- 
duct of the experiments by an independent third 
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party, and the use of unrehearsed experiments that 
provide little advance knowledge of and access to 
the test courses. The article also presents quanti- 
fied, objective performance metrics for the systems 
evaluated. Furthermore, it includes blind experi- 
ments that do not allow the system operators to see 
the test courses until all tests are completed. 

The test environment and the test content 
are described in detail; however, the evaluation 
approach are not thoroughly discussed (Sun et al. 
2011). 


2.5 Cost function-oriented quantitative evaluation 
methods 


Wei and Dolan 2009 claims that most teams in the 
2007 DARPA Urban Challenge preferred to avoid 
difficult manoeuvres in high-density traffic by 
stopping and waiting for a clear opening instead 
of interacting with it and operating the vehicle and 
human drivers. To encounter this, researchers at 
Beijing Institute of Technology, propose a design 
method for a scientific and comprehensive test and 
evaluation system for autonomous ground vehi- 
cles competitions, to better guide and regulate the 
development of autonomous ground vehicles. The 
evaluation method proposed by Sun et al. 2014, 
Sun et al. 2011 aims to evaluate the quality of 
completion with a cost function-oriented quanti- 
tative evaluation method. This evaluation method 
can presumably evaluate the overall technical per- 
formance and individual technical performance 
of autonomous ground vehicles. A complete test 
system that includes the test contents, the test 
environment, and the test methods to meet the 
demands of testing for autonomous ground vehi- 
cles is developed, and a fuzzy evaluation method 
is combined with an analytic hierarchy process to 
solve fuzzy and hard-to-quantify problems (Sun 
et al. 2014). 


2.6 Elimination of uneventful data 


Recently, a new approach to testing autonomous 
cars was proposed by researchers affiliated with 
the University of Michigans Mcity connected and 
automated vehicle center. Zhao and Peng 2017 
presents an accelerated evaluation process which 
aims to eliminate the uneventful driving activity, 
and filter out only the potentially dangerous driv- 
ing situations where an automated vehicle needs 
to respond, creating a faster, less expensive test- 
ing program. It is claimed that this approach can 
reduce the amount of testing needed by a factor of 
300 to 100,000. 

Four methodologies that form the basis of the 
accelerated evaluation process are listed (Zhao and 
Peng 2017): 


. Evaluate how frequently a significant driv- 
ing event happens on the road, and stripe out 
the more common, uneventful safe driving 
situations. 

. Use importance sampling to statistically 
increase the number of critical driving events 
in a way that still accurately reflects real-world 
driving situations. 

. Construct a formula that accurately distils those 
critical events, tests the formula, and apply it to 
further reduce the amount of testing required. 

. Analyse interactions between human-driven 
vehicles and robotic vehicles and optimize the 
random occurrences of significant driving 
events in the most complex scenarios. 


2.7 Exploring corner cases in deep learning 
systems 


In Pei et al. 2017a, Tian et al. 2017, Pei et al. 2017b 
prepared by researchers at Colombia University, 
Lehigh University and University of Virginia, a 
method for automated whitebox testing of deep 
learning systems is proposed. Deep Learning (DL) 
has made tremendous progress, achieving or sur- 
passing human-level performance for a diverse set 
of tasks including image classification (He et al. 
2016, Simonyan and Zisserman 2014), which has 
led to widespread adoption and deployment of DL 
in security- and safety-critical systems such as self- 
driving cars (Bojarski et al. 2016). Unfortunately, 
DL systems, despite their impressive capabilities, 
often demonstrate unexpected or incorrect behav- 
iours in corner cases for several reasons such as 
biased training data, overfitting, and underfitting 
of the models (Pei et al. 2017a). 

The proposed method aims to identify errone- 
ous behaviours of a DL system without manual 
labelling/checking, by jointly maximizing a joint 
objective function combining a metric called neu- 
ral coverage, and differential behaviour between 
multiple tested methods. The objective function is 
maximized by changing the input variable x, under 
some physical constraints. For example, an input 
image can be rotated or scaled differently, bright- 
ness and contrast can be changed, and rain and fog 
can be added to the input image. 

With differential behaviour we mean that when 
different deep neural networks (DNNs) are tested, 
the same input will be classified into differ- 
ent classes by the different DNNs. The aim is to 
maximize the probability that a randomly selected 
DNN provides an output that differs from the out- 
put of the other DNNs. Suppose we have N differ- 
ent DNNs, then each DNN has its own function 
model F, :x— y for, ke 1...N, where x and y are 
the input and output values respectively. If F,[c] 
is the class probability that the output of the k-th 
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neural network c is, and the j-th neural network is 
randomly chosen, the objective function (which 
will be maximized) is formulated as 


obj, (x) = YF. (x)[c]- 4- F,(x)[e] 


k+j 


F, 


J 


(1) 


where /, is a parameter to balance the objective 
terms between the DNNs that maintain the same 
class outputs as before (F, .;) and the DNN that 
produce different class outputs (F). 

Neural coverage is a measure of how many rules 
ina DNN are exercised by a set of inputs. The neu- 
ron coverage of a set of test inputs is defined as the 
ratio of the number of unique activated neurons 
for all test inputs and the total number of neurons 
in the DNN. 

To maximize the neural coverage, Pei et al. 
2017a and Tian et al. 2017 propose to iteratively 
pick inactivated neurons and modify the input 
such that output of that neuron goes above a pre- 
defined threshold ¢. Hence, for a given neuron, n 
we maximize the following function 


obj,( x) = G,( x) such that G,( x) > £ (2) 
where G, is the output value of neural n. 

The neural coverage and the differential behav- 
iour is jointly maximized, by slightly changing the 
input values using gradient ascent. The joint objec- 
tive function is 


PF in (X)= DA (Xel- AF (xlel+G, (x) G) 


By changing the input variables x to maximize 
this joint objective function, the paper claims that 
the method finds thousands of erroneous behav- 
iours in fifteen state-of-the-art DNNs trained on 
five real-world datasets. Hence, new corner cases 
are explored and different types of erroneous 
behaviours are uncovered. In addition, test inputs 
generated by the proposed method can be used 
to retrain the corresponding deep learning model 
to improve classification accuracy, and also iden- 
tify potentially polluted training data (Pei et al. 
2017a). 


2.8 Demonstrate need for sensor redundancy 


Inpired by Pei et al. 2017a, as introduced above, we 
propose to invoke differential behaviour by repeat- 
edly exclude one information source from the 
sensor fusion machinery. In addition to revealing 
differential behaviour, we believe this method will 
be useful to demonstrate the importance of sensor 
redundancy. If differential behaviour often occurs 
when a specific information source is removed 
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Figure 2. Illustrating how we invoke differential behav- 
iour by repeatedly excluding one information source 
from the sensor fusion machinery. 


from the set of explanatory variables, it can indi- 
cate that redundancy of this information is needed 
to achieve adequate robustness. 

To illustrate the idea, we consider a method 
which fuses four information sources: S, a day- 
light camera; S, an IR camera; S, a radar; and 
S, AIS (Automatic Identification System). F, is 
the standard method which uses all information 
sources, while the methods F, for k > 0 cannot 
use the information from information source S,. 
The goal is to change the input variable x in way 
to invoke differential behaviour as illustrated in 
Figure 2, where the output of method, F, which 
does not take information source S, into account, 
diverges from the other methods. 

In the same way as in section 2.7, we let F,[c] 
be the class probability that the output of the k-th 
method is c. Now the k-th method is the method 
where information source k is excluded as an 
explanatory variable. In addition, we propose to 
include method F, which includes all variables. 
Now the objective function (which will be maxi- 
mized) is formulated as 


obj,(x )= DF (x A-F, (x) [e] (4) 


k#j 


where j is randomly chosen, and A, is a parameter 
to balance the objective terms between the method 
that maintain the same class outputs as before 
(F,,,) and the method that produce different class 
outputs (F). 


3 ASSURANCE IN THE MARITIME 
DOMAIN 


The assurance of systems which safety is depend- 
ent on the accuracy and reliability of data driven 
models needs to be thoroughly tested. In this chap- 
ter, we present challenges related to machine per- 
ception that are unique or particularly pronounced 


in the maritime domain. We describe potential 
requirements, and discuss how the recommended 
practices and tools should be used and possibly 
adapted to form a framework for assurance in the 
maritime domain. 


3.1 Important technical challenges 
in the maritime domain 


One of the major differences, relevant for autono- 
mous navigation, between the automotive and the 
maritime industry is machine perception. Machine 
perception, also referred to as artificial or digital 
perception, is the process where information from 
sensing, maps, satellite data and the vessel condi- 
tion, are transformed into situation awareness (see 
Fig. 3). 

The requirements of a machine perception sys- 
tem in the maritime industry will most likely con- 
cern both what should be detected, such as object 
types, sizes, distances, reflexibilities, etc.; and what 
should be classified, such as ship types, number of 
ships, seamarks, complexity, etc. The requirements 
should be evaluated under various external condi- 
tions such as weather and daylight. 

Several technical challenges, particularly 
prominent in the maritime domain, remain open 
as described by for example (Prasad et al. 2016, 
Prasad et al. 2017): 


— Vessel movements effect on sensors: For sensors 
like video cameras which are mounted on-board 
ships, the unpredictable motion of the ship com- 
plicates the object detection. 

— Background subtraction: The water background 
is dynamic due to waves. Hence, background 
learning methods which recognizes background 
when a pixel stays constant for at least some 
time, fails. Also, waves and foam are often mis- 
interpreted as foreground objects when using 
standard methods. 

— Weather and illumination conditions: The mari- 
time environment is exposed to a variety of dif- 
ferent weather and illumination conditions such 
as fog, rainfall, clouds, bright sunlight, twilight 


BE 


Perception and sensor fusion 


—— Sensing O 


Situation awareness 


Figure 3. Key components in autonomous navigation 
in the maritime industry. 
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and night. The different solar angles pose sig- 
nificant challenges with speckle and glint which 
makes it difficult to distinguish background and 
foreground. 

Insufficient training data from the maritime 
domain: Very limited work has been carried out 
to develop object classification algorithms for 
objects relevant to the maritime environment. 
The objects of interest include other ships, 
leisure boats, kayaks, land marks, buoys, ice 
bergs, etc. 

Uneventful sailing: On ocean going ships espe- 
cially, a very large fraction of the collected data 
from a voyage are uneventful, hence a very large 
portion of the corner cases are left unproven. 


3.2 Operational specific requirements 


Operational specific requirements are not con- 
sidered in current class rules. We foresee that this 
might change in the future, especially for ships with 
autonomous navigation systems, as the operation 
will be embedded into the technology rather than 
being the responsibility of the human operator. For 
example, if the ships perception is limited due to 
fog, the permitted speed might be lowered, or the 
ship might be denied access to specific geographi- 
cal areas, until the weather conditions improve. 
This decision might also be based on ship type, 
cargo, manoeuvring capabilities, etc. 


3.3 Triple modular redundancy 


The tools presented in 2.7 and 2.8 above, both 
search for differential behaviour from multiple 
algorithms or sensor selections, using a majority 
organ (or voting circuit). This concept was first 
described by Von Neumann 1956. Today, the con- 
cept is often referred to as triple modular redun- 
dancy and is perhaps most widely used in space 
and aeronautics applications (Wu et al. 2017, Yeh 
1996), where reliability requirements sometimes 
are very high. Using the majority vote out of three 
(or more) methods ensures that a single failure will 
not cause a system failure. We believe this concept 
is highly relevant for autonomous navigation, as 
well as other black box AI algorithms, and believe 
the use of this should be required, in some form, to 
ensure sufficient system reliability and robustness. 


3.4 Approval of alternatives and equivalents 


According to the International Maritime Organi- 
zations guidelines for the approval of alternatives 
and equivalents (IMO Maritime Safety Commit- 
tee 2013), the approval of an alternative and/or 
equivalent design can be performed by compar- 
ing the alternative design to existing designs to 
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demonstrate that the design has an equivalent level 
of safety. Hence, the approval of autonomous sys- 
tems used in shipping, including everything from 
smaller automated tasks to fully autonomously 
navigated ships, will be based on the equivalence 
principle: The autonomous functionality must 
make the operation safer or at least as safe as the 
conventional operation. 


3.5 Automatic assessment of human perception 
ability 


To enable the comparison of human and autono- 
mous perception, evaluation metrics and measures 
should be defined, and performance thresholds 
should be identified. To achieve this, the human 
perception in representative real ship operations 
has to be studied. Research in the field of human 
errors have shown that a large number of investi- 
gated maritime accidents are related to loss of situ- 
ation awareness (Grech et al. 2002). 

It should be noted that the perception abil- 
ity is not necessarily the ambition. We know that 
the perception performance can be influenced by 
many factors such as for example stress, distrac- 
tions, monotony, boredom, etc. (Horrey et al. 2017, 
Brodsky and Slor 2013, Schwebel et al. 2012), but 
our aim is to measure the perception performance 
in practice. 

Simulation tools might be applied to provide 
more extensive data sets, to complement the data 
collected from real operation. In addition to 
increasing the data set, the simulation tool offers 
the possibilities to create controlled situations, 
and explore changes in weather, rotated objects, 
etc. as well as the possibility to explore potentially 
dangerous situations. Another advantage with the 
simulated data is that it is pre-labelled, and one will 
therefore avoid spending time and effort to estab- 
lish the ground truth on the simulated datasets. 


4 CONCLUSIONS 


A framework and tentative guidelines for assurance 
of autonomous systems in the maritime industry 
are proposed and discussed, with additional focus 
on the perception and situation awareness func- 
tionality. Because vital parts of the autonomous 
systems, such as the machine perception and situ- 
ational awareness algorithms, are expected to be 
partly or fully based on machine learning algo- 
rithms, including deep learning, whose functional 
reasoning is challenging or even impossible to 
understand and predict, we believe the assurance 
of such systems are fundamentally different from 
a traditional assurance process based on physi- 
cal understanding and theory. Hence, we believe 


new guidelines, framework and methodologies are 
needed. 

We propose and describe a range of recom- 
mended practices and tools that can be applied 
to test and validate the ability, performance and 
robustness of safety critical systems which deci- 
sions are based on data-driven methods. We dis- 
cuss challenges related to machine perception that 
are unique or particularly pronounced in the mari- 
time domain, and suggest how the recommended 
practices and tools should be used and possibly 
adapted to constitute an assurance framework for 
autonomous navigation in the maritime domain. 
Furthermore, we discuss potential implications 
of autonomy such as for example operational 
dependent requirements. We also discuss the assur- 
ance framework for autonomous systems relative 
to the IMO guidelines for approval of alternatives 
and equivalents. 
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ABSTRACT:  Theresults of the current research are based on two questionnaire studies of risk perception 
and worry about being involved in an accident when cycling. The data for Study 1 were collected through 
a questionnaire survey (n = 291) distributed in collaboration with the Norwegian Cyclists’ Association. 
Study 2 was based on survey data collected in a representative sample of the Norwegian public (n = 2000). 
The results of the first study showed significant associations between risk perception, worry, and risk 
tolerance on the one hand and cycling frequency during winter on the other hand. Risk perception was 
not as important for cycling frequency during the other seasons of the year. Study 2 showed that previous 
accident experience was associated with risk perception and worry. 


1 INTRODUCTION 

The risk of being involved in an accident is greater 
among cyclists compared with users of other travel 
modes (Høye et al., 2012). However, cycling is a 
more pro-environmental travel mode than motor- 
ized travel modes. Hence, Norwegian transport 
policy and the Norwegian authorities give high 
priority to increasing the number of cyclists and 
frequency of trips made by bicycle, especially in 
urban areas. Therefore, it is important to examine 
factors that may be associated with the frequency 
of bicycle use. Risk exposure may be a particu- 
lar challenge during harsh winter conditions in 
Norway, and due to the risk factors, priority should 
be given to examining risk perception among 
cyclists. Consequently, the aim of the study on which 
this paper is based was to investigate how urban 
work travel cyclists perceive risk when cycling, and 
to investigate the association between perceived risk 
and the decision to cycle. An additional objective 
was to investigate whether experiences of accidents 
are associated with cyclists’ perceptions of risk. 


2 THEORETICAL FRAMEWORK 


2.1 Risk perception and worry 


In accordance with Breakwell (2007), studies car- 
ried out recent years have used assessments of 
probability and judgments of the severity of con- 
sequences as indicators of perceived risk (Moen 
& Rundmo, 2006, Nordfjaern et al., 2014, Roche- 
Cerasi et al., 2013, Rundmo & Moen, 2006). 
Consequently, in the current study perception of 
risk was measured using the two factors ‘assess- 
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ments of probability’ and ‘judgments of the 
severity of consequences’ of an accident occurring 
when cycling. According to Sjöberg, Moen, and 
Rundmo (2004), risk perception can be defined as 
people’s cognitive assessment of risk on these two 
dimensions. 

Cognitive models have dominated risk percep- 
tion and decision-making research, but recently 
affective processes have received increased 
attention (Breakwell, 2007). The risk-as-feeling 
approach highlights the role of anticipatory and 
anticipated emotions for individuals perception of 
risk (Loewenstein et al., 2001). Anticipatory emo- 
tions are immediate visceral reactions to risk, such 
as worry, fear, anxiety, and dread. Anticipated 
emotions are those that the individual expects to 
feel as a consequence of their decision. There are 
two types of anticipatory emotions: integral emo- 
tions and incidental emotions. Integral emotions 
are caused by the decision problem itself, whereas 
incidental emotions are caused by other factors, 
such as mood (Loewenstein & Lerner, 2003). In 
this paper, we conceive worry as an anticipatory 
emotion and integral to the decision problem, 
which implies that worry is defined as a feeling that 
emerges as a reaction to the individual’s cognitive 
assessment of risk. 

Risk perception and worry are primarily of inter- 
est because they may be related to people’s behav- 
ioural choices. According to the risk-as-feeling 
approach, behaviour is influenced by the interplay 
between cognitive evaluations of risk and feel- 
ings. Further, emotions often produce behavioural 
responses that differ from the individual’s cognitive 
assessment of the best course of action. Accord- 
ing to (Loewenstein et al., 2001), it is apparent that 


when such differences occur, behaviour is driven by 
emotional reactions, not cognitive assessments. 


2.2 Risk tolerance and priority given to safety 


It is not only important to study risk perception 
but also how risk is tolerated by individuals. Indi- 
viduals may differ in their thresholds for the level 
of risk they find acceptable. In his classic study, 
Starr (1969) aimed to answer the question ‘How 
safe is safe enough?’ He measured the level of 
risk that individuals found acceptable for differ- 
ent activities, and found that the risks involved in 
activities that were voluntary and perceived as ben- 
eficial were tolerated more than the risks involved 
in other activities. In a more recent study by Fis- 
chhoff et al. (1978), respondents were asked to 
judge the acceptable level of risk associated with 
different activities or technologies. The researchers 
found that risk was less tolerated when the activi- 
ties were associated with dread. Fischhoff et al. 
(1978) also found that higher risk levels were tol- 
erated in voluntary activities with well-known and 
immediate consequences. 

To date, few studies have measured risk toler- 
ance among cyclists. Parkin et al. (2007) developed 
a model based on a risk threshold and provided 
a measure of acceptability for different cycling 
routes. The model shows how different infrastruc- 
ture reduces the perceived risk and makes the route 
acceptable to cyclists. In Parkin et al.’s (2007) study, 
the model showed that young and old people con- 
sidered cycling less acceptable than did people in 
the age group 35—44 years, and that males consid- 
ered cycling more acceptable than did females. 

The terms ‘acceptance’ and ‘tolerance’ are often 
used synonymously. However, Sjöberg (1999) 
argues that risk tolerance and risk acceptance are 
two separate concepts. According Sjéberg (1999, 
p. 131), ‘risks are not accepted but tolerated’. One 
may be aware of the risk and choose to tolerate it, 
even if one does not accept it. It is more accurate 
to talk about accepting, for example, cycling as an 
activity that generates risk, but the risk in itself is 
not accepted but tolerated. In our study, we inves- 
tigated risk tolerance. 

Several studies have investigated demands for 
safety priority or risk mitigation related to trans- 
port mode use (Nordfjaern & Rundmo, 2010, 
Rundmo & Moen, 2006, Simsekoglu et al., 2015, 
Sjöberg, 1999), and some of these studies have 
included cycling (Nordfjaern & Rundmo, 2010). 
To date, no studies have investigated how demands 
for safety priority relate to transport mode use. 
Nordfjaern and Rundmo (2010) found that in 
Norway, the demand for risk mitigation and the 
priorities related to transport safety increased sig- 
nificantly between 2004 and 2008. Demands for 
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risk mitigation can be defined as public demands 
for decision-makers to reduce specific sources of 
transport risks (Moen & Rundmo, 2004). In this 
paper, we define the demand for safety prior- 
ity as demands from the public directed towards 
decision-makers to prioritize traffic safety for 
cyclists. In the following, we examine how both 
risk tolerance and priority given to safety are asso- 
ciated with cycling. 


2.3 Risk perception and worry among cyclists 


The study of risk perception among cyclists has 
received little attention to date (Chaurand & 
Delhomme, 2013). Much of the research in this 
field has been conducted mainly to aid engineers 
and planners wanting to design, improve, and pri- 
oritize roads and intersections to cater for cyclists. 
In the majority of these studies, cyclists were asked 
to rate their overall risk perception of a route 
shown in video clips or simulations, or described in 
surveys. In one recent study, mental mapping was 
used to study cyclists’ risk perceptions by drawing 
their regular route with different colours accord- 
ing to their perception of safety and risk (Manton 
et al., 2016). Common to the studies of risk per- 
ception among cyclists is that the focus is on the 
perception of either the road infrastructure or the 
traffic (Lawson et al., 2013). Some researchers have 
investigated cyclists’ risk perception using specific 
roads (Llorca et al., 2017) or crossings and rounda- 
bouts (Moller & Hels, 2008). 

Several previous studies have examined risk 
perception in relation to travel mode use. Moen 
& Rundmo (2006) and Oltedal & Rundmo (2007) 
included cycling with other travel modes when 
investigating perceptions of risk. Both pairs of 
researchers found that the probability of being 
involved in an accident was higher when cycling 
compared with when using other travel modes. 
However, the severity of the consequences when 
cycling was judged as small. Another interesting 
finding was that respondents reported they were 
more worried about experiencing an accident when 
cycling than when they used other travel modes. 

Further, researchers have found gender differ- 
ences in perceptions of risk and worry among 
cyclists. Lawson et al. (2013) studied gender differ- 
ences in the perception of safety among cyclists, 
and found that both males and females more often 
described cycling as less safe than driving, and that 
older women perceived cycling as less safe than did 
younger women. Manton et al. (2016) found that 
females more often perceived their regular cycling 
routes as dangerous than did males perceive their 
own routes. Moen & Rundmo (2006) found that, in 
contrast to women, men scored lower on their per- 
ceptions of probability, on expected consequences, 


and on worry. They also found that individuals below 
the age of 25 years regarded the consequences as 
least serious and were least worried when travelling 
by private modes of transport, including bicycles, 
than were individuals aged 25 years or over. The 
same age group (i.e. individuals under 25 years) per- 
ceived the probability of being involved in an acci- 
dent as highest. Individuals with a university degree 
perceived the risk as lower than did others and were 
less worried about being involved in an accident. 
Other studied predictors of risk perception and 
worry are age (Hermand et al., 1999, Sjöberg, 1998) 
and level of education (Sjoberg, 2000). 


2.4 Aims of the study 


The specific aims of the current study were as 
follows: 


1. To examine differences in work travel cyclists’ 
risk perception, worry, risk tolerance, and pri- 
ority given to safety when cycling in winter and 
summer conditions 

2. To examine whether risk perception, worry, 
risk tolerance, and priority given to safety were 
associated with the cycling frequency during 
winter 

3. To compare the role of risk perception, worry, 
risk tolerance, and priority given to safety, for 
cycling frequency during all four seasons 

4. To examine the association between accident 
experience, and risk perception and worry. 


3 DATA AND METHODS 


3.1 Samples 


The results of the current study are based on two 
questionnaire studies about risk perception and 
travel behaviour among Norwegian cyclists. The 
data collection for both studies was carried out in 
spring 2017. 


3.1.1 Study 1 

The data for Study 1 were collected through a self- 
completion questionnaire survey (n = 291) car- 
ried out among work travel cyclists in Trondheim 
Municipality, Norway. The survey was distributed 
in collaboration with the Norwegian Cyclists’ Asso- 
ciation (Syklistenes Landsforening). All respond- 
ents cycled on a daily basis and were members of 
an Internet-based group for cyclists. We invited 
all 2240 members to participate in the study, and 
the response rate was 13 percent. Respondents 
who never used their bicycle for travelling to work 
or university, or during their workday were not 
defined as work travel cyclists and were excluded 
from the sample (28 respondents). 
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The percentages of females and males in the 
sample were 35 percent and 65 percent respec- 
tively. They were all in the age range 20—67 years 
(mean = 42.63, standard deviation = 11.30). A total 
of 69 percent of the respondents reported they had 
more than three years of university education, 19 
percent had three years or less of university edu- 
cation, and 12 percent had received their highest 
level of education at upper secondary school. A 
total of 91 percent reported their main occupation 
as employed, and the remaining 9 percent were 
students. The respondents reported that they used 
their bicycle to travel to and/or from work (94%), 
for their work (13%) and/or to and/or from univer- 
sity (10%). A total of 3 percent of the respondents 
reported that they did not have a driving license. 

Figure | shows how often the work travel cyclists 
used their bicycle during the different seasons. The 
cyclists reported that they cycled less during winter 
compared with in the other seasons. During win- 
ter, 48 percent cycled five or more times per week, 
compared with 67 percent during autumn, 70 per- 
cent during spring, and 76 percent during summer. 
A total of 8 percent reported that they never cycled 
during winter. None of the respondents reported 
that they never cycled during the other seasons. 


3.1.2 Study 2 


Study 2 was based on a telephone questionnaire 
survey (n=2000) carried out among a representative 
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sample of the Norwegian population in the age 
range 15-88 years (mean = 45.38, standard devia- 
tion = 17.56). The response rate was 27 percent. In 
the sample, 57 percent were males and 43 percent 
were females. A total of 9 percent of the respond- 
ents had primary or secondary school education 
as their highest completed education level. A total 
of 34 percent had upper secondary school as their 
highest completed education level. A high pro- 
portion of the sample (57%) had a higher educa- 
tion level from college or university. A total of 62 
percent reported that they were employed or self- 
employed and 10 percent were students. A total of 
10 percent of the responds reported that they did 
not have a driving license. In the sample, 8 percent 
(n = 164) had experienced an accident as a cyclist 
during the last two years. Two-thirds (66%) of the 
accidents were single accidents (i.e. accidents not 
involving other road users), and 28 percent of the 
cyclists involved in an accident needed medical 
treatment afterwards. 


3.2 Questionnaires and measurement instruments 


In both Study 1 and Study 2, the respondents were 
asked to evaluate the probability of an adverse 
event when cycling and the severity of its conse- 
quences, and their degree of worry about being 
in an adverse event when cycling in winter and 
summer conditions. The questionnaires also con- 
tained questions about age, gender, highest level 
of completed education, employment status, and 
possession of a driving licence. In addition, the 
questionnaire used in Study 1 contained questions 
about risk tolerance, safety priority, and cycling 
frequency during the four seasons. The survey in 
Study 2 contained questions about accident experi- 
ences as a cyclist during the last two years. 

To measure risk perception, the respondents 
were asked to assess their probability of experienc- 
ing an accident or injury, and to judge the severity 
of the consequences if such an event were to occur. 
The scale for measuring the probability assess- 
ments was a five-point evaluation scale ranging 
from ‘not at all probable’ to ‘very probable’. For 
the judgement of severity of the consequences, the 
scale ranged from ‘not at all serious’ to ‘very seri- 
ous’. The respondents were further asked to rate 
how worried they were about being involved in an 
accident when cycling, and the measurement scale 
ranged from ‘not at all worried’ to ‘very worried’. 


3.2.1 Study 1 

To measure risk tolerance, the respondents were 
asked the following question: ‘To what extent do 
you tolerate being exposed to risk when cycling?’ 
The five-point evaluation scale ranged from ‘toler- 
ate the risk absolutely’ to ‘do not tolerate any risk’. 
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To measure priority given to safety, the respondents 
were asked to assess the following question: ‘How 
important do you think it is that the authorities 
prioritize measures to improve safety for cyclists?’ 
The five-point scale ranged from ‘not at all impor- 
tant’ to ‘very important’. The respondents were 
asked how often they cycled each season (winter, 
spring, summer, and autumn). For this measure- 
ment, a six-point evaluation scale was applied: 5 
or more times per week, 3—4 times per week, 1-2 
times per week, Monthly, Rarely, and Never. This 
measure has been found appropriate in previous 
studies of bicycle use in Norway (Kummeneje & 
Tretvik, 2015, Tretvik, 2015). 


3.2.2 Study 2 

To measure cyclists’ experiences of accidents, the 
respondents were asked whether they had been 
involved in an accident, including a single acci- 
dent, as a cyclist during the last two years. If 
they reported that they had been in an accident, 
they were further asked if other road users were 
involved and whether they had needed medical 
help after the accident. 


3.3 Statistical procedures 


3.3.1 Study 1 

Paired sample t-tests were used to investigate dif- 
ferences in risk perception (probability and con- 
sequences of being in an accident), worry, risk 
tolerance, and priority given to safety, between 
cycling in winter and summer conditions. A hier- 
archical regression analysis was used to predict the 
amount of cycling done in all seasons. 


3.3.2 Study 2 

MANCOVAs were used to investigate differences 
in risk perception (probability and consequences of 
being in an accident) and worry between respond- 
ents who had experienced an accident while cycling 
during the last two years and respondents who did 
not have the same experience. The results were 
controlled for gender, age, and education level. 


4 RESULTS 
4.1 Risk perception relating to cycling in winter 
and summer conditions 


The paired sampled t-tests showed significant dif- 
ferences in the respondents’ assessment of risk 
during winter and summer cycling conditions. This 
was the case for the subjective assessments of the 
probability of an accident (t = 5.722, p < .001) and 
for how worried they were about being involved 
in an accident (t = 6.597, p < .001). The respond- 
ents’ perceived greater risks for cycling in winter 


Table 1. 


Differences in risk perception, worry, risk tol- 


erance, and priority given to safety when cycling in win- 
ter and summer conditions (n = 263). 


Table 2. Dimensions of cycling frequency during 
winter, showing only the two final steps of the analysis 
(n = 263). 


Winter Summer 
cycling cycling 
conditions conditions t-value 
(Sig. 
Mean SD Mean SD 2-tailed) 
Probability 2.94 1.090 2.64 1.014 5.722*** 
Consequence 3.21 967 3.24 944 —.724 
Worry 2.60 1.212 223 1.015 6.597*** 
Risk tolerance 2.62 1.083 2.47 1.025 3.320*** 
Safety priority 4.51 .744 4.56 .622 1.643 


*p < .05, **p < .01, ***p < .001. 


conditions compared with cycling in summer 
conditions. However, it is interesting to note that 
there were no significant seasonal differences in 
the respondents’ judgements of the severity of the 
consequences if an accident were to occur. Fur- 
ther, there were significant differences in risk tol- 
erance (t = 3.320, p < .001). With regard to risk 
tolerance, the larger the mean value was, the less 
the tolerance for risk was. Table 1 shows that the 
respondents tolerated less risk when cycling in 
winter than when cycling in summer conditions. 
Further, the results revealed that there were no dif- 
ferences in priority given to safety when cycling in 
winter conditions compared with when cycling in 
summer conditions (Table 1). 

The standard deviations for all variables were 
relatively high and there were variations in the 
respondents’ perceptions of risk. This was the case 
for both cycling in summer conditions and cycling 
in winter conditions. 


4.2 Predictors for cycling frequency during winter 


Table 2 shows the results of a hierarchical multiple 
regression analysis that were used to predict cycling 
frequency during the winter season. The independ- 
ent variables were entered in five blocks: demo- 
graphics, risk tolerance, priority given to safety, 
risk perception, and worry. Table 2 shows only 
the two final steps of the analysis. Demographics 
were entered as controlling variables in the analy- 
sis. Gender and age were found to be significant 
predictor variables. Female respondents cycled less 
than male respondents during winter, and the older 
the cyclists were, the more they cycled during win- 
ter. Educational level did not seem to be associated 
with whether the respondents cycled during winter. 

In total, the predictor variables explained 
an acceptable percentage of variance (adjusted 

? = 33). The results showed that risk tolerance, 
when cycling in both winter and summer con- 
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Standardized 
beta coefficient 


Model 4 Model 5 


Block 1: Demographics 


Gender (male = 0, female = 1) —.14* —.13* 
Age 14 12* 
Education level .08 li 
Block 2: Risk tolerance 

Risk tolerance, winter conditions  —.49*** ——_39*** 
Risk tolerance, summer conditions .28 21 
Block 3: Safety priority 

Safety priority, winter conditions lS ll 
Safety priority, summer conditions .04 08 
Block 4: Risk perception 

Probability, winter conditions —.16* —.03 
Probability, summer conditions 09 .04 
Consequence, winter conditions —.23* —.18 
Consequence, summer conditions 16 15 
Block 5: Worry 

Worry, winter conditions 324* 
Worry, summer conditions 12 
Adjusted R? 30 33 

F Change 4.216**  6.031** 


*p < .05, **p < .01, ***p < .001. 


ditions, were the most important predictors of 
cycling frequency during the winter. The more the 
cyclists tolerated exposure to risk when cycling in 
winter conditions, the more they cycled during the 
winter. By contrast, the less the cyclists tolerated 
exposure to risk when cycling in summer condi- 
tions, the more they cycled during the winter. The 
inclusion of priority given to safety improved the 
model and the explained variance improved signif- 
icantly, but the variables did not separately signifi- 
cantly influence bicycle use. When risk perception 
was included in the model as the fourth block of 
variables, the explained variance improved signifi- 
cantly (Table 2). Both the judgements of probabil- 
ity and the severity of consequences when cycling 
in winter conditions were found important pre- 
dictors for cycling frequency during winter. The 
higher the cyclists perceived the probability of 
being involved in an accident and the more seri- 
ous the consequence of an accident was perceived, 
the less they cycled during winter. The perceived 
severity of consequences was more strongly corre- 
lated with cycling frequency than the probability 
assessment. Additionally, worry related to cycling 
in winter conditions was found to be an important 
predictor of cycling frequency. The more worried 


the cyclists were, the less they used their bicycle 
during the winter. 

When worry was included in to the model as the 
last step in the analysis, the risk perception predic- 
tors (probability and severity of consequences) lost 
prediction power. This may indicate that risk per- 
ception has an indirect effect on cycling frequency 
during winter. 


4.3 Predictors of cycling frequency during all 
seasons of the year 


Our next step was to predict and compare seasonal 
differences in cycling. A total of four hierarchical 
multiple regression analyses were carried out, and 
the results are summarized in Table 3 and Table 4. 
The same group of predictors as used previously 


Table 3. Predictors of cycling frequency in winter and 
summer (n = 263). 


Winter Summer 


Adj. R? F Change Adj. R? F Change 


Block 1: 
Demographics .10 9.938*** 02 2.567 
Block 2: 
Risk tolerance .25 24.964*** 03 2.731 
Block 3: 
Safety priority .26 3.663* 03 580 
Block 4: 
Risk perception .30 4.216** .03 846 
Block 5: 
Worry 33 6.031** .03 1.824 


*p < .05, **p < .01, ***p < .001. 


Table 4. Predictors of cycling frequency in spring and 
autumn (n = 263). 


Spring Autumn 


Adj. R? F Change Adj. R? F Change 


Block 1: 

Demographics .00 1.043 02 2.244 
Block 2: 

Risk tolerance .06 8.864*** 05 24.714* 
Block 3: 

Safety priority .07 2.322 .04 .594 
Block 4: 

Risk perception .10 2.514* 10 5.5154*** 
Block 5: 

Worry ll 3.342* ll 1.117 


to predict cycling frequency during the winter were 
used to predict cycling frequency during all the 
seasons of the year. 

The model explained the largest amount of 
the variance in cycling frequency during winter 
(adjusted R? = .33). The model was least success- 
ful in explaining cycling frequency during summer 
(adjusted R* = .03). Therefore, we did not find the 
model a good fit for predicting cycling frequency 
during summer. 

The model explained an identical amount of var- 
iance in cycling frequency during spring (adjusted 
R? =.11) and autumn (adjusted R? =.11). For both 
seasons, risk tolerance and risk perception were 
significantly associated with frequency of cycling. 
In addition, worry was significantly associated 
with frequency of cycling during spring but not 
autumn. 

To summarize, the results showed that risk per- 
ception was significantly associated with cycling 
frequency during winter. However, perceived risk 
was not strongly associated with cycling frequency 
during the other seasons. Additionally, worry was 
found significant for cycling frequency during win- 
ter and spring, but not summer and autumn. 


4.4 The role of accident experience in risk 
perception and worry 


In Study 2, we conducted a MANCOVA to exam- 
ine whether the group of individuals that had 
experienced a bicycle accident during the last two 
years perceived the risk differently from the other 
individuals in the sample (Table 5), including both 
their perceived probability being in an accident 
and the severity of consequences if an accident 
should occur. In addition, we wanted to investi- 
gate differences between these groups with respect 
to worry about being involved in an accident. The 
results were controlled for differences in gender, 
age, and education level. The results showed dif- 
ferences between the two groups for cycling in 
summer conditions (Wilks’ à = .990, F = 6.792, 
p < .001) and cycling in winter conditions (Wilks’ 


Table 5. The role of accident experience in risk per- 
ception and worry when cycling in summer conditions 
(n = 2000). 


Accident No accidents 

experience last two years 

Mean SD Mean SD F (Sig.) 
Probability 2.44 1.175 2.16 1.053 10.238*** 
Consequence 3.28 1.222 3.27 1.253.084 
Worry 2.22 1.191 1.95 1.081 14.233*** 


*p < .05, **p < .01, ***p < .001. 


*p < .05, **p < .01, ***p < .001. 


X= .995, F = 3.232, p < .05). In the following, we 
only present the results of investigations of differ- 
ences in risk perception and worry for cycling in 
summer condition due to the fact that most cycling 
accidents happen during summer. 

As shown in Table 5, individuals that had experi- 
enced an accident as a cyclist perceived their prob- 
ability of being in an accident as higher than did 
the other individuals. They also tended to be more 
worried about being involved in an accident when 
cycling. There were no differences in the perceived 
severity of consequences between the two groups. 

In addition, we conducted two separate 
MANCOVAs. One tested whether the experience 
of an accident that involved other road-users 
would influence the perception of risk or worry, 
and the other tested whether the need for medi- 
cal help after an accident influenced the percep- 
tion of risk or worry. There were no significant 
correlations. 


5 DISCUSSION 


The results showed seasonal differences in work 
travel cyclists’ perceptions of risk, worry, and risk 
tolerance when cycling. The risk of being involved 
in an accident was perceived as higher and the 
cyclists tended to be more worried about such inci- 
dents and tolerated less risk when cycling in winter 
conditions compared with cycling in summer con- 
dition. Primarily, the probability of being involved 
in an accident was perceived as higher when cycling 
in winter conditions. There were no differences in 
the perceived severity of consequences. With dark- 
ness and icy and snowy roads in Norway in winter, 
it is natural that cycling in winter may be per- 
ceived as a bigger challenge than in summer, and 
the probability of being involved in an accident in 
winter was perceived as increased. One reason why 
the consequences were not perceived as increased 
might have been that the type of accidents the 
cyclists imagined they could be involved in did not 
differ with winter and summer conditions. 

We found that, on average, the respondents 
cycled less during the winter season. Further, the 
results showed that risk perception, worry, and 
risk tolerance influenced cycling frequency during 
the winter season. The assessment of risk was less 
important for cycling frequency during the other 
seasons of the year. One explanation for this is that 
risk in general is perceived as very low when cycling 
in summer condition and that a cyclist has to expe- 
rience that a risk is over their tolerance threshold 
before it will influence their behaviour. 

In common with other researchers (Hermand 
et al., 1999, Lawson et al., 2013, Manton et al., 
2016, Moen & Rundmo, 2006, Parkin et al., 2007, 
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Sjoberg, 1998, 2000), we found that demographic 
variables were associated with risk perception, 
risk tolerance, and worry. Compared with males, 
females tended to tolerate risk less, and they were 
more worried about involvement in an accident 
and perceived the risk as higher. 

The results of our study contribute to an under- 
standing of why work travel cyclists in Norway 
cycle less during the winter than in other seasons 
of the year. From a pro-environmental perspec- 
tive, it is important that people who use bicycles 
for their daily commutes and work-related travel 
do not change to motorized modes of travel during 
the winter season. Campaigns that aim to increase 
the number of cyclists may be ineffective if they 
do not take into account that the risk of being 
involved in an accident is perceived differently dur- 
ing the different seasons of the year. 

The results from Study 2 showed that accident 
experience can influence risk perception and worry 
when cycling. Primarily, the probability of being 
in an accident was perceived as higher for who had 
experienced an accident compared with persons 
who had not experienced an accident when cycling. 
The perceived severity of consequences did not dif- 
fer between individuals who had experienced an 
accident and those who had not, and in general the 
consequences were perceived as serious. This may 
have been due to the fact that the cycling accidents 
reported in the media are often the most serious ones. 

The response rate in Study 1 was relatively 
low and can be regarded as a limitation of our 
study. However, low response rates do not neces- 
sary constitute a methodological problem. Rather, 
this is only the case if the sample is overall non- 
representative of the target population (Krosnick, 
1999). We did not have any information about the 
non-respondents, and we cannot draw any conclu- 
sions about whether the respondents and the non- 
respondents differed significantly. However, the 
target group of the study was work travel cyclists 
who cycled on a daily base. We assume that those 
who cycled often also visited the web page, from 
where the respondents were recruited, more often 
than other cyclists. 


6 CONCLUSION 


Study 1 showed seasonal differences in how work 
travel cyclists perceived the risk and how worried 
they were about being involved in an accident as 
well as how they tolerated being exposed to the 
risk. As expected, cyclists perceived the risk as 
higher, were less tolerant of risk, and were more 
worried about being involved in an accident when 
cycling in winter conditions compared with cycling 
in summer conditions. Further, the results showed 


that risk perception, worry, and risk tolerance 
influenced cycling frequency during winter. The 
poor prediction of the model for cycling frequency 
during spring, summer, and autumn may have been 
due to the fact that the majority of the respond- 
ents perceived the risk as low during these seasons. 
Lastly, Study 2 showed that previous experience of 
an accident was associated with risk perception and 
worry. 
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ABSTRACT: This paper addresses Operational Barrier Elements (OBE) to strengthen the management 
of Major Accident Hazards (MAH) in major oil and gas field development project. OBE are defined 
as safety critical tasks that contribute to prevent and mitigate major accident hazards. 87 OBEs were 
identified, distributed across four platforms and 15 safety performance standards. The majority of the 
OBEs are related to gas detection, emergency shut-down, active fire protection, and well integrity. The 
primary output of the activity is Performance Requirements (PR) for each OBE, and recommendations 
for strengthening human performance through Performance Influencing Factors (PIF). OBEs were inte- 
grated in the field wide safety strategies, and the Operational Readiness team incorporated the OBEs in 
System descriptions and Operational Procedures (SO). Even though no critical deficiencies in design or 
barriers were identified, systematic integration of OBEs in engineering and operations will strengthen the 
fields robustness to withstand major accidents hazards in operations. 


1 INTRODUCTION Traditionally, major accidents risk on offshore 
petroleum installations is reduced by striving for 
1.1 Background inherently safe design, automation systems and 


fail-safe principles. PSA (2017), however, states 
that the design of safe and robust installations is 
not sufficient to protect against failures and haz- 
ard and that accident will continue to happen. It 
is postulated that most major accidents are sig- 
nificantly influenced by human actions, both in 
cause and consequence. Hence, since the safety of 
petroleum installations are dependent on reliable 
human performance, safety barrier management 
must incorporate operational and organizational 
aspects, in addition to the mere technical ones 
(PSA 2017). 

Barriers and controls mechanisms are often 
visualized as a bow-tie diagram, with a top event 
in the center of the model, with the hazards and 
preventive barrier on the left-hand side, and bar- 
rier for mitigation of the event on the right-hand 
side (Ruijter & Guldenmund, 2014). In Leva, 
Angel, Plot & Gattuso, (2013) the bow-tie model is 
applied to assess a scenario where human actions 
are the main barrier to accidental conditions (over- 
filling a storage vessel). 


In recent years there has been a renewed focus on 
the concept of safety barriers, including safety 
critical tasks, and its role to prevent and mitigate 
major accident hazards on the Norwegian Con- 
tinental Shelf. Partly, this is due to the Macondo 
accident where human and organizational fac- 
tors contributed to escalation of the accident 
(CSB, 2016). In volume 3 of the Macondo well 
incident investigation report, CSB assert that the 
safety barrier concept must extend beyond physi- 
cal safeguard, and that solutions to technical 
failures cannot prevent future incidents without 
giving equal attention to failures of less visible, 
non-physical barriers and support systems (CSB, 
2016, pp. 19). 

The Norwegian Petroleum Safety Authority 
(PSA) reinforces these conclusion by describing the 
principles for barrier management in the petroleum 
industry (2013; 2017). Here, safety barrier manage- 
ment is defined as the systematic effort to ensure 
that barriers are in place to provide protection of 
hazards and accident situations (PSA, 2017). 
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The concept of operational barriers in the 
petroleum industry are addressed in several recent 
publications. In addition to the PSA, the Norwe- 
gian Shipowner Association’s (NSA) report on 
good practices for barrier management for the 
rig industry (NSA, 2014) provide definitions and 
framework that are in line with the methodology 
outlined in this paper. Similarly, Hauge & Øien, 
(2016) provide practical guidelines for barrier man- 
agement in the petroleum industry, in accordance 
to the recommendations in PSA (2013). 

In a recent paper on human factors in bar- 
rier management, McLeod (2017) describes three 
generic types of barriers against threats, and their 
order of importance, or expected strength: 


e Engineered controls, which are physical, techni- 
cal and automated barriers 

Organizational control, including governing sys- 
tems and procedures, team organization etc. 
Human Controls, which are the individual fac- 


tors such as training, competence etc. 


According to McLeod (2017), human per- 
formance is central in maintaining and ensuring 
the integrity of barriers. Instead of considering 
human error as a hazard, the focus should be on 
improving the resilience of barriers, to be able to 
withstand human errors that degrade the barriers. 


1.2 Safety critical tasks 


Task analysis is the study of what a person is required 
to do, in terms of actions and mental processes, to 
achieve a goal (Kirwan & Ainsworth, 1992). For 
SCTA, Energy Institute (EI) extends this to tasks 
which contributes to MAHs in positive or negative 
ways including: initiating events; prevention and 
detection; control and mitigation, and emergency 
response. EI list the following steps of SCTA: 


. Identify main MAHs 

. Identify safety critical tasks 

. Understand tasks 

. Represent critical tasks 

. Identify human failures and PIF 
. Determine safety measures 


According to OGP (2011), a safety critical 
task inventory should be established for projects 
involving major accident hazard potential, e.g. to 
summarize all human tasks which are identified a 
safety barrier. 

Human performance of safety critical tasks is 
shaped by a range of factors, which, depending 
on the operational context, can drive performance 
both in a negative and positive direction. System- 
atic management of such PIFs is a key to ensure 
that Operational Barrier Elements (OBE) are 
effective. 
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Different methodologies apply different sets of 
PIFs. As a basis for establishing a standard frame- 
work of PIFs, various methods were reviewed. 
PIFs generally can be classified into four main cat- 
egories: According to Boring & Blackman (2007, 
p. 177), a PIF is an aspect of the human’s indi- 
vidual characteristics, environment, organization, 
or task that specifically decrements or improves 
human performance, thus respectively increasing 
or decreasing the “likelihood of human error” 
(Boring & Blackman, 2007, p. 177). 

In the context of operational barriers or safety 
critical tasks, PIFs are the factors which have a 
significant effect on barrier element performance. 
In this context, the definition refers to individual, 
workplace or other contextual factors which have 
significant effect on an operator or crews of opera- 
tor’s performance. A task based approach to OBE 
enables application of Human Reliability Analysis 
(HRA) to assess the qualitative or quantitative reli- 
ability of Safety Critical Tasks. According to Bye 
et al (2017), HRA can be applied to assess and 
understand human actions as barriers in major 
accidents. 

Quantification of Human Error Probability 
can provide valuable input to decision making, 
i.e. whether to implement design measures such 
as technical systems to eliminate risk, or whether 
the risk can be considered acceptable, and opera- 
tional measures are sufficient. Petro-HRA applies 
a list eight Performance Shaping Factors (PSF), 
which are modified for a petroleum context based 
on SPAR-H (U.S. Nuclear Regulatory Commis- 
sion, 2005). According to Bye et al. (2017) only the 
PSFs that have been shown in general psychologi- 
cal literature and in other HRA methods to have 
a substantial effect on human performance when 
performing control room tasks are included in 
Petro-HRA. This includes: 


Task factors: task complexity, time, procedures 
Individual factors: threat stress, experience/ 
training 

Organizational factors: teamwork, attitudes, 
Environmental factors: physical working 
environment. 


In the framework described in this paper, the 
PIFs defined in Bye et al. (2017) was selected for 
assessment of the safety critical tasks. In engineer- 
ing, the first step of strengthening task perform- 
ance is through task design, e.g. removing the task 
entirely by automating a manual task, increasing 
level of automation or introducing additional tech- 
nical barrier elements. Other measures to improve 
human reliability is adapting the physical configu- 
ration, conditions and surroundings to human 
cognition and anthropometrics, eg. through 
human centered design (ISO 11064-1). 


1.3 Definitions 


Requirements to operational barriers are outlined 
in Statoil governing documentation, and chapter 5 
of PSA management regulation. Safety barriers 
shall be effective in all situations, and the role to 
maintain its intended safety functions must be 
well-known across the organization PSA, (2017). 

This paper applies definitions provided in the 
Statoil internal report Definitions and guide- 
lines for non-technical barriers (Gould, Sklet & 
Ludvigsen, 2015), which aimed to standardize the 
definition and application of operational barrier 
elements in safety barrier management in Statoil. 

In this report, an operational barrier element is 
defined as the safety-critical tasks performed by a 
person, or team of personnel, which realizes one or 
several barrier functions. 

A technical barrier element is defined as engi- 
neered equipment, systems and structures which 
realize one or several barrier functions. 

A barrier function consists of the range of tech- 
nical systems, structures, personnel and tasks, 
which are required to fulfill (“realize”) the barrier 
function. These are referred to as barrier elements, 
and can be defined as technical or operational 
measures which alone or together realize one or 
several barrier functions. 

Performance Requirements (PRs) shall be estab- 
lished both for technical and operational barrier 
elements. Performance requirements for opera- 
tional barrier elements include a description of 
how the task should be conducted, who is respon- 
sible for performing a task (the formal role), and if 
possible, when (time criterion). 

Safety Critical Tasks (SCT) are the physical 
actions or activities by which human performance 
contributes positively or negatively to major acci- 
dent risk, through either: 


— Initiation of events; 
— Detection and prevention; 
— Control and mitigation; or, 
— Emergency response 


A similar definition is provided by the Energy 
Institute’s report on Safety Critical Task analysis 
(2011). However, the SCT term is broader and cov- 
ers a wider range of tasks than just OBEs. Some 
tasks can be critical because of their indirect influ- 
ence on barrier performance. This typically refers 
to inspection, testing and maintenance of technical 
barrier elements. 

This paper aims to capture SCTs that can be 
defined as OBEs, i.e. tasks that have a direct and real- 
time role in realizing a barrier function related to 
preventing or mitigating major accident hazards in 
an accident sequence are included. The work aimed 
to strengthen barrier performance by systematically 
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addressing operational elements throughout engi- 
neering, to enable barrier management in operations. 

OBE includes task required for realization of 
on-demand barrier functions, where human per- 
formance directly contribute to the availability or 
integrity of a barrier function, as maintaining the 
primary barrier during well operations, crane oper- 
ations, and containment during maintenance, etc. 

Barrier functions may be gas detection, Emer- 
gency Shut Down (ESD), blowdown, etc., where 
human action may prevent an accident, or where 
an omission or error of commission may directly 
contribute to the unavailability or degradation of 
a barrier function. 

In this context, major accidents are accidents 
leading to multiple fatalities, major environmental 
harm or loss of an asset. The term “hazard” is a 
potential source of harm. The term “defined situ- 
ations of hazard and accident” (DSHA) refers to a 
selection of hazardous and accidental events that 
will be used for the dimensioning of the emergency 
preparedness for the activity. 


1.4 Limitations 


Process disturbances or minor deviations during 
normal operations are excluded, because they are 
controlled by inherent safety measures or by con- 
trol mechanisms that prevent escalation. 

Tasks which have an indirect influence on the 
barrier function are excluded, and for which rou- 
tine work and operational safeguards are an estab- 
lished part of planned maintenance or safe work 
process. This may be task such as: 


— Routine inspections to check condition 

— Maintenance, calibration and testing 

— Inhibitions of safety systems as part of planned 
operations 

— Work permits and safe job analysis 

— Purely administrative tasks, such as handovers, 
issuing work orders etc. 


The framework does not address barriers that 
protect against risks related to working environ- 
ment, personal injury, security, or production 
regularity. Topics such as strategic management of 
change, organizational learning, and safety culture 
are all considered important, are also excluded from 
the framework described in the paper. This is in line 
with guidelines stipulate in Gould et al. (2015). 


2 CASE STUDY 


2.1 Project description 


The project is one of the five largest oil and gas 
fields on the NCS. The resources are estimated to 
be between 2.0-3.0 billion barrels of oil equivalents, 


and peak production will reach 660,000 barrels 
daily. The field will be operated by electrical power 
generated onshore, reducing offshore emissions of 
climate gases by 80%-90% compared to installa- 
tions utilizing gas turbines. 

The field development consists of four intercon- 
nected topside platforms, including a Living Quar- 
ter with more than 500 beds, Drilling Platform 
with utility module and integrated drilling facili- 
ties, Process Platform including 3 stage separation 
process and gas compression equipment, and a 
Riser Platform with oil and gas export risers. 

The engineering of the different platforms is 
performed by different engineering teams at sepa- 
rate locations. The field has several safety critical 
interfaces, such as facility for converting power 
from shore, gas export pipeline to an onshore plant 
and oil export pipeline to an onshore oil refinery. 

The entire field will be controlled from the Con- 
trol Centre, including Central Control Room and 
Emergence Control Centre located at the Living 
Quarter. The size, complexity and value of the 
field make high demands on safety and productiv- 
ity. Production start for Phase One is planned for 
late 2019. 


2.2 Scope 


This paper describes the effort put into opera- 
tionalizing the theoretical framework presented 
above through identifying, analyzing and defin- 
ing operational barriers in a major oil & gas field 
development project. The goal of mapping and 
assessing OBEs was to define the factors that 
shape human performance, and to strengthen 
these factors through engineering and operational 
readiness. This will enable a systematic framework 
for management of operational safety barriers in 
operation. 


3 METHODOLOGY 
3.1 Mapping and assessment of operational 
barrier elements 


The methodology for mapping and assessment of 
OBE is described in a project specific Terms of ref- 
erence (ToR) document. The purpose of the ToR 
is to establish guiding principles for the work, and 
ensure standardization across different platforms 
and contractors. The method is divided into two 
main steps, summarized in the section below. 


3.2 Identification and assessment of safety 
critical tasks 


Data required for identification and assessment of 
Safety Critical Tasks (SCT) was primarily collected 
by means of document reviews and workshops 
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with Operations and technical disciplines (Energy 
Institute, 2011). 

First, Safety Critical Tasks were identified by 
reviewing project information and documentation, 
and attending workshops (HAZOPs, LOPAs, etc). 
Experience from other installations and projects 
was also gathered as input to the SCTA inventory. 

A set of meetings and workshops were arranged 
to collect and analyze the data, task identification 
and screening. The meetings followed the activi- 
ties described in the project Terms of Reference, 
and mostly involved representatives from Statoil 
Operations and relevant technical disciplines on 
each platform. 

Second, a screening of the SCTA in terms of 
risk and relevance, and where it could be consisted 
as critical for realizing a barrier function. Screen- 
ing criteria consisted of: 


e Barrier functions that are depended on the task 
performance 

The consequence of human failure 

The severity of the MAH 


Task familiarity and complexity 


The outcome of the screening was a list of pro- 
posed Operational Barrier Elements. The aim of 
the screening analysis was to reduce the number 
of tasks such that further work would be limited 
to the ones significantly contributing to the risk 
picture. For the tasks identified as highly critical 
against a set of pre-defined criteria, more detailed 
task and failure analysis was conducted: 


e OBE task analysis; provide task description and 
assessment of Performance Influencing Factors 
OBE failure analysis; objective is to identify and 
mitigate risk associated with human failures 

Assessment of Performance Influencing Factors 


Definition of design requirements 


Although the definition of major accident refers 
to safety, environment and asset losses, only OBEs 
which have a role in managing MAHs with an 
impact to safety and environment were considered. 

A set of criticality levels were used to identify 
tasks where the barrier function itself is automatic, 
but where the operator(s) has a back-up role in 
monitoring and correcting system performance 
(if necessary). 

For example, the task “Manual cancellation of 
Emergency Depressurization DP in case of gas 
leak in the flare system/ KO drum” was considered 
as highly critical. One of the concerns about this 
task was whether it was applicable to all or just a 
specific set of scenarios involving gas leaks in the 
flare system or Knock-Out drum. Discussions 
focused on whether the Control Room Operator 
would be able to perform this task safely and reli- 
ably, and if there will be enough time available for 
decision-making and action. To be able to address 


these questions in a systematic manner a set of dif- 
ferent scenarios was assessed. 


3.3 Definition of performance requirements and 
performance influencing factors 


This step consisted of definition of performance 
requirements to each OBE, and providing rec- 
ommendations about how to improve design and 
operations to optimize relevant PIFs. 

PSA (2017) states that definition of perform- 
ance requirements is an essential part of establish- 
ing effective barrier management. Establishing 
PRs to OBEs is an integral part of this. Perform- 
ance Requirements (PRs) are verifiable require- 
ments related to barrier elements to ensure that the 
intended barrier function is effective (PSA, 2017). 
For each OBE, the following performance require- 
ments are defined: 


1. Coarse description of how the task should be 
conducted 

. Who is responsible for the task (formal role) 

. When the task need to be conducted 

. If feasible: time available for task performance 

. If feasible: criterion for successful task 
performance 


AUN 


PRs were established as part of a series of OBE 
tasks analysis workshops relevant technical disci- 
plines and operations. The agenda of the work- 
shops was structured according to a standard list of 
Safety Performance Standards (PS), and the OBEs 
related to these. Participants of the workshops 
included process technicians (end users), technical 
safety, process and instrument system responsible 
and platform management. The information from 
the OBE analysis was structured according to the 
following parameters: 


Unique ID 

Reference to relevant safety strategy 
Platform and Areas 

Major Accident Hazard 

Barrier Function 

Performance Standard 

Operational Barrier Element 
Performance Requirement 


The main output from this stage included draft 
Performance Requirement for each OBE, and 
Design and Operational recommendation, pro- 
posed for strengthening the PIFs. 


3.4 Quality assessment and alignment meetings 


Since the OBE framework adapted by the project 
was new, alignment with key stakeholders in the 
project was necessary. End-user representatives 
participated during all phases of the work, how- 
ever alignment meeting with engineering and 
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operational management was required to ensure 
mutual understanding of purpose, scope and 
method, agree upon the quality of the results, and 
how the results best should be applied by opera- 
tional readiness. 


4 RESULTS 


The results section is divided into four sections, 
including; (1) Operational Barrier Elements and 
Performance recommendations; (2) inclusion in 
safety barrier strategies; (3) implementation of 
recommendations in design; and (4) incorporation 
of results in Operational Readiness. 


4.1 Operational barrier elements and 
performance requirements 


87 OBEs were identified for the field development 
project, and distributed across 15 safety perform- 
ance standards (see Figure | below). Almost half 
of the OBEs were categorized as generic, indicat- 
ing that they are applicable to the entire field, not 
to specific platforms. The drilling platform has 
highest recorded frequency of OBEs (22), followed 
by the riser platform (11), process platform (5), 
and Living Quarter (4) (see Figure 1 below). 
Figure 2 illustrates how OBEs are distributed 
across performance standards. The PSs with 


s Generk » Processplatform Rist platform = Living quarter Drilling platirom 


Figure 1. Distribution of operational barrier elements 
per platform. 


Figure 2. Distribution of operational barrier elements 
per safety performance standard. 


Table 1. Example A: Respond to hydrocarbon leakage 
from oil export riser. 


MAH 

PS 

Barrier 
function 

OBE 


DSHA 01: Hydrocarbon Leaks 

P1 — Containment 

Reduce/limit hydrocarbon leaks from 
risers or pipelines 

Respond to hydrocarbon leakage from oil 
export riser (air gap area) or pipeline. 

Onshore site has a dedicated system for 
leak detection on the oil export 
pipeline going from JS. Onshore CCR 
is responsible for notifying the JS CCR 
in case a leak is suspected. JS CCR 
shall investigate and confirm that a leak 
is occurring. In case of a confirmed 
leak, JS CCR shall manually shut 
down the oil export pumps, close the 
riser ESDV, and confirm that pressure 
has dropped. Onshore CCR shall be 
informed of all mitigating actions. 

HMI system 21: Trykktransmitter olje 
eksport 

<tag> 

Priority 1 alarm 

Text: “Olje eksport fra pumper A/B/C til 
rorledningen” 

Included in system description for 
system 30 


PR 


Design 
measures 


SO 


Table 2. Example B: Respond to H2 alarm from bat- 
tery room. 


MAH DSHA 01: Hydrocarbon Leaks 

PS PS 3 — Gas detection 

Barrier Prevent ignition of H2 gas 

function 

OBE Respond to H2 alarm from battery rooms. 

PR Upon H2 gas detection in battery rooms, 
the CRO shall electrically isolate the 
area using the MEI function on the 
CAP panel. The CRO shall communicate 
with the Electrical team to ensure the 
affected room remains unoccupied until 
the H2 gas has been dispersed via the 
HVAC and the H2 alarm has been 
cleared. 

Design New HMI symbol H2 gas 

measures Priority 1 alarm 
Alarm text: “Informer over PA Følg 

instruksen i DFU” 

SO Included in system description for system 
84 and 85 


highest number of OBEs are Emergency Shutdown 
(17), Well integrity in drilling (15), Active Fire pro- 
tection (11), Gas Detection (9), Emergency depres- 
surization (7) and Offshore Cranes (6). 

Two examples of how the results are structured 
are provided below. These data are extracted from 
the OBE master database. 
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4.2 Inclusion of OBE in safety barrier strategies 


During engineering, five separate safety strategies 
were established, i.e. one for each facility/platform, 
and one for the entire field. The safety strategies 
have a standard organization, included common 
definitions of barrier terminology. A purpose of 
the safety strategies is to outline the role of safety 
barriers to be implemented to manage the risks that 
have been identified for the project. The strategy 
is used by Operations to understand how risks of 
major accidents are managed on the field and for 
each specific area. OBEs and associated Perform- 
ance Requirements are written into the project 
safety strategies, established during engineering. 
The OBEs are related to the relevant safety barrier 
functions and technical barrier elements. 


4.3 Implementation of recommendation in design 


To ensure traceability and inclusion in design, 
the main recommendation was included in action 
follow-up system to be formally handled as part of 
engineering quality system. 

A total of 81 design measures were imple- 
mented. It should be noted that these measures 
are in addition to the Human Factors Engineer- 
ing process performed in the project (ISO 11064). 
Most of the findings were related to enabling situ- 
ation awareness by means of alarm priority and 
alarm descriptions, operator station HMI and 
Large Screen Display design. In total, 45 measures 
were included on HMI (large screen and opera- 
tors displays), and 38 measures were implemented 
in the configuration alarm system (alarm prior- 
ity, alarm text or operator guidance, etc.). This 
included input to HMI for platform crane cabin 
and drillers cabin. 

The PS that have the highest number of meas- 
ures are ESD, (23) active fire protection (15), and 
gas detection (13), see Figure 4 below. 


4.4 Operational readiness 


An essential goal for mapping and assessing of 
OBE was that the results should be adopted by 


Human-Mochine interface 


Alarm system 


Figure 3. Record of recommendations implemented in 
design and operational readiness. 


System descrip/procedure 


Emergency prep. 


@ Human Machine interface 


Figure 4. Record of recommendations implemented in 
HMI and alarm system per performance standard. 


a System Desc./Procedure 


æ Emergency prep 


Figure 5. Record of reccomendations implemented in 
system descriptions & procedures, and emergency prep- 
ardness plan. 


operational readiness for use in operations. The 
outcome of mapping and assessment of OBE 
included a set of recommendations for improve- 
ment of human performance through operational 
readiness. 

When the results were finalized, the information 
was documented in a master OBE database, with 
additional parameters: 


e Reference to existing work process or procedure 
for a generic OBE 

Actions 

Status of action, i.e. implemented in SO, design 


etc. 


The master database is shared by engineering 
and operations. The results from the master list 
is maintained by engineering, and implemented 
in operational documentation by Operational 
Readiness. 

These are measures such as inclusion of OBE 
in system description and operational procedures, 
referred to as SO documents. These documents 
are organized according to system number, so the 
results were converted from PSs to systems. 

In total, 64 measures were implemented in Sys- 
tem Descriptions and Operational Procedures 
(SO). 

The results in the master database was reviewed 
as part of developing the Emergency Preparedness 
Plan. Four elements from the database were used 
as direct input to these plans. One of the purposes 
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of the SO documentation is to provide as a basis 
for operator training, which is the next step of the 
project. 


5 CONCLUSION 


This paper described the systematic integration of 
operational barrier elements in design and barrier 
management to strengthening an oil and gas field’s 
robustness to withstand major accidents hazards. 
The results show how the barrier framework was 
successfully applied in an engineering context. The 
results presented indicate how the project captured 
the criticality of the human role in preventing and 
mitigating major accident hazards, and ensur- 
ing efficient detection, initiate or control effective 
emergency shutdown, and realize sufficient active 
fire-fighting means. Also, the human role as a con- 
tinuous operational barrier in ensuring sufficient 
well integrity in drilling, and safe operations of 
offshore cranes is assessed and described. 

An essential part of a control room operator 
role is to monitor and control that safety and auto- 
matic systems perform on demand, and act accord- 
ingly if these systems fail. It was decided to include 
the most critical of these control tasks, because 
they provide valuable input to system design, pro- 
cedures and other PIFs. Also, it gives an improved 
understanding of the interaction between technical 
and operational barrier elements across different 
accident scenarios. 

By limiting OBE to mere cases where the barrier 
function is entirely dependent on a direct, physical 
human action, we would not capture the factual 
human role in protecting and mitigating against 
MAH, and the dependency between technical and 
operational barrier elements during accident. 

Applying this framework early in detailed engi- 
neering allowed for providing input to design. 
Throughout engineering, adjustments have been 
made to HMI, alarm system, but also the physical 
design of the facility to either remove/reduce human 
intervention or to optimize human reliability. Fur- 
thermore, applying the framework in engineering 
allowed a systematic focus on safety critical tasks 
in developing procedures, emergency preparedness 
plans and training. Inclusion of OBE in safety strat- 
egies help to improve risk awareness in operations, 
as OBE and PRs shall be known by the management 
and the roles responsible for realizing the OBEs. 

A central concept in human factors engineer- 
ing is the use of iterative design cycles. Likewise, 
it is beneficial to allow for several iterations when 
working with operational barriers. For example, 
the project would have benefited from starting 
earlier than detailed design (e.g. Front-End Engi- 
neering and Design), however at a cost of avail- 
able details to work with. The OBE work for this 


project started in Detailed Design and applied sev- 
eral iterations to come to sufficient maturity before 
starting developing procedures and training. 

Currently, the project has focused on identify- 
ing and defining OBE and their corresponding 
PR, facilitating operator reliability through design 
adjustments, procedures and training. The next 
steps in the barrier framework are to set up assur- 
ance and verification activities to ensure barrier 
functions remain intact under daily operations and 
to ensure that the performance of the barriers is 
working as intended (NSA, 2014). 

The framework described in this paper dem- 
onstrated the benefits of including assessment of 
safety critical tasks and PIFs in an engineering 
context. Second, the performance requirements 
for OBEs have provided operational readiness 
with risk-informed input to development of sys- 
tem descriptions, procedures and training program 
of the operational phase. Third, the framework 
provides a better understanding of the mutual 
dependency between operational barrier elements 
and technical barrier elements in realizing barrier 
functions, thus contribute to strengthen the field’s 
safety barrier management. 
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Naturalistic decision making in process control: The guidance-expertise 
model and the model of resilience in situation 


S. Massaiu 
Institute for Energy Technology, Halden, Norway 


ABSTRACT: The defining elements of naturalistic decision-making, such as proficient decision makers, 
ill-defined goals, uncertainty, high stakes, tools, and teamwork, are clearly present in process control. 
However, the domain is still heavily anchored in normative approaches for design, analysis and evaluation 
of human-technology systems that make unrealistic assumptions about the operators. The paper presents 
two naturalistic decision making models for process control developed in the nuclear power production 
sector and based on extensive observations of control room emergency operation in high-fidelity, human- 
in-the-loop simulators: the Guidance-Expertise Model (GEM) and the Model of Resilience in Situation 
(MRS). Unlike better-known naturalistic decision-making models, the GEM and MRS models recognize 
the central role that operating procedures and other organizational prescriptions play in process control 
decision making, elaborating on aspects that so far have received little attention in the naturalistic decision 
making community. 


1 INTRODUCTION in process control centers occur during incidents 
and accents. In such conditions the decision land- 

1.1 Process control and naturalistic decision scape typically include elements of time pressure, 
making stress, ill-defined or conflicting goals, uncertainty 


about conditions, and high stakes. In other words, 
process control is a good illustration of all defining 
elements of naturalistic decision-making (Lipshitz, 
Klein, Orasanu, & Salas, 2001). At the same time 
the central role of operating procedures (and other 
organizational prescriptions) for process control is 
an aspect that sets it apart from other, more stud- 
ied, naturalistic settings and makes widely-known 
naturalistic decision making models less readily 
applicable to the sector. 


Control centers operators in process industries 
supervise systems that are extensively automated 
and intervene in case of malfunctions. Control 
centers are technological environments in which 
the operators’ interactions with the system are 
mediated by human-machine interfaces and sup- 
ported by decision aids ranging from pen and 
paper to intelligent expert-systems. The single 
most important decision support aid used by 
control centers operators is represented by the 
operating procedures. In the nuclear energy sec- 
tor, for instance, when an emergency arises the 
control room operators respond by immediately 
opening the emergency operating procedures and Although process control decision-making is 
implementing these until the reactor is brought clearly a naturalistic setting, normative approaches 
to a safe state. In this environment understanding still dominate system analysis, evaluation and 
decision-making requires understanding how the design in the sector. This is likely a legacy from 
operators interact with the procedures. The second when these industries viewed their technical core 
distinguishing aspect of decision-making in proc- areas as sufficient for designing productive and 
ess control industries is the collective nature of the safe systems. The nuclear industry, for example, 
decisions. Most of the times the decisions are made did not pay attention to human and organizational 
by groups, often called operating crews. Even when factors before the Three Mile Island and Cher- 
single operators are the decision makers, the high nobyl accidents, assuming that reactor physics, 
level of proceduralization of their work implies thermodynamics and other technical factors were 
some sort of deferred relation with other actors like solely responsible for designing safe plants (Moray 
procedure designers, trainers, or management, who  & Huey, 1988). In normative approaches the oper- 
sets expectations on the decisions and behavior of ators are assigned the role of executors of proce- 
the front-line operators. The most critical decisions dures. As the procedures incorporate the rational 


1.2 The prevalence of normative approaches 
in process control 
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benchmark for how to behave in different situa- 
tions, they define what the operators are required 
to do in terms of actions, and even cognitive proc- 
esses, in order to achieve the system’s goals. The sit- 
uations for which the procedures are made are seen 
as predictable conditions in which there are limited 
ways of performing the tasks correctly. Although 
these assumptions may have been appropriate for 
workers of the first industrial revolution, they are 
inadequate for automated and computerized pro- 
duction processes (Vicente, 1999). These are char- 
acterized by external disturbances (unanticipated 
faults, automation failures) and other forms of 
uncertainty (degraded indicators) for which the 
procedures do not apply and in which the opera- 
tors are required to adapt to moment-by-moment 
changes in conditions by generating appropriate 
responses based on their conceptual understand- 
ing of the work domain. In such cases there are 
no standards of correct performance and no pre- 
defined correct decisions, if not after the fact. 
Research from anthropology, activity theory and 
naturalistic decision making has show that “work- 
ers’ actions frequently do not—and indeed, should 
not—always follow these normative prescriptions” 
(id., p. 62). From the point of view of this paper the 
main problem of relying on normative approaches 
to decision making in process control is that the 
assumptions they make on the operators are not 
realistic and therefore of limited use for the design, 
analysis and evaluation of systems in which human 
and organizational factors play a critical role. 


2 NATURALISTIC MODELS OF PROCESS 
CONTROL DECISION-MAKING 


This section describes two naturalistic decision 
making models for process control work. They are 
both based on extensive observations of nuclear 
power plant control room emergency operation in 
high-fidelity, human-in-the-loop simulators. This 
section provides a description of the models’ theo- 
retical foils, concepts, purposes and applications. 
Their limitations will be discussed also taking into 
account their level of maturity and intended use. 


2.1. The guidance-expertise model of procedure 
following 


The Guidance-Expertise Model of Procedure 
Following (GEM) is a methodology to describe 
and predict nuclear power plant operating crews’ 
behavior in emergency operation (Massaiu, 
Hildebrandt, & Bone, 2011; Massaiu & Bones, 
2011). The model, based on the framework 
of cognitive system engineering (Rasmussen, 
Pejtersen-Mark, & Goodstein, 1994) assumes 
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Figure 1. Conceptual diagram for the Guidance-Exper- 
tise Model of procedure following (GEM). 


that macrocognitive processes, or control modes, 
determine decisions in proceduralized operational 
environments (Figure 1). In cognitive system 
engineering control is defined as the way the next 
action at any given point in time is chosen. Cogni- 
tive control refers to the organization of cognitive 
functions (e.g. monitoring, evaluating) and proc- 
esses (e.g. heuristics, externalization of memory 
storage) in a situation (Hollnagel, 1998). Control 
modes in the language of activity theory are dif- 
ferent orientations towards the object of activity 
that result in different ways of acting (Kaptelinin 
& Nardi, 2006). Basically, the control modes are 
different ways of thinking in a situation that deter- 
mine behavior. 

The GEM model borrows the language of ‘situ- 
ated course-of-action’ theory (Theureau, Jeffroy, & 
Vermersch, 2000) and defines three control modes: 
Narrowing, Holistic View and Persisting Narrow- 
ing. During narrowing the operator’s horizon is lim- 
ited by what is referenced by the procedures. When 
the operators read the procedure in this mode their 
attention is focused, the situations are classified 
by their structural-mechanistic features, and are 
mapped into pre-planned methods of action. Dur- 
ing narrowing, information is not actively searched 
for except for what is required by the procedure. 
The pre-planned methods for actions are incorpo- 
rated in the emergency procedures as well as previ- 
ously trained. Narrowing in a broader sense is the 
cognitive control mode in which the operators let 
the procedures guide them and do not try to figure 
out ad-hoc plans for dealing with the situation. 

Holistic view occurs when the operators interpret 
the procedure steps relative to the situation dynam- 
ics, which include the activities of automated sys- 
tems and of other people. As such it is analogous to 
the concepts of situation awareness and sense-mak- 
ing. In holistic view information is actively searched 


for in the control panels and through communica- 
tion with other crewmembers to create an inter- 
pretation of the situation in a functional way, by 
considering the process as an integral whole. Holis- 
tic view takes into account larger time windows: the 
present is explained as the effect of previous events 
(diagnosis), the future evolution is represented to 
evaluate if the planned course of action is appro- 
priate (Lipshitz et al., 2001). Holistic view includes 
metacognition (thinking about thinking) which will 
manifest itself in activities such as reconsidering the 
course of actions, determining whether to act out- 
side the procedure while following them, redirecting 
procedure paths, detecting strains in teamwork and 
making adjustments. 

Switching from one control mode to the other 
can be challenging in several ways, as the two modes 
require different cognitive effort and different con- 
figurations of cognitive functions, including how 
much is memorized and consciously represented. 
Changing from holistic view to narrowing implies 
difficulties in establishing the necessary local focus 
and attention as well as the right procedures pro- 
gression pace. It can also challenge the capacity to 
re-construct an uncompleted course of action from 
the exact point it was left. Low-level errors, like 
slips and lapses, might occur as a result. Yet, most 
performance difficulties occur when the required 
control mode is holistic view but the crew struggle 
in establishing it, thereby continuing in narrowing 
mode. A typical example is when the crew starts 
engaging in problem solving behavior (e.g., discus- 
sion of transfer points, evaluation of novel events) 
but ends up reverting to literal procedural adher- 
ence. In such cases sustained periods of narrowing 
impede the achievement of a holistic view (i.e, a 
level of situation awareness adequate to develop an 
autonomous plan of action). This state is termed 
‘persistent narrowing and is considered a third con- 
trol mode. The longer the narrowing continues, the 
higher the risk of losing global situation under- 
standing. For self-paced processes, like emergency 
operation in nuclear power plants, the more the crew 
lags behind the process the more it will be pushed 
into a reactive mode and the more difficult it will be 
to achieve holistic view. Persistent narrowing ends 
when the crew is able to constructively generate a 
solution strategy, even if this is not an effective one. 

In order to make predictions regarding crew 
behavior in emergency situations, the GEM model 
relates the control modes to aspects of the situa- 
tion on one side and to aspects of crew expertise 
on the other. Regularities among situations, control 
modes and expertise for specific operational set- 
tings are derived from empirical human-in-the-loop 
simulations. The model outcome behaviors are task- 
independent behaviors that nuclear power plant 
operating crews exhibit in emergencies (Table 1). 
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Table 1. The 16 outcome behaviors included in the 
GEM model are task-independent behaviors that nuclear 
power plant operating crews exhibit in emergencies. 


# Outcome behaviour 


Slow progression by meticulous procedure following 
Slow reaction to recently discovered information 
No reaction to important information received 
No/slow action to unexpected event 

Literal step following rather than purpose 
Incorrect procedural transition 

Cue explained away 

Notes/warning/foldout pages ignored 

No priority between concurrent goals 

Priority given to minor goal/most recent deviation 
Autonomous decision avoided 

Successful step execution 

Incomplete step execution 

Inference to previous condition not made 
Sub-step skipped 

Stuck in procedure step 


Ree ee eee 
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According to the model, two formal features 
of the emergency procedures are the most impor- 
tant environmental aspects to consider. Procedural 
features are identified as being either ‘loose’ or 
‘detailed’, with the understanding that there is 
variance of the degree. Broadly speaking, when 
the procedures provide meticulous step-by-step 
direction they are pronounced to be “detailed”, 
otherwise as “loose” (e.g., evaluation of trends, 
adjustment and control actions, navigational 
decisions). 

Empirically observed regularities between pro- 
cedural features and control modes should help 
predicting the crews’ procedure progression in 
accidental conditions. According to the model the 
procedures-behavior pairings are in fact deter- 
mined by the control modes, which in turn are 
determined by the crew expertise. In GEM exper- 
tise is measured by teamwork indicators through 
a classification scheme based on both generic 
and nuclear-specific process control teamwork 
literature (Braarud & Johansson, 2010; Klein, 
1999; Norros, 2004; Salas et al., 2005; O’Connor 
et al., 2008). The model considers 8 teamwork 
dimensions: monitoring progress, communicating 
intents, communicating interpretations, looking 
for same cues, reconciling viewpoints, adapting, 
backing-up, team monitoring & flexibility. 

The GEM model has thus three elements: 
(1) structural features of the procedures, (2) con- 
trol modes, and (3) the crew expertise. 


2.1.1 Application of the GEM model 
The GEM methodology was tested by Massaiu 
and Bones (2011) in a retrospective analysis of 


74 critical decisions by four Nuclear Power Plant 
(NPP) operating crews’ in simulated Steam Gen- 
erator Tube Rupture (SGTR) events (Massaiu & 
Bones, 2011). The four crews that exhibited most 
operational difficulties (the models’ outcome 
behaviors) were selected to test whether the model 
could describe the observed performances and help 
estimating the likelihood of the outcome behaviors 
given observable aspects of the context of action 
(i.e., the procedural features) and the crew cogni- 
tive control modes. 

The analysis showed different patterns of out- 
come behaviors, control modes and procedural 
conditions as well as different patterns between the 
crew expertise (as measured by indicators of team- 
work proficiency) and observed control modes. For 
instance, almost all the times a crew transitioned to 
a wrong procedure (outcome behavior 6 in Table 1) 
it was with ‘loose’ procedural guidance and in ‘per- 
sistent narrowing’ control mode. Different team- 
work characteristics were associated with the three 
control modes. The preliminary results showed 
that nearly all instances of positive teamwork were 
observed under the ‘holistic view’ control mode, 
and that all negative teamwork dimensions were 
observed when the crews exhibited ‘persistent nar- 
rowing’. No positive teamwork indictors and some 
negative indicators were observed when the crews 
were in ‘narrowing’ mode (and to a lesser degree 
than when in ‘persistent narrowing’). 


2.1.2 Evaluation of the GEM methodology 

The GEM model can be used as classification sys- 
tem for analysis of observed team decision-making 
and behavior. Its main benefit is the possibility of 
evaluating the likelihood of outcome behaviors 
given observable features of the environment and 
to measured team expertise. The intended use of 
the methodology is for cognitive simulation and 
predictive task analysis. However the methodol- 
ogy has been tested on a small data set only and a 
number of challenges remain to be solved. These 
are the main ones: 


1. The set of outcome behavior is not a complete 

and not-overlapping set. 

Further aspects of the guidance system should 

be included (e.g., the crew operating policies 

and training). 

. The model is currently limited to two environ- 
mental features: the procedural features and the 
crew expertise (which determines the control 
mode). Although these are recognized as the 
most important factors for crew performance 
in emergency, the inclusion of other structural 
features of task and environment in the model 
is likely necessary for improving its predictive 
accuracy. 


2. 
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2.1.3. The model of resilience in situation 

The Model of Resilience in Situation underlies the 
human reliability method MERMOS (Pierre Le 
Bot, Cara, & Bieder, 1999). Its primary application 
has been in the context of predictive risk analy- 
sis, but it has proved a valid tool for retrospective 
accident analysis in the nuclear (Le Bot, 2004) and 
medical (Le Bot, 2008) fields. Recently it has been 
proposed as a way to analyze human and organi- 
zational factors in a High Reliability Organizations 
perspective for supporting design of risk-critical 
systems (Le Bot & Pesme, 2014). 

The MRS explains how operating teams in 
emergency organizations make decisions during 
the course of an accident. The model is based on 
Jean-Daniel Reynaud’s theory of social regula- 
tion (Reynaud, 1989), a sociological theory that 
understands social relations (particularly in the 
working environment) as social regulations, that is, 
the social production of formal and informal rules 
governing the behavior of groups. 

In the MRS the object of analysis is the Emer- 
gency Operating System (EOS), the ensemble of 
control room operating crew, the human-machine 
interface, and the operating procedures. The MRS 
is about team decision-making mediated by tech- 
nology and procedures and thus consistent with 
Edward Hutchins’ distributed cognition paradigm 
(Hutchins, 1995; Hutchins & Klausen, 1998) in 
assuming that cognitive resources are not only in 
the operators’ heads but also in the procedures and 
the interface. 

The MRS can be seen as constituted by three 
interrelated building blocks: (1) a description of 
the dynamics of emergency operation (Figure 2); 
(2) the functions that the EOS fulfills during emer- 
gency (Figure 3), and (3) the stable characteristics, 
or features, of the emergency operating system 
(see Table 3 for an example) (Massaiu & Braarud, 
2013). 

The dynamics of emergency operation are 
described by cycles of stability, ruptures, and 
new stability phases (Figure 2). During a stable 
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Figure 2. The dynamics of emergency operation 
according to the model of resilience in situation. 
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Figure 3. The Model of Resilience in Situation (MRS) 
identifies five functions of the Emergency Operating Sys- 
tem (EOS): ‘Execution’ and ‘Control’ ensure the system’s 
robustness, ‘Verification’ and ‘Reconfiguration’ the sys- 
tem’s adaptation, while ‘Information selection and sharing’ 
is a cross-cutting function enabling the other functions. 


Table 2. The MRS model specifies ‘sub-functions’ and 
‘details’ for the main functions of the model depicted in 
Figure 3. Here are the sub-functions and details for ‘Con- 


trol’ (of rule execution). 


Sub-function 


Detail 


Understand goals 
and priorities 


Allocate resources 
(cognitive, material, 
human) 

Continuous monitoring 
of expected plant 
responses 

Small deviations 
detections and 
adjustments 

Recovery (of individual 
errors) 


Concentrate on 
current plan 


Resist external demands 
(for resources) 
Reach plan goals 


Manage dynamics 
(e.g. Concurrent 
goals) 


Understand timing of tasks 
(when to do, when to get 
info, time lags, urgency) 

Distribute tasks/Position 
operators in CR 

Understand task allocation 


Keep focus on task and 
process 


Team monitoring, 
communicate significant 
actions 

Consult and peer check 
before performing 
significant actions (feed 
forward control to avoid 
need for recovery) 

Avoid distractions: Do not 
respond to all incoming 
information/requests 

Attention on procedure 
following, read notes, 
read foldout, referenced 
parameters 

Keep focus on priorities 


Ensure goals achieved 

Completing pending 
procedures/steps 

Manage multiple/parallel 
tasks (procedures) 

Manage interruptions and 
deferred tasks (including 
continuous EOPs’ steps 
and conditions) 
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Table 3. The emergency operating system ‘Features’ are 
the structural elements that determine the systems’ capac- 
ity to perform its functions (Figure 3). Here the EOS sub- 
features and indicators for the category ‘Procedures’. 


Sub-feature Indicators 


Monitoring/ Re-evaluate procedure appropriateness 
re-evaluation Re-evaluate procedure optimality 
loops Continuously/periodically re-evaluate 

priorities 

Writable/ To aid memory 
bookmarks 

Redundant Look for extra information to validate 
information itself 
sources Look for extra information to assess 

reliability of cues 

Overview/status Counter fixation on current plan 
trees Takes into account simultaneous 


influences 
Easy to look ahead/browse 


operating phase, called the stabilization phase, 
the system follows the effective rules that it has set 
itself, typically the operating procedures, allow- 
ing the attainment of its objectives and avoiding 
the continual demands made by the dense flow of 
information (for instance, several hundred alarms 
in a nuclear power plant control room). However, 
this organizational inertia, protecting the actors 
from unexpected demands, must be counterbal- 
anced by permanent redundant verification (or 
monitoring), i.e., constantly checking that the 
rules applied are appropriate to the situation (for 
example, that the procedure in effect is adequate). 
A rupture occurs when the active rules become 
inappropriate and the operating system has to be 
reconfigured so that it has new effective rules. This 
can happen for two reasons: (1) the objectives may 
have been reached in compliance with the applied 
rules; or (2) the rules are not longer adequate due 
to (2a) errors in rule implementation necessitating 
a reconfiguration (re-planning) that is more than 
mere error recovery, or (2b) when the team recog- 
nizes that conditions existed or have newly arisen 
for which the rules in effect are not adequate. In 
these cases, the verification of the of rule’s inad- 
equacy should trigger a “rupture” of the opera- 
tion so that the system reconfigures itself with new 
effective rules. (Figure 2). 

It should be noted that during emergencies at 
nuclear power plants the rupture phases may last 
minutes while the stability phases may last hours. 

The second building block of the MRS model 
is the description of the functions of the operating 
system (Figure 3). 

There are two main functions that define 
the resilience of the EOS: “Robustness” and 


“Adaptation”. Adaptation is accomplished by the 
functions described above of verification (1.e., veri- 
fying that the plans are good for the situation) and 
reconfiguration (the capacity to timely produce 
plans that fit changed conditions). Robustness is 
defined by Execution and Control. Execution is 
defined as “acting on the process given the effec- 
tive operating rules”. It includes object discrimina- 
tion (selecting the right control out of similar ones 
and the right mode in multi-mode displays) and 
situation discrimination (acting differently in dif- 
ferent plant operating modes). “Control” consists 
in a permanent monitoring of the consistency of 
actions and effective operating rules (are the rules 
well applied?). Control is about the execution of 
the rule in effect, is the function that ensures that 
the rule is being implemented as intended. Effec- 
tive control requires continuous monitoring of 
process and staff, detection of deviations, rapid 
adjustments, and management of concurrent 
demands and interruptions. 

In order to perform these functions the sys- 
tem has also to select and share information from 
the environment. Information Selection is then 
defined as a “common function” needed by Con- 
trol, Verification and Reconfiguration. Teamwork 
is treated as a set of processes (e.g., cooperation, 
team situation understanding) used in performing 
EOS functions. Therefore, teamwork functions are 
not represented as independent functions but are 
‘built-in’ in the other functions. 

The third building block of the MRS model is 
constituted by the “emergency operating system 
features”, i.e., stable characteristics of the system 
that allow it to perform its functions. Features are 
identified for the personnel (e.g., staffing, train- 
ing, safety culture), the human-system interface 
(e.g., displays, alarm logs) and the procedures 
(e.g., symptom based) elements of the system. The 
EOS features determine the systems’ capacity to 
perform its functions. Different configurations of 
personnel, HSI, and procedures will produce dif- 
ferent capabilities with regard to the various EOS 
functions. For instance, an operating crew with 
authoritarian line of command will likely facili- 
tate execution and control functions, but might 
counter effective reconfiguration. The MRS model 
organizes the features under the following catego- 
ries: Team, Prescriptions, Formal communications, 
Human-Machine Interface (HMI), Training, and 
Procedures (see for instance the features for Proce- 
dures in Table 3 below). These six categories include 
sub-features, that is, specific indicators that evalu- 
ate their contribution to the fulfillment of the EOS 
functions. For example, the HMI feature includes 
the “Control Room Layout” sub-feature to evalu- 
ate whether the HMI provides “visibility of other 
operators” and “visibility of others’ actions” (i.e. 


480 


“does the control room layout allow the operators 
to see each other and their actions?”). Another 
example is the Team feature’s sub-feature “Super- 
visory function”, that evaluates among others the 
system capability of “Monitoring others actions” 
and “Searching redundant information” (i.e. “does 
the supervisor monitor operators actions and 
search for redundant information?”). 

The MRS model specifies the influences of the 
EOS features on the EOS functions. The result is 
a complete matrix of influences from the features’ 
indicators to the sub-functions’ details. 


2.1.4 Evaluation of the MRS methodology 
The Model of Resilience in Situation is the theo- 
retical backbone of the human reliability method 
MERMOS, and in this form it has been applied 
in the French nuclear industry for more than a 
decade. The decision-making model presented in 
this paper has received a more limited application 
and testing, but it nonetheless has proved capable 
of capturing the essentials aspects of the decision- 
making processes followed by nuclear control 
room crews responding to simulated accidents in 
full scope simulators. These were detailed, minute- 
by-minute analyses of teams of professional opera- 
tors performing in realistic conditions. Compared 
to other naturalistic decision-making models the 
MRS model treats decision processes that span 
over relatively long time windows and include 
several decision points, as it is necessary for deal- 
ing with emergency operation in nuclear power 
plants. A second innovative aspect, also strongly 
dependent on the model’s domain of origin, is 
the importance reserved to the technology and 
the organizational environment in which the deci- 
sion makers operate. These are furthermore teams 
rather than individuals so that teamwork aspects 
become central. Finally, the model lends itself to 
predictive applications (trough the MERMOS 
human reliability method), retrospective accident 
analysis, for verification and validation purposes 
(by providing an overall framework that can be 
used as basis for performance-based evaluation of 
human-machine systems), and as an observation 
protocol for on-line recording and classification. 
The main limitation of the MRS model is 
that, beside the applications for human reliability 
which is at an industrial stage, the methodology 
needs further refinement and testing (Massaiu & 
Braarud, 2012). 


3 CONCLUSION 


Normative approaches are still the preferred option 
for analysis, design and evaluation of human- 
technology systems in process control industries. 


This is partly due to the field strong technical 
tradition, one that assumes that the core techni- 
cal disciplines are sufficient for achieving safe and 
productive systems, but also to field specificities 
(like the prominence of operating procedures) that 
make well-known naturalistic decision making 
approaches less readily applicable. 

This paper has presented two decision-making 
models specifically developed in process control 
settings. The models are informed by extensive 
empirical observations and have been tested and 
implemented at varying degrees for different appli- 
cations. The two models contribute to the natural- 
istic decision making discipline at large by tackling 
the not so well-studied aspect of team decision 
making with operating procedures. 


REFERENCES 


Braarud, P. O., & Johansson, B. (2010). Team Cognition 
in a Complex Accident Scenario (No. HWR-955). 
Halden, Norway: OECD Halden Reactor Project. 

Hollnagel, E. (1998). Cognitive Reliability and Error 
Analysis Method. Elsevier Science. 

Hutchins, E. (1995). Cognition in the Wild (New edition). 
The MIT Press. 

Hutchins, E., & Klausen, T. (1998). Distributed Cogni- 
tion in an airline cockpit. In Y. Engeström and D. 
Middleton (Eds.), Cognition and Communication at 
work, Cambridge University Press. 

Kaptelinin, V., & Nardi, B. A. (2006). Acting with technol- 
ogy. activity theory and interaction design. MIT Press. 

Le Bot, P. (2004). Human reliability data, human 
error and accident models—illustration through 
the Three Mile Island accident analysis. Reliability 
Engineering & System Safety, 83(2), 153-167. 

Le Bot, P. (2008). Analysis of the Scottish case. In 
Remaining Sensitive to the possibility of Failure 
(Vol. 1). Ashgate Publishing. 

Le Bot, P., Cara, F., & Bieder, C. (1999). MERMOS, a 
second generation HRA method: what it does and 
doesn’t do. In Proceedings of the international topical 
meeting on Probabilistic Safety Assessment (PSA’99) 
(Vol. 2, pp. 852-880). Washington DC, USA. 

Le Bot, P., & Pesme, H. (2014). Organising Human and 
Organisational Reliability. In 72th Probabilistic Safety 


481 


Assessment and Management Conference. Honolulu, 
Hawaii. 

Lipshitz, R., Klein, G., Orasanu, J., & Salas, E. (2001). 
Taking stock of naturalistic decision making. Journal 
of Behavioral Decision Making, 14(5), 331-352. 

Massaiu, S., & Bones, A. (2011). Emergency procedures 
and crew behavior: A Retrospective test of the 
Guidance-Expertise Model (No. HWR-995). Halden, 
Norway: Halden Reactor Project. 

Massaiu, S., & Braarud, P. Ø. (2012). Emergency 
Operating Systems Profiling: Proposals for Developing 
the Model of Resilience in Situation and for Classify- 
ing EOS Features (Internal report No. IFE/HR/F- 
2012/1541). Halden, Norway: Institute for Energy 
Technology. 

Massaiu, S., & Braarud, P. Ø. (2013). Including Organ- 
izational and Teamwork Factors in HRA: the EOS 
Approach. Presented at the EHPG 2013, Store- 
fjell Resort Hotel, Gol, Norway: Halden Reactor 
Project. 

Massaiu, S., Hildebrandt, M., & Bone, A. (2011). The 
guidance-expertise model: Modeling team decision 
making with emergency procedures. Presented at the 
International Conference on Naturalistic Decision 
Making, 10 (NDM 2011), Orlando. 

Moray, N. P, & Huey, B. M. (1988). Human factors 
research and nuclear safety. National Academies. 

Norros, L. (2004). Acting under uncertainty: The Core- 
Task Analysis in ecological study of work. Helsinki, 
Finland: VTT Technical Research Centre of Finland. 

O’Connor, P., O’Dea, A., Flin, R., & Belton, S. (2008). 
Identifying the team skills required by nuclear power 
plant operations personnel. International Journal of 
Industrial Ergonomics, 38(11-12), 1028-1037. 

Rasmussen, J., Pejtersen-Mark, A., & Goodstein, L. P. 
(1994). Cognitive systems engineering. Wiley. 

Reynaud, J.-D. (1989). Les règles du jeu: l’action collective 
et la régulation sociale. Colin. 

Salas, E., Sims, D. E., & Burke, C. S. (2005). Is there a 
“Big Five” in Teamwork? Small Group Research, 
36(5), 555. 

Theureau, J., Jeffroy, F., & Vermersch, P. (2000). Con- 
troling a nuclear reactor in accidental situations with 
symptombased computerized procedures: a semi- 
ological & phenomenological analysis. CSEPC 2000 
Proceedings, 22-25. 

Vicente, K. J. (1999). Cognitive work analysis: toward 
safe, productive, and healthy computer-based work. 
Routledge. 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Sensemaking and resilience in safety-critical situations: 
A literature review 


S.S. Kilskar 
SINTEF, Trondheim, Norway 


B.-E. Danielsen 
CIRiS NTNU Samfunnsforskning, Trondheim, Norway 
NTNU, Trondheim, Norway 


S.O. Johnsen 
SINTEF, Trondheim, Norway 
NTNU, Trondheim, Norway 


ABSTRACT: Recent accidents and near-accidents, such as the capsizing of the anchor handling vessel 
Bourbon Dolphin in 2007 and the unintended list of the drilling rig Scarabeo 8 in 2012, underline the 
need for addressing sensemaking in safety-critical situations within the maritime domain. This paper 
is a literature review to answer the research question: What are the characteristics of sensemaking and 
resilience in safety-critical situations? The aim was to establish more knowledge on sensemaking in safety- 
critical situations and the relationship between sensemaking and resilience. The majority of authors 
provide definitions based on Weick’s work on sensemaking, describing sensemaking as a social process, 
involving the extracting of cues and enactment to create meaning to events retrospectively. Few authors 
provide descriptions that characterise sensemaking in safety-critical situations. There is a lack of literature 
regarding sensemaking in safety-critical situations in the maritime domain that addresses the issues of 


training and human-machine interactions. 


1 INTRODUCTION AND OBJECTIVE 


1.1 Background 


This literature review is part of a research project 
(SMACS) that addresses the issue of sensemak- 
ing in safety-critical situations within the mari- 
time domain. The aim of sensemaking processes 
in an organisation is to provide meaning to an 
event or situation in a given context. In such situ- 
ations, sensemaking can be a source of resilience, 
in that it enables a person or a crew to “bounce 
back” when put under stress. Hence, the review 
not only focuses on sensemaking in safety-critical 
situations, but also on how the literature describe 
the relationship between sensemaking and resil- 
ience. This paper describes the search strategy and 
presents the results from the literature review. 


1.2 Purpose and research question 


The purpose of the review was to answer the 
research question: What are the characteristics of 
sensemaking and resilience in safety-critical situa- 
tions? This was done by establishing a knowledge 
base on sensemaking in safety-critical situations, 
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and by exploring the relationship between sense- 
making and resilience. In addition, we wanted to 
examine whether this literature addresses sense- 
making in relation to training or Human-Machine 
Interaction (HMI). This review is not specific to 
the maritime domain, but it was of interest to be 
able to later narrow it down to maritime operations. 


1.3. Concepts and definitions 


Sensemaking and resilience are two central concepts 
in this study, both of which have been approached 
within different theoretical frameworks. In the fol- 
lowing, we provide definitions and explanations 
for these terms, as well as for our use of the term 
safety-critical. 


1.3.1 Sensemaking 
Sensemaking has been of interest in the on-going 
research project, since the concept supports the 
idea that human actors in safety-critical opera- 
tions and their actions are dependent on the whole 
socio-technical systems consisting of organisa- 
tional, technological and human factors. 

The concept of sensemaking started to emerge in 
organisational literature in the late 1960s (Maitlis & 


Christianson, 2014), but was made prominent by 
Karl E. Weick in 1995 with his seminal book Sense- 
making in Organizations. In this work, Weick sum- 
marised the sensemaking research up to that point 
and presented seven key properties of sensemaking; 
1) grounded in identity construction, 2) retrospective, 
3) enactive, 4) social, 5) ongoing, 6) focused on and 
by extracted cues, and 7) driven by plausibility rather 
than accuracy. Sensemaking has since been the sub- 
ject of considerable research and there is an extensive 
variation in how the term is defined in the organisa- 
tional literature (Maitlis & Christianson, 2014). 
Weick et al. (2005) describe sensemaking as “a 
sequence in which people concerned with identity 
in the social context of other actors engage ongo- 
ing circumstances from which they extract cues 
and make plausible sense retrospectively, while 
enacting more or less order into those ongoing 
circumstances” (p. 409). Maitlis & Christianson 
(2014) developed a definition of sensemaking 
rooted in recurrent themes found in their literature 
review: “A process, prompted by violated expec- 
tations, that involves attending to and bracketing 
cues in the environment, creating intersubjective 
meaning through cycles of interpretation and 
action, and thereby enacting a more ordered envi- 
ronment from which further cues can be drawn” 
(p. 67). There are several factors that can influence 
sensemaking. Sandberg & Tsoukas (2015) found 
from their literature review that context, language, 
identity, cognitive frameworks, emotion, politics 
and technology constitute the main factors. 
Sensemaking is thus a process triggered by 
ambiguous events that interrupt an ongoing activ- 
ity and make individuals question what is going 
on. Individuals will extract cues from the environ- 
ment that are interpreted and they act on those 
interpretations and revise them through the conse- 
quences of their actions. This is an ongoing cycle 
and according to Weick (1995) sensemaking never 
starts or stops as people are always in the middle 
of things. The events that trigger sensemaking can 
range from unplanned to planned events and from 
minor to major events (Sandberg & Tsoukas, 2015). 
Sensemaking has been described as an individ- 
ual cognitive process that has to do with interpreta- 
tion and development of mental models (Elsbach 
et al., 2005). However, Weick (1995) described 
sensemaking as a social process where people 
actively shape each other’s meanings, and argued 
that even individual sensemaking is influenced by 
the actual, imagined or implied presence of others. 
The concept has traditionally been seen as a retro- 
spective activity that occurs as people look back on 
action that has already taken place (Maitlis & Chris- 
tianson, 2014). Weick (1995) argued that people can 
know what they are doing only after they have done 
it. The notion of prospective sensemaking (i.e. con- 
sideration of impact of actions) has long been a 
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part of the literature (Gioia et al., 1994). In recent 
years prospective or future-oriented sensemaking 
has gained more attention (e.g. Gephart et al., 2010; 
Rosness, Evjemo, Haavik & Wærø, 2016). 


1.3.2 Resilience 

The commonly used definition of safety has been 
“freedom from unacceptable risk”. In resilience 
engineering safety is defined as the ability to suc- 
ceed under varying conditions (Hollnagel et al., 
2011). Resilience Engineering is concerned with 
understanding the normal functioning of socio- 
technical systems and how they perform under 
varying conditions. Thus, performance variability 
is not a threat that should be avoided by the use 
of constraining means; in complex socio-technical 
systems variability is considered normal and neces- 
sary. In this view it is equally important to study 
things that go right as things that go wrong, with 
the aim to reinforce the variability that leads to 
positive outcomes (Hollnagel et al., 2011). Ibid 
define resilience as “the intrinsic ability of a sys- 
tem to adjust its functioning prior to, during, or 
following changes and disturbances, so that it can 
sustain required operations under both expected 
and unexpected conditions” (p. 275). 

According to Hollnagel et al. (2011), there are 
four corner-stones that characterise resilient sys- 
tems: 1) the ability to respond to events, 2) to mon- 
itor ongoing developments, 3) to anticipate future 
threats and opportunities, and 4) to learn from 
past failures and successes alike. 


1.3.3 Safety-critical 

In this paper, we use the term safety-critical situa- 
tion or safety-critical operation to denote situations 
or operations that, if they go wrong, have a large 
potential for causing harm to people, property or 
environment. 


2 METHODOLOGY 


A literature search was conducted to establish a 
knowledge base on sensemaking in safety-critical 
situations, as well as the relationship between 
sensemaking and resilience. Literature was 
obtained through Boolean searches of the follow- 
ing interdisciplinary databases: Scopus, Web of 
Science, Google Scholar and Oria. Based on the 
objective of the study, the keywords sensemaking, 
resilience and safety-critical were selected as the 
most relevant. In addition, some of the searches in 
the abstract databases, Scopus and Web of Science, 
included the keyword maritime. To capture varia- 
tions in these keywords, more specific search terms 
were used as shown in Table 1. Different combina- 
tions of the terms in Table 1 were used due to dif- 
ferent search approaches in the various databases. 


Table 1. 


Search terms. 


Broad search terms (e.g. “high-risk”) were used 


when searching the abstract databases, Scopus 
and Web of Science, whereas searches in Google 


Keyword Search terms 

Sensemaking sensemaking; sense-making; sense 
making 

Resilience resilience; resiliency; resilient 


Safety-critical 


Maritime 


safety-critical; safety-critical 
situation(s); safety-critical 
operation(s); safety critical; 

safety critical situation(s); safety 
critical operation(s); high-risk; 
high-risk situation(s); high-risk 
operations(s); high risk; high risk 
situation(s); high risk operation(s); 
hazardous; hazardous situation(s); 


hazardous operation(s) 


maritime; at sea; boat; vessel; offshore 


Scholar and Oria were conducted using more spe- 
cific terms (e.g. “high-risk situation’). 

To avoid an excessive amount of search results, 
general searches in Google Scholar covered all 
three keywords of sensemaking, resilience and 
safety-critical. When searching the abstract data- 
bases, and when using the “all in title” function 
in Google Scholar, search terms related to two of 
the three keywords were used. To be included, the 
documents either had to address sensemaking in 
the context of safety-critical situations or opera- 
tions, or discuss a relationship between sensemak- 
ing and resilience. For this reason, documents 
discussing sensemaking in other contexts were 


Table 2. Overview of the literature. 


Author(s) Year Publication type Topic 
Weick 1993 Journal paper Disruptions of sensemaking 
Gephard 1997 Journal paper Quantitative sensemaking during crises 
Beunza & Stark 2004 Working paper Organisational resilience in a Wall Street Trading Room After 9/11 
Furniss et al. 2009 Workshop paper Reflection in the control room during safety-critical work 
Baran & Scott 2010 Journal paper A grounded theory of leadership and sensemaking 
Bergström 2012 PhD Thesis Aspects of organisational resilience in escalating situations 
Grøtan & Størseth 2012 Conference paper Integration of organisational resilience into safety management 
Hayes 2012 Journal paper Operator competence and capacity in complex hazardous activities 
Lundberg et al. 2012 Journal paper Resilience in sensemaking and control of emergency response 
Sanne 2012 Journal paper Learning from adverse events in the nuclear power industry 
Hutter & Kuhlicke 2013 Journal paper Understanding resilience in the context of planning and institutions 
Rankin 2013 Licentiate’s thesis Adaptive performance and resilience in high-risk work 
Rantatalo 2013 Thesis Sensemaking and organising in the policing of high-risk situations 
Rankin et al. 2014 Journal paper A framework for analysing adaptations in high-risk work 
Busby & Collins 2014 Journal paper Sensemaking about risk control in offshore hydrocarbons production 
Haavik 2014 Journal paper The nature of sociotechnical work in safety-critical operations 
Norros et al. 2014 Journal paper Operators’ orientations to procedure guidance in NPP process control 
Saleh et al. 2014 Journal paper Safety diagnosability and observability of hazards in design 
van den Heuvel 2014 Journal paper Police strategies for resilient decision-making and action 
et al. implementation 
Barton et al. 2015 Journal paper Contextualised engagement in wildland firefighting 
Dahlberg 2015 Journal paper Exploration of resilience and complexity 
Grøtan & 2015 Symposium paper Conceptual approach to operational and managerial training 
van der Vorm of resilience 
Hunte et al. 2015 Symposium paper Dialogic sensemaking as a resource for safety and resilience 
van der Beek & 2015 Journal paper Adaptability and performance in teams to enhance resilience 
Schraagen 
Danielsson 2016 Journal paper Cross-sectorial collaboration in a potentially dangerous situation 
Jahn 2016 Journal paper Adapting safety rules in high reliability contexts 
Hoffman & 2017 Journal paper How to measure resilience 
Hancock 
Landman et al. 2017 Journal paper A conceptual model for pilot’s ability to deal with unexpected events 
Lofquist et al. 2017 Journal paper Why different sub-cultures interpret safety rule gaps in different ways 
Siegel & Schraagen 2017 Journal paper Making resilience-related knowledge explicit through team reflection 
Takeda et al. 2017 Journal paper Developing resilience in disaster management promoting sensemaking 
Teo et al. 2017 Journal paper How leaders utilise relationships to activate resilience during crisis 
Favaro & Saleh 2018 Journal paper Temporal logic for safety supervisory control and hazard monitoring 
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excluded. So were documents that primarily dis- 
cuss resilience and that do not relate to the notion 
of sensemaking. We did not include books in this 
literature review, and a few papers were excluded 
as we requested, but did not receive, the full-text. 

Going through the identified documents, we 
found some key references that we included in 
this review as background (i.e. not included in 
Table 2). 


3 FINDINGS FROM THE LITERATURE 
REVIEW 


The literature search resulted in 33 documents that 
were included in the review. See Table 2 for the 
complete, chronologically listed, literature over- 
view. As can be seen from the table, the reviewed 
literature includes 25 articles published in peer- 
reviewed scientific journals, four papers presented 
at international conferences, workshops or sympo- 
siums, three theses and one working paper. 

No inclusion criteria were applied regarding 
publication year. The results clearly indicate that 
the use of the term sensemaking in the context 
of safety-critical situations, or in relation to the 
term resilience, is relatively recent. Except from 
the papers by Weick (1993), Gephart (1997) and 
Beunza & Stark (2004), the rest of the included lit- 
erature was published in the ten-year period from 
2009 to 2018. 

In addition to the 33 publications in Table 2, we 
have also included often cited key research, among 
others Weick (1995) and Endsley et al. (2003). 

The following chapters describe how this litera- 
ture use the term sensemaking; how it character- 
ises sensemaking in the context of safety-critical 
situations; and how it describes the relationship 
between sensemaking and resilience. In addition, 
we describe the few issues we have found of sense- 
making in relation to training, human-machine 


interface, the maritime domain and design/ 
development. 
3.1 The use of the term ‘sensemaking’ 


As outlined in chapter 1.3.1, the concept of sense- 
making does not have one single definition. Weick 
(1995) stated that “(...) people can make sense of 
everything. This makes life easy for people who 
study sensemaking in the sense that their phenom- 
enon is everywhere” (p. 49). For this reason, we 
started the review by taking a closer look at how 
the various authors of the included literature use 
the term. 

In 1993, Weick reanalysed the Mann Gulch fire 
disaster in Montana in which 13 firefighters died. 
Here he provided analyses of sensemaking as a 


486 


generic phenomenon, explaining that “the basic 
idea of sensemaking is that reality is an ongoing 
accomplishment that emerges from efforts to cre- 
ate order and make retrospective sense of what 
occurs” (p. 635). He further uses the example of 
Mann Gulch to argue that sensemaking is about 
contextual rationality and that it is “built out of 
vague questions, muddy answers and negotiated 
agreements that attempt to reduce confusion” 
(Weick, 1993, p. 636). 

As described by Maitlis & Christianson (2014), 
the term sensemaking is often used without any 
associated definition from the literature, and when 
definitions are provided there are a variety of 
meanings asserted to it. Correspondingly, in our 
study we found that several authors include sense- 
making as a general notion and do not provide any 
associated definition (e.g. Favaro & Saleh, 2018; 
Saleh et al., 2014; Sanne, 2012). Some provide ref- 
erences to the work of others, but without repro- 
ducing the actual definition (e.g. Gretan & van der 
Vorm, 2015; Jahn, 2016). 

However, most of the papers in this review pro- 
vide definitions or references based on Weick’s 
work on sensemaking, describing sensemaking 
as a social process, involving the extracting of 
cues and enactment to create meaning to events. 
These include, among others, the work of Baran & 
Scott (2010), Danielsson (2016), Hayes (2012) and 
Tekeda, Jones & Helms (2017). 

Some describe sensemaking as a more cognitive 
process and refer to Klein’s macro-cognitive/data- 
frame model (Hoffman & Hancock, 2017; Siegel & 
Schraagen, 2017), whereas others refer to the work 
of both Weick & Klein (e.g. Landman et al., 2017; 
Norros et al., 2014; Rankin et al. 2014). 

Some authors describe sensemaking as a proc- 
ess building and supporting situational awareness 
(Lundberg et al., 2012; van den Heuvel et al., 2014). 
Situational awareness being a tactical (short term 
issue) while sensemaking is a broader strategic 
concept (long range issue) creating and supporting 
understanding. In his paper on sensework, Haavik 
(2014) uses the definitions of Weick to explain 
how sensemaking is a theoretical, generic frame- 
work that addresses mental processes and aspects 
of work. Alternatively, a few papers focus on the 
Cynefin sensemaking framework described in the 
work by Kurtz & Snowden (2003) (Dahlberg, 2015; 
Grotan & Storseth, 2012). 

The concept of sensemaking has traditionally, 
and in accordance with the work of Karl E. Weick, 
been described as retrospective in the sense that 
we make sense of our actions and experience after 
they have occurred. Most of the literature in this 
review uses the notion of sensemaking accordingly, 
often referring to Weick when doing so (e.g. Baran 
& Scott, 2010; Rantatalo, 2013; Teo et al., 2017). 


However, a few of the authors use the term in a 
more future-oriented sense. As an example, Barton 
et al. (2015) introduce the term of proactive leader 
sensemaking, arguing that leaders in particular 
play a critical role in creating and maintaining a 
context for actively managing uncertain contexts. 


3.2 Characteristics of sensemaking in the context 
of safety-critical situations 


There are several factors that can influence sense- 
making; context, language, identity, cognitive frame- 
works, emotion, politics and technology (Sandberg 
& Tsoukas, 2015). Thus, in the context of a safety- 
critical situation there might be characteristics of 
sensemaking other than or more prominent than 
the characteristics of everyday sensemaking. For 
instance, it might be expected that strong negative 
emotions like stress and fear would be influential 
on sensemaking in such circumstances. However, 
the literature found in this review did not discuss 
these characteristics explicitly. 

In his analysis of the Mann Gulch fire disaster 
Weick (1993) describes the disaster “was produced 
by the interrelated collapse of sensemaking and 
structure”. The smokejumpers expected to find a 
fire that they would have control over within the 
next morning. This positive illusion prohibited 
them from making sense of the cues in their envi- 
ronment contradicting this expectation. Weick 
describes unclear roles, identity issues and in the 
end the intense emotion of panic that led to the 
disintegration of the group and to the primitive 
tendency to flight. Unfortunately, this response 
was too simple to match the complexity of the fire 
and 13 men lost their lives. 

After completing a field study of 80 interviews, 
Busby & Collins (2014) categorised the many ways 
of acting through which informants made sense of 
the risk control task. The authors provide expla- 
nations to each of their 32 categories, but elabo- 
rate on the five more commonly used. These are 
1) being circumscribed (constrained, realistic, 
moderate), 2) being engaged (closely involved, con- 
cerned), 3) being resolute (rapid, and consistent in 
acting) 4) being socialised (social outcomes and 
systems of social obligation), and 5) being solici- 
tous (seeks opinion and external references). The 
authors use their qualitative findings to suggest 
that the sensemaking of organisational members 
is simultaneously optimistic and pessimistic about 
the capacities of social organisation to manage 
risk. This balance is, however, not of individual 
sensemaking and is not a deliberate choice. 

Lundberg et al. (2012) explored a model for 
describing and studying resilience in manage- 
ment of safety-critical/irregular events. The model 
was based on changes in the ongoing process, the 
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actors sensemaking and control functions and the 
technology used for sensemaking and control. 
The model helped to identify resilience building 
processes and sources of resilience emergency 
responses. 

Several other authors describe the character- 
istics of sensemaking by describing it in terms of 
how it relates to the concept of resilience. This is 
the topic of the following chapter. 


3.3. The link between sensemaking and resilience 


As described in the introduction, one goal was 
to establish a knowledge base on the relationship 
between the two concepts of sensemaking and 
resilience. 

In his reanalyses of the Mann Gulch fire disas- 
ter, Weick (1993) states that the disaster was pro- 
duced by the interrelated collapse of sensemaking 
and structure. Weick mentioned the importance of 
nonstop talk as a critical source of coordination 
in complex systems. He proposes four potential 
sources of resilience that “make groups less vul- 
nerable to disruptions of sensemaking” (p. 628). 
These include improvisation and bricolage, virtual 
role systems, the attitude of wisdom, and norms 
of respectful interaction. Thus, from this perspec- 
tive, resilience is core to maintaining sensemaking 
in critical situations. 

Other authors argue that sensemaking is an 
important source to achieving resilience. Through 
their review of disaster management literature, 
along with illustrative examples from global disas- 
ters, Takeda et al. (2017) highlight the importance 
of resilience in disaster management. They argue 
that heedful interrelating and sensemaking are two 
of the central tenets of resilience research and that 
a greater attention to resilience in the disaster man- 
agement process could be achieved through a focus 
on the development of sensemaking and heedful 
interrelating. Takeda et al. (2017) conclude that 
future research is needed to further understand 
resilience and sensemaking. 

Rankin (2013) draws lines between Hollnagel’s 
four central abilities to characterise a resilient 
system and the sensemaking capabilities of seek- 
ing information, ascribing meaning and action as 
described by Grotan et al. (2008). In a paper from 
the same year, Rankin and her co-workers present 
a framework for analysing adaptations in high-risk 
work (Rankin et al., 2014). Here they explain how 
sensemaking variety “includes the ability to proc- 
ess information and revise it as the world changes, 
given contextual constraints and the experi- 
ence and knowledge of the individuals involved” 
(p. 84). Thus, sensemaking is important for adap- 
tive behaviour, which in turn is a prerequisite 
for resilience. They focus on the importance of 


observing sharp-end adaptations as critical to 
identify system brittleness and resilience. 

Others suggest that sensemaking plays an 
important role in accomplishing tasks that facili- 
tates organisational resilience, especially when the 
sensemaking is carried out by leaders (Teo et al., 
2017). According to Hunte et al. (2015), “shared 
(social) sensemaking creates and nourishes com- 
mon awareness and understanding of the ‘operat- 
ing point’, and in so doing facilitates coordination 
and safer performance. This is an essential condi- 
tion for the emergence of safety and resilience” 
(p. 1). Similarly, van der Beek & Schraagen (2015) 
list sensemaking, or situation assessment, as one of 
several team resilience abilities. However, they do 
not provide a thorough discussion on sensemaking 
as such. 

The sensemaking perspective is also used in 
the literature as a means to analyse or explain 
resilience. According to Bergström (2012) “the 
development of a theoretical framework for ana- 
lysing organisational resilience in escalating situa- 
tions needs to relate to the explanatory potential 
of sensemaking theory” (p. 8). To enhance a 
dynamic understanding of resilience, Hutter & 
Kuhlicke (2013) analyse its elusive character from 
a sensemaking perspective. In their paper, resil- 
ience is understood as a “content of sensemaking 
processes in the context of a crisis” (p. 294). The 
authors state that the work of Weick in ‘sensemak- 
ing in organisations’ on the one hand and ‘resil- 
ience’ on the other is only loosely coupled, and 
they connect the two a bit more explicitly for plan- 
ning research about resilience. To understand how 
groups, organisations and networks make sense of 
resilience in the context of a crisis, one should con- 
sider the four processes of committing to resilience, 
expecting resilience, arguing about resilience and 
manipulating with resilience. These are referred to 
as sensemaking processes in planning research and 
practice (Hutter & Kuhlicke, 2013). 

In their paper, Lundberg et al. (2012) study resil- 
ience in the context of sensemaking and control in 
emergency management of irregular emergencies, 
and proposes an emergency management analy- 
sis model. The model unifies and complements 
existing models by explicitly modelling resilience 
factors and the actors ‘sensemaking and control 
functions and technologies’ variety. Other authors 
using sensemaking theory to describe aspects of 
resilience, is Siegel & Schraagen (2017). In their 
article on team reflection, they make an attempt to 
use reflection and the data-frame theory of sense- 
making to show the relationship between knowl- 
edge and resilience. 

Finally, Hoffman & Hancock (2017) aim 
to promote a discussion on how to measure 
resilience. They explain how sensemaking provide 
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information to the work system about whether and 
when the system needs to change its understand- 
ing of problem situations, and further argue that 
this means that “adaptive and resilient sensemak- 
ing requires mechanisms for recognizing anomalies 
and situations that mandate change” (p. 571). 


3.4 Sensemaking as a basis for change, 
innovation, creativity and design 


The literature has indicated how sensemaking 
supports innovation, design and creativity. Sense- 
making has often been limited to an organisa- 
tional context, seldom discussing issues such as 
system design. Saleh et al. (2014), point out that 
safety science seems to have drifted from the engi- 
neering and design side of system safety towards 
organisational and social sciences or refinement 
of probabilistic models. To improve safety and 
resilience in safety-critical operations, we must 
have a broad based approach involving the socio- 
technical system, and also how cues and prospec- 
tive sensemaking can be enabled from the design 
phase on. 

It is also mentioned that sensemaking is a key 
process for learning in organisations, teams and 
individuals (Maitlis & Christianson, 2014) — one 
challenge is to use new information rather than 
engage in sensemaking based on prior beliefs. 
Sensemaking is also concerned with new mean- 
ings that can underpin new ways of organizing, 
understanding and design. When sensemaking of 
organisational members are impacted, the partici- 
pants are motivated to change their own roles and 
practices. This has especially been supported by 
looking at interpretations and actions through the 
process of action research, Greenwood & Levin 
(2006). 


3.5 Topics partly covered in the literature review 


Accident reports have shown that insufficient 
training and poor HMI may impair sensemaking 
processes and thus lead to incidents and accidents. 
After the incident at Scarabeo 8 the investigation 
report attributed the incident to insufficient train- 
ing of control room personnel and weaknesses in 
the control room’s human-machine interface (Ptil, 
2012). As discussed by Endsley et al. (2003) human- 
machine-interface is a key factor shaping operator 
performance, via concepts like sensemaking and 
situation awareness. We thus expected some of this 
literature to addresses sensemaking in relation to 
training or HMI. However, we did not find much 
relevant literature through our review. In the fol- 
lowing, we have summarised our findings related 
to training, HMI and sensemaking in safety-critical 
situations within the maritime domain. 


Not many of the reviewed documents look at 
ways of training to improve sensemaking. How- 
ever, Takeda et al. (2017) focus on building capac- 
ity for individual actors to interrelate in a heedful 
manner. In Saleh et al. (2014) it is mentioned that 
the ability to diagnose hazardous states provides 
one way to improve operators’ sensemaking and 
situational awareness after an adverse event. It is 
synergetic with organisational factors in support 
of accident prevention, particular safety train- 
ing, that can be shaped by including off-nominal 
conditions. Rantatalo (2013) describes how obser- 
vations that were carried out were targeted joint 
police management training in the setting of full- 
scale simulated scenario, arguing that “from sense- 
making and organisational reliability perspectives, 
high-strain situations like that described above 
offer a possibility to observe interaction patterns 
during incident management in a realistic setting” 
(p. 55). One of the conclusions drawn in the paper 
by Landman et al. (2017) about dealing with unex- 
pected events on the flight deck, is that interven- 
tions should focus on “increasing pilot reframing 
skills (e.g. through the use of unpredictability in 
training scenarios)” (p. 1161). The authors propose 
a conceptual model for explaining pilot perform- 
ance in surprising and startling situations; a model 
that can be used to design experiments and train- 
ing simulations. However, several of the reviewed 
papers discuss training that is aimed at enhanc- 
ing resilience (Bergström, 2012; Grotan & van der 
Vorm, 2015; van der Beek & Schraagen, 2015). 

Siegel & Schraagen (2017), describe processes 
and HMI tools to make boundaries explicit in rail- 
way operations. In the maritime sector the ability 
to handle demanding operations safely is increas- 
ingly dependent on ICT-based control systems, e.g. 
dynamic positioning and ballasting. Hence, the 
impact of HMI on sensemaking is an important 
topic for our project. However, few of the docu- 
ments covered by this literature review address this 
issue. Relevant, but brief, discussions on such inter- 
action are made in the papers by Dahlberg (2015), 
Landman et al. (2017) and Sanne (2012). 

Furthermore, almost none of the identified 
publications on sensemaking in safety-critical situ- 
ations are related to maritime operations. A couple 
of the papers discuss sensemaking in the context of 
offshore oil and gas production (Busby & Collins, 
2014; Hayes, 2012). The doctoral thesis by Berg- 
ström (2012) is the only publication in our review 
that is related to navigation and shipping, although 
not specifically concerning sensemaking. 


3.6 Limitation of the review 


Our findings should be considered in light of the 
limitation in the search terms. In addition to the 
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search terms related to the keyword safety-critical, 
we could have included surprising, emergency, etc. 
Also, we could have obtained interesting find- 
ings had the review included papers on situational 
awareness. However, we chose to keep this review 
focused on the terms of specific interest for our 
project. 


4 DISCUSSION AND CONCLUSIONS 


The current literature review aimed at describing 
how the selected literature use the term sense- 
making; how it characterises sensemaking in the 
context of safety-critical situations; and how it 
describes the relationship between sensemaking 
and resilience. 

We found that several authors use the term 
sensemaking without providing a definition, and 
those who do refer to Weick’s work describing 
sensemaking as a social process, involving extract- 
ing cues and enactment to create meaning to events 
retrospectively. 

Sensemaking and resilience were found to be 
described as related in the reviewed literature. 
Sensemaking creates the context for being resil- 
ient; at the same time sources of resilience, such 
as redundancy (i.e. redundant clues), help to make 
sense of the situation. Lundberg et al. (2012) 
have suggested a model for resilient sensemak- 
ing, exploring changes in the ongoing process, 
the actors sensemaking and control functions and 
the technology used for sensemaking and control. 
However, little is written on the issue of sensemak- 
ing in safety-critical situations that also concern 
aspects of training, human-machine interaction 
or the maritime domain; thus, we a see a need to 
increase our knowledge in these areas by observa- 
tion studies and targeted literature reviews. 

Sensemaking is seen as a long term strategic 
process, creating understanding. There has been a 
discussion whether sensemaking is something that 
happens inside the individuals’ heads or if it is a 
social construct. In our further work, we would 
like to explore sensemaking as a social construct, 
impacted by organisations, technology and human 
factors. Also, sensemaking has been described as 
both a retrospective and a prospective process. We 
would like to build on the research of prospective 
sensemaking to understand how to build resilience 
through future actions. 

Sensemaking is accomplished through per- 
ceiving cues, creating meaning/learning (trough 
interpretations and actions) as discussed in 
Maitlis & Christianson (2014), thus it is depend- 
ent on responsibilities, procedures, training and 
technology (such as human machine interactions). 
However, unexpected or safety-critical situations 


do not necessarily trigger sensemaking; it happens 
when there is a discrepancy from what one expects. 
The expectations are influenced by the amount of 
experience, training and the degree of questioning 
attitude, i.e. in line with existing group norms or 
organisational culture. 

Discrepancies must also be supported by design, 
i.e. having redundant systems that can reveal dis- 
crepancies, and by training to ensure a questioning 
attitude. This is in line with sensemaking in HRO 
— High Reliability Organisations, where practices 
such as “preoccupation with failure”, “reluctance 
to simplify” and “sensitivity to operations” sup- 
port the explorations of cues and interpretations 
(Weick & Sutcliffe, 2011). 

Further work in the project will focus on how 
to strengthen the loop of perceiving cues, creat- 
ing interpretations, assert meaning to events tak- 
ing actions. Further, how to improve the design of 
interfaces between automation (i.e. HMI and proce- 
dures) and how to train to facilitate sensemaking and 
resilience. Through this work, we aim to contribute 
to improve the ability to handle safety-critical situa- 
tions in demanding maritime operations. 
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ABSTRACT: This paper presents the initial framework adopted to assess human error in assembly 
tasks at a large manufacturing company in Ireland. The model to characterize and predict human error 
presented in this paper is linked conceptually to the model introduced by Rasch (1980), where the prob- 
ability of a specified outcome is modelled as a logistic function of the difference between the person 
capacity and item difficulty. The model needs to be modified to take into account an outcome that is not 
dichotomous and feed into the interaction between two macro factors: (a) Task complexity: that sum- 
marises all factors contributing to physical and mental workload requirements for execution of a given 
operative task & (b) Human capability: that considered the skills, training and experience of the people 
facing the tasks, representing a synthesis of their physical and cognitive abilities to verify whether or not 
they are matching the task requirements. Task complexity can be evaluated as a mathematical construct 
considering the compound effects of Mental Workload Demands and Physical Workload Demands asso- 
ciated to an operator task. Similarly, operator capability can be estimated on the basis of the operators’ 
set of cognitive capabilities and physical conditions. A linear regression model was used to fit a dataset 
collected in R. The estimation of task complexity and operator skills was used to estimate human per- 
formance in a Poisson regression model. The preliminary results suggest that both elements are significant 
in predicting error occurrence. 


1 INTRODUCTION human nature (characteristics, feelings, and behav- 
ioural traits) and the impact of the features of the 
1.1 Scope of work and background workstation on human nature (typology of activi- 


ties, working load, anxiety induced, environmental 
factors etc.) was required to holistically determine 
the performance shaping factors for the worksta- 
tions under examination. The focus is on the role 
of operator’s capability to complete tasks and the 
means to reduce human errors whilst retraining 
product quality. Changes were proposed for the 
assembly lines at the dispatching stations, including 
changes in the procedures and training to employ 
an understanding of human performance and 
improvements to safety, with an overall beneficial 
impact on both productivity and quality. 

The researcher conducted a task analysis of the 
critical activities completed by operators when 
packing out the variety of product units at two pri- 
mary workstations. Questionnaires were prepared 
examining the skills requirements, skills rating of 
operators, mental workload requirements, physical 
workload requirements, perceived task complex- 
ity and motivation. Finally, the implementation 
of an applied model Task Execution Reliability 
Model (TERM) was used to identify the main fac- 


This paper presents the initial framework adopted 
to assess human error in assembly tasks at a large 
manufacturing company in Ireland [1]. 

The aim of this study was to carry out an obser- 
vational, empirical study on the existing human 
errors in the dispatching department, find a way to 
model the issue and if possible propose approaches 
to reduce and eliminate errors and variations in 
the end product. The company dispatches technol- 
ogy goods to national and international customers 
and the focus of the project was the assembly of 
goods for dispatch. Operators prepare the goods at 
workstations along conveyor lines, however at these 
conveyors inefficiencies and inaccuracies relating to 
human performance were identified. Two primary 
workstations were selected for inclusion in the dis- 
patching unit based on their recorded error rates. 
Conditions vary and fluctuate at workstations, 
which may increase the probability of making mis- 
takes, including the complexity and number of the 
activities, environmental conditions and the qual- 
ity of the product. An understanding of both the 
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tors affecting human performance for this settings. 
Three methods were used to inform the research: 


1. Firstly, an examination of performance shaping 
factors in the literature to inform a set of spe- 
cific questionnaires. 

. Secondly, the collection and analysis of the data 
from the questionnaires completed by opera- 
tors, technicians, supervisors, group leaders and 
process engineers in the manufacturing facility 
familiar with the work undertaken at the work- 
stations under examination. 

. Thirdly, focus group sessions were run discuss- 
ing possible participatory redesign for process 
and procedures at the workstations 

. Finally the data from the questionnaire was also 
used to predict task complexity and error occur- 
rences using two different types of regression 
models. 


2 MODELLING HUMAN ERROR 


2.1 Human error in manufacturing 


Human nature can be shaped and driven by fac- 
tors including individual characteristics, personal 
issues, physical and psychological conditions 
(Tooby & Cosmides, 1990). These factors inter- 
act with each other and may determine the out- 
put and productivity of the performance of the 
individual. Human performance is unavoidably 
susceptible to human error, as humans are not 
infallible and the occurrence of errors must be 
expected (Karl & Karl, 2012). Humans are often 
capable of recognising errors and rectifying such 
errors before any serious or critical consequences 
occur (Sheridan, 2008). With this in mind, human 
performance can be accepted and understood as 
the definitive product of the balance between task 
complexity and capability (Morgeson et al, 2010). 

When the capabilities and limitations of humans 
are understood, incorporated and acknowledged, 
Harris (2006) argues that benefits can include 
increased efficiency and improved safety perform- 
ance. Individual employee’s competencies may be 
challenged by fatigue, stressors and unpredictability, 
whilst competencies may benefit from skills, training 
and a clear comprehension of the task (Miller and & 
Parasuraman, 2007, Jo et al, 2012, Kostina et al, 
2012). The capabilities of the operator and the phys- 
ical skills required for the task must be taken into 
consideration when reviewing tasks and the errors 
associated with them (Harris, 2006). A balance 
between workload, both physical and mental, ought 
to be reached to reduce human errors among com- 
petent operators (Miller and & Parasuraman, 2007). 
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2.2. The TERM model: Task execution 
reliability model 


The model used is linked conceptually to the 
model introduced by Rasch (1980) to analyse 
correct or incorrect execution of a task as a func- 
tion of the trade-off between (a) the respondent’s 
abilities, attitudes or personality traits and (b) 
the item difficulty. In the Rasch model, the prob- 
ability of a specified outcome (e.g. right/wrong 
results) is modelled as a logistic function of the 
difference between the person and item difficulty 
parameter. 

The mathematical form of the model is provided 
in equation (1). 


Pr( X= 1)= D efrä 


Let X,, be a dichotomous random variable with 
binary values where, for example, X,, = 1 denotes 
a correct response and an X,, = 0 an incorrect 
response to a given assessment item. In the Rasch 
model for dichotomous data, the probability of the 
outcome is given by: 
where £, the ability of person n and 6, the difficulty 
of item i. 

The model needs to be radically enhanced to 
take into account an assessment of performance 
that is not dichotomous and feed into the interac- 
tion between two macro factors: 


(1) 


e Task Complexity (TC): summarising all factors 
contributing to physical and mental workload 
requirements for execution of a given operative 
task. 

Human Capability (HC): summarising the skills, 
training and experience of the people facing the 
tasks, representing a synthesis of their physical 
and cognitive abilities to verify whether or not 
they match the task requirements. 


Task complexity can be evaluated as a math- 
ematical construct considering also the compound 
effects of two main factors: “Mental Workload 
Demands” (MW) and, where relevant, “Physical 
Workload Demands” (PW), both associated to 
an operator task. Recent sensorised EEG experi- 
mental studies have shown that the simultaneous 
executions of tasks, whether physical or cognitive, 
tends to increase cognitive demands for the human 
brain (Mijović, 2017). 

Similarly then, operator capability should be 
estimated on the basis of the operators’ set of 
cognitive capabilities and physical conditions. A 
regression model was used to fit a dataset collected 
in R. The model and the preliminary results are 
discussed in chapter 3 of the present paper. 


3 THE CASE STUDY: SUMMARY OF THE 
DATA COLLECTED AND THE TERM 
MODEL AS APPLIED 


3.1 


The setting and focus of this study is a large 
electronic manufacturing facility in the south of 
Ireland, which prepares and distributes technol- 
ogy goods to both national and global customers. 
In the dispatching unit of the facility, operators 
are provided with work stations and conveyors to 
prepare the products for dispatch and shipment 
through pack out procedures. The aim of this study 
was to carry out an observational, empirical study 
on the existing human errors in the dispatching 
department of the facility in a subsequent phase 
the study also lead to the identification of suitable 
approaches to reduce and/or eliminate such errors. 

Two primary workstations were the focus of the 
assessment of the project, namely the conveyor line 
and another packaging workstation called the POD 
cell. To examine these workstations, an overview 
of the existing error rate at the conveyor line was 
required to be used as a benchmark against other 
workstations in the facility and to identify any pos- 
sible improvements. As a means of comparison, 
the error rates for nine control workstations from 
within the manufacturing facility were acquired to 
facilitate data analysis and interpretation. 

Error rates for both the control and non-control 
workstations were calculated in the same manner. 
Records were filtered from Ist December 2016 to 
31st March 2017 for all workstations to retrieve the 
information for the calculations. This four months 
timeframe was deemed adequate due to the large 
number of products passing through the worksta- 
tions. We considered only errors classified as stem- 
ming from a human related cause. 

The human error rates were calculated using the 
following formula: 


The case study and the data collection plan 


Number of Human Errors/the opportunity for error 


(2) 


where the number of human errors were the errors 
recorded or captured due to a human cause 

While the opportunities for error were the total 
output at the workstation i.e. number of processed 
units 

For the pod and the conveyor, to attain the 
number of human errors, data relating to MWDs 
(missing, wrong or damaged) goods was collected 
within the four month period from 01/12/16 to 
31/03/17. The MWDs originate from customer 
complaints or returned goods following disparities 
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from the sales orders or damaged goods. MWDs 
can be slow information to capture, due to the 
possible time lapse between the shipment of an 
order, the start of use of the product by the cus- 
tomer and the identification of an error. MWDs 
may be reported some months after a product was 
shipped, however due to the nature of the time- 
frame selected, it was deemed appropriate that 
by the completion of the project, the number of 
MWDs recorded for that time frame would be suf- 
ficient. The opportunity for error was derived from 
the total output at the workstations within the four 
months period from the beginning of December 
2016 to the end of March 2017. 

For the control workstations, the numbers of 
human errors were retrieved from an online soft- 
ware platform within the four months period out- 
lined above. The platform is used to record both the 
total output at the workstations and the number of 
errors recorded. The platform records errors with 
varying root causes through a classification sys- 
tem, many of which are not of a human nature. 
Twenty-seven classifications were deemed suitable 
for inclusion for the human errors recorded. 

In the control workstations, when an error has 
occurred, the operator or technician is forced to 
input an error report at the time of the error occur- 
ring detailing the source of the error i.e. human, 
equipment, technical. The process cannot continue 
until an error report has been submitted. Due to 
this, the error reports recorded in the system can 
be regarded as representative of the total number 
of errors occurring during the timeframe. When an 
error is recorded, users are prompted to categorise 
the error under a variety of descriptions. The cat- 
egories can include aspects of technology or equip- 
ment failure, and not all were relevant for inclusion 
in the error rate calculation. 


3.2 The observation and questionnaire protocol 
used for the wider case study 


Members of staff who work closely with the work- 
stations involved in the project and the control 
workstations were invited to complete question- 
naires to assess their opinions relating to: 


The importance of skills at different workstations 
Skills rating of individual operators 

Job satisfaction/motivation 

Mental workload requirements 

Physical workload requirements 

Perceived task complexity 


Two questionnaires were prepared with one for 
supervisors, group leaders and process engineers, 
and a second questionnaire for operators and 


technicians. Questionnaires were broken up in this 
fashion in order to capture observable variables 
from the supervisors/management and the individ- 
ual subjective opinions of the operators. There was 
a difference in the type and volume of questions in 
the questionnaires, as the supervisor/group leader 
questionnaires asked two different things: 


— Asked supervisors role participants to rate the 
skills of operators under their supervision 

— Asked all participants to rate the skills require- 
ment to complete work at the workstations 


The questionnaires were completed by the 
employees of all eleven workstations and their 
supervisors leading to a total of 149 employees 
completing the questionnaire (100% response rate). 

Participants were asked to rate their answers 
on a 10-point Likert Scale, with one meaning low 
and ten meaning high. Questionnaires were used to 
measure the mental and physical workload, worker 
skills, job satisfaction (motivation) and the per- 
ceived task complexity for operators, supervisors, 
group leaders and process engineers. As different 
duties and tasks require certain skills (e.g. manual 
skills, memory), practical training and underpin- 
ning knowledge, the questionnaire was designed to 
capture information relating to the following areas: 


Mental Workload Requirements 


— Need to cope with pace 

— Variance of product 

— Recognition requirements 

— Load due to quality of coordination 

— Requirement for training/experience 

— Requirements for human machine interface 
(HMI) 


Physical Workload Requirements 


— Ergonomic score (REBA Assessment) 
— Dexterity requirements/manual skills 


Table 1. 


Error rate dataset collected for each workstation. 


— Adherence to procedure 
— Reliance on automation 


Job Satisfactionl Motivation 
— Motivation e.g. satisfaction, meaningfulness 
Worker Skills 


— Memory 

— Decision-making 

— Recognition 

— Coordination/communication—teamwork 
— Coping with pace 

— Experience 

— Dexterity/manual skills 

— Physical resilience 

— Adherence to procedure 


Perceived Task Complexity 


— How mentally demanding are the tasks 
— How physically demanding are the tasks 
— How complex is this task 


The error rate for all eleven workstations has 
been calculated and is outlined in Table 1 

Data collection involved a rich integration of 
data from many sources, acquired observationally 
or through documented information. There were 
four primary sources of data: 


1. The questionnaires outlined above. The data 
collected would facilitate the assessment of the 
relationship between the task complexity (men- 
tal workload requirements, physical workload 
requirements) and the worker capability (cogni- 
tive skills, physical skills). 

2. Focus groups were conducted to understand the 
process and procedures at the workstations and 
aspects of the workstations that would benefit 
from redesign. 

3. Key Performance Indicators (KPI’s) were gath- 
ered for information relating to: 


Workstation No No of human errors 


1 Pod 0 TAT 
2 Conveyor 14 8,913 
3 Control A 12,055 
4 Control B 1 1,359 
5 Control C 44 221 
6 Control D 93 221 
7 Control E 28 3,971 
8 Control F 81 3,971 
9 Control G 368 5,402 
10 Control H 107 5,402 
11 Control I 133 5,402 


Opportunity for errors i.e. total output 


No of operators Error per 1000pc 


19 0.01 
19 1.5 
19 0.2 
19 0.7 
2 203.6 
2 425 
5 T 
7 20.3 
> 68.1 
0.019 19.8 
0.0246 24.6 
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The actual time at the workstation (produc- 
tivity KPI) 

The number of quality issues due to human 
error (quality KPI) 


. Error rates for the workstations were formulated 
to provide insight into the rate of human error 
and its resulting quality effects on the worksta- 
tion end products. 

Videos/Pictures 

In order to capture and assess information 
regarding the routine activities and work patterns 
of staff in the facility, video recordings and photo- 
graphs were taken as an observational method of 
data collection. The videos were used to: 


— Measure the amount of time the entire task took 
to complete 

Measure the amount of time an aspect of the 
task took to complete e.g. closing with sellotape 
Compare the procedure completed to the actual 
projected procedure for the completion of 
actions 

Task analysis using Video TimerPro software to 
break down the tasks required of the operator to 
complete 

The photographs and video recordings were 
used to: 

Provide a basis for the Ergonomic Risk Assess- 
ment method used 

Compare comparable tasks completed at alter- 
nate work stations 


For the first part of the regression model, an 
assessment of task complexity was conducted. 
The data gathered was evaluated on the basis of 
Task Complexity with a linear regression model. 
In order to complete this evaluation, a task com- 
plexity index was applied, namely: 


Task Complexity index = a (Memory req.) + b (rec- 
ognition req) + c (coordination req.) + d (cope with 
pace req) + e (Experience req) + f (Resilience req.) + 
g (adherence to procedure req.) 


The Correlation matrix obtained for the element 
used for the regression to evaluate task complexity 
obtained in the statistical software R are shown in 
Figure 1. 

Figure 2 reports the preliminary results of the 
linear regression model used to predict task com- 
plexity in R. 

The model indicates that the parameters used 
to estimate task complexity in the linear regression 
are quite significant. They predict task complex- 
ity with a Standard error of 0.2991 on 36 degrees 
of freedom. The adjusted R squared obtained is 
0.93996 and the F statistics on 36 Degrees of free- 
dom is 96.52, with a p value of 2.2 e-16. Therefore 
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Figure 1. Correlation matrix evaluated for the element 
used for the regression to evaluate task complexity. 


Coefficients: 
Estimate Std. Error t value Pr¢>itl) 


Cintercept) -0.62930 0.36179 -1.739 6.05509 . 

taskreqMemory, Capacity 0.19135 8.04300 4.454 7.636-05 *"" 
toskregsRecognition @.18180 =. 02609 6.938 3.96e-08 =». 
taskregSCoord.Comm 8.18539 8.83672 2.876 8.006733 ** 
toskregsCope.wi th, Pace @.14421 8.6006 3.600 0.000951 *** 
taskregSExperience 0.13699 0.04973 2.754 8.003162 ** 


9.04113 
8.03367 


1.729 8.092341 . 
5.865 1052-66 *** 


toskreqsPhysical. Rest iience 9.07111 
toskregSAcherence_to_procedure @.19745 


Figure 2. Results obtained from R to evaluate the rele- 
vance for the coefficient used to estimate task complexity. 


the linear regression model to estimate task com- 
plexity seems to deliver significant results. 

For the second part of the model, an estima- 
tion of the error occurrence of each workstation 
considering task complexity and operator capabil- 
ity was conducted. The use of the Rasch model 
with the dataset gathered was not possible as for 
the Rasch model the output needed to be a binary 
success or failure for each individual task. This 
was a type of data which was not able to be col- 
lected. Due to this, a generalised linear regression 
with a Poisson model, which was still based on 
the assumption that Human Performance can be 
represented as directly dependent from two macro- 
factors of task complexity and human capability, 
was used (see formula 3). 


A, = Cot Ax x24 & =eli+r log, = 7; (3) 
where A, is the amount of error recorded, x, is task 
complexity and x, is operator skill level/capacity. 
The results obtained in R suggest that both ele- 
ments are significant in predicting error occur- 
rence, as shown in Figure 3. 

The likelihood ratio test results confirmed the 
meaningfulness of the significance for the param- 
eter chosen for estimating the error rate with this 
model, as shown in Figure 4. 

However the limited data set and that the esti- 
mates of skill rating were gathered done using a 
subjective rating. Therefore the model could be 


Table 2. 


Summary of data collected and revised for each workstation used in the regression model. 


Id workstation Average skills recorded 


Task complexity Errors_on_10000 parts 


7.4 1 
7.28 15 
6.8 2 
6.8 T 
8 2036 
9 4250 
7.57 70 
433 203 
6.33 681 
6.4 190 
6.88 246 


11 Control I 7.83 


1 Pod 6.45 
2 Conveyor 6.45 
3 Control A 6.45 
4 Control B 6.45 
5 Control C 7.18 
6 Control D 7.23 
7 Control E 7.09 
8 Control F 7.86 
9 Control G 5.33 
10 Control H 6.27 
Call: 


gln€formule = errorateSerrors on 10000parts ~ arrorotesaverage.skills + 
CerrorateStask, complexity), family = poisson, dato = errorate) 


Deviance Residuals: 
Min 1Q Medion 3 Max 
-31.533 -21.450 -1.975 5.522 31.609 


Coefficients: 
Estimate Std. Error z value Prizi) 


Cintercept) -2.41163 0.19002 -i2.69 <2e-16 *** 
errorateSaverage skills -0.47110 @.03457 -13.63 <Ze-16 *** 
errorateStask.complexity 1.57751 6.01710 92,24 <2e-16 *** 


Signif. codes: @ '***’ @,.001 ‘**' O01 '*' Gas '.' O12" ta 


Figure 3. Results of the analysis run in R for the gener- 
alised Poisson linear model. 


Model 1: errorateserrar.rate ~ erroratesaverage.skilis + erroratestask.complentty 
Model 2; errorateserror.rate ~ 1 
SOF Loglik DF Chisg Pr(>Chisa) 
1 433.667 
2 228.272 -2 20.79 0.004539 ** 


Figure 4. Results of the analysis run in R for the likeli- 
hood ratio test for the generalised Poisson linear model. 
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Figure 5. Plotting of the expected error rate calculated 


in respect to task complexity. 
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improved if a more extensive data collection cam- 
paign and a more objective estimation for skill rat- 
ing is to be achieved. 

Figure 5 provides a graphical representation of 
the plotting of the expected error rate calculated in 
respect to task complexity. 


4 CONCLUSIONS AND WAY FORWARD 


Following this study a focus group and some 
observations study were performed suggesting that 
a reorganisation of work practices between the 
original conveyor line and the new pod cell design 
served to improve overall human performance in 
the facility. This has been demonstrated through 
the reduction in the number of human errors 
reported for the workstations during the four 
month timeframe of the project. 

The data formed the basis of an empirically 
based, cross-verified model of human perform- 
ance that can be used to provide objective feed- 
back to users increasing their awareness of risks 
related to their own human characteristics and 
impact the design of safety critical systems and 
current approaches for vocational training. For the 
manufacturing facility involved in the project, fur- 
ther developments may include engaging operators 
in all elements of a process, induction testing to 
match operator’s capabilities to task most suited to 
them and orientation of workstations to facilitate 
operators considering human error and ergonom- 
ics principles. 

Human error in the manufacturing facility 
prior to an intervention or examination of human 
performance contributed to the occurrence of a 
large number of errors resulting in financial costs 
and productivity losses for the organisation. The 
reorientation of work practices at work stations, 
considering the role of human error and ergo- 
nomic principles, has allowed for a reduction in 


the incidence of human related errors across the 
workstations examined. 

The results may be limited by the four month 
time frame for which human errors were consid- 
ered. However results shown that task complexity 
can be significantly predicted starting from the 
variables observed in the case study. 

The TERM model used (the Poisson general- 
ised linear regression) also suggests that both task 
complexity and operator’s skill are valid predictors 
of error occurrence in a workstation. It is maybe 
also possible that while task complexity increases a 
corresponding linear increase in worker skills and 
capability is not able to sufficiently compensate for 
the increased complexity. 
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ABSTRACT: The paper deals with the quantification of probabilities for human failures in the radio- 
therapy domain. The probabilities are used as input for the development of a Human Reliability Analysis 
(HRA) method specific for radiotherapy. Quantification is based on expert judgment, in view of the lack 
of relevant data. A Bayesian aggregation model is used to aggregate the judgments collected during elici- 
tation sessions with domain experts. A qualitative scale is first used; then the judgments are interpreted as 
information on the order of magnitude of the error likelihood and aggregated under the Bayesian scheme. 
Besides for the specific domain of interest, this work is relevant for novel HRA applications outside typi- 
cal domains, for which the need to incorporate expert judgment in traceable and defendable ways is key. 


1 INTRODUCTION 

Human failures are important contributors to near 
misses, incidents, and accidents in radiotherapy 
(WHO 2008), as in many other domains. Efforts are 
undertaken to systematically address the potential 
for failures and continuously improve the patient 
treatment process, e.g. Huq et al. (2016). In this 
context, the Risk and Human Reliability research 
group at the Paul Scherrer Institute (Switzerland), 
in collaboration with the institute’s Center for 
Proton Therapy, is developing a method to sup- 
port Human Reliability Analysis (HRA), specific 
for external beam radiotherapy. Previous work by 
the authors identified the personnel tasks critical 
to patient safety and possibly influencing factors 
(Pandya et al. 2017). Current work is addressing 
the quantification of the corresponding human 
failure probabilities. 

In particular, the present paper focusses on 
the quantification of the failure probabilities for 
representative tasks, given a set of Performance 
Influencing Factors (PIFs). Given the shortage of 
directly usable experience data, the quantification 
resorts to expert judgment. The paper presents 
the application of a Bayesian aggregation model 
(Podofillini and Dang, 2013) to the judgments 
collected during elicitation sessions with domain 
experts. To avoid direct elicitation of probability 
values, the experts are asked to provide their judg- 
ments on a qualitative scale. The judgments are 
then interpreted as information on the order of 
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magnitude of the error likelihood and aggregated 
under the Bayesian scheme. The paper presents the 
results of the aggregation. The application shows 
the ability of the aggregation approach to formally 
represent the variability of the experts’ estimates. 

Besides for the radiotherapy domain, the work 
presented in this paper is relevant for the various 
efforts recently done to extend HRA methods 
for application beyond their most typical appli- 
cations, i.e. nuclear power plant operation. Lack 
of relevant data is a major issue for these novel 
applications (Bye et al. 2017, NASA 2012, Gibson 
2012, Mkrtchyan et al. 2015, NUREG 2016) and 
methods to elicit expert judgment in a formal and 
defendable way are needed along with specific data 
collection initiatives. 

The paper is organized as follows. The next Sec- 
tion provides the background on the HRA method 
under development, for which probability values 
are sought for in this paper. Section 3 presents the 
design of the elicitation sessions and the concepts 
underlying the Bayesian approach for process- 
ing and aggregation of the judgments. Section 4 
presents the application to two Decision Trees part 
of the framework of the HRA method under devel- 
opment. Concluding remarks close the paper. 


2 BACKGROUND INFORMATION 


The framework for the HRA method consists of 
eighteen decision trees, one for each failure mode 


corresponding to a different Generic Task Type 
(GTT, Table 1). The concept behind the GTTs 
is taken from the Human Error Assessment and 
Reduction Technique (HEART, Williams 2017), 
and is intended to define a set of task types, each 
with similar characteristics as it relates to the fac- 
tors influencing performance and to the corre- 
sponding failure probabilities. The definition of 
the GTTs and of the influencing factors is based 
on GTTs- Performance influencing Factors (PIFs) 
structures developed in (Pandya et al. 2017): these 
structures link each GTT to the set of PIFs that 
influence the failure probability. These structures 
have been developed via a systematic and traceable 
process which, for each GTT, progressively identi- 
fies the involved cognitive functions, their failure 
modes and causes, failure mechanisms and PIFs. 
The DT framework is well suited to represent the 
cause-based influences on failures identified by the 


Table 1. Set of Generic Task Types (GTTs) and corre- 
sponding failure modes identified in Pandya et al. (2017). 
DTs are developed for each failure mode of the GTTs. 


# GTT Failure mode 


Patient information 
incorrectly matched 

Identification check not 
performed (decision based) 

Failure to execute desired 
action 

Deviation from requirement 
not recognized 

Inappropriate understanding 
of underlying principles 

Check not performed 
(decision based) 

Execute desired action 
incorrectly 

Failure to execute desired 
action 

Coordination failure 

Misinterpretation of data 

Execute desired action 
incorrectly 

Mismatch or inconsistency 
not recognized 

Execute desired action 


1 Identification of 
patient or patient 
related items 


2 Quality Check 


3 Complex 
interaction with 
software or tool 


4 Simple interaction 


with software incorrectly 
or tool Failure to execute desired 
action 

5 Iterative Misinterpretation of 
determination information 
of optimum Inappropriate decision on 
parameters strategy selection 

6 Verbal Communication failure 
communication Not communicated 


(decision based) 


GTT-PIF structures: the DTs identify the causes 
possibly leading to the GTT failure; in a similar way 
as done in other HRA methods (e.g. NUREG 2016, 
Moieni et al. 1994), each decision tree addresses a 
failure mode, with branching points representing 
the effects of PIFs. Two examples of DTs are pre- 
sented in Figure 1. The decision trees develop from 
eight branching points, e.g. “Problematic interface”, 
“Information content unclear”, “Low vigilance 
due to expectations”. As shown in Figure 1, each 
DT includes a subset of the eight branching points, 
three or four in most cases. The same branching 
point heading may appear across different DTs, e.g. 
“Problematic interface” in Figure 1; however, the 
influence of the branch on the failure probability 
may not necessarily be the same. This aspect will 
be returned to in the result Section 4. Each branch 
point is specified in terms of negative conditions: if 
any of the negative conditions is verified, then the 
lower branch applies. Example negative conditions 
are given in Table 2. The presentation of the devel- 
opment of the DTs from the GTT-PIF framework 
and of the negative conditions for each branching 
point is outside the scope of the present paper and 
will be presented in a separate publication (Pandya 
et al., working paper). 

To assess the failure probability of a specific 
radiotherapy task, an analyst would have to select 
the applicable DTs based on the relevant type of 
task and failure mode. Then, for each branching 
point, the analyst would have to select the appro- 
priate branch based on the negative conditions 
proposed for each branch, in a similar way as 
done with other HRA methods involving DTs, e.g. 
NUREG (2016), Moieni et al. (1994). 

The present paper focuses on the quantifica- 
tion of the DTs, i.e. on the assessment of the fail- 
ure probabilities in correspondence of each path 
defined by the combination of the branching 
points. 


3 EXPERT JUDGMENT ELICITATION 
AND AGGREGATION APPROACH 


3.1 Expert judgment elicitation 


As mentioned in the Introduction, due to the lack 
of relevant data, quantification is made via expert 
judgment. In particular, the expert elicitation ses- 
sions were designed with two aims. First, to sup- 
port the identification of the negative conditions 
underlying each branch point. Second, to assess 
the impact of each branching point on the failure 
probability. Only the effects of single branch points 
were addressed (i.e. determining failure probabili- 
ties 1, 2, and 4 in Figure 1, top part). The combina- 
tion effects will be addressed in future work. 
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Figure 1. 


| Distractions’ | 
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Problematic | 


Interruptions and 


excessive 


Low vigilance 
due to 
expectation 


or patient related interface workload 

items- Patient Nominal failure probability 

information incorrectly 

matched Pailure probatatity | 
Failure probability 2 
Failure probability 3 
Failure probatriity 4 
Failure probability $ 
Failure probability 6 
Failure probability 7 

a Lack of 
Complex Problemmic Tumor training or Resource Time 
interaction with interface complexity experience unavailable | pressure 


software or tool- 
Misinterpretation 


Nominal failure probability 2 
Failure protestebiay K 
Failure probability 9 
Failure probability 10 
Failure probability 11 
Fatlure probability 12 
Failure probability 13 
Failure probability 14 
Faliure probability 1$ 
Failure probability 16 
Failure probability 17 
Failure probability 1% 
Failure probability 19 
Failure probahiliy 20 
Failure probatytiry 2} 
Failure probability 22 


Two examples of decision tree; above: GTT “Identification of patient and patient related items”, Failure 


mode “Patient information incorrectly matched”; below: GTT “GTT: Complex interaction with software or tool”, 
Failure mode “Misinterpretation of data”. The focus of the present paper is on quantification of the failure probabili- 


ties at each tree branch (only single branch effects elicited). 


Table 2. Examples of negative conditions for two 
branching points in two different DTs. 


Six failure scenarios were developed for the elici- 
tation, Table 3 gives two examples. The idea is to 


elicit the impact of the branching point on these fail- 


DT Branch point Negative conditions 

GTT: Problematic The written values 
Identification interface look alike 
of patient or (e.g. 111, 117) 
patient related The value on the 
items; Failure label or file not 
mode: patient easily readable 
information There is no ID 
incorrectly number on the 
matched patient item 

GTT: Complex Lack of Lack of familiarity 
interaction with adequate with the 
software or training or tumor case 
tool; Failure experience Lack of training or 


mode: 


Misinterpretation 


of data 


experience on 
treating special 
tumor locations 
(e.g. close to 
multiple artefacts) 
Lack of experience 
or training to 
distinguish 
healthy 
and tumor tissues 


ure scenarios, which would then be representative 
of the overall GTT. The selection aimed at address- 
ing the largest set of GTT failure modes, as well as 
prioritizing failure scenarios with the most critical 
consequences on patient safety. As shown for the 
examples in Table 3, each failure scenario is associ- 
ated to a different GTT. Indeed as again shown in 
Table 3 and by the negative conditions in Table 2, 
the elicitation of the branching point impact was 
conducted on specific tasks and situations. This 
has been made to help experts to contextualize their 
judgments to the real tasks they perform and link 
their assessments to the daily experience. Alterna- 
tively, judgments may have been elicited directly 
for the GTTs and branching point categories. The 
former approach was chosen to avoid that experts 
would need to deal with abstract categories such as 
GTTs and the branch point labels. The focus of this 
paper is on the part of the elicitation session aimed 
at eliciting the impact of each branching point on 
the failure probability. The details of the overall 
elicitation design and its results will be presented in 
a different paper (Pandya et al., working paper). 
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Indeed, the elicitation addressed directly only 
part of the DTs required for quantification, i.e. six 
out of the eighteen from Table 1. However, some 
branching points may be thought of having very 
similar impact across different DTs; therefore the 
results from the elicitation for one DT may be used 
for others. It was assessed that the selected tasks 
may allow to quantify about two thirds of the 
whole set of branching points (recall only single 
branch point effects are considered here). Indeed 
future work may address the quantification of 
the remaining DTs and develop an approach to 
address multiple branch points as well. 

Twelve experts were interviewed: medical phys- 
icists, medical doctors, dosimetrists and radiation 
technologists. Each expert dealt with tasks part 
of his/her daily job. Three tasks were elicited at 
most per expert. Each expert took part in the 
exercise alone. For each of the assigned tasks and 
each of the negative conditions corresponding to 
the branching points, the experts were asked to 
assess the impact of the negative condition on the 
failure probability when performing the task. The 
impact is elicited on a qualitative scale (Table 4), 
to avoid the known shortcomings of directly 
eliciting probability values, see eg. Meyer and 
Booker (2001), Tversky and Kahneman (1974). 


3.2 Aggregation of expert assessments 


The approach to process the expert assessments has 
been as follows. The qualitative scale in Table 4 is 
first anchored to quantitative values, as shown in 
Table 5. The basis for the anchoring is the scale 


Table 3. Example of failure situations used to elicit the 
impact of the branch points on the failure probability. 


Generic task 


Failure situations Failure mode type 

Failure to identify Patient Identification 
correct ID from information of patient 
control document incorrectly or patient 
on the bite-block, matched related 
couch or file etc. items 


such that incorrect 


item is 
picked up 

Draw suboptimal Misinterpretation Complex 
(incorrect or of data interaction 
incomplete) with 
contours software 
around volumes of or tool 


interest for every 
slice due to 
misunderstanding 
of the data 


Table 4. Qualitative scale used to elicit impact of nega- 
tive conditions on the personnel tasks. 


Impact Descriptor Meaning 
Low Failure is not Given the negative 
impact expected condition, the 
to happen, desired task is 
although still so easy 
I see that it is 
how it could inconceivable 
happen. that any 
personnel would 
fail if they were 
to experience 
this condition. 
Moderate Failures Given the 
impact happen negative 
occasionally/ condition, the 
sometimes desired task 
with such becomes 
conditions moderately 
difficult that it is 
possible so that 
personnel would 
occasionally/ 
sometimes fail 
if they were to 
experience this 
condition. 
High Failures Given the 
impact happen often negative 
with such condition, the 
conditions desired task 
becomes highly 
difficult that 
is expected so 
that personnel 
would often 
fail if they 
were to expe- 
rience this 
condition. 
Extreme Failure is Failure is almost 
impact almost unavoidable. 
unavoidable Almost all 
personnel would 
not be able to 
perform the 
desired task. 
Table 5. Anchoring of the qualitative impact scale to 


probability values (adapted from NUREG 2007). 


Impact Order of magnitude of failure probability 
Low le-3 

Moderate le-2 

High le-1 

Extreme 1 
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presented as part of the ATHEANA human reli- 
ability analysis method (NUREG 2007). Note the 
values on the scale are to be interpreted as refer- 
ence orders of magnitude for the failure probability 
value. The value representative for low impact (1e-3) 
is confirmed in the studies by Wahi et al. (2008) and 
Salinas et al. (2013), from which it can be inferred 
that nominal error rates in patient identification 
and data entry in healthcare lie around le-3 and 
3e-3. These are interpreted as lower bounds for 
error probabilities for the sector. Low impact of the 
branching point is not expected to change the order 
of magnitude of the probability so that the refer- 
ence lower bound value still remains in the same 
order of magnitude. 

The scale allows converting each assessment by 
the experts into a statement on the order of mag- 
nitude where the probability value would lie. It is 
interpreted as evidence of the relevant order of 
magnitude, and used to update the belief on that 
quantity in a Bayesian framework. 

The process for aggregation of the judgments 
comprises two steps. First, for each negative con- 
dition, the judgments by the experts are aggre- 
gated: a distribution of the applicable probability 
for each condition is obtained. Then, these distri- 
butions are themselves aggregated into the final 
distribution of the corresponding branch. The 
aggregation approach is based on the Bayesian 
model presented in Podofillini & Dang (2013). 
The model represents the human error probabil- 
ity as an inherently variable quantity, resulting 
from the inherent variability of people perform- 
ance as well as of the specific manifestations of 
the type of tasks and of the influencing factors. 
More specifically, the combination of GTTs and 
branch point conditions envelop specific tasks 
and specific performance conditions that are 
assumed to be characterized by inherently dif- 
ferent failure probability values. The elicitation 
carried out in this work addressed specific mani- 
festations of the combination (see Table 3 and 
Table 2): the Bayesian model is intended to con- 
sider the expert assessments on these manifesta- 
tions (a specific task affected by specific negative 
performance conditions) and determine the orig- 
inal variability distribution of interest. Mathe- 
matically, the failure probability is assumed to be 
lognormally distributed, with unknown median 
to be determined based on the expert input. The 
error factor (square root of 95th and Sth per- 
centile) is assumed to be known, equal to 3. The 
latter assumption of known error factor is not a 
requirement of the approach, but largely simpli- 
fies the calculations and decreases the amount of 
data required to be elicited. It is indeed a typi- 
cally used and accepted value in HRA. The prior 
distribution of the median is assumed uniform 
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for the four orders of magnitude in Table 5 (all 
impact levels are equally likely). 

For the first part of the aggregation process, the 
expert assessments are used to update the degree 
of belief on the correct order of magnitude for 
the median of the probability distribution. The 
model requires as well assumptions on the confi- 
dence that the experts would be able to provide the 
correct value of the probability. The confidence 
is expressed in terms of a conditional probabil- 
ity that, given the real order of magnitude of the 
probability is one of the four in Table 5, the experts 
would assess the correct one or be off by one or 
more orders of magnitude. This conditional prob- 
ability can be defined to model biases and depend- 
ence across experts, indeed provided that adequate 
information on the distribution is available (these 
are not modeled in the present work). In this work, 
it is assumed that experts have about 80% prob- 
ability to provide the correct order of magnitude, 
10% of being one order of magnitude off, 5% of 
being two or more orders of magnitude off. The 
exact values of these probabilities depend on the 
position of the interval with respect to the lower 
and upper bounds to have them normalized to a 
probability distribution. These values have been 
assumed by the authors of the paper; they appear 
to represent reasonable assumptions on the ability 
of the experts to provide correct estimates in this 
context. It is anyway important to mention that as 
more than a few experts are available (say five or 
more), the specific assumptions on the confidence 
to each expert do not play a significant role any- 
more in the final probability distribution. The out- 
put of this step is a distribution of the degree of 
belief on which of the levels in Table 5 represents 
the real value of the median of the probability 
distribution, for each negative condition possibly 
affecting each branch point. 

The second part of the aggregation entails com- 
bining the degrees of belief obtained for each neg- 
ative condition underlying each branch points. As 
the negative conditions are assumed equally likely, 
the final distribution is simply obtained as the 
average distribution across the negative conditions. 
In particular, for each of the levels in Table 5, the 
final degree of belief is the average degree of belief 
across the corresponding negative conditions. 
Applications of the approach will be presented in 
the next section. 


4 AGGREGATION OF EXPERT 
ASSESSMENT: RESULTS AND 
DISCUSSION 


This paper presents the result obtained for two 
DTs: 


GTT “Identification of patient and patient 
related items”, failure mode “Patient informa- 
tion incorrectly matched”; 

GTT “GTT: Complex interaction with software 
or tool”, failure mode “Misinterpretation of 
data”. 


These are the two DTs shown in Figure 1. The 
results obtained from the whole elicitation are planned 
to be presented in Pandya et al. (Working paper). 

The first part of this Section presents an over- 
view of the aggregated results for the two DTs. The 
aim is to discuss how the quantitative results relate 
to the justification provided by the experts on their 
assessments. In other words, the goal is to check 
if the different values of error probability reflect 
in corresponding differences in the assessments by 
the experts. The second part of the Section pro- 
vides details on how the expert assessments are 
aggregated. 

Figure 2 shows the aggregated results from the 
expert assessments. The largest impact on failure 
probability corresponds to the branch point “Lack 
of training or experience” branching point affect- 
ing GTT “Complex interaction with software/tool”. 
This is a complex task, related to defining an optimal 
therapy plan and requiring knowledge and expertise. 
As shown by the expert assessments, the influence of 
inadequacies in this respect can have high impact on 
the failure probability. Indeed, the resulting median 
probability is around 0.01, corresponding to the 
“high” impact level on the adopted scale. Branch- 
ing point “Time pressure” was also assessed among 
the most influencing ones: it was felt that the need to 
complete the task with urgency would highly impact 
the quality of the therapy plan. On the other hand, 
two branching points were assessed to have generally 
low impact. In particular, interface issues (branching 
point “Problematic interface”) were not felt to affect 
much the performance when identifying patients: 
identification of patients is made with diverse 
means; besides checking the patient ID, identifica- 


tion is checked verbally (calling patient name) and by 
the patient picture. Additionally the interface to deal 
with is extremely simple so that there is little possibil- 
ity for confusion. Also, the complexity, in shape, size 
and location of the tumor was not felt to increase 
much the probability of errors in the development of 
the therapy plan. Typically, complex tumor cases are 
discussed in larger groups and the treatment details 
are thoroughly defined. 

It is interesting to see in Figure 2 how the same 
type of branching point may affect GTTs differ- 
ently. For example, the assessed probability for 
“Problematic interface” affecting GTT “Complex 
interaction with software/tool” is more than one 
order of magnitude larger than when affecting 
GTT “Identification of patient or patient-related 
tools”. Again, this is the result of the expert opin- 
ions on the importance of the respective influ- 
ences. The reasoning underlying the low impact 
according to the experts of the branch point on 
the latter GTT has already been discussed. On 
the other hand, the former GTT involves com- 
plex interactions with multiple software interfaces, 
dialog boxes, figures, etc: the impact of interface 
issues for this task was considered to have impor- 
tant effects on the failure probability. 

The length of the error bars reflects differences 
in the expert assessments both for each negative 
condition and across the different conditions. 
The larger the bar, the larger the differences. This 
aspect will be returned to later in this Section. 

Figures 3 and 4 show how the expert assess- 
ments are progressively processed to obtain 
the final probability distributions presented in 
Figure 2. The left side of the figures gives the assess- 
ments of the experts provided on the scale for each 
of the negative conditions. The middle shows the 
distribution results aggregated across the experts 
for each negative condition. The right side gives 
the final distribution, aggregating across the condi- 
tions. Specifically, Figure 3 addresses the effect of 
branching point “Problematic interface” on GTT 


GTT: Identification of patient or 
patient related Items 


5 


Human Error Probability 


> 
8 


E 


GTT: Complex interaction with software / tool + Problematic Interface 


@ Expectations 
A Distractions/Interruptions 
and excessive workload 


Tumor complexity 


xX Lack of training or experience 


+ Time pressure 


@ Resources unavailable 


Figure 2. Aggregated results for the two considered GTTs; Symbols identify the median failure probability affected by 
each branch point (acting one at a time, presented on the right of the figure), error bars the Sth and the 95th percentiles. 
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“Identification of patient or patient-related items”. 
It is interesting to see in the Figure how different 
assessments from the experts result in different dis- 
tributions. For the first negative conditions, three 
experts provided the assessments of “low” (thus 
corresponding to an error probability of about 
0.001) and two of “moderate”. Correspondingly, 
the aggregated distribution in the middle of Figure 3 
presents larger degree of belief for the latter level on 
the scale compared to the former. Degrees of belief 
for the other levels are in practice negligible. For the 
second condition, there is strong agreement for low 
impact: the peak in the degree of belief for the latter 
value is accordingly higher than in the previous case. 


A very different situation is present for the last con- 
dition, where the three experts provided three differ- 
ent assessments. The effect to spread the degree of 
belief for the latter condition is evident. 

Another interesting case is presented in Figure 4, 
related to the effect of “Lack of adequate training 
or experience” for the GTT “Complex interaction 
with software/tool”. There is general consistency 
across the expert on the effect of each condition, 
as reflected in the distributions in the middle part 
of the figure. The large span in the expert assess- 
ments, from “low” to “extreme”, results then in the 
large spread in the final aggregated distribution on 
the right. 


Expert judgments on conditions 


N 


Posterior probability at median HEF, + 


1) Expen did not provide a judgment 
ened: Drima Moyen, FEV Re Etre 


Decision tree for: 
GTT “Identification of patient and patient related items”, failure mode “Patient infor-mation incorrectly matched”; 


Processing and aggregation of judmgmnents 


a) 


| | 
Posterior probability of median HEP, =(f 


Figure 3. Processing and aggregation of judgments (GTT “Identification of patient and patient related items”, failure 
mode “Patient information incorrectly matched”): Left: judgements from experts, Middle: expert-aggregated posterior dis- 
tribution of median HEP for each condition, Right: posterior probability distribution of median HEP for the branch point. 


Decision tree for: 
GTT: “Complex interaction with sofi-ware or tool”, failure mode “Misinterpreta-tion of data”; 


Expert judgments on conditions 


Processing and aggregation of judmgmnents 


‘Postenor protabaity of median HEP, (iy 


1) Expert did not provide a judgment 
Legend: 1-Low, M-Moderate, H-High, E- Extreme 


Posterior probability of median HEP, =(/) 


Figure 4. Processing and aggregation of judgments (GTT: “Complex interaction with software or tool”, failure 
mode “Misinterpretation of data”) Left: judgements from experts, Middle: expert-aggregated posterior distribution 
of median HEP for each condition, Right: posterior probability distribution of median HEP for the branch point. 
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5 CONCLUSIONS 


The paper has presented the work performed to 
quantify the human failure probabilities to be 
used as input to a novel HRA method. Elicita- 
tion sessions were designed, with the following two 
main features. First, information on probabilities 
is asked to experts on a qualitative scale, with the 
goal of getting evidence on the order of magnitude 
for the probability. Second, specific situations are 
presented to the expert, i.e. specific failure scenar- 
ios influences by specific negative conditions. The 
latter feature was incorporated to avoid that the 
expert deal with abstract categories such as tasks 
types and influencing factors. 

The expert statements are processed and aggre- 
gated to determine degrees of belief on the correct 
values of the failure probability. The latte is assumed 
as an inherently variable quantity so that the main 
parameter of its distribution is the subject of the 
elicitation. The Bayesian model used to aggregate 
the assessments was found to represent well the dif- 
ferences in the experts statements, providing a cred- 
ible approach to process the expert input. 

As a next step of the work, comparison of the 
obtained values with values from existing HRA 
methods is envisioned. Indeed, although HRA 
methods are sector-specific, some of the underly- 
ing data can be thought of being general, e.g. data 
regarding dealing with indicators, simple execution 
tasks. This comparison may provide some valida- 
tion to the elicitation process. 

With broader perspective, future work will 
apply the developed HRA method to hypotheti- 
cal accident scenarios at the institute’s Center for 
Proton Therapy. 
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ABSTRACT: This paper proposes an integrated model for maintenance scheduling of parallel systems 
whose failures are detected by inspections. A common characteristic of such systems is that the system 
failures are detected only by inspections and the failure of a component may not cause its system to fail. 
As such, the failure may not be immediately detected and the random (disruption) time at which the 
number of failed components reaches a certain predefined number d may therefore be unknown. For 
such systems, scheduling maintenance policy is a difficult task, which is tackled in this paper. The main 
issue considered here is to get an estimate of the disruption time on the basis of inspection point process 
observations in the framework of filtering theorem. The paper develops a unified cost structure to jointly 
optimise inspection frequency and replacement time for the system when the lifetime distribution of a 
component follows the Weibull distribution. Numerical results are provided to show the application of 
the proposed model. In addition, a sensitivity analysis is performed to examine the effect of maintenance 
parameters on the model. 


1 INTRODUCTION maintenance policy for an m-component parallel 
system. More generally, given partial information, 
This paper proposes an approach to the joint the approach explored here can deal with two basic 
determination of optimal inspection and replace- problems: how to inspect and when to stop oper- 
ment policies for m-component parallel systems ating the system and carrying out a replacement 
subject to non-self announcing failures. The inter- in order to detect the system failure and minimize 
est in such systems is natural as they provide a some maintenance cost. 
redundant approach to improving system reliabil- Although some maintenance models consider 
ity and availability. Nuclear reactor safety systems, joint inspection and maintenance policies for sys- 
emergency core cooling systems, fire detectors, and tems whose state are detected only by inspections 
protective devise are good examples of parallel (He, Maillart, & Prokopyev 2015, Tambe, Mohite, 
systems used in the real world. A common feature & Kulkarni 2013, Tsaia, Sheu, & Zhange 2017), 
of such systems is that the failure of a compo- there is only a few research handling such a lack 
nent may not cause its system to fail and the sys- of data while providing effective decision making. 
tem failures are detected only by inspections. As For instance, using the filtering theorem argument, 
such, the failure may not be immediately detected Ahmadi & Wu (2017) propose a novel technique 
and can be regarded as a hidden failure. Conse- to estimate the disruption time, aiming at reveal- 
quently, the number of failed components may be ing the true state of the system. The technique also 
unknown at a given time. This raises a thought- helps to deliver a warning, or an alert on approach- 
provoking question: how can the failure time of ing a disruption. This alert is the signal that an 
dt (d =1,2,---,m-1) components (called disrup- inspection or a preventive maintenance has to be 
tion time) and the reliability and the maintenance performed. As such, the proposed technique aug- 
cost be respectively estimated and analyzed to han- ments the detection of failure. Further advantages 
dle such a lack of data scenario while providing of their model enhancing its applicability include 
effective decision making? This paper attempts to modeling inspection through the modulated Pois- 
answer those questions. The approach depends on son process and introducing a state-dependent cost 
the identification of the disruption time based on structure. Both modeling approaches make the 
the inspection point process observations. The esti- model superior to those (Berrade & Scarf 2012, He, 
mated disruption time contributes to scheduling a Maillart, & Prokopyev 2015, Liu, Wang, Peng, & 
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Zhao 2015, Bjarnason & Teghipour 2016, Babishin 
& Teghipour 2016, Tsaia, Sheu, & Zhange 2017) 
in which the inspection frequency and/or cost 
parameters do not respond to the variation of the 
system state. The structure also allows aperiodic 
inspections that is often more useful and realistic 
than the periodic policy, since it is more adaptive 
to deteriorating systems and typically leads to poli- 
cies with lower costs. 

The approach in this paper differs through the 
use of a general degradation model motivated by 
the extension of the model in the earlier paper by 
Ahmadi & Wu (2017). Furthermore, it encom- 
passes and examines some characteristics which 
have not been addressed or studied in isolation. 


2 ASSUMPTION AND MODEL 
DEVELOPMENT 


2.1 Model assumptions 


The following assumptions are made. (a) The 
paralell system consists of m components. (b) The 
failure of the system can only be detected by inspec- 
tions (“hidden or non-self announcing failures”). 
(c) The inspection intensity process is assumed to 
follow a modulated Poisson process. (d) The only 
available information is given by the inspection 
point process observations (observation filtra- 
tion). (e) The system is replaced at periodic times 
{t,,2t,,---} (f) Inspections do not impact on the fail- 
ure characteristics of the system. 


2.2 Model development 


2.2.1 Modelling degradation 

Here we describe a stochastic model of degra- 
dation. For this discussion, we assume that all 
random variables are defined on a complete prob- 
ability space (Q,F,P). Consider a multicompo- 
nent parallel system consisting of m components 
whose lifetimes are independent and identically 
distributed random variables. The system is sub- 
ject to random failure which can only be detected 
through inspection. The system state is character- 
ized by a two unobservable states: a normal state 
and a degraded state. The transition time from 
the steady state to the degraded state called “dis- 
ruption time” is defined as the first time the total 
number of failed components reach a predeter- 
mined threshold d: 


q = inf {t:¥(t) = d}sd=1,2,--.m-1 (1) 


where 7, is the disruption time and Y(t) is a sto- 
chastic process counting the total number of failed 
components up to time ¢. Denote 


0, if g> 
if gsi 


(2) 

From equations (1) and (2), one can see that the 
disruption time at which the system state, X{f), 
jumps from 0 to 1 depends on the threshold’s 
value d: a smaller value of d may result in earlier 
disruption. 


2.2.2 Modelling inspections 

The inspection modeling approach to detect the 
system failure is similar to that of Ahmadi & 
Newby (2011) and Ahmadi & Wu (2017) assum- 
ing that inter-inspection times conform to a modu- 
lated Poisson process. Specifically, let M(t) be a 
modulated Poisson process such that M(t) with the 
associated time points of inspections, T; < T, < +, 
depicts the total number of arrivals up to time 
t. In other words, according to Aven and Jensen 
(Aven & Jensen 1998), M(t) admits a smooth semi- 
martingale (SSM) with the F-intensity A, and the 
F-martingale M; 


N(t)= Í ‘Ads+ M, 


where te R*,Me M and M denotes the class of 
martingales adapted to the filtration F. As noted, 
the inspection intensity process A, is modulated by 
the stochastic process X(t) such that 


A, = Ay, = 4 +(A-A)Xa(s), Vt20, (3) 


where A, satisfying 
Ay <Å <%, (4) 


denotes the rate of arrivals when the state of X) 
is i (i = 0,1). Here, X(t) influences and modu- 
lates the arrival rate of the Poisson process. The 
reader is referred to Ozekici (Ozekici 1996) for 
more details about the modulated Poisson proc- 
ess. Equations (1)-(3) indicate that changes in 
the threshold’s value d induce changes in both the 
disruption time and the inspection intensity: as 
d decreases, the state process X(t) jumps sooner 
from the normal state to the degraded state which 
implies an increase in the number of inspections. 
It is evident that this modeling technique, in an 
elegant way, not only allows the inspection inten- 
sity responds to the variation of the system state, 
but also ensures (see equation (4)) the system 
upon approaching disruption is inspected more 
frequently which properly results in more certain 
detection of failure. Since the model including N(t) 
is driven by the unobservable state process X (1), 
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the first main problem investigated here is to get an 
estimate of the unobservable state X(t) given the 
inspection point process observations up to time ¢, 
F“ =o{N(s):0<s<t}, that is 


9,(1) = E[ X(t) F~ ]=P[ nsir]. (5) 


The measure (5) not only contributes to detect- 
ing the disruption time 7,at which the state process 
X {t) jumps from the normal state to the degraded 
state (or, the estimate of components lifetime), 
but also allows the estimation of the intensity 
process 4, o bY projection on the observed his- 
tory. That K 
Â = BL to lF" l= BL 4 +(4- 4) XR". 

The function 4, reflecting the reaction of the 
maintenance crew is called inspection alert func- 
tion. Indeed, given that the probability of the dis- 
ruption detection at time ¢ is @{1), the inspection 
intensity that is at the discretion of the mainte- 
nance crew should be 4, = A,+(A,-A,)¢,(1). 
The situation may be regarded as a case of condi- 
tion-based maintenance in a sense that the function 
(5) provides some information (alert) regarding the 
state of the disruption. Through the inspection 
alert function 2,, the crew will use the informa- 
tion to perform inspections in order to detect sys- 
tem failures. Intuitively, this alert is the signal that 
an inspection has to be performed. 


2.3. Modelling disruption 


This section aims to provide a solution to the dis- 
ruption alert function @,(¢) by setting in the fil- 
tering theorem framework (Bremaud 1981). The 
solution technique used here is similar to that of 
models proposed by Ahmadi & Newby (2011). 
2.3.1. Complete information-based 

disruption model 

We have to detect the random time T} at which 
the inspection intensity 4, q increases: Ae A. 


For this, let F(¢)= f f(u ny be the cumulative 
distribution of components lifetime. Using the 
semi-martingale argument (Aven & Jensen 1998), 
one can note that the increasing right-continuous 
state process X, (t)=Z(7, <t) with the associated 
maintenance parameter d admits the following 
semi-martingale representation: 


j=] Go(s 


where m, is an F, -martingale and 


X,(s))ds +m, 


x > if F(t)<I; 
aag O 


0, otherwise, 


denotes the transition rate of the state process X(t) 
from normal state (0) to the degraded state (1) with 
the disruption distribution function 


m 


F,(1)=P( 7< AIGO Pi- F()p. 


Using the fact that the density function of the 
disruption time is 


Jli) = o Wee al 


maar si), 


the transition rate of the state process X (t) can be 


formulated as 
t 
M Lm- oa) doil ) 


a "Eeoa t))- 1#! 


(6) 


f(t : 
where doil t ) = _ denotes the failure rate of the 


system and 
Qa (t) =j MAG 


2.3.2 Partial information-based disruption model 
Since the disruption time at which the process X (1) 
jumps from 0 to 1 is not directly observed, in the fil- 
tering theorem framework (Bremaud 1981), we get 
an estimate of the state process X,,(¢)=1(z, <t) 
based on the observed history Æ“. More precisely, 
the problem is that of computing $7) given in (5) 
and getting an F‘-adapted estimate of the inspec- 
tion intensity process 


A, = Ay, =A t(A-A)X (2), (7) 


that is to say, before the disruption time Tp, 4, = 4, 
and after T, 4, = 4. 


By projection on the observed history Æ” one 
can show that 


@,(t)= Í [Go(s) + (4- Ay) (s)I(- 


g (s))ds 
pice Ay) á (T, n (1 Ans, 


n>1 A+ (A- 4) 4 T, 
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or, equivalently, between the jumps (te[T, pay Ae )} 


9,(t.n) = A(t) IT, < t< T= A (T,) " 
= Í 1, (Gol s) + AG,( s,n))(1- ø (s sn) ) ds; 
at the jumps, 
= Ad, ( n ) 
@,(T,) = A, + Ag, (ey (9) 
_ ¢,(T,) denote the left limit of ø,(-) at time 


T, 0()=1-¢(-) and 2=4,-A,. From (8) one 
can see that for t €[7,,T, 


n?~ n+l 
i. the sojourn time distribution in normal state is a 
decreasing function of the transition rate of the 
state process X (f) that is ĝ and 4. The latter 
can be realized as a measure whose scale reflects 
the reaction intensity of the maintenance crew 
to perform inspections upon switching of X (1) 
from normal state to the degraded state. 
. the disruption time distribution ø,(t,n) satis- 
fies the differential equation (7) with the initial 
condition (9): 


(ta) = (Gu (1) + 40,(t.0))(1-4,(t.n))>0, 10) 


where the positive derivative implies increasing tra- 

jectories of g,(t,n) between the jumps. 

iii. Let 2=0. This assumption intuitively implies 
that the maintenance crew (inspection intensity) 
does respond to the variation of the state proc- 
ess X,(t):0>1. From (ii) an explicit solution 
to the sojourn time distribution of X (7) can be 
given by 


Alin) = exp- iu ( s)as). 


From above argument one can see that by 
projection on the observed history Æ“ for 
te|T,,T,„)(n20) and the use of the disruption 
time distribution ¢,(t,n), an F“-adapted esti- 
mate of the inspection intensity (7) can be given by 
Î = EA, A, + Ad, (tn). 


lA l= (11) 


To gain a better understanding of the impact 
of the disruption alert function ¢,(t,n) and the 
inspection alert function A, on both the perform- 
ance of inspections and failures, let the measure 
of alertness on approaching a disruption, ¢,(t,n), 
tend to 1. From (11) we get 4, > 4, (A, > A,) that 
indicates more frequent inspections to prevent the 
system failure. 
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2.4 Mean time to disruption 


The measure ¢,(t,n) can contribute to determina- 
tion of the mean time to disruption m; 


mM, = ie G,(t,n)dt 


The mean time to disruption can provide some 
information regarding the disruption time at which 
X(t) jumps from normal state to the degraded 
state. The crew will use this insight to perform 
maintenance in order to timely detect the compo- 
nents failure of the system. 

From equations (8) and (9) one can note that the 
solution of the differential equation (10) based on 
the initial condition (9) depends on the estimation 
of inspection times T, (n21). In Section 2.5 we 
devise an inspection scheduling function based on 
the F-adapted inspection intensity. We will see 
the explored scheduling function incorporating 
the disruption time distribution ¢,(t,n) enables us 
(i) to provide a systematic approach to the deter- 
mination of non-periodic inspection times and (ii) 
to estimate the disruption time distribution ø, (t,7) 
over inter-arrival inspection times [7,,,T,,,) (7 21). 


n?” n+l 


2.5 Scheduling inspections 
Let g,(v) for ve[0,V,,,) be the distribution 


n+l 


function of the (n + 1)" inter-inspection time 
V,.,=T,,,-T,(n20) adapted to the observed 


n+l n+l 


information F^. Then for te[T,,T,.1) we have 
a a Tytv A 
G,( v) = PU, 2 0] 4)=exp(-f Aas), 


where ,(v) =1—g,(v). Since A( t,n) = 
Ay) $,(t.n).te [T.T 


@,( v) = exp C40) 
xexp(- (A- A) fo i n) ds). 


Ay+(A- 


1), @,( v) can be expressed as 


n+ 


(12) 


If u, =E[V,,,|A,](=0,1,2,---) denote the 
(n + 1)" expected time between inspections, using 
the inter-inspection time distribution (12), an F~- 
adapted estimate of inter-inspection times can be 
given by 


4, =) 9, (dat (13) 


This provides a sequence of inspection times 
7, = Di ( n= 1,2,--+). 


2.6 Example: Examining the model 


To examine the response of the model to thresh- 
old’s values d = 1,2, consider a 3-component 


27 


N N 
a kozi 


Inspection intensity 
i) 
> 


23 


Figure 1. Inspection intensity Â, for different de {1,2}. 


parallel system whose components lifetime con- 
forms to a Weibull distribution with the failure rate 


dalt) = 


= 14 
Z (14) 

In addition let (æ, A) =(2,2) and (4,,4) =(2,3). 
Given known parameters, by using the equa- 


tion (10), an expression for the transition rate q,, 
corresponding to d= 1,2 can be given by 


_ 3tx ( exp (0.2527) -1) 


Jo, = 1.5t, 
a 3exp (0.25t2)— 2 


(15) 


Figure | reveals that the inspection intensity /, 
in terms of the sojourn time distribution ¢,(¢, n) 
emerging as solution of the differential equation 
(10) strongly depends on the threshold’s value d: 
as the threshold’s value d decreases to 1 the inspec- 
tion intensity occurs more frequently. In addition, 
Figure 1 indicates that the inspection intensity as 
a measure of alertness reflects the reaction of the 
maintenance crew. In the sense that they deliver an 
alert regarding a possibly approaching disruption 
and this alert is used to make inspections. Thus, the 
alert is a signal that an inspection has to be per- 
formed to detect a failure. 


3 COST MODEL 


This section aims to jointly determine an opti- 
mal replacement policy and an optimal inspec- 
tion frequency with respect to a state-dependent 
cost-reward model. The cost structure used here is 
similar to that of the model proposed by Ahmadi 
& Wu (2017). 
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3.1 


A cycle is comprised of a sequence of inspections 
that ends with replacement scheduled at periodic 
times {f,,2r,,---}. Repair and maintenance actions 
in a cycle incur costs that include: (i) inspection 
costs to detect the system failures, (ii) a penalty 
cost associated with undetected failures and (iii) a 
periodic replacement cost made after every t, units 
of time. More precisely, each inspection at time t 
incurs a time-dependent cost c, = ct (c >0). Unde- 
tected failures within inter-inspection times gener- 
ate a state-dependent penalty cost per unit time 


X, (0), 


(C, >C,). This implies that as the system state 
shifts to more degraded state X (t):0>1, the 
penalty cost increases from C, to C,. In addition, 
we assume that the replacement of the system in 
different states incurs different costs. In other 
words, the replacement of the system at time ¢ is 
measured by a state-dependent cost 


X(t), 


(ki >k). It is noted, as the state process X(t) 
moves from the normal state (X,(t)=0) to the 
degraded state (X,,(r)=1) the replacement cost 
increases k) ++ kı. With respect to the above cost 
structure, the total cost up to the periodic replace- 
ment time f, termed by C,(t,) can be expressed as 


Average cycle costs 


Cy, =Cot (C,-C,) 


k 


X(t) 


=k, +(k, —ky) 


(ys + k 


Zale) tp) 


C(t) = N c dN (s) + Wn Gy 


Inspection cost 


Penalty cost Replasanent cost 

The last cost is incurred as the system is found in 
the state X(t) at replacement time t, Since A 
isan F-intensity of M(t) and both the Cy 


ky are F-measurable, an F-adapted total ca 

C,(t.n),t €[T,.T,,1) (220), can be expressed as 

C (tn) = ELC, (¢,) | Fi l=ky +f", (s,n)ds, 

where 

c, (t,n) =|, (*)+(C, (t)-C, (0) x,(0)| (16) 
+(k, > ky)dX, (£), 


is the total cost rateand C, (t) =C, + 4c, for i=0,1. 


3.2 Partial information-based cost model 


Since the term c,(t,7) depends on the state indi- 
cator Xt) not directly observed, a projection 


on the observation filtration F~ is needed. As 
described in Section 2.3.2 such a projection from 
the F-level to the F%-level leads to the following 
expression; 


é,(tn) =[ Ca (+C C- Ca ()) 646) | 
+(k,- k,) d,(t,n). 
(17) 


Thus, an F-adapted estimate of the expected 
total cost can be given by 


Ĝ (tn) = ky f * ĉ(s,n) ds, (18) 


The integrand ĉ,(s,n) is the conditional expec- 
tation of the cost rate at time s given the obser- 
vations up to time s. In addition, by plugging the 
derivative of ø,(t,n) (see equation (10)) into equa- 
tion (17), in terms of the sojourn time distribution 
@,(t,n), an F™-adapted estimate of the cost rate 
can be given by 


where kio =k,-k, and Co =C,- 0,- 

Continuing the example we illustrate an evolu- 
tion of the maintenance cost rate for both alert 
parameters d = 1,2 (see Figure 2). One can note 
that lower threshold value d causes an earlier alert 
on approaching a disruption which makes inspec- 
tions more frequent, increases penalty cost and so 
incurring more maintenance cost. 


14 


Cost rate 
a 


F~-adapted cost rate for different de {1,2}. 
Blue and black color correspond to threshold values 
d= 1 and d=2 respectively. 


Figure 2. 


3.2.1 Long-run average cost rate 

The main objective is to minimize the long-run 
average cost rate by optimizing the periodic 
replacement time ¢,. To this end, let y,(n,1,) be 
the long-run average cost per unit time. Since the 
sequence of replacement times {f,,2¢,,---} forms a 
regenerative process, the time between two consec- 
utive replacements is a renewal cycle. Therefore, by 
the renewal reward theorem (Ross 1970), the long- 
run average cost rate is the average per cycle cost 
divided by the cycle length ¢, given by 


with expression for C,(t,,7) provided in (18). We 
set out to solve the optimization problem 


t = argmin{ y,(t,,n)}, 


eR, 


along with determining the optimal inspection fre- 
quency n“ subject to the optimal replacement time t“. 


4 NUMERICAL RESULTS 


The aim of the proposed model is to derive a 
cost-optimal inspection and maintenance policy 
for a three-component parallel system subject 
to non-self announcing failures. For this, let the 
components lifetime be modelled by the Weibull 
distribution with the failure rate (14). Main 
numerical results are based on the known degrada- 
tion parameters (@,) =(2,2) and the inspection 
parameters (4,,4,)=(2,3). The choice for cost 
parameters are (C,,C,)=(1,3),(ky,.k;) =(3,6) and 


c= 2, 


4.1 Optimal maintenance policy 


To investigate the effect of the alert parameter 
d on the model, optimal solutions for different 
threshold values d e {1,2} are obtained (see 
Table 1 and Figure 2). As noted above, the alert 
parameter d reflects the attitude of the decision 
maker towards inspection: changes in d induce 
changes in the inspection intensity upon disrup- 
tion time at which dth component of the system 
experiences failure. Given inspection parameters 
(44) =(2,3), results reveal insignificant impact 
of don the optimal replacement time. On the other 
hand, the optimal maintenance cost and the opti- 
mal frequency of inspection range over several 
orders of magnitude as the threshold varies. Sub- 
ject to the results given in Table 1 and Table 2, the 
sequence of inspections and maintenance actions 
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Table 1. Optimal parameters and mean time to disrup- 
tion m, for different de {1,2} given A, =3. 

d wr m t m, wl En) 
1 4 0.6316 0.8553 1.1593 8.7575 

2 1 0.6316 0.9156 1.1593 6.7237 
Table 2. Expected inspection times for different 
de {1,2} given A, =3. 

d n bh è Ms ns Ne 

1 0414 0.6 0.736 0.853 0.936 1.07 
2 0.62 0.93 1.115 1.254 1.373 1.484 


corresponding to the threshold values d = 1,2 are 
respectively scheduled as follows: 


e starting from the normal state, inspections 
with intensity 4 = 2 are scheduled at times n, 
(i=0,1). Upon failure of the first component 
at (disruption) time m,=0.6316 inspections 
are considered more often (4, > 4). Follow- 
ing disruption, inspections to detect the system 
failures are scheduled at 77, (i = 2,3). To restrain 
the frequency of inspections and avoid a costly 
strategy, the model suggests the replacement of 
the system at t = 0.8553 before disruption time 
m, =1.1593 at which the second component 
of the system experiences failure. This process 
incurs the optimal cost y; (¢",n")=8.7575. 

starting from the normal state, the first inspec- 
tion is scheduled at 7, =0.62 just before the 
first disruption at m,=0.6316 and a planned 
replacement is recommended during inter-dis- 
ruption times [0.6316,1.1593] at f° =0.9156. 
This incurs the optimal cost y; (t*,n* ) = 6.7237. 


4.2 Attitude to maintenance 


The inspection model adapts itself to the decision 
maker’s attitudes to maintenance (the value of 
d) by moving the inspection parameter 4, 4. 
Increasing the threshold value d:l 2 affects 
both the inspection intensity and the transition 
rate of the state process (q,,(¢)) from normal state 
to more degraded state (see equation (15)) which 
gives higher value for the mean time to disrup- 
tion m,:0.6316+ 1.1593. The higher threshold 
d reduces the frequency of inspection and penalty 
cost which results in reducing overall maintenance 
cost. As illustrated by Table 1 and Figure 3, the 
threshold value d = 2 leads to more effective strat- 
egy as the disruption time m, =1.1593 (failure 
time of the second component) is preceded by 
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Table 3. Optimal parameters and mean time to disrup- 
tion m, for different de {1,2} given A, = 4. 

d n m, t m, y(n) 
1 3 0.5978 0.7145 1.0808 9.4972 

2 1 0.5978 0.8609 1.0808 6.8923 


the replacement time tf =0.9156. On the other 
hand, the model based on the maintenance thresh- 
old d = 1 penalizes a costly strategy which favors 
not only an early replacement, but more frequent 
inspections (see Table 1 and Table 2). 


4.3 Level of maintenance 


The response of the model to maintenance level is 
examined with inspection parameter /, € {3,4}. 
Given A, =4 for de {1,2} the optimal parameters, 
the mean time to disruption and inspection times 
for both threshold values are shown in Table 3 and 
Table 4. Also, an evolution of average cost rate is 
illustrated by Figure 4. In both cases the optimal 
cost increases with A, implying an increase in the 
amount of maintenance undertaken on the sys- 
tem. On the other hand, with increasing A, or 4 
more alert on a possibly approaching disruption is 
discovered. As discussed in Section 2.3.2, part (ii), 
this causes a reduction in sojourn time of X(t) in 
normal state hence decreasing disruption time (see 
Table 3). In addition, given d = 1, with increasing 
inspection intensity 4:31. 4 inspections will be 
considered more often but f° decreases to restrain 
the frequency of inspections. But, in the case d= 2, 
due to insignificant effect of 2, on the inspection 


Table 4. Expected inspection times for different 
de {1,2} given A, =4. 

d m m 1 n, 1s Ne 

1 0.405 0.583 0.713 0.827 0.935 1.041 
2 0.61 0.904 1.076 1.207 1.322 1.431 


0.6 16 


Figure 4. a cost rate for different de {1,2} given 
Mdi Blue and black color coreipond to 
reshold ee d= = 1 and d=2 respectively. 


times, the optimal solutions (n*,t") and the result- 
ing optimal average cost are not particularly sensi- 
tive to the inspection parameter 4,. 


5 CONCLUSIONS 


The main issue discussed here is to detect the ran- 
dom time of change of the unobservable state 
(disruption time) based on partial information for 
systems subject to hidden failures. The approach to 
estimating the disruption time distribution, which 
rests on an associated maintenance alert parameter 
d, helps in two ways: firstly, it delivers an alert or 
a signal on approaching the components failure of 
the system and to control the inspection frequency 
in order to timely detect the system failures. Sec- 
ondly, it provides insight to perform a replacement 
in order to avoid typically more costly system fail- 
ures. The other problem investigated here is the 
construction and the estimation of a unified cost 
model based on the available information. The 
estimated cost model as a measure of policy con- 
tributes to the joint determination of an optimal 
inspection frequency and an optimal replacement 
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time for systems with partial information. The 
main examples considered are that of a Weibull 
degradation model for a three-component paral- 
lel safety system with the alert parameter d. The 
response of the model to both the threshold and 
the inspection parameter is examined. Although 
the numerical results did not reveal a strong influ- 
ence of the threshold’s value d on optimal replace- 
ment time, but its impact on other parameters 
including inspection times was more significant. 
Also, the results provide sensible and realistic 
inspection policies for such systems and gives 
insight into the effect of applying various inspec- 
tion policies. 
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ABSTRACT: A system is considered, which is deteriorating over time according to a non-homogeneous 
gamma process. The point of the presentation is to propose and compare two models of imperfect repairs 
for the system. For sake of simplicity, only periodic (and instantaneous) repairs are here envisioned. The 
first model, called the Arithmetic Reduction of Deterioration of order 1 (ARD1), assumes that a repair 
removes a given proportion of the degradation accumulated by the system from the last maintenance 
action. The second model, called Arithmetic Reduction of Age of order 1 (ARA1), refers to the virtual 
age models proposed by Kijima (1989) and further studied by Doyen & Gaudoin (2004) in the context 
of recurrent events: the ARA1 model assumes that a repair reduces the age accumulated by the system 
since the last maintenance action, in a given proportion. An ARDI repair hence lowers the deterioration 
level, without rejuvenating the system. On the contrary, by an ARAI repair, the system is put back to the 
exact situation where it was some time before, which entails the lowering of both its deterioration level 
and (virtual) age. The two models may hence correspond to different maintenance actions in an applica- 
tive context. This presentation focuses on the comparison between the two models, from a probabilistic 
point of view (moments and stochastic ordering). An application in a maintenance optimization context 
is also provided, for illustration purpose. A specific case is analyzed, where the two repair models provide 
identical expected deterioration levels at maintenance times (“equivalent” case). The comparison results 
can help understanding which among the two models is the best adapted in an applicative context. 


1 INTRODUCTION the rate of degradation of the system (Zhang 
et al. 2015). However, as Zhang et al. (2015) 
Many systems suffer a physical degradation before claimed, the issue of treating imperfect mainte- 
they fail. This degradation is a complex process as nance in the context of degrading systems remains 
it depends of many factors (material, stress loads, widely open nowadays. 
temperature, ...). To mitigate the effect of the sys- Following the spirit showed in (Mercier & Castro 
tem degradation and to extend the system lifetime, 2013) and (Castro & Mercier 2016), two models of 
a large volume of maintenance models have been imperfect repair are here proposed and compared 
proposed in the literature, with different mainte- for a system accumulating deterioration over time. 
nance actions. Most of these models are limited to The first model, called the Arithmetic Reduction 
perfect repairs (Huynh et al. 2014, Caballé et al. of Deterioration of order 1 (ARD1), assumes that 
2015, Hong et al. 2014 among others). However, the repair removes the proportion p of the degra- 
imperfect maintenance actions describe more real- dation accumulated by the system from the last 
istic situations than perfect repairs. Some advances maintenance action (with p e (0, 1)). The second 
have been made to include imperfect repairs ina model, called Arithmetic Reduction of Age of 
degrading system. Alaswad & Xiang (2017) classi- order 1 (ARA1), refers to the virtual age models 
fied the impact of the maintenance actions over the proposed by Kijima (1989) and further studied by 
maintained system into three types. The first type | Doyen & Gaudoin (2004) in the context of recur- 
assumes that the maintenance actions return rent events: the ARAI model assumes that the 
thesystem toaprevious stage of deterioration. Inthe repair removes the proportion p of the age accu- 
second type, the imperfect maintenance reduces the mulated by the system since the last maintenance 
degradation level of the maintained system (Cas- action. An ARD1 repair hence lowers the deterio- 
tro & Mercier 2016). The third approach is based ration level, without rejuvenating the system. On 
on the idea that the maintenance action changes the contrary, by an ARAI repair, the system is put 
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back to the exact situation where it was some time 
before, which entails the lowering of both its dete- 
rioration level and (virtual) age. The two models 
may hence correspond to different maintenance 
actions in an applicative context. 

For a better understanding of the differences 
between the two models, this paper focuses on 
their comparison, from a probabilistic point of 
view. Assuming that the degradation of the system 
is modeled by a non homogeneous gamma proc- 
ess, stochastic comparisons of both location and 
spread of the two resulting processes are given. 
Moreover, a specific case is analyzed, where the 
two models provide identical expected deterio- 
ration levels at repair times (“equivalent” case). 
Finally, an illustration of the two models is pro- 
vided, by including them in a global maintenance 
policy, with replacement of the system when it is 
too deteriorated. 

The paper is organized as follows: The two 
models of imperfect repair are described in Sec- 
tion 2. The comparison results are given in Section 
3 (including the “equivalent” case). Section 4 deals 
with the application to the global maintenance 
strategy and concluding remarks are provided in 
Section 5, together with possible extensions. 


2 THE TWO MODELS OF IMPERFECT 
REPAIRS 


2.1 The intrinsic deterioration and notations 


For a, b > 0, let us first recall that the gamma dis- 
tribution T(a, b) with parameters (a, b) admits the 
following p.d.f. (probability distribution function): 


Y, A 


xtte Vx >0. 


~ T(a) 


The corresponding mean and variance are 
and %7, respectively. 

A system is considered with degradation mod- 
eled by a non homogeneous gamma process (x ia 
with parameters A(-) and b, where A(-):R, >R, 
is continuous and non-decreasing with A(0) =0, and 
b > 0. We recall that (X An is a process with inde- 
pendent increments such that X, = 0 almost surely 
and such that each increment Y,,,—X, is gamma 
distributed erty for all s, £> 0. 

The system is periodically and instantane- 
ously repaired each T units of time. For modeling 
purpose, we set X",ieN* to be iid. copies of 
X =(X,) > where XY describes the evolution 
of the deterioration level between the i-th and (i 
+ 1)-th maintenance actions. For each imperfect 
repair model, the maintenance efficiency is meas- 
ured by an Euclidian parameter p e (0, 1). 


a 


b 


2.2 First model: Arithmetic Reduction of 
Deterioration of order 1 (ARD1) 


In this model, the maintenance action instantane- 
ously removes the proportion p of the degradation 
accumulated by the system from the last mainte- 
nance action (or from the origin). We set (Y) be 
the process that describes the degradation level of 
the maintained system under this model of repair. 

The ARD1 model is sketched in Figure 1 for 
p = 0.5. It is developed as follows: At the begin- 
ning, the system deteriorates according to X®. It is 
first maintained at time T, where the proportion p 


0 


Figure 1. Illustration of the ARD1 policy for p = 0.5. 
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of the accumulated deterioration since the origin is 
removed. This provides: 


Between T and 27, the system deteriorates 
according to X®. The age of the system is 
unchanged at time T and we have 


) 


forall T <t<2T. At the second maintenance time 
2T, the proportion p of the degradation accumu- 
lated between Tand 2Tis removed, which provides: 


(1-p)( ). 


More generally, we have: 


) 


for all nT <t<(n+1)T, where X0 xX") is 
gamma distributed T(A(t)-A(nT),b) and 


xð- y2 


t £ 


Y =¥, +( 


xg)-x? 


=Y, + 


T 


Yor 


Y, =Y, +(X -X 


t 


Ynr = Yir + (1 E p) pai z me) 
This provides 
Yr = (1 =< p) > t. = Xl) 


i=l 


with T(A(nT), 15) as distribution. 


When pl, then Y,,.—0* and the system 
is renewed at time nT (As Good As New repair: 
AGAN). When p—0*, the repair is ineffective 
and it is As Bad As Old (ABAO). 

Except for the case 9 0*, if tmodT #0, Y, 
is the sum of two independent and gamma distrib- 
uted random variables (r.v.s) with different scale 
parameters, and it is not gamma distributed. Its 
expectation and variance are given by: 


A(t)- pA(nT) 


E(Y )= 1 
(%) a. (1) 
var(Y,)= A) 0G AAT) (2) 
for nT <t<(n+1)T. 

It is easy to check that E(Y,) and var(Y,) are 


decreasing with respect to p, that is, the more effi- 
cient the repair, the smaller the expectation and 
variance of the deterioration level. In this way, 
the expectation and variance are minimal when 
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2—1 that is when the repair is AGAN, and 
maximal when p— 0* that is when the repair is 
ABAO. 


2.3. Second model: Arithmetic Reduction 
of (virtual) Age of order 1 (ARA1) 


As told in the introduction, the ARA1 model 
is based on the notion of virtual age, which is 
reduced by each maintenance action: each repair 
removes the proportion p of the age accumulated 
by the system since the last maintenance action (or 
from the origin). This means that, at each mainte- 
nance action, the system goes back into its past: 
the deterioration level is hence reduced (just as 
for the ARD1 model) but the system is also reju- 
venated at the same time (it becomes younger). In 
case of an increasing deterioration rate (A(-) con- 
vex), the rate of deterioration is hence reduced by 
the repair together with the deterioration level. We 
set (Z) be the process that describes the degra- 
dation level of the maintained system under this 
model of repair. 

The ARA1 model is sketched in Figure 2 for 
p = 0.5. It is developed as follows: At the first 
maintenance time T, the (virtual) age of the sys- 
tem is suddenly reduced of pT, so that it becomes 
T —- pT=(1-p)T. This provides: 
Z,=X®fort<T, 2207 

t T > T (l-p\T* 

For T <t<2T, the age of the system is (1—) 
T+(t-T)=t- pT. We get: 


). 


At time 2T (just before the repair), the age of 
the system is 2T — øT which is reduced of pT at 
time 27. The age hence is 2(1—)T at time 2T. 
This provides: 


Z,=Z,+(X2n— Xf 


-A)T 


Z 


ar = Zp + 
More generally, for nT <t<(n+1)T, the vir- 
tual age of the system at time tis t— pnT (which is 
just the same as for an ARAI model for recurrent 
events, see (Doyen & Gaudoin 2004)). We obtain: 


for all nT <t<(n+1)T, where Xe = K 


is gamma distributed [(4(¢- pnT)— A((1- p) 
nT),b) and Ki UER 


xe) 


t-pnT ~ * (1-p)nT 


Z, T Zur + (ae 


Figure 2. Illustration of the ARA1 policy for p=0.5. 
Hence 
Sly y0 
Zur = ee XP ies! 


i=l] 


and it is gamma distributed r(A(I- p) nT),b). 
Here, Z, is the sum of two independent gamma dis- 
tributed r.v.s which share the same scale parameter 
b and it is gamma distributed T(A(t-pnT),b). 
Also: 


for all nT <t<(n+1)T. 

Here again, it is easy to check that E(Z,) and 
var (Z,) are decreasing with respect to p. Also, the 
cases 9-1 and p— 0* correspond to AGAN 
and ABAO repairs, respectively. 


3 COMPARISON RESULTS 


3.1 Comparison of the moments 


We now come to the main object of the paper, 
which is the comparison between the two models 
of imperfect repairs. Note that, in an applicative 
context, there is no reason why the estimated repair 
efficiency should be the same when the impact 
of the maintenance is modeled by an ARD1 or 
ARAI model. Considering two different efficiency 
parameters ,,i=1,2, our point here is to compare 
Y and Z®, where exponent © refers to p, for 
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i= 1, 2. We begin with the comparison of their 
respective means and variances. 

Due to the reduced size of the paper, results on 
the comparison of moments are provided only in 
case of a power-law shape function for the gamma 
process. The interested reader will find general 
results in an extended version of the paper (Mercier 
and Castro, submitted), as well as proofs for all the 
results of the paper. 

Proposition 1. Let A(t)= at’ with a B > 0. 
Assume B< 1. 


1. We have 


VAG] 


E(Z®) > B(Y®),v:>0 
if and only if l- p,)4 2 (1-a). 
2. We have 


Var(Z®) > Var(y®),vt >0 


if and only if (1- 2,)4 2 (1 =~) 

If B = 1, all results are valid with reversed 
inequalities. 

For illustration purpose, we consider A(t) = Ë, 
b=1, T=] and p,=0.5. Asa first case, we take p, =0.3 
so that the conditions of the previous proposition 
are fulfilled for both expectation and variance. 
As expected, Figure 3 shows that E(Y®) and 
E(Z®) are ordered in the same way on the whole 
real line, the same for the variance. As a second 
case, we take p, = 0.95 so that the conditions are 
not fulfilled neither for expectation and variance. 
Figure 4 shows that E(Y,®) and E(Z®) are not 


Mean Variance 
50; ; 
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Figure 3. Mean and variance of Y,") and Z® with 
respect to t for A(t) = ?, b = 1, p, = 0.95, p, = 0.5, T=1. 
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Figure 4. Mean and variance of Y,") and Z® with 
respect to t for A(t) = ?, b = 1, p, = 0.95, p, = 0.5, T=1. 


ordered in the same way on the whole real line, the 
same for the variance. 


3.2 Technical reminders 


Before providing stochastic comparison results 
between Y, and Z®), we here recall a few defini- 
tions and results from the literature. 

As a first step, let us recall that given two non 
negative random variables XY and Y with probabil- 
ity distribution functions f, and f, and survival 
functions F, and F,, respectively, then: 


1. X is said to be smaller than Y in the usual sto- 


chastic order (X <,, Y) if Fy $F; 


X is said to be smaller than Y in the likelihood 
ratio order (X <, Y) if £ is non-decreasing; 


2. 
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. Xissaid to be smaller than Yin the convex (con- 
cave) order |X <o) Y) if E(9(X))< E(g(Y)) 
for all convex functions g (provided the expecta- 
tions exist); 

. X is said to be smaller than Y in the increas- 
ing convex (concave) order (X <ç) Y ) if 
B(o(X))< Elo) for all inċreasing convex 

(concave) functions @ (provided the expecta- 

tions exist). 


The usual stochastic order and the likelihood 
ratio order compare the locations of random vari- 
ables whereas increasing convex (concave) orders 
also compare their variability: X <., (<,,)Y 
roughly means that E(X) < E(Y) (location condi- 
tion) plus the fact that X is less (more) “variable” 
than Y, in a stochastic sense. Also, X <. (<,,)¥ 
is equivalent to X <, (<,,)¥ plus E(X)=E(Y). 

The likelihood ratio order is known to imply the 
usual stochastic order, which itself implies both 
increasing convex and concave orders, see Miiller & 
Stoyan 2002 or Shaked & Shanthikumar 2007 for 
more details. 

We finally review known results on the com- 
parison of gamma distributions in the follow- 
ing lemma, see (Müller & Stoyan 2002, p. 62) for 
instance. 

Lemma 1. Let X and Y be gamma distributed ran- 
dom variables with parameters (a,b) and (a,,b,), 
respectively, where a,,b,>0 fori=1, 2. Then: 


1. If a <a, and b,2b,, then X <, Y; 

2. If a 2a, and a/b, <a,/b,, then X <, Y; 

3. If a<a,b <b and a/b <a,/b,, 
Xx Y, 


The previous lemma is the basis for deriving 
the stochastic comparison results between the dif- 
ferent imperfect repair models given in the next 
subsection. 


then 


3.3 Stochastic comparison results 


As a first step, the influence of the efficiency 
parameter p on the deterioration level is studied 
for each imperfect repair model. 


Proposition 2. We have: 


1. Y,,, decreases with respect to p in the sense of the 


likelihood order: If 2, <p), then YẸ) <, YẸ; 

Y, decreases with respect to p in the sense of both 

increasing convex and concave orders: If p, < p,, 

then Y,°) =, ¥ and Y® <, ¥"; 

. Z, decreases with respect to p for the likelihood 
ratio order (and hence also for both increasing 
convex and concave orders): If p,<p,, then 
ZO x, Z0, 


om 


In each case, we can see that as expected, the 
more efficient the maintenance action is (namely 


the larger p is), the smaller the deterioration level 
is. Note however that, based on the fact that the 
likelihood ratio order is stronger than both increas- 
ing convex and concave orders, the results for the 
ARAI model are stronger than for the ARD1 one. 
(Counter-examples can be found, which show that 
that Y" and Y, are not comparable for the like- 
lihood ratio order in a general setting). 

We now come to the stochastic comparison 
between the two models. 


Theorem 1. Jf A(-) is concave and 

A((l=2,)t)2 (1-2) A(d) for all t (4) 
(which is true if p, = p,), then Y,® ZO for all 
t20. 

If A(t) is convex with a reversed inequality in (4), 
then Z®) <., Y® for all t 20. 

As a specific case, if 2, = 2, = ee 
for all 20 if A(-) is concave, and Z, <, Ý, for all 
t >20 if AC) is convex. The concavity/convexity of 
the shape function A(-) of the underlying gamma 
process hence deeply infers on the comparison 
results between the two models: When A(-) is con- 
vex, the rejuvenation included in a ARA1 model 
leads to a lower deterioration level than for an 
ARD1 model. 

Specific results are now summarized in the clas- 
sical case of a homogeneous gamma process for 
both moments and stochastic comparisons. 

Corollary 1. Assume that A(t) = at for all t20, 
where @>0. We have the following results: 


p, then Y, <,.. Z, 


1. If p2p,, then Y® <, Z® and hence 
p(y) < <E(Z, Z®) for all 0; 

2.0 psp, then ZI, ¥" and hence 
B(Z)) < BY," ') for all i 

3. If p =p, then Y® ren © and Z® <, ¥" 
and hence B(Z?) JZE 1 for all t 20; 


4. Var(Z®)>Var(Y/ 
l- 2(l- Ay: 


D) i all t>0 if and only if 


3.4 Mostly equivalent imperfect repair models 


In an applied context, parameters p, and p, for 
ARDI and ARAI1 models, respectively, will be 
estimated from feedback data, which will be typi- 
cally gathered at maintenance times iT, i = 1. As 
a consequence, we can expect that the estimated 
parameters p, and p, should be such that the 
corresponding expected deterioration levels should 
be very ow r maintenance times, namely such 
that E y® forall ¿e N*. Equivalently, 
the estimated ce aE should be such that 
(I-,)A(iT) A((I-6,)i7) 
b b 


„forall i21. 


There hence is a specific interest for the appli- 
cations to compare the ARD1 and ARA1 models 
under the condition 


(1-2) A(iT) 


which will lead to equivalent deterioration levels at 
maintenance times. However, the previous require- 
ment (5) does not seem to have a solution for a 
general shape function A(-). We hence restrict the 
study to the power-law case A(t) =at? (with œ, p 
> 0), for which (5) is just equivalent to 


= A((I-,)iT) for alli 21, (5) 


A homogeneous gamma process corresponds 
to a power-law shape function with 6 = 1. Then 
the equivalent case just means that the two models 
share the same efficiency parameters ( A= p). 

In case of a general power-law shape func- 
tion, note that the “equivalent” case has a similar 
spirit to that detailed in Doyen & Gaudoin (2004, 
Property 4), where the authors match the minimal 
wear intensities of two imperfect repair models for 
recurrent events, based on the reduction of either 
virtual age or failure intensity. 

The results for the equivalent case are summa- 
rized in the following proposition. 


Proposition 3. Assume that A(t) = at? (with œ, B>0) 


and that E =(1-p, ig mi 
1. Z® <, Y® and Y® <. Z (which both entail 
that var( Z0) > var (xe) forall n21. 
2. If B<(2)l, then YP <, (>) Z® (which 
entails that E(Y/ 9) <(2)E ( A 
Z )) 2V ar 


1) ) for all t20. 
3. If B<1, then Var(Z\ (Y; D) for all t20. 


4 APPLICATION 


For illustration purpose of the previous results, the 
system is now supposed to provide some reward 
per unit time, which decreases when the deteriora- 
tion level of the system increases. Based on classical 
functions used in the insurance literature (Rolski 
et al. 1998), we assume that the reward function is 
of the shape 


g(x) = (b, z ke” Jlors) y (b, = ke? cys (6) 


with 5,b,,@,,@,k,,k,,c>0. The reward function 
is supposed to be continuous and positive on (0, c), 
which implies that 


b, — ke% =b -ke^ >0. (1) 
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Also, we assume that æ <œ, and k <k, so 
that level c appears as a critical level, from which 
the system becomes less performing. 

With the previous assumptions, it is easy 
to check that g is a concave function and that 
g(x)>0 if and only if x<L=*2). Level L 
hence appears as a critical threshold, from where 
the unitary reward becomes negative. 

An example of reward function is plotted in 
Figure 5 with parameters œ =0.1,b =11 mon- 
etary units (m.u.), œ =0.25,k,=1 (m.u.), k, =1 
(m.u.), c=4, and b, obtained through (7). With this 
dataset L =10.01. 

The system is assumed to be preventively 
repaired each T units of time according to an 
ARD! or ARAI model, with efficiency parameters 
p, and p,, respectively. The accumulated reward on 
a time interval [0, ¢] hence is 


Riko (t)= B(f'¢(X.")ds] = ['B(e(%0)) as 


with a similar expression for the accumulated 
reward R&,,(¢) in the ARAI case. 

Considering for instance the equivalent case 
(A(t) = at4 and1—p,=(1—-2,)4) with B < 1, we 
easily derive from point 2 in Proposition 3 that 


B(e(¥,)) = B(s(Z?)) 


for all se [0,7], using the fact that —g is an increas- 
ing convex function. This immediately entails that 


RY) (“)= RO a (t) for all t 20. 


As a consequence, even if the “equivalent” case 
provides similar expected levels at maintenance 
times, they do not provide the same accumulated 


10 


g(x) 


a ar ce ee ee 2 
x 


Figure 5. The reward function g for œ = 0.1, b, = 11 
m.u., œ =0.25,k,;=1 m.u., k,= 1 m.u., c¢=4. 
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rewards on some given time interval. They may 
hence lead to different decisions in an applied 
context, for instance for optimizing maintenance 
strategies. 

To better illustrate this difference, we now con- 
sider a preventive maintenance strategy, which we 
next optimize. The assumptions are the following: 


e The unitary reward of the system per unit time 
is given by the reward function g provided in (6). 
We recall that L = +% stands for a “critical” 
level, from where the reward becomes negative. 

e The system is preventively repaired each T units 
of time according to an ARD1 or ARA1 model 
(respective efficiency parameters: p, and p,), 
up to the first maintenance time KT where the 
deterioration level is observed to be beyond a 
given preventive threshold M (with M < L). In 
that case, instead of an imperfect repair (ARA1 
or ARD1), a replacement is performed at time 
KT with cost C, m.u. when the level is beyond L 
(corrective replacement) and C, m.u. when it is 
between M and L (preventive replacement). 


The successive (corrective or preventive) 
replacements of the system appear as the points 
of a renewal process, and the long time (operating) 
profit rate per unit time is given by 


Curp(T,M) 
= war LE 9) 4) -C,(E(K)-1) 


-C, P(M <Y®. < L)-C, P(Ls ya] 


for the ARD1 model with a similar expression 
for the ARA1 model (C,,, (T, M)). Due to the 
complexity of the model, there is not hope here to 
find conditions that ensure the dominance of one 
of the two functions Cyry (T, M) or Cira (T, M) 
on the other. The comparison is hence made on a 
numerical example. 

The degradation is modeled by a homogeneous 
gamma process with parameters A(t) =1.3t and b 
= 0.8. The parameters of the reward function g are 
a, = 0.4, œ = 0.5, b, = 800 m.u., k, = 1.05 m.u., 
k, = 1.07 m.u., c = 8, which implies b, =832.66 
m.u. and L =13.31. The repair efficiencies of the 
ARDIJ/ARAI repairs are p =p, =0.9, which 
corresponds to an equivalent case (homogeneous 
gamma process and p, = p,). Their common cost is 
C,=200 m.u.. The cost of a preventive replacement 
is C, = 1000 m.u. whereas itis C, = 1300 m.u. for a 
corrective one. Figures 6a and 6b show the profit 
rates for the maintained system under ARD1 and 
ARA] repairs, respectively. These figures have been 
computed considering a grid of 10 points for T 
from 1.14 to 4 and a grid of 13 points for M from 1 


Cap (TM) 


Caral TM) 


(b) ARAI model 


Figure 6. Operational profit rates, p, = p, = 0.9. 


to L and 10000 simulations for each pair of points. 
For both ARDI and ARAI cases, the optimal 
maintenance strategy corresponds to T% =3.05. 
However, the corresponding optimal maintenance 
levels are Mep =9.21 and M9, =10.24, and 
the optimal unitary rewards C%p =673.94 and 
Cika =684.34 m.u. per unit time, respectively. 
There hence is a difference of about 10% between 
the two optimal preventive levels, which may entail 
inappropriate decision in an applicative context, 
when considering one imperfect repair model or 
the other, whereas the effective behavior of the 
maintained system is closer to the other one. 


5 CONCLUSIONS AND FURTHER 
EXTENSIONS 


Two imperfect repair models for a degrading system 
have been proposed and compared in this paper. One 
is based on the reduction of the single deterioration 
level at maintenance times (ARD1), the other one 
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on the reduction of both deterioration level and age 
(ARA1). They hence correspond to different main- 
tenance actions, in an applicative context. The com- 
parison of both location and spread have been made 
in terms of moments and stochastic ordering of the 
two resulting processes. It has been seen that the con- 
cavity/convexity of the shape function of the under- 
lying gamma wear process plays a central role in the 
comparison results. This corresponds to intuition as 
a convex shape function (for instance) induces some 
increasingness property in the rate of deterioration 
over time. Hence, rejuvenating the system as in an 
ARAI one action will decrease the rate of deteriora- 
tion together with the deterioration level. 

As for the stochastic comparison results, the 
paper focuses on the likelihood ratio order, which 
is well-known in reliability theory, but also on the 
(increasing/decreasing) | convex/concave orders, 
which come from the insurance literature and seem 
a littleless common in papers devoted to reliability 
theory. Clearly, other stochastic orders might be con- 
sidered such as Laplace transform order or Excess 
Wealth order for instance. Other questions of inter- 
est concern the comparison of remaining lifetimes, 
considering the system as failed (or too degraded) 
when its deterioration level is beyond a fixed failure 
(critical) threshold. From a theoretical point of view, 
this seems a difficult issue in a general setting. One 
could then look at partial results on specific models. 

Furthermore, the paper made an attempt for 
comparing the two types of imperfect repair 
including them in a global maintenance policy. 
Clearly, this subject requires further investiga- 
tion for a better understanding of the practical 
consequences on the optimal policy of choosing 
one model of imperfect repair or the other, and 
typically, this would require a numerical study at 
a larger scale. Of course, it would also be of inter- 
est to consider other types of global maintenance 
policy (among the numerous ones developed in the 
literature, e.g. see van Noortwijk (2009)), including 
one model of imperfect repair or the other. 

Finally, a gamma deterioration model has been 
assumed in this paper. The stochastic comparison 
results between the two imperfect repair models 
deeply rely on the comparison properties of the 
gamma distribution, as summed up in Lemma 1. 
As noted by the referee, a question of interest 
would be to revisit the comparison between the 
two imperfect repair models considering other 
deterioration models (such as Wiener processes 
with trend or other subordinators). 
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ABSTRACT: Both for the manufacturer and the user of industrial vehicles, optimizing simultaneously 
the maintenance and missions schedule at the fleet level becomes a necessity to improve the profitabil- 
ity. Lots of researches have been realized to optimize either the preventive maintenance schedule or the 
production planning. Integrating both activities in the same schedule has become a new hotspot. Few 
researches have been led to schedule both maintenance operations and missions for a fleet of systems 
which deteriorate over time in a comprehensive predictive approach. As a first step to reach this objective, 
a single vehicle is considered and we propose a method to jointly optimize its predictive maintenance and 
its mission planning using the remaining useful life and deterioration information. The vehicle has a set 
of missions to complete. The aim is to group the missions in blocks and interpose maintenance operations 
between these blocks. However, the vehicle deteriorates over time and each mission impacts differently the 
deterioration according to its severity. A stochastic process is used to model the deterioration phenom- 
enon and its parameters are changed for each mission. To obtain the best arrangement between missions 
and maintenance operations, a genetic algorithm is used. A criterion to minimize the maintenance cost is 
defined, which takes into account preventive and corrective maintenance costs associated with the mis- 
sions failure probabilities. The genetic algorithm enables to approach the optimal solution in a reasonable 
time and avoid considering every possible arrangement. 


1 INTRODUCTION nance in flowshop workshops. The production 


schedule is generated using heuristic methods. 


Many researchers have developed methods to 
optimize the preventive maintenance schedule for 
multi-component systems (Bouvard et al. 2011) 
without considering production constraints but 
taking into account the system availability on a 
fixed time period (Lesobre 2015). From these pre- 
vious achievements, a new problematic has arisen 
aiming at jointly scheduling production and main- 
tenance. Two strategies to solve this problem can 
be identified. 

The sequential strategy consists in scheduling 
one of the two activities, maintenance or produc- 
tion, and using this schedule as an unavailability 
additional constraint to solve the joint scheduling. 
Benbouzid et al. (2003) develop a sequential proc- 
ess to schedule production activities and mainte- 


Then, by considering the production tasks order as 
a constraint, the periodic maintenance operations 
are added to the initial schedule. This period is 
determined to reach a trade-off between the main- 
tenance cost and the risk of reducing the machine 
availability. The same procedure is applied to inte- 
grate preventive maintenance in the production 
scheduling for a single machine to minimize the 
total expected weighted completion time to do all 
the jobs (Cassady & Kutanoglu 2005). The exact 
method, for small size problems, considers all the 
job sequence possibilities. Then, for each possibil- 
ity, all the feasible Preventive Maintenance (PM) 
decisions sets are tested. The job sequence-PM 
decisions minimizing the objective function is the 
optimal solution. For larger problems, a two-steps 
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heuristic method is developed. The first step selects 
the job sequence having the shortest total weighted 
completion time. The weigh is an importance fac- 
tor for each job. The second step identifies the PM 
decisions minimizing the total expected weighted 
completion time for the job-sequence established 
in the first step. 

The integrated strategy approach aims at sched- 
uling simultaneously maintenance and production 
activities. Yalaouiet al. (2014) use a linear program- 
ming model to minimize the total production cost. 
The production is divided in cycles and a preventive 
maintenance is completed at each cycle beginning 
to restore the production lines capacities. Each 
maintenance cycle cost depends on the preventive 
and corrective maintenance cost. The corrective 
one is estimated thanks to the failure rate. This 
exact method is satisfactory for reduced size prob- 
lem. The following research works apply heuristic 
methods to solve the integrated scheduling prob- 
lem. Feng et al. (2016) apply a genetic algorithm to 
minimize the costs induced by jobs tardiness, cor- 
rective and preventive maintenance action. Ladj 
et al. (2016) suggest an approach hybridizing 
genetic algorithms and artificial immune systems. A 
deterministic deterioration model is considered as 
each job deteriorates the machine of a fixed value 
and it is assumed that no accidental failure occurs 
during the time horizon. The maintenance actions 
are triggered when the deterioration level oversteps 
a failure threshold. These approaches are single- 
objective optimization methods but multi-objective 
methods exist. Da et al. (2016) choose to minimize 
both the maintenance cost rate and makespan. 

A comparison between the sequential approach 
and a new integrated strategy to minimize the 
manufacturing system cost for a single-unit system 
has been drawn by Li et al. (2010). The mainte- 
nance operations are imperfect and an improve- 
ment factor is defined to quantify each operation 
effectiveness. The integrated approach uses a heu- 
ristic method to find the optimal schedule and ena- 
bles to save about 12% of the manufacturing costs 
with respect to the sequential approach. The terms 
sequential or integrated only refer to the way of 
solving the scheduling problem. In both cases, the 
result is a single simultaneous schedule for produc- 
tion and maintenance. 

In the previous contributions, the deterioration 
modelling considers either the machine age or a 
deterministic model where each job deterioration 
is exactly known. Li et al. (2010) do not consider 
the failure risk or any reliability based constraints 
to obtain the joint schedule. 

The objective of this paper is to develop a new 
integrated strategy to schedule both missions and 
maintenance for a deteriorating vehicle using dete- 
rioration information. Firstly, the joint scheduling 
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problem is defined. Then, the adopted approach 
to solve it is explained. The models for the vehicle 
deterioration evolution and for the missions impact 
on the deterioration are developed and the optimi- 
zation method is exposed. Finally, performances 
and sensitivity studies are presented on application 
examples to illustrate the algorithm behaviour. 


2 PROBLEM FORMULATION 


A single vehicle has a set of missions to complete. 
It is modelled as a single-unit system which deterio- 
rates over time according to its activity. The vehicle 
operates in missions with different severity levels, 
characterized not only by their duration but also 
by several environment parameters as road condi- 
tion, topography. The vehicle usage modelling is 
then adapted and takes into account the mission 
severity level. 

The objective of the present work is to jointly 
optimize the vehicle predictive maintenance and 
its mission planning using the deterioration infor- 
mation (Figure 1). The preventive maintenance 
operations have to be well planned to prevent immo- 
bilizing failures, maximize the vehicle availability 
and not disturb the missions progression. Accord- 
ing to these conditions, the schedule is defined as a 
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Figure 1. Framework to schedule the three missions 
M,, M, and M, and the maintenance actions MA accord- 
ing to the health state. Their durations and deterioration 
severities are respectively defined by the boxes lengths 
and colors. 


series of mission blocks interrupted by maintenance 
operations to restore the vehicle deterioration state 
to an As Good As New (AGAN) state. 

The schedule optimization is based on the total 
maintenance cost and aims at maximizing the 
vehicle activity between two consecutive preven- 
tive maintenance operations while considering its 
health state. The total maintenance cost is then 
based on the preventive maintenance costs C,,, for 
which each operation costs C,, and the costs C, 
associated with corrective maintenance. To esti- 
mate C.,, the missions failure risks and the cor- 
rective maintenance cost C,for each operation are 
considered. 


3 RESOLUTION APPROACH 


3.1 Vehicle deterioration model 


3.1.1 Deterioration evolution 

The vehicle is characterized by a global health 
indicator. Its deterioration evolution is modelled 
by a stationary Gamma process. The choice of 
this stochastic continuous deterioration process 
is motivated by its ability to represent observable 
and gradual deterioration phenomena in indus- 
trial systems (Lesobre 2015, Van Noortwijk 2009). 
A Gamma process X(t), t > 0 is defined by its shape 
and scale parameters respectively denoted aand 2. 
A failure occurs when the cumulated deterioration 
X(t) exceeds the failure threshold L. The distribu- 
tion function F followed by the time to failure T 
is then given by Eq. 1. The failure probability for 
a mission whose duration is equal to f,, is equal to 
F(t). 

T(at, LA) 


F(t)=P(T<t)= T(a1) 


> (1) 


where T(.) is the Gamma function and I(a,Lf) 
is the incomplete Gamma function defined by 


T(at,L£) = [ue e du 


3.1.2 Missions impact on the deterioration 

The missions correspond to the deliveries the truck 
has to complete. They are described by their dura- 
tions and they affect the vehicle deterioration evo- 
lution. Indeed, the vehicle evolves in a dynamic 
environment influencing its deterioration. The 
environment evolution results from changes in the 
missions characteristics during the vehicle lifetime. 
The deterioration-threshold failure model used in 
this work allows one to integrate the mission char- 
acteristics through the modification of the dete- 
rioration parameters. It is thus assumed that these 
changes have a time-related impact on the deterio- 
ration process, modelled by a change in the deterio- 
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ration speed. The deterioration is then modelled by 
a Gamma process with varying parameters. Each 
mission is associated with a pair of parameters. 

Thanks to the deterioration model, the failure 
probabilities can be estimated based on the distri- 
bution function followed by the time to failure. An 
optimization method to jointly schedule mission 
and maintenance can then be developed. 


3.2 Optimization method for joint scheduling 


A genetic algorithm based approach is developed 
to find the optimal joint schedule for maintenance 
and missions minimizing the total maintenance 
cost. The optimization criterion definition is one 
of the major point to discuss. 


3.2.1 Criteria definition 

The schedule is composed of mission blocks sepa- 
rated by preventive maintenance operations. The 
optimization criterion is then based on the mainte- 
nance cost associated with the schedule. This crite- 
rion relies on two elements: 


e The preventive maintenance cost C, correspond- 
ing to the maintenance operations scheduled at 
each block end. 

e The corrective maintenance cost C, related to 
the failure occurring within each mission block. 


An optimal balance is to be found between pre- 
ventive and corrective maintenance i.e. the number 
of blocks and the blocks filling. To determine the 
possible mission number in the blocks, the deterio- 
ration process is used to estimate the failure proba- 
bilityfor a block. This failure probability considers 
the parameters characterizing the missions in the 
block. 

For p missions in a block, p environment 
changes, representing the missions severity levels, 
occur. Figure 2 presents an example with 3 mis- 
sions and 3 environment changes. The only known 
information are that the deterioration level is equal 
to 0 at the block beginning. No information on 


Failure threshold L 


Environ, 2 
Environment | 


j Environ: 3 
(oi, 1) 


Deterioration NX (t) 


Time! (h) 


to ti ti+t ty +tat ly 


Figure 2. Deterioration evolution in a block with 
3 missions. 


its deterioration evolution are available when the 
vehicle is completing the schedule. 

Thecumulated deterioration between tand t, +t, + 
t, corresponds to the deterioration increments sum 
between 1, and ¢,, ¢, and f, + ¢, and between ¢, + f 
and ¢, + t, + t. The increments probability densities 
of the missions are respectively denoted f, f, and f. 
The increments deterioration density for the block 
is the increments densities convolution for the mis- 
sions in the block. The more numerous the environ- 
ment changes are, the more complex the reliability 
expression is to compute (Khoury 2012). Based on 
the Gamma process properties and the environment 
changes, an approximation for the block deteriora- 
tion process can be given by an equivalent Gamma 
process Ga(a, .). Its average value and variance are 
respectively weighted mean values of the average 
values and variances related to the Gamma proc- 
esses of the missions. The different weights are the 
proportions of the block duration spent in each 
environment. For Figure 2, the equivalent Gamma 
process parameters are defined as follows: 


a, Se a 14,4, 


Z TÈ" t, — (2) 


where p = 3 and T = Xy i jabr Note that if all the 
Gamma processes related to the missions have 
the same scale parameter £ and the same duration, 
the resulting convolution density is exactly the one 
for a Gamma process Ga(? a, 2). 

The block failure probability is obtained based 
on the equivalent Gamma process. It is the prob- 
ability that the cumulated deterioration exceeds 
the failure threshold L, knowing that X(t,), the 
deterioration at t, is equal to 0. 


T(& (T -_ ty), LB.) 
ræ T-t) 


P(X(T)- X(t) 2 L)= 68) 


The criterion C, (Eq. 4) is defined to estimate the 
maintenance costs for a joint schedule z composed 
of N, blocks. It considers only one failure by block. 


Np 


G(r) = X(C + C,P/(k)), (4) 
k=l 


where P,(k) is the probability to have one failure 
in the block k as explained in Eq. 3. Variations 
can then be defined to accept multi failures in the 
blocks. We define another criterion C, which con- 
siders that in a block composed of m missions, no 
more than m failures can occur. Only one failure by 
mission could happen. Considering multi failure in 
a block corresponds to estimate the expected fail- 
ure occurrences in a block. 


The replacement process is such that once the 
deterioration exceeds the failure threshold L, a 
corrective maintenance is completed and the dete- 
rioration level is back to 0. Based on the equivalent 
Gamma process, the deterioration evolution in the 
block can be characterized. The probability to have 
two failures in the block defined in Figure 2 can 
then be estimated by P(X(7) — X(t) 2 2L). The 
principle is the same as the replacement process 
but without the deterioration level reset after a 
failure. The criterion C, for a joint schedule zcom- 
posed of N, blocks is: 


Np Nin (k) 
C,(a) = Sle +C, > (32 i} (5) 
k=1 i=l 


where N,„(k) is the number of missions in the block 
k, D the deterioration level and P, (D2iL) the 
pr obability to exceed the failure threshold J L for the 
ih time in the block k. 

The two criteria C,, C, are used as optimiza- 
tion criteria for the genetic algorithm developed to 
solve the joint scheduling problem. 


3.2.2 A genetic algorithm based method 
A genetic algorithm (GA) is developed to solve the 
joint scheduling problem for missions and main- 
tenance using a maintenance cost based criterion 
defined in the section 3.2.1. The choice of this heu- 
ristic method is justified by its adaptability regard- 
ing the individuals and operators definitions as well 
as its performances to reach a good solution in a 
satisfying computation time. The genetic algorithm 
principle is based on the hybrid genetic-immune 
algorithm proposed by Ladj et al. (2016). The oper- 
ators are adapted to be applied in a random dete- 
rioration case as Ladj et al. (2016) develop it for 
a deterministic deterioration case. Its general main 
principle is illustrated in Figure 3. The different 
stages are explained in the following paragraphs. 

A parameter is defined to condition the block 
filling. As the deterioration evolution is stochastic, 
the exact deterioration level reached at a mission 
block end cannot be known. However, a failure 
probability can be estimated according to the mis- 
sions in the block. This parameter acts on the max- 
imum admissible failure probability for a block. It 
is more likely to dispatch a vehicle on a mission 
block whose failure probability is not too high to 
avoid at best corrective maintenance. This param- 
eter is denoted P, and its values is between 0 
and 1. The closer ‘to 1 it is, the more flexible the 
genetic algorithm is to generate individuals. Act- 
ing on this parameter guides the genetic algorithm 
towards the best joint schedule. 

Individual representation: The individuals corre- 
spond to candidate schedules for both maintenance 
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Figure 3. Genetic algorithm principle. 


and missions. In the GA, the genotype is obtained 
by sequencing the mission set into different 
blocks. In a block, the missions are supposed to 
be completed one after the other. The preventive 
maintenance operations occurs at each block end. 
For instance, in case of a problem with n = 6 mis- 
sions to schedule, a possible candidate schedule is 
= {(6,2)(5,3,4)()}. a NE 

Initial population: The initial population is com- 
posed of N,,, individuals. 60% of them are ran- 
domly generated. Each mission is randomly put 
in a block while respecting the block filling con- 
dition. For the remaining individuals generation, 
special techniques are applied. 20% of the popula- 
tion is generated using the First Fit (FF) method 
(Coffman et al. (1996)). This method is applied 
on random missions sequences to form candidate 
solutions by assigning the missions to the blocks. 
The last 20% is generated using to heuristic meth- 
ods called First Fit Decreasing (FFD) and Best Fit 
Decreasing (BFD) (Coffman et al. (1996)). Then, 
block permutations are performed on these solu- 
tions to generate other individuals. Note that the 
individuals generated by these techniques still 
respect the block filling condition. 

Evaluation: The evaluation stage consists in 
applying the fitness function Fit to evaluate the 
quality of an individual in the population. The 
fitness function is related to the maintenance cost 
based criteria defined in section 3.2.1. Two fitness 
functions for a individual z are defined in Eq. 6. 
The best individuals are the ones for which the 
maintenance costs are the lowest. 


Fit (z=. and Fit,(z) = : 


C(z) C,(z) 


(6) 


Selection: The selection operator is a 2-tour- 
nament selection operator (Michalewicz 1996). It 
randomly chooses two individuals in the current 
population and selects the fittest one. When the 
selected individuals number reaches N, the selec- 
tion stage is over. 

Crossover: The crossover operator crosses two 
parent individuals to obtain new offspring individ- 
uals. A crossover probability P., defines whether 
or not the parents will procreate. The principle is 
to randomly select a pair of parents. Both parent 
blocks are listed and a random selection is done 
to copy non-overlapping blocks to both offspring 
individuals so that no missions are duplicated. 
Then, for each offspring, the remaining missions 
are randomly added to blocks, either an existing 
one while respecting the block filling condition 
or a new one. This principle is inspired from the 
crossover operator developed by Rohlfshagen and 
Bullinaria (2010). 

Mutation: The mutation operator is based on 
Swap mutation (Michalewicz 1996) and consists in 
exchanging two randomly selected missions from 
two different blocks while respecting the block fill- 
ing condition. This swapping occurs with a prob- 
ability Pp 

Population dispersion: This stage periodically 
evaluates the total population dispersion which is 
of great importance to avoid converging towards 
a local optimal solution. The total population 
includes the parents and their children with their 
possible mutations and contains 2N,,, individu- 
als. As explained by Ladj et al. (2016), diversifica- 
tion and intensification are two major issues when 
building effective search algorithms. Diversification 
refers to the capacity to explore different regions 
of the search space while intensification refers to 
the ability to generate high fitted solutions in those 
regions. A balance between these two notions has 
to be obtained. When the iteration number n, is a 
multiple of the iteration period i, the population 
dispersion is estimated through the Coefficient of 
Variation (CV), also known as the Relative Stand- 
ard Deviation (RSD) (Ladj et al. 2016). CV consid- 
ers the fitness value of each schedule 7, in the total 
population Pop (Eq. 9). According to its value, dif- 
ferent decisions are adopted to enhance either the 
diversification or the intensification. 


CV =2x100% m 
4u 
with w= y — 7 
2N op n ePo Fit(z,) 
and a= j l 5 1 a 
2N yop mePop (Fit(z,) — uy 


Receptor editing: When the population disper- 
sion CV is lower than a minimum dispersion thresh- 
old £, individuals are very similar and focused on 
a limited search space region. The receptor editing 
operator aims at reducing the risk of premature 
convergence (Ladj et al. 2016). A part of the least 
fitted individuals (o% of the population) are elimi- 
nated and replaced by random new individuals to 
possibly explore new search space regions. 

Generation of new good solutions: When the 
population dispersion CV is higher than a maxi- 
mum dispersion threshold £,» individuals cover 
distinct search space regions. To promote the most 
promising regions, a part of the least fitted indi- 
viduals are replaced by mutations of the most fit- 
ted individuals (of% of the population). 

Replacement: Individuals for the next generation 
are selected among the total population formed by 
the parents and the children. A part of the least 
fitted individuals (8%) is directly added to the next 
generation while the fittest members among par- 
ents and children complete the next generation 
until it contains N, individuals. 

Termination condition: The genetic algorithm 
terminates after /,,,, iterations and a joint schedule 
for missions and maintenance operations mini- 
mizing the corrective and preventive maintenance 
costs can be obtained. The results are illustrated 
through application examples. 


4 APPLICATION EXAMPLES 


4.1 Simulated data 


Two datasets of n = 6 missions represent two sce- 
narii (Tables 1 & 2). The difference between them 
comes from the mission failure probabilities range. 
For the dataset A, the maximum mission failure 
probability is 1% while for the dataset B, it is 19%. 

The dataset A further emphasizes the influence 
of the missions variances changes while the data- 
set B better illustrates the sensitivity studies for 
the impact of the ratio between the preventive and 
corrective maintenance costs and for the effect of 


fax * 


Table 1. Dataset A. 
Durations Failure 
Missions (h) Apn B,, probabilities 
1 21 0.13 0.1 0.002 
2 21 0.18 0.1 0.009 
3 8 0.4 0.1 0.004 
4 8 0.33 0.1 0.002 
5 2 1.33 0.1 0.002 
6 3 1.32 0.1 0.01 


Table 2. Dataset B. 


Durations Failure 

Missions (h) Q B,, probabilities 
1 9 0.85 0.1 0.189 
2 4 1.85 0.1 0.163 
3 3 1.34 0.1 0.011 
4 6 0.93 0.1 0.048 
5 6 0.90 0.1 0.041 
6 2 3.81 0.1 0.184 
Table 3. Parameters definition. 
Parameter Value Parameter Value 
Gi 1000 C 3000 
L 100% N pop 30 
Pita 0. T P rg 0. 1 
i, 4 (AA 100 
a 20% B 20% 
Enin 10 Enax 60 
Table 4. Performance results. 

Dataset A Dataset B 
Criterion C; C, C C, 
C; 3787.2 3788.3 7898.4 7905 
GA: T, (s) 2.34 2.90 3.02 3.24 


EM: T, (s) 7.79 7.16 3.47 3.48 


The preventive and corrective maintenance 
costs C, and C, the failure threshold L and the 
genetic algorithm parameters have to be initialized 
(Table 3). The maximum failure probability P,. 
is fixed at 0.95 for all the studies except for the one 
on the variance changes (4.3.3) for which P, is 


equal to 0.1. 


4.2 Performances analysis 


The performance analysis is realized for both data- 
sets and both criteria defined in section 3.2.1. It is 
performed on a computer with Intel® Xeon® CPU 
E3-1240 v5 @ 3.50 GHz and 16.0 GB RAM. For 
each dataset, 1000 realizations are generated to 
compare the maintenance cost and computation 
time results between the Genetic Algorithm (GA) 
and an Exact Method (EM). The exact method 
considers all the possible schedules respecting the 
block filling condition and their associated crite- 
rion values to find the schedule minimizing the cri- 
terion. The maintenance cost and the computation 
time are respectively denoted C, and T.. The study 
results are explained in Table 4. 


m 
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Dataset A: The optimal joint schedule obtained 
with the exact method is the same for both crite- 
ria. This schedule is 72, = {(1,3)(2,5)(4,6)}. The 
computation time gains with the genetic algorithm 
are quite significant: 70% forthe criterion C, and 
62.6% for the criterion C,. 

Dataset B: The optimal joint schedule obtained 
with the exact method differs according to the chosen 


ma 2 a (G3) C,, the optimal schedule is 
ip = ={(D( cA while for the criterion C,, 


it is 7;,,, iy )(5)(6)}. The computation 
time gains are OG Ae with the dataset A. They 
are respectively of 13.1% with C, and 7% with C.. 
This difference comes from the number of feasible 
schedules when respecting the block filling condition. 
Indeed, the feasible schedule number for the datasets 
A and B are respectively 199 and 80. It explains why 
the exact method is faster for the dataset B. 

For all the 1000 realizations with both data- 
sets and criteria, the genetic algorithm converges 
towards the same schedule as the one obtained 
with the exact method. The only differences that 
can occur are permutations between the blocks or 
permutations of missions in the same block. As the 
criteria are computed based on the equivalent dete- 
rioration process for each block, the mission order 
does not have an influence on the block failure 
probabilities. The maintenance costs for the crite- 
rion C, are slightly higher because multi failures 
are considered in the blocks. It increases the associ- 
ated corrective maintenance cost. 

The genetic algorithm computation time for the 
criterion C, has increased with respect to the com- 
putation time for the criterion C,. For the datasets 
A and B, the increases are about 24% and 7.4%. As 
the criterion C, considers multi failures probabili- 
ties, there are more calculations to do when com- 
puting the fitness values for the candidate schedules. 

The results show the interest of using a genetic 
algorithm based method instead of an exact method 
to converge towards an optimal joint schedule for 
missions and maintenance while saving computa- 
tion time. After showing the genetic algorithm per- 
formances, it is essential to study its sensitivity to 
several parameters to analyse its behaviour. 


4.3 Sensitivity study 


This section studies the influence of some param- 
eters on the genetic algorithm behaviour, on the 
obtained joint schedule and on its associated per- 
formance. For each study, the dataset is selected to 
illustrate at best its behaviour. 


4.3.1 Impact of the ratio R, 

The first study is interested in the impact of R, the 
ratio between the preventive and corrective main- 
tenance costs respectively denoted C) and C, It is 


realized with the dataset B. R, varies between 0.1 
and 1 with a step fixed at 0.1 and the value for the 
preventive maintenance cost Q, is the one defined in 
Table 3. 

For both criteria, the block number in the sched- 
ule decreases when R, decreases, i.e. when the cor- 
rective maintenance cost decreases (Table 5). When 
the corrective cost is very high with respect to the 
preventive one, the best schedule favours a high 
block number because grouping missions increases 
the failure risk in each block and leads to a signifi- 
cant maintenance cost inflation. On the contrary, 
when the two maintenance costs are quite similar, 
grouping the missions in few blocks is less expen- 
sive with respect to the maintenance costs. 

Note that the reduction of the number of blocks 
is faster for the criterion C,. When R, is equal to 
0.5, the optimal schedule for the criterion C, is 
composed of 4 blocks while the one for the crite- 
rion C, has 5 blocks. 


4.3.2 Effect of the variations of P, 
This section studies the effect of "the maximum 
admissible failure probability for a block on the 
optimal schedule computation. This probability 
aims at guiding the genetic algorithm to converge 
faster towards the optimal solution. The smaller 
the probability is, the stricter the constraint on the 
block filling is. 

This part is illustrated for the dataset B with 
the criterion C,. Table 6 shows that when P,. 
increases, the number of blocks decreases. For most 


Table 5. Optimal schedule when R, varies (Criterion C,). 
R, Maintenance cost Optimal schedule 
0.1 12350 {(2)(6)(4)(1)(5)(3)} 
0.2 9175 £(6)(2)(3)(1)(4)(5)} 
0.3 8116.7 AOO) 
0.4 7415.4 {(4)(1)(6)(5,3(2)} 
0.5 6901.5 {(2)1)(6)(5,4,3)} 
0.6 6345.4 {(2)(6,1I)3,5.4)} 
0.7 5867.4 {(5,3,4)(1,6)(2)} 
0.8 5509 {1,6)(5,4,3)(2)} 
0.9 5230.2 {£(1,6)(2)(5,3.4)} 

1 5007.2 {(5,3,4)(2)(1,6)} 


Table 6. Optimal schedule when P, varies (Criterion C,). 


Ppa Maintenance cost Optimal schedule 
0.2 7905 {3)(4)(5)(6)1)(2)} 
0.3 7905 OOD) 
[0.4;0.9] 7898.4 {(4)(2)(1)(3,5)(6)} 
1 4000 {(1,5,6,4,3,2)} 


of P, values, the optimal schedule is composed 
of 5 tlocks. When P, is equal to 1, the optimal 
schedule groups all the'missions in the same block 
because it is less expensive for the maintenance. 

As expected, when the block filling constraint is 
relaxed, the optimal schedule is composed of less 
blocks. Note that when P, is equal to 1, the opti- 
mal schedule not necessarily counts one block. For 
the dataset A, it is composed of 3 blocks for both 
criteria. 


4.3.3 Influence of missions variance changes 
In this part, the influence of the missions variance 
changes are studied to evaluate the genetic algo- 
rithm behaviour. For this specific study, the block 
filling condition P, „ is fixed at 0.1. It is to be 
sure to illustrate only the effects due to the vari- 
ance changes. 

The initial dataset is the dataset A. As P, 
is reduced with respect to the part 4.2, the opii- 
mal joint schedule differs. With both criteria, it is 
Typ = {(2)(6)(5.1) (4,3)}. Based on this dataset 
missions, three Hines are generated. The mis- 
sions durations are identical but the deterioration 
processes parameters are modified so that the incre- 
ments expected value for each mission remains the 
same from one dataset to another. For each mission, 
the variance of the deterioration increment is either 
increased or decreased with respect to the initial vari- 
ance (Tables 7 & 8). P denotes the mission failure 
probability. 


Table 7. Datasets when increasing the variance by 2 or 5. 


Variance x 2 Variance x 5 


Missions œ, Ê, P n B P 

1 0.07 0.05 0.02 0.03 0.02 0.05 
2 0.09 0.05 0.04 0.04 0.02 0.09 
3 0.20 0.05 0.02 0.08 0.02 0.07 
4 0.16 0.05 0.01 0.07 0.02 0.05 
5 0.66 0.05 0.01 0.27 0.02 0.05 
6 0.66 0.05 0.04 0.26 0.02 0.09 
Table 8. Dataset when decreasing the variance by 2. 

Variance/2 

Missions OA Bn P 

1 0.27 0.2 4.2x 10° 
2 0.37 0.2 5.8 x 10“ 
3 0.79 0.2 1.1 x 10" 
4 0.66 0.2 2.5 x 105 
5 2.65 0.2 2.7 x 10% 
6 2.64 0.2 7.2 x 104 
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When the variances are increased by a factor 
2 and 5, the optimal joint schedules are respec- 
tively z, ={(5,4)(6)(1)(3)(2)} and z, = {(5)(6)(1) 
(2)(4)(3)} }. Increasing the variances increases the 
number of blocks in the optimal joint schedule. 
Indeed, increasing the variances also increases 
the probability to have a failure for each mis- 
sion. It is then harder to group the missions into 
blocks. On the contrary, when the variances are 
reduced by a factor 2, the optimal joint schedule 
is z, = {(4,6)(3, 1)(2, 5)} . The number of blocks is 
reduced. If the variances continue to decrease, the 
optimal schedule remains z, because reducing 
more the eae number lead to failure probabilities 
exceeding P, The results are explained by the 
fact that the failure uncertainty increases when the 
variance increases. 


4.4 Larger size problems 


The higher the missions number to schedule is, 
the harder it becomes to compare the genetic 
algorithm results with the exact method results 
owing to the exact method computation time. The 
number of feasible schedules becomes too numer- 
ous. When considering n = 9 missions with P 
equal to 0.95, the exact method needs more than 
11 days to reach the optimal schedule. With the 
genetic algorithm, the average computation times 
for the criteria C, and C, are respectively 2.63s and 
3.77s. Note that when using C,, 16% of the 1000 
realizations do not reach the optimal solution and 
the maintenance cost deviation represents 1.45% 
of the optimal maintenance cost. Considering sev- 
eral failures by blocks gives a better maintenance 
cost estimation and a more coherent cost surface 
from one iteration to another. The convergence is 
then improved and when using C,, all the schedules 
are identical to the optimal one. 

To improve the genetic algorithm convergence, 
the population size N „and the maximum iteration 
number i,,,, can be increased, but the convergence 
is not guaranteed. But it will be to the detriment of 
the computation time. Indeed, increasing the pop- 
ulation size and the iteration number also increase 
the computation time. 


5 CONCLUSION 


A static method based on a genetic algorithm is 
proposed to schedule missions and preventive 
maintenance for a vehicle by optimizing a main- 
tenance cost based criterion. Two criteria con- 
sider either one or multi failures in the blocks and 
include the vehicle deterioration model, evolv- 
ing due to the different mission severity. Its per- 
formances and sensitivity are described through 


application examples. The genetic algorithm con- 
verges towards the optimal schedule in a satisfying 
computation time. 

The obtained results are promising and offer 
improvement perspectives. Dynamic information, 
such as the deterioration level or the failure occur- 
rences, could be integrated to update the sched- 
ule and reduce the maintenance costs even more. 
New missions could be added during the schedule 
completion. These different points will be further 
investigated to evolve towards a dynamic schedul- 
ing method. 
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A concept for a holistic risk-based operation and maintenance strategy 
for wind turbines 
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ABSTRACT: Operational wind turbines are highly loaded structures. Especially the high amount of load 
cycles-up to 10° —, leads to the conclusion that these renewable energy structures are in need of a suitable 
structural health monitoring and management strategy. In times of energy transition from conventional 
power plants to renewable power plants a safe and reliable operation of these assets is a premise. The paper 
presented here introduces an approach on determining risk-based operation and maintenance strategies for 
wind turbines. The approach combines Bayesian decision analysis with the concepts known in structural 
reliability analysis. Information from sensors can be used in probabilistic models to update the incom- 
plete knowledge of the state of nature. They are a cost-effective risk-reduction measure and can be used 
for uncertainty quantification and reduction of uncertainties. Therefore the combination of the described 
methods is able to optimize maintenance and inspection activities according to the associated costs. 


1 INTRODUCTION 


A design lifetime of 20 years is generally assumed 


for modern wind turbines. The extension of serv- aymar N o 
ice life beyond those 20 years has to be justifiable ; 

from a technical, safety, and economic point of ; A f f 

view. Especially important are all load-transferring Pangea age 

components that are relevant for the structural Ka Pa EF $A #4 ra Ka l T d 


+ ¢ ys T AP 


integrity of the wind turbine and the control and ® 
protection system. 

The supporting structure of a wind energy sys- 
tems is a large capital expenditure (CAPEX), on 
average between 20 to 30% of the overall capi- 
tal expenditures. During the service life of wind 
energy systems, the structural supporting structure of onshore wind turbines. Especially challeng- 
is particularly important and at risk. While drive ing for existing wind turbine structures is that the 
train, generator and rotor blade systems already original design documents cannot often be used as 
are monitored for structural health, this is not the a reference for a remaining useful lifetime analysis. 
case for the structural supporting structure. According to the DNV GL guideline (DNV 

As shown in the failure statistic in Figure 1 fail- GL AS 2016) there exist four methods of lifetime 
ures in the supporting structure cause the highest extension analysis: 
and most expensive downtime events. From both 
a technological and economical point of view it 
is worthwhile to use the supporting structure of a 
wind turbine systems as long as safety and reliability 
levels can be fulfilled at reasonable cost (Geiss 2014). 

As shown in Figure 4, currently about 1.000 The assessment shall always be based on a combi- 
wind turbine systems already have exceeded their nation of an analytical part and a practical part. The 
20 year designed service life. The overall capacity analytical part incorporates an assessment based on 
of the German wind energy installation already in new or additional calculations for the wind turbine, 
its second half of service life is 41%. considering the site-specific conditions. The lifetime 

The described situation stresses the need for new calculation of the analytical part should be supple- 
maintenance strategies and life cycle assessments mented with relevant field experience of the wind 
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Figure 1. Wind turbine damage statistic (WInD-Pool 
2017). 


1. Lifetime extension inspection 
2. Simplified approach 

3. Detailed approach 

4. Probabilistic approach 
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turbine model concerning weak points, known fail- 
ures or retrofits. Figure 2 gives an overview of the 
different life cycle phases of a wind turbine system. 

The probabilistic approach integrates the use of 
stochastic methods in the assessment of structural 
integrity. Alternatively, rather than use of determin- 
istic values in the simplified and detailed approach, 
the probabilistic approach uses appropriate prob- 
ability distributions to characterize the uncer- 
tainty in models and model inputs. The stochastic 
parameters—such as probability distribution types, 
expected values, coefficients of variation, or corre- 
lation coefficients—in the limit state formulations 
have to justify that they do not introduce errors into 
the analyses. A structural reliability analysis (SRA) 
in the following course of actions has to be carried 
out according to DNV GL: 


1. Selection of a target reliability level 

2. Identification of failure modes in the system 

3. Developmentof limitstatefunctions(g-functions) 
for each failure mode based on engineering 
theory 

4. Quantification of the deterministic and sto- 
chastic variables within the limit state function 
and their correlations 

5. Use of appropriate methods (e.g. first order 
reliability method—FORM) to compute reli- 
ability indices or probabilities of failure for the 
structural components 

6. Comparison of the computed component 
reliability with target reliability level for each 
component 

7. Analysis of results using sensitivity analysis 


The aleatory and epistemic uncertainty in math- 
ematic models and input parameters are to be 
described by appropriate probability distributions, 
and the nature of uncertainty has to be clarified. 
Any type of measurement can be used to update 
the probability models. If aero-elastic models or 
component resistance are not available, generic 
load and resistance models can be used. Based on 
the probabilistic approach, risk-based inspection 
methods may be developed. 
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Figure 2. Wind turbine age structure (Ziegler et al. 
2018). 


The two most relevant deterioration mecha- 
nisms for a wind turbine supporting structure are 
corrosion mechanisms and fatigue crack growth 
mechanisms. The current research approach 
focusses on the fatigue crack-growth mechanism, 
as the most critical deterioration mechanism of the 
structural system. 

Theoretical models exist that describe the spe- 
cific fatigue mechanisms causing wear of a struc- 
tural element. However, these models inherit 
implied model uncertainties, which can be reduced 
by in-field inspection data. Non-destructive testing 
(NDT) methods are therefore effective risk mitiga- 
tion measures for existing structures. 


2 DETERIORATION MODELLING 


All deterioration mechanisms are time-dependent 
and consequently all reliability problems in fatigue 
are time dependent (JCSS 2002). A failure event 
of a deteriorating structure can be modelled as a 
first passage problem. The limit state function is 
then also a function of time. Failure occurs when 
the limit state function becomes negative for the 
first time, given that it was positive at ¢=0. In that 
case, the probability of failure between time ¢ = 0 
and time ¢ = T can be expressed by the following 
equation: 


p,(T)=1- P(g(X(2)) >0 (1) 


For most deterioration processes the problem 
is simplified by the fact that damage is monotoni- 
cally increasing with time. If the modelled deterio- 
ration problem has a fixed damage limit—failure 
occurs when damage reaches a constant limit—the 
deterioration problem can be solved as a time- 
independent problem. The time variable ¢ is then a 
simple parameter of the model and deterioration is 
treated as a monotonously increasing process. 

E.g., if failure has not occurred at time ¢,, fail- 
ure has also not occurred at time ¢ < ¢,. For the 
definition of a failure rate of the modelled system, 
several definitions are possible. In this case the 
annual failure probability of the modelled system 


Total lifetime 


Tranaport/ Service life Decon- 


intaon 


Lifetime extension 
te2a tex 
Endot Endo 
design tite total 
lifenme 


1=0 
Start of 
déripniih 


Figure 3. Life cycle phases of wind turbines. 
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is best fitting. This circumstance enables to evalu- 
ate the reliability problem at fixed time intervals 
aie 
t=t_,+1yr (2) 

Consequently, the annual probability of failure 
in year t, can be expressed as follows: 


(3) 


Very typically this is a SN fatigue modelling 
problem. Failure is defined to occur when the 
accumulated damage has reached A or commonly 
defined as A= 1. 

In practical engineering application the compu- 
tation of probabilities is done with Monte-Carlo- 
simulations and FORM-algorithms, e.g., (Strurel 
2017). 


3 STEEL FATIGUE PROCESS 


Generally, fatigue arises at points of local stress con- 
centrations; so-called hot spots can be represented 
by welds, cut-outs and a wide variety of mechani- 
cal connection types—e.g., bolts. Due to inhomo- 
geneities, welds are especially relevant as local hot 
spots and stress concentrations. State of the art 
engineering theory uses specific steel fatigue models 
for the description of the fatigue process caused by 
fluctuating stresses. Fundamentally the steel fatigue 
models are sub-divided into S-N models—based on 
experiments- and fracture mechanic models. 

A classic formulation is represented by the Bas- 
quin equation, which describes a linear relation- 
ship between InN, and InAs. 
Np =C;AS“™ (4) 

C, and m, are material parameters and are 
defined by experiments. 

The theory of linear damage accumulation goes 
back to Palmgren (Palmgren 1924). The Palmgren- 
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Figure 4. Stress and resistance of a structural system 
in time. 
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Miner law is based on the assumption of linear and 
interaction-free damage accumulation. The damage 
accumulated after N cycles is independent of the 
order in which the stress cycles occur. The damage 
increment for each cycle has the stress range AS. 


(5) 


Np; is the number of cycles to failure for AS, as 
given by the associates SN-curve. The total accu- 
mulated damage after N cycles can be described as 
follows: 

P © 


N 
tot = isl AD, 
Fatigue failure is reached when D, reaches a 
specific value, here expressed as A. Generally, A is 
modelled with the mean value 1. Subsequently, the 
SN fatigue limit state function can be described: 


D 


Esn = A= tot (7) 


The particular stress ranges and number of 
stress cycles are commonly application specific. 
Considering the distribution those two param- 
eters can often be approximated by a Weibull or 
Rayleigh distribution. The Weibull distribution is 
a good description for natural processes related to 
dynamic response of elastic system—e.g. marine 
structures (Almar-Naess 1984) or wind turbines 
(Ziegler, Muskulus 2016a). 

Generally, the described empirical fatigue design 
laws and models inherit obviously many uncertain- 
ties. The three general uncertainties in the empiri- 
cal fatigue design approach are: 


e Uncertainty of the fatigue model (SN-curve) 

e Uncertainty of the fatigue resistance (uncer- 
tainty on the applied SN curve) 

e Uncertainty about the loading and monitoring 
sensors 


An applicable limit state function for the fatigue 
design state of a steel structure can be found in 
(Straub 2014) and is expressed as follows: 


8sy = A- ITE [| AD,| (8) 


The fracture mechanic approach is opposed to 
the SN-model approach. The theory implies the 
initiation and gradual development and propaga- 
tion of small cracks in the microstructure due to 
cyclic loads. In practice, the crack propagation 
mechanism is often combined with corrosion 
mechanisms. 


Table 1. Generic steel SN-fatigue model (Straub 2014). 
Param. Dimension Distrib. Mean COV 
A - LogN 1 0,3 
v yr! Det 10’ 
To yr Det 40 
kas Nmm~ Det 7,448 
B; — LogN 1 0,25 
Rig - Det 0,9 
Ci (Nmm>?)™ Lognormal 4,48e12 0,51 
m = Det 3 
m, - Det 5 
N, — Det 107 
0 =$ Det oo 
d mm Det 16 
(FDF=2). 


The fracture mechanic fatigue models introduce 
a stress intensity factor, which is derived from load 
intensity and crack length, and is the predomi- 
nant control of crack propagation. Fatigue crack 
growth models are usually based on the linear elas- 
tic fracture mechanic theory. 

The evolution or lifetime of a crack can be 
divided in three stages: 


e Initiation —> Number of cycles spent in that 
phase > N, 

e Propagation — Number of cycles spent in that 
phase > N, 

e Failure — Total number of cycles N, > analo- 
gous to the SN-approach 


N,=N,+N> (9) 


For the purpose of deterioration control, an 
innovative concept for wind turbine structures is 
to combine the SN-model approach with the frac- 
ture mechanics approach to a hybrid probabilistic 
model, which describe the fatigue crack dimension 
at any time during the service life of a wind turbine. 

The SN-model approach represents all the design 
assumptions that have been made to design the 
structural element resistant to fatigue damage. On 
the other hand, the fracture mechanics approach is 
basically suitable for in-service deterioration con- 
trol of structural elements, crack width and crack 
length can be measured by non-destructive testing 
methods—e.g. ultrasound or magnaflux. 

In every model calibration, a measure of how 
well a particular calibration fits has to be intro- 
duced. The SN-model gives information on 
whether a hot spot has failed or survived, whereas 
the FM-model gives the crack dimension after any 
number of cycles. Therefore, the FM-model has to 


be calibrated to the SN-model after the number of 
cycles to failure the critical crack size is reached in 
the FM-model. 

The parameters to which the model should be 
fitted are random. In engineering theory there exist 
already a few calibration algorithms as state of the 
art. 

Considering the described facts Straub (Straub 
2014) developed a hybrid solution for a calibration 
of the SN-model to the FM model. The probability 
distribution of Fy, (N) is equivalent to the prob- 
ability of failure p, as a function of the number of 
cycles. 
pe(N) = Fy, (N) (10) 

The actual calibration is performed by a least- 
square fitting in f-space. 


Z=- (p,) (11) 


A minimization with respect to the parameters 
of the fracture mechanics model x,...x,, is carried 
out. 


min g~ (Bow (t)— Bene (tx, Xy)) (12) 


Ny. XW 


with 


e P(t) as the reliability at time ¢ using the 
SN-model 

© Brult X,...Xy) as the reliability at time ¢ using the 
FM model 


The evaluation of reliability indexes is per- 
formed by a FORM or SORM algorithm. The 
choice of parameters to be calibrated depends on 
the applied FM-model. Generally, two parameters 
have to be calibrated. First, the crack growth rate 
parameter, which has a large influence on crack 
growth, and the second parameter for which least 
information is available, but influences the struc- 
tural reliability. 

A solution for the one-dimensional crack growth 
model can be expressed as follows: 


a 


C,-AS”"(N —N,)= - a 
P ( i) J Y, (2)"™ (zr) ( ) 
It follows: 


C (Yom AS)"™ ( N- N, ))“ amn) m, #2 
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Ziegler (Ziegler, Muskulus. 2016b) is also con- 
ducting research in analyzing the application of 
risk-based inspection methods for deterioration 
control of offshore supporting structures. 


4 CONCRETE FATIGUE PROCESS 


In the field and onshore a tremendous amount of 
hybrid supporting structures currently exists, which 
combine a pre-stressed concrete part as lower tower 
section with a conical steel part as tower top; a holis- 
tic research approach also has to consider fatigue 
performance indicators of concrete elements, which 
can be measured by NDT techniques in field. 

Thiele (Thiele 2016) conducted a study on suita- 
ble fatigue performance and damage indicators for 
concrete specimen in the framework of his doctoral 
thesis. A main finding relevant for the research here 
is, that the E-Modulus tends to fit as a fatigue per- 
formance indicator for concrete structures. Besides 
the E-Modulus Thiele investigated deformations, 
cycle counts, and micro cracks as potential fatigue 
performance indicators. Ultrasound amplitudes 
showed a high sensitivity to the damage evolution 
in the concrete specimen. In both cases scattering 
in the measurement campaign was relatively low 
compared to other measurement principles, e.g. 
sound emission measurements. 

Ultra sound measurement has the advantage as 
an in-service inspection technique since measure- 
ments can be carried out while the structural ele- 
ment is in service without a special load slope or 
any other special preparations. However, it must be 
assured that the coupling of the ultrasound sen- 
sors is appropriate. 

In acurrent research project—called MISTRAL- 
WIND ~ at the Chair of Non-Destructive Testing at 
the Technical University of Munich, more sophisti- 
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Figure 5. Variation coefficient of several physical 
fatigue damage indicators over the relative life time of a 
concrete specimen (Thiele 2016). 


cated experiments considering those findings will be 
run in January 2018 (Geiss, et al. 2017). 


5 RISK-BASED INSPECTION PLANNING 


The concept of risk based inspection planning inher- 
its the primary goal of quantifying the effect of 
inspections on the risk condition of a component 
and thus enables cost optimal inspection planning. 
Given the fact that design deterioration laws repre- 
sent imperfect knowledge considering the compo- 
nent’s in-service deterioration process, risk-based 
inspection planning is a suitable method for deterio- 
ration control. Madsen showed one of the first appli- 
cations of the concept (Madsen, Krenk, Link 1986). 
The method of uncertainty quantification has 
two main objectives, first the quantitative charac- 
terization of uncertainties, and second the reduc- 
tion of uncertainties. Inspections can reduce 
uncertainties and update the incomplete knowledge 
of the structural state, which can be described as 
an epistemistic uncertainty. In many applications 
in structural asset management, inspections can be 
a cost-effective risk reduction measure. Empirical 
statistics for structural systems are rarely available 
since every structure is more or less unique. There- 
fore professional experience with structural fail- 
ures is scarce. A first approach in this direction is 
represented by the WInD-Pool project (WIndPool 
2017). Qualitative estimations of the probability 
of failure are not suitable. Eventually risk-based 
inspection planning has to seek the equilibrium 
between a desired level of reliability and the cost 
optimal allocation of maintenance activities in a 
holistic life cycle asset management framework. 


6 RISK-BASED INSPECTION 
ALGORITHM 


In many practical situations, the conditional prob- 
ability of specific events is of interest, meaning the 
probability of occurrence of event E2 given the 
occurrence of Event E1. This classical probabilistic 
dependency is generally handled with Bayes’ rule 
(Faber 2007): 


P(EE,) _ P(E, | E,)P(E,) 
P(E) P(E) 


P(E, |E) = (15) 


From this basic equation, one can derive basic 
and important definitions for the RBI framework: 


e P(E\|E,)is the likelihood measure for the amount 
of information on E, gained by knowledge of 
E, The likelihood measure is typically used to 
describe the quality of an inspection. 


e P(EJE,) is known as the posterior probability 
of occurrence of E, or its updated occurrence 
probability. 

e P(E,) is the prior probability of Event E,, prior 
to the knowledge of £. 


In the probabilistic RBI framework different 
inspection outcomes or results are possible. Those 
different inspections results will trigger different 
maintenance actions, which are also described by 
limit state functions. 

As relevant inspection outcomes in the RBI 
framework the following outcomes can be 
defined: 


Event of indication of a defect I 

Event of detection of a defect D 

Event of false indication FI 

Event of a defect measurement with a measured 
size S,,, 


7 INSPECTION MODELLING 


The number of inspection performance models 
today is limited, because round-robin tests for empir- 
ical models are expensive and empirical models for 
one application cannot be transferred to other appli- 
cations. Straub and Faber (Straub, Faber. 2002a, 
Straub, Faber. 2002b, Straub, Faber. 2003) devel- 
oped quantitative models for a risk-based inspection 
framework using Bayesian updating techniques. 

The performance of a non-destructive inspec- 
tion is dependent on many parameters. 


Defect size and geometry 

Defect orientation 

Environmental condition 
Sensitivity of the inspection method 
Sensitivity of the sensors 

Sensor placement 

Accessibility of the structure tested 
Inspector performance 

etc. 
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Figure 6. Probability of detection vs. defect size. 
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The probability of detection (POD) is the mean 
rate of success when the specific inspection tech- 
nique is performed and is exposed to various 
sources of uncertainties. One and two dimensional 
POD models are typically used for the analysis. 

Because the POD-function is a monotonically 
increasing function, the probability of detecting a 
crack smaller than or equal to the size s is POD(s). 
POD(s) becomes | for very large crack sizes. 

The classical approach of the POD formulation 
has some shortcomings. The POD-function must 
be a distribution function, which is not always the 
case in reality. Furthermore it is difficult to inte- 
grate the two dimensional POD-functions. 

Uncertainties inherited in the inspection per- 
formance models are: 


e Variability due to scatter in the response signal 
(aleatory uncertainty) 

e Statistical uncertainty due to limited set of trials 
in experimentally determined POD/POI models 
(epistemic) 

e Model uncertainty due to empirical nature of 
the parametric model (epistemic) 


Straub and Faber (see citations above) analyzed 
the aleatory and epistemic uncertainty sources 
in such models in more detail. Jiingert and Kurz 
(Kurz, et al. 2011) conducted a research study con- 
cerning the POD performance of ultrasound NDT 
tests. POD performance test represent a considera- 
ble bottleneck in bringing probabilistic approaches 
in practical and reliable applications. 


8 DECISION ANALYSIS 


The ultimate goal in the decision analysis process 
is the identification of optimal decisions on main- 
tenance actions for deteriorating structures. The 
decision environment is subjected to uncertainty 
under the following aspects: 


e Uncertainty on the state of the system; state of 
deterioration 

Uncertainty on the performance of the inspec- 
tion; probability of detection (POD) 
Uncertainty on the performance of repair 
actions 

Uncertainty on the consequences of failures 


One method of evaluating such decision prob- 
lems is through the deployment of so-called deci- 
sion trees. Each path of the decision tree is assigned 
with a utility value and its probability characteris- 
tics. Considering the point in time and informa- 
tional characteristics of the decision problem three 
basic situations can be defined. 


1. The prior analysis represents a decision analysis 
with given information. At this state, the utility 


function and the probabilities of the various 
states of nature corresponding to the different 
consequences have been defined. The decision 
analysis is reduced to the computation of the 
expected utilities and finding the optimal point 
of the optimization problem. 

. The posterior decision analysis represents a 
decision analysis problem with additional infor- 
mation on the state of nature. If additional 
information becomes available—e.g. through 
inspections—the probability structure in the 
decision problem can be updated. The prob- 
ability update is carried out using Bayes’ rule. 

. The pre-posterior analysis represents a decision 
analysis situation dealing with unknown infor- 
mation. The decision maker has the possibility 
to buy additional information through an experi- 
ment. If the cost of this information is small in 
comparison to the potential value of information, 
the experiment should be performed. If several 
experiments are potentially suitable, the decision 
maker has to choose the experiment yielding the 
overall largest utility for his decision problem. 


As a first step a set of possible events will be 
defined, E, to E£, Those events can be compared 
with different outcomes of a game. Events with a 
large index are preferred over other events with a 
lower index. The decision maker can choose dif- 
ferent actions also called lotteries. Each action 
will lead to probabilities of occurrence for differ- 
ent events. E.g., äctión a will lead to Event E, with 
the probability p“ and to Event E, with the prob- 
ability p\’. l. Action b will lead to event E, with the 
probability pl ) and so forth. 

All probabilities of occurrence have to fulfill the 
basic condition: pj + p} + pi =1. 

The utility index is used to express specific pref- 
erences of the decisionmaker, in such a way that 
one decision is preferred to another if the expected 
utility of the former is larger than the utility of the 
latter. A utility index u can be assigned to the dif- 
ferent basic events E, to E,,. 

The final optimization criterion is the expected 
cost criterion. The utility index u is linearly 
assigned to monetary for the considered range 
of events. The indirect costs associated with the 
failure events and repair and inspection are 
included in the probabilistic modelling. Thus, all 
consequences of an event have to be expressed in 
monetary terms, which can also introduce weak- 
ness in the approach due to lack of reliable infor- 
mation on monetary consequences,. 

The pre-posterior decision analysis has the 
intention to identify the optimal decision on pos- 
sible inspection actions, enabling optimal mainte- 
nance planning. 

Concerning the inspection actions it has to be 
determined: 
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Where to inspect — location of inspection 
What to inspect — indicator of the system state 
How to inspect — what kind of inspection 
technique 

When to inspect —> time of inspection 


The inspected costs of an inspection strategy 
have to be determined. Initially, the number of 
decision tree branches to consider is evaluated: 


n, = në" +5 ni i 


(16) 


9 INSPECTION COST MODEL 


The inspection cost model integrates the following 
cost parameters 


e Expected costs of failure C, 

e Cost of inspection as a function of inspection 
technique e, at time ¢ 

e Cost of repair Cr 

e Interest rate r 


The total expected costs during the service life 
period T,, is computed as the summation of the 
expected failure costs, the expected inspection 
costs and the expected repair costs: 


Ec, (e,d,Ts,) |= E[C;(¢.d,Tsx) | (17) 
+E[C,(e,d,Ts,) ]+£[Cy(@4,Tsr) | 


The decision rule has a significant influence on 
the cost function. A short compilation of possible 
maintenance decision rules can look like: 


1. 
2: 


Repair all defects indicated at the inspection 
After indication perform a measurement and 
repair only cracks deeper than a, 

Etc. 


For the optimization procedure a maximum 
annually probability of failure is allowed App®: 


3: 


min, ,E[C; (e,d,Ts,) |s-t. 


Ap; (e,d,t) < Apr t =0,...Ts, (18) 


However, this optimization procedure inherits 
to major restrictions: 


e The minimum analysis period is one year 
e The calculation of total service life is prohibitive 


To level out the latter restriction, two simplifica- 
tion approaches are possible, the constant thresh- 
old approach and the equidistant inspection time 
approach (Straub 2004). 


Applying the constant threshold approach, the 
optimization parameter is the annual probability 
of failure Ap,’ and inspection is always performed 
in the year before Ap,’ is exceeded. It follows as 
new optimization problem: 


min, , rE C,( e, Ap," Ts, ) | st.App" S App 
(19) 


In that way an RBI-based approach can be 
deployed a holistic and effect structural health 
monitoring and management strategy, which 
combines design information with realistic in situ 
performance measurements using non-destructive 
testing techniques. 


10 OUTLOOK 


The author’s future research will focus on inte- 
grating the RBI-approach into the overall asset 
management strategy of wind turbine systems as 
well as developing, validating and maturing the 
approach in in situ and laboratory tests. Labora- 
tory experiments will be undertaken on broadening 
the understanding of applicable fatigue perform- 
ance indicators of concrete structures under the 
influence of fatigue loading, which can be meas- 
ured with ultrasound based NDT techniques. 
Furthermore, validated methods and concepts 
can be broadened to other civil engineering struc- 
tures that have a high demand of cost-optimal allo- 
cation of maintenance activities, such as bridges. 
Last but not least it has to be considered how far 
such models are capable of integrating combined 
effects of deterioration mechanisms, e.g., corrosion 
and fatigue deterioration and their interaction. 
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Optimising the maintenance strategy for a multi-AGV system using 
genetic algorithms 
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ABSTRACT: Automated Guided Vehicles (AGVs) are playing increasingly vital roles in a variety of 
applications in modern society, such as intelligent transportation in warehouses and material distribution 
in automated production lines. They improve production efficiency, save labour cost, and bring signifi- 
cant economic benefit to end users. However, to utilise these potential benefits is highly dependent on the 
reliability and availability of the AGVs. In other words, an effective maintenance strategy is critical in the 
application of AGVs. The research activity reported in this paper is to realise an effective maintenance 
strategy for a multi-AGV system by the approach of Genetic Algorithms (GA). To facilitate the research, 
an automated material distribution system consisting of three AGVs is considered in this paper for meth- 
odology development. The movement of every AGV in the multi-AGV system, and the corrective and 
periodic preventive maintenances of failed AGVs are modelled using the approach of Coloured Petri 
Nets (CPNs). Then, a GA is adopted for optimising the maintenance and associated design and opera- 
tion of the multi-AGV system. From this research, it is disclosed that both the location selection of the 
maintenance site and the maintenance strategies that are adopted for AGV maintenance have significant 
influences on the efficiency, cost, and productivity of a multi-AGV system. 


1 INTRODUCTION AGV system. At present, preventive and correc- 
tive maintenance are two basic strategies that are 
AGVs are increasingly used in modern society popularly adopted in engineering practice (Smith 
attributed to their high efficiency, accuracy, low et al., 1973). In the past decades, there have been 
cost and therefore significant economic benefit a number of research studies conducted to opti- 
(Tuan, 2006). mise the maintenance strategies dedicated to vari- 
However, with the emerging trend in modern ous kinds of industrial applications. For example, 
society for AGVs designed for more complex tasks a simulation was carried out to evaluate the per- 
they have become larger and larger in size, where formance of manufacturing production lines 
the reliability and maintenance issues in recent with different maintenance policies (Lei, 2010); 
years are receiving increasing concern. However, The maintenance cost and availability of an air- 
to the author’s best knowledge, so far there has craft system was optimized using a mathematical 
not been sufficient research being conducted in replacement model (Fornl6f, 2016) and so on. 
this area except a few preliminary researches (Vis, In this paper, both preventive and corrective 
2006). For example, three major hazards, i.e. col- maintenance strategies dedicated to a multi-AGV 
lision, tilting over and falling, have been iden- system are studied by the combined use of a Col- 
tified during the operation of AGVs (Trenkle, oured Petri nets (CPN) simulation model and a 
2013); as well as a combined Markovian model specifically designed Genetic Algorithm (GA) 
and a neural network were applied to maximise model. 
the reliability of AGVs and minimise their repair The remaining part of the paper is organised 
cost at the same time (Fazlollahtabar, 2013). Lit- as follows. A brief description of the multi-AGV 
tle research has been conducted to deal with the system considered in the paper is given first in 
maintenance issue of failed AGVs except using a Section 2; the potential of the CPN in describing 
control method for enhancing the failure control the paths, routing and maintenance issues of the 
management of both loaded and unloaded AGVs AGVs is explored in Section 3; the maintenance 
in an underground transportation system (Ebben, strategy of the multi-AGV system is optimised 
2001). For this reason, the purpose of this research with the aid of GA in Section 4; and the paper is 
is to fill this technology gap through developing an finally ended with a few key research conclusions 
optimal maintenance strategy for a typical multi- in Section 5. 
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Table 1. Presumed time duration of every phase. 


Phase Phase length (hours) 
1 0.02 

2 0.2 

3 0.02 

4 0.2 

5 0.02 

6 0.2 


2 CONFIGURATION OF THE MULTI-AGV 
SYSTEM 


The AGV transport system described in (Yan, 
2017) is also considered in this research. How- 
ever, instead of considering a single AGV, a more 
complicated transport system consisting of three 
AGVs will be investigated in this paper. This will 
allow the investigation of the interactions between 
different AGVs and the impact of the failure of 
either one or more AGVs on the operation of the 
others in the same transport system. In addition, 
it is worth noting that in this research the subsys- 
tems of the individual AGVs are assumed to fail 12 
times every year, as cited in (Yan, 2017). The mis- 
sion of the AGVs is divided into six phases, namely 
(1) mission allocation and route optimization, 
(2) dispatch to station, (3) loading of item, (4) trav- 
elling to storage, (5) unloading and (6) travelling 
back to base. In order to facilitate the research, 
the time duration of every phase is presumed and 
listed in Table 1 for demonstration purpose. They 
would be different when the AGV is requested to 
deliver different types of missions. 

In the model, it is assumed that the AGV will be 
taken away from the system immediately to prevent 
deadlock and conflicts as long as it fails, so that the 
downtime of the system due to AGV failures can 
be minimised. To meet such a need, it is essential 
to optimise the location of the maintenance site in 
the system to enable the recycle vehicle (the vehicle 
collecting the failed AGV) to reach and recycle the 
failed AGV in the shortest time. 


3 SIMULATION MODELLING 


3.1 Coloured Petri Nets (CPNs) 


Attributed to the unique efficiency and cost-effec- 
tiveness features, modelling has been identified as 
one of the most important approaches to improve 
the design and operation of a system. In particular, 
a Petri net (PN) is regarded as one of the most eco- 
nomic and effective tools to model AGV systems 
(Wu, 1999; Nishi, 2010). The concept of a PN was 
developed by Petri (1962), which is a direct bipartite 
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graph. It consists of four types of symbols, i.e. circles, 
rectangles, arrows and tokens, as shown in Figure 1. 
Where, circles represent the places, which may be 
conditions or states (e.g. mission failure, phase fail- 
ure, or component failure); rectangles represent the 
transitions, more abstractly actions, or events which 
cause the change of condition or state; Arrows con- 
nect places and transitions; Tokens are small marks 
that gives dynamic properties of the PN. They move 
via transitions if the enabling condition is satisfied. 
It provides an intuitive graphical representation of 
a system and allows flexible description of events. 

What Figure 1 shows is an example explain- 
ing how the tokens move through a net. From 
Figure la, it is seen that there are two inputs and 
one output place connected to a timed transition 
with a time delay t. The input places have arcs with 
weights 2 and 1, respectively. Once the transition 
is enabled after the time delay, t, the arc weight 
number of tokens will be taken out from the cor- 
responding input place to fulfil the transition after 
the time delay t associated with the transition. For 
the example, as Figure 1b shows, one more token 
will appear in the output place. 

However, conventional PN methods are found 
inefficient in describing complex systems or 
describing a system that is designed to carry out 
complex tasks or missions (Jensen, 2015). To 
address this issue, a more advanced PN method, 
namely Coloured Petri nets (CPN), was proposed 
by Rene (1994). In comparison with the conven- 
tional PN, each individual token in the CPN is 
designed with a specific colour, which either has 
different identities or carries different information. 
Therefore, they are more informative than those 
present in the conventional PN. 


3.2 System modelling 


In a multi-AGV system, every AGV need to be 
distinguishable as they may be located at different 
positions in the transport system and may fail at 
different times. In view of the powerful capability 
of CPNs in describing the kind of complex situa- 
tions (Wu, 2002; Aized, 2009), CPN is employed in 
the following research. 
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Figure 1. Enabling and switching of transition, 
(a) before enabling transition, (b) after enabling transition. 


To correctly describe the operation and main- 
tenance activities in a multi-AGV system, three 
types of CPN models are purposely developed and 
detailed below: 


1. Path Petri nets (PPN) — for describing the layout 

configuration of the system; 

Corrective maintenance Petri nets (CMPN) 

— for defining the corrective maintenance of 

failed AGVs in the system; 

. Periodic maintenance Petri nets (PMPN) — for 
defining the periodic maintenance of all AGVs 
in the system. 


Herein, the CMPN and PMPN share the AGV 
failure information and feed their responses into 
PPN. 


2. 


3.2.1 PPN 

A three-AGV dispatching system is considered in 
this paper. It consists of 1 AGV base, | pickup 
station, Istorage site, 1 maintenance site, and a 
number of transport paths. The base is for stor- 
ing and recharging the AGVs; the pickup station 
is the place where items are collected; and stor- 
age is the destination for unloading the items. All 
these places are assumed to have sufficient space 
for parking multiple AGVs. To demonstrate the 
significant influence of layout configuration on 
the efficiency of recycling failed AGVs from a 
multi-AGV system, three different layout con- 
figurations are considered, as shown in Figure 2. 
Where, MS indicates the location of the mainte- 
nance site. 

From Figure 2, it is seen that different layout 
configurations are distinguished by the different 
locations of maintenance site and the extra paths 
for recycling failed AGVs. For example, in Fig- 
ure 2a the maintenance site shares the same space 
with the base; in Figure 2b the maintenance site is 
located between the base and the storage. In addi- 
tion, an extra path between the pickup station and 
the maintenance site is designed to prevent dead- 
lock caused by the breakdown of AGVs; in Fig- 
ure 2c the maintenance site situates at the centre 
of the system. Accordingly, three extra paths are 
designed to assure its accessibility to the AGVs 
that could fail at anywhere of the system. Based 
on the aforementioned designs, the PPN models 
for these three different layout configurations 
can be readily constructed by defining the move- 
ment directions of the AGVs. For example, the 
PPN for the configuration in Figure 2b is shown 
in Figure 3, where only one direction of move- 
ment is enabled, and the dotted arrows represent 
the information flows coming from other CPNs. 
The tokens in the figure represent AGVs. Once 
the required action is completed a token from 
other CPN enables the corresponding transitions. 
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Base & MS Storage Base MS Storage 


(a) Configuration | (b) Configuration 2 


Pickup Station 


Base Storage 


(c) Configuration 3 


Figure 2. Three layout configurations. 


Pickup station 


MS 


Figure 3. PPN for the configuration in Figure 2b. 


Then, the AGV token with the same colour can 
move to the place of the next station. 


3.2.2 Corrective Maintenance Petri Nets 
(CMPN) 

Once the failed AGVs are recycled and towed back 
to maintenance site, the corrective maintenance 
will be implemented immediately if the mainte- 
nance engineers are available to work on the failed 
AGV. However, once the maintenance engineers 
are unavailable, the failed AGVs will have to queue. 
On completing the corrective maintenance, the 
recovered AGV will be assumed having a perfect 
condition as a brand new one does. In the mean- 
time, the maintenance engineer who undertakes 


the repair of this AGV will be released. They will 
become available to undertake the repair of other 
failed AGVs. In the model, a normal distribution 
function is employed to describe the repair time 
of the failed AGVs. The developed CMPN model 
is shown in Figure 4. Once a token exists in both 
‘Failed AGVs recycled’ and ‘Available engineering’ 
places, the token will be produced in ‘under repair’ 
place. Following the repair process, the AGV will 
be back to the healthy state. This will be indicated 
by a token produced in the ‘Up’ place. 


3.2.3 Periodic Maintenance Petri Nets (PMPN) 
A PMPN model that considers periodic main- 
tenance has been developed and is shown in 
Figure 5. Likewise, in this model the recovered 
AGVs are assumed having perfect health condition 
as a new one does. 

It is worth noting that in the model shown in 
Figure 5 the three transitions with different col- 
ours indicate the failure time of the three AGVs in 
the system. For the simplification, a simple correc- 
tive maintenance policy is taken in this research, 
i.e. all AGVs in the system will receive periodic 
maintenance in spite of their actual health condi- 


Available engineers 


Failed AGVs 
recycled 


Base Under 


repair 


Down 


Periodic Maintenance 


PMPN model. 


Figure 5. 


tion. Moreover, the corrective maintenance will 
last only for 2 days regardless of the actual con- 
dition of the AGVs. For example, in Figure 5 it 
is assumed there are m AGVs in healthy condition 
and n AGVs in faulty condition. The healthy AGVs 
are in ‘Up’ place and faulty AGVs are in ‘Failed 
AGVs recycled’ place. Regardless the actual health 
status, all AGVs will receive periodic maintenance. 
Therefore, there will be m+n tokens in ‘Periodic 
Maintenance’ place. Accordingly, all AGVs in the 
system will not start to work until the 2-day period 
of corrective maintenance expires. On the expiry 
of the period of corrective maintenance, all AGVs 
in the system are assumed to have perfect health 
condition as a brand new one does. 


3.3. Simulation results 


By integrating the above CPN models, a more 
comprehensive model can be readily obtained, 
which not only considers the specialities of the 
layout configuration but also considers the main- 
tenance processes of the AGVs. In order to verify 
the model and investigate the influences of differ- 
ent maintenance strategies on the operational per- 
formance of a multi-AGV system, an algorithm 
has been developed to simulate the comprehensive 
model, the input variables of which include the 
failure rate and repair rate of all AGVs, the time 
taken to perform periodic maintenance, and phase 
lengths that are required by the AGVs to deliver 
assigned tasks. 

Firstly, the influence of different layout configu- 
rations on the recycle time of failed AGVs is inves- 
tigated. In the layout configurations described 
in Figure 2b and c, separate maintenance sites 
are designed. Such a design significantly reduces 
the risk of conflict and deadlock and therefore 
improves the efficiency of the recycle process, 
although with the cost of extra space and extra 
routes to enable the operation of such a design. 
The simulations considering all three types of lay- 
out configurations are performed and the corre- 
sponding recycle time calculation results are listed 
in Table 2. From the table, it is found that when the 
maintenance site is placed in the centre (see Fig- 
ure 2c), the recycle time will be the minimum. 

Subsequently, the influence of different mainte- 
nance strategies on the performance of the multi- 
AGV system is investigated. Assume the operation 
time of the system is 10 hours per day, the corre- 
sponding simulation results obtained for the layout 
configuration illustrated in Figure 2b are listed in 
Table 3. In the table, the number of missions com- 
pleted is employed as a criterion for performance 
assessment. 

From Table 3, it is found that if without apply- 
ing any maintenance strategy within the period of 


Table 2. Recycle time. 


Recycle Extra Length of 
Location time space extra route 
indicated by (hours) (unit) required (unit) 
Figure 2a 0.132 0 0 
Figure 2b 0.128 1 3/2 
Figure 2c 0.101 it 33/4 
Table 3. Number of completed missions. 
T P Nl N2 
7 days 0.03 11518 11840 
1 month 3.93 12840 15264 
3 months 36.32 9372 15972 
6 months 77.34 6084 16142 
12 months 98.06 3280 16234 


Note: T—Time interval of periodic maintenance; P— 
Percentage of AGVs failed within the time interval if 
there is no maintenance (%); Nl—Number of missions 
completed per year with periodic but without correc- 
tive maintenance; N2—Number of missions completed 
per year after taking both periodic and corrective 
maintenance. 


12 months, 98% of AGVs will fail after complet- 
ing 3280 missions. This fully highlights the added 
value and the necessities of conducting appropriate 
maintenance to the AGVs during their service life. 
Moreover, the larger values of N2 than the cor- 
responding values of N1 prove that the corrective 
maintenance can actually enhance the perform- 
ance of the multi-AGV system, i.e. the corrective 
maintenance can help to keep long-term high effi- 
ciency of the system, although it could cause extra 
financial and labour costs. 


4 OPTIMISATION OF MAINTENANCE 
STRATEGY 


4.1 Genetic algorithm 


The results obtained from the CPN simulations can 
be used as factors for optimising the maintenance 
strategy of the AGV system. Since the resultant 
optimal maintenance strategy is desired to lead to 
a cost effective and time efficient operation of the 
multi-AGV system, the optimisation considered in 
this research becomes a typical multi-objective opti- 
misation problem. In the paper, Genetic Algorithm 
(GA) is employed to carry out the optimisation. 
Nowadays, the GA has been regarded as one of 
the most popular tools to solve this kind of multi- 
objective optimisation problem attributed to its 


powerful capability of conducting optimisation in 
a global range regardless of initial conditions and 
other derivative factors. Inspired by the biological 
evolution of living species, GA was first introduced 
by John Holland in 1970s (Holland, 1975). GA has 
been well applied for solving the scheduling and dis- 
patching problems to AGV systems. For example, 
Reddy and Rao applied GA to minimise the make- 
span, mean flow time and mean tardiness at the 
same time (2006). A GA based simulation approach 
was proposed to find the optimal dispatching rules 
in complex environments (Chang et al., 2013). 

To implement the GA optimisation, an initial 
population of individuals (also known as chromo- 
somes consisting of genes) will be generated. The 
fitness of each chromosome is evaluated subject to 
the predefined objective functions. By selecting pairs 
of parents in the population, new chromosomes or 
children can be generated. This is known as crosso- 
ver. The chromosomes with the higher fitness are 
more likely to be selected so that their genes can 
be passed on with higher probability. A mutation 
might also be involved to prevent early convergence 
of the solution. Through repeating such a process, 
the chromosomes with larger fitness values can be 
obtained until an optimal solution is reached. 


4.2 Fitness functions 


Following this idea, a GA program is developed 
in the research to optimise the multi-AGV system. 
The flowchart of the GA program is shown in 
Figure 6. The parameters used in the calculation 
are listed in Table 4. The following two objec- 
tive functions are defined to optimise the system 
design: 

Objective function 1: The maximum number of 
missions completed within a given time 


Mission = max(N,, WNL N, JT) (1) 


Objective function 2: The minimum cost for 
completing the missions 


Gosi min Ne Ny CEN y Ct N Cit 9 
Cre, a Nn . C, + L, . G + Cin ( ) 
where N, = 365/T is based on the assumption that 
there are 365 days in a year. 
Based on the aforementioned two objective 


functions, a fitness function is developed as 


Missi 
fitness = oe (3) 
Cost 


The maintenance strategy is optimised subject to: 


Define the objective and associated fitness functions 


Initialise the program by setting 
2000; 


the maximum iteration time = 1000; 


population scale 


probability of crossover = 0.7; 
probability of mutton = 0.02 


Generate the original population 


Calculate the fitness function of each 
individual and replace the poorest 
individual with the best individual 


Select individuals based on a predefined 
exponential distribution 
conduct crossover calculation 


Select individuals in a random way and 
conduct mutation calculation 
Obtain new population 


Calculate the fitness function of each 


function and 


individual in the new population 


Ifthe maximum iteration time 
of the saturated fitness value has 
been reached? 


Figure 6. Flowchart of the GA based optimisation 
program. 

number of mission completed = 10000 (4) 
probability of all AGVs failed < 0.1 (5) 


where, equation (4) means that the number of mis- 
sions required to complete within one year is not 
less than 10,000; equation (5) means that the prob- 
ability of all AGVs fail is not larger than 10%. 


4.3 Selection 


In the program, an exponential function is spe- 
cifically defined for simulating the ‘survival of the 


Table 4. Parameters used in GA program. 


Parameters Symbol Value 
Number of AGVs N, 3 
Operation cost of C, 8 
an AGV to 
complete a single 
mission 
Business costs of Ca 10000 — with 
maintenance corrective 
site per year maintenance 
5000 — without 
corrective 
maintenance 
Land cost for Gr 1000 — Share site 
maintenance with AGV base 
site per year 5000 — Separate 
site 
Number of missions N,, See the values of 
competed per year N1 and N2 in 
Table 3 
Time interval of T See the values of 
periodic maintenance T in Table 3 
Periodic maintenance C, 400 
cost per AGV 
Recycle time Te See the values of 
recycle time in 
Table 2 
Average time to p 0.66 
complete a mission 
No. of maintenance N, 1 
engineers on site 
Cost of one Engineer C, 25000 
in a year 
Total number of N, 14 (results 
failures occurring in obtained using 
the system with PN) 
corrective 
maintenance per year 
Average cost for C, 200 
conducting corrective 
maintenance of an 
AGV failure 
Extra route length L, See the values of 
Length of 
extra route 
required in 
Table 2 
Cost of per unit length C: 1000 


extra route 


fittest’ principle in natural evolutionary process. 
For the i-th individual, its probability P, being 
selected for participating in GA crossover calcula- 
tion can be expressed as: 


eo" i-Smin) 
P=———_—_ x 100% (6) 


, 5 e” (fimin) 
j=1 
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where N denotes the size of population scale; f; is 
the fitness of i-th individual; f, is the fitness of 
the poorest individual; and w is a constant for 
controlling the efficiency of population evolution. 
It is worth noting that the larger the value of w, 
the more efficient the evolution tends to be. But it 
should be aware that too large a value of w would 
lead to risk of failure to obtain global optimisation 
results. In this research, w is taken to be 100. 


4.4 Coding 


It should be noted that there are three major fac- 
tors, namely the period of periodic maintenance, 
the system configurations and the adoption of 
corrective maintenance. Their values are obtained 
from the CPN simulations. Hence the variation 
ranges of these factors need to be controlled by 
constraints. These three parameters are coded into 
binary numbers and then connected together to 
create a single ‘chromosome’. 


4.5. Crossover operator and mutation operation 


The crossover operation is applied to two ran- 
domly selected chromosomes with the crossover 
rate of 0.7. A one-point crossover is adopted with 
an illustrative example shown in Figure 7. 

The alternating position can be chosen at any 
point within the chromosomes. By combing two 
sets of genes from both parent generations, an off- 
spring chromosome can be produced. The opera- 
tion of mutation illustrated in Figure 8, maintains 
genetic diversity of the population and prevents 
the solutions trapping to the local best. It is also 
implanted with a fixed mutation rate of 0.02. 


4.6 GA results and discussion 


Using the developed GA program the location of 
maintenance site and the maintenance strategies of 
the multi-AGV system are optimised through inte- 
grating the two objective functions into one fitness 
function, i.e. the unit mission cost shown in equa- 
tion (3). To illustrate the effectiveness of the GA 
optimisation, a numerical example has been taken. 
By applying the parameters defined in Table 4 to 
the program, the population starts to evolve grad- 
ually. The resultant variation tendency of average 
fitness against the number of evolution times is 
shown in Figure 9. 

From Figure 9, it is found that after the 
population is evolved for about 200 times, the 
average fitness reaches a saturated value. That 
means the optimal design of the multi-AGV sys- 
tem is achieved through 200 times of evolution 
calculations. The optimised results are listed in 
Table 5. 
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Parent chromosomes 


Figure 7. One-point crossover operator. 
Figure 8. Mutation operator. 
0.1 
0,098 
“a 0,096 
S 0.094 
$0.08 
F o 
g .09 
< 0,088 
0.086 
0.084 
0 50 100 150 200 
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Figure 9. Evolution of GA population. 
Table 5. Optimal results obtained from GA. 
With corrective maintenance? Yes 
Location of maintenance site In the AGV base 
Time interval of periodic 12 months 
maintenance 
Total cost (£) 164872 
Mission completed per year 16231 


From Table 5, it is found that 


The corrective maintenance is indeed essential 
for maintaining the long-term high efficiency of 
a multi-AGV system; 

Arrange the maintenance site to share the same 
place with the AGV base will save the cost on land, 
therefore result in the minimum unit mission cost; 
The positive influence of periodic maintenance 
on improving the performance of the system 
cannot be demonstrated if the ageing issue 
of the AGVs is not taken into account in the 
optimisation. 


5 CONCLUSIONS 


In order to develop a feasible and efficient approach 
to optimising the design, operation, and mainte- 
nance of a multi-AGV system, the CPN simulation 
models and the GA-based optimisation approach 
are developed in this research. From the research 
results described above, the following conclusions 
can be drawn: 


1. The combined use of CPN and GA has been 
demonstrated an effective approach to assessing 
the performance of multi-AGV systems; 

. This hybrid approach enables the prediction 
to the optimal time interval of periodic main- 
tenance and the assessment of the influence of 
correct maintenance on system efficiency; 

. The optimisation of the location of mainte- 
nance site and maintenance strategies can be 
skilfully converted to be a simple single objec- 
tive optimisation problem with the fitness func- 
tion of unit mission cost; 

. The corrective maintenance is an effective meas- 
ure to maintain the long-term high efficiency of 
the system, although it may lead to extra main- 
tenance costs. 


Future work of this research will focus on deal- 
ing with more complex AGV systems. 
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ABSTRACT: Huge rotary machines are commonly used in oil and gas processing plants for separation, 
compression and boosting. Their reliability is of high importance to avoid operation downtime and pro- 
duction loss. In this paper, we present a modelling methodology, based on the AltaRica 3.0 modeling lan- 
guage and stochastic simulation, to assess the average production level of a compressor drive system. This 
system consists of six trains, where each of them contributes to one sixth of the total production capac- 
ity. It runs under two operation modes (full and reduced capacity) corresponding to seasonal demand 
periods (winter and summer). The problem at stake is to design a model at system level that captures the 
various degradation processes, monitoring policies, and maintenance rules involved in the system under 
study. The aging of units is represented by means of multiple degradation levels. Given units information 
provided by monitoring and inspection, preventive and corrective maintenance interventions are decided 
locally to each unit. Performance indicators such as the cumulative production and production loss over 
a certain mission time can then be assessed. This paper contributes to the development of engineering 
models for maintenance assessment based on framework and patterns designed to architecting some typi- 
cal oil and gas systems. 


1 INTRODUCTION to two distinct seasonal demand periods (winter 
and summer). Maintenance interventions have to 
Huge rotary machines are commonly used in oil be scheduled during the low demand season where 
and gas processing plants for separation, compres- some of the trains can be stopped while still fulfill- 
sion and boosting. Ensuring a high reliability of ing the demand. 
these machines is of primary importance to avoid We present here the modelling methodology we 
operation downtime and production loss. As of used, together with modelling patterns and simula- 
today, this high reliability is achieved thanks to tion experiments. This methodology relies on the 
robust-by-design machines, and by rigorous pre- AltaRica 3.0 language Rauzy (2008) and Prosvir- 
ventive maintenance policies (DNV-GL-RP-002 nova (2014) and stochastic simulation. It aims at 
2014, API-RP-7L 2006). Nevertheless, rigorous developing models that make it possible to answer 
preventive maintenance comes with a high cost. questions like “Will the system survive in the com- 
This is the reason why the oil and gas industry is ing winter without loss of production? How much 
currently moving to the so-called condition-based can we gain/lose by changing inspection interval of 
approach (Gustavsson and Eriksen 2005, Marke- this component or group of components?” and so 
set et al. 2013). In order to deploy a condition- on. The key performance indicator is the expected 
based maintenance policy, one needs to assess its production loss due to failures and maintenance 
potential benefit and risk over more traditional operations over a given time period. 
approaches. The object-oriented language AltaRica 3.0 
In this paper, we present the results of a prelimi- makes it possible to handle the various model- 
nary study we made on a compressor drive system. ling challenges at stake: It provides mechanisms to 
This system consists of six trains, where each of represent faithfully the various degradation proc- 
them contributes to one sixth of the total produc- esses, monitoring policies, and maintenance rules 
tion capacity. The system runs under two operation involved in the system under study. It makes it pos- 
modes (full and reduced capacity) corresponding sible to represent implicitly very large state spaces; 
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It facilitates information propagation through the 
network of components and therefore the calcula- 
tion of key performance indicators; It supports the 
reproduction of basic patterns in order to develop 
models of large systems by assembling seamlessly 
models of components; ... 

The main purpose of this paper is to present this 
modelling framework and to illustrate its applica- 
tion. We report here the results of a number of 
experiments we performed on the model in order 
to run what-if scenarios. We show the various pos- 
sibilities to refine and optimize inspection/mainte- 
nance policies as well as to asses their impact on 
the production. 

The use of this study is not limited to onshore 
installations with given maintenance problems. 
It helps to improve the knowledge and decisions 
making with a tool for future subsea installations, 
as there will be more and more seabed compres- 
sion with equivalent and even more complex main- 
tenance problems. 

The remainder of this paper is organized as fol- 
lows. Section 2 presents state of the art of Preven- 
tive Maintenance PM in compressor drive system. 
Section 3 describes the use case of compressor 
train system and its assumptions. Section 4 intro- 
duces the modelling methodology and design of 
components and system. Section 5 shows numeri- 
cal experiments for assessing various inspection 
and maintenance policies. Section 6 concludes the 
work and discusses future research. 


2 STATE OF THE ART 


Compression is a common practice in Q&G 
processing technology. Due to the complex con- 
figuration of a compression drive system, the con- 
trol over such a system turns to be of paramount 
importance. This section is thus dedicated to the 
state of art investigation on compressor drive sys- 
tem which is normally composed of VSD, pump, 
compressor, valves and so on. Findings from for 
instance (Andersen et al. 2006, Eriksson and Staver 
2010, Eriksson and Konstantinos 2014) and inter- 
views with industrial partners are summarized as a 
preparation for our model. 


2.1 Health indicator and degradation modelling 


A typical compressor drive system is arranged 
with multiple trains in parallel where each train 
consists of a number of components (e.g. VSD, 
gear, compressor, valve etc.). Each of them is sub- 
ject to different aging processes (e.g. fouling, wear, 
harmonics, unbalance) which could be revealed in 
several condition monitoring sources as described 
previously. An option to aggregate these sources 
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to formulate a single health indicator is Technical 
Condition Index (TCI) (Nystad 2008, Nystad & 
Rasmussen 2010). It is defined to represent the 
degree of degradation with respect to production 
availability. The overall TCI value of a compres- 
sor, for instance, is aggregated from bottom-level 
TCI for each degradation symptom revealed by 
its monitoring source. Its aging measurement 
includes but not limited to: vibration, seal wear, 
bearing temperature, compressor efficiency and so 
on. Each of them is assigned a weight based on 
expert judgement about its criticality, and it is fur- 
ther merged upwards in the hierarchy for the total 
TCI of the compressor. Besides, previous main- 
tenance actions and their impact (mainly missing 
or incomplete working log) can be included in 
the model as left truncated censoring data. Given 
all these inputs, a parametric hazard regression 
model Weibull PHM model is applied in Nystad & 
Rasmussen (2010) to estimate parameters in the deg- 
radation model by maximum likelihood method. 
Once the parameters are available, the expected 
Remaining Useful Lifetime (RUL) given a specific 
time point is tractable with certain confidence. 

In this paper, we rely on an alternative solution 
proposed by Moholt (2016). The health indicator is 
a discrete variable with a finite number of possible 
values from new to fail. It is used by guidelines to 
assess the health of an equipment in a more quali- 
tative way. Such a simplified model is aligned with 
the amount of data that are currently available for 
the case under study in this paper. Moreover, it is 
more adapted to our modelling framework based 
on discrete states. 


2.2 Intervention and maintenance modelling 


Often maintenance programs are established by 
project teams by each companies. They propose 
alternative solutions, demonstrate the advantages/ 
disadvantages for the maintenance strategies and 
discuss an optimal solution with manufactures. 
Most components are under calendar-based main- 
tenance together with condition-based mainte- 
nance program. 

For calendar-based maintenance, the frequency, 
content, duration and required preparations for 
maintenances vary in the user manual by equip- 
ment type, size and application. The components 
are analysed in design phase and assigned mainte- 
nance levels (L1-L4) according to their criticality 
(ABB 2013). Periodic maintenance is implemented 
according the plan. 

For condition-based maintenance, there is short 
term preventive maintenance plan scheduled for 
the next low demand season based on the input 
from condition monitoring and inspection. Such 
decisions are mainly based on experience and 


expert judgement. Notice that for both mainte- 
nance interventions, there is no possibility to react 
immediately on any degradations/failures in the 
systems due to the time for preparing maintenance 
kits. 

Arisk based simulation approach (RBI) has been 
applied to develop maintenance strategy for subsea 
systemsin Ormen Langefield (Gustavsson & Eriksen 
2005). The lifetime scenario of the system is rep- 
resented by discrete degradation/failure events 
happen for each component. The cost model quan- 
tifies intervention cost and production loss due to 
unexpected failures for the mission time. Then a 
variety of maintenance strategies are fed into the 
model to calculate desired performance indicators 
(e.g. average lost production, average repair cost) 
and help to assess the impact of different mainte- 
nance strategies for the subsea systems. However, 
it is not clear how this model is implemented tech- 
nically and whether the modelling language and 
simulation tool are open to external users. 

In our paper, we illustrate our method to model 
maintenance planning on a system-size compres- 
sor drive system. The model presented in this paper 
is authorized by integrated modelling environment 
AltaRicaWizard in the framework of OpenAlta 
Rica project. We show the possibilities to develop 
various maintenance strategies in the model, and 
demonstrate the loss/gain of these alternatives. 


3 USE CASE: COMPRESSOR TRAIN 


3.1 


We focus on 6 electrical trains that are used to 
compress the gas. Each of them contributes to one 
sixth of the total production capacity. Each com- 
pressor drive system consists of, from the left: a 
Variable Speed Drive (VSD), a Motor (M), a Gear 
box (G) and the compressor (C), see Figure 1. 
During winter (6 months), full capacity is 
required and all the 6 trains are supposed to be used. 
During the summer, only part of the total capacity 
is required, for instance 1/2. Then only 3 trains are 
needed. The switch from full capacity to reduced 
one and then backwards is operated once a year 
(e.g. 01/10 and 01/04). At system level, once one 
train is failed, it is revealed naturally by production 


System description 


Motor Gear 


HOR 


Figure 1. 


Compressor 


Diagram of one compressor train. 
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capacity. Similarly, at train level if one of the com- 
ponent is degrading, it reduces production capacity 
immediately. 


3.2 Assumptions of each unit 


3.2.1 VSD and gear 
VSD and Gear are much more reliable than motor 
and compressor. They have only binary states: 
Working (W) and Failed (F) (Figure 2). The failure 
time follows exponential law with low failure rate. 

Both components are continuously monitored. 
In addition, they are easy to repair so the main- 
tenance time is short. Only corrective but no pre- 
ventive maintenance is planned on these units. The 
intervention can occur at any time during opera- 
tion of the system. 

Concerning production, it depends on working 
state of the unit and production flow plug into it. 


3.2.2 Compressor 
Compressor is subject to indirect continuous 
monitoring on the degradation process. It has four 
states: Working (W), Degraded 1 (D1), Degraded 
2 (D2) and Failed (F). The time to change between 
these states follows exponential laws where the 
rate is respectively Ay, A. and A, (Figure 3). In W 
state, the compressor runs at full capacity 100%; in 
D2 state, the compressor has fouling and it con- 
sumes more power to maintain the full production 
capacity; in D2 state, fouling is accumulated and 
compressor capacity decreases below an acceptable 
threshold 80%; in F state, the compressor cannot 
operate any longer and its capacity sinks to 0%. 

When compressor reaches its D1 state, a preven- 
tive minor maintenance (e.g. cleaning) is arranged; 
when it reaches D2/F state, a spare part is ordered 
and the preventive/corrective major maintenance 
(e.g. replacement) will be implemented. The dura- 
tion of minor and major maintenance is respectively 
ô, and ô, Delay time to prepare corresponding 
maintenance interventions is 9,,, and p, 

The actual production of the compressor 
depends on its degradation level, the operation 
phase and also the input production passing to it. 


outFlow :=if (inflow=100 and 
_state=WORKING) then 100 
else 0 


inFlow 


failure 


repair 


FAILED 


WORKING 


Figure 2. Automaton for VSD and Gear. 


outFlow := if (inflow=100 and _state= WORKING and OPERATION) then 100 
else if (inflaw=100 and _state=DEGRADED 1 or DEGRADED 2 and OPERATION) then 80 
else 0 


inFlow 


MAIN TLMAM 


DEGRADED 1 
hint 


Figure 3. Automaton for compressor. 


failure 
(rate = Af) 


stargMjMaint, 


FAILED 
fxn: 


DEGRADED 2 
iT 


outflow <= If (inflow=100 and _state=HIDDEN DEGRADED or REVEALED DEGRADED and OPERATION) 
then 100 
else 0 


lnFlow 


HIDDEN 
DEGRADED 


degradation 
(rate = Ad) 


HIDDEN 
DEGRADED 


JAINTENAI 


Figure 4. Automaton for motor. 


3.2.3 Motor 
Motor is under periodic inspection. It has six states: 
Working (W), Hidden Degraded (HD), Revealed 
Degraded (RD), Hidden Failed (HF), Revealed 
Failed (RF) (Figure 4). A periodic inspection is per- 
fect and it reveals all the hidden degradations and 
failures. The inspection occurs every Tand it lasts for 
A unit time. After detection of degradation (RD), the 
condition can deteriorate further to a failure (RF). 

There could be discrepancy between observed 
state and actual state of the unit. However, the 
maintenance planning is made based on observed 
state of the unit. In W/HD/HF, no action is taken; 
in RD/RF, a spare unit is ordered and the proac- 
tive/corrective maintenance is implemented which 
lasts for 6. The delay to prepare the intervention 
is p. 

The actual production of the motor depends 
on its working state, the operation phase together 
with the upstream production passing to it. 


amplereMaint. 
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REVEALED 
DEGRADED 


ins 


REVEALED 


As a first study of the compressor drive sys- 
tem, we simplify the model by saying that we have 
enough maintenance teams. We do not consider 
system-level diagnosis neither maintenance plan- 
ning. For the mentioned units, any maintenance 
and inspection stops the production of the inter- 
vened train, but do not interfere behaviours of the 
units on the same line or the other trains. All the 
components are independent of each other and 
interventions are decided locally. 


4 MODELLING METHODOLOGY 


The modelling formalism that we use here is Alta 
Rica 3.0 and its underlying mathematical framework, 
Guarded Transitions Systems (GTS). For formal 
presentation see (Rauzy 2008, Prosvirnova 2014). 
The formalism can handle complex systems with four 
main Modules as explained in (Zhang et al. 2017): 


. description of individual behaviours of units 

. description of the actual state of the system 

. diagnosis on the state of the system 

. description of maintenance planning and 
actions 


BRwWNe 


In our case, we simplify the situation by saying 
that components are independent and mainte- 
nance decisions are made local. Therefore, there 
is no system-level diagnosis netiher decision mak- 
ing. We only use part of the framework (Module 
1, 2) while Module 4 can be further embedded into 
Module 1. The model constituting two parts: mod- 
elling units and modelling system are explained as 
follows: 


4.1 


The first part describes behaviours of each unit. 
The purpose is to translate finite state automata 
as Figures 2, 3 and 4 into AltaRica language. GTS 
of each unit describes indigenous events that hap- 
pen locally by itself, for example degradation and 
failure. Meanwhile, it can embed foreign interven- 
tions that posed on top when the local condition 
or clock reaches certain trigging point, for instance 
start maintenance and start inspection. To clarify 
the mechanism, we use motor (Figure 4) as an 
example. The AltaRica code implementing the 
GTS for the motor is given in Figure 12. 

The class of a generic inspected unit is designed 
and named as Motor (line 5). The motor can be in 
one of the following states: Working (W), Hid- 
den Degraded (HD), Revealed Degraded (RD), 
Hidden Failed (HF) and Revealed Failed (RF) 
(line 1). It experiences three alternating stages in 
its life cycle: Operation (OP), Inspection (INSP) 
and Maintenance (MAINT) (line 2). In addition, 
it runs under winter and summer profile (line 3). 
Line 10-16 assign names to events and declare 
their duration by keyword delay. 

The definition of a transition (line 20-33) 
starts with the name assigned to it, then comes 
several pre-conditions of the transition with are 


Modelling units 


Inspection 
| interval 


Maintenance 
optimization model 


Pa pna 


Season profile Maintenance 
delay 


Failure rate | 


Time horizon 
~y i 


Figure 5. A map of maintenance optimization scenarios. 
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connected by logic operators and, or . After a tran- 
sition sign comes the final effect of the event, e.g. 
state or phase values are modified. Under such 
mechanism, the fireable events and their transi- 
tions mimic the behaviour of the component as 
time elapses. 

According to the assumptions, motor degrades 
and fails only in operation phase. Therefore the 
state changes are preconditioned with _phase==0P 
(line 20-22). The initiation of an inspection follows 
fixed interval. Upon completion, a hidden degra- 
dation or failure is detected and meanwhile a clock 
starts to count the time for maintenance prepara- 
tion (line 26-29). The counting, however, does not 
stop further degradation of the motor, so the event 
prepareMaint is defined by state indicator clock 
independent of motor _state (line 30). Once the 
preparation is ready, all the maintenance resources 
are ready for use but we have to check that it is 
the low demand season (e.g. summer) to launch 
an actual campaign. If the _season==SUMMER, the 
maintenance starts immediately and the comple- 
tion brings state back to working and resets clock 
again (line 31-33). 

The assertion part (line 35—36) describes how to 
update flow variables after each transition firing. 
The production of the motor is 100% when it is 
in operation and not failed, otherwise it produces 
nothing. 

In summary, classes of generic automata can be 
designed for each unit (VSD, Gear, Motor, Com- 
pressor) and season demand following the same 
routine. 


4.2 Modelling system 


The codes in section 4.1 defines actually several 
classes of continuously monitored (VSD, Gear, 
Compressor) and periodically inspected units 
(Motor) that can be reused for many times. Instan- 
tiations of these classes provide the basic elements 
for constructing a class or prototype of a bigger 
system. For instance, one train of a CDS can be 
described in Figure 10. 

The code declares a class of Train which consists 
of two instances of the basic RepairableUnit, one 
instance of COMPRESSOR and one instance of MOTOR 
(line 2-5). Similarly, for a complete CDS with 6 iden- 
tical trains, the class Train can be recalled for 6 times 
(Train Tl, Train T2,...) to instantiate the real 
structure. The production of one train is determined 
by input and output flows passing through the series 
of the units. Namely, the input of a unit is plugged, by 
the equation sign: = ,into the output of the precedent 
unit (line 9-13). Therefore, the equation means that 
the production of one train depends on the upstream 
production and availability of each unit in series. 


4.3 Performance indicators 


In practice, maintenances have a cost, just as shut- 
downs of systems due to failures. It is however 
very difficult to get realistic figures for these costs, 
because they depend on too many factors. More 
obviously, there is a discrepancy between the pro- 
duction expectation and the actual production of 
the system throughout the mission time. Here, we 
consider the average production and the loss of 
the system (per unit time) as relevant performance 
indicators. The GTS representation of these indi- 
cators are shown in Figure 11. 

The defined observers are real numbers. Poten- 
tial capacity of the CDS is normalized summation 
of each train (line 8-9). It is then compared with 
the actual demand according to the season. If the 
potential capacity satisfies the demand, then there 
is no production loss; otherwise the production 
loss equals the demand minus potential capacity 
of the CDS. 


5 NUMERICAL EXPERIMENT 


The mentioned modelling methodology is applied 
to the use case of CDS. A set of illustrative data 
(Table 1) is fed into the model. The mission time is 
87600 h (10 years). We set the maximum produc- 
tion capacity for all six working trains to 100 per 
hour. For 10 years that would add up to 8.76e + 
6, and with nominal seasonal production profile in 
summer (*0.5) we get 4.38e + 6. We assume that at 
the beginning of simulation, the production capac- 
ity of each unit is 100% per hour and the operation 
starts from winter. 

The GTS model can thus be able to run a vari- 
ety of what-if scenarios and the result can be 
assessed by the stochastic simulator embedded in 
AltaRicaWizard. The map in Figure 5 shows pos- 
sible experiments that we can simulate with the 
model. Due to scientific focus of this paper, only 
selected experiments are discussed in the following 
subsections. 


Table 1. Input parameters for each unit. 
Component VSD Motor Gear Compressor 
Xa 1.0e- 6 1.0e -6 
Kp 1.0e -5 
hy 10e-7 10e-5 10e-7 1.5e-4 
Dn 4380 2190 
Pmj 4380 

2 6 365 6 182.5 

2 365 
T 730 
A 12 
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5.1 


Figure 6 plots 1 million realizations of the process 
for accumulated production and loss versus dif- 
ferent inspection interval values 730 < T < 11680 
(1 to 16 months). When inspection interval is less 
than 4 months, the production loss increases dra- 
matically. This is because with too frequent inspec- 
tions, operations are dominated by unnecessary 
shutdowns to check still functioning components. 
After this point, the total production approaches 
a stable high level and it reaches the peak value 
when Tis around 12 months. Notice that the curve 
is almost flat when 7T is from 4 to 16 months. It 
implies that the total production may be not so 
sensitive to the inspection interval in such range 
given our assumptions and input parameters. 


Inspection interval 


5.2 Time horizon 


Figure 7 shows Monte Carlo simulation for accu- 
mulated production and production loss from 1 to 
10 years with 1 million realizations at each year. 
As time elapses, the production and loss increase 
almost linearly. It quantifies the provided overall 
production and revenue reduction over a period 
of mission time and thus gives decision makers a 
look-ahead horizon of the production profile of 
the whole CDS system. 
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Figure 6. Production and loss versus inspection interval T. 
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5.3 Maintenance delay 


Figure 8 presents consequence of reducing minor 
PM preparation time (delay). Here, we assume that 
delay of major preventive/corrective maintenance 
interventions remain as it was p„; and the changes 
only apply to p,,, of inspected components. From 
the figure we see that as minor preventive mainte- 
nance delay decreases from 10 to 1 year, the total 
production increases around 0.045%. The produc- 
tion loss with (p,,, < 87600) and without minor 
PM (p,,, = 87600) does not differ much. The link 
between instant reaction and increased produc- 
tion is obvious, but the gain of having maintenance 
resource immediately ready can be questionable. 
This is relevant for deciding an optimal spare parts 
strategy, when the cost of contracting maintenance 
service and storing spare parts has to be evaluated 
against the benefits of income. 


5.4 Seaon profile 


Figure 8 presents consequence of extending low 
demand period (i.e. summer) from 1 to 12 months. 
This time corresponds to the window when main- 
tenance intervention is allowed. As the window 
extends, the system spends more time in reduced 
operation mode, and thus the total production 
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decreases. Yet, production loss is reducing rather 
than increasing because the potential loss is com- 
pensated by relaxed requirement of production in 
the extended summer period. As such, we quantify 
the effect of a changing season profile on the accu- 
mulated production of the system. 


6 CONCLUSIONS 


In this paper, we present a preventive maintenance 
model on a compressor drive system. The model 
relies on the formal modelling formalism AltaRica 
3.0 and its mathematical framework guarded tran- 
sitions systems. They provide mechanisms to rep- 
resent state and transitions of local units, enable 
composition of multi-unit systems and allow infor- 
mation flows (e.g. production, observed condition) 
circulating through them. We illustrate how our 
modelling methodology handle the various model- 
ling challenges in our CDS case, like multiple unit 
types, huge state space, monitoring together with 
inspection policies, multi-level maintenance actions 
and so on. We pose on top of the model different 
maintenance and inspection policies and perform 
numerical experiments using Monte Carlo simula- 
tions. The simulator, as a decision support tool, 
demonstrates how the health of components in the 
system affect the kind of intervention decisions need 
to make now/soon. The model calculates accumu- 
lated production and its loss in a certain mission 
time. It can provide practitioners motivation with 
respect to, for instance, intensifying/relieving work 
on condition monitoring so that interventions 
become efficient and necessary. 

However, this work is very preliminary and 
there are several directions that can be further 
investigated. 

Direction 1. Degradation profile We assume 
that components deterioration follow exponential 
laws with its respective degradation or failure rate 
lambda. However, if the plant data is available and 
tractable for more accurate estimation of compo- 
nent degradation, we may introduce multiple deg- 
radation states, or fit the data with other laws (e.g. 
Weibull, empirical distribution). In addition, if the 
operation mode changes, we may need to consider 
different degradation profiles (e.g. failure rate) 
under normal production, increased production, 
decreased production. 

Direction 2. Season profile The current season 
profile is simplified with only two modes and 
constant production requirement for each mode. 
However, there could be fluctuations of produc- 
tion requirement given that it follows a mean 
within each season. For instance, in summer the 
production demands is 52% for 20 days, 65% for 
40 days, and then 43% for 7 days and so on. Such 


fluctuations in the seasonally demand profile can 
be introduced if a general abstraction of time 
dependent plant data is available. 

Direction 3. Monitoring policy In practice, the 
inspection dates cannot be freely fixed, because 
they depend on many factors external to system 
such as the availability of the maintenance crew. 
Inspection may be destructive to the component. 
Certain inspections may take some predictable 
time and the system may be partly or totally shut- 
down during inspections. Meanwhile, other inspec- 
tions do not interrupt operation. In addition, as 
we have 6 trains in parallel, the inspection can be 
conducted for all at the same time or only certain 
pieces at a time. These situations can be further 
introduced to the model to make it more realistic 
as implemented in O&G industry. 

Direction 4. Maintenance policy It is important 
to sustain production all the mission time, espe- 
cially in summer period when window is open for 
interventions. Therefore, in the occurrence of sev- 
eral degradations or failures in a system, we may 
need to decide when to react and which ones to 
intervene in priority. For instance, we can decide 
to maintain when we lose 3 compressors, or 2 com- 
pressors plus | motor, or wait for more failures and 
so on. When there are | working, 3 degraded, and 
2 failed trains, we can decide to repair the 2 failed 
trains first and then repair the 3 degraded ones 
later to ensure highest possible production capac- 
ity. System level maintenance planning can be con- 
sidered in these cases. 
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ANNEX 


class Train 

RepairableUnit VSD; 

RepairableUnit Gear; 

COMPRESSOR Compressor}; 

MOTOR Motor; 

Real inflow (reset = 100); 

Real outflow (reset = 100); 

assertion 
VSD.inflow := inflow; 
Motor.inflow := VSD. outflow; 
Gear.inflow := Motor.outflow; 
Compressor,inflow :=Gear,outflow; 
outflow := Compressor.outflow; 


end 


Figure 10. The AltaRica 3.0 code implementing the 
GTS pictured Figure 1. 


ı [block Plant 

2 A 

4 Real capacity (reset = 100); 

4 observer Real Production = production; 

3 observer Real ProductionLoss = C,demand - production; 

6 assertion 

1 ee 

8 capacity := (T1.outflow + T2.outflow + T3.outflow + T4.outflow + 
my T5. outflow + T6.outflow)/6.0; 

0 production := if C.demand<capacity then C.demand else capacity; 


Figure 11. The AltaRica 3.0 code implementing the GTS for performance indicators. 


ı [domain State{W, HD, RD, HF, RF} 

+ |domain Phase {OP, MAINT, INSP} 

+ [domain Season {WINTER, SUMMER} 

4 |domain Clock {STB, CALL, READY} 

s [class MOTOR 

4 State state (init = WORKING); 

1 Phase _phase (init = OPERATION); 
‘ Season _season (reset = WINTER); 
9 Clock _clòck (init = STB); 


0 event degradation (delay = exponential (lambda_d)); 

ii event failure (delay = exponential(lambda_f)); 

n event startInsp (delay = Dirac(tau), policy = MEMORY); 

u event completeInsp (delay = Dirac(Delta)); 

n event prepareMaint (delay = Dirac(rho)); 

is event startMaint (delay = Dirac(0)); 

w event completeMaint (delay = Dirac(delta)); 

v Real inflow (reset = 100); 

a Real outflow (reset = 100); 

0 transition 

w degradation: _state==W and _phase==0P -> _state := HD; 
H failure: „state==HD and _phase==0P -> _state := HF; 

3 failure: _state==RD and _phase==O0P -> _state := RF; 

n startinsp: _phase==0P -> _phase := TEST; 

u completeInsp: (_state==W or _state==RD or _state==RF) and 


E] _phase==INSP -> _phase:=OPERATION; 

w completeInsp: _state==HD and _phase==INSP -> {_phase:=0P;_state:=RD; 
n -clock :=CALL;} 

z completeInsp: _state==HF and _phase==INSP -> {_phase:=0P};_state :=RF} 
~ _clock :=CALL;} 

# prepareMaint: „clock==ÇCALL —> _clock := READY; 

E] StartMaint: (_state==RD or _state==RF) and _phase!=MAINT and 

ï -clock==READY and _season==SUMMER -> _phase := MAINT; 

a completeMaint: _phase==MAINT -> {_state:=W; _phase:=0P;_clock:=STB;} 
u assertion 

D outflow := if inflow==100 and ((_state==W or _state==HD or _state==RD) and 
m _phase==0P) then 100 else 0; 

n 


Figure 12. The AltaRica 3.0 code implementing the GTS pictured Figure 4. 
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A maintenance time estimated method based on virtual reality 
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ABSTRACT: Traditional method of maintenance time estimation is based on the time regular obtained 
from large statistical data. However, this method requires physical prototypes, which is much more dif- 
ficult during design and development stage. Moreover, to obtain accurate data, a large amount of experi- 
ments need to be conducted to calculate maintenance time and eliminate the error. A method is proposed 
in this paper, which uses maintenance platform based on virtual reality to avoid the necessity of physical 
prototype. In this method, based on maintenance simulation analysis under virtual reality environment, 
Methods Time Measurement (MTM) and compensation-based time prediction are integrated. Then, 
according to difference between the actual repair time and the estimated time from virtual simulation, 
time compensation model is built considering the factors of proficiency, fatigue and maintenance environ- 
ment. Finally, a real case is used to validate the model. This method could give suggestions of the evalua- 
tion of quantitative maintenance factors in the early design stage of the product. 


1 INTRODUCTION assembly optimization framework based on genetic 
algorithm was proposed in 2009 (Christiand et al. 

Maintenance time is an important quantitative 2009). 
parameter in the analysis of maintenance. How- Except of the use of virtual environment, a 
ever, itis usually hard to getin the developmentand lot of studies analyze virtual simulation which 
manufacturing stages. Traditional maintenance consists of a series of virtual human motions. 
time estimation mainly through time-accumulated And many methods have been proposed based 
method (Griswold 1970) or similar product com- on these motions, including work force, meth- 
parative time analysis (Pliska et al. 1987), which ods time measurement, modular arrangement of 
are based on massive data. Physical prototypes predetermined time standard and so on (Genaidy 
are necessary in traditional methods, while in the et al. 1989, Kanai et al. 1996, Genaidy et al. 1990, 
early stage of development and manufacturing it Laurig et al. 1985, Dossett 1992, Laringa et al. 
is hard to meet this necessity. Moreover, to obtain 2002, Fischer et al. 1991, Wygant et al., 1993, 
accurate data, a large amount of experiments need Hoffmann et al. 1993). These methods classify 
to be conducted to calculate maintenance time and different types of human motion and give the 
eliminate the error. relevant motion-time principles. Except for those 
Given these issues above, the main problems that methods, maintenance time estimated based on 
traditional methods faced are the lack of physical system maintenance work procedure through 
prototypes and the large amount of experiments. Monte Carlo simulation (LIU et al, 2014). Simi- 
While the application of virtual reality in main- larly, maintenance time predicted through struc- 
tenance can fix these problems. Virtual prototypes tural complexity metric model was proposed in 
are used in virtual maintenance, which avoids the 2014 (Owensby et al, 2014). While the average 
use of physical prototype. And the analyst can repair time could be also estimated by failure rate 
make virtual simulation which contains all the (Shen et al, 2017). While in this paper, author uses 
motions of a maintenance task. This reduces the methods time measurement as the basic method 


number of the experiments. to estimate maintenance time and gives some 
The application of virtual maintenance has compensation principle. 
been increased in early stage of development and The rest of the paper is organized as follow. Sec- 


manufacturing during the last few decades. Real- tion 2 shows the overview of methods time meas- 
time immersive virtual environments, such as the urement. Section 3 discusses the details of the 
CAVE (Cruz-Neira et al. 1993) and the Workbench compensation principle. Section 4 presents a case 
(Cutler et al. 1997) have been used to evaluate the study for the implementation of the methodology. 
maintainability of virtual prototypes. A novel While section 5 gives a conclusion of the study. 
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2 OVERVIEW OF METHODS TIME 
MEASUREMENT 


The maintenance process is consists of a series of 
human motions. While in methods time measure- 
ment, these motion are divided into three parts, 
including human work, posture adjustment and 
hand operations (Geng et al. 2014). The following 
paragraphs will discuss each part in detail. 


2.1 Human walk 


Human walk simply means the walk motion in the 
virtual maintenance, such as approaching the object. 
And the time of human walk is decided by walking 
distance and the weight of the object in the hand if 
the operator need to carry something to somewhere. 


2.2 Posture adjustment 


Posture adjustment contains 5 parts, which are leg 
adjustment, position, pick, place and collision. Leg 
adjustment is the movement of leg, while pick and 
place are the motion of human arm and hand. 
Position is the combination of leg and upper limbs, 
while collision is the movement of all human limbs. 


2.3. Hand operations 


In this part, hand operations contain only hand 
motion with tools. Bare-handed operation belongs 
to posture adjustment. Connector operation and 
maintenance tool operation are the two section of 
hand operations. 


3 TIME COMPENSATION 
METHODOLOGY 


As shown in Figure 1, the overall framework consists 
of 2 parts: human motion time and compensation 


Measurement by 
MTM 


Spl m> iepa 


Posture 
Adiustenent 
Hand 
Oneratians 


Human Motion Nme 


Maintenance Process 


e Step 


time. Human motion time is estimated by methods 
time measurement that have been discussed in Sec- 
tion 2, while compensation time will be fully dis- 
covered in Section 3. 

In the study of field maintenance work, the 
author have found that the proficiency and the 
fatigue of the operator have great impact on main- 
tenance time. While other qualitative maintainabil- 
ity parameters, such as visual accessibility, do have 
some influence on maintenance time; but they are 
not important as the factors of proficiency and 
fatigue are. Therefore, the author takes proficiency 
and fatigue as two main compensation part, while 
other qualitative maintainability parameters are 
integrated into maintenance environmental factor. 


T=T,x æ x(1+8,)x 7 (1) 


where T = final estimated time; T, = human motion 
time measured by methods time measurement; 
a; = proficiency ratio; B, = fatigue rate; y, = mainte- 
nance environmental ratio. 


3.1 Proficiency 


Proficiency refers to the degree to which the opera- 
tor does the maintenance work. A high proficiency 
means that the operator is skilled at the mainte- 
nance work including the maintenance process, 
tools used during maintaining, which are operating 
objects, where to put dissembling items and other 
maintenance matters. An operator with high pro- 
ficiency would start his work in order and finish 
them one by one. While without high proficiency, 
a maintainer would need more time to look at the 
maintenance manual to see what to do next, which 
would waste a lot of time. 

This paper classifies proficiency in three levels. 
Below are detailed rules describing the degree to 
proficiency in each level: 


Time 
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Mogel 


Process Decomposite 


Proficiency 


_ —sS 9 Stepn 


Maintenance 
ensranmerit 


Compensation Time 


Maimenance Time Predieting 


Figure 1. 


Framework of the time compensation methodology. 
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Level 1 — The operator knows the whole mainte- 
nance process including steps, tools, objects and 
he masters this operation. 

Level 2 — The operator knows the whole mainte- 
nance process, but he have never done it or he 
only have done it once before this operation. He 
may need to look at the maintenance manual 
when he forgets what to do next, but it is not 
often. 

Level 3 — The operator doesn’t know how to main- 
tain the object. He needs to look at the mainte- 
nance manual while operating. 


The relationship between proficiency level and 
its ratio is shown in Table 1. 


3.2 Fatigue 


Fatigue is common especially in long time main- 
tenance. After long time maintaining, operators 
would feel tired, which causes the decline of men- 
tal concentration. Less concentration results in the 
increase in maintenance time. Fatigue would also 
cause the decrease in moving speed including hand 
moving, walking and turning. 
The detailed rules are shown as follow: 


Level 1 — The operator doesn’t feel tired. He is in 
the same mood as beginning. Therefore, his 
motion would not be affected. 

Level 2 — The operator feels a little tired. His 
motions are also affected by his fatigue. He may 
take a little break or slow down his operation, 
but it wouldn’t be long. His efficiency may also 
reduce due to tiredness. 

Level 3 — The operator feels tired. He need to take a 
break or he would make mistakes. In this mood, 
he should have a rest or something dangerous 
may occur. 


The relationship between fatigue level and its 
rate is shown in Table 2. 


Table 1. Proficiency ratio. 

Proficiency level Ratio 
Level 1 Qh, 
Level 2 a, 
Level 3 O, 
Table 2. Fatigue rate. 

Fatigue level Rate 
Level 1 Bi 
Level 2 B, 
Level 3 B, 
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3.3 Maintenance environment 


Due to the limit of virtual maintenance environ- 
ment, it is hard to simulate the working environ- 
ment same with the actual one in the aspect of 
temperature, humidity, illumination condition. 
But it still has other maintenance environmental 
factors that will influence maintenance time. 

Visual accessibility means the extent or visibility 
that can be seen from the operator’s current loca- 
tion. A good visual accessibility refers to that dur- 
ing maintaining, the operator could directly see 
the working area, objects and tools. A bad vision 
could result in the increase of maintenance time 
because the maintainer would need more time to 
search for tools or others. 

Operating space refers to area of an operator 
touching objects and getting tools. A good operat- 
ing space could provide the maintainer a comfort- 
able working place. While a bad operating space 
would increase human motion time. 

Those two maintenance environmental factor 
could be examined in virtual maintenance which 
are shown in Figure 2. The upper one shows the 
operator’s visual accessibility, while the other 
presents the operating space. 

In this paper, the author presents three rules of 
maintenance environment: 


Good vision 
area 


invisible 
area 


Maximum 
vision area 


Figure 2. Visual accessibility and operating space in 
virtual maintenance. 


Table 3. Maintenance environmental ratio. 


Maintenance environmental level Ratio 
Level 1 Yı 
Level 2 \ 
Level 3 Ys 


Level 1 — The operator is in a natural posture, 
which means he could easily get what he want 
or touch what he want to operate, and at the 
same time all maintenance objects and tools are 
right in the his sight. 

Level 2 — Either the operator is in a narrow space 
that he could not be so easy to get or touch what 
he want, or he doesn’t have a good vision that 
some are out of his sight, which means he needs 
to turn to a certain angle to touch the object or 


Figure 3. 


Virtual model of a fan. 


Table 4. Maintenance process. 


the tool. 


Level 3 — The operator is in a narrow space and Component 
most or all maintenance objects and tools are © Number Detailed task of motion 
US ON TiS VIEN 1 Pick up the tool Pick 
The relationship between maintenance environ- Remove 2 nuts Pick, screw x 2 

mental level and its ratio is shown in Table 3. on the two sides and place 

3 Put down the tool Pick, place 
l } : and take out the 

3.4 Time estimation body part of the fan 

Before doing the final calculating, the analyst 4 Pick up another tool Pick 

should check the integrity of the virtual simula- > Remove 3 nuts _ Arm raise, 

tion. If there is any simplification, the analyst on the upper side at 
should take that into calculation. After checking, 6 Put down the tool Bice 

the analyst could choosing different ratio accord- 7 Remove the font Pek piee 

ing the different situation and the final value is the cover of the body part ° 

result of this time estimation. 

Table 5. Maintenance time calculation. 
4 CASE STUDY 
Methods time Proposed Actual 

Using the methodology above, the author uses Motion measurement method time 

the maintenance of an electric fan as example. p 

The virtual model of the fan is shown as Figure 3. Pick kl Li 2.5 

While the process of its maintenance is presented zed l T Fi a 

in Table 4. The author uses DELMIA (Digital one L 4 1 4 1 

Enterprise Lean Manufacturing Interactive Appli- aie 2 1 2 6 18. 6 18 

cation) as the virtual maintenance platform, where Place L 4 L 4 13 

the simulation is made exactly based on steps Pick LI L1 27 

showing in Table 4. While in the simulation, time Arae 0. 5 0. 5 09 

is estimated by PTS, which refers to Predetermined — sgrew 3 122 18.6 183 

Time Standards. And the actual time is collected goa 4 122 170 14.2 

through several surveys. Screws 12.8 17.0 23 
Table 4 show the detailed process of this mainte- Aym fall 0.5 0.5 2.1 

nance work. And the time of each task is listed in Place 14 14 25 

Table 5, which including data from methods time Pick 1.1 1.1 1.7 

measurement, proposed method and actual work. Place 1.4 1.4 1.6 

Time of methods time measurement only consider Total 73.2 100.8 108.7 


human motion, while proposed method not only 


568 


2 4 


—e— Methods time measurement 


Figure 4. Comparison between two methods. 
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Figure 5. Accumulated time comparison. 


uses human motion, but also time compensation 
rules. 

As can be seen from Figure 4 and Figure 5, the 
orange line stands for methods time measurement, 
while the blue one represents proposed methods 
and the actual result shows in purple line. It is 
clearly that compared to methods time measure- 
ment (the orange line), time estimated by proposed 
method is closer to practice. And result also shows 
that proficiency is the main factor that influences 
maintenance, which fits in with the actual survey 
result the author have gotten in field maintenance. 

In real maintenance, operators are often unfa- 
miliar with the maintenance process because the 
failure rate of the equipment is extremely low, 
especially in the field of aeronautics and astronau- 
tics. Therefore, proficiency is the key factor in time 
estimation. 


5 CONCLUSION 


This paper presents a methodology for mainte- 
nance time estimation based on virtual reality, 
where methods time measurement and time com- 
pensation model are discussed. The advantages of 


8 


—@-— Proposed method 


8 
—#— Proposed method 
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Step 
—#— Actual time 


10 12 


—e Actual time 


14 


stép 


this methodology is that it provides a time esti- 
mated way without physical prototype and it is 
more accurate than methods time measurement. 
By analyzing the comparison result, the author 
finds that proficiency is much more important in 
maintenance time estimation, and by actual survey 
the author also finds that qualitative maintainabil- 
ity parameters may not play an important role in 
time prediction. The author must admit that they 
do have some influence, but they do not have as 
much as I have thought before, which is only the 
author’s opinion. 

Future work will focus on the importance of 
qualitative maintainability and the simplifica- 
tion of the proposed method. Moreover, with the 
development of virtual reality, more methods will 
be found based on methods time measurement and 
the proposed method, which may be more simple 
and useful in real life. 
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ABSTRACT: The Original Equipment Manufacturers (OEMs) nowadays face the need of establishing 
an optimized maintenance plan from the design stage of the assets. Up to date, the production-centered 
business model has limited their after-sales maintenance strategy, and accordingly their knowledge about 
the assets operational behaviour. Furthermore, the added difficulty of the assets operating in different 
contexts (increasing the variability of their behavioural patterns) contributes to the misalignment of the 
maintenance plan with the assets’ actual needs. Therefore the purpose of this paper is to propose a meth- 
odology based on the proportional hazards model for assessing the behaviour of the assets and the influ- 
ence of the different operational context variables in their reliability. This methodology aims to provide 
support information to a better customization of the maintenance plans in the offer stage. Likewise, the 
proposed methodology has been verified and validated through a real case study with data provided by a 
leading company in the railway sector. 


1 INTRODUCTION nance actions and frequencies is difficult due to 
the lack of available information to characterize 
Traditionally Original Equipment Manufacturers the asset failure mechanisms. Furthermore, the 
(OEMs) have focused their efforts on asset pro- operational environment of products can change, 
duction and sale, however, the highly demanding requiring reliable operation in unfamiliar circum- 
market has imposed the need of maintaining their stances, or uncertain asset’s operators’ behaviour 
competitiveness levels through the after sales serv- (de Rocquigny et al. 2008). 
ice, which might become a profitable source of Nowadays, to face the mentioned problem, 
income. Since the business approach of the OEMs the OEMs follow a holistic approach supported 
has shifted to this new paradigm, it can be identi- by information provided by the suppliers and by 
fied in the literature a growing body of publica- information regarding similar assets’ patterns. 
tions recognizing the importance of optimizing the The customization degree of the resulting mainte- 
maintenance plans. nance plans does not properly integrate informa- 
Nevertheless, the OEMs have not achieved a tion regarding the operational context in which 
desirable optimization level in their after sales the assets will perform. Thus, in order to avoid the 
service. To date, because of these inefficient main- misalignment of expected and actual assets’ main- 
tenance plans, the after sales service has incurred tenance needs, it is crucial to develop capabilities 
into avoidable costs and excessive resources con- and tools to assess the influence that the different 
sumption. This becomes a major issue when the operational context variables have in the behaviour 
maintenance strategy is designed at the offer stage of the asset. To this aim, a novel methodology is 
of the asset, where the customization of mainte- developed and presented in this paper, with its 


571 


corresponding application to a real case study in 
the railway sector. 

The Proportional Hazards Model (PHM) plays 
a key role in the proposed methodology by being 
a valid approach to relate survival analysis and the 
impact of operational context variables. The main 
advantage of the model is the possibility of consid- 
ering the impact of more than one variable simul- 
taneously (Tang et al. 2014). The PHM was first 
introduced by Cox (1972) in a seminar paper in the 
Royal Statistical Society, not only has it been pro- 
posed for medical studies but for reliability anal- 
ysis by Cox himself and by other authors in the 
literature (Bendell et al. 1986, Wightman & Bendell 
1986, Lawless 1983). 

A critical review of the existing literature in the 
beginnings of this new methodology can be seen 
in Bendell (1985) where the complexities of the 
model are enumerated along with the proper way 
of applying it to reliability studies. Baxter et al. 
(1988) presented it as a powerful tool for examin- 
ing reliability data-sets where the failure data may 
be inhomogeneous due to the presence of risk fac- 
tors. Through the literature various examples of its 
applications can be found, in these applications, 
it is used as an exploratory tool in Bendell et al. 
(1986) and in Wightman & Bendell (1986); but also 
as a reliability prediction model for, among oth- 
ers, rail diesel engines (Jardine et al. 1989), marine 
and aircraft turbines (Jardine et al. 1987), traction 
transformers (Lin et al. 2016), vitrified clay pipes 
(Xie et al. 2017), electromagnetic relays (Li et al. 
2015) and mobile handsets (Tiwari & Roy 2013). 

As stated by Li et al. (2010), the PHM lays a 
mathematical foundation for predicting the failure 
occurrence and developing optimal maintenance 
policy. Thus, in this paper the application of the 
PHM to a real case study in the railway sector 
is presented and discussed, having the obtained 
results been validated by a leading company in the 
sector. More precisely, several components of the 
Heating Ventilating and Air Conditioning (HVAC) 
System have been analyzed through the study of 
a database that gathered failure information from 
different operational contexts. The intention is 
to provide a case study in the application of the 
PHM and its potential for the maintenance plan 
design at the beginning of the asset’s lifecycle, 
where no information to characterize its behaviour 
is available. 

The structure of this paper is as follows. Firstly 
the HVAC system and the several aspects to be taken 
into account regarding its maintenance are pre- 
sented, besides the failure data to be analyzed is also 
discussed in section 2. In sections 3 and 4 the Cox 
model, along with its main equations, and the pro- 
posed methodology are explained correspondingly. 
Once the main concepts are introduced, in section 5 


572 


the application to a real case study is performed, 
where several components of the HVAC systems 
with their individual resulting PHMs are analyzed. 
Finally, in section 6, the conclusions and benefits 
resulting from the application of the proposed 
methodology to the real case study are presented. 


2 HVAC SYSTEM 


The railway sector is a growing industry that pro- 
vides good mobility for a reasonable price. Trans- 
portation systems’ complexity is increasing, as well 
as its number of users; hence it is necessary both, 
to improve their availability and to guarantee high 
levels of security and comfort whilst ensuring rea- 
sonable maintenance costs (Foulliaron et al. 2014). 
There exists a trend towards a cost-efficient and 
performance-based system, this trend is highly 
influenced by political, economic and environmen- 
tal motivations (Umiliacchi et al. 2011). One of the 
main pillars within the trend is the asset manage- 
ment discipline, and the importance of a proper 
asset management strategy is supported by the 
vast amount of literature focused on optimizing 
the maintenance plans for rolling stock. 

As previously stated, in order to optimize the 
maintenance actions, customization to operational 
context is required. This paper discusses the appli- 
cation of PHM to the HVAC system in the railway 
sector, which has been chosen because it plays a key 
role in comfort levels of the passengers. The HVAC 
system in a train, metro or tramway provides an 
air flow in order to maintain proper and comfort- 
able room environmental conditions. Among the 
functionalities of the HVAC system several com- 
fort-related aspects are taken into account, these 
functionalities include (1) temperature adjust- 
ment, (2) refreshing of room air, (3) filtration of 
the outside air, (4) ensuring proper noise levels and 
(5) avoiding pressure waves (in high-speed trains). 

To regulate the inside comfort environment, it 
is necessary to take into account that the setpoint 
temperature depends on the outside temperature 
in order to avoid excessive thermal shock. How- 
ever this is not the only parameter to be consid- 
ered when designing an HVAC system; the indoor 
relative humidity level should be maintained below 
60% on average, and it is important to remark 
that passengers are heat and humidity sources. 
Accordingly, HVAC system should be designed to 
maintain desirable levels of inside temperature and 
humidity; the technical solution adopted should 
endure the outside conditions, as well as the coach 
maximum occupancy. 

As stated in Bendell (1985), the proportional 
hazards modeling techniques for reliability analy- 
sis are useful when the analyzed data correspond 


to failures of non-repairable items whose times 
to failure are completely independent and equally 
distributed. Following these concerns, for the case 
study three different components of the HVAC 
system have been selected: the control panel, the 
electronic control (as a subcomponent of the con- 
trol panel) and the compressor unit. 

The database has been provided by a leading 
company in the sector, it includes the work orders 
of maintenance services for HVAC systems in rail 
projects working in different European cities. The 
full study considered the HVAC systems running 
from Ist January 2015 to 31st August 2017, since the 
quality of the data is more consistent in this period. 
From the maintenance work orders record, the 
Time Between Failures (TBFs) has been calculated, 
and then the information regarding the operational 
context has been associated with each of the TBFs. 


3 PROPORTIONAL HAZARDS MODEL 


The PHM assumes that the hazard function of an 
asset decomposes into the product of a baseline 
hazard function and an exponential term incor- 
porating the effects of the explanatory operational 
context’s variables (covariates) (Cox 1972). The 
most extended PHM is given by Equation 1, 
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where h(t, X,, ..., X) represents the hazard rate at 
time ¢ for an asset with covariates (X, ..., X); X= 
(X,, ..., X is the vector containing the values of 
the explanatory variables where X(Vi = 1,2,...,4) 
can be either a naturally variable or an indicator 
variable; A (t) represents the baseline hazard func- 
tion, it would be the hazard function for the null 
vector X =0e R* — it can be parametric following 
certain distribution or of a unspecified form; and 
B= (Ba ..., Bà is the vector of the parameters of 
the model which describe the effects of each of the 
covariates. 

Given two identical assets operating in two dif- 
ferent operational contexts, that would be X = (X, 
.. X,) and X’= earn. <4 B being the X’ one 
with a higher risk; it is defined the Hazard Ratio 
(HR) between the two of them as: 
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It can be observed in Equation 3, that the Haz- 
ard Ratio is independent of time ¢, being that the 
property which gives the model its name of Pro- 
portional Hazards Model. 

A special case of the HR would be when com- 
paring an asset operating in an environment where 
the covariates take mean values X =(X1...,X x) 
with an asset operating in a context X, = (X,,..., 
X); given this HR let the hazard function of the 
asset operating in X; be expressed by Equation 4, 


A(t,X,) -roS s (x, -¥)) @) 


where the hazard function is decomposed into a 
baseline hazard function in the mean values of the 
covariates (/,(t)), and into the exponential part 
where the deviations from the mean value of each 
covariate are taken into account, instead of the 
covariates values themselves. 

The baseline hazard function in the mean values 
of the covariates can be estimated in the model and 
it can also be either an unspecified form function, 
or a function following certain distribution. In this 
paper it has been fitted to follow a two-parame- 
ter Weibull distribution, thus the obtained model 
would be expressed by Equation 5, 


X 


z( Ly onl 4 (x, -¥)) (5) 


A(t,X;) = 


where: 

ais the characteristic life (scale parameter). 
yis the shape parameter. 

T is the time. 


4 METHODOLOGY & PROPORTIONAL 
HAZARDS MODEL 


The proposed methodology for assessing the 
impact of operational context variables is shown in 
the flowchart of Figure 1. It consists of two mod- 
ules, the first one is oriented to data treatment and 
preparation, and the second one mainly compre- 
hends the model fit and tests of the PHM. In the 
next paragraphs, a detailed explanation of both is 
presented. 


4.1 


In this first module, the work is directed towards 
two focus of interest, the identification of the 
operational context variables and the treatment of 
the maintenance data. On the one hand, for identi- 
fying the variables that affect the reliability of the 


Module 1. Data treatment and preparation 


Identification of 
Opermtional Context 
Variables 


| 


Association of TBFs with 
corresponding values of 
the variables 


Data Treatment 


= 


Module 1 


Module 4 


a — 


Fit PHM to the data 


Backwards approach 
Ps ie 

7 Alivariables N F 
statistically slanificant > 


Fit PHM with 
ramaining variables 


Proportional Hazards Test 


Exclude variables 
a, oo P 


~ avery variable 


Pa 
-i 
Influential observations 
test 


A 
AE 
a Pg S k 
All observations S d 


Treatment of 
jativential 
observations 


< 


ka as ag A 
ae 


Fit hazord Gasoline 
function lat the mean 
values of the variables) 


Figure 1. Flowchart of the proposed methodology. 


asset it is essential to cooperate with maintenance 
workers in order to translate their know-how into 
valuable feedback that helps in the integration of 
every environmental aspect influencing the failure 
mechanisms. On the other hand, following the gen- 
eral framework in a reliability improvement study 
proposed by Louit et al. (2009), the work orders are 
treated to calculate the TBFs taking into consider- 
ation the diverse issues such as scarce data, censor- 
ing (only right is contemplated) or data pooling. 
Likewise, it is important to check and minimize 
as much as possible potential human errors and 
noise, such as double imputed work orders or non- 
failure related work orders. 

Having calculated the TBFs, as well as the varia- 
bles that might influence the reliability, the linkage 
between both should be performed. To this aim, it 
is necessary to define a way of associating a value 
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of the variables to every TBF; nonetheless, the pro- 
cedure to follow for each of the variables should be 
ad-hoc designed depending on the available infor- 
mation and the characteristics of the variable. 


4.2 Module 2. Valid PHM fit 


With the data set containing the failure times asso- 
ciated with the appropriate context variable value, 
it is the next step to seek for a valid PHM fit using 
the method of partial likelihood. In this paper’s 
case study the fit has been performed with the Cox 
Proportional Hazards regression in R, but there are 
many software solutions for this regression. The 
way to proceed should be to try to fit a PHM that 
includes all the identified covariates, however, it is 
very likely that many of them will not result influen- 
tial with enough statistical significance. Significance 
levels have to be arbitrary set for the null hypothesis 
contrast where every £,=0(Vi=1,2,...,4), in the case 
study a p-value <0.1 was considered significant. 

Moreover, it also exists the possibility of having 
identified collinear variables, i.e. covariates which 
can be linearly predicted from any of the others. 
When in the results appears either a not statisti- 
cally significant or a collinear covariate, they are 
eliminated following the backward approach 
proposed by Bendell et al. (1986) until a PHM is 
reached, where each variable is influential with a 
p-value lower than the selected one, and no colline- 
arity is detected. 

Already the model contains the proper covari- 
ates, it is required to check the proportional haz- 
ards hypothesis under which the model has been 
fit. To this aim, a test and a graphical analysis for 
each significant variable are performed based on 
the Schoenfeld residuals obtained from the regres- 
sion. If a covariate performs well in neither the 
analytical test, nor the graphical one, it needs to be 
excluded from the PHM model and the regression 
performed again with the remaining covariates. 

When the proportional hazards hypothesis has 
been tested and validated, the next phase is to spot 
any possible influential observations that may have 
biased the estimation of the regression parameters. 
In case any is identified, it is examined to consider 
how it should be treated before proceeding in the 
fit of a more accurate PHM. 

Since by default g(t) is a free distribution 
function estimated in the regression, the last step 
of the methodology consists of fitting it to a suit- 
able probability distribution. So the resulting Pro- 
portional Hazards Model will be a completely 
parametric function (see Equation 5), that can be 
decomposed in two terms: 


e A hazard baseline function following certain 
distribution, in the case study a two-parameter 


Weibull, in the mean values of the covariates. 
Thus it will depend on the parameters of the 
chosen distribution and it will depend on the 
time (t) as well. 

e An exponential function that integrates the 
information regarding the operational context, 
since the deviation from the mean of each influ- 
ential variable is multiplied by its corresponding 
estimated parameter. 


5 REAL CASE STUDY RESULTS & 
DISCUSSION 


The system to study is the HVAC system integrated 
into different projects of the railway sector, however, 
the PHM is not applied to the system as a whole 
unit but to selected subsystems. In order to select the 
subsystems several criteria have been followed but 
firstly a Failure Mode and Effect Analysis has been 
performed to fully understand the failures mecha- 
nisms of the HVAC system. Having formalized the 
knowledge about HVAC failure modes, an approach 
based on the CIB-framework proposed by Waeyen- 
bergh and Pintelon (2004), is applied. To identify 
both the Most Important Systems (MIS) and the 
Most Critical Components (MCCs), the know-how 
of the company workers is documented and imple- 
mented in the Maintenance Cost Matrix in Figure 2. 

Every MIS, at different indenture levels, is 
included in the matrix where in the vertical axis 
Corrective Costs (CC) are represented and the 
horizontal axis corresponds to preventive costs 
(PC). However this cost classification is based on 
a Pareto qualitative approach since for each com- 
ponent both have been calculated by Equations 6 
and 7, 


Corrective Costs 


Critical by 
security, comfort 
or environment 


Indenture Indenture 
fevel 2 fevel 34 


A= The indenture level 3 corresponds to the smallest maintainable item 


Figure 2. Maintenance cost matrix. 


CC = Pl orenat x PI siitscetiecuensy (6) 
PC= PI PreventiveCost x PI PreventiveFrequency (7) 


where PI is the Pareto Index being: 1 for low, 2 for 
medium and 3 for high. 

In the Matrix of Figure 2, four MIS can be seen 
in the red area of the matrix representing high 
maintenance costs. It also shows two MIS whose 
criticality is due to security, comfort or environ- 
mental issues. These six MIS are the ones that are 
categorized as MCC and thus are the ones that 
have been analyzed by the PHM. The identified 
MCCs are shown in Table 1, with their reference, 
description and indenture level. 

Once the items to be analyzed have been 
defined, their corresponding failure data have been 
extracted from the database and treated as pro- 
posed in the methodology (see subsection 4). 

The next step is to identify what are the influ- 
ential variables that are going to be integrated into 
the analysis. By cooperating with the company, 11 
covariates have been selected as possible risk fac- 
tors, some of them are defined by the project speci- 
fications and the others are exogenous uncertainty 
variables. In Table 2, the criteria for selecting every 
variable are explained. 

When applying the proposed methodology only 
three out of the six MCCs provide a valid PHM 
model, the control panel (HTSO1), the control 
electronics (HTS0101) and the compressor module 
(HTS04). The evaporator module (HTS02) can be 
adjusted to a PHM where the maximum tempera- 
ture and the relative humidity have influence in 
the hazard function, however, they do not pass the 
proportional hazards test. 

For neither the HTS0403 nor the HTS0201, 
a valid PHM can be adjusted; this is due to data 
scarcity. Both of them have been selected because 
their preventive maintenance cost was high (bot- 
tom-right position in the matrix). As a matter of 
fact, is this over-maintenance that causes failures 
absence, and therefore data scarcity. 

In every case, the integration of the same 7 vari- 
ables was problematic due to collinearity problems, 
these variables correspond to X, X,, X,, Xp Xo X 


Table 1. Identified MCCs. 

Reference Indenture Lv. Description 

HTSO1 1 Control panel 
HTSO101 2 Control Electronics 
HTS02 1 Evaporator Module 
HTS0201 2 Air filter 

HTS04 1 Compressor Module 
HTS0403 2 Compressor 


575 


Table 2. Covariates. 


Selection criteria 


People is a source of humidity and heat so the company believes that 
the annual average of passenger for every project has to be included 
since it might affect the reliability 

The kind of refrigerant the HVAC system of every project is using 
modify the pressure it works 

The bigger the number of stops of every project is the more the 
HVAC system is exposed to heat loss or gain and therefore the 
more it has to work. 

For every project the contractual available trains working is different 
as well as the fleet to provide that availability and therefore the 
HVAC system work load differ from every project 

As one of the functions of the HVAC system is to provide 
comfortable room humidity levels the outside atmospheric 
humidity will condition its work load 

The distance between stops its equivalent to the time the HVAC is 
working and its capability to reach working permanent regime. To 


incorporate the deviation of this variable the maximum and the 


minimum distances for every project are also taken into account. 


Variable j 
Number of passengers X, 
Refrigerant 3 
Number of stops 5 
Intensity of use X, 
Relative humidity X; 
Maximum distance between stops Xe 
Minimum distance between stops X, 
Average distance between stops Xz 
Maximum temperature X; 
Mean temperature Xii 
Minimum temperature X 


As the main function of the HVAC system is to provide comfortable 
room temperature the outside temperatures will condition its work 
i load. To incorporate the deviation of this variable the maximum 


and the minimum temperatures is also taken into account 


and X,. Most of them are project specifications and 
therefore the value of one of them determines the 
value of the others since it corresponds to a certain 
project. As a result of the variables demeanor, it is 
impossible to distinguish in the regression the indi- 
vidual effects of every one of them. 

Due to this problem, it has been only possible to 
fit into the models the uncertainty variables corre- 
sponding to the relative humidity and the tempera- 
ture, which are X;, Xj, Xio and X, 


5.1 HTSO1: Control panel 


In the analysis of this MCC, two covariates are 
found to be statistically significant, with a default 
p-value of 0.01 for the hypothesis contrasts. These 
two covariates are the relative humidity and the 
minimum temperature (X, and X,,) with coeffi- 
cients 6; = —0.008 and f,, = —0.041. Both of the 
coefficients are lower than zero meaning that 
the higher the variables’ values are, the lower the 
asset’s failure rate. 

Due to the resulting values of the coefficients, a 
unitary increase in the relative humidity (at a con- 
stant minimum temperature) provides a 0.8% lower 
failure rate; and a unitary increase in the minimum 
temperature (at a constant relative humidity) pro- 
vides a 4.02% lower failure rate. 

With a valid PHM, the fit of the baseline func- 
tion in the mean values of the two covariates has 
been performed. A Weibull Distribution has been 


Reliability 


Time (days) 


Figure 3. HTS01 survival baseline fit (Weibull, 95% CI). 


fitted to the non-parametric baseline function, it 
can be seen in Figure 3 with the 95% confidence 
intervals for both functions. These results have 
been contrasted with the collaborating company 
and there is one major conclusion about the com- 
ponent and its resulting PHM. 

It is worth to highlight how the higher the mini- 
mum temperature is the better the component 
performs. After discussion with the HVAC techni- 
cians, a likely explanation for this phenomenon has 
been found. The technicians point out that there 
is not forced ventilation for the electronic compo- 
nents, thus their temperature is not controlled and 
the outside temperature will have more influence 
on their behaviour. It also has been long noticed 
how the failure rate of the control panel increases 
when there are important temperature fluctuations 
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within the day. Thus it is believed that the mod- 
eled positive impact of the minimum temperature 
increase, reflects the observed negative impact of 
temperature fluctuation. This discussion has lead 
to future conjoint works. 


5.2 HTSO101: Control electronics 


This component is a subsystem of the HTSO1 previ- 
ously studied, and it is reasonable to think that its 
resulting PHM will share some features with the one 
from the HTSO1 analysis. As it was thought, they do 
share properties; the minimum temperature is the 
only variable that allows fitting a valid Cox model 
and therefore a single-covariate PHM is obtained. 
The coefficient associated with this variable is 6,, = 
—0.045 and it provides a 4.44% lower failure rate by 
a unitary increase of the minimum temperature. 

It can be seen that the effect of the minimum tem- 
perature is defined by a negative coefficient, mean- 
ing that the higher the minimum temperature, the 
better reliability the asset will have. The technicians 
have validated this result through the same reason- 
ing applied to HTSO1. The fit of its undefined base- 
line function, in the mean value of the covariate, to 
a Weibull distribution can be seen in Figure 4. 

It is important to notice that the effect of the 
minimum temperature is almost equal in HTSO1 
and HTS0101. However not only is this coefficient 
similar, but the fits to Weibull distributions result 
in very similar shape parameters. It is also impor- 
tant to mention that the scale parameter of HTSO1 
is smaller which is reasonable since HTSO1 includes 
HTSO101 whose scale parameter is over 65% higher. 

Therefore the control panel and the control elec- 
tronics are both affected by the same risk factor, 
and their behaviour is also similar because of their 
shape parameters. However, the reliability of the 
HTSO1 is lower because it depends on the failure 
rate of HTS0101, and on other components’ fail- 
ure with a serial arrangement as well. 
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Figure 4. HTS0101 survival baseline fit (Weibull, 95% 
CI). 
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5.3. HTS04: Compressor module 


The number of data available for this MCC was not 
very large, in spite of this inconvenience, a proper 
Proportional Hazards Model has been found to 
describe the behaviour of the component with 
enough statistical significance. The influential vari- 
ables for this MCC are the relative humidity and 
the mean temperature, with their corresponding 
coefficients equal to B;= 0.0547 and f,,=0.1316. In 
this case, both of the covariates are associated with 
positive coefficients meaning that the higher values 
of the covariates, the higher risk of failure. Being 
more specific, if the relative humidity increases one 
unit (at a constant mean temperature) it provides 
a 5.62% increase of the hazard function; and if the 
unitary increase happens in the mean temperature 
(at a constant relative humidity value), the increase 
in the hazard function equals 14.07%. Once the 
PHM has been adjusted and validated, the hazard 
function in the mean values of the covariates is fit- 
ted as proposed in the methodology. It can be seen 
in Figure 5, and it is important to notice how the 
fitted curve is not as accurate as in the previous 
MCC because of the data scarceness. 

It is remarkable the major effect of the mean 
temperature, it reveals that it is an important risk 
factor to take into account when designing the 
maintenance strategy of this critical component. 
The collaborating company validates this results, 
and they have been dealing with a problem that 
backs up the results here obtained. 


6 BENEFITS & CONCLUDING REMARKS 


In this paper, a failure rate model for different 
components of the HVAC system is proposed con- 
sidering different risk factors of the operational 
context. The main contribution of this paper is the 
PHM approach following an established method- 
ology that set a number of steps for a proper appli- 
cation of the Cox model. It has been demonstrated 
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Figure 5. HTS04 survival baseline fit (Weibull, 95% CT). 


that as thought, the temperature and the relative 
humidity influence the behaviour of the asset. 

In the last step of the methodology, by fitting 
the baseline hazard function in the mean values 
of the covariates, a Weibull PHM is obtained. The 
obtained failure rate model consists of a completely 
parametric equation (see Equation 5). The para- 
metric equation has three important advantages: 


e In the offer stage it allows to know the asset reli- 
ability before it starts to operate, and therefore it 
is possible to develop an optimized maintenance 
plan that better fits the reality of the asset opera- 
tion improving the efficiency and effectiveness 
of the maintenance activities. The customized 
reliability enables a more accurate Life Cycle 
Cost (LCC) analysis that may be a competitive 
advantage in tender processes, setting the com- 
pany a step forward their competitors. 

From the after-sales service’s point of view, 
one of the most significant advantages is the 
improvement of the management of an assets 
fleet in which every one of them is working in 
different operational conditions and shows a 
different failure mechanism. This improvement 
in the management will result in higher avail- 
ability rates, and accordingly a revenue from the 
decreased costs due to non-availability, and in 
savings derived from the enhanced customiza- 
tion of maintenance frequencies. It is remarkable 
that not only is the OEM making profit of the 
new reliability model, but the assets users as well. 
Depending on the arrangement of the compo- 
nents of the asset, it is possible that in certain 
operational contexts the criticality of the com- 
ponents will increase due to a higher failure rate 
caused by one or more variables of the context. 
The proposed model allows to foresee this criti- 
cality increase and to anticipate its consequences 
by identifying the components in which the 
maintenance service should focus its efforts. 


As a future work, it is necessary to study a way 
of modeling the impact of project specifications, 
since it is not possible to integrate their information 
in the proposed PHM due to collinearity problems. 

It is interesting to remark that some uncertainty 
variables, such as the number of passengers, could 
be integrated into the model if a more detailed 
record of the variable were available. 

The work here presented can serve as basis 
approach to future predictions of HVAC systems 
reliability, and also as a tool to integrate external 
information into the assets managers decision- 
making process. The authors reckon that by further 
work it will be possible to extrapolate the method- 
ology here proposed to reliability modeling for sev- 
eral assets, and it will enable a more efficient asset 
management by a better customized and accurate 
maintenance plan design in the offer stage. 
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Opportunistic maintenance strategy for a train fleet under safety 
constraints and inter-system dependencies 


H. Ghamlouch & A. Grall 
ICD/LM2S— UMR CNRS 6281, Université de Technologie de Troyes, Troyes, France 


ABSTRACT: The aim of this work is to propose a modeling framework to call into question the main- 
tenance regulations of a train fleet and improve an enforced maintenance policy without disrupting its 
main structure. The considered fleet is constituted of identical trains. A fixed periodic preventive main- 
tenance planning is imposed by the manufacturers for safety and operational quality requirements. Each 
overhaul includes a set of pre-defined tasks such as inspection, minimal repair activities and preventive 
replacement of some components. During inspection if the degradation level of a component exceeds a 
primary replacement threshold it is precautionary replaced. In case of failure of at least one component 
when the train is in operation, it is immediately replaced. In this paper we propose the introduction of 
opportunistic maintenance activities to improve the periodic maintenance strategy while operating within 
the constraints of imposed planning. Opportunistic maintenance can include early revisions or precipitate 
preventive replacements of some component (with respect to scheduled preventive replacement or their 
primary replacement threshold). Dynamic maintenance task can be applied additionally during correc- 
tive activities or pre-scheduled revisions. Dependence matrices are used in order to consider stochastic 
dependencies between the system’s components. Economic dependence is basically derived from a tree- 
like set-up model. A specific replacement threshold dedicated to opportunistic preventive replacement 
as well as a flexibility degree of revision dates are introduced. Multiple constraint are also considered 
for optimization: availability, work hours and number of simultaneously revised systems. A comparison 
between the periodic maintenance strategy with and without opportunistic tasks is presented. 


1 INTRODUCTION Maintenance strategies are usually classified in 
two categories: 1) corrective maintenance where 
In transportation industry, production, opera- maintenance is held when the system fails and 
tion and maintenance costs along with system’s 2) preventive maintenance where maintenance is 
reliability and safety can be very decisive in the held in a precautional way in order to prevent sys- 
business success and continuity. In this context, tem failure. This later is usually adopted for systems 
management tools adopted by manufacturing and with severe failure consequences. In a preventive 
operators companies are continuously enhancedin maintenance strategy the systems components are 
order to increase systems’ reliability as well as pas- maintained or replaced according to specific rules 
sengers safety and comfort. These tools specially that usually consider system’s age, deterioration 
allows to determine operation scheduling methods level, operation time, etc. Depending on the system 
and maintenance strategies. nature, different criteria can be considered for main- 
The optimization of maintenance policies in tenance policy assessment and optimization such 
manufacturing organization is not a recent sub- as mean cost rate, availability, downtime or com- 
ject for researchers (Barlow and Hunter 1960). A bined economic and reliability criteria. A detailed 
first review of issues and results concerning main- survey of maintenance policies and deterioration 
tenance scheduling problems and how crucial this modeling as well as optimization criteria is given by 
later can be on the productivity and competitive- | Wang (2002). A recent review of past and current 
ness of both manufacturing and service organiza- research on optimal maintenance policy selection 
tion was pointed out by Paz and Leigh 1994. Ever issues in manufacturing domain is given by Ding 
since, maintenance strategies development and and Kamarudding (2011). Maintenance mathemat- 
optimization have been considered in different ical models as well as numerical algorithms for opti- 
domains, for instance: energy production systems mization were respectively reviewed by Vasili et al. 
(Sikos and Klemes 2010), (Shafiee 2015), transpor- (2011) and Alrabghi and Tiwari (2015). 


tation (Fritzsche et al. 2014, Liden 2015), networks Most of maintenance optimization studies in rail- 
(Morcous and Lounis 2005), infrastructures (Rios- way field have been focused on railway infrastruc- 
Mercado and Borraz-Sanchez 2015), etc. ture maintenance Liden (2015), Bianchini Ciampoli 
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et al. (2017). This later can consume very large budg- 
ets has numerous challenging planning problems. 
Besides of the railway infrastructure, the main part 
of railway system is the rolling stock. Time spent in 
the maintenance of rolling stock may disrupt the 
operation and can cause financial losses as well as 
customers dissatisfaction. A major challenge for 
trains operators is to ensure required train service 
with limited rolling stock units. An operator should 
also respect the requirements of the trains manufac- 
turer and security norms. This can be achieved if the 
correct maintenance strategy is adopted. 

Despite the increasing efficiency of sensors and 
on line monitoring devices most of the mainte- 
nance strategies recommended by rolling stock 
manufacturers and adopted by railway operators 
are based on traditional periodic scheduling of 
maintenance operations, Eisenberger and Fink 
(2017). Classical periodic preventive maintenance 
strategies often lead to incorrect maintenance 
work, unnecessary maintenance tasks and often 
revert to corrective maintenance or breakdown 
maintenance (Rezvanizaniani et al. 2008). Different 
levels of inspections are defined and scheduled in 
a periodic manner according to a specified number 
of operation time or kilometers Sriskandarajah 
et al. (1998). Significant improvement in opera- 
tion and maintenance cost could be obtained by 
implementing condition-based or predictive main- 
tenance strategies which allow to anticipate system 
failure (Giacco et al. 2014). In complex systems the 
interaction between different components plays 
a major role in system deterioration and failure 
prediction and should be considered when setting 
the decision rules. Dependencies can offer more 
opportunities to group maintenance activities and 
reduce maintenance costs. Maintenance optimiza- 
tion for multi-component systems with internal 
dependencies is an important subject in mainte- 
nance optimization scholars (Nicolai and Dekker 
2007). The main aim of this paper is to investigate 
the possibility to take advantage of opportunities 
to improve a traditional maintenance policy based 
on regulatory obligations. 

This paper considers constraints of manufac- 
turer and security instructions which are control- 
ling the existing maintenance planning and cannot 
be modified easily. An opportunistic maintenance 
strategy is proposed for tram trains considering sys- 
tem dependencies. This work is a part of DIADEM 
project (Dynamic Diagnosis and Predictive Main- 
tenance for Train Onboard Systems) which han- 
dles the case of Rennes’s tram fleet. The company 
KEOLIS which is operating the Rennes’s tram 
fleet is currently adopting a systematic preventive 
maintenance strategy that is scheduled according 
to manufacturer instruction and security norms. 
Tram fleet system description as well as the exist- 


ing actual maintenance strategy are detailed in sec- 
tion 3. The approach proposed to add flexibility is 
presented in section 4. First the system and mainte- 
nance model are presented with strict maintenance 
rules. Then two new parameters are introduced: 
1) the flexibility degree which allows the opera- 
tor to modify the scheduled systematic replace- 
ment activities and 2) the secondary replacement 
threshold which allows the operator to anticipate 
replacement activities when it is beneficial. Finally, 
empirical results from numerical simulation are 
given and discussed. 


2 COMPLEX SYSTEMS AND 
DEPENDENCIES 


In order to manage system health and maintenance 
scheduling for multi-component systems a good 
understanding of components interactions and 
dependencies is essential. According to Thomas 
(1986) components dependencies can be classified 
as economic, structural or stochastic dependencies. 
Economic dependence implies that grouping main- 
tenance activities causes variation of the mainte- 
nance cost. Stochastic dependence means that the 
conditions (or status) of two or more components 
are related either in a deterministic or probabilistic 
way. In case of structural dependences the mainte- 
nance of one component requires the maintenance 
or at least dismantling of others. In this section a 
brief review of system dependencies and their mod- 
eling in maintenance optimization is presented. 


2.1 Economic dependence 


Economic dependence means that the cost of a 
joint maintenance on several components is not 
equal to the sum of their individual maintenance 
costs. 

In this paper we consider positive economic 
dependence which usually refers to economies of 
scale. It takes into account the reduction of setup 
activities and costs when several components are 
maintained simultaneously (Papadakis and Klein- 
dorfer 2005). The economies of scale have been 
introduced to maintenance optimization in multi- 
ple forms. A single setup model is considered by 
Schouten et al. (1998) in order to evaluate preventive 
and opportunistic maintenance policies for traffic 
control lights. “Single setup” means that for simul- 
taneous maintenance of two or more components 
a specific setup cost is paid only once. The single 
setup model is commonly used in maintenance 
optimization, see for example (Castanier et al. 2005), 
(Scarf and Deara 2003) and (Budai et al. 2004). 
Different setup activities can also be considered, 
leading to multiple setup models. In this case dif- 
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ferent components may or not share one or more 
setup activity. This model has been used by van 
Dijkhuizen (2000) where a tree-like setup structure 
was developed in order to reduce maintenance cost 
for an industrial production system. Another form 
of positive dependence is the downtime opportu- 
nity. Some components failures can be considered 
as an opportunities for preventive maintenance 
e.g. because they cause a system shutdown. The 
downtime required for corrective maintenance 
allows preventive maintenance and/or inspection 
in shared time which reduces the total downtime 
costs. This model is basically used in transportation 
field because of the nature of the system operation 
and maintenance (Sheu and Jhang 1997, Higgins 
1998 and Sriskandarajah et al. 1998). 


2.2 Stochastic dependence 


Stochastic dependence occurs as the result of phys- 
ical or operational interaction between compo- 
nents. The status of one component may influence 
the deterioration level or the lifetime of others. 
Stochastic dependence can be classified into three 
different types (Murthy & Nguyen 1985). The first 
type (type I) is when the failure on component i 
may introduce complete failure of component j 
with a certain probability p. In this case the fail- 
ure of component i is called natural failure, while 
the failure of component j is considered as induced 
failure. This model has been considered for main- 
tenance e.g. in (Scarf & Deara 2003) and (Jhang 
& Sheu 2000). Type II is defined when the failure 
of component i can induce a failure of component 
j with probability p but failure of component j 
induces a direct failure of component i. In case of 
deteriorating components, failure of component i 
acts as a shock to component j increasing its dete- 
rioration level with a random amount. Component 
j fails when the total deterioration level exceeds 
a failure level. This model has been e.g. used in 
context of maintenance optimization in Satow & 
Osaki (2003). The third type of stochastic depend- 
ence considers that failure of each components 
acts like a shock on other components, inducing a 
random increase on their failure rate or deteriora- 
tion levels. It was first used by Ozekici (1988). 


2.3 Structural dependence 


Structural dependence between multiple com- 
ponents implies that the maintenance of these 
components should always be applied simultane- 
ously. Related components can not be maintained 
individually. For systems with structural depend- 
ence both opportunistic maintenance policy and 
preventive maintenance policy seems to be advan- 
tageous (Nicolai & Dekker 2007). 
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3 DIADEM PROJECT AND THE CASE OF 
RENNES’ TRAM TRAIN 


DIADEM project focuses on the development of 
diagnostic and prognostic tools for sensitive com- 
ponents of railway rolling stocks. The final aim is to 
develop new decision-making tools for maintenance 
management. This paper focuses on one collateral 
objective which is to question an actual current 
maintenance policies and to propose improve- 
ments while preserving the same rules. A fleet of 
12 identical trains is considered. The tram fleet is 
operating in an alternative way in order to ensure 
the requested service availability of P = 8 out of N 
= 12 trains. During stand-by phase the non operat- 
ing trains can be held in the main station in order to 
cover any additional load, for example in the case 
of a train failure, or are brought to the workshop in 
order to undergo maintenance activities. 

The operation of different units (trains) is con- 
trolled by a special management tool which accords 
daily services to the train fleet according to three 
activity profiles (intense, medium or low activ- 
ity profile). Note that different operation modes 
for different trains implies that, some trains will 
undergo faster deterioration process than the oth- 
ers. Thus, preventive maintenance scheduling and 
probability of failure are distributed with respect 
to the time interval which prevents the overcrowd- 
ing of the maintenance workshop. 


3.1 


The actual maintenance strategies adopted by the 
operator of Rennes’ tram fleet is described hereaf- 
ter. A sequence of periodic revisions is imposed by 
the manufacturer for safety and operational qual- 
ity requirements. 

Additionally, online monitoring system is avail- 
able on each train and connected to a control room 
in the workshop. The monitoring system noti- 
fies the operator of any malfunction in any of the 
train’s monitored equipment. No matter of how 
serious or dangerous the malfunctioning can be or 
not, the operation of the faulty train is immediately 
interrupted. The train is temporary replaced and 
directed to the workshop for corrective maintenance. 
Revisions and preventive maintenance activities 
are not allowed during the corrective maintenance 
period. No matter how short is the time before the 
expiry of the next revision period, the faulty train 
should be returned to operation mode directly after 
the application of corrective maintenance. 


The applied maintenance strategy 


3.2 System components and subsystems 


The considered tram system consists of 12 iden- 
tical trains. Each train is composed of two cars 


connected by a coupling system. Referring to 
trains operational and maintenance manuals 
given by the manufacturer, trains subsystems 
are defined according to their operational func- 
tion. Thus a single subsystem may include several 
components that are physical distributed in one 
or both cars. 

Nine types of embedded subsystems are con- 
sidered. Each subsystem consists of multiple com- 
ponents which can be physically distributed in 
different locations of the trains. 16 different types 
of components are recorded. The list of compo- 
nents and subsystem under consideration as well 
as their quantities and other characteristics are 
presented in Table 1. 

The lifetime law of the components are derived 
from both manufacturing maintenance manu- 
als and the results of the first step of DIADEM 
project where historical data on trains mainte- 
nance activities and failures are collected. For each 
component type a set of useful information includ- 
ing replacement cost and the total lifetime prob- 
ability parameters (considered to follow a Weibull 
function) is available. 

In the next section a description of inter-system 
dependencies and their modeling is presented. 


4 THE PROPOSED OPPORTUNISTIC 
MAINTENANCE APPROACH 


As explained in previous section, the adopted 
maintenance strategy is based on the application 
of periodic preventive maintenance activities that 
are strictly scheduled and are coupled with correc- 
tive maintenance activities in case of failure. Main- 
tenance activities are performed in the company’s 
maintenance workshop with limited capacities in 
terms of space and manpower. 

According to the company records, and under 
the current maintenance and operation policies, 


the tram system ensures a global availability of 
0.98. Nevertheless, improvements can be reported 
on the current maintenance strategy in order to 
reduce operation costs and because of future 
requirements on transportation system capacities 
and availability. 

In this paper we show that the consideration 
of inter-system dependencies and the addition of 
some flexibility degrees to the preventive main- 
tenance schedule may introduce gains on main- 
tenance costs without harming system reliability 
and availability. Our proposed approach relies on 
the addition of opportunistic maintenance actions 
to the mandatory mixed maintenance strategy. 
Opportunistic maintenance requires that earlier 
revisions or precipitate preventive replacements 
may be permitted under specific rules. 

In order to cover the continuity of transporta- 
tion service at the required frequency, 8 trains out 
of 12 must be operating simultaneously at any 
time. Non-operating trains are put on stand-by 
mode or sent to the workshop for maintenance 
operations. Note that the operator’s workshop dis- 
pose of exactly n = 4 tracks where non-operating 
trains can be received. As for today, limitation on 
financial, technical and human resources of the 
operator company have not caused any obstruc- 
tion on tram service and trains availability. 


4.1 The modeling of system operation and 
dependencies 


Trains operation scheduling is devoted to a special 
software that uses a selection algorithm designed 
for this purpose. According to this algorithm, the 
running schedule of different trains is established 
so that the dates of systematic maintenance inter- 
ventions would be distributed in order to avoid the 
overcrowding of the maintenance workshop. 
Because of their physical locations and func- 
tional interactions, different components from 


Table 1. List of trains’ subsystems and components. 
Additional 
N Subsystem Component Components information 
1 Traction — Inverter — Motor 1 
Bearings — Wheels — Guide frame 4 2 in each car 

— Sensors Antennas 
3 Brake system —Pads and discs 8 4 in each car 

— Control Unit 
4 Doors — Body — Motor 4 in each car 
5 Pneumatic system — Compressed air generator 1 

— Consumer 
6 Low voltage system — Batteries — Converter 1 Distributed between 2 cars 
7 Coupling system -= 1 Connecting the 2 cars 
8 Body - 1 
9 Automatic control unit - 1 Situated in car A 


different subsystems may present several types of 
dependencies. Noting that components with struc- 
tural dependence are considered as a single part, 
we are only interested by the modeling of stochas- 
tic and economic dependencies. 


Economic dependence: Setup levels tree 

Economic dependence is the natural result of 
the physical installation of different components. 
According to system and components description 
given by the manufacturer manual as well as techni- 
cal experience of the maintenance team, the setup 
level of each component is deduced as follows. 

We first consider that each train is composed 
of two main blocks represented by car A and car 
B. Then each car is divided into smaller block that 
includes different components and equipment from 
different subsystems. Block definition and compo- 
nents grouping are done according to their position 
inside the car. Thus, components of different subsys- 
tem may belong to the same block while the compo- 
nents of one subsystem are not necessary belonging 
to a single block. One simple example is the position 
of pads and iscs from the braking system which are 
distributed into 4 blocks, along the two cars. How- 
ever, pads and discs of one car are installed on one 
side and the other of its bearing system. 

The division of systems components into blocks 
facilitates the definition of setup activities required 
for reaching each component. Components request- 
ing the same setup activities in order to be reached 
and maintained are said to be on the same setup 
level. Different setup levels can be related accord- 
ing to a sequential order. However the division of 
setup levels is not straightforward. Setup levels are 
divided according to a tree-like model. The root of 
setup tree is considered as the simple activity of 
bringing the train into the workshop. Disassem- 
bling activities of different blocks and components 
are then classified into their correct position inside 
the setup levels tree. For each setup level basic setup 
time and costs are then assigned. Information about 
the setup levels required by each component are 
also included in a setup information table. During 
revision or corrective maintenance activities and in 
order to reach to specified component, setup activi- 
ties of its associated setup levels must be applied. 

The effect of economic dependence appears 
when two components having a shared setup level 
or belonging to the same block are jointly main- 
tained. In this case, to reach both components 
during the maintenance session, common setup 
activities should be applied only once. Thus, the 
setup cost and downtime time of the concerned 
train can be reduced. 


Stochastic dependence matrix 
Depending on their physical or operational relation- 
ship, stochastic dependences between components 
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may exist or not. The considered dependence are 
as follows: after failure of one component a shocks 
occur on the health condition of dependent compo- 
nents. The shock is considered as a random amount 
of additional damage which follow a normal dis- 
tribution. Hence a stochastic dependence matrix is 
proposed to gather cross dependences between the 
components of the system. The element (i,j) of the 
stochastic matrix represents the effect of failure of 
a component of type i on the component of type j. 
This element consists of four parameters: 


e The first parameter is defined as a boolean vari- 
able that determines either or not the failure of 
a component i will affected the status of other 
components of type /. 

The second parameter is also defined as a 
boolean variable that specifies whether the sto- 
chastic relation between components of types i 
and j is conditioned by the fact that both com- 
ponents belongs to same block or car. 
Parameters 3 and 4 represent the mean and vari- 
ance of the normal distribution that quantifies 
the additional damage caused by the failure of 
component i on the deterioration level on the 
component of type /. 


One can notice that the definition of the sto- 
chastic matrix allows the representation of dif- 
ferent stochastic dependence types for different 
components by simply changing the parameters 
of its element. The parameters of the stochastic 
matrix in our case study are estimated according 
to component interaction represented in the man- 
ufacturer manual as well as historical records of 
components failure. 


4.2 Flexibility degree and secondary replacement 
parameters 


As described in section 3.1, the currently adopted 
maintenance strategy isa combination of preventive 
and corrective strategies. Preventive maintenance 
activities are grouped into 3 types of systematic 
revisions which are periodically scheduled for each 
train according to its rolling distance. At each 
revision the specified train should be directed to 
the maintenance workshop and a list of differ- 
ent inspections, cleaning, minimal maintenance 
activities and components replacements should be 
applied. For each type of revision a different list of 
maintenance tasks is defined. Maintenance tasks 
are carefully selected according to the manufactur- 
er’s specifications, components lifetime estimation 
as well as feedback from the workshop techni- 
cians experience. During revision, replacement 
of non-defected components should be applied if 
the inspection activities reveal that the deteriora- 
tion level of this component exceed a predefined 
replacement threshold (which we will call the pri- 


mary threshold). Each revision type requires a 
minimal revision cost and minimal downtime of 
the system which are both registered in association 
of the maintenance tasks list. The total cost of a 
revision is the sum on minimal revision cost setup 
costs and the eventual cost of parts replacement. 

According to the actual maintenance strategy, 
corrective maintenance is applied to the faulty 
train immediately after the detection of faulty 
behavior. However revisions and preventive main- 
tenance activities are not allowed during correc- 
tive maintenance period. Thus, the total corrective 
maintenance cost includes setup costs, replace- 
ment costs as well as any penalty costs in the case 
of availability loss. 


Opportunistic maintenance 

An opportunistic maintenance policy is pro- 
posed in order to take better advantage of cor- 
rective downtime and setup of the system. Early 
revisions and preventive activities will be allowed 
according to specified rules. This can be done by 
adding two new parameters to the maintenance 
decision rule: 


e Flexibility degree: it is applied to revisions peri- 
ods and allows the operator to perform early 
revisions. When a faulty train is directed to 
the workshop for corrective maintenance, the 
technical team is required to perform the next 
planned revision if the expiry date of the revi- 
sion period is close enough. 

Secondary replacement threshold: during the 
corrective maintenance of a given component, 
technical team is allowed to inspect and replace 
other components on the same setup level if 
their deterioration level exceeds a secondary 
replacement threshold. The application of this 
rule may introduce savings on setup costs and 
also prevents future failure of the system. 


In next sections, the addition of these two 
parameters and the application of the two addi- 
tional rules is described and discussed. 


4.3 Simulated operation and maintenance: An 
empirical example 


In this step we are going to test the new opportunis- 
tic maintenance approach for the case of Rennes’s 
tram system using numerical and simulation meth- 
ods. Simulation are based on the following system 
description: (i) Train fleet consist of 12 trains; 
(ii) The maintenance workshop dispose of 4 main- 
tenance tracks; (iii) 8 trains must be in service mode 
continuously in order to ensure the required load. 
As explained previously, three types of peri- 
odic revisions are defined. Trains are directed to 
the workshop for revision when their running dis- 
tance reach one of the periodic revision distance. 
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Running and maintenance of different trains are 
scheduled by specific software in order to ensure 
the continuity of transport service and correct 
scheduling of required revisions. If the load is 
not handled because of a low number of available 
trains, a penalty charge is imposed. Penalty costs 
are calculated per day according the number of 
lacking trains. 

Different data and parameters are required, 
especially about dependencies and revisions. 


e About maintenance baseline: for each type of 
revision a revision periodicity, maintenance 
operations as well as minimal cost and duration 
are imposed. 

About trains states: information about the last 
revision of each type, running distance, total 
time spent in the workshop, time and cost spent 
for all maintenance activities are registered as 
well as all historical information about failures. 
About components: the type of the component, 
the block and car it belongs to and its deteriora- 
tion level are recorded. 

About dependencies: the Stochastic dependence 
matrix is given. 

About maintenance costs: the setup matrix 
includes information about duration and cost of 
each setup level. 


According to the maintenance strategy, two dif- 
ferent cases are compared. The first case present 
the imposed maintenance policy as it is currently 
adopted by the train operator. The second one 
takes advantage of opportunistic maintenance as 
described in section 4.2 by introducing flexibil- 
ity degree and secondary replacement threshold. 
In both cases trains operation and revisions are 
scheduled using the same method. When the moni- 
toring system detects a malfunctioning on one of 
the trains, the operation of this train is interrupted 
and the train is directed to the workshop for cor- 
rective maintenance. Components deterioration 
and lifetime probabilistic rules are also the same. 
The two algorithms differ in the following: 


First results 

Simulation of trains operation and maintenance is 
held for 60 years. And the empirical results regard- 
ing preventive and corrective maintenance as well 
as setup total costs of the train fleet are given in 
Table 1. The first column of this table represent the 
case of initial maintenance policy without oppor- 
tunistic maintenance. The rest of the table presents 
the case where opportunistic maintenance actions 
are introduced with different flexibility degrees and 
secondary replacement threshold. Note that sec- 
ondary replacement threshold is represented in % 
with respect to the primary replacement threshold. 
Results in this table show the advantage of oppor- 
tunistic maintenance policies in mixed maintenance 


strategy for the case of Rennes’ tram fleet. The 
major advantage is due to savings on setup costs 
which leads to a smaller cost for operation per train. 

The best result is found with no flexibility degree 
on revisions periods but introducing a second- 
ary replacement threshold of 85% of the primary 
replacement threshold. Secondary replacement 
threshold identifies when non-defected compo- 
nents can be replaced during corrective main- 
tenance of other components. This preventive 
maintenance activity introduce no cost on preven- 
tive setup cost because it benefits from corrective 
setup activities to reach all related components. 


157210) 


181963 


179534 


Figure 1. Operational information and costs for one 
train simulated with mixed and opportunistic mainte- 
nance strategy using different maintenance secondary 
threshold and flexibility degrees. 
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A very close result is found for the opportunistic 
maintenance with 5% of flexibility degree and sec- 
ondary replacement thresholds of 80% and 85%. 
This is also due to savings on setup cost and fail- 
ure costs. Looking at unavailability durations and 
average failure rate of trains we also notice that the 
addition of opportunistic maintenance policy does 
not introduce any harm to the system availabil- 
ity. However, results in this table suggest that the 
relation between economic gains and additional 
decision parameters (flexibility degree and second- 
ary threshold) is not linear nor straightforward. 
Numerical optimization methods should be imple- 
mented for each special case in order to choose the 
best flexibility parameters to be used. 


5 CONCLUSION 


Maintenance strategy and optimization is a major 
concern of operators company in transporta- 
tion industry. Due to security and safety reasons, 
the flexibility with maintenance activity plan- 
ning is very low on the considered fleet which is 
derived from Rennes’ tram train fleet. Additional 
constraints are due to high request on availabil- 
ity and limitation on maintenance facilities and 
labor. Trains are maintained according to a strictly 
scheduled preventive maintenance planning which 
involves three types of periodic preventive revi- 
sions. This paper considers new degrees of free- 
dom in the maintenance policy which are related 
to additional opportunistic maintenance actions 
based on dependencies. A dependence matrix and 
a setup “tree like” model are considered in-order 
to characterize stochastic and economic depend- 
encies between different components of trains. 
Two “flexibility parameters” are introduced in the 
maintenance decision rule. The first one is the flex- 
ibility degree which allows the operator to perform 
earlier revisions. And the second one referred as 
“secondary maintenance degree” allows the opera- 
tor to perform preventive replacement of non- 
defective parts during corrective maintenance of 
other components. 

The effect of introducing flexibility parameters 
into maintenance strategy has been tested numeri- 
cally. The results show the interest of opportunis- 
tic maintenance and suggest that it could be very 
advantageous in terms of maintenance cost sav- 
ings without harming system availability. However, 
the choice of flexibility parameters is not obvious. 
Advanced numerical methods should be used 
in order to optimize the decision variables. The 
effects of different failures on system security and 
passengers safety have not been considered in this 
study. A fixed penalty cost is used to quantify the 
unavailability cost caused by the failure of any of 
the train’s components. Further studies may focus 


on types and effects of system failures in order to 
take them into consideration in maintenance opti- 
mization process. 
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ABSTRACT: The quality and reliability of road infrastructure and its equipment play a major role 
in road safety. This is especially true for autonomous car traffic guided mainly by a GPS system that is, 
unfortunately, neither precise nor reliable. In order to improve the guidance systems, one option could be 
to equip the vehicle with a camera reading road markings. Such solution require maintenance strategies 
guaranteeing markings’ perceptibility to the human eye or the autonomous car camera. Currently, the 
retroreflection luminance of markings is measured for evaluating marking degradation. An important 
remaining step is a life time analysis depending on the inspection strategy. Since the exact failure time 
isn’t generally observed, feedback database contain many censured data: the left-censure corresponding 
to a marking failing before the first inspection, the interval-censure that corresponds to markings failing 
between two inspections, and the right-censure corresponding to a marking that never fails. In the litera- 
ture, a Weibull analysis was proposed to estimate the markings reliable distributions using the Maximum 
Likelihood through the Newton-Raphson method. Facing with censored data, this approach couldn’t 
be computed without introducing strong bias in the reliability estimation. For generic interval-censored 
data, Pradhan and Kundu proposed an alternative, based on the EM algorithm. In our study an extension 
of the EM algorithm processing left and right censures is proposed. Finally, this algorithm is applicable 
for all kind of observations, whatever the censure nature. After introducing this EM extension, the paper 
focuses on the fact that computations are simpler than the Newton-Raphson methods and censored-data 
are directly estimated. The French National Road 4 markings case is considered to illustrate the proposed 
approach. Moreover, the proposed algorithm being generic, its application is, of course, not limited to our 
road marking case study. 


1 INTRODUCTION A retroreflective marking reflects light from a vehi- 
cle headlight back in the direction of the driver. For 
The quality and reliability of road infrastructure waterborne markings, the retroreflective property 
and its equipment play a major role in road safety. is guaranteed by glass spheres mixed into the paint 
This is especially true for autonomous car traffic. | during application. The retroreflection luminance 
Currently, autonomous vehicles are guided mainly is measured in millicandela per square meter and 
by a GPS system. Unfortunately, GPS systems are by lux (mcd/m?/Ix). A minimum threshold of 150 
neither precise nor reliable. For example, GPS sig- = mcd/m?/Ix is required for a new marking (AFNOR 
nals couldn’t work in urban canyons or tunnels. In 2009). 
order to improve autonomous vehicle guidance Several decay models for retroreflective mark- 
systems, one option would be to equip the vehicle ing exist in the current literature which mainly 
with a camera able to “read” road markings. How- calculate retroreflective luminance based on a 
ever, this solution requires a maintenance strategy regression model. For example, Lu (1995) pro- 
guaranteeing that road markings remain percepti- posed an exponential regression model function 
ble to a human eye or an autonomous car camera. of age of markings, Abboud & Bowman (2002) 
According to both the AFNOR rules (AFNOR developed exponential models as a function of the 
2009) and the available inspection devices, the Annual Average Daily Traffic (AADT) and the 
retroreflection luminance of markings is the only age of markings, Sarasua, Clarke, & Davis (2003) 
measure used for evaluating marking degradation. calculated the difference in reflectivity over time, 
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Sitzabee et al. (2009) proposed the most complete 
decay multilinear model as a function of time, the 
initial retroreflection, the AADT, the lateral loca- 
tions of markings and marking color. All decay 
models have a common weakness: they are diffi- 
cult to apply directly to a given road network. For 
example, consider a single road. The road is not 
systematically maintained in its entirety. For safety 
reasons, road managers maintain only specific 
areas at a time. This is especially true for road sur- 
face maintenance. 

In a previous study, a clustering approach able 
to segment a road network according to past 
inspections was proposed (Redondin et al. 2017). 
Each cluster was interpreted with respect to a spe- 
cific area of the road network and admits its own 
retroreflective luminance evolution over time. This 
fact leads to each cluster having its own optimum 
maintenance strategy. An important remaining 
step is to do a life time analysis. This work could 
confirms the necessity of one maintenance strat- 
egy by cluster and is the first step to develop any 
maintenance model. 

Currently, road markings are monthly or yearly 
inspected by a retroreflectometer. Thus, such a 
periodical approach isn’t able to determine the 
exact failure time for any marking. Moreover, in 
feedback database, three kinds of censure are 
observed: the left-censure corresponding to a 
marking failing before the first inspection, the 
interval-censure that corresponds to markings fail- 
ing between two inspections, and the right-censure 
corresponding to a marking that never fails. 

A Weibull analysis was proposed by Sathya 
narayanan et al. (2008), which consists of esti- 
mating the markings reliable function using a 
Weibull distribution. Its parameters are estimated 
according to the Maximum Likelihood Estimator 
(MLE). In attendance of censored-data, the like- 
lihood function depends on the probability den- 
sity, the cumulative distribution and the reliable 
function of a Weibull distribution. For interval- 
censored data, Pradhan and Kundu (2014) found 
instances where the Newton-Raphson does not 
compute. An alternative, based on the EM algo- 
rithm (McLachlan & Krishnan 2008) has been 
proposed. 

Section 2 proposed an extension of the EM 
algorithm processing left and right censures. The 
proposed algorithm is applicable for all observa- 
tion vectors, independently of the nature of the 
censure. After introducing this EM extension, this 
paper will focus on a simpler computations than 
the Newton-Raphson methods (Bain & Engle- 
hardt 1975). Morever, censored-data are directly 
estimated. Then, the proposed approach could be 
applied for other cases studies and are not restricted 
to roadmarkings considered in this paper. 
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As in, the French National Road 4 inspection 
database is considered to illustrate the proposed 
approach and its use for analysing maintenance 
cycles. Two applications are proposed: on the one 
hand, the first take into account the whole main- 
tenance cycle and the EM Algorithm produces 
one global Weibull analysis. On the other hand, 
the second segements first the cycle according to 
a clustering approach (Redondin et al. 2017). Each 
cluster admits its own Weibull analysis taking into 
account different decay profiles. 


2 IMPROVED EM ALGORITHM 


2.1 Censored Weibull distribution 


In this paper, a Weibull distribution W( æ, £) is 
defined by its associated probability density func- 
tion (1) where &œ > 0 and £> 0 are respectively the 
associated scale and shape parameters. 


ape Aa) 
0 if <0 


The observed data (t,...,¢,) is assumed inde- 
pendent identically distributed. In this paper, an 
observed failure is a failure established accord- 
ing to a given inspection strategy. Vi = l,...,n, t, is 
assumed to be the first observed failure moment of 
i. Three censorship cases are assumed. 


e If a failure has occurred before the first inspec- 
tion, then it is called left-censored. 

If a failure has occurred between two inspec- 
tions, then it is called interval-censored. 

If a failure hasn’t been observed, then it is called 


right-censored. 


If ¢, is interval-censored, then its associated 
interval is [4r]. The interval is clearly defined 
according to the current inspection strategy. A cen- 
sor detector is associated for each observation as 
follows Vi=1,..,n. 


if ¢, is uncensored 


if ¢, is left-censored 
= 2) 


if f, is interval-censored 


wn O 


if £, is right-censored 


From now, the observed data is subdivided 
according to the censor detector. 


T = {te ((.5)).--0(t,.6,))/5 = 0} (3) 


X= {x € (t8) l8) =} (4) 
Y=fye (lå) =2} (5) 
Z =fz e((18) l8 ))/8=3} (6) 


In the case of an uncensored Weibull distribution, 
the MLE is the couple (œ, 8) which maximizes the 
associated likelihood function (7). This couple annuls 
also symultaneously the two partial derivates of the 
log-likelihood function. For the specific case of a 
Weibull distribution, the solution of this non-linear 
equations system is currently done by a Newton- 
Raphson approach (Bain & Englehardt 1975). 


LATI C) ) 


The likelihood function associated to a censored 
Weibull distribution (8) depends on the reliable 
function S for interval and right censorship cases 
and it also depends on the cumulative function 
F =1-S for the left censorship case. 


Lep- TOTTI) 8OTTRG) 


teT xEX yey 
(8) 


Reminder: let’s take W a Weibull distribution 
W(a@, A), the reliable function (9) is the probability 
that the time of failure is later than some specified 
time ¢>0. 


1 
R(t)= PW >t)= ela) (9) 


For the road markings study, Sathyanarayanan 
et al (Sathyanarayanan, Shankar, & Donnell 2008) 
chose this classic approach for a Weibull analysis. 
In a breast cancer case study and restricted only to 
interval-censored data, Pradhan and Kundu (Prad- 
han & Kundu 2014) showed different examples 
where this approach does not compute. Specifically, 
the Newton-Raphson approach isn’t able to estimate 
(a, P). An alternative based on a EM Algorithm is 
proposed. This paper proposed an extension of the 
algorithm, processing left and right censures. 

The EM algorithm interprets censored data like 
missing data to estimate and computes iteratively 
into two steps: 


1. The Expected Step estimates censored data accor- 
ding to a given Weibull distribution W(a, £) . 

2. The Maximization Step estimates a Weibull dis- 
tribution W(a@,4) by the MLE according to 
both uncensored and completed data. 


The algorithm computes until one distribution 
converges. This point is detailed in section 2.4. 
First, sections 2.2 and 2.3 present the extended 
algorithm through one given iteration. 


2.2 Improved expected step 


Lets take W a Weibull distribution W/(a, £). 
To simplify the EM formalism, the substitution 
a<I\1/a is suggested by Pradhan and Kundu 
(Pradhan & Kundu 2014). According to that, the 
density function is done by (10) in this section. 


Aft?" if t>0 


OEF if 1< 0 


The Expected Step calculates the estimated like- 
lihood function defined as the maximized expecta- 
tion of the likelihood function (7) conditioned to 
uncensored data (11). 


(10) 


L.(@,8)=E[L(@/)|T] (11) 

This conditional expectation leads to produce 
three estimators Vxe V,ye Y,ze Z. Each one is 
a conditional expectation adapted for one specific 
censure: (12) and (14) estimates respectively the left 
and right censures and (13) is proposed by Prad- 
han and Kundu (Pradhan & Kundu 2014) for the 
interval-censures case. 


x 
apf Lee dt 


k= EW|W<x]= i = (12) 
E e2 
apf Bet dt 
p= EW |1< W< = — r (13) 
ew -—eu 
apf “eat dt 
ĉ= E[W|W>z]= i, (14) 
e az 


Finally, the completed likelihood function asso- 
ciated to a censored Weibull distribution is defined 
by (15). 


L.( a, B) = MOLO) 


(15) 


2.3 Improved maximization step 


According to the MLE formalism, the couple 
(a8) annuls simultaneously the two partial deri- 
vates of the completed log-likelihood function. 
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ðL, = F a Yi x4 - ¥ jo - >} 28 (16) 


xe yey zez 


dL, g hints yy inte mje yy me 


teT xe yey 
a3 t+ $n X+ YH In p+ an | 
teT xe yey 
(17) 


This non-linear equations system doesn’t 
admit an obvious solution. Restricted to interval- 
censored data, Pradhan and Kundu (Pradhan & 
Kundu 2014) solved this non-linear equations 
system by a fixed point approach. This choice is 
due to an estimation of œ done by the equation 
0,L. =0. Processing left and right censures, the 
extending estimator is: 


SL LEE is 


teT xe yey 


This estimator (18) is completely dependent 
on £: an estimation of # is enough. Furthermore 
replacing œ by (18) in (17), a function g(f) as 
dL. = g(4)— ZF isextracted (19). To simplify, let’s 
take S= XUVUZ. 


1 
AD TDDS 
teT seS + teT seS 
£ tf + x gh n 
teT seS 


(19) 


The fixed point of g is the point Bas g(f)=f 
also defined as the convergence point of the 
sequence (20). The initial value is arbitrary. This 
point is detailed at section 2.4 


_ |Mo > 0 
(u, nso = u, = glua) nog 


arbitrary (20) 


To conclude, ais finally deduced from (18). 


2.4 Iteration process and convergence 


The Weibull distribution estimated at the algo- 
rithm iteration k > 0 is denoted W(a@,,4,). The 
initial distribution W(@,4,) is arbitrary. In this 
paper, œ% and f, are the MLE where the censor- 
shipis not taken into account. 

Sections 2.2 and 2.3 show that one algorithm 
iteration could be reduced to an estimation of B 
done by the convergence point of (20) depending 


on the function g (19) conditioned by estimated 
data (12-14). As œ, is directly defined both by 
p. and (18), the next sequence (f,),., is clear- 
defined as: 


J, >0 
: Uy = fy 
8.= lims ° 
i ii u, = g(u,1) 


|Z. -4.4|< 10+ is the selected stopping crite- 
rion. Finally, the EM Algorithm convergence is 
defined both by the convergence of the sequence 
(B.)iso and (18): 


(Z: )kz0 = (21) 


B= jim 6, 
(22) 
DZD nI: 28 
teT xe yey 


Again, the stopping criterion is VA - Bu <10“ 


To conclude, the substitution w1/4/a@ returns 
the current Weibull distribution (1). 


3 MAINTENANCE CYCLE 2008-2012 


3.1 Presentation of the French National Road 4 


The broken centerline of the French National 
Road 4 (NR4) is considered to illustrate the pro- 
posed approach. The NR4 runs between Paris and 
Strasbourg. Since 2007, the section of this road 
between Courgivaux and Vauclerc (~102 km) has 
been managed by the DIR Est, which inspects 
this section of the road once a year. Inspections 
are organized in September in collaboration with 
CEREMA Est. The selected retroreflectometer is 
an Ecodyn. 

The maintenance cycle selected is composed 
of 73 measures annotated with a PR. To simplify, 
each measure is interpreted as one marking. Mark- 
ings laid in March 2008 and replaced in March 
2012. The marking material chosen is supposed be 
the same. The cycle is localized around three cities: 
Courgivaux, Sommesous and Vitry-le-Frangois. 
The direction heading toward Vauclerc is chosen. 

Four inspections are available: 6, 18, 30 and 
42 months (after March 2008). The retrore- 
flection luminance of a given marking i at the 
inspection point ¢ is denoted RL (i)eN*. If 
RL,(i) < 150 med/m?/Ix_ then the marking i is 
failed at ¢. Let’s take 7 be the first time when a 
given marking is observed failing. 


r= min{te { 6,18,30,42}/ RL,(i)< 150} 23) 
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According to 7, t,is the first time when the mark- 
ing i=1,...,73 is observed failing. If a marking i 
hasn’t been observed, then the marking is right- 
censored and 7=©. In this situation, t, = 42+ 
tentatively. 


P minz if 7=@ 
i) 42+ else 


The adapted censor detector is deduced (25). 
From now on, if ¢,= 42+ them ¢,=42. This for- 
malism assumes three censorship intervals ([6,18], 
[18,30] and [30,42]) and ¢, isn’t an uncensored 
observation. Finally, the observed data is finally 
defined as ((1,,0)),..-.(t45,43)) - 


(24) 


1 ift,=6. 
6,=42 if t,=18,30,42. (25) 
3 ift,= 42+. 


According to inspection campaigns, the moni- 
toring of the retroreflective luminance is presented 
by the Figure 1. All censor case are presented: 
two failures are observed during the 6th month 
(September 2008), a group of markings which never 
failed until the 42nd month (September 2011) exists 
and the majority of failure is interval-censored. At 
least two decay models are also presented. 

Two approaches could be produced: a glo- 
bal and a clustering Weibull analysis. The global 
analysis consists in selecting the whole monitoring 
and estimating one global Weibull distribution. 
This approach is interesting if the observed data is 
small. This situation could happen for two reasons. 
Firstly, current retroreflectometer like the Eco- 
dyn produce generally one average measure every 


ý peeees Minimum retroretiection | 


Retroreflection luminance (med/m?/Ix) 


Time (Months) 


Figure 1. Maintenance cycle on the NR4 brocken cen- 
terlines between September 2008 and September 2011. 
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100 m. This fact reduces considerably the data set 
size. Secondly, current maintenance strategy could 
concern only one specific area. The clustering anal- 
ysis consists in segmenting first the whole moni- 
toring and estimating one Weibull distribution by 
cluster. This approach is interesting to produce 
one specific maintenance strategy by cluster over 
time, but the observed data must be large and must 
present a diversified censorship case. 


3.2 Global Weibull analysis 


The Table 1 presents the Weibull analysis. The 
first column indicates the observed data associ- 
ated to the censor indicator. For example, (6,1) 
corresponds to failures observed during the 6th 
month (September 2008) and the left-censored 
case. The second column indicates the number of 
observation: for example, there are 2 left-censored 
data and they correspond to 3% of the observed 
data. The last column indicates the failure time 
(in month) estimated by the EM Algorith. For 
example, according to the Weibull distribution 
W(31.65,2.20), left-censored data are estimated at 
4.12 months (July 2008). 

Table 1 confirms observations made on the 
monitoring. An important case of interval- 
censored data (74%) is presented. Particularly the 
main failure is emerged between 18 and 30 months 
(53%). Right-censored data are the second most 
important case (23%). Only two markings are 
left-censored. 

The EM Algorithm converges after 12 itera- 
tions to the Weibull distribution W(31.65,2.20). 
According to this distribution, several estimated 
failures are proposed. Interval-censored failures 
are estimated 12.94, 24.01, 35.44 months (March 
2009-2010, February 2011). This period corre- 
sponds to the average of each interval. Left and 
right censored failures are respectively estimated to 
4.12 months (July 2008) and 50.41 months (May 
2012). The maintenance campaign in March 2012 
is finally warranted. 

Finally, the EM Algorithm is able to produce one 
Weibull analysis adapted to the whole monitoring. 


Table 1. Weibull analysis of the 2008-2012 mainte- 
nance cycle. 


Observed Estimated failure 
failure Numbers w( 31.65,2.20) 
(6,1) 2 (3%) 4.12 

({6,18],2) 13 (18%) 12.94 

({18,30],2) 39 (53%) 24.01 

({30,42],2) 2 (3%) 35.44 

(42,3) 17 (23%) 50.41 


3.3. Clustering Weibull analysis 


Based on an Agglomerative Hierarchical Cluster- 
ing, the clustering process proposed in the previous 
ESREL conference could be restricted to the mainte- 
nance cycle (Redondin, Bouillaut, Daucher, & Faul 
2017). Two clusters are proposed: Sommesous and 
other cities. Figure 2 distinguishes clusters on the 
monitoring. Sommesous (red) represents markings 
which admit a strong retroreflection luminance, a 
fast decay between 6 and 30 months and finally a sta- 
tionary decay until the 42 Months (September 2011). 
Other cities (blue) present a fast decay between 6 and 
18 months and a slow decay until the 42 Months. 

The Table 2 presents the Weibull analysis by 
clusters. The structure is similar to Table 1 excepted 
the added first column to indicate both the clus- 
ter and the Weibull distribution estimated. The 
EM Algorithm proposed the Sommesous Weibull 
distribution W(41.35,3.5) and the Other Weibull 
distribution W(18.83,6.61) respectively after 19 
and 38 iterations. 

Sommesous markings failed mainly between 
18 and 30 months and focused all right-censored 
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Figure 2. Maintenance cycle 2008-2012 by clusters. 


Table 2. Weibull analysis of the maintenance cycle 
2008-2012 by clusters. 


Observed Estimated 
Cluster failure Numbers failure 
Sommesous ({18, 30], 2) 18(25%) 24.95 

([30, 42],2) 2 (3%) 35.11 
W(41.35,3.5) (42,3) 17 (23%) 49.84 
Other Cities (6,1) 2 (3%) 5.21 

([6, 18], 2) 13 (18%) 15.23 
W(18.83, 6.61) ({18, 30], 2) 21(29%) 20.14 
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data. Their failure are respectively estimated at 
24.95 months (March 2010) and 49.84 months 
(April 2012). Other markings failed mainly 
between 6 and 18 months or 18 and 30 months and 
focused all left-censored data. Their failures are 
respectively estimated at 15.23 months (June 2009), 
20.14 months (November 2009) and 5.21 months 
(August 2008). 

The clustering process clearly separated the 
monitoring into two decay profiles. Sommesous 
markings had a strong retroreflectivity and failed 
either within 30th month or after the 42nd month 
while Others markings had a weaker retroreflectiv- 
ity and failed before the 30th month. Finally, the 
EM Algorithm is also able to produce one specific 
Weibull analysis by cluster. 


3.4 Comparison 


Estimated failures between the global and the clus- 
tering models are equivalent to one month. Mark- 
ings failed between 18 and 30 months in Other Cities 
are the main difference. The global model estimated 
the failure in March 2010 and the clustering model 
estimated the failure in November 2009. The dif- 
ference is due both to the EM Algorithm compu- 
tation and the current inspection strategy. First, 
the September 2010 inspection observed that 53% 
of markings failed since September 2009 and the 
associated interval-censored data is ([18, 30], 2) for 
all markings. Second, the global approach ignores 
different decay profiles. Therefore in this case, the 
EM-Algorithm produced one global estimation. 

The Figure 3 compares reliable functions 
according to the global model and the two clus- 
ters. The global reliable function (green) underes- 
timates Sommesous over time (—0.12 in average). 
Other Cities is slightly underestimated the 14th 
first months (—0.03 in average) and overestimated 
next months (+0.12 in average) in particular after 
the 23rd months. Finally, the global reliable func- 
tion is an average estimation. 

The global model is a compromise estimation 
according to different observations: two left- 
censored observations, important failure between 
18 and 30 months, the quarter of observations is 
right-censored... The clustering model suggests 
first several clusters according to the decay profile 
and each one admit its own Weibull distribution. 
Therefore the Sommesous Weibull distribution 
is estimated according to two facts: without fail- 
ures before the 18th month and all right-censored 
observations. 

The clustering model admits two main disad- 
vantages. First, the clustering process tends to 
isolate markings which admit one specific censure. 
For example, all right-censored observation could 
be gathered into a third cluster extracted from 
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Figure 3. Associated reliable functions according to the 


maintenance 2008-2012. 


Sommesous. In this situation, the MLE is based 
on the only observation (42, 3) and doesn’t com- 
pute. The second problem is each cluster admits 
its own Weibull analysis. The model admits a total 
of 57 EM Algorithm iterations, whereas the global 
model admits only 12. 

However, the global reliable function concludes 
to a poor compromise: Sommesous and Other 
markings reliable are respectively underestimated 
and overestimated. 

Finally, the clustering model has done a more 
reliable Weibull analysis but the clustering process 
and the number of EM Algorithm iteration should 
be monitored. 


4 CONCLUSIONS 


The current literature presents a first Weibull anal- 
ysis for road markings (Sathyanarayanan, Shankar, 
& Donnell 2008). Censored Weibull distributions 
are estimated by the current approach based on the 
MLE upgraded by a Newton-Raphson approach. 
Restricted to interval-censored data, Pradhan and 
Kundu (Pradhan & Kundu 2014) discoved several 
limits to this method and proposed a first alterna- 
tive based on a EM Algorithm. 

The introduced extended EM Algorithm is 
a credible alternative to the MLE of a censored 
Weibull distribution. This is specially true on 
the censored data management. Indeed, the EM 
Algorithm replaces censored data by a failure 
iteratively estimated at once the Weibull dis- 
tribution. Furthermore, the EM Algorithm is 
upgraded by a fixed point approach. This method 
is simpler than the Newton-Raphson. Indeed, the 
fixed point depends only to uncensored data and 
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estimated data and doesn’t need any upstream 
verification. 

The introduced algorithm isn’t limited to our case 
study. Pradhan and Kundu themselves presented a 
breast cancer case study for example. The extension 
accepts a greater variety of situations: uncensored 
and right-censored data, no-uncensored data, no- 
interval-censored data. From the moment the decay 
monitoring over the time and the maximum decay 
level are both clear, the EM Algorithm is able to 
produce and rank all censored data. 

The EM Algorithm is able to estimate several 
Weibull distribution in our road markings study. 
Two approach is proposed. The global model is 
interesting in the case where the observed data is 
small. However, the estimated Weibull distribu- 
tion is an average compromise between different 
decay profiles. In the NR4 case, the global reliable 
function overestimates Other markings and under- 
estimates Sommesous markings. The clustering 
analysis segments first the whole monitoring and 
estimates one Weibull distribution by cluster. This 
situation is interesting if the observed data is large 
and could distinguishes different decay profiles. In 
the NR4 case, reliable functions are more reliable. 

This paper shows that the clutering approach is 
more reliable that the global model. However, this 
approach needs a monitoring of the clustering proc- 
ess. Indeed, the current process tends to isolate one 
specific censure by cluster. Consequently, the MLE 

could be based on only one observed data and it 
formalism cannot compute. An alternative based 
on a mixture model also produces by an EM Algo- 
rithm is currently investigated. This approach could 
estimate directly the optimum segmentation and 
associated it in one mixture Weibull distribution. 

Finally, the Weibull analysis completed by an 
EM approach is adapted in a road markings study. 
Furthermore, for given sections of the road net- 
work, reliable functions are able to indicate directly 
each section admits a premature aging for example. 
These facts lead to the development of a probabil- 
istic opportunistic maintenance model adapted to a 
whole road network or a segmented road network. 
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ABSTRACT: In recent years, many authors have proposed alternative approaches to modelling the 
effect of ageing and test and maintenance activities. Some authors have proposed approaches to model- 
ling the unavailability of safety-related components associated with standby-related failures that explicitly 
addresses all aspects of the effect of ageing, maintenance effectiveness and test efficiency. Recently, other 
authors have proposed a new reliability model for the demand failure probability that explicitly addresses 
all aspects of the effect of demand-induced stress, maintenance effectiveness and test efficiency. In this 
context, the paper presents a whole time-dependent unavailability model for a safety component, regard- 
ing time-dependent unreliability contributions for each failure mode addressing ageing, degradation stress 
by demand and effectiveness of maintenance and testing as well as the downtime contributions related to 
maintenance and testing activities. An application case of a motor-operated valve of a pressurized water 
reactor nuclear power plant is included. A set of sensitivity cases are presented to show how this model 
can be used, for example, in the planning of maintenance and surveillance test activities with the aim of 
minimizing equipment unavailability. 


1 INTRODUCTION tion of their effectiveness in managing component 
degradation due to demand-induced stress and 
The safety of Nuclear Power Plants(NPPs)depends ageing. 
on the availability of safety-related components In recent years, many authors have proposed 
that are normally on standby and only operate alternative approaches to modelling the effect of 
in the case of a true demand. These components ageing and test and maintenance activities (Kanéev 
typically have two main types of failure modes that et al 2016, Shin et al. 2015, Volkanovski 2012). 
contribute to the probability of failure on demand: Martón et al. (2015) proposes an approach to mod- 
by demand-caused and standby-related failure. elling the unavailability of safety-related compo- 
Both are generally associated with constant val- nents associated with standby-related failures that 
ues in a standard Probabilistic Risk Assessment explicitly addresses all aspects of the effect of age- 
(PRA) models. However, both failure modes are ing, maintenance effectiveness and test efficiency. 
often affected by degradation such as demand-re- Recently, Martorell et al. (2017) have proposed a 
lated stress and ageing, which cause the component new reliability model for the demand failure prob- 
to degrade with chronological time and ultimately ability that explicitly addresses all aspects of the 
to fail. Maintenance and test activities are per- effect of demand-induced stress, maintenance 
formed to control degradation and the unreliability effectiveness and test efficiency. 
and unavailability of such components, although This paper presents a whole time-dependent 
this has both positive and negative effects. unavailability model for a safety component, 
Initial studies reported in Kim et al. (1994) regarding time-dependent unreliability contribu- 
already provided a well-organized foundation to tions developed by Marton et al. (2015) and Mar- 
account for ageing as well as positive and adverse torell et al. (2017) for each failure mode as well as 
effects of testing components for both by demand- the downtime contributions related to maintenance 
caused and standby-related failure modes. How- and testing activities. 
ever, this model does not take into account the An application case of a motor-operated 
positive effect of maintenance activities as a func- valve of a nuclear power plant is included. A set 
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of sensitivity cases are presented to show how 
this model can be used, for example, in the 
planning of maintenance and surveillance test 
activities with the aim of minimizing equipment 
unavailability. 


2 UNRELIABILITY MODELS UNDER 
IMPERFECT MAINTENANCE 


In this paper the model of demand and failure 
probability standby failure rate, proposed by 
Martorell et al. (2017) and Marton et al. (2015), 
respectively, have been joined to develop a whole 
time-dependent reliability model taking into 
account the related preventive maintenance and 
testing activities performed. 

On the one hand, demand failure probabil- 
ity of a component normally in standby depends 
on the number of demands on the component, 
which depends of performing the planned and 
unplanned surveillance and functional test, opera- 
tional demands and test after corrective actions. 
So, surveillance testing not only introduce a 
positive effect, but also and adverse one, which is 
usually compensate by performing maintenance 
activities to eliminate or reduce the accumulated 
degradation. 

On the other hand, standby failure rate depends 
on its age, which is a function of the chronological 
time elapsed since its installation and the effective- 
ness of the maintenance activities performed on it. 

In both models, in order to introduce the effect of 
preventive maintenance, two imperfect maintenance 
models are considered: Proportional Age Reduction 
(PAR) and Proportional Age Setback (PAS). 

In the PAR approach, each maintenance 
activity is assumed to reduce proportionally, in a 
factor €, only the component degradation gained 
from the previous maintenance, while the rest 
remains unaffected, where € represents the main- 
tenance effectiveness that ranges in the interval 
[0,1]. Nevertheless, in the PAS approach, each 
maintenance activity is assumed to reduce pro- 
portionally, in a factor e, the degradation that 
the component has immediately before it enters 
maintenance. 


2.1 Demand failure probability addressing 
maintenance effectiveness 


The time-dependent demand failure probability 
could formulated in terms of a time-dependent 
degradation function, f(t), as follows: 


At) = A+ A*f(t) (1) 


where p, is the residual demand failure probability 
and f(t), assuming that the degradation factor is 
the same for all types of demands and is equal to 
P, f(t) can be formulated as follows: 


Sf (t)= p*n(t) (2) 


where n(t) should include the number of demands 
performed up to time ¢. When only test-induced 
stress is considered, n(t) = /t/T], where T repre- 
sents the test interval and [x] the floor function that 
gives the largest integer less than or equal to x. 

A time-dependent demand failure probability 
model that addresses the demand-induced stress 
and the effect of m-/ maintenance activities can be 
formulated for the period m as follows: 


Pa £) = Po Afal) (3) 


being the residual demand failure probability 
and f,,( t) the degradation function. 

This equation can be particularized for 
t=1,,=m-M, immediately after performing main- 
tenance number m, and the PAR model, to obtain 
the formulation of the time-dependent demand 
failure probability immediately after maintenance 
m (P. Martorell et al., 2017): 


M 
Pr t= tn) = At Prb Ge (I> Ep) (4) 


where p, is the degradation factor associated with 
demand failures, T is the test interval, M is the pre- 
ventive maintenance interval, m is the number of 
maintenances performed and £, is the preventive 
maintenance effectiveness associated with demand 
failures. 

Analogously, the time-dependent demand failure 
probability immediately after maintenance m for 
PAS model can be expressed as follow (P. Martorell 
et al., 2017): 


Pp, (t= t,) 


(5) 


2.2 Standby failure rate addressing maintenance 
effectiveness 


In this paper, the linear ageing failure rate model 
has been considered to model the reliability model 
of standby—related failures. This model assumes 
that the failure rate has a linear behaviour with 
component age departing form an initial reliabil- 
ity, inherently by design, which can be expressed as 
(Martorell et al., 1999): 
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A(t) = A,+ aw(t) (6) 
where ¢ represents the chronological time, œ is the 
linear ageing factor and w(t) is the component’s 
age after maintenance m+/. In Equation 6 compo- 
nent inherent failure rate by design is given by the 
term A, which represents random failures, while 
the second term &œ-w(t) represents the degradation 
of the equipment failure rate due to equipment 
ageing, which is counterbalanced by the effective- 
ness of the maintenance policy. 

This equation can be particularized for 
t=t,, =m-M, immediately after performing mainte- 
nance number m, and the PAR model, to obtain the 
formulation of the time-dependent standby failure 
rate immediately after maintenance m (Marton 
et al., 2015): 


A 


m 


(t=t,)=4+M (1- £,).m (7) 


where, €, is the preventive maintenance effective- 
ness associated with standby failures. 

Analogously, the time-dependent standby fail- 
ure rate immediately after maintenance m for PAS 
model can be expressed as follow (Marton et al., 
2015): 


A 


)= 44 ven 


Ss 


(t=t 


{a-a} © 


m 


3 UNAVAILABILITY MODELLING 


The average unreliability contribution to the 
unavailability of a component normally in standby 
over its renewal period can be formulated as fol- 
lows (Marton et al. 2015, Martorell et al. 2017): 
Ug = Ur s tUp p (9) 
where the ugs is the standby-related unreliability 
contribution and ug p is the demand-caused unre- 
liability contribution. 

On one hand, adopting the PAS model to rep- 
resent the behavior of the imperfect maintenance 
for the standby-related failures of the component 
according to the results in the previous section, 
Ups is given by (Marton et al, 2015): 


(=) 


On the other hand, adopting the PAS model 
to represent the behavior of the imperfect main- 
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2 


2- £, 


1 
Urs ~ Al A (10) 
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tenance for demand caused failures of the com- 
ponent according to the results in the previous 
section, ug p is given by (P. Martorell et al., 2017): 


icy 


In accordance with Marton (2015), the averaged 
unavailability of a component is the sum of the 
average unreliability contributions and the unavail- 
ability contributions due to detected downtimes 
for performing testing and maintenance activities 
with the plant at power, which can be formulated 
as follows: 


M 
T 


2- Ey 
Ep 


1 
Ur p = Prt z AP (11) 


U= Ug +t Up +t Uy + Uc t uo (12) 
where u, represents the unavailability contribution 
due to testing, u, is the unavailability contribution 
due to performing preventive maintenance, uç is 
the unavailability contribution due to performing 
corrective maintenance conditional to detecting a 
failure during a previous test, and u, is the con- 
tribution due to replacement of the equipment, if 
any. 

For sake of simplicity, the last two contribu- 
tions, uç and uo, are not included as contributions 
of total average unavailability due to both are neg- 
ligible as compared with the downtime effect of 
preventive maintenance and testing activities. 

Thus, the downtime contributions considered 
in this paper can be evaluated using the following 
equations (Marton et al. 2015): 


w= (13) 
Oo 
“=M (14) 


where qt is the downtime for testing and o is the 
downtime for preventive maintenance. 
Thus, the averaged unavailability of the compo- 
nent is given by: 
U =Ups tUr p+tUr t+Uy (15) 
Substituting Equations 10, 11, 12 and 14 into 
Equation 15 yields the following formulation of 
the total average unavailability of a component: 


1 1 2-€ 
= + M 3 T 
1 M| 2-6, T O 
Aot 5AP al fs) +242 (16) 


4 CASE STUDY 


In this section, an example of application of the 
model is presented that focuses on a Motor-Oper- 
ated Valve (MOV) of a nuclear power plant. 

Table 1 and Table 2 shows reliability parameters 
estimated obtained from the plant operational 
data, i.e. historical failure, maintenance and test 
data. The imperfect maintenance reliability model 
that best fit the plant data for by demand caused 
failures and standby-related failures is the PAS 
model in both cases. These results are extracted 
from (Martorell et al. 2017). 

The estimates obtained are used to predict the 
performance of the MOV as a function of test 
and maintenance intervals. In particular, the MOV 
average unreliability contribution of each failure 
mode and the total MOV unavailability are com- 
puted and plotted as a function of maintenance 
and test intervals for a 10 years horizon. 

Figure | shows the evolution of up, and up p as 
a function of the test interval, regarding different 


Table 1. Parameters of the reliability model of standby 
related failures under PAS model. 

Parameter Value [Units] 
Ao 5.860E-06 h! 

a 3.424E-10 h? 

E 0.716 - 


Table 2. Parameters of the reliability model of demand 
caused failures under PAS model. 


Parameter Value [Units] 
Po 6.420E-03 - 
P, 5.415E-3 - 
£ 0.886 - 
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Test Interval T(h) 


Figure 1. u,, and u, p asa function of the test inter- 


val for different maintenance periods. 
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Figure 2. u,, and u,,+u,+u,, asa function of the 
test interval for different maintenance periods. 
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Figure 3. Unavailability for different maintenance and 
test intervals under PAS model. 


preventive maintenance intervals for a 10 years 
horizon renewal period. It can be seen that ups 
increases significantly for high T and M values. 


Nevertheless, the effect of maintenance is positive 
for both unreliability contributions. Moreover, an 
increase on test frequency between maintenances, 
i.e. low T values, have a very negative effect on up p 
for very low T intervals. 

Figure 2 shows the evolution of ugs versus 
Uppy t+Uy+uy as a function of the test interval 
considering different preventive maintenance 
intervals for a 10 years horizon renewal period. 
The term ugs allows quantifying the benefit of 
developing test and maintenance activities on the 
total component unavailability while the sum of 
contributions Ug p + uy + Uy represents their nega- 
tive effect. 

The last study involves the analysis of the total 
average unavailability of the component as a func- 
tion of the couple {M, T} for a 10 years horizon, 
which is shown in Figure 3. The highest values 
of u are reached adopting the highest mainte- 
nance and test intervals. The main contributor to 
the total unavailability, u, see Equation 15, is the 
standby-related unreliability contribution given 
by Equation 10 as it can be seen in Figure 2. This 
explains the direct and proportional dependence 
between u and T and M. Nevertheless, the sum of 
the demand unreliability contribution and down- 
time effects considered, i.e. downtime effect of pre- 
ventive maintenance and testing activities, become 
more relevant for very low T values. This fact is 
appreciated in Figure 2 too. 


5 CONCLUDING REMARKS 


This paper presents a whole time-dependent 
unavailability model for a safety component, 
regarding time-dependent unreliability con- 
tributions for by demand caused failures and 
standby-related failures as well as the downtime 
contributions related to maintenance and test- 
ing activities. Regarding imperfect maintenance 
effects, two models are considered: Proportional 
Age Reduction (PAR) and Proportional Age Set- 
back (PAS). 

A practical and realistic case study is included 
facing the parameters estimation of a typical 
motor-operated valve in a nuclear power plant. 
Equipment RAM is quantified based on the best 
model fitted to make it clear the impact of such an 
estimation in a testing and maintenance-planning 
context. 

Thus, the results of a whole time-dependent 
unavailability model may help to plan in a more 
efficient way the test and maintenance program, 
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which should provide appropriate balance among 
the different contributions to the unavailability of 
the MOV, with the aim of minimizing its unavail- 
ability assuring a low level of unreliability. 
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ABSTRACT: In the literature related to reliability and maintainability analysis on reparable equip- 
ments, a great number of studies have been published about modelling and analyzing their failure times. 
However, there are no so much studies which analyze the behavior of these equipments when failures 
happen at demand. In this case, models are completely different since the failure distribution is discrete. 
In this work, the model is performed from the real behavior of equipments without any a priori prob- 
ability distribution. From this model, discrete probability function and cumulative distribution function, 
needed for the subsequent parameters estimation, are obtained. By combination of these functions and 
imperfect maintenance models, likelihood function for reliability analysis is constructed to jointly estimate 
its parameters, that is, the failure at demand probability and the maintenance effectiveness, and their 
variability. Then, an application case performs the estimation procedure applied to a database of a safety 
equipment of a Nuclear Power Plant, where the obtained estimations together with their variability can 
be used to plan optimal test and maintenance intervals. 


1 INTRODUCTION influence of maintenance activities in the compo- 
nents aging process are combined with the previ- 
From the end of the last century, studies related to ous failure distributions obtaining like this, models 
improvement of safety in industrial plants in gen- more adequate to the objectives. 
eral, and in nuclear power plants in particular, have Even though a high degree of utility in the 
acquired special relevance in order to optimize exploitation of the information obtained from the 
both, the safety of these plants and the resources combination of previous models has been achieved 
spent for this purpose. In this context, analyzing in several applications (Lapa et al. 2000, Mullor 
reliability through a failure model that adequately et al. 2007), the standby condition in most of safety 
represents the behavior of safety equipments isan equipments to which this methodology is applied 
essential task to which many authors have been requires periodic verification of their availability to 
devoted (Martorell et al. 1999, Busacca et al. 2001, ensure a high probability of their correctly opera- 
Nakamura et al. 2004). First problem to solve in tion when they are called for performing their task. 
these kind of studies is modeling the behavior of | Mentioned verification of availability generates 
these equipments and, consequently, estimating another problem, each such verification of the state 
the parameters of such models for a later exploita- of safety components degrades the said state due 
tion of results, fundamentally in terms of optimiz- to the stress generated by the test, so, in addition 
ing some objectives (Shin et al. 1996, Mullor et al. to the previous age-dependent failure model for 
2006). Although initially models used were exces- standby equipments, the probability of failure at 
sively simple, exponential distribution with null or demand generated by checking their status must be 
perfect maintenance, the classic Bad As Old (BAO) also analyzed. The demand-induced stress has been 
and Good As New (GAN) models, in order to modeled by an stochastic degradation jumps (Yang 
improve the fit of the model to reality, more com- et al. 2017) considering that random shocks occur 
plex models, both in terms of failure distribution according to a Non Homogeneous Poisson Proc- 
and maintenance models, have been considered. ess, or addressing the effects of test strategies on the 
Weibull, Gamma, linear... failure distributions are probability of failure at demand for safety instru- 
currently used since they allow a better fit to reality, | mented systems (Torres-Echeverria et al. 2009). 
simultaneously, imperfect maintenance models, as However, most of these models are not close 
Proportional Age-Setback (PAS) or Proportional enough to the true behavior of safety equipments 
Age Reduction (PAR), which better reproduce the in terms of tests degradation. So this paper focuses 
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on modeling the probability of failures related to 
the influence of tests and maintenance activities 
without assuming any a priori distribution, that 
is, constructing the probability and the cumula- 
tive functions of a discrete probability distribution 
directly by observed data, that combined with the 
imperfect maintenance models PAS and PAR pro- 
vide the needed elements to construct the likelihood 
function which allows us to jointly estimate the 
probability of failure due to tests and the effective- 
ness of each maintenance activity, through the well 
known Maximum Likelihood Estimation (MLE) 
method based on the Nelder-Mead Simplex (Nelder 
& Mead 1965). 

Finally, a practical and realistic case study of a 
typical motor-operated valve in a nuclear power 
plant is presented. Estimations of probability fail- 
ure at demand and maintenance effectiveness are 
obtained through the new discrete model presented 
adapting its expressions to the MLE functions. 

The rest of this paper is organized as follows: 
Section 2 presents the discrete failure at demand 
model. Section 3 describes the parameter estima- 
tion method in these kind of reliability models. The 
above methodology is applied in Section 4 to the 
observed data from a set of High-Pressure Injection 
System (HPIS) motor valves of a NPP. Finally, Sec- 
tions 5 and 6 present the conclusions and references. 


2 DEMAND-CAUSED FAILURE MODEL 


The construction of this model is, in essence, 
simple, but somewhat more complex in its for- 
malization. We start from a null initial failure prob- 
ability, which means that the probability of failure 
at demand is zero until the first test. When this 
first test is performed, the probability of demand 
failure becomes p, and remains constant until the 
instant of the next test in which increases, again, by 
p, becoming from that moment 2p, and so on until 
the test immediately before the first maintenance, 
in which has been reached a failure probability of 
pi =kp (1) 
where k is the number of test performed in the first 
maintenance interval. Next action will be the first 
maintenance in which previous probability of fail- 
ure will be reduced, depending on its effectiveness, 
€, up to 


Pi = py (1-€) =kp(1-e) (2) 
and the same process is repeated from this point. 
Starting, in this case, from Pf» each test increases the 


failure probability in a constant p, until reaching, 
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in the test immediately before the second mainte- 
nance, a failure probability of 
py = pi +kp=kp(1- £) + kp (3) 

At this instant, the reduction of failure proba- 
bility after the second, and following, maintenance 
will depend on the imperfect maintenance model 
applied. 

In the case of implementing the PAR model, 
where the reduction affects only the failure prob- 
ability increased in the last maintenance interval, 
we obtain 
pi = kp(1—€)+ kp(1—€) = 2kp(1-e) (4) 

In the case of selecting PAS as maintenance 
model, the reduction affects the total failure prob- 
ability reached before performing the maintenance, 


so this probability, after the said maintenance, will 
became 


pi =(kp(1-£)+kp)(1- £) (5) 

Apart from simplifications or different repre- 
sentations for subsequent treatment and appli- 
cations, the previous one is the evolution of the 
failure probability at any instant of the useful life 
period of the analyzed component, that is repre- 
sented in Figure 1. 

Then, the construction of probability functions, 
that determine the likelihood functions which 
allow as estimating the objective parameters under 
models that really represent the behavior of the 
equipment in terms of the occurrence of failures at 
demand, are showed. 


2.1 Proportional age reduction model 


If a PAR model is considered, the maintenance 
only reduces the increase of probability from the 
previous maintenance, we would obtain that, for 


Mi- 


m, 


t tss 


Figure 1. Evolution of the failure probability vs time. 


any maintenance m, the failure probability before 
and later applying it will be 


p,=(m-1)kp(1- £)+ kp 
(6) 
p= mkp(1-e) 

Since, for the construction of the likelihood 
function we will only need the probability func- 
tion, p(t), at failure times and the cumulative 
distribution function, F(t), at failure and main- 
tenance times, from the previous expressions, and 
denoting by k,„ the instant, the number of test in 
maintenance m, in which the failure is observed, 
the probability function in said failure will be 


P(tjm)=(m—l)kp(1— €) + k; „nP (7) 


and adding the probability functions in the test 
and maintenances until the maintenance m we 


obtain the cumulative distribution function in each 
maintenance 


finally, the cumulative distribution function in 
each failure is obtained adding to the cumulative 
distribution function in the previous maintenance, 
the probability function in each test until the fail- 
ure, applied in the maintenance interval where the 
said failure is observed 


F(t ) + Kem (m — 1) kp(1 — £) 


k, (k, +1 


+ Fal i P 


F (tp ) 


(9) 


2.2 Proportional age-setback model 


If the imperfect maintenance model is the PAS 
one, which reduces the failure probability that each 
component has accumulated until the time before 
that maintenance activity is performed, we find, 
obviously, the same situation in the first mainte- 
nance, but this will not happen in the following 
ones. The failure probability before the second 
maintenance will continue to be the same than in 
the PAR model, but after then it will be 


pi = (kp(I- 2) + kp)(1-2)= kp((1- 2) +(I- °) 
(10) 
and, by generalizing the process again, we obtain 


the failure probabilities before and after the m-th 
maintenance, which will be given by 
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p; = kp(I+(1-e)+(1-e)' +...+(1-2)"") 
= key -e) 
pi, =kp((1 e)+(1-e) 


m 


=kpy,_(l-2)! 


+(1-2)"} (11) 


Although, again, what we want to obtain are 
probabilities of failure and their cumulated sums 
or distribution function, we can get, proceeding in 
a similar way to the previous model, the equivalent 
expressions 


Plisan) =E (=e) +k yn 


(12) 
F(z,,)= W", (k+)G-)+)d-a"™ 
mk(k + 2) 
are oe (13) 
2 
Pte =F(z,,)+ k, „kY, a -e 
pénnad a) 


2 


It should be noted that, unlike what it happens 
with continuous models of failure distributions, 
these expressions, and their subsequent contribu- 
tion to the likelihood function, do not depend on 
the instant in which the failure is observed but on 
the number of tests and maintenance, and type of 
maintenance, which have been carried out until the 
observed failure. Finally, it should be pointed that, 
although these expressions seem very complex, in 
fact they are, when substituting k, k,,, and m by 
their values in a particular problem as in our appli- 
cation case, they become much clearer and very 
useful for their implementation. 


3 ESTIMATION PROCEDURE 


The construction of the likelihood function for 
proposed reliability models under imperfect main- 
tenance is presented below. For a given model and 
a set of observed data, the likelihood function, L, 
is the product of probabilities that the observed 
data will occur as a function of the parameters that 
must to be estimated 


L(é| observed data) = Il (15) 


events? 


This expression can be applied to the age 
dependent failure models formulated previously 
by considering the probability function as prob- 
abilities of failures and the reliability function to 


model probabilities after maintenances, both nor- 
malized on the time interval in which the events are 
observed, obtaining in this way the general expres- 
sion for the likelihood function in reliability models 
with imperfect maintenance that will be given by 


L(£| observed data) = T jaitures(O: 


(16) 
lD iinienances 
which applied to discrete models becomes 
pÆ) 
L(é| observed data) -II lures . 
fi 1- F(t) a7 
i eee (-R@) 


From these expressions, the maximum likeli- 
hood estimation (MLE) method provide estimates 
of the parameters of reliability and maintenance 
models. The maximum likelihood estimations 
of these parameters are those values that maxi- 
mize the likelihood function, that is, that maxi- 
mize the probability that the observed events 
occur. Although sometimes direct methods can be 
applied to obtain the solution, this is not usually 
the situation in reliability models. In our applica- 
tions, to maximize the likelihood function of each 
model, we use the Nelder-Mead Simplex method. 

Let r,, be the number of failures observed in the 
maintenance interval m, which occur at instants 
lm lm» and let 7, be the instant when mainte- 
nance m is performed, withme{1,2,..., M}. The 
likelihood function under imperfect preventive 
maintenance will be given by 


LAT [TEA s) R 


Im 


j=l un T, 


n) Ry (7)] 08 


where & is the vector of unknown parameters, M 


is the number of maintenances performed in the 
observation period t*, with h,,(t) and R(t) being 
the failure rate and the reliability function in the 
maintenance interval m, and R(t*) the reliability 
function in the censor time t*. Since the logarithm 
is an increasing function, the likelihood function 
and its logarithm achieve their maximum at the 
same value of the objective parameters vector. So, 
in order to simplifying the computational process, 
we will use the log likelihood function given by 


log L( = z“ i (t,,.;)) 
— Yoni (Fn) nas?) 


This expression will be useful when we have a 
probability distribution of failures that provides 


(19) 


T) 
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in closed form the failure rate and the cumulated 
failure rate, H(t), as in the case of the continuous 
distributions. However, in the above presented dis- 
crete model we do not dispose of failure rate and 
cumulated failure rate functions in closed form, 
so we construct the likelihood function, and its 
logarithm, from the probability function and the 
cumulated distribution function, or one minus it, 
the reliability function, R(t), in each failure and 
Pn (tn) 


maintenance times 
log L( (= a m 
`. ed a(t] 
+ >. log(R,, (Z,,)) +08 (Ry (7’)) 


Again, by replacing the probability and cumu- 
late functions obtained in the previous section 
for PAR and PAS maintenance models, the cor- 
responding log likelihood functions are obtained. 

Maximizing these functions by applying Nelder- 
Mead simplex method provide the maximum like- 
lihood estimations of the parameters of each age 
dependent failure model. The optimization process 
itself provides, in addition to the vector of punc- 
tual estimations, information about its variabil- 
ity through the Fisher information matrix, which 
is defined as the opposite of the matrix of partial 
second derivates. So if the parameters vector is of 
dimension n, the information matrix will be given by 


(20) 


0 log L(€) 0° log L(é) 0 log L(¢) 
ds 06,96, 96,95, 
KƏ 0 log L(€) 0” log L(é) 0? log L(é) 
(=) aged ag agag 
0 log L(€) 0” log L(é) 0° log L(€) 
96,96, 06,96, dc, 
(21) 


then, we can obtain, for each model, the variance- 
covariance matrix as the inverse of the information 
matrix divided by the sample size. 

To conclude this section, we must comment that 
an inconvenient of the application of the Nelder- 
Mead simplex method to the optimization of our 
log likelihood functions to estimate the objective 
parameters is that this is an optimization method 
without restrictions. Obviously, the parameters 
in our models have a limited space of definition 
and these restrictions must be introduced in some 
way in the optimization process. Fortunately, the 
restrictions of the parameters can be included 
through transformations of themselves. These 


restrictions are 0 < p, € < 1, so the parameters, in 
the optimization process are defined as p = exp(x,)/ 
(1 + exp(x,)) and € = exp(x,)/(1 + exp(x,)), which 
ensures that solutions will be obtained within the 
domain of the decision variables. In addition, since 
the method only provides local optimum, its search 
must be repeated several times from different initial 
points, selected in a random way, to ensure obtain- 
ing a global optimum. 


4 APPLICATION CASE 


The application case presented below concerns the 
analysis of data collected from a High-Pressure 
Injection System (HPIS) consisting of a set of 
motor operated valves, normally in stand-by, rea- 
son why they must be checked periodically which 
causes some deterioration that increases the prob- 
ability of failure at demand. Given the set of avail- 
able data, failures, tests and maintenance activities, 
the objective of this application is the joint estima- 
tion of effectiveness of maintenance, €, and proba- 
bility of failure at demand, p, under the assumption 
of imperfect maintenance models PAR or PAS. 

The available database presents the monitor- 
ing of the described equipment during 5100 days, 
in which 195 tests and 9 maintenances have been 
applied, and 3 failures have been observed. Since 
more information is not available, and this is usu- 
ally the real situation, we assume that both mainte- 
nance and testing are time uniformly distributed so 
that all intervals between tests and maintenances 
have the same length, what means that finally we 
have tests every 25 days and maintenances every 
525 days, which provides 9 maintenance activi- 
ties in 14 years and 20 tests in each maintenance 
interval. 

Once constructed the logarithm of the likeli- 
hood function presented in Section 3, the sum of 
the logarithm of probability function divide by reli- 
ability function in failure times and the logarithm 
of this reliability function in maintenance and cen- 
sor times, Maximum Likelihood Estimation is per- 
formed through the Nelder-Mead Simplex method 
obtaining the results, about parameter estimations 
and their variability, showed in Table 1. 

The variance-covariance matrix for the PAR 
and PAS models is given respectively by: 


Table 1. Parameter estimation results. 

Parameter PAR model PAS model 
Ê 4.63E-4 4.19E-4 

ê 0.9824 0.9056 

L 0.024 0.005 


605 


=~ 8.24E-10 1.04E-7 
PAR T= 
104E-7 948E-5 
5 1.23E-9 1.31E-6 
PAS 2’ = 
131E-6 2.81E-3 


The results of this application, together with 
those obtained in applications of these models to 
other databases, provide interesting conclusions. 
Firstly, and focusing in the presented case, we 
obtain the estimation of both, failure probability 
at demand and maintenance effectiveness. We can 
realize that results do not differ too much between 
both models, which is due to the fact that, since the 
maintenance effectiveness is in both cases close to 
one, the two models are very similar, providing very 
close estimations. However, we can conclude that 
the model which better represent the behavior of 
this equipment is the PAR one, since the value of 
its likelihood function in the obtained optimum is 
greater than in the PAS one, which means, under 
the same number of parameters, as it happens, that 
the probability of occurrence of the observed data 
is greater in the PAR model than in the PAS one. 
Additionally, when obtaining together with punc- 
tual estimations, its variance-covariance matrix that 
provides, in addition to their covariance, the stand- 
ard deviation of the estimation of each parameter 
as the square root of its main diagonal, we can 
construct confidence intervals for each parameter, 
which provides us with a very usefulness informa- 
tion about their variability for further applications. 


PAR Clys,(p) = (4.07E —4, 5.19E —4) 
Cly sy, (£) = (0.9633, 1) 

pag C (P)=(3.50E -4,4.88E -4) 
Cly sy, (£) 7 (0.8018, 1) 


We also note that confidence intervals are nar- 
rower in the PAR model, that is, their estimates 
show less variability, which is consistent with the 
previous selection as the most appropriate model. 
Finally, it must be pointed that, in general, the 
maintenance effectiveness in PAR model is greater 
than in the PAS one, as it happens in our appli- 
cation. Since PAS maintenance provides better 
results, it reaches the same state as the PAR one 
with less effort. 


5 CONCLUSIONS 


This paper presents the construction of a model 
which really represent the behavior of an equip- 
ment which requires of periodic test, which dete- 
riorate at each realization. This model allows us to 


jointly estimate, via Maximum Likelihood Estima- 
tion, both the failure probability at demand and 
the maintenance effectiveness, when imperfect 
maintenance is, as really occurs, incorporated to 
the age dependent model. 

Results obtained in the application case, which 
are in accordance with those usually verified in its 
context, validate the usefulness of this model and 
expand the field of the optimization of efforts 
aimed at improving the reliability of this kind of 
equipment. 
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ABSTRACT: Planning maintenance over a large, distributed and ageing power infrastructure is a chal- 
lenge. Interruptions in power supply cost large amounts of money in terms of undelivered energy and 
therefore maintaining the infrastructure to deliver at all times is imperative. This paper presents a novel 
method developed to model the risk of power line infrastructure. The user is presented with a visual risk 
picture and the associated expected costs based on the user’s maintenance strategy of choice. Visual risk 
presentation overlaid on map data allows the decision-maker to understand the risk picture and make 
appropriate investment decisions based on the same. The developed model is currently being used in- 
house at various grid operators in Norway. Users can test out the effect different maintenance strategies 


and its effect on the risk picture. 


1 INTRODUCTION 


1.1 Background 


Every year there are thousands of planned and 
unplanned interruptions on the electrical power grid 
in Norway. These disconnections result in undeliv- 
ered energy that cost the society millions each year. 
The year 2016 saw a total of 25,777 incidents total- 
ing a loss of 8,239 MWh undelivered energy (Stat- 
nett 2017). Assuming an average interruption of 1.3 
hours (Kjolle 2011), the cost of undelivered energy 
is approximately 30.7 NOK per kWh. This equals 
253.2 million NOK, or over 32 million USD. 

Norway as a country is characterized by large 
energy differences from region to region. Some 
areas are under producers and others, over-produc- 
ers (Fornybar.no). In general, transporting energy 
over large distances is imperative in ensuring a 
robust power distribution network in the country. 

Planning maintenance of such a large power 
distribution infrastructure is a challenge. A large 
portion of the infrastructure is aging, pushing for 
significant re-investment costs from year to year. 
In such a situation, the challenge is identifying 
where efforts should be focused and where money 
should be invested. 

Furthermore, expanding infrastructure within 
renewables, creates challenges with respect to 
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resource availability and utilization. This places 
more pressure on power grid companies to improve 
strategies while simultaneously reducing costs. 


1.2 Objective 


Paragraph §2-13 in the regulations for electri- 
cal power distribution (FEF 2006) states that the 
power distribution grid owner shall have an over- 
view of their infrastructure such that they can 
decide where to focus maintenance effects and 
which areas require more focus/investment that 
others. This is with regard to reputation, Health 
Safety and Environment (HSE), technical condi- 
tion and of course reliability of power supply. 
RENblad ((Norwegian: Rasjonell Elekstrisk Virk- 
somhet)) claims that todays’ choice of mainte- 
nance strategy often results in components being 
replaced much earlier than they must. Therefore, 
risk-based maintenance is a pre-requisite to enable 
better maintenance decisions. This is the objective 
of the approach and model developed. 

This paper presents a simulation model devel- 
oped, and results from a case study. Furthermore, 
preliminary feedback from pilot users are also dis- 
cussed. The developed risk model quantifies the 
technical, HSE and economic-risk condition of a 
given power line and puts this in a financial context. 


The approach simulates line degradation over mul- 
tiple years and models the effect of different main- 
tenance strategies. The result is a risk picture and 
cost associated with a chosen maintenance strategy. 


2 STATE-OF-THE-ART 


2.1 Maintenance practice in the industry 


Moving from a reactive to a proactive maintenance 
approach is a demanding change for a power com- 
pany. Grid owners systematically document results 
from physical inspections on their power lines from 
year to year. Certain operators perform limited 
visual inspections from the ground level yearly. 
More detailed inspections examining each utility 
pole for internal damage, rotting etc., is performed 
at 10-year inspection intervals. The data collected 
from these inspections is used to a very limited 
extent in planning maintenance. Deciding on a 
maintenance plan purely based on “age of infra- 
structure” and “total number of findings from 
inspections” is not optimal. Such a strategy is not 
risk-based and can result in significant investments 
made on infrastructure that has a very limited risk 
connected to power supply reliability. 

Maintenance and reinvestment is a necessity to 
maintain the functionality of the power distribu- 
tion grid. As components age and are exposed to 
external factors such as wind, lightning, vegetation, 
etc., their functionality may weaken or fail. Fur- 
thermore, parts of the power grid are so old that 
lines are associated with the need for reinvestments. 
Driving maintenance versus reinvestment decisions 
is a fine balance between economy and risk. “When 
is it economically beneficial to replacelreinvest in 
infrastructure versus performing periodic mainte- 
nance on the same?”. Recent trends (Nordgård 
and Samdal 2009) point towards developing strat- 
egies by looking at the balancing of cost effec- 
tiveness with other risks, such as economy, HSE, 
reputation, supply reliability etc. Thus, bringing 
the advent of risk-based maintenance. 

A well-known technique for the same is Reliabil- 
ity Centered Maintenance (RCM) (IEC 60300-3- 
11). RCM is primarily a qualitative approach which 
involves a systematic consideration of system func- 
tions, where they may fail and a consideration of 
safety vs. economy to decide on an appropriate 
maintenance strategy, e.g. preventive maintenance 
vs. corrective run-to-failure maintenance (Moubray 
1997). The outcome of applying RCM on a set 
of systems including a number of components, 
on which reliability measurements can be done, 
would be a maintenance program pointing to a set 
of components or sub-systems to be maintained at 
different timeslots. 


In Norway, the RENblad is a set of guide- 
lines universally followed by most electric power 
distribution companies. These guidelines cover 
standardized material and methods for the power 
distribution business. One such guidebook is the 
“Distribution Network — Maintenance Strat- 
egy” (Norwegian: Distribusjonsnett — Vedlike- 
holdsstrategi”). This guidebook focuses on 
“condition-based maintenance management”. 


2.2 Available decision-support methods and tools 


The available decision support methods and tools, 
vary with the size and ambitions of the power 
companies. The project “Smarter Asset Manage- 
ment using Big dAta” (SAMBA) is a 3-year indus- 
try innovation project headed by Statnett, partially 
founded by the Norwegian Research Council and 
involving several partners such as SINTEF Energy 
Research, GE Grid Solutions, ABB and IBM. 
Their ambition is to add value by activating more 
online, automatically collected condition and sys- 
tem operation data. These data will be combined 
with maintenance data collected on site, and then 
be used for estimating asset’s condition, probabil- 
ity of failure and remaining lifetime. Some research 
questions are: 


e What is the condition of the components now? 
How will the condition of a component develop 
over time? 

Do I do the right maintenance at the right time? 
When should a component be replaced? 

What is the consequence of a component 
break? 


Figure 1. 
Statnett: www.statnett.no/Global/Smarte%20Nett.pdf). 


The SAMBA approach (Downloaded from 


The SAMBA approach is described by the fol- 
lowing figure: 

Connected Drone is another project approved 
by the Norwegian Research Council and The Nor- 
wegian Water Resources and Energy Directorate 
(NVE). The goal of the project is to create an effi- 
cient and secure system for inspection of power 
lines using drones. By utilizing deep learning to 
analyze data obtained from sensors mounted on 
drones, the project seeks to give power grid utili- 
ties insights into the state of their infrastructure. 
The system will be delivered as an integrated, total 
solution and is specially adapted to the monitoring 
and control of power grids. The project is managed 
and led by eSmart Systems in close collaboration 
with 12 network companies and several technol- 
ogy partners. The first commercially available 
product—Connected Drone and The Intelligent 
Assistant, is capable of analyzing massive amounts 
of image data. 

In the other end of the scale one can find the 
small net grid companies where the maintenance 
strategy is based on yearlong collection of expe- 
rience. Through the human capital in the compa- 
nies they have learned what causes degradation of 
their specific net, and they know the status of their 
components. 

In between there are the big group of power 
companies which do their inspections, record and 
assess the data through Excel sheets, and perform 
both reactive repairs and proactive maintenance. 
One often-used strategy seems to be that they 
inspect 10% of their net (Guidelines requires a full 
inspection at least every 10 years). Thereafter they 
do repair or replacement on those components 
that fail according to the maintenance strategy 
(RENblad). 


2.3 Risk modelling and decision support 


Operational safety has over the years received more 
and more attention in many industries. Within the 
nuclear field and in the offshore industry large acci- 
dents such as Three Mile Island (US, 1979), Piper 
Alpha (UK, 1988) and Texas City (US, 2005) has 
contributed to this attention. One reason for this is 
that the focus in the future for the offshore industry 
is on operation of existing installations, extending 
the operational life of some of these installations 
as well as tying in new subsea installations. 

As a way forward to understand, model and 
predict large accidents many risk models or barrier 
models has been developed, for example the BORA 
(Vinnem et al. 2003), OTS (Sklet et al. 2010) and 
Risk OMT model (Gran et al. 2012), or through 
master thesis works such as “Activity-based Mod- 
elling for Operational Decision-support” (Edwin 
2015). Common for these models and works are 
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that they build a model containing of target objects 
and background indicators. 


3 WISELINE MODEL 


3.1 A pole and a line 


The Wiseline model has at its basic element an 
electric pole. The pole itself can be constructed 
by timber or by steel (as shown in Figure 2) and 
consist of a number of components with the mean 
to stabilize the pole, carry the power line, or pro- 
vide as a barrier against different unwanted inci- 
dents. One such is lightning (as shown in Figure 2), 
thereby the need of isolators to avoid high cur- 
rent to spread. For a timber pole the top-hat is a 
component to protect the top of a pole from water 
impregnation and consequent decay. 

At a specific geographic spot, there might be a 
single pole, or a number of poles. The number of 
poles at a single location depends on for example 
the need to stabilize the line when the line shifts 
direction. Another reason for more poles at a sin- 
gle location is the number of power lines the line 
carries. In Wiseline we call this location a “pole 
site”. 

Finally, a row of such pole sites forms a line. 
Typically, a line runs from one switch to another 
switch, i.e. representing a part of the line which can 
be isolated during for example maintenance. 


Figure 2. 


An electric pole (Photo: wiseline.no). 


3.2 Inspection data 


Collection of data on a set of lines are done 
through inspection. A grid owner is required to 
do a complete inspection of every pole every 10 
years. This involves an inspection of all parts of a 
pole, both at earth level and in the air. This is typi- 
cally performed by professional inspectors using 
advanced tools and methods. 

In between, at more frequent intervals, simpler 
inspections are recommended and performed. 
Here the poles are only inspected from a distance, 
and the observer can for example fly along with 
a helicopter or travel along using a skidoo, small 
engine or on foot. These methods are now being 
replaced by the use of drones with camera. This 
allows the observer to analyse the photos after- 
wards or also apply intelligent computer image 
processing methods to automatically register any 
abnormalities. Finally, inspections are also per- 
formed every time there is an outage. In these 
cases, both an inspection as well as a repair must 
take place. 

There are in total about 30-40 inspection points 
for a given pole site. Some components have only 
one inspection point, e.g. if the top hat is intact, 
damaged or missing. While for others there are 
several inspection points, e.g. for a timber pole the 
following inspection points are recorded—external 
rot, internal rot, woodpecker holes, pole damage 
etc. The format of inspection data varies depend- 
ing on the company that has performed the inspec- 
tion. Each inspection is mostly recorded in terms 
of a textual description fitting with the damage 
categories listed in the REN-blad. 

Deviations per indicator are associated with a 
character running from 1 to 5 (or 0 to 4), where 1 is 
“no finding”, also equivalent to “as good as new” 
and 5 corresponds to “non-functioning”. For this 
exercise of translating inspection data to charac- 
ter scales, input from the Energi Norge handbook 
(Energi Norge 2011) is used. For instance, if a 
“top-hat” at the top of the utility pole is missing, 
it translates to a 5 on the character scale. On the 
other hand, the character setting on a slightly bent 
or misoriented pole may be nuanced from between 
2 to 5 depending on the degree of damage. The 
inspections records of such grades are used as 
input in the Wiseline model for each pole. 


3.3 Risk model 


In the Wiseline model each grade is translated to 
a value between 0.0 and 1.0, where 1 is the best. A 
grade | is always translated to the value 1.0. How- 
ever, a grade 5 is translated differently depending 
upon the role of the component. If the component 
is critical it may be given a value 0.01, while if it 
is not critical it may be given a value 0.9. This can 
be illustrated by the component “warning sign”. It 
provides no barrier in terms of delivering power, 
therefore a missing sign (with grade 5) will in this 
case be translated to 0.9. On the other hand, it pro- 
vides a barrier in terms of HSE and will be trans- 
lated to 0.01. Note, the numbers 0.9 and 0.01 are 
here only illustrative. 

In real cases, the values to be used in the transla- 
tion are set by the grid owner to reflect their inter- 
pretation of the importance of the inspections. The 
risk model is developed for three top-events — “tech- 
nical condition”, “HSE” and “power supply ability”. 
Dependent on the top-event being modelled, the 
various components and their corresponding indi- 
cators are weighted differently in the underlying risk 
model. For instance, the “power transmission” sub- 
function will be weighted with a higher importance 
for the “power supply ability” model as compared 
to in the “HSE” model. Similarly, the high-voltage 
signage on utility poles will be weighted higher for 
the HSE model in comparison to the other two risk 
models. Criticality tables are established for the vari- 
ous components, as shown in Table 1. 

The combination of the values within each group 
is done by a weighting function. Thereafter the 
groups are combined up to a risk value for the pole, 
the risk values at poles are combined to a risk value 
for the pole site (see illustration in Fig. 3). Finally, the 
risk values for pole sites are combined to a risk value 
for the line. This combination can be viewed upon as 
a factor model as applied in various risk models. In 
the current model the poles are handled independ- 
ently, but obviously further work on the model will 
also take into account that the state of on pole will 
have influence on the state of its neighboring poles. 


3.4 A set of risk values 


The outcome of applying the Wiseline model is a 
set of risk values for each pole and for each line, 


Table 1. Example—Component criticality specification. 

Function/Sub function Component Technical condition HSE Power supply ability 
Earthing/Protection Voltage Signage Medium Important Insignificant 
Structural Common Brace Medium Very Important Very Important 
Transmission Insulator Medium Important Important 
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Figure 3. Combining the translated inspection values on 
components up to a risk value for a pole and a pole site. 


Table 2. Risk values for a demo-line. 


Risk factor Risk value 
Technical state 0.98 
Power ability 0.99 
HMS 0.96 


as shown in Table 2 for demo line. The set of risk 
values contains of one risk value for the “technical 
condition” (where all components are treated equal 
in the combination), one for the “ability to deliver 
power” and one for “the HMS”. For some clients, 
with classification of poles and lines according to, 
the set is extended with a fourth risk value related 
to the economical consequence. 

The same is done for each pole. This can be used 
to visualize for example the HMS risk value along 
a line, as shown in Figure 4. Here also the trans- 
lated component values for some components are 
included, as well as a red and yellow line represent- 
ing different acceptance levels. The two lines at the 
bottom represents height above see and vegetation 
at the spots (here flat and equal vegetation all along). 


3.5 Predicting the future 


The next step in the Wiseline method is to do pre- 
dictions of the future. This is done by allowing each 
component to degrade from its current inspected 
state according the reliability figure yielding for 
that type of component. Monte Carlo simulation 
is used to simulate the degradation of component 
condition over time and calculate the associated 
costs connected to the chosen maintenance strategy. 

The developed risk model provides the user with 
significant flexibility in choosing a maintenance 
strategy. Examples of strategies include: 


Tilstand funksjoner 


EP | A ES Se el 


Titstand 


Figure 4. The risk value for HMS for the poles along the demo line. The red and yellow line represents different 


acceptance levels. 
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Figure 5. 


. No maintenance. Run-to-failure. 

. Preventive maintenance (Fix when component 
condition > = indicator score ‘x’) 

. Overhaul selected utility poles based on 
specified criteria at t = ‘x’ and run-to failure 
thereafter. 


Noe 


In the current Wiseline model the same reliabil- 
ity model is applied for a component independ- 
ent of its age, type, producer, location etc. In the 
future this can be exchanged with more specific 
models. One such model is under development by 
Sintef, power grid companies and inspection com- 
panies to model how fast a timber pole will rot, 
and thereby provide an estimate also of the rest 
life time. In another project, Statnett and Sintef 
(Solheim et al, 2016) aims at establishing wind 
dependent failure rates for overhead transmission 
lines using reanalysis data and a Bayesian updat- 
ing scheme. 


3.6 Estimating the cost of a maintenance strategy 


The final step in the Wiseline method is to associ- 
ate each maintenance strategy with an investment 
cost. This investment cost includes both the cost 
of the component repaired or replaced, and the 
labor cost. These costs are specific for each grid 
owner. 

The result is that decisions can be made by 
balancing the costs towards the acceptable risk 
for the different top events. In Figure 5 the cost 
per year for 5 different maintenance strategies are 
shown for a demo-case. Visual risk presentation 
overlaid on map data allows the decision-maker to 
understand the risk picture and make appropriate 
investment decisions based on the same. A dem- 
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The cost per year for 5 different maintenance strategies. 


onstration of this shown in Figure 6 for the client 
Vesteralskraft. 


4 CASE STUDY 


To demonstrate the functionality of the developed 
risk model, anonymized inspection data from 
power lines across different Norwegian electric 
power distribution owners were used in the fol- 
lowing. The data input basis included inspection 
points from a total of more than 300 power lines, 
where each line included from 10 to up to 300 util- 
ity poles. These include power voltage distribution 
lines right from 230V to 22kV. As discussed in Sec- 
tion 3, input inspection data was first mapped to 
the indicator set in the risk model. As the inspec- 
tion data were collected in different years, the next 
step was to degrade each component from its cur- 
rent inspected state according the reliability figure 
yielding for that type of component up to the pre- 
dicted state in 2017. 

The Wiseline model was then run using the 
2017 data as the starting-point for the simu- 
lations. An outline of the results is shown in 
Table 3. Here the results for HSE is shown, first 
for the calculated 2017 value, then the predicted 
risk value applying two different maintenance 
strategies: maintenance when grade 4 is predicted 
(VG4) and maintenance when grade 3 is predicted 
(VG3). Here the cost estimates are left out for 
simplification. However, only by observing at the 
lines in the table, one sees that there is a low corre- 
lation between the number of deviations and the 
predicted risk values. Furthermore, one also sees 
that the risk value does not correlate with the age 
of the poles either. 


Figure 6. The risk value displayed on a map (Vesteralskraft). 


Table 3. Outline of the results for the case study. 

Pole Pole HSE HSE 

year year HSE (2027) (2027) 
Line KV # Poles Inspected (min) (max) Deviations (2017) VG4 VG3 
Wiseline003 22 11 aug. 14 1963 1964 8 94.59 98.71 98.23 
Wiseline006 22 6 aug. 14 1962 1985 9 94.98 99.14 99.49 
Wiseline192 22 8 jun. 16 1990 1992 4 96.34 99.81 96.71 
Wiseline002 22 34 aug. 14 1955 1994 22 96.53 99.35 99.82 
Wiseline045 0.23 19 aug. 14 1950 2000 11 96.59 99.03 96.78 
Wiseline025 0.23 12 aug. 14 1954 1978 2 97.30 99.11 99.65 
Wiseline026 0.23 14 aug. 14 ? 1968 9 97.42 99.23 99.58 
Wiseline337 0.23 27 aug. 11 1900 1959 38 97.48 98.81 99.01 
Wiseline339 0.23 15 jun. 11 1900 1990 38 97.51 98.83 99.63 
Wiseline097 22 17. okt. 15 1955 2008 17 97.55 98.77 98.24 
Wiseline141 22 1 jun. 16 1957 1957 3 97.58 95.88 99.76 
Wiseline336 22 36 aug. 11 1974 1974 32 97.65 98.37 99.56 
Wiseline046 0.23 16 aug. 14 ? 2010 1 97.71 99.02 99.30 
Wiseline159 22 1 jun. 16 1960 1960 2 97.79 97.79 99.34 
Wiseline301 0.23 34 nov. 13 1900 2004 80 97.83 97.98 99.49 


In two clients cases the degradation has been 
run up to 2027. At this stage, the end-user can 
select appropriate risk acceptance criteria for each 
of the risk models. By looking at the lines that do 
not meet the risk acceptance criteria in 2027, the 
clients get a proposal for which lines to address 
first. For the same line the associated cost by main- 
tenance is calculated, so that a risk-based approach 
can be applied for the decision. 
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5 DISCUSSION 


5.1 Application for decision-support 


As the Wiseline model now is implemented, it 
applies a number of simplifications in the model- 
ling. One such simplification is to apply the same 
reliability model independent of both geography 
and age of a component. Furthermore, the risk 


model applies a limited set of component criticality 
values. In the next versions of the Wiseline model 
the intention is to extend the different parts, and 
also to include the effect of causes for deviation, 
such as lightning, wind, trees, woodpeckers etc. In 
its current state, the Wiseline model runs by apply- 
ing Excel, and transferring the data from a client 
into the model, and from the model into result 
visualization. The plan is to address these aspects 
through a development and innovation project 
together with partners from the beginning of 2018. 


5.2 Limitations 


Two limitations in the model is of course the qual- 
ity of the inspection data, and errors in the trans- 
lation of the inspection data into the input data. 
The more manual steps there are in this process, 
the more influence it can have. On the other hand, 
when the inspection data is professional collected 
and exported directly from an inspection data base, 
like the database of NordiConsult, this problem is 
minimized. 

Another limitation lays in the interpretation 
of the risk factors. The technical state, where all 
indicators are treated equal, is a construct. The 
HSE factor should probably be split into inspec- 
tor/skilled laborer and third party (animals and 
general public). Finally, the “ability to deliver 
power” is not the same as “causing an outage” and 
do not say much about how long an outage will 
be. Thereby the Wiseline model do not say much 
about expected costs due to outage. 


6 CONCLUSION 


Planning maintenance over a large, distributed and 
ageing power infrastructure is a challenge. This 
paper has presented a novel method developed to 
model the risk of power line infrastructure. The 
model uses inspection data, reliability models for 
the aging of components and associated main- 
tenance costs to present a risk picture and the 
associated expected costs based on the user’s main- 
tenance strategy of choice. Visual risk presentation 
overlaid on map data allows the decision-maker to 
understand the risk picture and make appropriate 
investment decisions based on the same. The devel- 
oped model is currently being used in-house at 
various grid operators in Norway. Here the Wise- 
line model has demonstrated that there is no single 
maintenance strategy that fits all cases. What the 


614 


Wiseline model do is to enable the grid operators 
to do risk-based decisions. 
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The real-time reliability evaluation and sequential inspection decision 
based on Wiener process 
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ABSTRACT: Due to the complexity and high reliability requirements of aerospace equipment, it’s a sig- 
nificant challenge to establish the efficiency maintenance management in terms of little failure data or even 
no failure data in the actual operation or storage environment. This paper investigates the issue of real- 
time reliability and sequential inspection policy for aerospace equipment based on a Wiener process-based 
degradation model. Firstly, the stochastic characteristics of the equipment’s degradation is described by 
the Wiener process. The product-to-product variability and the temporal uncertainty of the degradation 
can be characterized by the random drift parameters. Secondly, the expression of reliability distribution is 
obtained with close form by use of the first-hitting time theory. An adaptive method is proposed to evaluate 
the unknown parameters with the optimal smoothing algorithm (RTS) and the Expectation Maximization 
algorithm (EM). Once the new degradation information is available, the parameters should be updated 
with Bayesian equation. Moreover, the historical information of the same type product can also be inte- 
grated in the selection of the initial parameters to ensure the convergence in the iterative updating. Thirdly, 
a sequential inspection model is discussed to determine the optimal intervals to satisfy the requirements 
for the real-time reliability at the certain time. Finally, an example of fatigue crack growth in an aerospace 
aluminum alloy is given to illustrate the validity of the proposed method. 


1 INTRODUCTION (2016) proposed an optimal inspection-based 

maintenance policy for three-state mechanical 
The majority of aviation products are high reli- components subject to competing failure modes. 
ability products, such as aircraft engines, satel- A maintenance optimization model for mission- 
lites, gyroscopes, which need to maintain a certain oriented systems based on Wiener degradation is 
degree of reliability during long storage or opera- proposed by Guo (2013). These all demonstrate 
tion process. Due to a variety of factors (such as that Wiener process model can achieve satisfactory 
temperature stress, fatigue, corrosion, etc.), their result by using large sample data. However, most 
internal material properties will change, including aviation products are costly and high reliability, 
the corrosion of mechanical parts, rubber materi- it is almost impossible to obtain enough data of 
als, aging and so on. It results in a reduction in the the failure life through the life test or accelerated 
reliability of the product. Therefore, to improve the life test, and even “zero failure” phenomenon may 
reliability and minimize the times of inspection or occur. Therefore, we will consider establishing the 
maintenance, it is of vital significance to optimize degradation model based on Wiener process by 


the inspection and maintenance policy. using the small sample data, so as to be more suit- 
The Wiener process-based degradation modelis able for aviation products. 
a statistical data driven method, it has been exten- Besides, in order to ensure the normal operation 


sively utilized in modeling degradation paths both of products and preventive maintenance, variation 
academically and practically. Its biggest advance- in reliability under different conditions should be 
ment is that the distribution of failure time can be considered. Real-time reliability evaluation is ideal 
formulated as an inverse Gaussian distribution. because the factors that affect reliability change 
Ray (2002) and Ye (2015) et al. successfully applied continuously with time for dynamic systems. Yan 
Wiener process to fatigue crack growth analysis. (2016) developed a two-phase Wiener degrada- 
Gebraeel (2005) et al. described the degradation tion model to evaluate the real-time reliability of 
path of rotating element bearing by Wiener process. devices. Xu et al. (2008) proposed a real-time reli- 
Aiming at the effect of the non-stationary feature ability prediction method for a dynamic system 
and the delay of state detection caused by discrete that suffers from a hidden degradation process. 
inspection on long-run operation cost, Zhang Wanget al. (2014) investigated the issue of real-time 
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reliability evaluation based on a general Wiener 
process-based degradation model. Faghih-Roohi et 
al (2014) developed a dynamic model for availabil- 
ity assessment of multi-state weighted k-out-of-n 
systems using the universal generating function and 
Markov process. Each method has its advantages 
and disadvantages. For example, Markov models 
have state space explosion problems (Tanrioven 
(2004)). The major limitation of stochastic simula- 
tion is that the monitoring value has to be a scalar. 
However, there are usually many monitoring val- 
ues exist in real situations. The hidden degradation 
process identification is unsuitable for a situation in 
which the characteristics of the degradation proc- 
esses are time varying, or the path function of the 
fault process is nonlinear. Therefore the temporal 
uncertainty of the degradation will be character- 
ized by the random drift parameters in the model 
of this paper, which is defined as a random walk 
model, thus we can improve the degradation model. 

It is a significant challenge for engineers to 
define the appropriate inspection interval in term 
of uncertainty in product deterioration and envi- 
ronment change. Fewer inspections will lead to 
lower reliability, and frequent inspection will lead to 
higher cost, so the optimal inspection policy should 
be set up and tradeoff between the reliability and 
operation cost. For the long-storage products, Feng 
(2011) proposed a sequential inspection method 
based on Weibull distribution, and the sequential 
inspection time is confirmed based on storage reli- 
ability request. Cui (2004) developed a sequential 
inspection policy for multiple systems under avail- 
ability requirement. And considering the optimi- 
zation of alarm threshold, a sequential inspection 
scheme is determined by Jiang (2010). For systems 
subject to stochastic deterioration, Zhu (2017) pro- 
posed a sequential condition-based maintenance 
policy. The above sequential inspection methods 
are based on the constraint of the degradation 
threshold, availability, cost and so on to optimize 
the maintenance and inspection decision. But for 
the aviation products studied in the paper, real-time 
reliability is more necessary to be assessed. There- 
fore we will focus on improving the related sequen- 
tial inspection and maintenance policy by using the 
real-time reliability model, which will ensure the 
high reliability of the aviation products. 

The aim of this paper is to find the optimal main- 
tenance policy for the aviation products. The model 
of this paper can makes full use of the mathemati- 
cal properties of Wiener process, and the uncer- 
tainty of stochastic model parameter estimation is 
also taken into consideration. To ensure the conver- 
gence of parameters in the iterative updating, the 
historical information of the same type product is 
integrated in the selection of the initial parameters. 
Moreover, a variety of algorithms are used to itera- 
tively update the model parameters, which realize 


the real-time updating of parameter estimation and 
the adaptive prediction of remaining life. Com- 
bined with the requirement of real-time reliability, 
the sequential inspection policy is determined for 
the aviation product, thus reducing the number of 
inspection and maintenance effectively. 

The remainder of this paper is organized as fol- 
lows. Section 2 describes the model assumptions 
and maintenance policy. In Section 3, the real-time 
reliability model of the product is established and 
the unknown parameters are estimated. Section 4 
presents an optimal model for sequential inspec- 
tion. In section 5, a numerical example of fatigue 
crack growth in an aerospace aluminum alloy 
is given to illustrate the validity of the proposed 
method. Conclusions and some future works are 
drawn in Section 6. 


2 MODEL DESCRIPTION 


2.1 


1. It is assumed that the performance of the prod- 
uct is recovered as new after replacement 


Model assumptions 


Nomenclature 
H drift coefficient 
Ho initial drift coefficient 
o diffusion coefficient 
Wo preventive maintenance threshold 
w failure threshold 
uf] the random walk parameter of 4 
ao the mean of normal distribution about Ho 
Do the variance of normal distribution about Ho 
Dik the updated variance of # 
Dyis the conditional variance of u 
8 set of the model parameter (ao, Do, Qo, o0 ) 
R. requirement of the real-time reliability for the 
product (constant) 
Atiy inspection interval 
ti time of the j-th maintenance 
tik time of k-th inspection after i-th maintenance 
Xm degradation data of k-th inspection after /-th 
maintenance 
Xico) set of degradation data after j-th maintenance 
lik remaining life of the profuct al the time tw 
PCH Xow) the conditional PDF of Hy for the given Xa: 
fing, WX) the PDF ofthe renaming life when the 
inspection data is X4 
fon My |X) the PDF of the remaining life for the product 


when the inspection data is Xox 


Ry(t\x,) real-time reliability of the product at time / 


real-time reliability of the remaining life for 


Re(ly Xas) 


the product 
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Figure 1. Inspection and maintenance cycle diagram 
for the product. 


2. According to the actual requirement of engi- 
neering, R, is generally set to be a constant. 

3. Failures are observed only by inspection, and if 
degeneration data of inspection exceed the pre- 
ventive maintenance threshold w,, the mainte- 
nance should be carried out and the component 
will be replaced. 

4. The inspection time and replacement time are 
negligible. 


2.2 Deterioration and maintenance process 


In this paper, a metal material of an important 
component from the aviation product is taken as 
the research object. It is found that the degrada- 
tion processes of many aviation products per- 
formance are random and uncertain. The Wiener 
process can well describe the time uncertainty of 
the degradation process, and it is also easier to deal 
with the error of data measurement. Due to the 
long service life and high cost of maintenance for 
aviation products, periodic inspection or real-time 
monitoring is generally required. If the perform- 
ance degradation is found to exceed the preventive 
maintenance threshold w,, the component will be 
replaced immediately. The diagram of inspection 
and maintenance cycle is shown in Figure 1. 

The Wiener process is used to establish the deg- 
radation model of aviation products. R,(t|x,) is 
defined as the real-time reliability at time t. 
And according to the definition of real-time reli- 
ability, the value of R,(t|x,) would gradually 
decreases from 1 after each inspection as time goes 
on. And the process can be divided into three steps 
based on the sequential inspection policy. Firstly, 
the real-time reliability model is established based 
on Wiener process; then, the model parameters are 
estimated by using the degradation data X;9.,), SO 
the PDF expression /,,\v,,,(/xIXo,) of the remaining 
life for the product can be obtained, furthermore 
the expression R,(/, |X.) of the real-time reli- 
ability can be derived. Finally, the inspection inter- 
val At, is determined to satisfy the requirement of 
R,, so the next inspection time is tig) =l + Aly. 


3 REAL-TIME RELIABILITY 


If the aviation product is known to have not failed 
at the current moment, a conditional probability 
can be used to express its reliability level. When 
new degradation data is obtained, the real-time 
reliability is further updated. 


3.1 The model of real-time reliability 


For the actual operation or storage of aviation prod- 
ucts, the degradation process is described by the 
Wiener process, so we can get the following expres- 
sion by using the Markov property of {B(t),r 20}: 


XA) =x, + Mt—-t,) + Bt-t) (1) 


It is assumed that x, <w at the time ¢,, other- 
wise, the device is considered invalid. In this study, 
the definition of product life is based on the first- 
hitting time theory for random process proposed 
by Lee (2007), then the PDF of the product life 
can be obtained: 


Sri (t|x,)= 
WX, 
exp 
J2a(t -t, Po? 


Due to the small sample of the degradation data 
for the aviation product, it is important to integrate 
all the data from each inspection into degradation 
modeling. In this study, w is defined as a random 
walk model, which can be updated as the new 
inspection data is obtained, that is “4 = 4, +7, 
and 7~ N(0,Q). Then based on the state-space 
model framework, the degradation process of the 
aviation product can be constructed by a linear 
state-space model as: 


(w- x, -Mt-t,)) | (2) 


20°(t-t,) 


Me = Lat (3) 
Xp = Xp tale — ti) + CG, (4) 


where f = 0, x, =0, & ~N (0,t, —t,_,) and t, -t, 
is the variance of £, based on the characteristics 
of Brownian motion. 

It is assumed that the initial drift coefficient 
follows 4 ~ N(a,D,), and the drift coefficient 
4 can be regarded as an implicit “state” estimated 
from the inspection data X, = {X).X,,X),-.%;}- 
Therefore, the above state-space model establishes 
the relationship between the drift coefficient and all 
inspection data from the beginning to the present 
time ¢,. In equation (3), once the new degradation 
data is obtained, the probability distribution of 44 
can be calculated from all the inspection data by 
using recursive filtering method. And the poste- 
rior estimation expectation of 4⁄4, can be defined 
as fl, = E(u, |X,,), the corresponding variance 
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can be defined as D,,, = var(44 | Xox). In order to 
obtain A, and D,,, we need to get the PDF of 
4, fora given X, which denoted by p(44 | Xox) 
and p(44 |X.) can be recursively calculated from 
PU. | Xox) using Bayesian rules. According to 
the stochastic filter theory and the strong tracking 
filtering technique, the conditional PDF of 44, can 
be obtained: 


l oA 
(n | Xo) =- exp : 
PUL | Xo 21D, Diy 


According to the first-hitting time theory of 
random process, we can use equation (6) to meas- 
ure the real-time reliability of the product: 


R(t; Xor) = PAX (T) < w, Y T € [t,t] | 8c Xo:x> (6) 
x(t,)< w, j =1,2,-++,k) 


where 8, =(a,,D,.Q,.0,2) . In order to get the 
expression of R(t;X).,.4,) the definition of the 
remaining life for the product is given firstly: 


L, =inf{l,: X (L, +t,)2w| Xo, 7 
x(t,)<w,j=12,-k} o 


So R(t, Xo..,8.) can be expressed as follows: 
RX, .,.,9.) = PL, >t-t,) (8) 


According to the literature (Si (2016), Peng 
(2009)), and combined with the state-space mod- 
els of equation (3) and equation (4), the PDF of 
remaining life L, can be obtained as follows: 
j= w= X, 

[2T Drle + 0°) 


ae HES 

xexp _ wrx Ah) „4 >0 
2l, Dayle +0?) 

(9) 


Then when ¢—¢, =/,, the real-time reliability of 
the remaining life /, can be obtained: 


Si, \Xox | Xox 


Rs. | Xox) = Vie Xo (L | Xo dl, 


Ik 


-o| a= LAR 
Due +P 
A (w=x,) 2D wa 
exp 7H = x,) + ul X) l 
| 2DW- x) +O? (Ad +w- "| 
x 


Dey? + Ol, 


(10) 


The parameters a,,D,,Q, and g? of the ran- 
dom degradation model are unknown in equation 
(8), (9) and (10), which need to be estimated based 
on the inspection data X... Therefore, Section 3.2 
mainly describes the adaptive estimation method 
of unknown parameters. 


3.2 The adaptive estimation method of parameters 


In order to sequentially evaluate the unknown 
parameters, it is necessary to use all the inspection 
data of the product from the start of the opera- 
tion or storage to the current time. Once the new 
inspection data is acquired, the previous para- 
meters can be updated recursively. The unknown 
parameter vector is denoted as @=(a,D,,Q,0°)'. 
When new degradation data x, is obtained, the 
maximum likelihood estimation method is used to 
estimate @. The log-likelihood function based on 
Xo., can be expressed as: 


L,(9) = logl p(X, | 9] (11) 


where p(X,., 18) is a joint PDF of degradation 
data X,.,. Then, based on the likelihood function 
of the degradation data X... the maximum likeli- 
hood estimation value 6, of @can be obtained by 
maximizing the likelihood function, which can be 
expressed as: 


6, = arg max L,(8) (12) 


The EM algorithm can be used to approximate 
the maximum likelihood estimation value of the 
parameter by maximizing the likelihood func- 
tion p(X,,.U;, |) (U, = (My fs»). Accord- 
ing to the relationship between p(44 |X.) and 
P(X,,.U;, | A), L(A) can be divided into two 
parts, namely: 


L(A) = ¢,.(@) — log pU, | Xo 0) (13) 
where 
l (0)=l08 p(X oU; |) (14) 


According to the adaptive estimation method 
proposed by Si (2016), the unknown parameters 
of the model can be divided into the following steps: 


e Step 1: The joint log-likelihood function in the 
EM algorithm can be expressed as: 


k 
¢,(A)= log p4 |0) + log] [P 4.18) 
+log]] pŒ; lð) 


As a. result, the conditional 
E{,(0| 6;)] of ¢,(A) is as follows: 


(15) 


expectation 


Ell CO Â= Ey, 5 (OD 
1 


2 

- Yi (log+(u;- H),%)-Y)_(logo? 

EOX = Xj — Ha GDP OG -5 
(16) 


Step 2: The Rauch-Tung-Striebel (RTS) optimal 
smoothing algorithm is used to calculate the condi- 
sas expectations E ae (H,). Ey. at (45) 
and Ug|Xox 4% (HM )-1) 


Step 3: Substituting the results of Ey yo, 6 HD)» 
Eo iry ð Mi) and Ey w, 44 ihj) into equa- 
tion (16), the specific expression of E[?,(0| 6) )] 
can be obtained. This completes the E step in the 
EM algorithm. 

Step 4: Followed by the M step, according to 
the literature (Si (2016)), the parameter 00+” 
estimated from the i+1 step iteration in M 
step is the global unique optimal solution of 
the equation(12), there is 9? = (al*?, DE, 
OK" (o). Specific proof of the process 
and convergence analysis of the algorithm can 
be found in the literature (Si (2016)). 


„ Thus the parameter estimation value 
0, = (aie? DEP, QP, (o>) P)" is obtained, 
which will be used as the model parameter of the 


real-time reliability in the next inspection time. 


-log D, = (My E a) D, 


Ux |Xo:k Ôh [ 


4 DETERMINATION OF SEQUENTIAL 
INSPECTION INTERVALS 


In order to make the model parameters estima- 
tion converge faster and obtain higher accuracy, 
we can use the historical data of similar aviation 
products to estimate the approximate initialization 
parameters @ =(a,,D,,Q),0,)'. For an important 
component of aviation product, it is generally nec- 
essary to set the real-time reliability R, (constant) 
according to the task requirement, so as to ensure 
the reliability of the product before inspection or 
maintenance. 

4.1 Determination of the initial inspection time 
t,, and tẹ for the new product 


In order to estimate the parameter ô, more accu- 
rately, and make parameter estimation converge 
more quickly, two state information of product are 
required at least. Conservatively, the interval for the 
first inspection is set to be the same as the second 
inspection considering the product safety and reli- 
ability, i.e. %,=%,/2. Since the real-time reliability 
should be guaranteed before the next inspection, 
there is following expression based on equation (10): 
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W= Xy Mil 


Dale T o’l, | 
2 ft, (w-x,) i 2D yw =x) 
o? ot 


Rs, | X= of 


exp l 


| 


sO 2Dyy(w- Xj) + o° (Àd +w- x) >R, 
o° | Deul +07l, 
(17) 


Thus, the second inspection time ¢), satisfies the 
following formula: 


of W = Xo — loto 
2A (w - Xo) + 2D, (w - Xo)” 


| 2 2 
D ofo2 + 0 to 


exp | 


| 


lor oF 
xO 2D,(w- Zuo + Ci Hole +w—X,) >R, 
OoV Dito” + Oito 


(18) 


Based on the monotonicity of the distribution 
function, it is easy to get the solution Â, of the 
unary equation by using MATLAB, then: 


According to engineering practice, the second 
inspection time can be set as f,)=/,,, so the first 
inspection time is ¢),=f),/2=f,,/2. then the product 
can be respectively inspected at f,, and ¢,, to obtain 
new degradation data x, and Xj). 


4.2 Determination of the inspection time 
tiga (i 20,4 20) 


After obtaining the product degradation data xo 
and Xp, the new parameter 6,, can be determined 
by using the adaptive estimation method to inte- 
grate x), and @. Similarly, the parameter Op is 
adaptively estimated by integrating the degradation 
data X3 Xo and parameter 6,,, then the inspec- 
tion interval Af, can be determined according to 
the requirement of R,, so the third inspection time 
is ty = f + Afp. 

If the degradation data of the next inspection 
still does not exceed the preventive maintenance 
threshold w,, the new inspection data wo 
and parameter Osx- will be integrated to estimate 
the parameter 6), recursively, then the interval after 
each inspection can be determined. Similarly, 
the next inspection time will meet the following 
condition: 


t lo, + Alyy (19) 


O(k+1) — 


For further consideration, if the degradation 
data x,, exceeds the preventive maintenance 
threshold w, during the inspection, this important 
component will be replaced. Then all the degrada- 
tion data X,,_;\0.,, of inspections after the (i— 1)-th 
(i21) maintenance is used to estimate the para- 
meter 0; adaptively, and 0; will be taken as the 
initial parameter of the model after the i-th main- 
tenance. Similarly, all the inspection data, x, 9... 18 
used to estimate the model parameters 0, adap- 
tively, then the interval Az, can be determined, and 


the interval Aż, satisfies the following condition: 
g —————*__ * 
VD Aty? + OF Aly 
2n,(w-x,) 2 
apl Hac Xx) f | 
. of 2D, (WW Xp )Aty + Oh My Ay +W- Xp) 


2 
Ok 
2 2 2 
Ok V D,Aty + OF Aly 
> R, 


W — Xp — My Aty 


D; 


ik 


(w= x4) 
Ok 


| 


(20) 


Similarly, Az, can be easily obtained by using 
the MATLAB, so there is: 
t (21) 


(eet) = Lig + Ati, 


5 NUMERICAL EXAMPLE 


In this section, a certain type of aluminum alloy 
material (Meeker (1998)) commonly used in avia- 
tion products is taken as an example, the quality 
of this material is generally evaluated by the length 
of its fatigue crack. The initial crack length of the 
material is 0.9 inches, and the failure threshold is 
set to be w=1.6 inches, the preventive mainte- 
nance threshold is set to be w) =1.58 inches. 


5.1 Evaluation of the real-time reliability model 


In the experiment, firstly the approximate initial 
parameter 8, =(a),D),Q,,05)' =(6.5,0.5,0.5,0.1)” 
is obtained by integrating the fatigue crack growth 
data of the same type product. In order to verify 
the effectiveness of the model proposed in this 
paper, we select the crack length information cor- 
responding to time t,» as the degradation data. 
And for each inspection time, the degradation 
information X,., before the inspection time is set 
as the history degradation information, thus the 
parameter of degradation model can be updated 
by using the proposed method, then the future 
crack length of the material can be predicted by 
the formula X,,,=x,+A,(t.4;-t,) step by step. 
Finally, the result is compared with the real curve 
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Figure 3. PDF of predicted remaining life at seven dif- 


ferent inspection points. 


of fatigue crack growth for the aluminum alloy. 
The comparison plot is shown in Figure 2. 

According to the result of the experiment, the 
theoretical failure time of crack length reaches the 
failure threshold is about 0.13343 million cycles, 
and the mean square error between the predicted 
value and the true value is: MSE = 6.5575e-05. 

Figure 3 gives a PDF plot of the predicted remain- 
ing life at seven different inspection points. The 
curves gradually shift left and become sharper as 
data accumulates, which means that the uncertainty 
of the remaining life prediction is decreasing as 
more and more data are used to estimate the param- 
eters of the model. According to Fig. 2 and Fig. 3, 
it can be seen that the model proposed in this paper 
fits very well with the real model. 


5.2 Determination of sequential inspection intervals 


The calculation process of sequential inspection 
interval is given as follows: Firstly, R, is set to be 


The calculation results of the sequential inspection interval (R, = 0.9). 


Table 1. 


After two maintenances 


After one maintenance 


The new product 


The 


Replace 
as 


initial 


state 


Replace a 


t13 t14 new (t,) t21 t22 t23 


t12 


t03 t04 


t02 


(t,=0) t01 


The serial number 


0.3960 
1.5814 
5.9370 
0.1671 


0.3725 0.3941 


0.2328 0.2575 0.2629 0.2645 0.2645 
1.3860 1.5417 1.5783 1.5885 0.9000 


5.6880 5.6158 5.5307 5.5032 5.5032 
0.3025 0.2685 0.2316 0.2211 0.2211 


0.1322 
0.9000 
5.7091 
0.3099 


0.1322 
1.5825 
5.7091 
0.3099 


0.1274 0.1309 
1.5541 


t,/millions of cycles 0.0000 0.0429 0.0858 0.1171 
0.9000 


xik/inches 


ak 


1.4294 1.5744 


1.5792 


1.2970 1.4851 


1.0647 


5.4906 5.4340 


6.5000 6.3122 6.2132 6.0897 5.9436 5.8073 


0.5000 0.4647 0.4403 


0.2164 0.1867 


0.4070 0.3668 0.3302 


D; 


0.8480 1.1483 10.9867 


0.0609 0.0539 


1.4950 1.1045 4.4839 9.9541 9.9541 
0.1048 0.0659 0.0426 0.0357 0.0357 


0.5000 4.2608 2.9779 2.0715 1.5555 3.9689 12.0634 12.0634 
0.1000 0.2829 0.1780 0.1211 0.0809 0.0594 


Aĉ, /millions of cycles 0.0429 0.0429 0.0312 0.0103 0.0035 0.0013 — 


Qu 
(œ) ik 


0.0386 


0.0575 


0.0575 


0.0216 0.0019 


0.1080 


0.10007 0.0246 0.0054 0.0016 


The degradation data exceeds the 


The degradation data exceeds the 


The degradation data exceeds the preventive maintenance 
threshold 1.58 inches at the sixth inspection, which needs 


to be replaced. 


Remarks 


preventive maintenance threshold 
at the third inspection, which 


needs to be replaced. 


preventive maintenance threshold 
at the fourth inspection, which 


needs to be replaced. 
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R,=0.9, and 1t,,=0.0858 can be obtained by 
using the method proposed in Section 4.1, so 
ty, = by /2 = 0.0429; Secondly, when i =0, k = l; 2; 
Bor = (pis Dors Qors Gp)” and By = (ao Dose Qos Son)” 
are respectively estimated by the adaptive esti- 
mation method. Thirdly, the inspection interval 
Ai, is determined according to the requirement 
of real-time reliability, ie. t),,.) =, + Atp; when 
i=0,k =2, t is obtained; when i=0,k =3, ty is 
obtained. If the crack length exceeds the preventive 
maintenance threshold, the product will be replaced 
immediately. In the second operating phase, the 
last parameter 6,, of the first phase is taken as the 
initial parameter, and the parameters are recur- 
sively updated, then the time of each inspection 
and maintenance can be determined, the following 
inspection time will be obtained in the same way. 
The inspection time ¢,, determined by the proposed 
sequential method is shown in the second row of 
Table 1, and the third row shows the degradation 
data x, of each inspection. Finally, the following 
results can be obtained from the numerical example: 


1. If the aluminum alloy material is not be 
inspected, the theoretical service life of the 
material will be about 0.10837 million cycles 
when R, = 0.9. However, if the sequential 
inspection method is used, the effective service 
life of the material is about 0.132 million cycles, 
which is close to the theoretical failure time of 
0.13343 million cycles, the service life of avia- 
tion products is extended effectively. 

2. As can be seen from Table 1, the sequential 
inspection interval is not equal spacing, the 
number of inspection is reduced from 6 times 
to 4 times after one maintenance, and reduced 
to 3 times after two maintenances. As the data 
accumulated, the number of inspection can 
be further reduced. Therefore, the sequen- 
tial inspection method proposed in this paper, 
not only can effectively reduce the number of 
inspection, but also can improve the efficiency 
of inspection and maintenance while ensuring 
the reliability of aviation products. 

3. It can be seen that the convergence rate of 
model parameters is very fast in the case of 
small sample data, and the prediction accuracy 
also meets the actual requirements. 

4. If R,=0.8, the theoretical service life of the 
material is about 0.11482 million cycles. When 
the rest conditions remain unchanged, using 
the similar calculation procedure above, we 
can obtain the following inspection times: 
The initial new product {0.0463, 0.0926, 
0.1230, 0.1304, 0.1322}; After one mainte- 
nance {0.2401, 0.2605, 0.2656}; After two 
maintenances {0.3770, 0.3966}. According to 
the calculation results, the effective service life 


of the material is about 0.133 million cycles. 
Moreover, the number of inspections required 
for the new product is reduced from 6 to 5 times, 
and reduced from 4 to 3 times for the product 
after one maintenance. This shows that sequen- 
tial inspection time, the number of inspections 
and effective service life are closely related with 
the requirement of real-time reliability. 


6 CONCLUSIONS 


In this paper, a sequential inspection method for 
aviation products based on Wiener process and 
real-time reliability requirement has been proposed. 
It is known that in the actual operation process of 
aviation products, the mechanism of performance 
degradation is complex, and the degradation proc- 
ess shows randomness. The sequential inspection 
method can adaptively update the relevant parame- 
ters of degradation model in real time, so as to deter- 
mine the time of each inspection and maintenance 
more accurately. The method can not only ensure 
the reliability of the product, but also avoid the 
problems of over-inspection or under-inspection 
which may be caused by the traditional equal-pitch 
cycle inspection. And the number of inspection 
data is also required less by the sequential inspec- 
tion method. Moreover, the convergence rate of 
relevant parameters become faster by integrating 
the historical information of similar products, and 
the model has good robustness, so as to ensure the 
reliability requirement and the prediction accuracy 
of degradation information for the aviation prod- 
ucts, which is very suitable for the determination 
of inspection and maintenance intervals of the new 
small-sample product. 

The research of this paper is carried out under 
the condition of considering the perfect mainte- 
nance for the product. In some cases, the perform- 
ance of the product as a whole is affected by other 
non-replaceable parts and some environmental fac- 
tors of the system, the performance cannot recover 
as new after maintenance. Therefore, the further 
study will consider the inspection method for the 
product in the case of imperfect maintenance. 
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ABSTRACT: For repairable products sold with two-dimensional warranty, although burn-in and Pre- 
ventive Maintenance (PM) actions result in extra costs, they both can be effective approaches to reduce 
number of warranty claims and servicing cost by early detecting defects and reliability improvement. 
Longer burn-in or higher burn-in usage rate can remove more defects, but it accelerates the product deg- 
radation and incurs more wear-out failures, which in turn need to be alleviated by PMs in the warranty 
period. This article aims to propose a new warranty performance-based model to minimize the warranty 
claims and find out the optimal burn-in and PM decisions for two-dimensional warranted products. More 
specifically, the proposed model subsumes a special case under one-dimensional warranty, allows different 
failure modes—i.e. defects and wear-out failures, and takes usage heterogeneity into consideration. We 
find that, it is always reasonable to carry out a burn-in on products. If wear-out failures are more (less) 
sensitive to product usage rate than defects, the burn-in duration should be extended (increased with the 
upper limit of PM degree); The optimal burn-in usage rate should be increased with the upper limit of 
PM degree (conducted at conditions as harsh as possible) and decrease with the upper limit of detecting 
degree; the optimal PM should always be set at its upper limit. 


1 INTRODUCTION products sold with warranties. They considered 
different types of warranty policies. 
Burn-in as an important production process, where After burn-in, preventive maintenance (PM) is 


products are operated under the actual working an effective way to reduce warranty claims. Unlike 
stress for a period. Those weak items and prema- many warranty policies which require restoring a 
ture failures (e.g. from manufacturing and assem- failed product without any schedules, PM is com- 
bly errors) can be revealed, and then the general monly scheduled, aiming to control the wear-out 
product reliability is improved (Chan and Meeker, degradation and reduce likelihood of failures. 
1999; Blischke et al, 2011). Shafiee et al. (2011, 2013) have paid attention to 
Products are sold with warranties. Currently, the effects of PMs on the failure rate and the opti- 
most of burn-in models focus on the cases of one- mal burn-in duration. 
dimensional warranties. Leemis and Beneke (1990) A warranty can be one-dimensional, meaning 
have highlighted in their review paper that some that only calendar time is considered in determin- 
researchers had investigated relations between ing the warranty period, or two-dimensional, where 
burn-in and warranty policies. According to usages are often evaluated. However, all the above 
Nguyen and Murthy (1982), the optimal burn-in burn-in models are limited to one-dimensional 
duration is determined by the trade-off between warranted products. Ye et al. (2013) contribute 
the reduction in the warranty cost and the increase to deal with burn-in under two-dimensional war- 
in the cost with burn-in. Sheu and Chien (2005) ranties, but this work does not consider PM in the 
extended the above models by studying two types burn-in decisions. 
of failures. Similarly, Kar and Nachlas (1997), Therefore, this paper focuses on the burn-in 
and Wu et al. (2007) also have proposed models and PM decisions for two-dimensional war- 
to determine the optimal burn-in duration for ranted products. Firstly, since the optimal burn-in 
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decisions should be made by a trade-off between 
the increase in wear-out failures and the decrease 
in defects during warranty, we were wonder- 
ing whether a burn-in should be applied to two- 
dimensional warranted products. If the answer 
is yes, the next question is how to set the burn-in 
duration and burn-in usage rate. Furthermore, 
we will explore how PM influence the burn-in 
decision. 

The rest of this paper is organized as follows. 
The modeling assumptions are listed in Section 2. 
In Section 3, we investigate how to model failures 
under PM actions and introduce the warranty per- 
formance-based model to minimize the warranty 
claims and help make optimal decisions. Then 
numerical examples are presented to illustrate the 
applicability of the proposed model in Section 4. 
Conclusions and some extensions are given in 
Section 5. 


2 MODEL ASSUMPTIONS 


Firstly, all the item failures during the warranty 
coverage are statistically independent and mini- 
mally repaired. Each item is assumed to be repair- 
able and sold with a non-renewing free repair 
warranty policy. In addition, each failure will 
incur a warranty claim, and all warranty claims 
are valid. 

We assume that there are two types of failures, 
the wear-out failures and the defects. The wear- 
out failures, also known as normal failures, occur 
over time, while the defects can be revealed in 
the early period of lifetime. The two failures are 
independent. 

Effects of usage rate on both types of failures 
are modeled using Accelerated Failure Time (AFT) 
model (Baik and Murthy 2008; Shahanaghi et al. 
2013). We also assume that the failure rate func- 
tions are convex, in order to ensure that the failure 
rate curve of products is close to the bathtub curve 
as much as possible. 

After burn-in, the product is released to the 
market with a non-renewing free repair warranty 
(FRW). 

Like most of related papers (Iskandar et al. 2005; 
Wang et al. 2015), we confine the two-dimensional 
warranty region to be a rectangle one in which the 
horizontal axis represents age and the vertical axis 
represents usage. The warranty region is a rectan- 
gle [0,W,)x[0,U,). The warranty expires when 
the item reaches either the age or usage limit. 

The time of manufacturer implementing repair 
or PMs programs is negligible and assumed to be 
zero. And all the PMs are assumed to be performed 
with fixed time interval t. In addition, defects can 
be detected during PMs. 
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3 MODELING AND ANALYSIS 


In this section, we firstly investigate how to model 
wear-out failures and defects. Then performance- 
based model is introduced to minimize the war- 
ranty claims and help make optimal decisions. 


3.1 


The usage rate over time is constant for one cus- 
tomer, but varies across customers. Let R be the 
random usage rate, and G(r) and g(r) represent 
cumulative distribution function (CDF) and prob- 
ability density function (PDF) of R respectively. 
Under a two-dimensional warranty, we model 
the wear-out with a counting process character- 
ized by an intensity function dependent on both 
age and usage. We introduce v" (t|r) as the corre- 
sponding virtual age under the condition of usage 
rate r. Considering the item designed for certain 
nominal usage rate 1, according to the AFT 
model, if the usage time is ¢ and the specific usage 


Modeling of failures 


rate is r, then v" (¢(r) =1(2)’, where yis acceler- 
ated coefficient due to the wear-out. 

Conditional on a specific customer R=r, the 
wear-out process is a Non-homogeneous Poisson 
process (NHPP) with a non-decreasing rate of 
occurrence of failure 2" (v"). Note that v" is the 
function of t, so we can achieve the actual failure 
rate function of an item due to wear-out failures for 
any customer with usage rate ras A" (v (E 
according to the AFT model. 

The number of defects is modeled by a random 
variable M with a probability mass function z(-). 
All defects are assumed to be independent and 
identically distributed. For any specific defect, the 


virtual age is v (t\r) = (ef, where 77 is accelerated 
coefficients due to the defects. And its first time to 
failure conforms to a CDF F“(-). Therefore, the 
number of defects can be described by a binomial 
process with M trials, each with success probability 


given by F4(-). 


3.2 Failures within burn-in 


It is limited to set the values of burn-in dura- 
tions and usage rates, since longer duration and 
higher rates can decrease defect on the one hand, 
but intensify wear-out on the other hand. We use 
t, and t, as the upper bound and lower bound 
of Burn-in duration ¢,, and 7, and r, as upper 
bound and lower bound of burn-in usage rate 7. 
For an item arriving at the end of burn-in, we 
let v and vi be the virtual age of wear-out failure 
process and potential defects, respectively. The time 
lag between the end of burn-in and the beginning 


of product’s running into operation is ignored. 
Therefore, after burn-in under the usage rate r, 
: F NF 
and duration ¢,, we can obtain that vj’ =t, (2) 

1 

oe "b 

and vj =1,(2) : 

The expected number of failures of wear-out 
failures within the burn-in can be given by, 


pineal] eee] a o 


For any specific defect, the probability that it is 
detected within burn-in is F* key). Conditional 


on M, the number of defects occurs during burn- 
in follows a binomial distribution with mean value 
as follow, 


pimes] eer «(4 


0 


Therefore, the expected number of failures 
within burn-in can be given by, 


E[N, (t.%) ]=£"[N,(t.%) |+£'[, (4.5) | 


3.3 Failures within warranty period under PM 


Different usage rates mean different usage patterns 
which lead to different termination of warranties 
due to its age limits or usage limits. For any specific 
usage rate, only two cases exist, which is illustrated 
in Fig. 1. 

From the Fig. 1, we know that, for a 
specific usage rate r, define ry as =, if 
r>n, then warranty will end at the time =. 
If r<n, then warranty will end at the time W,. 


Therefore, we need to investigate these two cases. 


Figure 1. 


Warranty period under PM. 
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Within the warranty coverage, though minimal 
repair is free of charge, the customer has already 
experienced loss incurred by failures. Besides, the 
wear-out failure rate of product becomes higher, 
so that the manufacturer tends to suffer more 
repair cost. Therefore, PMs are needed, not only 
to improve product reliability, and reduce repair 
costs. 

We let periodical imperfect PMs are imple- 
mented with fixed interval t, and their effective- 
ness is depicted by a maintenance degree ó, 
for wear-out failures and a detect degree 6, for 
potential defects, both of which are between 0 
and 1. 6.=1 means that wear-out failure rate 
of the product is back to original level by a per- 
fect PM and then the product is “as good as 
new”, while 5,=0 means that no action is car- 
ried out. Also, 6, =1 means that any defects can 
be found during PM with probability of 1, while 
ó, =0 means that PM actions cannot find any 
defects. 


3.3.1 Wear-out failures 

For wear-out failures, there are at least two classes 
of imperfect PM models in previous literatures. 
The readers can refer to Doyen and Gaudoin 
(2004) as a good source of reference. We use the 
one that PM actions result in a rejuvenation of the 
product through effectively reducing the virtual 
age (Doyen and Gaudoin, 2004). We let 4” (tlr) be 
conditional wear-out failure rate of the item after 
the kth PM action, and v¥ be the virtual age of 
wear-out failure process after the Ath PM action. 


And we already know that vý = nly, then the 


virtual age of wear-out failure process after the 1st 
PM action is 


A 
vy = s + f£) l- 6,) 
h 


where r is specific usage rate of a certain customer. 
Then we have, 


tá 
efef a 
h 


vi is always returned to 0. When 
) on both sides of 

š vy _ a geil x l-k 
the equation, Ae ear” (=) (l1-d.) 


Let S,=—\,, then we have S, = S,,+ 
(1-6) “ “ 


T- (y4 -8,)™, add up these equations from sS, 


to S,, so 


=—E to replace it, then we have 


Using S,= = 


the general term of vy as follow, 


w 


Then we have the wear-out failure rate 4¥ (tlr) 
of the item after the kth PM action as follow, 


AW (tlr) =A" [i +t {Z 


Note that v” is still the function of t. Specifi- 
cally, when 5, = 0, A’ (dr) is given by, 


w 


A} (tr) = A" b y e+) ? JY 


For case r>ņ, the warranty for the prod- 
uct ceases when the product usage reaches U,, 


which corresponds to the age =. And we let 
the number of PMs within this warranty region 


be n, estimated by max jx TS a, where 


JEN. Therefore, we have the number of PMs 
performed within warranty when r>%, namely 


Nsw = 


argmax { jxrs’ “ut Then the expected 
failures of wear-out failures during this period are 


given by, 


f 


Í AY (dr) d 
0 


Ny>y-! 


E| il > %) | 
k=0 


Ta (tr) dt 


Here the first term is expected wear-out failures 
before the last PM within the warranty. The second 
term is expected wear-out failures between the last 
PM and the end of warranty. Since the meanings 
of terms hereafter are similar, we will not repeat. 

For the case r <r, the warranty for the product 
terminates when the product age reaches W,. Simi- 
larly, we let the number of PMs within this warranty 
region be n,.,, estimated by max{jx7<W}, 
where j¢N. Therefore, we have’ the number of 
PM actions performed within warranty when 
r<m, namely n, _ =argmax{jx7<W,}. And 
then the expected ‘failures during this period are 
given by, 


(3) 
Fran AW 


Tr> 1 
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hsl r 


by Í AM (tlr) dt 


k=0 QO 


( ir) dt 


E| N" (tr, < ie) | = 
+] 


We use E[N" (t, rl") | to denote the expected 
wear-out failures within warranty period of an 
specific customer, which means E [N "ig a 
should be either E[N" (tni >n or 
E[N"(t,.%\r <1) ] according to the usage Fate r. 

lso, we use E(n,,|r) to denote the number 
of PMs within this warranty region, which means 
E(n,,|r) should be either n,,,, or n,.,, according 
to the usage rate r. These declarations will be use- 
ful for our subsequent modelling analysis. 


(4) 
Wot rm ay, 


0 Tr< ry 


3.3.2 Defects 

Consider a specific remaining defect, the virtual 
age of this defect after burn-in is 1, 2) . After 
the burn-in, the remaining defects can either be 
found during operation or detected by PM actions. 
The distribution for the time to find this remain- 
ing defect conditional on specific r before Ist PM 
action can be given by 


Suppose that a total of K defects remains. 
Given M, K follows a binomial distribution 


efor 


Conditional on K, the number of remaining 
defects before the Ist PM actions follows the bino- 
mial distribution bi(K,G#(qr)). Then we can 
deduce the probability of remaining defects K that 
can be found from the end of burn-in to Ist PM 
action as follow, 


renfe) 


V 1 (6) 
AEA 
h 
A defect may be detected during the Ist PM. 
Using the similar deduction of Fé(t|r), we can 
achieve that the probability of remaining defects 


K that can be detected in the 1st PM with detect 
degree ó, is given by, 


F h, 


‘eau 


Ñ I 


mfi rhe) Gi (An), 
ep 


The rest as follows can be done as the same 
manner. Then the probability of remaining defects 
K that can be detected between k* and (k + 1)* PM 
action, and during the k* PM action respectively 
can be given by, 


Fé (dr) = r(.(2) Hoe =] | 
-»(o(2) vee) | 


(7) 


(9) 


Therefore, considering the cases of warranty 
region in Fig. 1, taking expectation with respect to 
remaining defects K, we can find that the expected 
number of remaining defects found during the 
operation and during PMs, 


nran T 


È [elu 


E| Ni ( (tnlr >n) )]= Fi (dr) )] 


7 7 
+E(M) eha Mo | (10) 
Ny h 
ry ry 
F4 b ee g y" 
F g8 tiis (=| | (-5,) 
npe 7! 
E NS (tnlr <n)]= > [ £(M) Fi (ar) | 
k=0 
7 2 
+E(M) (2) emf) | (11) 
Ny h 
a 7 a 7 
h h 
E[N, (tonl >%) |= X [£(4) Pi (ar) | (12) 
k=l 
ELA (ules) ]= D [EMR d] 03 
k=1 


We use E [N d (1,117) | to denote the expected 


number of defects within warranty period of any 
specific customer, which means E| Ne, el 
should be either E| Net thlr >h )] or 
E[ Na (t phil S %) | according to the usage rate r. 
Also, we use E[ Né,( os to denote the 
expected number of detects found during PM 
actions, which means E [Ni a (t,. lr) | should be 


either ENS, (tnlr >) | or E[N, (tnlr < 7%) | 
according to the usage rate r. 


3.4 Performance-based model 


The warranty performance-based model involves 
simultaneously selecting burn-in duration t, and 
usage rate 7, to minimize the expected failures 
during the warranty period. For a customer with 
specific usage rate r, we have the expected war- 
ranty claims per unit as follows, 


E[ N( Ald )]= E[N"( A r)|+E N4(1,.nI7)| 
(14) 


As mentioned earlier, we consider that the usage 
rate is constant for one customer but varies across 
different customers. Since we let R be its random 
usage rate, and G(r) and g(r) represent CDF and 
PDF of R respectively, the final expected number 
of failures is obtained by un-conditioning, namely 
by taking the expectation of E [Ny (tnlr) | with 
respect to the usage rate, 


ELN (son) |= [ELN (nb) ]eo(r) (15) 


Then the performance-based model can be 
given by, 


[t¥,r*] = argmin E| N( (1,.%,) | 


st. t,€ elaz] 
n € [ Th» z| 


where ¢* and 7;* are optimal burn-in duration 
and burn-in usage rate. 

In this performance-based model, obviously the 
larger the maintenance degree 6. and the detect 
degree 6, are, the smaller the expected number of 
failures within warranty period is, which means if 
they were decision variables, they would always be 
set at the max. In practice, some PM actions are paid 
by customers, which means manufactures can ignore 
the maintenance cost and only take into considera- 
tion their own possible limitation to implement the 
PM actions. Therefore, we don’t make the mainte- 
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nance degree 6, and the detect degree J, as the deci- 
sion variable in this performance-based model. 

Proposition 1 Suppose that [ ¢* ,r'] is the 
optimal burn-in setting for the performance- 
based model. When the accelerated coefficients of 
wear-out failures and defects follow the relation- 
ship v>7, (i) the optimal burn-in duration t% is 
always ¢,, (ii) the optimal burn-in usage rate r% is 
increasing with the maintenance degree ó, and 
(ili) decreasing with the detect degree ô. 

Proof of Proposition 1 Suppose that iG: Ta ] 
is the optimal burn-in setting for the performancë- 
based model. We use the contradiction to prove 
t$ =b, if 7, >17. 

Suppose t* <t, because the virtual age 1,(2 
is increasing with ¢, and 7,, and also decreasing 
with ¢, and 7,, we can always find 7y with %< r¥ 
such that, 


(16) 


It means that the virtual age of defects is the 
same under the cases of | ¢* ‚rž ] and | 4,7 
Therefore, as ŒE [Ni (t5) ] is the expected 
number of defects within warranty coverage, we 
obviously have the same expected numbers under 
these two cases, namely, 


aie )] =E| 


Based on (16), we have, 


E| NE (i; 


X 
a 


Because y >77 and < r* 


we have, ry 


h 


Therefore, the virtual age of wear-out failures of 
case [¢* ,r¢ | is bigger than that of case [77]. 
As E [N us (1,.%,) | is the expected number of wear- 
out failures during operations. Since the failure 
rate is increasing, we have, 


E[N" (g) |< E[N" (0% 
Because E| N (1,.%,) | Z d N" (1%) | + 


FL NaC) ] 2 M(44) 


< EL N(CH ort) 


Sy 


wrt 
ry, 
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Therefore, the case [a A | is better than| 1% ,7,*], 
which obviously contradicts our assumption. Then 
we have t,” =¢,. This completes the proof of (i). 

Now we set ¢, as a constant f,» and give the 
proof of (ii). 

When #,=7,*, based on principle of optimiza- 
tion, we obviously have, 


de[N*(7.5)] dE[ N4(7,.n,) | 
dr, dr, 
=r ae 
Since 6 exists in eeu, we Tet 
lanja E and ziar =- Sel, 


Suppose that when d=0", optimal bunin 
usage rate 7, =7/, We use the contradiction to 
prove (ii). When 0” < ô”, suppose i > rg: 

Lemma 1 If f(x,y) is convex in x for each 
yeA, and @(y)20 for each ye A, then the 
function @ defined as, 


A x)= J oy) f(x,y)dy 


is convex in x. (Boyd and Vanidenberghe 2004). 
Based on (3), we can find ig a, AN, (tl rat 
conforms to Lemma 1 since Aion (tlr) is a con- 
vex function which is our assumption. Therefore, 
when we set h= = h, as a constant, we can easily 
achieve that p me Ai, (ilr)dt_ is convex func- 
tion in 7,. Then based on (14) and (15), we can find 


pea 


Lan (tlr)dt |dr also conforms to 
Lemma | since g(r 


Uo 
-Ton 5 : 
(r)20 and |e "a" (tlr)dt 


is convex function in 7,. Therefore, we can easily 


achieve that the whole term is convex in 7,. The 

other terms can all be deduced in this way. 
Therefore, we can find that 

az" (in ae] Nv" (pon dS 

aer al Ey aa Seay, ay 
o% ony On, 


means for specific 5”, Z"(5",1) < Z" (8r) and 


ZF helo not eee 
Note that ô also exists in Z“(6,r,), we 
azd (62) 

can easily achieve >0, then we have 
Z! (8,1) < Z (8,1), ic ð <6”. In addition, 
for the expected numbers of wear-out failures 
E| N” (t,.%) | within warranty coverage, when we 
set 4,={, as a constant, obviously for achieving 
the same expected numbers of wear-out failures 
within warranty coverage, the deeper the mainte- 


nance degree dis, the larger the burn-in usage rate 
is, which conforms to the property of strictly sub- 
modular function. Therefore, we can easily find 
that en) = eel 
ematical deduction of this term is very similar with 
the proof of (i), so we omit it. 

Then we can find that for specific r’, 
ZY (8, i) >Z” (8, i) and Z’ (8, A < Z" (Er), 
if ð <ó”. Since rf is the optimal value, we have 
Zi (8,1) =Z" (8,1), then, 


<0. The rigorous math- 


af 


Zi (5”, 1) SZ (8r) =Z" (8r) >Z" (P 


) 

Therefore, based on Z" (5",1/) < Z" (6,7) and 

Z! (8,1) 2 Z4(5",1,), we have, 

Zí (8, 1) >Z4 (o”, r) >Z (8r) 
=Z” (8, r) >Z" (o”, Ares (o”, 


b 


"i 


7) 

It means Z’ (8r) is strictly bigger than 
ZY (0.7), which obviously contradicts the prin- 
ciple that Z4(5”,7/) should be equal to Z” (6”,7), 
if 7,” is the optimal value. Therefore, r” cannot be 
the optimal value of 7, if 1/27," Then we have 
the conclusion of (ii). 

The proof of (iii) can be deduced in the way of 
proof of (ii). 

These complete the proof of Proposition 1. 

Proposition 2 Suppose that [¢,*,7,*] is the 
optimal burn-in setting for the performance-based 
model. When the accelerated coefficients of wear- 
out failures and defects follow the relationship 
v<7, then(i) the optimal burn-in usage rate 4, * 
is always 4, (ii) the optimal burn-in duration ¢, * 
is increasing with the maintenance degree ó, and 
(iii) decreasing with the detect degree ô. 

Similar with the proof of proposition 1, proposi- 
tion 2 can be proved. These propositions indicate 
that if for a certain product of which wear-out fail- 
ures are more sensitive than defects to the usage 
rate, the burn-in duration should be conducted at 
as long as possible. From the proof of proposition, 
we know that this is because if the burn-in duration 
increases, for achieving the same virtual age of the 
failures result from defects under the two burn-in 
duration settings, the virtual age of the wear-out 
failures decreases which leads to less failures during 
warranty. When the burn-in duration is set to be a 
constant at its upper limit, that the optimal burn-in 
usage rate is increasing with the maintenance degree, 
meaning that if manufacturers implement deeper 
maintenance, they should set burn-in conditions 
harsher than before. Meanwhile, if they are able 
to detect more possible defects during PMs, they 
should set burn-in conditions milder. Conversely, 
if defects are more sensitive than wear-out failures 
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to the usage rate, the burn-in should be conducted 
at conditions as harsh as possible, and the optimal 
burn-in duration increases with the maintenance 
degree and decreases with the defecting degree. 
These two converse propositions cover all the possi- 
ble cases, so that manufacturers can easily select the 
burn-in decisions according to the features of any 
specific product or component. In addition, based 
on these propositions, the proposition 3 below can 
be easily set up: 

Proposition 3 In performance-based model, 
implementing burn-in is always better than 
nothing. 


4 NUMERICAL EXAMPLES 


In this section, numerical examples are presented 
to illustrate the applicability of the proposed 
model above. 

For the performance-based model, we use an 
example to show how to reduce the warranty 
claims by choosing the optimal burn-in duration ¢, 
and usage rate r,. 

Suppose that the product under consideration 
is certain automobile component covered by a 
two-dimensional FRW. The following settings are 
adopted into our performance-based model, 

We calculate the expected failures within the 
warranty region which is a rectangle of 3 years 
and 100,000 km of usage. Therefore, we let 
W, =3,U, =10. 

We assume that the nominal usage rate is 
n =2x10*km/y. And the customer usage rate 
follows uniform (0.4, 4.2), which means the least 
usage rate is 0.4 104 km/y, and the highest usage 
rate is 4x10* km/y. 

Wear-out failures follow a Weibull distri- 
bution with increasing failure rate, which is 

By [_t A= 
roi) 
the failure rate functions of two type failures 
are convex). The number of defects M follows 
Poisson( 4). Defect failure them follows exp(6). 

Other parameters for this model are given in 
Table 1. 

Under different maintenance degrees and detect 
degrees of PM actions, we perform a grid search 
for the minimization of numerical examples based 
on performance model. 


, where Ø, >2 (as we assume 
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Figure la. Optimal usage rate Figure 1b. Optimal burn-in time Figure lc. Minimal warranty 
under different PM and detect under different PM and detect claims under different PM and detect 
degree. degree. degree. 


y=047=08 


Figure 2a. Optimal usage rate 
under different PM and detect 
degree. 


From Fig. l(a, b, c), we can see that if 7> 7, 
then the optimal burn-in duration is always set to 
be upper bound ¢,, while optimal burn-in usage 
rate increases from r, to 7, with the degree of PM 
actions and decreases with the detecting degree. Con- 
versely, Fig. 2(a, b, c) shows that if y< 7, then the 
optimal burn-in usage rate stays permanently at the 
upper bound 7,, while the optimal burn-in duration 
increases from ¢, to t, with the degree of PMs and 
decreases with the detecting degree. These curves 
conform to the proposition | and 2. In addition, we 
also find that as the maintenance or the detecting 
degree increases, the minimal warranty claims in all 
examples decrease, verifying that the larger the main- 
tenance degree or the detecting degree increases, the 
better performance the warranty contract is. There- 
fore, manufacturers only have to consider their own 
limited capabilities to implement PM actions. 


5 CONCLUSION 


In this article, we have studied the optimal strate- 
gies of burn-in under PM actions for two-dimen- 
sional warranted products. 

To help manufacturers to improve the war- 
ranty contract performance, we have investigated 


Figure 2b. Optimal burn-in time 
under different PM and detect degree. 


Ome epee 


Figure 2c. Minimal warranty claims 
under different PM and detect degree. 


a performance-based model to help manufacturers 
to make optimal decisions on burn-in under PMs. 
In our model, the burn-in duration and burn- 
in usage rate are set as the decision variables to 
minimize the expected warranty claims within the 
warranty coverage. Through computing specific 
numerical examples, we demonstrate the validity 
of the model. These findings could help manu- 
facturers to promote the warranty performance in 
burn-in and warranty management. 
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Evaluation method of maintenance operation space based 
on virtual reality 


Pengyan Liu, Dong Zhou, Ziyue Guo, Juan Wu & Yuan Li 
Beihang University, Beijing, P.R. China 


ABSTRACT: In order to ensure the excellent maintainability of the product, it is necessary to give 
full consideration to the operation space of the maintenance personnel at design time. The traditional 
evaluation method of the operation space is that gives qualitative evaluation results and puts forward 
suggestions based on the virtual maintenance simulation animation and the digital prototype while 
comparing with rules of maintainability design by experts. The evaluation result usually depends on 
the expert level with subjective tendency. There is still a lack of widely used quantitative evaluation 
methods. This paper analyzes the main attitude of maintenance personnel in maintenance activities, 
evaluates their comfort level, and then evaluates the operation space. First of all, through the analysis 
of the characteristics of maintenance activities, the article draws the main factors of human comfort 
evaluation. Based on the artificial potential field theory commonly used in robot obstacle avoidance path 
planning, a reachability potential field that can be used to describe the upper limb posture of maintenance 
personnel is established and a criterion for evaluating upper limb comfort is proposed. Then, combined 
with the virtual maintenance technology, using the virtual maintenance simulation platform, together 
with a comprehensive analysis of tool rotation angle, operation space ratio and the hand swept volume 
of maintenance posture, the comfort of the maintenance staff’s hand gesture is analyzed. After that, the 
above two parts are combined to comprehensively evaluate the comfort degree of the human body posture 
and give the evaluation criteria. And a new operation space evaluation method for the key maintenance 
steps in the maintenance operation process is established. In the end, the maintenance and disassembly 
process of an engine is taken as an example to verify the validity of this method. 


1 INTRODUCTION ergonomics analyzes of general aircraft mainte- 
nance tasks (Stader, 2013). 
With the rapid development of industrial technol- Since the maintenance reachability of a product 


ogy and the ever-increasing demands of manu- is an inherent quality characteristic of the product 
facturers on the machining accuracy, product itself, it is important to start with the design phase 
maintenance activities have become more and to improve maintenance reachability. The mainte- 
more complicated. At present, most design man- nance reachability of the product means that when 
ufacturers have realized that at the same time, the maintenance work is carried out, the degree 
ensure the high reliability and maintainability of of difficulty to the system, equipment, and parts 
the products can guarantee in maximum availabil- of the machine can be seen, touched, inspected, 
ity throughout the product life cycle and achieve adjusted, disassembled, or other maintenance 
the greatest economic benefits (Sobral & Guedes, activities. (Krause & Jager, 1988).The maintenance 
2016). The quality of maintainability determines reachability includes three evaluation indexes, sight 
the procedures and duration of maintenance line reachability, physical reachability and opera- 
activities. In addition, excellent maintainability tion space (Zeng, 2007). 

will reduce maintenance errors and avoid safety This paper mainly studies the evaluation method 
accidents. Maintenance reachability of the prod- of the maintenance operation space, which is the 
uct can directly affect the size of the workload actual operation space for the maintenance person- 
and maintenance posture of maintenance activi- nel to repair the product fault unit. At the design 
ties. Poor maintenance reachability will lead to stage, enough maintenance operation space must 
fatigue for service personnel, increase the proba- be reserved to avoid collision in maintenance proc- 
bility of occupational disease and further increase ess and ensure the convenience, safety and com- 
the risk of maintenance errors. Stader, for exam- fort of maintenance personnel (Zhou et al, 2011). 
ple, confirmed this by conducting interviews and Designing a suitable maintenance operation space 


633 


will enable the maintenance personnel to observe 
and maintain the equipment more conveniently. 
Even if it takes a long time to maintain a certain 
working attitude, the reduction of maintenance 
personnel fatigue sensation and discomfort can be 
achieved as much as possible as well as the main- 
tenance staff’s safety, health and comfort, so as to 
improve the overall product maintenance efficiency 
and reduce maintenance error probability caused 
by personnel fatigue. At present, the judgement of 
maintenance operation space in the design stage 
is mainly based on expert experience and visual 
effect of virtual maintenance animation presenta- 
tion (Peng, 2010). Evaluation results usually rely 
on the level of experts, with strong subjectivity. 
There is still a lack of a widely used quantitative 
assessment method. Therefore, this paper presents 
a calculation method for the quantitative evalua- 
tion of maintenance operation space during the 
design phase. 

The rest of this paper is organized as follows: 
Section 2 proposes the concept of reachability 
potential field. Section 3 combined with the pre- 
vious analysis of hand activities, puts forward the 
evaluation criteria for maintenance space. Sec- 
tion 4 is a case study. Section 5 is conclusion. 


2 REACHABILITY POTENTIAL FIELD 
2.1. The proposed process of reachability 
potential field 


In 1986, Khatib first proposed the artificial poten- 
tial field method (Weerakoon, 2015), and applied it 
in the field of robot obstacle avoidance. The basic 
idea of this method is to construct a repulsive 
potential field around the obstacle and construct 
a gravitational potential field around the target 
point, similar to the electromagnetic field in phys- 
ics, as shown in Figure 1. The controlled object is 
under repulsion and gravitation in the composite 
field composed of forgoing two kinds of poten- 
tial fields. The combined forces of repulsion and 


SoG 336 


aa — -r — 
obstacle object 
Figure 1. 
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gravitation direct the movement of the control- 
led object and search for the obstacle avoidance 
path without collision. In the maintenance proc- 
ess, maintenance personnel in a narrow operation 
space, bypassing the obstacles, reach the location 
of the maintenance object. The process is simi- 
lar to robotic obstacle avoidance, so the concept 
of artificial potential field is introduced into the 
evaluation system of maintenance operation space, 
meanwhile proposes the concept of reachability 
potential field. 


2.2 Gravitational potential field 


Since the maintenance staff mainly rely on the 
upper limb to complete the maintenance action, 
the following definition is made that the gravita- 
tional potential field is mainly related to the dis- 
tance between the maintenance staff’s joint and 
the object to be repaired. The greater the distance, 
the greater the potential energy value of the joint; 
the smaller the distance, the smaller the potential 
energy value of the joint. The function of the grav- 
itational potential field is: 


1 2 
v (à- 37A (b4) 0S (4-4) <A m 


0, CRETA 


where ņ is the proportional gain coefficient, 
p(4, 4q,) is a vector representing the Euclidean dis- 
tance |q-q,| between the joint position q and the 
object position d; which to be repaired, and the 
vector direction is from the position of the joint 
to the position of the object to be repaired. When 
maintenance personnel are far away from the 
object to be repaired, it makes no sense to consider 
the comfort of the operation space during mainte- 
nance. Therefore, it can directly consider whether 
the maintenance object can be reached. To simplify 
the formula, this paper only considers the situation 
when the distance between maintenance staff and 
the object to be repaired is close. As the mainte- 
nance action is mainly completed by the human 
upper limb, define p, as the maximum length of 
the serviceman’s shoulder joint to the tip of the 
ipsilateral middle finger, as shown in Figure 2. 

Define when the joint of the maintenance per- 
sonnel is just an arm away from the object to be 
repaired, the reachability gravitational potential 
of the joint is 1; and when the joint of mainte- 
nance personnel completely wraps the object to be 
repaired, the reachability gravitational potential of 
the joint is 0. That is, when |q-q4,| = p, Ug) = 1; 
and when |q-q,| = 0, U,(q) = 0 

n= can be obtained. 


2 


Figure 2. Shoulder joint to the ipsilateral middle finger 
fingertips maximum length. 


2.3 Repulsion potential field 


Only in the case of limited maintenance opera- 
tion space, can the analyzation and verification 
be meaningful. Therefore, the obstructions in the 
vicinity of the object to be repaired always exist. 
The factor that determines the repulsion force field 
of obstacle is the distance between a certain joint 
of the maintenance staff and the obstacle. When 
the joint is beyond the influence range of the 
obstacle, its potential energy value is zero. After the 
joint enters the influence range of the obstacle, the 
greater the distance between the two is, the smaller 
the potential energy value of the joint will be, and 
the smaller the distance, the greater the potential 
energy value of the joint will be. The repulsive 
potential field potential function is: 


i a -+), 0< A(44,)< p, 
U.(a)=32 \ olaa) 2 


0, A 4.4,) =A 
(2) 


where k is a positive proportional coefficient. 
p(4, q,) is a vector whose size is the distance p(g, g,) 
between the joint and the obstacle, pointing in the 
direction from the obstacle to the joint. p, is a con- 
stant that represents the maximum distance the 
obstacle affects the joint. Only when the obstacle 
is within reach of the upper limbs of the main- 
tenance personnel will the maintenance activities 


be affected. Therefore, define that the maximum 
influence distance of the obstacle is the length p, 
of maintenance staff shoulder to the ipsilateral 
middle finger fingertip maximum length. 

Define when the maintenance personnel just 
enter the edge of the obstacle range, the mainte- 
nance staff’s joint repulsion potential energy is 0. 
When the maintenance staff completely wrapped 
obstacles, the maintenance staff’s joint repulsion 
potential energy is 1. That is, when |g-g,| = po 
U(q) = 0; and when |g-¢,| = 0, U(q) = 1. 

k =2,, can be obtained. 


3 MAINTENANCE POSTURE COMFORT 
EVALUATION 


3.1 Upper limb comfort evaluation 


Human upper limb mainly includes four parts of 
hand knuckles, wrist, elbow and shoulder joints. 
Knuckle and wrist joint comfort will be discussed 
in Section 3.2. This part mainly analyzes the reach- 
ability potential energy and relative comfort of the 
human elbow and shoulder joints. 

Upper limb comfort is mainly determined by two 
factors, one is the upper limb stretch, and the other is 
whether there are obstacles around the upper limbs. 
The upper arm’s stretch is mainly determined by the 
size of the gravitational potential field. The impact 
of obstacles mainly depends on the size of the 
repulsive potential field. Since each obstacle in the 
maintenance process affects the maintenance activi- 
ties respectively without counteraction, it makes no 
sense to superimpose repulsion potential fields. In 
the calculation, engineers only need to examine the 
distance from the joint of the most obstructions, 
namely the maximum repulsion potential field. 

Define that @ is the evaluation value of upper 
limb comfort obtained from the theory of reach- 
ability potential field. 


O=0,X 0, (3) 


The reachability gravitational potential suffered 
by the shoulder joint measures the stretch of the 
human upper limb during maintenance activities. 
According to ergonomics (Xie & Huang, 2009), peo- 
ple tend to feel tired when they bend over. The opti- 
mal range and normal range of human arm activity 
are respectively 0.59 and 0.78 times of maximum 
range when the upper body is in the upright position. 

When p(4,9,) < 0.59p,, that is, 0 < U (q) < 0.35, 
the upper limbs’ stretching posture is the most 
comfortable, and @, = 1. 

When 0.59p, < P(4:4,) < 0.78p,, that is, when 
0.35 < U,(q) < 0.61, the extension posture of the 
upper limbs is normal, and @, = 0.61. 
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When 0.789, < P(4.9,) S Po that is, when 
0.61 < U (q) < 1, the posture of the upper extrem- 
ity is the worst, and œ, = 0.35. 

The reachability gravitational potential energy 
size of elbow joint and the repulsive potential 
energy size caused by the recent obstacles are 
compared to measure the influence of the obsta- 
cles around the upper limbs during maintenance 
activities. 

When the maintenance posture is determined, 
the elbow’s reachability gravitational potential 


energy size is 
j 


uo- 


Among them, p(q,q,) is the longest distance 
from the elbow to the fingertip in this maintenance 
posture. 

The elbow reachability repulsive potential energy 


U,(q) -( - i) 


Among them, p(q,q,) is the shortest distance 
between the elbow joint and the nearest obstacle in 
this maintenance attitude. 

When U,(q) > U,(q), it is considered that the 
elbow joint is subjected to a larger gravitational 
potential field and a smaller repulsive potential 
field, and the maintenance posture is comfortable. 
Define that @, = 1. 

When U{q) < U(q), it is considered that the 
elbow joint is subjected to a small gravitational 
potential field and a large repulsion force potential 
field, which is obstructed by obstacles during the 
maintenance activities. The maintenance posture is 
uncomfortable. Define that œ, = 0.5. 

The evaluation index of the comfortableness of 
the upper limb of the maintenance personnel can 
be obtained according to the Equation 3. 


P(4-4,) 


Po 


(4) 


Po 
P(G4,) 


(5) 


3.2 


Since most of the maintenance operations are com- 
pleted by the maintenance staff’s hands, this part 
of the measurement method focuses on the hand 
comfort of the maintenance personnel. Main- 
tenance operations are divided into three basic 
maintenance activity units (MAUs): screw, twist 
and translate (Zhou et al, 2011). Specific actions 
shown in Figure 3, Figure 4, Figure 5. Screw is the 
hand act of tightening or loosing a screw with a 
screwdriver. Twist is the hand act when the nut is 
installed or dismantled by a wrench. Translate is 
the parallel movement of human hands without 


Hand swept comfort index 
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Figure 3. Screw sweep volume. 


Figure 4. Twist sweep volume. 


Figure 5. Translate sweep volume. 
any posture change. DELMIA (Digital Enterprise 
Lean Manufacturing Interaction Application) is 
a digital manufacturing application made by the 
French Dassault Systemes company. In the prod- 
uct design stage, designers can make maintenance 
simulation animation with it, to perform virtual 
maintenance and verify the maintainability of 
the product. With DELMIA, we can establish the 
swept volume for each maintenance activity unit. 
There are two types of swept volumes: free swept 
volume (V,,,) and constraint swept volume (V). 

Under completely freedom circumstance, free 
swept volume is the maximum swept volume of the 
hand movement range when taking a comfortable 
attitude to perform maintenance tasks. The maxi- 
mum or maximum angle at which the human hand 
can move comfortably can be obtained by consult- 
ing ergonomic data. The constraint swept volume 
is the swept volume determined by the actual hand 
movements of the maintenance personnel in the 
virtual maintenance animation, subject to the 
constraints of the machine parts, which can be 
obtained by DELMIA. 

The restricted sweep volume is equal to or 
less than the free-sweep volume. The wrist maxi- 
mum angle of motion is 180° in the screwed state 


through the ergonomics. So define that the sizes of 
sweep volume of the wrist when screw 180° is the 
screw free sweep volume. When the person is in a 
free state, the twist angle is generally 120. 

We can see through the body efficacy that the 
maximum wrist motion range is 180°, so the defi- 
nition of the wrist swept volume of 180° twisting 
volume of free sweep. When the person is in a free 
state, the rotation angle is generally 120°. Define the 
swept volume at 120° twist as a free-sweep volume. 

Define the sweep comfort index P,, as shown in 
Equation 6. Based on the ergonomic data to estab- 
lish a quantitative evaluation standard for P, -based 
operation space quantitative evaluation criteria 
(Zhou et al, 2011). The evaluation criteria of differ- 
ent maintenance activity units are shown in Table 1. 


V/V, (6) 


csv sv 


P= 


3.3. The total comfort evaluation 


Maintenance staff’s hand movements are impor- 
tant in themselves, and comfort in the upper limbs 
also significantly affects comfort. Furthermore, a 
complete set of maintenance actions is a combina- 
tion of a series of operations. Therefore, in order 
to combine the comfort of the upper limb with the 
comfort of the hand, total comfort is defined. 


(7) 


where s is the total maintenance operation space 
scores. The set of maintenance actions is made 
up of i maintenance activity units, @, is the upper 
limbs comfort score for an action, and P, is the 
hand swept volume ratio for that action. The 
higher the value of s, the higher the comfort level. 


0 
0 


3.4 Evaluation criteria 


Suppose a set of maintenance actions have x 
screwing action, y twisting action, and z transla- 
tion action. According to the criteria in Table 1, 
the maintenance operation space involved in the 
whole set of actions is defined as excellent when 
the evaluation of each maintenance activity unit 
is excellent. When the evaluation of each main- 
tenance activity unit is bad, define maintenance 
operation space involved in the entire operation as 
bad. When the evaluations of each maintenance 
activity units consist of all levels of excellent, nor- 
mal and bad, the maintenance operation space 
involved in the entire operation is defined as nor- 
mal. In summary, the maintenance of operation 
space evaluation criteria as shown in Table 2. 
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Table 1. The evaluation criteria of hand. 
Maintenance Evaluation 
activity unit criteria P, 
Screw Good >0.8 
Normal 0.5-0.8 
Bad <0.5 
Twist Good >0.75 
Normal 0.25-0.75 
Bad <0.25 
Translate Good >0.9 
Normal 0.7—0.9 
Bad <0.7 
Table 2. The evaluation criteria of operation space. 
Evaluation 
criteria s 
Good s 2 0.8x+0.75y+0.9z 
Normal  0.5x + 0.25y + 0.7z < s < 0.8x + 0.75y + 0.9z 
Bad s <0.5x + 0.25y + 0.7z 


3.5 The scope of application 


Operation space determines the maintenance staff’s 
physical posture. By evaluating the comfort of the 
maintenance personnel posture, the design of the 
operation space can be obtained and quantitatively 
evaluated. The evaluation of the operation space is 
not based on the entire maintenance process, but 
rather on the key actions in the maintenance proc- 
ess. Throughout the maintenance process, visual 
inspection and qualitative analysis by engineers 
shall confirm that most of the operation gestures 
are comfortable. Quantitative analysis, which can 
be used to help improve design, is only necessary in 
the key steps that engineers have questions. 


4 APPLICATION EXAMPLES 


This case is the quantitative evaluation of the 
maintenance operation space of the disassemble a 
certain type of passenger aircraft APU hexagonal 
nut. The purpose is to use the proposed method 
to quantitatively evaluate maintenance operation 
space to demonstrate the flexibility and effective- 
ness of the method. The APU virtual maintenance 
operation animation screenshot shown in Figure 6. 
There are six orange hex nuts on the APU. Take 
nuts 1, 2, and 3 as typical cases. The positions of 
the three nuts are shown in Figure 7. 

When disassembling the three nuts, remove the 
No. | nut and the right elbow is closest to the No. 4 
thick round pipe. When disassembling the No. 2 
and No. 3 nuts, the right elbow joint is closest to the 
No. 5 thin round pipe. A large space around nuts 


Figure 6. Virtual maintenance process screenshots. 


Figure 7. APU nut position diagram. 


Table 3. Case study results. 
No. | nut No. 2 nut No. 3 nut 

Vw 1.152 1.152 0.576 
Vey 1.152 1.152 1.152 

m 1 1 0.5 
p,(mm) 733 733 733 
PCG, q) (mm) 569 604 676 
pq, 4,) (mm) 280 263 255 
Uq) 0.6025826 0.6789940 0.8505218 
U(q) 2.6174617 3.193627 3.513787 
o, 0.61 0.35 0.35 
Q, 0.5 0.5 0.5 
S 0.305 0.175 0.175 
Level Normal Bad Bad 


1, 2, and maintenance personnel can reach 120° 
hand rotation. No. 3 nut is relatively narrow, and 
maintenance personnel can only reach 60 degrees 
rotation hand. According to Equation 4, Equa- 
tion 5, Equation 7 gives the following Table 3. 


5 CONCLUSION 


When evaluating the comfort of the operation 
space during the design phase, the traditional 


method is to use DELMIA, Jack and other simula- 
tion software to make the maintenance simulation 
animation and give the evaluation results and sug- 
gestions for improvement through the animation 
effects. This paper presents a new method of quan- 
titative evaluation of maintenance operation space 
in Section 2 and Section 3. Maintenance activities 
are mainly concentrated in the human upper limbs, 
so the evaluation of the operation space should be 
taken into account the hand movements and arm 
movements. Reference to the concept of artificial 
potential in the research of robot obstacle avoid- 
ance, the concept of reachability potential energy 
is proposed and used to analyze and evaluate the 
upper limb posture of maintenance personnel in 
maintenance activities. At the same time, combined 
with the predecessors’ use of swept volume to eval- 
uate the maintenance of hand space, get the over- 
all evaluation of maintenance attitude data. Later, 
make a comparison with the evaluation criteria pro- 
posed in Section 3.4. Finally, the paper concludes 
whether that operation space is qualified or not by 
assessing the comfort of the maintenance posture. 

In the evaluation of human upper limb comfort, 
this article considers only the single obstacle near- 
est. In subsequent studies, the effects of multiple 
obstacles acting simultaneously on the upper limbs 
can be considered. 
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ABSTRACT: For the park (system of systems) consisting of a set of identical systems, the mission of 
each time unit is shared by all the survival systems which could be overexploited to achieve the global park 
objective. This overexploitation is stressful for each individual system and increases its respective degrada- 
tion. This leads to increase the probability of failure of the system before the next planned maintenance. 
Otherwise, the system can be subject to operational constraints such as the reduction of exploitation 
because of an excessive degradation. Such constraint could affect the overall objective. We propose in this 
study to analyze the problem of the maintenance resource allocation on a park of n identical systems for 
ensuring a given production goal on a two successive maintenance period. Each system is degrading due 
to cumulative load and can be totally or partially renewed only during planned maintenance. We propose 
to construct a simulation-based model for the profit assessment of the whole park on a given time horizon 
for different maintenance allocation policies given the different assumptions described above. 


1 INTRODUCTION contract and the actual state of each wind turbine. 
The maintenance resources, such as the mainte- 
For the park which consists of n identical system, nance time, the maintenance team and spare parts 
which referred as “system of systems” in this study, of the wind farm are limited due to the economic 
the load of each mission is shared by the survival concerns and the offshore environment, which 
systems. During the mission phase, the random makes the maintenance policy of offshore farm 
failure of systems keep changing the load charging more complex (Martin et al. 2016). Hence, how to 
state of the survival systems (Park 2010, Ye et al. balance the control strategy and the maintenance 
2014, Zhang et al. 2017). Meanwhile, the mainte- policy is the common challenge for the system of 
nance planning and completion of the systems are systems (Irawan et al. 2017, Santos et al. 2015). 
affected by the random system failures and realistic Ye et al. (2014) proposed a load-sharing indus- 
conditions, such as the maintenance resources, the trial system where the operator allocates load to 
maintenance windows and production plan (Gun- balance the level of the degradation condition 
degjerde et al. 2015, Abdollahzadeh et al. 2016). of all parallel components to achieve system per- 
Nowadays, the study of system degradation has formance, where a simple replacement policy is 
extended from two-state to multi-state (Peng et al. conducted according to the cumulative work load. 
2017), (Dao & Zuo 2017b, Dao & Zuo 2017a). The Zhu et al. (2011) conducted a cost-based selective 
performance levels of each system are correspond- maintenance decision-making on a machine line 
ing to different stage of system degradation. For for selecting the optimal machine group under lim- 
the system of systems, the mission load is usually ited maintenance duration. Zuo (2017b) addressed 
achieved by redistributing the remaining load on a selective maintenance problem for multi-state 
the survival systems, meanwhile the total cost of | systems where each component can be in one of 
maintaining the system state and performance is multiple working levels and several maintenance 
preferable to be as low as possible. Taking offshore actions are optional. The dependency of the sys- 
wind farm as an example, the electricity produc- tem is presented by multiple hierarchical levels and 
tion of the whole wind farm should be in accord- dependence groups. Further Dao and Zuo (2017a) 
ance with the contract (Hawker & McMillan 2015). studied the selective maintenance problem on 
The manager of the wind farm has to control and multi-state series systems sharing in variable load- 
distribute the production demand according to the ing conditions in the next mission with the aim of 
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maximizing the expected system reliability in the 
next mission within available resources. Santos 
et al. (2015) presented a simulation method to 
study how the variation of failure and repair mod- 
els, vessels logistic times, weather windows and 
waiting times affect a wind turbine performance, 
hence to identify the factors which most influence 
the turbines performance. 

Based on the previous industrial practice of 
offshore wind farm and maintenance practices of 
complex system, the main goal of the present paper 
is to focus on a specific case of system of systems 
and to propose a simulation-based optimal main- 
tenance resource allocation rules in the context of 
offshore wind farm. Besides, control rules based on 
the system condition and remaining production of 
mission are included in this study. By setting the 
different assumptions for the elaboration of the 
assessment model, we will conduct a series numeri- 
cal simulations based on the degradation model, 
control and maintenance decision rules with the 
proposed algorithms. Our study contributes on the 
follow facts: (1) The study is based on the frame- 
work of “system of systems” under random work- 
ing condition. (2) The control rules which regulate 
the system degradation and production rate are 
considered. (3) The maintenance crew transfer time 
from one-to-one site is introduced, which is the 
practical issue for offshore wind farm. The remain- 
der of this paper is organized as follows: Section 2 
describes the degradation model of single wind tur- 
bine based on cumulative load and the performance 
of the wind farm. The assumptions of maintenance 
policy is introduced as well. Section 3 describes the 
production decision rule and integrates the main- 
tenance decision considering both the production 
and maintenance resource allocation. The simu- 
lation algorithm and numerical results are given 
in section 4. The contributions, limits and future 
works of this paper are discussed in conclusion. 


2 PERFORMANCE MODEL 


The objective of this section is to present the per- 
formance model of the wind farm with n Wind 
Turbines (WTs) from period to period. The per- 
formance is defined by an average cumulated 
amount of energy objective P, produced over a 
period T. Each of the wind turbines will contribute 
to reach this production objective. A wind turbine 
is subject to degradation. In the next paragraph, 
the degradation model for one WT is presented. 


2.1 Performance model for one wind turbine 


We assume that a WT is subject to continuous 
random cumulative degradation X(t) from 0 to 
a failure threshold x,. X(f) is a function of the 
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cumulative load of the WT. The instantaneous 
load at time ż is a function of the rotation speed 
of the blades given a wind speed, w, and a WT 
rotation control parameter, p,. Rotation can be 
controlled by both the pitch control and a brake. 
Braking should increase degradation. Finally, the 
degradation rate is given by: 


X(t+dt)-X(t)_ a(g,w,) 
dt Z 
aP Wta (1-9) 
Z 


lim 


dt>0 


(1) 


Let remark that the blades rotation speed is 
assumed to be directly proportional to the wind 
speed @=ap-w and the degradation rate be con- 
stant if w,=w and p, = p. Another constant can be 
directly introduced for modeling degradation for 
idle states, specifically due to the lack of wind. 

Under such assumptions, the degradation over 
a time period (0,t,) can be modeled by a non- 
stationary Gamma process where the degradation 
increments over (f,/,,,),é¢{I...4} are gamma 
distributed. Some assumptions on a, and a, can 
be done to ensure some of empirical degradation 
properties. 

The failure t, = min {¢>0| X, 2 xp} leads to an 
immediate stop of the WT. 

We assume a constant production rate of one 
WT per unit of time given by /,(@) = fp (2>w,) 
where æ, is the instantaneous rotation speed as a 
function of p, and the wind w,. The production 
rate of a WT has to be evaluated at each changing 
times, wind transition or change in the p-value. 


2.2 Maintenance assumptions 


A periodic maintenance policy is implemented 
every T period. No maintenance is allowed out 
of this maintenance period (block replacement 
model). This maintenance consists in visiting a 
certain number of WTs and renew them as much 
as possible. Other maintenance such as minimal 
maintenance are not considered in this model. The 
total maintenance capacity is here assumed to be 
fixed and not all of the failed or degraded WTs 
will be maintained in the good state. The mainte- 
nance capacity is related to a limited number of 
maintenance resources and is defined by a given 
maintenance duration D, and a given renewal effi- 
ciency rate per unit of time e,. Hence, the maxi- 
mum allowed degradation reducing is D,,-e,,. 
Moreover, some of maintenance crew transfer time 
from one-to-one WT is introduced. It is assumed 
this crew tranfer time is constant and denoted d, 
Finally, if n, over the n WTs are visited during a 
maintenance period, the maximum maintenance 
capacity is (Du =n; d;): e 


i ij m* 


4 Wt 
t 
> 
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Figure 1. Sketch of the wind farm over a period. 


2.3 Illustration of the performance model 


Figure | sketches the evolution of the production 
rate, p, and the WTs degradation, X(7), given the 
wind behavior, w„ and a constant rotation control 
parameter p on the whole 7 period. Here, the wind 
is modelled by a Continuous Time Markov Chain. 
Such a discrete model is motivated by, first, the wind 
speed forecasts are not so precise for a middle-term 
time horizon and only average values would be nec- 
essary. Second, a precise effect of the wind on deg- 
radation can be a challenge and should be estimated 
for average wind levels. Note that the associated 
transition rates and probabilities can be functions 
of the seasons. For the sake of simplicity, no vari- 
ations are considered in this paper. When the wind 
increases at f,, both the production rate p, which is 
the sum of the functioning WTs and the WTs deg- 
radation rates increase. When a WT fails and stops 
at t, the others continue to produce but the pro- 
duction rate of the farm is decreasing until the next 
maintenance at ¢,. In this example, the 3 WTs are 
repaired and the sum of the repair level is therefore 
Ay =(Du -3d,)-e,,. All of the times t,t, t, and 
t, define a change in the wind farm environment. 


3 DECISION PROCESS 


The decision rules are twofold: production and 
maintenance which should be combined for ensur- 
ing the production objective. 


3.1 Production decision rule 


We propose here to focus on the production decision 
rules which can be captured in the blade rotation 


speed control parameter p (controlled by the 
pitch). At time ¢, these decision rules are a function 
of the production objective P, over the time period 
T, the cumulative production at time ¢, t < T, P(t), 
the current wind speed w,, the wind forecasts over 
the end of the time period, Ew(t), the current state 
vector of the WTs, X (r) =(X,(1),X,(t),....X,,(0) 
and finally the forecasts in the production from t 
to T, EP(t;X(t),w,,£w(t),2). 

For the sake of simplicity, the decision in p at 
time ż will be defined in order to satisfy the produc- 
tion objective given a constant forecast, i.e. without 
considering the future state of the other WTs on 
the farm. Because no additional information occurs 
between two successive changes, the p, decision will 
be considered as constant. Finally, updates in the 
decision would only occur at any change. If a WT 
fails, if it is possible, the production rate of all the 
remaining WTs has to be increased to ensure the 
same yield and so p, should be greater. 

Let n, the number of failed WTs. We can approx- 
imate the expected energy production for the 
n—n, survival WTs given the p, variable, the wind 
forecasts and the vector x(t) = (x (eer) of 
the known WTs states by: 


n 


EP(t, X(t), W, Ew(t), 2) = Èl, (exp) X 


f Ew (2) 
Pets), Eo.) a eB) 


T-t 


where F(-|x,w,) is the conditional survival 
distribution and f,(:,-) the WT production rate 
function per unit of time. Let remark that the p, 
is the same for all the survivals and is considered 
unchanged until the end of the period. Then, if the 
production criterion is P,, p, should be chosen to 
ensure the missing production P,- P(t) and thus 
p, is the solution of: 


EP(t,X (t),w,,£w(t),0,) =P. - P(t) (3) 


If there is no solution, then p, = 1 which ensures 
the maximum of the potential production. 


3.2 The decision variable as a function 
of the maintenance 


p, Should be defined to ensure both the current 
and the future periods production objectives. If the 
maintenance resources are limited, then it will not 
be possible to maintain the whole WTs in a perfect 
good state. Denote Y,=D,,-e,, the maximum 
cumulative degradation level the maintenance 
crew can save in a fixed maintenance period. Some 
maintenance allocation rules will be discussed later 
in the paper. 
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Let define X,,,, that is the minimum amount of 
degradation of the whole WTs to ensure the P, 
production over a complete period. In this paper, 
we define the farm state Y (t) = $, ,_x; (£). At time 


t, p, should be updated to respect: 
Pr(Y (2) < X, + X min | x(t), Ew(t),2,) < € (4) 


where e, is a decision parameter decision variable 
which represents the risk aversion of the deci- 
sion-maker faced the non-respect of the produc- 
tion in a period. Numerical experiments should 
be conducted to highlight the influence on such 
a parameter as a function of the XY, maintenance 
capacity. Note that if there is no solution, it means 
that the current state farm Y(t) is already greater 
than X, + X,,,,, A lot of decision alternatives can 


min* 


be designed: 


1. Increase the maintenance resources, X. Penalty 
costs should be added. It could be done with 
extra-time (and so decrease the next production 
period) or extra-resources. 

2. Decrease the next maintenance period to ensure 
the respect of this maintenance specification 
next maintenance time 

3. This maintenance specification can be evalu- 
ated at any time rather only at any operational 
change. 

4. This can motivate the definition of a more 
flexible control rule for each of the WTs as a 
function of the individual degradation states. 
This would be closer to the load repartition 
problem. 


3.3 Maintenance allocation decision rules 


At the end of the production cycle, the overall 
farm system is Y(t). A maintenance is specified by: 


e D,,its duration; 

e d, the time for a crew to move from one WT to 
another; 

e ethe efficiency of the maintenance per unit of 
time, i.e. the number of degradation points the 
maintenance can rehabilitated in one time unit 
(this degradation point is a function of the pro- 
duction point); 

e costs and penalty costs which could be inte- 
grated. 


If 6, denotes the indicator of visiting the WT, 
the maintenance problem is then defined as the 
following optimization problem: 


max Ss (d, + t) (5) 
i=l 


(6.4) “> 


subject to the following constraints: 
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i 


enti <x,(z),Vie{L....n} (6) 


mi 


Dt: (z) ~ enti) S Xi, 


ZS (a, + t) <D,, +d, 


This optimization problem is a mixed-integer 
linear problem and it will not be treated in this 
paper. We propose to discuss two different alloca- 
tion rules in the next section. 


4 NUMERICAL EXPERIMENTS 


We propose in this section to compare two mainte- 
nance resource allocation strategies. The main ques- 
tion here is should the maintenance favor the number 
of WTs in operation versus a limited number of 
operating WTs but in a better state. Before present- 
ing and analyzing the different results, we propose 
to briefly introduce the simulation model with addi- 
tional assumptions and the data. 


4.1 Simulation model and data 


We propose to analyze the performance of the 
model based on a Monte Carlo approach for the 
simulation of the production wind farm over a T 
period. For ensuring the steady-state in the simu- 
lation process, a long-term time horizon with 
thousands 7 periods will be considered. Average 
quantities will be then analyzed for conclusions. 
We will consider that, for the first period of this 
long-term horizon, all of the n = 5 wind turbines 
are new x(0) =(0,0,0,0,0). 

The decision framework requires the definition of 
P „the production objective for the wind farm over T. 
In the wind turbine industry, a WT can be considered 
available approximatively 70% of its life cycle. We 
define P,as the 0.7 of the maximum of the expected 
total production of n operating WTs over the whole 
production cycle given an average wind speed, Ew(0) 
without any rotation control, p= 1. We have: 


P.=0.7-n-| f, (1, Ew(0)/ 7}: 7] (7) 


The evaluation of YX.,,,,, the minimum farm state 
at the beginning of the period for ensuring the pro- 
duction with a given probability ¢,, is the solution 


of the following equation: 
PrP, (X min) 2 P.) > €& (8) 


where P,(X,,,,) is the expected production over T 
which is defined as a Gamma-distributed variable. 
The shape function of this Gamma variable is 


function of the wind forecast, W, and an average 


rotation control level p, over the period. In our 
paper, we have considered that the wind is modeled 
by a CTMC and the forecast can be directly evalu- 
ated. p, can be optimized from the simulation model. 
In this paper, we consider p, as a decision parameter. 

The simulation algorithm is iterative from 0 to T 
and can be summarized by the following steps, at 
each changing time ¢ € (0,7): 


1. Simulate the current wind condition (speed and 
next change time) and evaluate the forecast over 
the remaining time to the end of the period; 

2. Find p,, the minimum of the solutions of Equa- 
tions (3) and (4); 

3. Simulate the WTs degradation. For each of the 
WTs in operation: 

a. Simulate the remaining time as a function of 
x(t) the current degradation level, the wind 
estimates and 9; 

b. Identify the change time which is the mini- 
mum between the wind change, the remain- 
ing times and the end of the period 

c. Simulate the corresponding degradations for 
every WTs in operation. 


The data used in our numerical experiments are 
fixed and presented in Table 1. 

Let remind that p, and e, are two decision 
parameters. Their relevance will be analyzed in the 
next paragraphs. Given these data, the production 
objective P equals 14312 units and the Mean Time 
to Failure, in unit of time, for each WT without 
any maintenance is: 


MTTF =| Pr(X,(t) <xp)dt = 487 (9) 


4.2 Policy 1: Maximization of the degradation 
rehabilitation levels 


Maximizing the degradation rehabilitation levels is 
here equivalent to restrict the crew move between 
WTs. Finally, the allocation rule for Policy 1 is: 


Table 1. Data for the numerical experiments. 
WT yield 

General T factor 

200 r,= 10 
Degradation a, a, B Xp 

0.02 0.006 1.25 20 
Maintenance Dy en, d; 

operation 10 4 1 

Constraints Production Maintenance 

P, =0.7 Po & 


Algorithm 1 Policy | 
Require: :c(7) = (T1(7);+7-4En(0)); Dn: ena dij 
Sort the WTs from the most to the less degraded; 
he 0 
while #, < Dw do 
Renew to the best condition as possible the most 
degraded WT 
Update the repair time t, (maintenance + trans- 
port) 
end while 


Estimations of different performance indicators 
through 1000 runs are presented in Table 2: P(z) 
the energy produced in a cycle, Y, the wind farm 
level after the maintenance, rho the mean of the 
rotation control on a cycle and p == /„ the per- 
centage of time p is defined by the maintenance 
constraint in a cycle. 

These results are obtained for different values 
of the decision parameters p, and €,. From these 
experiments, we can measure the low effect of the p, 
parameter on the proposed indicators. This can be 
partially explained because of the choice of different 
linear functions for @(.,-) and f,(--). The relaxa- 
tion of the maintenance constraint, €, leads to some 
intuitive results such as more energy production in 
average because the rotation is mainly controlled by 
the production objective but the wind farm is more 
degraded in average. Another is, because the mainte- 
nance resource are here very restricted, some of the 
5 WTs remain failed after the maintenance. 


4.3 Policy 2: Maximization of the operating WTs 
number 


We propose to conduct the same experiments when 
the objective of the maintenance is to ensure a max- 
imum available WTs at the beginning of the pro- 
duction cycle. Policy 2 algorithm is now to find the 
threshold x, which verifies x,(0) > x), Vi {L---.n} 
and Y(z)-Y(0)= (Dy, —(n,,—1)- d,) -e,, where n, 
is the number of maintained wind turbines. Table 3 
presents the numerical results. 


Table 2. Performance indicators for Policy 1. 
Parameters Performance indicators 
Ph & Pr) A p P==Pn 
0.9 0.95 14362 34.2 82% 94% 
0.9 14555 32.9 79% 85% 
0.8 14713 32.5 80% 60% 
0.6 15155 37.5 84% 39% 
0.4 15131 36.9 85% 18% 
0.8 0.8 14560 35.8 80% 73% 
0.6 0.8 14633 35.0 82% 76% 
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Table 3. Performance indicators for Policy 2. 
Parameters Performance indicators 
h 4 P(r) ¥, p P==Pn 
0.9 0.95 13545 74.5 96% 99% 
0.9 13394 74.5 96% 99% 
0.8 13422 73.9 96% 99% 
0.6 13590 74.6 96% 98% 
0.4 13470 74.0 96% 99% 
0.8 0.8 13438 74.4 96% 99% 


In this case, the experiments do not allow to 
highlight the impact of the decision variables. In 
fact, the maintenance resource is too low for such 
a policy and the losses because of the transporta- 
tion of the maintenance crew from one wind tur- 
bine to another are too high. After a maintenance, 
the wind farm state Y, remains very degraded. The 
mean p remains very high for ensuring the produc- 
tion given the fact that a lot of WTs would fail 
before the end of the cycle. The maintenance con- 
straint cannot minimize these number of failures. 
Extra maintenance resource is required. 

Finally, from these two numerical analysis, we 
can conclude the dominance of Policy | versus 
Policy 2 according to the proposed assumptions. 


5 CONCLUSION AND PERSPECTIVES 


In this paper, we have proposed a decision-making 
framework for the production and maintenance 
management of a wind farm which is defined as a 
system of systems. Based on this problem, we have 
introduced the problem of maintenance allocation 
when the resource are shared and limited. Two 
maintenance allocation policies have been pre- 
sented. We have also proposed a rule for managing 
the production rate to guarantee both the produc- 
tion objective and the degradation level according 
to the maintenance resource. This is clearly a con- 
tribution in the maintenance optimization topic 
and especially for the study of imperfect block 
replacement policies. 

This paper should be seen as a preliminary work. 
We have voluntary chosen some simple assump- 
tions that seem to be realistic to us in an industrial 
context. Nevertheless, some of them could be seen 
too restrictive (the linearity of the production rate 
function or a constant degradation rate for given 
operating conditions, e.g.) and other more explicit 
(what means the degradation level for a complex 
system such as a wind turbine). At this stage, the 
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current problem could be modeled as a Markov 
Decision Problem. Such a model could allow to 
identify some structural properties for the optimal 
control rules. Other perspectives can be drawn for, 
e.g., the production rate rule (rotation control p). 
In this paper, the updates in the p value are driven 
by stationary states in terms of wind forecasts and 
in p (no change from now to the end of the produc- 
tion cycle). To increase profitability at a minimum 
of risk, variability and uncertainty should be inte- 
grated in the decision framework. 
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ABSTRACT: A relevant issue in manufacturing and production seems to be “silo”-organisations and 
“silo”-planning with lack of coordination between departments. Integrated Planning (IPL) is a concept 
that aims to cope with this “silo”-problem. With the ground-breaking potentials from Industry 4.0 it 
should be expected that the advancement of IPL will speed up in development and implementation in 
companies. To manage IPL sound Key Performance Indicators (KPIs) must be implemented and estab- 
lished in the company. A promising indicator for IPL is Maintenance Backlog (MB). A strength of this 
indicator is the capability to be modelled with Risk OMT (Risk modelling—Integration of Organisa- 
tional, human and Technical factors). It remains to investigate how MB can be modelled to a Quantitative 
Risk Analysis (QRA). The main objective of this article is to develop a model of MB in QRA. In particu- 
lar the article demonstrates a case study of a production system where both Fault Tree Analysis (FTA), 
and Event Tree Analysis (ETA) is modelled. The article discusses the demonstration results and evaluate 
how potentials in Industry 4.0 can support QRA. 


1 INTRODUCTION Continental Shelf (Petroleumstilsynet, 2012). In 
fact, according to PSA, maintenance critical back- 
The Oil & Gas (O&G) industry has experienced log is regarded as a potential for major accidents. 
challenges with the demand of increasing oil pro- A case study of indicators related to IPL that 
duction, lowering operating costs and life exten- are applied in the O&G industry has been per- 
sion (Ramstad et al., 2010). These challenges have formed (Wahl and Sleire, 2009). Plan attainment 
among other efforts resulted in the concept Inte- was one type of indicator and can be related to 
grated Operations (IO) and is described to be a maintenance backlog. More research for improv- 
new way of doing business (Rosendahl and Hepsø, ing indicators for plan performance is concluded 
2013). This concept has further resulted in the (Wahl and Sleire, 2009). However it is not clear in 
Center for Integrated Operations in the petroleum this case study how these indicators are modelled 
industry (IO Center) where one important issue is to a Quantitative Risk Analysis (QRA). 
to go from “silo organisations” towards integrated In risk modelling, the Risk OMT (Risk model- 
operation of all the relevant organisations (IO  ling—Integration of Organisational, human and 
Center, 2012). When transferring the IO principle Technical factors) model has been developed by 
into the planning domain, leads us to the concept Vinnem et al. (2012) and evaluated though a case 
Integrated Planning (IPL) (Ramstad et al., 2010). study (Gran et al., 2012). In this model, a Bayesian 
In the planning domain the “silo” problem is also belief network is applied to structure two levels 
present. of Risk Influencing Factors (RIFs) connected to 
The problem with “silo” planning in O&G the basic events in QRA. Also, the principles for 
industry is lack of coordination across domains updating the risk picture with a QRA-basis have 
and organisations (Rosendahl and Hepsø, 2013).In been demonstrated (Vatn, 2014). The Risk OMT 
particular, lack of IPL results in limited resources, seems to be a promising model for a dynamic risk 
system failures and unscheduled maintenance barometer based on indicators (Paltrinieri et al., 
(Wahl and Sleire, 2009). It is also argued that other 2017, Paltrinieri et al., 2014). 
disciplines such as drilling may affect maintenance Due to different views of the term “mainte- 
(Sleire and Wahl, 2008). The maintenance backlog nance backlog” and how it is modelled and the rel- 
(MB) is according to Øien and Schjølberg (2009) evance for IPL, a novel model for MB of physical 
and Meland et al. (2009) to be represented in the assets has been recently developed and structured 
ageing phase for O&G facility and should be con- in a framework for IPL (Rødseth and Schjølberg, 
trolled at that stage. In addition, the Petroleum 2017). Furthermore, the Risk OMT was tested for 
Safety Authority (PSA) Norway measures MB MB ina reliability model, demonstrating that the 
systematically for Oil companies at the Norwegian risk aspect is included for MB. In the Risk OMT 
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model proposed by Rødseth and Schjølberg (2017) 
the RIF structure adjusted the level of MB after 
evaluating the RIF of people such as the skills to 
the craft technicians and the RIF of tools they 
use in maintenance planning. It remains however 
to investigate how MB can be evaluated as a RIF 
itself and connected to the QRA. 

With the potentials within Industry 4.0 it would 
be expected that enterprises establish Cyber Physi- 
cal Systems (CPS) where the physical world and 
the virtual world are converging (Kagermann 
et al., 2013, Monostori, 2014). 

To implement the potentials from Industry 4.0 
an architecture for CPS should be established 
in the organisation (Lee et al., 2015). Neverthe- 
less, more effort is needed to investigate more in 
detail how Risk OMT can be adapted to such an 
architecture. 

The main objective of this article is to develop a 
model of MB in QRA. To achieve this main objec- 
tive following sub-objectives have been outlined: 


1. Develop a general model that connects MB 
with QRA. 

Test the model with a case example. 

Propose how the model can be improved with 


support from the potentials in Industry 4.0. 


2. 
3. 


The remainder of this article is structured as fol- 
lows: Section 2 presents the CPS as a potential in 
Industry 4.0, Section 3 presents an example case 
with a corresponding risk model developed in Sec- 
tion 4. The results from the example case is pre- 
sented in Section 5. Further, Section 6 provides a 
discussion of how the results can be related to CPS 
where concluding remarks are made in Section 7. 


2 CPS AS A POTENTIAL IN INDUSTRY 4.0 


With the trend of digitalizing manufacturing, 
Industry 4.0 offers several promising technologies. 
An essential element in Industry 4.0 is convergence 
of the physical world and the virtual world repre- 
sented in CPS (Kagermann et al., 2013). This ena- 
bles network resources, information, physical assets 
and people to create Internet of Things (IoT) and 
Internet of Services. 

Maintenance clearly positions in Industry 4.0 
where both predictive and remote maintenance 
provides value creation in enterprises in terms of 
improved asset utilization and reduced mainte- 
nance costs (McKinsey & Company, 2015). For 
maintenance, the 5C architecture seems to be 
promising as a CPS architecture (Lee et al., 2015, 
Lee et al., 2017). This architecture has also been 
tested for manufacturing (Lee et al., 2017) and 
process industry (Rødseth et al., 2016). The 5C 
architecture forms a pyramid with following levels: 
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— Level 1: Connection. Data collection from e.g. 
sensors connected to machines. 

Level 2: Conversion. Data converted into useful 
information, e.g. calculations of vibration data. 
Level 3: Cyber. The information is connected to 
internet where advanced analytics can take place 
in terms of e.g. fleet analytics. 

Level 4: Cognition. To support a decision, visual 
interfaces such as dashboards and key perform- 
ance indicators are necessary. 

Level 5: Configuration. Decision-making is sup- 
ported through e.g. “digital advices” for con- 
ducting maintenance activities. 


This architecture has also been proposed to 
be a sound structure for the maintenance model 
Deep Digital Maintenance (DDM) (Rødseth et al., 
2017). DDM comprises the module planning 
where it remains to elaborate how MB can improve 
the planning function for DDM. 


3 DESCRIPTION OF SYSTEM 
AND AN EXAMPLE CASE 


In this article a heat exchanger with a barrier sys- 
tem is outlined in Figure 1 as an example. The sys- 
tem is a heat exchanger with a barrier system. The 
heat exchanger receives gas from two processes, 
process 1 and 2. If there is a leakage from the heat 
exchanger both valve 1 and valve 2 must close in 
order to avoid further leakages. Figure 2 further 


Barrier system 


Gas from process 1 


>- 


Rich gas 
Heat exchanger > 
Gas from process 2 
-> 


Figure 1. Illustration of a heat exchanger with a barrier 
system. 
Front-end head | Body/shell | Rear-end head 
f Battle plate Cold fluid inlet j 
| Tube sheet Tubes X i 
i d Jf \ I 
Farna E 


Hot fluid ler Cold Muid inber Hort fluid outlet 


Figure 2. Illustration of a heat exchanger, adapted from 
(Utne et al., 2012). 


describes the heat exchanger system in details 
(Utne et al., 2012). 

This heat exchanger 1s a single-pass tube, baffled 
single-pass shell, shell-and-tube heat exchanger 
designed to provide counter flow conditions. In 
addition, this type of heat exchanger is one of 
the most commonly used in offshore oil and gas 
processing plants. In the case study we will con- 
sider an example case of the tubes in the heat 
exchanger where the model developed is tested 
with example data. The tubes constitute a bundle, 
meaning that no single tube can be solely replaced. 
If there is leakage from a tube it will be plugged. 
However, this will decrease the efficiency of the 
heat exchanger and at some point the whole bun- 
dle will be changed. This will then set the efficiency 
back to 100% of the design efficiency. 

For the heat exchanger it is assumed that leak- 
age only occurs from tubes and is located in the 
shell section. For the barrier system, only valve 1 is 
of interest. For the heat exchanger and the valves 
there exists a specific maintenance programme 
coordinated by the maintenance management. 


4 RISK MODELLING 


The core of the RISK OMT is modelling RIFs 
and how these affect the operational barriers. In 
this paper the RIFs are identified in the tasks in 
the maintenance programme and affect both the 
barriers (valves) and the production facility (heat 
exchanger). A RIF is defined by (Øien, 2001) to 
be “an aspect (event/condition) of a system or an 
activity that affects the risk level of this systeml 
activity”. Further a RIF is a theoretical variable 
that can be measured. 

The risk picture is illustrated in Figure 3. The 
initiating event in the QRA is leakage from shell 
and tube heat exchanger. This leakage is due to 
either leakage from front-end head section, rear- 
end head section or shell section. 

In this article the shell section denoted as basic 
event 3 is further studied. It is assumed that the 
frequency of leakage from front-end head section 


Initiating 


Barrier system | 
Event 


impact 


Barrier system function and 
shut down the supply of gas 


No safety critical event, 
but economic critical 


to heat exchanger event 
Leakage from shell 
and tube heat 
exchanger = 
ange! _ ~ ~ 
Safety critial and 
Failure of barier system |} economie critical event J 
-r Nri 
Figure 3. The total risk picture in QRA. 


and rear-end head section is negligible. When the 
initiating event occurs the barrier system shall shut 
down the gas supply. Both valve 1 and valve 2 must 
function in order to avoid a safety critical event. 

The worst scenario in the QRA occurs with 
an impact of both a safety critical and economic 
critical consequence shown with the dotted circle 
in Figure 3. For this scenario the annual expected 
frequency is of interest. 

Figure 4 presents the FTA for the initiating 
event with leakage from the shell and tube heat 
exchanger, while Figure 5 presents the barrier sys- 
tem structured with FTA. The barrier system com- 
prises two shutdown valves where both must fail. 
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Figure 4. FTA of initialing event. 
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Figure 6. RIF structure that connects to QRA. 


Figure 6 presents the RIF structure which is 
connected to the basic event 3 in the FTA from 
Figure 4. The same design of the RIF structure 
is also connected to the basic event 4 in the FTA 
from Figure 5. For each RIF, a score, S from A-F 
is observed and a variance, Vs of the score is 
evaluated. 

The variance of the score reflects on how accu- 
rate the observation is to the “real” RIF. The RIF 
structure consists of two levels where there is a 
structural dependency Vp between level 2 parent 
RIF and level 1 child RIFs. At level 1 the RIFs are 
weighted with weight w based on expert judgment. 
The RIF structure is further described in this sec- 
tion with elaboration of Risk OMT. 


4.1 Level2 RIF 


Maintenance management is defined as (CEN, 
2010) “all activities of the management that deter- 
mine the maintenance objectives, strategies and 
responsibilities, and implementation of them by such 
means as maintenance planning, maintenance con- 
trol, and the improvement of maintenance activities 
and economics”. This level is on the organisational 
level and affects the operative level 1. 


4.2 Levell RIFs 


4.2.1 Maintenance backlog 
Maintenance backlog is defined as (Rødseth 
and Schjølberg, 2017) “the amount of unfulfilled 
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Table 1. Score and evaluation criteria. 
Score Evaluation criteria 
A «Best case» score 

B 

C «Normal case» score 
D 

E 

F «Worst case» score 


demands at a given point of time in explicit refer- 
ence to predefined standards to be achieved. The 
demands comprise both demands for the technical 
condition itself and demand in meeting the planned 
due dates in the work orders. Furthermore, main- 
tenance backlog can be expressed in functional 
(non-monetary) or monetary terms and it refers 
to single components, sub-assets or to the whole 
asset”. 

It is proposed by Rødseth and Schjølberg 
(2017) that maintenance backlog can have a 
financial perspective that is both based on work 
orders and the technical condition of the facil- 
ity based on the understanding of maintenance 
backlog from petroleum authorities (Petroleum- 
stilsynet, 2016) and road authorities (Weninger- 
Vycudil et al., 2009) respectively. When providing 
a score of maintenance backlog, both the devia- 
tion of expected wok completed from the work 
orders and the technical condition should be 
evaluated. 

Table 1 presents the score and evaluation criteria 
for maintenance backlog. 


4.2.2 Maintenance quality 

Maintenance quality is an aggregation of all fac- 
tors that affects the quality of the service provided 
in the maintenance task. A comprehensive list of 
factors that can be related to maintenance quality 
has been outlined (Rødseth and Schjølberg, 2017). 
The list of relevant factors for maintenance quality 
can also be based on the findings from audits (Øien 
et al., 2010): 


. Classification 

. Documentation 

. Use of classification 

. Competence 

. Maintenance efficiency evaluation 


nAeRWNre 


The evaluation of the score, variance and weight 
of this RIF is based on questionnaires and inter- 
views for the assessing the “soft” issues like compe- 
tence and maintenance efficiency evaluation. For 
the more “hard” issues like classification the obser- 
vation of the score is based on direct observations 
in the organisation. 


4.3 Standard calculation of two level 1 RIFs 
and one level 2 RIF 


When calculating the basic event with the RIF 
structure presented in Figure 6 a mathematical 
approach is needed. This approach has also been 
presented by (Rodseth and Schjolberg, 2017). It 
is here included for completeness. Further details 
and foundations for this approach and the formu- 
las are elaborated by (Vatn, 2013). The approach 
comprises 6 stages for calculating basic event q, 
and can be summarized as follows: 


1. Perform an expert judgement and evaluate the 

score of each RIF in the range of A-F, the vari- 

ances, the weights w, and values for q. 

. Map the scores into values in the interval [0,1] 
with following scores: 
A = [1/12], B = [3/12], C = [5/12], D = [7/12], 
E =[9/12], F =[11/12]. The range in the vector is 
then: [1/12, 3/12, 5/12, 7/12, 9/12, 11/12]. 

. Calculate the posterior distribution of parents 
RIF based on following formulas where 
Jeffreys prior is used with the prior parameters 


% = By = 0.5: 


0 *(1- 

ak HS yy 
+s(l- s7} 

g-Ê s( Vg 


. Calculate the prior distributions œ and f, of 
child RIFs based on following equations: 


(1) 


(2) 


a[n o 
_ PA 
=y) K 


. Calculate the weighted sum for level 1 RIFs and 
the expected probability for each possible com- 
bination, i. All the combinations are distributed 
in a list. 

. Apply the law of total probability and calculate 
the basic event with following formula of qi: 


In 


Lp 
A ) * pa(r| P = p) |* p(p) 


1=>}, Eal 


(5) 


4.4 Modelling the basic events 


For the example case we assume that we have two 
independent RIF structures that affects the initial 
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event and the barrier system in the QRA. In this 
case example we have two outsourced mainte- 
nance organisations that are specialised in mainte- 
nance for heat exchangers and valves. Further it is 
assumed that these organisations are independent 
from each other. 

When A, for basic event 3 is calculated, the 
frequency can also be calculated. Since the heat 
exchanger is regarded as a repairable unit follow- 
ing formula is used: 
q; = 4,* MTTR, (6) 

From basic event 3, the frequency of initiating 
event, IE, is of interest. Since we have one OR gate 
and neglect basic event 1 and 2, the frequency of 
the initial event can be calculated as follows: 


Arz dı + qs + q; _ q; 
TE T ~ 
MTTR, MITR, MITTR, MITTR, 
(7) 
5 RESULT 
5.1 Input data 


The input data is shown in Table 2 and Table 3, 
partly based on data from the OREDA handbook 
(Sintef and Oreda, 2009). The failure rates for the 
heat exchanger are collected from OREDA data 
base. The high and low values for A, and q; is in 
accordance with the Risk OMT model. 


5.2 Result data 


The result from the example case is structured in 
Table 4. For the scenario, the frequency is calculated. 


Table 2. Input date for basic event 3. 

Parameter Value 

Siis D = 0.58333 

Siza B = 0.25000 

S33 C=0.41667 

Wia si 0.3 

Wizs 0.7 

VSita 0.01 

VSiaa 0.04 

VS, ; 0.04 

VP, 0.0025 

MTTR , (hours) 3.0 

4, (/hours) from (Sintef 0.39*10-° 
and Oreda, 2009) 

4,_; (/hours) from (Sintef 23.87* 10° 


and Oreda, 2009) 


Table 3. Input data for basic event 4. 

Parameter Value 

Siia C=0.41667 

Sina C=0.41667 

S24 C=0.41667 

Wiis 0.3 

Wiza 0.7 

VS 0.01 

VSiza 0.04 

VS, 4 0.04 

VP, 0.0025 

qn 4 (/hours) 10° 

q, 4 (/hours) 107 
Table 4. Changes in QRA. 

Barrier Frequency 
Initiating event, system, in QRA 
(/hours) (/hours) (/year) 
=y q4 = 0.0030 F,=A,,*q4 
= 2.9500* 105 = 7.75*105 


6 ADAPTING RISK OMT WITH CPS 


The objective of this article was to develop a model 
of MB in QRA. With a model from Risk OMT, a 
structure of RIF with two levels was developed and 
connected to a case example of a heat exchanger. 
Improved maintenance quality can to some extent 
compensate for poor maintenance backlog. Still 
the organisation should strive to reduce MB in 
order to improve the overall risk picture of QRA 
of the plant. 

To improve the implementation and application 
of the model of MB, an essential evaluation would 
be to what degree could this model be automated. 
The motivation of this automation is due to the 
huge amount of technical objects in a plant that 
is maintenance significant. Obviously some evalu- 
ations such as maintenance quality should remain 
manually performed by experts, but the evaluation 
of MB should have potential in being conducted 
more automatically. 

Figure 7 proposes how an CPS architecture could 
be constructed, inspired by the SC model from (Lee 
et al., 2015). At level 1 relevant data for maintenance 
backlog is captured in real-time from both the com- 
puterized maintenance management system and 
condition monitoring systems. At Level 2 the values 
for MB is calculated both from CMMS and condi- 
tion monitoring in monetary terms in accordance to 
(Rodseth and Schjolberg, 2017). At level 3 the score, 
S is automatically evaluated based on comparing 


“Digital advice" Configuration 


Visualisation of QRA Cognition 


Evaluation of MB Cyber 


and Quality 


Information of onversion 


z Connection 
Real-ti 


Figure 7. CPS architecture proposed for Risk OMT, 
inspired by the 5C model from (Lee et al., 2015). 


with earlier measurements of MB. In level 4 the 
QRA will be visualized as a risk picture. Finally in 
level 5 a digital advice is provided to propose reduc- 
tion of maintenance backlog if necessary. 


7 CONCLUDING REMARKS 


It is concluded in this article that the proposed 
model of MB in QRA should be further devel- 
oped, in particular with respect to decision criteria 
for providing the score. In addition, a CPS archi- 
tecture for industry should be established with 
respect to Risk OMT modelling and include it in 
the maintenance model DDM. Further studies of 
this model should also be performed not only in 
the O&G industry, but also in industry branches 
such as the maritime industry, manufacturing, 
process industry and the railway industry. 
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Reliability-based maintenance optimization for the leased equipment 
with deterioration depending on age and usage 
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ABSTRACT: In this paper, we first consider a two-dimensional lease region that is composed of age 
limit and usage limit, and then from the leaser’s perspective develop maintenance policies so that the total 
cost in the lease region can be reduced. Deterioration process of the equipment is simultaneously depend- 
ent on age and usage. When reliability drops to a threshold, called reliability threshold, Preventive Main- 
tenance (PM) is carried out to improve reliability. This repeated PM action is ceased until the associated 
remaining lease limits terminate. Under the situation that the lessee’s usage rate is single usage rate class, 
a unified maintenance policy is used to maintain reliability of the equipment. Under the situation that the 
lessee’s usage rate is classified, customized PM integrating with usage rate class is used to maintain reli- 
ability. We derive the leaser’s total cost models that are respectively related to a benchmark lease contract 
with the unified maintenance policy and the proposed lease contract with customized PM. By means of 
Root Mean Square (RMS), lease contract selection is performed. An example experiment is provided to 
verify the proposed methodologies. As showed in the example experiment, optimal values can be searched 
by minimizing the total cost and the proposed lease contract is beneficial for both the leaser and the lessee. 


1 INTRODUCTION Although the mentioned-above literature stud- 
ied various types of PM policies related to the 
Due to the limits of ability to maintain reliability equipment, their implicit assumption is that aging 
and the lack of capital, some users (called lessees) of the equipment is uniquely characterized by 
buy only the right to use and don’t buy an equip- the time (age)-dependent random distribution. 
ment from its owner (called leaser). The lessee uses In real life, the deterioration of some equipment 
the equipment to accomplish tasks or missions by is affected not just by age, but it is influenced by 
paying rent for the leaser; and the leaser is often accumulative usage. For example, accumulative 
responsible for maintaining reliability of his equip- mileage of automobile can increase the probabil- 
ment in a period, called lease period. In the lease ity of a failure. Considering this fact, Iskandar & 
period, reliability size of the equipment not only Husniah (2017) recently developed a PM policy 
has effect on the leaser’ maintenance cost but for the equipment with a two-dimensional lease 
also influences the lessee’s task progress. Specially region. However, this literature ignored a fact that 
mentioning, the lessee may incur some loss if the all usage rates of individual lessee are not same. 
task is not accomplished on time due to a longer According to difference of usage rate, in practice, it 
time to recover failure of the equipment. In this is more popular with the lessee for the leaser to per- 
setting, the lessee usually requires a penalty cost form customized PM integrating with usage rate 
for the leaser so that her loss is compensated. For class in the two-dimensional lease region. Besides, 
reducing penalty cost and maintenance cost in the PM in this literature was performed periodically. 
lease period, the leaser usually performs preventive For the equipment with deterioration depending 
maintenance (PM) in the lease period. on age and usage, it is well known that its reliabil- 
For purpose of maintenance cost saving and ity sharply and nonlinearly declines with respect to 
penalty cost saving, various PM policies have been age and accumulative usage or its deterioration is 
proposed by academic researchers and industry accelerated with respect to age and accumulative 
practitioners. Jaturonnatee et al. (2006) developed usage. Obviously, periodic PM actions can’t be 
a sequential PM scheme using the reduction in fail- applied to effectively and in real time improve the 
ure rate. Yeh et al. (2009) proposed more effective sharply and nonlinearly declining reliability since 
PM policies than PM policy in Jaturonnatee et al. its interval is a constant. 
(2006). Other PM policies related to leased equip- In this paper, we consider a two-dimensional 
ment have offered in Pongpech & Murthy (2006), lease region with customized PM, which is used as 
Zhou et al. (2014) Mabrouk et al. (2016), Hajej a lease limit of the equipment with deterioration 
et al. (2016) and Hung et al. (2017). depending on age and usage. Because the leaser 
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can’t obtain usage rates of all individual lessees 
when lease contract takes effect, customized PM 
can’t be used to maintain reliability of the equip- 
ment. In order to use customized PM in the two- 
dimensional lease region with the unknown usage 
rates, we divide the two-dimensional lease region 
into multiple disjoint regions and then propose 
a flexible lease contract with customized PM. 
According to repair records or/and electronic 
monitoring records in the first two-dimensional 
lease region, the leaser can obtain usage rates of 
all individual lessees. This acquisition is convenient 
to usage of customized PM in the remaining lease 
limits. Besides, since reliability sharply declines 
with respect to age and accumulative usage, the 
customized PM considered in this paper is based 
on reliability threshold, as a decision variable. That 
is, PM is performed depending on whether reliabil- 
ity declines to reliability threshold. As well known, 
lease termination of weighty lessee is decided by 
usage limit and lease termination of light lessee is 
decided by age limit. To measure effect of custom- 
ized PM, some assume that customized PM for 
each weighty lessee class reduces age and custom- 
ized PM for each light lessee class reduces usage. 
For illustrating the performance of the proposed 
lease contract, moreover, we compare the total cost 
models under different lease contracts by means of 
Root Mean Square (RMS) technology. 

The structure of this paper is listed as follows. 
In Section 2, a benchmark lease contract and the 
lease contract proposed in this paper are defined. 
In Section 3, we derive the leaser’s total cost mod- 
els respectively related to two lease contracts. Sec- 
tion4 presents a numerical experiment to illustrate 
the considered lease contracts. Finally, Section 5 
concludes this paper. 


2 MODEL FORMULATION 


2.1 Assumptions 
The following assumptions are made in this paper. 


e The usage rate of individual lessee is fixed 
through the whole life cycle; 

e Usage rates after the first two-dimensional lease 
region are obtained and they are classified into 
m ( m > 1 ) classes. 


2.2 Lease contract definition 


2.2.1. Benchmark lease contract 

Inreallife, minimalrepairfrequentlyisusedtoremove 
failure. In this paper, we call the two-dimensional 
lease contract where all failures are removed 
by minimal repair as the benchmark lease con- 
tract. Under the benchmark lease contract, since 


minimal repair is unique method to sustain run- 
ning, we call minimal repair as the unified mainte- 
nance policy. Due to the unknown usage rate in the 
two-dimensional lease region Q, age and accumu- 
lative usage at the two-dimensional lease region Q 
termination are respectively expressed as 


O, if r<r* 
u/r, if r>r* 


ro, if rsr" 


and zoen f ror** 


no=f 


2.2.2 The proposed lease contract 
Let @ be maximal time limit, u be maximal 
usage limit and r* be average usage rate. Then 
r*=u/@ and the two-dimensional lease 
region can be defined by the set Q=(@,u). 
For the sake of analytical tractability, we 
define the first two-dimensional lease region as 
Q,={(@,u)|0S u,< Or, aNd OS @<u/r,,.} 
where Fain IS minimal usage; 7nay 1s maximal usage; 
and u =r*@,. Obviously, the first two-dimensional 
lease region Q, is a part of the two-dimensional 
lease region Q, which can be depicted in Figure 1. 
As showed in Figure 1, the two-dimensional 
lease region Q is classified three regions. According 
to repair records in the first two-dimensional lease 
region Q,, the leaser can conclude usage rate of 
individual lessee and further classify usage rate for 
them. So theleasercan obtain the following informa- 
tion when the equipment with the unknown usage 
rate class goes through the first two-dimensional 
lease region Q,, as follows: a) the usage rate class 
can be known; and b) the remaining lease limits, 
namely, the remaining time limit @-@,>0 and 
the remaining usage limit u— u >0, can be known. 


Usage 


Figure 1. 


Description of lease region. 


Further, age and accumulative usage at the first 
two-dimensional lease region Q, termination are 
respectively given by 
a, ; if r<r * 
u,/r, if r>r* 


rð, if rsr* 


Th, =| and Xo, of fiari” 
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Reliability-centered (or based) PM policy is 
a nice maintenance policy, which is frequently 
applied in reliability maintenance field (see Niu 
et al., 2010, Zhou et al., 2007, Lin et al., 2015). 
From the leaser’s perspective, here we propose 
a two-dimensional lease contract with custom- 
ized PM by integrating reliability-centered PM 
policy with usage rate classes. For this lease con- 
tract, when reliability in the remaining lease limits 
declines to a threshold, called reliability threshold, 
then PM is performed to improve reliability. This 
repeated PM action is ceased until the associated 
remaining lease limit expires. That is, customized 
PM is applied to keep reliability no lower than reli- 
ability threshold, in the remaining lease limits. For 
the given remaining lease limits, a smaller reliability 
threshold must result in a lower number of PMs, 
and vice versa. Under the proposed lease contract, 
therefore, both reliability threshold and number of 
PMs are considered as decision variables. 


3 MODEL ANALYSIS 


Failure rate (function), as an increasing function, is 
generally used to characterize aging. In this paper, 
we assume that aging is modeled by a failure rate 
given by y(x|r) where r is a random variable follow- 
ing G(r). The total cost under each lease contract 
is the sum of the expected maintenance cost, the 
expected penalty cost and the expected depreciation 
cost, i.e., 


The total cost = maintenance cost + penalty cost a 

+ depreciation cost. ) 
By (1), next, we will derive the associated total 

cost under the case that c,, is minimal repair cost. 


3.1 Cost model under the unified maintenance 


Under the unified maintenance policy, the equip- 
ment is not classified. That is, all equipment have 
a same usage rate class r € [7 insna]. We call this 
same usage rate class as single usage rate class in this 


paper. 


3.1.1 The expected maintenance cost 
The expected maintenance cost of single usage rate 
class can be obtained as 
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MC; 


all 


(2) 


3.1.2 The expected penalty cost 

From the lessee’s perspective, essentially, recovery 
of the equipment failure can cause revenue loss. 
Hamidi et al. (2016) formulated the revenue rate 
as the product of an increasing function in r and 
a decreasing function in ¢. Similarly, we also use 
this product to describe the revenue rate of single 
class, as 


R max 


R, |r) == (3) 


t 

7 rl T} 
where Raa be the potential revenue rate that can 
be generated by the new equipment if its capacity 
is fully used (i.e., R,(O|7,,,,) ); and L is life cycle of 
the equipment. 

If r<r*, then the expected revenue rate of 
single usage rate class in the interval (0,@] is 


given by R = fi [°RE|nddce)/a. If r>r*, 
then the expected revenue rate of single usage 
rate class in the interval (0,u/r] is given by 


R = fe rÍ RE |rdtdG(r)/u. Further, the 
expected penalty cost of single usage rate class in the 
two-dimensional lease region Q can be expressed as 


PC; = 


Hofer f J" Ararat 


Po (4) 
Rf J Axb)ardG(r)}, 


where the function H (2) is the expected excess of 
recovery time to failure. 


3.1.3. The expected depreciation cost 

The equipment in the two-dimensional lease region 
Q faster deteriorates with respect to age and accu- 
mulative usage. This process makes the equipment 
depreciate. In order to measure depreciation value of 
the equipment at the two-dimensional lease region Q 
termination, we offer a measure model by modifying 
residual value model in Hamidi et al. (2016), as 


DC (r,t) = 8X0) + BTA) (5) 


where G >0;6,>0;6,=(V, -V,)/@ - Gr; Vis 
purchase price; and V, is residual value at the end 
of the life cycle L when the equipment is used at 
the maximum usage rate. 

If r<r*, then the expected depreciation 
cost of single usage rate class at the two-dimen- 
sional lease region Q termination is given by 


DCL = 4a |" r-dG(r)+@. If r>r*, then the 
expected depreciation cost of single usage rate 
class at the two-dimensional lease region Q termi- 


nation is given by pcy = gu +f." (u/r)dG(r). 
Thus, the expected depreciation cost of single 
usage rate class at the two-dimensional lease region 
Q termination is given by 

DC; =DCr+DC 

= safe raj” a racte}+ 


A (ulrpdG(r) +a}, ©) 


fmin 


3.1.4 The total cost model 
By (1), finally, the total cost of single usage rate 
class in the lease Q can be obtained as 
TC; = MC; + PC} + DC} 

= («. +H) R a i Axlr)dxdG(r)+ 
aia Í Mx|r)d xdG(r) + 
6, (e + a |" rac()] + 


A (ulr? dG() +a"), 


c, + H(2)I 
(7) 


‘min 


3.2 Cost model under the customized PM 


As indicated in the definition of the proposed 
lease, customized PM is used to maintain equip- 
ment reliability in the remaining lease limits. Next, 
so we derive the expected total costs in the remain- 
ing lease limits. We state that the expected total 
cost of light lessee is derived in this subsection, 
and the expected total cost of weighty lessee will be 
presented in Appendix A. 

We use the reduction in age as a PM effect. As 
well known, the relationship between reliability 
and failure rate is a one-to-one mapping. Let R, be 
reliability threshold corresponding to the 7 usage 
rate class (similarly hereinafter) r € (7;,7,,,], then R, 
can be expressed as 


R= 


Ta Wheat 
exp) —| i vá 
4 Wir 


(T+ Si+ xinds] anh, (8) 


where H} i(- InR,) is inverse function a 
Ti un 

mT |" Ja HN AT + Si+ x|r)dx) dG(r); 

is a time” interval between 


(k-1)" _, PM and , the kh 
Susy errs, = Ti 1> a> æ 2-2 0; 


£ Si Lf =0 J? 
and Tj = Si ET 0. 
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Since lease of light lessee terminates only when 
age reaches age limit, customized PM of light les- 
see is only confined to the remaining time limit 
@-@,. Next, so we derive the total cost in the 
remaining time limit Ø- @,. 


3.2.1 The expected maintenance cost 

Since each failure between two successive PMs is 
removed by minimal repair, the expected mainte- 
nance cost in the k PM interval (0,7;] is minimal 
repair cost in the k PM interval (0,7;]. So, it can 
be calculated as 


. . Nig ri tT ; 
CiTD=c,f : ( Í m | NOt Sat xia] 


9 
dG(0). ©) 

After the last PM, i.e., the n, minimal repair 
cost of the i rate class is given by 


Ci =c, 1N ma 'H@, + Si +apax jacon, (10) 


where n, ={n, >O|T <O-@}. 

Let the increasing function C Ch (ATi) be PM 
cost resulted from the reduction AiTj, where 
Ē=1- a. The expected maintenance cost in 
the remaining time limit @- D, is equal to the 
sum of minimal repair cost HA 'Ci (Ti) resulted 
from the first n,-1 PM intervals, the minimal 
repair cost C} after the n PM and the PM cost 

K CATI) resulted from n, PMs. And it can 

@ mathematically expressed as 


m-l 


MC,(R,,n,)= x CiTD Ci + 3 Ci (ATD 
k=l 
oa (naas a+ xlr) ax] 
"j -a ! } m 
dawnt] | J TAa Si + xax | aco) 
+ È CTD 
k=1 


3.2.2 The expected penalty cost 
We modify the revenue rate in (3) as 


Rel) =A- 


i+1 


(12) 


where R,,, be the potential revenue rate that can 
be generated by the new equipment with the usage 
rate rée(r,r,,] if its capacity is fully used (i.e., 


R',O|%.:)). 


The expected revenue rate in 
the k* PM interval (0,7;/] is given 
by R; = (ff Ra + SL + x|ndxdG@r)/ Ty. 
And the expected revenue rate after the n™® PM is 
givenby Ri, = f “J ERIO- G+ Si + x |r) dxdG(r) 
K@-a-—Ti). Further, the expected penalty 
cost in the remaining time limit Ø- Ø can be 
expressed as 


PC (R,n)= HW) 
aan Tiel Tit 
ERS, m+ Si+ xr)dx] a60) 
k=1 f a 


Tis 


(13) 


+R f ( fo" nats. + as) TI l 


m 
ý 


3.2.3. The expected depreciation cost 
When the two-dimensional lease region Q ter- 
minates, accumulative usage and age of the 
equipment _is respectively ¥,(t)=r@ and 
T (t)= @- £ ATi In this setting, by (5), the 
expected depreciation cost is given by 


DCP(R,n,)= 80 |” PAG(r)+ 80- > ATi. 
' k=l (14) 


Similarly, the expected depreciation cost at the 
first two-dimensional lease region Q, termination 
is given by 


DCP = Ga?) PAG(r) + 8.0. (15) 


Thus, the expected depreciation cost at the 
remaining time limit @-@, termination is 
obtained as 


DC(R,,n,)= DC2(R,,n,)— DCP 
= 8( T- o?) | "Paco 


D cao 16 
a (o-} anya). ~ 
k=1 


3.2.4 The total cost model 
By (1), finally, the total cost in the remaining time 
limit Ø- Ø, can obtained as 


TC,(R,,n,)= MC,(R,,n,)+ PC(R,,n,)+ DC (R,,n;) 
= (6+ GOR, ) f SZI a+ 8, + 
dxdG(r)+ Y, CBT!) + 4(@- a) 
kl (17) 
Ti+] ki oer 
f raco) a (o- >. AT) - a: | 
j k=1 


Our target is to seek the optimal reliability 
threshold R* and the optimal number n* of 
PMs so that the total cost in (17) is minimized. As 
mentioned earlier, for a given remaining time limit 
@-@,, if the reliability threshold R, is smaller, 
then number n, of PMs is lower; if the reliability 
threshold R, is bigger, then the number n, of PMs 
is greater. These monotonous regularities mean 
that once the optimal reliability threshold R* is 
determined, then the optimal number n of PMs 
can be uniquely determined. Since expressions of 
G(r) and 7(@,+ Si_,+x|r) are uncertain, obtain- 
ing analytical solution is clearly difficult and so 
the optimal values (R*,*) need to be searched 
numerically by masterly designing reliability in the 
remaining lease limits. 


3.3. Improvement evaluation and contract selection 


How to select lease contract is an important prob- 
lem for the leaser. In practice, a key approach of 
lease contract selection is to compare the total 
costs under different lease contracts. In this sub- 
section, we consider a qualitative analysis method 
to select lease contract. 

As mentioned in Assumption, usage rates are 
classified into m classes. In the remaining lease 
limits (including the remaining time limit and the 
remaining usage limit), finally, the minimizing total 
cost of m usage rate classes can be obtained as 


TCI(R*N*)= Ý TC(R* n*), (18) 


i=l 


where R* is a vector composed of the opti- 
mal reliability thresholds and it is given by 
R*={R*,R*,---,R* }; and N* is a vector com- 
posed of the optimal number of PMs and it is 
given by N* ={në ,n¥,---,n*}. 

The total cost in (7) is the total cost of single 
usage rate class in the two-dimensional lease region 
Q, whereas the total cost in (18) is the minimizing 
total cost of m usage rate classes in the remaining 
lease limits. This means that they are not equivalent, 
and so they are not compared directly. In order to 
compare them, we must transform the minimizing 
total cost of m usage rate classes into the total cost of 
single usage rate class. By root mean square (RMS), 
the total cost of m usage rate classes in the remain- 
ing lease limits can be approximately transformed as 


TC;(R*,N*) = [$row yy. (19) 


ma 


Besides, by (7), the total cost of single usage rate 
class in the first two-dimensional lease region Q, 
can be obtained as 


TC; = («. + HOAR" i me UMx|r)dxdG(r) 
de, + now f” j7 xlr)d xdG(r)+ (20) 


g (u +a | rdG(r) 


— 


a 


where R"(= frf R(e|[rdtdG(r)/u,) is the 
expected revenùe rate of single usage rate class in the 
interval (0,u/ 7}; R= J [E R(e|r)dedG(r)/ a) 
is the expected revenue rate of single usage rate class 
in (0,@,]. 

Under the proposed lease contract, further, the 
total cost of single usage rate class in the whole 
two-dimensional lease region Q can be calculated 
as 
TC)(R*,N*) =TC} + TC}(R*,N*). (21) 

Clearly, the total cost in (21) is the total cost 
of single usage rate class in the two-dimensional 
lease region, and it is equivalent to the total cost 
in (7). This equivalency is facilitated to select lease 
contract by comparing the total costs. We neglect 
the error produced by RMS and consider two per- 
formance measures, as follows. 

If TC; >TC}(R*,N*) or TC; <W;(R*,N*), 
then the leaser should select the proposed lease 
contract with optimal customized PM or the 
benchmark lease contract to maintain reliability of 
the equipment in the two-dimensional lease region 
Q. Under this situation, the relative improvement 
of the total cost can be measured as 


TC:(R*,N*)-TC} 

TC: (R*,N*) 

TC; -TC:(R*,N*) 
TC; 


x 100% 


E, _ 


(22) 


or x100%. 


4 NUMERICAL EXPERIMENT 


The lease region considered by us possesses the 
time limit and usage limit. This is consistent 
with vehicle warranty in real life. Huang et al. 
(2017) recently utilized the data collected from a 
car dealer to demonstrate the effectiveness of the 
proposed policy. Similarly, the intensity function 
is used as a failure rate that reflects the interde- 
pendence among the age and usage, and it is given 
by (xr) =0.14+0.4r+(1.5x4+1.5r)x. Also, we 
assume that the random usage rate r is subject to 
the uniform distribution G(r) with the maximum 
Trax = 8 and the minimum Fin = 0.5. Shang et al. 


max min 
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(2016) used a quadratic function to model the 
PM cost function regarding maintenance degree. 
We use power function to model PM cost and it is 
given by Ch (e)=c(e)’, where c, p>0 and dots 
is the reduction in age or usage due to PM. 

We specify @=2 (years) and u=4 (10* kilom- 
eters), then the average usage rate r° =2 and fur- 
ther the two-dimensional lease region is given by 
Q = (2,4). The first two-dimensional lease region 
includes the following three cases, as Q, = (0.3, 0.6), 
(0.4, 0.8) or (0.45,0.9). We classify all lessees into 
two classes, light lessee and weighty lessee. In prac- 
tice, revenue rate is increasing with usage rate. Due 
to this fact, we specify the potential revenue rate as 
$12500. Some parameters are constants through- 
out the whole numerical experiment, which are 
listed in the Table 1. In order to analyze sensitivity, 
we present some figures, which are used to describe 
variation tendency related to parametric variation. 

For light lessee, since reliability at the first two- 
dimensional lease region Q,(=(0.45,0.9)) is equal 
to 0.8875, we take step length as 0.1 and obtain 
Table 2. As indicated in Table 2, the optimal val- 
ues (i.e., the optimal reliability threshold and the 
optimal number of PMs) are existed for light les- 
see. The existence of the optimal values is visually 
and vividly displayed in Figure 2. For weighty les- 
see, besides, reliability at the first two-dimensional 
lease region Q, is equal to 0.6553. Since it is less 
than reliability 0.8875 of light lessee, we take a 
lower step length as 0.05 and obtain Table 3. From 
the Table 3, we can see that the optimal values are 
existed for weighty lessee. Similarly, this existence 
is also visually and vividly displayed in Figure 3. 
In order to illustrate the performance of the con- 
sidered two-dimensional lease contracts, we plot 
Figure 4, where step length is 0.05. From Figure 4, 
we can obtain the following results. 


a. The total cost under the proposed lease (PL) 

contract are increasing and tends to be equal to 

total cost under the benchmark lease (BL) con- 

tract when the area of the first two-dimensional 

lease region gets bigger. 

. The relative improvement of the total cost under 
PL contract is decreasing when the area of the 
first total region gets bigger. 


The bigger area of the first two-dimensional lease 
region Q, produces the smaller the remaining lease 
limits. Further, the smaller the remaining lease lim- 
its leads to the lower PM count. Furthermore, the 


Table 1. Parameter settings. 
aB=p L e o Hw V, V, 6, 0, 
0.5 10 100 25 2x10* 2x10* 2x10 1 0.8 


Table 2. The existence of optimal value for light lessee. 
Reliability threshold 
Values 0.8875 0.7875 0.6875 0.5875 0.4875 0.3875 0.2875 0.1875 
N 1 1 1 1 1 2 5 17 
TC: 2137.6 828.3 519 389.4 318.5 229 284.1 398.9 
PB 
Table 3. The existence of optimal value for weighty lessee. 
Reliability threshold 
Values 0.6553 0.6053 0.5553 0.5053 0.4553 0.4053 0.3053 
N 1 2 3 4 6 10 21 
TC 289.8022 199.2240 221.4222 248.8676 290.5435 355.2092 469.8107 
P 
GER Total cost under BL 
HE Total cost under PL 
1000 (EG Improvement:percentage 
PN 350 | | 
v 
2 S 300} 
8 a 
3 © 250} | 
asl ae È | 
400 4 8 200} | 
| 
2 . 150} | 
200 L å 5 | 
07875 ~ € 100} 1 
0.6875. =~ . ae @ 
0.5675 = ae ae AB Bs | 
04875 Ae 14 16 50 | 4 
a a6 seo, 6 8 s 
0.1875 o 2 4 0 — 
Minbarid (0.3,0.6) (0.4,0.8) (0.45,0.9) 
Reliablity threshold The first two-dimensional lease region 
Figure 2. Total cost versus decision variables for light Figure 4. Total cost versus the first two-dimension. 
lessee. 


Number of PMs 


Reliablity threshold 


Figure 3. Total cost versus decision variables for 
weighty lessee. 


lower PM count leads to the lower minimal repair 
cost reduction and the greater depreciation cost, 
thus the results presented in a) is obvious. Since 
the total costs under the two types of contracts 
tends to be same when the area of the first region 
Q, approaches the area of the two-dimensional 
lease region Q, the relative improvement of the 
total cost under PL contract is decreasing. There- 
fore, the results presented in b) is obtained. These 
changes illustrate the performance of PL contract 
is superior to that of BL contract. Compared with 
BL contract, conclusively, the leaser should use 
PL contract as a reliability guarantee in the two- 
dimensional lease region, which is more beneficial 
since the leaser incurs the lower total cost. Com- 
pared with BL contract, additionally, the lessee 
may be also inclined to PL contract since PMs in 
the PL contract can keep a higher reliability which 
can slow down deterioration and indirectly reduce 
number of failures and increase revenue rate. 
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5 CONCLUSIONS 


In this paper, the two-dimensional lease contract 
in which all failures in the two-dimensional lease 
region were removed by minimal repair was taken 
as a benchmark lease contract. From the leaser’s 
perspective, we proposed a two-dimensional lease 
contract with customized PM by dividing the 
two-dimensional lease region. We specified that 
minimal repair was used to remove failure in the 
first two-dimensional lease region and customized 
PM integrating with usage rate class was adopted 
to improve reliability in the remaining lease limits 
so that maintenance cost, penalty cost and depre- 
ciation cost can be reduced simultaneously. From 
viewpoint of generality, the proposed two-dimen- 
sional lease contract can be translated into bench- 
mark lease contract by adjusting parameters. From 
viewpoint of management, the proposed lease con- 
tract is beneficial for both the leaser and the lessee 
since the total cost for the leaser and the failure 
counts for the lessee are less that under the bench- 
mark lease contract. These have been illustrated by 
the numerical results. 

The main contribution of this paper is that 
the proposed two-dimensional contract with the 
divided regions is a solution of the leaser how to use 
customized PM to maintain reliability, under the 
case that usage rate classes in the two-dimensional 
lease region can’t be known when the lease con- 
tract takes effect. This method still can be applied 
to maintain reliability in the two-dimensional war- 
ranty region although it was considered from view- 
point of the two-dimensional lease contract. 
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APPENDIX A 


For weighty lessee, the lease expires only when 
usage reaches the usage limit u, so we use the reduc- 
tion in usage to measure effect of customized PM. 
The reliability threshold R, associated with the 
i usage rate class r €(r,,7,,,] can be expressed as 


R= exp] ate (ut Zi let xax] dG(r)}, 


(23) 


where Jj(—InR,) is inverse function of 


ta (Uj _y+ui yir 

m (ui = f . (J Sa y(u + Zi) /r+ xax] dG(r); 

(u = Jį(-lnR,)) is an usage interval 

between the (k- pr PM and the k* PM; 
k-1 k—1 

Zi =J ajuj; Uj; => ui; and ui =U; =Zį=0. 
1=0 1=0 

The expected maintenance cost 


The expected maintenance cost in the k PM inter- 
val (0, ut /r] can be calculated as 


Ct (ut /r)=c, 
f “(J pane (u+ Zi) /r+ xpd] dG(r). (24) 


Ur 


After the n, PM, the expected maintenance cost 
is 


cœ =c, A CETA: ndx) dG(r), 


a, Ir 
(25) 


where n,={n, >O|U), S u— u}. 
The expected maintenance cost in the remaining 
usage limit u— u, is expressed as 


ni-l 
MC(R,n,) = ¥Ck(uk [r)+C! + 


k=l 


XC Am) 
k=l 


n,-1 (Uj_ytuj Vr 
ye (Fe a+ Zi yr ands] 


ui lr 


e `" dG(r) 


Tat (u~ )/r 
+ él, (i: „ a +Z,)/r+xļr)d x] (26) 


aG) Ý (Cu). 


The expected penalty cost 


The expected revenue rate in the ķk* 
PM interval (0, uk /r] is given by 
Ri = f rf OOR (y+ Zi) /r+ x| PdxdG(r)/ui. 


UL lr p 
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And the expected revenue rate 
after the ne PM is given by 


Tat (um Ir A 
Ri, = f , rf Ak R‘, ((u- u- Zi, )/r+ x |r)dxdG(r) 


Thus, the expected penalty cost in the remaining 
M(u-u,— Z} ). usage limit u-u, can be expressed 
as 


PC(R,,n,) = H) 


i ma f Wh tae 27 
( ` Rf ý rf E 7 , R (u, + Zi) /r+ x |rdxdG(r) (27) 


+f = rf a Ri(u-u,— Z) )/r+x| rasan) $ 


The expected depreciation cost 


Accumulative usage and age at the lease region Q 
termination are respectively Xo (t)= u- 2 2 B ui 
and T,(t)=u/r. By (5), so the expected deéprecia- 
tion cost is 


DCER, n) = G(u- >. Bul? + 6f” (wl rPaGo). 
kel ' 


(28) 


Similarly, the expected depreciation cost at the 
lease region Q, termination is 


DCP" = Gu? +0, f” (u/rPdG(r). (29) 


Thus, the expected depreciation cost at the 
remaining usage limit u— u, termination is 


DC(R,,n,) = DC2(R,,n,)— DC. (30) 
The total cost model 
By (1), the total cost in the remaining usage limit 


u-u can be obtained as 


TC, (R,,n,)= MC(R,,n,)+ 
PC (R,,n,)+ DC(R,,n,) = 


5 ( c, + HOR, ) 


naf p Ohar l 
J Gi (J. Mut Zier xindx} G(r) 


i 
healt 


+ ( c, + HOR, ) 


i ( fe yt Zr wae) +dG(r) (31) 


n (C Bui ))+ G, ( T u} - “| 4 
af = (u/ry— (u, /ry) dG(r). 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


A fuzzy evaluation method based on fuzzy consistency matrix for 
evaluating maintenance design program: Case study on heavy 


vehicle systems 


X.J. Yi 
China North Vehicle Research Institute, Beijing, China 


Y.H. Lai 


College of Mechanical and Electrical Engineering, Beijing University of Chemical Technology, Beijing, China 


P. Hou & H.N. Mu 
Beijing Institute of Technology, Beijing, China 


ABSTRACT: This paper takes heavy vehicle systems as case study to propose a fuzzy evaluation method 
based on fuzzy consistency matrix for evaluation maintenance design program, and formulate its process, 
which are determining the evaluation index, developing the hierarchical structure model, constructing 
fuzzy complementary matrix, operating the single hierarchical arrangement, operating the weight total 
ordering, and evaluating the level of maintenance design program. In order to illustrate the reasonable 
and feasible of the proposed evaluation method, its weight is compared with the results by other methods: 
Analytic hierarchy process, Fuzzy analytic hierarchy process based on consistency matrix, Fuzzy analytic 
hierarchy process based on triangular fuzzy number. Furthermore, the evaluation result is compared with 
those of the evaluation method based on cloud model. All in all, this paper proposes a new evaluation 
method for evaluating maintenance design program of repairable systems. 


1 INTRODUCTION 

The availability of product is the key factor to 
determine the stability of product performance. 
The availability of the product is determined by the 
reliability and maintainability [1]. By carrying out 
the reliability design of the product, the failure rate 
of the product can be reduced. Similarly, the main- 
tenance design [2-3] in the product design stage is 
needed to achieve economic, convenient and effec- 
tive maintenance. At present, the maintainability 
design has been paid attention to by the design- 
ers. The key factors affecting the maintenance level 
are as follows: simplification, standardization and 
interchangeability, detection, repairability, mistake 
proofing, safety, accessibility, human element engi- 
neering, maintenance time. The important part of 
the maintenance design is the evaluation of the 
maintenance design scheme with the comprehen- 
sive consideration of the above factors [4-5]. At the 
same time, the evaluation results can provide the 
basis for the analysis of the maintainability of the 
product and the improvement of its maintenance 
level. For this kind of evaluation, analytic hierar- 
chy process, cloud model algorithm and so on are 
often used. There are two outstanding issues in the 
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evaluation of maintenance design scheme by such 
evaluation methods: (1) The results obtained by 
these evaluation techniques can be affected by the 
subjective factors of experts. Only the influence of 
the subjective factors turn weak, can the evaluation 
results become more reasonable, (2) The expert 
scoring matrix is widely used in this kind of evalu- 
ation method, it is necessary to keep the evaluation 
intention of experts without changing. 

Thus, based on the above problems, taking heavy 
vehicles as the research object, a fuzzy evaluation 
method based on consistency matrix for evaluat- 
ing maintenance design program is presented. The 
main contributions of this paper are as follows: 


1. As an appropriate scale [6-7] can soften the 
impact of subjective factors, the comparisons 
between the following scales are carried out: 
1~9 scale, the deformation scales 9/9~9/1 and 
10/10~18/2, and exponential scales 9%~9%%, 

29798, e4_e84 The results can reduce the trou- 
ble about the selection of scale. 

. A fuzzy evaluation method based on consist- 
ency matrix for evaluating maintenance design 
program is proposed, and the corresponding 
evaluation process is also developed. Naturally, 


the problem of consistency of expert opinions 
and scoring matrix is solved. 


The remainder of the paper is organized as fol- 
lows. A fuzzy evaluation method based on consist- 
ency matrix for evaluating maintenance design 
program is proposed in section 2. Section 3 illus- 
trates a practical example, which is a heavy vehicle, 
based on the proposed method. Section 4 conducts 
result analysis with the method based on the cloud 
model. Section 5 provides some conclusions on the 
findings of the research. 


2 THEIMPROVED FUZZY EVALUATION 
METHOD BASED ON CONSISTENCY 
MATRIX 


For the maintainability initiative design, the main- 
tenance evaluation is necessary and an improved 
evaluation method based on consistency matrix is 
proposed. The method combines evaluation with 
the maintainability design content and weaken the 
subjective influence of expert. At the same time, 
a simple and reliable maintainability evaluation 
process is given for design personnel to provide the 
basis for maintainability design work actively. 


2.1 Scale analysis 


Scale is the tool for expert to evaluate, and the exist- 
ing scale can be divided into three categories: three 
scales, five scales and nine scales. The nine scales have 
the highest accuracy, strong psychological founda- 
tion, and the widest application. The nine scales are 
mainly: 1~9 scale, the deformation scales 9/9~9/1 
and 10/10~18/2, and exponential scales 9°~98?, 
29728, e%4~e84, In addition, the exponential scales 
have higher requirements on readability, availability, 
and normalization. There are so many kinds of nine 
scales and large differences between them, so the 
applications are suitable after comparisons. 

For convenience, we define the natural language 
description and the corresponding scale level as 
shown in Table 1. 

The relationship between the scale value and 
the natural language description can be connected, 
and the result is shown as in Table 2. 

The relative characteristics of different scales 
are shown in Table 3. 

From Table 3, 
obtained: 


some conclusion can be 


1. All of the scales can keep the rank property 
under a single criterion, but the rank property 
under multiple criteria is unclear. 

2. From the uniformity, memory, and perception, 
1~9 scale is the best, but its consistency is just 
ordinary. Therefore, 1~9 scale is only applicable 
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Table 1. The corresponding relationship between scale 
level and natural language description. 


Scale level 1 3 5 
Natural language E—equal S—slight O—obvious 
description 
Scale level 7 9 2,4,6,8 
Natural language I—intense U—ultra intermediate 
description value 
Table 2. The corresponding relationship between scale 
value and natural language description. 
Scale E S (0) 
1~9 1 3 5 
l-J : v3 vs 
1~9? 1 9 25 
9 9 9/9 9/7 9/5 
9 1 (1) (1.286) (1.8) 
10 18 10/10 12/8 14/6 
10 2 (1) (1.5) (2.33) 
909 ~ Q89 go gus) 96/9) 
(Ll) (1.277) (2.08) 
20/2 ow 28/2 gon 22/2 Jwa 
(1) (2) (4) 
e” ied’ e? e” e? e“ 
(1) (1.649) (2.718) 
Scale I U Formula 
1~9 7 9 k 
3 
1- J9 v7 Vk 
1~9? 49 81 ke 
9 9 9/3 9/1 9 
9 1 6) (9) 10-k 
10 18 16/4 18/2 9+k 
10 2 (4) (9) ll-k 
gos aa 98/9 96/9) Q(9/9) Ej 
(4.327) 9) 9? 
20/2 ed 28/2 26/2 28/2 k-l 
22 
(8) (16) 
e’ Pam e? e" e k- 
(4.482) (7.39) et 


in the case of that the requirement about the 
evaluation accuracy is not high. The consistency 
of 1~9° scale is not qualified and other indexes 


Table 3. The relative properties of different scales. 


9 9 
Characteristics 1~9 1~ v9 1~9? 01 
Retentivity (Single criterion) [8] Correct Correct Correct Correct 
Consistency Ordinary Ordinary Nonconformity Ordinary 
Uniformity Excellent Excellent Poor Ordinary 
Memory Excellent Ordinary Ordinary Poor 
Perception Excellent Poor Ordinary Poor 
Weight fitting [8] Ordinary Good Ordinary Ordinary 
Total evaluation Good Ordinary Poor Good 
10 18 

Characteristics 10 2 ed 2ra gpa gitmen? 
Retentivity (Single criterion) [8] Correct Correct Correct Correct 
Consistency Ordinary Excellent Excellent Excellent 
Uniformity Ordinary Good Ordinary Good 
Memory Poor Poor Poor Poor 
Perception Poor Poor Poor Poor 
Weight fitting [8] Good Excellent Excellent Excellent 
Total evaluation Good Excellent Good Excellent 

are not good, so it is not recommended. The Oost bayer Otje 

perception of 1~ V9 scale is poor, and the cal- 

culation is troublesome, and can be used when — ieee $ 

calculating conditions permit. The uniformity foklor Í Ruki ule? | dem 


of 9/9~9/1 scale and 10/10~18/2 scale is poor, 
and they may be considered as appropriate. 

3. Although the memory and perception of expo- 
nential scales 9°°~98?, 2°28, e-e% are poor, 
their consistency and weight fitting is good. They 
are suitable for the requirement of high accuracy, 
and can be complementary with 1~9 scale. 


2.2 Improved evaluation method based on 
consistency matrix 


1. Determining the indexes of the object in an 
evaluation 

As the evaluation object has multiple attributes, it 
is necessary to determine the indexes to guide the 
evaluation direction. The evaluation result must 
be achieved with these indexes under the same 
conditions. 

2. Establishment of hierarchical model 

The evaluation target is difficult to be evaluated 
directly, so the object is usually decomposed into 
different parts. These parts can indirectly repre- 
sent the object, the same as the evaluation result. 
According to the relationship between the parts, 
as well as the subordinate relations, a multi-level 
analysis model can be obtained. The factors with 
similar character are divided in the same level, i.e. 
they are the parallel relationship. Naturally, the 
factors are part of another factor or dominated, 


| 
Rule layer 2 


Scheme k 


s~n 
Scheme layer | | Scheme } Scheme? | asair | 


Figure 1. 


Hierarchical structure diagram. 


they are divided in the lower layer. The hierarchical 
model can be shown as in Figure 1. 

Although there is no limit on the number of ele- 
ments per layer in a rule layer, it is generally not 
more than 9. Psychological research shows that 
people can have reasonable judgments when the 
number of object is within nine. When there are 
more than 9 elements in the same level, the number 
of elements in the same layer can be reduced by 
increasing the number of layer. 

3. Establishment of consistency matrix 

When experts start to give estimation scale for 
every comparison, it is best to choose the memo- 
rable and perceptive scales, for example 1~9 scale. 
Meanwhile, the effect of scale is that translate 
the qualitative results of experts into quantitative 
value and the retentivity and consistency of scale 
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must meet the requirements to avoid the quantita- 
tive result deviate from the expert's intention. Last 
but not least, taking into account the evaluation 
habits of experts, the uniform and weight fitting 
indexes are also important. Above all, experts give 
estimation scale using 1~9 scale and 1~9 scale will 
be translated correspondingly into e~e* scale to 
form the consistency matrix. 

4. Hierarchical ranking 

For every layer, the weight value of all elements can be 
calculated according to the consistency matrix. After 
the eigenvector corresponding to the eigenvalue of 
maximum in a consistency matrix is obtained, the 
weight value of the elements in the same layer can be 
got by the normalization of eigenvector. 

5. Total ranking 

By step (4), the weight value of elements in 
each layer can be obtained. The following 
assumptions are given: the weight of m ele- 
ments in k-/,, layer relative to the object is 
wk) = (we, w, 4D, w, D); the weight of n 
elements in k layer subject to the ją element in k-/,, 
layeris p,® = (Dit PaPa and the weight 
of elements in k layer irrelative to the j,, element 
in k-1,, layer is zero; P® =(p®, p,,...,p,) 
represents the weight of all elements in k,, layer 
subject to k-/,, layer. The calculation formulas are 
shown as in Eqn. (1) and (2). 


: : ; T . . 
w) = (ww...) = wk) ptr) 


(1) 


wil = >» piP w i=1,2,....n 


j=l 


(2) 


All the weight values are summarized and uni- 
fied from top to bottom to get the total weight. 
The total weight can represent the priority of all 
the schemes. 

6. Set up a set of fuzzy comments 

For the evaluation of maintainability design con- 
tent, a set of fuzzy comments is needed for design- 
ers. The set of fuzzy comments is usually divided 
into five levels, V = {vl, v2, v3, v4, v5} = {excel- 
lent, good, ordinary, poor, bad}. 

7. Establishment of fuzzy evaluation matrix 
Through questionnaire survey of designers, fuzzy 
evaluation matrix is constructed for calculating the 
evaluation levels of units. Based on step (6), the 
fuzzy evaluation matrix E of all factors in the low- 
est level can be counted as shown in Eqn. (3). 


(3) 
en 
where, e, represents the fuzzy evaluation of i,, fac- 


tor, which subject to the j,, evaluation level, given 
by designers. 
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Set of 
comments 


Evaluation of maintainability 
design content 


Evaluation of maintenance 
indexes 


Consistency matrix 


Fuzzy evaluation matrix of 
iaintainubility design 
content 


Weighted average 
Comprehensive Weighted evaluation 
Principle of maximum membership 


Final evaluation result of maintainability 


Weight of maintainability 
indexes 


Figure 2. The procedure of proposed fuzzy evaluation 
method based on consistency matrix. 


Combined with the weight values above, com- 
prehensive weighted evaluation [9] can be calcu- 
lated from bottom to top, i.e. membership matrix. 
The membership matrix can be calculated as shown 
in Eqn. (4). 


B= f( w® E) = (www) 
ey eiz a Eim 
êz ez am = ( b,.d,, bn) (4) 
e, en2 Cum 


where, b, indicates that the membership of evalu- 
ated object subject to v, degree. 
8. Final evaluation result 
Based on the above analysis results, the final evalu- 
ation result of system can be obtained based on the 
principle of maximum membership. Meanwhile, 
the weak links can be found and provide basis for 
the corresponding correction strategies. 

The procedure of proposed fuzzy evaluation 
method based on consistency matrix is shown in 
Figure 2. 


3 CASE STUDY 


Taking heavy vehicles as example, the mainte- 
nance evaluation based on the proposed evaluation 
method is conducted. Specific implementation steps 
are as follows: 


1. Selection of evaluation indexes for heavy 

vehicles 
The key factors affecting the maintenance level are 
as follows: simplification (A), standardization and 
interchangeability (B), detection (C), repairability 
(D), mistake proofing (E), safety (F), accessibility 
(G), human element engineering (H), maintenance 
time (I). All the above evaluation indexes can be 
accepted by relevant personnel and departments. 
The first seven indexes are qualitative properties, 
and the last two indexes are quantitative proper- 
ties. The corresponding detailed maintainability 
design content can be listed simply: Al, A2, A3, 
B1, B2, B3, B4, Cl, C2, C3, D1, D2, D3, El, E2, 
E3, Fl, F2, F3, Gl, G2, G3, H1, H2, H3, I1, I2, 13. 
2. Establishment of hierarchical model 
The object is the maintainability of heavy vehicle 
system, and the maintainability indexes are listed 
in rule layer. The lowest layer is the corresponding 
detailed maintainability design content. The hier- 
archical model is conducted as in Figure. 3. 
3. Consistency matrix 
Several experts provide the evaluations about the 
elements in rule layer 1 by 1~9 scale. The evalua- 
tion of the first expert can be shown as in Table 4. 

According to the proposed method and the eval- 
uation results of experts, the corresponding consist- 
ency matrix can be obtained, as shown in Table 5. 
4. Order ranking 
According to the Table 5, the weight of main- 
tainability indexes can be obtained and shown in 
Table 6. 

For every layer, the weight value of all elements 
can be calculated according to the consistency 
matrix. For the same elements, different weight can 


| 
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Figure 3. The hierarchical model of maintainability of 
heavy vehicle system. 


H 


Table 4. The evaluation of the first expert. 


1 2 3 4 5 6 T 8 9 


1 1 1 1⁄2 1 a l l 1 1 
2 1 1 1/2 1 5 1 1 1 1 
3 2 2 1 1 8 2 2 1 1 
4 1 1 1 1 6 1 2 1 1 
5 Ws 1⁄5 1⁄8 1⁄6 1 1⁄5 1⁄4 1/7 1/6 
6 1l 1 1⁄2 1 S J 1 1 1 
7 1 1 1/2 12 4 1 1 1/2 1/2 
8 1 1 1 1 7 1 2 1 1 
9 1 1 1 1 6 1l 2 1 1 


be obtained by different expert. Taking the average 
of different weight, the final weight of maintain- 
ability indexes are shown in Table 7. 

5. Total ranking 

All the weight values are summarized and unified 
from top to bottom to get the total weight. The 
total weight is shown in Table 8. 

6. Set up a set of fuzzy comments 

For the evaluation of maintainability design con- 
tent, the set of fuzzy comments V = {v1, v2, v3, v4, 
v5} = {excellent, good, ordinary, poor, bad} is used. 
7. Establishment of fuzzy evaluation matrix 
Through questionnaire survey of designers, the 
fuzzy evaluation matrix F is shown as in Table 9. 
8. Final evaluation result. 

Based on the above analysis results and Eqn. (4), 
the membership result of all elements in rule layer 
1 and system are shown in Table 10. Based on the 
principle of maximum membership, the final eval- 
uation result is also shown in Table 10. 

We can see from Table 10 that the membership 
of system subject to ordinary evaluation as high 
as 46.3%. It is indicated that the maintainability 
of heavy vehicle is still insufficient and need to be 
improved urgently. At the same time, according to 


Table 5. Consistency matrix 


1 2 3 4 5 6 7 8 9 


1 e4 1/e! e4 e4 e e4 e4 e4 

e4 1 1/e!4 e4 et e e4 e4 e4 

e 1/4 e4 1 e4 e74 el4 e 14 e4 e4 

e04 e4 e4 1 e4 e4 e!4 e4 e04 
Le 1e™ 1/54 1 Let 1/e 1/e% 1/e% 


e04 e4 1/e"4 e4 e4 1/e'4 e4 e4 e4 


1/e! 1/2 e” e” 1 1/e" 1/e!“ 
e04 e4 e4 e4 e54 e4 e!4 1 e04 


e4 e4 


OMmAAYINDNARWN 
— 
~ 
@ 
e 
R 


e04 e4 e4 e% e4 e04 el4 e4 1 


Table 6. Weight of maintainability indexes. 


W, W, We Wp W; 
0.114 0.114 0.144 0.124 0.037 
Wp Wo Wy W, 

0.114 0.102 0.128 0.124 

Table 7. Final weight of maintainability indexes. 

W, W, We Wp W; 
0.1 0.095 0.129 0.132 0.065 
Wp, Wg Wy W, 

0.103 0.09 0.141 0.143 
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Table 8. The total weight of all elements. Table 9. Fuzzy evaluation matrix E. 
Rule Total Rule 
Rule layer 1 Weight layer2 Weight weight layer2 Excellent Good Ordinary Poor Bad 
Al 0.347 0.035 Al 1 0 0 0 0 
Simplification 0.145 A2 0.332 0.033 A2 1 0 0 0 0 
A3 0.321 0.032 A3 0.8 0.2 0 0 0 
oe Bl 0.249 0.024 Bl 0.8 0.2 0 0 0 
oe 0133 B2 0.243 0.023 B2 0.8 02 0 0 0 
: i; B3 0.258 0.025 B3 1 0 0 0 0 
interchangeability 
B4 0.249 0.024 B4 1 0 0 0 0 
Cl 0.391 0.051 Cl 0 0.2 0.6 0.2 0 
Detection 0.091 C2 0.304 0.039 C2 0 0 0.8 0.2 0 
C3 0.304 0.039 C3 0 0 0.8 0.2 0 
Di 0.333 0.044 D1 0 0 0.8 0.2 0 
Repairability 0.074 D2 0.333 0.044 D2 0 0 0.6 0.4 0 
D3 0.333 0.044 D3 0 0 0.6 0.4 0 
El 0.347 0.023 El 1 0 0 0 0 
Mistake proofing 0.146 E2 0.332 0.022 E2 1 0 0 0 0 
E3 0.321 0.021 E3 0.8 0.2 0 0 0 
F1 0.392 0.04 Fl 1 0 0 0 0 
Safety 0.145 F2 0.392 0.04 F2 1 0 0 0 0 
F3 0.216 0.022 F3 0 0 1 0 0 
G1 0.43 0.039 Gl 0 0 1 0 0 
Accessibility 0.063 G2 0.447 0.04 G2 0 0 1 0 0 
G3 0.123 0.011 G3 0 0 0 0.4 0.6 
Human element 0.121 H1 0.558 0.079 H1 0.4 0.4 0.2 0 0 
engineering ` H2 0.442 0.062 H2 0 0 1 0 0 
i . Il 0.36 0.052 Il 0 0 0.2 0.8 0 
Maintenance time 0.083 n 0.64 0.092 n 0 0 1 0 0 


Table 10. The membership and evaluation level of elements in rule layer! and object layer. 


Rule layer 1 Excellent Good Ordinary Poor Bad Evaluation level 
Simplification 0.094 0.006 0 0 0 Excellent 
Standardization and 0.086 0.009 0 0 0 Excellent 
interchangeability 
Detection 0 0.01 0.093 0.026 0 Ordinary 
Repairability 0 0 0.088 0.044 0 Ordinary 
Mistake proofing 0.061 0.004 0 0 0 Excellent 
Safety 0.081 0 0.022 0 0 Excellent 
Accessibility 0 0 0.079 0.004 0.007 Ordinary 
Human element engineering 0.032 0.032 0.078 0 0 Ordinary 
Maintenance time 0 0 0.102 0.041 0 Ordinary 
System 0.353 0.062 0.463 0.116 0.007 Ordinary 


the specific evaluation level of each index in rule 
layer 1, the corresponding maintainability design 
content can be traced. 


4 RESULT ANALYSIS 


In order to illustrate the accuracy and applicability 
of the proposed method, the relative comparisons 
were carried out. 
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1. Comparison about weight 
The result obtained by proposed method was com- 
pared with the results by other methods: Analytic 
hierarchy process (AHP) [7], Fuzzy analytic hierar- 
chy process based on consistency matrix (FAHP) 
[10], Fuzzy analytic hierarchy process based on 
triangular fuzzy number (SFAHP) [11]. All the 
results are listed in Table 11. 

As can be seen from the above table, the weights 
obtained by AHP and proposed method are very 


Table 11. Weight comparison. 
W, W, Wo W, W; 
AHP 0.1 0.095 0.125 0.136 0.066 
Proposed method 0.1 0.095 0.129 0.132 0.065 
FAHP 0.112 0.111 0.112 0.112 0.106 
SFAHP 0.108 0.106 0.119 0.121 0.085 
Wp We Wy W, 

AHP 0.103 0.088 0.141 0.147 
Proposed method 0.103 0.09 0.141 0.143 
FAHP 0.112 0.111 0.113 0.112 
SFAHP 0.11 0.102 0.124 0.125 
Table 12. Evaluation results between the proposed 
method and cloud model method. 

Cloud model Proposed 

method method 
Simplification Excellent Excellent 
Standardization and Excellent Excellent 

interchangeability 

Detection Good Ordinary 
Repairability Good Ordinary 
Mistake proofing Ordinary Excellent 
Safety Excellent Excellent 
Accessibility Good Ordinary 
Human element engineering Good Ordinary 
Maintenance time Good Ordinary 
System Good Ordinary 


close, but the range of result obtained by proposed 
method is slightly smaller than that of AHP. The 
main reason is that, although the judgment matrix 
of 1~9 scale is inconsistent, the error is not very large 
due to its high precision. And the weight obtained 
by FAHP is close, which is result from the poor 
precision of 0.1~0.9 scale. The result obtained by 
SFAHP is not good and the procedure of SFAHP is 
more complicated than other methods. 

2. Comparison of evaluation results 

The evaluation results obtained by the proposed 
method is compared with the cloud model method 
[12] based on golden partition evaluator and nor- 
mal similarity [13—14], all the results are listed in 
Table 12. 

According to Table 12, the qualitative evaluation 
results obtained by the proposed method are more 
conservative than that of cloud model method. As 
the using of golden partition evaluator and normal 
similarity is complex, and the proposed method is 
more suitable for the engineering application of 
maintenance evaluation. 


5 CONCLUSION 


Maintainability initiative design is an important way 
to improve the maintainability of equipment. And 
the maintenance evaluation can effectively grasp the 
maintenance design content and the maintenance 
level. From the purpose of the engineering appli- 
cation of maintenance evaluation, the appropriate 
scale can be selected which can weaken the influence 
of subjective factors after the comparisons between 
different scales are conducted. According to the 
grade that experts give, the consistency matrix is 
constructed to solve the problem about the consist- 
ency of expert opinion, and the related method and 
procedure are optimized. By using the AHP method, 
the expert experience can be translated into weight 
values. In addition, the fuzzy comprehensive evalua- 
tion method also can translate the designer’s under- 
standing about maintainability design content into 
values. Combined both, the fuzzy evaluation about 
the system can be obtained. Finally, a reasonable 
and simple evaluation process is given to make prac- 
tical application more convenient. 

Simultaneously, the proposed method is applied 
on heavy vehicles to study the maintenance evalua- 
tion. To verify the accuracy of the proposed method, 
two comparisons are conducted. The first compari- 
son is the weight values between different methods: 
the proposed method, AHP, the FAHP based on 
fuzzy consistency matrix using 0.1~0.9 scale, and 
the FAHP based on triangular fuzzy number. The 
result proves that the proposed method has higher 
accuracy. The second comparison is that the evalu- 
ation result of the proposed method is compared 
with that of cloud model algorithm. The result of 
second comparison verifies the applicability and 
accuracy of the proposed method applied on heavy 
vehicles. The efficiency of the evaluation also can 
be improved. The proposed method effectively com- 
bine expert experience and maintainability design 
content to provide a basis for the follow-up mainte- 
nance initiative design and improvement. 
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ABSTRACT: Most asset deterioration tools are designed for a specific application, as a consequence, a 
small change of the specification may result in a complete change of the tool. Inspired by the model-based 
approach of separating problem specification from analysis technique, we propose a model-based asset 
deterioration assessment framework using probabilistic relational models. The probabilistic relational 
models express abstract probabilistic dependency covers a range of deterioration modelling assumptions. 
An expert in the domain of asset deterioration can then use his knowledge of the factors that affect dete- 
rioration to customise the abstract models to a specific application, without requiring a detailed under- 
standing the underlying computational framework. We illustrate the use of the framework with multiple 


variants of deterioration models. 


1 INTRODUCTION 

Traditionally, inspection and maintenance of 
infrastructure has followed a fixed time interval. 
One idea to make inspection more cost-effective 
is to use a statistical model to predict the rate of 
asset deterioration and use the predictions to plan 
detailed inspections or maintenance. A range of 
deterioration models has been developed in differ- 
ent field, from railway track (Guler et al., 2011) to 
bridge (Sobanjo, 2011). Despite having common 
objectives such as condition prediction and using 
the prediction to leverage maintenance planning, 
the models differ in many ways. For example, the 
deterioration distribution and the grading system 
for asset condition may differs depending on the 
asset type or the standards set by inspection agen- 
cies. Our aim is to build a unified framework that 
is general enough to encode a wide variety of dete- 
rioration models. 

The approach of providing unified tools, so 
called model-based system engineering, has been 
advocated in both industry and academia. It 
aims to provide descriptive modelling of systems, 
common to different system analysis techniques. 
Importantly, this approach aims to enable decision 
makers to use analyses without a detailed knowl- 
edge of the underlying mathematical models. We 
review previous work in Section 2 and consider 
how it applies to deterioration models. 

In Section 3, we describe a framework for main- 
tenance domain experts, which does not require 
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a detailed understanding of the underlying dete- 
rioration models. Our framework extends stand- 
ard hierarchical Bayesian models with relational 
schema, allowing model variants to be expressed 
using domain concepts. In Section 4, we illustrate 
the use of the framework with a variety of deterio- 
ration model. 


2 MODEL-BASED APPROACH 


The emerging field of model-based system engi- 
neering focuses on bridging the gap between prob- 
lem specification and modelling (Estefan, 2007), 
with an agile modelling formalism to meet differ- 
ent modelling requirements without changing the 
entire tool (Prosvirnova et al., 2013). The model- 
based approach formalises the system develop- 
ment process using a unified language (e.g. SysML 
language) to provide a platform integrating dif- 
ferent modelling approaches and system analysis 
methods. This formalism has been extended to 
the safety and reliability domain, so called model- 
based safety assessment (MBSA) (see Lisagor et al. 
(2011)). MBSA aims to unify classical safety and 
reliability modelling methods (e.g. fault tree and 
stochastic process), and to generate an integrated 
structure for a range of safety and reliability analy- 
sis (e.g. fault tree analysis and system diagnosis). 
For example, the AltaRica modelling language 
(Arnold et al., 1999), separates system specification 
from analysis with a range of reusable techniques. 


The MBSA concept has been applied in dete- 
rioration assessment in recent years, with the mod- 
elling of stochastic process and Markov chain for 
complex system developed in project AltaRica 3.0 
(Prosvirnova et al., 2013). However, to our knowl- 
edge, the current practice of MBSA does not yet 
encompass learning from data, which is a com- 
ponent of deterioration modelling when since we 
wish to learn deterioration rates from inspection 
data. Fortunately, advances in machine learning 
provide a promising perspective for tackling these 
problems. 

Previous studies have shown the power of a hier- 
archical Bayesian network (BN) based approach to 
learn asset deterioration rates from data and how 
it can be adapted when there is insufficient data, 
both with expert knowledge (Frangopol et al., 
2004, Zhang and Marsh, 2018), or by learning 
from similar groups (Memarzadeh et al., 2016, 
Zhang and Marsh, 2018). In the work of Zhang 
and Marsh (2018), six generic BN models for asset 
deterioration were developed, which both provides 
us the possibility of adopt different deterioration 
models, but also enables us to include alternative 
data and unused expert knowledge. These model 
variants cannot yet be presented to an asset dete- 
rioration domain expert in a unified framework: 
adapting the underlying concepts to a particular 
context requires a deep understanding of their 
implementation as BNs. 

In model-based machine learning (MBML) (see 
Bishop (2013) and Ghahramani (2015)), models 
and problem specifications are defined in a com- 
pact language, while inference or machine learn- 
ing algorithm codes are generated automatically. 
Bayesian networks are such a language, though 
they lack structure. More recently, probabilistic 
programming languages such as Figaro (Pfeffer, 
2009), has been developed which could also be 
used in our framework. 

Model-based approaches, both in MBML and 
MBSA, often use the object-oriented paradigm to 
provide a library of generalized models for reuse. 
This is not provided by traditional BNs, with a 
fixed set of variables and relationships. This issue 
has been widely researched for BNs, with pro- 
posals including idioms in Neil et al. (2000) and 
fragments in Laskey and Mahoney (2000). Proba- 
bilistic relational models (PRM), developed by 
Koller (1999) combines relational structure with 
probabilistic graphical models (i.e. BNs). A PRM 
combines probabilistic dependencies with a rela- 
tional schema that describes the entities in the 
problem domain. This representation provides a 
separation between model library and structure 
relationships. 
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Therefore, we propose to develop a model-based 
framework for asset deterioration assessment in 
the spirit of MBSA. The framework separates 
reusable low-level models from modelling choices 
and asset descriptions. The framework is encoded 
with a PRM representation of a hierarchical Baye- 
sian network, with a range of generalised mod- 
els for asset deterioration each represented by its 
probabilistic dependencies, and the problem speci- 
fication of the target domain is represented as the 
relational schema. 


3 MODEL-BASED ASSET 
DETERIORATION ASSESSMENT 
FRAMEWORK 

3.1 Asset deterioration model using 

hierarchical BNs 


3.1.1 A simple deterioration model 

For a system that is either working or failed, given 
historical data on the times that it remained in 
working condition, we can estimate the distribu- 
tion of time for its transition to the failed state and 
so predict its likelihood of failing. A basic deterio- 
ration BN model, from Zhang and Marsh (2016), 
is shown in Figure 1. This is a hierarchical BN 
model that both learns from data and can be used 
for decision support. 

However, this specific model can only be used 
to describe a type of asset with two-state and dete- 
rioration that follows a one-parameter distribution. 
This is not usually the case in asset deterioration, for 
example a 4 point grading system is used to describe 
bridge condition, and a two-parameter Weibull dis- 
tribution is used to fit the bridge transition distribu- 
tion in Sobanjo (2011). So instead, the model has to 
be adapted: the variables are similar but the number 
of them and links between them must change. 


Figure 1. 


A simple deterioration model. 
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Figure 2. Effects of prior knowledge and data quantity 
in distribution training. 


The structure of the model may also need 
to change when we have limited domain data. 
Although the parameters of the transition dis- 
tribution are learnt from the historical transition 
data, some prior knowledge of the transition dis- 
tribution is also required. In Figure 2, two differ- 
ent prior probability distributions have been used: 
i) an uninformative prior and ii) an informative 
prior, available when there is good knowledge rep- 
resentation of the deterioration. Figure 2(a) shows 
that with good prior, we can provide a good esti- 
mate of the true distribution with only a little data, 
while Figure 2(b) shows a larger dataset will give 
a correct estimate of the parameter even the prior 
is weak. 

However, when failure data is scarce (which 
is the usual case in slow deteriorating asset, for 
example, bridges) or knowledge is poor for a 
particular asset class (which is also usual for new 
assets), we can combine data from asset types 
that, though not identical, are similar and so are 
believed to have the similar deterioration rate 
(Morcous, 2011). This kind of approximation 
is necessary, especially for assets types that are 
inspected infrequently so that the deterioration 
dataset is not large enough for each asset types. 
Two techniques are proposed in Zhang and Marsh 
(2016): one is to add another layer of parameter 
(hyper-parameter) to form a hierarchical BN that 
can group or pool data from different asset types 
sufficiently to overcome an uninformative prior; 
the other technique is to use influencing factors 
to adjust the transition distribution of a specific 
asset from distribution learned from a pool of 
similar assets (i.e. assets of the same type). Both 
methods may change the BN’s structure depend- 
ing on how assets are assigned to groups and what 
other factors influence the transition times. We 
refer to these (and related) issues as ‘modelling 
assumptions’. 
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assessment framework. 


3.1.2 A framework for expressing modelling 
assumptions 

To address the problem of many variants of the 
deterioration models, we provide a framework to 
help domain experts express modelling assump- 
tions. The stages of model-based asset deteriora- 
tion assessment framework can be illustrated in 
Figure 3: 


e The model library encodes the possible depend- 
encies of probabilistic models in the problem 
domain. 

Model selection uses modelling assumptions to 
determines what models, knowledge and data 
are included in the problem model. 

The relational database includes the configura- 
tion and failure data; its schema is derived from 
the modelling assumptions. 

Instantiation and inference, performed auto- 
matically, are used to evaluate queries on the 
model for domain decision support. 


The following sections describe each aspects of 
the framework. 


3.2 Model library: Abstract PRM 


The generic models in the model library are rep- 
resented as abstract probabilistic relational models 
(PRMs). Figure 4 shows an example developed 
from Zhang and Marsh (2018). 

In Figure 4, a square (called a class) defines 
a group of identical objects that share the same 
set of variables or probabilistic models. An oval 
defines a variable, and directed edge defines the 
dependency of variables. Aggregation is defined 
by a bold arrow. N represents a fixed multiple 
relationship, and * represents a multiple rela- 
tionship of uncertain degree. The classes are as 
follows: 


Class Purpose 


Asset We wish to predict the state of a specific 
asset, conditioned on its previous inspec- 
tion and the deterioration data of similar 
assets. This prediction will inform deci- 
sions about maintenance and inspection. 

Objects of this class represent transitions in 
a Markov chain, where the conditional 
probability of moving into future state 
S,+, at time ¢+/ given the present state 
S, at time ¢ follows a distribution with 
parameters learnt from data. 

Data is gathered from inspections, each 
giving information about the current 
state of an asset. Different types of data 
are used: for example, it is common to 
have only censored data giving a time 
after which a transition occurred. This is 
modelled as constraints on the transition 
time. 

Assets of the same type form a group. A 
group is represented by the model by 
parameters of the distribution (for each 
transition) learnt from historical data for 
this type of assets. Since the parameters 
are learnt, their values are uncertain and 
the model include them as probabilistic 
variables. 

The population as a whole also has distri- 
bution parameters. The similarity of each 
group of assets to the population as a 
whole is judged and this establishes a way 
to learn a group’s distribution parameters 
from data of other closely related groups. 

The idea of assets of the ‘same type’ is 
defined in relation to properties of the 
asset that influence deterioration rate. 
However, if all relevant factors are 
used to distinguish groups then there is 
likely to be insufficient historical data 
to estimate the transition distribution 
parameters. Therefore, groups need to 
be defined by the factors that are most 
important (and are known for all assets). 
Other factors—for example, loading and 
environment condition—can be used to 
estimate a target asset’s distribution. 


Transition 


Data 


Group 


Parameter 


Factor 


We describe this model as ‘abstract’ as it repre- 
sents number of different structures. In particular, 
the following issues need to be resolved to give a 
specific model: 


e The distribution (Weibull or exponential) used 
for the transition and the number of parameters 
needed. 

The population priors. 

The number of state of deterioration and 
therefore the number of transitions between 
states. 
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Figure 4. Probabilistic dependencies represented as an 
abstract PRM. 
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Figure 5. Process to customise the abstract PRM. 


The number of asset groups (or types) and the 
factors (e.g. material used in construction) used 
to define membership of a group. 

The similarity of each group to the population 
as a whole. 

The remaining factors that adjust the transition 
distribution, possibly varying by group, and the 
weighting used to aggregate the effect of these 
factors. 

The types of data available. 


Although these factors are uncertain they are 
not part of the probabilistic reasoning. Instead, 
these are the decisions made by domain experts 
to apply the generic models to a specific situa- 
tion. The next two subsections describe how this 
is done: the first covers ‘modelling assumption’ 
and the second asset classification. Together, 
as shown in Figure 5, these processes turn the 
abstract probabilistic model into a model that 
can be run. 


3.3. Customising the abstract PRM with 
modelling assumptions 


This aspect of the customisation covers four issues: 
a) the choice of transition distribution; b) the 
number of deterioration states; c) prior distribu- 
tions and d) available inspection data. 


3.3.1 Transition distribution and parameters 
Different distributions can be used to estimate tran- 
sition times, based on their goodness of fit. The 
goodness of fit of the distribution is usually done by 
visual observation and hypothesis test, such as coef- 
ficient of determination (R°) and Anderson Dar- 
ling (AD) test (Mendenhall et al., 2012). A range 
of study has been developed to find the best fit dis- 
tribution of asset state sojourn times. For example, 
the exponential distribution has been used for rail- 
way track (Guler et al., 2011) and the Weibull dis- 
tribution for bridges (Sobanjo, 2011). The number 
of parameter in the distribution’s survival function 
fixes the number of instances of the Parameter class 
for each Transition. An example is showed in the 
left side of Figure 6: there are two instances of class 
of Parameter if the guideline shows a Weibull distri- 
bution is normally adopted in practice. 


3.3.2 Deterioration states used for grading 

Each asset is usually rated with a state representing 
its functionality. For example, a 4 point grading sys- 
tem is used in Sobanjo (2011). Grading systems are 
normally adopted from industry standards and are 
often used to identify and priorities maintenance 
actions. They vary for different infrastructure type, 
countries, and sometimes, inspection agencies. An 
n-states grading system results in n—/ transitions 
represented in the instantiation of class Transi- 
tion. An example is showed in Figure 6’s right side: 
there are two instances of class Transition since it 
is rated by a three-state grading system. 


3.3.3 Asset deterioration characteristics: Prior 

Classical statistical methods, such as maximum 
likelihood estimators or least square method can be 
used to estimate priors if the data are sufficient. An 
alternative source is the expertise from experienced 


Weibull Distribution 


‘Three-States Grading 


Figure 6. Example customisations of transition distri- 
bution and grading system. 
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engineers (Welte and Eggen, 2008) from whom a 
prior range can be elicited. In addition, each group 
of assets is also characterised by its degree of 
similarity to the overall population. These group 
parameters are modelled used a truncated error 
distribution, with a mean inherited from the value 
of the learnt hyper-parameters, and the elicited 
degree of similarity defining the variance. 


3.3.4 Inspection data: Inferring state sojourn time 
Continuous monitoring can provide exact transi- 
tion times but periodic inspection is more com- 
mon. In periodic inspection, the state of asset is 
only known at the inspection times. Therefore, the 
time an asset stayed in a state before deteriorating 
to another state is only constrained by the inspec- 
tion result. Fortunately, this type of censored data 
can be modelled by different variants of the Data 
class. For example, periodic inspection requires 
three types: left, interval and right censored rep- 
resenting the state transition happening before, 
between and after the inspection respectively. 


3.4 Asset classification 


Asset deterioration rate may be influenced by 
many contributing factors, such as age, loading 
and environment (Fu and Devaraj, 2008, Wellalage 
et al., 2014). By classify assets into groups with the 
same factor levels, we can expect groups to have 
similar patterns of deterioration (Veshosky et al., 
1994). Our framework provides two ways to adjust 
deterioration based on such factors: 


1. Grouping: some factors are chosen to define 
groups of assets so that historical data can be 
pooled and used to learn assets deterioration 
parameters. The number of groups fixed the 
number of instances of the Group class in the 
abstract probabilistic model. 

2. Adjusting: other factors are used to adjust the 
deterioration rates learnt from data. A range 
of studies has been developed to identify these 
impact factors, for example, closeness to the 
coast, galvanic response level and structure type 
are identified in Yianni et al. (2016) as the key 
contributing factors for railway bridge dete- 
rioration. The number of these factors fixes the 
number of instances of the class Factor. The 
importance of each factor in the aggregated 
effect is modelled by a weight and this is used to 
shift effect of the learnt parameters. 


3.5 Model instantiation 


The final stage of the process shown in Figure 5 is 
the instantiation of the BN. This set is provided by 


the framework as the domain expert has expressed 
all the information needed in the steps described in 
sections 3.3 and 3.4. 

As Figure 3 shows, the customisation process 
also defines the schema of the database that holds 
both configuration parameters and the records of 
assets and inspections results. The class of each 
asset is defined either directly or inferred from the 
values of factors held in the database. 

Suppose that investigated asset is x, where x 
belongs to group g, which is similar to group /. We 
suppose that some time has elapsed since x was last 
inspected and we wish to estimate its current state. 
The following steps are involved to instantiate the 
BN: 


1. Creating transition variables for each deteriora- 

tion stage of asset classes g and h. 

Creating hyper-parameters for each transition 

of group and at the population level. The group 

parameters approximate the population ones, 
using the similarity degree defined for the group. 

. Create the variables for the state of x and its 
adjustment by the factors (the ones that do not 
define the grouping). The values of the factor 
variables are extracted from the database and 
used as observations in the probabilistic calcula- 
tion, but note that the model still operates if any 
of the values are missing (provided that priors 
have been provided). 

. Create variables for the all the inspection data— 
taken from the database—for both groups g and 
h. Starting from the current state, the inspec- 
tion reports show either that one transition has 
occurred since the previous inspection or that 
no transition has occurred. By selecting the 
appropriate variant of data object both of these 
observations can constrain the transition time. 
It is even possible for more than one transition 
to have occurred between inspections. 


2. 


For illustration, we assumed that it was known 
that group h needed to be include alongside 
data for assets in group g, the group to which x 
belongs. Two steps are involved in automating this. 
Firstly, we need to determine whether the number 
of observations in group g is sufficient to give a 
good estimate of the deterioration rate. We could 
evaluate this either from the variance of the learnt 
parameters or from a threshold value on the abso- 
lute number. The second step is to find the group 
that is closest in characteristics to g, perhaps by the 
proportion of the values of the factors that define 
the groups shared between the two groups. 


3.6 Inference and refinement 


Inference of the ground BN is performed auto- 
matically in the model-based machine learning 
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framework. As suggested previously, probabilistic 
programming languages with an extensive list of 
inference algorithms can be adopted. For example, 
when dealing with hybrid Bayesian network that 
contains both discrete and continuous variables, 
Gibbs sampling can be used. 

The query — to predict the unknown state of 
asset x — has to be expressed in domain terms 
and translated to a query on a BN variable. The 
result is a probability distribution over the possi- 
ble states and further ‘decision rules’ or guidelines 
are needed to determine an action. For example, 
a small probability of the worst state of deterio- 
ration may determine that an urgent inspection is 
required. 

Model evaluation can be made by comparison 
of different variants of the model, with the met- 
rics such as predictive accuracy or computational 
speed. Further refinement of data sources (e.g. 
from other source domain), expert knowledge (e.g. 
different experts or different types of expertise), 
and variations in the models (e.g. different group- 
ings of assets) are possible. This process repeats 
until a level of acceptable performance is accepted 
or it exhausts all the resource. As a future study, 
with the success of automatic inference software 
(Bishop, 2013), the refinement process can be 
made automatically with a defined threshold in a 
model-based machine learning framework. 


4 REPRESENTING ASSET 
DETERIORATION MODEL WITH 
DIFFERENT STRUCTURE 


Customised instantiation of this framework for 
practical uses have been developed, for exam- 
ple, general maintenance problems in Zhang and 
Marsh (2016), and rail bridge deterioration in 
Zhang and Marsh (2018). Since the focus of this 
paper is to show how the model-based asset dete- 
rioration framework to deal with different model- 
ling assumptions that may happen in practice, we 
present a series of instantiation variants of a basic 
deterioration model, which is sufficient to convey 
the idea. 


4.1 The simple deterioration model 


The basic deterioration model is showed in Fig- 
ure 1, and Table 1 shows its customisation setting. 
Notes that since the focus of this section is the 
model structure, we only show the setting that have 
influence on the shape of the model structure. 
From the practice guideline, we decide there is 
only one transition (because it only has a binary 
state: working or failed), one group (representing 
the entire group has only one subgroup, the prior of 


Table 1. The setting of the model in Figure 1. 


Transition distribution Grading system Prior Sojourn time 
Customisation Exponential (A) Binary states A ~ Uniform 15 < sojourn < 35 
(0, 0.05) 12 < sojourn < 24 


12 < sojourn < 36 
sojourn = 24 


Figure 7. Deterioration follows 


distribution. 


a two-parameter 


the hyper-parameter becomes the prior of param- 
eter), and one parameter (because its distribution 
is exponential). The expression of transition fol- 
lows the distribution parameter in Parameter class, 
while the parameter’s prior is given by experts, here 
is a uniform distribution between 0 and 0.05. Four 
sojourn time data are inferred from the inspection 
data, and three of them have censorships. 

Based on this model, a range of variants can 
be extended to represent different assets, whose 
underlying modelling assumptions vary. 


4.2 Variant 1: Deterioration that follows a two- 
parameter distribution 


Extended from the basic model, Figure 7 presents 
a deterioration model of asset that follows a two- 
parameter distribution. Studies have found that 
transition probabilities between states of some 
assets are better fitted with distributions with two 
or even more parameters. For example, the two- 
parameter Weibull distributions for bridge deterio- 
ration are suggested in Ng and Moses (1998) and 
Sobanjo (2011). 

This is achieved by the one-to-many relation- 
ship encoded in the relational database. Each 
instance of Transition class is linked to two identi- 
cal instances of Parameter class. 


4.3 Variant 2: Deterioration under multi-states 


Asset may degrade with several stages represent- 
ing the decrease of its functionality. For safety and 
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Figure 8. 
chain based; (b) semi-Markov chain based. 


Deterioration under multi-states: (a) Markov 


reliability reasons, they can be rated from perfectly 
working to completely failure accompanied with 
several intermediate states. Extended from the last 
two subsections, Figure 8 shows two deterioration 
models for multiple states asset. 

Degradation of asset with multi-states is mod- 
elled in the form of a Markov chain (Figure 8 (a)) 
by a sequence of states (represented by the transi- 
tion nodes) representing the condition of an asset 
over time. Markov chain deterioration models are 
widely accepted in modelling most asset’s life-cycle 
performance (Frangopol et al., 2004), but they also 
bear with the assumption that the transition prob- 
abilities between states follow the same stationary 
transition rate, which do not change over time. 
This property implies the sojourn time follows an 
exponential distribution regardless how long it has 
been in the current state (Ng and Moses, 1998). 
This is modelling is performed by the one-to- 
many relationship in the relational database: one 
instance of Asset class with two instances of tran- 
sition class. 

This restriction can be relaxed by semi-Markov 
model (Figure 8 (b)), which allows the modelling 
of transition probabilities to follow non-stationary 
distributions depending on current state and its 
next state. This extension enables the modelling 
of multi-state deterioration that follows multi- 
parameters based distributions. This modelling is 
performed by the many-to-many relationship: one 
instance of Asset class with two instances of tran- 
sition class, and each instance of transition class 
links to two instances of parameter class. 


4.4 Variant 3: Learning from similar assets 


Assets classified into different groups may share 
similar deterioration rate, which gives a potential 
to learn from others. Two types of learnings from 
similar assets are presented: 


1. Figure 9 shows an example of pooling all the 
available data to learn a universal distribution 
in the domain, and distinguish a specific asset 
by defining the influence of aggregated external 
factors on deterioration rates. A suggestion use 
of this form is in the situation when the entire 
population has little data. This is achieved by 
the instantiation of Factor class. 

2. Figure 10 shows an example of pooling avail- 
able data within its associated group to learn 
their own distribution but governed by their 
shared hyper-parameter introduced by hier- 
archical BN. This hyper-parameter helps the 
transfer learning of the weekly learnt group 
(typically target group with little data) from 
strongly learnt group (source groups with lots 
of data). A suggestion use of this form is in 
the situation when the groups are highly cor- 
related. This is achieved by the introduction 
of the class hierarchy by adopting layer 3 data 


sources. 


Figure 9. 
factors. 


Distinguish an asset’s deterioration from 


Figure 10. Learning from other groups hierarchically. 
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5 CONCLUSION AND FUTURE STUDY 


We have argued the need to provide a generalised 
asset deterioration framework encoded by proba- 
bilistic relational models, which can be adapted 
to model assets with different modelling assump- 
tions. The emerging field of model-based approach 
gives us a suitable formalism for separating speci- 
fications from analysis techniques, and we have 
applied this to asset deterioration. 

We also used several variants of the deteriora- 
tion model to demonstrate how it can be adapted 
to a variety of applications with different model- 
ling assumptions. In the future, we hope to extend 
the amounts of models in the model library, and 
extend the applications to more safety and reliabil- 
ity related problems. 
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Industry 4.0 and real-time synchronization of operation and 
maintenance 
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Department of Mechanical and Industrial Engineering, NTINU—Norwegian University of Science 
and Technology, Trondheim, Norway 


ABSTRACT: Industry 4.0 represents a trend in manufacturing which includes cyber-physical systems, 
the Internet of things, cloud computing and cognitive computing. Cyber-Physical Systems (CPS) refers to 
smart systems that include engineered interacting networks of physical and computational components. 
The term digital twin refers to a digital replica of physical assets, processes and systems that can be used 
in real time for control and decision purposes. The digital twin representation is seen as a prerequisite 
for effective synchronization of operation and maintenance within the manufacturing industry as well 
as in other industries. The relation between production plans and activities and actual production can to 
some extent be described by deterministic. The relation between maintenance plans and activities and the 
production system availability on the other and requires probabilistic representation. The term stochastic 
digital twin is therefore introduced. An ambition of Industry 4.0 is to support real-time processing when- 
ever possible. This paper discusses elements of Industry 4.0. A case study is provided to demonstrate these 
terms and challenges to the mathematical modelling required for optimal synchronization of operation 
and maintenance. 


1 INTRODUCTION concepts of value chain organizations. Further 
the terms Cyber-Physical Systems, the Internet of 
1.1 Background Things, Cloud computing and the Digital Twin are 


often used in relation to Industry 4.0. Although 
the term originates from the manufacturing indus- 
try, the elements of Industry 4.0 are relevant for 
most businesses. 

The current usage of the term Industry 4.0 has 
been criticized as essentially meaningless. The 4.0 
points to the forth industrial revolution under a 
premise that digitalization is the really new thing. 
But why digitalization and not Nano technology? 
Further, the content of Industry 4.0 also seems to 
vary from industry to industry, and from author 
to author. From a scientific point of view it might 
therefore be better to avoid a precise definition but 
rather focus on Industry 4.0 elements. 

1.2 Objective The aim of this paper is to shed light on Indus- 
try 4.0 elements that are relevant for the interac- 
The objective of this paper is to clarify basic terms tion between production and maintenance. Here 
and elaborate on basic elements of Industry 4.0in production has a very broad meaning, it could 
relation to real-time synchronization of operation cover manufacturing, logistics, transportation sys- 
and maintenance. A case study from the railway tems, hospitals, power supply and so on. 
sector is used to exemplify the concepts. The Internet of Things (IoT) is the network of 
physical devices, production facilities, cars, air- 
planes and in general items embedded with elec- 
2 DEFINITIONS AND CONCEPTS tronics, software, sensors, actuators, and network 
connectivity which enable these objects to connect 
Industry 4.0 is a collective term particularly used and exchange data. Each “thing” is able to inter- 
in manufacturing to emphasize technologies and operate within the existing Internet infrastructure. 


Nowadays Industry 4.0 and digitalization are fre- 
quently used terms for the changes that are taking 
place in industry, civil engineering, transportation, 
public services and so on. The “4.0” refers to the forth 
industrial revolution and points to the opportunities 
communication over the internet gives with respect 
to real-time control of processes at almost every level. 
Industry 4.0 and related concepts as cyber-physical 
systems, internet of things, cloud computing and dig- 
ital twins give new opportunities for both production 
and maintenance, but even more important the syn- 
chronization and coordination of the two. 
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The IoT allows objects to be sensed or control- 
led remotely across existing network infrastructure, 
creating opportunities for more direct integration 
of the physical world into computer-based systems. 
When IoT is augmented with sensors and actuators, 
the technology becomes an instance of the more 
general class of Cyber-Physical Systems (CPS). 

Cyber-physical systems (CPS) refers to smart 
systems that include engineered interacting net- 
works of physical and computational components. 

Cloud computing is an information technology 
paradigm that enables access to shared pools of 
configurable system resources. The companies can 
focus on their core businesses instead of expending 
resources on computer infrastructure and main- 
tenance. Downsides of such a strategy could be 
unexpected operating expenses if administrators 
are not familiarized with cloud-pricing models and 
vulnerabilities and security issues. In some presen- 
tations the term Internet of Services (IoS) are used 
rather than cloud computing. 

The term digital twin refers to a digital replica 
of physical assets, processes and systems that can 
be used in real-time for control and decision pur- 
poses. The digital twin representation is seen as a 
prerequisite for effective synchronization of opera- 
tion and maintenance within the manufacturing 
industry as well as in other industries. The rela- 
tion between production plans and activities and 
actual production can to some extent be described 
by deterministic models. The relations between 
maintenance plans and activities and the produc- 
tion system availability on the other and require 
probabilistic representations. 

A stochastic digital twin is a computerized model 
of the stochastic behaviour of a system where the 
model is updated in real time based on sensor infor- 
mation and other information accessed via the 
internet and the use of cloud computing resources. 

To be useful a digital twin needs “what-if” 
capabilities. This means that the decision makers, 
i.e., humans or computers, shall be able to “ask” 
the digital twin what will be the consequences of 
various decisions. For a stochastic digital twin this 
means that the “answer” is given as a set of prob- 
ability statements. 

A real-time model is a model where it is possible 
to obtain values of system performance and sys- 
tem states in real-time. With real-time we mean that 
data referring to a system is analysed and updated 
at the rate at which it is received. As for the digital 
twin a real-time model typically connects to the 
“real world” via the IoT, although other means of 
communication is also possible. A real-time model 
is also referred to as an on-line model. 

A test model is a mathematical model describing 
relations between future and current values of the 
variables of interest, but where we are not able to 
monitor system performance and system sates in 
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real-time. Such a model is often referred to as an 
off-line model. A test model is still valid in order 
to establish decision rules to be used in real-time. 

Most methods and models used in production 
planning and optimization as well as in main- 
tenance planning and optimization are off-line 
models. These models can be used for establishing 
optimal strategies, but they can not give real-time 
decision support. Areal-time model is often used to 
describe a limited part of a system, whereas a dig- 
ital twin aims at giving a complete digitalized rep- 
resentation of the system and decision processes. 

A real-time decision support systems is a system 
where relevant data is collected and processed into 
relevant information in real time. This means that 
the raw data stream is automatically collected and 
processed into information. Information is further 
interpreted in such a way that it gives meaningful 
decision support. 

A real-time execution system is a system which 
implement algorithms to determine optimal deci- 
sions at time, and then execute these decisions. 
An example of a real-time execution system is an 
Automated Replenishment Program (ARP). The 
aim is to provide automated replenishment of prod- 
ucts based on real-time demand information to the 
production, warehouses and distribution processes 
in the supply chain. This corresponds to real-time 
control in control theory. Similarly for maintenance 
a real-time execution system will automatically issue 
a work order with task descriptions and due date. 

Predictive maintenance builds on the idea to uti- 
lize the condition of a component and the future 
expected loads in order to judge the correct time 
for “hard” maintenance such as overhaul, replace- 
ment of worn parts, calibration and so on. Sensor 
technology is usually used to capture the condi- 
tion of components or a system, and the term 
‘condition monitoring’ is often used to describe 
the collection and analysis of state data relevant 
for predictive maintenance. It should be noted that 
manual inspection and use of “human sensors” to 
capture noise, smell, vibration could also be treated 
as condition monitoring. 


3 THE DIGITAL TWINS 


This section presents principal elements of the dig- 
ital twins for maintenance and production. 


3.1 Maintenance 


To a large extent the Computerized Maintenance 
Management System (CMMS) could be seen as 
a digital twin for maintenance. Principal content 
found in the CMMS are the asset register covering 
all components, the Preventive Maintenance (PM) 
program covering the type of maintenance and the 


plan for maintenance. The CMMS will also con- 
tain required spare parts, resources and tools for 
conducting maintenance and so on. But there is rel- 
evant information not found in the CMMS which 
is essential for the stochastic digital twin to be 
developed. First of all the CMMS has no inherent 
mathematical models to be used for degradation 
development and time to failure. Further infor- 
mation regarding component condition is often 
not part of the CMMS, and needs to be obtained 
from stand alone systems operated in parallel to 
the CMMS. Further the CMMS is not connected 
to the supervisory control and data acquisition 
(SCADA) system and other systems giving infor- 
mation regarding process parameters and future 
loads from production and the environment. 

It is beyond the scope of this presentation to 
write out the details regarding the content of a sto- 
chastic digital twin for maintenance. For illustrative 
purposes and for use in the case study presented 
later a very simple digital twin is presented in the 
following. Although we in many situations can do 
much better, the classical failure rate function is 
used as a basis. The situation relates to so-called 
delay time models (Christer 1987), often referred 
to as PF-interval models. The situation is as fol- 
lows: A component is put into service at time t = 0. 
Then after a random time the component enters 
a degraded state. This state is often referred to as 
a potential failure. It is assumed that a condition 
monitoring activity can reveal such a potential fail- 
ure with some detection probability, say 1 — q. If no 
action is taken the component will fail after another 
random time Tp A Cox proportional hazard rate 
function (Cox 1972) is used as a basis for formu- 
lating the failure rate function, z( = f(f)/R(t) for 
Tp, (t is running time after the potential failure has 
occurred). 7,,.1s often referred to as the PF-interval, 
and the corresponding failure rate function is: 


z(t y,x(1) = z,(He4%e2O (1) 


where z,(¢) is a baseline failure rate function, typi- 
cally on the form z,(¢) = aA°t*" in the Weibull case. 
y is a vector of state variables at the point of time 
of the potential failure is observed, x(t) and is the 
average load profile ¢ time units ahead. £, and £, 
and are regression coefficient vectors established 
by for example statistical analysis of data. 

The failure rate function in eq. (1) is a classi- 
cal model and it could be questioned whether this 
model represent at digital twin. A prerequisite for 
being at least a part of a digital twin is that y could 
be accessed from sensor readings and communi- 
cated via the internet. Further x(t) needs to be 
accessed in real time from enterprise resource plan- 
ning (ERP) systems and other system for future 
production plans. 


If eq. (1) is part of the stochastic digital twin we 
may now “ask” for the probability of failure if we 
wait for example ¢ time units before the potential 
failure is fixed: 


F(tly,x(0)= 1 e- f ‘z(u y,x0))du (2) 


p 


Only a few aspect of the “maintenance twin’ 
are elaborated here. For real applications it will be 
an enormous amount of work to structure the raw 
data, information, knowledge, models and so on to 
have a digitalized stochastic maintenance twin. 


3.2 Production 


There are so many aspects to deal with when it comes 
to production and logistic optimization that we will 
not even make an attempt to cover these in this pres- 
entation. However, with respect to maintenance there 
are some important aspects that we will emphasize 
when setting up the digital twin for production. 


3.2.1 Objective function 

Operations Research (OR) is the systematic 
approach to optimize production under various 
constraints (Phillips, Ravindran, & Solberg 1976). 
The objective function, Z, is typically the quantity 
to maximize or minimize with respect to some vec- 
tor of decision variables, say x, i.e., Z = Z(x). 


3.3. Constraints and conditions 


Usually there are constraints to take into account 
in the optimization, for example a set of functions, 
say g(x) should all be positive. In addition to these 
constraints we also have to optimize Z = Z(x) sub- 
ject to S, where S = [s,, S», ..., s,] is the state vector 
of the components in the system. For example s,= 1 
could represent that component i is functioning, 
and s,= 1 represents a fault state. 

It should be emphasized that both the objective 
function and the constraints and conditions are 
changing all the time. It is therefore required to have 
real-time access via the internet to the “physical” 
plant, existing orders, inventory levels and so on. 


3.4 Maintenance interaction 


The digital twin for production will also be a sto- 
chastic twin due to the probabilistic nature of pro- 
duction optimization. From classical OR examples 
variability in supply and demand are the main 
sources for uncertainty. However, we will focus 
on the relation to maintenance. Important aspects 
that need to be structured as part of the digital 
twin for production are: 


e Slots for maintenance, i.e., possible opportuni- 
ties for doing maintenance 
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Specifications of how utilizing possible slots will 
affect the objective function Z = Z(x) and the 
constraints g(x) 

Specification of possible “relaxes” in production, 
for example avoid running a component with full 
load if a “potential failure” has been revealed 
Specifications of how such “relaxes” will affect 
the objective function Z = Z(x) and the con- 
straints g(x). 


Note that the objective function Z = Z(x) in 
traditional OR does not include maintenance. 
Since the objective of this paper is to investigate 
synchronization and coordination of activities in 
the production and maintenance departments, the 
objective function should cover both departments. 


4 CASE STUDY 


4.1 Introduction 


A railway example is used to demonstrate chal- 
lenges in synchronization and coordination of 
activities in the production and maintenance 
departments. Only few aspects are dealt with, 
and issues related to really establish the stochas- 
tic digital twins and have them to play together is 
not addressed in this presentation. One aspect of 
“digitalization” within maintenance is related to 
increased used of predictive maintenance. Turn- 
outs (switches) are important components in the 
railway infrastructure, and failure of a turnout will 
usually give large problems with the circulation, 
and delays are expected. Although various condi- 
tion monitoring techniques exist for turnarounds, 
they have not been implemented on the Norwegian 
railway network due to high cost. In Norway Bane 
NOR is a state-owned company responsible for 
the Norwegian national railway infrastructure. In 
recent years Bane NOR has been running a test 
project on a simplified predictive maintenance 
strategy for turnouts based on measuring only 
power as function of time when the traction motor 
is activated to change the position of the turnout. 
The time required for changing the position of the 
turnout varies from one to up to 20 seconds. The 
idea is that the power as function of time for each 
individual turnout is a “signature” for that turnout, 
and deviation from that signature could be seen as 
a potential failure as discussed in Section 3.1. The 
main advantage of this system is that the informa- 
tion is available more or less “free of charge”. The 
challenge is to use it in an efficient way. 

The system has been piloted over a period of 
almost 3 years. As part of the pilot project data 
have been analysed for one of the turnouts. In a 
follow up project it is planned to conduct more 
comprehensive analyses. During the test period 
11 failures were observed. For 3 of these failures 
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a potential failure was not observed at all. Thus 
the reliability of the condition monitoring system 
is only some 70%. The analysis was conducted 
by visual analysis of the power/time curve for all 
movements of the turnout for a period of 10 days 
prior to the failure. More comprehensive analysis 
could obvious give a higher reliability. 


4.2 The PF-interval model 


The average PF-interval, i.e., the estimate of E(Tpr) 
were found to be 80 hours. However, 2 failures had 
an observed PF-interval of less than 3 hours. To 
estimate the parameters in the failure rate func- 
tion in eq. (1) assuming a Weibull distribution and 
ignoring covariates y and x(t) we may use the fol- 
lowing procedure: 


1. Let x be such that F(x) = p, where both x and 
p, are known. It can be shown that the follow- 
ing iterative scheme may be used to estimate 


_ In(— In(1- p,.)) 
Œ isi = KOTNE 
2. For the location parameter we use set: A = (1 + 


1ÒIECT p). 


In the example we had p, = 2/(11 — 3) and x 
= 3 hours, and E(7;,) = 80 hours. Applying the 
procedure this will give @= 0.49 and A= 0.026. 
It should be noted that œ < 1 means that the PF- 
interval is not very consistent. The reason for the 
low value of ais that we are mixing several failure 
mechanisms. There are three main failure mecha- 
nisms with quite different characteristics here, i.e., 
snow and ice with short PF-interval, lack of lubri- 
cation with a medium PF-interval, and mechanical 
failure with a rather long PF-interval. This means 
that we need to apply the procedure above for each 
separate failure mechanisms. From the failure sta- 
tistics obtained from the pilot project we do not 
have sufficient number of observations to apply 
the procedure above. For the case study we will 
proceed with assuming that the failure mechanism 
is related to lack of lubrication and without any 
statistical support we set @=2 and A= 0.0246 
corresponding to E(Tpp) = 36 hours. 


4.3 The initial cost model 


The operational hindrance cost of executing a 
“hard maintenance” task, and the cost of a failure 
depends on the position of the turnout, the time of 
the day, the traffic and so on. Therefore an exam- 
ple situation is presented in the following. 

The location of the turnout is assumed to be on 
a part of the line where access only can be made by 
means of a work train. We assume a single track 
line where access by the work train will disturb the 
circulation. Investigating the time table for today 
four opportunity windows have been identified. 
They are shown in column 1 in Table 1. The first 


Table 1. Optimization results. Table 2. Optimization results—with covariates. 

t (hours) Delay (min) Cpm Cp Cree t (hours) Delay (min)  Cpņm Cp Crot 

3 30 18 500 441 18 941 3 30 18 500 822 19322 
5 15 11 750 1218 12968 5 15 11 750 3 422 15 172 
7 10 9 500 2 370 11870 7 10 9 500 9812 19 312 
9 0 5 000 3 880 8880 9 0 5000 22562 27562 


three of these will, however, cause delays in circula- 
tion. The expected delay minutes for each window 
is shown in column 2 in Table 1. In average there 
are 150 passengers per train and a minute delay 
cost per passenger of 3 NOKs is used by Bane 
NOR. In addition to the delay cost there is a fixed 
cost of NOK 5 000 for ordering the work train and 
associated personnel cost. If the failure can not be 
“caught” in due time, the expected total delay is 
3 hours. The cost equation to minimize is: 


C(t) = cmé) + Cy FO) (3) 


where cy = 3-150-60-3 = 81 000 NOKs, F(t) is given 
by eq. (2). Table 1 shows that in this situation one 
should utilize the last maintenance window since 
the circulation is not affected, and the probability 
of failure is still rather low. It can be shown that if 
the failure mechanism is ice and snow, and assum- 
ing E(Tpp) = 10 hours and @= 2 the risk is much 
higher, and one should rather use the first oppor- 
tunity. Note that the optimization here is seen from 
the maintenance department, i.e., the only “pro- 
duction” related cost is the increased c,,,-cost by 
rushing the maintenance. 


4.4 The refined cost model 


So far the covariates y and x(t) have been ignored. 
We now introduce two variables, y which is a meas- 
ure of degradation at the point of detection of the 
potential failure, and x as the number of times per 
hour the turnout will be operated. The proposed 
Cox proportional hazard model reads: 


z(t | yx) = z(e m 


For simplicity we have assumed that the number 
of train passages per hour is constant over the day. 
From the case study we do not have sufficient data 
to estimate 8, and B.. We will therefore proceed 
with illustrative values for these parameters. That 
is, for the example we proceed with 2, = In 2 = 0.69 
and $, = 0.1 In 2 = 0.069. l 

Now, assume that at the time of the potential 
failure we assess y = 0.15 by analysing the power 
vs time curve from the condition monitoring 
system, and further from the time table we wind 
x = 2. Table 2 shows the result when the covari- 
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ates are taken into account. Compared to the 
original situation we have to advance the point of 
time for doing hard maintenance, i.e., lubrication 
and required adjustments. The number of times 
per hour we operate the turnout, x, is a decision 
variable seen from operation. Since the failure rate 
function is increasing with increasing value of x, 
we should investigate whether it pays off to reduce 
x. Now, assume that wecan completely remove the 
need for operating the turnout by changing the sta- 
tion where trains are crossing. This corresponds to 
set x = 0 in the model. Rerunning the model shows 
that the optimal value of ¢ is t = 9. The cost has 
been reduced from 15 172 to 12 249, i.e., a total 
saving of ~ NOK 3 000. However, if this causes 
total delays of more than 7 minutes the delay cost 
will be higher than the savings. For the railway case 
it seams unrealistic that changing the crossings for 
the actual station for an entire day will not cause 
more than 7 minutes of total delay. 

A first attempt to formalize such a “relax” strat- 
egy is to add an extra cost term in the objective 
function, Cp(x): 


C(t,X) = Coy (0) + Cy F(t | vx) + Cp (X) (5) 


The joint optimization of ¢ and x is not pursued 
further in this presentation. 


5 DISCUSSION 


The objective of this paper has been to investigate 
“Industry 4.0 solutions” to facilitate synchroni- 
zation and coordination of operation and main- 
tenance. By a “paper exercise” it is rather easy to 
demonstrate how this can be done, and potential 
savings. This section discusses challenges when 
such ideas are to be implemented for real systems. 


5.1 Slots for maintenance and consequences for 
the production model 


In order to synchronize and coordinate production 
and maintenance it is essential that the digital twin 
on request can provide time slots for maintenance 
and evaluate the production consequences for each 
possible slots. In the example we assumed that 
“some” could establish the time slots at 3,5,7 and 
9 hours. Here, “some” could be a train manager 


at the Train Control Centre (TCC). But this is not 
really a part of the “digital twin” for production. 
To develop a digital twin all production plans, 
cost optimization functions etc. need to be imple- 
mented in a computerized system supported with 
algorithms to both find possible slots, and do cal- 
culations to evaluate the consequences. For the 
railway example we are far from realizing such sys- 
tems. To the author’s knowledge the situation is the 
same in most Norwegian industries. 

The way forward is therefore to develop simpli- 
fied production models. For example in Norway a 
simplified circulation model for use by the TCC- 
personnel upon traffic deviations to assist plan- 
ning has been developed. The model acts like a 
“what-if” tool that can simulate the consequences 
if crossing is moved to station A rather than on the 
scheduled station B. It is hard to spot significant 
achievements here unless modelling competence 
within the companies is significantly increased. 
A vision behind “cloud computing” is that ready 
to use models could just be plugged in whenever 
needed. But still this is a vision. 


5.2 Specification of possible “relaxes” in 
production 


In the example, and in many real case situations 
a mitigating measure upon a component degrada- 
tion is to reduce the load on that component to 
increase residual life. An even more realistic exam- 
ple than the railway example is maintenance and 
operations of wind farms. A wind farm can be dif- 
ficult to access in periods of the year due to harsh 
weather conditions. Upon a potential failure of for 
example the main bearing of the turbine it may 
be better to close down the turbine in situations 
with high wind loads. Although this will reduce the 
power produced for some hours, it might prevent a 
failure which would have made the turbine unavail- 
able for weeks and even months. 

The digital twin for production therefore need 
to respond upon request on what are the possible 
“relaxes” that could be made in production that will 
have a positive impact on residual life of a compo- 
nent. In addition to respond on what can be done, the 
digital twin also needs to specify the consequences, 
for example by quantifying the reduced production. 


5.3. Maintenance models 


The literature in the field of maintenance optimi- 
sation produces every year a huge number of mod- 
els. Very few of these models are used in practice. 
One reason for this is that it is hard to get access 
to statistical data for estimation of model param- 
eters. In our example we need to estimate a, A, By 
and By. Further this have to be done for all failure 
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mechanisms. We can easily imagine an enormous 
workload. Next, comes the question whether the 
Cox-proportional hazard model is the appropriate 
model to use. It is rather simple, but it does not 
really take into account the physical aspects of the 
phenomenon causing a failure. 

The prospects for the maintenance twin is therefore 
also not that promising. Again, starting with a set of 
rather simplified models seems a natural first step. 


5.4 Machine-learning 


Machine learning is a field of computer science 
that gives computers the ability to learn without 
being explicitly programmed. Machine-learning 
is quite different from the model based approach 
advocated here. A strength of machine-learning is 
it’s efficiency to produce huge amount of results 
without the explicit need to do all the “hard work”. 
From a model based approach perspective most 
of us are reluctant to just “let the computer work 
out the answers”. However, for sub-problems like 
establishing a failure model, looking into machine 
learning approaches are more acceptable. 


5.5 Real-time execution models 


An objective of Industry 4.0 solutions is to have 
automated decision processes. For simple situa- 
tions such as replenishment in retail we see auto- 
mated replenishment policies. However for mixed 
problems as discussed here it is a long way to go to 
get trust in real-time execution models. 


6 CONCLUSIONS 


This paper has discussed steps in synchronization 
of operation and maintenance. An example was 
provided to illustrate some of the challenges and 
opportunities this will give. With idealized exam- 
ples and simplified assumptions we are able to 
carry out relevant modelling. Still, these models 
are test (off-line) models and integration into real- 
time (on-line) models require significant effort. To 
succeed it is recommanded to start with a relative 
small set of standardized models for critical proc- 
esses in the value chain of the company. 
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ABSTRACT: Many papers are dedicated to maintenance strategies of deteriorating systems and almost 
all of them share a common assumption that the parameters of the degradation process are known. In 
this paper, we deal with a dynamic and aperiodic condition-based maintenance of single-unit systems 
with unknown parameters of deterioration. It has been considered that the deterioration is governed by 
an Inverse Gaussian (IG) process. The time interval between two successive inspections is scheduled based 
on the Remaining Useful Life (RUL) of the system. The Bayes method is employed to use the available 
information of degradation paths and update the information about parameters during the time. The 
proposed maintenance decision rule aims to avoid too frequent and costly inspections by implementing 
an aperiodic planning. The decision process is dynamically improved with the successive Bayesian update 
of the degradation parameters. The ability of the proposed modeling framework to drive the Bayesian 
update while controlling the number of inspection is analyzed through numerical experiments. The global 
maintenance cost is considered over a finite time horizon. 


1 INTRODUCTION Stochastic processes are natural choices to 
model deterioration over time. The Wiener process, 
Maintenance decision making is of prime impor- the gamma process, and the Inverse Gaussian (IG) 
tance to improve the global performance of indus- process are commonly used, Alaswad and Xiang 
trial systems and structures during their useful life. (2017). The last two processes have the particularity 
Different strategies for maintenance can be consid- to exhibit monotone evolutions of degradation indi- 
ered, see Wang (2002). Their choice depends mainly cator. The most popular stochastic process employed 
on the system failure process and on the associ- in maintenance literature is the gamma process; see 
ated monitoring system. This paper is related to van Noortwijk (2009) for a review in this subject. 
systems or structures which gradually deteriorates The IG process is an alternative stochastic process 
from initial “new” state to failure and whose deg- introduced by Wang and Xu (2010) to the reliability 
radation level can be perfectly observed through literature. Ye and Chen (2014) precisely investigated 
inspections. In this general context, predictive and the IG process properties and mentioned its advan- 
Condition-Based Maintenance (CBM) strategies tages. Having a clear physical interpretation, flexibil- 
are understandably relevant candidates. Actually, ity in incorporating random effects and covariates 
they allow to adapt decisions about maintenance or even prior information, and the existence of 
actions and the inspection scheduling to the cur- explicit formulas for important related functions are 
rent system state and possibly to remaining lifetime some of such advantages. Although the IG process 
estimation including prospective usage. The global is employed progressively (see e.g. Pan et al. (2016), 
design of a predictive or condition-based mainte- Peng et al. (2014), Peng (2015), and Ye et al. (2014)), 
nance policy requires mathematical modeling which the literature on CBM policies are scarce. Chen et 
brings together deterioration model, maintenance al. (2015) investigated the optimal CBM policy with 
decision-rule and cost function for policy assess- periodic inspections when the system degradation 
ment and optimization. follows an IG process with random-drift model. 
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In this paper, we discuss the condition-based 
maintenance of a single unit system whose deg- 
radation follows an IG process. The considered 
system has degradation parameters that can be dif- 
ferent from one unit to the other among a popu- 
lation. Hence it is considered that parameters of 
the model are unknown but a prior information 
like expert judgment is available. As an illustra- 
tive example, one can consider the degradation of 
roads subject to longitudinal cracking. For a road 
section, the degradation level corresponds to the 
ratio of the fissured length on the total length of 
the section. A maintenance action is a complete 
resurfacing of the road section. The dynamics of 
crack propagation depends on several parameters 
which may represent features of the road founda- 
tion and are unknown. As a consequence, there 
are some differences from a section to the others 
and the specific parameters for a given section 
are unreachable. An adaptive Bayesian method is 
employed to update the information about param- 
eters after inspections as the degradation state of 
the system is measured. In order to reduce unnec- 
essary inspections and to control maintenance 
actions, an aperiodic maintenance policy is con- 
sidered. The time interval between two successive 
inspections depends on the system current degra- 
dation level and on degradation prediction. The 
knowledge about the degradation process can be 
improved on-line through the adaptive Bayesian 
method, simultaneously. The aim is to introduce 
this on-line improvement structure within the 
maintenance decision rule: the next inspection is 
planned based on the remaining useful life (RUL) 
of the system. At each inspection, the RUL is esti- 
mated according to the last update of degradation 
parameters. 

The aim of the paper is twofold. First, the per- 
formance of the Bayesian update process has to 
be checked especially because of the inspection 
schedule which is driven by the aperiodic mainte- 
nance decision rule. Hence the times for updates 
are nonperiodic and scarce. Secondly, the effect of 
on-line updates of RUL prediction on the global 
maintenance cost is assessed and analyzed for dif- 
ferent alternatives. 

The remainder of the paper is organized 
as follows. The next section is devoted to the 
description of the stochastic process for degra- 
dation modeling and RUL prediction. Section 3 
is related to the proposed maintenance policy 
including the procedure for on-line adaptation of 
the decision based on Bayesian update and the 
cost considered for assessment. Some numerical 
simulations are proposed in Section 4 to illustrate 
the behavior of the considered policy. Section 5 
concludes and shows further possible extensions 
of this paper. 


2 MODEL DESCRIPTION FOR RUL 
CALCULATION 


Consider a single component system which is sub- 
jected to degradation. Let Y, denote the degrada- 
tion state of the system at time ¢. In the absence 
of repair or replacement actions, the evolution of 
the system degradation is assumed to be strictly 
increasing. Then X, can be modeled by an increas- 
ing stochastic process. Moreover, other assump- 
tions are considered: 


e The initial state X, is 0. 

e The system is failed if its degradation crosses a 
critical threshold level L. 

e The system failure is not self-announcing and if 
it fails, it remains failed until the next inspection. 
This down time imposes some extra cost. 


2.1 Stochastic degradation process 


The IG process is a stochastic process with inde- 
pendent, non-negative increments such that for 
each ¢>s 2 0, the increment Y= X,- X, ~ [G(u(t — 
s), M(t — s)?) with following probability density 
function (pdf): 


fo)= ,AE® ap] 
27y? 


where u, à > 0. Then, the mean and the variance of 
X, are ut and 4°t/à, respectively. 

Known parameters are common assumptions 
in the maintenance literature. However, in prac- 
tice, the model parameters are unknown and must 
be estimated. Here, we use the Bayesian approach 
to overcome this difficulty, also we consider that 
all information in hand like expert opinions are 
reflected in prior information. The assumption 
of the conjugate priors can be a good option to 
reduce the complexity of finding the posterior dis- 
tribution. To this end, let à have the gamma den- 
sity function, 


(1) 


-A(y - t= 8) 
2uy i 


fA =e = ext Al By, (2) 


and let 6= 1/u have the conditional normal density 
function with mean € and variance o7/A, 


sead] L apl-4 = = 6) 


Then, the joint prior distribution of (6, A) is 
given by A8, A) = AAAA). Furthermore, to avoid 
the probability of getting non-feasible degradation 
slopes, it is supposed that P(ô < 0) is negligible. 


688 


In other words, the degradation model can be 
rewritten as: 


X,|4A~ 1G, At), 
O|A~N(Eoe/ A), 
A~1(@,f). 


(4) 


2.2 Degradation model update 


Assuming the priors mentioned above, the pos- 
terior distribution of (6, A) can be obtained as 
soon as the new observations are available. Let 

i ..X, be new observations of hae sys- 
tem’s degradation state at times ¢,, ¢,,...,¢,. It can 
be shown that the joint posterior eae of 
(6, A) is: 


Kô, Data) = AAA, Data)f(A|Data), 


where 
f(6| A,Data) = Pe exp] A a. sy i 
f(A Data)= 7 = = et on 


The updated hyperparameters are: 


@ =atn/?, Z =(1/ B+1/ Dy", 
f=B/A and o =A; 
where 
A= bee, B= Sai +S =, 
i=l o o 
cy Gu f l E pelle z) 
=I g 2 A 


and Ax; =X, -X, p At =t -t when X, is 
considered as the degradation level of the current 
time ¢,. Then, new observations can be employed 
to update the information about 6 and 4 through 
the Bayesian method. It can happen at each inspec- 
tion (n = 1) or at the end of each cycle (n would be 
the number of inspection in the cycle). 


2.3, Remaining useful life 


Herein, the objective is to determine the distri- 
bution of the remaining useful life (RUL) of the 
system. At a given time ¢ and with knowing the 
current state, the RUL of the system can simply be 
computed as a first passage time when the degra- 
dation X, crosses the threshold L. Hence, we define 
the RUL, R, of a system at time ¢ as: 
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R=inf{r>0:X,,,2L|X,}, 


where X, is the observed degradation at time t. 
Assuming the parameters of the IG process are 
known, the cumulative distribution function (CDF) 
of R given the parameters can be expressed as: 


pee a 


L-X 


where ® is the CDF of the standard normal dis- 
tribution. Then, in case of unknown degradation 
parameters and knowing the joint distribution of 
(6, A), the CDF of R can be obtained by: 


Frys a(t |0,4,X,) =® 


(5) 


—exp(2r1d)® 


OIX) =f [Fas |OAX,)f(5| Af (Adda, 


where F,;,(r|6,4,X(t)) is given in (5). With 
some similar calculation given in Peng (2015), we 
have: 


ay Le Teele F 
Fr |X) JE a JEO 


where, 
= 
Sez) 


ae) 


(6) 


Agz-ry 


2z(o°z+1 


el jar z+1)? fi 


3 MAINTENANCE POLICY 


3.1 


Let {T}, ex be the aperiodic inspection times (T, = 
0). At each inspection, one must decide about 
the required maintenance action. This decision is 
driven based on the knowledge of the system con- 
dition after the inspection. Here, we assume that 
the maintenance actions are performed in a neg- 
ligible time and T7 refers to the time just before 
the maintenance date. To control the system failure 
occurrence, a preventive threshold M < L is cho- 
sen. The possible scenarios which can arise are: 


Maintenance policy structure 


e If X, 2L, the system is failed and it is correc- 
tively replaced with the cost of C.. 

e If M < X, <L, the system is not failed but it is 
too deteridrated and cannot function appropri- 
ately. Hence, a preventive replacement with the 
cost of C, is performed. 


e If X,_<M, the system is still properly func- 
tioning. Then, there is no need for replacement 
and the system is left as it is. 


Both preventive and corrective replacements are 
perfect and reset the system to “as good as new” 
condition. Then, the system condition after the 
inspection X,, would be: 


if X,- ZL 
if M<X_<L 
if X,- <M 


In all the above cases, the RUL based inspec- 
tion is carried out. Therefore, the time for the next 
inspection T,,, is determined from time T, by: 


n+l 


T 


„a =T, + AT, with AT, = 7,(X;,), 
where 7, (x a) is the p-quantile of the RUL distri- 
bution given in Equation (6). 

The maintenance decision rule is dynamically 
updated through the Bayesian method. The time 
between two successive inspections is derived from 
the RUL assessment which depends on the sto- 
chastic degradation process hence on the hyperpa- 
rameters. Different strategies can be considered for 
the update frequency. Typically it can be at each 
inspection time or at each replacement time. The 
general flowchart describing the evolution of the 
maintained system state submitted to the proposed 
aperiodic maintenance policy with the Bayesian 
update is given hereafter. 


Initiate 00) and set 
Xy =0,T% =0,F=1 
L 


Calculate 
AT;-i = Tp(Xi-1) 


Monitor the system at 
Ti = Ti- + ATi- 


If required: Em- 
ploy Bayesian ap- 
proach with new 

data to update §{'=") 


The system 
has failed 
or it is too 
wom Xi > M 


Yes 
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@~) is the vector of hyperparameters at itera- 
tion i and 7,(X,_,) is the p-quantile of the RUL 
distribution obtained with 6# . 


3.2 Maintenance cost 


The inspections are planned discretely and each of 
them incurs a cost C, At each inspection, the deci- 
sion of replacement is checked and a preventive 
or corrective action is performed, if needed, with 
costs C, and C, respectively. As the maintenance 
performed on a more deteriorated system is more 
complex and hence more costly, C, > C,. More- 
over, since the failure can only be detected through 
inspections, there is a system downtime after fail- 
ure and the additional cost is incurred from the 
failure time until the next replacement time at a 
cost rate C,. The cumulative maintenance cost is: 


C(t)=C,N,(t)+C,N, (t)+C.N, (t) + C,d(t) 


where N,(t), N,(t), and N,(t) are respectively 
the number of inspections, the number of preven- 
tive replacement, and the number of corrective 
replacement in [0,¢]. Also, d(¢) is total time passed 
in a failed state in [0,4]. 

Here, we use the expected cost function over a 
finite time horizon T_,,, as the objective function to 


end? 


assess and optimize the maintenance policy. 


4 SIMULATION STUDY 


In order to illustrate the behavior of the aperiodic 
maintenance policy, let consider the case of a sys- 
tem which deteriorates according to an IG process 
with fixed parameters M. =1 and 4.,,,=1. The 
prior information about the system is given by 
the values of hyperparameters such that @=1.5, 
f=, €=1 and g= +. The limit threshold 
is L = 9. The Bayesian Update introduced before 
helps us to get better information about the model 
parameters (6, A) in comparison to the initial infor- 
mation which may be wrong or partly wrong in 
some cases. Figure | shows the mean and variance 


Figure 1. The mean (on the left) and the variance (on 
the right) of the distribution of 6(top) and A (down) over 
the time. 


of the evolving distribution of (6, à) over time with 
the update at each inspection. It is clear that after 
a while the means tend to their correct values, 44} 
and A«a» while their variances vanish to zero. 

The maintenance cost is the expected cumula- 
tive maintenance cost over a finite horizon with 
Taa =100, where C,=0.2, C,=4, C,=10 and 
C,=4. It is estimated by a Monte Carlo simula- 
tion with 5000 samples over the considered finite 
horizon. Four different versions of the proposed 
maintenance policy are considered corresponding 
to four configurations of the model used for RUL 
prediction. The p-quantile of the RUL hence the 
time for the next inspection is successively obtained 
from Equation (6) or as the first passage time of 
the IG process used for degradation simulation. 
This last case is hereafter referred as “perfect” case 
because it assumes that the model used for prog- 
nosis is exactly the model used to describe the deg- 
radation of the system. Equation (6) describes the 
case of unknown degradation parameters. Three 
considered options depend on hyperparameters. 
In the first one, the values of hyperparameters 
are fixed to the ones given by experts. This case 
is called “no update” case. The second and third 
cases consider updates of a, p, and o respectively 
at each inspection time (case “inspection”) and at 
each replacement time (case “cycle’”’). 

As an example, specific values of the decision 
parameters p and M are chosen, which are the 
optimal decision parameters obtained under the 
assumption of known degradation parameters (case 
“perfect”). The values obtained by Monte Carlo 
simulation are Pirow = 0-03 and Min = 6.5. 
Table 1 gives different results obtained with these 
values of decision parameters for the four different 
configurations of maintenance decision rule over 
the finite time horizon [0; Tal. The given quanti- 
ties are the cumulated expected maintenance cost, 
its variance, the mean numbers of maintenance 
actions (inspections, corrective and preventive 
replacements), the mean unavailability duration 
and the mean number of renewal cycles. 


Table 1. Mean values for comparison of different main- 
tenance decision rule options applied on degradation 
data generated from IG process with fixed parameters 
Leg =lLand Apa =1. 


real 


Case Perfect Insp. Cycle No update 
Cost mean 69.26 72.20 74.78 97.416 
Cost Var. 11.41 11.86 12.25 7.62 
Inspections 32.8 44.7 62.3 193.3 
Prev. actions 11.5 11.4 11:7 12.8 
Corr. actions 1,3 1.38 1.23 0.7 
Unay. dur. 0.85 0.91 0.76 0.12 

Nb of cycles 13.8 13.8 13.9 14.5 


The variations of the cost for the considered poli- 
cies are in line with logical thinking. The considered 
decision parameters correspond to the minimal cost 
for the highest possible level of information i.e. for 
known real deterioration parameters. The optimal 
cost is used as a reference. For the lowest level of 
information i.e. for RUL assessment based on initial 
hyperparameters values without any update, the cost 
increase with respect to the reference is about 40%. 
The introduction of the Bayesian update leads to 
maintenance costs which are close to the best case. 
The increase is close to 4.2% and 7.5% if the RUL 
prediction for inspections scheduling is respectively 
updated at each inspection time (i.e. around 44 times 
on [0; Tal) or at each replacement (i.e. around 14 
times). The main differences between the costs of 
the four considered options are due to the number of 
inspections. In the absence of updates, the probabil- 
ity law of the RUL is spread out and the p-quantile 
causes small inter-inspection times. With updates, 
the pdf becomes sharper and it leads to increasing 
periods of time between successive inspections. 

Table 1 contains results for one specific value 
of (p: M)= (Pirow Menon): In order to illustrate 
the performance of the proposed maintenance 
policy for different settings, the expected cumu- 
lative maintenance cost as a function of the two 
decision parameters is depicted in Fig. 2. The two 
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(a) With RUL update at each replacement time. 
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(c) With RUL based on exact degradation parameters. 


Figure 2. Cost as a function of the maintenance deci- 
sion parameters p and M for 3 different decision rules 
with C,=0.2, C,=4, C,= 10 and C, =4 and Tana = 100. 
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inspection. 


The pdf of the RUL updated at each 


cases with the Bayesian update are investigated as 
well as the case with fixed and known parameters. 
It can be seen that the shapes of the different cost 
surfaces are close to each other. It means that the 
performance evolves in the same way when the 
decision parameters are modified. For example, 
one can see that the sensitivity to the parameter M 
is low and that too small values of p can produce 
a sudden cost increase due to the increase of the 
number of cycles. In order to show how the RUL 
pdf becomes peakier in each update, the RUL pdf 
are depicted with successive update at each inspec- 
tion in Figure 3. 

The obtained numerical results show that the 
efficiency of the Bayesian update procedure as 
a way to adapt the decision rule of the aperiodic 
maintenance policy. It allows implementing an 
auto-adaptive maintenance policy with a small 
number of decision parameters. 


5 CONCLUSIONS 


A predictive maintenance policy for deteriorating 
systems with unknown parameters is proposed in 
this paper. It includes a 2-parameter maintenance 
decision rule based on the Bayesian update which 
allows to jointly decide when to inspect the sys- 
tem and what to do about possible replacement at 
inspection time. The whole modeling framework 
based on IG process for degradation modeling and 
cumulative maintenance cost on a finite horizon 
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is described. Some numerical results are given to 
illustrate the behavior of the maintenance policy. 
The inspection times are driven by the mainte- 
nance decision rule and used for hyperparameters 
updates. Two different versions are considered: 
with the update at each inspection time and with 
the update at each system replacement time. What- 
ever the version is, the shape of the cost function 
is shown to be close to the case of known deterio- 
ration parameters. It allows considering possible 
choices of decision parameters without unexpected 
behaviors on cost value. According to the Bayesian 
update procedure, the fast evolution of the means 
degradation parameters to their exact values make 
the proposed maintenance policy promising. The 
problem related to the optimal choice of the deci- 
sion parameters will be considered in a future work. 
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ABSTRACT: A hybrid maintenance policy that combines inspection in early life with opportunistic 
replacement in later life is developed for a one-component system with a heterogeneous lifetime. Inspec- 
tions and replacements use opportunities that arise periodically, for example due to visits of a mainte- 
nance vessel to a turbine in an offshore windfarm. Components may be weak or strong, and the inspection 
phase of the policy operates like a burn-in period. Inspections are modelled using the delay time concept. 
The policy mimics reality whereby new systems are carefully maintained and older systems are replaced 
at events determined by operational constraints. We determine the cost-rate and reliability of the hybrid 
policy. Using a numerical example we show that the inspection phase offers benefits when the delay time 
is sufficiently large and/or early failure is a significant possibility. The policy would be relatively easy to 


implement in practice. 


1 INTRODUCTION 

Maintenance management of technical systems 
uses preventive and corrective replacement, inspec- 
tion, repair and such like (De Almeida et al. 2015; 
Lee & Cha, 2016) to increase system reliability and 
availability and to reduce total cost of ownership of 
industrial assets (Berrade et al. 2013; Xia et al. 2015; 
Zheng, et al. 2016). For some systems, stoppages 
may provide opportunities for the execution of pre- 
ventive maintenance with less disruption and at a 
lower cost than scheduled preventive maintenance. 
An example is the loss of the cold water supply to a 
soft-drinks production line caused by pump failure 
(Wang et al. 2000), whereby preventive maintenance 
of bottling and packing sub-systems may be car- 
ried out ahead of schedule. The models of oppor- 
tunistic maintenance policies develop this idea in 
theory (e.g. Dekker & Smeitink, 1991, Zheng, 1995; 
Tan & Kramer, 1996; Mohamed-Salah et al. 1999; 
Budai et al. 2006; Laggoune et al. 2010; Xia et al, 
2017b, c; Zhang & Zeng, 2017) and for practice (e.g 
Ding and Tian, 2011; Shafiee et al., 2015; Yildirim 
et al. 2017; Hu & Zhang, 2014; Nilsson et al. 
2009; Cavalcante & Lopes, 2015; Xia et al. 2017a; 
Garambaki et al. 2016). Typically, opportunities 
arise from economic and structural interdepend- 
encies among components or parts (Dekker & 
Smeitink, 1991). Grouped maintenance policies 
(Wildeman et al. 1997) also exploit such depend- 
encies to maintain groups of parts (Vu et al. 2015; 
Peng & Zhu, 2017), but opportunistic maintenance 
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is different because it aims to maintain a part or 
parts when another part of the system causes a 
stop. 

In this paper, we consider opportunities in dif- 
ferent way. We suppose that the opportunities arise 
deterministically and periodically, as they might 
arise when shutdowns of a plant of which the sys- 
tem of interest is a part are seasonal or when main- 
tenance resources are available only occasionally, 
as in the case of visits of a maintenance vessel to a 
turbine in an offshore windfarm. 

We further suppose that the system is subject to 
a two-stage failure process according to the delay 
time model (Christer, 1999), whereby a defective 
but operational state precedes the failed state. Then, 
opportunities may be utilized for inspection, and for 
replacementif required. Additionally, we modelcom- 
ponent heterogeneity that may arise, for example, in 
the context of variable maintenance quality (Scarf 
& Cavalcante, 2012), supplier selection (Berrade 
et al. 2012), reliability (Castet and Saleh, 2010), and 
analysis of failure warranty data (Attardi et al., 
2005; Lee et al., 2016). This heterogeneity means 
inspection in early life has an important role that is 
similar to operational “burn-in” (Zhang et al. 2014). 

Few papers consider this connection between 
opportunistic maintenance and inspection (Wang & 
Christer, 2003; Berrade et al. 2017). Cavalcante et al. 
(2018) exploit this gap in the literature. In particular, 
they develop a model that generalizes hybrid inspec- 
tion and replacement (Scarf et al. 2009, Scarf & 
Cavalcante, 2010) and opportunistic maintenance, 


and in which a system is periodically inspected in the 
first phase of its life and replaced at opportunities 
during the second phase. In the policy they consider, 
opportunities arise at random, due to system stop- 
pages that are caused by the failure of other struc- 
turally connected systems. In our paper here, we take 
a different approach and suppose that opportunities 
are deterministic and periodic. This makes the policy 
simpler to study (fewer decision variables) and easier 
to implement in practice (akin to block replacement 
rather than age replacement, Barlow & Proschan, 
1966), but nonetheless applicable to maintenance of 
windfarms (e.g. Shafiee, 2015), transportation sys- 
tems (e.g. Corman et al. 2017), and manufacturing 
systems (e.g. Zahedi et al. 2017). 

The structure of the paper is as follows. We 
begin next with the description of the system and 
the policy. Then in section 3 we develop the cost 
rate. Section 4 briefly develops the system reli- 
ability. This is followed by a numerical example to 
illustrate the policy. We conclude with a discussion. 


2 THE POLICY 


2.1 Description of the system 


We consider a single component which when in its 
socket performs an operational function (Ascher 
and Feingold, 1984). The component is in one of 
three states, good, defective or failed. The time in 
the good state, X, the time to defect arrival, has a 
mixture distribution F(t) = pR()+0- p)E (t), 
with mixing parameter p. Thus components arise 
from a mixed population of “weak” and “strong” 
sub-populations. F, and F, are in general increas- 
ing failure rate distributions, and in our particular 
example they are Weibull distributions with charac- 
teristic lives 77,,7, and shape parameters 7,/, >1. 
The corresponding density and reliability (survival) 
functions are denoted by f, and F, respectively. 

The component in its socket constitutes a one- 
component system. This system is a non-critical 
system so that, on failure, the system stoppage is 
revealed only at a subsequent opportunity, whereat 
the component is replaced and the system returns 
to operation. Inspection determines whether the 
system is good or defective. When the system 
is defective it continues to operate albeit with a 
degraded function (e.g. a noisy bearing). Inspec- 
tions are perfect in that an inspection reveals the 
true system state. At a positive inspection (system 
is defective), the component is replaced. Compo- 
nent replacement corresponds to system renewal. 

Opportunities arise deterministically every S 
time units, as a result of, for example, seasonal 
shutdowns or periodic availability of maintenance 
resources. 
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The sojourn in the defective state, H, the delay 
time, has density /,,(/), distribution function f,,(/), 
and reliability (survival) function F,,(h). X and 
H and opportunities are mutually, statistically 
independent. 


2.2 Description of the policy 


Opportunities arise every S time units. Inspections 
are scheduled at each of the first K opportunities 
from when the system is new and replacement is 
scheduled at the Mth opportunity (M 2 K). On 
failure, the system is replaced at the first oppor- 
tunity that follows the failure. Maintenance inter- 
ventions (inspections and replacements) cannot 
occur at instants other than opportunities. In 
this way, replacements are always synchronized 
with opportunities, and so renewal occurs only 
at an opportunity. The policy has two decision 
variables: K and M. The cost parameters are as 
follows: 


— the cost of an inspection is C; 

— the cost of a replacement of a defective compo- 
nent is Co; 

— the cost of a preventive replacement of a com- 
ponent at age MS is also Co; 

— the cost of a replacement of a failed component 
is Cp (C, < Co < Cp). 


The innovation of this policy is the early-life 
inspection phase [0, KS], whereby a component 
is replaced if it is found to be defective. In this 
way, great care is taken of the system during 
early life. 

We might study a more general policy in which 
inspections are scheduled every N-th opportunity, 
that is, at times NS, 2 NS, ..., KNS (from new). This 
policy may be appropriate when opportunities 
arise very frequently (e.g. at shutdowns or under 
reduced timetables of transportation systems that 
occur on a weekly basis). However, in this paper, 
we focus on the simpler two-variable policy. 

We might also study a policy in which the cost 
of failure is not fixed, but instead related to the 
unavailability of the system (from the point of fail- 
ure until the replacement at the subsequent oppor- 
tunity). This is perhaps a more realistic policy for 
the case in which the unavailability cost is large 
relative to the cost of remedial work on a failed 
system. For a turbine in a windfarm, however, una- 
vailability cost is not large (relatively), because the 
cost of lost power generation will be small relative 
to the cost of damage to the turbine as a result of 
failure. Furthermore, unavailability costs across 
a large installation are typically factored into the 
wind power capacity model, so that operators 
should focus on direct maintenance costs (Shafiee 
and Sorensen, 2017). 


3 CALCULATION OF THE COST-RATE 


The cost-rate is determined by calculating the 
probabilities of the three types of renewal scenario 
that arise: related to failure; related to preven- 
tive replacement; and related to replacement of a 
defective component at inspection. 

A failure can arise as a result of a defect that 
itself arises either in early-life during the inspec- 
tion phase or in later-life during the replacement 
phase. The costs are different in each case, so we 
develop the calculations separately. 

In the first case, a defect arises in the interval 
[(i-1)S,iS) and fails before time (age) iS. This 
occurs with probability 


Ps 


iS 
a = fens aS -fr dx, 
for i=1,...,K. The cost associated with this event 
is Cp +(i—1)C, and the length of the renewal cycle 
is iS, noting that replacement has to wait until the 
next opportunity and that the cost of unavailabil- 
ity is ignored. 

In the second case, a defect arises after 
KS and fails in the interval ((i-1)S,iS), for 
i=K+l,..,M. For i=K+2,...M(M>K+)), 
this occurs with probability 


iS 
Pos, = fi ygFu GS-2) fe (2) dx + 
i-l jS . f 
E a a fc) dx, 
j=K+1 


and for i= K + | we have 


(K+1)S 


F,((K +))S —x) fy (x) dx. 


2Fka — Jgs 

The cost associated with this event is Cp + KC, 
and the length of the renewal cycle length is iS. 

For renewal related to preventive replacement at 
MS, this can occur only if either no defect arises 
before MS or a defect arises before MS but after 
KS and it survives to MS. In either case the cost 
of K inspections and the cost of the preventive 
replacement are incurred. Thus the probability of 
preventive replacement at MS is 


MS — = 
Py =f, Fu(MS — x) fy(x) dx + Fy(MS), 


and the associated cost of the renewal cycle is 
Co + KC, and the length of the renewal cycle is MS. 

The final renewal scenario is replacement of a 
defective component at inspection. This can only 
occur in the early-life phase, and the defect must 
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arise in an interval between opportunities and 
survive (not fail) by the end of the interval. The 
probability of renewal at the i-th inspection, for 
i=1,...,K, is 


iS = 
Po, =f s Falis — x) f(x) dx, 


and the associated cost of the renewal cycle is 
Co + iC, and the length of the renewal cycle is iS. 

Thus, associated with each type of renewal 
event is a cost, and the expected cost of a renewal 
cycle is the sum of the products of the cost of each 
renewal event and the respective probability of 
each renewal event. Therefore 


EXU(K,M)} =(Co + KC) Py, + 
E {Cr+ G-DC)B x, + (Co +iC,) Py, } 
+(Cp+ KC)Y "Bs. 


The expected cycle length is found similarly: 


E(V(K,M)}= MSP,,+ 

r y M 
Doe thy) tiS D Be 

Then we use the renewal-reward theorem to 
define the cost-rate (the long run average cost per 
unit time) C_(K,M)=E(U)/E(V). This is the 
objective function we use to determine the opti- 
mum values of the decision variables K and M. 


4 RELIABILITY 


We can also quantify the reliability of the main- 
tained system (Lewis, 1987; Scarf et al. 2005) using 
the “long-run time between failures”: 


expected cycle length _ EWV(K,M)} 
Pr(renewal on failure) 5 P, + y” 
i=] fāi i=K+1 


Pr, 

We denote this by u (K,M). We might also 
describe its inverse as the “failure rate”, a term 
often used in reliability engineering but commonly 
misunderstood (Ascher & Feingold, 1984; Gaens, 
2017). The system availability might also be calcu- 
lated, but we omit this for brevity. 


5 NUMERICAL EXAMPLE 


We set the time unit is one year and an arbitrary 
cost unit, so that all costs are multiples of Co. The 
values of parameters (defined in sections 2.1 and 
2.2) are typical of wind turbines (e.g. Faulstich 


et al. 2011). In the base case (grey fill in row 2 of 
Table 1), we set 7, = 2, 8, =2, and 7,=10, 2, = 5 so 
that the populations of weak and strong compo- 
nents are reasonably well separated. The value of 
the mixing parameter (p = 0.1) is not untypical of 
mechanical systems (e.g. Gales, 2015). The mean 
delay time and the time between opportunities are 
both chosen to be the same (one year), so that we 
might expect interesting effects to be observed. 
Some results are shown in Table | and Figure 1. 
Figure | suggests that the cost-rate is sensitive to 
both M, which determines the age at replacement, 
and K, which determines the length of the inspec- 
tion phase. We can further see that the cost-rate 
increases sharply as inspection reduces (K = 2 vs 
K = 3), so the benefit of the inspection phase is 
clear. This benefit is absent when there are no weak 
components (p = 0 in Table 1). Final point to make 


Table 1. Optimal policy for various 7,, J, p, 2, and 
S. Other parameters fixed at 7, = 10, B, = 5, C, = 0.08, 
Co = 1, Cp = 8. Base case is row 2. 


n B P À S K M C, l/u, 
il 2 0l 1 1 2 6 0.301 0.0132 
pp ORT 1 B 6 0.313 0.0134 
a W 0.1 T 1 6 7 0.307 0.0123 
2 1 01 1 1 2 6 0.313 0.0155 
2 Seo.) 1 1 3 6 0.308 0.0126 
2 2 0.0 1 1 0 6 0.213 0.0066 
2 2 Oe | 1 6 7 0.372 0.0192 
2 2 0.1 Osa 3 7 0.276 0.0123 
2 2 01 2 1 2 6 0.351 0.0208 
2 2 0.1 œ 1 0 6 0.387 0.0298 
2 2 O. -1 0.5 6 12 0.339 0.0111 
2 2 0.1 1 2 2 3 0.312 0.0155 
0.40 
0.38 
w 
w 
© 0.36 
+ 
e] 
O 0.34 
0.32 
0.30 
Figure 1. In the base case, cost-rate versus M for vari- 


ous values of K: K=2( ), K=3 (—), K=4 (------ );K=5 
=. 


about this figure is that it is apparent that if one is 
unsure about how much inspection to carry out, 
more (K > 3) is better than less. 

The effect of heterogeneity (through p) is large, 
and as p increases we see in Table 1 that the inspec- 
tion phase extends. Furthermore, there is a three- 
fold increase in the failure rate between p = 0 and 
p = 0.2. Presumably the increase would be greater 
without inspection, and this is responsible for the 
high cost-rate of the pure inspection policy in row 
3 of Table 2. This is confirmed by the very high 
failure rate for the pure inspection policy in row 3 
of Table 3. 

The mean delay time 1/Aalso has a large influence 
on the policy. As the mean delay time decreases, 
the usual response in an inspection policy is to 
carry out more frequent inspection. However, in 
the policy here, this frequency is fixed and only the 
length of the inspection phase can be increased. 
Thus, the opportunity to prevent failures through 


Table 2. Policy cost-rate comparison. C, = 0.08, Cg = 1, 
C, = 8. 


Pure Pure Whole life 


K, M inspection replacement inspection, 


policy (M=%) (K=0) (K=M) 
p A S Cost-rate, C, 
O 1 1 0.213 0.446 0.213 0.276 
0.1 1 1 0.313 0.477 0.338 0.326 
0.21 1 0.372 0.512 0.465 0.381 
0.1.0.5 1 0.276 0.357 0.304 0.287 
0.12 1 0.351 0.631 0.359 0.372 
0.1 œ 1 0.387 0.965 0.388 0.466 
0.1 1 0.5 0.339 0.445 0.339 0.369 
0.1 1 2 0.312 0.562 0.335 0.324 
Table 3. Policy failure-rate comparison. C, = 0.08, 
Co=1,; C= 8. 

Whole 
Pure Pure life 

K, M inspection replacement inspection, 

policy (M=) (K=0) (K=M) 
p A S Failure rate, 1/4, 
0 1 1 0.0066 0.0379 0.0066 0.0041 
0.1 1 1 0.0134 0.0411 0.0233 0.0129 
0.2 1 1 0.0192 0.0448 0.0434 0.0192 
0.1 0.5 1 0.0123 0.0238 0.0219 0.0098 
0.1 2 1 0.0208 0.0635 0.0259 0.0163 
0.1 c 1 0.0298 0.1117 0.0299 0.0299 
0.1 1 0.5 0.0111 0.0245 0.0250 0.0088 
0.1 1 2 0.0155 0.0601 0.0231 0.0155 
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inspection and consequently the replacement of 
defective components is limited by the frequency 
of opportunities (1/S). Nonetheless, inspection at 
these opportunities is still beneficial. 

Although S is not a decision variable in our for- 
mulation of the model, we should study its effect. 
And its effect on the cost-rate and the failure rate 
is quite large. There is a 40% increase in the failure 
rate between S=0.5 and S=2. Notice however that 
the length of the inspection phase (KS) changes 
little and the age at replacement (MS) does not 
change at all. The increased cost-rate when S = 0.5 
is explained by the increased cost of inspection. 
Thus, the statistical properties of the weak sub- 
population determine broadly the length of the 
inspection phase and then the inspection regime 
follows from this and the frequency of opportuni- 
ties. Obviously the frequency of opportunities lim- 
its how often inspections can be carried out. 

The effect of 7, is curious (when the life of the 
weak components is shortest, the cost rate is not 
the largest) and we have no explanation for this. 

Focusing more broadly on Tables 2 and 3 and 
the policy comparisons (noting that each policy 
reported is the respective optimum policy in its 
class), we can see the use of an inappropriate 
policy has a more profound effect on cost and reli- 
ability than do the failure characteristics of the 
system. Pure replacement is always a high-cost 
and low-reliability. Whole life inspection gives the 
highest reliability and this is to be expected; in the 
circumstances, additional inspections only have a 
cost-disadvantage. Pure replacement and whole 
life inspection are the extremes of the K, M policy, 
and again Tables 2 and 3 suggest that if a main- 
tainer is unsure about the appropriate frequency of 
inspection then more is better than less. 


6 DISCUSSION 


We study a hybrid inspection and replacement 
policy in which the opportunities to carry out 
inspections and replacement are periodic but lim- 
ited and somewhat out of the control of the main- 
tenance decision-maker. We consider in particular 
the effect of heterogeneity of component quality, 
so that early-life defects are possible. The model is 
related to the hybrid inspection and replacement 
policy proposed by Scarf et al. (2009) and to the 
general hybrid opportunistic policy of Cavalcante 
et al. (2017) in which opportunities are also limited 
but occurring at random. The policy proposed in 
the paper here is natural in the context of the main- 
tenance of an offshore wind-farm. 

The calculation of the long-run average cost 
per unit time (cost-rate) for the policy is pre- 
sented. We also calculate a measure of the 
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operational reliability of the system. We illustrate 
the policies with a numerical example. We com- 
pare, in terms of cost-rate, the proposed policy 
with a number of policies that are special cases 
including pure inspection and pure replacement. 

Our results indicate a number of points. Com- 
ponent heterogeneity drives the length of the 
inspection phase, and then the frequency of 
inspect-ion during this phase is determined not by 
the decision-maker but by the frequency of oppor- 
tunities. Similarly the age limit for replacement is 
driven by the system failure characteristics and the 
costs rather than the frequency of opportunities. 
When the quality of replacements is most poor 
then the inspection phase of the hybrid policy is 
most beneficial. Finally, if the maintainer is in 
doubt about the length of the inspection phase, 
then more inspection is better than less, provided 
that inspections do not induce defects (Scarf and 
Cavalcante, 2012). 

The fixed frequency of opportunities for main- 
tenance can simplify maintenance planning, and 
the policy we propose provides a simple framework 
in which to mitigate for the effects of variable qual- 
ity in the execution of maintenance interventions. 
An interesting question relates to the effectiveness 
of continuous condition monitoring in the circum- 
stances when the maintenance frequency is fixed. 
In a simple, single component system, alarms are 
not useful if the system cannot be accessed. None- 
theless, if failure can be prevented through remote 
shutdown, then we would expect continuous moni- 
toring to be beneficial. 
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ABSTRACT: Signal monitoring is one of the basic tasks, which are included in the satellite system 
maintenance. Currently, the civil aviation, in terms of navigation, above all, develops solutions based on 
satellites, indicating them as future-orientated. This activity is coordinated by the ICAO (International 
Civil Aviation Organization), which oversees the operations of the Global Navigation Satellite System 
(GNSS). The analysis of satellite system errors is a major aspect limiting the operational functioning of 
such systems in air transport. From the point of view of this study, the tropospheric and ionospheric 
errors deserve special attention. It turns out that the time of year and even time of day can have a signifi- 
cant impact on the quality of the satellite signal and, therefore, on the operational safety of aircraft. Rela- 
tionships occurring between selected external factors (temperature, pressure, cloudiness, precipitation, air 
humidity) and their very effect on the signal interferences—will be tested using fuzzy reasoning. 


1 INTRODUCTION 


The high safety level in aviation is placed on top 
of the pyramid of industrial challenges for mod- 
ern operators and service providers’. The rationale 
for the selection of the research problem is the fact 
that the satellite systems are considered to be the 
future of navigation and surveillance in aviation. 
Failure to meet the requirements set out for satellite 
signals prevents their operational use (Siergiejczyk 
& Krzykowska 2014). The satellite systems play a 
significant role in programmes relating to the devel- 
opment of the aviation technology, including the 
SESAR programme (Single European Sky ATM 
Research), which is a technological component 
of the SES (Single European Sky) project imple- 
mented in the EU (Kierzkowski & Kisiel 2016). The 
conditions set out for the use of satellite systems 
in, for example, air traffic operations are, therefore, 
associated with four defined, main signal parame- 
ters: accuracy, availability, continuity and integrity. 


2 SATELLITE SIGNAL PARAMETERS 


The satellite signal used in aviation is subject to par- 
ticularly stringent functional requirements (Siergie- 
jezyk et al. 2015). Therefore, it is crucial to satisfy 
them. The aforementioned requirements are deter- 
mined with navigational parameters (International 
Civil Aviation Organization 2006). They include: 


e accuracy — defined by an error in determined 
position; in GNSS it is the difference between 
the determined and actual position; the prob- 
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ability for a determined position should be at 
least 95% -— the measurement error is then within 
the specified accuracy; 

e integrity — is characterized as a measure of con- 
fidence in the validity of information provided 
by a system; it covers the capability of a system 
to deliver appropriate warnings (alarms) to a 
user within a predetermined time, which include 
information on when not to use the system; 

e continuity — is the ability of a system to utilize 
the assumed function without unplanned inter- 
ruptions during an executed flight operation; 

e availability — can be defined as a percentage of 
time, during which a satellite system can be used 
for navigation, and during which reliable infor- 
mation is passed on to the crew, a control system 
or other aircraft flight management systems. 


The requirements in relation to accuracy indi- 
cate that in a large set of independent samples, at 
least 95% should meet specified conditions (stated 
in metres, per each satellite system type). Such accu- 
racy must be satisfied in relation to the worst geom- 
etry of a satellite constellation, for which the system 
is to be available. It should be noted that position 
errors, in the case of, e.g. a GPS system, consist of 
satellite clock and ephemeris errors. They do not 
include ionospheric and tropospheric delays, multi- 
path errors or self-receiver noise. The latter are in 
each case included in the standards regarding receiv- 
ers (International Civil Aviation Organization 2006). 

In the context of integrity, in order to deter- 
mine whether a location error is acceptable—an 
alarm limit is specified, which allows to reflect the 
maximum, permissible position error that will not 


undermined the executed flight operation. It should 
be noted that satellite system navigation, thus, a 
satellite signal, is simultaneously transmitted to 
many objects (aircraft) over a large area—often one 
or more continents. Therefore, the impact of los- 
ing integrity of a satellite system on an air traffic 
management system will be much more significant 
than in the case of conventional navigation meth- 
ods. Hence, the stringent requirements regarding 
the parameters. An information about the loss of 
signal integrity (or exceeding the permissible values 
of other parameters) delivered sufficiently early, 
should result in an abandonment of using satellite 
navigation or discontinuation of the operation (in 
case of a take-off or landing). Furthermore, an indi- 
vidual, as well as a unique GNSS navigation feature 
is adapting the navigation capabilities over time, 
depending on the changing satellite constellation. 
The impact of changes in the space segment may be 
increased with an additional fault in the ground seg- 
ment, e.g., damage to one of the components (Inter- 
national Civil Aviation Organization 2006). 

In the case of en-route, approach and landing 
operations—the continuity of the service is associ- 
ated with the capability of a navigation system to 
deliver output data with a specified integrity and 
accuracy over the course of the operation, assum- 
ing that the data were available at the beginning of 
the operation. Due to the fact that the length of 
individual operations is variable, the requirement 
regarding the continuity is defined as a range of 
signal discontinuity probability values per hour. 
The bottom range value is the minimum continuity 
value, at which a system may be used in areas with 
low traffic and a complex airspace structure (these 
are areas with a low number of navigation system 
failure per the number of aircraft). The top value 
enables the application in area with heavy traffic 
and a complex airspace structure (these are areas 
with a higher number of navigation system failures 
per the number of aircraft). It is worth noting that 
flight planning may not be approved if it is based 
solely on GNSS navigation, in which a signal is 
burdened with a high risk of a continuity loss at 
the time of planning the executed operation (Inter- 
national Civil Aviation Organization 2006). 

Defining the requirements concerning GNSS 
availability should be considered in terms of the 
expected level of the provided service. Certain 
requirements will be set out for a system, which is 
to replace the existing navigation infrastructure, 
and different ones for a system supporting the cur- 
rent infrastructure (International Civil Aviation 
Organization 2006). Basically, the determination 
of a GNSS signal availability criterion for given 
operations or areas, should be based on: 


a. traffic intensity and complexity; 
b. the presence of back-up navigation aids; 
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. covering an area with primary and secondary 
surveillance; 

. guidance procedures to another airport; 

. Navigation system used at a back-up airport; 
duration of interruptions in signal availability; 

. geographical range of interruptions. 


m mo A 


In addition, according to the International Civil 
Aviation Organization, GNSS availability should 
be determined through engineering, analysing and 
modelling processes, and not only by measuring 
them. The signal availability model should take 
into account, among others, ionospheric and trop- 
ospheric errors, as well as receiver faults, which it 
utilizes to determine integrity via calculated HPL 
(Horizontal Protection Level) and VPL (Vertical 
Protection Level) indicators (Januszewski 2012) 
(Januszewski 2013). 


3 LITERATURE STATE OF THE ART 


The research problem in the presented paper is not 
only the analysis of satellite signal parameters, but 
also a search for external factors causing interfer- 
ence of that signal. Weather conditions will be cer- 
tainly among those observed. The matter and its 
essence are already known in the domestic subject 
literature. R. Zieliński sets forth the issue of thermal 
noise and their presences in Earth—satellite and sat- 
ellite—Earth links (Zielinski 2009). Ground receiv- 
ing antenna receives noises through sky luminance 
temperature (sky radiation), whereas for a satellite 
antenna—the noises are the Earth’s surface with a 
defined thermodynamic temperature. Attention was 
also paid to additional losses arising as a result of 
precipitation. Signal attenuation they cause, depends 
on the extent of the precipitation itself, most often 
expressed in mm/h. The European Broadcasting 
Union elaborated on measurement results, which 
present the phenomenon of signal attenuation, 
depending on the magnitude of precipitation, for 
the frequency of 11.5 GHz. This statistic provides 
attenuation distribution function values expressed 
in dB for 99% and 99.9% of the time for the worst 
month in Europe. The dependency of meteorologi- 
cal conditions and GNSS in the context of the posi- 
tion determination accuracy can be also found in 
elaboration of renowned scientific journals (Wilgan 
et al. 2015). However, in most cases, the impact of 
weather on satellite signal propagation has been 
described in the literature in light of tropospheric 
errors. The following factors are determined: dry-air 
density, pressure and temperature, and humid-air 
humidity caused by clouds, rain, fog. 

It should be noted that the analysis of external 
factors, including meteorological, is of large sig- 
nificance also in the case of other fields of air traf- 
fic. The elaborations include studies on, inter alia, 
the impact of meteorological conditions on the 


execution of aircraft landing operations or model- 
ling external factors in the context of air traffic. 
These publications evidence the great importance 
of different factors present in air traffic. The fact 
that meteorological factors should not be underes- 
timated in any of the air traffic fields seems crucial 
(Rychlicki & Miszkiewicz 2013). 

In light of the received satellite signal, one 
should adopt a relevant correction, taking into 
account its passage through different layers of the 
atmosphere. For example, in order to determine a 
correction taking into account the passage of the 
signal through the ionosphere, a vertical compo- 
nent of electron density TEC [el/m’] is adopted. 
TEC (Total Electron Content) is the total number 
of electrons concentrated between two points, 
along a column with a cross-section of 1 m’. 
The studying of the TEC variable, its modelling 
and forecasting became a popular issue for con- 
temporary science in the context of the satellite 
technology development (Rius et al 1994). Due to 
the demonstrated significant impact of the TEC 
value on the quality of the received satellite sig- 
nal, a lot of attention was focused on TEC pre- 
dictions, also via artificial neural networks (Paul 
& Sur 2013). 


4 RESEARCH METHODOLOGY 


The use of fuzzy sets for controlling and model- 
ling processes has a long history. Originally, the 
creators of fuzzy logic saw their theory in fields 
such as economics or psychology, where human 
perception plays a crucial role, and the phenom- 
ena can be described in an unclear manner, thus, 
a fuzzy one. However, already in the 1970s, the 
possibility to control processes through this tool 
was observed. Nowadays, the greater part of its 
application concerns controlling, quite often, 
technical systems (Zurek & Grzesik 2015) (Rob- 
inson et al. 2005) (Skorupski & Uchronski 2016) 
(Stańczyk & Stelmach 2014) (Losurdo et al. 
2017). 

In principle, the literature sources define two 
approaches towards fuzzy control-descriptive and 
prescriptive (Zadeh 2008). The first one is based 
on the expertise of an operator who, based on his 
experience, knows how to control a process. It is a 
traditional approach, not based on a model. The 
second approach assumes the existence of a sto- 
chastic or deterministic model, and defines how, in 
an optimal manner, to control it. The traditional 
approach is closer to the studies in this paper. This 
is due to the fact that a model of a process defining 
the output (satellite signal interference) as an input 
function (weather condition) is not known. This 
means that the process is a, so-called, black box. 
However, knowledge on how to correctly control a 


process, i.e., which control to choose for a current 
output, is available (Kacprzyk 2001). 

The construction of a model and conducting the 
aforementioned tests is subject to having a suffi- 
cient number of data (Siergiejczyk & Krzykowska 
2017). The satellite signal analysis was based on 
EGNOS system data. A total of 181 measure- 
ments for a PRN120 satellite, conducted in the 
Ist half of 2014 at a station in Warsaw for an 
APYV-1 vertical guidance approach operation shall 
be used for constructing the model. The selection 
of this period, area and satellite is justified by the 
completeness of data, which is compulsory in 
such studies—conditioning high reliability of the 
results. By agreement with the Institute of Mete- 
orology and Water Management in Warsaw, a list 
of weather condition measurements (cloud cover, 
humidity, precipitation, pressure and temperature) 
was received, for the same period of time and area. 
The Space Research Centre of the Polish Acad- 
emy of Sciences publicizes the data on solar activ- 
ity, including the number of sunspots, number of 
observations, standard deviations. That data was 
also used in the research. 

The article presents a model regarding the satel- 
lite signal accuracy. Therefore, input data was iden- 
tified. It included: 


cloud cover; 

air humidity; 

precipitation; 

average air temperature; 

atmospheric pressure; 

solar activity expressed as the number of sun- 
spots (daily sunspot number). 


At the same time, it can be concluded that 
cloud cover, humidity and precipitation form a 
first group of input data, which determines the 
humid tropospheric part, associated with water 
vapour content. Two other factors-temperature 
and pressure constitute a fragment of the dry part 
of the troposphere and form a second data group. 
According to the literature (Woloszyn 2009), the 
dry factor causes almost 90% of the tropospheric 
error (delay). The last, but not least, factor-solar 
activity, forms a third group of input data asso- 
ciated with the ionosphere and ionospheric delay 
(Siergiejczyk & Krzykowska 2017). 

Nonetheless, the determination of an output 
variable remains important. It was an accuracy 
error, determined by the average value of HNSE 
and VNSE. 

The next table shows a summary of linguistic 
values for all variables and their assigned numeri- 
cal values, representing the total membership 
to a fuzzy set. This classification, for input vari- 
ables, was based on known literature of the sub- 
ject (Wotoszyn 2009) (Moszkowicz & Tuszynska 
2006). The definition of output variable ranges 
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Table 1. 


Linguistic values of input and output variables. 


T/O data Linguistic variable Linguistic value 
I Cloud cover - small average big - 
L Humidity v. low low average high v. high 
L Precipitation - small average big - 
I; Temperature — low average high - 
I, Pressure - low average high - 
I, Solar activity v. low low moderate high v. high 
Accuracy error - small average big - 
Table 2. The function of linguistic variable membership. 
T/O Measurement Numerical 
data Measure value range value damaso 
I, cloud cover 0-8 0 ore xc 
(in octane 4 amen 
scale) 8 x 
L % 42-100 42 Temperature “soar activity 
57 
72 Figure 1. Graphic representation of the impact of tem- 
86 perature and solar activity on the accuracy error. 
100 
I, mm water 0-25 0 
head 12.5 
25 
I, average air (-15) — 24 -15 
temperature 4.5 R 
in °C 24 error 
I, hPa 985-1024 985 
1005 
1024 Cloud cover DO Precipitation 
ke sunspot 40-222 40 
muta an Figure 2. Graphic representation of the impact of 
150 cloud cover and precipitation on the accuracy error. 
222 
o average HNSE 5-8 5 significant impact on the accuracy of a satellite 
and VNSE 6.5 : 
i signal. 
value 8 In the presented model, solar activity (I,) was 
metres 9.5 


was solved otherwise. It applied the expertise and 
applicable requirements regarding the use of a 
satellite signal in civil aviation imposed by ICAO 
(International Civil Aviation Organization 2006). 
The input and output variables were assigned 
the Gaussian membership function according to 
the ranges given in Table 2. 126 rules were set out. 


5 RESEARCH RESULTS 


The following visualizations present selected con- 
figurations of atmospheric factors with the most 
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more important than temperature (I,), nonethe- 
less, the impact of low temperature on the increase 
of the accuracy error is noticeable. The configu- 
ration of precipitation (I,) with cloud cover (I,) 
and their impact on the satellite signal is interest- 
ing. Figure 2 shows an increase of the accuracy 
error in conditions of big and small cloud cover 
and rather low precipitation. This leads to a very 
important conclusion—I are able to indicate the 
day, during which it was not the weather factor, 
which impacted the increased signal accuracy 
error, because pressure changes (I,) remained 
insignificant, whereas humidity (I,) contributed 
to the error to a small extent. Such was the situ- 
ation, for example, on 14 March 2014, when the 
applied principle (no. 78) had the following form: 


If the cloud cover is average and humidity is 
very low and precipitation is low and temperature 
is average and pressure is average and solar activity 
is moderate then the accuracy error is high. 

In most other cases, when the accuracy error 
reached a linguistic value of “high”, solar activity 
would assume a linguistic value of “high” or “very 
high”. The presented example, in some ways, con- 
firms the effectiveness of the model and the tool. 


6 SUMMARY 


The described cases are undoubtedly a big advan- 
tage of the model. In many cases, the knowledge 
that the error source should not be sought among 
weather conditions is more important. Especially, 
if the factors are not accompanied by explicit- 
ness regarding the impact on signal interference. 
However, it can be concluded that the tool and the 
model implemented for it, constitute a clear base 
for the evaluation of the signal quality, depending 
on the tropospheric and ionospheric factors. 
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ABSTRACT: 


In this paper a framework to predict the remaining lifetime for existing waterways infra- 


structures based on stochastic modeling of deterioration processes and Bayesian analysis is presented. The 
application of the Bayes’ theorem is motivated by the availability of expert knowledge as well as the collec- 
tion of both qualitative and quantitative data from the structure. An original method is proposed to derive 
the prior statistical parameters of the gamma distribution describing the stochastic deterioration process, 
based on the assumption that the lifetime distribution can be approximated by the Birnbaum-Saunders 
statistical model. An appropriate Bayesian Network is finally implemented to improve the classification 
of the structure with respect to its proneness to damage. The outcome of the research work is to assist the 
owners of large infrastructural network in planning and prioritizing maintenance interventions. 


1 INTRODUCTION 

In the last decade the management of assets of 
infrastructures is becoming an increasingly impor- 
tant engineering task. As it is also stated in ISO 
55000 (2014), accurate prediction of lifetime dis- 
tributions are required to develop strategic asset 
management plan, and the use of a life-cycle man- 
agement approach is fundamental to realize value 
from the asset. The German Federal Waterways 
Engineering and Research Institute (BAW) has 
developed a management system based on fixed 
inspection intervals, allowing, in principle, to opti- 
mize the repair or strengthening interventions, 
which are planned and carried out according to the 
outcomes of the inspections themselves. Although 
plenty of data regarding structures’ condition are 
available in BAW, further steps should be under- 
taken in order to extract information regarding 
asset’s lifetime (Haider 2012). The transforma- 
tion of data into useful information is not always 
straightforward, and several approaches could be 
found in literature: for example, Trappey (2012) 
evaluates the asset lifetime using logistic regres- 
sion; Lim & Mba (2013) estimate the remaining 
useful life implementing Kalman filter, while Tse & 
Shen (2013) pursue the same scope using support 
vector machines. 

Another way to reach this goal is through a 
more accurate lifetime prediction, which could 
be obtained by modelling degradation phenom- 
ena through stochastic processes (Riascos-Ochoa 
2016); the parameters of the process can be also 
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updated applying the Bayes’ Theorem, given any 
new information about the state of damage (Ang 
& Tang 2007). However this approach, which has 
already been adopted by Bousquet (2014) and 
Haowei (2015), has the disadvantage that only 
quantitative data can be considered for the updat- 
ing, and only the uncertainty directly affecting 
model parameters can be reduced. 

This paper thus proposes a new procedure, 
which allows updating prior knowledge consider- 
ing both quantitative data and qualitative informa- 
tion through the definition and the implementation 
of a Bayesian Network. Besides, using the Baye- 
sian Network allows also identifying the stochastic 
process better describing the degradation phenom- 
ena affecting the structures; the Bayes’ Theorem 
is thus applied in order to reduce the uncertainty 
affecting the parameters of the identified stochas- 
tic process. 

The paper is subdivided in the following way: in 
Chapter 2 the current approach to the asset man- 
agement implemented by the BAW is presented, 
focusing the attention on its advantages and draw- 
backs; in Chapter 3 the available information for 
lifetime prediction is examined; in Chapter 4 the 
new approach to lifetime evaluation is presented, 
also briefly reviewing the theoretical background 
to gamma processes, Bayesian Analysis and Baye- 
sian Network; in Chapter 5 the new approach is 
applied to a real case study, paying particular 
attention to the elicitation of the prior distribution 
and the Bayesian Network; in Chapter 6 conclu- 
sion and outlook are drawn. 


2 CURRENT APPROACH TO ASSET 
MANAGEMENT 


The BAW has developed tools for the management 
of a huge number of waterways infrastructures. 
The entire portfolio comprises several types of 
construction such as locks, weirs, culverts, canal 
bridges and lighthouses, also having different ages. 
A huge part of the asset is represented by mas- 
sive structures older than 100 years and designed 
according to empirical methods with limited and 
simplified static calculations. These structures are 
characterized by sections with a large thickness, in 
which a multi-axial stress state takes place. There- 
fore they exhibit significant “reliability reserves” 
and relevant plastic resources, as demonstrated by 
the low number of recorded collapses which were 
mainly characterized by ductile failure modes. 
Despite their satisfactory performance at the ulti- 
mate limit states, these structures often fail to meet 
serviceability requirements such as crack width 
limitations or deformations (BAW 2015). Further- 
more, several other time-dependent factors may 
affect their service life, also depending on climate 
change (Orcesi 2016) and obsolescence (Langston 
2011). But, above all, main degradation phenom- 
ena affecting such kind of structures are spalling 
and corrosion. In order to manage maintenance 
intervention, the BAW developed an ad hoc main- 
tenance management system called EMS-WSV 
(Bodefeld & Kloé 2012), shortly summarized in 
the following. 

A database software, called WSVPruf, is used 
to store data collected on each structure during 
execution, inspection and maintenance. All dam- 
ages are rated on an increasing scale from level 1 
to level 4 (1: good condition; 4: critical condition) 
and they are recorded in a standard format by the 
program. Another program called ‘Zustandsprog- 
nose’ forecasts future deterioration stages of the 
structure for the next 20-30 years. Here, different 
approaches have been considered: 


e Survival functions are applied to describe the 
deterioration of components where no evident 
damage is detected at the actual inspection 
time; 

A method based on discrete-time Markov proc- 
esses is implemented in order to forecast the 
deterioration of detected damages. The parame- 
ters of the transition matrix are also determined 
according to survival functions; 

In some cases, physical equations have been used 
in order to validate Markov Chains. 


Once the survival functions and Markov Chains 
have been defined, the evolution of the damage 
scale in time can be determined in an almost deter- 
ministic way, as a unique process. The remaining 
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lifetime is also expressed in a deterministic way, 
and it represents the time lapse interval required 
to the damage scale in order to reach the critical 
condition, corresponding to level 4. 

Obviously, the above mentioned methods 
present some advantages and drawbacks. On one 
hand, physical equations are usually considered 
as deterministic; they model only some deterio- 
ration processes and they require a huge amount 
of data to determine the main parameters of the 
physical laws. On the other hand, survival func- 
tions and Markov chains are powerful and flexible 
methods and they can be easily adapted to differ- 
ent deterioration processes. But, in both cases, it 
is difficult to take into account the influence of 
different factors on future degradation stages. In 
case of Markov Chains, the state of the structure 
is described through a unique variable, and there 
are no consolidated methods according which 
the parameters of the transition matrix can be 
defined. It must be also underlined that degrada- 
tion phenomena largely depend on other factors 
such as the environmental condition and the qual- 
ity of the materials, which is mainly the quality 
of concrete. Anyhow, despite they are crucial to 
determine the proneness to damage of the struc- 
ture and to predict the remaining lifetime, these 
factors are almost completely disregarded by the 
current approach. 


3 AVAILABLE INFORMATION FOR 
LIFETIME PREDICTION 


Whether survival functions of Markov Chains 
are implemented, the parameters of the models 
should be determined from real data obtained 
from inspections, surveillance or ad hoc investiga- 
tions. Obviously, provided that enough data exist, 
the parameters can be obtained through some sort 
of regression analysis or statistical investigation; 
but, since inspections are usually carried out every 
six years, surveillance is executed not later than 
three years after each principal inspection, and ad 
hoc inspections are only required after accidental 
shocking events such as ship impacts or flooding, 
available data are often not sufficient. 

As underlined also by Haider (2012), asset life- 
cycle management is an information intensive task 
that requires generating, processing and analyzing 
huge amount of information. Data usually consist 
in both qualitative and quantitative information, 
able to describe the static, constructional and hydro- 
mechanical condition of the structure; however 
quantitative data collected during the inspections, 
being mainly obtained through simple measuring 
instruments, are quite rough, and for this reason 
they are affected by great uncertainties. 


Another source of information is represented by 
expert knowledge. In the present case, this infor- 
mation was previously collected through a Del- 
phi Interview submitted to a total amount of 28 
experts (BAW 2009). In the Delphi Interview it was 
asked to answer the following question: “When 
does a special damage appear for the first time in 
your opinion?”, or in other words, it was asked to 
elicit the survival functions given a certain degra- 
dation process and a specified degradation level. 
Three degradation levels were especially consid- 
ered, notionally corresponding to damage levels 2, 
3 and 4, subdividing the asset into three categories: 
fragile, normal and robust constructions, and a 
choice sheet with several different predefined time 
intervals (decades) was provided to the experts to 
facilitate the comparison of the answers. 

Other relevant information can be also extracted 
by unstructured data and secondary database of 
tests results by conducting data analysis with suit- 
ably developed algorithms and software, as shown 
by Gao & Koronios (2012) and Croce (2018). This 
information can be supplemented by the results of 
ad hoc investigation such as material or chemical 
tests carried out on the considered structure or 
on similar constructions, built in in the same time 
period in a given geographical area, adopting com- 
parable construction techniques. 

Finally, data need to be acquired about relevant 
climatic actions influencing some degradation 
processes, like temperature or moisture, as well as 
about effects of climate change on them. 

Aleatoric and epistemic uncertainties are inevi- 
tably associated to the above mentioned infor- 
mation, the second one representing a lack of 
knowledge, which can be reduced as soon as fur- 
ther data become available. 


4 A NEW APPROACH TO LIFETIME 
PREDICTION 


4.1 


In order to obtain more realistic lifetime prediction, 
an innovative method is proposed in the paper, 
where the deterioration phenomenon is modelled 
by an appropriate stochastic process, a gamma 
process, while epistemic uncertainties are suitably 
reduced resorting to the Bayesian Theorem. 
Although updating the parameters of a sto- 
chastic process through data collected during 
inspection was already suggested (Bousquet 2014, 
Haowei 2015), the issue here presented involves 
some complications that have not been yet con- 
sidered. One complication is due to the fact that 
different levels of epistemic uncertainties affect the 
lifetime prediction. A first level is connected to the 
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proneness of the structure to damage: in fact, even 
if three categories of constructions can be defined 
(fragile, normal and robust) according to Chap- 
ter 3, the “category” to which the structure belongs 
is not known a priori. A second level is represented 
by the uncertainty affecting the parameters of the 
gamma process describing the degradation phe- 
nomenon within each category. 

Moreover, as information derived from the 
structure is both qualitative and quantitative, but 
the degradation phenomenon is mainly described 
in quantitative terms, the problem becomes more 
and more complex. Nevertheless, qualitative data 
are important in order to figure out the sensitivity 
of the structure to damage, and the variables better 
describing the degradation process. Thus the ques- 
tion becomes: how could qualitative and quantitative 
data be considered in order to reduce the two levels of 
epistemic uncertainties previously identified? 

A possible answer to this question is to resort to 
a Bayesian Network (BN) and to use it as a Naive 
Bayesian Classifier in order to identify the “Prone- 
ness to damage” category of the structure and to 
remove the first level of epistemic uncertainty. Once 
the category has been identified, the parameters of 
the gamma process describing the degradation phe- 
nomenon within that class could be further updated 
considering the data about the damage progression 
collected during the inspection on the structure. 

In the following, the theoretical background to 
the proposed approach is briefly introduced. 


4.2 Gamma process 


Gamma process is always applicable to model 
positive and strictly increasing degradation data, 
as suggested by Nicolai (2007) and van Noortwijk 
(2009). 

A non-stationary gamma process Y(t) with 
shape function A(t) >0 and scale parameter p has 
the following properties: 


1. ¥(0) =0 with probability one; 
2. Y(2)-Y¥(t)~I(A(z)- A(t), 4), te [0,2); 
3. Y(t) has independent increments; 
where I.) is the gamma function. Conditions (1) 
and (2) provide the probability density function of 
a gamma process: 
Y 
. 0) 


B 


FAA) = 2 re 


r(A(1) 


Thus the expected deterioration at time ¢ can be 
expressed by a power law: 


A(t) at 


E(Y(1))=—+ F’ 


== (2) 


where @/ #>0 and c>0 isa parameter describ- 
ing the shape of the expected deterioration. Values 
of the parameter c for some relevant deterioration 
processes are given by van Noortwijk (2009). 

Suppose now that the lifetime € is defined as the 
time when Y(t) reaches a suitable failure threshold 
D (first passage). Then the cumulative distribution 
function (CDF) of €can be obtained as: 


T(A(t),D,) 


T(A(t)) 


where D; = DI J. Park & Pedgett (2005) showed 
that the exact probability distribution function 
(pdf) of &for a gamma process can be extrapolated 
from Equation (3). However, since the resulting 
expression is too complex for practical application, 
they proposed to approximate the CDF of & with 
the Birnbaum-Saunders (BS) distribution (Birn- 
baum & Saunders 1969), so that the mean lifetime 
és simply results: 


F(t)= (3) 


= 1 if 
d= | D+ | ` (4) 


4.3 Bayesian updating 


As known, Bayes’ theorem represents an actualiza- 
tion principle and it allows the calculation of con- 
ditional probabilities or conditional densities. 

In the discrete case, it describes the updating of 
P(A)) to p(A, | B) once observed the event B. 

In the continuous case, given two random vari- 
ables X and Z, with conditional distribution of X 
given Z f(x|z) and marginal distribution g(z), it 
describes the conditional distribution of Z given X: 


(5) 


Depending on the interpretation of probability, 
the meaning of Bayes’ theorem differs significantly. 
If probability (or density) is interpreted in a fre- 
quentistic manner, the theorem expresses the pro- 
portion of an event given the occurrence of another 
event. Conversely, if probability reflects the relative 
plausibility or degree of belief attributed to a cer- 
tain event, Bayes’ theorem forms the mathematical 
basis for adjusting or updating the probability as 
soon as more evidence becomes available. 

Also the uncertainty on the parameters 0 of a 
model could be described by a probability distri- 
bution, which has to be interpreted in most of the 
cases as a degree of belief. If the model is repre- 
sented by a density function, a two-levels hierarchi- 
cal model can be obtained, in which the second level 


is represented by the pdf on the statistical param- 
eters of the probability model at the first level. The 
statistical parameters of the second level distribu- 
tion are usually called ‘hyperparameters’, and the 
pdf ‘hyperdistribution’. Anyhow, the degree of 
belief can be updated when new data are available, 
so that, once defined the posterior distribution, 
statements about the parameter can be made. 

The computation of the posterior distribution 
involves the calculation of several integrals and 
for this reason is not straightforward. Simplified 
approaches have been sought in order to facilitate 
the updating of the prior distribution. The most 
popular approach is to resort to the so called con- 
jugate prior distributions, which have the appeal- 
ing features that prior and posterior distributions 
have the same functional form, and the updated 
parameters can be computed in an analytical way. 
In case of the gamma distribution, the conjugate 
prior distribution is characterized by rate parameter 
0=1/f following a gamma distribution, denoted 
as 0’ ~T (a’,v’) where v = 1/b’ is the rate param- 
eter and b’ is the scale. Let x= [| x,,x,,....x,| the 
observed degradation data, t= | t),f,,...,, | the cor- 
responding times; denoting with Ax, = x;— x, and 
At,=t,—t,, the degradation and the time incre- 
ments, respectively, the posterior shape and rate 
parameter for 0 can be computed in the following 
way (Ang & Tang 2007): 


a” =a’ + at, -t) (6) 
V =V +x, — Xo (7) 


and the posterior mean of @can be written as: 


rr 


E(6\Ax)= —. (8) 
v 
from which the scale parameter Ø is easily 
obtained. 


4.4 Bayesian Network 


A Bayesian Network (BN) is a flexible tool that 
allows a rigorous processing of both quantitative 
and qualitative information (Kjærulff & Madsen 
2008). It is defined as a directed acyclic graph 
(DAG) which determines a factorization of a 
joint probability distribution, as the nodes of the 
DAG represent the variables and the directed links 
describe the factorization. For each direct link 
from a node X to a node Y (where X is here the 
‘parent’ and Y the so-called ‘child’), a conditional 
probability p(Y | X) =z is attached to Y. The con- 
ditional probability expresses a rule that assumes 
the following form: if X= x then Y= y with prob- 
ability z, where y and x denote the state of Y and 
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X, respectively. Usually Y represents the effect of 
X, typically not observable by itself, but whose 
state is inferable via prior pdf p(X), conditional pdf 
pY | X) and Bayes’ theorem: 


P(Y =y,X =x) 
> PW =y,X =x) 


p(X =x|¥=y)= (9) 


Observation may have the form of hard evi- 
dence if zero probability is assigned to all but one 
state; otherwise it is said to provide soft evidence. 
So far it is implied that the nodes of the BN repre- 
sent discrete variables, at which probability tables 
expressing conditional probabilities are attached. 
However it is also possible to have continuous 
variables: in this case it is necessary to specify a 
density function for each combination of states for 
the parent variables. Whether discrete or continu- 
ous variables, BN can be applied to solve a wide 
range of problems, and, inter alia for classification 
purposes. In this specific case, they are especially 
called Naive Bayesian Classifier, because of the 
strong (naive) independence assumptions assumed 
among the features that determine the class label 
drawn. 


5 CASE STUDY 


In order to clarify the proposed approach, a case 
study is here developed. The attention is mainly 
focused on the elicitation of the prior gamma dis- 
tribution describing the stochastic process; this 
task is carried out exploiting the approximation 
of the lifetime distribution with the BS distribu- 
tion. The case study aims to illustrate a practical 
application of the proposed method referring to 
a unique damage detected on an existing lock. 
Extensions to cases involving several damages and 
different deterioration processes will be discussed 
in future works. 


5.1 Elicit the prior distribution 


A prior distribution for the model parameters is 
elicited according to the expert knowledge col- 
lected through the Delphi interview described in 
Chapter 3. It would be easier if experts could have 
been asked to represent their opinion in statistical 
terms, for example: “Given the degradation phe- 
nomena modelled through a gamma process, what 
would be the shape and the scale parameters of the 
gamma distribution?”. However it is unlikely that 
experts could give an accurate answer. Except in 
case of symmetric distribution, when mean value 
and standard deviation can be easily elicited, they 
usually think in terms of percentile. Moreover, 
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expert knowledge actually concerns expected life- 
times rather than damage increments; for this rea- 
son, the prior information should be transformed 
in a form leading to an easy identification of the 
prior gamma distribution. One way to do this is by 
using Equation (4) which links the parameters of 
the gamma distribution with the expected lifetime. 
Given the expected lifetime required to reach cer- 
tain degradation levels, and the speed of the proc- 
ess, represented in some way by the value of c, the 
parameters of the gamma distribution can be eas- 
ily obtained by solving the system of equations. In 
this case, two equations are required to determine 
the two parameters: those related to damage levels 
2 and 3 are considered, as expert knowledge about 
events that happened earlier should be more reli- 
able compared to that related to future events. The 
uncertainty on the parameter is similarly defined, 
considering the uncertainty affecting the expected 
lifetimes and assuming a confidence interval. The 
procedure, which is very general, can be applied 
to elicit prior distribution for any degradation 
process. 

In the specific case, in Table 1 the average of 
the expert predictions are summarized in terms of 
time lapses required by the crack width to reach 
the levels corresponding to different degradation 
class, while in Table 2 the confidence intervals of 
the predicted time lapses, corresponding to prob- 
ability of exceedance of 75% and 25%, respectively 
are reported. Given the progress in time of the 
degradation process, it is also possible to conclude 
that the degradation speed can be assumed con- 
stant, at which c = 1 corresponds. For example, in 
case of fragile construction, the system of equa- 
tions results: 


0-59 
O a 
0.7 17 ae 
| eae ee 
op 2a 


The values of parameters œ and £ obtained by 
solving all the systems of equations are shown in 


Table 1. Expert knowledge results—Delphi Interview 
—(DR: damage level, CW: crack width, E: expected life- 


time, F: fragile, N: normal, R: robust). 


E (years) 
DL CW F N R 
2 0.3 10 25 50 
3 0.7 20 50 100 
4 1 30 75 150 


Table 3, while the simulation of the gamma process 
for fragile, normal and robust structures with these 
values are shown in Figures 1, 2 and 3, respectively. 

For analytical and mathematical tractability, it 
is then assumed that only J is random and it also 
follows a gamma distribution, so that the prior dis- 
tribution is then conjugate through the likelihood 
function to the posterior, simplifying the updating 
calculation. The prior distribution can be defined by 
considering the uncertainty affecting the expected 
lifetime: each time lapse interval corresponds to an 
interval for the parameter 8. Assuming the same 
probabilities of exceedance, it will be possible to 
elicit the shape a’ and scale b’ (or the rate v’) param- 
eters of the prior distribution (Table 3). 


5.2 Elicit the Bayesian Network 


The Bayesian Network is elicited according to prior 
information and the previously defined gamma proc- 
ess. In the remainder of the paper, only the most 
fundamental variables will be considered; anyhow, 
as soon as that further analysis will be carried out, it 
will be possible not only to consider the relationships 
among a greater number of variables, but also to 
elicit the structure of the network and the conditional 
probabilities from data itself rather than from expert 
knowledge. The variables that will be considered 
are: the proneness to damage of the structure (also 
called damage category (DC), characterized by three 
possible states: Fragile (F), Normal (N) and Robust 
(R)), the concrete quality (CQ, characterized by three 
possible states: Bad (B), Normal (N), Good (G)), the 
damage quantity (DQ, characterized by three possi- 
ble states: extended (E), limited (L), sporadic (S)) and 


Table 2. Confidence intervals for the lifetime & pre- 
dicted by the experts (F: fragile, N: normal, R: robust). 


Easy = Ern (years) 


F N R 

8-12 20-30 40-60 
16-24 40-60 80-120 
24-36 60-90 120-180 
Table 3. Elicited parameters of gamma process and sta- 


tistical parameters of the gamma distribution describing 
the uncertainty over the scale £. 


a Boy, Basr Baso% a’ b’ v 
F 02 02 
N 0.08 02 015 0.25 727 0.75 133 


Figure 1. Simulation of the gamma process for fragile 
structures (100 sample paths). 


35 


Figure 2. Simulation of the gamma process for normal 
structures (100 sample paths). 


as 


Figure 3. Simulation of the gamma process for robust 
structures (100 sample paths). 
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Figure 4. Bayesian Network implemented in order to 
classify the structure with respect to its proneness to 
damage. 


the crack width referred to different time steps (CW), 
characterized by 10 possible states. 

The states of each variable, the dependency rela- 
tionships and the associated probability tables are 
shown in Figure 4. For the sake of simplicity only 
the probability table referred to t = 10 years is here 
showed. 

The concrete quality, the damage quantity and 
the crack width can be observed, while the updat- 
ing of the probability of the damage categories, that 
cannot be observed, is a key objective. It is assumed 
that the observable variables are independent, given 
the variable that it should be classified. Although 
this assumption facilitates calculation of posterior 
probabilities, it is actually inaccurate, as it is often 
the case with Naive Bayesian Classifier. Nonethe- 
less it has been demonstrated that the crude inde- 
pendence model could perform surprisingly better 
than more complicated alternatives (Hand 2001); 
furthermore precise estimation for the posterior 
class probabilities are here not required, since they 
will be used for comparison purposes, as better clar- 
ified in Chapter 5.3. By collecting evidence regard- 
ing the observable variables, it will be possible to 
infer the proneness to damage of the structure. It is 
important to underline that assuming a parametric 
deterioration process simplifies the relationships 
among the variables; in effect, the deterioration at 
instant ¢ can be considered independent from the 
deterioration at the previous instant, preventing us 
to resorting to a much complicated dynamic Baye- 
sian Network (Straub 2009, Ramirez & Utne 2015). 
Furthermore, the probability table associated to the 
variable ‘crack width’ can be easily obtained simu- 
lating the gamma process. The variable is actually 
continuous, and the degradation at each time step ¢ 
is represented by a gamma distribution with shape 
at and scale f (see Chapter 4.2). However BNs that 
involve continuous random variables imply some 
complications, which are outside the purposes of 
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the present paper and will be discussed later in 
further studies. Moreover, as it can be seen later 
from Table 4, data about crack widths are actually 
given in terms of intervals rather than single values, 
since, as already stressed, the measuring devices are 
rough and measures are inevitably affected by great 
uncertainty. Also for this reason, the continuous 
variables are then discretized so that each state cor- 
responds to collectable measures. 


5.3. Implementation of the Bayesian Network and 
updating of the gamma process 


Data referring to three different structures (A, B 
and C) and related to a damage event having a sin- 
gle cause are considered in the case study (Table 4). 

First of all the data are used in order to clas- 
sify the structure by using the previously defined 
BN. If all the data point out that one category has 
very high updated probability (for example higher 
than 0.70), then it is possible to directly consider 
the Gamma distribution corresponding to that cat- 
egory and the related uncertainty of the statistical 
parameters for a further Bayesian updating. How- 
ever, in several cases, the results could not point out 
clearly at a certain category, because two adjacent 
categories have similar updated probabilities; in 
this case, it is possible to conservatively consider 
only the lowest category, or to collect further data. 

Once that the prior model will be chosen, it is 
possible to update it resorting to the analytical for- 
mula expressed by Equations (6) and (7). 


Table 4. Inspection results on three structures, A, B, C, 
(CQ: concrete quality, DQ: damage quantity, CW: crack 
width). 


Ist inspection 2° inspection 


Str. Year CQ Year DQ CW Year DQ CW 
A 1995 1 2005 1 0.5 2017 1 0.8 
B 1975 2 2005 2 0.5 2017 2 0.7 
C 1955: -3 2005 3 0.3 2017 3 0.4 
Table 5. Updated probabilities for damage category 


(DC) derived from the Bayesian Network, updated sta- 
tistical parameters for gamma process, prior and updated 
lifetime prediction E in years, obtained simulating the 
gamma process. 


F N R gë v Ë & & 


1 0.96 0.03 0.01 9.67 1.62 0.16 28 33 
2 0.07 0.72 0.21 8.23 1.52 0.184 65 70 
3 0.01 0.09 0.90 7.75 1.42 0.183 135 140 


Finally B” can be used to improve the predic- 
tion of the structure lifetime when the limit dam- 
age level will be reached; the prediction can be 
obtained by simulating the gamma process consid- 
ering the updated parameter. 

The results are listed in Table 5. 


6 CONCLUSION 


An innovative procedure to lifetime prediction 
exploiting the potentialities of the Bayes’ Theorem 
has been proposed, as depicted in the flowchart 
shown in Figure 5, based on the following points: 


— The degradation phenomenon is modelled as a 
stochastic process in which degradation incre- 
ments are gamma-distributed. 

Considering background information and quali- 
tative and quantitative data collected during 
inspections, a Bayesian Network is adopted to 
classify the proneness to damage of the structure. 
Uncertainties affecting the parameters of the 
stochastic process are updated considering quan- 
titative data about the degradation level reached 
by the structure by applying the Bayes Theorem. 


Advantages of the proposed method are: 


Qualitative and quantitative data can be used at 
the same time, to refine lifetime predictions. 
Adoption of gamma processes to model degra- 
dation phenomena simplifies the structure of 
the Bayesian Network and the computation of 
the conditional probabilities. 


A case study has been finally developed in order 
to show a practical application of the proposed 
method. Attention has been largely devoted to the 
elicitation of the prior distribution for the gamma 
process, and real data collected during inspec- 
tion on real structures have been considered. In 
authors’ opinion, this approach is very promis- 
ing although some aspects need further studies. In 
effect, the value of the constant c, that regulates 
the speed of degradation, should be determined 
more precisely, also considering different degrada- 
tion phenomena. An improved Bayesian Network 
including continuous variables, representing the 
state of damage at different time steps and whose 
parameters are directly derived from measure- 
ments, should be suitably developed. Uncertainties 
in the collected data, as well as uncertainties in the 
degradation limits should be duly considered too. 
At the final stage, once the proposed procedure 
is improved and suitably validated, its efficiency 
should be also verified by comparing theoretical 
time lapse predictions with empirical data on a suf- 
ficiently large number of existing structures. The 
approach can be implemented in order to plan in a 
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more precise and realistic way asset life-cycle, and 
to identify on which structures maintenance inter- 
ventions should be firstly performed. 
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ABSTRACT: This paper presents a methodology, developed through a case study, to support preven- 
tive maintenance tasks selection based on Reliability Centered Maintenance (RCM) methodology. The case 
study was carried out in an electronic goods company that follows the Total Productive Maintenance (TPM) 
philosophy. The methodology has six steps and it is intended to be applied in every mechanical and electro- 
mechanically equipment of the plant and it is applicable in equipment with some historical data of interven- 
tions. To validate the methodology, one system of the plant was selected and all the steps of the methodology 
were followed. With this application, a reduction of 33% in failures number and production losses was 
achieved, as well as a MTBF improvement of about 15% and therefore an increase in machine yield. 


1 INTRODUCTION 


1.1 Background 


Companies have sought to improve their opera- 
tional efficiency with the aim of offering high 
value added products (Ahuja & Khamba 2008; Eti 
et al. 2004; Konecny & Thun 2011). The growth of 
mechanization and automation of plants created a 
need to improve production equipment reliability, 
availability and productivity. This way, consumer 
needs can be pleased and at the same time com- 
petitiveness is improved (Bakri et al. 2012; Muchiri 
et al. 2011; Soni 2013). 

The maintenance management area gives the 
means to deal with this challenge because it allows 
to improve systems performance (Kutucuoglu & 
Hamali 2001), instead of the traditional vision that 
maintenance actions are only applicable in repair- 
ing “broken items” (Ahuja & Khamba 2008). 

Indeed, Moubray (1997) considers that mod- 
ern management challenges are: to select the most 
appropriated techniques, to deal with all failure 
modes, to have cost effective processes and to sat- 
isfy customer expectations and of all the society. 


1.2 Objectives 


The main goal of this paper is to present a mainte- 
nance planning methodology. The methodology is 
based on Reliability Centered Maintenance (RCM) 
process and aims to be applicable in all mechani- 
cal and electromechanical equipment. By using 
failure data, it will define the most appropriated 


maintenance tasks, as well their periodicity. In addi- 
tion, with the development of the methodology the 
following objectives are intended to be achieved: 


— Balancing corrective and preventive mainte- 
nance tasks; 

— Improving key performance indicators of the 
machines; 

— Reducing total maintenance cost; 

— Linking the maintenance planning methodol- 
ogy and Total Productive Maintenance (TPM) 
philosophy; 

— Matching qualitative and quantitative mainte- 
nance management approaches. 


1.3. Paper outline 


This paper is structured in six sections. In the first 
one is presented the background, the goals and the 
structure of the paper. In the second one, a short 
literature review is produced to frame the subject of 
the paper. The third section is on the development 
of the critical equipment maintenance planning 
methodology and explains all its steps. In section 
four, one machine of the case study company’s 
plant is used to exemplify the methodology. The 
last section presents the paper conclusions. 


2 LITERATURE REVIEW 


Maintenance management has evolved in last dec- 
ades and it can be divided in three main genera- 
tions The first one is related to period before the 
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second world war (Moubray 1997). In this period, 
machines were simple and very reliable. Hence, 
maintenance was based on corrective actions to 
repair the equipment when some failure occurs 
(Ahuja & Khamba 2008; Garg & Deshmukh 2006; 
Soni 2013). After the second world war, equip- 
ment became more complex which triggered the 
first preventive maintenance tasks (Alsyouf, 2007; 
Moubray, 1997). 

More recently, the research made in mainte- 
nance management field allowed to innovate these 
concepts and new techniques emerged, such as 
condition monitoring and failure mode and effects 
analysis (FMEA) (Moubray 1997). At the same 
time, TPM and RCM philosophies were developed 
(Garg & Deshmukh 2006; Soni 2013). 

According to NP EN 13306 (2007) standard, 
maintenance is defined as a “combination of all 
technical, administrative and managerial actions 
during the life cycle of an item intended to retain it 
in, or restore it to, a state in which it can perform 
the required function”. The same standard also 
defines the maintenance interventions types or 
strategies. In a first level, the maintenance concept 
can be divided into preventive or corrective. Per- 
forming preventive maintenance tasks reduces the 
number of failures. As the cost of preventive main- 
tenance increases with increasing frequency of 
interventions, the cost of corrective maintenance 
reduces. Thus, a replacement periodicity which 
minimizes maintenance costs should be found out 
(Figure 1). 

Until the 60th decade, maintenance was based on 
the premise that every part has a defined life time 
at which it is necessary to do a general review, so it 
was applied the scheduled maintenance approach 
(Siddiqui & Ben-Daya 2009). However, in 1960, 
airline companies started to realize that this mind- 
set was not cost-effective. The found solution was 
to study the failure pattern of every component. 


Total cusl of maintenance 
— and time lost 
Optimal polley 
[i 


Cost/unit time 


t Com Of mantena nie 
paticy 


Cast of time lost due tò 
brakdowns 


l 
i 
[i 
l 
l 
[i 
i — 
l 

| 

Maintenance policy (eg. frequency of overhauls) 


Figure 1. Maintenance total cost (source: Jardine & 
Tsang, 2013). 
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Hence, it was proposed the Maintenance Steering 
Group publications, that gives rise to RCM meth- 
odology named by Stanley Nowlan & Howard 
Heap (Siddiqui & Ben-Daya 2009). The Main- 
tenance Steering Group is a coordination group 
that defines maintenance plans to new aeronautics 
projects. The main goal of RCM methodology is to 
keep system functions during its operational time. 
To achieve this intention, RCM uses some main- 
tenance strategies such as time-based maintenance 
(TBM), condition-based maintenance (CBM), 
run-to-failure (RTF), hidden failures and design 
modification (Siddiqui & Ben-Daya 2009). This 
methodology is also supported by FMEA, Fail- 
ure Tree Analysis and other decision making tools 
(Backlund & Akersten 2003; Eti et al. 2004; Sid- 
diqui & Ben-Daya 2009; Waeyenbergh & Pintelon 
2002). One definition of the RCM process given 
by Ahuja & Khamba (2008) is: “structured, logi- 
cal process for developing or optimizing the main- 
tenance requirements of a physical resource in its 
operating context”. 

RCM methodology is based on system func- 
tions, failure modes and effects study and its aim 
is to define the system maintenance requirements 
(Ahuja & Khamba 2008; Carretero et al. 2003; 
Mokashi et al. 2002; Siddiqui & Ben-Daya 2009). 
The selected maintenance tasks should be effec- 
tive in reducing the possibility of occurrence of 
failure modes as well as applicable (Mokashi et al. 
2002; Siddiqui & Ben-Daya 2009; Smith 1993). 
In this scope, “applicable” means that the task 
should avoid or detect the failure, while “effec- 
tive” is related to costs (Siddiqui & Ben-Daya 
2009). 

In terms of application, Smith (1993) shows that 
RCM implementation should follow seven steps: 


. System selection and information collection; 

. System barriers definition; 

. System description and functional 
diagram; 

. System functions and functional failures; 

FMEA; 

. Decision tree analysis; 

. Task selection. 


block 


Eisinger & Rakowsky (2001) developed a new 
version of the decision tree because they believe 
that the questions should not be answered with 
total assurance. Thereby, the authors developed a 
tree in which the path is decided in terms of prob- 
abilities. A set of eight questions are answered 
according to the probability of the answer be 
“YES”. After that, a value r; is calculated using the 
traditional method of decision trees, where i is the 
index of each option. At the end, the option with 
bigger r, is selected. This procedure was applied to 
a fire detection and extinction system. 


Waeyenbergh & Pintelon (2002) developed a 
framework that combines TPM and RCM philoso- 
phies with an economic concern, once they claim 
that maintenance is now an economic concern. 
Hence, to each question of the decision tree, one is 
requested to answer first with a technical point of 
view and then with an economic criteria. Finally, 
the decision is the use of one the five possibilities 
of maintenance tasks or the search for additional 
data. 

Rastegari & Mobin (2016) studied the case of 
a gearbox Swedish company. These authors set- 
tled up two flowcharts with two different points 
of view: technical and output-based. The types of 
questions are different to each flowcharts, however 
the result is the selection of one of the following 
tasks: Time-Based Maintenance (TBM), Break- 
down Maintenance (BM), CBM or Improvement 
and Restoration. 

Dehghanian, Fotuhi-Firuzabad, Aminifar, & 
Billinton (2013) recognized the RCM importance 
but also the barriers in the implementation of this 
methodology in energy industry. To improve the 
current situation, they developed a RCM version 
with a cost benefit analysis, as well as an output 
evaluation to help to improve maintenance plans. 
Following a set of steps similar to Smith (1993) 
approach, the authors have proposed a prioritiza- 
tion of maintenance tasks through Benefit-to-Cost 
Ratio (BCR) index. 

The main benefits of RCM cited in the literature 
are: 


— Maintenance plans quality 
(Deshpande & Modak 2002); 

— Increase of equipment reliability and availability 
(Backlund & Akersten 2003); 

— Ensure safety (Backlund & Akersten 2003); 

— Increase of equipment lifetime (Carretero et al. 
2003); 

— Maintenance costs reduction (Carretero et al. 
2003; Waeyenbergh & Pintelon 2002); 

— Unplanned downtime reduction (Backlund & 
Akersten 2003; Carretero et al. 2003). 


On the other hand, the most cited disadvantage 
is the fact that RCM process is based in a qualita- 
tive base and focused only on system reliability and 
do not quantify the maintenance costs (Braglia 
et al. 2013; Waeyenbergh & Pintelon 2002). 


improvement 


3 METHODOLOGY 


The proposed methodology aims to define a cost- 
benefit maintenance plan based on machine’s 
failure modes. The process follows the RCM prin- 
ciples and uses some TPM concepts once the TPM 
philosophy is currently adopted by the case study 
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company. Thus, the methodology has six steps and 
starts with the selection of the critical system, the 
team building and collection of some data about 
the system selected. The second step consists in 
system description and identification of analysis 
boundaries. Afterwards, the system functions and 
functional failures are defined. The fourth step uses 
a matrix to find the relationship between compo- 
nents and functional failures as base to apply the 
machinery FMEA technique. In the next step, a 
decision tree is proposed to help to find out which 
maintenance tasks can be used to mitigate the criti- 
cal failure modes defined by FMEA. By the end, it 
is proposed a cost analysis step that allows choos- 
ing a cost-benefit plan for the machine. 


3.1 System selection, team building and 


data collection 


This step starts with a prioritization of equipment 
based on three criteria: Mean Time Between Fail- 
ures (MTBF), spare parts cost (C) and preventive 
maintenance total time (T). Using Analytic Hierar- 
chy Process (AHP) method, it was assigned weights 
to the mentioned criteria with the support of the 
case study company. This procedure was done with 
help of Saaty’s relative importance scale (Table 1). 

The result matrix is shown in Table 2 that was 
validated by a consistency test. 


Table 1. Saaty’s relative importance scale. 
Intensity Meaning Explanation 
1 Equal Two activities 
importance contribute equally 
to the objective 
3 Weak importance Experience and 
of one over judgment slightly 
another favor one activity 
over another 
5 Essential Experience and 
or strong judgment strongly 
importance favor one activity 
over another 
7 Demonstrated An activity is 
importance strongly favored 
and its dominance 
demonstrated in 
practice 
9 Absolute The evidence 
importance favoring one 
activity over 
another is of 
the highest 
possible order of 
affirmation 
2, 4, 6,8 Intermediate When compromise 
values is needed between 


two definitions 


Table 2. Matrix of comparison of criteria. 


MTBF C T Weight 
MTBF 1 0,2 0,3333 0,1096 
C 3 1 2 0,5813 
T 3 0,5 1 0,3092 
Sum 9 1,7 3,3333 1 
Table 3. Criteria’s scale. 
Level MTBF (min) C (€) T (min) 
1 MTBF > 4897 C < 390 T < 333 
2 4897 > 390 < 333 < 
MTBF > 1825 C< 703 T < 442 
3 1825 > 703 < 442 < 
MTBF > 976 C< 1483 T < 664 
4 976 > 1483 < 664 < 
MTBF > 766 C < 5456 T < 976 
5 MTBF < 766 C 2 5456 T > 976 


At this point, it is necessary to solve another 
concern. Each criterion has its own measure, thus, 
a quantitative scale with five levels was defined, 
where 1 is assigned to the best scenario and 5 is 
assigned to the worst. Table 3 presents an exam- 
ple that was obtained with data of the case study 
company. 

The levels shown in Table 3 were obtained by 
analyzing a random sample of the plant machines. 
For each type of machine the MTBF, spare parts 
cost and preventive maintenance time was col- 
lected from a period of 18 months. The levels were 
chosen using the percentiles 20, 40, 60 and 80 of 
the sample. 

After the collection of the values, the machine 
ranking (R) is obtained by equation (1) consider- 
ing the previously defined weights. 


R =0,1096*MTBF + 0,5813*C + 0,3092*T (1) 


The critical equipment is the one that has the 
lowest R value. 

After the selection of the more critical equip- 
ment, the team is built according to the equipment 
type. It is also necessary to collect some information 
about the equipment, such as: supplier specifica- 
tions manual, defined process rules for production 
and events data. The latter is composed by correc- 
tive, preventive and improvement actions. 


3.2 System description and limits 


The second step comprises the steps 2 and 3 of orig- 
inal RCM procedure. The procedure is simplified 
because this methodology will analyze machines 
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with known operation behavior. Hence, this step 
consists in drawing the functional diagram of the 
equipment and lists the analysis limits, i.e. the sys- 
tem parts that will not be considered. 

In this scope, a standard structure for all 
machines was considered, as shown in Figure 2. 

Following the Figure 2 concepts, a “system” 
is a complex structure that has several functions. 
Each function is associated to a “subsystem” to 
facilitate the organization. In the next level appears 
the “parts” at which failure modes are associated. 
Lastly, the “causal elements”, i.e., the reasons of 
failure modes (causes), are connected to the failure 
modes. 


3.3 Function analysis and functional failures 


In the third step, the subsystems functions are 
listed and each function should be described on 
basis of quantitative requirements, if possible. 
This way, it will be easier to understand the func- 
tions that make the machine useful. 

Afterwards, the functional failures to each func- 
tion, i.e., the conditions in which some function is 
not achieved, are listed. Only some of these func- 
tional failures will progress in the analysis. Hence, 
the team should choose the critical functional fail- 
ures in terms of department objectives, costs, main- 
tenance time required or breakdown frequency. 


3.4 Failure Mode and Effects Analysis (FMEA) 


In this step, the authors added a preparation phase 
before FMEA. Using a matrix with parts in line 
and functional failures in column, the team will 
identify what parts have relationship with what 
functional failures. This way, only critical parts will 
be analyzed in FMEA. 

To each of these parts, the team will deliberate 
what failure modes can occur based on experience. 
To each failure mode is associated an effect at sys- 
tem level. This effect should be described in two 
ways: machine condition and error message. One 
example is “Machine stop with error ‘NOK’”, so 
after the failure mode has occurred, the machine 
has stopped and the message error was displayed 
on the monitor. 


Causal 
element | 


Causal 
element 2 


Subsystem 
l 


Causal 
element 3 


Causal 
element 4 


Figure 2. Standard structure of a system. 


Afterwards, the team needs to find the causes. In 
this methodology three types of causes are consid- 
ered: causal element, operation error or raw mate- 
rial issue. These three groups are adopted because 
the plant has the experience that an operation 
error or a raw material issue causes a breakdown of 
the equipment, so these problems should be con- 
sidered to mitigate them since they affect machine 
and plant performance. 

Detection and prevention actions should be 
filled in FMEA form. Examples of detection 
actions are: error messages, parameter changes 
and sensorial detection of an irregular behavior. 
It needs to be highlighted that not only detection 
before failure should be considered but also the 
detection modes after failure occurrence. 

Regarding prevention actions, redundancies, 
tests or other methods that prevent the failure 
mode are considered. This field should not be con- 
fused with preventive maintenance actions. 

Lastly, the Risk Priority Number (RPN) is 
calculated. In this methodology, a critical failure 
mode is characterized by a RPN > 100. 


3.5 Decision tree 


The proposed decision tree (Figure 3) is based on 
RCM original concepts but also in the autono- 
mous maintenance and the planned maintenance 
TPM pillars. Thus, the alternative maintenance 
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Figure 3. Decision tree. 
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policies considered are: CBM, TBM, Design-Out 
Maintenance (DOM) and RTF. 

The application of the decision tree is simple 
and intuitive. To each failure mode, the first ques- 
tion that is done is: “Is there an effective method 
of monitoring or checking the condition?” This 
question is related to CBM policy. If the answer is 
positive, the policy is taken as an option. However, 
the analysis continues once the goal is to find out 
all the applicable maintenance tasks. If the ques- 
tion is “No”, the analyst goes directly to the sec- 
ond question. 

Retaking the analysis, the second question is: 
“Is the component life span known or can it be 
calculated and is there an effective action that can 
decrease the failure rate?” In other words, to apply a 
time-based maintenance action it is needed to have 
some available historical failure data to calculate the 
life span of the part and an action that reduces the 
failure rate. Going down, the third question is: “Is 
the failure type S or E?” The failure modes can be 
classified in one of four types: safety (S), environ- 
ment (E), operational (O) or none (N). According 
to RCM, a failure with impact on safety or environ- 
ment should not occur, so for these types the option 
is a design modification (DOM). A RTF task can be 
considered to the other two types of failures. 

For CBM, TBM and RTF policies there is the 
possibility to involve the operator in the mainte- 
nance action. So, after the selection of the appli- 
cable policies, it should be asked “Can the task be 
performed by the operator?” This shows the TPM 
influence in this methodology and it was consid- 
ered because an autonomous maintenance task is 
more cost-effective, it makes maintenance plan- 
ning easier and improves the operator skills and 
motivation. 


3.6 Economic evaluation and maintenance 
task selection 


The last step of this methodology has three main 
goals: to characterize the failure data distribution, 
to define the preventive intervention periodicity 
and to select the more cost-effective maintenance 
task. 

The failure pattern of the component can be 
estimated by adjusting the failure data to a Weibull 
distribution. This distribution was chosen for the 
reason that it is adaptable because of its three 
parameters: form (£), scale (7) and position (y) 
(Jardine & Tsang 2013). 

In the end, it is possible to estimate average and 
standard deviation by equations (2) and (3). 


Ma=A* +7 (2) 
o=B*n (3) 


The next goal is to calculate the optimal perio- 
dicity of preventive intervention, tp, if TBM was 
one of the selected policies. To obtain this value, 
maintenance costs should be balanced to minimize 
total cost (Jardine & Tsang 2013). Hence, the aged 
based policy is considered and involves two types 
of costs: failure cost (Cf) and preventive cost (Cp) 
(Jardine & Tsang 2013). 

To obtain Cf spare part cost, workforce 
cost (MDO) and production loss (PL) cost are 
accounted for, as equation (4). 

C,=C 


spare parts 


+ Cupo + Cor (4) 
On the other hand, Cp is calculated by spare 

part cost and workforce cost — equation (5). 

C,=C 


Spare parts 


+ Cuno (5) 

This effort to obtain the cost values allows the 
use of the Glasser graph to find out the tp value 
(Fig. 4). 

The Glasser graph allows reading z and p values. 
The variable z is used to obtain tp through equa- 
tion (6) (Jardine & Tsang 2013). 
tpp=M+z*o0 (6) 

The value p means the cost of optimal policy as 
a percentage of RTF policy cost (Jardine & Tsang 
2013). 

Finally, the annual cost of the different options 
is estimated. The authors have defined this period 
of time to make easier the cost comparison. 


a) 


TEET 


10 


Average service tite in standard deviation unite t) 


Figure 4. Glasser method for age-replacement policy. 
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The first one is CBM and its cost involves the 
preventive maintenance cost C,, an inspection 
cost (Cinspection) and the investment value (ny), if 


required. 


C, 


inspection 


Cou F hi" C, RaT +(Inv/n) (7) 

Equation (7) assumes that failure is always pre- 
dictable and f1 is the number of preventive inter- 
vention in a year and f, the number of inspection 
actions also in a year. The f, value can be calcu- 
lated using a reliability study or estimated based in 
team experience. The f, value should be estimated 
the same way. The value of Inv is divided by n that 
indicates the expected number of years in which 
the condition monitoring equipment is expected to 
function well. 

For TBM two distinct calculation formulas were 
considered. 


Creu = p* St; * C, (8) 

Equation (8) considers an optimal scenario 
where data is available and in which f is the inverse 
of MTTF. About p, this variable allows to obtain 
TBM cost based on RTF cost. Hence, equation 
(8) represents the optimal TBM policy cost (with 
periodicity tp). 


Cram = Ja *C, (9) 
On the other hand, equation (9) will be used 
when there is no available data to use Weibull dis- 
tribution to estimate tp. Here, f, is an annual fre- 
quency defined by the team based on its experience. 
For the RTF policy, the cost is obtained by 
equation (10). 


Carr = f,* C; (10) 


In case of equation (10), f} should be in an 
annual base. 

After the calculation of cost through the previ- 
ous equations, the maintenance task to apply the 
critical failure mode should be chosen. Between 
TBM, CBM and RTF policies, the cheaper one 
will be chosen. However, if DOM is an option, the 
team needs to analyze its benefits comparing to the 
cheapest task. In case of S or E effect of failures, 
it shall be analyzed if the probability of failure is 
reduced to a level considered adequate. Therefore, 
for each policy the associated probability of failure 
(or reliability) must be calculated and it should be 
made an analysis of costs together with an analysis 
of this probability. 


4 APPLICATION 


The described methodology was applied to elec- 
tronic goods of the company plant. 

The critical equipment was chosen between 10 
types of machines. Table 4 shows the result of the 
prioritization (step 1). 

In Table 4, it can be seen that GP is the machine 
with biggest rating value. This machine is a dis- 
pensing equipment whose process consists in 
dispensing a mixture with thermal conductivity 
properties. 

After the selection of the critical equipment, 
a team of five members with multidisciplinary 
knowledge was formed. One of the members was 
the moderator. 

Related to historical failure data, the interven- 
tions done during year 2016 were collected. 

The first task of the team was to draw a func- 
tional diagram of the system GP. The result was a 
set of eight subsystems and it is shown in Figure 5. 

The team has deliberated that subsystems 6 and 
7 should be excluded from analysis because of their 
lower failure rate. 

The third step is functions and functional failure 
study. In Table 5 it is possible to see an application 
example for dispensing subsystem. 

In this case, the two functional failures go to 
next step. Following this thought, the team found 
out 22 functional failures and chose 13 ones, since 
some of them do not have great importance in 
terms of preventive or corrective interventions. 

Through the relationship matrix the team 
crossed functional failure data with machine parts. 


Table 4. Application of prioritization method. 


Machine MTBF C T R 
GP 5 5 4 4,6908 
BUR 3 4 5 4,1996 
LB 1 4 2 3,0529 
ICT 3 3 3 3,0000 
AOI 2 3 1 2,2721 
RH + 2 2 2,2192 
AMF 5 1 3 2,0566 
FCT 2 1 4 2,0370 
MMC 1 1 1 1,0000 
PR 1 1 1 1,0000 
GP 
—2 z —_ F e=. io 2 
EEI Se FAA PA EA 
Figure 5. System structure. 


723 


The result was 12 parts with a strong relationship 
with system failures. Table 6 presents an example 
for 2 machine parts. 

So, FMEA technique was applied to study the 
failure modes of these 12 parts and find out their 
causes. Using RPN rule, 25 critical failure modes 
was discovered. However, 3 of them are related 
to operation errors and raw material issues and, 
therefore, only 22 failure modes go to next step— 
the decision tree. 

After running the decision tree 22 times, the 
team selected all the effective and possible tasks 
to each critical failure modes. Table 7 presents an 
excerpt of this step. 

Calculating the costs of each task, Table 8 was 
obtained. 

As example, for failure mode with code 1.1.1.1.1 
TBM task was selected with a frequency of 7 hours 
and for 1.1.1.1.3 RTF was the only available option. 


Table 5. Functions and functional failures analysis. 
Subsystem Function Functional failures Y/N 
5. Dispensing 5.3 Mix 5.3.1 Do not Y 
materials mix materials 
A and B A and B by 
by 1:1 1:1 
5.3.2 Do not Y 
mix materials 
A and B at all 


Table 6. Relationship between parts and functional 
failures. 

Functional failures 
Parts 1.1.1 1:1:2 1.1.3 2.1.1 21:2 
Pump A X 
Axis X X X 


Table 7. Decision tree results. 


Failure mode Q1 Q2 Q3 Possible tasks 


1-1111 N Y N TBM, RTF 
1.1.1.1.3 N N N RTF 
Table 8. Economic evaluation of tasks. 
Failure mode Options Annual cost Frequency 
E G TBM 8 526€ 

RTF 11 392€ 7 hours 
1.1.1.1.3 RTF 401 987 € - 


5 CONCLUSION 


This paper presents a methodology for selecting 
and defining maintenance tasks for critical equip- 
ment. This procedure is based on RCM meth- 
odology and uses the AHP method to machine 
prioritization, the FMEA technique to study parts 
failure modes and an economical evaluation of 
maintenance tasks, the Weibull distribution and 
the Glasser Graph to define the most appropriate 
maintenance policy. In addition, some TPM con- 
cepts are taken into account by the methodology. 

In short, the main differentiating factors regard- 
ing RCM are: the consideration of autonomous 
maintenance as a possible maintenance task to be 
selected and the economical evaluation of the tasks 
for this selection. 

As results of application in the GP equipment 
of the case study company, the plant has achieved 
a 33% reduction in the number of failures, a 33% 
reduction in production losses, a MTBF improve- 
ment of 15% and an increase of machine yield of 
1,8%. It is also expected to reduce total mainte- 
nance cost in mid-term. 

Therefore, the application of the methodology 
makes the work of maintenance technicians more 
effective, since only appropriate and cost-effective 
preventive maintenance tasks are performed. It 
allows also the involvement of the operators in 
simple maintenance tasks as recommended by 
TPM. Since, the selection of maintenance tasks are 
oriented by costs, whenever adequate, the applica- 
tion of the methodology leads to a reduction in the 
overall maintenance cost. 
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A study of the relationship between sample size and the confidence level 
of MTTF for products with exponential failure distribution 


Y. Wang, H. Cheng & D. Xu 
NAA, Beijing, China 


ABSTRACT: Itis common in applications to ignore the influence of sample size and population size 
when evaluating confidence lower limit of Mean Time to Failure (MTTF) for products with exponential 
failure distribution. The conventional method assumes that failure times or lives of different samples are 
independent identically distributed random variables. This assumption holds reasonably when the popu- 
lation size is much larger than the sample size. However, it will induce substantial inaccuracies when the 
sample size is not negligible to the population. To this problem, the influence from the sample size to the 
confidence limit of MTTF is analyzed when the population is determined. First, the conventional method 
is reviewed. Then, the formulae based on hypergeometric distribution are derived to reflect the relation- 
ships among sample size, population size, confidence level and confidence limit. Finally, an example is 
presented to illustrate the suitability of the proposed method and to show the computation errors from 
the conventional method. 


1 INTRODUCTION 2 THE CONVENTIONAL METHOD 


Normally, the confidence lower limit of Mean For a product that the failure time following expo- 
Time to Failure (MTTF) for products with expo- nential distribution, its failure rate (Tan et al. 2005) 
nential failure distribution are verified by reliabil- is 2. For mission time 1, the reliability R(/) is, 

ity verification test where some samples are picked 

up randomly (Jiang et al. 2012, Modarres et al. R(t) = exp(-At) (1) 
1999) to conduct the test. Under most conditions, where exp() denotes getting natural logarithm. 

the number of samples is very small compared to Meanwhile the expression for failure probability 
the population. Then it is reasonable to assume F(® is, 

that the failure times of different samples are inde- 

pendent and identically distributed (i.i.d.) random F = 1—exp(-An) (2) 


eS aio i ere h Assume the number of population is infinity, 
owever, for complex system or product, the aad the sample size is n in the reliability verifica- 


population Sige 18/001 very large, usually. Hence, tion test. For time ¢, the probability for r failures 
the sample size is not ignorable to the population observedis 


in the reliability verification test plan (Li & Jiang 
2006, Yan et al. 2014). The conventional method 
without considering the influence from sample size 
will result in computation error. 

In this study, the conventional method ignoring 
the influence of sample size is reviewed. Then, a 


P(X =r) = CFO) RE) O 


when n, t, r is specified in the reliability test plan, 
the formula to compute the confidence lower limit 


method considering the number of samples based Of RCD is, 

on hypergeometric distribution (Ronald et al. i 

2009, Zhao & Mi 2015) is presented. Finally, the Ly= P(X < r) = ECF (1) R(t)" (4) 
example is used to show the suitability of the pro- = 

posed method. And the conventional method is 

used as well to show the computation error. where yis confidence level. 
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The confidence lower limit of R(t) can be 
derived from Eq. (4) through numerical method. 
Then we can use Eq. (1), where 4 = 1/MTTF, to get 
the confidence lower limit of MTTF. 


3 METHOD CONSIDERING 
POPULATION SIZE 


Assume that the population size is N, the sample 
size for reliability test is n, the number of failures 
for acceptance is r in the test plan. Then, the prob- 
ability for failure number r is, 


P(X= r) = CuCu 
Gr 
n! M! (N-M)! (N-n)! 


ri(n r)!(M r)!(N M n+r)! N! 
M(M-1).(M-r+1) | 


(N-M)(N p 1)..( N M n+r+1) 


N” r 


N" 
N( N-1)...( 


N-n+ 1) 
(5) 


where M is the number of products that can- 
not pass the test in the population although it is 
unknown. 

The probability for the failure number is equal 
to or less than r is, 


P(X < r) = 5 Ci Cn 


(6) 
mo Ch 

By substituting Eq. (5) into Eq. (6) and denot- 
ing M/N with F, the confidence upper limit of fail- 
ure probability, F= 1 — R (Ris the confidence lower 
limit of reliability), we can get, 


N( N-1)..(N-n+1) 


Base on the derivation of Eq. (7), the factor in 
the first bracket is senseless when r is zero. And a 
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step further, when n = 1, r = 0, in the reliability test 
plan, Eq. (4) and Eq. (7) can be both rewritten as, 


l-y=P(X<0)=1-F=R (8) 


4 EXAMPLE 


For a product or system whose failure time fol- 
lows exponential distribution, the population size 
Nis 28. Assume that we need to perform reliability 
test to assess the confidence lower limit of MTTF. 
And the confidence level is 80%. To show the dif- 
ferences of the results of the conventional method 
and of the proposed method clearly, a series of test 
plans are presented and analyzed. 


4.1 Test plan 1 


In this reliability test, we set the parameters of the 
test plan as follows: 


Population size N = 28. 

Sample size of the test n = 1. 
Acceptance number r = 0. 
Accumulated test time T = 4000 h. 
Confidence level y= 80%. 


1. Conventional method 

For conventional method, substitute the param- 
eters above into Eq. (4), we could get the following 
equation, 


1=7= P(X <1)= CF (A R)" 
i=0 


= (9) 
= R(4000) =1-0.8 = 0.2 
That is, 
4000 
R(4000) = exp} — =0.2 10 
( ) of ml a) 


From Eq. (10), we can derive the 80% confi- 
dence lower limit of MTTF, 


MTTF = 2485.34 h (11) 


2. Proposed method 

For the proposed method considering the popula- 
tion size, substitute the parameters above into Eq. 
(7) or Eq. (8), we could get the following equation, 


l-y=P(X <0)=1-F =R 


= R(4000) =0.2 2) 


As shown in Eq. (9) and in Eq. (12), for this test 
plan where sample size nis 1 and acceptance number 
r is zero, the proposed method is equivalent to the 
conventional method. This is because that there is 
statistical independence for the only one sample no 
matter what the population size is. 

Hence, for the proposed method considering 
the population size, the assessed result of MTTF 
is 2485.34 as well. 


4.2 Test plan 2 


In this reliability test, we set the parameters of the 
test plan as follows: 


Population size N = 28. 

Sample size of the test n = 2. 

Acceptance number r = 0. 

Accumulated test time T = 4000 h (2000 h for 
each sample). 

Confidence level y= 80%. 


1. Conventional method 

For conventional method, substitute the param- 
eters above into Eq. (4), we could get the following 
equation, 


< iF j~ 
1-7=P(X Sr) = cir (13) 
R? (2000) =1- 08=0.2 
That is, 
2000 \f 
R? (2000) =| exp] — =0.2 14 
42000) | of ral (14) 


From Eq. (14), we can derive the 80% confi- 
dence lower limit of MTTF, 


MTTF = 2485.34 h (15) 


The result is the same as the one from test plan 1. 
2. Proposed method 
For the proposed method considering the popula- 
tion size, substitute the parameters above into Eq. 
(7), we could get the following equation, 


l-y=P(X <r) 


-e-n 


(16) 
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R(2000)| R(2000) - ale =0.2 


2000 2000 1 |28 
=> exp exp =0.2 
MTTF MTTF} 2827 


(17) 
From Eq. (17), we can derive the 80% confi- 
dence lower limit of MTTF, 


MTTF = 2556.72 h (18) 

The result is different from the one assessed 
from the conventional method. This is because that 
the proposed method has considered the interde- 
pendency of the two samples, but the conventional 
method has not. 

And the computation relative error of the con- 
ventional method by take the result of the pro- 
posed method as standard is, 


_ 2485.34 —2556.72 _ 


19 
2556.72 al 


2.8% 


4.3 Test plan 3 


In this reliability test, we set the parameters of the 
test plan as follows: 

Population size N = 28. 

Sample size of the test n = 10. 

Acceptance number r = 0. 

Accumulated test time T = 4000 h (400 h for 
each sample). 

Confidence level y= 80%. 


1. Conventional method 

For conventional method, substitute the param- 
eters above into Eq. (4), we could get the following 
equation, 


< ‘a 
1-7=P(X <r) =YC.F (0) (20) 
= R” (400) =1-0.8=0.2 


That is, 


10 
109 ) =0.2 (21) 
MTTF 


From Eq. (21), we can get the 80% confidence 
lower limit of MTTF, 


MTTF = 2485.34 h 


R" (400) = |[e(- 


(22) 


The result is the same as the ones from test plan 
1 and test plan 2. 
2. Proposed method 
For the proposed method considering the popula- 
tion size, substitute the parameters above into Eq. 
(7), we could get the following equation, 


l-v=P(X<r) 
Bo) 
Peete (23) 
=| e(r-5)-( -2) |» 
28" =0.2 


28(28-1)...(28-9) 


That is, 
r(r-5) Hi -2) 
28 28 


=0.2 


28(28-1)...(28- 9) 


( 400 ) ( 400 ) 1 
=> exp| - exp| — -—|... 
MTTF MTTF) 28 
( 400 ) 9 (24) 

exp| — ->—]|x 

MTTF) 28 

10 

a =0.2 
28( 28-1)...(28- 9) 


From Eq. (24), we can derive the 80% confi- 
dence lower limit of MTTF, 


MTTF = 3055.20 h (25) 


And the computation relative error of the con- 
ventional method by take the result of the pro- 
posed method as standard is, 


_ 2485.34-3055.20_ 
~ 3055.20 


18.7% (26) 


From all the results of test plan 1, test plan 2 
and test plan 3, we can see that for the conventional 
method the results always keep the same regard- 
less of the sample size when the accumulated test 
time and acceptance failure number is determined. 
This is because that the conventional method has 
ignored the interdependency of different samples 
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from the same population. And it could be veri- 
fied easily that all the test plans in this example is 
equivalent to the conventional method, we will not 
present the verification process here. 


5 CONCLUSIONS 


In this study, we analyze the influence of popu- 
lation size to the computation accuracy of con- 
fidence lower limit of MTTF for products with 
exponential failure distribution in reliability verifi- 
cation test, and get the following conclusions: 


1. A method considering the interdependency of 
different samples is proposed to compute the 
confidence lower limit of MTTF mainly based 
on hypergeometric distribution. 

. The conventional method will give out the same 
result regardless of how many samples are used 
in the reliability test when the accumulated test 
time and acceptance failure number are deter- 
mined. This is because that the conventional 
method has ignored the interdependency of dif- 
ferent samples from the one population. 

. According to the proposed method considering 
the interdependency of different samples and the 
results in the example, the confidence lower limit 
of MTTF will become greater with the increase 
of sample size when population size, accumu- 
lated test time, acceptance failure number are 
determined in the reliability test plan. 
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Failure Mode Effects & Criticality Analysis (FMECA) using Bayesian 
Dirichlet-multinomial conjugate pair 


W. Baun 
Pratt & Whitney, East Hartford, CT; USA 


ABSTRACT: Failure Mode, Effects & Criticality Analysis (FMECA) is a widely-used tool for system 
safety and reliability evaluations; one popular approach is outlined by MIL-STD-1629A. MIL-STD- 
1629A Criticality Analysis requires estimates of the Failure Mode Ratio (œ) and the Failure Effect Prob- 
abilities (B,) for each failure mode. To maximize impact on the design, FMECAs are initiated early in the 
Product Development Process (PDP), before reliability data is available for the new product. Typically, 
initial criticality estimates are derived via a combination of failure mode/effect data from similar legacy 
products and from engineering judgment. Later in the PDP, actual data on the new product becomes 
available. Such a situation is ideal for employing Bayesian methods. The data from which the Criticality 
parameters are estimated is Multinomial. The Dirichlet distribution is chosen to represent prior uncer- 
tainty in the Criticality parameters, thus taking advantage of the Bayesian conjugacy property for the 
Dirichlet-Multinomial pairing. This paper describes the application of Bayesian techniques to FMECA 
Criticality analysis, the structure of a Bayes-enabled FMECA, briefly outlines some expert elicitation 
approaches for developing prior distributions for a, and B,, and demonstrates the use of evidence in the 
form of field data to update those prior estimates. 


1 CRITICALITY ANALYSIS The failure mode ratio, œ, is the probability that 
INTRODUCTION a particular component will fail by a certain failure 
mode. The sum of all failure mode ratios for any 

Failure Mode, Effects and Criticality Analysis specific component must sum to 100%. Where a 
(FMECA) is an inductive tool used to analyze the component has only one failure mode, the failure 
effects, and quantify the risks of failure modes mode ratio would equal 100%. If a component has 
within a larger system. FMECA is an extension three equally-likely failure modes, then each fail- 
of Failure Modes and Effects Analysis (FMEA), ure mode would have a failure mode ratio equal to 
which adds a quantitative Criticality assessment to 33.3%. If one failure mode is more prevalent than 
the basic FMEA process. The outputs of the criti- others, then that failure mode would have the larg- 
cality analysis enable risk mitigation activities tobe est share of the 100% total. The number of failure 


undertaken in a systematic, prioritized fashion. modes is not limited, being a function of the various 
There are different approaches to conduct- ways in which the particular component can fail. 
ing criticality analysis. This paper focuses on The failure effects probability factors, B, are the 


the FMECA approach outlined in MIL-STD- probability that a particular component-failure 
1629A (MIL 1980), which is also covered in EN mode will result in effects of a certain severity 
60812:2006. (EN 2006) It is worth noting that the level. In the case of the MIL-STD-1629A FMECA 
methods outlined in this paper are not applicable approach as applied to the aerospace industry, the 
to the SAE J1739 (SAE 2009) FMEA approach; effects severity level is broken into four distinct cat- 
SAE FMEA focuses on a very different aspect egories (CAT): 

of design risk, and uses a qualitative risk rank- In this categorization system, there would thus 
ing technique. Criticality analysis as performed be four different B factors for each failure mode, 
in MIL-STD-1629 requires estimation of the fol- $B., B., B, and B,, representing the probabilities that 
lowing factors for every component-failure mode a failure by that component-failure mode com- 


combination in the system being analyzed: bination results in CAT I, II, IM, or IV effects, 
respectively. The sum of the B factors for a given 

œ: Failure mode ratio component-failure mode combination must equal 
B,: Failure effects probability factors 100%. If a particular failure mode can only result 

A: Component base failure rate in effects of one category, then the failure mode 
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Table 1. MIL-STD-1629A severity categorizations for 
aerospace application. 


Category Description Details 


A failure which can cause 
death or loss of aircraft 

A failure which can cause 
major system damage, 
mission abort 

A failure which can cause 
minor system damage, 
mission delay, loss of 
availability 

A failure which will not cause 
system damage, mission 
delay or loss of availability 


I Catastrophic 


II Critical 


Il Marginal 


IV Negligible 


effect probability for that category would be 100%; 
all other categories would be 0%. Generally, a fail- 
ure mode will have a most likely set of effects; this 
category would thus have the highest proportion 
of the 100% total for the four categories. If it is 
possible for other effect levels to occur, those cat- 
egories should be given a non-zero rating which 
corresponds to the best estimate of their probabil- 
ity of occurrence. 

While other effects severity rating systems having 
fewer or more categories are possible, the remain- 
der of this paper will assume that four severity cat- 
egories exist, as given above. The methods outlined 
in this paper are equally applicable to any number 
of severity effects categories. 

The component base failure rate, A, is the over- 
all failure rate for a particular component, for all 
failure modes, and all outcomes (effects). Typi- 
cally in FMECA (and other reliability analysis 
techniques), A is assumed to be constant, with 
failure times exponentially-distributed. Under that 
assumption, the base failure rate would be calcu- 
lated as the total number of failures of that com- 
ponent observed in the fleet (for all failure modes 
and effects) divided by the total operating hours 
for that component across the entire fleet. Failure 
rate in the aerospace application consistent with 
Table | has units of failures per engine flight hour. 

In the MIL-STD-1629A approach to FMECA, 
Criticality is calculated as the product of the three 
factors above, and a fourth factor, time, t: 


C=A1afht (1) 


For the remainder of this paper, the time factor 
will be omitted, under the assumption that the Crit- 
icality factors are calculated on a per-hour basis. 
Because there are four severity categories, there exist 
four criticality numbers for each component-failure 
mode, one for each category of effect severity. 


The units of criticality are also in failures per 
hour, and represent the failure rate per hour for a 
given component-failure mode combination pro- 
ducing effects of a particular severity category. 

Ideally, estimation of the factors used in criti- 
cality analysis will come from an extensive data 
set of actual field experience. The failure rate esti- 
mate for a given component comes from the total 
number of failures divided by total operating time, 
as discussed above. The failure mode ratio for a 
component-failure mode combination would be 
the number of times a component has failed by 
that particular failure mode divided by the total 
number of failures for that component for all fail- 
ure modes. For a given failure mode, the failure 
effect probabilities would come from the number 
of times that failure mode has produced effects of 
that severity level divided by the total number of 
failures by that mode. 

In practice, some or all of these factors will need 
to be estimated from other sources. FMECA is 
most effective as a risk prioritization and mitiga- 
tion tool when applied early in the product design 
process. However, this is typically long before any 
actual field failure data from which to calculate 
the criticality factors would be available for the 
new system being designed. In this situation, the 
FMECA analyst must turn to similar legacy prod- 
ucts as a source of such data. However, there may 
be components in the FMECA which have never 
experienced a particular failure mode in the legacy 
fleet. Furthermore, for high severity failure modes 
(CAT J), it is typically the case (fortunately) that 
they are rare events, thus there are few to no CAT 
I failures in the legacy data set from which to cal- 
culate a probability. Regardless, estimates must be 
made, in order to utilize the FMECA process to its 
full potential. In these situations, where limited / no 
field experience is available, estimates of criticality 
factors must be made by expert opinion. 

The use of Expert opinion to estimate unknown 
quantities of interest is widely-used in reliabil- 
ity engineering applications. (Bedford et al. 2006, 
Hodge et al. 2001, Yadov et al. 2003) Bayes Rule 
provides a framework for starting with such sub- 
jective estimates of parameter(s) of interest, and 
then updating those estimates with actual data as it 
becomes available. The remainder of this paper will 
discuss a method for the application of Bayes Rule 
to the estimation of FMECA Criticality parame- 
ters wand B. Section 2 will provide a brief overview 
of Bayes Rule. Section 3 will outline the specific 
choice of distributions for modeling the unknown 
quantities of interest, & and B for the FMECA 
application. Section 4 will discuss the process of 
expert elicitation as it applies in this situation. 
Section 5 will discuss specific concerns related to 
prior distribution strength. Section 6 will provide 
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examples of the Bayesian FMECA updating proc- 
ess, showing how evidence can be combined with 
the prior estimates via Bayes Rule. Section 7 will 
discuss conclusions and future work. 


2 OVERVIEW OF BAYES RULE 


Bayes Rule is a formal mathematical method for 
updating one’s state of belief about unknown 
parameter(s) of interest (POI) as new evidence is 
obtained. It enables initial, often subjective, esti- 
mates about those POI to be combined with actual 
data to produce those updated estimates. A classi- 
cal formula for Bayes rule is: 


(4|£) 


__L(E|A)* P(A) 
~ [L(E|A)* P,(4)dA 


(2) 


where A is the parameter of interest (POI), the 
unknown parameter or factor being modeled; 
P,(A) is the prior distribution of A (the “Prior”), 
which is the uncertain estimate of the value of A 
developed before any actual evidence is obtained, 
modeled as a distribution to represent the initial 
state of uncertainty regarding the value of A; E is 
the evidence; P,( A|E) is the posterior distribution 
of A (the “Posterior”), which is the updated esti- 
mate of A, after gaining new information in the 
form of evidence; L(E|A) is the likelihood of the 
evidence, given parameter A. The denominator is 
a normalizing factor, sometimes referred to as the 
“total probability of the evidence”, and is the sum- 
mation of the likelihood of the evidence over all 
possible values of the parameter, A. 

The prior distribution is the mathematical rep- 
resentation of the initial state of knowledge about 
the POI. Depending upon that state of knowledge, 
the prior can be diffuse, or concentrated. The prior 
distribution can be modeled using a wide variety 
of mathematical distributions to represent uncer- 
tainty. Where the POI can take on any value in a 
range, the initial state of knowledge is represented 
by a continuous Prior distribution, and the sum- 
mation term in the denominator is an integral. 
Typical distributions employed to model continu- 
ous Priors include the uniform distribution, log- 
normal, gamma, beta, as well as many others. The 
integral is evaluated over the full set of unknown 
POI, A, which are the parameters of the distribu- 
tion used to model the prior. If there is more than 
one parameter, then the integral becomes a double 
(or higher) integral. As such, the Bayes equation 
can quickly become difficult to solve. In many 
cases, a closed-form solution cannot be found, and 
numerical methods are required. 
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Fortunately, there are a special set of Prior Dis- 
tributions and Likelihood Functions which make 
the calculation process relatively easy. These sets 
are called Conjugate Pairs. For a certain form of 
evidence expected, a specific likelihood function 
is required. Given a particular type of evidence 
(and thus likelihood function), the choice of a 
certain form of Prior distribution will make the 
calculation of the Posterior Distribution via Bayes 
Rule simple, and obviate the need to evaluate any 
integrals. 

For example, if one is interested in estimating 
the failure rate, A, from a constant failure rate 
process, then the evidence by which failure rate will 
be estimated will be of the form n failures in obser- 
vation time T, also known as Poisson data. Given 
evidence of this form (Poisson) and the resultant 
likelihood function, the choice of a Gamma dis- 
tribution with parameters œ and B to model the 
Prior distribution of failure rate (the initial esti- 
mate of the unknown POI, A) results in a Posterior 
distribution which is also Gamma, and which has 
parameters o4+n, B+T. Other conjugate pairs exist 
for widely-modeled situations (Fink 2017), includ- 
ing the Beta-Binomial conjugate pair for mod- 
eling situations such as “Probability of Failure on 
Demand” where the value can range between 0 and 
1, and the evidence will be in the form of k failures 
in n trials (Bernoulli process). 


3 DIRICHLET-MULTINOMIAL 
CONJUGATE PAIR 


Another Bayesian conjugate pair is the Dirichlet- 
Multinomial pair (Fink 2017). This pairing is used 
in situations where the evidence will be multinomial 
in nature; the prior distribution of the unknown 
POI will be modeled as a Dirichlet distribution. 
Multinomial data is applicable in situations where 
the data can fall into one of n categories (with 
binomial being the specific case where n = 2). It 
is easy to see that failure mode and failure effects 
data are both cases of multinomial datasets—fail- 
ure modes can fall into one of n categories for a 
given component, where n = the number of failure 
modes for that component. Similarly, under the 
MIL-STD-1629A approach to FMECA, failure 
effects data can fall into one of four severity cat- 
egories. Therefore, because the data for estimating 
FMECA Criticality factors will be of Multinomial 
form, the Dirichlet distribution will be employed 
as the choice of priors for the FMECA critical- 
ity analysis POI in order to take advantage of the 
mathematical benefits of the conjugate pairing. 
As discussed previously, the criticality calculation 
for the MIL-STD-1629A FMECA has two param- 
eter sets (POI) which must be estimated—the failure 


mode ratio, œ, and the failure effects probabilities, 
B, — B,. For a given component, the failure mode 
ratio is the probability that the component fails 
by that failure mode (versus other possible failure 
modes) when that component fails. Thus, the evi- 
dence which will be used to update the prior esti- 
mates of failure mode ratios using Bayes rule will 
be multinomial in nature. Table 2 below shows an 
example of multinomial data for failure mode ratio 
for one component. 

Similarly, the failure effects probabilities for a 
given component-failure mode combination are the 
probabilities that a particular failure mode will result 
in effects in each of the four different categories of 
severity from Table 1. Taking the data for failure 
mode C from Table 2, the failure effects probability 
data is also multinomial, as shown in Table 3. 

The multinomial probability density function 
for the failure mode ratios, œ, is as follows: 


n 
x!Kx,! 


f (ass An h Glasi e) 


(3) 


where k is the number of failure modes for that 
component, x; are the number of failures by each 
of the modes x, — x,, n is the total number of fail- 
ures observed for that component. 

Similarly, the multinomial probability density 
function for the failure effect probabilities, B, — B, 
is: 
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Table 2. Example failure mode data and calculated fail- 
ure mode ratios. 


Failure mode # Failures Failure mode ratio 
A 2 20% 
B 1 10% 
C 7 70% 
All 10 100% 
Table 3. Example failure effect data and calculated 


failure effect probability factors for failure mode C from 
Table 2. 


Category # Failures Failure effect probability 
I 1 14.3% 
II 1 14.3% 
Ill 2 28.6% 
IV 3 42.9% 
All 7 100% 


where x, are the number of failures for that mode 
that resulted in effects of severity level i, n is the total 
number of failures for that component-failure mode. 

Because it is the conjugate prior for this form 
of evidence, the Dirichlet distribution is used to 
model the Prior state of knowledge of the critical- 
ity parameters o and ß. 

The Dirichlet distribution for the failure mode 
ratio parameters œ, — 0, for a given failure mode is 
written as follows: 


Sf (0,0,0 Po PK p) =| z= 


(5) 


where œ, — œ, are the POI — the failure mode ratios 
for the k failure modes of that component, p, — p, 
are the parameters of the Dirichlet distribution. 

The values a, — œ, are not known exactly, but 
rather are uncertain quantities, with uncertainty 
represented by the Dirichlet distribution. The 
expected value of each of the failure mode ratios, 
Q; for the prior distribution are simply the ratio of 
the parameter p; for failure mode i to the sum of all 
parameters p; — p, (Agresti 2017): 


E(a)=—2 (6) 


As discussed previously, evidence about the fail- 
ure mode ratios will come in the form of the number 
of failures which were caused by a particular failure 
mode out of a total number of failures for all fail- 
ure modes—Multinomial data. Because of the con- 
jugacy property of the Multinomial data with the 
Dirichlet prior distribution, the Posterior distribu- 
tion of the failure mode ratios ©, — 0, is also a Dirich- 
let distrbution, with the following characteristics: 


Pig = Pi +X; (7) 


where p; are the initial Dirichlet parameters; x, are 
the number of failures which occurred by that fail- 
ure mode; and p; are the final Dirichlet parameters. 
The expected value of each of the failure mode 
ratios, ©, for the posterior distribution are simply 
the ratio of the parameter p; for failure mode i to 
the sum of all parameters p, — Pye (Agresti 2017): 


(8) 
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Unlike the failure mode ratio which can have any 
number of parameters, one for each failure mode, 
the failure effects probability can have only four in 
the MIL-STD-1629A approach to FMECA, since 
there are four categories of failure effect severity. 
The Dirichlet distribution for the failure effect 
probability parameters B, — B, for a given compo- 
nent-failure mode is written as follows: 


(9) 


where £, — 2, are the POI — the failure effect prob- 
ability factors for that failure mode; and ô, — 6, are 
the parameters of the Dirichlet distribution. 

As with failure mode ratio parameters, the 
expected values of the prior failure effect probabil- 
ity factors are (Agresti 2017): 


: (10) 
ô 
=1 


Similarly, the parameters of the Posterior 
Dirichlet distribution are the initial parameters, 5, 
plus the number of failures which occurred by that 
failure mode resulting in effects having severity of 
that category, x;. 


6, = +%, (11) 


The expected value of each of the failure effect 
probability factors, PB, for the posterior distribu- 
tion are simply the ratio of the parameter 6, for 


failure effect 1 to the sum of all four parameters 
õie — Sy (Agresti 2017): 


(12) 


4 EXPERT ELICITATION OF PRIOR 
CRITICALITY PARAMETERS 


For a Criticality analysis of a new product which 
is in the development process, where no testing 
(or failures) have yet occurred, the Dirichlet Prior 
distributions of œ; and p; are the initial estimates 
of those parameters. By setting up the FMECA 


structure with a Bayesian framework, as multi- 
nomial data are gathered later in the design and 
verification/validation processes, the critical- 
ity estimates may be updated. This allows for 
updated FMECA-based risk assessments to be 
made as additional evidence is accrued over time. 

Because of the stage of the process where the 
prior estimates will be needed, development of 
the Dirichlet prior distribution may require the 
elicitation of information from experts. A detailed 
review of the topic of expert elicitation is beyond 
the scope of this paper. Some basics as they apply 
to this situation will be briefly discussed in this sec- 
tion, but the remainder of the paper will assume 
any expert-elicited estimates utilized sound meth- 
ods in the elicitation process. 

In order to develop the Priors for œ; and B,, 
experts must provide their estimates of each of 
the values, œ; — œ, and B, — Biy Once values are 
obtained from experts, assuming more than one 
expert, their answers (which may differ) must be 
aggregated into a single set of estimates. There 
are a couple of different approaches which can 
be taken to this aggregation process, often cat- 
egorized as either mathematical or behavioral 
approaches (Clemen et al. 1999). At a high level, 
the mathematical approach allows each expert to 
independently provide their assessments, and then 
the analyst combines them via some form of aver- 
aging. Weighted averages can be used to reflect dif- 
fering degrees of expertise. The advantage of this 
approach can be that it requires less time in for- 
mal joint meetings with experts, whose time may 
be of limited availability. The alternative behavio- 
ral approach requires the experts to develop to a 
jointly-agreed upon estimate through discussion. 
This method has the advantage of allowing all of 
the experts to share their viewpoints and reason- 
ing for their initial positions. As the experts share 
their unique knowledge and perspectives with each 
other, they may collectively arrive at an estimate 
which is more comprehensive and representative 
than if they had each provided independent assess- 
ments. The drawback to this method is the time it 
may take the experts to reach that consensus, and 
the difficulty in getting all experts together for the 
necessary consensus-reaching discussions. 

One particular nuance in developing expert esti- 
mates of FMECA Criticality parameters is that 
the parameters being estimated are dependent. 
For example, when estimating the failure mode 
ratios for a component with k distinct failure 
modes, only k — 1 estimates can be provided; the 
last estimate is derived by subtracting the sum of 
the other estimates from 1, since the sum of the 
failure mode ratios must equal 100%. This depend- 
ency can produce some challenges to the elicitation 
process. Some techniques for eliciting estimates 
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for dependent parameters include overfitting as 
described in (Zapata-Vazquez et al. 2012). 


5 PRIOR DISTRIBUTION STRENGTH 


Once a set of criticality factor estimates have been 
developed, the experts must determine whether to 
apply those estimates as strong or weak data. The 
prior estimates of the Dirichlet parameters, p; and 
6, can be assigned any nominal value, as long as 
their relative values are consistent with the experts’ 
opinions. For instance, assuming three failure 
modes, the following two sets of failure mode ratio 
prior distribution parameter estimates are all the 
same from a relative perspective. Both sets of prior 
distribution parameters imply the assumption that 
the expected value of Mode 1’s failure mode ratio, 
a, is 10%, Mode 2’s expected value is 65% and 
Mode 3’s expected value is 25%. However, in the 
first case, the parameters are set as decimal values 
representing the percentage, whereas in the second 
case, the parameters are set as the integer values 
representing the percentage. 

These two sets of prior parameters, both rep- 
resenting the same expected values for the failure 
mode ratios produce prior distributions with very 
different strengths. The sum of the p; values in each 
case can be thought of as a quasi-number of fail- 
ures previously observed. The stronger prior in this 
case is approximately the same as having observed 
100 past failure events from which initial estimates 
of failure mode ratio have been made, whereas 
the weak prior is approximately equivalent to hav- 
ing observed only | failure. As actual field data is 
gathered and combined with these prior estimates 
to update predictions of failure mode ratios, the 
stronger prior will require significantly more data 
to overcome the initial estimates, if those initial esti- 
mates are proven to be incorrect by the field data. 

For example, assume that actual field data is 
gathered on the failure modes in Table 4. Assume 
ten failures have occurred, and all 10 were due to 
Mode 1. This is not at all what was expected based 
on the prior estimates above (in both cases, the prior 
parameters would have predicted only 1 out of 10 
failure would be via Mode 1). Table 5 below com- 
pares the Posterior estimates of the expected value 
for the failure mode ratios, using Bayes rule to 


Table 4. Examples of two Dirichlet parameter sets with 
differing strengths for a given component. 


Failure Mode p; — weak p; — strong 
1 0.10 10 
2 0.65 65 
3 0.25 25 
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Table 5. Posterior estimates of failure mode ratio 
expected values as a function of prior strength. 

Failure mode a; — weak a; — strong 
1 92% 18% 

2 6% 59% 

3 2% 23% 


combine the prior parameters with the field data. 
Note that the results obtained using the weak prior 
shifted significantly, and “feel” more correct in light 
of the field data. Conversely, the posterior estimates 
obtained using the strong prior have only marginally 
changed from their prior values, despite the signifi- 
cant disagreement between the prior estimates and 
the field results. 

The key takeaway here is to be cognizant of 
the fact that the choice of absolute values for the 
Dirichlet parameters, p, determines the strength of 
the prior distribution. The absolute values should 
be determined based upon the strength of the prior 
information from which those parameter estimates 
were developed. Pre-fitting of hypothetical field 
data can help the analysis team to see how such 
data would affect posterior estimates, and help the 
team to select a set of prior values which best rep- 
resent the team’s true state of prior knowledge. 

Note that this same behavior is true for the prior 
parameters for the failure effect probability fac- 
tors, B, — B,, too—the prior distribution strength 
is positively correlated with the sum of the failure 
effect probability estimates for that failure mode. 


6 APPLICATION OF EVIDENCE TO 
UPDATE PRIOR CRITICALITY FACTOR 
ESTIMATES 


Data to enable Bayesian update of the prior 
FMECA criticality parameters will be in the form 
of a number of failure events categorized by fail- 
ure modes and their resultant failure effects. The 
utility of the proposed FMECA with Bayesian 
structure will be shown via an example. 

Assume experts have provided prior estimates of 
the Failure Mode Ratios and Failure Effect Prob- 
abilities for a particular component. Those prior 
estimates have been provided in the form of Dirich- 
let distribution parameters, p; and ô, and their asso- 
ciated expected values, œ; and ß, calculated from 
those parameters in Tables 6-8. Note that in this 
example, the expert-estimated parameters result 
in weak prior distributions for both Failure Mode 
Ratios and Failure Effects Probabilities, presumably 
reflecting their lack of certainty about those prior 
estimates. Such weak priors will be quickly influ- 
enced by any new evidence which is gathered. 


Assume that later in the development process 
for the component/system in question, the follow- 
ing failure information is obtained: 

Combining the prior Dirichlet distribution infor- 
mation with the new evidence following Bayes Rule 
as outlined in Section 3 results in the updated (Pos- 
terior) estimates of the Failure Mode Ratio and 
Failure Effect Probability Dirichlet parameters and 
expected values for this component in Tables 10-12. 


Table 6. Example prior failure mode ratio estimates. 

Prior failure mode ratio Estimates 
Expected 

Failure mode Dirichlet parameters, p; values, o; 

1 0.05 5% 

2 0.30 30% 

3 0.65 65% 

Table 7. Example prior failure effect probability factor 


dirichlet parameter estimates. 


Prior failure effect probability factor 


Dirichlet parameter estimates, 6, 


Table 10. Posterior failure mode ratio estimates. 
Posterior failure mode ratio 
estimates 
Dirichlet Expected 

Failure mode parameters, P; values, 04, 

1 0.05 1.3% 

2 2.30 57.5% 

3 1.65 41.3% 

Table 11. Posterior failure effect probability factor 


dirichlet parameter estimates. 


Posterior failure effect probability factor 


Dirichlet parameter estimates, 5, 


Failure mode CATI CATH CATIN CATIV 

1 0.00 0.20 0.50 0.30 
0.00 0.00 2.80 0.20 

3 0.01 1.01 0.75 0.23 

Table 12. Posterior failure effect probability factor 


expected value estimates. 


Failure mode CATI CATH CATIN CATIV Posterior failure effect probability factor 
1 0.00 0.20 0.50 0.30 Expected value estimates, B; 
0.00 0.00 0.80 0.20 
3 0.01 0.01 0.75 0.23 Failure mode CATI CATI CATIN CATIV 
1 0% 20% 50% 30% 
; ? a 0% 0% 93.3% 6.7% 
Table 8. Example prior failure effect probability factor 3 0.5% 50.5% 375% 11.5% 


expected value estimates. 


Prior failure effect probability factor 


Expected value estimates, p; 


Failure mode CATI CATH CATUI CATIV 
1 0% 20% 50% 30% 
2 0% 0% 80% 20% 
3 1% 1% 75% 23% 


Table 9. Example failure mode and effects data. 


# Failures by failure 
mode/effect category 
#Failures by 
Failure mode mode I II HMI IV 
1 0 0 0 0 0 
2 2 0 0 2 2 
3 1 0 1 0 0 


737 


The changes for the Failure Mode Ratio 
expected values are shown graphically in Figure 1. 
Based on the new evidence, one would conclude 
that the probability of occurrence of failure modes 
1 and 3 are less than originally estimated (despite 
having observed one failure via failure mode 3), 
and the probability of occurrence of failure mode 
2 has nearly doubled. 

The changes for the Failure Effect Probability 
Factor expected values for Failure Mode 3 are 
shown graphically in Figure 2. Based on the new 
evidence, one would conclude that the probabil- 
ity of failure mode 3 producing CAT II severity 
effects is significantly higher than was originally 
estimated. 

A key item to note in both examples above is 
that all categories’ predictions change in response 
to new evidence, even if that new evidence didn’t 
explicitly include events in that particular category. 


This is one of the key benefits of the Bayesian 
approach to FMECA updating—new evidence 
showing that a failure occurred in one of the cat- 
egories updated all of the other categories’ pre- 
dictions, too. For example, in the failure effects 
probability example, the single piece of new evi- 
dence (occurrence of a single event with CAT H 
failure effects) also informed the other three cat- 
egories; it was a failure that did NOT result in 
effects from those other severity categories, and 
thus their predicted values shifted downwards by 
varying degrees. 

A final benefit of approaching FMECAs in a 
Bayesian fashion is that it enables sensitivity and 
uncertainty analysis techniques to be applied to 
the results. Because the criticality factors are mod- 
eled in a Bayesian framework as uncertain param- 
eters of interest, one can report FMECA criticality 
factors calculated based on the expected value of 
those parameters, or one can report criticality fac- 


Modà 2 Mode 3 


Price e Posterior 


Figure 1. Comparison of prior and posterior failure 
mode ratio expected values. 


50% 

40% 

30% 

20% 

10% 

0% — im 
CATI CAT Ii CAT ili CAT IV 


Prior e Posterior 


Figure 2. Comparison of prior and posterior failure 
effect probability factor estimates. 
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Table 13. Comparison of criticality estimates based on 
use of expected values versus 90% UCL values for fail- 
ure mode ratio and failure effect probability estimates for 
failure mode 3, failure effect CAT II. 


Expected 90% 


Parameter Value UCL 

Failure Mode Ratio, Mode 3, «3 41.3% 94.1% 
Cat II Failure Effect Probability, B3 50.5% 97.7% 
Component Base Failure Rate, À 6.2E-8 6.2E-8 
Cat II Criticality (failures per hour) 1.3E-8 5.7E-8 


tors calculated based on worst-case values, such 
as upper confidence limit values. The confidence 
limits for any Dirichlet-distributed parameter are 
found by calculating the marginal Beta distribution 
for that parameter. (Farrow 2017) In the case of 
the uncertain estimate of the failure mode ratio for 
failure mode 3, œ,, from the previous examples, 


a Bete oZ -a 


(13) 


The same holds true for the failure effect prob- 
ability Dirichlet parameters. Table 13 shows a 
comparison of the criticality values calculated for 
Failure Mode 3, Failure Effect Category II based 
on the data used in the examples above. Note that 
the table assumes no uncertainty in the base fail- 
ure rate, à, and thus the same failure rate is used 
for both the nominal criticality value calculation 
and the upper confidence limit criticality value cal- 
culation. (In practice, the base failure rate could 
also be treated as an uncertain POI, and modeled 
in a Bayesian framework). The upper confidence 
bound estimate of the criticality value for this 
failure mode effect combination is more than 4X 
higher than the nominal estimate. The Bayesian 
structure of the FMECA enables such uncertainty 
analysis, which can serve as a means of prioritiz- 
ing future data-gathering efforts, or can allow for 
uncertainty-based risk evaluations. For failure 
modes with the highest severity effects, risk pri- 
oritization efforts can be based upon upper confi- 
dence limit values of the Criticality estimates. 


7 CONCLUSIONS AND FUTURE WORK 


FMECA is a widely-used tool in a broad range of 
industries for evaluating and quantifying the risks 
associated with a design. Because FMECA is most 
effective when applied early in the product design 
process, it is frequently the case that the criticality 
analysis must proceed before any actual data from 


which to conduct that analysis is available on the 
system in question. The methods outlined herein 
provide a framework for allowing the synthesis of 
expert-elicited prior estimates of criticality factors 
with later-arriving direct evidence via Bayes Rule 
to produce updated estimates of those factors. 
This approach imparts a broad range of analytical 
benefits, including allowing uncertainty and sensi- 
tivity analyses to be conducted on the quantitative 
outputs of the FMECA. 

Future work in this area will focus on the devel- 
opment of FMECA-specific techniques for elicit- 
ing expert estimates of criticality factors a and B, 
notably in dealing with the dependency of those 
factors, and in addressing the difficulties in esti- 
mating rare event probabilities. 
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interlockings from railML models 
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ABSTRACT: The theoretic foundations for formally verifying railway interlocking systems have already 
been studied extensively. There exist a lot of work covering the application of methodologies like model 
checking in this context. However, some design faults still remain undetected until final on-track evaluation 
of the system. This is strongly related to missing automation solutions for real-world models and standards 
as well as the high theoretical expertise required. There exist many well-developed tools each requiring 
different modeling formalisms and focusing on a different question/scenario. Without specific experience 
in formal system modeling, it is extremely complicated to model such complex systems. In this paper, we 
present a methodology for the automatic model generation and verification of railway interlockings in a 
tool-independent(!) way. Therefore, we define a generic template set of atomic track elements and safety 
properties in a formal modeling language applicable with precise semantics. This generic template enables 
us to verify the structure of any given track layout. The already existing tool support of VECS allows to 
automatically translate these specifications into various model checkers for verification. More important, 
we present a robust transformation of the upcoming data exchange format for railway interlocking systems 
railML into the presented specification template. As a consequence, this approach really may help to bridge 
the gap between formal methods and system design in railway interlockings. We evaluate this approach on 
a real-world case studies train station of Brain l’Alleud. We also show the tool-independent modeling by 
automatically translating the specification to different verification engines and compare their performance. 


1 MODEL-BASED VERIFICATION AND particular modeling tool that can do the verifica- 
INTERLOCKING SYSTEMS tion task. However, either you need to find the one 
tool that applies to your specific modeling envi- 
Designing interlocking systems for large railway ronment, or you must transform your model into 
stations is a very complex task. Lots of different the input language of your verification tool. The 
routes, not only for passengers but also for logistics first approach can lead to the conclusion that there 
and freight traffic, must be combined with a vast does not already exist such tool and the second is 
traffic network. Moreover, such railway compo- very erroneous and due to the time consumption 
nents, e.g., the interlocking system structure and and required skill level of the engineer. 
corresponding route network, are safety critical To overcome this problem, we want to define a 
infrastructures. This means, errors in the network structural transformation from an accepted inter- 
plan, as an unconnected or a dead-end track, an locking modeling standard into a formal repre- 
incorrect working switch or wrong scheduled and sentation of a railway interlocking system and 
therefore crossing routes can lead to costly and its routing tables. Over the last ten years, a data 
dangerous hazards. Verifying such railroad inter- scheme standard enabling the interchangeability 
locking systems by applying formal verification of railroad design data has been developed by a 
techniques (e.g., model checking (Clarke et al. consortium of leading companies from the inter- 
1999)) increases the quality a lot. locking and signaling domain. It is called Railway 
Even though there exists a vast amount of work Markup Language (railML) (Nash et al. 2004) and 
on the formal verification of railway interlocking defines a data scheme based on XML. By using this 
systems, techniques as model checking or deductive standard, it becomes possible to define a standard- 
verification are not commonly used in practice. We ized verification approach for general interlocking 
think that one major problem is that there exists no system designs and logical routing tables. Since 
off-the-shelve implementation that can be applied as we think that this standard will be used through 
a simple addon to a given interlocking design tool. industrial application by the next years, a veri- 
In particular, there exist several tools providing fication tool based on this can help to lower the 
their domain specific language or an interface toa hurdle of using formal verification techniques for 
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interlocking systems in practice. For example, it is 
imaginable that a an algorithm for the automatic 
processing of paper-written interlocking schemes 
generates railML output, as presented in Klock- 
man et al. (2018), which is later on verified by the 
algorithm defined in this paper. 

In this paper, we define a transformation from 
a given railML description into a formal model 
representation which can be used for model-based 
verification, i.e., for verifying safety properties and 
additional measures as probabilistic analysis or 
failure injection techniques. Therefore, we defined 
a set of template automata for essential interlock- 
ing elements and their instantiation with the infor- 
mation derived from the railML representation. 
Further, we present a realworld case study of the 
Belgian railway station Braine l’Alleud and a rep- 
resentation of the German train station of the city 
of Leipzig. This is, to provide an idea of the capa- 
bilities enabled by the automatic transformation 
and corresponding model checking tools. 

We think that the fully automatic transforma- 
tion from an accepted railway data scheme will 
lower the hurdle for the application of formal 
verification techniques within the development of 
interlocking systems and even increases the accept- 
ance of the assessment. This is especially the case 
since we try to provide a 1:1 transformation be 
keeping as much information of the original data, 
which also ensure a full, and easy to understand, 
traceability from the design to the formal model. 

Related Work Of course, the verification of 
railway interlocking systems has already been 
researched during the last years by several publica- 
tions as Cimatti et al. (1998a) or Haxthausen et al. 
(2011), Cappart and Schaus (2016), Cimatti et al. 
(1998b), Bonacchi (2013), or Limbrée et al. (2016). 
These authors showed the overall applicability of 
formal verification, in particular, model checking, 
in the context of the analysis of railway interlock- 
ing systems. However, Banci et al. (2004) presented 
a first general representation of interlocking com- 
ponents using statemate and Harel state charts 
(Harel 1987). Further, the first transformation 
of a railway interlocking system from a specific 
Domain Specific Language into a formal modeling 
language, Event-B Abrial (2010), was presented by 
Tliasov and Romanovsky (2012) in the context of 
the SafeCap project (Iliasov et al. 2013) intending 
to improve the time-effectiveness of route plans 
without violating the safety properties of the sys- 
tems. Therefore, the B verification engine ProB 
(Leuschel and Butler 2003) was used. However, the 
main focus of this project was not on the formal 
verification but the optimization of the system. 

The remainder of the paper is the following: In 
section two we present preliminary background 
information about the implemented interlock- 
ing components and railML. Further, we show 
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the generation algorithm for the formal model 
by presenting abstractions for the minimal set of 
required elements (switches, tracks, etc.) and the 
corresponding safety properties to be verified 
(derailment, collision, etc.). Section four presents 
the real-world case study and experimental results 
of the verification using different model checking 
tools. The conclusion and further work are given 
in section five. 


2 THE BASIC ELEMENTS OF AN 
INTERLOCKING SYSTEM AND 
RAILML 


Before we introduce the mentioned approach, we 
need to clear the atomic elements of an interlock- 
ing system which must be implemented for a cor- 
rect abstraction of an interlocking system. 

Figure | presents a simple example of a railway 
interlocking system’s track layout (Fig. la) as well 
as the corresponding interlocking table (Fig. 1b) 
representing the available routes. 

The track layout is divided into connected track 
elements like T01, T02, T11, etc. Track T12, for 
example, is a dead-end track ending with a bumper. 
Openended track elements T11 and T21 can con- 
nect the system to other interlockings. Tracks T01 
and T02 are connected by switches (the points 
PO1 and P02) for branching between several track 
elements. 

Points (P01 and P02) can be in normal position 
so that trains will drive straight on, or in reverse 
position so that trains will branch off the current 
track element. Further, the given interlocking table 
supports flank protection, i.e., Switch P02 is also 


Ta 


(a) Simple track layout example including five track elements. 


Route Direction Tracks PO! P02 
All ATH TOI, TI N ONS 
Al2 A+T12 TOI. T02, T12 R R 
B20 OBT] TUI, T21 N Wr 
C2 G@+121 T02, TOL, T21 R R 


(b) Interlocking table of the example track luyout containing 
four simple routes. 


Figure 1. A simple interlocking system example with 
track layout (a) and the route definitions given by an 
interlocking table with the corresponding direction, 
track elements, and position of the switches (P01, P02) as 
nominal (N) or reverse (R). 


switched as a safety function. In normal position, 
it will disconnect track T12 from the main track 
TII<sTO1<T21 to prevent wagons from rolling 
back into main track. Of course, an industrial 
real-world interlocking system plan contains more 
elements than the presented ones. However, this 
is sufficient to verify the basic system topology 
and on top of that the logical route planning and 
scheduling of the system. 

Railway Markup Language If we want to proc- 
ess the data of a given representation in differ- 
ent steps and tools, we are required to base our 
transformation on some standard scheme. As 
mentioned before, we, therefore, focus on the infra- 
structure definition of railML. The definition of 
railML in general contains schemes for Timetable 
Rostering, Infrastructure definitions, Rollingstock, 
and Interlocking tables and signal plans. It would 
have also been interesting to use the Interlocking 
scheme which is meant to support the definition of 
interlocking and route tables. Unfortunately, this 
scheme is under development at the moment and 
therefore not available. Instead, we use a simple csv 
representation of the route table as given in Fig. 1. 

In the following, we present an excerpt of the 
infrastructure railML representation of the exam- 
ple in Fig. 1. In particular, these are elements cov- 


r cre’ down! 


21 onnection 
aja switch La=' POJ 


23 <oonnection id ="PO26" ref-"P0in’ prantattohs” 


Figure 2. Excerpt from the railML representation of 
the example track layout. 


ered by route C2/ (cf. Fig. 1b), i.e., 772, T02, T01, 
and correlated elements like switches and signals. 
In the following, we give a short explanation of 
the listing in Fig. 2, without providing a complete 
description of railML. The basic elements of the 
infrastructure are © tracks with an id. For each 
track element we define its track Begin, its track- 
End, and the connections to other track elements 
or a bufferStop if it is a blind end ©. Each track Be- 
gin and track End has, in addition, a position that is 
used to define the direction on the track. From the 
position with the lower value to the higher value 
the position is upwards, otherwise downwards. The 
connections are realized by references to other 
track Begin or trackEnd elements via their unique 
id. Further, we find © switch elements within a 
track, which can be connected to corresponding 
switch elements on other tracks, also via id of the 
elements. Signals ®© can be modeled, too, by using 
signal elements with a specific position and a direc- 
tion in which they work, e. g., down if they work in 
the downwards direction of the track. 

Using this representation, we get an applicable 
standard for exchanging interlocking data between 
different modeling and verification tools. The 
original stand contains, of course, more than only 
these elements, but they are sufficient for proofing 
the applicability of the presented approach. 


3 A TEMPLATE SYSTEM FOR 
INTERLOCKING SYSTEMS 


The basic idea of our transformation system is 
to provide a set of accepted templates in a formal 
language, e.g., SAML in our prototype, for single 
track elements and to derive the instantiations 
and the links between elements from the railML 
representations. 

The class diagram in Fig. 3 shows the templates 
and their relations. All infrastructure elements 


s norraRegaesIEL shout) i 
| ermoReceent ( stR) 
[> uneackOewtacttlai<Tract>) wi 


Figure 3. Class model of the templates. 
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(tracks, signals, switches) are controlled via route 
automata. Each route holds information about 
the track segments it contains and the correspond- 
ing signal states and switch positions. Therefore, 
all activities are correlated to the currently active 
routes, i.e., which tracks must be blocked, which 
signal must be set, or which switch must be set to a 
particular position. 

Route Scheduling The route scheduling itself 
is implemented indeterministic, i.e., each nonac- 
tive route can request to get active at each time 
if all required track segments are not blocked by 
another route. If more than one route tries to get 
active, we solve this race condition by choosing the 
route with the lower id (e.g., route 1 before route 
2). Overall, more than one route can be active at 
the same time, but only one additional route can 
get active at each time step. 

Logical Trains At the moment, the train move- 
ment is modeled in an indeterministic, logical, way. 
This means a train can leave a specific track seg- 
ment or not, without taking into account physi- 
cal parameters like train velocity and length of 
the track segment. Each train consists of one to 
n adjacent track elements to model the movement 
from one to another track element and trains of 
different size. If required the formal model can 
easily extend with physical behavior. Further, we 
define two trains per route for modeling collisions, 
flank protection, and other safety properties. 

In the following, we present the internal autom- 
ata of the templates and their basic instantiation 
idea. 


3.1 Track 


In general, a track can be clear, reserved, or occupied. 
Since we are verifying the system with two trains, the 
track can be occupied by train No. 1 or train No. 2 
(occupiedByl or occupiedBy2). Model checking, in 
combination with non-deterministic behavior, cov- 
ers all possible combination of train positions and 
routes and therefore two trains are sufficient for ana- 
lyzing all possible train to train situations. 

railML relation In the following, we need informa- 
tion on adjacent tracks. This information is derived 
from the railML element track (cf. Fig. 2 ©) and 


Figure 4. Automat model of tracks. 


underlying connection information (cf. Fig. 2 ©). 
To identify, which tracks are before and behind a 
track element, we use the railML direction conven- 
tions, where the direction is from the lower pos value 
to the higher. From this, tracks that are referenced 
track-Begin are precedent tracks and those refer- 
ences at trackEnd are subsequent. The assignment 
of tracks to routes and their required direction (nor- 
mal or reverse) is extracted from the route table. 

Behavior A track can be requested by a route. 
This is done indirectly by checking the current 
state of all routes containing the track. 


reserve Request := vV r.state = commanded 


rerequireing Routes 


Tracks will be occupied in addiction to their 
neighbor tracks. 


occupiedContractl: = trackLeft.occupied Byl 

v trackRight : occupied Byl 
occupiedContract2 : = trackLeft.occupiedBy2 

v trackRight.occupiedBy2 


The initial state of the model is with empty 
tracks, so there has to be a possibility to place 
some trains on the tracks. Therefore, all possible 
starting tracks have a non-contradictional formula 
occupiedContract0 with 


occupiedContract0 : = V r.commanded 


rerequireing Routes 


The transition from occupied to clear is enabled 
when the next track is occupied, and the previous 
track is clear. This assures that a train will not be 
split or removed. 

Further a train is not forced to enter or leave a 
track in every step modeling trains with different 
length. 


clear Contract : = (trackLeft.occupied ^ track Right. 
state =clear)v (track Left.state=clear A track Right. 
occupied) 


3.2 Signals 


Signals can be in one of two states — stop (red) or 
proceed (green). States which allow the trains to 
proceed with low speed are not modeled. 


state=pinoceed 


SaveHoute 


Figure 5. Internal automaton for signals. 
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railML relation For the signals, we need to know 
on which track the signal is placed and the direction 
of the signal, as well as the next track behind the sig- 
nal to define when the train entered a route and set the 
signal respectively. These information can be derived 
from the signal element of the railML description (cf. 
Fig. 2 ©) and the parent track element. 

Behavior The complete automaton of the signal 
template is given in Fig. 5. Its initial state is stop. It 
changes to proceed if the following route (behind 
this signal) is ready, i. e., the route is accessible. Save 
Route gets true one of the routes starting at the signal 
is accessible and the signal state changes to proceed. 


V 


rerequireing Routes 


saveRoute : 


r.proved 


If a train passed the signal, it falls back to stop. 
Therefore, formula Train Passed is connected to the 
track behind the signal to detect when a train passed 
it. This is detected by a subsequently occupied track. 


trainPassed := track Behind.isOccupied 


3.3 Switch 


A switch can be in two states according to its posi- 
tion: normal (straight on) and reverse (branching). 

railML relation Besides the id, we must derive 
the information of the connection relations from 
the switch, i.e., which tracks are connected via nor- 
mal and reverse position. This is done by utilizing 
the switch tag information (cf. Fig. 2 ©) and the 
references to the adjacent track or switch in com- 
bination with the parent track element. 

Behavior A switch automaton has four states, two 
for each position normal or reverse, and two accord- 
ing to its possibility to change its position. This 
means a switch can be /ocked (no change possible), 
or unlocked (changing the position is possible) for 
preventing switches from changing while occupied 
by a train. The transitions normal Request and reverse 
Request will cause the switch to change its position 
corresponding to the one required by the active 
route and lock the switch against any other requests. 


normalRequest := r.reserved 
rerequireing RoutesNormal 
reverseRequest — r.reserved 
rerequireing Routes Reserve 
zzi 


Figure 6. Internal automaton for switches. 
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The transition unlockContract will open the 
switch for new commands if both connected tracks 
are clear. 


unlock Contract := Track1.isClear a Track2.isClear 


3.4 Route scheduling 


railML relation As mentioned before, the railML 
Interlocking scheme is not finished, i.e., not avail- 
able, and therefore we have to process simple csv 
data. This has the same structure as given in the 
route table in Fig 1b. 

Behavior A route controls and observes all track 
side elements necessary for a save train movement. 
It can be in one of three states (idle, commanded, 
or occupied). A route is initialized as idle. As pre- 
viously mentioned, the routes are commanded 
nondeterministic. 

If a route can get active, i.e., is in the state com- 
manded, it causes the corresponding tracks to be 
reserved for this route and the switches to change 
their position as required. The route will check 
the state of the elements (tracksReserved and 
switchesLocked). 


/\ 


teincludedTracks 


t.reserved 


s.locked 


tracksReserved := 


switchesLocked := 


seincludedSignals 


If everything is correct the start signal is changed 
to proceed. 


signalProceed := 


seincludedSignals 


s. proceed 


After that, the route will be occupied. When 
the train passed the complete route, the route will 
change its state back to idle. 


/\ 


teincludedTracks 


routePassed := t.isClear 


3.5 Safety specifications 


In the following, we want to present the safety 
properties we provide for our method. Since the 
defined model is built after a given mechanism, we 


state=idle 


routePassed 


state=occupied 


>| state=commanded 


Internal automaton for routes. 


Figure 7. 


think it can be easily extended with other safety 
specifications as the presented ones. 


3.5.1 Collision detection 

Two trains collide if they are at the same time at 
the same place. As described in 3.1 there are two 
or more track objects for one track element in the 
track layout to save additional information about 
the direction of the trains. Therefore, a collision 
occurs if a track is occupied by more than one 
train. With this definition, you can detect colli- 
sions on switches and head-to-head collisions. 

To detect rear-end collisions, tracks have to rec- 
ognize if a collision can occur in the next step. This 
is done by observing the reserved and occupied 
state of the track elements. If a track should get 
occupied in the next step, occupiedContract_a gets 
true. If it is also already occupied by another train 
(track.state = occupied By_b) we found a possible 
collision in the next step. 


collision(Trackt) := 
(t.occupiedContract_| & t.state = occupiedBy2) 
| (t.occupiedContract_2 & t.state = occupiedByl) 


3.5.2. Derailment detection 
In this model derailment due to wrong switch posi- 
tions is checked. The two cases which can happen 
are 1) derailment because a train is moving over a 
switch while it is changing its position and ii) derail- 
ment because a train is passing a switch which is in 
a wrong position. 

The specification will ensure that the position of 
a switch correlates with the track object which is 
occupied. So it will check if the switch is in the cor- 
rect position according to the direction of the train. 
In the transformation, we derive the correct posi- 
tion for a switch from the route table and encode 
this within the model. For readability, we refer here 
to inRightPos that evaluate to true if the switch has 
the required position. As a second aspect, the speci- 
fication will check if the switch is locked every time 
a train is on a track linked to this switch. 


isDerailed := 


V (t.is Occupied 


teTracks 


sinRigthPos A 8s.isLocked) 


seswitch(t 


3.5.3 Flank protection check 

Especially on sidings, some parked wagons can 
start rolling unadvisedly. To prevent accidents 
between those wagons and trains specific switches 
which are not on the route of the train can be com- 
manded to a position so that they route the wagons 
to the other track than used by the train. Nearly 
the same problem is caused by trains which miss 
stopping at red signals. 
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(a) Train flank (b) Train fank (c) Flank pro- 
protected, not protected, tection not 
available. 
Figure 8. Verification of flank protection. 


For checking this safety guideline every time a 
train is on a switch a backward search is performed. 
This backward search will start at the not used 
branch of the switch and will progress hand over 
hand along the tracks. If it reaches a switch which is 
in the position to guide wagons away from the train, 
it will stop (8a). If the switch is in the other position, 
the search will proceed (8b). If the search reaches 
a uniting switch, it will continue on both branches 
(8c). The specification ensures that the backward 
search will never reach a station or main track. 

This formula is also derived automatically 
from the track layout for each route, 1.e., for each 
route that is active, all adjacent tracks, which are 
not directly on a function isFlankProtected(r) is 
defined. It returns true if the switch in in the cor- 
rect position. 


isF lank Protected( Router): = 
AN s.sFlankProtected(r) 


seSwitch 


4 CASE STUDY BRAINE L’ALLUDE 


The validation of the presented modeling meth- 
ods was executed on a real-world case study of the 
Belgian train station Braine Alleud, taken from 
(Cappart and Schaus 2016). Braine l’Alleud sta- 
tion consists of four platforms, twelve switches, 
twelve signals, and 18 track segments. From the 
given layout, 32 different routes are available. 
From this layout, we generated a model with 
about 100 state machines. This contains represen- 
tations of the track elements, switches, routes, etc. 
In the following, we present the verification times 
of different model checking engines for the defined 
specifications from sec. 3.5. For providing informa- 
tion about the computation times and capabilities 
of different verification methods, different tools 
implementing divers qualitative model checking 
techniques were chosen. Further, we checked these 
specifications on a correct model, i.e., the speci- 
fications hold for the system. On the other hand, 
we injected faulty behavior to provoke erroneous 
behavior. The failures can be categorized as follows: 


e a route does not reserve a needed track 

e a route requests a wrong position of a switch 

e aroute does not request any position of a needed 
switch 


To check the verification time for the erroneous 
models, we also model checked the specifications 
and calculated the mean time for each specifi- 
cation, not separating for the found error (i.e., 
whether a switch or a track was not commanded 
correctly). We hope that the evaluation of different 
tools may help users, which are not familiar with 
model checking, in choosing a proper tool for their 
purpose. 


4.1 Experiments with the model checking tools 


To check the modeled interlocking system the 
model checking tools iimc, nuXmv, and aigbmc 
were used. The tests were performed on a com- 
puter with Intel 17 core (3.2 GHz). The target 
language of our prototype is the System Analysis 
and Modeling Language (SAML) (Giidemann 
and Ortmeier 2010). Using SAML, we can verify 
the imported qualitative model with several states 
of the art model checking tools (nuXmy, timc, 
UUPAAL), for which the SAML IDE VECS 
(Verification Environment for Critical Systems) 
provides implemented connectors. We used this 
connectors for the experiments with the different 
model checking tools. 

aigbme This tool is a bounded model checker 
built on top of the AIGER distribution provided 
by Biere (2007). Being a bounded model-checker 
makes it necessarily incomplete (if the system 
diameter is unknown), but allows for very efficient 
counter-example search, as only the base case is 
checked and the induction step is not encoded. 

iime This tool, described by Hassan et al. (2012), 
is the evolution of the original IC3 method. It uses 
different proof engines (BMC, IC3, FAIR), depend- 
ing on the type of property to verify. The tool is one 
of the most efficient model-checkers for sequential 
circuits and allows for multi-threaded verification. 

nuXmvy This tool is the evolution of NuSMV, 
described by Cavada et al. (2014). It supports infi- 
nite state spaces via real-valued variables with an 
analysis based on the SMT solver MathSATS and 
also PDR-style verification using SMT solvers 
(Cimatti and Griggio 2012). 


4.2 Evaluation of the results 


In the following, we present the results of the 
verification experiments. Table 1 presents the 
verification run time for the different tools and 
specifications. In this diagram, the BMC-based 
algorithms are missing since BMC is not capable 
of proving the correctness of a system rather than 
falsifying specifications. One most important fact 
is that all tools that terminated for a given specifi- 
cation also produced correct verification results for 
both, the correct and the incorrect model. For the 
verification of the correct model time performed 
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Table 1. Verification results of the evaluation. The star 
* marks the specifications of the defect model. 


Spec iimcIC3 iimcBMC nuXmy AIGBMC 
collision 120s — 1419s — 
collision* 45s 2s 52s Sis 
derailment 120s — 2402s  — 
derailment* 38s 35 335 29s 
flank 608s = = = 
protect 
flank 223s = 437s = 
protect* 


best for all specifications. It took between 120 s 
for the collision specification and 608 s for the 
flank protection, which is a quite acceptable com- 
putation time. In comparison, the nuXmv model 
checker had quite more difficulties with about 
2400s for the derailment specification (iimc: 120 s). 

For the verification of the incorrect model we 
also used the BMC-based tools. The results show 
that for medium complex formulae, as the derail- 
ment or the flank protection specification, BMC- 
based algorithms outperformed the inductive ones. 
However, they were not able to complete their 
analysis for the flank protection (2 s time BMC 
compared to 45 s iimc IC3 for the derailments 
specification). This behavior is not a surprise. The 
reason is that the BMC algorithm is very fast for 
simple specifications and short counterexample 
paths (10 to 11 steps for derailment and collision). 
However, the algorithm is not very applicable for 
complex specifications and, moreover, with high 
bounds, it was not able to compute the results for 
the flank protection specification within 24 h. This 
is connected to the algorithm itself since for each 
step towards the bound, new formulas are added 
increasing the complexity of the underlying SAT 
problem. In contrast to that, it is quite fast for the 
short counter examples since the computation of 
the invariants is quite complex, without a direct 
correlation to the number of steps. 

Summarizing, BMC has shown up to be a suit- 
able method for finding bugs in a system, especially 
if the error bound is small. Nevertheless an IC3 
implementation is needed for proving the correct- 
ness of the system and, moreover, even for the dis- 
covery of errors in a deep bound, e.g., the presented 
flank protection faults. This means timc would be 
the best tool out of our small collection for the veri- 
fication of the interlocking systems since it contains 
both, a fast BMC and a fast IC3 engine. 


5 CONCLUSION 


In this paper, we presented an approach for reduc- 
ing the complexity of the formal verification of 


railroad interlocking systems. Therefore, we pre- 
sented an approach for automatically generating 
a formal model from a given track layout and the 
corresponding route definitions. Moreover, the 
declaration rules for basic safety specifications, i.e., 
derailment, collision avoidance, and flank protec- 
tion were presented. 

To validate our method we generated a model 
of a real-world case study of the Belgian train sta- 
tion Braine l’Allude. For giving an insight view 
into available verification algorithms and tools, 
we verified the specifications with four state-of- 
the-art verification tools implementing the cur- 
rently most effective algorithms, [C3 and Bounded 
Model Checking. In addition to the experiments 
where the specifications hold, we also examined 
the behavior of the verification for the error case 
and injected failures into the model. 

The results show, in our point of view, the appli- 
cability of the transformation and the computabil- 
ity of the verification. This opens new perspectives 
for the analysis of interlocking systems since the 
presented target modeling language SAML supports 
both, qualitative as well as quantitative verification 
methods. 
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ABSTRACT: Safety and dependability of technical systems are key aspects when it comes to a system 
quality. Critical infrastructure covers many parts such as mains, transportation system, networks and 
buildings. In our paper, we put an emphasis on a water distribution network where high availability, 
reliability and safety are very much demanded. Such network usually consists of more lines with differ- 
ent hierarchical importance and age. Moreover, various materials are usually used to manufacture water 
pipes. The in-field operation of a water distribution system is not very well recorded since it usually does 
not contain all details about failure occurrences. We have interesting water distribution system recordings 
covering a wide time span, but, unfortunately, they provide only the numbers of failures during respective 
months. In this paper, we apply a selected form of dynamic linear models—a modified Kalman filter with 


the implementation of a structural break point. 


1 INTRODUCTION 

Critical Infrastructure (CI) is a vital part of each 
country. It includes important elements such as 
energy supply networks and distribution systems 
of, e.g. gas, water, etc. Availability, safety and secu- 
rity of CI elements are key aspects for a country to 
function smoothly. Some CI elements and nodes 
are continuously monitored to watch their condi- 
tion. Failures, which might occur time and again, 
are also recorded. Unfortunately, not all the records 
provide every detail about undesirable event occur- 
rence such as failure. 

Our paper includes an approach to analyse water 
distribution system field data related to failure 
occurrence. Our intention is to introduce the way 
of modelling some dependability measures despite 
having only limited records. During our research, 
we have noticed that there is a certain break point 
in the displayed data course. It might be explained 
by the gradual improvement of the water distribu- 
tion network when old lines are being replaced by 
new ones. This results in improving some reliabil- 
ity measures. We would like to describe this proc- 
ess analytically and predict how it is going to be 
developed in the future. For this purpose, we use 
a dynamic linear model—the Kalman filter with a 
structural break point. 


1.1 State of the art 


CI is studied from several perspectives—from the 
conceptual ones to specific applications. A detailed 
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review can be found, for example, in Wilt et al. 
(2016). Here, we mention only a few articles con- 
cerning failures of the critical infrastructure. 

Zio (2007) initiated a change in a point of view 
to frameworks capturing properties emerging from 
a complex system of critical infrastructure such 
as water supply, transportation, information and 
other networks. Eusgeld et al. (2009) proposed a 
methodological framework for the analysis of criti- 
cal infrastructures using topology-driven analysis 
of vulnerabilities. The authors applied the pro- 
posed methodology on the Swiss high-voltage grid. 
Jaskolka and Villasenor (2017) focused on interac- 
tions in a distribution system and analysed depend- 
encies in the system. Their approach is based on 
concurrent Kleene algebra, which offers sequential 
and concurrent compositions. Korkali et al. (2017) 
pointed out that mechanism of failures in models 
provided so far differs from reality where inter- 
connections can be present. They compared sev- 
eral models to understand the impact of network 
topology. They concluded that understanding both 
benefits and risks of interconnections is crucial to 
designing robust and resilient systems. 

We would like to focus on water distribution 
systems. These systems are studied with respect to 
several factors such as plans for maintenance, reha- 
bilitation and replacement (M/R/R); application 
to a specific water network; methods and models 
for water mains failures. 

Failures in water networks were studied by Cema- 
gref group. They presented methods to support 
maintenance of the network based on a survival 


analysis model (Le Gat & Eisenbeis 2000). Reli- 
ability of the water distribution system was stud- 
ied in (Jung et al. 2016). The authors investigated 
sixteen study networks to develop linear reliability 
models using linear regression. Failures caused by 
external influences are studied, e.g., in Otrisal et al. 
(2017). Classification of a water distribution system 
was given in (Hwang & Lansey 2017). The authors 
used a graph theory combined with classification to 
determine adequate parameters for describing the 
water distribution system. 

Evaluation of the efficiency of urban water 
infrastructure and determination of the optimal 
time of rehabilitation is given in (Karamouz et al. 
2017). The study is completed by examination of 
the model in a real water distribution network. 

Failures are a common keyword of all above 
mentioned sources. Hazard function (or fail- 
ure/intensity has been studied by many authors, 
beginning with Ascher (1970), unified approach 
for both repairable and non-repairable items by 
Hokstad (1997), and its modifications by Woch 
& Vališ (2017). For more complex (non-reconfig- 
urable) flow networks, one can use a topological 
or algebraic approach (Todinov 2012, Hošková- 
Mayerová et al. 2013, Ameri et al. 2016). 

However, having a limited though interest- 
ing data set, we turn our attention to statistical 
methods. Regarding repairable systems, the non- 
homogeneous Poisson process is taken into 
account and its properties were studied by 
MacLean (1974) or Krivtsov (2007). The data set 
is in the form of the time series; therefore, Kalman 
filter is a suitable tool to study the series. Kalman 
filter was introduced by Kalman (1960) and has 
been used since then, e.g. in the latest study by 
Arthur et al. (2018). 


2 ANALYSED DATA DESCRIPTION 


We have a data set which contains the failures of 
water lines in three different structures—a mag- 
istral/main line, a local line and a branch line to 
a house. However, the failure records cover the 
number of failures during single months only, 
and, unfortunately, there is no more information 
available. In practice it is rather common to keep 
records this way. The data set, however, is quite rich 
when we consider how long it took to record the 
information—it was a period of fifteen years. 

For the purpose of our further analysis, we con- 
vert the failure records into an event occurrence 
rate during the period we are interested in. It is 
common to use the occurrence rate per one day 
or a month. However, if we wanted to assess how 
the seasons affected the process for example, the 
data would be sorted by quarters, but this is not 
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Table 1. Example of water distribution system field 
failures records. 
Sample Month/ Nr. of Failure 
nr. year failures frequency 
1 01/2000 6 0.193 548 
2 02/2000 12 0.413 793 
3 03/2000 27 0.870 968 
4 04/2000 23 0.766 666 
5 05/2000 18 0.580 645 
6 06/2000 8 0.266 667 
09 T 
| data 
08} — power law 
| — iog ines 
| — hemel rege 


o1 K i 1 u y 
—————— S | E K oe A Æ| 
0 20 40 60 8o 100 120 140 160 160 
Time [months] 
Figure1. Water distribution system field failures—in the 


form of failure frequency time series with typical courses 
applied for non-homogeneous Poisson processes. 


our case since we are working with the number of 
failures per month. 

In Table 1 there is an example of the way to 
record the failures and convert them into event 
frequency which we develop further. The data 
converted into failure frequency are illustrated in 
Figure 1. Since this is the Non-Homogeneous Pois- 
son Process (NHPP), a common way to assess this 
kind of data is modelling with the use of a power- 
law model and a log-normal model (Rausand & 
Hoyland 2004, NIST/SEMATECH 2017) as we 
can see in Figure 1. These parametric courses are 
complemented by a non-parametric kernel regres- 
sion estimate. 


3 THEORETICAL FORM OF 
MATHEMATICAL MODEL 


In order to analyse the data described above, we use 
in this case time series modelling. If we are to com- 
ply with some assumptions for the time series (e.g. 
an equidistant time increment), it will be necessary 
to work with a pseudo measurement, failure fre- 
quency, where this principle is observed. Another 


assumption is a correlation rate in data which will 
be proved by performing further tests. 

For data assessment, we apply generalized 
dynamic time series model based on linear Kalman 
filter (LKF) (Bhar 2010, Bain & Crisan 2009). Let 
the dynamic process X, follow a transition equation 


x, = f(x, W,) (1) 


and we also assume that we have a measurement 
Y, such that 
J= h(x,,u, * (2) 
In the above equations (1) and (2), w, and u, 
are two mutually uncorrelated sequences of tem- 
porally uncorrelated sequence of normal random 
variables with zero mean and covariance matrices 
Q, and R, respectively. Additionally, w, is uncor- 
related with x,_, and u, is uncorrelated with x, The 
prior process estimate is defined as 
Xy- = Elx,]- (3) 
which is the estimate of x, at time f-1 just to mak- 
ing the measurement at time ¢. Similarly, we define 
the posterior estimate as 
Xa = Elx, | yı] (4) 
which is the estimate at time ¢ after the measure- 
ment at ¢ has taken place. We also have the cor- 
responding estimation errors e,,, = X, — Xy and 
e,=x,— xy, These give us the estimate of the error 
covariates as 


à 
t\t-1 


P 


t\t-1 = Elen e ], Py = Ele, i e7]. (5) 

In order to compute the above mean and cov- 
ariance, we need the corresponding conditional 
densities p(x|y,,) and p(x|y,). These are deter- 
mined iteratively via transition and measurement 
updates. The basic idea is to define the probability 
density function corresponding to the hidden state 
x, given all the measurements made up to that time 
Le. yı, The transition step is based on Chapman- 


Kolmogorov equation 


PAX, | Mua) = ee Xa Vir DPC Mir 


(6) 
= Jc, |X, POA | dEX 


following the Markov property. The measurement 
update step is based on Bayes rule 


PY, l x) p(x, l Yia) 
PUY, |V) 


P, |e) = (7) 
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and PO Wired = fpr |x) ply )dx, At this point 
it is instructive to specialize the transition and the 
measurement equations (1) and (2) for a linear 
system and state the updating equation in a form 
amenable for easier implementation. 

Let us focus on a linear state space system with 
transition equation of the form 


x; = TX + C, F wW, (8) 
and the measurement equation 
Yy, =Z,x, +d, +u, (9) 


where c, and d, are possible time dependent vectors 
of compatible dimensions. Similarly, the matrices 
T, and Z, are of dimensions compatible with the 
length of the state vector x, and the measurement 
vector y,, respectively. 

For our purposes, however, we apply two forms 
of dynamic local linear models. 


3.1 Dynamic linear local level model 


First form of model—called dynamic Local Level 
Model (LLM) has the following form: 


observed series: y, = 4+ £, E, ~ NID(0, 02) 


latent level: 4 = 4_,+7, 7, ~ NID(0,0;,). 

Looking at the data plot (see Figure 1), we can 
notice a negative spike around year 2002 (after 
approx. 24 months). Recall that the model we just 
fit the data to (local level plus noise) assume that 
the variance matrices g? and o,” are constant over 
time. Therefore, one way to improve the accuracy 
of this model and take the jump in failure level 
(around March 2002) into account is to assume 
that the variance did change in this year. The new 
model LLM therefore becomes: 


observed series:y, = 4, + £, £ ~ NID(0, 02) 


7, ~ NID(0,0;,(t)) 
t#t 


break 


t=t 


break 


latent level: 44 = 44+ 7, 
where o (t) — | 


w 


w* 


3.2 Dynamic linear local level model—seasonal 


The other type of the applied dynamic linear local 
model is the one which takes into account sea- 
sonality. The dynamic Local Level Model (LLM) 
described in the previous sub-paragraph is extended 
by seasonality here and has the following form: 
observed series: 


VEUM tE  & ~ NID(0,0}) 


latent level with jump: 


M, =at, 1, ~ NIDO, 00) 
2 W EE break 

where o,(t)=4_ , 
Wo b= break 


latent seasonality: 


s-l a 
Vt == ja jel +O, a,~ NID(O, Op) 
Pas = fa 
Kie 7 F524 


By applying this type of model, we intend to 
trace potential influence the seasons could have on 
the course of the observed measure of time series 
failure intensity. 

Before we actually start the modelling of both 
types of defined dynamic models, we have to apply 
the Show test sequential algorithm (Bai et al., 
1997a, 1997b, 1998, 2003) in order to find the 
structural break points. 


3.3 Optimal break-points estimation 


The foundation for estimating breaks in time series 
regression models was given by Bai (1994) and was 
extended to multiple breaks by Bai (1997a) and 
Bai & Perron (1998). We implement the algorithm 
described in (Bai & Perron, 2003) for simultaneous 
estimation of multiple breakpoints. The distribu- 
tion function used for the confidence intervals for 
the breakpoints is given in (Bai 1997b). The ideas 
behind this implementation are described in Zeileis 
et al. (2003). 

In order to determine the optimal number of m 
break-points we use sequential tests and compare 
the criteria based on Bayes Information Criterion 
(BIC) and Residual Sum of Squares (RSS). The 
principle is based on the RSS calculation and it 
shows that the bigger the number of regression coef- 
ficients which are supposed to be another break- 
point, the higher the penalization which affects the 
result of the BIC criterion. The results for the opti- 
mal numbers of m break-points are put in Table 2. 

The selection of the number of break-points m 
might be shown the following way, see Figure 2. 
Based on the results, we then select one break- 
point in our further research. 


Table 2. Reference criteria for break-points m choice. 


m 0 1 2 3 4 5 


2.284 
216.3 


2.719 
212.7 


2.658 
206.4 


2.622 
198.5 


2.636 
187.1 


RSS 3.964 
BIC -164.6 
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Figure 2. Graphical representation of number of break- 


points m selection criteria RSS and BIC. 
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Figure 3. Water distribution system field failures — 


failure frequency/intensity time series with loess course 
and its confidence intervals. 


4 RESULTS OF SELECTED MODELLING 


In this part, we are gradually introducing the 
results achieved by using the mathematical tools 
mentioned in paragraphs 3.1 and 3.2. For model- 
ling and simulating, we use the product R-Studio 
(R Core Team 2005). 

First we introduce the approximation of the basic 
data course, failure intensity, using non-parametric 
kernel estimation—loess function, see Figure 3. 

Before we start the actual modelling, we first 
construct a static model which is later used for 
determining initial parameters of the dynamic 
linear models. Although during the modelling 
and simulation the software system would be able 
to estimate the initial parameters of the dynamic 
models, our own input is by all means better and 
therefore the static model is made first. This static 
model is put in Figure 4. 


752 


Failure rate 
00 02 04 O06 08 


2000 2005 2010 2015 
Time [year] 


Figure 4. Construction of static linear model with con- 
fidence intervals for observed series—failure intensity. 
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Figure 5. Analysis of residuals of LLM based on Q-Q 


plot, Auto Correlation Function (ACF) and p-value of 
Box-Ljung. 
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Figure 6. Kalman filter of LLM with confidence inter- 


vals for observed series—failure intensity. 


4.1 Results of pure LLM 


In the following steps, we deal with the results 
of pure LLM. First we analyse and test residuals 
to find required normality. It is useful to put the 
results in a graph, see Figure 5. They show clearly 
that in the normality test (here Q-Q plot) there is a 
decent compliance. However, the ACF and p-value 
Box-Ljung analyses show that there might be a cer- 
tain correlation among the data (events-failures). 

In our next step, we make the Kalman filter for 
the LLM model. The result is put in Figure 6 and 
it clearly shows a few potential break-points. The 
only break-point we have selected is illustrated in 
Figure 6 by a dashed vertical line. 
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For illustration, we also introduce the decompo- 
sition of single components of this applied LLM 
type Kalman filter and the observed series, see 
Figure 7. 

Next, we construct the Kalman smoother for 
our analysed data and the LLM model. This is 
shown in Figure 8, where we can also clearly see 
the break-point illustrated by a vertical dashed line. 

There are also the decompositions of the 
Kalman smoother for our LLM model and the 
observed series, see Figure 9. 

After, we set the course for the Kalman predic- 
tor of LLM and the observed series—failure inten- 
sity. The predictor is predicted 10% steps ahead 
which is about 18 months. The result is shown in 
Figure 10. 


us 


aw 
Ş a ] ji ie | 
$s] iw it 
EE WUA AN 
3s +——_ — — 
at A 
3J] 5 
2 7 | | 
TE \ ; 
137 = ee 
g wad A YA | f Ag eM | MI 
| | ee Al Wy) al’ I, MANN! 
2 | \ Wi, Vey aA VAP Y Wy W W | 
ey Leis i 7 
2000 2005 2010 2015 
Figure 7. Kalman filter of LLM for observed 


series—decomposed. 
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Figure 8. Kalman smoother with confidence intervals 


of LLM for observed series—failure intensity. 
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series—failure intensity—decomposed. 
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Figure 10. Kalman predictor of LLM with confidence 
intervals for observed series—failure intensity. 
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Figure 11. Analysis of residuals of LLM—seasonal 
based on Q-Q plot, Auto Correlation Function (ACF) 
and p-value of Box-Ljung. 
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Figure 12. Kalman filter with confidence intervals of 
LLM— seasonal for observed series—failure intensity. 


4.2 Results of extended LLM—seasonal 


Next, we deal with the results of the LLM model 
extended of seasonality. First, we analyse test 
residuals to find required normality. Again it is 
useful to put the results in a graph, see Figure 11. 
Again they show clearly that in the normality test 
(here Q-Q plot) there is a decent compliance which 
might be even higher than in the previous model. 
However, the ACF and p-value Box-Ljung analy- 
ses show that there must be a certain correlation 
among data (events-failures). 

After, we construct the Kalman filter for the 
introduced seasonal LLM model. The result is put 
in Figure 12 and it clearly shows a few potential 


break-points. The only break-point we have 
selected is illustrated in Figure 12 by a dashed ver- 
tical line. 

The effect of seasonality is also nicely remark- 
able in the figure. 

For illustration, we also decomposed single 
components of this applied Kalman filter of sea- 
sonal LLM for the observed series, see Figure 13. 

In the subsequent stage, we construct the 
Kalman smoother for our analysed data and the 
seasonal LLM model. This is shown in Figure 14, 
where also a break-point in the form of a vertical 
dashed line along with the effect of seasonality are 
clearly seen. 

For illustration purposes, we also introduce the 
decomposition of the Kalman smoother for our 
seasonal LLM model for the observed series, see 
Figure 15. 

After that we set the course for the Kalman 
predictor of the seasonal LLM for the observed 
series, failure intensity. The predictor is predicted 
again 10% steps ahead which is approximately 
18 months. The result is put in Figure 16. 
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Figure 13. Kalman filter of LLM—seasonal for 
observed series—failure intensity—decomposed. 
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Figure 14. Kalman smoother of LLM —-seasonal 
with confidence intervals for observed series—failure 
intensity. 
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Figure 15. Kalman smoother of LLM—seasonal for 
observed series—failure intensity—decomposed. 
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Figure 16. Kalman predictor of LLM—seasonal for 
observed series—failure intensity. 


5 DISCUSSION 


All the numerical values representing single mean 
value courses of the Kalman filter (KF), the 
Kalman smoother (KS) and the Kalman predictor 
(KP) from the applied normal and seasonal LLM 
models are put in Tables 3 and 4. For comparison 
purposes, we also included a basic data set for the 
failure intensity observed series. The results clearly 
show that each model has a specific estimation 
power for the observed time series; therefore, the 
course results slightly vary. The great advantage 
of the applied models is their ability to estimate 
the course and predict a few steps ahead. However, 
when predicting future development, a constant 
trend is seen for the LLM. 

The above estimations show us that if we fol- 
low the Kalman predictor, we can spot about four 
to nine failures on the observed mains during the 
next few months. We could also provide an accu- 
rate calculation for each separate month. Since this 
information has not been available yet (NA), we 
compare our results after it is available. 


5.1 Conclusion 


The aim of our article is to show possible ways of 
appropriate modelling even if there are only short 


Table 3. Numerical values of KF, KS and KP of LLM. 


LLM 
Original 

Time series KF KS KP (KF/KS) 
01/00 0.1935 0.1935 0.4414 NA 
02/00 0.4138 0.304 0.4429 NA 
03/00 0.8709 0.4949 0.4446 NA 
04/00 0.7667 0.5643 0.4437 NA 
09/14 0.0 0.1953 0.2149 NA 
10/14 0.2258 0.1976 0.2165 NA 
11/14 0.3667 0.2102 0.2180 NA 
12/14 0.3226 0.2186 0.2186 NA 
01/15 NA 0.2186 
02/15 NA 0.2186 
03/15 NA 0.2186 
04/15 NA 0.2186 
05/15 NA 0.2186 
06/15 NA 0.2186 


Table 4. Numerical values of KF, KS and KP of LLM 
seasonal. 


LLM—seasonal 


KF KFseas KS KSseas 
Original 

Time series KP KPseas KP KPseas 
01/00 0.1935 0.0161 0.1935 0.4729 0.4124 
02/00 0.4138 0.2230 0.4137 0.4765 0.4696 
03/00 0.8709 0.4444 0.8709 0.4811 0.5346 
04/00 0.7667 0.5289 0.7667 0.4800 0.4667 
09/14 0.0 0.2080 0.1466 0.2148 0.1521 
10/14 0.2258 0.2079 0.2260 0.2157 0.2289 
11/14 0.3667 0.2186 0.2964 0.2166 0.2955 
12/14 0.3226 0.2163 0.3375 0.2164 0.3375 
01/15 NA 0.2163 0.2142 0.2163 0.2142 
02/15 NA 0.2163 0.2806 0.2163 0.2806 
03/15 NA 0.2163 0.1372 0.2163 0.1372 
04/15 NA 0.2163 0.2000 0.2163 0.2000 
05/15 NA 0.2163 0.1785 0.2163 0.1785 
06/15 NA 0.2163 0.1278 0.2163 0.1278 


and insufficient records of device failures available. 
We wanted to introduce the modelling of a pseu- 
do-measure and also show that we are able to esti- 
mate possible development of the assessed system 
behaviour, similarly as Woch & Zielinski (2015). 
The observed and later transformed data record 
is interesting because it shows that the course of 
the observed failure intensity decreases in time. 
It might indicate a change in the system proper- 
ties which are manifested by reliability change. In 


this case, it would be advisable to replace-revitalize 
gradually all water pipe, thereby restoring its origi- 
nal properties. 

Moreover, the results of prediction estimations 
enable us to expect a certain amount of failures on 
the observed main pipeline section. This can help 
with planning an operation system and a technical 
maintenance system. 

In our future work, we are going to apply dif- 
ferent approaches, different forms of dynamic 
models based on the observation of the structural 
changes of the observed pipeline. Some inspiring 
results, however, which we are going to use for that, 
have been already introduced in Hasilova & Vališ 
(2018), Pietrucha-Urbanik et al. (2017 a, b), Pilch 
et al. (2014), Rojek & Studzinski (2014), Romaniuk 
(2016), Vališ et al. (2015, 2017 a, b), Vališ & Pokora 
(2015) and Woch et al. (2015). 
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ABSTRACT: To ensure a high quality of service to the users in the next generation of smart grids, 
self-healing capabilities are a crucial feature to be introduced. We show how distributed communication 
protocols can enrich complex networks with self-healing capabilities; an obvious field of applications are 
infrastructural networks distributing a commodity via a flow, like gas, water or electric power. We con- 
sider the case where the presence of redundant links allows to recover the connectivity of the system. We 
then introduce a theoretical framework to calculate the fraction of nodes still served for increasing levels 
of network damages. Such framework allows to analyse the interplay between redundancies and topology, 
a key point in improving the resilience of networked infrastructures to multiple failures. 


1 INTRODUCTION 

The functioning of any advanced society relies on 
networks that distribute commodities like water, 
gas, energy. The increasing urbanization and the 
accelerated growth of the size and the numbers 
of mega-cities (Facchini et al. 2017, Kennedy 
et al. 2015) requires such networks to be not only 
robust—i.e. able to sustain natural hazards, ran- 
dom failures or intentional attacks—but also to 
be resilient, i.e. to quickly recover from such haz- 
ards. Thus, the key feature to be implemented in 
network systems is resilience (Francis and Bekera 
2014, Ganin et al. 2016), i.e. the ability of recov- 
ering an acceptable level of service in the face of 
faults, failures, accidents and attacks. In this paper, 
we described a simplified model for networks who 
are able to increase their resilience through self- 
healing capabilities while considering the effects 
both of the topology and of the redundancy on 
the capability of network recovery. After introduc- 
ing the concept of resilience in sec.2, we will intro- 
duce in sec.3 a percolative model of self-healing 
reconnection. In sec.4 we will describe the analytic 
approaches needed to solve the problem at the 
mean-field level; we will compare the results of the 
model prediction with numerical simulation in 5. 


2 RESILIENCE 


Resilience is a complex property and is best 
expressed via the function describing the recov- 
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ery history of a system (see Fig. 1, upper panel). 
Loosely speaking, there are three main factors 
characterizing resilience: the initial after-event 
state s", the recovery time t and the recovery level 
Sc. In the usual resilience cycle, an accident hap- 
pens at time f° and the quality of service s drops 
quickly to value s* that depends on the robustness 
(Cohen and Havlin 2010) and on the reliability 
(Chaturvedi 2016) of the network; after a recov- 
ery time tT that will depend on the restoration plan, 
service would have been restored up to a level so 
that in some cases can be equal or even higher than 
the initial one. In this paper we will concentrate on 
the restored quality of service sę and we will intro- 
duce a self-healing percolation model for charac- 
terizing such quantity. 

Self-healing is a crucial feature for implement- 
ing resilience in the smart networks of the future 
(Quattrociocchi et al. 2014). An example of already 
existing self-healing infrastructures are telecommu- 
nication networks, which rely on self-healing proce- 
dures based on routing protocols to restore traffic 
using redundant links (Bhandari 1998). This is a 
typical example of a strategy based on the redun- 
dancy in the interconnectivity of its components 
to ensure the continuity of a system; for example, 
when a hole is punched in a leaf, the remaining 
vessels are capable to sustain the extra flow nec- 
essary to keep the tissues alive (Katifori, Szöllősi, 
& Magnasco 2010). On the other hand, self-heal- 
ing in infrastructural networks should be instead 
though as a constrained mechanism in which 
only a limited amount of resources is available. 


Figure 1. Relations among self healing and Resilience. 
Upper panel: Cartoon of a recovery function describing 
the resilience of a system; in this case we are plotting a 
metric s measuring the quality of service versus the time 
t. In the picture, an accident happens at time f° and the s 
drops to a low value that depends on the robustness and 
on the reliability of the network; after a time T, recovery 
plans will have restored the service s up to a level sọ that in 
some cases can be equal or even higher than the initial one. 
While the order of the restoration events determines the 
recovery time 7, any optimal restoration plan will achieve 
the same level of service s, given the same resources. In 
this paper we will concentrate on the magnitude of sç and 
not on the recovery time. Lower panel: Cartoon of a self- 
healing process in a distribution network. 


Such a strategy is also common to material science 
where new polymeric compounds are capable of 
self healing due to the presence of small amounts 
of healing agents that gets released and activated 
upon cracking (White et al. 2001, Toohey et al. 
2007). 

While the security of large-scale infrastructural 
networks—like long range, high-voltage electric 
power networks—is based on redundancy, most of 
local (regional, rural and city level) networks due 
to economic constrains are tree-like objects with 
few redundant links (Quattrociocchi et al. 2014). 
To describe and characterize a core feature of the 
resilience of such local networks, we introduce a 
new percolation problem that models their capabil- 
ity of recovering connectivity upon random fail- 
ures. Hence, our model for self-healing networks 
is inspired by local distribution networks like gas, 


water and medium/low voltage electric power 
networks. In real networks, cables and pipes— 
especially in urban areas—will likely follow the 
topology of the street networks. Also, tree-like net- 
works allow operators to minimize costs (building 
physical links in the network requires huge invest- 
ments) and to have an easy accountability of the 
consumptions. However, few redundant links must 
be present in order to recover the connectivity of 
the networks in case of accidents. Thus, after a link 
failure, some redundant links can be activated to 
recover (at least partially) the functionality of the 
network. In real infrastructures, such a procedure 
is often implemented manually, while in the smart 
networks of the future it should function automati- 
cally, possibly by embedding distributed algorithms 
automating the network recovery (Quattrociocchi 
et al. 2014). In the lower panel Fig. 1 we present a 
cartoon of the self-healing process in real networks. 


3 MODEL 


In our scenario we consider network systems dis- 
tributing some utility; for sake of simplicity, we 
will consider a single node to be the source of the 
quantity to be distributed on the network. Exam- 
ples of such network utilities are water, power, gas 
or oil pipelines or electric power distribution. At 
each instant of time, the topology of the network 
distributing the utility (the active tree) is assumed to 
be a tree; this assumption is partially verified in the 
above mentioned system; in particular, it is mostly 
verified in the case of electric power distribution 
(Pagani and Aiello 2011). In fact, such a structure 
meets the infrastructures’ managers needs—.e., to 
measure (for billing purposes) in an easy and pre- 
cise way how much of a given quantity is served 
to any single node of the network. Finally, as a 
further simplification we will not take into account 
the magnitudes of flows—i.e., all links and sources 
are assumed to have infinite capacity—but we will 
focus on maximizing the connectedness of the sys- 
tem in order to serve as many nodes as possible. 
Notice that in the case of real flows, such assump- 
tion can be unrealistic since links and nodes in real 
networks have limits beyond which they become 
unoperational. 

In order to implement our strategy and its self- 
healing capabilities, we consider the presence of 
dormant backup links—i.e., a set of links that can 
be switched on. Nodes are assumed to be able to 
communicate with their neighbors by means of 
a suitable distributed interaction protocol with a 
limited amount of knowledge: in particular, nodes 
are supposed to possess information only about 
the state of neighboring nodes connected either via 
active or via dormant links. 
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(B) initial configuration 


(1) Line faitura (0) Recovered Network 
Figure 2. Example of the healing procedure: (Left 
Panel) In the initial state, the source node (filled square, 
upper left corner) is able to serve all 16 nodes through the 
links of the active tree. The 4 dashed lines (green online) 
represent dormant backup links that can be activated 
upon failure. The redundancy of the system is p = 4/9 as 
only 4 of the 9 possible backup links are present. The 
link marked with an X is the one that is going to fail. 
(Central panel) A single link failure disconnects all the 
nodes of a sub-tree; in the example, a sub-tree of 6 nodes 
(red online) is left isolated from the source—i.e., the sys- 
tem has a damage A = 6). (Right Panel) By activating a 
single dormant backup link, the self-healing protocol has 
been able to recover connectivity for the whole system, 
in this case bringing back the number of served nodes 
at its maximum value 16. The link that has recovered 
the connectivity is marked with an R. Notice that in real 
networks, due to the physics of the flows and to the con- 
straints on links and nodes capacities, not all the possible 
reconnection could be viable and on the contrary could 
lead to cascading behavior. 


When either a node or a link failure occurs, all 
the nodes below the failure will disconnect from the 
active tree and become unserved. Such unserved 
nodes can now try to reconnect the active tree by 
waking up through the protocol some dormant 
backup links. Such a process will reconstruct a new 
active-tree that can restore totally or partially the 
flow, i.e. heal the system. Fig. 2 presents a graphi- 
cal sketch of the healing procedure. 

In the following, we will indicate with T(G) a 
tree on the graph G routed in the node s; we indi- 
cate with RCG-—T the set of redundant links 
and with N the number of nodes. The set of redun- 
dant links will be also described by its adjacency 
matrix R,, that takes the value 1 if a redundant 
link is present among nodes i and j, 0 otherwise. 
Moreover, we will describe the damages inflicted to 
the distribution tree by the matrix q, =0 if link ij is 
removed, q, = 1 otherwise. The relevant metric for 
the effectiveness of a redundancy pattern R given 
an attack q is the fraction of served nodes FoS, i.e. 
the number of nodes connected to the root nor- 
malized by N. 


4 METHODS 


Let us first notice that a node 7 is connected to a 
node j if and only if a message from node i can 
reach node j. To solve such a problem, we use a cav- 


ity based approach for message passing (Mezard 
and Montanari 2009). 

First of all let us define the set S(i) as the set 
on nodes which are sons of node / on the original 
tree T (i.e. not considering the redundant edges): 
S(i)={k:k isasonof i}. Moreover we will call 
F(i) the unique father of node i. Finally, we define 
R(i)= {i :R, =1; the set of nodes connected to i 
via redundant links; obviously, we have F (i) ¢ R(i) 
and S(i)N R(i) =Ø. Then, let us consider a node 
iand anode je€S(i) which is one of the sons of 
node j on the original tree. 

We introduce the following quantities: 


e d,,, is probability that node i is connected to 
the root s, when the son-node j € S(ji) is absent 
from the tree. 

e u,,, is the probability that node i is connected 
to the root s, when the father-node j= F(i) is 
absent from the tree. 

e 7,,, is the probability that node iis connected to 
the root s when i and j connected by a redundant 


edge and j is absent from the tree. 


We will derive first the recursive equation for 
d,,;, 7 €S(i). When j is absent from the graph, 


then 7 is connected to the root s if at least one 
among this possibilities is realized: 


1. its father F(i) is connected to the root. The prob- 
ability that such event does NOT happen is 


m= (I E Heute} 


2. one of its sons S(i) — except j — is connected to 
the root. The probability that such event does 
NOT happen is 


1, = I (L- Gili) 


keS(i)\j 


3. one of the neighbours connected to i via a 
redundant link is connected to the root when 1 
is absent. The probability that such event does 
NOT happen is 


N 
m =] [0- Rin) 
kel 


The total probability that 7 is connected to 
the root when its son j is absent is thus given by 
l-7 XT, XT, 1e. 


diz; =l= (1 Peete: x 
N 
Il a = VU i )[ Ja ~ Rin) 


keS(i)\ i m=1 


(1) 


Following the same procedure we can write the 
equation for w,, ;: 


= =1 = I] (= Akiki ) x fla- R,, ri) (2) 
keS(i m=l 
Finally, the equation for 7, is: 
fr; =1 -(I = Gales x 
(3) 
Il l- Wie i) Il ad- Rin Fisi 


kes(i) m=l,m+ j 

Equations (1-3) constitute the self consistent 
equations of the problem. They are valid for any 
given realization of the random tree and redun- 
dant links. Equations (1-3) can be considered as 
describing messages running on the edges of the 
tree and on redundant edges, and they can be 
solved by simple iteration. 

Once a solution to Eqs. (1-3) has been found, 


we can compute the total probability p, that node i 


is connected to the root 


Jx 
I], l- har N (l- Ral, se) 


keS(i m= 


p,=1-(l-4 Vir i Fes 
(4) 


Disregarding correlations, we can estimate the 
average fraction of served nodes FoS as 


N 
pe 
FoS = t= 5 
o N (5) 


5 RESULTS 


We now compare the results of our eqs. (1-3) to 
the numerical simulation of our self-healing pro- 
cedure. In particular, we are considering the case 
of networks generated by random trees at which 
a fraction a of random recovery links are added. 
After an initial fraction f of links in the tree is 
deleted at random, recovery links are activated 
whenever they reduce the number of connected 
components. Finally, the FoS is calculated by 
checking the fraction of nodes connected to the 
origin via the surviving links plus the redundant. 
Notice that when averaging eq. (1-3) to solve the 
mean-field equations, we are using (g,)= f and 
(R,,)=a@l(N -1). 

First, we perform simulations on random trees 
ona complete graph of 1000 nodes. The random 
trees are generated according a flat-sampling pro- 
cedure in the space of possible trees (Broder 1989, 
Aldous 1990, Wilson 1996). Edges on the tree 
are removed at random; the fraction of removed 
edges is indicated as f=1-) q; where 

LJ 
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|T|= N —1. We parametrize with a the population 
of redundant link existing among two nodes j and j 
where the link i,j does not belong to T; thus, R, are 
random variables on {0,1} where R,,= 1 with prob- 
ability @/(N-1). In each simulation, a random 
tree is generated, a random vertex is assumed to 
be the source, a fraction œ of redundant links are 
added to the tree and a fraction f of links is erased 
at random from the tree; then the fraction of FoS 
of sites connected to the root (via links either in T 
-qorin R) is calculated. The results for the average 
FoS are presented on Fig. 3. 

We then consider the average behavior of Eqs. 
(1-3) over the randomness in the model, i.e. over 
the possible realizations of the gj and Cj. Perform- 
ing the average we obtain the following self con- 
sistent equations: 


d=1-[1-(1- En a 

a P(k)[1-(1- f)ujt? (6) 
u=] e” P(k)[1-(1- f)u]t 

(k) 


Fos 


Figure 3. Comparison of simulations and analytical 
approximation. Depicted are the curves for the fraction 
of served nodes FoS (i.e. nodes connected at the origin 
after self-healing) versus the initial fraction of failed 
links f. Crosses correspond to average values of the FoS 
obtained by simulating random trees of 1000 nodes for 
different levels of redundancy æ = 0.2 (red X), @ = 0.4 
(green X) and æ= 0.6 (blue X); the size of the symbols is 
of the order of thrice the error bars. As expected, curves 
are monotonically decreasing with f and monotonically 
increasing with œ. Full lines correspond to the predic- 
tions of our analytical approximations eqs. (7). 


where P(k) is the degree distribution (i.e. the prob- 
ability of having k neighbours) of nodes in the tree 
which are not leaves. In Fig. 3 we compare the theo- 
retical predicted FoS obtained by averaging eq. (5) 
with the results of numerical simulations. 


6 CONCLUSION 


In this paper we have discusses the recovery of net- 
works upon a minimal self-healing procedure that 
exploits the presence of redundant edges to recover 
the connectivity of the system. Our scenario is 
inspired by real-world distribution networks that 
are, often for economic reasons, tree-like and in the 
meantime are also often provided with alternative 
backup links that can be activated in case of mal- 
functioning; as an example, this is the case for low- 
voltage distribution networks (ENEL 2011). 

Our model, albeit schematic, is realistic in the 
sense that it could be readily and easily imple- 
mented with the current technologies. In fact, 
routing protocols represent a vast available source 
of distributed algorithms able to maintain the con- 
nectivity of a system. Therefore, our scheme could 
be implemented by coupling an ICT network to 
current infrastructures. Our case is an example 
in which interdependencies enhance the resilience 
instead of introducing catastrophic breakdowns 
(Buldyrev et al. 2010). However, since in we are 
assuming that links capacities are infinite, our 
model applies to cases where the network is not 
stressed. For real flow networks, the physics of 
the flows and their constrains can forbid some 
of the possible reconnections that, in the worst 
cases, could even lead to cascading failures like 
he one observed in model power grids (Pahwa 
et al. 2014). Moreover, we are considering the case 
of a single source; in the case of multiple sources, 
leaving the system disconnected (islanding) could 
even improve its robustness (Mureddu et al. 2016). 
Notice that since our model predicts only the final 
level of service of a network, also timescales are 
an important element that should be introduced 
to allow for a characterization of the full resilience 
curve of the system. 

In this paper we have introduced the cavity equa- 
tion describing our model and compared an ana- 
lytical approximation for the average values of the 
connectivity under random failures with numeri- 
cal simulations. We find a promising accordance 
of such an approximation with numerical results 
opening the field for future investigations. 

The first direction to be investigated is the study 
of our approximation when both the fraction of 
failures f and the fraction of redundant links œ is 
small. In fact, in real systems the number of concur- 
rent failures is small: this is the reason at the basis 
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of the N — 1 criterion in engineering. On the same 
pace, for economic reasons also the fraction œ of 
redundant links is doomed to be small: hence, our 
approximation is valid for small fs and œs, where 
analytical expressions could be linearly expanded. 
On the other hand, if failures are not independ- 
ent but happen in a correlated and perhaps cata- 
strophic way like in cascading events (Pahwa 
et al. 2014), the approximation must be enhanced 
to hold for the entire f range; work in progress is 
done in this direction. 

Most importantly, cavity equations (1-3) can 
be applied to single networks with a given topol- 
ogy and set of redundant links to calculate the FoS 
as an alternative method to numerical simulations. 
Having a set of closed equations allows then to eas- 
ily analyze scenarios in which the set of redundant 
links is varied: coupling our approach with the 
introduction of a cost function for the links is of 
importance for the design of networks since it would 
allow to optimizing the redundancy. As an example, 
numerical simulations on planar topologies suggest 
that a very effective strategy to strengthen planar 
networks is to add long range links (Quattrocioc- 
chi et al. 2014): since such links are overly expensive 
in networks like electric distribution, the feasibility 
of such a strategy depends on cost-benefit analysis 
about their implementation of physical long-range 
links in PNJs. A further direction of study would 
be to consider the effects of more detailed struc- 
tural characteristics on the dynamics of the system 
(D’ Agostino et al. 2012). However, it is important 
to remember that in optimizing the system the cost 
of the links is as much important as the increase in 
resilience of the system. In fact, a simple constrain 
like keeping the number of redundant links fixed 
would lead to very unrealistic topologies in which 
the source is at the center of a star regardless of 
the length of the links (Quattrociocchi et al. 2014). 
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A new hybrid Bayesian network approach for modeling reliability 
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Grettia, Ifsttar, University of Paris Est, France 


ABSTRACT: 


In this paper, a hybrid discrete-continuous Graphical Duration Models is proposed. Since 


the interest of the Weibull density was demonstrated for reliability analysis, this paper focuses on the use 
of Weibull densities for modeling sojourn times in each state of the system. This extension of the standard 
Graphical Duration Model (GDM) requires a specific structure that we call Weibull-Hybrid Graphical 
Duration Models (W-HGDM). The main contribution of this study lays in the proposal of a specific 
inference algorithm for such hybrid networks. Finally, comparisons of reliability estimation will be pro- 


posed for both standard GDM and W-HGDM. 


1 INTRODUCTION 

Reliability analysis is an integral part of system 
design and operating, especially for systems per- 
forming critical applications. 

A wide range of works about reliability analysis 
is available in the literature. 

Most of the time, the system failure is caused 
by the failure of one or more components. In this 
case, it is possible to use statistical distributions 
as the exponential distribution to model the life- 
time. The Weibull Distribution (Weibull 1951) 
allows to describe each phase of the bathtub curve. 
In (Bertholon 2001) a new modeling of aging is 
proposed. 

These distributions do not allow to focus on the 
dynamics of degradation. They cannot be applied 
to a system that is repaired when it fails. They 
cannot be used to assess the expected number of 
failures during the warranty period, or maintain a 
minimum mission reliability, or determine when to 
replace or overhaul a system. These methods are 
often called “classics”. 

Dynamic models explicitly take into account 
the temporal aspect modeling the evolution of the 
degradation of the system over time by stochastic 
methods (Cocozza-Thivent 1997). 

Rather than considering the different compo- 
nents of the system, it is also possible to consider 
the whole system. Several methods are commonly 
used to model the different states of a dynamic 
system, in order to analyze its reliability, such as 
Markov Chain, Petri net, or Bayesian Network 
(Demri 2009). 

Recent works have shown the interest of using 
Bayesian Networks (BN) (Jensen 1996) in the field 
of reliability. For example (Boudali & Dugan 2005) 
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shows how to model the reliability of a complex 
system using Bayesian networks. 

Weber and Jouffe (2003) explains how to use 
dynamic Bayesian networks (DBN) (Murphy 2002) 
to study the reliability of a multi-state system that 
depends on a certain context. 

However DBN suppose that the sojourn time in 
each state are exponentially distributed, whereas 
most of the industrial applications underline non 
Markovian behaviors. In these cases, a Markovian 
degradation process modeling can introduce non 
negligible biases. 

So an original Bayesian Network structure was 
proposed, named Graphical Duration Models 
(GDM), in order to fit systems whose sojourn time 
in each state are not necessary exponentially dis- 
tributed (Donat, Leray, Bouillaut, & Aknin 2010). 
A GDM is characterized by a duration variable, 
allowing the use of any kind of distribution for 
modeling the sojourn time in each state of the con- 
sidered system. 

However the complexity is directly related to 
the size of the discrete space of the sojourn time 
variable. It can induce some technical problems in 
terms of storage capacity and computation time. 
A solution could be to consider a continuous dura- 
tion variable. 

In the theory of Bayesian Networks, there are 
several approaches that contain both continuous 
and discrete variables. 

In hybrid Bayesian networks, where both dis- 
crete and continuous variables appear simulta- 
neously, it is possible to apply inference schemes 
similar to those for discrete variables. The first 
model that allowed exact inference in hybrid net- 
works was based on the Conditional Gaussian 
(CG) distribution (Lauritzen 1992). 


The restriction of discrete variables with con- 
tinuous parents in CG may also be partially 
lifted using logit or probit function, generalized 
in the multinomial case by the softmax function 
(Murphy 1999). 

Another way to lift the restriction caused by dis- 
crete variables with continuous parents is the using 
of a mixture of exponentials (Koller, Lerner, & 
Angelov 1999), but the inference is approximated. 

The Mixture of Truncated Exponential model 
has been introduced in (Moral, Rumi, & Salmerón 
2001). The advantage with respect to CG is that 
discrete nodes with continuous parents are allowed 
and inference can be exact (Cobb & Shenoy 2006). 

Given that hybrid approaches thus exist in the 
Bayesian Network framework, such a perspec- 
tive could be envisaged in a Graphical Duration 
Model. 

According to expert feedback, sojourn-time 
variables follow a Weibull distribution in many 
systems (Weibull 1951). 

Our goal is to integrate sojourn-time variables 
following a Weibull distribution in a graphical 
duration model by proposing a new approach. 

First, this paper will briefly describe the formal- 
isms of the Bayesian Networks of the Graphical 
Duration Models. We then introduce the our pro- 
posed formalism named Weibull-Hybrid Graphical 
Duration Models (W-HGDM) and the inference 
algorithm associated. 

Then a toy system is introduced. Before some 
conclusions and prospects, a comparison of 
reliability analysis results, obtained from both 
Graphical Duration Models and Weibull-Hybrid 
Graphical Duration Models approaches, will be 
done. 


2 INTRODUCTION OF THE FORMALISM 


2.1 Bayesian Network 


Bayesian Networks (Jensen 1996) are mathemati- 
cal tools relying on the probability theory and 
the graph theory. They allow to qualitatively and 
quantitatively represent uncertain knowledge. 
Bayesian Networks are Probabilistic Graphi- 
cal Models that allow to intuitively represent 
the distribution of a set of random variables 
X=(X,,...,.Xy). Basically, a BN is defined as a pair 
M= (GP, )iency)» G =(X,E) is a Directed Acy- 
clic Graph in which each node / is associated to a 
random variable X,, that takes its values in a finite 
and countable set ¥,, and in which each directed 
arc (i,j)€€ represents dependencies between 
random variables X, and X. (P) <,<y 1S a set of 
Conditional Probability Distributions (CPD) such 
that each p, denote the conditional probability 
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distribution associated to random variable X, given 
its parents Xa,» pa, referring to the sequence of 
parents indices of the random variable X, in G. 

The conditional independence relationships 
introduced by the arcs of the graph enable to fac- 
tor the joint probability distribution of the set of 
random variables X as follows: 


P(X)= P(X... ky) = Il P(X, | Xia.) 


n=1 


(1) 


Besides, tools have been developed to automati- 
cally learn the structure and the parameters of the 
graph and those of the CPD from complete or 
incomplete data or if a priori knowledge is avail- 
able (e.g. expert opinion) (Neapolitan 2003). 

Using BN is particularly interesting because of 
the possibility to propagate knowledge through the 
network. Indeed, various inference algorithms can 
be used to compute marginal probabilities of the 
system variables. One of the most classical infer- 
ence procedures relies on the use of a junction tree 
(Lauritzen & Spiegelhalter 1988). Nevertheless, in 
our experiments, we use the elimination algorithm 
(Dechter 1999). This choice is motivated by the 
simplicity and the efficiency of the method. 


2.2 Dynamic Bayesian Network 


Inspired by the formalism of classical BN, the 
Dynamic Bayesian Networks (DBN) framework 
(Murphy 2002) allowed to unify many approaches 
from modeling of time series such as hidden 
Markov models. A DBN aims to model the prob- 
ability distribution of a random variables set 
(X,)icer = (Xi -Xxyi)zr. It consists of a pair 
of Bayesian Networks (MM, .M, defines the 
prior distribution P(X aires Oe as in (1). M, 
defines the transition model which describes the 
dependencies between variables in slice t — 1 and 
variables in slice ż, i.e. the distribution of X, |X, 


P(X, |X,_,) = P(X Xn l Xa X ya) 
X 


where pa, , refers to the sequence of parents indices 
of the random variable Y,, in the graph of M,. 

A Dynamic Bayesian Network is actually a 
static Bayesian Network, that is repeated several 
times. So DBN inherit the convenient properties 
of static BN, particularly with regard to learning 
and inference. 

The state of the system at future time ¢+ 1, X,,,, 
is decided by the system state at the current time f, 
X,, and does not depend on the state at earlier time 


instants 1,...,¢-1; and the conditional probability 
distributions of M, do not depend on the slice 
t; so the distribution of time spent in each state 
is geometric. Whereas most of industrial applica- 
tions have not this behavior. Such a modeling can 
introduce non negligible biases in the estimations. 


2.3 Graphical Duration Model 


A Graphical Duration Model (Donat, Leray, 
Bouillaut, & Aknin 2010) relies on the two follow- 
ing variables: the system state X, and the duration 
variable S, describing the time spent in any system 
state (remaining sojourn time) (Fig. 1). 

Firstly, the Conditional Probability Distribu- 
tion associated with the distribution of the initial 
system state is defined as follows, over the discrete 
and finite domain Q, = {1,...,Ny }: 


(i) 
The initial sojourn-time CPD gives the distribu- 


tions for each initial state. This CPD is defined over 
the discrete and finite domain Q, = {1,...,N,}: 


F 


P(X, =i)= P; (3) 


P(S =k|X,=i)= P (i,k) 


7 (4) 
Then, it is necessary to define the system state 
and the Sojourn-time transition CPDs. 
A transition occurs if and only if $; =1. 


P(X, = {|X = iS =) =O" (ij) (5) 
where Q% isa N,.N, matrix, called static system 
transition matrix. 


A new sojourn-time is selected according to the 
following CPD: 


P(S, =k |X, =i,S_, =) =F" (i,k) (6) 


where Fisa Ny.N, matrix. 


Mai 


M 


Figure 1. Discrete duration graphical model. 
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While there is no transition, the system deter- 
ministically remains in the previous state i: 
P(X, =j|X =S, 22)=I(i j) (7) 


and the sojourn-time in the current state is 
decreased deterministically by one unit: 


1 ifk=k’-1 
0 otherwise 


P(S,=k|X,=i,S_, =k’ 22) l 


2.4 Reliability computation 


Let assume that the set of system state Q, is par- 
titioned into two sets U and D respectively for 
“up” states and for “down” states (i.e. OK and 
failure situations). 

The discrete-time system reliability is defined as 
the function R: N* > [0,1] where R(#) represents 
the probability that the system has always stayed in 
an up state until moment z, i.e. 


R(t)= P(X, €U;...3X, €u). 


In addition, it is possible to derive some interest- 
ing metrics such as the failure rate or the MTTF 
(Pham 2006) from the reliability definition. 

Hence, this issue boils down to an infer- 
ence problem, i.e. to the computation of 
P(X, €U;...;X, €U). 

The system cannot repair itself. The failure state 
is absorbing. The reliability is then equal to the 
availability, ie 


R(t)= P(X, €U) =1- P(X, €D). 


Then the reliability computation is reduced to 
the computation of the marginal distribution of X.. 


2.5 Limitation of GDM 


The larger N, is, the more accurate the represen- 
tation of the sojourn-time distribution. On the 
other hand, choosing a too large N, value will have 
immediate consequences on both the space needed 
to store the Conditional Probability Distributions, 
ie. Ny.N, values, and the time complexity of 
inference. 

That can induce some technical problems in terms 
of storage capacity and computation time. A solu- 
tion could be to consider a continuous duration vari- 
able. With c the number of parameters needed by the 
chosen distribution, the CPD will be constituted of 
Ny.c values, so the space needed to store the con- 
tinuous CPD is smaller than if they are discrete. 

Since the interest of the Weibull density was 
demonstrated for reliability analysis (Weibull 


1951), this paper focuses on the use of Weibull 
densities for modeling sojourn times in each state 
of the system. 


3 WEIBULL-HYBRID GRAPHICAL 
DURATION MODELS 


In the following paragraphs, we propose a new 
formalism to represent the evolution of a dynamic 
system over time, in order to estimate its reliabil- 
ity. This particular model is called Weibull-Hybrid 
Graphical Duration Models (W-HGDM). In the 
following, we will describe its conditional prob- 
ability distributions, and an inference algorithm in 
order to compute the reliability. 


3.1 Generalities 


The collection (X,),..., represents the system 
state over a sequence of length T. The collection 
(S,)\c,<r represents the remaining time before a 
system state modification. More clearly, we refer 
to the random variable S, as the remaining sojourn 
time in the current system state. These variables are 
called duration variables or sojourn-time variables. 
A Weibull-Hybrid Graphical Duration Model is 
illustrated in Figure 2. 

Expressions (3), (4), (5), (6), (7) and (8) become 
expressions (9), (10), (12), (13), (14) and (15). 

The Conditional Probability Distribution asso- 
ciated with the distribution of initial system state 
must be defined as follows, over the discrete and 
finite domain Q, = 1,...,Ny: 
P(X, =i) = p" (9) 

The initial sojourn-time CPD gives the distri- 
bution for each initial state, it is defined over the 
continuous domain Q, = ]0,+ |: 


(10) 


Ís [Msi 7 SW 4 


Mi 


Figure 2. Continuous duration graphical model. 
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with 


a 


Figg ( 8) = (11) 


where œ is the scale parameter and 2 the shape 
parameter. 

Then, it is necessary to define the system state 
and the sojourn-time transition CPD: 

A transition occurs if and only if S$; <1. 
P(X, = 7 |X. =i8,,<)=0" (i j) (12) 
where Q% isa N,.N, matrix, called static system 
transition matrix. 

A new sojourn-time is selected according to the 
following CPD: 


Ss ests, <5, = waa (s) (13) 


While there is no transition, the system deter- 


ministically remains in the previous state i: 


P(X, =j\X =S, >1)=1(i, j) (14) 


and the sojourn-time in the current state is 
decreased deterministically by one unit: 


1 


0 otherwise 


ifs=s -l1 


J(S.=s|X,=i,S, =s >1) l (15) 


3.2 Inference 


As shown in 2.4, the reliability of a non- 
repairable system is equal to the availability of the 
system: 


R(t) = P(X, cu) 
=1- P(X, € D) 
S1-¥ P(X, € D, X Xp SX) 


pe eat 
Siab 


=1-$ P(X, € D| pa(x,)) 
Xiii 
Sia Sa 


(16) 


We have to compute successively the following 
distributions: 


e „= P(X|X,, Sm) for uefļ2;t-1] and ®,, = 
P(X|X,). 
eA, = P(XIX,, Sua) for we [2:1] and A, = P(X) 


The required distributions are expressed as 
follows: 
For u e [2;t-1] 


a= J PÆ, 1X5, OG = 5128 ey 
o (17) 
= | Aa X FCS, = s, 1X, S, )05, 
and 
= ` P(X, |X¥,=x,S_,) P(X, = x|X,,.8,,) 
ve Qy (18) 
= Zox P(X,=x|X,,,8,,) 
For u=1 
®, =| PY, |X,,S, = 5, f(S, = s |X ds, 
K (19) 
=| Ap f(S,=5,|X ds, 
0 
and 
A= X P(X,|X,=x)P( X= x) 
xEQy, 20 
= $ P(X = ) 
xEeQy, 


The elements required by the computation (17) 
are given by (18), (13) and (15); and those required 
by (19) are given by (18) and (10). 

The elements required by the computation (18) 
are given by (17), (12) and (14); and those required 
by (20) are given by (17) and (9). 

Equations 17, 18, 19, 20 define an iterative 
algorithm, that give at the end: A,, = P(X), that 
allows to have the value of the reliability: 

R(t) =1- P(X,eD)= X P(X, 


xeUu 


(21) 


The computation in (17) and (18) can be devel- 
oped as follows: 


In (17): 
P(X, | X,_ q? S41 =8)= (22) 
QW) if s<l 
ow x ow) if l<s<2 
Oov*OWw) if j<s<j+l (23) 
o% if q <s< q +1 
I if g+l<s 


with QO (x,y) Sa, ae (Ore | 
+f Ing s) ds x I(x, A 
: (18): 
PCX, |X ge S-01 = 9) = (24) 
Qe * Ola) if s<l 
ow*Ol) if l<s<2 
Qu * Ql) if j<s<j+l (25) 
Qo if q<s<qtl 
I if q+1<s 


Q% xQ% is actually the natural transition of 
the system between q slices. 


4 ILLUSTRATIONS 


The W-HGDM used has been implemented in 
MATLAB® environment, using the open source 
Bayes Net Toolbox (BNT). The objective of this 
study is the inference part, not the learning part. 
So toy systems are created. The parameters of 
Conditional Probability Distributions are not 
learned from data, they are determined. 


4.1 Test 


To validate this algorithm, we consider a two-state 
system, one of the states being “up”, the other being 
“down”. The reliability can be written as follows: 


R,,,(t)= P(X,e UN... X,€ U) 
= P(S,>t|X,e U)P( X,€ U) 


=1-F,, (t) (26) 


The application of the inference algorithm pre- 
sented in 3.2 to such a model is supposed to return 
a Weibull distribution. 

The Figure 3 shows that the estimated reliabil- 
ity is exactly the reliability given by the Weibull 
distribution. 


4.2 Comparison with standard GDM 


Let’s illustrate our approach using a GDM mod- 
eling the behavior of a 3-states production machine, 
ie. Q, ={1,2,3}, with D={3}. The transition rate 
between system states is given in Table 1. The param- 
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2-states system Reliability, with T= 50 


X estimation of system reliability,(c,3),,, : (20,0.8) 


og i s Weibull Reliability for "up" state;(ex, 3): (20.0.8) 
X estimation of system reliability, (a, 5): (20,1) 
0.8 —— Weibull Reliability for *up* state,(a.),, : (20,1) 
> estimation of systam reliability, (a, A) g: (20,4) 
07 x ——— Weibull Reliability for “up” state, (añ): (204) 
0.6 
zos 
04 
03 
02 
0.4 
o = = 
o 5 10 15 20 25 30 35 40 45 5c 


Figure 3. Reliability of a 2 states-sytem and of Weibull distributions. 


Table 1. System transition conditional probability table. Table 2. Sojourn-time distribu- 
tion for each state. 
State 1 State 2 State 3 
a B 
State 1 0 0.9 0.1 
State 2 0 0 1 State | 30 1 
State 3 0 0 1 State 2 20 1 


3-states system Reliability, with T= 50 


-reliability with standard mgd 
>< reliability with weibull hybrid mgd 


Rit) 


Figure 4. Comparison between discrete GDM and W-HGDM. 
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eter of the Weibull distribution that characterize the 
sojourn-time variable when a transition occurs are 
given in Table 2. The algorithm presented in subsec- 
tion 3.2 was used to make the computations. 

A standard GDM is then built. The state system 
variable of this standard GDM model has the same 
static system transition matrix as the W-HGDM 
(Table 1). The distribution of sojourn-time vari- 
able of this standard GDM model is constructed 
by discretizing the continuous Weibull distribu- 
tion, with the same parameters as the W-HGDM 
(Table 2), and with N, = 150 (Donat, Bouillaut, 
Aknin, Leray, & Bondeux 2008). 

The method presented in 3.2 has been used to 
compute reliability estimations presented in Figure 4. 

We obtain the same results with both methods. 


5 CONCLUSIONS 


In this paper, a new hybrid approach for the dura- 
tion model in which sojourn-time follows a Weibull 
distribution, called Weibull-Hybrid Graphical 
Duration Model, has been proposed. An associ- 
ated inference algorithm has been therefore devel- 
oped and introduced allowing the estimation of 
the reliability of the system in such models. 

Finally, first results introduced in this paper 
underline the feasibility of the proposed approach 
in terms of inference accuracy. This validates the 
interest we have had in proposing such a hybrid 
approach for the modeling of degradation dynam- 
ics. Indeed, the expected advantage of W-HGDM 
leads mainly in the non discretization of the 
sojourn variable in GDM that can become a drag 
to reliability computation for complexity reasons. 

Now, many things can be done and investigated 
through this new approach. The first perspective is 
to go further in the algorithm validation, by compar- 
ing the complexities and the speeds between stand- 
ard GDM and W-HGDM. We also have planned 
to add some nodes to the presented W-HGDM, 
such as context variables, and, of course, mainte- 
nance actions. The long-term goal of this study is 
to improve VirMaLab (Virtual Maintenance Labo- 
ratory), a generic decision support tool developed 
by the GRETTIA (Bouillaut, Aknin, Donat, & 
Bondeux 2011) in order to evaluate, optimize and 
compare maintenance strategies for discrete state 
space system, by allowing sojourn-time variables to 
follow a Weibull distribution. 
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ABSTRACT: The analysis of road network vulnerability is a challenging and important research subject 
in the field of transportation reliability engineering because of the complex coupling relationships among 
travelers, vehicles, roads and environment. In view of the deficiency of existing researches on the descrip- 
tion of travelers’ risk averse and bounded rational behavior characteristics, both random utility theory and 
random regret theory are used to describe travelers’ decision-making behaviors. Then a traffic assignment 
model expressed by variational inequality is constructed, in which travelers’ risk aversion and bounded 
rationality as well as resultant heterogeneity of travelers’ decision-making behaviors are explicitly taken 
into account. Subsequently, the vulnerability index of road network based on accessibility is defined, and 
a vulnerability identification model is built, then corresponding heuristic algorithm is also proposed. The 
example results show that the consequences of link closures could be misjudged and the vulnerability 
rankings could be misidentified, if ignoring the effects of the congestion levels of road network, travelers’ 
perception errors and regret aversion degrees, as well as travelers’ route choice decision criteria. Therefore, 
it is necessary to capture the travelers’ behavior characteristics in the process of vulnerability analysis. 


1 INTRODUCTION Modeling travelers’ behavioral responses to link 
failures is another key issue involved in critical link 
1.1 Background identification (Chen et al., 2012). Subjected to link 


failures, the high degree of demand uncertainty and/ 
or link capacity degradation will inevitably yield 
travel time variability, and consequently imposes 
additional disutility on travelers. Many empirical 
studies have revealed the significant influence of 
travel time uncertainty on travelers’ route choice 
behavior (Liu et al., 2004, Shao et al., 2006, Wu 
and Nie, 2011). Travelers under travel time uncer- 
tainty tend to choose reliable shortest path, not only 
dependent on travel time saving, but also on reduc- 
tion of travel time variability. This risk averse behav- 
iors under travel time uncertainty have received 
considerable attention (Shao et al., 2006, Lo et al., 
2006, Siu and Lo, 2008). However, most of the pre- 
vious studies adopt Expected Utility Theory (EUT) 
and/or Random Utility Theory (RUT) to quantify 
travelers’ perceptions of network uncertainty, and 
assume that the travelers are homogeneous. It is 
well known that both EUT and RUT are based on 
the assumption that travelers are absolutely rational 


Robust road traffic networks have been regarded as 
one of the preconditions for a high quality of life. 
However, a traffic network can be vulnerable to vari- 
ous natural or man-made disasters. For example, 
adverse weather such as heavy snow and flooding 
could severely degrade the network performance. 
Although the occurrence probability of these major 
incidents is low, their consequences could be suffi- 
ciently large to generate a major problem that threats 
remedial actions. Therefore, it is vital to understand 
the potential vulnerability of traffic networks under 
such major incidents, so as to manage their risks. 

A key issue in the vulnerability analysis is to iden- 
tify the critical links of a network, where the failures 
of those links would have the most serious impacts 
on the whole network (Chen et al., 2012, Yin and Xu, 
2010). After identifying the critical links, the network 
robustness can be enhanced through reinforcing these 
identified critical links or constructing new alterna- 
tive parallel paths (Matisziw and Murray, 2009). 
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when making route choice decisions. In reality, how- 
ever, part of travelers’ behaviors may be bounded 
rational, and can be influenced by his or her per- 
sonality, psychological state, and environmental ele- 
ments, etc. It goes without saying that the usefulness 
of EUT or RUT as a descriptive model of choice 
behaviour has been fiercely debated both inside and 
outside the transportation domain. It is particularly 
interesting to note that Regret Theory (RT) (Loomes 
and Sugden, 1982) and its extended Random Regret 
Theory (RRT) (Chorus, 2014), being widely con- 
sidered one of the most prominent competitors of 
both EUT and RUT in the behavioral decision sci- 
ences, has been virtually ignored in the route choice 
domain. RT or/and RRT postulate that when choos- 
ing, people anticipate and try to avoid the situation 
where a non-chosen alternative outperforms the cho- 
sen one (which would cause post-decision regret). It 
is a pity that, to our best knowledge, travelers’ risk 
aversion and bounded rationality as well as resultant 
travelers’ heterogeneous behavior responses due to 
link closures have not yet been considered simulta- 
neously in the studies of critical link identification. 

In view of the above, this study proposed a method 
of road network vulnerability identification taking 
into account travelers’ risk aversion, bounded ration- 
ality and heterogeneity simultaneously under travel 
time variations subjected to link failures. We assume 
that the total travel demand comprises of two parts, 
completely rational travelers and bounded rational 
travelers. A new vulnerability index based on acces- 
sibility is introduced to evaluate the consequences of 
a link closure with consideration of their effects. 


1.2 Outline 


The remainder of this paper is organized as fol- 
lows. In Section 2, a stochastic mixed user equilib- 
rium model is built. It follows with the definition 
of vulnerability index based on accessibility for 
evaluating the consequences of a link closure with 
consideration of travelers’ risk aversion, bounded 
rationality and heterogeneity. In Section 4, numeri- 
cal examples by means of the Nguyen and Dupuis 
network is provided to demonstrate the proposed 
model. Finally, the conclusions are given in Sec- 
tion 5, together with future research directions. 


2 ASTOCHASTIC MIXED USER 
EQUILIBRIUM MODEL 

2.1 Distributions of travel times under link 

closures 


Consider a road network represented by a strongly 
connected graph G = (N, A), where N and A are 
the sets of nodes and links respectively. Let W 
denote the set of Origin-Destination (OD), R,, 
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represent the set of paths between OD pair w, 
we W. 

Because of link failures, the high degree of 
demand uncertainty and/or link capacity degrada- 
tion will inevitably lead to travel time uncertainty. 
Assume that a link travel time is a random variable. 
Let T, represent the travel time on link a. Further- 
more, assume that T, follows the normal distribu- 
tion with mean value ¢, and variance p,t,, where 
p, represents the variation coefficient of T,. The 
mean travel time f, can be described by the follow- 
ing BPR (Bureau of Public Roads) function: 


J fee 


where f° is the free-flow travel time on link a, 
v, and c, are the flow and the capacity on link a 
respectively, 8 and n are the constant parameters 
of BPR function. 

Let the travel time on path k between the OD 
pair w be represented as 7», which can be cal- 
culated according to the relationship of link and 
path, as follows: 


a 


v 
la = cfr 


(1) 
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Ty’ = X T,8 o kE Rnw EW 


acA 


(2) 


where 6”, is the link-path incidence variable, 
oO” 71 if link ais on path k, otherwise 3”, =0. 

In order to simplify the problem, it is assumed 
that link travel times are independent of each other. 
Because a path is composed of several independent 
links, according to the Central Limit Theorem, a 
path travel time should obey the normal distribu- 
tion approximately. According to equations (1) 
and (2), the mean and standard deviation of T% 
can then be expressed as below: 


ey 
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where 1’ and Ø, are respectively mean and 
standard deviation Of T». 

The empirical researches show that, the travelers 
not only want to save travel time, but also hope to 
avoid the risk caused by the travel time uncertainty. 
Therefore, Lo et al. (2006) put forward the concept of 
Travel Time Budget (TTB), which is used to describe 
the route choice behavior of travelers avoiding travel 
risk. Let £ (æ) be the TTB of route k between OD 
pair w, which can be expressed as follows: 


oe 


8 t? 


aka 
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prema (3) 


(4) 


(a) =1 +®"(a) Ory keR,wew (5) 


where ®-'(-) is the inverse function of stand- 
ard normal distribution, @ denotes the reliability 
parameter reflecting the probability that the actual 
trip time is within the TTB. 


1.2 Travel decision model for completely rational 
travelers based on RUT 


Considering the travelers’ risk aversion in route 
choice decisions, it is assumed that the completely 
rational travelers use TTB as their route choice cri- 
terion. In addition, because of the travelers may 
not have perfect information on the travel time 
distributions, the travelers’ perception errors on 
the travel times should be also taken into account. 
Therefore, according to RTU, the travel disutility 
of the completely rational travelers can be regarded 
as a random variable, which can be expressed as 
follows: 

Uae = on 


(a)+o°R, ke R, wew (6) 
where UCR and ¢°% respectively represent the per- 
ceived travel disutility and perception error when 
the completely rational traveler’ choose the route k 
between the OD pair w. 

Furthermore, assuming that the perception 
error ¢~% are identically and independently Gum- 
bel distributed random variables with mean zero, 
then the probability that a completely rational 
traveler chooses the route k between OD pair w can 
be described as follows: 


exp(-@* v(a) 
> xph- (o) 


reR,, 


CR = 
wok 


ske R, weW (7) 


where 0° > 0 is the perception error parameter of 
the completely rational travelers which is used to 
measure the degrees of travelers’ perception errors. 
It is noted that a higher @® means smaller percep- 
tion errors. 


1.3 Travel decision model for bounded rational 
travelers based on RRT 


In reality, due to the influences of information 
conditions, personality, preferences and other 
factors, not all travelers’ route choice behaviors 
are completely rational, and some travelers show 
bounded rationality when choosing a route. In this 
paper, RRT is used to describe this phenomenon. 
In contrast with RUT, which postulates that a 
route’s disutility is a function of its own perform- 
ance only, RRT postulates that in addition, the 
performance difference with the competing route 
codetermines a route’s disutility. In other words: 
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RRT assumes that the traveler is regret averse in 
travel decision-making, that is, when the traveler 
make choice among all alternatives, the total value 
of regret that the current choice scheme compares 
with the other alternatives is considered, and the 
scheme with minimum value of regret is chosen. 

Based on the above analysis, assume that the 
bounded rational travelers are risk averse and 
regret averse, who take perceived regret value as 
their route choice criterion. According to RRT, the 
travel disutility of the bounded rational travelers 
can be represented as below: 


BR — 7,BR 
Unk = Uk 


+ GBR ke R,wew (8) 
where UPR, uPR and ¢BE respectively represent 
the perceived travel disutility (the perceived regret 
value), mean of the perceived travel disutility (mean 
of the perceived regret value) and perception error 
when the bounded rational traveler’ choose the 
route k between the OD pair w. 

Suppose that the bounded rationality travelers 
use the TTB as the absolute travel disutility, a spe- 
cifically functional form of u?% that satisfies RRT 
requirements can be stated as follows: 


È apl nEw- €(@)), ke R, wew O) 
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where y> 0 is a regret aversion parameter. When 
y increases, regret becomes more and more 
important. 

Similarly, assuming that the perception error 
¢BR are identically and independently Gumbel 
distributed random variables with mean zero, 
then the probability that a bounded rational trave- 
ler chooses the route k between OD pair w can be 
described as follows: 


exp( —PPRyBR ) 
p= 3 exp( au) ik  wew (10) 
re Ry 


where 68> 0 is the perception error parameter of 
the completely rational travelers which is used to 
measure the degrees of travelers’ perception errors. 


1.4 Stochastic mixed user equilibrium model 


It is assumed that there are two kinds of travelers 
in the road network, completely rational travelers 
and bounded rational travelers, respectively. The 
completely rational travelers use the perceived 
travel time budget as their travel disutility and the 
bounded rational traveler use the perceived regret 
value as their travel disutility. In the process of 
routes selection, two kinds of travelers try to find 


the routes with the minimum travel disutility. The 
network is called to achieve the stochastic mixed 
user equilibrium state when each type of travelers 
can not decrease their travel disutility by unilater- 
ally changing the routes. 

According to the principle of stochastic user 
equilibrium (Sheffi, 1985; Huang, 1994), the net- 
work equilibrium condition can be expressed as 
follows: 


Sik = U0 Pre KER, wWew (11) 
we =e Rp KER,, weWw (12) 
py and p* in formulas (11) and (12) are 
determined by formulas (7) and (10) respectively, 


and the following conditions of flow conservation 
constraint are required: 


QH=gk+ge= PV fk+>d fR wew (13) 
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where g¢®,qg®® and q, respectively denote the 
completely rational travelers demand, the bounded 
rational travelers demand and the total demand 
between OD pair w; f&%, fek and f,,, respectively 
represent the completely rational travelers flow, the 
bounded rational travelers flow and total flow on 
route k between OD pair w; v®, vB® and v, respec- 
tively represent the completely rational traveler 
flow, the bounded rational travelers flow and total 
flow on the link a. 

The equilibrium conditions of (11) and (12) 
can be described by the following equivalent vari- 
ational inequality (VI) model. 

Find f=. fe E2, making it satisfy the 
condition: 
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where the superscript “*” is used to designate the 
solution of the VI problem; 2 is the feasible set 
for route flows satisfying the constraint condition 
formulated as formulas (13)-(16). 

Let f® and f?8 denote the column vectors compos- 
ed of Ta, ke R„we w} and Le, ke Rwe w} 
respectively, v€ Rf )and v?! (f) denote the column vec- 
tors comiposgd of fe (o)+ ane ke R,we w} and 


e+ tint, keR 


we weW respectively, Because 

vR(f) and ‘VPR(f) are continuous with respect to f® 
and f} respectively, and the feasible set Qis a bounded 
closed convex set, there exists at least one solution of 
VI problem expressed by formula (20) according to the 
variational inequality theorem (Facchinei and Pang, 
2003). 


2 VULNERABILITY IDENTIFICATION 
MODEL OF ROAD NETWORK 


The identification of key links and their criticality 
are important question in the road network vulner- 
ability evaluation. The key links refers to the links 
that the failure will result a significant impact on 
the network vulnerability. In the literature, various 
vulnerability indices have been proposed to evalu- 
ate the consequences of link closures. For example, 
Jenelius et al. (2006) used the increase of the gener- 
alized cost, weighted by the demand, as a vulner- 
ability measure to a link closure. Chen et al. (2007) 
proposed the utility-based accessibility index to 
take account of travelers’ behavioral responses to 
the link closure. In this paper, the road network vul- 
nerability is evaluated by the change of road net- 
work accessibility, and then identify the key links. 

Accessibility can be defined as the convenience 
for travelers to arrive at a destination from a origin 
to a destination by the certain way in the certain 
period of time (Taylor et al., 2006, Taylor, 2008). 
The accessibility index of a single OD pair can be 
defined as follows: 


Liga) 
AC, = watt weW (21) 
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where AC, expresses the accessibility between the 
OD pair w. 

As shown in the formula (21), it can be seen that 
when the travel time of a unit OD travel demand 
is higher, the accessibility index of the OD pair is 
lower, which indicates that the traveling relative 
convenience is lower. 

According to the formula (21), the accessibil- 
ity index of a road network can be defined as 
follows: 
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where AC(G) express the accessibility of road net- 
work G. 

In this paper, the road network vulnerability is 
evaluated by the relative change of road network 
accessibility before and after link failures. The 
road network vulnerability index under the failure 
of link a is defined as follows: 


VUL, = (23) 


where AC,(G) is the road network accessibility 
under normal condition; AC (G) indicates the road 
network accessibility under the failure of road link 
a (removing road link a from road network). Obvi- 
ously, it reflects the change of road network acces- 
sibility caused by the failure of link a. 

The failure of single link is the lightest case of 
link failures in a road network. It is also the basis 
for studying the multi-link failures. For simplicity, 
this paper selects the situation of single link failure 
to identify the road network vulnerability. 

Based on the above analysis, the specific vulner- 
ability evaluation scheme is given below: 

Step 1: Calculate road network accessibility 
AC,(G) under normal condition. Method of succes- 
sive average (MSA) algorithm is used to calculate the 
equilibrium route flows and every route’s TTB under 
normal conditions. Subsequently, the road network 
accessibility under normal conditions is calculated 
according to formula (21) and formula (22). 

Step 2: Remove each link from the road network 
in turn, and the road network accessibility AC (G) 
after the removal of link a is calculated according 
to the formulas (21) and (22). 

Step 3: Calculate the road network vulnerability 
VUL, According to AC,(G) and AC (G) obtained 
from Step 1 and Step 2, the road network vul- 
nerability index VUL, after the failure of link a 
can be calculated in turn by formula (23). If the 
removal of link a would result the road network 
is no longer connected directly, then the link a is 
immediately identified as the most critical section, 
making it VUL, =~. 

Step 4: Identify the critical links. Sort the VUL, 
in descending order, and rank, express the order 
value of VUL, (namely critical degree), and the 
key links are selected from the first N links with 
the minimum values of the rank,, where N is the 
number of key links set in advance. 

Where the steps of MSA algorithm are as 
following: 


Step 1: Initialization. Set error parameter € 
= 0.01 and iteration counter n = 1; initialize the 
route flows. The reasonable initial route flows 
of the completely rational travelers are set as 
FROM = {fRO,ke R,,we w}, and the initial 
route flows of the bounded rational travelers are 
setas PMs 1 fe ke R,,we w}. 

Step 2: Based on the current path flow 
fm and fPR™, the vector of TTB 
Pat EM ke R „we w} and the vector of 
mean regret value u®” = { ur ke R,,we w} 
are calculated respectively. 

Step 3: Seek the iterative direction of route flows 
for the completely rational travelers g°® and 
that of the bounded rational travelers g®®"). Let 


pem = PRW = FORM, BR) = FBR Z FBR, where 
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Step 4: Update flow. Set FRO) = FORM + LgR™ 
and f BRO) = FBR) + 1 gBRin), 
Step 5: Check the convergence. If È. Ze, 


CR(n)_ pCR(n-1) BR(n)_ ,BR(n-1) 
føk Fwik 5 5 Su 
wew Lad ker, 7 


Íw, Íw.k 
+ 
iteration, otherwise, set n = n + 1, turn to Step 2. 
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3 NUMERICAL STUDIES 


In this section, the Nguyen and Dupuis network 
(Nguyen & Dupuis, 1984) shown in Figure 1 is pro- 
vided to demonstrate the proposed model, which 
consists of 13 nodes, 19 links, 25 routes, and 4 OD 
pairs. The free-flow travel time and design capacity 
for each link are shown in Table 1. 
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Figure 1. 


Nguyen and Dupuis network. 


Table 1. Link characteristics. 


Link Free-flow travel time/min Design capacity/pcu.h"! 


1 12 2500 
2 12 2500 
3 12 2500 
4 24 2500 
5 12 2500 
6 12 2500 
7 12 2500 
8 12 2500 
9 12 2500 
10 12 2500 
11 12 2500 
12 12 2500 
13 24 2500 
14 12 2500 
15 12 2500 
16 12 2500 
I7 -12 2500 
18 36 1500 
19 12 2500 


For illustration purpose, the OD demand for 
each OD pair is described as follows: 


CRO 4 
w i 


M ng + A- mao) we W 
(24) 


Ma," + gi”) 


0 
' 


Ww 


where q? represents the potentially maximum 
demand of OD pair w; u is the multiplier of OD 
demand; 7 represents the proportion of com- 
pletely rational travelers. 

Other relevant parameters are set as: the param- 
eters of BPR function in formula (1) are B = 0.15, 
n = 4; the variance coefficient p, = 0.5, a € A; the 
reliability parameter @ = 0.90; the regret aversion 
parameter y= 0.15; the perception error parameters 
OR =1.0 and GR =0.2 respectively; the maxi- 
mum OD demands q% = 5000 (pcu-h™'), q% = 4000 
(peu-h"), g}, = 4000 (pcu-h'), and q% =3000 
(pcu-h-'), respectively; the multiplier of OD demand 
= 0.5; the proportion of completely rational travel- 
ers N = 0.5; the number of key links N = 6. 

Table 2 shows the results of vulnerability evalua- 
tion for the 6 most critical links in the Nguyen and 
Dupuis network. It can be seen that the closures of 
these critical links decrease the network accessibil- 
ity by 2% or more. The worst case is the closure 
of link 2, which leads to a 3.11% decrease in the 
network accessibility. 

To test the effects of congestion level on net- 
work vulnerability, the consequences of link clo- 
sures under various demand multipliers u values 
are depicted in Table 3. From the table, it can be 
seen that as increases in demand level, the network 
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vulnerability under link closures tends to go up. 
For example, when demand multiplier u = 0.05, 
the vulnerability index VUL, is equal to 0.34% after 
the link 2 closure. However, when demand multi- 
plier u = 0.99, the vulnerability index VUL, rises to 
5.45% under the link 2 closure. The results are not 
surprise because if the network is more congested, 
owing to closure, less spare capacity is available 
to absorb the rerouted traffic. In addition, From 
Table 3, it can be found that the vulnerability rank- 
ings may be varied owing to the congestion level. 
For instance, link 15 is ranked at Rank, = 5 when 
u=0.05. However, this link was ranked at Rank, =6 
when = 0.49, and Rank, = 7 when u=0.99. 

In order to illustrate the effects of travelers’ per- 
ception errors on network vulnerability in terms 
of different values of OR and 98, the results of 
link closures arising from various values of 0" and 
9®® are shown in Table 4. It should be pointed out 


Table 2. The results of network vulnerability evaluation. 


Rank, Link VUL, 
1 2 3.11% 
2 11 2.52% 
3 14 2.44% 
4 13 2.31% 
5 19 2.31% 
6 15 2.26% 
Table 3. The effects of the congestion levels on network 
vulnerability. 
u=0.05 uU=0.49 u=0.99 
Rank, Link VUL, Link VUL, Link VUL, 
1 2 0.34% 2 3.06% 2 5.45% 
2 14 0.30% 11 2.48% 11 4.43% 
3 7 0.29% 14 2.40% 14 4.26% 
4 11 0.26% 13 2.27% 13 4.12% 
5 15 0.26% 19 2.27% 19 4.12% 
6 13 0.23% 15 2.22% 7 3.77% 
7 19 0.23% 7 2.11% 15 3.76% 
8 1 0.19% 1 1.76% 1 3.05% 
9 3 0.17% 16 1.60% 16 2.83% 
10 6 0.15% 3 1.57% 3 2.73% 
11 4 0.15% 4 1.33% 4 2.43% 
12 5 0.13% 5 1.26% 5 2.28% 
13 16 0.12% 6 1.05% 6 1.83% 
14 12 0.12% 18 0.92% 18 1.67% 
15 17 0.11% 12 0.86% 12 1.50% 
16 9 0.09% 17 0.80% 17 1.46% 
17 18 0.09% 9 0.57% 9 1.02% 
18 10 0.05% 10 0.43% 10 0.82% 
19 8 0.01% 8 0.27% 8 0.58% 


that a higher 6“? or 6”? means smaller perception 
errors and vice versa. It can be observed from the 
table that travelers’ perception errors can result in 
some impact on network vulnerability evaluation. 
Specifically, as change in values of O°® and 688, 
the vulnerability rankings and the vulnerability 
index may be different accordingly. For example, 
link 13 is ranked at Rank, = 4 when 6°® = 5.0 and 
9®® = 1.0 whereas this link was ranked at Rank, = 7 
when 6® = 0.1 and 6?® = 0.02. 

Table 5 shows the impact of the regret aversion 
parameter on network vulnerability in terms of 
different values of y. It can be found from Table 5 
that the parameter ycan result in certain impact on 
network vulnerability evaluation. The vulnerabil- 
ity rankings may be varied due to the different y 
values. For example, link 15 is ranked at Rank, = 4 
when y= 0.05 whereas this link was ranked at 
Rank, = 6 when y= 0.30. 

Table 6 depicts the impact of travelers’ route 
choice decision criteria on network vulnerabil- 
ity according to different values of 7. As shown 
in Table 6, with the variation on type structure of 
travelers, the vulnerability rankings and the vul- 
nerability indices may change accordingly. For 
instance, when 77 = 0.99, which means that the 
completely rational travelers are dominant in the 
network, link 15 is ranked at Rank, = 6 and the 
vulnerability index VUL, = 2.19%. However, this 


Table 4. The effects of the perception errors on network 
vulnerability. 


OR = 5.0, 0 = 1.0, 6 =0.1, 
0R = 1.0 OR = 0,2 68 = 0.02 

Rank, Link VUL, Link VUL, Link VUL, 

1 2 314% 2 3.11% 2 2.90% 

2 11 2.55% 11 2.52% 14 2.55% 

3 14 239% 14 244% 7 2.54% 

4 13 2.338% 13 231% 1l 2.28% 

5 19 2.338% 19 2.31% 15 2.09% 

6 15 2.19% 15 2.26% 19 2.02% 

7 7 244% 7 215% 13 2.02% 

8 1 174% 1 1.79% 1 1.63% 

9 16 167% 16 163% 3 1.43% 
10 3 156% 3 159% 6 1.33% 
11 4 13% 4 135% 4 1.29% 
12 5 1.26% 5 128% 5 1.10% 
13 6 0.96% 6 107% 12 1.04% 
14 18 0.94% 18 0.93% 16 1.03% 
15 17 0.83% 12 0.88% 17 1.01% 
16 12 0.81% 17 081% 9 0.82% 
17 9 057% 9 0.58% 18 0.76% 
18 10 0.46% 10 0.44% 10 0.40% 
19 8 0.35% 8 0.28% 8 -0.08% 


Table 5. The effects of the regret aversion parameter on 
network vulnerability. 


y=0.05 y=0.15 y=0.30 

Rank, Link VUL, Link VUL, Link VUL, 
1 2 312% 2 311% 2 3.10% 
2 11 252% 11 252% 11 251% 
3 14 245% 14 244% 14 2.43% 
4 15 2.31% 13 231% 13 234% 
5 13 229% 19 231% 19 234% 
6 19 2.29% 15 2.26% 15 2.20% 
7 7 21% 7 215% 7 213% 
8 1 183% 1 179% 1 174% 
9 16 1.60% 16 1.63% 16 1.67% 
10 3 158% 3 159% 3 1.62% 
11 4 133% 4 135% 4 1.38% 
12 5 1.30% 5 1.28% 5 1.27% 
13 6 1.09% 6 1.07% 6 1.04% 
14 18 0.94% 18 0.93% 18 0.93% 
15 12 0.89% 12 0.88% 12 0.87% 
16 17 0.82% 17 0.81% 17 0.80% 
17 9 0.59% 9 0.58% 9 0.57% 
18 10 0.45% 10 044% 10 0.44% 
19 8 0.27% 8 0.28% 8 0.29% 


Table 6. The effects of the travelers’ route choice deci- 
sion criteria on network vulnerability. 


ņ=0.01 n=0.49 n= 0.99 

Rank, Link VUL, Link VUL, Link VUL, 
1 2 293% 2 311% 2 3.13% 
2 l4 2.72% 11 252% 11 253% 
3 13 247% 14 245% 14 2.34% 
4 19 247% 13 231% 19 232% 
5 11 237% 19 231% 13 2.32% 
6 7 227% 15 2.26% 15 219% 
7 3 2.07% 7 2.15% 7 212% 
8 15 2.06% 1 1.79% 1 1.72% 
9 16 171% 16 163% 16 1.59% 
10 1 168% 3 160% 3 145% 
11 4 163% 4 135% 4 135% 
12 6 145% 5 128% 5 1.23% 
13 5 131% 6 108% 18 0.94% 
14 12 113% 18 0.93% 6 0.90% 
15 18 0.83% 12 088% 17 0.82% 
16 17 0.79% 17 081% 12 0.79% 
17 9 0.64% 9 0.58% 9 0.57% 
18 10 0.438% 10 044% 10 0.44% 
19 8 0.15% 8 0.28% 8 0.31% 


link was ranked at Rank, = 8 and the vulnerability 
index VUL, = 2.06% when 7 = 0.01. 

The above analysis shows that ignoring the 
effects of the congestion levels of road network, 
travelers’ perception errors and regret aversion 
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degrees, as well as travelers’ route choice decision 
criteria could misjudge the consequences of link 
closures and misidentify the most critical links. 


4 CONCLUSIONS AND FUTURE 
RESEARCH 


This study proposed a method of road network 
vulnerability identification taking into account 
travelers’ risk aversion and bounded rationality 
simultaneously under travel time variations sub- 
jected to link failures. It is assumed that there are 
two kinds of travelers in the road network, com- 
pletely rational travelers and bounded rational 
travelers, respectively. The completely rational 
travelers use the perceived travel time budget as 
their travel disutility while the bounded rational 
traveler use the perceived regret value as their travel 
disutility. According to the travelers’ postulated 
route choice decision criteria, a mixed stochastic 
traffic assignment model formulated as variational 
inequality is constructed, a new vulnerability index 
of road network based on accessibility is defined, 
and a vulnerability identification model is built, 
and corresponding heuristic algorithm is also pro- 
posed. Numerical examples on the Nguyen and 
Dupuis network made apparent that the conse- 
quences of link closures could be misjudged and 
the vulnerability rankings could be misidentified, 
if ignoring the effects of the congestion levels of 
road network, travelers’ perception errors and 
regret aversion degrees, as well as travelers’ route 
choice decision criteria. Therefore, it is necessary 
to capture travelers’ behavior characteristics for the 
vulnerability analysis. 

The proposed vulnerability analysis only con- 
siders the scenarios of single link closure, and the 
consideration of multiple link closures is an impor- 
tant extension. Another valuable extension of this 
study is to take into account day-to-day adjust- 
ment processes for modeling travelers’ behavioral 
responses to link closures. 
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ABSTRACT: The paper presents an approach of how safety of autonomous driving can be analysed 
by using semi-Markov processes. The approach can be used in development and assessment of vehicles 
implementing autonomous driving. Through a case study, it is indicated that a semi-Markov process 
model can capture relevant properties related to safety of autonomous driving. The case study particu- 
larly investigates if Level 3 autonomy, in which the driver is reponsible to take over when alerted by the 
system, can be made sufficiently safe. The paper also highlights how the current standard ISO26262 is 
insufficifient for autonomous driving where the system itself affects the exposure of operational situa- 
tions. Therefore, as complement to [SO26262, it is shown how the proposed approach can be used to 


derive top level safety requirements. 


1 INTRODUCTION 

Autonomous driving is today one of the, if not 
the, major technological drivers in the area of on- 
and off-road vehicles. d new business opportuni- 
ties. One of the main arguments for, and also main 
arguments against, autonomous driving is safety. 
It is argued that autonomous driving is safer that 
human driving, since a computer is always alert, 
always takes rational decisions etc. On the other 
hand, huge technological obstacles remain on how 
to realize autonomous driving in complex traffic 
situations, bad weather, and bad road conditions, 
where available commercial sensor technologies 
have yet to prove its sufficiency (Watzenig & Horn 
2016). 

Even though autonomous driving can be made 
safe, how can it be proven to be safe? In more 
technologically mature areas, there are applicable 
standards providing systematic methods to argue 
for safety, but this is still lacking in the area of 
autonomous driving. Current standards, such as 
1SO26262 (2011) cover only the case when a human 
driver is responsible for the driving. 

To assess if autonomous driving is safe or not, 
systematic and formal analyses are needed. There- 
fore, as the main contribution, the present paper 
presents a method for formal analysis of safety 
of autonomous driving by using so called semi- 
Markov processes. 

The use of Markov processes for analysis of 
dependability, including safety, is an established 
practice. The study of Markov processes has a 
long history but a mile-stone for its industrial 
usage for dependability analysis (Fuiqua 2003) was 
the introduction of the standard IEC61508 (2010). 
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Since IEC61508 is a mother standard for many 
other standards, the usage of Markov processes 
has spread to many application areas. In particular, 
it is a large part of the ISO26262 standard, which 
uses continuous Markov-chains to asses reliability 
of hardware components. However, the literature 
is sparse on the usage of Markov processes for 
analysis of safety of autonomous driving. Markov 
process models of different kinds have indeed been 
used in the context of autonomous driving, but for 
the purpose of on-line perception or decision mak- 
ing, such as path planning, e.g. see (Hoekstra et al. 
2013, Katrakazas et al. 2015, Wei et al. 2011). Also, 
there are studies on how to model human driving 
using Markov models, e.g. see (Mitrovic 2005, 
Oliver & Pentland 2000). 

For reliability or safety analysis, the most com- 
mon type of Markov processes used is discrete or 
continuous time Markov-chains (Fuiqua 2003). 
Markov chains are completely “memoryless” 
assuming that the probabilities of transitions are 
independent oftime, which implies that all transi- 
tion times are assumed to be exponentially dis- 
tributed. This is a limitation in modelling of real 
world systems, which do not always follow expo- 
nential distributions (Limnios 1997). Therefore, 
the present paper makes use of continuous-time 
so-called semi-Markov processes, introduced by 
Levy (1954) and Smith (1955). In semi-Markov 
processes, the probabilities of transitions from a 
state are allowed to depend on the time spent in 
the state. Thus, transition distributions in a semi- 
Markov process can be non-exponential. 

The paper is focused around a case study, High- 
way Pilot, which is one typical autonomous driving 
function (Kirschbaum 2015). Autonomous driving 


is classified into six levels (SAE 2016), and Highway 
Pilot is classified as Level 3 or 4. Level 3 means that 
the vehicle drives autonomously but requires the 
driver to be ready to take over whenever requested 
by the vehicle. Several manufacturers have argued 
that Level3 is not feasible since drivers can not be 
expected to be as alert as needed to manage the 
takeover reliably. However, other manufacturers 
argue that Level 3 is indeed possible as an interme- 
diate step before Level 4, in which the driver has no 
responsibility. This is an example of a controversy 
in the area of autonomous driving. This is also an 
example where the semi-Markov analysis method 
proposed by the present paper can bring clarity 
and help answering questions such as: under what 
conditions can Level 3 be sufficiently safe, and, is 
it really easier to reach safety using Level 3 instead 
of Level 4? 

The paper is organized as follows. In Sec. 2, 
selected parts of the theory of semi-Markov proc- 
esses are shortly summarized. Then, in Sec. 3, a 
semi-Markov based approach for analysis of safety 
is presented. The approach utilizes the steady-state 
distribution in combination with an assessment of 
safety based upon the decision-theoretic concepts 
loss and risk. In Sec. 4 the case study highway pilot 
is presented, and also modeled by using a semi- 
Markov process. Then Sec. 5, based on the model, 
analyses the overall safety of highway pilot. Sec. 6 
performs a sensitivity analysis on the parameters 
in the model and addresses the above mentioned 
questions related to Level 3 vs 4. Finally, Sec. 7 dis- 
cusses how the proposed approach can be used in 
an engineering context. 


2 SEMI-MARKOV PROCESSES 


This section shortly introduces continuous-time 
semi- Markov processes in accordance with literature, 
e.g. (Limnios & Oprisan 2001, Limnios 1997). The 
purpose, and only focus, is to give the background 
theory needed for reading the rest of the paper. 
Consider a system with a finite set of states 
{1,2,....N}. Let X, be a random variable repre- 
senting the state after the n:th transition where 
nE Z.. Let U, be a random variable representing 
the time spent in state X, called the sojourn time. A 
semi-Markov process can then be described using 
a transition function in the form of a so called semi- 
Markov kernel, which is a matrix Q with elements 


P(X 
0 


=j, U, <St|X, =) i#j 
i=j 


n+l 


Q, (1) -| (1) 


Let M be the matrix of transition prob- 
abilities of the embedded Markov chain, i.e. 
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M, =P(X,,, =J/|X, =i) which can be obtained 


from the semi-Markov kernel as M, = lim@Q,,(t) . 


2.1 


A special case of semi-Markov processes is so called 
continuous-time Markov chains, where the sojourn 
time U, follows an exponential distribution, and 
the semi-Markov kernel takes the form 


Continuous-time Markov chains 


(2) 


where 4,= >) i and the parameters 4, are 
called transition rates. This further implies that 
the embedded Markov chain becomes M, =0 for 
i=j and 

A 


Y 


M., = 
A 


j for 


i+ j (3) 


2.2 Steady-state distribution 


Let X(t) denote the random variable representing 
the state at time t. We now present how to compute 
the steady-state distribution of the semi-Markov 
process, defined by a vector m with elements 
T,= lim P(X (t) = i) for i=1,..., N. To compute z, 
we will below use a formula containing two vectors 
v and m. 

Let v be the steady-state distribution of the 
embedded-Markov chain, which means that it 
is a row vector satisfying vM = v and È »,=1. 
Note that v can be computed for example as 
v= [01] P PY, where ¥ =[ ( M- 1)1], and 
0 and 1 are vectors of 0’s and 1’s. 

Let m be the column vector of expected 
sojourn times of the different states, ie. 
m,= E[U,, |X, =i]. Let H(t) be the cumulative 
distribution function of the sojourn time of state 
i, ie. H,(t)= PU, <t|X, =i). Using one of the 
standard formulas for expectation, m, can then be 
computed as 


(4) 


The process is assumed to be recurrent meaning 
that it is possible, i.e. with a non-zero probability, 
to reach any state from any state, and the expected 
return time to any state is finite. Then the steady- 
state distribution of the semi-Markov process can 
be computed as 


1- diag( v)m (5) 


vm 


where diag(v) is the diagonal matrix with the ele- 
ments of v on its diagonal. 


3 ANALYSING SAFETY BY USING 
SEMI-MARKOV PROCESSES 


Inspired by the approach of operational situations 
in ISO26262 (2011), this section proposes a general 
loss- and risk-based framework for the analysis of 
safety. Although general, the intended target is 
here safety of autonomous driving. 


3.1 Loss and risk based measure of safety 


In accordance with so called decision theory 
(Berger 1985), each state į is associated with a Joss 
denoted L(i)e R20 representing the level of 
“dangerousness” of being in state i. Then, expected 
loss, often referred to as risk, can be used as an 
overall measure of the dangerousness of the sys- 
tem. To obtain a measure independent of initial 
condition, we use the limiting expected loss: 


limE[ L(X(1)) ]= lim P(X (1) =A L(i) 


100 


z È im P(X(t) =i) L(i) = Rari 


(6) 


The actual value of the loss/dangerousness 
of being in state i can be chosen based on sev- 
eral principles. One approach is the one adopted 
in ISO26262 (2011), where metrics of severity 
and controllability, obtained from standardized 
tables, are combined into what here corresponds 
to loss. However, in the present paper, the danger- 
ousness of being in state i will be assessed using 
the probability of fatality as a result of being in 
state 7. 


3.2 Using probability of fatality as loss 


We consider two extra states, F and —f, resid- 
ing in its own dimension, and where F represents 
“fatal accident” and is absorbing. Let the random 
variable Y(t)e{F,F} represent if a fatal acci- 
dent has occured or not at time t. We assume an 
exponential distribution of transition time from 
state ~F to state F but with a transition rate 
A7 dependent on state i, i.e. 


P(Y(t)=F|Y(0)=AF Wee 
[01] X(7)=)=1- 04" 
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While the transitions of Y have a dependency of 
X, we make the simplifying assumption that transi- 
tions of X¥ do not depend on Y. 

In accordance with ISO26262, we will evalu- 
ate safety based upon an assumption of 1 hour of 
driving. Therefore we use loss L(i) = F (1h). It can 
then be realized that 


lim P(Y (t+ Ih) = F|¥(t)=7F)= 


= Alt lim P(X(4) =i) = SLi) 


This means that the loss-based measure of 
safety (6) in fact corresponds to the limiting prob- 
ability of fatality during 1h of operation. 


4 MODELLING AUTONOMOUS DRIVING 


This section presents the case study Highway Pilot 
(HP). HP is a general and well known function 
(Kirschbaum 2015), but the case study has been 
done in collaboration with Scania CV, a global 
manufacturer of heavy trucks. After giving an 
overview of HP, the approach for modelling and 
safety assessment presented in Sec. 2 and 3 is used 
to set up a semi-Markov process model of HP. 


4.1 


When highway pilot is activated by the driver, the 
driving of the vehicle is completely taken over by 
the electronic system in the vehicle, and the driver 
does not need to be involved in the driving, not 
even monitoring. HP can only be activated when 
the vehicle is on a highway and the vehicle will con- 
tinue to follow the highway as long as HP is on. 
HP is designed to function in a range of speeds, for 
example up to 120 km/h for a passenger car and 
90 km/h for heavy trucks. 

At the time of writing this paper, there is not 
yet a commercially available vehicle with HP. One 
reason is that there are a number of technical chal- 
lenges present. One is the reliability of commercial 
radar and camera sensors. In more complicated 
road, weather, and traffic situations, the required 
performance is insufficient. But also under perfect 
conditions, these sensors have a relatively low reli- 
ability level causing them to miss or detect false 
objects at a non-negligible probability level. Inde- 
pendent of the reason, the situation when the sen- 
sors of the vehicle are fault free but still unable to 
obtain a correct interpretation of the surrounding 
environment, will in the following be referred to as 
bad conditions. Bad conditions may be detected by 
the system itself, but may also be undetected for a 
significant amount of time. 


Highway pilot 


Another reason for the sensors to make an 
incorrect interpretation of the surrounding envi- 
ronment is that faults appear and become active. 
In contrast to bad conditions, which will always 
disappear after some time, faults do not disappear 
unless the system is repaired. 

When HP is on and either bad conditions or 
faults are detected by the system, the system can 
not immediately deactivate HP. The reason is that 
the driver may not be ready to take over driving, 
e.g. he/she may be sleeping. Therefore, upon detec- 
tion of badconditions or faults, the system will 
enter so called degraded driving. In degraded driv- 
ing, faulty or unreliable sensors may be shut off. 
This typically causes perception accuracy or reli- 
ability to be lower. To compensate for this in the 
autonomous driving, a more safe driving style with 
more safety margins is typically adopted, including 
reducing the speed of the vehicle. During degraded 
driving, the system also tries to get attention from 
the driver so that he/she takes over the driving. 


4.2 States 


The states and transitions of the model are illus- 
trated in Figure 1. Each of the states and their out- 
going transitions are described below. 

State S1 HP is off and the driver manually drives 
the vehicle. When the driver activates HP, there is 
a transition to S2. 

State S2 HP is on and the electronic system of 
the vehicle drives the vehicle. The driver can at 
anytime deactivate HP which triggers a transition 
to S1. Alternatively, at any time, bad conditions or 
a fault can occur, which triggers transitions to S3 
and S4 respectively. 

State 3 HP is on and the electronic system of 
the vehicle drives the vehicle. However, undetected 


Driver activates HP 
=07 


A= 


Driver deactivates HP 
A=t 


51, HP off; Manual driving | a 
L(St)=0 j 


4 A=185 


Driver deactivates HP 
A=2 


bad conditions are present so there is a greater risk 
for incorrect perception potentially leading to an 
incorrect and dangerous control action by the elec- 
tronic system. The driver can at anytime deactivate 
HP which triggers a transition to S1. Alternatively, 
the system may detect the bad conditions and this 
triggers a transition to S5. 

State 4 HP is on and the electronic system of 
the vehicle drives the vehicle. However, at least 
one undetected fault is present potentially lead- 
ing to an incorrect control action by the electronic 
system. The driver can at anytime deactivate HP 
which triggers a transition to S1. Alternatively, the 
system may detect the fault[s] and this triggers a 
transition to S5. 

State 5 HP is on but the system has detected bad 
condition or a fault so the driving is in a degraded 
mode. There is a warning indicated to the driver 
who is thereby urged to take over the driving by 
deactivating HP. If the driver deactivates HP, this 
triggers a transition to S1. If the driver does not 
deactivate HP within a predetermined time limit 
of 10 s, the system will initiate emergency stopping 
which means a transition to S6. 

State 6 HP is on and the system is performing an 
emergency stopping. During emergency stopping, 
the system controls the brakes and steering of the 
vehicle in order to stop the vehicle as quickly as pos- 
sible but in a controlled manner, e.g. making sure 
that the vehicle stops along the side, and not in the 
middle, of the road. The emergency stopping pro- 
cedure does not involve the driver, and the driver is 
not able to terminate the procedure. When the vehi- 
cle reaches zero speed, there is a transition to S7. 

State 7 HP is off and the vehicle is completely 
stopped, i.e. having a speed 0 km/h. When the 
driver again starts the vehicle, there is a transition 
to S1. 
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Figure 1. 


States and transition of highway pilot. Solid lines are used for transitions whose transition function is pos- 


sible to affect by the design, and dashed lines for those not possible to affect by the design. 
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It can be noted that ISO26262 (2011) covers 
only safety with respect to faults, i.e. the right path 
(S2-S4-S5) of the model. 


4.3. Parameters of the model 


The model contains a number of parameters that 
need to be determined: 


e The transition rate 4, for transitions respecting 
the Markov assumption, i.e. where the distribu- 
tion of the sojourn time is exponential. These 
implicitly define the transition function Q,(t) 
according to (2). f 
Transition functions Q,(1), for states with non- 
exponential distribution of sojourn times, 

The loss L(i) associated with each state. 


Now follows, for each state in the model, an 
explanation of the chosen parameter values in the 
model. All parameter values were derived by study- 
ing data and statistics from prototype systems and 
also from discussions with engineers. It should 
however be mentioned that the available data was 
limited or in many cases not applicable, due the 
prototype statusof the system, but also due to con- 
fidentiality reasons. 

A general principle used is that as long as there is 
no clear argumentation for not choosing an expo- 
nential distribution, an exponential distribution is 
chosen. In all cases, transition rates are chosen by 
considering the resulting expected sojourn time, 
probability distribution of transitions to succeed- 
ing states, and for each succeeding state, the prob- 
ability of transition within certain time periods. 

State S1 The transition to S2 is considered to 
have constant rate 4, =0.7 and this corresponds 
to an expected soujourn time of 85 min and a prob- 
ability of transition to S2 within 1 h to equal 0.50. 
The loss is L(1) = 0 since only loss (dangerousness) 
associated with usage of HP is considered. 

State S2 The transition back to S1 is assumed to 
have a constant probability, with rate 7=1.5. The 
transition to S3 is assumed to have constant rate 
A=0.1 and transition to S4 a constant rate 4 =10~° 
. These three ratesresult in an expected sojourn time 
of 37 min. The probability of a transition back to 
S1 within 1 h becomes 0.75 while the corresponding 
probabilities of transitions to S2 and S3 respectively, 
become 0.05 and 10°. The loss is L(2)=0 since S2 
corresponds to when HP is working perfectly with- 
out any bad conditions or any active faults. 

State S3 All transitions from S3 are assumed to 
occur with constant rates. The rate of the transition 
to S1 is assumed to be 4,, =2, a bit higher than for 
the corresponding transition from S2 since when 
conditions are bad, the driver is likely to observe this 
and therefore deactivate HP. The rate of transition 
to S5 is assumed to be 4,, =2 which corresponds to 
a probability of detecting the bad conditions within 
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one hour to equal 0.49 and within 10 min to equal 
0.24. The transition rate back to S2 were chosen to 
be also 4=2. All this means that being in state S3, 
there in an equal probability of transitions to S1, S2, 
and S5. The expected sojourn time becomes 10 min. 
In contrast to the previous states, being in S3 isas- 
sociated with a non-zero loss due to possible incor- 
rect perception caused by bad conditions. The loss 
is assumed to be L(3) =0.01, but remember that this 
corresponds to the probability when being in state 
S3 for 1 hour and since the expected sojourn is only 
10 min, the probability of fatality will be lower. 

State S4 All transitions from S4 are assumed 
to occur with constant rates. The rate of the tran- 
sition to S1 is assumed to be 4,, =1.5, i.e. equal 
to the rate of corresponding transition from S2, 
since an undetected fault is likely notto change 
the behavior of the vehicle, and consequently not 
increase the probability of the driver to deactivate 
HP. The rate of the transition to S5 is assumed to 
be Å, =3, which is set a little bit higher than the 
rate of the corresponding transition from $3. The 
rate of transition back to S2 is set to 4, = 0.1, 
which corresponds to an assumption that faults 
never disappears, but may become inactive. The 
loss of being in state S4 is estimated to be L(4) = 
0.01, i.e. the same as for state S3. 

State S5 During degraded driving the driver is 
alerted to take over driving so a transition to state S1 
is set to a high and constant rate 2., = 60. A tran- 
sition to S6 will occur deterministically after 10 s, 
which means that this is not a constant rate and the 
sojourn time does not follow an exponential distri- 
bution. This implies that the probability of having 
a transition to S1 is M, =1—e°-4110/30 = 0.154, 
and the probability of having a transition to S6 is 
M,,=1-M,,. This further implies 


l-e t<10 
2(1)= m 1210 

0 £<10 
ooi t210 


The expected sojourn time can be computed 
using the formula (4): 


10 


Q.,(1) de fi (1 e^ ) dt = 9.2s 


0 


m= f 1-0,(t)- 


In state S5, the loss is estimated to be 
L(5)=0.001, ie. significantly lower than in state 
S3 or S4. The reason is that the increased safety 
margin resulting from degraded driving. 

State S6 From S6, the only possible transition is 
to S7 and it occurs at the time it takes to stop the 
vehicle. Since there are a number of uncertainties, 
e.g. road conditions and tire conditions, the time 


is assumed to have a probability density function 
constant in the interval 4 to 10 s, and zero outside 
this interval. This means that the expected sojourn 
time is 7 s. The loss in S6 is estimated to be the 
same as in S5, i.e. L(6) = 0.001. 

State S7 From S6, the only possible transi- 
tion is to the starting S1. The transition is set to 
a fixed rate of 2=30 which corresponds to an 
expected sojourn time of 2 min. The risk in S6 is 
estimated to be the same as in states S5 and S6, i.e. 
L(7)= 0.001. On one hand, the vehicle is stand- 
still meaning that it can on its own not cause an 
accident. On the other hand, doing an emergency 
stop on a highway may result in stopping in a dan- 
gerous position on the road, thus increasing the 
probability of being hit by another vehicle. 


5 COMPUTING THE RISK 


To compute the risk, i.e. the limiting expected loss, 
according to (6), we first need to compute the 
steady-state distribution. For this, we will use (5) 
which requires v, the steady-state distribution 
of the embedded Markov-chain M, and m, the 
expected sojourn times. 

For the states S1, S2, S3, S4, and S7, the cor- 
responding rows in the matrix M, defining the 
embedded Markov-chain, are obtained by inserting 
the transition rates given in Section 4.3 into equa- 
tion (3). For S6, there is only one possible transition 
so the row in M becomes by M,, =1 and the other 
entries 0. For S5, both M,, and M,, have already 
been derived in Sec. 4.3, and the other entries are 0. 

The expected sojourn times have all been given 
in Sec. 4.3. The resulting vector is 


= [ 85min, 37min, 10min,13min,9.2s,7s,2min | 


Now using (5) results in the steady-state dis- 
tribution shown in Figure 2. Note that since we 
are only interested in the safety of HP, we condi- 
tion on HP on, i.e. the bar chart shows the value 
lim P(X(t) =i|HP on) for each state i. 

‘The risk, can now be computed by using (6), but 
use conditional expectation, conditioned by HP on: 


| 
| 
| 
er 
0 


Figure 2. Steady-state distribution normalized for HP 
on. 
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lim E[L( X(1)) |HP on] = 0.000165 


ta 


This value is clearly too high. To investigate the 
explanation and to find means to reduce the value, 
the next section presents a sensitivity analysis with 
respect to the parameters in the model. 


6 SENSITIVITY ANALYSIS 


The model contains in total 14 transitions. Of 
these, 12 transitions are parameterized by a transi- 
tion rate parameter 4, The remaining two are the 
transitions from S5 to S6 and S6 to S7. The transi- 
tion from S5 to S6 is therefore instead parameter- 
ized by the time of degraded autonomous driving, 
and the transition from S6 to S7 is parameterized 
directly by the expected sojourn time. 

Each parameter is varied logarithmically from 
10-* to 10° times the original value, and the result- 
ing new risk is computed. For each parameter, 
the risk is plotted in Figure 3. In the figure, it is 
seen that some parameters have no effect on the 
total risk while other have a significant effect. For 
each of the 14 plots, a short interpretation of the 
observed behavior is now given. 

S1 — S2 and S2 — S1 Since we measure the 
safety given that HP is on, solely activation and 
deactivation of HP does not affect the safety. 

S2 — S3 If bad conditions rarely occurs (left 
part of the plot), then HP is very safe. 

S2 > S4 If HP failure rate is changed, safety is 
not affected, since it is dominated by the effect of 
bad conditions. 

S3 — S1 If the presence of undetected bad con- 
ditions are eliminated by the driver disabling HP, 
HP becomes very safe. 

S3 > S5 If detection of bad condition is made 
more efficient, HP becomes safer since driver will 
faster disable HP. But to a limit, since driver reac- 
tion time from being alerted until action is not 
affected. 

S3 — S2 If bad conditions quickly disappears, 
HP becomes very safe. 

S4 — S1 The drivera€™s rate of disabling HP, 
when a fault has occurred, does not affect HP 
safety. Since fault occurence is such a rare event. 

S4 — S5 Increased efficiency in detecting faults 
does not affect safety since faults are anyway so rare. 

S4 > S2 If faults quickly become inactive, HP 
becomes very safe. 

S5 — S1 If the driver faster disables degraded 
HP, safety is not affected. Since degraded driving 
is comparably quite safe anyway. 

S5 — S6 To more quickly, or slowly, activate 
emergency stopping, does not change safety. More 
time increases chance that driver disables HP, but 
also increase time of dangerous driving. 
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Sensitivity analysis where the probability of fatality is plotted w.r.t. variations in each of the 14 parameters 


in the model. Solid lines correspond to parameters possible to affect in the design of the HP, and dashed lines to param- 


eters not affected by the design. 


S5 — S6 To more quickly start the emergency 
stopping after detection of fault, does not affect 
safety. Since fault detection is anyway so rare. 

S6 — S7 Stopping time can in fact not be 
affected, but if, then extreme long stopping time 
decreases safety, since time spent stopping is more 
dangerous than auto driving. 

S7 > S1 Time to restart vehicle after stopping 
can in fact not be affected, but if, then extreme 
long time before restart time decreases safety, since 
time spent stopping is more dangerous than auto 
driving. 


6.1 


From the sensitivity analysis we can conclude 
that the model seems to be able to capture rel- 
evant properties of HP and their effects on safety. 
Another important conclusion is that the only way 
to reach sufficient safety of HP is to reduce time 
spent in state S3, i.e. to reduce transition rate from 
S2 to S3, i.e. to make sure that bad conditions do 
not occur, and according to the plot, the new value 
of the transition rate A, needs to be 10% times the 
original value 0.1, i.e. A, = 107. 

But then, it becomes Level 4 autonomous 
driving, i.e. the extra safety gained by demand- 
ing the driver to take over when bad conditions 
are detected is only marginal and not significant. 
However, to really know that we obtain Level 4 
autonomous driving if 4,,= 10-7, a second sen- 
sitivity analysis was performed with 4,,=1077 
and only the parameter A, varied. Although not 
shown in the paper due to space limitation, the 
result is that the risk is not dependent on A,,. This 
implies that it truly becomes Level 4 autonomous 
driving. 


Conclusions of sensitivity analysis 
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7 USING THE MODEL IN ENGINEERING 


This section discusses how a conceptual model of 
a technical function, like the one developed and 
analysed in the sections 4, 5, and 6, can be used 
generally by engineers to develop safe systems, in 
particular within automotive industry. 

We will here view a transition function Q, (t) 
for a transition dependent on the design (exempli- 
fied by the solid lines in Fig. 3) as a requirement 
on the system. This is a requirement to be taken 
as input to the succeeding engineering design 
activities. As a requirement it has to be formulated 
taking into account what is technically and com- 
mercially viable. In the case study, for example 
the transition function Q,,(¢) corresponds to the 
requirement 


“Emergency stopping shall start exactly 10 s after 
bad conditions or fault have been detected.” 


This requirement is in fact unrealistic in the sense 
that it is not so important that emergency stopping 
starts exactly after 10 s. And also, it is technically 
not feasible to build such a system since all com- 
ponents have tolerances. By following the princi- 
ple of how safety requirements are formulated in 
industry with a Safety Integrity Level (SIL), such 
as ASIL in ISO26262, an example of a more real- 
istic requirement is 

P(“Emergency detected.”) > 1 — 10% 


This requirement corresponds in fact to a whole 
set of transition functions, namely all Q,,(t) where 
there is a step somewhere between 8-12 s, and the 
integral of the corresponding density function is 
<10“ excluding the step. 


In conclusion, the transition function for each 
transition depending on the design must match the 
actual requirement and to align with principles of 
how requirements are written in industry, not one 
but a set of transition functions must be used. By 
using this insight, the following procedure summa- 
rizes how the proposed approach of the paper can 
assist the conceptual phase of engineering: 


1. Set up a semi-Markov process with states and 
transitions modelling the system considered. 
Identify the transition function Q,(^ for the transi- 
tions not dependent on the design, e.g. transitions 
depending on the operator or other parts of the 
system (examplified by the dashed ones in Fig. 3). 
. Estimate the loss/dangerousness L(i) of each state 
i, for example by estimating probability of fatality 
as result of being in the state a certain time. 

. For each transition depending on the design, 
specify a requirement and corresponding set of 
transition functions Q,(2). 

. By using the formula (5), compute steady state 
distribution z, for each valid combination of 
transition functions picked from the sets in step 3, 
possible by using sampling if the set is too large. 

. For all steady-state distributions computed, 
compute the risk, i.e. limiting expected loss (6). 
If not acceptable, go back to step 4 and modify, 
possibly by first doing a sensitivity analysis, 
the requirements and corresponding transition 
functions Q,(2). 

. Design the system according to the obtained 
requirements. 


2. 


It can be noted that, in relation to the method 
of hazard analysis and risk assessment in ISO26262 
(2011), the steady-state distribution, and the 
steps 1, 2,4, and 5 leading up to it, replace the con- 
cept of operational situations in ISO26262. How- 
ever, while the steady-state distribution is a result 
of the chosen requirements, the operational situa- 
tions is a fixed input to the requirements elicitation 
and consequently, ISO26262 does not acknowl- 
edge themutual interplay between identification of 
requirements and the operational situations. 


8 CONCLUSIONS 


The paper has presented an approach of how 
safety of autonomous driving can be analysed 
using semi-Markov processes. The approach can 
be used both to complement the often informal 
discussion of whether autonomous driving is safe 
or not. But more importantly, the approach can be 
used also in development and assessment of vehi- 
cles implementing autonomous driving. 

The case study has indicated that a semi-Markov 
process model can capture relevant safety related 
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properties of autonomous driving. The case study also 
concludes that the extra safety obtained by Level 3 
autonomous driving is only marginal and in fact, 
tomake Level 3 sufficiently safe, it needs to be Level 4. 

The paper has highlighted how the current 
standard ISO26262 is insufficient for complex 
functions like autonomous driving where the sys- 
tem itself affect the exposure of operational situa- 
tions. Instead of using the concept of operational 
situations from ISO26262, it has been shown how 
the proposed approach of semi-Markov processes 
can be used to derive top level requirements. 
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ABSTRACT: The probability of failure of mechanical products often conforms to the Weibull distribu- 
tion. If the Weibull distribution is used to construct the likelihood function, the Bayesian method is used 
to estimate the reliability, which involves a large number of integral operations and is not easy to select the 
appropriate prior distribution. In order to reduce the computational workload and improve the efficiency 
of the evaluation, based on the mature exponential distribution Bayesian method, this paper transformed 
the Weibull distribution into exponential distribution and chose the obverse-Gamma distribution as the 
prior distribution. Through a series of derivation and calculation, we obtain the Weibull distribution Life 
parameter estimation, and finally gives the corresponding examples to verify the feasibility of the method. 


1 INTRODUCTION 

Mechanical reliability growth technology is an 
important part in mechanical reliability theory. It 
exists in the whole lifespan of design, production 
and usage. Mechanical reliability growth technol- 
ogy aims to evaluate mechanical product reliabil- 
ity with the synthesized knowledge of different 
kinds, such as mechanical engineering, statistics 
and management. In the whole lifespan of the 
products, reliability growth test (RGT), data analy- 
sis and management are effective and economical 
methods to enhance the reliability stage. 

Reliability growth test often faces the problem of 
too little data volume, especially for some mechan- 
ical products. The reliability test usually takes a 
long time and costly, and the test uses small sam- 
ples or even very small samples, which makes the 
product reliable In the process of assessing small 
sample data. Bayes method can use the current test 
information together with the historical test data 
of prior information and similar products, which 
can be well applied to small sample data Statisti- 
cal inference problems (Zhang 2000, Zhang 2003). 
The basic principle of Bayesian theory is to calcu- 
late the posterior distribution by using the pre-test 
distribution and sample information, so as to esti- 
mate the point estimate and confidence interval of 
the variable and further derive the estimated values 
of other related reliability features. 

The specific method is generally based on a param- 
eter as a random variable, through the construction 
of the likelihood function and pre-test distribution 
function, using the Bayesian formula to derive the 
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form of a posterior distribution function, and then 
use the reliability test to get the failure data as a 
sample. Estimate the probability density function or 
distribution function of the relevant parameters, and 
finally get the reliability index. In this process. The 
determination of transcendental function and the 
construction of the likelihood function are the key 
points, and the difficulty is parameter estimation. 
Many scholars at home and abroad have done a lot 
of research on parameter estimation. 

In engineering practice, the failure probability of 
mechanical products usually obeys Weibull distribu- 
tion. The Bayesian parameter estimation of Weibull 
distribution is used to make mathematical derivation 
step by step by using the most direct method (Shi 
1992). A large number of integral calculations are 
used, and the whole process is calculated Both the 
amount and computational complexity are high, 
increasing the workload for reliability assessment. 
The Weibull distribution is converted to the extreme 
distribution and then calculated using the Baye- 
sian formula (Liu et al. 2005). Although the calcu- 
lation is simplified, the workload is still very large. 
In response to this problem, this paper seeks a new 
method of evaluation and calculation to avoid a large 
number of integral calculations and reduce the com- 
putational workload. In the literature (Mao 1999), an 
exponential distribution evaluation method is intro- 
duced. The exponential distribution is a commonly 
used form of distribution, while the gamma distribu- 
tion is the conjugate prior distribution of the expo- 
nential distribution. It is also suitable for describing 
the characteristic life of the mechanical and electrical 
products. Therefore, Bayes estimation of exponential 


distribution, the choice of gamma distribution as a 
prior distribution, which will greatly simplify the cal- 
culation, compared to the direct use of Weibull Baye- 
sian derivation, the efficiency is significant. 

Based on this idea, this paper presents a method 
for Bayes reliability growth evaluation based on 
model transformation. We first convert the Weibull 
distribution obeying the product to an exponential 
distribution, and then return to the Bayes estimation 
of the parameter obeying the exponential distribu- 
tion and finally return the parameters of the expo- 
nential distribution estimation to the parameters of 
the Weibull distribution, respectively, to obtain the 
estimate. At the end of this article, we use a practical 
case to verify the feasibility of the method. 


2 MODEL DISTRIBUTION CONVERSION 


Suppose the mechanical product life T obey 
Weibull distribution, the distribution function is: 


F(t)=1-e'" t>0,8,7>0 (1) 
Its probability density function is: 
f(t)= Eta erent (2) 


where ¢ is the time, J is the shape parameter, n is 
the characteristic life. 

The cumulative probability distribution func- 
tion is the integration of the probability density at 
time £,: 

= ts 2 2-1 (tin? 
F=f a el! dt (3) 


ra 
Z 


Assume z = f4, which can be introduced as 
Iz = Bt dt, then the equation (3) can be expressed as: 


(4) 


Here the equation (4) can be further simplified 
as: 


1 


Tal 
=Í edz (5) 


where 0= nf, X= #8 

So, the equation (5) is the exponential distribu- 
tion function, whose probability density function 
at time z is: 


1 
=—e 
0 


x/8 


f(x) (6) 


Among them, the exponential distribution func- 
tion parameters are respectively: 0= 7, x = tê. 
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That is to say, if the lifetime ¢ obeys the Weibull 
distribution of the shape parameter # and the 
characteristic lifetime 7, after exponential trans- 
formation, the experimental time can be expressed 
as an exponential distribution of x and the average 
life expectancy 6. 

As can be seen from Equation (6), x is a random 
variable for the parameters ¢ and p, and 1/8 is an 
exponential distribution parameter which is related 
to the parameters 77 and £. 

If Bis known, then x can be treated as a ran- 
dom variable equivalent to the parameter ¢. The 
parameter 1/8 depends only on the parameter n. 
The pre-test distribution of 8 is equal to the pre- 
test distribution of 7, and then the Bayes theory is 
used to infer it. 

If Bis unknown, 8 depends on the parameters 
n and p simultaneously, which can be extrapolated 
using the multiple Bayes method, which is not dis- 
cussed here. 

In general, 6 as a shape parameter, for the 
same type of batch products, its value is often 
unchanged. 


3 PRIOR DISTRIBUTION DETERMINATION 


The basic principle of Bayes is to use the sample 
information to correct the prior information and 
estimate the approximate value closer to the truth. 
The formula is: 


E p(x|A)z(A) 


m(x) 


where /(6|x) is the posterior distribution; p(x|6) is 
the likelihood function; 2(@) is the prior distribu- 
tion; m(x) is the edge distribution. 

Since m(x) does not depend on the parameter 9, 
it plays a negligible role in calculating the posterior 
distribution of © as a regularization factor. There- 
fore, the posterior distribution depends on the joint 
distribution density function and prior distribution. 

When the p value has been determined based 
on prior information, then the prior distribution 
parameter 0 depends on the feature life 77. 

According to the engineering experience, the 
prior distribution of the characteristic life 7 of 
mechanical products generally obeys the inverse 
gamma distribution, then the prior distribution of 
8 can be determined as: 


h(8| x) ~ p(x | )z(8) (7) 


2(0) ~ IGa(a,b) (8) 
Its distribution form is: 
be 1 atl b 
6)= — a 9 
z(o) ala) e (9) 


And the mean and variance of the distribution 
respectively are: 


a-l 
b? (10) 


(a-1} (a-2) 


It can collect the corresponding fault informa- 
tion, calculate the average and variance statistics, 
and can calculate the value of hyper parameters a 
and b. 


4 RELIABILITY ASSESSMENT 


Assuming the end of a mechanical product reli- 
ability test using the censored method, in which 
the sample number n, the failure number is r, then 
the equation (6) of the likelihood function can be 
expressed as: 


L( HO) = f ted) (11) 
n! : 
g" -X. oœ GO" -X, 
Geni exp(—X,/ 0) < @"exp(-X,/ 8) 
where XY, = pe +(n-r)x, 
Then the posterior distribution of @is: 
h( 8| x)= p( x|@)2( A)= 


z a+r+l b+ X, 
b ( Z) oF be Gere (FX 8 
T ( a) 0 


(12) 


At this point, the mean of @is its point estimate, 
which is: 


b+ X, 
a+r-1 


ĝ= (13) 
It is known that the posterior distribution of 0 
obeys the inverse gamma distribution and param- 
eter 7/0 obeys the gamma distribution. 
According to 6! = Ga(a+r,b+X,), so: 


2(64 x Jo caf D L) = lala] (14) 


According to the chi-square distribution, it can 
be concluded that the confidence interval is /—«a, 
the interval of ĝis estimated as: 


| 2(b+X,)  2(b+X,) | 


i (15) 
Wanlla +r) Zai2(2(a+ 1) 


The Bayes method can be used to estimate the 
corresponding parameters, and the corresponding 
reliability features can also be obtained. The mean 
time between failures (MTBF) of the product is: 
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1 
mrar = (1+) 16 
Z (16) 


5 SIMULATION CASE CALCULATION 


The data in this paper refer to the product fail- 
ure time data in the paper (Zhang et al. 2005) as 
the historical data, which can be seen in Table 1. 
The product in this paper is a typical mechanical 
product. The distribution of failure interval obeys 
Weibull distribution. According to the data in the 
paper (Zhang et al. 2005), the shape parameter 
= 1.3612 of the Weibull distribution can be cal- 
culated by the least square method, and the value is 
taken as the shape parameter of the Weibull distri- 
bution in the failure interval of a certain mechani- 
cal product, and remain unchanged. 

From the historical information data of the 
mechanical product in Table 1, when the parameter 
Bis known, it can be seen from the calculation that 
the mean and variance of nê, 74,..., 78 are: 


E(7?)=11008.454, V(772) = 130857196.8 


From equation (10), hyper parameters can be 
calculated as: 


a =3, b=21203. 


So, we can determine the prior distribution 
of the failure interval of mechanical products 
according to the inverted gamma ray distribution 
1Ga(3,21203). 

According to known Weibull distribution shape 
parameters £, and the known parameter 77°, so the 
characteristic life parameter 7) can be calculated as 
931.62h. 

Using Monte Carlo simulation method, a set of 
8 random number sequences obeying the Weibull 
distribution of two parameters (the shape parameter 
is 1.3612 and the size parameter is 931.62) are gener- 
ated and arranged in order of size t,,t,,...,¢, as simu- 
lation test data. Assuming the experimental program 
for the sample size of 8 sets of the same batch of 
mechanical products, the number of failures for 
the censored is 5, the test data processing shown in 
Table 2, where n is the sample size, T is the censored 
time, t/(i =1,2,...,5) is the 8 power of fault data, 
X, is the sum of ¢/ and ris the truncation number. 

Set the reliability of /-a@ is 90%, according to 
equation (13) ~ (15), get the following data: 

The posterior point of 0 is estimated as 9264.4. 


@ interval is estimated as: 
eee 129701.7 


His (16) ° Z65(16) 


l =[5509.8,13931.4] 


Table 1. Fault interval time history data. 


No. 1 2 3 4 5 6 7 8 9 10 11 12 
n; 79.45 104.11 143.56 196.16 248.77 250.41 276.71 367.12 395.07 397.81 408.22 446.03 
No. 13 14 15 16 17 18 19 20 21 22 23 24 
n; 459.18 552.88 566.03 695.89 698.63 735.34 776.44 814.25 840.55 919.45 932.6 972.05 
No. 25 26 27 28 29 30 31 32 33 
n; 986.85 1261.37 1498.08 1590.14 1603.29 1708.49 1894.25 2326.58 2841.1 
Table 2. Monte Carlo simulation data. 
ti ty i ty t; ty f; ty n T 
47.6 301.2 432.5 482.6 735 936.9 1531.6 1644.8 8 735 
tf if tf if t2 tf t2 t2 xX, r 
188.8 2367.1 3873.5 4496.8 7972.4 7972.4 7972.4 7972.4 42815.8 5 
Table 3. Data processing and analysis. 

B n MTBF (h) Sample size (sets) Test time (h) 
Historical experience data 1.3612 922.14 852.06 33 2841.1 
Simulation test data 1.3612 932.18 764.025 8 1644.8 
Bayes method data 1.3612 820.4 758.05 8 735 


Furthermore, the point estimate and interval esti- 
mation of y are 820.4 and [560.07,1107.07] respec- 
tively, and the estimated value of MTBF is 758.05 h. 

Table 3 shows the comparison of empirical 
information and experimental data with several 
evaluation methods. The least squares method is 
used to estimate the historical data in the paper 
(Zhang et al. 2005), which represents the empirical 
information. The Bayes method combines the prior 
information and the experimental (Simulation) 
information, the calculation results and simulation 
data are similar, but the test time is about half. 


6 CONCLUSION 


In this paper, we presents a method for Bayes reli- 
ability growth evaluation based on model transfor- 
mation, which can effectively solve several difficult 
issues in mechanical products reliability evalua- 
tion. It can be adopted in the following researches, 
but not limited to 


1. this method avoids the complicated model of 
Weibull distribution, reduces the computational 
complexity of Bayes estimation, simplifies the 
calculation process and improves the efficiency 
of the algorithm; 
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2. this method also saves test time, and greatly 
reduces the number of samples. 
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ABSTRACT: A method to measure the contribution of random joint clearances to mechanism output 
error, termed as Error Importance (EI), is presented in this paper. In this method, the 2-order original 
moment is used to characterize the deviation of mechanism output error and joint clearances from their 
ideal values, i.e. zero. Then the 2-order original moment of mechanism output error is decomposed into 
a series of fractions. These fractions are divided into two categories: individual effects and interaction 
effects. The total effect of one joint clearance is defined as the sum of the corresponding main effects and 
the interaction effects, and total importance index, i.e. EI index, is defined as the value of its total effect 
relative to 2-order original moment of mechanism output error. Similarly, the total importance index of a 
group of joint clearances are defined also. Then the mathematical and physical properties of the EI indices 
are discussed, and a Monte Carlo based evaluation method is offered. At last, the EI indices are applied to 
cabin door mechanism of aircraft landing gear. Simulation results revealed that the EI indices can reflect 
honestly the contribution of joint clearances to mechanism output error. 


1 INTRODUCTION 


Uncertainty joint clearances are inevitable in link- 
age mechanisms (Han et al. 2002, Zhang et al. 
2015), and lead to mechanism output error (Li 
et al. 2015). The output error caused by joint clear- ee Le. 
ances cannot be compensated with any kind of ji AN E Pi] CR 
calibration (Chen et al. 2013). Furthermore, mech- Design value Mean valie 
(Kumarosvamy et al 2013), traditional calibra 
tion cannot ensure the motion accuracy across the : . _ 
whole workspace. With respect to these problems, Figure l Mechanism output error under the joint effect 
joint clearances should be controlled under speci- of various joint clearances: 
fied level, so as to ensure the mechanism output 
accuracy. In order to improve mechanism output practice. SA is defined as “the study of how the 
accuracy maximally under limited resources, the uncertainty in the output of a model (numerical or 
study of how different joint clearances contribute to otherwise) can be apportioned to different sources of 
mechanism output error is very necessary. uncertainty in the model input” (Saltelli 2002). There 
Joint clearances are random variables of always are mainly two kinds of SA methods, i.e. Local 
positive. Thus the deviation of joint clearances Sensitivity Analysis (LSA) methods and Global 
from the ideal value, i.e. zero, is composed of vari- Sensitivity Analysis (GSA) methods (Sobol 2001, 
ation and mean shift. Under the combined action Borgonovo et al. 2016, Pianosi et al. 2016). GSA are 
of various joint clearances, the deviation of mecha- also called uncertainty Importance Measure (IM) in 
nism output from the ideal or design value is also some literatures (Aven et al. 2010, Borgonovo 2007). 
composed of variation and mean shift, as shown in SA methods have already been used by many 
Figure 1. In engineering practice, not only the varia- researchers to study the relative importance of input 
tion, we care about also the deviation of mechanism errors. Caro et al. (2009) proposed partial derivatives 
output from the ideal or design value. Thus both the based sensitivity indices for the geometric param- 
contribution of variation and mean shift should be eters and actuated variables of 3-RPR planar paral- 
taken into account in EI of joint clearances. lel manipulators. Hanzaki et al. (2009) use partial 
With respect to the problem of identifying the derivatives to study predict how the steering error 
most important joint clearances, Sensitivity Analy- is affected by manufacturing tolerances, assembly 
sis (SA) of random variables is the most related errors, and clearances. GSA methods, e.g. variance 
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based method, are also used to identify the most values. It took into consideration both the mean 
important factors that influence the output errors shift and variation, and has been adopted by many 
(Cheng et al. 2014, 2015). researches in error or tolerance allocation (Jin 
However, the current SA methods are not suit- et al. 2015). The ideal values of both joint clear- 
able for the problem this paper addressed. From the ances and mechanism output errors are all zero. In 
definition of SA one can get that the present SA this respect, we use the 2-order original moment 
methods focus mainly on “uncertainty”, rather than to measure the deviation of input or output errors 
the deviation from the ideal or design values. Take from their design values. 
variance based methods as an example, they use Assume a mechanism with N joint clearances, i.e. 
variance to characterize “uncertainty”, implying X=(X,, Xp» , Xy), and one output, i.e. Y. The 
that the deviation from the “mean values” is what output error depends only on joint clearances. The 
we interested in. However, the mean value of joint PDFs of X are fy), fy,0%), «+s Ayy(%y), and that 
clearance can never be zero. Thus variance based of Y is f(y). The functional relationship between X 
SA methods will produce unreasonable results if | and Y can be written as the following form: 
they are used to measure the contribution of joint 
clearances to mechanism output error. Similar con- 


clusions can also be derived for other SA methods. y= g(x)=8( Xi Xz, Xy) 0) 
With respect to this problem, an importance a ; 
measure method of joint clearances that can take If there are no joint clearances, the mechanism 


into consideration both the contribution of varia- Output error Y will obviously be zero, thus the fol- 
tion and mean shift is presented in this paper. In lowing equation holds: 
addition, the presented method should satisfy the 


requirements of be “global, quantitative, andmodel Y7 & (0) =0 (2) 
free” that claimed by Saltelli (2002), and should be 
convenient to implement. Assume that the function g(x) is (m+1) times dif- 


The following of the paper is organized as follows. ferentiable at the point of x = 0. The m-th order 
In section 2, EI indices for one and a group of joint Taylor expansion is performed to Equation 1: 


= 2 N dg( 0) N 1 N N N dg( 0) 
vex) a0) È g 7 2] Ox, on iia dd day ax, 


4=1 i i h=] ġ=1 (3) 
Ied g y 0) 1 uae y əd g( Ax) 
+ ce Ta Ax,0x, dx, 12 Mm DÈ 2 RAN N 
ma 3 ft. OX, +X; (mI aaax, ---0x, 
m Im m- 'm+l 


clearances are proposed based on the decomposition where the last item is Lagrange reminder, with 
of the 2-order original moment of model output, and 0 < 0< 1. Because g(0) = 0, the first item in the 
a Monte Carlo based evaluation method is offered. right side of Equation 3 always be zero. With- 
In section 3, the mathematical and physical proper- out the considering of the Lagrange reminder, 
ties of the presented EI indices are discussed. In sec- Equation 3 can be simplified into the following 
tion 4, an application case to cabin door mechanism form: 

of aircraft landing gear is offered. Section 5 provided 


a summary and some concluding remarks. N NN 
y= dla a) È D4. (im) 
i=l q=li=l (4) 
N N N 
2 IMPORTANCE MEASURE METHOD $Y Dlana (te, )| 
FOR JOINT CLEARANCES i=li=l inal 

2.1 2-order original moment decomposition of where the polynomial coefficient, say, a,j, can 

mechanism output error be computed as follows: 
As the basement of IM method, a reasonable l a“g(0) 


index to characterize the deviation of joint clear- 
ances and mechanism output from their design 
or ideal values is necessary. Taguchi et al. (1989) 
proposed the quadratic quality loss function to Computing the 2-order original moment of 
measure the deviation of parameters from design Equation 4, we get: 


a T : = 1,2,---, 5 
Gi in oig s! 9X, 9X; 0X; (s m) ( ) 
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i OON where the polynomial coefficients, i.e. B, are only 
A?(Y) = E(¥*)= e| | > a: x, + = > diii i; used to represent the model structure, and need 
j=l jel hal g not to be evaluated in the importance measures. 
i In Equation 8, the lower the degree of nonlinear- 
ex, ) | ity between X and Y, the small of the polynomial 
coefficients in the higher order items, both in 

(6) main effects and in interaction effects. 


One can get that the right side of Equation 6 is 
also a polynomial of M = m? orders with the lowest 
order of 2. The equation can be expanded into the 
sum of a series of items. But in condition of high The value of one item divided by the 2-order origi- 
order Taylor expansion and a large number of vari- nal moment of Y, say, A®( Y) is used to denote the 
ables, the expansion of Equation 6 will be very com- ratio of the item’s contribution. With respect to 
plicated. For convenience, we use A; p ;,(9=2,3, X, the total importance index, denoted as EZ, is 

., M) here to denote the polynomial coefficients, defined as the value of its total effect relative to 
and Equation 6 can be written as the following form: A®(Y): 


2.2 Definition of the importance indices 


N 


aO) E0) al BE XX, aSo Lea ee 2È D Aa lose | m 


i=l h=l j=l h=l ġ=l q=l ġ=l yml 


A E(x XX ) 
hlas ty Bl eas im 


=} } 4 E(x x STS a: E(x, oe te 


N N N N N N 
ġį=1 ġ=1 i=l ġ=l ġ=1 q=1 h=l  iy=l 


In fact, the polynomial coefficients, say, A, 5 


„need not to be evaluated. Items in Equation7can py = jp L MN 1 i (9) 
be divided into two categories: 7 A@(Y) A®(Y) 
1. The individual effects. The items that include i S 
only one joint clearance, eg. (4, es ne With respect to a group of joint clearances, say, 
AESP) o Ay, HOG), 7 = 1, 2,» , N). Th (Xis Xess > Xn), the total importance index, 


represents the individual contribution of a denoted as El,» mw can be defined in the same 
joint clearance to the 2-order original moment manner: 
of mechanism output error. 


2. The interaction effects. The items that include D 

at least pie different joint clearances, e.g. (A,, EI m Ar (XX 2X; a) 

EX Xn Ai FOX arin ++ Aa, 2, im EO Xa en A® (Y) 

im)). They represent the interaction effects of 2 2 

is corresponding joint clearances. = Ay’ (Xi. X20 ` Xin) TA (XX X; „) 

The contribution that one joint clearance made A®(Y) 
to the mechanism output error should includes the (10) 
corresponding individual effects and interaction 
effects. Thus we defined the “total effect” of X,as where A,/(X),, X e rue denote the sum of 
the sum of its main effects and all the interaction the all the main effects of (X (x je Ap creer , Xn) and 
effects that include X, In summary, the individual all the interaction effects among (X,,, X, a erdes > 
effect (termed as Aye (X})), the interaction effect Xin) A(X 15 Xj. eves Xx; oa ae the sum of all 
be med as 4,” (X)), and the total effect (termedas the interaction effects between (X (Kiar Kjos ss: sX 

A,(X)) of x are illustrated as follows: and the others; 4;(X},, Xz ---++ » Xin) denotes the 


AQ(X,) = BOE(x?) + BOE(x3)+---+ BOE (xM) 


APX) = 5 BoE xx, ) + aa Y Big nk Blea, )- aye(s)| 
ba b=lh=l (8) 
a alè $ $B jiis Ela x, Xi ) ~ ayes) 
b=l}=1 iw=l 


AD(X,)= A@(X,) + AM(X,) 
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total effect, defined as the sum of the main effects 
and the interaction effects. 


2.3. Monte Carlo based evaluation method 


The derivation of EI indices is based upon the 
assumption that the model is m times differenti- 
able. However, this assumption not always holds in 
engineering practice. Even this assumption holds, 
the computational burden induced by the evalua- 
tion of the high order partial derivatives is unaf- 
fordable, especially for complicated and implicit 
models. Thus an efficient method is desirable for 
the evaluation of the presented EI indices. 

Looking at the expression of the main effects and 
the total effect of X, we noticed that each of their 
items can be seen as the multiplication of X, with 
the others. In this respect, if X, is assigned a value 
of zero, then not only the main effects, all the items 
that contain X, will also be zero. This phenomena 
inspired an evaluation method for EI indices: given 
X, equals zero, evaluate the conditional 2-order orig- 
inal moment of Y, i.e. A®(Y|X, = 0). A®(Y|X, = 0) 
represents the A™(Y) minus all the effects (including 
the main effects and the interaction effects) relevant 
to X,, Then we minus A®(Y|X,= 0) from A™(Y) and 
thus we get the total effect of X; 


AP (X,) = A (Y)- A (¥ |X, =0) (11) 
EI indices of X, can be evaluated as follows: 
A®(Y)- AP (Y| X, =0 
EI = u Pio) (12) 


ri A®) (Y) 


Similarly, the evaluation method for the EI indi- 
ces of a group of joint clearances can be defined 
as follows: 


A®(Y)- AP (Y| X,X,X, = 0) 
A®(Y) 


EI Ajapa” 


(13) 


The EI indices involve the evaluation of the con- 
ditional and unconditional 2-order original moment 
of Y. The most direct and simple method is the Mote 
Carlo method. The basic procedure is as follows: 


1. Take L samples of X randomly based on their 
PDFs; 

2. Run mechanism model to obtain the corre- 
sponding output error values y; 

3. The 2-order original moment can be evaluated 
by the following equation: 


1 È 
a Da (14) 
i=l 


The larger the number of samples is, the more 
accurate the simulation results are. 

4. In the same manner, the conditional 2-order 
original moment of Y, i.e. AP(Y|X, = 0), can 
also be evaluated given X,=0. 

5. EI indices of X, can be evaluated based on 
Equation 12. 


3 PROPERTIES DISCUSSION 


Saltelli (2002) argued that SA should satisfy the 
requirements of “global, quantitative and model 
free”. By global one means that the technique 
allows to take into consideration the entire input 
distribution. By model independent one means 
that no assumptions on the model functional rela- 
tionship to its inputs is necessary in order for the 
SA method to produce accurate results. 

With respect to X, the individual effects are 
averaged across the whole distribution range of X, 
in the presented EI index. Thus all the possible val- 
ues of X, are considered in. Similarly, we average 
the interaction effects over the whole distribution 
range of the joint clearances they involved. Thus 
the presented EI index reflects the contribution of 
X, over the whole distribution range of all the joint 
clearances. At the same time, the linear effect, the 
higher order effects, and the interaction effects are 
all considered in. Consequently, the presented EI 
index possess the property of “global”. Same con- 
clusion can also be derived for the EI index of a 
group of joint clearances. 

Although the partial derivatives are used to 
deduce the importance indices, they need not to 
be computed in the evaluation of the importance 
indices. In addition, there are no extra assumptions 
about nonlinearity and monotonicity of the sys- 
tem model. Thus the indices possess the property 
of “model free”. The presented EI indices reflect 
the contribution of system input errors to output 
error, clearly they have the property of “quantita- 
tive”. Besides, the method presented here can han- 
dle the problem of a group of factors. 


4 APPLICATION TO CABIN DOOR 
MECHANISM OF AIRCRAFT LANDING 
GEAR 


4.1 Problem statement 


The cabin door mechanism is used to cover the land- 
ing gear cabin when aircraft has taken off, and is 
an important subsystem of aircraft landing gear, as 
shown in Figure 2. Based on kinematic analysis one 
can get that the components in actuator and link- 
age mechanism are all under pulling or press forces, 
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thus they can be simplified into two-force rods. The 
cabin door is under bending moment, and can be 
simplified into a beam. In addition, the components 
can be seen to move in the same plane. Hence the 
cabin door mechanism can be simplified into a pla- 
nar linkage mechanism, as shown in Figure 3. 


4.2 Synthesis of joint clearances 


With respect to two-force rods, the joint clearances 
can be synthesized into the rod length. Based on 
multi-body dynamic analysis we know that the rod 
Ris Rep, Rg, and Rpp are all under pulling force, 
thus their equivalent length can be evaluated: 
L =D tete, (15) 
where L,, denotes the equivalent length of the rod 
R, Lj, denotes its design length, and e, denotes the 
clearance of the ith joint. Rod Rep is under press 
force, thus its equivalent length is: 

(16) 


= a 
Ler = Ler — ec er 
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Figure 3. The simplified cabin door mechanism. 
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Figure 4. The joint center variation of joint G. 


where Lop denotes the equivalent length, L?,, is the 
design length, and eç and e, are the clearance of 
joint Cand joint E respectively. 

Because the cabin door is under bending 
moment, the clearance of joint G should be inte- 
grated into the variation of the joint center: 

Xg = eç COSA 
e 07) 


where x, and y, are the real coordinate of the joint 
G, e, is the clearance of joint G, and the angle ais 
defined as the angle between x axis and the direc- 
tion of force F, force Fis the force applied on fuse- 
lage by cabin door, as illustrated in Figure 4. 


4.3 Results and discussion 


The position of joint 4, C, D, G, and the design 
value of rods length are shown in Table 1. The 
joint clearances are all considered to follow trun- 
cated normal distribution, and their distribution 
parameters are shown in Table 2. In the tables, x°, 
and y? denote the joint center position in the coor- 
dinate system illustrated in Figure 3, L° denotes the 
design value of the rod length, e, denotes the value 
of the ith joint clearance. 

The importance indices of joint clearances are 
evaluated by the variance based method, moment 
independent method, and the presented EI indices 
respectively. In addition, the partial derivatives (PD) 
of the output error with respect to joint clearances 
at zero points are also evaluated. Simulation results 
are shown in Figure 5. From Figure 5 and Table 1 
one can get that eç has the biggest PD value, mean 
value, and standard deviation. Thus it should con- 
tributes the most to the output error of latch hook. 
The ranking results of all the three methods con- 
sistent with this conclusion. In addition, the three 
methods produced the same ranking. In order to 
study the effects variation of distribution param- 
eters on position error of latch hook, importance 
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Table 1. Design value of joints position and rods length. 

Part Design Part Design 

name value/mm name value/mm 

ai 800 Liy 510 

wi 800 ES 270 

x, 510 D 260 

ye 310 Le 420 

x, 450 De 540 

y 200 Leg 408 

x, 0 Dy 592 

W 0 

Table 2. Distribution parameters of joint clearances. 
Mean Standard 

Variable value/mm deviation/mm 

ey 0.14 0.06 

Cain 0.12 0.04 

Cp pe 0.12 0.04 

ec 0.16 0.08 

ep 0.16 0.08 

Cr pe 0.08 0.04 

Cr pr 0.08 0.04 

er 0.08 0.04 

eg 0.06 0.03 


indices are evaluated under different mean val- 
ues and variations of of e,. Simulation results are 
shown in Figure 6 and Figure 7 respectively. 

From Figure 6 and Figure 7 one can get that: 
(1) By both the moment independent method and 
variance based method, importance indices of e, 
increased accordingly along with the increase of o,, 
and the importance value of other joint clearances 
all have different degrees of decline; (2) By both the 
moment independent method and variance based 
method, importance indices of all the joint clear- 
ances kept roughly constant as 4, increases; (3) By 
the presented EI indices, along with the increase of 
LU, and o,, the importance value and order of e, 
are all increased accordingly. Clearly the results of 
the presented EI indices consistent with with com- 
mon sense, while that the other two methods are 
not. This is because the current SA methods, e.g. 
moment independent method and variance based 
method, focuses mainly on “uncertainty”, rather 
than on mean shift from the ideal or design values. 

In conclusion, the error importance measure is 
more appreciable when studying the contribution of 
joint clearances to mechanism output error. Further- 
more, under the same modification level of 4“, and 
O, a bigger variation of EJ, is caused by the shift 
of ,. It means that 4, has a bigger influence on the 
variation of EJ, than o,. Based on this conclusion, 
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Figure 5. Importance values of joint clearances by dif- 
ferent methods. 
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Figure 6. Variation of importance indices along with 
the shift of u4. 


more attention should be paid on the decrease of 4, 
so as to decrease the contribution of e, to the posi- 
tion error of the latch hook. 
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Figure 7. Variation of importance indices along with 
the shift of o,. 


5 CONCLUSIONS 


In this paper, EI indices to measure the contribu- 
tion of joint clearances to mechanism output error 
is presented. The 2-order original moment is used 
to characterize the deviation of joint clearances or 
mechanism output error from their design values, 
i.e. zero. The 2-order original moment of output 
error is decomposed into a serial of fractions, and 
the total importance indices for one or a group of 
joint clearances are defined respectively. The pre- 
sented importance indices possess the properties 
of quantitative, model free, and global. An evalu- 
ation method based on Monte Carlo simulation is 
offered, making it convenient to employ. 

At last, the presented EI indices is applied to 
cabin door mechanism of aircraft landing gear. 
In the application case, the EI indices are used to 
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measure the contribution of joint clearances to the 
position error of latch hook. For comparison, the 
variance based method and the moment independ- 
ent method are also employed. Simulation results 
revealed that: The presented EI indices can reflect 
the effects of both variation and mean shift of joint 
clearances, and the variance based and moment 
independent methods reflect mainly the effects of 
variance. Consequently, the presented EI indices is 
more appreciable for the study of the contribution 
of joint clearances to mechanism output error. 
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ABSTRACT: The High-Pressure (HP) turbine blade of aero-engine is subjected to high temperature 
and high pressure, whose life is mainly governed by the fatigue. In order to model the working condition 
and obtain the critical area, the Finite Element Analysis (FEA) software ANSYS is typically used as a 
mathematical tool to solve the problems. Besides, the Chaboche model is employed as a constitutive model 
to describe the response behaviors of the materials under cyclic loadings in the software. The Fatemi-Socie 
(FS), Wang-Brown (WB) and Redefined Smith-Watson-Topper (Re-SWT) models are utilized to estimate 
the life of turbine blades on the basis of the critical plane criteria, which considers the effects of hardening 
due to non-proportional cyclic loading and mean stress on the multiaxial fatigue life of material. Fur- 
thermore, these models are applied to proportional and non-proportional loadings. The analysis results 
indicate that the critical region of HP turbine blade which should be enhanced the intensity or redesigned 
to reduce the stress concentration. 


1 INTRODUCTION into three categories based on different criteria: 
equivalent stress or equivalent strain criteria, the 
It is known that the relationship between the fatigue critical plane theory, and energy-based method. 
life and characteristic of structure is very difficult The critical plane theory is one of the most popu- 
to describe accurately in engineering applications lar and potential theory in multiaxial fatigue life 
because of variable working conditions. Fatigue prediction, which is suitable for various loading 
life prediction plays a quite important role in the condition, and it needs another effort to get the 
fatigue failure analysis. The blade in the engine is LCF (low cycle fatigue) data as preparatory work 
one of the highest risk components. According to of multiaxial fatigue life prediction (Socie and 
the failure data of components in aircraft, 70% of | Marquis, 1999). 
those are ascribed to the blades (Tao et al., 2000). A classical formulation (Kandil et al., 1982) was 
The blade should have good operating characteris- proposed to solve the multiaxial problems under 
tics under different working conditions, and its life biaxial loadings, in which the maximum shear 
may be governed by a series of failure mechanisms, strain range A/,,,,, is taken as the main factor lead- 
such as fatigue, creep, fracture, yielding, wear, cor- ing to failure, and the normal strain Ae, on the 
rosion, erosion, and so on. In order to decrease the same plane is regarded as the secondary damage 
incidence of turbine blades failure, all the aspects parameter, given as 
that degrade engine performances should be taken 
into account, such as material properties, loading Ay o ; i 
spectrum, structure and working environment. To — + sAeé,= a—{ 2N A) + Be, (2N el 
identify the mechanical behaviors of HP turbine 2 E 
blades in service, it is necessary to take advantage 4=!+%+ s(1- v.) (1) 
of the mechanical analysis software to model the B=1+v,+ s(1- v,) 
blade and its working condition. 


where v, and v, are the elastic and plastic Poisson’s 

ratio; of is ihe fatigue strength coefficient; E 
2 PREDICTION METHODS fatigue ductility coefficient; b is fatigue strength 

exponent, c is fatigue ductility exponent, E is the 
The uniaxial fatigue life prediction models may be Young modulus, N, is the number of cycles to fail- 
not suitable for turbine blade which suffers multi- ure, sis a material parameter can be obtained from 
axial loadings during flight missions. The methods Eq. (2); z and b, are the shear fatigue Strength 
of multiaxial fatigue life prediction can be divided coefficient’ and exponent, respectively; 7 and c, 
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are shear fatigue strength ductility coefficient and 
exponent, respectively; G is the shear modulus, 
G=E/2(1+v,). 


£(2N,)" +y',(2N,)° -(14 v) £(20,)’ 
se (tev jelen) (2) 


(1 v) £(2N,) +(1-y, )e’,(2N,)' 


Besides, the properties related to shear strain- 
life can be estimated by (Zhu et al., 2017) 


, oO’, , , 
T,= Ly = 38, b= bp e= c (3) 
ne f 1 1 
Considering the mean stress O, mean ON the maxi- 
mum shear strain plane, and may result in the loss of 
fatigue life. A modification of Eq. (1) based on the 
Morrow model is defined as (Wang & Brown, 1996) 


AF 


=a TAE, = 
P, -20 roan : : 
ai (aN) + Be, (2N,) = 


Similarly, Fatemi-Socie used the normal tress 
to supersede the normal strain term in Eq. (4), 
then presented an improvement involving tension- 
torsion loadings and non-proportional hardening, 
shown as (Fatemi and Socie, 1988) 


Aza ( eos _ 5 (2n r (2N) ©) 


where o, represents the yield strength; and k is the 
material constant which can be inferred by the 
uniaxial experimental data, shown in Eq. (6). 


H(2N,)" +y (2N) 


(14v) (2N) + (1+ v,)e (20) 


(6) 


In addition, Smith-Watson-Topper also estab- 
lished a formulation involving the mean stress 
effect under uniaxial cyclic loadings, and it also 
can be applied to the multiaxial loads under pro- 
portional or nonproportional conditions, and 
included the principal strain range Ae, and the 


maximum stress O, max ON the principal strain range 


plane, given as (Smith et al., 1970) 


A 72 N 
ka 7 ~ = 2N;) 4 o'€, (2N, )’ (7) 


Similarly, applying the SWT method on the on 
the maximum shear strain plane, the Eq. (7) can be 
rewritten as 


Em Em = L (IN) ar (2N) 


tq 


(8) 


where T, ma is the maximum shear stress on the 
maximum shear strain plane 

There is a simple assumption to decide the criti- 
cal plane for SWT theory, if Z, max 2 On max V3; 
Eq. (8) is applied to predict the fatigue life, or 
Eq. (7) is chosen. The Redefined-SWT (Re-SWT) 


model is shown in Eq. (9). 


“ee = aa 2N; ) “+ o’, e’,( 2N, i 
Ty max S On, max / V3 
ie Hu ~ a 2N,) +7(2N,) 
T, nax È Op max 1V3 


(9) 


In general, the WB, FS and SWT models have 
already gained a certain acceptance in the life pre- 
diction field of multiaxial fatigue. Additional efforts 
of uniaxial experimental data (tensile or torsion test) 
are required to guarantee the accuracy of the models. 


3 MODELLING OF HIGH-PRESSURE 
TURBINE BLADE 


In order to ensure the accuracy of the results and 
reduce the difficulties of meshing, a simplified ver- 
sion of the three-dimensional model for the 1/n 
HP turbine disc (remain one fir-tree mortise) seg- 
ment is created, illustrated in Figure 1. The blade- 
disc coupled system is modeled to gain the critical 
region of fir-tree mortise, which uses a nonlinear 
contact analysis technique. 

The centrifugal forces and thermal stresses have 
a significant effect on the static analysis of turbine 
blade in LCF. The HP turbine blade is cast by 
high-temperature alloy GH4169, which has good 
performances of corrosion resistance, high tempera- 
ture and high strength. The working temperature of 
blades is about 600-700°C. The material properties 
of GH4169 can be found in (Sun et al., 2010; Shi 
et al., 2001; Yu et al., 2017), listed in Table 1, where 


K’ and n’ denote the cyclic strength coefficient and 
cyclic strain hardening exponent, respectively. 

The Chaboche model is introduced to model 
the nonlinear kinematic hardening behavior of 
GH4169, which considers the effect of viscosity on 
elasticity (Chaboche, 2008), and it is exploited to 
model isotropic and kinematic hardening effect in 


Figure 1. The geometry of blade and blade-disc cou- 
pled system. 


Table 1. The material properties of GH4169. 

TPC EIGPa o, (MPa) E; 
650 182 1476 0.162 
b c K’(MPa) n 
—0.086 —0.58 1933 0.1483 


Table 2. Loading spectrum of aircraft for HP blades. 


Number of Rotational speed 
Statement cycles n æ (rpm) 
S1 1306 0-18050-0 
S2 2006 9520-18050-9520 
S3 24326 16936-18050-16936 


Figure 2. The equivalent plastic strain area of HP tur- 
bine blade. 


the FEA. The Chaboche model with three evolu- 
tion parts (M = 3) is given as 


Ao 
2 


Ae, 
2 


M C: 
=ø, + Ý —tanh(7, 


i=l Fi 


) (10) 


where the parameters C, and y% (i=1,2,---,M) can 
be obtained by the uniaxial test data. 

In accordance with the flight mission and filed 
test for 800 hours of HP turbine blades, the load- 
ing spectrum can be specified by 3 different state- 
ments: S1(0-max-0), S2 (idle-max-idle), and 
S3 (cruise-max-cruise). 

After a series of prepared work, the stress-strain 
analysis of HP turbine blade is simulated to specify 
the stress concentration region, the fir-tree mor- 
tise, where begins with the cracks initiation, cracks 
propagate under continued cyclic loading, finally 
leads to the rupture or failure, shown in Figure 2. 


4 EXPERIMENTAL VALIDATION 


Referring to the estimation of the applicability 
and efficiency of the FS, WB and Re-SWT models, 


Values of k, s 


w w w w w w w w w 
Life Cyices N, 
Figure 3. The FS and WB parameters vs. the fatigue 
life for GH4169. 


Figure 4. Predicted life N, vs. the experiment life N, for 
GH4169 at T= 650°C. 
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multiaxial fatigue experimental data were con- 
ducted on turbine blade alloy GH4169. It complied 
with the principle of proportional and non-pro- 
portional loading introduced in (Sun et al., 2010), 
which was intended for tension-torsion fatigue test 
at 650°C. Based on the material properties illus- 
trated in Table 1, the material parameters s and k 
can be determined by Eqs. (2) and (6), as depicted 
in Figure 3. The predicted life cycles of Re-SWT, 
WB and FS models for multiaxial loading are 
compared with experimental data, is illustrated in 
Figure 4. 

Figure 4 shows a good agreement for WB and 
FS models, which can be observed for most of 
the predicted results within the +2 life factor. The 
Re-SWT model shows non-conservative results 
under non-proportional conditions compared with 
experimental lives. 


5 APPLICATION 


According to the loading spectrum and results of 
FEA, the predicted lives of Re-SWT, WB and FS 
models for HP turbine blade are summarized in 
Table 3. The life cycle of S3 trends to infinity, the 
damage is equal to 0. Here, the Miner rule (Miner, 
1945), given in Eq. (11), is used to calculate the 
total damage of 800 hours. Furthermore, the pre- 
dicted lives, formulated as Eq. (12), are calculated 
and illustrated in Table 4. 


n n, n 
Dowo =-= ++- (11) 
Na Na Nos 
1 
T =800 x — (12) 
Dayo 

Table 3. Fatigue life prediction of HP turbine blade. 

S1 S2 S3 
n, 1306 2006 24326 
N,,(Re-SWT) 10294 28707 >10’ 
N (WB) 9833 41686 >10 

fi 

N,,(FS) 13624 101363 >10’ 


Table 4. Working hours of HP turbine blade. 


Re-SWT WB FS 
Diy 0.1967 0.1820 0.1164 
T(h) 4067 4439 6872 


6 CONCLUSIONS 


In this paper, a set of experimental data for blade 
material GH4169 under proportional and non-pro- 
portional loadings are used to evaluate Re-SWT, WB 
and FS models, and the Re-SWT model is not suit- 
able to predict the fatigue life for non-proportional 
loadings. Furthermore, these three models are also 
used to estimate the fatigue life based on the FEA of 
HP turbine blade under serious working conditions, 
and the prediction results of Re-SWT and WB mod- 
els are more conservative than that of the FS model. 
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ABSTRACT: Maintenance and logistics optimization was applied to a fleet of drones, operating from 
three sites with a central stock. The optimization achieved a Life Cycle Cost reduction of 34% while the 
fleet availability increased. This paper presents the optimization process, methods and results. Similar 


methods can be applied to a variety of other fleets. 


1 INTRODUCTION 


1.1 Motivation 


The drone industry is one of the fastest growing 
markets today (Forny & van der Meulen 2017). 
Drone failures pose both safety and financial risks, 
yet the drone failure rate is much higher than the 
failure rate of manned aircraft (Bone & Bolkcom 
2003) Therefore, a great need exists for logistics 
and maintenance optimization of drone fleets. 

In this paper we present an example for model- 
ling and optimizing the maintenance policy of a 
fleet of drones. 


1.2 Case description 


A fleet of 11 surveillance drones, operating from 
3 different sites is considered (4 drones in site 1, 4 
drones in site 2, and 3 drones in site 3). A central 
stock services the three sites. 

Site surveillance is considered as not operational 
when more than 1 drone is failed. During such 
downtime a penalty is paid by the drone operator 
to the site owner. 

In order to optimize the fleet logistics and main- 
tenance policy, the fleet operation had to be mod- 
eled. Following is a list of the parameters which 
were used to create a detailed model of the fleet 
behavior: 


Reliability Data 

e Component failure distribution 
e Component failure modes 

e Drone redundancies 

e Operation profile 


Maintenance Data 

Component repair/discard policy 
Repair time 

Corrective maintenance 
Preventive maintenance 

e Inspections 

Logistic Data 

e Spare parts 

e Transportation times 

e Procurement time 


Financial Data 

e Cost of spare parts 

Penalties due to operation agreement 
Corrective maintenance 

Preventive maintenance 

Inspections 


Figure 1 (see next page) presents the fleet 
breakdown tree. The fleet tree includes three main 
branches (one for each operation site), and under 
each branch the drones and their components are 
described. This study focused on several drone 
sub-systems: Navigation, GPS, Inertial Measure- 
ment Unit (IMU), and the flaps. 

The “Reliability Model” column in Fig. 1 
describes the relevant model for each sub-system. 
For example: the GPS sub-system includes two 
redundant GPS units (parallel model). 

The “Distribution Type” column in Fig. 1 
presents the failure distribution type for each 
component. Electronic components were 
assigned an Exponential failure distribution 
whereas the mechanical gyros and flaps were 
given a Normal distribution that describes their 
ageing behavior. 
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Figure 1. Fleet breakdown tree. 


2 CALCULATION DETAILS 


A commercial software (apmOptimizer) was used 
for the optimization. The software employs a 
combination of analytic methods (Birolini 1999) 
for calculating the fleet Life-Cycle Cost (LCC), 
and identifying cost and failure drivers. The 
analytic methods include: 


e Markov chains for modelling spare parts supply, 
demand, and spare waiting times. 

Block mean failure rate calculations that account 
for component failure distributions, reliability 
models, scheduled maintenance, inspections and 


the mission profile. 


While analytic calculation is not as flexible as 
Monte-Carlo simulations, the analytic method is 
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much faster. The speed of evaluating each model 
allowed for a fast optimization of the maintenance 
and logistic policies using modified Dynamic 
Programming. 

Dynamic Programming algorithms (Cormen 
et al. 2009) are ideal for bottom-up optimization 
of trees where the tree branches are independent. 

However, in the fleet case the branches are not 
completely independent: A central stock services 
the three sites, therefore a failure in one site affects 
spare part availability in the other sites. A modi- 
fied dynamic programming algorithm was used 
to account for the inter-site dependencies. For 
example: The Markov chain model that describes 
the GPS voter spare parts supply and demand 
accounts for all 11 operating units, serviced by a 
single central stock. 

The optimization goal is to achieve high reli- 
ability and availability while minimizing the LCC. 
Optimization was achieved by using the following 
optimization modules: 


Optimal LOR: Level Of Repair Analysis—Opti- 
mization 1.e. Repair/Discard policy. Repair is 
usually cheaper than buying a new component, 
however, long repair time (compared to pro- 
curement time) may require large and expensive 
safety stock. Therefore, the discard strategy is 
sometimes advantageous even when repair is 
cheaper than buying a new component. 
Optimal PM: Preventive Maintenance Opti- 
mization. Periodic maintenance is required for 
components that exhibit an ageing behavior 
(failure rate increases with time) and cannot be 
inspected for degradation. In our case scheduled 
maintenance is relevant to the mechanical gyros. 
Optimal PdM: Predictive Maintenance Optimi- 
zation—inspections schedule. Periodic inspec- 
tions are used to identify flap degradation. 
Beyond a degradation threshold, preventive 
maintenance is used to rejuvenate the flaps. 
Optimal I: Inventory Optimization. Optimal I 
finds the optimal combination of spare parts that 
minimizes the fleet LCC. The optimal spare part 
combination is as cheap as possible while ensuring 
a low probability of downtime due to stock out. 


In order to emulate the case of under-maintained 
fleets, an initial maintenance policy was defined 
with few spare parts, inspections and scheduled 
maintenance events. In each optimization step some 
maintenance actions/spare parts were changed in 
order to find the optimal combination. 


3 RESULTS 


Fleet availability at each site as well as the fleet 
LCC were calculated at each step of the optimiza- 


Table 1. 


Summary of fleet availabilities at the various sites and the total LCC at each optimization step. 


Site Initial A, Optimal-LOR Optimal-PM Optimal-PdM Optimal-I 
1 99.15% 99.15% 99.15% 99.4% 99.4% 

2 99.15% 99.15% 99.15% 99.4% 99.4% 

3 99.36% 99.36% 99.36% 99.5% 99.5% 
LCC $47.84M $46.9M $45.58M $31.96M $31.5M 


tion process. Table 1 presents a summary of fleet 
availabilities at the various sites and the total LCC 
at each optimization step: 

It can be seen from Table | that the optimiza- 
tions resulted in increased fleet Availability and 
LCC reduction of 34%. Optimal-LOR, Optimal- 
PM, and Optimal-I decreased the LCC but had a 
small effect on site availabilities. Optimal-PdM had 
a strong effect on both availability and LCC. This 
is not surprising since increased availability means 
a lower downtime financial penalty. 


4 CONCLUSIONS 


Fleet operators can reduce operation costs without 
jeopardizing performance by using analytic tools 
such as the apmOptimizer. 

The modelling and optimization methods which 
were used in the example are applicable to any 
fleet, and are therefore relevant to many industries: 
defense, rolling stock, aviation, and mining. 


Furthermore, the method is also good for 
modelling MRI medical machines, industrial 
printers, and other sets of identical machines that 
are operated at different sites and are maintained 
by the OEM. 
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Statistical test planning using prior knowledge—advancing the 
approach of Beyer and Lauster 
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ABSTRACT: An approach is presented to include prior knowledge in the test planning of product 
reliability based on the approach of Beyer and Lauster (Beyer and Lauster, 1990) such that several and 
different sources of prior knowledge can be accounted for. Furthermore, this advanced approach is 
independent from Beyer/Lauster’s prerequisite that the confidence C of the prior knowledge in the form 
of R(t) is to be at C = 63.2%. A nomogram is presented consistent with the pragmatic suggestions of 
Beyer/Lauster (Beyer and Lauster, 1990) yet extending the original version’s applicability. This paper 
provides a mathematically sound extension to the original paper broadening its applicability to more than 
one source of prior knowledge while liberating the constraint of C(R,(t)) = 63.2% while also allowing for 
a partial transfer of prior knowledge and accounting for accelerated lifetime tests. 


1 INTRODUCTION (nt, i 
C=1- (hAnas) gondra (1) 
Physical product testing is an important tool to vali- 20 i! 

date a product’s maturity with respect to reliability. 

It is used to demonstrate the product’s reliability. | where n stands for the number of samples, t, test 
When planning such a validation test, budget, time time, 4,,,, the desired maximum failure rate and x 
and accuracy have restrictive effects. Budget may the number of failures during the test. 

be limited and time to market may be constrained Prior knowledge is considered through the cor- 
while accuracy of the derived conclusion is to be responding failure rate 2, with ¢,, and sample size 
maximized. To meet the requirements of efficient n, (“0” connoting the prior knowledge). If the test 
and yet effective reliability validation, prior knowl- to which the prior knowledge relates to is passed 
edge can be used, which could stem from former without failed samples, the failure rate’s density 
development stages, previous product generations function is then given by (2) with Nol yg = 1/2. 

or similar products such as benchmarks etc. 


1 -2 
f(Aj=—e * (2) 
2 APPROACH OF BEYER AND LAUSTER Ay 
The approach of Beyer and Lauster is based on If a Poisson distribution is supposed to describe 


Bayes’ theorem, which allows to link information the conditional probability of the current test 
on the product’s reliability from current tests with Tesults, the confidence as shown by (3) is calculated 
information from prior knowledge, e.g. tests ata by means of Bayes’ theorem. 

former development stage, previous product gen- 


erations etc. Beyer and Lauster differentiate the x] 1 i [re 1 Poes 
approach in (Beyer and Lauster, 1990) into one C=1-5,- nt, ta Anax | € 4 (3) 
part dealing with constant failure rates and con- ent 


sequently one part with non-constant failure rates . ; 

(A, =), i.e. By # 1). Note that the prior knowledge in form of the con- 
stant failure rate 2, needs to satisfy C,(A,) = 63.2% 

. . ; which limits the approach’s applicability. 

2.1 Test planning with constant failure rate i, 

The failure probability of test samples with 

constant failure rates is based on the Poisson 

distribution (Kececioglu, 2002). The confidence If failed samples are not replaced during a test— 

level is then calculated by equation (1). which is typically the case in practice—the Poisson 


2.2 Test planning with non constant failure rates 
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distribution fails to be able to describe the prod- 
uct’s reliability. The density of reliability R is then 
described by means of a Beta distribution, cf. (4) 
for no failures. 


F(R) = nR ® 

After combining the prior knowledge from (4) 
with the result of the current test encompassing n 
samples with x failures, a lifetime ratio L, (test time 
t, over the desired lifetime ¢,) and shape parameter 
fof the underlying Weibull distribution, the confi- 
dence can be calculated by (5). 


testing with replacement 


(esting without replacement 
(lower line for N= 10) 


prior knowledge Ral C= 63,2%) 


NUS AVA 


AAEN A 


KAE 


5 10 


sample size 7 


20 30° 


a emm aa an an Da m aE N 
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The basic relation between the confidence level 
and the reliability in general, i.e. without failures 
and neglecting lifetime ratios etc. follows (6). 


C, =1- R” (6) 
With C, = 63.2%, an expression for n, results 
from (6) not containing a C, which is fixed. 
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Figure 1. 


Nomogram for test planning with prior knowledge in the form of R, (C, = 63.2%) (Beyer and Lauster, 
1990). 
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(7) 


Equation (7) combined with (5) results in (8): 


C = 1- (ela) 
1 
n+ K 8 
G naf) 1- R(t) 8) 
io we 
R(t)” 
i 


Please note that Equation (8) is only valid for 
C, = 63.2%. For R, > 0, i.e. for inexistent prior 
knowledge, (8) is reduced to (9): 


() 


Equation (9) cannot be derived directly. Prior 
knowledge has to be transferred fully. In (Beyer 
and Lauster, 1990) a nomogram is offered for 
the practical application of equations (8) and (9) 
for test planning, cf. Figure 1. (The nomogram 
is also published and discussed in (Krolo, 2004, 
pp. 41—47)). 


i 
L 


1- R(t,) d 
R(t)” 


n 


x 


CERK E (9) 


i=0\ 7 


3 EXTENSION OF THE APPROACH 
OF BEYER AND LAUSTER 


The extension of the original approach by Beyer 
and Lauster (cf. Chapter 2) is motivated by the fol- 
lowing aspects: 


— The consideration of prior knowledge in the 
form of R, at arbitrary confidence levels C,. 

— The consideration of several sources of prior 
knowledge. 

— The consideration of accelerated lifetime tests 
by introducing an acceleration factor r in line 
with (Krolo et al., 2002). 

— The consideration of a nuanced incorporation 
of the existing prior knowledge by introducing a 
transformation factor ®. 


In order to allow for the fact that the prior 
knowledge might stem from previous product 
generations, former development stages or similar 
products, a transformation factor ® is suggested 
in (Krolo, 2004). It ensures the ability to nuance 
the transfer of prior knowledge other than 100%, 
i.e. 0 < Ø< 1. Ways to quantify the transformation 
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factor, i.e. the degree to which products or product 
generations are similar, are found in e.g. (Krolo, 
2004; Hitziger and Bertsche, 2005; Schweizer 
et al., 2015). 

Ways to quantify the acceleration factor r are 
found in e.g. (Jakob et al., 2017). 

Both the acceleration factor r and the transfor- 
mation factor ® can be included in (8). Thus, (10) 
results, where the presumed confidence of R, is still 
C,( R,) = 63.2%. 


C=1-R(t Amana] 


s 


1- R( f \ 00 


R(t, m 


Equation (10) contains a shape parameter p, 
accounting for a possibly different shape of the 
Weibull distribution observed in the results of the 
accelerated test with the acceleration factor r as 
compared to the one observed in the field under 
regular conditions. It can be observed, that p, tends 
to be slightly greater than p while it is commonly 
assumed that 6, = 2 (Juskowiak and Bertsche, 
2016). 

From (6) results (11), similar to (7): 


_ In(1-C,) 


Ny = : 11 
= R) (1) 
Prior knowledge with individually arbitrary 
confidence levels C,, can be combined in the way 
(12) suggests. A total sample size nj,,,,, is thus cal- 
culated stemming from k different sources of prior 
knowledge. In doing so, individual transformation 
factors ® account for different degrees of similar- 
ity between the product of interest and the source 
of prior knowledge. 


i In(1-C,, 
No sum = Xo, aea (12) 
: In(R,,) 
With (12), (10) is amended to (13): 
rhy p In(1-Co,) 
C=l- R(t,) d i=l! in(Ro ) 
: In(1-C,, B 
“fat > ® n(I-C,,) 1 R() 
D2 1 1 LrA In(R,,) T 
i=0 f J R( ) 4r 
í s 
(13) 


4 ANOMOGRAM ENHANGING THE 
TEST PLANNING’S APPLICABILITY 


4.1. The Nomogram as proposed by ( Beyer and 
Lauster, 1990) 


Beyer and Lauster offer in (Beyer and Lauster, 
1990) a nomogram which aids decision makers at 
including prior knowledge into statistical test plan- 
ning. Figure 2 shows the original version able to 
account for one source of prior knowledge in the 
form of R,(t) with confidence C, = 63.2%. This 
prior knowledge is assumed to be fully transferra- 


(AWN DE PLD 


2222 eg 2 


, | 
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testing with replacement 


e sgwar 
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ble to the current product’s version and the current 
tests need to be not accelerated. 


4.2 Example #1 — Beyer and Lauster’s approach 


This example is extracted from (Beyer and Lauster, 
1990). It is used in order to demonstrate consist- 
ency with the advanced approach for statistical test 
planning elaborated in this paper. 

The target reliability level of a given product is 
set to be R „„(t, = 20000 h) = 90%, i.e. B, = 20000 
h. This reliability target is to be established with 
a confidence of C = 85%. From a previous and 


1] 


W W WHOS HOU) 


Figure 2. Nomogram for test planning with prior knowledge in the form of R, (C,) according to the advanced 


approach. 
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very similar product generation it is known that 
R, = 90% and the then derived R,(t,) follows a 
Weibull distribution with a shape parameter of 
B,=2 and J, = £. From Figure 2 it follows, that 
n: Lf =8.5 (14) 
If eg. n = 5 samples are allowed for, then the 
lifetime ratio L, = 1.3. The failure free test time per 
sample therefore amounts to about 26000 h, cf. 
(13). It becomes clear from (8) that if prior knowl- 
edge was neglected, 5 additional samples would be 
required to demonstrate the target reliability under 
the otherwise same conditions. 


4.3 Development of a nomogram for the 
advanced approach 


A nomogram is developed aiding at practical test 
planning in line with this paper’s ambition as 
described above. It accounts for prior knowledge 
(Ry with C,,), which can be partially or fully trans- 
ferred to the current product for which a certain 
reliability is to be demonstrated and whose demon- 
stration tests may see accelerated test conditions. 

Figure 2 illustrates the adapted version of the 
nomogram. Herein, example #2 is implemented. 
Its different fields are numbered. 


Field 1: The acceleration factor r (here, 1 < r < 4) 
of the current product’s test (or the test of the 
product’s current state) is taken to the power of 


L 

Field 2: The lifetime ratio L, (test time t,/t,) of the 
current product’s test is taken to the power £. 

Field 3: The results from field 1 and field 2 get 
multiplied. The product is found by extending 
the resulting point in field 1 vertically and that 
of field 2 horizontally, followed by an exten- 
sion parallel to the oblique lines in field 3 upon 
arrival at field 3. 

Field 4: The result from field 3 gets multiplied with 
the sample size of the current product’s test n. 
The sample size can be chosen in the form of 
one of the oblique lines. The point found in 
field 3 is horizontally extended to the given sam- 
ple size line. The then found point in field 4 is 
extended vertically to serve as input to field 7. 

Field 5: The prior knowledge is put into considera- 
tion here. Its reliability level R, as well as the cor- 
responding confidence level C, are chosen. C, is 
represented by one of the oblique lines. Field 
5 calculates what could be considered the sample 
size of the prior knowledge n,: 


i _In(l-G,) (15) 
In(R,,) 
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This version of the nomogram deals with one 
source of prior knowledge, albeit it can be seen 
from (12) and (13) that k sources can be dealt 
with. The prior knowledge’s fictional sample size 
n, would then simply be the sum of the n, found 
through field 5. 


Field 6: The result from field 5 (cf. (13)) gets mul- 
tiplied with the transformation factor ® which 
is represented by the oblique line adequate 
to the degree to which the prior knowledge is 
trusted to represent the current product (or 
the current product’s state). The thus found 
point in field 6 is extended horizontally toward 
field 7. 

Field 7: Here, the results from fields 4 and 6 are 
summed up to what could be considered the 
total sample size N containing the sample size 
n of the current product’s test as well as the fic- 
tional sample size n, calculated previously. The 
corresponding point in field 7 is where the result 
from field 6, extended along the adequate plot- 
ted line in field 7 (i.e. the one entered at) inter- 
sects with the vertically extended line off of 
field’s 4 resulting point. 

Field 8: The desired reliability R(t), represented by 
the corresponding oblique line, is taken to the 
power of N. C results if no failures were observed 
during the test of the n specimen, cf. (13). 

Field 9: Field 9 accounts for the number of 
observed failures x during the test of the n speci- 
men. Here, 0 < x < 3. The resulting C can be 
concluded upon by extending the point of inter- 
section of field 9 horizontally to the right. Each 
upper of the two equidistant lines represents the 
relationship between C and the number of fail- 
ures with the supposed replacement of the failed 
units. In case that failed units are not replaced, 
Beyer and Lauster’s suggestion is followed since 
(13) cannot be illustrated in the same way as 
(8) (Beyer and Lauster, 1990). A hatched area 
is introduced which is valid for large n. For 
large n, C from (13) is approximately equal to C 
from (8). Although it is suggested in (Beyer and 
Lauster, 1990) that this assumption holds true 
for n = 10, the authors recommend it for n > 20 
(the deviation then amounting to 2,1%, given 
the conditions of example #1). 


Note that the order of application through the 
described fields depends on the goal of the given 
lifetime test planning or evaluation. If the goal is to 
determine a confidence C which can be assured by 
the combination of a given test with prior knowl- 
edge, the order is one to nine. If the goal is to 
determine how many specimen n need to be tested 
under given conditions in order to assure a certain 
reliability R with a specific C,, then the order is 9 
through 4 and 1 through 4. 


4.4 Example #2 


This example adds an acceleration factor r, a con- 
fidence level C, and a transformation factor ® to 
example #1. If r= 2, C, = 90% and ®= 0.8, then 
C = 99.55% results. As compared to the original 
Beyer and Lauster approach, where C, = 63.2%, the 
resulting confidence was thus increased by = 15%. 

The implementation of this example is shown in 
Figure 2. The line laid over the structure described 
in chapter 4.3 constitutes the conditions of exam- 
ple #2 and their exemplary implementation in the 
developed nomogram. 


5 SUMMARY 


An extension to Beyer and Lauster’s approach for 
including prior knowledge into statistical test plan- 
ning was presented. For both, the original and the 
extended version of the approach, a nomogram 
was developed. It aids at the practical implementa- 
tion of the definition of test plans and at analyz- 
ing given test settings in terms of the reliability’s 
confidence C(R). 

The presented extension is capable of account- 
ing for more than one source of prior knowledge 
with arbitrary confidences, results from accelerated 
tests of arbitrary length. Furthermore, it allows for 
a just partial transfer of prior knowledge to the 
product’s current state based on the assessment of 
their similarity. 
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ABSTRACT: Component Fault Trees (CFTs) were invented in 2003 as a compositional extension 
to fault trees to better reflect the technical architecture of a system in its safety analysis model. Since 
then, a lot of research has been contributed regarding semantic extensions, evaluation techniques, and 
tighter linking between system and safety models. This paper addresses three main objectives. First, we 
summarize the most important contributions and shape a vision of better integrated system modeling 
and safety analysis. Second, we push forward standardization and sketch a new evaluation scheme for 
quantitative analysis using mdd. Lastly, an outlook on future improvement ideas is given to make CFTs a 
viable technique for loosely coupled systems and Cyber-Physical Systems. 


1 INTRODUCTION 

Fault Tree Analysis (FTA) is one of the most prom- 
inent safety and reliability analysis techniques. 
FTA has been applied for over 50 years and was 
standardized by several official or de facto industry 
standards like Vesely et al. (1981) and IEC 61025. 
Standardization in practice has also taken place 
through available tool solutions, such as Fault- 
Tree+, BlockSim, SafetyOffice, medini analyze 
and many more. Its application is recommended or 
even required by various safety standards such as 
IEC 61508 and ISO 26262. 

FTA helps engineers identifying causes and 
influence factors for a supposed failure (called 
top event) by iteratively seeking backwards in the 
causal chain. Top events can be hazards (situations 
that may provoke accidents) in safety engineering, 
or system failures and unavailability in the domain 
of reliability engineering. Causes and influence 
factors are represented in a tree structure. The 
intermediate failures in the tree structure are joined 
by so-called gates, in particular the AND gate 
(indicating that all influencing factors together are 
necessary to cause the output failure) or the OR 
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gate (indicating that at any of the influence factors 
causes the output failure). Originally, only Boolean 
logic connectives such as AND, OR, and N-out- 
of-M (sometimes called voter gate) were provided, 
as well as NOT (which cannot be handled by all 
evaluation algorithms). In later years, dynamic 
extensions like the Priority-AND gate or gates 
formulating constraints, such as the Sequence- 
Enforcing gate have been introduced. However, 
these gates require a closer look into their seman- 
tics, as they do no longer express Boolean propo- 
sition logic. The gates are usually represented by 
American style logic symbols, as used in circuit 
diagrams, or by European style rectangular sym- 
bols. At the leaves of the tree, the fundamental 
causes (basic events) for failures are represented. 
In the context of propositional logic, basic events 
represent failure conditions or failed states of 
technical components (e.g. “Relay 3 short circuit”). 
For quantitative analysis, failure probabilities 
(instantaneous values or functions over time) are 
assigned to the basic events. This is usually done 
by specifying a probability density function (e.g. 
exponential or Weibull) for the transition to the 
failed state along with its parameters, such as the 


failure rate à. Due to the tree structure, it is not 
possible to graphically indicate that a basic event 
may lead to different consequences in the tree (e.g. 
power supply failure influencing multiple parts of 
the system). This information can only be provided 
to evaluation tools by labeling the respective basic 
event as repeated event. 

Fault Trees (FTs) can be analyzed in different 
ways. The most common qualitative analysis is the 
identification of “Prime Implicants” or “Minimal 
Cut Sets (MCSs)”, i.e. combinations of basic events 
that cause the top-level failure or hazard. Cut Sets 
with only one element indicate single points of 
failures. The most important quantitative analysis 
is calculating the top-level failure probability (e.g. 
hazard probability or system unavailability). Other 
kinds of quantitative analysis are, for instance, 
importance analyses that unveil the relative 
importance of single failures w.r.t. the top event 
probability and give valuable indications where 
optimization makes sense and where not. 


1.1 Traditional ways of structuring fault trees 


FTs for real industry systems tend to be quite large, 
often comprising more than 1000 basic events. 
Therefore, it is not possible to present them on a 
single page, so some structuring measures are nec- 
essary. Traditionally, there are two means of struc- 
turing: (1) the transfer symbol and (2) splitting the 
trees into independent subtrees (modules). While 
the first method is purely graphical, the second has 
a semantic connotation, since it is possible to cal- 
culate the probability of failure of a subtree and 
include it in the parent tree as a basic event. This 
reduces computing time and enables IP protec- 
tion by allowing component suppliers to deliver 
the aggregated failure rate instead of having to 
disclose the entire tree structure. Obviously, the 
second method is only feasible if a component’s 
FT is a true subtree without any shared events. 
Multi-instancing of components is possible by 
allowing several instances of a module. Unfortu- 
nately, most available tools do not support this and 
the user has to copy and paste subtrees, which is 
prone to errors. Graphically, both methods appear 
the same, because subtrees are also separated by 
using the transfer symbol. Both methods are not 
formally related to the technical architecture of the 
system, so it is up to the user to divide the FT into 
meaningful parts that correspond to the compo- 
nents of the technical system. 


1.2 Introduction to CFTs 


To overcome these and other restrictions of tradi- 
tional FTs, Component Fault Trees (CFTs) were 
proposed as a new component concept by Kaiser 
et al. (2003). The idea was to cut out parts of FTs 


816 


Pout! Pout1 Pout? 
A A Controller 
& | faled 
z1 & 
cput.cPu | | CPO2:CPU 
ES A i mo 
C2) (J a. J 
E1 Pin} EY beet Sply'Sply 
P=03 P=0.1 P=02 
CPU Sply ctr 
Figure 1. Example of a simple CFT. 


that correspond to technical components and turn 
them into reusable units with input and output 
failure ports. At the same time, the tree structure 
was extended towards a Directed Acyclic Graph 
(DAG) structure. This avoids the artificial splitting 
of common cause errors into multiple “repeated 
events”. Instead, it is possible for more than one 
path to start from the same basic event or sub- 
graph. In addition, it is possible to have more than 
one top event, such as an accident when a primary 
failure coincides with the failure of a countermeas- 
ure, but only a system unavailability when the same 
primary failure occurs while the countermeasure is 
working. This saves analysts time, since large parts 
of the failure logic can be shared between the two 
top events. 

Fig. 1 shows an exemplary controller system, 
including two redundant CPUs (two instances 
of the same component) and one power supply 
(which would be a repeated event in traditional 
FTA). The controller is unavailable if both CPUs 
are in the state “failed”. The inner FT of the com- 
ponent type CPU is shown on a separate screen; as 
the CPUs are of identical type, they only have to be 
modeled once. The failure of a CPU can be caused 
by some inner basic event ”E1” (the repetition of 
the ID “E1” in several components is not a prob- 
lem, as each component constitutes its own name 
space). The failure of the CPU can also be caused 
by an external failure cause which is connected 
via an input port. As both causes result in a CPU 
failure, they are joined via a 2-input OR gate. The 
power supply is modeled as a separate component. 
Instead of a single large FT, the model consists of 
small, reusable and easy-to-review components. 


1.3 Quantitative evaluation using BDDs 


The most important quantitative analysis for 
CFTs, just as for traditional FTs, is the calcula- 
tion of top event probabilities. The result is usually 
expressed as probability for a given time period or 
point in time, or by an equivalent failure rate. For 
traditional FTs, two approaches are commonly 
used: collecting all MCSs or prime implicants and 
to sum up their probabilities, or to transform the 
Boolean term represented by the FT into a Binary 
Decision Diagram (BDD). In the latter case, the 
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BDD integration, reduction and probability 


event probabilities along each edge are multiplied 
and the final result calculated as the sum of prob- 
abilities of all paths. This BDD-based algorithm 
works also for CFTs. An example is given in Fig. 2 
where the BDD for the main CFT in Fig. 1 is 
shown. 

Fig. 3 shows the integrated BDD after insert- 
ing the fragments for the sub components into the 
main component at the position of the output port 
nodes. In this example, the probability is calculated 
as Po stem fail = 0-1 * 0.2 + 0.9 * 0.3 * 0.3 =0.101. Note 
that we use fixed probabilities to facilitate under- 
standing, while in most cases in practice failure 
rates are applied. 

When nodes representing CFT ports occur in 
the BDD, they have to be replaced by the sub- 
BDD that represents the corresponding FT seg- 
ment with its root at that port. The algorithm for 
CFTs is explained in detail by Förster (2006). 

Kaiser (2005) showed that CFTs not only help 
structuring large FTs in a more intuitive way, but 
that they also accelerate the quantitative analysis 
significantly. The reason is that the logical structure 
of the FT fragment of a component can be parsed 
and precompiled into a BDD, which makes up a 
significant part of the computation time. While 
optimizing a BDD by finding the lowest number 
of required nodes is NP-complete and thus not 
feasible in practice (see Bollig and Wegener 1996), 
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the component concept leads to smaller BDD 
fragments that reduce the number of variables 
even if the ordering is not optimal. Furthermore, 
the nodes that belong to internal failure events of 
each component can be combined in one part, and 
the nodes that belong to input ports can be com- 
bined in the other part of the reduced structure of 
the BDD fragment. By a set of examples it could 
be shown empirically that this leads to an accept- 
ably good variable ordering. The reduced BDD 
fragments for each port of each component can 
then be stored for later use. They can be instan- 
tiated many times if multiple instances of a com- 
ponent exist in a system (e.g. train braking system 
with multiple wheel brakes). Kaiser (2005) showed 
an algorithm to store precompiled and optimized 
BDD fragments and insert them into one virtual 
BDD node when instating. 

The possibility of reducing/ordering and stor- 
ing BDD fragments also helps suppliers to protect 
their intellectual property. In many cases, industrial 
companies refuse to hand over their safety analy- 
sis in full detail to the OEM, because a well-struc- 
tured and well-documented FTA or FMEA allows 
reverse engineering the product to some extent. 
CFTs allow handing over just the signature of a 
component (name and types of failure input and 
output ports) plus a precompiled BDD. The BDD 
contains the full logical terms and the probability 
distribution of the inner failure events, but with the 
variables in a random order and without the names 
of any internal or immediate failure events. This is 
very similar to the principle of information hiding 
in software engineering, where compiled binaries 
and header files with signatures are handed out, 
but not the full program code. 


2 SUBSEQUENT EXTENSIONS 


2.1 Integration with Markov Chains (MCs) 


An important extension was presented by Zocher 
(2005), which allows to consider not only stochasti- 
cally independent events (which was the hypothesis 
of standard FTs) but also mutually exclusive events 
(e.g. failure modes “too high” and “too low”). 
Zocher managed to do so by replacing BDDs with 
Multi-Valued Decision Diagrams (MDDs) in the 
analysis. Instead of the binary logic TRUE/FALSE 
(which is interpreted as “working’/“failed” in 
FTA), now several mutually exclusive values are 
possible, e.g. “Working”/“in Failure Mode 1”/“in 
Failure Mode 2” etc. For tool vendors there are 
several software libraries available for handling 
MDDs (e.g. MEDDLY, Sylvan), but in the case 
that only a standard BDD engine is at work, 
Zocher also presents a canonical transformation 
of MDDs to BDDs. Zocher’s main intention was 
to allow embedding MCs as subcomponents into 


CFTs. However, his approach is useful in gen- 
eral, because mutually exclusive failure modes are 
often used and classic FTA cannot handle these 
correctly. 

An example is given in Fig. 4. A Markov Chain 
subcomponent exports two failure states S1 and 
S2 for usage in the FTA, while the other internal 
states are hidden. For better understanding of 
the calculation, the fictitious probabilities of two 
failure states are given (instead of transition rates) 
and they have been chosen very high. On super- 
ordinate level, a CFT OR gate joins both failure 
modes to form the top event. Usually, the result- 
ing probability according to the OR gate’s formula 
would be calculated as P, + P, — P, * P.. 

Applying the MDD-based algorithm, the result- 
ing MDD constellation would be identical to that 
described in Fig. 5. After integration and reduc- 
tion, the probability is calculated correctly as P, + 
P, (there is no cut set of two mutually exclusive 
conditions), see Fig. 6. Putting an AND gate above 
the two mutually exclusive failure modes would 
correctly produce the output probability zero. 

The same task of integrating MCs with more 
than one exported failure mode has been solved by 
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Figure 4. MC with several exported failure modes in a 
CFT. 
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subcomponent. 
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Adler et al. (2007) under the term “Hybrid CFTs”. 
For more information, see Fig. 2 in the cited 
paper. Unlike Zocher, they do not use MDDs for 
analysis, but refer to another way of representing 
multi-state FTs, as proposed by Zang et al. (2003). 
However, both approaches ultimately produce the 
same quantitative result and the choice is simply a 
matter of taste. 


2.2 Component-integrated component fault trees 


Domis (2012) enhanced the CFT approach to ena- 
ble traceability to system and architectural design 
models (C2FTs). The traces between CFT artifacts 
and architecture artifacts make it easier to main- 
tain the fault models when changes are made to 
the design. Syntactically, the integration is realized 
by connecting (1) a CFT with a component and 
(2) every input or output failure port of the CFT 
with an architectural port of its connected compo- 
nent. While the syntactical enhancement has pre- 
vailed, we decided to revise the initial naming and 
stick to the term CFT. 


2.3. CFT generation from system models 


Many approaches have been proposed for mod- 
eling systems in UML, SysML and Simulink 
and expanding them for the generation of CFTs. 
Kaiser et al. (2015) show the highly-automated 
derivation of CFTs from SysML models. The Inte- 
gration of CFTs with UML using a meta-model 
has been proposed by Adler et al. (2010). For that 
purpose, a UML profile was created that provides 
all necessary elements. As the UML addresses the 
modeling of software, using this profile makes it 
possible to model CFTs for the designed software. 

The automated derivation of CFT frames from 
Simulink data flow models has been proposed 
and implemented by Ramich (2014) and has been 
validated by a small practical case study by Buono 
et al. (2015). A prototypical tool reads a hierar- 
chy of Simulink models and transforms it into a 
hierarchy of CFT components. The integration is 
merely on topology level (each Simulink compo- 


nent produces a CFT, each Simulink signal port a 
set of CFT ports, CFT edges reflect the Simulink 
connections). The internal failure logic must be 
inserted manually by a safety analyst, as in tradi- 
tional FTA. 


2.4 Enhancement by failure taxonomies 


The introduction of failure type classifications has 
paved the way towards better semantic integration 
of system models and CFTs. Domis and Trapp 
(2008) and Domis (2012) systematically derive 
and integrate CFTs with a data flow based sys- 
tem architecture, providing a canonical procedure 
using interface-focused FMEA (IF-FMEA) and 
a hierarchical failure mode classification scheme. 
Möhrle et al. (2017a) automate the composition of 
CFTs on system level using semantic type annota- 
tions. For that purpose, architectural ports of com- 
ponents are annotated with flow types that classify 
the type of interaction (e.g. digital signals, material 
flows, or energy). For each flow type, safety engi- 
neers create a collection of failure types to classify 
related undesired behavior. These types are used to 
annotate interfacing model elements in CFTs to 
abstract from textual descriptions and provide a 
machine-readable vocabulary. 

Fig. 7 shows a taxonomy including four layers. 
Layer LO contains the root node failure as the most 
abstract failure type. In each subsequent layer, the 
undesired behavior is refined further into mutu- 
ally exclusive sub-items. Taxonomies serve (1) as a 
template when creating CFTs and (2) as a means 
for automating the interconnection of ports when 
composing CFTs. For that purpose, a matching 
algorithm connects ports according to the relation- 
ship of their assigned types. Inconsistencies can be 
detected and located quickly by analysis tools. This 
allows a tight coupling between system model and 
safety analysis model since new CFTs can be gen- 
erated with minimal effort when making changes 
to the architecture. 

The idea of using failure taxonomies is also 
exploited by Kaiser et al. (2015). The system archi- 
tecture is specified in SysML and annotated with 
semi-formal contracts (assumptions and guaran- 
tees) that specify expectations about the nomi- 
nal externally observable behavior of the system 
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Figure 7. Excerpt of an example failure type taxonomy. 


and each of its components. A contract can, for 
instance, contain the assertion (guarantee) that 
some value at an output port (e.g. voltage) shall 
always be in the range [0, 100], provided that all 
assumptions are met by the environment. A fail- 
ure is consequently defined as contract violation. 
In the given example, a value lower than 0 would 
constitute the failure mode “too low” (“too high” 
analogously). With contract specification lan- 
guages that also allow expressing temporal behav- 
ior, failure modes like “too late” can be expressed 
as well. Analyzing whether or not an output event 
is generated too late is not done inside the CFT 
analyzer, but must be performed in some discrete- 
state or hybrid analysis, model checking or simula- 
tion framework. Another proposed way to guess 
the failure propagation logic is to combine fault 
injection with simulation of the system models. 
For the CFT analyzer, the question whether or not 
an assertion is violated is still of Boolean nature. 
The failure mode classification scheme defined 
by Domis and Trapp (2008) has been used here 
as well, but can be refined as appropriate. In the 
same paper, a new graphical notation is used where 
the ports from the architecture model appear in 
the CFT as rectangular box around the triangular 
input and output failure ports (see Fig. 10a). 

However, when switching to the technical view- 
point, potential technical failures like bit-flips, 
wire-breaks, failures of electronic components and 
the like will arise. 


2.5 Exploiting behavioral models to derive CFT 
failure models 


For traditional FTs, several proposals exist to also 
exploit behavioral models of components (e.g. 
State Diagrams) for deriving failure modes and 
failure propagation automatically. Of particular 
interest in the domain of CFTs is the approach by 
Berres and Schumann (2016), which can exploit 
behavioral models to generate CFTs by focusing 
on each component and its behavior individually. 


2.6 CFTs for layered architectures 


Hofig et al. (2015) describe a methodology 
called Architecture Layer FailuRE Depend- 
ency (ALFRED) which extends CFTs to better 
maintain the independence of model elements in 
different layers of a vertically decomposed sys- 
tem architecture. The approach uses so-called 
ALFRED connections to allow for CFTs on dif- 
ferent layers of an architecture. These connections 
are used to generate a FT for the entire system and 
over all different architectural layers. This way, for 
every failure dependency relation, all basic events 
that are included in the CFT of the dependency 
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element are added to all output failure modes of 
the dependent component using OR gates. 
ALFRED connections ease the modeling of 
common cause failures without explicitly modeling 
dependencies using information flow elements such 
as ports. This keeps layers in the model independ- 
ent and supports compositional development. The 
approach can be used to reuse software on differ- 
ent hardware and ease the evaluation of differ- 
ent deployment variants in terms of safety. Using 
ALFRED connections, existing safety analysis 
models can be reused more effectively to support 
early safety assessments and head towards an 
automated safety qualification of future cps. The 
ALFRED approach is already being successfully 
applied for certification tasks in the railway domain. 


2.7 CFTs for product variants 


MoOhrle et al. (2017b) extend CFTs further towards 
an automated exploration of the system design 
space in terms of safety. For that purpose, muta- 
ble parts of the architecture for which more than 
one design alternative exist are modeled as variants 
rather than concrete components. A variant speci- 
fies a functional interface by its type-annotated 
architectural ports (e.g. integer signals, energy flow). 
The failure logic to be connected to these ports is 
provided for each realizing entity (component or 
subsystem). For instance, multiple sensors may be 
applicable to measure the fill level in an oil tank. 
Hence, a variant is defined for the level measure- 
ment and one realization of the failure logic pro- 
vided for each sensor. By including variants in the 
functional system model, multiple architectures can 
be derived for which CFTs are generated automati- 
cally. This way, FTA becomes a benchmark com- 
paring various design alternatives in terms of safety. 


3 PRACTICAL EXPERIENCE WITH CFTs 


CFTs have been applied in several case studies and 
industrial projects. Below we present a case study 
of a situation display system, which was taken from 
a real world application. In the study, the modeling 
of the system was done by industrial engineers 
with some support from modeling experts. 

The input to the system is sensor data of the 
world outside the system. The GPS receiver also 
obtains information from outside the system, but 
this information comes from a central authority 
and is not acquired by the situation display itself. 
The information is transmitted over two redundant 
channels and then collected and compared in the 
channel interface. Finally, it is transmitted to the 
processing component. During processing, sensor 
and GPS data are combined to obtain information 
of the situation outside the system. This informa- 
tion is redundant, since the outside world is evalu- 


ated by two independent sources. The system is 
shown in the center of Fig. 8. 

Fig. 9 shows a classic FTA of the system for 
two top events. The top event Loss of position data 
implies that no information about the outside situ- 
ation is available. This is the case if both signals 
are lost. The top event Partial loss of position data 
implies that either the GPS or the sensor data is 
not available. In this case, the analysts decided to 
make a worst case assumption by using an OR 
gate instead of the more precise XOR gate. The 
top event Erroneous position data is not depicted 
but it is shaped like the FT for the top event Loss 
of position data with only OR gates and erroneous 
basic events for each component. It models the sit- 
uation where the data displayed is inaccurate. The 
analysts decided to create a pessimistic tree for this 
top event as well, where any failure in one of the 
components can contribute to the top event. 

Fig. 8 shows the CFT analysis that was carried 
out for this system. The CFTs that are related to 
system components are depicted as breakout boxes. 
The CFT for the channel components is redundant 


Figure 8. CFT analysis of the situation display system. 
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Figure 9. Classic FTA of the situation display system. 


and therefore only depicted once. The triangles 
connected to the output port of the processing 
component represent the top events of the system, 
which are the same as for the classic FTA. Lo refers 
to Loss of position data, pLo refers to Partial loss 
of position data and err refers to Erroneous position 
data accordingly. CFTs facilitate the use of inter- 
faces from the system model. For example, the tri- 
angles labeled Lo and Err connected to the right 
input port of the processing component model the 
failure modes loss of and erroneous that propagate 
over the interface from the component channel 
interface to component processing. 

During the case study, some findings were 
observed. Some of them are related to benefits or 
drawbacks of the modeling strategies used, other 
findings can be used as hints for proper modeling. 

Deep trees: It is a common practice during FTA 
to follow the chain of actions backwards through 
the system and document the intermediate failure 
modes. An example is provided in Fig. 9, where the 
FTs are deep instead of forming a flat structure as 
disjunction of all combinations that lead to the top 
event. Since CFTs follow the system structure, we 
can see that CFTs promote deep structures instead 
of flat disjunctions. CFTs therefore should start 
with the top event at the last processing compo- 
nent (actuator). The failure analysis is then done 
backwards or upstream to the participating com- 
ponents until all possible causes are identified, 
ending at the system inputs (sensors). 

Localization: When changes need to be made, 
CFTs facilitate searching for affected parts of the fail- 
ure analysis. Changes that can be narrowed down to 
a single component are constrained to a single CFTs. 
In the classic FT approach, restoring consistency 
after making changes lies in the hand of the analyst. 

Redundancy: In classic FTs, homogeneous 
redundancy requires a precise distinction between 
repeated events and redundant, but not identical 
events. In the CFT approach, redundant com- 
ponents are expressed by simply duplicating the 
respective CFT, while common cause failures 
(which have to be identified by common cause 
analysis) stay outside the component and are con- 
nected via ports. In the situation display system, 
the channel components are redundant and the 
duplication of the CFT facilitates the modeling 
and the analysis of the system. 

Organizational structures: As systems grow, 
classic FTs do not provide a proper divide-and- 
conquer strategy to cope with the increasing 
complexity. In CFTs, the entire failure behavior 
(internal) and failure propagation (interfaces) is 
documented at once. Large distributed develop- 
ment teams can take advantage of the clear struc- 
ture and interact with one another using consistent 
interfaces for functional modeling and safety mod- 
eling. Hence, CFTs are a better choice for breaking 
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down the complexity of a system in complex 
organizational structures. 

Systemic faults: Uncovering systematic faults in 
critical applications is of great importance. If sys- 
tems grow in complexity, failures resulting from 
flawed collaboration become more significant in 
comparison to local component failures. CFTs align 
with component interfaces in system models which 
often match different developer teams and therefore 
foster communication about overlooked failures. 

Simplification and easier maintenance: The case 
study has several top events and therefore requires sev- 
eral FTs in classical FTA. This not only causes extra 
work, but increases the susceptibility to errors and 
impairs maintenance, as changes have to be applied to 
every affected tree. When changing details in a com- 
ponent, potential effects on all top events have to be 
checked. As CFTs filter only the relevant component 
failures for each top event, the analysis is facilitated 
for complex systems where many top events exist. 

Prepared for later reuse: Reusing a component 
together with its CFT in later projects appears 
attractive and enables Off-the-shelf components 
(“safety element out of context”). However, caution 
is needed as safety issues may arise when reusing 
a previously verified component in a new environ- 
ment. Yet, a reused CFT is a good starting point for 
new analyses. The combination with contract-based 
development, which logs assumptions and guaran- 
tees for each component, can mitigate the risk of 
blind reuse. If components are reused across many 
projects, they and their CFTs go through a matur- 
ing process and the chances of all potential failures 
being detected increase. 


4 CONSOLIDATION TOWARDS “CFT 2.0” 


After years of various new contributions since 
CFTs were initially proposed, it is time to collect 
the ideas and define “CFT 2.0”. This hopefully 
encourages tool builders to support this method. 


4.1 Integration with architectural models 


A first proposal is discarding the term “Compo- 
nent-Integrated Component Fault Trees (C2FTs)” 
and refer to the overall concept as CFTs. CFTs 
can either be used as a safety analysis technique 
alone, or linked to functional or technical static 
architecture models, in particular SysML Internal 
Block Diagrams (IBDs). When used in combina- 
tion with SysML IBDs, we suggest that it shall be 
mandatory that 


1. for each component in the SysML architecture, 
there is exactly one corresponding CFT with the 
same name 

2. the nesting hierarchy in the SysML IBD and the 
CFT is the same 


. cause-consequence edges in the CFT exist wher- 
ever and only where corresponding signal or 
service flows exist in the SysML IBD architecture 

. for each incoming flow port, at least one fail- 
ure input port shall be assigned in the CFT (the 
same applies for outgoing flow ports and out- 
put ports) 

. for each required or provided service port (lol- 
lipop symbol), several failure input and output 
ports can be assigned, since failure consequences 
can be passed from service provider to requester 
and vice versa. Safety engineers are responsible 
for avoiding cycles in the causal chains. 

. all CFT failure ports are bound to one archi- 
tectural port; graphically represented either 
by drawing the rectangular architectural port 
around its triangular failure ports, or by draw- 
ing edges between them. 


Integration with other signal-flow-based mod- 
eling techniques should be standardized as well, 
in particular the integration with Simulink, e.g. 
based on the work of Ramich (2014) and Buono 
et al. (2015). Like for SysML IBD flow ports, 
the failure causality flow follows the signal flow. 
Attention must be paid because these models 
often contain cycles in the signal flows due to 
closed-loop control. These cycles must not lead 
to cycles in the causal edges of the FT, which are 
illegal. Avoiding cycles can be achieved by skill- 
ful modeling (e.g. modeling failure consequences 
on the open-loop signal chain sensors—control- 
ler—actuators), but there are also proposals how 
to resolve cycles automatically, see Vaurio (2007) 
and Domis (2012). However, these aspects should 
be handled externally and are not included in the 
CFT standard. 


4.2 Mutually exclusive failure modes 


We suggest BDD or MDD based evaluation of 
the resulting failure probabilities as the default 
algorithms for CFTs. Embedding MC compo- 
nents (which are not part of the CFT method) 
is a recommended practice for CFT tools. Using 
the MDD evaluation method, exporting more 
than one failure mode from a MC, as shown in 
Fig. 5, is possible and produces correct results. 
Even without embedding MCs, it should be 
possible to model mutually exclusive failure 
modes like “too low” and “too high” and notify- 
ing the tool about their special relationship. To 
allow this, we suggest as an extension to CFTs, 
that several basic events can be surrounded by 
a rectangular frame as shown in Fig. 10a), indi- 
cating that they form a mutually exclusive group. 
Similarly, failure ports encapsulated by the same 
rectangular architectural port shall be treated as 
mutually exclusive. 
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b) Multiinstance component connected to a multi-input 
gate. 


4.3 Multiple instances and variants 
of components 


Another innovation we suggest for CFT2.0 is 
exploiting the multi-instance capabilities in a more 
convenient way for highly-redundant systems by 
introducing a multi-instance symbol and gates with 
multi-input ports. Imagine a train braking system 
with 96 wheel brakes of the same type, where a haz- 
ard is present whenever at least 4 wheel brakes have 
failed. Instead of manually creating 96 individual 
instances and connecting them to an n-out-of- 
m gate with 96 inputs, a tool can offer a multi- 
instance symbol like in Fig. 10b) with a parameter 
— 96 in this example—to indicate how many repli- 
cas are available. Note the input port symbol at the 
gate that indicates a multi-port and the different 
style of line indicating a multi-line. Of course, the 
parameter can easily be changed for later projects, 
so that only one parameter needs to be changed to 
adapt the model for a train with 72 wheel brakes. 
This enables efficient safety analysis for product 
lines. The semantics is straightforward because this 
is merely a matter of graphical representation. 

To further support the product line approach, 
we suggest that also the variant approach from 
MoOhrle et al. (2017b) be part of CFT2.0. Model- 
ers can specify variants by modeling the signature 
(number and types of ports), and the concrete 
CFT model is attached later, selected from a model 
library and guided by a variants database. For 
example, if a hydraulic brake is replaced by an elec- 
tric brake, only one more model needs to be added 
and the different solutions can be easily compared. 
However, this aspect is mainly a question of tool 
support and not of the CFT semantics themselves. 


5 CONCLUSION AND FURTHER 
RESEARCH 


By collecting and integrating scientific contribu- 
tions from many researchers over more than a 
decade, we have provided an overview of the state- 
of-the-art in CFTs and made some proposals on 


how to update their specification under the term 
CFT2.0. By referencing larger case studies, we aim 
to encourage tool developers and safety analysts 
to use CFTs in industrial projects and provide 
feedback for further method improvements. We 
are convinced that the CFTs have reached matu- 
rity for industrial use in safety-relevant projects. 
Nevertheless, we are inclined to work on further 
developments. We will further elaborate on a more 
formal integration of CFTs with Sy;sML/UML by 
combining safety analysis with contract-based 
development. With regard to UML/SysML, we 
are working on integrating not only flow ports 
(dataflow oriented interaction) but also service 
ports (lollipop symbol), as this type of interac- 
tion (often asynchronous) is typical not only for 
software systems, but also for open networked 
systems and systems with user interaction. Here 
we will have to deal with the fact that failure 
propagation can occur in both directions over the 
architecture port. We will continue working on 
approaches to derive the internal failure propaga- 
tion and mitigation logic of CFTs, in particular by 
combining failure injection and simulation. Also 
more traditional analytic approaches like FMEA 
or Hip-Hops to find inner failure modes and fail- 
ure propagation should be better integrated with 
CFTs. We plan to extend the presented variability 
approach also towards runtime variance, which is 
important as spontaneously networking CPSs, so 
called System-of-Systems (SoSs) become ubiqui- 
tous. This would mean that for every legal config- 
uration of an SoS (e.g. vehicle platoon or sensor 
network), parametric CFT templates will have to 
be re-combined and checks carried out whether 
or not the desired constellation is safe or reliable 
enough. 
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ABSTRACT: With the increase in computing power and the rise of “big data”, many machine learning 
algorithms have been developed or given a new lease of life. Their application is proved to be really 
efficient in various fields. However their implementation in the industrial sector, such as nuclear power 
generation, seems less widespread. This paper presents a comparison of several techniques on a case 
study from the nuclear industry. Their performance on the real dataset are compared and a discussion 
is proposed on their practical use, advantages and disadvantages, precaution for use and relevance in an 
industrial context. 


1 INTRODUCTION Indeed, with the increase in computing power and 
the rise of “big data”, many machine learning algo- 
1.1 Industrial context rithms have been developed or given a new lease of 


life. Their application is proved to be really efficient 
for prediction purpose in various fields: banking, 
insurance, media, information technology, healthcare, 
transportation, sports... However their implementa- 
tion in the industrial sector, such as nuclear power 
generation, seems less widespread and their relevance 
and effectiveness have to be confirmed in this area of 
application where data may be less voluminous. 


Shutdowns of nuclear power reactors are regularly 
planned for refueling and carrying out mainte- 
nance operations. The state of the reactor during 
its shutdown can be characterized by an indicator 
which is measured during the shutdown. If this 
indicator is higher than a given fixed threshold, 
it may impact the scheduled planning and lead to 
an extension of the shutdown. In addition to this 
indicator, around twenty parameters describing sates: 
the operation conditions of the reactor just before he Cees 
the shutdown are available. Different supervised machine learning techniques 
There currently exists neither physical-based have been tested on the available data and the main 
model nor computer simulation code to charac- objective of this paper is to present a comparison 
terize the indicator and predict if it will be higher of the obtained results. 
than the given threshold. That is why, in order to The remainder of this article is organized as 
anticipate the required logistical support for main- follows. Section 2 describes the fundamental prin- 
tenance during the reactor shutdown, it seems ciples of the families of machine learning algo- 
relevant to “learn from the data” and try to take rithms that have been used. Section 3 presents 
advantage of the information brought by the oper- how the performance of the predictive models can 
ation conditions before the shutdown to foresee be assessed and compared and it gives the main 
the state of the reactor during its shutdown. The results obtained on the dataset. A general discus- 
use of a “black-box” machine learning algorithm sion is proposed in Section 4 on the practical use of 
seems to be a promising solution in order to build the techniques, by exposing their advantages and 
the prediction function between the state before disadvantages, precaution for use and relevance in 
and the one during the shutdown. an industrial “relatively small data” context. 
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2 PRESENTATION OF THE DIFFERENT 
FAMILIES OF MACHINE LEARNING 
ALGORITHMS 


2.1 Notations and preliminary assumptions 


Y will denote the indicator characterizing the state 
of the reactor during its shutdown. Y is a quali- 
tative (or categorical) random variable taking two 
values: AT (for “Above Threshold”) and BT (for 
“Below Threshold”). For more convenience, we 
will represent it by two equivalent numerical vari- 
ables, Z and Z , using binary codes: Z = 1 if Y= 
AT and Z=0if Y= BT, so that Z=1,,_,,, with 
I,, the indicator function, and Z=1 if Y= AT 
and Z=-1 if Y= BT. X=(X",...,X”) will denote 
the vector made of the p parameters describing 
the operation conditions of the reactor before its 
shutdown. In our case, p = 25. Each variable X”, 
1 < j < p, can be either deterministic or random 
and either qualitative or quantitative (or continu- 
ous). If XY is qualitative with k, levels, we will con- 
sider its numeric version using dummy coding of 
its levels and equivalently introduce k, — 1 binary 
variables, only one of which is “on” at a time: 
(Lc agisisk,—1° If X? is quantitative, we will 
assume it is a standardized variable. n will denote 
the number of joined observations of (Y, X)) <<). 
In our dataset, n = 89. We will make the assump- 
tions that (Y, X,),.,., are independent observations 
of (Y, X) and are not subject to measurement 
uncertainty. 

Regarding the data (Y, X)),<<,, we seek for a 
function, denoted by AX) (sometimes called a 
“classifier” since Y is a binary variable), for pre- 
dicting Y given the predictor inputs X. In the lit- 
erature, this issue is called a “supervised” learning 
problem because of the presence of the two pos- 
sible outcomes for Y (or equivalently for Z or Z ) 
to guide the learning process and build classifi- 
cation function f(.). Depending on the type of 
algorithm, f(.) can predict either directly a value 
Y = f(X,) €{AT; BT}, 1<i<n, or an estimated 
probability that Y, equals AT depending on X,. In 
that case, our decision rule will be to assign value 
AT if the estimated probability is higher than 0.35 
(this conservative value has been tuned to obtain 
the best results in terms of performance indicators 
— see Sections 3.1 and 3.2). 


2.2 Logistic regression, stepwise, LASSO, Ridge, 
PLS and Sparse PLS 


From a general point of view, the logistic regres- 
sion model can be used to estimate the probability 
of a binary response based on one or more pre- 
dictor variables. A standard parametrization uses 
the logit function to link the target probability 
P(Y = AT | X)=P(Z =1| X) with predictors X: 


logit(P(Z =1|X))=log{ Zt (1) 


=A SAX 


exp(4 +5 Bx") 
1+exp[4, +57 Bx) 


e (P(Z =1|X))= 


with 2, a constant called intercept and (£, ..., 8.) 
a vector of regression coefficients to be estimated 
from the data. We will denote B= (A, ..., 8). 

One interest of this parametric model is its 
interpretability, since exp (8), 1 < j < p, quantifies 
the increase in P(Z =1| X) when (supposing) con- 
tinuous variable X” increases by 1 unit. 

. Regarding (Z, X)),<;<,, B can be estimated by 
B™* using the Maximum Likelihood (ML) statis- 
tical inference method: 


Ê"! = aremin(-24(B)) 
eR?! 
= argmin( -25 * Z,log(P(Z,=1|X,)) 6) 
pere! hig 


+(1- Z,)log( P(Z, = 0| X;))) 


which is equivalent to: 


B” = argmin (257.2, (4, + Eixo) 


Bo vn By ER 


~log(1+-exp(4,+",8,x"))} (4) 


where (Ø) denotes the log-likelihood function. 

In order to improve the prediction accuracy 
and the interpretability of regression models, it 
may be interesting to alter the model fitting proc- 
ess to select only a subset of the provided pre- 
dictors X for use in the final model rather than 
using all of the X”, 1 <j <p. A standard proce- 
dure for subset selection is the stepwise-selection 
strategy. The forward-stepwise selection starts 
with the intercept J, and then sequentially adds 
into the model the predictor that most improves 
the fit. The backward-stepwise selection starts 
with the full model including X, and sequentially 
deletes the predictor that has the least impact on 
the fit. We used a hybrid stepwise-selection strat- 
egy that considers both forward and backward 
moves at each step, and selects the “best” of the 
two. The step function uses the Akaike Informa- 
tion Criterion (AIC) for weighing the choices, 
which takes proper account of the number of 
parameters to be fitted. At each step, an add or 
drop will be performed that minimizes the AIC 
score: 
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AIC = 2(p*+1)—20(B™) (5) 


with 1 < p* < p the number of predictors included 
in the model. 

Because it is a discrete process—variables are 
either retained or discarded, stepwise procedure 
may not reduce the prediction error of the full 
model. Shrinkage methods are more continuous 
and do not suffer as much from high variability. 
Several shrinkage variants of the logistic regression 
are available. LASSO (for Least Absolute Shrink- 
age and Selection Operator) shrinks the regression 
coefficients B by imposing a L, penalty on their 
size. It forces the sum of the absolute value of the 
regression coefficients to be less than a fixed value, 
which forces certain coefficients to be set to zero, 
effectively choosing a simpler model that does not 
include those coefficients: 


B148°0 = argmin(—20(B)) subjectto $,” 18,151 
Ber?! a 


(6) 


The smaller the value of the tuning param- 
eter ¢, the fewer the number of nonzero compo- 
nents in +4580 , thus leading to what is called 
“sparse” models. Thus LASSO does a kind of 
continuous subset selection. If ¢ is chosen larger 


than pe pa , LASSO estimates are the same 


y? ave 
as the ML’s. For t == say, then the ML 


coefficients are shrunk by about 50% on average. 
Parameter ¢ has to be carefully chosen in order 
to minimize an estimate of expected prediction 
error. 

A “good” value of ¢ can be obtained by cross- 
validation. Cross-validation is a general process 
which can be applied to many types of issues. It 
consists in dividing the dataset (Z, X,),<,., ran- 
domly into a certain number of equal parts, say 
10. LASSO regression with a given value for t is 
fitted with the nine-tenths of the data (this dataset 
is called the “training dataset”), and the predic- 
tion error is computed on the remaining one-tenth 
(called the “test observations”). This process is 
done in turn for each one-tenth of the data, and 
the ten prediction error estimates are averaged. 
From this we can obtain an estimated prediction 
error curve as a function of t, enabling to identify 
a relevant value for ¢ minimizing the estimated pre- 
diction error. 

Ridge is another shrinkage method, but it 
shrinks the regression coefficients J by imposing a 
L, penalty on their size: 


Bie = argmin(—2¢(B)) subject to) Bsti O 
Ber?! iat z 
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Contrary to LASSO and because of the L, 
penalty, Ridge shrinks the size of the coefficients, 
but it does not set any of them to zero. 

When there are many correlated predictors in 
a regression model (also called multicollinearity 
among X), their coefficients B can become poorly 
determined. In this situation the coefficient 
estimates of the regression may change 
erratically in response to small changes in the 
model or the data and thus may suffer from 
unstability. By imposing a size constraint on the 
number of coefficients to be estimated, as in (6) 
or (7), this problem is alleviated. However another 
solution is to consider m new inputs, | < m < p, 
denoted by TV’, 1 < j < m, derived from the original 
predictors X and used in place of the X”), 1 <j <p, 
in the regression model: 


logit (P(Z =1|T)) | 


P(Z=1T) | 
1-P(Z =1|T) 


=f, +X AT") (8) 


Generally, m is chosen to be small and the 7, 
1 < j < m, are orthogonal linear combinations 
of the X”, 1 <j < p, in order to avoid estimation 
unstability. 

Partial Least Squares (PLS) constructs a set of 
linear combinations of the inputs for regression, 
by using Yin addition to X for this construction. It 
finds a linear regression model by projecting Y and 
X to a new space. It tries to find the multidimen- 
sional direction in the X-space that explains the 
maximum multidimensional variance direction in 
the Y-space. A detailed mathematical description 
of how the TY’, 1 <j <m, are defined can be found 
in (Wold et al. 1983). 

Sparse PLS is a hybrid method, which can be 
viewed as a combination of LASSO and PLS. 
Details on this approach can be found in (Chun 
et al. 2010). 

Stepwise, LASSO, Ridge, PLS and Sparse PLS 
are all variants of the logistic regression and they 
share the same advantage of being interpretable 
models, with fully explicit expressions for the esti- 
mated target probability depending on X. 


2.3 Classification trees, bagging and boosting 


A classification tree is a directed acyclic graph 
consisting of nodes and directed edges. It has three 
types of nodes: a root node that has no incoming 
edges and zero or more outgoing edges; internal 
nodes, each of which has exactly one incoming edge 
and two or more outgoing edges; leaf or terminal 
nodes, each of which has exactly one incoming 


edge and no outgoing edges. The full dataset sits at 
the root node at the top of the tree. Each leaf node 
is assigned a class label corresponding to one value 
of the target variable (Y in our case). The non- 
terminal nodes, which include the root and other 
internal nodes, contain attribute test conditions 
based on the predictors X to separate observa- 
tions that have different characteristics, as shown 
in Figure 1. Stumps (or decision stumps) are trees 
with a single split (the root is directly connected to 
the leaves). 

Algorithms for constructing classification trees 
usually work top-down and recursively, by choos- 
ing an input variable among the p and a split-value 
at each step that best split the set of observations 
into two data subsets. The process is continued 
until some stopping rule is applied. 

A key advantage of the recursive binary tree 
is its interpretability, since it finally gives a set of 
fully understandable decision rules, as illustrated 
in Figure 1. 

Different algorithms use different metrics for 
measuring “best”. These generally measure the 
homogeneity of the target variable within the sub- 
sets. These metrics are applied to each candidate 
subset, and the resulting values are combined to 
provide a measure of the quality of the split. 

How large should we grow the tree? Clearly a 
very large tree might overfit the data, while a small 
tree might not capture the important structure of 
the dataset ( Y, X) i<, Tree size is a tuning parame- 
ter governing the model’s complexity, and a “good” 
tree size can be adaptively chosen from the data. A 
strategy is to grow the full initial tree, denoted by 

»» stopping the splitting process only when some 
minimum node size is reached (in our case 6). Then 
this large tree is pruned using a process called cost- 
complexity pruning. We define a subtree T c T, 
to be any tree that can be obtained by pruning 
Ty, that is collapsing any number of its internal 
(non-terminal) nodes. We index terminal nodes by 
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Figure 1. Illustration of a classification tree. 
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m, with node m representing data subset S,,. Let 
|7] denote the number of terminal nodes in T, n,,, 
1 < n„ < n, the number of observations in subset 
S,, and P, the proportion. of observations with 


m m 


value Y = AT in node m: P,, = -+ 


ee 
Xi€Sm {%i= AT} 
We define the cost complexity criterion by: 

irl 

C,(T) => %mQn(T) + aT | (9) 
where Q,,(7) is an “impurity measure”, in our case 
the Gini index defined by Q,,(7)=29, (1-P,). 
We classify the observations in node m to class Y 
= ATif P, >0.5 (that is to say if value Y= AT is 
majority in node m), and to class Y = BT other- 
wise. The idea is to find, for each œ> 0, the subtree 
T,, Cc T, to minimize C,(T). The tuning parameter 
oa governs the tradeoff between tree size and good- 
ness of fit to the data. Large values of œ result in 
smaller trees T,, and conversely for smaller values 
of a. With a = 0, the solution is the full tree 7). 
The determination of a “good” value for œ can be 
achieved by cross-validation. 

In order to avoid overfitting (see Section 3.2 for 
a detailed discussion on this issue), bagging (for 
“bootstrap aggregation”) can be used. It consists 
in drawing randomly with replacement from the 
original dataset ( Y, X,),.;., a large number B > | of 
new datasets of same size n. For each new dataset, 
a classification tree is built and the final bagged 
classifier is obtained by selecting the class between 
Y= AT or Y= BT with the most “votes” from the 
B trees. A deeper introduction to bagging can be 
found in (Breiman 1996). 

For large problems, the performance of classi- 
fication trees can be improved using a procedure 
called boosting. If we consider numerical target 
variable Z introduced in Section 2.1, given the 
predictor variables X, a classifier X) produces a 
prediction taking one of the two values -1 (Y = 
BT) or 1 (Y= AT). The error (or misclassification) 
rate of f(.) on the dataset is err, = =I lise ws} 
A weak classifier is one whose error rate is only 
slightly better than random guessing. The purpose 
of boosting is to sequentially apply a weak classi- 
fication algorithm to repeatedly modified versions 
of the data, thereby producing a sequence of weak 
classifiers f (X), 1 <m < M. The predictions from 
all of them are then combined through a weighted 
majority vote to produce the final prediction: 


F(X) =sien( E æfa (X) 


where @,...,@%,, weight the contribution of each 
respective f (X). Their effect is to give higher 
influence to the more accurate classifiers in the 
sequence. The data modifications at each boost- 


ing step consist of applying weights @, ..., @, to 


(10) 


each of the observations (Y, X;),<;<,. Initially all of 
the weights are set to æ, = +, so that the first step 
simply trains the classifier on the data in the usual 
manner. For each successive iteration m = 2,...,M, 
the observation weights are individually modified 
and the classification algorithm is reapplied to the 
weighted observations. At step m, those observa- 
tions that were misclassified by the classifier f(X) 
induced at the previous step have their weights 
increased, whereas the weights are decreased for 
those that were classified correctly. Thus as itera- 
tions proceed, observations that are difficult to 
classify correctly receive ever-increasing influence. 
Each successive classifier is thereby forced to con- 
centrate on those training observations that are 
missed by previous ones in the sequence. 

In our study, we used stumps as weak classi- 


i 

LL 7 E Y Afli 

( =e) with err, = 1 Ban )} 
Liam 

and for each successive iteration m = 2,...,M, the 

observation weights were modified as follows: 


a, I, #fn(Xi)} ,l<i<n. 


Contrary to standard binary trees, bagging and 
boosting techniques lead to classification rules 
that are not easy to interpret. Indeed, since they 
aggregate results obtained from different trees, it 
is difficult to identify how the inputs concretely 
affect the output. 


fiers, œ, =log 


m 


O, — exp 


2.4 Neural networks 


The name “neural networks” derives from the 
fact that they were first developed as models for 
the human brain. Each unit within the model 
represents a neuron and the connections (links in 
Figure 2) represent synapses. The term neural net- 
work encompasses a large class of models. Here 
we describe the most widely used neural network, 
often called the “single hidden layer back-propa- 
gation network” or the “single layer perceptron”. 
The central idea is to extract linear combinations 
of the inputs as derived features, and then model 
the target as a nonlinear function of these features. 
More formally, in our case, a neural network is a 
two-stage classification model. Derived features 
o(Xne,w) and o(Yne,y) are created from linear com- 
bination of the inputs (X%,. Mest %)= Kae „ne o) 
and then output Z is modeled as a unction o 


E (z= 


s | rz 


)-»(2)} 


x 
x 

1 yer 
x 
x Ii 
x ix 


Baput layer 


Hidden layer Output layer 


Figure 2. Illustration of a neural network. 
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linear combinations of these derived features: 


g(Oo(SYne,a,) + ¥o(Yne,7;)}) . 
o(.) is called the S o and is usually 


chosen to be the sigmoid: o(v) =; g(.) allows 
exp(-v 

a final transformation and 1 l R OP ce: to be 

are unknown 


the logistic. B=((a, X ka 0,9) 
parameters, often called weights, and we seek values 
for them that make the model fit the data “well”, 
for instance by minimizing twice the opposite of the 
log-likelihood function (Ø) defined in Equation 
(3). The generic approach to this optimization prob- 
lem is by gradient descent, called back-propagation 
in this setting. Because of the compositional form 
of the model, the gradient can be easily derived 
using the chain rule for differentiation. This can be 
computed by a forward and backward sweep over 
the network, keeping track only of quantities local 
to each unit. Details on the back-propagation algo- 
rithm can be found in (Rumelhart et al. 1986). 

In our specific case, a neural network can be 
seen as a nonlinear generalization of the logistic 
regression model presented in Section 2.2. 
However, it lacks its interpretability and is often 
perceived as the “ultimate black-box model”. 


2.5 Support vector machine 


In our classification context, a Support Vector 
Machine (SVM) consists of constructing a hyper- 
plane that separates the data into two classes corre- 
sponding to the possible values for Y. Intuitively, if 
we suppose the two classes are linearly separable, a 
good separation is achieved by the hyperplane that 
has the largest distance to the nearest data point of 
any class, as in Figure 3. 
More formally, a hyperplan is defined by: 


{x cR’: 


)=At yi 8X0 = 


A, +X"p=0} 
(11) 


Figure 3. Illustration of an optimal separating hyper- 
plane (in green) and a non optimal one (in purple). 


with 8 a unit vector: p = 1. If we consider target 
variable Z , the classification rule induced by h(.) 
is f(X) =sign(Z, + X7f). 

If the two classes are separable, we want 
to find a function h(X)=A4,+X7f# with 
Zh(X,)>0,Vi=l,....2, that creates the biggest 
margin M between the data for class —1 and 1. The 
following optimization problem formalizes this idea: 


max (M) subject to Z,(4,+X7A)>M,1<i<n. 


fy ER, BER? 
(12) 


If the two classes overlap, one way to deal with 
this overlap is to still maximize margin M, but allow 
for some points to be on the wrong side of the mar- 
gin. Define the slack variables £=(&,...,). One 
way to modify the constraint in Equation (12) is: 


Z(4, 


with &>0,1<i<n, and 2 é< constant. é 
in constraint (13) is the proportional amount by 
which the prediction A(X) is on the wrong side of 
its margin. Hence by bounding the sum X $> 
we bound the total proportional amount by which 
predictions fall on the wrong side of their margin. 
Misclassifications occur when ¢; >1 , so bounding 

ie at a value C say, bounds the total number 
of misclassifications at C. 

Whereas the original problem may be stated in the 
original space, it often happens that the two sets to 
discriminate are not linearly separable in that space. 
For this reason, it is proposed that the original finite 
dimensional space be mapped into a much higher 
dimensional space, presumably making the linear 
separation easier in that space, as illustrated in Fig- 
ure 4. This transformation is called the “kernel trick”. 

To keep the computational load reasonable, the 
mappings used by SVM schemes are designed to 
ensure that dot products may be computed eas- 
ily in terms of the variables in the original space, 
by defining them in terms of a kernel function 
Ø( X,U) selected to suit the problem. The hyper- 
planes in the higher dimensional space (which may 
be nonlinear in the original input space) are defined 
as the set of points whose dot product with a vector 
in that space is constant. The vectors defining the 


+X7f)> M(1-&),Vi=l,....0 (13) 


Figure 4. 


Illustration of the mapping transformation. 
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hyperplanes can be chosen to be linear combina- 
tions with parameters a, of images of feature vec- 
tors X,,1<isn. With this choice of a hyperplane, 
the points X in the feature space that are mapped 
into the hyperplane are defined by the relation: 
$ @0(X,,X)=constant. If g(X,U) becomes 
small as U grows further away from X, each term 
in the sum measures the degree of closeness of the 
point X to the corresponding data base point X, In 
this way, the sum of kernels above can be used to 
measure the relative nearness of each test point to 
the data points originating in one or the other of the 
sets to be discriminated. In our case, we tested four 
different kernel functions: linear g(X,U) = X7U, 
Gaussian radial ¢(X,U) =exp(-oX -U’), vER, 
inhomogeneous polynomial (of order deN) 
(X,U) = (XU +r)’ »%reR, and hyperbolic 
tangent g(X,U)= tanh («XU +0),«,0ER. We 
chose d = 2 and “good” values for tuning param- 
eters 0,%,r,< and 0 minimizing the estimated 
prediction error can be found by cross-validation 
as presented in Section 2.2. Further elements on 
SVM can be found in (Burges 1998). 

As for neural networks, SVM provides a “black- 
box model” which does not allow any interpretable 
description of how the inputs affect the output. 


3 RESULTS 


3.1 


Before showing the results obtained with the 
different families of machine learning algorithms 
presented in the previous sections, one must intro- 
duce indicators that will allow to assess and com- 
pare the performance of the techniques on the 
dataset. 

These indicators are based on the “confusion 
matrix” (also called “error matrix”) presented in 
Table 1. Once the prediction function /(.) has been 
built using one of the available machine learning 
algorithms, it is easy to compare for each observa- 
tion (Y,,X;),,., if the predicted value Y, = f (X,) 
is the same as the real Y, observed in the dataset. 
Each row of the matrix represents the instances in 
a class predicted with the model while each column 
represents the instances in an actual class. 
with TP+FP+FN+TN=n. 

Error rate or misclassification rate on the data- 
set is defined as: 


Performance indicators 


n 


1 n 
pe 


Tn ph WALK) n mer} 
_ FP4FN (14) 
TP +FP+FN+TN 


err, lies between 0 and 100% and the lower 
err, is, the better classifier /(.) fits the data from 
a general point of view. Since the case when the 


Table 1. Illustration of a confusion matrix. 


Actual class 


Y=AT Y= BT 
Predicted =r True Positive False Positive 
class (TP) (FP) 
Y=BT False Negative True Negative 
(FN) (TN) 


indicator characterizing the state of the reactor 
during its shutdown exceeds the fixed threshold is 
more critical than the other one, we prefer to cor- 
rectly predict the class Y = AT and that is why we 
also introduce the sensitivity defined as: 


TP 


se, =————_ (15) 
TP + FN 

se, lies between 0 and 100% and the higher se, is, 

the better classifier f(.) is able to predict the obser- 

vations where Y,= AT, among the n of the dataset 

(vx 


ices 


3.2 Overfitting and bootstrap 


A prediction model /(.) is built using some set of 
“training data”, that is to say exemplary situations 
for which the desired output is known. Of course, 
the goal is that f.) has good performance on the 
training dataset but also performs well on predict- 
ing the output when fed “validation data” (or “test 
data”) that was not encountered during its training. 
“Overfitting” (sometimes called “overtraining”) is 
the production of a prediction model that corre- 
sponds too closely or exactly to a particular set of 
data, and may therefore fail to fit additional new 
data or predict future observations reliably. Over- 
fitting occurs when a model begins to “memorize” 
training data rather than “learning” to generalize 
from a trend. The potential for overfitting depends 
not only on the number of parameters and data but 
also the conformability of the model structure with 
the data shape, and the magnitude of model error 
compared to the expected level of noise or error 
in the data. To lessen the chance of, or amount of, 
overfitting, several techniques are available, among 
which cross-validation, as introduced in Section 2.2, 
or bootstrap. These procedures consist of testing 
the model's ability to generalize by evaluating its 
performance on a set of data not used for training, 
which is assumed to approximate the typical unseen 
data that a model will encounter. 

The basic idea of a bootstrap iteration, indexed 
by b,1 <b < B, isto randomly draw without replace- 
ment from the original data (Y,, X, — two comple- 
mentary subsets: one training subset, of size 1 < N < 
n (in our case N =n), which will be used to fit pre- 


diction model f(.), and one validation subset, of size 
n — N, on which f.) will be applied to assess its per- 
formance on data that were not used to train it. On 
this test subset, one can evaluate error rate err, and 
sensitivity se, using Formulas (14) and (15) applied to 
the n- N validation data. This is done B times (in our 
case B = 1000), producing B bootstrap training and 
validation datasets. We refit the model to each of the 
B bootstrap learning datasets and systematically test 
it on the B bootstrap test datasets, leading to B val- 
ues for error rate and sensitivity (err,,se,),_,.,. From 
these B values, one can assess the bootstrap average 
error rate and the bootstrap average sensitivity: 


~ 1 1 B 


B Zs 
ETT Boot = re pelts and Seso = FÈ se, 
(16) 


These two quantities better estimate the “real” 
performance of classifier f(.) than err, and se, do 
and easily allow to identify a potential overfitting 


issue if ( err aor, Semon ) is far from (err,,se,) . 


3.3 Results 


To implement the different algorithms, we used R, 
an open source language and software environment 
for statistical computing (https://www.r-project. 
org/). More precisely, the following packages were 
used: ‘stats’ for stepwise logistic regression, ‘glmnet’ 
for Ridge and LASSO, ‘plsRglm’ for PLS, ‘spls’ for 
Sparse PLS, ‘rpart’ for classification trees, ‘adabag’ 
for bagging, ‘ada’ for boosting, ‘nnet’ for neural net- 
work and ‘e107?’ for the different SVM versions. 
Table 2 gives the performance indicators 


(err Boot, SE noar) for each of the machine learning 


algorithms that have been applied to our dataset. 


Table 2. Summary of the results obtained with the dif- 
ferent algorithms. 

Algorithm eI Boot (%) SE Boor (%) 
Stepwise 4.49 92.59 
Ridge 13.48 66.67 
LASSO 10.11 74.07 

PLS 3.38 93.24 
Sparse PLS 14.61 93.87 
Classification tree S12 96.15 
Bagging 5.61 95.71 
Boosting 3.37 96.32 
Neural network 4.69 96.17 
Linear SVM 4.39 90.86 
Gaussian radial SVM 3.46 91.03 
Polynomial SVM 5.62 88.89 
Hyperbolic tangent SVM 7.86 85.11 


4 DISCUSSION 


Boosting is the most efficient approach, with both 
the lowest errs... and the highest se go... PLS logis- 
tic regression and neural network come in second 
position. Classification tree and bagging have 
promising sensitivity but somewhat disappointing 
error rate. The different SVM kernels give quite 
close medium results. The two shrinkage variants 
of the logistic regression, Ridge and LASSO, give 
the worst models, quite far behind all the other 
techniques. 

Based on this single study, it is of course impos- 
sible to draw any general conclusions about the 
potential superiority of one algorithm over the oth- 
ers. Many other numerical tests should be carried 
out on simulated (and not real industrial) data to 
achieve such an ambitious goal. Nevertheless it is 
interesting to highlight that the performance indi- 
cators are rather good, even excellent, compared 
with the size of the dataset. Indeed, in our case, we 
are far from what is called “big data”, since we only 
have n = 89 observations and p = 25 variables to 
predict our binary target output. This finding runs 
counter the popular belief that machine learning 
algorithms necessarily require a huge amount of 
data. Nevertheless one must not be mistaken: it is 
unrealistic to imagine these black-boxes will solve 
any problems and always be efficient, even with lit- 
tle data. 

From the perspective of the decision maker, 
even if boosting is the most efficient algorithm 
in our case study, he may prefer a more interpret- 
able model, such as the PLS logistic regression. 
Indeed this approach has a quite honorable pre- 
diction performance and provides at the same 
time an interpretable description of how the 
inputs affect the output (see Equation (8)). If the 
interpretability and the explanation of the funda- 
mental principles of the model are not necessary 
in various fields, they become significant argu- 
ments when trackability, auditability, transpar- 
ency or physical justification are required, as in 
nuclear industry. 

Otherwise it is often pointed out that machine 
learning algorithms are greedy ones requiring pro- 
hibitive computational time and/or memory to 
train the models or predict the outputs. With our 
small dataset, it was not an issue on a standard 
computer, even when carrying out bootstrap pro- 
cedure. We should also mention that open source 
statistical software, such as R, make such machine 
learning algorithms financially accessible to any 
companies, even if their use requires relevant 
expertise in order to properly parametrize the algo- 
rithms and avoid the potential pitfalls (for instance 
overfitting). The transferability outside R&D divi- 
sions of such black-box predictive models is also 
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a real issue, in particular to engineering divisions 
which often only have spreadsheet application 
software to make calculations. 

Last but not least: before applying any machine 
learning algorithm, quality of the input data must 
first be ensured. Moreover, it can only be profita- 
ble that the data scientist, who designs and manip- 
ulates these black-boxes, discusses with the experts 
of the technical application field. 


5 CONCLUSION AND PROSPECTS 


Thirteen supervised machine learning techniques 
have been tested on a real dataset from the nuclear 
industry. The fundamental principles of these algo- 
rithms have been presented and their prediction 
performance has been assessed on the case study. 

The most efficient methods give really promis- 
ing results, especially compared to the small size of 
the available data. Nevertheless one would be well 
advised not to draw any general conclusions about 
the efficiency of these techniques. 

Several prospects and extensions of this study 
can be envisaged. Other machine learning algo- 
rithms, such as discriminant analysis or random 
forests, could be tested on the same dataset to 
assess how they perform. An intensive study based 
on simulated data could also be carried out to try 
to identify if some algorithms are more efficient 
than others on datasets with characteristics close 
to those met in the nuclear industry. 
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New resilience performance indices based on the k-terminal 
reliability of the complete graph 


C. Tanguy 
Orange Labs, Orangel[MT/OLN/GDMI/TRM, Chatillon, France 


ABSTRACT: In network reliability, the best-known performance measure is the so-called all-terminal 
reliability Rel ,(G), i.e., the probability that all the nodes of the network (or its underlying graph G) are 
connected. A particular family of graphs, namely the complete graphs K, in which each node is connected 
to the n — 1 others, have long been a key issue in this field. Brown, Cox and Ehrenborg have recently 
proposed new performance indices for networks based on the all-terminal reliability Rel ,(K,,): (1) the aver- 
age reliability Rel, (K,), (ii) the “average of the average” all-terminal reliability AvgAvg (G, n) of all the 
graphs with n vertices and at most one edge between two nodes. They showed that as n increases, these 
measures tend to 1. In this work, we generalize their idea to the k-terminal reliability—the probability that 
k nodes are connected—which can also be used to describe a network’s resilience. The new measures can 
be derived from the k-terminal reliability of the complete graph. Since we are interested in the applica- 
tion of these indices to large systems (n >> 1), we have performed numerical investigations to assess their 
variations with n. We have identified the leading, analytical correction (in 1/n) to unity. This corroborates 
a previous conjecture made on the asymptotic value of Rel, (K,). These new results could be helpful as 
simple, ready-to-use evaluations/orders of magnitude of the resilience of a network, when its size is so 
large as to make exact or approximate computations either impossible or very cumbersome. 


1 INTRODUCTION 

Complete graphs are graphs in which each vertex 
is connected to all other vertices (see Figure | for 
the first examples of K,, the complete graph with 
n vertices). They have attracted interest for a very 
long time. Among the pioneering papers on ran- 
dom graphs (Erdés and Rényi 1959, Gilbert 1959), 


one focused on the possible application to the 
probability P, that N telephone central offices can 
call each other (Gilbert 1959), or that two offices 


can be connected (with probability R,), assuming 
that the probability of connection between two 
nodes is p. The following simple expressions were 
given (Gilbert 1959, Frank and Gaul 1982) 

P, ~1-Nq™', (1) 
Ryol=2g", (2) 


where g=1-p. f f 
More recently, complete graphs K, have been Figure 1. First complete graphs K, (3 < n < 6). 
investigated in the context of the resilience of large 
networks (Sekine and Imai 1998, Imai et al. 1999, The first one is simply the average all-terminal reli- 
Tsitsiashvili 2011), and in particular for wireless ability of a graph G over the interval [0,1]: 
networks (Park 2016). They also appear in per- 
formance measures recently defined by Brown 
and collaborators (Cox 2013, Brown et al. 2014). 


Rel, (@)= f Rel,,(G:p) ap. 6) 
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The second one is the “average of the average” 
reliability of simple graphs (a “simple” graph is 
such that there is at most one link/edge between 
two given vertices) G with n vertices, AvgAvg(G, n), 
which can be expressed (Cox 2013) as 
AvgAvg(G,n) = 2) Rel, (K,;p) dp. (4) 

Cox and collaborators (Cox 2013, Brown et al. 
2014) showed that in the n— œ limit, these two 
averages should go to one. 

Multicast procedures in modern networks call 
for accurate assessments of the A-terminal reliabil- 
ities Rel,(p) too. Let us recall that the k-terminal 
reliability is the probability that the k vertices of 
interest are connected. In this context, it is hardy 
surprising that the resilience of such procedures 
has stimulated a lot of work, from very math- 
ematical approaches to more pragmatic ones. Its 
definition may also vary among authors: some of 
them (Colbourn 1993, Farley and Colbourn 2009) 
associate it with the number of nodes that are still 
connected, whereas others (Cox 2013, Brown et al. 
2014, Heidtmann 2016) compute it as the average 
of all possible connections between k nodes of the 
system. 

In a recent work (Tanguy 2017), we improved 
the asymptotic expansions of P, and R,, and 
generalized them to the k-terminal reliability for 
the complete graph K,. We also provided an esti- 
mate of the asymptotic expansion of Re/,(K,), 
derived from numerical simulations. 

In this work, we have addressed the generalization 
of equations (3) and (4) to the k-terminal reliability 
of k specific nodes of the system: 


Rel, (G) =f Rel, (G; p)dp. (5) 


The second generalization is the “average of the 
average” reliability of simple graphs G with n ver- 
tices, AvgAvg(G, n), which can be expressed (Cox 
2013) as (see below for a justification) 


1/2 
AvgAvg(Rel,;G,n) = 2), Rel,(K,:p)dp. (6) 


When all the link reliabilities are equal to p, all 
nodes are equivalent. The above expressions give 
therefore a direct expression for the resilience as 
viewed by Cox (2013), Brown et al. (2014) and 
(Heidtmann 2016). 

This paper is organized as follows: we describe 
in Section 2 how to compute the k-terminal 
reliability for the complete graph K,, and the 
associated averages. We then show in Section 3 a 
few figures showing how these averages approach 
1 when n goes to infinity. We describe in Section 4 
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the determination of the first-order correction to 
unity for several values of k, from which we pro- 
pose a closed-formed expression. We finally apply 
these results to the case of the all-terminal reliabil- 
ity of the complete graph K,. 


2 MATHEMATICAL PROCEDURES 
AND DEFINITIONS 

2.1 Asymptotic expansions of the reliabilities 

of the complete graph K, 


The all-terminal reliability Rel,(K,;p)= 4, can 
be obtained recursively by (Gilbert 1959). 


A, =1- aC A, ad Fa py) ? (7) 
j=l 
A =1, (8) 
where CK is the binomial coefficient 
n! 
Ck =——___., 9 
"= Towle! O) 


The k-terminal reliability Rel, (K,; p)=B, is 
then deduced from all the A;s (1 <j < n) by 


B = py oe 4a — pye), (10) 


j=k 
2.2 Asymptotic expansions of A, and B, 


In our previous work (Tanguy 2017), we extended 
Gilbert’s results (Gilbert 1959) for the asymptotic 
expansions in the case of large n and fixed q’s. 


A, >l-nq"! 4 mena 142q)@rtte. (1) 
from which we derived 
Bw l-kg™'+ g*{ C2- k( n—1)(1- q) 

HCE l=@) j+ (2) 


Equation (12) will be used in the following. 


2.3 Average k-terminal reliabilities 


We first start by defining, for k specific nodes, 
— 1 
Rel, (K,) =| Rel, (K,;p)dp. (13) 


The second generalization is the “average 
of the average” reliability of simple graphs G 


with n vertices, AvgAvg(G, n), which can be 
expressed (Cox 2013) as 


AvgAvg(Rel,;G,n) = J Rel, (K,;p/2)dp. (14) 


The origin of equation (14) is simple: all the 
k-terminal reliabilities are affine functions of each 
individual link reliability. Since we consider all pos- 
sible graphs on n vertices, each link will be present 
(or not). The average is thus given by the corre- 
sponding reliability of the complete graph K,, for 
which each link has a reliability equal to (0 + p)/2 = 
p/2. We can thus write 


AvgAvg(Rel,;G,n) = 2f"Rel, (K,;t)dt. (15) 


We deduce that 
AvgAvg(Rel,;G,n) =2Rel, (K,)-1 


+2) (I-Rel, (K,:1))dt. 


n? 


(16) 


Because of equation (12), the last integral in the 
above equation vanishes as (2 k)/(n 2”) when n goes 
to infinity. Consequently, the asymptotic expan- 
sions of the averages are linked by 


1- AvgAvg(Rel,;G,n) ~ 2(1 - Rel, (K,)). (17) 


By studying the behavior of Rel, (K,) when 
n>l, we can get the asymptotic expansion of 
AvgAveg(Rel,; G, n) too. 


3 EXAMPLES FOR 2<K<4 


In this section, we present the_results of our 
numerical calculations of Rel, (K,) and 
AvgAvg(Rel,; G, n) as functions of n, for the first 
values of k. 

We obtained the different Rel,(K,; p) from 
equation (10). Since they are polynomials in p with 
integral coefficients, the integrals in equations (13) 
and (14) were easily obtained. 


3.1 Rel, (K,) 


We have represented in Figures 2—4 their variations 
with n. Obviously, these quantities go rapidly to 1, 
so much so that the curves cannot be distinguished 
from unity when n = 200. 


3.2 AvgAvg( Rel; G, n) 


We have displayed in Figures 5-7 the variation with 
n of AvgAvg(Rel,; G, n) for 2 < k <4. Even though 
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the values go to 1 as n increases, they do so less 
rapidly than the average reliabilities of the preced- 
ing paragraph, in agreement with equation (17). 


4 ASYMPTOTIC EXPANSIONS 
OF THE AVERAGES 


From the numerical values obtained in Section 3, 
we were able to see that 


Rel, (k,)=1-+0{ +} (18) 


n n 


The determination of C, has been done by using 
“convergence acceleration methods”. Even though 
the convergence to 1 is rather slow, it is possible 
to assess the limit C, by using the Richardson 
extrapolation (Bender and Orszag 1999). Per- 
forming these computations, we found that C, = 
2.00000000, C, =~ 2.24999999, and C, = 2.4444444, 
implying that C,, C, and C, may well be 2, 4, and 
2, respectively. This was a strong indication that 
all the C,’s are rational numbers. Using large values 
of n (up to 700), we were able to increase the accu- 
racy of our evaluations of the first constants, the 
identifications of which are given in Table 1. 
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Table 1. Values of the first 
constants C, (2 < k < 10). 


k GC: 

2 2 
9 

3 has 
4 

4 22 
9 

i 125 
48 

‘ 137 
50 
343 

120 
726 

745 
6849 

9 kenaa 
2240 
7129 

ay 2268 


The last part of the study was to infer the general 
formula for C,. By trial and error, and taking the 
factorization of the denominators into account, 
we were able to find that 


(19) 


where H, =1+4+---+2 is the so-called harmonic 
number of order k. 

As k increases, the precision of C, gets greater. 
We have checked that equation (21) still gives sat- 
isfactory values; for instance, in the case k = 150, 
the difference between our estimate and Cso is less 
than 1071. 


4.1 Consequence for Rel, ( K,) 

In (Tanguy 2017), we had computed the average 
Rel ,(K,,) as a function of n, and proposed that 
Inn+C 


Rel ,(K,) ~1- = 


, (20) 


where C is the Euler gamma constant (using 
the notation of Gradshteyn and Ryzhik 
C = 0.5772156649). This assumption was correct 
since for the all-terminal reliability, we have to 
replace k by n, and because 


H,>Inn+C (21) 


as nooo, 


4.2 Asymptotic expansion of AvgAvg(Rel,; G, n) 


If we now turn to AvgAvg(Rel,; G, n), we deduce 
from equation (17) 


_ 2k A, 1 


22 
= (22) 


AvgAvg(Rel,; G, 7) ~1 
n 


A by-product of this expression is of course 


Inn+C 


AvgAvg(Rel ,; G, n) ~1-2 
n 


(23) 


5 CONCLUSION AND OUTLOOK 


We have generalized the asymptotic expansions of 
(Cox 2013, Brown et al. 2014) for averaged k-termi- 
nal reliabilities of the complete graph K, which are 
performance measures recently introduced for the 
resilience of large networks. The first correction to 
unity, as n goes to infinity, has a very simple form, 
and confirms the assumption made in a previous 
work (Tanguy 2017) on the averaged all-terminal 
reliability for the complete graph. Other perform- 
ance measures for the complete graph are cur- 
rently under investigation, and will be presented 
elsewhere. 
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A mathematical programming approach to railway network 
asset management 


C. Fecarotti & J. Andrews 
Resilience Engineering Research Group, Department of Civil Engineering, The University of Nottingham, 
Nottingham, UK 


ABSTRACT: A main challenge in railway asset management is selecting the maintenance strategies 
to apply to each asset on the network in order to effectively manage the railway infrastructure given 
that some performance and safety targets have to be met under budget constraints. Due to economic, 
functional and operational dependencies between different assets and different sections of the network, 
optimal solutions at network level not always include the best strategies available for each asset group. 
This paper presents a modelling approach to support decisions on how to effectively maintain a railway 
infrastructure system. For each railway asset, asset state models combining degradation and maintenance 
are used to assess the impact of any maintenance strategy on the future asset performance. The asset state 
models inform a network-level optimisation model aimed at selecting the best combination of mainte- 
nance strategies to manage each section of a given railway network in order to minimise the impact of the 
assets conditions on service, given budget constraints and performance targets. The optimisation problem 
is formulated as an integer-programming model. By varying the model parameters, scenario analysis can 
be performed so that the infrastructure manager is provided with a range of solutions for different com- 
bination of budget available and performance targets. 


1 INTRODUCTION that the required level of service reliability and 
safety risk is achieved within budget. Determining 
The railway system is the result of the interaction the best set of strategies for a given network does 
of a number of different systems and infrastruc- not simply consist in choosing the strategy which 
ture with the ultimate aim of transporting peo- is optimal for each asset. When a network perspec- 
ple and goods safely and on time. It consists of a tive is adopted dependencies among different assets 
diverse portfolio of assets, each bounds to deliver and different sections of the network arise, due for 
a specific function but all together contributing to example to resource availability. This implies that 
ultimately provide a reliable and safe service. Each intervention strategies that are optimal when an 
railway asset is subject to degradation and failure asset is considered individually, might not be opti- 
processes, and maintenance is performed in order mal when decisions are made at a network level. 
to control the state of the assets and ensure that 
each asset’s function is performed to the required 
standard. Maintenance policies are developed as 
a combination of periodic inspection, routine and 
emergency maintenance, enhancement and renewal Optimisation models have been presented in the 
activities, and these are specific to each railway literature to support infrastructure asset manage- 
asset. As maintenance resources and budget are ment from different perspectives and to address 
limited, decisions have to be made on how to different aspects of the problem. Two main 
optimally allocate the available resources among approaches to the optimisation of infrastructure 
all the asset on the network. Infrastructure asset asset management can be identified: asset-level 
management is the process of allocating mainte- and system-level optimisation. Assetlevel optimi- 
nance resources among the assets comprising the sation aims at determining optimal maintenance 
system with the aim of minimising the whole-life policies for an individual asset, while systemlevel 
costs while maximising the system performance. optimisation seeks the optimal combination of 
Optimal asset management involves decision mak- maintenance policies for all the assets comprising 
ing and selection of the best intervention strategy the system. The focus of this paper is on system- 
for each asset along the network in order to ensure level optimisation; in the following, the modelling 


1.1 Modelling approaches to infrastructure asset 
management optimisation 
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approaches developed in the literature to deter- 
mine the optimal set of maintenance policies for 
infrastructure systems composed of multiple assets 
are briefly discussed. 

The authors in Yeo et al. (2012) address the 
problem of planning maintenance for a system 
of heterogeneous facilities undergoing stochas- 
tic deterioration over a finite time horizon. They 
develop a twostage bottom-up approach accord- 
ing to which optimal maintenance policies are first 
determined for each facility. The deterioration of 
each facility is modelled as a Markov process. The 
state of the facilities is known at the beginning of 
every year when inspection is performed and main- 
tenance activities are selected year by year. The 
authors apply a dynamic programming approach 
to find the optimal activity as well as the alterna- 
tive near optimal activities and associated costs 
for each facility. Then, a system-level optimisation 
is developed to obtain the combination of activi- 
ties, one for each facility, that minimise the system 
expected cost-to-go while the agency cost (cost of 
the maintenance activities) is kept within a given 
budget. All facilities in the system are considered 
to be independent and the system-level optimisa- 
tion problem is formulated as a constrained com- 
binatorial problem. 

A similar approach has been used in Furuya & 
Madanat (2013) with application to a hypotheti- 
cal railway system, where facility-level and system- 
level optimisation are combined to obtain the best 
combination of activities for all facilities in a given 
network. The authors demonstrate their approach 
on a hypothetical dual redundant railway network. 
A number of facilities are associated to each link 
in the network, and a set of available maintenance 
activities is considered for each facility. As in Yeo 
et al. (2012), the degradation and maintenance of 
the railway assets is modelled as a Markov process, 
and the facility-level optimisation problem is formu- 
lated as a Markov decision process solved through 
dynamic programming. In the system-level opti- 
misation problem, the budget constraint includes 
the cost reduction that can be achieved when 
adjacent facilities are maintained simultaneously. 
Constraints are also formulated on the minimum 
capacity to be guaranteed between an origin and 
a destination node and for each individual route. 
This enables to consider the loss of throughput due 
to maintaining adjacent facilities simultaneously. A 
numerical example is solved, which demonstrates 
how including both economic (opportunistic main- 
tenance) and functional (capacity loss) dependen- 
cies arising between the assets when performing 
maintenance, has an impact on the optimal deci- 
sion and associated lifecycle cost. 

In Robelin & Madanat (2008) the authors 
address the optimisation of maintenance policies 
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for a system of bridge decks with the objective of 
determining the optimal set of policies based on 
the current system conditions as well as the pre- 
diction of future conditions. The deterioration 
model of an individual deck is Markovian, where 
each state is defined in terms of the current con- 
dition of the deck, the last maintenance action 
performed and the time since the last intervention. 
The condition of a deck is given by its instantane- 
ous probability of failure. A two-steps approach 
is suggested. First, a facility-level optimisation is 
solved to obtain the optimal cost of maintenance 
and replacement for each facility. The facility- 
level optimisation problem is solved for a discrete 
range of failure probabilities. Then, at system level, 
the cost of the system given by the combination 
of the cost for each facility, is minimised subject 
to budget constraint, and the optimal threshold 
of failure probability is obtained. This threshold 
is used backward within the facilitylevel optimi- 
sation to obtain the set of policies for each deck 
which are optimal at system level. Some of the 
assumptions the optimisation model in Robelin 
& Madanat (2008) is based on are too restric- 
tive to be applied to the railway system. Many of 
the railway assets exhibit multiple failure modes, 
each with different probabilities and frequencies 
of occurrence. Different failure modes usually 
have different effects on system performance and 
must be therefore considered individually. Deci- 
sions on maintenance policies must account for 
the different failure modes so that different effects 
on service performance can be distinguished, and 
both safety and performance requirements can be 
addressed in a cost effective manner. Another sim- 
plifying assumption made in this paper is that at 
system level, the optimal threshold of probability 
of failure is the same for all the facilities. While this 
makes the optimisation problem easier to solve, it 
also produces a less realistic model. In real sys- 
tems the location of the assets on the network may 
play an important role within the decision making 
process. The railway network includes lines and 
routes with different criticalities corresponding to 
different safety and service performance targets. 
It is often the case that in the trade-off between 
cost and performance, more expensive policies are 
likely to be implemented on assets located on lines 
with higher criticality, while lower performance is 
accepted on lower criticality lines. 

The author in Durango-Cohen (2007) presents 
a method to simultaneously address the condi- 
tions and costs forecasting problem and the opti- 
misation of maintenance action for transportation 
infrastructure facilities. Facilities deterioration is 
represented as an autoregressive moving average 
with exogenous input model (ARMAX). Decision 
variables can be investment levels or maintenance 


rates and the optimisation problem is formulated as 
a dynamic program seeking the minimum expected 
discounted cost over the planning horizon. Deci- 
sions are made based on the information available 
at the beginning of the planning period. The use 
of the ARMAX model is based on the assump- 
tion that the effects of maintenance actions are 
linear and additive. This assumption however is 
too restrictive for many railway assets (e.g. track) 
as it completely disregards the complexity of the 
combined effects of different interventions on the 
future asset state and the consequent impact on 
costs. 

The approach presented in the aforementioned 
papers is aimed at selecting the maintenance poli- 
cies to be adopted year by year over a given time 
horizon. Inspection is not considered as part of the 
policies as it is assumed to be carried out at the 
beginning of every year. However, the frequency 
of inspection is an important aspect of every 
maintenance policy as it allows to reveal the con- 
ditions of an asset before failures occur or unac- 
ceptable degraded states are reached. Indeed it is 
the optimal combination of inspection frequency, 
threshold values for assets conditions triggering 
interventions and the time required to perform 
maintenance that make an effective maintenance 
strategy. Furthermore, most of the contributions 
use a Markov approach to model the degradation 
and maintenance processes of the assets. However, 
the Markov approach has a few limitations that 
prevent it from being an effective modelling tool for 
many of the railway assets. A significant limitation 
is the requirement of Markov models to restrict 
transitions between states on the model (generally 
representing degradation or repair) to occur at a 
constant rate. This means that the state residence 
times are exponentially distributed. The memory- 
less property of the Markov approach restricts the 
ability of the model to consider the maintenance 
history which is important in some of the railway 
asset components such as the track ballast. Fur- 
thermore, the size of a Markov model can experi- 
ence a state-space explosion with the number of 
components considered, thus making difficult to 
model assets with many different components or 
formed linking several sections of track. One final 
significant limitation is its inability to represent a 
route or network perspective. If Markov models 
exist for two assets and it is required to account for 
their dependencies in constructing a route model, 
this can only be accomplished by the generation of 
a completely new model. 

An alternative modelling technique that over- 
comes some of the limitations of the Markov 
approach in modelling railway asset degradation 
and maintenance is the Petri Net (PN) method. 
PNs are a formalism for modelling complex, 
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dynamic systems characterised by concurrency 
and dependencies, synchronisation and resource 
sharing. PNs provide a valuable mathematical and 
graphical description of the system behaviour. PNs 
is a stochastic technique which allows far greater 
detail in comparison to the alternatives when mod- 
elling assets degradation and complex management 
strategies, whilst maintaining a manageable model 
size. PNs account for any distribution of degrada- 
tion and failure times; thus increasing failure rate 
typical of components subject to wear-out can 
be considered. PNs also enable the modelling of 
complex maintenance processes including condi- 
tion and risked based inspection and maintenance, 
replacement prior to failure based on either age, 
condition or use, reactive repair, refurbishment and 
renewal and all the rules for the implementation of 
such activities. The resulting PN models are usually 
smaller in size than the alternative Markov repre- 
sentation. An additional and very desirable feature 
of PN models is their modularity. Models of assets 
consisting of many interacting components can be 
built up in parts giving the model a modular struc- 
ture which is easier to analyse. Monte Carlo simu- 
lation is the most common solution technique for 
PN models and produces distributions for the out- 
put variables of interest. The PN approach is sug- 
gested in this paper as a valid modelling technique 
to produce models that combine the degradation 
and maintenance processes involving the railway 
assets. Such models can be used as a tool to inves- 
tigate the effectiveness of a variety of maintenance 
strategies for each railway asset, covering a range 
of performance and costs, so to provide the deci- 
sion maker with a set of potential strategies among 
which the ones which are best from a system per- 
spective can be selected. 


2 THE METHODOLOGY 


This paper presents a modelling approach to sup- 
port decisions on how to effectively maintain a 
railway infrastructure system. First, for each rail- 
way asset, a modelling tool is required to assess the 
asset response to the implementation of a range 
of feasible maintenance strategies. Such modelling 
tools, called asset state models combine the deg- 
radation/failure processes affecting the asset with 
the intervention activities that can be performed 
in order to predict the future asset state. The asset 
state models developed for each asset inform a net- 
work-level optimisation model aimed at selecting 
the best combination of maintenance strategies to 
manage all the assets on a railway network under 
budget and performance constraints. The network- 
level optimisation model is formulated as an inte- 
ger program with multiple constraints (Hillier and 


Lieberman 2009). The model is bounded to select 
one option for each individual asset located in the 
considered railway network. Constraints are for- 
mulated on the overall available budget and on the 
availability required of each railway line. Different 
lines in the network may have a different critical- 
ity depending on the effect that failures have on 
service. This is strictly linked to the frequency of 
the service running on each line. Different lines 
criticality are accounted for by imposing differ- 
ent thresholds to the availability of each line. This 
modelling approach has the advantage to enable 
the evaluation of a variety of different scenarios by 
changing the model parameters such as the avail- 
able budget or the threshold levels set for the lines 
availability. 


2.1 Network segmentation for strategic 


planning purposes 


The UK railway network is segmented for policy 
decisions. The whole network is divided into 19 
Strategic Routes, each divided into a number of 
Strategic Route Sections (SRSs). An SRS is a 
section of the railway network characterised by 
broadly homogeneous infrastructure type and traf- 
fic levels. Therefore strategy decisions are taken at 
SRS level. It is assumed that the same maintenance 
strategy will be applied within the same SRS. Asset 
state models are developed for each asset type 
existing on each SRS and are used to assess the 
impact of a range of maintenance strategies on the 
assets’ performance. 


2.2 Asset state models 


The PN method is adopted as the modelling 
approach to develop the asset state models. PNs 
are a formalism for modelling complex distributed 
systems characterised by concurrency and depend- 
ency, synchronization and resource sharing. Petri 
nets provide a valuable mathematical and graphi- 
cal description of the system behaviour. A PN is a 
directed, weighted bi-partite graph where nodes are 
places and transitions connected by arcs (Murata 
1989). A PN can be formally defined as follows. 


Definition 1. A PN is a 5-tuple PN = (P, T, A, W, 
M,) where: P = {p,, P» ..., Pmy is the non-empty 
set of places, T= {t, t,..., t,} is the non-empty set 
of transitions, PAT =Ø, AC (P x T) U (T x P) 
is the set of arcs, W : A > {1,2,...} is the multi- 
plicity function, M, : P > {0,1,2,...} is the initial 
marking. 


Places may represent physical resources, condi- 
tions or the state of a component. Tokens are held 
in places and the number of tokens in each place 
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defines the marking of the Petri net which repre- 
sents the state of the system at a given time. The 
flow of tokens through the network is determined 
by transitions and represents the evolution of the 
system state over time. Transitions represent events 
that make the status of the system change. Arcs 
only connect places with transitions (input arcs) 
and vice versa (output arcs). Inhibitor arcs are 
defined as well, which can be used to stop the firing 
of a transition under certain circumstances. Arcs 
are characterised by a multiplicity. The marking 
and the multiplicity of the arcs determine the ena- 
bling conditions for each transition. Transitions 
can be deterministic or stochastic. The former have 
an associated constant firing time, while the latter 
sample their firing time from stochastic distribu- 
tions. Firing of transitions is ruled as follow: 


e If the number of tokens contained in the input 
places is at least equal to the multiplicity of the 
associated input arcs, and the number of tokens 
in the places connected by inhibitor arcs is lower 
than the arcs multiplicity, then the transition is 
enabled. 

Once the transition is enabled, it will fire after a 

time interval which is fixed for deterministic tran- 

sitions. For stochastic transitions the firing time 
is sampled from a probabilistic distribution. 

e When the firing time is reached and the transi- 
tion fires, a number of tokens is removed from 
the input places and added to the output places 
according to the arcs multiplicity. 


For the purpose of maintenance, a number 
of discrete states are usually considered, cor- 
responding to levels of degradation that trigger 
maintenance interventions with different levels of 
urgency. The degradation process can therefore be 
represented as a chain of places and transitions 
as shown in Figure 1. Places P,,,; indicate differ- 
ent states which are relevant from a maintenance 
perspective, namely each state (except for the new 
state P a) triggers maintenance with different level 
of urgency depending on the level of degradation. 
Transitions T,,,, represent the degradation from 
one state to the next (worse). These are stochastic 
transitions whose firing time is sampled from a sto- 
chastic distribution representing the distribution of 
times to degrade between two consecutive states. 
Asset conditions requiring a speed restriction or a 
line closure can be included as well, these being usu- 
ally the last two levels of degradation. Inspection 


is performed periodically to reveal the current asset 


Tdeg.3 
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Figure 1. Degradation. 


condition so that degraded states can be discovered 
and maintenance planned accordingly. In Figure 2) 
transitions Tare timed deterministic and fire at 
a fixed frequency. Once a degraded condition is 
revealed, maintenance in planned depending on 
the level of urgency. Maintenance interventions 
are represented by transitions 7,,,, (Figure 3). 
After maintenance, the asset is usually restored to 
a good condition (P,,,,) rather than to new, unless 
a renewal is carried out. If necessary, it is possible 
to account for the effectiveness of maintenance by 
adopting a probabilistic routing policy for transi- 
tions T,,,; so that the state after maintenance can 
be any of the degraded state with a given probabil- 
ity. It is also possible to keep track of the number 
of maintenance interventions performed. This is 
achieved by monitoring the marking of place PI 
which is marked every time an intervention is per- 
formed (and therefore any of transitions T; fires). 
For some assets, the degradation might depend on 
the past maintenance history; an example is the bal- 
last for which the rate of degradation increases with 
the number of tamping interventions performed. 
This can be accounted for if transitions T}; update 
their distributions of times to degrade according to 
number of interventions performed on the asset. 
This modelling approach enables the evaluation of 
a wide range of maintenance strategies, for each 
of which it is possible to specify the inspection fre- 
quency, the thresholds on the asset conditions that 
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Figure 2. Degradation and inspection. 
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Figure 3. Degradation, inspection and repair. 
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trigger maintenance, the mean time to schedule and 
perform any maintenance activity. Furthermore, 
by keeping track of the marking during the simu- 
lation, it is possible to evaluate the probability of 
being in any of the considered states as well as the 
number of interventions performed. The probabil- 
ity of having a speed restriction and a line closure 
is of particular interest to evaluate the impact of a 
given strategy on service and safety risk. 

This modelling structure can be used as a mod- 
elling template to describe a variety of railway 
asset exhibiting degradation during their lifetime. 
The number and features of places and transi- 
tions representing the degradation processes and 
the maintenance activities can be easily fitted to 
characteristics of the specific asset to be modelled. 
Example of degradation and maintenance models 
adopting a similar structure have been proposed 
in the literature for a number of railway assets 
such as track (Andrews 2012, Prescott & Andrews 
2013, Andrews, Prescott, & De Rozieres 2014) 
and bridges (Le & Andrews 2016, Le, Andrews, & 
Fecarotti 2017). 


2.3 Network-level strategies optimisation 


The analysis conducted by means of the asset state 
models results in a set of potential asset manage- 
ment strategies covering a range of performance 
levels for each asset group. Given a set of potential 
strategies for each asset group, the infrastructure 
manager is faced with the task of selecting one 
strategy for each asset on the network given that 
a limited budget is available. Performance and 
safety targets are usually set for each route and line 
along the network, and these targets can be differ- 
ent depending on the route criticality. Decisions 
are therefore bounded by the available budget and 
are made with the aim of minimising the disrup- 
tion caused to the railway service, while a certain 
level of availability is ensured for each line depend- 
ing on the line criticality. Whatever asset fails, the 
impact on trains service is due to either a speed 
restriction, leading to delays, or a section closure 
leading to journeys cancellation. The extent of the 
disruption depends on both the duration of such 
control actions and the location of the section(s) 
involved. If a speed restriction or a section clo- 
sure is imposed on a section belonging to a high 
frequency line, or to more than one line, then the 
number of journey affected by the disruption will 
be high. With regard to the impact of failures on 
service, for each section in the network is therefore 
fair to define two failure modes, each with a dif- 
ferent effect on service: (i) section subject to speed 
restriction, and (ii) section subject to closure. 

Let us define a Strategic Route as a set 
of SRSs R={ Ri R... RR, | Railway 


services run along a set of railway lines 
L={L,L,,....L,,...L,,},each railway line con- 
sists of one or more SRSs. Therefore each rail- 
way line can be represented as a subset of set 
R, L, c RWV1=1,...,.n,. A railway line L, will be 
unavailable if any of its SRS is unavailable. If a 
is the number of asset groups considered and b is 
the number of strategies available for each asset 
group, then the set of maintenance strategies 
for each SRS is given by all the possible combi- 
nations of the individual asset groups’ strategies 
ns=axb. The set S={S,,S,...SjaSpe} is 
defined, containing n, potential strategies avail- 
able for each SRS, each corresponding to a given 
combination of the individual asset strategies. 
From now on the term strategy will be used to 
indicate a strategy for the individual SRS, among 
the available ones in set S. The index j =1,2,...,n, 
will be used to refer to a generic strategy within set 
S while the index j =1,2,...,1, will be used to refer 
to a generic SRS within set R. The vector of deci- 
sional variables ¥ has components x, such that 
x, = 1 if strategy j is applied to SRS i, 0 otherwise. 
The infrastructure manager is bounded to choose 
only one strategy per SRS. Following the imple- 
mentation of a given strategy, each SRS will be 
subjected to a given probability, average number 
and duration of imposed speed restrictions and 
section closure during the considered planning 
period. Section closure contributes to define the 
availability of the SRS. In fact a section closure 
means that the section is not available for use and 
therefore all the journeys that use that section 
are cancelled or rerouted if possible. If a speed 
restriction is imposed, trains can still run but at 
a reduced speed; this implies delays and some- 
times journey cancellations. Therefore we assume 
that the number and duration of imposed speed 
restrictions implicitly provide an indication of the 
impact on service delay. Similarly, we assume that 
the number and duration of imposed section clo- 
sure implicitly provide indication of the deleted 
services due to section unavailability. The problem 
is formulated as follows: 


ng ng 


min Z(X) = yn aye SKg st (1) 
i=l j=l 

ng 

yx, =1 V i=1,2,...mp, (2) 
j=l 

ng Ng 

Sas, < B, (3) 
i=l j=l 

Q (x,) SQ} WL,eL, (4) 
x, € {0,1} Vi=1,2,...mg3 J=1,2,...g (5) 
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ip the model parameters are: 


° ni 10) the average number of closures in SRS i 
following implementation of strategy j, 

dl (£0) the average duration of closures in SRS i 
following implementation of strategy j, 

nl") the average number of speed restriction 
imposed on SRS i following implementation of 
strategy j, 

d°") the average duration of speed restriction 
imposed on SRS i following implementation of 
strategy j, 

q\ ) the probability of a closure in SRS i fol- 
lowing implementation of strategy /, 

QO; the threshold on the unavailability of line L, 
c; the cost of strategy j implemented on SRS i, 
J the frequency of trains travelling on SRS i, 

B the available budget. 


The objective function Z(X) is representative of 
the impact that the selected combination of strat- 
egies has on service delay, which allows to com- 
pare different solutions. It represent the expected 
number of trains affected by a service disruption 
aa oe considered time horizon. Each term 

058) a -fi> Xy gives an indication of the con- 
foul of each SRS to the overall service dis- 
ruption. This contribution is proportional to the 
average number of speed restrictions imposed on 
the SRS and its average duration, and on the fre- 
quency of trains travelling through the SRS. The 
train frequency is used to weight each SRS propor- 
tionally to the normalised amount of flow travelling 
on the SRS. The frequency also implicitly weight 
each SRS based on the its centrality, namely its role 
in serving more than one line. The set of constraints 
2 indicates that only one strategy can be selected for 
each link. Constraint 3 adds a bound on the overall 
costs according to the available budget. The set of 
constraints 4 put a threshold on the minimum value 
of unavailability of each line. A line is unavailable if 
any of its SRSs is closed. Therefore, the probability 
of line L, being closed Q,, (x (x, can be written as 


Q,,(X)=1- I] [- = do] 


VilRjeLy Vj|SjeS 

The optimal solution X* is given by the feasi- 
ble combination of strategies that will provide the 
minimum impact on service as represented by the 
objective function Z*(X). The objective function in 
1 Z(X) is linear in X, as constraints (1) and (2), while 
constraints (3) are non-linear. Problem 1 is therefore 
a non-linear integer otpimisation problem. 


(6) 


2.3.1 Solution method 

There are no general-purpose solution methods 
yielding the global optimum for non-linear (non- 
convex) constrained optimisation problems and 


approximate solution algorithms are usually used. 
However, it possible to solve a linear approxima- 
tion of the original problem if the non-linear func- 
tions (objective function and/or constraints) can 
be converted to an acceptable linear form. 
Problem 1 is transformed into a linear integer 
programming model by replacing the left hand side 
of constraint 4 with its rare event approximation 


(Andrews & Moss 2002) as follows: 
o (x)=1- JI |1- F as] 


VilRe Ly Vj|SjeS 
<} Lg xy 
VilR;e LVjISje S 

The rare event approximation is an upper bound 
to the top event exact probability and can be used 
when the probability of the basic events is low. 
This an acceptable approximation for the problem 
at hand as the probability of a link closure is usu- 
ally small. 

Integer programming is NP-hard, namely it 
can be solved in non-polynomial time. Therefore, 
depending on the problem size it can be difficult 
to solve in reasonable computational time. In such 
circumstances, the associated relaxed problem 
obtained through Continuous relaxation can be 
studied. The relaxed problem is a linear continu- 
ous programming model which can be solved by 
means of the simplex method. The optimal solu- 
tion of the relaxed problem is a lower bound of the 
global optimum of the original problem. 


(7) 


3 NUMERICAL EXAMPLE 


The optimisation approach presented in this paper 
has been applied to select the best combination of 
maintenance strategies for a set of SRSs compris- 
ing one of the UK Strategic Routes, the East Mid- 
lands (EM) Route. Details of the EM route and 
it SRSs can be found in (NetworkRail 2015). A 
schematic representation of part of the EM route 
showing seven of its eleven SRSs is given in Fig- 
ure 4. The set of SRSs considered in this example 
are listed in Table 1 along with the train frequency. 

Railway services running along the EM Route 
which have been considered here are listed in 
Table 2 along with the service type (Long dis- 
tance high speed—LDHBS, interurban and local), 
while Table 3 lists the SRSs included within each 
service. 

For each railway service, different availability 
requirements are considered depending on the type 
of service. Three potential maintenance strategies 
are considered, S ={S,,S,,5;}. The evaluation of 
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Figure 4. Map of part of the EM Route, including 
SRSs 11.01 to 11.07. 


Table 1. SRSs and trains frequency. 


SRS Train per hour 


01 London St. Pancras-Bedford 

02 Bedford-Nottingham 

03 Wichnor Jn/Long Eaton-Chesterfield 
04 Chesterfield-Nottingham 

05 Nottingham-Newark Castle 

06 Matlock-Ambergate 

07 Netherfield-Grantham 


2 


Nee OOO 


Table 2. Railway services. 


Service name Service type 


London St. Pancras to Nottingham LDHS 
London St. Pancras to Sheffield (via Derby) LDHS 


Norwich to Liverpool Interurban 

Nottingham to Leeds Interurban 

Newark Castle-Nottingham-Derby- local 
Matlock 

Table 3. SRSs included within each railway service. 

Service name SRSs 

London St. Pancras to Nottingham {01, 02} 

London St. Pancras to Sheffield (via {01, 02, 03} 
Derby) 

Norwich to Liverpool {02, 04, 07} 

Nottingham to Leeds {02, 04} 


Newark Castle-Nottingham-Derby- 
Matlock 


{02, 03, 05, 06} 


the maintenance strategies through the PN asset 
models yields the input parameters to the optimi- 
sation model. The values of the model parameters 
used to run this numerical example are detailed in 
Table 4 where c, g,and nS? indicate the cost, una- 
vailability and number of speed restriction due to 
the implementation of the available strategies. 

The optimisation model has been solved 
for eight different values of the available 
budget B, = 350, B, = 400, B, = 450, B, = 500, 
B, = 550, B, = 600, B, = 650, B, = 700, while the 
thresholds on the unavailability of each rail- 
way service remain unchanged and equal to 


Table 4. Model parameters. 


SR SR SR 


SRS ¢ G G qh Qh d; n {e m 
01 50 70 85 0.9 0.95 0.99 47 3.8 2.5 
02 50 70 85 0.9 0.95 0.99 47 3.8 2.5 
03 60 80 95 0.9 0.95 0.99 47 3.8 2.5 
04 60 80 95 0.9 0.95 0.99 47 3.8 2.5 
05 60 80 95 0.9 0.95 0.99 47 3.8 2.5 
06 50 70 85 0.9 0.95 0.99 4.7 3.8 2.5 
07 60 80 95 0.9 0.95 0.99 4.7 3.8 2.5 
Table 5. Maintenance strategies selected. 
SRS 
Scenario 01 02 03 04 05 06 07 
1 
2 = 
3 
4 
5 S, S, S, S, Si Si S, 
6 S; S; S, S, Si S, S, 
T S; S, S, S; S3 S3 S, 
8 S; S S, S, S, S, S, 
Z(X) 
B1 B2 B3 B4 B5 B6 B7 B8 
Figure 5. Expected number of disrupted trains for dif- 


ferent available budgets. 
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O* = 0.98, Q,* = 0.98, Q,* = 0.98, Q," = 0.95, O*= 

The results of the scenario analysis are summa- 
rised in Table 5 and Figure 5. Table 5 details the 
optimal maintenance strategies for each SRS, while 
Figure 5 shows the corresponding value of the 
objective function which is indicative of the 
expected number of trains affected by a speed 
restriction. 

Results show that no feasible solution can be 
found for budgets B, to B, as the strategies that 
would be achievable within the available budget do 
not ensure the required level of availability for each 
railway service. If the budget is increased solutions 
can be found. Budget B, is enough to find a feasible 
solution but does not allow the selection of the best 
strategy available for each SRSs. The algorithm 
selects less expensive strategies for SR Ss 05, 06 and 
07 as they belong to those railway services for which 
a less restrictive value of availability is required. 
Furthermore, the train frequency on those sections 
is lower than in the others. By further increasing 
the available budget, better strategies can be chosen 
and the impact on service decreases. 


4 CONCLUSIONS 


This paper presents a modelling approach to sup- 
port decisions on how to effectively maintain a 
railway infrastructure system. First, for each rail- 
way asset, a modelling tool is required to assess the 
asset response to the implementation of a range 
of feasible maintenance strategies. Such model- 
ling tools, called asset state models combine the 
degradation/failure processes affecting the asset 
with the intervention activities that can be per- 
formed in order to predict the future asset state. 
The modelling approach suggested to develop the 
asset state models is the PN method. A model- 
ling template based on the PN method have been 
presented, which can be specified to represent a 
variety of railway assets undergoing degradation 
and ageing. The asset state models developed for 
each asset inform a network-level optimisation 
model aimed at selecting the best combination of 
maintenance strategies to manage all the assets on 
a railway network under budget and performance 
constraints. The network-level optimisation model 
is formulated as an integer program with multiple 
constraints. A numerical example has been pre- 
sented to show the capabilities of the optimisation 
model. An advantage of mathematical program- 
ming formulation is that the model is not a black 
box. Furthermore, when the problem size is such 
that global solutions cannot be found in reason- 
able computational time, the mathematical pro- 
gramming formulation allows the use of tools to 


estimate the goodness of approximate solutions. 
By varying the model parameters, scenario analy- 
sis can be performed so that the infrastructure 
manager is provided with a range of solutions for 
different combination of budget available and per- 
formance targets. 
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Operation and climate-weather change impact on maritime ferry safety 


K. Kolowrocki & E. Kuligowska 
Gdynia Maritime University, Poland 


ABSTRACT: The paper is concerned with an application of the recently developed, a general safety 
analytical model of a critical infrastructure under the influence of an operation process related to climate- 
weather change process. The model is presented and applied to the prediction of the maritime ferry safety 
characteristics. As a result of this application, the ferry unconditional safety function and the risk func- 
tion at changing in time operation and climate-weather conditions are determined. Moreover, the other 
significant safety indicators, i.e. the mean lifetime up to the exceeding a critical safety state, the moment 
when the risk function value exceeds the acceptable safety level, the intensities of ageing and the coeffi- 


cients of the operation and climate-weather impact on the ferry intensities of ageing are presented. 


1 INTRODUCTION 


The paper presents the operation and climate- 
weather change influence on the safety of a criti- 
cal infrastructure. The maritime ferry operation 
process is described in (Kołowrocki & Soszyńska- 
Budny 2011), whether the climate-weather change 
process for the ferry operating area is modeled 
in (Kuligowska 2017). The identification of the 
ferry operation process related to climate-weather 
change is performed in (Kołowrocki et al. 2017a). 
Having these processes identified, the safety pre- 
diction of the considered ferry under the operation 
process and climate-weather change influence is 
performed. 

An analytical safety model of a complex tech- 
nical system under the influence of the operation 
process related to climate-weather change process 
is proposed (Kotowrocki et al. 2017b). It is the inte- 
grated model of complex technical system safety, 
linking its multistate safety model and the model 
of its operation process related to climate-weather 
change process at its operating area, considering 
variable at the different climate-weather states 
impacted by them the system safety structures and 
its components safety parameters. 

The maritime ferry safety characteristics, i.e. the 
unconditional safety function and the risk func- 
tion at changing in time operation and climate- 
weather conditions (in February) are determined. 
Moreover, the safety and resilience indicators are 
presented: the mean lifetime up to the exceeding 
a critical safety state, the moment when the risk 
function value exceeds the acceptable safety level, 
the intensities of ageing and the coefficients of the 
operation and climate-weather impact on the mari- 
time ferry safety. 


2 CRITICAL INFRASTRUCTURE 
OPERATION PROCESS RELATED 
TO CLIMATE-WEATHER VARIABLE 
CONDITIONS 


2.1 Critical infrastructure operation process 


We assume that the critical infrastructure during 
its operation process is taking v, v e N, different 
operation states z,, Z», ..., Z„ Moreover, we assume 
that the critical infrastructure operation process 
Z(t) is a semi-Markov process with the conditional 
sojourn times 0, at the operation states z, when its 
next operation state is z, b,/=1,2,...,v, b#1. 
Under these assumptions, the critical infra- 
structure operation process may be described by 
(Kotowrocki & Soszynska-Budny 2011): 


— the vector [p,(0)],,., of the initial probabilities 
P0) = P(Z(0) = z,), b = 1,2, ...,v, of the critical 
infrastructure operation process Z(t) staying at 
particular operation states at the moment t = 0; 

— the matrix [p,)],., of probabilities p,, b, L= 1,2, 
..5V, b + l, of the critical infrastructure opera- 
tion process Z(t) transitions between the opera- 
tion states z, and z; 

— the matrix [H,(0)],,,, of conditional distribu- 
tion functions H(t) = P(@, < t), t € <0,+e9), 
b,l = 1,2,...,V, b + l, of the critical infrastructure 
operation process Z(t) conditional sojourn times 
6,, at the operation states. 


The limit values of the critical infrastructure 
operation process Z(t) transient probabilities 
Dt) = P(Z(t) = z,), t € <0,+0°), b = 1,2,...,v, at the 
particular operation states, can be found using 
the procedure given in (Kotowrocki & Soszyńska- 
Budny 2011). 
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2.2 Climate-weather change process at the critical 
infrastructure operating area 


To model the climate-weather change process 
for the critical infrastructure operating area we 
assume that the process is taking w, w € N, dif- 
ferent climate-weather states c,, c,,..., ¢,. Further, 
we define the climate-weather change process C(f), 
t e <0,+%), with discrete operation states from 
the set {c,, ¢,..., ¢,}. Assuming that the climate- 
weather change process C(t) is a semi-Markov 
process it can be described by (Kolowrocki, 
Soszynska-Budny & Torbicki 2017a,b): 


— the vector [¢,(0)],,,, of the initial probabilities 
q,(0) = P(C(0) =c,), b = 1,2,..., w, of the climate- 
weather change process C(f) staying at particular 
climate-weather states c, at the moment f= 0; 
the matrix [q,],,,,, of the probabilities of tran- 
sitions q, 5,/ = 1,2,..., w, b # l, of the cli- 
mate-weather change process C(t) from the 
climate-weather states c, to c; 

the matrix [C,(0],,,, of the conditional distri- 
bution functions C(t) = P(C,, < £), t € <0,+°), 
b,l = 1,2,..., w, of the conditional sojourn times 
C,, at the climate-weather states c, when its next 
climate-weather state is c, b,/= 1,2,..., w, b £ L 


The limit values of the climate-weather change 
process C(f) transient probabilities ¢,(4) = P(C) 
= &), t e <0,+%), b = 1,2,...,w, at the particular 
climate-weather states, can be found using the pro- 
cedure given in (Kołowrocki, Soszyńska-Budny & 
Torbicki 2017a). 


2.3 Critical infrastructure operation process 
related to climate-weather change 


We assume as in (Kotowrocki et al. 2017c), that the 
critical infrastructure operation process related to 
climate-weather change ZC(1), t € <0,+%), can take 
vw, v, w € N, different operation states zc,,,2¢,,,..., 
zC,,. described by: 


— the vector [pq,{0)],,,,, of initial probabilities 
of the critical infrastructure operation process 
related to climate-weather change ZC(t) stay- 
ing at the initial moment ¢ = 0 at the operation 
and climate-weather states zc,, b = 1,2,...,Vv, 
7=1,2,...,w3 

the matrix [pq,,¢-lxywy Of the probabilities of 
transitions of the critical infrastructure opera- 
tion process related to climate-weather change 
ZC(t) between the operation states zc, and 
zc;,, b = 1,2,...,Vv, 1 = 1,2,....w, b = 1,2,...,v, 


l 2 Zew 

the matrix [ HC, p (Dlwxw Of the conditional 
distribution functions of the critical infra- 
structure operation process related to climate- 
weather change ZC(t) conditional sojourn times 
OC b =1,2,...,v, /=1,2,....w, 6 =1,2,...,V, 


bl bi? 
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I 1,2,...,w, at the operation state zc,, 
b = 1,2,...,v, L= 1,2,...,w, when the next opera- 
tion state is zc;„, b =1,2,...,v, 7 =1,2,...,w. 


bi? 

Assuming that we have identified the unknown 
parameters of the critical infrastructure operation 
process related to climate-weather change ZC(®), 
we can predict this process basic characteristics, e.g. 
the limit transient probabilities pq, (t) = P(ZC(t) 
= zeh t€ <0, +0), b = 1,2,...,v, = 1,2, ..., w, 
at the particular states, according to the procedure 
given in (Kolowrocki et al. 2017c). 


CRITICAL INFRASTRUCTURE 
SAFETY AT VARIABLE OPERATION 
CONDITIONS RELATED TO CLIMATE- 
WEATHER CHANGE 


In the safety analysis of critical infrastructures at 
the variable operation conditions related to climate- 
weather change, to define the critical infrastructure 
with degrading components we assume that the 
changes of the process ZC(f) states have an impact on 
the critical infrastructure’s components and its struc- 
ture (Kotowrocki 2014, Kotowrocki et al. 2017b). We 
denote the critical infrastructure asset A, i= 1,2,...,”, 
conditional lifetime in the safety state subset {u, 
u+1,..., z}, while the operation process related to 
climate-weather change ZC(®) is at the state zc,,, b= 1, 
2, ..., V, /=1,2, ..., w, by [TA]. Moreover, in this 
section we assume that the critical infrastructure 
assets at particular states have the exponential safety 
functions. According to (Kuligowska & Soszynska- 
Budny 2017b), the conditional critical infrastructure 
safety function is defined by the vector 


[SA 1 = [LISA DI... SAEZ) ],0 € <0,+¢°), 
b=1,2,...,v, /=1,2,...,w, i= 1,2,...,n, (1) 
with the coordinates 
[SAL] = PAT UN” > t| ZC(t) = zep) 

= exp HA r], t€ <0, +), 
b=1,2,...,v, /=1,2, ..., w, i= 1,2, ..., 7, (2) 


where the intensities of ageing of the critical infra- 
structure assets related to operation and climate- 
weather impact, existing in (2), are given by 


[AA] = [pA] + APC), u= 1, 2,..., 2, 
b= 1,2,.05V, lE 1,25..045 Wy F= 12,4005: 7s (3) 
and A(u) are the intensities of ageing of the sys- 
tem components without operation and climate- 
weather impact and 


[p DE, uw = 1,2,...,z, b= 1,2, ..., v, 1 
w, 1=1,2,...,n, 


are the coefficients of operation and climate- 
weather change impact on the critical infrastruc- 
ture assets’ intensities of ageing without operation 
and climate-weather change impact. 

Further, we denote the critical infrastructure 
unconditional lifetime in the safety state subset 
{u,u + 1,...,z} by T'(u) and the system uncondi- 
tional safety function by 


S(t,-)=[1, S41), ..., S42), (5) 
with the vector’s coordinates defined by 

S4(t,u) = P(T Hu) > f), te <0, +), 

u=1,2,..., Z. (6) 


In the case when the critical infrastructure oper- 
ation time 0C, is large enough, the coordinates of 
the unconditional safety function of the system 
defined by (5) are given by 


S*(tu)= YY pau lS (Gul, te< 0,4), ¢ 
b=1 I[=1 ( ) 
u SL eZ; 


where [S(t,u)], u = 1,2,...,z, b = 1,2,...,v, L= 1,2, 
..., w, are the coordinates of the system conditional 
safety function defined by (2)-(4) and pq,» b = 1,2, 
..., V, /=1,2, ..., w, are the system operation proc- 
ess limit transient probabilities (see section 2.3). 
Further, we determine the mean values p(w) 
and the standard deviations o*(u) of the uncondi- 
tional lifetimes of the critical infrastructure in the 
safety state subsets {u,u + 1, ..., Z}, u = 1,2, ..., Z, 
the mean values Z4(u) of the unconditional life- 
times of the critical infrastructure in the particular 
safety states u, u = 1,2, ..., z, the risk function r(f) 
and the moment 7 when the critical infrastructure 
risk function exceeds a permitted level 6, after sub- 
stituting for S“(t,u), u = 1,2, ..., z, the coordinates 
of the unconditional safety functions given by (6). 


4 MARITIME FERRY OPERATION 
PROCESS RELATED TO CLIMATE- 
WEATHER CHANGE 


The maritime ferry technical system consists of a 
navigational subsystem S,, a propulsion and con- 
trolling subsystem S, a loading and unloading 
subsystem S,, a stability control subsystem S, and 
an anchoring and mooring subsystem S., which 
form a series structure. The detailed scheme and 
system description may be found in (Kotowrocki & 
Soszynska-Budny 2011). The maritime ferry safety 
structure and the assets’ safety depend on its chang- 
ing in time operation and climate-weather states. 
Taking into account the expert opinions and 
according to section 2.3 and (Kotowrocki et al. 
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2017a), the maritime ferry operation process 
related to climate-weather change process ZC(®), 
te <0, +œ), can take 


v-w=18-6=108, (8) 
different operation states 

ZC) 15 204 45 +++, Zig D 

ZC} 95 Zla «++ ZC 1895 

ey 

Zeite Zeig hey ZCI ig: (9) 


Considering the results of the identification of 
the unknown parameters of the maritime ferry 
operation process related to climate-weather change 
(Kołowrocki et al. 2017), it was possible to predict 
this process’ basic characteristics. The limit values 
of the maritime ferry operation process related to 
climate-weather change process ZC(t) transient 
probabilities pq,,, b = 1,2,...,18, Z= 1,2,...,6, at the 
particular states zc, are given in the vector 


[Panlo = (0.015352, 0.000418, 0.017138, 0.00038, 
0.004712, 0; 

0.000808, 0.000022, 0.000902, 0.00002, 0.000248, 0; 
0.02093, 0.004342, 0.000208, 0, 0.000182, 0.000338; 
0.02898, 0.006012, 0.000288, 0, 0.000252, 0.000468; 
0.287496, 0.067155, 0, 0, 0.005808, 0.002541; 
0.010244, 0.000416, 0.008086, 0.00078, 0.006474, 0; 
0.00197, 0.00008, 0.001555, 0.00015, 0.001245, 0; 
0.006304, 0.000256, 0.004976, 0.00048, 0.003984, 0; 
0.014578, 0.000592, 0.011507, 0.00111, 0.009213, 0; 
0.000788, 0.000032, 0.000622, 0.00006, 0.000498, 0; 
0.001182, 0.000048, 0.000933, 0.00009, 0.000747, 0; 
0.006304, 0.000256, 0.004976, 0.00048, 0.003984, 0; 
0.277992, 0.064935, 0, 0, 0.005616, 0.002457; 
0.02737, 0.005678, 0.000272, 0, 0.000238, 0.000442; 
0.01932, 0.004008, 0.000192, 0, 0.000168, 0.000312; 
0.001212, 0.000033, 0.001353, 0.00003, 0.000372, 0; 
0.00202, 0.000055, 0.002255, 0.00005, 0.00062, 0; 
0.005252, 0.000143, 0.005863, 0.00013, 0.001612, 0]. 


(10) 


MARITIME FERRY SAFETY 
PREDICTION INCLUDING 
OPERATION AND CLIMATE-WEATHER 
CHANGE IMPACT 


5.1 


After discussion with experts, taking into account 
the safety of the operation of the ferry, we fix 5 
(z = 5) safety states of the ferry technical system 
and we distinguish the following safety states: 


Maritime ferry safety parameters 


— a safety state 4 — the ferry operation is fully safe; 

— a safety state 3 — the ferry operation is less safe 
and more dangerous because of the possibility 
of environment pollution; 


— a safety state 2 — the ferry operation is less safe 
and more dangerous because of the possibil- 
ity of environment pollution and causing small 
accidents; 

a safety state 1 — the ferry operation is much less 
safe and much more dangerous because of the 
possibility of serious environment pollution and 
causing extensive accidents; 

a safety state 0 — the ferry technical system is 
destroyed. 


Moreover, by the expert opinions, we assume 
that there are possible the transitions between 
the components' safety states only from better to 
worse ones. 

Considering the assumptions and agreements 
from the previous sections, we assume that the 
components £,, i= 1,2,...,k, j = 1,2,...,/, of the 
subsystem S,, v = 1,2,3,4,5, at the system states 
Zyp b= 1,2,...,18, Z= 1,2,...,6, have the exponential 
safety functions, i.e. the coordinates of the vector 


[SP EA LIS PED, -- 
te<0,+),i=1,2,...,k,j=12... 
b=1,2,...,18,/=1,2,...,6, 


[Spe], 
.l V=1,2,3,4,5, 


(11) 
are given by 


[SiP (t, ue” = PTP] (u) > t|ZC(t) = ZC) 
= exp H2 Pn, 
t e< 0,+%),u =1,2,3,4,i =1,2,...,k, j =1,2,..., 


v=1,2,3,4,5,b =1,2,... ABD HI. col 


L, 
(12) 


Existing in the above formula the intensities 


of ageing of the components E, i = 1,2,...,k, 
j=1,2,...,/, of the subsystem S, v= 1,2,3,4,5, at the 
system operation process states zc,, b = 1,2,...,18, 
l = 1,2,...,6, i.e. the coordinates of the vector of 
intensities 

[AM]? = [0,144 v (Je (bi) tae (4 Jí yJ], 

i=l 2 esk JE b2 aby 0=1,2,3,4,5,B= 1,2,...,18, 
P= 1,2,2.556, (13) 
are given by 

[AA (a)? = [2° (WAM (u), u= 1,2,3,4, 

(2 1,2..0.5k,. 7 = 1 2hs2550= 12,3545; (14) 
b=1,2,...,18,/= 1,2,...,6, 

where 2% (u), u=1,2,3,4, i=1,2,...,k, j= 1,2,...,l, 


are the intensities of ageing of the components 
EW, i= 1,2,. wk, j = 1,2,...,1, of the subsystems 


Sp V= 1,2,3,4,5, without of any impact, i.e. the 
coordinates of the vector 
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[A] =[0, 40), 42), 48), 44], 


i (15) 
i=1,2,...,4, 7 =1,2,...,/,, v=1,2,3, 4,5, 

and 

[POG BS 12349 Legh j= 1,2, (16) 
v= 1,2,3,4,5, b= 1,2,...,18, 7= 1,2.,..., 6, 


are the coefficients of the operation and arn 
weather change impact on the components E\” 
i= 1,2,...,4, 7 = 1,2,...,4, of the subsystems i 
v= 1,2,3,4,5, intensities of ageing at the system 
states zc, b = 1,2,...,18, Z= 1,2,...,6, i.e. the coordi- 
nates of the vector 


iD aaa POM. pan ] 
i= ok, j= 1,2... Iva 2345, (17) 
b=1, 18, / = 1,2,...,6. 


According to expert opinions, changing the mari- 
time ferry operation process states have influence on 
changing the system safety structures and its selected 
components‘ safety parameters as well. For this sys- 
tem, the intensities of components’ departure from 
the safety states subsets {1,2,3,4}, {2,3,4}, {3,4}, {4} 
without of any impact are given in (Kotowrocki, 
Soszynska-Budny & Torbicki 2017d), whereas the 
intensities of departure related to the operation proc- 
ess influence on ferry safety are given in (Kotowrocki 
et al. 2017c). The intensities of departure related to 
the climate-weather influence on the maritime ferry 
safety are calculated as a multiplication of the coeffi- 
cients given in Table | in the Appendix and the inten- 
sities without of any impact. Thus, considering the 
above results, the new intensities of departure related 
to the operation and climate-weather influence on 
the maritime ferry safety are calculated according to 
formula (14), where the coefficients of the operation 
and climate-weather change impact on the compo- 
nents’ intensities of ageing at the particular states are 
calculated as follows: 

Lp aN” = LAP LA oy”, i= 1.2... 
j=1,2,...51, V= 1,2,3,4,5, u= 1,2,3,4, b= 12... 
1=1,2,...,6, 


k, 
18, 


where [p*\”)(u)J°”, €=1,3, are respectively the coef- 
ficients of operation impact and climate-weather 
change impact on the components’ intensities of 
ageing at the particular states. 


5.2 Maritime ferry safety characteristics 


In (Kotowrocki & Soszynska-Budny 2011), it is 
fixed that the maritime ferry technical system 
safety structure and its subsystems and compo- 
nents safety depend on its changing in time opera- 


tion states. The influence of the system states 
changing on the changes of the system safety struc- 
ture and its components safety functions is given in 
(Kotowrocki et al. 2017b). Thus, in the case when 
the operation time is large enough, according to (7) 
the maritime ferry technical system unconditional 
safety function is given by the vector 


S(z,-) = [1, SD, S*(t,2), S23), S44], 


t € <0,+0), (18) 


where according to (7), the vector coordinates are 
given respectively for t € <0,+%), u = 1,2,3,4, by 


18 6 
St(tu)= > Y Palsi, 


b=1 l=1 


(19) 


where [Stw], u = 1,2,3,4, b = 1,2,...,18, 
/=1,2,...,6, are the coordinates of the system con- 
ditional safety functions defined by (2)-(4) and pq,» 
b=1,2,...,18, /=1,2,...,6, are the system operation 
process limit transient probabilities given by (10). 

The graph of the five-state maritime ferry 
technical system safety function is presented in 
Figure 1. 

Considering (19), the expected values and stand- 
ard deviations, given in years, of the maritime ferry 
technical system lifetimes in the safety states sub- 
sets {1,2,3,4}, {2,3,4}, {3,4}, {4}, respectively are 


L(A) = 5.785139, 4°(2) = 3.161191, 


LÉ(3) = 2.342224, wA(4) = 1.879662; (20) 
o'(1) = 5.56368, o(2) = 3.077813, 
o4(3) = 2.28222, o4(4) = 1.830362. (21) 


Consequently, the mean values of the maritime 
ferry technical system lifetimes in the particular 
safety states 1, 2, 3, 4, respectively are: 


Z (1) = 2.623948, zZ (2) =0.8189667, (22) 
T (3) = 0.4625623, T (4) = 1.879662. 

By (20) and (21), the mean and the standard 
deviation of the maritime ferry lifetime up to 
exceeding critical safety state r = 2 are 


(1,0) 


5 10 5 20 25 3 


ł [years] 


Figure 1. The graph of the maritime ferry safety func- 
tion coordinates. 
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L'(2) = 3.161191 years, o*(2) = 3.077813 years. 


Since the critical safety state is r = 2, then the 
system risk function of the maritime ferry techni- 
cal system, is given by 
r(t) = 1 — S*(t,2), (23) 
where S*(t,2) is given by (19) and illustrated in 
Figure 1. 

Hence, considering (23), the moment when the 
system risk function exceeds a permitted level, for 
instance 6= 0.05, is given as follows 
Tt =1*'(6) = 0.17 year. (24) 
The maritime ferry intensities of ageing accord- 


ing to (Kolowrocki et al. 2017b) and considering 
(19) are: 


d(S(t,u)) 1 
dt S4(t,u)’ 


A‘( tu) = (25) 


where particularly 


AX\(t,1) = 0.216514, A*(t,2) = 0.3860834, 
AX(t,2) = 0.516497, AX(7t,3) = 0.6449959. 


The graphs of the intensities of ageing for the 
maritime ferry are shown in Figure 3. 

Considering (20) and applying (57) from 
(Kotowrocki et al. 2017d), the coefficients of the 
operation and climate-weather impact on the mari- 
time ferry safety are 


_1/44() _ 1/5.785139 


t1)= = = 1.079663, 
oi) 1/0) 1/6.246 
p*(t,2)= HAA) „ 1/3.161191 5 1072381, 
1/ (2) 1/3.390 
1/u(3) _ 1/2.342224 
3)= = = 1.068642, 
BY) 1/3) -1/ 2.503 
1/ (4) _ 1/1.879662 
t,4) = = = 1.067745. 
pts) 1/ (4) 1/2.007 


(26) 


Figure 2. The graph (the fragility curve) of the mari- 
time ferry risk function. 
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200 400 ao 800 1000 
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Figure 3. The graph of the intensities of ageing of the 
maritime ferry. 


The resilience indicator, i.e. the coefficient of 
maritime ferry resilience to operation process and 
climate-weather change process impact is 


1 


RI*(t)= FED 


= 0.9325047 = 93.25%. 


(27) 


6 CONCLUSIONS 


The simplified impact model of critical infrastruc- 
ture safety related to operation and climate-weather 
change impact was applied to the safety and risk 
evaluation for the maritime ferry operating at Baltic 
Sea waters. The predicted maritime ferry safety char- 
acteristics are different from those determined for 
this system operating at constant conditions without 
considering operation and climate-weather influence. 
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APPENDIX 


The coefficients [PP (u, u= 1,2,3,4, i= 1,2,...,4, j= 1,2,...,/, V= 1,2,3,4,5, b = 1,2,...,18, /= 1,2,...,6, 
of the operation process related to the climate-weather change process impact on the maritime ferry inten- 
sities of degradation are given in the table below. 


Table 1. Coefficients of operation and climate-weather change impact on the maritime ferry intensities of degradation. 


S S S; S, S; 


Oper. CW- 


States area CP b l En Eaa Bia Esg Ey Esg Eg By Ba Bo Bg Ep Ep Esn Es Ess 


1 GDY CM 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
2 GDY C® 1 2 105 12 1.3 1.3 1.25 125 12 12 11 11 11 115 1.15 12 12 12 
3 GDY CC® 131 1 1 1 1 1 1 1 1 1 1 1 1 1 
4 GDY C) 1 4 1.02 11 11 11 1.1 11 11 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
5 GDY C) 151 1 1 1 1 1 1 1 1 1 1 1 1 
6 GDY C 161.05 12 1.3 13 1.25 125 12 12 11 141 11 1.15 1.15 12 12 12 
7 GDY Cc)? 211 1 1 1 1 1 1 1 1 1 1 1 1 1 
8 GDY CY 22 105 12 13 1.3 125 1.25 1.2 12 2.1 21 11 “15 115 12 12 12 
9 GDY ©” 231 1 1 1 1 1 1 1 1 il 1 1 1 1 1 1 
10 GDY C® 24102 1.1 1.1 1.1 1.1 11 11 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
11 GDY C 25 1 1 1 ier t 1 1 1 1 1 1 1 1 1 1 1 
12 GDY ©” 2 6105 12 13 13 1.25 1.25 12 12 1.1 11 11 1.15 1.15 12 12 12 
13 res œ 311 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
14 res œ 321 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
15 res @ 3 3-105 11. 1 T IAS MSR LI IS -LIS LIS 1 ll ol 1 1 
16 res Cc 3441.1 1.05 1 1 1.05 1.05 1.02 1.02 1.05 1.05 1.05 1 11 J 1 1 
17 res MM 35 11 115 1 1 1.15 1.15 1.05 1.05 Ll It Ll 1 1.15 1 1 1 
18 res Mm 3611 12 1 r LA E2 Lr ar DS DS: 1S 1.15 1 1 1 
19 res C 411 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 
20 res Cc 42 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
21 res C 43 1.05 11 1 1 LIS 1.05 LE LE TAS AI 2.150 1 Ki a 1 1 
22 res C® 441.1 1.05 1 1 1.05 1.05 1.02 1.02 1.05 1.05 1.05 1 11 ol 1 1 
23 res C 451.1 115 1 L DIS MIS 105-1505) TT TLE a I 1.15 1 1 1 
24 res C 4611 12 1 1 42 12. LL 1r Tas LIS 1.75: 1 1.15 1 1 1 
25 open C? 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
26 open C? 5 2 1 1 1 T A 1 1 1 1 1 1 1 1 1 1 1 
27 open C® 5 3 1.05 14 1 1 14 14 14 14 13 13 13 1 125: d 1 1 
28 open C? 5 4 1.15 1.1 1 1 Lt bt TA bl 105 1.05 1.05 1 Ii d 1 1 
29 opn C? SS 1.15 13 1 T de S 3 LS I2 2 2, q1 12 1 1 1 
30 open C? 5 6 1.15 1.5 1 O bS kS TS Laky ba Koel 13 1 1 1 
31 KAR C 61 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
32 KAR C 62 102 1.1 11 11 1.1 11 1.1 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
33 KAR C® 6 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
34 KAR C 6 4 1.05 12 1.3 13 1.25 1.25 12 12 11 11 11 1.15 1.15 12 12 12 
35 KAR C 6 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
36 KAR C® 6 6 1.02 11 11 11 11 11 11 11 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
37 KAR C® 71 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
38 KAR C? 72 1.02 11 11 11 blo 11 11 11 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
39 KAR C 73 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
40 KAR C® 7 4 1.005 1.2 1.3 1.3 1.25 1.25 12 12 11 11 11 1.15 115 1.2 12 1.2 
41 KAR C? 75 1 1 1 1 eee | 1 1 1 1 1 1 1 1 1 1 1 
42 KAR C® 76 1.02 L1 11 11 blo 11 11 41 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
43 KAR C® 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
44 KAR C® 8 2 102 11 11 11 14 11 11 11 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
45 KAR C 8 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
46 KAR C 8 4 1.05 1.2 1.3 1.3 1.25 1.25 1.2 12 11 1.1 1.1 1.15 1.15 1.2 12 1.2 
47 KAR C® 8 5 1 1 1 i l 1 1 1 1 1 1 1 1 1 1 1 
48 KAR C® 8 6 1.02 1.1 11 11 11 11 11 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 


(Continued) 
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Table 1. (Continued). 
49 KAR CM 911 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 
50 KAR C® 9 2 1.02 1.1 11 11 11 1.1 11 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
51 KAR Œœ 9 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
52 KAR C® 9 41.05 1.2 £3 1.3 1.25 1.25 1.2 1.2 11 11 TI 115 1.15 1.2 1.2 12 
53 KAR CY 951 1 1 1 1 1 1 1 1 1 1 1 1 1 
54 KAR C® 9 61.02 LI 11 Lt Ll LI 11 11 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
55 KAR C® 101 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
56 KAR C® 102 1.02 11. 1.1 11 114. 11 11 1.4 (1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
57 KAR C® 103 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
58 KAR C® 104 1.05 1.2 13 1.3 1.25 1.25 1.2 1.2 11 11 A1 115 1.15 12 12 12 
59 KAR C® 105 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
60 KAR C® 106 1.02 1.1 11 11 11 11 11 11 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
61 KAR C® 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
62 KAR C® 112 1.02 1.1 11 11 11 ti 11- 1.1 (1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
63 KAR C® 113 1 1 1 1 1 1 1 1 1 1 1 1 1 il 1 1 
64 KAR C® 114 1.05 12 13 1.3 1.25 1.25 12 12 11 11 11 1.15 1.15 12 12 12 
65 KAR C® 115 1 1 1 ee | 1 1 1 1 1 1 1 1 1 1 il 
66 KAR C® 116 1.02 11 11 11 11 141 11 T1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
67 KAR C® 121 1 1 1 I l 1 1 1 1 1 1 i 1 1 1 1 
68 KAR C® 122 1.02 1.1 11 1.1 1.1 #11 11 L1 (1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
69 KAR C® 123 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
70 KAR C® 1241.05 12 1.3 1.3 1.25 1.25 12 12 11 11 11 115 1.15 12 12 12 
71 KAR CC® 125 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 il 
72- KAR C® 126 1.02 11 L1 11 11. 11 11 11 105 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
73 open C® 131 1 1 1 1 1 1 1 1 1 1 1 il 1 1 1 1 
74 open C® 132 1 1 1 1 1 1 1 1 1 1 1 il 1 1 1 1 
75 open C® 133 1.05 14 1 1 14 14 14 14 13 13 13 1 1.25 1 1 1 
76 open C® 134 1.15 11 1 1 AE EE LL LE 61,05 1.05. 1.05.1 tik. 1 1 
77 open C® 135 1.15 13 1 L: LF 13 13 13 12 12 12 1 12 1 1 1 
78 open C 136 1.15 15 1 1 US Us: 15s 15 13 13: 13° 1 1.3 1 1 1 
79 res C® 141 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
80 res C? 142 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
81 res C® 143 1.05 1.1 1 Lo AAS. 10S Te Tel IS TAS 1.15 1 iF) ee 1 1 
82 res C 144 1.1 1.05 1 1 1.05 1.05 1.02 1.02 1.05 1.05 1.05 1 Lr 1 1 1 
83 res C? 145 1.1 1.15 1 1 J15 115 1.05: 1,05 t1 1.2 11 1 1.15 1 1 1 
84 res CC® 1461.1 12 1 L L2 LZ T1 Ar IS L Sa 115 1 1 1 
85 res C® 1511 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
86 res C 152 1 1 1 i -i 1 1 1 1 1 1 1 1 1 1 1 
87 res C®? 153 1.05 11 1 1 1.15 1.15 11 LI 115 1.15 1.15 1 Ir l 1 1 
88 res C? 154 1.1 1.05 1 1 1.05 1.05 1.02 1.02 1.05 1.05 1.05 1 Ir I 1 1 
89 re C® 155 11 1.15 1 1 115 L15 105 1,05 11 141 T1 1 1.15 1 1 1 
90 res C? 1561.1 12 1 L J2° “12> TI LI kis 115 115-1 1.15 1 1 1 
91 GDY Cc 161 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
92 GDY C® 162 1.05 12 13 1.3 1.25 125 12 12 1.1 11 21.1 1.15 1.15 12 1.2 12 
93 GDY C 163 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
94 GDY C” 164 1.02 1.1 1.1 11 11 11 1.1 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
95 GDY C” 165 1 1 1 I 1 1 1 1 1 1 1 i 1 1 1 1 
96 GDY CY 166 1:05 12 1:3 1.3 1.25. 125.12 12 24 11: 11 115 115 12 1:2. 12 
(Continued) 
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Table 1. (Continued). 

97 GDY CM 1711 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
98. GDY Co -172 1:05 1.2 13 13 1.25 125 122 12 Ii LI lab 115 L15 12 1.2 12 
99 GDY C 173 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 
100 GDY C® 174 1.02 1.1 1.1 11 1.1 11 1.1 1.1 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
101 GDY CM 175 1 1 1 1 1 al 1 1 1 1 1 1 1 1 1 1 
102 GDY C™ 176 1.05 1.2 1.3 1.3 1.25 1.25 12 12 11i 11 11 1.15 1.15 1.2 12 12 
103 GDY C” 181 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
104 GDY C 182 1.05 12 13 1.3 1.25 1.25 12 12 IL 11 11 A15 1.15 1.2 12 1.2 
105 GDY C” 183 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
106 GDY ©?” 184 1.02 11 11 11 11 11 11 11 1.05 1.05 1.05 1.05 1.05 1.05 1.05 1.05 
107 GDY Cc” 185 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
108.. „GDY CH 186 105-12. 13 13° 125 1.25 12. 12 110 1 I1 -IIS 115. 1.2" 12. 1:2 
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Operating environment threats and climate-weather hazards 
impact on maritime ferry safety 


K. Kolowrocki & E. Kuligowska 
Gdynia Maritime University, Poland 


ABSTRACT: In this paper, a general safety analytical model of a critical infrastructure under the 
influence of an operation process including operating environment threats related to climate-weather 
change is applied to safety prediction of maritime ferry. The main safety characteristics of the ferry in 
its corresponding operating area are evaluated. Namely, there are determined the unconditional safety 
function, the risk function, the mean lifetime up to the exceeding a critical safety state, the moment when 
the risk function value exceeds the acceptable safety level, the intensities of ageing and the coefficients of 
the operation and climate-weather impact on the ferry intensities of ageing. 


1 INTRODUCTION 2 CRITICAL INFRASTRUCTURE 

OPERATION PROCESS RELATED 
The paper is concerned with the operating envi- TO OPERATING ENVIRONMENT 
ronment threats and climate-weather hazards THREATS AND EXTREME 
influence on the safety of a critical infrastructure. WEATHER HAZARDS 


The maritime ferry operation process related to 
operating environment threats is described in 2.1 Critical infrastructure operation process 
(Kotowrocki et al. 2017a), whether the climate- related to operating environment threats 
weather change process for the ferry operating area 
is modeled in (Kuligowska 2017). The identifica- 
tion of the ferry operation process related to oper- 
ating environment threats and climate-weather 
hazards is performed in (Kotowrocki et al. 2017b). 


We assume as in (Kolowrocki et al. 2017c) that 
the system during its operation proces; is taking 
v’, ve N, different operation states z’,, 2’,, ..., Zye 
Further, we define the critical infrastructure opera- 


Having these processes identified, the safety pre- MON PEORESS Z0) te < 0, a) related to the criti- 
cal infrastructure operating environment threats 


diction of the considered ferry under the opera- 
S : : : 3 with discrete operation states from the set {z’,, 
tion process including operating environment , 


threats and climate-weather change influence is Fe er zy}. Moreover We assume that a 
mr sos j ie i infrastructure operation process Z’(t) related to its 
performed. The maritime ferry safety character- 


istics, i.e. the unconditional safety function and operating environment threats Dad semi-Markov 
2 : fan ae ine process with the conditional sojourn times 8’, at 
the risk function at changing in time operation 


et a ` a 
including threats and climate-weather conditions pone eae a j y hy I S Hexe Operation 
(in February) are determined. Moreover, the safety Zp O0=1,2,..5V, ; 


ae mer ; Under these assumptions, the critical infra- 
and resilience indicators are presented: the mean structure Gnerabinn nrocess may Be described b 
lifetime up to the exceeding a critical safety state, p P y y 


the moment when the risk function value exceeds (Kołowrocki, Kuligowska & Soszyńska-Budny 


the acceptable safety level, the intensities of ageing annie): 
and the coefficients of the operation and climate- — the vector [j/,(0)],,.,. of the initial probabilities 
weather impact on the maritime ferry safety. p,(0)= P(Z’(0) =z;), b = 1, 2, ..., V’, of the 
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process Z’(t) staying at particular states at the 
moment ¢ = 0; 

— the matrix [p;,],,,,, of probabilities p’,, b,/ = 1, 
2,.., V, b#1, of the process Z(t) transitions 
between the operation states z’, and z’; 

— the matrix [H; (t), of conditional distribu- 
tion functions H;)(t)=P(G, <t), te < 0,+%), 
b,l = 1, 2, ..., v’, b #1, of the system operation 
process Z’(t) conditional sojourn times 6’, at the 
operation states. 


The limit values of the process Z’(t) tran- 
sient probabilities p’,(t) = P(Z’(t) = z’,), te < 0, 
+), b=1, 2, ..., v’, at the particular states can be 
found using the procedure given in (Kolowrocki, 
Kuligowska & Soszynska-Budny 2017c). 


2.2 Climate-weather change process at the critical 
infrastructure operating area 


The climate-weather change process C(t), te < 0, 
+0), for the critical infrastructure operating area is 
defined and modelled in (Kotowrocki, Soszyńska- 
Budny & Torbicki 2017a,b). We assume that the 
process is taking w, w €e N, different climate- 
weather states c,, c,, ..., c,, and that this is a semi- 
Markov process. The process’ parameters are 
described in (Kotowrocki & Kuligowska 2018), 
whereas the process’ characteristics, e.g. the limit 
values of the climate-weather change process C(t) 
transient probabilities ¢,(t) = P(C(t) = ¢,), t € <0, 
+œ), b = 1, 2, ..., w, at the particular climate- 
weather states, can be found using the procedure 
given in (Kotowrocki, Soszyiska-Budny & 
Torbicki 2017a). 


2.3 Critical infrastructure operation process 
related to operating environment threats 
and climate-weather hazards 


We assume as in (Kotowrocki et al. 2017a), that 
the critical infrastructure operation process Z’C(f), 
te <0, + œ), related to operating environment 
threats and climate-weather hazards can take v’w, 
v’, w € N, different operation states 2’c,,, 2’C,,, = 
z’C,,» described by: 


— the vector [p’q,(0)],,.,,, of initial probabilities of 
the critical infrastructure process Z’C(f) staying 
at the initial moment ¢=0 at the particular states 
ZC, b=1, 2, ..., VIS 1, 2, ..., W 

— the matrix Paan v w x v w of the prob- 
abilities of transitions of the critical infra- 
structure process Z’C(t) between the states 
Ze, and 2’c,, b= 1, 2, ..., V’, l= 1, 2, p W 
b=1,2,...,v,/ =1,2,...,w 

— the matrix [H’C, (Ory Of the conditional 
distribution functions of the critical infra- 
structure process Z’C(t) conditional sojourn 


times @ Cyg b = 1, 2, ..., WLS Tye ow 
bal 2n ie =1,2,..,w, at the state 2z’c,, 
b=1,2,..., v’, /=1, 2, ..., w, when the next state 
IS Z C; mee avs Toko. w 


Bi? 
Assuming that we have identified the unknown 
parameters of the critical infrastructure operation 
process Z’C(t) related to operating environment 
threats and climate-weather hazards, we can predict 
this process basic characteristics, e.g. the limit tran- 
sient probabilities p’q,(t) = P(Z’C(t) = z’c,), t € <0, 
+co), b=1,2,..., v’, /=1, 2, ..., w, at the particular states, 
according to the procedure given in (Kolowrocki, 
Kuligowska & Soszynska-Budny 2017a). 


3 CRITICAL INFRASTRUCTURE SAFETY 
RELATED TO OPERATING 
ENVIRONMENT THREATS AND 
CLIMATE-WEATHER CHANGE 
PROCESS 


In the safety analysis of critical infrastructures 
at the variable operation conditions under the 
influence of operating environment threats and 
related to climate-weather change, to define the 
critical infrastructure with degrading components 
we assume that the changes of the process Z’C(t) 
states have an impact on the critical infrastructure’s 
components and its structure (Kotowrocki 2014, 
Kotowrocki et al. 2017b). We denote the critical 
infrastructure asset A, i = 1, 2, ..., n, conditional 
lifetime in the safety state subset {u,u + 1, ..., Z}, 
while the process Z’C(t) is at the state z’c,, b = 1, 
2,.., VW, /=1, 2, ..., w, by [T u)". Moreover, in 
this section we assume that the critical infrastruc- 
ture assets at particular states have the exponen- 
tial safety functions. According to (Kotowrocki 
et al. 2017b), the conditional critical infrastructure 
safety function is defined by the vector 


[SRA = [1 [S79 n”, wS; T te <0, +), 
bal 2s Vid = 152.6 ain os 1,8 (1) 
with the coordinates 

[Stu] = PATRUN” > t| ZCO = zcy) 


= [A t, tE <0, +), b= 1, 2, 
v’, /=1, 2,....w,i=1,2,...,n, (2) 


where the intensities of ageing of the critical infra- 
structure assets related to operating environment 
threats and climate-weather impact, existing in (2), 
are given by 


[Awl = [pw - a u=1,2,. 
b=1,2,.., V’, l=1,2,. 
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and A,(u) are the intensities of ageing of the system 
components without any impact and 
[e?(u)|, w= 1, 2, ..., 2, b=1, 2, ..., Vv’, 
LEl 2s We PH 15.25 saM (4) 
are the coefficients of operation and climate- 
weather change impact on the critical infrastruc- 
ture assets’ intensities of ageing without operation 
and climate-weather change impact. 

Further, we denote the critical infrastructure 
unconditional lifetime in the safety state subset 
{u, u+ 1, ..., z} by T *(u) and the system uncondi- 
tional safety function by 


S(t) = [1, S(t, 1), ..., S(42)], (5) 


with the vector’s coordinates defined by 
S(t, u) = P(T(u) > t), t € <0, +), u= 1, 2, ..., z. (6) 


In the case when the critical infrastructure oper- 
ation time 6’C,, is large enough, the coordinates 
of the unconditional safety function of the system 
defined by (5) are given by 


Si(t.u)= YY p’g,[S(tu)|, t E< 0, +0), 
b=1 /=1 

= 1,2 50552; (7) 
where [$°(¢,u)], u = 1, 2, aa Z3, b = 1, 2, ao V, 
/=1, 2, ..., w, are the coordinates of the critical 
infrastructure conditional safety function defined 
by (2)-(4) and pq,,, b = 1, 2, ..., v, L= 1, 2, ..., w, are 
the system operation process limit transient prob- 
abilities (see section 2.3). 

Further, we determine the mean values w(u) 
and the standard deviations o°(u) of the uncondi- 
tional lifetimes of the critical infrastructure in the 


safety state subsets {u, u + 1, ..., Z}, u = 1, 2, ..., Z, 
the mean values Z“(u) of the unconditional life- 
times of the critical infrastructure in the particu- 
lar safety states u, u = 1, 2, ..., z, the risk function 
r°(t) and the moment 7% when the critical infra- 
structure risk function exceeds a permitted level 
ô, after substituting for S°(7, u), u = 1, 2, ..., z, the 
coordinates of the unconditional safety functions 
given by (6). 


MARITIME FERRY OPERATION 
PROCESS RELATED TO OPERATING 
ENVIRONMENT THREATS AND 
CLIMATE-WEATHER HAZARDS 


Taking into account the expert opinions concerned 
with the operation process of the considered ferry 
technical system (Kotowrocki et al. 2017d), we 
assume that the maritime ferry operation process 
and safety may depend on operating environment 
threats and we distinguish the following 3 unnatu- 
ral threats: 


e ut, —a human error, 
ut, — a terrorist attack, 


ut, — a heavy sea traffic. 


After assuming that the maritime ferry opera- 
tion process including operating environment 
threats and the climate-weather change process at 
its operating area are independent, it is possible 
to evaluate the unknown basic parameter of the 
maritime ferry process Z’C(t). Considering these 
identification results (Kotowrocki et al. 2017a), 
it was possible to predict this process’ basic char- 
acteristics. The limit values of the process Z’C(t) 
transient probabilities p’g,, b = 1, 2, ..., 72, 
/=1, 2, ..., 6, at the particular states z’c,,, are given 
in the vector 


[P'idlixazy = [0.014948, 0.000407, 0.016687, 0.00037, 0.004588, 0; 
0.0002424, 0.0000066, 0.0002706, 0.000006, 0.0000744, 0; 0, 0, 0, 0, 0, 0; 
0.0001616, 0.0000044, 0.0001804, 0.000004, 0.0000496, 0; 
0.000404, 0.000011, 0.000451, 0.00001, 0.000124, 0; 
0.0002424, 0.0000066, 0.0002706, 0.000006, 0.0000744, 0; 0, 0, 0, 0, 0, 0; 
0.0001616, 0.0000044, 0.0001804, 0.000004, 0.0000496, 0; 
0.020125, 0.004175, 0.0002, 0, 0.000175, 0.000325; 
0.000483, 0.0001002, 0.0000048, 0, 0.0000042, 0.0000078; 0, 0, 0, 0, 0, 0; 
0.000322, 0.0000668, 0.0000032, 0, 0.0000028, 0.0000052; 
0.028175, 0.005845, 0.00028, 0, 0.000245, 0.000455; 


0.000483, 0.0001002, 0.0000048, 0, 0.0000042, 0.0000078; 


0, 0, 0, 0, 0, 0; 


0.000322, 0.0000668, 0.0000032, 0, 0.0000028, 0.0000052; 
0.286704, 0.06697, 0, 0, 0.005792, 0.002534; 

0.0004752, 0.000111, 0, 0, 0.0000096, 0.0000042; 0, 0, 0, 0, 0, 0; 
0.0003168, 0.000074, 0, 0, 0.0000064, 0.0000028; 

0.00985, 0.0004, 0.007775, 0.00075, 0.006225, 0; 
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0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 

0.001576, 0.000064, 0.001244, 0.00012, 0.000996, 0; 

0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 

0.00591, 0.00024, 0.004665, 0.00045, 0.003735, 0; 

0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 

0.014184, 0.000576, 0.011196, 0.00108, 0.008964, 0; 

0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 

0.000394, 0.000016, 0.000311, 0.00003, 0.000249, 0; 

0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 

0.000788, 0.000032, 0.000622, 0.00006, 0.000498, 0; 

0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 

0.00591, 0.00024, 0.004665, 0.00045, 0.003735, 0; 

0.0002364, 0.0000096, 0.0001866, 0.000018, 0.0001494, 0; 0, 0, 0, 0, 0, 0; 
0.0001576, 0.0000064, 0.0001244, 0.000012, 0.0000996, 0; 


0.2772, 0.06475, 0, 0, 0.0056, 0.00245; 


0.0004752, 0.000111, 0, 0, 0.0000096, 0.0000042; 0, 0, 0, 0, 0, 0; 
0.0003168, 0.000074, 0, 0, 0.0000064, 0.0000028; 

0.026565, 0.005511, 0.000264, 0, 0.000231, 0.000429; 

0.000483, 0.0001002, 0.0000048, 0, 0.0000042, 0.0000078; 0, 0, 0, 0, 0, 0; 
0.000322, 0.0000668, 0.0000032, 0, 0.0000028, 0.0000052; 

0.018515, 0.003841, 0.000184, 0, 0.000161, 0.000299; 


0.000483, 0.0001002, 0.0000048, 0, 0.0000042, 0.0000078; 


0, 0, 0, 0, 0, 0; 


0.000322, 0.0000668, 0.0000032, 0, 0.0000028, 0.0000052; 
0.000808, 0.000022, 0.000902, 0.00002, 0.000248, 0; 


0.0002424, 0.0000066, 0.0002706, 0.000006, 0.0000744, 0; 


0, 0, 0, 0, 0, 0; 


0.0001616, 0.0000044, 0.0001804, 0.000004, 0.0000496, 0; 

0.001616, 0.000044, 0.001804, 0.00004, 0.000496, 0; 

0.0002424, 0.0000066, 0.0002706, 0.000006, 0.0000744, 0; 0, 0, 0, 0, 0, 0; 
0.0001616, 0.0000044, 0.0001804, 0.000004, 0.0000496, 0; 

0.004848, 0.000132, 0.005412, 0.00012, 0.001488, 0; 

0.0002424, 0.0000066, 0.0002706, 0.000006, 0.0000744, 0; 0, 0, 0, 0, 0, 0; 


0.0001616, 0.0000044, 0.0001804, 0.000004, 0.0000496, 0]. 


5 MARITIME FERRY SAFETY 
PREDICTION RELATED TO 
OPERATING ENVIRONMENT THREATS 
AND CLIMATE-WEATHER HAZARDS 


5.1 Maritime ferry safety parameters 


There are distinguished z = 5 safety states for 
the considered maritime ferry described in 
(Kotowrocki & Kuligowska 2018). Moreover, 
by the expert opinions, we assume that there are 
possible the transitions between the components’ 
safety states only from better to worse ones. 
Considering the assumptions and agreements 
from the previous sections, we assume that the 
components £”, i= 1, 2, ..., k, j= 1, 2, ..., lp of 


the subsystem S,, v= 1, 2, 3, 4, 5, at the particular 
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(8) 


states z’c,, b=1, 2, ..., 72, /=1, 2, ..., 6, have the 
exponential safety functions, i.e. the coordinates 
of the vector 


[S PEANO = LISP EDI, n [SPA 
te< 0, +œ), i=l, 2., k, j=1,2,. l, 
v=1,2,3,4,5,b=1,2,...,72,1=1,2,...,6, 


(9) 
are given by 
[SP(t u) = PUPP OO) > 1Z’CQ) = zy) 
= exp[-[4° (wt), 
te<0,+%), u=], 2,3,4,1=1,2,...,k, j=1,2,..., li, 
v=], 2,3, 4,5, b=1, 2,..., 72,1 =1, 2, ..., 6. 


(10) 


Existing in the above formula the intensities 
of ageing of the components E9, i= 1, 2, ..., k, 
J=1,2,...,4, of the subsystem S,, v= 1, 2, 3, 4,5, at 
the system states z’c,, b=1,2,...,72,/=1, 2,..., 6, 
i.e. the coordinates of the vector of intensities 


[Ar a (bl) = = [0 i ee [A P40], 
j= L IRT LI li 
v= 2,3.4,5,b=1,2... STALE 125.0550; 
(11) 
are given by 
LPC = LAC BPW, 
=1,2,3,4,i=1,2,.. Kk, JH 1,2, .05h5 
=1,3,3,4,5,$=1,2,,, a 7 a l 20; 
(12) 


where A (u), u = 1, 2, 3, 4, i= 1, 2, .., k, 
j= 1,2, ..., l, are the intensities of ageing of the 
components EM, i=1,2,..,4,f=1,2,.. 4, 
of the subsystems S,, v= 1, 2, 3, 4, 5, without 
of any a and LEOU, u = 1, 2, 3, 4, 
$= 1, 2505-4 J = 1,25 cg b. WH 1, 2; 35.455, 
b=1, 2, 72, /=1, 2, ..., 6, are the coefficients 
of the operation impact including threats and cli- 
mate-weather change impact on the components 
EP, i=1,2,...,k,j=1, 2, ..., l, of the subsystems 
Sp v=1, 2, 3, 4, 5, intensities of ageing at the states 
Zen b=1,2,. wg T S 0 

According to expert opinions, changing the mari- 
time ferry operation process states have influence on 
changing the system safety structures and its selected 
components’ safety parameters as well. For this sys- 
tem, the intensities of components’ departure from 
the safety states subsets {1,2,3,4}, {2,3,4}, {3,4}, 
{4} without of any impact are given in (Kołowrocki 
et al. 2017d), the intensities of departure related to 
the climate-weather influence on the maritime ferry 
safety are given in (Kołowrocki & Kuligowska 2018) 
and the intensities of departure related to the operat- 
ing environment threats influence on ferry safety are 
given in (Kołowrocki et al. 2017d). Thus, the inten- 
sities of departure related to the operating environ- 
ment threats and climate-weather hazards influence 
on ferry safety are calculated according to formula 
(12), where the coefficients of the operation, operat- 
ing environment threats and climate-weather change 
impact on the components’ intensities of ageing at 
the particular states are the multiplication of the 
above mentioned coefficients of impact on the com- 
ponents’ intensities of ageing at the particular states. 


5.2 Maritime ferry safety characteristics 


Assuming that the maritime ferry technical sys- 
tem safety structure and its subsystems and 
components safety depend on its changing in time 


operation states, the influence of the system states 
changing on the changes of the system safety struc- 
ture and its components safety functions is given in 
(Kołowrocki et al. 2017b). Thus, in the case when 
the operation time is large enough, according to (7) 
the maritime ferry unconditional safety function is 
given by the vector 


S(t,-) = [1, S5(t,1), S5(t,2), S°(t,3), S(2,4)], 
t e <0, +), (13) 
where according to (7) the vector coordinates are 
given respectively for t € <0, +œ), u = 1, 2, 3, 4, by 


S°(t,u) = YY palse, (14) 


b=1 [=1 


where [$°(¢,u)], u= 1, 2, 3, 4, b= 1, 2, ..., 72, 
l= 1, 2, ..., 6, are the coordinates of the system 
conditional safety functions defined by (2)-(4) and 
Pd 6 =1, 2, ..., 72, = 1, 2, ..., 6, are the limit 
transient probabilities given by (8). 

The graph of the five-state maritime ferry 
technical system safety function is presented in 
Figure 1. 

Considering (14), the expected values and 
standard deviations, given in years, of the maritime 
ferry technical system lifetimes in the safety states 
subsets {1,2,3,4}, {2,3,4}, {3,4}, {4}, respectively are 


W(1) = 5.778643, 1(2) = 3.157926, 


WW(3) = 2.339679, u5(4) = 1.877708; (15) 
o°(1) = 5.557708, 0°(2) = 3.07524, 
0°(3) = 2.280044, 0°(4) = 1.828565. (16) 


Consequently, the mean values of the maritime 
ferry technical system lifetimes in the particular 
safety states 1, 2, 3, 4, respectively are: 


(1) = 2.620717, (2) = 0.8182476, 


T (3) = 0.4619703, 7(4) = 1.877708. (7) 


5 10 15 20 25 30 
t [years] 


Figure 1. The graph of the maritime ferry safety func- 
tion coordinates. 
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By (15) and (16), the mean and the standard 
deviation of the maritime ferry lifetime up to 
exceeding critical safety state r = 2 are 


LÊ(2) = 3.157926 years, 0°(2) = 3.07524 years. 


The system risk function of the maritime ferry 
technical system, is given by 
P(A = 1-—S8*(t,2), (18) 
where S°(t,2) is given by (14) and illustrated in 
Figure 2. 

Hence, considering (18), the moment when the 
system risk function exceeds a permitted level, for 
instance ô= 0.05, is given as follows 
t? =r '(6) = 0.17 year. (19) 
The maritime ferry intensities of ageing accord- 


ing to (Kotowrocki et al. 2017b) and considering 
(14) are: 


d(Si(tu)) 1 


A’(t,u) = i 20 
(Su) dt S°(t,u) (20) 
= if 
7 
08 
0.6 
04 
02 
=- 
5 10 15 20 25 30 
t [years] 
Figure 2. The graph (the fragility curve) of the mari- 


time ferry risk function. 


A(t,3) 


A\(t,2) 


> 
1000 
t [years] 


Figure 3. The graph of the intensities of ageing of the 
maritime ferry. 
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where particularly 


AX(t,1) =0.2165163, A*(t,2) = 0.3860834, 
AX(1,2) = 0.5164971, A*(t,3) = 0.644996. 


The graphs of the intensities of ageing for the 
maritime ferry are shown in Figure 3. 

Considering (15) and applying (57) from 
(Kołowrocki, Soszyńska-Budny & Torbicki 
2017d), the coefficients of operating environment 
threats and climate-weather change impact on the 
ferry safety are 


_1/ (1) _ 1/5.778643 


plz = =1.080877, 
12A) 1/6246 
5 
pone 2 SE n, 
1/ (2) 173.390 
5 
paga Z) UAN 059805 
1/6)  1/2.503 
5 
pnya AD ULI O ai gs3 Gi) 


~1/ (4) 1/2.007 


The resilience indicator, i.e. the coefficient 
of maritime ferry resilience to operation proc- 
ess including threats and climate-weather change 
process impact is 


= 0.9315416 = 93.15%. 


| 
RI°(t) = za (22) 


> 


6 CONCLUSIONS 


The simplified impact model of critical infrastruc- 
ture safety related to operating environment threats 
and climate-weather change impact was applied 
to the safety and risk evaluation for the maritime 
ferry operating at Baltic Sea waters. The predicted 
maritime ferry safety characteristics are different 
from those determined for this system operating 
at constant conditions without considering any 
impact. This approach makes the systems safety 
prediction much more precise. 
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ABSTRACT: The capabilities of nontraditional models for engineering computation under uncertainty 
have been under continuous review so that several nonprobabilistic approaches have been developed. 
The first applications of nonprobabilistic interval analysis in geotechnical engineering have been recently 
explored, considered a research area formally motivated by input information characterised by impreci- 
sion. Thus, the conventional probabilistic approach to uncertainty may be extended to include imprecise 
information in the form of intervals. For demonstration, results are provided on the analysis of a strip 
spread foundation designed by the Eurocode 7 methodology. A limit state imprecise interval analysis 
for bearing capacity is presented in the format of a sensitivity analysis. The limit state charts to safety 
assessment are separately sketched for the cases cohesion and friction angle interval scenario wherein the 
random variables are bounded on different levels of probability. The corresponding optimisation-based 
probability box structures are then sketched. The extension for high dimensional cases is as well consid- 
ered through a limit state three-dimensional joint view to safety assessment, considered simultaneously 
the interval variables cohesion and friction angle. At last, the Eurocode 7 partial factor design is discussed 
on the basis of distinct levels of credibility. 


1 INTRODUCTION probabilistic models in order to find the lower and 

upper probability bounds. This procedure is par- 
In recent years, several nonprobabilistic approaches ticularly convenient for sensitivity analysis, a very 
have been developed to include uncertainties efficient tool to identify important design param- 
described by scarce information. If reliability is eters and to address the performance level of real 
seen as a probability related to the satisfactory per- engineering systems in the context of optimisation 
formance of a system under given circumstances, procedures. Thus, whenever some probabilistic 
a nonprobabilistic concept of reliability holds on information is available, a mixed approach gather- 
the acceptable range of performance fluctuations. ing imprecise information in the form of intervals 
According to the discussion of Haim & Elishakoff may be explored. Vicig & Seidenfeld (2012) discuss 
(1995), when considered only input interval vari- the idea of replacing one exact probability value 
ables, the nonprobabilistic concept of reliability is | by introducing an indecision interval with two dif- 
approached only by bounds as the safety margin ferent exact one-sided values as endpoints. As the 
is as well expressed in the interval format. Among solution of practical problems in geotechnical engi- 
nonprobabilistic approaches, the ordinary interval neering often requires judgement based on limited 
analysis involves the mapping of interval input to information, the assumption of interval-valued 
interval output quantities. The main advantage is parameters in a mixed approach that admits as 
the evaluation of the analytical enclosure of the well probabilistic information verily complies with 
true solution. Thereby, a considerable number the imprecision of the input scenario. For dem- 
of papers devoted to interval versus stochastic onstration, results are provided for a strip spread 
analysis have been published. Intervals represent foundation designed by the Eurocode 7 methodol- 
an appropriate model to describe uncertainty in ogy (EN 1997-1 2004), wherein the shear strength 
cases when a possible range between bounds is parameters of the foundation soil are implemented 
known and no other information concerning fre- as intervals and then combined with other uncer- 
quencies is available. Apart from, interval quanti- tain parameters in the form of random variables 
ties may as well be included in computation based bounded on different levels of probability. A limit 
on other uncertainty models. Interval probabilities state imprecise interval analysis for bearing capac- 
emerge from the consideration of a set of plausible ity is then presented in the format of a sensitiv- 
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ity analysis wherein the safety margin is as well 
expressed in the interval format. 


2 LIMIT STATE IMPRECISE INTERVAL 
ANALYSIS 


On the limit state imprecise interval analysis some 
selected parameters are separately or simultane- 
ously implemented as imprecise intervals and then 
combined with other uncertain parameters in the 
form of random variables bounded under depend- 
ence; thereby, the random variables are bounded 
by using imprecise interval quantities in order 
to express different levels of probability. A brief 
explanation of the methodology for minimisation 
and maximisation which considers dependence is 
summarised next. The step by step optimisation 
algorithm is supported by transformations derived 
in compensative probability to express the per- 
formance function in the standard normal space 
of uncorrelated random and interval variables: 


[Step 1] Express the distributions and the statis- 
tics of basic input variables; 

Express the equivalent standard normal 
correlation matrix; 

Express the outcome matrix from the 
Cholesky decomposition of the equiva- 
lent standard normal correlation matrix; 
Express the set of random and interval 
variables vector in the standard normal 
space of uncorrelated random and inter- 
val variables; 

Express the performance function and 
the limit state in the standard normal 
space of uncorrelated random and inter- 
val variables; 

Express the set of constraints and when- 
ever required the set of guess values for 
initialisation of the suboptimisation 
algorithm; 

Select the suboptimisation algorithm 
properly and run the iterative process in 
a multistart approach; 


[Step 2] 


[Step 3] 


[Step 4] 


[Step 5] 


[Step 6] 


[Step 7] 


[Step 8] Estimate the pair minimum-maximum 
and the corresponding coordinates in the 
standard normal space of uncorrelated 
as well as in the original space of corre- 


lated random and interval variables. 


Thereby, the performance function in the standard 
normal space of uncorrelated variables is expressed 
by the following equivalent transformations xi = f(yi) 
and f(xi) = yi straightforwardly derived by the next 
Equation (1) in compensative probability: 


xi = Fa [ ®(yi) ] [Fx (xi) ]= yi (1) 
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if xi = variable in the original space; yi = variable 
in the standard normal space; F,, = cumulative 
nonnormal distribution function; Fẹ" = inverse 
cumulative nonnormal distribution function; 
® = cumulative standard normal distribution func- 
tion; and ®~' = inverse cumulative standard normal 
distribution function. 

The correlation assignment in Equation (2) is 
also added to this standard normal space script: 
[eyi ]=[Ch][uyi ] (2) 
if cyř = set of variables vector in the standard 
normal space of correlated variables; uyi = set of 
variables vector in the standard normal space of 
uncorrelated variables; and Ch = outcome matrix 
from the Cholesky decomposition of the equiva- 
lent standard normal correlation matrix. 

Considered normal and lognormal distributions, 
the equivalent transformations for representation 
of variables xi in the original space as a function 
of variables yi in the standard normal space are 
further detailed as follows in the next Table 1 as 
explicit expressions for the equivalent transforma- 
tions xi = f(yi) and f(xi) = yi as previously derived 
by Equation (1) in compensative probability. 

A simplified approach may be considered on 
the computation of the equivalent standard nor- 
mal correlation matrix wherein the values of the 
correlation coefficients between the variables are 
transformed by empirical relationship. Thus, the 
transformation F expressed in the next Table 2 
is derived by exact relationship and used for the 
computation of the resultant equivalent standard 
normal correlation coefficient, when the random 
variables are normal and lognormal. 

A test function as the Rosenbrock which is 
characterised by one challenging minimum point 
is very convenient for the purpose of testing, see 
the following Figure 1. 


Table 1. Transformations for representation of vari- 
ables xi in the original space as a function of variables 
yiin the standard normal space. 


Distributions xi = f(yi) 
Normal Xi=UM+0-yi 
Lognormal xi= el“talyi) 
Distributions f(xi) = yi 
Normal xi= fl yi 
oe 
Lognormal In(xi)-al 
CAE = yi 
ol 2 


u - mean value; o - standard deviation; ul - log mean 
value; ol - log standard deviation. 


Table 2. Transformation F by exact relationship for 
representation of the coefficient of correlation between 
two standard normal variables xi’ and xj’ as Pisy =F Puisi- 


Variable xi Variable xj Transformation F 


Normal Lognormal 


cv - coefficient of variation. 


Rosenbrock 


if 


1. Rosenbrock test function overview 


Figure 
x, €[-3,+3] and x, €[-3,+3]. 


In fact, a considerable number of suboptimisa- 
tion algorithms may be considered for the purpose 
but both local and global engines may fail under 
certain circumstances. In particular, conjugate gra- 
dient and quasi newton optimisation algorithms in 
a multistart approach are successful on the search 
for the minimum point of the Rosenbrock test 
function. 


3 DESIGN EXAMPLE 


The design example is referred to the strip spread 
foundation on a relatively homogeneous soil shown 
in Figure 2, wherein groundwater level is away. 
Considered the vertical noneccentric loading prob- 
lem and the calculation model for bearing capacity, 
the performance function may be described by the 
simplified Equation (3): 

M = f( B,D, 7Cr, Ø, 7 P,Q) (3) 
if B is the foundation width; D is the soil height 
above the foundation base; y, is the unit weight of 
the soil above the foundation base; c, is the cohe- 
sion of the foundation soil; @, is the friction angle 
of the foundation soil; y, is the unit weight of the 
foundation soil; P is the dead load; and Q is the 
live load. 


Figure 2. Strip spread foundation. 


Table 3. Summary description of basic input variables. 

Basic input Mean Coefficient 

variables Distributions value of variation 

B (m) Deterministic 1.30 0.00 

D (m) Deterministic 1.00 0.00 

y, (N/m?) Normal 16.80 0.05 

c; (kKN/m?) Lognormal 14.00 0.40 
Interval* a ia 

©) Lognormal 32.00 0.10 
Interval* s ae 

Y: (KN/m’) Normal 17.80 0.05 

P (kN/m) Normal 370.00 0.10 

Q (kN/m) Normal 70.00 0.25 


*Cases cohesion interval scenario [0.00,35.00] and fric- 
tion angle interval scenario [25.00,35.00]. 


Table 4. Correlation coefficients between the basic 
input variables. 


Correlation matrix 


Prixt Pxix2 Prisa Prisa  Pxixs Prix 
Prxoxt Px2x2 Px2x3 Pxzxą  Prxrxs Prox 
Pxsxi  Px3x2  Pxax3 Px3xa  Px3xs  Px3xo 
Prxaxt Pxax2 Prass Prasa Prxaxs Pxaxo 
Pxsxt Pxsx2 Pxsx3 Prxsxa Pxsxs Pxsx6 
Pxoxt Pxox2 Pxox3 Pxoxa Pxoxs Pxoxo 
1.0 0.0 0.5 0.9 0.0 0.0 

0.0 1.0 0.0 0.0 0.0 0.0 

0.5 0.0 1.0 0.5 0.0 0.0 

0.9 0.0 0.5 1.0 0.0 0.0 

0.0 0.0 0.0 0.0 1.0 0.0 

0.0 0.0 0.0 0.0 0.0 1.0 

XY Xo XyỌs XyYs XsP; xşQ; p-coefficient of 
correlation. 


Table 3 summarises the description of basic 
input variables, with different types of distributions. 
The considered correlation coefficients between the 
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basic input variables are either presented in Table 4. 
The strip spread foundation is designed by the 
Eurocode 7 methodology, Design Approach DA.2*. 


4 RESULTS AND DISCUSSION 


A cautious estimate of the 95% reliable mean 
value for a known coefficient of variation is con- 
sidered for each geotechnical characteristic value. 
The summary description of reliability estimates 
in interval scenario is further provided in Table 5 
for the cases cohesion and friction angle interval 
scenario, wherein results are obtained from 5e6 
simulations or sampling points, Monte Carlo 
simulation (MCS) considered. The generalised 
extreme value distribution is selected after a study 
with Kolmogorov Smirnov and Anderson Dar- 
ling and Chi Squared goodness of fit tests. The 
shear strength parameters of the foundation soil 
are separately implemented as intervals and then 
combined with other uncertain parameters in the 
form of random variables under dependence. The 
optimisation-based probability box structures for 
the cases cohesion and friction angle interval sce- 
nario are drawn in the next Figure 3 and Figure 4 
by using conjugate gradient and quasi newton 
optimisation algorithms in a multistart approach. 


The 3.8 target reliability index is considered on 
the determination of the cohesion and friction 
angle values which satisfy the safety on every inter- 
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Figure 3. Optimisation-based probability box structure 
for the case cohesion interval scenario. 


Table 5. Summary description of reliability estimates in interval scenario. 
MCS Reliability FIT Reliability 
failure index failure index 

Case Value probability Bucs probability Brrr 
0.00 4.4845e-2 1.6970 4.4800e-2 1.6974 
5.00 4,5420e-3 2.6089 5.0000e-3 2.5127 
10.00 2.0200e-4 3.5375 2.888 7e-4 3.4419 
11.00 8.9800e-5 3.7461 1.4933e-4 3.6165 
. 12.00 4.1800e-5 3.9338 7.6973e-5 3.7846 
Cohesion 13.00 1.9800e-5 4.1098 3.8507e-5 3.9535 
interval 14.00 7.2000e-6 4.3377 1.851le-5 4.1253 
Sename 15.00 3.2000e-6 4.5127 9.1553e-6 4.2846 
20.00 =0 =00 1.8173e-7 5.0872 
25.00 =0 =00 2.5786e-9 5.8420 
30.00 =0 =00 2.956le-11 6.5459 
35.00 =0 =00 3.0374e-13 7.1988 
25.00 2.2054e-3 2.8472 1.8000e-3 2.9189 
25.50 5.7560e-4 3.2507 3.8867e-4 3.3607 
o. 26.00 1.3200e-4 3.6483 6.0447e-5 3.8443 
Friction angle 26.50 1.7600e-5 4.1369 5.9806e-6 4.3783 
interval 27.00 1.8000e-6 4.6332 3.3652e-7 4.9690 
scenario 27.50 2.0000e-7 5.0690 8.9437e-9 5.6313 

30.00 =() =00 =0 =00 

35.00 =0 moo =() ~00 


MCS results from 5e6 simulations in interval scenario. 


FIT results from 5e6 sampling points for the generalised extreme value distribution. 
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Figure 4. Optimisation-based probability box structure 
for the case friction angle interval scenario. 


val scenario by simulation and distribution fitting. 
A boundless immeasurable reliability is further 
noticed on both cases. A comparative analysis of 
results from both methodologies shows appreci- 
able differences whenever the cohesion values are 
approaching 15.0 kN/m? and the friction angle 
values are approaching 27.5°, noted that for the 
correspondent levels of reliability the Monte Carlo 
simulation failure probability error is estimated 
higher than 10%. Moreover, it is clearly shown that 
small variations in the friction angle input are very 
influential in that the median and the imprecise 
lower bound of probability correspond to a bound- 
less immeasurable reliability, considered that satis- 
factory levels of reliability estimates are separately 
attained for a cohesion of about 12.0 kN/m? and 
for a friction angle of about 26.0°. 

Regarding the graphs on Figure 3 and Figure 4, 
appropriate representation of the minimum and 
maximum branches comprehends respectively a 
number of 351 or 201 discretisation points uni- 
formly distributed on the interval [0.0,35.0] kN/ 
m° or [25.0,35.0]°. The stairstep median trend is 
comparatively drawn from a moderate number of 
discretisation points uniformly distributed on the 
intervals. The optimisation-based probability box 
structures are constructed comparatively to the 
median empirical cumulative distribution func- 
tion, noted that the empirical cumulative distribu- 
tion functions corresponding to the collection of 
minimum and maximum values are sketched at a 
central credibility level 36. From observation of 
Figure 3 and Figure 4, it is noted that the linear 


and nonlinear trends evinced by the graphs express 
the influence of the parameters cohesion and fric- 
tion angle on the calculation model for bearing 
capacity. The position of the median trend, which 
separates the fifty percent chance cases, is further 
sketched on the safe side. Regarding the limit state, 
the probability of no failure across the cohesion 
and friction angle interval scenario at a central 
credibility level 36 is respectively circa 50% or 70%, 
see the probability level for a positive outcome on 
the minimum branch of the graphs. 

The limit state charts to safety assessment are 
represented on Figure 5 and Figure 6, considered 
respectively and separately the cases cohesion and 
friction angle interval scenario, in a sensitivity 
analysis wherein the random variables are bounded 
on different levels of probability. Distinct levels of 
credibility are used to sketch the lines which express 
the limit state bounds. It is possible to search for a 
credibility level which ensures no failure regardless 
of the parameter value on the horizontal axis, see 
safe level for arrow in Figure 6 against unsafe level 
for arrow in Figure 5. Conversely, it is possible to 
find the threshold parameter which ensures no fail- 
ure for a given credibility level and then to proceed 
with proper ground investigation and testing or 
improvement, see in both charts the circles cross- 
ing the zero limit state boundary and the 0.9900 
credibility level line. In decision making, this 
valuable approach may be extended by numerical 
analysis for high dimensional cases with several 
indecision variables in simultaneous, see Figure 7 
and Figure 8. 

According to the results on Table 5 and consid- 
ered the cases cohesion and friction angle interval 
scenario, the threshold values which satisfy the 3.8 
target reliability index are about 12.0 kN/m? and 
26.0°, respectively. Following the vagueness of the 
indecision interval, this imprecise approach is inter- 
preted altogether with a complementary limit state 
imprecise interval analysis for sensitivity, valida- 
tion and decision. Thereby, it is concluded that sat- 
isfactory levels of reliability estimates are attained 
by the probability level of 0.9900 sketched on the 
limit state charts to safety assessment. On this 
framework, Figure 7 and Figure 8 display respec- 
tively the limit state three-dimensional joint view to 
safety assessment for a 0.9900 and a 0.9990 prob- 
ability level. A three-dimensional representation of 
the zero limit state reference surface is crossed in 
one corner by the limit state lower bound surface 
constructed from a joint assumption of the val- 
ues of the interval variables cohesion and friction 
angle. Considered the Figure 7, unsafe coordinates 
are limited to a cohesion parameter under a value 
between 7.0 kN/m? and 10.5 kN/m? or to a fric- 
tion angle parameter under a value between 29.0° 
and 30.0°, considered the two worst combination 
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Figure 5. Limit state chart to safety assessment for the case cohesion interval scenario. 
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Figure 6. Limit state chart to safety assessment for the case friction angle interval scenario. 
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Figure 7. Limit state three-dimensional joint view to 
safety assessment for a 0.9900 probability level. 
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Figure 8. Limit state three-dimensional joint view to 
safety assessment for a 0.9990 probability level. 


cases corresponding to a friction angle of 25.0° or 
to a cohesion of 0.0 kN/m’, respectively. Consid- 
ered the Figure 8, unsafe coordinates are limited 
to a cohesion parameter under a value between 
10.5 kN/m? and 14.0 kN/m’? or to a friction angle 
parameter under a value between 30.0° and 31.0°, 
considered the two worst combination cases corre- 
sponding to a friction angle of 25.0° or to a cohe- 
sion of 0.0 kN/m’, respectively. 

Unrealistic model assumptions are evinced by 
the limit state charts to safety assessment when 


higher levels of probability are considered, see 
the case cohesion interval scenario wherein the 
limit state upper bounds are obtained from unre- 
alistic coordinates. Therefore, a critical evalua- 
tion from the interval scenario may influence the 
safety-based decision in ground investigation and 
testing or improvement. In addition, the interval 
uncertainty comprehends a probabilistic mean- 
ing for every combinatory possibility, see the case 
cohesion interval scenario wherein nonsatisfac- 
tory levels of reliability estimates are attained in a 
significant part of the cohesion interval. Another 
important issue is the position of the median trend 
which separates the fifty percent chance cases, here 
sketched on the safe side, noted that the linear and 
nonlinear trends evinced by the charts express the 
influence of the shear strength parameters on the 
calculation model for bearing capacity. 

In the multivariate case the considered prob- 
ability level prescribes a credible region in the 
hyperspace, then the imprecise interval analysis 
complies with a mixed set of probabilistic and 
nonprobabilistic interval models wherein different 
bounding measures may be applied in order to find 
the limit state lower and upper bounds in different 
scenarios. The proposed step by step optimisation 
algorithm for minimisation and maximisation fur- 
ther considers any important dependence relation- 
ships and whenever interval variables are involved 
proper constraints are required, see the case fric- 
tion angle interval scenario. In fact, dependence is 
an important feature on the geotechnical engineer- 
ing practice wherein design parameters encompass 
a physical meaning. 

The summary description of reliability esti- 
mates in probabilistic scenario are finally pro- 
vided in the next Table 6, wherein three methods 
are compared: the first order reliability method 
(FORM) and the second order reliability method 
(SORM), or the Monte Carlo simulation (MCS), 
the latter from counting among 5e6 trials. Regard- 
less of the type of function and within acceptable 
margin of error, the global behaviour is approach- 
able in every case. 


Table 6. Summary description of model results for the 
reliability index and respective relative errors. 

Brorm Bsorm 

and and 

E =e pa Bucs Bracs, 

Reliability Reliability Reliability relative relative 
index index index error error 
Prorm Bsorm Bucs (%) (%) 
3.3716 3.4392 3.4388 -1.9542 0.0102 


MCS results from 1e6 simulations. 
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Figure 9. Limit state function histogram in probabilis- 
tic scenario. 


A limit state function histogram in probabilis- 
tic scenario is further presented in the following 
Figure 9. The behaviour of the nonlinear bearing 
capacity model is then illustrated noted that the 
limit state function shows a nonnormal distribu- 
tion under the probabilistic scenario characterised 
by the uncertain parameters and dependencies for- 
merly described. Thorough observation reveals the 
presence of a small amount of data on the left of 
the zero limit state boundary, unsafe side. Thereby, 
failure may be considered a quite rare event among 
the considered 5e5 trials on nonsymmetric right 
skewed arrangement and with no single center. 


5 CONCLUSION 


The capabilities of nontraditional models for 
engineering computation under uncertainty have 
been under continuous review. Practical experi- 
ence suggests that important dependence relation- 
ships should be considered on the proposed step 
by step optimisation algorithm for minimisation 
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and maximisation. For demonstration, results are 
provided on the analysis of a strip spread founda- 
tion designed by the Eurocode 7 methodology, and 
precise versus imprecise probabilistic approaches 
are discussed. The limit state imprecise probabil- 
istic analysis is interpreted altogether with a limit 
state imprecise interval analysis for bearing capac- 
ity. This sensitivity analysis may reveal unrealistic 
model assumptions and may provide meaningful 
results for safety-based decision in ground investi- 
gation and testing or improvement. Thus, the limit 
state charts to safety assessment are separately 
sketched for the cases cohesion and friction angle 
interval scenario wherein the random variables are 
bounded on different levels of probability. 

The extension for high dimensional cases is as 
well considered through a limit state three-dimen- 
sional joint view to safety assessment, considered 
simultaneously the interval variables cohesion and 
friction angle. In the multivariate case the consid- 
ered probability level prescribes a credible region 
in the hyperspace, then the imprecise interval anal- 
ysis complies with a mixed set of probabilistic and 
nonprobabilistic interval models wherein different 
bounding measures may be applied in order to find 
the limit state lower and upper bounds in different 
scenarios. Thereby, the Eurocode 7 partial factor 
design is discussed on the basis of distinct levels 
of credibility. On this particular case study it is 
clearly shown that small variations in the friction 
angle input are very influential in that the median 
and the imprecise lower bound of probability cor- 
respond to a boundless immeasurable reliability. 
This conclusion is noticeable as well from the limit 
state chart to safety assessment for the case friction 
angle interval scenario. 
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ABSTRACT: The underlying idea on the imprecise probability theory consists in modelling an imprecise 
probability distribution by a set of candidate probability distributions which are derived from the available 
data. The family is represented through a probability bounding approach applied to specify the lower 
and upper bounds of the imprecise probability distribution. A number of set-based uncertainty models 
derived from the probability bounding approach have been considered, namely the probability box 
structure. Designed from different approaches, probability boxes may differ meaningly from each other. 
A parametric approach may involve distributions with interval parameters or an envelope of competing 
probabilistic models. From search amid the number of candidate cumulative distribution functions the 
envelope of competing probabilistic models is expressed by a probability box function. In this way, a pro- 
cedure for construction of a probability box structure by optimisation is advanced. Different dependencies 
may lead to quantitatively varied results so that a single scalar measure of a correlation coefficient may 
not be able to capture the complexity of the dependence model. Thus, the effects of correlation on the 
probability box structure may be comparatively considered. The technology is demonstrated on a synthetic 
exercise and on a design example referred to a strip spread foundation designed by the Eurocode 7. 


1 INTRODUCTION ficient for Archimedean parameterised copula fami- 
lies as Clayton or Frank or Gumbel, defined directly 
A probability bounding approach is hereafter rather than being defined constructively from mul- 
applied to specify the lower and upper bounds tivariate distributions. Considered the Pearson cor- 
of one imprecise probability distribution, see the relation pattern at 0.9 correlation coefficient, the 
review of Vicig & Seidenfeld (2012). By principle, equivalent Kendall rank correlation coefficient of 0.7 
variational correlation measures are insufficient is determined at first to find the correspondent cop- 
to explore the possible nonlinear dependencies ula parameter œ to launch the experiments accord- 
between variables in the form of a general rela- ingly by using a statistical toolbox. Afterwards, the 
tionship. In fact, the standard correlation concept Kendall correlation pattern at 0.7 correlation coef- 
is associated to a directional variation of paired ficient may be compared to the Pearson correlation 
random variables in that the word correlation is pattern at 0.9 correlation coefficient, for the pur- 
often reserved to be used only with random vari- pose see Figure 2. These approaches represent only 
ables. In this way, a procedure for construction a measure of the strength of the linear relationship 
of a probability box structure by optimisation is between the two variables. It is further noted that 
advanced. In particular, the proposed procedure is ~ the search for minimum and maximum values may 
aimed to consider how interval variables contained reveal differences when compared the correlated- 
in a set between two endpoints may be related to ness and uncorrelatedness scenarios, particularly in 
other random variables characterised by a prob- the case of a higher correlation coefficient wherein 
ability distribution. Regarding a sensitivity analy- the designed correlation pattern may be displayed as 
sis, the effects of correlation on the probability sharp at the endpoints of the intervals. 
box structure are comparatively demonstrated on 
a synthetic exercise, see the challenge problems in 
Oberkampf et al. (2004), and on a design example 2 OPTIMISATION-BASED SYNTHETIC 
referred to a strip spread foundation designed by EXERCISE 
the Eurocode 7 (EN 1997-1 2004). 

Considered that the parameter a follows a uniform The step by step optimisation algorithm is sup- 
distribution on the interval [0.1,1.0] and the param- ported by transformations derived in compensative 
eter b follows a uniform distribution on the inter- probability to express the performance function in 
val [0.0,1.0], Figure 1 represents comparatively the the standard normal space of uncorrelated ran- 
Kendall correlation pattern at 0.7 correlation coef- dom and interval variables. Afterwards, the pair 
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Figure 1. Kendall correlation pattern at 0.7 correla- 
tion coefficient, Clayton Copula and Frank Copula and 
Gumbel Copula. 


Figure 2. Pearson correlation pattern at 0.9 correlation 
coefficient. 


minimum-maximum is estimated at a given central 
credibility level for every parametric combination 
across the indecision interval on the construction of 
the optimisation-based probability box structure: 
[Step 1] Express the distributions and the statis- 
tics of basic input variables; 

Express the equivalent standard normal 
correlation matrix; 

Express the outcome matrix from the 
Cholesky decomposition of the equiva- 
lent standard normal correlation matrix; 
Express the set of random and interval 
variables vector in the standard normal 
space of uncorrelated random and inter- 
val variables; 

Express the performance function and 
the limit state in the standard normal 
space of uncorrelated random and inter- 
val variables; 

Express the set of constraints and when- 
ever required the set of guess values for 
initialisation of the suboptimisation 
algorithm; 

Select the suboptimisation algorithm 
properly and run the iterative process in 
a multistart approach; 

Estimate the pair minimum-maximum 
and the corresponding coordinates in the 
standard normal space of uncorrelated 
as well as in the original space of corre- 
lated random and interval variables. 


[Step 2] 


[Step 3] 


[Step 4] 


[Step 5] 


[Step 6] 


[Step 7] 


[Step 8] 


The performance function in the standard nor- 
mal space of uncorrelated random and interval 
variables is expressed by the following equivalent 
transformations xi = f(yi) and f(xi) = yi derived by 
Equation (1) in compensative probability: 


xi =F,;'[ (yi) ] ©" [ F,, (xi) ]= vi (1) 


if xi = variable in the original space; yi = variable 
in the standard normal space; F,, = cumulative 
nonnormal distribution function; Fẹ" = inverse 
cumulative nonnormal distribution function; 
® = cumulative standard normal distribution func- 
tion; and ®! = inverse cumulative standard normal 
distribution function. 

The correlation assignment in Equation (2) is 
also added to this standard normal space script: 


[ cvi*] = [Ch] | uyi*] (2) 


if cyi* = set of variables vector in the standard 
normal space of correlated variables; uyi* = set of 
variables vector in the standard normal space of 
uncorrelated variables; and ch = outcome matrix 


from the Cholesky decomposition of the equiva- 
lent standard normal correlation matrix. 

A considerable number of suboptimisation algo- 
rithms may be considered for the purpose but both 
local and global engines may fail under certain 
circumstances. A test function as the Peaks which 
is characterised by six minimum-maximum points 
is very convenient for the purpose of testing. In 
particular, conjugate gradient and quasi newton 
optimisation algorithms in a multistart approach 
are successful on the search for the six minimum- 
maximum points of the Peaks test function. 

For demonstration, the proposed procedure is 
then applied on the synthetic exercise expressed 
by Equation (3) by using conjugate gradient and 
quasi newton optimisation algorithms in a multi- 
start approach: 
y=(a+b)' (3) 
where the parameter a belongs to the interval 
[0.1,1.0] and the parameter b follows a uniform 
distribution on the interval [0.0,1.0]. 

Different degrees of correlation are considered 
to evince the effect of the parameter b on the inter- 
val for the parameter a. A sequence of transfor- 
mations is exemplified next for the optimisation 
procedure by the group of Equations (5) and (7), 
respectively developed for the case minimisation 
to a 0.1 and 0.9 correlation coefficient at central 
credibility level So and a = 1, wherein the group of 
Equations (4) and (6) presents the corresponding 
equivalent standard normal correlation matrix R 
and the outcome matrix from the Cholesky decom- 
position of the equivalent standard normal corre- 
lation matrix Ch: 


Given 

1.0 0.1 d Ch 1.0000 0.0000 4) 
= ne = 

0.1 1.0 i 0.1000 0.9950 ( 


) = (Fy [(1.0000- ya + 0.0000- yb) | 
+F,"[(0.1000- ya 
Fu [e( 

+0.9950- yb)]) 


1.0000: ya+0.0000-yb) ] 


—S<yass5 

—S<yb<5 

a = Fy [ ®(1.0000- ya + 0.0000 - yb) ]=1 

M = Minimise(L,ya,yb) 

ya=5 

yb=-5 

a =F,,"[ ®(1.0000- ya + 0.0000- yb) ]=1 

b = F~ [© (0.1000 - ya + 0.9950- yb) | = 3.8206e — 6 
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L(ya,yb) = L(a,b) =1.0000 (5) 
Given 
1.0 0.9 1.0000 0.0000 
= and Ch= (6) 
0.9 1.0 0.9000 0.4359 


+F,'[(0.9000- ya 
40.4359. oo 


—S<ya<5 
—S<yb<5 
a =F," [ &(1.0000- ya + 0.0000: yb) ]=1 


M = Minimise (L,ya,yb) 


ya=5 

yb=-5 

a =F,,"[ &(1.0000- ya + 0.0000: yb) ]=1 

b = F ~ [ ®© (0.9000 - ya + 0.4359- yb) | = 0.9898 
L(ya,yb) = L(a, b) = 1.9898 (7) 
if ya = corresponding parameter a coordinates in 
the standard normal space; yb = corresponding 
parameter b coordinates in the standard normal 
space; F,,! = inverser cumulative uniform distri- 
bution function; and ® = cumulative standard 
normal distribution function; noted o as a uni- 
tary standard deviation in the standard normal 
space. 

The following graphs are then designed at 0.1 
correlation coefficient and at 0.9 correlation coef- 
ficient. Figure 3 and Figure 5 present the mini- 
mum and mean and maximum trends across the 
interval [0.1,1.0] for parameter a. Figure 4 and 
Figure 6 present the corresponding optimisation- 
based probability box structure. For appropri- 
ate representation of every sketched line, it is 
selected a number of 181 discretisation points 
uniformly distributed on the interval [0.1,1.0] for 
parameter a. 

The optimisation-based probability box struc- 
ture is constructed comparatively to the mean 
empirical cumulative distribution function, noted 
that the adjacent empirical functions correspond- 
ing to the collection of minimum and maximum 
values are sketched at a central credibility level 50. 
From observation of Figure 3 and Figure 5 it is 
noted that the position of the lowest minimum 
occurs on the inner part of the interval [0.1,1.0] for 
parameter a, with the position of the highest maxi- 
mum on the coordinate 1.0. From observation of 
Figure 4 and Figure 6 it is noted that the gradual 
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Figure 3. Trends for the synthetic exercise at 0.1 correlation coefficient. 
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Figure 4. Optimisation-based probability box structure for the synthetic exercise at 0.1 correlation coefficient. 


convergence of the major probability lines occurs 
whenever the correlation coefficient increases, with 
exception to the zone near the zero probability 
level on the minimum branch of the optimisation- 
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based probability box structure at 0.9 correlation 
coefficient. This particular feature is observable 
because the lowest minimum occurs in the inner 
part of the interval [0.1,1.0] for parameter a. 
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Figure 5. Trends for the synthetic exercise at 0.9 correlation coefficient. 
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Figure 6. Optimisation-based probability box structure for the synthetic exercise at 0.9 correlation coefficient. 


3 DESIGN EXAMPLE 


The design example is referred to the strip 
spread foundation on a relatively homogeneous 
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soil shown in Figure 7, wherein groundwater 
level is away. Considered the vertical noneccen- 
tric loading problem and the calculation model 
for bearing capacity, the performance function 


may be described by the simplified Equation 
(8): 


M= f (B,D, 7Cr P7 P.Q) (8) 


if B is the foundation width; D is the soil height 
above the foundation base; y% is the unit weight of 
the soil above the foundation base; c, is the cohe- 
sion of the foundation soil; @, is the friction angle 
of the foundation soil; y, is the unit weight of the 
foundation soil; P is the dead load; and Q is the 
live load. 

Table 1 summarises the description of basic 
input variables, with different types of distribu- 
tions. The considered correlation coefficients 
between the basic input variables are either pre- 
sented in Table 2. 

The strip spread foundation is designed by 
the Eurocode 7 methodology, Design Approach 
DA.2*, wherein partial factors are coupled with 
characteristic values. 

A cautious estimate of the 95% reliable mean 
value for a known coefficient of variation is con- 
sidered for each geotechnical parameter regarding 
the occurrence of a limit state. Scenarios wherein 
the parameters cohesion and friction angle of 


Figure 7. Strip spread foundation. 

Table 1. Summary description of basic input variables. 

Basic input Mean Coefficient 

variables Distributions value of variation 

B (m) Deterministic 1.30 0.00 

D (m) Deterministic 1.00 0.00 

Y, (kN/m*) Normal 16.80 0.05 

c; (kK N/m?) Lognormal 14.00 0.40 
Interval* sat aa 

Q) Lognormal 32.00 0.10 
Interval* es ses 

Y: (x N/m?) Normal 17.80 0.05 

P (kN/m) Normal 370.00 0.10 

Q (kN/m) Normal 70.00 0.25 


*Cases cohesion interval scenario [0.00,35.00] and fric- 
tion angle interval scenario [25.00,35.00]. 


Table 2. Correlation coefficients between the basic 
input variables. 


Correlation matrix 


Pax Pax Prix3 Prix Pyixs Prix6 
Praxi Prax2 Prax3 Praxa Pross  Prrx6 
Py3x1 Py3x2 Py333 Py3x4 Py3x5 Pyx3x6 
Prax Pyrax2 Prax3 Prax Pyaxs Prx6 
Prsxi Pyxsx2 Pyx5x3 Pasxa Pysx5 Prx5x6 
Proxi Prox2 Prox3 Proxa Proxs Prox6 


1.0 0.00 0.5 0.9 0.0 0.0 
0.0 10 0.0 0.0 0.0 0.0 
0.5 0.0 1.0 0.5 0.0 0.0 
0.9 0.0 0.5 1.0 0.0 0.0 
0.0 0.0 0.0 0.0 1.0 0.0 
0.0 0.0 0.0 0.0 0.0 1.0 


XYo Xap XyeOs XeYs XP x-Q: p-coefficient of 
correlation. 


the foundation soil are separately implemented 
as intervals are further considered. The proposed 
procedures are then applied on the design example. 
The optimisation-based probability box structure 
is drawn by using conjugate gradient and quasi 
newton optimisation algorithms in a multistart 
approach for the case cohesion interval scenario 
and for the case friction angle interval scenario, 
represented respectively in Figure 8 and Figure 9. 
Considered the case cohesion interval scenario, 
for appropriate representation of the minimum 
and maximum branches, it is selected a number 
of 351 discretisation points uniformly distributed 
on the interval [0.0,35.0] kN/m?*. Considered the 
case friction angle interval scenario, for appropri- 
ate representation of the minimum and maximum 
branches, it is selected a number of 201 discretisa- 
tion points uniformly distributed on the interval 
[25.0,35.0] °. 

The optimisation-based probability box struc- 
ture is constructed comparatively to the median 
empirical cumulative distribution function, noted 
that the adjacent empirical functions correspond- 
ing to the collection of minimum and maximum 
values are sketched at a central credibility level 30. 
The median trend is comparatively drawn from 
a moderate number of discretisation points uni- 
formly distributed on the interval. From observa- 
tion of Figure 8 and Figure 9 it is noted that the 
linear and nonlinear trends evinced by the graphs 
express the influence of the parameters cohesion 
and friction angle of the foundation soil on the cal- 
culation model for bearing capacity. The position 
of the median trend, which separates the fifty per- 
cent chance cases, is further sketched on the safe 
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Figure 8. Optimisation-based probability box structure 
for the case cohesion interval scenario. 


friction angle = [25.0,35,0] ° 
empirical cumulative distribution function 
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Figure 9. Optimisation-based probability box structure 
for the case friction angle interval scenario. 


side. Regarding the limit state, the probability of 
no failure across the cohesion and friction angle 
interval scenario at a central credibility level 30 
may be determined, see the probability level for a 
positive outcome on the minimum branch. 


4 CONCLUSION 


A procedure for construction of a probability box 
structure by optimisation is advanced. Different 
dependencies may lead to quantitatively varied 
results and as the degree of correlation may be 
unknown, a single scalar measure of a correlation 
coefficient may not be able to capture the complex- 
ity of the dependence model. Thus, the effects of 
correlation on the probability box structure are 
comparatively demonstrated on a synthetic exer- 
cise and on a design example referred to a strip 
spread foundation designed by the Eurocode 7. 
The optimisation-based probability box structure 
opens a new path in the framework of engineer- 
ing limit state design under dependence in order to 
consider the failure analysis on a number of differ- 
ent central credibility levels. 
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ABSTRACT: This elaboration is an attempt to present a reliability assessment model of a technical 
object in the aspect of a catastrophic damage. As the technical object a mechanical device was adopted, in 
which, between operating elements a jamming may occur and as a diagnostic parameter—the operating 
time of device t. Determining the distribution of device operating time until jamming appeared was based 
on the Yule’s process modified by Gercbach and Kordonski. Due to the adopted simplifications in record- 
ing of the above-mentioned model, the process of defining final equations was identified. This process, 
in a discrete system, depends on the system of equations typical for the accrual process of the diagnostic 
parameter. By performing a transformation of these equations the formula was determined for reliability 
of the technical object and density function of the object operating time until the catastrophic damage is 


found in the form of jamming. 


1 INTRODUCTION 

A characteristic feature of mechanical devices is 
the occurrence of movable mating elements creat- 
ing a system of kinematic pairs. Depending on the 
intensity of using the device and the conditions in 
which work of the technical object is being per- 
formed, it might lead to wear and, in consequence, 
e.g. to jamming. 

Due to device operation, the clearances 
between mating elements or resultants in kin- 
ematic sequences are changed and deviations 
are created from normal values, which harmfully 
affect the automatics operation. The values of 
deviations of clearances from normal values trig- 
ger a change in operating conditions of the techni- 
cal object, what has a substantial impact on their 
operation reliability. Sometimes, interferences in 
automatics of operation of the technical object 
due to the change in operating conditions along 
with the increased value of clearances contribute 
to the formation of jamming, that is a kind of the 
catastrophic damage (Baranowski & Małachowski 
2015, Idziaszek & Grzesik 2014, Tomaszek et al. 
2016) 

By making an analysis of the device techni- 
cal condition an essential issue is the choice of 
a parameter, which will be leading (forecasting) 
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in the assessment under consideration. It, 
therefore, appears to be appropriate to adopt 
that the parameter pursuant to which the pos- 
sibility of occurrence of the catastrophic dam- 
age (jamming) is evaluated, will be a number 
of device completed working cycles. With the 
increase in the number of completed cycles, the 
device experiences wearing processes (destruc- 
tive), which exert a considerable influence on the 
quality of the device’s usage conditions (Dhillon 
1999, Tomaszek et al 2013, Tomaszek et al. 2011, 
Wazny 2009). 


DETERMINING THE DISTRIBUTION 
OF THE NUMBER OF THE DEVICE 
COMPLETED WORKING CYCLES 
UNTIL DAMAGE APPEARS (JAMMING) 


For the diagnostic parameter (forecasting) accord- 
ing to which we will determine the change in tech- 
nical condition of a device (wearing) the number 
of cycles completed by the device was assumed. 
This parameter is a speed function of complet- 
ing the working cycles by the device and will be 
defined by equation (1): 


t (1) 


= y. 


where v = speed of working cycles completed by 
the device; and ¢ = device operating time. 

In the function of the number of cycles com- 
pleted by the device, the effects of destructive proc- 
esses continue to increase in the form of wearing 
and clearances in kinematic chains. As there are 
more effects related to wearing of components, the 
chance that the device will jam also increase. There- 
fore, it might be concluded that with the increase 
of working cycles completed by the device, there is 
a change of its technical condition and increases a 
chance of the occurrence of a damage in the form 
of jamming. This suggests the possibility of using 
certain models to describe the risk of the occur- 
rence of a damage in the form of jamming. 

To determine the distribution of the number of 
completed cycles until jamming is found the Yule’s 
process was applied by modifying it. The outline of 
this modification is provided in work (Gercbach & 
Kordonski 1968) without a detailed way of acquir- 
ing final results. Utilizing the results mentioned in 
this work (Gercbach & Kordonski 1968) needs sup- 
plementing the way they are presented with indi- 
rect operations enabling to derive final equations. 

To define regularities of this model it is conven- 
ient to use discrete values of diagnostic parameter. 
Way of discretization is shown in Fig. 1 where 
E, = discrete values of a diagnostic parameter 
defined as parameter value states; AAt = probabil- 
ity of transition from £, to E, in a time interval 
with the length of At; À = intensity of parameter’s 
state changing described as 


P 

At ©) 
where P = probability of completion of one work- 
ing cycle in a time interval with the length of At; 
q(t) = probability of interrupting the development 
of the process of increasing a diagnostic param- 
eter, depending on the process state; u = increment 
of intensity of jamming occurrence along with the 
increase of device working cycle; 4, = intensity of 
jamming for initial state (for t = 0); h = average 
value of the increment of a diagnostic parameter 
in time Aż; and ¢ = device operating time. 

Providing that P (t) signifies the probability that 
for device operating time equaling ¢ a diagnostic 


y. y ; ' > 
E E E: Ei E 
$ 
X 
Fa F) ea ee 
Figure 1. Way of discretization of a diagnostic parameter. 


parameter takes the value E,(k =0,1,2,...). For 
the adopted arrangements, using the postulates of 
the Poisson process, a system of equations charac- 
terizing the accrual process of a diagnostic param- 
eter may be obtained (DeLurgio 1998, Pham 2006, 
Werbinska & Zajac 2015, Zio 2009). 

For k =0: 


P(t + At) =(1— AAt)(1- 4 At) P(t) 
P(t + At) =(1- At — AAt + Au, At?) P(t). 


After omitting higher order negligibles the 
formula takes the following form: 


P (t+ At) =(1- (2+ 4) At) P(t) (3) 
For k=1, 2, ... 
P (t+ At)=(1- Adt)(1-( 4+ ku) At) P.(t)+ 


+ AM(1-( 4, + ku) At) P_(O; 


_ (1- (at ky) At- 
a= AAt+ AAt( y+ ky) At EO+ 


+( 4At- AAt( 4+ kx) At) PÒ. 


Again by omitting higher order negligibles the 
above-mentioned formula takes the following form: 


P.(t+At)=(1-(A+ 44 +k) At) P(t) + AAt P, 0). 
(4) 
Thus, we obtain the following system of equations: 

P (t+ At)=(1-( 4, + 2) At) P+ OAN 


P (t+ At)=(1-(4 + kut A)At) P+ t (5) 
+ AAt P_,(t)+ O(At), fork=1,2,... 


From system of equations (5) after transforma- 
tion and movement to the limit from At—0 the 
following system of equations is received: 


PKO =-(4 + DRO 
PIO =—(&y + ku+ A) PO +A PaO, p- (6) 
for k =1, Z 


The initial condition for all of these equations 
may be written in the following form: 


ro- ne g 


System of equations (6) is solved by the use of a 
recursive method. 
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Solution for k = 0: 


RO=-(4+ ARO, 


Í ho dt = -Í (4n + A)dt. 


0 


Therefore, 
RG@j=-C.e a, (8) 


For t=0, P,(0)=1 thus, C, =1. 

Solution for k = 1, 2, 3... can be determined in 
the following way. A differential equation takes the 
form: 


P(th=—-(A + k+ A) P) +A PO) (9) 


In this case, we predict the solution in the 
form of: 


P= CA) or (10) 


A derivative of relationship (10) assumes the 
following form: 
RO= CGEA C(Q(—( A+ u) er”, 
(11) 


Substituting the above equation for equation (9) 
the following formula was obtained: 


C0) eA A+ M) C0 elmer 
=-(u,+ku+A) Pt) +4 Pt) 
= gany 


Cy (1) e atah Cp-1(t) (o+ Ade 


Thus, we obtain an equation: 


C(O + ku C) = AC, A) (12) 
Equation (12) for k=/ will be as follows: 
CO) + AC) = 4. (13) 


Differential equation (13) takes the following 
general form: 


y’ + P(x)y = Q(x). 


The solution of which is the following equation: 


(14) 
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Using formula (14) the solution of formula (13) 
may be written in the following form: 


t 


-| udt t udt t 
T Í Ae? dt aa Aer ar) 
0 0 


t 
=e! J—e"'| = eu (em 1) =f. 4 eH, 
0 4 «U H 
(15) 
For k = 2 equation (13) assumes the form: 
cup ucn= a( 4-4 er), (16) 
hoe 
Solution of equation (16): 
-fou t 2 2 [rua 
C,(t)=e ° i = er Je dt 
og H 
eaf ? 1 e?”t A’ «| 
U 2U uU Aa 
al x erHt L - e”! + Z) (17) 
24? 24? 4È Me 
= eet j 24t 4 24 — x x e”! 
24? 24 4P 
= £ + a eet x et 
2 E 
= l+ -2At \ a -4i 
2 2 ( e ) 2 e 


Equation describing function C,(f) was con- 
verted to the form, which suggests the general form 
of this function: 


O= +2 om emt] 2 
2 wv 2 
264) = a a eri i eu 
wee 4 
Thus: 


co-(4 4 er) (18) 
U 4 2 


Form of equation (18) allows to predict the gen- 
eral form of this function. This equation assumes 
the following form: 


(19) 


Applying (19) the solution of equation (9) may 
be written. This solution adopts the following 
form: 


(20) 


fork=1,2,3... 
Using equation (8) and (20) it is possible to 
determine reliability of the device. Therefore: 


R= > PO 
k=0 

= “(4 A ea 

=e 


It should be noted that the following equation 
is true: 


Š k E N 
Siéi e) sen a”, 


koH fl 


(22) 


Using equation (22) the formula for reliability 
takes the following form: 


lå pm 
ROS e% # eta, 
Thus, 
4 g Mt) _ 
Roser Oe (23) 


Based on the above-mentioned equation the 
probability of jamming of the device for the 
number of completed cycles in time ¢ will equal: 


2 (1 -e Hl ) —(4o+A)t 


O(t)= 1-e# (24) 

Distribution of the number of working cycles 
completed by the device until its jamming will be 
as follows: 


d 
{M= qe: 
Thus, 


A —e™M! )— (uy +A)t 
f(t)= (“4 +A(l-e“)) e# (1-e-#") — (w (25) 

The form of relation (25) presents the density 
function of working cycles completed by the device 
until catastrophic damage is found in the form of 


jamming. This distribution can be used to establish 
the risk of the occurrence of a catastrophic failure 
in the form of jamming. 


3 CALCULATION EXAMPLE 


The remarks as regards establishing distribution 
parameters A, 4, 4, shall also be given. The param- 
eter A will be evaluated by using the number of 
working cycles completed by the device. Calcu- 
lation formula of parameter / will be as follows 
(Wazny 2015): 


= 26) 


~ At* 


where Ar* = the duration of one device working 
cycle. 

For determining «ë i u* data are used from 
observation of the device operating process due to 
which we obtain for N of devices the list of the 
number of cycles completed by the device until 
jamming appears. Hence, we acquire t, ¢,,..., ty- 
These data are used to build a histogram. The 
value x is the ordinate of the histogram in point 
t=0. Estimation value shall be established from the 
following expression: 


(2 1) +1 Lii eur) 
A*\ N ue 


where n(t) = the number of efficient devices in time 
t< ty; and t = device operating time (agreed value). 

The left side of the equation for an agreed value 
tis changeable. A value u* is the value, which guar- 
antees that the right side of the equation is equal 
to the left side of the equation in equation (27). For 
the purpose of estimating a jamming intensity of 
the device y(t) the following equation shall be used: 


LO 


ey 


(27) 


(28) 


Replacing equation (25) and (23) with the 
formula (28), equation (29) was obtained: 


Mt) =(4H + A(1-e*)). (29) 

As it follows from equation (29) the probability 
of device’s jamming increases with the increase of 
the number of working cycles completed by the 
device. 

By attempting to perform the number verifica- 
tion of the presented model one adopted technical 
data characterizing a hypothetical technical object. 
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Table 1. Calculation results. 


Zz At* A* nA N w w 
850 0,071 14,197 36 36 0 121037 
i 
R(t) 
om 0 
o 0 1000 2000 3000 4000 
t [min] 
Figure 2. Graphic form of reliability function R(t). 
6:10 °° 
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ft) 310° 
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110° 
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Figure 3. Graphic form of density function f(t). 
6-105 
410% 
v(t) 
210° 
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Figure 4. Graphic form of jamming intensity function 


u(t). 


Input data used for calculations and obtained 
results were summarized in Table 1 where z is the 
number of working cycles completed by the device. 

For the above data the characteristics R(t), f(t) 
and y(t) were established. 


4 FINAL REMARKS 


In a diagnostic parameter function there is an 
accrual process of the impacts of destructive proc- 
esses such as surface wear, corrosion and other fac- 
tors, what in consequence contributes to accruing 
of negative effects leading to the occurrence of the 
catastrophic damage, among others in the form 
of jamming, crack and other similar events of the 
assemblies of technical objects (devices). 

The presented method of the assessment of 
device operation reliability in the aspect of the for- 
mation of the catastrophic damage in the form of 
jamming appears to be justified and correct. The 
presented calculation example enabled to perform 
number verification of a developed model and 
showed the application nature of the established 
method. Applying parameters determined on the 
basis of the described method it is possible to 
present the characteristics of technical properties 
of the object under consideration. In addition, 
the developed model can also be applied to esti- 
mate the reliability of other technical objects in the 
aspect of occurring sudden damage originating 
from relaxation stimuli. 
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Extensions of the I&AB method for the reliability assessment 
of the spent fuel pool of EPR 


M. Bouissou 
EDF Lab Saclay, Palaiseau, France 


ABSTRACT: The I&AB (Initiator and All Barriers) method was first introduced at ESREL 2016, as 
an efficient means to calculate, thanks to closed form formulae, the reliability of a very large repairable 
system with dependencies among components. The mathematical support of I&AB is continuous time 
Markov chains, and therefore it cannot be used for modeling the spent fuel pool of a nuclear power plant, 
because for this system, there are two kinds of deterministic delays that must be taken into account: grace 
times (for example, after the total loss of cooling of the pool, it takes exactly 14 hours for the water to start 
boiling), and deterministic failures due to the limited capacity of water tanks. In the present paper, we 
extend the I&AB method to account for deterministic delays. We explain how we could apply this method 
in the case of the fuel pool of the EPR (European Pressurized Reactor) starting from a model in the form 
of a BDMP (Boolean Logic Driven Markov Process), and how results and computation times compare to 
a Monte Carlo simulation of the same BDMP. 


1 INTRODUCTION For these reasons, we have developed a new 
approximate method for the quantification of very 
The standard PSA method (based on fault tree large BDMPs, and more generally any model able 
linking) is not well suited for the reliability assess- to generate minimal products containing one initi- 
ment of the spent fuel pool of a nuclear power ating event and the failures of the barriers activated 
plant, for several reasons. Firstly, the dynamics of after it in order to avoid the undesirable event. This 
the phenomena to be modeled are relatively slow is why the main foreseen application domain is 
because of the large amount of water available in nuclear PSA, all the more so as existing PSA mod- 
the pool itself and in the safety systems. It is thus els will be very easy to adapt to I&AB, merely by 
not sufficient to look at what can happen in only adding repair rates to component data. In a PSA 
24 hours after an initiator. Secondly, the fact that context, I&AB can be used to take repairs into 
components are repairable, and the existence of account instead of postulating that 24 hours after 
multiple standby redundancies cannot be ignored. an initiating event, either the undesirable event is 
EDF has developed several tools for creating unavoidable, or the system is in a safe state. The 
and quantifying dynamic models, better suited for I&AB (Initiator and all barriers) main principles 
this kind of system study. In particular, BDMPs were published ina paper at ESREL 2016 (Bouissou 
(Boolean logic Driven Markov Processes) are a & Hernu 2016). In the present paper, we give all 
powerful modeling tool for the dependability anal- analytical formulae of I&AB and of its exten- 
ysis of dynamic systems (Bouissou & Bon 2003). sion in the case of grace times and deterministic 
For more than ten years, they have been used for failures. We also give some numerical application 
assessing the reliability, availability, and safety of | examples, comparing the I&AB approximation 
complex reconfigurable systems. BDMPs have a to “exact” calculations performed on a dynamic 
graphical representation close to fault trees, yet model via Monte Carlo simulation. 
they specify (potentially very large) CTMCs (Con- 
tinuous Time Markov Chains). A BDMP model 
with the same detail level as fault trees of a stand- 
ard PSA would not be quantifiable by analytical 
methods, even with classical approximations. On 
the other hand, it would require too large compu- Suppose we want to calculate the reliability of a 
tation times with Monte Carlo simulation, because repairable system with standby redundancies; 
the probability of reaching a too low level in the it may be a good approximation to take into 
spent fuel pool is very small. account only one level of dependences between 


2 THE INITIAL I&AB METHOD (2016) 


2.1 Hypothesis on the system and definitions 
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the components. In other words, one is capable 
to distinguish failures of “normal” components 
(they are called “initiating events”) and failures 
of components in standby (that function only in 
case of failures of normal components). But one 
cannot discriminate between a component of “pri- 
mary standby” (that assures the functioning of the 
system after a failure of the corresponding nor- 
mal component) and a component of “secondary 
standby” (that operates only after a failure of the 
primary standby component). 

The I&AB method relies on the two following 
approximations: 


A0: When an initiating event occurs, all standby 
components are supposed to start functioning 
(or maybe refuse to start) immediately after 
the initiating event; then, they may fail and be 
repaired independently from each other until 
the initiating event is repaired. 

Al: Once an initiating event is repaired, the sys- 
tem cannot anymore fail, whatever happens. 


We suppose that the real system is described by 
a CTMC where the initial, “perfect” state is the 
state into which the system always returns, until it 
is absorbed by a failure state. Then its unreliabil- 
ity can be estimated from the following formula 
(Bouissou & Bon 1992): 


R(t) $1-exp(—Apz) (1) 


where A is the frequency of initiating events (sum 
of rates of all transitions exiting the initial state) 
and p is the probability that the initiating events 
lead to an accident before the system goes back to 
the perfect state. 

In I&AB, in order to estimate p we use the 
“Minimal Content of (failure) Sequences” (MCS) 
of the Markov chain, as it was defined in (Bouissou 
2006). The formal definition of a MCS is given 
ibid, but it can be defined informally as the result 
of a Boolean reduction of the following fault tree: 
a single OR gate with one son per failure sequence, 
each son being presented as an AND gate over the 
events appearing in the sequence. Initiating events 
must be distinguished from other events, so that for 
example, the MCS of a system made of 2 compo- 
nents Y and Z in active redundancy is {Y_init, Z} 
{Z_init, Y} and not simply {Y, Z}. 

For real, complex systems, the MCS can be 
obtained in (at least) two ways: by building a PSA 
type model made of event trees and fault trees, and 
calculating its minimal cut sets, or by building a 
BDMPandapplying the steps described in (Bouissou 
& Hernu 2016) to transform it into a standard 
fault tree whose minimal cut sets are the MCS of 
the Markov chain specified by the BDMP. In the 
remainder of the paper, we will therefore suppose 


that we have the MCS of the studied system at hand, 
and we will call its elements “minimal products”. 


2.2 I&AB general formulae 


Let us suppose that there are n initiating events 
that can lead out the system from its perfect state. 
Then, according to (1), the system unreliability at 
time t can be found from 


R(t)<1- ep5", A. p) (2) 


where à, is the failure rate of initiating event ie and 
P includes probabilities for all k minimal products 
corresponding to it. 

In calculations we distinguish two time intervals. 
The first interval is the mission time figuring in (2). 
The second time interval is infinite and starts once 
an initiating event takes place. The probability that 
all components in a minimal product c fail within 
time interval [0,°c] is simply the unreliability of a 
parallel system made of these components R, (<); 
then we can use the following upper bound for p, 
that will be a good approximation when all failure 
probabilities are small: 


Pe S E R(~) (3) 


Using the Murchland approximation, we obtain: 
k 
Pe $ $, B(N.(~)) (4) 


where N (œ) is the number of failures of the mini- 
mal product c on an infinite horizon. What keeps 
Pą small is the fact that in the initial state consid- 
ered for c, the initiating event is realized with prob- 
ability 1, but once repaired, it never fails again, 
contrary to other elements of the product. (this is 
the approximation A/). 

In order to calculate E(N, (t)) we need to give 
first a few definitions. We will utilize the following 
reliability characteristics: 


— Unavailability Q(t) — the probability that a com- 
ponent is in a failure state at time ¢; 

— Unconditional failure intensity W(t): W(t)At is 
the mean number of failures of a component 
between ¢ and ¢ + At. 


For markovian basic events, depending on their 
type, these quantities are given by the following 
expressions: 


— Initiating event (the repair is definitive) 


Q(t) =exp(-11) 
W(t) =0 
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— Failure in operation (it can fail several times) 


oli) = 4 [1-exn(-(2+)0)] 


W(t)= A(1-Q(2)) 
— Failure on demand (the repair is definitive) 


Q(t) = yexp(—s1) 
W(t)=0 


Because of the lack of space, we will not recall 
here the demonstration given in (Bouissou & 
Hernu 2016) that leads to the following formula, 
written for a minimal product c containing / fail- 
ures on demand and m failures in function. 


E(N.())= JT,- Tafel ee 


ix c)dx (5) 
with 


f(x)=ep(- Da A, D : Ma2) 


jži 


Equation (5) assumes that a minimal product 
contains at least one basic event with a failure in 
operation. However, sometimes minimal products 
are only composed of basic events corresponding 
to failures on demand (plus one initiating event, 
as usually). In such a case, we suppose that these 
events happen at ¢ = 0 and the unreliability for 
minimal product c is given by: 


R, (ce) = Pr(top = true at t = 0) l Iar (6) 


2.3  I&AB formulae in the markovian case 


The general equation (5) yields a closed form 
formula in the purely markovian case, where 
all components have constant failure and repair 
rates. 

In order to simplify notations, we will omit 
the index c in the remainder of this section: we 
will implicitly give formulas for a single minimal 
product. 

Taking an infinite time horizon and replacing 
W(x) by its expression given in section 2.2 for a 
failure in operation, equation (5) becomes: 


E(N(%)) =J], 7 J, exp(-4,) f 


with 


(x) dx (7) 
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f(x) =exp(-xD) ee 
Ito 


j#i 


_ Peji 


EiAfT e-e 


jai 


Here we need to introduce new notations in 
order to simplify upcoming formulas. Let: 


f 
U= +X th 
r, = À, + 4, 


Hence, replacing the functions Q, by their defi- 
nitions and using these new notations, we obtain: 


E(N(%))= J] am DA 
e a en ae 


AY Je “TT m 


Each integrand includes a product of functions, 
which can be represented in the following way: 


He) eee 
= PL Py ea) 

m P i 
xexp|-5 1x). (9) 


m 


Hence, after the integration from 0 to infinity, we 
obtain an alternating series, every term of which, 
in its turn, is a sum of fractions. For instance, the 
second integral results in: 


fe “TAC 
m 
A r 
SLE 


on +r th 
+(-1)” (uE) . 


The first integral is calculated in a similar way, 
the only difference is that one should exclude cur- 
rent element 7 from the product. 

These analytical formulae (8) and (10) seem very 
cumbersome; however, they permit to considerably 
reduce the processing time (in comparison with a 


ede = 


+5” y” —— 
i=l 


aa 


(10) 


numerical integration) while ensuring an excellent 
accuracy. 


3 I&AB EXTENSIONS 


3.1 Taking grace times into account 


In this section, the focus is on systems such that, 
after the loss of all components subject to ran- 
dom failures in a minimal product, the undesir- 
able event is delayed by some physical process that 
guarantees a deterministic grace time. The spent 
fuel pool is a good example: after the complete 
loss of the cooling system, the water will heat until 
it boils, but this process is deterministic and it 
would give an excessively conservative evaluation 
to replace the grace time by a random delay, expo- 
nentially distributed in order to stay in the marko- 
vian framework. 

We first suppose that we need to quantify mini- 
mal products containing failures of components 
(with the same hypotheses as in § 2.3) and a sin- 
gle deterministic grace time. Let X, be the failure 
time of the set A, of markovian elements of the 
minimal product c, Y, the time needed to repair at 
least one of the markovian components, starting 
from the state where they are all failed, and T, the 
grace time. For sake of simplicity, we suppose that 
after a given occurrence of the initiator, the basic 
event corresponding to the grace time behaves like 
a Heaviside function: it becomes true at X,+ T, and 
stays true forever (it is “not repairable”). The prob- 
ability p,, to go from the state where the system is 
just after the initiator ie to the failure state can be 
estimated as: 


De = >, E(N.(2))-Pr(¥, >T.). (11) 


The total repair rate when all markovian com- 
ponents are failed is the sum of their repair rates. 
Hence 
Pr(Y, >.) =exp(-T, >) ic,44)- (12) 

As for E(N_(e2)), it can be computed using the 
formulae of §2.3. 

To conclude this section, let us mention that the 
grace delay may depend on the minimal product, 
and that a minimal product can contain two or 
more grace delays: in this case, only the last one 


must be taken into account (cf. §4.1.2 for more 
details about this choice). 


3.2 Taking deterministic failures into account 


If, after a non-recovered loss of cooling, the water 
starts to boil in the fuel pool, there is a possibility 
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to add water coming from tanks. However, the 
capacity of tanks is limited and after a given time 
the water flow is interrupted: this is what we call a 
deterministic failure. After a given initiator, such 
failures can be considered as non-repairable: it is 
impossible to replenish the tanks in a short amount 
of time (the same applies to batteries). However, 
in a dynamic model, they can be associated to a 
repair (with a small repair rate, see discussion on 
that topic in §4.1.2) in order to allow the model to 
return to its initial state. In order to be consistent 
with general assumptions of I&AB, we will sup- 
pose that the “timers” associated to deterministic 
failures start just after the initiating event; this is 
obviously conservative, as in fact they start after 
some failures. This assumption has an immediate 
consequence: if there are two or more determinis- 
tic failures in a minimal product, the one associated 
to the greatest delay suffices to prevent the minimal 
product from becoming true until it happens. So, 
without loss of generality, we will consider in this 
section that we want to quantify a minimal prod- 
uct containing / failures on demand, m failures in 
function, and one deterministic failure. 

We define the unavailability Q and 
unconditional failure intensity W, needed in 
equation (5), for this type of basic event. Q is a 
Heaviside function and Wa Dirac distribution: 


| 


With these notations, equation (5) can be written 
as follows (with an infinite time horizon, and omit- 
ting the minimal product index c): 


0, 
1, 


t<t, 
tt, 


Olt) = R(t) =H(4) 


W(t) =d(t-4). 


E(N(~)) = IL.” x J exp(-m,« Z x] 


xD WOTE O (d a3 
j+i 
Taking, as in §2.3, 
H= Hie + ye and 
r, = À; + 4; 
Í ès m 
E(N(~))=T] 17x J, exp(-) nll) 


m 


x | Il mQ, x) x Hla) dx+ J |, exp(— 4x) 
x A=) TT LO) 


Finally, 


E(N) = Trix few Ew, 


i 
xT x) dx + exp( —{lt, IT. Q(t 


m 


j= 10,1 (14) 


The second term (the integral) of equation (13) 
can be written, using the same notations as in §2.3: 
j= 1 


EATA ATE 
p Ape T 


m Å, p= 


ji 


m m 


i(1- e” 1“) dx 


(I-e) a) (15) 


isi to 


m 


j=l 


After integration from ¢, to infinity, we obtain 
for the second integral the following alternate 
sum: 


Je" TEtl- 
exp (-“t,) 5 


u 
7 exp(-(u+r, 1 


PI 
j>i 


D mee S MFT, +r, th, 
-1 ar 
+(-1)"(z+ >" 0) exp(-(a+ "1)) 


Of course, taking t, = 0, we obtain again the for- 
mula (10) given in §2.3. 

All these formulae are so complicated that it is 
necessary to carefully validate their implementation 
in a program. The next section has two purposes: 
give what we believe is the result of I&AB (we 
cannot guarantee that our Python implementation 
is totally bug free) and see how I&AB approxima- 
tions compare to more precise calculations made by 
Monte Carlo simulation (the only possible method 
because of deterministic times) on a truly dynamic 
model. 


e *) dx = 


m exp(—(“+1)f) 
i=l 4u 


Utr, + r; 
m exp(-(w+ I, T r, T AA 


k>j 


4 ACCURACY TESTS OF I&AB 
EXTENSIONS 


The small examples of this section were designed 
just to make comparisons between I&AB and 
“exact” calculations performed with Monte Carlo 
simulation. In practice the models were input 
graphically as BDMPs in KB3 (see Figure 1 for an 
example), then processed both by I&AB and by 
the Monte Carlo simulator YAMS. An overview 
of EDF tools including KB3 and YAMS is given 
in (Bouissou 2005). 
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Table 1. I&AB accuracy on various simple test cases. 
Columns 2 and 3 are the estimations of the unreliabil- 
ity at 10000 hours computed by I&AB and Monte Carlo 
simulation (the last column is the width half of the 90% 
confidence interval of the YAMS result). 


Test case I&AB YAMS Conf. interval 
4.l.la 5.25 107 5.05 10° 5.21 10° 
4.1.1b 1.17 10° 1.12 103 5.50 10° 
4.l.lc 5.85 10° 5.72 10° 3.93 10° 
4.1.2a 7.08 10° 2.60 10° 2.64 104 
4.1.2b 1.73 107 3.90 10° 3.24 10+ 
4.1.2¢ 2.89 10° 6.95 10+ 4.33 10° 
412d 9.55 10° 1.03 103 3.27 10° 
4.1.2e 3.54 10+ 3.80 105 1.01 105 
4.1.2 f 3.89 10° 1.00 10+ 1.64 105 
42.1 a 1.87 10"! 1.67 10"! 8.67 10+ 
4.2.1 b 6.21 10° 5.44 10° 1.71 104 
4.2.1¢ 1.65 103 1.67 10° 9.50 105 
4.2.2a 1.87 10"! 1.42 10? 2.75 10+ 
4.2.2 b 1.87 10"! 1.56 107 2.88 104 
4.2.2¢ 6.21 10° 9.86 104 7.30 10° 
4.2.2 d 6.21 10° 1.14 103 7.86 10° 
4.2.2 e 1.65 10° 8.28 10+ 4.73 10° 
4.2.2 f 1.65 107 8.62 10+ 4.83 10° 


In all calculations, failures are associated to a 
failure rate of 10-*/h and repair rate of 2 10-*/h. The 
grace times and delays of deterministic failures are 
indicated in § 4.1 and 4.2. Table 1 gives a synthesis 
of all comparisons. The numbers in the first col- 
umn correspond to the numbers of sections below 
that explain the test cases. All calculations with 
I&AB require a negligible time, whereas some of 


the Monte Carlo simulations require a few minutes 
for sufficient precision. 

Below are the descriptions of the test cases and 
comments on the results. 


4.1 


4.1.1 Single grace time 

The minimal product to quantify is {Initiator, A, B, 
grace_time}. In the dynamic model, there are only two 
sequences: Initiator, A, B, grace_time and Initiator, B, 
A, grace_time (A and B are in active redundancy). 
The grace time is successively taken equal to 25 h (line 
4.1.1.a of Table 1), 50 h (line b), 100 h (line c). 

In this case, I&AB works quite well, and it is not 
surprising, given the fact that the dynamic model 
corresponds exactly to the simplifying assump- 
tions made in § 2.1. 


Grace times 


4.1.2 Two grace times 

The minimal product to quantify is {Initiator, A, 
grace_time_l, B, grace_time_2}. In the dynamic 
model, there is only one sequence: Initiator, A, 
grace_time_1l, B, grace_time_2. The grace time is 
fractioned, and the component B can fail only after 
the failure of A and the end of grace_time_1. The 
two grace times (in hours) are successively taken 
equal to (5, 20) (line 4.1.2a of Table 1), (20, 5) 
(line b), (15, 35) (line c), (35, 15) (line d), (30, 70) 
(line e), (70, 30) (line f). Note that in the dynamic 
model, the basic event grace_time_1 is considered 
as repairable (with repair rate equal to 1/250 h), so 
that after an initiating event, the system can return 
to its initial state provided it does not reach the 
undesirable event; the value chosen for the repair 
rate is not sensitive as long as the mean time to 
repair components is much smaller than the mean 
time to repair the grace time: in such a case, after a 
given occurrence of the initiator, the grace time can 
be considered as “not repairable” just like in IAB. 
In the simulation model, the order of the two grace 
times makes a difference if they are not equal. In 
I&AB, it is also the case because only the last grace 
delay is taken into account. We have also tested the 
idea of taking the sum of the two grace delays like 
a single one in I&AB: it yields results much closer 
to those of the dynamic model, but this approxima- 
tion can produce optimistic results in some cases 
(for case f the result is 5.85 10>). Intuitively, this is 
due to the fact that in the dynamic model only the 
last grace delay is competing with all repairs of the 
markovian elements of the cut set. Cf. also §5. 


4.2 Deterministic failures 


4.2.1 Barriers in active redundancy 
Let us consider a little hydraulic system modeled 
by the BDMP below: 
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When the initiator Leak occurs, the two barriers 
(each one composed of a pump and a tank) are 
activated. The undesirable event occurs if, before 
the repair of the leak, the two pumping systems 
are lost, either because of a random failure of the 
pump, or because the tank is empty. The failure 
and repair rates for random events are as described 
at the beginning of §4, except that the repair rate 
of the Leak is 0.1/h in order to get small enough 
probabilities. 

The results given in Table 1 correspond to 
the following values for the times after which 
Tank1 and Tank2 are empty: (40, 30) (line 4.2.1a), 
(80, 60) (line b), (150, 100) (line c). Note that 
the order of the two numbers is not important 
here, because of the symmetry of the two bar- 
riers. I&AB performs quite well on this exam- 
ple, where the minimal product containing the 
two deterministic failures is dominant. In the 
dynamic model as well as in I&AB, the amount 
of water in the biggest tank is the most influent 
parameter. 


4.2.2 Barrier 2 activated on failure of barrier 1 

In this case, in the dynamic model, the function- 
ing times of the two tanks add up, unless a failure 
of pump! forces to start barrier2 before deple- 
tion of tank1. It is therefore not surprising that 
I&AB is more conservative in this case than when 
the two barriers are in active redundancy. The 
BDMP corresponding to this case is not shown, 
because it is the BDMP of Figure 1 with just one 
additional trigger (red dotted line), going from 
gate Barrierl_lost to gate Barrier2_lost. There is 
no need to re-run the calculations with I&AB, 
since for this method, this case gives the same 
results as the previous one (active redundancy 
of barriers). But here, the capacities of the 
two tanks are not exchangeable in the dynamic 
model, this is why we ran YAMS with the fol- 
lowing couples of values for deterministic delays: 
(40, 30) (line 4.2.2a), (30, 40) (line b), (80, 60) (line 
c), (60, 80) (line d), (150, 100) (line e), (100, 150) 
(line f). The unreliability increases a bit when the 
greatest delay is the last one. Going from line a to 
f, the results of I&AB range from extremely con- 
servative (by a factor 10) to acceptably conserva- 
tive (by a factor 2). On the other hand, using the 
sum of the delays instead of the greatest cannot 
be recommended because it could lead to opti- 
mistic results. 


5 DIFFERENCES BETWEEN GRACE 
TIMES AND DETERMINISTIC DELAYS 


In a dynamic model like a BDMP, both grace times 
and deterministic delays are represented as leaves 


associated to a deterministic time to failure. So 
the difference between those two concepts is not 
obvious. In essence, the difference between a grace 
delay and a deterministic failure is that: 


— Once the grace delay has started, whatever hap- 
pens on the system can only postpone (case 
of a repair) the undesirable event, or leave it 
unchanged (case of a failure); 

— In the case of a deterministic failure, whatever 
happens on the system can only make the unde- 
sirable event happen sooner (case of a failure), 
or leave it unchanged (case of a repair). 


The I&AB theory makes a very clear distinction 
between the two concepts, because it considers that 
the grace time starts when all other components of 
the cut set have failed, whereas the timer of a deter- 
ministic failure starts just after the initiator. An 
intermediate grace time such as in the example of 
§4.1.2 corresponds to none of these cases, this is why 
in I&AB it should be simply ignored. It is the user's 
responsibility to mark leaves as grace delays, deter- 
ministic failures or “to be ignored” in the BDMP 
before transforming it into input data for I&AB. 

On a large model, it is probable that the few 
minimal products with a too conservative quanti- 
fication will be “hidden in the crowd” and that the 
global result will not be much affected. 


6 APPLICATION TO THE SPENT 
FUEL POOL 


To perform all our tests so far, we have used the 
implementation of I&AB that we described in 
(Bouissou & Hernu 2016). It is not the most efficient 
because it separates the search for minimal products 
from their quantification, therefore preventing the 
use of a probability threshold to discard at an early 
stage in the calculations most minimal products, 
as it is done by the MOCUS algorithm (Fussell 
& Vesely 1972). In spite of this limitation, we have 
been able to demonstrate impressive performances 
of I&AB in the spent fuel pool application. 

We have built a model relative to the spent fuel 
pool of the European Pressurized Reactor and 
its support systems. Although less detailed than 
a classical PSA model, the BDMP we have built 
takes into account all dependances due to standby 
redundancies, common cause failures, sharing of 
electrical supplies... The model takes into account 
both the grace time of 14 hours before boiling of 
the water and deterministic failures of tanks used 
to replace evaporated water. 

This BDMP (326 leaves, 77191 minimal prod- 
ucts of order up to 6) could be processed by I&AB 
in a few minutes on a laptop. This model happened 
to be also quantifiable by YAMS: the Monte Carlo 
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simulation gave a failure probability smaller than 
the result of I&AB by a factor around 2, but the cal- 
culation took 25 minutes to reach a 10% precision 
with 95% of confidence on the same machine. 

Besides, with Monte Carlo simulation, it is very 
hard to get qualitative results: for that particular 
model, there is only one dominant sequence and all 
other sequences are much less probable: it would 
require many hours of simulation to get results 
comparable to the, say, 10 most probable minimal 
products that are easily identified by the I&AB 
method. 


8 CONCLUSION 


I&AB is an analytical method for the reliabil- 
ity calculation of large repairable systems with 
dependences between components. Two kinds of 
models can serve as input for this method: BDMPs 
or standard nuclear PSA models complying with 
the fault tree linking method. Both of them can 
be transformed into a set of minimal products 
that are the basis of the calculation. I&AB as it 
was described in (Bouissou & Hernu 2016) can- 
not readily be used for the fuel pool case, because 
for this system, there are two kinds of determinis- 
tic delays that must be taken into account: grace 
times, and deterministic failures due to the limited 
capacity of water tanks. 

In the present paper, we have given two theo- 
retical contributions: the analytical formulae of the 
I&AB method (so far, they were only available in 
the French patent file FR3044787) and their exten- 
sion in the case of deterministic delays. In addi- 
tion, we have shown on several examples that the 
extended method can yield reasonably conservative 
results, in times incomparably shorter than Monte 
Carlo simulation. 

Thanks to a partnership between EDF and 
Lloyd’s Register, I&AB will soon be available for 
the large community of users of the RiskSpectrum 
PSA tool. This could revolutionize PSA praxis in 
upcoming years. 
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Structure function in analysis of multi-state system availability 


M. Kvassay, V. Levashenko, J. Rabcan, P. Rusnak & E. Zaitseva 
University of Zilina, Zilina, Slovakia 


ABSTRACT: Mathematical representation of investigated system is important step of reliability 
analysis. There are two principal representations of considered system. The first one is known as Binary- 
State System (BSS), and it permits considering only two states in system/components performance— 
functioning and failed. Very often, this is not sufficient to adequate mathematical description of system 
behavior. Therefore, other types of mathematical representation are used. Multi-State System (MSS) 
is one of most prospective mathematical representations. This representation allows defining more 
than two states in describing behavior of the system and its components. A MSS can be described by 
structure function, which defines unambiguous correlation between all possible combinations of states 
of the system components and the system state. Typical disadvantages of MSS are large dimension and 
impossibility to form structure function based on uncertain data. We propose to consider application 
of mathematical methods of Multiple-Valued Logic (MVL) to form the MSS structure function. 
MVL is natural extension of Boolean algebra in MSS reliability assessment. This logic has been used 
for analysis of logic functions with more than two values. New method for structure function forming 
based on uncertain data and application of MVL methods is proposed. Multi-Valued Decision Diagrams 
(MDDs) are used for structure function representation, in particular. A MDD is useful data structure 
for representation of MVL function of large dimension. The method presented in this paper takes into 
account two disadvantages of MSS representation by structure function. 


1 INTRODUCTION for which the initial data is completely specified or 
uncertain. Four groups of methods are used for 
System availability analysis can be implemented MSS analysis (Lisnianski & Levitin, 2003): stochas- 
based on a mathematical representation of investi- tic processes, universal generating function, Monte 
gated system. The construction of the mathemati- Carlo simulation, and extensions of Boolean meth- 
cal representation of the system has some factors ods. The methods based on extensions of Boolean 
that cause the final form of this representation. | methods was historically the first in the MSS reli- 
The first of them is definition of the number of ability evaluation (Murchland 1975), and they are 
availability levels (states) (Natvig 2011; Zaitseva & based on the representation of investigated system 
Levashenko 2017). The second factor is related to in a form of the structure function. 
the mathematical method that is used for qualita- The structure function defines dependency of 
tive or quantitative analysis (Lisnianski & Levitin system state on states of system components. There 
2003; Zaitseva & Levashenko 2017). The third fac- are many mathematical approaches in the reliabil- 
tor is caused by initial data uncertainty: the dif- ity engineering for the structure function analysis. 
ferent mathematical representations are used for The most frequently used are fault trees (Kabir 
completely specified initial data and for uncertain 2017), reliability block diagrams (Distefano & 
initial data (Aven 2010). Puliafito 2009), and minimal cut/paths sets (Kvas- 
According to these factors, mathematical repre- say et al 2015a). Very often methods of Boolean 
sentations for evaluation of system availability are algebra as Binary Decision Diagram (BDD) (Zait- 
divided as Binary-State System (BSS) and Multi- seva et al 2015), Boolean function minimization (Di 
State System (MSS). A BSS allows representing Maio et al 2016) and Logical Differential Calculus 
the initial system as mathematical model with two (Schneeweiss 2009) are used in the structure func- 
possible states that are complete failure and perfect tion analysis of BSSs. The similar methods have 
working. A MSS permits considering more than been developed for MSSs based on mathematical 
only two states in system behavior. In this paper, background of Boolean Logic (Xing & Amari 2015) 
the MSS is considered for mathematical representa- and Multiple-Valued Logic (MVL) (Zaitseva & 
tion of the investigated system because it allows us | Levashenko 2017). For example, Multi-Valued 
to describe system behavior in more details. Both Decision diagrams (MDDs) as generalization 
BSS and MSS can be used in analysis of systems of BDDs are considered in (Xing & Dai 2008, 
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Zaitseva & Levashenko 2008) for analysis of MSSs. 
The methods for finding minimal cut/path sets of 
a MSS are proposed in (Kvassay et al. 2015b). 
Logical Differential Calculus for MSS importance 
analysis is considered in (Kvassay et al. 2017). 

These investigations show the efficiency of 
MVL mathematical method use in MSS reliability 
analysis. However, methods based on structure 
function have some restrictions in analysis of 
MSSs, and they do not allow investigating systems 
for which initial data is incompletely specified. 
The structure function is constructed based on 
completely specified data as a rule. Therefore, 
MVL methods are combined with other methods 
to construct and analyze structure function of a 
MSS. One of possible methods has been proposed 
in (Zaitseva & Levahsenko 2016), and it allows us to 
construct the MSS structure function using Fuzzy 
Decision Tree (FDT). The FDT is inducted based 
on incompletely specified and uncertain data. The 
FDT permits creating a decision table that agrees 
with the structure function of the MSS. However, 
the table is not an optimal representation of MSS 
structure function because this function has large 
dimension as a rule. More suitable representation 
of the structure function of a MSS is MDD. 

In this paper, we consider the MSS structure func- 
tion mathematical representation based on com- 
pletely specified data (section 2) and incompletely 
specified data (section 3). The MDD allows us to 
represent the completely specified structure function 
of a MSS. The complexity of the MDD depends on 
the ordering of variables of the structure function in 
this form. The FDT can be used in case of incom- 
pletely specified and uncertain data. The algorithm 
for induction of FDT for representation of the MSS 
structure function has been proposed in (Zaitseva 
& Levashenko 2016). However, this algorithm does 
not take into account specifics for ordering of vari- 
ables ina MDD. Therefore, we propose use of other 
types of FDT that is ordered FDT (Levashenko 
et al 2007). A new algorithm for transformation of 
the ordered FDT into MDD is considered in sec- 
tion 4. The evaluation of accuracy of the FDT use 
in construction of a MDD representing the MSS 
structure function and efficiency of the ordering of 
variables in MDD are shown in section 5. 


2 MATHEMATICAL REPRESENTATION 
OF SYSTEM BASED ON COMPLETELY 
SPECIFIED DATA 


2.1 Structure function 


A system of n components in the stationary state 
is considered in this paper. This system has M per- 
formance levels and is interpreted as a MSS. We 
suppose that this system is repairable and repairs 


P 
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end in the state as-good-as-new (for the individual 
components). This assumption allows defining MSS 
structure function as a time-independent function. 

The structure function of MSS defines correla- 
tion of the MSS performance levels and its compo- 
nents states (Zaitseva & Levashenko 2017): 


Ax) = AXi. Xp): 
{0,...,m,—-1} x... x {0,..., m,—- I} > {0,..., M-21}, 


(1) 


where (x) defines system performance levels from 
complete failure (øx) = 0) to perfect functioning 
(G(x) = M -1); x =(%,..., x,) is a state vector; x; is 
the i-th component state that changes from failure 
(x, = 0) to perfect functioning (x, =m, -1). 

Every system component is characterized by the 
probabilities of its states: 
u= Pr{x;=s},5=0,...,m;—-1. 


(2) 


The probabilities of the MSS performance lev- 
els, availabilities and unavailability of the system 
based on the structure function (1) are defined as 
(Kvassay et al. 2015b; Lisnianski & Levitin 2003): 


P,=Pr{ Wx) =j}, j=0,1,..., M-1, (3) 
A, = Pri ax) 2h}, h=1,..., M1, (4) 
U=Pr{ d(x) =0}. (5) 


The MSS structure function (1) according to 
(Zaitseva & Levashenko 2017) is interpreted as 
MVL function that allows using MVL mathemati- 
cal methods for MSS analysis. There are some 
descriptions of MVL function that can be applied 
for the MSS structure function representation. 
Two of most well-known are truth table (Zaitseva 
& Levashenko 2017) and Multi-Valued Decision 
Diagram (MDD) (Xing & Dai 2008). 

For example, let us to consider the truth table 
of the simple service system analysed in (Kvassay 
et al 2015b) that consists of 3 components (n = 3): 
2 service points (components | and 2) and the 
service system infrastructure (component 3). This 
system has 3 performance levels (M = 3): 0 — non- 
operational, 1 — partially operational, 2 — fully 
operational. Components | and 2 have 2 states 
(m, = m, = 2): functional (state 1) and dysfunc- 
tional (state 0). The third component has 3 states 
(m, = 3): failure (it is 0), partly working (it is 1) and 
perfect working (it is 2). The structure function of 
this system is shown in Table 1. 

The basic measures (3)-(5) of this system are 
calculated by the truth table (Table 1) as: 


P, =P Pry (P31 + Pap) + Piat Pao’ Pa» (6) 
P, = (Pio Poi Pia Pao) Psi + Pio P21 + P20) Pio 
(7) 


Table 1. The structure function of the simple service 


system. 

Components states Xz 

xX, X> 0 1 2 
0 0 0 0 1 
0 1 0 1 1 
1 0 0 1 2 
1 1 0 2 2 
A, = Pia Paa (Pai + Psy) + Pia Pao’ Psw (8) 
A, = (Pio Por Pia’ Poo + Pia’ Pai) P31 tPs» (9) 
U=PyotPio* Pao Pai (10) 


The truth table of the MSS structure function 
has dimension []" m,. This dimension extremely 
grows with the increasing number of the function 
variables, i.e. with the number of system compo- 
nents. In MVL, another representation is used for 
function of large dimension. It isa MDD (Miller & 
Drechsler 2002). 


2.2 Multi-valued decision diagram 


A MDD isa graphical presentation of MVL func- 
tion (Miller & Drechsler 2002). This form for the 
MSS structure function representation has been 
proposed in (Xing & Dai 2008, Zaitseva & Levash- 
enko 2008). A MDD has been modified into Mul- 
ti-state Multi-valued Decision Diagram (MMDD) 
in (Xing & Dai 2008). A MMDD has been pro- 
posed for the MSS in which the number of system 
performance levels is not equal to the number of 
component states, and each component in the 
system may have different number of states. The 
structure function of this MSS is defined accord- 
ing to (1). A MMDD analyzes the system states 
with success rather than all system states. In (Zait- 
seva & Levashenko 2017) the theoretical aspects 
of MVL mathematical approaches application for 
a MSS with the structure function (1) have been 
considered. These results allow us to use MDD 
to represent a MSS in which the system and each 
component may have different number of states. 

A MDD is specified same as a BDD, except 
the nodes become more complex. This is due to 
the Shannon expansion for MVL-functions and 
this expansion is defined as (Miller & Drechsler, 
2002): 


Wx) =0- A0, x) +1-00,%x)+...+(m-1)- 
W(m — 1), x). (11) 
The Shannon expansion (11) is the basis for 


using MDD and a definition of manipulations 
over them. The format for MDD manipulation 
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The MDD for the structure function of the 
simple service system (Table 1). 


Figure 1. 


is defined as expression case(a, bo, by, ...5 Dni) = b, 
(Miller & Drechsler 2002). 

A MDD isa directed acyclic graph representing 
MVL function based on the Shannon expansion 
(11). This graph has M terminal nodes, labelled 
from 0 to (M — 1), representing M states of MSS. 
Each non-terminal node is labelled with a structure 
function variable x, and has m, outgoing edges. The 
non-terminal node outgoing edges are interpreted 
as component states. 

For example, the MDD for the representation 
of the simple service system defined by Table 1 
is shown in Figure 1. This MDD is uniquely 
described by procedures case and if-then-else as: 


If x, =0 then 
If x, =0 then case (x;, 0, 0, 1) 
else case (x, 0, 1, 1) 
else 
if x, =0 then case (x,, 0, 1, 2) 
else case (x,, 0, 2, 2) 


The probabilistic interpretation of the MDD and 
rules for MSS measure calculation are based on the 
canonical Shannon expansion (11). The probabil- 
istic interpretation of the MSS assumes that every 
edge from the i-th node labeled as s, agrees the i-th 
component state probability p,,. Rules for the MSS 
evaluation according to (3) — (5) based on MDD 
are presented in details in (Zaitseva & Levashenko 
2008). A calculation of these measures based on 
the MDD consists of analysis of paths from the 
root node to the terminal node “7”, 7=0,..., M—-1. 
For example, the system availability A, based on 
the MDD in Figure 1 is defined by 3 paths from 
the root node labeled x, to terminal node 2: 


A= Pia © Poot P32 + Pia’ Poat Pst Pia’ Por’ P32 
= Pia t Paa (Psa + P32) + Pia’ P20 Pao 


This agrees with the availability A, (8) calculated 
by the truth table. 


The truth table and the MDD are typically used 
for representation of completely specified function 
(Stankovic et al 2012). There are investigations to 
develop methods of the MDD construction for 
incompletely specified function. Most of these 
investigations have been provided in logic design 
(Popel & Drechsler 2003). The unspecified values of 
function in logic design can be interpreted arbitrar- 
ily (Stankovic et al 2012). The unspecified values 
of the structure function in reliability engineering 
appear if they cannot be measured or observed, 
but they must exist. Therefore, these values cannot 
be ignored (Aven 2010, Zaitseva et al 2017). These 
specifics require development of new methods and 
algorithms for representation of incompletely spec- 
ified structure function because methods for MDD 
construction proposed in logic design cannot be 
used. One of possible approach has been proposed 
in (Zaitseva & Levashenko 2016, Zaitseva et al 
2017). It is based on use of decision trees. 


3 MATHEMATICAL REPRESENTATION 
OF SYSTEM BASED ON INCOMPLETELY 
SPECIFIED DATA 


3.1 Initial data 


According to the approach for the MSS structure 
function construction based on incompletely speci- 
fied data in (Zaitseva & Levashenko 2016, Zaitseva 
et al 2017) this function is interpreted as classifica- 
tion procedure: all system state vectors (x,..., x,,) 
are divided into M classes. This classification is 
illustrated by MDD too. For example, the MDD 
of a simple service system in Figure 1 divides all 
state vectors into 3 classes. 

The interpretation of the MSS structure func- 
tion as classifier allows us to use methods for induc- 
tion of classifiers based on incompletely specified 
data. Such methods are well known in Data Min- 
ing. One of them is induction of a decision tree. 
For example, the decision tree can be inducted by 
algorithm ID3/C4.5 developed by Quinlan (1987). 
This algorithm can be used for incompletely speci- 
fied data that has crisp values. But initial data for 
reliability analysis is collected as expert data often. 
This data is latent or uncertain. There are differ- 
ent factors that cause uncertainty of data collected 
for reliability analysis (Aven 2010, Ley 2011). In 
this paper two of them are considered. The first is 
incompleteness of data. The second is ambiguity 
and vagueness of initial data. 

The incomplete data is collected if it is impossible 
to indicate some values of the system components 
states or system performance levels. For example, it 
can be very expensive, or it needs unacceptable long 
time, or it is dangerous for the environment. 
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The ambiguity and vagueness of initial data are 
caused by methods for the data collection: measure- 
ment and/or expert’s knowledge. The measurement 
can be inaccurate and with an error, that depends 
on accuracy of measuring device. Therefore, these 
data can be defined and used with some likeli- 
hood. In case of data collected based on expert’s 
knowledge, data cannot be indicated exact because 
experts can have different opinions on one situa- 
tion. The fuzzy logic makes it possible to define the 
structure function in a more flexible form. 

The mathematical representation of a MSS in 
this case is interpreted as a classification problem 
for uncertain and incompletely specified data that 
can be decided with the application of FDT (Zait- 
seva & Levashenko 2016, Zaitseva et al 2017). 


3.2 Fuzzy Decision Tree 


A FDT is one of possible types of decision trees. 
A FDT allows taking into account not only 
unspecified values of system states but also uncer- 
tain values of components states. It is a formalism 
for expressing mapping of input attributes on out- 
put attribute/attributes, consisting of an analysis 
of attribute nodes linked to two or more sub-trees 
and leaves or decision nodes labeled with a class 
(in our case it is the system performance level) 
(Olaru & Whenkel 2003). A FDT defines the cor- 
relation between n input attributes {A,,..., A,} and 
an output attribute B. The important step in the 
use of FDT for the MSS mathematical representa- 
tion is definition of the correlation between termi- 
nologies (formalisms) of FDT induction and MSS 
structure function. 

In (Zaitseva & Levashenko 2016) the state vec- 
tor x = (x), ..., x,) is interpreted as the set of input 
attributes A = {A,, ..., A,}, and value of the struc- 
ture function @(x) as the output attribute B for 
FDT induction. Each input attribute (component 
state) A, (1 <i < n) is measured by a group of val- 
ues ranging from 0 to m,— 1, which agree with the 
values of states of the i-th component: {A,,, ..., 
Aj mt- Every value A,, of fuzzy attribute 4, can 
be considered as a fuzzy set. These sets create the 
domain of a fuzzy attribute and their count is 
considered as a domain size. Each value of each 
instance is described by membership function 


H,, (e) € [0,1], where pers (e)=1 and e repre- 
sents an instance. These restrictions on representa- 
tion of initial data are caused by the strategy for 
FDT induction for mathematical representation 
of the MSS. A FDT assumes that the input set 
A= {A,,..., A,} is classified as one of the values of 
output attribute B. Value B,, of output attribute B 
agrees with one of the system performance levels 
and is defined as M values ranging from 0 to M — 1 
(w=0,..., M-1). 


For example, let us suppose that we have incom- 
pletely specified data with some likelihood about 
the simple service system (section 2.1). The data 
is represented in form of Table 2 (Zaitseva et al 
2017). This data can be used for the system struc- 
ture function construction based on likelihood 
of values (Table 3). But this structure function is 
incompletely specified (3 values are absent). There- 
fore, special methods or algorithms have to be used 
to “reconstruct” this MSS structure function. The 
method for structure function construction based 
on uncertain and incomplete data by FDT has 
been proposed in (Zaitseva & Levashenko 2016). 

This method for structure function construc- 
tion based on uncertain and incomplete data by 
FDT includes three steps (Zaitseva & Levashenko 
2016): 


— collection of data into the repository according 
to requests of FDT induction; 

— representation of the system model in the form 
of a FDT that classifies components states 
according to the system performance levels; 

— construction of the structure function as a deci- 
sion table that is created by inducted FDT. 


According to this method, the structure func- 
tion is formed as a truth table that indicated the 


Table 2. Uncertain data for the simple service system. 


Components states Perfor- 
mance 
Xi x x levels 
0 1 0 1 0 1 2 0 2 2 
09 O1 1 0 0.8 02 0 1 0 0 
1 0 09 01 01 09 0 1 0 0O 
09 01 01 09 01 08 O1 0 1 0 
1 0 0 1 0 01 09 O 1 0 
0.1 09 1 0 09 01 0 1 0 -0 
0 1 1 0 01 08 01 0 1 0 
01 09 0 1 1 0 0 1 0 0 
01 09 0 1 0.1 0.9 0 0 0 1 
02 08 01 09 0 01 09 0 0 1 
Table 3. The incompletely specified structure function 
of the simple service system. 
Components states xX; 
Xi X 0 1 2 
0 0 0 0 * 
0 1 * 1 1 
1 0 0 1 2 
1 1 F 2 2 
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system performance level for each possible com- 
bination of components states. In this paper, we 
propose to transform inducted FDT into MDD 
for MSS mathematical representation, which is 
more acceptable for description of a MSS of 
large dimension. In particular, we propose to use 
the ordered FDT introduced in (Levashenko et al 
2007) for the MDD creating. 


4 MULTI-VALUED DECISION 
DIAGRAM CONSTRUCTION BY 
FUZZY DECISION TREE 


A FDT can have different properties. One of possi- 
ble FDTs types is ordered FDT. This FDT has the 
same attributes at one level. It permits to use the 
predefined order of attributes in analysis of new 
samples. In this paper, the ordered FDT is used 
for the construction of the MDD based on uncer- 
tain data. The proposed algorithm includes three 
steps and differs from the method proposed in 
(Zaitseva & Levashenko 2016) only by the last step: 


— collection of data into the repository according 
to requests of FDT induction; 

— representation of the system model in the form 
of the ordered FDT that classifies components 
states according to the system performance levels; 

— construction of the MDD by inducted FDT. 


4.1 Data collection and Fuzzy Decision Tree 


induction 


Collection of data in the form of a repository is pro- 
vided by the monitoring of values of system com- 
ponent states and system performance level. This 
repository can be presented in the form of a table 
where the columns agree with the input and output 
attributes. This table is composed of n + 1 columns 
associated with n input attributes and 1 output 
attribute. The i-th column, for i = 1,..., n + 1, is 
divided into m,sub-columns. The j-th sub-column, 
for j,= 1, ..., m, agrees with the j-th value of the 
attribute represented by the i-th column. Each row 
of the repository corresponds to one instance of 
collected data. 

For example, the data about the service system 
in Table 2 satisfies the given conditions and can be 
interpreted as the repository. 

There are different algorithms for FDT induc- 
tion. The ordered FDT has been introduced by 
Levashenko et al (2007). In that paper, the algo- 
rithm for induction of ordered FDT based on 
cumulative information estimation of attributes 
for the selection of next node in tree’s induction 
has been proposed. 

The ordered FDT is inducted based on data in 
the repository by application of the cumulative 


information estimates I(B; A, yey ÅA; „Å; p where 
A,» A; are input attributes that’ were’ used at 
previous levels of FDT. FDT induction is based on 
determination of expanded attribute where A, are 
input attributes which have been used at previous 
levels of FDT. The selection criterion of expanded 


attributes A iy is defined as (Levashenko et al 2007): 


(BA, are oA A; ) 
Cosi(A,, } x H(A, ) 


(12) 


5$ 


l = argmax - 


where Cost (A,) is an integrated measure that 
covers financial and temporal costs required to 
define the value of the A, for an instance, and 
this value is defined a priori; H(A, ) is a cumula- 
tive entropy of input attribute A, . 

The cumulative mutual information in out- 
put attribute B about the attribute A, and the 
sequence of attributes A,,....A,_ reflects the 
influence of attribute A, on the output attribute 
B when sequence A,,,. så, _, İs known. Maximum 
value 7, in (12) facilitates the selection of expanded 
attribute A, . This attribute will be associated 
with a node of the ordered FDT. 

Two tuning thresholds o and B are in the algorithm 
for induction of the ordered FDT (Levashenko et al 
2007). A tree branch stops to expand when either the 
frequency of the branch is below œ or when more 
than B percent of instances left in the branch have 
the same class label. These values are key parameters 
needed to decide whether we have already arrived at a 
leaf node or whether the branch should be expanded 
further. Decreasing the parameter œ and increasing 
the parameter B allow us to build large FDTs. On one 
hand, large FDTs describe datasets in more detail. 
On the other hand, these FDTs are very sensitive to 
noise in the dataset. We empirically select parameters 
near & = 0.10 and B = 0.80. 

The algorithm for the ordered FDT induc- 
tion based on cumulative information estimates 
(12) is considered in (Levashenko et al 2007) in 
details. According to this algorithm we inducted 
the ordered FDT based on data in Table 2 for the 
mathematical representation of the simple service 
system (Figure 2). 

The ordered FDT in Figure 2 allows us to form 
the decision table for all possible combinations of 
component states (all state vectors). For example, 
for state vector x = (0 0 0) in Figure 2, the value of 
the output attribute is 0. This value is defined by the 
path from root node to leaf node A, , with the confi- 
dence of 0.880 without analysis of other attributes. 
The values for other state vectors are computed in 
the similar way. All calculated values of the output 
attribute are equal to system states in Table 1. 
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Figure 2. FDT of the simple service system (Table 2). 


Therefore the ordered FDT allows represent 
the structure function uniquely. At the same time, 
there are many methods and algorithms for the 
MSS evaluation based on a MDD. The calculation 
of availability and other measures for the system 
reliability evaluation are considered in (Mo et al 
2017, Xing &Dai 2008, Zaitseva & Levashenko 
2008, Zaitseva et al 2013). So, let us consider the 
transformation of ordered FDT into MDD. 


4.2 Construction of multi-valued decision 
diagram from Fuzzy Decision Tree 


Theimportant problem in MDD construction is the 
ordering of the variables: the complexity of MDD 
depends on the variables order in MDD (Miller & 
Drechsler 2002). This problem for MDD is being 
actively investigated and some methods have been 
proposed for approximate decisions for the opti- 
mal ordering of variables in MDD (Stankovic et al 
2012, Popel & Drechsler 2003). At the same time 
the ordered FDT induction supposes the optimal 
(or quasi optimal) ordering of nodes. Therefore, 
the ordered FDT can be transformed into MDD 
with the optimal ordering of variables. 


The algorithm for the transformation of the 
ordered FDT into the MDD includes three steps: 


— the transformation of the ordered FDT into 
decision tree by the defuzzification of the 
ordered FDT nodes. 

— the merger of the same leaf nodes; 

— the tree's attributes substitution by appropriate 
variables. 


Let us illustrate these steps by an example of 
the MDD construction for the service system 
(Table 2) based on the ordered FDT of this system 
(Figure 2). According to the proposed algorithm, 
the first step is defuzzification of the ordered FDT. 
It can be implemented by construction of decision 
tree based on maximal values of thresholds œ and 
B. The decision tree for the investigated system is 
shown in Figure 3. 

The second step is the merger of the same leaves 
that is illustrated in Figure 4. The result of the 
last step, which is substitution of tree’s attributes 


Figure 3. The decision tree of the simple service system 
formed from the ordered FDT in Fig. 2. 


Figure 4. The merger of equal leaf nodes of decision 
tree in Fig. 3. 


The MDD for the structure function of the 


Figure 5. 
simple service system. 


by appropriate variables, is shown in Figure 5 
as MDD of the structure function of the simple 
service system. 

The comparison of MDDs in Figure 1 and 
Figure 5 shows that the MDD constructed based 
on the ordered FDT has 5 non-terminal nodes and 
MDD constructed based on the truth table consists 
of 7 non-terminal nodes. This simple example 
illustrates that the MDD can be constructed for 
MSS by incompletely specified data and that the 
constructed MDD has better ordering of variables 
than the MDD constructed in section 2.2. 


5 EVALUATION 


In this section, the efficiency of the algorithm for 
the construction of the MDD of the MSS structure 
function based on uncertain data is considered. We 
use structure functions of three systems analyzed 
in (Zaitseva et al 2017): outline of an offshore 
electrical power generation system (Natvig 2011); 
army battle plan (Boedigheimer & Kapur 1994); a 
computer system with a memory subsystem sub- 
ject to competing failure isolation and propagation 
effect (Xing & Levitin 2011). Basic characteristics 
of these systems are shown in Table 4. Two types 
of investigation were implemented. 

The first of them is the influence of incomplete- 
ness of data on the accuracy of the MSS repre- 
sentation. For this investigation, all integer values 
representing components states and performance 
levels of the system were transformed to values 
with possibilities (Zaitseva et al 2017). We indi- 
cated known value with the possibility 1.00 and 
other values with possibilities 0.00 for components 
states and performance levels. To model incom- 
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Table 4. Characteristics of investigated system. 


Natvig Xing & Boedigheimer & 
(2011) Levitin (2011) Kapur (1994) 


Numbers of 243 512 108 
state vectors 
Numbers of 5 5 4 
components 
——— Xing & Levitin (2011) —— Boedigheimer & Kapur (1994) 
Natvig (2011) = = e [pitia] misclassification error 


o 
te 
3 


Error rate 
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The incompleteness of structure functions 


Figure 6. The error rate for the construction of the 
structure function for three systems. 


pleteness of data, the structure functions of the 
MSSs were randomly transformed into incom- 
pletely specified by random deleting some values 
of the functions. The proportion of deleted values 
was changed from 5% to 90%. The transformed 
functions were interpreted as incompletely speci- 
fied and restored by the ordered FDTs. A restored 
structure function and initial complete specified 
function were compared, and the error rate was 
calculated as a ratio of erroneous values of the 
structure function to the dimension of unspecified 
part of the function. The experiments were done 
for every structure function. The average values for 
all functions with specified range of deleted values 
were computed. Error rate for investigated system 
is less than initial misclassification error (maximal 
error rate) that is indicated by red line in Figure 6). 
The evaluation showed that the error increases sig- 
nificantly if the unspecified part is less than 10% or 
most than 85% for these systems (Figure 6). 

The second type of the evaluation was analy- 
sis of MDD, in particular, the number of non- 
terminal nodes of MDDs that are constructed 
based on different approaches. In particular, we 
compared the MDD without special ordering of 
variables of structure function, the MDD formed 
based on FDT (this algorithm has been considered 
in (Zaitseva et al 2017) and MDD constructed 
by ordered FDT. This analysis was implemented 
for 16 structure functions of coherent systems 
(number of performance levels M = 2 and number 
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Figure 7. The efficiency of MDD constructed based on 


the truth table and FDT. 


of variables n = 5). The evaluation of these MDDs 
showed that the MDDs created based on ordered 
FDT have less number of non-terminal nodes 
(Figure 7). 


6 CONCLUSION 


The main contribution of this paper is in develop- 
ment of the new algorithm for the construction of 
the structure function based on incompletely speci- 
fied and ambiguous data in form of the MDD. It 
directly supports MSS reliability estimation and 
can cope with uncertain data for the analysis of 
system reliability/availability. 

The analysis of the error rate of the proposed 
algorithm for the construction of the structure 
function based on ordered FDT showed that the 
method has good efficiency, which is similar to effi- 
ciency of the method for construction of structure 
function based on unordered FDT (Zaitseva et al 
2017). This method is acceptable for incomplete 
data and the incompleteness of initial data can be 
indicated from 10% to 85%. The structure function 
constructed based on the proposed method has 
less error rate than maximal error rate in interval 
of the incompleteness. 
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ABSTRACT: Safety and risk analyses rely on models. These models have several important charac- 
teristics. They are event-oriented. The system under study changes of state when events, such as failure, 
hazard, repair and so on, occur. They are probabilistic. The exact moment of the occurrence of a failure 
is in essence unpredictable. They are discrete. States are represented by means of variables that take their 
values into finite, usually very small, domains. The most widely used modeling formalisms such as Fault 
Trees, Block Diagrams and Event Trees rely on Boolean algebra. There are cases however where binary 
states are not sufficient. For instance, it is sometimes necessary to represent the level of degradation 
of a component, the quality of a signal, and so on. This kind of models can be easily represented with 
AltaRica 3.0 — a high level modeling language dedicated to safety analyses. AltaRica 3.0 is at the core of 
the OpenAltaRica project which aim is to develop a complete set of assessment tools for the language, 
including among others compilers to Fault Trees and Markov Chains, stochastic and stepwise simulators. 
In this article we study how the notion of prime implicants can be extended to finite domain calculus. We 
discuss the efficient implementation of finite domain calculus and show how these results can be applied 
to simplify Fault Trees, automatically generated from AltaRica 3.0 models. This simplification in its turn 
significantly improves the efficiency of the assessment of the automatically generated Fault Trees. 


Diagrams and Event Trees rely on Boolean algebra. 
1 INTRODUCTION There are cases however where binary states are not 
sufficient. For instance, it is sometimes necessary 
Risk analysis relies on models. These models have to represent the level of degradation of a compo- 
several important characteristics: nent, the quality of a signal, and so on. This kind of 
models can be easily represented with AltaRica 3.0 
—a high level modeling language dedicated to safety 
analyses (Prosvirnova, Batteux, Brameret, Cherfi, 
Friedlhuber, Roussel, & Rauzy 2013). AltaRica 3.0 
is at the core of the OpenAltaRica project! which 
aim is to develop a complete set of assessment tools 
for the language, including among others compil- 
ers to Fault Trees (Prosvirnova & Rauzy 2015) and 
Markov Chains (Brameret, Rauzy, & Roussel 2015), 
stochastic and stepwise simulators (Aupetit, Bat- 
The last characteristic is pragmatic: given the teux, Rauzy, & Roussel 2015). 
difficulty to design models and computational com- In this article we study how the notion of prime 
plexity of the calculation of indicators, discrete implicants can be extended to finite domain cal- 
abstractions are a necessary tradeoff. Hence the role 
of Boolean algebra in Reliability, Availability, Main- 
tainability, Safety engineering. The most widely used = ——--____ 
modeling formalisms such as Fault Trees, Block ‘See https://www.openaltarica.fr. 


e They are event-oriented. The system under study 
changes of state when events, such as failure, 
hazard, repair and so on, occur. 

e They are probabilistic. The exact moment 
of the occurrence of a failure is in essence 
unpredictable. 

e They are discrete. States are represented by 
means of variables that take their values into 
finite, usually very small, domains. 
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culus and how to encode it efficiently. The con- 
tribution of this article is thus multiple. First, we 
present how the notion of prime implicants can be 
extended to finite domain calculus. Second, we dis- 
cuss the efficient implementation of finite domain 
calculus. Finally we show how these results can be 
applied to simplify Fault Trees, automatically gen- 
erated from AltaRica 3.0 models. 

The remainder of this article is organized as 
follows. Section 2 describes a motivating example. 
Section 3 presents a theoretical work about finite 
domain calculus and discusses its implementa- 
tion. Section 4 shows the application of the finite 
domain calculus to the simplification of Fault Trees 
automatically generated from AltaRica 3.0 models. 
Section 5 presents some experimental results using 
the motivating example. Section 6 concludes this 
article. 


2 MOTIVATING EXAMPLE 


Consider a parametric block diagram use case (see 
Figure 1) with three parameters: 


e s the number of blocks in series; 
e p the number of parallel blocks; 
e q the level of recursivity (depth). 


These relatively simple but large safety models 
can be easily represented in AltaRica 3.0 and han- 
dled simply and efficiently by means of the Fault 
Tree compilation tool chain. 

Note that without loosing the efficiency of the 
assessment, in AltaRica 3.0, it is possible to repre- 
sent multi-state blocks, e.g. consider the quality of 
data with the values ok, lost or erroneous, or the 
level of degradation with the values ok, degraded 
or failed. 

This use case is both representative of a class of 
industrial models and parametric to show the scal- 
ability of the approach. We shall use it through- 
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Figure 1. 


Parametric block diagram use case. 


out the article to illustrate the advances in the 
simplification of Fault Trees. 


3 FINITE DOMAIN CALCULUS 


3.1 Definitions 


Let 3 = {X X,,..., X,} be a finite set of variables. 
Each Xi takes its values into a finite domain (a finite 
set of constants) denoted as dom( X). The set of well 
formed formulas over & is the smallest set such that: 


e The two Boolean constants 0 (false) and 1 (true) 
are formulas. 

e If Xisa variable and c is a constant then XY = c is 
a formula. Such a formula X = c is called a literal 
and makes only sense if c e dom(X). 

e If fand gare formulas, then so are f+ g (disjunc- 
tion), f* g (conjunction), and -f (negation). 


We assume that the negation (—) has a higher 
priority than the conjunction (*), which has a 
higher priority than the disjunction (+). 

A product is a set of literals interpreted as the 
conjunction of its elements. A product is said fun- 
damental if it does not contain two literals built over 
the same variable. We shall consider only funda- 
mental products. The empty product is denoted 1. 

A minterm is a product that contains a literal for 
each variable of &. As we shall see, minterms play 
a fundamental role in the finite domain calculus 
for they are the atoms of the underlying Boolean 
algebra. 

A sum of products is a set of products interpreted 
as the disjunction of its elements. The empty sum 
of products is denoted 0. 

A variable assignment of Z is a function m= > 
dom(X,) x dom(X,) x ... x dom(X,), that associates 
to each variable X, its value from dom(X,), i = 1, 

Let f and g be formulas and o be a variable 
assignment over ©. The value of o(f) is calculated 
recursively as follows: 


e o(1)=1, o(0) =0; 
e (AX=c)=1 if aX) =c and 0 otherwise; 
e oft g)=max(o(f), og), f * g) = mino), 

a(g)), Af) = 1- off). 

The variable assignment o satisfies the formula 
Sif off) = 1, otherwise is falsifies it. 

There is a one to one correspondence between 
minterms and variable assignments: the minterm p 
corresponds to the variable assignment oif for each 
variable X € =, X =c € pif and only if aX) =c. 


3.2 Implication, equivalence, properties 


Let = = {X,..., X,} be a finite set of finite domain 
variables. Let fand g be two formulas built over =. 
f implies g, which is denoted as f > g, if any vari- 


908 


able assignment that satisfies f satisfies g as well. f 
is equivalent to g, which we denote as f © g, if both 
fogandg>f. 

The usual properties of Boolean algebras hold 
for the finite domain calculus: 

Neutral element: f+ 0 & 0 +f &fandf* 16 
tref 

Absorbing element: f+ 1 & 1 +f& 1andf*0 
e0*fe0 

Idempotence: f+ f & fand f* fef 

Commutativity: f+ g & g+fandf*geg*f 

Associativity: f + (g + h) & (f+ 2) + h and f * (g 
*hhea(f*g)*h 

Distributivity: f+ (g * h) = f* g+f* hand f* (g 
th) = (f+ g)* (f+h) 

Double negation: — - f= f 

de Morgan’s law: -(f+ g) = -f * -g and -(f* g) 
© -f* -g 


3.3 Negation 


The real difference between the propositional and 
finite domain calculi stands in the negation. 

Let = be a finite set of variables, let X be a 
variable from =, and finally let c be a constant of 
dom(X). Then, H(X =c) © Lie amw, ae X =D 

Theorem 1 (Elimination of negations).: For any 
formula of the finite domain calculus, there exists an 
equivalent formula involving no negation. 

Note that any formula is equivalent to the sum 
of minterms that satisfies it, which is a first way to 
demonstrate the theorem. A more syntactic proof 
consists in pushing negations down to literals, 
thanks to de Morgan’s law, and then to transform 
negative literals as shown above. 


3.4 Subsumption, resolution 


Let p and q be two products built over =. We say 
that p subsumes q if q => p, i.e. if and only if any 
literal of p is also a literal of q. If p subsumes q, 
then p +q €p. 

Let X be a variable of E, let dom(X) = {c,, ©,...5 
c,} and let p,,..., p, be k products in which X does 
not show up. Let r be the product p, * p,*... * Pe 
Then the following implication holds: (X = c,) * p, 
+.. + (X=) * >r. 

The product r is called the resolvent of the prod- 
ucts (X= c,) * Pi- (X= c) * Py 

In the case, where there is a product p, such that 
p;=r, then the following equivalent holds: 


(X=c,)* pi +... + (X= c) * Py 
e (X=c¢,) *p, t+..+(X=c) * p, 
+..+(X=¢) *p,t+r 


3.5 Shannon normal form 


Let X be a variable of =, let c be a constant of 
dom(X) and finally let f be a formula built over =. 


There exist two formulas f, and fọ in which the 
atom (X = c) does not show up such that: 


fe (X=0*fith 


The above representation is called the pivotal de- 
composition of f with the respect to X and c. 

Assume we are given an (arbitrary) order < over 
the variables of = and over the constants of the 
do-main of the variables of =. The set of formulas 
in Shannon Normal Form is defined inductively as 
follows: 


e The two constants 0 and | are in Shannon Nor- 
mal Form. 

e If fand g are two formulas in Shannon Nor- 
mal Form, X is a variable and c is a constant of 
dom(X), the formula (X = c) * f+ g is in Shannon 
Normal Form if 


— X does not show up in f, and 
— for all literal (Y = d) showing up in g, either 
X< Yor X=Yandc<d. 


3.6 Representation theorem 


Let E be a finite set of finite domain variables. Let 
X be a variable of E, let c be a constant of dom(X) 
and finally let f= (X = c) * fi + fy be a formula in 
Shannon Normal Form built over 2. 

In the above representation we can assume with- 
out a loss of generality that: 


e f #0 as(X=c)* 0+ Koff 
efetlas(X=c)*ftleol 


From now, we shall assume that these two 
simplification rules are systematically applied. 

Theorem 2 (Representation).: For any formula of 
the finite domain calculus, there exists at least one 
equivalent formula in Shannon Normal Form. 

In general, this equivalent formula is not unique. 
We shall see that two of the formulas that repre- 
sent a given sum of products are of special interest: 
the first one can be interpreted as sum of disjoint 
products, the other one as the set of prime impli- 
cants. These two formulas are extremum in a sense 
we shall explain. 

A formula in Shannon Normal Form can be 
interpreted as a sum of products. Namely, 


e SumOfProducts [0] = 0; 

e SumOfProducts[1] = 1; 

e SumOfProducts[(X = c) * f+ g] = {((X=c) * p; p 
€ SumOf P roducts{f|} O SumOf P roducts[g] 


Theorem 3 (Sums-of-Products).: Shannon Normal 
Formulas one-to-one correspond with Sums-of-Prod- 
ucts (for a given order of variables and constants). 


3.7 Factors and cofactors 


The factor and cofactor of a formula f with respect 
to a variable X, denoted respectively as f|X and 
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J~ X, are syntactic operations that select respec- 
tively the products of f that contain ¥ and the 
products of f that do not contain X. The factor f(X 
is defined recursively as follows: 


e 0[X¥=0and 1|X=1 

e [((X=0)* f+ glX=(X= c0) * f+ [sX] 

© [(Y=0)*f+g]|X=0if X<Y 

© (Y=0) *f+ gllX=(Y=c) * [AX] + [glx] if X> Y 


The cofactor f ~ X is defined recursively as 
follows: 


e 0~X=0andl~ X= 

e (X=09*f+g]-X=g-~X 

e (Y=) *f+g]~ X= a 

e (Y=) *f+g]~ =¢)* V~ X] + [e~ X] if 
X>Y 


3.8 Logical operations 


Let E be a finite set of finite domain variables. Let 
X and Y be two variables of = with X < Y. Let c, 
dand e be three constants such that c, de dom(X) 
with c < d, and e € dom( Y). Finally let f= (X = c) * 
fi+fo g=(X=c)* 2,4 8, h=(X=d)* h +h, and 
I=(Y = e) * I + Ļ be four formulas built over = 
in Shannon Normal Form. The following equiva- 
lences hold and they are used as recursive equa- 
tions to perform logical operations on formulae in 
Shannon Normal Form: 


e f+gea(X=0)* i +e]+ [h+ a0] 

e f+he(X=c0)*fi+th+h 

è ft+loe(X=0*ft+htaA 

e f*g (X=) * it gati" g- X* g] 


+ [o * 80] 
e f*he(X=0* ff * h tfi * ~X] + * gl 
+[-f* -g] 


X+h~ 


e f* I> (X=0*fi +o * N 
e72 Èa Domde X = d) 4 -g] 


3.9 Subsumption 


As we shall see, it is of interest to remove from a 
formula fall the products that are subsumed by a 
product of a formula g. This operation, denoted 
f + g, can be defined by means of the following 
recursive equations. Let = be a finite set of finite 
domain variables. Let X and Y be two variables of 
=(X< Y), let c, dand e be three constants such that 
c, de dom(X), c< d, and e e dom( Y). Then: 


e f+0=f,f+1=0,0+g=0andl+g=1 

eKKX=0\*fi +A + X= oo * g + gl 
=(X=0)* [fi +2)+ gd +Jo* go 

eX =o * fit fl + = d) * g + gl 
=(X=0)* i +8) + + Eo 

e(X =o * f+ + (Y = e) * g + gl 
= (X = off, + (Y = Ə * g + gll 
+h+[(Y =e) * g, +g 


eKXX=d* fiit A + X= o * g + g 
=(X= of, + gl +f + So 
e((Y¥ =e) * f+ < + (X= c) * g + gl 


=((Y=0)*f,+f]= 


3.10 Prime implicants 


Let E be a finite set of finite domain variables with 
an order over variables and constants. Let fand p be 
respectively a formula and a product built over =. 


e pisan implicant of fif p > f. 

e pisa prime implicant of f if it is an implicant of 
fand no strict sub-product (subsuming product) 
of pis. 


The set of prime implicants of f is denoted 
Pf). 

Theorem 4 (Decomposition of Prime Impli- 
cants).: Let f be a formula in Shannon Normal 
Form. Then f=(X=c,)* fi + (X=) * f+... +(X 
=c) *f,+fo)) .. -) for some constants ¢,,..., €, from 
dom(X) and some formulas f,,..., f fo in Shannon 
Normal Form in which X does not occur. 

Leth=(,*f,*...*f) +f. Then, the set of 
prime implicants of f ‘denoted by PI[f] are calcu- 
lated as follows: 


PIN = (X= c,)* p;p e PI] + PMA} 


Ù (X= c) * p; p € PIJ, | + PHAD 
U{PIh]} 


The decomposition theorem gives an algorithm 
to calculate for any formula fin Shannon Normal 
Form an equivalent formula / such that: 


g = SumOf Products{h] = PHJ) 


Because all possible resolutions and subsump- 
tions have been performed, g can be considered as 
the most simplified form of f. 

At the opposite, we may want to transform f 
into an equivalent sum of disjoint products so to be 
able to calculate the exact probability of f. Disjoin- 
ing products encoded by f is performed by the dual 
operation of calculating resolvents. 

Let f= (X = 6) * f + (X= c) * ft... + (X= 
E) * fe +fo)) «. .). Assume that dom(X) = hei, 5 Cet 
(if some constant c, of dom(X) is missing we can 
always add the term (X = c,) * 0). Then, fis equiva- 
lent to the following formula: 


=(X=c,) *f+fl+(X=c) *Bt+AK.. + 
(X=) * Th. +f] +9). .). 


By applying this transformation recursively, we 
get asum of disjoint products, which is also unique, 
for a given order on variables and constants. 
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3.11 


The idea is to represent sums of products in Shan- 
non Normal Form by means of a variant of Bry- 
ant’s Binary Decision Diagrams (Bryant 1992). The 
idea is therefore to represent formulas in Shan- 
non Normal Form by means of Directed Acyclic 
Graphs with two types of nodes: 


Diagrammatic representation 


e Leaves, that are labeled with either 0 or 1. 

e Internal nodes, that are labeled with a variable 
X and a constant c of dom(X) and that have 
two out-edges called the 1-outedge and the 0- 
outedge. Such a node represents the formula 
(X = c) * f+ g, where fand g are the formulas 
represented respectively by the node pointed 
by the l-outedge and the node pointed by the 
0- outedge. 


The Shannon Diagram representing a formula is 
always built bottom-up. Nodes are maintained into a 
unique table (and accessed by means of a hashtable). 
In this way, for any formula f, there is at most one 
node representing fin the table. Checking the equiv- 
alence of two formulas is thus performed in con- 
stant time once their Shannon Diagrams are built. 


4 APPLICATION 


One of the possible applications of the finite domain 
calculus presented above is the simplification of 
Fault Trees automatically generated from AltaRica 
3.0 models. AltaRica 3.0 is an event-based high level 
modeling language dedicated to Safety Analyses 
(Prosvirnova, Batteux, Brameret, Cherfi, Fried]hu- 
ber, Roussel, & Rauzy 2013). Its semantics is based 
on Guarded Transitions Systems (Rauzy 2008). 


4.1 


A Guarded Transitions System (GTS) G is a quin- 
tuple (V, E, T, A, i, where: 


e V=SwF F isa set of variables, divided into 
two disjoint sets: a set S of state variables and a 
set F of flow variables. 

E is a set of events. 

T is a set of transitions. A transition is a triple 
t= (e, G, P}, where e is an event from E, G is a 


Guarded transitions systems 


Boolean expression built over variables from V 


and called the guard of the transition, and P is 
an instruction built over V and called the action 
or the post-condition of the transition. 
e Aisan assertion (i.e. an instruction built over V). 
e zis the initial (or default) assignment of vari- 
ables of V. 


A GTS G= (V, E, T, A, 0 is an implicit repre- 
sentation of a labeled Kripke structure, i.e. a graph 
T = (È, ©), where 
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e the set of nodes È represent the variable assign- 
ments (of V), and 
e Ois the set of edges labeled by the events from £E. 


Instructions of GTS are defined recursively as 
follows: 


e “skip” is an empty instruction. 

If vis a variable and Exp an expression, then “v 
= Exp” is an instruction (called “assignment”). 
If Cis a Boolean expression, Jis an instruction, 
then “if C then 7’ is an instruction (called “con- 
ditional assignment”). 

If J, and J, are two instructions, then so is “J,; J,” 
(called “parallel composition”). 


We shall consider two types of instructions. 
The “Actions” which are instructions in which 
left members of assignments are only state vari- 
ables. The “Assertions” which are instructions in 
which left members of assignments are only flow 
variables. 

Let denote by t= Propagate(A, 1, 0) a variable 
assignment obtained after applying the assertion A 
to the variable assignment ø., i.e. the calculation of 
flow variables value. Propagate(A, 1, 0) computes 
the values of flow variables using the instructions 
of the as sertion A and the values of state variables 
in o. At the end if there are flow variables with- 
out any value, they are set to their initial values in 1 
and the assertion A is applyed to check that all the 
assignments are satisfied. 


4.2 Compilation to fault trees 


The compilation of AltaRica 3.0 models to Fault 
Trees works (Prosvirnova & Rauzy 2015) in 5 steps 
(see Figure 2): 


1. The AltaRica 3.0 model is flattened into a 
GTS. 


AltaRica 3.0 model 


e ee 
E9 | Compilation | 
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Figure 2. Compilation of AltaRica 3.0 models to Fault 
Trees. 


2. The obtained GTS is partitioned into independ- 
ent GTSs plus an independent assertion. 

3. Reachability graphs of each independent GTS 
are calculated. 

4. Each reachability graph is separately compiled 
into Boolean equations. 

5. The independent assertion is compiled into 
Boolean equations. 


The independent assertion (V", A’, Č) (the Sth 
step of the algorithm) is transformed into a set of 
Boolean formulas in the following way. For each pair 
(f, d), where fe V” is a flow variable and de dom(f) 
is its value, a formula @,,, is constructed according 
to the instructions in the assertion A* and Boolean 
formulas {U E U, c € dom(u)} obtained from 
the compilation of the independent GTSs. 

In order to compile the assertion into Boolean 
formulas efficiently, one need to separate it into 
independent parts. The dependency relation 
between variables in the assertion A° defines a 
dependency graph. This graph may contain cycles. 
The strongly connected components of this graph 
divide variables of A* into sets and enable to decom- 
pose the assertion A’ into blocks of instructions A, 
(i= 1, ..., m), where m is the number of strongly 
connected components: A* = Af; A);...;.4), 

Each block of instructions A; is compiled into 
Boolean formulas recursively. Let denote by 


e V, —a set of variables labeling the vertices of 
the strongly connected component number i. 

e A -an instruction that calculates the values of 
variables from V," . 

e č —an initial assignment of variables from V; . 

e W, —a set of variables such that variables from 
V* depend on them in A? . 


For all variable v in F; the formula (v, c) 
(where c € dom(v)) is built as follows: 


e Let È= r „dom(w) be the Cartesian product 
of the domains of variables from wW? . 

e Let oe È be an assignment of variables from 
W;. 

e Let ¢, bea product built over W,” (as defined in 
Section 3.1) calculated as follows: 

fa Tears (w= o(w)) 

‘be a partial variable assignment, 

Ti ov Uw >C, such that: 

VweW, cw) =o) 

e The partial variable assignment T can be com- 
pleted by propagating the assertion A? : 

T= Propagate( A}, Ñ , T) 

e Then for each couple (v, c), with veVž, such 
that (v) =c, the formula associated with (v, c) is 
updated as follows 


Yo, v4) = Bu» + VA 


At the end of the algorithm, for all variables v € 
V’ and their values, we obtain a formula ¢,, „ built 
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over a finite set of finite domain variables W”, such 
that v depends on them in the assertion A*. We use 
the diagrammic representation as defined in Sec- 
tion 3.11 to represent these formulas. 

As we have seen in Section 3.10, @,,. © PA@,, ol 
and it is the most simplified form of Ø, o 

For each variable v e V* and its value c e€ 
dom(v), we compute PI[g, a] and use this form, 
which greatly simplifies the generated Fault Tree. 


4.3. Example 


Consider the parametric block diagram use case 
presented in Section 2. Figure 3 illustrates how 
each basic block of these diagrams can be repre- 
sented in AltaRica 3.0. 

The variable State represents the internal state 
of a basic block and takes its value in the domain 
BLOCKSTATE = {ok, ko}. A domain is an enu- 
meration having any finite number of values. The 
event failure represents the internal failure of a 
basic block. It is possible to associate different 
probability distributions to the events of basic 
blocks (e.g. exponential, constant, Weibull). The 
value of the parameters can also be changed. The 
behavior of a basic block is represented by a state 
machine given Figure 3. The variable Out is a flow 
variable, which represents the output of a basic 
block. The assertion of a basic block is an instruc- 
tion, which calculates the value of this variable Out 
according to the value of the state variable State. 

Figure 4 shows how two blocks in series can 
be modeled in AltaRica 3.0. The assertion of the 
whole model is 

B\. Out: = B1.State; 

B2.Out: = B2.State; 

Out: = if (B1.Out = = 
then ok else ko; 

The compilation into Fault Trees is performed 
as follows. 


ok) and (B2. Out = = ok) 


domain BLOCKSTATE = {ak, ko} 
class BasicBlock 
BLOCKSTATE State(init = ok); 
BLOCKSTATE Out (reset = ko); 
event failure (delay = exponential(0.001)); 


State = ok 


O=0k 


failure 


_State=ko__| 
O=ko 


AltaRica 3.0 model of a basic block. 


transition 
failure ; State==ok -> State := ko; 
assertion 
Out := State; 
end 


Figure 3. 


class Series 
BasicBlock B1, 82; 
BLOCKSTATE Out (reset = ko); 


assertion 
Out := If (81.0ut==0k) and (B2.0ut==0k) 
then ok else ko; 


Figure 4. AltaRica 3.0 model of two blocks in series. 


First, local reachability graphs are compiled: 
(B1.State,ok) = true 

Ø B2.State,ok) = true 

(Bi .State,ko) = B1 failure 

QW B1.State,ko) = B2,failure 

Second, local assertions are compiled: 

&{ B1.Out,ok) = (B1.State = ok) 

( B2. Out, ok) = (B2.State = ok) 
&(B1.Out,ko) = (B1.State = ko) 

( B2.Out,ko) = (B2.State = ko) 

Third, the global assertion is compiled 

{ Out,ok) = (B1.Out = ok) * (B2. Out = ok) 
A Out, ko) = ((B1.Out = ok) * (B2. Out = ko) 
+(B1.Out = ko) * (B2. Out = ok) 
+(B1.Out = ko) * (B2. Out = ko)) 

Figure 5 represents the last formula by means 
of a variant of Binary Decision Diagram (as pre- 
sented in Section 3.11). It can be simpified using 
the algorithm presented in Section 3.10 as follows: 

dom(B1.Out) = dom(B2.Out) = {ko, ok} 
f=(B1.Out = ko) 

* [(B2. Out = ko) * 1+ [(B2. Out = ok) * 1+ 0]] 
+{(B1.Out = ok) * [(B2. Out = ko) * 1+ 0]}+ 0] 

J, = (B2. Out = ko) * 1+ [(B2. Out = ok) * 1+ 0] 

f= (B2. Out = ko) * 1+0 

f= 

h=f, * fı + fa = (B2.Out = ko) * 1+0 

PI{h\ = (B2.Out = ko) * 1+0 

PIf)=1 

PIf,] = (B2.Out = ko) * 1+0 


B1.Out=ko 


Figure 5. 
ok) * (B2.Out = ko) + (B1.Out = ko) * (B2.Out = ok) + (B1. 
Out = ko) * (B2.Out = ko)). 


Diagrammatic representation of ((B1.Out = 
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PIff,| + Pl{h|=1 
PI{f2] + Pl[h|=0 
PI{f| = (B1.Out = ko) * 1 
+{(B2. Out = ko) * 1+ 0] 
which reads as (B1. Out = ko) + (B2.Out = ko). It 
is the most simplified form of ouor 


5 EXPERIMENTS 


The Fault Tree compiler of the the OpenAltaR- 
ica platform produces Fault Trees from AltaRica 
3.0 models. More precisely, according to one or 
several Boolean observers, representing safety 
cases of the modeled system, the Fault Tree com- 
piler generates Fault Trees in Open-PSA model 
exchange format (Hibti, Friedlhuber, & Rauzy 
2012) with these Boolean observers as top events. 
The produced Fault Trees can be then assessed by 
XFTA (Rauzy 2012) to compute Minimal Cut 
Sets, probabilities of the top events, and so on. 

We have implemented the algorithm presented 
in Section 4 and integrated it in the original version 
of the Fault Tree compiler. This algorithm greatly 
simplifies the generated Fault Trees, compared to 
those generated by the original version. 

We have performed experiments with different 
values for the three parameters s, p and q of the 
motivating example pictured Figure 1. 

In Table | we present the results obtained with 
the original version of the Fault Tree compiler and 
with the new one. 

The first column is the number of the consid- 
ered cases. The five next columns, from the second 
column to the sixth column, present the number of 
components contained in each sub-parts. We start 
with n, parallel blocks, in each block there are n, 
sub-blocks in series, into each sub-block there are 
n, parallel subblocks, and so on. For example, the 
first case means 3 parallel blocks, with 3 sub-blocks 
in series, each one containing 3 parallel sub-blocks; 
whereas the fifth case means 2 blocks in series, with 
4 parallel subblocks, each block containing 4 sub- 
blocks in series, with 4 parallel sub-blocks into 
each one. The seventh column represents the total 
number of basic blocks in the AltaRica 3.0 model. 
Finally, the eighth and ninth columns represent the 
number of intermediate events in the Fault Trees 
generated by the original version of the Fault Tree 
compiler (ninth column) and the new one (eighth 
column). 

The main observation is about the benefit of the 
number of generated gates with the new version 
of the Fault Tree compiler in comparison to the 
original one. In average, this benefit is of 56.9%. 
It means that in average with the new algorithm, 
the number of generated gates is less than 56.9% 
compared to the number of generated gates with 


Table 1. 


Parametric block diagram use case—fault tree compilation. 


Case n on, onm on, n, Number of blocks Number of gates(new) | Number of gates (original) 
Ist 3 3 3 0 0 27 243 525 
2nd 3 3 3 3 0 81 720 1158 
3rd 4 4 4 0 0 64 537 1276 
4th 4 4 4 4 0 256 2133 5114 
Sth 0 2 4 4 4 128 1135 3236 
6th 0 3 3 3 2 54 558 1264 
7th 0 3 3 3 0 27. 243 596 
8th 0 3 3 3 3 81 768 1914 
9th 0 4 4 4 0 64 549 1562 
the original one. The minimal value is 37.8% in the REFERENCES 


second case; and the maximum value is 64.9% in 
the fifth case. 

The benefit obtained with the new version of the 
algorithm implemented in the Fault Tree compiler 
is important. 


6 CONCLUSION 


Boolean models are widely used for probabilistic 
safety analysis. There are cases however where 
binary states are not sufficient. For instance, it 
is sometimes of interest to represent the level of 
degradation of a component, the quality of signal, 
and so on. This kind of models can be easily rep- 
resented with AltaRica 3.0, a high level modeling 
language dedicated to safety analyses. AltaRica 
3.0 comes with several efficient assessment tools, 
amongst them a Fault Tree compiler. 

In this article we presented how the notion of 
prime implicants can be extended to finite domain 
calculus. We discussed how the finite domain cal- 
culus can be efficiently encoded using a variant 
of Binary Decision Diagrams. We shown, using 
a parametric block diagram use case, how these 
results can be applied to simplify Fault Trees auto- 
matically generated from AltaRica 3.0 models. 
The number of generated intermediate events is 
on average divided by two, which greatly improves 
Fault Trees readability and the efficiency of their 
assessment. 
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ABSTRACT: AltaRica 3.0 is an event-based, object-oriented modeling language dedicated to (probabi- 
listic) safety analyses of complex systems. It makes it possible to design models at higher level than done 
with formalisms traditionally used for safety analyses (fault trees, Markov Chains, stochastic Petri nets, 
etc.), without increasing the complexity of calculations of risk indicators. Several assessment tools have 
been developed for AltaRica 3.0, including a stepwise simulator. This tool is of a great help for the design 
and the validation of AltaRica 3.0 models. It is the analog for modeling of debuggers for programming. 

In this article, we show how the AltaRica 3.0 stepwise simulator has been greatly enhanced by the intro- 
duction of an abstract notion of time. The key mathematical property is that abstract and concrete simu- 
lation are bisimilar: any concrete (timed, stochastic) execution can be simulated by an abstract execution 
and reciprocally any abstract execution corresponds to at least one concrete execution. This important 
result paves the way to the design of efficient model-checking algorithms, e.g. generators of sequences of 
events leading to a failure state. 


1 INTRODUCTION built-in distributions (exponential, Weibull, . . . ) 
and empirical distributions. 
AltaRica 3.0 is an event-based, object-oriented Several assessment tools have been developed 


modeling language dedicated to (probabilistic) for AltaRica 3.0, see e.g. (Prosvirnova & Rauzy 
safety analyses of complex systems (Prosvirnova, 2015, Brameret, Rauzy, & Roussel 2015, Aupetit, 
Batteux, Brameret, Cherfi, Friedlhuber, Rous-  Batteux, Rauzy, & Roussel 2015), including a step- 
sel, & Rauzy 2013). It makes it possible to design wise simulator. This tool is of a great help for the 
models at higher level than done with formalisms design and the validation of AltaRica 3.0 models. 
traditionally used for safety analyses (fault trees, It makes it possible to perform interactive step 
Markov Chains, stochastic Petri nets, etc.), with- by step simulations, i.e. to go forth and back in 
out increasing the complexity of calculations of sequences of events, enabling in this way to track 
risk indicators. modeling errors, unexpected behaviors and so 
The semantics of AltaRica 3.0 is defined in on. With that respect, stepwise simulators play a 
terms of stochastic guarded transition systems similar role for discrete event modeling as debug- 
(Rauzy 2008). AltaRica 3.0 executions are simi- gers like GDB or DDD for programming, see e.g. 
lar to those of other discrete event modeling for- (Matloff & Salzman 2008) for an introduction to 
mulisms: each time a transition gets fireable, it is the latter. 
scheduled and possibly fired after a certain real- In this article, we introduce the notion of 
valued delay, see e.g. (Cassandras & Lafortune abstract executions of AltaRica 3.0 models. This 
2008, Zimmermann 1976) for introductions to notion is implemented into the new version of 
(stochastic) discrete event systems. Events labeling the AltaRica 3.0 stepwise simulator. The previous 
transitions may be either deterministic or stochas- version of this tool did not consider the time at 
tic. In the later case, AltaRica 3.0 provides both all. The reason was that it would have been much 
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too tedious for the analyst to enter by hand the 
delay associated with a stochastic transition each 
time this transition gets fireable. Moreover, infi- 
nitely many real-valued delays can be chosen, let- 
ting the analyst pondering which one is the most 
suitable for her or his purpose. However, ignoring 
delays had a major drawback: the stepwise simula- 
tor allowed the firing of sequences of events with 
no counterpart in stochastic simulation and more 
generally that did not obey the timed semantics of 
AltaRica 3.0. 

The idea is therefore to abstract away the time in 
stepwise simulation: each transition is now associ- 
ated with a time interval. Firing a transition may 
modify the time intervals associated with already 
scheduled transitions. This idea is by no means 
new: it enters into the general framework of Cou- 
sot’s abstract interpretation (Cousot & Cousot 
1977). The problem at stake was to make it work 
for the particular case of stochastic discrete event 
simulations. The key mathematical property here 
is that abstract and (concrete, in the sense of the 
semantics of AltaRica 3.0) simulations are bisimi- 
lar, see e.g. (Milner 1989) for an introduction to 
this important notion: any (concrete) execution 
can be simulated by an abstract execution and 
reciprocally any abstract execution corresponds 
to at least one (concrete) execution. In a word, 
abstract executions are in agreement with AltaRica 
3.0 semantics. 

This important result paves the way to the 
design of efficient model-checking algorithms, 
see e.g. (Clarke, Grumberg, & Peled 2000) for an 
introduction. In particular, it makes it possible the 
design of generators of sequences of events lead- 
ing to a failure state. 

The remainder of this article is organized as 
follows. Section 2 presents an illustrative example. 
Section 3 recalls fundamental notions about timed 
and stochastic guarded transition systems. Section 
4 introduces the notion of abstract execution of 
guarded transition systems. Section 5 concludes 
this article and gives some perspectives. 


2 ILLUSTRATIVE EXAMPLE 


As an illustrative example, we shall consider a sys- 
tem made of two identical, periodically tested, com- 
ponents that evolve independently one another. 

The behavior of such a component is repre- 
sented by the state diagram pictured Figure 1. 

The component alternates operation and test 
phases. It is initially working and starts with a first 
operation phase that lasts a constant time 6. All 
subsequent operation phases last a constant time 7. 

The component may fail when in operation, 
with a failure rate A. 
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Figure 1. State diagram for a periodically tested 
component. 
Table 1. A possible evolution of component for a peri- 
odically tested component. 
Transition Firing date 
startTestl d=0 
completeTest d=d +T 
failure d,=d,+6,,0 (7 
startTest d=d,+ 7 
repair d,=d,+ ĉ, u(ô, (v 
startTest d,=d,+ 7 

HT 


completeTest 


If the component is working when it enters a 
maintenance phase, this maintenance phase lasts 
a constant time T. If, on the contrary, it is failed 
when it enters a maintenance phase, the duration 
of its repair is uniformly distributed between two 
values u and v. 

Finally, the component is as-good-as-new after 
a repair. 

In the state diagram of Figure 1, transitions 
failure and repair are thus stochastic while tran- 
sitions startTestl, startTest and completeTest are 
deterministic. 

Table 1 shows a possible evolution of such a 
component. 

We can assume that @ and 7 are relatively big 
compared to T, wand vand that Tis smaller than u 
(itself smaller than x). 

In order not to have the two components A and 
B out of service due to test or maintenance at the 
same time, it is reasonable to take different values 
of @for A and for B (the values of the other param- 
eters being identical for the two components). For 
instance, if an operation phase lasts normally six 
months (7=4380/), the component A can be tested 
after three months (A.@= 2190/) while the compo- 
nent B is tested after six months (B.@ = 4380A). It 
this way, tests/maintenances of A and B are shifted 


Table 2. Typical values of the parameters. 


parameter A B 

0 2190 4380 
T 4380 4380 
T 0 0 
u 12 12 
v 24 24 


by three months which improves the overall avail- 
ability of the system. Table 2 gives some typical val- 
ues of the parameters for components A and B that 
we shall use throughout the article. 

With this values of parameters in mind, the 
reader sees immediately the problem of using a 
stepwise simulator that does not consider delays 
associated to transitions. Many executions that 
would be impossible with a timed semantics 
become possible with a non-timed semantics, e.g. 

. B. startTest1 
A. startTest 


S= 
A. completeTest 


A. startTest 
>: >: 


A. completeTest 


On other hand, asking the analyst to introduce 
interactively delays of stochastic transitions is not 
a practical solution. Not only it would be tedi- 
ous, but it would let the analyst facing the choice 
of suitable delays, which gets quickly puzzling. 
Hence the need of an abstract notion of time 
which makes it possible to take into account delays 
of transitions without asking the analyst to enter 
them interactively. 

To fulfill this need, the idea is to reason in terms 
of time intervals rather than in terms of dates. To 
explain how this idea works, we shall first recall the 
regular semantics of AltaRica in the next section. 
Then, we shall introduce its abstract semantics (in 
terms of time intervals) in Section 4. 


3 TIMED/STOCHASTIC GUARDED 
TRANSITIONS SYSTEMS 


The semantics of AltaRica 3.0 is defined in terms 
of stochastic guarded transitions systems (Rauzy 
2008), (Batteux, Prosvirnova, & Rauzy 2017). We 
shall recall here only the notions that are important 
for the purpose of this article. The reader should 
refer to the cited articles for in depth presentations. 


3.1 Definition 


A guarded transitions system is a quintuple S= (V, 
E, T, A, Ù, where: 


— Vis a set of variables. V is the disjoint union of 
the set S of state variables and the set F of flow 
variables: V = SWF. . Each variable v of V takes 
its value into a finite or infinite set of constants 
called the domain of v and denoted as dom(v). 
The global state of the system is thus a variable 
valuation, i.e. a member of the Cartesian prod- 
uct IL. dom(v) . 

— E is a set of events. Each event e of E is asso- 
ciated with a function de/ay(e) that returns a 
non-negative real number. de/ay(e) may be deter- 
ministic, in which case it returns always the same 
value, or stochastic, in which case it returns a 
value according to a certain cumulative prob- 
ability distribution. 

— Tis a set of transitions, i.e. of triples (e, G, P}, 
where e is an event of E, Gis a Boolean expres- 
sion built on variables of V (called the guard 
of the transition) and P is an instruction that 
modifies the value of state variables (called the 
action of the transition). For the sake of the 
clarity, we shall write a transition (e, G, P) as 
G—HP. 

— A is an assertion, i.e. an instruction that modi- 
fies the values of flow variables. 

— 11s a valuation of the variables of V, called the 
initial state. 


Example The guarded transitions system encod- 
ing the periodically tested components described 
in the previous section is as follows. 

The state of the component is represented 
by means of two state variables: state that takes 
its value in {WORKING, FAILED} and phase 
that takes its value in {OPERATIONI, TEST, 
OPERATION}. 

The events are startTestl, startTest, comple- 
teTest, failure and repair. They are associated with 
the delays described in the previous section. 

The transitions are as follows. 


state = WORKING A^ phase # TEST failure, 
state + FAILED 

phase = OPERATION1 startipati,, 
phase <— TEST 

phase — OPERATION startTest 
phase <— TEST 

state = WORKING A phase = TEST SATS, 


phase +— OPERATION 
repair 


state = FAILED A phase = TEST ——3 
state + WORKING, phase + OPERATION 


Finally, the initial state is defined by the vari- 
able valuation: state = WORKING, phase = 
OPERATIONI. 
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3.2 Composition 


One of the advantages of guarded transitions sys- 
tems over some other similar formalisms is that 
they are highly compositional. 

Formally, let M,:(V,, Ep Tp Ap 4) and M, : (Vz, 
E, T,, A,, L} be two guarded transitions systems. 
Then the composition of M, and M,, denoted as 
M, ® M,, is simply the guarded transitions system 
(V, E, T, A, ù such that V=V,UV,, E= E L E, 
T=T,UT,, A=A,o A andi=4o0 4. 

The above principle extends to any number of 
guarded transitions systems. 

Example To represent the system of the dis- 
cussed example in the previous section, it suffices 
to create two copies of the above guarded transi- 
tions system and to compose them (which is done 
automatically by the AltaRica compiler). 

In our example, it may be worth to introduce a 
flow Boolean variable failed to tell when the sys- 
tem is failed. The assertion defining this variable 
could be as follows. 

failed < A.state = 
FAILED 


FAILED A RB.state = 


3.3 Semantics 


The semantics of a guarded transitions system S 
=(V, E, T, A, 1) is defined as the set of its possi- 
ble executions (sometimes said “concrete” into this 
article, as opposite to abstract executions intro- 
duced into section 4). 

To define formally the executions, we need to 
introduce the notion of schedule. A schedule of a 
guarded transitions system S = (V, E, T, A, 1) is a 
function from T to R + U{+}. 

A schedule T is compatible with a state o of 
the guarded transitions system S and a date d if 
the following conditions hold for all transitions 
t:G— +P of T. 


— d <T(t) <+% if G(o) = true. 
— T()=+ if Gio) =f alse. 


Intuitively, an execution of the guarded transi- 
tions system S is a sequence: 


(d T Ead T.d, Tp) 


n? 


where n 2 0, the gs are states of S, the d,s are dates, 
i.e. non negative real numbers verifying 0 = d) < d, < 
... Sd, each T, is a schedule compatible with g, and 
d, and finally the t?s are transitions of S. 

The set of valid executions of the guarded tran- 
sitions system S is defined recursively as follows. 

The empty execution (1, 0, T} is a valid execu- 
tion if the schedule T, is such that for all transi- 
tions t: G—=P T: 
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— T,(t) = delay(e) if G(d) = true. 
— TO =+ if GO =f alse. 

Now, if A=(0, h, To)... (6,.4,,T ,), 22 0, 
is a valid execution, then so is the execution 
A =—*> (01d, 41) if the following condi- 


ntl? ntl? 
tions hold, assuming ¢,,,=G,,, —“> P a. 


n+l 


— G,,,(6,) = true. 

— 6,,,=A(P,,,,(6,)), i.e. the firing of the transition 
t„ı 18 performed in two steps: first, state vari- 
ables are updated by means of the action P, of 
the transition, then flow variables are updated 
by means of the assertion A. 

— d,,, = T (t) and there is no transition ¢ of T 


n+l 


such that T (£) < T, (ta). 


— T „ıı is obtained from T, by applying the follow- 
ing rules to all transitions t: G——>P of T: 


— If G(0,) = true, then: 

- If Glo) = true and t ¥ ty, 

Tað =T (i) 
— Otherwise, 
Tað = d + delay(e) 

— If G(o,,,) = false, then: 

T.(2) = tee 

Example Consider again our system of two 
components. 

At time 0, 4 transitions are fireable: 

Transition Firing date 

A. startTestl 2190 

A. failure 5617 

B.startTestl 4380 

B. failure 4111 

As A.startTest has the earliest firing date, it is 
fired (at 2190). After its firing, 3 transitions are 
fireable: 

Transition 

A. completeTest 

B. startTestl 4380 

B. failure 4111 

As A.completeTest has the earliest firing date, 
it is fired (at 2190). After its firing, 4 transitions 
are fireable: 


then 


Firing date 
2190 + 0 =2190 


Transition Firing date 

A. startTest 2190 + 4380 = 6570 
A. failure 2190 + 6020 = 8210 
B. startTestl 4380 

B. failure 4111 


As B.failure has the earliest firing date, it is fired 
(at 4111). After its firing, 3 transitions are fireable: 


Transition Firing date 
A. startTest 6570 
A. failure 8210 
B. startTestl 4380 


As B.startTest! has the earliest firing date, it is 
fired (at 4380). After its firing, 3 transitions are 
fireable: 


Transition Firing date 


A. startTest 6570 
A. failure 8210 
B. repair 4400 


As B.repair has the earliest firing date, it is 
fired (at 4400). After its firing, 4 transitions are 
fireable: 


Transition Firing date 

A. startTest 6570 

A. failure 8210 

B. startTest 4400 + 4380 = 8780 
B. failure 4400 + 5201 = 9601 
And soon... 


This sequence shows how deterministic and sto- 
chastic transitions can be intricated. In particular, 
dates of tests are not decided once for all. They 
depend on times to failure and to repair of the 
component. 


4 ABSTRACT SEMANTICS 


4.1 Principle 


The first idea to abstract the executions consists 
in associating an abstract delay delay* with each 
event of the model. de/ay*(e) is simply the image 
of the function delay, i.e. an interval of non-nega- 
tive real numbers. 

We have to be a bit careful because some inter- 
vals that are the images of distributions are closed 
while some others are open (to the left and/or to 
the right) and that we have to consider infinite 
bounds. A solution consists in working only with 
closed intervals, but in a non-standard arithmetic 
built over the set R =R* U{é-},, where € and 
œ are respectively infinitely small and infinitely big 
numbers verifying: € + €= € and œ + x = œ for all 
xéER . In this way, the interval Ja, b[, with a, b, € 
R*+, can be encoded as [a + €, b — £]. Table 3 gives 
the abstract delays associated with the most widely 
used distributions in AltaRica 3.0. Moreover, a 
transition whose guard is not satisfied in the cur- 
rent state is scheduled in the interval [œ, ~]. 

The second idea is to consider not the date at 
which transitions are fired, but an interval of time 
within which they are fired. 

Assume that we are building the sequence under 
study step by step and that the last transition we 


Table 3. Intervals associated with delay functions. 


Concrete delay Abstract delay 


Dirac (t) 

Uniform Deviate (/, h) 
Exponential (A) 
Weibull (œ, p) 
Empirical distribution 
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considered must be fired in the time interval [/, A]. 
Assume moreover that transitions f, 4,,.. . t, are 
scheduled in time intervals [/,, A], [L, 4l, . » [/ 
h,]. Then, we can make the following remarks. 


1. We must have /</,for alli=1,...,n, because 
the next transition cannot be scheduled in the 

2. We must have also h < h, for all i= 1, ..., n, 
because if h, < h for some i, it means that the 
transition ¢, must be fired before h, therefore the 
last transition must also be fired before h, 

3. For the same reason, we can choose t, as the 
next transition to be fired only if there is no 
other transition ¢, such that h,< J). 

4. Again for the same reason, if the transition t, 
is fired, it is necessarily fired in the interval [/, 
Apinl Where Ai 18 the smallest of the h,’s. 

5. If the transition ¢,is fired and the transition tj is 
such that /;< /,, then /, must be changed to /; so 
to obey our first remark. 

6. Finally, if the transition ¢, is fired and a transi- 
tion t associated with the interval [/,, 1] becomes 
fireable (¢ can be the transition £, itself), then t 
must be scheduled in the interval [/,, A] + [4 A] 
= [L+ Amin + Ad: 


© “min 


We are now able to define formally the abstract 
semantics of guarded transitions systems (and 
therefore for AltaRica 3.0). 


4.2 Formal definition 


The abstract semantics of a guarded transitions 
system S= (V, E, T, A, 1) is defined as the set of its 
possible abstract executions. 

To define formally the abstract executions, we 
need to introduce the notion of abstract sched- 
ule. An abstract schedule of a guarded transitions 
system S = (V, E, T, A, 1 is a function from T to 
closed intervals over R . 

A schedule T= is compatible with a state oof the 
guarded transitions system S and the abstract date 
[/, h] if the following conditions hold for all transi- 
tions t:G—>P of T, with T'*(4) =[/, A]. 


—IsI<eandh<h, if Go) = true. 
— 1=h,=~ if G(o) = false. 


An abstract execution of the guarded transi- 
tions system S is a sequence: 


(op.d.03)9(o4d}.07)%..49( 45,07) 
where n > 0, the gs are states of S, the d;’s are 
abstract dates, i.e. time intervals, each T? is an 
abstract schedule compatible with o, and d; and 
finally the t?s are transitions of S. 

The set of valid abstract executions of the 
guarded transitions system S is defined recursively 
as follows. 


The empty abstract execution (op, [0, 0], T% ) isa 
valid abstract execution if the abstract schedule T% 
is such that for all transitions t:G——> P of T: 


— Ty(t) = delay’(e)if G(2) = true. . 
— D (£) = fe, œ] if G@ = false. 


If A =(0,,[0,0], 1; )“9...9(6,,[1, 4,107). n= 

0, is a valid abstract execution, then so is the 

abstract execution, then so is the abstract 

execution, then so is the abstract execution 

A" (Gy lhala) if the following 

conditions hold, assuming 
bari = Gra cis Tg) = [A] 
min A. 

(LA]=0%, (1),teT 

~ G,,,(0,) = true. 

gi One S ACP (0). 

— There is no transition ¢ of T such that T(t) =[l, 
hļand A < I, 

= [hihna] =, Anin- 

— I“(¢) is obtained from T* by applying the fol- 
lowing rules to all transitions t:G—> P_ of T 
and I(t)=[/,A] . 

— If G(o,,,) = true, then: 
— If G(o,) = true and t + t 


and 


linin = 


then 


n+l? 


Tia (t) =[max(/,,:,/),] 


+19 
— Otherwise, 


Ta =O 


n+19 


h, „|+ delay“ (e) 
— If G(o,,,) = false, then: 
= ly] 
Example We shall consider the abstract version 


of the execution given in the previous section. 
At time 0, 4 transitions are fireable: 


Transition Abstract date 
A.startTest1 (2190, 2190] 
A. failure [0 + €, œ] 
B.startTest1 [4380, 4380] 
B.failure [0 + £, œ] 


A.startTest1 is fired at the abstract date [2190, 
2190]. After its firing, 3 transitions are fireable: 


Transition Abstract date 
A.completeTest (2190, 2190] + [0, 0] = 
(2190, 2190] 
B.startTest1 [4380, 4380] 
B.failure 2190, œ] 


A.completeTest is fired at the abstract date 
(2190; 2190]. After its firing, 4 transitions are 
fireable: 

Transition Abstract date 

A.startTest (2190, 2190]+[4380, 4380] = 

[6570, 6570] 


A. failure 2190, 2190] + [0 + £, œ] = 
[2190 + £, œ] 

B.startTestl [4380, 4380] 

B. failure [2190, œ] 


B.failure is fired at the abstract date [2190 + €, 
4380]. After its firing, 3 transitions are fireable: 


Transition Abstract date 
A.startTest [6570, 6570] 
A.failure [2190 + g, œ] 
B.startTest1 [4380, 4380] 


B.startTest1 is fired at the abstract date [4380, 
4380]. After its firing, 3 transitions are fireable: 


Transition Abstract date 

A.startTest 6570, 6570] 

A. failure [4380, œ] 

B.repair [4380, 4380] + [12, 24] = 
[4392, 4404] 


B.repair is fired at the abstract date [4392, 4404]. 
After its firing, 4 transitions are fireable: 


Transition Abstract date 

A.startTest [6570, 6570] 

A. failure [4392 + €, œ] 

B.startTest [4392, 4404] + [4380, 4380] = 
[8872, 8884] 

B.failure [4392, 4404] + [0 + £, œ] = 
[4392 + £, œ] 

And so on... 


4.3 Bisimulation 


The key mathematical property is that abstract 
and concrete executions are bisimilar: any concrete 
(timed, stochastic) execution can be simulated by 
an abstract execution and reciprocally any abstract 
execution corresponds to at least one concrete 
execution. 

Theorem 1. Let S = 
be a guarded transitions system. Let 
A=(0),d),0))>...%(6,,d,,0,,), be a (concrete) 
execution of S. Then it exists an abstract execution 
A, = (d ,)-9(6,,47,1;)...%(6,,d5,77), 
such that the following properties hold: 


- Vn>0d, ed’; 


n? 


— Yn20YteT T (H) Erit). 
Theorem 2. Let S = 


(V, E T, ua A) 


(V, E T, 1 A) 


be a guarded transitions system. Let 
* p\ h * \ f bi * pe 
A, = (md T) (0d T) >. (0, dT} be 


an abstract execution of S. Then ^, corre- 
sponds to at least one (concrete) execution 
A=(0,d, T)... (0d, p T,} such that the 
following properties hold: 


— Vn20d, ed); 


n? 


— Va>0VteTT,(t)eT% (0). 


Proofs of these two previous theorems are done 
recursively on the abstract executions or (concrete) 
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executions. They are out of the scope of this article 
and are not described. 


5 CONCLUSION AND PERSPECTIVES 


In this article, we introduce the notion of abstract 
executions of AltaRica 3.0 models. This notion 
implemented into the new version of the AltaRica 
3.0 stepwise simulator. This notion of abstract 
executions enables to reconcile both stochastic and 
stepwise simulations of AltaRica 3.0 models. 

We show that abstract and (concrete) simula- 
tions are bisimilar: any (concrete) execution can 
be simulated by an abstract execution and recipro- 
cally any abstract execution corresponds to at least 
one (concrete) execution. 

We illustrate our purpose using a motivating 
example that mix both stochastic and determinis- 
tic transitions. 

The introduction of the notion of abstract exe- 
cutions to the stepwise simulator paves the way to 
the design of efficient model-checking algorithms, 
and in particular to the design of generators of 
sequences of events leading to a failure state. 

The next step of our work is the application 
of the presented results for the development of 
an efficient sequence generator for AltaRica 3.0 
models. 


REFERENCES 


Aupetit, B., M. Batteux, A. Rauzy, & J.-M. Roussel (2015, 
September). Improving performance of the AltaRica 
3.0 stochastic simulator. In L. Podofillini, B. Sudret, B. 
Stojadinovic, E. Zio, and W. Kro“ger (Eds.), Proceed- 
ings of Safety and Reliability of Complex Engineered 
Systems: ESREL 2015, pp. 1815-1824. CRC Press. 


921 


Batteux, M., T. Prosvirnova, & A. Rauzy (2017, Septem- 
ber). Altarica 3.0 assertions: the why and the where- 
fore. Journal of Risk and Reliability. 

Brameret, P.-A., A. Rauzy, & J.-M. Roussel (2015, July). 
Automated generation of partial markov chain from 
high level descriptions. Reliability Engineering and 
System Safety 139, 179-187. 

Cassandras, C.G. & S. Lafortune (2008). Introduction 
to Discrete Event Systems. New-York, NY, USA: 
Springer. Clarke, E.M., O. Grumberg, & D.A. Peled 
(2000, February). Model Checking. Cambridge, MA, 
USA: MIT Press. 

Cousot, P. & R. Cousot (1977). Abstract interpretation: a 
unified lattice model for static analysis of programs by 
construction or approximation of fixpoints. In Con- 
ference Record of the Fourth Annual ACM SIGPLAN- 
SIGACT Symposium on Principles of Programming 
Languages, New York, NY, USA, pp. 238-252. ACM 
Press. Los Angeles, California. 

Matloff, N. & P.J. Salzman (2008). The Art of Debug- 
ging with GDB, DDD, and Eclipse. San Fransisco, CA, 
USA: No Starch Press. 

Milner, R. (1989). Communication and Concurrency. 
Prentice-Hall international series in computer science. 
Upper Saddle River, New Jersey, USA: Prentice Hall. 

Prosvirnova, T., M. Batteux, P.-A. Brameret, A. Cherfi, 
T. Friedlhuber, J-M. Roussel, & A. Rauzy (2013, 
September). The altarica 3.0 project for model-based 
safety assessment. In Proceedings of 4th IFAC Work- 
shop on Dependable Control of Discrete Systems, 
DCDS'2013, York, Great Britain, pp. 127-132. Inter- 
national Federation of Automatic Control. 

Prosvirnova, T. & A. Rauzy (2015). Automated gen- 
eration of minimal cutsets from altarica 3.0 models. 
International Journal of Critical Computer-Based Sys- 
tems 6(1), 50-79. 

Rauzy, A. (2008). Guarded transition systems: a new 
states/events formalism for reliability studies. Journal 
of Risk and Reliability 222(4), 495-505. 

Zimmermann, A. (1976). Stochastic Discrete Event Sys- 
tems. Berlin, Heidelberg, Germany: Springer. 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Reliability forecasting of components/systems in automobile 
applications by using two-dimensional stress functions 


Abderrahim Krini 
Robert Bosch GmbH Engineering of Remanufacturing and Quality, Schwäbisch Gmünd, Germany 


Josef Börcsök 
Department of Computer Architecture and System Programming, University of Kassel, Kassel, Germany 


ABSTRACT: For remanufacturing automobile subsystems, the knowledge of future failure rates of 
systems produced in serial production is from great importance. Remanufacturing departments especially 
need information about the number of cores to remanufacture, the condition of those cores as well as the 
number of replacement system the market requires. On the basis of an already existing model that predicts 
failure rates using warranty data as input, a bivariate prognosis model is developed, which takes two vari- 
ables of stress in field into account. By enhancing the existing model a more precise prognosis of future 
failure rates is expected as well as a better knowledge of the damage. This paper presents all necessary 
mathematical tools to perform a bivariate lifetime prognosis as well as the implementation in MATLAB. 


1 INTRODUCTION 2 THEORETICAL BACKGROUND 


After serial production, automotive suppliers This chapter gives a brief overview of the basic 
have the obligation to assure post series supply tools the model is based on. 

of replacement components and systems. The 
demanded number of replacement systems influ- 
ence the production method. When the demand 
drops, from one point on, the remanufacturing 
of old systems represents the most economical As basis for reliability prediction models in general, 
method. In order to optimize the strategy of after data can be collected from different data origins: 
series production, a suitable prognosis is needed. Laboratory testing, testing in field or real field data 
Requirement for this prognosis is a precise pre- from the customers. The disadvantage of Labora- 
diction of future failure rates in field in order to tory testing and testing in field is the small sam- 
estimate the number of returning cores as well as ple size as well as the incomplete consideration of 
the demand for replacement systems. Additionally, the variety of stress in field. Therefore, Pauli and 
an accurate picture of the stress in field, the cores | Meyna developed an approach to predict the reli- 
have experienced, is an advantage. It gives infor- ability of automobiles as well as automotive sub- 
mation about the costs that should be taken into systems by using warranty data. These data must 
account for remanufactureing the systems. On the contain: Time in field until failure and mileage 
basis of a univariate prognosis model by Pauliand until failure for at least 50 failures. This method 
Meyna [1], [2], [3] a bivariate model is developed. will be described briefly. For a better understand- 
This model predicts future failures in field with ing, consider also reading references: [1], [2], [4] 
respect to two different variables that represent the and [3]. 

stress in field. Data basis for the model are war- In order to describe the stress in field, the driven 
ranty data. The knowledge of the distribution of mileage until failure is a suitable variable. Time in 
stress in field for a car model allows a transforma- field is inappropriate due to varying user behavior [1]. 
tion to predict failure rates as a function of time. Varying usage of automobiles can be described 
The prognosis is carried out by using bivariate sto- by the mileage distribution. Considering time in 
chastic distribution functions. Methods of fitting field and mileage (until failure), under the assump- 
bivariate functions and comparing the goodness of tion of a linear driving behavior of each driver, one 
fit between several distributions are presented. The can develop a mileage distribution for a certain 
discussed multivariate prognosis method has to be time period by using warranty data. Theoretical 
categorized between univariate prognosis models distribution functions can then be fitted to empiri- 
and prognosis models using neuronal networks. cal data: 
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2.1 Univariate prognosis model as basis for the 
new model 


By cumulating the number of failures that 
occurred within a certain mileage interval, e.g. 
3001 to 4000 km, a failure frequency is obtained. 
Based on the mileage distribution of the warranty 
period, it is known that only a share of all compo- 
nents of the sample has reached a certain mileage 
interval within warranty period. Until all com- 
ponents will have reached this interval, a certain 
number of additional failures are expected. The so 
called candidates can be estimated by correcting 
the observed failures by considering the mileage 
distribution for each interval. 

In order to calculate a time dependent failure 
probability, it is necessary to fit a theoretical distri- 
bution function on the corrected failure frequency. 
By normalizing the resulting function with respect 
to the sample size, the mileage dependent failure 
probability can be obtained. The transformation 
to a time dependent failure probability is carried 
out by utilizing the mileage distribution for each 
date. 


2.2 Bivariate reliability parameters 


Considering the probability of failure as a bivariate 
function, the lifetime of a component is character- 
ized by two positive continuous random variables 
X, and X,. The system fails if both variables of 
stress, x, and x, exceed X, and X,: 


F( 3%, )= P(X, < x, X, < x,) fora yx 20 (1) 


F(x,, x,), the failure distribution function, indi- 
cates the probability for a component that experi- 
enced the stress x, and x, to be failed. 

For xX,, X, > œ the component fails: 


lim F(x,,x,)=1 (2) 


X] X23 


The derivation of F(x,, x,) with respect to x, and 
X, gives information about the density of failures at 
a certain combination of stress: 


Empirical and theoratical cumulated yearly Mileage Distributions 


i= Empirical Distribution 
m=Lognormal Distribution 
= Weibull Distibton 


w 4s aa S Be Se 46 
Mileage [km] 10" 
Figure 1. Empirical yearly mileage distribution, fitted 


lognormal distribution and fitted Weibull distribution. 


Observed Failures and corrected Failures 
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Figure 2. Observed failures and corrected failures. 
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The reliability R(t) is the complement of F(t). 
A component survives if one or both variables of 
stress, x, and x,, remain below the random vari- 
ables X, and X, [6][7]. 


R(x,.%;) = P(X, > x,X, > x) 
=1-P(X,< x, X, < x,)=1-F(x,x,) © 


By dividing the failure density through the reli- 
ability the hazard rate h(x,, x,) can be determined. 
The hazard rate is an important indicator for the 
reliability of a component. 


ce) 


bs)“ RE X ) 
127g 


(5) 


Both, the failure distribution function and the 
reliability function, can be determined by empiri- 
cal data. All other parameters can be derived. For 
a better distinction, the empirical functions are 
marked with a tilde. The sample size of all examined 
components (n,), consists of the subsets of failed 
components (n,) and surviving components (n,): 


Ny =n, (xpx,)+ n, (4%) (6) 
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The empirical probability of failure is defined 
as: 


P (xx) = 22) (7) 
0 
The complement R(t) is defined as: 
R(x, x,)= Oe) (8) 


0 


In equations (7) and (8), the cumulated failures 
n, and n, must be used. 


2.3 Bivariate distribution functions 


For most of the common univariate distribution 
function, multivariate counterparts exist. Most 
important fields for application of such functions 
are financial (insurance) mathematics, meteorology 
and reliability statistics. In this chapter important 
bivariate distributions, in the context of reliability 
of automobile components, are presented. 
In general, bivariate distribution functions are 
a useful tool for a dataset that fits the following 
criteria: 
— The distribution of the events fits the same uni- 
variate distributions for each variable. 
— The correlation between both variables is neither 
Onor+1. 


In [5] a distribution function of the bivariate 
normal distribution is defined as follows: 


= *exp l z 
2n0,0,,/1- 2° 2(1- 07) 
| sy 2p% thse th, aT bey 
a a O, O, 


fa ( XiX) = 


(9) 


The parameters stand for: 
For a given distribution function, the cumu- 
lative distributions function can be derived by 


Table 1. Parameters of the bivariate distribution 
function. 

Domain of 
Parameter Meaning definition 
Ly, HL, First moment of X}, X, — —e0 LL; p, ° 
6/7, 07 Variance of X,, X, 6,7, 07 >0 
p Correlation coefficient -l<p<l 


of X, and X, 


double integrating the distribution function with 
respect to both variables [5]. 


Ea ( Xok) = Í J fy, ( Vp Va) dv, dy, (10) 


Another important distribution for reliability 
models is the bivariate lognormal distribution. In 
[6] the distribution function is defined as: 


(3) = : * exp l 
= 2AX X0, 0, Vl- 2 2(1- 2°) 


| In( s)- A } [| In( se M, ) 


| In(x,)- 4, | -| In( x))- 44, j 
y2 Op 
(11) 


One way to estimate the parameters of the dis- 
tribution is the method of moments. On the basis 
of empirical failure data, an estimation for the 
parameters can be calculated using the equations 
given in (12): 


(o :standard deviation 


1/2 
o 
o, =| log 1+ — 
K M 


o 
H, : first moment log( 4, ) - | 4) (12) 


p: corrleation coefficient 
EI(Y.- 4, )(Yo= 4.) 


The cumulative distribution function can be 
derived by applying equation (10). 

Especially the reliability of electronic compo- 
nents can often be described by the exponential 
distribution. In [7] a bivariate exponential distribu- 
tion, developed by Gumbel, is described. 


F( XX3) = glatar) for Q< O<1 (13) 


By derivating the cumulative distribution func- 
tion with respect to both variables, the following 
distribution function can be obtained: 


£(x,,x,)={(1- 0) aB+ 007 Bx, 


+ OOP X,+ PX PX x, F(x,x,) (14) 
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1 


z E(X) 


(15) 


For the univariate Weibull distribution several 
bivariate counterparts were developed, for example 
by Hougaard [8]. Because of the difficult applica- 
bility of these functions, in the case of the bivariate 
Weibull distribution, the developed model works 
with copula functions. A copula function is a func- 
tion that consists of marginal distributions as vari- 
ables but contains additional parameters. For the 
developed reliability forecast model, three different 
copula functions fit the need. The goodness of fit 
with empirical failure data for all three functions is 
comparable [9]. 

Gumbel Copula: 


ale 


F(x x j= [i eno et)" ] 
Xp 


for 821 


(16) 


Clayton Copula: 


1 


F(x) "+ B(x) 1)? 


1 


E(w ( 
for @>0 


(17) 


Frank Copula: 


P(X RJS -log l4 


for 8+0 


(18) 


F,(x,) and F,(x,) represent the cumulated mar- 
ginal Weibull distributions of x, and x,. 


2.4 Fitting of bivariate distribution functions 


The difficulty in dealing with multivariate distri- 
bution functions is the estimation of appropri- 
ate parameters. Most important methods will be 
explained briefly. 


Method of moments 

The method specified here makes use of the mathe- 
matical relationship between parameters of the func- 
tion and characteristic values of the sample data. 
First step is the expression of the parameters of the 
function as a formula of moments of the distribu- 
tion. The value of the moments is then calculated on 
the basis of the sample data. Important moments are 
for example mean and variance of a distribution. [5] 
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Maximum likelihood 

A maximun likelihood estimate is the value of a 
parameter for which the sample data gain the high- 
est possibility. For fitting univariate distributions, 
maximum likelihood is the predefined method 
MATLAB uses. [5] 


Method of least squares 

This method estimates parameters by minimaliz- 
ing the deviation of the empirical and the theoreti- 
cal distribution. The deviation is measured by the 
sum of the squared residuals. Those parameters for 
which this sum is minimal are chosen. The sum of 
squared residuals (SSE) serves as an index to com- 
pare the goodness of fit of several fitted theoretical 
distributions. [10] [11] 


3 DISTRIBUTION OF STRESS IN FIELD 


The above described univariate prognosis model 
uses the driven mileage as a variable to describe 
the stress in field. When dealing with ECU’s, the 
finite number of storage processes the non-volatile 
storage can handle before wearing out, should be 
considered. Therefore switching cycles! are consid- 
ered as an additional appropriate variable when 
modelling electronical components. 

The sample data used in this paper contains fail- 
ure data of ECU’s in automobile application. The 
prognosis is carried out on basis of time in field, 
mileage and switching cycles at the time of failure 
for each sample. Nevertheless, the model is not lim- 
ited on electronic components. It can be applied on 
all automobile systems and subsystems when two 
appropriate variables are determined to describe 
stress in field. 

Exposure profiles of automobiles vary greatly 
from customer to costumer. To take this variety into 
account a distribution of stress in field is formed 
on the basis of all failures. A linear increase of 
mileage and switching cycles over time is assumed. 
An empirical distribution of stress in field, 1z,(s, z), 
is obtained by considering all failure data, normal- 
ized to a certain time period. For example one year: 


(19) 


S,, z: mileage, switching cycle during one year 
Sp Zp: mileage, switching cycle until failure 

ti: one year 

tp: time in field until failure 


‘Switching cycle: Turning the ignition key in the off posi- 
tion forces the ECU to write data to the non-volatile 
memory. 


By using all of the above described methods for 
fitting distributions, a bivariate normal-, lognor- 
mal-, exponential-distribution as well as a Weibull 
copula function can be fitted to the sample data. 
A brief description of the implemented fitting 
procedure for each distribution is given in the 
following: 


Bivariate normal distribution 

All parameters of the bivariate normal distribution 
are estimated by the method of moments. Values 
were calculated using the sample data (Table 1). 


Bivariate lognormal distribution 

On basis of mean, variance and covariance of 
the sample, the parameters were estimated, using 
equations (12). 


Bivariate exponential distribution 

Parameters o and B can be estimated by the first 
moments of the sample data, according to (15). 
Afterwards an estimator for 0 is found by the 
method of least squares. 


Bivariate Weibull distribution (Copula) 

Both marginal distributions are estimated using 
maximum likelihood. In MATLAB this operation 
is carried out by “wblfit”. Parameter 0 for equa- 
tions (16), (17) and (18) is then estimated by the 
method of least squares. 

According to goodness of fit, carried out by 
SSE, the bivariate lognormal distribution and the 
bivariate Weibull distribution fit the sample data 
best: 

The sum of squared residuals of the bivariate 
Weibull distribution, with a value of 0,00255, is 
smaller than the sum of squared residuals of the 
bivariate lognormal distribution with 0,00291. 
Therefor the stress in field for further calculations 
is considered to be a bivariate Weibull distribu- 
tion with estimated parameters. The cumulative 
distribution function LZ (s, z) is derived by double 
integrating 1z,(s, z). 


Empirical Probability Distribtion of Stress in Field (1 Year) 


Mileage [km] 


Switching Cycle (Number) 


Figure 4. Empirical probability distribution of stress in 
field during a period of one year. 
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Empirical and Lognormal Distribution of Stress in Field (1 Year) 
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Figure 5. Empirical probability distribution of stress in 


field during a period of one year and fitted bivariate log- 
normal distribution. 


Empirical and Weibull (Gumbel) Distribution of Stress in Field (1 Year) 
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Figure 6. Empirical probability distribution of stress 
in field during a period of one year and fitted bivariate 
Weibull distribution (Gumbel copula). 


Cumulative Empirical Distribution and Weibull Distribution 
(Gumbel Copula) of Stress in Field (1 Year) 


Cumulative Weibull Distribution 


15000 
4 10000 
«10 1 - 5000 


0 
Mugs (onl Switching Cycle [Number] 


Figure 7. Cumulative empirical probability distribu- 
tion of stress in field during a period of one year and 
fitted cumulative, bivariate Weibull distribution (Gumbel 
copula). 
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The cumulative distribution function can be 
transformed for arbitrary time periods, using the 
following equation: 


LZ, (s,z) = Lz (s a an ) =LZ. (s ʻe zte) (21) 


Conditional Failures during Warranty Period 
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Figure 8. Distribution of the conditional failures dur- 


ing warranty period as a function of the class of stress 
in field. 


LZ,: cumulative distribution function for the 
period of one year 

LZ,: cumulative distribution function for the 
warranty period 

tg: warranty period 


4 ESTIMATION OF CANDIDATES 


Regarding warranty data, the sample is divided 
in two subsets: complained components and fully 
functional components still in field. For every 
complained component additional information 
exists: 


— Time in field 
— Mileage 
— Switching cycles 


For an efficient further processing in MAT- 
LAB, the data set of failed components is to be 
discretized. For the described application, a class 
size of 2000 km x 200 switching cycles was cho- 
sen. Each class contains n,(s, z) failures. These fail- 
ures are called conditional failures, because until 
all components will have reached the considered 
class, additional failures are expected to occur: the 
candidates. 

Considering the distribution of stress in field, 
the ratio of components that has already reached a 
certain class of stress during warranty period, can 
be estimated: 


P(S2s,Z >z)=1-LZ,(s,z) (22) 


The number of corrected failures n, for each 


class composes of all conditional failures and all 
candidates. 


n, (s,z) 


~1-1Z, (sz) 


n,(s,z) 


Corrected Failures during Warranty Period 
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Figure 9. Distribution of the corrected failures during 
warranty period as a function of the class of stress in 
field. 


5 STRESS DEPENDANT LIFETIME 
PREDICTION 


Fitting an adequate theoretical distribution func- 
tion to the distribution of corrected failures allows 
a forecast of the stress dependent lifetime of the 
regarded component. Procedure of fitting bivariate 
distributions is similar to the methods described in 
3. The corrected failure density is calculated using 
the following equation: 


= n, (8,2) 


i (s, Z n (24) 


0 


A bivariate exponential, normal and lognormal 
distribution was fitted as well as a bivariate Weibull 
(copula) distribution. According to goodness of 
fit, carried out by SSE, the bivariate lognormal 
distribution and the bivariate Weibull distribution 
fit the corrected failure density f,(s,z) best: 

For the calculation of a time dependent lifetime 
prediction goodness of fit of the theoretical distri- 
bution functions has to be compared. According 
to SSE, the Weibull distribution fits the data best. 

SSE lognormal distribution: 33065 

SSE Weibull distribution: 16945 

For the present data set, the time dependent 
lifetime prediction is best carried out by a f,(s,z) 
represented by the Weibull distribution. 


6 TIME DEPENDANT LIFETIME 
PREDICTION 


To transfer the stress dependent lifetime distribu- 
tion, determined in 5, in a time dependent distri- 
bution the distribution of stress in field must be 
considered: 


Empirical corrected Failure Density and fitted Lognormal Distribution 
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Figure 10. Empirical corrected failure density and the 
fitted lognormal distribution. 


Empirical Failure Density and fitted Weibull Distribution (Gumbel) 
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Figure 11. Empirical failure density and fitted Weibull 


distribution using Gumbel copula. 
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Figure 12. Time-dependent lifetime distribution as the 
result of a bivariate prognosis model. 
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Figure 13. Time-dependent hazard rate. 


The result of the bivariate prognosis model is 
the time-dependent lifetime prediction. 

A time-dependent hazard rate can be deter- 
mined using the following formula: 


Ri) dt =F ey 


The hazard rate shows a typical early failure 
characteristic. 


7 ASSUMPTIONS/LIMITATIONS 


Underlying the model a few assumptions concern- 
ing the data and the usage behavior were made: 


— The warranty database is complete and all 
failures of the regarded period of time are 
registered. 

— The warranty database contains an appropriate 
number of samples (Recommendation: Mini- 
mum 1 000 failed samples). 

— Aconstant increase of mileage over time is con- 
sidered for the complete lifetime of the compo- 
nent/the automobile. 


The basic limitation of the described model 
is the ability to only describe one type of failure 
behavior (e.g. early failures, random failures or 
wear out failures). In order to describe the full life 
cycle of a product consider the superposition of 
three independent predictions. 


8 CONCLUSIONS 


The previous chapters prove the possibility to create 
a bivariate reliability forecast model as well as the 
successful implementation of said model in MAT- 
LAB. Comparing the calculated forecast with real 
field failure data, good accordance was determined. 
As a result it can be said that a model on basis of 
bivariate distribution functions with respect to two 
variables of stress in field can successfully be applied 
in quality management of automobile systems. 
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Newly enhanced computing algorithm to quantify unavailability 
of maintained multi-component systems 


R. BriS & N.T.T. Tran 
VSB—Technical University of Ostrava, Czech Republic 


ABSTRACT: In our previous work we developed an analytical algorithm which is able to carry out 
exact reliability quantification of highly reliable systems with maintenance (both preventive and correc- 
tive). A directed Acyclic Graph (AG) was used as a system representation. The unavailability of a node 
of an AG is in fact given by going over all possible combinations of probabilities of the input edges 
(such combinations that cause failure of the node). New improvement of the computing methodology 
will be presented in this paper, which efficiently reduces computing time and complexity. The improve- 
ment is based on applicable properties of internal non-terminal nodes. If an internal node has multiple- 
dimensionality, i.e. a lot of input edges, resulting in an excessive summarizing combinations, the number 
of combinations can be significantly reduced by multiple application of the pivotal decomposition. The 
effectiveness of the process will be demonstrated on selective systems. 


1 INTRODUCTION A Monte Carlo simulation method (Marseg- 
uerra & Zio 2001) is used for the quantification 
Estimating (un)availability of a highly-reliable of reliability when accurate analytic or numerical 
multi-component and maintained system is a procedures do not lead to satisfactory computa- 
problem of great interest in different areas suchas tions of system reliability. Since highly reliable 
computer systems, telecommunications, mechan- systems require many Monte Carlo trials to obtain 
ics, aircraft design, power utilities, and many other reasonably precise estimators of the reliability, 
engineering fields. Increasing demand for system various variance-reducing techniques (Tanaka, 
reliability cannot depend on the increasing reli- Kumamoto & Inoue 1989), eventually techniques 
ability of components due to technological restric- based on reduction of prior information (Baca A. 
tions. Safety systems of nuclear power stations 1993) have been developed. A direct simulation 
represent other example of highly reliable com- technique has been improved by the application 
plex systems. They have to be reliable enough to of a parallel algorithm to such an extent that it 
comply with still increasing internationally agreed can be used for real complex systems which can be 
safety criteria and moreover they are mostly so then modelled and quantitatively estimated from 
called sleeping systems which start and operate the point of view of reliability without unreal sim- 
only in the case of big accidents. Their hypotheti- plified conditions which analytic methods usually 
cal failures are not apparent (hidden or latent fail- expect (BriS 2008). However, if it is necessary to 
ures) and thus repairable only at optimally selected work and quantitatively estimate highly reliable 
inspection times. Wide class of highly reliable systems, for which unreliability indicators (i.e. 
fault-tolerant systems, which, through the use of system non-functions) move to the order 10° and 
redundancy, have the ability to operate properlyin higher (i.e.10° etc.), the simulation technique, 
the presence of faults are investigated in (Villén- whatever improved, can meet the problems of pro- 
Altamirano 2014). Any system failure should have longed and inaccurate computations. 
a small probability of occurring; that is, it should In our previous work we developed a new ana- 
be a rare event. It is important to estimate such lytical algorithm which is able to carry out exact 
probabilities because when a rare event does occur, reliability quantification of highly reliable systems 
its consequences may be catastrophic. For exam- with maintenance (both preventive and correc- 
ple, network servers play an increasingly important tive). An exponential distribution for the time to a 
role due to the rapid growth in demand for internet failure is supposed, possibly for the time to restora- 
services, and a server breakdown event may cause tion. A generalization of the original methodology 
significant financial losses. As a result, redundancy so as to be used for unavailability quantification 
is usually built in to prevent services from breaking of systems with ageing input components with 
down. optional lifetime distribution (i.e. where gener- 
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ally distributed failure and repair times were sup- 
posed) is developed in (Bris & Byczanski 2017a). A 
directed acyclic graph was used as a system repre- 
sentation. The algorithm allows take into account 
highly reliable and maintained input components. 
The unavailability of a node of an AG is in fact 
given by going over all possible combinations of 
probabilities of the input edges (such combina- 
tions that cause failure of the node). For example, 
having 20 input edges, we have regularly a million 
combinations to be summarized. This process has 
two disadvantages: first, computing errors can 
easily be committed, which were discussed and 
removed in (Bris & Byczanski 2013), and second, 
it is hard to be computed systems having big nodes 
with a lot of input edges. 

New improvement of the computing methodol- 
ogy will be presented in this paper, which efficiently 
reduces CPU-time and computing complexity 
as well. The improvement is based on applicable 
properties of internal non-terminal nodes. If an 
internal node has multiple-dimensionality, i.e. a lot 
of input edges, resulting in a lot of summarizing 
combinations, the number of combinations can 
be significantly reduced by multiple application of 
the pivotal decomposition. The effectiveness of the 
process will be demonstrated on selective systems. 


2 SYSTEM REPRESENTATION AND 
UNAVAILABILITY COMPUTATION 


2.1 The directed acyclic graph 


Example of a real system to be analyzed is shown 
in Figure 1. The system is demonstrated by means 
of a directed Acyclic Graph (AG), (Bris 2008). A 
graph is composed of nodes and edges. The high- 
est node (here ul) represents functionality of the 
whole system (success, failure), internal and termi- 
nal nodes represent subsystems and components. 
All of the nodes are bounded by edges. Direction 
of the graph is not explicitly marked in Figure 1 it 
is given by itself—by projection to vertical direc- 
tion. The graph is acyclic which means that it can- 
not contain feedback loops. 

Terminal nodes, as for example T1 or T2, of the 
AG are marked by blue squares. They represent 
stochastic functionality of input system compo- 
nents given by a probability distribution of their 
time to failure and a maintenance model. From 
them we can compute a time course of the unavail- 
ability function of input components, using meth- 
odology of basic renewal theory (Bris 2007). 

Internal nodes (non-terminal) are marked by 
blue triangles. They represent functionality of 
subsystems. A subsystem is well functioning in a 
given time point (success) just in the case when the 
number of well-functioning inferior edges reaches 
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Figure 1. Graph structure of a real system. 


at least the number that is inside of the triangle, see 
Figure 1. Otherwise it is not-functioning (failure). 
For example internal node u3 is well-functioning 
when the number of well-functioning inferior 
edges is at least 1. 

The key problem is to estimate the point (or 
instantaneous) unavailability at any time ¢, which 
is the probability that the system is unavailable at 
time ¢ due to a failure or due to an ongoing repair 
after the detection of a failure. System may be 
composed of highly reliable components. In (Bris 
2010) we developed new procedure for exact reli- 
ability quantification of a highly reliable system. 
The procedure eliminates all errors made by a com- 
puting hardware system when calculations close to 
error limit are executed. 


2.2 Determination of system unavailability 
according to a graph structure 


The algorithm will be demonstrated on the system 
from Fig. 1, which is composed of different ter- 
minal nodes. The probability of a non-functional 
state of the system (represented by the ul node) 
can be obtained upwards resulting from unavail- 
ability functions of independent terminal nodes. 

For instance the unavailability of internal node 
u7 can be computed as follows: 


e numerical expression of the unavailability of 
inferior terminal nodes, i.e. elements C6, C7 and 
T7 (having unavailability ¢,, q; and qy). 

e numerical expression of the unavailability q, of 
the internal node u7 which is given by the follow- 
ing sum: 


Guz = 16-94-Ir7 + = I6)-94-Ir7 +q- -4)-4r7 
+ 96-47 Gp7) + = 96). 9) -4r7 (1) 
+(1= 6)-47-l= 977) +4- A-4) -qr) 


In other words we go over all possibilities of non- 
function state of the node u7 in ascending order of 
functioning edges. The process is terminated just in 
the case when the number of well functioning infe- 
rior edges reaches 3 (i.e. the number that is inside 
of the triangle). 

The unavailability of a node of an AG is in fact 
given by going over all possible combinations of prob- 
abilities of the input edges (such combinations that 
cause failure of the node). If a node has a lot of input 
edges, we can expect computational difficulties. For 
example, having 20 input edges we have regularly a 
million combinations to be summarized what results 
in an unaccepted excessive long computing time. 


2.3. Computational improvement 


In Figs 2—4 we give definitions of following nodes 
N1-N3, with input edges denoted as 1,2, ...k: 

We denote unavailability of j node as U(Nj) 
and unavailability of i input edge as U(i). Obvi- 
ously we can write: 


U(N1) = U1). U(N2) + (1- U(1)). UCN3) 
= U(1). [U(N2) - UN3)] + U(N3) (2) 


This operation we call as pivotal decomposition 
(applied to first input edge). Expression in square 
brackets can be simplified (a lot of elements are 
eliminated applying subtraction) to the following 
expression: 


$  XUG@)X...x(1-U()) x... (3) 
ea 

indexes i ļ 

are from number of factors in parentheses 


{2,3,...k} 


And the last summand in (2) can be simpli- 
fied by analogy (i.e. recurrently, by the same way 
as U(N1)). In other words the process of pivotal 
decomposition can be applied to second input 
edge, etc. By applying this process recurrently we 
obtain an alternative and time saving formula to 
compute U(N1). 

Example: Let us substitute in Figure 2 for m = 3 
and k = 4. Then 


UNI) = U0){0- U2)).(1- U(3)). U) 
+ (1- U(2)).U(3).1-U(4)) 
+ U2).1—U(3)).d1-UA))§ (4) 
+ U(2).{(1-U()).U(4) + U(3).1-U(4))} 
+ U(3).U(4) 


is “m-1”. 
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Figure 2. Node 1 (N1). 


Figure 3. Node 2 (N2). 
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Figure 4. Node 3 (N3). 


2.4 Unavailability models of terminal nodes 


Most of component models (i.e. models of termi- 
nal nodes) including maintenance, both preventive 
and corrective, can be described by the following 
three models: 

Model I with components that cannot be repaired. 
Final time dependent unavailability coefficient U(t) 
for this simplest model is given by the distribution 
function of the time to failure of the component: 


UO = F(= | f(x)dx, (5) 


where F(‘), f(t) is distribution function and prob- 
ability density function (pdf) of the time to failure. 
Model II with repairable components (CM — 
Corrective Maintenance) for apparent failures, i.e. 
a model when a possible failure is identified at its 
occurrence and immediately afterwards it starts 
a process leading to its restoration. In Model II, 
two random variables are immediately connected, 
i.e. the time to failure X, characterized by distribu- 
tion function F(t) and density f(t), and the repair 
(or restoration) time Y, characterized by distribu- 
tion function G(f) and density g(t). In this model we 
can apply well known relations from renewal theory 
and alternating renewal processes. In (Briš & Byc- 
zanski 2017b) we derived and proofed the following 
Theorem for time dependent unavailability U(t): 


U(t) =fr -et-a 00): 


0 
U(t—x)dx (6) 
where * means convolution. This Theorem can be 
considered as a recurrent linear integral equation 
which helps us to compute U(t). 

Model III with repairable components with hidden 
failures, i.e. a model when a failure is identified only 
at special deterministically assigned times, appearing 
with a given period (moments of periodical inspec- 
tions). In the case of its occurrence at these times an 
analogical restoration process starts, as in the previ- 
ous case. Inspections are carried out periodically. Let 
us denote inspection time points as k,, k,, k,..., then 
k,,, — K; = T, is period of inspections. In (Briš & Byc- 
zanski 2017b) we derived numerical formula to calcu- 
late the time dependent unavailability coefficient U(t): 


t-ki 
| E*N adx 


ky-kj 


U(t)= [reoae+ 5, { 
kn i=l 
+{1-Gr-%)| (7) 


where S, is the probability that in the inspection 
time k, a renew was realized. Numerical formulas 


to find probabilities S, are as well derived in (Bris 
& Byczanski 2017b). 


3 RESULTS WITH TESTED 
SYSTEM FROM REFERENCE 


We consider a system model that is a generaliza- 
tion of the Highly Reliable Markovian System 
(HRMS) often used to represent the evolution of 
multi-component systems in reliability settings, 
and which has been studied in (Villén-Altamirano 
2014, L’Ecuyer & Tuffin 2011), among others. In 
the HRMS model, the system has c types of com- 
ponent, with n, identical components of type i. The 
system works if at least r, components of each type 
i work. Each component is either in a failed state 
or in an operational state. Specifically we consider 
a system with 3 types of component, c = 3, with 
n components each. Although formulas (5)-(7) 
are numerically realized for any probability distri- 
butions ft) and g(t), we assume exponential life- 
time and repair times with failure rates 2, = 0.01, 
A, = 0.015 and A, = 0.0002, respectively, and repair 
rates ų = 1 for all components to be compared 
with results in above mentioned reference. In other 
words, all components are of Model II. There are 
ample repairmen who work simultaneously on all 
the failed components. The system breaks down 
as soon as at least one component type had less 
than 2 operational units (r = 2). The redundancy 
is the same for all 3 types of component. Graph 
structure of the system for n = 8 is demonstrated 
in Figure 5. We realized comparison computations 
for n= 8, 12, 16 and 20. 

The concern is to estimate transient measures, 
such as time dependent system unavailability, 
including steady state unavailability. 


3.1 Computed results 


The system is to such extent complex that it would 
be hardly computed by applying original meth- 
odology from (Bris 2010). Number of combina- 
tions of all input edges each of 3 internal nodes 


Figure 5. 


Graph structure of the referenced system for 


n=8. 


934 


is excessive. If computational improvements are 
realized, see formulas (2)-(4), all computations can 
be run in CPU-times less than 1 min. In Figs 6-8 
we can see dependence of system unavailability on 
time for n = 8, 16 and 20, ending in steady state 
unavailability values. 


3.2 Comparisons with referenced results 


Advanced simulation methodology, so called 
RESTART estimators, to compute unavailabil- 
ity of this system was used in (Villén-Altamirano 
2014). Table 1 brings comparison of our obtained 
numerical results of steady state unavailability 
with simulation results from the reference, includ- 
ing computing CPU-time. 

As can be observed, very low probabilities were 
accurately estimated with reduced computational 
effort. 

Our improved computational method as well as 
RESTART estimators can be extended to many 


Table 1. Steady state unavailability of the referenced 
system with c = 3, r= 2 and changing n. 


Simulation results 
(with relative 


Numerical results error = 0.1) 
n U CPU-time U CPU-time 
8 1284x1072 <I(min) 1.31 X107!2 0.67 (min) 
12 874x107 <1(min) 8.76X 107° 2.33 (min) 
16 549x10 <1(min) 5.381077 3.60 (min) 
20 3.258 1074 <1(min) 3.19 x 1074 9.60 (min) 

4= c=3 f=2 n8 

ae? 
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Figure 6. Dependence of system unavailability on time, 
n=8. 
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other models of highly reliable components, such 
as non-Markovian models, without much addi- 
tional effort. In first case computational experi- 
ments in (Bris & Byczanski 2017a) comparing 
exponential and Weibull distributions do not show 
relevant difference in computational time of both 
probability distributions. In second case author 
(Villén-Altamirano 2014) showed that the com- 
putational times in the two first system versions 
(for n = 8 and 12) were around 2.5-3 times greater 
when Raleigh-Erlang distributions were utilized, 
which was caused for two reasons (i) it is more 
time consuming to generate random numbers from 
these distributions than from exponential, and (11) 
rescheduling with Erlang distribution is much 
more time-consuming than with exponential. 


4 CONCLUSIONS 


We demonstrated that the innovative computa- 
tional method for unavailability quantification is 
comparable with recent advanced simulation meth- 
odology. Even our numerical method gives better 
computational times particularly when n 2 12. 

The innovative methodology for high-per- 
formance computing enables exact unavailability 
quantification of a maintained and highly reliable 
system containing highly reliable components hav- 
ing optional probability distribution of the time to 
failure, as well as time to repair, i.e. a complex sys- 
tem with both ageing and non-ageing components 
can be analyzed. All frequently used component 
models with both preventive and corrective main- 
tenance may be considered. 

The most important advantage of the comput- 
ing methodology is that it enables the analyst to 
calculate arbitrary small values of the unavailabil- 
ity function during a mission time exactly. Such 
small unavailability values as for example values 
of order 10* are hardly computed by other meth- 
ods or software, including advanced simulation 
RESTART estimators, where computational time 
increases with increasing number of components. 

Having system represented by AG, numerical 
expression of an unavailability value of one inter- 
nal node of the AG has a combinatorial character. 
We have to go over all combinations of input edges 
leading to a non-functional state of the node. We 
showed that using pivotal decomposition this com- 
putational process can be efficiently improved. 

The innovative computing methodology has 
been numerically realized within the high-perform- 
ance language MATLAB. All computations above 
run on Intel (R) Core™ i7-3770 CPU @ 3.4 GHz 
3.9 GHz, 8.00 GB RAM. 
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MLE versus MCMC estimators of the mixture of failure rate model 
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ABSTRACT: 


In this paper, the parameters and reliability characteristics of the failure distribution of 


the mixture of failure rates are estimated based on a complete sample using both Markov Chain Monte 
Carlo (MCMC) method and Maximum Likelihood Estimation (MLE). While MLE is the most fre- 
quently used method for parameter estimation, MCMC has recently emerged as a good alternative. The 
most popular MCMC method, called Metropolis-Hastings algorithm is used to provide complete analysis 
of the concerned posterior distribution. A simulation study is provided to compare MCMC with MLE, 
and differences between the estimates obtained by the two approaches are evaluated. 


1 INTRODUCTION 

Engineering systems, while in operation, are always 
subject to environmental stresses and shocks which 
may or may not alter the failure rate function of 
the system. Suppose p is the unknown probability 
that the system is able to bear these stresses and 
its failure model remains unaffected, and q is the 
probability of the complementary event. In such 
situations, a failure distribution is generally used 
to describe mathematically the failure rate on the 
system. To some extent, the solution to the pro- 
posed problem is attempted through the mixture 
of distributions (Mann et al. 1974, Sinha 1986, 
Lawless 2002). However, in this regard we are faced 
with two problems. Firstly, there are many physi- 
cal causes that individually or collectively cause the 
failure of the system or device. 

At present, it is not possible to differentiate 
between these physical causes and mathematically 
account for all of them, and, therefore, the choice 
of a failure distribution becomes difficult. Sec- 
ondly, even if a goodness of fit technique is applied 
to actual observations of time to failure, we face a 
problem arising due to the non-symmetric nature 
of the life-time distributions whose behaviour is 
quite different at the tails where actual observations 
are sparse in view of the limited sample size (Mann 
et al. 1974). Obviously, the best one can do is to 
look out for a concept which is useful for differ- 
entiating between different life-time distributions. 
Failure rate is one such concept in the literature on 
reliability. After analyzing such physical considera- 
tions of the system, we can formulate a mixture of 
failure rate functions which, in turn, provide the 
failure time distributions. In view of the above, and 
due to continuous stresses and shocks on the sys- 
tem, let us suppose that the failure rate function 
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of a system remain unaltered with probability p, 
and it undergoes a change with probability q. Let 
the failure rate function of the system in these two 
situations be in either of the following two states 
(Sharma et al. 1997): 


1.1 State 1 


Initially it experiences a constant failure rate model 
and this model may (or may not) change with 
probability q(p =1- q). 


1.2 State2 


If the stresses and shocks alter the failure rate 
model of the system with probability g, then it 
experiences a wear-out failure model. 

In comparison with Sharma et al. (1997), this 
study brings distinctive generalization of the state 
by implementation of a new parameter, which ena- 
bles to take into account also more general Weibull 
model. 

In probability theory and statistics, the Weibull 
distribution is a continuous probability distribu- 
tion, which is named after the Waloddi Weibull. 
This is the most commonly used distribution to 
model times until failure and provides a good 
description for many types of lifetimes (Rinne 
2008, Lawless 2002). The Weibull distribution has 
two parameters, a shape parameter p and a scale 
parameter. Only shape parameters 8 > 1 corre- 
spond to an increasing failure rate, implying that 
ageing processes can be intensively studied. We 
will therefore not consider cases with B< 1. Recent 
studies that adopt a Weibull lifetime distribution 
include Xia et al. (2015), Zhou et al. (2015), and 
Xu & Cao (2015). 

As a result of flexibility in time to-failure of a 
very widespread diversity to versatile mechanisms, 


the two-parameter Weibull distribution has been 
recently used quite extensively in reliability and 
survival analysis particularly when the data are 
not censored. Much of the attractiveness of the 
Weibull distribution is due to the wide variety of 
shapes which can assume by altering its parameters. 

Using such a failure rate pattern, the characteri- 
zation of life-time distribution in the correspond- 
ing situation is given. Various inferential properties 
of this life-time distribution along with the estima- 
tion of parameters and reliability characteristics 
is the subject matter of the present study. Since 
the estimates based on the operational data can 
be updated by incorporating past environmental 
experiences on the random variations in the life- 
time parameters (Martz and Waller 1982), there- 
fore, the Bayesian analysis of the parameters and 
other reliability characteristics is also given. 

The remainder of this article is organized as fol- 
lows. Section 2 introduces the intended mixture of 
failure rates model including basic characteristics 
of the corresponding life-time distribution as well. 
Section 3 brings maximum likelihood estimators 
for two unknown parameters of the model on one 
hand and the Metropolis-Hastings algorithm based 
on MCMC method on the other hand. In addi- 
tion, Bayesian methodology resulting from special 
likelihood function of the mixture model is dem- 
onstrated bringing formulas for alternative estima- 
tors in comparison with MLE. Section 4 shows 
short illustration how estimation procedures work. 
Section 5 reports simulation study on special 
selected situations where both MCMC and MLE 
are confronted. 


2 CHARACTERISTICS OF THE LIFE-TIME 
DISTRIBUTION 


Notations: Let 
T: the random variable denoting life-time of the 
system. 
A(t): the failure rate function. 
JÐ: the probability density function (p.d.f.) of T. 
F(t): the cumulative distribution function of T. 
R(t) =P(T >t): the reliability/survival function. 
T)= PRO :mean time to failure 
i (MTTF). 


2.1 The mixture of failure rate model 


Let 

p: the probability of the event A, that the system 
is able to bear the stresses and shocks and its fail- 
ure pattern remains unaltered. 

q=1- p: the probability of the complementary 
event A‘. 
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Further, let, the mixture of the failure rate func- 
tion be 


h(t) = pA+(1- p)At*, A,t >0,0<p<l (1) 


for 


1. p=1; represents the failure rate of an exponen- 
tial distribution. 

2. k= 1 and p=0; represents the failure rate of the 
Rayleigh distribution or Weibull distribution 
with shape parameter 2. 

Note: In our context B=k+1 

3. k= 1; represents the linear ageing process. 

4.0 < k < 1; represents the concave ageing 
process. 

5. k> 1; represents the convex ageing process. 


In Weibull reliability analysis it is frequently 
the case that the value of the shape parameter is 
known (Martz & Waller 1982). For example, the 
Raleigh distribution is obtained when k = 1. The 
earliest references to Bayesian estimation of the 
unknown scale parameter are in Harris & Sing- 
purwalla (1968). Since that time this case has been 
considered by numerous authors, see Sharma et al. 
(1997), Canavos (1974), Moore & Bilikam (1978), 
Tummala & Sathe (1978), Alexander et al. (2009) 
& Muhammad et al. (2014). This study is free con- 
tinuation and generalization of the research origi- 
nally introduced by Sharma et al. (1997). 


2.2 Characteristics of the life-time distribution 


Using the well-known relationship between p.d.f. 
and failure rate function 


f(t) =h(t)exp|- fia(x)ax}, 1>0 2 


and in view of (1), the p.d.f. of the life-time T is 


TA 7 A(l=p) pa 
sO =M] [pars i (3) 
The reliability function is 
= = A(1= p) k+l 
Re) =e (pan t , t>0. (4) 


The MTTF is given by 


MTTF =E(T)= |  R(t)dt 


a = (5) 
=Í exrf-[ rate Ay) ea 
o k+l 


This integral can be obtained by using some 
suitable numerical methods. 


3 ESTIMATION OF PARAMETERS AND 
RELIABILITY CHARACTERISTICS 


Let D:t,...,t, be the random failure times of n 
items under test whose failure time distribution is 
as given in (3). Then the likelihood function is 


| 


ia 
k+1 


n 


IIe +0-p)t) 


[Eloise 


Maximum likelihood estimation 


L(D| A, p)= el 
n (6) 
xexp -a$ 


i=l 


3.1 


The log-likelihood function can be written as 


log L(D| A, p) = X log(p+(1- p)tt) 
l 


+nlog 4-49 (pi + l=p 
i=l 


Fu) 

k+1' 

(7) 
From (7), we derive the likelihood equations for 

the two parameters p and A, by taking the partial 

derivatives with regard to each of the parameters 

and equating them to zero 


dlogL n l-p | 
==> t,+-—t* 8 
a. od > (om kil” (8) 
dlog L 1-t* d ( 1, 
= i AY | t-i 9 
op ma p+(1—p)tt 2 " k+l’ 0) 
We get 
1 = (10) 
i t + — Ep 
at k+1' ) 
and 
n 1 npill+k-t) i 
z tk n p Se 
=j pt : k+l 11 
P 1 GDI CA: t! ) (11) 


Equation (11) may be solve for p by Newton- 
Raphson or other suitable iterative methods and 
this value is substituted in (10) to obtain A. By 
using the invariance property of MLE’s, 
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1. The MLE for R(t), say R(t), will be 


: af aa LP g 

R(t)= -A| pt +—— t" |>. 12 
O=expf-A( prs Pom (12) 

2. The MLE for A(?), say h(t), will be 

h(t)=A(p+(1-p)t*). (13) 

3. The MLE for MTTF will be 

MTTF = MTTF( pA), (14) 


which can be obtained by installing into (5) and 
integrating. 


3.2 The Metropolis-Hastings algorithm 


The Metropolis-Hastings algorithm is the most 
popular MCMC method (Hastings 1970, Metrop- 
olis et al. 1953). According to Navarro & Per- 
fors, the basic problem is that MCMC provides a 
method for sampling from some generic distribu- 
tion p(x), say target distribution. The idea is that 
in many cases, we know how to write out the equa- 
tion for the target distribution p(x), but we do not 
know how to generate a random number from 
this target distribution. This is the situation where 
MCMC is very useful. In fact, for the Metropolis- 
Hastings algorithm we do not even need to know 
how to calculate p(x) completely. 

The basic idea behind MCMC is to define a 
Markov chain over possible x values, in such a way 
that the stationary distribution of Markov chain 
is in fact p(x). That is, what we are going to do is 
to use a Markov chain to generate a sequence of x 
values, denoted ( x); X1, X3,- . ), in such a way that as 
n— œ, we can guarantee that x, ~ p(x). There 
are many different ways of setting up a Markov 
chain that has this property, one of which is the 
Metropolis-Hastings algorithm. 

Here is how Metropolis-Hastings algorithm 
works. Suppose that the current state of the 
Markov chain is x,, and we want to generate x, 
In the Metropolis-Hastings algorithm, the genera- 
tion of x,,, is a twostage process. The first stage 
is to generate a candidate, which we will denote 
x’. The value of x* is generated from the proposal 
distribution, denoted q(x* |x,), which depends on 
the current state of the Markov chain, x,. There is 
a few minor technical constraints on what we can 
use as a proposal distribution. 

The second stage is the accept-reject step. Firstly, 
what we need to do is calculate the acceptance 
probability A(x, — x*), which is given by: 


A(x, => x") = mi EL) See) (15) 


P(x,) q(x" |x,) 


There are two things to pay attention to here. 


Firstly, notice that the ratio 4 n) does not depend 
on the normalizing constant for the target distribu- 


tion p(x). The second thing to pay attention to is 


the behavior of the other term, eel Au n What this 
term does is correct for any biases thal the proposal 
distribution might induce. In this expression, the 
denominator q(x*|x,) describes the probability 
of generating a x" as the candidate given that the 
current state is x, (i.e, what actually happened), 
whereas the numerator describes the probability 
that the “opposite” event would have occurred: 
that is, if the current state had actually been x’, 
what is the probability that you would have gen- 
erated x, as the candidate value? If the proposal 
distribution is symmetric, then these two prob- 
abilities will turn out to be equal, {5 ai .=1. This 
special case of the Metropolis- Hastings algorithm 
is called the Metropolis algorithm. 

Having proposed the candidate x* and calculated 
the acceptance probability, A(x, > x"), we now 
either decide to ‘ ‘accept” the candidate (in which 
case we set X,,, =X") or we decide to “reject” the 
candidate (in which case we set X,,,, = x, ). To make 
this decision, we generate a (uniformly distributed) 
random number between 0 and 1, denoted u. Then: 


* 

x 

Xn = ir 
xn 


3.3 Bayesian estimation 


if u< A(x, > x") 


if u> A(x, > x*) 


(16) 


For our mixture model, the Bayesian model is con- 
structed by specifying a prior distribution for p and 
A, and then multiplying with the likelihood func- 
tion to obtain the posterior distribution. Given a 
set of data D: t ...,t,, the likelihood function is 


log L(D| A,p)= > log(p+(1-p)t!) 


i=l 
l-p 
+nlog A- ay t pee 
nlog (> PET j! ) 
(17) 


Denote the prior distribution of p and Å as 
mp,A), the posterior distribution of p and J given 
D:t,...,t, is given by 


LD | p.A)a(p.A) 
iJ LD | p.A)a(p,A)dpd A 


Because the denominator in (18) is a normaliz- 
ing constant, Bayes’ theorem is often expressed as: 


a p,A|D)= 


(18) 


a p,A|D)< L(D| p,A)a(A, p) (19) 

Here the prior distribution is given beforehand, 
usually based on prior information of the param- 
eters, such as that from historical data, previous 
experiences, expert suggestions, even wholly sub- 
jective suppositions, or simply from the point of 
mathematical conveniences. 

The proposed priors for parameters p and A may 
be taken as 


po'- py, a,b>0. (20) 


B 
(A) =L Pte 


rh a>0,2>0. 


(21) 


For these two parameters, we assume independ- 
ent priors. Then the joint prior distribution for p 
and A will be 


2(p,A)= pa-p aner Q2) 


gÊ 
BATA) 


In view of the prior in (22), the posterior distri- 
bution of p and A given D:t...,t, is given by 


ap, A| D) = AA JT (e+0-2)0)| 


E (23) 
-4 a" (on +o bike 


xe )) p(l- p) 

Then, under the square error loss function, the 
Bayes estimate of p, A, failure rate function A(t) 
and reliability function R(t) are given by 


p =E(p|D) (24) 
A’ =E(A|D) (25) 
h(t) = E(h(t) | D) (26) 
R*(t) = E(R(t) | D) (27) 


In our study, we use adaptive Metropolis-Hast- 
ings sampling (Chivers 2012) to generate sample 
6 =(p,,4,),i=1,...,n from the posterior distribu- 
tion z(p,4|D). Then, Monte Carlo integration 
estimates p*, 2°, h’(t) and R*(t) by calculating the 
means: 


. _ 12 
p= (p| D) =— dip, (28) 


(29) 


tna- Fa 
N j=l 
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h(t) = B(h(0) | D) = ~ Si pA) (30) 


R (1) = E(R(0)|D) = “YR(t:0,4) 31) 


i=] 


4 ILLUSTRATIVE EXAMPLE 


In this section, we present an example to illustrate 
the estimation procedures discussed in this paper. 
We consider data given in Table 1, which were used 
as an illustrative example in our previous study 
(Bris & Thach 2016). The data set was generated in 
case p=+, A=0.2 and n = 30 and the parameter 
k is considered to be fixed to one. In this study, we 
use both MLE and MCMC method to estimate the 
parameters and reliability characteristics. In order 
to obtain MCMC estimators, prior parameters are 
arbitrarily taken as a = b = 2 and @=0.1, B=1 and 
we ran the Metropolis-Hastings algorithm to con- 
struct Markov chain of length 50,000 with burn-in 
of 1000 and reduced the autocorrelation by retain- 
ing only every 5 iterations of the chain and obtain 
9801 samples. Table 2 shows our MCMC point esti- 
mates and two-sided 90% of Bayes credible inter- 
val for p and A, and Table 3 shows our MLE point 


Table 1. Data from our previous study. 
3.615 1.261 1.964 4.534 2.176 1.799 
6.704 1.169 4.563 1.371 2.784 4.779 
2.346 2.105 5.059 3.657 1.882 5.270 
5.955 2.894 2.452 0.821 0.863 0.468 
3.748 4.455 3.326 0.811 2.416 2.468 
Table 2. Point estimates and two-sided 90% Bayes cred- 
ible interval for p and A. 

True value MCMC 90% BCI 
p 1/3 0.3242 [0.3089, 0.6169] 
A 0.2 0.2062 [0.2035, 0.2792] 
MTTF 2.9844 2.9797 [2.4726, 3.4726] 
Table 3. Point estimates and two-sided 90% bootstrap 


confident interval BCa (bias corrected and accelerated) 
for p and A. 


True value MLE 90% BCa 
p 1/3 0.0065 [0.0000, 0.6276] 
A 0.2 0.1792 [0.1294, 0.2083] 
MTTF 2.9844 2.9636 [2.6410, 3.3790] 
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Figure 2. Traces of each parameter of the Bayesian 
model. 
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Figure 4. The time courses of failure rate functions. 


estimates and two-sided 90% bootstrap confident 
interval BCa (bias corrected and accelerated) for 
p and A. Figures 1-2 show posterior distributions 
and trace plots of each parameter of the Bayesian 
model obtained by Metropolis-Hastings algo- 
rithm, and Figures 3—4 show time courses of reli- 
ability function and failure rate function obtained 
by both MLE and MCMC method. On the basic 
of these results, we may conclude that MCMC is 
better than MLE. 


5 SIMULATION STUDY 


A Monte Carlo simulation study is conducted to 
compare the performance of MLE and MCMC 
estimators for the parameters of mixture failure 
rate. For each of the following choice of parame- 
ters, we simulate 1000 sets of data with sample size 
n = 20, 50, 100 and 200, respectively, and based on 
each set of data we computed the MLE and MCMC 
estimator for the model parameters. In order to 
obtain MCMC estimators, prior parameters are 
taken as in section 4, and we ran the Metropolis- 
Hastings algorithm to construct Markov chain of 
length 20,000 with burn-in of 1000 and reduced 
autocorrelation by retaining only every 5 iterations 
of the chain and obtain 3801 samples. Note that as 
discussed earlier, when p = 1, the hazard rate func- 
tion A(t) represents the failure rate of an exponen- 
tial distribution, while when p = 0, it represents the 
failure rate of the Rayleigh distribution or Weibull 
distribution with shape parameter 2. 


1. p=0.3 and 2=0.2 
2. p=0.5 and 2=0.2 
3. p=0.7 and A=0.2 


The Tables 4—6 list the results of the simulation 
study. Denote @ as the MLE and & as the MCMC 


Table 4. Comparison of ĝ and @ for @= (0.3, 0.2). 


n Method Bias p MSEp BiasdA MSEA 
20 MLE —0.0503 0.0657 0.0119 0.0036 
MCMC 0.0890 0.0152 0.0120 0.0019 
50 MLE —0.0431 0.0414 0.0016 0.0015 
MCMC 0.0500 0.0119 0.0070 0.0010 
100 MLE —0.0157 0.0239 0.0021 0.0007 
MCMC 0.0342 0.0098 0.0057 0.0005 
200 MLE —0.0131 0.0130 0.0006 0.0004 
MCMC 0.0128 0.0071 0.0028 0.0003 
Table 5. Comparison of 6 and 6 for @=(0.5, 0.2). 
n Method Bias p MSEp Biaså MSEA 
20 MLE —0.0974 0.0864 0.0053 0.0038 
MCMC —0.0530 0.0126 -0.0044 0.0018 
50 MLE —0.0505 0.0423 0.0026 0.0016 
MCMC -0.0462 0.0140 -0.0018 0.0010 
100 MLE —0.0207 0.0221 0.0019 0.0009 
MCMC -0.0317. 0.0123 -0.0014 0.0006 
200 MLE —0.0139 0.0113 0.0010 0.0004 
MCMC -0.0251 0.0090 -0.0011 0.0004 
Table 6. Comparison of ĝ and @ for @= (0.7, 0.2). 
n Method Bias p MSEp Bias dA MSE å 
20 MLE —0.1178 0.0867 0.0026 0.0048 
MCMC -0.1750 0.0433 -0.0218 0.0024 
50 MLE —0.0559 0.0312  -0.0011 0.0018 
MCMC -0.1218 0.0282 -0.0163 0.0013 
100 MLE —0.0308 0.0147 -0.0002 0.0009 
MCMC -0.0779 0.0165 -0.0099 0.0008 
200 MLE —0.0115 0.0067 0.0001 0.0005 
MCMC _ -0.0393 0.0076 —-0.0055 0.0004 


of 6. Bias is calculated as the mean of 1000 esti- 
mates minus the true value, and MSE is the mean 
square error, the mean of the squared differences 
between 1000 estimators and true value. And Fig- 
ures 5-10 show the bias and MSE obtained in 
Tables 4—6. 

From the comparison of estimates, we observe 
the following: 


e For estimation of p and A, although for some 
cases the biases of MLE are smaller than those 
of MCMC, MCMC has overwhelming advan- 
tage over MLE in the index of MSE. Therefore 
MCMC is more stable than MLE, in spite of the 
fact that when sample size is large (say, larger 
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Figure 7. Comparison of bias of 6 and @ for 0= (0.5, 
0.2). 


than 100), MLE has less bias than MCMC. In 
general MCMC is better than MLE. 

e When the sample size is small (say, less than 100), 
MCMC behaves better than MLE with regard 
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Figure 10. Comparison of MSE of 6 and 6 for 
6= (0.7, 0.2). 


to the indexes, bias, and MSE. The advantage 
of MCMC over MLE is especially remarkable 
in the estimation of parameter A. 


6 CONCLUSION 


Based on the simulation results, we may conclude 
that under square error loss function, the MCMC 
method can show better result than MLE method 
to estimate the parameters and reliability charac- 
teristics of the failure distribution of the mixture 
failure rate. We suggest the use of MCMC instead 
of MLE for point estimation when sample size 
is not very large. Even when sample size is large, 
MCMC is still more stable than MLE. 
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ABSTRACT: Crane is the supporting equipment in the construction of engineering projects, thus 
ensuring its safe operation has become an important work. The critical load-bearing part, steel structure, 
in case of failure can affect reliable operation of the whole crane. Therefore, steel structure is chosen as 
the study object. First of all, we construct the testing system and data-integration platform to collect run- 
ning data. Then, the article focuses on establishing the index system of safety assessment, determining 
the weight of each index based on the Analytic Hierarchy Process (AHP), and obtaining the membership 
matrix by utilizing the Fuzzy Theory. On this basis, we evaluate the safety level of its current state, and 
further express the possibility of each safety level occurring by probability form. Finally, we introduce the 
parameter, transition probability between different safety states, and build the model predicting future 
safety state of steel structure. 


1 INTRODUCTION In the aspect of testing technology, remote 
monitoring (Szpytko 1998, Li & Liu 2012) and 
1.1 Research background spot inspection (Zhang 2017) are widely used 


in this field, and also become the research focus. 
Especially in the era of Internet of Things and Big 
Data, real-time monitoring of actual operation 
and timely capture of overload, emergency braking 
and other safety conditions can improve the safety 
and production efficiency of cranes. However, the 
current inspection of cranes is still based on the 
traditional way, which depends on spot experience 
and simple instrument inspection (Li & Yin 2011). 
Obviously, the current inspection level is rela- 
tively low, and the inspection content is narrow. It 
is just limited to the standard inspection and no 
inspection items are carried out according to the 
new market demand, such as safety level assess- 
ment and future state prediction. Moreover, many 
monitoring systems mainly focus on reflecting the 
real-time running parameters (Cen et al. 2015), and 
the in-depth analysis of these parameters is not 
enough. Thus, it is difficult to assess the current 
safety state and failure possibility of cranes (Wang 
et al. 2013, Makovskii 1994). 

In terms of assessment method, the current 
research work mostly includes two aspects, the 
The latest research developments indicate that test- | assessment of current safety state (Yang et al. 2009, 
ing technology (Tian et al. 2009) and assessment Bucas et al. 2014) and the prediction of future 
method (Yang 2005) have been the focus in the safety state (Wu et al. 2010), and the latter is also 
field of crane’s safety assessment. called life prediction. At present, the assessment of 


Crane plays an indispensable role in the construc- 
tion of engineering projects, and has been widely 
used in the pillar industry, such as machinery man- 
ufacturing industry, transportation and logistics 
industry, water conservancy and hydropower engi- 
neering, and nuclear power construction. In recent 
years, crane is developing rapidly in the direction 
of large scale and specialization, for adapting to 
the increasingly enlargement of basic industry and 
infrastructure. 

Under actual operation conditions, the complex 
alternating load is frequently applied to cranes, 
which makes its safety to be the significant index 
in the process of design, manufacture and mainte- 
nance. Therefore, we know the crane’s safety is of 
great importance. In case of failure, the economic 
loss produced tends to be extremely disastrous. 
Accordingly, carrying out the study on safety 
assessment and safety state prediction has become 
the critical breakthrough in solving the problem. 


1.2 Research reviews 
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crane’s safety state is mainly based on Grey Theory 
Method, Combination Weighting Method, Unas- 
certained Measurement Theory, Fisher Discrimi- 
nant Method, Support Vector Machine, Fuzzy 
Theory Method and Artificial Neural Network 
(Fan et al. 2011). In these methods, the application 
of Fuzzy Theory Method is the most extensive. 
On the other hand, scholars have conducted in- 
depth and systematic research by theoretical and 
experimental methods for life prediction (Zhou 
et al. 2012), and put forward several life prediction 
models for the whole crane or steel structure. How- 
ever, due to the complexity and randomness of 
environment and loading conditions, these classi- 
cal methods based on deterministic equations have 
being developing towards the direction of prob- 
ability statistics. 

On account of the insufficient research above, 
we accurately grasp the new market demand 
of safety assessment, and further carry out the 
research on index system of safety assessment, 
probabilistic safety assessment method, and future 
state prediction method for crane’s steel structure. 
The research results will help to ensure the reliable 
operation of cranes for a long time and bring enor- 
mous economic benefits. Besides, these results can 
also promote the technological progress of hoist- 
ing machinery industry and enhance the interna- 
tional competitiveness of cranes, which have the 
long-term social benefits. 

In this paper, the method, Analytic Hierarchy 
Process (AHP), has been used to assessing the 
safety level of cranes. AHP actually represents 
the decision process, namely at first resolving the 
total assessment target into several layers and then 
carrying out qualitative and quantitative analysis. 
The specific procedures mainly include: establish- 
ing the index system, defining the assessment level, 
constructing the judgment matrix, calculating the 
weight vector, checking the consistency, determin- 
ing the membership and assessing the safety level, 
as shown in section 2 and section 3. 


2 INDEX SYSTEM OF SAFETY 
ASSESSMENT 


2.1 Establishing the index system 


According to BS 7121-1:2006, GB 6067-1:2010 and 
GB/T 21920:2008, the influence factors have been 
sorted out and index system of safety assessment 
for cranes has been established, which includes 
three aspects, such as target layer, standard layer 
and index layer, as shown in Figure 1. 

The target layer, namely the final result layer, 
refers to the safety assessment level of cranes, 
which is represented by the parameter F. 
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Figure 1. Index system of safety assessment for crane’s 
steel structure. 


The standard layer is sorted based on four 
aspects, including working environment, dimen- 
sion testing, stress crack and use status. Namely, 
V=[V, V, V, V,] = [working environment, dimen- 
sion testing, stress crack, use status]. 

The index layer is the specific expansion of the 
factors in standard layer. For the working environ- 
ment, V,=[M,, M,, M,, M,,], where M,, represents 
the temperature; M, represents the humidity; M, 
represents the acid-base property; M,, represents 
the corrosion degree, respectively. For the dimen- 
sion testing, V, = [M,, M, M,, M,, M,; My My 
M, M], where M, represents the rail height dif- 
ference between front and back doorframes; M,, 
represents the rail height difference of semi-girder; 
M represents the camber of main girder; M,, rep- 
resents the upwarp of semi-girder; M,; represents 
the deflection between front doorframe and pull 
rod; M,, represents the deflection of semi-girder; 
M. represents the thickness of upper cover plate 
on main girder; M, represents the thickness of 
upper cover plate on front doorframe; M,, rep- 
resents the thickness of main girder web, respec- 
tively. For the stress crack, V,=[M,, M,, My, M,,], 
where M,, represents the stress distribution; M,, 
represents the dynamic stress; M,, represents the 
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crack distribution; M,, represents the crack activ- 
ity, respectively. For the use status, V, =[M,, M,, 
M, M, M,;], where M, represents the utility time; 
Mp represents the hoisting volume; M,, represents 
the load situation; M,, represents the personnel 
operation; M,; represents the maintenance opera- 
tion, respectively. 

The index system above is established on the 
basis of AHP, which takes into account not only 
the technical elements such as dimension testing 
and stress crack, but also the non-technical ele- 
ments such as working environment and use status. 
The method and index system are mainly used to 
assess cranes’ comprehensive safety performance 
in the middle and later stages of service. Moreo- 
ver, for the other mechanical products similar to 
cranes, the safety assessment can also be carried 
out by adjusting for the factors in the index system 
correspondingly. 


2.2 Constructing the testing system 


Under actual service conditions, the working envi- 
ronment of cranes exists uncertain factors, and 
heavy alternating load often acts on crane’s steel 
structure at the same time. Therefore, collecting 
the real-time running data and testing data has 
become the basis of carrying out safety assessment 
research. 

Based on the index system of safety assessment 
established in Figure 1, we construct the testing 
system with multiple modules, such as corrosion 
morphology testing module, dimension deforma- 
tion testing module, stress distribution testing 
module, dynamic stress testing module, crack dis- 
tribution testing module and crack activity testing 
module. Specifically, the advanced sensor technol- 
ogy has been applied on crane’s steel structure to 
collect testing parameters, including corrosion 
parameter, deformation parameter, stress-strain 
parameter, crack parameter, crack growth param- 
eter and so on. On this basis, the data-integration 
software platform is built to gather all the testing 
data, which are exactly the basic data source for 
assessing safety level of crane’s steel structure. 


3 SAFETY ASSESSMENT OF 
CURRENT STATE 


3.1 Definition of assessment level 


According to statistical data and previous experi- 
ence, the assessment level of safety state has been 
classified in consideration of the severity of crane’s 
failure accidents. There exists five kinds of safety 
level listed in descending order. It can be expressed 
as follows: U = [U, U, U, U, U], where U, rep- 


resents the excellent level; U, represents the good 
level; U, represents the available level; U, repre- 
sents the require-repair level; U, represents the dis- 
card level, respectively. 


3.2 Weight of factors in standard layer 


The 1-9 scale method has been proposed to quan- 
tify the relative importance between working envi- 
ronment, dimension testing, stress crack, and use 
status, as shown in Table 1. 

According to the scale method in Table 1, the 
judgment matrix can be constructed as follows. 


N=(a),, 0 


where the element a; in matrix N indicates the rela- 
tive importance between every two factors. 

For the factor F in target layer, the matrix NM 
reflecting the relative importance between working 
environment V,, dimension testing V, stress crack 
V, and use status V, in standard layer, has been 
constructed and specifically expressed as follows. 


1 1/5 1/7 1/3 
5 1 1⁄5 3 
N= (2) 
7 5 1 5 
3 1/3 1/5 1 


The maximum eigenvalue and eigenvector of 
matrix N have been calculated by MATLAB. The 
eigenvector exactly reflects the relative importance 
of factors in standard layer, which is also called the 
allocation of weight coefficient. The expression of 
eigenvector after normalization is given as follows. 


A=(0.0521 0.2195 0.6194 0.1090) (3) 


In order to verify whether the allocation of 
weight coefficient is reasonable, it is necessary 


Table 1. 1-9 scale method. 

Scale value Explanation 

1 The two factors are of the equal 
importance 

3 One is a little more important than 
the other 

5 One is more important than the other 

7 One is more important than the other 
intensely 

9 One is more important than the other 
extremely 


The intermediate values of adjacent 
scales above 
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to carry out the consistency check of judgment 
matrix N. The consistency ratio is defined as the 
parameter CR. 


T 
ee (4) 
RI 
CI = Ariss = n (5) 
n-1 
where Ana; represents the maximum eigenvalue; n 


represents the matrix dimension; RZ represents the 
mean random consistency index. 

When the consistency ratio satisfies the condi- 
tion CR < 0.1, it is considered that the allocation 
of weight coefficient is reasonable. Otherwise, 
we should adjust the elements’ value in judgment 
matrix N, and redistribute the weight coefficient. 

Based on equations 4-5, the calculating result 
shows that CR = 0.09 < 0.1, which indicates the ele- 
ments’ value of matrix N and the weight coefficient 
distribution of eigenvector A are both reasonable 
and effective. 


3.3. Weight of factors in index layer 


For the factor V,, the judgment matrix reflect- 
ing the relative importance between temperature 
M,,, humidity M,,, acid-base property M,, and 
corrosion degree M,, in index layer, has been con- 
structed based on 1-9 scale method. The specific 
expression is shown as follows. 


1 1/5 1/4 1/7 
5 1 3 1/3 


N= 6 
! J4 1/3 1 1/5 (6) 


7 3 5 1 


The maximum eigenvalue and eigenvector of 
matrix N, have been calculated by MATLAB. The 
expression of eigenvector after normalization is 
given as follows. 


A, =(0.0516 0.2605 0.1276 0.5604) (7) 


In aspect of the consistency check, the calcu- 
lating result shows that CR, = 0.067 < 0.1, which 
indicates the elements’ value of matrix N, and the 
weight coefficient distribution of eigenvector A, 
are both reasonable and effective. 

For the factor V,, the judgment matrix reflect- 
ing the relative importance between M,,, M,,, M,,, 
M>, M,;, My, Mo, Mo, and M,, in index layer, has 
been constructed based on 1-9 scale method. The 
specific expression is shown as follows. 
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N, 
t 1/2- 2 2 2 3 2 Be 2 
2 1 3 3 2 3 2 4 2 
1/2 1/3. 1 3 2 2 1/2 1/2 
1/2 1/3 1/3 1 1/3 1/2 1/3 2 1/3 
=| 1/2 1/2 1/2 3 1 3 2 3 2 
1/3 1/3 1/2 2 1/3 1 1/3 2 1/3 
1/2 1/2 2 3 1/2 3 1 4 2 
1/3 1/4 1 1/2 1/3 1/2 1/4 1 1/3 
1/2 1/2 2 3 1/2 3 1/2 3 1 
(8) 


The maximum eigenvalue and eigenvector of 
matrix N, have been calculated by MATLAB. The 
expression of eigenvector after normalization is 
given as follows. 


A, =(0.1660 0.2136 0.0969 0.0488 0.1338... (9) 
.0.0555 0.1326 0.0427 0.1100) 


In aspect of the consistency check, the calcu- 
lating result shows that CR, = 0.061 < 0.1, which 
indicates the elements’ value of matrix N, and the 
weight coefficient distribution of eigenvector A, 
are both reasonable and effective. 

For the factor V}, the judgment matrix reflecting 
the relative importance between M,,, M,,, M,,, and 
M,, in index layer, has been constructed based on 
1-9 scale method. The specific expression is shown 
as follows. 


1 1/3 1/6 1/7 
3 1 1/4 1/5 
N, = — (10) 
6 4 1 1/3 
7 5 3 1 


The maximum eigenvalue and eigenvector of 
matrix N, have been calculated by MATLAB. The 
expression of eigenvector after normalization is 
given as follows. 

A, = (0.0513 0.1061 0.2890 0.5536) (11) 

In aspect of the consistency check, the calcu- 
lating result shows that CR, = 0.065 < 0.1, which 
indicates the elements’ value of matrix N, and the 
weight coefficient distribution of eigenvector A, 
are both reasonable and effective. 

For the factor V, the judgment matrix reflect- 
ing the relative importance between M, My, Migs 
M,, and M,, in index layer, has been constructed 
based on 1-9 scale method. The specific expression 
is shown as follows. 


(12) 


2 
Il 
= 
as, 
Ww 
= 
iat 
n 
= 
wW = A oN 
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The maximum eigenvalue and eigenvector of 
matrix N, have been calculated by MATLAB. The 
expression of eigenvector after normalization is 
given as follows. 


A, =(0.2705 0.5045 0.1153 0.0368 0.0729) (13) 


In aspect of the consistency check, the calcu- 
lating result shows that CR, = 0.045 < 0.1, which 
indicates the elements’ value of matrix N, and the 
weight coefficient distribution of eigenvector 4, 
are both reasonable and effective. 


3.4 Obtaining the membership matrix 


According to the assessment level of safety state 
defined in section 3.1, the fuzzy subset for factors 
in index layer needs to be constructed. Namely, 
D =[D, D, D, D, D} = [excellent, good, medium, 
poor, extremely poor], where D, represents the 
excellent state; D, represents the good state; D, 
represents the medium state; D, represents the 
poor state; D, represents the extremely poor state, 
respectively. Then, we analyze quantitatively the 
degree of each factor being subordinate to fuzzy 
subset D, and further obtain the membership 
matrix. 

For working environment V,, the membership 
matrix has been obtained as follows. 


0.1 0.1 02 03 03 
0.1 0.1 02 03 03 
10.1 0.1 0.1 04 03 
0 0 0 0 1 


(14) 


The first row vector represents the member- 
ship degree of temperature factor M,, being sub- 
ordinate to fuzzy subset D. As the temperature of 
cranes is changing during service, the membership 
degree corresponding to each state D,-D, is the 
result of probability statistics. Through referring 
to the weather data, poor weather appears more 
frequently, and the membership vector of tempera- 
ture M, is [0.1 0.1 0.2 0.3 0.3]. In the same way, for 
the other two factors humidity M,, and acid-base 
property M, we can also obtain the membership 
vector based on previous data, as shown by the 
second row vector and the third row vector. 
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The fourth row vector represents the member- 
ship degree of corrosion factor M,, being subor- 
dinate to fuzzy subset D. It is remarkable that the 
membership vector is a definite value rather than 
the form of probability distribution because the 
corrosion degree is uniquely determined by the 
on-site testing results. According to the testing 
data, the degree of corrosion has reached state D., 
namely the extremely poor state. Therefore, the 
membership vector of M,, is [0000 1]. 

For dimension testing V,, the membership 
matrix has been obtained as follows. 


(15) 
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The first row vector represents the membership 
degree of factor M,, being subordinate to fuzzy 
subset D. We have collected six groups of testing 
data, which are respectively 7 mm, 10 mm, 14 mm, 
6 mm, 3 mm and 7 mm. The results indicate that 
M. has reached state D,, namely the poor state, so 
the membership vector is [0 0 0 1 0]. 

The second row vector represents the membership 
degree of factor M, being subordinate to fuzzy sub- 
set D. We have collected thirteen groups of testing 
data, which are respectively 32 mm, 9 mm, 18 mm, 
13 mm, 6mm, 4 mm, 12 mm, 1 mm, 3 mm, 25 mm, 
1 mm, 23 mm and 25 mm. The results indicate that 
M, has reached state D,, namely the medium state, 
so the membership vector is [0 0 1 0 0]. 

The third row vector represents the membership 
degree of factor M,, being subordinate to fuzzy 
subset D. We have collected two groups of test- 
ing data, which are respectively 8 mm and 5 mm. 
The results indicate that M,, has reached state D,, 
namely the medium state, so the membership vec- 
tor is [0 0 1 0 0]. 

The fourth row vector represents the member- 
ship degree of factor M,, being subordinate to 
fuzzy subset D. We have collected two groups of 
testing data, which are respectively 76 mm and 
133 mm. The results indicate that M,, has reached 
state D,, namely the extremely poor state, so the 
membership vector is [0 0 0 0 1]. 

The fifth row vector represents the membership 
degree of factor M,, being subordinate to fuzzy 
subset D. The testing value of M,,, deflection 


between front doorframe and pull rod, is 29 mm, 
which has reached state D,, namely the good state, 
so the membership vector is [0 1 0 0 0]. 

The sixth row vector represents the membership 
degree of factor M,, being subordinate to fuzzy sub- 
set D. The testing value of M,,, deflection of semi- 
girder, is 100 mm, which has reached state D,, namely 
the poor state, so the membership vector is [0 0 0 1 0]. 

The seventh row vector represents the mem- 
bership degree of factor M,, being subordinate 
to fuzzy subset D. We have collected three groups 
of testing data, which are respectively 25.8 mm, 
25.9 mm and 25.9 mm. The results indicate that 
M,, has reached state D,, namely the medium state, 
so the membership vector is [0 0 1 0 0]. 

The eighth row vector represents the member- 
ship degree of factor M,, being subordinate to 
fuzzy subset D. We have collected three groups 
of testing data, which are respectively 14.5 mm, 
14.6 mm and 14.5 mm. The results indicate that 
M; has reached state D,, namely the medium state, 
so the membership vector is [0 0 1 0 0]. 

The ninth row vector represents the membership 
degree of factor M,, being subordinate to fuzzy 
subset D. We have collected three groups of testing 
data, which are respectively 11.8 mm, 11.9 mm and 
11.9 mm. The results indicate that M,, has reached 
state D,, namely the medium state, so the member- 
ship vector is [0 0 1 0 0]. 

For stress crack V}, the membership matrix has 
been obtained as follows. 


(16) 
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For use status V,, the membership matrix has 
been obtained as follows. 


R,=|0.1 0.2 0.3 0.2 0.2 
0.1 0.1 0.1 0.4 0.3 
0 01 01 0.2 0.6 


(17) 


It should be noted that the membership degree of 
factors M,,-M,,, M,,-M,,; being subordinate to fuzzy 
subset D can be determined in the same way by the 
on-site test and data analysis, so we won't repeat it. 


3.5 Assessing the safety level 


Based on equation 7 and equation 14 above, the 
assessing vector of working environment V, has 
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been calculated. The expression of this vector B, 
after normalization is given as follows. 


B,=AR, 
=(0.0440 0.0440 0.0752 0.1447 0.6922) (18) 


Based on equation 9 and equation 15 above, the 
assessing vector of dimension testing V, has been 
calculated. The expression of this vector B, after 
normalization is given as follows. 


B, = A,R, = (0 0.1338 0.5958 0.2216 0.0488) 
(19) 


Based on equation 11 and equation 16 above, 
the assessing vector of stress crack V, has been 
calculated. The expression of this vector B, after 
normalization is given as follows. 

B, = A,R, =(0 0 0.4464 0.5536 0) (20) 

Based on equation 13 and equation 17 above, 
the assessing vector of use status V, has been cal- 


culated. The expression of this vector B, after nor- 
malization is given as follows. 


B, = A,R, 
=(0.0152 0.0340 0.0456 0.0524 0.8528) 
(21) 


The following work is to integrate these assess- 
ing vectors, including B,, B,, B}, and B,, and then 
we can obtain the fuzzy relation matrix B of final 
index F in target layer, as shown in the following 
equation. 


(22) 


Based on equation 3 and equation 22 above, the 
assessing vector of final target F has been calcu- 
lated. The expression of this vector C after nor- 
malization is given as follows. 


C = AB =(0.0039 0.0354 0.4161 0.4048 0.1397) 
(23) 


Equation 23 indicates that each safety level U,- U; 
has the possibility of occurrence based on the cur- 
rent testing data, and the vector C exactly represents 
the probability distribution as shown in Figure 2. 

In Figure 2, the first column represents the 
probability of excellent level occurring, and the 
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Figure 2. The probability distribution of current safety 
state. 


value is 0.39%. The second column represents the 
probability of good level occurring, and the 
value is 3.54%. The third column represents 
the probability of available level occurring, and the 
value is 41.61%. The fourth column represents the 
probability of require-repair level occurring, and 
the value is 40.48%. The fifth column represents 
the probability of discard level occurring, and the 
value is 13.97%. 

Next, the final score of current state are cal- 
culated considering the weighting coefficient, as 
shown in the following expression. 
G=Cxs™ (24) 
where S represents the vector of weighting coeffi- 
cient; G represents the final score of current state. 

Based on equation 23 and equation 24, the 
safety score G is 60.99. Then, combining with the 
partition of safety state as shown in Table 2, the 
service state of crane’s steel structure can be ulti- 
mately evaluated. 

From Table 2, although the score 60.99 corre- 
sponds to the require-repair level, it is also near the 
discard level. Therefore, from the conservative per- 
spective, the conclusion has been drawn that this 
crane should be discarded in the short term. 

The above contents have described the process 
of analyzing cranes’ safety state based on AHP and 
Fuzzy Theory. Then, we invite the experts in this 
field to discuss the assessment results obtained. 
Experts consider that the method is reasonable 
and the factors listed in index system are com- 
prehensive, which correspond to the actual situa- 
tion of cranes. To sum up, the assessment effect is 
very good. Similar to its application on cranes, the 
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Table 2. Partition of safety state. 


Safety score Safety state 


90-100 excellent level 
80-89 good level 

70-79 available level 
60-69 require-repair level 
< 60 discard level 


method can be further applied to assessing other 
critical mechanical products such as automobiles, 
wind turbines, nuclear power equipment, so as to 
provide important support for obtaining the cur- 
rent service state timely. 


4 SAFETY PREDICTION OF 
FUTURE STATE 

4.1 Transition probability between different safety 

states 


There exists five kinds of safety states for the 
factor F in target layer, namely U =[U, U, U, U, 
U] = [excellent, good, available, require-repair, 
discard]. We introduce the parameter, transition 
probability p, (i,j = 1,2,3,4,5), to characterize the 
possibility of safety state į developing to safety 
state j, as shown in the following equation. 


Py = PU; >U,) (25) 


It is necessary to point out that when the time 
interval of state developing is one year, p; is known 
as one-step transition probability. Therefore, 
parameter p, represents the transition probability 
of safety state developing from this year to the next 
year. 

For crane’s steel structure, the assessing result 
is a vector rather than a numerical value, which 
indicates that the assessing result includes several 
safety states and each state has the possibility of 
occurrence. In section 3.5, the assessing vector 
C of this year has been calculated and charac- 
terized by the form of probability distribution. 
On this basis, predicting the assessing vector C 
of the next year has become the focus, and the 
transition probability matrix P can be expressed 
as follows. 


Py Po Psa Pu Pis 
Pa Px Px Pau Ps 
P=| Pa Pa P3 Pa Ps (26) 
Pa Pu Pu Pu Pas 
Psi Ps Ps3 Ps Pss 


Based on equation 23 and equation 26 above, 
the assessing vector of future safety state can be 
obtained as follows. 


C=CxP (27) 
(c Cha Cua Cua Cia) 
Fly, v2 v3 Cua Cos) 
Pr Po Ps Pua Pis 
Pa Pa Po Pa Pas 
X| Pa P3 Ps3 P3, P35 (28) 
Pu Pu Pu Pu Pas 
Ps Ps Ps Ps, Pss 


where cy, represents the probability of excellent 
level occurring in this year; cy, represents the 
probability of good level occurring in this year; cy, 
represents the probability of available level occur- 
ring in this year; c,,, represents the probability of 
require-repair level occurring in this year; cy; rep- 
resents the probability of discard level occurring in 
this year; c;,, represents the probability of excel- 
lent level occurring in the next year; c/,, represents 
the probability of good level occurring in the next 
year; c, represents the probability of available 
level occurring in the next year; cọ, represents the 
probability of require-repair level occurring in the 
next year; c;,, represents the probability of discard 
level occurring in the next year. 


4.2 Solving the transition probability matrix P 


Based on equation 27, the corresponding equa- 
tions can be obtained. 


C201 = C2010 x p (29) 
C212 = C1 x p (30) 
C215 = CH y p (31) 
Con = CMI x P (32) 
C215 = CHM y p (33) 


After expanding these equations above, we can 
solve the transition probability matrix P as follows. 


P=K"xL (34) 
C2010 C201 
C20 2012 

K=| C22] | L=| C203 (35) 
C2013 2014 
C214 C2015 
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Figure 3. The predicting probability distribution of 
future safety state. 


4.3 Predicting the future safety state 


For the assessing vectors of the past years, C°", 

CH C2012 C, C014 and C" have been obtained 

based on the running data tested once a year. 

Accordingly, the matrix P can be solved and 

the assessing vector in 2016 can be predicted as 

follows. 

C2016 = (ce le i eae ce ca 

=(0.0055 0.0311 0.4358 0.3800 0.1475) 
(36) 


The assessing vector in equation 36 exactly rep- 
resents the prediction of probability distribution, 
as shown in Figure 3. 

Contrasting and analyzing the assessing results 
between the testing result and the predicting result 
have been carried out. Equation 23 and Figure 2 
represent the safety state in 2016 based on on- 
spot testing, while equation 36 and Figure 3 rep- 
resent the safety state in 2016 based on predicting 
model. In general, the possible safety states in the 
later stage of service are the last three states. The 
comparison results show that the percent error 
of available level is 4.7354%, the percent error of 
require-repair level is 6.1166%, and the percent 
error of discard level is 5.5548%. All these three 
percent error are below 10%, so the prediction of 
probabilistic safety assessment has high accuracy. 


5 CONCLUSIONS 


The key load-bearing part, steel structure, in case 
of failure can affect reliable operation of the whole 
crane, therefore carrying out the study on safety 
assessment and safety state prediction has become 


the critical breakthrough in solving the problem. 
Considering the complexity and diversity of influ- 
ence factors, we adopt the method, Analytic Hier- 
archy Process (AHP), to assess the safety level of 
cranes. The final results have indicated that the 
application of AHP on safety assessment is effec- 
tive. After our research work, the following conclu- 
sions have been drawn. 


1. The index system of safety assessment includ- 
ing target layer, standard layer and index layer 
has been established, which almost covers the 
factors affecting crane’s safety. On this basis, 
constructing the testing system with multiple 
modules, and collecting the real-time running 
data have been carried out. 

2. Based on fuzzy theory, we obtain the weight 
vector of each factor and membership matrix 
of each factor, and then depict the probability 
distribution diagram of current safety state. 
Further more, the safety score 60.99 is calcu- 
lated, which indicates that the current safety 
state has nearly reached the discard level, and 
therefore this crane should be discarded in the 
short term. 

3. To predict the safety state of cranes, we introduce 
the parameter p, transition probability between 
different states, and further build the model that 
can obtain the future safety state based on test- 
ing data of previous years. Moreover, the predic- 
tion of probabilistic safety assessment has high 
accuracy through comparative analysis. 
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ABSTRACT: Railway tracks are one of the most important assets in rail transport that are subject 
to very high stresses and have to be made of very high-quality steel alloy. The dominant form of track 
degradation is Rolling Contact Fatigue (RCF) which occurs mainly due to cyclic loading. If RCF is not 
controlled by maintenance operations, it will result in intensive correction works, service disruption, and 
even train derailment. In order to reduce such negative consequences and improve the quality of trans- 
port services, the railway organizations are moving towards using prognostic analytics approaches for the 
optimisation of maintenance and repair actions. In this paper, we investigate a cost-optimal prognostic 
maintenance policy for a railway track system by incorporating RCF data. The mechanism of RCF is ana- 
lyzed by means of Finite Element Analysis (FEA) and the fatigue propagation behavior is described by 
the Paris-Erdogan power law function. The length of the cracks over time is also modelled by a stochas- 
tic gamma process and its parameters are estimated using the Maximum Likelihood Estimation (MLE) 
method. An optimization model is proposed to determine fixed time interval and/or Million Gross Tons 
(MGT) achieving the best possible balance between non-destructive tests and emergency maintenance. 
The proposed model is applied to support the maintenance decision-making for a conventional rail track 
system 60E1 in steel grade R350HT. The results show that the use of the proposed prognostic mainte- 
nance allows a significant reduction of the costs compared to the strategy when maintenance 1s conducted 
on an as-needed basis. 


1 INTRODUCTION 


Britain’s railway system is one the most reliable, 
comfortable and safest rail networks in the world. 
Nevertheless, the country’s rail network is still 
confronted with serious problems arising from the 
failures of infrastructure assets that require costly 
and time-consuming maintenance work. The rail 
infrastructure includes those assets that are fixed 
such as tunnels, bridges, permanent way, tracks, 
stations, signaling equipment, etc. The key compo- 
nents of a rail infrastructure system are illustrated 
in Figure 1. 

Conducting regular Maintenance and Renewal 
(M&R) is essential for railway infrastructure to To increase the cost-effectiveness of maintenance 
ensure network availability and reliability, pas- operations while achieving higher levels of reliability 
senger safety and comfort, and also energy effi- and service quality, several rail transport organisa- 
ciency. Currently, the maintenance of the railway tions across the Europe, such as Network Rail in 
infrastructure assets is preventive in nature and the UK, Administrador de Infraestructuras Fer- 
includes repair or renewal of some certain compo- roviarias (ADIF) in Spain, ProRail in the Nether- 
nents at regular time intervals or pre-determined lands, Trafikverket in Sweden, and Österreichische 
Million Gross Tons (MGT). However, it has been Bundesbahnen (OBB) in Austria, etc. are mov- 
reported in many case studies that a significant ing towards using prognostic analytics approaches 
portion of maintenance resources (e.g., budget, for the optimisation of maintenance programmes 
time, manpower) is wasted due to insufficiency or within the railway transport network (Dinmoham- 
inefficiency of current programs. madi, 2018). However, a survey of the literature 


Figure 1. Railway infrastructure components. 
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shows that there are few studies examining the prob- 
lem of optimised prognostic maintenance regimes 
for railway transport infrastructure assets. 

Generally, railway infrastructure defects occur 
due to a number of specific causes that have been 
classified by many researchers. Olofsson and Nils- 
son (2002) divided the defects of steel tracks into 
two types of surface-initiated and subsurface-initi- 
ated cracks. Cannon et al. (2003) classified the steel 
track defects into three main groups: (i) defects 
originating from rail manufacture, (ii) defects orig- 
inating from damage caused by inappropriate han- 
dling, installation and use, and (iii) defects caused 
by the exhaustion of the rail’s inherent resistance 
to Rolling Contact Fatigue (RCF). 

The RCF process in rail tracks is very complex 
as it depends on various factors such as age, traf- 
fic density, axle load, material properties, track 
geometry, curvature, speed, and accumulated 
MGT (Kumar, 2008). If RCF is not controlled by 
maintenance operations, it will result in intensive 
correction works, service disruption, and even train 
derailment. In order to analyse and control the rate 
of RCF growth, Non-Destructive Testing (NDT) 
is extensively used by railway operators. NDT 
includes an extensive range of inspection tech- 
niques such as ultrasonic, acoustic, eddy current, 
thermography, etc. that provide information about 
the presence of defect, its location, size and depth. 
The steel rail track has to undergo an emergency 
repair when the size or depth of crack exceeds a 
“warning” level. We assume that the costs for an 
NDT inspection and an emergency repair task are 
respectively C, and C,, where C, > C,> 0. The main 
problem encountered in this policy is to determine 
the optimal time interval (T), or MGT level (U), or 
number of fatigue cycles (N) for maintenance tasks 
such that the railroad availability is maximized and/ 
or total inspection and repair cost is minimized. 

In this paper, we formulate an optimal prog- 
nostic maintenance policy for steel rail tracks 
subjected to progressive RCF phenomenon. The 
mechanism of RCF is analyzed by means of Finite 
Element Analysis (FEA) and the fatigue propaga- 
tion behavior is described by the Paris-Erdogan 
power law function. The length of the cracks over 
time is also modelled by a stochastic gamma proc- 
ess and its parameters are estimated using the 
Maximum Likelihood Estimation (MLE) method. 
The explicit expression of the long-run expected 
cost function per unit time is derived and the exist- 
ence and uniqueness of the optimal solution are 
shown for the infinite-horizon case. The perform- 
ance of the proposed prognostic policy in terms of 
cost is evaluated and compared to the case when 
only emergency repairs are considered. 

The rest of this paper is organized as follows. 
The formulation of the optimization model and 
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the properties of the optimal solution are discussed 
in Section 2. In Section 3, the model is applied to a 
real-life case study. Section 4 concludes this study. 


2 MODEL FORMULATION 


The aim of this model is to determine an optimal 
time interval or MGT level or number of fatigue 
cycles for the maintenance of steel railway tracks in 
case that they are subject to risk of progressive RCF. 
The RCF process typically involves three following 
phases: (i) initiation, (ii) propagation (or growth), 
and (iii) the failure. The step-by-step procedure on 
how to implement the model is presented below: 


1. Suppose that the RCF processes initiate in the 
interval [0, ¢) following a Non-Homogeneous 
Poisson Process (NHPP), {N(z);¢20} with 
intensity function m(t) and mean value function 


M (t), i.e, 


M(t)= J max, t20, (1) 


where f is the age of the track system and M (t) isa 
non-decreasing function of t with M (t) = 0. Then, 
the probability that exactly j (= 0, 1, 2, ...) RCF 
processes occur in the interval [0, £), P(® is given by 


7 [Mo] 


a (2) 


P(t)=P{N(t)= j}=e™M" 


Let T,(j=0, 1, 2, ...) denote the initiation time of 
the j' RCF process in track system, where T, = 0. 
Then, the cumulative distribution function of the 
random variable T, is given by 


F () = P{T, =S n (3) 


2. Propagation is the second phase of the RCF 
process which may be accelerated by adverse 
environmental conditions. Many models have 
been developed to study how RCF process in 
steel rail tracks propagate. For instance, Rings- 
berg (2001) proposed a crack growth model for 
railway tracks in which the crack propagation 
life is divided into three stages: (i) shear stress 
driven initiation at the surface; (ii) transient crack 
growth behavior; and (iii) subsequent tensile and/ 
or shear driven crack growth (see Figure 2). 


In this paper, the RCF propagation is modelled 
using a stochastic gamma process, which represents 
the evolution of RCF length/size/depth in time. The 
gamma process is a stochastic process with inde- 
pendent non-negative increments having a gamma 


~ 
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Figure 2. Crack propagation phenomenon in railway 
tracks (Ringsberg and Bergkvist, 2003). 


distribution with identical scale parameter. The 
gamma process has been widely studied for different 
maintenance applications by several authors (see 
van Noortwijk (2009) for a thorough review on the 
use of gamma process in maintenance modeling). 
Also, it has been observed that the gamma process 
is satisfactorily fitted to data of different gradual 
degradation phenomena (such as wear and crack 
propagation) in railway industry (Meier-Hirmer 
et al., 2005). Moreover, the existence of an explicit 
probability distribution function of gamma process 
permits feasible mathematical developments. 

Let X (£) be the length/size/depth of the j” RCF 
process after ¢ units of time from initiation. We 
assume that X(t) has a homogeneous gamma 
process with shape and scale parameters given by 
at and $ respectively. Then, for t > 0, the density 
and the cumulative distribution function of the 
increment of the length/size/depth of the 7" RCF 
process is given by (Park and Padgett, 2008): 


Ea p(X) = 2 xme A, x20, (4) 
and 
a a >0; a, B>0, (5) 


where T(-) [y(.,.)] denotes the gamma [incomplete 
gamma] function, i.e., 
= „u-lpo- z. Sa v-lp-z d> 

rw = J z’le dz; (v,u) = f z’-ledz, v,u>0. 
3. Steel railway track undergoes an emergency repair 
when the length/size/depth of RCF exceed a 
warning level D. In the event of a repair, the track 
segment returns to an “as-good-as-new” condi- 
tion. Let U; be the length of the interval between 
the initiation time of the 7" RCF process to the 
time that it attains the warning threshold D. Thus, 


U,=inf (120: X () 2D}, j=1,2,..., (6) 
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Then, from Eqs. (4) and (5), the density and 
the cumulative distribution function of U, respec- 
tively, are given by i 


B De), t>0;a@,8>0, (7) 


8y(t) = Sy (t) = 


T(at) 
and 
G, (26,0, 120;a,8>0. (8) 


We denote by S, the time point that the length/ 
size/depth of the j® RCF process exceeds the warn- 
ing level D. Then, 


ey vee Ea (9) 


Lemma. Let I,(.) denote the indicator function 
that is defined as I,(x) = 1 for xe A, and 0 oth- 
erwise. Let {N (t); t = 0} be the counting process 
associated with the random variables S, (j= 0, 1, 
2, ...), that is, 


NAD =S Ioon(S)) (10) 


Then, having in mind that the convolution of 
any functions a(.) and b(.) is given by 


a(x) e b(x) = fav —1)db(t), 


{N s(t); t2 0} is an NHPP with intensity 
function, 

A(t) = m(t) è gy (t), (11) 
where g(t) is given by Eq. (7) (Shafiee and Finkel- 
stein, 2015). 

Let X, denote a maintenance cycle defined by the 
time interval between maintenance actions (either 
NDT inspection or emergency repair). Under the 
assumptions of the model, we have 


X, =min(S 


mT) 


(12) 


where Siy denotes the time that, for the first time, 
an RCF process exceeds the warning threshold D, 
1.e., 


S 


1 =min{S,, j=1,2-}, (13) 


Then, by using lemma, the survival function of 
Sy is given by 


F O= PtSi, >t}= P{N,(1) =0} 


= exp(-f) (x)ds}, (14) 


where /(.) is the failure rate function of Spy, and is 
given by Eq. (11). Then, the expected length of a 


maintenance cycle ELX,], is given by 


ELX, =|" F OdT > 0. (15) 


Let C(t) be the s-expected cost of operating the 
system for the time interval [0, £). From the renewal 
reward theorem (see Ross 1970, p. 52), the expected 
cost rate, denoted by CR(t), is the expected opera- 
tional cost incurred in a maintenance cycle divided 
by the expected cycle length, 1.e., 


CR(T)= lim 


100 


C(t) _ C XFey(T)+C, x Fs,(T) 
t 


J, Fon (de a6) 


where Fy. (-)| Esa o] is the cumulative distribu- 
tion [survival] function of Si. The problem is to 
find the optimal value of 7” that minimizes the 
objective function CR(T), given in Eq. (16). There- 
fore, the proposed optimization model can be for- 


mulated as follows: 


minimise CR(T)= To 
J, Fs, (t)dt (17) 


The following theorem solves this problem. 
Theorem. If h(T) is strictly increasing in t, and 
MKT.) > CR(T,,,.), there exists an unique and 


finite minimum T* e (0,7,,.) that verifies the fol- 
lowing equation: 


= re C, 
Fs (TATO f, Fdz 


(18) 


r F 


whereas, if A(T) is non-decreasing in ż, and 
(C, ~ CA Tha) = CR(T pax)» then T” = L (im- 


plying maximum preventive replacement interval). 


3 SIMULATED CASE STUDY 


In this Section, we present an application of the 
RCF-based maintenance decision-making model 
to a conventional rail track system 60E1 in steel 
grade R350HT. 60E1 is a common rail profile 
developed by the International Union of Railways 
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(UIC) (http://uic.org/). It also has been widely used 
by Network Rail as the standard for all new high 
speed rail lines. The rail 60E1 profile was designed 
by 3D Abaqus simulation software and is shown 
in Figure 3. 

UIC60 rails are made of high-quality steel 
alloys, resistant to Rolling Contact Fatigue (RCF), 
wear, plastic deformation and corrosion. Differ- 
ent sizes of UIC60 rails are available and used on 
railroads. In this study, a standard rail section as 
shown in Figure 4 was chosen for the analysis. 

The material properties of the rail (e.g. mass, 
density and Young’s modulus) were provided 
by the manufacturer, British Steel (britishsteel. 
co.uk). We assume that the arrival of RCF cracks 
on rail track follows a Poisson process with rate 
mh=0.144/month (Shafiee et al, 2016). The 


Figure 3. 
simulation. 


Rail 60E1 profile designed by 3D Abaqus 


Figure 4. A rail section designed by 3D Abaqus 
simulation. 
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Figure 5. CDF of time-to-initiate a RCF crack. 


Figure 6. Cross section of contact patch localised on 
rail track. 


Cumulative Distribution Function (CDF) of time- 
to-initiate a crack is illustrated in Figure 5. The 
mean-time-to-initiate a crack is estimated 6.94 
months. 

In order to estimate the rate of degradation 
propagation, an FEA simulation model was con- 
structed by commercially available software pack- 
age, Abaqus. The cross section of the contact patch 
localised on rail track is shown in Figure 6. 

Many models have so far been developed to 
describe the crack growth process, e.g. Paris’ law 
(also known as the Paris-Erdogan law). The Paris’ 
law is a power-function used to predict crack evolu- 
tion for structures subject to fatigue stresses. The 
Paris’ law equation is given as follows (Paris and 
Erdogan, 1963): 


da 


CAO" (19) 


where a represents the crack length, N represents 
the number of load cycles, da/dN is the fatigue 
crack growth rate per cycle, and C and m are 
empirical constants (usually referred to as Paris’ 
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law parameters) which depend on material proper- 
ties and operating environment. The range of the 
stress intensity factor, AK represents the difference 
between the stress intensity factor at maximum and 
minimum loads for a particular crack length and is 
calculated as: 


AK = K aax — Kinin = AOYN Za, (20) 


where K aa and K pin are, respectively, the maximum 
and minimum stress intensity factors, A ø is the 
range of cyclic stress amplitude, and Y is a dimen- 
sionless parameter that depends on the crack and 
loading geometries. The remaining cycles can be 
found by substituting Eq. (20) into Eq. (19): 


— =C(AoY Vza)", (21) 


For relatively short cracks, Y can be assumed as 
independent of a and the differential equation can 
be solved via separation of variables 


Ny de da 
dN=| ———— 
Í, I C(Ao Y | za)" 
z Vo i a? 

C(Ao YN zma)“ (22) 

and subsequent integration 

2 (a? - a, 2 

N,= (23) 


i (Q=m)C(Ao¥Vx)"’ 


where Nis the remaining number of cycles to frac- 
ture, a, is the critical crack length at which instanta- 
neous fracture will occur, and a, is the initial crack 
length at which fatigue crack growth starts for the 
given stress range Ao. If Y strongly depends on a, 
numerical methods might be required to find rea- 
sonable solutions. Basically, crack propagation can 
be divided into three stages: stage I (short cracks), 
stage II (long cracks) and stage III (final fracture). 
At stage I, once a fatigue crack is initiated, it prop- 
agates along high shear stress planes (45 degrees). 
When the stress intensity factor K increases as a 
consequence of crack growth or higher applied 
loads, slips start to develop in different planes 
close to the crack tip, initiating stage II. Finally, 
stage III is related to unstable crack growth as K nax 
approaches K,,. At this stage, crack growth is con- 
trolled by static modes of failure and is very sen- 
sitive to the microstructure, load ratio, and stress 
state (plane stress or plane strain loading). The 
cross-section of a fatigue crack introduced to the 
model in 100 nm scale is shown in Figure 7. 


Figure 7. Cross-section of a crack that was introduced 
to model (scale bar: 100 nm). 


Figure 8. Cumulative distribution function for crack 


length. 


The length of the fatigue cracks in time is mod- 
elled using the gamma distribution as presented 
in Equations (4) and (5). The parameters of the 
Gamma process are estimated by means of maxi- 
mum likelihood estimators given in Kahle et al. 
(2016) and are reported as follows: 

= 0.576 and B=1.50 (œ/ß is 0.384 mm per 
month). 

The best-fit models for the cumulative distribu- 
tion function associated with the length of RCF 
processes are shown in Figure 8. 

Average length of the rail repaired emergently is 
L = 8 meters (Patra, 2009). The cost of 60E1 rail- 
way track (including neutralization) per meter is 
2,250 Monetary Unit (MU). Average labour cost 
per hour (including the track worker cost, track 
welder cost, and inspection personnel cost) is 625 
MU. The hourly rate of hiring the welding equip- 
ment or service vessels for maintenance, replace 
or inspection of the railway track is 80 MU. The 
mean time required to perform a maintenance 
(either corrective or preventive) is 4 hours. How- 
ever, the corrective type may cause traffic disrup- 
tion that incurs an additional cost 7 to the route 
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operator. The physical lifetime of 60E1 railway 
track is considered to be equal to six years (72 
months). We wrote a MATLAB code for the mini- 
mization of the expected cost rate, as given in Eq. 
(17), to determine the optimal preventive NDT 
inspection interval 7°. The pictorial representa- 
tion of the expected cost rate as a function of the 
inspection interval T for three different cases of (1) 
n =C,- C, =0; n= 5,000, and 7 = 10,000 MU is 
shown in Figure 9. From Figure 9, it is found that 
the use of optimal prognostic inspection policy 
allows a significant reduction of the maintenance 
cost compared to the strategy when only emer- 
gency repair is considered (the corresponding cost 
is the asymptote of the path, when T tends to Tax- 
The optimal value of 7” and the corresponding 
expected cost rate, CR(T") the expected cost rate 
for corrective maintenance policy, CR(T,,,,,), and 
the percentage reduction of the maintenance cost, 
r are presented in Table 1. 

It can be seen that as the cost parameter n 
increases, the optimal NDT inspection interval 7" 
becomes shorter, however the expected cost rate, 
CR(T’) increases. Also, when a large additional 
cost is likely to be incurred by the infrastructure 
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Figure 9. Expected cost rate for different values of n. 


Table 1. Results of the optimization model for different 
values of n. 
Unit n=0 7=5,000 1=10,000 
Te month 72 49 42 
CR(T*) MU/ 448.38 546.40 594.81 
month 
CR(T,,,.) MU/ 448.38 554.94 661.49 
month 
Cost % 0 1.54 10.08 
reduction 


owner in emergency maintenance case, applying 
the prognostic inspection policy will be more effi- 
cient than corrective repair and has a huge poten- 
tial to reduce the maintenance cost. For instance, 
when the cost parameter 77 is 10,000, the prognos- 
tic maintenance policy allows for approximately 
%10 reduction of the maintenance cost compared 
to the corrective repair policy. 


4 CONCLUSIONS 


In this paper, an optimal prognostic Non-Destruc- 
tive Testing (NDT) and inspection policy is 
presented for railway track systems subject to pro- 
gressive Rolling Contact Fatigue (RCF). The RCF 
phenomenon was modelled by means of Finite 
Element Analysis (FEA) and the fatigue propaga- 
tion behavior was described by the Paris-Erdogan 
power law function. The length/size/depth of the 
RCF processes over time was formulated by a sto- 
chastic gamma process and a preventive inspec- 
tion was conducted before the length/size/depth of 
cracks exceeds a “warning” level. This study can 
be extended in many directions to make it more 
practical in maintenance management of railway 
industry. The model can be formulated and ana- 
lyzed for the case when RCF is only detected if its 
length/size/depth reaches a detection threshold c 
(> 0); and, a cost comparison can be made between 
the proposed prognostic inspection policy and 
other common strategies such as RCM. 
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ABSTRACT: Accelerated Degradation Testing (ADT) is used to collect more performance degradation 
data under accelerated stress levels in a limited time. However, limited sample size or inadequate testing 
time may lead to the lack of degradation information, which causes inaccuracy of the evaluation of the 
lifetime. High Accelerated Life Test (HALT), High Accelerated Stress Screen (HASS) are common test- 
ing methods, whose information should be considered in evaluation of the lifetime and reliability. In this 
paper, Methodology for Integration of HALT, HASS and ADT (MIHHA) is proposed as the method 
with the multi-utilization of HALT, HASS and ADT. One difficulty of MIHHA is how to integrate the 
information of HALT, HASS and ADT to provide crucial information for product life prediction. We 
propose an evaluation method of MIHHA which integrated degradation information of HLAT, HASS 
into ADT evaluation based on Bayesian theory. This method would be a great help to the engineering 
application of MIHHA. 


1 INTRODUCTION reliability or lifetime of product. As quality and 
reliability of product improved, there was less fail- 
During the past decades, Accelerated Degradation ure in HALT and HASS. People pay more atten- 
Testing (ADT) has drawn much more attention tion to the degradation information in HALT and 
in both industry and academia, which can collect HASS (Gray & Paschkewit, 2016). In this article, 
more performance degradation data under acceler- we consider to make full use of degradation infor- 
ated stress levels in a limited time (Ge et al. 2012). mation of HALT and HASS in evaluation method 
It is convenient to establish model of reliability of MIHHA. 
and life assessment with degradation information Bayesian inference is an important technique in 
through ADT technique. However, sometimes statistical inference (Box & Tiao 1992). In Baye- 
limited sample size or inadequate testing time sian inference of ADT evaluation method, the deg- 
may lead to the lack of degradation information, radation data of HALT and HASS can be regard 
which causes inaccuracy of the evaluation of the as the prior information and degradation data of 
lifetime and reliability. Usually, Highly Accelerated | ADT can be considered as the sample information. 
Life Testing (HALT) and highly Accelerated Stress The theory of Bayesian can successfully integrate 
Screening (HASS) are mainly applied in qualita- degradation data of HALT, HASS into ADT eval- 
tive method to improve the reliability of products uation method. To solve the problem that there are 
(Liu et al. 2016). But the information of HALT two sources of prior information come from HALT 
and HASS can make up for the lack of ADT infor- and HASS, a weighting fusion method is adopted 
mation very well. Based on this, Methodology for to integrate them into one prior distribution based 
Integration of HALT, HASS and ADT (MIHHA) on correlation function. The technology roadmap 
was proposed to coordinate the design of HALT, of the evaluation method of MIHHA with inte- 
HASS, ADT, and collect more information under grated degradation information of HLAT, HASS 
the limited time and samples. is shown in Fig. 1. 

The different between ADT and MIHHA is In conclusion, this article proposes an evalua- 
MIHHA contains information of HALT, HASS. tion method of MIHHA based on Bayesian infer- 
The key issue of MIHHA is how to integrate the ence, which integrated degradation information of 
information of HALT, HASS into ADT to predict | HLAT, HASS and ADT. The evaluation method 
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Figure 1. Technology roadmap of the evaluation 
method of MIHHA. 


of MIHHA can make full use of various sources 
of information and improve the engineering appli- 
cation value of MIHHA. 


2 BAYESIAN INFERENCE OF 
ADT MODEL 


ADT Model and the relationship of parameters 
in Bayesian inference are essential parts of the 
evaluation method and integration method. In this 
paper, we describe the degradation paths through 
Wiener process and give the relationship of ADT 
model and parameters in Bayesian inference. 


2.1 ADT model based on Wiener process 


Wiener process is used for degradation data analy- 
sis widely (Whitmore & Schenkelberg 1997). The 
degradation model of Wiener process can be 
expressed as: 


Y(t) = oB(t)+d(s)-t+y, (1) 


where Y(t) is the performance degradation proc- 
ess of product, B(t) means the standard Brownian 
motion, denoted as B(t)~N(0,t), o is a constant 
named diffusion coefficient, which is free from 
changes caused by stress or time. y, represents 
initial value of product performance. d(s) means 
the drift coefficient, which represents the degrada- 
tion rate of product under specific stress level s. 


Normally we use acceleration model to describe 
the relationship between d(s) and s (Zou, T.J., et al., 
2015). This article defines acceleration model as: 


Ind(s)=a+b@(s) (2) 


where, ø(s) is a function of stress s. Known from 
the property of the Wiener process, the degrada- 
tion increment Ay during the unit time Af is sub- 
ject to a normal distribution with the mean of 
ds) - At and the square deviation of o°- At, i.e., 
Ay ~ N(d(s) - At, o>- At). The probability density 
function of Ay is: 


[Ay — dls) - At} | (3) 
20° -At 


1 
f(Ay)= a] 


The literature (Whitmore & Schenkelberg 1997) 
suggests that the first passage time of Wiener proc- 
ess to a threshold follows the inverse Gaussian dis- 
tribution. We assume y, as the failure threshold 
of performance degradation. In other word, prod- 
uct failures when Y(t) — yp <0. Thus, the key of 
ADT evaluation method turn to the calculation of 
probability distribution of the first passage time t 
to the threshold y, The probability density func- 
tion of the inverse Gaussian distribution with the 
mean u = ((); — y,)/d(s)) and the shape parameter 
À = (yY; — Vol OF is: 


20t 


pot, (4) 


The reliability function of degradation process 
(Lu, J., 1995) is given by: 


R()=® l A iar 
ovt (5) 
cx 2d(s\(V,- V9) } ò | Y- y+ s] 
o ovt 


From the reliability function above, the 
unknown parameters a,b,o are the key param- 
eters in ADT evaluation method, which are also 
known as prior parameters in Bayesian infer- 
ence. We assume 8 as the prior parameter vec- 
tor of model, i.e. @ = (a,b,0). The relationship of 
ADT model and prior parameters is shown in the 
Fig. 2. 

If prior parameters are obtained, it is easy to 
accomplish the evaluation of lifetime and reli- 
ability (Wang et al. 2016). The crux of evaluation 
method of MIHHA is how to obtain reasonable 
valuations of prior parameters through degrada- 
tion information of HLAT, HASS and ADT. 
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Figure 2. The relationship of ADT model and prior 
parameters. 


2.2 Bayesian inference of ADT model 


Bayesian inference is an important technique in 
statistical inference. Normally, we use Bayesian 
theorem to update the probability for a hypothesis 
as more evidence or information becomes avail- 
able (Wang et al. 2016). It is suitable for integrat- 
ing degradation information of HLAT, HASS and 
ADT in evaluation method. Bayesian inference is 
particularly important in the dynamic analysis of a 
sequence of data. Through Bayesian inference, this 
article can obtain the posterior distribution of the 
parameters to evaluate lifetime and reliability of 
product with degradation information of HALT, 
HASS and ADT. 

In this paper, the observed data x (also known 
as sample information) can be obtained from 
degradation data of ADT and degradation 
data of HLAT, HASS can be regarded as prior 
information. The posterior distribution is the 
distribution of the parameters after taking into 
account the observed data, which combines the 
observed data and the prior information and 
forms the core of Bayesian inference. With the 
observed data x, the Bayesian theory suggests 
that the posterior distribution of parameters can 
be expressed as: 


z): p(x1@) 
3%): pt |8)a0 


m0 | x)= j (6) 


where, © is value range of @. (0) is the prior dis- 
tribution. p(x|) is the distribution of the observed 
data conditional on its parameters, which is termed 
the sampling distribution or likelihood function. 
In this paper, it is expressed as: 


1 
POIO)= BEA 


| x-exp( a+ b- @s))- At ] ? 
2P At 


(7) 


onl 


3 INTEGRATION OF PRIOR 
DISTRIBUTIONS 


3.1 Integrate degradation information from 
HALT and HASS 


Before the integration, we should make sure that 
the degradation data from HALT, HASS and 
ADT meet the following conditions: 1) the degra- 
dation data from HALT, HASS and ADT should 
belong to the same performance parameter. 2) the 
degradation mechanism of performance degrada- 
tion data from HALT, HASS and ADT should be 
the same (Wang et al. 2013). 

The degradation data of HLAT, HASS can be 
regarded as prior information. Thus, we assume 77,(8) 
represents the prior distribution obtained from deg- 
radation information of HALT and 7) represents 
the prior distribution obtained from degradation 
information of HASS. For there are two prior distri- 
butions about parameters, a weighting fusion method 
is adopted to integrate into one prior distribution. 
Through using the weight coefficient œ and œ, the 
prior distribution after integration is expressed as: 


a(0)= a7(0)+ æm (9) (8) 


where, the weight coefficient œ, and œ are con- 
strained by formula as follow: 


a+a,=1 (9) 


According to the Bayesian theory, the posterior 
distribution after integration is expressed as: 


z(0 |x) a7,(0) p(x |0)+ æm, (0)p(xl10) 
Í a7,(0) p(x |0)+ a@%z,(0) p(x |6) de 


[G] 


(10) 
It can be transformed into: 


œm, (x) : 


) 7 am,(x) 
7 m(x) 


m(0|x 


m,(0|x)+ 


7,(0|x) 


m(x) 
(11) 


where, 7,(0|x) and m (0|x) represent the posterior 
distribution of HALT and HASS independently. 
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m(x) is known as the posterior predictive distribu- 
tion, which is the distribution of a new data point, 
marginalized over the posterior: 


m,(x)=[ 7(0)p(x]Ə)dð i=1,2 (12) 
o 
m(x) = &m (x) + œm, (x) (13) 
Here, we assume: 
a co (14) 
m(x) 
So we can easily reach the conclusion: 
A+B=l (15) 


And, the expression of posterior distribution 
after integration is transformed into formula (16), 
which is a weighted equation of posterior distribu- 
tions with the weight coefficient p, and £,. 


m(0|x)= 42,(0|x)+ Am (0 |x) (16) 


3.2 Calculation based on correlation function 


Correlation function is a function that gives the 
statistical correlation between random variables. 
Sometimes correlation functions of different ran- 
dom variables are referred to as cross-correlation 
functions to emphasize that different variables are 
being considered. Through analyzing the correlation 
function, the correlativity between posterior distri- 
butions and prior distributions can be weighted. 

In this paper, we set u, as the mean of m(0), 
u, as the mean of 2,(6’|x). According to the for- 
mula(16), the mean of posterior distribution 2(x|@) 
can be descried as: 


U = Bi + fu, (17) 

Then we assume three equations as follow to 
describe the correlativity between prior and poste- 
rior distributions. 


u’ = a) + ath + a,u, 
+ L , r 

U = dy + Ath + Gy Uy 
$ + rd , 

Uy = do + Aath + Ay Uy 


(18) 


There are three unknown coefficients in each 
equation in formula(18). Through simulation sam- 
pling, three groups of random data are obtained. 
So we can get three group of estimated value of u,, 
u,, u’ to solve the equations. 
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We v? as the square deviation of (x). The 
correlation function defined as (Wang et al. 2013): 


2 2 
day? + d;a, 


Pe 
i JAR ve -fath ae 


i=1,2 (19) 


From the formula(19) we can see 0<r <1. A lager r, 
means 7,(@x) is more closely associated with m( 4x), 
which illustrates that 7(@) should have a stronger 
weight in 7(@). The weight coefficient œ, and a, of 
prior distributions are obtained by formula (20): 

Li 


+T, 


f=) (20) 


With the values of œ, we can easily obtain the 
posterior distribution m(x|@) after integration. 
According to the reliability function of ADT, it is 
easy to achieve the evaluation of ADT. 


4 A SIMULATION EXAMPLE 


4.1 Data simulation 


Accelerated test conditions are typically produced 
by testing units at higher levels of temperature, volt- 
age, pressure, etc (Nelson 1971). In this paper, we 
assume the degradation process is mainly affected 
by high temperature. Arrhenius model is suitable to 
describe the relationship between the degradation 
rate and temperature (Alferink 2012), which means 
ø (s) = 1/(s + 273.15). s is Celsius temperature. 

In order to verify the effectiveness of the evalua- 
tion method of MIHHA, we simulated 240 degra- 
dation data points for each of the 6 samples under 
HALT and 30 degradation data points for each of 
the 6 samples under HASS. Performance monitor 
interval, ie.Ar, is one minute. Fig. 3 shows the deg- 
radation paths under different high-temperature 
level in HALT. Fig. 4 shows degradation paths of 
high temperature at 120°C involving vibration syn- 
chronously under different cycle in HASS. Both 
data of Fig. 3 and Fig. 4 are the prior information 
in Bayesian inference. 

Then, we simulated a 2-day Step Stress ADT 
(SSADT) with 2880 degradation data points for 
each of the 6 samples under as observed data or 
sample information. Fig. 5 shows the degradation 
paths of SSADT under different stress. 

According to the formula (2), we can convert all 
the degradation data under accelerated stress into 
the degradation data of 25°C by method of linear 
regression. The equivalent At, at 25°C are shown in 
the Table 1. We can obtain the d(s) - At at 25°C of 
different samples under HALT, HASS and ADT 
through Ay/At,. 
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Figure 3. The degradation paths of high-temperature 
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Figure 4. The degradation paths of high temperature 
vibration synthesis test in HASS. 
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Figure 5. The degradation paths of SSADT. 


4.2 Integration of prior distributions 


As all the data have be converted into the data 
of 25°C, we can set d(s) - At as prior parameter 
replace a and b to reduce the calculation. The prior 
information is obtained from the distribution data 
of HALT and HASS as show in the Table 2. 

With the help of Matlab, we simulate three 
groups of random samples from prior distribution 
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Table 1. Conversion factors of At, 


Accelerated Equivalent 
stress At/min At, at 25°C/min 
60°C 1 1.447711 

80°C 1.730614 
100°C 2.029593 
120°C 2.341943 
130°C 2.502343 
140°C 2.665169 
145°C 2.747404 
150°C 2.830144 

Table 2. Prior distributions. 

Prior 

parameters d(s)- At o-At 

HALT N(0.0014, 0.00041) N (2.34e-05, 2.61e-06) 
HASS N(0.0098, 0.0047) N (1.21e-03, 9.19e-04) 


m(d(s)- At), T(P- At), 2,(d(s)- At), mE - At) and 
posterior distribution 2,(d(s) - Atx), 2,(o - At|x), 
1,(d(s)- At|x), 2,(0°- At|x) to calculate the value of 
three unknown coefficient in each equation in for- 
mula (18). Final, the posterior distributions after 
integration are obtained as the formulas shows: 


x(d(s)- At | x) = 
0.94087, (d(s)- At | x) +0.0592z, (d(s)- At | x) 


(21) 
z( œ: At|x)= 
0.90547, ( 0° At | x)+ 0.09467, ( œ- At | x) 
(22) 


From the formula (21) and formula (22), we 
can see that there is a heavier weight coefficient of 
HALT than HASS, which illustrating the degrada- 
tion information of HALT is more relevant to the 
observe data of ADT in the simulation example. 


4.3 Evaluation of reliability 


Through the posterior distribution after inte- 
gration, we can obtain the assessed value of key 
parameters in reliability function: 
6 = E(0|x)=&E (0 |x)+ éE, (01x) (23) 
where, E represents the mean of the posterior 
distributions. 

We assume the initial value of product per- 


formance y, = 100 and the failure threshold of 
performance degradation y, = 50. The evaluation 


~~ before integration} 
—— after integration 


X: 4.454e+004 
tasm a Y:0.5000 
= 


5 6 7 e 


0 ' 2 a 


4 
time/min x10" 
Figure 6. Evaluation of reliability before and after 
integration. 


of reliability can be calculated by formula (5). To 
compare the different between evaluation results 
with information of HALT and HASS or not, the 
results of reliability before and after integration 
are showed in the Fig. 6. 

The red line in Fig. 6 indicates the evaluation 
of reliability before integration, which is obtained 
from the information of ADT only. It represents the 
traditional ADT evaluation method. The blue line 
means the evaluation of reliability after integration, 
which integrated the information of HALT, HASS 
and ADT. It represents the evaluation method of 
MIHHA. From the Fig. 6 we can take the conclusion 
that the medium life is diminished after integration. 
We can hold the opinion that the evaluation result of 
MIHHA updates the evaluation result of ADT with 
information of HALT and HASS through Bayesian 
inference. Thus, the evaluation method of MIHHA 
contains more information than ADT; it can be a 
better indicator of reliability level of product. 


5 CONCLUSION 


Directing against the existing problem that ADT 
may lack degradation information for limited sam- 
ple size or inadequate testing time, this article pro- 
poses an evaluation method of MIHHA. It solves 
the key issue of MIHHA that how to integrate 
HALT, HASS into ADT to predict reliability or 
lifetime of product. First, an ADT model based on 
Wiener process is proposed and the prior param- 
eters are given. Then, based on the framework of 
Bayesian inference, the degradation information 
of HALT, HASS is regarded as prior informa- 
tion and the degradation information of ADT is 
considered as observed data. Through the prior 
information and observed data, we can obtain 
the posterior distribution which integrated the 
information of HALT, HASS and ADT. Third, 
the degradation information of HALT and HASS 


are integrated into one prior distribution based on 
weighting fusion method and the correlation func- 
tion is adopted to calculate the weight coefficient 
of HALT and HASS, which solving the problem 
that the prior information comes from two sources. 
Finally, a simulation example is given to prove the 
feasibility of the evaluation method of MIHHA. 
The evaluation method of MIHHA integrates 
degradation information of HALT, HASS and 
ADT. It provides a new path for evaluation method 
of product which makes full use of the information 
of HALT and HASS. This method would be a great 
help to the promotion and application of MIHHA. 
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ABSTRACT: The aim of this work is to present a novel methodology for damage detection in a struc- 
ture. In this method, new features based on Short Time Fourier Transform are used, dimensionality reduc- 
tion was carried out by Principal Component Analysis and the classification was performed using Logistic 
regression. The efficacy of this method is demonstrated by detecting the presence of damage in a canti- 
lever beam using the features based on transformation of displacement response. 

Even though the difference between the displacement waveforms from damaged and the undamaged 
beam is not clearly perceptible, the difference is clearly visible in the principal components’ space. Data 
belonging to damaged and undamaged classes are linearly separable up to a certain level of noise after 
which linear separability is lost. A linear decision boundary was obtained corresponding to a particular 
noise level after the mean normalization and feature scaling of the data. Feasibility of dimensionality 
reduction is ensured by checking the loss percent in the reduction process. And generalization ability of 
the classifier has been assessed on some test sets. 


1 INTRODUCTION 


Measurement 
Structural Health Monitoring (SHM) has been of dynamic 
studied as a potential way to improve safety and response 


reliability of various mechanical, aerospace and 
civil infrastructures. Current SHM techniques 
process dynamic responses of a structure in order Feature 

to detect, localize or find out the extent of damage extraction 
(Farrar & Worden 2013). There are two approaches 
for SHM i) The Model based and ii) The Data 
driven approach. In model based approach, the 
system is modeled based on its physics using a suit- 
able modeling technique. Further the updating of 
this model is carried out using the field data. Dur- 
ing this updating, initial model is tuned in a way 
that it produces outputs which match well with the Classification 
real data. Eventually after updating stage a rea- 
sonably accurate model of the healthy system is 

obtained. Now the characteristics of the damage 
will be indicated in the form of change in param- 
eter values on further updating using data from 
the monitoring stage. On the other hand in data 
driven approach instead of using a physics based . i 
model, the data obtained from the system is used It is expected that accurate damage detection 
to develop a statistical model for classification to Will require successful feature extraction, dimen- 
achieve diagnosis. Major steps involved in the data sionality reduction and signal classification. 
driven approach are 1) Measurement of dynamic Appropriate selection of features and classifica- 
response 2) Feature extraction 3) Dimensionality tion algorithm for an application depends on the 


reduction 4) Classification and 5) Prognosis. They nature of the dataset at hand. No single classi- 
are followed in sequence as shown in Figure 1. fication method can be preferred over others to 


Dimensionality 
reduction 


Figure 1. Major steps in SHM. 
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achieve good generalization. (Duda et al. 2001). 
Similarly by Ugly duckling theorem, no features 
can be called as best features irrespective of the 
application (Duda et al. 2001). Consequently 
many different classification methods and fea- 
ture sets are explored in the SHM literature to 
suit a particular application. For example (Zhou 
et al. 2013) proposed a methodology for dam- 
age detection based on integrating data fusion 
and random forests. They used energy features 
extracted using wavelet packet transform for 
training. (He & Yan 2007) suggested a method 
where the classification was carried out using a 
wavelet based support vector machine and the 
feature extraction was done from wavelet energy 
spectrum constructed using wavelet packet trans- 
form. (Satpal et al 2016) have used support vector 
regression for localization of the damage using 
displacement values corresponding to first mode 
shape as the features. Although new methodolo- 
gies are continually being proposed for damage 
detection still there are many to be explored. 
Damage detection accuracy can depend not only 
on the individual techniques used for feature 
extraction and classification but also on their 
various combinations. 

The aim of this work is to present a novel meth- 
odology for damage detection in a structure. In this 
method, new features based on Short Time Fou- 
rier Transform (STFT) are used, dimensionality 
reduction is carried out by Principal Component 
Analysis and the classification is performed using 
Logistic regression. The efficacy of this method is 
demonstrated by detecting the presence of damage 
in a cantilever beam using the features based on 
transformation of displacement response obtained 
from a model of the beam. 

Even though the difference between the dis- 
placement waveforms from damaged and the 
undamaged beam is not clearly perceptible, the 
difference is clearly visible in the principal com- 
ponents’ space. Data belonging to damaged and 
undamaged classes are linearly separable up to a 
certain level of noise after which linear separabil- 
ity is lost. A linear decision boundary has been 
obtained for a particular noise level after the mean 
normalization and feature scaling of the data. Fea- 
sibility of dimensionality reduction is ensured by 
checking the loss percent in the reduction process. 
And generalization ability of the classifier has been 
assessed on some test sets. 


2 METHODOLOGY 


In this section the mathematical formulation 
of each step taken toward damage detection is 
provided. 


2.1 Data generation 


The purpose of this step is to acquire multiple 
signals (displacement time history) from a dam- 
aged and an undamaged beam. Here damaged and 
undamaged conditions are two different classes for 
which the classifier has to be designed. In a data 
driven approach the data are obtained either by 
carrying out experiments or with a reliable model. 
In this case we model the damaged beam based on 
Euler-Bernoulli hypothesis. The task of producing 
multiple signals has been accomplished as follows. 
Firstly the variational formulation of the damaged 
beam (shown in Figure 2) was done using Ham- 
ilton’s principle and application of Ritz method 
to obtain the approximate response/signal. Subse- 
quently the above signal was mixed with noise to 
obtain multiple signals corresponding to damaged 
beam with different damage locations. To get mul- 
tiple signals for undamaged beam the damage size 
was substituted to be zero and the corresponding 
signal was mixed with noise. The mathematical for- 
mulation of this process is described below. 


By Hamilton’s Principle: óf [K-U+W ]dt=0 (1) 


t 
1 


Kinetic 


L+ 


where, energy, K= 2 al eC æ)? 
dx—2£ 2d] n ( æ)’dx Potential energy, U=# 
| L+AL+ Ly (2 32 a) dx eat f mc 32 aj dx Workdone 


0 ax? 
by external forces, W = = fl )w(x=L,+AL+L,) 
where, p = desity, E = Young’s modulus, I= 
moment of inertia 
A(t) = exitation force 
w = deflection of the centreline of the beam 
A =area of cross section 
x is the direction along the length with its zero at 
the fixed end 

To get the response using Ritz method we assume 


w(xt)={H} {2} 


(a) Front view 


| It 


{b) Top view 


Figure 2. Beam geometry. 


972 


where, {H}=[ H,(x),H,(x).. Hy (x) consisting 
of N linearly independent admissible functions. 
Such that H,(x)=(#)'(1-Z)"",i=1,2...N & 
{p} = AGH AG Nas (t) I here p, (t), p,(t),---Py (0) 
are unknown coordinate functions. 

The following equation is finally obtained 


[[m]-[m]] [K] i=, 
AMA} cranny © 


where, 


L+AL+ Ly 


[m]? J pate} Hy ax] 


0 


[I> 


LAL 


| abot HH HY dx 
j Ly+AL+ Ly 


[K] | aay {HY dx 


0 


[K,] = 


T EAI HY’ {H} 


L 


or 


The above equations are subsequently solved. 

The output from the sensor placed at x = Xp, is 
w(x) = {Ay {p}, which is the required signal 
when the damage is at a location L, from the fixed 
end as shown in Fig. 2. By changing the location L, 
one can get signals corresponding to different dam- 
age locations. But, when these signals are acquired 
experimentally, the signal is bound to get distorted 
by noise. To simulate the real experiment the noise 
is added to the original signal giving multiple signals 
corresponding to each damage location. Similarly, 
one can get multiple signals from an undamaged 
beam as well. The process of mixing noise to the 
deterministic signal is expressed mathematically as 


Waew (X000) = V(w( 9,7), SNR) 


where, ¥ is a function which adds noise to w (Xp, n) 
with signal to noise ratio being SNR. 


2.2 Feature extraction 


Step 1: The set of time domain signals which one 
gets from data generation step are first transformed 
to a time-frequency domain signal by using the 
discrete Short Time Fourier transform (STFT) as 
shown in Equation 3. 


N,-1 —j2agn 
S[m, é] = 2 W new (xœ n)øln —mje * (3) 


n=0 


where, ø(n) is the window function given as: 
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where, J, is the zeroth-order modified Bessel func- 
tion of the first kind, wis the width of the window 
and ais an arbitrary non negative real number that 
determines the shape of the window. 

Step 2: In this step one forms a matrix [P] given 
in equation (4) using the outputs of the STFT 
given by Equation (3) in this matrix time increases 
across the columns of [P] and frequency increases 
down the rows, starting from zero. 

First, two vectors namely the frequency and 
time vectors are defined as following: 


{F}= ae A $ (<é) 
{T}= [m,,m,,...0 m, i (m, <m) 


The entries of [P] denoted by P(m,, &) is defined 
as: 


P (mé Pai m, $, 


— oO p 
cs le i 


n=l 


frequency $577 g Šo 


The time and frequency used in equation (3) are 
taken from {T} and {F} respectively. Finally the 
matrix [P] looks like 


where, when é + 0, Nyquist 


p otherwise 


P(m.G) P(m,g) aao P( m£) 
P(m,.G) P(m,g). aei P( m,»&) 
[P] 2 r E 
P(m,é,) P(m,é) 2... P( mé) | 
(4) 


This matrix has been computed for each of the 
signals in the data generation step. 

Step 3: Subsequently the eigenvalues of [P]’ [P] 
are calculated and used as initial features. There- 
fore the data matrix [X] is given as: 


=[{x}, {x}, {x},] © 


where, {x}, cae „f, | and f; s are the 


eigenvalues PESARA the matrix [P]" [P] 
constructed from k” signal and n, is the number of 
signals used in training. In other words the rows 
of the data matrix [X] represent the feature vectors 
corresponding to each signal used in the training. 


2.3 Dimensionality reduction using principal 
component analysis 


Principal Component Analysis (PCA) attempts 
to project the data obtained in the feature extrac- 
tion step into a lower dimensional vector space 
maintaining the maximum variance possible. The 
mathematical procedure to obtain such a transfor- 
mation is as follows. 

Step 1: Given the data matrix: [X] the variance- 
covariance matrix can be calculated as: 


var( FE) — cov( FF) cov( EF.) 

cov( EF) var( FE ) cov( EF ) 
[s] = 

cov( E E) cov( FF) cov £ £, ) 


where, F =[X),,X>;,X3;,--X,,;]’" such that X; is an 
element of aa IF at i row and j" column, 
cov( FF,)= at - 4)( Fe -= 4), and 
var( FF)= Ny ad he j F,- Ai 

Step 2: Now eigenvalues and eigenvectors of 
matrix [S] are calculated by solving the eigenvalue 
problem as given in Equation 6. 


[slb}, = Atv, 


Step 3: Subsequently As are arranged in descend- 
ing order as 4,)A,)/,....4, 

Step 4: For dimensionality reduction the first y. 
Eigen vectors corresponding to the Eigen values 
which are arranged in descending order were cho- 
sen. These Eigen vectors are stacked into a matrix 
[T] such that 


(6) 


x 
2 4 


[TI [PDRPR Oh --0},] & 2 0.99 
A 
G 


x 
| *100 is referred as 


i=l 


The quantity 
loss percentage. 
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Step 5: Finally the new data matrix correspond- 
ing to the new vector space can be obtained as: 


Hias Clee di 


Each column of the new data matrix denotes 
the projection of a point in R% on to the R’. 
Hence, one gets all the q, dimensional vectors con- 
verted to ydimensional vectors where y< q, The 
proof that the matrix [7]’ indeed projects the data 
on to a subspace which maximizes the variance can 
be found in (Bishop 2006). For visualization of the 
data, ymust be < 3. 


(7) 


2.4 Classification using logistic regression 


Reduced dimensional feature vectors obtained after 
the PCA are first mean normalized and scaled and 
then augmented with unity to create the augmented 
feature vectors. These augmented feature vectors 
along with the corresponding class labels are used 
as training data. Logistic regression uses a hypoth- 
esis function given below (Ng 2012). The unknown 
parameter vector {6} in the hypothesis function is 
estimated by optimizing the cost function given in 
equation (9) using a MATLAB function ‘fminunc’. 
The value of the hypothesis function calculated for 
a feature vector can be treated as the probability of 
it belonging to the damaged beam. 
Training set: 


nyoy 


5 


on(ted™ 


Hypothesis function, 


hg {2}) = (8) 
ela {ay FT z} 
Cost function J({0}) is given as following: 
ILA) =- £ tos gl 23!” + 5 


(1- y ) tog I~ he Lz}! ))] 


| 


PC|'—meanPC, 


1 


G 


is 
where, {7}"” = l 
i 

Opel 
PC '—meanPC 


Opc2 


such that C; = 


PC; and PC,’ are the values of Ist and 2nd 
principal components corresponding to column of 
matrix [X] eaen Mean PC, and mean PC, are the 
means of Ist and 2nd principal components, while 


O, and O, are the corresponding standard devia- 
tions and ye{0,1} n, is the number of signals 
used in training, {0} is the vector of parameters. 

Assessment of the classifier was done by creat- 
ing a test set following the same procedure as in 
the case of training data, except that during test set 
creation the noise added to the signals were altered. 
The accuracy obtained for various test set is tabu- 
lated in Table 2. The classifier thus made from the 
training data can be used for classifying the signal 
from the real structure. 


3 RESULTS AND DISCUSSION 


The method explained in the previous sections is 
now applied to a cantilever beam of the following 
specifications. 


L=45cm, b=5.2 cm, h=0.5 cm, 0=0.2 cm 
L= (5 cm, 10 cm, 15 cm, 20 cm, 25 cm, 30 cm) 
AL =0.2 cm, L, = L- AL- L, E = 200 GPa 


p =7850 kg/m? where the geometric param- 
eters are shown in Figure 2. For simulation of 
equation (2) the initial conditions were set to zero 
and the forcing function was a rectangular pulse 
of small width as shown in Figure 3. The dis- 
placement responses for different damaged and 
undamaged cases are shown in Figure 4. To verify 
the correctness of the above results the damage 
size was substituted to be zero in the mathemati- 
cal model for the damaged beam and the first four 
natural frequencies obtained through Fast Fourier 
Transform (FFT) of the response with the ones 
obtained by analytical formula and ANSYS were 
compared. The comparison is shown in Table 1. 
It is clear from Table 1 that all natural frequencies 
calculated through FFT of the response match rea- 
sonably well with the natural frequencies obtained 
by the analytical formula. On the other hand when 
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Figure 3. Excitation force. 


975 


compared with the natural frequencies obtained by 
ANSYS the match is fairly good for the first two 
natural frequencies but there is a big difference 
between the third and fourth natural frequencies. 
The reason for above mismatch for higher frequen- 
cies can be attributed to the following reason. In 
ANSYS the field problem is solved over the entire 
domain without making any special assumptions as 
in Euler and Bernoulli’s beam theory. It is known 
that at high frequencies more realistic assumptions 
like the inclusion of rotary inertia (Rayleigh beam 
theory) and shear deformation (Timoshenko beam 
theory) are to be incorporated (Shames 1985). The 
predicted results of Euler-Bernoulli beam overes- 
timate the correct values. Looking at the displace- 
ment signals corresponding to various damage 
conditions it is almost impossible to figure out any 
difference between them. Time frequency plots 
given in Figure 5 do reveal some differences present 
between the signals. But after the feature extraction 
and dimensionality reduction by the STFT and 
PCA respectively a good separation can be seen in 
the final 2D feature space (principal components’ 
space) as shown in Figure 6.The loss% during 
PCA is of the order of 10° which indicates that 
the data obtained after the feature extraction step 
has most variance in the first two principal direc- 
tions. Another interesting thing can be noted in 
Figure 6 is that there are seven clusters which are 
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Figure 4. Displacement response for different damage 
locations. 


Table 1. Comparison of natural frequencies. 

Natural From FFT of 

frequency the MATLAB (FEM) 
number response Analytical ANSYS 
lst 20 Hz 20.1329Hz 20.131 Hz 
2nd 126 Hz 126.17 Hz 126.08 Hz 
3rd 354 Hz 353.28 Hz 207.21 Hz 
4th 694 Hz 692 Hz 328.4 Hz 
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Figure 5. Spectrograms for different damage locations. 
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indicative of the nature of data set because the data 
set contains vectors corresponding to the six differ- 
ent damage locations and the undamaged system. 
Decision boundary obtained using logistic regres- 
sion for training data obtained by mixing noise of 
SNR= 91 is shown in Figure 7. It can be noticed 
from Figure 7 that the accuracy obtained on the 
training set is 100%. Generalization assessment 
performed by calculating the accuracy achieved 
on different test sets in given in Table 2. It is clear 
that the accuracy achieved on test sets created with 
SNR 2 91 is 100% on the other hand the accuracy 
is 98.89% with the SNR = 90 in the test set. Hence 
the classifier is working quite well on the data out- 
side the training set. 
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Table 2. Accuracy achieved for various test sets. 

SNR No. of test points Accuracy 
90 90 98.89% 
91 90 100% 
92 90 100% 


4 CONCLUSION 


The proposed structural damage detection meth- 
odology is based on integrating techniques used 
in the field of Signal processing and Machine 
learning in a novel way, but the combination of 
the methods and the features proposed are new. 
New features based on the Short Time Fourier 
Transform of the response signal have been suc- 
cessfully used to detect the presence of a damage 
in a beam. As these features are able to discrimi- 
nate between the damaged and undamaged beams 
they can also be used with classifiers other than 
the one presented here. In addition, the method is 
also capable of capturing the characteristic of the 
data by creating different clusters in the principal 
components’ space for different damage location 
indicating the possibility of damage localization as 
well. The assessment of the classifier showed the 
accuracy of 100% on the test sets with SNR = 91 
and 98.89% for SNR = 90. 


REFERENCES 


Bishop, Christopher M. 2006. Pattern_Recognition. 
Springer Science + Business Media, LLC. 

Duda, Richard O., Peter E. Hart, and David G. Stork. 
2001. “Pattern Classification.” New York: John Wiley, 
Section. doi:10.1007/BF01237942. 

Farrar, Charles R., and Keith Worden. 2013. Structural 
Health Monitoring: A Machine Learning Perspective. 
doi:10.1177/1475921708090560. 

He, Hao-Xiang, and Wei-ming Yan. 2007. “Struc- 
tural Damage Detection with Wavelet Support 
Vector Machine: Introduction and Applications.” 
Structural Controland Health Monitoring 14(1): 162-76. 
doi:10.1002/stc.150. 

Ng, Andrew. 2012. “1. Supervised Learning.” Machine 
Learning, 1-30. doi:10.1111/).1466-8238.2009.00506.x. 

Satpal, S.B, A Guha, and S Banerjee. 2016. “Damage 
Identification in Aluminum Beams Using Support 
Vector Machine: Numerical and Experimental Stud- 
ies.” Structural Control and Health Monitoring 23 (3): 
446-57. doi:10.1002/stc.1773. 

Shames, Irving H. 1985. Energy and Finite Element Meth- 
ods in Structural Mechanics. CRC Press. 

Zhou, Q., Y. Ning, Q. Zhou, L. Luo, and J. Lei. 2013. 
“Structural Damage Detection Method Based on Ran- 
dom Forests and Data Fusion.” Structural Health Mon- 
itoring 12 (1): 48-58. doi:10.1177/1475921712464572. 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 
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deteriorating components in hybrid dynamical system 
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ABSTRACT: A Hybrid Dynamical System (HDS) includes a set of continuous dynamics in which a 
particular continuous dynamics is activated at a particular set of discrete events, termed as mode of the 
system. Thus, a different mode triggers a different continuous dynamics. Degradation evolutions of the 
components of HDS depend on the operating mode of the system. Thus, the existing approaches for 
continuous systems are not suited for Remaining Useful Life (RUL) prediction for HDS. In addition, a 
discrete mode fault may be possible besides the parametric faults (abrupt or progressive nature). This arti- 
cle presents an integrated approach to Fault Diagnosis (FD) and RUL prediction of multiple deteriorat- 
ing components in an HDS. For improving FD scheme, dynamic fault signature matrices are utilized for 
parametric and discrete mode fault isolation, which minimize the possible suspected faults by using the 
possible deviation direction of the faults. If the detected fault is progressive in nature, then the FD scheme 
is further utilized to point out the severity change points of the degradation. Using the knowledge of each 
severity change points and the deviation direction of progressive fault, constrained parameter estimation 
method with dynamically updated parameter’s bound is proposed for fast degradation states estimation. 
The estimates of the degradation at different time instances in a respective operating mode are further uti- 
lized for mode-dependent degradation model identification and RUL prediction. An online degradation 
model selection scheme is proposed for degradation model identification in different operating modes. 
The proposed method is able to identify the degradation model of multiple degrading components in 
a real time at different operating modes and can be adapted with new information of their degradation 
states estimated by the constrained parameter estimation during continuous monitoring. The proposed 
approach is demonstrated through numerical simulation of an example hybrid dynamical system. 


1 INTRODUCTION many faults may be possible during the continuous 

operation of a system and RULs of all such faulty 
Nowadays, due to advances in technology and components must be provided to the plant engi- 
modern control systems, most of the process neers for the decision making and maintenance 
engineering systems, like drinking water systems, scheduling. Generally, component’s degradation is 
chemical process engineering systems, etc., can a slow process where the degradation severity level 
be best modeled as hybrid dynamical systems changes in the time units of hours, days, weeks, 
(HDS). Fault diagnosis (FD) and remaining use- months, or even years depending on the type of 
ful life (RUL) prediction of HDS are complicated the system, its dynamics and the environmental 
because the dynamical behaviour of such system conditions. These severity change time points for 
is governed by a specific combination of discrete different components may be different, i.e., dif- 
modes (autonomous mode or supervisory con- ferent components degrade at different rates. The 
troller mode) (Narasimhan & Biswas, 2007; Wang existing approaches may not be easily applied to 
et al. 2013; Borutzky, 2015). These systems show HDS since the degradation progress in the faulty 
different continuous dynamics in different modes. | components depend on system operating mode 
Most of the existing FD and RUL prediction and the existing approaches may fail to diagnose 
approaches (Medjaher & Zerhouni, 2009; Jha the actual faulty components due to single fault 
et al., 2016) are typically developed for continu- assumption. Thus, the switching behavior of the 
ous dynamical systems and assume the occurrence system and possibility of occurrence of multiple 
of single fault with constant rate of degradation faults make the FD and RUL prediction tasks very 
throughout the components life cycle. However, challenging for HDS, especially considering the 
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fact that some fault symptoms may be masked or 
compensated by pre-existing fault symptoms. Also, 
a discrete mode fault may be possible besides the 
parametric faults (abrupt or progressive nature). 
Examples of discrete mode faults are valve stuck 
on/off fault, mode transition failure, control com- 
mand communication fault, etc. Thus, the existing 
approaches based on single fault assumption may 
provide unreliable results and lead to misdiagno- 
sis. Therefore, there is a need for development of 
more effective and efficient techniques for FD and 
RUL prediction of HDS and it is also practical to 
unify both in a common methodology for health 
monitoring of the system. Also, degradation of a 
component is usually irreversible and the associ- 
ated parameter value deviates monotonically in a 
certain direction (either increasing or decreasing 
parameter value). This information of param- 
eter deviation direction can be used for improving 
parameter estimation algorithms through specifi- 
cation of appropriate constraints. 

Recently, many works have been published 
using hybrid bond graph (HBG) technique with 
applications to modeling, control and FD of HDS 
(Narasimhan & Biswas, 2007; Wang et al. 2013; 
Borutzky, 2015; Low et al., 2010; Ghoshal et al., 
2012; Levy et al., 2015). But, very limited works 
(Yu et al., 2015; Daigal et al., 2015) deal with RUL 
prediction along with FD of HDS. In (Yu et al., 
2015; Daigal et al., 2015), HBG technique is uti- 
lized for system modelling and FD; however par- 
ticle filtering/Monte Carlo technique is applied 
for identification of degradation rate. According 
to available literatures for HDS, no work is found 
related to RUL prediction of sequentially occur- 
ring multiple faults, where new fault’s symptoms 
may be masked or compensated by the already 
existing degradations in other components. 

In summary, paper proposes an integrated 
approach for real time RUL prediction of multi- 
ple degrading components in an HDS using HBG 
as a common framework. Constrained param- 
eter estimation (CPE) technique with dynamically 
updated constraints supported by the information 
of parameter drift direction is proposed to speed 
up the degradation pattern identification and RUL 
prediction. The proposed method accommodates 
the influences of different operating modes and 
is adapted with new information of degradation 
states identified through continuous monitoring. 

The remaining article is organized as follows. 
Section 2, presents the common terminology and 
methodology used for health monitoring of HDS 
using HBG approach. Section 3 presents the FD 
and RUL prediction of multiple deteriorating com- 
ponents. Section 4 shows the efficacy of the pro- 
posed method using simulation and Section 5 gives 
conclusions and perspectives. 


2 HBG-BASED DIAGNOSIS 
METHODOLOGY 


2.1 Hybrid dynamical system 


A benchmark hybrid two-tank example system is 
shown in Figure 1, which is considered as the appli- 
cation example in this paper. This system includes 
tanks, valves, pipes, and PlI-controlled pump flow 
(Q,). Valve V, is on-off type and operates at differ- 
ent state (on or off state) according to the supervi- 
sory controller command (a,,). Two drainage pipes 
L, and L, belong to two autonomous modes transi- 
tions (a, and a,, respectively) of the system, which 
depend on the internal dynamics of the system, i.e. 
state of liquid level (H,, H,) of tanks T,, T,, respec- 
tively. When the level of liquid in T, exceeds the 
level H,, then the autonomous mode aq, is switched 
to its active mode and liquid starts flowing from 
T, to T, through pipe L,. Likewise, when the level 
of liquid in T, exceeds the level H,, then another 
autonomous mode a, is switched to its active mode 
and the liquid starts flowing from T, to atmosphere 
through pipe L,. These changing modes complicate 
the FD and RUL prediction process. For example, 
performance of an on-off valve V, may degrade due 
fouling (buildup of sediment/lime-scale, etc.). But, 
the deterioration rate of V, is not always constant, 
as it depends on the liquid flow through this unit. 
During off command of V,, there is no buildup of 
fouling as no liquid flows through this unit. Only 
in on command of V,, fouling is possible and may 
increase at some unknown rate. 

Also, two imaginary valves Vax and Viak 
respectively, are modeled to simulate leakage faults 
in tanks T, and T,. The pump saturation charac- 
teristic (®,) and output law of PlI-controller (®,,) 
are, respectively, stated as 


Up, if 0 <s Up, < on 
Q, =40, if Up <0 
Toes IE Upi > Simax 
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Figure 1. 


A schematic of hybrid two-tank system. 


Up, = K, (Sp — 2.8-H,(0)) + K, | (Sy, —2-8-H,(0) dt 
= a (H,(0)) 


(2) 
where Up is PI-controller output, fpa 1s the maxi- 
mum pump flow, S, is a controller set point, K, is 
proportional gain and K; is integral gain, g is accel- 


eration due to gravity, p is liquid density. 


2.2 Diagnostic Hybrid Bond Graph (DHBG) 
model 


In bond graph, generalized elements, i.e., Se-effort 
source, Sf — flow source, J — inertia, C — compli- 
ance, R-resistor, 0 — equal effort junction, 1 — equal 
flow junction, TF — transformer, GY — gyrator, De 
— effort detector and Df — flow detector, are used 
to model the multi-physics system in a unified way 
(Samataray et al., 2008). The diagnostic BG tech- 
nique is proposed for diagnosis of continuous sys- 
tem as in Samataray et al., 2006, Samataray et al., 
2008. This technique is further extended for hybrid 
system, called DHBG technique (Low et al., 
2010), which is well-suited for diagnosis of HDS 
by means of controlled/switched junctions and by 
feeding the measurements and mode information 
to the DHBG model. Also for the uncertain sys- 
tem, the nominal parameters are decoupled from 
their uncertain parts to account for the parame- 
ter uncertainties and modeled in linear fractional 
transformation (LFT) form by using feedback 
loops of internal variables, which is called LFT- 
DHBG model (Djeziri et al., 2007; Merzouki 
et al., 2012). For example, a LFT-DHBG model of 
hybrid two-tank system is presented in Figure 2, 
where the two level sensors (H,, H,) are fed to the 
model and the respective parameter uncertainty is 
decoupled from its nominal value to account for 
the uncertainty. In Figure 2, 1 —junctions with sub- 
script a,,, a, and a, are switched junctions related 
with discrete modes. 
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Figure 2. LFT-DHBG model of hybrid two-tank 


system. 
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2.3. ARRs/GARRs and adaptive thresholds 


The analytical redundancy relations (ARRs) or 
global ARRs (GARRs) represent the conservation 
equations like continuity or mass balance equa- 
tions, energy equations, momentum equations, etc. 
for the system. These relations are true at all times 
and at any working mode of the system. These rela- 
tions do not depend on the past history of events. 
Evaluations of these relations provide the residuals 
which are ideally zero during fault free operation of 
system. But, due to measurements and parameters 
uncertainties, practically, residuals show small non- 
zero values and hence they are bounded with some 
time varying thresholds, called adaptive thresh- 
olds. The ARRs/GARRs and adaptive thresholds 
equations can be systematically derived from LFT- 
DHBG model, whose general form is 


GARR,,(Z,0,U,Y)+(/, + As,) = 0 (3) 


where GARR,» A, and Ay represent the nominal 
residual (r) (i= 1, 2,..., n; nis number of residuals), 
uncertain part due to parameter uncertainties and 
small static uncertain part needed to account for 
measurement errors, respectively. The uncertain 
parts of residual provide the adaptive threshold as 
£ = tabs(AtA,,). Also, Z = [d), ay... Ap- Ay)” Sig- 
nifies the switched-junction ideal mode vector of 
m discrete components, a, € (0, 1), 0 = [0 s... 
6,..., 0," signifies a known ideal parameter vector 
of p components, U and Y are the measured input 
and output variables of the system. 

The two imaginary flow detectors (Df) in the 
LFT-DHBG model of hybrid two-tank system 
(Fig. 2) provide the two GARR, (i= 1, 2) and uncer- 
tain parts (A,). These GARRs are nothing but repre- 
sent continuity equations of the system in any mode. 

Since absolute values of uncertain parts contribu- 
tion is added in adaptive threshold, the small A,, part 
can be neglected, thus, the GARRs can be written as 


d 
GARR, :Q,-C,, woe a Ca Vl 28H,- AO) | 
t 


sign(H (t)— H,(t))- aC, pg(H (t)- H) (4) 


1 dLe 


-Cra IZH O| + A= 0, 


Leak 


GARR, :4,.Cy yl 28, (O- H,(¢))| sign, (0) 
d 
=H (t))- Ca gy eet aCy Ps H(t) 
H,,)- 4,C 4 ,P8(H, (t)— Miz) 
= Caa y| P8420) |- Craw Vl Pg. OIE 4,=0 
(5) 


where C,,= A/g, A; is tank cross-section area, and 


0, if H< Hy, 
L if H>H, 


i=], 2. 


a, 


| 


Using Equations (1)-(2), two more ARRs, 
respectively, for actuator and controller (assumed 
to have no uncertainty) are obtained as 
ARR, : Q,-®,(U),) = 0 
ARR, : Up -Pa (H) =0 


(6) 
(7) 


The uncertain parts A, (i = 1, 2) presented in 
Equations (4) and (5), respectively, are given as 


d 
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2.4 Fault signature matrix and coherence vector 


For different faults (parametric or mode faults), 
the GARRs are utilized to create the fault signa- 
ture matrix (FSM), which depends upon the sen- 
sitivities of GARRs to the parameter variations 
(Low et al., 2010). In this study, global fault sensi- 
tivity signature matrix (GFSSM) and mode change 
sensitivity signature matrix (MCSSM) (Levy et al., 
2015) are utilized, which have a power to discrimi- 
nate between increasing (a1) and decreasing (04) 
variations in the parameters. These dynamic matri- 
ces (see Equations (10) and (11)) are the extended 
forms of GFSM/MCSM matrices (Low et al., 
2010) used for diagnosis of HDS. 


-sign(ðr, 198,), if z is sensitive to at 
GFSSM , =} +sign(or /08,), ifr, is sensitive to 6, 4 


0, otherwise 


(10) 
—sign(dr, /da,), if r, is sensitive to a, T 
MCSSM,, =4 +sign(dr /da,), if r is sensitive to a, J 
0, otherwise 


(11) 
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For example, the signature for the leakage fault 
in tank T,, i.e. Carex D iS as 

GFSSM € ieai?) il -sen| walk | 5 

OC areak 

. ( dGARR, | 

— sign} ———= 

OC areaki 


= [ +sign( VisH,1 ). o] 


Likewise, signatures for all faults are derived. 

A coherence vector (CV) is used to generate the 
alarm state (0 or 1), whose standard form is given 
as CV = [cv,(0), cv (t),..., cv,(O], where cv{£), i= 1, 
2,... n, are generated from decision procedure, 
O(7,(t)). For robust FD, each residual r (r) is tested 
against an adaptive threshold £(f) as follows: 


(12) 


0, if —E() S KOS E(t) 
cv (t) = O(r(t)) =4 41, if (2 Ee) 
—1, otherwise 


(13) 


Here, it is assumed that ¢(t)= 4,(t). During 
monitoring, CV is derived at every sampled data 
for generating the alarm state. An alarm state 
shows 1 if any abnormality is found in the system 
with CV # (0, 0,..., 0]. After an alarm, the CV is 
matched with the GFSSM/MCSSM for the iso- 
lation of real degrading component. If a unique 
fault signature matches with the obtained CV the 
fault is isolated. A detectable and isolable com- 
ponent fault is represented by detectability index 
D, = 1 and isolatability index, /, = 1, respectively, in 
the fault signature matrix. 


3 FD AND RUL PREDICTION 
OF MULTIPLE COMPONENTS 
DEGRADATIONS 


A schematic flow diagram of integrated FD with 
RUL prediction of multiple degrading compo- 
nents occurring in a sequential way is presented 
in Fig. 3. The GARRs and adaptive thresholds 
are numerically evaluated at each and every time 
instance. If any threshold violation occurs for 
any unknown fault (abrupt/progressive paramet- 
ric fault or discrete mode fault) then the obtained 
instantaneous CV is checked with both MCSSM/ 
GFSSM and generates the initial set of suspected 
faults (SSF). Inclusion of discrete mode fault in 
SSF complicates the fault isolation process, since 
the discrete mode fault in a component shares the 
same fault signature as the partial parametric fault 
associated with the same component. First, it is 
assumed that the threshold violation occurs due to 


a discrete mode fault, as discrete mode fault has 
more severe impact on the system dynamics and 
its stability. This paper presents a new method to 
discriminate discrete mode fault from the paramet- 
ric fault based on magnitude of residual deviation 
after a fault, which is identified from Equation (14) 
as 


dGARR, 


Di. =|GARR(Z, 0, U, Y)|- <6, (14) 


k 


where D; is the difference between absolute 
value of residual deviation after a fault and 
sensitivity of residual with respect to suspected 
discrete parameter a, in the initial SSF. If any 
discrete mode fault is found, then the DHBG 
model is updated with faulty mode for subse- 
quent detection of fault (if monitoring is con- 
tinued). After updating the DHBG model, all 
residuals are forced to remain bounded within 
the updated thresholds even in presence of one or 
more faults. If it is found that the discrete modes 
are con-sistent, then it indicates that the thresh- 
old violation is due to parametric fault. Then, the 
discrete mode faults are removed from the initial 
SSF, and the algorithm is triggered for parametric 
fault identification only for the suspected param- 
eters which remain in refined SSF. Using the 
knowledge of deviation directions of suspected 
parameters obtained from GFSSM, the bounds 
of the suspected parameters, 6, € [@,,, Oru] are 
created. Bounds are created based on previous 
known nominal values of the parameters (6,.) and 
their possible extreme variations after the fault, 
derived from the deep knowledge of the system, 
called technological specifications. For the time 
varying parameter, the parameter bound is also 
updated when the true magnitude of varying 
parameter is obtained after successive estimation 
of fault at different times in the respective mode. 
A CPE technique with dynamically updated 
parameter bounds is proposed by integrating the 
gradient-projection method with Gauss-Newton 
method. To further improve the parameter esti- 
mation, sensitivity BG technique (Samantaray & 
Ghoshal, 2007) is utilized which provides the 
gradient information of the cost function dur- 
ing optimization process. The gradient projection 
method is more efficient, particularly, when con- 
straints include only bounds on the parameters 
(Nocedal & Wright, 2006). The objective function 
is formulated as: 


min J(A@)= 1 r'(¢,).W.r(t,) 
g 2 Er 
8< 0< 6; 


(15) 
subject to: 
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where r(¢,) is the residual vector obtained by 
evaluating the GARRs at time instant t, k is the 
current sample time, q = 0 is the number of col- 
lection of past sampled data during monitoring, 
and We R”™" is a positive semi-definite weighing 
function and can be considered as unit matrix, 0, 
and 6, are, respectively, the sets of lower and upper 
parameter bounds on the parameters. 

Thus, CPE finally isolates (say 61 at mode 
Z=2") the actual deteriorating component (6) at k™ 
time instant and gives the first estimate 0) (k,2"”) 
(fault magnitude at k' instant). The parameter 
vector is updated as O=[6,4,,...,4)(K,2),. Oo)" 
and considered as a new nominal parameter vec- 


Actual mode} 
information 


GARRs and adaptive 
thresholds 
evaluation using 

DHBG Mod 


date nominal mode with 
identified true mode 
Update nominal parameter with 
estimated fault magnitude 


Discrete fault identification 
basedon MCSSM and 
ARRs 


Report parametric 
faulty component 
with its estimate 


L~ —-------------—------ -—------ ------------, 


Figure 3. 
approach. 


A schematic flow diagram of integrated 


tor at k instant. Also, initial bound 6, ElL Au] 
is updated with fault magnitude at k“ instant as 
0, € [0,, A (k,z)]. The evaluation of residuals 
with updated parameters again forces the residuals 
to remain lie within the updated thresholds. If the 
identified degradation is progressive, the param- 
eter O(t, 2°) varies slowly at mode z”. Thus, the 
updated residuals again violate the same thresh- 
olds after some more time, and again CPE tech- 
nique with the updated bound 6, € [6,, O(k,2)) 
gives the second fault estimate O(k+ Az z®) at 
(k+A)" time instant. Parameter estimation method 
converges quickly as the search zone is reduced 
dynamically here. Likewise, more estimated data 
points are accumulated with dynamically updated 
bounds of the supervised system at a particular 
mode 2”. If the 7“ mode switching occurs at £ time 
instant during sample data accumulation for CPE 
in the preceding mode z" then the fresh sampled 
data of a fixed time window at new mode 2” are 
accumulated and deteriorating parameter, O(t, 2”) 
is estimated in this new mode 2". This way, a set 
of estimated parameter values (degradation trend) 
is obtained at different time instants in different 
operating modes. Depending upon the number of 
such estimates, an interpolation curve is generated 
and that curve is extrapolated to obtain the RUL. 


3.1 Degradation model and RUL prediction 


E D3" =(2,.4 (1,2 )), w=l, 

, k, is further used for’ headin: modél (DM) 
eae at mode 2. Model fitting should 
contain sufficient number of estimated data points 
for precise degradation model identification. The 
proposed RUL prediction adapts to any new infor- 
mation of deteriorating state of the supervised 
system. Initially, for RUL prediction, linear DM 
is used which has good prediction accuracy with 
less information of data and as the monitoring 
is continued DM and RUL are adapted with the 
new facts of deteriorating state of the component 
or system. The initially predicted RUL provides 
some indication to the maintenance engineers for 
decision and maintenance planning. Thus, the 
selection of DM which depends on the informa- 
tion of estimated data points up to current time t, 
at mode Z = z" is represented as; 


M2: A(t, 2°) = Muy, ifw <w, 
a A(t, ay = ¢M unt $,M nat $M EXP? ifw2 Ws 
(16) 


where w is number of estimated data points up to 
current time, w, is sufficient number of estimated 
data points decided by the user for precise degra- 
dation model identification. M,,, is linear model, 
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Mpi is second order polynomial model and M,x» 
is exponential model. Thus, if w2w,, a particu- 
lar degradation model is selected for data fitting 
according to the value of degradation model vec- 
tor DMV =[¢, ¢5, 6]. If DMV =[1, 0, 0] then 
M, is selected, if DMV = [0, 1, 0] then M,,, 
is selected and if DMV = [0, 0, 1] then M,,p is 
selected and the model which has least root mean 
square error (RMSE) to the data fit is selected as a 
best degradation model at mode Z = 2". Likewise, 
other equation models can be plugged into Equa- 
tion (13). 

However, this paper uses these three models 
only for RUL prediction since these models show 
the monotonic increasing or decreasing trend and 
can touch the set failure threshold after extrapo- 
lation of the models. It is also suggested that the 
polynomial models of higher order more than the 
second order model should be avoided in data fit- 
ting unless there is some known physical reason or 
any past experience of such type of degradation of 
the component. The RUL prediction with higher 
order polynomial models may give good interpola- 
tion result, but may give bad extrapolation result 
(Randall, 2011). 

RUL is predicted for degrading parameter 
{@(t, Z),t>0} by extrapolating the data using 
the identified model M;” with the future oper- 
ating modes (Z) of the System known up to the 
current time (obtained from past experience of 
the system). Thus, when the extrapolated trend of 
0, reaches a well set failure threshold @ then the 
component is declared as end of life (EOL) com- 
ponent at time to failure (TTF) ¢,. Thus, the ¢, and 
RUL are defined as : 


t, =inf{teR:O(t, Z) > 0f |8, Z)< A} 
RUL (1,Z 


(17) 


) =t- (18) 


Also, the identified mode-dependent DM, 
M 3 are fed into DHBG model for detection 
of subsequent faults precisely. In case of multiple 
degrading components, RUL is predicted for every 
degrading component and the component with 
least predicted RUL requires more attention by 
plant technicians. 


4 NUMERICAL SIMULATION 


The proposed approach is demonstrated through 
simulation on two-tank system (Fig. 1). Two 
faults are injected, first in valve V, (mode, a,,, 
dependent degradation, see Fig. 4) and second 
in tank T, at instances tą = 225 s and tp = 1475s, 
respectively, as 


(Z).a.,, ift<t 

dvin vl s! 

Ca (t, Z) = Col Ze r(Z)4on ), Can if t 2ta (3) 
pens dLeakIn? ifr ` tp 

Carear (É) E P + kı (t- ti) if t= tro 


where C,,,,(Z) is the nominal discharge coefficient 
of V, at respective mode (Z), r(Z) = 1.0 x 10“ s~! at 
Z=2)=a,,=1,and r(Z)=0s" at Z=2=a,, =0, 
Lon = | in a, Careakin 1S the nominal discharge 
coefficient of Y, a and k, = 1.0 x 10* kg"? m'? s7. 
The system is simulated for a time span of 2400s, 
with a fixed step size of 0.02 s, by setting all state 
variables to zero at ź = 0 s. The nominal parameters 
of system are K, = 1 ms, K,=5x10°m, S,,=0.5 m, 
Fonax = 1 kg/s, A; =2.16 x 102 m’, C,,;= 1. 593 x 102 
kg"? m”, Cy, = 1 x 103 ms, Cire = 0 kg? m!?, 
(i = 1, 2), H,, = 0.58 m, H,, = 0.40 m, Pam = 0 
N/m’, g = 9.81 m/s’, p = 1000 kg/m’. 


atm ~~ 


4.1 Implementation of proposed scheme 


The measurements (Q, H, H,) and the modes 
(a, 4, a) information obtained from the simu- 
lation are used to evaluate the residuals and the 
adaptive thresholds presented in Equations (4)-(5) 
and Equations (8)-(9), respectively. Activation of 
mode (a,) and no activation of mode a, are noticed 
according to respective prescribed condition. 
So, a, = 0 for all the duration of simulation. The 
response of residuals (r,, r,) and adaptive thresh- 
olds (€,, £), using single fault assumption, is shown 
in Figure 5. 


i _ E 275 
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4 vesverersetasnazuanceaege | Sp | | 
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5) hanced igs i 
è & DUC UU OUT 
o 4 2 1 2 24 O0 4 8 2 6 © 24 
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(a) (b) 
Figure 4. (a) mode, a, (b) mode-dependent degrada- 
tion in V, 
~~ Residual RG Residual 
0. daptive threshold Mi | 0.12} — Adaptive threshold 
2 “Tei titi jj Zoos Lit | 
= ul “wave UU 
L ri A 
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Figure 5. Residuals and thresholds with single fault 
assumption. 


From the simulation results, H, > H, is found 
during the observation period 0 to 2400s, so the 
GFSSM and MCSSM are as presented in Table 1. 
Also, the obtained CV, (subscript s signifies sin- 
gle fault assumption) using previous existing 
approaches and obtained CV „ (subscript M signi- 
fies multiple sequential faults assumption) using 
proposed method just after the both faults are 
also shown. Note that taking of absolute value 
of signatures in Table 1 without deviation direc- 
tion provides the standard GFSM/MCSM. CV, 
just after 225 s, provides SSF = {a,,, C,,,} if mode 
a, = 0; otherwise SSF = {a,,, Cj, Cy} if mode 
a = | (see Table 1). In any mode, parametric fault 
(Cxi) is not isolatable as a,,, Cy, Cy, share a 
common signature. Also, the inclusion of a,, in 
SSF complicates the task of FD; so, tracking real 
mode (a, = 1) just after the fault is also required. 
As dynamics of V, is dependent on mode a,,, the 
evaluated residuals (7,, r,) at a,, = 0 forces the resid- 
uals to enter into the threshold bounds even after 
detecting this fault (see Fig. 5); and under such sit- 
uation, FD is not possible unless the system enters 
into a different working mode (a,, = 1). Also, CV, 
just after 1475s, provides the same SSF as before. 
In such cases, the real fault (C,,.,,,) is not included 
in SSF which leads to misdiagnosis. Here, next 
component degradation (C,,...,) is concealed by 
the already known component degradation (C). 

The response of residuals (r, r,) and adaptive 
thresholds (&, €) using proposed modified method, 
in which LFT-DHBG model is dynamically updated 
after each fault estimate, is shown in Figure 6. It is 
observed that using proposed technique with locally 
updating the model, SSF includes the true faults 


Table 1. GFSSM with MCSSM of the system. 
Parameter r r, D, I, 
Cal ay Zay ay 0 
Cait Tay, ay aa 0 
GT 0 +1 1 0 
Car 0 -1 1 l-a, 
Cut a EA a 0 
Ciut =a, a, a, 0 
Cas! 0 a, a, 0 
Cus 0 a, a, 0 
Cisal 41 0 1 1 
Caria 0 +1 1 0 
aaf +1 =i 1 0 
ay, =I +1 1 0 
CV, after 225 s Si +1 1 0 
CV, after 1475s -1 +1 1 0 
CVy, after 225 s =I srl 1 0 
CV,, after 1475s +1 0 1 1 
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according to CV,, for the same faults scenario dis- 
cussed beforehand. The suspected mode fault (a, Ny) 
is tested as per Equation (14) just after threshold vio- 
lation and it is found that a, = 1 is consistent. Thus, 
using CPE technique only for the faults remain in 
the refined SSF, i.e., C,, and Cine T are isolated 
as true faults just after 225 s and 1475s, respectively. 
Also, the predicted RULs for both faults using lin- 
ear DMs, with less data points of deteriorating 
parameters (Cı? and Cie, are presented in 
Figures 7a and 8a, respectively. The horizontal lines 
in these figures indicate the degradation threshold or 
end of life of a component. Subsequently, updated 
RULs estimated with adapted models after getting the 
sufficient data points are presented in Figures 7b and 
8b, respectively. The gradually evolving degradation 
pattern reconstructed from results of the simula- 
tion and parameter identification matches with the 
expected behavior defined in Equations (19)-(20). 
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Figure 6. Residuals and thresholds using proposed 
method. 
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Figure 7. RUL of C,,, using (a) initial linear model ©) 
identified true model, with a failure threshold ( Cf, 
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5 CONCLUSION 


The key features of the proposed technique are 
(1) the use of residual response to identify the 
minimal set of prognosis candidates for which 
parameter estimation is triggered, (2) detection 
of the time points when the parameter estimation 
needs to be triggered (3) the use of minimization 
of only inconsistent residuals for the degrada- 
tion state estimation and the use of sensitivity of 
ARRs during estimation (4) successive ARRs 
updation with the degradation estimates in order 
to force the residuals to lie within the respective 
adaptive thresholds and thereby detect the further 
degradation and predict the degradation trends 
of multiple degrading components, and (5) RUL 
prediction is coupled with the operation mode of 
the system and adapts to any new state of health 
information on the system’s components. The 
implementation of proposed CPE technique with 
SBG approach, dynamic model updation and use 
of GFSSM/MCSSM improve the FD and RUL 
prediction. The problem of misdiagnosis with mul- 
tiple sequential faults is also discussed. However, 
the proposed work demonstrated through simula- 
tion and the predicted RULs are also deterministic 
in nature. In future, the proposed technique will be 
applied to some real experiment, and uncertainties 
and confidence limit will be taken into account 
for the prediction of the RULs. Also, in the cur- 
rent work, failure threshold for each component is 
selected arbitrarily just for demonstration. How- 
ever, setting failure threshold for multi-component 
system based on system level performance limits is 
also a crucial design factor in prognosis, which can 
be considered as a future research problem. 
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ABSTRACT: This paper discusses novel technologies for energy efficiency and predictive maintenance 
using hardware accelerated energy disaggregation. The disaggregation process involves the use of custom 
designed smart sensors that collect and treat aggregated information on the current and voltage wave- 
forms. The treated data are further on transmitted to the cloud where they are stored and processed to 
enable the extraction of advanced information on individual device consumption patterns and health 
status. This information can be extremely useful for the management of electric devices in residential or 
commercial sites as well as for predictive maintenance in industrial sites. The paper reviews the underlying 
methodologies, and presents preliminary work and results from data collection in the offices of a software 
company. The presented work involves the installation of measurement devices and the development of 
complementary hardware and software. This is part of the ongoing 4-year project PREDIVIS (PREdic- 
tive, Disaggregation Intelligent VIS (meaning “power” in Latin)). 


1 INTRODUCTION Monitoring of energy consumption at appliance 
level is essential for predicting energy needs and mon- 

Nowadays, the ever-growing power demand of itoring appliance operation in a household, a build- 
industries and households combined with the goals ing or an industrial system. Energy disaggregation 
for carbon dioxide emission reduction, have led the refers to using data analytics and signal processing, 
communities to take action, by implementing con- to identify specific patterns and to break down elec- 
servation and energy efficiency programs. The first tricity consumption to individual appliances. This is 
step in energy reduction actions is the rollout of usually done in a non-intrusive manner by monitor- 
smart meters to monitor energy consumption and ing the utility connection meter, and has been a field 
the smart grid technologies to distribute the avail- of significant research work for over twenty years. 
able energy more efficiently, combined with the Non-Intrusive Load Monitoring (NILM) is a proc- 
wide adoption of renewable energy sources. ess where the aggregated electricity consumption is 
The energy consumption and carbon emis- metered at the Grid-consumer connection point, 
sions are regulated by frameworks and directives, | and by analyzing the changes in voltage and cur- 
mainly focused on actions by industries in order rent wavelengths tries to identify which appliances 
to minimize their impact on climate change. In are being used at a certain time. Still, NILM tech- 
most cases these actions are costly and inefficient, | nology’s main goal is to provide insights into energy 
and often industries are incapable of adapting new consumption at appliance level, mainly to support 
equipment and carbon dioxide emission reduction energy efficiency actions with economic and envi- 
techniques which leads to increased taxes and ronmental impact. There are novel techniques using 
fines when goals are not achieved. various approaches of NILM for a great number of 
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applications, like safety on industrial environments, 
device health monitoring and predictive mainte- 
nance and demand response applications. 

Equipment monitoring on industrial sites is a 
necessity and most of the time, a costly and complex 
process. Various industries, monitor their machinery 
and equipment to prevent malfunctions, minimize 
danger, service and repair costs, and to increase the 
overall operating time. NILM techniques could be 
a cheap alternative to equipment monitoring sys- 
tems which are costly and require huge and complex 
installations. Monitoring equipment is vulnerable to 
failures due to its numerous sensors and measure- 
ment devices that are being deployed. NILM is not 
widely used in industrial and commercial environ- 
ments because of the complexity of these environ- 
ments: the great number of similar devices, power 
factor correction and load balancing equipment, as 
well as the huge number of harmonics in loads make 
this process really challenging. 


2 TECHNICAL PROBLEM DESCRIPTION 


Let a system of N devices. Devices can be of dif- 
ferent types (let K denote the number of possible 
types, e.g., washing machine, PC, monitor, refrig- 
erator, etc.). For each type k=1, 2, ..., K of device 
n= 1,2, ..., N there is a set S, of possible opera- 
tion states. The assumption here is that there is a 
mapping between the state of a certain device and 
its electrical footprint on an aggregated time series. 
Consequently, the sequence of operational states 
leaves a string of unique fingerprints on the time 
series of energy consumption measurements. 

The usual case is that for each time segment, only 
the total energy consumption is measured and not the 
individual consumptions of each device. The device 
fingerprints are therefore mixed up. The disaggrega- 
tion exercise consists of analyzing the system data 
in order to unravel the strands of each device, and 
enable further analysis of the device operation. The 
disaggregation process is accompanied by informa- 
tion on the operational pattern of each device type, 
for instance, continuous operation or interrupted, 
expected duration of each operational state etc. 


00:00 06:00 


12:00 
Time of Day 


Figure 1. Example of a household total energy con- 
sumption from the REDD! dataset. 
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Looking more closely at the device types, it is also 
possible to extract and make use of more advanced 
information. Indeed, different device types usually 
generate slightly different harmonic distortions. The 
harmonic distortions can be identified if the resolu- 
tion of the time series is sufficiently high. 


3 STATE-OF-THE ART ON ENERGY 
DISAGGREGATION 


Over the past 20 years, there have been many dif- 
ferent approaches to addressing the problem of 
energy disaggregation and the subsequent moni- 
toring of device health. 

Since Hart [1] first introduced the concept 
of Non-Intrusive Appliance Load Monitoring 
(NIALM) in 1992, numerous techniques have 
been developed to address the problem. Initial 
approach was focused on examining a device as 
a state machine, trying to identify the states and 
disaggregate the device from the aggregated load. 
The method could perform well for large loads and 
devices with a finite number of states that are not 
always on, with discrete power changes between 
states. 

Since then, other works [5, 6, 7] have proposed 
different techniques for electrical signature analysis 
to address the classification problem, with promis- 
ing results. Such methods include, Support Vector 
Machines (SVM), Bayesian methods, k-Nearest 
Neighbors etc. The most common techniques used 
are Hidden Markov Models and Artificial Neural 
Networks using supervised, semi-supervised and 
unsupervised methods [8-12]. 

Regarding energy disaggregation, the non- 
intrusive approaches (NILM) promise adequate 
accuracy with lower installation costs and complex- 
ity compared to smart plug-based [2, 12] approaches. 
NILM methods using steady-state and transient 
load signatures are further classified according to 
data time series’ measurement frequency. High- 
frequency [13] methods require custom hardware 
(high-frequency meters ~10° Hz/s) and employ an 
array of machine learning and pattern recogni- 
tion methods. Low-frequency methods [14] (1 sec 
up to | hour) apply similar data processing tech- 
niques, but are not sufficiently tested to guarantee 
commercial-grade accuracy. NILM is so far mainly 
focused on households and small-scale buildings, so 
there is also the issue of its scalability to commer- 
cial buildings or even entire neighborhoods in order 
to extract useful information for demand response 
applications, as well as for grid and device health. 

The models employed range from least square 
estimation to Hidden Markov models. Some 
approaches use Fast Fourier Transform and other 
transformations to reduce hardware, bandwidth 


and storage cost. Research in this field focuses 
on finding the algorithm that increases the accu- 
racy of energy disaggregation in each application 
case. More recently, new approaches with semi-/ 
un-supervised algorithms are being studied [15]. 
An emerging research trend is to use IoT based 
architectures for data capturing and on-board [16] 
analytics on appliance level to provide energy effi- 
ciency solutions. By using IoT devices dedicated 
to energy monitoring and data manipulation, 
researchers aim to extract more information by 
analyzing the electrical characteristics of the appli- 
ances in the deployed sites. 

At the moment, there is no universal Machine 
Learning (ML) algorithm that will fit multiple 
application cases. The specification of the ML 
algorithms varies with the constraints of each 
application case. A list of well performing algo- 
rithms in different application cases should include 
Artificial Neural Networks, SVM + kernels, Deci- 
sion Tree, Random Forests etc. Marking a turning 
point in the history of Artificial Intelligence, Deep 
Neural Networks (DNN) are now widely used, 
e.g., for face recognition on smartphone cameras. 
Research in this field is ongoing in response to the 
evolving market interest for improved DNNs. It 
includes development of new hardware architec- 
tures implementing DNNs to improve on the cur- 
rent CPUs and GPUs [17]. Neuromorphic chips 
have reduced energy consumption and enhanced 
DNN capabilities in processing the vast volumes 
of information generated by the IoT [18]. 

Analyzing data from multiple sensors can pro- 
vide critical information on each current state of 
the monitored devices and enable predictions of 
behavior in the future. The sensors can measure a 
variety of device/environmental attributes, such as 
the temperature [19]. Likewise, electrical consump- 
tion data, when added to other available device 
data, can provide significant input to predictive 
maintenance, and also minimize the data volume 
that needs to be processed and stored. Approaches 
to predictive maintenance through electrical con- 
sumption data has been made on specific cases. 


4 PROPOSED APPROACH 


4.1 Description of solution 


Project PREDIVIS aims to develop novel tools for 
energy disaggregation and monitoring of device 
operation status, based on real-time pattern rec- 
ognition/matchmaking of complex energy load 
data time-series, using hardware acceleration tech- 
niques. The proposed approach requires the design, 
development and implementation of complex algo- 
rithms on a reprogrammable Field Programmable 
Gate Array (FPGA), in order to create a network 
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of distributed agents that performs the majority of 
the data analysis in real-time, and transmits events, 
instead of raw data, to a main server. 

This project is trying to address the problem 
of energy Disaggregation on household and com- 
mercial/industrial environments. Due to the differ- 
ence in complexity of the above mentioned two 
cases that this project is trying to address, we will 
need to utilize different approaches, algorithms 
and also fine-tune the sampling frequency needed 
per case to acquire sufficient data for the disag- 
gregation process. A system like that depends on 
the specifications of each deployment site, e.g., 
different sites have different number of devices 
with different characteristics that might lead us in 
using completely different data acquisition rate. 
Typically, a NILM system design involves three 
main components: Data Acquisition and Storage, 
Analysis and Classification. 

Most of the previous projects were using pri- 
vate generated/produced or open datasets to train 
models and validate the results and the algorithm 
efficiency. Previous works are using data collection 
methods at either High frequency (1 to 10° kHz) 
or Low frequency (10° to 10° kHz). The sampling 
rate may differ from one sample per 15 minutes 
or more, to a couple of millions per second. This 
project focuses on High frequency methods using a 
sampling rate between 8 kHz and 64 kHz to be able 
to extract more features from the available data 
to assist the classification process. Sampling rate 
determines the information that can be extracted 
from the sampled signals. Consider for example an 
electrical installation with fundamental power fre- 
quency of 5 x 10° kHz. Sampling the wavelength 
with higher sampling rate (8 kHz) fulfilling the 
Nyquist-Shannon theorem, enables our system to 
capture up to the 160-th harmonic. By analyzing 
the different harmonic distortions, we can identify 
and differentiate the device from the aggregated 
workload. 

Using high frequency sampling rates requires 
large storage space to store the acquired data and 
huge bandwidth to transfer the data over the inter- 
net to a central powerful unit for further analysis. 
To minimize storage and bandwidth, some appli- 
cations are using compression technics or on- 
site devices to analyze the data. These hardware 
devices require a great amount of power in order 
to perform the analysis and usually they are very 
expensive. 

Project PREDIVIS will use a novel technique 
implementing on site disaggregation to limit band- 
width and storage needed (Figure 2). Using dedi- 
cated hardware implemented in FPGA devices will 
help in decreasing not only the overall bandwidth 
and storage but also the power consumption needed 
for the data analysis. Based on the installation, 
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Figure 2. Bandwidth trade-off between raw data trans- 
fer and event reporting architecture. 


which can be one-, two- or three- phase, the number 
of electric power transmission data to be used for 
the disaggregation, will increase proportionally. 
Note, however, that, the detected events will be more 
or less the same for each case, despite the number 
of phases. 

For data storage, PREDIVIS will use a small 
local memory capable of storing the device signa- 
tures and a couple of hours of data stream. Aside 
from the local storage, data like on/off events, total 
energy consumption and total amount of opera- 
tion time, anomalies on appliance electric charac- 
teristics etc., will be sent to a cloud infrastructure 
and will be saved in a No-SQL database. Each 
agent will have the ability to retrain when specific 
conditions are applied. The system will use adap- 
tive learning techniques to adapt better on new 
and existing installations by sharing knowledge 
on already known devices between the employed 
agents through the central cloud system. 

In terms of data analysis, various techniques 
will be used to extract information from the avail- 
able data in order to identify: 


e Event transitions, when a device is turned on or 
off. Event-based approaches detect only major 
changes and anomalies on the energy load time 
series. 

e State transitions, when the device swifts from 
one state to another (e.g. from full operation to 
standby and vice versa). State-based approaches 
also detect the different states of device 
workload. 


4.2 Description of project 


Project PREDIVIS consists 
components: 


of three main 


e Agents, which are custom hardware compo- 
nents for data collection, data analysis and load 
monitoring 

e Cloud-based platform for data visualization, 
storage and NILM assisting mechanisms 


Hee. 


Transfer Protocol through internet 
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NILM Assisting Predictive 
Mechanism Maintenance 
Figure 3. PREDIVIS architecture diagram. 


e NILM and Predictive maintenance algorithm 
suite 


The Agents mentioned above are custom hard- 
ware implementations using FPGA devices and 
different Intellectual Property (IP) blocks (e.g., 
ADC, DSP etc.) to address data collection and 
analysis on the deployment site. Hardware accel- 
eration of NILM algorithms will help us take the 
computational load off the cloud infrastructure 
and minimize the bandwidth needed, as mentioned 
earlier. Each deployment site is unique, having a 
different number of N devices with different K 
states, making the case of using a universal algo- 
rithm/approach very challenging. Each device will 
be able to perform better on its deployment site 
through adaptive learning techniques, with the 
assistance of the Central platform and information 
gathered by other deployed agents with similar site 
characteristics or identical device types. 

The Cloud-based platform will provide data 
visualization and display information through a 
friendly User Interface (UI) helping the user get 
insights on the energy consumption. Acting as a 
central point for reporting, the platform will com- 
plement distributed agents by collecting and ana- 
lyzing information from each one of them about 
the deployment site and the site’s devices, and will 
help distributing knowledge between them. 

The Cloud platform will also host a NILM suite 
with several algorithms for analyzing real-time data 
to determine which algorithm/method is more suit- 
able for that particular site’s appliance mix, condi- 
tion etc. Finally, the predictive maintenance suite 
will analyze the data and anomalies detected by the 
agent to help reduce hazardous machinery errors, 
downtimes and failures. 

The consortium of this project comprises the 
following complementary partners: 
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the System Reliability and Industrial Safety Lab- 
oratory, National Center for Scientific Research 
“Demokritos” as a research partner supporting 
NILM and predictive maintenance analysis. 

Plegma Labs S.A as Enterprise partner in IoT 
technologies supporting the cloud infrastruc- 
ture, and the data storage and management. 

the Department of Information and Communi- 
cation Systems Engineering, University of the 
Aegean, as a research partner supporting hard- 
ware development and data acquisition processes. 


Figure 4 depicts the three main stages of this 
project. The project will run for 4 years and the 
work is currently at stage one. 


4.3. PREDIVIS technologies breakdown 


The project is a combination of the aforemen- 
tioned techniques, ranging from hardware blocks 
to advanced software features. In a nutshell, this 
project will try to implement hardware designs 
for data collection using Analog to Digital con- 
version and Digital Signal Processing techniques, 
combined with embedded Artificial Intelligence 
functions and methods. The ability to reprogram 
over the air an FPGA device, can help each device 
adapt better to new or pre-installed environments. 
The software portion of this project includes: 


a. the Cloud-based high-level software for data 
transport and storage, 

the intelligent adaptive NILM algorithm suite to 
reprogram and calibrate DNN on agents, and 
the Predictive maintenance analytics suite 
combined with statistical models and machine 
learning algorithms, to predict future failures 
and stimulate faults. 


b. 


* Training, First approach 
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* DL Techniques adaptive 
learning models 
* Predictive Maintenance 
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Figure 4. PREDIVIS project stages. 
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5 GENERATION OF DATA SETS 


In order to develop and test different energy disag- 
gregation methods and optimize the efficiency of 
PREDIVIS project, a variety of datasets will be 
used. These involve open High Frequency public 
datasets as well as privately generated data. Some 
of the public sets that we intend to use are the 
REDD [2] (2011), the Blued [3] (2012), and the 
UK-DALE [4] (2015). These datasets relate mainly 
to residential applications. 

The private datasets will contain data from 
office sites and from industrial sites. Regarding 
the former, a set of measuring devices is currently 
installed at offices of a typical software SME. 
Figure 3 shows the layout of the company offices. 
The site is connected to the electricity grid through 
a three-phase power supply. Note that, three-phase 
data have entirely different characteristics com- 
pared to single or two-phase data and this will be 
considered during the analysis stages. 

The installed measuring devices collect data logs 
from the main power circuitry connector, as well 
as from individual devices. The main power data 
involve the aggregated electric current and voltage 
waveforms, and these are measured at both low and 
high-frequency rates. The electric power of individual 
devices is monitored using one smart plug per device. 

Seventeen (17) different entities are monitored, 
ranging from lighting to air-condition units. These 
include multiple devices of the same appliance 
type, for example 9 monitors and 3 laptops. 


5.1 


This section presents the employed technical 
equipment towards the generation of the dataset 
described above. 
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Monitoring devices setup 


Figure 5. 


Plegma labs headquarters installation. 


For the main power supply, the authors devel- 
oped a custom implementation using (a) a set 
of voltage and electric current converters for the 
measurements, (b) an Analog to Digital Converter 
(ADC) and (c) an ARM based single board com- 
puter for data treatment. In particular: 


a. The current sensors used for this project are cur- 
rent transformers with 1:1800 turn ratio with 
rated input of 100 Amperes to 50 milliamperes 
output. For the voltage sensing, in-house imple- 
mented voltage transformers have been used. 

b. The employed ADS is the ADS 131E08 by 
Texas Instruments, which is capable of sampling 
simultaneously eight different channels. The 
number of available channels allows the meas- 
uring of electric current and voltage for each 
of the three phases, leaving two channels free 
for additional analog sensors (e.g. for tempera- 
ture, humidity, luminosity etc.). The sampling 
frequency can range from | kHz up to 64 kHz. 
For the needs of the PREDIVIS project, data 
are collected at the maximum resolution. In the 
future, the analysis will indicate whether lower 
are sufficient (to reduce the sensor consump- 
tion) without losing on the quality of results. 

c. The arm single board computer is a Raspberry 
Pi 3 by Raspberry Pi Foundation which receives 
data from the ADC through serial peripheral 
interface (SPI) communication. 


In order to verify the data collected through 
the above custom implementation, a widely used 
industrial grade energy meter/analyzer is also con- 
nected to the system. The selected device is the 
BFM136 produced by SATEC Itd (see Table 1). 

For the recording of individual devices’ data, a 
Z-wave based system is implemented. The system 
uses two different types of smart-pugs, namely the 
Wall Plug by Fibaro and the Smart Switch 6 by 
Aeotec. Each plug is monitored continuously at 
the interval of 5 seconds or less, using a Z-wave 
USB adapter. The adapter is the Z-Stick S2 by 
Aeotec (see Table 1). 

The two sets of measuring devices are accom- 
panied by appropriate software components devel- 
oped by the authors. These include: 


Table 1. Devices used in the case study. 


Metering device Number 


BFM136 

100 A High Accuracy Current Sensors 
ADS136E08 with RPi 

100 A:50 mA Current Sensors 

Voltage Sensors 

Fibaro Wall Plug 

Aeotec Smart Switch 6 


w BWW We 


a. a No-SQL database to store the data locally. 
The chosen format for the recorded data is in 
the form of time—voltage—current triplets. 
These follow a key-value format, with UNIX 
timestamp for the time. 

b. a custom interface is herein implemented to 
collect and transmit the data and visualize 
the current and voltage waveforms. The data 
are sent over the internet to Plegma’s cloud 
infrastructure. 


All the devices reported in Table 2 are measured 
through the smart-plug installation. The two main 
monitoring devices are currently polled in 3 sec- 
ond (BFM136) intervals, as well as with frequency 
of 8 kHz (custom implementation). The main 
monitoring devices are measuring the three-phase 
installation as follows: 


Table 2. Measuring devices and entities. 


Measuring 
Entity device Room No 
Main BFM136 & Electricity 
ADS with board 
RPi 
Water cooler Fibaro Wall 2 
plug 
Microwave Fibaro Wall 2 
plug 
Coffee maker Fibaro Wall 2 
plug 
Refrigerator Fibaro Wall 2 
plug 
Work station | desktop Aeotec Smart 1 
with 2 monitors Switch 6 
Work station 2 laptop Aeotec Smart 4 
with 1 monitor Switch 6 
Work station 3 1 desktop Aeotec Smart 3 
with 2 monitors Switch 6 
Work station 4 1 high load Aeotec Smart 4 
desktop with 2 monitors Switch 6 
1 Monitor Aeotec Smart 3 
Switch 6 
1 Laptop Aeotec Smart 3 
Switch 6 
1 TV monitor Aeotec Smart 2 
Switch 6 
1 Router Aeotec Smart 2 
Switch 6 
1 Printer Aeotec Smart 1 
Switch 6 
Guest plug 1 Aeotec Smart 1or3 
Switch 6 
Guest plug 2 Aeotec Smart 1or3 
Switch 6 
2 Air-condition Units Aeotec Smart 1,2 
Switch 6 
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e Phase 1 lights for rooms 1, 2, 3 and 4, 
e Phase 2 sockets of rooms | and 2, 
e Phase 3 sockets of rooms 3 and 4. 


5.2 Future steps 


Due to the differentiation of the two cases this 
project is trying to address, it is necessary to be able 
to simulate different events and scenarios to test its 
efficiency. The two cases are divided in two major 
categories residential and industrial/commercial. In 
order to test the efficiency of the utilized algorithm 
on specific cases and validate our data different 
devices will be simulated on variable working states. 

The collected data allow us to test different 
approaches of NILM, based on either high or 
low frequency data. This project is trying also to 
address the problem of device health monitoring, 
and predictive maintenance. In order to have suffi- 
cient data for the third phase of the project where a 
Predictive Maintenance suite will be implemented, 
we will tamper some specific days of the data from 
the devices, with different methods (e.g. leaving 
the fridge door open, turning devices on and off, 
modifying thermal loads etc.). 

Additional data will be generated to see how 
the disaggregation mechanisms work, to test their 
ability to distinguish the anomalies produced and 
match them to the device or entity appropriately. 
More similar sites will follow so that the final data- 
set has adequate variety to allow the development 
of widely applicable algorithms and tools. 


6 CONCLUSIONS 


In this paper, we reviewed the fundamentals of 
NILM systems and the energy disaggregation 
problem. A novel technique is proposed to address 
this problem that will be applicable to the energy 
efficiency and predictive maintenance. 

Recent works indicate that energy disaggrega- 
tion is an active field, and there are a lot of dif- 
ferent approaches to address it. Even, however, 
the most advanced methods have not achieved 
adequate results to be reliable for deployment on a 
large scale. The problem, therefore, is still open and 
the potential benefits of disaggregation, in terms 
of its ability to support end-users and utilities, can- 
not be fully exploited. 

The project PREDIVIS presented here proposes 
a novel approach with custom hardware imple- 
mentation of measuring devices. The combination 
of software and hardware modules can address 
the problem of data bandwidth and storage size. 
Adequate information on predictive maintenance 
can be obtained by combining electric consump- 
tion data with other monitoring data. 
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The availability of relevant and reliable data is 
crucial for the development of the disaggregation 
tools. For this reason, the project starts with the 
generation of datasets for residential, commercial 
and industrial energy use patterns. The collected 
residential and commercial data will be combined 
with publicly available datasets. For commercial 
and industrial environments, the lack of public 
datasets makes it a more challenging process. 

Once the datasets are fixed, the main work 
involves the development of energy disaggregation 
algorithms, the design of data collection and smart 
on-site devices for hardware accelerated analysis. 
With continuous monitoring it is possible to produce 
useful information about the device and machinery 
health. Monitoring the full cycle of operation could 
support energy efficiency and predictive mainte- 
nance applications, by detecting abnormalities, 
predicting total operating time of components etc. 
The current approach could thus detect early-stage 
device malfunction in industrial sites, supported by 
energy consumption evidence as well as other sensor 
data (e.g. temperature). The project considers, at a 
later stage, the development of decision support sys- 
tems for industrial, commercial and large residential 
sites. The success of non-intrusive electric load mon- 
itoring on predictive maintenance could provide a 
much cheaper alternative as compared to complex 
and expensive monitoring equipment. 
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ABSTRACT: The paper aims to establish a model to identify wear fault of marine diesel engine based 
on grey rough set and Self-Organizing Map (SOM) network with oil monitoring data analysis. The empir- 
ical data indicates the wear fault takes great proportion in fault types of diesel engine. Through oil moni- 
toring, the change of parameters of lubricating oil and the information of wear particle can be obtained 
to analyze status of components. Firstly, the paper constructs the two-dimensional fault decision table. 
Subsequently, the grey relational analysis and rough set theory are used to reduce the fault decision table 
horizontally and longitudinally. Next, the fault diagnosis model is established by SOM network. Finally, 
the proposed model is validated by empirical research. The result suggests that the proposed model is 
feasible in wear fault diagnosis problem. Moreover, compared with the traditional SOM neural network, 


the model has less error and better diagnosis effect 


1 INTRODUCTION 

Marine diesel engine as the heart of the marine 
power plant, its safety and reliability is vital. How- 
ever, due to the abominable work condition and 
the complexity of marine diesel engine structure, 
the fault of diesel engine is relatively frequent. In 
addition, there are many kinds of faults in diesel 
engines, and the proportion of faults caused by 
abnormal wear is the highest, which is about 37.5% 
(Jones et al. 2000). Therefore, monitoring of the 
running status in real time and timely recognition 
of the wear condition with marine diesel engine 
can effectively improve the reliability and economy 
of the operation of the ship (Xiang 2009). 

During the operation of the marine diesel engine, 
the oil is circulated in various parts of the equipment. 
Through the oil monitoring, we can learn the changes 
of lubricating oil indexes and the wear particles of 
each friction pair. Further, the diesel engine wear state 
can be qualitatively and quantitatively identified. 
Spectral analysis and ferrography analysis, which are 
able to detect concentration of elements and identify 
the fundamental parameters of ferromagnetic wear 
particles in lubricating oil, is one of oil monitoring 
technologies and widely used for diagnosing wear- 
out fault of diesel engine (Gao et al. 2013). 

Due to the complexity of the factors that affect 
the oil and the result of fault diagnosis is greatly 
influenced by the oil sampling period and oil 
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change, the traditional “three-line value” method 
based on statistics is no longer applicable. There- 
fore, the best way is to excavate the nonlinear 
relationship behind the data as much as possible 
through data learning. The development of artifi- 
cial intelligence algorithm provides more choices 
for fault diagnosis research (Li et al. 2017). Among 
them, Self-Organizing Mapping (SOM) network 
has good self-organization, self-adaptability and 
robustness. Due to the unsupervised learning 
method, the SOM network does not need to spec- 
ify the category of the input vector. It is a kind of 
recognition network based on small sample train- 
ing, which is different from the traditional neural 
network that requires a large number of training 
samples to ensure the accuracy of classification 
(Wen 2016). Therefore, SOM networks are widely 
used in pattern recognition and classification. 
However, when inputting a large amount of com- 
plex data, the SOM network often suffers from slow 
convergence and low classification accuracy due to 
the complexity of the network. The rough set the- 
ory can identify and extract the hidden and valuable 
key data in input information, and remove redun- 
dant and invalid data. Yi et al. (2014) used rough 
set theory to simplify decision rules and removes 
redundant information, and proposed a fault diag- 
nosis model for the lube oil system of gas turbine 
based on rough set and SOM neural network. 
Zhao et al. (2016) designed a neural network 


fault diagnosis model based on rough set theory. 
Although these researches use the rough set theory 
to reduce the attribute of fault decision table, that 
is, the two-dimensional decision table is reduced 
by the longitudinal dimension. However, the other 
dimension of the two-dimensional decision table 
is ignored-the data of the horizontal dimension. 
Therefore, the traditional reduction of input data is 
not complete, but there are still some irrelevant and 
redundant data. These data will still be excessive 
learning in the SOM network, affecting the classifi- 
cation accuracy of the model. 

Therefore, this paper makes a two-dimensional 
reduction of the fault decision table; first, the 
data reduction of the horizontal dimension is car- 
ried out by grey relational analysis, and then the 
rough set theory is used to reduce the redundant 
attributes of the longitudinal dimension in the 
fault decision table. Finally, a diesel engine wear 
fault diagnosis model based on Grey rough set and 
SOM neural network is established. 


2 TWO DIMENSION REDUCTION OF 
FAULT DECISION TABLE 


For the marine diesel engine, the fault diagnosis 
process caused by abnormal wear is actually a deci- 
sion-making process. The basis of decision mak- 
ing can be a predetermined decision table which 
represents the experience knowledge of managers. 
It can also be a cumulative decision table which is 
abstracted and generated according to the sum- 
mary of historical failure events (Gao et al. 2013). 
A common fault decision table is shown in Table 1. 

Each column in the table represents an attribute, 
and each row represents an object. a,...a, represent 
the conditional attributes, and D represent the deci- 
sion attributes. In practice, there are many fault 
reporting events and oil monitoring information of 
diesel engine. Therefore, there are more condition 
attributes, and the objects in the horizontal dimen- 
sion also have some irrelevant and redundant data. 

Thus, in this paper, two-dimension reduction of 
two dimensional fault decision table is introduced. 
Indeed, the data reduction of the horizontal 
dimension in the fault decision table is carried out 


Table 1. Diagrammatic sketch of fault decision table. 
U a q ay fn a, D 
X, 1 va 2 i 1 0 
X, 2 1 2 1 
Xx, L 1 1 0 


by grey relational analysis, and then the rough set 
theory is used to reduce the redundant attributes 
of the longitudinal dimension. 


2.1 Grey relational analysis method 


Grey system theory is one of the important methods 
and techniques for studying uncertain systems. And 
grey relational analysis is a very active branch in the 
grey system theory, which basic idea is to divide the 
factors as sequence curve, and then through the sim- 
ilarity degree of geometric shapes to obtain the cor- 
relation degree of each factors (Gao et al. 2013). The 
closer the shape of the curve is, the greater the corre- 
lation of the corresponding sequence is determined. 

The calculation steps of grey correlation degree 
are as follows (Liang & Zhang 2009): 


Step 1: X,={a,(k)|k=1,2,---n} was deter- 
mined as a reference sequence, and 
X, ={a,(k)|k =1,2,---n} were the compara- 
tive sequences. Among them, i= 1, 2,..., m, and 
there are m sequences and n attributes. 

Step 2: Calculate the correlation coefficient 
Ma,(k),a(k)) of each attribute relative to the 
reference sequence. 


Nalk),a(k)) = 


min min|a)(k) — a,(k)| + pmax max Ja (k)- a,(k)| 


iem_ ken 


|as(k) — a;(K)| + pmax max |a (k) — a;(k)| 


(1) 


where, p means resolution coefficient, and the 
smaller the value is, the greater the differentiation 
is. Generally, p € (0,1), and when p<0.5463, the 
discrimination performance is the best. Here p is 
taken as 0.5. 

Step 3: Calculating grey relational grade. 


HX,.X,) = 1S Haya) D 
k=l 


In a diesel engine wear fault decision table, if 
each row is considered as a factor leading to die- 
sel engine fault, then the grey correlation grade 
of these factors can be sorted from large to small, 
eliminating some invalid, low correlation and 
redundant data. 


2.2 Attribute reduction based on 
information entropy 


2.2.1 Rough set theory 
Rough set theory is based on an information sys- 
tem IS = (U, A, F), where U = (X), X,,...,X;,....X,,) 


is called the universe and A = (a), d),...,dj,....4,,) 
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denotes attribute sets. F ={f, :U >V,(k<n)} is 
the relation sets between U and A, and V, is the 
range of a, (Jia et al. 2016). 

Definition 1 Let IS = (U, A, F) be the informa- 
tion system of diesel engine, and Vis the range of 
fault state D. The mapping from universe U to fault 
state is denoted as d: U >V,, then DIS = (U, A, F, 
d) is called fault diagnosis information system. 


R, = (XX A.) = AX VQ € AD} 


R, =X, X dX) =d(X,)} 


(3) 
(4) 


If R, c R; DIS is called coordinated diagnosis 
information system. On the contrary, it is called 
inconsistent diagnostic information system. 

For inconsistent diagnostic information systems, 
the processing is no longer a simple inclusion rela- 
tion, but the inclusion degree between sets (Zheng 
et al. 2014). The concept of inclusion degree is 
introduced as follows. 


BAX], ={X,|(X),X,) € Ry} é 
Ry = (XX) F(X) = fX a, € B)} 

U/R, ={[X,],|X, €U} ={X, Xp X (6) 
a D (7) 


For X, €U(j =1,2,---,r), the inclusion degree is: 


In the formula, |X| denotes the cardinality of the 
set X. 

In order to enhance the anti-interference ability 
of the rough set model, Ziarko proposed a variable 
precision rough set model, introducing the classifi- 
cation accuracy (0.5< <1). 

Definition 2 Let DIS be an inconsistent diag- 
nostic information system, BCA, precision 


DD; /TX;]s) = D,N [X;]s/X lz (8) 


threshold e (0.5,1]. For X cU, 
R(X) = 1X, eae 1) 24 0) 
= UX ],|DO/X1,) > 2) 


RÊ is called B lower approximation of X. 

Definition 3 Let DIS be an inconsistent diag- 
nostic information system, BcA, precision 
threshold Ze (0.5,1]. The 6 approximate depend- 
ence of the fault state D on the parameter set B is 
defined as: 


URED) 


j=l 


mB, D, 2) = |U 


(10) 
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For inconsistent diagnostic information systems, 
the classification accuracy is first selected, and the 
corresponding approximate dependence is calcu- 
lated. If the evaluation index meets the require- 
ments, the system reduction should be continued. 


2.2.2 Information entropy and conditional entropy 
For diesel engine fault diagnosis, the purpose of 
attribute reduction is to use fewer parameters to 
obtain the same diagnosis effect as many param- 
eters, and improve the diagnosis efficiency. Infor- 
mation entropy can measure the uncertainty of 
knowledge, and reveals that the roughness of 
knowledge is essentially the description of the 
information contained in it (Li et al. 2012). 

Definition 4 For DIS, the information entropy 
of parameter set B is: 


$ 


H(B)=-> 


i=l 


i 


4 


(11) 


Definition 5 the conditional entropy of the fault 
state D is relative to the parameter set B: 


i BNO 


x, 


(12) 


noa- SE 


i=l j=l 


2.2.3 Heuristic attribute reduction algorithm 
Attribute reduction is one of the key problems 
in rough set theory. Searching for all reductions 
or optimal reductions is proved to be a NP (non- 
deterministic polynomial) complete problem. 
Therefore, heuristic algorithms are usually used to 
search for optimal reductions (Zhang et al. 2009). 

Definition 6 the mutual information of fault 
status D and parameter set B is defined as: 
I(D;B)= H(D)- H(D|B) (13) 

Mutual information is used to measure the 
amount of information obtained from the param- 
eter set B of the fault state D. 

Definition 7 the importance of any parameter 
ae A-B relative to the fault status D is defined as: 
SGF(a,B,D) = H(D|B)- H(D|BU {a}) (14) 

The attribute importance measures the amount 
of information about the fault state D from the 
parameters {a} under the condition of the known 
parameter set B. 

In order to reduce the parameter redundancy 
in the longitudinal dimension of the fault decision 
table, a forward heuristic search algorithm is pro- 
posed for attribute reduction. From the rough set 


theory, the relative kernel of any decision informa- 
tion system is unique. Therefore, the relative core 
can be used as the starting point of searching the 
optimal reduction (Tian et al. 2014). The algorithm 
steps are as follows: 


Step 1: Calculate mutual information J(D;A) 
between attribute set A and fault state D in DIS. 

Step 2: Calculate the core of the attribute set A 
relative to the fault state D. If relative core is 
C= ġ, then 1(D;C)=0. 

Step 3: Let B = C and repeat 1) to 3) for the 
parameter subset A-B. 


1. If 7(D;B)=1(D;A), then turn to step 4. 

2. Foreach parameter ae A- B, calculate the 
importance SGF(a, B,D) of the attributes. 

3. Select the attribute with the most important 
attribute value, which is denoted as p. Let 
B:=Bv {p}. 


Step 4: Output reduction parameter set B. 


3 FAULT DIAGNOSIS MODEL OF 
DIESEL ENGINE WEAR BASED ON 
SOM NETWORK 


3.1 Self-organizing map network 


SOM network is a self-organizing, unsupervised 
learning, self-learning neural network composed 
of fully connected neuron arrays. The typical SOM 
network structure is shown in Figure 1, which con- 
sists of an input layer and a competition layer (also 
known as an output layer). The number of neurons 
in the input layer is n, and the competition layer is 
a two-dimensional plane array composed of a x b 
neurons. Each input neuron is connected to all the 
neurons in the two-dimensional plane array (Xu 
et al. 2014). The training process of SOM network 
is to adjust the weights of network nodes continu- 
ously, so that different input types correspond to 
different neurons in two-dimensional plane array. 


competition 
layer 


input layer 


Figure 1. Network model of SOM. 
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The learning steps of the SOM network are as 
follows (Zhang et al. 2017): 


Step 1: Network initialization. The random 
number is used as the initial value between the 
input neuron and the output neuron to connect 
weights. 

Step 2: Accept input samples. The matrix 
U=X1,X2,-.-,-X¥m composed of the sample 
characteristic parameters is input to the SOM 
network. 

Step 3: Calculate the Euclidean distance between 
the input vector and the connection weight vec- 
tor. The neurons with the least Euclidean dis- 
tance are determined, and they are denoted as 
the winning neuron O. 

The Euclidean distance between the i-th input 
vector and the j-th neuron in the mapping layer is 


FS-a (15) 


where, œ, is the weight between the i-th neuron of 

the input layer and the j-th neuron of the mapping 

layer. 

Step 4: Adjustment learning of the value. 
According to the formula (16), the weights 
between the winning neuron O and the neigh- 
boring neurons are corrected. 


Aa, = Mj, X,- @;) (16) 

where, 77 is the constant in the range of (0, 1). 
ny a ET 

h(j, j*)= oo z | (17) 


In the formula, o decreases with learning and 
narrows the range of A4(j,j*) from width to width. 
The weight adjustment is changed from coarse to 
fine. Ensure the accuracy of classification. 

Step 5: stopping criterion. Determine whether 
the expected requirements, if achieved, then 
end. Otherwise, go back to step 2 and proceed 
to the next round of learning. 


3.2 The process of diesel engine wear fault 
diagnosis model 


By using grey relational analysis and rough set the- 
ory, the data of horizontal and vertical dimensions 
in a fault decision table are reduced. Therefore, 
the validity of the input data to SOM network is 
greatly improved, and the classification accuracy 
of the network is improved. The diesel engine wear 
fault diagnosis model based on Grey rough set and 
SOM network is shown in Figure 2. 


The concrete steps are as follows: 


Step 1: Initial fault decision table. The Ferrogra- 
phy and spectral analysis data of diesel engine 
oil monitoring information, as well as multi 
group sample data (Universe), form the initial 
fault decision table. 

Step 2: The grey correlation grade of each row 
data is calculated and the data reduction of 
horizontal dimension is carried out. 

Step 3: Discretization of continuous attributes 
and data reduction in longitudinal dimension. 

Step 4: Before the analysis of the rough set 
method, the continuous attributes of the hori- 
zontal reduction are discretized by using the 
equal frequency binning. Then, attribute sets 
are reduced by using information entropy. 


Initial Fault Decision Table 


Calculate the Grey 
Correlation Degree of Each 
Row Data 


Reduction of Row Data 
Determination of SOM 
Network Model 
Analysis of Output 
Results 


Discretization of 
Continuous Attributes 


Reduction of Conditional 
Attributes 


uotonpoy jejuozuoy 
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Step 5: Determination of SOM network model. 
The number of neurons in the input layer is the 
number of attributes set B after reduction. 

Step 6: Analysis of fault diagnosis results. The 
evaluation index of SOM network is defined as 
the classification accuracy, such as formula (18). 


R(%)=22x% 
Vin 


(18) 


Where, y, is the number of samples for cor- 
rect classification and y,, is the total number of 
samples. 


m 


4 CASE EXPERIMENT 


The oil spectrum and Ferrography data obtained in 
this paper are derived from a marine diesel engine. 
Rated power of marine diesel engine is 1760 KW 
and engine speed is 1800 r/min. There are 92 sets of 
oil monitoring data. Irregular sampling was carried 
out from 2013 to 2015, and the sampling position 
was the main oil duct of marine diesel engine. Oil 
is changed out regularly. There are 8 kinds of ele- 
ments (Unit: ppm) in oil spectral analysis, including 
Cu, Fe, Cr, Ba, Zn, Si, Al and Pb. The oil Ferrogra- 
phy data is derived from Quantitative Ferrography 
Analysis, which contains the concentration value 
of large wear debris D1 (> 5 um) and small wear 
debris Ds (1-2 um) in the oil samples. In order to 
better characterize the wear conditions of diesel 
engines, we introduce the Total wear Dls = DI + Ds; 
and Wear severity index IS = DP-Ds?. The data set 
(fault decision table) is shown in Table 2. Among 


Figure 2. Diesel engine wear fault diagnosis model them, the fault states include 3 types, namely, well 
based on Grey rough set and SOM network. condition (50 samples), degradation (21 samples), 
Table 2. Oil monitoring data set. 

U Cu Fe Cr Ba Zn Si Al Pb DI Ds Dis IS Wear-out status 
X1 0.9 7.5 0.4 0.9 28 11.0 19 1.2 28.4 11.6 40 672 Well condition 
X2 5.6 8 1.5 0.7 3.1 98 1.4 1.3 32.1 16.6 48.7 754.9 Well condition 
X3 0.5 9.6 0.4 16 2,5., 159 2 0.5 38.3 15.3 53.6 1171.1 Well condition 
x50 0 3.9 0.5 0.6 0.3 21.1 1.8 0 33.5 20.9 54.4 685.44 Well condition 
X51 13.2 298 333) 05 04 40.6 23.9 5.8 96.9 88.8 185.7 1504.17 degradation 
X52 24.8 296.7 2.4 0.7 29 304 14 8.2 143.5 106.5 250 9250 degradation 
X71 0.8 489 55.4 04 3.5 406 1.6 2.9 168.3 142.7 311 7961.6 degradation 
X72 19 221:5 0.3 0.9 25 343 171 1.8 98.5 132.7 231.2 -7907.04 fault 

X73 13 186.3 0 1.0 0.4 40.6 15.8 13 69.2 96.1 165.3 -4446.57 fault 

X91 76.2 100.6 2.1 0.5 1.5 39.1 17.8 24.3 96.9 88.8 185.7 1504.17 fault 

X92 100.6 78.6 1.6 0.4 3.0 30.3 29.6 36.4 96.9 62.3 159.2 5508.32 fault 
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and fault (21 samples). The judgment for the con- 
dition of diesel engine is given by experts. 


4.1 Grey relational analysis 


Sample ID represents the cycle of marine diesel 
engine oil monitoring. Therefore, the oil sam- 
ple with the first fault is selected as the reference 
sequence. It is considered that the sequence of grey 
relation less than 0.85 cannot reflect the wear fault 
state of diesel engine effectively. Therefore, these 
invalid redundant data samples are deleted. In this 
study, samples of ID 2, 14, 17, 29, 43, 55, 63, 67, 
71, 76, 80, 82, 85 and 90 were excluded from the 
fault decision table. 


4.2 Attribute reduction 


Firstly, the method of equal frequency binning is 
used to discretize the data after horizontal reduc- 
tion. Due to R,ZR,, the fault decision table 
shown in Table 2 is inconsistent diagnostic infor- 
mation system. At the same time, when the clas- 
sification accuracy £ is different, the classification 
performance of the system is not the same. We 
select different values of 2, and their classification 
performance is shown in Table 3. 

Therefore, this paper chooses the precision 
threshold B = 0.8, and uses the heuristic algorithm 
of 2.2.3 section to reduce the attributes. Finally, 
the attribute reduction result is B = {Fe, Al, Cu, 
Pb, DI, IS}. 


4.3 Establishment of SOM network model 


The parameters of the attribute set B are normal- 
ized and then used as input to the SOM network. 
The competition layer of SOM network is set to 
6 x 6 = 36 neurons, and the number of training 
iterations is defined as 200. There are 78 groups of 
samples after horizontal data reduction. 45 groups 
were selected as training sets (including 25 well 
condition, 10 degradation and 10 fault), and the 
remaining 33 (including 20 well condition, 7 deg- 
radation and 6 fault) were the test sets. 

The Matlab toolbox was used to establish SOM 
network structure and the results as shown in 
Figure 3, Figure 4 and Table 4. Hexagons with 
numbers represent the winning neurons in Figure 3; 
the number of training samples represented by the 
neuron is described numerically. The hexagons 


Table 3. Comparison table of classification perform- 
ance for the samples. 

B 1.0 0.9 0.8 0.7 
(A, D, B) 0.1563 0.5721 1.000 0.8109 


with numbers in Figure 4 represent neurons, and 
the straight line in the middle represents the con- 
nection of neurons. The distance between neurons 
was calculated by Euclidean distance formula, and 
the distance between the neurons was reflected by 
the hexagon color background of the connected 
neurons. The darker the color is, the farther the dis- 
tance between neurons is. That is, the greater the 
difference between the two neurons. 

Table 4 illustrates the neurons excited by each 
wear state. The distribution of neurons in the 
whole competition layer can be explained by 
Table 4 and Figure 4. 


4.4 Results analysis 


The diagnosis results of diesel engine wear fault 
based on SOM network are shown in Table 5. As 


Figure 3. 


Winning neuron map. 


‘SOM Neighbor Weight Distances: 


Figure 4. Distance between the neurons. 
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Table 4. Competitive layer neurons corresponding to 
wear states. 


Index of excited 
neurons 


Serial 
number 


Wear-out 
status 


Well condition 1 23, 30, 36, 24, 18, 12, 


6, 11, 29, 35, 29 


Degradation 2 3,13,2,1,14,7,31,25 
Fault 3 4,9,10,32,33,28,21,3,16 
Table 5. Diagnostic result statistics of SOM network 
model. 

Sample type Classification accuracy 
Well condition 100% 

Degradation 85.71% 

Fault 100 

Total 96.97% 

Table 6. Classification accuracy of different models. 


Model Total classification accuracy 
GRS-SOM 96.97% 
RS-SOM 87.88% 
PCA-SOM 75.76% 
GRPCA-SOM 78.79% 


shown in Table 5, the accuracy of the whole test set 
is 96.97%, and the diagnosis process is feasible and 
the results are satisfactory. 

To illustrate the advantages of the model, we 
compare it with Rough set-SOM (RS-SOM) 
(without horizontal data reduction), Principal 
component analysis-SOM (PCA-SOM), and Grey 
relational principal component analysis-SOM 
(GRPCA-SOM) models. Using the same training 
set and test set, the results are shown in Table 6. 

Compared with RS-SOM, PCA-SOM and 
GRPCA-SOM methods, GRS-SOM has higher 
classification accuracy. Therefore, it is feasible to 
use the SOM network model based on Grey rough 
set to diagnose the wear fault of diesel engine. In 
addition, it can solve the problem that the wear 
state of diesel engine is difficult to identify. 


5 CONCLUSION 


This paper presents a fault diagnosis method for 
marine diesel engine wear based on Grey rough set 
and SOM neural network. The horizontal and lon- 
gitudinal dimensions of the two-dimensional fault 
decision table are reduced by using grey correlation 


analysis and rough set reduction theory respectively. 
The reduced data is used as the input of the SOM 
network, and the fault diagnosis model is established 
to identify the wear state of the marine diesel engine. 

The grey relational analysis and rough set theory 
are used to reduce the input data, and the redun- 
dant data and attributes are completely removed. 
While simplifying the network mechanism, the 
accuracy of fault diagnosis is improved. 

Compared with the traditional RS-SOM, PCA- 
SOM and GRPCA-SOM model, the fault diag- 
nosis model based on Grey rough set and SOM 
neural network has smaller error and higher clas- 
sification accuracy. 
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ABSTRACT: This paper describes and proposes some indicators for continuous monitoring of anoma- 
lous conditions in the hydraulic system of a Kaplan turbine using SCADA data. The indicators are based 
on significant deviations between the estimated values for key variables describing the current working 
conditions of the components at the plant, and those actually observed. This monitoring strategy requires 
models describing the expected values for variables through the whole range of possible working condi- 
tions of the monitored components. These models are normal behavior models able to characterize the 
typical relationships between a set of variables used as inputs to the models and the corresponding output 
of a target variable whose expected value has to be predicted. The criteria to select the variables to use 
in the models are based on the physical working principles of the component. The paper is focused on 
models of normal behavior applied to a real case of condition monitoring of a Kaplan turbine regulating 


mechanism. 


1 INTRODUCTION 


Hydropower is the leading renewable global source 
for electricity generation supplying 71% of all 
renewable electricity and reaching 1,064 GW of 
installed capacity in 2016 (WEC, 2017). It gen- 
erated 16.4% of the electricity produced in the 
world from all sources. Hydropower is the most 
flexible and consistent of all the renewable energy 
resources, capable of meeting base load electric- 
ity requirements, as well as with pumped storage 
technology, meeting peak and unexpected demand 
due to shortages or the use of intermittent power 
sources. Also, hydroelectricity is a source of electri- 
cal energy coming from water that is clean and safe. 

A large number of data is logged in the SCADA 
system (Supervisory Control And Data Acquisi- 
tion) in hydropower plants, but the current status in 
Norway and Sweden is that SCADA data—apart 
for their use to control the plant—is not much used 
for other purposes, such as condition monitoring 
and maintenance planning. Thus, there is a large 
potential for using SCADA data for these new pur- 
poses. This may contribute to increased availability 
and energy production due to prevention of fail- 
ures and shut downs. 


The identification of possible failure modes in a 
hydropower plant (Topliceanu, 2016) is one of the 
key points in order to identify how failures could 
be detected in an early state. The analysis of the 
causes and effects of these failure modes can sug- 
gest the variables that can be useful for the detection 
of abnormal behaviors or anomalies (Chandola, 
2009). Several references can be found in scientific 
literature proposing different methods for anom- 
aly detection, and, in general, fault detection in 
industrial processes (Garcia Matyos, 2013) based 
on values of some variables measured in real time. 
One area in hydropower plants with an important 
research activity is related with the vibrational anal- 
ysis focused on some key components (Mohanta, 
2017), also the health condition of the components 
observed through several types of measurements is 
the goal of other studies such as those referred to in 
(Jamil, 2013) and (Selak, 2014). 

In this paper, the hydraulic system of a Kap- 
lan turbine was identified as a target of analy- 
sis and in particular the detection of a possible 
oil leakage in the system. This analysis is part 
of the results obtained in the research project 
MonitorX — “Optimal utilization of hydropower 
asset lifetime by monitoring of technical condition 
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and risk”. MonitorX is a joint industry project ini- 
tiated and led by Energi Norge (Energy Norway— 
the Norwegian electricity industry association) in 
cooperation with Energiforsk (the Swedish Energy 
Research Centre), more than 20 Norwegian and 
Swedish power companies, a number of equip- 
ment manufacturers and service providers, and the 
research institutions Comillas Pontifical Univer- 
sity, SINTEF Energy Research and the Norwegian 
University of Science and Technology as R&D 
partners. The project is financially supported by 
the Research Council of Norway. 

The aim of the MonitorX project is to develop 
models and algorithms for condition monitoring 
and the detection of faults in hydropower equip- 
ment. The main focus in the project is on models 
based on machine learning and artificial intelli- 
gence. The project is case-driven, and several rel- 
evant cases have been identified in the beginning 
of the project, whereof the case related to monitor- 
ing of the Kaplan turbine regulating mechanism 
and corresponding hydraulic system was consid- 
ered as relevant for further work. Since several 
components and parts of the system are difficult 
to inspect, models that can be used to monitor the 
system condition and detect failures are valuable. 
Furthermore, oil leakage from the hydraulic sys- 
tem may cause environmental damage, especially 
when oil leaks into the river. 

Usually, no separate condition monitoring sys- 
tems are installed in power stations to surveil the 
condition of the regulating mechanism. The data 
that normally is available is from the SCADA sys- 
tem of the plant that usually presents one hour 
average values. Thus, one of the aims of the pre- 
sented case is to study if such type of data is useful 
for modelling the normal behavior of hydropower 
components and detecting with these models 
anomalies that are related to faults. 

The paper is organized in the following sections. 
Section 2 describes the method and objectives 
used for the creation of normal behavior models 
and detection of anomalies. Section 3 presents a 
description of the hydraulic system of the hydrau- 
lic power plant analyzed. Section 4 includes the 
description and development of normal behavior 
models used as references for detection of anoma- 
lies. Section 5 presents several cases about how the 
normal behavior models can be used as reference 
patterns for the detection of anomalies. Finally, 
section 6 summarizes some conclusions of the 
analysis developed throughout the paper. 


2 METHOD AND OBJECTIVES 


This section describesthemainsteps of theprocessto 
build anomaly indicators for detection of abnormal 


behavior in some functional characteristics of 
components in a hydropower plant. These indi- 
cators are based on patterns previously obtained 
from observing the typical normal behavior of the 
monitored components. The following sequential 
steps are required in order to detect anomalies 
based on an estimation for these indicators: 


a. Selection of a data training set for learning 
the typical normal behavior of the compo- 
nent. This includes data selection and filtering, 
removing of outliers and treatment of missing 
measurements. 

b. Identification of failure modes that could be 
detected with the variables available in the 
SCADA system, and selection of variables. 
Information available in a Failure Modes and 
Effects Analysis (FMEA) may help to select rel- 
evant failure modes and variables. The variables 
will be used for the characterization of normal 
behavior patterns developed in the next step. 

c. Building of normal behavior patterns of a com- 
ponent described through variables collected in 
real-time from the hydropower plant. The cases 
studied in this paper are based on data samples 
collected every hour. The patterns were built 
using multi-layer perceptrons (Bishop, 1995), 
(Bishop, 2006). This technique is supervised 
requiring previous knowledge of behavior con- 
sidered as normal and covering all the typical 
working conditions of the plant. A good selec- 
tion of this behavior, considered as normal, 
is crucial in this method because the normal 
behavior will be learnt by the models as a refer- 
ence to watch when new information is coming 
from the power plant. 

d. Estimation of anomaly indicators. Once the 
previous steps are completed, the indicators of 
anomalies can be estimated. Its objective is to 
warn about data collected from the hydropower 
plant that do not correspond to the expected 
behavior by the reference patterns. The evo- 
lution of the values of the anomaly indica- 
tors over time will suggest whether or not it is 
necessary to pay attention to the components 
monitored from the point of view of scheduled 
maintenance and operation. 


Sections 4 and 5 will describe details about each 
of the previous steps with examples demonstrating 
their use. 


3 SYSTEM ANALYSED 


The cases analyzed in the paper are from Embret- 
sfoss 4, which is a hydropower plant using a Kap- 
lan turbine for the production of electric energy. 
The Kaplan turbine is a propeller type turbine 
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controlled by the operation of the turbine runner 
blades (turbine blades) and the wicket gates (guide 
vanes). See illustration in Figure 1. A Kaplan tur- 
bine is a typical run-of-river turbine, which can be 
operated at different flows and at varying head. 
For each head and flow, there is a given ideal com- 
bination of the wicket gate and runner blade posi- 
tion to ensure the best efficiency of the turbine. 

A turbine regulator controls and operates the 
turbine. Based on information about head and 
flow it uses predefined combination curves for the 
runner and wicket gate. The regulator controls 
the turbine by adjusting the blade and wicket gate 
positions with a correlated movement between the 
two. The acting mechanism for the wicket gates 
and runner blades are based on high-pressure 
hydraulics where an HPU (high-pressure unit) and 
an accumulator bank provide high-pressure oil for 
actuation of hydraulics servomotors. 


3.1 The high-pressure hydraulic system 


The turbine regulator controls the wicket gates and 
the runner blades by the use of a high-pressure 
hydraulic system which consist of the following 
main components: 


— Turbine governor oil sump tank with oil pumps 
(HPU — High Pressure Unit) 

— Pressure accumulator banks. One bank for run- 
ner blades and one bank for the wicket gates 

— Hydraulic oil cooling/heating system 

— Wicket gate control system 

— Runner blade control system 


Generator 


Turbine 
Generator Shaft 


Figure 1. 
of Wikipedia). 


Illustration of the Kaplan turbine (Courtesy 


— Quick stop/Emergency stop system 

— Oil system for runner hub. The runner hub is the 
lowermost part of the runner. The cone part just 
below the runner blades. See Figure 1. 


For a simplified view of the high-pressure 
hydraulics system, see Figure 2. The HPU is 
located at the turbine floor and it supplies the 
wicket gate and runner blade control system with 
high-pressure oil. The main components of the 
HPU are the oil reservoir, the oil pumps, valves, 
filters and coolers. In addition to supply oil to the 
control system, the HPU is “charging” in total 
five accumulator banks. The accumulator system 
is a safety system designed to handle a predefined 
number of safe shutdown cycles, in case of mal- 
function of the HPU system or a blackout of the 
station. The HPU have systems for monitoring the 
oil level, temperature and water-in-oil content. To 
prevent the pollution of the oil, each of the HPU 
pumps are equipped with a filter system. 

For maintenance reasons, the oil reservoir is 
designed to be big enough for storage of all the 
oil in the system. However, during operation, the 
oil is in the different components of the hydraulic 
system hence only a limited amount of oil is con- 
tained in the reservoir. A minimum level is however 
required in the reservoir for avoiding dry running 
of the HPU pumps. 

The hydraulics system has an oil cooling (and 
heating) system. The cooling system cools the oil 
during operation and the heating system heats the 
oil during standstill. 

The wicket control system controls the wicket 
gates by the use of two hydraulic servos (cylin- 
ders). The servos actuate the control ring, which 
again provides the open/close movement on the 
wicket gates. When the control ring, seen from the 
top, turns clockwise, the wicket gates close. 

The runner blade control system controls the 
position of the runner blades by the use of a servo 
actuator located in the runner hub. The actuator 
high-pressure oil supply/return is routed through 
the center of the turbine shaft via the oil supply 
head located at the top of the shaft. 

The system is equipped with a system for safe 
emergency stopping of the turbine. This can be 
activated by a manual activation of the emergency 
stop or if the turbine is speeding and the overspeed 
trip valve is activated. 

The turbine hub is filled with oil and has an oil 
pressure that is slightly higher than the surrounding 
water pressure. In the case of runner blade sealing 
degradation, this pressure prevents water from enter- 
ing the hub. The oil pressure in the hub is a static 
pressure created by the elevated location of the hub 
oil tank (see Figure 2). The oil in the hub oil tank is 
pumped up from the HPU oil reservoir. The hub oil 
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Figure 2. Simplified view of the hydraulic system. 


tank and the hub oil are not a part of the high-pres- 
sure circuit, but a leakage in the runner blade servo 
will influence the oil level in the hub oil tank and will 
eventually sound an alarm or stop signal. 


4 MODELS OF NORMAL BEHAVIOUR 


An industrial component or system can be stressed 
due to normal operation, extraordinary opera- 
tion and extreme environmental conditions or to 
a combination of all. Over time, these facts along 
with ageing factors can produce different ranges of 
typical values observed in measured variables even 
when the functional objectives of the component 
or system as expected have been reached (Sanz- 
Bobi, 2011). However, when a component has 
been stressed or overloaded over time, an increas- 
ing risk of occurrence of a failure is probable. For 
this reason, it is important to characterize the 
normal behavior expected for an industrial compo- 
nent or system when it is performing its function 
under several typical working conditions, because 
any deviation with respect to this behavior could 
alert about the presence of an incipient failure. The 
sooner this is detected, the sooner it is possible to 
mitigate the effect of a failure. 


This section describes real examples of normal 
behavior models. These models are able to charac- 
terize the typical dynamical evolution of variables 
when the component is working under different 
operating conditions without symptoms of failure 
or stress. 

In particular, the models developed and pre- 
sented as an example in this paper, are based on 
information collected in real-time from a hydraulic 
power plant located in Norway. The models devel- 
oped use neural networks based on multi-layer per- 
ceptrons (Bishop, 1995; Kruse, 2013) because this 
is a method able to approximate non-linear rela- 
tionships among variables. 

An basic model to characterize the normal 
behavior of the hydraulic power plant can be 
expressed by function fin Equation 1 


P= {(GVP,WF,HW-TW) (1) 


where: 
P: Power generated by the power plant in MW 
GVP: Guide Vane Position in percentage 
WF: Water flow through the turbine in m7/s 
HW-TW: Difference between headwater and 
tailwater levels in m. 
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Equation 1 tries to predict the power generated 
as function of the values of the main variables con- 
tributing to the power generation. 

In order to build a normal behavior model char- 
acterizing the function fin Equation 1, a training set 
was selected covering different seasonal conditions 
from January 1 to August 20, 2015. The data set 
is based on hourly values for the variables consid- 
ered. The model was developed with a multi-layer 
perceptron based on one hidden layer containing 
20 neurons and using the Levenberg—Marquardt 
algorithm for learning. The model obtained is 
very good, as it can be observed in Figure 3, where 
the estimated values for the power generated and 
the real values observed are almost identical. The 
mean value of their difference (error of the trained 
model) is 0.0012 MW and the standard deviation 
is 0.067 MW. This error is distributed according to 
a normal distribution with narrow shape. 

An interesting family of models will be pre- 
sented in the following for the characterization 
of the normal relationships that exist between the 
tank oil level of the turbine regulator and variables 
observed in different components of the turbine 
regulator that uses this oil. It is important to moni- 
tor that the oil in the tank is at the expected level, 
because if this is not the case, a possible leakage 
could be present. 

The first normal behavior model of the family 
that was tested is described in Equation 2 using the 
function fl. 


OTL = f\(P,OTT, AITR) (2) 


Neural network test using the training set 
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dicted by the normal behavior model and the real value 
observed for the training set using the guide vane posi- 
tion, the flow through the turbine and the difference 
between headwater and tailwater level. 


where: 

OTL: Oil tank level in the HPU in percentage 

P: Power generated by the power plant in MW 

OTT: Oil tank temperature in °C 

AITR: Oil level in accumulator 1 of the turbine 
runner. 

Equation 2 tries to predict the oil tank level in 
the HPU of the turbine regulator knowing the 
working conditions of the plant, the level of one 
oil accumulator of the turbine runner and the tem- 
perature of the tank oil. 

The model for fl was obtained with a similar 
architecture as for fin Equation 1. Also, the same 
dates as in the previous case were used to obtain the 
samples of the training set. The model obtained is 
good, which can be observed in Figure 4 where the 
estimated values for the oil tank level and the real 
observed are very close. The oil tank level is meas- 
ured in percentage (%). The mean value of their dif- 
ference (error of the trained model) is 0.0007% and 
the standard deviation 0.0644%. This error is dis- 
tributed according to a normal distribution shape. 

The hydraulic power plant studied has another 
similar accumulator given the number 2 in the tur- 
bine runner. A normal behavior model was fitted 
and the results obtained were very similar to those 
obtained for accumulator | of the turbine runner. 

Other important components in the turbine reg- 
ulator of the hydraulic power are three oil accumu- 
lators for the guide vanes. These are very important 
for the correct regulation of the hydraulic turbine. 
Three models, one considering each of the oil accu- 
mulators, were developed such as in Equation 2. 
For simplicity, only one of them will be presented. 
Equation 3 describes it using function /2. 


Neural network test using the training set 
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Figure 4. Estimated value for oil tank level in percent- 
age predicted by the normal behavior model and the real 
value observed for the training set using as inputs the 
power generated, the oil tank temperature and the oil 
level in accumulator 1 of the turbine runner. 


1007 


OTL= f2(P,OTT, A3GV) (3) 


where: 

OTL: Oil tank level in percentage 

P: Power generated by the power plant in MW 

OTT: Oil tank temperature in °C 

A3GV: Oil level in the accumulator 3 for the 
guide vanes. 

Equation 3 tries to predict the oil tank level in the 
turbine regulator knowing the working conditions 
of the plant, the level of the oil accumulator 3 for 
the guide vanes and the temperature of the tank oil. 

The model for f2 was obtained following the 
same method as in the previous cases described. 
However, the main difference was that the data 
used in the training set covered the period from 
April 9, 2016 to October 13, 2016, because before 
that period some measurements of the oil accumu- 
lators of the guide vanes were not collected cor- 
rectly. In any case, more than half of this period 
overlaps with the one used for obtaining f and 
J1. The model resulting for f3 obtained is good, 
as it can be observed in Figure 5 where the esti- 
mated values for the oil tank level (in percentage 
%) and the real observed values (in percentage too) 
are very close. The mean value of their difference 
(error of the trained model) is 0.0022% and the 
standard deviation 0.09%. This error is distributed 
according to a normal distribution shape. 

Good results were also obtained for the two 
models that are similar to the one in Equation 3, 
where the variable oil level in accumulator 3 
has been changed to the oil levels in the corre- 
sponding accumulators with numbers | and 2, 
respectively. 


Neural network test using the training set - Guide Vanes 
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age predicted by the normal behavior model and the real 
value observed for the training set using the power gener- 
ated, the oil tank temperature and the oil level in accumu- 
lator 3 for the guide vanes. 


5 ANOMALY DETECTION BASED ON 
PATTERNS OF NORMAL BEHAVIOUR 


Once a normal behavior model has been elaborated, 
it can be used in real time with real-time values from 
the required inputs. The output from the model can 
then be compared with the corresponding real meas- 
ured output variable. The prediction will correspond 
to the expected value for normal behavior under the 
current working condition. Any incipient failure will 
produce a deviation between the expected value and 
the real value measured of the monitored variable. 
This section presents how the normal behavior mod- 
els obtained in the previous section respond to new 
inputs of data collected after the training set dates. 
This will allow for the discovery of abnormal behav- 
ior different to the one expected. 

Model f was used with data not contained in the 
training set, covering the period from November 25, 
2015 to May 31, 2017. Figure 6 shows the results 
obtained by the model. The real behavior observed 
is very near to the predicted one and this confirms 
that the behavior observed in this new period of 
time is similar to the previous one in the training set. 
No abnormal behavior was detected in the power 
generation according to model f. The mean value 
of their difference (error) is -0.017 MW and the 
standard deviation is 0.7 MW. Both are higher than 
what was obtained for the training data set, but the 
prediction is still reasonable. Also, this error is dis- 
tributed according to a normal distribution shape. 

Furthermore, model fl was used with data 
not contained in the training set, covering the 
period from November 25, 2015 to May 31, 2017. 
Figure 7 shows the results obtained from the model. 
The real behavior observed is near to the predicted 
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dicted by the normal behavior model and the real value 
observed for the testing data set using the guide vane 
position, the flow through the turbine and the difference 
between the headwater and tailwater levels. 
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age predicted by the normal behavior model and the real 
value observed for the testing data set using as inputs the 
power generated, the oil tank temperature and the oil 
level in accumulator | of the turbine runner. 


one in some cases in the central part of the figure 
and different in the rest of the period studied. This 
means that the behavior observed in the training 
data set is different from the one observed in the 
new test data set at some periods. An abnormal 
behavior was detected in the relationships between 
the output and input variables of this model for 
the test period. Once this was detected, it became 
necessary to investigate the cause. 

The cause of abnormal behavior detected that 
is breaking the relationship modelled by fl can be 
any of the variables used in this model. The vari- 
able power generated cannot be the cause due to 
the test carried out in model f and presented in 
Figure 6 which confirms that no abnormal genera- 
tion of power exists. The rest of the variables could 
be candidates to be anomalous and they are related 
with the oil tank (level and temperature) and the 
accumulator 1 level of the turbine runner. 

A model similar to the one presented in Equa- 
tion 2 was developed replacing the variable AITR 
(Oil level in accumulator 1 of the turbine runner) 
by another equivalent model, but measuring the 
oil level in accumulator 2 of the turbine runner. 
The model obtained was very good and similar to 
that presented in Figure 4. This model was checked 
with data not contained in the training set, cover- 
ing the period from November 25, 2015 to May 31, 
2017 as for accumulator 1 of the turbine runner. 

The result is presented in Figure 8. The profile 
between predicted and real oil tank levels are almost 
the same in Figures 7 and 8. The same broken rela- 
tionship is shown between the oil tank level and the 
oil level in accumulators | and 2 of the turbine runner. 
This induces the thought that it is not probable that 
the problem of the abnormal behavior observed is 
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age predicted by the normal behavior model and the real 
value observed for the testing data set using as inputs the 
power generated, the oil tank temperature and the oil 
level in accumulator 2 of the turbine runner. 


due to some anomaly in both turbine runner accumu- 
lators at the same time and it is therefore convenient 
to closely monitor the oil tank level. 

In this way, model f2 was also tested with data 
covering the period from October 14, 2016 to May 
31, 2017. This period includes data from sample 
8000 till the end of the graphics in both Figures 7 
and 8. Figure 9 presents the results of the applica- 
tion of model f2 to the data set mentioned. The 
discrepancy between predicted and real values 
for the oil tank level is clear. This is lower than 
expected for the working conditions of accumula- 
tor 3 of the guide vanes. In fact, it seems that the 
difference between the real and expected values for 
the oil tank level is increasing over time, except in 
the last part of the graphic in Figure 9 where the 
real and expected values are approaching. 

Two similar models to f2 were built and tested 
during the same periods of time replacing the 
variable A3GV (Oil level in accumulator 3 for the 
guide vanes) by other equivalent elated respectively 
to accumulators 1 and 2 for the guide vanes. The 
results were similar. 

According to the results obtained, all five models 
applied for anomaly detection in the oil tank level 
(three of them presented in Figures 7, 8 and 9) coin- 
cide in that they indicate a lower level of oil over 
time. This is an indicator of a possible leakage of oil 
in the oil tank level or surrounding locations. The 
accumulators are working as expected, but the total 
oil level in the tank of the HPU is decreasing. This 
was verified and a leakage was discovered from the 
oil side to the nitrogen side of the accumulators. 

These examples demonstrate that the deviation 
values obtained from the comparison of the real 
value and predicted one by the patterns of normal 
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age predicted by the normal behavior model and the real 
value observed for the training set using the power gener- 
ated, the oil tank temperature and the oil level in accumu- 
lator 3 for the guide vanes. 


behavior can be good indicators for alerting when a 
typical relationship among variables could be broken. 

In this case, the unexpected decreasing level in 
the oil tank must be monitored. 


6 CONCLUSIONS 


This paper describes a methodology for the early 
detection of anomalous behavior conditions of 
selected Kaplan turbine components. The method 
is based on discovering behavior patterns, also 
called normal behavior models, from the observa- 
tion of the typical relationships existing between a 
set of variables used as inputs to the models and the 
corresponding output of a target variable whose 
expected value has to be predicted. The criteria to 
select the variables to use in the models are based 
on the physical working principles of the compo- 
nent in order to detect symptoms of abnormal 
behavior that can cause a possible failure mode. 
The data set used for pattern discovering of nor- 
mal behavior comes from the SCADA system of the 
plant. Abnormal behavior is any significant devia- 
tion or difference between the predicted output of 
the models and its corresponding real observation. 
The paper presented some examples of normal 
behavior models for the cases of characterization of 
power generated by the hydropower plant and the 
oil tank level considered from different perspectives 
such as the oil level in the bank of accumulators of 
the turbine runner and the bank of accumulators of 
the guide vanes. Once the models were created, they 
were applied to new examples of operation. The 
predicted amount of generated power was always as 
expected, but the oil tank level was not. The analysis 


of deviations of normal behavior described in the 
paper shows that the oil levels in the accumulator 
banks were according to their working conditions, 
but the oil tank level was continuously decreasing 
during the time analyzed. This suggests a need for 
close monitoring of this level in order to search for 
the cause of this potential detected leakage. 

In future works, an approach based on different 
algorithms working in parallel for anomaly discov- 
ering will be tested. This will improve even more 
the robustness of the anomaly detection method 
proposed. 
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Current status of the MFM suite for diagnostic and prognostic 


reasoning of industrial process plants 


Harald P.-J. Thunem 


Institute for Energy Technology, OECD Halden Reactor Project, Norway 


ABSTRACT: This paper presents the status of a software system, the Multilevel Flow Modeling (MFM) 
Suite, dedicated to the design and analysis of MFM models related to diagnostic and prognostic analysis 
of physical processes. New and updated features of the system are described, as well as some examples 
of its practical use. The paper also briefly describes how the system facilitates the collaboration between 
control room and field operators via the Android-based MFM Viewer app. 


1 INTRODUCTION 


Multilevel Flow Modeling (MFM) (Lind, 2011) is 
a methodology for graphical modeling of industrial 
processes by representing the goals and functions of 
industrial plants. The purpose is to model the com- 
bined functions of any number of physical proc- 
ess components, which together provide the means 
to achieve one or more goals. The model may then 
be used for diagnostic and prognostic purposes to 
determine the possible causes and potential conse- 
quences of unwanted process events. 

Since MFM models consist of graphical, inter- 
connected elements arranged in specific structures, 
there is an apparent need for dedicated software 
to design them. The MFM models also need to 
be connected to the process, the functionality of 
which they represent. 

For several years such a dedicated software sys- 
tem, the MFM Suite, has been under development 
at the Institute for Energy Technology (Thunem, 
2013, Thunem and Zhang, 2015). The system will 
allow a user to graphically design and verify the 
semantic correctness of MFM models, in addi- 
tion to creating graphical models of the indus- 
trial process. It will provide associations between 
process components and MFM functions. Using a 
dedicated MFM reasoning engine developed at the 
Technical University of Denmark (DTU), the sys- 
tem will perform diagnostic and prognostic analy- 
ses of anomalous events on online process data. 

This paper provides an update of the system’s 
features, a brief description of results from applying 
it in a practical experiment, and the functionality 
and use of the Android-based MFM Viewer app. 


2 APPLICATIONS IN THE MFM SUITE 


The MFM Suite primarily consists of three 
applications: the MFM Editor for graphical 
creation, editing and verification of MFM and 
process models, the MFM Runtime for real-time 
acquisition and analysis of on-line sensor data, 
and the MFM Playback for step-wise and simu- 
lated real-time playback and analysis of sensor 
data. Since the applications share a lot of func- 
tionality, they are all based on the MFM Applica- 
tion Java class. 

The MFM Suite also includes a simple launcher 
from which to start the applications. 
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Figure 1. 
applications. 


Inherited functionality in MFM Suite 


Figure 2. The launcher application. 
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3 THE EDITOR APPLICATION 


The MFM Editor contains two essential graphi- 
cal modelers, the MFM modeler and the process 
modeler. As both are based on the ShapeShifter 
framework (Thunem et al, 2011, Thunem, 2012), 
they share a lot of functionality. 


3.1 Meta-model display 


Most complex processes may be in more than one 
operating state (e.g. start-up, normal, shutdown), 
and the functionality of the process components 
and the process goals may depend on the operating 
state. Therefore, it may be necessary to create sepa- 
rate MFM models to describe the component func- 
tionalities and goals for each of the operating states. 

The MFM Suite applications includes a meta- 
level display (Figure 3), which allows MFM models 
to be connected by arrows, indicating transitions 
from one operating state to another. 

Note that only one MFM model may be con- 
sidered active at any point, and any MFM reason- 
ing analysis will be performed on the active model 
only. This furthermore requires that for each MFM 
model, the alarm limits for process sensors associated 
with MFM functions must be set individually, as 
their alarm limits may depend on the operating state. 


3.2 Click-and-drop model design 


Both the MFM and the process modelers provide 
graphical click-and-drop design of models in a 
manner similar to e.g. Powerpoint. For both mod- 
elers, a list of available components is shown to the 
left of the editing area (see Figure 4 and Figure 5). 
Clicking on one of them changes the cursor into 
a miniaturized version of the component. Click- 
ing anywhere within the editing area will place the 
component at this point. 

The MFM modeler provides another, faster 
method which also facilitates the creation of 
semantically correct models. When selecting an 
MFM function in the editing area, a list of allowed 
(according to MFM rules) connections (consisting 
of one MFM relation and one MFM function) 
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Figure 3. A project’s meta-model. 
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Figure 5. Designing a process model by click-and-drop. 


Figure 6. Assisted design of MFM structures. 


will be displayed (see Figure 6). Clicking on one of 
them will create the connection from the selected 
function and place it at a suitable position. The 
modeler will automatically select the newly added 
function and update the list of available connec- 
tions. In this way, complex MFM structures can be 
designed in relatively few steps. 


4 THE RUNTIME AND PLAYBACK 
APPLICATIONS 


The MFM Runtime and MFM Playback applica- 
tions use the same ShapeShifter-based graphical 
components for displaying the MFM and process 
models as the MFM Editor, however in this case 
the editing functionality is disabled. 
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4.1 Separate analysis threads 


The applications will perform diagnostic and 
prognostic analyses on a given MFM model using 
the reasoning rules that are explained in detail in 
(Zhang, 2015). The analysis processes will run in 
separate execution threads, which will set flags to 
avoid e.g. starting a new diagnosis before the previ- 
ous has terminated (however, a diagnosis may run 
concurrently with a prognosis). This will prevent 
two problems; the analysis process will not lag the 
sensor sampling in cases where the sampling inter- 
val is shorter than the analysis time, and the user 
interface remains responsive since the main appli- 
cation thread is not blocked (Figure 7). 


4.2 Sensor display based on value and origin 


In addition to indicating the sensor range by 
changing the shape, color coding is used to indi- 
cate whether the sensor’s value is obtained from the 
process, manually set by an operator, or inferred 
from an MFM model analysis. 

An example is shown in Figure 8, where the shape 
(down-arrow) and color (red) of sensor P1444 indi- 
cates that it has a value below the low limit, and that 
the source is the actual process/simulator. This value, 
possibly along with other abnormal sensor values, has 
triggered an analysis in the associated MFM model. 

The analysis has concluded that a possible 
cause of PI444’s low value is a high value in sen- 
sor T1463, so this sensor is visualized with an up- 
arrow shape and a yellow color (indicating that the 
sensor state is inferred from an MFM analysis). 
Note that if the sensor is functioning correctly and 
shows a high value, it would be visualized using 
the red color. Since it is not, we can conclude that 
either this is not the cause of the abnormal value in 
P1444, or the sensor is malfunctioning. 

The shape and color of sensor T1465 indicates 
that the sensor has a normal value/state (round 
shape), and that it has been set manually by an 
operator (green color), perhaps as a result of a 
visual inspection. 


4.3 Sensor trend display 


The process viewer component in the Runtime and 
Playback applications will display trend graphs for 
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Figure 7. Concurrent main (M), diagnosis (D) and 
prognosis (P) threads. 


Figure 8. Sensor visualization. 
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Figure 9. Process sensor trend panels. 


selected process sensors (Figure 9). Through a dedi- 
cated settings menu the user may select which sensor 
trends to display (using a searchable selection panel), 
whether to show different markers depending on 
sensor type, and specify all colors used in the trends. 


4.4 Playback of existing data (Playback) 


The MFM Playback application will read stored 
sensor data from files to allow step-wise playback 
and analysis of previous experiments. The applica- 
tion also includes a slider to move back and forth 
freely. Furthermore, it can also replay a previous 
experiment in simulated real-time by activating a 
timer in a separate thread, and using the stored 
sampling intervals to update all sensors and re-run 
any analyses at correct times. 


5 FEATURES COMMON TO ALL 
APPLICATIONS 


5.1 Server functionality 


The MFM Suite applications include functional- 
ity to allow external network clients to download 
the currently open MFM project (process and 
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MFM models) and to access updated MFM func- 
tion states and sensor values. The MFM Suite also 
accepts commands from external clients for set- 
ting MFM function states and sensor value ranges 
(Figure 10). This allows e.g. a field operator to 
have a situation understanding similar to a control 
room operator and to prune away cause paths that 
have been determined irrelevant. 


5.2 Web export 


The MFM Suite applications will provide updated 
displays of models with dynamic data accessible 
through an ordinary web browser. When opening 
an MFM project, the applications will generate 
several files, depending on the number of models. 
An example of the files is shown in Figure 11. 
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Figure 10. Data server connectivity. 
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Figure 11. Web export files. 
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Figure 12. Web-page showing an MFM model. 


The graphical models are exported as SVG 
(Scalable Vector Graphics) files, and since these 
graphics files are (as the name implies) scalable, 
they can be zoomed in the browser without the loss 
of any details (Figure 12). 

Whenever the visual appearance of any element 
(MFM function, sensor etc.) in a model changes 
(e.g. due to an analysis affecting the elements), or 
whenever the user saves the project, the files are 
updated. Code is included in the HTML files to 
reload the pages at given frequencies (default every 
5 seconds). This frequency is specified using the 
Settings dialog box of the applications, which also 
allows specifying an output folder, to which the 
generated files are exported. In order to make the 
web pages available to other devices in a network, 
this folder could be set to the active root folder of a 
web server application, such as e.g. Apache or IIS. 


6 EXAMPLES OF USE 


Several tests have been conducted using an MFM 
model of the primary side of two PWR simulators; 
the RIPS simulator which is based on the Ringhals 
power plant, and a generic PWR simulator (Zhang 
et al, 2014, Zhang et al, 2016, Thunem, 2017). Sev- 
eral scenarios were tested, including reactor trip 
cause by too low pressure in the steam generators, 
and a high pressure in the reactor coolant system. 

Before running the experiment, the simulator 
was initialized to a normal running state. At this 
point, the values of all sensors were registered and 
used to set the individual alarm limits. The process 
model contained a total of 49 sensors, 21 of which 
were associated with MFM functions. 

Before running each scenario, the simulator 
was reset to a normal running state. The control 
room operator would run the selected scenario by 
adjusting selected process parameters and letting 
the simulator run its course. 
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In most cases, the reasoning engine performed 
flawlessly. Figure 13 shows one scenario with mul- 
tiple anomalous sensor readings (right side) and 
the successful diagnosis using MFM reasoning 
(left side). 

It was, however, noticed that in some cases the 
selection of trigger function in the MFM model 
would cause a memory problem (stack overflow) in 
the reasoning engine. By manually selecting another 
trigger function, the problem would disappear. 

The MFM reasoning engine uses the Java-based 
JESS inference engine (Friedmann-Hill, 2003), 
which employs recursion, a common cause of stack 
overflow problems. It is reasonable to assume that 
the selection of different trigger functions resulted 
in different recursion levels during the rule-based 
reasoning. One solution to this problem is to 
increase the Java call stack; however, the maximum 
size of the call stack may be platform dependent 
and difficult to ascertain. 

Since the JESS system is no longer supported, the 
MFM reasoning engine has been re-implemented 
using the Drools inference engine, however, the 
experiments have not yet been re-run to verify any 
performance improvements. 


7 MFM VIEWER 


The MFM Viewer is a graphical model viewer 
designed for Android-based portable devices such 
as tablets and cell phones. The use of this applica- 
tion will facilitate a distributed team of e.g. control 
room and field operators to gain a common under- 
standing of a process situation regardless of their 
location. For practical purposes a cell phone was 
used to create the images below, while in a work 


situation a tablet may be more suitable. The appli- 
cation will scale appropriately to all common dis- 
play resolutions. 

The main menu will allow the user to specify rel- 
evant settings, including the IP address and port 
number of the data server, i.e. any of the MFM 
Suite applications currently running. The main 
menu further provides a “Download project” menu 
item, which will instruct the MFM Suite applica- 
tions to package the currently open project into a 
zip file and to transmit the zip file to the device. 
After receiving the zip file, the MFM Viewer will 
unzip the project files and display the project meta- 
model. Clicking on any of the model boxes (MFM 
or process) opens the model in a new window. In 
any of the models, the user may scroll and zoom 
using common finger gestures. 

At the bottom of the MFM and process dis- 
plays is a green button. Pressing this will cause 
the application to connect to the MFM Suite and 
retrieve and display the current MFM states and 
sensor values at a sampling frequency specified in 
the “Server Settings”. The button will turn red, 
indicating the retrieval of data (Figure 14). Press- 
ing the red button or returning to the meta-model 
will disable the connection. Note that for both 
models there is a search icon in the upper right 
corner. Clicking this will open a search field, which 
allows the user to find (by highlighting) any MFM 
function or process component whose outer label 
contains the given text string. 

The MFM Viewer also allows the user to set the 
states of individual MFM functions or sensors. This 
is done by clicking on the desired element, which will 
open a dialog box, and clicking on the desired state 
(Figure 15). This will in turn trigger a re-analysis of 
the models in the MFM Suite application. 
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Figure 13. Scenario diagnostic. 
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Figure 14. Displaying models and highlighting analysis 
results in the MFM Viewer. 
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Figure 15. 
the MFM Viewer. 


Setting MFM function and sensor states in 


8 FURTHER WORK 


The primary focus on the MFM Suite develop- 
ment has until recently been on including neces- 
sary functionality to graphically design MFM and 
process models, create links between them, and 
perform online analysis with simulated process 
sensor data. When presenting the analysis results, 
focus has primarily been on how the MFM func- 
tions are affected, and which MFM functions 
have been identified as root causes. Interpreting 
the results has required some familiarity with the 
MFM methodology, which is unreasonable to 
expect from e.g. a control room operator. 


To make the analysis results more accessible to 
someone without MFM training, the focus of the 
analysis presentation will therefore shift towards 
the process display, indicating which process com- 
ponents are the possible root causes and which 
components may be affected by the identified 
anomalies. 

The activity will further evaluate various display 
modes to enhance the operators’ comprehension 
of a given process situation in light of established 
analysis results, and to facilitate collaboration 
between control room and field operators via dedi- 
cated interaction mechanisms. 
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Prognostic and health management design for subsea applications 
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ABSTRACT: The design of a subsea production system needs to ensure high reliability and safety fig- 
ures since these assets will be deployed in harsh environments for extended periods of time. Maintenance 
costs associated with these systems represent a significant percentage of the total operational expenditure 
incurred by an Oil & Gas operator. Traditional reactive maintenance approaches applied on subsea equip- 
ment are starting to drop in their efficiency as degradation occurs on such systems resulting in prolonged 
downtime periods. Prognostics and Health Management is a relatively new topic and other industry sec- 
tors have demonstrated that it can provide a solution for reducing maintenance costs and improving sys- 
tems’ overall availability. This paper presents a prognostic and health management development process 


suitable for subsea production systems. 


1 INTRODUCTION 


When oil and gas exploration is not economically 
viable through tradition oil platforms, subsea pro- 
duction systems represent an alternative for the 
majority of operators. A Subsea Production Sys- 
tem (SPS) is a collection of hydrocarbon extracting 
equipment located on the seabed and its main com- 
ponents consist of a seabed wellhead, subsea x-mas 
tree (XT), manifold, umbilical, riser, a network of 
pipelines, flowline as well as subsea power and con- 
trol systems. The amount of shallow water oil and 
gas reserves is decreasing. This has led exploration 
and production into deep waters where SPS are the 
preferred solution to make development economi- 
cally viable. The subsea industry is now facing a 
number of challenges on how to improve reliability 
and safety of critical assets in an economical way. 
Subsea lifecycle analysis demonstrated that Capital 
Expenditure and Reliability Availability Mainte- 
nance Expenditure (RAMEX) are the two major 
costs of a subsea asset, with downtime cost making 
up the majority of RAMEX. Prognostic and health 
management (PHM) has the capability to address 
these financial challenges and reduce the downtime 
cost by assessing and predicting the Remaining 
Useful Life (RUL) of a component/system while 
supporting the system’s goal and compliance with 
high level requirements- safety, reliability, availabil- 
ity, maintainability etc. 

At present, the main research focus area of 
PHM in the context of subsea applications is only 
targeting the development of such capability for 
isolated components. This paper introduces on an 
integrated approach to design the PHM for a SPS 
at the system level. This approach is mapped on a 


typical engineering design and operational process 
of a subsea system and it involves integration and 
concurrent analysis of multi-disciplinary sources 
of knowledge and interaction between several 
engineering functions. 

Section 2 of this paper discusses the current 
prognostic approaches and the level of adoption 
of PHM for subsea equipment. Section 3 will cover 
four engineering disciplines and their potential use 
for the development of the PHM capability. Sec- 
tion 4 will present a novel PHM development proc- 
ess capable of integrating knowledge, information 
and data within concurrent engineering analysis. 
In section 5, an instantiation of the PHM develop- 
ment process for a XT will be presented. 


2 STATE OF THE ART OF PHM 


Four different types of prognostic approaches cur- 
rently exist: 


e Experience-based prognostic approaches are 
based on historical data and knowledge accu- 
mulated during the lifecycle of systems. 

e Model-based approaches involve the construc- 
tion of mathematical model which integrates the 
underlying physics of failure of the critical com- 
ponents of the system, their degradation and 
their failure modes. 

e Data-driven prognostic approaches specify the 
behavior of a system through gathered opera- 
tional data (CM data via sensors and/or event 
data). The data is processed and compared with 
key parameters/features to predict the probabil- 
ity of fault occurrence. 
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e Hybrid prognostics are a combination of the 
model-based and data-driven with parameters in 
the model being continuously updated when data 
from service becomes available with the ultimate 
purpose of improving the accuracy of the pre- 
diction (Vachtsevanos, Lewis, Roemer, Hess, & 
Wu, 2006) (Medjaher & Zerhouni, 2013), (da 
Silva & Radespiel, 2013). 


The majority of industry sectors adopted 
the data-driven prognostic approaches as a first 
attempt for the development PHM capability, par- 
ticularly for complex systems (Bykovsky, 2008) as 
this enables the development of the understanding 
of degradation without the construction of math- 
ematical model capturing the physics of failure. 
Currently, in the subsea arena, there are very few 
Condition Based Maintenance (CBM) solutions 
deployed in field. The major Original Equipment 
Manufacturers (OEM) for SPS include companies 
like: Technip/FMC Technologies, Cameron, GE, 
Aker Solutions, OneSubsea etc. Based on infor- 
mation available in the public domain, Cameron 
provides a blowout preventer condition based mon- 
itoring system and riser annulus condition system. 
However, FMC has delivered a first attempt of a 
system level Condition and Performance Monitor- 
ing (CPM) capability for a subsea asset, currently 
being installed at the Gjøa field (Soosaipillai, 
Roald, Alfstad, Aas, Smith & Bressand, 2013). 


3 ENGINEERING DISCIPLINES AND THE 
STAKEHOLDERS INVOLVED IN THE 
SUBSEA PHM DESIGN 


3.1 Overview 


To develop and implement any of the prognostic 
approaches mentioned in the previous section, dif- 
ferent types of data and information is required 
(Vachtsevanos, Lewis, Roemer, Hess, & Wu, 2006). 
This information and data represents the output 
of several engineering disciplines being owned by 
various functional teams. The design stage for a 
subsea system is usually undertaken by the OEM 
and suppliers under the requirements established 
by an operator. During the operation period, 
maintenance teams will be hired by the operator to 
Inspect/Maintain/Repair (IMR) the subsea equip- 
ment, although, the maintenance of control sys- 
tems will be the responsibility of the OEM. Very 
often, upgrades, overhauls and de-commissioning 
typically are done by different parties. During the 
life time of a subsea production system, multiple 
organizations are involved and a subsea system 
may include components and processes originating 
from all over the world. Throughout the entire life- 
cycle of a subsea field, interactions between differ- 
ent engineering disciplines belonging to different 


companies, which are operating under multiple 
languages and cultures, also take place. Hence, one 
of the challenges is the fact that the data/informa- 
tion required to develop a PHM solution is not 
unified and/or centralized. In the context of sub- 
sea applications, these engineering functions and 
the data/information associated to each of them is 
captured in Table 1. 


3.2 Subsea system design—initiation of the PHM 
design 


PHM Design must be underpinned by a level of 
understanding of the healthy state of a system. In 
the case of a subsea system, this characterization 
is developed during various stages of the design 
process (conceptual design, front end engineering 
design (FEED) and detailed design). Engineering 
modelling is carried out during the FEED. Modifi- 
cations to the design (to include the PHM capabil- 
ity as an afterthought) might be extremely costly, 
therefore it is recommended to design-in the PHM 
function as the asset design progresses through var- 
ious technical and business gates. It is instrumental 
at this stage to derive the PHM requirements from 
the subsea asset requirements, if such a capability 
is to be developed. 


3.3 Reliability and availability analysis— 
foundation of the PHM design 


In the offshore industry, Failure Modes and Effects 
Criticality Analysis (FMECA) tends to be one of 
most common approaches for reliability analysis 
and it has been increasingly implemented in the 
last ten years on subsea projects (DNV, 2013). 
FMECA also forms the foundation for good PHM 
design (Vachtsevanos, Lewis, Roemer, Hess, & 
Wu, 2006) although, in the context of subsea 


Table 1. Engineering disciplines. 


Engineering disciplines Information 


System design Engineering models 
Schematics, Reports, 

Hierarchical levels 
Dependencies 

Failure concepts 
Criticality information 
(Re) Certification 

Operational conditions 


Reliability & 
Availability 


Control and 


Instrumentation Condition, Performance 
Condition and Indicators Control data 
Performance 
Monitoring (CPM) 

Inspection, Past operational conditions 
Maintenance and Maintenance records 
Repair (IMR) Failure history 
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equipment, it is mainly used to support qualifi- 
cation of new technologies or re-qualification of 
legacy systems. Reliability analysis must be based 
on the actual design of the system therefore a con- 
current engineer design and reliability analysis is 
recommended to ensure the both disciplines are 
targeting the same point of truth. Input is required 
from IMR to ensure the efficiency and accuracy 
of this analysis. A FMECA study attempts a good 
understanding of system behavior under faulty 
conditions is instrumental to be able to design 
the diagnostic and prognostic capability. Histori- 
cal data and knowledge can also be employed to 
adjust or re-design components/equipment/sys- 
tems to improve reliability targets. This informa- 
tion usually resides with the operators and OEMs, 
but it is very often not available for the reliability 
team responsible with the qualification of the sub- 
sea equipment under investigation. 


3.4 Control & instrumentation|CP M—core of the 
PHM design 


A large number of parameters to assess the con- 
dition of an SPS is currently captured by CPM 
through sensors i.e. acoustic control systems, 
multiphase flow meters, accelerometers, pressure 
and temperature sensors, sand and leak detection 
systems, as well as detectors for dropped objects 
(1SO:13828, 2010). These types of sensors found 
their way into the subsea design either through 
regulations or recommendation and there is a con- 
sensus in the industry that they are not engineered 
for the purposes of ensuring higher availability 
figures. Production efficiency performance of sub- 
sea equipment has been particularly poor in recent 
years, reaching a low point of 60% in 2012 and 
averaging at 71% in 2015 although back in 2004 
the production figures were above 80%. Degrada- 
tion of production equipment is one of the con- 
tributor factors to this drop. Abnormal behavior 
of subsea equipment can be detected by condition 
monitoring solutions and information sensed by 
instrumentation is sent to the control module and 
interpreted by the operational teams (Markeset, 
Moreno-Trejo, & Kumar, 2013) to allow informed 
decisions guiding the operation and maintenance. 


3.5 Maintenance—exploitation of the PHM 
design 


Various types of maintenance strategies exist and 
can be implemented for subsea equipment in serv- 
ice. Corrective maintenance is typically applied after 
the failure has occurred. A scheduled maintenance 
regime aims at dealing with faults before they occur, 
but they are based on fixed time intervals. The con- 
ditional maintenance only supports non-dynamic 
estimation of the degradation and the most recent 


approaches take advantage of the PHM capability 
to estimate the RUL of a critical component and 
to plan the maintenance job according to these cal- 
culations (based on data-driven, model-based and 
hybrid prognostics algorithms). The PHM capabil- 
ity of a system is represented by a set of techniques 
and methods from different disciplines that combine 
knowledge and data to support predictive mainte- 
nance by detecting, diagnosing, predicting, advising 
and analyzing (postmortem) the failure information 
(Guillén, Crespo, Macchi, & Gomez, 2016). The 
information and data related to functional failures 
(failure modes) and physical failures (faults) of sub- 
sea equipment is scarce, so the traditional reactive 
maintenance approaches (corrective and sched- 
uled) are still the preferred choices for oil and gas 
operators. The predictive maintenance regimes also 
present their own set of challenges and these must be 
considered during the design stage (the inability to 
accurately and reliably predict the RUL of a compo- 
nent/system; the inability of maintenance systems to 
document, learn and recommend that action should 
be taken; the lack of tools capable of demonstrating 
the effectiveness of a predictive maintenance pro- 
gram). SPS have been typically designed to operate 
over five years without failure, thus the operator will 
plan to carry out preventive maintenance every five 
years (Moreno-Trejo & Markeset, 2012). However, 
over the last decade, reliability data shows some 
components had to be maintained/replaced sooner 
to prevent failure. Hence, traditional maintenance 
can incur huge expense and consequential damage 
for an asset and the environment—such as pollution, 
loss of production, etc. (Markeset, Moreno-Trejo, & 
Kumar, 2013; Uyiomendo & Markeset, 2015). In 
recent years, traditional maintenance activities in 
the oil and gas industry are transforming through 
the adoption of CBM which are ensuring efficient 
maintenance, reducing lifecycle costs and improving 
the systems’ overall availability. The application of 
CBM and PHM (as an extension of CBM) in the 
oil and gas industry has started to be exploited on 
for example, drilling systems, control systems and 
pumping systems. 


4 SUBSEA PHM DEVELOPMENT PROCESS 


To determine the Prognostic method, these four 
major disciplines need to share information and data 
with each other. However, PHM development for an 
entire subsea system is still a challenge from the view 
of big data management and information support, 
thus it is hard to build those disciplines into system. 

In this paper, we propose an integrated approach 
to subsea PHM design process. This process is 
captured in Figure 1 and it highlights the data/ 
information exchanged between the four main 
engineering disciplines, discussed in the previous 
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Maintainability Assessment (Maintenance Aware Design) 


Condition and Performance Monitoring’ Prognostics & Health Management 


(Diagnostics and Prognostics Analysts (Fault Coverage) 
False Positives’ False Negatives, Ambiguity Groups 
Sensor set identification and optimisation, IMR Test poin 


Figure 1. 


section. The proposed subsea PHM development 
process includes feedback loops that are intended 
to enable enhanced data collection, exchange and 
analysis of knowledge related to the degradation 
of subsea components, addressed by different 
engineering disciplines as the subsea project moves 
through various technical and business reviews. 
Good communication during a generic design engi- 
neering process includes both historical and cur- 
rent information to be shared freely, problems to be 
reported, views to be exchanged, and positive inter- 
personal relationships to be retained, as the design 
progresses through various technical and business 
gates. However, the effective communication is also 
dependent on balancing the levels of information 
to the complexity of the task/topic, avoiding both 
over-complication and over-simplification. Hence, 
suitable levels of information must be delivered to 
the correct person at the correct time to communi- 
cate essentials without the receiver being overbur- 
dened with data (Parkes & Hodkiewicz, 2011). 
Step 1 — The start of the PHM design process 
is represented by the requirements phase. High- 
Level (HL) requirements are typically divided into 
functional requirements (FR) and non-functional 
requirement (NFR) categories. Functional require- 
ments define the technical details of a system (includ- 
ing the function of the system and the functions of 
each individual component) and non-functional 
requirements cover the attributes of the system 
(such as safety, reliability, maintainability, usability, 
performance, security, etc.). The Design/Re-design/ 
Instrumentation phase of the PHM development 
process coordinates various engineering analysis 
to ensure the final subsea design is meeting the 
requirements established at the start of the project. 
During the design phase, data associated with the 
environmental conditions, reservoir, well comple- 
tion, process and operations, host facilities, safety 
and hazards should be considered when progressing 


Cost-Benefit Analysis 


An integrated approach for exchanging engineering knowledge to support Subsea PHM Design. 


through conceptual, FEED and detailed design. 
These engineering efforts are targeting the techni- 
cal requirements (also known as functional require- 
ments (FR)) of a subsea system since this design 
element is heavily regulated and supported through 
recommended practices (ISO 13628-1, 2010). Non- 
Functional Requirements are also considered by the 
current subsea design best practices and they cover 
safety, performance and security. However, there is 
no attempt to define and to target the reliability and 
maintainability requirements at the early stage of 
the design. This influences the PHM requirements 
definition as these are derived from the reliability 
and maintainability requirements. Only recently, 
the industry has generated recommended practices 
like the API-RP-17 N on topics related to reliability, 
technical risk and integrity management (API RP 
17 N, 2009). However, they are not yet adopted due 
to the lack of tools, processes and meaningful reli- 
ability data to support their implementation. Reli- 
ability and maintainability requirements should be 
part of the non-functional requirement (NFR) of 
the system. Deriving PHM requirements should be 
done from system’s high-level (HL) Non-Functional 
requirements. For example, a HL NF requirement 
for a XT can be affordability by reducing the down- 
time periods while keeping the same levels of safety. 
In this manner, cost and safety become main drivers 
for the development of a PHM solution. Derived 
PHM requirements can be represented: PHM 
Requirement 1 — the XT must have a feature that 
can reliably predict functional failures at least one 
week prior to the actual event and PHM Require- 
ment 2: the XT must have the capability of offer- 
ing mitigation/advisory generation in the context 
of current operational conditions. Having access to 
information provided by such features, the opera- 
tor reasonable time to schedule a vessel, equipment, 
personnel to carry out an intervention on the faulty 
components. 
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Step 2 — To support the PHM within the subsea 
design phase, reliability analysis needs to be carried 
out concurrently with the subsea system design. Reli- 
ability analysis should be based on the actual design 
data to support the engineering analysis that can 
identify the means to detect and isolate the potential 
faults which may occur during the operation. One 
way to support the realization of this step is through 
the exploitation of functional models which can sup- 
port reliability engineers to carry out analysis such as 
FMECA, Fault Tree analysis (FTA), and Reliability 
Block Diagrams (RBD) from the very early stages of 
the subsea design process. These methods and tech- 
niques can also underpin the Reliability Centered 
Maintenance (RCM) and Back-Fit RCM analysis 
at later stages of the lifecycle. RCM includes four 
elements that are critical to a maintenance program 
aimed at improving availability of a given asset. 
These elements are: preservation of the system func- 
tion, identification of the failure modes that can lead 
to functional failures (and sequentially downtime), 
prioritization of failure mode candidates based on 
Occurrence (O), Severity (S) and Detectability (D) 
by highlighting the Risk Priority Number (RPN =O 
x S x D) of each of the candidates and finally selec- 
tions of applicable and effective tasks to control the 
failure modes. Back-Fit RCM builds on the same 
RCM principles by incorporated operational reli- 
ability figures (by updating the O, S, D parameters) 
and evaluating the applicability and effectiveness of 
the control measures. We believe that the reliability, 
availability and maintainability analyses should be 
the foundation of the PHM design as it can high- 
light the risk associated with a brand new subsea 
design or a legacy system using information from 
service (provided to the reliability team during Step 
7 using the output of a bespoke maintenance ana- 
lytics engines). Sequentially, using this information, 
the design will be assessed by the reliability authority 
during a technical review (targeted reliability assess- 
ment) and if the design fails to meet a specific tar- 
get, the risk and the critical components must be 
addressed either through re-design, redundancy or 
addition of instrumentation. Informed trade-off 
studies between these three approaches must be 
in place to guarantee the final design meets all the 
requirements of the project. However, this topic is 
beyond the scope of this paper. Nevertheless, for far 
too long, reliability assessment on subsea equipment 
was carried out only to present a measured reliabil- 
ity figure to support re-certification of production 
equipment already in exploitation. 

If the PHM is channeled as derived require- 
ment from the RAM requirements, specific levels 
of targeted reliability can be achieved. Also, differ- 
ent PHM requirements and implementation strate- 
gies can be evaluated at this stage against specific 
sets of RAM requirements. Very often, in subsea 
applications, redundancy seems to be the option 


preferred by the system designers as this guaran- 
tees improved availability figures when the primary 
component/sub-system fails. The major drawback 
of the redundancy approach is that fact that it does 
not offer any indication of the RUL for the critical 
component, sub-system or system, therefore, our 
case for addition of instrumentation supporting 
the PHM capability. If instrumentation is required 
for diagnostics and prognostics purposes, this must 
be defined by the PHM analysts in collaboration 
with the subsea design and reliability teams. We 
believe that majority of the condition monitor- 
ing applications existent in a subsea environment 
were driven by vendors of sensors capable of tar- 
geting symptoms associated with specific failures. 
This approach captures failures in isolation and 
does not account for propagation of faults lead- 
ing to functional failures of other components and 
sequentially to failure of the system. This limita- 
tion can be overcome using functional models of 
the system, functional relationships and failure/ 
effects dependencies in a system for both func- 
tional and physical failures, defined using widely 
accepted, well defined failure taxonomies. Once the 
stakeholders of the asset have validated the propa- 
gation tables (component’s reaction to functional 
failures of the system) generated against analysis 
capturing various end-item effects. The end-item 
effect is the consequence a failure mode has on the 
operation, functional output of the system at 
the highest indenture level (an item’s position in 
the system hierarchy relative to the top-level item). 

What we propose in step 2 is an implementation 
of the API 17 N recommended practices through a 
concurrent design-reliability analysis by evaluating 
the actual reliability of the design by highlighting the 
effects of failure modes leading to functional failures. 

Step 3 — enables an informed dialog between 
RAM engineers and PHM analysts by allowing the 
use of system level propagation table (a collection of 
all the failure mode signatures—the effects (through- 
out the system, and not just at the point of occur- 
rence) of a given failure mode universe. Currently, 
the job description of a PHM analyst falls somehow 
under the control team although we believe that its 
responsibility and involvement goes beyond the 
control systems. The PHM team should liaise very 
closely with the design team as it aims at the identifi- 
cation and optimization of sensor set solutions capa- 
ble of detecting, isolating and making predictions 
of a given set of failure modes considered under 
the PHM analysis. The mean of realization of this 
dialog is represented by the reliability models popu- 
lated with failure and criticality information of fail- 
ure concepts. The automated PHM instrumentation 
analysis aims to identify and optimize, in a system- 
atic manner, the sensor set configurations capable 
of supporting the detection, diagnosis, prediction 
functions of a subsea asset to further enable advisory 
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generation. It also aims to calculate, for each sensor 
set solution identified during this process, the fault 
detection and isolation characteristic representing 
the proportion of failure modes selected for PHM 
analysis that can be detected and identified by a given 
sensor set under consideration. The PHM instru- 
mentation analysis must be able to allow modifica- 
tion of existing sensor arrangements based on user 
knowledge or trade-offs. Legacy sensors present on 
the system can be considered as part of this analy- 
sis, although qualification institutions and regulators 
will not easily accept interrogation of sensors used 
for fail-safe control purposes (functional safety sen- 
sors). During this step, criticality of failures affecting 
a subsea system must be considered during the PHM 
instrumentation analysis. For new subsea designs, no 
measured criticality information exists and this must 
be defined using input from various stake-holders 
(subsea designers, RAM team, Operators, IMR 
personnel) by taking into account qualification of 
new technology standards and recommended prac- 
tices (DNV-RP-A203, 2011; DNV-DSS-401, 2012). 
For subsea legacy systems, the criticality should be 
considered given what failed in service and it can 
be characterized through occurrence, severity and 
detectability parameters by calculating a risk prior- 
ity number for every single component of the sub- 
sea equipment. The PHM instrumentation analysis 
should be capable of running the identification and 
optimization algorithms for specific groups of com- 
ponents by focusing on specific targeted criticality. 
There is also a feedback loop between the PHM and 
RAM functions meant to ensure that the selected 
sensor set solution meets the reliability criteria of the 
system (as a sensor that will fail in service ahead of 
the component that it is monitoring for failure does 
not ensure higher levels of availability for the asset). 
This feedback loop is represented within the PHM 
development process by Step 4. 

Step 5 of the PHM development process ena- 
bles the dialogue between the PHM analysts and 
the subsea design team. Although represented as 
a separate link/step, this dialog should take place 
concurrently with the reliability analysis of the 
instrumented subsea system. Decisions regard- 
ing unfeasible sensor set solutions are taken, and 
trade-off studies related to cost, weight, cover- 
age, location, reliability, probability of detection, 
probability of prediction, likelihood positives and 
negatives ratios, physical constraints, loads, envi- 
ronmental conditions should be carried out by a 
multi-disciplinary team led by the RAM function. 
During this step, the reliability and maintainability 
requirements are verified and validated by group 
of experts. As a mean of realization of this dia- 
logue, we propose a model-based approach as this 
will allow rapid generation of sensor sets spanning 
multiple levels of hierarchy for a subsea system by 
considering technical and economical metrics. 


Step 6 facilitates the exploitation of the data 
provided by a PHM-enabled subsea system and 
the support offered to the IMR function. Data 
from sensors will be plugged to diagnostic and 
prognostic engines capable of supporting the IMR 
function on fault detection, fault isolation and ide- 
ally, prediction of the remaining useful life. 

Step 7 of the PHM development process facili- 
tates an implementation of an integrated analysis- 
drive sustainment activity. It provides traceability 
of the subsea maintenance activities when using 
PHM information. We recommend the use of 
function-based reliability models to gather, share 
and analyze maintenance data in order to enable 
automated failure and data reporting, analysis and 
corrective action system (FRACAS/DRACAS). 


5 CASE STUDY 


For the implementation of the integrated PHM 
development process for subsea equipment, a com- 
mercial-of-the-shelf software tool, namely Main- 
tenance Aware Design environment (MADe™) 
developed by PHM Technology was employed. It 
was used to carry out an instantiation of the proc- 
ess on a XT. The selection of this software pack- 
age was based on the previous success in using it in 
aerospace industry sector on fuel system and envi- 
ronmental control systems (Hess, Frith & Calvello, 
2005). MADe™ is a ‘model-based’ engineering 
tool that can provide an integrated framework to 
manage, control and analyse the information and 
data throughout different disciplines including the 
design, safety, reliability, availability and maintain- 
ability for a high-value high-complex systems. How- 
ever, a good inter-discipline communication requires 
expert document management and control, and in 
this instance this is achieved through a model gath- 
ering data and knowledge from various disciplined. 
The instantiation of the PHM design process was 
carried on a typical subsea XT. Several concurrent 
engineering analyses belonging to separate engineer- 
ing disciplines, but derived from a single model of 
this XT (developed within MADe™) were carried 
out. The outcomes of some of these analyses are 
highlighted in Table 2 (a and b). These are briefly 
summarized further on. The functional model 
accommodates information describing the input 
and output flows of each component, the causal 
relationships between these flows that allow for 
systematic propagation of failures throughout the 
system and criticality data (retrieved from the 6th 
edition of the Offshore Reliability Data—Volume 2). 
Knowledge and data characterizing the failure 
of each of the component forming a subsea XT 
was added to the functional model using concepts 
defining the causes, mechanisms, faults, symptoms 
and the links to the functional failures previously 
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Table 2a. Outcomes of implementation of the integrated PHM 


1.Design /Re-design /Instrumentation 
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defined. Significant challenges were faced when 
trying to align the failure taxonomy used by the 
OREDA (based on the ISO: 14224 standard) and 
the failure taxonomy employed and defined by 
PHM Technology in MADe™. 


Sensor Set Solution Y 


A clear understanding of the causes, mechanisms, 
potential symptoms leading to a fault and the way 
this fault develops into a functional failure is instru- 
mental in selecting the correct maintenance task or 
the PHM instrumentation, in an informed manner. 
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Table 2b. Outcomes of implementation of the inte- 
grated PHM development process. 


PHM Instrumentation Trade-off Studies 


sas — 


Sel? —— 
9 CONTA (%) 100 Number of Sensors 


Four components were selected during the critical- 
ity assessment as having a significant impact on 
the function of the XT, namely the choke valve, the 
down hold safety valve, the production master valve 
and the tubing hanger. For the scenario of these four 
functional failures, 100 sensor set solutions were 
automatically generated from the functional model, 
having between 7-9 sensors (measuring pressure and 
flow rate) offering between 50-100% fault coverage. 
Ambiguity groups were clearly highlighted during 
this process due the similarities in the fault signa- 
tures characterizing two of the faults. At this stage, 
the PHM development process allow investigations 
on the trade-off studies on the sensor set solutions 
very early on during the design process and various 
maintenance strategies can be benchmarked when 
using specific diagnostic and prognostic engines cou- 
pled to the instrumentation identified in the previous 
step. 


6 CONCLUSIONS 


The definition and articulation of cost-effective 
maintenance regimes is a challenging task for sub- 
sea assets. Very often, they are over-engineered since 
they are required to operate in harsh conditions 
for long period of times. Various stakeholders are 
involved with these assets throughout the entire life- 
cycle of these assets and knowledge related with the 
de gradation of this equipment is scattered through- 
out various organizations being owned and used 
by different engineering functions. In this paper, an 
integrated PHM development process was presented 
as a multi-disciplinary engineering analysis, comple- 
menting the current development of subsea equip- 
ment. The proposed development process aims to 
help subsea designers to integrate the development 
of the PHM capability of a subsea asset with the 
actual design of such systems. This is meant to hap- 
pen at the early stages of the design process, but the 
process also enables the retrofit of such capabilities 
on legacy subsea fields to achieve higher availability 
and operational reliability figures. This is achieved 
by placing the reliability, availability and maintain- 
ability engineering analysis at the heart of the subsea 
asset and PHM system level design. 
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ABSTRACT:  Rubber-metal-elements are used in a wide range of applications for vibration and sound 
isolation. Nowadays it is state of the art to calculate the lifetimes of these elements under mechanical 
stress prior to their service life. To establish more reliable and safer rubber-metal-elements, continuous 
monitoring by different sensors can be used. Especially prognostics enable a rise in reliability, availability 
and safety. To establish these advantages, estimating the remaining useful lifetime of rubber-metal-ele- 
ments should be realized during its service life based on current information on its condition. Therefore a 
suitable measure to monitor the condition of the element is necessary. This work focuses on temperature 
signals. This approach allows including the ambient temperature and thereby involving changing operat- 
ing conditions. For estimating the RUL of rubber-metal-elements a model-based prognostics approach 
based on particle filtering is proposed. Its performance is analyzed regarding relevant parameters to ena- 


ble the best performance for the applied data. 


1 INTRODUCTION 


1.1 State of the art and motivation 


Rubber-metal-elements are used in a wide range 
of applications for sound isolation and in particu- 
lar for isolating critical components from strong 
vibrations. Typical applications are trains, trucks 
and wind turbines. The bearing in focus is dis- 
played in Figure 1. It consists of an inner steel ring, 
a rubber part and an outer steel ring which is slot- 
ted. This main part is within the outer hollow cyl- 
inder which is used in combination with the inner 
bolt for generating a prestress on the rubber part. 
Nowadays, it is state of the art to follow a preven- 
tive maintenance strategy handling these bearings. 
Thereby, the lifetime of the bearing needs to be 
estimated prior to its service life. Therefore this 
lifetime is estimated conservatively by the devel- 
oper based on experience, lifecycle tests and the 
conditions of the planned application. This calcu- 
lation is often based on linear damage accumula- 
tion theory (Spitz 2012). A preventive maintenance 
strategy shows some drawbacks regarding optimal 
utilization of the resource and costs. Moreover, 
today’s industry develops growing expectations 
concerning efficiency of capabilities and availabil- 
ity. That is why condition monitoring gains more 
and more importance in the field of maintenance. 


1.2 Maintenance strategies 


Maintenance can be divided in different strategies 
according to DIN EN 13306. The oldest strategy is 


Figure 1. 


Rubber-metal-bearing. 


the reactive maintenance. Technical systems were 
used until failure and needed to be repaired or 
replaced once they reached their end of lifetime. 
This procedure leads to a couple of problems con- 
cerning costs and time, for example possible high 
consequential costs due to unplanned downtime. 
Therefore the preventive maintenance strategy was 
developed. In that case mechanically lifetimes of 
technical systems are calculated based on experi- 
ence, lifecycle tests and fatigue life calculations. 
However, this maintenance strategy does not enable 
exploiting the whole lifetime of a single product. 
The calculation is a generalized one that bases on 
assumptions regarding the expected loads over all 
bearings. Furthermore, safety factors are included 
in the calculations that ensure with a high degree 
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of certainty that every product is maintained or 
replaced previously to its end of lifetime. However, 
this strategy provides no information on the cur- 
rent state of individual bearings which experience 
individual loads during their lifetime and therefore 
degrade individually. That is why on the one side, 
possible early failures could occur and on the other 
side, bearings are replaced although their lifetimes 
are not yet exhausted. These disadvantages of the 
previous named strategies are the reason why the 
condition based maintenance strategy was devel- 
oped. This strategy is mainly based on the condi- 
tion of the product in focus. Using different kind 
of sensors, information about the condition of 
the product is acquired and progressed by condi- 
tion monitoring methods. So, maintenance can be 
planned optimally based on the condition of the 
product and, in case of prognostics, additionally 
on the estimated remaining useful lifetime (RUL), 
which improves the reliability of the product and 
leads to an optimized efficiency. 


1.3 Structure of the following sections 


In this work the prognostics method which is used 
to estimate the RUL of these rubber-metal-bearings 
is analyzed regarding its performance on tempera- 
ture data of these bearings. The used method is a 
particle filter. Due to the fact that different types 
of particle filters exist (Arulampalam et al. 2002, 
Jouin et al. 2016), in chapter 2 the used method is 
presented regarding type correlated differences. Aim- 
ing for realizing the best RUL prediction for rubber- 
metal-bearings, relevant parameters of the method 
for a performance analysis are identified. Chapter 3 
focusses on necessary lifecycle tests and generated 
data for developing the condition monitoring system. 
Chapter 4 deals with the analysis of that temperature 
based prognostics. Two different measured values are 
implemented and particle filtering performance is 
analyzed based on varied parameters. In chapter 5 a 
conclusion and a short outlook are given. 


2 PROGNOSTICS METHODS 


2.1. Types of particle filter 


Particle filters are Monte Carlo methods that base 
on Bayesian probability theory. These filters are 
model-based methods for state estimation that are 
appropriate for estimating non-linear behavior. 
Currently, it is a classical method for model-based 
predictions of RUL (Jouin et al. 2016). Moreover, 
particle filters create a probabilistic output which 
can be used to present uncertainty involved in 
RUL predictions. Additionally, this method was 
chosen because a multi model particle filter has 


been successfully applied to other signals of rub- 
ber-metal-elements (Bender et al. 2017b, Bender 
et al. 2017a, Bender et al. 2017c). 

These filters can be divided in different types. 
The commonly known ones are Auxiliary particle 
filter, Unscented particle filter, Regularized par- 
ticle filter, Sequential Importance Sampling (SIS) 
filter and Sampling Importance Resampling (SIR) 
filter (Arulampalam et al. 2002). In this work a 
SIR particle filter is used for estimating the RUL 
of rubber-metal-bearings due to the named advan- 
tages and the SIR related improvement of particle 
degeneracy. Particle degeneracy is a weakness of the 
classical SIS filter. The SIR particle filter is a fur- 
ther development of the SIS filter and prevents that 
degeneracy by resampling. All these Monte Carlo 
based filters use random samples that are called 
particles to estimate the state of the monitored 
product in the form of a distribution. Therefore, the 
samples’ relevance is symbolized by weights. These 
weights are calculated based on a defined distribu- 
tion and a comparison of the predicted and the 
measured values. In the case of degeneracy after 
little iteration most of the particle weights tend 
towards zero while only one particle has a bigger 
weight. That means that only one particle builds 
the base for the state estimation and the consecutive 
estimation of the RUL. Nevertheless all particles 
are still part of the estimation even if their influence 
on the result tends towards zero. This degeneracy 
problem can be solved by resampling. Thereby only 
relevant samples survive which means samples with 
a higher weight. Those samples build the base for 
the next prognostics step while the probably irrele- 
vant particles are no longer considered. In that case 
a smaller variance of samples is used, but the result 
is more accurate. The RUL prediction is an iterative 
method. As long as measured data is available the 
weights can be updated and resampling can be pro- 
ceeded (Arulampalam et al. 2002, Jouin et al. 2016). 

The general structure of a particle filter is given 
in Figure 2. The models are developed based on 
data for training that is presented in chapter 3. Due 
to the complex, nonlinear behavior and multiple 
ways of degradation, no physical model of failure 
for rubber-metal-elements exists. Therefore empiri- 
cal parameterized models are implemented. For 
every dataset these parameters are estimated by 
using Differential Evolution, a population based 
optimization algorithm (Elsayed et al. 2012). So, 
every model is related to one bearing. These mod- 
els are used within the method for state estimation 
based on samples. Therefore a multi model version 
of a SIR particle filter is implemented. The gen- 
eral state equation to estimate the samples is given 
in Equation 1 (Vachtsevanos 2006, Arulampalam 
et al. 2002) and the particular state equation in 
Equation 2. 
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Figure 2. Structure of a particle filter. 
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where x, is state vector at time ti, mdl,, is the 
model with parameters p,,,; and vi is added 
noise. The model parameters are chosen based on 
the weights of the previous state vector. In this 
version the initial samples or initial states are in 
each case generated by one of the models and the 
first measurement. By an appropriate number of 
samples, model choice is equally distributed for 
the initial sample generation. Therefore different 
numbers of samples are evaluated in chapter 4. 
If new measurements are available, the estimated 
states can be corrected through resampling. With 
the aim of estimating the RUL, the prediction 
step is repeated until a given threshold is reached 
by the samples. 


2.2 Relevant parameters 


Variable parameters of a particle filter influence the 
accuracy of prognostics. The parameters to be ana- 
lyzed are number of simulations, number of samples, 
the measured values and the resampling strategy. 

Accuracy of particle filter strongly depends on 
the number of particles. That is because it is more 
likely that a big random sample of a defined dis- 
tribution is able to show a good representation of 
that distribution than a smaller random sample. To 
show the influence of variable number of samples 
on predictions of RUL, three possible numbers of 
samples should be compared. In this context the 
number of simulations is analyzed as well. 

The measured values in focus are temperatures 
acquired in or close to the bearing. It was observed 
that the temperature of rubber-metal-bearings 
changes over their lifetime, especially in the end 
of their lifetime. Due to the fact that bearing tem- 
perature is influenced by operating conditions, 
these conditions should be considered. In chap- 
ter 4 measurements based on similar conditions 
are implemented including similar exciter power, 
similar frequency and similar bearings. Neverthe- 
less, there is one parameter that cannot be kept 
constant, the ambient temperature. That is why 
the ambient temperature is measured as well. The 
relative temperature AT involves both tempera- 
tures in the form of a subtraction, AT = T (bear- 
ing) — T (ambient). In the following chapter both 
measured values, absolute bearing temperature 
and relative temperature are presented. 

To improve the degeneracy problem, resampling 
can be involved in the particle filter. Multiple resa- 
mpling schemes exist (Arulampalam et al. 2002, 
Ignatious, Lincon 2013), in this work the SIR is 
implemented. One point of interested in this context 
is the question when to resample. Two possibilities 
are compared for the application of rubber-metal- 
bearings. The first continuous strategy enables 
resampling in every iteration step which is easy to 
implement but leads to high computational cost. 
The other strategy is based on a defined thresh- 
old for resampling. In this case resampling is only 
executed if the condition is fulfilled. The realized 
threshold is based on the effective sample size Ny 
which is a measure for degeneracy. The effective 
sample size cannot be computed exactly, therefore 
an estimate Ny of N yis used here, see Equation 3. 


Nag = N F (3) 


where æi is the normalized weight (Arulampalam 
et al. 2002). A threshold needs to be defined which 


1027 


allows resampling when Ny is smaller than that 
threshold. This resampling strategy needs less 
computational time because resampling is not real- 


ized in every iteration step. 


3 LIFECYCLE TESTS 


3.1 Lifecycle tests 


Testing rubber-metal-elements is a complex task. 
Due to their nonlinear behavior and wide distri- 
butions concerning lifetime characteristics of rub- 
ber caused by manufacturing, lifetime estimation 
is not trivial (Steinweger 2006, Wallmichrath et al. 
2009). That is why nowadays preventive mainte- 
nance based on prior calculated lifetimes, often 
using linear damage accumulation, is state of the 
art in applying rubber-metal-elements. 

Due to a lack of real data, lifecycle tests are 
performed to generate data for prognostics. Here 
accelerated lifecycle tests with an increased excita- 
tion force are realized because of the long lifetime 
of these bearings. In the suspension system of 
trains they are used for up to 8 years (Bender et al. 
2017b). These lifecycle tests are performed on a 
vibration analysis system using a hydraulic cylinder 
as exciter. It enables movements of the outer ring 
of the bearing, whereas the inner ring is fixed. The 
rubber between those rings allows a small move- 
ment. Under this mechanical stress the characteris- 
tics of rubber change over time due to degradation. 
Finding a suitable measure to monitor a rubber- 
metal-bearing condition is a challenging task due 
to non-linear rubber characteristics and many pos- 
sible impacts on the lifetime of rubber. Moreover, 
the structure of the elements increases the difficulty 
of installing a sensor for a suitable and reliable 
measure. In this work the focus is on temperature, 
a measure that is used in other applications as well, 
for example ball bearings (Kimotho, Sextro 2015) 
or subsystems of wind turbines (Crabtree 2011). 
The correlated concept for temperature measure- 
ments in rubber-metal-bearings is introduced in 
(Bender et al. 2017c). Based on that work, a pro- 
totype of a rubber-metal-bearing was developed 
that enables temperature measurement inside the 
bearing. Integrating a sensor inside the rubber 
presents a weakness and could lead to a shorter 
useful lifetime. Moreover, (Molls 2013) showed 
that temperature inside the rubber part of rubber- 
metal-bearings have deviations of maximum 3°C 
compared to temperature measurements at its sur- 
face. Therefore, the used thermocouples are placed 
inside the outer ring of the bearing close to the 
surface of the rubber. Little pockets are shaped in 
the metal, in which the thermocouples are bonded. 
These pockets protect the sensible thermocouples 


from external influences. Additionally to the abso- 
lute temperature of the bearing, the ambient tem- 
perature is measured close to the lifecycle tests. 


3.2 Measurement data 


For temperature measurements sheath thermo- 
couples of type K are inserted in the lifecycle tests 
that are able to monitor the temperature of the 
bearing. Moreover, they are robust to weather the 
conditions of the tests and real applications. Data 
is measured over the whole lifecycle test including 
data of the failure state. Prior to the prediction, 
the measured data is preprocessed for generating 
empirical models. As shown before these models 
are based on a combination of e-functions which 
describes the graph of the measurements. The 
characteristic graphs of the absolute temperature 
of three bearings are shown in Figure 3. 

In the beginning the absolute temperature of 
bearings raises strongly, before it fluctuates dur- 
ing the main part of the life of a bearing. Bearing 
2 shows a small fall of temperature during that time 
whereas the temperature of bearing 3 stays almost 
constant. In the last part all temperature curves 
rise until the end of lifetime is reached. Analyzing 
Figure 3, it becomes obvious that in addition to 
their common characteristics these curves differ in 
aspects such as starting and ending temperature, 
lifetimes of bearings and the corresponding graph. 
This has different reasons based on characteristics 
of the bearing and operating conditions, especially 
the ambient temperature. That is why the ambient 
temperature is involved in the second measured 
value, the relative temperature. The graph of the 
relative temperature for bearing 3 is depicted in 
Figure 4. In general the curve of the relative tem- 
perature shows similar characteristics like absolute 
temperature of bearings during its lifetime. The 
significant temperature rise in the beginning and 
in the end is related to the absolute temperature of 
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Figure 3. Absolute temperatures of bearings during 
lifecycle tests. 
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Figure 4. Relative temperature during lifetime of bear- 
ing 3. 


the bearing. Due to the fact that the ambient tem- 
perature fluctuates more easily than the tempera- 
ture inside the bearing, the relative temperature 
fluctuates during the main part of the lifecycle test. 
Moreover a stop of the test after about 10° cycles 
leads to a falling temperature because of a cooling. 
After starting the test again the temperature of the 
bearing raises quickly. Due to the similar graphs, 
all models of both measured values base on the 
same state equation, only the parameters differ. 
All in all, the relative temperatures have a more 
similar value range than the absolute temperatures 
of bearing. Therefore the models should fit better 
and might result in an improved prediction. Both 
measured values are implemented in the following 
analysis where the influence of the previously men- 
tioned parameters comes under focus. 


4 ANALYSIS OF PREDICTIONS 


4.1 Test structure 


In this chapter the presented measured values are 
used for estimating the RUL with the presented 
SIR particle filter. To find the best performance 
different parameters shall be implemented and 
compared. The following tests are evaluated on: 


1. Measured values: absolute temperature of bear- 
ings and relative temperature 

2. Number of simulations 

3. Number or samples 

4. Resampling strategy 
thresholds 


For evaluating the performance a metric based 
on relative error is used. This metric is calculated 
analogue to Equation 4. 


including different 


RUL -RUL | 
real estimated | 100% (4 ) 
RUL 


real 


Error = 


where RUL,,,, is the current RUL of the element 
and RUL, simaa 1S the predicted RUL. In this work 
the RUL is estimated for different times from 15 
to 95% of spent lifetime of the bearings. Thus 
the error is the mean error calculated as the mean 
of the RULs from different times. Thereby posi- 
tive and negative errors compensate each other; 
therefore the number of negative errors is given in 
brackets to get an impression of the sign of single 
RULs. As an example Figure 5 depicts the RULs 
at the mentioned starting times for bearing 1. Illus- 
trated are the real (grey circles) and the estimated 
RULs (black squares). The dashed lines symbolize 
an error band of 15%. Only one error is negative 
(1 N) which is good. Greater RULs present a too 
late prediction and thereby a possible unwanted 
breakdown of the system. The parameters used to 
generate this result are 100 particles, three simu- 
lations, resampling realized in every iteration step 
and the relative temperature as measured value. 


4.2 Results for bearing temperature 


In this paragraph prognostics base on absolute 
bearing temperature measurements and the associ- 
ated models. The first parameter of interest is the 
number of simulations. Due to the fact that the SIR 
particle filter is based on probability it is necessary 
to reach a repeatable result within certain limits. To 
find a suitable number of simulations three differ- 
ent alternatives between three and 100 simulations 
are tested and the results are displayed for three 
test bearings in Table 1. The tests are numbered. 
An ‘a’ is added to the name as a symbol for meas- 
ured value absolute temperature of bearings. 
Table 1 shows that prediction of the RUL of 
these three test bearings do not lead to the same 
results regarding the best number of simulations. 
Each of the bearings shows the best performance 
for another number of simulations. However, the 
number of negative errors differs only slightly for 
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Figure 5. Estimated RUL at different times of bearing 1. 
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each bearing. SIR particle filter are sensitive to 
outliers (Arulampalam et al. 2002), that is why a 
high number of simulations is necessary to com- 
pensate outliers. To reach valuable results particle 
based methods need a suitable (minimum) number 
of samples or particles. Therefore in the previous 
simulations 100 particles were implemented. 

For that reason 10 and 100 trails are analyzed 
again for varying number of samples from 100 to 
1000 with the aim of improving the result. The 
performance metrics for the three test bearings are 
compared in Table 2. Again no consistent results 
over all bearings exist. 

Bearing 3 performs best for 1000 samples, bear- 
ing | for more than 100 samples and bearing 2 
for 100 samples. It indicates that especially bear- 
ing 2 leads to unexpected results. Furthermore, 
only 2/3 of the results exhibit a better perform- 
ance for 100 simulations. This may be related to 
the small number of models. However, the differ- 
ence between the results of variable trails decreases 


with a growing number of samples. To analyze the 
influence of the number of samples and resam- 
pling thresholds on the performance both param- 
eters are varied in the next step for bearing 1. So 
far a continuous resampling was implemented; in 
Table 3 both resampling strategies are realized. 
The resampling threshold is in a first step based 
on the mean effective sample size measured during 
a continuous resampling strategy. In the following 
steps it is adapted to the performance metric. The 
chosen number of simulations is ten. 

Table 3 underlines that a threshold based resa- 
mpling strategy is able to improve the perform- 
ance for both number of samples. While for 100 
particles a threshold of around 50 leads to the best 
performance, for 1000 particles a threshold of 410 
is the best. A threshold based resampling strategy 
can balance worse performance metrics based on a 
smaller number of simulations. As Table 3 shows, 
a parameter combination of 10 simulations, 1000 
samples and a threshold of 410 (test 15a) leads 


Table 1. Influence of number of simulations on prognostics of absolute temperature of bearings. 

Test No. Simulations Error (bearing 1) in % Error (bearing 2) in % Error (bearing 3) in % 
la 3 12.6 (1 N) -26.0 (17 N) 13.1 3 N) 

2a 10 15.7 (0 N) -19.1 (17 N) 13.7 (0 N) 

3a 100 14.0 (1 N) -22.6 (17 N) 13.0 (1 N) 

Table 2. Influence of simulations and number of samples for absolute temperature of bearings. 

Test No. Simulations Number of samples Error (bearing 1) in% Error (bearing 2) in % Error (bearing 3) in % 
2a 10 100 15.7 (0 N) -19.1 (17 N) 13.7 (0 N) 

3a 100 100 14.0 (1 N) -22.6 (17 N) 13.0 (1 N) 

4a 10 500 13.3 (0 N) -24.4 (17 N) 10.1 (1 N) 

Sa 100 500 13.5 (0 N) -24.3 (17 N) 10.0 (1 N) 

6a 10 1000 13.6 (0 N) -24.5 (17 N) 9.6 (2 N) 

Ta 100 1000 13.4 (0 N) -24.0 (17 N) 9.3 (1 N) 

Table 3. Influence of number of samples and resampling strategy (bearing 1) for absolute temperature of bearings. 
Test No. Number of samples Resampling threshold Error in % 
2a 100 — 15,7 (0 N) 
8a 100 40 13.3 (1 N) 
9a 100 42 13.1 (2 N) 
10a 100 45 15.7 (0 N) 
lla 100 50 12.0 (2 N) 
6a 1000 — 13.6 (0 N) 
12a 1000 390 13.8 (0 N) 
13a 1000 400 14.5 (0 N) 
14a 1000 405 14.1 (0 N) 
15a 1000 410 13.4 (0 N) 
16a 1000 412 15.5 (0 N) 
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to a similar result like a parameter combination 
of 100 simulations, 1000 samples and a continu- 
ous resampling strategy (test 7a). In the context of 
online prognostics this could be a great advantage, 
since less simulations and a threshold based resam- 
pling strategy need less computational cost. How- 
ever, the error is smaller for 100 samples, but more 
sensitive to unwanted negative prediction errors. 
4.2.1 Results for a further position of 
measurement 

Molls suggested that the rubber temperature inside 
the rubber changes quite similar to the surface tem- 
perature (Molls 2013). The measurements in this 
work show similar temperature behavior between 
the bearing temperature and the temperature meas- 
ured at the bolt of a bearing. In Figure 6 these two 
temperatures are visualized for bearing 3. 

This leads to the possibility of testing the 
method and the models, based on those of abso- 
lute temperature of bearings, by new bearings 
whose bolt temperatures are known. The used 
parameters are 100 particles, 10 simulations and 
a resampling threshold of 50 effective samples. 
The resulting errors are 12.6% (0 N) for bearing 
I and 4.0% (8 N) for bearing II. Figure 7 depicts 
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Figure 6. Absolute temperature of bearing and bolt 
temperature. 
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Estimated RULs for bolt temperature of new 


the result of bearing I. The errors are within a 15% 
error band or just below it. The estimated RULs of 
bearing II fluctuate stronger as the large number 
of negative errors shows, while the mean error of 
4.0% seems to be good. 

These two tests show that temperature meas- 
urements close to the rubber-metal-element that 
have similar characteristics could be used for 
estimating RULs in the case of missing bearings’ 
temperatures. 


4.3 Results for relative temperature 


In this paragraph predictions based on the relative 
temperature are evaluated. Because of the former 
results and the similarities between the two meas- 
ured values, this evaluation is based only on bear- 
ing 1. The implemented models base on relative 
temperatures measurements. The structure of this 
analysis is similar to the one in chapter 4.2, the 
parameters simulations, number of samples and 
resampling strategy are varied and the performance 
metric is evaluated. In Table 4 the influence of the 
number of simulations on the performance of bear- 
ing 1 is shown for 100 particles. The names of the 
numbered tests for this measured value contain an 
added ‘b’. The error falls with increasing number 
of simulations. The number of negative errors is 
small and nearly constant as before. The best result 
is based on 100 simulations that is why the predic- 
tions shown in Table 5 include 100 simulations. 
Table 5 depicts an improved estimation of the 
RUL for an increasing number of samples. The 
best performance of 12.0% is predicted for 1000 
particles. The former good result with less simula- 
tions and a threshold based resampling (error,;,) 
should be examined in the next step for the rela- 
tive temperature. Table 6 shows the results for vary- 
ing resampling strategies and thresholds based on 


Table 4. Influence of simulations (bearing 1) for the 
relative temperature. 


Test No. Simulations Error in % 
1b 3 15.4 (1 N) 
2b 10 12.7 (0 N) 
3b 100 12.4 (0 N) 
Table 5. Influence of number of samples (bearing 1) for 


the relative temperature. 


Test No. Number of samples Error in % 
3b 100 12.4 (0 N) 
4b 500 12.2 (0 N) 
5b 1000 12.0 (0 N) 
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Table 6. Influence of resampling strategy (bearing 1) 
for the relative temperature. 


Number of Resampling Error 

Test No. samples threshold in % 

6b 1000 ~ 12.5(0 N) 

7b 1000 390 12.5 (0 N) 

8b 1000 395 12.0 (0 N) 

9b 1000 400 11.8 (0 N) 
10b 1000 405 11.9 (0 N) 
11b 100 45 12.3 (0 N) 


10 simulations and 100 or 1000 samples. Similar 
to Table 3 Table 6 presents possible performance 
improvements based on suitable thresholds. Due 
to the small differences regarding the performance 
metric of different number of samples, the best per- 
formance for 100 and 1000 samples were evaluated. 
An error of 12.3% was achieved for a threshold of 
45 using 100 samples. The best threshold of 400 
enables an error of 11.8% using 1000 samples. Com- 
paring the threshold based results of 1000 particles 
and 10 simulations (test 9b) to the one of continu- 
ous resampling with 100 simulations (test 5b), the 
threshold based resampling slightly improves the 
former results. It can be concluded that a threshold 
based resampling can lead to an improvement of 
the performance of the SIR particle filter. At least 
a saving of computational time is realized. 


4.4 Comparison of the results 


In this part of chapter 4 the results of the two meas- 
ured values are compared. Regarding the number 
of simulations, 100 simulations are less outlier 
prone and therefore lead usually to the best results. 
Comparing the errors, the measured value relative 
temperature enables better performance regard- 
ing the number of simulations. While the smallest 
error related to absolute temperature of bearing 1 
is 14.0%, an error of 12.4% is related to the relative 
temperature of bearing 1. The errors are in most 
cases reduced by an increased number of sam- 
ples. Once again the relative temperature performs 
better than the absolute temperature of bearings 
(error, = 12.0%, error, (bearing 1) = 13.3%). In 
the end the analysis of different resampling strat- 
egies emphasis that a threshold based resampling 
with a suitable threshold leads to a similar good 
performance with a smaller number of simula- 
tions. Moreover, in the case of the relative tem- 
perature the performance is slightly improved 
(errors, = 12.0%, errory = 11.8%). 

All in all estimating RUL for rubber-metal-bear- 
ings is possible based on temperature measure- 
ments and SIR particle filter for almost constant 


conditions. In reality applied bearings experience 
variable changing conditions, e.g. changing excita- 
tion. If the operation conditions are changed to a 
great extent, the similarity between the measure- 
ments will not be given. Therefore, implementing 
or adapting models for different excitation forces 
seems to be necessary. 


5 CONCULSIONS 


In this paper a temperature based estimation of 
the RUL of rubber-metal-bearings is introduced. 
To evaluate and improve the performance of the 
SIR particle filter number of simulations, number 
of samples and resampling strategy are analyzed. 
The predictions base on two different measured 
values, absolute temperature of bearings and rel- 
ative temperature that includes the ambient tem- 
perature and absolute temperature of bearings. 
Predictions based on relative temperature show a 
better performance than those based on absolute 
temperature. The reason lies in bigger differences 
between temperature curves that lead to more 
variance in the results compared to predictions 
based on relative temperature. Regarding the 
parameter, on average 10 to 100 trials and 1000 
particles are a good choice for this application. 
Moreover, both predictions can be improved by 
a threshold based resampling strategy. It can be 
concluded that even if rubber-metal-bearings 
show nonlinear behavior and slightly changing 
characteristics a threshold based resampling in 
combination with a suitable threshold enables a 
relative good RUL estimation based on tempera- 
ture measurements. 

Open questions are related to the threshold of 
the previous predictions. In this work the end of 
lifetime is defined by the last measurement. There- 
fore a threshold needs to be estimated that marks 
the end of lifetime. Moreover, finding thresholds 
for effective resampling can be optimized. 
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ABSTRACT: Different types of safety barriers are deployed in many infrastructures to reduce the 
occurrences of hazards, and protect people, environment and other assets in case the unexpected events 
have occurred and the capacity of these barriers against hazards can be weakened by degradations or the 
failures related to changes over time. It is natural to adapt the approaches of Prognostic and Health Man- 
agement (PHM) to monitor the conditions and measurable parameters of safety barriers, and predict 
their future performance by assessing the extent of degradations. This study aims to identify the unique- 
ness and possible challenges when implementing PHM on safety barriers. Definitions and classifications 
of safety barriers will be discussed with considering their installation environment in infrastructures, in 
order to reveal what kind of characteristics of barriers can lead to higher demand on prognosis and heath 
monitoring. Another objective of this paper is to review the qualitative and quantitative measures for the 
capacity and performance of safety barriers, and to explore the possible methods and research gaps in 
the assessments for different PHM strategies, taking account their effects on safety barriers, and effects 
on the infrastructures being protected by the barriers. 


1 INTRODUCTION 


Maintenances can be defined as the activities to 
keep a system in a working order (Do et al. 2015). 
With the development of sensor technologies, the 
maintenances for many complex systems involve 
more and more condition-based and preventive 
activities to reduce maintenance costs on one 
hand, and improve their performance on the other 
hand (Sharma et al. 2017, Liu et al. 2017). Prog- 
nostics and Health Management (PHM), includ- 
ing fault detection, diagnostics, prognostics and 
health management, is a developing approach that 
enables real-time health assessment of a system 
and predicts of its future state based on up-to-date 
information. PHM has been conducted in many 
applications including manufacturing, aerospace 
systems, railway, energy, and military industry 
(Sun et al. 2012, Pecht and Rui 2010). 

Safety barriers are installed in many critical 
systems and infrastructures to prevent hazardous 
events or mitigate their consequences, such as fire 
prevention systems and railway signaling systems. 
Technological safety barriers, such as shutdown 


valves in process, and airbags on cars, are also 
called as safety-critical system (Rausand 2014). 
But these safety barriers can also degrade and 
fail to accomplish their safety function under the 
evolving environment (Zio 2016). In case of fail- 
ures of the barriers, serious accidents or disaster 
may occur. Many studies have been carried out on 
the operational and performance analysis of the 
safety barriers (Innal et al. 2015, Duijm and Goos- 
sens 2006, Innal et al. 2015, Rahimi et al. 2011, Cai 
et al. 2012), and most of them assume that the fail- 
ures of components in the safety barriers follow 
the exponential distribution (Guo and Yang 2008, 
Jin and Rausand 2014, Catelani et al. 2011, Liu 
and Rausand 2011), meaning that their failure rate 
keep constant in any time. 

According to IEC 61508 (2010) and IEC 61511 
(2003), many technological safety barriers consist 
of three subsystems: sensor(s), logic solver(s) and 
actuating unit(s). The mechanical actuating units 
can degrade due to corrosion and wear-out etc, 
become more vulnerable along with time (Zio 2016), 
and so that the assumption of exponential distribu- 
tion of failures is challenged. Based on this concern, 
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a growing attention is given to the predict degrada- 
tions of safety barriers and offer suitable mainte- 
nances in advance to ensure the barrier adequacy. 
PHM can be a helpful approach in performance 
prediction and decision-making for maintenances. 

The purpose of this paper is to review the tech- 
niques of PHM and designing and operational 
characteristics of safety barriers, so as to explore 
the research issues when the PHM approach is 
planned to be implemented for improving the 
integrity of safety barriers. 

The remained of this paper is organized as fol- 
lows: In section 2, the development and advan- 
tages of PHM are introduced; Section 3, includes 
the review of safety barriers in infrastructure and 
introduces technological barriers; Section 4 intro- 
duces several unmet problems and challenges 
related to using PHM on safety barriers. A conclu- 
sion is given in Section 5. 


2 PROGNOSTICS AND HEALTH 
MANAGEMENT 


2.1. Development of PHM 


PHM is developed based on the concept of Con- 
dition-based maintenance (CBM). CBM is an 
approach to carry out maintenance actions based 
on the information collected through condition 
monitoring on systems in contrast to breakdown 
or time-based preventive maintenance. In order to 
make a timely decision on maintenance, prognos- 
tics is the key technology for CBM (Jardine et al. 
2006, Shin and Jun 2015, Bousdekis et al. 2015). 
From this point, PHM is developed from the con- 
cept of CBM. A CBM program consists of three 
key steps (see Figure 1) (Lee 2004): 


1. Data acquisition step; 
2. Data processing step; 
3. Maintenance decision-making step 


Diagnostics and prognostics are two aspects in 
CBM. Diagnostics deals with fault detection, iso- 
lation and identification when it occurs (Jardine 
et al. 2006). Prognostics, in ISO-13381 (2015), is 
to estimate the time to failure and risk for one or 
more existing and future failure modes. The rela- 
tive placement of detection, diagnostic and prog- 
nostic can be explained in Figure 2 (Gouriveau 
and Medjaher 2011). 


Maintenance 
> Decision 
Making 


Data 
Processing 


Data 


Acquisiton | 


Figure 1. Three steps in CBM. 
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Figure 2. Complementarity of detection diagnostic and 
prognostic activities (Gouriveau, 2011). 


In literature, prognostics is a process of health 
assessment and prediction, which includes incipi- 
ent fault/failure detection, performance monitor- 
ing, life cracking and predicting residual useful 
lifetime (RUL) (Hess et al. 2005, Lee et al. 2014); 

PHM is the extension of prognostics. According 
to CALCE (Center for Advanced Life Cycle Engi- 
neering) (2012), PHM is the means to predict and 
protect the integrity of equipment and complex 
systems, and avoid unanticipated operational prob- 
lems leading to mission performance deficiencies, 
degradation, and adverse effects to mission safety. 

Sun et al. (2010) regards PHM as a methodology 
to predict when and where failures will occur and 
to mitigate risks through evaluating the reliability 
of asystem in its actual life cycle conditions. It is an 
enabling discipline of solving reliability problem in 
the process of design, manufacturing, operational 
and maintenance (Pecht and Jaai 2010). PHM is 
aiming to all information of an equipment in past, 
present and future while considering its environ- 
mental, operational and usage condition so as to 
detect its degradation, diagnose fault and predict 
and manage failures (Zio 2012). 

Haddad (Haddad et al. 2012) regards PHM as a 
discipline that can used for: (i) evaluating the reliabil- 
ity of systems of their life cycle; (ii) determining the 
possible occurrence of failures and risk reduction; 
(ili) highlighting the Remanding Useful Lifetime 
(RUL) estimation. Actually, modern and compre- 
hensive PHM systems take many issues into consid- 
eration, such as fault detection, fault isolation, useful 
life remaining, and performance degradation trend- 
ing and then provides a broader set of maintenance 
benefits than any function by itself (Hess et al. 2005). 

In this paper, we understand PHM as an approach 
to carry out dynamic management based on RUL 
which is predicted by status information collected 
through actual life cycle conditions, including envi- 
ronmental, operational and usage conditions. 


2.2 PHM architecture 


PHM means a complete process from captur- 
ing the data to decision-making (in maintenance, 
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Figure 3. 


life time control, equipment design, etc.) (Guil- 
lén, Crespo, Macchi, & Gomez 2016), which is 
originally conceived by ISO 13374 and gradually 
becomes a standard in OSA-CBM (Open System 
Architecture for Condition Based Maintenance). 
As shown in Figure 3. The whole process of PHM 
is based on that of CBM, and can be divided into 
two parts. The first part (from Level 1/L1 to Level 
5/L5) is related to health monitoring and prognos- 
tics, and the second part (Level 6/L6) is for health 
management. 

In such a process, PHM attempts to answer sev- 
eral questions, e.g.: 


e How is the status of system now? (Performance 
assessment). 

e When will the system fail? (Remaining useful 
lifetime). 

e What will the primary faults that cause system 
failure? 

e Why does the incipient fault occur? 


2.3. PHM methodologies 


To answer the above questions, prognostics is cur- 
rently carried out in different ways, namely with 
model-based, data-driven and hybrid prognostics 
(Brahimi et al. 2016). 

The model-based approaches are based on a 
good knowledge of the physics of system and the 
available failure modes. Analysts can construct 
mathematical models with the above knowledge, 
and analyze those systems whose field operational 
and failure data is not enough (Lee et al. 2014, 
Luo et al. 2003). However, for many complex 
systems, one of limitations of the model-based 
approaches is the difficulty to create deliberate 
models representing the multiple physical proc- 
esses (Pecht 2008). Moreover, it is very difficult to 
adopt the models built for some specific applica- 
tions to the others, even though the systems are 
very similar. 

The data-driven approaches are based on sta- 
tistics and machine-learning techniques (Gu 
et al. 2007). In data-driven the remaining useful life 
would be predicted by fitting the monitoring data 


=e = => 
Prognostics 
‘(Failure Mode Evolution 


L4-health assessment 


Diagnostics 
Failure Mode 


Decision 


Making 


L6-advisory generation 


L5-prognostics assessment 


General process of PHM. Correlation with ISO 13374 (Guillén, 2016). 


of developing fault to the degradation mechanism 
before it reaches the predetermined threshold level 
(e.g., see (Medjaher et al. 2012)). These methods 
are relatively simple to deploy due to the necessary 
of an analytical model of behavior and failure of 
the system. 

The hybrid approaches are proposed in consid- 
eration of the pros and cons of the previous two 
groups (Lee 2004), in which prognostics results are 
claimed to be more reliable. The hybrid approaches 
have been used for the RUL prediction and main- 
tenance of systems, such as (Kumar et al. 2008, 
Garcia et al. 2010, Skima et al. 2015, Zhang et al. 
2009). 

PHM has been conducted in many areas, such 
as the infrastructures, aerospace industry, and in 
this paper we focus on the approach for safety 
barriers. 


3 SAFETY BARRIERS 


3.1 


Safety barriers, or simply barriers are the equip- 
ment and features that are installed to protect peo- 
ple, the environment and other assets against harm 
should features or deviations occur in the most- 
designed system (Rausand 2013). Safety barriers 
are always related with a certain safety functions, 
which are defined by Sklet (2006) as the functions 
planned to prevent, control, or mitigate undesired 
events or accidents. 

Figure 4 is a Bowtie diagram widely used in the 
field of risk analysis, where we can identify the 
two different roles of safety barriers. A hazardous 
event can occur due to some causes, so that some 
barriers can be located on the left side of the dia- 
gram (the causes side), to reduce the probability 
of the hazardous event. This kind of barriers are 
called as proactive barriers or prevention barriers, 
such as antilock braking system, electronic stability 
control system in automobiles. On the right side, 
some barriers are located on the right side (the 
consequences side), in case of the occurrence of a 
hazardous event, for reducing its effects or failure 


Safety barriers and classification 


1037 


Protection 
Barriers 


Prevention 
Barriers 


Sasnvo 
y 
SAONANOASNOD 


Figure 4. Bowtie diagram for a Top Event with preven- 
tion and protection barriers. 


escalation, and they are regarded as reactive barri- 
ers, or protection barriers, e.g. seat belts, airbag sys- 
tems (Hollnagel 2004, Rausand 2013, Groot 2016). 

This classification is based on the objectives or 
functions of barriers. In addition, considering the 
operational modes of barriers, Rausand (2013) has 
distinguished safety barriers as passive and active 
barriers. An active barrier is dependent on some 
energy sources and a sequence of detection-diag- 
nosis-action to perform its function, such as an air- 
bag, Meanwhile, a passive system is not required 
to take an action and just by the presence of their 
elements to achieve its function (e.g. a seat belt). 

Safety barriers also can be divided into on-line 
and off-line barriers. The on-line barriers operate 
continuously or so often, and on the contrary, the 
off-line ones are only used intermittently or infre- 
quently. In practices, most protective barriers are 
off-line ones (Rausand & Arnljot 2004). 

Sklet (2006), on the other hand, considers who 
are carrying out safety functions, and classifies 
barriers as the physical, technical, and human/ 
operational barriers. Combining with the catego- 
rization based on the operational modes, we can 
obtain Figure 5. In the figure, technical barriers are 
always active. They are further divided into three 
groups: Safety Instrumented System (SIS), mean- 
ing that a technical barrier which involves the elec- 
tric, electronic, and programmable electronic (E/E/ 
PE) technologies, other technology safety-related 
systems and External risk reduction facilities. In 
the rest of this paper, we focus on technical barriers. 


3.2 Technological barriers 


A technological barrier, involving E/E/PE tech- 
nologies and some mechanical items, generally 
consists of three subsystems: input element sub- 
system (e.g., sensors, transmitters), logic solver 
subsystem (e.g., programmable logic controllers 
[PLC]) and finial element subsystem (e.g., safety 
valves, circuit breakers). The main parts are illus- 
trated in Figure 6. 


Barner Junejion 


Realized by 
f 


Bamier System 


Passive | Active 


[ Physical | | Hamamoperational | l Techmicai | [_Humanioperational | 
Safety Instrumented Other technology External risk 
Systemi SiS} safety-related system reduction facilities 
Figure 5. Classification of safety barriers (adopted 


from Sklet, 2006). 
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Figure 6. Main parts of a technological barrier. 


The system protected by a technological barrier 
is called the Equipment Under Control (EUC). A 
safety-instrumented function (SIF) is a function 
that has been designed to protect the EUC against 
a specific demand. To enhance the reliability of a 
barrier, redundancy is often implemented in the 
system configuration. 


4 DEVELOPMENT A PHM FOR SAFETY 
BARRIERS 


We can introduce the PHM to safety barriers, with 
the purpose to assess the degree of deviation or 
degradation of barriers, and then plan mainte- 
nances in advances, so as to improve their avail- 
ability and bring safety to EUC. 


4.1 Main functions of PHM on barriers 


Compared to the existing diagnostics of barriers, 
PHM is expected to predict failures from incipi- 
ent failures or deviations in components. The main 
functions and potential benefits of the PHM on 
barriers can include: 


e Advance warning of failures—Prognostics in 
PHM can evaluate the degradations of bar- 
riers, so as to detect incipient deviations. It is 
possible for maintenance staffs with prognostic 
results regarding the operational conditions to 
take actions on a barrier before a failure really 
occurs. 
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e Optimized maintenances—With prognostics, 
maintenance staffs also can estimate the remain- 
ing life of a component, especially a mechanical 
one, in a barrier, and then develop a mainte- 
nance, repair or replacement plan. Compared 
with scheduled maintenances, these condition- 
based and predictive maintenances eliminate 
unnecessary activities, and keep the barrier 
effective. 

e Logistic support and cost reduction—Ideal 
prognostics tell the maintenance staffs when and 
where failures will occur, and thus they can iden- 
tify and fix the failed components easily. PHM 
can reduce lead time and therefore increase the 
available time of safety barriers. Moreover, the 
“just-in-time” maintenances based on prognos- 
tics decrease the unnecessary costs of scheduled 
inspections and interruptions. 


4.2 Challenges of PHM on barriers 


Although PHM has been proved in many applica- 
tions, we may meet challenges when we implement 
PHM on the technological safety barriers, due to 
the following design and operational characteris- 
tics of barriers: 


4.2.1 Operational modes of barriers 

Current PHM is always used for systems continu- 
ously running, while safety barriers have several 
operational modes in stead: 


e Low-demand mode: where the safety function is 
only performed on demand, and where the fre- 
quency of demands is relatively low; 

e High-demand mode: where the mechanism 
is same as low-demand, but the frequency of 
demands is relatively high; 

e Continuous mode: where the safety function is a 
part of normal operation. 


In the latest version of IEC 61508 (2010), the 
borderline between low-demand and high-demand 
is once per year in terms of demand frequency. 

For those technologies barriers with demand- 
ing operational modes, they are usually in a dor- 
mant state and transit to an active state in case that 
demands come. The degradation mechanisms in 
different states are varied. Not many studies have 
been conducted so far on degradation prediction 
with state transitions. We need new approaches of 
parameters to predict the future performance of a 
barrier in response to demands during the dura- 
tions of demands. 


4.2.2 Structures of barriers 
Redundancy structures are often used in barri- 
ers to improve availability and to enhance safety, 


e.g., two shutdown valves are installed in paral- 
lel to stop flow when the downstream pressure is 
too high. When one of them cannot activate, the 
process is still safe if the other works. Such kind 
of structures is called as 1-out-of-2 configuration. 
For a system with N channels, if at least K of the N 
channels need to be functional to ensure that the 
system is functional, the system has a K-out-of-N 
(KooN) configuration. 

Many barriers can be adaptive, meaning that 
they can change their configurations to perform 
safety functions when some expected occur. For 
example, a 2003 barrier can automatically tran- 
sit to a 2002 configuration when one of the three 
channels fail. The challenge for PHM is to predict 
the effects of degradations in one channel on the 
entire barrier system with complex configuration 
and adaptivity, as well on the EUC. 


4.2.3 Failure modes and tests of barriers 

Failures of technological barriers can be classified 
as dangerous (D) failure and safe (S) failure. D fail- 
ure refer to a failure that has the potential to put 
the barrier in a hazardous or fail-to-function state, 
while S failure does not leave the barrier in fail-to- 
function state (Rausand 2014), e.g. a valve shuts 
down unnecessarily. 

The integrity of a technological barrier is highly 
related with tests, especially for those running in 
the low-demand mode. Regular proof tests are 
conducted on technological barriers (e.g. once per 
year), to reveal failures and then initiate mainte- 
nance activities if necessary. Many modern safety 
barriers have installed automatic self-testing 
modules, which has a diagnostic function and 
detects some failures. The D failures that can be 
found in diagnostic tests are called as dangerous 
detected (DD) failures, such as signal loss, signal 
out of range and final element in wrong posi- 
tion (Rausand 2014). The D failures that are not 
detected are called dangerous undetected (DU) 
failures. DU failures are only revealed in proof 
tests with regular intervals. 

A research challenge of PHM is therefore to 
find suitable approaches to link the incipient fail- 
ures or deviations with those D failures of inter- 
est in integrity of barriers. Most data-driven PHM 
approaches depend on the historical/training data 
to predict the trends of failure, but in those pub- 
lished data sources for technological barriers, such 
as Offshore Reliability Data (OREDA) and Proc- 
ess Equipment Reliability Data, we cannot find 
any clues. For model-based PHM approaches, no 
guidance is given to deal with those DU failures. 

Another challenge is from the failure occur- 
ring in the redundancy structures. Common cause 
failures (CCFs) are the main contributor of the 
unavailability of redundant safety barriers (Hauge 


1039 


et al. 2015). CCFs are the failures of multiple com- 
ponents simultaneously or with a short time interval 
due to a shared root cause or a common cause. It is 
valuable to identify those deviations that can lead to 
CCFs and predict their potential influences in PHM. 


4.2.4 Measures of technological barriers 
IEC 61508 (2010) suggests the average probabil- 
ity of failure on demand (PFD) as a measure for 
technological barrier of low-demand, and the 
probability of failure per hour (PFH) as the meas- 
ure for technological barrier of high-demand. And 
then, for different results of PFD and PFH, safety 
barriers can be located at different integrity levels 
(SILs), from the loose SIL 1 to the strictest SIL 4. 
These measures are widely used, and they are cal- 
culated always on the basis of some basic assump- 
tions (Jin and Rausand 2014, Wang and Rausand 
2014, Rausand 2014), including: (1) each failure is 
assumed to occur at a constant rate (i.e. exponen- 
tial distributed failures); and (2) the channels in a 
redundant structures are identical and independent. 
We release these assumptions when implementing 
PHM, and so weaken the theoretical foundations 
of measure calculations, since we have realized that 
deteriorations in mechanical components of a tech- 
nological barrier is unavoidable. However, to evalu- 
ate the effectiveness of a PHM program, we still need 
to utilize the widely accepted measures, and build a 
relationship between SILs and effects of PHM. 


4.2.5. Cost-benefit analysis of PHM 

Safety and availability are dominator in the assess- 
ment of safety barriers. But for PHM, the return- 
on-investment (ROI) needs to be considered 
(Saxena et al. 2008, Wang and Pecht 2011), espe- 
cially for the fact where other test and diagnostics 
are also employed on safety barriers. 

The main work for ROI analysis or cost-benefit 
analysis is to quantify the costs and benefits of 
PHM (Scanff et al. 2007). The costs of a PHM 
program can includes: the cost of acquisition and 
installation for data, such as sensors and micro- 
processors, the cost of re-design of host product, 
which can be a big investment (Sun et al. 2012). 
The benefit is more complex including the decrease 
of proof tests and maintenances. It is challenging 
on how we choose the indicators to calculate the 
ROI of a PHM program. Moreover, we also need 
to determine the best PHM program for a specific 
technological barrier. 


5 CONCLUSIONS 


In this paper, a short review of PHM is presented. 
PHM enables estimating the RUL of the in-service 
equipment which can provide timely decision for 


maintenance. Due to the vital role of technologi- 
cal barriers and the advantages of PHM, an idea 
for developing a PHM system for SIS is presented. 
Compared with mechanical systems, technological 
barriers have their own characteristics which pro- 
pose new challenges. 

Therefore, we propose several research topics 
to be addressed in future, specifically in a PhD 
project: 


e New approaches for predicting degradations of 
a component with state transitions; 

e Mechanism of incorporating redundancy struc- 
tures and varied configurations in degradation 
modeling and analysis; 

e Models to link the effectiveness of PHM with 
the measures for safety barriers; 

e Methods to optimize PHM and other mainte- 
nance activities under the constraints of SIL 
requirements by safety barriers. 
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ABSTRACT: One of the main challenges faced by the industry in the context of failure diagnosis is the 
high quantity and high dimensionality of the available data. Due to the increasing capability and avail- 
ability of sensing technology, nowadays it is possible to acquire a large amount of (unlabeled) data on 
many operational and maintenance related variables from monitored machines. The problem lies on how 
to extract useful information from such data. A standard approach in fault diagnosis is to first apply a 
dimensionality reduction method. In this paper, we propose a method for dimensionality reduction based 
on Variational Auto-Encoders (VAEs). VAEs have shown good results in areas such as image process- 
ing, image generation and speech processing. In particular, in this paper, the VAE based method works 
on spectrograms generated from vibration signals measured during system’s operation. This approach is 


applied to the fault diagnosis of ball-bearings. 


1 INTRODUCTION 


The early detection of faults in machinery is nowa- 
days one of the main challenges faced by industrial 
sectors. In general, faults in machinery compo- 
nents yield to one of two possible results. Either 
the component is replaced according to the manu- 
facturer instructions, which in the majority of the 
cases leads to a less than optimal useful life of the 
element, or the sudden failures occur that causes 
a performance drop in the machine. Nowadays, 
the development of new methods that can identify 
faults early and that make use of the entire useful 
life of the component is an active topic of research 
(Hasani, Wang, & Grosu, 2017; Liu, Li, & Ma, 
2016). 

One possible approach to this problem is to 
assume that data measured during the operation 
of a machine contains useful information about 
its health state and that such information can be 
extracted if a proper technique is used. Machine 
Learning has been used in the past to excerpt use- 
ful information from fault data (Blum & Langley, 
1997; Ciresan, Meier, & Schmidhuber, 2012). 

Nevertheless, the use of Machine Learning in 
conjunction with operational and maintenance 


data brings another problem to the table: the curse 
of dimensionality. Usually, data obtained dur- 
ing the operation of a system tends to be noisy 
and high dimensional. The use of a dimension- 
ality reduction techniques is a popular strategy 
to improve diagnosis or prognosis performance; 
see for example (Shuang, Automation, ICMA, & 
2007, n.d.) for Support Vector Machines and (Pan, 
Rust, Networks, IJCNN, & 2000, n.d.) for Artifi- 
cial Neural Networks (ANN). 

In fact, one type of model that has been devel- 
oped with the objective of performing dimension- 
ality reduction within the Machine Learning field 
are Auto-Encoders (AEs) (Hinton & Salakhutdi- 
nov, 2006). AEs are unsupervised models based 
on the use of neural networks. They are trained 
passing the data through an intermediate layer 
that is of lower dimensionality than the input, thus 
generating a latent representation and performing 
a dimensionality reduction. Then, the rest of the 
neural network tries to reconstruct the input from 
this reduced representation, forcing this latent 
space to be meaningful. In the area of fault diag- 
nosis, AEs and its variants have been used recently 
with great success (Liu et al., 2016; Zhou, Gao, & 
Wen, n.d.). 


1043 


Another model that also uses the idea of reduc- 
ing the dimensionality of the data feeding it through 
a pipeline that generates a latent representation are 
Variational Auto-Encoder (VAE) (Kingma & Well- 
ing, 2013). VAEs nowadays are one the most aus- 
picious unsupervised machine learning techniques, 
mainly because of their success in the processing of 
complex data such as images (Rezende, Mohamed, 
& Wierstra, 2014; Salimans, Kingma, & Welling, 
2015) and speech (Hsu, Zhang, & Glass, 2017). 
VAEs use a combination of variational inference 
(Blei, Kucukelbir, & McAuliffe, 2017) and neural 
networks to solve the difficult problem of finding 
a good approximation of a posterior distribution 
in a Bayesian model. However, not attempt has 
been made to develop a dimensionality reduction 
method using VAE as the underlying model for 
fault diagnosis. 

Therefore, this paper presents a VAE based 
dimensionality reduction approach for fault diag- 
nosis in machinery using spectrogram images 
extracted from vibration signals, and evaluate it by 
comparing the classification results between the 
proposed approach and the case where no reduc- 
tion in dimensionality is performed. 

This paper is organized as follows. Sec- 
tion 2 introduces the VAE model. Section 3 
discusses the proposed VAE’s architecture for per- 
forming fault diagnosis. Then, Section 4 presents 
an example of application. Finally, Section 5 draws 
some concluding remarks. 


2 VARIATIONAL AUTO-ENCODERS 


Variational Auto-Encoders (VAEs) are generative 
models originally proposed by (Kingma & Well- 
ing, 2013) that are built on top of the concept of 
variational inference (VI) (Blei et al., 2017). VI is an 
approach used for solving the problem of approxi- 
mating difficult to compute probability distribu- 
tions. In a Bayesian framework, the observations x 
are assumed to be produced by a set of latent or 
hidden variables z. In general, the objective is to 
compute p(z|x) because of two main reasons. First, 
for an observation x,, the vector of latent variables 
that produce such observation can be of interest in 
itself, for example, when the latent variables repre- 
sent physical magnitudes that are not easily measur- 
able. Second, using p(z|x) and the Bayes theorem, 
one would be able to compute p(x) which repre- 
sents the model that generates the data itself. This is 
of special interest for generative models where the 
objective is to produce new data that shares charac- 
teristics with the data that is already in the dataset. 

The main difference between VAEs and oth- 
ers variational inference based approaches is the 
assumptions made over the distributions that con- 


trols the model. In the VAE model, such distribu- 
tions are assumed to be parametrizable by a set of 
parameters (e.g., Normal or Bernoulli distribu- 
tions). These parameters can be found by neural 
networks via a proper training. The combination 
of the assumptions made for the distributions and 
the use of neural networks is what made the VAE 
model so efficient in searching the approximation 
of the true posterior p(z|x). For the proposed 
VAE based dimensionality reduction approach, in 
the following sections we discuss the parametric 
form for the prior over the latent variables p(z) the 
data conditioned over the latent variables p(x|z) 
and the approximate posterior distribution q(z|x) 
along with the form that the objective function of 
the VAE. 


2.1 Prior over the latent variables p(z) 


The definition of the latent variables of a certain 
model is usually a complex problem. The nature 
of those variables or the relationship between them 
are very difficult to express beforehand without 
deep knowledge about the situation that we want 
to model. VAEs takes an easy approach to this, 
assuming that the latent variables are distributed 
according the following distribution: 


p(z)~ N(z| 0,2) (1) 


From Equation 1, the prior distribution for p(z) 
assumes no prior information about the relation- 
ship between the latent variables or the nature of 
them (other than they are Normally distributed). 
As it is mention in (Doersch, 2016), this can be 
done because any distribution of k dimensions can 
be originated by a vector of k variables Normally 
distributed if they are processed with a function 
that is complex enough. This function corresponds 
to the neural network of the VAE’s decoder. 


2.2. The data conditioned on the latent 
variables p(x|z) 


This distribution represents the probability of x 
being generated by the latent variables contained in 
z. Note that the family of p(x|z) should be defined 
based on the nature of the data. For example, for 
black and white spectrograms, or gray scale ones 
that are real valued but restricted to the interval 
(0,1), p(x|z) is chosen to be a Bernoulli Distribu- 
tion, in the form of: 


p(xlz) ~ Be(x | f(z,9)) (2) 


The function f(z|@) is a neural network that 
takes as inputs both the latent variables z and a 
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vector of parameters @ that represent the weights 
and biases of the NN, and then outputs a vector of 
Bernoulli parameters, P, to define the distribution. 
Here we can see that it does not matter if z is sam- 
pled from a simple distribution such as a standard 
Normal, because if we let f to have an acceptable 
level of complexity (i.e., number of neurons and 
layers), then f will be able to first find the mapping 
between those Normally distributed samples for z 
to the true but unknown distribution that controls 
the latent variables, and then find the relationship 
between that distribution and the vector of Ber- 
noulli parameters. The Bernoulli distribution is 
chosen to represent the output of the decoder for 
cases where binary data is being reconstructed by 
the VAE, as it is the case in this paper where we use 
images as inputs for the model. 

The structure formed by f and p(x|z) is named 
the decoder of the VAE, since its job is to output a 
reconstruction of the original input from the latent 
variables z. 


2.3 The approximate posterior q(z|x) 


In variational inference, the true posterior p(z|x) 
is approximated by a distribution g(z|x) that is 
searched in a family of parametric distributions of 
probabilities Q. In the VAE model, Q is assumed 
to be the family of multivariate isotropic Normal 
distribution. So, the following is true for g(z|x): 


q( z|x) ~ N(z| A x.w,),6( xw) D (3) 


In Equation 3, the distribution q(z|x) have 
both of its vectors of parameters, 7 and ġ para- 
metrized by neural networks that take as inputs 
the input data and vectors of weights and biases 
denoted as w, for the means vector and w, for the 
variances vector. The structure formed by the dis- 
tribution q(z|x) in conjunction with the neural 
networks for “7 and @ is called the encoder of 
the VAE since its task is to transform the input 
data to the latent representation z. The choice of 
a Normal distribution to represent the approxi- 
mate posterior in the VAE model serves a double 
purpose. First, as we shall see in the next section, 
this choice allows a very quick and efficient train- 
ing of the VAE. Second, without adding excessive 
complexity, the use of a Normal distribution adds 
flexibility to the model, since every latent variable 
will be controlled by its own mean and variance. 

Since VAEs uses variational inference to find 
the approximate distribution g(z|x), the idea is to 
train the neural networks that parametrize the vec- 
tors 9,4 and @ optimizing the VAE’s objective 
function. But before, we need to be precise about 
the form that the VAE’s objective function will take 


with the assumptions made for p(z), p(x|z) and 
q(2|x). 


2.4 Objective function of the VAE model 


As stated before, the VAE model uses variational 
inference to solve the problem of finding a good 
approximate to the true posterior distribution p(z|x). 
Variational inference proposes the following objec- 
tive function to optimize and find q(z|x), named the 
Evidence Lower Bound (ELBO) function: 


ELBO(q(2lx)) = Ey.) ollog p(«| 2)]- 


4 
KL(q(z|x)|IP(2)) A 

It can be proved (Blei et al., 2017) that by maxi- 
mizing Equation 4, variational inference is able to 
find the parameters that determines the best pos- 
sible approximate distribution q(z|x) within the 
chosen family of distributions Q. 

In the ELBO function, there are two terms that 
need to be computed efficiently in order to find 
q(z|x). First, as both distribution g(z|x) and p(z) 
belong to the Normal family, the Kullback-Leibler 
divergence that appears in Equation 4 between them 
will have closed form. The KL divergence between 
an isotropic Normal and a standardized normal, as 
stated in (Doersch, 2016), is shown in Equation 5: 


KL( N( 441) || N(0,1)) = 2(tr( 1) + r 
Z ù- k- logdet( G1) (5) 
where k is the dimensionality of the distribution, 
i.e., the dimensionality of the latent space of the 
VAE model. 

The first term that appears in the ELBO function 
is the expectancy of the logarithm of p(x|z). Com- 
puting that expectancy by sampling from z would 
be extremely inefficient because that would require 
the evaluation of f (which is a neural network that 
may be very complex) many times in order to obtain 
a good estimation of the expectancy. The neural 
networks within the VAE model are trained using 
stochastic gradient descent (SGD), so we could 
also use SGD in the evaluation of this expectation 
by sampling one time from z, compute /ogp(x|z) 
from that value of z and then use that results as an 
estimation of the expectancy above mention. This 
process will be repeated until the solution converges 
to a good result for the posterior approximation. 


2.5 The variational auto-encoder model 


In Figure 1 a graphical representation of the VAE 
that comprises all the elements discussed before 
model is portrayed. 


1045 


Latent Representation} 


< = Il = 
of. AT ays 
KRG zew KE 


12 


T 


VAB's Encoder | 


f 


moo 
IL VAR's Decoder | 


Figure 1. VAE model, with both neural networks, one 
for the encoder and the other for the decoder. Notice that 
it is the encoder that generates the latent representation 
z, and then the decoder, from that latent representation, 
reconstruct the input. 
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Figure 2. VAE model with the reparametrization trick 
applied. 


Nevertheless, there is still one difficulty with the 
VAE model. It is known that neural networks use 
the back-propagation (BP) algorithm for perform- 
ing a more efficient training. But BP does not work 
if there are stochastic units within the network, as it 
is the case when we sample from the encoder’s distri- 
bution, since they have no gradient, therefore cannot 
propagate the error using the chain’s rule. Kingma 
and Welling (Kingma & Welling, 2013) proposed a 
solution for this problem called “the reparametriza- 
tion trick”, in which all the stochasticity of the model 
is moved to an input, so the error can be propagated 
without problems. Figure 2 shows the VAE’s model 
with the reparametrization trick applied. 


3 PROPOSED VAE’S ARCHITECTURE FOR 
DIMENSIONALITY REDUCTION 


The VAE model have as core components two neu- 
ral networks, one for the encoder and other for the 
decoder. In what follows, the proposed architec- 
tures for such neural networks are described. 


3.1. Encoder’s deep neural network architecture 


In Figure 3, a graphical representation of the 
encoder’s neural network is portrayed. 

The encoder’s neural network is chosen to have 
three hidden layers with a number of units equal 
to 400, 200 and 100 for the first, second and third 


Output Layer #1 
k Units 


Hidden Layer #2 
200 Units 


‘Hidden Layer #3 
100 Units 


Input Layer 
Hidden Layer #! 


400 Units 


Output Layer #2 
k Units 


Figure 3. The encoder of the proposed VAE has three 
hidden layers, and two output layers, one for the means 
vector zz and other for the variances vector oO. The 
number of units in both output layers is k, the chosen 
dimension for the latent space of the VAE. 


layer, respectively, and two different outputs layers, 
one for the vector of means, 7%, and one vector of 
variances, G, to parametrize the approximate pos- 
terior distribution q(z|x). In this case, both out- 
puts layers have k units, where k corresponds to the 
target latent space dimension, i.e., the dimension- 
ality of the data once the reduction is performed. 
This neural network uses the ReLU activation 
function in all the hidden layers. 


3.2 Decoder’s deep neural networks architecture 


In Figure 4, the decoder of the proposed VAE 
model is portrayed. 

As shown in Figure 4, the decoder of the pro- 
posed VAE has three hidden layers with 100, 200 
and 400 units for the first, second and third hid- 
den layers, respectively, and one output layers with 
y units, where y is the dimensionality of the data 
prior the reduction. Similarly to the encoder’s case, 
this neural network has in all its hidden layers the 
ReLU activation function. 

Both the VAE’s encoder and decoder work 
together to first compress the data to its latent rep- 
resentation by computing the parameters that con- 
trols the distribution g(z|x) to sample from it the 
latent variables and then decompressing the data 
from this latent representation to a reconstruction 
of its original form by first computing the vector of 
Bernoulli parameters and then, depending on 
whether the data is binary or real valued restricted 
to the (0,1) interval, sample from the distribution 
or take 9 as the reconstruction itself. 
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Figure 4. The decoder of the proposed VAE model has 
three hidden layers and one output layer to compute the 
vector of Bernoulli parameters 2. The number of units 
in the output layer is equal to y, the dimensionality of 
the input data. 
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Figure 5. Dimensionality reduction method based on 
the Encoder of a VAE that has been trained successfully. 


Once the VAE is trained, the neural network of 
the encoder can be used as a feature map to reduce 
the dimensionality of the input data to the lower 
dimensionality space by computing the means and 
variance vectors. Nevertheless, it is usually the case 
where the dimensionality reduction is preferable to 
be deterministic, i.e., it does not include any sam- 
pling in its process. This can be done if we note 
that the probability distribution of the encoder is 
a Normal distribution, so we could take the vector 
of means from the encoder’s neural network as the 
latent variables directly in the form of z=. This 
provides a latent representation that does not vary 
each time a reduction over an specific data point 
is performed. A graphical representation of the 
dimensionality reduction process using the VAE’s 
encoder can be seen in Figure 5. 


4 EXAMPLE OF APPLICATION 


4.1 Dataset 


We use vibration data from the Society for Machin- 
ery Failure Prevention Technology (MFPT) open 
dataset (Bechhoefer, 2016) that was obtained dur- 
ing the operation of a NICE ball-bearing element. 
The dataset contains three main classes, each 
corresponding to one health condition: baseline 
(where no failure is present), outer race fault and 
inner race fault. The original signals for each class 


were first divided into portions of length L = 1024 
points each. Besides, we use 50% of overlap 
between adjacent chunks. Then, by means of the 
Short Time Fourier Transform (STFT), a spectro- 
gram of each section of data was computed, and 
then via a bilinear interpolation (Raveendran & 
Thomas, 2014), it was transformed into an image 
of 96 by 96 pixels. We chose to use spectrograms 
because they can portray both time and frequency 
information of the signal, instead of just time 
information as it would have been the case if we 
fed the original signal directly into the VAE model. 
Table 1 shows the total number of images per class 
in the dataset. 

But, in order for the VAE model to work 
properly, as we can recall from Figure 3, the 
VAE’s encoder has to receive as input a vector 
of CH131_181-E021.eps components, not a two 
dimensional array as it would be if we fed the spec- 
trogram images directly to the model. Thus, we 
need to first reshape the images by stacking hori- 
zontally their rows to form vectors, as it is shown 
in Figure 6. 


4.2 Validation experiments 


We are interested in two situations. First, we want 
to know if the VAE’s latent representation is stable 
across dimensions of the latent space. This is of 
important because the decision of the dimension- 
ality k is not clear beforehand, and it usually relies 
on the person in charge of the analysis, so models 
that are less sensitive to variations of this param- 
eter present an advantage over those who needs 
a fine tuning to be effective. For this purpose, we 
perform reductions using the VAE approach to a 
series of different values of k and then analyze if 
variations in the dimension of the latent space pro- 


Table 1. Classes and number of samples per class for 
the MFPT dataset. 


ID Fault’s location # of samples 
IR Inner Race Ring 1981 
OR Outer Race Ring 5404 
Baseline No Failure 3423 

= (0.24. 0.4"054...0,16).,10.65,..0.94]) E R” 


Figure 6. Diagram showing the process of reshaping 
the spectrogram images as vectors. 
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Table 2. Values for k and epsilon. 


k € 


2, 4, 8, 16, 64, 128 1%, 100% 


duce variations in the accuracy of classification. 
The values of k tested can be seen in Table 2. 

The second situation of interest is when the 
amount of labeled data available is low. This is 
usually the case in industrial applications, because 
it is usually cheap to acquire data from installed 
sensors, but the process of labeling such data for 
training machine learning models requires highly 
qualified personnel and often further analysis on 
the machine’s degradation process. For this, we 
will test two extreme situations: where the size of 
the training dataset training for the fault diagno- 
sis model is either the full training dataset or only 
1% of it. This is represented by the parameter €, 
the percentage of the training dataset available. 
This emulates situations where we might have a 
very reduced amount of labeled data versus when 
the availability of labeled data is not a concerning 
issue and from there, evaluate if the VAE based 
reduction generates a representation of the data 
that makes the differences between the different 
health states more explicit, so that the NN based 
classifier can still learn even if less data is avail- 
able. Recall that as the VAE is an unsupervised 
model, it does not require labels for its training, 
so even when working with £ = 1%, the VAE can 
use the whole training dataset for its unsupervised 
training. 

For the classification tasks, we use a neural net- 
work based classifier with a single hidden layer 
of 100 units and ReLU activation function. The 
regularization is performed via dropout with prob- 
ability equals to 50%. For the output layer, soft- 
max is used as the activation function. The model 
is trained using the cross entropy cost function, 
a learning rate of 10* and a maximum of 15000 
epochs. The VAE’s weights and biases are initial- 
ized with the Xavier initialization (Glorot & Ben- 
gio, n.d.) and with an array of zeros, respectively. 
For the NN based classifier, the weights and biases 
were randomly initialized with a Normal distribu- 
tion of zero mean and unity variance. 

The results shown in the next section were 
obtained following the steps below: 


I. Choose a value for € and k from Table 3. Then, 
divide the whole dataset into a training and 
testing set in the proportion 3:1. 

II. Train the VAE in an unsupervised way using 
the full training dataset. This training is per- 
formed using a maximum of 500 epochs and 
learning rate equal to 10“. 


III. Reduce the dimensionality of the training 
and testing datasets with the trained VAE’s 
encoder as discussed in Section 3. 

IV. If ¢ = 1%, extract that proportion from the 

training dataset, and train the NN based clas- 

sifier. If e = 100% use the whole dataset to 
perform the training. 

Use the full testing dataset to evaluate the 

accuracy of the classification obtained. 


a 


We also compare the results obtained with the 
VAE model against results obtained when the spec- 
trograms are fed directly to the NN-based classifier. 
To perform this type of experiment, the same pro- 
cedure as before applies except for steps II and III. 

The hardware and software used in the example 
of application are: Nvidia GPU Titan X, an Intel 
Processor 17 7700k, 32Gb of RAM DDR4 and 
Tensorflow 1.2.0 combined with CUDNN 8.0. 


4.3 Results and discussion 


In Figure 7 and Figure 8, results regarding the 
accuracy obtained in the classification of health 
states when using the VAE approach to reduce 
the dataset to different values of k are compared 
against the case where no reduction is performed 
to the dataset. 

Table 3 shows the best accuracy results obtained 
with the VAE approach and the model that does 
not perform a reduction in dimensionality, for 
either £ = 1% or € = 100% 

As we can see from Figure 7 and Figure 8, fault 
diagnosis accuracies obtained with the VAE model 
are almost always better than the ones obtained 
when no reduction is performed. The only excep- 
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Figure 7. Accuracy versus latent space dimension for 
the case where only 1% of the training dataset is used to 
train the NN classifier. 
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the case where the full training dataset is used to train 
the NN classifier. 


Table 3. Best accuracy results for both mod- 
els tested. The VAE model achieves its better 
results for values of k equal to 4 and 64 for € = 
1% or £ = 100% respectively. 


VAE No reduction 
£= 100% 

93.43% 90.38% 
é=1% 

88.39% 55.76% 


tion is when the full training dataset is used and the 
VAE model reduces the dimensionality to k = 2. This 
can be explained from the fact that the spectrogram 
images have an original dimensionality of y= 9216 
data points (96 by 96 pixels), and to reduce from that 
dimension to k = 2 will induce some loss of infor- 
mation. Nevertheless, from Figure 8 and for every 
other choice of k (including k = 4) the VAE model 
surpasses in terms of fault diagnosis accuracy the 
model that do not perform reduction. 

Also observe that the drop in accuracy caused by 
the reduction in the size of the training dataset is 
much less notorious for the VAE model. As we can 
see from Table 3, the drop in accuracy for the VAE 
model is approximately 5% for reducing the training 
dataset a 99% of its original size, but for the model 
that do not perform a reduction in dimensionality, 
the drop is about 45%. That clearly indicates the 
robustness of the VAE’s latent space in making the 
different health states present in the dataset more dif- 
ferentiable for the NN based health state classifier. 


In terms of the results obtained with the VAE by 
varying the latent space’s dimensionality, the size of 
the training dataset does affect the stability of the 
model. For € = 100%, except for k = 2, the model is 
very stable with differences in accuracies of maxi- 
mum 2%. Instead, when € = 1%, the model needs 
to be tuned prior its application in order to find the 
dimension that delivers the best performing latent 
space, since the differences in accuracies reach 12%. 


5 CONCLUDING REMARKS 


We have introduced a new method for performing 
dimensionality reduction in a dataset using deep 
Variational Auto-Encoders. The VAE model maps 
the data to a latent representation that can be cho- 
sen to have a lower dimensionality than the original 
data, thus producing a dimensionality reduction. 

Results show that when using the STFT and 
spectrogram to preprocessed ball-bearings opera- 
tional data provided by vibration sensors, the 
VAEs approach produces better accuracies in the 
classification tasks performed than the model that 
do not perform a reduction in dimensionality, 
especially when the available labeled data to train 
the fault classifier is limited. 
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ABSTRACT: System health management is of upmost importance with today’s sensor integrated systems 
where a constant stream of data is available to feed information about a system’s health. Traditional meth- 
ods to assess this health focus on supervised learning of these fault classes. This requires labeling sometimes 
millions of points of data and is often laborious to complete. Additionally, once the data is labeled, hand- 
crafted feature extraction and selection methods are used to identify which are indicators of the fault sig- 
nals. This process requires expert knowledge to complete. An unsupervised generative adversarial network 
based methodology is proposed to address this problem. The proposed methodology comprises of a deep 
convolutional Generative Adversarial Network (GAN) for automatic high-level feature learning as an input 
to clustering algorithms to predict a system’s faulty and baseline states. This methodology was applied to a 
public data set of rolling element vibration data from a rotary equipment test rig. Wavelet transform repre- 
sentations of the raw vibration signal were used as an input to the deep unsupervised generative adversarial 
network based methodology for fault classification. The results show that the proposed methodology is 


robust enough to predict the presence of faults without any prior knowledge of their signals. 


1 INTRODUCTION 


Much of fault diagnostics involves the use of 
labeled data. This is challenging for new assets out- 
fitted with sensor suites capable of generating mas- 
sive amounts of data. Without knowledge of faults 
or their corresponding signals, engineers may not 
be able to diagnose faults effectively. Traditional 
methods include feature extraction and selection 
methods which attempt to use a specific feature 
of the signal to diagnose the faults. This method 
requires knowledge of which features are relevant 
for the task. Moreover, if an engineer has some 
knowledge of the fault, that knowledge could be 
biased or incomplete. Unsupervised fault diagnos- 
tics attempts to fill in that knowledge. 

Deep learning algorithms can perform auto- 
matic feature learning to better understand the 
underlying data features that are most relevant. 
This automatic feature learning attempts to fill in 
the gaps of knowledge of relevant features to the 
fault signals. There are challenges with this auto- 
matic feature extraction and selection. 


Unsupervised learning has been attempted for 
fault diagnostics previously. Indeed, Langone 
(2017) took pre-stressed concrete bridge natural 
frequency data and proposed an unsupervised 
adaptive kernel spectral clustering for damage 
events. Wang (2016) proposed unsupervised feature 
extraction via continuous Sparse Auto-Encoders 
(SAE). Once the SAEs extracted the features super- 
vised learning was used on transformer faults. 
Lei (2016) proposed unsupervised sparse filtering 
feature learning. Faults were then diagnosed with 
supervised softmax regression. Jiang (2016) pro- 
posed unsupervised feature learning with SAEs for 
chemical sensor data. These features were fed into 
supervised softmax regression to diagnose faults. 
Sun (2016) took induction motor fault data and 
proposed the use of SAEs for unsupervised feature 
extraction. These features were again followed by 
supervised learning for classification by neural net- 
works (NN). Of these approaches, only Langone 
et al. could be considered truly unsupervised. The 
rest are restricted to unsupervised feature learning 
followed by supervised fault diagnostics. Moreover, 
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apart from the use of SAEs, none of these methods 
would be considered deep. 

In this paper, we propose a GANs based method- 
ology application to unsupervised fault diagnostics 
on scalogram image representations. To validate 
the proposed methodology, the public Machinery 
Failure Prevention Technology (MFPT) Society 
bearing data set (Bechhoefer 2016) is used. The 
remainder of this paper is structured as follows. 
Section 2 gives an overview of GANs and the 
methodology. Section 3 presents results of the 
GANs based methodology applied to the MFPT 
data set. Section 4 provides conclusions. 


2 GENERATIVE ADVERSARIAL 
NETWORKS 


Generative adversarial networks (GANs) have at 
their core a minimax game which seeks to pit a 
forger, the generator network, against a detective, 
the discriminator network. The generator seeks 
to create fake data, or scalograms in this paper, 
to trick the discriminator who must discriminate 
between the real data and the fake data as shown 
in Figure 1. Back propagation is performed on the 
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Figure 1. GAN training. 


weights and biases and the process is repeated. 
The benefit to this training is while the generator 
seeks to develop an underlying distribution of the 
real data, the discriminator is feeding information 
back to the generator, not on the real data, on the 
weights and biases of the learned features. This 
helps to prevent overfitting of the data. 

Within this minimax game, the objective func- 
tion’s goal is to maximize the value, V, to the point 
where the discriminator and generator no longer 
find it necessary to make changes to their weights 
and biases. While this is the goal of GAN training, 
there is functionally no mechanism with the train- 
ing to control it. Therefore, there can be issues with 
convergence. More formally in Eq. 1 from Good- 
fellow (2014): 


min max V(G,D)= Eor (x) [ log(D( x) |: 


Be, Pel?) [ log( 1- D(G( z))) | ; 


where, P wax 18 the data distribution, P sse 18 the 
noise distribution, D(x) is the Discriminator objec- 
tive function, and G(z) is the generator objective 
function. 

The GANSs based methodology used in this 
paper can be found in Figure 2. The methodology 
starts with developing a scalogram image represen- 
tation of the raw data, and then proceeds to train- 
ing of the deep convolutional generative adversarial 
network (DCGAN). Once the DCGAN training is 
completed and visual inspection of the generator 
output images is done, concatenation of the last 
activation layer of the discriminator is completed. 
Once the activations are concatenated, kmeans++ 
is used for clustering on the first two principal 
components. Visual inspection of the generator 
output is still needed within GANs training and is 
a crucial step within the training. 


(1) 


Raw Signal Start 
Sealogram Representagon 


Z - Nips) 


Figure 2. Proposed unsupervised GAN methodology. 
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There are two goals for the output of this method- 
ology: 1) Separation of the baseline healthy data with 
the fault data, and 2) Separation of the individual 
faults. When a new sensor system comes online, the 
engineer needs to know when the system drifts from 
healthy signals to a signal with which to decide when 
to conduct planned maintenance. Once the engineer 
has familiarity with the system and signals can be 
identified as individual faults on the inner or outer 
raceway, then better predictions and a fully super- 
vised methodology can be used (Verstraete 2017). 

The GAN architecture used in this paper incor- 
porates the guidelines proposed in Radford (2016); 
however, adjustments to that paper’s architecture 
were made for handling the MFPT data set. Radford 
et al provides the following five GANs architecture 
guidelines: 1) generator and discriminator network 
pooling layer replacement with strided convolutions, 
2) Batch normalization (BN) is required for both 
the discriminator and generator networks, 3) Fully 
connected hidden layers should be removed for deep 
architectures, 4) Rectified Linear Unit (ReLU) acti- 
vation use in all layers of the generator except the 
output should use Tanh, and 5) Leaky ReLU activa- 
tion use on all layers for the discriminator. DCGANs 
are used in this paper as a baseline to implement 
GANs. The combination of these five guidelines 
composes what is defined as deep convolutional gen- 
erative adversarial networks (DCGANs). 


2.1 Strided convolutions 


The relationship between a convolutional opera- 
tion’s input shape, i, and the operation’s output 
shape, o, of a convolutional layer along axis j are 
related to three factors: 1) kernel size (k), 2) stride 
(s), and 3) padding (p). Convolutional strides are 
generally set to s,= 1 for most operations; however, 
for GANSs strided convolutions of s, > 1 are used 
in place of pooling layers. This is applied for the 
discriminator to learn its own downsampling, and 
for the generator to learn its own upsampling. 


2.2 Batch normalization 


Batch normalization (BN) is an important addi- 
tion to the architecture between each convolutional 
layer. (Ioffe 2015) As the data moves through the 
convolutional layers the weight and bias values 
are adjusted. This has the potential to lead to the 
data increasing or decreasing to unrealistic values. 
Batch normalization prevents this from becom- 
ing an issue with the training by normalizing the 
data to a mean of zero and a variance of one for 
each data batch. Setting values of x over a mini- 
batch: Z= Taa to output the learned param- 
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2.3 Activation layers 


The following activation functions are used through- 
out the architecture. For the generator, two activa- 
tions functions are used: 1) Rectified Linear Unit 


0 forx< 0 
(ReLU), f(x) -| and 2) Hyperbolic 


x for x 20 

tangent (tanh), f(x)=—*--1). Within the gen- 
erator network ReLU is used between every layer 
except tanh activation is used after the last layer. 
For the discriminator, Leaky ReLU is used on 


laver: Leaky ReLU j= 0.01 for x< 0 
every layer: Leaky ReLU f( x)= pene 


Leaky ReLU differs from ReLU in values less than 0. 


2.4 Neural network architectures 


The neural network architectures used in the pro- 
posed methodology incorporate the guidelines as 
proposed by Radford et al. Two networks were 
developed to account for the data set presented 
in Section 3. The generator network, as shown in 
Figure 3, takes the vector of noise and through 
deconvolution, BN, and activation functions creates 
an image. In this case the output of a 96 x 96 image 
of a scalogram of a signal. To do this, a 100 x 1 
vector is projected and reshaped to deconvolve 
into a 6 x 6 x 512 feature space. This space is then 
deconvolved to a 12 x 12 x 256, then 24 x 24 x 128, 
48 x 48 x 64, and finally a 96 x 96 x 3 image. 

The discriminator network, as shown in 
Figure 4, then takes that generated image and 
judges whether the image is real or fake. It does this 
by taking the real images and automatically learn- 
ing the feature subspace. For the 96 x 96 images 
this results in a network of convolutional layers 
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Figure 3. Generator network. 
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Figure 4. Discriminator network. 
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consisting of a 48 x 48 x 64 layer, 24 x 24 x 128, 
12 x 12 x 256, and 6 x 6 x 512 layers. The output 
of this last activation holds a lot of information 
about the feature space and is useful for unsuper- 
vised fault diagnostics. 


3 PROPOSED METHODOLOGY 
APPLICATION 


The MFPT data set is a good test of any algorithm 
as the outer race fault and baseline conditions are 
difficult to separate. NICE bearings were used 
within an experimental test rig. Accelerometer 
data was gathered on three conditions. First, at a 
sampling rate of 97,656 Hz, a baseline condition 
at 270 lbs of load was captured. Second, a total 
of ten faults on the outer-raceway were gathered. 
At the same sampling rate and loading condition 
as the baseline, three outer race faults were tested, 
and the remaining seven outer race faults had the 
following load cases: 25, 50, 100, 150, 200, 250 and 
300 Ibs. These seven load cases had a sampling rate 
of 48,828 Hz. Third, with a sampling rate again 
of 48,848 Hz, seven inner race faults at a loading 
of 0, 50, 100, 150, 200, 250 and 300 Ibs were gath- 
ered. From these raw signals, scalogram image rep- 
resentations were created with the following three 
classes as shown in Table 1: normal baseline (N), 
inner race fault (IR), and outer race fault (OR). 
In total 10,808 scalogram images were generated 
with 3,423 baseline, 1,981 inner race, and 5,404 
outer race images. The training set used was fifty 
percent. Bilinear interpolation (Raveendran 2014) 
aided in reducing the original images to down to a 
trainable size for the GAN architecture. 

The first step once the GANs training is com- 
pleted is visual inspection of the generator image 
outputs. These can be seen in Figure 5. The different 
fault conditions can be identified within the images. 
This step is a key indicator for identification of mode 
col-lapse, vanishing gradients, non-convergence, or 
checkerboarding artifacts. With this completed, the 
last activation layer of the discriminator network is 
concatenated and clustering can be done. 


Table 1. 96 x 96 pixel MFPT scalogram images. 


Baseline Inner race Outer race 


The first step once the GANs training is com- 
pleted is visual inspection of the generator image 
outputs. These can be seen in Figure 5. The dif- 
ferent fault conditions can be identified within the 
images. This step is a key indicator for identification 
of mode collapse, vanishing gradients, non-conver- 
gence, or checkerboarding artifacts. With this com- 
pleted, the last activation layer of the discriminator 
network is concatenated and clustering can be done. 

Kmeans++ is used for clustering within the 
paper to demonstrate how robust the GANs train- 
ing can be towards a simple clustering algorithm. 
Figure 6 shows the resultant clustering predictions 
of the first two principal components of the last 
activation layer of the discriminator and colored 
by the predicted labels. There is overlap in the outer 
and inner race predictions, but the GANs training 
plus kmeans++ does an excellent job separating the 
baseline signals from the fault conditions. 

Figure 7 shows the first two principal compo- 
nents of the last activation layer color coded by 
the real labels. It appears the GAN training with 
kmeans++ had the most difficulty with separating 
the fault conditions. A clustering algorithm more 
capable of handling the non-convex nature of the 
outer race fault could potentially increase the evalu- 
ation metrics but is beyond the scope of this paper. 

Since the labels to the data are known, evalua- 
tion metrics like purity (Manning 2008), normal- 
ized mutual information (NMI) (Kuncheva 2004), 
and adjusted RAND index (ARI) (Hubert 1985) 
can be used to validate the architecture. Table 2 has 
the overview of these metrics for this methodology. 

Overall these number could be improved; 
however, the first goal of this methodology is to 


Figure 5. 


Output images of DCGAN generator training. 
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Figure 6. DCGAN PCA Kmeans ++ predicted. 
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Figure 7. DCGAN PCA Kmeans ++ real. 


Table 2. MFPT 96 x 96 generator output, DCGAN, 
Kmeans++. 


ARI Purity NMI 


0.50 0.79 0.62 


separate the baseline healthy system state with 
that of the faults. This methodology proves 
it can handle that. More work can be done to 
improve these numbers and provide better infor- 
mation to the engineer regarding which individ- 
ual fault case the signal is presenting itself as. 


4 CONCLUSIONS 


Generative adversarial networks and deep learning 
as a field stand to unlock numerous potential appli- 
cations within the field of engineering research. 
This application is the first of its kind and shows 
great promise. 


The proposed architecture demonstrates its abil- 
ities with automatic feature learning to a level with 
which a simple clustering algorithm can separate 
the healthy baseline signals with the fault data. An 
engineer can easily make an engineering decision 
on maintenance without the need for any knowl- 
edge of the individual signals. 

The practical application of this paper has far 
reaching possibilities into many engineering fields 
and is not limited to rolling element bearings. Aero- 
space, automotive, oil & gas, and many other indus- 
tries can utilize this unsupervised methodology. 
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ABSTRACT: Recent advances of deep neural networks have revolutionized the techniques of machine 
learning for practical applications involving computer vision tasks. The flexibility of these models has 
allowed difficult tasks such as image segmentation to be tackled by this type of algorithms with much 
improved results. In this work, we propose and explore the capabilities of a novel deep residual neural 
networks with atrous convolutions for pixel to pixel classification tasks to achieve localization and quan- 
tification of structural damage in noisy image datasets. The proposed model is applied to a dataset of 
images synthesized to resemble debonding damage in honeycomb structures. 


1 INTRODUCTION 


Critical infrastructures such as bridges or complex 
systems in the mining industry might present dam- 
age that severely reduces the reliability. Inspections 
are important to warrant that a dangerous and 
costly failure does not take place. In this context, 
Structural Health Monitoring is critical to ensur- 
ing cost-effective and safe operational efficiency. 

Inspections of structures usually includes some 
form of visual tasks that heavily rely on individu- 
als with domain knowledge. In scheduling these 
inspections, a preventative maintenance program 
must balance a system’s safety and the cost of 
an expert analyst. Moreover, there are situations 
where these tasks are carried out in perilous of dif- 
ficult circumstances. 

To tackle these limitations, automated visual 
inspection systems might help to reduce costs, 
increase accessibility, and improve safety. For 
instance, unmanned aerial vehicles (UAVs) can 
autonomously produce videos and photographs 
of damaged areas. This can be further improved 
by methods based on computer vision to identify 
damage in real time. 

For certain structures and equipment, the iden- 
tification of the type or location of damage is not 
enough. It is also important to quantify the size of 
the damage as it can be can be used for the tracking 


of damage growth over time and therefore assess 
possible effects and their severity they may have on 
the structure or equipment health state. However, 
damage quantification is not widely used by cur- 
rent forecast algorithms (Douka et al, 2003) (Ohno 
et al, 2010) because it is a complex and complicated 
process as well as difficult to automate. 

The existing methods for damage quantifica- 
tion are based on processing of images (Zhou et al, 
2015) that in general require different parameters 
for each image and manually extracted features, 
a procedure that is labor intensive and costly as 
well as usually not adequate in real time damage 
quantification. 

Thus, we propose a deep learning based model 
for damage quantification based on deep residual 
neural networks with atrous convolution layers. 
Applied in the context of semantic segmentation 
(L.-C. Chen, et al 2016) (J. Long, et al 2015), these 
convolution operations use contextual information 
to perform pixel-by-pixel image classification by 
identifying which pixels correspond to damaged 
areas in a structure or equipment. The proposed 
model is applied to a dataset of images synthesized 
to resemble debonding damage in honeycomb 
structures. 

The remaining of the paper is organized as 
follows. Section 2 introduces the building blocks 
of residual networks and presents the proposed 
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model. Section 3 discusses the example of appli- 
cation and results based on the proposed model 
and compares it to fully connected neural network 
and k-means for image segmentation. Section 4 
presents some concluding remarks. 


2 PROPOSED DEEP RESIDUAL NEURAL 
NETWORK MODEL WITH ATROUS 
CONVOLUTIONS 


In this section, we discuss the proposed model 
based on deep residual neural network with atrous 
convolutions and operating on images for dam- 
age quantification. We first introduce the several 
building blocks of the proposed model and then 
the proposed architecture is presented. 


2.1 Deep learning 


Deep learning is a set of techniques based on neu- 
ral networks that in recent times have shown bet- 
ter results than traditional techniques. The great 
success of this framework is due to the increase of 
the computational power added to the increasing 
availability of larger datasets, thus being able to 
train more complex and robust models. 

These techniques are based on the hierarchy 
of the layers of a neural network, which can vary 
from a few layers to hundreds (He et al, 2016). The 
success of deep learning lies in how information is 
distributed throughout the network, extracting the 
most abstract characteristics of the data in the first 
layers and locating the most specific characteristics 
in the latter layers. 

Deep learning has established a change in how 
to solve machine learning problems by generating a 
variety of architectures that are not limited to super- 
vised tasks. The applications range from unsuper- 
vised models such as auto-encoders (Vincent et al., 
2010) to autoregressive models as recurrent net- 
works (Gregor et al., 2014). In the case addressed in 
this paper, a supervised task is analyzed that encom- 
passes a deep neural network architecture with mul- 
tiple outputs, each corresponding to a pixel. 


2.2 Segmentation 


Semantic segmentation consists in the pixel to 
pixel classification of an image, allowing to predict 
whether a pixel is considered a part of the damage. 
This type of algorithm takes into consideration the 
pixel context in the image, establishing whether it 
corresponds to an isolated damage without impor- 
tance to the problem at hand or belonging to a 
damage type of interest. This type of segmentation 
is not limited to binary classification, as they can 
also be extended to establish if the damage is con- 


sidered of a certain type, that is, how it was gener- 
ated since different damages result from different 
failure modes. 

By obtaining the pixel to pixel prediction of the 
image it is possible to obtain the size of the dam- 
age (e.g., crack), essential information to diagnose 
the damage and predict the Remaining Useful Life 
(RUL) of a structure or machine. 

One of the disadvantages of this type of algo- 
rithm lies in the construction of the dataset 
required for its training. In fact, since the predic- 
tion of the model is pixel by pixel, the classifica- 
tion task for the training images must also be pixel 
to pixel. An example of the data that should be 
available is shown in Figure | where the left image 
shows the training image while the right image is 
the corresponding ground truth. If the training 
set is properly constructed, a procedure can be 
automated that delivers a damage prediction that 
is flexible for a given application, which is deter- 
mined by the ground truth. 


2.3. Convolution neural networks 


Convolutional neuronal networks are a type of 
neural networks that take advantage of the transla- 
tional invariance that images have. In summary, the 
parameters that these networks learn are filters W 
that are convolved with a subarea of the features 
maps. As the convolution is done in only one sec- 
tor of the feature maps, the number of parameters 
learned is smaller, and therefore the computational 
cost is reduced and more relevant characteristics are 
learned for data that fulfill the assumption of trans- 
lational invariance. Mathematically, a convolutional 
layer can be written as shown in equation (1). 


fl f2 C 


E = ¿ l-1 l 
Xij => £ £ > Wrncdi+ m, f+ nye È b (1) 


m n c 


in which 1 is the number of the layer. The input of 
each layer is x, of dimensions H x W x C. These var- 
iables are the height, width and quantity of features 
maps, respectively, with x° the training image that is 
input to the network. The weights of the network 
are w. The variable s are the feature maps after the 
activation function; m and n iterate over the width 


input Groond. Truth Annotatines 


N = © T m J h © g = m 


Figure 1. 
truth. 


Left, Image of data training. Right, ground 
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and height of the filter, f1 and f2. The variable c iter- 
ates over the number of channels C, i and j iterate 
over the width and height of the feature map, H y 
W. In addition, the convolution in equation (1) is 
carried out a number of times equals to the number 
of feature maps desired for the layer 1. 


2.4 Atrous convolutions 


In the context of semantic segmentation, it is rel- 
evant to establish the information of the context in 
which a pixel is found. To do this, previous algo- 
rithms based on deep networks (Krizhevsky et al, 
2012) used a series of convolutions followed by 
pooling layers and with a subsequent layer known as 
deconvolution. This increases the size of the smaller 
feature map to establish a cost to what the pixel to 
pixel prediction should be. This type of model has a 
disadvantage: the resolution is lost with increasingly 
deeper network architectures, so it is not possible to 
use this type of transformation when the number 
of layers is large, as is the case of residual networks 
used in the proposed model. Atrous convolutions 
(Holschneider et al, 1989) try to solve this problem 
by creating filters that have zeros in between. For a 
traditional filter, this means increasing its size, plac- 
ing zeros in between and making a standard convo- 
lution. Mathematically this shown in equation (2) 
for the case of one dimension (1D). If the rate r is 
equal to 1, we have the usual convolution; x is the 
input and K is the filter length w[k]. Figure 2 shows 
the variation of the filters used in the atrous convo- 
lution for different rates, with the green color repre- 
senting 0 values in the filter. As these are parameters 
are not learned, performing atrous convolution 
does not increase the number of parameters. The 
blue color represents values of the filters that are 
learned by backpropagation. 


K 


yi] = È x[i+r*k]w[k] (2) 


i 


2.5 Atrous pyramid pooling 


In the proposed architecture discussed in Sec- 
tion 2.7, the concept of pyramid pooling is used in 


rate | y aia 
—y 
fate2 
rure 3 


Figure 2. Atrous convolutions for different rates. 


which atrous convolution is performed in parallel 
using different resolutions and, therefore, captur- 
ing contextual information at different levels. 


2.6 Residual neural network 


In general, deep neural networks before 2015 did 
not exceed a number of layers greater than 20 as 
trying to increase this number resulted in perform- 
ance deterioration. This is known as vanishing 
gradient, which consists in that the earliest layers 
of the neural network receive updates that are too 
small and, therefore, a null learning. This is due to 
the rule of the chain and how backpropagation is 
made. 

Recently this difficulty was surpassed by a 
type of neural networks called residual networks 
(He et al, 2016) These networks allow the use of 
architectures with an even larger size reaching 
block numbers of up to 151, in which each block 
has several layers, thus overcoming the usual con- 
volutional networks. For example, in ILSVRC 
& COCO 2015 Competition (ImageNet, 2015), 
residual networks obtained the first place in all cat- 
egories (He et al, 2016). Some known residual net- 
works are ResNet-50, ResNet-101 and ResNet-151 
(He et al, 2016), where the name number represents 
the number of residual blocks it has. 


Figure 3. Residual block. 
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Figure 4. Proposed architecture. 


The main idea underlying the residual net- 
works is that if a network is working properly, 
it should not worsen its performance by increas- 
ing the number of layers. In this way, the residual 
networks are constructed in blocks as shown in 
Figure 3, where the information flows through the 
identity mapping if the network does not need to 
change the information that is reaching a certain 
layer. On the other hand, if it is required to make 
changes in the feature maps, mapping with differ- 
ent non-linear functions are applied. 

In Figure 3, the classical components that have 
a residual block are shown, where the non-linear 
transformations correspond to the transformations 
that a traditional convolutional network could have. 
The “Merge” is the function designated to how the 
input of the block should be united with the output 
of the non-linear transformations. Following the 
“Merge”, a non-linear activation is placed. 


2.7 Proposed architecture 


The proposed residual neural network architec- 
ture is comprised of five residual blocks followed 
by a pyramid pooling layer, as shown in Figure 4. 
Each residual block has the following structure (see 
Figure 5): (i) one convolution with stride of 1 x 1 
and batch normalization; (ii) atrous convolution 
with filter of 3 x 3 and rate of 2 followed by batch 
normalization; (iii) another convolution with stride 
1 x 1 and batch normalization; (iv) addition opera- 
tion over all the resulting features maps and passed 
to a ReLU activation function. The resulting fea- 
ture map is the input to the next residual block. 
Note that the number of feature maps (FM) var- 
ies from block to block. For the first two blocks, the 
feature maps FM,, FM, and FM, are equal to 256, 
256 and 1024, respectively. For the remaining blocks, 
the number of feature maps FM,, FM, and FM, are 
512, 512 and 2056, respectively. After all the residual 
blocks, pyramid pooling is applied with 4 different 
rates, i.e., atrous convolutions with rates of 6, 12, 18 


Residual block used. 


Figure 5. 


and 24. Moreover, the model uses cross entropy loss 
function and the parameters are updated via back- 
propagation with ADAM optimizer. 


3 EXAMPLE OF APPLICATION 


3.1 Data 


The dataset used in this example was obtained via 
finite element modeling of debonding damage in 
honeycomb structures and consists of 6000 noisy 
images of 98 x 69 pixels, of which 2000 images 
with circular damage, 2000 with rectangular dam- 
age and the remaining are undamaged. Note that 
the damage quantification is performed pixel by 
pixel, thus the classification task is separated into 
two: quantification of rectangular damage and 
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Figure 6. Examples of damage types and correspond- 
ing ground truth used for training: the two top images are 
for circular damage while the two bottom images are for 
rectangular damage. 


quantification of circular damage. In both cases, 
the ground truth is the original image without 
noise. Figure 6 shows examples of rectangular and 
circular damage images. 

The results presented in the next section were 
obtained by running the experiments in a compu- 
ter with intel core 17 and 4.2 GHz processor capac- 
ity, 32 Gb RAM, Nvidia Titan X Pascal GPU with 
TensorFlow 1.3, Cuda 9 and Cudann 5.1 for com- 
piling and neural networks optimizers. 


3.2 Metrics 


To analyze the damage quantification results, the 
following semantic segmentation metrics are con- 
sidered: pixel accuracy (PA), mean accuracy (MA), 
mean intersection over union (MIoU) and fre- 
quency weighted intersection over union (FWIoU). 
Pixel accuracy in equation (3) corresponds to the 
overall accuracy. Mean accuracy, given in equa- 
tion (4), is the average accuracy over all classes 
involved in the segmentation task. Mean intersec- 
tion over union is shown in equation (5) and is the 
average of the correct pixel classification divided 
by the total number of pixels of that class, thus 


is an important performance metric for semantic 
segmentation as it encompasses true positives and 
false positives for the pixel by pixel classification. 
Frequency weighted intersection over union, as 
shown in equation (6), is similar to MIoU, but with 
the difference that FWIoU takes into account the 
number of data points in each class. 

Therefore, MIoU and FMIoU are stricter 
metrics, while MIoU and MA are not sensible to 
unbalanced datasets, but FMIoU and PA could be 
inflated if that is the case. 


PA= = (3) 


1 n, 
MA =— ð ~ 4 
na py t; ( 
1 n, 
MIoU = = 5 
na 2 t+ £ fi Thi ( ) 


-1 
tn. 
E EEA 6 
FWIoU (£) re Dd nmi À 


i 


n,: Number of classes included in the ground 
truth segmentation. 

n; Number of pixels of class i predicted to 
belong to class j. 

t: Total number of pixels of class ¿in the ground 
truth segmentation. 

t; Total number of pixels in the ground truth 
segmentation. 


3.3 Results and discussion 


In this section, we present the results of the damage 
quantification obtained from the proposed model 
as well as comparison to two other approaches: 
K-means for segmentation with feature space in 
a similar fashion as presented in (Dhanachandra 
et al, 2015) and a deep fully convolution neural net- 
work (FCN) with the same architecture shown in 
(Long et al, 2015). 

Indeed, Table 1 shows the results for damage 
quantification for the test dataset containing 200 
images. For circular damage, both the proposed 
model and the FCN deliver quite similar perform- 
ance metric results and significantly superior to the 
unsupervised K-means. This is mainly because the 
deep learning based models are supervised thus 
taking into consideration the source of the damage 
images. In terms of the MIoU metric, the proposed 
model performs better than the FCN because 
atrous convolutions do not lose detail information 
as occurs with the standard convolutions used in 
the FCN architecture. 
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Table 1. Segmentation results for exposed metrics. R: 
Rectangular damage, C: Circular damage. 


K-means 
clustering % 


Proposed 


model % FCN% 


R C R C R C 


PA 97.8 98.5 975 98.6 96.5 97.2 
MA 89.3 90.0 83.9 85.5 864 85.7 
MIoU 81.2 82.6 76.7 816 76.5 76.9 
FWIoU 96.1 97.2 95.5 974 94.2 95.3 


The performance metrics for rectangular dam- 
age quantification for all models are worse than in 
the previous case as segmentation for this type of 
damage is a more complex task than the circular 
one because it requires obtaining greater contex- 
tual information and, at the same time, determin- 
ing greater details (e.g., corners). Moreover, the 
proposed model outperforms the two other algo- 
rithms. In particular, the performance deteriora- 
tion of the FCN model might be attributed to the 
information loss due to the use of pooling layers, 
whereas the proposed model uses atrous convolu- 
tion that permits to capture contextual informa- 
tion without losing information. 

Figure 7 shows examples of predicted damage 
quantifications delivered by the proposed and the 
FCN models for two test images: the top images 
are the noisy test data, the middle ones are the 
ground truths and the bottom images correspond 
to the predicted damages. It can be observed that 
the rectangular damage prediction from the pro- 
posed model is able to better capture the shape of 
the damage, in particular the corners, whereas the 
FCN model struggles in that task and thus leading 
to deterioration in its performance when quantify- 
ing this type of damage. 

Table 2 and Table 3 show the no normalized 
and normalized versions of the confusion matrix 
for the proposed model for every pixel in test data 
and for both rectangular and circular damages. 
Note that the true negatives are 99.1% and 98.9% 
for circular and rectangular damages, respectively. 
These results can be explained by the unbalanced 
pixel count in the ground truth between no damage 
and damage. It is important to note also that the 
proposed model has higher scores for true positives 
with 85.9% and 79.7% for circular and rectangular 
damages, respectively, arguing in favor of the pro- 
posed model’s robustness for the unbalanced data 
in this example of application. 

The promising results obtained by the proposed 
model come at a computational cost that might be 
perceived as a disadvantage. In fact, the training 
time for the proposed model takes forty minutes 
for twenty epochs. However, it is significantly faster 


Figure 7. Top: test images; second row: ground truth; 
third row: damage prediction from proposed model; bot- 
tom: damage prediction from FCN. 


Table 2. Confusion matrix of the proposed model. ND: 
No Damage, D: Damage. 


Circular Rectangular 
damage damage 
ND D ND D 
No Damage 1277298 11532 1262236 14340 
Damage 8962 54608 15394 60403 
Table 3. Normalized confusion matrix of the proposed 
model. ND: No Damage, D: Damage. 
Circular Rectangular 
damage (%) damage (%) 
ND D ND D 
No Damage 99.1 0.9 98.9 1.1 
Damage 14.1 85.9 20.3 79.7 


than the K-means in predicting the damage size of 
an unseen image: 0.07 second against 0.6 second 
for the latter. These results indicate that the pro- 
posed model might be a candidate for developing 
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online monitoring systems where fast responses 
are required. 


4 CONCLUDING REMARKS 


This paper presented a novel residual atrous convo- 
lution neural network model for quantification of 
damage based on image processing. The proposed 
model was applied for damage quantification and 
segmentation of debonding damage in honeycomb 
structures. 

The results show that the proposed model is 
a promising tool for damage quantification and 
segmentation with superior performance in nois- 
ier and more complex damage types and shapes 
than both deep fully convolutional networks and 
K-means algorithm. 
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ABSTRACT: Return On Investment was calculated regarding a PHM system for a fleet of rolling stock. 
We show that the financial feasibility depends strongly on expected asset life as well as on the asset reli- 
ability model. The tool that was used for the analysis can be used for choosing the optimal PHM system 


for various assets. 


1 INTRODUCTION 


The goals of PHM systems are to provide advanced 
warning of system failures, enable Condition 
Based Maintenance (CBM), increase system avail- 
ability, reduce Life-Cycle Cost (LCC), and reduce 
No Failure Found events (Pecht 2008). 

While these goals are appealing, PHM integra- 
tion projects include expensive installation of sen- 
sors, communication networks, big data storage 
and computing hardware or services (Feldman, 
Sandborn, and Jazouli 2008). 

Therefore, it is not surprising that many asset 
owners are reluctant to venture into such projects. 
In order to assess the pros and cons of a suggested 
PHM project, Return On Investment (ROI) Analy- 
sis is required. The PHM system ROI is defined as: 


C+C -C 
GODET (1) 
3 


where C, represents the financial cost reduction 
due to decreased downtime, C, is the financial sav- 
ings due to reduced maintenance costs, and C, is 
the cost of PHM system procurement, installation 
and services. 

While the cost of investment is relatively easy 
to quantify, the expected gain is hard to asses. A 
good PHM system is expected to reduce the occur- 
rence of some failure modes. Therefore, Failure 
Mode, Effects, and Criticality Analysis (FMECA) 
is a good basis for PHM design (Meng and Zhang 
2013; Banks, Reichard, Crow and Nickell 2009) 
and ROI analysis. 

However, additional considerations are 
required. Maintenance policies such as inspections 
and scheduled maintenance affect the number of 
expected corrective and preventive maintenance 
events during the lifetime. Logistics policies affect 


the transportation and spare waiting times. These 
parameters heavily influence the asset/fleet main- 
tenance cost as well as downtime penalties. There- 
fore, expected financial gain depends also on the 
facility/fleet operation profile, maintenance and 
logistics policies, and related LCC (Kacprzynski, 
M. J. Roemer and A.J. Hess 2002). 

The general tradeoff of initial investment and 
maintenance cost for LCC optimization was dis- 
cussed by W. Taylor (Taylor 1981). The question of 
PHM system ROI is a classic case for such tradeoff 
analysis. 

Several PHM ROI evaluations were reported 
in the literature (see references in Feldman, Sand- 
born, and Jazouli 2008), mostly related to the 
defense industry. In recent years PHM and the 
Industrial Internet of Things (IoT) penetrated 
other industries with focus on critical expensive 
equipment such as turbines in the Oil & Gas, wind 
farms, and aviation industries. These cases relate to 
new equipment with built in sensors. 

In this paper we present an example regarding 
the ROI on a PHM system for an existing rolling 
stock asset. The LCC of several scenarios is con- 
sidered and it is shown under which conditions 
PHM installation is financially advantageous. 


2 CASE DESCRIPTION 


A fleet of 17 trains located in two sites (10 trains 
in site A and 7 trains in site B) is considered. The 
trains operate 18 hours each day, and have 6 hours 
for scheduled maintenance and inspections during 
the night. A central stock services both sites. 

The suggested PHM system will monitor the 
state of the rolling stock motors and pantographs. 
Adding the PHM system is expected to increase 
visibility and control of the rolling stock fleet, and 
to reduce failure events and downtime. 
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Following is a financial analysis of the ROI on 
the suggested rolling stock PHM system. 


2.1 Investment 


Adding the PHM system is expected to cost 
$20,000 per train for motor and pantograph sen- 
sors, and $160,000 for the central servers and 
operation center dashboard displays. An addi- 
tional cost of $4,000 per month is paid for support 
and maintenance of the PHM system by the PHM 
provider. 

This amounts to an investment of $1.22M fora 
15 years period, or $1.94M for a 30 years period. 


2.2 Calculating expected gain 


In order to calculate the mean expected gain, the 
fleet behavior over the life period has to be predicted. 

Most commercial system simulation software 
are based on the Monte-Carlo method. Indeed, 
Monte-Carlo simulations are flexible and easy to 
create. However, in order to achieve highly accu- 
rate results, many simulations have to be carried 
out. This is especially important in systems where 
rare events can have significant effect on the LCC. 

We used the apmOptimizer software that 
includes a combination of analytic methods 
(Birolini 1999) for calculating the fleet Life-Cycle 
Cost (LCC), and identifying cost and failure driv- 
ers. The advantages of analytic calculations are 
speed and accuracy. 

Calculations were carried out for various sce- 
narios with and without the PHM system. 

A detailed model of the existing fleet was con- 
structed, accounting for: 


Reliability Data 


Component failure distribution 
Component failure modes 
Rolling Stock redundancies 
Operation profile 


Maintenance Data 


Component repair / discard policy 
Repair time 

Corrective maintenance 
Preventive maintenance 
Inspections 


Logistic Data 


e Spare parts 
e Transportation times 
e Procurement time 


Financial Data 


e Cost of spare parts 
e Penalties due to service agreement 
e Corrective maintenance 


e Preventive maintenance 
e Inspections 


Using the input data, the apmOptimizer calcu- 
lates the expected rolling stock availability, failures, 
maintenance, inspections, and LCC. 

Figure | presents the breakdown tree of rolling 
stock in site A. The “Reliability Model” column in 
Fig. 1 describes the relevant model for each sub- 
system. The “Distribution Type” column in Fig. 1 
presents the failure distribution type for each 
component. Electronic components were assigned 
an Exponential failure distribution whereas the 
mechanical components were given a Normal dis- 
tribution that describes their ageing behavior. 

The initial model described the fleet behav- 
ior without a PHM system. In order to calculate 
the behavior of the fleet with PHM, the original 
model was copied, and the following changes were 
implemented: 


e Motor and several pantograph components have 
sensors; therefore, there is no need for scheduled 
maintenance, only Condition Based Mainte- 
nance (CBM). 

e When CBM is conducted instead of scheduled 
maintenance, only the problematic compo- 
nents are treated. Therefore, each CBM event is 
cheaper than the corresponding scheduled main- 
tenance event. 


Four scenarios were considered: 


No PHM system, life-cycle of 15 years 
No PHM system, life-cycle of 30 years 
Added PHM system, life-cycle of 15 years 
Added PHM system, life-cycle of 30 years 


Project 1 (Sentai EE 
=D Roling Socks! M Parallel l- 
EP FolingStock 1 10 ‘Serial i- 
d Sensors control console 1 Lea |Exponential 
@ Boges j4 Serial - 
P Sensors 1 Leat Exponential 
= @ Frort/Back 2 Sensi - 
= Ø Let/Rigr 2 Señal - 
Ø Bearing! 2 led Normal 
P Case pat 2 Lea [Noma 
A Seal i les [Noma 
Shaft 1 (Leaf [Noma 
E Whee 3 le [Normal 
@ Brakes 1 KowtofN 3 | 
@ Brake unts 16 (Serial - 
Pneumatic unit n Leat ‘Normal 
Disc 1 Lea _Nomal 
| Ø Moors 1 Sena >: 
@ Geer [i _ Leaf [Noma 
BD Rotor 1 Les Normal 
Stator 1 Leaf (Noma 
Windings 1 Leet |Normal 
= g Partograph 1 Seral $- 
@ Frame and ins i Leaf [Normal 
@ Vave piste 1 Leaf |Normat 
@ Bsvation sys 1 Leat [Normat 
Am |i Leat |Normal 
Head ]t__ {leat Nomai 
Figure 1. Breakdown tree of rolling stock in site A. 
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For each scenario the optimal maintenance and 
logistics policies were found using apmOptimizer’s 
enhanced dynamic programming algorithms. The 
optimization process was very fast due to the use 
of analytic calculations (as opposed to Monte 
Carlo). The optimizations goal is to minimize the 
LCC. LCC includes downtime penalty, therefore 
the optimization also ensures low downtime (high 
fleet availability). Furthermore, LCC includes the 
cost of spare parts, corrective and preventive main- 
tenance and inspections. 

The net gain of a PHM is equal to the calculated 
LCC of the fleet without PHM minus the LCC of 
the fleet with PHM, minus the PHM investment 
(this is C,+ C,- C, from Eq. 1). 


3 RESULTS 


LCC was calculated for each optimized scenario. 
Table | presents a summary of the calculated results. 
From Table 1 it is clear that a PHM system is 
not expected to yield financial savings when a life- 
cycle of 15 years is considered (ROI < 0). On the 
other hand, some savings are expected when a life- 
cycle of 30 years is considered (ROI = 0.278). 

The analytic tool that was used produced addi- 
tional useful information: it was found that the 
main contributor of rolling stock downtime is 
bearing failure. The suggested PHM program did 
not include bearing monitoring. 


Table 1. Expected LCC for various scenarios. 
PHM 
Model investment LCC Total 
Lifecycle 15 years 
No PHM $0 $43.52M $43.52M 
With PHM $1.22M $42.35M $43.57M 
Lifecycle 30 years 
No PHM $0 $100.9M $100.9M 
With PHM $1.94M $98.42M $100.4M 
Table 2. Expected LCC for various scenarios with 
enhanced PHM. 
PHM 
Model investment LCC Total 
Lifecycle 15 years 
No PHM $0 $43.52M $43.52M 
With PHM $2.44M $14.3M $16.74M 
Lifecycle 30 years 
No PHM $0 $100.9M $100.9M 
With PHM $3.88M $29.38M $33.26M 


Next we consider the case where bearing sen- 
sors were also added. The bearing sensors, data 
analysis, storage and maintenance are expected to 
double the PHM cost. Table 2 presents a summary 
of expected LCC for the enhanced PHM system. 

Data in Table 2 clearly demonstrates the finan- 
cial advantage of PHM systems which are applied 
to the asset/fleet critical failure driver. The reduced 
LCC results from reduced maintenance costs as 
well as greatly reduced service penalties. ROI for 
15 years is 10.97 while for 30 years the ROI is 
17.43. These values are very high. One possible 
reason for the high ROI is the assumption that 
bearing failures are 100% prevented by the PHM 
system. The apmOptimizer can also account for 
non-ideal PHM systems. 


4 CONCLUSIONS 


A general conclusion is that PHM systems are 
most suited for asset intensive systems with long 
expected life-cycles (aircrafts, rolling stock, utili- 
ties, mining and O&G). The reason is as follows: 
While PHM reduces downtime penalties and main- 
tenance cost, a high installation cost is incurred. 
For long life-cycles the accumulated savings over- 
come the initial PHM installation cost. 

Another conclusion is that the right PHM sys- 
tem has to be selected in order to address the asset/ 
fleet main drivers of failure and downtime. This 
requires preliminary field data collection and sen- 
sitivity analysis using modeling software such as 
the apmOptimizer. 
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ABSTRACT: The development process of complex technical products of the last years shows an increas- 
ing amount of sensors, electronic control units, data logging and monitoring systems within consumer 
goods (e.g. automobiles, washing machines) and industrial goods (e.g. machine tools, manufacturing sys- 
tems). In many cases, the main goal of data logging is monitoring, controlling and optimisation of the 
product functionalities within the usage phase. A further aim is the fulfilment of the process capability of 
manufacturing processes. Therefore, the layout of the concept of operating data logging and monitoring 
systems—especially operating data (mainly type, volume, and format) as well as hardware (like sensors and 
storage)- is designed by the development engineer within the product concept development phase. This 
paper discusses challenges, requirements and approaches for future conceptual design of operating data 
logging concepts of technical products related to reliability engineering. Base of operations is the state of 
art. Based on that, the concept for operating data logging within a monitoring system is shown. The con- 
cept draft is subdivided in three parts which are divided as follows: part one deals with data analytics, part 
two contains data requirements, and part three focuses on hardware requirements. The presented research 
study was worked out on the international research platform “Computational Reliability Engineering in 
Product Development and Manufacturing (CRE) — 2017” and contains contributions of universities, insti- 
tutes and original equipment manufacturers of industrial nations: Germany, United Kingdom, Japan, 
Turkey and France. 
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1 INTRODUCTION 


The development process of complex technical 
products of the last years shows an increasing 
amount of sensors, electronic control units, data 
logging and monitoring systems within consumer 
goods (e.g. automobiles, washing machines) and 
industrial goods (e.g. machine tools, manufactur- 
ing systems). In many cases, the main goal of data 
logging is monitoring, controlling and optimisa- 
tion of the product functionalities within the usage 
phase. A further aim is the fulfilment of the proc- 
ess capability of manufacturing processes. 
Therefore, the layout of the concept of oper- 
ating data logging and monitoring systems— 
especially operating data type, volume, format and 
storage—is up to this point of time in most cases 
designed by the development engineer within the 
product concept development phase. However, the 
design engineer is also responsible for the product 
functionality. Hence, the logged data is very often 
directly related to a technical discipline (e.g. automo- 
tive engineering: the rotation angle and cycle sen- 
sor is related to the antilock braking system which 
belongs to the division of chassis engineering). But 
this data can also serve as a foundation for the reli- 
ability analysis (e.g. automotive engineering: amount 
of steering turns, or frequency gathered from rota- 
tion angle sensor are live span variables, which can be 
used for statistical reliability models). Consequently, 
a comprehensive operating data logging (software) 
and monitoring system (hardware) for future tech- 
nical complex product generations is needed with 
complementary functionality: Controlling product 
functionality and ensure product reliability of the 
actual and subsequently following generation. 


2 GOAL OF RESEARCH ACTIVITIES 


This paper discusses challenges, requirements and 
approaches for future conceptual design of oper- 
ating data logging systems of technical products 
related to reliability engineering. In detail: (1) 
Requirements regarding operating data struc- 
ture: e.g. data type, data volume, data format; (2) 
Requirements regarding data recording structure 
and hardware aspects: e.g. frequency, sensors and 
storage; (3) Aspects of data analytics based on 
gained operating data in the usage phase. 


3 FUNDAMENTALS 


3.1 Design of the monitoring system within the 
product life cycle 


The product life cycle of technical products can 
be described in four main and eight subordinate 
phases, cf. (Bracke 2016): 


1. Concept phase 
la. Definition of the product characteristics 
1b. Development of the product concept 
2. Development phase 
2a. Construction stages (different prototype 
levels and finalising the design) 
2b. Preparation of manufacturing 
3. Production phase 
3a. Start of production (SOP) 
3b. Production 
4. Sale/Usage phase 
4a. Sale of products to the markets 
4b. Usage phase and product observation 


The concept of operating data logging and moni- 
toring systems—especially operating data type, vol- 
ume, format and storage — has to be designed by the 
reliability engineer of the Original Equipment Man- 
ufacturer (OEM) or Supplier within the product con- 
cept development phase (Phase 1b, cf. section 3.1). 


3.2 Design of operating data type 


In general, the operating data types can be subdi- 
vided in following different categories: 


a. Secretly compiled data (OEM / Supplier): 

— Definition logging logic OEM, 

— Data encryption through OEM, 

— Storage strategy: “fleeting”, “semi-perma- 

nent”, “permanent” 
b. Officially compiled data: 

Example automobile: eCall emergency system 

(since 31-03-2018), which gives an emergency 

call after an accident (“sleeping system”) and 

transfers basic automobile operating data. 
c. Voluntarily compiled data: 

Example: Vehicle insurance, Policy with scoring 

option, logging function is always on. 

The reliability engineer has to define the essen- 
tial life span variables and operating data types in 
the concept development phase (cf. section 3.1). To 
ensure a long-term availability of the operating data 
regarding reliability analysis within the entire prod- 
uct life cycle, the storage strategy “permanent” (cf. 
numeration (a) above) is to be pursued. The strat- 
egy “fleeting” is only interesting for direct operating 
decisions, the strategy “semi-permanent” does not 
allow the data analysis after the end of product life. 
Officially compiled data is also interesting, if the 
storage strategy is “permanent”. Voluntarily com- 
piled data is not in focus of this study. 


4 OPERATING DATA AND MONITORING 
SYSTEM: DATA ANALYTICS, DATA AND 
HARDWARE REQUIREMENTS 


Within this section, the data and hardware require- 
ments, based on data analysis strategies, for an 
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operating data and monitoring system are shown. 
The concept draft is subdivided in three following 
parts: Part one deals with data analytics e.g. uncer- 
tainty, second life and lessons learned aspects (cf. sec- 
tion 4.1). Based on the goal of data analysis, part two 
and three contains data requirements (cf. section 4.2; 
e.g. structure and format) and hardware requirements 
(cf. section 4.3, e.g. sensors and storage availability 
within monitoring systems in products and facilities). 


4.1 Data analytics 


4.1.1 Aspects of data uncertainty regarding a 
reliability model 
Operating data logging and monitoring systems 
are largely used to improve the knowledge of a 
specific system or component. However, data are 
always associated with some noise or measurement 
error, e.g. due to different environmental condi- 
tions. In turn, the model used to, e.g. predict the 
useful reaming life of a component, or schedule 
maintenance is also affected by uncertainty. If such 
uncertainties are neglected, some wrong and costly 
decision can be made (for instance, recall a prod- 
uct). Such uncertainty can also nullify the benefits 
of using machine learning frameworks for analys- 
ing the continuously increasing available data. 
One of the current challenges in the capability is 
to discriminate when such machines and tools are 
providing reliable estimate or their prediction has 
been fooled by noise. One possible solution is to 
use past experience and predictions to determine 
precise levels of confidence for the new predic- 
tions (Shafer and Voyk 2008). Another popular 
approach is based on the Bayesian paradigm for 
inference. In such framework, Bayes’ rule is used 
to update our believe on validity of the model 
prediction with information from empirical obser- 
vations (data) taking into account the associate 
uncertainty in such observations. The reader is 
referred to (Aki Vehtari and Janne Ojanen 2012) 
for detailed description of Bayesian methods. 


4.1.2 Aspects of the use of product operating data 
for a second life cycle 

In order to avoid environmental issues, it is nec- 
essary to minimize the material and energy con- 
sumption during the whole product lifecycle 
(Yamada 2012). One of the potentials for material 
circulation environmentally and economically is to 
reuse the End-of-Life (EOL) assembly products by 
remanufacturing in the second life cycle. Remanu- 
facturing is the process of bringing an assembly to 
like-new condition through replacing and rebuild- 
ing its components at least to current specification 
(Ilgin and Gupta, 2012). There are two essential 
processes in the remanufacturing: disassembly and 
re-assembly of the EOL products (Lambert and 
Gupta 2005). 


To conduct the data analytics for the disassem- 
bling process in an environmental friendly and eco- 
nomical way, a parts selection method (Igarashi, 
et al., 2016; Kinoshita et al., 2016) including reuse 
(Hasegawa et al., 2017a, 2017b) shows which parts 
should be reused, recycled and disposed in terms of 
environmental impacts and costs. The operating data 
of a part in the usage stage of the first life cycle helps 
on the parts selection that can be disassembled. Here, 
one of the challenges is that the data affects the deci- 
sion of the part selection itself by the recovery costs 
with different sales revenues for the parts. 


4.1.3 Transforming operating data in Lessons- 
Learned-Data-Structure for subsequently 
following product generations 

Application of data analytics to improve manufac- 
turing operations and transferring critical informa- 
tion to following product generations are an integral 
part of data-driven decision making. When processes 
are better defined and more standardized, lessons 
can be combined into standards and guidelines. It 
is rather hard to share lessons in areas of complex 
or context-specific need, for topics that are rapidly 
changing, and where new problems are frequently 
being identified. Therefore, lessons should be writ- 
ten down and stored in a database in such a way that 
other related people can find and access the required 
knowledge. It is important to sort the individual les- 
sons and store them under themes or topics in the 
lessons learned database. By this means, data can be 
filtered and previous actions of any problem can be 
considered during product’s engineering design proc- 
ess. Updating the database (i.e., guidance documents, 
best practices and standards for the process) is also a 
crucial issue to sustain the garbage in, garbage out 
philosophy. The quality of lessons knowledge may 
change from extremely useful to completely unhelp- 
ful. Therefore, the ease and accuracy of transforming 
data for product generations depend on how data is 
collected, stored, and updated. 

Commonly used data mining applications in 
manufacturing include failure evaluation, qual- 
ity control, safety analysis, and capacity plan- 
ning. Statistical Process Control (SPC) is one of 
the techniques suggested for real-time monitor- 
ing of operational performance of manufacturing 
systems. There is an ongoing research on better 
ways of collecting data and developing big data 
infrastructure technologies. Besides sensor tech- 
nologies, Enterprise Resource Planning (ERP) and 
Manufacturing Execution Systems (MES) are also 
used for data collection technology the develop- 
ment of cloud service platforms. 


4.1.4 Aspects of reliability analytics: 
Probabilistic uncertainty based on input 
data/operating data 

The quality of the data can have a significant influ- 

ence on the analysis results and their interpretation 
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which could cause a wrong conclusion of the prod- 
uct reliability. In fact, the amount, quality, format 
and processing of the data are only few of many 
other factors which have to be considered dur- 
ing the statistical analysis of operating data. It is 
still uncertain, whereupon and in which order it is 
necessary to pay attention by the examination of 
particular properties during the statistical analysis. 
A comprehensive list of factors which can influ- 
ence the data analysis, or much more the results, 
is shown in (Hinz 2015). The proposed factors are 
divided into four groups: 


e Data quality — demonstrates the requirements 
with respect to the compound of diverse inputs 
and properties of a data set (e.g. diverse load 
profiles or the unit on life span variables) 

e Empiricism — knowledge based on experience 
regarding the application of the product fleet as 
well as market specific boundary conditions (e.g. 
product derivatives or user profiles) 

e Aim of analysis — various purposes of the sta- 
tistical analysis will result in the application 
of different methods which may cause further 
uncertainties (e.g. the kind of a damage case) 

e Mathematical models — this group describes sta- 
tistical models, equations and algorithms which 
can be used with regard to the reliability analysis 
(e.g. various methods for the estimation of the 
distribution parameters). 


In plenty of cases, even the application of meth- 
ods that can be expected to provide always reliable 
results may lead to high uncertainties. For example, 
the estimation of the shape parameters of Weibull 
distribution based on different estimators (here: 
Maximum Likelihood, Gumbel, Least Squares, 
Method of Moments, Nelson, and DIN 55303) 
and various sample sizes (varying between 10 and 
1000) shall be considered. The results are shown in 


Figure 1. 
eters of Weibull distribution using various estimation 
models. 


Relative error of the estimated shape param- 


Figure 1. The mentioned methods are compared 
to the maximum likelihood (MLE) and the relative 
error is plotted as a function of the sample size. 

It can be easily observed, that especially for the 
small sample sizes, the application of different 
parameter estimators can cause big differences in 
the gathered results. For a sample consisting of 10 
entries, the difference between MLE and Nelson 
equals to 28%. 

In many cases combinations of multiple factors 
play a major role during the data analysis. This can 
cause a high uncertainty of the results which can 
differ exponentially from the reality. 


4.2 Data requirements: Operating data structure: 
e.g. data type, data volume, data format 


To some extent, the requirements for data are 
relatively flexible as long as that data supports 
the data-scheme for monitoring systems. The 
requirements therefore focus on data scheme 
and depend less on the actual data itself. The 
objective is to enable communication between 
the data-sender and the data-receiver. The data 
sender, say a logging system, has a data scheme 
to store relevant information in its own local 
database (which may be small if there are detec- 
tors only). The data may be system state mes- 
sages, error messages or alarms that may contain 
a timestamp, serial numbers, identification codes, 
numeric values and meta-data. The date receiver, 
say a Matlab application for reliability engineer- 
ing, has its own data scheme. Only part of the 
data from the data logger is useful for the Mat- 
lab application; some kind of data transforma- 
tion is required. Such transformations can be 
made in many different ways; the key, however, 
is that the meta-data about the data-scheme is 
correct, informative and up-to-date. In many 
industries data standards have been developed 
to harmonize efforts of different industry part- 
ners; this tends to be efficient for many indus- 
tries. For instance, the Oil and Gas industry uses 
ISO 15926 as an International Standard for the 
representation of process plant life-cycle infor- 
mation. This standard specifies a generic, con- 
ceptual data model that is suitable as the basis 
for implementation in a shared database or data 
warehouse. Many industries have developed simi- 
lar standards; aligning with them in own field is 
well-worth the effort; IT solutions that do not 
follow the standards may not be accepted in the 
industry. Summarizing, data requirements focus 
on correct data scheme descriptions. For engi- 
neers, such descriptions are captured in techni- 
cal reports. For the computer it is captured in the 
format of database. 
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4.3 Hardware requirements 


4.3.1 Aspects/requirements of hardware (sensor 
and storage technologies) within certain 
products 

The hardware components of data recording and 
monitoring systems typically represent a measure- 
ment chain that consists of four blocks for sensing, 
signal conditioning, signal processing, as well as data 
presentation and storage (Bentley 2005). Depending 
on the field of application, each hardware compo- 
nent has to fulfil certain requirements that are typi- 
cally provided by established standards. An example 
of a general high-level standard is the RTCA- 
DO-160 standard for the environmental testing of 
hardware in an aerospace context (RTCA-DO160). 
The related tests are, for example, of a mechanical, 
chemical, or electromagnetic nature and, as a whole, 
rather involved and time-consuming. Depending 
on the field of application of a data recording and 
monitoring system, it might be neither feasible nor 
economical to perform the whole variety of desirable 
tests for each component. This leads to the challenge 
of determining, on a component level, meaningful 
hardware requirements that avoid excessive testing 
while keeping necessary standards. There is no gen- 
eral solution to this problem, which highly depends 
on the actual case. Existing standards, however, can 
provide valuable guidelines. 

Moving from component to system level, the 
proper integration of a data recording and moni- 
toring system into a complex technical product is 
required. Due to the general increase of complex- 
ity and electromagnetic sensitivity within technical 
systems, this task becomes increasingly challeng- 
ing as well. The traditional approach for a proper 
integration involves the analytical and numerical 
modelling of possible unwanted electromagnetic 
couplings between different components (Tesche 
1997). This is caused by the signal propagation 
between sensor and data unit along the aforemen- 
tioned four blocks which is mainly of an electro- 
magnetic nature. However, it has recently become 
apparent that increasing complexity requires, 
besides deterministic methods, also statistical meth- 
ods that are adapted from reliability engineering to 
electrical engineering (Mao 2016). As a result and 
new development, both hardware requirements 
and aspects of uncertainty have to be considered 
as a whole during the system integration. 


4.3.2 Hardware requirements: Aspects of 
monitoring systems in complex 
technical facilities 
In an industrial environment a monitoring system 
will be developed, installed and operated only if a 
commercial benefit, either direct or on multi-level 
basis, can be expected. 


The basic kind of monitoring is meant to pre- 
vent from system damage under use, offering 
upfront indications (e.g. life span variables like 
temperature, vibrations, etc.) for imperative serv- 
ice, usually combined with routine maintenance. It 
is applied for cheap and simple mass products. Sec- 
ond order monitoring is used to acquire data about 
the product quality during production (etc. weight, 
shape, homogeneity); it is recommended for mass 
products of some value and complexity. Ideally, 
this information is used to control the production 
process inline. 

On the next level, sensor data combined with 
the operating parameter log can be used to con- 
tinuously diagnose the present system status and 
so the product quality. It is combined with pref- 
erably non-destructive sample inspection of the 
product to verify the process’ stability. Such kind 
of monitoring may also allow to adapt the mainte- 
nance frequency to the actual strain of the system. 
It is used for complex processes where reliability is 
the most important aspect (low-volume, costly or 
safety-related products). 

Finally, these vast amount of information and 
data has to be merged with a system behaviour 
model to approve, recommend or tune warily 
operation modes and to venture a prediction for 
remaining lifetime, while the maintenance sched- 
ule is aligned with production requirements. On 
this level, the product is not necessarily a touch- 
able item but could also mean energy (battery), 
information (data storage), or movement (aircraft 
engine). 

Some of the economic effects of such moni- 
toring efforts are obvious: increased lifetime, less 
downtime, higher throughput, less spares on stock 
and sufficient time to plan inevitable replacements. 
But there are also savings due to less failure in gen- 
eral, and less risk caused by unknown defective 
parts distributed into the market (product liability: 
documented monitoring is mandatory to defend 
claims). Wherever potential savings outbalance the 
costs an appropriate level of monitoring will be 
established. 


4.3.3. Conceptional aspects of standardisation of 
operating data recording within technical 
products 

The purposes of the operation data logging system 

regarding reliability engineering are detection of 

the cause of a failure or observation as well as pre- 
diction of failure and providence of maintenance 
action to user or producer. To formulate failure 
and its cause, monitoring system, and necessary 
and sufficient kind and number of sensors need 
to be allocated to product system. Hence, fail- 
ure mode needs to be deployed into basic events 
by using FTA: Fault Tree Analysis (Lee et al. 
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1985) where designer selects appropriate sensor 
by reference to these events. On the other hand, 
many existing products such as automobiles and 
machine tools are already equipped with a moni- 
toring system for attainment of its functionality. 
This monitoring system consists of sensors, ECU: 
Electronic Control Unit, and actuators and these 
components are modularized from the perspective 
of functionality usually by using DSM: Design 
Structure Matrix (Eppinger et al. 1994). Therefore, 
in the case of developing reliability monitoring 
system additionally, this system is desired not to 
change the structure of the existing functionality- 
it has monitoring structure which includes existing 
system modules. In addition, the occurrence of 
product failure depends on various elements such 
as the usage time, client usage, and other external 
factors. The sensors which are components of 
functionality-based system might not detect exter- 
nal factors such as temperature, humidity, and 
electromagnetic wave. Hence, for attainment of 
building failure-based monitoring system, designer 
needs to integrate undermentioned three steps: (1) 
deploying target failure mode to basic event by 
using FTA, (2) identifying necessary functionality- 
based monitoring systems and additional sensors 
for detecting external factors, and (3) modularizing 
these functionality-based sub-systems, additional 
sensors, and administration unit for transmitted 
signals. Figure 2 illustrates the concept structure 
of the failure-based monitoring system. 

This failure-based monitoring system does not 
affect the structure of the existing functionality- 
based monitoring system. Therefore, this system 
has possibility to be optionally added to operat- 
ing product system by upgrading without major 
design or structure changes. 


4.3.4 Hardware in use: Impacts on reliability of 
data recording safety within the product 
use phase 
The first impact of data recording systems within 
the product usage phase is related to scenarios in 
which the measurements are performed: 


1. Field test: the goal is to measure the transient 
and steady state inputs of a vehicle as it operates 
over the real environment, in order to anticipate 
market region of use 


p Sarsina sassu 


Figure2. Structure of failure-based monitoring system. 


2. Proving ground measurement: the goal is to 
replicate the most significant drive profiles from 
the field test, but in a more controlled environ- 
ment (e.g. test track or climatic wind tunnel). 


Field test are designed to capture all the envi- 
ronmental loadings that might affect the reliability 
of the vehicle or single components during its in- 
use phase. This type of measurements takes usually 
days or weeks. Data are recorded by the mean of 
a mobile data logger, which allows the simultane- 
ous recording of a wide variety of sensor measure- 
ments. This type of device can be small and with 
integrated sensors, making them ideal for final cus- 
tomers’ survey (Figure 3 left). 

Proving ground measurements are based on 
the results from field test and focused on precise 
driving events during the development phase of a 
new vehicle. They are performed by an acquisition 
system, which guarantees more refined measure- 
ments, but it is less robust towards environmental 
stresses, and need to be interfaced to a laptop for 
data saving and storage. 

Be either a road field test or a wind tunnel 
test, vehicle measurements are expensive. To per- 
form such tests, one must first built a prototype 
(for Original Equipment Manufacturer - OEM) 
respectively buy or rent the selected car (for com- 
ponent suppliers). The required sensors need to 
be mounted, connected, and cabled. There must 
be enough room for all sensors, the cables, and a 
comfortable environment for both the driver and 
the acquisition system. Additional care must be 
taken when measuring the response of a compo- 
nent, which needs to be equipped with sensors (e.g. 
strain gages and thermocouples) by a reliable sup- 
plier, and then assembled in the vehicle. 

Because of so many time and money consuming 
aspect, the key role of the test engineer is to make 
sure that measurement sessions are not jeopardize 
because of 1) improper sensor mounting 2) inad- 
equate data acquisition/storage. 

Once the suitable acquisition system has been 
considered for the measurements, the key aspect 
of a successful measurement lies on the type of 
sensors. 

Sensors need to be tailored to the physical value 
of interest. It is therefore fundamental a prior 


Figure 3. Small data logger with integrated sensors 
(left). PC interfaced Data acquisition system (right) (cf. 
(MSRDatenlogger)). 
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knowledge of the value range (maximum and mini- 
mum value, the frequency etc.). 

Sensors must also be accurate, enough sensitive 
to properly measured small variations, but robust 
toward the inevitable environmental stress result- 
ing from driving ground: shock, heat, humidity, 
and contamination associated to the potential 
adverse conditions of the road profile (dust, mud, 
water etc). In general, sensors must be operative 
under all ambient temperature conditions. 

Similarly, the acquisition system must be tai- 
lored to the measurement type. There must be a 
trade-off between system performance and robust- 
ness and in some case its dimensions. As a typi- 
cal example, integrated circuit piezoelectric (ICP) 
accelerometers are less intrusive (smaller and 
lighter) and more accurate than capacitive ones, 
but less resistant to high temperature and shock. 
ICP sensors would perfectly fit for vibration meas- 
urements on the chassis or cabin component, but 
be unreliable for engine vibrations, due to the high 
temperature reached during combustion. 

High care should be taken when mounting the 
sensors: external factors that might interfere with 
the measurements are electrical leakage and shorts. 
Moreover, the electromagnetic compatibility of the 
acquisition systems and logger should be verified to 
avoid the presence of electromagnetic parasitic noise. 

Some additional consideration and precau- 
tion should be used when planning long field 
measurements. 

The settings of the data logger (usually mounted 
inside the cabin) require a trade-off between fre- 
quency of acquisition and storage memory. 
Particular care must be taken for acceleration 
measurements, since their high frequency and 
long acquisition time might be computationally 
demanding during data post-processing. 

Solid State Drive (SSD) memory devices are pre- 
ferred to Hard Disk Drives (HDD) because they are 
less affected by dust contamination and vibration 
loading. Moreover, SDDs do not require rotating 
component such as the platter or the fan system. 

Planning of field measurements also need to 
consider the available memory and how it works, 
to avoid data loss. The most commonly used mem- 
ory storage methods are i) erasable data storage 
systems (once the memory is full, after a certain 
time the system is erased) and ii) circular or buffer 
memory (oldest data gets overwritten when the 
memory is full). 

During field or proving ground measurements, 
both data loggers and acquisition systems can be 
used to record data coming from the on Board Diag- 
nostics (ODB) or the control area network (CAN) 
bus. It is obvious that when recording both data 
from sensor and form the vehicle on-board compu- 
ter, the measurements needs to be synchronized. 


A check-list to avoid potential problems encoun- 
tered during field measurements could include the 
following topics: 


— Sensors: must be robust, accurate and properly 
calibrated. 

— Acquisition system: suitable to the type of 
measurements (portable PC interfaced vs. data 
logger). 

— Memory storage: chosen with respect to the 
amount of expected data, limitation of the 
memory capability and post-processing compu- 
tational effort. 

— Post processing: properly labelling of measure- 
ments channels; data saved in an exploitable 
format. 

— Privacy issue: field tests on final customers (e.g. 
users’ fleet) must be compliant to privacy policy, 
which varies from one country to another. 


5 SUMMARY 


The development process of complex techni- 
cal products of the last years shows an increasing 
amount of sensors, electronic control units, data 
logging and monitoring systems within consumer 
goods (e.g. automobiles, washing machines) and 
industrial goods (e.g. machine tools, manufacturing 
systems). This operating data can be a foundation 
for statistical analysis regarding the reliability of the 
product. The goals of data analytics (focus: data 
uncertainty, reliability analytics, second-life-cycle 
aspects, Lessons-Learned issues) are the base of 
operations for the data and hardware requirements. 

Main requirements regarding the monitored 
data are as follows: 


e Clear data Scheme regarding the local data stor- 
age system, 

e Data content (storage): Messages, error, alarms, 
timestamp, serial number, identification codes, 
numeric values, meta data, 

e Data receiver: Possibility of 
transformation 

e Consideration of industrial data standards, 
depending on product category, 

e Possibility of technical report, 


data 


Main requirements regarding the monitoring 
system hardware are as follows: 


e Considering standards for sensing, signal condi- 
tioning, signal processing, data presentation and 
storage, 

e Considering possible electromagnetic sensitiv- 
ity regarding signal propagation between sensor 
and data unit, 

e Modularisation of 
components, 


monitoring system 
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e Considering upgrade possibility during product 
life cycle, 

e Considering load profile regarding expected 
product life cycle within monitoring proto- 
type testing (field test versus proving ground 
measurement). 


The shown requirements can be used as a guide- 
line for the reliability engineer for the design of 
operating data logging and monitoring systems 
within the product concept development phase of 
a new product generation. 


REFERENCES 


Bentley, J.P.: “Principles of Measurement Systems”, 4th 
ed., (Pearson, Harlow, 2005). 

Bracke, S., Hinz, M., Inoue. M., Patelli, E., Kutz, S., 
Gottschalk, H., Ulutas, B., Hartl, C., Mors, P. and 
Bonnaud, P.: Reliability engineering in face of shorten 
product life cycles: Challenges, technique trends and 
method approaches to ensure product reliability. In: 
L. Walls, M. Revie, T. Bedford; Risk, Reliability and 
Safety: Innovating Theory and Practice; ESREL 2016, 
Glasgow, United Kingdom, 25th — 29th September 
2016; European Safety and Reliability Association, 
ESRA (2016). 

Eppinger, S.D., Whitney, D.E., Smith, R.P., & Gebala, 
D.A.: A Model-Based Method for Organizing Tasks 
in Product Development. Research in Engineering 
Design 6(1): 1-13. 1994. 

Hasegawa, S., Kinoshita, Y., Yamada, T., Inoue, M., 
Bracke, S.: Disassembly Parts Selection for Recovery 
Rate and Cost Considering Reuse, The 24th Interna- 
tional Conference on Production Research (ICPR- 
24), Poznan, Porland, July (2017a). 

Hasegawa, S., Kinoshita, Y., Yamada, T., Inoue, M., 
Bracke, S.: Disassembly Parts Selection for Material- 
based CO2 Saving Rate and Cost Considering Reuse, 
The 3rd International Conference on Remanufactur- 
ing, (ICoR2017), Linkoping, Sweden, pp.175—188, 
Oct (2017b). 

Hinz, M., Sochacki, S., Rosebrock, C. and Bracke, S.: 
Qualitative and quantitative analysis of uncertainties 
in the risk analysis of field data within the product 
usage phase. In: L. Podofillini, B. Sudret, B. Stojadi- 
novic, E. Zio, W. Kröger; Safety and Reliability of 
complex Engineered Systems; ESREL 2015. 


Igarashi, K., Yamada, T., Gupta, S.M., Inoue, M., 
Itsubo, N.: Disassembly System Modeling and Design 
with Parts Selection for Cost, Recycling, and CO2 
Saving Rates using Multi Criteria Optimization, Jour- 
nal of Manufacturing Systems, Vol.38, No.41, pp.151— 
164 (2016). 

Ilgin, M.A., Gupta, S.M.: Remanufacturing Modeling 
and Analysis, Boca Raton, FL, USA: CRC Press; 
2012. 

Kinoshita, Y., Yamada, T., Gupta, S.M., Ishigak, A., 
Inoue, M.: Disassembly Parts Selection and Analysis 
for Recycling Rate and Cost by Goal Programming, 
Journal of Advanced Mechanical Design, Systems, and 
Manufacturing, Vol.10, No.3, pp.1-15 (2016). 

Lambert, A.J.D., Gupta, S.M.: Disassembly modeling 
for assembly, maintenance, reuse and recycling, Boca 
Raton, FL, USA: CRC Press; 2005 

Lee, W.S., Grosh, D.L., Tilman, FA. & Lie, C.H.: Fault 
Tree Analysis, Methods, and Applications: A Review. 
IEEE Transactions on Reliability R-34(3): 194-203. 
1985. 

Mao, C. and Canavero, F.: “System-Level Vulner- 
ability Assessment for EME: From Fault Tree Anal- 
ysis to Bayesian Networks—Part I: Methodology 
Framework”, IEEE Transactions on Electromag- 
netic Compatibility, vol. 58, no. 1, (February 2016), 
pp. 180-187. 

MSRDatenlogger — Source: MSR Electronics, CC 
BY-SA 3.0, https://commons.wikimedia.org/w/index. 
php?curid=18557569 and Miiller-BBM GmbH, www. 
muellerbbm.com. 

RTCA-DO160: “Environmental Conditions and Test Pro- 
cedures for Airborne Equipment”, Version G, (Radio 
Technical Commission for Aeronautics, 2010). 

Shafer, G., Vovk, V.: A Tutorial on Conformal Predic- 
tion, Journal of Machine Learning Research 9 (2008) 
371-421. 

Tesche, F.M., Ianoz, M.V. and Karlsson, T.: “EMC Anal- 
ysis Methods and Computational Methods”, (John 
Wiley & Sons, New York, 1997). 

Vehtari, A. and Ojanen, J.: A survey of Bayesian pre- 
dictive methods for model assessment, selection and 
comparison Statistics Surveys, Vol. 6 (2012) 142-228. 

Yamada, T.: Part 6, 6.2, design of closed-loop and low- 
carbon supply chains for sustainability. In: Soemon 
Takakuwa, Nruyen Hong Son, Nguyen Dang Minh 
(Editors), Manufacturing and environmental manage- 
ment. Hanoi, Vietnam: National Political Publishing 
House; 2012. pp. 211-221. 


1076 


Safety and Reliability - Safe Societies in a Changing World - Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Enhanced hybrid prognostic approach applied to aircraft on-board 
electromechanical actuators affected by progressive faults 


PC. Berri, M.D.L. Dalla Vedova & P. Maggiore 
Department of Mechanical and Aerospace Engineering (DIMEAS), Politecnico di Torino, Torino, Italy 


ABSTRACT: In the last generation aircraft, the architecture of the powered flight control system 
adopted Electromechanical Actuators (EMAs). Being some on-board actuator safety critical, the practice 
of monitoring their behavior to determine their health condition is a task of growing importance. The 
choice of the best prognostic algorithm is driven primarily by their effectiveness in correctly identifying 
the health conditions of the system since each technique might be more or less useful in a given situa- 
tion. In this contest, the authors propose a new GA-based fault detection tool, relying on a model-based 
approach, comparing the system output to that of a Monitor Model (MM), which is able to reproduce 
accurately the dynamic response of the actual EMA in terms of position, speed and equivalent current, 
even under the effects of different failure modes while keeping a reasonably low computational cost; this 
Fault Detection and Identification (FDI) algorithm have been extended to seven progressive failures. A 
numerical simulation test environment has been developed to simulate progressive faults and to evaluate 
the accuracy of this prognostic method. Results showed an adequate robustness and a suitable ability 
to early identify malfunctions with low risk of false alarms or missed failures. Moreover, the effect of a 
failures different from those considered was studied, to avoid safety concerns related to the missed identi- 


fication of an incipient failure, hidden by another unknown failure mode. 


1 INTRODUCTON 


Electromechanical Actuators, or EMAs, are grad- 
ually replacing hydraulic systems in aeronautical 
applications, starting from the less safety critical 
uses (such as the actuation of trim tabs, cargo bay 
doors, or weapon and sensor systems of military 
aircraft) up to the most important ones, like pri- 
mary and secondary flight controls. EMAs are 
composed of an electric motor driving the user 
through a mechanical transmission; then, the main 
advantages over a traditional electrohydraulic 
system are the absence of a centralized hydraulic 
power generation and distribution system, lead- 
ing to an overall weight reduction, and the total 
absence of the hydraulic fluid itself, which is usu- 
ally pollutant or flammable. 

Given that the EMA technology is quite new 
and the reliability of these systems is not yet 
adequately known, risk reduction methods based 
on redundancy and scheduled maintenance and 
inspections shall be heavily employed, which some- 
how limits the diffusion of this new technology due 
the unavoidable cost increase. Moreover, those risk 
reduction methods can do nothing against failures 
caused by unexpected and extreme scenarios such 
as the exceeding of the flight envelope. 


A new approach to the risk reduction, called 
Prognostics and Health Management (PHM), 
relies on the monitoring of functional parameters 
of the system to detect and identify the precursors 
of failures at an early stage (Vachtsevanos et al. 
2006), in order to estimate the Remaining Useful 
Life (RUL) of the components. The monitored 
parameters shall be usually converted into electric 
signals, so a PHM approach is particularly conven- 
ient when applied to an electromechanical system, 
where most parameters are already in form of elec- 
tric signals without the need for dedicated sensors 
and transducers, which would increase the overall 
costs and worsen the system basic reliability. In lit- 
erature, many different Fault Detection and Iden- 
tification (FDI) methods have been investigated: 
model-based techniques based on the direct com- 
parison between the output of real and monitor- 
ing system (Raie & Rashtchi 2002, Byington el al. 
2004, Alamyal et al. 2013), on the spectral analysis 
of well-defined system behaviors performed by 
Fast Fourier Transform (Mamis et al. 2013, Dalla 
Vedova et al. 2014), on combinations of these 
methods (Borello et al. 2009a, Dalla Vedova et al. 
2015a,b) or on Artificial Neural Networks (Su & 
Chong 2007, Hamdani et al. 2011, Refaat et al. 
2013, Dalla Vedova et al. 2016a). 
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This paper proposes an FDI algorithm relying 
on a model based approach and, in particular, on 
parametric estimation, for the prognostic analysis 
of a typical EMA according to the More Elec- 
tric Aircraft and All Electric Aircraft paradigms 
(Quigley 1993, Howse 2003); the robustness of this 
algorithm is tested under different operating con- 
ditions and the effects of its use integrated with the 
traditional RAMS approach is investigated. 


2 REFERENCE AND MONITOR MODELS 


Two numerical models of the actuator were devel- 
oped for this study. A very detailed reference model 
is used as a virtual test rig for the FDI algorithm, 
simulating the behavior of the faulty physical sys- 
tem. The computing time required by this model, 
however, is not compatible with the use in the FDI 
algorithm itself, which involves an iterative evalu- 
ation of the fitness function in Genetic Algoritm 
(GA). For this reason, a simplified monitor model 
was built to achieve a light computing cost and, at 
the same time, a high accuracy in reproducing the 
early effects of different incipient fault modes. 

The Reference Model (RM), widely described 
by Dalla Vedova et al. (2016b), contains a detailed 
simulation of the physical phenomena acting in 
the EMA, in particular regarding the electromag- 
netic stator-rotor coupling (Haskew et al. 1999, 
Lee & Ehsani 2003, Halvaei et al. 2009, Cunkas 
& Aydogdu 2010), end-of-travels, compliance and 
backlashes acting on the mechanical transmis- 
sion (Borello et al. 2009b, Borello & Dalla Vedova 
2014), dry friction acting on bearings, gears, hinges 
and screw actuators (Borello & Dalla Vedova 2012) 
and a precise model of the behavior of the power 
electronics, including the solid-state inverter and 
the PWM control of the three electrical phases. 

The Monitor Model (MM) is a simplified rep- 
resentation of the system using, for example, an 
equivalent single-phase DC motor with a single 
feedback loop instead of the complex electro- 
magnetic model of the BLDC. This requires the 
introduction of a shape function based model for 
the simulation of the electrical fault, which is not 
strictly related to the physics of the system, but 
allows to reproduce the effects of faults with goof 
accuracy, as shown by Berri, Dalla Vedova & Mag- 
giore (2016). 


3 FAULT MODES 


Five different fault modes were considered for the 
study. Those were chosen among the most com- 
mon for EMAs, as highlighted by (Kenjo & Naga- 
mori 2003, Chesley 2011, Weiss, 2014); moreover, 


they are usually characterized by a progressive 
evolution, making possible an effective prognostic 
detection. 

The considered faults are briefly listed below: 


— Dry friction due to the wearing of mechanical 
components; 

— Backlash of the reducer gearbox and/or rotary- 
to-linear conversion device; 

— Partial short circuit of the BLDC stator coils; 

— Rotor eccentricity due to the degradation of its 
support bearings; 

— Control electronics fault resulting in the drift of 
the PID controller Proportional gain. 


The implementation of the first four faults 
in both the RM and MM is described by Berri, 
Dalla Vedova & Maggiore (2017); the last one is 
modelled by varying the Proportional gain param- 
eter in the Controller subsystem of both models. 
Despite the relatively straightforward implementa- 
tion of this fault, non negligible difficulties were 
found due to the position of the affected subsys- 
tem in the feedback loop, in particular consider- 
ing the interactions with other fault modes: in fact, 
its effects are hardly distinguishable from those of 
partial short circuit. 


4 FITNESS FUNCTIONS 


The objective function to be optimized by a GA 
is known as the fitness function. In the proposed 
FDI technique, it is the cumulative error in terms 
of equivalent single-phase current between the 
MM and RM. This results in a 8-variables func- 
tion to be optimized: in fact, two of the considered 
fault modes have multiple degrees of freedom. The 
rotor eccentricity is characterized by a its magni- 
tude C and its phase ọ (i.e. the angular position of 
the minimum air gap measured from the reference 
rotor angular position); similarly, the partial short 
circuit can affect each of the three stator phases, 
which have to be treated separately in order to iso- 
late this fault mode from the others. 

The fault parameters are normalized as follows: 


— k(1) is the normalized friction: k(1) = 0 means 
nominal conditions, while k(1) = 7 means 300% 
of nominal condition; 

— k(2) is the normalized backlash: k(2) =0 means 
nominal condition, (2) = 7 means 100 times 
the nominal condition; although at a first glance 
this range may seem exaggerated, 100 times the 
nominal condition means about half a radian of 
mechanical play on the fast shaft, reduced by the 
gear ratio to 5.7-10° degrees on the slow shaft 
or 20% of the already small chirp command 
amplitude. 
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— k(3), k(4) and k(5) are respectively the nor- 
malized short circuit of phases A, B and C; for 
example, k(3) =0 means a fully functional phase 
A, while k(3) = / means a complete short circuit 
for the same phase. 

— k(6) is the rotor static eccentricity amplitude: 
k(6) = 0 means no rotor eccentricity, while 
k(6) =1 means ¢= 1; in fact, k(6) is equal to the 
eccentricity parameter ¢ (Belmonte et al. 2015) 

— k(7) is the phase of rotor eccentricity 0, i.e. the 
direction corresponding to the minimum air 
gap; k(7) =0 means ¢ġ = —180°, k(7) = 1 means 
@= 180°. 

— k(8) is the normalized variation of the propor- 
tional gain: k(8) = 0 is a 50% reduction of the 
proportional gain, while k(8) = 7 means a 50% 
increase. 


The fitness function is then computed with a 
modified total least squares method, which is tol- 
erant to small phase lags cumulated between the 
two EMA models, even in presence of steep gradi- 
ents and abrupt changes in the equivalent current 
(Markovsky & Van Huffel 2007, and Berri, Dalla 
Vedova & Maggiore 2016). The resultant error is 
therefore: 


eg AE 
mra [= (22h) 
° (dI Idt) 
k 


dt (1) 
+1 


where I, and J,, are the reference and monitor 
single phase currents, ¢,,,, is the duration of the 
simulation and k is a constant used normalize the 
derivative of the reference current. Alternatively, 
computing the integral with a numerical scheme, 
considering the discrete nature of the simulations: 


(Lyf ai 
=d- 5 O O (2) 
err = dt py (ar lai} 


k 


+1 


It has to be noticed that in this formulation dt 
assumes the meaning of the finite time step of the 
numerical integration. 


5 GA SETTINGS 


The genetic algorithm used for this study is based 
on the ga function available in the MATLAB Opti- 
mization Toolbox. Employing an eight-variables 
fitness function requires a little calibration of the 
GA to achieve a good convergence. However, vari- 
ous parameters are left to their default value, so 


a further optimization of the algorithm settings 
is likely to allow a performance improvement, 
increasing the convergence speed. In particular, the 
Junction tolerance in the stopping criteria is tight- 
ened from the default 1-10 to a value of 1-107, 
to prevent the method to stop in a local minimum; 
moreover, the maximum iterations parameter is 
changed from 100 to 200, in order to avoid stop- 
ping the algorithm too early, before convergence is 
reached. Moreover, the hybrid function option is set 
to start a deterministic optimization with the fmin- 
con Interior Point solver at the end of the genetic 
algorithm to refine results. The gradient based 
algorithm starts from the final point of the GA, 
offering a faster way to converge on the optimum 
solution, while the genetic algorithm provides a 
suitable start point to prevent convergence on local 
minima, ensuring robustness of the method. 


6 RESULTS 


In order to measure the accuracy of the fault iden- 
tification in all the eight variables, a total error was 
defined as the quadratic mean of the errors on sin- 
gle variables: 


1 wi 
= 2 2 2 2 2 2 22 2 
eu = file +e +e +e? +e +e + ke +e) (3) 


The FDI algorithm was executed several times, 
with the fault parameters set to different values, in 
order to test the effectiveness and accuracy of the 
proposed method. For each combination of faults, 
the GA was executed ten times, because the method 
is inherently non-deterministic, involving the itera- 
tive random choice of points for evaluating the 
fitness function. This way it was possible to assess 
the repeatability of results, which were not strongly 
influenced by the aforementioned random choice. 

Then, to assess the robustness of the algorithm, 
the variance o° of results and total error is used: 


G =>" (ka), -K) (4) 


where N is the number of optimizations, k(j), is the 
value of the j-th fault parameter resulting from the 
i-th optimization and k( j) is the average value of 
the j-th parameter over N optimizations. 

The following table reports, as an example, the 
results obtained with three of the optimizations for 
the medium damage level combination of faults. 

Tables 2 to 5 summarize the results of 40 optimi- 
zations, performed with multiple faults and differ- 
ent damage levels. For each damage level, the GA 
was executed ten times, and the arithmetic mean 
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Table 1. Optimizations for medium multiple damage. 


Table 4. Results for a high damage combination. 


Optimizations Mean value 
Fault parameter (over 10 
Reference #1 #2 #3 reference value optimizations) Variance 
k(1) 0.4 0.4129 0.3985 0.4090 k(1) 1.0 0.9727 2.719E-04 
k(2) 0.4 0.4620 0.4254 0.4303 k(2) 1.0 0.9728 7.219E-05 
k(3) 0.2 0.1756 0.1796 0.1798 k(3) 0.5 0.5623 4.757E-04 
k(4) 0.0 0.0060 0.0109 0.0000 k(4) 0.0 0.0086 9.420E-05 
k(5) 0.0 0.0067 0.0175 0.0005 k(5) 0.0 0.0071 7.816E-05 
k(6) 0.2 0.1912 0.2085 0.1961 k(6) 0.5 0.6912 2. A91E-04 
k(7) 0.5 0.4986 0.5149 0.5037 k(7) 0.5 0.5019 1.554E-05 
k(8) 0.6 0.6131 0.6082 0.6024 k(8) 1.0 0.9948 4.585E-05 
Total Error e,,,[%] 2.48 1.43 1,33 Total Error e, 7.33% 4.190E-05 
Table 2. Results for a low damage combination. Table 5. Medium damage combination with noise. 
Mean value Mean value 
Fault parameter (over 10 ; ; Fault parameter (over 10 
reference value optimizations) Variance reference value optimizations) Variance 
k(1) 0.1 0.0953 6.019E-05 k(1) 0.4 0.3929 2.546E-04 
k(2) 0.1 0.1109 3.424E-04 Ef) 0.4 0.4072 1.131B-03 
k(3) 0.1 0.0876 5.908E-04 k(3) 02 0.1895 2.414E-04 
k(4) 0.0 0.0069 8.215E-05 k(4) 0.0 0.0028 9.730E-06 
k(5) 0.0 0.0021 1.040E-05 k(5) 0.0 0.0085 5.591E-05 
k(6) 0.1 0.0944 3.706E04 76) 0.2 0.2135 5.635E-04 
k(7) 0.5 0.5085 1.412E-04 k(7) 05 0.5031 7 505E-06 
k(8) 0.4 0.3940 7.716E-04 k(8) 0.6 0.5996 1.064E-03 
Total Error é,,, 1.51% 1.029E-04 Total Error e,, 1.97% 8.469E-05 
Table 3. Results for a medium damage combination. 
The high level damage, with its error rising up 
Mean value to over 7%, starts to show the divergence between 
Fault parameter (over 10 _ : MM and RM; however, this case is considered 
reference value optimizations) Variance only to assess the range of applicability of the 
k(1) 0.4 0.4017 1.774E-04 model, but is not of practical interest. Eventu- 
k(2) 0.4 0.4373 5. 558E-04 ally, the introduction of a white noise disturbance 
KB) 02 0.1809 1780-04 superimposed to the controller output signal in the 
k(4) 0.0 0.0074 1.048E-04 RM seems not to affect the method accuracy nor 
k(5) 00 0.0088 1.414E-04 ÏS robustness (as shown in Table 5). oe 
k(6) 02 0.1982 1.245E-04 Figure 1 shows the probability distribution of 
k(7) 05 0.5039 2.010F-05 the total error, with data gathered in all the per- 
k(8) 06 0.6050 7.051E-04 formed optimizations. It can be noticed that, for 
most optimizations, the total error has a very low 
Total Error e 2.00% 8.916E-05 


tot 


value of the Fitness Function input vector was 
computed, along with its variance. Moreover, aver- 
age value and variance of the total error are pro- 
vided. It can be noticed that, for damage levels low 
to medium (and therefore in the prognostic field 
of interest), both average total error and variance 
of the results are satisfactorily low, meaning that 
the proposed prognostic tool is both accurate and 
robust. 


value, in the order of 1%. The smaller peak of 
the probability distribution, settled around 7%, is 
indeed caused by the simulations executed with 
a high damage level, which causes the behavior 
of the two models to diverge slightly. However, 
this case is not of practical interest for prognostic 
applications (in fact, such a high damage results 
in a jammed actuator response and therefore lies 
in the field of diagnostics), but rather is consid- 
ered to assess the limits and the applicability of 
the MM damage implementation. 


1080 


Prabablity Girtrbetion of Total Emar 


Probabilty Drstnbubon 


Figure 1. 


Probability distribution of the total error e, 


tot" 


Table 6. Effect of unknown failure mode. 


Fitness function 


Initial conditions final value 


k, = 0.02 Nm/rad 0.0618 
k,= 0.05 Nm/rad 0.3609 
k,=0.1 Nm/rad 1.8743 
Baseline (low to medium 3.049E-04 


damage, known faults only) 


In addition, in order to partially rule out the pos- 
sibility of an unknown failure mode (i.e. one differ- 
ent from those considered in the models) not being 
detected by the GA, a fictitious failure consisting 
in the introduction of an elastic external force, 
with elastic constant k, varying between 0.02 and 
0.1 Nm/rad on the motor shaft, has been added 
to the RM only, to evaluate its effect on the FDI 
algorithm. The system health status in this case is 
partially misinterpreted, since the total error shows 
an increase in its value. However, the presence of an 
unknown failure is detected by the fitness function 
failing to converge to a near zero value. 

Table 6 shows the optimized value of the fitness 
function for different entities of the added ficti- 
tious failure, compared to the mean baseline value 
referred to low to medium damage status with only 
known fault modes. 

It can be seen that the introduction of an 
although small unknown failure results in an 
increase of the fitness function value after the 
optimization of 2 to 4 orders of magnitude. This 
increase can be easily related to the presence of 
a condition not identifiable by the algorithm, as 
an unknown failure mode or a too high damage 
level. 

Each failure mode affecting the waveform of 
the monitored variables in a recognizable pattern, 
it appears unlikely that two failures can cancel each 
other resulting in a missed identification; however, 
this aspect shall be further investigated in future 
works. 


7 RELATION BETWEEN RAMS AND PHM 
APPROACHES 


Although the RAMS and PHM disciplines are sel- 
dom considered together in practical applications, 
they are closely related and their integration in a 
unified approach to the system life cycle manage- 
ment can lead to significant benefits in terms of 
safety and cost effectiveness (Dersin et al. 2017). 


7.1 Effect of PHM approach on RAMS 


The use of a prognostic tool to detect the early 
effects of incipient faults can be an effective and 
powerful method to improve the characteristics of 
Reliability, Availability, Maintainability and Safety 
of asystem. A reliable PHM strategy can in fact ena- 
ble the cost effective implementation of Predictive 
Maintenance and Condition Based Maintenance 
practices: with this approach, maintainability is 
affected by the improvement of the self-diagnostic 
capability of the system, which reduces the time to 
detect the fault to be fixed. In fact, with a PHM 
approach for the management of the system life 
cycle, most maintenance interventions are actually 
preventive scheduled maintenance. The corrective 
maintenance is then reduced to the replacing of 
those components that underwent failure modes 
impossible to detect in advance. 

The overall Maintenance Man Hours/Flight 
Hour of a given aircraft system are then decreased, 
as well as the maintenance related operating costs. 

The above mentioned effect on maintenance 
greatly improves the availability of the system. 
Being the corrective maintenance reduced, most 
of the necessary interventions can be scheduled 
in advance to be performed in the already slated 
ground time. 

This allows a more cost effective management 
of the fleet, and in some cases also the reduction of 
the number of aircrafts necessary in the fleet for a 
given commercial airline or military service. 

On the other hand, a prognostic fault detection 
and useful life estimation would virtually eliminate 
the risk of having faulty components flying on an 
aircraft in service. Then, the effective failure rate 
of the considered components is reduced, since 
worn components are replaced before their damage 
level starts affecting the performance of the sys- 
tem. Therefore, both the reliability and safety are 
improved, since they are both related to the failure 
rates of the system (the difference between them is 
only the potential effect of the considered failures). 


7.2 Use of RAMS tools for PHM purposes 


Various analytic tools commonly used in the 
RAMS disciplines can be employed to perform an 
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effective PHM activity, leading to safety improve- 
ment and cost reduction. Through the Failure 
Modes Effect & Criticality Analysis (FMECA) it 
is possible to identify the most significant failure 
modes of the system in terms of their impact on 
service reliability and on safety, based both on their 
effects at subsystem and aircraft levels and on their 
failure rate, estimated through statistical analysis 
of field data and return on experience. Therefore, 
focusing the PHM activity on the prediction of the 
occurrence of those failure modes can lead to a sig- 
nificant benefit in terms of improvement of safety 
and service reliability. 

On the other hand, the Life-Cycle cost analysis 
is intended to identify the most cost affecting and 
availability affecting maintenance actions. Then, 
prognostic tools can be proficiently employed 
to reduce the number of required cost affecting 
maintenance actions and to plan the availability 
affecting ones into the already slated ground time 
windows, thus improving the availability of the air- 
craft and reducing its operating costs. 


8 CONCLUSIONS 


A satisfyingly effective FDI tool intended to 
be applied to an electromechanical actuator for 
flight control system has been implemented by 
the authors and tested in a simulated test bench in 
presence of different working conditions in terms 
of fault modes combinations of the monitored 
system. 

The algorithm has shown an adequate robust- 
ness also in presence of a noisy input signal or 
the effects of failure modes not considered in the 
monitor model. In particular, the presence of an 
unknown failure mode leads to a partial misin- 
terpretation of the system health status, but this 
condition is correctly recognized and reported 
coherently. On the other hand, the possibility of an 
unknown failure mode and a considered one can- 
celling each other effects, thus producing a missed 
fault identification, despite appearing as extremely 
unlikely, shall be further investigated. The future 
work on this FDI algorithm will include the exten- 
sion of the models, taking into account a greater 
number of fault modes, and their validation on 
a physical test bench, also to ensure the required 
accuracy and resolution of the measured signals 
is matched by the currently available sensors. The 
implementation on a flying aircraft of an FDI 
algorithm similar to the one proposed (adapted to 
match the parameters of the particular actuator 
installed on-board), coupled with a statistical RUL 
prediction, could then reduce the maintenance 
related operating costs, reduce the downtime and 
increase both safety and availability of the system, 


integrating the traditional RAMS disciplines with 
the emerging PHM approach. 


REFERENCES 


Alamyal, M., Gadoue, S.M. & Zahawi, B. 2013. Detec- 
tion of induction machine winding faults using 
genetic algorithm. Diagnostics for Electric Machines, 
Power Electronics and Drives 9th IEEE Int.Sympo- 
sium, Valencia, Spain: 157-161. 

Berri, P.C., Dalla Vedova, M.D.L. & Maggiore, P. 2016. 
A Smart Electromechanical Actuator Monitor for 
New Model-Based Prognostic Algorithms. Interna- 
tional Journal of Mechanics and Control (JoMaC) 
17(2): 59—66. 

Berri, P.C., Dalla Vedova, M.D.L., Maggiore P. 2017. 
On-board — electromechanical — servomechanisms 
affected by progressive faults: proposal of a smart GA 
model-based prognostic approach. Proc. of the 27th 
European Safety and Reliability Conference, Portoroz, 
Slovenia: 839-845. 

Belmonte, D., Dalla Vedova, M.D.L. & Maggiore, P. 
2015. Electromechanical servomechanisms affected 
by motor static eccentricity: Proposal of fault evalu- 
ation algorithm based on spectral analysis techniques. 
Safety and Reliability of Complex Engineered Systems 
— Proc. of the 25th European Safety and Reliability 
Conf. ESREL 2015: 2365-2372. 

Borello, L., Dalla Vedova, M.D.L., Jacazio, G. & Sorli, 
M. 2009a. A Prognostic Model for Electrohydraulic 
Servovalves. Annual Conference of the Prognostics and 
Health Management Society, San Diego, CA. 

Borello, L., Villero, G. & Dalla Vedova, M.D.L. 2009b. 
New asymmetry monitoring techniques: effects on 
attitude control. Aerospace Science and Technology 
13(8):475-487. 

Borello, L. & Dalla Vedova, M.D.L. 2012. A dry friction 
model and robust computational algorithm for revers- 
ible or irreversible motion transmission. International 
Journal of Mechanics and Control 13(2): 37-48. 

Borello, L., & Dalla Vedova, M.D.L. 2014. Flaps Failure 
and Aircraft Controllability: Developments in Asym- 
metry Monitoring Techniques. Journal of Mechanical 
Science and Technology (JMST) 28(11): 4593-4603. 

Byington, C.S., Watson, W., Edwards, D. & Stoelting, P. 
2004. A Model-Based Approach to Prognostics and 
Health Management for Flight Control Actuators. 
IEEE Aerospace Conference Proceedings, USA. 

Cunkas, M., & Aydogdu, O. 2010. Realization of Fuzzy 
Logic Controlled Brushless DC Motor Drives using 
Matlab /Simulink. Mathematical and Computational 
Applications 15(02): 218-229. 

Dalla Vedova, M.D.L., Maggiore, P., & Pace, L. 2014. 
Proposal of Prognostic Parametric Method Applied 
to an Electrohydraulic Servomechanism Affected by 
Multiple Failures. WSEAS Trans. on Environment and 
Development 10: 478-490. 

Dalla Vedova, M.D.L., Maggiore, P., & Pace, L. 2015a. A 
New Prognostic Method Based on Simulated Anneal- 
ing Algorithm to Deal with the Effects of Dry Fric- 
tion on Electromechanical Actuators. International 
Journal of Mechanics 9: 236-245. 


1082 


Dalla Vedova, M.D.L., Maggiore, P., Pace, L. & Desando, 
A. 2015b. Evaluation of the correlation coefficient as 
a prognostic indicator for electromechanical servo- 
mechanism failures. International Journal of Prognos- 
tics and Health Management 6(1). 

Dalla Vedova, M.D.L., De Fano, D. & Maggiore, P. 
2016a. Neural Network Design for Incipient Failure 
Detection on Aircraft EM Actuator. International 
Journal of Mechanics and Control (JoMaC) 17(1): 
77-83. 

Dalla Vedova, M.D.L, Germana, A. & Maggiore, P. 
2016b. Proposal of a new simulated annealing model- 
based fault identification technique applied to flight 
control EM actuators. Risk, Reliability and Safety: 
Innovating Theory and Practice: Proceedings of 
ESREL 2016: 313-321. 

Dersin, P., Alessi, A., Lamoureux, B., Brahimi, M. & 
Fink, O. 2017. Prognostics and health management in 
railways. Proc. of the 27th European Safety and Reli- 
ability Conference, Portoroz, Slovenia: 889-895. 

Halvaei Niasar, A., Moghbelli, H. & Vahedi, A. 2009. 
Modelling, Simulation and Implementation of Four- 
Switch Brushless DC Motor Drive Based On Switch- 
ing Functions. JEEE EUROCON 2009, 18-23 May, 
St.-Petersburg, Russia. 

Hamdani, S., Touhami, O., Ibtiouen, R. & Fadel, M. 
2011. Neural network technique for induction motor 
rotor faults classification-dynamic eccentricity and 
broken bar faults. IEEE International Symposium on 
Diagnostics for Electric Machines, Power Electronics & 
Drives: 626-631. 

Haskew, T.A., Schinstock, D.E. & Waldrep, E. 1999. 
Two-Phase On’ Drive Operation in a Permanent 


Magnet Synchronous Machine Electromechanical 
Actuator. IEEE Transactions on Energy Conversion 14. 

Howse, M. 2003. All-electric aircraft. Power Engineer, 
17(4). 

Lee, B.K. & Ehsani, M. 2003. Advanced Simulation 
Model for Brushless DC Motor Drives. Electric Power 
Components and Systems, 31(9): 841-868. 

Mamis, M.S., Arkan, M. & Keles, C. 2013. Transmission 
lines fault location using transient signal spectrum. 
Int. J. Electr. Power Energy Syst. 53: 714-718. 

Markovsky, I. & Van Huffel, S. 2007. Overview of total 
least-squares methods. Signal Processing 87 (10): 
2283-2302. 

Quigley, R.E.J. 1993. More electric aircraft. Proc. of 
Eighth Annual IEEE Applied Power Electronics Con- 
ference - APEC ‘93, San Diego, CA: 906-911. 

Raie, A., & Rashtchi, V. 2002. Using a genetic algorithm 
for detection and magnitude determination of turn 
faults in an induction motor. Electrical Engineering 
84(5): 275-279. 

Refaat, S.S., Abu-Rub, H., Saad, M.S., Aboul-Zahab, 
E.M. & Iqbal, A. 2013. ANN-based for detection, 
diagnosis the bearing fault for three phase induction 
motors using current signal. IEEE International Con- 
ference on Industrial Technology (ICIT ): 253-258. 

Su, H. & Chong, K.T. 2007. Induction machine condi- 
tion monitoring using neural network modelling. 
IEEE Transactions on Industrial Electronics, 54(1): 
241-249. 

Vachtsevanos, G., Lewis, F., Roemer, M., Hess, A. & Wu, 
B. 2006. Intelligent Fault Diagnosis and Prognosis for 
Engineering Systems. Wiley. 


1083 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


A method for wind speed generation 


J. Ma, M. Fouladirad & A. Grall 


Institut Charles Delaunay, LM2S, Université de Technologie de Troyes, CNRS, Troyes, France 


ABSTRACT: This paper proposes a flexible continuous wind speed model based on first order Markov 
chain and Stochastic Differential Equations (SDE). This model permits to generate wind speed sequence 
with high frequency with statistical properties similar to an observed wind speed in a given geographical 
site. This model can be merged with the deterioration health indicators of wind turbine to make prognosis 


and plan maintenance actions. 


1 INTRODUCTION 


Since the Paris Agreement has been signed among 
the 195 members of the United Nations Frame- 
work Convention on Climate Change (UNFCCC), 
more and more countries have announced their 
plans to reduce the emission of CO, and to 
develop alternative clean energies like wind energy, 
solar energy, tidal energy, energy of natural gas, 
etc. Consequently, in the past decade, the world 
widely exploitation and technical development of 
wind energy have increased. In Europe, the capac- 
ity installed in 2016 reached 13489.9 MW and the 
estimated electricity production from wind power 
in the European Union (EU) in 2016 is 302.7 TWh 
(windeurope 2017a). By 2030, wind energy could 
cover 29.6% of EU’s electricity demand (wind- 
europe 2017b). However, because of the inbred 
randomness of wind speed, the wind power indus- 
try faces enormous challenges in practice and 
the development of wind power industry is very 
laborious. 

To obtain a good prognostic of the wind tur- 
bine lifetime as well as to carry out efficient main- 
tenance policies, it is necessary to study the impact 
of the wind speed on the wind turbine deteriora- 
tion. It is important to rely on an appropriate wind 
speed model which permits to simulate the wind 
sequence related to a required geographical region 
with the same features as the real wind sequence 
and put in our disposable high frequency data of 
wind speed sequence. 

The literatures on wind speed can be classified 
into three groups. 


e Wind characteristics study on a specific site 
e Wind speed generation 
e Wind speed prognosis 


The wind characteristics study is more quan- 
titative and focuses on statistics properties of the 


wind speed, such as its average, maximum, mini- 
mum, standard deviation and its statistical distri- 
bution (usually consider the Weibull distribution 
function) ((Shu et al. 2016, Barthelmie et al. 2005, 
Celik 2004)). The daily patterns in wind energy 
production has been investigated (Scholz et al. 
(2014),Lennard (2014)), and studies the modelling 
for extreme events, refer to Baseer et al. (2017). 
Wind speed generation and wind speed progno- 
sis are two different subjects. However, they have 
similarities regarding the modelling aspect. A great 
effort is deployed for wind speed modelling. To 
achieve the accuracy, models using environmen- 
tal parameters such as temperature, pressure and 
humidity are proposed. Generally, these models 
require the computation of fluid dynamics in order 
to simulate the environmental conditions on differ- 
ent grids (Sezer-Uzol and Long 2006, Landberg 
et al. 2003). Since these latter are computationally 
burdensome they are not capable of generating a 
large data set and therefore not very suitable for 
applications requiring a good knowledge of the 
wind sequence in a short time. 

Using probability distributions such as Weibull 
distribution to generate wind speed data which per- 
mits to reproduce wind speed data with the same 
statistical features as the wind data in our disposal. 
Regrettably, wind speed samples are measured at 
hourly time scale and data generation based on this 
type of data can only reproduce the macroscopic 
features of the wind behaviour. 

Discrete models with memory applied in time 
series analysis such as Autoregressive Moving Aver- 
age (ARMA) models (Poggi et al. 2003), Autore- 
gressive Integrated Moving Average (ARIMA) 
models (Kavasseri and Seetharaman 2009) and 
Generalized autoregressive Conditional Heter- 
oskedasticity (GARCH) models (Liu et al. 2011) 
are commonly used to reproduce wind speed data 
for a particular location depending on available 
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historical measured data. Although these models 
have efficient computational capacity for generat- 
ing wind speed data, their linearity discard the pos- 
sibility of nonlinearity in time series (Wanga et al. 
2018). Since a simple model ignores the complex- 
ity of the wind behaviour and an elaborate model 
is not easy to deal with, some numerical methods 
based on computer capacity have been proposed to 
make predictions in wind behaviour only accord- 
ing to historical data. 

Recently, wind speed prognosis based on 
machine learning and artificial neural networks 
(ANNs) has become very popular, see (Bilgili et 
al. 2007, Ji et al. 2007). Generally, the method is 
to generate wind speed by training the historical 
information about wind speed data. Easily trapped 
in local optimisation is a key problem related to 
ANNs. Moreover, without a model it is difficult to 
make instantaneous predictions and to control the 
behaviour of the engineering systems impacted by 
wind speed evolution. 

Lately, in order to reduce the computational 
complexity and still propose a model for the wind 
evolution memoryless or short-memory stochastic 
models are getting more attention. In this frame- 
work, Markov processes, lévy processes or diffu- 
sion processes are proposed. For instance, based 
on Brownian motion and Langevin equations, the 
wind evolution is described by Wilson and Saw- 
ford (1996). More accessible, Markov chains have 
the capability to model wind speed data. Despite 
the simplicity of first-order Markov chain, it can 
generate wind speed time series with very similar 
statistical characters as the observed data (Nfaoui 
et al. (2004), Ettoumi et al. (2003), Yang et al. 
(2011)). However, Markov chains are not able to 
capture wind speed characteristics at high fre- 
quency. Research done by Brokish and Kirtley 
(2009) shows the limits of Markov chains for simu- 
lation data with time step smaller than fifteen min- 
utes. To generalise, second or third order Markov 
chains and semi-Markov chains are used for wind 
speed modelling in order to improve the accuracy 
(D’Amico et al. 2013). Diffusion processes or 
stochastic differential equations are very flexible 
with short memory and as mentioned in (Zarate- 
Minano et al. 2013), they seem to be suitable can- 
didate wind speed modelling. Since SDE models 
are continuous, technically, they can be used to 
simulate wind speed at any time scales. Consider- 
ing the diurnal and seasonal influences, the wind 
speed generation length of SDE models is limited 
to a few hours. SDE models are especially suitable 
for modeling wind speed turbulences. 

An epitome could be summarized from the litera- 
tures is that the existent wind speed models focus on 
either short term modeling or long term modeling 
with large time scale. Hence, if these models are 


required for the analysis of wind turbine perform- 
ances and operation related to wind speed their util- 
ity is limited. Short term models limit the analysis to a 
current short period; long term models can’t provide 
sufficient and qualitatively useful data for analysis. 

This problem is especially prominent in reliabil- 
ity analysis and failure prognosis and output power 
estimation of wind turbine. It has been highlighted 
that wind speed has a significant impact on degra- 
dation of wind turbine components. Particularly, 
the blade-pitch system has high failure rate among 
the components. Moreover, as it is indicated in 
Barthelmie et al. (2005), the wind turbulence inten- 
sity (variance divided by mean wind speed) has key 
influence on wind turbine lifetime. For instance, 
the fast variation of wind speed leads to frequent 
and sudden actions of the blade-pitch system. It 
has been demonstrated that turbulence intensity 
affects turbines power production ((Elliott and 
Cadogan 1990, Frandsen et al. 2000, Kaiser et al. 
2007, Gottschall et al. 2006, Gottschall and Peinke 
2007, Lubitz 2014)). Furthermore, from the point 
of view of prognosis and remaining useful life- 
time (RUL) estimation, hourly average wind speed 
data is not satisfiable because it reflects the trend 
of wind speed by losing details, namely, the tur- 
bulence of wind speed. In addition, degradation 
occurring to wind turbine components is a long 
term process, and the RUL of wind turbine com- 
ponents are dynamically changing according to the 
working conditions of wind turbine. 

Presently, wind speed at different time scales 
are applied to different problems, such as long 
term wind speed data are used to estimate wind 
energy for a future wind farm; diverse wind speed 
sequences are used to test a new control strategy; 
annual wind sequences with detailed information 
are indispensable to reliability research about the 
wind turbine, for example, the degradation of 
blade/yaw pitch system. 

A method taking into account a continuous 
wind speed generation during a long period can fill 
the existing gap in the literature. This fact encour- 
ages us to propose a wind speed model that satis- 
fies the following requirements. 


e It can reflect the trend of wind in a long term 
period. 

e It can contain turbulence information at small 
time scale. 

e It can be accord with real wind speed data’s 
probability distribution 

e It can be generated quickly 


In other words, this paper aims to propose a 
mathematical model which is able to generate sat- 
isfiable wind speed series for wind turbine com- 
ponents RUL study and accurate output power 
estimation. 
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The main contributions of this paper could be 
summarized as follows. 


e Proposition of a new wind speed model con- 
sidering Markov chains embedded with SED 
model; 

e Consideration of different SDE models to be 
able to generate different wind profiles. 


The remainder of this paper is organized as fol- 
lows. After the model description a brief introduc- 
tion on Markov chains and SDE models is given in 
section 2. In the same section, model parameters 
estimation methods are mentioned. An illustra- 
tive example is given in section 3. Conclusions in 
section 4. 


2 MODEL DESCRIPTION 


In wind industry, the Reynolds decomposition 
need to be considered for analyzing turbulence 
effects. Hence, the wind speed time series U(t) is 
decomposed into its mean value U(r) and the 
fluctuation u(?).. 


U(t)=U(t)+u(t) (1) 


This gives us an idea to model mean wind speed 
and wind speed fluctuation separately. Refer to 
Calif (2012), U (t) can be considered as a low-pass 
filter corresponding to the hourly, daily, monthly, 
seasonal or annual effects; turbulence u(t), which 
has a zero mean value and can be seen as a high- 
pass filter corresponding to turbulent effects. 

Depending on research emphasis, wind speed 
can be modeled in different ways at different time 
scales. In wind energy industry, choosing continu- 
ously wind speed is more appropriate when we 
study the operation of wind turbine, as studies 
show that wind turbulence intensity has enormous 
effect on energy production and detected failures 
about wind turbine; hourly average wind speed is 
a good choice when we want to estimate the wind 
resource of a site; 10 minutes’s average wind speed 
is noted down by SCADA system. 

In this paper, we propose a Markov chain model 
embedded with SDE. Due to its properties, this 
model is very flexible. Markov chains is used to 
model macroscopical wind speed trend. It repre- 
sents hourly average wind speed or different wind 
speed classes relating to wind turbulence intensity. 
The embedded SDEs are mainly used to model 
a continuous wind speed depending on the envi- 
ronmental states which are set by Markov chain. 
Different setting parameters and different applica- 
tions of SDE give different continuous wind speed 
models. Other SDE models also can be considered 
for simulating different conditions such as gust 


or extreme weather. Hence this model is easily 
adapted according to the location of interest. 

Here we would like to give two illustrative 
examples. 


e One day’s (24 hours) wind speed generation 
Each state in Markov chain represents hourly 
average wind speed and drives the choice of 
related SDE. In this case, 24 states are ran- 
domly generated to represent the average wind 
speed time series. In each state, continuous wind 
sequences are generated from specific SDE at 
small time scale like secondly data. 

e Different wind condition generation 
Refer to IEC standards Commission et al. (2005), 
wind conditions contains normal, extreme and 
gust which could be generated by Markov chain. 
Inside Markov chain, SDEs are used to simu- 
late wind speed associated with the certain wind 
condition. 


2.1 Markov chain 


A Markov chain process is uniquely defined by its 
state space, transition matrix and initial distribu- 
tion. Markov chains wind speed modelling consists 
of two main steps: state classification and transi- 
tion matrix estimation. The state classification is 
arbitrary depending on the purpose. In this paper, 
the range of wind speed [UnU 4. | is divided into 
serval intervals, each interval’s index is identified 
by its central point, changing the event “the wind 
speed value is in the range [a, bT” to the event “the 
wind speed is + (b + a)”. We introduce the nota- 
tion for the wind speed states as S = {5,,5,,---,5y }; 
and make {X,},.y represent the wind speed time 
series. Hence, the event “X, = s;” means at time t = 
0, the wind speed belongs to state s,. Table 1 shows 
an example for state assignment. 

The initial distribution is estimated by dividing 
the dataset into bins according to the states, then 
normalising the vector of the occurrences in each 
bins. For example, to calculate p? = P{X, o= ah 
which is the probability that the first element of 
the wind time series is in state s,, we need to count 
the number of times that a value enters in state s, in 
the entire recorded time series and divide it by the 
total number of recorded values. 

The empirical frequencies are the estimation of 
the transition probabilities 
P(X 


ntl 


5, |X, =5, X14 =5, 4X, ES) 


TER. 
= P(X 


Sj |X, =5) = Pi ©) 


n+l 


where s,, s; € S. The transition probability p, can 
be estimated by counting how many times a value 
from s; goes to state s, in the wind time series, nor- 
malised over the total number of occurrences of 
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Table 1. Example for Markov chain’s state space. 


State S S, 6 S, 


S, S S, 5 


Interval [4,8] (8, 9.4] (9.4, 10.2] 


(10.2, 10.7] 


(10.7, 11.2]  (11.2,12} (12, 12.8] (12.8, 25) 


values in state s, The transition matrix denoted m 
is filled with transition probabilities p,, 1 <i, j <n. 


2.2 SDE 


After studying 1 million of wind speed fluctuation 
distributions, Calif (2012) concludes that it is more 
common to have three classes of wind speed fluc- 
tuation distribution. The distribution classes are as 
follows. The 90% wind speed turbulence follows a 
kind of symmetrical mono-modal PDF which is 
well fitted to a Gaussian PDF (class 1 shown in 
Figure 1). The 9% wind speed turbulence follows 
a kind of dissymmetrical mono-modal PDF which 
can be described by Gram-Charler series (class 
2 shown in Figure 1) and the rest 1% follows a 
kind of bimodal PDF which is fitted by mixture 
of Gaussian PDF (class 3 shown in Figure 1). 
Figure 1 illustrates the mean PDF of wind speed 
turbulence for each class. In this paper, we con- 
sider more about the wind with speed fluctuation 
of class 1 and class 2. 


2.2.1 SDE model selection 

In order to randomly switch between the two 
classes, one considers an inner Markov chain inside 
each inner state s, To avoid confusions between the 
wind speed Markov chain and this SDE switching 
Markov chain, the latter one is designate as a SDE- 
Selection model (SSM) in this paper. A transition 
probability matrix of SSM is directly assigned as 
follow. 


class Class, Class, 
Class, 0.9 0.1 
Class, 0.9 0.1 


Since the majority of turbulence seems to fol- 
low a gaussian distribution and some can be fitted 
to mono-modal distributions, it seems natural to 
propose a stochastic process where increments are 
gaussian or functional of a gaussian distribution. 
Potential candidates are diffusion processes which 
are derived from stochastic differential equations. 
The general form of one-dimensional SDE equa- 
tion is as follows: 


dx(t)=a(x(t),t)dt+b(x(t),t)dW (t), t € [0,T] (3) 


—Ə— class 1 
—D~ class 2 
wr class 3 


=f | 0 1 2 3 4 5 
wind speed fluctuation (m/s) 


Figure 1. Wind speed fluctuation (Source: (Calif, R.2012)). 


where a(x(t), £) and b(x(d), t) are the drift term 
and diffusion term, respectively. W(t) is a standard 
Wiener process described as follows: 


e W(0)=0, with probability 1. 

e For 0 < t < tą < T, the increments 
AW, = W (taa) -W (t,) isa Gaussian distribution 
with zero mean and o = ¢,, — ¢,, namely, 
AW, ~ N (0t =t) 

efor OS¢t<t,,<t,,<7T, the non-overlap- 
ping increments AW, =W (t )-W (t) and 
AW... =W (t,2)-W (t1) are independent. 
Hence, a standard Wiener process describes a 

continuous Gaussian process whose increments 

are of unbounded variation. Since there is a large 

variety of functions a(x(¢), t) and b(x(¢), t), the dif- 

fusion process can model a large range of phenom- 

enon. In this paper, we focus on a special case of 

SDE initially proposed by Vasicek (1977). 


2.2.2. Ornstein-Uhlenbeck process 
Figure 2 lists 3 histograms of wind speed turbu- 
lence. This group wind data is downloaded from 
the site www.winddata.com, recorded the wind 
speed during one hour at the site San Gorgonio, 
USA. It can be observed that the wind speed tur- 
bulence follow a kind of symmetrical mono-modal 
PDF and the mean value is almost around zero. 
Combining with the research results of Calif 
(2012) (shown in Figure 1), Ornstein-Uhlenbeck 
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nfl 


Figure 2. Histogram of wind speed turbulence (data 
downloaded from www. winddata.com). 


processes appear to be suitable candidates for wind 
speed fluctuations modelling. In other words, the 
aim of SDE application here is to model a con- 
tinuous wind speed during a short period, such as 
several seconds, 10 minutes or 1 hour. 


Form A 


Langevin’s equation is considered to describe the 
first process that is chosen to depcit the increments 
of wind speed, namely, 


dx(t) =ax(t)dt+bdW(t),t [0,7] 4 
x(0) = x, @ 
where, x(t) is the wind speed turbulence at time t, 
W(t) is a standard Brownian motion, a and b are 
constants. x(t) is a stationary autocorrelated Gaus- 
sian diffusion process. 

Fouque et al. (2000) provides a maximum likeli- 
hood estimation method. The logarithm likelihood 
function is defined as 


Jigs Mice = —F (In (b?)+1n (v(a)) 
1 n-l (5) 


| lay exp (aA)x, )) 


with v(a) = 


the estimated parameters. 


2aA —-1 
as Therefore, we can get 


n 


i=l i-l“ ti (6) 


A Èa 
5 1 wt 
ye 


-exp (@A)x, 1)? (7) 


Form B 


A classical mean-reverting Ornstein-Uhlenbeck 
process is chosen as the second one for wind speed 
simulation. It has the following form: 


dx(t) = 
x(0)=> 


where x(t) is wind speed at time f, 4, o are param- 
eters and W, is the standard Brownian motion. 
Make Aż =t, -t _, according to Deng et al. (2016) 


the transition density function is 


1 c(t)- y)dt + odW (t) 8) 


e^ 
To (e= = 1) 
(x -e^ + Al — g) =X y 
co (es = 1) 


P(X, of 5 | Kist 3440)= 


x exp 


(9) 
Hence, the log-likelihood function is as follows: 


J= TPP 86) (10) 


log L( (4,0 


The parameters u and ø are estimated by maxi- 
mizing (10). 


3 APPLICATION 


3.1 Experimental dataset 


Long term (as long as one year) hourly average 
wind speed data can be easily accessed at www.ncdc. 
noaa. gov/cdo-web, and www. winddata.com provides 
free time series of wind characteristics measured 
under different conditions during 10 minutes or 
1 hour. The real data used in this paper is down- 
loaded from the two websites. 


3.2 Wind speed generation 


It exists two difficulties when we use real wind 
speed data to estimate the transition probability 
matrix of Markov chain model: 


e A Markov chain with a great deal of states could 
be constructed resulting in a huge transition 
matrix that might cause additional difficulties in 
estimation. 

e The number of elements in certain states could 
be much smaller than others, resulting in a lot of 
small probabilities near 0 in the transition prob- 
ability matrix. 


In order to avoid the former phenomena, the 
number of states should be determined relatively 
smaller. The number of states could be determined 
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by the user or some criterions. Tang et al. (2015) 
recommend to determine the interval with the 
empirical quantiles. Suppose that the wind speed 
time series is stationary and ergodic. Hence, the 
empirical cumulative distribution function is a 
consistent estimator of the cumulative distribution 
of the invariant measure of this time series. The 
boundaries are taken to be Fy'(j/k), j=, 2, ..., 
k, where k is the number of state and F, is the 
empirical cumulative distribution function. 

The principal steps for wind speed data genera- 
tion (shown in Figure 3) are as follows, 


Step 1 Chose appropriate wind speed data for 
Markov chain and diffusion process model esti- 
mation separately 


Step 2 e Determine the states of Markov chain 

e Estimate transition probability matrix 
of Markov chain by using real wind 
speed data 

e Estimate parameters of SDE with real 
wind speed data 

Generate hourly average wind speed 

Select SDE model and generate detail 

wind speed data inside each state 


Step 3 
Step 4 


3.2.1 Generation of wind speed in macroscopic 
scale using Markov chain model 

A sequence of hourly average wind speed during 
one year is used to estimate the transition probabil- 
ity matrix of the Markov chain model. Due to data 
acquisition issues, the available number of the data 
is 6180 instead of 8760 (24 hours/day x 365 days). 
The number of state is determined as 8, recall the 
states shown in Table 1, we can obtain the transi- 
tion probability matrix of the Markov chain. 


3.2.2 Generation of wind speed at secondly time 
scale using SDE 

As Markov chain’s states represent the hourly 

average wind speed, the aim of using SDE is to 


P 
Modelling 
| M 
ARKOV CHAIN 
: A yp peer —> e States determination 
| f sahii ar * TPM estimation 
: j 
Real V 
wind ; 
speed l 
data ‘ 
N 
1 \ WIND DATA 
å DAT: 
$ + High frequency — SDE M 
| é: Sergi tiis ada ¢ Parameters estimation 
an ls peat pon, ie py a 2 He 
Figure 3. Steps for wind speed generation. 


i! 
il 
il 
il 


generate secondly wind speed data during a short 
period. Taking account the common data collec- 
tion frequency of SCADA—10 minutes, a state 
contains 6 groups SDE models which represent 
the wind speed at secondly scale during 10 min- 
utes, depending on the hourly average wind speed 
set by Markov chain. We class the real wind speed 
sequences according to their average values associ- 
ated with the states determined in section 2.1. 

An example of wind speed generation by SDE 

To illustrate the performance of SDE wind speed 
model, the parameters with respect to the state s, 
are chosen to build the two Ornstein-Uhlenbeck 
process, with form A expressed by equation (11) 
and form B expressed by equation (12). For x(0)=0 


dx(t) = —0.0314x(t)dt +0.2517dW (t), 
dx(t)=-(x(t)-10.0245)dt +0.64594W (t), 


(11) 
(12) 


In Figure 4, the red line shows the probability 
density of a real wind speed sequence, the blue 
lines show the probability density of 500 samples 
simulated data generated by Equation (12). 

SDE model has the ability to generate continu- 
ous wind speed. 


orf 
| POF of emiga ret pat 
| 

— POF te ens set 


0 
Word speed Havs) 


Figure 4. Probability density of real data and simulated 
data. 


Simulation 


Hourly average 
wind speed 
generation 
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term wind 


speed 
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Detail wind speed 
> generation in each 
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4 CONCLUSION 


In this paper, a continuous wind speed generation 
model based on Markov chain and stochastic dif- 
ferential equation is proposed. This model is flex- 
ible enough to generate wind speed at different 
time scale without time length limitation. Having 
low computational requirement is another advan- 
tage of the model. Two forms of SDE are studied 
in this paper, other forms of SDE could be applied 
to this model according to the user’s requirement. 
Particularly, this developed model is suitable to 
be merged with the deterioration model of wind 
turbine’s key component, like blade-pitch system. 
Also, this model could be used in the analysis of 
wind turbine’s power system such as dynamic 
behaviour of generator. 
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The class of life time distributions with a mean residual life linear 
in time: Application to prognostics and health management 
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Alstom Digital Mobility, St-Ouen, France 


ABSTRACT: Prognostics and Health Management (PHM) elicits increasing interest in view of its 
potential economic impact both through service-affecting failure avoidance and maintenance cost reduc- 
tion. One of the key challenges in the PHM discipline is the estimation of Remaining Useful Life (RUL), 
i.e. the time remaining until failure in the absence of maintenance. The expectation of RUL, Mean Resid- 
ual Life (MRL), has long been studied by reliability engineers, dating back to the nineteen sixties. There 
is a one-to-one relationship between MRL, reliability function and failure rate (also called hazard rate): 
each one of three functions uniquely determines the other two. Usually, MRL varies with time. We study 
here a special class of life-time distributions characterized by the fact that their MRL is a linear, non- 
increasing function of time, and highlight the central role of the time derivative of the MRL, called ageing 
rate, for PHM; and then generalize to piecewise-linear MRL. 


1 INTRODUCTION 


Since the pioneering work of (Barlow & Proschan 
1965), the notion of MRL is well known, as well as 
its relation to reliability function and failure rate. 

However, we have not seen in the literature any 
study of the class of life-time distributions which 
have a MRL that varies linearly with time, even 
though the literature on RUL is quite extensive; 
see e.g. (Si et al. 2011). 

This class of distributions is important for two 
reasons: 


— because a linear relationship is always a good 
approximation locally; more-general functions 
can be approximated by piecewise-linear ones. 

— because this class contains important, well- 
known special cases: the exponential distri- 
bution, the uniform distribution, and the 
deterministic (‘Dirac’) distribution. 


Explicit expressions for reliability function, 
MRL and failure rate are given here for this class. 
Cases of special interest are presented and the 
meaning of some relationships is discussed. A sim- 
ple relation is derived between the coefficient of 
variation of the time to failure and the derivative 
of MRL with respect to time (which we call the 
rate of ageing). A confidence interval for the RUL 
is derived, in terms of this rate of ageing; the width 
of the confidence interval decreases as the rate of 
ageing grows from 0 (corresponding to the expo- 
nential distribution failure) to 1 (corresponding 
to the ‘Dirac’ deterministic case). Potential use of 
those results for PHM, and their generalization to 


piece-wise linear MRL, are discussed. In Section 2, 
a reminder of the key relationships between MRL, 
reliability function and failure rate is provided. 
In Section 3, expressions for those quantities are 
derived for the family of lifetime distribution with 
MRL linear with time, and special cases are iden- 
tified. In Section 4, a confidence interval for the 
RUL is provided, in that family. In Section 5 an 
explicit relation is given between the rate of ageing 
and the coefficient of variation of the time to fail- 
ure. Potential use in PHM and generalizations to 
the family of lifetime distributions with piecewise- 
linear MRL are discussed in Section 6. Finally, 
Section 7 presents the conclusions, followed by 
bibliographical references in Section 8. 


2 REMINDER OF KEY RELATIONSHIPS 


2.1 Mean residual life, reliability function and 
failure rate 


If T denotes the time to failure of an item not 
subject to maintenance, the reliability function is 
defined by: 


R(t)= PIT >t],t>0 
R(0)=1, lim R(t)=0 (1) 
t œ 


Then the Mean Residual Life (MRL) at time t 
is the expectation of the remaining time to failure 
from time t under the condition that no failure 
has taken place until time t. The following relation 
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results directly from the definition. In what fol- 
lows, the notation V(t) will be used for MRL(t). 


— 1 j 
V(t)= RO ! R(s)ds (2) 


From Equation 2 it is seen that, for t = 0, 
MRL(0) = MTTF. i.e., the a priori value for the 
mean residual life is the mean time to failure. The 
failure rate A(t) (sometimes called hazard rate, 
especially in safety contexts), on the other hand, 
is given by: 


a= (3) 


so that R(t) can be derived from it by integration 
(see e.g. Birolini 2017). Finally, under simple regu- 
larity conditions it can be shown (Swartz 1973) 
that the reliability function R(t) can be derived 
directly from the MRL function: 


L 
dx 


R(t)= ahs 


(4) 


Therefore, any one of those three functions 
determines the other two. 


2.2 Differential equation formulation 


There is also a simple relationship between the 
MRL and the failure rate which takes the form of 
a differential equation: 


VO (5) 
dt 


Equation 5 can be derived immediately from 
Equation 2 and Equation 3 (Watson & Wells 1961). 
When Equation 5 is written in differential form: 


dV (t) = -dt + AVadt (6) 


its interpretation provides an interesting insight: 
usually 


dV (t) 
dt 


<0 (7) 


as the mean residual life does not increase with 
time. 
Equivalently, 


AV <1 (8) 


In a time interval of length dt, the loss in mean 
residual life during the time span dt is given by : 


dV (t)=-—(1-— AV )dt (9) 


Therefore it is usually less than the elapsed time 
dt. 

The limiting case when: AV = 1, corresponding 
to a distribution for which V(t) = MTTF is con- 
stant with time, which is the exponential distribu- 
tion, also characterized by the property that its 
failure rate 2 is constant in time. 

Moreover, Equation 5 then yields: 


1 1 


A(t) = = 
MRL(0) MTTF 


(10) 


a relation valid only in the exponential distribution 
case. 
This property is the ‘no aging’ property of the 
exponential distribution in the reliability context. 
More generally, in the next section, will be stud- 
ied the class of time-to-failure distributions for 
which the MRL is a linear function of time, i.e. 


av) =-k (11) 
dt 

where the constant k will be called the ageing rate. 

It is assumed that 0 < k < 1 so as to ensure, accord- 

ing to Equation 11, that 


-1< WVO <0 (12) 
dt 

i.e. the mean residual life either is constant or de- 
creases with time but no faster than time ( in a time 
interval AT, the item does not lose more than an 
amount AT of its remaining life). Note that k is 
dimensionless. 

The case k = 0 corresponds to the limiting case 
of the exponential distribution. 


3 THE FAMILY OF LIFE DISTRIBUTIONS 
WITH MRL LINEAR IN TIME 


3.1 General case 


We study the family of time-to-failure distributions 
characterized by the property that their mean resid- 
ual life MRL(t) is a linear non-increasing function 
of time or, equivalently, it satisfies Equation 11. 

For convenience, from now on, the MTTF will 
be denoted by the symbol u. Thus: 


V(0)= MTTF = 4 (13) 


1094 


From Equation 11 and Equation 13, we obtain 
V(t) =u-kt (14) 


The range of the time variable t is the interval 
[0,“], so that 


0<V(t)h<u (15) 


(In the limiting case of the exponential distribu- 
tion, k is allowed to vanish asymptotically, and the 
range of the time variable becomes the nonnega- 
tive real half line). 

From Equation 5 and Equation 14, there follows 
that the failure rate of such a distribution is given by: 


Hepat jornereZ (16) 
4-kt k 


KC 


From Equation 3 the reliability function, 
denoted from now on D(k,t), can be derived as 


d Lk i 
D(k,t)=e 0" (17) 
which yields: 
ut 
k 
mtb =( 4) . for 0st< (18) 


And D(k,t) = 0 for t > w/k. 

Thus Equations 14, 16, 18 characterize the one- 
parameter family of probability distributions, with 
parameter k ranging over the interval [0,1]. 


3.2 Special cases 


3.2.1 Exponential distribution 

If the parameter k is allowed to go to 0, the range 

of the variable t becomes (0, ). 
The following limit is 

Equation 16: 


obtained from 


lim A(k,t) =~ (19) 
ZER u 


It can also be verified from Equation 18 that: 


lim D(k,t) = en (20) 


Those properties characterize the exponential 
distribution. 

As seen previously and confirmed from Equa- 
tion 14, the mean residual life is then constant in 
time: V(t) = u= MTTF. 


3.2.2 Dirac distribution 
At the other extreme, if k = 1, the evolution of the 
mean residual life over time is characterized by: 
V=f4-t,for OStsu (21) 
which means that, over any time interval At, the 
loss of mean residual life is precisely equal to At. 
Equation 16 shows that the failure rate remains 
equal to zero as long as t<u and jumps to infinity as 
t approaches the life limit u. This is best seen by first 
taking k = 1-e with e close to 0 and letting t go to w/k. 
Likewise the reliability function has a disconti- 
nuity at t = u, where it jumps from 1 to 0, which 
corresponds to a deterministic lifetime equal to u. 
It can also be verified that the right-hand side of 
Equation 6 is always equal to —1, meaning that, 
over any time interval At, the loss of mean residual 
life is precisely equal to At. 


3.2.3 Uniform distribution 
The intermediate case k=+ defines the uniform 
distribution, over the range (0, 2u). 

Indeed, from Equation 18, the corresponding 
reliability function is given by: 


1 t 
D| —,t |=1-—, 0<t<2 22 
(; 77 4 (22) 


The mean residual life (Eq. 14) is given by 
v =u-5, 0<1<2y (23) 


which states that, over a time interval At, the loss 
of mean residual life is equal to oe 
The failure rate is given, as a function of time, by: 


1 1 
A. = (24) 
2 2u-t 
Table 1. Family of life distributions with MRL linear 
in time. 
Value of k Range of t D(k;t) V(t) = A(t) 
0<k<1 kt = 
l 0, 4) Nee oe 
k u-kt 4-kt 
0 (exp.) [0, =] Ea H 1 
e” Ewi 
4 
t t 
L (uniform) [0, 24] 1- — 4- l 
2 24 2 2u-t 
1 (Dirac) [0, u] lfort<p tt Ofort<u 
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3.2.4 Synopsis 
The results just derived are summarized in the 
table below. 


4 CONFIDENCE INTERVAL FOR RUL 


It is now possible to obtain the residual life-time 
distribution, i.e. the probability distribution of 
the Remaining Useful Life (RUL), or time to the 
next failure, under the condition that no failure has 
taken place up to time t, for the class of lifetime 
distributions under study. Note that the concept 
of ‘failure’ is understood in a broad sense and can 
also apply to pseudo-failures, such as the crossing 
of a threshold by a health indicator. 


First let us rewrite Equation 18 slightly 
differently: 
ly 
kt \ 
D(k,t)= (i - A (25) 
u 


The probability distribution of the remaining 
useful life (RUL) at time t can be characterized by 
the conditional survival function. Let us denote by 
D (s) the probability that the item survives at least 
for a duration s given that it has lived (without fail- 
ure) up to time t. Then: 


D(s) = D(kt +5) (26) 
D(k;t) 
Or, according to Equation 25), 
= 
l- k(t+s) \k 
= 4 
D(s)= hi (27) 
4 
Or 
iy 
- k 
D(s)= | __ks (28) 
4-kt 


for0<k<1;0<t< 4 

Let us now determine the confidence interval of 
the RUL at time t, with confidence level 1-a. 

To that end, define s* and s` by: 


D()=5 (29) 


-jaj 2 
D(s~)=1 5 (30) 


Then, if T denotes the random variable “survival 
time to failure after time t”, the conditional prob- 
ability of survival for a time between s and s* is: 


P(tts'<T<t+s*/T>th=l-a@ (31) 


i.e., the interval (t+.s5-,t+s*) is the confidence 
interval with confidence level 1—0. 

In other words, +s < RUL < t+s* with probabil- 
ity 1-a. Equations 28-30 determine the confidence 
bounds. Consequently, from Equation 28-29 one 
obtains: 


fha 
s=(#-:)1-(2) " 


Similarly, from (28) and (30), one obtains: 


180 ~ 

160 * — - —— 
—s— MRL 

140 + a 

es “oe upper 
— A~ lower 


MRL 


0 + a | 


O 20 40 60 80 100 120 140 160 
Time (h) 
Figure la. Confidence interval for RUL for k = 0.6 


(MTTF = 100 h). 


0 20 40 60 80 100 120 
Time (h) 


Figure 1b. Confidence interval for RUL for k = 0.9 
(MTTF = 100 h). 
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Figure 2. Width of confidence interval versus ageing 
rate k. (CI calculated at t = 50 h, for MTTF = 100 h). 


(33) 


Confidence intervals are illustrated in Figure 1: 
on (Fig. la) for k = 0.6 and on (Fig. 1b), for k = 0.9, 
where the confidence interval is narrower at any time. 

For k = 1 (deterministic case), it is verified that 
the confidence interval reduces to a point (no 
uncertainty) and, for k = 0, the known result for 
the exponential distribution is found: 


s=-m(2) 
2 
-= ili- 

E fı z) (35) 


Indeed the width (st — s% ) of the confidence 
interval decreases as k grows from 0 to 1, as illus- 
trated in (Fig. 2). 

Thus, the faster the asset ages, the narrower the 
confidence interval for the RUL is. This property is 
intuitive. For instance, in the limit of deterministic 
ageing (k = 1), the RUL is known with certainty and 
the failure rate jumps from 0 to infinity at t = RUL. 
In the other extreme (k =0, exponential distribution), 
there is no ageing and the confidence interval is the 
widest and stays constant with time (Eq. 34-35). 


(34) 


5 RELATION BETWEEN AGING RATE K 
AND COEFFICIENT OF VARIATION OF 
TIME TO FAILURE 


The probability density function f(k;t) for the 
time to failure is obtained from Equation 16 and 
Equation 18: 


J (k;t) = A(k;t).D(k;t) (36) 
1 
a isk u \* 
fe) =F ae z) (37) 
Or equivalently, 
1 
1- k) K 
fk) = aa (38) 


It follows that the variance of the time to failure 
is given by: 


4, l 
m í -k)i X 
3 (u -— kt} * 


(t-4}dt (39) 


This integral can be calculated in closed form, 
which results in: 


o _l=k 

u) l+k 
an expression for the coefficient of variation o/y as 
a function of the ageing rate k. Equivalently: 


(40) 


(41) 
142) 
4u 
Equations 40 and 41 show that the coefficient 
of variation of the time to failure decrease from 
1 (exponential distribution) to 0 (deterministic 
case) as the ageing rate increases from 0 to 1. 


For instance, the uniform distribution (k = 1/2) 
corresponds to o/ “= ~1/3, as is well known. 


6 POTENTIAL USE FOR PHM AND 
GENERALIZATION 


6.1 Use for RUL estimation 


In PHM applications, one of the key concerns is to 
estimate the RUL of a monitored asset on the basis 
of past observations. If, for some reason, based 
for instance on physics, it is believed that MRL 
decreases linearly with time, then Equations 32-33 
can be used to determine a confidence interval for 
the RUL at any time t. To that end, it is necessary 
to estimate the ageing rate k. If field data or simu- 
lation data are available that represent failure times 
or pseudo-failure times (a pseudo-failure may 
defined as the crossing of a threshold by a health 
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indicator, for instance), then those data can be 
used in order to obtain a MLE (Maximum Likeli- 
hood Estimator) of k, or a Bayesian estimator, on 
the basis of Equation 38. If on the other hand an 
estimate of the coefficient of variation of time-to- 
failure (or time-to-pseudo-failure) can be obtained 
from field data or simulations, then Equation 41 
can be used in order to estimate k. 

However, in general, the assumption of linear 
evolution of MRL with time may be unrealistic. 

Below it is now shown that the results demon- 
strated so far can be generalized to the case of a 
MRL which is a piecewise linear function of time. 

Therefore the generality of the method is con- 
siderably extended, as any continuous function can 
be approximated by a piecewise-linear function. 


6.2 Generalization 


Let us consider the case of a MRL consisting of 
two linear segments, as in Figure 3; it is then pos- 
sible to generalize to any number of segments. 

In that example, the MRL has a change of slope 
at T, = 40 h. It is defined as follows: 


For 0<t<T,, V) =4-kt (42) 
For T1 <t<T,,V()=4-kT-k(t-T,) (43) 


where V(T,) =0 

Therefore the parameters are linked by the rela- 
tion u-k,.T, —k, (T-T,) =0. 

In the example of Figure 3, ageing accelerates 
after T, = 40 h. The slopes are k, = 0.5 before T, 
and k, = 0.9 after T,, and T, = 133.33 h 

From Equation 42, the reliability function is 
obtained, using Equation 4. 

In this example, 


1 
For 0<t<T,,D(k31)=(0- 4" (44) 
4u 


And, for T, <t<T,, 


V(t) 
120 
100 
80 
60 
—vI(t) 
40 
20 
0 
O 10 20 30 40 50 60 70 80 90 100 110 120 130 
Figure 3. Piecewise linear MRL (2 segments). 


its E 

pk:n-(1 “y (azna ) 
19 

4 A-kKT, (45) 


From those expressions, it is possible to derive a 
confidence interval for the RUL in the same man- 
ner as was done in Section 4 for the linear case, i.e. 
from the conditional survival function. 

Similarly, the probability density can be derived 
from Equation 44 and Equation 45 and there- 
fore the MLE of parameters k, and k, can be 
obtained.. 

Increases in the ageing parameter k should alert 
the maintainer as they are reasonably correlated 
with reduction in the RUL. 

Generalization to a piecewise linear function 
with any number of segments (instead of two) is 
relatively straightforward. 


7 CONCLUSIONS AND SUGGESTIONS 
FOR FUTURE RESEARCH 


Prognostics methods usual fall into the following 
categories: physics-of-failures analysis, data-driven 
methods, and fusion methods which combine both. 

Unlike traditional reliability engineering, PHM 
focuses on individual assets rather than on a popu- 
lation of assets. However, the author contends 
that some methods of reliability engineering can 
be of use to PHM and can complement the other 
approaches, in particular for the elaboration of 
maintenance policies at fleet level. The recent work 
by (Huynh et al. 2017) points in that direction. 

In this paper, a class of time-to-failure distribu- 
tions characterized by a mean residual life evolv- 
ing linearly in time has been studied, to derive 
confidence intervals for the RUL of assets whose 
time-to-failure follows such distributions. 

Despite the fact that that class contains very 
important representatives such as the exponen- 
tial distribution, the uniform distribution and the 
Dirac distribution, we have not found such a study 
in the literature. 

The relationship between the ageing rate and 
the coefficient of variation of the time to failure 
is interesting to the extent that it exhibits an inti- 
mate link between the variability (measured by the 
coefficient of variation) and the rate of ageing: 
the higher the former, the lower the latter. It also 
allows to estimate one from the other. 

The results have also been generalized to dis- 
tributions with piecewise linear MRL, which can 
approach any distribution if enough segments 
are considered. The challenge then will consist of 
identifying the ‘knees’, i.e. the points of change of 
slope in the MRL. Methods such as trend filtering 


1098 


(Tibshirani 2014), for instance, could be used for 
that purpose. It would be of interest to investigate 
how coefficient of variation of time to failure and 
ageing rate are linked in that general case; and to 
study in that respect the coefficient of variation of 
the RUL rather than that of the time to failure. 

Recent contributions (Atamuradov et al. 2017) 
have highlighted the importance of changes in 
coefficient of variation of extracted features in 
detecting faults and in segmenting time series for 
health assessment and prognostics. 

Since useful health indicators normally vary 
monotonically with RUL, changes in their coef- 
ficient of variation could be used as proxies for 
changes in that of the RUL. 

In conclusion, the author believes that PHM 
stands to gain from cross-fertilization with tradi- 
tional reliability engineering. 
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ABSTRACT: In many applications, detection methods are implemented to monitor systems. Detector 
may be built upon analytical model of the system behaviors, or based on recorded data. Lately lots of 
efforts have been put in developing methods that enable to transfer a detector from an assessed system 
to a new one when the 2 systems are similar. These so-called transfer learning approaches have shown 
efficiency when dealing with a fleet of systems composed of different similar systems. For the detector of 
each model of system a tradeoff between false alarm and non-detection must be chosen. One question 
that has not retained much attention up to now is the tuning of these detection systems. The common 
practice is to tune the detection system associated with each model individually. In this communication we 
tackle the problem of the global efficiency of this practice. How can we tune all those detection systems 
together so that the group of detectors satisfies some performance constraint? The formalization of the 
problem is introduced, discussed and commented based on detector ROC curves. Three toy examples are 
used to show the gain one could expect from join optimization of detectors and to illustrate optimization 
issues. Example results show that non detection probability can easily be reduced by up to 50%. In conclu- 


sion several extensions of this work are discussed. 


1 INTRODUCTION 


1.1 Context 


In many applications, detection methods are 
implemented to monitor systems. The aim of these 
detectors is basically to setup an alarm when the 
monitored system behavior changes from its regu- 
lar behavior. Detector may be built upon analytical 
model of the system behaviors (Tartakovsky et al. 
2015), or based on recorded data (Pimentel et al. 
2014). Lately lots of efforts have been put in devel- 
oping methods that enable to transfer a detector 
from an assessed system to a new one when the 
2 systems are similar (He et al. 2014), (West et 
al. 2007), (Evgeniou et al. 2005)]. These so-called 
transfer learning approaches have shown efficiency 
when dealing with a fleet of systems composed of 
different similar systems. The transfer learning 
idea is to adapt a decision system trained on a 
source system to a new target system using a lim- 
ited amount of data coming from the target sys- 
tem. The adaptation process enables to keep most 
of the information coming from the source system, 
so to have good performances with few data. An 
informal way to put it would be to consider trans- 
fer learning as a calibration operation of a trained 
detector on a target system. To give an example of 
such a fleet, we can consider car engines in the case 


of K car models sharing the same engine. Since the 
models differ, the monitoring system may slightly 
differ from one model to another. It leads to define 
one detector D, (i= 1 to K) with its specific tuning 
parameters for each car model. 

For each detector, so for each model, a trade- 
off between false alarm and non-detection must be 
chosen. One question that has not retained much 
attention up to now is the tuning of these detection 
systems D,. A quite common practice is to tune the 
detection system associated with each model indi- 
vidually, so that each individual detector satisfies 
the global targetted performance constraint. 


1.2 Considered problem 


In this communication we tackle the problem of 
the global efficiency of this practice. Can we tune 
all the detection systems together so that the group 
of detectors satisfies some performance con- 
straint? Does it make sense to tune all the detectors 
together? Is it feasible? And finally can we expect 
gain from such an approach? 


2 FORMULATION OF THE PROBLEM 


Consider K systems and the corresponding K 
detectors. Each detector makes decisions according 
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to measurements x between two hypotheses: H, the 
system is working fine and H, the system experi- 
ments a problem. Measurements x are considered 
as realization of a random variable X. Without 
loss of generality, we can assume that each detec- 
tor D, takes decisions depending on a function of 
the measurements d(x). Each detector compares 
the output value of the function d(x) with a tuned 
threshold ¢, which is equivalent to decide accord- 
ing to: 


D,(x) = sign(d,(x)-4)). (1) 


H, is chosen and an alarm is set when D(x) = 1. 

In this context the thresholds ¢, must be tuned 
according to some criteria. Criteria are usually 
related to performance constraints. A very well- 
known performance constraint setting is the Ney- 
man Pearson framework which is adopted here 
(DeGroot and Schervish 2002), (Neyman and 
Pearson 1933). So detectors are tuned according to 
a false alarm rate. 

We assume that the owner of these systems 
considers that a global acceptable false alarm rate 
is Œc, so systems must be tuned according to this 
objective. 

A usual mean to respect this objective is to tune 
each model so that its false alarm rate is a, 


Pa W a (2) 


ee x)dx (3) 


where f(x) is the probability density fonction of X 

for system i. 

A basic solution consists to choose 1° 
=a, for all systems i. 

By doing so we ensure that the global perform- 

ance, considering all models, named Pp, is Œc: 


so that 


Pi = 


K 


P fag T Xr Poy (4) 


where P, is the a priori probability of using 
system k. 

This solution is valid. But in such a case a very 
specific setting has been chosen among many pos- 
sible ones. One could allow a larger false alarm 
rate on some models and a smaller one on oth- 
ers and still respects the constraint. As in Ney- 
man Pearson problem, we consider that the best 
solution is the one that respects the constraint and 
that maximizes the detection power of the global 
system P; 


P, = P(D(X)=1|H,) (5) 


= YP.P(D, (X) =1|,) (6) 


k=l 


So the aim is to find (f,,t,,...,4,)° that maximize 


P,under the constraint Pp, S @%. 


3 OPTIMIZATION 


The optimization problem we want to solve is the 

following: 
min P, 

(its) (7) 


withP,,, = @% 


This problem could be optimized based on 
lagrangian formulation as in (Grall-Maes and 
Beauseroy 2009). But this would lead to a quite 
complex optimization problem for at leazt two 
reasons. The change of a threshold ¢, has an 
indirect effect on both P(D,(X)=1|H;,t,) and 
P(D,(X)=1| H),t,). To express the dual Lagrang- 
ian problem one would need to express the partial 
derivatives of these probabilities according to t, 
for all 7. Such expression cannot be computed in 
the case of trained detector, and in case of toy 
problem, as in this paper, this formulation would 
be rather complex even for simple laws on X. The 
second reason is that the function P; may be non- 
convex leading to local minima. 


3.1 Proposed approach 


To tackle this optimization problem we propose a 
rather simple strategy that consists to first find a 
possible solution, next select 2 detectors / and m 
and change their settings so that their join prob- 
ability of false alarm P,, does not change: 


PPa + PnP tig = Pien (8) 


and so that P, is maximized 


F,,, 


Pi, — = PP, + Tala; (9) 


This treatment is repeated with a new couple of 
detectors until the detection probability cannot be 
improved. 


3.2 Tuning two detectors 


To implement the proposed approach, for a given 
couple of detector (/, m) one need to change simul- 
taneously their thresholds ¢, and ¢,, so that equation 
8 is satisfied. 
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So if t, is changed to ¢/ then P, is changed 


to Py, and thus the threshold 17, that satisfies the 
relation: 
P, -PP; 
fam -1° fa 
/ =— ; 10 
Jam P ( ) 


m 


has to be found. 

The impact of these changes can be obtained 
directly using the ROC curves of the two detectors 
since each ROC curve establishes a relation: 


P23; Pau) (11) 


The ROC function g,() is a non decreasing 
function. 

To maximize the detection probability for this 
couple, we search for a point that maximizes: 


Lin = £18) (Pa ) ui Pl (Pan) (12) 
Combining with equation 10, we obtain 
Py, a PPa 
Fi, = P8 (P, ) AA oS (13) 
And deriving according to P,,, we obtain: 
dP. d 
a ( fa ) P, -i ) oh 
fay Sa a 4) 
P =F dg, Pan = PPa 
" A dP, A 
which simplifies to: 
dim L dg 1 
aP ( a ) a AP 1, (2a 
(15) 


dg, Pirm PP pa 
P 
dP P 


fam m 


This formulation gives a mean to implement a 
gradient algorithm, increasing P, if the gradient 
is positive and decreasing it otherwise. P,, is then 
set according to equation (10). The optimum point 
corresponds to an extreme point or to a point 
that cancels the derivative (equation 15). This for- 
mulation enables to optimize the local detection 
probability given by (9) under the constraint (8) 
according to the detectors false alarme rate instead 
of the thresholds (the threshold may be deduced 
afterward). 


3.3. Comments 


In order to speed up the convergence the idea is to 
choose the couple (/, m) so that the absolute value of 
the gradient is as large as possible at each iteration, 
meaning that the optimization focuses on the cou- 
ple that locally brings the largest improvement. 

It must be mentioned that the proposed alternate 
gradient approach can lead to a global optimum 
only if the starting point is located in a convex 
neighborhood of the optimal solution. Otherwise 
the algorithm converges to a local optimum. 

To reach global optimum we propose to initiate 
the optimization process with different initial solu- 
tions when the number of detectors is large. When 
it is not too large a random optimization method 
can be applied without big computing time loss. 

The experimental study that follows illustrates 
the optimization process for two detectors. The 
impact of prior on the solution are discussed. More 
complex cases are tested with four detectors, and 
the non convex case is illustrated with 2 examples. 


4 EXPERIMENTAL RESULTS 


We design the experiments to show the key issues 
related to the problem. The experimental study 
first illustrates the optimization process for two 
detectors. The impact of priors on the solution is 
discussed. More complex cases are tested next with 
four detectors, and the non convex case is illus- 
trated with 2 examples. 


4.1 Two systems case 


Our first experiment corresponds to the simplest 
setting with 2 models. The ROC curves of the two 
detectors are shown on Figure 1. In that scenario 
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Figure 1. ROC curves of machine | and 2 detectors. 
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one detector is good while the second one is signifi- 
cantly less accurate. 

The targeted global false alarme rate a, is set to 
0.1 and the systems are assumed equiprobable. 

By tuning each detector to reach a false alarm 
rate of 0.1, the power of the global system is about 
0.88 (Figure 2). By using the proposed optimiza- 
tion approach we improve the power of the glo- 
bal system to 0.925. To obtain this result the worst 
detector (system 2) is tuned to a higher false alarm 
rate (0.185) while the best one is reduced to less 
than 0.02 as shown by Figure 3. 

It means that to optimize the global system we 
must be more demanding with the best detector: 
the one which detection rate decreases slowly when 
false alarm rate is reduced. On the contrary, the 
worst detector’s false alarm rate is relaxed to gain 
a significant detection improvement. Logically, 
combining these two actions enables to respect the 
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Figure 2. ROC curves of the complete system depend- 
ing on systems 2 detector’s false alarm with usual tuning 
(red) and proposed solution (green). 
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Figure 3. Tuning of each detector reported on its ROC 


false alarm constraint and consequently improves 
the overall detection rate. 


4.2 Impact of models prior 


With a two system case the point which cancels the 
derivative (15) depends on the prior probabilities 


Pfa m PiP fa 


due to the argument z of its second term: 


gm 


Pia To study the impact of the prior probabil- 


ity ‘the example of previous section is used again 
changing the prior P, from 0 to 1 and decreasing 
P, accordingly. The gain related to the global opti- 
mization is reported in Figure 4. It goes from 0 up 
to 0.047 which represents approximately a 30% 
reduction of the non detection rate (one minus the 
detection probability is plotted in Figure 5) which 
can be considered as significant. 

The optimal false alarm tuning of machine | and 
2 as a function of prior probability is illustrated by 
Figure 6. The false alarm rate of the best detector 
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Figure 4. Performance gain as a function of the prob- 
ability to use machine 1. 


Global detection power versus probability of machine 1 


curve—standard tuning (dotted lines) and proposed tun- 
ing (dash lines). 
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Figure 5. Power of the global system as a function of 


the probability to use machine 1. 
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Tuning of machines versus probability of machine 1 


Figure 6. Tuned optimal false alarm probability for the 
detectors as a function of the probability to use machine 
1 - machine 1 in blue, machine 2 in red. 


(blue line) is always smaller that the targeted value 
@; =0.1 but logically tends to œ, when P, tends to 
1, while the false alarm rate of the worst detector 
(red line) is always larger than a. 


4.3 Simple four systems case 


To illustrate the proposed optimization process we 
designed a K = 4 systems case. The detector ROC 
curves corresponding to these systems are given by 
Figure 7. All priors are set to + and @, = 0.25. 
The optimal tuning is illustrated by Figure 8. 
The non detection rate corresponding to the ref- 
erence setting (a@,=@, Vi=1..K) is 0.054 and 


decreases to 0.028 once optimized. 


4.4 Optimization limits 


To illustrate the limits of the proposed optimiza- 
tion process when the power function is not convex, 
a 3 system case has been designed and compared to 
a convex case. 

Figure 9 shows the 3 detectors ROC curves for 
the convex case. The corresponding power func- 
tion is displayed by Figure 10. 

Figure 11 shows the 3 other detectors ROC 
curves. The corresponding power function is dis- 
played by Figure 12. 

In the second case the power function is not 
convex while the ROC curves are not so complex. 
It means that non convex case may not be an 
exception. In fact such case will appear each time 
the second derivative of a ROC curve becomes 
positive which is not so rare. It implies that the 
proposed optimization process may lead to a local 
optimum. So we have to find an optimization pro- 
cedure that can cope with such non convex case. 


Figure 7. ROC curves of a 4 detector case. 
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Figure 8. Tuning of each detector reported on its ROC 
curve—standard tuning (dotted lines) and proposed tun- 
ing (dash lines). 
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Figure 9. ROC curves of a simple 3 detector case. 
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Global detection power curve in case of 3 machines: 
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Figure 10. ROC surface of the global system as a func- 
tion of detectors 1 and 2 tuning. 


Figure 11. 
case. 


ROC curves of a more complex 3 detector 
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Figure 12. ROC surface of the global system as a func- 
tion of detectors 1 and 2 tuning. 


4.5 Improved optimization process 


To improve the proposed optimization process, the 
idea is to initiate the optimization process from many 
different points. So we generate those initial solutions 
randomly. In our simulation we randomly chose ini- 
tial false alarm rate for each machine and normalize 
those initial guess so that the global false alarm rate 
satisfies the constraint. Next, for each initial solution 
we apply the previous optimization procedure. 

We apply this improved process on a 4 system 
example whose detector ROC curves are shown in 
Figure 13. 

Figures 14 and 15 depict the results found with 
the initial optimization process and the results 
based on the improved one respectively. 

Table 1 summarizes the tuning of each system 
and the overall performance in each case. The non 
detection rate decreases from more than 17% in 
that case to 10.3% in the case of a simple optimiza- 


Figure 13. ROC curves of a complex 4 detector case. 


Figure 14. Tuning of each detector reported on its ROC 
curve—standard tuning (dotted lines) and proposed tun- 
ing (dash lines) obtained with the proposed optimization 
strategy. 
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Figure 15. Tuning of each detector reported on its ROC 
curve—standard tuning (dotted lines) and proposed tun- 
ing (dash lines) obtained using random optimization. 


Table 1. Tuning and performances for different settings. 
Exp. Pa Pray Pray Pra P, 

Ref. 0.25 0.25 0.25 0.25 0.828 
Basic opt. 0.026 0.412 0.248 0.314 0.897 
Imp. opt. 0.014 0.328 0.557 0.101 0.922 


tion and is reduced to 7.8% based on the improved 
optimization proposed in this last section. These 
results illustrate the influence of detectors tuning 
on the overall performances and show that very 
significant gain can be achieved. 


5 CONCLUSION 


This communication proposes to tackle the prob- 
lem of global detection optimization of a detector 
fleet. This can be encountered in distributed detec- 
tion systems or in detection systems applied to a 
fleet of similar systems. In a distributed system the 
question that arises is to find an optimal fusion 
of the local detections that all relate to the same 
object (Blum and Kassam 1997). In the latter case 
the main concern is about the global performance 
of the system over decisions on different popula- 
tions, typically a fleet of similar systems, with a 
specific detector for each kind of systems. 

We propose a model for this problem that fits 
in the Neyman-Pearson’s framework. We develop 
the analytical formulation and discuss the optimi- 
zation issue of the global detection system in terms 
of detector ROC curves. An approach to optimize 
the global system is proposed and some simulations 
are designed to illustrate the main issues related to 
the problem and to the optimization process. 

It is shown that the optimization can signifi- 
cantly improve the global performances compared 


to a reference naive approach. It tends to prove the 
utility of the proposed approach. 

The next steps are the extension of this first 
study to real cases. This extension seems to be quite 
straightforward since the approach is based on 
ROC curves. The fact that the ROC curves corre- 
spond to maximum likelihood ratio tests as in the 
proposed examples or the fact that they are build 
upon trained detectors makes no real difference. 

A more challenging extension is to develop the 
same approach in the case of production lines tak- 
ing into account some real time issues. Typically, in 
recent production systems, the production of a line 
may evolve due to diverse causes such as production 
planning, availability of the line inputs, maintenance 
issues... The related questions are connected to our 
capacity to adapt, in real time, such detection or qual- 
ity control systems to these changes and what gain 
can we expect from such a dynamic tuning system. 

Other extensions can be studied such as the 
impact of the number of detectors. Can we use the 
same optimization approach or should we adopt 
more efficient ones? The case of detection with a 
rejection option should also be considered. The 
optimal criteria in the latter case must be redefined 
and the tuning strategy must be also reconsidered. 
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Assessment method of the deterioration degree of asphalt concrete 
airport pavements 


M. Zieja, P. Barszc, K. Blacha & M. Wesołowski 
Air Force Institute of Technology, Warsaw, Poland 


ABSTRACT: An important factor, which affects the safety of carried out flight operations, constitutes 
proper management of airports on the basis of information on the pavement condition of their func- 
tional elements, obtained in a systemic manner. One of the elements of the technical condition estimation 
of airport pavements is the assessment of their deterioration degree on the basis of identified damage 
and carried out repairs. Such an approach allows to forecast essential resources necessary to carry out 
repairs and rationally plan overhauls. The multicriteria analysis presented in this article is a method of the 
weighted assessment supporting the deterioration degree estimation of airport pavements. While calculat- 
ing the criteria, it is important to focus on parameters characterising the deterioration degree of airport 
pavements, both of civil and military facilities. Therefore, it is important to diagnose the functional ele- 
ments’ pavements of these airports within the framework of current five-year inspections, and to carry 


out inspections including the inventory of damage and made repairs within the annual intervals. 


1 INTRODUCTION 


In the literature and operational practice, many 
terms related to the life phases of airport facilities, 
such as, e.g. durability, service life, labour resource, 
technical service life resource, service life resources 
between repairs, are used (Shahin 2007, Zieja 
et al. 2016, Zieja 2015, Werbińska-Wojciechowska 
& Zając 2015, Barszcz & Blacha 2015 & 2016). 
The airport facilities’ life phases are characterised 
by their operation strategies. Figure 1 graphically 
shows operation strategies of the airport func- 
tional elements’ pavements, which currently func- 
tion within the framework of operation of airport 
pavements. 

In case of the functional elements’ pavements of 
airports operated by air forces, mixed strategies of 
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Figure 1. Operation strategies of the airport functional 


elements’ pavements. 


their operation are used; however, a series of works 
aimed at proceeding to the operation according to 
the technical condition is taken. The operation in 
accordance with the technical condition requires 
to obtain information, which would make it pos- 
sible to monitor the technical condition of airport 
pavements. One of the parameters characteris- 
ing the pavement’s technical condition includes 
its deterioration degree. The deterioration degree 
is estimated on the basis of data obtained during 
the inventory. The inventory includes activities 
aimed at drawing up a detailed physical inventory 
listing of components, which specify the assessed 
feature for a given day. The inventory of damage 
and repairs to the functional elements’ pavements 
of airports is carried out in accordance with the 
adopted assumptions with the use of the previ- 
ously prepared bases. It involves the inspection of 
a basic element, which includes a slab (sample). 
The damage and performed repairs identified dur- 
ing the inspection are marked on the previously 
prepared bases with the use of adopted symbols 
with appropriate colours. Within the framework of 
the carried out inventories, it was assumed that the 
identified damage is marked with red, and repaired 
damage with black colour. The quantities which 
characterise a given type of damage and repairs are 
marked on the previously prepared bases (Zurek 
et al. 2014, Barszcz & Wesołowski 2015, Zio 2009). 
In accordance with the adopted nomenclature, the 
parameters characterising the deterioration degree 
of the airport pavement are marked, inter alia, as 
follows (Figure 2): 
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Parameter, marking 
and its unit 


Figure 2. Inventory of damage and repairs performed 
by a prepared expert with the use of a visual method. 


— alkali-silica reaction, alligator cracking; 
surface losses; 

blisters; 

chipping and loosening; 

— cavity and wide cavity cracks; 

— heaves or fractures; 

— deep losses; 

— ruts; 

— boreholes. 


One of the most important functional elements 
includes runways, and therefore, they are given a 
higher standard of maintenance than taxiways or 
airport aprons. Loose parts of the pavement can 
cause damage to the aircraft engines and propel- 
lers. The aircraft operational safety forces a way of 
conducting the runway inventory, where the slab is 
considered a primary element to be inspected. In 
case of the airports’ functional elements, such as 
airport aprons or taxiways, the pavement tests can 
be carried out by analysing the number of selected 
samples on the tested section; however, at the same 
time, it is important to pay attention whether oper- 
ated aircraft taxi with the use of their own drives 
or they are towed along the analysed test section 
(Smirnov & Mickiewicz 2002). 


2 ALGORITHM OF ESTIMATING THE 
DETERIORATION DEGRE OF AIRPORT 
PAVEMENTS 


The algorithm, presented in Figure 3, shows the 
way of assessment of the deterioration degree of 
the airport’s functional element pavement. 

In order to assess the pavement deterioration 
degree, a finite sequence of defined activities is 
performed, and it includes: 


Figure 3. Algorithm of assessing the deterioration 
degree of airport pavements. 


— determination of the parameter limit values of 
relative functional elements of airports on the 
basis of the analysis of: 

e histogram and regression curve; 
e probability distribution; 

— determination of the parameter limit values 
characterising deterioration of a single slab on 
the basis of the analysis of: 

e histogram and regression curve; 
e probability distribution; 

— determination of assessment criteria related to 
the deterioration degree of airport pavements; 

— determination of the indicators’ values charac- 
terising the pavement deterioration; 

— determination of the pavement deterioration 
degree. 


3 METHOD OF ESTIMATING THE 
DETERIORATION DEGRE OF AIRPORT 
PAVEMENTS 


Deterioration is a slow and spread over time proc- 
ess. It mainly involves the reduction of construction 
properties by the impact of external factors, which 
as a result, generates changes in its structure. The 
deterioration degree of the airport pavement’s func- 
tional element is affected by damage and performed 
repairs. This quantity is determined on the basis of 
11 defined types of damage and repairs. In order 
to optimally select an indicator characterising the 
actual deterioration degree of the pavement sur- 
faces, two variants of calculation are considered, 
where it is assumed that the performance of repairs 
affects the pavement deterioration in 20% or 50%. 
While analysing the indicators characterising the 
pavement deterioration, the impact of the types of 
damage and repairs on the aircraft operation safety 
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is taken into account by the introduction of prop- 
erly selected weights. The indicator for assessing the 
pavement deterioration degree of the airports’ func- 
tional elements counted with the use of a method of 
occupied surface including the impact of a specific 
parameter on the aircraft operation safety is calcu- 
lated on the basis of data obtained during the inven- 
tory in accordance with the following formula: 


Dii y= Wea X Woy two X Wii» (1) 
= (w ), x (Obj, ); x (pha ), 
Wer = = = x100, © 


where D”",, = pavement deterioration of the air- 
port’s functional element made of asphalt concrete; 
p,= conversion rate of the parameter characterising 
damage and repairs on the surface including dam- 
aged and repaired areas; w’,, = statistical weight of 
the importance of damage and repairs in the assess- 
ment of deterioration of the airport’s functional 
element pavement; w’,,, = statistical weight of the 
validity of specific damage and repairs in the assess- 
ment of deterioration of the airport’s functional 
element pavement; Ob, = measurement of damage 
and repairs of the airport’s functional element pave- 
ment; F = total area of the tested pavement of the 
airport’s functional element; U= damage to the air- 
port’s functional element pavement; and N — repairs 
of the airport’s functional element pavement. 

The indicator for assessing the pavement deteri- 
oration degree of the airports’ functional elements 
calculated with the use of a method of limit values 
including the impact of a specific parameter on the 
aircraft operation safety is calculated on the basis 
of data obtained during the inventory in accord- 
ance with the following formula: 


DYE, = wl, x WES + wh, x WR, (4) 
Š (wo), x (Ob54); x (P54), 
2 Fx (WGe 

wyus = el ra), x100, 6) 


2 (wo), 


i=] 


5 (w), x (0b3,) x (p3) 
F x(WG,) 


i x100, (6) 
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DX (w), 


i=l 


where D” p4 = pavement deterioration of the air- 
port’s functional element made of asphalt con- 
crete; and WG, — limit value for specific types of 
damage and repairs. 

The unloaded indicator for assessing the pave- 
ment deterioration degree of the airports’ func- 
tional elements with the use of a method of 
occupied surface is calculated on the basis of data 
obtained during the inventory in accordance with 
the following formula: 


Dg = Waa X Wsi + Way x Wei, (7) 
13 Ob” U 

WEF = 5 (Oba), x (Pa), pa), * (Pra). 190, (8) 
i=l 
13 ObN N 

We = 5 (Obi), (PEs), ae (Pi) 100, (9) 


i=] 


The unloaded indicator for assessing the pave- 
ment deterioration degree of the airports’ func- 
tional elements with the use of a method of limit 
values is calculated on the basis of data obtained 
during the inventory in accordance with the fol- 
lowing formula: 


Don y = Waa X Woy + Wey X Wan» (10) 
13 (ObY, ) 
wee y —_*47i__ x 100, 11 
f À FxWGE), (11) 
13 Ob’ . 
WG = Oi) xio. (12) 


B. 

“4 Fx(Wwes,). 
The indicator for assessing the pavement deteri- 

oration degree of the airports’ functional elements 

is calculated in accordance with the following 

formula: 


D=w', XW} +w XWA (13) 

WE, = wee x WUG + well x Week + (14) 
+ woe x WG + wll x WYP 

WE, = WR x plo + WEF x WEEP + (15) 


,UG UG jUF UF 
+wpa XWey + Wg XW ga 


4 ASSESSMENT OF THE 
DETERIORATION DEGREE OF 
OPERATED AIRPORT PAVEMENTS 


While estimating the pavement deterioration 
degree, the test results, obtained during the inspec- 
tions with a visual method of surfaces of the air- 
port facilities’ pavements, were taken into account. 
The analysed spectrum of tested facilities was 
divided into 7 groups, and the limits of ranges 
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characterising the assessment criteria of the pave- 
ment deterioration were determined. The indica- 
tors, on the basis of which the values of assessment 
criteria of the pavement deterioration were esti- 
mated, includes DUN unloaded indicator and DYUN) 
weighted index of deterioration of the airports’ 
functional elements, which are defined on the basis 
of the indicator characterising WY damage and WN 
repairs of the airport’s functional element. The val- 
ues, which were achieved by indicators character- 
ising the deterioration of airport pavements while 
taking into account appropriately selected weights, 
are shown in Figures 4 and 5. However, it should 
be noted that for a complete view of the pavement 
deterioration state, it is important to analyse indi- 
cators characterising damage and repairs, which 
was presented in Figures 6 and 7. 

The Histogram, which has a bar graph form, was 
used in order to graphically display the variability of 
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Figure 4. D®%,, deterioration degree of the airport’s 
functional element pavement. 
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Figure 5. 


D',,, deterioration degree of the airport’s 
functional element pavement. 
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Figure 6. Indicator of the assessment of damage and 
repairs of the airport’s functional element pavement. 
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The indicator of damage and repairs of the airport, 


Figure 7. 


Indicator of the assessment damage and 
repairs of the airport’s functional element pavement. 


a particular set of data characterising the pavement 
deterioration. The organisation of a set of raw data 
is based on the division into ranges, the so-called 
classes. It allows to present the empirical distribu- 
tion of characteristics for quantitative variables, 
and it determines the values at which the majority 
of results is located. Based on the probability dis- 
tribution analysis, the corresponding distributions 
relevant to the nature of the probability density 
function were selected. The probability refers to the 
possibility of the occurrence of an event or several 
events. With data of the index development char- 
acterising the pavement deterioration degree, the 
probability of the occurrence of a specific event, 
which adopts a value in the range from 0 to 1, was 
determined. The probability scope of the occur- 
rence of an event was divided into seven ranges and 
the possibility of the occurrence of a specific event 
with the specified probability was calculated. 
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Figure 8. Standard distribution of the variability 
assessment indicator of the deterioration degree of the 
airport’s functional element pavement. 


The indicator offered by Air Force Institute 
of Technology, which characterises the pavement 
deterioration degree of the airport’s functional ele- 
ments, is within the range from 0 that means the 
perfect condition pavement to 100 that means the 
pavement unfit for further use. The calculation of 
D indicator is based on visual inspection results, 
during which different types of damage and repairs 
as well as their measurement are determined. The 
impact of a type of damage and repairs on the air- 
craft operation safety is included in the calculation 
by adopting weights estimated on the basis of the 
experts’ method. The standard assessment scale of 
the pavement deterioration degree includes 7 lev- 
els, however, it is also possible to apply a simplified 
scale, where there are three decision-making levels 
of a description of the deterioration degree of the 
airport’s functional element pavement. For each 
level, classes determining the pavement condi- 
tion were assigned. The first one is a desired level, 
which includes new, renovated and operated pave- 
ments, with the assumption that these pavements 
will not require planned renovation works over the 
next five years. The indirect warning level identi- 
fies the pavement condition as the one in which 
it is reasonable to perform detailed tests in terms 
of conducting treatments in order to improve the 
pavement condition. The last one is a critical level, 
which determines the prompt performance of tech- 
nical and operational research, in order to define 
activities aimed at the introduction of procedures 
to improve the pavement condition or taking the 
facility out of service. Figure 9 shows the relation- 
ship between decision-making levels and classes of 
the deterioration state of the airport’s functional 
element pavement. 

Interpretation of the pavement classes are 
shown in Table 1. 
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Figure 9. Assessment criteria of the deterioration 
degree of the airport’s functional element pavement 
offered by Air Force Institute of Technology. 


Table 1. Pavement classes. 
State D'UN) è peun PFWUN) PGWUN) PFIUN) 
9 0-10 0-9 0-10 0-10 
10-30 11-25 10-34 11-25 11-25 
31-34 2640 35-39 2640 2640 
35-48 41-55 40-47 41-55 41-55 
ery ba 49-51 56-70 48-51 56-70 56-70 
52-54 71-85 52-69 71-85 71-85 


Unsuitable 55-100 86-100 70-100 86-100 86-100 


5 CONCLUSIONS 


On the basis of verified parameters characterising 
the deterioration degree of airport pavements, it is 
possible to predict and estimate the period of safe 
operation of a particular functional element of the 
airport, which as a result, provides the proceeding to 
operation of pavements of the airport’s functional 
element in accordance with the technical condition. 
The deterioration degree of airport pavements is 
estimated on the basis of accepted indicators with 
the use of selected weights verified according to the 
experts’ method. In order to reliably predict the con- 
dition of airport pavements, it is necessary to use 
an objective, repetitive assessment system. Design- 
ing of the IT support system for managing pave- 
ments of the airport’s functional element should be 
preceded by the analysis of processes, which oper- 
ate within the organisational unit. Currently, while 
estimating the assessment criteria of the technical 
condition, a verified database, which was obtained 
within the framework of works carried out on the 
functional elements’ pavements of tested civil air- 
ports, is used. While calculating the criteria, it is 
important to focus on parameters characterising 
the deterioration degree of airport pavements, both 
of civil and military facilities; therefore, it is impor- 
tant to diagnose these airport pavements within the 
framework of current five-year inspections, and to 
carry out inspections in order to inventory damage 
and repairs within the annual intervals. 
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ABSTRACT: Monitoring systems for rotating machines has been largely used in industry even with the 
highreliability achieved in turbines and compressors. These systems can reduce the number of non-scheduled 
shutdowns and time for the scheduled ones. In this way, it remains a challenge for researchers and industry 
to develop better monitoring, diagnosis and prognostic techniques. These systems generally use multiple 
sensors to analyze the machine behavior, so the Mahalanobis-Taguchi strategy (MTS), which is recog- 
nized for its suitability for multivariate data analysis, was applied in order to integrate these sensors. Also, 
in order to compare and validate MTS results, a more classic approach, based in vibration analysis is 
also considered. Therefore, the objective of this paper is to present some initial results in the unbalance 
estimation in rotating machines using this method. A centrifugal compressor used in an off-shore unit is 
selected for application. Actual data and computational simulations were used to validate the method and 


to provide the unbalance estimation of the selected machine. 


1 INTRODUCTION 


Monitoring systems for rotating machines has 
been largely used in industry even with the high 
reliability achieved in turbines and compressors. 
This technology aims to reduce the non-scheduled 
shutdowns and the time for the scheduled ones 
minimizing the costs with downtime. 

These problems are more severe in the off-shore 
facilities, because the logistics issues and weather 
condition play a role in reestablishing the pro- 
duction. In this way, it remains a challenge for 
researchers and industry to develop better moni- 
toring, diagnosis and prognostic techniques. 

Other peculiarity of these monitoring systems 
is that they have multiple sensors installed around 
the machine and integrate them in a troubleshoot- 
ing analysis is not a simple task. So, the Mahalano- 
bis-Taguchi strategy (MTS), which is recognized 
for its suitability for multivariate data analysis, was 
applied to improve this analysis (Cudney, 2015). 

In order to verify the MTS’ applicability to 
rotating machinery, the unbalance malfunction 
was chosen because, as described by Bently & 
Hatch (2002), this is the most common malfunc- 
tion presented in this type of machines. 

As it is not possible to insert an unbalance error 
in a real machine to validate the method, compu- 
tational simulations were used to produce this 
data. 


Also, in order to compare and validate MTS 
results, a more classic approach, based in vibration 
analysis is also considered. 

Therefore, the objective of this paper is to 
present some initial results in the unbalance estima- 
tion in rotating machines using MTS. A centrifugal 
compressor used in an off-shore unit is selected for 
application. 


2 MAHALANOBIS-TAGUCHI SYSTEM 


As described by Cudney (2015), Soylemezoglu et al. 
(2011), Xin & Chow (2013) and John (2014), the 
Mahalanobis-Taguchi system (MTS) is a method 
that has been used in multivariable diagnostic 
applications. It creates two types of groups, the 
“normal” and “abnormal”, using the Mahalano- 
bis Distance (MD), and then optimize the number 
of variables used in the diagnostic by applying the 
Orthogonal Arrays (OA) and Signal-to-Noise ratios 
(SN). Finally, a prognostic can be done in order to 
avoid the application’s fault. The MTS can be gen- 
erally summarized in four stages, as follows: 


2.1 Stage 1: Mahalanobis space construction 


The normal group is created in this step to be 
used as reference. The variables are selected and 
“healthy” samples of each one collected. 
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At first, calculate the mean of each variable in 
the “normal” condition per Equation 1: 


== (1) 
where n = number of samples of ith variable. 


Then, calculate the standard deviation for each 
variable per Equation 2: 


(2) 


(3) 


Then, calculate the transpose of Z, Z”, 
Calculate the correlation matrix per Equation 4: 


> (Zin Z im) 
C= (4) 
n-1 


Calculate the inverse of correlation matrix, C, 
Calculate the MD per Equation 5: 


MD,=—Z",C“Z, (5) 


where k = number of variables. 


2.2 Stage 2: Validation of measurement scale 


In this stage is necessary to identify abnormal 
samples of each variable in order to calculate the 
abnormal Mahalanobis Distance (MD,). The sam- 
ples of abnormal variables are normalized using 
the mean and standard deviation of the respective 
normal variable. The normal correlation matrix is 
also used. Then, the MD, is calculated. If the val- 
ues of MD, are greater than MD, it’s an indication 
that values used in the construction of MD should 
be right. 


2.3 Stage 3: Optimization 


To optimize the number of useful variables is neces- 
sary to construct a two-level OA. The variables will 
be placed in a row and will have two levels in each 


column. Level-1 indicates that the variable should 
be used in construction of Mahalanobis space and 
Level-2 means that it should not be used. Then, the 
MD, will be recalculated following the OA. 

The signal-to-noise ratio (n) is calculated per 
Equation 6. 


1f 1 (6) 
7=-10log} - 5} —— 

t j=l MD, 
where ¢ = number of abnormal conditions; 


MD,=the MD of the ith abnormal condition. 
The gain in signal-to-noise ratio is calculated per 

Equation 7. If the value is positive, the variable is 

stored. If not, it is removed from the analysis. 


Gain = S/ Nratio,,, ,—S/ Nratio,,, (7) 


level-1 


2.4 Stage 4: Diagnose 


As the optimization is done on stage 3, the Maha- 
lanobis space will be reconstructed and the diagno- 
sis process will be performed. 


Not valid 


Calculate MD, 


Validation of measurement scale 


Identify the useful variables 


i 

i 

i 

i 

| | Construct of measurement scale 
H : —_ r 
' using optimized variables 
i 

i 

i 

i 

i 


[m] 


Validate the measurement scale 
using optimized variables 


Use the optimized model for 
diagnostics 


Figure 1. MTS process. 
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The Figure 1 summarize the four presented 
stages. 


3 UNBALANCE 


Unbalance is a centrifugal force produced by a dif- 
ference between the mass center and geometrical 
center position of moving parts. This difference 
is related to manufacturing process that cannot 
produce a rotor with same geometrical and mass 
center. When the rotor is in operation, it tries to 
rotate around its geometrical center, but the pres- 
ence of this not centered mass produces a cen- 
trifugal force resulting in unbalance vibration (Sun 
et al. 2017). 

A strong radial vibration at the fundamental 
frequency is unbalance characteristic diagnostic 
symptom. As the response amplitude is related to 
the square of the rotational speed, unbalance is 
a dangerous condition in machinery that runs at 
high rotational speeds. At variable speed machines, 
the effects of unbalance will vary with the shaft 
rotational speed. At low speed machines, however, 
the high spot (location of maximum displacement 
of the shaft) will be at the same location as the 
unbalance. At increased speeds, the high spot will 
lag behind the unbalance location. 

To analyze the vibration signals, and also the 
unbalance, a useful tool is the Fast Fourier Trans- 
form (FFT). The FFT is a mathematical algo- 
rithm that converts a periodic signal from the time 
domain to the frequency domain. So, when analyz- 
ing the spectrum, every periodic revolution will be 
presented by a peak and a respective frequency in 
the frequency domain (Al-Badour et al. 2011). 

Every rotating machine has an unbalance 
degree, but when it is outside of limit, it becomes 
a problem (Walker et al. 2013). Unbalance is 
generally identified by a high synchronous vibra- 
tion in frequency (1X) or time domain that, in 
excessive cases, may cause fatigue, internal rubs 
and damage bearings and seals (Bently & Hatch 
2002). 

Also, there are other malfunctions that produces 
a high synchronous vibration (for more details 
verify Bently & Hatch 2002), but when analyzing 
unbalance, it is important to remove the runout 
of the signal, because it can influence the 1X sen- 
sors readings and disturb the diagnosis (Bently & 
Hatch 2002). 

Runout is a false vibration measurement which 
occurs when the rotor rotates its geometric center 
with no displacement of its center line. The runout 
can have two different sources, mechanical (defects 
on the shaft surfaces in the vibration probes area) 
and electrical (variance on electrical conductivity 


and permeability in the vibration probes area) 
(Bently & Hatch 2002). 

Normally, there are four radial vibration probes 
that two are located on the drive (Fig. 3) and two 
on non-drive (Fig. 4) machine shaft ends mounted 
with 90° between each other. The recorded vibra- 
tion data consists in the shaft radial displacement 
and phase angle measured by each probe. Also, a 
rotor orbit with 1X filtered signal is presented in 
Figure 5, the major axis is obtained by combination 
of the two probes displacement and phase angle. 


Figure 3. Drive end radial vibration probes. 


Figure 4. Non-drive end radial vibration probes. 
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Probe 1 


Major axis 


Probe 2 


Rotor orbit 


Figure 5. Rotor orbit — 1X filtered signal. 


4 ROTORDYNAMICS MODELING 


As described by Vance et al. (2010), rotodynamics 
modeling using beam elements models are gener- 
ally adequate for modelling rotors. The main objec- 
tive is to calculate the beam deflection to verify the 
internal clearances between the rotor and static 
parts and the rotor stability. 

To construct this model, beam elements are 
defined with geometrical and material character- 
istics. Then, applying the theory of finite elements 
formulations, the stiffness, damping and inertia 
matrices are calculated. 

Also, the machine’s vibrational performance is 
produced by the relationship between the bearings 
and rotor system dynamic properties. The mode- 
ling program used in this paper simulates the bear- 
ing stiffness and damping properties variation with 
the increase or decrease of rotor speed solving the 
Reynold’s equations. For more details, refer to He 
et al. (2005). 

As described above, rotodynamic modeling is 
big deal, so the program called RotorLab devel- 
oped by the Rotating Machine and Controls Labo- 
ratory (ROMAC) was chosen to simulate the rotor 
unbalance response. ROMAC is an industrial con- 
sortium at the University of Virginia with over 40 
years of experience (Weaver et al. 2017). 

Finally, the model was validated by the authors 
reproducing the API lateral analysis developed by 
the machine manufacturer. 


5 UNBALANCE IDENTIFICATION USING 
VIBRATION ANALYSIS 


As mentioned before, in order to compare and vali- 
date MTS results, a more classic approach, based in 
vibration analysis is considered. As widely reported 


in the literature, unbalance is strongly associated 
with 1X synchronous vibrations. Thus, consider- 
ing the major axis displacement signal filtered at 
1X the rotational speed in both bearings can give 
a clue about unbalance magnitude. Generally, it is 
expected that the greater the unbalance, the greater 
the vibration measured in Ix. 

However, depending on the position and magni- 
tude of the considered unbalance, the system modal 
response can assume different configurations and, 
therefore, the displacement amplitude in the bear- 
ings may not follow the logic described above. 

In this case, an approach considering the virtual 
work (Lalanne & Ferraris 1998) of the forces act- 
ing on the shaft is applied. The bearings stiffness 
and damping terms are considered known and the 
shaft bending influence is neglect. 

In fact, the bearings are considered symmetrical 
—since the unbalance response in both directions 
are almost identical—and without cross-coupled 
forces—since the machines has tilting pad bearings 
(He et al. 2005). 

Considering just the forces acting on the shaft 
due the bearings stiffness, the virtual work can be 
presented per Equation 8. 


OW, =—k, udu—k, wow (8) 


where k, is the bearing’s stiffness and u and w are, 
respectively, the displacements in both directions. 

From Equation 8 can be easily concluded that 
the maximum potential energy related to the shaft 
displacement, considering just the 1X component, 
occurs in the major axis direction. 

The idea is to associate the energy level, consid- 
ering the scalar value obtained for both bearings, 
with the rotor level of unbalance: the greater the 
unbalance, the greater the potential energy. 

As the bearings’ stiffness is a constant under 
the same speed conditions, the evaluated potential 
energy depends only on the major axis displace- 


ment (d,,,,), as presented by Equation (9). 
U; = (daa) (9) 
k, 2 


6 CURRENT METHOD OVERVIEW 


As described in Figure 6, a real tested machine will 
be considered as reference to construct the Maha- 
lanobis space. Then, this machine will be mod- 
eled using the RotorLab software with two cases 
which produces the same vibration level as noticed 
in the tested machine (550 g.mm at the midspan 
and 275 g.mm in two points of the rotor and out 
of phase). This simulation will be done in order to 
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Figure 6. Method overview. 


verify the similarity between the real machine and 
simulated data. Finally, more four unbalance cases 
will be simulated to calculate the abnormal Maha- 
lanobis distance. 


7 APLLYING MTS 


As described on item 2, the MTS will be applied 
as follows. 


7.1 Stage 1: Mahalanobis space construction 


The first step is the Mahalanobis space construc- 
tion which requires a “health” machine as standard. 
The activities developed in this step can be sum- 
marized as follows and will be detailed in the 
sequence. 


e Collect the “health” vibration data with runout 
compensation of the four vibration probes dur- 
ing the mechanical running test (MRT); 

e Filter the 1X signal (synchronous vibration) of 
each probe; 

e Use the major axis displacement of drive and 
non-drive to construct the Mahalanobis space 
(there will be two variables to construct the 
Mahalanobis space); 


The considered “health” machine was one that 
was tested and approved by API 617 7th edition 
MRT. 

This test is conducted in a test bench with low 
pressure and consists in accelerating the com- 
pressor from zero to maximum continuous speed 


Health Sim. Machine 


Simulated 
Data 


Abnormal Sim. Machine 


F. E. models 
with sim 
unbalance 
conditions 


Calculated 
Mahalanobis 
distance 


TAr rE E eee neene 


15 min @ trip speed 


4 hours @ MCOS 


Warm up @ MCOS 


Figure 7. Mechanical running test schematic. 


(MCOS) in increments of 10% of MCOS and 
run at MCOS until bearing temperatures and 
shaft vibrations have stabilized. Then, the speed is 
increased to the trip speed and kept at this level by 
fifteen minutes. Finally, the machine is decelerated 
to the MCOS and run in this condition during four 
hours, a schematic can be seen in Figure 7 (API 
617 7th edition). 

To construct the Mahalanobis space, vibration 
data were collected during the four hours operat- 
ing at the maximum continuous speed with runout 
compensation (Fig. 8). Then, only the synchro- 
nous vibration signal (1X) of each probe was kept 
because unbalance is identified by high synchro- 
nous vibration. 


1119 


A — ¿45 Leh Direct 
= Compressor è 16:41.42 To 17.1742 
E= /45° Let 1XCOMP SR48698 /267° (man) 
s=Compressor 16:41.42 To 17.1742 
16:41:42 16:46:42 16:51:42 


PHASE LAG 
5 deg/div 


AMPLITUDE 
0.5 um ppidiv 


< 16:41:42 


16:46:42 


16:51:42 


Figure 8. 
MCOS. 


Trend of a radial vibration probe during the 


Table 1. Calculated Mahalanobis distance. 


Minor value —6,3 
Major value 10,5 
Standard deviation 0,7 


Finally, the Mahalanobis space was constructed 
using the major axis displacement with the 1X fil- 
tered signal of drive and non-drive end, this choice 
eliminates the influence of phase angle. 

Applying the Equations 1 to 5, the calculated 
MD is presented on Table 1. 


7.2 Stage 2: Validation of measurement scale 


At this point is necessary to identify abnormal 
behaviors to validate the measurement scale. The 
activities developed in this step can be summarized 
as follows and will be detailed in the sequence. 


e Simulate four rotor unbalance cases; 

e Verify the simulated vibration displacement and 
phase angle at the vibration probes; 

e Analyze the distribution and variance of the 
major axis vibration data obtained during the 
MRT; 

e Use the simulated unbalance rotor major axis 
displacement as the mean and same variance 
obtained in the previous step to create an abnor- 
mal vibration sample. 


The six unbalance responses were simulated 
using the program RotorLab developed by 


ROMAC, two points following the API 617 7th 
edition paragraph 2.6.4 (550 g.mm at midspan to 
excite the first bending mode and two 275 g.mm 
180° out of phase placed on the shaft maximum 
displacement to excite the second bending mode), 
one as per paragraph 2.6.4 (1100 g.mm at the cou- 
pling) and one with the same unbalance magnitude 
(1100 g.mm), but at the thrust disk. The other two 
are 1500 g.mm at the coupling and thrust disk. 

At MCOS, the two first points presented the 
same vibration level as the “health” machine at the 
vibration probes. It shows that depending where 
the unbalance is placed and the vibration probes 
are located, this malfunction will not be identified. 
A similar case was identified in the company where 
one of the authors works, a centrifugal compres- 
sor was operating with a broken impeller without 
alarming the radial vibration probes, but presented 
a degraded performance and this root cause was 
identified during the overhaul. 

The third point presented a major axis vibration 
of 26,1 microns pk-pk at drive end and 3,5 microns 
pk-pk at non-drive end. The fourth point presented 
a major axis vibration of 3,1 microns pk-pk at 
drive end and 25 microns pk-pk at non-drive end. 
The fifth point presented a major axis vibration of 
31,4 microns pk-pk at drive end and 3,5 microns 
pk-pk at non-drive end. The sixth point presented 
a major axis vibration of 4,3 microns pk-pk at 
drive end and 34,3 microns pk-pk at non-drive 
end. A resume can be seen on Table 2. 

Although the unbalance simulation produces a 
discrete result, the MD, needs a sample of vibra- 
tion measurements. So, it was decided to verify the 
major axis vibration statistic distribution on drive 
and no drive-end collected during the four hours 
mechanical running test. The drive end followed 
a Weibull distribution and the non-drive a mixed 
Weibull distribution as presented on Table 3 and 
Table 4, respectively. 


Table 2. Unbalance response at MCOS. 
Unbalance Position Vibration 
1 550 g.mm Midspan 2,4 um at drive end 2,3 


um at non-drive end 


2 2x275¢.mm 180° out of 1,9 um at drive end 8,4 


phase um at non-drive end 
3 1100gmm Coupling 26,1 um at drive end 3,5 
um at non-drive end 
4 1100g.mm _ Thrust disk 3,1 um at drive end 25 
um at non-drive end 
5 1500g.mm Coupling 31,4 um at drive end 3,5 
um at non-drive end 
6 1500gmm _ Thrust disk 4,3 um at drive end 


34,3um at non-drive 
end 
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Table 3. Weibull distribution on drive end. 

F(x) R? 
1- exp[-(x /11,3)*7] 0,864 
Table 4. Weibull distribution on non-drive end. 

F(x) R? 
0,459{1 — exp[—(x /5,15)!99>8]} 0,798 
+0,355{1 — exp[-(x / 5,41)” ]} 0,765 
+0,186{1 — exp[—(x / 5,97)357}} 0,848 


As described by Lewis (1994), the Equations 10, 
11 and 12 represent the Weibull distribution equa- 
tions. According to Kececioglu & Wang (1998), the 
mixed Weibull distribution is described in Equa- 
tion 13 and 14. 


F(x) =1-exp[-(x/0)"] (10) 


where F(x) = cumulative distribution function; 
0= scale; and m = shape parameter. 


4=6*T(+1/m) (11) 
where u = mean. 

o = P[T(1+2/m)-TU+1/m)] (12) 
where o= variance. 

SFO P*HOD+* LOD) (13) 


where f(x) = probability density function; f,(x) 
and f,(x) = probability density function of two 
different subpopulation. 


p+q=1 (14) 


where p and q = correspondent mixing weight of 
each subpopulation f, (x) and f(x). 

Then, per Equations 11 and 12, new parameters 
m and @ were calculated using the same variance 
(o) of the measured sample with the mean (4) as 
the simulated major axis vibration. 

To create the new sample (the term x in Equa- 
tion 10), aleatory numbers between 0 and | were 
generated for the term F(x). 

Finally, the MD, is presented in Tables 5 and 7 
for coupling unbalance and in Tables 6 and 8 for 
thrust disk unbalance. Both show higher values 
than the MD validating the measurement scale. 


Table 5. Calculated MD, for 1100 g.mm coupling 


unbalance. 

Minor value 631,3 
Major value 1.106,2 
Standard deviation 60,8 


Table 6. Calculated MD, for 1100 g.mm thrust disk 
unbalance. 

Minor value 4.254,8 
Major value 4.841,9 
Standard deviation 73,4 


Table 7. Calculated MD, for 1500 g.mm coupling 
unbalance. 

Minor value 1.667,3 
Major value 2.224,1 
Standard deviation TE 
Table 8. Calculated MD, for 1500 g.mm thrust disk 
unbalance. 

Minor value 10.886,1 
Major value 12.175,0 
Standard deviation 205,9 
Table 9. MS - 550 g.mm at midspan. 

Minor value —16,4 
Major value 22,5 
Standard deviation 0,7 


Table 10. MS -2 x 275 g.mm 180° out of phase. 


Minor value -11,1 
Major value 15,2 
Standard deviation 0,7 


As described previously, the cases 1 and 2 
described in Table 2 presented a low vibration 
level. So, two health vibration samples were gener- 
ated following the presented Weibull distribution 
and two new Mahalanobis spaces were calculated 
in order to compare the simulated data with the 
tested one. Comparing Table 1 with Tables 9 and 
10, it can be seen that the simulated and tested date 
have similar results. 
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7.3 Stage 3: Optimization 


As mentioned in item 2.3, the optimization aims 
to reduce the number of monitored variables. The 
Gain was calculated per Equation 7 per each vari- 
able (drive and non-drive end major axis vibration). 
The results of optimization considering the 1100 g. 
mm coupling unbalance and 1100 g.mm thrust 
disk unbalance are presented in Tables 11 and 12, 
respectively. 

Following the optimization results in Table 11, 
only the drive end vibration shall be maintained, 
this result is consistent because when the unbal- 
ance is placed at the coupling for this machine, the 
high vibration signal is noticed only in the drive 
end probes. The same interpretation is noticed in 
the Table 12, at this point, only the non-drive end 
vibration shall be maintained, because an unbal- 
ance placed in the thrust disk, creates a high vibra- 
tion signal only in the non-drive end probes. 

So, in order to use the same Mahalanobis space, 
the optimization stage was removed from the analysis. 


Table 11. Variables gain for coupling unbalance. 
Vibration major axis Gain Result 

Drive end 12,8 Keep variable 
Non-drive end —0,2 Not keep variable 


Table 12. Variables gain for thrust disk unbalance. 


Vibration major axis Gain Result 
Drive end -0,3 Not keep variable 
Non-drive end 12,3 Keep variable 
Table 13. MTS diagnoses. 

MD Malfunction 
1 -6,3 to 10,5 Normal operation 
2 631,3 to 1.106,2 1100 g.mm Coupling unbal. 
3 4,254.8 to 4.841,9 1100 g.mm Thrust disk unbal. 
4 1.667,3 to 2.224,1 1500 g.mm Coupling unbal. 
5 10.886,1 to 12.175,0 1500 g.mm Thrust disk unbal. 


Table 14. Energy approach diagnoses. 


UJ/k, Malfunction 
1 76,13 Normal operation 
2 347,49 1100 g.mm Coupling unbal. 
3 317,56 1100 g.mm Thrust disk unbal. 
4 504,89 1500 g.mm Coupling unbal. 
5 597,05 1500 g.mm Thrust disk unbal. 


7.4 Stage 4: Diagnose 


The diagnose is summarized in Table 13. There, 
it can be seen that the malfunctions have differ- 
ent MD from the normal operation showing the 
method's effectiveness. 

Considering the proposed method presented 
in section 5, in which the potential energy asso- 
ciated with the major axis displacement at 1X is 
related to the unbalance amount, the effectiveness 
of the MTS method can be confirmed. In Table 14 
are presented the results considering the energy 
approach. 

It can be seen, as in MTS method, that the mal- 
functions are also clearly associated with different 
energy levels. 


8 CONCLUSION 


Mahalanobis Taguchi strategy showed to be a 
useful tool when analyzing multivariate. This 
characteristic is a differential to produce a good 
malfunction diagnosis related to complex tur- 
bomachinery, which have many installed sensors. 

In fact, considering MTS and the energy 
approach results, in both cases the malfunctions 
are clearly detected. Nevertheless, because the 
energy method is based on the vibration amplitude 
in the bearings, the results found for the analyzed 
cases demonstrate a much greater sensitivity to the 
unbalance value than to the unbalance position. 
In another hand, the MTS method presents differ- 
ent results for all cases, being sensitive to both the 
unbalance position and magnitude in the analyzed 
cases. 

Considering these results, it can be said that MTS 
has the potential for a more complete analysis than 
methods based on more traditional approaches, 
such as vibration amplitude analysis, especially in 
cases with many monitored points. 

However, some care shall be taken during the 
optimization step, because as seen in item 7.3, if 
the optimization were done, the different diagnosis 
would not be possible using the same Mahalanobis 
space. 

Also, the first variables selection to construct 
the Mahalanobis space is not an easy task, some 
previous troubleshooting knowledge is necessary 
to do it, or the validation of measurement scale 
iterative procedure may take a long time. 

This method has great potential to be applied 
in standard machines, because as they have similar 
behavior, more abnormal conditions may be iden- 
tified for the same Mahalanobis space. 

As well, the runout compensation is an impor- 
tant procedure to be done, mainly when construc- 
tion the Mahalabonis space, because the “healthy” 
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machine presents vibration signals levels that 
runout can disturb them. 

Finally, independently of the monitoring sys- 
tem, all malfunctions won't be suitable to be 
identified, as described in item 7.2, two types of 
unbalance presented the same vibration level as 
the “healthy” machine on drive and non-drive end 
radial vibration probes. 


ACKNOWLEDGEMENTS 


The authors thank Petrobras for using the Rotor- 
Lab software for academic purpose and CNPq for 
the financial support. 


REFERENCES 


Al-Badour, F.; Sunar, M. & Cheded, L. 2011. Vibration 
Analysis of rotating machinery using time-frequency 
analysis and wavelet techniques. Mechanical System 
and Signal Processing volume 25: 2083-2101. 

API Standard 617 7th Edition. 2009. Axial and Cen- 
trifugal Compressors and Expander-compressors 
for Petroleum, Chemical and Gas Industry Services. 
Washington D.C.: American Petroleum Institute. 

Bently, D.E. & Hatch, C.T. 2002. Fundamentals of Rotat- 
ing Machinery Diagnostics. Minden: Bently Pressu- 
rized Bearing Press. 

Cudney, E.G.A.A.E.A. 2015. Mahalanobys-Taguchi sys- 
tem: a review. International Journal of Quality & Reli- 
ability Management volume 32 (3): 291-307. 

He, M.; Cloud, C.H.; Byrne, J.M. 2005. Fundamentals 
of Fluid Film Journal Bearing Operation and Mode- 
ling. Proceedings of the Thirty-Fourth Turbomachinery 
Symposium: 155-175. 

He, Y.; Shi, L.; Shi, Z. & Sun, Z. 2017. Unbalance Com- 
pensation for HTR-10GT: A Frequency-Domain 


Approach Based on Iterative Learning Control. Sci- 
ence and Technology of Nuclear Installations (ID 
3126738): 1-15. 

Jin, X. & Chow, T.W.S. 2013. Anomaly detection of cool- 
ing fan and fault classification of induction motor 
using Mahalanobis—Taguchi system. Expert Systems 
with Applications. Volume 40 (issue 15): 5787-5795. 

John, B. 2014. Application of Mahalanobis-Taguchi 
system and design of experiments to reduce the field 
failures of splined shafts. International Journal of 
Quality & Reliability Management volume 31 (issue 6): 
68 1-687. 

Kececioglu, D.B. & Wang, W. 1998. Parameter Estimation 
For Mixed- Weibull Distribution. ZEEE PROCEED- 
INGS Annual RELIABILITY and MAINTAINABIL- 
ITY Symposium: 247-252. 

Lalanne, M. & Ferraris, G. 1998. Rotordynamics Predic- 
tion in Engineering. Second edition. Hoboken: John 
Wiley & Sons, Inc. 

Lewis, E.E. 1994. Introduction to Reliability Engineering. 
Evanston: John Wiley & Sons, Inc. 

Soylemezeglu, A.; Jagannathan, S. & Saygan, C. 2011. 
Mahalanobys-Taguchi System as a Multi-Sensor 
Based Decision Making Prognostics Tool for Centrif- 
ugal Pump Failures. ZEEE Transactions on Reliability 
volume 60 (4): 864-878. 

Vance, J.; Zeidan, F. & Murphy, B. 2010. Machinery 
Vibration and Rotordynamics. Hoboken: John Wiley & 
Sons, Inc. 

Walker, R.; Perinpanayagam, S. & Jennions, I. Rotor- 
dynamic Faults: Recent Advances in Diagnosis and 
Prognosis. International Journal of Rotating Machin- 
ery volume 2013 (ID 856865): 1-12. 

Weaver, B.; Tsukuda, T.; Rizvi, S.A.A.; Schwartz, B.; 
Nichols, B.; Griffin, D; Branagan, M.; Fittro, R.; Lin, 
Z. & Wood, H. 2017. Experimental Measurements 
of Turbomachinery Rotordynamics, Component 
Performance, and Dynamic Control at ROMAC —- A 
Review. Journal of the Gas Turbine Society of Japan: 
1-8. 


1123 


Safety and Reliability - Safe Societies in a Changing World - Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Optimization of periodic inspection time of sis subject to a regular 
proof testing 


H. Srivastav, A.V. Guilherme, A. Barros & M.A. Lundteigen 
Department of Mechanical and Industrial Engineering, NTNU, Norway 


F.B. Pedersen & A. Hafver 
Group Technology and Research, DNV GL, Hovik, Norway 


F.L. Oliveira 
R&D Center, DNV GL, Rio de Janeiro, Brazil 


ABSTRACT: Periodic testing is a method to ascertain the availability of Safety Instrumented Systems 
(SIS). These systems are generally passive and are activated only on demand. Testing is then required to 
diagnose their current state and to take the corresponding maintenance action. However, the testing pro- 
cedure can provoke damage on some units of the SIS (especially the mechanical parts) and the system as a 
whole becomes more prone to failures. This situation is currently not well covered by standards under the 
so-called umbrella of imperfect testing. The decision maker must in practice come across to an optimiza- 
tion problem where the objective is to determine the optimal compromise between an accurate diagnostic 
of the current system state (high tests frequency) and the possible failures or degradation provoked by the 
testing procedure itself. The commonly used criteria to assess the performance of SIS are all related to the 
mean downtime of the SIS between two tests. The IEC 61508 provides subsequent analysis for multi-unit 
SIS when all the units are supposed to follow exponential lifetime distributions. It cannot be applied in 
this case as some parts of the system have a time varying failure rate which can increase after every test. 
We propose the use a of Markov process to model the degradation of the mechanical parts upon test and 
possible preventive maintenance after testing. Since the degradation due to tests is experienced at deter- 
ministic dates, we use the modelling framework of multiphase Markov processes to calculate the mean 
downtime. The paper is focused on explaining the optimization problem between the frequency of testing 
versus PFD,,, and find out the optimum frequency through simulations 


1 INTRODUCTION This means that the SIFs are passive most of 


the time and are supposed to act only when needed 


A Safety Instrumented System (SIS) is often used to 
detect hazardous events and to mitigate their con- 
sequences at facilities and plants that produce or 
handle hazardous substances, like e.g. hydrocarbon 
fluids and gases. Due to their criticality, they must 
obey to regulatory requirements and international 
standards on safety. IEC 61508 (1998) and related 
standards (such as IEC 61511 (2002) for the proc- 
ess industry sector) are key in framing the design 
and operation of SIS. One important requirement 
mandated by these standards is the need to verify, 
by quantitative analysis, that the safety perform- 
ance is adequate in light of risk acceptance criteria. 
Most safety functions implemented by a SIS, the 
so-called Safety Instrumented Functions (SIFs), 
are seldom demanded as the normal operation is 
managed by a dedicated control system. According 
to the mentioned IEC standards, the SIFs are clas- 
sified as operating in the low demand mode. 


(“on demand”). The reliability of low demand 
SIFs is measured by the average probability of 
failure on demand (PFD,,,). PFD,,, is calculated 
over a time interval between two proof tests and 
corresponds to the mean downtime per unit of 
time between proof tests. The same measure is also 
used to express the reliability requirement for the 
each SIF, but then the associated required value 
is derived on the basis of a risk analysis (Jin et al. 
2012). IEC 61508 suggests four levels of safety 
integrity levels (SIL), each giving a specified range 
of PFDavg. For example, a SIF with a SIL 2 
requirement must demonstrate that the PFD,,,, is 
within 10“and 107. j 

The PFD,,, can be quantified using different reli- 
ability models. These models are based on assump- 
tions and simplifications and in some situations 
they can lead to different results, depending on 
the dominating contributing factors. Lowdemand 
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SIS are periodically tested (proof tests) in order 
to confirm that they are able to act on demand. 
Length of intervals between such tests is an impor- 
tant contributor to PFD,,,. Normally, it is assumed 
that the proof tests are perfect, and that the equip- 
ment is restored to an to as-good-as-new condition 
(Shao-Ming et al. 1994). These assumptions imply 
that the proof tests are carried out in a manner 
and under conditions which are similar to a real 
demand, so that all dangerous failure modes,- i.e. 
failure modes that result in a failure to carry out 
the SIF, are revealed.The assumptions also imply 
that no degradation is experienced by the SIS due 
to the test itself (a non-destructive test). However, 
in reality, proof tests may not beperfect, and the 
equipment may degrade from exposures that are 
applied during the tests. The latter example is 
also identified by Brissaud et al. (2010). Rausand 
(2014a) gives one practical example on how the 
proof test can degrade a Downhole Safety Valve 
(DHSV) installed in to protect against releases 
from oil and gas wells. The DHSV is exposed to 
harsh conditions when operated ( due to high pres- 
sures drop and in some cases high temperature). A 
perfect proof test, would imply that the DHSV is 
closed with full flow from the well (which would be 
the real demand situation). However, this type of 
exposure is known to degrade the performance of 
the DHSV, and the proof test is therefore carried 
out under non-perfect/imperfect test condtions by 
closing DHSV with downstream valves already 
closed. Still, it is interesting to understand better 
the impact of perfect versus non-perfect/imperfect 
test conditions. One approach has been suggested 
by Oliveira et al. (2016), where an additive test-step 
varying (ATSV) model was elaborated to reflect 
the increment of the failure rate after each proof 
test in a blowout preventer (BOP) system. Yet, it 
is still not clear how to implement the full effect 
of degradation for the quantification of PFD „g: A 
review of the modelling framework was performed 
by Rouvroye & Brombacher (1999) and Bukowski 
(2005) and both promoted the use of Markov 
processes when other states than functioning and 
failed are to be included. 

The objective of this paper is to demonstrate the 
implementation of the Markov process to model 
the combined effects of degradation due to equip- 
ment wear out (aging) and the exposure from the 
proof test. A simple homogeneous Markov proc- 
ess cannot be used,since the transition rates will 
change after each proof test. Instead, a multiphase 
Markov approach is suggested. This method was 
applied in Strand and Lundteigen (2015) to assess 
the BOP reliability and also in Innal et al. (2016) 
to establish new generalized formulas with repair 
time. Compared to simple Markov processes, mul- 
tiphase Markov processes allows one to take into 


account changes of the transition rates at deter- 
ministic time points (Wu et al. 2018). The paper is 
organized as follows: Section 2 provides the prob- 
lem statement and assumptions. The model is dis- 
cussed in section 3, within a multiphase Markov 
framework. Section 4 describes the model imple- 
mentation in terms of discrete event simulation 
and Monte Carlo simulations. The last section is 
devoted to numerical results and the consequent 
optimization problem. 


2 MODELLING FRAMEWORK AND 
MODEL ASSUMPTIONS 


PFD,,, is defined as Rausand (2014b): 

“If a demand of safety function of the item 
occur at a random time in future, the PFD,,, is 
the average probability that the item is not able to 
react and perform its safety function in response 
to demand..” 

Theoretically, PFD,,, value stems from the risk 
analysis. For practical purposes, it is estimated on 
the basis of the reliability model of the SIF. In 
general, an estimator for PFD (PFD, ) can be 
interpreted as long run average value of unavail- 
ability, it can be defined as: 


i E 1 m kt U(t) 
PFD wg = (lle 


where: 


PFD „g Probability of failure on demand on average 
n = Total number of inspection performed 

t= Duration between two consecutive inspection 
U(t) = Unavailability of the system at t 


Inspection is an integral part of the proof test 
which reveals about the state of the system at the 
time of proof test. For all calculations, frequency 
of inspection is equal to frequency of proof test 
performed. In this situation PFD, is proportion 
of time on average that the multiphase Markov 
process spends in the failed state. It is the danger- 
ous failure rate that is considered in the calculation 
of PFD,,,, i. the failures that can prevent the 
SIF from functioning on demand. 

The modelling framework to model this prob- 
lem is described hereafter. 


2.1 Modelling framework 


There are basically two different mindsets for 
modelling degradation due to equipment wear out 
(aging) and degradation due to proof test. One 
mindset is more inherited from Reliability theory: 
the main idea is to model the degraded unit by a 
binary random variable moving from working 
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state to failed state and to consider that the transi- 
tion rate between these two states will increase with 
time or with the number of tests experienced by 
the unit. In other words, the unit has a lifetime law 
with an increasing failure rate which is a function 
of the number of tests. Another mindset is more 
applied for people working in the framework of 
maintenance optimization. The unit is modelled 
by a random variable with more than two states. 
The state space can be a discrete finite space, an 
infinite discrete one or a continuous one. The main 
idea is that there exists intermediate states between 
the new one and the failed one. All the intermedi- 
ate states can be considered as working states but 
with possibly degraded performances and they are 
taken as a health indicator of the system. They 
often correspond to degradation phenomena or 
symptoms that can be monitored, diagnosed and 
used as a decision indicator to trigger preventive 
maintenance actions. The advantage of such mod- 
els is that 


e We can make correspondence between degra- 
dation phenomena and the performance of the 
system (here 1-PFD). 

e We can use the intermediate states to opti- 
mize and define preventive condition-based 
maintenance 


However, if expert judgments can be relevant 
enough to define the number and the nature of 
intermediate states, the law of the sojourn time 
in every single state may be difficult to estimate. 
A model relying only on lifetime law and a binary 
random variable may be then more reasonable. 

Most of the existing models that are described 
in the introduction are inherited from Reliabil- 
ity theory. The calculation of the PFD for SIS is 
mainly based on binary random variables. In this 
paper, we want to explore the use of intermediate 
states in a specific context when the tests have a 
negative impact on the system condition. We want 
to investigate such a framework because 


e The literature, guidelines and practices related 
to negative impact of testing should be linked at 
some point to the identification of some degra- 
dation mechanism. 

e This seems to be a good way to go ahead and 
prepare the future for condition-based mainte- 
nance and optimal use of condition monitoring. 


Asa preliminary study, we propose a model with 
two intermediate states. This number is arbitrarily 
chosen and we do not investigate any preventive 
maintenance. We only aim at showing that there 
is a trade off between the negative effect of tests 
(pushingthe system randomly into more degraded 
states) and the added value performing more tests 
to detect failures earlier. 


Equipment wear out is modelled by a finite 
number of intermediate degraded states between 
the new state and the failed one. Degradation due 
to proof test is modelled by an increase of the tran- 
sition rates between two states at inspection time. 
In addition, direct transitions are possible from 
any functioning state to the failed one: they model 
sudden failures that are not due to wear. Since the 
unit is passive, all the failures are undetectable 
without testing, whatever the failure mode is. At 
last, in order to develop further analytical formula- 
tions, we chose a Markovian framework. Because 
the transition rates are changing at inspection 
times, we refer it as a Multiphase Markov process. 
The current paper is only devoted to Monte Carlo 
simulations in order to demonstrate the relevance 
of the problem statement and the possible trade 
off that arises due to the negative effect of testings. 
Analytical formulations seems to be tractable but 
are left for further work. 


2.2 Assumptions 


Modelling degradation using Multiphase Markov 
process, we have used following assumptions: 


e In general, we can consider that a SIF equip- 
ment is exposed to two types of failures: 

— Dangerous detected (DD) failures, i.e. 
the dangerous failures revealed by online 
diagnostics. 

— Dangerous undetected (DU) failures, i.e. 
the dangerous failures that are not DD and 
which are to be revealed by regular proof 
tests. 

e For the sake of simplicity to begin with the mod- 
elling, we only consider the effect of DU fail- 
ures in our analysis, since the equipment focused 
in our study (valves) have no or very limited 
facilities for diagnostics. However, effect of DD 
failures,for modeling purposes beyond equip- 
ment type in our study, will be considered in the 
future paper. From now, when we use the term 
“detected” or “detectable” , it is used to denote 
DU failures that are revealed by the proof test, 
in light of the real (non-perfect/imperfect) test 
conditions. 

e DU—failures are of two types: they can be sud- 
den or they can be due to a progressive degra- 
dation process named hereafter aging. Sudden 
failures are modelled by a failure rate 4,, and 
aging is modelled by several intermediate states 
(degradation levels) between new state and failed 
one, with associated transition rates. Whatever 
the failure mode is, the system will stay in failed 
state until the next inspection, and then the sys- 
tem is repaired as per the chosen maintenance 
policy. 
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e There are 4 degradation levels: A, B, C, D. These 
are the states of a Markov process. (A: System 
working with no degradation, B: System work- 
ing with degradation of system of level 1, C: 
System working with degradation of level 2, D: 
Failed) 

e In our model the following instantaneous transi- 
tions are possible: 

— System can always jump to next higher state 
of degradation due to effect of aging. 

— System can always jump to failed state due to 
sudden DU failures. 

— System can not go to lower degraded state 
until the maintenance is performed. 

e Instantaneous transitions rate for the multiphase 
Markov process are represented in the Figure 1. 


e In the Figures above represents the effect of 
aging on the system, which changes every time 
when a proof test is performed on the system. 
We consider that the proof test has a negative 
effect on the system condition (shock leading to 
extra stress) and this negative effect increases the 
aging transition rates. The modelling of impact 
of negative effect of testing is done through the 
following model. 


1.01* A, (t5) CurrentStateA 
A (ti) =31.03* A (t) CurrentStateB (1) 
1.05*X,(t5) CurrentStateC 


We assume here that a proof test is performed at 
t= t, and the current state is the state of the system 
at t= ty. 

— The underlying idea behind this modelling is 
to show that the negative impact of the proof 
test increases with the degradation of the 
system 


—(A,, + Aa ) 


= (Au F Aa ) 


a (Au + Na ) 


Figure 1. Instantaneous transition rates for the mul- 
tiphase Markov process. 


e Between two consecutive proof tests à, and À, 
remains constant. 

e When a failure is detected after the proof test, 
we assume that the mean time to repair the sys- 
tem is negligible. 


3 METHODOLOGY 


The multiphase Markov process was analyzed using 
discrete event simulation and exponential distribu- 
tion for the time spent in each state. System starts 
in state, degradation time (T) and failure time 
(T) are sampled from the exponential distribution 
of the respective parameters 1,(t) and A,(t). Then 
based on the minimum of (7), T, 7) the next state 
of the process was chosen. Some specific decisions 
were made for the modeling: 


e If system goes to a failed state (state D), the una- 
vailability is calculated by measuring the time 
spent in the state D by the system. On inspection 
the maintenance action is taken and process is 
re-initiated. 

e If the minimum is T, then the system stays in 
the same state for the duration between two 
consecutive proof tests. Then the inspection is 
performed and we repeat the process with the 
increased A(t). 

e If system goes to more degraded state, then the 
T;,T;, are again sampled from the correspond- 
ing exponential distributions. Now, the mini- 
mum is compared between (77,77, 7—T,). And 
the process repeats itself until system goes to 
failed state. Once the system fails, the unavail- 
ability is calculated,the maintenance action is 
taken and process is re-initiated. 


The following maintenance policies were proposed 
when the system was found to be in the failed state 
on inspection: 


e As-good-as-new (AGAN): System is reset to 
new state (A) and the failure rate of the system 
is reset to A,fi + 1] = A [1], ie we consider that 
system is as-good-as-new when we take away the 
effect of aging after maintenance of the system 

e As-bad-as-old (ABAO): On maintenance, the 
new state of the system is set to C and the failure 
rate of the system is reset to À [i + 1] =A,[q] 


4 RESULTS AND DISCUSSION 
Recall that the PFD, is the performance meas- 
ure. Simulations were performed to estimate 
PFD,,, by calculating the average unavailability 
of the system. The proof test interval (7) is varied 
from 3 days to 1 year, where represents the time 
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between two consecutive inspections/proof tests. 
We considered following values t of for simula- 
tions: T= (3 days, 6 days, 15 days, 21 days, 1 month, 
2 month, 3 month, 4 month, 5 month, 6 month, 
7 month, 8 month, 9 month, 10 month, 11 month, 
12 month). 

Values of parameters like à, A,, and mission 

time are chosen based on industry guidelines on 
the performance measure. The mission time of 
the system for the purpose of simulation is cho- 
sen to be 5 years. Based on industrial guideline, the 
impact factor of the proof test is considered as per 
equation 1. For each value of t, 500 random reali- 
zations were simulated to obtain average unavail- 
ability of the system. 
Figure 2, shows the estimated value of PFDavg of 
the system for different values of t. The border- 
lines of SIL 1 and SIL 2, showing that the .01 < 
PFD v < 0.1 for being within the range of SIL 1 
and PFD „. < 0.01 for being in the range of SIL 2. 
Left side plot in Figure 2 shows that when both A, 
and À, are of the order of 10 per hour, the PFD „g 
remains within SIL 2 for both AGAN and ABAO 
maintenance policies for 15 days < T < 1 year. 
Right side plot in Figure 2 shows that when À, and 
à, are increased to the order of 10~ per hour, the 
PFD,,, increases for both maintenance policies. 
For AGAN maintenance policy, PFD,,, leaves 
the range of SIL 2 and enters SIL 1 at t= 15 days 
and leaves the range of SIL 1 at t= 6 months. For 
ABAO maintenance policy the PFD,,, leaves range 
of SIL 1 for t= 4 months and T< 15 days and stays 
within the range of SIL 1 for an optimal proof test 
interval (15 days < T< 3 months). 

In Figure 2, when the plots pertaining to AGAN 
maintenance policy are observed, it is found that 
the information gain through inspection is more 
significant over the negative effect of testing. This 
is because with AGAN maintenance policythe 
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Figure 2. Effect of different maintenance policy on 
PED py 


system did not carry the history of past tests expe- 
rienced by the system. 

The important conclusion that can be derived 
from Figure 2 is that when the maintenance policy 
ABAO is chosen, PFD „ of the system shows a 
trade off between the negative effect of perform- 
ing a proof test versus the gain of information by 
performing the proof test on the system. In other 
words, when the system undergoes through high 
frequency of proof tests, the unavailability repre- 
sented by the PFD,,,, increases instead of decreas- 
ing as it did for AGAN policy. At the same time, 
when the frequency of proof tests is reduced, the 
user does not get enough information about the 
state of the system. Therefore, there exists an opti- 
mum frequency of testing which minimizes the 
value of PFD,,, in the Figure 2. 

Figure 3 shows the effect, of changing the values 
of à, while keeping the value of À, as constant 5 x 
10% per hour, on the PFD,,,. Note that the trade- 
off between multiplicative negative effect of testing 
by high frequency of testing versus loss of informa- 
tion by low frequency of testing, is an attribute of 
ABAO maintenance policy only. Hence, the main- 
tenance policy considered in Figure 3 is ABAO. 
It is observed from the Figure 3 that the PFD,, 
remains within the range SIL 2 when the value of 
A, <5 x 10° per hour for te [15 days, 5 months]. 
Plots show that for each value of à, there exists 
an optimum value of 7 for which PFD, attains a 
minimum value. It is also observed that the value 
of PFD,,, increases with increasing values of À, 

Figure 4 shows the effect, of changing the values 
of the failure rate ,, while keeping the value of à, 
constant 5 x 10% per hour, on the PFD, ABAO 
maintenance policy is considered for obtaining 
these plots, using the same arguments as for plots 
in Figure 3. It is observed from the left side plot 
in Figure 4 that when A, is increased from 107’ per 
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Figure 3. Effect of changing failure rate on PFD,,,. 
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Figure 4. Effect of changing failure rate à, on PFD ws- 


hour to 10* per hour the shape of plot of PFD,,, 
changes from a flat convex to a steep convex indi- 
cating that PFD „g increases with increase in À,. 

The optimum time-interval (for performing 
proof test) which minimizes PFD „is significantly 
visible in Figure 4 for higher values of à, whereas 
for lower values of À, the curve needs to be zoomed 
up to observe the optimum time-interval (for per- 
forming proof test) as shown in the right side plot 
of Figure 4. 


5 CONCLUSIONS AND IDEAS FOR 
FUTURE WORK 


In AGAN maintenance policy, the technical state 
of the system is maintained to “as-good-as-new” 
after regular proof test meaning the system will 
not aggregate the negative effect of the regular 
proof test after maintenance. Hence, with AGAN 
we canmake PFD, as small as required by increas- 
ing the frequency of performing the proof test on 
the system. But using AGAN maintenance policy 
may not be economical in most of the practical 
situations, hence we focus on ABAO maintenance 
policy in this section. 


5.1. Conclusions 


Our case study showed that in case of ABAO 
maintenance policy, there are two competitive 
forces that can increase the PFD „y The first is the 
multiplicative negative effect of frequent proof 
tests, despite the maintenance that is carried out 
as part of the tests. This force becomes more domi- 
nant when the frequency of performing proof test 
is high. The second is the information obtained 
about the status of the system by carrying out 
the proof test. While the second force would like 


to increase the frequency of performing the proof 
test to lower PFD „g The first force would like to 
decrease the frequency of performing the proof 
test to obtain the same effect on the PFD wg 
An optimum can be obtained for a regular 
proof test interval that can be verified against the 
constraints of the SIL requirement. It is therefore 
suggested that there exists an optimum frequency 
for performing the proof test that minimizes the 
PFD,,, of system whenever the following is true: 


e The regular proof tests, that involves the inspec- 
tion of technical state of the system, have some 
negative effect on the performance of the system 
due to test conditions and exposures. 

e Some dangerous failure modes of the system can 
only be revealed by the regular proof tests, and 
not by other means (like e.g. diagnostic testing). 

e System is maintained with the ABAO mainte- 
nance policy, meaning that the technical state is 
not “as-good-as-new” after a regular proof test. 
The ABAO maintenance policy will aggregate 
the negative effects of regular proof test. 


5.2 Ideas for future work 


The above studies were performed assuming no 
DD failures and mean time to repair as negligi- 
ble. It would be an interesting proposition to see 
the effect of adding DD failures and mean time 
to repair to the above study. Analytical solutions 
need to be developed to find out the exact solution 
of the stochastic differential equation involved in 
the above situation. Two degraded states were cho- 
sen randomly in the above study, the connection 
between the physical phenomena of the degrada- 
tion and quantification the degraded states needs 
to explored. Effect of the predictive maintenance 
and redundancies on the PFD,,, in this situation 
needs to be studied. 
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ABSTRACT: Typical measurement technologies are coordinate measuring technology, structured-light 
3D scanner and Computed Tomography (CT) scans. In this paper, an innovative measurement and analy- 
sis strategy for the statistical comparison of the mentioned technologies is developed. Furthermore, the 
measurement result can be proved with regard to the plausibility and sensitivity. This strategy can be used 
not only for the intern company purposes but also for the adjustment and harmonization of product 
and production improvement activities and corrective actions between original equipment manufacturers 
(OEMs) and suppliers. Plausibility checks of the gathered measurement results can be facilitated also in 


case of different applied technologies. 


1 INTRODUCTION 


To measure, to map and to analyse different parts 
or objects is interesting in many scopes of pro- 
duction respectively product development proc- 
ess. Especially in automotive, requirements on 
specification and product complexity is steadily 
increasing which requires high performance on 
measurement methods. Depending on product 
geometry, complexity and its critical specification, 
a proper measurement method has to be chosen. 
Every measurement technology is basically suit- 
able for the measurement of the product geometry 
but has also various advantages and disadvantages. 
Structured-light 3D scanner is well qualified for 
quick measurement of entire geometry but is sensi- 
tive for the creation of shades as well as product 
undercuts. An advantage of coordinate measur- 
ing technology is the high measuring accuracy 
of touched measure points which can be used for 
calculation and reconstruction of the actual prod- 
uct shape. In contrast, the measuring effort of 
explicit free-form surfaces is very high, which is a 
grate disadvantage of this method. Product cavities 
are measurable only after saw opening of a prod- 
uct in case of both, structured-light 3D scanner 
and coordinate measuring technology. A CT scan 
makes use of computer-processed combinations of 
many X-ray images taken from different angles and 
allows a survey of undercuts and cavities but can- 
not be used for all component materials (e.g. lead). 
It is common that a complex geometry is meas- 
ured with several, different measurement technolo- 
gies in many practical applications. The problem is 


that the obtained results cannot be compared to 
each other due to different technology principles. It 
is uncertain whether all three technologies provide a 
sufficient precision of the product geometry or not. 
Therefore, a study based on the statistical 
comparison of the three different measurement 
techniques regarding various dimensions of the 
measured objects is presented in this paper. Due to 
the overall understanding, first the measurement 
methods are discussed regarding the applicability 
as well as their advantages and disadvantages. In 
the subsequent part, the measured part, the mate- 
rial it is made of as well as the measured points 
are presented. The statistical methods used for the 
comparison, in this case the non-parametric statis- 
tical tests are discussed in detail. For the purpose 
of a sensitivity study, Monte Carlo simulated val- 
ues of various dimensions are analysed first. In 
other words, it is necessary to understand the sen- 
sitivity of the tests before presenting and discuss- 
ing the real measured values. Finally, the results are 
discussed in detail and the study is concluded. 


2 MEASUREMENT METHODS 


The following measurement methods have various 
advantages and disadvantages. In general there are 
three measurement methods, which were also used 
in this work: 


e Coordinate measuring machine (CMM) 
e Structured-light 3D scan method (3DSM) 
e Computer tomography (CT) 
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Coordinate measuring machine is based on a 
tactile measuring method. Main components are 
the measuring head with a touch sensor, the meas- 
uring table and the positioning system with an 
incremental sensor technology. The measurement 
object and the measurement head move relative to 
each other; therefore CMM is suitable to measure 
space coordinates. In comparison to 3DSM and 
CT, CMM provides the best measurement result 
with highest measurement precision at the touched 
measurement points. Due to the relative move- 
ment of measurement object and head, measur- 
ing an object point by point takes a lot of time. 
Furthermore, it is difficult to measure objects with 
freeform surfaces and it is impossible to measure 
objects with undercuts in a non-destructive way. 

The structured-light 3D scan method is an opti- 
cal, non-contact method and structured by a pro- 
jector and two cameras on a tripod. Measuring 
result is a point cloud of the surface of the meas- 
urement object and needs to be mapped by com- 
puter algorithms subsequently. It is also possible to 
create a CAD model out of these points. A quick 
measurement of the entire object is the big advan- 
tage of 3DSM and provides a good alternative to 
measure parts in a production line. Disadvantages 
are big scattering effects regarding measurement 
result and precision due to the creation of shades. 
Furthermore, it is not possible to measure objects 
with undercuts in a non-destructive manner. 

Computer tomography is also called imag- 
ing method. It is constructed similar to an X-ray 
machine, whereby the operating principle is slightly 
different in comparison to an X-ray machine. The 
difference is that during a measurement with a CT 
a lot of images, out of various angles and direc- 
tions, are made and recorded in a systematic way. 
Subsequently, a computer composes the recorded 
images with the help of complex algorithms and 
map them to a CAD model. CT method does not 
work very fast but it has the ability of measuring 
undercuts without destroying the measured object. 
Therefore, it is well suitable to measure complex 
objects and geometries. The measurement result 
and precision is strongly influenced by the mate- 
rial of the measurement object and lots of setup 
parameters of the machine itself. The thicker and 
the denser is the material, the more the measure- 
ment result and precision will be affected in a nega- 
tive way due to the scattering effects. 


3 MEASURED PART—DTM INTAKE 
SOCKET 
3.1 Geometry 


The part, that has been chosen for the study is 
an aluminium intake socket out of the DTM 


(German Touring Car Masters). Since the material 
it is made of is very important for the CT-meas- 
urement, a spectral analysis has been undertaken 
in order to define the proper material composition. 
The results are described in the following section. 

The overall geometry was considered to consist 
of various measured dimensions as well as geo- 
metric tolerances. Furthermore, it was considered 
to have a complex, though still manageable com- 
plexity for the measurements. The chosen intake 
socket, presented in the Figure 1 was a good com- 
promise of all these attributes. 

Further requirements on the object were dif- 
ferent specifications of the measurement units (in 
millimetre or degrees). The object was considered 
to consist of various surfaces, curvatures, as well 
as cavities and undercuts. The material had to be 
measureable in a CT. Is shall be additionally stated, 
that the found part was prepared and refurbished 
for the purpose of the study. The exact composi- 
tion of the material was unknown at the begin- 
ning and had to be analysed before starting the 
measurements. 


3.2 Material 


As already stated in the previous section, the mate- 
rial the part has been made of, needed to be an 
appropriate one in terms of the applicability for the 
CT-system. Therefore, the chosen part had to be 
analysed first regarding the material composition. 
For this purpose, a small splinter has been removed 
in the inner part of the socket (in the upper flange) 
for the material analysis. Subsequently, the splinter 
has been prepared for the analysis which has been 
performed with the energy-dispersive X-ray spec- 
troscopy (Liptak 2003). The result of the spectros- 
copy is shown in Figure 2. 


Figure 1. 
automotive engineering. 


Measured object—intake socket out of the 
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The abscissa of Figure 2 shows the energy in 
[keV], and the ordinate sows the intensity in counts 
per minute [cpm] of a certain element in the com- 
position. The main peak of the diagram indicates 
an aluminium alloy. Though, the proper estimation 
of the material composition cannot be performed 
without a material expert. Therefore, the overall 
estimation of the compound was discussed with 
a material scientist and provided the result (with 
a high probability) of an AIMgSi—wrought alloy. 

The small amounts of other various elements 
are impurities. The analysed material is well suit- 
able for the CT-measurement, since it is the alu- 
minium wrought alloy. 


3.3 Measured points 


The chosen 19 points for the measurement are pre- 
sented in Figure 3. The precise description of all 
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Figure 2. Spectral analysis of the material—intake 


socket out of the automotive engineering. 


Figure 3. 


Measured points of the DTM socket. 


point is out of scope for the purpose of this paper 
as well as not necessary for the understanding of 
the performed research, and will be therefore omit- 
ted. Nonetheless, the chosen points represent a 
variety of different lengths and angles with various 
geometric tolerances and tolerance specifications. 
Though, the chosen points can be considered as 
representative for all parts of similar dimensions. 


4 SAMPLE ANALYSIS BASED ON TEST 
STATISTICS 


First step of the analysis of the measurements is 
the realisation of various hypothesis tests regard- 
ing the application of the proper analysis methods. 
Therefore, first the goodness of fit tests with the 
null (and alternative) hypothesis: 


Ho: F = Fo (and Hi: F # Fo) (1) 


that the sample data is distributed in a certain way 
(either normally or Weibull) has to be performed. 
For the normal distribution, Kolmogorov-Smirnov 
and Shapiro-Wilk (Hartung 1998) tests are com- 
mon in use. Accordingly, Anderson-Darling, Kol- 
mogorov—Smirnov and Chi-squared (Sachs 2009) 
tests can be applied for the Weibull distribution. 

These testes are performed for two different 
tasks. First is the analysis of distributions regard- 
ing the measurement technologies itself, the second 
concentrates on the distributions of the measured 
points. The second analysis is mainly interesting 
for the different groups of geometric tolerances. In 
simple words, it shall be analysed, if e.g. the angles 
can be always fitted based on the same distribution 
function. The further purpose of the goodness of 
fit tests is the proper application of test statistics— 
parametric for the normal distributed samples and 
non-parametric in all other cases (Hinz 2014). 

The goodness of fit tests applied for both cases 
provide heterogeneous results. It means, that no 


1135 


specific distribution function can be fitted to all 
samples, either distinguished by the technology or 
by the measured points. Therefore, non-parametric 
statistics shall be applied for the further analyses. 

For the comparison of samples, various signifi- 
cance tests according to different applications are 
available. Non-parametric significance tests do not 
require dependences on specific distributions. In 
comparison to similar application cases of para- 
metric significance tests, further application cases 
and benefits such as simple calculation and detec- 
tion of randomness of data are available. 

For the purpose of this study, two nonparamet- 
ric statistical tests were performed: 

Mann-Whitney U test with the null (and alter- 
native) hypothesis: 


Ho : F(z) = G(z) (and Hı : F(z) # G(z-0)) (2) 
that the samples have the same focus, and 
Levene’s test with the null (and alternative) 
hypothesis: 
Ho : F(z) = G(z) (and H: : F(z) # G(@z)) (3) 
that the variances of the samples are equal. 
A visual example of the statistical comparison 
based on the mentioned tests is shown in Figure 4. 
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Figure 4. Boxplot—comparison of the samples. 


Here, two different samples established with the 
CT-technology are statistically compared to each 
other. 


5 SENSITIVITY STUDY 


Preliminarily to the performed study based on real 
data, a sensitivity analysis was performed regard- 
ing the comparison of various samples based on 
simulated data. The main aim of the sensitivity 
analysis was the detection of critical boundaries, 
at which the test will distinguish the samples. In 
simple words, it shall be analysed when the sam- 
ples can be defined as significantly different with 
regard to the measured value itself, the scattering 
of the value and the decimal digits with respect to 
the measurement technology. 

For the purpose of the study, 27 various sam- 
ples with sample size of 20 measurements were 
Monte-Carlo simulated according to the following 
conditions: 


e All values are simulated with three decimal digits 
(due to the maximal resolution of the measured 
technologies) 

e For every measured magnitude, three different 
values have been simulated: mean, upper specifi- 
cation limit, and lower specification limit 

e All values were assumed to be normal distributed 


Based on the given assumptions, different sam- 
ples with different magnitudes has been simulated: 
19 + 0.05; 33 + 0.5; and 136 + 0.5, which provides 
all in all 27 samples. Note that all the values have no 
units, since the unit has no influence on the analysis 
itself and can represent any measurable value. 

The simulated samples are analysed with both: 
Mann-Whitney U as well as Levene’s test. Exemplary 
results are shown in Table 1. Here, the first column 
of the simulated value shows the simulated magni- 
tude, the second shows the simulated scattering. 


Table 1. Exemplary results of statistical analysis of 
simulated values. 
Levene’s 
Simulated value U-Test Test 
w=19 $=0,005 vs.s=0,05 Moos =Hos Oos £ Oos 
s = 0,005 vs. s = 0,5 Hoos = Ms Ovos # O; 
s = 0,05 vs. s = 0,5 Hos = Ms Oos £ O; 
u=33 s =0,005 vs.s=0,05 Moos =Hos Goos # Oos 
s = 0,005 vs. s = 0,5 Hoos = Ms Oos # O5 
s = 0,05 vs. s = 0,5 Hos = Ms Oos £ O; 
u=136 s=0,005 vs.s=0,05 Hos #ŁHos Oos £ Oos 
s = 0,005 vs. s = 0,5 Hoos = Ms Oos # O5 
s = 0,05 vs. s = 0,5 Hos = Ms Oos £ O; 
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The results of the statistical tests are shown in 
the columns U-Test (which is the short form of 
Mann-Whitney U Test) and Levene’s test. It can be 
observed easily that only one hypothesis is rejected 
in case of the U-Test (seventh row) which means 
that the test statistics recognizes only one signifi- 
cant difference regarding the mean value. Though, 
it cannot be determined why this one particular 
test shows a significant difference. On the other 
hand, all test are rejected in case of the compari- 
son of scattering of the samples. This means that 
independent on the magnitude of the simulated 
value, a factor 10 in the scattering will be always 
detected. This provides a very good detectability 
for the samples. 

Since a detailed description of all results is out 
of scope for this paper, the most important results 
are summarized as follows: All 27 Levene’s tests 
show a significant difference between the simu- 
lated scatterings. It means that a scattering with a 
difference of at least one magnitude can be always 
detected for such measured values. 26 of 27 U-tests 
(96.29%) show no significant difference between 
the samples regarding the focus of the sample. This 
means that a small scattering of the values has no 
influence on the detection of significance in the 
measured samples regarding the mean value. Fur- 
thermore, it means that once a significant differ- 
ence will be detected in the measured samples, one 
can be sure that the technologies are significantly 
different. 

Since no significant differences were determined 
in the scattering of the samples, a further analy- 
sis was performed. For this reason, further sam- 
ples has been Monte-Carlo simulated based on 
the mentioned conditions with minor exception: 
The values are simulated with four decimal digits. 
Based on the assumptions, following samples has 
been simulated: 19 + 0.002; and 19 + 0.005. 

All in all, 22 additional Mann-Whitney U and 
Levene’s tests each were performed in the same 
manner. Once the scattering is much lower, the 
focus of the samples is much better differenti- 
able. 18 out of 22 test show a significant difference 
regarding the focus of the samples. The variances 
are always differentiable based on the Levene’s 
tests. This shows a very good applicability of the 
introduced hypothesis statistics. 


6 ANALYSIS OF THE MEASUREMENTS 


Based on the results and gained knowledge from 
the previous section, the measurements of the 
three described technologies shall be analysed. All 
measurements were performed 20 times for every 
technology each. Therefore, all discussed 19 meas- 
ured points provide samples of 20 measurements. 


However, CT provides not the measured values 
itself, but.stl files which have to be additionally 
imported into a CAD program and processed. 
The.stl files are created based on the measurements 
and can be exported with very different qualities 
(the higher the number of elements in the.stl grid, 
the bigger the files). Therefore, different qualities 
of CT measurements were exported in order to 
analyse the quality of the mesh on the final results. 
The smallest.stl file was 120 MB and the biggest 
3.5 GB big, which obviously determines also the 
export and calculation time. 

The statistical test were performed exactly in the 
same manner as in case of the simulated values. 
All in all, over 250 statistical test were performed 
based on the measurements. The results provide 
the following conclusions: 


e CMM has in 97.14% of all measurements a sig- 
nificant difference in the mean value 

e CMM has in 98.85% of all measurements a sig- 
nificant difference in the scattering 

e 3DSM has in 100% significant difference in 
both, mean and scattering, compared to CMM 

e 3DSM has in 86.67% significant difference in 
mean value, compared to CT 

e 3DSM has in 56.67% significant difference in 
the scattering, compared to CT 

e CT has in 76.67% of all measurements a signifi- 
cant difference in the mean value 


Basically, based on the gathered results, follow- 
ing statements are valid: 


e CMM has a significant difference to the remain- 
ing technologies 

e Based additionally on experience, it can be 
defined, that CMM is the most accurate 
technology 

e The choice of the converting algorithm for the. 
stl files within the CT measurements has a sig- 
nificant influence on the quality of the results 

e The difference between 3DSM and CT cannot 
be observed—both technologies has the same 
measuring accuracy 


7 CONCLUSIONS 


A very comprehensive study regarding the com- 
parison of three different measuring technologies 
is presented in the present paper. For this purpose, 
an aluminium DTM socket has been measured on 
19 different points. For the purpose of the com- 
parisons, nonparametric test has been chosen and 
classified as suitable. 

For the purpose of a sensitivity study, a number 
of samples has been Monte-Carlo simulated and 
analysed. It has been proven, that the chosen test 
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provide satisfactory results regarding the differen- 
tiability of the measured samples. 

The analysis of the results show many signifi- 
cant differences, which means that the technologies 
itself are significant different, whereby the CMM 
is the most accurate one. 

Based on the results of the analysed measure- 
ments as well as the advantages and disadvantages 
of the technologies, the decision about the appli- 
cation of the proper technology within a certain 
industry can be performed easily. 
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ABSTRACT: The exponential growth of industrial data being generated by sensors, modern equipment 
and devices is pushing the service sector to use more sophisticated analytics tools that can produce useful 
knowledge and predict certain events, especially for those which require reducing loss through preventive 
maintenance. This work presents the application of big data analytics for machine learning processing 
through a railway company problem approach, using one of the most powerful tools for large scale data 
management: the open-source Apache Spark platform. The practical implications of this, are in a reliable 
prediction of the condition of trains before being loaded and sent to a destination. 


1 INTRODUCTION 


The exponential growth of industrial data being 
generated by sensors, modern equipment and 
devices is pushing the service sector to use more 
sophisticated analytics tools that can produce use- 
ful knowledge and predict certain events, especially 
for those which require reducing loss through pre- 
ventive maintenance. 

However, processing large scale datasets and 
building predictive models with advanced algo- 
rithms, demands high efficiency in iterative com- 
putation tasks. In the last years, the open-source 
platform Apache Spark has experienced rapid 
growth due its outstanding performance and wide 
range of settings, which makes it well-suited for the 
development of machine learning (ML) applica- 
tions using popular programming languages such 
as Java, Python, Scala and R. 

This work shows how to build a predictive model 
in simple steps using the Spark environment for 
structured massive data manipulation. This model 
will consist of a decision tree for binary classifica- 
tion and a sensitivity analysis performance adjusting 
the main parameters. For academic purposes, the 
modeling will be set up on a pseudo-distributed sin- 
gle node cluster in Ubuntu 16.04, and the datasets 
loaded to Hadoop Distributed File System (HDFS), 
a reliable storage for large files that allows parallel 
processing, and Apache Spark 2.2.0 framework. 


2 PROBLEM STATEMENT 


In the railroad industry, the train wheels are criti- 
cal components in terms of safety and one of the 
main priorities due their probabilities of failure, 


which may imply catastrophic consequences. One 
of the main concerns of the maintainers, is to be 
able to predict if according to several factors as 
the condition of the wheels, the size of the load to 
be transported, the distance to travel and the con- 
ditions of the road, the vehicle is suitable for go 
on a trip, ensuring its arrival on time and without 
problems to the destination. Otherwise, the neces- 
sary actions should be taken. The use of technol- 
ogy and especially counting on reliable databases, 
as well as mechanisms to provide intelligence and 
structure that data in predictive models, is relevant 
to support such decision-making. 

Thanks to the development of new sophisti- 
cated sensors such as wheel impact load detectors 
(WILD) it is possible to monitor structural health 
trends and spot the critical wheels which need to 
be removed. However, setting out a car when it is 
loaded will cause a lot of loss to railroad companies, 
among other problems such delayed shipment and 
unnecessary disruptions in the network traffic. This 
situation urges the need of developing predictive 
models that can be used to project if a total verti- 
cal force imposed to the rails (denoted as peak kips) 
is above to certain values in the next loaded status. 
The WILD system scans millions of wheels per day 
throughout the international rail industry (Strat- 
man, 2007), and it will provide us useful empirical 
datasets for training and testing for this endeavour. 


3 PROPOSITION FOR METHODOLOGY 


In real world applications, the data provided by dif- 
ferent sources is not always structured. It needs to be 
preprocessed in order to generate the inputs to feed 
machine learning algorithms. In this particular case, 
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we propose the following framework for the creation 
and validation of a wheel peak kips predictive model. 


3.1 Machine learning data pipeline 


Data pipeline represents the flow of data for the 
machine learning process. This starts with a set of 
massive raw data which needs to be pre-processed and 
then fed to the machine learning algorithm. There 
are a large number of machine learning algorithms 
and different classifications, according to the type 
of problems they address (regression, classification, 
etc.), according to the data processing (linear, tree- 
based, neural network models) among others classi- 
fications. The ML algorithm learns, based on the fed 
data, to predict the future behavior of new supplied 
data. This is called “training” an algorithm. Subse- 
quently, the data thrown by the ML algorithm need 
to be interpreted to finally obtain a diagnosis that is 
helpful in decision making. The entire process from 
the obtaining and pre-processing of raw data, to the 
interpretation and use of the predictions requested by 
the final users, is the called ML data pipeline. 

The development and assembly of pipeline com- 
ponents need to support distributed computation 
and other requirements regarding data treatment, 
including fault tolerance, resource management, 
scalability and maintainability. This makes Apache 
Spark a very useful and reliable environment due its 
fundamental data structure for parallel processing 
(Resilient Distributed Datasets), and the Scala Data- 
Frame API that allows to process from Kilobytes to 
Petabytes of data on a single node cluster and per- 
form operations with only a few lines of code. 

Figure 1 summarizes the steps followed by the 
ML data pipeline used in this work. 


3.1.1 Step 1: Data ingestion 

Data ingestion refers to the process of obtain- 
ing data to be used in the ML data pipeline. Data 
ingestion implies a non-trivial process of prioritiz- 
ing data sources, validating the data obtained and 
sending them in an orderly manner to their next 
destination. All this must be done quickly and reli- 
ably so as not to lose information, especially when 
it is collected in real time. As previously mentioned, 
WILD system is the source of the two datasets 
used in this project: a training dataset and a testing 
dataset. These files contain the exhaustive informa- 
tion about a great number of variables monitored 
in relation to the performance of train wheels. It 
has very complete mechanical and physical infor- 
mation about the condition of each wheel, at each 
moment, corresponding to specific cars, trains and 
trips. 

Data ingestion of the mentioned databases will 
be done using the Hadoop Distributed File Sys- 
tem (HDFS), a reliable storage and highly fault 
tolerant for large files, and specially designed to be 
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deployed on low-cost hardware. Its architecture is 
shown in Figure 2. 

This distributed file system follows a master/ 
slave architecture, the Namenode and Datanodes, 
respectively. For example, we have running five 
computers with 500 Gigabytes of storage each one 
with Hadoop environment installed. Accessing the 
storage from any of these five machines will work 
as a single large machine with total capacity of 2,5 
Terabytes. This is when parallel processing can take 
advantage: if one single machine makes a task that 
takes 20 minutes with 500 Gigabytes of data, ten 
of them can complete such task in only 2 minutes. 

In our particular case, the training and testing 
datasets consist in tabulated text files of 1.000 
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and 500 Gigabytes corresponding to 7 million and 
2,5 million of rows, respectively. Both have head- 
ers with the name of the 22 features including the 
label, i.e. the values that we want to predict (peak 
kips). In our local machine terminal, the following 
commands are used to load these files: 


>hadoop fs—put /localpath/training.tsv /training 
>hadoop fs—put /localpath/testing.tsv /testing 


3.1.2 Step 2: Data cleaning 
Data cleaning is considered as a main challenge in 
the era of big data due to the increasing volume, 
velocity and variety of data in many applications 
(Tang, 2014). It is the process of detecting, fixing 
or removing incorrect and misleading records. 
Having dirty source inputs is very likely and it 
can easily trigger runtime exceptions and therefore, 
terminate our whole process in Apache Spark. 
There are several solutions and tools for data 
cleaning. In the Scala programming language, the 
main utility to achieve this task are the instances 
Try, Success and Failure. 


// Object to transform peak kips to binary 
def b(y: String): String = 
if (y.toDouble > = 90) "1" else "0" 


// Load from HDFS and clean data 

val file = sc.textFile("/training") 

val file2 = file 
-map(_.split("\t")). 


.flatMap(c = > Try{ b(c(1)+...)}.toOption) 


VVV VV VV VV 


3.1.3. Step 3: Feature extraction 

Feature Extraction is a kind of dimensional- 
ity reduction that efficiently represents an initial 
set of data. The resulting reduced set is a non- 
redundant and good representation of the most 
relevant information of the entire data. Then, the 
following modeling training and testing processes 
can be executed over this reduced set of features 
(feature vector) instead over the complete initial 
database. 

Since the incorporation of the Dataframe API 
to Spark in 2015, data processing and functional 
transformations have become much easier to code 
in general-purpose programming languages, and 
at the same time their performance have been 
improved. For the realization of this work, the 
selected features appear in Table 1. 


3.1.4 Steps 4 and 5: Model training and testing 
To train the data, we will use as ML algorithm, 
the Random Forest model for classification. One 
of the main advantages of this algorithm is that 
it does not require assumptions of normality of 
variables and it can deal with highly correlated and 
non-linear relationships between them. 

Basically, the decision trees are built as it follows: 


Table 1. Description of selected features. 


Feature Name Description 


LOAD_EMPTY Binary variable that indicates 
whether de equipment is 
loaded or empty 

The tonnage of the equipment 
that the wheel is part of 

The speed of the wheel at the time 
of the WILD measurement 

Weight of the car when empty 
(in tons) 

The tonnage that the particular 
wheel type can sustain 
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Figure 3. Graphical representation of a Random Forest 
decision tree. 


e Randomly select f features from available fea- 
tures F 

e Compute the best split point for tree ¢ using the 
determined splitting metric (in our case of study, 
Gini Impurity), and split the current node into 
child nodes and reduce the number of features f 
from this node on 

e Repeat steps 1 to 2 until either a maximum tree 
depth has been reached 

e Repeat steps | to 3 in order to create a T number 
of trees 


The result can be represented in Figure 3. 
In the Spark platform, the generic code for 
building the model can be written as it follows: 


> val model = Random- 

Forest.trainClassifier (trainingData, numClasses, 
categoricalFeaturesInfo, numTrees, featureSub- 
setStrategy, impurity, maxDepth, maxBins) 


3.1.5 Step 6: Evaluation 

One of the most effective methods of evaluating 
the performance of a binary classifier system is 
the receiver operating characteristic (ROC) curve, 
which is defined as a plot of the true positive rate (y 
coordinate) against the false positive rate (x coor- 
dinate). Accuracy is measured by the area under 
the ROC curve (AUROC), and it’s represented by 
the Equation (1) 
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avroc=j a) (1) 


This value fluctuates between 0,5 to 1, where 0,5 
denotes a null prediction capacity, and 1 a perfect 
classifier. 

To compute the raw scores on the test set, we 
can code the following lines: 


> val predictionAndLabels = test.map {case 
LabeledPoint (label, features) =>valprediction= 
model. Predict (features) 

(prediction, label) } 
> // Instantiate metrics object 
> val metrics = new BinaryClassificationMet- 
rics (predictionAndLabels) 
> // Area under the ROC curve 
> val auROC = metrics.areaUnderROC 


> println ("Area under ROC = " + auROC) 


In addition, we can provide a simple sensitivity 
analysis by modifying the number of trees (num- 
Trees) and maximum depth (maxDepth) to calcu- 
late the different values of AUROC (1). 


4 RESULTS 


As we can see in Table 2, the area under the receiver 
operating characteristic (AUROC) does not expe- 
rience major fluctuations. 

The most accurate model has 4 trees and maxi- 
mum depth of 6. We can say that there is a 78% of 
probabilities that a defective train wheel will acti- 
vate the alarm in the next loaded status predicted 
by the model (peak kips > 90) than a randomly 
chosen one. This represents the accuracy of predic- 
tion of the model. 


5 DISCUSSION AND CONCLUSIONS 


This work presents an application of big data 
analytics for machine learning processing with a 
railway company problem approach and using 
open-source tools. According to the obtained 
results, we can stand that the application of this 


Table 2. Area under ROC curve according to the 
parameters. 


Number of trees 


Max depth 3 4 5 6 

3 0.7767 0.7708 0.7742 0.7702 
4 0.7757 0.7828 0.7380 0.7812 
5 0.7819 0.7632 0.7620 0.7622 
6 0.7600 0.7864 0.7604 0.7622 
7 0.7567 0.7660 0.7535 0.7622 


methodology is adequate and valuable as a deci- 
sion-making support, jointly used considering the 
experience of personnel, maintainers and techni- 
cians. At the same time, future analyzes can be 
tried to increase the predictive performance of the 
tool used in this project, for example, try to con- 
sider more features for the modeling, using a com- 
puter with more processing capacity. 

We want to highlight the multipurpose Ran- 
dom Forest algorithm that can be used in many 
industrial sectors, in this case for preventive main- 
tenance. Its outstanding performance is usable to 
get valuable predictive information that can be 
translated into new opportunities. However, due to 
limitations in hardware (settings on a single node 
cluster), training models with even larger input 
datasets may have a wide margin of improvement 
in terms of accuracy and processing speed. 
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ABSTRACT: The jacket structures are often employed in the range of shallow-moderate water depth. 
The bracing systems and jacket legs typically use the circular section in order to compromise the hydrody- 
namic resistance and high torsional rigidity However, under lateral impact, these tabular bracing members 
are susceptible to local denting due to ship collisions or through impact of falling objects and that can 
weaken overall performance of the entire platform. It is a great significance for forecasting dent depth of 
these members accurately. This paper investigates the use of adaptive meta-heuristics algorithm to provide 
an automatic detection of denting damage in an offshore structure. A model is developed combining with 
the percentage of the dent depth of damaged member diameter and is used to assess the performance of 
the method. It is demonstrated that the small changes in stiffness of individual damaged bracing members 
are detectable from measurements of global structural motion. 


1 INTRODUCTION 


Offshore jackets play an important role in the oil 
and gas industry. The risk of platform failure is a 
higher risk issue when operating an ageing plat- 
form. As a result, the need for the development of 
techniques necessary to assess platform integrity is 
clearly established. Several standards and recom- 
mended practices list a detailed inspection meth- 
ods and requirements for the underwater, splash 
zone and topsides structures (May et al. 2008). 
For the oldest of operational ageing platforms, the 
assessment is quite challenging due to the existence 
of deterioration or dent-damage of structural ele- 
ments due to collision or impacts. The change of 
dent depth can dramatically reduce the axial and 
bending capacities of the individual jacket leg 
members and weaken overall performance of the 
entire structure (Bruin, 1995). Therefore, accurate 
prediction of dent depth and dent direction angle 
would provide valuable information for the opera- 
tions and maintenance operations at offshore 
platforms. 

Karamanos and Andreadakis (2006) studied on 
denting of tubular members subjected to lateral 
loads. The loading condition is quasi static and 


the tubes are internally pressurized. The relation 
between denting force and displacements were 
evaluated through experiments and finite ele- 
ment simulations. Also relation between normal- 
ized forces and normalized denting displacements 
was deviated following both procedures. Wedge 
shaped denting tool was used in the experiment as 
well. Outcome of this research deliberates that the 
internal pressure in the tubular member increases 
the resistance against denting and reduce the dent- 
ing length. Travanca and Hao (2014) discussed on 
the response of jackets on response to ship impact. 
Finite element formulations and calculations are 
derived to incorporate the dynamic aspects as well. 
The jacket was considered as a cantilever beam 
and the Degree of Freedom (DOF) was reduced 
for easiness of study. The outcome of the reduc- 
tion of DOF was significant and showed similar 
outcome as from the FEM of original structure 
so far. Comparison of response time histories, 
deformation of modes, normalized deformation 
at different stages and the comparison of response 
histories from the original model and equivalent 
reduced SDOF model was contemplated for the 
parametric study. Moreover a four-legged jacket, a 
tripod jacket and a jack-up model were considered 
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for case study. Furthermore the effect of top mass 
are included and it is found that it is significant 
while the natural period has large value and the 
ratio of top mass and the jacket mass is high. Two 
stage dynamic analysis is also suggested to achieve 
more accuracy. Cosham and Hopkins (2004) 
illustrated on the dent effect on oil and pipelines 
which are especially used for oil and gas transmis- 
sion. Different types of defects along with differ- 
ent types of dents are studied in this study. Burst 
strength and fatigue life are investigated for dif- 
ferent types of dents. It was found that plain and 
smooth dent doesn’t affect significantly on the 
burst strength but has effect on the fatigue life. 
Moreover, smooth dent reduces the fatigue life to 
a great extent. Kinked dents may be are dangerous 
for longitudinal cyclic stresses and are risky for 
external pressures with the internals. Moreover, it 
is noted that the smooth dent along with gouge is 
very unsafe considering both burst strength and 
fatigue which reduces considerably. Khedmati and 
Nazari (2012) explored the behavior of tubular 
members on response to impact loads. Strength 
and deformation characteristics are inspected in 
this work. Moreover the axial shortening behav- 
ior is also included. The effect of preloading and 
quasi static lateral load are also delineated. Stor- 
heim and Amdahl (2014) deliberates the design 
process of offshore platforms susceptible to ship 
collision. The ship-platform interaction is also 
represented. Four different collision contexts are 
incorporated. Force-displacements curves for dif- 
ferent scenarios are interpreted. The outcome of 
the study suggested to revise the design specifica- 
tion for collision as the vessel size and bow con- 
figurations are changed a bit in recent years. Cerik 
et al. (2016) inspected denting damages in tubular 
members for low mass impact. Both experimen- 
tal and numerical investigation was incorporated. 
It is observed that the numerical outcome has 
satisfactory similarity with the experimental test 
models. Local denting is considered along with the 
load-indenter displacements are also delineated. 
The deformation characteristics of single member 
with clamps are vastly studied. Cho et al. (2010) 
investigated the denting damage consequences of 
tubular members in offshore structures. Experi- 
mental denting test along with bending test was 
directed and a relation between dent depth and 
denting force was established. Moreover the 
residual strength of dented tubes can be pre- 
dicted through the established equation which 
was derived from the relation of residual strength 
and bending moment. Cho et al. (2015) deliber- 
ated the response characteristics of impact load- 
ing in the context of tubular members in maritime 
structures. Drop test along with statistical analysis 


was carried out to predict the behavior against 
dynamic impact loading. Consequences of local 
denting and global bending were discussed elabo- 
rately from the experimental and numerical out- 
come. Minor dented damages due to ship collision 
could be the cause of major reduction of ultimate 
capacity and it was illustrated by Pacheco and 
Durkin (1988). Moreover, a geometrically ideal 
dent can represent the actual damage in case of 
finite element analysis and it is also delineated in 
this research. Li et al. (2013) focused on the ship 
collision consequences on the tubular members of 
Jacket structures. Both elastic and plastic behav- 
iors are elaborated. Different scenario of ship col- 
lisions were assessed for substantial conclusions. 
It is reported that in case of larger deformation, 
the standard guideline underestimate resistance- 
indentation relationship. The elastic response 
from the vessel-platform impacts should be noted 
more carefully to comprehend significant effect of 
it in terms of energy absorption during impact. 

The damage detection can be considered as a 
problem of system identification or an optimiza- 
tion inverse problem (Farrar and Worden, 2012). 
The optimization techniques can be used to quan- 
tify the unknown parameters of the damage. Over 
the last few decades, the adaptive meta heuristic 
and predicting control methods have become more 
widely used in several scientific research especially 
in the damage detection under ambient vibration 
(Miguel et al, 2012) and in the optimization design 
of fixed jacket offshore platform under environ- 
mental loads (Nasseri et al, 2014). Although much 
have been reported, very little have been done 
for the dent examination of circular member in 
the jacket structure which is the most common 
problem type to investigate the structural dam- 
age after ship collision. Intelligent computational 
techniques such as metaheuristics can be served to 
highlight areas where the sensor technologies and 
structural integrity monitoring techniques might 
be useful. 

In this work, an optimization problem for off- 
shore dent detection is posed to find percentages 
of dent in element diameters and impact angles. 
Five well established self-adaptive metaheuris- 
tics include JADE (Zhang & Sanderson, 2009), 
CMAES (Hansen, Muller et al., 2003), SHADE 
(Tanabe and Fukunaga, 2013), L-SHADE 
(Tanabe and Fukunaga, 2014), and ASCDE 
(Bureerat and Pholdee, 2017) are used to perform 
detection within a simulated damage scenario of 
a finite element reduced model of an offshore 
jacket platform. The efficiency and accuracy of 
each optimizer is discussed. Finally, the main 
conclusions and recommendations of future 
work are summarized. 
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2 DESCRIPTION OF AN OFFSHORE 
JACKET PLATFORM FOR THE 
ADAPTIVE META-HEURISTIC 
ANALYSIS 


2.1 Fixed offshore platform model 


The offshore jacket platform is modeled with the 
3D frame elements to conduct the adaptive meta- 
heuristic analysis. The physical configuration of 
the jacket platform in side view along with the 
front view is available in Figure 1. 

The offshore jacket platform is simulated to 
operate at the water depth 65.53 m. The topside 
of the jacket platform emerges above the sea water 
level 17.68 m. The bracing system of the jacket 
platform consists of two types (single bracing and 
the K bracing). The bottom bracing systems of the 
jacket structure is the K-bracing so as to compro- 
mise the connecting angle between chord and brace 
members more than 30 degrees to avoid the weld- 
ing problem. The remaining of bracing systems is 
the single bracing to optimize the structural weight. 

The top side of the oil and gas platform is sim- 
plified as a lumped mass for finite element analysis. 
The total weight of the topside is 2500 tons, which 
is equally distributed over four legs where each leg, 
is conveying 625 tons at the top connection of the 
jacket structures platform. 

The structural members of the offshore jacket 
platform are utilized according to the design speci- 
fications (DNV-RP-C203, 2001). This jacket struc- 
ture consists of 11 groups of tubular cross section. 
The element properties are listed in Table 1. All the 
structural members have the same material density 
and elasticity modulus. The location of structural 
elements with different colors for each groups are 
illustrated in Figure 2. The annotations of the 
colors are clarified substantially in Table 1. The 


Side View 


Front View 3D View 


Figure 1. 3D and side views of offshore platform. 


Table 1. Element properties. 

Group Color Outside 

name indicator diameter Thickness 
Gl =] 1.067 0.038 
G2 | 0.457 0.010 
G3 | 0.406 0.013 
G4 Ee 0.356 0.010 
G5 | 0.457 0.013 
G6 =a] 0.356 0.013 
G7 0.406 0.016 
G8 ian 0.324 0.010 
G9 B 0.559 0.013 
G10 | 0.559 0.019 
G11 E 0.610 0.025 


Modulus of Elasticity, E = 2.1 *10!! N/m?; 
Density of Steel = 7833 kg/m’. 


Figure 2. Member group specifications. 


formations of the groups are based on the outer 
diameter and wall thickness of the tubular section. 


2.2 Denting data and assumptions 


The assumption of denting members in the jacket 
platform comes from the consequences due to 
ship collision. There members from two different 
groups are assumed to be influenced due to ship 
collision. Member no 10 and 20 from G1 group 
and member no 88 from G9 group are the dented 
member. The vessel can impact to jacket platform 
in two directions either through bow or starboard 
direction. The location of dented members and 
impact direction are present in the Figure 3. Two 
members from leg and one bracing are impacted 
in according to the Figure 4. Denting is generally 
defined in terms of percentages of diameter reduc- 
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Figure 3. Ship collision and denting in members. 
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Figure 4. Denting sample in G1 and G9. 
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Table 2. Element groups of the jacket structure. 


Group name Element numbers Outside diameter 


Gl 1-20 In element 10 & 20 
G2 21-28 No denting 
G3 29-32 No denting 
G4 33-44 No denting 
G5 45-52 No denting 
G6 53—60, 117-120 No denting 
G7 61-68 No denting 
G8 69-72 No denting 
G9 73-80, 85-92 In element 88 
G10 81-84 No denting 
G11 93-116 No denting 


# For members 41—44 and 57—60, are the members above 
the water level. 


tion as per different code of practices (ISO 19901- 
3, 2014). 

A sample of denting configuration for both type 
of members G1 and G9 are illustrated in Figure 4. 


Dent Section 
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Figure 5. Denting of members for rotation of axis. 


In this sample, 20% denting is contemplated. The 
diameters of the members as well as the dented 
length are identified in the following figure as well. 
Due to the denting effect, the stiffness of the mem- 
ber will be changed because of the changing of 
cross-sectional area and the moment of inertia but 
the mass will be unchanged. 

It is assumed in this study that a single point of 
dented section will affect the whole member length. 
Moreover in reality, it is often not fixed whether 
the denting will occur along a certain axis or not. 

As shown in Figure 5, the denting can occur in 
different angles if vessel impacts to the platform 
in the different direction of attack angle. The dif- 
ferent local angle of denting member, as shown in 
Figure 5, can have different moments of inertia 
about two different axes. But the area will remain 
the same. For this reason, the angle for dent- 
ing direction is also considered in this investiga- 
tion and the range of the angle is between 0 and 
360 degree. That is the ship can attack the offshore 
jacket platform from any direction and the denting 
can be occurred in any angles. 


3 OPTIMIZATION PROBLEM OF 
OFFSHORE DENT DETECTION AND 
NUMERICAL EXPERIMENT SETUP 


3.1 Frequency and mode shape changes 


Dent detection of the offshore structure as detailed 
in the previous section is performed using vibra- 
tion based damage detection. The main concept of 
vibration based damage detection is that updated 
mechanical properties of a mathematical model 
such as a finite element model and the modal data 
such as natural frequencies of the model agrees 
well the measured modal data. Damage of the 
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structure is identified by the detecting changes of 
the mechanical properties. 

In this work, an optimization problem for off- 
shore dent detection is posed to find percentages of 
dent in element diameters (P,) and impact angles 
(0 ) which consequently lead to the changes of 
mechanical properties (cross section area, second 
moment of area along y and z direction) and natu- 
ral frequencies of the offshore structure. Since the 
natural frequencies can be accurately measured, an 
objective function in this case can be constructed 
in terms of changes in natural frequencies. 

The percentage of dent damage in the tabular 
element can be found by solving an optimization 
problem to minimize the root mean square error 
(RMSE) between natural frequencies measured 
from the dented structure and natural frequencies 
computed by using the finite element model. The 
problem can be expressed as: 


"mode 


j,damage = Dy cinni) 


— (1) 


mode 


Min: f(x) = 


where O, gamage Nd Qj compuea Ave the structural 
natural frequency of mode j obtained from a 
dented structure and that from the finite element 
model respectively. A vector x is a set of design 
variables including percentages of dent in element 
diameters and impact angles (x = {P4,, Py, -> Pay» 
O15 Oria Y N: 

In this work, only element numbers 5, 10, 15, 
20, 86, 88, 90 and 92 which are located at the sea 
level are set to have dent possibility. Therefore, the 
total number of design variables is set to be 16 (8 
for percentages of dent in element diameters and 
other 8 for impact angles of the elements). The 
possibility of percentages of dent is set in rang of 
[0, 0.7] while impact angle is set in the set of {0, 
22.5, 45}. 


3.2 Numerical experiment 


To investigate the search performance of optimiza- 
tion methods on solving the proposed problem of 
dent detection of the offshore structure, the per- 
centage of dent in element diameter and impact 
angle are pre-defined while natural frequencies 
are simulated by means of finite element analysis 
instead of using real measuring data. The per- 
centages of dent in element diameters and impact 
angles are set as 0.3 percent dent at elements 10, 20 
and 88 with impact angles of 0, 22.5 and 45 degrees 
respectively. Natural frequencies for the first six 
modes of the dented and undented elements (as 
shown in Table 3) are used for the objective func- 
tion calculation. 


Table 3. Natural Frequency (Hz) up to six mode. 
Modes Undented Dented 
1 0.680 0.674 
2 0.690 0.714 
3 0.919 0.989 
4 2.506 2.507 
5 2.527 2.527 
6 3.179 3.145 


To minimize the objective function, five well- 
established self-adaptive meta-heuristics (MHs) 
are used. Details and notations of these methods 
are available in the literature in the corresponding 
references of the methods and will not be detailed 
here. The MHs used include: 


— Adaptive Differential Evolution (JADE) (Zhang 
& Sanderson, 2009). 

— Evolution Strategy with Covariance Matrix 
Adaptation (CMAES) (Hansen, Muller et al., 
2003). 

— Success-History Based Adaptive Differential 
Evolution (SHADE) (Tanabe & Fukunaga, 
2013). 

— SHADE with Linear Population Size Reduction 
(L-SHADE) (Tanabe & Fukunaga, 2014) 

— Adaptive Sine Cosine algorithm with integrat- 
ing Differential Evolution mutation (ASCDE) 
(Bureerat and Pholdee, 2017) 


Each optimizer is used to solve the offshore 
structure dent detection test problem for 10 opti- 
mization runs. The population size is set to be 30 
whereas the number of iterations is set to be 300. 
All methods will be terminated with two criteria: 
the maximum numbers of functions evaluation as 
30 x 300, and the objective function value being 
less than or equal to 1 x 10°. 


4 RESULTS AND DISCUSSIONS 


After performing 10 optimization runs of all MHs 
on solving the offshore structure dent detection 
optimization test problem, the results obtained are 
given in Table 4. The mean of objective function 
are used to measure the algorithm rate of conver- 
gence in cases that the objective function threshold 
(1 x 10°) is not reached during searching. Other- 
wise, the mean number of FE runs is used as an 
indicator. The number of successful runs out of 
10 runs is used to measure the search consistency. 
The algorithm that is terminated by the objective 
function threshold is obviously superior and any 
run being stopped with this criterion is considered 
a successful run (Bureerat and Pholdee, 2017). 
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Table 4. Comparison results among each optimizer. 


No. of 
Mean Mean successful 
Optimizers Obj. FEs run 
CMAES 0.02428 8166 1 
JADE 0.03626 8268 1 
SHADE 0.03886 8367 1 
LSHADE 0.02903 9000 0 
ASCA 0.00095 2148 8 
Table 5. The best results on finding the percentages of dent 
and impact angles obtained from all optimizers. 
Simu- 
lated 
No dent CMAES JADE SHADE LSHADE ASCA 
10 0.3 0.36 0.50 0.47 0.19 0.26 
20 0.3 0.07 0.17 0.18 0.01 0.35 
88 0.3 0.29 0.23 0.17 0.00 0.30 
5 0 0.13 0.13 0.19 0.27 0.05 
15 0 0.00 0.00 0.00 0.16 0.00 
86 0 0.27 0.06 0.08 0.70 0.00 
90 0 0.00 0.00 0.00 0.01 0.00 
92 0 0.00 0.00 0.00 0.01 0.00 
Simu- 
lated 
impact 
No angle CMAES JADE SHADE LSHADE ASCA 
10 0 45 22.5 45 0 45 
20 22.5 45 45 22.5 45 0 
88 45 22.5 22.5 45 ~ 0 
5 - 45 22.5 45 45 0 
15 - 0 -— 0 45 - 
86 - 22.5 0 22.5 22.5 - 
90 - ~ -— - 0 - 
92 - - = - 0 - 
@,(hz) 0.674 0.674 0.675 0.673 0.672 0.673 
@,(hz) 0.714 0.715 0.713 0.714 0.716 0.714 
@,(hz) 0.989 0.989 0.988 0.989 0.988 0.989 
@,(hz) 2.507 2.507 2.507 2.507 2.505 2.507 
@(hz) 2.527 2.527 2.527 2.527 2.527 2.527 
@,(hz) 3.145 3.146 3.146 3.145 3.149 3.145 


From Table 4, it can be seen that the best per- 
former based on mean objective function values 
is ASCA while the second best and the third best 
algorithms are CMAES and LSHADE, respec- 
tively. When considering the number of successful 
runs, ASCA is said to be the most efficient opti- 
mizer which can detect the percentage of dent in 
element diameters and impact angles for 8 times 
out of totally 10 optimization runs with the aver- 
age of 2,148 function evaluations. 

Table 5 shows the best results on finding the per- 
centage of dent and impact angles obtained from all 


optimizers. It was found that MHs will only detect 
impact angle values in the elements having percent- 
age of dent higher than zeros. From Table 5, it can 
be observed that ASCA can correctly detect the 
percentages of dent in the offshore structure while 
the others failed to achieve such results. For the 
impact angles, the results of all optimizer are not 
accurate and need further improvement. One idea 
of such improvement is to introduce some nodal 
displacements that can be picked out of numerical 
mode shapes in the objective function, which will 
be explored and presented in future work. 


5 CONCLUSIONS AND 
RECOMMENDATIONS 


Five meta-heuristics optimizers were tested for 
the the dent damage of circular member in the 
jacket structure. The damage detection problems 
are based on vibration measurement and can be 
treated as an inverse optimization problem. The 
comparative results reveal that the ASCA is out- 
standing for predicting denting diameter but not 
denting angles. The results from ASCA could be 
use as the baseline for further improvement and 
investigation of dent damage examination using 
meta-heuristics. 
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ABSTRACT: An Asset Health Index (AHI) is a tool that processes data about asset’s condition. That 
index is intended to explore if alterations can be generated in the health of the asset along its life cycle. 
These data can be obtained during the asset’s operation, but they can also come from other information 
sources such as geographical information systems, supplier’s reliability records, relevant external agent’s 
records, etc. The tool (AHI) provides an objective point of view in order to justify, for instance, the exten- 
sion of an asset useful life, or in order to identify which assets from a fleet are candidates for an early 
replacement as a consequence of a premature aging. This paper develops a model applicable to different 
classes of equipment and industrial sectors. A review of the main cases where the asset health index has 
been applied is included. Likewise, advantages and disadvantages in the application of this kind of tools 


are revealed, providing a guide for a research line related to the general application of this tool. 


1 INTRODUCTION 


Nowadays, network operators are facing many 
challenges in their assets management. There is an 
increasing trend for stakeholders (safety, reliability, 
environment, and financial impact) while assets 
are aging, increasing the risk of failure. The need 
to estimate the expected time to failure becomes 
more relevant every day, and planning for an opti- 
mal replacement or maintenance program to renew 
assets becomes even more essential. By having a 
large amount of assets, the maintenance manager’s 
challenge is to decide which assets require more 
attention and what actions should be taken. The 
complexity of this decision increases because each 
asset class has different failure modes, and each 
failure has different consequences in the asset net- 
work (Vermeer et al. 2015). 

The objective of this contribution is to highlight 
the most relevant recent studies related to the asset 
health index. The concept of asset health index 
(AHI) and its application appear throughout this 
document. The different models will be divided into 
three parts. The first part deals with data gathering 
and their treatment, the second part corresponds 
to the index composition, and the third part is the 
output of results and recommendations related to 
the index value. As a final part of the contribution, 
limitations of the models and the future scope of 
asset health index are discussed. 


2 CONCEPT OF AN ASSETS HEALTH 
INDEX 


An Asset Health Index (AHI) is an asset score, 
which is designed, in some way, to reflect or char- 
acterize the asset’s condition and thus, its perform- 
ance in terms of fulfilling the role established by 
the organization. 

AHI represent a practical method to quantify 
the general health of a complex asset. Most of 
these assets are composed of multiple subsys- 
tems, and each subsystem can be characterized 
by multiple modes of degradation and failure. In 
some cases, it may be considered that an asset has 
reached the end of its useful life, when several sub- 
systems have reached a state of deterioration that 
prevents the continuity of service required by the 
business (Hjartarson & Otal 2006). Therefore, the 
health index, based on the results of operational 
observations, field inspections and laboratory 
tests, produces a single objective and quantita- 
tive indicator. It may be used as a tool to manage 
assets, to identify capital investment needs and 
maintenance programs (Naderian et al. 2008). In 
addition to condition and operation factors, the 
health index requires also to contain static factors 
linked to its location. That means, when environ- 
mental conditions are changed independently of 
the asset itself or the lack of any other change over 
time (Scatiggio & Pompili 2013). 
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The critical objectives in the formulation of a 
complex Health Index are as follows (Hjartarson & 
Otal 2006): 


e The index should be indicative of the asset suit- 
ability for a continued service and representative 
of the overall asset health. 

e The index should contain objective and verifi- 
able measures of asset condition, as opposed to 
subjective observations. 

e The index should be understandable and readily 
interpreted. 


3 MODELS 


Next, four different models from the literature 
(proposed for the calculation of an asset health 
index) will be studied. In general terms, they all 
have inputs to the model that can be data related to 
condition, equipment operation, the availability of 
spare parts used in the maintenance and, in some 
cases, information from the geographic location. 

For the algorithms used in the index calculation, 
it will be seen how in the different models, all the 
information sources are integrated by weighting 
factors and depending on the maturity level of the 
model implemented in each sector. Likewise, the 
output study for each model and their recommen- 
dations is shown. 

The following scheme (Figure 1) represents the 
concept of an AHI. It tries to compile the different 
inputs to the model from the literature consulted, 
and the different outputs for making long-term 
decisions (Azmi et al. 2017). It is important to 
highlight that this paper is focused on making 
long-term decisions. In any case, there are also 
AHI models that are used as tools in the field of 
Prognostics and Health Management (P.H.M.) 
(Ludovic et al. 2011) (Abichou et al. 2012; 
Abichou et al. 2015). 


Operating 
observations 


Location fac- 
lors 


Repair & 
upgrades 


AHI 
Algo- 
rithm 


Site and la- Overhaul 


boratory test- 


Replacement 


Field-testing Contingency 


control 


Figure 1. Concept of the health index within an asset 
management framework. 


3.1 Asset health index calculation model 
by Kinetrics 


The model developed by Kinetrics, Canada, 
proposes the overall assessment of transformers 
condition. The inputs to the model are data from 
different variables related to the operation and 
condition of the equipment throughout its useful 
life. The calculations for each variable, as well as 
detailed assessment are used to normalize the val- 
ues, weighting them with corresponding weights 
and building a personalized asset health index. 


3.1.1 Inputs 

The inputs of the model, are the historical data of 
the variables of operation (load, number of opera- 
tions, etc.), the results of oil samples labs tests (Oil 
quality, content in dissolved gases, acidity, etc.) and 
on-site tests carried out by technicians, such as insu- 
lation tests, thermography, corrosion status, etc. 


3.1.2 Index calculation methodology 

The methodology proposed for the index calcula- 
tion is based on the normalization of each vari- 
able into a value between 0 and 4, together with a 
variable weight for the final composition in a single 
indicator. 

As an example of standardization for the results 
obtained at a laboratory test, the following chart 
(Table 1) shows the link between concentration of 
dissolved gases in transformer oil and aging. In the 
table, concentrations of different dissolved gases 
are weighted, according to their relationship with 
the aging of the asset. Therefore, gases with greater 
weight are those that appear when the asset has 
reached a certain level of aging (Naderian et al. 2008). 

Equation 1 below, calculated from the results of 
gases dissolved in oil, refers to the variable value 
which is one of the inputs to the AHI. 


E SxW, 
i (1) 
Wi 
T=1 


LTC Oil Quality = 


For the variables, the author proposes a weight 
between 1 and 10. Values close to 1 are assigned 


Table 1. Concentration in ppm of gas dissolved in oil. 
Gas in oil concentration in tap changer 

Gas ppm ppm ppm ppm W; 
CH, <50 50-150 150-250 >250 3 
CH, <30 30-50 50-100 >100 3 
GH, <100 100-200 200-500 >500 3 
C-H, <10 10-20 20-25 225 3 
Score (S) 1 2 3 4 
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to variables whose relationship with the aging of 
the equipment is very small or null. On the other 
hand, for variables that take higher values (higher 
than 5 points), they are condition variables that 
more accurately reflect the aging of the equip- 
ment. The operating variables, such as the load 
factor of transformers and the power factor, are 
also related to the equipment aging speed, because 
they are good indicators showing when the equip- 
ment operates outside the design conditions. 


3.1.3. Model outputs 

For the model output, the author proposes a com- 
position of all normalized variables in a single indi- 
cator ranging between 0 and 100. The value of 100 
corresponds to a value of new equipment and the 
value of zero refers to a piece of equipment that 
has reached the end of its useful life, requiring to 
be replaced because it is already out of service. The 
following equation 2 is proposed by the author for 
calculating the health index (Naderian et al. 2008). 


20 
$ K HIF, 
j=18 J j 


’ K HIF, 
Tag Sa 5 2) 


The index value is related to a failure probabil- 
ity of the equipment, being divided into different 
ranges with their respective interpretations and 
recommendations. For the different index output 
ranges, some recommendations and measures are 
proposed in order to be taken into account for 
decision making in maintenance management. The 
following Figure 2 shows the relationship between 
the health index and the probability of failure 
(Naderian et al. 2009). 


3.2 Asset health index calculation model by DNV 
GL 


This model developed by DNV GL, Arnhem, 
the Netherlands, proposes a methodology for the 
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Figure 2. Asset health index ranges and the relationship 
with the probability of failure. 


calculation of a health index, calculated from the 
maximum admissible failure rates for the business, 
together with the asset criticality, in order to obtain 
an index to prioritize maintenance, overhaul and 
substitutions of parts. It uses as input variables the 
estimated useful life of the equipment, the current 
age and condition variables (load, on-site condi- 
tion analysis, maintenance number, etc.). The out- 
put of this model is the remaining useful life of the 
equipment in years (Vermeer et al. 2015). 


3.2.1 Inputs 

The model uses the useful life of the equipment as 
static data; this allows making a first estimation that 
will be corrected later with the information of the 
asset’s condition and, at the same time, with the fail- 
ure modes that appear throughout the asset life cycle. 


3.2.2 Index calculation methodology 
The methodology proposed by the author, is sepa- 
rated into three large blocks, depending on the type 
of data entry. The blocks are called as degradation 
function, static function and condition function. 
The model output is the relationship between the 
different functions for determining the health index, 
which is in this case the equipment remaining life. 
The model application requires a previous esti- 
mation of the asset average age based on histori- 
cal data. This average life becomes a technical life 
average that is later corrected with the specific 
condition data of the asset. In a simple way, the 
model increases or decreases the end of the asset 
technical life, depending on the real asset condition 
at the moment of its analysis, Figure 3, (Vermeer 
et al. 2015). 


Date of manu- 
facturing 


End of technical 
life (GeTL) 


Current date 


Estimated remaining life under normal op- 
erating conditions (years) 


Past life 


Condition Assessment output 


Critena Add years to 
EoTL Eoll 
Design / Quality Poor Good 
Spare part availability Poor Good 
Past maintenance Too little Sufficient 
Future maintenance Increase Decrease 
costs 
Wear from past loading Much Litle 
Future loading / use Increase Decrease 


Inflicted damage Yes 


Figure 3. Estimated remaining life corrected with the 
specific condition data of the asset. 
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3.2.3 Model outputs 

Once calculated all proposed methodological func- 
tions, they are combined in order to provide the 
result of the asset remaining life. In order to make 
the calculation, first, the static function is calculated 
with condition function in series, while the degra- 
dation function is calculated in parallel with the 
others. Like the previous model, the index output 
provides an approximated value of the asset health 
status, which is in this case the equipment remain- 
ing life. In Figure 4, the combination between the 
different functions is observed for the calculation 
of the asset health index proposed by the author. 


3.3 Asset health index calculation model by 
TERNA 


This model developed by Terna Rete, Italy, proposes 
the calculation of the equipment health index based 
on static and dynamic parameters. Static param- 
eters are associated with the location where the 
equipment is located, which are invariable in time 
and independent of the asset, for example, the recur- 
rence of catastrophic phenomena, the probability of 
electrical storm, etc. Dynamic parameters are asso- 
ciated to the equipment and can be measured in situ 
by functional and visual tests, as well as in labora- 
tory tests by analysis of oil samples, lubricants, etc. 
The output of the model is an index between 0 
and 0.5 which refer to the state as new and criti- 
cal respectively. That is intended to justify techni- 
cally and economically, making decisions for the 
investment of capital in replacement of equipment 
(Scatiggio et al. 2016; Scatiggio & Pompili 2013). 


3.3.1 Inputs 

The model author proposes static and dynamic 
variables for the model’s inputs. Static variables do 
not depend on the asset itself but depend on the 
location (lightning frequency, catastrophic events, 
etc.). The dynamic variables proposed depend on 
the asset, and their value changes with the asset 
ages. Therefore, by capturing the change over 
time and comparing the maximum and minimum 
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Figure 4. Schematic of how the functions are combined 
to give a health index. 


admissible for each kind of equipment, the condi- 
tion status can be estimated at each moment of the 
asset life. The following condition parameters are 
those that are taken into account as inputs to the 
model (Pompili & Scatiggio 2015), each parameter 
is known as Health Index (HI). 


e HI dielectric: parameters related to dielectric and 
thermal condition, as it may be obtained from 
dissolved gas analysis. These parameters are able 
to provide information on electrical (partial dis- 
charges, low energy discharges, arcing) and ther- 
mal problems (hot spots, overloads); 

e HI thermal: parameters related to pure thermal 
condition of the insulating paper, as they may be 
obtained from the CO,, CO and further periodi- 
cal determinations; 

e HI mechanical: parameters related to mechani- 
cal condition of the transformer, as they may 
be obtained from on-site electrical tests (induct- 
ance measurements, Sweep Frequency Response 
Analysis or SFRA, Frequency Domain Spec- 
troscopy or PDC/FDS); 

e HI oil: parameters related to insulating oil 
condition, as they may be obtained by water 
content, acidity, 50-60 Hz Breakdown Voltage 
(BDV) and Dielectric Dissipation Factor (DDF) 
determinations. 


3.3.2 Index calculation methodology 
Due to the fact that different condition factors are 
very different from each other, they must first be 
standardized with their corresponding weights. In 
order to transform the value into a non-dimen- 
sional number, international guidelines and regula- 
tions (IEC, IEEE, CIGRE, etc.) are used. 

Once the parameters have been standard- 
ized, from the following equation 3, the HI is 
calculated: 


+ HI FTL i pechanical + Hla 


thermal + 1 


HI max 


HI = HI bietectrie 


(3) 


where HI,,,x is a prefixed number and, as a conse- 
quence, the HI of each asset may be expressed per 
units (p.u.). 


3.3.3. Model outputs 
The model output is a HI value between 0 and 0.5. 
Higher and lower HI values are associated, respec- 
tively, to lower or higher levels of asset reliability. 
In dependence on their HI, assets are classified 
in four classes. In Table 2, assets classified in “very 
good” and “good” condition may be managed fol- 
lowing the common and standard maintenance 
practices, assets classified as “fair” or “doubtful” 
need an increase of analysis frequency or a deeper 
investigation (Scatiggio et al. 2016). 
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Table 2. Health Index (HI) evaluation. 


Health Index (HI) Condition 
0-0,10 Very Good 
0,10-0,20 Good 
0,20-0,30 Fair 

>0,30 Doubtful 


The models that will be introduced in the paper 
are relevant, among other things, because: 


e Asset managers need models to study options 
that maximise the value of an asset as it 
approaches the end of its useful life. Options 
may include (for example) changing the oper- 
ating regime, partial asset replacement/refur- 
bishment to extend useful life, or an indefinite 
ongoing ‘patch-and-continue’ programme, per- 
haps involving suppliers to provide necessary 
parts or services. 

e Predicted performance supported by knowledge 
and asset information is available in many com- 
panies—normally based on good understanding 
of how assets degrade—but not incorporated in 
formal processes for capital investment. These 
models contribute to the decision process that 
seeks the optimal life cycle value. 


4 CHALLENGES OF AHI APPLICATION 


Currently, in order to respond the increasingly 
demanding requirements in terms of asset man- 
agement, the application of AHI models offer the 
possibility to improve the process of decision mak- 
ing in maintenance situations. After a review of the 
literature, the best practices agree that using the 
asset health index offers the following advantages: 


e Consolidate all information sources about the 
asset condition in a single integrated view of 
asset health. 

e Provide an approaching indication of the asset 
at the end of its useful life. 

e Condition assessment and asset performance. 

e Report generation for maintenance attention. 

e Needs identification at short and medium term 
for the replacement of individual equipment. 

e Prediction of long-term needs replacement in 
large volumes of assets, identifying potential 
peaks with investment requirements. 

e Identify problems, risks and opportunities for 
maintenance management. 

e Provide information on asset deterioration 
trends that do not correspond to the rates of 
natural aging processes, which can be useful for 
planning appropriate maintenance strategies. 


e Comparison between the assets condition by 
classes and locations, allowing taking actions in 
the operation and maintenance strategy of the 
organization. 


On the other hand, any organization that decides 
to implement this tool in their strategic processes, 
with the purpose to improve its asset management, 
will have to take into account the below-mentioned 
considerations. Depending on the level of maturity 
of the organization, in some cases, they may be a 
challenge to overcome and, in others, an inconven- 
ience to avoid or mitigate: 


e The collection of data has a high cost. The cap- 
ture of certain information requires a field tech- 
nician in order to inspect and record the data. 

e Uncertainty in evaluating asset conditions can 
create inconsistencies in the collection of data. 

e Uncertainty about the return on investment, the 
valuation of costs and the financing of assets 
replacement or renewal, can make difficult to 
determine the information 

e The lack of consistent and compatible methods 
to record, store and reference information can 
cause errors in the analytical phase. 


Once these advantages and disadvantages have 
been seen, it’s worth investigating the implementa- 
tion of the AHI tool. The initial part of capture 
and processing of the data is critical; improving in 
this initial stage the assets management will ensure 
better results. For any organisation that decides 
to apply the AHI tools, its essential to incorpo- 
rate in its asset management model, the condition 
study of the asset after replacement promoted by a 
decision based on the asset health. This will allow 
the learning and adjustment of the mathematical 
model based on their own experience, which will 
be benefited in better results in making long-term 
decisions. 


5 CONCLUSIONS 


Today, the use of tools for decision making about 
long-term renewal and replacement of equipment 
for organizations is quite extended. Thanks to 
Life-Cycle Cost analysis (LCC), it is possible to 
know from an economic point of view, the cost 
of an asset over its useful life and to estimate the 
time for replacement if needed. The disadvantage 
in many cases, is the large amount of variables that 
must be handled when estimating the real cost of 
an asset over its useful life, generating a scenario 
of high uncertainty (Durairaj et al. 2002). At that 
moment, it is where the AHI comes into play as a 
support tool, having a completely different calcula- 
tion methodology, estimated from lab tests in order 
to know the asset condition, visual inspections, 
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operation and maintenance history and the age of 
the equipment and its components. 

The roadmap for the definition of an AHI 
model is applicable to different kinds of equip- 
ment. It is currently under development and such 
development is generating the need to open new 
lines of research, in parallel to what is currently 
implemented in the field of electrical networks and 
more specifically in electrical transformers. 
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ABSTRACT: Acoustic Emission (AE) has seen increased popularity in applications involving machine 
condition monitoring. AE applications usually involve higher sampling rate than vibration signals, not 
rare reaching 2 MHz. One of the main challenges involving AE based fault diagnosis is the need of 
preprocessing massive amounts of data generated by this technique, including engineering of appropriate 
features and dimensionality reduction so to be able to handle such massive datasets. In this paper, we 
propose a novel method based on Deep Convolutional Neural Networks (CNN) to handle raw AE signals 
for diagnosis of a system’s health states. This method is flexible enough to not only handle the massive 
amount of AE data, but also to provide the means for automatic feature extraction by applying various 
filters to the raw AE signals, and thus identifying relevant frequencies related to different faults. The 


proposed CNN method is applied to fatigue crack detection on blades of an experimental rotor. 


1 INTRODUCTION 


Unscheduled maintenance of mechanical systems 
leads to loss of production and might as well affect 
safety. There are many ways to minimize this effect, 
some of them are: increase redundancy, programed 
maintenance to identify problems that could incur 
in extended downtimes and, condition monitoring 
(Rabiei et al., 2016). 

A popular approach to condition monitoring is 
vibration analysis. However, vibration monitoring 
is usually less sensitive to detecting damages already 
developed, which pose a significant limitation in 
sensitive systems. On the other hand, Acoustic 
Emission techniques are gaining grounds because 
they can identify damage at early stages, with the 
tradeoff of introducing higher sample rates result- 
ing in massive and higher data dimensionality. 

Moreover, both AE and vibration monitoring 
require signal preprocessing and interpretation 
such as wavelets, fast Fourier transform and band 
filtering among others (Riaz et al., 2017), a labor 
intensive and expensive endeavor requiring special- 
ized engineering expertise. 


Machine learning techniques have become a 
popular choice for fault diagnosis and prognosis. 
Most of these shallow models heavily rely on 
manual feature identification and extraction (Ruiz- 
Gonzalez et al., 2014; Kane and Andhare, 2016; Li 
et al., 2016). 

As discussed in (Verstraete et al., 2017), the 
performance of these methods is dependent on 
the quality of the hand-engineered features, which 
obviously requires significant understanding of 
the system’s degradation processes. 

To tackle these challenges, we propose a deep 
CNN-based method for fault diagnosis that operates 
on massive raw acoustic signals and allows for the 
automatic hierarchical “layer to layer” feature extrac- 
tion to learn complex representations of the data. 

The remainder of the paper is structured as 
follows. Section 2 introduces deep learning and 
CNNs and their architectural building blocks. 
Then, Section 3 discusses the proposed method, 
application and validation for fault diagnosis of an 
experimental rotor and compares its performance 
to a fully optimized shallow neural network. 
Section 4 presents some concluding remarks. 
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2 CONVOLUTIONAL NEURAL 
NETWORKS 


2.1 Artificial neural networks 


Artificial Neural Networks (ANNs) are comprised 
of simple units called neurons. Each of these neu- 
rons creates a linear combination for an input (x) 
between a weight (w) and a bias (b) parameters 
that are learned by the ANN. 

Also, an activation function (f) adds the non- 
linear behavior that allows to compute nontrivial 
responses. Then, the output (O) of a neuron is 
computed by Equation (1): 


O(x)= f (wx +b) (1) 


2.2 Deep learning 


Simply put, deep learning is a branch of the 
Machine Learning that uses many hidden layers 
to perform the learning. The deep learning based 
networks learn multiple features over the features 
learned by previous layers, integrating the concept 
of hierarchy between features implying different 
levels of abstraction (Deng and Yu, 2014). This is 
important to achieve high accuracy in tasks that 
have complex relationship among data such as 
image recognition and signal processing. 


2.3. Convolutional neural network overview 


Convolutional Neural Networks (CNNs) are a 
type of Neural Networks that are specialized for 
processing grid-topology data (Goodfellow, Bengio 
and Courville, 2017). CNNs have been shown to 
outperform shallow architectures in many image 
recognition tasks and have been applied to vibra- 
tion based fault diagnosis (Verstraete et al., 2017). 
The main characteristics of the CNNs are that 
the layers have sparse connectivity and parameter 
sharing. The first characteristic means that CNNs 
use filters that are considerably smaller than the 
input implying that the filters store less parameters 
than a shallow neural network and detect impor- 
tant features of the input. The second one implies 
that the filter weights in a convolutional layer are 
used multiple times across the input resulting in 
computationally efficient matrix multiplication. 


2.3.1. Convolutional layer 

As we are dealing with raw AE data, the convolu- 
tion operation in the proposed model is also 1D. 
This means that the filters (w(t)) learned by the 
network are in the time domain and generate a fil- 
tered signal of the input (x(t)) highlighting features 
that represent the system’s health state. The output 
signal of a 1D convolution, s(t), is computed as 
shown in Equation (2): 


s(t) =(x* w)(t) =J, __x(a)w(t-a) (2) 


2.3.2 Batch normalization layer 
Batch Normalization (BN) is a method for accel- 
erating the learning of deep networks by reduc- 
ing internal covariate shift of the data (Ioffe and 
Szegedy, 2015). This is achieved by perform- 
ing a normalization for each mini-batch with 
the learning of new normalization parameters: 
scale parameter gamma (7) and shift parameter 
beta (9). 

Given the p-dimension input to a BN layer 
x=(x,....x”), the transformation is made with 
Equation (3) and Equation (4): 


x — Ex] 


$0) = Ld (3) 
Var[ x | 
y = p00 + BO (4) 


where Var[X] is the variance and E[X] is the expec- 
tation and are computed over the training set. 
Equation (3) is used to standardize features and 
accelerate convergence and Equation (4) restores 
the representation power of the network. 


2.3.3. Pooling layer 

Pooling layers are used to reduce the dimen- 
sion of the input and achieve spatial invariance. 
This is usually accomplished by taking the maxi- 
mum value of a pooling window and switching 
it for all values in that window. This lowers the 
resolution but taking only the most important 
feature. 


2.3.4 Activation function 

As discussed before, activation functions add non- 
linear behavior to the network. In the proposed 
CNN architecture, we employ the Rectified Linear 
Units (ReLUs) as activation function for the con- 
volutional layers as they provide increased sparsity 
compared with Tanh or Sigmoid activation func- 
tions, thus decreasing computation time (Maas 
et al., 2013). The commonly used ReLU is shown 
in Equation (5): 


g(x) =max(0,x) (5) 


For the fully connected layers (see Section 2.3.5), 
the softmax activation function is used to quan- 
tify the probability of a sample to correspond to 
a given system’s health state. This function is dis- 
played in Equation (6): 


HORES E (6) 
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2.3.5 Fully-connected layer 

Finally, a couple of fully connected layers with 
same dimension are responsible for the classi- 
fication based on the feature maps from the last 
convolution. 


2.3.6 Network optimization 

For supervised learning, the network learns by opti- 
mizing (minimizing) a loss function on the difference 
between the predicted and true labels. The selected 
loss function is the cross-entropy between the esti- 
mated softmax output q(x) and target class p(x): 


Loss = —Le(x)lo8(a(x)) (7) 


The network is optimized via ADAM (Kingma 
and Ba, 2014), an algorithm inspired in Stochastic 
Gradient Descent (SGD). This optimizer works 
with adaptive estimates of lower-order moments 
and performs well with noisy and sparse gradients. 
This method combines two extensions of SGD: 
Adaptive Gradient Algorithm that improves per- 
formance on problems with sparse gradients by 
maintaining a learning rate per parameter. The 
other one is the use of Root Mean Square Propa- 
gation, which adapts the learning rate as a function 
of the magnitude of the gradients for the weights, 
improving the performance with noisy data. 


2.3.7 Regularization 

To tackle the CNNs tendency to overfit during 
training and to improve generalization perform- 
ance, regularization is implemented by means of 
the following two approaches. First, we employ L2 
regularization by adding a term to the loss func- 
tion that penalizes high weights over the network 
(Peng et al., 2015). Second, we use dropout, which 
consists of disconnecting some neurons during 
training to prevent co-adapting (Srivastava et al., 
2014), is implemented in the fully connected layer 
with 50% drop probability. 


3 PROPOSED CNN METHOD 


3.1 Dataset 


The proposed CNN based method is conducted 
on a dataset generated from AE monitoring of an 
experimental rotor, as shown in Figure 1. 

The setup is comprised of: 


. Mistras, Micro 30 Acoustic Emission sensors 
. Rotor and blades 

. DC Motor MY-1016, 24[V] 13.7[A] 

. MCP Q10-QS305 Power Source. 


The rotor has 8 blades with one of them notched 
according to size and position in Table 1. 


BRWNre 


Figure 1. 


Experimental rotor setup. 


Table 1. Size and position of notches. 


Position [mm] Size [mm] 


© 


5 3 
20 6 
10 


> 


SS 


Figure 2. Sketch of a blade with a 6 mm notch at the 
5 mm position. 


A sketch of a blade is displayed in Figure 2. 

The acquisition rate is 500 kHz and each com- 
bination between position and size of the notches 
are measured for 176.16 s divided in 168 files of 
524,288 data points each. 

In this dataset, there are seven health condi- 
tions: undamaged; 3 mm, 6 mm and 10 mm cracks 
at the 5 mm position; 3 mm, 6 mm and 10 mm 
cracks at the 20 mm position. For each health state, 
there are 53,477,376 data points and the CNN is 
fed with samples composed by slices of 49,152 
points (1.77 rotor turn) of the raw signal with 50% 
overlap. Note that the proposed CNN method is 
trained for a 3-health state scenario correspond- 
ing to: (1) Undamaged, (2) Damage at 5 mm posi- 
tion obtained by combining the 3 mm, 6 mm and 
10 mm cracks at that position; and (3) Damage 
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at 20 mm obtained by combining the crack sizes 
3 mm, 6 mm and 10 mm at the 20 mm position. 

Data augmentation is also implemented to 
enforce network generalization, thus improving 
the accuracy on unseen samples. The dataset has 
a total of 548,458,432 data points that are split 
into 80-20 proportion for training and testing, 
respectively. 

Figure 3 shows examples of raw AE signals for 
each of the three health conditions. Notice that the 
signals have a significant amount of noise mainly 
because the rotor system is not perfectly balanced 
and has some degree of misalignment. In addition, 
vibrations from the bearings, coupling and motor 
add additional noise to the response. 

However, the implementation of de-noising 
methods incurs in loss of information and encom- 
passes pre-processing time that we want to avoid 
and handle with the proposed architecture. 

Moreover, based on the raw signals and the 
amplitude spectrums shown in Figure 3, the health 
conditions are remarkably similar that, coupled 
with the signal noise levels, makes this dataset a 
significant challenging diagnosis task. 
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Figure 3. 


3.2 Proposed deep CNN architecture 


The proposed CNN architecture, processing batches 
of 256 samples, consists of five convolutional layers 
as follows (see Figure 4): the first convolutional layer 
has 32 oversized filters of 128 x 1 designed to tackle 
background noise in the acoustic emission raw sig- 
nal; this is followed by four convolutional layers 
with 32, 32, 64 and 128 filters of size 3 x 1, respec- 
tively, which are designed to automatically and hier- 
archically extract features from the AE data. The 
last convolutional layer’s output is reshaped before 
being fed to the last two fully connected layers, each 
with 1024 neurons, that are responsible for process- 
ing the features obtained from the convolutional 
layers to perform fault diagnosis. 

The proposed CNN method is trained for 15000 
epochs, where one consists of all training samples. 
The CNN is regularized via dropout for the fully 
connected layers with 50% of keep probability 
and L2 regularization (see Section 2.3.7) as well as 
early stopping by saving the best epoch in terms of 
accuracy and generalization capability (train loss 
remaining low as test loss decreases). 


#102 Single sided amplitude spectrum 
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a) 5 mm sample signal, b) 5 mm amplitude spectrum, c) 20 mm sample signal, d) 20 mm amplitude spec- 


trum, e) Undamaged sample signal and f) Undamaged amplitude spectrum. 
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Raw Signal 
Input 
1@49152x1 


Features 
64@1536x1 


Features 
32@1536x1 


Features 
32@1536x1 


Convolution 


Convolution 
128%1 3x1 3x1 3x1 
Balch Norm. Batch Norm. Batch Norm Batch Norm 


Convolution Convolution 


ReLU ReLuU ReLU ReLU 


Figure 4. Architecture of the CNN. 


Also of note is that the proposed CNN archi- 
tecture does not have pooling layers. There are 
two underlying reasons: firstly, the proposed CNN 
method is not required to achieve spatial invari- 
ance as it deals with raw acoustic emission signals; 
secondly, the CNN marginally improved (in terms 
of accuracy and generalization) by the reduction 
in size of the feature maps resulting from the poll- 
ing layers or, conversely, its performance deterio- 
rated due to the loss of information incurred by 
the implementation of pooling layers. 

In terms of activation functions, the fully con- 
nected layers use softmax, whereas ReLU is 
implemented in all five convolutional layers (see 
Section 2.3.4 for details). The CNN weights are 
initialized by Xavier Normal Initialization, a Nor- 
mal Uniform distribution normalized by the size 
of the previous and next layer (Glorot and Bengio, 
2010) and bias as 0.1 constant in all layers. 


3.3. CNN implementation 


All the results shown in the next section were 
obtained using the following hardware configura- 
tion at Smart Reliability and Maintenance Integra- 
tion Laboratory (SRMILab) in the University of 
Chile: Intel® Core™ i7-6700 K CPU with 32 Gb 
RAM and a NVIDIA Titan XP GPU. 


3.4 Results and discussion 


The proposed CNN method is compared with a 
shallow ANN that has the same two fully con- 
nected layers, but lacks the convolutional layers. 
This allows us to assess the impact on the fault 
diagnosis performance of the convolutions as 


Features 
64@1536x1 


Hidden Hidden 
Features Reshape Units Units 
128@1536x1 1@196608x1 1024 1024 


Classes 
3 


Convolution Fully-Con. Fully-Con 
3x1 ReLU ReLU Loalts 
Batch Norm Reinap DropOut DropOut 
ReLU 50% 50% 


Table 2. Accuracy for health state of the system. 


Test accuracy (%) 


ANN 33.7 
CNN 93.0 
Table 3. Performance measures in percentages [%] for 
the proposed CNN method. 
5 [mm] 20 [mm] Undamaged 
Sensitivity 89.77+1.70 91.5041.48 97.65 0.36 
Specificity 94.69+0.80 95.9340.62 99.04 + 0.46 
Precision  89.22+1.62 91.53+1.22 98.18 +0.92 
F1 Score 89.48 +0.98 91.51+0.88 97.91 +0.57 
Accuracy 93.06+0.80 94.50+0.42 98.56 +0.39 


signal filtering and de-noising tool as wells as the 
quality and robustness of the extracted features. 
This ANN is fully optimized with ADAM adap- 
tive gradient-based optimization algorithm and 
regularized via dropout (with 50% keep probabil- 
ity), weight regularization for both hidden layers 
and early stopping. 

Table 2 shows the overall test fault diagnosis 
accuracy. The proposed CNN method significantly 
outperforms the shallow ANN in terms of accu- 
racy and generalization capacity, with the ANN 
barely learning from the complex AE dataset. 

To corroborate these results, we collect the aver- 
age values and corresponding standard deviations 
per health state from multiple runs for different 
performance metrics as shown in Table 3. 
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Table 4. Confusion matrix for the proposed CNN method. 


5 [mm] 20 [mm] Undamaged 
5 [mm] 363 35 6 
20 [mm] 28 382 0 
Undamaged 8 0 402 


Confusion matrix 


0.899 


0.087 


0.015 


5mm 


20mm 


True label 


Undamaged 


0.0 
Predicted label 


Figure 5. Normalized confusion matrix for the pro- 
posed CNN method. 


Accuracy 
a) 1 


€) 


Epoch 


Figure 6. 
d) Loss behavior of the ANN. 


Moreover, the unnormalized and normalized 
confusion matrices are shown in Table 4 and 
Figure 5, respectively. 

Based on these results, the proposed CNN method 
outperforms the shallow ANN for the rotor’s fault 
diagnosis based on acoustic emission monitoring. 
This is corroborated by observing Figure 6a) and c) 
that the CNN presents a monotonically descendent 
testing loss behavior that leads to improvement in the 
fault diagnosis accuracy. But, it should be observed 
that that the accuracy improvement to time ratio for 
the CNN is very low for the last epochs even though 
the network still learns. This could be driven by a 
very low learning rate for these epochs as ADAM 
adapts this hyperparameter. 

However, as shown in Figure 6b) and d), 
the ANN barely learns from the raw AE data, 
which could be attributed to the complexity 
of the data as well as the meaningless features 
that the ANN extracts by treating the signals as 
independent points, problem that seems to be 
compensated by the convolutional filters in the 
CNN. 

However, the superior performance achieved 
by the proposed CNN method comes at a 
much higher computational cost due to the sig- 
nificant number of learnable parameters and 
hyperparameters leads to extended training 
times. 


Accuracy 


a) Accuracy behavior of the CNN, b) Accuracy behavior of the ANN, c) Loss behavior of CNN and 
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4 CONCLUSIONS 


This paper has introduced a new deep CNN-based 
method for fault diagnosis using raw acoustic 
emission signals. The application of this method to 
an experimental rotor has shown that the proposed 
method delivers satisfactorily performance metrics 
for health state diagnosis. The CNN method was 
also compared to a fully optimized ANN, with the 
former significantly outperforming the shallow 
method. 

These solid results in fault diagnosis are mainly 
due to the CNN’s ability to automatically extract 
features from and efficiently handle the noisy 
acoustic emission signals. This also brings major 
advantages to the development of automated 
monitoring and fault diagnosis tools such as the 
possibility to bypass the intervention of the human 
element in the labor-intensive feature engineering 
process and reducing the need for preprocessing 
and de-noising of acoustic emission signals. Based 
on these preliminary results, the proposed CNN 
method is a promising tool for fault diagnosis. 
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ABSTRACT: Population is increasingly urban settled. Ensuring the welfare of society against upcoming 
crises derived from emerging challenges such as climate change, social dynamics and critical infrastructure 
dependencies within cities is seen as a priority for both scholar and practitioners. In this context, the concept 
of city resilience gains relevancy. Moreover, the task of ensuring the well-being of society increasing city 
resilience cannot completely rely on public entities; the contribution of private companies and citizens is 
also needed. There is a need to develop effective mechanisms such as Public Private People Partnerships 
(4Ps) to support the city resilience-building process. In order to develop effective and long lasting 4Ps that 
contribute to the city resilience building-process it is important to consider three dimensions; stakeholder 
relationship, information flow and conflict resolution. The aim of this paper is to present and describe a 
repository of best practices gathered from real city resilience-building processes that are currently taking 
place in different cities all over the world that contribute to the development of these three important 4P 


dimensions. 


1 INTRODUCTION 


According to the United Nations by 2030 more 
than 60% of people living in the world will be set- 
tled in urban areas (WHO 2017). Moreover, cities 
are currently at cross roads of challenges like climate 
change, social dynamics and the increasing depend- 
ence on the correct functioning of critical infrastruc- 
tures that affect directly to the welfare of society 
(Gonzalez et al. 2017, Schauppenlehner-Kloyber 
& Penker 2016, Toubin et al. 2014, Elmqvist et al. 
2013). Therefore, crises affecting cities derived from 
these challenges will potentially affect the welfare of 
citizens. This is why ensuring effective crisis manage- 
ment within cities will be increasingly important in 
the upcoming years in order to ensure the wellbeing 
of society (Toubin et al. 2014). 

It is also important to bear in mind that the nature 
of the striking events that generate crises could be 
predictable or unpredictable. Moreover, predictable 
crises can also have unpredictable consequences 
due to potential cascading failures that may occur 
between complex interconnected systems (Pyrko 
et al. 2017). Therefore, a risk management approach 
that only consider predictable risks and conse- 
quences is not enough to deal with nowadays crises 
(Boin & McConell 2007). The resilience concept 
seems promising to address the need to deal with 
unexpected crises (FCOP 2011). Therefore, efforts 
are being made in promoting resilience in order to 
be able to face upcoming unpredictable crises that 
could potentially affect the welfare of society. 


City resilience is an emerging concept that has 
been gaining popularity in the last few years. How- 
ever, there is still a lack of consensus on its defini- 
tion and has different approaches (Bang & Rankin 
2016). Within this research, city resilience is defined 
as “the ability of a city or region to resist, absorb, 
adapt to and recover from acute shocks and chronic 
stresses to keep critical services functioning, and 
to monitor and learn from on-going processes 
through city and cross-regional collaboration, to 
increase adaptive abilities and strengthen prepared- 
ness by anticipating and appropriately responding 
to future challenges” (Hernantes et al. 2016). 

Therefore, increasing city resilience will be a pri- 
ority to ensure the welfare of society in the upcom- 
ing years (Toubin et al. 2014). In fact, ensuring 
the well-being of citizens in times of crisis is not a 
mission that can be delegated to public entities. In 
order to successfully fulfil this mission; the colla- 
boration of additional city stakeholders, like pri- 
vate companies and citizens, is required (Kapucu 
2012, Oxley 2013). While public entities should be 
the ones in charge of coordinating all the efforts 
being made to increase city resilience, it could also 
be beneficial to involve private companies and citi- 
zens (Gimenez et al. 2016). Private companies could 
contribute with technical expertise and additional 
resources in case the damage caused by an unpre- 
dictable event exceeds the capacities of public 
entities to deal with the crisis. Moreover, citizens 
could use their specific knowledge about the local 
community to better understand the needs of local 
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people regarding city resilience (O’Sullivan et al. 
2015, Scolobig et al. 2015). Therefore, addressing 
unpredictable events affecting cities requires let- 
ting aside a silo-thinking mentality by involving 
different city stakeholders and coordinating their 
efforts to increase the city resilience level. 

In light of this situation, developing meaningful 
public private people partnerships (4Ps) at the city 
level for strategic decision-making regarding city 
resilience could have a significant positive impact 
(Boyd & Juhola 2014, Ng et al. 2013). We define 
4Ps as partnering arrangements, both formal and 
informal, that are developed between public enti- 
ties, private companies and citizens with the aim of 
improving the city resilience-building process. 

It is important to bear in mind that the objective 
of developing 4Ps is to foster meaningful collabora- 
tion among city stakeholders rather than to develop 
agreements per se. Rigid agreements have proven 
to be suitable for addressing expected crises, but 
trusting that rigid predetermined agreements can 
address unexpected events is not always effective 
(Stewart et al. 2009). Developing meaningful and 
long lasting 4Ps in which all the city stakeholders 
are represented enables to increase the adaptability 
and the improvisation capacity in times of crisis. 

In order to develop effective 4Ps it is important 
to consider three dimensions; stakeholder rela- 
tionship, information flow and conflict resolution 
(Marana et al. 2018a). The aim of this paper is to 
present and describe a repository of best practices 
(projects, strategies, activities, policies, methodolo- 
gies and tools) gathered from real city resilience- 
building processes that are currently taking place 
in different cities all over the world that contribute 
to the development of 4P dimensions. 


2 STATE OF THE ART 


2.1 Fragmented efforts between city stakeholders 


Emerging wide scope complex challenges like cli- 
mate change, socio-political issues or critical infra- 
structure dependency, cannot be addressed by a 
single institution on its own. 

The awareness regarding the importance of 
addressing the effects of these challenges is increasing 
among most city stakeholder groups (Gonzalez et al. 
2017). For instance, the local government is devel- 
oping and implementing climate change adapting 
plans; critical infrastructures are investing resources 
on understanding the existing interdependencies 
between them in order to prevent cascading failures 
that end up affecting several services, NGOs are 
focused on developing programs to reduce inequali- 
ties that affect vulnerable population and so on. 

Each city stakeholder group can contribute to 
the city resilience-building process in different ways. 


Public companies (local, regional and national gov- 
ernment, emergency services and so on) can con- 
tribute with their decision-making experience and 
capacity as well as with material resources. Private 
companies (Critical Infrastructure providers, busi- 
nesses, insurance companies and so on) can con- 
tribute with technical and operational expertise as 
well as with additional material resources. People 
(citizens and NGOs) can contribute with knowl- 
edge of social, behavioral, economic and environ- 
mental issues (Scolobig et al. 2015). 

All the approaches are equally valuable and 
require to create a holistic city resilience-building 
process. The key is not to consider them as isolated 
efforts but to integrate all of them in an effective way. 
Therefore, aligning the efforts of all the city stake- 
holders using mechanisms like 4Ps should be consid- 
ered as a priority in order to improve city resilience. 


2.2 Dimensions of public private people partnerships 


Collaboration enables partners to share their 
knowledge, skills, resources and perspectives to 
use them in alternative and complementary ways 
(Gagnon et al. 2016, Jones & Barry 2011). Foster- 
ing 4Ps within the context of city resilience ena- 
bles each stakeholder to contribute in the most 
appropriate way to the resilience-building process. 
In order to develop 4Ps in the most effective man- 
ner, the following dimensions should be considered 
(Marana 201 8a) (Figure 1). 


2.2.1 Stakeholder relationship 

This dimension is related to the attributes and atti- 
tudes stakeholders must possess to work together 
successfully. We highlight the importance of pro- 
moting commitment of the city stakeholders to be 


Information 
Flow 


Conflict 
Resolution 


Figure 1. 4P dimensions. 
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active part of the city resilience-building process 
assuming that mutually beneficial goals could be 
achieved. Coordination and trust not only among 
city stakeholders but also with other systems and 
institutions with similar or complementary pur- 
poses are also considered within the scope of this 
dimension. It also embraces the need to increase 
adaptability of the partnership in the face of 
upcoming challenges. Finally, this dimension also 
considers the relevance of involving representa- 
tives of all the city stakeholder groups within the 
city (public institutions, private companies and 
citizens). 


2.2.2 Information flow 

This dimension is related to the communication 
channels and protocols that stakeholders must 
use to invest resources in the most effective man- 
ner. When we refer to this dimension we are high- 
lighting the need to ensure the timeliness, accuracy 
and relevance of the shared information. Active 
participation of city stakeholders in planning, 
goal setting and execution of tasks are also con- 
sidered within this dimension. This dimension also 
embraces how quickly information is available to 
relevant city stakeholder and the ease with which 
partners understand the information provided. 
Finally, it also considers to what extent critical and 
sensitive information is shared with other author- 
ized partners. 


2.2.3 Conflict resolution 

This dimension is related to the techniques used to 
solve problems related to the correct functioning 
of the partnership. It highlights the importance 
of finding solutions to solve conflicts between 
city stakeholders in a constructive way in which 
the interests of all the involved partners are rep- 
resented. This dimension also refers to the ability 
of the partnership to use lessons learnt in the past 
and to increase the effectiveness of future deci- 
sions. Finally, it also considers the importance of 
aligning the self-interests of each partner into a 
mutually beneficial goal. 


3 METHODOLOGY 


This research consisted of two different stages. 
The first stage consisted of an academic literature 
review and the second stage consisted on a revision 
of existing city resilience strategies. 


3.1 Ist stage: Academic literature review 


A literature review was conducted in the 
Scopus database, in order to find best practices that 
contribute to the development of 4Ps in the city 


resilience-building process. This literature review 
enabled to find articles focused on the dimensions 
of multi-stakeholder collaboration in the context 
of city resilience. 

The query used in the search in order to find 
relevant articles was the following: “city resilience” 
OR “community resilience” OR “urban resilience” 
AND partnership OR collaboration. 

In order to ensure a more standard set quality 
only academic papers published in scientific jour- 
nals were considered in this research. Although 
conference proceeding usually present interesting 
research projects, generally the main outcomes are 
published in scientific journals. Therefore, the type 
of publications was limited. 

After conducting the search in the Scopus data- 
base a total amount of 96 articles were obtained. 
After reading the title and the abstracts a total 
amount of 52 research articles were analysed in 
full detail. 

The aim of this academic literature review was 
to gather information about projects, strategies, 
activities, policies, methodologies and tools that 
have proven to be effective in the development of 
4Ps in the city resilience-building process. 


3.2 2nd stage: Revision of city resilience strategies 


A revision of city resilience strategies was con- 
ducted in order to identify what cities are currently 
doing and planning to do in order to improve the 
effectiveness of 4Ps and consequently, increase city 
resilience. 

City resilience strategies were obtained from the 
100 City Resilience webpage (100 Resilient Cities, 
2016c). 100 Resilient Cities is an initiative funded 
by the Rockefeller foundation whose aim is to help 
cities around the world to become more resilient to 
the physical, social and economic challenges that 
are a growing part of the 21st century. 

The 36 city resilience strategies available in 
the 100 Resilient Cities website at the moment 
when this research was carried out were revised 
in order to find projects, strategies, activities, poli- 
cies, methodologies and tools that contribute to 
improve each of the three dimensions of 4Ps in 
order to develop effective 4Ps in the city resilience- 
building process. 


4 RESULTS 


In the following section, the most important results 
obtained after reviewing scientific papers as well 
as city resilience strategies will be presented. Con- 
sidering their final aim, the best practices gathered 
from the literature review have been classified into 
the three 4P dimensions. 
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4.1 Stakeholder relationship 


Improving the interaction among stakeholders rep- 
resenting different city sectors is key for developing 
effective 4Ps that contribute to the city resilience- 
building process. 

Representing the interests of all the city stake- 
holders when developing the basis of the city 
resilience-building process is key so that everyone 
accepts it and feels part of it. 

The empowerment of citizens is key to foster 
engagement and a sense of belonging to the city. 
The city of Christchurch is developing alternative 
forms of public participation to promote awareness 
of issues and engage citizens in resilience related 
decision-making (100 Resilient Cities 2016d). The 
creation of community boards, advisory groups 
and working parties will enable that city stakehold- 
ers are more informed about the issues that com- 
munity leaders have to make decisions. 

The involvement of city stakeholders is not suffi- 
cient to ensure the effectiveness of resilience-build- 
ing processes. There is a need to coordinate all the 
efforts being made by all the different groups. The 
city of Glasgow is conscious of this and has started 
to develop an integrated resilience plan for critical 
services in the face of long-term stresses in order 
to ensure the wellbeing of its citizens (100 Resil- 
ient Cities 2016b). The interdependencies among 
cross-sectoral critical services within Glasgow 
are identified with the objective to ensure that 
critical services remain functional and accessible 
regardless of the upcoming challenges. 

Moreover, it is also important to realize that 
the resilience level of the city does not only rely on 
the city itself. Due to the increasing interconnec- 
tion among cities throughout the world, learning 
from what others are doing is also key to address 
the challenge of developing resilience in cities. 
For instance, Bangkok is one of the cities that 
has addressed this challenge and is now interact- 
ing with different cities around the world with the 
support offered by the 100 Resilient Cities network 
(100 Resilient Cities 2017a). The city of Bangkok is 
currently working together with the city of Jakarta 
and Mexico City. The three of them are rapidly 
developing in mega cities where the challenge of 
efficient mobility exist. Therefore, they are work- 
ing together to find solutions to this challenge. 


4.2 Information flow 


Improving the information flow among different 
stakeholder groups is key to improve the decision 
making process in the context of the city resilience- 
building process. 

Developing communication channels to share 
information with city stakeholders is required to 


increase city resilience. The municipality of Dakar 
is aware of this fact and therefore is currently work- 
ing on implementing tools and services to provide 
its city stakeholders with access to information on 
imminent crises in real time (100 Resilient Cities 
2016a). These tools will enable that, a few months 
before a rainy season or time of high tides, the 
municipality could start communicating on dis- 
turbances observed to anticipate and reduce even- 
tual damages caused by flooding. They realized 
that although public entities keep track of natural 
events preceding major disasters, such information 
is not actively used to the best advantage of the 
city. In light of this situation, they are working on 
creating a database of current data on the city’s 
vulnerability state and they are establishing a net- 
work of community resilience champions. 

This last initiative includes the integration of 
real-time information to the city’s mobile applica- 
tion NAVIGEM in order to advise people when 
imminent risks are detected. However, communi- 
cation with city stakeholders should be done not 
only using social media and apps. Usually, the most 
vulnerable groups in society like elderly people and 
children are not users of these communication 
channels. In order to reach to those groups alter- 
native tools to proactive communicate upstream 
periods of vulnerability, like daily/weekly radio 
programs are being developed. 

The timeliness, accuracy and accessibility of 
information is another key issue when talking 
about information in the context of city resilience. 

Due to the technological updates conducted in 
cities in the last years, local governments are now 
increasingly considering to leverage the internet of 
things (IoT) to gather timely and accurate data that 
can enable to improve resilience related decision- 
making processes. Using the most current tech- 
nology could help them to address disasters more 
efficiently and safely. However, this type of progress 
will require more than just employing the IoT to 
improve emergency preparedness and response; 
city stakeholders need to be ready to receive, inter- 
pret, and use the data in an effective manner. IoT 
sensors can be critical for urgent decisions like 
whether to evacuate an area at risk of earthquake, 
or how to guide residents to the safest exit routes 
ahead of an emergency. For instance, Santiago is 
working on applying sensors to improve the early 
warning systems as well as to use that information 
to design more effective evacuation protocols (100 
Resilient Cities 2017b). In San Francisco, an early 
warning system called ShakeAlert has been started 
to be implemented in order to detect the first wave 
sent by an earthquake and to report is to citizens 
using this system (100 Resilient Cities 2016f). 

Public entities must also know which communi- 
cation channels work best to reach the affected city 
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stakeholders. For instance, if the at-risk population 
is predominantly Spanish-speaking, then the mes- 
sages should be sent in Spanish. When dealing with 
an elderly population, the outreach can be done 
through television, newspapers, and radio rather 
than tech-driven channels like text alerts and apps. 
This targeted communication is a shift from the 
conventional “one size fits all” approach. 


4.3 Conflict resolution 


Improving conflict resolution to enable perspective 
alignment among stakeholders representing differ- 
ent interests will contribute to have a holistic view 
of city resilience. 

Not only the involvement and contribution of 
all the city stakeholders is required to develop an 
effective city resilience-building process, an align- 
ment on their perspective about resilience is also 
need. The city of New Orleans is aware of this need 
and is currently working on establishing a resilience 
center (100 Resilient Cities 2015). The aim of this 
center is to provide a space to build awareness and 
expertise of city stakeholders to develop projects 
and coalitions and to exchange ideas and practices 
both locally and globally. This space will enable to 
create synergies between different city stakeholders 
to work on initiatives that benefit all. 

In order to increase the city’s resilience level, 
the experience and lessons learnt by all the city 
stakeholders should be considered. The city of 
Melbourne is very aware of this need and is cur- 
rently working on an initiative called (Monash Uni- 
versity Disaster Resilience Initiative (MUDRD), 
which explores how different stakeholders prepare 
to effectively response to crises and develops a 
Resilience Compendium that identifies leading 
practices that facilitate the sharing of best prac- 
tices among different city stakeholders (100 Resil- 
ient Cities 2016e). This initiative enables to have 
a centralized resource for sharing and accessing 
information on resilience- building activities under- 
taken at the city. Consequently, this will prevent 
duplication of efforts and will promote a more 
efficient use of available resources. 

In Rotterdam, the local government is also very 
aware of the need to empower all city stakeholder 
groups and has developed an action called the inte- 
gration tours (100 Resilient Cities 2016g). In fact, 
talks and events aimed at encouraging cooperation 
and fostering dialogue between public entities, pri- 
vate companies and people are organized. These 
actions make citizens aware of their own roles in 
society and how they can better contribute to city 
resilience. Talks seek to break down the barriers 
created by self-interests to enhance effective dia- 
logue among city stakeholders. These tours bring 
groups from different backgrounds and roles in 


society together to discuss different issues that are 
important for increasing the city’s resilience level. 
Activities like this support knowledge sharing, 
strengthen mutual understanding, and enable the 
alignment of the different perspectives. 


5 INTERCONNECTIONS BETWEEN 
4P DIMENSIONS 


This paper has presented the three different dimen- 
sions that should be considered when developing 
effective 4Ps that contribute to the city resilience- 
building process. However, it is important to bear 
in mind that all this dimensions are closely related 
among each other. The literature review has shown 
us that improving one 4P dimension has co-lateral 
effects on the other dimensions. 

For instance, improving the relationship of 
different city stakeholders will have a potential 
impact on the amount and quality of the infor- 
mation they share between them (O’Sullivan et al. 
2015). When a sense of belonging and trust among 
different entities exists, there is a bigger chance to 
improve the information flow among the partners. 
Improving the quality, accessibility and sharing 
of information also improves the coordination of 
city stakeholders and the sense of inclusiveness 
(Davenport et al. 2010). 

Moreover, an improvement of the stakeholder 
relationship and of the information flow within 
the partnership has also an impact in the conflict 
resolution dimension (O’Sullivan et al. 2015). The 
better the relationship among city stakeholders 
and the more information is shared the easier to 
solve potential conflicts among city stakeholders 
will be. 

Although all the best practices could have been 
classified within one 4P dimension, the effects of 
their implementation are usually transversal and 
affect not only to the improvement of their own 
dimension but also to the others. Therefore, further 
research should be conducted to better understand 
which the priority implementation order of these 
best practices should be in order to use the avail- 
able resources in the most effective manner. 


6 CONCLUSIONS 


Although practitioners and academics are aware 
of the importance of improving collaboration 
between all the city stakeholders (public entities, 
private companies and citizens) to increase the 
resilience level of a city, there is not a concept to 
refer to this idea. This paper has presented the con- 
cept of public private people partnership (4P) as 
a new mechanism to foster collaboration through 


1171 


formal or informal arrangements between city 
stakeholders in the city resilience building process. 

This research has presented relevant best prac- 
tices that arecurrently being implemented in certain 
cities all over the world. However, it is important to 
consider the limitations of the work presented. The 
best practices presented in this paper to illustrate 
how each dimension can be developed have been 
chosen in a pragmatic manner, without following 
a concrete methodology. The relevancy of these 
examples, will also decrease as available knowledge 
about city resilience building process increases and 
technology improves. 

The efforts and resources dedicated to develop 
and implement an effective city resilience-building 
process are limited. Therefore, there is a need to set 
a priority order when implementing the resilience 
building best practices depending on the strengths 
and weaknesses of each particular city. Moreover, it 
is important to bear in mind that some cities could 
find easier to develop one dimension rather than 
other due to a cultural aspect. For instance, in some 
countries, like the ones in northern Europe, more 
attention is paid to the standardization of informa- 
tion sharing procedures. Therefore, these countries 
may find easier to improve the information flow 
dimension. However, other countries with a differ- 
ent cultural background, for instance the countries 
in the Mediterranean Sea, may find easier to estab- 
lish informal relationships among stakeholders due 
to their feature of being more sociable. All these 
aspects should be addressed in future researches. 
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ABSTRACT: Technical facilities safety is fundamental issue for human society and its development. 
It is reality that it is broken by many known and newly cognized risks that are related to forever grooving 
complexity of technical facilities and whole world. Today, we know that it is also necessary to consider the 
risks that are connected with interfaces among their subsystems and components. With regard to the world 
dynamic development, it is necessary to monitor the priority risks and to cope with them during the time. 
The measure of safety level is performed by help of logically arrangement of requirements of individual 
techniques used at work with risks in technical facilities, fragmented to 7 domains and Maximum Utility 
Theory principle. The paper also shows the results on safety levels for 5 technical facilities. 


1 INTRODUCTION 


On the basis of present level of knowledge that is 
e.g. represented by publications from the ESREL 
conferences (Ale et al. 2010, Bérenguer et al. 2011, 
Bris et al. 2009, Cepin & Bris 2017, Nowakowski 
et al. 2014, Podofillini et al. 2015, Steenbergen 
et al. 2013, Walls et al. 2016), which is summarized 
in (Prochazkova 2015, 2017), we perceive each 
technical facility as open complex system of sys- 
tems, i.e. as several open systems that are mutually 
penetrated and are interfaced with vicinity. 

The interfaces ensure the fulfilment of impor- 
tant operations and services, and simultaneously 
they cause the dependences that are the roots of 
specific vulnerabilities. Under specific conditions 
they originate highly unfavourable interfaces that 
lead to technical facility failure, which at certain 
circumstances distinctly also damage the technical 
facility vicinity. Therefore, at ensuring the techni- 
cal facility safety it is necessary to consider that 
technical facility has various assets that are altered 
in dynamically variable world. The multiplicity 
and variability of assets cause that under cer- 
tain conditions, the measures ensuring the safety 
of individual assets are conflicting, which means 
that methods using at risk management aimed to 
technical facility safety need to be multi criterial 
(Prochazkova 2017). 


2 SAFETY AND RISKS OF TECHNICAL 
FACILITIES AND PRODUCTS 


At present in advanced engineering disciplines 
described in works (Ale et al. 2010, Bérenguer 
et al. 2011, Briš et al. 2009, Cepin & Bris 2017, 


Novakowski et al. 2014, Podofillini et al. 2015, 
Steenbergen et al. 2013, Walls et al. 2016, Prochaz- 
kova 2015, 2017), the safety is understood as the 
attribute that emerges on the system level. The 
safety shows the quality of set of human meas- 
ures and activities, which ensure that system is 
safe. Among the important quantities the follow- 
ing relations are valid: dependable (reliable) system 
is a system that performs required functions in a 
given place, a given time and a given quality during 
the whole life cycle; secure system is the depend- 
able system that is protected against to internal 
and external disasters of all kinds; safe system is 
the secure system that does not endanger itself 
and its vicinity under all conditions; and risk je is 
understood as the probable size of losses, damages 
and harms on protected assets in real system that 
is calculated for unit of space and unit of time. It 
is dependent on the disaster size and on the local 
assets vulnerabilities. Safety and risk are in certain 
relation, but they are not complementary quanti- 
ties. The risk reduction means the safety increase, 
but it is not always valid inversely (Prochazkova 
2015). The complementary quantity to safety is the 
criticality; in some legislation, e.g. in the SEVESO 
directive, it is used the term recklessness instead of 
criticality. Criticality denotes the limit (boundary) 
from which the risk impacts are significant up to 
eliminative for followed system, which means that 
appurtenant risk needs to be always mastered. 


3 DATA USED AT CHECKLIST 
FORMATION 


On the basis of data given in publications (Ale et al. 
2010, Bérenguer et al. 2011, Bri8 et al. 2009, Cepin 
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Process mode 


of work with 
risks 
= ay 4 
=> 
Figure 1. Items that influence the result of work with 


risks of technical product. 


& Bris 2017, Novakowski et al. 2014, Podofillini 
et al. 2015, Steenbergen et al. 2013, Walls et al. 
2016, Prochazkova 2015, 2017), and in Archives 
(CVUT 2017), it is necessary to consider seven 
items (Figure 1) that influence the result of work 
with risks of technical facility, i.e. its safety, namely: 


1. 


nA BW hb 


Context in which the risks, inherently connected 
with technical facility, are inserted. 


. List of considered sources of risks. 

. Type of risk form. 

. Ways of mastering the risks. 

. Process model of work with risks, application 


of the TQM and Coase theorem. 


. Technique of management and coping with 


risks of technical facility. 


. Way of management of risks in time. 


Ad 1. It holds that the most general context to 


which the risks of technical facility are inserted has 
the assets: human life, health and security; prop- 
erty and public welfare; environment; and tech- 
nologies and infrastructures. The process model 
ensuring the human security and development 
is in (Procházková 2015). On the basis of results 
in (Prochazkova 2015, 2017) and data obtained 
directly in practice (CVUT 2017), in the technol- 
ogy sector it is often considered only the context 
of technical facility or context of enterprise that 
administrates the technical facility, and in many 
cases only the context of production facility or its 
part. It is understandable that the use of more lim- 
ited context means the higher default of reality. 


In practice, it means that the appurtenant solu- 


tion does not consider some sources of risks and 
the impacts of risks’ realization on all public and 
enterprise assets. They are neglected: harmful phe- 
nomena from technical facility vicinity and phe- 


nomena induced by bad decisions of management 
of enterprise or administrative bodies; and the 
impacts of risks on humans, properties and envi- 
ronment in technical facility vicinity. 


Ad 2. With regard to results in (Prochazkova 


2011a, 2015, 2017) and data obtained directly in 
practice (CVUT 2017), it holds that in practice 
there are used the following choices of sources of 
risks: 


l. 


Sources of risks determined either by legisla- 
tive, or by experiences of worker who solves the 
task. 


. Only technical sources of risks in a given techni- 


cal facility. Usually, it goes on risks connected 
with: material (fulfilment of required param- 
eters, supplier relations—alternative material 
etc.); construction and interfaces of compo- 
nents and facilities (free procedures, presence of 
unstable hazardous substances. ...); production 
procedures, e.g. at welding, specific works with 
millers, lathes etc.; and conditions that are nec- 
essary pro production of quality product, e.g. 
certain pressure, certain temperature or certain 
humidity of surrounding medium etc. 


. Technical sources of risks and human factor. To 


items given in point 2, they are added the risks 
connected with false operations of workers. 


. Technical sources of risks and human factor 


in the broadest interpretation. To items given 
in point 3, they are added risks connected with 
sources of organizational accidents (i.e. bad 
decision-making, using the false procedures etc.). 


. Technical sources of risks, sources of risks 


threatened the workers lives, health and safety, 
sources of organizational accidents and sources 
of risks in working environment. 


. The sources of risks given in point 5 plus exter- 


nal sources of risks. 


. The sources of risks given in point 6 plus sources 


of risks from interfaces of facilities, components 
and system that disturb the technical integrity, 
the originators of which are in automatization, 
education and good skill. 


. All Hazard Approach in the form described 


in (Procházková 2011la). This selection con- 
siders risks from the five basic domains (ca 77 
sources). 


The last set of risk sources is complete, but it is 


challenging on data, methods, knowledge, experi- 
ence and time period. It requires the strategic, sys- 
temic and proactive approach and it has according 
the results of FOCUS project (Prochazkova 2015) 
a lot of deficits at use in present practice. 


Ad 3. On the basis of results in (Prochazkova 


2015, 2017) and data obtained directly in practice 
(CVUT 2017), it holds that in technical practice, 
there are used the partial, integrated and integral 
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(systemic) risks. Partial risk is the risk connected 
with one asset. The partial risks are various, e.g. 
health risks, technological risks, risk of fire etc. For 
their determination, many legal rules and support- 
ing software exist (Procházková 201 1a). Integrated 
risk represents the sum or other aggregation of 
partial risks. It is used e.g. in protection of work- 
ers lives, health and safety (Procházková 201 1a). 
Integral (systemic) risk is based on system concept 
of entity and it also includes the interfaces among 
the assets and components of technical facility 
(Prochazkova 2015, 2017). It is given by relation 


b 
R(H)= l È, 4UDZ(A) 


+ > f a FH, A, Pi,0, nasa a 


in which H is the hazard connected with given 
disaster in site of technical work; A; are values of 
followed assets for i = 1,2,..., n; Z; are vulnerabili- 
ties of assets for i= 1,2,..., n; F is the loos function; 
P, are the occurrence probabilities of damage of 
assets for i = 1,2,..., it goes on conditioned prob- 
abilities; O is vulnerability of protective measures; 
S is the size of followed space; t is time measured 
from the disaster origin; T is the time period of 
losses origin; and Tis the disaster return period. 

It is evident that for long-term ensuring the 
safe technical facility, it is necessary to consider 
the integral risk. Because in above given formula, 
the loss function is not known, so in (Prochaz- 
kova 2015, 2017) there are given procedures used 
in practice for estimation of integral risk; they are 
based on the analysis of real and simulated disas- 
ters’ scenarios and expert judgement. 

It is necessary to note that determination of 
individual types of risks also differ in exactingness 
on data and methods of their processing (Prochaz- 
ková 201la, 2017); the lowest challenging is the 
determination of partial risks, and therefore, these 
are mostly used in practice, although their validity 
with regard to total technical facility safety is very 
limited. 

Ad 4. On the basis of results of investigations 
given in (Procházková 2015, 2017) and data from 
practice (CVUT 2017), three cases are found. In 
the first one, there are used the risks, which are 
determined and mastered only after the techni- 
cal facility construction (Prochazkova 2015); this 
way is danger because some of important risks, 
which could be only mastered by specific technical 
measures in assignment of technical facility, can 
be only reduced by organizational measures that 
are lower effective than technical measures (CVUT 
2017, Prochazkova 2015, 2017). In the second 
one, the risks are considered from beginning the 


technical facility design up to its termination from 
operation. This way depends on requirements 
of legislation, knowledge and skill of designers, 
constructors and operators, which does not guar- 
antee the consideration of all sources of risks. In 
the third one, risks are considered from beginning 
the technical facility design and at trade-off with 
them, it is used the Defence-In-Depth approach, 
which requires system thinking, multi sectoral 
and transdisciplinary knowledge and experiences 
(CVUT 2017, Prochazkova 2015). 

The ensuring the safety of technical facility and 
its vicinity depends on quality of work with risks 
and on accessible possibilities of both, the techni- 
cal facility management and personnel, and the 
public administration (CVUT 2017). 

Ad 5. The risk mastering in given time and 
given site requires: knowledge; capabilities; com- 
petences, finance; material, technical and human 
sources. Therefore, in next we deal not only with 
alone work with risks, but also with practical pro- 
cedures that are used at decision-making on the 
risk mastering. On the basis of results of investi- 
gations in (Prochazkova 2015) and data obtained 
directly in practice (CVUT 2017), it holds that in 
technical practice it needs to use the process model 
shown in Figure 2. 

It is evident that if we are not able to identify 
and analyse some risk, so we are not capable effec- 
tively to defend the followed entity against it. The 
error, which we do at risk analysis, is transferred 
to emergency, continuity and crises plans, and it 
reduces their value in relation to planned measures 
directed to protection of human lives and health, 
and also to operational capability of rescue units 
participating in performance of rescue operations. 


Process model of work with risks 


3 


ae T, rey 


CRITERIONS A 
FEEDBACKS -1, 2, 3,4 


Figure 2. Process model of work with risks. Criteri- 
ons = conditions that determined when the risk is accept- 
able, conditionally acceptable or unacceptable. Aims 
denote required states. Numbers 1,2,3,4 denote feed- 
backs that are used if the monitoring shows that followed 
requirements on safety are not fulfilled. 
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The aim of risk management is to find the 
optimum way, how to reduce the founded risks to 
required socially acceptable level, possible to keep 
up on this level. The risk engineering aim is then 
to find the way, how under available options, the 
risk management proposed measures and activi- 
ties for risks mastering to realize and to ensure 
their reliability and function. The risk reduction is 
almost always connected with increase of expenses 
and claims on knowledge. The risk management is 
led by effort to find the boundary to which it is 
endurable the risk reduction, so the spend expenses 
would be socially acceptable. 

In harmony with the public interest it is neces- 
sary so the risk acceptability might have the social 
dimension. Therefore, it is necessary to consider: 


1. For whom the risk could be acceptable; for risk 
originators, for politicians or for public? 

2. Who determines the risk acceptability; politi- 
cians adjudicate on that, which is legal, and 
so they could not adjudicate on that, which is 
acceptable. 

3. If at risk determination there were discussed 
actually permitted risks, intolerant threshold 
values and attitudes of public to risks. 


Risks are inherent factors of human system, i.e. 
they were, are and will be, and besides they will 
occur new ones. Therefore, the management of 
risks requires risk dimension and measurement of 
risk, which consider not only the physical damages, 
harms, victims and economic losses bulk, but also 
the social, organizational and institutional factors. 

The outputs from risk management process for 
needs of good governance according to TQM are: 
Risk assessment document — records on all appur- 
tenant risks; Top risks list — list of selected risks, 
the mastering with them has the highest claims on 
sources and time; Retired risk list — serving as the 
historical reference for future decision-making. 

Technique of alone risk management from the 
point of provident handle with forces, sources and 
means, formally reviews before at each phase of 
work with risks the results of management and 
mastering the risks in the context of profits and 
expenses on outputs. The Coase theorem (Coase 
1960) is used for determination of economic opti- 
mum in expenses on mastering the risks. 

Ad 6. On the basis of results in (Prochazkova 
2015, 2017) and data obtained directly in practice 
(CVUT 2017), it holds that in technical practice 
it needs to understand that technical facility risk 
management and risk mastering are not the task 
of individual, not one organisation or one sector. 
It goes on the collective effort of all participants. It 
is evident that: professionals who have knowledge, 
data and capability to apply suitable methods can 
only determine the risk; and only persons who have 


appropriate competences can decide on handling 
with risk, i.e. legally determined representative of 
public administration or technical facility; and risk 
mitigating and control could be performed only 
by professionals who have appropriate knowledge, 
capabilities, skill, equipment, sources and means. 
The public is lawful participant at risk mastering 
because it goes on its security and quality of life. 
Because there are many risk sources, and counter- 
measures for their mastering are very often con- 
flicting, it is necessary to use the risk management 
aimed to safety (Zairi 1991). 

The negotiation with risks goes from present 
possibilities of human society and it lies in splitting 
the measures and activities for risk mastering into: 
prevention, mitigation, response and renovation. 

So the executive body of organisation could 
effectively work with risks, it is necessary to deter- 
mine the procedure for risk determination by legal 
rule, and simultaneously to determine the value 
scales by which the outputs of tools for determi- 
nation of risks in organization are interpreted; 
i.e. it is necessary to determine which risk value 
is acceptable, which one is conditionally accept- 
able and which one is unacceptable. In tools for 
risk determination, it is necessary to distinguish 
the sophisticated tools for professional sphere 
and tools for administrative bodies for which the 
checklists are the most suitable. 

Ad 7. On the basis of results in (Prochazkova 
2015, 2017) and data obtained directly in practice 
(CVUT 2017), it holds that in technical practice 
it needs to understand that from system view- 
point the ensuing the technical facility safety is 
the requirement on the complex system, not on its 
components. 

Risks are inherent attribute of human system 
and each technical facility, and therefore, they 
need to be managed during the whole technical 
facility life time. The aim of risk management 
is to ensure the safe technical facility, i.e. also its 
competitiveness today and in future, i.e. it goes 
on determination the priority risks and their cor- 
rect management. The risk management needs to 
ensure the technical facility safety at conditions 
normal, abnormal and critical. 

On the basis of present knowledge given in 
(Procházková 201la, 2015, 2017), the Safety 
Management System (SMS) of complex object is 
built on principles of process management and it 
includes the organization structure, responsibili- 
ties, practices, regulations, procedures and sources 
for determination and assertion of prevention of 
disasters or at least the mitigating their unaccept- 
able impacts. Usually, it deals with many questions, 
apart from also the organisation, workers, identifi- 
cation and assessment of hazards and risk that fol- 
low from them, organization management, change 
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management in organization, emergency and crisis 
planning, safety monitoring, audits and review. 

The process safety management is concentrated 
to six processes: concept and management; admin- 
istrative procedures; technical matters; external 
co-operation; emergency preparedness; and docu- 
mentation and investigation of accidents. These 
processed are further divided into sub processes 
that are in detail described in (Prochazkova 2015, 
2017). 

The processes coordination is aimed to ensur- 
ing the safe facility at conditions normal, abnor- 
mal and critical. The coordination in this context 
is understood as the controlled process, the aim 
of which is to create and to operate the technical 
facility in required quality; it follows the processes 
in spheres as: space and time, personnel, material, 
finance and documentation (Prochazkova 2015, 
2017). 

For support of safety management system, it 
is necessary to process the series of remedial tools 
as: security plans; on-site and off-site emergency 
plans; continuity plans; crisis plans; in practise 
the risk management plans for priority risks have 
been very came in useful (Prochazkova 2017). The 
most important is safety culture as it is stressed in 
fundamental work (Kongsvik, Almklov, & Fenstad 
2010). 


4 METHOD FOR CHECKLIST MAKE-UP 
AND METHOD OF ITS USEIN 
PRACTICE 


Data and experiences from work with risks are 
given in (Procházková 201la, 2015, 2017) and in 
works that are cited in them. They clearly show 


Table 1. 
YES, N — NO; R — Remark. 


that the facility risks are pulled off, the better level 
of facility safety is reached. From this reason, we 
used at construction of tool for judgement of tech- 
nical facility safety: the final judgement of used 
risk engineering methods quality (US EPA 2008), 
i.e. we compile the checklist; and the maximum 
utility theory principles (Keeny & Raiffa 1993). 
The checklist was proposed by procedure described 
in (Prochazkova 2011b) so the question answer 
“YES” in each aspect given in Chapter 3, belongs 
to the best way of aspect solution on the basis of 
present knowledge and experiences. The scale for 
judgement of total result is selected in agreement 
with recommendation in (Prochazkova 2017). 

For real judgement of safety of technical facili- 
ties, we used the safety audit method (Prochaz- 
ková 201la, b). At safety audit, the answer to 
each question is always separately formulated by 
5 evaluators (technical director, security expert of 
technical facility, security expert of local public 
administration, security expert of regional public 
administration, authors) according to documen- 
tation of technical facility. The final evaluation 
of each question is made as median from partial 
evaluations. In case of significant doubts at cer- 
tain real question judgement, the note was given in 
special column of check list; and the final results 
in these cases are finally obtained by panel discus- 
sion of experts. 


5 CHECKLIST 


Specific checklist compiled by procedure described 
in foregoing chapter is in Table 1. It contains 72 
questions. The scale for its final evaluation (i.e. the 
determination of safety rate) is in Table 2. 


Check list for judgement of technical work safety according to judgement of work with risks. Answers: Y — 


Question 


Answer 


Y N R 


Are in technical facility documentation distinguished the terms danger, hazard and risk? 

Is technical facility documentation based on context that considers only the technical work assets? 

Is technical facility documentation based on context that considers technical work assets and selected 
public assets (employee, contractors, and visitors, humans in work vicinity, working setting and 


environment)? 


Is technical work documentation based on context that considers technical facility assets and all public 


assets? 


Are only considered risk sources that are determined by expert experience? 

Are only considered only risk sources that are determined by legislative and expert experience? 

Are only considered risk sources that are connected with technical facility alone? 

Are considered risk sources that are connected with technical facility alone and human factor connected 


with badly performed working operation? 


(Continued) 
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Table 1. (Continued). 


Answer 


Question Y N R 


Are considered risk sources that are connected with technical facility alone and human factor in the 
broadest concept? 

Are considered risk sources that are connected with technical facility alone, human factor in the 
broadest concept, workers health jeopardy and threatening the working environment? 

Are considered risk sources that are connected with technical facility alone, human factor in the 
broadest concept, workers health jeopardy, threatening the working environment and environment 
outside the technical facility? 

Are considered risk sources that are connected with technical facility alone, human factor in the 
broadest concept, workers health jeopardy and threatening the working environment in system 
context, i.e. also risk sources connected with linkages and flows in technical facility? 

Are considered risk sources according to All-Hazard-Approach? 

Are only considered partial risks? 

Are considered partial risk and integrated risk? 

Are considered partial risks, integrated risk and integral risk? 

Are risks in technical facility systematically followed? 

Are risks in technical facility systematically followed only after technical facility building? 

Are risks in technical facility systematically followed for its whole life cycle, i.e. from its design? 

Are risks in technical facility systematically followed for its whole life cycle, i.e. from its design and 
in its design and operation used the Defence-In-Depth approach? 

Is at work with risks in technical facility systematically used the process model of work with risks? 

Is at work with risks in technical facility systematically used the process model of work with risks that 
possesses clearly determined criterions for risks acceptance? 

Is at work with risks in technical facility systematically used the process model of work with risks that 
possesses clearly determined criterions for risks acceptance, which respect public interest (i.e. they 
have social dimension)? 

Is at work with risks in technical facility systematically used the process model of work with risks that 
possesses clearly determined criterions for risks acceptance and aims of risk management? 

Is at work with risks in technical facility systematically used the process model of work with risks that 
possesses clearly determined criterions for risks acceptance with regard to public interest? 

Is at work with risks in technical facility systematically used the process model of work with risks that 
possesses clearly determined criterions for risks acceptance with regard to public interest and corrected 
measures in monitoring for the case that risk will happen unacceptable? 

Is at work with risks in technical facility systematically determined and followed the set of priority risks? 

Does technical facility risk management technique ensure in each phase of work with risks the review 
of profits and costs connected with measures for risks mastering, so economical handling with forces, 
sources and means might be ensured in technical work? 

Does technical facility risk management technique ensure in each phase of work with risks the review 
of profits and costs connected with measures for risks mastering, so economical handling with forces, 
sources and means might be ensured in technical work and in public administration? 

Are in technical facility systematically performed the preventive measures for reduction or avert of 
some risks? 

Are in technical facility systematically performed the preventive measures for reduction or avert of all 
priority risks? 

Are in technical facility systematically performed the preventive measures for reduction or avert of all 
risks that have potential to cause important losses to technical facility? 

Are in technical facility systematically performed the preventive measures for reduction or avert of all 
risks that have potential to cause important losses to technical facility and unacceptable impacts on 
surrounding environment? 

Are in technical facility systematically performed preventive measures for reduction or avert of all risks 
and prepared the mitigating measures for reduction of some highest risk impacts? 

Are in technical facility systematically performed preventive measures for reduction or avert of all risks 
and prepared the mitigating measures for reduction of all priority risks impacts? 

Are in technical facility systematically performed preventive measures for reduction or avert of all risks 
and prepared the mitigating measures for reduction of all risks impacts that can cause the significant 
losses to technical facility? 


(Continued) 
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Table 1. (Continued). 


Answer 


Question Y N R 


Are in technical facility systematically performed preventive measures for reduction or avert of all risks 
and prepared the mitigating measures for reduction of all risks impacts that can cause the significant 
losses to technical facility and unacceptable consequences for surrounding environment? 

Is technical facility insured against risks? 

Does technical facility possess the finance, material, technical, personal and organisational measures 
for response to important risk? 

Does technical facility possess the finance, material, technical, personal and organisational response for 
renovation after important risk realisation? 

Does technical facility possess the finance, material, technical, personal and organisational measures also 
for response and renovation after extreme unexpected realisation? 

Are at work with risks in technical facility only considered the results of preliminary risk analyses? 

Are at work with risks in technical facility preferred the results of standard, fast and low precise risk 
analyses before results of preliminary risk analyses? 

Are at work with risks in technical facility preferred the results of detailed risk analyses in synoptic 
concept before the results of preliminary risk analyses and standard, fast and low precise risk 
analyses? 

Are at work with risks in technical facility preferred the results of individual and specific risk analyses 
before the results of detailed risk analyses in synoptic concept, preliminary risk analyses and standard, 
fast and low precise risk analyses? 

Are at work with risks in technical facility determined the criterions for assessment? 

Are at work with risks in technical facility determined the criterions for assessment technical and 
economical items? 

Are at work with risks in technical facility determined the criterions for assessment technical 
and economical, external and internal items? 

Are at work with risks in technical facility determined the criterions for assessment technical and 
economical, external and internal and socially political items? 

Are at work with risks in technical facility determined the requirements for ensuring the safety? 

Are at work with risks in technical facility determined the requirements, standards and norms for 
ensuring the safety? 

Are at work with risks in technical facility determined the requirements, standards and norms for 
ensuring the safety and partial aims? 

Are at work with risks in technical facility determined the requirements, standards and norms for 
ensuring the safety, partial aims and methods and procedures? 

Are at work with risks in technical facility determined the requirements, standards and norms for 
ensuring the safety, partial aims, methods and procedures, and also limits and conditions? 

Are at work with risks in technical facility determined the requirements, standards and norms for 
ensuring the safety, partial aims, methods and procedures, limits and conditions and also the 
authorizations of persons or institutions? 

Does the technical facility administrator hold the safety management system that is compiled on the 
principles of process management and systemic work with risks? 

Does the technical facility administrator hold the safety management system (SMS) that contain the 
organizational structure, responsibilities, practices, rules, procedures and sources for determination 
and enforce of disaster prevention or at least for mitigating the unacceptable disasters impacts in 
technical facility and in its surrounding? 

Does the technical facility administrator hold the safety management system (SMS) that contain 
management of six processes: concept and management; administrative procedures; technical matters; 
off-site co-operation; emergency preparedness; and documentation and accident investigation? 

Does the technical facility administrator hold the SMS that contains the concept and management 
process with sub-processes for: overall concept; reaching the safety partial aims; safety governance; 
alone safety management system; personnel—human sources management, education and 
training, internal communication, working environment; audit and assessment of performance of 
safety aims? 

Does the SMS technical facility contain the administrative procedures process with sub-processes for: 
hazard identification from possible disasters and corresponding risk assessment; documentation of 
procedures (including the work permits); changes management; safety connecter with contractors; 
surveillance under products safety? 


(Continued) 
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Table 1. (Continued). 


Answer 


Question Y N R 


Does the SMS technical facility contain the technical matters process with sub-processes for: research 
and development; design and montage; inherently safer processes; technical standards; storage of 
hazardous substances; and integrity maintenance and maintenance of equipment and buildings? 

Does the SMS technical facility contain the off-site co-operation process with sub-processes for: 
co-operation with public administration; co-operation with public and other involved (including 
the academic institutions); and co-operation with other enterprises? 

Does the SMS technical facility contain the emergency preparedness process with sub-processes for: on- 
site planning; facilitation of off-site planning (for which the public administration is responsible); and 
co-ordination of activities of resort organisations at ensuring the emergency preparing and the response? 

Does the SMS technical facility contain the documentation and accident investigation process with 
sub-processes for: processing the reports on disasters, accidents, near misses and other instructive 
experiences; investigation of damages, losses and harms and their causes; and response and 
consequential activities after disasters (including the application of lessons and information sharing)? 

Does the SMS technical facility contain the program for safety improvement in which there are given: 
roles of stakeholders; rules for safety culture improvement (golden rules); and relevant responsibilities? 

Does the SMS technical facility contain the program for safety improvement in which there is given: 
security plans (on strategic, tactical, functional and technical levels); on-site and off-site emergency 
plans; continuity plans; and crisis plans? 

Does the SMS technical facility contain the program for safety improvement in which there is given the 
risk management plan with clearly determined countermeasures and responsibilities? 

Does the SMS technical facility contain the program for safety improvement in which there is given 
the risk management plan with clearly determined countermeasures and responsibilities that only 
contains the technical risks? 

Does the SMS technical facility contain the program for safety improvement in which there is given the 
risk management plan with clearly determined countermeasures and responsibilities that only contains 
the technical and organisational risks? 

Does the SMS technical facility contain the program for safety improvement in which there is given 
the risk management plan with clearly determined countermeasures and responsibilities that contains 
the technical, organisational and external risks? 

Does the SMS technical facility contain the program for safety improvement in which there is given the 
risk management plan with clearly determined countermeasures and responsibilities that contains the 
technical, organisational, and external and cyber risks? 

Does the SMS technical facility contain the quality monitoring the both, the integral risk and all 
important partial risks and the corrective countermeasures for occurrence of unacceptable risks? 


Total 


critical objects in the Czech Republic; they belong 
to Small and Medium Enterprise (SME); specifi- 
cally: chemical plant; machine plant; thermal power 


Table 2. Value sale for safety level determination. 


Number of answers 


Safety level Values v% “YES” in Table 1 plant; airport; highway (ČVUT 2017). In all cases, 
: the access to the facility documentation was regard- 

Extreme high~5 More than 95% More than 68 less conditioned by agreement that real data on tech- 

Very high -4 10-95% 51-68 nical work will not be published. Therefore, it is only 

High — 3 45-10% 33-50 given the final result of investigation in the form: 

Medium — 2 25-45% 19-32 

Low-1 5-25% 4-18 1. Number of answers YES moves in interval 


20-29 with mean value (median) 24. 

2. The highest reached validities of work with risks: 

— there are only followed assets of technical 
facility, 

— there are only followed the risk sources that 
are in technical facility and human factor con- 
nected with wrongly performed operations, 

— there are considered partial risks and mostly 


Negligible — 0 Lower than 5% Lower than 4 


6 RESULTS OF JUDGEMENT OF SAFETY 
LEVEL FOR SELECTED COMMON 
TECHNICAL FACILITIES AND 
DISCUSSION 


For judgement of safety levels, it was selected 5 com- 
mon technical facilities that do not belong to the 


integrated risk connected with workers’ health 
threatening, 
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— risks are followed only after technical facility 
building, 

— at work with risks of technical facility it is 
used the process model of work with risks 
that has only clearly determined the accept- 
ance criteria for risks inside, and sporadically 
for risks outside, 

— there are performed the preventive measures 
only for reducing or averting the priority risks, 

— it is ensured the insurance of technical facili- 
ties for case of realisation of famous risks, 

— there are preferred the results of fast and less 
precise risk analyses, 

— at work with risks there are only determined 
the criterions for technical and economical 
assessment, 

— there are applied demands, standards and 
norms for ensuring the technical safety, 

— the technical facility administrator has the 
safety management system based on princi- 
ples of process management. 


Comparison of number of answers YES with 
the scale in Table 2 shows that the safety level is 
medium in the followed SME. The judgement of 
level of reached validities of used techniques for 
work with risks shows that in practice, the system 
approach is missing and that in common techni- 
cal facilities there are only respected the demands 
given in legislative and own experience with risks. 

From the viewpoint of human system security, 
the obtained finding is not too comforting. Results 
of detail long-term research of technological facili- 
ties and infrastructure accidents, summarized in 
(Prochazkova 2017), fully agree with outcomes 
that are in many papers published in journal Safety 
Science, which are perfectly expressed in work 
(Konsvik, Almklov & Fenstad 2010). Structure of 
safety culture needs to start on top management 
level and to spread to lower management levels. 

However, the safety culture is not all-powerful 
tool. Therefore, it is also necessary, so: the tech- 
nological facility owner may not prefer the profit 
prior to human system security; and the tools 
used for safety formation need to correspond to 
present level of cognition. In these cases, the state 
government and legislation play important role. 
The government needs to ensure the correspond- 
ing education level, high qualified supervision and 
inspection under the technological facilities behav- 
iour, namely starting from sitting, over building 
and operation up to decommission and decontam- 
ination of territory. 

Subsidiary product of study of documenta- 
tions of mentioned technical facilities and others 
(CVUT 2017) is the detection that experts from 
different domains connected with technical facility 
do not co-operate; it is proved by records in docu- 
mentations on conflicts that had not originated if 
expert communicate together. 


7 CONCLUSION 


An overview of areas that affect the selection of 
individual techniques work with the risks shows 
that where it comes on ensuring the safe technical 
facilities, it is a need to use the techniques for work- 
ing with risks, which are based on the system con- 
cept and critical evaluation of all influences that can 
act on the technical facility, now and in the future. 

The investigation of problems related to the 
work with risks of technical facilities showed that 
at common technical facilities, is the medium level 
of safety. It reflects low level of work with risks; i.e. 
low safety culture and weak power of government 
in formation of territory safety. 

The judgement of validity of methods and pro- 
cedures of work with risks, which are used in prac- 
tice in the Czech Republic, shows that they still 
predominate the techniques that do not respect the 
system nature of technical facilities and the dynam- 
ics of development. From the study of followed 
technical facilities documentations, it is obvious 
that at formation of their safety, the experts from 
different fields work separately, which of course 
cannot guarantee optimal safety and optimal costs. 
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The Kursk submarine disaster in view of resilience assessment 
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R. Mock 
Zurich University of Applied Sciences, Winterthur, Switzerland 


ABSTRACT: In August 12, 2000, the Russian Oscar-class submarine Kursk (K-141) sank during a navy 
manoeuvre in the Barents Sea killing all 118 personnel on board. The vessel was powered by two nuclear 
reactors and carry nuclear missiles which can be armed. The disaster is well documented and encom- 
passes many socio-technical elements influencing the sequence of events finally leading to wreckage. For 
this, the disaster is considered as an archetypical event which might highlight the advantages as well as 
the limitations of resilience assessment approaches, e.g. in comparison with established risk assessment 
methodology. For this the paper starts with results of a literature survey with resilience metrics and areas 
of technical applications. The Kursk disaster is reviewed by available literature and research reports by 
Root Cause Analysis. The causing aspects (events, procedures, human factors, etc.) are then structured 
and classified according to their relevance and impact on vessel’s resilience. In a next step, these aspects 
are contrasted to the risk assessment approach as defined, e.g. by ISO 31000. The methodological juxta- 
position is intended to characterize the maturity level of resilience analysis in a real world framework as 
well as to elaborate major differences in validity of the underlying system analysis concepts. Finally, the 


pros and cons of the reviewing approach are discussed. 


1 INTRODUCTION 


In the context of risk analysis, the term resilience 
is often used nowadays. It is noticeable that both 
a generally accepted definition of this term and 
consequently a metric of resilience are missing. 
The differences between risk and resilience assess- 
ment often remain unclear, e. g. in connection with 
related terms such as availability, vulnerability, and 
Business Continuity Management (BCM). To a 
certain extent, this follows a tradition of dealing 
with indefinite terms such as, risk, which in turn 
is based on other terms that are not always clearly 
definable. For instance, there is a risk if several fac- 
tors coincide: danger, exposure and vulnerability 
(cf. (Lenz 2009)). 

The paper is an attempt to work out the differ- 
ences and similarities between the two concepts 
of risk and resilience, where the approach follows 
the idea of “learning by doing” system assess- 
ments. An archetypical case was selected for this: 
The Kursk submarine disaster in 2000. On the 
one hand, a submarine is a self-contained socio- 
technical system, which simplifies considerations. 
The case itself, in turn, can be presented from a 
variety of sources. One of us (A. Leksin) can refer 
to less well known Russian literature as well as on 
feedback of one Russian accident investigator. The 


case was dealt with a root cause analysis (RCA), 
which was then used for the discussion on risk and 
resilience assessment. 

The remaining paper is structured as follows: 
Chapter 2 compiles definitions of risk, resilience 
and the comparison of major system management 
terms. Chapter 3 describes the chronology of 
major events and causative aspects of the Kursk 
disaster and present a part of the resilience iden- 
tification. Based on sequence of major events 
differences in risk and resilience assessment are 
elaborated in chapter 4. The results are discussed 
in chapter 5. 


2 TERMINOLOGY 


There is extensive literature research on the defi- 
nition of the term resilience, e. g. (Husseini et. al. 
2016, Francis et al. 2014). There is a consensus that 
resilience is concerned with socio-technical sys- 
tems and their ability to respond to disturbances in 
order to maintain the specified performance. This 
paper follows the definition of (Lay et al. 2015) 
who defines resilience by a set of system abilities: 

Resilience: System abilities to respond to dis- 
turbances, to monitor, to learn, and to anticipate 
developments. 
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Table 1. 


Comparison of major system management terms. 


Term Connotation Intrinsic system property Management Focus 

Risk negative no external interference (undesired) events 
Resilience positive yes Intrinsic System performance 
Vulnerability negative yes external interference (undesired) flaws 
BCM positive no external interference (undesired) events 
Availability positive yes external interference failures 


Responsiveness considers all kind of distur- 
bances into account, all deviations from specified 
performance levels, both positive and negative 
impacts. The term “disturbance” indicates that 
point of view is dominated by negative impacts. 

Furthermore, responsiveness indicates systems 
immediate response to disturbances. Hence resil- 
ient systems are designed to react on disturbances 
in a self-managing way. 

Looking at socio-technical system, humans are 
the carrier of its learning and anticipating abili- 
ties as covered by system management processes. 
Monitoring can be done both automatically/tech- 
nically and by humans also depending on surveil- 
lance level. 

The concept of risk is assumed to be known to 
the reader. The paper follows the well-established 
definition of risk of (Kaplan & Garrick 1981): 


{Risk; | s,,f,,c;}, 

where: 

S; scenario identification or description 

F; probability (or frequency) of that scenario 

C; consequence or evaluation measure of that 
scenario, i.e., the measure of damage. 


The frequency/consequence concept of risk is 
also along to common risk management stand- 
ards, e.g. (ISO 31000 2009). Risk figures are usu- 
ally computed by f;c, Note, that scenario is not 
defined by (Kaplan & Garrick 1981) and (ISO 
31000 2009). The authors will use it in terms of 
(imagined) sequence of events. Probability is a 
measurement of uncertainty of (future) events 
based on (statistical) data. As a consequence, risk 
becomes a concept of proactivity and finally pre- 
paratory by management. 

Vulnerability is a well established term in IT 
security which is easily adaptable to any other 
engineered systems. According to (NIST 2012), 
the definition is: 

Vulnerability: Weakness in an information sys- 
tem, system security procedures, internal controls 
or implementation that could be exploited by a 
threat source. 

For this, the understanding of vulnerability fol- 
lows the keyhole principle and is an intrinsic sys- 
tem property. Vulnerability management is then to 
reactively plug flaws. 


However, the limits of the concepts are not 
always clear: According to (Lenz 2009:38—43.69), 
risk always coincides with danger, exposure and vul- 
nerability. Additional components such as coping 
capacity and criticality (meaning and consequences 
upon entry) can be added to this assumption. 

The concept of Business Continuity Manage- 
ment is a related system maintaining process to 
risk, resilience and vulnerability management, as 
defined by (SBA 2013): 

Business Continuity Management (BCM) is a 
company-wide approach designed to ensure that 
critical business processes can be maintained in the 
event of major internal or external incidents. 

The view is the management of single (major) 
undesired events in order to minimise their impacts. 
The management objective is to maintain the spec- 
ified business performance level. 


3 THE SUBMARINE KURSK (K-141) 
DISASTER 


The Kursk submarine disaster took place in the 
Barents Sea on 12 August 2000, killing all 118 per- 
sonnel on board. In this paper, the course of the 
disaster, if publicly known, serves as a test case for 
the methods of system analysis listed in chapter 1. 
The detailed description of the disaster is a signifi- 
cant part of the resilience identification step which 
is explained in chapter 4.2. The authors process the 
Kursk catastrophe on the basis of publicly avail- 
able information and present a possible sequence 
of events. Further discussions will then be held on 
this basis. The case covers all elements that make 
an analysis interesting from very different perspec- 
tives, i.e. the interaction of people and technology 
in a stressful overall situation. However, the basic 
system performance remains simple: ensuring 
the safety and health of the crew. The question is 
to what extent the system analysis methods listed 
above would have been suitable for recognizing this 
accident in advance. 


3.1 Event of the Kursk disaster 


There are about 18 different disaster versions of 
the Russian Oscar-class submarine Kursk (K-141). 
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Figure 1. Example of the inner and outer hull construc- 
tion with P-700 Granit “Shipwreck” cruise missiles on 
the bow side (Militaryarms 2017). 


Oscar Class mei ee antigo o anart 


C 


Figure 2. Characteristics of Oscar class submarine 
(Defending 2015). 


This paper considers one of the official versions— 
an explosion of a torpedo but due to the influence 
of a second submarine. The chronology of major 
events and causative aspects (events, procedures, 
human factors, etc.) are structured and classified 
according to their relevance to give a better over- 
view for the reader by describing only the impor- 
tant steps of the disaster. The RCA breaks down 
a complex scenario into individual steps (black 
boxes in the RCA diagram of Figure 3), which 
ultimately indicate a cause-effect chain. Secondary 
event chains can be added (grey boxes). For better 
understanding the boxes are numbered. The event 
numbers can also be found in the case description. 

The description of the significant factors which 
had a strong influence on the worst case scenario 
can be traced back to 1999. Kursk was on the mili- 
tary mission in the Mediterranean Sea to monitor 
the United States Sixth Fleet responding to the 
Kosovo crisis. (7) After the successful mission 
the submarine returns into the stationing port of 
Vidyayevo. After a longer down time due to finan- 
cial reason the commissioning of the submarine by 
the crew was under time pressure towards the end of 
May 2000 because of the Russian Navy large scale 
naval exercise planning for August 2000. Therefore 
the crew had a shortage of lack of planned training 
activities in the last approx. 9 months (2). But due 
to the last successful mission in the Mediterranean 


Table 2. Explosive characteristics of USET-80 and 
VA-111,. 


USET-80 
(warshot torpedo) 


Total weight — 2000 kg 
explosive weight — 200/300 kg 


VA-111 
(warshot torpedo) 


Total weight — 2700 kg 
combat unit — 200 kg 
explosive weight 200 kg 


Sea, it cannot be ruled out that part of the crew 
was self-confident. Either because of time pressure 
and/or the incorrect planning of the Marine areas 
by the Military-Maritime Fleet of the Russian Fed- 
eration (3), the way to the naval exercise area was 
over “underwater mountains” (4). Such manoeu- 
vre through areas of not deep-water sites of the sea 
can be dangerous for an Oscar-class submarine and 
other submarines because it is difficult to manoeu- 
vre due to radar shadows of sonar and magnetic 
interference. The threat obviously increases with 
the condition that other countries submarines are 
always present in such naval exercises. 

On August 10th, 2000 the Kursk had begun 
the planned activities in the naval exercise near 
the Kola Bay. On August 12th, 2000 at 11:28 local 
time, two explosions were detected by various 
seismologists and hydroacoustics. The first explo- 
sion corresponded for ~500 [kg] TNT equivalent 
and after 135 seconds the second explosion with 
=5000 [kg] TNT equivalent. Unfortunately the 
exact number of armed cruise missiles at the Kursk 
varied depending on references. Typical armament 
consist 24 of SS-N-19/P-700 Granit “Shipwreck” 
cruise missiles that were designed to defeat the 
best naval air defences. The missile containers are 
located on both sides of the deckhouse, outside 
the rugged boat hull. Based on the most references, 
photographic material and video footage the Kursk 
had during this naval exercise 24 of P-700 Granit 
“Shipwreck” cruise missiles on board. Due to the 
double hull construction of the Oscar-class sub- 
marine, the second explosion of the P-700 Granit 
“Shipwreck” cruise missiles did not initiated. Con- 
structors considered such worst-case scenario and 
reinforced the inner hull with high content stain- 
less steel about 45—68 [mm] thick. There is 200-350 
[cm] gap to the 5-10 [mm] thick outer hull. 

Therefore both detected explosions were in the 
Ist torpedo compartment. As before the exact 
number of dummy and warshot torpedos varies 
from 8 to 18 and even 24. Weapons included 18 of 
SS-N-16 “Stallion” (PITK-6 “Bogonay”’), hydrogen 
peroxide-fueled Type 65 torpedo (65—76A), USET- 
80 (YCƏT-80) and their different types. Kursk was 
armed at that moment with dummy (65—76I1B and 
USET-80) and warshot (65-76A, USET-80) as 
well as torpedo VA-111 Shkval. 
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Sitges 


Figure 3. Sequence of events in form of RCA. 


Although it was an exercise, Kursk loaded, as 
mentioned before, also with combat capable weap- 
ons. This means that some of the torpedo tubes 
are constantly in combat readiness with an armed 
warshot torpedo. Warshot torpedo which was used 
by Kursk in military mission is typically the tor- 
pedo USET-80. Table (2) shows short characteris- 
tics of the torpedoes USET-80 and VA-111 Shkval. 

Based on these characteristics, the possibility of 
an USET-80 in the torpedo tube is very high and 
its TNT equivalent is near to 500 [kg] (7). Also 
it was planned to launch the USET-80 torpedo as 
secondary in this naval exercise. 

However, all versions of disaster reports agree 
on one—the first explosion was an explosion of a 
torpedo in a torpedo tube. As mentioned before, 
by all naval exercises other countries submarines 
are always present at the naval area as well as near 
main marine ports during the year. The history of 
underwater incidents between submarines is well 
known and documented in different languages and 
countries (Drew et al. 1998). “Among the speci- 
fied accidents there are several tens of collisions of 
submarines, including 20 underwater collisions of 
Russian Navy submarines with foreign submarines. 
From these 20 examples 11 were in grounds of 
combat trainings (naval exercises) on the way to the 
main stationing sites of the Northern and Pacific 
fleet, including 8 in the north and 3 in the Pacific 
Ocean in a short time period from 1968 till 1993” 
(Aleksin 2001, Viperson 2001). Several accidents 
have also been registered since 1993 till nowadays. 
On August 12th, 2000 Kursk prepares for shoot- 
ing practice in the predetermined and surveyed 
area radio and radio engineering investigation of 
surface forces of “opponent—Kirovy-class battle- 
cruiser Pyotr Velikiy” (5). Due to force 3 at sea the 
speed of Kursk was approx. 8 knots. The Kursk 
had changes the depth level many times according 


to typical exercise. A second non-Russian subma- 
rine which monitors Kursk the last two days (6.1) 
has lost the contact (6.2) and couldn’t find the 
Russian submarine (6.3). They decided to emerge 
on periscopic depth (6.4) to explain this situation 
in order to prove if Kursk has also surfaced. On the 
way to periscopic depth the non-Russian subma- 
rine unexpected struck (6.5) with the lower cornice 
of a bow part from a high angle of attack to the top 
area of the right bow side of Kursk where were tor- 
pedo tube was charged with the warshot torpedo 
USET-80. Both submarines continue to move with 
a former speed (5.5 [m/s]), destroying each other’s 
hulls (6). Nuclear submarines of US and UK 
Navy are build only one 35-45 [mm] thick stainless 
steel hull. Thus Kursk damage was much higher. 
In a second after the struck with the torpedo tube 
located to the right board of Kursk it was crumpled 
on a half of the length which caused a detonation 
of the warshot torpedo USET-80 (7). This detona- 
tion was on a line of least resistance to the hatch 
of the torpedo tube, destroying this and created a 
hole more as half a meter in diameter. Water flows 
inside the torpedo compartment and causes trim to 
the bow side (8). The captain of Kursk order to 
ascent and increase the speed (9). However short 
circuits of electrical networks happen because of 
water penetration (/0) and due to this the emer- 
gency system block both nuclear reactors (11.1). 
The Kursk was out of control (11.2) with a strong 
trim to the bow side and hits the seafloor (12). The 
second explosion was initiating with the impact on 
the floor (/3). This explosion killed many crew 
members in the conning tower and control room 
(2 compartment), radioelectronics room (3 com- 
partment), living room (4 compartment), room 
with diesel-generator, electrolysis installation for 
air regeneration, compressors of high pressure etc. 
(5 compartment). Although Kursk was designed 
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to withstand external pressure of depths of up to 
1.000 [m], the second internal explosion destroyed 
the bulkheads between the compartments (prob- 
ably till the compartment 5) which are calculated 
for only 10 atmospheres. 

The inner hull is designed for 60 atmospheres, 
which prevented the explosion of the P-700 Granit 
“Shipwreck” cruise missiles as mentioned before in 
this paper. 

Based on the RCA and literature statistics on sim- 
ilar incidents, the catastrophe must be considered by 
the submarine Kursk as an archetypal event. Next 
chapter discusses how such scenario could be imple- 
mented in the prospective of classical risk analysis as 
well as the possible approach of a resilience concept 
and its implementation problems. 


4 RISK AND RESILIENCE ASSESSMENT 


The established system assessment process can be 
summarized by three steps: identification, analysis, 
evaluation. These processes are well defined in risk 
assessment while there are methodological gaps in 
resilience assessment. 

The following subsections outline these gaps 
and point to differences in risk and resilience 
assessment by exemplary application to the Kursk 
disaster. 


4.1 Risk assessment 


The established approaches and concepts of risk 
assessment are presumably known to the reader (i.e., 
how to perform Failure Mode and Effects Analysis 
(FMEA), Fault Tree Analysis (FTA), and others). 

Also in risk assessment, a study starts with 
determination of system boundaries. When fol- 
lowing the risk management process of, e.g., ISO 
31000, then risk assessment includes the (technical) 
system, where the remaining risk managing proc- 
esses are beyond. With regard to resilience assess- 
ment (cf. chapter 4.2), risk studies are based on 
a restricted system definition. As a consequence, 
some boxes of RCA in Figure 3 are excluded (the 
results of all selection criteria as compiled in this 
chapter are applied on RCA of the Kursk disaster 
and summarized in Table 3). 

There are studies of navy available to support 
out established risk assessment approaches, e.g., 
(Holmboe et al. 1992) on likelihoods of threats, 
maturity of technologies, systems potential to 
develop a threat scenario. 

The identification process starts with the speci- 
fication of hazards and threats as well as vulner- 
abilities of the system and system components. 

Hazard is commonly defined as a condition, 
circumstance or process what can cause dam- 


age. Furthermore, hazard is limited to accidental, 
undesired and sudden events. 

Risk analysis needs the quantification of fre- 
quency and likelihood of an undesired event. 
Figure 5 shows the risk analysis model as applied 
by the authors. 

The terms in Figure 5 are specified by: 


e hazards are characterised by possibilities, 

e results of threat factors. Scenario analysis used 
to anticipate how threats and opportunities 
might develop and are used for all types of risk 
with short and long term time frames, 


Figure 4. Compartments of Kursk (Naked-science 
2017). 


Table 3. Risk analysis relevant actions according to 
RCA. 

RCA 

steps Action H I: V S&S C 
2 negative nt. - 
3 negative nT. - 
4 negative nt. + 
5 mr. nr. n.r. n.r. n.r. nr. 
6 negative 

7 negative 

8 negative 

9 NT: if. nr. nt. nr. nr. 
10 negative 

11.1 positive nT. mr we mr. + 
11.2 negative 

12 negative 

13 negative 

End negative 


+: relevant impact regarding risk analysis; —: no-impact 
on defined system; n.r.: not relevant for risk analysis. 


Dannie 


Figure 5. 


Risk analysis model. 
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e destruction of objects as a result of hazards are 
characterized by conditional probability, 

e unavailability of safety and security systems 
because of combinations of non-reliability, 
human factors among others, are quantified 
by probabilities of scenario development from 
emergencies towards accident. 


Depending on analysis goals (quantitative or 
qualitative) different approaches are in use. The 
Russian Army carries out FTA (personal com- 
munications with Saint Petersburg State Institute 
of Technology). However, FTA does not con- 
sider positive events and an analysis of top event 
Recessed submarine is then incomplete. The fail- 
ure analysis based on qualitative approaches, as 
FMEA, could be added. 

Within this defined framework for risk analysis 
Table 3 shows the relevant boxes of the RCA. 

The selection process for identifying the RCA 
boxes relevant to risk analysis bases on the follow- 
ing rules: 


Rule 1: The box event is within the defined system 
boundary. 

Rule 2: The hazard is relevant within the defined 
system boundary. 

Rule 3: If a threat (from outside) exists, it is taken 
into account. 

Rule 4: Searching for vulnerabilities in the system. 

Rule 5: Investigation of safety and security systems 
is a special task of risk analysis. 

Rule 6: Negative consequences are always relevant 
for risk analysis independent from threats 
and vulnerabilities. 


Thus, step preparing shooting exercise (5) 1s 
not relevant from the view of risk assessment. The 
order by the captain (9) is a positive measure and, 
thus not relevant for the defined scenario. The 
automatic shutdown of the nuclear reactor (11.1) is 
not part of the risk analysis because it is a planned 
safety process. Steps (2) and (3) are only consid- 
ered in human reliability analysis. Finally risk is 
often evaluated by risk matrix. However, the risk 
evaluation process is not subject of this paper. 


4.2 Resilience assessment 


As mentioned in chapter 2, resilience considers 
extended socio-technical systems where (human 
as well as automated) actors are responsible for 
actions to positively or negatively affect system 
responsive-ness. Applied resilience assessment 
by case study brings further differences to risk 
assessment in understanding to light. The system 
assessment processes (i.e. aspect identification, 
analysis and evaluation) structures the following 
discussion. 


System performance is a matter of documented 
system design specifications and other characteris- 
tics of embedding system entities. Then, resilience 
assessment of the Kursk disaster comprises all 
involved submarines and crews as well as the entire 
Northern Military-Maritime Fleet of the Rus- 
sian Federation at the moment of the exercise and 
impacts from sea environment. The impact on the 
environment is not relevant. Hence, you can eas- 
ily define actions, actors, and system boundaries in 
Kursk example in contrast to, e.g., infrastructures. 
Within this framework, the identification process 
starts with the specification of system perform- 
ance P. For this, two approaches are common in 
resilience assessment (cf., e.g., Mock 2018): either 
the analysts decide to model time-depending per- 
formance P(t) or they compile a set of n resilience 
impacting aspects P = {a,; a, ...;a,}. P(t) can 
be easily defined (e.g. safe and secure transport 
of crew and cargo during mission time) but find- 
ing a corresponding measurement is not always 
as straightforward as, e.g. oxygen content of the 
breathing air during mission time. Note, that avail- 
ability, as shown in Table 1, is a performance model 
P(t) showing the probability course of operability 
of a system. Maintenance and repair are consid- 
ered as activities to keep the system resilient, and 
are actions of responsiveness. 

The identification process by compilation of a 
set of aspects influencing resilience appears plain, 
e.g. the number of redundancies of life supports 
systems, educational level of crew, repair, etc. 
However, time dependency and the representation 
of systemic relationships are lost then. 

The resilience analysis process by P(t) follows the 
common processes of formal mathematical/physical 
of system modelling and simulation and will not be 
discussed here. However, the RCA presentation of 
the Kursk disaster in Figure 3, which follows a time- 
line of succeeding events, is considered as a simple 
representor of P(t) after revision towards resilience 
(see Table 4). P(t) analysis needs the specification 
of normal operation bandwidths of total system 
performance. For instance, the oxygen content on 
a submarine can be above or below a lethal thresh- 
old. The life support system may be able to provide 
a breathable atmosphere again, but this can be too 
late for the crew. In terms of resilience analysis, the 
responsiveness of the entire submarine system is lost 
as safe transport has ended. These points to specific 
views in resilience analysis: Total loss of perform- 
ance or functionality (worst case) is excluded from 
analysis (“If dead you are not resilient any longer”). 
The analysis of impacting aspects P needs the defini- 
tion of a resilience metric which is still under discus- 
sion in academia. Table 4 summarises the findings 
in resilience identification and analysis by the Kursk 
example as represented in Figure 3. 
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Table 4. Resilience assessment for performance “safe 
and secure transport of Kursk crew and cargo during 


Table 5. Juxtaposition of risk and resilience assessment. 


mission time”. RCA events Risk assessment Resilience assessment 
RCA Sub-system P Action Actor(s) 1 N Y 
2 N Y 
1 fleet + success Kursk 3 N Y 
crew, 4 Y Y 
fleet 5 N ? 
2 Kursk - Kursk 6 Y Y 
ECW. 6.1 N Y 
3 fleet - preparedness Kursk 62 N Y 
crew. 
2 6.3 Y Y 
Teet 6.4 ? Y 
4 environment — 65 5 Y 
5 Kursk + exercise p 7 Y Y 
6 Kurs, other — ram Both IY A 
sub. crews y N 
7 Kursk - explosion a i y N 
8 Kursk — leakage 112 Y N 
9 Kursk + order Kursk’s 1 2 Y N 
captain 13 y N 
10 Kursk n.r short-circuit, 
loss of Summary Y: 10 of 19 Y: 13 of 19 
control 
11.1 Kursk n.r shutdown 
reactor i = 
11.2 Kursk n.r loss of analysis results still needs to be defined (a resilience 
control priority value is introduced in (Mock 2018)). 
12 Kursk, n.r grounding, 
environment loss of í 
control 4.3 Synopsis 
13 Kursk n.r detonation, Based on RCA every single step is discussed from 
loss i the side of resilience assessment in comparison to 
contro the risk assessment. Table 5 differs between the 
End Kursk n.r 


+/—: positive/negative impact on resilience; n.r.: not rel- 
evant for resilience assessment purposes. 


As figured out in Table 2, resilience assess- 
ment ends with the loss of control of the Kursk 
(“If faint, then you are no longer resilient”). Step 
(6) can be similarly analysed by considering RCA 
steps (6.1) to (6.6) which introduces the second 
submarine into analysis. The marine environment 
(step (4)) has been identified as challenging for 
submarines which does not support safe transport. 

Resilience evaluation is the process of assess- 
ment. Again, the analyst depends on how resilience 
analysis been performed. In case of modelling 
P(t) acharacteristic value needs to be defined, e.g. 
the ratio of resilient operation mode to total mis- 
sion time. This is equivalent, e.g. to reliability and 
availability analysis. The evaluation of the set of 
impacts P needs the definition of a resilience met- 
ric comparable to risk prioritisation value RPV in 
risk analysis and provided by FMEA. Evaluation 
criteria of acceptance/non-acceptance of resilience 


selections of yes (Y), no (N). 

Avoidance of worst case scenario or disaster is 
the aim of a risk assessment. Therefore, positive or 
neutral (from the view of risk analysis) steps such 
(1), (3), (5), (6.1), (6.2) are not considered. But 
e.g. step (9) could be considered if the order of the 
captain is incorrect. Step (2) can be also considered 
only in case of human reliability analysis. Equiva- 
lent to a worst case scenario “meltdown of nuclear 
reactor”, step (13) must be take into account by 
this disaster. Similar, step (6) could correspond to 
a “plane crash on nuclear power plant” scenario 
(external event). Step (5) is not a malfunction or 
optimization, but an important point in the overall 
process. Due to the defined performance indicator 
step (6.2) must be considered too (the second sub- 
marine is part of the whole system). As mentioned 
before, with defined performance indicator—safe 
implementation of the naval exercise for the crew— 
the resilience analysis ends with the step (10). As 
mentioned in chapter 2, risk assessment often uses 
the basic frequency/consequence definition to get 
calculation values. However, the next step after 
description of the RCA from the side of resilience 
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analysis is complicated due to absence of any use- 
ful values and equations which could be support 
the calculations and as a result the evaluation of 
the defined system and performance indicator. 


5 CONCLUSIONS 


Resilience assessment should be different from 
risk assessment and other related concepts and 
approaches. For instance, risk assessment is basi- 
cally restricted to undesired events and does not 
cover the extended view of technical systems. On 
the other hand, event identification highly depends 
on the definition of system performance indicating 
resilience as a measurement of system quality. 

The issue of applied resilience assessment is 
shown by considering the archetypical case of 
Kursk submarine disaster. The detailed description 
of sequence of steps by Root Cause Analysis shows 
that a precious accident analysis is significant for 
identification of aspects which have impacts on 
resilience. The specification of system perform- 
ance and the view of extended socio-technical 
systems increase resource requirements (time, 
expertise, etc.) of auditing. This way of thinking 
definitely uncovers additional elements of system 
disturbances. 

However, resilience analysis is still in its begin- 
nings and there is no commonly accepted meth- 
odology and metric. In summary, resilience 
assessment is different to risk assessment in some 
ways and shows promising aspects in extended sys- 
tem analysis. However, further steps towards oper- 
ationalisation of the resilience concept are needed. 
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ABSTRACT: Today’s infrastructural systems are expected to be safe and resilient. In this context, 
assessment of such systems faces two principal challenges: common approaches in risk assessment have 
reached their limits in methodology and feasibility in assessing complex and interconnected systems. On 
the other hand, resilience assessment is in its beginnings and lacks, e.g., a commonly accepted resilience 
metric. The paper starts to specify a practical definition of resilience and assigned metric: Resilience is 
characterised by influencing recovery properties of a socio-technical system. Actors and actions are car- 
riers of these properties. This corresponds to the views of system representation by Use Case Diagrams 
(UCD). In order to quantify an UCD, actions are validated by assessing their compliance level L. Actors 
are associated with their abilities to respond, monitor, learning, and to anticipate developments. The 
result is given by the Resilience Priority Value REPV = L - J of actors and overall system. The resilience 
assessment process is exemplified by a case study of a car park guidance system. 


1 INTRODUCTION 


Current infrastructural systems show a high level of 
complexity and technical development will further 
strengthen this trend. As consequence, such sys- 
tems will become increasingly difficult to handle for 
system operating organisations (private and non- 
private) and managers involved. It already looks as 
that methodological or practicable limits of, e.g., 
established risk assessment approaches have been 
reached. New terms reflecting newly desired system 
properties (e.g., resilience, smartness) are emerg- 
ing too. However, the methodology of resilience 
assessment is in an early stage of development and 
not (yet) in the focus of most organisations. This 
is also due to the lack of a practicable, quantita- 
tive metric of resilience. In this context, the paper 
presents an approach to facilitate applied resilience 
assessment audits. Following the concept of system 
representation and resilience quantification, the 
remaining paper is structured as follows: Chapter 2 
defines resilience and terms in use. The results of a 
literature survey on resilience definitions in speci- 
fied engineering domains are given in Chapter 3. In 
Chapter 4, resilience assessment is utilised by using 
quantified Use Case Diagrams (UCD). The case 
study presented in Chapter 5 serves to proof the 
concept. The paper closes with discussion of pro 
and cons of approach and context. 


2 TERMS 


The view in applied research and development in 
resilience analysis covers the requirements of users in 


organisations and enterprises (mainly small to mid- 
sized enterprises SME). Hence, any resilience assess- 
ment approach needs to cover additional demands 
(cf. (ISO-31010 2009)), which might be unimportant 
to basic research. A major concern of organisation 
is method efficiency. Thus, the resilience assessment 
approach as introduced in this paper aims to finally 
reach practicability as known in basic risk assess- 
ment audits, fire and explosion inspections, annual 
tests of vehicle safety (Ministry of Transport (MOT) 
test), among others. According to the author’s expe- 
rience, such a system analysis must be typically per- 
formed from one person in about one day. 

There are already exhaustive literature surveys 
on terminology of resilience, where the most com- 
mon understanding of resilience is exemplified 
by Scholz et al. (2012): Resilience is the ability of 
the system to adjust its functioning [...] following 
changes and disturbances, so that it can sustain 
required operations. 

Hosseini et al. (2016) also consider system 
recovery abilities as crucial part of resilience, where 
recovery is the capability of a system to absorb and 
adapt to disruptive events. 

Lay et al. (2015) labour characteristics and abili- 
ties of resilient systems in more detail, which is 
finally the definition as used in this paper: 


DEFINITION | (RESILIENCE) characterises the abilities 
of a system to respond to disturbances, to monitor, 
to learn and to anticipate developments. 


With this, resilience belongs to a set of 
related engineering terms characterising system 
capabilities by attributes or system performance 
function P(t), e.g., availability A(2): 
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DEFINITION 2 (AVAILABILITY) is the probability of 
finding an unit in an operational condition at time t. 


This definition of availability follows, e.g., 
DINEN61703 (2002). Note that A(t) encompasses 
maintenance, which is a system ability to respond 
to disturbances (failures, incidents, over-fulfilment, 
etc.), to monitor them (failure identification) and 
to learn (optimising maintenance processes) and 
anticipate trends (expected failures). The latter is 
covered by reliability management processes and 
preventive maintenance. So far, resilience looks 
like the generalisation of availability towards the 
analysis of extended socio-technological systems. 
Furthermore, management and associated proc- 
esses are considered as an integral part of such a 
system in resilience assessment (cf. (Leksin et al. 
2018)). By contrast, management tends to play the 
role of an external controller in risk assessment. 
Business continuity (BC) also follows the concept 
of system recovery but concentrating on business 
impacts: 


DEFINITION 3 (BUSINESS CONTINUITY) is a corporate 
capability. This capability exists whenever organisa- 
tions can continue to deliver their products and serv- 
ices at acceptable predefined levels after disruptive 
incidents have occurred (cf. ISO 22301: 2012). 


Resilience is in line with established approaches 
to manage deviations, e.g., risk management 
according to ISO-31000 (2009). Note, that any 
system assessment approaches cover the sub- 
processes of event identification, analysis and 
evaluation. The view of resilience, availability and 
business continuity is to describe system capabili- 
ties with associated performance functions where 
risk relates to (undesired) events. 


3 STATE OF THE ART 


The following results of a survey on resilience defi- 
nitions concentrates around engineering domains 
which are then used to reason the way of utilisation 
of resilience assessment as proposed in Chapter 4. 
Hosseini et al. (2016) give an extended review of 
definitions. They state that the engineering domain 
“includes technical systems designed by engineers 
that interact with humans and technology, such as 
electric power networks”. There, engineering resil- 
ience is defined in various points of views: 


— Sum of the passive survival rate (reliability) and 
proactive survival rate (restoration) of a system. 

— “Intrinsic ability of a system to adjust its func- 
tioning prior to, during, or following changes 
and disturbances, so that it can sustain required 
operations under both expected and unexpected 
conditions” (Hollnagel et al. 2010). 


— “Infrastructure resilience is the ability to reduce 
the magnitude and/or duration of disruptive 
events. The effectiveness of a resilient infra- 
structure or enterprise depends upon its ability 
to anticipate, absorb, adapt to, and/or rapidly 
recover from a potentially disruptive event” 
(NIAC 2009). 

— Factors, e.g., minimisation of failure, limitation 
of effects, administrative controls/procedures, 
flexibility, controllability, early detection. 


Furthermore, Hosseini et al. (2016) state that: 


— Many definition focus on the capability of sys- 
tem to absorb and to adapt to disruptive events, 
and recovery is considered as the critical part of 
resilience. 

— For engineered systems, reliability is considered 
to be an important feature (e.g. nuclear power 
systems). 

— Returning to steady state performance is needed 
for resilience; some definitions do not impose 
that the system returns to the pre-disaster state 
(e.g. infrastructure). 

— Multidimensionality and threat-dependency of 
resilience definitions. 


These lists and the findings of Chapter 2 sub- 
stantiate: Resilient systems show abilities to 
preparedness and recovery in general. Then, pre- 
paredness is typically covered by a descriptive 
(i.e. qualitative) and case specific set of attributes 
MAX...) /(X)— a(x) -A(x). System perform- 
ance P(t) uses various modelling and simulation 
approaches to model system dynamics and per- 
formance P(t) according to the resilience triangle 
concept. Definition of preparedness attributes 
follows methods to design and evaluate question- 
naires, check lists, etc., and graphs represent rela- 
tionships of system abilities. This notation is useful 
to characterise the common approaches of resil- 
ience system analysis: 


— Attributive: Starting point is the compilation 
of system specific attributes A ={a,a,,...,4,}, 
which characterises the presence of or impact on 
resilience properties, e.g., awareness, flexibility, 
risk management, competence, and redundancy. 
Then, analysis follows methods to evaluate ques- 
tionnaires, check lists, etc. or uses any graphs to 
represent relationships. 

— Performance: System performance modelling 
needs the specification of time dependent and 
system specific performance measurements, e.g., 
availability (i.e. function showing the alterna- 
tion of operation and maintenance), returns 
(money), among others. Note that P(t) already 
comprises the recovery properties. Modelling 
parameters of might base on A. Hence, P(t) can 
be modelled by graph theory, (e.g. Markovian 
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models, system dynamics, state diagrams), clas- 
sical mechanics considering damped harmonic 
oscillation, (i.e., spring model) as well as control 
theory using proportional—integral—derivative 
controller (PID controller). 


The definition of resilience performance P(t) 
results in generic system statements: 


= max {|AP(1)| <b: There is a specified band 
width b (i.e. defined upper and lower perform- 
ance levels) which defines system operability 
(Mock and Zipper 2017). 

— P(t) =0: Total system failure (worst case) which 
ends reparability. Recovery is not possible and 
system re-construction is equivalent to a differ- 
ent and thus new system. Reconstruction is only 
possible by supporting measures of the superior 
system (cf. definition of disaster). Hence, P(t) = 
0 is associated with fully operable (new) system 
which is a common boundary condition in reli- 
ability analysis. 

— P(t)=k and P(t)=0: Nominal operation on 
constant performance level k. 


= P t)#0: System is in resilience mode. 
— P(t)#0: Acceleration of performance altera- 
tion is an indicator of resilience request. 


In summary, there is no common under- 
standing of resilience and how to model resil- 
ience. Many authors define key abilities of 
resilience by their own. The lowest common 
denominator is the ability of a resilient system to 
respond to disturbances and, hence, functional 
preserving capabilities (i.e. recovery). In order 
to verify these findings, Table 1 uses Def. 1 of 
resilience to allocate the specified attributes of 
resilience for infrastructure systems as named in 
references. 

Table 1 also shows that “respond” to distur- 
bance is the main property of a resilient system. 
The remaining properties are less frequently listed. 
From the author’s point of view, this table exempli- 
fies the uncertainty of how to deal with resilience 
key capabilities other than “respond”, and that it 
is still necessary to utilise the resilience concept for 
concrete applications. 


Table 1. Key abilities of resilience for infrastructure systems. 
System Respond Monitor Learn Anticipate Reference 
System homeland robustness, threat and hazard adaptability, risk-informed (Hosseini et al. 
security consequence assessments harmonisation planning and 2016) 
mitigation od purposes, investment 
comprehensive 
of scope 
Telecommuni- maintainability reliability, safety, — — (Hosseini et al. 
cation network confidentiality, 2016) 
availability, 
integrity 
performance 
Communication defend, remediate, detect, diagnose refine — (Hosseini et al. 
network recover 2016) 
Infrastructure absorb, recover — adabt anticipate (Lay et al. 2015) 
system 
Critical responsiveness, timely — - coordinated (Lay et al. 2015) 
infrastructure recovery, minimum planning 
level of service while 
undergoing changes, 
flexibility 
Infrastructure ability to regain a - adopt the stress- — (Bergstrom et al. 
network previous state strain model 2015) 
Infrastructure recovery - - - (Lundberg and 
(bouncing back) Johansson 
2015) 
Critical robustness, — — — (BABS 2013) 
infrastructures (availability of 
redundancy, 
resourcefulness 
and efficiency of 
supporting 
measures) 
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4 UTILISATION 


Chapter 4 identifies the interrelationships among 
system elements by UCD and how to quantify sys- 
tem resilience by attributes as given in Def. 1. 


4.1 Use Case Diagram UCD 


The Unified Modeling Language (UML) is a quasi- 
standard of system representing diagrams offering 
conformance in syntax and semantics. UML 2.5 
defines thirteen types of diagrams, divided into three 
major categories: Structure Diagrams, Behaviour 
Diagrams, and Interaction Diagrams (cf. www.uml. 
org) UML diagrams are standardised by (ISO-19501 
2005), where UCD is the most simple structure dia- 
gram in UML. This Chapter gives a short introduc- 
tion into the concept of UCD by referencing to the 
mentioned standard unless otherwise stated. 


DEFINITION 4 (USE CASE) is a kind of classifier rep- 
resenting a coherent unit of functionality provided 
by a system, a subsystem, or a class as manifested by 
sequences of messages exchanged among the system 
(subsystem, class) and one or more outside interac- 
tors (called actors) together with actions performed 
by the system (subsystem, class). 


A use case is shown as an ellipse containing the 
name of the use case which characterises activities 
of actors. 


DEFINITION 5. ‘An [actor] defines a coherent set of 
roles that users of an entity can play when interact- 
ing with the entity. An actor may be considered to 
play a separate role with regard to each use case with 
which it communicates”. 


The standard stereotype icon for an actor is a 
“stick man” figure with the name of the actor. 

There are three types of relationships among 
use cases (actions) and association 


— Association: The participation of an actor in a 
use case. In Figure 1, associations are shown by 
solid lines. 

— Extend: An extend relationship from use case A 
to use case B indicates that an instance of use 
case B may be augmented (subject to specific 
conditions specified in the extension) by the 
behaviour specified by A. 

— Include: An include relationship from use case 
E to use case F indicates that an instance of the 
use case E will also contain the behaviour as 
specified by F. 

— Generalisation: A generalisation from use case C 
to use case D indicates that C is a specialisation 
of D. 


The author considers UCDs as especially use- 
ful for resilience assessment purposes in order to 


depict actors and associated actions on technical 
and organisational level (i.e., modelling socio-tech- 
nical systems). 


4.2 Semi-quantified resilience assessment 
by UCD 


Establishing the resilience assessment audits at 
organisations needs an approach which is resource 
saving and follows established ways of system rep- 
resentation, e.g., by UML. In a first step, it is sug- 
gested to use the interrelationships among system 
elements by UCD and to assess system resilience 
by means of the resilience attributes as given in 
Def. 1. As mentioned above, the UCD differenti- 
ates between actors and actions. In engineering 
terms, actions can be evaluated by assessing their 
level of compliance with standards, best practices, 
etc. It is assumed that a high compliance level has 
a positive effect on the system resilience. Actors 
are the carriers of system resilience where their 
impact on recovery abilities is evaluated. For this, 
the Resilience Priority Value REPV of an actor is 
introduced, which uses the definition of resilience 
as given in Def. 1: 


REPV =L-I(d,m,l,a), (1) 


where 


— REPT: Resilience Priority Value of an actor 

— L: compliance fulfilment level of an use case 
(action) 

— I: impact of recovery ability of an actor 

— d, m, l, a: actor’s abilities to respond distur- 
bances, to monitor, to learn and to anticipate. 


All assessments use ordinal scales of range [1, 2, 

, 10], where 1 indicates best and 10 worst cases. 
The concept follows the familiar idea of estimat- 
ing risk priority figures, even if resilience is under- 
stood as a positive system property. 

So far, the assessment of Lis the result of audits 
and expert judgement about the proven record of 
reached compliance levels of actions or associated 
technology, e.g., the operation of IT security man- 
agement. In the best case, the rating of L bases on 
already available reports of compliance certifica- 
tions, e.g., according to ISO/IEC-27002 (2005). 

Actors are considered as the intrinsic carriers 
of resilience. As mentioned above and following 
Def. 1, the impact of recovery ability 7 depends on 
four attributes, which are rated by ordinal scales 
of range [1, 2, ..., 10]. Xd, m, l, a) of an actor is 
assessed by the mean value of these abilities. The 
abilities of learning / and anticipation a are cur- 
rently covered by humans. However, trends in 
smart manufacturing and artificial intelligence 
blur this classification. 
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The analysis of system resilience needs rules to 
make use of UCD. For a very first proof of con- 
cept, the following procedural steps are defined: 


1. An estimated compliance fulfilment level L,, of 
i=1,2,...,k is assigned to each use case U.. 
2. Considering relationships for assessing L, of U, 
— Apply the mean value of all incoming extend 
associations 
— Apply the mean value of all values assigned 
to outgoing include associations 
— Compute the mean of both values 
3. Rounded off to the next integer 
4. Repeat the process until all use cases (actions) 
are assessed. 


In summary, every use case (action) U, is charac- 
terised by a number of extend and include relation- 
ships R (i.e., edges): U, (Raoa; R, wel For further 
resilience computation, only subsets of relation- 
ships are needed. For this, every U, is assessed by 
the mean values x of compliance levels L of asso- 
ciated incoming extend relationships and outgoing 
include relationships, i.e. Un) (Xarpa Xatuna) 
The mean of both values finally gives the looked 
for compliance level Ly of an action. 

Next, every actor Ag j =1,2,...,k is evaluated 
by the following rules: | 


1. Assign values of J, (d,,m,;,1).4;) 

2. Compute the mean of assigned values in Z, 
which is the looked for impact value of recovery 
ability of an actor J, 


ja 


Then every actor shows an impact value Z, , 
and is assigned with a use case value L, (if there 
are more than two associations use the mean value 
of L,s). With that, all values are given to compute 
REPYV, as defined in Eq. 1. System resilience is 
estimated by the mean value of all actors’ REPV 
and again rounded of to next integer. 


4.3 Proposed audit process 


The utilisation process of resilience assessment is 
finalised by auditing a system. The following steps 
roughly structure such an audit: 


— Step 1 — Drafting use cases and actors of socio- 
technical system to be audited 

— Step 2— Transfer of use cases and actors into the 
UCD 

— Step 3 — Quantification of UCD 

— Step 4 — Evaluation of results and REPV. 


Steps 1 and 2 follow the basic steps of creating 
any UCDs. In terms of risk and resilience assess- 
ment Step 1 covers the identification process and 
Step 3 the analysis process. The resulting REPVs 
might be evaluated by a matrix or threshold 


approaches as known in risk assessment. However, 
this step is not elaborated in this paper. 

The suggested resilience audit process opens 
developments towards semi-automated processes 
to support auditors. The generation of UCDs is 
a well-known activity in software engineering and 
there are many tools available to do that (cf. Chap- 
ter 5). The computation process of UCDs follows 
ideas of using complexity metrics as common to 
characterise computer codes and associated UMLs 
(cf. (Mock et al. 2015)). Altogether, it is intended to 
develop the following audit supporting steps: The 
auditor has to identify actions and actors for UCD 
generation. Both aspects are plant or system spe- 
cific. However, there are repetitive elements, e.g., 
associated with IT security, fire and explosion pro- 
tection, and occupational safety. These elements 
are typically standardised and subject of compli- 
ance checks. Frequently occurring or, e.g., indus- 
try branch specific actions and actors can thus be 
deposited in a tool library. An auditor then selects 
the appropriate ones by a drop down menu. 

In a next step, the auditor has to identify and 
create the relationships among actions and actors. 
This step is tool supported too. 

Finally the auditor needs to input the estimated 
impact values J for every action with only one rela- 
tionship and to assign 1,(d,,m,,1,.4;) for every 
actor. The remaining computations will be done 
by the tool. 

In the end, the auditor needs more knowledge in 
system relationships as, e.g., for filling check lists 
or to perform an FMEA. On the other hand, the 
usual actions and actors as well as associated J, , 
and L, ratings should be known by an experienced 
auditor as they are close to common checks and 
results of site-specific compliance checks. 


5 CASE STUDY: CAR PARK GUIDANCE 


The audited system in this case study is a car park 
guidance system as implemented in a Swiss city. 
The system is designed to manage and optimise 
car traffic flow between a parking lot outside town 
(“Castle”) and a car park building in city centre. 
(“Town”). All parking spaces are equipped with 
sensors, networked and controlled by a Super- 
visory Control and Data Acquisition system 
(SCADA). Parking space allocation is visible for 
drivers by displays in “Town”. 


Step 1 — Drafting actors and use cases 
The actors are defined by 


— Driver (Family): The family is on a getaway. 
The Driver (Family) speaks German and 
strictly follows the parking guiding displays in 
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order to avoid looking for parking space. The 
DRIVER (FAMILY) first drives to the display at the 
car park in the city. 

Driver (Tourist: The foreign DRIVER (TOURIST 
does not understand German and feels uncon- 
fident with display symbols. Hence, this driver 
ignores the parking guiding displays and makes 
ad-hoc decisions where to park. 

CAR PARK OPERATOR (TOWN): There are no 
specifications about sensors. CAR PARK OPERA- 
TOR (Town) is assumed to be responsible for car 
park and system operation. The operator might 
start parking place managing activities. 
PARKING LOTS OPERATOR (CASTLE): There are no 
specifications. PARKING LOTS OPERATOR (CAS- 
TLE) is assumed to be responsible for 84 parking 
lots and system operation. The operator might 
start parking place managing activities. 


Actions (use cases) are defined as 


Display. The only car parking display is located 
at the car park “City” in town and shows the 
number of free parking spaces at “City” (max. 
340) and “Castle” (two parking spaces small and 
big: 10 + 74 = 84) nearby the Castle. It is assumed 
that display hardware does not fail at any time 
within the observation period of 4.5 years of 
operation. The associated system software is 
remotely updated and patched via Internet. 
Gateway: Kerlink LoRa IoT Station (2 identical 
stations) “is an industrial solution suitable for 
people who want to mount the gateway outside 
and who have sufficient technical skills to connect, 
mount and maintain the device themselves. ... 
somewhat older software, that is being used, [and] 
this device will do the job. A trained software engi- 
neer will be able to update the device using the 
[firm] software” (source: thethingsnetwork.org). 
The Gateways link the 84 Sensors( Castle) with 
the Internet by the Swisscom Mobile network. 
Sensors (Castle): The “Fastpark Flush- 
Mounted Sensor” (in total 84 sensors) are part 
of Parking Management System (PMS). “The 
wireless system uses smart sensors installed in 
parking spaces and guides drivers to areas with 
vacancies via electronic panels ...” (source: 
www.worldsensing.com). The Sensors( Castle) 
are linked with associated Gateways and Park- 
ing Management System PMS(Castle). Sensors 
might fail but are not maintained in observation 
time. The sensors are battery operated and uses 
the novel Low Power Wide Area (LPWA) tech- 
nology for gateway communication. 

PMS Operation: PMS operation and associated 
data storage is done by a separated EU comput- 
ing centre. 

Sensors (Town): There are no specifications 
about sensors of car boxes. It is only assumed 


that there are sensors which provide display 
data. 

— PMS (Castle): SCADA device in order to proc- 
ess and monitor data from Sensors( Castle). 
The SCADA serves as Human Machine Inter- 
face (HMI). The operator is considered as an 
integral part of PMS (Castle) who then might 
startparking place managing activities. 


Step 2 — Creating UCD 


Information on actors and actions is used to build 
up the UCD of Figure 1. The software tool Plan- 
tUML (www.plantuml.com) creates UCDs from 
textual inputs. It is a plug-in, e.g., of Eclipse. The 
possibility of integrating PlantUML into various 
software development frameworks is considered as 
pre-condition for further resilience software tool 
development. 


Step 3 — Quantification of UCD 


Table 2 shows the quantification of UC as given 
in Figure 1. 

Computation in Table 2 is exemplified by con- 
sidering the Action U,: The auditor estimates and 


Figure 1. UCD of case study. 


Table 2. Estimation of compliance fulfilment level L, by 
use case (actions) U, 
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Table 3. Estimation of impact value of recovery ability 
Jj of actors. 


j Actor A, d; m; |; a, I; Mean x =], 


1 Car par operator (town) 


2 Parking lot operator (castle) 9 1 
3 Driver (family) 97 
4 Driver (tourist) 75 


Table 4. Resilience priority value REPV of actors. 


j Actor A, L, I; REPV, Comment 


1 Car par operator (town) 97 63 
2 Parking lot operator (castle) 9 8 72 
3 Driver (family) 10 6 60 


L _10+9 
j 2 


4 Driver (tourist) 10 3 30 


assigns a compliance fulfilment level of L, ,= 8 to 
the action “operates PMS”. This action points to 
three other actions by include relationships associ- 
ated with (8+9+9). The mean value including L; , 
gives 9. There is an input of an extend relationship 
L, e %8 which then gives the final mean value of 
L, =—— =9 (rounded off to the next integer). 

Every actor is assigned to an impact value of 
recovery ability using J, d, mj, L, a, 


Step 4 — Evaluation of results and REPV 


As a result from Table 4 the actor DRIVER(ToURIST) 
shows lowest resilience properties. The overall 
resilience value of the car park guidance system is 
the mean of all REPV’s, i.e., REPV,,., = 56 indicat- 
ing a system with medium resilience. 


6 CONCLUSIONS 


In view of extended socio-technical system anal- 
ysis, developing a closed resilience assessment 
approach is subject of research (cf. (Mock and Zip- 
per 2017). However, this research only makes sense 
if the understanding of resilience finally results in 
a different approach as already established by the 
concepts of, e.g., risk, BCM and availability. From 
the author’s experience, discussion about resiliency 
often follows synonymous paths as already given 
by these established concepts (cf. (Leksin et al. 
2018)).). 

On the other hand, resilience assessment meth- 
odology is in its beginnings and still beyond entre- 
preneurial interests and has not fixed as state of 
technology yet. Thus, the paper is understood as 


a step toward utilisation of resilience assessments 
of complex systems. For this, a simple REPV is 
defined and the assessment process uses standard- 
ised system representation by UCD, which prop- 
erly differentiates between actions and actors. This 
property covers well the inclusion of socio-technical 
aspects, where actors are carriers of major proper- 
ties of resilience (e.g., learning). They are integral 
parts of the audited system, which is then becomes 
describable as a socio-technical system. By defin- 
ing rules to quantify UCDs, the proposed resilience 
assessment approach opens paths for software tool 
development in order to support resilience assess- 
ment audits of, e.g., infrastructural systems. The 
case study serves as a proof of concept. 

Discussions at ESREL conference in 2017 have 
given rise to fears that the inclusion and detailed 
understanding of the technical functioning of 
(infrastructural) systems could be neglected in 
resilience assessments. The use of UCD provide a 
practical way out of this situation, since UCDs are 
based on comprehensive descriptions of actions, 
actors and their relationships supporting a sys- 
temic analysis approach. 

The proposed concept of system assessment 
supports auditors to check to what extend infra- 
structural systems are resilient. However, the 
approach still needs verification of quantification 
rules, which are presumably too simplistic. The 
approach also needs an extended review based on 
a broader application example. Further develop- 
ments consider the inclusion of complexity meas- 
ures in order to increase the meaningfulness of 
UCD quantification. 
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ABSTRACT: Signaling system is a typical safety critical system aiming to enhance the safety and effi- 
ciency of train operation. Any signaling failure or abnormity will force a fallback to the safe side (stop), 
causing the drop in driving efficiency. Therefore, a fast and efficient way to replace the metro system after 
signaling failure is of great significance for passengers. Considering the road traffic conditions, a method 
for generating emergency metro-bus bridging plan is presented to improve the metro system resilience by 
assessing the satisfied passenger travel demands. The plan is implemented by generating the bus bridging 
routes based on the constructed metro-bus network and allocating the limited bus resources to the gener- 
ated routes optimally. The approach is applied to Beijing Metro Line 5 coping with high signaling failure 
probability and the simulation results show that the system resilience is significantly enhanced about 


21%-43% with metro-bus bridging service. 


1 INTRODUCTION 


Nowadays, the operation of metro system relies 
more and more on signaling system with the 
increasing of the degree of automation. Any dis- 
ruption in signaling system may cause the trains 
and passengers to be late, even more, resulting in 
the chaos of public traffic. In August 18, 2016, 
the Beijing Metro Line 1 was suspended for more 
than 2 hours during evening peak hours due to a 
3 min interruption of signal transmission network. 
Coupled with the bad weather, thousands of peo- 
ple were trapped in the downtown area. Therefore, 
compared with the frequency of failures, metro 
operators are more concerned about the impact of 
signaling failures on train headways and passenger 
transport. However, RAM (Reliability, Availability, 
Maintainability), the commonly used performance 
indexes, are always assessed by the average data 
with a period time, and cannot clearly show the 
reduction and recovery of train passing efficiency. 
Therefore, the concept of resilience is introduced 
to measure the impact of failure to metro system. 
The first systematic definition of resilience is 
represented by Holling (1973) in ecological system 
almost 40 years ago. He thinks “resilience deter- 
mines the persistence of relationships within a sys- 
tem and is a measure of the ability of these systems 
to absorb changes of state variables, driving vari- 
ables, and parameters, and still persist.” Since that 


time, researchers covering numerous fields have 
gradually realized the advantages of resilience in 
system design and management. For social domain, 
US Department of Homeland Security (2006) tend 
to view social resilience in terms of the ability of sys- 
tem or asset to maintain its function or recover from 
attack or incident. When in economic system, Rose 
& Liao (2005) characterize dynamic resilience as the 
speed of which an entity or system recovers from a 
severe shock to achieve a desired state. In railway 
system, Adjetey-Bahun et al. (2016) state that the 
concept of resilience has been introduced to meas- 
ure both the ability to absorb perturbations and 
rapidity recover from perturbations. Throughout all 
the definitions, all of them are emphasizing the abil- 
ity of system to absorb the perturbation and adapt 
to it changes, as well as recover from it quickly. The 
most quoted concept of resilience is from Bureau 
et al. (2003), they state that the seismic resilience of 
a system can be achieved by reducing failure prob- 
abilities, reducing consequences from failures and 
reducing the time to recovery. Furthermore, it can 
be defined as four characteristics: robustness, redun- 
dancy, resourcefulness, and rapidity and a broad 
measurement of resilience capturing these features 
is proposed as a resilience triangle model. 
Combining the characteristics of metro system, 
the resilience of metro system can be introduced as: 
the ability of system to reduce the probability of 
failure, adapt to the impact of failure and recovery 
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from the failure. In this point, Vugrin et al. (2010) 
define three capacities to quantify and design for 
a better resilience of system, these capacities are: 


— Absorptive capacity: 
It mainly reflected before or the time perturbation 
intervention, when the system can automatically 
withstand the perturbation and minimize the 
consequence of failure, it’s an endogenous feature 
of system. In metro system, it can be reflected 
as reducing the probability of system to failure 
through the reliability design. For example, the 
redundancy of the onboard wireless unit to make 
the system immediately switch to the rear one 
when the front unit failures, so that the train can 
normally receive the Movement Authority with 
no impact on the train operation. 

— Adaptive capacity: 
During the disruption and recovery period, 
through ingenuity or extra efforts to deal with 
the impact of disruption, is a set of actions of 
self-organization to response to the perturba- 
tion. One example is during the signaling pertur- 
bations, the operation order of the metro system 
relies upon the command of the dispatcher in 
the degraded model. 

— Restorative capacity: 
The ability of system to quickly recover back to 
its original or expectation state by maintenance 
management and the ability of system to be 
repaired easily. For example, training and test- 
ing of the maintenance plans of general failures 
to improve the human maintenance ability so 
that the system can be recovered quickly. 


As quick and efficient substitution of metro 
service is necessary for accommodating metro pas- 
sengers during signaling perturbations. Thus, an 
ad-hoc bus bridging service is set up after signaling 
perturbations to enhance the metro system resil- 
ience of adaptive capacity by transferring the pas- 
sengers to the nearest or destination metro stations 
in the shortest time. 

The reminder of the paper is organized as fol- 
lows. Section 2 reviews the quantitative measure- 
ment methods of resilience as well as the strategies 
to enhance the resilience of system. Section 3 intro- 
duces the proposed metro-bus bridging emergency 
plan to enhance the system resilience. The plan is 
applied to Beijing Metro Line 5 in Section 4 and 
Section 5. And we end this paper with two conclu- 
sions in Section 6. 


2 LITERATURE REVIEW 


2.1. Quantitative measurement of resilience 


The quantitative measurement of resilience pro- 
posed by Bureau et al. (2003) (shown in Figure 1.) 


Perturbations 


System 
performance 
oft) 
to tr t 
Figure 1. Resilience triangle model proposed by Bureau 


et al. 


is extended and improved by many ways. Reed 
et al. (2009) propose the resilience of a system can 
be assessing by the ratio of the resilience triangle 
curve area to the time interval. Besides, Vugrin 
et al. (2013) proposed the system resilience can 
be assessed by resilience costs within systemic 
impact and total recovery effort. Where the sys- 
temic impact can be evaluated by the size of the 
resilience triangle curve, and the recovery effort is 
represented by the area of recovery effort curve. 

Except for the size of resilience triangle, the 
important points of the triangle curve can also be 
used to measure the system resilience. Dorbritz 
(2011) puts forward the disaster resilience of trans- 
portation system can either be quantified by the 
resilience triangle area or three values: (1) the initial 
impact of the disaster; (2) the minimum value of 
system performance; (3) the time system get recov- 
ery. Nan et al. (2016) present an integrated metric 
for system resilience quantification, the metric 
combines three resilience capabilities in four phases 
based on resilience curve and is dimensionless and 
useful to compare different system resilience. 

In addition to the traditional numerical methods 
used above, a stochastic method can also be used 
for assessing the resilience of system. Based on 
the resilience triangle, Chang & Shinozuka (2004) 
developed a probabilistic method for assessing sys- 
tem resilience. Where the resilience is defined as 
the probability of the system performance after 
the disruption lesser than the accepted perform- 
ance. The proposed framework is applicable to 
both the infrastructure system and community, but 
the acceptable standard of the system is difficult to 
standard. Ouyang et al. (2012) introduce a time- 
dependent expected resilience metric as the mean 
ratio of the area between the real and target per- 
formance curve. The resilience metric is a stochas- 
tic one as the perturbation is modeled as Poisson 
distribution process and the resilience of system 
can incorporate one or multiple related hazards. 

The above general resilience evaluation methods 
are concentrated on the resilience triangle model, 
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besides, the mathematical optimization models are 
also proved to be effective for assessing the resil- 
ience of system based on the network topologies 
structure of the system with one or more objec- 
tive. As the scale-free graph nature of the system 
makes it easy to quantify the system resilience, 
especially for transportation system. Ip & Wang 
(2011) represent the transportation networks by 
an undirected graph with nodes as cities and edges 
as traffic roads, and the resilience of the network 
can be assessed by the feasible links between each 
node after disruption. Follow that, a multi-objec- 
tive optimization model is proposed to evaluate the 
efficient edges and nodes to improve the network 
resilience. Beyond that, fuzzy logic is also used to 
quantify the resilience of system when involves sev- 
eral variables which are all important to the system. 


2.2 Strategies to enhance resilience in 
transportation system 


The definition and value of the system resilience can 
become useful and meaningful when used to devise 
effective resilience strategies for the system of inter- 
est. In transportation system, two ways are summa- 
rized to enhance the system resilience. On the one 
hand, it can be enhanced based on the definition 
of resilience expounded as the three capacities. As 
resilience is the embodiment of a variety of capaci- 
ties, enhancing a specific capability contributes to 
enhance the resilience of system. Nan et al. (2016) 
test the strategies in an electric power supply sys- 
tem, where the resilience enhancing strategies can be 
interpreted as: (1) increasing the capacity of the bat- 
tery to improve the absorptive capability during the 
disruption phase; (2) the improvement of human 
operators’ ability to enhance the adaptive capabil- 
ity during the recovery phase; (3) the improvement 
of the efficiency of line operation to enhance the 
restorability capacity during the recovery phase. 

In this point, organizational and management 
plans can be implemented to enable the system 
quick recovery and minimum the impact of disrup- 
tions. Adjetey-Bahun et al. (2016) propose a crisis 
management plan addressing the system capacity 
in order to assess the extent to which they increase 
the resilience of mass railway transportation sys- 
tem during perturbations, i.e. setting up temporary 
train services on part of the impacted line during 
perturbations can enhance the resilience of system 
effectively and decreasing the repair time of fault 
equipment contributes to the restorative capacity. 

The same way also can be found in Chan et al. 
(2016) when assessing transportation system resil- 
ience with weather disruptions. Three common 
strategies are defined and offered for making 
transportation system more resilient: (1) hard- 
ening: building levees and floodwalls to prevent 
floodwaters; (2) redundancy: power distribution is 


a redundant system that provides a backup path 
between points; (3) elasticity: holding the aircraft 
on the ground to protect the fleet and passengers 
and taking off again quickly after the storm. Three 
aspects of resilience can work in combination to 
enhance the transit system resilience and support 
system management decisions. 

On the other hand, pre- and post-disruptions 
actions are also impactful for system resilience 
improvement. After metro system disruption, an 
integrated local bus services and metro system to 
enhance metro network resilience is introduced 
by Jin et al. (2014), where a two-stage stochastic 
process model is presented to evaluate Singapore 
public transit network resilience and to optimize 
bus service routes that run in parallel with the 
metro lines. For more comprehensive, Faturechi 
& Miller-Hooks (2014) clearly address the pre- 
event mitigation and preparedness and post-event 
response in the disaster management life cycle are 
three decision process to the system resilience and 
a three-stage program is proposed to quantifying 
and optimizing the roadway network resilience. 


3 METHODOLOGY 


As the metropolises with large population and com- 
plex road conditions, the road congest conditions 
and the metro station exports are considered in the 
metro-bus bridging methodology to allow the pas- 
sengers to reach their destinations in an available 
and quick way. In order to enhance the metro sys- 
tem resilience after signaling perturbations, the pro- 
posed bus bridging plan can be divided into three 
steps: (1) to the failure metro station as the endpoint, 
combined with the actual topology of the roadway 
to construct the metro-bus network; (2) generating 
the bus bridging routes between each O-D pair to 
avoid the congestion sections with minimum travel 
time; (3) allocating the limited bus resources to the 
bridging routes to increase the number of passen- 
gers demands that can be satisfied. 


3.1 Metro-bus network 


The metro-bus bridging network is modelled by a 
directed graph G= (V, A), illustrated in Figure 2. In 
order to avoid passengers get stranded at the metro 
station during signaling perturbation, not only the 
disruption metro stations, the other metro stations 
which directly connected with the disruption metro 
stations are also considered to be bridged. Where set 
V is the set of nodes in the network, including metro 
station nodes and roadway nodes. The metro station 
nodes are the bus bridging stations and the roadway 
nodes are the end of roadways modelled accord- 
ing to the actual road network. Set A is the set of 
arcs in graph G, which is the connection of roadway 
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Figure 2. Metro-bus bridging topology network. 

Table 1. Classification of roadway congest conditions 
in Beijing. 

Congest conditions h; 
Smooth 0 
Basically smooth 0.2-0.5 
Mild congestion 0.5-0.8 
Moderate congestion 0.8-1.1 
Serious congestion >1.1 


node with metro node, or roadway node, labelled as 
a,(i,j€ V). Based on the metro-bus network, the bus- 
bridging route can be modelled as the link of several 
arcs between origin nodes and destination nodes. 


3.2 Generation of bus bridging route 


Go through all the alternative routes, the shortest 
path may not be able to meet the constraints of 
actual demands, like route length constraints and 
road capacity limit. Thus, in the plan, Yen’s (1971) 
k-shortest path algorithm is used to explore the fea- 
sible bus bridging routes under certain constraints. 

Different from other research, the real-time 
traffic conditions h,(ijeV) is considered in the 
plan, known as Traffic Performance Index (TPI) in 
Beijing, means the average travel time on a, is more 
than h, times as much as usual under congest con- 
ditions, for details shown in Table 1. 

According to the real-time traffic conditions A, 
the travel time ż,(ijeV) on each a, 1s calculated as: 


L 
t, =Ëx(1+h) (1) 


y v 


where v is the average travel speed of the bus, 
l;(ije V) is the length of a,. 

= The route constraints in the plan include two 
components: the minimum route travel time 7, 


min 


and the maximum route travel time 7,,,,,. The final 


feasible generation route be B between each O-D 
can be formulated as f(w, k): 


b: f(w,k) = min{t x£} (2) 


ww 


subject to: 
min < to s T nax (W E W) (3) 
x, = 10,1} (weW) (4) 


objective function (2) minimize the travel time of 
each O-D route, where route we W is defined as all 
the available routes between each O-D and rf, is the 
travel time of route w. Constraint (3) is the travel 
time constraints for each route . In constraint (4), 
x* „is a binary decision, if x*, = 1, means the route 
w belongs the k-shortest path. Table 2. describes the 
process of generating the bus bridging route be B. 


3.3 Bus resource allocation 


According to Reed et al. (2009), the system resil- 
ience can be quantitative as (shown in Figure 3.): 


f OOd 
Raw 


tp ly 


(5) 


Table 2. Steps of generating the bus bridging routes. 
Input: G=(V.A), Typ Tyas k 

Output: be B. 

Step 1 All the alternative routes are generated using 


k-shortest path algorithm between each 
O-D. 

Check all the generated routes for the 
minimum and maximum route 
constraints, if the route satisfied, then the 
route is accepted as a candidate route. 
Otherwise, it is removed. 

For all candidate routes, find the shortest 
one for each O-D, and kept in the set B as 
the final generated bus bridging routes. 

Output the set of B. 


Step 2 


Step 3 


Step 4 


4 Perturbation 
Resilience 
metric 


oni) 


to l t 


Figure 3. Resilience quantitative model proposed by 
Reed et al. (2009). 
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where R is the system resilience, can be calculated 
as an integral of the normalized resilience metric 
Q(t) within an interval time, ¢, is the time pertur- 
bation intervention and f,is the fixed time window. 
As the time interval (t,— t) can be defined flexibly, 
in the metro system, it can be varied with differ- 
ent types of signaling failure or defined as a fixed 
value. When resilience metric Q(t) is a discrete 
value, the system resilience R can be defined as: 


_xtENQ@-AT)-AT 


R 6 
N-AT (6) 

let N be a number such that: 

(t, —1,)=N-AT (7) 


Mathematically, according to Chen & Miller- 
Hooks (2012), the metro system resilience met- 
ric Q(t) can be expressed as the passenger travel 
demands between each Origin-Destination (O-D) 
pair that can be satisfied during the signaling per- 
turbation, which known as Travel Demand Satis- 
faction Rate (TDSR) in the following. 

Since the Genetic Algorithm (GA) provides a 
robust search as well as a near optimal solution 
in a reasonable time, the approach is employed to 
obtain an optimal system resilience accomplishing 
by allocating the limited bus resources to the gen- 
erated bus bridging route be B and the resilience 
metric (TDSR) of metro system is proposed as: 


Èn, xC 
f(TDSR)= maxs = (8) 


> d, 


beB 


subject to: 

0 <n, < N(be B,n, is integer) (9) 
yin, =N (10) 
beB 


objective function (8) is to maximum the total 
number of passenger Travel Demands Satisfied 
Rate (TDSR), where d, is the passenger demands 
for each O-D route be B and obtained with fixed 
intervals At. An integer variable n, indicates the 
number of buses allocated to each O-D route 
beB. Cis the capacity of the bus and N is the total 
number of buses that available in the bus depots. 

Hence the GA chromosome consists of integer 
gene values represents one solution of the bus allo- 
cation and is as follow: 


(11) 


where m is the total number of generated routes 
from set B. The objective of GA model presented 


Table 3. The process of bus resources allocation. 


Input: G = (VA), d, B, N, C, m, 
Genetic Algorithm setting parameters. 
Output: f(TDSR), [n,; M2 Ppl 


Step 1 
Step 2 


Input all the GA related parameters. 

Generate the GA population formulation for 
the current O-D pair set size m and initialize 
each chromosome randomly. 

Set generation = 1. 

According to each O-D pair passenger 
demands d,, evaluate each chromosome by 
the objective function. 

Keep the current solution. 

Generate next generation. 

e Rank. 

e Selection. 
e Crossover. 
e Mutation. 

If the generation < Max_Generation, 
increased generation by 1 and go to step 4; 
else update the current best solution if 
improved. 

Output the best bus-allocation solution from 
the best solution found, and its performance 
J(TDSR). 


Step 3 
Step 4 


Step 5 
Step 6 


Step 7 


Step 8 


here is to scientifically guide the bus resources 
allocation and select an optimum solution with 
the satisfied passenger travel demands being maxi- 
mum. The implementation of the GA is shown in 
Table 3. 


4 CASE STUDY 


According to the actual signaling system fault 
records, typical trackside signaling perturbations 
are summarized and the impact of different condi- 
tions to system resilience is discussed. 


4.1 Background of case study 


The model is applied to generate metro-bus bridg- 
ing plan for Beijing Metro Line 5, whose average 
number of passengers on weekends can reach one 
million. The signaling failure data is provided by 
Communication and Signaling Branch Company 
Affiliated with Beijing Mass Transit Railway 
Operation Corp. Ltd. According to the statistics of 
the failure record in the past four years, the main 
trackside signaling equipment with high failure 
probability are summarized as: (1) WESTRACE 
TCOM system used to produce carrier frequency 
of track circuit; (2) track circuit transmitter or 
receiver; (3) the trackside APR used to send posi- 
tion information to the trains. Once the equip- 
ment fails, signal or multiple track sections will be 
affected and impact will last one or two hours. 


1205 


Through the analysis of historical fault records, 
the failure of TCOM system has the most influence 
in the trackside equipment as the TCOM system 
failure will affect multiple sections for few hours. In 
this way, four typical areas with high failure rate of 
TCOM system are enumerated (shown in Figure 4.) 
and the bus bridging area is tabled in Table 4. 

Due to the large passenger flow in Line 5, great 
disturbance to passenger travels will be caused 
when the signaling failure happens. Therefore, we 
assume the operators will call for bus-bridging if 
the perturbation lasts ten minutes. The parameters 
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Figure 4. Typical areas with high failure rate of TCOM 
system in Beijing Metro Line 5. 


Table 4. Four typical bus bridging scenarios. 


Scenario NO. Bus bridging areas 


Scenario 1. Huixinxiie Beikou-Huixinxyie Nankou- 
Hepingxiqiao-Anzhenmen 

Tiantongyuan-Tiantongyuan 
South-Lishuiqiao-Lishuiqiao 
South-Huoying 

Datunlu East-Huixinxijie Beikou- 
Huixinxijie Nankou-Hepingxiqiao- 
Anlilu-Anzhenmen 

Yonghegong Lama Temple-Beixinqiao- 
Zhangzizhonglu-Dongsi-Dengshikou- 
Nanluoguxiang 


Scenario 2. 


Scenario 3. 


Scenario 4. 


related to the proposed bus-bridging plan are set 
as follow: 


— The perturbation occurs at 7:00 am and lasts 
2 hours. 

— The average bus speed of bus: v = 18 km/h. 

— Feasible bus resources: N = 15 during each time 
interval, C = 80 person/bus. 

— Time constraints: 1 min—20 min. 


4.2 Modeling and bus-bridging strategy 


The path network models between metro stations 
are built based on google map. And then, short- 
est bridging route are selected with the improved 
k-shortest path algorithm. For example, in Sce- 
nario 1, set k = 4. Under the constraints of time 
conditions and passenger travel demands, there 
are 9 bus bridging routes are generated (shown in 
Figure 5.). And the bus bridging plan avoids the 
congestion section {v7-v/2-v15, v27-v28} from 
Huixinxiyie Beikou to Huixinxijie Nankou. But 
from Huixinxijie Nankou to Hepingxiqiao, the 
congestion section {v/5-v/S-v27} is still in the 
plan, as the other way cost too much travel time. 
The travel demands for each O-D route are col- 
lected from the AFC (Automatic Fair Collection 
system) during 7 am to 9 am in the workday. The 
available bus resource in each time interval is 15. The 
TDSR of Scenario 1 with bus bridging plan is shown 
in Figure 6. And the resilience of the metro system is 
0.80, improved by 29% than normal condition. 


| HUIXINNUIE = 
BEIKOU 9 


Figure 5. 


The bus bridging routes in Scenario 1. 
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Figure 6. TDSR with bus bridging plan in Scenario 1. 
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5 DISCUSSION 


5.1 The influence of bus resources on bus bridging 
plan 


The system resilience metric (TDSR) can be cal- 
culated by the O-D demands that the bus bridg- 
ing plan can satisfy in perturbation scenarios and 
shown in Figure 7. At the beginning of the imple- 
mentation of the plan, as a large number of pas- 
sengers stayed at the station, the TDSR presents a 
trend of decreasing due to the start of the failure. 
About half an hour later, the passengers are gradu- 
ally evacuated with the bus bridging plan, and the 
resilience metric of system begins to recover. How- 
ever, the time with the lowest TDSR is varied with 
the passenger travel demands delay at each metro 
station. In Scenario 2, the system is less resil- 
ience at the outset due to the greatest passenger 
demands at the time perturbation occurs. And in 
Scenario 3, the TDSR of the system has changed 
little because of the stable passenger demands dur- 
ing the perturbation. 

According to Formula (2), the system resilience 
in the four scenarios with or without the bus bridg- 
ing plan are shown in Table 5. It can be seen that 
with the proposed bus bridging plan the system 
resilience can be increased by 21%-43%. 
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Figure 7. TDSR in different scenarios with bus bridg- 
ing plan. 


Table 5. The comparison of the metro resilience before 
and after the implementation of the bus bridge scheme. 

System 

resilience 

System with bus 

Scenario resilience bridging Improvement 
NO. (R1) (R2) (R2-R1/R1) 
Scenario 1. 0.62 0.80 0.29 
Scenario 2. 0.58 0.70 0.21 
Scenario 3. 0.70 0.88 0.26 
Scenario 4. 0.49 0.70 0.43 


5.2 The influence of bus resources 


In order to minimize the impact on normal bus 
transit system, the buses that can be used to 
the metro-bus bridging plan are limited. As the 
amount of buses will directly affect the improve- 
ment of system resilience, it is necessary to weight 
the bus transit system with the metro system resil- 
ience enhancement. Figure 8 shows the TDSR of 
system with different bus resources in Scenario 2. 
When the bus resources get 25 each interval time, 
the system resilience can get 0.91, up to 30% com- 
pared to M = 15. And from 7 am to 8 am, about 25 
buses are needed at each time interval so that the 
TDSR of system can reach 0.8, and from 8 am to 
8:30 am 20 buses are required, and from 8:30 am to 
9 am, 10 or 15 buses are enough to make the TDSR 
to 0.8. It is provided a reference for the metro oper- 
ator and the bus transit operator to design the bus 
bridging emergency plan adjusting the number of 
buses in different time periods to meet the needs 
of passenger demands without affecting the bus 
transit system. 


5.3. The influence of tidal flow 


For Beijing Metro Line 5, its northern end to Tian- 
tongyuan, an important office worker residence, 
the flow of people in working days during morning 
and evening peak has obvious tidal trend (shown in 
Figure 9.). Therefore, in the bus bridging plan, the 
bus routes on the up and down lines will be reduced 
to one-side route according to the direction of pas- 
senger flow in order to support the specific pas- 
senger demands. In Scenario 2, before 8:30 am, 
the TDSR is about one third higher than that of 
the two-sides bus bridging and the TDSR of sys- 
tem are above 0.7 (shown in Figure 8.). There is an 
obvious improvement of system resilience as the 
bus bridging plan only considers one-side (uplink) 
passenger flow and it can be improved about 21%. 
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Figure 8. TDSR in different amount of buses over time 
in Scenario 2. 
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routes in Scenario 2. 


2 

Gi 

= 

z= 

Ē 

H 

a 

z 

F 

n 

i 

a 

K 

£ 7:10 7:20 7:30 7:40 7:50 8:00 8:10 8:20 8:30 8:40 8:50 9:00 

GN-24 GNe20 GNel6 ime 

Figure 10. TDSR improvement from different bridging 


routes in Scenario 4. 


Also, in Scenario 4, when removing part of the 
bridging routes that the system TDSR also get 
improvement (shown in Figure 10.). When the total 
bus bridging routes are 24, the system resilience 
is 0.7, when the total number of bridging routes 
streamlined to 16, the system resilience can be 0.83 
improved by 19%. For general scenarios, with the 
limited bus resources, the metro operator can adjust 
the bus bridging routes appropriately according 
to the distribution of passenger flow in each time 
period and arrange the limited buses to the sta- 
tions with most demands so as to meet the bridging 
routes with more demands and prevent the accu- 
mulation of stranded passengers in stations. 


6 CONCLUSION 


The resilience of metro system is mainly mani- 
fested as absorptive capacity, adaptive capacity and 
recovery capacity, and we focus on the adaptive 
capacity which can be enhanced with external bus 
bridging services after signaling failure. According 
to the above, the main conclusions are as follow: 


1. In metropolis, urban rail transit system is the 
most important way for people’s traveling, once 
the signaling system failures, a large number of 
passengers will be accumulated at the station in 
a short time during the peak period. With the 


bus bridging plan, the metro system resilience 
can be improvement about 21%—43% compared 
to the normal situations. And the bus bridg- 
ing plan can be used to slow down the pressure 
caused by the passengers. The results show the 
passenger flow will return to normal state about 
an hour later without other actions taken by the 
metro operators. Therefore, we suggest that dur- 
ing the perturbations, except for the bus bridg- 
ing plan, operators shall inform the passengers 
about the signaling failures with network or 
other media in time so as to avoid excessive pas- 
senger stranded at stations. 

2. The trend of morning and evening tides of pas- 
senger flow is obvious in Line 5, the generated 
bus bridging routes can be properly adjusted to 
the change of passenger flows and the priority 
of the routes with more demands to enhance the 
metro system resilience. The results provide ref- 
erence for the metro operator in daily disposal 
after signaling failure. 


Further improvements of the metro-bus bridg- 
ing plan, we are interested in the exits of the bridg- 
ing metro stations, as the wide distribution of the 
metro exits will influence the bus bridging routes 
especially in transfer stations. In addition, further 
work can be carried out on the commands of the 
dispatcher during the perturbations, optimizing 
the strategy of dispatcher to restore the normal 
operation order of the train as soon as possible to 
enhance the resilience of the metro system. 
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ABSTRACT: Resilience of Critical Infrastructure (CI) has been a research focus for several years now, 
with efforts being made to develop methods for the analysis and assessment of CI resilience. However, 
these efforts are often carried out without consideration of enriching societal risk or resilience assess- 
ments with knowledge of the resilience of CI. Bearing in mind that the definition of CI according to the 
EU reflects the fact that it exists to deliver vital societal functions, the consideration of its resilience in 
isolation of the community it serves is only addressing part of the problem. The Horizon 2020 project 
IMPROVER has already developed methodologies for assessing and managing CI resilience. This paper 
proposes an evolution of the management framework for CI resilience which enriches societal resilience 
assessment with knowledge of the CI resilience. The framework and societal resilience analysis methodol- 
ogy are both described along with an application of the analysis method. 


1 INTRODUCTION 


In recent years, increasing resilience of Critical 
Infrastructures (CI) has been a major objective 
of the EU as well as worldwide. Disruptions of 
critical infrastructures, caused by natural or man- 
made events, are increasingly affecting today’s 
society as the reliance on infrastructure systems to 
provide vital societal functions increases (Rinaldi 
et al. 2001). As one of the main purposes of CI is 
to deliver services to the society, resilience assess- 
ment of CI should be closely linked to the study 
of societal resilience. Societal resilience refers 
to “the ability of social groups or communities 
to cope with external stresses and disturbances 
as a result of social, political and environmen- 
tal change” (e.g. Adger 2000, Folke 2006, Furedi 
2007, Marshall 2010, Voss 2008). As being able 
to tolerate a reduction in the quality, quantity, or 
availability of a service provided by a CI demon- 
strates coping capacity, it can therefore be seen as 
a key component of societal resilience. Thus, CI 
resilience and societal resilience are closely linked, 
with each influencing the other. Indeed, many 
existing societal resilience analysis methodologies 
include indicators relating to critical infrastructure 


(e.g. Michel-Kerjan 2015, Boon et al. 2012, Cutter 
et al. 2010, Renschler et al. 2010). As such, increas- 
ing CI resilience also serves to increase societal resil- 
ience, creating a positive feedback loop between 
the two. The societal domain mainly becomes of 
interest when considering a higher level; regional 
or national, where several CI operate. 

To that end, the EU Horizon 2020 project 
IMPROVER (Improved risk evaluation and 
implementation of resilience concepts to critical 
infrastructure) has developed the IMPROVER 
Societal REsilience Framework (IS-REF), which 
maps resilience management onto common risk 
management frameworks and includes aggregated 
CI resilience assessments as an input to the soci- 
etal resilience analysis (Fig. 1). The IS-REF is 
an evolution of ICI-REF (IMPROVER Critical 
Infrastructure Resilience Framework), which is 
developed for managing CI resilience, and has a 
similar structure. 

The initial step in IS-REF is to establish the 
context for societal resilience management. This 
includes the gathering of CI resilience assessments, 
which may have been conducted by using the ICI- 
REF framework with methodologies developed 
within IMPROVER, designed for CIs to manage 
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Establishing the context 


Cl resilience assessment 
Multiple operators and multiple 
sectors 


Societal resilience 
assessment 


Aggregated CI 


Resilience 
¢ assessments 
Societal risk Societal resilience 
assessment analysis 


i Societal resilience evaluation 


| Societal Resilience treatment | 


Societal Risk 
treatment 


Monitoring and review 


Figure 1. The structure and process of the IMPROVER 
Societal REsilience Framework (IS-REF). 


their technological and/or organisational resilience. 
In the next step of IS-REF, as a complement to 
societal risk assessment, a societal resilience assess- 
ment is undertaken. The results of such a resil- 
ience analysis shall be evaluated against previously 
decided upon criteria, and if necessary, a societal 
resilience treatment plan is devised. Throughout 
the entire IS-REF framework, communication 
and consultation as well as monitoring and review 
take place. For more details on the framework see 
IMPROVER’s deliverable 5.1 (Lange et al. 2017). 

Within both the ICI-REF and IS-REF, the 
resilience analysis can be performed using exist- 
ing methodologies. However, IMPROVER has 
designed success factors for the performance of 
the resilience management frameworks and their 
content. Those of the success factors which gener- 
ate requirements to resilience analysis methodolo- 
gies are presented in Table 1. The success factors 
were designed to meet requirements from stake- 
holders and end-users, and have their basis in the 
Horizon2020 call text. With the background in sys- 
tems theory (Roux-Rouquié & Moigne 2002), the 
success factors are categorised according to four 
fundamental system dimensions: goal, environ- 
ment, structure and evolution, and are described 
in detail in IMPROVER’s deliverable D6.1 (Reitan 
et al. 2017). 

For the purpose of fulfilling the success fac- 
tors, a well-defined methodology which is highly 
suitable for analysing societal resilience within 
the context of CI is currently being developed; 
the IMPROVER Societal Resilience Analysis 
(ISRA) methodology. This paper describes the 
background and structure of ISRA which is still 


Table 1. Success factors for IMPROVER’s resilience 
management frameworks, with requirements to analysis 
methodologies. 


System dimension Success factor 


Goal 
Environment 


Applicable to all types of CIs 

Easy to use 

Effective and coherent crisis and 
disaster resilience management 

Provide relative resilience 
measurements 

Supplements existing practice 

Taking into account public 
communication 

Arranged for being revised 
continuously 

Learning capabilities 

Willingness of utilisation 


Structure 


Evolution 


in an early phase. Pilot tests, giving insight in how 
the methodology can be used, will aid the further 
development and optimisation of the methodol- 
ogy. A societal resilience analysis of one of the 
living labs in IMPROVER was conducted and is 
presented in this paper. 


2 ANALYSING SOCIETAL RESILIENCE 


Assessing and enhancing the resilience of critical 
infrastructures will not automatically result in a 
resilient society, since social and human dimen- 
sions have a strong influence in the achievement 
of a resilient society. Indeed, it is important to 
consider the link between physical and human sys- 
tems to understand and enhance societal resilience 
(Chan et al. 2014). However, the concept of resil- 
ience is not yet fully operationalized, and many dif- 
ferent approaches have been developed to achieve 
a measurement of resilience (e.g. Frankenberger 
et al. 2013, Boon et al. 2012, Cutter et al. 2010, 
Norris et al. 2008). While there are no generally 
agreed upon metrics, indicators can be an effec- 
tive tool to help decision makers understand where 
their community stands in terms of resilience and 
as a base for developing plans and strategies to 
enhance resilience. Furthermore, for CI, a societal 
resilience analysis based on indicators provides a 
holistic picture of a community’s strengths and 
weaknesses in times of disasters, offering an under- 
standing about what kind of society the critical 
infrastructure operates in. 


2.1 Resilience capacities and dimensions 


In the field of societal resilience, the concept of cop- 
ing capacity, adaptive capacity and transformative 


1212 


capacity are common denominators that are used 
to categories capacities needed to achieve resilience 
(Keck & Sakdapolrak 2013). Coping capacity 
refers to the ability to respond, absorb and recover 
from a disruptive event and is generally related to 
a time frame close to the event. Adaptive capacity 
includes the ability to plan for and adjust to future 
challenges, which is related to a longer time-frame 
both before and after an event. Resilience is not 
only about quick recovery and adjusting to new 
circumstances; the aspect of transformation must 
also be taken into account. Transformative capac- 
ity refers to the ability to transform the stability 
landscape in order to create new, better, pathways 
for the system and is thus related to major changes 
in the long-term. 

Resilience is by definition a complex and mul- 
ti-dimensional topic, and a resilience assessment 
should ideally include all of these dimensions and 
their interdependencies (Sharifi & Yamagata 2016). 
However, to be able to operationalize the resilience 
concept, some trade-offs are necessary. From lit- 
erature, six major societal resilience dimensions 
were identified as physical, social, human, natural, 
economic and institutional capital. Each of these 
dimensions has influence on the resilience capaci- 
ties discussed above. 


3 THE ISRA METHODOLOGY 


3.1 Overall structure 


The objective of the methodology is that the analy- 
sis should be able to inform a community on how 
to enhance coping, adaptive and transformative 
capacities. The starting position is to analyse a 
community’s perceived capability to react, adapt 
and recover from a shock. Based on the resilience 
dimensions, a set of indicators has been identified 
to analyse societal resilience. Within ISRA, detailed 
individual assessments of CI resilience are used to 
provide an input to the physical dimension. 


CC = Coping capacity 

AC = Adaptive capacity 

TC = Transformative capacity 
w= weight 


Wa Wi2 Wis Wai W22 W23 W31 


W32 W33 Wan 


The indicators are further categorized by which 
capacity they mainly have influence on (Table 2). 
The indicators and their parent resilience dimensions 
can, in reality, influence more than one capacity, but 
in this first development phase, they are categorized 
according to the capacity they are assumed to have 
the most effect on. The proposed aim of using the 
methodology is to 1) develop a common understand- 
ing of societal resilience, and ii) establish the current 
position (the result of the analysis). 

However, it is important to keep in mind that 
ISRA is only a part in an overall resilience manage- 
ment strategy, the results of which need to be used 
in the resilience evaluation part of the framework 
to ensure sense making, before deciding how to go 
about improving resilience in the resilience treat- 
ment part of IS-REF. In this early development 
phase, the focus was to identify relevant indica- 
tors from literature and create a holistic picture of 
the societal resilience domain. Thus, at this stage, 
interdependencies or conflicting goals among the 
indicators were not taken into account. 


3.2 Aggregation of the indicators 


The assessment is performed by qualitatively scor- 
ing a set of indicators on a scale from 1 to 5. The 
indicators are categorized according to the six resil- 
ience dimensions at Level 2 of the ISRA structure 
(Fig. 2). The score for each resilience dimension 
is achieved by aggregating the indicators under 
each dimension and capacity to a single measure- 
ment, and thereafter aggregating the three capacity 
scores into one score for each resilience dimension. 
In this early development phase of ISRA, all indi- 
cators and subcategories are weighted equal, but 
one might want to weigh the importance of the 
capacities and dimensions in order to capture sub- 
jective resilience aspects. The indicators are aggre- 
gated by the weighted arithmetic mean (Eq. 1) 


Level k indicator = $ wx (1) 


i=l 


Level 1 


Level 2 


Level 3 


Wa2 Was Wsi Ws2 Ws3 Wea 


We2 Wes 


a a a Levis 


Figure 2. Structure of the ISRA methodology. 
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where n is the number of Level k+/ indicators, w, 
the weighting coefficient for the individual Level 
k+l indicator and x,the scoring of the individual 
k+l indicator. The aggregation can be done all the 
way up to Level | (see Fig. 2) resulting in a global 
score for societal resilience. However, a lot of infor- 
mation can be lost by doing that, so presenting the 
resilience dimensions in Level 2 provides a more 
detailed result. 


4 PILOT DEMONSTRATION OF ISRA 


4.1 Case study 


The Port of Oslo has been one of the living labs 
in IMPROVER from the start of the project. The 
relevance of the Port of Oslo from a CI resilience 
perspective can be traced back to 2014, when the 
Norwegian Directorate for Civil Protection pub- 
lished an overall assessment of the management of 
the safety conditions in Sydhavna, and pinpointed 
improvement measures for the enterprises, the City 
of Oslo, the Port of Oslo and central government 
authorities. 

In the wake of the report, the extensive exercise 
HarbourEx15 was carried out at Sydhavna from 28 
to 29 April 2015. The scenario for the exercise was 
an explosion and fire in containers with hazard- 
ous substances in the container area, fire in the fuel 
depot, evacuation of smoke-filled areas in parts of 
Oslo, and the grounding of a vessel and subse- 
quent oil spill in the Inner Oslo Fjord. 

The goal of the exercise was to improve emer- 
gency planning and rescue operations in the 
event of a major accident in Oslo. The exercise 
involved a total of ca. 40 participating organisa- 
tions, including the rescue agencies, public authori- 
ties with responsibility for emergency planning, 
international rescue services and private actors at 
Sydhavna and in the City of Oslo. 

In the evaluation of HarbourEx15, nine pro- 
posed initiatives were described, among these some 
of relevance to societal resilience, including popu- 
lation alert (acute alert) and providing information 
to the public (DSB 2016). Based on this work, a 
well-defined approach for societal resilience assess- 
ment is of interest to the area. 


4.2 The pilot test approach 


4.2.1 The focus group 

In the development of ISRA, which is an operational 
methodology, users (authorities) were included at an 
early stage of the work to ensure that the method- 
ology is fit-for-purpose. The focus group consisted 
of 5 relevant representatives from the HarbourEx15 
evaluation. The participants were chosen to 


represent different parts of the community, as well as 
for their ability to provide insight and experience in 
areas connected to societal resilience. Their task was 
to evaluate the methodology from the perspective of 
the HarbourEx15 based on their expert knowledge 
about Sydhavna and the surroundings. 


4.2.2 Approach 

The exercise was in the form of a questionnaire. 
This was considered a sufficient method to com- 
bine a societal resilience analysis of the case study, 
and simultaneously receive the focus group’s spon- 
taneous evaluation of the indicators, as well as 
the quality of the questions that were designed to 
provide measures to the indicators. The question- 
naires were designed as follows: 

The focus group was asked to decide on a geo- 
graphical area to perform the evaluation on. They 
were subsequent asked to evaluate the area accord- 
ing to the 56 societal resilience indicators. The indica- 
tors were evaluated according to a symmetric Likert 
scale from 1-5. To facilitate the self-evaluation, each 
indicator was described by a statement to which the 
respondents can specify their level of agreement 
from Strongly Disagree (corresponding to 1 on the 
scale) to Strongly Agree (corresponding to 5 on the 
scale). To support their evaluation, they were asked 
to provide evidence and support for their answers, 
along with a summary of the discussions that led to 
the response. Finally, they were asked to give their 
opinion about each indicator’s relevancy and under- 
standability. The feedback was then analysed using 
thematic analysis (Braun & Clarke 2006). 


4.3 Results 


In this section, general comments from the focus 
group and evaluations of the specific indicators are 
described. 


4.3.1 Comments and input 

Three main themes were identified from the com- 
ments and input regarding the chosen indicators 
and statements: lack of clarity, more than one 
question, and overlap. 

For several of the indicator statements, focus 
group participants expressed the need for several 
clarifications. For example, one comment stated, 
“Unclear what it is you are looking for here.” 
Another way to express lack of clarity was by ask- 
ing directly for clearer definitions for the following 
words: Cooperation, Coordination, Community, 
Crowdsourcing, Community based, Actor, Stake- 
holder, Public, People in the community and Trust. 
Other times, there was uncertainty about which 
actor the methodology was asking about (eg. 
“unclear if referring to decision making structures 
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Table 2. Societal resilience indicators according to their resilience dimension and supporting capacity, and scores 
from the pilot self-evaluation exercise. CC = Coping capacity, AC = Adaptive capacity, TC = Transformative capacity. 


Resilience Self-eval. 
dimension Indicator CC AC TC Reference score 
Physical 1.1 Preparedness x Pursiainen et al. (2016) N/A 
capital 1.2 Prevention x Pursiainen et al. (2016) N/A 
1.3 Warning x Pursiainen et al. (2016) N/A 
1.4 Response Pursiainen et al. (2016) N/A 
1.5 Risk assessment Pursiainen et al. (2016) N/A 
1.6 Recovery Pursiainen et al. (2016) N/A 
1.7 Learning Pursiainen et al. (2016) N/A 
Social 2.1 Social welfare and family support x Sharma & Srivastava (2016) 4 
capital 2.2 Isolation/decline in place attachment xX Born (2014) 3 
2.3 Trust between citizens X Mayunga (2007) 4 
2.4 Attitudes towards sharing of resources x Coles & Buckle (2004) 4 
2.5 Perception of risk Cabinet Office (2011) 2 
2.6 Participation in community x Burns et al. (2004) 3 
organisations/projects 
2.7 Minority groups in decision Magis (2010) 4 
making structures 
2.8 Social cohesion Poortinga (2012) 4 
2.9 Shared community values Flora & Flora (2004) 3 
2.10 Perceptions of inclusiveness in Ahmed et al. (2004) 4 
decision making 
2.11 Sense of identity in community Paton & Johnston (2006) 4 
2.12 Engaging the public by using Pursiainen et al. (2016) 4 
social technologies 
2.13 Information to public about their x H6ppner et al. (2012) 4 
responsibilities in case of emergency/ 
disaster 
2.14 Exposure to media Serafinelli et al. (2017) 5 
2.15 Interoperable communication IMPROVER D2.2 5 
among stakeholders 
2.16 Information to public about H6ppner et al. (2012) 4 
hazards and risks 
2.17 Partnership between agencies, Evart & McLean (2017) 3 
community groups and private enterprises 
2.18 Social network/good social Paton & Johnston (2006) 3 
infrastructure 
2.19 Knowledge sharing by different ISO/DIS 22316:2017 5 
stakeholder groups 
Human 3.1 Diversity in resources and skills Magis (2010) 4 
capital 3.2 Health inequality x Chandra et al. (2011) 5 
3.3 Immunization coverage x WHO/EHA (1998) 5 
3.4 Water quality X Wilson (2012) 5 
3.6 Exercises and drills for disaster response Collis et al. (2004) 3 
3.7 Attitudes towards change Wilson (2012) 3 
3.8 Attitudes towards value of education Cutter et al. (2010) 4 
3.9 School completion UNDP (2014) 5 
3.10 Adoption of new technologies Magis (2010) 5 
3.11 Free education UNDP (2014) 5 
3.12 Experiences of disasters/emergencies Wilson (2012) 3 
3.13 Knowledge about formal Kuhlicke et al. (2011) 2 
institutions, laws, legal frameworks 
and actors involved 
(Continued) 
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Table 2. (Continued). 
Resilience Self-eval. 
dimension Indicator CC AC TC Reference score 
Natural 4.1 Existence of green spaces x FEMA (2014) 5 
capital 4.2 Communal resource management x Wilson (2012) 4 
structures 
4.3 Carbon footprint x Wilson (2012) 2 
Economic 5.1 Economic inequality x Magis (2010) 3 
capital 5.2 Economic resources available X Paton & Johnston (2006) 5 
5.3 Community support for maintenance x Magis (2010) 5 
of services 
5.4 Population covered by hazard insurance x Tierney (2007) 5 
Insti- 6.1 Disasters/emergency response plans x Chandra et al. (2011) 3 
Tutional 6.2 Zoning ordinances for high hazard areas x Frankenberger et al. (2013) 5 
Capital 6.3 Land use and growth management plans x Frankenberger et al. (2013) 5 
6.4 Hazard mitigation and vulnerability x Frankenberger et al. (2013) 5 


assessments 

6.5 Disaster recovery plans 

6.6 Crowdsourcing platforms as decision 
support 

6.7 Usage of ICT for public awareness 
of disasters 

6.8 Public involvement in decision 
making and planning 

6.9 Satisfaction with emergency managers 

6.10 Trust in local government 
and authorities 

6.11 Satisfaction with local government 

6.12 Dialogue-oriented communication 
with the public 

6.13 Emergency management procedures 

6.14 Early warning and contingency 
planning 

6.15 Building standards, codes and 
enforcement 

6.16 Transparency and accountability 

6.17 Regulatory mechanisms for use of 
pasture, water, agricultural lands and 
forest resources 


x Frankenberger et al. (2013) 4 


x Serafinelli et al. (2017) 2 

x Serafinelli et al. (2017) 2 

x Priest et al. (2016) 4 

x Frankenberger et al. (2013) 5 

x Paton & Johnston (2006) 3 
x Frankenberger et al. (2013) 4 

x Kuhlicke et al. (2011) 4 

x Frankenberger et al. (2013) 4 
x Frankenberger et al. (2013) 4 
x Frankenberger et al. (2013) 5 

x Frankenberger et al. (2013) 5 

x Frankenberger et al. (2013) 5 


in administrative decision-making systems or in civil 
society”) or the scope of the question (e.g. “should 
we refer to the South Harbour and industrial duty, 
or should we refer to the community at large”). This 
theme appeared 18 times in the comments. 

Another theme was that several indicator 
statements are actually comprised of multiple 
statements, showing up 11 times. This comment 
illustrates well the participants’ views: “The state- 
ment contains two different elements/issues that 
can have different answers.” There were also a few 
(five) comments with regards to certain indicators 
overlapping with one another. One such comment, 
“there is an overlap here with several of the other 
indicators,” demonstrates well this theme. 

Of special importance, though only men- 
tioned once, was a comment related to the entire 


methodology instead of a single indicator or indi- 
cator statement. The participants asked, “How 
close should we attach the response to events in 
Oslo harbor?” This comment also shows a lack of 
clarity in how to use the method. 

While the participants of the focus group had 
helpful critiques of the methodology, they also 
stated that while they found the exercise challeng- 
ing, they also perceived it to be interesting and rel- 
evant to the assessment of societal resilience. 


4.3.2 Self-evaluation 

The preliminary scores from the pilot self-evalua- 
tion are shown in Table 2. Since there is no analysis 
made within IMPROVER project on the critical 
infrastructure in Sydhavna, there are no scores for 
the indicators under physical capital. 
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The focus group were also asked to provide evi- 
dence and discussion in regards to the self-evaluation 
score given for each indicator. However, evidence 
and discussion were provided for only 18 out of 
56 indicators. Evidence and discussion provided 
ranged from very relevant and specific, for example 
for indicator 2.12 (see Table 2), “Emergency services 
and the municipality use social media, and have a 
large number of followers (c.f. HarbourEx15)”, to 
relevant but lacking in empirical support, for exam- 
ple for indicator 2.3 Trust between citizens, “Based 
on the generally high level of trust in Norway, we 
answer that we agree,” to suppositions, as for exam- 
ple indicator 6.7 Usage of ICT for public awareness 
of disasters, “we don’t think so”. 


4.3.3 Presentation of societal resilience 

The societal resilience of Sydhavna and its 
surrounding neighbourhoods is preliminary 
illustrated by the radar chart in Figure 3. All 
dimensions except physical capital are represented 
by aggregated scores from the self-evaluation, and 
the physical capital is purely fictively scored by the 
authors of this paper. Note that the radar chart 
does not represent absolute measurements of the 


Physical capital 


Social capital Political capital 


Human capital Economic capital 


Natural capital 


Figure 3. Presentation of the result from the pilot self- 
evaluation of societal resilience using ISRA. 


societal resilience in Sydhavna, but is rather a 
way to show how the output from ISRA may be 
presented. 


4.3.4 Sensitivity analysis of indicator weights 

At this early development phase, the indicators 
were weighted equally due to lack of information 
about relative importance of the indicators and 
dimensions. A simple sensitivity analysis was per- 
formed to study if, and how, different weightings 
would affect the results. The sensitivity analysis 
was performed by assigning the Level 3 indica- 
tors a weight of 0.5 one at a time, and assigning 
the remaining two Level 3 indicators a weight of 
0.25. The resulting scores were compared to those 
obtained with equal weighting. The result from the 
sensitivity analysis is shown in Table 3. 


5 DISCUSSION 


5.1 The outcome of the pilot test 


The methodology was perceived as valuable in 
terms of creating an overview of aspects that can 
affect coping, adapting and transforming capacity. 
However, the self-evaluation scores given to the 
indicators need to be supported by evidence or ref- 
erences. In the pilot test, only 18 out of 56 indica- 
tors were supported with evidence. This means that 
the output of the analysis should not be consid- 
ered a realistic result in terms of maturity of soci- 
etal resilience. In general, the focus group scored 
themselves high on the indicators and the result- 
ing aggregated scores for each resilience dimension 
was around 4 on a scale from 1-5. This could mean 
that the studied area is mature regarding societal 
resilience, but since the evidence provided in the 
analysis were weak, strong conclusions should not 
be drawn from these results. 


5.2 Limitations of the study 


At this stage, the indicator hierarchy that ISRA is 
constructed of does not consider interdependencies 


Table 3. Level 2 scores aggregated with equal weights, and w = 0.5 for each Level 3 indicator (capacity) by letting the 
rest of the weights remain equal (w = 0.25). 

Equal Coping Adaptive Transform. 

weights w=0.5 w=0.5 w=0.5 
Physical capital 3.500 3.313 3:625 3.500 
Social capital 3.802 3.768 3.744 3.893 
Human capital 3.961 4.054 3.908 3.921 
Natural capital 3.667 3.250 4.000 3.750 
Economic capital 4.333 4.500 4.500 4.000 
Institutional capital 3.944 3.875 4.125 3.833 
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and correlations between different resilience dimen- 
sions. However, as the socio-ecological system is a 
very complex system, there is a need to consider 
how the different parts interact and affect each 
other. Moreover, there is no consideration of con- 
flicting goals among the indicators or if there are 
synergies that could result in emerging properties. 
These limitations need to be explored, and the 
effects need to be evaluated, to further improve 
ISRA and produce valid results from the analysis. 


5.3 Going forward: Ways to improve the method 


The sensitivity analysis showed that weights can 
affect the end result, although neither a statistical 
sensitivity analysis was made, nor an analysis of 
weighting on each indicator. To further develop 
ISRA there is a need to investigate if and how to 
assign weights to the indicators. 

One limitation of the indicator hierarchy that 
ISRA is constructed of is that it does not consider 
interdependencies and correlations between different 
resilience dimensions. As the socio-ecological system 
is a very complex system, there is a need to consider 
how the different parts interact and affect each other. 
This would increase the validity of the analysis. 

The thematic analysis of the participants’ com- 
ments identified three main areas for improving 
ISRA indicators. These are the need to clarify not 
only the definitions of words but also the scope 
and actors associated with the statements, ensure 
that each statement is only presenting the respond- 
ent with one question, and that the statements do 
not overlap with other indicator statements. As 
such, for future iterations of the methodology, we 
propose to add in a definitions section that clearly 
defines commonly used terms, as well as define cer- 
tain terms that are used only once within the State- 
ment provided. When operationalising the indicator 
into a statement, care needs to be taken to ensure 
that each indicator statement is indeed composed of 
only one question, thus a review of the statements 
to include only the most applicable question to the 
indicator will be done. Lastly, overlapping indica- 
tors will be evaluated to see if the overlap is neces- 
sary. If so, the statements will need to be rephrased 
in order to put the accent on the differences between 
the indicators and ensure they do not overlap. The 
focus group also demonstrated the importance of 
including ISRA in the IS-REF framework, and bet- 
ter explaining how these two work together. Indeed, 
the comment relating to the scope of the methodol- 
ogy demonstrates that the fact that the context was 
meant to be identified before ISRA was used was 
unclear to the participants. 

Furthermore, the level of detail required for the 
evidence and discussion part of the methodology 
should be clearly defined, and more emphasis should 


be put on the importance of providing evidence 
to support ones self-evaluation. In this first pilot 
study, the scoring was, in general, not underpinned 
by strong evidence, and thus the results in terms of 
societal resilience maturity are not reliable. Indeed, 
the evidence allows future users of the evaluation to 
understand why the indicators were scored as they 
were, and also provides good input for the resilience 
treatment part of IS-REF. 

While improvements need to be made, the over- 
all feeling shared by the participants was that ISRA 
provides a novel approach on disaster risk manage- 
ment and may be useful for evaluating societal resil- 
ience, thus demonstrating the success of the method. 


6 CONCLUSIONS 


This paper has presented the Improver Societal 
Resilience Analysis (ISRA) and demonstrated the 
methodology. The methodology is being developed 
to enrich societal resilience with the knowledge of 
critical infrastructure resilience on a regional or 
national level. To ensure operability of the method- 
ology, a focus group consisting of potential future 
users were involved from the start of the develop- 
ment process. Their feedback provides a basis for 
further development of the methodology, including 
increasing clarity, removing overlap and multiple 
statements, better explaining how ISRA fits into IS- 
REF, and putting more emphasis on the collection 
of evidence. These improvements will ensure that 
ISRA meets the needs of the intended user and is 
providing a useful process as well as valid results. 
The ISRA methodology draws on existing indi- 
cators and frameworks for societal resilience anal- 
ysis; however it introduces the results of critical 
infrastructure resilience evaluations to the societal 
resilience analysis as an overall high level indicator 
of the physical capital of a community. In combina- 
tion with the IMPROVER IS-REF, as proposed in 
the introduction to this paper, ISRA is a tool which 
could be used to enrich societal risk assessments by 
providing an overview of the capacity of a society 
to cope with, adapt to and transform in the after- 
math of a disaster or emergency. This information, 
while not necessarily adding value for a risk assess- 
ment addressing the frequency of an incident and 
the immediate consequences of an incident would 
add significant value and help to evaluate risks by 
providing an overview of the medium and long term 
ability of a community to cope with the incident. 
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ABSTRACT: In the field of Critical Infrastructures (CI), both policy and research focus has shifted 
from protection to resilience. The IMPROVER project has developed a CI resilience management frame- 
work (ICI-REF), applicable to all types of CI and resilience domains (technological, organisational and 
societal) allowing operators to understand and improve their resilience. IMPROVER has also developed 
methodologies to be used within the framework, accompanied with resilience indicators for operators to 
assess their technological and organisational resilience. The framework allows CI operators to incorporate 
resilience management as part of their risk management processes. The ICI-REF, the resilience analysis 
methodologies and indicators have been optimised, applied and demonstrated in a pilot implementation, 
focusing on the potable water supply in Barreiro, Portugal. Conclusions from the operators so far are that 
the indicators, well-defined and unambiguously described, are crucial for monitoring resilience activities, 


to ensure objective, consistent, repeatable and representative results from the assessed processes. 


1 INTRODUCTION 


Increasing Critical Infrastructure (CI) resilience is 
one is one of the main objectives for the European 
strategy towards a more secure Europe (COM, 
2010). Through the Program for Critical Infrastruc- 
ture Protection (EPCIP), issues and approaches to 
focus on are defined, where measures to facilitate 
implementation of resilience concepts to CI are 
identified (SWD, 2013). The concept of resilience 
has evolved from ecological resilience, via psychol- 
ogy, engineering to the disaster risk reduction field. 
There is thus a range of definitions of the concept 
of resilience and for this context we use that of 
UNISDR “The ability of a system, community or 
society exposed to hazards to resist, absorb, accom- 
modate to and recover from the effects of a hazard 
in a timely and efficient manner, including through 
the preservation and restoration of its essential 


basic structures and functions” (UNISDR, 2009). 
In EU, CI is defined as: “an asset, system or part 
thereof located in Member States which is essential 
for maintenance of vital societal functions, health, 
safety, security, economic or social well-being of 
people, and the disruption or destruction of which 
would have a significant impact in a Member State 
as a result of the failure to maintain those functions” 
(Council, 2008). 

An overall goal of the EU-funded Horizon 
2020 project IMPROVER is to improve Euro- 
pean CI resilience to crisis and disasters, through 
the implementation of technological, organisa- 
tional and societal resilience concepts. To that 
end, the IMPROVER Critical Infrastructure 
REsilience Framework (ICI-REF) was developed. 
The framework is supported by resilience analy- 
sis methodologies and indicators, also developed 
in IMPROVER. It is inspired by existing stand- 
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ards and frameworks e.g. ISO 31000, ISO 22301, 
ISO 22316, Org. Resilience HealthCheck (Austr. 
Government, 2017), Benchmark Resilience Tool 
(Resilient Organisations, 2014) and Resilience 
Measurement Index (Petit et al, 2013). 

To ensure that the developed ICI-REF frame- 
work, with supporting methodologies and indi- 
cators, is fit-for-purpose, it is optimised in pilot 
implementations, by application to relevant sce- 
narios in semi-real environments at several living 
labs. One pilot implementation has recently been 
conducted, focusing on potable water supply in 
Barreiro, Portugal. 

This paper describes structures and processes of 
the ICI-REF and resilience analysis methodologies, 
including preliminary results of the pilot imple- 
mentation at the Barreiro living lab, Portugal. 


2 IMPROVER CRITICAL 
INFRASTRUCTURE RESILIENCE 
FRAMEWORK (ICI-REF) 


2.1. The ICI-REF structure and process 


ICI-REF is a general and well-defined framework 
for managing the technological, organisational 
and societal resilience of CI (Lange et al., 2017a; 
2017b). It includes the flexibility to account for the 
unique features of the various types of CI, giving 
CI operators an understanding of, and a capabil- 
ity to improve, their resilience. The framework 
extends standard risk procedures (ISO 31000) and 
considers resilience assessment as complementary 
to risk assessment. The framework is constructed 
such that it is easily incorporated within exist- 
ing risk management processes by CI operators. 
Initial feedback by CI operators (Theocharidou 
et al., 2016) indicated this approach as the most 
feasible one, as it can improve their current prac- 
tices and allow for risk and resilience management 
decisions to be taken based on the results of both 
assessments. ICI-REF allows operators to perform 
self-assessment or focused analysis of technologi- 
cal/organisational aspects in order to either moni- 
tor resilience over time, or compare to similar CI 
within the same sector. The ICI-REF structure is 
depicted in Fig. 1. 

The ICI-REF process starts with establishing 
the context, implying the gathering of informa- 
tion, defining the resilience domain(s), etc. The 
defined context, risk identification and risk anal- 
ysis are then fed into, and complemented by, the 
resilience assessment process. Resilience assess- 
ment comprises of resilience analysis and evalua- 
tion against pre-defined criteria. Three different 
resilience analysis methodologies have been devel- 
oped (described in 2.3). The results from risk and 
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Figure 1. Structure of the IMPROVER Critical Infra- 
structure REsilience Framework (ICI-REF). 


resilience assessments constitute the basis for 
designing treatment plans, describing how to both 
mitigate risk and improve resilience. This parallel 
process allows decision makers to select risk and 
resilience measures in a cost-effective way, especially 
when a measure can be implemented to address 
both risk and resilience objectives. Throughout the 
ICI-REF process, Monitoring and review as well as 
Communication and consultation are continuous 
background processes (see Lange et al., 2017). 

CI resilience analysis can be performed by 
implementing existing methodologies, or by meth- 
odologies developed in IMPROVER. This paper 
focuses on technological and organisational resil- 
ience analysis, for which resilience assessments 
are performed at the CI level. Assessment and 
management of societal resilience shall instead be 
conducted on regional or national levels, using CI 
resilience assessments as input. A modified version 
of ICI-REF is developed for this purpose (Rosen- 
qvist et al., 2018). 


2.2 Resilience indicators 


In the context of IMPROVER, the term “resil- 
ience indicators” is related to variables that can be 
used, either alone or in combination, as a represen- 
tation of resilience. Qualitative, semi-quantitative 
or quantitative indicators are analysed and, when 
sufficient, aggregated to a measure of resilience. 

The resilience indicators should be clearly 
defined, in order to ensure objectivity and a proper 
balance between generality and specificity. To 
monitor resilience over time or comparing to simi- 
lar CI, the indicators must also provide reproduci- 
bility and repeatability. Measurement scales for the 
indicators and their possible weight factors should 
ideally be benchmarked at a sectoral level. 

Based on literature and defined requirements 
from CI operators associated with IMPROVER, 
the resilience indicators to be included in the resil- 
ience analysis step of ICI-REF are developed and 
optimised. They relate to the various resilience 
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analysis methodologies used for different resilience 
domains. 


2.3 Resilience assessment 


As a first step, the CI operator may want to con- 
duct an initial self-assessment to indicate strengths 
and weaknesses in its resilience; i.e. in which areas or 
domains a more in-depth assessment is required. For 
this purpose, the operator may find a resilience anal- 
ysis methodology with high flexibility useful, such as 
the Critical Infrastructure Resilience Index (CIRI) 
developed in IMPROVER (Pursiainen et al., 2017). 

This process may be sufficient, but if required, 
operators can perform re-assessment by using 
analysis methodologies which goes more into 
details. For this purpose, two different method- 
ologies have been developed: the IMPROVER 
Technological Resilience Analysis (ITRA) and 
IMPROVER Organisational Resilience Analysis 
(IORA) for analysing technological or organisa- 
tional resilience, respectively (Bram et al., 2017; 
Mindykowski et al., 2016). CIRI, ITRA and IORA 
methodologies are briefly described below. 

2.3.1 Critical Infrastructure Resilience Index 
(CIRI) 

Critical Infrastructure Resilience Index (CIRI) is a 
holistic and easy-to-use self-assessment methodol- 
ogy. It is applicable to all types of infrastructures, 
and built on a four level hierarchy of indicators, 
focusing mainly on the technological and organi- 
zational domain. The backbone for CIRI is the 
crisis management cycle (OECD, 2011; Pur- 
siainen, 2017). The different phases in the cycle 
corresponds to the seven Level 1 indicators: Risk 
assessment, Prevention, Preparedness, Warning, 
Response, Recovery, and Learning, Fig. 2. Under 
each Level 1 indicator there is a subset of given 
generic indicators (Level 2). 

Further, for each Level 2 indicator there is a 
new subset of mainly given, measurable indicators. 
However, as sectors use different metrics and meas- 
ures (quantitative/qualitative) the exact measure- 
ment depends on the sector, referred to as Level 4 
indicators, the bottom of the hierarchy. 

For a common viewpoint, Level 4 indicators are 
transformed to qualitative maturity scale, scaling, 
from 0 to 5. At Level 3 and 4, the operator has 
the possibility to assign weight to the indicators 
according to their importance. After assessing the 
Level 4 indicators, results are aggregated up the 
hierarchy, and each Level 1-3 indicator get a score 
from 0 to 5. The result is presented in a radar chart 
with all the seven Level 1 indicators. 

In addition, to present a more detailed analysis, 
it is possible to construct charts for all Level 1 over 
their respective Level 2 indicators, see Fig. 3. 
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Figure 2. The hierarchical structure of Critical Infra- 
structure Resilience Index (CIRI). The Level | indicators, 
representing different phases in the risk management 
cycle, are here denoted (A) Risk assessment, (B) Pre- 
vention, (C) Preparedness, (D) Warning, (E) Response, 
(F) Recovery and (G) Learning. 
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Figure 3. Analysis results for most Level 1 indicators, 


and Level 2 indicators under Prevention, Preparedness, 
and Response. 


It should be noted here that this is a self-assess- 
ment methodology and thus not fully objective. 
However, the result is indicative of the CI’s resil- 
ience level and highlights strengths and weaknesses 
of the infrastructure, both from the technological 
and organisational perspective. It can be used as 
the basis for further detailed analysis, using meth- 
odologies like IORA and ITRA. 


2.3.2 IMPROVER Technological Resilience 
Analysis (ITRA) 

Technological resilience is often visualised using 

the performance loss and recovery function or the 

area between the function and an uninterrupted 

capacity/performance. From the risk identification, 
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Fig. 1, a prioritised list of possible hazards which 
could impact the CI is used as input to the techno- 
logical resilience analysis, which aims at quantifying 
the performance loss and recovery of the CI service. 
Thus, technological resilience is conditional on the 
occurrence of a specific hazard, following the pro- 
cedure of the risk management of ISO 31000. 

Estimating the functionality needs therefore suit- 
able intensity measure of the hazard to which the 
vulnerability of the system’s subparts can be evalu- 
ated through their fragility. Combining this infor- 
mation gives a measure of the damage to the system 
which should be transformed into one or several 
performance measures in order to focus on the core 
aspect of resilience: functionality of the system. 

Once the performance measures loss and recov- 
ery functions are estimated they should be evalu- 
ated against other CI, historical performance or 
the needs and expectations of the infrastructure’s 
end-users. It is therefore of vital importance to 
choose the performance criteria keeping in mind 
that: (i) they should be possible to translate from 
estimated damages, with sufficient accuracy and 
(ii) they should be constructed such that they can 
be compared to other CIs, historical performance 
or (preferably) the needs and tolerances from the 
end-user (Petersen, 2018). 


2.3.3 IMPROVER Organisational Resilience 
Analysis (TORA) 

IORA follows a similar structure to other organi- 
sational analysis methods. The purpose of the 
analysis is promoting resilient performance. Sub- 
sequent levels are functions, forms and processes 
which contribute to this purpose. The functions 
required to achieve this are: design of tasks and 
roles; design of the framework and its content, 
goals, rules, processes and procedures; strengthen- 
ing collaboration; learning and redesign; underly- 
ing values and interpretations, Fig. 4. 


PURPOSE 


Organisational resilience analysis process 
requires collection and processing of information 
about how the organisation’s processes contrib- 
ute to this. For the Barreiro implementation this 
is done via in-depth interviews based on a narra- 
tive of a historical event (saline intrusion in a fresh 
water well). Functions, forms and processes during 
this event form the basis for the analysis and the 
subsequent evaluation. 


3 PILOT IMPLEMENTATION ON 
POTABLE WATER SUPPLY NETWORK 


3.1 


The object to be tested in the pilot implementa- 
tion, comprises of the ICI-REF, its supporting 
methodologies for resilience analysis (CIRI, IORA 
and ITRA) and the developed resilience indicators. 
The test object will be denoted as ICI-REF in the 
remainder of the document for simplicity. 


Test object 


3.2 Living lab: The potable water supply system 
of Barreiro 


Barreiro’s municipality, with an area of 36.41 km’, 
has, according to the Census 2011, a population 
of 78,764 people. It has 17 km river front to Tagus 
and Coina rivers and an important road-rail-river 
terminal. It is located about 40 km from Lisbon to 
which it is linked by two bridges, and about 35 km 
from Setubal, the district capital. Barreiro’s pota- 
ble water supply system consists of 11 licensed 
ground-water intakes from a semi-confined aquifer, 
7 reservoirs for treated water storage with the total 
capacity of 12.750 m°, 7 treatment installations, for 
disinfection with the addition of sodium hypochlo- 
rite, 3 pumping stations, 5 blowers, 16.1 km of main 
ducts, and 308 km of meshed distribution pipes. 


Promoting resilient performance 
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ogy (IORA). 


Indicators on different abstraction levels in the IMPROVER Organisational Resilience Analysis methodol- 
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The municipality has a remote management sys- 
tem that allows real time monitoring of pressure and 
flows in the water supply (and waste water) systems. 
The pilot implementation focuses on three pressure 
zones in the north, which combined account for 
60% of the total water supply in the municipality. 

Fig. 5 shows the area subject to the assessment. 


3.3 Systematic approach for testing and 
evaluating the performance of ICI-REF 


To make the pilot implementation robust, a trian- 
gular approach was used for testing and evaluating 
the performance of ICI-REF. Triangulation is the 
combination of two or more data sources, inves- 
tigators, methodologic approaches, theoretical 
perspectives or analytical methods within the same 
study (Denzin, 1970; Kimchi et al., 1991). Using 
multiple methods decreases the “deficiencies and 
biases that stem from any single method” (Mitch- 
ell, 1986) creating “the potential for counterbal- 
ancing flaws or the weaknesses of one method with 
the strengths of another.” Therefore; focus group, 
documentation, field studies and surveys were used 
to collect data for the critical evaluation of the per- 
formance of ICI-REF. The IMPROVER project 
embraces all these approaches in several steps and 
iterations for optimising ICI-REF. 


3.3.1 Collection of data 

A focus group, consisting of representatives from 
the operator at the Barreiro living lab, was selected 
based on their insight into current processes and 
methodologies for risk assessment at the Barreiro 
living lab. There has been close cooperation 
between the focus group and the project team 


Legend 


Area O wastewater Pant 
A Train Station 


Figure 5. The northern part of the Barreiro municipal- 
ity subject to the pilot implementation. 


throughout the project via continuous commu- 
nication, and workshops. These were invaluable 
in addressing strengths and weaknesses of ICI- 
REF before the final pilot implementation. The 
focus group, as a qualitative, exploratory research 
method, has aided the understanding about not 
only the operators’ opinions, but also how and why 
they think the way they do. 

Field studies were performed for testing the 
application of ICI-REF in a semi-real environ- 
ment. Field studies require detailed observation 
and evaluation, allowing conclusion of understand- 
ing and comparisons of the information generated 
from each site (Burgess, 1984; Denzin & Lincoln, 
2011; Rossman & Rallis, 2011). An advantage of 
field studies is that they give better external valid- 
ity than in laboratory experiments because a field 
experiment takes place in typically occurring social 
settings. 

The field study relied on application of ICI- 
REF to a relevant hazard scenario. A scenario 
with high disaster risk was prioritised by structured 
expert judgement elicitation by the stakeholders. 
Fig. 6 shows a hazard map for the Barreiro living lab. 

The hazard chosen to assess the resilience of the 
water supply system was an earthquake with liq- 
uefaction, which is considered the highest disaster 
risk for the water network combining consequence 
and probability. The assets susceptible to the haz- 
ard are: 


— The reservoir, pipe system, pumps and the criti- 
cal users being the hospital and the health centre. 

— All technical equipment used to repair and to 
distribute redundant functionality. 


1.0 


Seismic ground shaking and tsunami 


0.8 


Seismic ground shaking and liquefaction 


0.6 


Heatwave leading to water shortage 


Wildfire 
e Storm Surge 
e 


0.4 


a temperatiires and ice 


Rank Score (Natural Disaster) 


0.2 


0.0 


T T T T 
0.0 0.2 0.4 0.6 0.8 1.0 
Rank Score (Likelihood of Occurrence) 


Figure 6. Plot of rank scores of six natural hazard sce- 
narios based on their likelihood to cause disaster and 
to occur at Barreiro’s water network in the next 5 years. 
(Pursiainen et al., 2015). 
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All staff and the entire organisation and the 
processes used in the preparatory, functional and 
administrative work. 

Documentation was collected in order to ana- 
lyse vital data from the CI. For example, the safety 
plans, and organisation chambers. These docu- 
ments were used to assess the as-is situation of the 
CI. Typically, the analysis aims at visualising the 
current state process to clarify how the CI process 
works today, and what can be done to improve the 
current situation. 

Different forms of surveys, aimed at the opera- 
tor, project team members and other stakeholders 
were used in advance, during and after the pilot 
implementation. By using surveys, a broad range 
of data has been collected, e.g. tolerance levels; 
attitudes; opinions; beliefs; values; behaviour and 
factual. The surveys were used as basis both for 
defining performance criteria for resilience assess- 
ment and for the critical evaluation of the perform- 
ance of ICI-REF. 


3.3.2 Critical evaluation 

Eighteen success factors were developed for the 
critical evaluation of the performance of ICI-REF. 
These ensure that ICI-REF meets stakeholders and 
end-users needs and are designed based on continu- 
ous input from the living labs during the project. 
The design science research methodology (Hevner 
et al., 2004) is used for the critical evaluation proc- 
ess in which the success factors are evaluated based 
on demonstration results and applications of 
ICI-REF. 

The defined success factors of the project are 
primarily designed for critical evaluation of the 
overall ICI-REF framework, but they also implic- 
itly set requirements to the relevance and quality of 
the tested analysis methodologies with indicators. 
Examples of success factors related to indicators 
are shown in Table 1. 


3.4 Results from initial demonstrations 


CIRI, IORA and ITRA were all applied in the 
initial demonstrations. Evaluation from ITRA 
showed that the system most probably will meet the 
expectations of end-users for reasonable scenarios 
of damage. Also, despite being highly dependent 
on key personnel resources the flat organisation 
helps in fast recovery in times of crises, as shown in 
IORA evaluation. 

A set of resilience indicators was tested within 
the CIRI methodology and assessed, using a soft- 
ware tool developed in IMPROVER (accessed at: 
http:/Amprover-inov.herokuapp.com/). The indica- 
tors were discussed and evaluated by the operator 
according to the indicators’ relevance and compre- 
hensibility as means of assessing their resilience. 


Table 1. Examples of success factors for critical evalua- 
tion of the relevance and quality of resilience indicators. 


Success factor Defined by 


The framework shall be 
applicable to all types 
of CI 


The balance and definitions of 
indicators 


Clearly described and 
categorised indicators 

Guidance on how framework 
indicators can be 
interpreted in relation to 
resilient performance 

Resilience indicator follow-up 
should promote a shared 
view within the organisation 
on real work challenges. 


The framework shall be 
easy to use 


The framework shall 
provide effective and 
coherent crisis and 
disaster resilience 
management 

The framework is 
arranged for being 
revised continuously 


Existence of a system for 
recurring analysis, criticism 
and revision of the indicator 
framework and 
implementation 


Table 2. Scale for analysing perception of indicators by 
the Barreiro operator. 


Rating Definition 

A The indicator was perceived, and there is 
evidence of the indicator 

B The indicator was perceived, but there is no 
evidence of the indicator 

C The indicator was not perceived 

D Not applicable 


The operator was asked to assign resilience 
measurement scores to the indicators, and to rate 
them on the perception scale, according to how well 
they were understood by the operator. The scale 
for the perception ratings is presented in Table 2. 

The structure and processes of resilience analy- 
sis methodologies proved functional in the demon- 
stration. Based on the feedback from the operator, 
only minor modifications of the methodologies 
were required to optimise the relevance of the anal- 
ysis results towards the main pilot implementation. 
The operator expressed the need for user-friendly, 
clear and not too complex assessments. They fur- 
ther concluded that the structure is not the main 
point of interest to the living lab, but the func- 
tionality of the assessment process, and the ques- 
tions and goals related to the indicators. An issue 
pointed out by the living lab is which resources are 
required to perform the assessment; i.e. whether 
they need to employ external resources or can train 
internal resources. 
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The operator emphasised the crucial role of 
resilience indicators in the monitoring of resilience 
activities. However, to ensure objective, consistent, 
repeatable and representative results, the indicators 
and their designed questions must be defined using 
unambiguous terms. As long as the indicators are 
well described and leave little room for subjectiv- 
ity, the high number of indicators is not a problem. 
The need for guidelines was also expressed. 

Challenges related to the definitions of meas- 
urement scales and assignments of weights for 
qualitative or semi-quantitative indicators were 
pinpointed. E.g. the measurement scale used to 
assess the indicators should be well-defined since 
it is mandatory to understand the differences 
between the different measurement scales to per- 
form benchmarking. It was also discussed how 
flexible the indicator structure should be; e.g. if 
CI operators shall be allowed to define their own 
scales and weights, and how this will affect the 
assessments and limit their relevance. 

The perception of the operator that some of 
the indicators were too vague, needed to be better 
explained and that some were difficult to point out 
evidence for, led to adjustments and development 
of the overall set of indicators and how they are 
presented. 

To address the need for proper descriptions and 
definitions of sector-specific indicators, “indicator 
cards” were developed for the complete developed 
set of technological and organisational resilience 
indicators at the lower CIRI level. Each individ- 
ual resilience indicator card provides a detailed 
description of the sector-specific indicator subject 
to assessment as exemplified in Fig. 7. 

The indicator cards consist of the following 
information: 


— The assessed indicator and its parent indicators 
are listed. 

— Detailed information about the context is given. 
The resilience domain (technological or organi- 
sational), hazard types (natural, non-malicious 
man-made, malicious man-made and multi- 
hazards) and situational factors (e.g. temporal, 
geographical or conceptual considerations for 
taking such an indicator into account) are indi- 
cated. Finally, the applicable sector (in this case 
potable water supply) is pointed out and if the 
indicator is generic or scenario specific. 


A description of the indicator and guidance for 
assessing the maturity level is provided through a 
rationale of why this indicatoris justified. Moreover, 
a question is provided, which can be asked to the 
operator for measuring the indicators in a clear 
and explicit manner with the 6 different maturity 
levels described (scale 0-5) and a reference for 
describing the indicator. 


Resilience indicator card 


Interoperable information and 
communication technology 


Availability (SIRESP 


General 


ater Suppl 


SIRESP is the emergency communica- 

tion system used jn Barreiro, and it is 

crucial for effective coordination and 

exchange of information during emer- 
periods. 


What is the availability level of the 
SIRESP system? 


ag | 


Figure 7. Indicator card for a sector-specific indicator 
at a lower CIRI level (here level 4), showing its parent 
indicators, context, description and measurement scale. 


Answer modalities 


3.5 Results from pilot implementation 


Based on the feedback from the initial demon- 
strations, ICI-REF was optimised, and the pilot 
implementation was conducted. 

The resilience assessments resulted in sugges- 
tions for resilience treatment. Raising public aware- 
ness as well as training were pinpointed from the 
three tested methodologies. Application of CIRI 
resulted in recommendations to prevent silos, 
i.e. to have quick and easy cooperation between 
management and people in the field and to have 
structures in place to ensure this. Application of 
IORA actually identified that such a characteris- 
tic existed in the Barreiro organisation however 
more as an unofficial way of working. This dem- 
onstrates the ability of the different methodologies 
to bring up different levels of details and different 
perspectives. 

The performance of ICI-REF in the pilot 
implementation is currently being critically evalu- 
ated with regards to the success factors. Generally, 
indicators were perceived by the operator as clearly 
described and easy to interpret, hence the adjust- 
ments made after the initial demonstrations had 
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improved ICI-REF significantly. When weight- 
ing the indicators, information from the operator 
could be valuable with regards to the importance 
of an indicator, but at the same time, it is impor- 
tant that the indicators are not biased towards the 
operator. The operator was of the opinion that 
ICI-REF can be valuable both as an internal audit 
tool and also in everyday work, and that it was 
useful in promoting reflection around resilience 
of the organisation and resilience treatment. The 
ability to compare results with other operators in 
the same sector outside of Portugal would also be 
useful for benchmarking purposes. 

In order to provide an overall resilience score, all 
relevant indicators must be assessed. Although the 
pilot implementation assessed only a sample set of 
all the defined indicators, the operator found the 
results valuable for prioritising future work and 
development within their organisation. 


4 DISCUSSION AND WAY FORWARD 


After the initial demonstration of ICI-REF at the 
Barreiro living lab, the operator was of the opinion 
that indicators are crucial for monitoring resilience 
activities. However, to ensure that the assessment 
results are objective, consistent, repeatable, and 
representative of the assessed processes, the indica- 
tors should be defined using unambiguous terms. It 
was strongly suggested that clear questions should 
be asked for the operator to better understand 
what the indicator is assessing. The main potential 
for improvements of ICI-REF therefore lies in the 
design of sector-specific resilience indicators. 

The indicators must not only be comprehensi- 
ble and clear, but also at the same time leave some 
room for site-specific information. The degree of 
indicator specificity has been discussed with several 
living labs through the project, and the need for a 
balance between generality and specificity has been 
emphasised. If an indicator is too general, this may 
reduce the ability to detect details or new areas of 
resilience improvements. On the other hand, infor- 
mation about the specific CI can also be lost if the 
indicator is too detailed, which can make a further 
comparison with similar CI less relevant. 

Regarding the measurement scales for the sec- 
tor-specific indicators, it is not only challenging to 
define the scales, but it may also be challenging to 
assign quantitative value to a qualitative indica- 
tor without introducing subjectivity. The operator 
should therefore provide evidence and comments 
to support their assigned values for each indicator. 

Despite the challenges in defining the indica- 
tor scales, weights and degree of specificity, the 
need for including sector-specific indicators are 
unquestionable. It should be described in terms 


of guidelines or references, at which level the indi- 
cators’ scales and weights should be defined to 
ensure legitimacy. Benchmarked indicators exist 
within certain CI sectors. Although it may not be 
a requirement for a CI to compare to similar CI’s, 
the living labs have expressed the wish to perform 
such a comparison at a regional level. However, for 
indicators that are not benchmarked, the compari- 
son between similar CI will not be applicable if the 
operators, themselves, define scales and weights. 

The indicator cards for the Barreiro living lab 
were successfully tested in the pilot implemen- 
tation, as the indicators were considered well 
described and easy to assess and respond to. Indi- 
cator cards are now being developed for applica- 
tion in the next pilot implementation at another of 
the living labs in IMPROVER;; the M1 highway in 
Budapest, Hungary. 


5 CONCLUSION 


The ICI-REF, technological and organisational 
resilience analysis methodologies and indicators 
have been applied and demonstrated in a pilot 
implementation, focusing on the potable water 
supply in Barreiro, Portugal. These have been 
developed with the aim to smoothly extend cur- 
rent risk management practices into a resilience 
management framework. A set of technologi- 
cal and organisational resilience indicators has 
been designed and described in “indicator cards”. 
Efforts are made to improve the clarity of defini- 
tions and descriptions of resilience indicators, 
since unambiguous description of indicators is 
crucial for monitoring resilience activities. Based 
on the feedback from the Barreiro living lab dur- 
ing the project, initial demonstrations and a pilot 
implementation, the ICI-REF and the developed 
resilience analysis methodologies proved func- 
tional based on preliminary results. 

They are now ready to be fine-tuned towards the 
next pilot implementation in IMPROVER. This 
focuses on the M1 highway in Budapest, Hungary, 
covers several scenarios and will be finalised in 
2018. Combining results from the two pilot imple- 
mentations allows evaluating the performance of 
the ICI-REF framework, methodologies and indi- 
cators to different CI sectors and contexts. Based 
on this, European guidelines for the resilience 
management to CI will be developed, addressed 
both to CI operators and policy makers. 
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ABSTRACT: No consensus currently exists on how to measure and evaluate Critical Infrastructure (CI) 
resilience. Attempting to use the public’s declared coping capacity as a target for CI resilience, this paper 
explores how to develop relevant resilience performance measurements that enable comparison to the toler- 
ance levels of the general public. To do so, one must first establish the normal performance of the system 
and the applicable performance measures. Then, a survey is used to convert public perception into these 
measures as to enable comparison with the technical resilience performance. The CI resilience will be pre- 
sented through a family of so-called resilience triangles which will illustrate the evolution of the perform- 
ance, before, during and after a crisis event. A case study of the Municipal Water Network of Barreiro, 
Portugal, is used. The overall performance is preferably described with the categories quality, quantity and 
delivery. In quantifying the performance the importance of what is being assessed, to what hazard and for 


which end-user became evident. 


1 INTRODUCTION 


Critical infrastructure (CI) resilience is often 
defined as the ability to maintain a minimum 
acceptable level of service and the ability to rap- 
idly restore full service in relation to a crisis event 
on the CI system. However, no consensus cur- 
rently exists on how to measure these elements. 
Furthermore, most existing methodologies do 
not take into account human factors arising from 
the society which the CI serves. Since the general 
public, end users of CI services, appear to have 
reasonable expectations of CI operators in cri- 
sis times (Petersen et al., 2017), we suggest using 
their declared coping capacity as a reference for 
CI resilience measurements. In order to do so, 
one must convert public perception into measur- 
able indicators, which can be comparable to the 
technical performance of the service, and examine 
how these indicators can be measured, forecasted 
or assessed over the course of an event. Thus, this 
paper explores how to identify relevant compa- 
rable performance measurements for the CI case 
of a water distribution network. The methodol- 
ogy will present the resilience measures through a 
family of so-called resilience triangles, which will 
illustrate the evolution of the performance, before, 


during and after a crisis event. A resilience triangle 
(Bruneau et al., 2003) is shown schematically in 
Figure 1. 

The performance, Q, quantifies the performance 
of the system. For many CI systems, it is normal 
that the system performance decays slowly over 
time as a result of aging, reflected in by the differ- 
ence between Q, and Q,. A sudden drop in the per- 
formance represents the effect of a sudden shock 
to the system. How deep the drop goes and the 
steepness of the drop depends on how well the sys- 


Performance, Q 


Figure 1. A schematically drawn “resilience triangle” 
illustrating the performance of a system over the course 
of a crisis event. 
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tem is able to absorb the initial shock (minimizing 
the impact) (Q, to Q,) and respond to it (Q, to Q,). 
The size of the triangle (marked in red) further 
depends onthesystem’scapacitytorecover(Q,toQ,). 
The total performance loss of the system over the 
duration of the initial incident and recovery is 
however, partially an indirect result of how well 
the system has earlier targeted the measures of risk 
assessment, prevention and preparedness. The gain 
in performance over time after the performance 
loss reflects how well the system recovers from the 
shock and eventually learns from the shock. The 
learning phase enables and could likely lead to 
increased system functionality compared to before 
the shock, due to the renewal in restoration of the 
system, but also from the learning experience itself. 

A case study of the Municipal Water Network of 
Barreiro, Portugal, is used to illustrate the develop- 
ment of the performance measuers needed to cre- 
ate the resilience triangles and is subject to a pilot 
implementation of a CI resilience management 
framework developed within the IMPROVER 
(Improved risk evaluation and implementation 
of resilience concepts to critical infrastructure) 
project, funded under the Horizon 2020 program. 
The implementation explicitly investigates pub- 
lic tolerance levels through a survey. The overall 
performance is well described and categorized in 
terms of quality (suitability for drinking and cook- 
ing), quantity (the amount of water accessible 
to the public) and delivery (the amount of water 
delivered to the tap) of water. In order to quan- 
tify these, the importance of understanding what is 
being assessed, to what hazard and for which end- 
user became evident. After describing the Munici- 
pal Water Network of Barreiro, Portugal the paper 
describes the development of performance meas- 
ures, followed by a first look at the preliminary 
results of the resilience assessment through the 
resilience triangles. 


2 BACKGROUND ON THE BARREIRO 
LIVING LAB CASE STUDY 


The city of Barreiro is part of the Lisbon met- 
ropolitan area, located on the south bank of the 
Tagus River estuary, about 40 km from the city 
of Lisbon. It has a population of almost 80 000 
people with an area of 36.41 km”. Ferries and two 
bridges connect Barreiro to Lisbon. 


2.1 The municipal water network 


The Barreiro Municipal Water Network delivers 
potable water to the municipality of Barreiro and 
serves all its inhabitants and industries. It has an 
annual water flow of 6,200,000 meters cubed. The 


drinking water comes exclusively from an under- 
ground aquifer (Barreiro Municipality, 2009). The 
water supply system in Barreiro is constructed 
by 11 licensed ground-water intakes from a semi- 
confined aquifer, 7 reservoirs for treated water 
storage, 3 pumping stations, 5 blowers, 16.1 km of 
main, 263 km of ordinary pipes and a consider- 
able amount of service connector pipes. The pipe 
system is made out of mostly fibre cement (FC) 
and some parts of PVC (polyvinyl chloride) and 
PE (Polyethylene). 

The Barreiro municipality is divided into 7 dif- 
ferent pressure zones. This study is limited to look 
at the resilience of 3 of these water supply zones 
that together serve about 60% of the population. 
Within the three pressure zones considered for this 
study there are three water storage units, includ- 
ing one high storage tank: Alto da Paiva high 
tank (SNZM Reservoir), and 2 semi-buried tanks: 
Alto da Paivo (SNZB1 Reservoir) and Sete Portais 
(SNZB2 Reservoir). The reservoirs have a reserve 
capacity of 12 750 m°, which supplies the popula- 
tion for about 24 hours based on volume. 

Several hazards may influence the water net- 
work in Barreiro including earthquakes (leading 
to sever ground shaking or liquefaction), droughts 
and heatwaves. A historical example can be found 
in 1969, when Barreiro’s water network endured 
moderate damage due to a 6.8 Magnitude earth- 
quake event which led to the unavailability of pota- 
ble water for 24 hours. More recently, in 2012, rice 
and cereal agriculture in Setúbal were affected by a 
water shortage (Ioannou et al., 2016). 


3 DEVELOPMENT OF THE 
METHODLOGY 


3.1 Interview with living lab 


The first step in developing a methodology to 
measure resilience was to interview the living lab. 
Semi-structured interviews were held with employ- 
ees from the Municipal Water Network in order to 
further the understanding of the system function 
as well as how they would act in the case of a crisis. 
Specifically, questions were posed in regards to the 
technical details of the system, recovery time esti- 
mates, and emergency/contingency plans. 


3.1.1 Results 

The Barreiro Municipal Water Network operators 
provided models of their different pressure zones 
where all the details of pipes, valves, reservoirs and 
tanks are stored. The model can solve the hydraulic 
flow through the whole system using the EPANET 
freeware (Rossman, 2000), provided by the United 
States Environmental Protection Agency. The 
operators of Barreiro especially pointed out two 
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assets that are crucial for the overall supply of 
water in the system: a supported, semi-buried 
reservoir and one critical pipe, the General Dis- 
tributor Conduit (DN350). The supported res- 
ervoir of Alto da Paiva supplies the zones, Zona 
Baixa 1 (SNZB1) and Zona Media (SNZM), while 
the General Distributor Conduit DN 350 mm in 
fiber cement supplies the Zona Baixa 2 (SNZB2). 
Figure 2 shows these assets including the water 
network in the study site. 

When discussing recovery times and actions, the 
operators informed us that if there is a water out- 
age longer than 24 hours, regardless of the cause, 
they intend to use two water tanks of 80 m? capac- 
ity provided by the Barreiro Firefighters to provide 
water to the public. However, the water will need 
to be boiled. They estimate that within 30 min- 
utes they would be able to make the tanks avail- 
able to the public if using their own water sources. 
However, if the crisis was larger and water would 
have to come from neighboring municipalities or 
elsewhere, it could take a few hours, depending on 
road accessibility. They also have the possibility 
to request assistance to the district civil protec- 
tion command and to be able to have a collabo- 
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Figure 2. Barreiro Study Site Pipe network and 
Reservoir. 


ration between the firefighting corporations of 
the district, a total of 25 teams with about 1500 
to 2000 m° (estimated values) of total water sup- 
ply capacity. However, these resources are pending 
other duties for the firefighting corporations. If an 
earthquake also affects the neighboring munici- 
palities, the assistance could come from either the 
north bank of the Tagus River, Lisbon, or even 
further south, using the same means. They are 
currently in the process of developing their Water 
Safety Plan that will include both emergency and 
contingency plans. 

Based on the hazard assessment and interest of 
the operators, a scenario of liquefaction following 
a severe earthquake is considered for this study. 


3.2 Performance measures 


In order to define the performance measures, the 
first step was to define the normal performance. 
For this case, normal performance was considered 
as the normal domestic water use, covering con- 
sumption (drinking and cooking), hygiene (both 
personal and domestic cleanliness) and amenity 
use (i.e. watering plants, washing bikes) as listed by 
WHO (Howard & Bartram, 2003). In Barreiro the 
average water consumption to cover this is about 
200 L/person/day. For survival and to avoid an out- 
break due to lack of sanitation, the lowest accept- 
able quantity of water for survival is approximately 
20 L/person/day (Cousins, 2013; Kameda, 2000; 
Mowll, 2012). In this study we have assumed that 
this is the minimum requirement for the first 3 days 
and that the requirement thereafter is increased 
to 50 litres per person per day, on account of the 
WHO recommendations. 

A quantitative assessment of the performance 
of the water system was in this study measured 
as the percentage of the population that receives 
the services with the same performance as of a 
“normal day” before the earthquake for each per- 
formance measure. A similar study on the Los 
Angeles Water Service Restoration Following the 
1994 Northridge Earthquake (Davis et al., 2012) 
defines five performance measures of the system: 
quality, quantity, delivery, firefighting and func- 
tionality. While the first three are measures directly 
impacting the public, functionality describes how 
the system performs its function in terms of effi- 
ciency, durability, sustainability and economics. 
This measure is not dealt with in this study. The 
availability of extinguishing water for firefight- 
ing is assumed covered by the abundant access to 
water around the municipality by the Targus River 
and is not covered here. Thus, the three perform- 
ance measures in this study are: 


1. Water delivery: Percentage of the population 
served by the pipe system through water on tap 
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(the water delivered may not meet the quality or 
quantity requirements). 

2. Water quality: Percentage of population that 
have access to water at drinkable standards (not 
needing to boil the water). 

3. Water quantity: Litres of water available per 
person per day. 


3.2.1 Estimating the performance measures 

First, an estimated time for service restoration is 
needed. To do so, the repair time was divided into 
two groups: emergency response time and recovery 
time. The emergency response time covers the period 
in which search and rescue is highly prioritized. This 
also includes work for road accessibility etc. Recov- 
ery time covers the period from when the repair 
starts until the entire network is repaired. These two 
times overlap and this is dealt with according to pri- 
oritizations inspired by the Oregon resilience plan 
(OSSPAC, 2013) and from literature on experiences 
and lessons learned from previous earthquakes 
(Bragado, 2016; EERI, 2007; Eidinger & Davis, 
2012; Pedroso et al., 2013; Mowll, 2012). Based on 
this, a simple spreadsheet model is built which can 
estimate the emergency response time. The recov- 
ery time is estimated from simulations of perturbed 
EPANET models, as well as historical data for repair 
times for different pipe materials. These combined 
methods allow us to measure the water delivery and 
water quantity. The water quantity is defined as the 
total volume of water delivered on tap or to com- 
munity service points for the public to carry home. 
This water is delivered either through tanks or, at 
later stages, also through pipes connected to these 
service points. At even later stages, the quantity also 
includes water being delivered to homes through the 
water distribution pipe system. Water delivery is cal- 
culated through the total amount of water that can 
be extracted from the network through nodes con- 
nected to households or other buildings as the sys- 
tem is repaired. Nodes that are somehow connected 
to a water source can often deliver the capacity prior 
to the incident. We assume that the “normal” deliv- 
ery at each node is proportional to the number of 
people served at this point. Thus the relative deliv- 
ery at all nodes to the delivery prior to incident is 
assumed equal to the fraction of people being served 
by the connected nodes. Based on the interview with 
the operators, the water quality indicator will need 
to remain at 0% of normal performance (i.e. water 
needs boiling before consumption but is suitable 
for washing) until the system has been repaired and 
thoroughly flushed. 


3.3. Public expectations 


Previous research on expectations/satisfaction of 
water service disruptions have shown that attitudes 


are not very strongly held on this subject matter 
(Vloerbergh et al., 2007). Most previous research 
does not deal with expectations/tolerances/ 
satisfaction during times of crisis, but instead with 
“normal times” or planned works (Speers et al., 
2002). As mentioned above, previous work within 
the IMPROVER project has found that the general 
public, end users of CI services, appear to have rea- 
sonable expectations of CI operators in crisis times 
(Petersen et al., 2017). As such, we suggest using 
their declared coping capacity as a target for CI 
resilience. In order to do so the public perception 
must be discovered in a way that is comparable to 
the technical performance measures of the service. 
To our knowledge, no studies comparing expecta- 
tions to performance measures currently exist. 

A common way to evaluate public expectations/ 
user satisfaction for water operators is using a 
“willingness to pay” model. However in a disaster 
situation, this is not an appropriate measurement 
method as people need to have access to water, 
being a basic human need, regardless of the cost. 
As such, we propose to use a questionnaire in order 
to determine the public’s coping capacity. Indeed, 
people themselves have been found to be good 
judges of their ability to deal with disturbance and 
change (Nguyen et al., 2013) and the idea of peo- 
ple being able to accurately judge themselves has 
been used in other domains, ranging from psychol- 
ogy to well-being and climate change adaptation 
(Jones & Tanner, 2015). Furthermore, previous 
research into water customer preferences has also 
used questionnaires to establish coping capacity 
in times of water shortage (Vloerbergh et al, 2007; 
Speers et al., 2002). 


3.3.1 Development of the survey 

The survey was developed with a view to real 
world performance capabilities of operators that 
were established in the interview. The performance 
measures were not asked about directly, mean- 
ing we did not use the terms delivery, quality or 
quantity. Instead, situational, laymen’s terms were 
used to increase understanding of the survey (for 
more on the importance of the understandability 
of questionnaires, see OECD, 2013). The following 
paragraphs describe the questionnaire. 

The respondent is presented with the following 
scenario: “Jmagine that a high magnitude earth- 
quake occurs, where a large part of the population 
is left without access to potable water on tap with- 
out any previous warning”. Next, the respondent is 
reminded of the various needs for water follow- 
ing an earthquake (drinking, hygiene (showering, 
flushing toilet), cooking and cleaning). Then, the 
same measures as used for technical performance 
are used to create questions to find public expecta- 
tions/tolerance levels for reduced service. This took 
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the form of three questions. The first question 
deals with the tolerance for recovery times of deliv- 
ery by asking how long the respondent is willing 
to tolerate water being delivered via tanks (Water 
would have to be delivered in tanks. How long would 
you tolerate these conditions?). The next question 
addresses the recovery time tolerance level for 
quality by asking how long the respondent would 
be willing to boil water before drinking (How long 
would you tolerate having to boil the water before 
drinking it?). Lastly, to address the quantity indica- 
tor, respondents are asked how long they will tol- 
erate having only X amount of water (going from 
10 L to 100 L) per person per day (How long will 
you tolerate having only the following amount of 
water per person per day?). Respondents were also 
presented with some examples of water consump- 
tion (washing dishes by hand uses about 25 L, tak- 
ing a 5-minute shower uses about 35 L). The time 
frame proposed to the respondent comes from not 
only historic examples and satisfaction surveys, 
but also realistic time frames for operators given 
in the interview. The proposed times are <12 h; 
12-24 h; 1-2 days; 3-4 days; 5-6 days; 1 week; 2 
weeks; 2 weeks — 1 month; More than 1 month. 
The questionnaire also asked about respondent’s 
demographics and satisfaction levels with the cur- 
rent water service. 


3.3.2 Dissemination plan 

The questionnaire was designed as a telephone 
questionnaire. A representative sample of 1,005 
(with a confidence level of 95.5% and an error 
value of +/-3%) based on age and gender of Bar- 
reiro residents was interviewed. The questionnaire 
was carried out by Pitagorica — Investigação e 
Estudos de Mercado SA. Respondents were inter- 
viewed in Portuguese. A few face-to-face interviews 
were also held, as there were issues finding enough 
young adults to answer the questionnaire via the 
telephone. Data collection was from 11 October 
2017 to 5 November 2017. 


4 APPLICATION OF THE METHODOLOGY 


4.1 Technical resilience analysis: performance 
measures resilience triangles 


The performance is briefly described below. The 
details of the models and assumptions involved 
will be further described in future publications of 
the same authors. Here, we show the result of one 
event only, a magnitude 7, peak ground accelera- 
tion 0.21-0.36 g shaking, which has a probability 
of 4-10% in 50 years for the specific location. It 
should be noted that the results shown here are 
preliminary and the final results will be subject to 
review throughout the complete pilot implementa- 


tion. They serve, nevertheless, as a good example 
of how to compare estimated performance and 
public expectations. 

The quantity of water during the response phase, 
in which resources are focused on rescue, clearing 
roads and electricity, is left to what is in tanks and 
what is stored in people’s homes and stores. It is 
conservatively set to 5 L/day/person, assuming 
that all water in the storage tanks can be placed at 
community service points to reach the full popula- 
tion. After 24 hours water will be transported from 
adjacent regions via eight tank trucks of 40 m? to 
service points (~40 L/person). Since it is difficult 
to carry more than 20 L of water long distances, 
also considering that people have to carry for oth- 
ers (children, elderly etc.), a reasonable estimation 
of water availability when it’s not delivered on tap, 
based on previous experience, is 20 L per person for 
the first days and twice that as needs become more 
pressing. When the network is being repaired the 
performance of the system is modelled using modi- 
fications and repeatedly running EPANET. By 
using fragilities of ductile and brittle pipes based on 
historical seismic events (Eidinger et al, 2001; Shih 
& Chang, 2006) we assigned a fragility of each pipe 
to liquefaction following earthquakes, describing 
the number of breaks in a pipe of certain material 
and size per unit length. The probability of a pipe 
breaking due to an event is given as, where is the 
breaks per unit length of the pipe and its length. 

The pipes are then removed from the EPANET 
model using random numbers in comparison to 
the probability of failure. The flow in nodes con- 
nected to removed pipes are set to zero. Large neg- 
ative pressures at nodes as a result of the perturbed 
network also results in the removal of nodes and 
any connected pipes. Running the model for the 
pressure and flow in the new system yields e.g. the 
demand (water extracted) at each node, which is 
summed up over all nodes. This is interpreted as a 
measure of the delivery of water. The pipe model 
just takes into account mains and not service lines 
from the street to the tap, which usually are the 
responsibility of the property owner. 

Starting with the pipes of high priority (serving 
the hospital and community service points) and 
then according to their distance to reservoirs and 
tanks, the pipes are then replaced in the model, run- 
ning it repeatedly to investigate the performance of 
the network until the network is back at original 
shape (note that other strategies of repair are being 
considered in the ongoing full study). The time to 
replace a pipe is treated as a function of its mate- 
rial and size based on empirical data (Porter, 2016) 
as well as the interviews with the operators. All in 
all several hundred of models are run controlled 
from Matlab. The result presented here in Figure 3 
is very preliminary. 
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Figure 3. A simulation of the delivery to tap after the 
event using analytical models and EPANET modelling 
using fragilities of each pipe. 


For the magnitude of earthquake considered 
here the response phase takes 2 days after which 
the repairs on the network start. After 48 hours 
some parts of the network distributed water will 
add to the water delivered at service points. As 
the network is repaired and additional nodes 
become connected, water is not uniformly distrib- 
uted throughout the municipality. Instead water is 
delivered with close to normal service to a limited 
amount of people with access to it. There is likely 
a higher quantity capacity in the system than can 
be delivered. 

It will be recommended that all water collected 
from service points or extracted from tanks should 
be boiled before consumption and, as such, the 
quality will not meet the drinkable water standards 
until the complete network is properly flushed. 


4.2 Evaluation criteria: tolerance triangles 


Preliminary results from the survey have been used 
to create tolerance triangles. Quantity is described 
as an average of the amount of water accepted 
weighted by the lowest fraction of population tol- 
erating this (people tolerating 20 L also tolerate 
40 L), Figure 4. Quality and delivery are defined 
as percentage of the population tolerating the 
action (boiling and collecting from service point, 
respectively). The expectations resilience triangles 
are thereafter created by the accumulated percent- 
ages of respondents willing to tolerate a given time 
frame. 


4.3 Technical resilience evaluation: performance 
vs tolerances 


To determine if the operator currently meets pub- 
lic expectations for service restoration time the tol- 
erance and performance triangles are compared. 
The comparison of delivery tolerance against per- 
formance is shown in Figure 5. 


Percentage of respondents 
(accumulative) 


Figure 4. Percentage of respondent that tolerate a 
certain quantity of water per day and person after an 
earthquake. 
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Figure 6. Resilience evaluation of quality. 
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Figure 7. Resilience evaluation for quality. 


The weighted average of the tolerance quantity 
at each moment in time is compared to the total 
quantities from our performance models, Figure 6. 
Advice to boil the water for everyone is considered 
the zero level for quality. The advice will be told 
until the whole network is restored. Clearly that 
does not meet the tolerances of the public, Figure 7. 
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5 DISCUSSION 


5.1 Meaning of the resilience evaluation 


The expectations of the public seem to be nearly 
met by the technical performance in terms of 
quantity, Figure 6. At the beginning the tolerance 
is somewhat below the performance, however, it 
seems that the endurance of the citizens is longer 
than the time needed to restore full capacity in 
terms of quantity. A month after incident the toler- 
ances are for much lower quantities than expected 
performance. Also the simulated performance level 
during the first days is very conservative. This is 
because it is very difficult to assess road accessibil- 
ity and to estimate the stored water capacity. We 
chose to ignore larger water storages in stores or 
homes. These can be significant and the resilience 
of the society might be much higher than the resil- 
ience of the specific infrastructure itself. Neverthe- 
less, facing the mismatching identified, one way to 
close the resilience treatment potential could be 
to lower the expectations of the public during the 
initial stage and to increase the capacity through 
communication campaigns highlighting the need 
to store water for emergency situations. 

Furthermore for delivery the performance is 
within a good range of the tolerances; the perform- 
ance seems better than the tolerance for the entire 
time period studied. This is also in line with the pre- 
vious findings in this project (Petersen et al., 2017). 

One aspect where the public tolerances do not 
meet the performance is the quality. Apparently, 
boiling water for more than two days is too long of 
a time as it appears to constitute a major impact 
on many people’s lives. The treatment to close 
this gap could be to use antiseptic tanks, which 
there is access to, but structures and practices of 
the usage are lacking. Also, quality of tap deliv- 
ered water could possibly be guaranteed earlier by 
flushing separate areas of the system once they are 
repaired. Public campaigns could again be useful 
for changing the public’s perception of and the 
need for boiling water. 


5.2 Performance measures chosen based on 
comparability of results 


In developing these performance measures the 
importance of what is being assessed, for which 
end-user and to which hazard became evident. The 
need of water could e.g. be larger in a heat wave 
and dry spell compared to an earthquake. The per- 
ception of the service provided to the public is not 
always measurable in technical evaluation of the 
system. From a personal perspective, it can be diffi- 
cult to know how many liters of water one can tol- 
erate to get by with. It is even harder to relate this 


to pressure or head in the system, parameters that 
are actually measured. Indeed, general perform- 
ance indicators do not take into account the service 
that is provided to the public by a given infrastruc- 
ture. Thus, by focusing on the normal performance 
of the service, the tolerance levels of the public 
to the change of service are able to be taken into 
account. Uncertainties in the modelling are found 
in the estimates of accessible resources and time 
efficiency of restorative capacity for the damage 
itself is done based on likelihood of exceeding a 
break probability. A natural way to solve this is to 
run the model in a probabilistic methodology to 
identify the spread in the result based on the ran- 
dom numbers. This is currently being done as this 
article is being written. One aspect is that the per- 
formance is here measured as a scalar number. It 
might not be the people of least tolerance who gets 
access to water first. Instead, even if the tolerance 
and performance curves matched perfectly a por- 
tion of the population would consider the system 
to underperform whereas another portion would 
consider the opposite. In addition, the tolerances 
are naturally a function of the magnitude of the 
event, something which is very difficult to identify 
in a survey. Previous findings show, however, that 
communication and information of the situation 
and the reasons for performance drops is vital to 
keep the public content during crisis events. Lastly, 
while the performance measures suggested could 
be used for almost any hazard, a scenario is nec- 
essary to be able to measure them effectively (for 
both technical performance and the survey). 


5.3. Survey limitations 


When responding to questionnaire surveys, people 
often respond by providing snap judgments based 
on available information and may be influenced by 
emotional or contextual factors (Schwarz & Stack, 
1999). Also, question wording could affect stated 
tolerance levels, as research has demonstrated that 
when asked if they care about a given issue, people 
state concern for issues that do not exist (Herrmann 
et al., 1994). Further, respondents may choose to 
answer in their own self-interest, claiming to toler- 
ate less so as to not give the CI operators an excuse 
to perform any lower than absolutely necessary. The 
opposite may be true, reporting that they are will- 
ing to tolerate more than they actually could handle 
in order to appear heroic. This is furthered by the 
fact that research has also shown that disaster vic- 
tims rarely passively wait around for someone else 
to take care of their needs (Quarantelli, 1998), and 
having high expectations towards CI operators to 
act in a disaster may indicate a gap between expec- 
tations and the ability of citizens in responding 
to crisis situations. Lastly, expectations have been 
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found to be influenced by demographic factors, pre- 
vious disaster experience and information provision 
(Petersen et al., 2016). However, with purposeful 
survey design and adequate sampling methods such 
as the one used here, many of these limitations are 
reduced and even overcome (Jones & Tanner, 2015). 


5.4 Success of the method 


These preliminary results show that public tolerance 
and technical performance of a critical infrastruc- 
ture can be evaluated and compared to each other in 
the case of a drinking water distribution network. It 
seems from this case study that reasonable compa- 
rable performance measures have been found. The 
operators of Barreiro have expressed positive feed- 
back to the methodology and think that the results 
provide relevant knowledge. However, a more 
descriptive resilience treatment where strategies of 
closing the gaps between expected performance and 
public tolerance are formed is currently work in 
progress during the writing of this article. The suc- 
cess of the method, pending on the usefulness to the 
operators, is not yet entirely apparent. However, this 
work shows that tolerance and performance can be 
compared if the survey asks the right questions and 
the right modelling work is conducted. 


6 CONCLUSIONS 


This paper describes the preliminary results of 
evaluating the expected performance of a critical 
infrastructure, in the form of a water distribution 
network, to the public tolerances of the service 
that the infrastructure provides. It is suggested 
that this comparison is a valuable measure refer- 
ence of evaluating the performance compared to 
an otherwise arbitrary scale of performance. It 
is shown that comparisons can be done and that 
the chosen performance measures must be clearly 
defined prior to evaluating both performance and 
tolerances. When doing this, the different aspects 
of the service provided must be considered as the 
perception of the service usually is multifaceted 
and not necessarily directly linked to the scalar 
which might be the most straightforward to assess 
technically. The results of the resilience analysis 
that are presented here are based upon an initial 
study of the Barreiro municipality’s potable water 
network. As such there are a number of simpli- 
fying assumptions and changes which are made 
which mean that the performance shown is not 
the true performance of the system (such as it 
being based on only one earthquake scenario). 
Nevertheless the results are useful as an illustra- 
tion of the resilience evaluation methodology 
shown. 
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ABSTRACT: Resilience Engineering is an original approach on safety management considering the 
development of agent’s adaptive capacities to the diversity of situations that can occur, as the main tar- 
get of safety management practices. The Resilience Analysis Grid aims supporting the assessment of 
key capacities of a resilient organisation. A specific instance is developed for supporting railway safety 
management. It is composed of a set of key indicators and a methodology to collect and analyse informa- 
tion. Application of the prototype to study trains station resilience capacities demonstrate its potential to 
understand resilience capacities with deducting resilience and fragility factors. 


1 INTRODUCTION 


Resilience Engineering is considered by Borys, Else 
and Leggett (2009) as the fifth age of safety follow- 
ing an age of integration age (Glendon et al. 2006) 
where safety management aims integrating techni- 
cal, human, managerial and cultural factors in risk 
management practices such as risk analysis, barri- 
ers management and accident analysis (Hale and 
Hovden 1998) (Hudson 2007). Resilience Engineer- 
ing perspective on safety management considers 
that the dynamic of evolution of safety practices 
doesn’t provide to workers capacities to cope with 
the complexity of their environment and that theo- 
retical and methodological innovations are neces- 
sary endowing systems the requisite imagination to 
respond and overcome to the diversity of situation 
that can possibly occur (Adamski and Westrum 
2003, Woods and Hollnagel 2006). Safety evolves 
reactively after each event questioning the relevance 
of models structuring safety theories, methods and 
tools. The new factor or element founded that sup- 
port the explanation of the event and the failure 
of safety management system is theorized and 
integrated in safety management practices (proce- 
dures, indicators, etc.). Then safety management 
system tries to constraint system dynamic in order 
to minimize the occurrence of such factors. The 
development of “human error”, “organisational 
failures”, “safety culture” theories, method, tools 
and operational practices illustrates this dynamic. 
The complexity of a system is associated among 
other properties, to non-linearity and to difference 


between reality and artifice (Chandlers 2014). Con- 
sequently, the Resilience Engineering perspective on 
safety aims to change the main focus of safety man- 
agement from risk prevention to workers adaptive 
capacity to respond and overcome unwanted situ- 
ations. This perspective considers that the absence 
of unwanted consequences is not caused by the 
efficiency of risk barriers but by the capacity of the 
system to be in control despite the variability and 
the complexity of situations and the lack of time, 
knowledge, competence or resources (Hollnagel 
and Woods 2006). The Resilience Engineering per- 
spective on safety management considers resilience 
of a system as it’s “ability to recognize and adapt 
to handle unanticipated perturbations that call into 
question the model of competence, and demand 
a shift of processes, strategies and coordination” 
(Woods 2006) and “the intrinsic ability of a system 
to adjust its functioning prior to, during, or follow- 
ing changes and disturbances, so that it can sustain 
required operations under both expected and unex- 
pected situations” (Hollnagel 2011). 

Aims of this paper are to firstly present indica- 
tors defined to assess the resilience of an organi- 
sation, then associated method for assessing 
resilience performance and finally results of its 
application to a train station. 


2 PERFORMANCE INDICATORS 


According to the resilience engineering perspective 
on safety management, organisation can be con- 
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sidered as resilient if they are able: 1) to respond 
and overcome to the diversity of situations that 
may arise; 2) to monitor that which changes, or 
may change in the near term that it will require a 
response; 3) to learn from both positive and nega- 
tive experience of the past; 4) to anticipate devel- 
opment, threats, and opportunities further into the 
future (Hollnagel 2011). A first set of indicators, 
the Resilience Analysis Grid, has been proposed to 
support the assessment of these four dimensions 
(Hollnagel 2011). Starting from resilience engi- 
neering concepts and models and the Resilience 
Analysis Grid, a process of contextualisation to 
the railway domain has been conducted. Results 
from this process are nine indicators to be used to 
assess the resilience of railway systems. For each 
indicator, four rules are defined for supporting the 
evaluation. A system is evaluated with five stars if 
the four rules are true, four stars if the first three 
rules are true, etc. 


2.1 Capacity to respond and overcome to the 
diversity of situations that may arise 


Two indicators are related to the capacity to 
respond and overcome to the diversity of situ- 
ations that may arise. The first one is related to 
routine situation and the second situations that are 
considered as abnormal. The definition of the indi- 
cators considers tasks related to the adaptive proc- 
ess (detect, recognize, decide to change behaviour, 
define behaviour, mobilise resources), the existence 
or not of procedures or good practices associated 
to the situation, the difference between the context 
of action (competence, knowledge, resources and 
time) available and the context required to respond 
to the situation, and the different dimensions of 
performance of the system (quality, reliability, 
safety, security, sustainability, etc.). 


1. The first indicator is related to the capacity of 
operational agents to adjust their procedural 
and/or methodological framework or to be cre- 
ative in order to carry out their normal activity 
in spite of the variability of their environment 
while respecting the temporal, economic and 
activity specific performance criteria. The four 
rules associated to the indicator are: 

2. Agents know their work and associated per- 
formance criteria 

3. They have the skills or know the procedures to 
follow and have the resources, time and infor- 
mation to carry out their work in accordance 
with the different performance criteria. 

4. If they lack skills, resources, time or informa- 
tion they are able to be creative in carrying out 
their work according to performance criteria. 

5. If the situation changes and the procedural 
framework is no longer applicable, they are able 


to be creative enough to carry out their work in 
accordance with performance criteria and have 
the necessary margins of maneuver. 


The second indicator is related to the capacity 
of operational agents to adjust their normative 
and/or methodological framework or to be creative 
in order to face and overcome the occurrence of an 
urgent and/or unexpected situation, anticipated or 
not, while respecting the temporal, economic and 
activity specific performance criteria. The four 
rules associated to the indicator are 


1. Agents are aware of the abnormal situations, 
the behavior to adopt when they occur, or what 
document to consult to know. 

2. They have the skills, resources, time and infor- 
mation to respond to the situation in accordance 
with the different dimensions of performance. 

3. If they lack skills, resources, time or informa- 
tion they are able to be creative in responding 
to the situation in accordance with the different 
dimensions of performance. 

4. If the situation changes and the procedural 
framework is no longer applicable or there is no 
procedural framework, they are able to be crea- 
tive in responding to the situation. 


2.2 Capacity to monitor that which changes, 
or may change in the near term that it will 
require a response 


Three indicators are related to the capacity to 
monitor that which changes, or may change in 
the near term that it will require a response. They 
are related to the capacity to evaluate and exploit 
retrospective safety performance, actual safety per- 
formance and prospective safety performance. The 
four rules associated to the three indicators are: 


1. The system has indicators for measuring retro- 
spective/actual/prospective performance 

2. Retrospective/actual/prospective safety per- 
formance indicators are consistent with the sys- 
tem and are properly and regularly reviewed. 

3. The nature, period, frequency of the measure- 
ment (qualitative or quantitative) of the indi- 
cators and the time between measurement and 
exploitation are correct 

4. There is no conflict between safety performance 
indicators production performance indicators 


2.3 Capacity to learn from both positive and 
negative experience of the past 


Two indicators are related to the capacity learn 
from both positive and negative experience of the 
past. They are related to the capacity to acquire, 
disseminate and use experience in the occurrence 
of unwanted situations (incident, accident, etc.) 
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for the first one and of gained during the observa- 
tion of daily operations for the second. 
The four rules associated to the first indicator are: 


1. The occurrence of an abnormal situation (non- 
compliance, incident, accident, disaster, etc.) is 
detected, listed, investigated and the results dis- 
seminated within the organization. 

2. Criteria for identifying a situation to be investi- 
gated are clearly identified, shared and appro- 
priate by the system. The direct and indirect 
causes sought and the reaction of the system 
are investigated. 

3. Necessary skills for conducting the investiga- 
tion are available, necessary information can be 
accessed and the study carried out independ- 
ently of the stakeholders. 

4. Lessons learned information are processed, 
capitalized and proactively transmitted to the 
other entities of the system. 


The second indicator is evaluated with these 
four rules: 


1. Audits are carried out to understand the real 
functioning of the system 

2. Competences necessary to realise audits are 
sufficient 

3. Audits results are used and capitalized by the 
station 

4. Results are proactively transmitted to others 
stations 


2.4 Capacity to anticipate development, threats, 
and opportunities further into the future 


Two indicators are related to the capacity to learn 
from both positive and negative experience of the 
past. They are related to the capacity to identify, 
use and disseminate information about the conse- 
quences of a change of a component of the system 
for the first indicator and of a change in the envi- 
ronment of the system on safety performance for 
the second indicator. 

The four rules associated to the first indicator 
are: 


1. Anticipating the consequences of internal 
changes on safety is a dimension of the culture 
of the organisation. 

2. The methodological approach for conducting 
the study of consequences of change is clearly 
formulated and based on adequate expertise 

3. The anticipation of consequences of the change 
is carried out with sufficient time so that the 
consequences identified can be taken into 
consideration. 

4. Outcomes relating to potential sources of 
threats or opportunities are shared within the 
organisation. 


The second indicator is evaluated with these 
four rules: 


1. Anticipating the consequences of external 
changes and of trends is a dimension of the cul- 
ture of the organisation. 

2. The methodological approach for conducting 
the study of consequences of external changes 
and trends is clearly formulated and based on 
adequate expertise 

3. The anticipation of consequences of the exter- 
nal changes and trends is carried out with suf- 
ficient time so that the consequences identified 
can be taken into consideration. 

4. Outcomes relating to potential sources of 
threats or opportunities are shared within the 
organisation. 


3 METHODOLOGY 


In order to assess and enhance the resilience of 
socio technical system resilience, a four phases 
method is proposed: 


e Phase 1. Definition of the context of the diag- 
nostic. System representative defines scope, 
schedule, working team and stakeholders of the 
diagnostic process. 

e Phase 2. Performance assessment. Working team 
collects data necessary to evaluate performance 
indicators and write assessment report validated 
by stakeholders. 

e Phase 3. Actions plan definition. Working team 
collects data necessary to define with stakehold- 
er’s actions plans aiming to improve gaps and 
preserve good practices. 

e Phase 4. Conclusion of the study. Working 
team writes and presents final reports to system 
representatives. 


In the perspective to experiment it, the method 
has been applied to the study the management 
of departures and arrivals of train processes of a 
train station. Accordingly, the following four para- 
graphs are outlining results of the application of 
these four phases in detail. 

The study conducted aims studying both the 
resilience of the train management activities of the 
station and the relevance of the method. The scope 
will be operational activities dedicated to the man- 
agement of departures and arrival of train in the 
station. Operational trains departure and arrival 
functions are relevant to study resilience because 
they involved technical agents, proximity manag- 
ers, safety and production services, they are shape 
by schedule, procedures and time constraints, 
injuries may occur and they are sensitive to events 
occurring in the station and in the network. 
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The second phase of the process consists of the 
organisation of data collection. A set of workshops 
has been scheduled. First workshop was dedicated 
to a global presentation of the organisation of 
departure and arrival operational functions from 
schedules design to effective realisation of the tasks, 
observation and informal interview techniques were 
used. Results of this first workshop support the 
definition of a questionnaire. This questionnaire 
has been applied to interview twelve representative 
agents of the system (operation, technical man- 
ager, proximity manager, safety manager, head of 
the safety service). Data collected with interviews 
support the assessment of the different indicators. 
These results were discussed in a group workshop. 

Next section is dedicated to the presentation of 
the results. 


4 RESULTS 


Data collected during interviews support the 
assessment of the different indicators that consti- 
tute the resilience performance. For each indicator 
rules, an evaluation is proposed and a set of fragil- 
ity and resilience factors are proposed. Results for 
the indicator related to the capacity to respond and 
overcome to the diversity of situations that may 
arise are discussed in this section. 


4.1 The capacity to respond and overcome to the 
diversity of routine situations that may arise 


Results of the analysis of data collected related to 
the first indicator support the assessment of adap- 
tive capacity of operational agents and of margin 
of manoeuver provided by the system to the vari- 
ability of routines situations. 

Agents know their work and associated per- 
formance criteria. 

The assessment of the rule is based on the com- 
parison of the description of the different tasks to 
be performed between both operational and mana- 
gerial agents, procedures and observations. 

According to the results of the analysis, the part 
of the rule dedicated to the knowledge of the activ- 
ities is judged satisfactory. Agents seems to have a 
quite good perception of the reality of their tasks. 

According to the results of the analysis, the 
part of the rule dedicated to the knowledge of the 
performance associated to the activities is judged 
satisfactory. Agents seems to know the different 
performance associated to their activities and aims 
satisfying safety and punctuality issues. Proxim- 
ity managers aims finding the good trade-off 
between the satisfaction of objectives associated to 
their position and the management of the human 
dimension of their team. 


They have the skills or know the procedures to 
follow and have the resources, time and informa- 
tion to carry out their work in accordance with the 
different performance criteria. 

The assessment of the rule is based on the 
analysis of agent’s testimonies about their work 
environment and their relation with competences, 
resources, procedures, time and information. 

Operational work environment is judged malad- 
justed to work conditions such as winter and sum- 
mer temperature, workplaces are not ameliorated 
and cleanness is judged insufficient. Moreover, they 
have the feeling that working in difficult condition 
is considered as normal and that budget is use for 
other issues than improving work conditions. 

A period of empowerment is necessary for 
operational agents before they can overcome fear 
feelings to hazards associated to the tasks, during 
this one-two month period they can make mis- 
takes. This time is longer for the traffic manager 
not caused by hazards buy by the complexity of 
the management tasks. Moreover, it is very dif- 
ficult allocating training period to agent due to 
workforce issues and unplanned absence. These 
difficulties are compensated by initial training, 
the accompaniment of experimented agent during 
their first interventions, the vigilance of the hierar- 
chy and mutual assistance between agents. 

Procedures are learned in initial training by 
agents, monitoring achieved by proximity manag- 
ers and their hierarchy helps check that they are 
fully known and understood. In case of doubt, 
agents ask their managers who are able to answer 
them. Agents used to mentally remind the pro- 
cedure to be followed before performing critical 
and hazardous tasks. Agents consider that some 
tasks planed by procedure don’t impact safety 
and consequently might don’t apply them if they 
have constraints. This fragility is compensated by 
the monitoring by the hierarchy of the correct 
application of all the tasks of procedures. When a 
procedure is modified, each agent concerned have 
to attest he has considered the change. Due to the 
multiplication of changes, sometime minor, agent 
known that a change occur but don’t take the time 
to study the change. Proximity managers compen- 
sate this issue with teaching and monitoring the 
correct understanding of changes that affect the 
work of the operational agents. 

There is a difference between technical and 
human resources necessary to achieve goals in term 
of number of daily trains negotiated with stake- 
holders and resources available inducing recurring 
problems. There is a feeling that solutions provided 
to solve them by operational agents are not consid- 
ered by the hierarchy. When tools are not available, 
exchanges are organized between agents. Agents 
absenteeism is important it is compensated by the 
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capacity of proximity managers and the hierarchy 
to perform operational tasks. 

Work schedule is planned so that time is not a 
constraint for operational agents achieving tasks 
and time margins are planned between two work 
activities. The hierarchy considers that safety is the 
priority and don’t put pressure on agents. A point 
of fragility is that agents enchain short period of 
activity and period of inactivity causing issues 
related to activation and concentration. 

Information management is a major for all agents 
involve in the management of trains departures and 
arrivals. The use of information systems and infor- 
mal communication networks helps manage con- 
strains during both design and production phases. 
Information systems and communication protocol 
aims considering the constraints of each agents. 
Operational and proximity managers are continu- 
ously looking for information about the state of 
activities and of the network in order anticipating 
potential issues. For that they try to find informa- 
tion system and create a personal communication 
network. In case of lack of information operational 
agent don’t hesitate to ask their managers who are 
responsible to find solution. Information systems 
may be saturated by the multiplication of message. 
Managers have to be careful delivering the good 
message to the right person at the right time in order 
to not induce perturbation in the system. Conflicts 
may arise caused by absence of answer or because 
of the tone or the type of message exchanged. 

Related to all these positives and negatives fac- 
tors the rules has been evaluated as medium. 

If the situation changes and the procedural 
framework is no longer applicable, they are able 
to be creative enough to carry out their work in 
accordance with performance criteria and have the 
necessary margins of manoeuvre. 

The assessment of the rule is based on the 
analysis of agent’s testimonies about complicated 
and complex situations that can occur in routine 
situations. 

Many situations related to incidents, delays, 
malfunctions require an adaptive response from 
agents and from the system. A routine situation 
may cause blockages due to its failure to be taken 
into account by procedures or by the definition of 
agent’s roles. Some agents demonstrate initiative 
during such situations in order to insuring the train 
to be able to start on time and safely. They have 
margin of maneuver from hierarchy to take the 
time necessary to perform work safely for insuring 
safety even if it creates some delays for the train. 

Based on all the fragility and resilience fac- 
tors identified, the evaluation of the indicator is 
Acceptable for agents capacity to adapt to the vari- 
ability of routines situation and the contribution 
of the organization as medium. 


4.2 The capacity to respond and overcome to the 
diversity of abnormal situations that may arise 


Results of the analysis of data collected related 
to the second indicator support the assessment 
of adaptive capacity of operational agents and of 
margin of manoeuver provided by the system to 
abnormal situations. A typology of abnormal situ- 
ation has been firstly deduced and for each, factors 
of fragility and resilience has been identified. 
Four abnormal situations have been studied: 


e Situations caused by an increase of agents work- 
load of agents resulting from the absence of two 
to three operational agents. 

e Situations resulting from a safety incident occur- 
ring on one of the place of the station. 

e Situations resulting from an incident occurring 
at the station or on the network which disrupts 
the production, where the management is under 
the responsibility of the train station. 

e Situations resulting from an incident occurring 
at the station or on the network which disrupts 
the production, where a crisis management room 
is open under the responsibility of an authority 
external of the station. 


Situations caused by an increase of agents 
workload of agents resulting from the absence of 
two to three operational agents. 

When such a situation occurs, agents must per- 
form the same tasks as in routine situations but in 
a different context. They may be required to per- 
form more during the same service, to carry out 
them on numerous trains consecutively, to carry 
out them in a crisis atmosphere with many passen- 
gers seeking information to see in a hostile situa- 
tion with aggressive and violent passengers. Agents 
adaptation is promoted with their acceptance that 
some days they must performed more activities 
than originally planned, with the ability of man- 
agement to perform operational tasks and agent’s 
knowledge on know how to prioritize their work 
with managing their production tasks as a prior- 
ity and trying to find a solution for the passengers. 

Situations resulting from a safety incident 
occurring on one of the place of the station. 

When such a situation occurs, potential conse- 
quences of the psychological impact of the event 
on agent’s activities in the short and medium term 
has to be considered as impact of the consequences 
of the absence of one or more agents for injuries, 
rest or suspension on the production capacity of 
the station. Ability of management supporting 
agents to overcome the event, the possible conse- 
quences (suspension, etc.) and to make the inci- 
dent a source of awareness of risks not taken into 
account and of learning performing activities in 
compliance and safely. 
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Situations resulting from an incident occurring 
at the station or on the network which disrupts the 
production, where the management is under the 
responsibility of the train station. 

When such a situation occurs, delays may affect 
the following services that need to be refined, 
Supervisors may have to leave the site and oversee 
operational activities to manage the crisis situation 
as a strain, the, some tasks that have to be carried 
out quickly in order to solve the situation may 
not be possible due to unavailability of technical 
resources or difficulties in setting up the stopover 
(access to storage areas, etc.). Culture of mutual 
assistance between services and agents in disturbed 
situations, the "Pride of the railway man", facilitate 
adaptation need to overcome such situations. 

Situations resulting from an incident occurring 
at the station or on the network which disrupts 
the production, where a crisis management room 
is open under the responsibility of an authority 
external of the station. 

When such a situation occurs, agents and manag- 
ers have the feelings to not to be listened by the cri- 
sis management unit when they propose solutions 
appearing effective to them and that they undergo 
the decisions and their negative consequences on 
production and on customers. Moreover, there is a 
risk of chilliness caused by the fear of the percep- 
tion and reaction of general headquarters. Culture 
of mutual assistance between services and agents 
in disturbed situations, the "Pride of the railway 
man", facilitate adaptation need to overcome such 
situations. 

Based on all the fragility and resilience fac- 
tors identified, the evaluation of the indicator is 
medium for agents capacity to adapt to the vari- 
ability of abnormal situation and the contribution 
of the organization as medium. 


5 DISCUSSION 


Resilience can be perceived as a process, a set of 
properties, results of a dynamic of development, 
results of the response of a system to a situation, 
a combination of all the above. Resilience assess- 
ment is a complex process requiring to consider the 
system studied, a set of situations of adversity and 
a set of values. 

As other equivalent work (Patriarca et al. 2017), 
a phase of translation of the initial RAG grid 
(Hollnagel 2011) is necessary to address concrete 
topics of the system studied. 


The overall process is complicated to apply. 
Firstly, because topics studied are related to the sys- 
tem as he is and not as it should be, consequently 
conditions have to be favorable in order to have 
people accepting to describe their working condi- 
tions. Secondly, resilience performance is related to 
situation that don’t occur frequently and for some 
situation they never occur. Consequently, it’s diffi- 
cult to collect relevant data allowing to predict how 
the system will respond to such a situation. 

Results of the application are mainly qualita- 
tive because they aimed to promote dialog between 
actors of the organization and to support the 
definition of plan of actions for improving the 
system. 


6 CONCLUSION 


Application of the method to the train station dem- 
onstrates the potential of indicators and method to 
collect and analyse data on the resilience perform- 
ance of an organisation. Fragility and resilience 
factors has been identified. If some topics where 
familiar with agents, others where more difficult 
to study, such as learning from normal situation, 
actual and prospective safety performance indi- 
cators or change management with operational 
agents. 

Resilience factors have been identified, never- 
theless the success of such factors are dependant 
of the availability and motivation of operational 
and managerial agents performing tasks that are 
not relevant to their duties, good cohesion of 
team and of vigilance on maintaining margins of 
manoeuver available. 

Lessons learned from this experiment will struc- 
ture the refinement of indicators and methodology 
in order to produce an operational method. The 
new method will be applied to other systems to 
continue validating its performance. 
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ABSTRACT: 


In the recent years there is a constant growth of interest in Resilience Engineering. 


According to its approach, accidents in complex socio-technical systems happen due to ‘functional reso- 
nance’ of many underdefined system functions. The resonance is often analysed and presented with help 
of Functional Resonance Analysis Method (FRAM). However, in many reported research, the analysis is 
limited to a separated group of functions and describes a process rather than a system as a whole. Con- 
trary to this common practice, we assume that the processes are ‘looped’ and the system functioning— 
infinite. In the paper, we introduce a concept of a novel computer software which can demonstrate such 
systems described with FRAM models. The prepared software has been used for simulating a moving 
tram. Several simulations with different parameters have been compared to show the possible use of this 
software in further investigating of systems described with help of FRAM. 


1 INTRODUCTION 


The issues of safety have probably accompanied 
humankind since the very beginning of its exist- 
ence. However, the need for a more formalised 
approach to this subject matter did not appear 
until the industrial revolution in the 18th century 
(Hollnagel 2014). Scientific research came even lat- 
er—the essential book by Heinrich (Heinrich 1931) 
dedicated to work safety is worth mentioning here 
(Lundberg et al. 2009). In the period after World 
War II, reliability engineering developed, thanks 
to which many tools still in use today, such as the 
FMEA or FTA methods, were introduced. 

Along with the noticeable improvement in the 
reliability characteristics of technical objects, unde- 
sirable events increasingly resulted from the errors 
of machine operators. In consequence, the models 
in use were supplemented with the so-called human 
factor. However, it turned out relatively quickly 
that an attempt at attributing responsibility solely 
to operators of technical objects does not bring 
the expected effect. It therefore became necessary 
to also include the organisations for which these 
operators worked in the analyses, which gave rise 
to safety management systems (Hollnagel 2014). 

The results of scientific research translated into 
legal regulations concerning safety, in which depar- 
ture from the focus on the technical details towards 
ways of making decisions and management could 
be observed since the 1970s (Hale et al. 1997). This 
trend was additionally reinforced by the results 
of studies on disasters from the 1970s and 1980s, 


such as the Piper Alpha oil platform disaster of 
1988 (Paté-Cornell 1993), as well as the govern- 
ments’ will to withdraw from direct responsibility 
for the level of safety. Since the announcement of 
the results of the investigation of the causes of 
the Chernobyl disaster of 1986, also the need to 
enhance the culture of safety has been the topic of 
discussion (Wang & Liu 2012). 

Experience in implementing safety management 
systems revealed a number of problems, however. 
Already in the 1990s, it was observed that people’s 
behaviour changes under the influence of the ubiq- 
uitous procedures. Power (Power 1997) called this 
phenomenon the formation of an ‘audit society’, in 
which more emphasis is put on obtaining another 
certificate than on the actual effects of work. Forc- 
ing compliance with procedures does have certain 
positive effects, e.g. makes cooperation more pre- 
dictable (Jeffcott et al. 2006), but it also leads to the 
marginalisation of the significance of the employees’ 
knowledge and experience (Almklov et al. 2014). 

The problems mentioned above are the reason 
that changes and additions to the approach to 
safety management are currently proposed. They 
concern changes in the manner of formulating 
safety procedures (Hale & Borys 2013) and moving 
from searching for the causes of undesirable events 
to searching for the causes of correct execution of 
system tasks (Hollnagel 2014). As a result, system 
resilience to unexpected changes in the manner of 
their operation is expected to increase. 

One of the methods used for modelling of com- 
plex socio-technical systems in order to find the 
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resilience mechanisms is Functional Resonance 
Analysis Method (FRAM), proposed by Hol- 
Inagel (Hollnagel 2012). It has been applied in a 
variety of organisations and industries. The most 
recent papers describe e.g. sinter plant (Patriarca, 
Di Gravio, Costantino, et al. 2017), air traffic 
management system (Patriarca, Di Gravio, & Cos- 
tantino 2017, Yang et al. 2017) or application in 
medical care (Pickup et al. 2017). 

Most of the authors, however, tend to use the 
FRAM to analyse separated processes in the complex 
systems, from the beginning to the end of such a proc- 
ess. In our opinion, the FRAM can also be used to 
describe systems as a whole, where the processes are 
‘looped’ and the system functioning—constant and 
infinite. The article aims to present an early version of 
a simulation software which can be used for this pur- 
pose, basing on a simple example of a moving tram. 

In Section 2 we have presented the way how the 
approach to safety has changed over years to bet- 
ter and described the foundations of the FRAM. 
In Section 3 we have shown how these foundations 
are transformed into the proposed software. In 
Section 4, we show results of a tram ride simula- 
tion. The paper ends with conclusions in Section 5. 


2 FUNCTIONAL RESONANCE 
ANALYSIS METHOD 


Recently, a considerable increase in the complexity 
of systems, caused by the possibilities offered by 
new technologies and pressure to introduce changes 
as soon as possible and for as low a price as pos- 
sible, could be observed. The consequence of this 
situation is often the lack of complete understand- 
ing of the processes occurring within the systems, 
which leads to their sudden failures. Since 2000, 
a rapid increase in the number of publications 
on this topic, dedicated to the issues of resilience 
engineering from the point of view of e.g. safety, 
system complexity, organisation or ecology, has 
been observed. The term ‘resilience’ is understood 
in four ways in these works (Woods 2015): 


— As rebound — how a system rebounds from dis- 
rupting or traumatic events and returns to previ- 
ous or normal activities. 

— As robustness — expanding the set of distur- 
bances the system is prepared for. 

— As graceful extensibility — how a system extends 
performance or uses extra adaptive capacity in 
case of unpredictable events. 

— As sustained adaptability — a policy making it 
possible to continue proper operation in the 
long term as external conditions change. 


It is sometimes believed (Woods 2015) that only 
the latter two approaches are justified. Research of 


the sole process of returning to the initial condition 
is considered to be incomplete, as its course is deter- 
mined by activities undertaken before the distur- 
bance (which corresponds to increasing resilience 
as defined in no. 3 and 4). Preparing the system for 
the previously predicted events in turn corresponds 
to the phase of reacting to risk in the classic form 
of the hazard risk management process. 

A different attempt at taking a complete look at 
the issue of resilience was made by Lundberg and 
Johansson, whose work (Lundberg & Johansson 
2015) presents a resilience model including six func- 
tions: anticipation, monitoring, response, recovery, 
learning, and self-monitoring. An in-depth analy- 
sis of various definitions of resilience can also be 
found in (Adjetey-Bahun et al. 2016). In principle, 
resilience engineering is supposed to be as universal 
as possible (Haavik et al. 2016), and research on the 
topic has been carried out in many different areas 
of human activity, including medicine, aviation, 
and nuclear power stations (Le Coze 2016). 

The Functional Resonance Analysis Method 
(FRAM), has been proposed by Hollnagel (Hol- 
Inagel 2012) for modelling how complex socio- 
technical system work. The method is based on four 
principles (Patriarca, Di Gravio, & Costantino 2017): 


— Equivalence of failures and successes. Equiva- 
lence and successes come from the same origin, 
i.e. everyday work variability. This latter allows 
both things go right, working as they should and 
things go wrong. 

— Principle of approximate adjustments. People 
as individuals or as a group and organizations 
adjust their everyday performance to match 
the partly intractable and underspecified work- 
ing conditions of the large-scale socio-technical 
systems. 

— Principle of emergence. It is not possible to 
identify the causes of any specific safety event. 
Many events appear to be emergent rather than 
resultant from a specific combination of fixed 
conditions. Some events emerge due to particu- 
lar combination of time and space conditions, 
which could be transient, not leaving any traces. 

— Functional resonance. The function resonance 
represents the detectable signal emerging from 
the unintended interaction of the everyday 
variability of multiple signals. This resonance 
is not completely stochastic, because the signals 
variability is not completely random but it is 
subject to certain regularities, i.e. recognizable 
short-cuts. 


The principles constitute a new approach for 
understanding safety, called ‘Safety-II’ (Hollnagel 
2014). In brief, the steps for safety system analysis 
using FRAM are (Patriarca, Di Gravio, & Costan- 
tino 2017): 
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— Identification and description of 
functions 

— Identification of performance variability 

— Aggregation of variability 

— Management of variability. 


The distinctive feature of the FRAM is the way 
how the functions are represented. The graphical 
form of this representation is shown in Figure 1. 

As shown in Figure 1, each function is charac- 
terised through six aspects (Yang et al. 2017): 


system’s 


— Input (I): what the function transforms or proc- 
esses or what starts the function 

— Output (O): the result of the function, either an 
entity or a state change 

— Preconditions (P): conditions that must exist 
before a function can be executed 

— Resources (R): what the function needs when 
it is carried out (Execution Condition) or con- 
sumes to produce the Output 

— Time (T): temporal constraints influencing the 
function (with regard to starting time, finishing 
time or duration) 

— Control (C): how the function is controlled. 


The definition of aspects is intuitive, but often 
too general for creating a consistent procedure of 
choosing them in particular circumstances. Anvar- 
ifar et al. (Anvarifar et al. 2017) notice that fur- 
ther work is required to test the applicability of the 
FRAM for detailed risk analysis in more compli- 
cated and data demanding case studies. Patriarca 
et al. (Patriarca, Bergstrom, et al. 2017) add that 
a comprehensive FRAM analysis might generate 
a representation, which is impressive in terms of 
its sheer number of functions and couplings, but 
hard to make interpretive sense for further ana- 
lytical purposes. The solutions used to overcome 
this problem consists of decomposition schemes 
and various original computer software based on 
Monte Carlo simulation. 


©) 


A 


Function XO 


K ifn 
oe 


Figure 1. A hexagon characterising function in FRAM 
(Hollnagel 2016). 


3 THE SIMULATION SOFTWARE 


With our software presented in this paper we would 
like to simulate ‘looped’ systems described with 
FRAM models. It has been prepared using Micro- 
soft Visual Studio Community 2017, which is avail- 
able for free i.a. for academic purposes, enriched 
by a demo version of NMath library from Center- 
Space. The program has been written with Visual 
Basic.NET programming language as a Console 
application and is running in the text mode only. 
It allows to focus more on the algorithms than on 
the visual part of the software. Following assump- 
tions have been made for assuring consistency of 
FRAM-based models used as simulation basis: 


— All the functions use information of generic type 
provided through their Input aspects to produce 
results of generic type, which are made available 
through the Output aspects 

— The Control aspect is an input of generic type 
for information used by the function in order to 
minimise its own variability 

— The Timing aspect is an input of date/time type 
that provides the earliest time when the function 
can be activated 

— The Preconditions and Resources aspects are 
inputs of ‘true’ or ‘false’ type; the first aspect is 
checked whether it equals to ‘true’ at the acti- 
vation of the function and the second aspect — 
constantly throughout its realisation. 


The assumptions have been shown graphically 
in Figure 2. 

With the assumptions (Fig. 2), the aspects have 
been effectively divided into two groups: 


— Aspects responsible for determination if and 
when the function is performed: Timing, Pre- 
conditions and Resources 

— Aspects responsible for how the function is per- 
formed: Input, Output and Control. 


For the first group of aspects, it was possible to 
determine the type of variables used for commu- 
nication between functions. The types correspond 


T (schedule) 


H Function ——> O (generic) 


P (true / false) R (true / false) 


| (generic) 


C (generic) 


Figure 2. Interpretation of FRAM’s aspects used in the 
proposed simulation software. 
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to the description in Fig. 2, i.e. Date for the Tim- 
ing aspect and Boolean for the Precondition and 
Resources aspects. The second group of aspects is 
of generic type, as it depends on what the function 
actually does. The implementation of the ‘inside’ of 
the function has to be written manually directly into 
the program’s code. However, in the future it is pos- 
sible to introduce an interface similar to the Lab- 
View from National Instruments. It would allow to 
‘construct’ the functions graphically from a set of 
predefined instructions (loops, conditions, etc.). 

All the functions identified during the FRAM 
and present in the model are put by the software 
at the beginning of the simulation simultane- 
ously in a ‘stand-by’ mode. Their actual activation 
depends on the respective Timing, Preconditions 
and Resources aspects an can be repeated through- 
out the simulation time. This approach makes the 
simulation never ends, just as the modelled systems 
never ‘stop’ their existence. 


4 PERFORMANCE VARIABILITY OF A 
MOVING TRAM 


For the testing purposes of the simulation soft- 
ware, we have decided to simulate a simple system 
of a tram moving through a city. The respective 
FRAM model has been presented in Figure 3. 

The schemes as in Figure 3 can be created with 
help of specialised software, called FRAM Model 
Visualiser, which is available free of charge (Hol- 
Inagel 2016). Following two foreground functions 
have been considered in the model: 


— Exchange of passengers, a function encapsulat- 
ing the boarding activities between opening and 
closing doors 

— Move tram, a function of keeping the vehicle 
running. 


Scheduled time J (s 


Scheduled 


time @ 


G R Power 


; Boarding 
A 
Buarding of » completed 


duration 


Figure 3. 
city. 


FRAM model of a tram moving through the 


In addition, following background functions 
have been considered: 


— Prepare timetable, giving the earliest time of 
moving away from a tram stop 

— Passenger flow, a function that represents the 
passengers waiting for the tram on the tram stop 

— Traffic lights, a technical function that periodi- 
cally prevents the driver from leaving the tram 
stop 

— Produce electricity, a technical function that 
supplies electricity which is inevitable for per- 
forming the function Move tram. 


In the model, the Gauss distribution has been 
used for generating random numbers that describe 
the timing of traffic lights, producing energy, and 
the travel time between stops. The number of 
passengers increases with time and, when more 
boarding is needed than the schedule allows—the 
Exchange of passengers function uses warning 
signal to speed up the process. The warning signal 
device works with average efficacy of 1/e, i.e. in 
average one out of e signals lowers the remaining 
time of boarding by some value, that depends on 
the device type. 

The sample simulation record has been shown 
in Figure 4. The simulation software allows to save 
the simulation record in form of a text file, which 
can be further elaborated with e.g. Microsoft Excel. 

In the case study we have assumed that the 
departure time from stop i + 1 is 40 seconds 
later than the departure time from stopi. For the 
Gauss distributions determining the timing of 
background functions Produce electricity and 
Traffic lights (Fig. 3), we have used the following 
parameters: 


— For Power = true: u = 35, o= 5 [s], 
— For Power = false: u= 10, o= 3 [s], 
— For Green light = true: u = 4, o=4 [s], 
— For Green light = false: u= 8, o= 2 [s]. 


In case that the time were negative, it is changed 
into zero. Additionally, the travel time is drawn 


Figure 4. 
the software. 


Sample record of a simulation performed in 
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Number of delays 


Impact of the warning signal 


meme Number ot delays «=== Fatal delay time 


Figure 5. Total delay time and number of delays in 
respect to the impact of the warning signal. 


according to the Gauss distribution of u = 10, 
o = 3 [s]. The Passenger flow output is the board- 
ing time that equals to the time between closing the 
doors at stop i and opening it at stop i + 1. Board- 
ing time at the beginning of the simulation equals 
to 10s. 

The aim of the case study is to determine which 
one of five possible warning signal devices should 
be used in the tram after its renovation. The devices 
have the same efficacy e = 0.4, but differ in terms 
of their effectiveness in speeding up the boarding. 
Each time the signal is taken into consideration by 
the passengers (what happens with average prob- 
ability 0.4), the signal lowers the remaining board- 
ing time by 1, 2, 3, 5 or 10 seconds. This timespan 
will be called ‘impact’ of the device. For each of 
the warning signal device, a 10-minute simulation 
has been made. The results are summarised in 
Figure 5. 

The number of delays during the simulation 
amounts to 2 in four cases and 3 in one case. Due 
to the different number of delays, the total delay 
time increases for impact 2 and then decreases and 
remains at the same level for impact 5 and 10. The 
results, although should only be considered as esti- 
mates, allow to opt for the device with impact 3 
or 5 and suggest that investing in the device with 
impact 10 is not justified. 


5 CONCLUSIONS 


Models prepared with the FRAM can be used 
not only for description of processes, but also 
for simulating systems throughout their lifetimes. 
Models used as a basis for this kind of simulation 
will often be complex and, therefore, a strict and 
consistent understanding of the functions’ aspects 
is needed. A proposal for such understanding 
has been presented in this paper together with an 
early version of a dedicated simulation software. 
Its applicability has been shown on an example of 
warning signal devices installed in a tram. Further 


work is needed to make the software fully intuitive 
and to integrate it with the official FRAM editor 
(Hollnagel 2016). 
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Technical safety and reliability methods for resilience engineering 


I. Haring & P. Gelhausen 
Fraunhofer Ernst-Mach-Institut, EMT, Efringen-Kirchen, Germany 


ABSTRACT: Resilience of technical and socio-technical systems can be defined as their capability to 
behave in an acceptable way along the timeline pre, during and post potentially dangerous or disruptive 
events, i.e. in all phases of the resilience cycle and overall. Hence technical safety and reliability methods 
and processes for technical safety and reliability are strong candidate approaches to achieve the objective 
of engineering resilience for such systems. This is also expected when restricting the set of methods to 
classical safety and reliability assessment methods, e.g. classical Hazard Analysis (HA) methods, induc- 
tive Failure Mode and Effects Analysis (FMEA), deductive Fault Tree Analysis (FTA), Reliability Block 
Diagrams (RBDs), Event Tree Analysis (ETA) and reliability prediction. Such methods have the advan- 
tage that they are typically already used in industrial research and development. However, improving the 
resilience of systems is usually not their explicit aim. The paper covers how to allocate such methods to 
different resilience assessment, response, development and resilience management tasks when engineering 
resilience from a technical perspective. In particular, the resilience dimensions of risk management, resil- 
ience objectives, resilience cycle time phases, technical resilience capabilities and system layers are used 
explicitly to explore their range of applicability. Also typical system graphical modelling, hardware and 
software development methods are assessed to document the usability of technical reliability and safety 


methods for resilience analytics and technically engineering resilience. 


1 INTRODUCTION 


As the number of applications of the concept of 
resilience to (socio) technical systems in mainly aca- 
demic research and development rises, the question 
of how to successfully implement these approaches 
in the private sector, industry and small and medium 
enterprises is getting more and more prominent. 
The present paper addresses this challenge by sur- 
veying the suitability of classical, mainly analytical 
system analysis methods for assessing and improv- 
ing resilience of (socio) technical systems. 

The wording resilience analytics has been used 
recently quite general in the sense of resilience 
assessment in the societal-technical domain, see 
e.g. (LOpez-Cuevas et al. 2017; Linkov, Florin 2016; 
Thorisson et al. 2017). However, the present use of 
the term analytical is to relate to classical system 
analysis methods, such as hazard analyses (HAs) 
including Hazard Lists (HLs), failure mode and 
effect analyses (FMEAs), fault tree analysis (FTA), 
event tree analysis (ETA), reliability block dia- 
grams (RBD) and double failure matrix (DFM). 

The application of established system model- 
ling, analysis and simulation methods for resil- 
ience analysis covers Bayesian networks (e.g. Yodo 
et al. 2017) and Markov Models (e.g. Zhao et al. 
2017). Fault propagation in a hazard and oper- 
ability analysis (HAZOP) context for resilience 
assessment is described in (Cai et al. 2015). Also 


first approaches have been reported to use FMEA 
for resilience assessment with the aim of applying 
the functional resonance analysis methodology 
(FRAM) to a smart building (Mock et al. 2016). 
However, for instance fault tree analysis has not 
yet been used for resilience analysis. 

First attempts of an evaluation of the suitabil- 
ity of system analysis, simulation and development 
methods including some selected classical system 
analysis methods for resilience assessment have 
been conducted in (Haring et al. 2016c; Haring 
et al. 2016b). In contrast to (Haring et al. 2017), the 
present approach does not provide generic consid- 
erations of the suitability of methods for technically 
driven resilience assessment and development proc- 
ess steps. 

The present approach focuses on determining 
which contributions to resilience assessment can 
be expected from mainly analytical system analysis 
methods and their extensions. To this end, it resorts 
to already often used resilience dimensions such as 
resilience or catastrophe response phases, system 
management domains or resilience capabilities. 

By resorting to 5 such resilience dimensions as 
detailed and motivated below, the expected rel- 
evancy of mainly analytical system analysis meth- 
ods is assessed from different and complementary 
perspectives. Using the resilience dimensions, the 
suitability of method assessment is resolving the 
expected benefit in respective phases or resilience 
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aspects rather than aiming at an overall applicabil- 
ity scoring. 

This is conducted based on expert judgement 
(Meyer, Booker 2001) and consensus feedback of 
scientists related to the research field of technical 
safety and risk analysis. Also groups of almost fin- 
ished master students in security and safety engi- 
neering of an applied science university contributed, 
mainly trained in tabular system analysis methods. 

Key motivations for focusing on classical system 
analysis methods include: 


e Analytical system analysis methods are estab- 
lished and accepted by practitioners in industry; 

e Expectation that resilience analysis can in 
parts be delivered with extensions of classical 
methods; 

e Expected efficiency of semi-quantitative meth- 
ods compared to quantitative approaches; 

e Identification of implicit resilience activities 
within current existing risk analysis and man- 
agement practice; 

e Clarification of resilience concepts by specifying 
methods supporting their fulfillment; 

e Identification of critical resilience aspects that 
need to be analyzed with more effort, i.e. going 
beyond classical system analysis methods. 


The paper is structured as follows. In section 2, 
the approach is described how to assess the suita- 
bility of mainly classical analytical system analysis 
methods for resilience analysis by employing resil- 
ience dimensions suitable for technical resilience 
understanding. Section 2 also details the method- 
ology. It illustrates the need of going beyond classi- 
cal risk assessment with the help of resilience event 
propagation through logic and assessment layers. 

In the following sections 3 to 7, for each of the 
listed resilience concepts, possible contributions 
from the methods are discussed. For each resilience 
dimension a matrix is filled with assessments of the 
suitability of the method for contributing to each of 
the resilience dimension attribute. Also, recommen- 
dations for the extension of the methods are given. 

In section 8, the overall suitability of each 
method is summarized and conclusions regarding 
adaptations and further developments are drawn. 


2 APPROACH TO ASSESS THE 
SUITABILITY OF METHODS 


Before detailing the approach of suitability assess- 
ment of classical system analysis methods, the paper 
gives some general considerations on the necessary 
extension of methods for resilience assessment 
when compared to classical risk assessment. 
Conditional probability expressions based and 
extending classical notions of risk have recently 
been used to quantify key objectives of resilient 


response (Aven 2017). This shows that resilience 
analysis may benefit from the application of tradi- 
tional and more modern risk concepts. 

The idea used as starting point in (Aven 2017) 
is that resilience behavior can be defined to occur 
post disruption events. Thus resilience event B, e.g. 
“system stabilizes post disruption”, “system recov- 
ers”, “system bounces back better”, “recovery 
time shorter than critical time” or “sufficient sys- 
tem performance level reached within £” are always 
conditional previous events, 


P(B| A), (1) 


where A is a “disruptive event” or equals a chain 
of events, 


A=4,4&, Á, (2) 


This approach relates with the often used defi- 
nition of conditional vulnerability in risk expres- 
sions, see e.g. (Daniel M. Gerstein et al. 2016), 


R= P(E) P(C|E)C, (3) 


where Æ is a threat event and C the consequence. 

However, the classical vulnerability approach of 
(3) focuses on the quantification of the conditional 
consequence probability, whereas (1) refers to resil- 
ience behavior post disruption events. 

As the vulnerability including risk definition of 
(3) is already an extension of the classical defini- 
tion of risk 


R=P(C)C, (4) 
and typical resilience expressions are further extend- 
ing the definition of (1) and (3), it is expected that 
classical system reliability and safety approaches 
are challenged when used for assessing resilience. In 
particular (very) simple tabular approaches resort 
to risk concepts as described by (4) when applied 
in a traditional way, i.e. they focus on avoidance of 
events and system robustness in case of events only. 
Generalizing (1), resilience expressions of inter- 
est typically are of the form (Häring et al. 2016a) 


N 
P(B| A)= > P(B|D,)P(D, | A), 

je (5) 
where D,,i=1,2,....n, form a complete set of 
expansion events. Equation (5) uses the law of total 


probability and can be understood as an insertion 
of unity of all possible intermediate states 


È PCI DPD, Ir) (6) 
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between any two known states. Equation (5) can 
also expresses the idea of possibly unknown tran- 
sition states or disruptions which are included in 
the set D,. In this case, A is just a system initial sate. 

Of course, (5) can be generalized to consider 
multiple resilience layers or response and recovery 
phases, see (Haring et al. 2016a). Along the lines of 
interpretation given for (1), an interpretation of (5) 
reads for instance 


A= “Disruption event”, 

{D,}<1,.... Set of possible response and recovery 
events/Set of transition states, (7) 
B= “Final state of interest”. 


When comparing risk and vulnerability expres- 
sions of the form (3) and (4) with resilience expres- 
sions of the form (1) and (5), it becomes obvious 
that it is not straightforward to expect that classi- 
cal analytical system analysis methods can deliver 
assessment results regarding resilience. This moti- 
vates the question how such methods can contrib- 
ute to resilience assessment. 

For focusing the research question, the follow- 
ing system modelling, classical system analysis 
and system development methods are considered 
regarding their suitability for resilience assessment: 


SysML, UML; 

HL, PHA, HA, O&SHA, HAZOP; 
FMEA, FMECA, FMEDA; 

RBD; 

ETA; 

DFM; 

FTA, time-dependent FTA (TDFTA); 
Reliability prediction with standards; 
Methods for HW and SW development; 
Bit error correction methods. 


To assess the suitability of methods for resil- 
lence engineering, the following resilience dimen- 
sions are used: 


e 5-step risk management process (AS/NZS ISO 
31000:2009), for review: (Purdy 2010), (Luko 
2013), for critical discussion mainly regarding 
the coverage of uncertainty (Aven 2011); 

e Resilience time-phase cycle, based on (Thoma 
2014); 

e Technical resilience capabilities, 
(Haring et al. 2016a); 

e System layers, based on (Haring 201 6a); 

e Resilience criteria (Bruneau et al. 2003) (Pant 
et al. 2014); 

e Resilience analysis and management process 
(Haring et al. 2017); 


For the first 5 resilience dimensions, each com- 
bination of system analysis method and resilience 


based on 


dimension attribute is assessed using the three 
equivalent semi-quantitative scales 


{1,2,3,4,5}, 

{-—,—,0,+,++}, 

{not suited (adaptation useless), 

rather not suited (or only with major (8) 
modifications), 

potentially suited (after adaptation), 

suited (with minor modifications), 

very well suited (straightforward/no adaptions)}. 


Typical examples read as follows: (i) The identi- 
fication of potential disruption events of systems 
can be supported by using the classical system 
analysis methods hazard list (HL) and preliminary 
hazard analysis (PHA). Hazard lists are very well 
suited for identifying hitherto unknown events 
when used as checklists of potential disruptions 
and asking the question “what if?”. Regarding 
the identification of possible disruptions for a 
system under consideration, the overall rating of 
HL could be “++” or “+” for PHA. This example 
shows that rather than assessing the generic suit- 
ability of a method, its use within a certain resil- 
lence assessment process or conceptual structuring 
is addressed. 

(ii) Fault tree analysis (FTA) allows to consider 
combinations of events by using the AND gate. 
When only a known sequence of events is possi- 
ble, the sequencing AND gate can be used, which 
enforces an order of occurrence of events. Such a 
sequence might be first “detection of threat”, second 
“decision to start counter-measure” and third “acti- 
vation of counter-measure”. This order is then used 
for assessing the probability of success of a technical 
prevention measure. Sequential events can be ana- 
lyzed with time-dependent Boolean differences to 
analyze sequential structure functions rather than 
classical combinatorial Boolean structure func- 
tions (Moret, Thomason 1984). Hence, FTA and 
even more TDFTA can be expected to cover after 
modifications also the response and recovery phase, 
resulting in a “+” assessment, respectively. 


3 SUITABILITY ASSESSMENT WITH 
FIVE-STEP RISK MANAGEMENT 
SCHEME 


The 5-step risk management scheme is only a very 
generic framework for identifying risks on resil- 
ience objectives. As discussed in the introduction, 
objectives in the case of resilience analysis are more 
second order (e.g. “fast recovery in case of disrup- 
tion”) when compared to classical risk analysis and 
management (e.g. “avoid disruption’). 
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Table | assess the suitability of analytical system 
analysis and some development methods for resil- 
ience analysis sorted along the 5-step risk manage- 
ment scheme using the scale of (8). 

Understanding the system sufficiently for resil- 
ience risk analysis is supported with graphical/semi- 
formal Unified/Systems modelling languages (UML/ 
SysML) modelling, see the first two lines of Table 1. 

The initial hazard analysis methods HL and PHL 
support the identification of possible disruptions. 
They are considered as a starting point. Refined ana- 
lyses can be supported with SSHA, HA, O&SHA, 
and HAZOP, the differences of which are typically 
small and depend on the application; for small sys- 
tems they can be summarized in one analysis. 

Approaches that need substantial system knowl- 
edge include RBD, the inductive approaches ETA, 
FME(D/C)A and deductive approaches (TD) FTA, 
which are often summarized in a bow tie analysis. 
The success of FMEA variations is expected to be 
more efficient when depending on system functions 
(or services) as inductive starting points rather than 
system components or subsystems. 

In the case of (TD) FTA, the success of applica- 
tion will strongly depend on the definitions of the top 
events, which should cover main resilience objectives. 


Table 1. Suitability of analytical system analysis and HW/SW 
development methods for resilience analysis sorted along the 
5-step risk management scheme. 


Method\ (2) (3) 

5-step risk (1) Identify Analyze/ (4) (5) 
management Establish risk/ compute Evaluate Mitigate 
process steps context hazards risks risks risks 


SysML 
UML 

HL 

PHA 
SSHA, HA 


O&SHA, 
HAZOP 


FMEA, = o Pa 
FMECA 


FMEDA — - + 
RBD o + 
ETA + + +F 
o 
o 


++ 


oo++++ 
eae 
tt+tocoe 


+ 
+ 
$ 


DFM = 
FTA, 
TDFTA 
Reliability = — - + 
predic- 
tion with 
standards 
Methods for — - = = +H 
HW and 
SW devel- 
opment 
Bit error — — — — ++ 
correction 
methods 


+++++ 


po 


4 METHOD USABILITY ASSESSMENT 
USING RESILIENCE RESPONSE CYCLE 
TIME PHASES 


The catastrophe management cycle in 4 steps (e.g. 
preparation, prevention and protection, response 
and recovery, learning and adaption) as well as in 
5 steps as used in Table 2 take advantage of a logic 
or time ordering of events with respect to disruption 
events (Häring et al. 2016a): (far) before, during, 
immediately (after). Another typical timeline as well 
as logic sequence example was given in section 2. 

The first observation in Table 2 is that the analy- 
sis methods should be conducted, if considered rel- 
evant, mainly in the preparation phase. However, 
especially fast analytical simple methods can also 
be applied during actual conduction of response 
and recovery. For instance, during and post events, 
a PHA scheme could be used to identify further 
possible second-order events given a disruption. 

The second observation in Table 2 comprises 
the coverage of resilience cycle phases. The suit- 
ability of method assessment stems from the fact 
that classical analytical approaches by definition 
cover prevention and protection when identified 
with frequency of event assessment and immediate 
(first order) damage assessment. 


Table 2. Suitability of system modelling, analytical system 
analysis and selected development methods for resilience analysis 
along the 5 phases of the resilience cycle: Resilience event order 
logic or timeline. 


Method\ 
Resilience 
timeline 
cycle (1) (2) (3) (4) (5) 
phase Prepare Prevent Protect Respond Recover 
SysML + + + + + 
UML + + + + ve 
HL + + ++ nes eA 
PHA oars gate + + re 
SSHA, HA ++ + + + + 
O&SHA, + JF vere + + 
HAZOP 
FMEA, ++ ++ ++ + + 
FMECA 
FMEDA 4 iie ee + + 
RBD ++ + + + + 
ETA +H + ++ + + 
DFM TE $F + + fe 
FTA, ++ + ++ + pa 
TDFTA 
Reliability pan F ren Ee + 
prediction 
with 
standards 
Methods for ++ + ++ + + 
HW and SW 
development 
Bit error correc- ++ +4 Ace cp + 


tion methods 
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If system failure, in case of variations of HA, 
FMEA and FTA, is defined as failure of adequate 
response (e.g. absorption and stabilization), of 
recovery (e.g. reconstruction and rebuilding) or even 
of improving or bouncing forward using damage 
as optimization opportunity, these methods can be 
used with adaptions also for these resilience timeline 
phases. Similarly, RBD and ETA are assessed. 

The system modelling and development methods 
for hardware and software (HW/SW) can be used 
for all resilience cycle phases. As in the case of classi- 
cal system analysis methods, adaptions up to major 
new developments are believed to be necessary. 

Even if especially the classical tabular system 
analysis methods were assessed as very relevant 
for resilience assessment, it is noted that they are 
part of established processes in practice. Therefore, 
even when adding only some additional columns, 
their modified best practice of use is expected to 
be challenging in company development environ- 
ments. In this sense, Table 2 is a guideline for the 
expected usability of the listed methods. 


5 METHOD USABILITY ASSESSMENT 
USING TECHNICAL RESILIENCE 
CAPABILITIES 


Sensor-logic-actor chains are basic functional ele- 
ments used for active safety applications in safety 
instrumented systems, especially within the context of 
functional safety as governed by (IEC 61508 Series). 
The technical resilience capabilities can be considered 
as a generalization of such functional capabilities. 

They can also be related to the much more abstract 
and generic OODA (observe, orient, decide, act) loop, 
which has found much application also in the catas- 
trophe response arena, see e.g. (Lubitz et al. 2008; 
Huang 2015). The technical resilience capabilities are 
also very close to capabilities to be expected from a 
general artificial intelligence (Baum et al. 2010) and 
related possible architectures (Goertzel et al. 2008). 

Table 3 assesses the suitability for use of the 
selected methods along each technical resilience 
capability dimension attribute. Since the technical 
resilience capabilities are generic properties of (socio) 
technical systems, the realization of the properties 
in systems is prone to risks: e.g. external and inter- 
nal; accidental and intentional; safety and security 
related; systematic (by construction) and statistic. 

In Table 3, the more generic system model- 
ling methods SysML and UML are rated better 
when compared to more specific methods. Table 3 
expresses with the uniform distribution of “+” that 
any resilience analysis conducted using the methods 
has to take into account all the technical resilience 
properties. This shows that major adaptations and 
further developments are necessary to apply clas- 
sical methods, since a cross-cutting task has to be 
covered by the methods. 


Table 3. Suitability of system modelling, analytical system 
analysis and selected development methods for resilience analysis 
along the technical resilience capability dimension attributes. 


(5) 
Learning, 
(2) modifi- 
Method\ (1) Represen- (3) cation, 
Technical Obser- tation, Inference. (4) adaption, 


resilience vation, modeling, decision Activation, rearrange- 
capabilities sensing Simulation making action ment 


SysML ++ + + + + 
UML + ++ ++ + + 
HL + + + + + 
PHA + + + + + 
HA + + + + + 
O&SHA, + + + + + 
HAZOP 
FMEA, + + + + + 
FMECA 
FMEDA + + + + + 
RBD + + + + + 
ETA + + + + + 
DFM + + + + + 
FTA, + + + + + 
TDFTA 
Reliability + + + + + 
predic- 
tion with 
standards 
Methods for + + + + + 
HW and 
SW devel- 
opment 
Bit error o o o o o 
correction 
methods 


For instance, columns or labels could be added 
to assess to which type of system resilience func- 
tion a system failure belongs in case of HA, FMEA 
and ETAs. Also FTAs top level event formulations 
either have to address the functional steps sepa- 
rately or find sufficient generic top level formula- 
tions allowing for combinations of top events. 


6 METHOD USABILITY ASSESSMENT 
USING SYSTEM LAYERS 


Table 4 assesses the potential of application of the 
representative methods of section 2 with the help of 
system layers for socio technical systems. The often 
used 4 layers physical, information, cognitive, social, 
see e.g. (Fox-Lent et al. 2015), have been refined in 
the physical-technical domain and more specified in 
all attributes when compared to (Häring et al2016a). 

The strength of the selected representative very 
specific methods is in the domain of hardware and 
data integrity as well as HW/SW development. 
Also the classical tabular methods focus somewhat 
on electronics, especially FMEDA. 
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Table 4. Suitability of system modelling, analytical system 
analysis and selected development methods for resilience analysis 
along system layers or generic management domains. 


Table 5. Suitability of system modelling, analytical system 
analysis and selected development methods for resilience analysis 
along modified resilience criteria. 


Method\ (3) (4) (5) 
System Cyber, Opera- Societal, 
layer, (2) software- tional, eco- 
Management (1) Technical, wise, organiza- nomic, 
domain Physical hardware protocols tional ethical 
SysML + + + + + 
UML ++ ++ + + + 
HL + + + o o 
PHA + + + o o 
HA + ++ + o o 
O&SHA, + ++ + + + 
HAZOP 
FMEA, + + + o o 
FMECA 
FMEDA + ++ + o o 
RBD + +F + o o 
ETA + ++ + + + 
DFM + ++ + + + 
FTA, + + + o o 
TDFTA 
Reliability — + - o o 
predic- 
tion with 
standards 
Methods + + + o o 
for HW 
and SW 
develop- 
ment 
Bit error — + + o o 
correction 
methods 


The general purpose methods ETA, DFM, RBD 
and FTA require educated application, often out 
of their classical domain of application, especially 
HAZOP. RBD diagrams and SysML/UML meth- 
ods are expected to be of use for acquiring and 
documenting sufficient system understanding. 

HA-type methods are well suited but need to be 
applied out of their typical technical domain also 
to operational and societal system layers. 


7 METHOD USABILITY ASSESSMENT 
USING RESILIENCE CRITERIA 


Table 5 assesses the potential of application of the 
representative methods of section 2 with the help 
of the often used 4 resilience criteria introduced 
by (Bruneau et al. 2003) and technically refined 
by (Pant et al. 2014). For the suitability of method 
assessment, the following modified working defini- 
tions are used in this paper: 


1. Robustness: measure for low level of damage 
(vulnerability) in case of event; ‘good’ absorp- 
tion behavior; ‘good’ protection. 


(2) (3) 

Redun- Resource- 

dancy: fulness: (4) 

(1) system fast Rapidity: 

Method\ Robust- property stabiliza- fast 
Modified ness: of overall tion recovery 
resilience low initial damage and and 
criteria damage tolerance response reconstruction 


SysML 
UML 
HL 
PHA 
HA 
O&SHA, 
HAZOP 
FMEA, 
FMECA 
FMEDA 
RBD 
ETA 
DFM 
FTA, 
TDFTA 
Reliability — - — — 
prediction 
with 
standards 
Methods — - — — 
for HW 
and SW 
development 
Bit error — - - = 
correction 
methods 


++ ++ 
+ + 


+++ 04 
+++ OHH 


+ 
[e] 
| 
l 


+ o++0 
titt 


2. Redundancy: measure for low level of over- 
all system effect in case of local (in space, in 
time, etc.) disruption event; system disruption 
tolerance. 

3. Resourcefulness: measure for capability of suc- 
cessful allocation of resources in the response 
phase to stabilize the system post disruptions. 

4. Rapidity: measure for fast recovery of system. 


The results are similar to the suitability assess- 
ment along the logic or timeline resilience cycle 
phases as conducted in section 4: classical analyti- 
cal approaches do not focus beyond the damage 
events. Robustness, resourcefulness and rapidity 
are according to the working definitions strongly 
related to the resilience cycle phases absorption/ 
protection, response and recovery. 

Redundancy is understood in the classical way 
as an overall system property. Hence in all cases 
sufficient system understanding is required, which 
is supported by graphical modelling. 

The (also) graphical approaches RBD, ETA and 
FTA are strong for redundancy and resourceful- 


1258 


ness assessment. Especially time dependent FTA 
and underlying time dependent Markov mod- 
els are believed to be key for resourcefulness and 
redundancy assessment, nevertheless with major 
adaptations. 


8 SUMMARY AND CONCLUSIONS 


In summary, each of the representative analytical 
system analysis methods as well as HW/SW devel- 
opment methods (techniques and measures in the 
sense of (IEC 61508 Series)) showed potential for 
resilience engineering, i.e. resilience assessment 
and development and optimization as defined in 
the introductory sections. 

The classical tabular approaches HA and 
FMEA are assessed to be suited with minor up to 
major modifications for resilience analytics. Major 
advantages are expected by redefining and add- 
ing dedicated columns to cover resilience aspects. 
Also graphical methods like RBD, ETA and FTA 
are tools that by definition cover at least techni- 
cal aspects of resilience of systems in case of very 
informed application. 

In all cases, the extensions and adaptions need 
to carefully consider the initial background and 
application context of the methods. Therefore, in 
case of technical resilience engineering contexts, 
it is expected that the methods have to be newly 
established. This holds since all these methods are 
prone to routinely use, which is often very contrary 
to the out-of-the box thinking necessary for resil- 
ience engineering. For instance, established hazard 
lists for an application domain will not contribute 
to an as complete as possible disruption threat 
list. 

The different resilience dimensions used for suit- 
ability assessment exhibited strengths and weak- 
nesses for exploring the methods’ potentials: 


e The risk management cycle is a very generic 
process, allocating most analysis methods in the 
risk analysis step. Resilience objectives formula- 
tion is key and challenge. 

e Resilience cycle (time or logic) phases allow to 
spread out assessments and activities. However, 
they are prone to ‘divide et impera’ effects of los- 
ing the overall picture. 

e Technical resilience capabilities need to be cov- 
ered for the operation of typical system (service) 
functions on overall system level allowing a tech- 
nical approach. It is deemed challenging how to 
modify and extend classical methods to cover 
them. 

e Traditional resilience criteria (“Resilience Rs”) 
can be nicely linked to timeline/logic concepts 
as well as system redundancy assessments. They 
also link with performance-based resilience 


curve assessments. Challenges are expected when 
trying to translate the more abstract concepts 
into system analysis and development requests. 


In summary, future work is expected to benefit 
from informed further development of classical sys- 
tem analysis methods for resilience analysis. Such 
resilience analytics is believed also to strongly sup- 
port the development of resilient systems, in par- 
ticular in industrial environments. Such informed 
applications are expected to ripe many of the ben- 
efits listed in the bullet list of the introduction. 


ACKNOWLEDGEMENTS 


This research has been conducted in the context 
of the Freiburg Sustainability Center of Excel- 
lence, a cooperation of the Fraunhofer institutes 
in Freiburg and the Albert-Ludwigs-University 
Freiburg. It is supported by grants from the 
Baden-Wirttemberg Ministry of Economics and 
the Baden-Wirttemberg Ministry of Science, 
Research and the Arts. In parts, the work has also 
been supported by the German BMBF Project 
“Windows for continuous academic education“ 
within the Sub-Project “Resilient Technical Sys- 
tems”. Thanks goes also to master students of 
security and safety engineering of the Hochschule 
Furtwangen University. 


REFERENCES 


AS/NZS ISO 31000:2009: Risk management - Principles 
and guidelines. 

Aven, Terje (2011): On the new ISO guide on risk man- 
agement terminology. In Reliability Engineering and 
System Safety 96 (7), pp. 719-726. DOI: 10.1016/j. 
ress.2010.12.020. 

Aven, Terje (2017): How some types of risk assessments 
can support resilience analysis and management. In 
Reliability Engineering & System Safety 167, pp. 536- 
543. DOI: 10.1016/j.ress.2017.07.005. 

Baum, Eric; Hutter, Marcus; Kitzelmann, Emanuel 
(Eds.) (2010): Artificial general intelligence. Proceed- 
ings of the Third Conference on Artificial General 
Intelligence, AGI 2010, Lugano, Switzerland, March 
5-8, 2010. Conference on Artificial General Intelli- 
gence; AGI. Amsterdam: Atlantis Press (Advances in 
intelligent systems research, 10). 

Bruneau, Michel; Chang, Stephanie E.; Eguchi, Ronald 
T.; Lee, George C.; O’Rourke, Thomas D.; Reinhorn, 
Andrei M. et al. (2003): A Framework to Quantita- 
tively Assess and Enhance the Seismic Resilience of 
Communities. In Earthquake Spectra 19 (4), pp. 733- 
752. DOTI: 10.1193/1.1623497. 

Cai, Zhansheng; Hu, Jinqiu; Zhang, Laibin; Ma, Xi 
(2015): Hierarchical fault propagation and control 
modeling for the resilience analysis of process system. 
In Chemical Engineering Research and Design 103, pp. 
50-60. DOI: 10.1016/j.cherd.2015.07.024. 


1259 


Daniel M. Gerstein; James G. Kallimani; Lauren A. 
Mayer; Leila Meshkat; Jan Osburg; Paul Davis et al. 
(2016): Developing a Risk Assessment Methodol- 
ogy for the National Aeronautics and Space Admin- 
istration. RAND Corporation (RR-1537-NASA). 
Available online at —https://www.rand.org/pubs/ 
research_reports/RR1537.html. 

Fox-Lent, Cate; Bates, Matthew E.; Linkov, Igor (2015): 
A matrix approach to community resilience assess- 
ment. An illustrative case at Rockaway Peninsula. In 
Environ Syst Decis 35 (2), pp. 209-218. DOI: 10.1007/ 
$10669-015-9555-4. 

IEC 61508 Series, 2010: Functional safety of electrical/ 
electronic/programmable electronic safety-related 
systems. Available online at  http://www.iec.ch/ 
functionalsafety/standards/page2.htm, checked on 
12/27/2017. 

Goertzel, Ben; Wang, Pei; Franklin, Stan (Eds.) (2008): 
Artificial general intelligence, 2008. Proceedings of 
the First AGI Conference. ebrary, Inc; AGI Confer- 
ence. Amsterdam, Washington, DC: IOS Press (Fron- 
tiers in artificial intelligence and applications, v. 171). 

Haring, Ivo; Ebenhoch, Stefan; Stolz, Alexander (201 6a): 
Quantifying resilience for resilience engineering of 
socio technical systems. In Eur J Secur Res | (1), pp. 
21-58. DOI: 10.1007/s41125-015-0001-x. 

Haring, Ivo; Sansavini, Giovanni; Bellini, Emanuel; 
Martyn, Nick; Kovalenko, Tatyana; Kitsak, Maksim 
et al. (2017): Towards a generic resilience manage- 
ment, quantification and development approach. In: 
Linkov I., Palma-Oliveira J. (eds) Resilience and Risk. 
NATO Science for Peace and Security Series C: Envi- 
ronmental Security. Springer, pp. 21-80. https://link. 
springer.com/chapter/10.1007/978-94-024-1123-2_2. 

Haring, Ivo; Scharte, Benjamin; Hiermaier, Stefan 
(2016b): Towards a novel and applicable approach 
for Resilience Engineering. In: 6-th International Dis- 
aster and Risk Conference (IDRC). Integrative Risk 
Management — towards resilient cities. 6-th Interna- 
tional Disaster and Risk Conference (IDRC). Davos, 
28.08-01.09. 

Haring, Ivo; Scharte, Benjamin; Stolz, Alexander; Leis- 
mann, Tobias; Hiermaier, Stefan (2016c): Resilience 
Engineering and Quantification for Sustainable Sys- 
tems Development and Assessment. In: Resource 
Guide on Resilience. Lausanne: EPFL International 
Risk Governance Center. 

Huang, Yanyan (2015): Modeling and simulation method 
of the emergency response systems based on OODA. 
In Knowledge-Based Systems 89, pp. 527-540. DOI: 
10.1016/j.knosys.2015.08.020. 

2010: IEC 61508 - Functional safety of electrical/elec- 
tronic/programmable electronic safety-related systems. 

Linkov, Igor; Florin, M.-V. (Eds.) (2016): IRGC 
Resource Guide on Resilience. Edited Book. Avail- 
able online at https://www.irge.org/risk-governance/ 
resilience/. 


Lopez-Cuevas, Armando; Ramirez-Marquez, José; 
Sanchez-Ante, Gildardo; Barker, Kash (2017): A 
Community Perspective on Resilience Analytics. A 
Visual Analysis of Community Mood. In Risk analy- 
sis: an official publication of the Society for Risk Anal- 
ysis 37 (8), pp. 1566-1579. DOT: 10.1111 /risa.12788. 

Lubitz, D.K. von; Beakley, James E.; Patricelli, Fred- 
eric (2008): ‘All hazards approach’ to disaster man- 
agement: the role of information and knowledge 
management, Boyds OODA Loop, and network- 
centricity. In Disasters (32, 4), pp. 561-585. DOI: 
10.1111/j.0361-3666.2008.01055.x. 

Luko, Stephen N. (2013): Risk Management Principles 
and Guidelines. In Quality Engineering 25 (4), pp. 
451-454. DOI: 10.1080/08982112.2013.814508. 

Meyer, M.A.; Booker, J.M. (2001): Eliciting and Analyz- 
ing Expert Judgment. A Practical Guide: Society for 
Industrial and Applied Mathematics. 

Mock, R.; Lopez de Obeso, Luis; Zipper, Christian 
(2016): Resilience assessment of internet of things. A 
case study on smart buildings. In Lesley Walls, Mat- 
thew Revie, Tim Bedford (Eds.): European Safety and 
Reliability Conference (ESREL). Glasgow, 25-29.09. 
London: Taylor & Francis Group, pp. 2260-2267. 

Moret, B.M.E.; Thomason, M.G. (1984): Boolean 
Difference Techniques for Time-Sequence and 
Common-Cause Analysis of Fault-Trees. In IEEE 
Trans. Rel. R-33 (5), pp. 399-405. DOI: 10.1109/ 
TR.1984.5221879. 

Pant, Raghav; Barker, Kash; Ramirez-Marquez, Jose 
Emmanuel; Rocco, Claudio M. (2014): Stochastic 
measures of resilience and their application to con- 
tainer terminals. In Computers & Industrial Engineer- 
ing 70, pp. 183-194. DOI: 10.1016/j.cie.2014.01.017. 

Purdy, Grant (2010): ISO 31000. 2009—Setting a 
New Standard for Risk Management. In Risk 
analysis: an official publication of the Soci- 
ety for Risk Analysis 30 (6), pp. 881-886. DOI: 
10.1111/j.1539-6924.2010.01442.x. 

Thoma, Klaus (Ed.) (2014): Resilien-Tech: “Resilience by 
Design”: a strategy for the technology issues of the 
future. Miinchen: Herbert Utz Verlag; Utz, Herbert 
(acatech STUDY). 

Thorisson, Heimir; Lambert, James H.; Cardenas, John 
J., Linkov, Igor (2017): Resilience Analytics with 
Application to Power Grid of a Developing Region. 
In Risk analysis: an official publication of the Society 
for Risk Analysis 37 (7), pp. 1268-1286. DOT: 10.1111/ 
risa.12711. 

Yodo, Nita; Wang, Pingfeng; Zhou, Zhi (2017): Predic- 
tive Resilience Analysis of Complex Systems Using 
Dynamic Bayesian Networks. In IEEE Trans. Rel. 66 
(3), pp. 761-770. DOI: 10.1109/TR.2017.2722471. 

Zhao, S.; Liu, X.; Zhuo, Y. (2017): Hybrid Hidden 
Markov Models for resilience metrics in a dynamic 
infrastructure system. In RESS 164, pp. 84-97. DOI: 
10.1016/j.ress.2017.02.009. 


1260 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Interdependent infrastructure network restoration from a community 
resilience perspective 


K. Barker, D.B. Karakoc & Y. Almoghathawi 
University of Oklahoma, Norman, OK, US 


ABSTRACT: Many critical infrastructure networks that dot the global landscape often rely on each 
other in different ways for each to be functional. Government planning documents around the world 
recognize the interdependence of these infrastructure networks. But naturally infrastructure networks do 
not exist for their own operation but because society relies upon them for convenience, productivity, and 
health, among others. Recent large-scale disruptions to critical infrastructure, primarily due to natural 
disasters whose frequency appears to be increasing, have left communities devastated for extended peri- 
ods. As such, planning for the resilience of critical cyber-physical-social networks should emphasize the 
social aspects of disruptions. In this work, we study the problem of the restoration of interdependent 
infrastructure networks after the occurrence of a disruptive event with a focus on the vulnerability of the 
society that interacts with the networks. We integrate (i) a resilience-driven multi-objective mixed-integer 
programming formulation that schedules the restoration of disrupted demand nodes in each network 
with (ii) a geographically distributed index of social vulnerability that measures the impact to the com- 
munity surrounding the disrupted demand nodes. This model integration is illustrated with an example of 
community resilience in Shelby County, Tennessee. 


1 INTRODUCTION 


A critical infrastructure network is defined as a 
network of independent, mostly privately-owned, 
human-made systems and processes that function 
collaboratively and synergistically to produce and 
distribute a continuous flow of essential goods and 
services (The Report of the President’s Commis- 
sion on Critical Infrastructure Protection 1997). 
Such infrastructure networks, such as electric 
power, water distribution, natural gas, transporta- 
tion, and telecommunications, operate on a daily 
basis to maintain the functioning of modern socie- 
ties and to provide their essential needs. 

With the continuous technological develop- 
ments, infrastructure networks and their distri- 
bution systems become more dependent on each 
other’s functionality to perform with higher effi- 
ciency (Rinaldi et al. 2001). However, this type of 
a complex coordination and interconnection on 
various aspects such as sharing components, uti- 
lizing one’s output as another’s input, transmitting 
information and much more, make critical infra- 
structure networks more vulnerable against pos- 
sible disruptive events (Rinaldi et al. 2001). This 
(often bi-directional) relationship among these 
networks enhances the possibility of chain reac- 
tions between the disrupted and undisrupted com- 
ponents where one infrastructure network might 
lead the failure of another one due to their high 
interdependency (Little 2002, Wallace et al. 2003, 


Buldyrev et al. 2010, Eusgeld et al. 2011, Ouyang 
2014, Danziger et al. 2016, Wu et al. 2016). There- 
fore, resilience planning in the form of restoration 
scheduling of these potentially highly vulnerable 
networks becomes more challenging especially 
when the increasing frequency of man-made or 
natural disruptive events considered. 

In the literature, many different approaches have 
been introduced to quantify the resilience of a net- 
work where the ability to withstand, adapt to, and 
recover from a disruption is referred as resilience 
(Barker et al. 2017). As shown in Figure 1, consider 
two primary dimensions of resilience: vulnerability 
and recoverability (Henry and Ramirez-Marquez 
2012, Barker et al. 2013). The vulnerability of a 
network is defined as the magnitude of damage 


ot) 


! Vulnerability | Recoverability 
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Figure 1. Network performance, ọ(t), across state tran- 


sitions before, during, and after the occurrence of a dis- 
ruptive event. 
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in network performance due to a disruptive event 
(Jönsson et al., 2008), where the recoverability of a 
network describes the speed at which the network 
reaches to a desired performance level (Rose 2007). 
Resilience can be defined as the time depend- 
ent ratio of network recovery over its loss (i.e. 
A(t) = Recovery( t) / Loss( t)where A1,(t|e’/) = 1 
indicates that network is fully resilient [Henry and 
Ramirez-Marquez 2012]). 

Recent research has explored the relationship 
between critical infrastructure networks are inter- 
related with the geographical vulnerabilities of the 
local regions where they are built and the social 
vulnerabilities of their surrounding communities 
(Cutter and Finch 2007). The specific and varying 
demographics of the communities that these criti- 
cal infrastructures provide service to can be con- 
sidered as the key factors to guide response and 
restoration operations. 

In this initial study, we have considered a social 
vulnerability index as a community resilience 
measure which is a function of predetermined 
demographic factors of a community (Cutter 
et al. 2003). The weighted sum of these factors 
assigns relative vulnerability scores to disrupted 
components of the interdependent infrastructure 
networks according to the region in which they 
are located. In addition to the social vulnerability 
index, we have also considered the population den- 
sity of the community whose interdependent infra- 
structure network service is disrupted. 


2 SOCIAL VULNERABILITY INDEX 


Different levels of socio-economic conditions and 
distinguishing properties of a community shape its 
resilience against disruptive events by either con- 
tributing to or counteracting its vulnerability. In 
their work on the Social Vulnerability Index (SoVT), 
Cutter et al. (2003) identify eleven key factors that 
contribute to measuring the vulnerability of a com- 
munity including the age, gender, race, wealth, and 
occupation of members of the community. 

A significant level for each of the factors has 
been identified and the percentage of population 
that is below these limits are considered as more 
vulnerable to disruptive events and, therefore, con- 
tributors to the overall social vulnerability. Due to 
their higher vulnerability, it is noted that through 
the recovery process, these subgroups would 
require more time and investment of resources to 
achieve a resilience level similar to the other com- 
munities (Cutter et al. 2003). 

The SoVI-Lite technique (Cutter et al. 2011, 
Evans et al. 2014) is a reduced version of the SoVI 
that calculates a community’s vulnerability score 
with the following steps: 


1. Calculate the percentage of population that 
falls beyond the predetermined level of vulner- 
ability for each factor, 

2. Calculate the z-score of each factor by using 
mean and standard deviation of each factor, and 

3. Sum the z-scores of all factors to find the 
total social vulnerability score of a specific 
community. 


Furthermore, the SoVI-Lite score can be scaled 
between 0 and 1 using Eq. (1), where 0 represents 
the least socially vulnerable community and 1 is 
the most socially vulnerable community. 


z—min( x) 
max( x) — min( x) ae (1) 


3 PROPOSED MODEL 


In this study, we have proposed a multi-objective 
resilience-driven restoration optimization model 
using mixed-integer programming, where our main 
goal is to maximize the resilience of interdependent 
infrastructure networks while minimizing the total 
cost associated with the entire restoration phase 
(Almoghathawi et al. 2017). We have integrated 
social vulnerability index into the objectives to help 
guide interdependent infrastructure network resto- 
ration from a community resilience perspective. 

Let K represent a set of infrastructure networks, 
K={1,...,x}, and T represent a set of avail- 
able time periods, T = {I,...,7}. For each network 
k e K, the sets of nodes and links are represented 
by N* and L*, respectively. The sets of source nodes 
and demand nodes are defined with N‘ c N* 
and Ni c N*, respectively. The sets of disrupted 
nodes and links are denoted by N’k and L’k, 
respectively. 

Let bt be the maximum amount of supply at 
node ie N* in network k eK, considered to be 
the maximum flow from node i e N¥ to all demand 
nodes in network ke K. The amount of unmet 
demand at node ie N4 in network ke K in time 
teT is denoted by s*. Total unmet demand at all 
demand nodes in network ke K after recovery at 
time period teT is 2 sk, 

To introduce the social vulnerability index into 
the model, the SoVI-Lite score, SoVI;, is cal- 
culated for node ie N% in network ke K. To 
more effectively emphasize the social vulnerability 
index, Eq. (2) introduces an exponential effect to 
give more relative importance to nodes in socially 
vulnerable areas with V;* . 


Vit = eh Y ie Nyt, bE Z (2) 
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Additionally, population density was also 
included in the proposed model, where densities 
were assigned to demand nodes to more effectively 
place importance on them during the restoration 
process. P* is the population density for demand 
node ie Nk in network k € K, shown in Eq. (3). 


P* 
population of community where node / is located 


total population of area being studied 
Vie Nk 


(3) 


The unmet demand in the network represents 
the system loss in the maximum flow which will be 
caused by disruptive event. In this manner, decreas- 
ing the total amount of unmet demand to a desir- 
able level refers to the effectiveness of restoration 
process and a reasonable recoverability level of the 
network. Hence, the resilience of the system could 
be represented by Eq. (4) where it is the ratio of total 
unmet demand that is recovered over total amount 
of unmet demand after the disruption occurs. This 
equation represents cumulative recovery of the 
interdependent infrastructure networks over time 
teT where Q* is the unmet demand at demand 
node ie Nk in network ke K after a disruption 
and //* is the weight of demand node ie Nk in 
network ke K such that ÈX ex X,. wi Meal. 


È sex È ent AOVP) T 
E LA (otters) -(sivers)) (4) 
-(t- 1)(( ONVEP*) J- (sk, VER ‘= 


The other objective of the restoration process 
minimizes the total cost associated with the restora- 
tion process. The fixed restoration cost for disrupted 
nodes and links are fn‘ for ie N’* and f} for 
(ij)e LF, respectively. The unitary flow cost 
through link (i, j)e L’* is ck and p% isthe unitary 
unmet demand cost for node ie N’*. The binary 
decision variable z% equals 1 if the node ie N% 
is restored and 0 otherwise, and y% is also a binary 
decision variable that equals 1 if link (i, j)e K4 is 
restored and 0 otherwise. Finally, xj, is the non- 
negative decision variable that represents the total 
flow through link (i, j)e L’* in network ke K at 
time że T. Therefore, the total cost of the restora- 
tion process can be represented as Eq. (5). 


Py | > ie N’* Snpz} p> (ijjeL'k NY; +), ef 
be pert HH * Qe te we PES sive) | (5) 


In network ke K, the restoration duration 
for node ie N’* and link (i,j)e L’* are dně 
and diš, respectively. The link capacity is us for 
link (i,j) L’*. The binary decision variable Ø 
equals 1 if node ie N’* is operational and 0 oth- 
erwise, where binary decision variable æt, is 1 if 
the link (i,j) € L’* is operational and 0 otherwise 
in network ke K attime te 7. For each network 
k eK, R* represents the available work crews or 
resources that are specific to network k (e.g., in 
terms of work crew expertise and restoration 
equipment). The scheduling variables are denoted 
by binary variables 7{" and ó, respectively, for 
node ie N’* and link (i, j) € Tt, where they are 
equal to 1 if restoration of the related component 
is completed by work crew re R* at time teT 
and 0 otherwise. Finally, the network interdepend- 
encies are denoted by (i,k). (i)} e¥ that node 
i e N* in network k €K depends the function- 
ality of node ie N* in network ke K. The com- 
plete version of the proposed mathematical model 
is as follows. 


md 2X 70 AO Om ee > E [ (( QEP) 


-( sEV EPK )) = ( t= a OWVEP*) — ( Raa l 


nin | D fnri J, fix 


keK \ jen’ (ier ¥ 
+) È oh D prsiV Pi 
teT | (i j)er ie Nf 
Subject to: 
D eo Vie N',ke K,te T 
(i j)e k 
X xk- J xt =0, vie N*\{ NF, NE}, 
(i j)e tk (jie Lk 
ke K,teT 
= xi + Sf =p. Vie Nk,ke K,teT 
(jie Lk 
Ty ty $ D, V(ijje L,keK,teT 
Xj — Uj D v(i j)e Lie Nike K,teT 
xy UG Fy S v(i j)e Lie N*,ke K,teT 
ees V(ijje lke K,teT 
S-<0, W((ik) (TK) )eW eT 
y= YS nr, V(ij)e Like K 
re R* teT 
25 >”, Vie N *t,ke K 
re RK teT 
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min (zrak -1) min( ( e+ dnk 
2 Jier oy EÈ 2 t 
(Gj)er* ie’ 
yes, Vke K,re Rte T 
SDT Yje L ke Keer 
re RK 
BES Lae Vie N *,ke K,te T 


V(ijjel ike K 
Vie N't,ke K 
V(ijjel ke K 


re RK 

ZZ =, Vie N *,ke K 
re RK 

sk 20, Vie Nt, ,ke K,teT 

x% 20, V(ij)e L,keK,teT 
yke{0.1}, v(ij)e lke K 
zke{0,1}, Vie Nt,ke K 


a V(ijjeL'ke Kyte T 

fe {0,1}, Vie N‘,ke K,te T 
g {0}, V(ijje lke Kyte Tre Rt 
vy e{0,1}, Vie N’*,ke K,te Tyre R* 


4 IUSTRATIVE EXAMPLE 


In this study, the proposed model is illustrated with 
data collected for Shelby County, Tennessee in the 
United State, a location in the New Madrid Seis- 
mic Zone at risk of earthquake (Gonzalez et al., 
2016). 

Three critical interdependent infrastructure 
networks, water, gas, and power distribution sys- 
tems as shown in Figure 2, were examined. We 
consider a disruptive scenario consisting of a 
total of 43 disrupted components, 19 of which 
are demand nodes. We assigned two work crews 
for each network and time horizon of 23 periods 
in total. 

Figure 3 represents five districts in Shelby 
County. To relate unmet demand in the three infra- 
structure networks to adverse community effects, 
SoVI-Lite was calculated for the five districts guide 
the restoration process with community resilience 
in mind. The demand nodes in each district were 
assigned the different V* value that is specific to 
that district. 

The SoVI-Lite algorithm (Cutter et al., 2011, 
Evans et al., 2014) was implemented using the 
available variables for Shelby County to cover the 
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Figure 2. Critical gas, water, and power infrastructure 
networks of Shelby County, TN, respectively (Gonzalez 
et al., 2016). 


Representation of five districts in Shelby 
County, TN (www.smartcitymemphis.com). 


Figure 3. 


eleven key factors, shown in Table 1. As shown 
in Figure 4, District 5 is assigned with the high- 
est scaled SoVI, indicating that it may be given a 
higher priority in the restoration process. Figure 5 
illustrates the exponential transform of the SoVI 
score, V,*, where District 5 stands out substantially 
from the other districts. Population density, P*, 


i 
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Table 1. SoVI-Lite algorithm variables for Shelby 
County, TN. 


SoVI-Lite Variables 


% of population that lives in poverty 

% of population that is over 65 

% of population that is under 5 

% of population earning less than $75,000 per year 
% of population lives in a single-mother household 
% of population that is female 

% of population that is Hispanic 

% of population that is African-American 

% of population that is Asian 

% of population that isn’t high school graduate 

% of population that relies on food stamps 

% of population that is unemployed 

% of population works in low-skilled service jobs 
% of population that speaks English as 2nd language 
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Figure 4. Social-vulnerability indexes for 


County, TN districts. 
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Figure 5. Exponentially increasing social-vulnerability 


scores for the districts of Shelby County, TN for b = 10. 


was also calculated by and assigned to each district 
through the study. 

The multi-objective problem was solved using 
the constraint method where the resilient con- 


straint was assigned values such that €e€ [0,1] asin 
Eq. (6) (Almoghathawi et al., 2017). 
Lk 


ee ee 
LX LL A(aver')- (sive) 
—(-((arvert)-(sionte)) |] 


Tables 2, 3, and 4 are subset comparisons of the 
restoration optimization model results where sec- 
ond column represents the schedule without SoVI 
and population density factors and the third col- 
umn represents the scheduling with consideration 
of those factors. 

“Without SoVI and population density” refers 
to the removal of the V,* and P* terms from the 
vulnerability objective function. Note how the 
ranking, that is, the order in which those particu- 
lar components of the individual networks would 
be restored, differs when community resilience 


(6) 


Table 2. Water network restoration schedule compari- 
son with and without considering social-vulnerability. 


Water network Rank Rank with 
components without SoVI SoVI 
Node 29 3 1 

Node 37 2 2 

Node 11 1 3 

Node 36 4 4 

Table 3. Power network restoration scheduling compar- 
ison with and without considering social-vulnerability. 
Power network Rank Rank 
components without SoVI with SoVI 
Node 13 3 1 

Node 14 4 2 

Node 57 2 3 

Node 56 1 4 


Table 4. Gas network restoration schedule comparison 
with and without considering social-vulnerability. 


Gas network Rank without Rank with 
components SoVI SoVI 
Node 9 4 1 

Node 14 1 2 

Node 8 2 3 

Node 6 3 4 
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Figure 6. Unmet demand through recovery period, 


with the consideration of social-vulnerability scores. 
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Figure 7. Unmet demand through recovery period, 


without considering the social-vulnerability scores. 


measures are taken into account. The trajectory 
of recovery, as measured by unmet demand over 
time, is depicted in Figures 6 and 7, comparison 
of total unmet demand with and without consid- 
eration of the community resilience perspective. 
Naturally, the figures display a different trajectory 
as a different priority is given to meeting demand 
at the demand nodes of the different networks. 


5 CONCLUDING REMARKS 


Critical infrastructure networks often rely on each 
other. However, such dependent and interdepend- 
ent relationships potentially result in these networks 
being more vulnerable to disruption. However, not 
only physical infrastructure networks are adversely 
impacted, as the communities that rely on those 
networks can also be significantly disrupted. As 
such, restoration planning and resource allocation 
should account for disrupted communities in addi- 
tion to disrupted physical infrastructure networks. 

In this paper, we have studied the restoration 
process planning and scheduling with (i) social 


vulnerability and (ii) population density in mind. 
Social vulnerability was calculated using an estab- 
lished index, a variation on the Social Vulnerability 
Index (Cutter et al. 2003, 2011), that accounts for 
several age, income, race, and educational attain- 
ment dimensions. 

This community resilience perspective was 
added to a multi-objective resilience-driven resto- 
ration model using mixed-integer programming. 
The objective of the model was maximizing the 
cumulative community resilience of the interde- 
pendent networks over time while also the total 
cost associated with the restoration process is 
minimized. 

For the results of our study, we have found 
that accounting for community resilience meas- 
ures impacts the restoration schedule of disrupted 
infrastructure networks. While more future work 
will follow up this initial study, at a minimum we 
have demonstrated that different restoration pri- 
orities are found when we account for the vulnera- 
bility and the density of the population associated 
with unsatisfied demand in the interdependent 
networks. The community resilience perspective 
should be emphasized in future research. 
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ABSTRACT: The resilience of modern societies is to a large degree determined by the resilience of 
their Critical Infrastructures (CI). These infrastructures are critical because interruptions not only influ- 
ence the infrastructures themselves, but loss of functionality has secondary effects on the society. The use 
of smart technologies makes these “Smart” CIs (i.e. SCIs) increasingly interdependent and vulnerable 
to various hazards, such as terror attacks, cyber-attacks and extreme weather. The EU H2020 research 
project SmartResilience has developed a baseline resilience assessment method, which measures the level 
of resilience indirectly through a selection of resilience indicators considered relevant by the user of the 
SCI in question. Other methods have also been developed in SmartResilience, but this paper focus on the 
development and application of the baseline resilience assessment method and the development and col- 
lection of resilience indicators used in the assessment method. The application is demonstrated using a 


production facility as a case. 


1 INTRODUCTION 


The power grid in Ukraine was cyber-attacked both 
in 2015 and 2017. The attack in 2015 was a complex 
and pervasive attack on three energy distribution 
companies, resulting in about 230 thousand people 
being left without electricity for a period from 1 to 
6 hours (Wikipedia 2017). Energy supply systems, 
such as those attacked in Ukraine, are examples of 
critical infrastructures (CIs); critical because their 
functions are vital for the society. 

Smart technologies are introduced in infrastruc- 
tures to maximize the service they provide using 
intelligent systems. Thus, the term Smart Critical 
Infrastructure (SCI) is introduced. However, smart 
features may also make the SCIs more vulnerable, 
e.g. by providing a gateway for hackers and cyber- 
terrorists. 

The need to defend these SCIs has been rec- 
ognized for decades through e.g. Critical Infra- 
structure Protection (CIP) programs. However, in 
recent years, it has been realized that with increas- 
ingly complex and interdependent infrastructure 
systems, CIP is not enough (HSAC 2006). It is not 
enough to focus on protection of a CI from events 
like cyber-attacks, terror attacks and extreme 
weather, because the complexity and interdepend- 
encies makes it virtual impossible to foresee and 


prevent all scenarios, and when they occur—no 
matter how unlikely—tt is vital for society that the 
loss of functionality is minimized, e.g. that the CIs 
are up and running as soon as possible after an 
event. 

A shift of the focus from CIP towards CIR, 
i.e. Critical Infrastructure Resilience has been 
observed. “Overall, a resilience-based approach 
for CI is an approach that is gradually adopted by 
nations in order to face the challenges and costs of 
achieving maximum protection in an increasingly 
complex environment and to overcome limitations 
of the traditional scenario-based risk management 
approach, where the organization may lack capa- 
bilities to face risk from unknown or unforeseen 
threats and vulnerabilities” (Setola et al. 2016). 

Resilience is not a straight-forward term. It has 
many different applications and a broad scope. A 
helpful review paper providing insights into the 
term and its history is Alexander (2013). Suffice to 
state here is that although the term was unfamil- 
iar within risk of critical infrastructures in the US 
some ten years ago (HSAC 2006), it is now a well- 
recognized term. Resilience is also a familiar every- 
day term in English speaking countries, but it is not 
easily understood by lay people when translated to 
other languages. In addition, the CIR approach is 
relatively new in the EU compared to the US. This 
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gives some challenges for the implementation of 
CIR in EU and the single EU member states. 

Recognizing the challenges with the term resil- 
ience, the questions are still: How can we make a 
system like the energy system in Ukraine, and other 
SCIs, resilient against cyber-attacks and other 
relevant threats? How can we know—and meas- 
ure—the level of resilience of an SCI? These are 
the challenges that the EU H2020 project Smart- 
Resilience (2016) is set out to solve. It answers the 
DRS-14 call, which explicitly asks for an indicator- 
based approach. 

Several methods and tools for assessing 
and monitoring resilience are developed in the 
SmartResilience project. In this paper, we present 
the baseline resilience assessment method meas- 
uring the Resilience Level (RIL) of SCIs through 
resilience indicators. We denote this as the “RIL 
method” in the following. It is based on review, 
adaptation and further development of relevant 
reference methods having their roots in high reli- 
ability theory (Wreathall 2006), resilience engi- 
neering (Woods 2006) and critical infrastructure 
resilience (Fisher et al. 2010). 

The resilience indicators have been developed 
(identified and/or proposed) mainly by the case 
study partners in the SmartResilience project, cov- 
ering a range of different critical infrastructures. 
They are stored in a database as “candidate” resil- 
ience indicators, i.e. the users select the most rele- 
vant indicators for their case from the candidates in 
the database, or add new indicators, when necessary. 

Based on the selected set of resilience indicators, 
the RIL method provides a level of resilience on 
a scale from E (worst) to A (best) for one specific 
SCI, or several SCIs, within an area. In addition to 
an overall level of resilience, that can be trended 
periodically, the results point to areas where 
improvements are most needed. In this paper, the 
application of the RIL method is demonstrated for 
a production facility. 

The description of the development of the RIL 
method and the resilience indicators are based 
on Øien et al. (2017a-c). Earlier versions of the 
Smart-Resilience methodology are also presented 
in Jovanovic et al. (2017a; 2018). 


1.1 Concepts and definitions 


In the SmartResilience project, the resilience of an 
infrastructure is defined as: “The ability to antici- 
pate possible adverse scenarios/events (including 
the new/emerging ones) representing threats and 
leading to possible disruptions in operation/func- 
tionality of the infrastructure, prepare for them, 
withstand/absorb their impacts, recover from dis- 
ruptions caused by them and adapt to the chang- 
ing conditions” (Jovanovic et al. 2016). 


Functionality Resilience curve 


Loss of functionality 


Understand Anticipate/ Absorb/ Respond/ Adapt Phase 
sks prepare withstand recover learn, 
Figure 1. Resilience phases in the resilience curve/ 


cycle. 


Based on this definition, we derive at the fol- 
lowing five phases of the resilience curve/cycle: 
understand risks, anticipate/prepare, absorb/with- 
stand, respond/recover, and adapt/learn. The five 
phases, representing the main resilience attributes 
in SmartResilience, are illustrated in Figure 1. 

Each of the phases are measured by indicators 
through the most important “issues” affecting each 
of the phases. 

An issue is a very general term referring to any- 
thing (factors, conditions, functions, actions, capaci- 
ties, capabilities, etc.) that is important in order 
to be resilient against severe threats such as terror 
attacks, cyber threats and extreme weather. It is what 
is important, and it is allocated to one of the five 
phases in the resilience cycle. E.g., it can be “train- 
ing” performed in the anticipate/prepare phase. 

An indicator is the description of how to meas- 
ure an issue. Any type/form of indicators are con- 
sidered appropriate in the RIL method, meaning 
that they can be yes/no questions, numbers, per- 
centages, frequencies, or some other type. E.g., 
it can be “percentage of personnel in a certain 
response team taken a certain course”. 


2 METHOD DEVELOPMENT 


The RIL method is an indicator based approach 
consisting of two main parts; the resilience assess- 
ment method itself and the indicators used to 
measure the resilience level. The development of 
the two parts are described in the following. 


2.1 Resilience assessment method 


The RIL method has its roots in high reliability 
organization theory (EPRI 2000, 2001) and resil- 
ience engineering (Øien 2010, 2012; Øien & Nielsen, 
2012; Øien et al. 2012), but also more resent resil- 
ience developments within critical infrastructures, 
especially in the US (e.g. Petit et al. 2013, Linkov 
et al. 2014). 
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2.1.1. The ANL method 

The Argonne National Laboratory (ANL) method 
for assessing a resilience index (RI) (Fisher et al., 
2010), or a resilience measurement index (RMI), as 
it is termed in the most recent version (Petit et al. 
2013), is structured in five (or six) levels, providing 
indicators on the lowest level. A similar hierarchy 
is used in the SmartResilience project for assessing 
resilience levels, entering the indicators on level 6. 
The structure is comparable in the two approaches, 
and many of the resilience attributes are the same; 
however, the level at which the various resilience 
attributes are found, differs between these two 
methods. 


2.1.2 The LIOH method 

The Leading Indicators of Organizational Health 
(LIOH) method focused on developing indicators 
for a set of seven themes important for the “health” 
of a nuclear power plant, some of which have their 
roots from the research on high reliability organi- 
zations (HRO) (Wreathall 2006). They also formed 
part of the basis for factors considered important 
in resilience engineering. In addition to themes, 
LIOH uses issues and indicators as the three levels 
in the structure of the method. 

The LIOH method is a contributory-based 
method in which the users of the indicators take 
part in workshops and define their own issues 
(general and nuclear power plant—NPP—specific) 
for each theme, and for each issue they define 
indicators. There are no predefined examples of 
issues prior to the workshops, and no proposals 
or “candidate” indicators are in place prior to the 
workshops. 

The case studies of the LIOH method show that 
there is often only one level of issues used, i.e. the 
issues are not divided into general and NPP issues 
(EPRI 2000, 2001). A second observation is that 
the results (the issues and indicators defined) from 
identical power plant units are very different. The 
reason for this difference is that there is no guid- 
ance with respect to issues and indicators (no a 
priori “candidates”), and that there have been dif- 
ferent participants in the workshops in each of the 
case studies. 


2.1.3. The REWI method 

The idea of combining the issues into one common 
level was brought further to the Resilience-based 
Early Warning Indicator (REWI) method (Øien 
et al. 2010, 2012); using three levels to identify early 
warning indicators for resilience, i.e. starting with 
resilience attributes, followed by issues important 
for these resilience attributes, and finally develop- 
ing indicators to measure the issues. In REWI, the 
level of resilience attributes is not termed themes 
as in LIOH, but rather contributing success factors 


(CSFs). Thus, the structure consists of CSFs, issues 
and indicators. 

The CSFs are structured in two levels, of which 
the lowest level consists of eight factors, or resil- 
ience attributes. The CSFs at the first level are: risk 
awareness, response capacity, and support. The 
CSFs at the second level are: risk understanding, 
anticipation, attention, response, robustness (of 
response), resourcefulness/rapidity, decision sup- 
port, redundancy (for support). The CSFs repre- 
sent the REWI operationalization of the concept 
of resilience, similar as themes are used in LIOH 
and phases are used in the Smart-Resilience 
project. The CSFs are partly, but not entirely, 
sequential. For each CSF, there is a set of issues 
contributing to the fulfillment of the goals of the 
CSF. There is only one level of issues—denoted 
general issues—for which indicators are developed. 
The CSFs were developed based on a literature 
review and an empirical study on successful recov- 
ery of high-risk incidents; thus, the term contribut- 
ing success factors (Størseth et al. 2009). 

The REWI method consists of a predefined 
set of issues and a set of candidate indicators for 
each issue. This is a main difference compared to 
the LIOH method, and makes it less “open ended”. 
However, it is still a contributory-based method 
and new issues may be added. The predefined set 
of issues and sets of candidate indicators “forces” 
the participants to assess the a priori set of gen- 
eral issues and candidate indicators. Thus, it coun- 
teracts the tendency to identify indicators during 
workshops just as random “indicators of the day”. 

The issues are just candidates, which may be 
considered appropriate or rejected, and addi- 
tional issues may be included. After selecting the 
important issues, the next step is to consider how 
to measure them. How well are we doing with the 
selected issues? What would tell me that we are 
doing well (or have problems) with a specific issue? 
What information do we have about this? This is 
the role of the indicators. 

The issues we try to measure, and the indica- 
tors we use to measure the issues, are two differ- 
ent things. The indicator will typically be described 
as a number, ratio, score on some scale, or similar. 
Without this type of specification or operation- 
alization, we are left with just a theoretical issue. 
We cannot start with the indicators either, since we 
need to know what we want to measure (i.e. the 
issues) and why. 


2.1.4 The SmartResilience RIL method 

Like the LIOH method and the REWI method, 
the RIL method uses issues and indicators on the 
two lowest levels of the structure, whereas phases 
are used on the next higher level, compared to 
themes in LIOH and contributing success factors 
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in REWI. For each of the phases, issues that are 
important for them are identified, and indicators 
to measure the issues are developed. 

In addition, the issues (and corresponding indi- 
cators) may be structured according to five dimen- 
sions, which are system/physical, information/data, 
organizational/business,  societal/political, and 
cognitive/decision-making (Jovanovic et al. 2016). 
The phases and dimensions forms what is denoted 
the Resilience Matrix, commonly used in several 
resilience assessment methods (e.g. Linkov et al. 
2014). However, in the SmartResilience project, 
dimensions are only optionally used for structur- 
ing and triggering the identification of issues and 
indicators. Only phases are directly included in the 
quantification, i.e. it is the columns in the Resil- 
ience Matrix that are of interest, not the rows (or 
the single cells) in the matrix. 

The SmartResilience RIL method has been 
developed through several iterations, including 
input from user requirements (Buhr et al. 2016), 
test case use, and feedback from case study part- 
ners in workshops and through a questionnaire 
(Jovanovic et al. 2017b). A description of the 
resulting method is provided in Section 3.1. 


2.2 Resilience indicators 


The candidate issues and indicators collected in 
the SmartResilience project are to a large degree 
provided by the partners from existing standards, 
guidelines and reports within the areas of risk, 
safety, security, crisis management, business con- 
tinuity and similar domains. 

Resilience is considered an “umbrella” term 
(Setola et al. 2016), covering all the mentioned 
domains; thus, the term resilience indicators may 
include risk indicators, safety indicators, etc. The 
umbrella concept is illustrated in Figure 2. 

In addition to standards, guidelines and reports, 
some indicators are based on what the case study 
providers already are using, and some indicators 
are developed as part of the project. Figure 2 also 
illustrates that the resilience concept in general 
and the resilience indicators, aim at capturing the 
unexpected, by using the metaphor “rain from a 
blue sky”. 

Candidate issues and indicators are stored in a 
database, and reported in Øien et al. (2017a), rep- 
resenting the status of the collected issues and indi- 
cators approximately half way through the project. 

In addition, Øien et al. (2017c) present generic 
candidate issues (without indicators) covering more 
genuine resilience issues, i.e. capturing topics typi- 
cally discussed in the resilience literature. The two 
main sources are the guideline for implementing 
the REWI method (@ien et al. 2012), and an emer- 
gency preparedness plan developed by SINTEF 


OE ee Ta 
Understand ' Anticipate/  Absorb/ Respond/ Adapt/ Phase 
risks prepare = withstand recover kam 


Figure 2. Resilience as an “umbrella” term. 


(2014). Some issues are derived from IMPROVER 
(2016), a few from RESILENS (2016a, b), and the 
rest is based on input from SINTEF as part of the 
SmartResilience project. Some issues are taken 
directly from the original sources, whereas others 
are slightly adapted. Only for those generic can- 
didate issues that are considered relevant for each 
user, indicators need to be developed. 

A presentation of the collected candidate issues 
and indicators is provided in Section 3.2. 


3 RESULTS 


3.1 The SmartResilience RIL Method 


3.1.1 Model 

The three lower levels (level 4—6) of the hierar- 
chical model are phases, issues and indicators, as 
described in Section 2.1. In addition, the over- 
all structure consists of three more levels. The 
first level is the area level, e.g. a city. The second 
level consists of the smart critical infrastructures 
(SCIs), and the third level defines the threats. This 
is illustrated in Figure 3. 


3.1.2 Method steps 

At each level, the scores—alternatively combined 
with weights—corresponds to a certain resilience 
level (RIL) given by a character E-A, where E is 
worst, and A is best. A weighted score between 0-1 
corresponds to resilience level E, a weighted score 
1-2 corresponds to resilience level D, and so on. 

The method steps are as follows: 


Step 1: Select the area, e.g. a smart city 
Step 2: Select the relevant SCIs for the area 
Step 3: Select relevant threats for each SCI 
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The six levels in the hierarchical model. 


Figure 3. 


Step 4: Consider each phase for each threat 

Step 5: Define the issues within each phase 

Step 6: Search for the indicators for each issue 

Step 7: Determine the range of values for each 
indicator (and optionally assign weights) 

Step 8: Assign values to the indicators 

Step 9: Perform the calculations (scores and RILs) 

Step 10: Use the results and make decisions 


The method steps have been described in 
Jovanović et al. (2017a) and we will only focus on 
the changes that have been made lately. This apply 
to Steps 7 and 9. 

The indicators real values are collected and 
transformed to a score (or rating) on a scale from 0 
(worst) to 5 (best). This requires the determination 
of best and the worst values for each indicator, i.e. 
Step 7. This part is simplified by using five catego- 
ries, or value ranges (Øien et al. 2017b). 

At every level, there is a possibility to give 
weights; however, we recommend being restric- 
tive with the use of different weights. It is chal- 
lenging to substantiate the assignment of weights 
(who and how), and the assignment itself can 
easily be criticized. Thus, equal weights are the 
default values at all levels. However, if different 
weights are considered necessary, we now propose 
using a simple type of pairwise comparison (Øien 
et al. 2017b). It can also be considered to include 
weights after gaining some experience, i.e. “tun- 
ing” the assessment. 

In Step 8, the values are assigned to the indica- 
tors, i.e. the measurement itself is performed, and 
in Step 9 scores are calculated, first on the indi- 
cator level, and then aggregated upwards through 
all levels until the area level. On each level in the 
hierarchy, the scores can be transformed to resil- 
ience levels. This is new, and also the use of char- 
acters E-A is new; previously a scale 0-10 was 
used for RILs, and the transformation from scores 
to RILs only took place at the phase level (Øien 
et al. 2017b). 


The use of the results, in Step 10, is described in 
Section 3.3. 


3.1.3 Special topics 

The way cascading effects, dependencies and 
interdependencies, interoperability, and smartness 
opportunities and vulnerabilities are treated in the 
RIL method is briefly described below. We strive 
for a good balance between the comprehensiveness 
of the analysis framework and the simplicity of 
understanding and using the framework. Thus, the 
specific topics have been addressed explicitly, but 
relatively simplistic. 

Cascading effects where the SCI in question 
is affected from the outside should be treated as 
a specific threat e.g. toxic cloud, flooding, etc. If 
the effect is in the form of loss of service, then it 
is treated as dependencies as part of Step 5, Le. 
explicitly as issues. Internal escalation of an event 
is also treated explicitly as issues (Step 5) reflecting 
the required safety systems or barriers needed to 
prevent escalation. 

Critical infrastructures, or other infrastructures, 
services or systems that the SCI are dependent on, 
should be addressed explicitly as issues in the rel- 
evant phases for the relevant threats. This could e.g. 
be the need for redundant energy supply or commu- 
nication networks. Interdependencies are treated in 
the same way. The difference is that the SCIs being 
dependent on “your” SCI, need to explicitly include 
this as issues in their resilience assessment. 

If interoperability is an internal concern e.g. 
interoperable communication systems, then it 
should be treated as an issue. If it is related to 
external interoperability in the sense of external 
backup systems, e.g. “bus for train”, then it should 
be included explicitly as an issue (e.g. cooperation 
agreements) if this is the responsibility of the SCI 
being assessed. 

The relevance of smartness opportunities and 
smartness vulnerabilities related to smart features 
(sensors, gateways, processors, actuators, etc.) 
should be considered explicitly as issues in each 
phase. 


3.2 The collection of issues and indicators 


Øien et al. (2017c) describes candidate resilience 
issues and indicators to be used when assessing, 
predicting and monitoring resilience of Smart 
Critical Infrastructures (SCIs). A total of 233 can- 
didate issues and 1264 indicators are provided for 
various threats, SCIs and the five phases of the 
resilience cycle. 

Table 1 shows the number of issues and indi- 
cators in the five phases defined in the Smart- 
Resilience project. In addition, some issues and 
indicators are considered relevant for all phases. 
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Table 1. No. of issues and indicators in each phase. 


Table 2. Calculations on indicator level (example). 


Phase Issues Indicators Indicator scores, weights and RILs 

Phase I Understand risks 46 226 Real Score Weighted 

Phase I Anticipate/prepare 93 520 I&I value value RIL Weight score 

Phase HI = Absorb/withstand 45 236 f 7 

Phase IV Respond/recover 39 180 L1 Safety risk registry 

Phase V Adapt/learn 20 95 L11 Y 5 A 0,33 1,67 

Relevant for all phases 10 182 L12 Y 5 A 0,33 1.67 
1.1.3 N 0 E 0,33 0,00 
I.2 Management of change—MOC 

Although a substantial number of issues and L21 N 0 sa 1,00 0,00 

indicators have been collected, they will never be 1.3 Register of accidents/incidents 

complete and they are just candidates. There will 13-1 Y 5 A 0,33 1,67 

always be a need for additional and/or more rel- 1.3.2 ‘1/6mth 1,5 D 0,33 0,50 
1.3.3 80% 3,5 B 0,33 1,17 


evant issues and indicators for each specific user; 
and in the end, it is always the user that is responsi- 
ble for finding a relevant and complete set of issues 
and indicators for his/her own case study. 

Issues are essential in order to focus on those 
aspects that are most important to measure. There- 
fore, issues are considered first, and then indica- 
tors to measure the selected issues are established. 
Focusing on indicators first may result in important 
aspects (issues) being missed and not measured. 

The importance of issues is also reflected by the 
143 generic candidate issues provided in Øien et al. 
(2016c). 


3.3 Results obtained by using the method 


From the overall result, i.e. the resilience level of 
an area or a specific SCI, we can “drill-down” 
through the levels 2-6 for detailed results, which 
can be used in Step 10, together with the overall 
result. We do not have “just one number” (the over- 
all resilience level). 

There are many possibilities for use of the 
results, including: 


1. Following up own development over time 
(trending) and analyse status 

2. Comparing with others (benchmarking) 

3. Providing overview of strengths and weaknesses 
and point at improvement needs 

4. Making any gaps visible (lack of relevant 
indicators) 


3.4 Example 


To explain the assessment and calculations per- 
formed, Table 2 shows an extract of an example 
RIL assessment of a production facility within 
the chemical industry. The threat considered is 
terrorist attack (threat 1), and only the first phase 
(phase I) is shown. 

Issues and indicators (I & I) IDs are listed in the 
first column. The indicators for the first issue (1.1) 


are: Does a safety risk register exist? (I.1.1); Is this 
registry used in decision making? (I.1.2); Is a fre- 
quency for updating the registry defined? (1.1.3). 
The second issue (I.2) only have one indicator: Is a 
procedure for MOC established? (1.2.1). The third 
issue (1.3) has the following three indicators: Does 
an accident/incident register exist? (1.3.1); Frequency 
of communication about incidents (1.3.2); Percent- 
age of employees informed about incidents (1.3.3). 

Each indicator is measured, i.e. providing the 
real values for the indicators, whether it is yes/no 
questions, frequencies, percentages, or some other 
type of indicator. Based on the real value and the 
predetermined range of values, from worst to best 
(not shown in Table 2), an indicator score value is 
calculated. This value can be transformed to an 
indicator resilience level, from E (worst) to A (best) 
according to a predefined scale. Weights are deter- 
mined, and the default values are equal weights. By 
multiplying the indicator scores with the indicator 
weights, the indicator weighted score is obtained in 
the last column. The indicator weighted scores are 
brought to the next level in the calculations, i.e. the 
issue level (level 5), where similar calculations are 
performed obtaining issue weighted scores, and so 
on, all the way to the area level (level 1). 

The calculations gave an overall score on area 
level of 3,06 corresponding to RIL = B (Øien et al. 
2017b). 

The overall result just represents one aggre- 
gated character or value, which provides limited 
information. We need to “drill down” in the levels 
beneath, to reveal more detailed information about 
the various contributions to the overall result. One 
example of results on level 2 (SCI level) is shown 
in Figure 4. Here it is revealed that the threats with 
the lowest scores are Threat 1 — Terrorist attack 
and Threat 2 — Natural threats, both with a score 
of 2,64, which would be natural to look further 
into to improve resilience. 
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SCI: Chemical industry 


Excellent 


Threat 1: 
Terrorist 
attack 


Threat 2: 
Natural 
threats 


Threat 3: 
New technology 
related threats 


Figure 4. Resilience status at threat level (example). 


4 DISCUSSIONS AND CONCLUSIONS 


The SmartResilience RIL method helps to under- 
stand how resilient the SCIs are against specific 
types of threats and what measures could help 
improve their resilience. The results show the level 
of resilience (RIL) and where improvements are 
most needed (“drill-down”), emphasizing and fos- 
tering a continuous improvement mindset through 
regularly (typically yearly) updated assessments. 

The resilience assessment uses a_ holistic 
(“umbrella”) approach that goes beyond tradi- 
tional risk of known events, emergency prepared- 
ness, crisis management, and business continuity. It 
covers e.g. preparing for the unforeseen, imagina- 
tion, vigilance, flexibility, improvisation, recovery 
including business continuity aspects, and learning 
and adaptation. 


4.1 How to use the SmartResilience RIL method 


There are two main options for resilience assess- 
ment; internal self-assessment and external asses- 
sor audit. One main reason for using external 
assessments is the possibility for bench-marking 
between similar SCIs or even areas/cities with 
similar SCIs. To ensure comparability, it is impor- 
tant to use the same threats, issues and indicators, 
with the same range of indicator values, weights 
and similar requirements for collecting data for the 
indicators. This is possible to achieve (at least for 
a simple assessment/audit), but may not prove very 
useful for each individual user. 

It is also possible to make user adaptation and 
customize the set of threats, issues and indica- 
tors, ranges of indicator values, weights and so 
on, e.g. by allowing to reject or add new indica- 
tors. However, the more the “dynamic checklists” 
(the tool used in the SmartResilience project) of 


threats, issues and indicators are adapted to take 
user requirements into account, the less compara- 
ble they will be. 

Internal self-assessment can also be performed 
using similar checklists as an external assessor 
would use; however, if the focus is not on bench- 
marking and comparing with others, the assess- 
ment can be adapted to the specific requirements 
of each user. This will ensure a more relevant and 
accurate assessment useful for trending own devel- 
opment over time. A user customized self-assess- 
ment approach requires more engagement from the 
users. On one hand this is positive, since the users 
will take more ownership to the analysis framework 
and the results; however, on the other hand it will 
require more resources compared to an external 
assessment using a standardized framework. 


4.2 Usefulness of the SmartResilience RIL 
method 


The purpose of assessing resilience is to obtain a 
measure of how resilient a city or an individual SCI 
are against severe threats such as terror attacks, 
cyber-attacks and extreme weather. Assessing RIL 
provides a baseline assessment of resilience that 
gives insight on status and improvement needs to 
increase or maintain a high level of resilience. 

A RIL assessment goes beyond traditional risk 
assessments by focusing on unknown and unfore- 
seen events, and the capability to recover from 
events. This is achieved by capturing the time 
dimension through (five) distinct phases, incor- 
porating e.g. emergency response and business 
continuity. A RIL assessment complements risk 
assessment; it is not a substitute for risk assess- 
ment. Risk assessments also provide valuable input 
to a RIL assessment, specially to phase I “Under- 
standing risks”. 

An important purpose of a RIL assessment is 
to identify potential problems before they occur, so 
that risk reducing measures may be planned and 
implemented as needed, regardless of the likeli- 
hood of events. Most SCIs in the world have never, 
and will never, experience an extreme event. Still it 
is possible to assess the RIL, i.e. the level of risk 
understanding, anticipation and preparation, the 
capability to absorb and withstand, to respond and 
recover, and the abilities to learn and adapt. With a 
high RIL, it is less likely to experience adverse con- 
sequences due to an extreme event, and should it 
occur, then disruptions are likely to be less severe. 


4.3 Conclusions 


The SmartResilience project has developed a 
method for assessing resilience of SCIs with 
respect to specific type of threats on a scale from E 
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(worst) to A (best). An overall RIL is obtained by 
combining resilience levels for five main attributes/ 
phases of resilience for each threat. For each phase, 
the user/analyst must identify the most important 
“issues” affecting SCI resilience and for each issue 
select relevant indicators, indicator range values, 
and perform calculations. The Smart-Resilience 
project has provided candidate issues and indica- 
tors for various SCIs that may be used as a starting 
point for identifying issues and indicators for resil- 
lence assessment of specific SCIs. This baseline 
resilience assessment can be used for trending as 
well as identifying improvement needs. 

The resilience curve, describing the SCI func- 
tionality as a function of time, before, during and 
after an adverse event, is treated as a conceptual 
model, i.e. the method does not consider the exact 
shape, size or area of the curve directly. It is an 
indirect measurement. For direct assessment of 
SCI resilience, the SmartResilience project has 
developed a functionality assessment method with 
respect to specific threat scenarios. This alternative 
method provides a quantitative measure of loss of 
SCI functionality as a function of time addressing 
explicitly the resilience curve. 
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ABSTRACT: With the increasing frequency and severity of disasters resulting especially from natural 
hazards and impacting both infrastructure systems and communities, thus challenging their timely recov- 
ery, there is a strong need to prepare for more effective response and recovery. Communities have especially 
struggled to understand the aspects of recovery patterns for different systems and prepare accordingly. 
Therefore, it is essential to develop models that are able to measure and estimate the recovery trajectory 
for a certain community or infrastructure network given system characteristics and event information. 
The objective of the study is to deploy the Poisson Bayesian kernel model developed and tested in earlier 
work in risk analysis to measure the recovery rate of a system. In this paper, the model is implemented and 
tested on a resilience modeling case study of power systems. The model is validated using a comparison to 
other count data models such as Poisson generalized linear model and the negative binomial generalized 


linear model. 


1 INTRODUCTION 


Recent disasters severely impacting both infra- 
structure systems and communities emphasize the 
need to prepare for more effective response and 
recovery. Communities have especially struggled in 
understanding the aspects of recovery patterns for 
different systems. Therefore, there is a strong need 
to develop models that are able to measure and esti- 
mate what are the recovery prospects for a certain 
community or infrastructure network given system 
characteristics and event information. In addition, 
the models need to account for uncertainty under- 
lying the information that has been or being gath- 
ered before, during, and after the disruption. 

Prior work on recovery rate modeling of infra- 
structure systems focuses on the time to recovery 
from power outages as a function of event attributes 
and impact of the disaster (Mackenzie & Barker 
2013, Barker & Baroud 2014, Barabadi & Ayele 
2018). In this research, the goal is to incorporate 
the uncertainty in estimating the resilience of sys- 
tems after disruption. More specifically, the objec- 
tive of the study is to analyze the recovery rate of a 
system or a community that has been impacted by 
a disaster. The response variable considered in this 
work is the average recovery rate computed based 
on the impact of the event and the total time to 
network recovery as well as other variables. 

In order to integrate information from experts 
with data on the disruptive event and recovery 
process, this work proposes the use of a Poisson 


Bayesian kernel model which accommodates count 
data while accounting for prior information and 
uncertainty in the estimates. The model has been 
developed and tested using sample data in earlier 
work (Floyd et al. 2014) and has been applied to a 
risk analysis case study to predict the frequency of 
disruptive events in inland waterway (Baroud et al. 
2013). However, the method has never been imple- 
mented in post-disaster scenarios, more specifically 
to model recovery rate. In this paper, the model is 
implemented and tested on a resilience modeling 
case study of power systems. More specifically, the 
recovery rate of a community from power outages 
is represented by a parameter following a Gamma 
distribution. This prior distribution is updated 
using historical data of disruptive events as well as 
a set of attributes that are represented by the kernel 
function, a measure of similarity between the new 
data point and the training set. The model perform- 
ance is evaluated in comparison to other count data 
models such as the Poisson generalized linear model 
and the negative binomial generalized linear model. 

Section 2 provides background literature on 
community resilience modeling and count data 
methods with an outline of the paper’s contribu- 
tions. Section 3 briefly describes the Poisson Baye- 
sian kernel method and provides a structure to the 
model comparison and performance measures. 
Section 4 describes the case study with an over- 
view of the data and a summary of the results of 
the models used in this work. Finally, concluding 
remarks are provided in section 5. 
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2 BACKGROUND AND CONTRIBUTION 


2.1 Community resilience modeling 


The ultimate goal of recovery measures after a dis- 
aster is to insure the society is able to bounce back 
from the losses incurred and reach normalcy as fast 
as possible, in recent studies this has been termed 
as “community resilience.” One common definition 
for community resilience refers to the ability for a 
social system to respond and recover from a disaster. 
While vulnerability was previously used as an indica- 
tor, researchers and government policy have realized 
the advantages of utilizing resilience as an indicator 
to measure the ability of a community to not only 
recover during the post-disaster phase, but also 
advance beyond the pre-disaster state and adapt or 
transform to improve preparedness to future events. 
Furthermore, resilient communities are also less vul- 
nerable to hazards than an equivalent less resilient 
community. Initially, community resilience modeling 
research focused on qualitative approaches founded 
in a set of metrics and indicators that describe the 
resilience of a community (Johansen et al. 2016). 
The concept of resilience can be useful when quan- 
tified and used as a decision-making tool, however, 
this can be challenging due to the uncertainty in 
many factors impacting resilience as well as the lack 
of data in recovery measures. As such, a number of 
research initiatives have focused on quantifying resil- 
ience ranging from stochastic modeling to simula- 
tion and data-driven approaches, among others. 

Models of community resilience often include 
a variety of social factors. In one study, commu- 
nity resilience was modeled as categorical variables 
based on four primary sets of adaptive capacities- 
Economic Development, Social Capital, Infor- 
mation and Communication, and Community 
competence (Norris et al. 2008). It is proposed in 
this work that advancements within each category 
will aim to create a community that is more resil- 
ient to disasters as a whole. More specifically, one 
example of the hypothesis proposed in Norris et al. 
(2008) is the ability to measure infrastructure and 
economic resilience in terms of power restoration 
time which can therefore be used as a proxy to 
understand community resilience. 

A more robust model for community resilience 
uses a composite index of social and geographi- 
cal factors, the Baseline Resilience Indicators for 
Communities (BRIC) (Cutter et al. 2014). This 
relative value measure of resilience can point to 
counties and tracts within a specific geographic 
location that are particularly vulnerable to dis- 
asters and require more attention and more time 
to fully recover. This measure was found to have 
significant negative correlation with the previously 
established Social Vulnerability Index (SoVI). 


Analysis has been performed to identify recov- 
ery rate specifically following a disaster. However, 
two relevant primary issues are dealing with miss- 
ing data as well as homogeneity and heterogeneity 
across the data set and the fact that some models are 
so specific that they need to be adapted for differ- 
ent situations. In addition, most studies have aimed 
to provide restoration curves that give information 
on the number of customers with service over time. 
A lack of literature exists to model recovery rate spe- 
cifically. One study focuses on the need to not only 
develop recovery rate plots but to be able to select the 
appropriate models based on the characteristics of a 
specific data set (Barabadi & Ayele 2018). 


2.2 Methods for modeling count data 


Modeling the recovery rate requires methods that 
can accommodate count data as the response vari- 
ables in this case constitutes the number of recov- 
ered subjects per unit of time. 

Generalized Linear Models (GLM) are widely 
used within regression models when count data is 
present. Within this class of models, the Poisson 
density function is often used with a log-link func- 
tion, if the variance of the counts is higher than 
the mean of the counts, it is common to also use 
a negative binomial GLM. In certain special cases, 
extensions of these models can accommodate spe- 
cific situations. For example, zero-truncated models 
and zero-inflated models can be used when there are 
excess zero counts (Shankar & Mannering, 1997), 
and both use an underlying Poisson distribution. 

However, both Poisson and negative bino- 
mial lack the flexibility to handle data that is, for 
example, both underdispersed and overdispersed. 
As such, other models have been developed. One 
example is the Conway-Maxwell Poisson (COM) 
distribution GLM (Guikema & Goffelt, 2008). The 
model functions by having underdispersed data 
yield a Bernoulli distribution, overdispersed data 
yield a geometric distribution, and a Poisson dis- 
tribution when the variance is equal to the mean. 

Using a Bayesian framework to account for 
the uncertainty in the regression parameters, it is 
possible to improve on their accurate estimation 
by updating the parameter distributions with new 
data. Other approaches of analyzing count data 
using a Bayesian framework are conjugate priors. 
These methods are quite attractive as they offer 
the benefit of uncertainty modeling using Baye- 
sian techniques without adding any computational 
cost. Given a specific prior distribution and a spe- 
cific likelihood function, the posterior distribution 
will have the same form as the prior distribution 
but with updated posterior parameters. There are 
different forms of conjugate priors, one of which 


1280 


is the Gamma conjugate prior used to model count 
data in the model presented in this paper. The 
method assumes that the rate of occurrence follows 
a Gamma prior and updates the distribution using 
information represented by a Poisson likelihood. 
The Gamma conjugate prior is the foundation 
of the Poisson Bayesian kernel model used in this 
paper and will be further discussed in the following 
section. This method allows the user to model and 
understand the uncertainty around each variable 
and estimate them by considering their probability 
distributions as opposed to point estimate. 


2.3. Contributions 


This paper presents new analysis for data-driven 
community resilience modeling. A Bayesian 
approach developed and tested in prior work is 
implemented and tested in a case study of com- 
munity recovery from power outages. The work 
presented here constitutes a first step in advancing 
data-driven methods for applications in infrastruc- 
ture and community resilience. 


3 METHODOLOGY 


3.1 Poisson Bayesian kernel model 


Poisson Bayesian kernel methods estimate the rate 
of occurrence of the event rather than estimating 
a deterministic value for the number of times the 
event is estimated to occur. A common distribution 
to model count data within a Bayesian framework 
is the Gamma-Poisson conjugate prior. The devel- 
opment of the Poisson Bayesian kernel method 
discussed can be found in Baroud et al. (2013) and 
Floyd et al. (2014). The approach uses the Gamma 
conjugate prior as the basis of the model. 

It is assumed that the parameter to be estimated 
is the rate of occurrence, A > 0, which follows a 
Gamma prior distribution with parameters œ > 0 
and {> 0, as shown in Eq. (1). 


P( A)= ra Ael- (1) 


For the likelihood function, the product of the 
Poisson density function, shown in Eq. (2), is used, 
since this is a Gamma-Poisson conjugate prior 
approach. 


n m (Aeh 
riro i E 
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(2) 


Thus, the posterior distribution is the product 
of Eqs. (1) and (2). Rearranging the product of 
the likelihood function and the prior distribution 
function results in a Gamma posterior distribution 


where a=)" x,+.@ and P= m+ Z. 
P( Ax) 
= 2 A@\e- 4 ( AÈ i g-md ) 
Tr(a) 
A Lite) AmA aoe pyrene (3) 
T ( £ "Jit a) 
= Gamma ( a*, 2*) 


This result is the basic Gamma conjugate prior 
approach used in Bayesian analysis. This approach 
assumes the notion of exchangeability meaning 
that for different sets of training and testing data, 
the resulting posterior parameters will be similar 
since they are a function of the prior parameter, 
the size of the dataset, and the summation of 
all the data points. The characteristics of each 
outcome are not taken into consideration in this 
case, but rather the overall property of the dataset 
(MacKenzie et al., 2014). 

The Poisson Bayesian kernel approach extends 
the notion of the conjugate prior such that the pos- 
terior parameters computation not only depends 
on the prior parameters and the historical data but 
also on the attributes through the kernel matrix. 
The parameters for the Bayesian kernel model for 
counts are expressed in Eqs. (4) and (5). K is the 
m x m kernel matrix, Y is an m x 1 vector contain- 
ing the output data associated with the m observa- 
tions of X, and V is an m x 1 vector containing 
ones. Each entry in the kernel matrix represents 
the similarity measure between the attributes of 
the testing set and the training set, respectively. 
As such, the new data point is compared with the 
training set and according to the similarities of the 
attributes, new values for the parameter of the pos- 
terior distribution are computed. Note that in this 
case, the training and testing sets are assumed to 
have the same size, m. However, when the model is 
deployed, the sets can be of different sizes, and in 
some cases, the testing set could include only one 
data point such as in a leave-one-out analysis. 


at=KY+a (4) 
Pr=KV+ ZB (5) 


As with other statistical and mathematical 
models, there are a few assumptions underly- 
ing the deployment of such modeling approach. 
Even though the form of the prior distribution is 
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known from the conjugate prior, the model user 
would still need to identify the values of the prior 
parameters. While there are formal ways to deter- 
mine the prior parameters (Kass & Wasserman, 
1996), the selection of such parameters might not 
always be considered (Montesano & Lopes, 2009; 
Mason & Lopes, 2011). Oftentimes, the priors are 
either assumed to be known or are assigned such 
that the prior distribution is non informative. In 
other cases, these parameters are estimated using 
data and prior knowledge by matching the sample 
mean and variance to those of the prior distribu- 
tion (MacKenzie et al. 2014; Carlin & Louis, 2008). 
Another assumption to consider is the choice of 
the kernel function which depends on the appli- 
cation and the model user. This research uses the 
most popular kernel function, the Radial Basis 
Function (RBF) in Eq. (6), where k( xx) is one 
entry in the matrix K representing the kernel func- 
tion between the attributes of the i” and j” data 
points. 


2 
= |x; — x| (6) 


k(x,,x,)= exp JE 


In addition to being commonly used in kernel 
methods, RBF has nice properties. The function 
has only one parameter, o, to be tuned to an opti- 
mal value. This reduces computation efforts signifi- 
cantly in comparison to other kernel functions with 
two or more parameters requiring a grid search to 
estimate them. Also, the structure of the function 
is based on the Euclidean distance, whereby simi- 
lar data points are closer to each other in the fea- 
ture space. Finally, the kernel matrix of the RBF 
has full rank and the entries fall between zero and 
one resulting in kernel functions of the data points 
acting as weights in the computation of the poste- 
rior parameters (Schölkopf & Smola, 2002). More 
discussion on the impact of the RBF parameter, o, 
on the performance of the model will follow in the 
case study presented in section 4. 

The estimated rate for the new data point fol- 
lows then a Gamma distribution with parameters 
a* and B*. As a point estimate for this parameter, 
the expected value of the posterior distribution is 
considered, shown in Eq. (7) as the ratio of the 
Gamma distribution parameters œ* and p*. 


qa (7) 


Note that a different point estimate for the rate 
can be used such as the median, the mode, or the 
variance, depending on the type of problem and 
the model users. 


3.2 Predictive accuracy measures 


The ultimate objective of developing and identi- 
fying predictive models is their application in risk 
and resilience analysis problems, such as predict- 
ing the frequency of disruptions in a particular 
network system or the recovery rate of infrastruc- 
ture and communities. While the goodness of fit is 
important to assess whether the model is capturing 
the pattern and variability in the data, is it equally 
important to analyze the prediction power of a 
statistical model if it is going to be used for fore- 
casting purposes. Prediction accuracy is assessed 
by the out-of-sample error, which accounts for the 
discrepancy between the estimated parameter and 
the actual observation of data points that were not 
in the set used to train the model. In order to vali- 
date the prediction power of the models, several 
metrics are evaluated to assess the out-of-sample 
error, and they are summarized in Table 1. 

While RMSE and MAE are the most com- 
monly used measurements of error, the normal- 
ized RMSE is also considered to account for the 
variability across different samples of training 
sets generated by the multi-iteration validation 
process. NRMSE can either be normalized based 
on the standard deviation of the observed values, 
sd(Y,), or the range of values in the testing set, 
-Y and both cases are considered in 


maximum minimum? 


this paper. 


3.3. Comparative analysis 


In order to assess the performance of the models, 
the predictive accuracy measures are used to evalu- 
ate the models. More specifically, Poisson Bayesian 
kernel model is compared to a Poisson generalized 
linear model and a negative binomial generalized 
linear model (Cameron & Trivedi, 1986, 2013). 


Table 1. Prediction accuracy metrics formulae. 


Prediction accuracy metrics Formula 


Root Mean Square Error (RMSE) 1 


P "d Y, 7 A, ) 
Normalized Root Mean 1 <= 
Square Error = > i ( Y,-A, ) 
(NRMSEM & NRMSED) n a 
sa(¥) 
1 7 a \2 
a > "d Y, 5 A, ) 
maximum Yn 
Mean Absolute Error (MAE) 1 A n 
=), i=1 Y,- A, 
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The Poisson GLM assumes that the rate to be 
estimated has an exponential relationship with a 
set of covariates representing coefficients for the 
different attributes, Apem = e%*, while the pre- 
dicted rate for the PBK is equal to the expected 
value of the posterior probability distribution, 
Årer = Wit: 


4 CASE STUDY 


A case study is presented in this paper to demon- 
strate the use of the Poisson Bayesian kernel model 
in assessing the resilience of communities. More 
specifically, the study is focused on major power 
outage events that happened in the US between 
1999 and 2016. The goal is to compare the per- 
formance of the model against classical methods 
and assess its ability to predict, with a high level of 
accuracy, the recovery rate after these major events. 

The ability to accurately measure and predict the 
recovery rate from power outages allows respond- 
ers and recovery crews to improve their strategies 
and resource allocations before, during, and after 
a disruption. 


o wo ra o 10 25 0 300 mo 


4.1 Data 


The data used in the case study is collected from the 
Energy Information Administration and includes 
information on the time, date and length of an out- 
age occurred, the magnitude of the power outage 
(Megawatt Loss & Customers Affected) and the 
disturbance type (severe weather, equipment fail- 
ure, among others). The dependent variable to be 
modeled is recovery rate which is the number of 
customers affected divided by the duration of out- 
age. To model the rate using a Poisson linear model, 
an offset of duration was used. Recovery rate is 
modeled based on 10 regression coefficients that 
represent information on the cause of the outage, 
the severity, the location, the duration, and the time 
of the day and month. 

Figure 1 is a scatterplot of all variables in the 
data set, each square represents a pairwise plot 
between the corresponding pair of variables on the 
x-axis and the y-axis, the red line represents a local 
regression line of the two variables. The numbers 
shown in the upper side of the scatterplot repre- 
sent correlations of the pairs of variables which, in 
this data set, are not significant with the exception 
of a couple of variables. Examining Figure 1, it is 


5 10 Oo 180 300 


0 300 700 


a00 2010 2 6 t0 1a & 2 


Figure 1. 


| Case 
i 


ow 2 


2610 


zaan 


o 4% 


com 50 0 4w 


Pairwise scatter plots for all the variables in the data. 
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difficult to identify visually any particular relation- 
ships beyond the expected linear correlations due 
to multicollinearity such as start date and time 
with restoration time. The plot provides histo- 
grams for the different variables and it can be seen 
that there is a large variance for many predictors. 

Further examination of the patterns in the data 
focus on the impact of seasonal variations and types 
of disturbance on the recovery process from power 
outages. Rates of recovery are generally slower in 
the winter than in the summer months (Figure 2). 

While wide variations are observed in the recov- 
ery rate by the type of disturbance, outages due to 
load shed and fire/extreme heat experience the high- 
est average recovery rate. Disasters such as flooding 
and hurricane, however, have much slower recovery 
rates (Figure 3). Also, Severe Weather events result 
in the largest number of outliers in the data. 


4.2 Results 


The Poisson Bayesian kernel model referred to as 
PBK, the Poisson GLM referred to as PGLM, 
and the negative binomial GLM referred to as 


Recovery Rate (Cumomerw Hou) 
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Figure 2. Recovery rate as a function of month. 
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Figure 3. Recovery rate by disturbance type. 


NBGLM were used to model the data and predict 
the recovery rate as a function of the predictors 
related to the time, location, disruption, and other 
characteristics. The error measures discussed ear- 
lier and presented in Table 1 were calculated for 
each model and summarized in Table 2. 

Across all predictive accuracy measures, PBK 
performs yields small errors overall. PGLM results 
in very large errors that could be driven by the 
extreme values under Severe Events for instance, 
whereas PBK is able to control for that and pro- 
vide more stable estimates. For two of the predic- 
tive error measures, NRMSED and MAE, the 
PBK outperforms the NBGLM. 

Overall the performance of PBK and NBGLM 
is comparable from a predictive accuracy stand- 
point. However, using PBK would provide an 
assessment of the uncertainty in the estimates 
through the prior and posterior distributions of 
the recovery rate, the outcome is a probability 
distribution of a comprehensive range of possi- 
ble values for the recovery rate. As a result, it is 
possible for a decision maker to identify multiple 
point estimates based on their risk preference. For 
example, if the decision maker or infrastructure 
operator is risk averse, he/she will rely on a more 
extreme (lower) value than the expected value of 
the recovery rate posterior distribution since a 
more conservative mitigation and recovery strat- 
egy is preferred. However, if the decision maker 
is risk taking, the preference would be to save on 
cost of mitigation and recovery and the upper tail 
of the distribution will be considered as an opti- 
mistic measure of the recovery rate. The choice of 
the posterior point estimate is not the only way a 
decision maker is involved in this process. Stake- 
holders play an important role in identifying mul- 
tiple initial parameters in the model. 

As mentioned earlier, the definition of the prior 
is an important consideration for any Bayesian 
approach. In this case, a non-informative prior 
was assumed. However, another important con- 
sideration is the value of the parameter in the ker- 
nel function. The results in the table above were 
obtained based on an arbitrary value of sigma. In 
order to understand the effect of this parameter on 
the predictive accuracy, Figure 4 shows the value of 
the root mean squared error as a function of 1/0. 

There is clearly an optimal value for this parame- 
ter valued at approximately 12. It would be ideal if the 


Table 2. Prediction error values for all the models. 
Model RMSE NRMSED NRMSEM MAE 
PBK 2435 2.03 0.25 1258 
PGLM 12961 10.88 1.34 5579 
NBGLM 1706 2.06 0.17 2039 
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Figure 4. PBK RMSE as a function of different values 
for the tuning parameter of the kernel function. 


parameter is tuned to minimize the error during the 
training process. The drawback of doing so is the 
additional computation time for tuning which 
would exponentially increase as more parameters 
are considered in other forms of the kernel function. 


5 CONCLUSION 


The work presented in this paper evaluates the use 
of Poisson Bayesian kernel models to measure and 
predict the rate of recovery. The ultimate goal of 
the research is to be able to quantify community 
resilience in order to inform resource allocation 
before, during, and after a disruption. The pro- 
posed approach to model the rate of recovery was 
compared to traditional count data models such as 
Poisson and negative binomial generalized linear 
models. 

The advantage of using Bayesian techniques 
is their ability to provide probability distribution 
of the estimates, accounting for the uncertainty in 
resilience metrics. Another important benefit is the 
ability to update predictions as new information on 
the evolvement of the disaster and the correspond- 
ing response of the community becomes available. 

An initial comparison to other methods shows 
that PBK provides a higher accuracy than tradi- 
tional models with the added benefit of accounting 
for uncertainty and the decision maker’s opinion 
and prior knowledge. 
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ABSTRACT: Critical technical infrastructures constitute the backbone of society by providing essential 
services to vital societal functions and the community at large, hence it is of essence that these are resilient. 
Critical infrastructures, e.g. power systems, telecommunication systems and railway systems, are designed 
and operated based on different philosophies of where to put the resilience emphasis, robustness, rapidity 
of recovery or a combination of the two. Here empirical failure data, such as duration and consequence of 
disruptions, from several critical infrastructures in Sweden are explored and analysed. To facilitate com- 
parisons, a generic resilience assessment approach is also presented and applied. The results give insight to 
the resilience level of different infrastructures in Sweden and a basis for an exploration of its reasons, e.g. 
due to difference in regulatory schemes, design or risk cultures. It is concluded that there exist significant 
differences of infrastructures resilience levels and the factors shaping the resilience. 


1 INTRODUCTION 


The large scale societal consequences arising in past 
events involving critical infrastructures, such as the 
European blackout in 2006, the Eyjafjallajökull 
eruption in 2010 and Hurricane Sandy in 2012 
(Johannsson et al., 2015), clearly indicates the need 
for approaches and measures aiming at increasing 
infrastructure robustness and rapidity of recov- 
ery, in essence addressing the resilience of critical 
infrastructures. These events have also revealed the 
complexities and uncertainties involved in assess- 
ing resilience of critical infrastructures. 

During the last years there have been ample 
contributions of different, both academic and 
policy oriented, approaches for assessing criti- 
cal infrastructure resilience. These include expert 
based methods, e.g. different types of index meth- 
ods, modelling and simulation based methods and 
empirical data based methods. Limited attention 
has been drawn towards contrasting and compar- 
ing the resilience of different types of critical infra- 
structures through the use of empirical failure and 
interruption data that are normally gathered by 
regulatory authorities or internally by infrastruc- 
ture owners. 

Here we are presenting and applying a generic 
resilience assessment approach to this type of 
data, aiming at assessing and contrasting resilience 


levels as input to exploration of possible causes, 
e.g. underlying influential factors such as regula- 
tory schemes, differing design philosophies or risk 
cultures. 

In Chapter 2 a background is given for the 
generic resilience assessment approach as pre- 
sented in Chapter 3 together with the data collec- 
tion method. In Chapter 4 the data is presented 
for five critical infrastructures in Sweden, electric 
transmission, electric distribution, transport rail- 
way, transport road and water supply. In Chap- 
ter 5 the resilience levels of the infrastructures are 
contrasted and compared. The findings are then 
discussed in Chapter 6 and conclusions from the 
study are drawn in Chapter 7. 


2 BACKGROUND 


The most common approach in the scientific litera- 
ture, related to critical infrastructures, is to assess 
resilience in terms of the system’s functionality 
over a time period. Emphasis is normally on meas- 
uring a system’s resilience towards a single spe- 
cific event or scenario (e.g. Pursiainen et al., 2016, 
Hosseini et al., 2016, Panteli et al., 2017, Nan & 
Sansavini, 2017). The approaches also tend to fol- 
low the fundamental resilience conceptualisation 
proposed by Bruneau et al. (2003), sometimes 
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referred to as the “resilience triangle” and where the 
area of the triangle is defined as resilience loss. Sev- 
eral more detailed conceptualisations of this engi- 
neering oriented concept of resilience have lately 
emerged in the literature. For example, account- 
ing for that the recovery behaviour is normally 
not linear but can e.g. be exponential (Cimellaro 
et al., 2010). There is also examples of where the 
resilience curve is divided into several different 
phases, e.g. original steady state, disruptive phase, 
system recovery state, and finally a stable end state 
(c.f. Henry & Ramirez-Marquez, 2012, Nan, C. & 
Sansavini, G., 2017). For these conceptualisations, 
the events in consideration seems to be for single 
large impact events, resulting in great stress on the 
measured system’s functionality and high societal 
consequences. Further, the literature also presents 
a great variety of ways for defining or measuring 
the functionality of different infrastructures. 
Methods for assessing the resilience of critical 
infrastructures can broadly be divided into three 
categories: expert based, modelling and simula- 
tion based, and empirically based methods. Expert 
based methods are generally an assessment based 
on system characteristics where the opinion from 
experts have been aggregated to produce some 
sort of index of infrastructure resilience (Hosseini 
et al., 2016, Hassel & Johansson, 2016). One exam- 
ple is Chang et al. (2014) that presents a resilience 
elicitation approach, where they estimate resilience 
levels based on expert opinions for a specific sce- 
nario in an interdependent critical infrastructure 
setting. Resulting in estimated service disruption 
levels, rated from no loss to severe disruption for 
different time lengths of disruption. Modelling 
and simulation based methods, in this context, are 
quantitative approaches aiming at observing mod- 
elled system behaviour during disruptive events 
(e.g. Hosseini et al., 2016, Ouyang, 2014). Some 
authors also include more general resilience metric 
conceptualisations, that could be applicable also for 
empirical studies, and utilize them in a modelling 
and simulation context (e.g. Nan & Sansavini, 2017, 
Panteli et al., 2017). Modelling and simulations 
efforts span e.g. from network theoretical, Monte 
Carlo, optimization to fuzzy logic approaches (e.g. 
Hosseini et al., 2016). Empirically based methods 
aim at deriving resilience levels of critical infra- 
structures during past events (see e.g. Johansson 
et al., 2015, Zorn & Shamseldin, 2015a). Normally 
it involves statistical analyses of the rapidity and 
robustness of one or several infrastructures dur- 
ing a specific disruptive event, sometimes also the 
quantification of interdependencies are addressed 
(Zorn & Shamseldin, 2015b). The empirical meth- 
ods normally aim at measuring resilience during 
different phases of the disruptive event. These 
approaches are less applicable when addressing the 


overall resilience of an infrastructure towards the 
range from low impact to high impact events and 
specifically for assessing the resilience over time. 

Here the aim is to provide such an approach that 
is based on assessing the resilience level of differ- 
ent types of critical infrastructures based on inter- 
ruption data, which has not been covered in the 
scientific literature to our knowledge. It should be 
noted however that reliability oriented approaches 
for critical infrastructures (for power systems see 
eg. Billinton & Allan, 1996) to a large extent 
deals with similar issues of assessing system func- 
tionality during disruptive events based on either 
modelling and simulation approaches or through 
empirical failure data, then however normally 
focusing on infrastructure specific indices and not 
towards a unified approach for the comparison of 
different types of infrastructures. 


3 METHODS 


3.1 Data collection 


The data was collected through a three phase proc- 
ess: I) identifying technical critical infrastructures 
(CIs) within Sweden, I) identifying actors within 
these infrastructures, preferably at national level, 
that potentially collects interruption data for their 
respective CI, and III) contacting the actors and 
retrieving the data. This data then underwent an 
examination to gain further understanding of the 
quality and quantity of the data, e.g.: how the data 
is collected? (e.g. in the form of databases or in the 
form of incident reports?), what are the parameters 
given in the data?, and what is the delimitation and 
limitation of the data? In some instances, further 
contacts where made with the actors for clarifica- 
tions or to gain additional data, resulting in an 
iterative data collection process. For each data set 
it was finally determined if it contained the desired 
parameters for the resilience assessment approach. 

In the end data from the following five infra- 
structures and actors are included (out of 17 
initially addressed infrastructure actors): a) Elec- 
tricity transmission, Svenska Kraftnät (SVK), b) 
Electricity distribution, Energiféretagen, c) Trans- 
port Road, Trafikverket, d) Transport Railway, 
Trafikverket, and e) Water supply, Stockholm 
vatten och avfall (SVOA). 


3.2 Resilience assessment approach 


The aim of the Resilience Assessment Approach 
(RAA) is to present a generic method that is appli- 
cable for the resilience analysis of several different 
types of critical infrastructures, at a variety of hier- 
archical levels, based on empirical interruption data. 
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The approach is hence naturally bounded by the 
interruption parameters that are typically collected 
by infrastructure owners or regulatory authori- 
ties. The interruptions are typically reported as 
independent incidents, although there might be 
underlying correlations between the interruptions 
(e.g. during a storm). Furthermore, the collected 
interruption data of a single event generally lacks 
detailed information of the actual functionality 
during the recovery process for the recorded inter- 
ruptions. Normally only start time, end time and 
maximum functionality loss is given. Hence the 
data does not reflect the widely spread “resilience 
curve” as normally depicted in the academic lit- 
erature (e.g. Cimellaro et al., 2006, Hosseini et al., 
2016, Panteli et al., 2017). If the interruption data 
would have had higher granularity, the approach 
presented here would however still be applicable. 

The resilience assessment approach aims to 
measure the resilience of a critical infrastructure 
at a “national system level” to guide policy mak- 
ing and regulatory considerations at this level. 
As typically several sub-infrastructures together 
build up an overall critical infrastructure (e.g. in 
Sweden there are about 160 distribution system 
owners, 4 regional system owners and 1 national 
system owner of the electricity infrastructure) it is 
necessary to aggregate the data from several sub- 
infrastructures that covers different geographical 
areas to a national level. Each individual failure is 
considered to place a strain on the overall system 
with a negative effect on the functionality during a 
specific time period. In cases where two or more fail- 
ures occur simultaneously or overlapping at any time 
period, these are summed up (stacked), resulting in 
a fictive national functionality loss of the system as 
a whole (F,), see Figure 1. As the true behaviour of 
the system functionality during a single interruption 
is unknown, each individual reported interruption 
has been depicted with a box-shape profile. 

The fundamental criteria for the approach 
is that each interruption is recorded with a start 
time, end time and one or several consequence 
parameters which can be transformed into a sin- 
gle functionality loss, F,, by normalising with an 
appropriately chosen time-independent baseline. 


Functionality Functionality 
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Figure 1. Illustration of how interruption data from 


several sub-infrastructures (left) is aggregated to a fictive 
“national system level” (right). 


The total resilience, R,,,, of a critical infrastructure 
over a specified time period is then derived as: 


Teng 
Rre= >, FRO (1) 


where: 

F, = Functionality loss during time period t, 

T, = Start time of time period 

T.a = End time of time period 

In order to facilitate a comparison of different 
infrastructures and for varying time periods the 
mean resilience loss, R wean for a given time period 
is calculated as: 


R 
R mean = ae (2) 


N = Total number of samples in time period 


4 INFRASTRUCTURE DATA 


Here the background and boundaries of the data 
for the infrastructures are presented. Many factors 
sets boundaries of the data, such as if its gath- 
ered due to regulatory obligations or for reasons 
of internal audits and improvements. If an inter- 
ruption were lacking necessary information for the 
resilience calculations, these are labelled as “miss- 
ing data”. 


4.1 Electricity transmission 


Svenska Kraftnat (SvK) is the national opera- 
tor and the Swedish authority responsible for the 
electric power transmission system. The transmis- 
sion system transports high voltage power from 
generating sources, such as power plants and wind 
parks, to regional networks. The transmission sys- 
tem is also synchronously interconnected to other 
countries (Norway, Finland and Denmark), but 
the data is delimited only to the Swedish system. 
According to Swedish and EU regulations, all 
companies providing power supplies are bound by 
law to deliver reports of outages and disturbances 
since the year 2009 (ENTSOE, 2017, Regering- 
skansliet, 2009, SVK, 2017). 

The data from SvK contained operational 
outages from the years 2006 to 2016. Each inter- 
ruption has in total 14 different parameters. Of 
specific interest here is the start time, end time and 
the amount of Energy Not Supplied (ENS) due 
to the interruption. It should be noted that most 
interruptions in the transmission system leads to 
no loss of power supply (95%), due to high levels 
of redundancies and spare capacities. 
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The data set received from SvK is judged to 
be of very high quality and comes with only a 
few unusable data points each year, see Figure 2. 
Regarding data required for the resilience calcula- 
tion, the data set is almost complete. In total 2468 
interruptions are contained in the data, where only 
113 (4.6%) were missing the relevant information. 

The data used for the resilience calculations 
are the given start and end times, and the conse- 
quence is calculated by transforming the reported 
ENS to Power Not Supplied (PNS), averaged for 
the duration of the outage. The functionality loss 
is derived by dividing the PNS with the baseline 
of total Swedish power consumption for a specific 
year. 


4.2 Electricity distribution 


Energiféretagen is a relatively new Swedish trade 
association, formed in 2016 through the merging 
of existing trade organisations. Their main task is 
to ensure and maintain its members’ commercial 
interests, but they also support the Swedish dis- 
tribution actors to keep up to date with changing 
demands and environmental needs. Energif6reta- 
gen continuously gather and analyse failure data 
from their members, annually summarised and pre- 
sented in the “DARWin-reports” (Energiforetagen, 
2017). With almost 100 member companies, the 
data represents the distribution of electricity to 
nearly 90% of the end users in Sweden. 

The data from Energif6retagen were ano- 
nymized (no reference to company) and con- 
tained outage data from the year 2005 to 2015, see 
Figure 3. Each interruption has in total 13 dif- 
ferent parameters. As oppose to the data for the 
transmission system, disruptions at the distribu- 
tion level to a much higher degree leads to conse- 
quences (only 0.75% of the interruptions had zero 
affected customers), due to the infrastructures lim- 
ited redundancies and spare capacities. 


2005 2006 2007 2008 2009 2010 2011 2012 2M3 2014 2015 2016 
Year 


Figure 2. Interruption data for the electricity transmis- 
sion infrastructure. Categorized into failure cause and 
missing data. 
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Figure 3. Interruption data for the electricity distribu- 
tion infrastructure. Categorized into failure cause and 
missing data. 


The data set contains about three hundred 
times more interruptions than the SvK data. The 
data is judged to be of high quality, where there is 
only one bad value, after correcting some obvious 
errors, of just over 700000 interruptions in total. 

The parameters used for the resilience calcula- 
tions are the start and end times of the interrup- 
tions, and the consequence is calculated as the sum 
of affected low and medium voltage customers. 
The baseline for functionality is the total number 
of customers during a specific year for the report- 
ing companies. 


4.3 Transport road 


Trafikverket is a public authority responsible for 
the long-term infrastructure planning of road-, 
railway-, shipping- and aviation-operations, and 
construction and maintenance of the road and 
railway in Sweden. 

The data from Trafikverket regarding the road 
infrastructure comes from a project called “Total 
Traffic Stops” (TTS). This data started to be col- 
lected in 2016 and we have data until the first half 
of 2017. In TTS, unplanned and total stop of traf- 
fic for a given road section is reported. Hence e.g. 
planned interruptions or maintenance actions that 
give rise to considerable delays in the traffic is not 
included. On larger roads, with several lanes and 
the two directions completely separated, it is classi- 
fied as a total stop only if all the lanes in one direc- 
tion are closed. For roads where the two directions 
are not separated, all lanes in both directions must 
be in full stop to be classified and registered in the 
TTS. 

In the data quite many parameters, 30 in total, 
are described for each interruption. Most of them 
for classification in accordance to TTS and con- 
sidered less relevant here, e.g. coordinates of the 
accident and other location specific descriptions. 
The quality of the data is good. Of a total of 8295 
recorded interruptions, no one where missing the 
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relevant data. 2730 interruptions had either zero 
duration or zero consequences. 

As resilience metrics the given start and end 
times of the interruptions, and the consequence of 
each failure is given by the reported “Average daily 
traffic” in terms of vehicles affected. The baseline 
for calculating functionality is the average number 
of vehicles that is in the transport system on any 
given moment (about 600’000), derived from addi- 
tional data provided by Trafikverket. 


4.4 Transport railway 


The data for the railway infrastructure is also from 
Trafikverket and covers the years 2012 to 2016. 
Failure data on railway system outages was scarce 
and according to Trafikverket there is no general 
data that sums up or connects railway outages or 
failures with a root cause. However, they do log 
train delays which can be used as proxy. If a train 
is more than 3 minutes delayed when passing a 
designated “measurement point” (usually a train 
station), it is registered. This is logged manually 
and if correctly logged, all train delays associated 
with the same root cause are given the same “delay 
ID”. It is however quite uncertain how accurate 


Vaar 


Figure 4. Interruption data for the transport road 
infrastructure. Categorized into failure cause and miss- 
ing data. 


Figure 5. 
infrastructure. Categorized into failure cause and miss- 
ing data. 


Interruption data for the transport railway 


and stringent these logs connect cause and effect. 
In total the data contains three million delays and 
about one million delay IDs. Less than 0.1% of the 
interruptions miss the relevant parameters. The 
quality of the data is judged to be fair. 

Start and end time gives the duration, and the 
number of registered delays are averaged over the 
interruption duration with the unit of delays/hour. 
The baseline is derived from the total number of 
train departures in the system during the year 2012, 
in total 38 million. Hence, on average about 4 300 
trains/hour are passing or departing from a meas- 
urement point on average. Due to lack of yearly 
data, this number is normalised and scaled with the 
total transported km in the system per year. 


4.5 Water supply 


For the water supply infrastructure, attempts 
to gather both national data and data from sev- 
eral different distribution system operators were 
made. However, only Stockholm Vatten och Avfall 
(SVOA), which is the main distributor of drinking 
water and operates the wastewater treatment for 
Stockholm city and the municipality of Huddinge, 
gathered interruption data in a format suitable for 
our purpose. In total, SVOA serves about 10% of 
the Swedish population and the data is used for 
internal audit, as no regulatory demands exist for 
the water sector with respect to disruption data. 
The data covers water outages caused by service 
and repairs of the infrastructure. The data set 
stretches from July 2009 to May 2017. 

In total, each interruption has 9 different param- 
eters, where those of specific interest here is start 
and end time, duration (given in “Hours without 
water”) and consequence (given as “Number of 
users”). In total 2500 interruptions are recorded in 
the data, where quite many are missing the relevant 
information. About 1700 interruptions were in the 
end used. The baseline for calculating functionality 
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Figure 6. Interruption data for the water supply infra- 
structure. Categorized into failure cause and missing 
data. 
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is the approximate total number of customers that 
are served by SVOA (circa one million). 


4.6 Other infrastructures 


During the process of gathering interruption data 
attempts were also made to get data for telecom- 
munication, electronic communication, maritime 
and aviation infrastructures. However, either due 
to reasons of sensitivity of data or incomplete/ 
lacking data collection processes no usable data 
was possible to retrieve for these infrastructures. 


5 RESULTS 


5.1 Comparison of resilience levels 


In Figure 7 the resilience level for the different 
infrastructures are presented. For each of the 
infrastructures, depending on the availability of 
data, the yearly mean resilience level for 2005—2017 
is presented. 

It is clear that there are rather large differences 
of the resilience levels when comparing the infra- 
structures. The worst performing infrastructure is 
the transport railway and the best is the electric 
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Figure 7. Annual resilience levels of the studied 
infrastructures. 


Table 1. Resilience values for: ETr = Electricity Trans- 
mission, EDi = Electricity Distribution, TRo = Transport 
Road, TRa = Transport Railway, Wat = Water supply. 


Duration (h) F, Resilience 
Mean Var Max 
Inf. Mean Var Max (10°) (10°) (10°) Mean 
ETr 2.23 76.5 67.3 0.51 8.00 44.5 1.0000 
EDi 8.16 3420 8490 3.95 160 159 0.9989 
TRo 5.87 3200 10500 0.24 0.229 7.48 0.9999 
TRa 19.3 8170 42700 16.6 55.1 421 0.9899 
Wat 4.76 130 350 0.17 0.155 7.55 1.0000 


transmission system. The resilience level for a given 
infrastructure also varies over the studied period. 
The electricity distribution system demonstrates 
quite fluctuation resilience levels over the years. 
The resilience levels of the transport railway sys- 
tem are declining over the years, with more or less 
similar resilience level during the last two years, 
2015 and 2016. 

In Table 1 the overall mean resilience level are 
given, together with mean and variance of duration 
and functionality loss (respectively). The Electricity 
transmission scores the best result, closely followed 
by Transport road and Water supply. Then comes 
Electricity distribution followed by Transport rail- 
way that is by far the least resilient infrastructure. 


5.2 Comparison of duration and consequence 


In Figure 8 histograms of duration and func- 
tionality loss for each of the infrastructures are 
presented. All interruption data that meets the 
requirements for calculating the resilience metrics, 
in accordance with the data section, are included. 
Interruptions that either exceeds the duration 
threshold of 12 hours or the functionality loss 
threshold of 0.001 are binned. These binned data 
(i.e. the tails of the distributions) are also presented 
separately in insert figures. 

Comparing the length of disruptions among the 
different infrastructures give that the water supply 
infrastructure seem to have slightly different shape 
with respect to the length of disruptions, where a 
typical interruption last for about 4-5 hours com- 
pared to 0-1 hours for the other infrastructures. 
Comparing functionality loss, the values are gener- 
ally very small (maximum from 0.5% to 3%), except 
for the transport railway infrastructure that have 
experienced larger scale disruptions (up to 39%). 


6 DISCUSSION 


One of the complications of comparing critical 
infrastructure resilience levels based on empirical 
failure data is that the processes for gathering the 
data and the format of the data varies significantly 
between the studied infrastructures. To facilitate 
comparisons, the parameters that are of essence for 
assessing resilience levels of critical infrastructures at 
a national level needs to be defined, communicated 
and maybe even regulated. Some of the infrastruc- 
tures are either legally obliged, or encouraged by offi- 
cial instances, to systematically gather interruption 
data, while for other infrastructures such incentives 
are lacking. If the latter, and if interruption data is 
gathered, these are generally more geared towards 
being suitable for specific internal usages, rather than 
for comparisons across different infrastructures. 
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Histograms of duration (left) and functionality loss (right) for: a) Electricity transmission, b) Electricity dis- 


tribution, c) Transport Road, d) Transport Railway, e) Water supply. In the plots the interruptions exceeding 12-hour 
duration or exceeding 0.001 functionality loss have been separately binned for reasons of clarity. In the inset figures 


these binned exceedance values are shown. 


Striking a balance between these goals, national 
comparisons and internal usage, is hence of essence. 
For some infrastructures there seems to overall be a 
lack of structured data collection processes in place, 
e.g. water supply infrastructure and telecommuni- 
cation/electronic communication infrastructures. 
For example, for the water infrastructure we found 
a document from the trade organisation Svenskt 


Vatten that is not encouraging collection of the kind 
of data that is necessary for empirical resilience esti- 
mations as carried out here. This document gives 
quite detailed suggestions of how to collect disrup- 
tion data for the water distribution sector and what 
parameters to include, whereas only start time is 
proposed as being of importance and excluding end 
time. This is one of the causes why much of the data 
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we could receive from the water sector was not usa- 
ble as it did not come with accounts of ending times 
for the interruptions. 

There are many aspects that can potentially be 
explanatory of the found differences between the 
resilience levels of the infrastructures. For example, 
the infrastructure with the lowest resilience level, 
transport railway, is known to be a highly a con- 
gested system with few redundancies and a system 
that often operates close to the limits of maximum 
capacity, leaving no margins to absorb coincidental 
variations and stresses. In comparison, the trans- 
port road infrastructure has greater redundan- 
cies, where there are alternative routes available if 
a certain road is interrupted, lessening the overall 
consequences. Similarly, the infrastructure with 
the highest resilience level, the electricity trans- 
mission system, has a high degree of redundancy 
and more seldom are operating close to the limits 
of maximum capacity. The electricity transmission 
data was the only one with systematically collected 
zero consequence interruptions. These zero conse- 
quence interruptions are of high interest when try- 
ing to give an account of the level of flexibility and 
adaptability in the system for tolerating stresses, but 
unfortunately these type of interruptions are not 
systematically logged for the other infrastructures. 

Since electricity transmission is one of the most 
critical infrastructures, for which many other infra- 
structures are dependent upon, (e.g. Johansson 
et al., 2015) there might be more incentives in place 
for this infrastructure that has led to the high level 
of resilience that we see in our data, which would 
be interesting to explore further. 

There are also other structural properties and 
contextual factors that influence the resilience level 
as assessed from empirical failure data. For exam- 
ple, the electricity transmission system is a quite 
protected system, e.g. highly tree-secured and gen- 
erally harder to access, compared to the electricity 
distribution system, where interruption due to fall- 
ing trees and excavations are quite common and are, 
in general, more exposed systems. The water sup- 
ply infrastructure was shown to be the second most 
resilient infrastructure, this is likely because even 
when failures occur, many times some degree of 
functionality can still be obtained (although loosing 
pressure in the system), where customers in many 
cases will not experience a complete loss of service. 

The resilience assessment approach presented 
here is designed to utilise empirical failure data. 
We discovered that there are inherent limitations 
of the data in terms of resolution when it comes 
to assessing infrastructure resilience, such as that 
the behaviour during the disruptive- and recovery 
phase is unknown. As such the results are slightly 
pessimistic, as normally customers get incremen- 
tally reconnected during an interruption. Further, 


as with all empirical approaches it is a retrospective 
exercise, and hence have limited predictability of 
the resilience towards unknown or not yet experi- 
enced stresses. The approach however does allow 
for a robust resilience comparison of different 
critical infrastructures and how the resilience level 
changes over time. As such it provides relevant 
input for further explorations of causes and incen- 
tives for achieving resilience across infrastructures. 
With further extensions of the presented work, 
it could also be possible to analyse the influence 
of infrastructure interdependencies on the resil- 
ience levels (c.f. Duefias-Osorio & Kwasinski, 2012, 
Zorn & Shamseldin, 2015b). Given for example a 
power outage in the electricity transmission system 
it could be possible to quantify how that impacts 
other infrastructures, e.g. the electricity distribution 
system and the transport railway system. Further, 
if the interruption data is also combined with other 
data such as wind speed, rainfall and temperature, 
or average income or population densities given the 
outage area, further explorations into the causes 
and effects of infrastructure failures is possible. 
The results presented here is valuable input for 
further studies of the underlying factors that are 
shaping and influencing the resilience levels of dif- 
ferent infrastructures. In the end, giving guidance 
towards how underlying influential factors can 
shape the resilience of critical infrastructure, fac- 
tors such as regulatory schemes, differing design 
philosophies or risk cultures. The approach can 
also be used for bench-mark type studies, e.g. how 
resilience strategies for one type of infrastructure 
might be possible, or impossible, to implement 
for another. It is also possible to explore differ- 
ent design philosophies in more depth and their 
impact on the resilience of critical infrastructures. 


7 CONCLUSIONS 


In the paper, a generic approach for the assess- 
ment of resilience levels of different types of criti- 
cal infrastructures based on empirical interruption 
data is presented and applied to several Swedish 
critical infrastructures. It is concluded that the 
approach is applicable for a unified comparison of 
the resilience levels of different types of technical 
infrastructures at a national level. The most resil- 
ient infrastructures are Electricity transmission and 
Water supply, and the less resilient infrastructures 
are Transport railway and Electricity distribution. 
The results also reveal difference in how resilience 
seems to be achieved, where some infrastructure 
seems to focus on limiting the interruption times 
(such as Electricity transmission) and some focus 
on limiting the consequences that arise (Water 
supply). This type of information is valuable for 


1294 


understanding the level of resilience and what is 
influencing and shaping the resilience of our criti- 
cal infrastructures. 
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ABSTRACT: 


Improving the resilience level of a critical infrastructure is vital to face crises and ensure 


its proper functioning. This paper describes the methodology followed to gather information for develop- 
ing a methodology to assess the resilience level of Red Eléctrica de España (REE), which is the company 
responsible for the transmission and operation of the electricity system in Spain. The process was based 
on the three units of analysis of a crisis: peak of the crisis, lifecycle of the crisis, and learning process 
among crisis. In the peak of the crisis we identified the impact variables that allow to characterize a crisis, 
and the main key stakeholders. In the lifecycle of crisis, we analyzed how the preparation and prevention 
activities affect the response and recovery. Furthermore, a CIs’ interdependencies analysis was carried out. 
Finally, the last step focused on understanding how the CI learns between one crisis and the next one. 


1 INTRODUCTION 


The welfare of society is highly dependent on the 
efficient performance of Critical Infrastructures 
(CIs). In general, CIs are those systems and com- 
panies that provide essential services that under- 
pin, maintain and sustain vital societal functions 
in which relies societies’ wellbeing (DHS 2017; EU 
2017). 

CIs are very complex systems with high degree 
of interconnections among them (Rinaldi 2001). 
These systems are designed to be reliable and 
robust but, at the same time, the inherent complex- 
ity increases their vulnerability since they become 
completely dependent of other CIs proper func- 
tioning. This vulnerability could be translated into 
impacts, direct and indirect, when one of those CIs 
fails. 

Nowadays, as the potential threats affecting CIs 
are global, so are the effects of the crises they lead 
to. Trends like the technological advances of the 
last years, climate change or interdependencies 
among systems have modified how crises occur and 
evolve, making the events and their consequences 
more difficult to foresee and prevent. A crisis can 
be defined as a consequence of an unexpected trig- 
gering event that suddenly or by an accumulative 
process of near misses strikes the entire system 
(Coleman, 2004; Mitroff & Anagnos, 2000; Pear- 
son & Claire, 1998). 

Relevant crises in the las years, like Hurricanes 
Katrina in 2005, Sandy in 2012, the most recent 
Harvey in 2017, the earthquake in Japan that 
derived in the Fukushima nuclear accident in 2011 
or the Ukraine Cyberattack in 2015 have something 
in common, all of them affected several CIs signifi- 


cantly causing important impacts in many sectors 
and making the recovery phase more difficult and 
longer (Comes & Van de Walle 2014; Chang et al. 
2007). Sometimes, cross border effects and interde- 
pendencies make several nations being affected by 
crises, and require international cooperation. Deal- 
ing with crises requires a huge effort, and involves 
many stakeholders from different organizations of 
different nature and even from different countries. 

It is clear that CIs must do all they can in order 
to anticipate and prevent crises but, at the same 
time, they should improve their capacity to act in 
a dynamic, flexible and creative way. Furthermore, 
CIs should develop skills and tools that lead them 
to a successful resolution of any type of crisis, pre- 
dictable or not. Risk management is necessary but 
is not enough when facing unexpected crises. CIs 
must go beyond known risks and must be resilient 
(CSS 2011). 

Resilience can be considered as a strategic prop- 
erty of the systems that enhances system’s capaci- 
ties to adapt to and face successfully any type of 
crisis, in a changing environment. Those capacities 
refer to: 1. — the ability to prevent, anticipate and 
then avoid a crisis; when the crisis occurs, 2. — the 
capacity of a system to resist the triggering event 
absorbing and mitigating the impacts; and 3. — the 
capacity to recover rapidly and efficiently, being 
able 4. — to learn, improve and prepare for future 
stressors (Ganin et al. 2016; Hosseini 2016; Francis 
& Bekera 2013; Hollnagel et al. 2007). 

In the literature there can be found some 
approaches and proposals for measuring a CI resil- 
ience. The majority of them focus on their technical 
attributes such as the robustness of the physical assets 
or their redundancy level. Other methodologies 
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adopt a risk management approach, considering the 
system behavior against a specific risk. The organi- 
zational resilience perspective takes into account the 
sociotechnical attributes of the CIs but they don’t 
pay special attention to the technical aspects. Fur- 
thermore, most of the methodologies lack to offer 
a more holistic view, and they don’t provide tools to 
characterize both the event and the impacts (Cimel- 
laro et al., 2010; Panteli et al. 2017; Brown et al. 2017; 
Labaka et al. 2016; Ouyang & Wang 2015; Aleksić 
et al. 2013; Petit et al. 2013; McManus et al. 2008). 

To develop a methodology to evaluate the resil- 
ience level of a CI from a holistic approach, consider- 
ing both the technical and the organizational aspects, 
it is necessary to get a deep knowledge about the 
crisis occurrence, analyzing how it happens and the 
effects it has in the CI. Furthermore, it is interesting 
to find the way to analyze the crisis evolution over 
time, to identify key cross-cutting capacities that, 
duly enhanced, will lead to a more resilient system. 

Before defining the resilience assessment meth- 
odology, the challenge is to design an effective 
strategy to gather the most relevant information to 
create the method. 


2 METHODOLOGY 


Knowledge about CIs behaviour against crises is 
embebed into the practitioners’ minds. Thus, to 
determine the CIs performance in a crisis, is essen- 
tial to work together with all relevant stakeholders 
to agregate their knowledge and obtain a compre- 
hensive vision of the whole crisis including all dif- 
ferent points of view. 

As the resilience building process needs the com- 
mitment and participation of all the stakeholders 
involved (Gimenez et al. in press), we can combine 
different methods to gather as much information 
as possible from all the sources available. 

Individual interviews with key stakeholders are 
an interesting resource, as they provide precise data 
from their daily work that let get an idea about 
their personal perspectives, what kind of data they 
manage or how far they are familiarize with the 
topics under study. 

If we want to collect information from a large 
group of people we could carry out a survey. A 
survey is a systematic method for gathering infor- 
mation from entities for the purpose of construct- 
ing quantitative descriptors of the attributes of the 
larger population of which the entities are members 
(Groves et al. 2011). The selection of a representa- 
tive sampling, the adequacy of the questions and 
the processing of the data to reach the goals set will 
be crucial to get valid results from the survey. 

Once the most significant issues related to resil- 
ience have been identified from a wide range of 


practitioners, we can work on the most relevant 
issues in workshops with reduced number of 
experts through collaborative methodologies. 

Collaborative methodologies can be very useful, 
as people within the organization from different 
departments and with distinct levels of responsi- 
bility, share and discuss they knowledge and ideas 
to contribute to a common goal. 

As a result, we will get a shared vision from the 
individual points of view and understanding. 

Group model building (GMB) is an example of 
collaborative methodology that enables integrating 
fragmented knowledge, initially residing in the minds 
of different agents, into aggregated models (Scott et 
al. 2016; Andersen et al. 2007). GMB is based on 
workshops where modelers work on the problem 
jointly with multidisciplinary domain experts. 

GMB has been employed successfully in many 
areas such as inter-organizational integration of 
information, health care system organization, 
organizational strategy changes, analysis of inte- 
grated operation strategy in a crude oil and gas 
company or CIP (Luna-Reyes et al. 2016; Her- 
nantes et al. 2012; Ackermann et al., 2010; Rich 
et al., 2009). 

The outcomes from the GMB process are essen- 
tial to understand and identify the crisis behavior 
patterns, and then, to set the most representative 
variables to typify the resilience of the system. 

This paper describes how the above explained 
methods have been used to identify the key ele- 
ments in the CI under study for developing a meth- 
odology to assess its resilience level and improve 
it. 

The process has been carried out within the con- 
text of a project with a CI from the energy sector in 
Spain called Red Eléctrica de España (REE). 


3 INFORMATION GATHERING PROCESS 


Crises can be analyzed from different perspec- 
tives. The analysis can only focus on the trigger- 
ing event and subsequent direct impact, or it can 
have a broader perspective including the pre-crisis 
period, the peak of the crisis and the post-crisis. 
Furthermore, the analysis can focus on looking at 
how the system learns from one crisis to the next 
one, broadening even more the perspective. 

In order to analyze the crises from different per- 
spectives, this research uses the three units of anal- 
ysis (see Figure 1) defined by Labaka et al. (2011): 
1. Peak of the crisis. It focuses on the magnitude of 
the triggering event evaluating the immediate con- 
sequences due to the event and the impacts. This 
analysis provides relevant information about issues 
like what factors characterize the crisis, which are 
the most important impact variables, who are the 
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= THREE UNITS OF ANALYSIS S 


{ Peak of the crisis 


Multiple crises \ 


Crisis lifecycle 


Figure 1. Three units of analysis (Labaka et al., 2011). 


key stakeholders and which are the mechanisms of 
the CI to cope with the crisis. 2. Lifecycle of the 
crisis comprises the pre-crisis, the peak of the crisis 
and the post-crisis. The study of the complete cycle 
of one crisis allows to understand how the prepa- 
ration and prevention activities carried out in the 
pre-crisis phase, affect the response and recovery 
activities. Finally, the last unit of analysis is 3. Mul- 
tiple crises, where the period of time between one 
crisis and the next ones is analyzed. The objective 
in this case is to identify how the CI under study 
learns between one crisis and the next one. 

This method has been successfully used in 
another project to analyze major power-cut crises 
in Europe. (Hernantes et al. 2013). 

Based on the three perspectives defined, the 
process we followed was structured in the next 
three steps: 


1. Analysis of the peak of the crisis. 
2. Analysis of the lifecycle of the crisis. 

a. Analysis of the CI interdependencies. 
3. Multiple crisis analysis. 


Each of those three steps was performed for 
obtaining knowledge about specific objectives as it 
is explained bellow. 


3.1 Peak of the crisis 


The peak of the crisis covers the period of time 
between the moment when the crisis strikes, and 
the moment when the CI starts its recovery. It rep- 
resents the most visible part of the crisis, in which 
most of the impacts appear. It is the moment where 
all the efforts we have made before to improve our 
capacity to face crises are tested. 

The objectives for this first phase were: (1) To 
characterize the peak of a crisis in the CI under 
study, and (2) To determine the most important 
variables to characterize the impact of a crisis. 

To reach the goals set we needed: to compre- 
hend the CI under study, to understand what a cri- 
sis means for them and, finally, to know how they 
represent and measure their crisis. 


With that aim, we asked the CI for information 
related to the organization itself and, in particular, 
everything related with crisis management, records 
and indicators. 

The CI employees participating in the project 
were, firstly, interviewed individually. In particular, 
seven people were interviewed. The participants 
had technical profile and belonged to the transmi- 
sion and operation divisions, more specifically to 
the following departments: renovation and facilities 
improvement, lines maintenance, substations main- 
tenance, telecommunications, lines engineering, sub- 
stations engineering and system operation. 

All the interviews had the same structure. At 
the beginning general information was requested 
(company structure, departments, activities, busi- 
ness, ...) and they were asked, in particular, about 
their function and responsibility in the organiza- 
tion. Secondly, they were asked about their knowl- 
edge regarding the crises related procedures in the 
organization (risk analysis, contingency plans, 
chains of command for crises management, reports 
of past crises and incidents, procedures and agree- 
ments with external stakeholders). After that, they 
were asked about the indicators and variables they 
managed to identify, report and register the crisis. 
We were interested, especially, in the indicators 
used to represent the crisis impacts. 

In addition to the interviews with the partici- 
pants in the project, we also interviewed the man- 
ager of the risk management department, to have 
the strategic view of the company about crises. The 
manager for the press office was also interviewed, 
to understand how crises are communicated exter- 
nally, such as to the media, to other stakeholders 
and to the society. 


The outcomes from this first step were: 


1. A list of 21 indicators to quantify the impacts 
of an eventual crisis, independently of the trig- 
gering event and classified in four dimensions: 
structural impacts in the organization, social 
impacts derived from the CI failure, economic 
impacts, and environmental impacts. 

2. Aset of five representative indicators of the peak 
of the crisis, that will let us to represent the evo- 
lution of the crisis over time, taking into account 
both internal and external aspects for the CI. 
Those five indicators were chosen among the 
previously identified 21 indicators, and they are: 
percentage of their own infrastructures damaged 
(V1), percentage of users without electrical sup- 
ply (V2), percentage of extra resources needed 
for the crisis resolution (V3), level of the com- 
pany’s knowledge about the situation (V4) and, 
finally, the public anxiety due to the crisis (V5). 

3. The duration of a “standard” peak of the crisis 
in the CI was set in 48 hours. 
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4. A classification of the crises according to four 
extreme patterns defined based on two param- 
eters: on one hand the resilience level of the CI 
and on the other hand, the magnitude of the 
event. Thus, we would be able to represent any 
kind of crises based on the resilience level of 
the system and the magnitude of the triggering 
event regardless of its origin (see Figure 2). 


This first phase of interviews, highlighted some 
barriers that hindered the optimal data and knowl- 
edge collection. For that reason, we decided to 
apply, hereinafter, collaborative methodologies as 
GMB, to manage barriers like: 


e Silo thinking: Each department is expert in a 
specific issue and they only focus in the aspects 
related to their specific field of knowledge. 

e Confidentiality. Sensitive information is only 
available for a few people within the organiza- 
tion, usually “high” profiles and does not tran- 
scend to other operational levels. 

e The existence of long chains of command (and 
levels of information) that derive in loss of per- 
spective of the whole crisis and its impacts. 

e Poorly established failure sharing culture. 

e Reports about crises don’t reflect a holistic view 
of the crisis as, in general, they focus on techni- 
cal aspects and don’t collect information about 
social or economic impacts. 


3.2 Lifecycle of the crisis 


In this unit of analysis, we extend the study taking 
into account the whole lifecycle of the crisis, that is: 
the pre-crisis, the peak of the crisis and the post-crisis. 

Whilst in the peak of the crisis the focus was 
on the impacts, the lifecycle also considers the 


RESILIENCE LEVEL 
High 


Perfornumce 


Performance 


EVENT 


Figure 2. Four patterns for CI crises classification 
according to CI resilience level and the magnitude of the 
triggering event. 


preparation and the recovery activities to man- 
age the crisis. During the pre-crisis, CIs’ activities 
are oriented towards anticipating and preventing 
the crisis occurrence identifying hazards, threats 
and vulnerabilities, and preparing plans to afford 
risks and reduce weaknesses. In the peak of the 
crisis all the efforts are concentrated on absorbing 
and minimizing the impacts, trying to reduce the 
human and material losses, avoiding the cascading 
effects and recovering the service, all in the shortest 
period of time. Finally, the post-crisis encompasses 
the recovery to the normal operation levels where 
all the equipment and physical systems damaged 
and affected must be repaired and restored. These 
activities can last several months or even years. 

With the objective of obtaining information 
about the prevention, preparation, response and 
recovery activities in REE we carried out a table- 
top exercise with the participants. We put as an 
example a hypothetical catastrophe which damages 
considerably the physical systems, such as power 
lines and some important substations, as well as 
weaken the response capacity of REE leaving the 
system without the telecommunication service. 

The session was structured in three parts. 

In the first part the objective was to understand 
the mechanisms of CI for the response and recov- 
ery phases. With that aim, the hypothetical scenario 
of the catastrophe was introduced to the attendees 
and then, the participants were invited to describe, 
individually, the activities they would carried out to 
face the crisis to bounce back to their normal opera- 
tional state. In addition, they should identify all the 
stakeholders involved in the crisis resolution. In 
particular, they were asked to match the stakehold- 
ers with the activities they should be involved in. 

In the second part of the session the partici- 
pants focused on the preparation phase. In this 
occasion they needed to think about the activities 
and measures that must be carried out to prevent a 
disaster and to be ready to solve it when it happens. 
We invited them not only to focus on technical 
issues but also to take into account management 
aspects. Moreover, they should identify strategic 
relationships with stakeholders within and outside 
the company for the satisfactory crisis resolution. 

Finally, once the policies had been identified 
they recognized and analyzed the barriers and the 
difficulties to implement those policies. 

The outcomes from this unit of analysis were: 


1. The four dimensions of Resilience for REE, 
based on the literature (Labaka et al. 2016; 
McManus 2008; SMR 2016) and on the poli- 
cies resulting from the table-top exercise. 1) 
Leadership: company management’s commit- 
ment with resilience building process and its 
capability to promote and consolidate a cul- 
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ture, attitude and values based on it. 2) Pre- 
paredness: It refers to capabilities, knowledge, 
procedures and technical means of the com- 
pany to afford crises. 3) Technical: it refers to 
system robustness and resistance capability in 
terms of impact reduction and loss of perform- 
ance. 4) Cooperation: efficient management of 
human and material resources, internal and 
external, in major crises. 

2. A set of policies to improve the resilience level 
of REE. Those policies refer to properties that 
the system should have and transversal activities 
that should be developed in order to enhance 
the resilience capacity of the CI. 

3. A list of indicators for the preparation phase. 
For each of the policies identified at least one 
indicator is proposed, in order to be able to 
measure their evolution over time. 

4. A list of indicators for the respond and recov- 
ery phase. As in the previous point, for each 
of the policies at least one indicator has been 
identified, to monitor and control the degree of 
implementation of each policy. 

5. A tool to generate random scenarios of cri- 
sis (see Figure 3). Resilience is the capacity to 
withstand and deal with any kind of crises, 
predictable or not. With the idea of introduc- 
ing a degree of uncertainty we proposed a tool 
that takes into account the potential risks but 
also trends, like Climate Change, demographic 
imbalances or interdependencies that can act as 
crisis enhancers. The core of the tool represents 
the resilience capacity of the CI. 


3.2.1 Analysis of the interdependencies 
CIs are tightly coupled than ever before constitut- 
ing a very complex and strongly interconnected 
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Figure 3. Random scenarios generator. 


systems. The resilience approach takes into account 
the context of the CI under study and its interaction 
with other Cls, stakeholders, organizations, etc. 

Interdependencies are the cause of many indi- 
rect impacts and cascading effects. Furthermore, 
they can be determinant for the recovery phase of 
one crisis. For that reason, when we want to evalu- 
ate the resilience level of a CI, an analisis of the 
interdependencies is needed. 

As part of the crisis lifecycle analysis, we carried 
out the study of the CI interdependencies. 

The objectives for this phase were: 


1. Identify the CI’s interdependencies: 
a. From which CIs they are more dependent. 
b. Which CIs are dependent from them. 
2. Analyse in detail of the most 
interdependencies. 


critical 


For the consecution of the objective 1, we decided 
to conduct an on-line survey within the company 
(CI) under study, based on the structure that Lauge 
et al. (2014) suggests, to analyze the CI interdepend- 
encies. We chose that method to obtain a broad 
view about interdependencies, and indirectly, about 
crisis management. In the survey 39 people partici- 
pated, of whom 26 completed the entire question- 
naire. They belonged to 15 different departments 
and most of them had technical profile. 

The survey content focused on the dependencies 
that REE had with other CIs. We wanted to estab- 
lish in which degree REE is dependent on other 
CIs’ operation, in case of failure in any of them. 

The survey proposes different scenarios of crisis 
due to a failure in one CI for different periods of 
time: less than 2 hours, from 2 to 8 hours, from 8 
to 24 hours, from 24 hours to one week, more than 
one week. 

With the obtained answers, the CIs from which 
REE is more dependant and the duration of the 
failure from which the situation becomes critical 
for REE were defined. 

Furthermore, the analysis of the data from the 
survey also provided aditional information related 
to the perception of the employees about crisis and 
interdependencies. In general, there was not con- 
sensus on the issues asked. There was not a com- 
mon view of the crisis and it was difficult for the 
employees to identify both the stakeholders and 
the dependencies. 

To complete the interdependencies analysis 
(objective 2), a GMB workshop was carried out in 
order to share, discuss and then reach a consensus 
among the participants. In the first part of the work- 
shop the results of the questionnaire were presented, 
explained and discussed. In the second part of the 
session, we focused on the analysis of the most criti- 
cal scenarios according to the results of the survey. 
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We applied the tool dependency radar for the 
analysis (Laugé-Eizagirre 2014). The dependency 
radar provides a graphical representation of 
the level of dependency among CIs and helps to 
understand the CIs dependencies by identifying 
those dimensions that determine the dependency 
of a CI on others (see Figure 4). The tool sets out 
five dependency dimensions classified under two 
main areas. Failure area refers to how a CI can 
make another CI fail. Dependency for working, 
redundancy, and effect of external aggravating fac- 
tors are the dimensions within this area. Recovery 
area refers to how the first CI can contribute or 
difficult the second CI’s recovery. Recovery time 
and effects of resources sharing for recovery are 
the dimensions defined in this area. 

Two scenarios of dependency were analysed 
with the dependency radar: the first one considered 
a power supply failure for 8 hours and the second 
one set out a 24 hours’ communications failure. 

They worked the cases in small groups of two 
and three people. Then, they were invited to explain 
and discuss the different radars to finally reach an 
agreement for each of the dependency scenarios. 

The outcomes from this phase were: 


1. The CIs from which the company is more 
dependent and the duration of the failure 
from which the situation becomes critical. This 
information is relevant as it lets REE to define 
actions to control and reduce, when possible, 
the dependency level with other CIs and, conse- 
quently, minimize the risks. 

2. The period of time from which the situation 
becomes critical, when one of those previously 
identified CIs fails. This information is very 
important for characterizing the crises. Further- 
more, it will help to design and choose appro- 
priate prevention and mitigation actions. 

3. A classification in four extreme patterns of 
dependencies, according to the CI dependen- 
cies to operate and to recover: FDRD (Fail- 
ure Dependent Recovery Dependent), FDRI 
(Failure Dependent Recovery Independent), 


Dependency 
for working 


Effect of resource Redundancy 
sharing for 
recovery 


™P Effect of external 
Recovery aggravating 
time factors 


Figure 4. Dependency radar (Laugé-Eizagirre 2014). 
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Figure 5. Dependency Patterns for dependency analysis. 


FIRD (Failure Independent Reovery Depend- 
ent, FIRI (Failure Independent Recovery Inde- 
pendent), (see Figure 5). This classification of 
the dependency relationships with other CIs by 
patterns allows to stablish similarities and dif- 
ferences among the scenarios analized. 


3.3 Multiple crisis analysis 


In the third unit of analysis we studied learning 
process from one crisis and the next one. When 
the learning process about the incidents or crisis is 
not systematize, organizations are condemned to 
repeat once and again the same mistakes. 

Resilient systems learn from their own crises 
and also from others. One of the characteristics of 
the resilient systems is their situational awareness, 
an attribute that let them to maintain the warning 
level to detect small changes and signals that may 
prevent a crisis. 

The objective of this third step was to under- 
stand how REE learns from its own crisis and from 
others’. With that aim we organized a workshop 
with the participants in the project. The session 
was structured in two parts. 

In the first part we focused on the learning proc- 
ess and the knowledge they had about their own cri- 
sis. The exercise had four steps: 1. Identify relevant 
crisis of REE by heart, that is, without documen- 
tary support. The goal was to know the degree of 
knowledge, and awareness of the participants about 
their most relevant crisis. 2. To look up for informa- 
tion about those crises in the internal network. We 
wanted to get information about how they docu- 
ment their crises and incidents, which kind of infor- 
mation they report and the quality of the reports 
from a lessons’ learned point of view. Moreover, 
how accessible was the information and how far 
they were familiarized with these searches. 3. Iden- 
tify relevant changes in procedures and plans due 
to lessons learned from those crises. The aim was 
to see if the learnings turn into actions contribut- 
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ing to improve the resilience of the organization. 4. 
Identify indicators for the learning process in REE. 

In the second part of the session, the objective was 
to analyze how REE learns from the crises that hap- 
pen to others. With that aim, we asked them to repeat 
the four steps performed in the previous part of the 
session. Moreover, in this part, we asked them about 
their participation in national and international sec- 
torial networks, related to crises management. 

The outcomes resulting from this third unit of 
analysis were: 


1. A list of indicators to monitor the learning proc- 
ess in REE. The indicators are classified in four 
dimensions of learning previously identified: 
documentation, analysis, dissemination and 
communities of practice. The documentation 
dimension assesses how far the organization 
reports the crises and incidents. In the analy- 
sis dimension we pay attention to the reports’ 
quality and the reflection process and learnings 
derived from the incidents. Dissemination refers 
to how the organization spreads and shares the 
knowledge, lessons learned and good practices 
extracted from the crises analysis. And, finally, 
the communities of practice dimension studies 
how the organization promotes and participate 
in communities of practice about resilience top- 


Table 1. 
process. 


Dimensions and Indicators for the learning 


Dimensions Indicators 


Documentation Documented incidents in 1 year/total 
incidents in 1 year. 

Number of lessons learned from 
incidents in 1 year. 

Own incidents analized in 1 year/total 
incidents 1 year. 

Others’ incidents analized in 1 year. 

Improvement actions from lessons 
learned in 1 year. 

Number of incident analysis sessions 
in one year. 

Number of lessons learned sessions 
in one year. 

Man hours for incident analysis/total 
man hours per year. 

% of the staff whom receive direct 
communication of lessons learned. 

Level of knowledge of the staff about 
Company’s incidents (0-10). 

Quantity of communities of practice 
within the company, or 
conversations, related to crisis. 

Level of use of internal communities 
of practice (0-10). 

Company’s level of presence in 
international communities of 
practice (0-10) 


Analysis 


Dissemination 


Communities 
of Practice 


ics within and outside the company, including 
international communities (see Table 1). 

2. A set of good practices and lessons learned 
were identified extracted from the reports and 
from the workshop. 

3. A list of barriers in learning identified by the 
particpants in the workshop and a proposal of 
actions for improvement. 


4 CONCLUSIONS, LIMITATIONS AND 
FUTURE RESEARCH 


The process described in this paper has been success- 
ful for gathering the information needed to design a 
methodology to evaluate the resilience of REE. 

Resilience building process requires of the partici- 
pation and the commitment of all the stakeholders. In 
this way, it is very important to involve them from the 
early stages of the projects. While three units of anal- 
ysis offers a wide view of the crisis cycle and establish 
an appropriate framework for the crisis analysis, the 
use of a combination of different methods to get the 
data and knowledge has created spaces for both the 
individual and the common reflection. Furthermore, 
the workshops themselves have provided an opportu- 
nity for encouraging the dialogue and the exchange 
of experiences around crisis management and resil- 
ience within the participants involved in the project. 

However, the process has also highlighted some 
barriers. One of the most important is that informa- 
tion related to crises is treated as confidential, and 
only few people within the company have access to it. 
Related to this issue, in this case, it was not possible 
to interview or work with stakeholders outside the 
organization. Anyway the steps proposed consider 
also the participation of other external stakeholders. 

Once the information has been gathered, the 
next step must be to analyze and manage all the 
gathered information, so it can be used to assess 
the resilience level of REE. Furthermore, the 
methodology should provide a guide to REE to 
improve its resilience level. 
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ABSTRACT: Resilience of interdependent infrastructures increasingly depends on collaborative 
responses from actors with diverse backgrounds that may not be familiar with cascade effects into areas 
beyond their own sector. A simulation-game can enable societal actors to obtain a deeper understanding 
of the interdependencies between their infrastructures and their respective crisis responses. Following a 
design science approach, a simulation-game has been developed that combines role-playing simulation 
and computer simulation. The simulation-game challenges participants to address the interaction between 
payment disruptions, food and fuel supply, security problems (riots, robberies) and communication chal- 
lenges (preventing hoarding). A number of crucial design choices were handled while developing the 
simulation-game. The main design challenges were: How to validate an unthinkable escalation scenario?; 
How to give the simulation a sufficient level of detail on all aspects and keep the complexity graspable so 


it can be played instantly?; and How much time should each playing round take? 


1 INTRODUCTION 


Resilience of critical infrastructures is a complex 
problem area. When societal actors with different 
backgrounds quickly need to orchestrate a collective 
crisis response, a deep understanding for the exist- 
ing interdependencies between their respective infra- 
structures and crisis response strategies is required. 
Gaming-simulation can help to collaboratively 
develop a deeper understanding of such interde- 
pendencies and be a safe environment to explore the 
robustness of response strategies from a multi-secto- 
rial perspective. In the context of tightly interrelated 
infrastructures a response strategy should not only 
be beneficial for individual organizations or sectors, 
but even mitigate consequences and limit escalations 
from a holistic multi-sectorial perspective. 

Building a simulation game involves many design 
choices. Depending on which choices are made, con- 
sciously or unconsciously, very different simulations 
or simulation-games can be created for studying the 
same problem. It is important to build simulation- 


games of good quality and to understand how cru- 
cial design choices impact simulation-game design 
and simulation-game outcomes. Consequently, the 
contribution of this paper is a detailed description 
of our simulation-game design. Following a design 
science research approach, a simulation-game has 
been created that enables actors from a large variety 
of critical infrastructures to analyze and mitigate 
the cascading effects of payment disruptions on 
their respective infrastructures. Besides a presenta- 
tion and motivation of the most important simu- 
lation-game design choices (Section 4), three major 
design challenges were identified: scenario valida- 
tion, game complexity and length and number of 
playing rounds (Section 5). 


2 THEORY BACKGROUND 


Our research builds upon three research areas: 
critical infrastructures, resilience and gaming- 
simulation. 
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2.1 Critical infrastructures and cascading effects 


Societies rely on well-functioning critical infra- 
structures such as Energy, Information and Com- 
munication Technology, Water Supply, Food 
and Agriculture, Healthcare, Financial Systems, 
Transportation Systems, Public Order and Safety, 
Chemical Industry, Nuclear Industry, Commerce, 
Critical Manufacturing, and so on (Alcaraz & 
Zeadally 2015). When one or more critical infra- 
structures break down or provide only limited 
service, large numbers of citizens, companies or 
government agencies can be severely affected (Boin 
& McConnell 2007, Van Eeten et al. 2011). Break- 
downs can be caused by internal factors (human or 
technical failure), external factors (nature catastro- 
phes, terror attacks) or by failures of other infra- 
structures as there are many dependencies between 
critical infrastructures (Van Eeten et al. 2011). 
Energy and Information Technology or Telecom- 
munications are well-known event-originating 
infrastructures that generate cascading effects in 
many other infrastructures, as has been shown in 
different types of analyses (Van Eeten et al. 2011, 
Laugé et al. 2015). In times of increasing digitalisa- 
tion and an ever increasing development towards 
a digitally interconnected society, security experts 
argue for more awareness for digital vulnerabili- 
ties, more attention for cyber security and a need 
to educate professionals and citizens on these mat- 
ters (Hagen 2016). 

Ansell et al. (2010) argue that resilience of inter- 
dependent infrastructures increasingly depends on 
collaborative responses from actors with diverse 
backgrounds that may not be familiar with cas- 
cade effects into areas beyond and outside their 
own organisation or sector. Boin & McConnell 
(2007) and Van Eeten et al. (2011) argue that there 
is limited empirical evidence of cascading effects 
across many infrastructures, which makes it hard 
to foresee which interactions may occur across 
sectors. Risk analysis, business continuity man- 
agement and crisis management training are often 
performed within the context of a single organisa- 
tion or sector and are seldom addressing the holis- 
tic analysis of multiple infrastructures (Van Eeten 
et al. 2011). 

More research is needed to understand collec- 
tive resilience in the context of critical infrastruc- 
ture management. In this study, a contribution is 
made by focusing on one application area, i.e. how 
payment disruptions impact other critical infra- 
structures. Despite the long term efforts of public 
and private actors in the financial sector in Swe- 
den to identify, analyse and understand risks and 
to develop routines for preventing and mitigating 
serious disruptions in the payment system in Swe- 
den, there is still a lack of insight into how the 


proposed action plans exactly need to be executed 
and how numerous other actors in society (e.g. citi- 
zens, food stores, gas stations, voluntary organiza- 
tions, governmental agencies and so on) will act in 
case of a temporary or complete breakdown of the 
payment system. For instance, several key actors in 
the payment system have in earlier studies expressed 
that they will take a larger responsibility than their 
formal responsibility (MSB-2009-3309 2010), but it 
is not clear what this implies and how these organi- 
zations actually will act when crisis hits. 


2.2 Resilience 


Lundberg & Johansson (2015) and Bergström et al. 
(2015) list that resilience amongst others can refer 
to: bouncing back to a previous state, or bouncing 
forward to a new state, or both; absorbing variety 
and preserve functioning, or recovering from dam- 
age, or both; and being proactive and anticipat- 
ing, or being reactive (when recovering during and 
after events), or both. Given the variety of inter- 
pretations of resilience, resilience is hard to opera- 
tionalize into measurable indicators (Lundberg & 
Johansson 2015). 

Lundberg & Johansson (2015) made an effort to 
merge and compile different points of view in the 
field of disaster and crisis response resilience into 
one systemic model, the Systemic Resilience Model 
(SyRes). The model departs from the idea that the 
coping with an unwanted event can be seen as a 
downward spiral activating certain basic resilience 
functions (anticipation, monitoring, responding, 
recovery and learning) and their associated strate- 
gies (where the strategies are the actual manifesta- 
tion of the functions, or their ‘form’, which may 
differ from system to system). Further, Lundberg & 
Johansson (2015) suggest that resilience is needed 
to protect core values, i.e. values central for the 
existence of the system in focus. In safety-critical 
systems, such core values usually take the form 
of maintaining safety, such as avoiding harm to 
humans or critical infrastructures. For a commer- 
cial business such as a grocery store, a petrol station 
or a bank, a core value is typical to create revenue, 
i.e. to assure a higher income than outcome. With- 
out this profit, the business will seize to exist. This 
core value will manifest itself in a number of practi- 
cal activities which usually take the form of differ- 
ent flows such as goods, money, services etc. 

In line with the challenges to resilience suggested 
by Johansson & Lundberg (2010) comes the fact 
that most systems in society, such as the payment 
system, depend on several different actors to func- 
tion properly. Therefore, resilience must be con- 
sidered from a systems perspective. In the field of 
resilience, this is sometimes referred to as ‘collec- 
tive resilience’. Weick & Sutcliffe (2007) argue that 
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loosely coupled systems relying on a ‘sensemaking’ 
process generally are more resilient than tightly cou- 
pled systems based on the assumption that all sys- 
tem states can be predicted and safeguarded against 
possible threats. This resembles distinctions made 
in safety science between the paradigms labelled 
Safety I and Safety II (Hollnagel 2013) where Safety 
I is signified by the idea that safety can be designed 
into a system and Safety II is signified by the idea 
that human adaptability is the most important con- 
tributor to success despite inadequate design or 
insufficient predictive capacity of safety engineers. 
Weick & Sutcliffe (2007) argue that a dilemma exists 
in sensemaking: you can optimise for analysis or 
action, but not both. This dilemma seems contra- 
dictory to the requirements of resilience, because 
Weick & Sutcliffe argue for sensitivity to operations 
and reluctance to simplify (i.e. an interest in details 
and scrutinize the situation at hand) and simultane- 
ous blunt and immediate action without thorough 
analysis. The solution suggested by Weick & Sut- 
cliffe (2007) is that deep knowledge about the sys- 
tem should have been acquired earlier (long before 
the disruption) so that quick and blunt action based 
on deep understanding of the system’s dynamics is 
possible in case of disruptions. As more actors may 
simultaneously initiate a quick and blunt response, 
a risk is that these responses counteract each other. 
Weick & Roberts (1993) discuss how attentiveness 
(heedful interrelating) is key in a resilient group 
response, i.e. while acting quick and blunt, various 
actors should pay close attention to how other actors 
respond and to what kind of system behaviour their 
collective response leads. Heedful interrelating has 
been demonstrated in small groups. Heedful inter- 
relating becomes challenging when systems become 
larger, more interrelated and involve more and more 
decision makers that do not really know each other 
and do not understand the impact of their decisions 
on nearby systems, as in the case of large interde- 
pendent infrastructure systems (Ansell et al. 2010). 
Then these groups of stakeholders may lack swift 
trust (Weick & Roberts 1993) and may lack a shared 
understanding of the situation and a shared vision, 
which may lead to inferior performance (Berggren 
et al. 2014). Yet another risk might be organisations 
or companies who continue putting their own goals 
ahead of the common good, thus risking initiating 
counterproductive actions that may hamper the 
process of recovery from disruptions. 


2.3. Gaming-simulation 


Gaming-simulation is defined as a specific form of 
simulation. Simulation in general aims at designing 
a model of a system in a complex problem area 
in other to be able to experiment with the model. 
Deeper insight in the behavior of the system is 


created by evaluating various operating strategies 
against each other in one ore multiple scenarios. 
Gaming-simulation differs from other forms of 
simulation in that it incorporates roles to be played 
by participants and game administrators, implying 
that people and their (goal-directed) interactions 
become part of the simulation (Laere et al. 2006). 
In addition to role descriptions and interaction 
formats, simulation-games can also include a phys- 
ical simulation model (a board game, a mock-up, a 
computer simulation, or any other representation 
of a physical reality) which the game participants 
need to interact with. It is important to understand 
that both the changes and impacts of changes to 
the physical simulation model in the simulation- 
game and the interaction between the participants 
(often negotiation processes about what to change 
and how to interpret changes in the physical simu- 
lation model) are part of the simulation-game and 
object of study (Mayer 2009). Gaming-simulation 
is especially relevant when the “how and why” of 
the interaction processes between the participants 
are of interest and when these interactions can- 
not easily be incorporated in computer simulation 
models. In addition, it creates a deeper learning 
opportunity, as simulation-game participants liter- 
ally are active participants in the simulation, rather 
than passive observers of a computer simulation. 

To design a high quality simulation-game, many 
design choices have to be taken into account, which 
often are not self-evident, but rather involve tricky 
cost-benefit analyses ending up with a dilemma (is 
the benefit worth the extra cost?). Examples of such 
design choices are for example (Laere 2003, Mayer 
2009, Meijer 2009): defining a limited number of 
research or learning objectives, defining the number 
and content of roles, defining the scope of the mod- 
elled situation/problem, guaranteeing the validity 
of the simulation, defining rules and constraints, 
defining the load (difficulty), choosing the location/ 
environment where the game will be played, select- 
ing the type of participants to be invited, design of 
qualitative and quantitative data collection during 
the game, degree of realism of the scenario, degree 
of complexity of the game (often phrased as mod- 
elling internal complexity of the system to be mod- 
elled, but creating external simplicity, i.e. an easy to 
understand and easy to play game for the partici- 
pants), degree of competition, degree of dynamics, 
macro cycle (preparation, playing, debriefing, fol- 
low-up), micro-cycle (number of playing rounds) 
and real-time or symbolic-time. 


3 RESEARCH DESIGN 


Our research design is based on an inductive 
research strategy and a qualitative research method. 
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A clear theory on how critical infrastructures exactly 
are related, and how the many actors involved col- 
laboratively could manage disruptions that create 
cascading effects in many infrastructures, is lacking. 
As such, there is a need for theory building rather 
than theory testing, which leads us to an inductive 
research strategy (Eisenhardt & Graebner 2007). 
From an interpretative perspective, we are inter- 
ested in exploring the many different interpretations 
of actors involved regarding what challenges dis- 
ruptions can pose and how they could be handled 
collaboratively across the affected infrastructures. A 
simulation-game can be a safe environment where 
participating actors can experiment with different 
action alternatives, and through their participation 
and their choice of resilience strategies demonstrate 
the core values they hold. 

For the design of the simulation-game a design 
science research strategy is adopted. The result of 
design science research is a purposeful artifact cre- 
ated to address an important organizational prob- 
lem (Hevner et al. 2004). In our case, the problem 
is “understanding critical infrastructure dependen- 
cies and exploring collective infrastructure resilience 
strategies” and the artifact is “a simulation-game 
that can serve as save analysis, learning and explo- 
ration environment”. As argued in Hevner et al. 
(2004) design science is an iterative search method 
aiming at identifying a creative solution for the 
problem at hand. Given our interpretative stance, 
our aim is not to design the best or an optimal sim- 
ulation-game, but rather to design one appropriate 
simulation-game (amongst many alternatives), and 
developing a deep understanding what the benefits 
and drawbacks of our chosen design are. Design 
science addresses relevance by a strong interest 
the societal needs in the application environment 
studies, and aims simultaneously at rigor through 
reflecting on the design process and arguing how 
the produced solution informs the research front 
(where either the produced artifact and/or the 
insights regarding how to design such an artifact 
can be research contributions). 

A first data collection phase consisted of docu- 
ment study of prior incidents (33 reports), 6 inter- 
views with key representatives from each sector 
and two half-day workshops with respectively 26 
national and 11 local actors in order to identify 
cascading effects, consequences, actors involved 
and potential mitigating actions which they could 
perform with regard to payment disruptions (Laere 
et al. 2017a). Mapping these characteristics of our 
problem environment contributed to identification 
of the elements to be simulated in our simulation- 
game. A second data collection phase aimed at 
analysing existing simulation-games for critical 
infrastructure resilience (Laere et al. 2017b). Here, 
six existing simulation-games where analysed in 


detail with the purpose of understanding how dif- 
ferent design choices impact the capabilities of the 
learning environment and the learning experience 
of the participants. 

Next, the collected data was analysed and trans- 
formed to elements of the envisioned simulation- 
game. During a series of six bi-monthly organised 
full day workshops with the project team of 10 
researchers, different versions of the simulation- 
game were created, tested and refined. In between 
the workshops the involved researchers worked 
in smaller task forces on different elements of the 
simulation-game. During the last to full day work- 
shops societal actors from the different sectors 
were involved to gather their feedback on the sim- 
ulation-game design. The next two sessions sum- 
marize the main design choices and main design 
challenges that were identified and dealt with 
under this design process. 


4 GAME DESIGN CHOICES 


4.1 Game overall structure 


When role playing simulation games and computer 
simulations are combined a powerful simulation 
environment is created. Actors, as game partici- 
pants, can collaborate or compete with each other 
in different rounds, enter their decisions in the 
computer simulation and receive the output of the 
computer simulation as input in their next playing 
round. As such, participants can experience social 
interaction (role playing) and large scale system 
dynamics (impacts of their decisions over time, or 
on a large scale). The participating decision mak- 
ers can compare intended consequences with unin- 
tended and unexpected consequences and create a 
deeper understanding of the system as a whole and 
the behavior of other game participants. 

The main purpose of the simulation-game is 
to create a deeper understanding of the dynamics 
and interdependencies in the overall system. Alter- 
natively or additionally, collaboration between the 
different actors involved could be a learning goal. 
When collaboration is a learning goal, actors may 
be placed in different rooms and different actors 
may have different information at hand. In such 
games sharing the right information with the right 
actor at the right time might be in focus. In our 
design became clear quite early that grasping the 
complexity of the overall societal system (i.e. all 
sectors that are impacted by payment disruptions) 
and their interactions is a challenge at such. It 
was decided that grasping this complexity created 
sufficient load and that additional collaboration 
challenges would adventure the main objective of 
understanding overall system dynamics. Therefore 
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it was decided that the players, who each can rep- 
resent different societal roles (i.e. food sector, fuel 
sector, media etc.) would be placed in one team 
that in collaboration would try to manage payment 
disruptions. 

Putting the participants in one team makes the 
use of simulation-game flexible. Teams could con- 
sist of either 3, 5, 7,9 or 11 participants interacting 
as one team with the computer simulation. From a 
learning perspective it is preferable to have a larger 
group with a strong diversity in backgrounds, but 
from an execution perspective it is a benefit that 
a simulation-game session still can be performed 
even if two of the seven participants would not 
show up. 

The team interacts with a fictive society repre- 
sented in the computer simulation. The computer 
simulation is created with Anylogic simulation 
software. The main reason to choose this software 
package is that it enables to combine agent-based 
simulation, discrete event simulation and system 
dynamics simulation, which gives us a certain flex- 
ibility to implement different scenarios. The com- 
puter simulation covers a typical region with some 
cities and some countryside, where relevant soci- 
etal infrastructures can be distinguished (see 4.2). 
The overall idea is that payment disruptions occur 
(see 4.3) in this fictive society and that the team 
can try out different combinations of actions strat- 
egies (see 4.4) to learn how they differ in impact 
on a number of performance criteria (see 4.5). An 
important characteristic of the simulation-game is 
that the participating teams can re-play the same 
scenario over and over again (see 4.6). By keeping 
the scenario conditions constant the participants 
can really compare their chosen action strategies 
and experience and learn how different combina- 
tions of actions give different impacts. 

During the design process we have alternated 
between versions that could be played at a dis- 
tance, or at one physical location. Playing at a 
distance allows for more elaboration time between 
playing rounds which might be beneficial for learn- 
ing (i.e. making more thoughtful choices). While 
keeping the alternative of playing at a distance as a 
potential future development, our current impres- 
sion is that the intense discussion and interaction 
between the participants in the team are of major 
importance (as the learning and creation of deeper 
insight occurs exactly there). Therefore physical 
presence at one location is to be preferred. 


4.2 Sectors represented in the computer 
simulation 


From the document studies and workshops with 
societal actors (Laere et al. 2017a) a number of 
societal actors, sectors and processes has been 


selected that are primarily vulnerable for pay- 
ment disruptions and therefore form the core of 
computer simulation of the fictive society in the 
simulation-game. 

The fictive society consists of a number of gro- 
cery stores of varying size, a number of fuel sta- 
tions and a number of pharmacies (where medicine 
can be bought). For each store a customer flow is 
created. The number of customers, their demands, 
and the number of stores are balanced based on 
statistics for typical regions in Sweden. Stores offer 
one or several of the following payment options 
(card payment, cash payment, digital phone pay- 
ments and delayed invoice payments). Individual 
customers have also one or more different payment 
options available. When customers collect goods 
in the store the store’s payment options and their 
payment preferences need to match to create a 
transaction. Payment transactions are performed 
and accredited by the actors from the finance sec- 
tors (i.e. credit card companies and/or banks) and 
lead to account changes for stores and customers. 
When goods are sold new goods are order and 
delivered by transport companies. Customers and 
transport companies consume fuel, which in turn 
requires financial transactions when they buy new 
fuel. ATMs are available for those customers and 
transport companies who want to acquire cash and 
ATMs are refilled by certain transport companies. 
Security guards are present at the larger stores, and 
more could be hired when needed. Different media 
actors are represented who can spread news which 
in turn can influence consuming behavior. 

In our current implementation there is a rather 
rough logic. The purpose in the development has 
been to quickly arrive at an implementation that 
can be played with actual representatives from dif- 
ferent critical infrastructure managers. Given their 
feedback in early playing sessions the simulation- 
game will be further refined. Our aim is to perform 
30 playing sessions in 2018 and 2019 and gradually 
improve the design science artifact under study. 


4.3. Payment disruption scenario 


Thus far one main scenario has been developed 
and implemented. During the course of our 
project (2016-2021) two additional scenarios will 
be created. Our current scenario is a 10-day card 
payment disruption at the store level. The other 
scenarios will be developed in such a way that 
they effect other parts of the payment system (i.e. 
disruptions in the transferring of money between 
accounts—or a long term scenario that covers 
multiple years rather than only a few days). 

The current 10 day card-payment scenario is 
based on the fact that 90% of transactions in stores 
in Sweden is based on card payment, which makes 
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the Swedish society extremely dependent on that 
payment option as the other alternatives are not 
capable to instantly handle such large volumes of 
transactions. Although the scenario is much more 
detailed than presented here, the main elements of 
the scenario are as follows. 


Day 1: Card payment disappears as payment 
option. The expectation of most actors is that it 
will take some hours. Stores close or offer digital 
phone payments or cash payments as alterna- 
tives. Chaotic scenes for those customers who 
are disappointed. Queues at stores and at ATMs. 

Day 2-3: Banks and media announce that the 
disruption will take several days. Customers 
are confused where they can buy. Sales drop 
dramatically, use of cash and digital payments 
increase dramatically, some customers start 
hoarding, deliveries and logistics to stores are a 
mess as major fluctuations occur. A lot of cash 
in stores and in society at large increase robbery 
risks. 

Day 4-5: Cash and digital payment options col- 
lapse as well as they cannot cope with the large 
volumes. Long queues, angry customers as they 
are running out of goods at home, customers 
become aggressive, a lot of stores close, those 
who are open experience massive hoarding. Per- 
ishable goods need to be thrown away as they 
cannot be sold. Logistics trouble increases. 

Day 6-7: Government in collaboration with stores 
introduce a general “buy based on your identity 
and pay later by invoice option”. Massive hoard- 
ing when stores open. Logistics collapse again 
as they have hard to adjust from total sales stop 
to massive hoarding. 

Day 8-10: The general “buy based on your identity 
and pay later by invoice option” is too compli- 
cated and time consuming which creates enor- 
mous queues, frustration and aggression. Chaos 
and panic on more and more places. Police and 
army guard the few stores that still keep open. 


The cascading effects that occur are not hard 
implementations, but do occur as cascading effects 
as a result of the initial card payment disruption. 
All other effects can be influenced when other 
actions are chosen by the players. 


4.4 Action alternatives to mitigate disruptions 


The team that plays the simulation-game in sev- 
eral rounds can select on ore more of the following 
actions. Besides these alternatives that are given 
(and prepared) we are open for creative ideas of 
the participants. When they come with a sugges- 
tion for an unforeseen action the game facilitators 
will try to simulate that action and its presumed 
impacts instantly in the simulation if possible. 


Possible actions that the team can select are for 
example (note that each action can be implemented 
at any day in the scenario): offer more/less payment 
options at all or some stores; close or open stores; 
increase/decrease deliveries to stores; communicate 
information or instructions to customers; offer 
cash withdrawal in stores; limiting the amount of 
goods per purchase; increase/decrease the number 
of security guards for one or several stores; throw 
away perishable goods; give away perishable goods 
for free. 

The design of the computer simulation involves 
an implementation of impacts of each and every 
action, based on interviews and discussions with 
key representatives from the different societal proc- 
esses simulated. Even as we as designers know the 
approximate impact of individual action, the play- 
ing sessions need to reveal how the different actions 
in combination fall out. In addition, actions can 
be implemented on different moments in time 
(day one to ten in the scenario), which makes the 
number of alternative strategies near to infinite. 
Rather than experimenting with the computer 
simulation as such ourselves, the whole idea with 
involving real societal actors in role-playing is to let 
their expertise and value frames guide the selection 
and time-planning of combinations of actions. 
Moreover, not only the selection of actions as such 
is of interest, but also the motivation and reason- 
ing behind. Therefore, the teams who play need to 
motivate the timing and selection of actions before 
they are implemented in various playing rounds 
and the collection of these motivations is seen as a 
crucial element of the simulation-game. 


4.5 Performance metrics 


Extensive discussions have been held at several 
of our design workshops and in intermediate 
work group meetings considering what indica- 
tors are most relevant and appropriate to visual- 
ize performance in the various sectors of society. 
Currently, three major performance areas have 
arisen: 1) payment options, 2) good flows, and 3) 
security 

Available payment options are statistics on the 
actual use of each of the four different payment 
over time, or the amount of stores (in% of total 
stores) where they each option is available. 

For good flows the main indicators is “disap- 
pointed customers” over time (the simulation 
counts the number of arriving customers that 
cannot fulfil their purchase for any reason). Addi- 
tionally it is shown how many stores currently are 
closed (in%), which groups of goods currently 
are out of stock, how many perishable goods are 
destroyed over time, and how many planned deliv- 
eries that fail (due to fuel shortages). 
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Security related indicators are amount of cash 
in stores (implying increased robbery risk), number 
of shop lifting incidents, and the number of secu- 
rity guards per store. 

A performance area which has been suggested 
but been hard to implement thus far is “trust”. 
Although trust is a core value in society, it can be 
different kinds of trust (trust that you can obtain 
certain goods, trust in banks and stores, trust that 
you will be safe when being out in society). Our 
current interpretation is that trust depends on the 
other indicators and that is thus might be sufficient 
to only model them. 


4.6 Replay-ability 


After a short introduction into the learning goals, 
the computer simulation environment, the start 
scenario, and the way how the team can choose 
actions to influence the scenario, the team can play 
an optional number of rounds. When the start sce- 
nario is introduced the simulation is paused at day 
1, day 3, day 6 and day 10 to show how the per- 
formance measures slowly deteriorate. 

When the team later plays itself and chooses 
actions the simulation-games is initially paused 
at the same moments to be able to compare the 
new performance statistics with the earlier ones. 
Typically, it takes 10 to 20 minutes to discuss and 
decided on actions, so | to 11⁄2 hour to play the full 
scenario once. Our expectation is that teams might 
succeed to play 3 rounds on a half-day (leaving 
time to sum up and debrief the whole playing ses- 
sions) and maybe 6-8 rounds one a full day (where 
the expectation is that playing speed slowly can be 
increased when the team plays more rounds as they 
get familiar with the simulation-game). 


5 GAME DESIGN CHALLENGES 


Most design choices have after some iterations and 
refinements evolved into more permanent choices 
where motivation why each respective choice was 
important gradually became more profound. 
Three design issues have been particularly chal- 
lenging and are therefore interesting to highlight 
as potential areas for future research. 


5.1 Validation 


How to validate an unthinkable crisis escala- 
tion scenario? Many of the interactions that are 
simulated in the computer simulation are based 
on slightly related incidents and expectations of 
experts we have interviewed. It is however hard to 
translate observed effects of poorly related cases 
or judge the imaginary power of the experts. There 


might be certain interactions that are hard to 
imagine and which are not correctly represented in 
our current simulation. Normally, when building 
a simulation of an existing system, there is some 
kind of real data to validate against. As the pur- 
pose of crisis scenarios is to be far from the current 
equilibrium state, it is hard to foresee or imagine 
what relevant (new) elements and (new) interac- 
tions and dependencies are. An interesting future 
research area is therefore to develop methods and 
tools to improve the validation of crisis scenarios 
and simulations. 


5.2 Fidelity and playability 


A major concern in our current design is that 
players easily can get stuck in details. Multiplying 
25 stores and several other actors with 3 decision 
points in time and roughly 15 different types of 
actions that each individual store can pick at each 
point in time results over 1000 potential actions 
which can be combined in infinite variations. Even 
though our simulation is a strong simplification 
of the actual complexity of our society, players 
might easily get lost here. It has particularly been 
clear that players easily can zoom in on individual 
decisions in individual stores and loose the “over- 
all society helicopter view”. This is a typical risk of 
introducing a detailed computer simulation in the 
role playing simulation. 

In our current design discussions different 
options are explored to handle this issue. One is 
to develop facilitator strategies to keep the playing 
teams on track (while keeping the fine granular- 
ity of the computer simulation interface). Another 
is simplifying the computer simulation interface 
(i.e. limiting the amount or granularity of actions 
to be taken & decreasing the number of perform- 
ance statistics). The latter has the danger that the 
simulation becomes to abstract and transferability 
between simulation-game learning and value of 
the lessons learned in real society is lost. 


5.3. Time per playing round and number of rounds 


A closely related concern is the number of playing 
rounds and the time per playing round for discus- 
sion in the team. More round is preferable, but 
they should not become so short that players quit 
discussing their motivations and just guess. On the 
other hand, teams might get stuck in endless discus- 
sions about which actions to choose without ever 
implementing them in the computer simulation. 
Here, well-experienced facilitators are currently 
seen as the major viable option to fix this challenge. 
Alternative options could be to allow for playing 
the simulation independently at a distance after 
participating in the first facilitated team session. 
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6 DISCUSSION AND CONCLUSION 


During the last two full day workshops where the 
latest version of the simulation-game was tested it 
was concluded (by designers and potential play- 
ers, i.e. representatives from societal sectors) that 
the current design potentially can increase insight 
in collective critical infrastructure resilience. The 
main challenge is to make sure that the team who 
plays the game does not get stuck in details (due 
to complexity) and that the game facilitation is of 
such quality that a reasonable playing speed and 
number of playing rounds is achieved in a session, 
while at the same time team players experience to 
have sufficient time in each playing round to come 
to thoughtful and well-motivated action packages. 

Researchers and practitioners can benefit from 
an increased insight into the challenges of design- 
ing simulation-games for critical infrastructure 
resilience analysis and training, as documented in 
this paper. Combining the insights from our design 
process with insights from alternative applications 
and approaches can increase the quality of our 
designs and thereby subsequently improve overall 
critical infrastructure resilience in society. 
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ABSTRACT: September 2015 saw a sharp increase in the influx of refugees in the Oresund region. In 
this study, resilience defined as flexible adaptation was taken as a baseline to guide interviews with societal 
infrastructure actors and NGOs engaged in managing the situation. Different actors had different organi- 
sational preconditions that influenced their ability to adapt to the new situation. Among the strongest 
drivers behind resilient performance were the organisation’s ways of relating to established rules, regula- 
tions, procedures and processes, the way relationships were formed between people and hierarchical layers 
within the organisations, and the perceived value of the human operator and the human contribution 
within the organisational whole. These values, in turn, determined how the organisations shaped many of 
the basic conditions that allowed resilient performance to develop. In the study it was found, for public 
actors in particular, that the criteria necessary to adapt to the situation were not met by organisational 


structures and processes. 


1 INTRODUCTION 


In the Øresund region, which consists of Denmark 
and the southernmost province of Sweden (Skane), 
an increase in immigration was noticed during the 
spring and summer of 2015, but fluctuations were 
still within the normal range. However, in the begin- 
ning of September, the number of refugees reached 
unexpectedly high levels in just a few days, rising 
to the highest levels since the Second World War. 
In October the amount of asylum seekers doubled 
compared to the month before, and in November, 
with the argument that the large amount of refu- 
gees was threatening national safety and straining 
critical infrastructure functions to an unaccept- 
able level (SOU 2017:12) the Swedish government 
decided to initiate border controls, which continued 
in steps during the rest of the year. 

Refugees travelled to or through Denmark, 
Sweden, Norway and Finland, some reaching 
the Nordic countries by boat from Germany, but 
most travelling over the Øresund bridge, arriving 
in Malmö which is the third largest city in Swe- 
den. Even though structures for the reception of 
refugees existed, the volume and rapid increase of 
refugees was a surprise for most of the organisa- 
tions involved. The situation put a massive stress 
on infrastructure and vital societal functions, espe- 
cially in the southern part of Sweden, and some 
organisations went into formal crisis management. 

This article, which represents a limited part of a 
larger study, focuses on drivers and barriers for resil- 
ient performance within a group of organisations 


involved in immigration management in Malmö. 
The aim of this article is to identify areas of future 
organisational research and development that could 
contribute to better support for such performance. 


2 STUDY OF CRITICAL 
INFRASTRUCTURE RESILIENCE IN 
THE ORESUND REGION 


This case study on critical infrastructure resil- 
lence was performed within the EU Horizon 2020 
project IMPROVER. In the IMPROVER project, 
the concept “critical infrastructure” is defined in 
the following way: 

Critical Infrastructure is an asset, system or part 
thereof located in Member States which is essen- 
tial for the maintenance of vital societal functions, 
health, safety, security, economic or social well-being 
of people, and the disruption or destruction of which 
would have a significant impact in a Member State 
as a result of the failure to maintain those functions. 

Resilience is a popular concept that is inter- 
preted and applied in different ways (Bergström 
& Dekker, 2014; Woods, 2015). In this study, resil- 
ience defined as flexible adaptation was taken as 
a baseline. When studying resilience, researchers 
should strive to understand the factors that allow 
organisations to uphold their functionality despite 
changes in context, time constraints and workload, 
i.e. adaptability, how adaptability may manifest 
itself and evolve over time. This stance is exem- 
plified by Woods (2015) saying that the search 
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for understanding rebound in past events should 
focus on understanding how the system changed 
for managing the new circumstances. This notion 
was used as the starting point for this case study. 
The rapidly increasing number of refugees within 
a short period of time meant that many organisa- 
tions had to change their practices, particularly the 
public institutions on which society relies. In this 
study, the narratives provided by these actors were 
examined for signs of resilient performance. 


2.1 Research method 


A literature review of organisational aspects of 
resilience was conducted. In addition, several 
organisations were interviewed: DSB Trains (Dan- 
ish national railway operator), the Danish police 
force, the Øresund bridge consortium (Danish and 
Swedish sides), the Swedish Migration Agency, 
Malmö Municipality, the Swedish Civil Contingen- 
cies Agency, the Swedish Armed Forces, Jernhusen 
(operator of Malmö central station), the Red Cross 
(Danish and Swedish) and Kontrapunkt (a Swed- 
ish autonomous NGO). In total, 28 semi-struc- 
tured interviews were performed. 

Since resilient aspects are not operationalised in 
most organisations today, informants could not be 
asked direct questions about resilience. Therefore 
a thematic analysis (Bowen, 2009) of the entire 
empirical and theoretical body was suitable, since 
such an analysis is directed towards the empirical 
data as a whole, not only towards already pre- 
determined rules. Reflections over the empirical 
interview material were based on Alvesson’s (2011) 
reflection approach and framework, where tra- 
ditional notions of the interview have been com- 
plemented with eight metaphors with a social, 
psychological and linguistic nature. 

An overarching goal of the IMPROVER project 
is to create an indicator hierarchy for resilience in 
critical infrastructure, but from a qualitative eth- 
nographic perspective this could be problematic. 
Describing the performance of a complex system 
in terms of measurements may not provide the 
most useful information, because such system 
interactions typically require qualitative descrip- 
tions. The use of indicators is a logical non-contex- 
tual exercise, while real decision-making processes 
in everyday work or crisis is a complex analytical 
process (Salmon et al., 2014). Because of these dif- 
ficulties, the abstraction hierarchy framework from 
Rasmussen (1985) was used as an inspiration when 
describing the identified themes from the thematic 
analysis in terms of indicators (or “things to look 
for in story-telling about past events in organisa- 
tions”). The Rasmussen framework was originally 
intended for the design of technical decision sup- 
port in complex systems, typically an automated 


system and a human operator combined. In this 
study we applied the framework to an organisa- 
tional structure, where artefacts rather are proce- 
dural than technical. 


2.1.1 Result of thematic analysis 

The thematic analysis of the entire collection of 
empirical material resulted in four overarching 
themes. These themes represent areas where key 
examples of organisational resilient performance 
were identified. 


1. Design of roles, tasks and processes 
2. Artefact design: procedures and tools 
3. Strengthening collaboration 

4. Learning and re-design 


Beyond these four areas, a fifth theme was iden- 
tified which is referred to in the study as “Under- 
lying values and interpretations”, representing the 
way in which the other themes are interpreted. In 
the view of the authors, striving for resilience is 
not only about knowing what organisational abili- 
ties to enhance, but also about the way in which 
such abilities are sought—organisational values 
that may allow or deny developments that enable 
resilient performance. This article focuses on how 
design could support resilient performance. 


2.2 Focus of this article 


While the engineering community has spent many 
years looking for ways to measure and increase the 
reliability of isolated components in critical infra- 
structure, not nearly as much attention has been 
given to interactions within socio-technical systems. 
Paradoxically, this is precisely where the causes 
of many of the great disasters of our time can be 
found (Woods, Leveson, & Hollnagel, 2012). As 
time passed after the initiation of this study, offi- 
cial crisis evaluation reports were published (e.g. 
RiR 2017:4; SOU 2017:12) and measures were 
established. These measures have largely consisted 
of more administrative routines, plans and control 
processes, with little respect for the fact that simi- 
lar structures caused problems when public organi- 
sations were faced with unexpected and rapidly 
evolving events. This article focuses on how design 
aspects within the five identified themes could equip 
these organisations with better adaptive capacities. 


3 SIGNS OF RESILIENCE IN REFUGEE 
RECEPTION IN THE ØRESUND 
REGION 


The results of this study reflect the notion in sys- 
tems oriented safety research that because of sys- 
tem complexity, work is always to some extent 
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under-specified, and that humans in a socio- 
technical system should be seen as a unique asset 
instead of an unreliable and risky system com- 
ponent (Cook, 1998). Our study exposed several 
examples of resilient performance, i.e. adaptation 
to evolving circumstances and needs, both on a 
societal level and within each organisation. 


3.1 First response at Malmö central station 


The security guards of Jernhusen, the organisa- 
tion that manages Malmö central station, made 
the first observations of the increasing amount of 
refugees at the station. Jernhusen informants said 
that they normally respond quickly to different cir- 
cumstances, which explained their fast response in 
this situation. Jernhusen is a relatively flat organi- 
sation with few formal procedures and they have 
short paths of communication between hierarchi- 
cal layers. Informants said decision-makers have 
a tradition of using the information provided by 
operative guards to build a picture of what hap- 
pens at the station and to determine their response. 
Jernhusen functioned as a central hub during the 
entire autumn, organising space and functions 
within the station’s walls and donating conference 
rooms where the involved organisations could meet 
twice a week. At an early stage, Jernhusen had to 
fight hard to get the attention of public agencies, 
and to get representatives of those agencies to visit 
the station and make their own assessment. 


3.2 Adaptations within public organisations 


The most affected public agencies were the Swed- 
ish Migration Agency and the Malmö municipal- 
ity. In addition to the more obvious long-term 
responsibility of the asylum process, the Swedish 
Migration Agency was also responsible for the 
short-term housing of adults and families, while 
Malm6 municipality was responsible for the hous- 
ing of unaccompanied children. In the operative 
functions of these organisations, typically on or 
near the accommodation sites, the employees rap- 
idly adjusted to the new circumstances. 

As the workload quickly increased it was no 
longer possible to use the normal routines, which 
were deemed too time and resource consuming. 
Personnel at the accommodation sites understood 
that if they were to follow normal procedures, they 
would fail to meet the overall purpose of housing 
people in need. Instead they started to change the 
way they worked. Electronic registration forms 
were replaced with whiteboards, giving an overview 
of all the persons living there, their health status, 
their asylum process status and other important 
data. The formal way to procure materials and 
tools was too time consuming and had to be set 


aside. Instead managers themselves went to IKEA 
to pick up necessities like mattresses and kitchen- 
ware. These adaptations were crucial to fulfil the 
purpose of putting a roof over the head of every 
newly-arrived refugee. In all interviewed organi- 
sations, normal operations were abandoned in 
favour of new solutions, often based on the work- 
ing experience and creativity of the personnel, or 
as Rasmussen (1985) states it, familiarity with the 
system’s value structures. There is not only just one 
representation of how to operate, especially not 
when travelling towards the higher purpose in the 
abstraction hierarchy. The purpose could always be 
met in different ways. 


3.3 Focus on routines in public organisations 


Even though operative adaptations within Malmö 
Municipality and Swedish Migration Agency were 
necessary to handle the situation, these adap- 
tations were not always facilitated by existing 
organisational structures or even condoned by 
management. Informants said that the abandon- 
ment of too rigid routines was to some extent con- 
tested by management and operative achievements 
were never fully acknowledged afterwards. Since 
the activities of operative personnel were not offi- 
cially sanctioned, they were put in a sensitive situa- 
tion in the case of negative outcomes. 

Informants from the operative parts of pub- 
lic agencies said that during the whole event, no 
representatives of upper leadership came to the 
sites to form their own view of the situation. In 
the non-operative parts of these public organisa- 
tions, there seems to have been a more wide-spread 
belief that the situation could be solved through 
ordinary processes, and this seems to have resulted 
in a lack of practical support for operations. It is 
known from safety research that when an outside 
observer attempts to describe the work of others, 
those descriptions tend to be more simplified and 
linear than actual work (Woods et al., 2012). In 
many organisations, decisions about procedures, 
plans and routines are made on a management 
level. If the gap between management and opera- 
tions is too large, the assumptions guiding deci- 
sions about organisational structures may not 
reflect actual work needs, as seen in Malmö among 
the public actors. 


3.4 Internal organisational dynamics 


Judging from many examples described by inform- 
ants, a key factor behind good internal collaboration 
and the ability to meet goals was a tight interface or 
likeness between those who detected and interpreted 
early signals of change and those who decided about 
actions. Signals of important changes are often sub- 
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tle and manifest in the course of operations, and if 
such information is not adhered to, the organisation’s 
response may be delayed, which was the case for 
public agencies. Jernhusen on the other hand, where 
reactions were quick, generally depend on the input 
from their guards on the floor for decision-making. 
For Jernhusen surprises are normal for operations, 
which mean that reactions to these surprises have to 
be of the same flexible nature. In public organisations 
surprises are not desirable, which could be coupled 
to the demand on their processes for legal certainty. 
Paradoxically, in this case the focus on regular proce- 
dure combined with reluctance towards adaptations 
resulted in delays and problems in the process of ful- 
filling the basic needs of arriving refugees. 


3.5. Complementing NGO activities 


While public agencies were in some cases tied down 
by regulations, the results of this study showed 
that NGO’s could sometimes fill the gaps created 
by slow official response. A large group of refugees 
arriving in Sweden were transit refugees, attempt- 
ing to pass through Sweden on their way to Norway 
or Finland. In Swedish legislation, however, there 
is no such thing as a transit refugee. Any refugee 
arriving in Sweden has to seek asylum there, and 
the policies of the migration agency made it impos- 
sible for established NGO’s to help these people. 
The Swedish Red Cross had a permanent barrack 
outside Malmö central station, on the square Post- 
husplatsen, which functioned as a welcome center 
and housed numerous different organisations. The 
organisations at Posthusplatsen had agreed to fol- 
low Swedish legislation, meaning that they would 
only help people who had the intention to seek 
asylum in Sweden. Instead, autonomous organi- 
sations stepped in to help the transit refugees. As 
one example, the cultural association Kontrapunkt 
arranged housing for 100-500 persons per night. 


4 DISCUSSION 


The analysis revealed a number of drivers for resil- 
ient performance within the studied organisations, 
such as different ways of relating to established 
rules, regulations, procedures and processes, the 
way relationships were formed between people and 
hierarchical layers within the organisations, and the 
perceived value of the human operator and of the 
human contribution to the organisational whole. 
These values, in turn, affect how the organisations 
shape many of the basic conditions that allow—or 
obstruct—resilient performance. 

Different actors involved in the response to the 
201Ssituation had different organisational formal 
and informal prerequisites that influenced their 


ability to adapt to the situation. This section deals 
with a number of such prerequisites such as tasks, 
roles, working environments, supporting tools and 
organisational structures. 


4.1 Heuristics for adaptation in organisational 
goals and values 


In the light of the present analysis, a likely chal- 
lenge for public organisations will be to find ways 
of detecting when established rules and procedures 
are no longer appropriate and to find ways of 
adjusting them based on information from emerg- 
ing events. In several of the organisations that were 
able to make successful adaptations, adjustments of 
normal procedures were made in relation to clearly 
defined and deeply rooted core goals. One way of 
approaching the issue of how to guide adaptations 
may be to look for such core goals or core functions 
which are central to the organisation and necessary 
for operations under any circumstances. Organisa- 
tional goals may however have considerable room 
for interpretation. Because of that, it might be 
equally important to examine how organisations 
engage in dialogue around goals and values and 
who is allowed to participate when abstract goals 
are interpreted in terms of strategies and action. 


4.2 Adaptation supported by shared perceptions 


Examples of operative adaptations within the case 
could serve as an inspiration to more permanent 
and widespread design approaches within public 
organisations. Today, management typically has the 
final decision about artefacts such as plans, proc- 
esses, procedures and work roles. For public actors 
a gap was observed between real work challenges 
and higher management’s understanding of the 
circumstances and issues of operations. If organisa- 
tions were to adapt supporting artefacts more con- 
sciously to operative needs, a first step may be to try 
to increase management knowledge about opera- 
tive conditions, the real-life issues and difficulties 
faced by operators, so that solutions implemented 
by management do not run the risk of undermin- 
ing operations. In terms of more profound changes 
to organisational practices, it could be rewarding to 
explore the implementation of design processes cen- 
tred on system and user needs and to give employ- 
ees a more active role in the constant evolution of 
artefacts. This could be an opportunity to bring the 
fields of safety science, organisational science and 
design science closer together. 


4.3 Lessons learned as re-design 


All of the interviewed organisations involved in 
the 2015 reception of refugees in the Oresund 
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region give examples of organisational lessons 
learned and adjustments that have been made with 
respect to these experiences. On the other hand, 
and for public organisations in particular, a large 
portion of learning and re-design seems to have 
been oriented towards formal organisational arte- 
facts such as plans and procedures. For example, 
one of the main interventions has been to extend 
the practice of Risk and Vulnerability Analysis to 
the Swedish Migration Agency. These analyses are 
a national requirement for specific public agencies 
and are coordinated by the Swedish Civil Con- 
tingencies Agency (MSBFS 2016:7). As noted in 
the analysis, this could be interpreted as a reflec- 
tion of the very focus on compliance and official 
doctrine that undermined adaptation during the 
crisis. From the perspective of systems oriented 
safety research, strict routines and procedures 
are in themselves no guarantees for safety and 
efficiency (Dekker, 2001). Yet another scenario is 
added to risk analyses, only to find that the next 
large disturbance has unexpected or unique quali- 
ties. The calibration of routines and procedures, 
or the addition of new documentation or similar 
symbolic barriers is a common response to nega- 
tive events within organisations (Hollnagel, 2008). 
Here it must be acknowledged that procedures will 
never cover every possible scenario and may even 
limit the creative problem-solving abilities of pro- 
fessionals (Dekker, 2003). Every added barrier or 
new procedure will also increase system complex- 
ity, thus possibly increasing the demands on the 
people controlling the process, with potential nega- 
tive effects on their performance (Praino & Sharit, 
2016). On the other hand, procedures may provide 
stability and common ground, particularly in crisis 
response that involves a diverse set of actors. Just 
as with any organisational artefact, procedures can 
be consciously designed, informed by and adapted 
to their users. As noted above, it may also be pos- 
sible to explore different ways for organisations 
to assess when normal procedures are not enough 
and adaptation is needed. 

Another possible interpretation is that the prob- 
lem does not lie so much in the procedures them- 
selves, but rather in organisational perceptions 
and values around management and compliance. 
The process of gathering experiences and turning 
them into improvements should never focus only 
on administrative outputs. Firstly, future research 
within the public sector could explore the imple- 
mentation of methods that allow for systems-ori- 
ented analyses of events and activities. This could 
give governmental organisations a more complete 
picture of past events and a better understanding 
of all the different pre-conditions that must exist to 
support operations. Secondly, event analysis could 
be tied more tightly to a general design process so 


that lessons learned are used to produce solutions 
that are suited to the needs of potential end users. 
This approach could counter the risk of adding to 
their administrative work burden. 


4.4 The role of employee experience and 
overarching organisational values 


Much of the above discussion centres on design 
issues, but the extent to which the studied organi- 
sations were able to perform resiliently also seems 
to depend on the way relationships were formed 
between people and hierarchical layers within the 
organisation, or on the perceived value of the 
human operator in the organisational whole. It is 
important to acknowledge that processes and tools 
do not in themselves guarantee good outcomes. 
Even though work-supporting artefacts from 
processes to decision-aids may have all the right 
attributes, an organisation can still prove too rigid 
and slow to adapt to the fast pace of real-world 
operations. For example, if an organisation is to 
reap the benefits of use-centred design, it also has 
to have a fundamental respect for the experience 
and practical knowledge of operative personnel. It 
is hard to imagine that the individual operator envi- 
sioned in resilience research -knowledgeable, crea- 
tive and full of initiative—would be likely to exist 
within an organisation that does not have a fun- 
damental appreciation of its employees, acknowl- 
edging the signals and interpretations emerging 
from operations. Furthermore, the existence of 
good adaptive designs that are achieved repeat- 
edly presupposes a continuous dialogue around 
working conditions and developments within the 
organisation, so that designs can build on a good 
representation of reality. This concept calls into 
question the measure-centric, hierarchical para- 
digm of management and control found in many 
of today’s organisations and raises the question of 
how future management could be steered towards 
more inclusive and systems-oriented principles. 
Purposeful designs cannot be reached without 
applying a systems perspective, making sure that 
solutions support not only individual actors within 
the system, but also activities that are distributed 
among people, mediated by technology, within an 
organisational context. 


5 CONCLUSION 


The 2015 increase of refugees arriving in Malmö 
meant a great challenge for the organisations 
involved in the response, and interviews within this 
study revealed a number of examples of resilient 
performance. One of the main findings of this 
study has been that resilience does not primarily 
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reside in simple organisational features, functions 
or resources, i.e. simple boxes to tick in an organi- 
sation’s management system. Rather, some of the 
most important resilient behaviours in the observed 
case had to do with adaptations—trade-offs and 
judgments made by professionals under the some- 
times harsh conditions of real-world operations. 

In terms of resilient performance, although many 
examples of this emerged during the reception of 
refugees, it also became clear that disconnections 
between management and operative personnel may 
hamper adaptability and lead to designs of organi- 
sational structures that do not fully answer the needs 
of their users. For these reasons, it is suggested that 
future research investigates different ways of guiding 
adaptations e.g. from a basis of core organisational 
goals and values, and of creating joint perceptions 
between management and personnel. 

Practices such as employee inclusion, increased 
local autonomy, user-centred design and systems- 
oriented organisational learning may require a new 
set of values within an organisation. These values 
include an understanding of the organisation as a 
socio-technical system and a fundamental respect 
for human experience, initiative, collaboration and 
problem-solving. While studies such as this one can 
provide positive examples of resilient performance, 
deeper changes may require a paradigm shift for 
both public and private organisations that in itself 
will require further research and interventions. 
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ABSTRACT: Recent natural and man-made disasters highlight that a more resilient approach to pre- 
paring for and dealing with such events is needed. To address this challenge, the main objective of the 
research and innovation H2020 project DARWIN is the development of European resilience management 
guidelines for Critical Infrastructures (CI). Based on a systematic literature survey with a world-wide 
scope and prioritization of resilience concepts, the guidelines have been developed taking into account 
everyday operations, contingency plans, training, etc. This paper describes insights gained from the 
adaptation of these guidelines in the domains of Air Traffic Management (ATM) and Healthcare (HC). 
A collaborative and iterative process has been defined involving relevant experts and practitioners. To 
ensure transnational, cross-sector applicability and uptake, a Community of Crisis and Resilience Prac- 
titioners (DARWIN DCoP) has been involved. The preliminary results indicate that a big step has been 
taken in moving from the resilience theory to practice. 


1 INTRODUCTION 


ATM and HC have a great track record of safe opera- 
tions in challenging conditions, even if disruptions or 
occasional crises may happen routinely. While it can 
certainly be improved, both domains have already 
implemented a number of practices and methods, 
especially related to being able to handle such disrup- 
tions or to learning from them. Still, recent examples 
from disasters are reminders of the urgent need to 
improve our ability to reveal, assess and manage 
resilience, both in everyday operations and during 
crises (Hollnagel et al., 2011, Adini et al, 2017). 

The overall objective and main result of the 
Horizon 2020 EC project DARWIN is the devel- 
opment of European resilience management 
guidelines. These guidelines are called DARWIN 
Resilience Management Guidelines (DRMG). 

The DRMG consist of suggested interven- 
tions and guiding principles to help or advise any 


organization in the creation, assessment or 
improvement of its own reference guidelines, pro- 
cedures and practices. 

What is really important is that DARWIN 
results are useful for our end users namely the 
Critical Infrastructures that include ATM and 
HC. 

For this purpose, the DARWIN Resilience 
management guidelines are designed to address 
disruptions, changes and opportunities; facilitate 
anticipation, adaptation, flexibility; and provide a 
foundation for an effective crisis response (Adini 
et al., 2017). 

An initial set of generic DRMG was produced 
(DARWIN D2.1, 2016) and then adapted to ATM 
and HC to make the guidelines more operational 
and usable in these domains. 

This paper presents the approach and meth- 
odology carried out to adapt the DRMG to both 
domains and discusses relevant results. 
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1.1 Nature of the DARWIN guidelines 


The methodology to obtain the list of DRMG has 
been thoroughly defined: based on a world-wide 
systematic literature review carried out for the 
DARWIN project (DARWIN D1.1), 56 concepts, 
approaches and practices have been identified and 
evaluated (DARWIN D1.2). 

The results of an evaluation following a modi- 
fied Delphi process with practitioners and experts 
resulted in essential and important resilience 
concepts to be included in the resilience manage- 
ment guidelines. These conceptual as well as user 
requirements are input for the development of the 
DRMG (DARWIN D1.3). 

The guidelines are developed as individual 
topics that address the conceptual requirements 
identified. Those topics are referred to as Con- 
cept Cards (CC). CCs propose interventions that 
organizations can implement (the how) to reach 
the resilience management capabilities captured in 
the conceptual requirements (the what). Through 
those interventions, the guidelines aim to help CI 
organizations in developing a critical view of their 
own crisis management activities (management of 
resources, procedures, training, etc.). The CC are 
structured in content blocks that contain infor- 
mation such as: purpose; interventions proposed; 
actors in charge; illustration; associated practices, 
methods and tools; etc. In addition, while they 
address specific aspects of resilience management, 
CCs are not independent and links between them 
are captured through various means. 

DARWIN CCs, and in particular adapted CCs, 
could be complementary to guidelines, procedures 
and practices already present in the organiza- 
tions of the two domains, fostering their revision, 
improvement or even creation of new guidelines. 

Also, each CC includes a Minimum Viable 
Product (MVP) which is the smallest way to 
start using the interventions proposed in the CC. 
The MVP is the set of minimum set of features 
required to test or experiment a solution. Its pur- 
pose is to get through the “build-measure-learn” 
feedback cycle as quickly and efficiently as pos- 
sible (Ries, 2011). The DARWIN project pro- 
posed this solution based on interactions with 
experts (managers and front-line operations). 
This approach contrasts the traditional product 
development of designing, performing prelimi- 
nary and critical reviews, producing and testing 
and perfecting the product. 


1.2 Content of the DARWIN guidelines 


The DARWIN CCs are organized under the fol- 
lowing themes: 


SUPPORTING COORDINATION AND SYNCHRONIZA- 
TION OF DISTRIBUTED OPERATIONS 


1. Promoting common ground in cross-organiza- 
tional collaboration 

2. Establishing networks for promoting inter-or- 
ganizational collaboration 

3. Ensuring that actors involved in resilience man- 
agement have a clear understanding of their 
responsibilities and the responsibilities of other 
involved actors 


MANAGING ADAPTIVE CAPACITY 
4. Enhancing the capacity to adapt to both 
expected and unexpected situations 

5. Establishing the capacity for adapting during 
crises and other events that challenge normal 
plans and procedures 
ASSESSING RESILIENCE 

. Identifying sources of resilience 

. Noticing brittleness 

. Assessing community resilience to understand 
and develop its capacity to manage crises 


wana 


DEVELOPING 

CHECKLISTS 

9. Managing policies involving systematically— 
policy makers and operational personnel for 
dealing with emergencies and disruptions 


AND REVISING PROCEDURES AND 


INVOLVING THE PUBLIC IN RESILIENCE MANAGEMENT 
10. Interacting with the public not yet affected by 
or involved in a crisis 


2 METHODOLOGY 


The established methodology is a systematic step 
by step approach strictly intertwined with the other 
DARWN activities. These include in particular those 
relevant to the development of generic guidelines, to 
their evaluation and to interaction with the DCoP. 

The adaptation process consists of two main 
steps: 


— Step 1: Selection of adaptable CCs, i.e. the assess- 
ment for the adaptability of the generic CCs 

— Step 2: Adaptation of adaptable CCs, that is 
the adaptation of the generic CCs to ATM and 
HC domains, and the release of the adapted 
guidelines. 


2.1 Selection of adaptable CCs 


This phase has been performed by applying a 
methodology based on a quantitative and qualita- 
tive SWOT (Strengths, Weaknesses, Opportunities 
and Threats) analysis that assessed if a CC was 
adaptable or not. 
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The SWOT analysis methodology is commonly 
used to develop a deep understanding of all kinds 
of situations in business, organizations, and for 
individuals, to support the decision-making process. 

It is noteworthy that the SWOT analysis find- 
ings address relevant actions for Guidelines devel- 
opers concerning the improvement of the generic 
guideline content also. In particular, during the 
development of DRMG, the CCs were simultane- 
ously assessed with regard to a possible adaptation 
to the specific domains, possibly avoiding any gaps 
between development and later adaptation. 

At the end of the adaptability assessment two 
lists are expected: the list of non-adaptable CCs 
including the rationale behind their non-adaptability 
or elements that can be improved, and the list of 
adaptable CCs including the rationale behind their 
adaptability. 

The applied SWOT analysis has been defined 
combining the quantitative and qualitative 
approaches for a richer collection of data. 

The quantitative SWOT analysis has been based 
on the definition of a set of Indicators (7) that were 
identified starting from the fields of the CC used 
for the process of adaptation. 

Seventeen indicators (Table 2) have been estab- 
lished, each of them formulated as a specific state- 
ment and categorized according to the four areas 
of the SWOT: 


— the Strength/Weakness (S/W) areas include indi- 
cators concerning internal aspects of the CC (i.e. 
specific contents of the CC fields). 

— the Opportunity/Threat (O/T) areas include 
indicators whose assessment needs to take into 
account a more long-term perspective and the 


Table 1. Step 1 overview. 


Last available version of Generic 
Input CCs 


Output e List of Adaptable and non- 
Adaptable CCs 

e Information concerning content 
for CCs adaptation (from 
qualitative SWOT 

e Information concerning elements 
of the Generic CC improvement 

e |-day per each interview with 
each ATM/HC expert 
concerning each single CC SWOT 

e 3-4 days (per CC) to organize 
relevant information and 
perform additional research 

e | day to review the results with 
involved expert 


Effort 
Required 


interdependency with external factors linked to 
the contexts of the CC application. 


Experts’ opinions on the indicators have subse- 
quently been collected through seventeen questions 
formulated as follows “How much do you agree with 
the following statement (I_01, I_02, ... 117)?” 

The answers to each question were recorded 
using a 5-point Likert scale ranging from “Disa- 
gree” to “Very Strongly Agree”, with “Somewhat 
agree” in the middle. 

In order to obtain a quantitative figure for 
each indicator’s assessment, a numeric value was 
assigned to each level of the scale, starting from 
1 (=“Disagree”) to 5 (=“Very Strongly Agree”) and 
incrementing by one per level. 

Thus, according to the J mean value, the assess- 
ment of each I has been classified according to the 
criteria described below. 

The qualitative SWOT analysis of each generic 
CC was carried out by collecting comments and 
feedback from experts during the assessment of 
the quantitative SWOT analysis indicators. 

After the quantitative assessment of each indi- 
cator, the expert was asked to explain the rationale 
of the scoring, indicator by indicator, while the 
interviewer was taking notes. 

The interview started with the narration of the 
illustrative case or lesson learnt that, according to 
the expert, better supports the discussion on the 
contents of the specific CC applied to the domain. 

The rationales provided during the interviews 
have been collected and grouped into four areas of 
the SWOT on the basis of the mean values calcu- 
lated for each indicator (ref. Table 2). 

It is noteworthy that also the rationales fully 
contrasting the average evaluation for the specific 
indicator have been kept and taken into account 
for the sake of richness of data. 

In addition to the CCs adaptability assessment, 
the qualitative SWOT analysis results have been 
considered as one of the main sources of informa- 
tion used to adapt the CCs’ content to the specific 
domain. Moreover, the collected information has 
been enriched using sources of information avail- 
able online. 

Some criteria were established to evaluate the 
adaptability of each CC to the specific domain. 
They were based on two mean scores of the SWOT 
analysis results: 


— I-02 mean score—This indicator directly refers 
to the applicability of the CC to the local con- 
text in which the card will be used. The applica- 
bility is the condition sine qua non the CC can be 
used in real ATM/HC environment. 

— Total mean score of all CC indicators—This value 
provides a synthetic measure of the “adequacy” 
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Table 2. 


List of SWOT indicators 


Nr. Statement 
1 The CC overlaps with other CCs 
SWOT category: S/W 
2 The CCis applicable to local ATM/HC contexts (where the card will be used) 
SWOT category: O/T 
3 The CC can be complementary to local ATM/HC arte-facts (i.e. procedures, regulations) 
SWOT category: O/T 
4 Actors, as described in the CC, are identifiable in the ATM/HC domain 
SWOT category: S/W 
5 The roles and responsibilities of the actors, as described in the CC, are clear in the 
ATM/HC domain 
SWOT category: S/W 
6 Itis possible to identify actors, roles and responsibilities, as described in the CC, in case of 
sudden changes in the ATM/HC domain (i.e. regulatory bodies, etc.) 
SWOT category: O/T 
7 It is possible to identify actors, roles and responsibilities, as described in the CC, in case of 
future changes in the ATM/HC domain (i.e. regulatory bodies, etc.) 
SWOT category: O/T 
8 The implementation before, as developed in the CC, is rel-evant for the ATM/HC domain and adaptable 
SWOT category: S/W 
9 The implementation during, as developed in the CC, is relevant for the ATM/HC domain and adaptable 
SWOT category: S/W 
10 The implementation after, as developed in the CC, is rele-vant for the ATM/HC domain and adaptable 
SWOT category: S/W 
11 Internal factors of the ATM/HC domain, facilitating or hindering the implementation of the 
contents of the CC, can be easily identified and explained 
SWOT category: O/T 
12 External factors (cultural, social, economic environment), facilitating or hindering the 
implementation of the contents of the CC, can be easily identified and explained 
SWOT category: O/T 
13 Expected results, that can be inferred from the CC, can be identified and explained within the 
ATM/HC domain 
SWOT category: S/W 
14 Illustrative cases and/or lessons learnt, linked to the con-tents of the CC, are available in 
ATM/HC domain 
SWOT category: S/W 
15 Practices, linked to the contents of the CC, are available in ATM/HC domain 
SWOT category: S/W 
16 Methods, linked to the contents of the CC, are available in ATM/HC domain 
SWOT category: S/W 
17 Tools, linked to the contents of the CC, are available in ATM/HC domain 


SWOT category: S/W 


and maturity of the CC fields for the adaptation 
purposes. 


2.2 Adaptation of adaptable CCs 
Once a CC has been evaluated as adaptable, the 


The combination of these two mean scores—as 
explained in Figure 1 - defines the adaptability of 
each CC and specific issues to be addressed by CC 
developers. 

Figure 1, Table 4 show the criteria applied 
to establish if a CC is adaptable or not, and the 
actions identified to handle the issue with guide- 
line developers. 


second step of the adaptation process begins. The 
Adapted CCs have been developed by integrating 
several sources: 


— The findings of the qualitative SWOT analysis 
performed in Step 1; 

— The information collected during ad-hoc inter- 
views with domain specific experts; 
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Table 3. Criteria for the classification of SWOT results. 


Table 4. Rationale for CC classification (example). 


Indicator CC 
mean score Classification of the Indicator I Classific. | “Non-adaptable” or “Partially adaptable” 
I>3 I classified as Strength or Opportunity Rationale e the concept is valid at a general level but 
The indicator is helpful to the CC difficult to adapt. It is not applicable in 
adaptation to the ATM/HC domain the local ATM/HC domain (i.e. due to 
LSB I classified as Weakness or Threat type of organization and current policies 
The indicator is “harmful” to the CC of the local ATM/HC systems); or 
adaptation to the ATM/HC domain e the CC is not adequately developed to 
[=3 Those Indicators whose mean value be adapted; or 
was =3 have been classified by taking e so far, it has been particularly difficult 
into account the experts’ comments to find specific ATM/HC content for 
collected by the qualitative SWOT the majority of the fields. 
analysis: Action Major amendments are needed and the 
e the Indicator has been classified as issue has to be discussed with 
Strength or Opportunity if the guideline developers 
majority of the comments mainly 
emphasized positive elements; CC o . 
e the Indicator has been classified as Classific. “Adaptable 
Weakness or Threat if the majority ; f . , 
of comments highlighted lacks and Rationale ethe CC is applicable in the local 
missingpoints. ATM/HC domain; or 
e most of fields of the CC are adequately 
developed. However, in some cases 
some effort could be needed to make 
Step 1 - Selection of Adaptable Concept Cards to ATM adjustments or to find specific ATM/HC 
content for some fields. 
Action The CC can be adapted and some issues to 
be discussed with guidelines developers 
Table 5. Step 2 overview. 
Input e Last available version of Generic CCs 
e Output from Step 1: 
— List of Adaptable CCs 
— Information concerning content for 
adaptation of CCs (from qualitative 
Issue to be discussed with guidelines SWOT) 
| developers Output DRMG/Adapted CCs to ATM/HC 
Effort e | day (per CC) to interview one/two 
Required ATM/HC expert/s; 


Figure 1. 


Adaptability Criteria. 


— The information provided by the domain spe- 
cific experts involved in the “initial evaluation of 
guidelines”; 

— The feedback provided by the DCoP during the 
workshop; 

— The results collected during the implementation 
of Pilot exercises. 


A strategic selection of participants from manage- 
ment as well as front line operators was performed. 
A template was prepared to follow a semi-structured 
interview with experts. The interview started with a 
narration of an illustrative case to better support the 
discussion of the context of the CC. 

The topics covered concern actors involved, 
actions prior to, during and after the crisis. The 
template also includes context information, prac- 


e 5 days (per CC) to integrate Wiki with 
relevant information coming from SWOT, 
expert interviews, DCoP feedback, CC 
evaluations, feedback from pilots, 
additional research on internet. 


tices, methods and tools as well as other illustrative 
cases. The relevance of the content proposed by the 
CC, in particular concerning the interventions, was 
discussed. 


3 RESULTS ON THE SELECTION OF 
ADAPTABLE CC 


The adaptability assessment of the CCs has been 
gradually carried out during the development 
process of the generic CCs. 


1323 


The SWOT analysis started as soon as the 
generic guidelines developers team considered the 
CC mature enough to be released and assessed for 
adaptation. 

The information collected during the SWOT 
has been enriched using sources of information, 
suggested by the interviewee, available online and 
integrated in the DARWIN Wiki. 

Overall review of expert feedback has provided 
guidance in terms of quality of the adaptable 
guidelines. It has been taken into consideration 
when updating the adaptable CC as well when 
elaborating new CCs. 

Elements that have been mostly appreciated 
among experts are: 


— The concepts developed by the CCs are relevant 
to the ATM and HC context, 

— Actors in ATM and HC context are identifiable 
in a clear and concise manner; also roles and 
responsibilities are clear, being hierarchy well 
defined in ATM and HC, 

— The list of actions/interventions give sufficient 
explanation of responsibilities making it easier 
to adapt to the ATM, and to HC, while taking 
into account the broader and different fields of 
HC, 

— The triggering questions are useful and well 
grouped, 

— The indications provided in the fields “imple- 
mentation before/during/after” are sufficient 
to develop a CC adapted to the ATM and HC 
context, 

— The level of provided information makes it eas- 
ier to be integrated with local artefacts (proce- 
dures, plans), 

— Useful examples, illustrative cases, practices 
and methods are available in the ATM and HC 
context. 


Elements that need improvement: 


— In some CCs, the information is very high 
level or too generic thus making it difficult to 
adapt, 

— Some content concerning the “Triggering Ques- 
tions” and “Actions” is redundant and needs to 
be simplified, 

— No tools are provided in some of the current 
version of the CCs, thus, during the adaptation 
process, efforts should be spent, accordingly, 

— Harmonization still needs to be reached among 
some CCs. 


4 RESULTS FROM ADAPTATION OF CC 


At first sight, ATM and HC seem to be very differ- 
ent contexts, but during meetings, the DARWIN 


team has discovered that they share many similari- 
ties and many common issues (i.e. criticalities of 
the infrastructures, impact on the public, etc.). 

Notwithstanding that, the aviation domain in 
general is characterized by high level of standardi- 
zation. The number of standards and regulations 
guarantee that ATM has a great track record of 
safe operations. 

Regulatory bodies and concerned actors are well 
defined together with roles and responsibilities. 

For example, the geographical limitation of 
an aerodrome makes this type of environment 
exposed to a relatively limited number of crisis 
types (e.g., aircraft accident during take-off or 
landing, disaster in the premises, loss of working 
resources, climatic event, etc.). 

Although it is impossible to know when the cri- 
sis will occur, the characteristics and dynamics of 
crisis situations can be foreseen in advance to some 
degree. As a consequence, the concerned actors 
and the response procedures can be defined with 
sufficient accuracy before the crisis occurs. 

On the other hand, other types of crises in ATM 
may be much more extended from a geographical 
point of view and less predictable in the way they 
evolve (as in the example of the Eyjafjallajökull 
volcano eruption in 2010). 

As previously suggested, the HC domain exhib- 
its common aspects with ATM that deals with criti- 
calities and brittleness, and, in addition, they share 
the same scientific basis on which public health and 
HC tasks (i.e. care, surveillance, research, regula- 
tion and control) that are the same bases/criteria 
of Safety and Quality Assurance present in ATM. 

What is, however, peculiar to HC is the individ- 
ual/team resilience that HC workers (professional 
and operators) practice daily while performing their 
task and while coping with unexpected situations. In 
this case, the management of this resilient approach, 
i.e. systematically creating the conditions to bridge 
the gap between work-as-imagined and work-as-done 
(WAI vs. WAD), proves to be challenging. 

Other relevant aspects include that this domain 
shows more complexity, and a variety of tasks, 
with many actors, ranging from surgeons to 
nurses, from regulatory bodies, providers, training 
organizations. These lead to an variety of systems, 
processes and outputs (protocols, documents, or 
records that could differ from hospital to hospital. 

HC moves to innovation, however it should 
be recalled how professionals frequently support 
clinical decisions over standardization (a clinician 
sometimes stands on autonomous judgement pro- 
vided for it is based on knowledge and belief). 

One of the noteworthy outcomes of the adapta- 
tion process is that we discovered more uses than 
we expected at the beginning of the project. We 
found out that the guidelines are useful, they can 
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be adapted and adopted in many occasions such as 
training, workshops and meetings. 

They help to start discussions and to deal with 
significant topics and they can be used to: 


1. Check or update current procedures and guide- 
lines, if already existing; 

2. Define new procedures and guidelines if not 
existing; 

3. Identify possible indicators and evaluation of 
trends (to do possible benchmarking); 

4. Prepare plans; 

5. Perform risk assessment and management. 


During the interviews with ATM and HC 
experts concerning the CCs, some common aspects 
that play an important role in the resilient manage- 
ment of crisis emerged: 


— THe CCs SHOULD CONCERN ALL LEVELS OF 
ORGANIZATION 

Even if, at first sight, the DARWIN CCs may 
address only policy makers and management, 
being responsible for the modification of cur- 
rent procedures, it is noteworthy that all concepts 
address all levels of organization starting from sen- 
ior management to front line operators. 


— THE ROLES AND RESPONSIBILITIES OF INVOLVED 
ACTORS CHANGE ACCORDING TO THE TYPE OF CRISIS 
AND THE RELATED ENVIRONMENT OF OPERATIONS 

In the ATM context, according to the type of 

crisis several actors are involved. According to 

‘ICAO Annex 14. Emergency and other services’, 

An Airport Emergency Plan shall be established 

to coordinate the response and participation of all 

existing agencies which could assist in responding 
to an emergency. 

Examples of possible agencies ON and OFF 
aerodrome are provided: 

ON-aerodrome: air traffic control unit, rescue 
and firefighting services, aerodrome administra- 
tion, medical and ambulance services, aircraft 
operators, security services, and police; 

OFF-aerodrome: fire departments, police, health 
authorities (including medical, ambulance, hospi- 
tal and public health services), military, and har- 
bour patrol or coast guard. 


— THE ESTABLISHMENT OF JUST CULTURE AND SAFETY 
CULTURE IN ALL ORGANIZATIONS 
With particular reference to the concept of “notic- 
ing brittleness”, Just Culture and Safety Culture 
are the internal factors that could help in facili- 
tating the identification of brittleness in each 
organization. 
The concept of ‘Just culture’ is discussed in 
EUROCONTROL (2006) “in recent years the con- 
cept of “Just culture” has become better understood 


and accepted by people employed in the aviation indus- 
try. However [...] the need for a “just culture” is gen- 
erally not understood by many legislators and therefore 
not accepted within their State judicial systems.” 

This issue causes “increased fear of sanctions 
against the reporter, particularly if partly or fully 
responsible for the reported occurrence.” 

“Furthermore, certain elements of the media may 
deal aggressively with apparent breaches of flight 
safety within certain airlines and ANSPs.” 

“These factors—punishing Air Traffic Control- 
lers or pilots with fines or license suspension—may 
have the cumulative effect of reducing the level of 
incident reporting and the sharing of safety infor- 
mation. This hinders safety improvement and as a 
cascading effect resilience.” 

There could be concerns about possible misuse 
of information regarding brittleness in the organi- 
zations, since “one of the major problems with 
collecting and analysing information is that such 
information can be a very powerful tool and, like any 
powerful tool, if used properly it will provide great 
benefit. However, it can also be used improperly and 
if that occurs considerable harm can be caused”. 

In the last decade, many progresses have been 
made to encourage Just Culture in the European 
ATM context, mainly thanks to the efforts of 
EUROCONTROL: eg. Air Navigation Service 
Providers are endorsing Just Culture policies and 
programmes, Task Forces have been created to pro- 
mote, debate and discuss issues concerning safety 
and justice, meetings are organized to encourage 
interaction between safety and the judicial experts; 
special “just culture” courses for aviation experts 
and prosecutors have been organized, etc. 

According to EUROCONTROL (2008), Safety 
Culture is “the way safety is perceived, valued and 
prioritised in an organization. It reflects the real 
commitment to safety at all levels in the organiza- 
tion. [...] It is not something you get or buy; it is 
something an organisation has. [...] It can therefore 
be positive, negative or neutral.” 

Since 2006, there is an active involvement 
of EUROCONTROL, in collaboration with 
FAA and CANSO, in measuring and improv- 
ing Safety Culture within ANSP organiza- 
tions. Safety Culture surveys are continuously 
planned and performed, results and recom- 
mendations are taken into account and imple- 
mented to guarantee an effective SMS and a 
healthy Safety Culture. 


— THE IMPORTANCE OF PLANNING, TRAINING AND 
TESTING IN ADVANCE 

The plan should include a clear definition of the 

agencies involved, the responsibility and role of 

each agency and the coordinates of offices/peo- 

ple to be contacted in case of emergency. 
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The training of the people allows to maintain 
the high level of preparedness for possible crisis 
events. 

The test of the plan could be done in many dif- 
ferent ways beginning with the organization of 
exercises, from a lower level to a higher level, with 
each one building on the concepts of the previous: 
discussion-based and operations-based exercises. 
The execution of the exercises allows to identify 
weaknesses in the plans and possibly improve 
them. 

Discussion-based exercises are organized to 
discuss the plans for upcoming operations-based 
exercises, and to make everyone familiar with roles, 
procedures and responsibilities. They include: sem- 
inars, workshops, tabletop exercises, and games. 

Operations-based exercises are used to validate 
and test plans and procedures that have been con- 
solidated after the discussion-based exercises. They 
allow to better clarify roles and responsibilities of 
involved actors, identify gaps and limitations of the 
plan, and improve everyone’s performance. They 
include drills, functional and full-scale exercises. 


— THE IMPORTANCE OF 

DISSEMINATION 
The importance of the dissemination of the rel- 
evant information after the crisis events is funda- 
mental in order to improve the resilience of the 
organization during crisis. In the ATM context 
and for this particular purpose, EUROCON- 
TROL encourages the lesson learnt distribution 
and exchange of best practices though the website 
Skybrary. As well, the magazine Hindsight con- 
tains lot of useful case studies and provides the Air 
Traffic Controllers (ATCo) with a means to share 
their experiences concerning ATM-related safety 
occurrences. The objective is to “broaden ATCOs 
understanding of the problems that may be encoun- 
tered, learn more about possible solutions and be 
better prepared in the face of similar occurrences.” 

Moreover, the presence of the “triggering ques- 
tions” was particularly appreciated even if it may 
be difficult to use them during time-critical types 
of crisis as a checklist to be read step-by-step and 
to identify someone that checks their completion. 
On the other hand, it is important that all the 
actors involved in the management of the crisis are 
fully aware of the topics addressed. 

For crises developing over a longer time (e.g. 
Icelandic volcano eruption or Ebola outbreak) it 
is possible to organize workshops and meetings to 
reflect with other colleagues on the possible sources 
of brittleness and use the triggering questions to 
support the reflection. The same approach can be 
used during a drill or a simulation by a facilitator 
to guide the simulation and stimulate participants 
to notice brittleness. 


LESSON LEARNED 


5 DISCUSSION AND CONCLUSIONS 


The work performed so far confirms the intended 
readership as policy makers, front line operators, 
resilience engineering managers, crisis managers, 
critical infrastructures managers, methodologists, 
community of practice in ATM and HC. Stake- 
holders such as managers and policy makers can 
use this work as source of inspiration when adapt- 
ing resilience guidelines to their domains. 

In particular, applying the DARWIN resilience 
concepts, triggering questions, methods and tools, 
they will be able to: 


— Apply the proposed interventions provided in 
the CCs to survey current practices, strategies, 
procedures and guidelines; 

— Start to reflect on “what went well” and not only 
“what went wrong” when learning from events; 

— Assess the effectiveness of roles and responsi- 
bilities during a crisis; 

— Revise and/or define common action plans 
through periodical coordination activities and 
training; 

— Identify brittleness in the system and the appli- 
cation of procedures and response to the crisis; 

— Get to know practices, methods and tools 
applied by others; 

— Test and improve their plan of communication 
with public during emergencies. 


The collaborative method presented in this 
paper illustrates an iterative approach that brings 
theoretical concepts close to their practical imple- 
mentation. We gathered information from other 
domains through workshops with members of the 
DCoP. The SWOT facilitated translation of resil- 
lence concepts into practical interventions. The 
methodologies proposed to adapt the concepts are 
defined in detail to ensure other concepts to be 
included in the future. The participation of experts 
is essential to ensure applicability, relate to the spe- 
cific domain as well as enrich the cards with exist- 
ing practices and methods. 

At the beginning of the work, we planned sepa- 
rate guidelines for ATM and HC. The first results 
were cards that replicated the generic cards. This 
overlap in relevance and adaptation in CCs out- 
come in ATM and HC indicate the potential of the 
generic CCs to be applicable to other sectors. The 
current result combines generic fields with adap- 
tations to HC and ATM as required. The results 
indicate the possibilities of similar adaptations to 
other domains. 

A challenge is the achievement of consensus 
on the review process and iteration to achieve suf- 
ficient maturity. Another challenge as well as an 
opportunity is to merge different cultural perspec- 
tives across Europe when dealing with crises. We 
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found this as a window of opportunity to learn 
mapping recommended practices and methods 
within and across domains. 

Further work includes evaluation of DRMG 
and associated CCs in relevant operational scenar- 
ios. We consider collecting feedback from ATM, 
HC as well as other domains from the DCoP. 
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PSA modeling method for a safety critical DI&C system 


Sung Min Shin & Jaehyun Cho 
Korea Atomic Energy Research Institute, Yuseong-gu, Daejeon, Republic of Korea 


ABSTRACT: I&C systems in NPPs are being digitalized by adopting new features, such as software, 
fault-tolerant techniques, and network communication. Although the risk caused by these new features 
should be analyzed in an appropriate framework, at present there is no consensus on PSA methods for 
them. In this study, a general frame of a PSA model for the automatic safety signal generation function 
in a DI&C system is proposed, in consideration of the representative safety features of this system and 
the linkage between them. Through the related literature, we identified the requirements to construct the 
DI&C PSA model, constructed a general frame reflecting its possible parts, and specified the assumptions 
and approaches applied in this process. Although this study has focused on a qualitative approach because 
an appropriate database cannot be obtained yet, important failure modes that are understood in this cur- 
rent phase, and the research topics that need to be considered for the development of the enhanced DI&C 


PSA model, are summarized. 


1 INTRODUCTION 


Numerous analog Instrumentation and Control 
(I&C) systems in Nuclear Power Plant (NPP) are 
now being replaced by digital systems owing to the 
obsolescence of safety-critical analog components. 
This shift entails the adoption of new features that 
did not exist in analog systems, such as software, a 
Fault Tolerant Technique (FTT), and network com- 
munication. Although these features are expected 
to contribute to the enhancement of both efficiency 
and economy, from a safety point of view, the risk 
caused by the new features should be analyzed in an 
appropriate framework to ensure the dependabil- 
ity of the entire NPP (Kang, 2009, Authen, 2012). 
In recent years, regulatory bodies in each country 
have been actually demanding that the reliability of 
DI&C systems be incorporated in the PSA model. 
In this regard, some studies have been conducted in 
relation to software reliability, which are the repre- 
sentative feature of DI&C system; however, a more 
comprehensive approach seems necessary, such as 
how to integrate it with other digital features and 
apply it to the actual NPP PSA model. 

Therefore, in this study, an approach of the 
DI&C PSA model is suggested in consideration 
of the representative safety features of the DI&C 
system and the linkage between them. In this pre- 
liminary phase for the DI&C PSA, it is based on 
a qualitative approach, without considering the 
available database, and we focused on the reliability 
model related to the automatic signal generation 
part, excluding the part related to human behavior 
among the functions of the DI&C system. 


2 AUTOMATIC SAFETY SIGNAL 
GENERATION IN DI&C SYSTEM 


Although a more appropriate modeling method 
for the DI&C system can be developed during the 
research progress, in this phase, a conventional 
fault tree (FT) format is considered for the DI&C 
system reliability model in terms of the conven- 
ience of integration with the plant model. 

From the risk perspective, the modeling factors 
caused by the introduction of the DI&C system 
can be roughly expressed through Figure 1. 


— DI&C induced initial event (spurious operation 
of DI&C system) 

— Hardware and 
mechanism) 

— Fault tolerant failure (fail-safe mechanism) 

— Human error in digital environment 


software failure (failure 
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Figure 2. Functions implemented for safety signal 
generation. 


Among the four categories, the initial DI&C 
induced event is not considered in this first phase, 
as this factor can have different effects on the sys- 
tem and is possible to be considered separately. 

The remaining factors, in combination, imple- 
ment certain functions for safety signal generation, 
as expressed in Figure 2. The functions in orange 
indicate human behavior. As can be seen from this 
figure, there are two methods for safety signal gen- 
eration. One is a fully automated one in which the 
fault tolerant technique is applied. In another one, 
the safety signal is generated by a human operator 
using the information transmitted. This method 
also utilizes digitalized features, such as a display, 
CPS, and control, helping the decision making of 
the human operator. To develop a reliability model 
of this, information of the hardware and software 
failure in the functions corresponding to 2 through 
5 (numbers in Figure 2), and the Human Error 
Probability (HEP) in this digitalized environment, 
should be analyzed. 

Because the reliability model of the two methods 
can be treated separately, and the HEP has to be 
additionally obtained, research into the reliability 
model of human intervention is being conducted 
separately, with a plan to merge them later. 

Therefore, this study focuses on only the auto- 
matic safety signal generation (1 in Figure 2) in 
consideration of the failure mechanism and the 
effect of the fault tolerant technique. 


3 TYPICAL CONFIGURATION OF DRPS 


The DRPS of NPP has some differences in a 
detailed configuration according to the type and 
reactor model. To develop a general frame, it is 
first necessary to confirm the typical configuration 


of the DRPS covering various types through the 
examination of many systems. For this purpose, the 
DRPS applied to the IDiPS-RPS (Integrated Dig- 
ital Protection System-Reactor Protection system) 
developed through the KNICS (Korean Nuclear 
Instrumentation and control) project, the OPR- 
1000 (Optimized Power Reactor), and the APR- 
1400 (Advanced Power Reactor) are investigated, 
and a typical configuration, as shown in Figure 3, 
is confirmed. For reference, only the parts related 
to automatic safety signal generation are shown in 
this configuration except for the indication parts. 
The notations of each configuration are as follows. 


— AIM: Analog Input Module 

— DIM: Digital Input Module 

— PM: Processor Module 

— CM: Communication Module 

— F: Fiber Optic Module (FOM) 

— DOM: Digital Output Module 

— AT: Automatic Test Module 

— CPC: Core Protection Calculator 


The more detailed functions are as follows. 


— It consists of four physically isolated multiple 
channels (A, B, C, and D) and performs the 
same function independently in each channel. 

— There are two sub-racks in each channel, sharing 
the inputs and performing the same function. 

— Each sub-rack in each channel is composed of 
bistable logic (B) that generates Trip and Engi- 
neering Safety Feature (ESF) signals through 
comparison with internal set points, and coin- 
cidence logic (C) that receives signals generated 
from each PM in bistable logic in each channel 
and applies a voting. 

— Bistable logic receives inputs from sensors using 
two AIMs and CPC values through one DIM, 
and the actual comparison with internal set 
points is proceeded using AS (Application Soft- 
ware) in PM. 

— There are three PMs in each coincidence logic, 
two of which perform trip-related functions, one 
PM applies ESF-related functions. 

— The two DOMs in the coincidence logic receive 
the trip signal from connected PM and transmit 
it to the function of selective 2/4 logic. 

— The safety signal generated by the PM in the 
bistable is transmitted to the coincidence logics 
in the other channels. In this process, FOM is 
used to ensure one-way signal transmission and 
independence between channels. 

— Each PM in the coincidence logic receives two 
values generated by PMs in the different rack in 
the different channel. Before performing 2/4 vot- 
ing logic in the associated PM in the coincidence 
logic, 1/2 logic is performed first for the two val- 
ues at each PM. 
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— Regarding the fault tolerant techniques, each tion of faults in each module within a channel 
input module and PM has a self-diagnostic periodically through the network and the CMs. 


function that is performed in real time at the The fault detection coverage of each tech- 
operating system level, and the AT module, nique for each module is different but partially 
which exists in each channel, detects some por- overlapped. 


Bistable logic {B} 
Channel A rack 1 


Bistable logic (B) 
Channel A rack 2 


_ Selective 3/4} 


Figure 3. Typical configuration of DRPS for safety signal generation correspond to one channel (channel A). 


Table 1. Proposed requirements [U.S.NRC, 2008]. 


Proposed requirements Note 
R-1 Review the level of PSA proportional to the use of results and insights. B 
R-2 Identify how systems can fail and what these failures can effect. A 
R-3 Identify CCF events. A 
R-4 Address the uncertainties in modeling and data. Cc 
R-5 Confirm the capability of its safety function. B 
R-6 Address the impact of external events. C 
R-7 Model the failure of control room indication. C 
R-8 Determine and evaluate the scope, boundary condition, and modeling assumptions. A 
R-9 Model the recovery actions taken for loss of DI&C function. C 
R-10 Quantify the contribution of software failures. A 
R-11 Verify the credit for defensive design. A 
R-12 Review the DI&C data. B,C 
A-1 Verify that physical and logical dependencies are identified and their bases provided. A 
A-2 Evaluate the spurious actuations of diverse backup systems or functions. G 
A-3 CCF s can occur in areas where there is sharing of design, application, or functional attributes. A 
A-4 Evaluate the credit that should be given for defensive design features. A 
A-5 If a DI&C system shares a communication network, the effects on all systems due to failures of the A 
network should be modeled. 
A-6 Calculations, their bases, and the modeling assumptions used in standard methods may be warranted. B 
A-7 Review of applicant claims regarding data should be proportional to the use made of the PRA results. B 
A-8 Confirm the suitability of data based on the suggested criteria. B 
A-9 Interactions 1.between plant system and physical process, and 2.within a DI&C system. G 
A-10 Target reliability and availability specifications should be described (2) B 
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4 REQUIREMENTS FOR DI&C PSA 


The US is also carrying out research on a reliability 
evaluation of a digitalized I&C system in relation 
to the second-lifetime extension issue. However, 
because the recent guidance (10 CFR part 52) does 
not provide sufficient details regarding the DI&C 
system, a separate ISG (Interim Staff Guidance), 
which is consistent with the existing guidance, is 
issued to provide the reviewers with much more 
detailed requirements to be identified in the DI&C 
PSA (U.S.NRC, 2008). To consider the require- 
ments to the general frame of the DI&C reliability 
model, the 12 review guidelines (R-x) and the 10 
additional steps (A-x) in this reference are checked 
and marked according to the applicability to this 
study. In Table 1, the marks in the note column 
mean the follows: “A” means the requirements 
selected for consideration in this study, “B” indi- 
cates what is required in the actual review process, 
and “C” shows the requirements that are difficult 
to consider or apply in this phase of the study. 


5 GENERAL FRAME OF RELIABILITY 
MODEL FOR SAFETY SIGNAL 
GENERATION 


To implement the requirements selected in chap- 
ter 4 to the general frame, the DI&C system is basi- 
cally divided into standard units, and within each 
unit, the characteristics of the HW/SW failure and 
the effects of fault tolerant techniques are consid- 
ered. Some assumptions are made in this standard 
unit based approach and a general frame is devel- 
oped on it. 


5.1 Assumptions and approaches 


The top event of the DI&C reliability model is 
defined as the failure of automatic safety signal 
generation for each mitigation system in each acci- 
dent scenario. 

To effectively analyze the DI&C system, this 
study took the module level, such as the input, out- 
put, and processor module, as the standard unit, as 
the different sets of modules are utilized for differ- 
ent safety signal generation. 

Within a single module, the categories of failure 
are largely divided into an HW (hardware) failure, 
OS (operating system) failure, and AS (application 
software) failure. Although one failure among them 
can affect another, it is assumed that the original 
cause is independent of each other. As for the AS 
and OS, because the development process, devel- 
oper, and functions of each is different, the two need 
to be considered separately. Furthermore, a different 
AS can be applied to the same base (HW and OS), 


and thus the AS and OS should be distinguished in 
order to consider the CCF in a more realistic way. 

The effect of a fault tolerant technique can be 
reflected in such a way that the technique detects 
some portion of each HW, OS, and AS failures 
in each module and treats them in a fail-safe way 
so as to not lead to a module failure, and the reli- 
ability of the technique can be considered in this 
approach together. 


5.2 General frame of reliability model for a 
module failure 


Figure 4 shows the general frame of the reliability 
model for a module failure. This frame is applicable 
to all modules as a basic structure, but only some 
of them can be modeled according to modules. For 
example, the processor module has a specific AS 
for comparing the set points and input value, but 
the input/output modules are composed of only an 
HW and an OS, and do not have an AS. 

The safety signal is generated by a sequential 
process in which a value generated in one module is 
passed to another module and a value generated in 
that module is passed to the other module. There- 
fore, the linkage between the modules can be mod- 
eled such that the failure of the previous module is 
input as the ‘signal transmission failure’ and the fail- 
ure of that module is connected to the ‘signal trans- 
mission failure’ to the next module reliability model. 

Failures of the HW, OS, and AS in each module 
leads to a module failure when they are not detected 
by the FTT or are detected but cannot be handled 
properly by the FTT. Based on the HW failures illus- 
trated in Figure 4, if an HW failure is not detected by 
the FTT, it simply leads to a failure of the module. 
On the other hand, when an HW failure is detected 
by the FTT, it can be treated in a fail-safe way when 
the FTT works properly. Therefore, the reliability of 


Module failure 


a 
signal transmission failure Bia 


Detected-FTT failure 


te 
Undetected failure Detected failure | | FTT failure 
p] G 
P 1-Detection y Detection 
HW failure probability HW failure probability 
rž, 


Figure 4. General frame of reliability model for a mod- 
ule failure. 
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the relevant FTT needs to be considered together in 
this case. The reliability of the FTT itself propor- 
tionally affects the reliability of the entire DI&C, 
and the degree of the influence depends on the mag- 
nitude of the fault detection coverage that the FTT 
has for each fault. 

In certain cases, various FTTs are applied to 
a specific module simultaneously, in which case, 
regarding the detection probability and FTT fail- 
ure, it is necessary to integrate the effects of the 
various FTTs or to model the effect of each in 
detail. In either approach, it can be extended based 
on this structure. 


5.3. Example application: Steam generator low 
water level trip 


In order to verify the feasibility of the proposed 
frame, as an example, a fault tree was developed 
for the configuration of the DRPS shown in 
Figure 3. As a result of analyzing which trip sig- 
nal is required first for 18 initiating events of the 
OPR 1000 nuclear reactor model, the steam gen- 
erator (SG) low water level trip signal was most 
likely to occur first [Cho et al., 2016]. The following 
assumptions were made during the development of 
the reliability model for the failure of this trip sig- 
nal generation. The sensor for the SG water level is 


Bastable/O1.4 829M 
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connected to each AIM1 in each rack, FOM integ- 
rity of both the transmitting and receiving sides 
should be guaranteed for the signal transmission 
from the bistable logic to the coincidence logic, and 
the failure of hardwired connection is ignored. 
Figure 5 shows a part of the reliability model 
for the typical configuration of the DRPS shown 
in Figure 3, and Figure 6 shows an example of the 


Figure 6. Example of application of the module reli- 
ability model (Processor module failure in figure 5). 
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Example of reliability model of the general configuration of DRPS. 
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application of the general frame (Figure 4) for a 
module failure regarding the processor module 
HW failure shown in Figure 5. 


6 DISCUSSION OF PROPOSED APPROACH 


Although the failure information needs to be 
obtained to develop the actual reliability model, in 
the first phase of this study, we focused on the devel- 
opment of the general reliability frame for the DI&C 
system. On the other hand, the frame of the DI&C 
PSA developed through this study may suggest the 
required characteristics of the reliability data to be 
collected or the direction for securing them. 

The important failure types identified based 
on the example application are basically the form 
of CCF, which can disable the independence and 
redundancy in the following cases, resulting in the 
failure of the entire system. 


— Bistable logic input module HW/OS CCF 
— Bistable logic PM HW/OS/AS CCF 

— FOM HW CCF and OS CCF 

— Coincidence logic PM HW/OS/AS CCF 
— Coincidence logic DOM HW/OS CCF 


Because it is difficult to compare the objec- 
tive significance of the above cases through this 
qualitative analysis, it is necessary to reconfirm its 
importance by comparing the impacts on the over- 
all system by developing/applying appropriate reli- 
ability data. Although the reliability data that can 
be applied at present have not been obtained yet, 
we can consider the importance of priority under 
the following conditions expected. 


— Regarding the CCF Group, OS and HW are 
expected to be grouped into the same size 
according to each module, and the size of the 
AS CCF group is expected to be relatively small 
(CCF group size: HW = OS> AS). 

— For the CCF parameter, the Alpha factor can be 
applied for the HW CCF, but the beta factor is 
more likely to be valid for the OS and AS CCF 
(value of CCF parameter: OS = AS>>HW). 

— The reliability data value is expected to be large 
in the order of HW, AS, and OS (value of reli- 
ability data: HW>>AS > OS). 


Finally, in addition to the above topics (CCF 
group, CCF parameter, and reliability data), for 
the development of the more advanced DI&C 
PSA, the following points should be further inves- 
tigated or confirmed. 


— The self-reliability of various FTTs and the com- 
bined fault detection coverage for each module 
— Interdependency between SW and HW 


7 CONCLUDING REMARKS 


In this study, a method for DI&C PSA modeling 
for the automatic safety signal generation based 
on a general reliability model frame of a module. 
Although there are some assumptions and things 
to check to confirm the validity of this method- 
ology, authors think that the complex relationship 
between elements composing the DI&C system can 
be objectively modeled by using frames suggested. 
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ABSTRACT: Complex technical systems assist air traffic controllers in management of air traffic in 
the airspace. Their very important task is to provide the controller with visualization of the traffic situ- 
ation, thereby enabling situational awareness and making right decisions. In practice, errors sometimes 
occur, despite the efforts to improve the reliability of visualization systems. The purpose of this paper is 
to present a method to determine the impact of error types on the air traffic safety. The assessment of 
the threat level is influenced by subjective factors and cannot be expressed in a precise way. Therefore, the 
fuzzy reasoning theory has been used in the paper. The developed fuzzy model has been used to obtain a 
tool enabling the experimental simulation of the impact of various factors on traffic safety assessment. As 
the results of experiments indicate, the most important determinants of safety are: the time when air traf- 
fic controller remains unaware of the breakdown and the total time he/she does not have full knowledge 
of the traffic situation. It can be therefore stated that the key role for the proper operation of the air traf- 
fic visualization system and thus the restoration of full situational awareness is played by self-diagnostic 
systems that can restore the visualization system’s correct functioning without even the controller being 
aware of the error occurrence. Their role in assuring safety might be even greater than redundancy which 


is commonly used. 


1 INTRODUCTION 


The airspace is divided into smaller volumes called 
sectors in every of which a radar controller is 
responsible for the safety of aircraft. One of the 
most important determinants of ability to issue 
appropriate clearances is precise information about 
the traffic situation. The controllers are supported 
by Air Traffic Control (ATC) systems. Their sub- 
system of key importance is the Traffic Situation 
Visualization System (TSVS) that is responsible for 
representation of the aircraft’s positions. ATC sys- 
tems are constructed with the awareness of their 
role in air traffic safety. Nonetheless, software bugs 
and hardware failures still occur sometimes. One 
of the most severe effects are visualization systems 
operation errors. 

An error in the visualization system causes a 
partial loss of controller’s situational awareness 
(Endsley & Smolensky 1998). Depending on the 
complexity of the situation and the error type, 
various risks of air traffic safety may be caused. 
In this paper different types of visualization sys- 
tems’ errors are analyzed, their impact on the air 
traffic safety is assessed quantitatively, and finally, 
some case studies are conducted, vulnerability is 
examined and opportunities to reduce the risk are 
sought. 


There are many studies examining the various 
types of factors affecting air traffic safety, as well as 
evaluating it. Many authors suggest a quantitative 
approach to the analysis and assessment of safety 
and security (Lee 2006, Ali et al. 2015, Vismari & 
Camargo 2011, Skorupski 2015, Skorupski 2016; 
Patriarca et al., 2017; Stroeve et al., 2015). 

In typical ATC systems, information about 
three-dimensional (3D) scenery is displayed with 
a two-dimensional representation. Bagassi et al. 
(2010) presented an innovative concept based on 
a four-dimensional (4D = 3D space + time) visu- 
alization. A new working environment containing 
special information was proposed by Rohacs et al. 
(2016). 

Many papers have been devoted to analyzing con- 
trollers’ information perception (Moehlenbrink & 
Papenfuss 2011, Inoue et al. 2012). Ahlstrom 
(2005) analyzed the results of improper construc- 
tion of visualization systems, on the probability 
of causing a threat to air traffic safety. Kesseler & 
Knapen (2000) highlighted the need to consider 
the interactions between the controllers, ATC sys- 
tems and functions offered by individual systems. 

Many papers suggest using the fuzzy logic meth- 
ods and tools in the area of air traffic management 
(Hadjimichael 2009, Teodorović & Lučić 1998, 
Lower et al. 2016). The research by Xianfeng & 
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Shengguo (2012) and Skorupski & Uchronski 
(2015, 2016) includes the attempt of airport secu- 
rity assessment where the human factor was taken 
into consideration. The other examples of fuzzy 
methods utilization in air traffic management can 
be found in Babić & Krstić (2000) and Netjasov 
(2004). 

The literature review indicates the need to ana- 
lyze the visualization systems’ errors that occur 
in air traffic controllers’ practice. This analysis 
will be the basis for assessing the risk caused by 
different types of errors. This assessment has a 
significant degree of subjectivity and is impossi- 
ble to be quantified unequivocally. In such situ- 
ations, methods which deal well with uncertain 
and imprecise information are applied. In our 
paper, fuzzy logic, more precisely fuzzy reasoning 
systems are used to develop an expert advisory 
system that will categorize different error types to 
hazard classes. 

It is also important to look for the kinds of 
errors that have the greatest impact on safety. Sev- 
eral distinct factors are considered, such as the 
ability to identify the error quickly or the availabil- 
ity of backup resources. The rest of the paper is 
organized as follows. Section 2 outlines the essence 
of visualization systems’ errors. Section 3 provides 
a brief introduction to the theory of fuzzy rea- 
soning systems. Section 4 describes a fuzzy model 
for assessing the hazard caused by TSVS errors. 
Section 5 shows the results of several simulation 
experiments using a computer tool created in the 
SciLab environment. Section 6 provides a sum- 
mary and final conclusion. 


2 AIR TRAFFIC CONTROL SYSTEMS 


The main task of ATC systems is to assist air traf- 
fic controllers in ensuring a safe and effective flow 
of air traffic, but their range of applications is 
much wider. That is why they are often called Air 
Traffic Management (ATM) systems. 


2.1 General structure of ATC systems 


The main part of an ATM system is usually a server 
processing data from multiple sources. Following 
main data processing modules can be listed: 


— surveillance data processing module, 

— tracker—its task is to follow objects based on 
surveillance data, 

— flight data processing module, 

— decision supporting modules. 


2.2 Air traffic controllers work technology 


Actions performed by radar controllers depend on 
many factors, like type and distribution of traffic 


streams, traffic volume or airspace availability. 
To ensure safety and efficiency of air traffic, the 
controller needs information. The most important 
source of traffic information is the visualization 
system. It allows determining the position of the 
aircraft related to different objects. 

Anticipating future positions of the aircraft is an 
essential element of the controller’s work. A func- 
tion of displaying routes according to current flight 
plans and so-called vectors (predicted trajectories 
of the aircraft) is used for this purpose. 

Another action of the controller interacting 
with the TSVS is to read the predicted time neces- 
sary for the aircraft to arrive at a specific point. A 
major part of the radar controller’s work is also 
monitoring the aircraft maneuvers to ensure that 
they are consistent with the expectations and do 
not jeopardize safety. 

A cardinal part of the work is verifying of the 
data displayed by the system, such as the flight level, 
its speed, the heading, the altitude selected by the 
crew in FMS (Flight Management System). These 
data are essential for making the right decisions. 


2.3 The role of the visualization system 


The actions performed by the controller, which are 
presented in Section 2.2, show that the visualiza- 
tion system is a fundamental component of the 
air traffic control system. Functionally, the most 
important role of the TSVS is to assist the control- 
ler in creating an image of the current and future 
traffic situation. As this is the basis for decision 
making, the visualization system is the controller’s 
essential tool. Its role is further enhanced by inte- 
grating the TSVS with some executive features, for 
example transfer of control to an adjacent sector. 


2.4 Errors in ATC systems 


Experience in using TSVS shows that some errors 
may occur. This section contains their classifica- 
tion. All the described errors have been observed 
during operational work at the air traffic con- 
trol position at approach (APP) and area control 
(ACC) units (one of this paper’s co-authors is an 
active air traffic controller). 


2.4.1 Incorrect indication of aircraft position 

This error lies in showing the aircraft position sym- 
bol at a distance from its actual position. The most 
likely cause is a malfunction of the tracker algo- 
rithm due to an internal error or erroneous input. 
The significance of this error is determined by the 
number of incorrectly positioned symbols and 
the type of misrepresentation. The most danger- 
ous situation occurs when the symbols on display 
are spaced apart while, in fact, aircraft are close 
together. In case of significant divergences, the 
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error is relatively easy to spot, but it generates stress 
and high workload to clarify the situation. Fur- 
thermore, this situation is dangerous also because 
of distraction from observing other traffic. 


2.4.2 Flight plan—track linking error 

This type of error may appear as a total or par- 
tial lack of flight plan data, that should be avail- 
able when aircraft’s position symbol is indicated. 
Another variant of this error is the inability to 
update the current flight plan. This function is inte- 
grated with the visualization system as standard. 
The threat to safety is dependent on the number 
of aircraft affected by the error, the period that the 
issue exists and the number of flight plan param- 
eters which are incorrect. 


2.4.3 Disappearance of aircraft position symbol 
Threat level caused by this error depends on the 
time when the symbol is not displayed. During a 
straight and level flight, a temporary disappearance 
of the symbol of a single aircraft makes a 
relatively small trouble. However, when more 
complicated maneuvers are performed or when the 
disappearance prolongs, the threat created by such 
a failure is much higher. Also, aircraft symbols’ 
disappearance prevents the controller from moni- 
toring the aircraft’s maneuvers which in some cases 
can substantially increase the safety risk. 


2.4.4 Delayed aircraft position update 

It takes some time from the moment of measuring 
the aircraft’s position to the moment of displaying 
it in the TSVS. It consists of the data transmission 
time, processing time, hardware delay. As long as 
the delay is approximately constant, the error may 
be compensated. The problem occurs when this 
value is variable. In such a situation, an aircraft 
movement may be displayed unrealistically. The 
error of this kind is easy to spot and is not a seri- 
ous problem as long as the image is constantly vis- 
ible and the deviations are not large. The problem, 
however, arises when the crew performs an incor- 
rect maneuver or crosses the final approach track. 


2.4.5 Conflict warning systems malfunctions 
ATC systems are equipped with the functions of 
short- and medium-term conflict detection (STCA 
and MTCD). All of the above-mentioned posi- 
tioning algorithms’ shortcomings and errors such 
as information delay and incorrect input data may 
indirectly lead to conflict warning systems mal- 
functions. The most serious is the lack of necessary 
STCA or MTCD action or its too late activation. 
However, this type of error is extremely rare. The 
so-called false alarms are much more widespread. 
False alarms may also be caused by errors in STCA 
and MTCD algorithms. Whatever the cause, false 
alarms can pose a threat. 


2.4.6 Total loss of image 

An error of this type may be caused by a restart of 
the workstation resulting from an internal system 
error, a major technical failure or even a terrorist 
activity. This error can be divided into two kinds: the 
image disappears entirely or just stops refreshing. 
The latter situation is obviously better from a safety 
point of view, as it allows the controller to use his- 
torical data to build a picture of the traffic situation 
for some time. Depending on the number and type 
of the protection systems (additional power sources, 
additional data lines) the period for which the image 
is lost may differ. Lack of image on several worksta- 
tions is even more critical as there is no opportunity 
to use the picture at the workstation nearby. 


3 FUZZY REASONING SYSTEMS 


The problem discussed in this paper is distinct by 
two major characteristics. On the one hand, we 
consider a socio-technical system where the role of 
the human factor is crucial. The result is an intense 
subjectivism of opinions. That is since the control- 
lers are not equally vulnerable to make mistakes 
arising from a faulty indication in the TSVS. 

On the other hand, high ambiguity and lack of 
precision characterize the problem. Errors in the 
visualization system are concerned, that are unpre- 
dictable. Not only we cannot predict the time of 
their appearance, but also we are unable to define 
their type precisely. That is caused by the fact that 
errors can result from numerous causes. 

In such situations, literature recommends using 
the tools and methods suitable for problems of 
epistemic uncertainty, i.e. ones in which full knowl- 
edge of the phenomenon is unavailable. In such 
cases, it is required to use expert opinions. It is a 
known fact that very often they are formulated in a 
descriptive and an imprecise way. 

Among the possible approaches, we have chosen 
to use the fuzzy logic, in particular, fuzzy reasoning 
systems. Zadeh (1965) created the basis for modern 
applications of fuzzy logic. 

A fuzzy set A will denote a set of 


A={(x, 4a (x)): x€ X, s(x) €[0,1]} 


where u, is the membership function of this set 
and X is a set of considerations. 

A linguistic variable is a variable whose values 
are words or sentences in a natural or artificial lan- 
guage. These words or sentences will be called the 
linguistic values of a linguistic variable. 

Within the scope of the reasoning process, 
we will use the input value fuzzification block, 
reasoning block using some fuzzy rules and the 
defuzzification block. The rule sets will be created 
using experts’ opinions, in particular, air traffic 
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controllers. We will use the so-called compositional 
method of reasoning introduced by Zadeh (1973) 
which uses a generalized “modus ponens” fuzzy 
reasoning rule. 


4 FUZZY REASONING SYSTEM FOR 
THREAT LEVEL ASSESSMENT CAUSED 
BY ERRORS IN VISUALIZATION 
SYSTEMS 


In this section, the fuzzy reasoning system, which 
allows for an assessment of the threat caused by 
errors in the visualization system. Factors influ- 
encing threat assessment are introduced as well as 
their representation by linguistic variables that are 
inputs to the fuzzy inference system. The knowl- 
edge base plays a critical role in this system. We 
have gained it from experts—air traffic controllers. 
A computer application created in SciLab environ- 
ment allows for assessment of specific breakdown 
situations. 


4.1 Factors influencing the threat level assessment 


4.1.1 Degree of situational awareness loss 

The concept of ‘situational awareness loss’ generally 
describes all situations when a controller or pilot is 
not entirely aware of what the current traffic situa- 
tion is. Visualization errors cause safety risks only 
if they cause a loss of situational awareness for the 
air traffic controller. Obviously, the level of threat 
depends on the degree of loss of situational aware- 
ness, that is how the image of the traffic situation 
created in the controller’s mind differs from reality. 


4.1.2 Awareness of errors in the situation image 
The problem of the loss of situational awareness is 
linked with the issue of the controller’s belief that 
the picture of the traffic situation he/she has created 
in mind is correct. In some cases, he/she may be con- 
vinced that the image is proper, while reality is dif- 
ferent. That is the most dangerous case because the 
controller will continue to work based on the wrong 
image without taking any verification action. 
Knowledge of the existence of a malfunction may 
appear after some time, which depends on the obvi- 
ousness of this error. The period from the moment 
the error occurs until the controller learns about it 
will be used as an evaluation criterion for this factor. 


4.1.3 Time of situational awareness loss 

Another factor affecting the degree of threat is the 
time interval in which the controller, because of a 
system error, does not have full situational aware- 
ness. It may range from a few seconds to dozens of 
minutes. The longer the time, the greater the safety 
threat. For this factor, we will use a judgment based 
on the anticipated time of situational awareness loss. 


4.1.4 Backup resources availability 

When a controller is aware of the error, especially 
if the error persists for an extended period, he/she 
will try to use other available resources to keep 
him/her aware of the situation. A straightforward 
and efficient manner is to use: 


— an image in another workstation, 
— a backup system, 
— another data source of aircraft positions. 


An important remedy is the use of traditional 
flight progress strips, which can be employed when 
no traffic picture is available. 

Backup resources availability should be consid- 
ered in two ways. Firstly, it is necessary to deter- 
mine whether backup resources are available at all 
and, secondly, to assess their quality. 


4.1.5 Human factor 

Maybe the most important, factor influencing the 
assessment of the safety threat in case of visuali- 
zation system failure is the human factor. One of 
the key components is the level of training. Cer- 
tain standards and requirements must be met by all 
controllers, but the differences between individuals 
may affect their ability to deal with an emergency. 

Besides, the experience of the controller, which 
can be expressed both by the number of years of 
work or by the number of hours worked at the 
position, affects the level of threat. 

On the other hand, the perception abilities of the 
human being fall with age, so that the older person 
is less able to perceive and remember, and this can 
have an adverse impact on actions in a particular 
situation. Another, equally important component 
of the human factor that affects the level of threat 
is the psychophysical condition of the controller. 


4.2 General structure of the Threat level fuzzy 
inference system 


The scheme of the fuzzy model for assessing the 
level of threat caused by the error in the visualization 
system is shown in Figure 1. The output variable 
Threat level (z,) depends on five input variables. 
These are: Loss of situational awareness (x,,), Time 
without knowledge of error (x „), Time of situational 
awareness loss (Xj,.), Quality of remedies (x,) and 


Threat level 
luman factor 


Figure 1. 


General scheme of the Threat level fuzzy model. 
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Human factor (y,). The last of these input variables 
is the output of the local fuzzy reasoning system, 
with four input variables: Experience (x,), Age (x,), 
Training (x,,), Psychophysical condition (x,). 

The form of membership functions of linguistic 
variables values, the basis for their determination 
and the knowledge bases of both fuzzy reasoning 
models will be presented in subsequent sections. 


4.3 Input linguistic variables of fuzzy 
reasoning system 


4.3.1 Loss of situational awareness 

The degree of divergence of the traffic situation 
picture in the controller’s mind relative to reality 
can be assessed based on the difference between 
the actual position of the aircraft and the position 
indicated (incorrectly) by the visualization system. 
We refer this value to the separation minimum that 
is obligatory in a given airspace. Based on expert 
knowledge, it has been assumed that the linguistic 
variable Loss of situational awareness will take one 
of the six values: slight, small, significant, serious, 
large, total. We propose the use of an integrated 
indicator with the following form to determine the 
value of a linguistic variable: 


& À 
d = max| n-e*,n-e* 


where: 

d-— integrated indicator determining the degree 
of situational awareness loss, 

n — number of aircraft, which visualized posi- 
tions do not correspond to their actual locations, 

6, ô, — the amount of deviation of the imaged 
position from the actual position of the aircraft in 
the horizontal and vertical plane respectively, 

S,, S, — separation minimum obligatory in the 
considered airspace in the horizontal and vertical 
plane respectively. 

Membership functions of values of linguistic 
variable Loss of situational awareness are shown in 
the logarithmic scale in Figure 2. 
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Figure 2. Membership functions of values of Loss of 
situational awareness linguistic variable. 


4.3.2 Time without knowledge of error 

The level of safety threat depends on whether the 
controller is aware of the occurrence of an error or, 
more specifically, how long he/she is not. Accord- 
ingly, we have used the linguistic variable Time with- 
out knowledge of error, which can take five values: 
very short, short, average, long, very long. The mem- 
bership functions of the values of this linguistic 
variable were adopted based on expert knowledge, 
and their logarithmic form is shown in Figure 3. 


4.3.3 Time of situational awareness loss 
Depending on the type of error, the period in 
which the situational awareness of the controller 
is disturbed differs. In case of a short-term disap- 
pearance of the track of individual aircraft, con- 
troller’s situational awareness is usually maintained 
all the time. In case of major system malfunction, 
the time of situational awareness loss may be long 
and is not necessarily the same as the time when 
the system works incorrectly. 

The linguistic variable Time of situational 
awareness loss may take five values: very short, 
short, average, long, very long (Figure 4). 


4.3.4 Quality of remedies 

A controller who is aware of the dysfunctional 
operation of the TSVS will seek to use other avail- 
able means to ensure air traffic safety. As already 
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Figure 3. Membership functions of values of Time 
without knowledge of error linguistic variable. 
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Figure 4. Membership functions of Time of situational 
awareness loss linguistic variable values. 


1341 


Table 1. Fuzzy inference rules for the local model Threat level. 
Loss of Time without Time of 
situational knowledge situational Quality Human Threat 
Rule awareness of error awareness loss of remedies factor level 
number (x,) a) (Sia) (x,) 0) E) 
1 total any # very short none any very high 
9 large any short none any high 
27 significant very short average none very high average 
30 significant average average average average high 
40 small any average any very high low 
45 slight any very short any any very low 
mentioned, the possibility of using them in each —— high =- — average 
situation will be considered within two catego- 2O“ rere 
ries—availability and quality of available remedies. { | | 
Linguistic variable Quality of remedies will take 4 li pi 
four values: none, low, average, high. Ú ii i 
vw pi 
4.3.5 Human factor eH i i 
For an assessment of the influence of the human 4}: Mm H j 
factor on the degree of safety threat in the event of iet LA i 
2 3 4 5 E 


a TSVS failure, we will use combined information 
about the professional experience, age, the level 
of training and psychophysical condition of the 
controller. The linguistic variable Human factor, 
which describes the ability to cope with a failure 
of a visualization system in general, will take five 
values: very low, low, average, high, very high. It will 
be determined by the result of the local fuzzy rea- 
soning system with four inputs — Experience, Age, 
Training, Psychophysical condition. 

The linguistic variable Experience will take 
three values: low, average and high, and will be 
determined by the number of years of operation 
at the radar control position. The linguistic vari- 
able Age will take three values: young, middle and 
old. The linguistic variable Training will take one 
of three values: poor, average and good, and will 
be determined by the controller’s level of training 
in procedural control. The linguistic variable Psy- 
chophysical condition will take one of three values: 
poor, average, good. 


4.4 Output variables of the fuzzy reasoning 
systems 


Both local models Human factor and Threat level 
are Takagi-Sugeno-Kang models with singleton 
output values. The Human Factor variable, dis- 
cussed in Section 4.3, is also the input variable for 
the Threat level model. In turn, this variable will 
take five values: very low, low, average, high and 
very high (Figure 5). At the output of the fuzzy 
reasoning system, we evaluate the level of safety 
threat caused by the visualization system error as a 
real number from the interval [1,5]. 


E 
Safety threat level 


Figure 5. Membership functions of values of Threat 
level linguistic variable. 


4.5 Knowledge base of the fuzzy reasoning system 


The knowledge in inference systems describing 
complex socio-technical systems is subjective to 
some degree and as such impossible to quantify 
precisely. Therefore, fuzzy inference rules based on 
expert knowledge have been applied. One of the 
authors of the paper is an active air traffic control- 
ler, but for more credibility of the knowledge base, 
the rules have been verified with other field experts. 

In the Human factor fuzzy reasoning system, 
81 fuzzy inference rules are defined. In the Threat 
level fuzzy reasoning system, 49 rules have been 
defined, some of which are shown in Table 1. 

The fuzzy model for evaluation of the safety 
threat to air traffic caused by visualization sys- 
tem errors has been implemented in the SciLab 
5.4 environment with Fuzzy Logic Toolbox 
package. 


5 SIMULATION EXPERIMENTS 


The developed model together with its computer 
implementation allows us to assess the influence 
of errors in TSVS on traffic safety. Simulation 
experiments have been carried out to check the 
usability of this software tool for threat assess- 
ment and also for the selection of the most criti- 
cal system components that influence the safety 
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of air traffic control. Some of these experiments 
are described in this section. For all experiments, 
it was assumed that the controller who encoun- 
tered an error has been characterized by the aver- 
age values of parameters related to their individual 
characteristics (age, experience, training, psycho- 
physical condition). 


5.1 Scenario S1 — minor failure 


The analyzed scenario (S1) can be described as fol- 
lows. Due to an anomaly in the tracker subsystem, 
one of the tracks representing an aircraft stops 
and does not change its position. Because of heavy 
traffic, the air traffic controller does not notice an 
error, and after one minute the system automati- 
cally switches to a backup tracker. The maximum 
deviation of the radar position shown in relation 
to the actual aircraft location was 7 NM. The air- 
craft involved was performing a level flight. Rem- 
edies were not available. 

The input parameters of the fuzzy inference 
system and the results of the experiment for the 
scenario S1 are given in Table 2. 

For the scenario being analyzed, the result 
obtained from the fuzzy inference system places 
the emergency in a low threat area with a slight 
shift towards the average rating (Figure 5). 


5.2 Scenario S2 — major failure 


The scenario S2 can be described as follows. 
Because of power failure, the controller completely 
loses indications from TSVS and the monitor goes 
blank. At the time, there are 10 aircraft in the con- 
trol sector. The backup display located nearby is 
available, so after one minute the air traffic con- 
troller starts to work with its use. That finishes the 
emergency. The controller’s characteristics are the 
same as in the scenario S1. 

The input parameters of the fuzzy reasoning 
system and the results of the experiment for the 
S2 scenario are given in Table 3. 

For the scenario being analyzed, the result 
obtained from the fuzzy reasoning system places 
the situation in a high threat area with a slight shift 
towards the average rating. 


Table 2. Results of the experiment for the scenario S1. 


Threat 
Parameter Value level 
Human factor 3.0 
Loss of situational awareness [indicator d] 2.7 
Time without knowledge of error [s] 60 3.8 
Time of situational awareness loss [s] 60 


Quality of remedies none 


Table 3. Results of the experiment for the scenario 82. 


Threat 
Parameter Value level 
Human factor 3.0 
Loss of situational awareness [indicator d] 27.2 
Time without knowledge of error [s] 0 2.3 
Time of situational awareness loss [s] 60 


Quality of remedies none 


Table 4. Results of the experiment in scenario Sla. 


Threat 
Parameter Value level 
Human factor 3.0 
Loss of situational awareness 8.5 
Time without knowledge of error [s] 90 2.0 


Time of loss of situational awareness [s] 120 
Quality of remedies none 


5.3 Sensitivity analysis in scenario S1 


The scenario S1 is characterized by a relatively 
small deterioration of safety because TSVS quickly 
switches to a backup tracker. At this point, we will 
assume that the switch to the backup tracker does 
not take place, and the track does not move for a 
few minutes. We will mark it as scenario Sla. After 
about 90 seconds the controller notices the error 
and, using the available previous generation ATC 
system, continues the work after about two min- 
utes. The difference between the displayed and the 
actual position of an aircraft is 15 NM. 

The input parameters of the fuzzy inference 
system and the results of the experiment for 
scenario Sla are set out in Table 4. 

As we can see extending the duration of the 
failure causes the threat level to fall into the high 
rating area. That clearly shows how dangerous 
these errors can be, and how important it is to 
implement effective self-diagnostic means in TSVS, 
that are responsible for detecting, for example, a 
tracker error and switching to a backup. 


6 SUMMARY AND FINAL CONCLUSIONS 


Traffic situation visualization modules are essential 
elements of air traffic control systems. They con- 
stitute a basis for building situational awareness 
for air traffic controllers. At the same time, they 
focus all the hardware failures and software errors 
which, despite the use of technology with a very 
high level of reliability, can happen in practice. 
Regardless of the origin of malfunctions of the 
system, they can result in several typical situations 
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that have been categorized in Section 2.4. The 
essence of this paper has been to analyze the level 
of threat to air traffic resulting from errors of each 
category, taking into account factors such as the 
controller’s experience or the volume of traffic at 
which the error occurred. 

Experiments have shown that one of the most 
important factors influencing threat assessment 
is the amount of time a controller does not have 
full knowledge of the traffic situation. The time 
depends on the awareness that we are handling 
an abnormal image of the TSVS. That, in turn 
depends on the type of error. The results of experi- 
ments carried out using the created computer tool 
confirm these observations. In addition, they allow 
for quantitative assessment. It is worth noting that 
the results indicate a crucial role of diagnostic 
modules built into ATC systems. Waiting for the 
controller to notice an error in TSVS and take any 
corrective action can significantly increase the time 
spent without complete knowledge of the traffic 
situation. It is therefore possible to provide a gen- 
eral recommendation to extend and further develop 
such systems. They can be even more important 
than redundancy that is usually used for increas- 
ing reliability of the system. In the case of redun- 
dancy, duplication of the same error can occur on 
all backup devices. In contrast, self-diagnostic sys- 
tems can restore system performance even without 
the controller being aware of the malfunction. 
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Automated driving on steel and rubber 
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ABSTRACT: In this paper, we provide a comparison between principles and experience of autonomous 
or automatic systems on rails and on the street. An automatic metro operates in a controlled and well- 
defined environment that makes automatic driving possible. Passengers are separated from moving systems, 
e.g. by using platform screen doors that allow access only directly into the train, which is at standstill. In 
addition, passengers and third persons are separated from driving trains by fences, tunnels, etc. For road 
vehicles, currently a large number of assistance system is available that are able to handle specific situa- 
tions. This leads to the impression that these vehicles can move autonomously. However, these assistance 
systems are developed in such a manner that the driver must always be able to interfere. There are only 
some exclusions with genuine autonomously moving vehicles. In general, the environment, in which a 
road vehicle operates, is much more complex than that of a train, mainly caused by unforeseen situations. 
We describe differences regarding approval for automated metros, road vehicles and so called Automated 
Guided Vehicles (AGV). Legal requirements for homologation of road vehicles according to the conven- 
tion on road traffic are discussed and the implication for the system and the behavior of the driver. We 


sketch the current technical possibilities for automated driving and the existing technical solutions. 


1 INTRODUCTION 


Autonomous driving on the street has become 
more and more popular and the first demonstra- 
tor systems are operational. On the other hand, 
automatic metros and people movers are already 
successfully working for many years. 

In this paper, we provide a comparison between 
principles and experience of autonomous or auto- 
matic systems on rails and on the street. 

We compare the different levels of automation 
as defined by UITP and SAE and their meaning 
for the system. In addition, manual fallback modes 
are considered. 

An automatic metro is located in a controlled 
and well-defined environment that makes auto- 
matic driving possible. Passengers are separated 
from moving systems, e.g. by using platform screen 
doors that allow access only directly into the train. 

For road vehicles, currently a large number of 
assistance system is available that are able to han- 
dle specific situations. This leads to the impression 
that these vehicles move autonomously. 

In general, the situation for a road vehicle is 
much more complex than that of a train. 

We describe differences regarding approval for 
automated metros, road vehicles and so called 
Automated Guided Vehicles (AGV). Legal require- 
ments for homologation of road vehicles accord- 
ing to the convention on road traffic are discussed 


and the implication for the system and the behav- 
ior of the driver. 

We sketch the current technical possibilities for 
automated driving and the existing technical solu- 
tions. Especially, we discuss the possibilities and 
restrictions of artificial intelligence. We briefly 
describe a roadmap of possible next steps. 


2 THE STATUS WITH METROS AND 
PEOPLE MOVERS 


In many cities in the meanwhile automated metros 
and automated people movers are working 
Examples are 


e On the New York City Subway, the BMT Canar- 
sie Line. 

e On the London Underground, the Central, 
Northern, Jubilee, and Victoria lines run with 
ATO. 

e On the Nuremberg U-Bahn, existing U2 and 
new U3 lines converted to ATO. 

e On the Barcelona Metro, the L9 (as the Europe’s 
longest driverless line), L10 and L11 runs with 
ATO. 

e The Rio Tinto Group has the iron ore railway 
driverless go-ahead. 

e The Tren Urbano, has an Siemens ATC system 
that allows for fully automatic operation. 

e The Vancouver SkyTrain. 
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e Frankfurt Airport Skyline. 

e Copenhagen Metro. 

e On the Milan Metro, the M1 Red Line runs with 
ATO. 


On the Mass Rapid Transit (Singapore), all lines 
operating currently run with ATO since 1987 

For metros and people movers, a principle of 
separation has been applied: The automated trains 
are separated from all other traffic, running in the 
tunnels, open track is separated by fences, platform 
screen doors are used to separate the trains from 
passengers. This simplified the exploitation condi- 
tions significantly. 

The Automated Train Protection system (ATP) 
is used to prevent collision and derailment. This 
allows also manually operated trains to use the 
same network. 

The normal safety requirement for the ATP is a 
safety integrity level SIL 4. Nevertheless, manually 
operated fallback modes exist. Partially stewards 
are present to assist the passengers, especially in 
case in case of evacuation. 

For metros and people movers, the UITP (2017) 
has established 5 levels of automation. That 
means, the picture is not black and white, knowing 
either manual or automated driving. Automation 
is a stepwise process. The following five levels are 
established, UITP (2017). 

GoA 0 is on-sight train operation, similar to a 
tram running in street traffic. (No automation at all) 

GoA 1 is manual train operation where a train 
driver controls starting and stopping, operation 
of doors and handling of emergencies or sudden 
diversions. 

GoA 2 is Semi-automatic Train Operation (STO) 
where starting and stopping is automated, but a 
driver operates the doors, drives the train if needed 
and handles emergencies. Many ATO systems are 
GoA 2. 

GoA 3 is Driverless Train Operation (DTO) 
where starting and stopping are automated but a 
train attendant operates the doors and drives the 
train in case of emergencies. 

GoA 4 is Unattended Train Operation (UTO) 
where starting and stopping, operation of doors 
and handling of emergencies are fully automated 
without any on-train staff. 

As a conclusion, automatic metros and auto- 
matic people movers can be seen as established sys- 
tems. However, one needs to note that they operate 
in a controlled and simplified environment. 

Road vehicles: The general impression on how 
autonomous driving works is mainly dominated 
by vehicles as the Google vehicle or the Tesla and 
other systems that have shown up in the meanwhile. 
Simpler systems are those for automated parking, 
which is carried out using the mobile phone, the 


driver being outside. Studies for autonomous driv- 
ing have been carried out with a driver on board 
for testing purposes or for demonstration. Auto- 
mated Guided Vehicle on closed areas or transport 
systems in workshops are also applied. The latter 
systems are strictly speaking not road vehicles but 
moving machines. 

As an example, just consider the Google vehi- 
cle. This is a Smart-like vehicle with two seats and 
one can read that it drives autonomously, with no 
driver action being necessary. 

Alas, an accident has been reported and Google 
said it bears “some responsibility” after the car 
struck the municipal bus in Mountain View, Google 
(2016). That means that the Google vehicle caused 
a crash. In that case, the car would be responsi- 
ble, i.e. finally its manufacturer. However, also the 
driver and his responsibility need to be discussed. 

Another example is a Tesla vehicle that crashed 
into a trailer. The driver did not react since he relied 
on automated driving and died as a consequence 
of the crash. In fact, the technical driving system 
of the Tesla was not able to detect the trailer. Then 
the question arises on the responsibility for the 
accident. Surely, the automatic systems needed 
permanent supervision by the driver and the ques- 
tion arises whether the driver was sufficiently 
instructed. Also, it needs to be discussed whether 
the driver had the possibility to stop the vehicle or 
take over the steer. This includes reaction time as 
well as features of the technical systems. 

Some statistics might be done for the Tesla S at 
the time of the accident to illustrate the problems. 
One fatality has occurred after 210 Million km that 
have bene accumulated by the Tesla S model, see 
Focus (2016). This yields a rate of approximately 5 
10° fatalities/km = 1/(2.1 108 km). We need to admit 
that this “rate” has been computed from just one 
fatality and a serious statistical investigation would 
indicate that this figure contains a lot of spread. 

What does this rate practically mean? We dem- 
onstrate this with some data on traffic accidents in 
Germany. 

In 2013, in Germany private cars travelled 494 
080 million kilometers, see Statista (2016). Let us 
now estimate the number of additional fatalities 
to be expected. We get 5 10° fatalities/km *5 10"! 
km = 2500 additional fatalities per year, provided 
everyone would use the Tesla S model. 

Note that a driver would have noticed the trailer 
and have reacted. 

Moreover, in the statistics only possible addi- 
tional accidents have been computed under the 
provision that the Tesla S model will be rolled out 
as it is now and that all drivers would behave as the 
one who died in the accident. On the other hand, 
accidents that would have been prevented by the 
Tesla S are taken into account. 
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This short computation shows the complexity 
and the problems connected to autonomous driv- 
ing. Mainly this is caused by the very complex 
situation on the streets. 

By the SAE (2016) and the UN (2017) the 
following levels have been defined. 


e 0 No automation 

1 Driver assistance 

2 Partial automation 

3 Conditional automation 
4 High automation 

5 Full automation 


Detailed information on the levels is shown on 
the following Figure 1. 

The currently present systems are mainly sys- 
tems for assisted driving. The assistant helps in 
simple situations, however, the driver has always 
full responsibility. Examples are 


Distance assistant, 
Platooning, 

Lane assistant, 

Highway pilot for trucks. 


A short glance on the approval systems shows 
the differences: 


e Automated metros are approved according to 
EN 50126, EN 50128, EN 50129 and local laws 
on metros, that differ per country, 
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e Road vehicles are approved by a European 
approval based on ECE rules. In Germany this 
institution for approval is the KBA, in Nether- 
lands this is the RDW, 

e AGVs are not road vehicle and not a train, 
they are considered as automated machines 
and approval is according to Machine directive 
(2006) and IEC 61508 (2011). 


A new law for homologation of road vehicles 
in Germany allows automated driving in specific 
cases—note that this is not assisted driving—but 
driver must be able to overrule the technical system. 

This is in line with Convention (1973) on Road 
Traffic, which says: 


e article 8,1: “Every moving vehicle or combina- 
tion of vehicles shall have a driver”, 


e article 8, 3: “Every driver shall possess the 
necessary physical and mental ability and be ina 
fit physical and mental condition to drive.”, 

e article 8,5. “Every driver shall at all times be able 
to control his vehicle or to guide his animals.” 


Currently, these principles are implemented in 
the law of the countries. 

What does this mean for automated driving? 
The requirement “... driver must always be able 
to overrule the system” leads to the following 
requirements: 
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1. Technical possibility: brake, accelerator pedal, 
steering wheel must have a priority over auto- 
matic systems. Must be implied in safe control- 
lers (ASIL C or ASIL D) 

2. Driver must be present and vigilant, i.e. no 
sleeping, no messaging on the smartphone, no 
games including Pokémon GO etc. 

3. Driver must have the necessary time to react. If 
there is a failure of the system or an unwanted 
reaction. 


In fact, the last point leads to the following 
requirements for automatic driving. 


e Braking: braking by automatic systems must be 
with a smaller acceleration than the driver could 
apply, the difference in accelerations (vehicle, 
driver) must still allow for a reaction time of the 
driver (braking curves), 

e Steering: the distance from dangerous objects 
(other vehicles, border of the lane etc.) must 
be large enough to allow for drivers reaction, 
together with a limit of the steering angle. This 
might lead to speed restrictions. 

e Perhaps the driver needs special training. 


Figure 2 shows an example of a brake curve. 
Speed (m/s) versus distance is shown. There are 
two curves, one for automatic braking (decel- 
eration 3 m/s**2) versus braking by driver 


(5 m/s*2), where a reaction time of 1.3 s has 
been taken into account for the driver. The ini- 
tial speed is 20 m/s. 

In this example, the driver is still able to come to 
a standstill in time, if he detects that the automatic 
system fails to brake. Of course, the driver must 
react and be able to react with 1.3 s. 

For steering, similar requirements must be taken 
into account: Driver must have necessary reaction 
time. This reaction time depends on the distance to 
shoulder or adjacent lane, the speed and the reac- 
tion of the system. The latter includes maximal 
angular velocities and accelerations with which the 
system might show a faulty reaction. 

The current technical solutions are supported 
by the following existing equipment: 


e Different controllers or safe computers are avail- 
able that are qualified according to up to ASIL 
D/SIL 4, 

e Sometimes even “intelligent sensors” with a SIL 

available. 
e Different, diverse sensors (no SIL), which are 
cross-validated by the safe computer. Examples 
of such sensors are cameras, lasers, radar, infra- 
red, ultrasonic etc. 

e Multiple, diverse actors; safety relays as electric 
actors, the use of proven mechanical systems is 
also possible. 
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Figure 2. Example for a brake curve. 
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3 POSSIBLE NEXT STEPS 


Based on the current status one can imagine the 
following future steps for road vehicle. 


e Safe guidance (lane keeping) could be imple- 
mented, e.g. using differential GPS together with 
good update service of precise maps. All work 
on the road and all temporarily blocked roads 
need to be present on these maps. 

e Stopping before traffic lights enforced by a wire- 
less transmission of information between traf- 
fic lights and vehicles. Nevertheless, the driver 
needs to watch out for violators, e.g. cyclists 
even if he has a green lights. 

e Speed limit enforcement, e.g. the speed limit is 
transmitted in a wireless manner form a sign 
broadcasting the speed limit or the sign is read by 
a camera, alternatively a map is used as source. 

e Handling of simple traffic situations as e.g. on 
motorways following the lane, without overtak- 
ing maneuvers. 

e Vehicles on separated areas and on separated 
road networks. 


Further development leads to a following sce- 
nario, which include: 


e The road or lane might be separated by two 
fences forming a controlled environment and 
on this environment a vehicle can run auto- 
matically, with steering, braking, driving imple- 
mented according to ASIL D. 

e Vehicles drive with very short distances using 
platooning. 

e Atcertain places entry and exit to this network 
of roads is allowed. There, the driver takes over 
the automatic vehicle and drives it manually to 
the destination. 

e The necessary information as maps, position, 
speed limits, communication with other auto- 
mated vehicles would be implemented on the 
vehicle, rather than on the road. 

e The infrastructure would be rather cheap, con- 
sisting of the road and fences. Comparing this 
with a railway, the infrastructure is more flex- 
ible, no signals, no switches, no ballast and sleep- 
ers are necessary. 


In all these cases, the relevant technical systems 
would need to be safe life systems with a safety 
level up to ASIL D/SIL 4. 

A safe life system is a system, in contrast to a 
fail-safe system, does not switch itself off in case of 
a failure, but where the safety function is ensured 
even in case of one (or sometimes several) failures. 

The safety integrity levels (SIL/ASIL) are 
defined in standards for functional safety. IEC 
61508 and EN 50129 define SIL 1 to SIL 4. ISO 
26262 defines the automotive SIL (ASIL) A to D. 


The SIL/ASIL consists 
requirements: 


of two essential 


e Maximum tolerable rate of dangerous failure 
which cannot be exceeded 

e Measures against systematic failures (verifi- 
cation, traceability of requirement, specific 
techniques) 


Regarding future development, also possible 
problems need to be considered, that an automatic 
or autonomous vehicle driving on the road need to 
face to become comparable with a human driver. 
First of all, such a system needs to distinguish 
objects as persons or animals from unmoving 
objects. Another example would be to distinguish 
vehicles on high wheel from bridges etc. Another 
problem is that sometimes intentions of a person 
or animal need to be guessed: does the person or 
the animal intend to cross the road and step on the 
road? A typical example would be a child with a 
ball standing on the sidewalk, having dropped the 
ball and this has moved on the street. There are 
a lot of such tasks would require intelligence and 
one would tend to use artificial intelligence for 
such a task. 

Assume now that artificial intelligence should 
be implemented for autonomous driving. Then 
requirements for SIL 4/ASIL D would need to 
be implemented in full rigor in the software and 
the hardware. On the other hand, the algorithms 
for artificial intelligence are voluminous and com- 
plex. If then e.g. traceability needs to be shown 
from a requirement as e.g. “The algorithm must 
distinguish human beings from other objects” one 
might imagine the complexity of such a task. This 
would only be one requirement. The entire com- 
plex of requirements to the software would have to 
take into account a lot of driving situations, in the 
environment etc. If the algorithm is a self learning 
algorithm, one needs to ensure that it has learned 
in a certain time enough and this must be proven 
in the light of the standards /IEC 61508/and/or/ 
ISO 26262/. Another possibility would be to use 
a proven in use argument and accumulated 3 10° 
hours in service, see /IEC 61508/part 7 annex D. 
With 600 hours of driving that would mean to have 
5 000 000 vehicles driving an entire year under con- 
trolled circumstances, i.e. with trained drivers that 
can override the system and that would also reg- 
ister all events—or the vehicle has to do this. One 
can decrease the number of vehicles by increasing 
the number of driving hours per year, e.g. up to 
6,000, which would mean driving in shifts. Never- 
theless, still 500,000 vehicles would be necessary. 
In addition, each change of the software would 
require to repeat this approval process 

The conclusions is that solutions for the 
safety relevant software must be simpler, without 
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guessing intentions etc. in order to overcome these 
problems. Artificial intelligence would be good for 
assistance systems. 


4 CONCLUSIONS 


In this paper we have provided some considerations 
on automatic (or autonomous) driving for rail and 
road vehicles. It turns out that for road vehicles, 
the environment is much more complex than for 
rail vehicles. Therefore, the experience from e.g. 
automatic metros cannot be directly used. 

Most of the existing systems are either pure 
assistance systems or they are dedicated to simpli- 
fied traffic situations 

It has to be expected that the first safe solutions 
for autonomous driving would come for situations 
with a simplified environment, especially where the 
environment is controlled or even adapted to the 
task of autonomous driving. Here, a special solu- 
tions are AGV (automatic guided vehicles) that are 
just moving in an environment fully adapted to 
them, but not on an open road. 
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Probabilistic analysis of faults affecting multiple trains of the electrical 
power supply system of nuclear power plants 
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ABSTRACT: Faults simultaneously impairing multiple redundant trains of the electrical power supply 
system of NPPs recently received growing attention by the nuclear community. This was triggered by 
events at several different NPPs including Byron in the U.S. or Forsmark in Sweden. Such events have 
generally not been included in PSAs of NPPs yet. Therefore, GRS has initiated a research project aiming 
at a comprehensive and in-depth analysis of events characterized by fault states of multiple trains of 
the electrical power supply system (including but not limited to open phase conditions) and at the 
development of modelling and quantification methods to include them in PSAs. The project consists of 
different interacting efforts. Firstly, the possible causes of faults affecting multiple trains of the electrical 
power supply system and their consequences are assessed from an operating and modelling perspective. 
The second step comprises the development of a detailed dynamic model of the electrical power supply 
system of a German PWR and the investigation of the cause and the propagation of such faults. Then, a 
current PSA model of a German PWR is extended to allow for the modelling of the phenomena identified 
in the previous steps. This includes adding relevant equipment not modelled before and new failure modes 
of equipment already modelled. The additional reliability parameters and frequencies of initiating events 
required to quantify the extended PSA model are estimated. Finally, the additional failure mechanisms 


considered in the extended PSA model are evaluated quantitatively. 


1 INTRODUCTION 


Faults simultaneously impairing multiple trains 
of the electrical power supply system of Nuclear 
Power Plants (NPPs) have recently received grow- 
ing attention by the nuclear community (Brück 
2016). This was triggered by events that involved 
so-called asymmetrical faults at several different 
NPPs where such faults occurred. An asymmet- 
rical fault results from the degradation (e.g. an 
interruption) of one or two of the three phases 
in a three-phase alternating current system. For 
example, at the Byron NPP in the U.S., asym- 
metries in the power supply system arose from 
a single failure of an insulator in the switchyard 
of the plant. The asymmetry failed to cause the 
Reactor Protection System (RPS) to initiate 
the isolation of the emergency bus bars and the 
operation of the emergency diesel generators. 
As another example, at the Forsmark NPP in 
Sweden, the failure of one pole of a breaker to 
open led to an open phase condition that was also 
not detected by the RPS. In both cases the electri- 
cal consumers remained connected with the fault 
and were exposed to an asymmetric voltage sup- 
ply, leading to unavailabilities and even destruc- 
tion of electrical equipment. 


Such events have generally not been included in 
Probabilistic Safety Analyses (PSAs) of NPPs yet. 
Therefore, GRS has initiated a research project 
aiming at a comprehensive and in-depth analysis 
of events characterized by fault states of multi- 
ple trains of the electrical power supply system, 
including—but not limited to—open phase con- 
ditions, and at the development of modeling and 
quantification methods to include them in PSAs. 

The electrical power supply system is particu- 
larly susceptible to faults affecting multiple trains 
since during normal power operation there is 
no separation between the redundant trains. As 
shown in Fig. 1, failures that occur on or above 
the generator bus bars will affect all underlying bus 
bars simultaneously. 


2 PROJECT 


This project consists of different interacting 
efforts: 

Initially, the possible causes of faults affecting 
multiple trains of the electrical power supply system 
and their consequences are assessed from an operat- 
ing and modeling perspective. To achieve the project 
goals, first a detailed analysis of international 
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Figure 1. Typical electrical systems of a NPP (inspired 
by IAEA 2016). 


operating experience is carried out with respect to 
actual and potential faults affecting multiple trains 
of the electrical power supply system of NPPs, 
complementing the German operating experience 
that had also previously been analyzed by GRS 
with respect to this topic in the course of the gen- 
eral monitoring and analysis of German operating 
experience (Mildenberger 2016). 

The second step comprises the development of 
a detailed dynamic model of the electrical power 
supply system of a generic modern German Pres- 
surized Water Reactor (PWR) and the investiga- 
tion of the cause and the propagation of such 
faults. 

Then a current PSA model of a German PWR is 
extended to allow for the modeling of the phenom- 
ena identified in the previous steps. This includes 
adding relevant equipment not modeled before and 
new failure modes of equipment already modeled. 

The additional reliability parameters and fre- 
quencies of initiating events required to quan- 
tify the extended PSA model are estimated. New 
approaches and procedures to achieve this are 
developed as needed. 

Finally, the additional failure mechanisms con- 
sidered in the extended PSA model are evaluated 
quantitatively. 


3 OPERATING EXPERIENCE ANALYSIS 


The events at the Byron and Forsmark NPPs high- 
lighted the importance of the grid connections and 
the associated equipment for the reliability of the 
plants’ safety system. These events also revealed 
the importance of a systematic evaluation and 


analysis of their operating experience; while the 
physical and electro-technical effects that led to the 
failures were all well-known and well understood 
in theory, this theoretical knowledge was not used 
in the design of the safety system of NPPs world- 
wide until operating experience made these prob- 
lems obvious. 

Therefore, an evaluation of operating experi- 
ence with a special focus on effects that might 
impair multiple trains of the electrical power sup- 
ply system is performed. By doing so, two targets 
are pursued: 


— Identification of failure mechanisms that might 
lead to multi-train impairments and that are 
not yet covered by the design assumption of the 
plant. Also, events where so far no (multiple) 
failures have been observed but where the effec- 
tive failure mechanism may cause such failures 
in case of other circumstances are relevant. 

— Development of failure scenarios that describe 
how such failure mechanisms would affect 
modern German KWU type NPPs. With the 
development of the scenarios it is intended 
not only to “copy” the event to the KWU type 
plant, but also to develop variations of the 
actual event. 


3.1 Events with asymmetrical faults 


As a first step, the systematic evaluation of inter- 
national operating experience with a focus on 
asymmetrical faults, which was conducted by GRS 
after the events in Byron and Forsmark, was taken 
as basis to develop failure scenarios. This evalua- 
tion (NRC 2007) revealed that such faults can be 
observed regularly in NPP operating experience. 
Ten events were revealed where the active grid con- 
nection of the plant was affected by an asymmetri- 
cal fault. In four events such faults were discovered 
in the standby grid connection. The identified 
events where active grid connections were affected 
are presented in Table 1. 

The systematic analysis of asymmetric faults 
showed that a single failure mechanism might lead 
to various different failure scenarios. 

In case of an asymmetrical failure event, the fol- 
lowing set of features that had a significant influ- 
ence on the extent of the degradation of the onsite 
power system could be identified: 


— Type of failure: Failures of one and of two 
breaker poles have been observed in operating 
experience and need to be analyzed. 

— Location of the failure: Asymmetrical failures 
may occur in the main grid connection, the 
auxiliary grid connection and the generator bus 
duct, each with different consequences. In case 
of grid side asymmetries, the different distances 
between the location of the failure and the plant 
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Table 1. Events with asymmetrical faults. 


Date Plant Failure cause 

1994-05-13 Kalinin Collapse of a transformer 
duct, OPC in one phase 

1997-02-25 Balakovo Unintended closure of a 
single breaker pole 

2001-03-31 South Texas One breaker pole in the 
switchyard failed to close 

2005-11-11 Koeberg One breaker pole in the 
switchyard failed to close 

2006-07-26 Vandellos Mechanical failure of a 
disconnector 

2007-05-14 Dungeness-B One pole of a HV- 
transformer breaker 
failed to close 

2012-01-30 Byron Collapsed insulator caused 
a line interruption 

2012-12-01 Bruce Mechanical line failure 
during severe weather 
(storm) 

2013-05-30 Forsmark Failure to open on 
command of a single 
breaker pole 

2014-04-27 Dungeness-B Open breaker pole in the 
switchyard 


have to be considered as well as parallel grid 
connections that are not impaired. 

— Neutral point treatment: Operating experience 
has shown that the treatment of the neutral 
point of the main or auxiliary transformers has 
a crucial effect on the propagation of a grid side 
asymmetry into the plant. 

— Load of the onsite power system: Both the load 
and its characteristics (inductive or ohmic) have 
an influence on the asymmetry and need to be 
evaluated carefully. 


In total, more than 500 combinations of fea- 
tures can be derived from the list above. Currently, 
methods are being developed how this number can 
be reduced to a practicable amount of failure sce- 
narios. Once this reduction is achieved, the identi- 
fied scenarios will be analyzed with the simulation 
model described in Section 4. 


3.2 Extended scope 


Beside the asymmetrical faults, several other phe- 
nomena that might affect multiple redundant 
trains of the electrical power supply system are 
already known from operating experience, both 
from the International Reporting System for Oper- 
ating Experience (IRS) and from German oper- 
ating experience. Among these phenomena are 
the Forsmark event of 2006-07-25 (NRC 2007), 
where combined voltage and frequency fluctua- 
tions in the 400 kV grid caused multiple impair- 


ments of Uninterruptible Power Supply (UPS) 
units necessary for the startup of the emergency 
diesel generators (EDGs), and an event in a 
German NPP (RSK 2015) where four inverters 
(each in a separate redundant train) failed because 
of a single failure in a 660 V breaker of a residual 
heat removal pump. 

In the light of these insights it was concluded to 
extend the scope of the analysis from asymmetri- 
cal faults to all failures that are capable of affect- 
ing more than one of the redundant trains of the 
electrical power supply system. This includes non- 
redundant components or systems like the grid con- 
nections, the generator or the generator bus duct, 
but also redundant components inside and outside 
of the safety systems that have caused impairments 
in more than one train during a failure event. 

Since German operating experience with NPPs 
is limited to about 800 reactor years, it was decided 
to extend the scope of the analysis. Based upon 
an evaluation of the technical comparability of 
the plants, the accessibility of the necessary infor- 
mation and the amount of available data, it was 
decided to use the operating experience of the U.S. 
NPP as it is provided through the Licensee Event 
Reports (LER) as additional information source. 
Although all these events have already been ana- 
lyzed in depth by the U.S. NRC, it was concluded 
that additional insights from the events could be 
gained by an analysis focused on possible effects 
on modern KWU type plants, which have a safety 
system that relies primarily on redundancy rather 
than diversity. 


3.3. Methodology 


Effects with the potential to affect multiple redun- 
dant trains simultaneously are rare since extensive 
precautionary measures are taken to avoid such 
events. Therefore, a substantial amount of operat- 
ing experience has to be analyzed to identify some 
of these types of failures. To achieve this, all 3466 
LERs with events in PWRs and BWRs from the 
beginning of the year 2000 to the end of 2009 were 
included into the scope of the analysis. To cope 
with this high number of events in a reasonable 
amount of time, a four stage process was devel- 
oped to identify the relevant fault scenarios: 


1. Initial screening of the events; based upon an 
event summary, all available events are screened 
to filter out all those events that are obviously 
not relevant for a further analysis. 

2. Thorough analyses of the remaining events to 
further reduce the number of events; at this 
stage all information included in the LER is 
used for the assessment. 

3. The remaining events are analyzed and 
described in depth by taking into account all 
available information. 
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4. Based upon the results achieved in step 3, fail- 
ure scenarios (see Section 3.1) are developed. 


The first step of the process resulted in 250 
potential events, which were reduced to 29 events 
in step two. 

They cover a wide range of effects like external 
impacts (severe weather or grid fluctuations), com- 
ponent failures inside the plant or in the associated 
switchyard, or events due to human error. There- 
fore, it may be expected that a comprehensive set of 
relevant failure scenarios will result from this effort. 
Up to now, three scenarios have been developed. 


4 DETAILED DYNAMIC MODEL 


A generic model of the auxiliary power system of 
modern German K WU type NPPs has been devel- 
oped using the software NEPLAN (NEPLAN), 
see Fig. 2 and Fig. 3. Using this model, different 
calculations and analyses can be performed, such 
as load flow calculations, short circuit calculations, 
harmonic analysis, and dynamic simulations. 


Fig. 3 


Figure 2. Generic model of the auxiliary power system 
of a German NPP. On each bus bar, several individual 
electrical consumers are modelled (see Fig. 3). 
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One of the four main 10 kV bus bars with 
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Figure 4. Total power and power of the three phases at 
the generator terminal after a single-phase interruption 
occurring at t=0.5 of the connection to the grid (see Fig. 2). 


The model currently consists of 733 elements 
(including 114 asynchronous machines as consum- 
ers) so far and is currently being further developed. 
It is already suitable for estimating the impact of 
different scenarios on the NPP’s safety system. 

As an example, Fig. 4 shows the voltage and 
current of the generator for all three phases as a 
function of time for a three phase disturbance as 
marked in Fig. 2. The next step will be the detailed 
investigation of the scenarios to examine the fail- 
ure modes and effects. 


5 PROBABILISTIC MODEL 


First analyses have shown that the appropriate 
modelling of realistic single phase failure scenarios 
and other scenarios simultaneously impairing mul- 
tiple trains of the electrical power supply system in 
a PSA requires extensive augmentation and modi- 
fication of present PSA models. This comprises 
modeling the complex impacts of such phenom- 
ena on the electrical equipment, including parts of 
the electrical power supply system not important 
to safety, in the PSA model and adding additional 
failure modes of electrical components already 
modeled. To efficiently and systematically integrate 
such modifications into existing PSA models, GRS 
has developed and continuously improves the soft- 
ware tool “pyRiskRobot” for modifying complex 
fault tree topologies in an automated and trace- 
able manner (Berner 2017). This tool will facilitate 
modifying and enhancing the PSA model. 


6 QUANTIFICATION 


To quantify the model, the frequencies of the new 
initiating events and additional reliability param- 
eters need to be estimated including e.g. failure 
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probabilities and failure rates for electrical equip- 
ment that is or has been exposed to single phase 
failure conditions. While the extension of the PSA 
model is expected to be more or less straightfor- 
ward, quantification will pose a major challenge 
since reliability parameters for the equipment 
under the respective conditions or analyses of rel- 
evant operating experience that would be suitable 
to base estimations of these parameters on are not 
readily available. 

To achieve this nonetheless, information from 
existing databases, from operating experience 
and expert knowledge will be utilized. Models, 
methods and procedures to extract and combine 
the available information will be developed and 
evaluated as needed. This may include modeling 
of the emergence of relevant initiating events by 
fault trees in order to estimate their rate, modeling 
of the failures of components under appropriate 
boundary conditions as a result of failures of their 
piece parts by fault trees, quantitative analysis of 
operating experience using Bayesian statistical 
methods, and quantitative assessments by experts. 

As first step, the rates of the initiating events sin- 
gle-phase and dual-phase failures in the active grid 
connection have been assessed utilizing the oper- 
ating experience analysis described in section 3. 
The rate of single-phase failures in the active grid 
connection is comparable to the rate of small Loss 
of Coolant Accidents (LOCAs) while the rate of 
dual-phase failures is approximately one order of 
magnitude smaller. 


7 CONCLUSIONS 


Faults affecting multiple redundant trains of 
the electrical power supply system of NPPs— 
including but not limited to open phase condi- 
tions—may pose a significant threat to the safety 
of NPPs. Such failures have generally not been 
appropriately considered in PSAs despite the fact 
that the rate of such events is comparable to small 
LOCAs. An on-going project of GRS to research 


this subject comprises the systematic analysis of 
national and international operating experience, 
the development of a detailed dynamic model of 
the electrical power supply system and the exten- 
sion and modifications of an existing PSA model 
and the quantification of the enhanced model. The 
advancements made so far suggest that substantial 
new insights will be gained in the frame of this 
project. However, the estimation of the reliability 
parameters and rates of initiating events needed to 
quantify the enhanced PSA model will still pose a 
challenge. 
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ABSTRACT: Risk analysis is very useful for identifying hazardous events affecting the functional 
limits of complex systems as autonomous vehicles. To ensure the robustness of an autonomous vehicle 
architecture, new approaches should be investigated providing the dimensioning parameters related to 
functional scenarios. Research is currently ongoing at the VEDECOM Institute to identify and analyze 
safety critical situations including accidents. For this reason, a set of functional scenarios has been defined 
as an abstraction of real life driving situations. The aim is to explore how autonomous vehicles evolve and 
behave under various environmental and traffic conditions. Thus, this paper presents a qualitative risk 
analysis approach taking into account the evolution of a scenario by means of the concept of transition 
between successive scenes. Then, combinations of hazards are identified potentially leading to safety 
critical situations. Finally, the feasibility of our proposal is shown on an application case examining a level 
4 autonomous vehicle behavior in high traffic driving. 


1 INTRODUCTION 


During the last decade, focus of research in the 
automotive industry has shifted to the development 
of highly if not even fully automated (autonomous) 
driving (Bengler et al. 2014). Recent progress in this 
field showed that nowadays is possible to provide 
the driver with useful assistance systems, e.g. lane 
departure warning (Dickmanns 2002), lane change 
assistant (Ruder, M. et al. 2002), Adaptive Cruise 
Control (ACC), or even autonomous driving over 
long distances on highways (Dagli et al. 2004). 
Nevertheless, in order to achieve highly robust 
autonomous driving systems, there are still chal- 
lenging use cases that must be mastered (Schmidt 
et al. 2015, Geyer et al. 2014). In fact, mastering 
these cases is not only substantial for the Society 
of Automotive Engineers (SAE) level 4 and 5 auto- 
mated vehicles (Donges 1999), which are capable 
of sensing its environment and navigating without 
human inputs, but relatedly on the rate of trig- 
gering driver take-over requests in case of level 3, 
conditional automation. Outside of these areas or 
circumstances, the vehicle must be able to safely 
abort the trip, i.e. park the car, if the driver does 
not retake control. Increasing the automation level 
requires remarkable improvements on the existing 
perception, prediction and planning algorithms 
but also on the system architectures (Weiss et al. 


2004). Although the former have been an active 
field of research, focus on the system (functional) 
architecture has been limited so far. In our point 
of view, system architecture, and in particular its 
behavioral planner layer, is a key element for future 
autonomous vehicles (Bertolazzi et al. 2004, Bonic 
et al. 2017). 

In order to set up a robust decision module in 
an autonomous systems, a number of functions 
have to be realized in an integrated architecture. 
As processing tasks for perception, situation 
assessment, behavioral module and actual control 
of the vehicle have tight constraints, it is necessary 
to focus on the environment around (Chan et al. 
2004, Dickmanns 2003, Bertozzi et al. 2000). 

Developing a safe functional architecture is one 
of the main objectives of the ‘Robustness of archi- 
tectures and systems’ program at VEDECOM 
(Bonic et al. 2017). 

Today, risk and safety analysis has not suf- 
ficiently addressed the problem of explore the 
behavioral and decision module of an autono- 
mous system architecture taking into account the 
surroundings (Donges 1999, Figlewski & Levich 
2002). 

In this sense, risk analysis should provide for 
relevant design insights concerning the overall 
robustness of an autonomous vehicle, i.e. its fac- 
ulty to adapt and operate properly under a large 
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diversity of operating conditions (e.g. infrastruc- 
ture, weather, illumination, traffic etc.). However, 
most of existing risk analysis approaches are 
rather specialized in assessing system reliability 
and usually they focus on system and components 
failure modes (Kontio & Basili 1997). Moreover, in 
these approaches information is more or less used 
directly on a quantitate level, with an uncertainty 
that have to be considered. 

In this paper, we present a risk analysis approach 
which aims to qualitatively identify hazardous pat- 
terns and by this way the underlying critical situa- 
tions. In that way, we want to focus on the limits of 
behavioral performance of a SAE level 4 autono- 
mous vehicle. The aim is to address the robustness 
of the autonomous system architecture by focus- 
ing on the evolution of a driving scenario. 

Then, the risk is manly seen as discussed in (Poly- 
chronopoulos et al. 2004) as the approaching critical 
situations resulting from the combination of several 
hazardous events which might endanger the auton- 
omous vehicle and its passengers or other traffic 
participants. Concretely, the paper addresses risk 
analysis qualitatively and thus allows for describing 
the abstract critical situations in a comprehensible 
way. Based on an approach oriented to the dynamic 
modelling, it allow for integrating risk analysis 
insights on an experimental evaluation (simulation) 
platform for a scenario-based design of autono- 
mous system architecture (Go & Carroll, 2004). 

The rest of the paper is structured as follows: 
Section 2 gives an overview of the global frame- 
work on which our risk analysis methodology 
relies on. Section 3 introduces the conceptual 
formalization of scenes and scenarios along with 
some notions relating with dynamic modeling of 
autonomous driving. In Section 4 an application 
case is proposed. The paper close with some con- 
clusions and an outlook for further developments 
to adapt the proposed technique to specifics for 
autonomous systems design. 


2 FROM REAL WORLD DRIVING TO 
SAFETY CRITICAL SCENARIOS 


The risk analysis approach that we propose is based 
on a global framework allowing to assess risks in crit- 
ical driving situations generated by the occurrence of 
hazardous events combined with critical operational 
modes. We make the choice to use the adjective 
hazardous since it makes the tradeoff between the 
semantic usage specific to dependability studies and 
the standardized expressions used in a research area 
such as that related to autonomous driving. 

As illustrated in Figure 1, this risk analysis meth- 
odology has been proposed to the need of analyz- 
ing nominal functional scenarios designed starting 
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Figure 1. Scheme of the global approach. 


from real world driving data collected in the con- 
text of the MOOVE project (Bonic et al. 2017). 

The parameterization of the functional scenar- 
ios makes it possible to do queries in the MOOVE 
database and search “virtual” aggravating situa- 
tions in driving records. 

In our risk analysis approach, we focus on the 
hazardous events leading to major risks situations 
linked to safety issues affecting the autonomous 
vehicle performance (behavioral and decision). 

After this introduction to the main concepts and 
the framework of development in which this method 
fits, in Section 2 we present in detail the different 
parts that compose it and the underlying approach 
aiming to make the link with dynamic modelling. 


3 A SCENARIO-BASED RISK ANALYSIS 
FOR AUTONOMOUS DRIVING 


3.1 Defining the main terms 


In the literature several definitions exist which are 
used in different application fields. We have con- 
sidered those that fit better with the choices made 
in the context of the MOOVE project. Thus, the 
following terms have been defined as they consti- 
tute the basic components of our approach. 

First of all, we need to define the term scene as 
below: 


A scene describes a snapshot of the ego vehicle and 
the surrounding including the infrastructure, the 
environmental conditions as well as all the interact- 
ing traffic participants, and their relationships with 
the ego vehicle. 


The objects in the scene can be characterized in 
different ways: 


— Dynamic or static (i.e. “potentially” mobile or 
“always” static objects); 
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— Environmental elements; 
— Ete. 


Overall, the scene describes the current state of 
the system according to the traffic conditions (posi- 
tion, speed, etc. of the ego and the other vehicles), 
the environment (weather, road surface, visibility, 
etc.) and infrastructure (type of road, number 
of lanes, slope, presence of working zones and 
mobiles objects, etc.). 

The second term which has to be defined is 
scenario: 


A scenario describes the evolution of the system by 
a succession of scenes. The (temporal) succession 
starts with an initial scene and ends with a final scene. 


The mechanisms for moving from one scene to 
another are mainly of two types: 


— Actions, i.e. the ego vehicle state changes; 
— External events, i.e. other vehicles or the environ- 
ment state changes. 


After having formalized the current situation 
(initial scene), how it evolve and the final state of 
the ego vehicle, the nominal development of the 
scenario can be modified or degraded by adding 
sequential hazardous event (Bengler et al. 2014). 
These can be maneuvers of traffic participants 
interacting with the ego vehicle can be considered 
along with an unsafe behavior of the ego itself. 
This allow for identifying critical patterns, and by 
consequent, risks of a missed control about devia- 
tions of a specific scenario. 

So, we have to define two last concepts, hazard- 
ous event: 


A hazardous event is an event that may increase the 
risk likelihood or severity. It is any condition that 
can degrade the development of the ongoing scene 
towards a negative evolution of the scenario, with 
more or less serious consequences on the achieve- 
ment of objectives directly related to safety. 


and critical pattern: 


A critical pattern is a set of hazardous events which 
combined with actions or external events leads to a 
critical situation. 


Based on the nominal scenario, we have to deal 
with a large combinatorial explosion concerning 
the scenario branching as in sequential critical pat- 
terns (Lattner & Herzog, 2004). This combinato- 
rial explosion essentially depends on: 


— The actions and/or external events that may take 
different values and occur at different times; 

— The number of hazardous events and their com- 
bination to produce critical patterns. 


This representation of a driving scenario is 
naturally oriented to the dynamic modelling of an 
autonomous system functional architecture. Next 
section provides for a further formalization sup- 
porting the dynamic modelling. 


3.2 Formalization of transitions for dynamic 
modelling 


Following the definition of the concepts that we 
have proposed in previous section about this risk 
analysis approach, i.e.: 


— Scene, which in turn is based on the ‘infrastructure’, 
‘environmental conditions’ and ‘traffic’ descriptors, 

— Scenario, which start with an initial scene, 
evolves and ends with a final scene, 

— Critical patterns, which are introduced in a sce- 
nario by considering a combination of hazardous 
events, then we give some elements to formalize 
the evolution of a driving scenario for dynamic 
risk assessment. This formalization is intended 
to provide a way to deal with the dynamic mod- 
elling of autonomous driving functions. 


To do that, we have referred to the concept of 
transition which is well known in graph models for 
risk assessment, as Petri Nets (Reschka et al. 2015; 
Ulbrich et al., 2014). So, in this risk analysis meth- 
odology a transition is defined as follows: 


A transition is any action of the ego vehicle or exter- 
nal events occurring during the evolution of the sce- 
nario and leading to the successive scene. 


Otherwise said, scenes are connected and trig- 
gered by transitions. Graphically, a driving sce- 
nario can be then represented as a discrete event 
dynamic system. As Petri nets, it is a directed 
bipartite graph, in which the nodes represent tran- 
sitions (i.e. events that may occur, represented by 
bars) and places (i.e. conditions, represented by 
circles) (see Figure 2). 
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To resume, the risk analysis methodology pro- 
posed in this paper can be depicted as a top-down 
approach (Matthaei & Maurer, 2015) (Figure 3). 

In Section 4, an illustrative case is presented as 
application of the level 1. In particular, the analysis 
of a relevant scenario is discussed for the decision 
and behavioral module of an autonomous vehicle. 


4 ILLUSTRATIVE CASE 


As application cases, we tested the approach on a 
set of scenarios of interest in the context of the 
MOOVE project. Among these, a very interesting 
and illustrative one is the “Inconsistent yellow and 
white lane marking” scenario. It allows for show- 
ing the behavioral layer of the ego vehicle archi- 
tecture in the presence of a working zone (WZ). In 
particular, it is a WZ characterized by two types of 
lane markings simultaneously: white marking for 
lanes before the WZ, and yellow marking for lanes 
after the WZ. The superposition of both white and 
yellow markings on the road makes lane detection 
problematic for the perception system. Conse- 
quently, the ego vehicle trajectory planner depends 
on tracking (i.e. following the target vehicle ahead). 
The absence, or in any case the loss of the target 
vehicle, significantly compromises the possibility 
for the ego vehicle to go beyond the WZ. 

So, we proceed to identification of all the critical 
patterns potentially leading to critical situations. 


As illustrated in Fig. 4, level 1 proceeds by analyz- 
ing the scenario evolution as described above, ie.: 


— Initial scene; 
— Evolution (że. transitions); 
— Final scene. 


This first part is followed by the identification 
of hazardous patterns and the critical situations. 

Let’s consider the nominal scenario evolution 
analysis. The scenario that we are considering 
starts with an initial scene, illustrated in Figure 4. 
The ego vehicle drives on the right lane and follows 
the green car ahead (the target vehicle). Let’s name 
this latter T1. 

The initial scene is described in our methodol- 
ogy as shown in Table 1. 

An event occurring after the initial scene marks 
the evolution of the scenario. This event is the 
beginning of a working zone (WZ), as illustrated 
in Figure 5. 

T1 detects the WZ and changes lanes. 


Figure 4. Snapshot of the initial scene. 


Table 1. Initial scene as analyzed in the methodology. 


Level | — Initial scene 


Description 

Infrastructure — Road with separate carriageways; 
— Two or more lanes for the same 

direction; 

— Dashed lines in the middle; 

Environmental Nothing to report (NR) 

Conditions 
Traffic — The ego vehicle (E) drives on the 


rightest lane at a constant average 
speed; 

— The vehicle (T1) in front of E 
advances at the same speed; 

— Other vehicles drive on the left 
lane. 


Evolution of the scenario: An event occurs 


Figure 5. 
consisting on the presence of the WZ. 
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Based on this event, ego vehicle performs 
actions basically consisting in T1 tracking in order 
to go over the WZ (see Figure 6 below). 

The scenario closes with the final scene 
(Figure 7). 

Table 3 shows how the final scene is described 
in the risk analysis it before to proceed to 


Figure 6. Evolution of the scenario: Ego vehicle do some 
actions in response to the events occurred in the scenario. 


Figure 7. Final scene of the scenario: Action. 


Table 2. Scenario evolution. 


Level 1 — Evolution 


Description 


Infrastructure — In the right-hand lane there is a 
working area indicated by warning 
signs; 

— New traffic lanes are marked in yellow 
on the ground, and white markings of 
initial lanes are still present. 

Environmental Nothing to report (NR) 
conditions 

Traffic — T1 turns on the left indicators and 
follows the lines of the new yellow 
marking to go beyond the working 
zone; 

— E detects that T1 changes lanes to the 
left, detects the new marking lines, 
turns on the left indicators and 
follows T1. 


Table 3. Final scene. 


Level 1 — Final scene 


Description 


Infrastructure — Two tighter lanes are available 

Environmental Nothing to report (NR) 
conditions 

Traffic — E continues to drive behind C1 in the 
new lane on the right (yellow marking) 


and goes beyond the working zone. 


Table 4. Hazardous patterns & critical situations. 


Level 1 — Hazardous patterns & critical situations 


Hazardous patterns Critical situations 


1. T1 suddenly changes 2. Collision in the 
lanes without turn working zone; 
signals AND E brakes 
too late; 

3. T1 stops abruptly; 4. Accident between E 
and T1/Exit of lane 
or road; 

5. E does not detect or 6. Accident in the work 


detects too late the work area with injured 


area (panels placed too workers; 
close to the work) AND 
does not brake 
sufficiently before; 
7. E brutally shifts to the 8. Accident with one 


right (bad weather 
conditions: e.g. rain, side 
wind, loss of grip: e.g. 
slippery ground). 


or more injured 
pedestrians. 


identification of the critical patterns and the result- 
ing critical situations (Table 4). 


5 CONCLUSIONS 


In this paper we presented a risk analysis approach 
adapted to the specifications of the autonomous 
vehicle. For a SAE level 4 autonomous vehicle the 
driver only provides a destination or navigation 
instructions. 

Then, a robust behavioral planner should man- 
age safely a large number of critical situations. 
Then, critical situations need to be well identified 
and mastered to deal with the design of the func- 
tional architecture of an autonomous vehicle. The 
aim is to evaluate the performance limits of a level 4 
autonomous vehicles. 

To identify these limits, in this paper we pro- 
posed a risk analysis methodology which focuses 
on a specific scenario and identify possible hazard- 
ous events. In that way, we tried to contribute to 
two main research axes relevant for autonomous 
driving: 


1. Exploring the ego vehicle behavior qualitatively 
in functional scenarios and identifying the safety 
critical situations dimensioning the ego vehicle 
architecture without taking into account failure 
modes; 

2. Mastering critical situations by proceeding with 
a contextual semantic adapted to autonomous 
vehicles operational behavior. 


In order to validate these contributions, this 
methodology has been applied to one of the 
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functional scenarios defined in the MOOVE 
project basing on our real life driving database. 
Scenarios have been designed to explore the ego 
vehicle behavioral limits (safe functional) and 
identify critical situations which could face in real 
life. These critical situations arise by combining 
hazardous events. As hazardous events, we do not 
considered those related to failure modes but we 
focus on hazards arising from extreme or critical 
operational modes that no longer allow the ego 
vehicle to safely behave. 

Based on the application case, some perspec- 
tives have been identified to extend this approach 
to fully deal with autonomous driving safety: 


— On the one hand, in the first part of the analysis, 
the integration into the scenario analysis of fail- 
ure modes concerning perception and control; 

— On the other hand, in the second part, the 
authors would like to simulate a deeper discus- 
sion on the risk classification new exigencies for 
autonomous vehicles compared to the classical 
ISO-26262 standard (International Organiza- 
tion for Standardization, 2011). 
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A whole system approach to managing defective on-train equipment 


A.J. Gilchrist 
RSSB, London, UK 


ABSTRACT: This paper describes recent risk analysis work undertaken to determine the safest 
operational responses to on-train equipment failures on the mainline railway in Great Britain. The current 
rules and guidance for managing these failures have focused on minimising the risk to passengers onboard 
the train with defective equipment. In more recent years, the rail network has become significantly more 
congested and it has become increasingly necessary to also consider the safety impact any control measures 
applied to an individual train will have on the rest of the network. For different operational responses two 
types of safety risk were calculated: the immediate risk from a train not being able to use the defective on- 
train equipment; and the knock-on risk resulting from any impact on train performance. To illustrate the 
methodology, results are presented for an Automatic Warning System (AWS) failure on a passenger train. 


1 INTRODUCTION 


Each train operator on the mainline railway in 
Great Britain (GB) is required to produce a con- 
tingency plan outlining how they will manage situ- 
ations where certain items of on-train equipment 
fail. The pieces of equipment for which contin- 
gency plans should be made includes a wide range 
of safety-critical equipment, such as train protec- 
tion systems and cab radio failures. 

The safety impact from an on-train equipment 
failure will depend on the equipment which has 
failed and the role of that equipment onboard the 
train. The loss of on-train equipment will generally 
result in an increased risk from train accidents such 
as train collisions, derailments and buffer stop col- 
lisions. This is called the ‘immediate risk’ and may 
be reduced by imposing restrictions on the train, 
such as reducing the maximum permissible speed 
or detraining passengers. 

Over the last decade, passenger usage on the 
GB mainline railway has increased significantly. 
The annual number of passenger journeys in 2006 
was 1.2 million whilst this has increased to 1.7 mil- 
lion in 2016 (Office of Rail and Road 2017). This 
increased passenger usage has resulted in increased 
congestion on the rail network, with both trains 
and stations becoming busier. Any control meas- 
ures which impact train performance will further 
increase this congestion and has a resulting safety 
impact. This is called the ‘knock-on risk’ and 
includes personal accident risk resulting from extra 
boarding, alighting, and crowding at stations, as 
well as train accident risk caused by miscommuni- 
cation and additional red signal approaches. 


In order to effectively manage on-train equip- 
ment failures both the immediate and knock-on 
risks must be taken into account so that the imme- 
diate risk to the train with defective equipment is 
reduced whilst minimizing the knock-on risk to 
the rest of the rail network. The management of 
defective on-train equipment is currently covered 
by a rail industry standard (RSSB 2016) and asso- 
ciated guidance note (RSSB 2015) as well as Rule 
Book module TW5 (RSSB 2017). These docu- 
ments outline the basis by which a train operator 
should produce their contingency plans, including 
speed restrictions and maximum distances trav- 
elled with the defective equipment. The current 
rules and guidance were determined to reduce the 
immediate risk as low as possible but did not fully 
consider the knock-on risk associated with these 
operational responses. 

Quantitative risk models have been developed 
to calculate both the immediate and knock-on 
risk for a range of different operational responses 
to defective on-train equipment. Using these risk 
models, the operational response resulting in the 
lowest total risk (both immediate and knock-on) 
has been determined. This paper will describe the 
risk modelling undertaken to determine the most 
effective way of managing an Automatic Warning 
System (AWS) failure on a passenger train. 


2 AUTOMATIC WARNING SYSTEM (AWS) 


2.1 Introduction 


The primary safety role of the AWS is to provide 
warnings to the driver of potentially hazardous 
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situations that are approaching. This might be 
signals where the train is required to slow down 
and be prepared to stop, or of approaching severe 
speed reductions. 

When the driver is approaching a potentially 
hazardous situation: 


— The driver will receive a warning horn. 

— The driver must acknowledge the warning within 
a set time period. 

— The AWS will then change visually to indicate 
that the driver has acknowledged the warning. 


An emergency brake application will be applied 
on the train if the driver does not respond correctly 
to an AWS warning. 

Fitment of AWS at signals became standard for 
British Rail in 1956. Figure 1 shows the historical 
number of accidents caused by Signals Passed At 
Danger (SPADs) for the years 1950-1980 (Evans 
2003). It can be seen that in the years immediately 
after AWS fitment became standard, the number 
of accidents caused by SPADs was significantly 
reduced (approximately 3 to 10 times). Whilst other 
factors such as the change from steam to diesel and 
electric trains and the introduction of colour light 
signals will have contributed to this decrease, the 
introduction of AWS is thought to have been a 
major factor contributing to this decrease. 

In the years after 1956, AWS has been intro- 
duced at locations other than signals, primarily as a 
response to major accidents. Following the Morpeth 
derailment in 1969, a recommendation was made 
to provide AWS for Permanent Speed Restrictions 
(PSRs) requiring a third reduction (or more) in 
speed. Following a derailment at Nuneaton in 1975, 
temporary AWS magnets were provided on the 
approach to Temporary Speed Restrictions (TSRs) 
and, since 1987, they are also provided on the 
approach to Emergency Speed Restrictions (ESRs). 
AWS started to be provided on the approach to cer- 
tain locally monitored level crossings in 1981. 
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Figure 1. Historical number of accidents caused by 


SPADs for the years 1950-1980. 


3 CALCULATING THE IMMEDIATE RISK 


3.1 Risk model structure 


The immediate risk from a train travelling without 
AWS is principally from the train either travelling 
too fast or too far at locations where AWS is fitted. 
The hazards which have been explicitly quantified 
in the risk model are therefore: 


— Train collisions (caused by SPA Ds). 

— Derailments (caused by overspeeding at speed 
restrictions and from SPA Ds). 

— Collisions at level crossings. 


All frequencies and consequences have been 
quantified based on established risk models. The 
main risk model used to determine the immediate 
risk in this work was RSSB’s Safety Risk Model 
(SRM). 

The SRM consists of a series of fault tree and 
event tree models representing 131 hazardous 
events, which collectively define the overall level 
of risk on the GB mainline railway. It provides a 
structured representation of the causes and con- 
sequences of potential accidents arising from 
railway operations and maintenance on railway 
infrastructure as well as other areas where the 
industry has a commitment to record and report 
accidents. These risk estimates are for the current 
level of residual risk on the railway, which is the 
level of risk remaining with the current risk control 
measures in place and with their current degree of 
effectiveness. The SRM is calculated assuming that 
all events are independent and can be attributed to 
a single cause, reflecting how safety event data has 
been historically recorded in GB. More informa- 
tion on the SRM and how it calculates risk may be 
found in the Risk Profile Bulletin (Dacre 2014). 

The SRM has been designed to take account 
of both high-frequency, low-consequence events 
(occurring routinely, and for which there is a signifi- 
cant quantity of recorded data) and low-frequency, 
high-consequence events (occurring rarely, and for 
which there is little recorded data). For each of the 
low-frequency, high-consequence train accidents 
considered in this work the SRM has a specific 
fault and event tree structure. For example, for train 
collisions, the national frequency of SPADs manu- 
ally coded by type of signal and cause is used as a 
precursor. Fault trees are then used to estimate a 
predicted frequency of train collisions for each type 
of SPAD, based on the probability the train will 
reach a potential conflict point and whether there is 
another train at this location. Event trees are subse- 
quently used to determine the average consequence 
from a train collision, considering escalation fac- 
tors such as the probability of train fires and sec- 
ondary collisions. Where there is sufficient data, all 
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probabilities and frequencies in the SRM fault and 
event trees are derived from historical event data 
by determining average rates and trending analysis. 
Since all available event data is used to determine 
these probabilities, the SRM provides national aver- 
age risk estimates. Because of the network-wide 
nature of the SRM, it is necessary to make average 
assumptions that represent the general characteris- 
tics of the network when calculating the risk values. 
To determine the effect of operational responses 
on the immediate risk, risk estimates were required 
for trains travelling at different speeds and passen- 
ger loadings. This required significant modifica- 
tion of the fault and event trees contained within 
the SRM, including estimating the change in effec- 
tiveness of the Train Protection and Warning Sys- 
tem (TPWS) in stopping a train from reaching a 
potential conflict point following a SPAD. TPWS 
is a system which is fitted in certain high-risk loca- 
tions and automatically applies the brakes on a 
train if it passes a signal at danger, or if the train’s 
speed is excessive when approaching a signal at 
danger, permanent speed restriction or buffer stop. 
Note that not all signals are fitted with TPWS and 
of those that are, only some are additionally fitted 
with overspeed protection on approach. By ana- 
lysing the historical data from trains which have 
been stopped following a TPWS intervention, the 
effectiveness of TPWS fitment at signals and speed 
restrictions can be estimated (Harrison 2007). This 
methodology was incorporated into the SRM 
fault and event trees to determine the effect of a 
speed restriction on the train’s immediate risk. The 
TPWS effectiveness calculations have assumed 
Network Rail’s standard Overspeed Sensor (OSS) 
and Train Stop Sensor (TSS) loop positions and 
set speeds, including the addition of an additional 
TPWS+ loop for line speeds of 75 mph and over. 


3.2 Risk without AWS 


The risk values in the SRM are based on recent 
historical data and therefore only estimates the risk 
for trains with AWS fitted and working. In order 
to calculate the increase in risk during an AWS 
failure the increase in driver error probabilities 
when AWS is no longer available as a reminder was 
estimated using Railway Action Reliability Assess- 
ment (RARA) (Gibson 2012). RARA provides a 
consistent approach to human error quantification 
for the rail industry and may be used to determine 
high-level estimates of the increase in driver error 
rates without AWS for different causes of train 
accidents. 

To estimate the increase in train collision and 
derailment risk at signals, the causes of SPADs 
most likely to be affected by AWS were deter- 
mined. These were found to be: 


— The driver failing to check the signal aspect. 

— The driver failing to react correctly to a caution- 
ary aspect. 

— The driver misreading a signal (either misread- 
ing the correct signal or reading the incorrect 
signal). 

— Miscommunication between the driver and 
signaller. 


Of these causes, failing to check the signal 
aspect and not reacting correctly to a cautionary 
aspect were determined to be those where AWS 
would have the greatest influence since these are 
the situations where AWS can directly prevent the 
error. For these causes of SPADs, RARA analysis 
gives two estimates of the increase in SPAD fre- 
quency without AWS: 


— 20 times increase—drivers who have driven the 
route many times before and have a good knowl- 
edge of where signals are so that, even without 
AWS, approaching signals is still a simple, rou- 
tine task. 

— 150 times increase—drivers who are less famil- 
iar with the route or are particularly reliant on 
AWS to let them know where they are and react 
correctly to signals. This may only be a very ini- 
tial increase for a short distance whilst the driver 
becomes used to driving without AWS. 


In reality, this increase will not be constant and 
will vary with different signal approaches and driv- 
ers. Some signal approaches will be simple and 
correspond to the lower increase whilst the more 
complex approaches may correspond to the higher 
increase. Therefore, an average increase of 85 times 
has been used as a best estimate for this increase. 

For the other causes of SPAD identified as 
being influenced by AWS, the initial error will not 
be prevented by AWS but the AWS sunflower will 
act as a visual reminder to the driver, providing 
an opportunity to rectify the initial error before it 
results in a SPAD. For these causes the effect of 
AWS being unavailable will be less and RARA 
analysis suggests that there should be a 6.25 times 
increase in these causes of SPADs without AWS. 

As well as accidents caused by SPADs, the cal- 
culated error rate increases were also used to esti- 
mate the increase in drivers overspeeding at speed 
restrictions. Since AWS is not generally fitted at 
buffer stops, no increase in the frequency of buffer 
stop collisions is predicted without AWS. 

The total calculated immediate risk for differ- 
ent line speeds is illustrated in Figure 2. The figure 
shows the estimated immediate risk without AWS, 
including the upper and lower bounds as calculated 
from the RARA analysis. For reference, the baseline 
level of risk calculated by the risk model for when 
AWS is working is also shown. The drop in risk at 
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Figure 2. Immediate risk per billion train km in FWI 


as a function of line speed. The black and red lines show 
the average calculated risk with and without AWS respec- 
tively. The dashed lines show the upper and lower bounds 
of the calculated risk values without AWS using the two 
different estimates for the increase in driver error from 
the RARA analysis. 


75 mph is due to the addition of a second TPWS+ 
loop, providing extra protection at signals. The risk 
results are given in units of Fatalities and Weighted 
Injuries (FWI) per billion train km. This is an 
aggregate measurement of safety risk, using weight- 
ings for fatalities, major injuries, minor injuries and 
shock/trauma events which have been agreed for use 
by the GB rail industry (Jones-Lee & Loomes 2008). 

Analysis of the historical data in Figure 1 can 
also give an estimate of the risk without AWS. This 
will, however, not take into account any reliance on 
AWS that a driver may have developed by driving 
regularly with AWS. The level of risk calculated 
from historical data can be therefore be thought of 
as the level of risk in areas where AWS has never 
been introduced or if AWS has been unavailable for 
a long period of time. The values calculated from 
historical data analysis are almost identical to the 
lower bound of the RARA estimate. This may be 
expected since in areas where AWS has never been 
provided drivers will not have developed any reliance 
on AWS to warn them of signals and approaching 
signals without it will be a routine task. 


4 KNOCK-ON RISK 


Analysis of data shows that there is a strong cor- 
relation between certain types of hazardous events 
and train performance. Any operational response 
which affects train performance will therefore result 
in an increase in the risk from these hazardous 
events. This risk increase is called knock-on risk. 


4.1 Relationship between train performance 
and risk 


In order to calculate the knock-on risk, the 
amount of risk associated with train performance 
needs to be determined. This is achieved by esti- 
mating the percentage of annual risk, as calculated 
by the SRM, which is attributable to train delay 
minutes and cancellations. These percentages may 
be derived by investigating the correlation between 
event frequency and train performance data. Four 
main areas of risk were considered: 


— SPADs 

— Staff assaults (both physical and verbal) 
— Boarding and alighting incidents 

— Passenger slips, trips & falls at stations. 


The daily frequency with which each of these 
events occur may be plotted against Public Per- 
formance Measure (PPM) data (Office of Rail 
and Road 2017) to determine the percentage of 
these events which are attributable to performance. 
PPM is a measure of the percentage of passenger 
trains which were delayed or cancelled on a par- 
ticular day. An example plot for SPADs is shown 
in Figure 3. In this example, it can be shown that 
each delayed train approximately results in an 
additional | in 10,000 chance of a SPAD and that 
34% of current SPADs are related to train delays. 
The knock-on risk from SPADs is subsequently 
determined by adding together all of the risk in 
the SRM associated with SPADs, calculating the 
fraction of the annual delay minutes accumulated 
during the failure and multiplying this by the por- 
tion of the risk identified as resulting from delays. 
This process was then repeated for each of the four 
main risk areas. 


SPADs per million trains run 
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Public Performance Measure (PPM) 


Figure 3. Number of SPADs per million trains run as a 
function of Public Performance Measure (PPM). The red 
line is a linear best-fit to the data. 
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In addition to the risk areas where the percent- 
age associated with delays and cancellations have 
been determined through the analysis of current 
data, other risks have also been included in the 
knock-on risk calculations. These were identified 
in a previous risk assessment for the Interim Voice 
Radio System (IVRS) through expert judgment in 
workshops (Harris 2006). 


4.2 Calculating the knock-on risk 


Delay minutes and the number of cancelled 
trains are calculated for each possible operational 
response to an on-train equipment failure. The fac- 
tors considered when calculating these delay min- 
utes are: 


— Any delays (both reactionary and primary) from 
running at reduced speed. 

— Delays accrued from part or full cancellation of 
trains. 


In addition to delay minutes, any extra boarding 
or alighting resulting from cancelling a train mid- 
journey is also calculated. These delay minutes and 
extra boarding and alighting are then used with the 
relationships described previously to determine a 
knock-on risk for each operational response. 


5 OPERATIONAL RESPONSES 


For AWS failures, once a failure has been detected 
the train should travel to the next available location 
where the train can be dealt with. At this location it 
is assumed one of the following will occur: 


1. The train is taken out of service and is replaced 
by one with working AWS for the rest of the day. 

2. The affected cab is “boxed in” so that it is not 
required to be used for the rest of the day. 

3. If the rear cab has working AWS, then the train 
enters service driven from the working rear cab, 
and at the end of this subsequent journey, either 
1 or 2 above occur. 


Whilst travelling to the next available location, 
the train should proceed at a maximum speed 
of either 40 mph or 60 mph. If passengers are 
onboard the train, they may either be detrained 
at the next suitable station or remain onboard 
whilst the train travels to the next available loca- 
tion in order to complete their journey. Therefore, 
the total risk during four possible operational 
responses has been explicitly calculated: 


1. The train proceeds with a maximum permit- 
ted speed of 40 mph. Passengers may remain 
onboard the train for the duration of the dis- 
tance to the next available location whilst they 
finish their journey. 


2. The train proceeds with a maximum permitted 
speed of 40 mph. Passengers are detrained at 
the next suitable station and the train then pro- 
ceeds without passengers to the next available 
location. 

3. The train proceeds with a maximum permit- 
ted speed of 60 mph. Passengers may remain 
onboard the train for the duration of the dis- 
tance to the next available location whilst they 
finish their journey. 

4. The train proceeds with a maximum permitted 
speed of 60 mph. Passengers are detrained at 
the next suitable station and the train then pro- 
ceeds without passengers to the next available 
location. 


6 RESULTS 


The calculated risk values, in units of Fatalities 
and Weighted Injuries per million trains (F WI/mt), 
for each of the possible operational responses are 
given in Figure 4. The risk is calculated for each 
train from the point of failure (assumed to be half- 
way through the day), to the end of the day (when 
the failure is assumed to be fixed). The baseline is 
the risk from a train running from the middle of the 
day till the end of the day with no failure. The risk 
values have been calculated by assuming the train 
is operating under approximately network average 
conditions and using the average risk increase from 
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Figure 4. Risk in FWI/mt for different operational 
responses to an AWS failure. The blue and orange bars 
represent the immediate and knock-on risks respectively. 
The black dotted line is the baseline level of risk for a 
train with AWS working. 
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the RARA analysis. For comparison, the level of 
risk if a train continues until the end of the day 
without AWS is also shown. 

As can be seen in Figure 4, for a train operating 
under network average conditions the operational 
response which results in the lowest total risk is 
when: 


— The train proceeds to the next available location 
with a maximum permitted speed of 60 mph. 

— Passengers remain onboard the train for the 
duration of the distance to the next available 
location whilst they finish their journey. 


6.1 Effect of speed restrictions 


There is only a very small safety benefit in the 
immediate risk from reducing a train’s speed fur- 
ther from 60 mph to 40 mph. The main reason for 
this is due to the effectiveness of TPWS at stop- 
ping a collision. TPWS is designed at signals to 
automatically apply the brakes if trains are over- 
speeding on the approach to a signal as well as if 
they go past the signal at danger. The majority of 
overspeed sensors for TPWS systems are set so 
that they only intervene if trains are travelling over 
speeds in the range 42-46 mph on the approach 
to a signal at danger. Reducing the train’s speed 
to 40 mph therefore removes most of the protec- 
tion provided by the overspeed sensor. Whilst 
the effectiveness of applying the brakes once the 
train has passed the signal will increase as speed 
is reduced, the overall effect is that the effective- 
ness of the TPWS system is very similar at 40 mph 
and 60 mph. The overall predicted effectiveness of 
the TPWS system is illustrated in Figure 5 for a 
train travelling with different speed restrictions on 
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Figure 5. Predicted effectiveness of TPWS at stopping 
a train collision at a signal for different trains travelling 
with different maximum speeds on a track with a maxi- 
mum permitted line speed of 70 mph. 


a track with a maximum permitted line speed of 
70 mph. Whilst this only applies to signals where 
TPWS is fitted, since these junction signals repre- 
sent a large percentage of the overall train colli- 
sion risk they have a large influence on the overall 
results. 

Reducing a train’s speed does, however, cause 
a large number of extra delay minutes and conse- 
quently an increase in the knock-on risk. Reducing 
the train’s speed from 60 mph to 40 mph results in 
an increase in the knock-on risk from 1.6 FWI/mt 
to 4.5 FWI/mt for an initial line speed of 70 mph. 
Since any safety benefit in the immediate risk from 
reducing the speed is very small, it is outweighed by 
the approximately three times increase in knock-on 
risk as delay minutes increase. It is found that for 
any initial line speed, applying a speed restriction 
of 60 mph always provides a lower total risk than 
40 mph. 


6.2 De-training passengers 


The knock-on risk from de-training passengers 
immediately and not letting them remain onboard 
to complete their journey has been calculated to 
be 5.5 FWI/mt. This risk is mainly due to the risk 
from extra boarding and alighting at unscheduled 
stations, where the platform may not be as suitable 
as a terminal station to accommodate a train full 
of passengers. 

The immediate risk for a train operating with- 
out passengers is still significant since an empty 
train could have a collision with another passenger 
train. Whilst the knock-on risk from de-training 
passengers will only depend on the loading of the 


Risk (FWI /mt}) 
de 


60 80 100 120 140 


Distance to end of journey (miles) 


Figure 6. Immediate risk associated with allowing pas- 
sengers to remain onboard a train to complete their jour- 
ney or knock-on risk from de-training them, given by the 
red and black lines respectively. Results are for a train 
travelling with a speed restriction of 60 mph on a section 
of track with a line speed of 125 mph. The dotted lines 
represent the upper and lower bounds of the immediate 
risk estimate using the two different estimates for the 
increase in driver error from the RARA analysis. 
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train when it is cancelled, the immediate risk will 
depend on the distance passengers would need to 
remain on the train in order to finish their journey. 
Figure 6 illustrates how the difference in immedi- 
ate risk (with and without passengers) increases 
with journey distance for a train travelling with a 
speed restriction of 60 mph on a section of track 
with a line speed of 125 mph. 

From Figure 6 it can be seen that, even for a 
track with a line speed of 125 mph, the additional 
immediate risk of passengers remaining onboard 
in order to complete their journey is less than the 
knock-on risk of de-training them immediately for 
distances of up to approximately 100 miles. For 
distances of over 100 miles, the upper bound of 
the risk estimate starts to outweigh the knock-on 
risk. Therefore, if the remaining journey distance 
is longer than 100 miles, passengers should be de- 
trained before this distance is reached. 


7 CONCLUSIONS 


Following the risk analysis work, the operational 
response which was found to provide the lowest 
total risk during an AWS failure was determined 
to be: 


— The train to proceed to the next available location 
at 60 mph once the failure has been identified. 

— If passengers are onboard, they should be 
allowed to remain onboard in order to finish 
their journey whilst the train proceeds to the 
next available location. 

— If the distance to the next available location is 
greater than 100 miles, the distance passengers 
remain onboard should be limited to 100 miles. 
The train should proceed for the remainder of 
the distance to the next available location with- 
out passengers. 


Another possible mitigation for AWS failures is 
to provide a competent person in the driver’s cab 
to monitor the driver and ensure they react cor- 
rectly to speed restrictions and signal aspects. The 
risk from operating with a competent person was 
reviewed separately and all results presented in this 
paper have assumed no competent person has been 
provided. 


These operational responses represent a change 
from the current rules whereby the train proceeds 
at 40 mph and passengers are de-trained at the next 
suitable station. These rules were previously deter- 
mined by only considering the immediate risk to the 
train without AWS and before the national fitment 
of TPWS was completed. The proposed increase in 
speed restriction therefore seems reasonable and is 
consistent with more recently updated rules for other 
items of defective on-train equipment. The sensitiv- 
ity of the chosen operational responses to the various 
assumptions and input parameters in the model has 
been assessed and the results are found to be robust. 

The changes proposed in this work should pro- 
vide a more effective way of dealing with an AWS 
failure, considering both the immediate risk to the 
train operating without AWS and the knock-on 
risk an operational response has on the rest of the 
rail network. These changes should therefore pro- 
vide a benefit both in terms of safety and perform- 
ance for the GB mainline railway. 
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ABSTRACT: The paper presents a semi-quantitative methodology developed to help Italian local 
authorities in facing multi-risk aspects in their Land Use Planning practices. The methodology acts as 
a pre-screening of the risks present on the territory, highlighting the areas more exposed to risk and risk 
interactions, also taking into account aspects neglected by the sectorial plans. A quick overview of the 
methodology is provided, together with a significant Italian case study: a small town in Piedmont, for which 
neither the land use planning related to major risk plants, nor the supra-regional plans for flood preventions 
were sufficient to obtain a detailed representation of the overall risk. The proposed methodology analyzed 
the context and evaluated the possible interactions, identifying possible environmental consequences, and 
then addressed further studies and interventions to the critical situations. A dedicated questionnaire was 
developed for the plants, to examine in depth the assets more exposed to NaTech risks. 


1 INTRODUCTION 


Land Use Planning (LUP) procedures improve and 
program the use of territories, therefore they have 
to deal with several types of risks, starting with the 
natural ones deriving by the territory itself (flood, 
earthquake) and arriving to the risks generated by 
men (Technological risks, climate change, etc.). 
Risks are faced through dedicated sectorial plans, 
that have a hierarchical development and applica- 
tion: i.e. in Italy, they are usually drafted by regional 
or supra-regional authorities and then applied by 
the Municipalities. However, the multiplication of 
tools dealing with risks (City plans, Emergency 
plans, supra-local plans) can sometimes bring to 
lose some important information; also, climate 
change is varying the reliability of the calculations 
of return times for events influenced by climate. 
Most of all, currently an integrated plan contain- 
ing all the risks does not exist, therefore also the 
possible risk interactions are neglected. 

Several projects related to Multi-risks have been 
developed in recent years, proposing different 
types of methodologies to deal with the problem 
of interactions. Besides qualitative approaches, 


that are mainly adopted at a wider general scale, 
many projects proposed quantitative analyses 
aimed at taking into account and harmonize the 
different probabilities of occurrence of the risks. 
However, these methodologies can be affected by 
the lack of data, and are very long, costly and diffi- 
cult for the final users and stakeholders, that means 
LUP decision-makers (Menoni et al, 2006; Nadim 
& Liu, 2013). This is particularly evident for Italy: 
Municipalities, as final LUP planners, have not the 
right expertise and financial resources to apply any 
multi-risk approach, most of all if it is not made 
mandatory by law. The lack of integration between 
risk plans and the non-consideration of risk inter- 
actions, summed up to the increasing effects of 
climate changes, already brought to several disas- 
ters (i.e., the repeated floods in Geneva, caused 
by a creek whose dangerousness was well known 
but not adequately represented in the Municipal 
emergency plan and City plan, or the Rigopiano 
hotel tragedy, where an avalanche caused by a 
earth shake invested an hotel built in an area where 
constructions should not have been permitted. The 
dangerousness of this area was correctly identified 
by an old map produced by Abruzzo region, that 
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however was not reported in the updating of the 
local City plan). 

In order to help the Municipalities in taking 
into account their territorial risks in an inte- 
grated way, the authors proposed a semi-quan- 
titative methodology for the local scale, acting 
as a pre-screening instrument to rapidly identify 
the areas more exposed to risks and their possi- 
ble interactions. Further studies, resources and 
planning actions shall be primarily addressed to 
these areas. A semiquantitative methodology, 
based on indexes, and intended for a direct use 
by the Municipality technicians, was developed 
by the authors; it is briefly summarized in the 
following Paragraph 2; an in-depth explanation 
can be found in (Pilone et al, 2017). This paper 
mainly focuses on the application for an Italian 
case study, a Municipality in the Piedmont region, 
where possible interactions between flood and 
industrial risks were identified. 


2 METHODOLOGY FOR RISK 
INTERACTION 


The steps of the proposed methodology were par- 
tially inspired by the ERIR, a Plan for the safe LUP 
around Seveso plants, that in Italy is mandatory for 
the Municipalities with a Seveso plant inside their 
territory. While ERIR only considers Industrial 
risk, the objective of the proposed methodology 
was the identification of the impact of several ter- 
ritorial risks and of their possible interactions, on 
the basis of a semi-quantitative rating scale, going 
from 0 to 3 onwards: 


0 < I < 0.99: Negligible 

1 < I < 1.99: from Low to Moderate 

2 < I < 2.99: from Moderate to High 

I > 3 onwards: from High to very high. 


The methodology was explicitly designed for a 
direct use by the Municipality technicians. 


2.1 Risk characterization 


The most relevant territorial risks have to be 
described and investigated according to three 
Macro-categories, that express peculiar aspects of 
the risk analyzed and determine its impact: 1) HE 
Historical and recent events: recurrence of the risk 
events analyzed; 2) PM Protection measures: pro- 
tection and preventive measures that could reduce 
the impact of the risk analyzed; 3) SE Strengthen- 
ing effects: Local characteristics increasing the risk 
effects. The latter was explicitly introduced to con- 
sider risks not only on the basis of their probability 
of occurrence, but also in relation to all the intrin- 
sic factors, sometimes neglected in the sectorial 


plans, that could enhance the final risk impact: i.e., 
for earthquakes, the quality of the soil; for floods, 
the section reductions and flow obstructions, for 
Seveso industries, the quantity and type of sub- 
stances detained and type of items etc. 

The macro-categories are rated in accordance to 
the local variations of the risk, in compliance with 
the scale above-mentioned. A dedicated guideline 
was developed to help the Municipality technicians 
in this procedure; at the moment, the guide is 
related to the most diffused risks in Italy and in 
Europe: industrial, flood and seismic risks (Table 1 
reports the first two). 


2.2 Risk interaction 


The macro-categories constitute the basis to evalu- 
ate possible risk interactions, because they accu- 
rately describe each risk and its possible impact. 
Therefore, when risks overlay, the effects of one 
risk on another one (binary interaction) can be 
assessed through an average sum of the ratings 
assumed by the two risks in the analyzed point of 
the territory, following Equation 1 below: 


Interaction = [ (HE x, +HEp,)*2+(SEg,+SEg,)*1 
+(PM,, + PM,,)*0.5]/6 


(1) 


Equation 1 also shows the different weights 
assigned to the Risk macro-categories for the inter- 
action assessment; in fact, they have different reli- 
ability in terms of available data and capacity to 
influence the final interaction value. The weights 
assigned following these criteria were validated 
through expert judgement. 


2.3 Risk compatibility and planning phase 


The values obtained for binary interactions and 
risks have to be superimposed to the territo- 
rial and environmental vulnerabilities, whose 
identification follows the Italian legislation for 
ERIR (Ministerial Decree 09/05/2001). If the val- 
ues of the interactions or of the macro-categories 
overcome the threshold of 2.5 (medium-high 
impact) in areas where relevant or extreme vulner- 
able elements are present, a potential incompati- 
bility is identified. The Municipality shall conduct 
further studies and investigations to verify the sit- 
uation and adopt opportune planning measures. 

An optional step was introduced in this phase of 
the methodology to help the Municipalities in the 
assessment of the interactions involving industrial 
risks: thanks to two modelling software (ALOHA® 
and HSSM*®*), the possible spatial extension of the 
damage areas can be hypothesized and taken into 
account for the planning phase. 
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Table 1. 


Rating guideline—industrial and flood risks. 


Category 


Rating 


1<I<1.99 


2<1<2.99 


I > 3 onwards 


SE Few items with Na-tech risk; Items with NaTech risk; 
scenarios related to flammable scenarios related to 
substances, with a reduced flammable and environmental 
area of impact. substances 

HE No relevant or NaTech accidents Low impact events (NaTech/ 
occurred. with external repercussions) 

PM No dedicated measures for Good safety level, partially 
NaTech; lack of protective effective also towards 
measures towards the NaTech accidents 
environment 

FLOOD 

SE Interaction with other rivers/ Interaction with and hydraulic 
creeks with low or reduced control devices with moderate 
criticalities; hydraulic devices criticalities; critical points; the 
in good state; no or few river/creek/etc. analysed 
critical points contains key element for the 

safeguarding of the general 
safety of the system 

HE Rare main flood events, return Floods of moderate impact, 
time of Flood management and/or in areas not included 
plans is confirmed (zones in Plans, with a short return 
classified as C, or Em, Cn—if time (250 years) (zones 
recent events do not evidence classified as B, or Eb, Cp—if 
different distributions/timing recent events do not 
of the floods) evidence different 

distributions/timing 
of the floods) 
PM No water regulation artefacts/ Water network/river/creek is 


systems or insufficient 
number/way. Criticalities 


properly controlled, the 
artefacts do not show 
relevant criticalities 


Huge quantities of hazardous 
substances. Items with NaTech 
risk. Toxic scenarios and/or 
with a great extension. 

High impact events (NaTech/ 
with external repercussions) 

Preventive measures adequate 
for avoiding NaTech risk 
and domino effects 


Problematic interaction, 
recognized high critical areas, 
reported in Flood plans. 
Hydraulic devices in bad 
conditions, with recognized 
criticalities 


Events with return time > than 
that of the Flood management 
plan worst zone. (zones 
classified as A, or Ee, Ca—if 
recent events do not evidence 
different distributions/timing 
of the floods) 


The management of the water 
network/river/creek is well 
coordinated, evidencing no 
criticalities 


and inadequate safety level 


3 CASE STUDY 


The application of the proposed methodology 
is showed through an Italian case study: a little 
Municipality near Turin with 16000 citizens, inter- 
ested by flood and industrial risks provoked by 
“minor sources”, which are not adequately consid- 
ered in the sectorial planning. 

The town raises on a flat land crossed by several 
artificial channels, derived from Stura river and 
used in the past for irrigation purposes. Urbani- 
zation and industrialization completely altered the 
functioning of the water network: many channels 
were deviated, interrupted or undergrounded, and 
their maintenance completely ceased, while the 
waterproof surface dramatically increased. Besides 
this minor water network, the northern portion of 
the municipal territory is crossed by a creek, con- 
noted by several reduced sections of the water flow 
and banks with low height. 


Banna-Bendola and the water network pro- 
duced extensive floods in 1994, 2000 and 2008; 
the water height reached 80-100 cm. The flooded 
areas were mapped in detail by (Regione Piemonte, 
1998) and (Provincia di Torino, 2009), but the new 
PGRA Piano Gestione del Rischio Alluvioni— 
Plan for Flood management (AD.B.Po, 2016), 
only reported the potential flooding areas of the 
creek. The dangerousness of the secondary water 
network and its combined effects with the creek 
in case of intense rainy events were not analyzed 
and neither mapped. At the same time, the return 
times assigned to the creek buffer-zones were not 
so in line with the recurrence of the recent events, 
as demonstrated also by (Politecnico di Torino, 
2009). 

The industrial risk of the Municipality analyzed 
is ascribed to a single Seveso plant, a former plating 
factory named ‘X’ for this paper, that was closed 
in 2010 for breach of obligations related to the 
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AlA—Integrated Environmental Authorization. 
Since it was never ascertained if the underground 
platin basins had been emptied in a safe way, the 
Municipality had to prepare ERIR plan. ERIR 
plan was drafted in compliance with (Regione 
Piemonte, 2010) and (Provincia di Torino, 2010) 
that require to include in the analysis the so-called 
Seveso Sub-threshold plants (detaining the 20% of 
the hazardous substances necessaries to be classi- 
fied as Under-tier Seveso plant) and all the poten- 
tial hazardous activities. Thanks to this request, 
not applied in other Italian regions, two potential 
Seveso plants (named ‘Y’ and ‘Z’), not signalled by 
the Authorities in charge, were identified. Prob- 
ably, since ERIR was drafted immediately after 
the approval of Legislative Decree 105/2015 (Ital- 
ian implementation of Seveso II Directive), these 
plants had not adequately checked the new clas- 
sification of hazardous substances imposed by 
the decree. The hazardous plants identified are 
majorly located close to the water network; they 
were repeatedly interested by flooding, even if 
there are no testimonies on the consequences of 
these events (see Figure | in the following page). 
Since the current sectorial plans neglect some 
sources of risk (i.e. the secondary water network) 


/ 


P . 
Flood buffer areas (PRGA) 
Extreme dangerousness (Ee) 
sy E53 High dangerousness (Eb) 
Mediuny low dangerousness (Em) 
Areas interested by food events 
Flood event 09.2008 


| 
Gram 6 Z 77 Flood event 09.2008 (Municipal data 
N ` Flood event 10.2000 


P ’ Flood event 11.1994 


Figure 1. Plants and areas interested by flood risk (Pro- 
vincia di Torino, 2009). 


and do not allow to consider the danger deriving 
from the interaction Flood/Industry in an inte- 
grated way, the methodology was applied to assess 
possible unforeseen consequences of the compres- 
ence of the risks on the Municipal territory. 


3.1 Risk characterization 


FLOOD: In order to proceed with the rating 
assignment for flood risk macro-categories, rivers 
and creeks have to be divided in portions connoted 
by homogenous characteristics and behavior. For 
the creek, a unique portion was identified in the 
Municipal territory, because according to (Politec- 
nico di Torino, 2009), this is a unique sub-basin, 
uniformly connoted by very small slopes of both 
riversides and river course. The secondary water 
network was also considered as a unique element, 
given the complexity of the interactions and inter- 
dependencies between the canals, and also because 
all the main canals equally produced overflowing 
during the past events. 

Table 2 shows the rating attributed to the macro- 
categories of these two hydraulic elements, respon- 
sible of the municipal flood risk. As showed in 
Table 1, as far as it concerns HE ratings, the areas 
with higher probability of flood obtain a higher 
rating. However, the return times of the flood 
events that interested the municipal area in the last 
20 years were higher than those defined by PGRA, 
in particular for areas interested by medium-low 
flood hazard. The necessity to re-assess the buffer 
zones for the creek because of this mismatch was 
recognized also by (A.D.B.Po, 2016). As a conse- 
quence, HE rating of the areas with medium-low 
flood hazard (Em) was raised to 2, instead of 1 (see 
Table 2, fourth column of ‘Creek’). 

For the secondary network, no flood return 
times were available, but since the network partici- 
pated in all the major flood events of the last 20 
years, being responsible in 2008 of a proper break- 
down (Politecnico di Torino, 2009), a HE value 
equal to 2 was assigned. 

SE ratings considered all the possible criticali- 
ties encountered both for the creek and the water 
network, while unfortunately PM ratings reflect 
the total absence of protection measures, in spite 
of the interventions proposed by (Provincia di 
Torino, 2009). 

INDUSTRY: The ratings for the plants ‘X, “Y’ 
and ‘Z’ were assigned on the basis of the following 
information: for the first plant, officially recog- 
nized as Seveso —> Emergency plan (Prefettura di 
Torino, 2007), Environmental Authorization— 
AJA (Provincia di Torino, 2007) and Notification 
of the plant, that were the last documents drafted 
before the closure; for the other plants — question- 
naires compiled by the owners during ERIR draft. 
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Table 2. Flood risk—rating assignation. 


SE PM HE 
Interaction with Criticalities of the Hydraulic artefacts, 
other elements artefacts, sections levees etc. Recurrence 


2 areas for flood 
expansion and a 
stone riverbank 
were planned; only 
the last was 
realized, after the 
flood in 2000. 


2 zones: Extreme flood 
hazard Ee (20-50 yrs), 
Medium-low flood 
hazard Em (300-500 yrs). 
Floods in Em more 
frequent than the 
assigned Return time, 


Creek 

Possibilities of 5 critical sections 
inverted flow producing hydric 
from the creek to insufficiencies 
the tributaries or were identified in 
upstream correspondence 

of bridges 

3 

Secondary water network 

Water intakes from Reduced slope of 


soil and canals, 
scarce maintenance, 
obstructions, 
riverbeds not 
defined, inadequate 
crossing artefacts; 
covered portions, 
diversions. 

Raised roads block 
the natural flow 


Stura cannot be 
regulated; The 
creek can feed the 
channels network 
during flood events 


5 floodway channels 


water height 1 m. 
Ee 3 Em2 


No flood buffer zones 
and return times 
assigned, except for 
a little portion in an 
agricultural area (Em). 
Recurrent overflowing in 
1994, 2000 e 2008, water 
heights between 30-80 cm 
in the city centre, other 
areas. 


were planned to 
return the 
exceeding 

flows to Stura, 

but no interventions 
were executed. 


However, essential information related to case his- 
tory, storage conditions and preventive and protec- 
tion measures were missing for all the plants. 
Table 3 shows the rating assignation to the macro- 
categories of each industrial plant: for the macro- 
categories SE and PM, Google maps and Google 
street view allowed to partially integrate the missing 
data on items exposed to NaTech risk, waterproof 
aprons, etc., but no alternative sources of informa- 
tion were available for HE. Therefore, a common 
indicative value of 1.5, corresponding to a low- 
medium impact, was assigned if the plant had been 
involved by past flood events. A negligible HE value, 
equal to 0.5, was given to the plants not hit by flood 
events located in proximity of canals, to take into 
account the possibility overflowing water. The rat- 
ings assigned to the macro-category PM were main- 
tained generally low because of cautionary reasons. 


3.2 Risk interaction 


The industrial and flood characterizations made 
clear that the analyzed Municipality is not inter- 
ested by extremely high risks, due to huge plants 
or important rivers; the risks are generated by lit- 
tle lower-tier Seveso plants and low energy flood 
events. However, the interaction of plants, mostly 
detaining toxic and environmental hazardous 


substances, with the recurrent overflowing events, 
could produce unexpected and severe conditions 
for people and environment, because of the lack of 
adequate protection and prevention measures. In 
order to verify possible consequences, Equation 1 
was applied to each area where flood areas and 
plants overlay, through dedicate Binary Interac- 
tions tables (Tables 4, 5 and 6). They report the rat- 
ings attributed to the risks in a specific point of the 
territory (in this case, in the areas of plants ‘X’, “Y’ 
and ‘Z’) and allow to repeatedly apply Equation 1. 

The possible interaction was verified also for 
plant ‘Y’, even if it was never interested by flood, 
because of a cautionary reason: in fact, it is adja- 
cent to canals whose slope, maintenance and river- 
bed conditions are not different from those of the 
other canals. Indicative low values were attributed 
to Flood risk in this area: SE= 1, HE=0, PM =0. 


3.3. Compatibility assessment 


The vulnerable elements were investigated accord- 
ing to the requests of national and regional 
legislation. For the territorial vulnerability, related 
to urban density and people density, the major- 
ity of the residential areas was included in the 
classes C and D of Ministerial Decree 09/05/2001 
(building ratio index < 1.5 m?/m?). Some punctual 
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Table 3. Industrial risk—rating assignation. 


SE PM HE 


NaTech and 


Assets items at risk Substance General pollution Recurrence 


nt x Sens a = | : | | 2 = p not = - p E E 


5 plating basins Nickel (N) 
3.5 t storage Chromium trioxide, nickel (T, N) 
barrels 


Other SE elements: plant closed under unsafe 
conditions, situation not monitored 


recommended measures included in (Em) 
and was closed for its not buffer zone. 
compliance. POLLUTION: 

measures for environmental 

protection not adopted 

NATECH: information not 


available 
3 0 1.5 
4.6 t storage barrels Duisocyanates (T Waterproof apron; storage an No information on the 
5 t storage barrels Isoforon diisocyanate (T, N) area for loading/unloading plant accidents are 
20.4 t storage Formic acid (T) under cover. available. 
barrels DMAE (T, F) POLLUTION: Plant The area is very close to 
Propylene diamine (F) subjected to ATA; Plan those interested by the 
Zinc oxide, derivates (N) for the management of flood in 1994 and 2008. 
DPG dust (N) rainy water (addressed to a 
canal). Collection system 
Acetone (F) for accidental spills; 
‘ P pills; 
5 t. bags Sodium fluosilicate (F) different drainage lines for 


Other SE elements: Toxic substances = lower tier 
Legislative Decree 105/2015. Plant not compliant 
with the regulation. 


rainy and process water; 
sedimentation basin, 
shutter. 


NATECH: information n.a. 


DES) -2 0.5 
ank, 27 t eno authority recommende: o information on the 
Tank, 50 t Formaldehyde 24% (T) adopting for the tanks: level plant accidents are 
Tank, 25 t Acrylic acid (F, N) alarms, containment basins. available. 
Tank, 27 t Acetic acid (F) POLLUTION: Plant The plant was repeatedly 
Bags, 22 t Ammonia (F) subjected to AIA; Plan interest by the flooding 
for the management of of the adjacent canal 
rainy water (addressed to a 
canal). Collection system for 
accidental spills; different 
drainage lines for rainy and 
process water; emergency 
basin. 
NATECH: Information n.a. 
Other SE elements: Toxic substances > lower tier 
Legislative Decree 105/2015. Seveso plant not 
compliant with the regulation. Outdoor 
unprotected storage areas, some tanks seem 
to have the containment basin. 
2.8 -1.8 1.5 


buildings were categorized as A (mostly schools) 
and B (discotheque, bowling center), because of 
their high frequentation. In relation to Environ- 
mental vulnerability, no elements with Extreme 
vulnerability were encountered, but Relevant vul- 
nerable elements typical of flat land areas were 
present: 1) depth of the aquifer between 0 and 


3 meters; 2) Land use capacity of soil between 1 
and 2. These features made the Municipal territory 
very vulnerable towards possible pollution events. 
The compatibility assessment was verified in the 
areas of risk interactions (flood — industry), cor- 
responding to the zones of plants ‘X’, “Y’ and ‘Z’: 
a buffer zone of 500 m. was drafted around each 
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Table 4. Binary interaction—Plant X’. 


Flood risk Industrial risk 
SE HE PM SE HE PM 
Plant X’ area 3 2 0 3 135° 0 
Flood SE 3 
risk HE 2 = Nointeraction 2.17 
PM 0 
Industrial SE 3 
risk HE 1.5 Nointeraction — 
PM 0 
Table 5. Binary interaction—Plant ‘Y’. 
Flood risk Industrial risk 
SE HE PM SE HE PM 
Plant ‘Y’ area 1 0 0 2.5 1.5 -2 


Flood SE 1 


risk HE 0 No interaction 0.58 


PM 0 
Industrial SE 2.5 
risk HE 1.5 Nointeraction — 


PM -2 


Table 6. Binary interaction—Plant ‘Z’. 
Flood risk Industrial risk 
SE HE PM SE HE PM 
Plant ‘Z’ area 3 2 0 2.8 1.5 -1.8 
Flood SE 3 
risk HE 2 No interaction 1.98 


PM 0 
Industrial SE 2.8 
risk HE 1.5 
PM -1.8 


No interaction — 


plant, projecting here the values of the Industrial 
macro-categories and of F/I interaction. 

The condition of compatibility can be consid- 
ered satisfied if no A and B vulnerable elements 
are included in buffer zones where H.E., S.E. or 
Interaction values are higher than 2.5, a threshold 
corresponding to a medium-high impact. 

Figure 2 shows an example of Territorial com- 
patibility analysis for the plant ‘X’: inside the 
buffer zone, residential areas classified as C and D, 
and E areas (building ratio index I < 0.5 m*/m’) are 
identified. 


ai 


Figure 2. Plant ‘X’ buffer zone with territorial vulner- 
able elements. 


The threshold of 2.5 is adopted for the envi- 
ronmental vulnerability too, but the specific rela- 
tion between the threats and the environmental 
vulnerable element has to be investigated (not all 
the elements are equally sensitive to risks). The 
environmental vulnerable elements identified for 
the case study were sensitive both to Industrial risk 
and its combined effects with flood. 

The assessment of the territorial and environ- 
mental compatibility for each plant is reported in 
Tables 7, 8, 9. 


3.4 Results and planning steps 


The application of the methodology to the case 
study demonstrated that the simultaneous pres- 
ence of Industrial and Flood risk can produce 
unexpected interactions, connoted by low-me- 
dium impacts (plant ‘X’ area = 2.17, plant ‘Z? 
area = 1.98), which are reasonably in line with the 
verified low energy of the flood events in the areas 
(water height between 30-80 cm). 

The Interaction values do not overcome the alert 
threshold of 2.5, however the plants analyzed are 
subjected to potential incompatibility related to 
their Industrial macro-category SE, whose values 
are high because of particular conditions (aban- 
don of plant X’, not compliance with Seveso 
regulation of plants ‘Y’ and ‘Z’). HE received low 
ratings only as a consequence of the unavailability 
of data. 

The overcoming of the 2.5 threshold signals 
to the Municipality that further investigations 
are needed, in order to: 1) confirm or not the 
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Table 7. Compatibility—Plant ‘X’. 


Table 8. 


Compatibility—Plant ‘Y’. 


Environmental vulnerabilities 
inside 500 m. 


Territorial vulnerabilities 
inside 500 m. 


Territorial vulnerabilities 
inside 500 m. 


Environmental vulnerabilities 
inside 500 m. 


1) Cand D Residential Land use soil capacity Ist 


areas. 3 productive 
areas (E) destined to 
future commercial 
function. Two are 
interested by Flood 
HE =3; 
2) Few C punctual 
elements are included; 
3) No linear elements and 
strategic areas/building/ 


and 2nd classes; Water 
table depth <3 m; 
historical urban areas. 


Canal for irrigation 


adjacent to the northern 
border of the plant, 
probably used in the past 
to drain the rainy water. 
Presence of a well inside 
the plant. 


infrastructures 


Territorial compatibility 

HE and SE ratings for 
Flood and Industrial 
risks > 2.5 threshold; 
no manifest 
incompatibility because 
of low people density. 
However, the state of 
abandon of the plant 
represents a potential 
threat for the territorial 
elements, particularly 
in case of flood events 
(medium value of 
interaction). Further 
analysis should be 
carried out, in 
particular to verify the 
state and filling of the 
containment basins. 

Areas addressed to future 


Environmental compatibility 

A potential incompatibility 
is detected: Industrial 
SE and HE ratings = 2.5 
threshold, in an area 
where the environmental 
elements are particularly 
sensitive to pollution. 
The interaction value is 
medium: flood events, 
even with their low 
energy, could cause unex- 
pected consequences of 
spreading and diffusion 
of pollutants towards 
the underground water 
and superficial water. No 
prevention and protective 
measures for the environ- 
ment have never been 
adopted. An onsite visit 


transformations: is recommended to verify 
considering the High the actual conditions of 
Flood risk level, the plant, and to organize 
avoiding high density a recovery procedure. 

of people and adopting 

specific constructive 

parameters. 


incompatibility; 2) plan possible LUP actions, 
taking into account the actual conditions of the 
plants and their possible interactions. However, 
collecting further information on the plants could 
be a difficult task: Sub-threshold plants and aban- 
doned plants indeed represent a potential threat 
for population and environment, but they have no 
legal obligation to provide information about sub- 
stances detained, or possible external risks deriving 
from their activities. This lack of obligation and 
monitoring could in some cases enhance the level 
of risk in comparison to a Seveso plant. 

In order to verify the actual hazardousness of 
the plants, and establish their compatibility, it is 
essential to know at least the type of storage, the 


1) E productive areas; 

2) No A and B punctual 
elements; 

3) No linear elements 
and strategic 
areas/buildings/ 
infrastructures 


Territorial compatibility 
No incompatibilities 
were encountered with 
respect to the territorial 
vulnerabilities. 


Water table depth <3 m; 
land use soil capacity Ist 
and 2nd classes (agricul- 
tural areas around the 
plant); two canals for 
irrigation are close to the 
plant 


Environmental compatibility 

Potential incompatibility 
(SE = threshold) in a 
highly sensitive area. 
The plant declared 
adequate Protection 
measures, however, since 
it is not in line with 
the Seveso regulation, 
at least an in-depth 
analysis on the storage 
methods, and protection 
and preventive measures 
should be carry out. 


Table 9. Compatibility—Plant ‘Z’. 


Territorial vulnerabilities 
inside 500 m. 


Environmental vulnerabilities 
inside 500 m. 


1) C residential areas + 
2 E productive areas 
for future commercial 
function; 

2) 2 punctual elements in 
B (commercial centre, 
bowling; church); 

3) Energetic lines 

Territorial compatibility 

Potential incompatibility: 

threshold for SE > 2.5 
with two punctual 
elements classified as 
B. An in-depth analysis 
is recommended for: 

1) specific activities of 
the 2 vulnerable 
elements; 2) the 
storage methods and 
protection and 
preventive measures of 
the substances 
classified as toxic (H2) 


Water table depth <3 m; 
presence of a canal for 
irrigation adjacent to the 
northern of the plant 


Environmental compatibility 

Potential incompatibility: 
SE = 2.8 overcomes the 
compatibility threshold; 
the interaction with flood 
events, even if connoted 
by a low-medium value 
(1.98), could enhance the 
threat. Further analysis 
on the possible pollution 
scenarios and prevention 
and protective measures 
against flood should be 
carried out. 


prevention and protection measures adopted and 
the case history. For this reason, a detailed ques- 
tionnaire, reported in Table 10, was proposed by the 
authors; the portion related to the environmental 
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Table 10. Questionnaire for in-depth investigation of 
plants. 


A, STORAGE CONDITIONS & NA-TECH ITEMS 


1) With reference to the hazardous substances detained, 
please indicate in detail the storage conditions of 
each hazardous substance, describing type, capacity, 
quantity and containment measures adopted: 

Hazardous substance: 

Stored in (container type): 

Number of containers and/or total capacity: 

Single Container Capacity: 

Position (Inside, outside, outside under coverage, 
underground, etc.): 

Containment measure adopted for the container (basin, 
waterproof ground etc.): 


2) Please report if the following items are present: 
Underground pipelines, pipelines passing on not- 
waterproofed soil 
Description (length, width, substance transported, 
protection measure): 
Long and slim structures (torches, chimneys, cooling 
and distillation towers etc.) 
Description of the structure and its function: 
Open-air water treatment basin/liquid waste storage. 
Description of the installation and related preventive 
measures 


B. CASE HISTORY 
3) Please report a list of the accidents occurred in the last 20 
years that have provoked release of hazardous materials 


Date Item interested Accident description 


4) Please signal eventual damages provoked by: flood 
events, extreme climate events, earthquake. 


Date Item interested Accident description 


C. ENVIRONMENTAL ANALYSIS! 

5) For the environmental protection, the owner shall 
demonstrate to have adopted the protective and preventive 
measures recommended by Turin Province Guidelines; 

OR 

5) Proceed with a vulnerability analysis of the conditions 
of water and soil around their plants: 

> Depth and the direction of the phreatic aquifer nearby 
the plant, in a sector with 30° degrees of amplitude and 
3 kilometres of extension, measured from the possible 
point of release in the direction of the aquifer flow; 

> Presence of wells inside the same sector, within an 
extension of 500 metres 

> Presence of drains in superficial creeks or canals. 


analysis is extracted from Provincia di Torino 
Seveso guidelines (Provincia di Torino, 2010). 


1. Turin Province guidelines; If the three conditions 
reported are all verified, the owner shall adopt all the 
measures of points 1, 2, 3 (although the Municipality 
could in some cases relieve the owner of the application 
of point 3). 


4 CONCLUSIONS 


The application of the proposed methodology to 
the case study quickly identified the areas more 
exposed to risk, returning feasible results in terms 
of possible risk interaction impact, in line with the 
initial risk values. The risk pre-screening allows to 
take into account in an integrate way the risks infor- 
mation contained in the various sectorial plans, 
and at the same time, the Municipality technicians 
can employ their direct and enriched knowledge of 
the Municipal territory. Therefore, the methodol- 
ogy can create an increased awareness about risks 
and a correct risk and LUP management. 

Many possible developments and further steps 
could be carried out: the proposed framework, 
till now elaborated for 3 risks (Industrial, Flood 
and seismic), can be extended to more territorial 
threats and the methodology could be exported to 
other countries simply adapting the criteria for rat- 
ing assignation. The authors are currently work- 
ing on the development of participative practices 
to facilitate the approach of the technicians to the 
methodology, and some contacts are in course with 
Municipalities to directly experiment the proposed 
approach. 
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ABSTRACT: Emergency occurrence during the freight rail transport does not necessarily have to be 
and frequently is not affected by the type of the material transported; however, in case that such an 
emergency occurs, the category of the substance transported may indicate an increased risk; in particular, 
the risk of explosions, fire, and significant threat to property and people. Leakage of any substance, in 
particular a hazardous substance, represents a significant part of all emergency cases at rail transport. 
Despite the seemingly decreasing number of these risky threats, every single incident has to be investigated 
and analysed. After examining every case in question, it was found out that the leaked substance was not 
classified hazardous because it was either plain water or frequently leaking operating fluids. Nevertheless, 
every leakage has to be thoroughly investigated because it is not clear beforehand what category the sub- 
stance belongs to. Our aim was to identify crucial factors resulting in emergency cases occurrence on the 
railway in the Czech Republic through a detailed examination of past incidents. Most of these factors are 
the same for railway transport in general; however, there may be particular local specificities and organiza- 
tional faults in individual transport units. Another goal of research consisted in revealing these specifics. 


1 INTRODUCTION 


Emergency occurrence during the freight rail trans- 
port does not necessarily have to be and frequently 
is not affected by the type of the material trans- 
ported; however, in case that such an emergency 
occurs, the category of the substance transported 
may indicate an increased risk; in particular, the 
risk of explosions, fire, and significant threat to 
property and people. 

Leakage of any substance, in particular a haz- 
ardous substance, represents a significant part 
of all emergency cases at rail transport. Despite 
the seemingly decreasing number of these risky 
threats, every single incident has to be investigated 
and analysed. After examining every case in ques- 
tion, it was found out that the leaked substance 
was not classified hazardous because it was either 
plain water or frequently leaking operating fluids. 
Nevertheless, every leakage has to be thoroughly 
investigated because it is not clear beforehand 
what category the substance belongs to. Sprava 
železniční dopravní cesty (Management of Rail- 
way Network Company) provided us the access to 
its database, which was thoroughly studied, and 
data from the 9-year period (2008-2016) could be 
selected and processed. Unfortunately, a longer 
data period was not available for further use and 
processing due to the different data archiving 
method. In 2008-2016, 597 leakages were recorded 


in the Czech Republic. In accordance with the 
Regulations for international rail transport RID, 
dangerous substances are classified into categories 
depending on their hazard class. The most hazard- 
ous substance leakage according to this categoriza- 
tion was in the hazard class 3 — flammable liquids, 
class 8 — corrosive substances, and class 2 — gasses. 

Our aim was to identify crucial factors resulting 
in emergency cases occurrence on the railway in the 
Czech Republic through a detailed examination of 
past incidents. Most of these factors are the same 
for railway transport in general; however, there may 
be particular local specificities and organizational 
faults in individual transport units. Another goal 
of research consisted in revealing these specifics. 

Using the statistical processing of available 
data, a model of accidents distribution based on 
the leakage location has been developed (Hasilova, 
et al. 2017b). The research also identified the oper- 
ating units with the highest number of emergency 
cases, the risk of emergency occurrence and the 
development trend. The unambiguous research 
objective was to contribute to increasing the safety 
of hazardous substances transport by rail. All the 
data obtained were passed to the transportation 
company for further assessment and incorporation 
into internal safety regulations. 

Presence of hazardous substances in rail trans- 
port poses a higher risk effect on the individuals, 
critical infrastructure an environment. 
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The most frequent factors affecting the emer- 
gency cases occurrence on the railway are as 
follows: 


1. Technical condition: state of the track, state of 
safety devices, state of a notification device, 
state of the crossing security equipment, state 
of railway rolling stock, conditions for equip- 
ment used for the railway infrastructure main- 
tenance. Technical state is frequently affected by 
vandalism: equipment and various devices are 
often damaged, destroyed or stolen. Therefore, 
frequent monitoring of all equipment connected 
to the transmission system is necessary. 

2. Human factor: Poor competence, fatigue, neg- 
ligence, failed thinking, inattention, scattered 
attention, non-observance of working/techno- 
logical procedures. 

3. Climate conditions: snow, rain, fog, floods, 
calamities. 


In recent years, many incidents and accidents 
have occurred on the railways. The impact of these 
accidents could have been much worse if hazard- 
ous substances were involved. Therefore, the issue 
covering transport of hazardous substances on the 
railways is still urgent and critical. Our effort is 
aimed at reducing and eliminating consequences of 
possible accidents to minimum. In order to prevent 
emergency cases and reduce negative effects, it is 
necessary to examine thoroughly both causes and 
consequences of all accidents which had occurred; 
in addition, knowledge of current situation in 
terms of hazardous substances transport has to 
be available as well. The scope of consequences 
of a possible accident is related both to the type 
of material transported and the extent of residen- 
tial zones which are close the site of the accident. 
Selecting an appropriate route as well as the right 
time is one of crucial safety factors. Quality analy- 
sis of the current state and the identification of the 
busiest corridors and stations, determination of 
the level of safety at work in particular operating 
units and understanding the overall transport con- 
text will be of great help in the subsequent propos- 
ing measures in terms of ensuring the maximum 
transport safety. 

The developed analysis specifies in the first 
phase the most loaded corridor sections (operat- 
ing units); in the second phase, there are specified 
their risky segments, potential sources of threat at 
typical activities, and predicted the occurrence fre- 
quency of emergency cases. 

The Czech Republic belongs to the countries 
with the densest railroad network used for both 
domestic and international transport. The chemical 
industry in the Czech Republic is the third largest 
industrial sector with basic chemistry, petroleum 
processing (petro-chemistry), pharmaceutical 


industry (drug production), rubber industry, 
industry producing plastics and paper producing 
industry. Production of basic chemicals (64% of 
total sales) and drug production (17%) are domi- 
nating. Other five sectors are represented in lower 
quantities: production of specific chemical prod- 
ucts and fibres (9%), cleaning and cosmetic agents 
(5%), production of coating material and paints 
(4%), production of pesticides and agrochemicals 
(1%) (Jeneralova, 2011). 

The company CD Cargo, a.s. is dominating in 
freight transport by rail and a substantial part of 
transport is made up of transport of hazardous 
substances. 

Table 1 shows the most important producers of 
chemicals in the Czech Republic that use the com- 
pany CD Cargo, a.s. for transporting commodities. 
There is also listed the number of carriages with 
hazardous substances, which had been transported 
by the company CD Cargo, a.s. within the Czech 
Republic territory. 

The presented data are analyzed for a 3-year 
period (2014-2016) according to data provided 
by the CD Cargo, a.s. database. There are con- 
sidered data covering only the transport within 
the Czech Republic territory. The assessment of 
transported hazardous substances comprises the 
total of 141,229 railway carriages for a monitored 
period. The authors also prepared an overview of 
all companies using the service of the company 
CD Cargo, a.s. considering both individual haz- 
ard classes and number of railway carriages trans- 
ported. However, the table is too extensive and 
cannot be presented here. Therefore, a map had 
been created to visualize a current situation: there 
are shown 4 transit corridors that are used by the 


Table 1. The most significant companies and number 
of transported carriages with hazardous substances. 


Nr. Company 2014 2015 2016 Sum 


1. CESKÁ | ; 
RAFINERSKA, a.s. 


12 656 13 490 3 842 29 988 


2. TERMINAL OILas. 4224 5554 1 883 11 661 
3. METRANS, a.s. 5512 4471 4846 14829 
4. DEZA, a.s. 2979 3688 3807 10474 
5. BorsodChem MCHZ, 1367 1451 1600 4418 
STO: 
6. Ceské dráhy, a.s. 1015 1022 1010 3047 
7. Czech Airlines 0 1870 1088 2958 
Handling, a.s. 
8. ArcelorMittal 0 1246 1321 2567 
Ostrava a.s. 
9. Synthesia, a.s. 1447 0 1026 2473 
10. Lovochemie, a.s. 1216-0 1 108 2324 


Source: Processed by the author (Internal materials CD 
Cargo, 2017). 
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Figurel. Mapofthemostsignificantcompanies utilizing 
the company CD Cargo, a.s. depending on operating units. 
Source: (Becherova, 2017). 


company CD Cargo, a.s., which transports hazard- 
ous substances produced by the most significant 
companies. The map also shows the sites where 
hazardous substances are loaded depending on the 
operating units (main stations). 

In 2009-2016, the system of data storage 
changed: some transport units were excluded 
(Nymburk, Olomouc and Plzen), and the available 
data do not show where they were subsequently 
included. Therefore, the analysis has to be divided 
into two time periods. 

In the first period, the highest number of emer- 
gency cases at transport of hazardous substances 
belongs to Usti nad Labem and Praha. In the sec- 
ond period, it is Praha again, Ostrava takes the sec- 
ond position. The main reason of high number of 
emergency cases consists in the highest density of 
shipment in these regions; in terms of transport, 
the operating unit Praha is the busiest. The operat- 
ing unit Praha also comprises the former operat- 
ing unit Nymburk and operating unit Plzen. In 3 
years (2014-2016), this operating unit dispatched 
142,079 railway carriages with hazardous sub- 
stances, i.e., approx. 130 carriages a day. At this 
station, hazardous substances of hazard class 3 
prevail, i.e. flammable liquids. The analysis was 
complicated by the fact that the company CD 
Cargo, a.s. does not file all defects; therefore, it was 
necessary to get additional data (incompatible with 
the original ones) from the Management of Rail- 
way Network Company (hereinafter referred to 
as SZDC-MRNC). Having carried out a detailed 
analysis of these data as well as manual compari- 
son of the individual RID for 2006-2016 periods 
(inland shipment of hazardous substances), the 
following data were obtained. Due to the amount 
of information presented in the Table 2, the table 
is processed using years and figures. Explanation 
of particular figures is provided below the table 
(Table 2). 


Table 2. Number of defects at shipment of hazardous 
substances reported by the company CD Cargo, a.s. 


Type of 

defect 2006 2007 2008 2009 2010 2011 
J; 13 10 13 13 19 35 
2, 256 175 172 149 171 294 
3. 39 33 50 68 35 51 
4. 20 22 8 8 10 10 
> 57 39 21 39 46 15 
6. 0 0 3 1 1 0 
T: 0 0 0 0 0 0 


Total 385 279 267 278 282 405 


Fault 

type 2012 2013 2014 2015 2016 Total 
1 31 0 3 0 1 138 
2 166 69 85 102 41 1680 
3 45 1 5 4 2 333 
4. T 0 3 6 3 97 
5. 22 6 5 6 4 260 
6 0 0 0 0 0 5 
Ta 0 0 0 1 0 1 
Total 271 76 101 119 51 2514 


Source: Processed by the author. 


Explanatory notes in terms of recorded defects: 


1. A column not marked with a cross, 

2. Incorrect or missing entry necessary for RID 
transport, 

3. Incorrect entry “EMPTY TANK”, “LAST 
LADING?” or alternatively “EMPTY, 
UNCLEANED”, or “RESIDUES, LAST 
CONTENT”, 

4. Leakage of the transported agent—tanks are 
not sealed properly, leakage through some 
device/valve/fittings, 

5. Faults without hazardous substance leakage— 
improperly set boards with labels, missing blind 
flanges, screws, 

6. Relevant specification for transport and inscrip- 
tion of the stored goods are inconsistent (gasses 
of class 2 are in tanks), 

7. Overfilling of the transport unit 


2 EMERGENCY CASES DUE TO 
HAZARDOUS SUBSTANCE LEAKAGE 


Leakages of hazardous substance at transport by 
rail represent a significant share of all emergency 
cases. The following tables illustrate the leakages 
of hazardous substances that occurred at the 
operation of the company CD Cargo, a.s. in the 
Czech Republic. SZDC-MRNC made us possible 
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to access the database; after detailed examination, 
the data were processed to get the following results. 
(Bekesiene, et al., 2016). 

In a 9-year period 2008-2016, 597 leakages of 
hazardous substances occurred at transport by rail 
in the Czech Republic. There are recorded all emer- 
gency cases where various amounts of hazardous 
substances leaked. In compliance with the Regu- 
lations for international rail transport RID, the 
highest number of hazardous substances leakage 
occurred in hazard class 2 i.e., flammable liquids. 
Considerable number of leakages also occurred in 
class 8, i.e. corrosive substances, and in class 2, i.e. 
gases. Table 3 presents the number of leakages of 
hazardous substances according to hazard class 
that occurred within the Czech Republic territory. 
(Becherova & HoSkova-Mayerova, 2017). 

From the presented table becomes evident that 
the highest number of leakages in the Czech Repub- 
lic occurred at the operation of the company CD 
Cargo, a.s.; they belong to class 3 in compliance with 
RID, i.e., flammable liquids. The aim of the analysis 
consisted in the emergency cases distribution, partic- 
ularly leakages, within the Czech Republic territory. 

The following Table 4 is focused on the number 
of accidents at transport of hazardous substances 
in compliance with a specific UN code. Data 
presented cover a 9-year period 2008-2016. Pre- 
sented data come from the database provided by 
SZDC-MRNC. 

From Table 4 and supplement N becomes evi- 
dent that the most frequently leaked hazardous 
substances in a 9-year 2008-2016 periods are the 
following agents: 


e class 2—carbon dioxide (UN 1013) and propane 
butane (UN 1965); 


Table 3. Number of leakages at transport of hazardous 
substances in 2008-2016 periods. 


Number of leakages in compliance with 
relevant hazard class 


Year 2 3 41 42 51 61 8 9 Total 
2008 25 107 1 1 1 2 31 5 173 
2009 9 82 0 0 5 0 3 1 100 
2010 15 73 1 0 0 2 9 1 101 
2011 8 %40 0 2 1 16 1 12 
2012 2 451 0 4 0 9 0 6l 
2013 4 30 0 0 0 9 1 47 
2014 2 182 0 5 0 6 3 36 
2015 11 12 0 0 1 0 4 0 28 
2016 4 140 0 5 1 6 0 28 
Total 80 478 5 1 23 4 93 12 696 


Source: Processed by the author (Internal materials CD 
Cargo, 2017). 


Table 4. Emergency cases due to leakage of hazardous 
substance in compliance with hazard class. 


Year 2 3 4.1 42 43 5.1 5.2 618 9 Total 
2008 25 1071 1 2 31 5173 
2009 9 82 5 3 1 100 
2010 15 73 1 2 9 1 101 
2011 8 94 2 1 16 112 
2012 2 45 4 9 61 
2013 4 33 9 1 47 
2014 2 182 5 6 3 36 
2015 11 12 1 4 28 
2016 4 14 5 1 6 30 
Total 80 478 4 1 22 6 93 12 698 


Source: Processed by the author. 


Table 5. Number of leakages—The entire leakage sites. 


Year CT BR CB NY OL OS PL PR UL Sum 


2008 9 14 11 12 28 12 13 26 48 173 
2009 3 15 12 4 9 15 4 19 19 100 
2010 5 18 12 1l 6 11 5 15 18 101 
2011 11 33 6 2 4 17 5 28 16 122 
2012, 3 #7 8 2 2 10 5 #10 15 62 
2013 6 1l 1 1 1 9 8 10 47 
2014 1 3 2 4 1 8 4 4 9 36 
2015 2 2 1 2 3 10 1 4 3 28 
2016 8 3 Or 2. 5 3 30 
Sum 40 111 53 38 57 101 39 119 141 699 


Source: Processed by the author. 
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Figure 2. Accident distribution—leakage of haz- 
ardous substances according to operating units. 
Source: processed by the author. 


e class 3 — diesel ( UN 1202); class 8 — sodium 


hydroxide (UN 1824) and hydrochloric acid 
(UN 1789); 


e class 9 — liquid tar (UN 3257). 
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From the Table 5 and graph illustration becomes 
evident that the accident distribution depends on 
the site of the leakage. This graph results totally 
from the operating units, i.e., the number of emer- 
gency cases occurred both in Praha, Usti nad 
Labem and Ostrava. 


3 DISCUSSION AND CONCLUSION 


The analytical part examines in detail main factors 
affecting the railway traffic fluency and the emer- 
gency cases occurrence at shipment of hazardous 
substances. Having applied detailed identification 
and risky segment exploration involved in 
shipment, potential sources of threat for typi- 
cal activities were identified; therefore, the most 
loaded operating units could have been selected 
(Hoskova-Mayerova et al., 2017, Kass, 1980). 

Having solved the specified area of interest, 
a crucial factor consisted in exploring the cargo 
company CD Cargo, a.s. in terms how the com- 
pany operates, which companies it cooperates with, 
what goods and in what volumes are cargoes trans- 
ported, etc. The developed analyses showed what 
hazardous substances are transported within the 
Czech Republic territory most frequently, which 
operating units are the busiest, how many railway 
carriages and how many tons of particular sub- 
stances were transported, what the accident rate 
is during the transport itself (HoSkova-Mayerova, 
2016, Vališ et al., 2017). 

In 2009-2016 periods, 2,384 emergency cases 
occurred on the Czech Republic railways at the 
operation of the company CD Cargo, a.s. In a 
3-year period (2014-2016), 3,809,266.262 tons of 
hazardous substances were shipped; most of them 
belonged to hazard class 3, i.e., flammable liquids. 
In the monitored periods 2014-2016, 12,114 load- 
ings of hazardous substances were carried out. The 
analysis showed that the highest number of load- 
ings was in the operating unit Usti nad Labem. In 
terms of number of railway carriages, 142,079 car- 
riages were loaded with hazardous substances in 
that period: most of them were from the operating 
unit Praha. Number of reported defects and fail- 
ures at hazardous substances shipment within the 
company CD Cargo, a.s. was monitored as well. 
Ina 11-year period (2006-2016), 2,514 defects and 
failures were reported; most of them belonged 
to the category: ‘Incorrect or missing record 
required by RID transport’. In terms of hazard- 
ous substance leakages, 696 more or less signifi- 
cant leakages occurred in 2008-2016 periods. The 
analysis showed that the highest number of leak- 
ages occurred in class 3 (in compliance with RID), 
i.e., flammable liquids at the company CD Cargo, 
a.s. operation within the Czech Republic territory 


(Prochazkova, 2016, Prochazkova & Prochazka, 
2016). 


4 SUGGESTED SYSTEMIC MEASURES 
TO IMPROVE THE PREDICTING 
EMERGENCY CASES, THEREBY 
INCREASING RAILWAY SAFETY 


Due to the employee fluctuation, the database with 
data record is not unambiguous. The categoriza- 
tion of emergency cases changed every year; there- 
fore, the data processing was very complicated. In 
particular, it is essential to 


e have data available in uniform and specified form, 

e obtain the data in the same way, 

e categorize them using the same method which is 
not modified. 


These systemic measures lead to accurate and 
practical data handling, therefore to more accu- 
rate analysis of a possible emergency causes and 
thereby increasing the safety of persons and prop- 
erty (Hasilova, 2017a, b). 

In case that both data categorization and record 
face the inevitable change, the employees have to 
be retrained in a particular type of categorization 
(Woch, 2015, 2017). 

The practical benefits are a thorough review of 
emergency cases in which hazardous substances 
were present. The paper provides the assessment 
of hazardous substances transported within the 
Czech Republic territory in 2008-2016 periods. 
The phenomenon of exact time information iden- 
tifying where a particular hazardous substance is 
present, might contribute to increase the popula- 
tion safety resident close the operating unit area. 
Close proximity of particular hazardous substance 
found close residential zones might result in seri- 
ous consequences; therefore, the introduction of 
accurate records of particular substances trans- 
ported by rail would encourage preventing such 
situations and eliminating possible consequences 
of emergency cases. (Malek, et al., 2017, Otřísal, 
et al., 2017, Rosicka, 2006). 

The interlink between the information system of 
the company CD Cargo, a.s. and SZDC-MRNC, 
and between the company CD Cargo, a.s. and the 
Integrated rescue system would result in fast noti- 
fication of employees and further to immediate 
warning of the population. 
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ABSTRACT: 


In security risk analysis and assessment, uncertainties regarding occurring threats, conse- 


quences and the capabilities of security systems to mitigate vulnerability are enormous. Although some 
quantitative approaches exist in security risk analysis that allow the consideration of these uncertainties, 
most practical assessments are based on expert knowledge in semi-quantitative or qualitative models. This 
paper presents a study on the influence of uncertainties in physical security risk analysis using the example 
of a semi-quantitative risk assessment of a notional production infrastructure. Therefore, a procedure 
is suggested as a systematic approach to transfer differing expert ratings into a pdf-based description 
for a quantitative approach. The influences of uncertainties on the exemplary assessment are calculated 
and discussed regarding the validity of the results. To visualize these results and to support the decision- 
making process, a three-dimensional risk matrix is proposed. 


1 INTRODUCTION 


In security risk analysis and assessment, uncer- 
tainties regarding occurring threats, consequences 
and the capabilities of security systems to mitigate 
vulnerability are high. Although some quantitative 
approaches exist in security risk analysis that allow 
the consideration of these uncertainties, most prac- 
tical assessments are based on expert knowledge in 
semi-quantitative or qualitative models. 

As uncertainties are hardly or not completely 
considered, these models do not allow an estima- 
tion of their influence on the validity of a con- 
ducted assessment. Additionally, the influence of 
occurring events with a very low probability of 
occurrence and disastrous consequences, e.g. black 
swan events, are possibly not considered in these 
models, as they are very uncertain. 

This paper investigates the influence of uncer- 
tainties on physical security risk analysis for the 
example of a semi-quantitative risk assessment of 
a notional production infrastructure. 

To demonstrate the described lack of considera- 
tion of uncertainties, the state of the art regarding 
security risk assessment and vulnerability assess- 
ment is outlined. Existing approaches are briefly 
introduced and the occurring difficulties regarding 
the use of expert knowledge are described. 

For analysis purposes a simple semi-quanti- 
tative model for security risk assessment is set 
up and a transition to a quantitative approach 
based on (Lichte & Wolf 2017) is introduced. This 
approach uses probability density functions (pdfs) 


to describe the abilities of security measures as well 
as the probability of occurrence of a threat and 
also the level of possible consequences. The vari- 
ance of the security measure-related pdfs can be 
interpreted as the uncertainty concerning the capa- 
bilities of the considered measures that are part of 
the security system. At the same time, differing rat- 
ings by experts reflect a rising level of uncertainty 
that consequently lead to a higher level of variance 
of the generated pdfs. 

Besides the intended purpose of the analysis, 
the transition is developed to suggest a systematic 
approach to transfer differing expert ratings to the 
pdf-based description for the quantitative app- 
Sroach. Subsequently, the influences of the result- 
ing uncertainties on the exemplary assessment are 
calculated and discussed regarding the validity of 
the results. To visualize these results and to sup- 
port the decision-making process, a three dimen- 
sional risk matrix is proposed. Finally, the study is 
summarized and discussed in the general context 
of security risk assessment and the consideration 
of uncertainties. 


2 STATE OF THE ART 


2.1 Security risk analysis 


Security risk analysis is conducted by experts from 
different fields of expertise, as security is a holis- 
tic term (Harnser Group 2010). The protection of 
infrastructures against intentional physical attacks 
is covered by the subdomain of physical security 
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(Beyerer et al. 2010). In detail, the protection of 
infrastructures is implemented by security meas- 
ures intended to prevent attackers from reaching 
their targets or assumed infrastructure assets. The 
measures include different means of protection, 
detection and intervention and additionally resil- 
ient structures to mitigate the consequences of suc- 
cessful attacks (Garcia 2008). 

A possible definition of security risk is for- 
mulated as a function of the triplet threat, 
vulnerability and consequence (Contini et al. 2012, 
McGill et al. 2007): 


Risk = Threat x Vulnerability x Consequence 


This definition is the common basis for mostly 
all semi-quantitative methods for risk assessment 
that are used in the field. Recently, this classic defi- 
nition was discussed e.g. in (Amundrud et al. 2017), 
regarding its lack to consider uncertainties inherent 
to its parameters. If the above introduced formula- 
tion for security risk is considered as a multipli- 
cation of quantitative values, it is apparently only 
valid for stochastically independent discrete proba- 
bilities. Usually, these preconditions are not given, 
as at least threat and vulnerability are fraught with 
uncertainty. It may even be discussed whether the 
description of these parameters as probabilities in 
the sense of classical frequentist probability the- 
ory is justified. However, probability theory is an 
established concept to handle uncertain quantities. 
Additionally it can be extended in accordance to 
the axioms of Kolgomorov in a suitable manner to 
use it according to the above outlined constraints 
in security risk assessment. For example, the inter- 
pretation of the risk definition based on Bayesian 
probability theory enables the modeling of the 
parameter triplet by the formulation as degree-of- 
belief-densities, thus enabling the consideration 
of inherent uncertainties formally compliant with 
probability theory (Beyerer & Geisler 2016). 

Thus, the definition combines in a quantitative 
manner consequences of attacks and probabilities 
of threat scenarios with the risk of success of indi- 
vidual attacks defined as vulnerability. The above 
quantitative definition of risk may help to deduce 
acceptable risks and necessary measures to reduce 
risks (Broder & Tucker 2012). Inherent uncertain- 
ties regarding the three risk factors should be cau- 
tiously considered (Campbell & Stamp 2004). 

Various approaches to security risk assess- 
ment have been developed, which may be divided 
into qualitative, quantitative and hybrid methods 
(Meritt 2008). Qualitative methods are mostly based 
on expert knowledge, while existing quantitative 
methods use discrete probabilities. Additionally, 
some quantitative methods aiming at cost-benefit 
analysis have been developed. Typically, cost-bene- 
fit analyses of security measures compute potential 


financial losses as a result of an attack, the proba- 
bility of occurrence of various attack scenarios and 
the vulnerability of the security system (Flammini 
et al. 2009). This analysis yields accurate results 
but raises the complexity compared to qualitative 
methods (Landoll 2011). 


2.2 Vulnerability assessment 


Quantitative vulnerability analysis as part of the 
quantitative risk analysis is mostly based on meth- 
ods adapted from reliability and general risk analy- 
sis. Here, the considered model is dependent on 
given attack scenarios (French & Gootzit 2011). 
This dependency is detrimental to a comprehensive 
analysis as knowledge about the behavior of a poten- 
tial attacker may be insufficient (Cox Jr. 2009). The 
different modeling approaches can be further split 
up into mainly analytical but also formal methods. 
An overview of approaches is given by Nicol et al. 
(Nicol et al. 2004). Analytical methods are often 
based on attack trees, which can be seen as a deriva- 
tive of the fault trees known from reliability analysis. 
Attack trees were first used by Schneier (Schneier 
1999) for IT-security analysis and since then have 
been further developed by different authors, sum- 
marized e.g. by Vintr et al. (Vintr et al. 2012). 

Contini et al. have introduced incoherent attack 
trees to characterize the dynamic behavior of the 
considered system (Contini et al. 2008). Addition- 
ally, they integrated simple probability distribu- 
tions for protection into attack trees to investigate 
the chronologic sequence of attacks. Hence, it is 
possible to analyze the security system’s ability for 
an attack intervention by comparing the probabili- 
ties of residual protection and system’s response 
(Contini et al. 2012). 

Garcia describes this relation (Garcia, 2008), 
where feasible attack paths as part of different 
attack scenarios and corresponding barriers are 
used. The model is time-based and introduces the 
critical detection point, which is the latest possible 
point of detection that ensures a successful inter- 
vention against the potential attacker. 

A quantitative model that further develops the 
approaches of Contini et al. (Contini et al. 2012) 
and Garcia (Garcia, 2008) was introduced in 
(Lichte et al. 2016) and applied to an infrastructure 
in (Lichte & Wolf 2017). The suggested vulnerabil- 
ity model uses pdfs to describe the characteristics 
of security measures and uses a path-based bar- 
rier model. Additionally it uses the principle of 
the weakest path. The developed approach in this 
paper uses this vulnerability model as a basis. 

Summarizing, the different existing approaches 
to analytical modeling and analysis of vulnerability 
as well as security risk analysis are lacking the con- 
sideration of uncertainties in the system param- 
eters and overall behavior. Additionally a great 
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number of these approaches relies on expert 
knowledge, causing various problems that are 
described in the next paragraph. 


2.3. Expert knowledge and uncertainty in security 
risk assessment 


The elicitation and use of expert knowledge is an 
often used approach in risk assessment, e.g. to 
determine point estimates for unknown parameters 
(Kaplan 1992). Especially, this approach is used to 
develop and parametrize models for risk assessment 
in case the available database is very small or there 
is a lack of objective data (Bolger & Wright 2017). 

Therefore, methods for the elicitation of expert 
knowledge are developed to minimize subjectivity 
and uncertainty in parameter estimation. Though, 
the use of expert knowledge itself generally raises 
the problem of uncertainties in probabilistic risk 
assessment that should cautiously be analyzed 
(Rausand 2013) (Aven & Zio 2013). Latest devel- 
opments in general risk assessment are focusing 
on uncertainty as a key concept of risk assessment 
(Aven 2016). Especially in the analysis of rare 
events that may lead to catastrophic consequences 
with large uncertainties, different approaches to 
deal with uncertainty have been suggested and ana- 
lyzed, e.g. in (Flage et al. 2014). 

Especially security risk assessment is often 
accompanied by great uncertainties, as there is a 
lack of evidence of threats, consequences and the 
abilities of security measures. Thus, qualitative 
or semi-quantitative models that strongly rely on 
expert knowledge are often used, although these 
models can lead to misleading or even wrong 
results (Landoll 2011). Consequently, an analysis 
of the described resulting uncertainties should be 
conducted for security risk assessment. This paper 
discusses an approach of a transition from a semi- 
quantitative to a quantitative model, which enables 
further analysis in this direction. 


3 APPROACH 


To analyze the influence of uncertainties especially 
emerging from the dependence on expert knowl- 
edge in semi quantitative risk assessment methods, 
an approach is introduced that allows the transi- 
tion to a risk model that uses triangular pdfs. Both 
methods are gradually applied to an exemplary 
infrastructure depicted in Figure 1. The results 
of both risk models are analyzed for deviations 
and possible causes. Therefore, a simplified semi- 
quantitative approach to security risk assessment 
is presented first. It is based on existing methods, 
e.g. (Harnser Group 2010). The part of vulner- 
ability assessment reflects four basic assumptions 
presented in (Lichte & Wolf 2017): 


| Barrier |Description 

idan 

| a, [Gated outer entrance 

| ë, |Main entrance to lobby 


B, (Emiptoyers entrances 


Shop Moor 


| B; | Fire exit shop floor 


|B, |Quter perimeter fence 


B; [Server room entrance office section 


B, [Server room entrance shop flóar 


Figure 1. 


Outline of infrastructure and security system. 


1. The weakest path of the security system deter- 
mines the system’s vulnerability as the chosen 
path of the attacker is uncertain. 

2. The combination of protection and observation 
at barriers is necessary as an attacker is always 
able to break through a barrier given infinite 
time without being detected. 

3. The detection of an attack is possible only if 
the protection is sufficient to prevent a break- 
through under observation until detection. 

4. After detection, an attack can be stopped only 
if the residual protection along the remaining 
attack path lasts long enough to prevent the 
attacker from reaching the asset until interven- 
tion is completed. 


In a second step, the developed approach is 
extended to be able to consider various contri- 
butions from expert knowledge. Subsequently, a 
transition is presented that uses triangular pdfs to 
represent the various semi-quantitative ratings of 
the experts. Thus, the differences of the experts’ 
assessments are treated as uncertainties. In the 
last step, possible results for the semi-quantitative 
model are compared to the transformed model. 


3.1 Exemplary production infrastructure 


A basic drawing of the considered production 
infrastructure is outlined on the upper left of 
Figure 1. The model only considers security meas- 
ures located along paths that lead to the server 
room, which is assumed as the asset of a possible 
attack. The upper right of Figure 1 denotes the 
security measures marked in the drawing. 

The model depicted on the lower left of Figure 1 
(onion layer model) represents the structure and 
feasible attack paths by means of barriers. Four 
individual attack paths lead to the asset A via 
barriers B,-B.. 
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All feasible attack paths of the exemplary infra- 
structure are extracted into a path-based model. 
The result of the extraction of the attack paths is 
shown on the lower right of Figure 1. It includes all 
four attack paths directed towards the asset. 


3.2  Semi-quantitative risk assessment 


The presented semi-quantitative method for secu- 
rity assessment is divided into three submodels. In 
analogy to the risk definition there are submodels 
for threat, vulnerability and consequence assess- 
ment. The security risk is then indicated in a two 
dimensional risk matrix, in which the ordinate 
shows the probability of occurrence of a threat 
incorporating the vulnerability of the infrastruc- 
ture against the threat. The axis of abscissas maps 
estimated consequences. 

Threat and consequence assessment are simpli- 
fied to the estimation of threat probability and 
level of consequences in ranking scales with five 
steps, which are described in Table 1. 

The presented semi-quantitative model for vulner- 
ability assessment is based on a barrier-oriented view 
of the infrastructure. Therefore, the commonly used 
parameters protection, observation and intervention 
have to be estimated at every security barrier of the 
infrastructure (Lichte et al. 2016). To realize this, the 
model uses a five step ranking scale to estimate the 
time based capabilities of the considered security 
measures. The ranking scale that is chosen depend- 
ing on the considered infrastructure as well as the 
corresponding estimations is shown in Table 2 and 3. 

First, the probability of a detection of the 
attacker needs to be modeled—Based on the four 
basic assumptions a high level of protection and 
observation enables a high probability of a trig- 
gered alarm A, at a barrier, so that: 


Table 1. Ranking scale threat & consequence. 
Expert 
1 2 3 4 5 score 
Scale threat [probability] 
0.0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0 2 
Scale consequence [100k $] 

0-1 1-2 2-3 3—4 45 4 
Table 2. Score descriptions P,O,I. 
1 2 3 4 5 

Scale P,O [s] 
0-90 90-180 180-270 270-360 360-450 

Scale I [s] 

0-180 180-360 360-540 540-720 720-900 


Table 3. Ranking scale threat & consequence. 
Expert score 
Barrier i P O I 
1 1 5 3 
2 4 1 1 
3 2 2 4 
4 2 5 5 
5 1 5 4 
6 5 3 2 
7 5 2 2 
P.+(0,,,.-0; 
A, =i ( max — J (1) 
P iin + Ona 
Herein P,,,, and O pax represent the highest pos- 


sible score in the ranking scale. For the first barrier 
B, of path 1 of the example infrastructure we get: 


_ 14+(5-5) _ 
A= — = 0.1 (2) 


In a second step, the level of security of a barrier 
is described by the possibility of an intervention tak- 
ing place before the attacker reaches the asset. This 
is described by merging the residual protection of a 
barrier with the estimated duration of an interven- 
tion. The residual protection R; can be interpreted 
as the sum of all remaining n protection times along 
the considered attack path. Using a floor function 
to obtain an integer as a result we obtain: 


"P 
R, = Pa (3) 


14+44+2+5 

= =] =3 (4) 
4 

The residual protection is then merged with the 


duration of intervention I; of the barrier to obtain 
the possibility of a timely intervention T.. 


T, = R —I, 
R, tt 


max max 


Joo T €(0.1,..., 0.9) (5) 


T= (253) +05 =0.5 (6) 
10 


Using the results for the possibility of a triggered 
alarm and a timely intervention the vulnerability 
of the feasible attack paths depicted in Figure 2 is 
calculated by: 


V pam = IL! a (4 N T,) (7) 


1390 


Figure 2. Exemplary histograms and rankings accord- 
ing to aggregated expert knowledge. 


For path 1 this yields: 


Vean = (1-0.3)(1-0.35)(1-0.25) -(1- 0.35) 
=0.147 (8) 


3.3. Consideration of multiple expert knowledge 


In order to analyze the influence of uncertainties 
in threat, vulnerability and consequence assess- 
ments, judgements of multiple experts are now 
introduced. Hypothetical assessments of k = 10 
experts are introduced and aggregated in a histo- 
gram that shows the frequency of the scores given 
by the experts. As an example, the histogram of 
the absolute and relative frequencies for protec- 
tion measures at B, as well as for threat and conse- 
quence are shown in Figure 2. 

On the right-hand side of Figure 2 the ranking 
of all experts for all barriers of the security system 
of the example infrastructure is included. 


3.4 Transition to the usage of probability 
density functions 


In this step, the semi-quantitative models are 
transformed into quantitative models. Both types 
of models use similar operations based on condi- 
tional probabilities to describe the security risk of 
an infrastructure. 

Therefore we develop triangular pdfs on the 
base of the above found histograms that result 
from the differing ratings of multiple experts. Tri- 
angular distributions are widely used in different 
fields of risk analysis to estimate probability dis- 
tributions under uncertainty (Haimes 2015). The 
characteristic parameters of the function are lower 
limit a, upper limit b and mode c. The determina- 
tion of the parameters for all input parameters of 


a b c 
thy 0 0.8 0.4 
Co I 5 3.33 


the risk model are based on the respective histo- 
grams of the relative frequency (see Fig. 2). 

The lower limit a and the upper limit b for the 
pdfs of threat and consequence are determined 
by the minimum and maximum value given in 
the ranking score for probability of occurrence 
and costs of a successful attack, respectively. The 
median interval of the respective histogram is 
assumed as mode c. The resulting fitted triangular 
pdfs are outlined in Figure 2. 

The obtained values for the pdf parameters for 
threat th,(TH) and consequence ¢,(C) are listed in 
Table 4. 

In contrast to the simplified threat and conse- 
quence submodels, the transition of the vulnerabil- 
ity submodel is more complex due to the relations 
between the input parameters. To manage transi- 
tion, the five ranking scores are transformed to 
time-based intervals depending on the chosen 
time-based ranking scales (see Tab. 2) to describe 
the characteristics of the security measures accord- 
ing to [Lichte & Wolf 2017]: 


e Protection is characterized by an estimated time 
needed for a break-through. 

e Observation is the time span needed for a 
detection. 

e Intervention is the period of time until an inter- 
vention is completed. 


The resulting intervals mark reasonable time 
steps to describe the time based characteristics 
and at the same time reflect the assessment of the 
experts. 

In order to ensure comparability within the 
model, the J = 5 time intervals for all protection 
and observation measures At, at the i barriers of 
the infrastructure lie in the time span [0,t,,,,). The 
reasonable choice for the value of t,,,, depends on 
the considered infrastructure. The intervals At, are 
all of the same size. For the example infrastructure 
tmax = 450 [s] 

Due to the modeled relations, the time span of 
the intervention measures can possibly reach higher 
values. Therefore the upper considered time span 
for intervention measures is set to t,,,,, = 900 [s]. 
As the number of intervals does not change, the 
intervals At, ;, grow proportionally. 

Now, a triangular pdf can be fitted into the his- 
tograms for protection, observation and interven- 
tion measures of the security system. The intended 
fitting of the function is outlined in Figure 2. 
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Therefore, the lower limit a and the upper limit b 
are defined for protection and observation as the 
minimum and maximum value for t within the 
intervals comprised in the scoring of the experts. 
The lower interval 1 of the j = 5 overall intervals 
containing expert scores can be found via the 
following conditions: 


l l-1 
yak) >0 and Yk, =0 (9) 
The conditions for the upper interval u are: 


Dik >0 and Yk, =0 (10) 


Following, the lower limit a and the upper limit 
b are determined by: 


a=t 


— “min 


(11) 
(12) 


= min(At,) 


b= tnax = max(At,,) 

The definition of intervention is equal using 
At, The mode c for all security measures is 
determined by means of calculating the median 
of grouped data. First the median interval of the 
grouped data can be defined as the interval m, 
which contains the median of the dataset: 


m-1 k m k 
jak <z and LN 


(13) 

Following, the median within the found 
interval m is estimated by linear interpolation 
assuming an equal distribution as no further 
information about the distribution within the 
interval is available. With the lower boundary of 
the median interval |, and the upper boundary u,, 
we obtain: 


k m-l 
>- k 
c= tn = la H 2 k — (un la) 


m 


(14) 


For the derivation of the parameters of the tri- 
angular function of the protection measure at bar- 
rier B, of the example system we obtain: 


l l-1 
Èk% >0 and Sk, =0>1=1 (15) 
u J 
Deak >0 and Sk, =0>u=2 (16) 
a = tin =min(At,,)=0 (17) 
b= tnax = Max(At,,) = 180 (18) 
m-1 k m k 
aas and Dyali2 5 Oo (19) 


c=t, =0+ 2 


(20) 


Se 
jal J 
6 


(90-0) =75 


As all parameters for the triangular pdfs for the 
vulnerability submodel p(t), o(t) and I(t) are now 
defined, the calculation of the path vulnerability 
Vo pani by applying the quantitative model is pos- 
sible. The characteristic parameters and their rela- 
tion to each other in the quantitative vulnerability 
submodel are taken from (Lichte& Wolf 2017). 

Inserting the obtained pdfs for the barriers of 
path 1 of the example system into the quantitative 
model we obtain for the path vulnerability Vo pan 


V, 


Q,Pathl 


= 0.028 (21) 


3.5 Results and analysis 


As the vulnerability of both models is based on the 
principle of the weakest path, the other three fea- 
sible attack paths also have to be calculated. The 
results for the semi-quantitative and the quantita- 
tive model are summarized in Table 5: 

The resulting risk is entered into a security risk 
matrix for both models. The risk matrix for the 
semi-quantitative model considering the vulner- 
ability of the weakest path 4 is shown in Figure 3. 

The security risk R, estimated by the semi- 
quantitative model is calculated to: 


Table 5. Vulnerability V; & Vg for all attack paths. 


Path Vs Vo 

1 0.147 0.028 
2 0.141 0.014 
3 0.328 0.209 
4 0.341 0.430 


Figure 3. 
model. 


Security risk matrix for the semi-quantitative 
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RS see = (TH, WV. (22) 


aaa) "C; =59,752 $ 

As the quantitative model uses pdfs the risk 
matrix is extended to a three dimensional risk 
matrix. Basically, the matrix maps the bivariate pdf 
of the risk Rose: It incorporates the probability 
of occurrence of a threat considering the vulner- 
ability of the infrastructure and the consequences. 
The distribution-based quantitative risk matrix of 
the example infrastructure is depicted in Figure 4. 

The security risk estimated by the use of the quan- 
titative model is received by the cumulated bivariate 
pdf. For the example infrastructure we obtain: 


Ro sec Costh, o) = Í ef ig Jeg.rn, 2 ( C, x TH, o) 
x dC, dTH TH; = THa V; 


Q.Path4 


(23) 


V0? 


Thus it is possible to calculate the overall secu- 
rity risk. 
Ro see = 133,344 $ (24) 
Other results for interesting boundaries of the 
distribution variables can also be obtained For 


example we chose two especially important risk 
cases to be considered: 


e Very high estimated consequences C > 400,000 $ 

e Black swan events with a very low probability 
of occurrence of the threat TH < 0.05 and very 
high consequences C > 400,000 $ 


When calculating according to (23) we are able 
to obtain a probability P for the special case, given 
the overall scenario as well as the resulting risk in 
terms of consequence: 


Ro, = 29,9208, P,, = 0,07 (25) 


th,c 


TH-V 


oo 
et 


Figure 4. 3D security risk matrix. 


Ro» = 10298, P, , = 0.0024 (26) 


The analysis shows that the semi-quantitative 
model covers a smaller window of possibly occur- 
ring security risks and estimates a significantly 
lower overall risk. As a result, the model only pro- 
vides a risk point estimate. Therefore, information 
about certain cases, e.g. rare events, is neither vis- 
ible nor can be considered in the risk assessment. 

This is especially shown by the two risk cases 
applied to the considered scenario and the exam- 
ple infrastructure. In contrast to the quantitative 
model, the information about a probability of 
occurrence of such events is not comprised in the 
results of the semi-quantitative model. This infor- 
mation is visualized in the three dimensional risk 
matrix of the quantitative model (see Fig. 4). It 
shows the probability of occurring risk cases at the 
boundaries of the considered scenario besides an 
average risk, e.g. black swan scenarios like in Rg». 

Furthermore, this is reflected in the differing 
results of the vulnerability analysis. Although 
the trend for the vulnerability of the single attack 
paths of the considered infrastructure is similar, 
the absolute results differ. The reason for this dif- 
ference is again a result of the consideration of 
uncertainties within the vulnerability assessment 
resulting from the elicitated expert knowledge. 

Both findings underline the importance to 
consider uncertainties to be aware of possible 
boundary cases, e.g. of rare events with great con- 
sequences in a complete risk analysis and decision- 
making process. 


4 CONCLUSION AND OUTLOOK 


The paper shows the influence of uncertainties on 
physical security risk assessment and the resulting 
need for a consideration of uncertainties. Therefore 
the paper outlines the state of the art in the field 
of security risk assessments and expert knowledge 
especially regarding resulting uncertainties. It is 
shown that semi-quantitative modeling using expert 
knowledge is a common used practice though these 
models lack a consideration of uncertainties so far. 
To tackle this problem, an approach is introduced 
that enables a transition towards quantitative mod- 
eling using expert knowledge as a basis. 

Following, the approach is further detailed and 
step by step applied to a notional production infra- 
structure. Therefore a simplified semi-quantitative 
security risk assessment method is introduced and 
the transition using triangular pdfs is shown. In a 
last step a three dimensional risk matrix based on a 
bivariate pdf resulting on threat, vulnerability and 
consequences is set up. Subsequently, a compari- 
son of the results of the example infrastructure for 
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both modeling approaches is conducted. Hereby, 
differing results for the risk analysis and the pos- 
sible influence of uncertainties are revealed. 

The presented approach shows the influence of 
uncertainties to physical risk analysis and proposes 
a three dimensional risk matrix that visualizes pos- 
sible rare events at the boundaries of estimated 
threats. Additionally it presents a method for elici- 
tation and use of expert knowledge in the context 
of physical security risk assessment. Continuously 
this approach can be further developed to support 
profound decision-making based on a comprehen- 
sive security risk assessment. 

Nevertheless, the approach needs further 
enhancement especially in detailing the input 
parameters for threats and consequences. The fur- 
ther analysis of the vulnerability is needed also. 
Here, the influence of uncertainties on the vul- 
nerability assessment should be analyzed, e.g. by 
Monte-Carlo simulation and sensitivity analysis. 
The applicability of other modeling approaches 
like info-gap models should be investigated. 
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ABSTRACT: An increasing number of Cyber Physical Systems is used in different areas of application 
like smart grid, smart factory or smart home. This paper outlines a first approach for an integrated 
consideration of safety and security for Cyber Physical Production Systems in the so-called Industry 4.0 
context which can be interpreted as Systems of Systems. The approach is based on a use case-based model 
for application in the context of Industry 4.0. To realize a safe and secure operation of Cyber Physical 
Production Systems in System of Systems a high number of elements, relations and functions have to be 
taken into account. A Systems Engineering-based approach will be introduced in this paper to deal with 
this complexity. The approach consists of a SysML-based model which is associated with a procedure that 
ensures the safe and secure design of Cyber Physical Systems. Specified safety use cases will be used in the 
following security analysis and assessment. By harmonizing security assessment and safety use cases the 
integrated consideration is accomplished. The results can be used for technically solution-neutral designs 


in early development phases. 


1 INTRODUCTION 


Cyber Physical Systems (CPS) have become 
increasingly important. As mechatronic systems 
they consist of sensors, actuators, an embedded 
intelligence and the ability to communicate with 
other CPS (Anderl et al. 2013). Applications of 
CPS are diverse, e.g. advanced automotive systems, 
environmental control or smart structures (Lee 
2008), like e-health, smart home, smart factories, 
micro grids etc. (Geisberger & Broy 2012). 

This paper focuses on applications in indus- 
trial environments, where CPS are an important 
part of so-called Industry 4.0 (Jazdi 2014), where 
intelligent manufacturing systems (IMS) are com- 
municating and collaborating within production 
networks (PN) which are connected via inter- 
net of things (IoT). These systems are defined as 
cyber physical production systems (CPPS). Basi- 
cally, CPPS are systems of systems (SoS), as they 
consist of autonomous and cooperative elements 
and sub-systems that are interacting with each 
other depending on predefined (production) goals 
(Monostori 2014). 

Besides autonomous machine to machine com- 
munication (M2M), the collaboration of humans 
and machines, so-called human machine interac- 
tion (HMI), is essential in Industry 4.0 scenarios. 
Collaborative robots (Cobots), on which a strong 
emphasis is put on in this paper, are a representative 
example for these interactions. Cobots collaborate 


with humans without special safety barriers and 
use sensors and intelligence to avoid collisions with 
co-workers. They are connected to other machines 
in a CPPS for e.g. collecting data and updating 
production procedures or steering software. 

A primary challenge is to maintain safe and 
secure operation of these CPPS for industrial appli- 
cations (Lee et al. 2008), since safety and security 
requirements are key factors to reduce operational 
risks. The design of a safe and secure system may 
be a difficult task because of inherent tradeoffs 
(Lichte et al.) like for example a safe shutdown of 
a cobot in case of an imminent collision with the 
collaborating human and the protection against 
an interruption of production by an intentionally 
precipitated malfunction of the safety system. If 
several CPPS are considered simultaneously, this 
task is getting even more difficult, the result is a 
high number of CPS combinations and associated 
use cases. 

Frequently, CPS research focuses on inter- 
faces or technical standardization approaches 
for communication. Yet these approaches are not 
harmonized so far to provide safe and secure inter- 
operability between the great variety of connected 
collaborating devices (Kim et al. 2014, Knight 
2007, Sarijari et al. 2014). The detailed level and 
discipline specific focus of these approaches do not 
allow to adopt the design of the SoS sufficiently. 

In this context, this paper demonstrates a first 
approach for an integrated consideration of safety 
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and security for CPS in a CPPS by a use case-based 
model. To realize a safe and secure operation of a 
large number of elements, relations and functions 
have to be taken into account (Banerjee et al. 2012, 
Axelrod 2013). 

A systems engineering-based approach will be 
introduced in this paper in order to deal with this 
complexity. The approach consists of a SysML- 
based model which is combined with a procedure 
to ensure the safe and secure design of Cyber Phys- 
ical Systems. The procedure is then applied to the 
cobot system and a first simplified model focusing 
on the CPS in the CPPS based on use cases is devel- 
oped. Overlapping use cases (by time and location) 
are investigated throughout the analysis and sup- 
ported by the model for a safe and secure design. 
Finally, results are summarized and discussed. 


2 STATE OF THE ART 


In a scientific context, safety and security are 
often defined as a deliberate threat (security) and 
an unwanted hazard (safety) (Beyerer et al. 2010). 
Safety functions are designed to protect users from 
hazards, e.g. an accident. Security functions pro- 
tect the system and its contents against attacks 
like intentional misuse. The variety of components 
and their IT-based networking lead to a growing 
number of safety and security requirements, which 
have to be fulfilled by functions. 

Regarding CPS in SoS, the variety and diver- 
sity of requirements, components and functions is 
growing as industrial applications are increasingly 
integrated into the Industry 4.0 context. As the 
general development in the field of CPPS is similar, 
CPPS mainly face the same challenges. 

The diverse functions are often subject to a 
fundamental goal conflict. For reasons of safety, 
redundancies are designed to ensure safety in dan- 
gerous situations. Simultaneously these redundan- 
cies should not be implemented for reasons of 
security, because they result in additional attack 
vectors. Consequently, safety and security func- 
tions influence each other. 

Additionally, the system’s complexity, which is 
defined by the number and diversity of elements, 
relations as well as dynamics (Meyer 2007), is 
increasing, e.g. due to networked systems. Results 
are for example the requirement of additional 
security functions to avoid an intrusion into the 
SoS. Besides the mentioned diversity of elements, 
complexity is also described, by the high number 
of participating systems. In turn systems, which 
carry out tasks independently of each other, 
as well as together for a limited period of time, 
can be considered as a SoS (Holt & Perry 2014). 
According to this characteristics CPPS, which are 


considered to be SoS of CPS, can also be defined 
as virtual SoS: 


e No central management and no overarching 
agreed-upon purpose. 

e No consistent configuration or maintenance of 
the SoS as a whole system. 

e The individual constituent system will be config- 
ured and managed. 


These constituent systems consist of a variety 
of components. For instance, a Cobot includes 
components for movement, steering, control and 
positioning, which implement various safety—and 
security-related functions. As there is no central 
management or consistent configuration of such 
a virtual SoS—which is composed of such sys- 
tems—an integrated safety and security consider- 
ing model is needed. 

Use cases of CPS allow an extensive description 
of their safety-related behavior. These use cases do 
not describe the behavior of a SoS consisting of 
different collaborating CPS. To focus on a compre- 
hensive description of safety and security aspects 
and resulting goal conflicts, intersection points 
between the CPS in a SoS have to be investigated. 
These intersection points have to be defined by 
CPS specific use cases, which can be postulated 
(Cockburn 2015). Nevertheless, these use cases do 
not contain the required information on conse- 
quent safety and security goal conflicts. 

In order to reach a defined level of safety and 
security, different methods and concepts may be 
used, e.g. TSM or GlobalPlatform for security archi- 
tectures or risk analysis to estimate a safety level. 
Although specific methods for safety or security 
exist, an integrated, simultaneous consideration of 
both aspects is not possible yet (Lichte et al. 2016) 
or only for software related aspects (Axelrod 2013). 

Safety and security aspects need an interdisci- 
plinary understanding for CPS as well as CPS in 
SoS. Many existing approaches lack a common 
understanding. 

While focusing on complexity, Systems Engi- 
neering (SE) can handle these challenges (Mamrot 
et al. 2014) as it is about creating effective solutions 
to problems and managing the technical complex- 
ity of the resulting developments. SE includes a 
system model for handling complexity with an 
interdisciplinary procedure. However many dif- 
ferent SE-based approaches were developed. In 
(Marchlewitz et al. 2015) a first common model 
for SoS was developed and combined with a pro- 
cedure. This new Generic Systems Engineering 
(GSE)-based procedure consists of a standardized 
procedure using the modules “analysis” (prob- 
lem identification and system analysis), “target 
definition” (problem localization) and “design” 
(recommendations) (Winzer 2015). 
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The order and structure of these modules 
is depending on the specific problem under 
consideration. 

Different GSE-based approaches are used, e.g. 
for requirements engineering (Nicklas 2015) or for 
the design support of autonomous robots (Mam- 
rot et al. 2014, Marchlewitz et al. 2015). However, 
the existing system model which was introduced in 
(Mamrot et al. 2014) does not support the specific 
combination of CPS in CPPS which is needed for an 
integrated safety and security consideration as well as 
a standardized notation. Therefore, a SysML-based 
approach will be used. Based on its diagrams and 
standardized notations, an integration in the existing 
common GSE model of thinking can be realized. 

In summary, the following challenges were 
identified: 


e No standardized model for SoS or CPPS (of 
CPS) for safety and security aspects already 
exists. 

e Missing description of the virtual SoS and its 
behavior by the CPS specific use cases. 

e Difficulties in handling diverging high level use 
cases caused by inherent complexity. 


To deal with these challenges, the approach for 
an integrated consideration of safety and security 
aspects for smart home applications will be devel- 
oped and introduced in the following section. 


3 APPROACH 


In order to analyze the described complex systems 
and enable a further development, a Systems Engi- 
neering-based approach is introduced [24]. This 
new GSE-based approach is shown in Figure 1. 
The proposed procedure is combined with a safety 
and security integrated system model for applica- 
tions in the context of CPPS as SoS of CPS. In 
step | the CPPS and its scope are analyzed by 
using the GSE-module “analysis”. This analysis is 
initially realized by the CPS use cases. Hazardous 
behavior of the CPPS is then identified by com- 
bining relevant use cases, e.g. by time and location 
intersection points. Step 2 represents the safety 
use case definition based on SysML notation and 
diagrams. This is then combined with the GSE 
target definition module. In the following step 3, 
resulting safety use cases are investigated to iden- 
tify related attack scenarios. The security analysis 
based on the derived safety use cases is necessary 
as safety use cases possibly create new attack vec- 
tors. This ensures the extensive analysis of goal 
conflicts between safety and security. Finally, the 
harmonization of safety and security is carried out 
in step 4. As a result, design recommendations can 
be derived based on the GSE-module “design”. 


procedure 


Step \. 


Oo Use-Case (UC) definiton 


Step B. 
<-> Risk Analysis 


Step C. 


"Safety UC definition 


Step D. 
<--> Security UC definition 


Step E. 
<--> Harmonization of safety and 
security architectare 


Figure 1. 


Approach. 


Following, the four-level approach will be 
explained in detail. 


3.1 CPPS definition 


In a first step being based on systems thinking the 
scope of system has to be limited in order to handle 
the complexity of SoS [25]. Therefore the proposed 
CPPS is divided into its subsystems and users. With 
this limitation, the focus is placed on systems and 
interacting users. In this article an example based 
on four different systems will be used: 


Cobot (CB), 

Smart wristband (SWB), 

Intelligent emergency reaction unit (IER), 
Communication hub (CH). 


The smart wristband worn by the co-worker is 
used for position tracking while working with the 
cobot to avoid hazardous collisions and to monitor 
the health status in case of potential work accidents. 
In addition, the communication hub is understood 
as a part of the CPPS and not as a central manage- 
ment. It only enables the communication between 
the systems. With the help of these systems the iden- 
tified challenges and use cases have to be derived. A 
typical use case description includes preconditions, 
postconditions, primary flow and an alternative or 
exception flow (Friedenthal et al. 2015). Therefore, 
a predefined template is suitable. Different suitable 
use cases are shown in Figure 2 which have to be 
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Figure 2. Use cases and combination. 


analyzed regarding safety risks. The challenge is to 
identify every hazard resulting from an interaction 
of two or more CPS. This interaction is depicted 
by the intersection points with regard to time and 
location. For example, the use case analysis deter- 
mines that the use cases “CB1-assisted mounting” 
and “CB2-autonomous mounting” (see Fig. 2) 
cannot overlap. 

In the following exemplary application these 
four use cases will be used: 


e Use case “CB1-assisted mounting” 
e Use case “SW1-health status monitoring” 
e Use case “SW2-position tracking” 

e Use case “CH2-ensuring communication” 


By combining the use cases the virtual CPPS is 
formed out of the cobot, the smart wristband and 
the communication hub. 

For the identified intersecting use cases respec- 
tively the new CPS in CPPS a risk analysis has to 
be performed. This risk analysis is state of the art 
and therefore not further focused on in this paper. 
Here, the risk is defined in a quantitative or quali- 
tative way as a function of the severity, the expo- 
sure, the occurrence and the controllability (ISO 
2011). 

By this procedure the risk is assessed for the cor- 
responding use case (Step 2). As a result potential 
risks are identified for the following steps (step 
2-4) to achieve a sufficient safe and secure CPPS. 


3.2 Safety use case definition 


With the result of step 2 safety use cases are 
defined. The goal of the safety use cases is derived 
from the risk analysis in step 2. In the example of 
the combined use cases “CB1” and “SW2” the 
collision between the user and the AVC should be 
avoided. Therefore, a new use case “AC” (avoid 
collision) is defined. The following Figure 3 shows 
the storyline of this use case. 


[Safety UC AC 
name: avoid collision 
actor: co-worker 
Irigger even: co-worker falling below safety distance 
Short description: avoid collision between co-worker and „CB“ 
pre-condition; — co-worker wearing „SW“, „CB” operational, 
network-communication available 
1) recognize direction of co-worker movement 
2) adapt dmovement of CB” to regain 

safety distance 
Alternative flow: 1) recognize direction of co-worker 
2) adapt movement 
3) collision 
co-worker wearing SW% „CB“ operational, 
network-communication available 


primary tow: 


post-condition: 


Figure 3. Safety use case “AC”. 


$ 


"..--- -=-= 5 
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Figure 4. Sequence diagram for “collision” from safety 
UC: AC and safety UC: emergency. 


Unlike “IER1” (see Fig. 2) the use case “AC” 
has an alternative flow to include a possible colli- 
sion. Therefore it is necessary to consider a safety 
use case that reflects the resulting hazard of the 
collision of the alternative flow. In consequence 
the safety use case “emergency” is equally defined 
and documented. 

Consequently, a sequence diagram is used to 
describe the interaction of the safety use cases. 
Sequence diagrams are based on the predefined 
use cases (Cockburn 2000). The use cases will be 
depicted and considered in the safety sequence 
diagram, which shows the required exchange of 
messages to describe the functionality of the safety 
scenario (SysML 2015). 

In the example the alternative flow from safety 
use case “AC” is represented by the first two 
sequence steps of the diagram. The other part 
illustrates the steps of the subsequent safety use 
case “emergency”. An emergency alert will be trig- 
gered and “IER” will shut down the whole CPPS if 
the monitored health status of the user is not okay 
after the incident (see Figure 4). 
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Figure 5. 


Internal block diagram. 


The technical safety relevant information flow of 
the involved CPS “CB”, “SW”, “CH” and “IER” 
is determined by an internal block diagram (see 
Fig. 5). In addition to the logical task-orientated 
sequence the internal block diagram allows the 
description of information flow. Hereby, design 
support and further security analysis of the involved 
CPS are prepared. Figure 5 shows the information 
flow through the radio (“RC”) and network (“NC”) 
communication ports and its direction. Likewise, 
communication redundancies can be defined. 

Based on the internal block diagram a security 
analysis and assessment is prepared in step 3. 


3.3 Security analysis and assessment 


In step 3 the needed security measures are defined 
to prevent the intended occurrence of threats as 
results of the safety use case by an outside attacker. 
The goal is to describe barriers between the com- 
ponents that show necessary limitations of infor- 
mation flow and encryption of communication 
between the components of the CPPS. Both items 
of information can be used to extend the solution 
structure of the safety use case by adding security 
barriers and hierarchic structures. 

Therefore, the CPPS is analyzed by means of 
a security assessment based on attack scenarios. 
The most important results of these scenarios are 
goals and methods of the attack. The safety use 
case derived in step 2 is used to define the goal of 
the attack. The attack goal that results from the 
exemplary safety use case is achieving access to 
the home. Feasible attack paths and methods are 
deduced by the diagrams of the CPPS defined in 
step 2, which show involved CPS and informa- 
tion flows between them. The description of use 
cases and attack paths may require the integration 
of further CPPS components. The resulting sim- 
plified attack scenarios are summarized in attack 
trees, which were introduced by (Schneier 1999). 


Figure 6 shows five resulting scenarios defined 
by the CPPS information flow of the use cases. 
Following, a qualitative assessment is conducted 
on the CPPS considering the developed secu- 
rity scenarios. The assessment includes a ranking 
regarding the probability of occurrence (PO) and 
goal achievement (PG) based on the attack trees 
shown in Figure 6. The scenarios S5 and S6 are 
excluded from further analysis as they are very 
unlikely to occur in terms of PO and PG. 

As a result of the security analysis the attack 
vectors of the probable scenarios (S1-S4) have 
to be investigated. This shows where barriers are 
needed to secure the considered CPPS for the 
specific use case. The simplified block diagram in 
Figure 7 depicts this. 

The above shown barriers describe the limita- 
tion of flowing information or needed encryp- 
tion. Additionally, a simplified hierarchic model is 
established by analyzing the proposed limitation of 
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Figure 7. Security ibd and communication flow. 
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Figure 8. 
UC: AC and safety UC: emergency. 


Sequence diagram for “collision” from safety 


direction of command and information flow sig- 
nals between the components of the CPPS. 


3.4 Harmonization of safety use cases and 
security Scenarios 


In the last step, the results of step 2 and 3 will be 
matched to expose and solve the safety-security 
goal conflicts related to the analyzed safety use case. 
As a result, Figure 8 shows a harmonized sequence 
diagram to achieve an adequate safety and security 
level. The analysis of information and command 
flow leads to changed connections in the sequence 
diagram. In the explained example, the connection 
of “SW” and “CB” is identified as critical for secu- 
rity. Therefore, the tasks “check health status” and 
“check emergency send” have to be executed by the 
“CH”. The activity “send rescue alert” is addition- 
ally realized by the “CH”. As a result, the sequence 
diagram “sd collision ver. B” is recursively adjusted. 

As a result an integrated safety and security 
consideration for the CPPS based on the use cases 
is derived. On the one hand possible goal conflicts 
between safety and security functions are revealed. 
On the other hand the method enables a designing 
process that solves these conflicts. Due to the high 
degree of abstraction, early and technically solu- 
tion neutral design can be planned. 


4 CONCLUSION AND OUTLOOK 


In this article a first use case-based approach is 
developed, which integrates safety use cases and 


resulting security scenarios for a widespread over- 
view. First the problem of goal conflicts between 
safety and security was outlined and the state of 
the art regarding Cyber Physical Production Sys- 
tems (CPPS) that can be considered as SoS and 
especially cobots as CPS was summarized. It 
was shown that existing models and approaches 
regarding SoS do not focus on an integrated safety 
and security perspective. Simultaneously the secu- 
rity of CPPS in the context of Industry 4.0 has to 
be considered in a more detailed way with regard 
to users and experts. Hence this article proposes 
an approach based on Systems Engineering to 
analyze and harmonize safety and security at 
the same time. The four individual steps of the 
approach include use case definition, safety use 
case definition, security scenario analysis and har- 
monization. Additionally, an example illustrates 
how the concurrent single steps may contribute 
to a safe and secure model of the CPPS, which 
is enhanced in every step of the procedure. In the 
first step, use cases are defined that may overlap 
in time and space and combined by time and loca- 
tion for the identified systems of the CPPS. These 
combinations are analyzed. Step 2 comprises 
the definition of the resulting safety use cases 
to avoid risks. They are described by storyline, 
internal block diagram and sequence diagram in 
SysML-based diagram types (Alt 2012). The secu- 
rity analysis in step 3 identifies attack goals as a 
result of the safety use cases and establishes attack 
scenarios based on attack trees. The probabilities 
of occurrence and goal achievement of the attack 
scenarios are qualitatively assessed and security 
structures containing the limitation of communi- 
cation and encryption are derived. The resulting 
security structure is compared to the safety use 
case in step 4. Occurring goal conflicts are solved 
by adapting the sequence diagram of the safety 
use cases. 
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ABSTRACT: The aim of this paper is to present a mathematical framework for estimation the expected 
heat loss due to the implementation of de-icing measures for vessels operating in the Arctic offshore. 
Sea-spray icing on vessels operating in Arctic waters imposes financial and safety risks, such as loss of 
vessel stability, safety risks for vessel crew, as well as delays in maintenance and operations of offshore 
units. In Arctic offshore logistics, efficient planning of platform supply vessel operations should be per- 
formed while accounting for the risk of spray icing associated with selected voyages. Although Arctic 
vessels are equipped with a range of anti-icing and de-icing options, an optimum design and estimation 
of energy consumption for winterisation purposes remains a challenging task especially due to the uncer- 
tainties associated with the temporal-spatial variation of meteorological and oceanographic parameters 
contributing to ice accretion. However, long-term forecasts of such parameters are hardly available during 
logistics planning phase. Thus, this study uses 3-hourly reanalysis hindcast data (Norwegian Reanalysis 
10 km data: NORA10) for estimation of icing rate and develops a probabilistic framework for estimation 
of expected heat loss, due to implementation of winterisation measures, over sea voyages for long-term 
logistics plans. Expected icing rate and winterisation-related heat loss can be used as safety and financial 
risk indicators in Arctic offshore logistics operations and their involved long-term decision-making proc- 
esses. The framework is illustrated by a case study in the Arctic-Norwegian waters. 


1 INTRODUCTION 


Sea-spray icing is considered the most severe icing 
type due to its potentially high accretion rate 
(Ryerson, 2008; Ryerson, 2011) and a major safety 
concern for vessels operating in the Arctic offshore 
as the weight of the ice negatively affects vessel’s 
stability and manoeuvrability. Spray icing can 
threaten the safety of crew on-board and structural 
reliability of platforms and vessels. It can inter- 
rupt routine on-board maintenance and opera- 
tions activities due to safety concerns. Heavy icing 
events and its following de-icing operations can 
delay the delivery of goods, spare parts, and serv- 
ices (Jones and Andreas, 2009; Naseri and Bara- 
bady, 2016; Samuelsen et al., 2015). Figure 1 shows 
a severe spray icing on KV Nordkapp vessel on 
26.02.1987 in the Barents Sea. 

There are numerous works on planning offshore 
logistics operations such designing and optimis- Figure 1. 110 tons of ice accumulating during a 
ing offshore fleet size and scheduling a number 17 hours period on KV Nordkapp on 26.02.1987, while 
of platform supply vessels to support the needs of sailing from Tromso to waters between Bjørnøya and 
a cluster of offshore platforms; for example, see Hopen in the Barents Sea (Samuelsen et al., 2015). 
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(Fernandez Cuesta et al., 2017; Halvorsen-Weare 
et al., 2012; Maisiuk and Gribkovskaia, 2014; Stal- 
hane et al., 2016). Some other works tackle the 
issue of reducing CO2 emission and vessel voy- 
age cost by lowering fuel and energy consumption 
(Norlund and Gribkovskaia, 2013; Norlund et al., 
2015). However, application of these research find- 
ings and developed frameworks and techniques for 
Arctic offshore logistics may be faced with a great 
deal of uncertainties due to the contribution of 
icing risks, its related safety concerns, and induced 
operational delays. 

In order to tackle the issue of icing, vessels 
and platforms operating in the Arctic offshore 
are equipped with a range of anti-icing and de- 
icing techniques such as using heat tracers, insu- 
lations, shelters, semi-enclosures, or chemical ice 
protection options (DNV, 2013; Farzaneh M., 
2015; Rashid et al., 2016; Ryerson, 2008; Ryerson, 
2011). Implementation of such techniques, on the 
other hand, increases capital and operations costs, 
energy usage, greenhouse gas emissions, and fuel 
consumption. In addition, one should make sure of 
the reliability of winterisation techniques, whether 
or not they manage to perform their required func- 
tions as expected, and if malfunctioned, they can 
be maintained and repaired within an acceptable 
timeframe. 

In this regard, efficient planning and execution 
of platform supply vessel operations is crucially 
important in terms of timely supply of goods and 
services to Arctic offshore units while taking the 
risk of spray icing associated with selected voyages 
into consideration. To this aim, long-term and real- 
time estimation of spray icing rate over the voyages 
is vital, especially due to the temporal-spatial varia- 
tion of meteorological and oceanographic param- 
eters contributing to ice accretion. 

However, modelling sea-spray icing rate, is in 
general very challenging. This is mainly due to 
the uncertainties related to accurately estimating 
the spray amount during wave-ship interaction, 
the turbulent heat transfer between the atmosphere 
and wetted surfaces on the ship, and the freezing 
temperature of the brine water. The brine water on 
the wetted surfaces of a ship is namely lower than 
the temperature of the incoming sea water due to 
salt expulsion during the freezing process (Samu- 
elsen, 2017a). In spite of such challenges and issues, 
some researchers have proposed highly sophis- 
ticated models to estimate spray-icing rate (e.g., 
(Horjen, 2013; Kulyakhtin and Tsarau, 2014)). 
However, these models have been little verified, 
and their complexity are therefore not justified by 
observations. A major drawback is for instance the 
fact that they assume that the wave height may be 
estimated directly from the wind speed, which is 
rarely the case in observed icing events (Samuelsen 


et al., 2017). In this study, MINCOG model devel- 
oped by Samuelsen et al. (2017) is adopted for esti- 
mating spray icing rates. A brief description on the 
MINCOG model, is given in Section 2. 

Long-term prediction of icing rates or the 
parameters contributing to spray-ice formation are 
hardly available during logistics planning phase. 
Therefore, by employing MINCOG model, and 
using 3-hourly reanalysis hindcast data (Norwe- 
gian Reanalysis 10 km data: NORA10), icing rates 
are estimated for a sufficiently long period in the 
past. By the use of such estimates and applying a 
non-sequential Monte Carlo simulation technique 
(Zio, 2013), a probabilistic representation of icing 
rate for a specific sea voyage and time period is 
developed to simulate the icing events and their 
rates in future for that voyage. This is further used 
as a fundamental input for estimation of expected 
icing rates and expected amount of energy required 
for winterisation purposes associated with selected 
sea voyages during certain time intervals. 

This framework and its provided information 
on expected icing risk and winterisation-related 
energy consumption can be used as a safety and 
financial risk indicator for long-term decision- 
making processes in Arctic offshore logistics opera- 
tions. In addition, by applying short-term weather 
prediction data, the presented framework can help 
vessel crew for making short-term decisions with 
respect to icing risks along the selected sea voyage. 
The rest of this paper is organised as follows. In 
Section 2, after reviewing spray-icing process, the 
MINCOG model is briefly discussed. Section 3 
describes the proposed mathematical framework 
for estimating expected icing rate and its related 
energy consumption for a given voyage. The case 
study and conclusions are presented in Sections 4 
and 5, respectively. 


2 SEA-SPRAY ICING RATE MODELLING 


In this study, a newly developed physics-based ship- 
icing model, known as Marine Icing model for the 
Norwegian COast Guard (MINCOG), is adopted 
from (Samuelsen et al., 2017) to predict the sea- 
spray icing rate. This model uses the Norwegian 
coast guard ship class named “KV Nordkapp” as 
a reference ship type for ship-icing calculations. 
Samuelsen (2017b) shows how this model provides 
higher verification scores than previously-applied 
ship-icing models and nomograms, like the com- 
monly-applied Overland (1990) model, when the 
models are verified against ship-icing data from 
Arctic-Norwegian waters, outside Alaska, and at 
the east coast of Canada. 

Sea-spray generated from the interaction 
between ships and waves are considered the most 
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dominating water source in ship-icing events (e.g. 
(Samuelsen, 2017a) and references therein). The 
MINCOG model is, therefore, based on the model- 
ling of wave-ship interaction icing. Firstly, the sea- 
spray flux is calculated based on spray data derived 
from (Borisenkov et al., 1975). Icing rate r, can be 
calculated from the average sea-spray flux by tak- 
ing into account the different heat fluxes involved 
in the icing process on a fixed position in the front 
of the ship. The heat balance is given by (Samu- 
elsen et al., 2017), 


G=4.4+ 94% +4, (1) 


where q is the energy that is released by freezing 
process, during which, salt is expelled making 
freezing temperature lower than that of incom- 
ing sea water. In Equation (1), q, is the convective 
cooling from the air to the freezing brine, q, is the 
evaporative cooling of the brine, q, is heating (or 
cooling) from the sea water to the brine, and q, is 
the incoming or outgoing longwave and shortwave 
radiative heat fluxes. Once icing rate is computed, 
the amount of energy released by freezing, i.e., q, is 
used for estimating the amount of energy to avoid 
icing (see Section 3.3). 

Six model-input parameters including wind 
speed, air temperature, relative humidity, mean-sea 
level pressure, significant wave height, and signifi- 
cant wave period, are all derived from NOrwegian 
Reanalysis 10 km data (NORA1O0) (Reistad et al., 
2011). Constant values of ship speed, sea-surface 
or water temperature, and incoming sea-water 
salinity, are applied. 

As illustrated in (Samuelsen, 2017b — Figure 12 
and 13), the sensitivity to these latter parameters 
in the normal range considered in marine-icing 
studies are relatively low compared to the former 
parameters, and thus, their medians are chosen as 
fixed model inputs. Incoming short-wave radia- 
tion, is here neglected, and incoming longwave 
radiation is parametrized by assuming that the 
atmosphere is radiating as a black body with a 
temperature equal to the air temperature at the 
level of the ship. 

Furthermore, it is assumed that the winds and 
waves are coming from the same direction, and a 
constant angle is applied for the direction between 
the ship and wind equal to the median value of this 
angle derived from the icing reported by Samu- 
elsen et al. (2017). 

For simplicity, the trajectory model of water 
droplets used in is Samuelsen et al. (2017) skipped, 
and the droplet velocity is calculated from the rela- 
tive velocity of the wind and ship in the horizontal 
direction, and an assumed terminal velocity of uni- 
form droplets with a constant spherical size with 
a diameter of 2 mm. A detailed discussion on the 


MINCOG model, its underlying assumptions and 
constant values of input parameters are given in 
(Samuelsen, 2017a; Samuelsen, 2017b; Samuelsen 
et al., 2017). 


3 LOGISTICS RISK INDICATOR MODEL: 
ANTI-ICING EXPECTED HEAT LOSS 


In this work, energy consumption for anti-icing 
purposes and heat anti-icing heat loss are used 
equivalently. In order to calculate expected heat 
loss, a systematic procedure based on a non-sequen- 
tial Monte Carlo simulation is suggested in this 
study, as shown in Figure 2. First, the trajectory is 
decomposed into several segments and coordinates 
of the origin and destination of each segment is 
determined. According to the vessel speed and the 
length of each segment, the time interval the vessel 
is sailing along segment is computed. In the second 
step, corresponding to those time intervals and for 
each segment, the occurrence of icing event and its 
rate is determined using the hindcast data, which 
will be used in the next step to extract the statistics 
of icing rates for those locations. Using determined 
distributions of icing events and rates for each 
segment, the occurrence of icing events and their 
associated icing rates are simulated for each seg- 
ment at given times. This procedure is repeated for 
a sufficiently large number of times in order to esti- 
mate the statistics and determine the distribution 
of icing rate and heat loss for the whole trajectory. 


3.1 Vessel trajectory decomposition 


Let œ and p denote, respectively, latitude and 
longitude, in degrees, of a geographical location on 
a map. Thus, vessel trajectory AB, is recognised 
by the coordinates of its origin A and destination 
point B, i.e., A=(a,,4,) and B=(a@,,f,). 


STEP I: Vessel trajectory decomposition and computation of 
trajectory segments, sailing time, and corresponding coordinates 


STEP I: Estimation of icing rates for each segment at corre= 
sponding lime intervals using bindcast data 


STEP If: Computation of the statistics of icing rates for each 
segment and corresponding time using estimated icing rates 


STEP LV: Simulation of icing events and their associated rates 
for future and computation of corresponding heat loss for each 
segment and this the whole trajectory 


Figure 2. A systematic procedure for estimating 
expected heat loss for a given vessel trajectory. 
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Further, let trajectory AB be divided into Nnot- 
necessarily-equal segments S,=7,n,,,,1=1,...,N, 


with length d, and coordinates n,=(@, VAR 
where, 


n=4> (a, VA) = (æ, i) (2) 
Ny. =B > (aa Bara] = (@,4z) (3) 


Note that, ya = D, with D being the rhumb 
line distance (Snyder, 1987) taken as the length of 
trajectory AB in km, given by (Alexander, 2004), 


a a,l 
1s0' mA 


(4) 


cos Ø 


7 


where, R = 6371 km is the earth’s radius a, 
in degrees, is the constant heading of trajectory 
AB, i.e., the heading of rhumb line connecting A 
to B, clockwise from north. The constant heading 
is given by (Snyder, 1987), 


a= an f = z) (5) 
x Ys- Ya 
with, 
ZR 
X, = — 6 
Xa = 180 By (6) 
y,= Rin) tan 45 # z) (7) 


being rectangular coordinates of A=(@,,f,). 
While x, lies along the equator, increasing towards 
east, y, lies along the central meridian increasing 
towards north. Similar definitions stand for x, 
and yp. 

In order to determine the coordinates of 
segment vertices, i.e, 1,=(@,,, ENE = anni My 
we can use the length of segment S,,, Le, d,,, 
i=2,...,N, and the constant heading of trajectory 
AB, æ, in degrees. 

The latitude of the destination point of segment 


Si np is given by (Kaplan, 1995), 


Oe cos@ (8) 
a \ 6371 


ni ny 
where, d, , in km is the rhumb line distance between 
n,, to n, The longitude of the destination point of 
segment S, ,, 7n, is given by (Kaplan, 1995), 


180 d,-1 
B, =f. l 


6371cos@, 


ni 


i T 


Jao (9) 


Once the coordinates n,,i=1,..., M +1 are deter- 
mined, the coordinates of midpoint of segment S, 


denoted by n; =(@.,2.),i=1...,.N, are deter- 


mined by substituting half of segment’s length into 
Equations (8) and (9), i.e., d_, —d,_,/2. 


3.2 Probabilistic representation of icing events 
and rates 


Probabilistic representation of icing events for 
a location includes representing two stochas- 
tic processes. In the first stochastic process, one 
predicts whether an icing event occurs, and dur- 
ing the second process, the amount of icing rate 
is predicted should the icing event occur. To this 
aim, assume that there exists sufficiently large 
number of icing event observations for a given 
location, using which both stochastic processes 
can be simulated. 

For modelling the first process at a given 
location and time instant, let the probability 
of the occurrence of icing event be denoted by 
Pr(O =1) = p. Thus the complementary event (i.e., 
not occurrence of icing event) occurs with a prob- 
ability of Pr(O=0)=1- p. For a given location 
and time instant, this process can be represented 
by a Bernoulli distribution, whose cumulative dis- 
tribution function (CDF) is given by (Rausand 
and Høyland, 2004): 


forO=0 


1 for O=1 ee 

In order to simulate this process using a non- 
sequential Monte Carlo simulation, one can sam- 
ple a random number form the CDF given by 
Equation (9). In other words, let ¿~ U [0,1), then, 
icing event occurs if ¢21- p, and it does not 
occur, otherwise. 

The second stochastic process initiates once the 
icing event occurs, i.e., O=1. Once sufficiently 
large number of icing rate observations are col- 
lected, one can determine the empirical CDF of 
icing rate for a given location and time instant, i.e., 
r~ F,(r), using which the amount of icing rate 
can be predicted. 


3.3 Heat loss estimation 


To estimate the amount of heat required for anti- 
icing, one can calculate the equivalent amount 
of heat released due to freezing, denoted by gq, in 
Jm~ s~! (Samuelsen et al., 2017), i 


q=L-r-p (11) 


where, p = 890 kgm” is ice density, taken con- 
stant, r is the ice accretion rate in ms”, and the 
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L=2.3 x 10° Jkg" is the latent heat of freezing for 
saline-water ice (Samuelsen et al., 2017). 


3.4 Anti-icing expected heat loss for a given 
vessel trajectory 


Assume that the vessel departs from the origin A 
at time ¢,. The vessel sails segment S, with constant 
speed of V, By using the length of each segment 
and time of departure from origin, the time at 
which the vessel is sailing along segment S, can be 
calculated accordingly. 

By assuming that the vessel is exposed to the 
same level of icing continuously for the period it 
sails segment S, the amount of heat loss for seg- 
ment S, can be calculated by: 


: (12) 


with o, = | if icing event occurs, and 0 otherwise. 
By repeating the presented procedure for a suffi- 
ciently large number of times, an empirical CDF 
of e, E,~ F, (e), can be determined. 


__ The amount of heat loss for the whole trajectory 
AB, thus, will be given by: 


N 
CR = De 


Mean, median, standard deviation, and quan- 
tiles of heat loss due to anti-icing can be extracted 
from empirical distribution of e- to represent 
the associate uncertainties with the occurrence 
and rate of icing events along the given trajectory. 
Another application of presented framework is 
identification of the most critical segment of the 
trajectory, from safety or heat loss viewpoint. 


(13) 


4 ILLUSTRATIVE CASE STUDY 


Consider an operational site in the northern Bar- 
ents Sea, 75.40N and 24.46E degrees, and Ham- 
merfest 70.66N and 23.68E degrees; points B 
and A in Figure 3. A vessel leaves Hammerfest at 
00:00 01.01.2018 towards location B. The aim is 
to simulate the occurrence of possible icing events 
along this trajectory and, in addition, simulate the 
expected heat loss associated with implementation 
of anti-icing measures. 


4.1 Data 


As discussed in Section 2, the MINCOG model is 
used in this study to model the sea-spray icing rate. 
As suggested by Samuelsen et al. (2017), median 
values of vessel speed V = 4 ms”, water tempera- 
ture 2.5°C, and water salinity 35 ppt are chosen as 


Maximum icing rate, cm/ħ 
06 08 1 1.2 14 16 18 2 22 


8 E 36 E 
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Figure 3. Maximum icing rate over the Barents Sea in 
January during 1980 to 2012. The solid black line shows 
the rhumb line of vessel trajectory AB, decomposed into 
some segments showed by dark circles. 


fixed input parameters. The main six input param- 
eters including wind speed, air temperature, rela- 
tive humidity, mean-sea level pressure, significant 
wave height, and significant wave period, are all 
derived from NORA10, obtained every 3 hours 
from 01.01.1980 00:00:00 to 31.12.2012 21:00:00. 


4.2 Analysis, results, and discussion 


The occurrence of icing events and their rates are 
subject to temporal and spatial variations over 
the Barents Sea due to changes in meteorological 
and oceanographic conditions. Figure 3 shows the 
maximum icing rate occurred in January during 
the period 1980 to 2012. As illustrated, the icing 
rate in the region is highest in the northeastern 
part. As shown, the maximum icing rate in Janu- 
ary over the selected trajectory is also subject to 
considerable variations due to temporal and spa- 
tial changes in meteorological and oceanographic 
parameters. 

Icing rate is lowest in the southwest due to 
higher air temperatures. This is only partly a result 
of the relatively high sea-surface temperature asso- 
ciated with the North Atlantic Current. However, 
the main reason is that these areas are located 
some distance away from the cold lands or sea ice, 
leading to the fact that the air has time to be suf- 
ficiently heated from below to avoid the high icing 
rates apparent in the other areas due to strong ver- 
tical mixing in weather situations in which icing 
occurs (Samuelsen and Graversen, 2017). 
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The icing rate is highest in the north and north- 
east due to lower air temperatures associated with 
strong winds and high waves that may arise dur- 
ing cold-air outbreaks from the ice. Since the ice 
edge normally is located just north of these areas 
exposed to severe icing, the air does not have time 
to be sufficiently heated from below in these areas 
to avoid severe icing. Sea-surface temperature 
alone does not have a large effect on icing (Samu- 
elsen, 2017a). An interesting finding in Figure 3 
is the maximum values obtained outside some of 
the fjords near the coast of Northern Norway. 
Such maximum values are associated with strong 
and cold gap winds out some of the large fjords 
generated in mountain wave situations during 
offshore flow from the Scandinavian archipelago 
described by Samuelsen and Graversen (2017). If 
using hindcast reanalysis data for atmosphere and 
ocean variables with higher spatial resolution than 
10 km between the grid points, most likely this 
effect would have been more apparent. Thus, the 
maximum icing rates in these sea areas are prob- 
ably underestimated. 

In order to discuss the expected icing rate and 
heat loss over trajectory AB, one needs to decom- 
pose it into different_segments. The heading and 
length of trajectory AB is obtained using Equa- 
tions (5) and (4), respectively, @=2.732 degrees 
and D = 653 km. Thus, total sailing time of a ves- 
sel with speed of V = 4 ms" will be 45.35 h. Since 
the model input data are available every 3 hours, 
we divide the trajectory into equal intervals of 
d, = 43.2 km, i=1,...,15, and di = 5 km (ie, 
N = 16), and assume that the meteorological and 
oceanographic conditions remain constant during 
this time interval and along the segment. Coordi- 
nates of the vertices and midpoint of each segment 
are then computed using Equations (8) and (9). 
According to the vessel speed, one can determine 
at what time the vessel enters each segment and 
continues sailing towards destination. 

MINCOG input data corresponding to each 
midpoint location, if existed, and the nearest 
location otherwise, are extracted from NORA10 
database. Over the period 1980 to 2012, in total, 
33 sets of input data are extracted correspond- 
ing to each coordinate and arrival time. Later, 
their corresponding icing rate is computed using 
MINCOG model. Figure 4 shows the frequency 
of the occurrence of an icing event, in per cent, 
at given locations and arrival times. These results 
are obtained based on the reanalysis hindcast 
data over the period 1980 to 2012 corresponding 
to certain arrival times at those specific locations. 
We assume that the probability of the occurrence 
of icing event, i.e., Pr(O=1), in future is equal 
to its frequency, in per cent, over the past years, 
although this approach suggests that, for example 


Frequency of occurrence 


123 45 6 7 8 9 10 11 12 13 14 15 16 
segment 


Figure 4. Frequency (probability of icing) in each seg- 
ment at certain arrival times. 


in segments S,, to S,, icing event occurs with cer- 
tainty for specific arrival times to those segments. 
Thus, for instance, for segment S,, the Bernoulli 
distribution of the occurrence of icing event can 
be given by, 


live 1-0.2121 forO=0 
o=) for O=1 


Not that the overall trend of increasing icing 
rate in Figure 3, is also verified by the frequency 
of icing occurrence that increases from location A 
towards location B. 

Empirical distributions of icing rates in each 
segment at given times, F,(r), is obtained by 
considering the occasions where r > 0.05 cmh”! 
(Samuelsen et al., 2017). The modified box-plots 
in Figure 5 show the minimum, the Sth quan- 
tile, median, the 95th quantile, and maximum of 
icing rates in each segment. Segment S, is disre- 
garded due to lack of weather data in its nearby 
locations. 

The amount of heat loss for each segment is 
then calculated using Equations (11) and (12). For 
this purpose, a non-sequential Monte Carlo simu- 
lation is used, wherein, for each section, a realisa- 
tion of icing event is sampled to simulate whether 
the icing event occurs. 

Once an icing event occurs, another stochastic 
process is simulated and a random icing rate is 
sampled form the corresponding icing rate dis- 
tributions, using which the required amount of 
energy for anti-icing applications is computed 
accordingly. This procedure is repeated for a suf- 
ficiently large number of times to obtain the statis- 
tics of the energy consumption for each segment, 
as the median, the Sth upper quantile and maxi- 
mum values are presented in Table 1. 


1408 


Segment 0 p EE 
Segment 12 F í I 14 
Segment 16} Í I [i 


Icing rate, cmh” 
$ 3 
EL | 
EES 
0 
cE 
i | 
Segment7+} C—_ TL 
Segment 9 + a a} 


vn -o- ft 4 © 6 = o rt o 
EEZ = ae ae. 
ee a ee 35 8 5 5 
SZ FFF FF ee 2 8 & 

a oon E 


Figure 5. Box-plots showing the minimum, lower Sth 
quantile, median, upper Sth quantile and maximum val- 
ues of icing rates for each segment at arrival times. 


Table 1. Maximum, mean, and the 95th quantile of 
anti-icing energy consumption per area for each segment. 


Energy consumption for each segment, 


e, MJm? 

Segment, 

S; Maximum Mean 95th Quantile 
2 1.7283 0.2221 1.4708 
3 1:6353 0.1745 1.3452 
4 1.5429 0.1897 1.4747 
5 1.7001 0.2966 1.6411 
6 2.2691 0.4373 2.2318 
7 2.0671 0.5098 2.8852 
8 3.7993 0.6128 3.1980 
9 4.0705 0.6325 3.0958 

10 4.1505 0.6671 3.0741 

11 4.0949 0.7177 2.9959 

12 3.7782 0.7983 3.2998 

13 4.0377 0.8963 4.0040 

14 5.2738 1.0631 4.4079 

15 5.0115 1.0447 4.0969 

16 0.5971 0.1276 0.5574 


Finally Equation (13) is employed to compute 
the overall expected energy_consumption for all 
segments, i.e., trajectory AB, whose CDF is 
shown in Figure 6. Mean, median, the 5th upper 
quantile and maximum amount of energy con- 
sumption along trajectory AB are 8.40, 8.13, 
15.00 and 29.24 MJm”, respectively. Depending 
on the decision-maker’s risk perception approach 
and required safety guidelines, one may choose 
either maximum or some other statistics like the 
95th quantile of expected energy consumption. 

The maximum amount of icing rate and thus 
energy consumption, can be computed by adding 
up the required energy associated with maximum 


0 5 10 15 20 25 30 
Total energy consumption per area (e ap) Mdm 


Figure 6. CDF of total expected energy consumption 
for anti-icing along trajectory AB. 


icing rates computed using the data available for 
previous years. However, from a probabilistic view- 
point, the probability of having such a scenario is 
extremely low. Alternatively, a Monte Carlo simu- 
lation can be used to sample a sufficiently large 
collection of scenarios for icing events along the 
trajectory, based on which maximum amount of 
energy consumption for anti-icing is computed as 
around 29.24 MJm”. 


5 CONCLUSIONS 


In this study, a sea-spray icing rate prediction 
model, known as MINCOG, is employed to esti- 
mate the rates of icing events using reanalysis 
hindcast data. Such estimates are used to represent 
icing events and their rates probabilistically. Based 
on this procedure, a mathematical framework is 
proposed to simulate the icing events and their 
associated rates along a sea voyage through two 
stochastic processes. It further estimates expected 
energy consumption required to avoid icing on 
vessels. The results of this paper, illustrated by a 
case study, can be used in long-term decisions to be 
made in Arctic offshore logistics operations, such 
as scheduling and routing platform supply vessels, 
while reducing icing risks, operational costs, fuel 
consumption and CO2 emission. Future research 
on this area can be revolved around improving the 
simulation process of meteorological and oceano- 
graphic parameters, and thus icing events, by tak- 
ing into account the effects of short-term patterns 
present in such parameters. 
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ABSTRACT: One of the design measures adopted for the risk reduction of machinery is the de- 
energization of all the components of the machine for assuring a “safe stop state”. So, for gravity loaded 
axes, the risk of unintended gravity descent in the de-energized state has to be considered in the risk 
assessment. In case of power failure, the gravity loaded axes are held solely by the brake/counterweight 
systems, which are installed in the machine. The gravity loaded axes may drop down, if the existing brakes/ 
counterweights do not provide adequate protection against unintended descent due to gravity. 

In the paper first, a full analysis for this underestimated risk of today’s very complex machines is 
presented. Than a description of the problem is illustrated and the resulting requirements are presented 
in terms of: design measures, information to be given to the end user and testing procedures to be assured 


for the new machines. 


1 INTRODUCTION 


While it can be assumed that during horizontal 
movements in the automatic production no haz- 
ards to persons occur due to gravity in the de- 
energized state, for vertical movements. However, 
the risks of unintended gravity descent have to 
be considered in the risk assessment, see DGUV 
(2012). These hazards particularly become obvious 
with linear robots for the handling of heavy parts 
(Fig. 1), but also with jointed-arm robots or inside 
machines, e.g. at vertical axes of any machine 
tools. If the existing brakes/counterbalancing sys- 
tems do not provide sufficient protection against 
unintended gravity descent, control measures can 
contribute to reduce the risk of hazard in any case. 


So, when the machine systems are in an ener- 
gized state in any mode of operation, it can be usu- 
ally assumed that the risk assessment performed by 
the machine builder assures a proper safety of the 
machine itself. 

The aim of this article is to give a guide to 
machine builders for design of “state of the art” 
Gravity Load Axes (GLA) for machine tools. The 
risk assessment and the design solutions presented 
were mainly discussed during ISO standardization 
works of ISO 16090-1 (2017), but the results are of 
general applicability for all the types of machine 
tools. 

The typical iterative three step method used 
for risk reduction, see ISO 12100 (2010) assures, 
in conjunction with the hypothesis of full disposal 
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Figure 1. 
from DGUV (2012). 


Robot arm used for handling of heavy parts, 


of machine energy, the proper reduction of risk of 
uncontrolled descend of all the gravity loaded axes. 

In case of power failure or energy removal (for 
example during maintenance), gravity loaded axes 
(weight-loaded, vertical, slant axes) are held solely 
by their brake/counterbalancing system installed in 
the machine. 

The gravity loaded axes could descent in case 
of failure of the retention system and, usually, also 
the control/warning system, could be off-line due 
to the complete de-energization of the machine. 

As is simple to understand, the risk reduction 
for this potentially unsafe condition cannot be 
satisfied also though some of the typical step two 
risk reduction means: “other protective measures”, 
because they are also usually ineffective (e.g. light 
barriers, emergency stop buttons....). 

As an example, for vertical axes with braking 
systems the mechanical wear or oil-fouling may 
cause the braking torque/force of the brakes to 
fall below its nominal value which may result in an 
unintended descend of the gravity loaded axes. 

In de-energized state the risk reduction cannot 
be achieved by machine builder through safety 
functions: only inherently safety design measures, 
physical guards, and information for the user are 
available. It is to say that, for the risk reduction 
of this potentially unsafe “state”, the end user 
behaviour is essential during the utilization of 


the machine: correct procedures of maintenance 
and frequent inspections are essential for reduce 
probability, severity and occurrence of the risk of 
descend of GLA in de-energized state. 

In the following paragraphs first, the risk analy- 
sis performed for safety standardization will be 
presented. Than the “state of the art” and future 
coming technical possible solution needs for risk 
reduction will be shown, using also an easy to be 
read table format. 

This table format was preferred by the 
authors also during ISO standardization works 
to assure excellent clarity/effectiveness/brevity 
necessary to be understood to machine builder 
design departments. 

Depending on the technical application and 
the risk to be reduced, different technical safety 
devices are suitable to prevent the unintended grav- 
ity descent of gravity-loaded axes. 

Safety functions related to gravity loaded or 
slant axis will be also presented and discussed. At 
the end some examples of already in the marked 
design solutions will be presented for clarity. The 
presented design measures will be introduced in 
annexes of safety standards for machine tools such 
as the ISO 16090-1 (2017). This latter new ISO 
standard was published in December 2017. 


2 RISK ANALYSIS FOR GLA 


For the risk assessment the tables in new annex G 
of the ISO 16090-1 (2017) shall be used. The annex 
will help the designer to find the “perfect solution” 
for the necessary risk reduction. The tables cover 
the foreseeable operation of the machines as well as 
maintenance, cleaning and repair of the machines. 

The risk for the GLA hazard started from pre- 
condition of machine design, i.e. the different 
safety condition of operators between all opera- 
tional mode and maintenance. For more informa- 
tion see the table G.1 and G.2 on annex G of ISO 
16090-1 (2017). 

During maintenance no energy supply is present, 
so Numerical Control (NC) and Safety Function 
(SF) are not generally available. The safety of the 
workers it is accomplished during maintenance 
through inherent safe measures, such as mechani- 
cal support (e.g. using struts) of the axes loaded by 
gravity or equivalent measures. 

Sometimes a direct safe support of the GLA 
is not the best technical solution because, as an 
example, it might not be possible, due to space lim- 
itation, to have adequate space left for the worker 
in the maintenance zone with the supported axis. 
In this case the mechanical locking of a different 
component directly connected to the GLA must be 
assured. 
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It is possible that: 


e an appropriate mechanical support equipment 
(i.e. a strut) shall be designed and provided by 
the manufacturer of the machine, or 

e a proper locking system (position) shall be 
designed and provided by the manufacture or 

e at least, if no “special equipment/locking system 
is provided by the manufacturer, information on 
how to provide a proper support shall be given 
(see Table G.2 situation G1.3 and G1.4). 


For all others operative modes, it was assumed 
that the most critical situation for hazard is full body 
access in the work area under a GLA (see G1.1). 

Using as an example ISO/TR 14121-2 (2012) for 
risk assessment the authors stated that if only the 
upper limb stays under the GLA of the machine 
(G1.2) no serious (S2) injury is expected for the 
worker. Even if sometime the hypothetical severity 
is S2 than the velocity of the GL descending axis 
is not high, and the worker have not to fully move 
the body to escape to the risk. So, it is possible to 
avoid/reduce the harm and use A1 for possibility of 
avoidance/reduce the harm for risk the risk graph, 
see again Figure 3 of ISO/TR 14121-2 (2012). 

Moreover, even if for some machines the hypo- 
thetical severity can be higher (S2), the necessity 
for cycling testing of “single channel/unsafe sys- 
tems”, such as a single brake (design V1 of Table 
G.1.), leads always to low occurrence of the haz- 
ardous situation (O1). 

The hazard become a real risk only if: 


e the machine goes in de-energized state and 

e the worker is currently under the GLA and 

e there is a not already evident failure on braking 
system, so we are between two cyclic tests. As 
one can see in table G.3. if the workpiece is man- 
ually loaded (i.e. frequent exposure to the risk 
without possibility to avoid it) the cycle testing 
time is 8h, otherwise for automatic loading of 
workpiece, so short presence, testing time is 48h. 


It should be mentioned that, for correctly main- 
tained/tested systems, usually the typical failure of 
braking of GLA are due, as an example, to wear 
and/or leakage (respectively for mechanical and/or 
pneumatic systems). So, no free falling of the GLA 
is expected. 

At the end of this paragraph authors wants to 
emphasize that: a correct utilization/maintenance of 
the machine is a crucial point for safety of the work- 
ers for GLA risk. The defeating of the safety sys- 
tems by the user is very dangerous for those systems. 


2.1 Safety functions for GLA 


It ought to be said clearly that it is not possible to 
design any SF for GLA in de-energized state of the 


machine, which is the main hazardous situation in 
operational field. As mentioned before, when the 
energy is lacking from the system only inherent 
safe measures shall be used if required. 

When the machine is in de-energized state (often 
called the weak state) or by disturbance of the 
energy supply, only the mechanical braking system 
can work properly (e.g. brake, clamping etc.). 

The faultless function of this mechanical brak- 
ing system has to be checked in periodical dis- 
tances (cyclic testing). However, this diagnosis 
function has no real SF according ISO 13849-1 
(2008) but guarantees the faultless function for the 
weak state. 

So, all the SF covering the hazard of GLA are 
designed in energized state of the machine. The 
matching safety functions (SF) are defined in 
the annex J, Table J.3 of ISO/FDIS 16090-1. For 
information on this annex see Bornemann A. & al. 
(2015a, 2015b). 

As an example a “safe” stop function for a 
GLA, when only the upper limbs are under the 
GLA, can be accomplished in performance level 
c (PL, = c, see ISO 13849-1 (2008)) with, at least, 
two different designs: the system is brought to a 
standstill, for example by the opening of an inter- 
locked guard and is then held in position by a 
de-energized clamping or by an integrated SF cat- 
egory 2 stopping with a monitored SOS (see IEC 
62061:2005 + Al (2012)). 

For more examples such as the prevention of 
unexpected start-up see Table J.3 of ISO 16090-1 
(2017). 


3 DESIGN OF BRAKING SYSTEM 


The authors considered a lot of different braking 
systems for GLA, see the different design solutions 
in Table G.1. column 1, from V1 to V7. 

The state of the art for machining centres is: 


e single or redundant systems based on pure 
mechanical brakes. The redundant brake can be 
internal to the electrical motor or external to it. 
The internal redundant brake can be used for 
axis movements if no additional risks is fore- 
seen. As an example, a fault of the electrical 
motor shaft due to fatigue. The mechanical parts 
of power transmission shall be at least designed 
with double weight load to withstand the occur- 
ring static and dynamic stresses. Moreover, if 
the same shaft is used for redundant internal 
brake the situation of full braking with only 
the redundant system (the added one) have to 
be taken into account for mechanical strength 
calculation, if a longer shaft is expected in this 
case (resulting in a greater torsional deflection). 
In order to prevent unnecessary wear of the 
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brakes, it is preferable to decelerate with the elec- 
trical drive controller instead of stopping with 
mechanical brakes. 

e “hybrid systems” based on mechanical brakes 
in conjunction with counterweights. In this 
case the authors stated a difference between 
pure mechanical counterweights, usually 
designed by proper balancing masses con- 
nected to the GLA, and hydraulic/pneumatic 
counterbalancing systems. The reliability of 
pure mechanical counterweight is based on 
static/fatigue calculation of mechanical com- 
ponents, whose failure analysis is quite sim- 
ple to be taken into account. There is a lot 
of literature for “usable safety”. Safety coef- 
ficients, can be derived, as an example, from 
the design/safety of mechanical lifting systems. 
Conversely in pneumatic/hydraulic counter- 
weight systems more non-return valves and 
other components were introduced in the 
field of “safety” system recently. Also “failure 
and effect methods” such as FMEA, see Sta- 
matis, D.H, (2003), for those components is 
more complex with respect of simple balanc- 
ing masses systems. So, it was decided that, if 
the manufacturer cannot explicitly justify the 
fault exclusion for hydraulic/pneumatic system 
(V6 of Table G.1), a hydraulic counterweight 
cannot be used when the worker whole body 
can be exposed to the hazard. An example 
of a proper hydraulic counterweight system 
for GLA where fault exclusion can be man- 
aged properly will be presented in paragraph 5 
below; 

e redundant system with one brake and one exter- 
nal clamping device (see V7 of Table G.1.). 
In this case the external clamping system is 
designed completely separate from the motor 
braking system. If a dangerous fault exclusion 
can be assessed from the designer the cyclic test 
can be avoided. It needs to be mentioned that, 
for external clamping, the cleanliness of every 
part of the clamping device is a key factor for 
a proper braking in a coolant environment 
of machine tools. In this case the cleanliness 
instruction should be properly defined in the 
instruction manual. 


As one can see the V1 design is not suitable if 
whole body access is foreseen (situation G1.2). 
Due to simple tabular format it is not explained 
that this is not suitable due to frequent access dur- 
ing automatic mode because of frequent full body 
access without other measures possible. 

For repair and maintenance additional meas- 
ures are necessary, e.g., underpinning, mechani- 
cal locking, hanging as we will see in the next 
paragraph. 


4 ADDITIONAL DESIGN MEASURES 


As all safety engineers know that the inherent safe 
measures (see step 1 of risk reduction process of 
ISO 12100) usually are not sufficient to reduce the 
risk to a tolerable level. So other additional meas- 
ures have to be assigned during the subsequent 
step 2 of risk reduction process. 

Depending from the design of the braking sys- 
tem additional measures have to be selected. In 
Table G.2, the additional measures are described 
depending from the same situation (Gx.x) of the 
Table G.1. 

The authors defined those measures using 
some guiding factors such as: minimize the gravity 
stored energy, maximize the stability of the system 
especially during maintenance. At least, informa- 
tion and measures for preventing the misuse dur- 
ing manual intervention on the system have to be 
provided by the machine manufacturer. 

The locking of the GLA with a mechanical lock 
during maintenance is not only assuring that no 
intentional restart of the system is done, but also 
is a “guide” for the end user to stop the axis in the 
correct position (i.e. the correct maintenance loca- 
tion chosen by the designer of the machine). 

It should not be forgotten at this point that 
parking the axis in the “lowest position” is the 
safest measure at all, because the axis cannot fall 
under this position. 

Finally, according to step 2 of the risk reduction 
process, one or several warning signs shall be vis- 
ibly fixed at the machine pointing out to hazards 
due to GLA and suspended loads. As an example, 
“Do not stay underneath the vertical axis!”. 

The same should be reported in the instruc- 
tion manual giving also advice for safe working 
practices. 


4.1 Instruction for use, key role of the end user 


As all the national assurance of the workers reports 
shows, see as an example INAIL (2015), a lot of 
injuries are caused by workers defeating and/or 
misuse of the machine. 

In the particular case of hazards related to GLA 
the user has a key role to maintain the safety of the 
system over the time. If the correct maintenance of 
the system is not performed also the safer redun- 
dant braking systems can by ineffective, espe- 
cially for problem related to clean and wear. Also, 
small fluid leaks of hydraulic circuits that are not 
removed can cause an ineffective braking torque 
(force) of the system. 

So, the machine manufacturer has to define, how 
normal operation, repair, cleaning shall be carried 
out safely by the machine user. It has to be remem- 
bered that maintenance, cleaning and repair works 
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are carried out at or next to the gravity-loaded axis. 
Usually a safe mechanical support of the gravity- 
loaded axis is easily feasible and consequently it 
has to be done for the sake of safety. 

Operating instructions shall describe measures 
to protect the operator from a fall-down of GLA. 
These instructions shall also point out to hazards 
due to gravity-loaded axes and suspended loads. 
Also, the required skill level of the operators needs 
to be considered. If the brake is removed for main- 
tenance, a support or a manual mechanical lock 
shall be used also in the designed system where 
fault exclusion can be done (see V3-V6-V7 in 
Table G.1). 

For extensive additional measures required 
for the different operative/inoperative modes, see 
Table G.2. 


4.2 Cyclic testing and testing of torque/force 


The need of this diagnosis function is considered 
in the column “Requirement for cyclic test”, see 
Table G.1. The maximum tolerable time span 
between tests depends mainly on frequency of 
exposure to the GLA risk for the worker, see 
Table G.3. 

As mentioned before it is very important to 
understand that cyclic testing is not a SF itself and 
that it can be done with NC. The cyclic testing is 
always performed in safety condition (e.g. with 
machine doors closed). Cyclic testing is performed 
for the brake system, when fault exclusion cannot 
be done, at a predefined time span. 

The test has to be able to measure the braking 
performance of the system over the time: a test 
torque/force is applied to the brake, e.g. motor 
brake or the clamping device. 

Because the test condition of the ISO 13849-1 
(2015) is not applicable for cat. 2 systems of brak- 
ing test (i.e. mainly because we cannot test the brake 
100 times before it is used again), a proper specifi- 
cation is defined in ISO/FDIS 16090-1 (2017). A 
sudden complete failure of a brake with the force 
actuated by a spring can be excluded because of the 
basic principles of the mechanical brake design of 
ISO 13849-2 (2012). 

For torque/force testing the following require- 
ments apply: 


e | motor/l brake (or clamping device) systems. 
The brake or the clamping device is charged with 
1,3 times the maximum gravitational load for at 
least 1 s by the electric drive. If also a perma- 
nently present counterweight system is installed, 
the braking device is charged with 1,3 times the 
maximum gravitational weight minus the coun- 
terbalanced weight. 

e | motor/2 brakes (or clamping device) systems. 
The braking devices are tested separately one 


after the other on 1,0 times the maximum gravi- 
tational load. 

e 2 motors/2 brakes (or clamping device) system 
mechanically connected. The braking devices 
are tested together on 2,0 times the maximum 
gravitational load or one after the other on 1,0 
times the maximum gravitational load. 


All the requirements for brake test are defined 
in annex G of ISO 16090-1 v(2017), it is important 
to note that, again, the safety of the worker during 
the test is mainly defined by the GLA position: 


e before the test, the GLA must be placed in a 
proper position where no hazard for the worker 
is foreseen, even if the test is failed, 

e during the test, no additional hazard should arise 
from the failure of the test. As an example, the 
designer should take care of possible tools/parts 
breaking for undesired falling of the GLA dur- 
ing the test with sharp object contours, 

e after the test, the machine must be placed in a 

safe state before any worker can enter inside the 

machine and further operation of the machine 
shall only be possible after a new successful 
cyclic test. 

In case a fault detection occurs during the cyclic 
test the NC shall inform on the screen for a brake 
repair. In case of guards with closed and protective 
doors, a safe position shall not be approached until 
an unlock demand signal has been given. Again, 
further operation of the machine shall only be pos- 
sible after a new successful cyclic test. 


5 EXAMPLES OF EXISTING SYSTEMS 


In this paragraph some design examples for GLA 
braking system are presented. 


5.1 Example of a single brake system—with or 
without fault exclusion 


In Figure 2 the typical V1 system of Table G.1 
is presented. This system can be used only if the 
whole-body exposure to the GLA is not feasible 
during operative phases. Usually during setting, 
repairing works and so on, sometimes a staying for 
short time under the gravity loaded axis is neces- 
sary. Then a single brake with cyclic test is also a 
suitable design measure. It is to be remembered 
that during setting, as an example, usually addi- 
tional measures have to be taken, such as, an oper- 
ator pendant and/or reduced axis speed (leads also 
to shorter braking distance). Due to those latter 
additional measures a greater probability of avoid- 
ance of the accident is expected. 

In Figure 3 the typical V2 system of Table G.1 is 
presented. A redundant brake system is safer with 
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Figure 2. Single brake system without fault exclusion 
(V1). 


respect to V1, even if no fault exclusion is made as 
in the redundant system V2. As one can see, all the 
parts of the braking system are redundant in this 
figure also the clamping stick attached to the fixed 
part of the machine. This is so, because all the pos- 
sibilities of failures of braking cannot simply be 
excluded (as an example due to leaking of coolants 
lowering coefficient of friction). 

It is very important that the risk assessment 
needs to be done for the V3 design under the con- 
dition of fault exclusion e.g. ISO 13849-2 (2010). 
However, if a single fault to a component cannot 
be excluded than a partial redundancy of the sys- 
tem is necessary. 


5.2 Redundant brakes system 

4 et _...--..- motor measuring system 
motor (M1) 
-j----- --- motor brake (Q1) 


‘j-------+ coupling system 


=i. eme clamping stick 


ciets ball screw nut 


GLA mass (m) 
na ~ external brake (Q2) 


ball screw 


linear measuring system 


Figure 3. 
sion (V2). 


Redundant brake system without fault exclu- 


5.3. Counterweight systems 


As mentioned above a counterweight system is 
often used in machine GLA tools technology 
also to improve the dynamic capabilities of heav- 
ier axes. A GLA counterbalancing with a proper 
system results in small unbalanced masses to be 
moved. So, smaller motors can be used for axis 
movement and/or greater accelerations is foreseen 
with respect to the same GLA without counterbal- 
ancing system. 

In Figure 4 a typical single counterweight sys- 
tem with external clamping is presented as an 
example of system V5. 

Looking again to Table G.1. from V4 to V6 some 
simple conclusions can be derived: 


e if it is not possible to prevent the staying of the 
full body of the worker under the GLA, it is not 
possible to use simple hydraulic/pneumatic sys- 
tem without fault exclusion, even if cyclic testing 
is performed, 

e a mechanical counterweight without fault exclu- 
sion can be used in conjunction with a braking 
motor, but, in this case, a cyclic test is required. 
Basic and well-tried safety principles in the 
mechanics have to be used. Also, classical safety 
factors in the interpretation of the mechanic 
components, e.g., cable (rope), clamping device, 
bearing system have to be done. There is not 
safety control system for the counterweight that 
is required to be assembled only by mechanical 
parts. In this case the full system has 2 different 
and huge inertias to be supported, the system 
inertia and the counterbalancing inertia. The 


base of gravity loaded axis 


clamping unit 


valve 
position 
controlled 


Figure4. Single counterweight system based on hydrau- 
lic/pneumatic (V5). 
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validation has to be done, as usual, according to 
ISO 13849-2 (2012), annex “A” for mechanical 
systems. 


For hydraulic counterweight systems the most 
important aspect to be considered is the require- 
ment of fault exclusion for the non-return valves 
of the counterbalancing system. 

Because it is considered possible a pressure 
leakage due to different events (breaking on a pipe, 
leakage of a valve, ...), the piston(s) shall be suf- 
ficiently protected from non-closure of non-return 
valves. 

Typical design measures to avoid dangerous fail- 
ures (or dangerous interference on correct opera- 
tional behaviour of the system) are: 


e devices for fluid temperature monitoring (if 
hydraulic), 

e device for pressure monitoring, 

e devices for fluid pollution monitoring (filters up 
to 1 micron), because complete valves closure 
can be prevented by pollution particles dispersed 
in the fluids. For the faults related to problems 
of valves aging, see Schuster, U., (2004). 


In Figure 5, a fully redundant counterweight 
system, with also redundant clamping unit, is 
shown. 

If correctly designed this system could be appro- 
priate for V6. 

It has to be said that, during operative phases, 
the drive systems and the correct pressure of the 
fluid, assures GLA movements with counterbal- 
ancing. For safety aspects, the load capacity of the 
system with closed non-return valves is the crucial 
argument: during de-energization the balancing 
force is not acted through hydraulic system, but 
directly in the pressurized pistons by closed valves. 

Depending on the applied principle for fault 
exclusion some of the double component of the 


cimnping unit 2 


Figure 5. 
on hydraulic/pneumatic (V6). 


Fully redundant counterweight system based 


system in Figure 5 can be avoided, such as the dou- 
ble reservoir. 

As examples, the counterbalancing system with 
fault exclusion to non-return valves can be: 


e a single piston, system capable to support with 
the non-return valves closed (with no motion/ 
limited motion at low speed) the 1,3-full weight 
of the GLA. In this case, even if the motor or 
external clamping has a fault, the hydraulic 
system can stop the descending axis during the 
de-energized state. The valves closure shall be 
double monitored with cross monitoring. More- 
over, it must be possible to make a fault exclusion 
to all relevant mechanic/hydraulic components 
of the system safety related (see below). Even if 
theoretically possible at the current state of the 
art a complete fault exclusion for this system is 
still difficult to be done. 

e a redundant piston/brake system, each of them 
capable to support with the non-return valves 
closed (with no motion/limited motion at low 
speed) 1,3 the full weight of the GLA. Each 
valve closure shall be monitored with safety 
functions in PL, =c. 


In relation to non-returning valves, a fault 
exclusion of the mechanical spring of the valve 
shall be done by the valve manufacturer (the 
spring can be considered a well-tried component 
also for hydraulic, see Table B.6 of ISO 13849-2 
(2012)). 

For the validation of hydraulic system, e.g., fail- 
ure exclusions for pipes related with safety accord- 
ing to ISO 13849-2:2012 Table “C.7” and failure 
exclusions for hydraulic connecting elements of 
Table “C.9” are necessarily in any case. Safety 
valves shall be firmly connected with the cylinders 
in order to avoid safety problems with connection 


pipes. 


6 CONCLUSIONS 


The risk analysis and the design measures pre- 
sented in this article were initially developed as a 
part of the standardization works of ISO/TC 39/ 
SC 10/WG4, “Safety of machining centres, milling 
machines, transfer machines for cold metal materi- 
als”. The authors have been involved in this stand- 
ard and they think that this risk assessment can be 
adopted also for other types of machine tools. 
Because of its innovative subject for stand- 
ardization purpose, the DGUV (2012) document 
has been taken as a basis for the discussion. This 
document needed to be clarified by the authors 
also with examples foreseeing the publication of a 
normative annex on GLA of a new standard. The 
authors believe that the risk analysis and technical 
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solutions presented in the paper provide an objec- 
tive view of the state of the art and design solu- 
tions that can be used to effectively reduce the risk 
due to GLA. 

The authors encourage the drafting of similar 
regulatory annexes for the C-type standards of 
forthcoming publication in the field of machine 
tools safety. 
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Risk significance assessment with operational events of Korea 
nuclear power plants 
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ABSTRACT: A handbook was published to document methods and guidance that NRC staff should 
use to achieve more consistent results when performing risk assessments of operational events by the 
U.S. NRC. Korea Atomic Energy Research Institute (KAERI) launched a research project to develop 
a regulatory purpose Level 1 PSA (Probabilistic Safety Assessment) model and framework for use in 
risk-informed regulation. To this end, we designed a regulatory risk model reflected regulatory purposes 
based on the real conditions and developed a regulatory software called RYAN (Risk Analysis for ASP/ 
SDP of NPP) that enables the regulatory body to perform the overall safety assessment such as ASP/SDP 
(Accident Sequence Precursor/Significance Determination Process). In order to verify and validate the 
RYAN software, we investigated operational events occurred in domestic NPPs from databases such as 
OPIS (Operational Performance Information System for Nuclear Power Plant) and KRDB (Korean Inte- 
grated Reliability Database). From those nuclear event databases, we selected some component failures 
and IEs for the software verification and validation. We performed a sensitivity analysis for the various 
cases with the selected operational events. Based on the framework, it is expected that the regulatory staff 
can identify a nuclear power plant or SSC (Structure, System, and Component) for which safety perform- 
ance has been decreased. For a further study, we are confirming the applicability of RYAN by performing 
a sensitivity analysis with more event data. In addition, we are to analyze the significance of each case 
with the AIMS-PSA to compare the results from RYAN and AIMS-PSA for ensuring the accuracy of the 
analysis results with RYAN. 


1 INTRODUCTION 


US.NRC provides the Risk Assessment of Opera- 
tional Events Handbook (RASP Handbook) to 


assist NRC staff to achieve more consistent results 16-4 1E 
when performing risk assessments of operational 

events and licensee performance issues. The meth- 1E-5 1E-6 

ods and processes described in the RASP hand- 

book can be primarily applied to risk assessments jES P KAN E MANA ct a aie pesca 16:7 

for Phase 3 of the SDP (Significance Determina- 

tion Process), the ASP Program, and event assess- EEE i i 
ments under the NRC’s Incident Investigation CDF (yr) LERF (yr) 


Program. For example, Figure 1 depicts the crite- 
ria to determine the level of safety significance to 
characterize the safety significance of inspection 
findings for the NRC ROP by assigning a color to 
the inspection findings (U.S.NRC 2013). 

Korea Atomic Energy Research Institute 
(KAERI) launched a research project to develop 
a regulatory purpose Level 1 PSA (Probabilistic 
Safety Assessment) model and framework for use 
in risk-informed regulation (J. Kim et al, 2017). 
The purpose of this research is to estimate the 
risk significance of initiating events and degraded 


Figure 1. Criteria to determine the level of safety 


significance. 


an inspection finding consistent with regulatory 
response thresholds such as SDP (Significance 
Determination Process) of the US NRC. 


2 DEVELOPMENT OF A REGULATORY 
RISK ASSESSMENT ALGORITHM 


conditions occurred in domestic NPPs (Nuclear 
Power Plants) and characterize the significance of 


As this research aims to estimate the risk signifi- 
cance and to characterize the significance of an 
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inspection finding consistent with regulatory 
response thresholds, it is essential to develop a reg- 
ulatory risk assessment logic to satisfy the require- 
ments of risk estimation for regulatory purpose. 


Significance 
Determination 


Event Response Analysis Failure 
dentificatio 
Tt + 1 J 
Initiating || Common Rec Human || Loss of Off- || Support 
Event || Cause aat Reliability || site Power || System 
JE: Failure ope is || Event IE 
| 


Risk Evaluation 


IE Assessment: CCDP Basic Event 
(Conditional Core Damage 


Figure 2. Schematic diagram of algorithm for regula- 
tory risk assessment. 


PSA Model Modification 


Related Basic Event 
Modification 


SSC Failure Related CCF Event | 


Modification 


Figure 2 represents the schematic diagram of 
the algorithm for regulatory risk assessment. The 
analysis of the risk significance consists of the 
following procedures. First, the type of failure is 
determined by performing failure identification. 

The types of failures include ‘Initiating Event’, 
‘Common Cause Failure’, ‘Recovery and Repair’, 
and ‘Human Reliability Analysis’. For example, 
the failure caused by one of initiating event or 
component failure. After the failure analysis is 
completed, a risk analysis is performed in which 
the associated failure calculates the Conditional 
Core Damage Probability (CCDP) for the initial 
event or evaluates the component level in case of 
failure in the unit of equipment. 

At this time, the risk assessment is performed 
by modifying the basic event probability (BEP) for 
the initial event and the inoperable (out of serv- 
ice) time when the related component is exposed to 
the fault. Finally, the risk assessment is performed 
using the risk assessment results and a regulator 
response plan is determined. 

Figure 3 shows the procedure for evaluating the 
risk increase due to reactor shutdown and SSC 
failure: 1) Calculate the delta CDF (Risk increase 
by Degrade SSC) by performing a PSA model 
modification to change the BEP and CCF values 
of the equipment for the SSC degraded condition 
including the SSC fault or out of service (OOS) 
due to maintenance. 2) Calculate the delta CDF 
(Risk increase by Degrade SSC) by performing a 


ACDP = 
(CDF, ~ COFo) « At 


Figure 3. 


Risk Increase due 
to SSC Failure& 
Reactor Trip PSA Model Modification Risk Increase by IE Reactor Trip 
eH} 
election ccpp = 
CDFx/y * ly 
| Increased IE Related Initiating | 
Frequency Event Modification 


SSC : Structure, System, Component 

ACDP : Incremental Core Damage Probability 
CCOP : Conditional Core Damage Probability 
CDF, : Modified Core Damage Frequency 
CDF) : Base Core Damage Frequency 

At : Exposure Time 


Process of increased risk evaluation due to SSC failure and reactor trip. 
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Table 1. 


CCDP calculation for sensitivity analysis. 


Case Calculate CCDP (Conditional Core Damage Probability) 
Case 1: 1] CCDP Calculation 

IE only — By setting the observed IE to 1.0 and all other IEs to 0.0 
Case 2: 1] CCDP Calculation 


IE & mutually exclusive SSC 
Unavailability (SDP only) 


2] ACDP [(CDF, - CDF, )At] 
Calculation for the SSC unavailability 


— By setting the basic event associated with the SSC unavailability to TRUE 
3] Total Risk Calculation, 
ACDP;,,4; = CCDP + [CDF, — CDF, ]At 


Case 3: 1] CCDP Calculation for the combined IE and SSC unavailability 


IE & Mutually Inclusive SSC 


— By setting the observed IE to 1.0 and all other IEs to 0.0 


Unavailability — By setting the basic event associated with the SSC unavailability to TRUE 
2] ACDP Calculation for the SSC unavailability only 
3] Choose the highest of the CCDP or ACDP result 
Case 4: 1] Baseline system failure prob. estimation by solving an applicable FT 
SSC Unavailability Increases 2] Calculate the system failure probability factor (or ratio) 
the IE frequency (No IE — By setting the basic event to TRUE 
Occurred) — Calculation of system failure probability factor (new value/baseline sys- 


frequency 


tem failure probability) 
3] The modified initiating event frequency calculation 
— By multiplying system failure probability factor with the baseline IE 


4] Calculate ACDP for degraded condition 


PSA model modification to change the BEP and 
CCF values of the equipment for the SSC degraded 
condition including the SSC fault or OOS due to 
maintenance. 

In order to carry out these procedures, the fol- 
lowing cases are evaluated. From those nuclear 
event databases, we selected some component 
failures and IEs for the software verification and 
validation. We performed a sensitivity analysis for 
the four kinds of cases with the selected opera- 
tional events. Table 1 shows the CCDP calculation 
method for each case 


e Case 1: IE only 

e Case 2: JE & mutually 
Unavailability (SDP only) 

e Case 3: IE & Mutually Inclusive 
Unavailability 

e Case 4: SSC Unavailability Increases the IE 
frequency (No IE Occurred) 


exclusive SSC 


SSC 


3 SDP ASSESSMENT SOFTWARE (RYAN) 


3.1 Development of RYAN Software 


The SDP importance assessment can be per- 
formed using full-scale quantification programs 
such as AIMS-P. However, the importance evalu- 
ation using AIMS-P is possible by PSA experts 
who have knowledge of PSA model. Therefore, 
it is necessary to develop software that provides 


Non-PSA Expert with the ability to roughly esti- 
mate the risk increase due to incidents / accidents. 
To solve this problem, this study developed RYAN, 
which is a SDP evaluation program for regulatory 
verification. 

RYAN provides a simple sensitivity analysis 
interface that can easily assess RISK changes to 
events/accidents resulting from plant shutdowns 
and equipment failures. In the case of detailed 
evaluation, PSA Expert can evaluate using AIMS- 
PSA. By using this, the regulatory body can estab- 
lish regulatory standards for the change of the risk 
of the power plant. We called the software for a 
regulatory PSA model RYAN (Risk Analysis for 
ASP/SDP of NPP) which is for a significance 
assessment of incidents/accidents in domestic 
NPPs (Nuclear Power Plants). 

The assessment of the importance of incidents/ 
accidents is based on the evaluation of the increase 
in risk due to them, and is divided into the follow- 
ing: risk increase by an IE (Initiating Event) assess- 
ment and risk increase by a conditional assessment 
due to damaged SSC. Figure 4 shows the risk 
assessment concept for RYAN (S. Y. Choi et al, 
2017) (S. H. Han, 2017). RYAN is a user-friendly 
interface tool developed under Windows environ- 
ment using the PSA model provided by the AIMS- 
PSA (Advanced Information Management System 
for PSA) which is a software for integrating vari- 
ous types of PSAs including typical external and 
shutdown PSAs (S. H. Han, 2016). 
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Assessment. 


ACDP = (CDF, - CDF) -At 


Risk Increase by Conditional 
Assessment due to Damaged SSC 
Trip 
COF, 


ACOP = 


CDF, ccoP 


at 
+ CDF; Reatsessd COF (Core Damage Frequency) by an initiating event 
+ CDF» Base CDE 
+ CDP: Conditional Core Damage Probabeity 


Figure 4. Risk assessment with RYAN. 


āe às @k- 40 
Anarin Rasut 
Combined CCOP: 
oco: 


ary 


As) ua4OCCW Lons of a COW Tram A Occurs 
14 WWDLOOCO1A Lens of OC Bus 12-NCOIA inating Event (U3) 
15 UBLOOCOIE Loss of DC Bus L2.MCOIB Intiating Event (U3) 
TO SUOLOKY Loni of 4, 16K Bes Occurs 


19 WDASSO-SGL LESH Occurs at SG #1 

20 WUDASSESGZ LSS Occurs at SG #2 

TA SUD Mediam LOCA -RCS Hot ieg 1 Birak (V3) 
22 WUR? Medium LOCA - RCS Hot leg 2 Break (U3) 
Genaral Transwst Occurs 

Lons of Condenser Vacovum & Meat Sink Occurs 


D wom 
24 woo 


Figure 5. An example of RYAN analysis interface. 


Figure 5 shows an example of RYAN analysis 
interface to quantify risk increase of an accident. 
To evaluate the increased risk due to an accident, 
the user should input damaged SSC information 
and IE information when the damaged SSC affects 
an IE. Then RYAN automatically changes the PSA 
model to quantify the increased risk. 


3.2 Example of SDP Calculation with RYAN 


In order to verify and validate the RYAN soft- 
ware, we investigated operational events occurred 
in domestic NPPs from databases such as OPIS 
(Operational Performance Information System 
for Nuclear Power Plant) and KRDB (Korean 
Integrated Reliability Database) (OPIS, 2017) 
(S.Y.Choi et al, 2005). Figure 6 shows an example 
of SDP analysis using RYAN. 
Assumptions for the analysis are as follows. 


— IE: General Transients 

— Failed SSC component: Fail to Start of AF WS- 
PP02 A (T/D Pump) 

— Exposure Time: 720 hours 


As shown in the figure, the analysis is performed 
in the following order. 


‘ r , a = 
x [rae | | Use) | ae j 
A A \ 


d 


ve e me. CO 
e ceatis deen wea eoce 
ase} 0479002 S164 Sem 
+ Games banii Scar 

7 ARIANA GIMD NAD LOCON 

7 saan A2957 se3400e 
msa soro 


Table 2. Result of sample calculation. 

Option Component IE CDP SSCCDP Total 

Event Comp = Fail 2.52E-06 1.39E-06 3.91E-06 
Level Value = 1 

Comp Comp = Fail 4.05E-06 3.15E-06 7.20E-06 
Level Value = 11 


1. Select the analysis type as the initial event + SSC 
analysis. 

2. Select the initial event as General Transient. 

. Change the BE of the failed SSC to Fail. 

4. Set the OOS Time to 720 hr. 


Table 2 shows the results of SDP analysis in this 
example and indicates that the SDP analysis using 
RYAN has been performed properly. 


U 


4 CONCLUSION 


In this research, KAERI developed a regulatory 
purpose Level 1 PSA model and framework for 
use in risk-informed regulation. The purpose of 
this research is to estimate the risk significance of 
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initiating events and degraded conditions occurred 
in Korean NPPs and characterize the significance 
of an inspection finding consistent with regula- 
tory response thresholds. To this end, we designed 
a regulatory risk model algorithm reflected regu- 
latory purposes based on the real conditions 
and developed a RYAN (Risk Analysis for ASP/ 
SDP of NPP) that enables the regulatory body 
to perform an overall safety assessment such as 
ASP/SDP of U.S.NRC. To verify the applicabil- 
ity of RYAN, we investigated operational events 
occurred in domestic NPPs from databases such as 
OPIS (Operational Performance Information Sys- 
tem for Nuclear Power Plant) and KRDB (Korean 
Integrated Reliability Database), and performed 
a sensitivity analysis for the selected operational 
events. Based on the framework, it is expected that 
the regulatory staff can identify a nuclear power 
plant or SSC (Structure, System, and Component) 
for which safety performance has been decreased. 
For a further study, we are confirming the applica- 
bility of RYAN by performing a sensitivity analy- 
sis with more event data. In addition, we are going 
to analyze the significance of each case with the 
AIMS-PSA to compare the results with RYAN 
for ensuring the accuracy of the analysis results of 
RYAN. 
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Risk dimensions of fish farming operations and conflicting objectives 
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Institute of Marine Technology, NTNU, Trondheim, Norway 


ABSTRACT: Operations at sea-based fish farms can be challenging, and several risk dimensions are 
of concern during operations. Sea lice represent a challenge for the fish farmers who are required to 
perform delousing when the infestation levels rice above a set value. Delousing operations are frequently 
performed and require the use of heavy machinery operated from service vessels moored to the net-cages. 
Operators are exposed to hazards that may cause severe injuries and fatalities. Escape of salmon, which is 
a substantial environmental risk, has occurred in relation to delousing operations. Chemicals used during 
the operations may cause negative environmental consequences. Other safety related issues are the fish 
health and welfare. In this paper, a delousing operation on a fish farm is discussed with respect to different 
dimensions of risk, and potential conflicting objectives are discussed. 


1 INTRODUCTION 


The operators on fish farm localities have to navi- 
gate and make decisions in an environment where 
their own safety is lined up against other factors, 
such as fish welfare and prevention of escape of 
salmon. The workplace is exposed to forces from 
the environment, such as waves, current and wind, 
and maintaining focus on safety is crucial in all 
operations. Authorities with different regulatory 
responsibilities require risk assessment of preven- 
tion of fish escape, environmental impact and fish 
welfare (Holmen et al., 2017). Identification of 
hazards and risk assessments are measures imple- 
mented to avoid accidents. Holistic and systematic 
risk management is a prerequisite for safe opera- 
tions, however, the fragmented regulation might 
work against this (Utne et al., 2017). 

Projects related to the evaluation of risks in fish 
farms have identified critical operations, such as 
lice counting, well boat operations and operations 
involving cranes (Sandberg et al., 2012). Technol- 
ogy, the physical working environment, work-load, 
work pressure and safety management are found 
to be among the factors influencing escape events 
(Thorvaldsen et al., 2015). External pressures on 
operations, such as time, costs and weather condi- 
tions also puts constraints on operations. 

Lice infestations has become a major sustaina- 
bility challenge in Norwegian fish farming, and has 
also become the main delimiting factor for future 
growth in the industry (Svasand et al., 2017, Nor- 
wegian Ministry of Trade Industry and Fisheries, 
2017). The fish farming industry in Norway uses 
up to NOK 4,5 billion in anti-lice measures (DN, 
2017). Treatments to remove lice are decreed in 


regulations (Norwegian Ministry of Trade Indus- 
try and Fisheries, 2012), and has become an opera- 
tion frequently performed in fish farms. Delousing 
is an operation where several factors identified as 
critical or risk-influencing are present, see Table 1. 
In this paper, the first three risk dimensions are 
presented and compared with the purpose of iden- 
tifying examples of potential conflicting objectives 
in the fish farming operation delousing. Conflict- 
ing objectives is an accident perspective, and high- 
lighting consequences of the different pressures the 
human operators are exposed to in aquaculture, 
risk-reducing measures can be developed. 


2 RISK DIMENSIONS IN A CONFLICTING 
OBJECTIVES’ ACCIDENT PERCPECTIVE 


The concept of conflicting objectives is described 
by Rasmussen’s migration model (Rasmussen, 
1997a). It explains how accidents may happen 
when decisions in an organization are made based 
on different objectives and constraints. One exam- 
ple is the decisions made by management to mini- 
mize costs, while operators may focus on making 
the operations as efficient as possible. These some- 
times competing, or conflicting, objectives may 
eventually lead to a migration towards the bound- 
ary of a functionally acceptable performance. As 
the decisions are made local at separate levels, the 
side effects of the decisions may eventually set the 
stage for an accident (Rasmussen, 1997b). The 
operators can be seen to be at the sharp-end, close 
to the hazard sources, while management can be 
seen to be at the blunt end, removed from the haz- 
ards (Rosness, 2001, Rosness et al., 2010a). 
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Table 1. 


Risk dimensions present in the fish farming operation delousing. Adapted from (Yang et al., 2017). 


Risk dimension 


General description 


Relation to delousing operations 


Risk to 
personnel 


Risk to 
environment 


Risk to fish 
welfare 


Food safety 


Risk to 
material 
assets 


The Norwegian fish farming industry has one of 


the highest fatality and accidents rates when 


compared to similar industries (Aasjord, 2010). 


Accident statistics show that the fish employ 
ees are among the most exposed workers with 
regards to injuries and fatalities (Holen et al., 
2017a). 


The escape of salmon represents a hazard for the 


stock of wild salmon living in the rivers and 
fjords of Norway (Svasand et al., 2017). The 
use of chemicals in delousing operations and 
on the net-cage to avoid fouling may affect the 
environment around the fish farm. Waste that 
accumulate under the fish farms due to fodder 
spill and organic matter and may have benthic 
impacts and on species living around the fish 
farm (Holmer, 2010). 


Fish welfare in fish farms are under pressure due 


to sea lice and diseases (Hjeltnes et al., 2017). 


Food safety is a general concern due to the 


accumulation of toxins in the fish meat. 


Risk to material assets (e.g., net-cages, service 


vessels, workboats etc.) in fish farm 
operations may have severe economic 
consequences, mainly to the fish farm 
company. This risk dimension has not 
gotten much attention in the literature 


Frequent use of safety critical equipment 
during delousing operations. 


Risk of net-tear is present during delousing 
operations. Medical treatment chemicals 
are released after operation. 


Delousing operations require handling of the 
fish and may cause harm. The chemicals 
used in delousing may cause discomfort 
and wounds. 

Chemicals used for treatment of fish are not 
seen as critical for food safety (Norwegian 
Veterinary Institute, 2016). 

Structural damages of net during delousing 
may lead to escape of salmon which is a 
risk dimension already included. 


(Xue, Yang et al. 2017). 


Safety is an emergent property of a system and 
risk should be considered in a systems perspective 
where all factors that can influence safety, should 
be analyzed. Control can be made by increas- 
ing the safety margin, increase awareness of the 
boundary, or make the boundaries explicit. Mak- 
ing visible the limits on acceptable risk by estab- 
lishing criteria for critical decisions or other ways 
of establishing clear lines as to when the safety 
margin is small should encounter challenges with 
conflicting objectives. Managers should also com- 
municate openly about the existence of conflict of 
interest (Rosness et al., 2010b). 

Fish farming is an industry dealing with produc- 
tion of livestock, thus requiring knowledge about 
biology, welfare, and diseases. In addition, opera- 
tions are increasingly resource demanding and large 
production equipment requires special expertise for 
safe handling. The fish farms are mainly placed in 
the fjords where the operations may impact the 
fauna and wild animals living around the fish 
farm. These are all risk dimensions of concern for 
the operators at the sharp-end, and in some situa- 


tions trade-offs between the risk dimensions must 
be made. In this paper these are seen as conflicting 
objectives. An example of a situation where opera- 
tors are faced with having to choose between pri- 
oritizing risk objectives is provided by Storkersen 
(2012). The operators have to choose between 
fixing a net cage damage immediately after dis- 
covery, or use valuable time to provide the appro- 
priate safety equipment to do the repair according 
to safety procedures. In the case presented, the 
operators do not hesitate to improvise and make 
the repair without the required safety equipment. 
Thus, the risk of escape is reduced, while the opera- 
tors face a greater personal risk by down prioritiz- 
ing their own safety (Storkersen, 2012). 

The management at the blunt end is also making 
choices that affect the risk in operation, by allocating 
resources, like personnel, equipment and timeslots 
to operations. Management decisions influenced 
one of the biggest single escape event in Norway, 
which happened in relation to a delousing opera- 
tion in 2011 where 176 000 salmon escaped (Soknes, 
2012). The delousing operation had been ongoing 
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for two continuous days in order to finish the opera- 
tion as quickly and efficiently as possible. The com- 
pany later claimed that the responsible operator was 
disloyal to the company when breaching procedures 
to get the job done. However, a court case ruled that 
the employee had loyally tried to fulfill the manage- 
ment’s expectations and that there had been a great 
time pressure on the employees, and no willingness 
from the company to compensate economically for 
extra personnel (Soknes, 2012). 

Time pressure is a risk-influencing factor men- 
tioned by personnel at fish farms in relation to 
both escape events, fish welfare and personnel 
safety (Thorvaldsen et al., 2015, Hjeltnes et al., 
2017, Fenstad et al., 2009). Time pressure is not 
only created by allocation of resources by man- 
agement, but also unforeseen weather changes 
puts this constraint on operations.The regulation 
of the fish farming industry is characterized by 
being fragmented and the authorities have devel- 
oped separate regulations to ensure the different 
values being protected (Holmen et al., 2017). Fish 
farmers state that the focus in planning for safety 
in operations will be towards the area were they 
experience pressure from the authorities (Skjærvik, 
2017). In line with Rasmussen’s framework of dis- 
tanced decision-making, some unforeseen conse- 
quences might be the result. For example, the strict 
regulations on delousing according to infestation 
levels may lead to both unsafe situations concern- 
ing escape and reduced welfare for the fish. 


3 THE DELOUSING OPERATION 


3.1 Anti-lice measures 


The sea lice, or salmon lice, is a parasite, which only 
have salmonids as hosts. The last five stages of the 
life cycle of the sea lice are parasitic to the salmon, 
when it feeds of the mucus, skin and blood. The 
sea lice may cause fish welfare problems both to 
farmed and wild salmonids, and may ultimately 
cause fish death. The sea lice has become a major 
issue in the fish farming industry where large out- 
breaks of the parasite is made possible by the high 
density of salmon in the fish farms along the coast. 
The sea lice is sensitive to temperature, and infes- 
tation levels change according to the season; the 
lowest levels are registered in the spring and the 
levels increase during summer and fall (Svasand 


et al., 2017). This have led to frequent delousing in 
periods of the year. 

As the sea lice mainly lives in the higher levels of 
the sea some preventive measures to sea lice have 
been developed, e.g., a skirt placed around the net 
cages with a depth up to 3 meters preventing the 
sea lice to enter in the area where the salmon are 
(Lien et al., 2014). The skirts around the net are 
the most used preventive measure (Svasand et al., 
2017). Also a “snorkel’-solution, where the fish are 
held in an semi-enclosed net cage, only with access 
to water air through a “snorkel” with a diameter 
around 6 meters (Stien et al., 2016). In 2017 over 
27 million wrasse was captured, mainly used for 
delousing purposes in fish farming (Directorate of 
Fisheries, 2017). 

The main mode of combating the sea lice have 
been medicinal products. These can either be intro- 
duced to the fodder, or the salmon are exposed 
to the medicament in bath-treatments. In bath 
treatments, the salmon are exposed to medicinal 
products added to the seawater after the salmon is 
gathered in an enclosed area, in either a well vessel 
or using a tarpaulin around the net cage. The bath 
treatments require major resources and is one of 
the most demanding operations that is carried out 
in fish farming. 

In addition to the bath-treatments, some new 
technologies have been developed to remove sea 
lice from farmed salmon. These methods have been 
developed mainly due to resistance in the salmon 
lice of the medicaments used. The new treatments 
use mechanical aids, such as water jets, higher tem- 
perature and brushes. These new methods are seen 
as the main cause in the large drop in prescribed 
anti-lice treatment medicaments from 2015 to 2016. 
There is a concern that the new methods might be 
a risk to fish welfare, and that they have not been 
sufficiently tested for welfare before they have been 
put to use (Hjeltnes et al., 2017). In addition, signs 
of possible resistance to these new anti-lice treat- 
ments have been discovered. 


3.2 Steps of a bath treatment operation 


Figure 1 show the steps of a fish farm operation 
using tarpaulin. This approach is representative for 
all methods of delousing, only the step “Perform 
delousing” differs according to the method and 
technologies used. 


*Resources 
*Book sub-contractor 


*First day of operation Lift net 


Figure 1. Steps of a delousing operation using tarpaulin. 


*install tarpaulin 
Add chemical 


eLower net *lnspect net cage 


*Debrief 
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e Planning 
Delousing must be carried out when the critical 
level of lice is reached. The operation is planned 
by the operations manager on the fish farm, some- 
times in cooperation with higher level onshore area 
managers. The operations are in most cases per- 
formed by or in co-operation with service providers 
who have both required equipment and expertise. 
e Safe Job Analysis (SJA) 
Most fish farmers conduct a preparation meeting 
the day the operation starts. An integral part of 
this meeting is to perform a SJA, where hazards in 
the operation are identified and responsibilities for 
tasks during the operation are assigned. 
e Prepare net cage for delousing (Lift net) 
It is necessary to make the volume of the net cage 
smaller so that the fish is easily accessible in the 
upper layers of the sea. Lifting is demanding and 
time-consuming, and requires the use of crane and 
winches from work vessels. If a well vessel or a type 
of barge is used in the treatment, a “crowding” of 
the fish is also necessary. This is done by using an 
extra net to push the fish together in an even more 
confined area. 
e Perform delousing 
Bath-treatment are either performed with tarpau- 
lin in the net cage or in a well vessel. New types of 
mechanical treatments are performed on special- 
ized barges. 
e Prepare net cage for normal operation (Lower 
net) 
After the treatment, the fish is put back in the net 
or the tarpaulin is removed, depending on the type 
of treatment. Then the net needs to be lowered to 
its normal position. This is done in a reverse man- 
ner to the lifting of the net. Careful lowering of the 
net and ropes are necessary to avoid any damage. 
e Finish operation 
After the operation is finished, an underwater 
inspection should be made by either divers or 
a ROV. Debrief-meetings will ensure that any 
adverse events during the operation are discussed 
and subsequent changes implemented in safety 
management systems. 


4 RISK DIMENSIONS OF DELOUSING 
OPERATIONS 


In this section, the three first dimensions of risk 
in Table 1 (Yang et al. 2017) are presented and dis- 
cussed for the delousing operations. 


4.1 Risk to personnel 


Delousing operations are demanding operations 
where the operators on fish farms are exposed to 
several hazards. Most of the delousing techniques 


require use of cranes when preparing for the oper- 
ation. In accident statistics from the fish farming 
industry, the use of cranes are found to contribute 
to several of the blow by object and entanglement 
injuries (Holen et al., 2017a). Work operations are 
also an increasing contributor to fatalities in the 
fish farming industry (Holen et al., 2017b). As 
service vessels are an important part of the opera- 
tion, also man over board accidents is in important 
risk to consider. In addition, the chemicals used 
in delousing operations may present a hazard to 
the operators. In some delousing operations, extra 
oxygen is used, and explosions may happen. 


4.2 Risk to the environment 


In general, two types of hazards to the environ- 
ment should be assessed in relation to delousing 
operations; (i) the effects from escaped farmed 
salmon, and (ii) the release of treatment chemicals, 
which may have an effect on organisms around the 
fish farms. 


4.2.1 Risk of escape 

The main causes to escape from fish farms are due 
to structural failures including net tearing. Net tear- 
ing can happen during operations and from abra- 
sion from related components (Jensen, Dempster 
et al. 2010). Abrasion from the sinker tube chain 
is the most common cause for net tearing, while 
handling of net weights, including the sinker tube 
is the second largest cause (Fore and Thorvaldsen, 
2017). Handling of net weights must be done in 
all delousing operations, as part of the preparation 
before the operation, and after the operation has 
been completed. Organizational factors influenc- 
ing escape events are found in Thorvaldsen, Hol- 
men et al. (2015). 

The consequences of escaped salmon are related 
to introgression of genes and the spreading of dis- 
eases, which both may influence the wild salmon. 
Introgression of farmed salmon genes is unwanted 
because of the genetic differences in farmed salmon 
and wild salmon (Taranger et al., 2015). The long 
term consequences of introgression may lead to 
“changes in life-history traits, reduced population 
productivity and decreased resilience to future 
changes” (Glover et al., 2017). 


4.2.2 Risk of treatment chemicals on surrounding 
environment 
The medical chemicals used for bath treatment of 
sea lice may affect other animals, especially crus- 
taceous animals as the sea lice belongs to this type 
of animals. The chemicals used for bath treatments 
are Azametifos, Deltametrin, Cypermetrin and 
Hydrogrenperoxid; the three first chemicals are 
mainly used in tarpaulin treatments, while the last 
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is used in well vessels. When the bath treatment 
is made with tarpaulin, the chemicals are directly 
released into the sea at the fish farm; when well 
boats are used the chemicals can be transported 
away (Svasand et al., 2017). The different chemi- 
cals have different levels of toxicity, where Delt- 
ametrin have been shown to be very toxic for some 
non-target organisms, such as plankton, and may 
also be bound up in seaweeds. Hydrogen peroxide 
have the least effect on organisms in the surround- 
ings of the fish farm (Svasand et al., 2017). In a 
five-year study of effects of sea lice medicine to 
the receiving environment in Scottish sea lochs, no 
long-term effects could be found (Scottish Asso- 
ciation for Marine Science, 2005). Chemical release 
in the case of vessel capsizing may also be a risk. 


4.3 Risk to fish welfare 


Fish welfare is affected by the salmon louse itself 
and the anti-lice treatments carried out to remove 
the lice. Normally the damage to the farmed 
salmon is not high because treatment is required 
before a critical number of lice is reached (Svasand 
et al., 2017, Norwegian Ministry of Trade Industry 
and Fisheries, 2012). However, substantial injuries 
in some areas where the salmon lice infection pres- 
sures have not been possible to control have been 
reported (Hjeltnes et al., 2017). The larger wounds 
caused by sea louse may lead to dehydration, elec- 
trolyte balance and increased influence on physi- 
ological functions with the fish (Svasand et al., 
2017). 

Anti-lice treatment represents a significant neg- 
ative welfare challenge to the fish (Hjeltnes et al., 
2017). Especially handling and crowding of fish, 
which is done in relation to the treatment, will have 
an impact on welfare of the fish. The stress and 
fear-levels increase in the fish during these opera- 
tions and if the fish is weak, heart failure may 
occur. Open wounds, scale and mucus-loss and 
stress are factors caused by handling which might 
also increase the risk of other infections in the 
fish (Svasand et al., 2017). The chemicals used in 
treatment may be overdosed and give toxic effects. 
Observed fish behavior during delousing opera- 
tions may indicate that the fish experience the 
treatment chemicals as uncomfortable (Oppedal 
et al., 2011). 

Bath treatments have been the primary method 
of delousing, but new methods and technologies, 
which does not use chemicals, are increasingly in 
use, mainly due to resistance of chemicals in the 
salmon louse. Mechanical delousing using heated 
water, water jets or a combination of water jets 
and brushes are reported to give welfare issues 
related to reduced appetite, eye injuries, reduced 
mucus production and poor skin health, amongst 


others. These new methods of anti-lice treatment 
are of great concern to fish welfare as they are 
not sufficiently tested for effectiveness and welfare 
(Hjeltnes et al., 2017). Heated water treatment has 
caused mass-fatalities of salmon (Heraldscotland, 
2016). 


4.4 Conflicting objectives of the risk dimensions 


Some examples of how each risk dimension may 
influence the others during the delousing opera- 
tion are presented below. Especially, risk to per- 
sonnel safety, risk of escape, and risk to fish health 
may come in conflict. All these dimensions are also 
under the constraints introduced by management 
decisions like allocation of resources, such as per- 
sonnel, equipment and timeslots to operations. 


4.4.1 Prioritizing personnel safety 

Personnel safety has been given increasing focus 
in the fish farming industry. Major hazards for 
personnel are especially present during operations 
using heavy machinery. For delousing operations, 
this type of machinery is used in preparation of 
the delousing, and after the operation when net 
is lifted and lowered. Handling of the net also 
involves hazards with regards to tearing of net and 
following escape. In stressful situations, due to lim- 
ited attention span, there could be a need to focus 
on one of the risk factors. Situations where focus- 
ing on personnel safety may cause higher risk with 
regards to escape may also occur after operations 
when inspections of the net cages should be done 
to ensure that the nets have been correctly lowered. 
Inspections by divers or cameras must be done so 
that potential holes caused during operation are 
discovered. In cases where there may be risk of 
injuries because of, e.g., weather conditions, per- 
sonnel safety must be prioritized over prevention 
of escape. 

Stopping operation too soon or too late in 
cases of risk to personnel may cause the delous- 
ing treatment not to work adequately. The opera- 
tion must then be repeated later, which represents 
an extra strain to fish welfare, which must undergo 
handling again in a short time. If operation is not 
completed the net may not be lowered in between 
operations which also means that the fish must be 
kept “crowded”. 


4.4.2 Prioritizing fish welfare 

Fish welfare has traditionally been given high pri- 
ority. Fish welfare is important to management as 
it affects earnings. Cases when fish welfare may 
influence personnel safety or prevention of escape 
during operations, may occur if delousing with 
tarpaulin must be abruptly stopped due to, e.g., 
too low oxygen levels in the net cage. Stressful 
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situations and a focus on fish welfare may lead 
to hazardous situations by personnel. The choice 
of delousing methods may also have an influence 
on the fish welfare. Bath-treatments with tar- 
paulin include some more hazardous tasks using 
crane compared to bath-treatments in well vessels. 
Whereas well vessels may include more welfare 
issues due to “crowding” and pumping of the fish 
in and out of the vessel. 


4.4.3 Prioritizing prevention of escape 
Prevention of escape is a major focus of the fish 
farming industry. This focus may also have been at 
the sacrifice of personal safety in procedures and 
risk assessments. An inadequate focus on hazards 
that may cause personnel injuries when planning 
operations and in safe job analysis performed 
before delousing may contribute to accidents. 

If a hole in the net is discovered, fish may be 
kept crowded longer than normally to keep the 
fish away from hole in the net. This will be at the 
expense of fish welfare. 


4.4.4 Prioritizing limited consequences for 
environment 

When using well vessels for delousing operations, 
chemicals used during operations may be trans- 
ported out of the fjords into designated “drop 
zones”. The choice of using well-vessels may have 
an influence on fish welfare in operations. Some of 
the new delousing methods do not use chemicals 
and, in this regard do not represent a challenge to 
the environment. Emissions to the environment of 
the chemicals used in delousing operations are an 
integrated part of the operation, especially when 
using the tarpaulin. The consequences of the 
release of chemicals into the fjords is a controver- 
sial issue between the stakeholders. 


5 DISCUSSION 


Several risk issues are present during fish farm 
operations, and delousing is no exception. Acci- 
dents, such as escape, serious personal injuries and 
major fish deaths, have happened in relation to the 
activities in delousing operations. In the accident 
perspective of conflicting objectives, one of the 
measures towards avoiding accidents is to make 
visible the limits of acceptable performance. It is 
important to assess how prioritizing one risk issue 
may affect other risk aspects and dimensions. Dur- 
ing delousing operations both personnel safety, 
fish welfare and fish escape are concerns, which 
require attention. It is not possible to eliminate the 
conflicting objective as they, in today’s methods 
available for delousing, are inherent in the opera- 
tion. However, means to avoid accidents due to 


conflicting objectives are to highlight the conflicts 
themselves and the possible consequences of giv- 
ing priority to one aspect in operations. Visualizing 
the different risk dimensions, which may give rice 
to hazardous situation, gives an opportunity for 
operators and management to gain awareness of 
possible hazards in the operation. 

Possible risk mitigating actions could be to 
assign some operators the main responsibility to 
follow whether one risk issue is given an unbalanced 
focus. The different steps of the operation may also 
be more hazardous with regards to one type of 
risk. For example, the beginning of preparation of 
the net cage is hazardous related to tearing of the 
net, while the last part of preparation may be more 
hazardous to personnel injuries because of excess 
chains suspended from the crane. In addition, cor- 
rect lowering of the net after operation is a critical 
part of the operation concerning escape events. 

Risk avoidance of some of the measure might 
also have mutual positive effects. One example of 
this is to not starting delousing treatment in harsh 
weather, as this might present hazards to both per- 
sonnel and fish welfare (Storkersen, 2012, Fenstad 
et al., 2009). 

In almost all situations during operations where 
one risk issue might be prioritized over a different 
one, management decisions, such as time pressure, 
costs and weather may influence the decisions 
made during an operation. Stress due to time lim- 
its and limited resources will affect how choices are 
made, and violation of procedures might be done 
if that is what seems most rational in the moment. 
When evaluating how risk mitigating measures 
might work, one should be aware off the mecha- 
nisms of the socio-technical system where differ- 
ent actors will make decisions according to their 
respective constraints and options, and that some 
interpretation of rules will be made at lower levels 
of the organization (Rasmussen, 1997b). Within 
the aquaculture company, the operators are in 
the sharp-end in close proximity to the hazard, 
and they make decisions within different frames 
of what the land based organization with higher 
level of authority and distance to the hazard do. 
Without the possibility of always seeing the whole 
picture decisions on both ends are made on “local 
rationality”. Often, it is explicitly said that safety 
should be prioritized, but tacitly opposite messages 
are sent through planning, follow-up and resource 
allocation. Measures should be implemented in 
and continuously monitored by the management 
systems to ensure that safety is not compromised. 

In this paper, the immediate risk issues that arise 
during an operation due to conflicting objectives 
has been in focus. In a broader perspective, other 
risk issues would also be relevant to consider with 
regards to conflicting objectives such as resistance 
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of sea lice to the different treatments and the influ- 
ence of regulations on the different risk issues. The 
decisions made on higher level may have more 
impact to the risk picture, than the decisions made 
by operators during operations. The regulations 
that specifies the limits for the acceptable lice level 
may challenge fish welfare as it leads to frequent 
delousing (Hjeltnes et al., 2017). This is seen as a 
challenge to the welfare and in some cases the lev- 
els of lice might be more acceptable to welfare than 
performing repeated treatments which cause strain 
and stress to the fish. Repeated delousing opera- 
tions will also increase the possibility of escape due 
to handling of the net. 


6 CONCLUSION 


Sea lice is a major challenge to the fish farm- 
ing industry and delousing is decreed by the 
authorities. The delousing operation involves risk 
dimensions with regards to personnel safety, the 
environment and fish welfare, all issues including 
severe consequences. Conflicting objectives may 
arise during the operation. Prioritizing one risk 
dimension at the expense of others may lead to 
situations, such as: (i) focusing on personnel safety 
may hinder the discovery or repairing holes in the 
net, or (ii) operator stress to finish operation due 
to fish welfare, may cause hazardous situations 
for personnel. Higher-level management decisions 
also influence the risk during operations through, 
e.g., timely allocation of resources. Unforeseen 
accidents may happen if conflicting objectives 
are not visible to management and operators, and 
they should be openly discussed to ensure safety 
in operations. 
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The future of driver training and driver instructor education in 
Norway with increasing ADAS technology in cars 


G.B. Setren, J.P. Wigum, R. Robertsen, P. Bogfjellmo & E. Suzen 
Road Traffic Section, Business School, Nord University, Stjørdal, Norway 


ABSTRACT: On average, more than two people are killed or severely injured every day in Norway in 
road traffic. Hence, elements that benefit a decrease in this number will be welcomed, such as “Advanced 
Driver-Assist System” (ADAS) technology. However, increasing technology in cars might require new 
driving skills compared to those taught today and the transition to more and new technology could 
potentially increase the accident rate. In the safety industry, it is well known that training for new and 
more automated technology is important. This raises a question: How does the transition to new, more 
complex and more automated technology affect driver training and the education of driver instructors? 
At the present time, there are no clear answers to this question. However, it seems that there is a need for a 
discussion and potentially a redefinition on which driver skills should be required, and how to implement 
these skills. This is what we attempt to discuss in this paper. 


1 INTRODUCTION 


In 2016, there were 135 road deaths in Norway. The 
number for 2015 was 117, and for 2014 it was 147. 
However, if you include the number of accidents 
resulting in severe injuries, the number was 791 in 
2016, 810 in 2015, and 821 in 2014 (SSB 2018a). 
This means that, on average, more than two people 
are killed or severely injured every day due to traffic 
accidents in Norway. Compared to any other high- 
risk sector, the number is high, but the trend over 
the past decades is that the number is decreasing. 
The Norwegian government bases the National 
Transport Plan (NTP) on a vision of zero. This 
vision means zero dead and zero severely injured in 
road traffic. The objective for this period of NTP 
(2014-2023) is to halve the number of road deaths 
and severe injuries, and that, in 2020, there should 
be no more than 775 killed and severely injured in 
road traffic in Norway, that is about two people per 
day on average. Strategies to achieve this objective 
in Norway are, for instance, to design safer roads, 
to encourage safer behaviour from road users, 
and to encourage the development of technology 
to produce safer vehicles (NTP 2014-2023). Nor- 
way is not alone in such objectives as this is also 
in accordance with the EU objective, which is to 
halve the number of people killed in road traffic 
during the period 2010-2020 (European Commis- 
sion 2015a). In order to achieve this, the EU has 
developed seven strategies: (1) improve education 
and training of road users, (2) Increase enforce- 
ment of road rules, (3) safer road infrastructure, 


(4) safer vehicles, (5) promote the use of modern 
technology to increase road safety, (6) improve 
emergency and post-injuries service, and (7) pro- 
tect vulnerable road users (European Commission 
2010). Technological innovation is one of the seven 
strategies, in addition to improving education and 
training of road users. However, what we know 
from other industries regarding humans interre- 
lating with increased automation (e.g., Lee 2006; 
Setren and Laumann 2015), it is not a certainty 
that the numbers of killed and severely injured will 
continue to decrease with an increase in technolog- 
ical solutions. Reasons such as lack of standardi- 
sation in technological solutions, mode confusion, 
lack of situational awareness, overreliance, compla- 
cency and so forth, could all be reasons why the 
interrelation between humans and technology have 
a possibility of not going according to plan (Young 
and Stanton 2007). One of the reasons is a lack of 
focus on training for automation (Setren and Lau- 
mann 2015). Regarding the technological develop- 
ment in cars, new technology is implemented at a 
fast tempo, but little attention is given to training 
in using the technology to new and existing driv- 
ers. Research shows for instance that only 24% 
of buyers were given instructions from the car 
dealer when cars with an “Advanced Driver-Assist 
System” (ADAS) were bought in The Netherlands 
(Harms and Dekker, 2017). Even less attention 
seems to be placed on teaching driver instructors 
how to teach driving skills with this vast variety of 
technology. In addition, we have found no litera- 
ture on this topic from a pedagogy aspect. For this 
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reason, we would like to look at training for auto- 
mation when it comes to driving cars. 


How will ADAS technology in cars potentially affect 
driver training and driver instructor education, and 
which new skills might be needed for a driver? 


In order to answer this, the driver training pro- 
gram and the driver instructor education in Nor- 
way will be presented first, before we present issues 
regarding automation and training. After this we 
discuss ADAS technology in cars and how it would 
affect driver training and driver instructor educa- 
tion. Next, we look at which new skills a driver 
might need. Then, we present our conclusion. 


2 DRIVER LEARNING PROGRAM AND 
DRIVER INSTRUCTOR EDUCATION 
IN NORWAY 


The Norwegian driver education model is very 
comprehensive and systematic (Rismark and 
Solvberg 2007) and it normally takes about two 
years to become a driver with the program which 
contains detailed curricula for content, progres- 
sion, and teaching methods (NPRA 2013). This 
two-year education is a module based training 
program consisting of four modules that include 
both individual and group tutorials that are both 
theoretical and practical. In addition, accompa- 
nied driving with someone who has had their driv- 
ing license for a minimum of five years is highly 
recommended and thus it is common, from the age 
of sixteen to drive with a parent as a passenger. 

Driver instructor education in Norway is also an 
extensive education as it is a two-year university 
education with an emphasis on traffic pedagogy, 
road traffic law, and traffic psychology in addition 
to physics and technology (Nord universitet 2017). 
This two-year education includes both theory and 
practice and emphasises operational, tactical and 
strategic driving skills (Michon 1985), and the 
GDE framework (Peraaho, et al. 2003). However, 
in the future we might see a reduced need for an 
extensive focus on these elements, which until now 
have been viewed as basic. As future in-car tech- 
nology might replace some of the information 
retrieval, assessments and decisions previously 
made by the driver, we might see a shift in which 
are the knowledge and skills that are important for 
driving instructors to develop. 


3 AUTOMATION AND TRAINING 


There are a number of different systems, ranging 
from basics such as automatic windscreen wipers 


to more advanced technology such as lane depar- 
ture tracking, automatic braking systems and even 
more enhanced levels of automated driving func- 
tionality. Such systems are for instance autopilot 
(Tesla), distronic plus steering assist (Mercedes), 
and intellisafe (Volvo). 

Increased automation in cars will probably lead 
to an eventual decrease in the numbers of acci- 
dents (Elvik and Hoye 2015; Wilmink et al. 2008). 
Some reports indicate that traffic fatalities could be 
reduced by as much as 90% (Bertanocelli and Wee 
2015). Further, levels of automation in cars will 
most probably increase as a result of the increased 
digitalisation of the transport sector, and brands 
such as Volvo, BMW, and Tesla, all popular brands 
in Norway, expect to have self-driving cars on the 
roads within the next five years (TechEmergence 
2017). However, it is expected that the leap from 
where we are today to all cars being self-driven, 
is remote, and that semi-automation with in-built 
ADAS technology seems to be a reality for some 
time to come, considering that age of the motor 
vehicle population in Norway in 2016 was, on 
average, 10.6 years (SSB 2018b). The number for 
Europe is 10.7 years (ACEA 2017). 

There are several different taxonomies try- 
ing to capture the essence of the development of 
advanced technology in cars, and the most com- 
mon seems to be the SAE’s levels of automation 
(SAE 2014). This approach is based on six levels 
of automation ranging from “No automation” 
(level 0) to full automation (level 5). In levels 0-3 
the human driver has the responsibility for the driv- 
ing, and in levels 4-5 the car takes on this respon- 
sibility. Examples of technology at each level, 
according to Banks et al. (2017) are for instance 
level 1: Adaptive Cruise Control (ACC), level 2: 
Tesla Autopilot, level 3: Audi A7 prototype, level 
4: Toyota Highway Teammate, and level 5: Google 
self driving car. Today, most ADAS technology 
equipped cars are at level 1. Furthermore, seen 
from a drivers perspective (Banks and Stanton 
2017), there are different roles for the driver within 
automated systems. As an example, a Driver Driv- 
ing (DD) is defined as an operator responsible for 
completing basic operational, tactical, and strate- 
gic tasks (Michon 1985). However, the Driver Not 
Driving (DND) would expect an automated sys- 
tem to have full control of these tasks. That being 
said, the transition is not straightforward, and dur- 
ing the middle phases of automation, Driver Mon- 
itoring (DM) should be assumed. A challenge is 
that in level 2 the driver operates the vehicle, which 
assumes a transition between DD and DM and in 
level 3 the driver, to a larger degree, supervises the 
vehicle but needs to intervene if needed assuming 
a transition between DM and DND (Banks and 
Stanton 2017). The cars with the most advanced 
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driver assist systems will be additional to the many 
cars that have less advanced technological equip- 
ment on the roads. However, this middle phase is, 
according to human factors and safety research, a 
phase where the human interference is relied upon, 
but the human is not very reliable (Wickens et al. 
2016; Son and Park 2017). Human interrelation- 
ship with semi-automated technology is known to 
potentially result in serious unwanted incidents in 
a wide range of sectors such as petroleum (Setren 
and Laumann 2015), aviation (Billings 1997; Par- 
asuraman and Byrne 2003), and road transport 
(NTSB 2017). Research has found there are several 
causes for this, for instance the issue of trust, over- 
reliance, or complacency (Setren and Laumann 
2015; NTSB 2017), situational awareness (Kaber 
and Endsley 2007), mode confusion (NTSB 2014), 
or lack of optimal training (Salas et al. 2006; 
Setren and Laumann, 2015). Additionally, news 
items concerning ADAS technology in cars seems 
to share a common misperception that when more 
automation is introduced, human error will disap- 
pear (e.g. NRK 2017). This gives rise to the idea 
that training is not necessarily needed. Human fac- 
tors research advises against not training for the 
use of new complex technology (Lee 2006; Salas 
et al. 2006; Setren and Laumann 2015), as there 
will always be a human in the technology loop, for 
instance in use, maintenance or design. 

It might even be an issue that increased auto- 
mation might increase the level of competence 
required for the operator, as an operator must 
know both how to handle the system more or less 
manually, for instance if the sensors in a car turn 
off due to bad weather, and additionally know how 
to handle and supervise the advanced technology. 

So, as driving skills decrease, the need for poten- 
tially taking over the car will occur in more difficult 
scenarios such as in bad weather conditions like 
slippery roads, heavy snow, and so forth, because 
such conditions could be difficult for ADAS tech- 
nology to handle. One example is Adaptive Cruise 
Control (ACC). A driver who uses ACC, that 
works most of the time, does not get much train- 
ing in driving without it. Then, when it is time for 
the driver to take over control, for instance because 
the weather conditions are too harsh for the sys- 
tem to operate, the driver might lack optimal skills 
to handle the driving. Research has indicated that 
ACC technology leads to a reduction in mental 
workload and thus problems with regaining con- 
trol of the vehicle in failure scenarios (Stanton 
and Young 1998). ACC is one of the technologies 
that might be turned off in, for instance, heavy 
rain without advance warning, implying in that 
the driver must be skilled in handling bad weather 
conditions while driving, and be able to take con- 
trol of the car straight away. 


During a transition period where there will 
be cars on the roads with very little to no ADAS 
technology in combination with cars with a large 
variety of ADAS technology. There is the impor- 
tant question of which skills should be taught in 
a driver training program and in driver instructor 
education. The introduction of more automation 
in cars will lead to a change in the skills needed for 
the driver, and hence will bring about a need for a 
change in the competence of the driver instructor. 
This, in turn, will probably affect driver instructor 
education. 


4 AUTOMATED AND ADVANCED NEW 
TECHNOLOGY IN CARS IN REGARD 
TO DRIVER TRAINING AND DRIVER 
INSTRUCTOR EDUCATION 


There are some obvious strengths regarding more 
automation in cars as opposed to fully manual 
cars. First of all, the workload will decrease for the 
human driver. With more technology taking over 
tasks such as changing gears, keeping the speed 
stable, avoiding collisions with pre-crash systems, 
navigation, and so forth, the driver can pay atten- 
tion to other aspects. However, it is commonly 
known that when humans supervise a system as 
opposed to being an active participant, attention 
seem to fall (e.g. Yerkes and Dodson 1908). Even 
though there are many benefits such as the prob- 
ability of a lower accident rate, there are also sev- 
eral concerns regarding automation. Most of these 
concerns are about when the driver needs to take 
over a vehicle, for instance in critical conditions 
(Son and Park 2017) or intention to use/user resist- 
ance (Kyriakidis, et al. 2015; König and Neymar 
2017). When technology takes over many of the 
tasks, and works most of the time, driver skills will 
decrease. This is because maintaining skills with- 
out practice is probably not possible. However, 
very little information exists on driver training 
in regard to how to learn to drive with new tech- 
nology as a new driver, or driving cars with new 
technology as an experienced driver (Harms and 
Dekker 2017). The topic of learning to use the 
technology is not even mentioned when opportu- 
nities and barriers on a societal level are considered 
(Fagnant and Kockelman 2015). However, the use 
of the technology on the market today, such as, 
for instance, lane assist and Adaptive Cruise Con- 
trol (ACC) should perhaps be taught after proper 
driver skills are acquired. For instance, techni- 
cal driving using lane assist could be perceived as 
uncomfortable as the technical reaction of the car 
is generally slower compared to a driver. When 
turning, for instance, the car is often too far out 
the curve before the turn is performed and this can 
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be repeated several times during the turn. If this 
was the behaviour of a learner driver during a les- 
son, the instructor would not have considered the 
technical driving skills to be adequate. This means 
that the driver must be skilled in order to under- 
stand that the car’s behaviour is not adequate, 
and respond accordingly. The driver requires both 
good driver skills and an understanding of how 
the technology works, together with its advantages 
and limitations. On the other hand, technology 
such as lane assist could probably be of support in 
the event that an unexpected incident occurs and 
the driver loses control of the car. As single vehicle 
off the road together with head-on accidents are 
the most frequent accidents with the highest death 
rate in Norway for the past decades (SSB 2018a), 
this technology could potentially save lives. How- 
ever, perhaps, it should not be trusted for use on a 
regular basis. In driver instructor education today, 
the teaching is that when driver assistance systems 
take over, the driving is not optimal. Thus, the sys- 
tems could be there as a backup, but not trustwor- 
thy enough to be used regularly. The driver should 
drive the car. Furthermore, if such systems are to 
be used while driving, there are other considera- 
tions involved. For instance, regarding ACC, it is 
a technological system that perhaps works better 
in some driving conditions than in others. As an 
example, on icy roads, or in higher density traffic 
in a more complex driving environment, it might 
be a better solution to control speed manually. 
Making the correct decisions on when to use, and 
when not to use, technology while driving requires 
good driving skills. 

Regarding driver instructor education and 
driver training, it seems that the introduction of 
ADAS technology requires that elements are added 
to the education and training rather than removed. 
Additionally, operating these technologies should 
perhaps be a larger part of driver training, driver 
testing, and hence driver instructor education. 

Technology has always had an impact on the 
content of the Norwegian driver education cur- 
ricula. For instance, driving on slippery roads has 
been a mandatory part of driver training in Nor- 
way since 1975. In the early days, the learner driv- 
ers were trained to manually adjust the brake pedal 
in different ways to minimise the braking distance, 
on ice and snow, as much as possible. After the 
ABS braking system was introduced and became 
common in most cars, the content of driver train- 
ing on slippery roads changed and focused more 
on letting the learner drivers experience that the 
ABS system enabled them to brake as hard as 
they could and to simultaneously use the steer- 
ing wheel to control the car (NPRA 1995; NPRA 
2005). However, the main difference between the 
ABS brakes transition and the present technologi- 


cal transition, is that ABS brakes became common 
in many cars and used the same way of braking 
in all brands of car. The driver needed to change 
how to move the foot while braking, but the brakes 
were in the same place, the basic movement was 
the same, and most brands of car had the same 
system. Nowadays, new technological solutions 
such as ACC, are different in different makes of 
car where some brands for instance have a switch 
on the right side of the steering wheel, while others 
have a button on the front or on the left side of the 
steering wheel. This lack of standardisation could 
be confusing and hence could distract the driver. 
All kinds of different solutions such as these, and 
different software solutions in touchscreens in new 
cars may have as a result that it may not be as easy 
as previously to drive a car that the driver has not 
driven before, due to a wide variety of technologi- 
cal solutions. It could be difficult to know which 
technological solutions are included in the car, 
and difficult to know how to use the technology. 
Currently, distractions for the driver are about to 
increase due to in-vehicle devices. This runs coun- 
ter to the necessity of keeping an eye on road 
(Wickens et al. 2004). 

There is a possibility that the answer to this is 
to have differentiated driving licenses and not a 
standardised license such as we are used to today, 
because technology in cars is too varied and 
unstandardised. It should be a matter for discus- 
sion as to when cars are so different from each 
other that a standardised driving license is no 
longer good enough. 

Increased technology has affected the train- 
ing situation for a long time, and, in Norway, one 
example of an aspect that is in a transitional phase, 
is the trend that new cars are not equipped with 
manual gears. There are two important aspects to 
this situation. First, we see that the educational 
system does not keep up with the speed of tech- 
nological development. Toyota for instance, sold 
more than 99% of new cars equipped with auto- 
matic gears so far in 2017, in Norway (Korsvoll 
2017). Thus, the driver will not need to learn how 
to use manual gears as automatic gears will most 
probably become the new normal. However, in 
Norway, driver training is based on manual gears, 
and the education of driver instructors is based on 
vehicles equipped with manual gears. Perhaps the 
driver instructor program should focus instead on 
other tasks rather than teaching new drivers how 
to drive with manual gears. If a technology as basic 
as gears is hard to keep up with regarding a transi- 
tion from manual to more automation, it could be 
a challenge when now even more technologically 
equipped cars enter the market. 

Second, the gearing system is an example where 
different technological equipment in cars requires 
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different types of license for the driver. In Norway, 
as in the EU, you are allowed to drive an automatic 
car if your license is for manual gears, but not the 
other way around. You are not allowed to drive a 
car with manual gears if your license is for auto- 
matic driving (FOR 2017). For this reason, many 
driving schools only have manual gears in their 
cars, as for instance learner drivers know that they 
will probably buy a cheaper car with manual gears 
when they have their license. A solution such as 
this might also include more ADAS technology in 
the years to come. There could be different licenses 
based on the technology in the car you drive. 

The rapid speed of introducing new technology 
seems to be happening faster than the changes in 
the educational system. Furthermore, if you have 
received your class B driving license, there is no re- 
testing or system to update your driving skills, so 
there is a question as to how these drivers should 
learn how to operate new technology properly. 
Additionally, for driver instructors who are already 
authorised, there are no mandatory courses for 
updating their competence, so another question 
could be how they should get the necessary skills 
to teach new and existing learner drivers. If the 
two-year university education to become a driver 
instructor in Norway adjusts today, the market will 
not change completely for many years. Neverthe- 
less, the rapid speed of technological progress will 
continue. 


5 NEW DRIVER SKILLS REQUIREMENTS 


In order to know which skills a driver must have, 
we need to know how the car works. For example, 
the GDE matrix has been the basic understanding 
of the driving skills that is necessary for a driver 
to have and thus, one of the central elements in 
the driver instructor education. The GDE-matrix 
consists of five levels, where the lowest level is 
vehicle manoeuvring, the second level is mastering 
traffic situations, the third level is goals and con- 
text of driving, the fourth level is goals for life and 
skills for living (Keskinen 1996 in Hatakka et al. 
2002), and the fifth level is social skills (Keskinen 
2014; Keskinen et al. 2010). However, the situa- 
tion regarding new technology in the car is also 
changing the skills needed for a driver. It seems 
to be time to redefine which competence a driver 
must hold, and the GDE matrix may not be the 
optimal way to define the necessary skills in the 
future. If cars become more or less self-driving and 
automated, perhaps the lower stages of the GDE 
matrix might not correspond with the actual skills 
that are needed to drive a car. 

Another example is the driving process, which 
might be explained using a basic information 


processing model (e.g. Wickens and Carswell 2006). 
This model assumes that information is perceived, 
then processed before decisions are made based on 
how the information is processed and action is then 
taken. In regard to the driving process, the ques- 
tion is who is collecting the information and who 
is responsible for collecting which information, the 
car or the human? As an example, when driving 
with ACC, the driver needs to monitor the envi- 
ronment and collect information on driving condi- 
tions as the car does not collect information, for 
instance, on the road conditions such as rain or ice 
or dry asphalt. Furthermore, the system does not 
correspond with any other systems in the car, so, 
for instance, if the car skids and the traction gets 
the car back on track, the ACC does not take the 
slippery road condition into consideration, and 
will only work to get the car back to the required 
speed or distance from the car in front. This 
assumes Driver Driving and Driver Monitoring 
with this technology (Banks and Stanton 2017). 
Regarding another technology, lane assist, the 
same aspect occurs as, for instance, lane assist will 
not work without proper road markings. Therefore 
the driver must pay attention to whether the road 
is properly marked or not. This is information a 
driver normally would not need to pay that much 
attention to if driving the car, as the driver would 
most likely hold the steering wheel and stay on her/ 
his side of the road regardless of the quality of the 
road marks. The technology could thus make the 
driver pay attention to the road closer to the vehi- 
cle rather than paying attention to the road traffic 
environment further ahead. Additionally, regard- 
ing decision making, it could be questioned as to 
whether it is the car or the driver that makes the 
decisions. For instance, with ACC, if the car does 
not collect information on the road conditions, 
it cannot be responsible for making decisions in 
this regard. The driver must monitor and make 
decisions based on the information gathered and 
processed. Finally, the question is who takes action 
based on the information and decisions? If the car 
does not make decisions or gather relevant infor- 
mation, it probably cannot take appropriate action, 
meaning this would be the driver’s responsibility. 

So, what do we hand over to the car and what is 
left to the driver? The question will have different 
answers for different technologies. If, for instance, 
using the same scenario as with the ADAS tech- 
nology, adaptive lighting, there is a different situ- 
ation. Here the car gathers information on for 
instance the light conditions in the environment, 
and oncoming cars, and makes decisions based on 
the information gathered and takes action to turn 
lights on or off or chooses the degree of bright- 
ness. Thus, the driver will not need to use as much 
cognitive capacity for this operation. 
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Another issue that complicates which driv- 
ing skills are needed is the lack of standardisa- 
tion between the car manufacturers on how the 
new technology should interact with the driver. 
For instance, there are several different solutions 
to touchscreen software in cars. So, the skills of 
the driver need to correspond to the actual car the 
driver will be driving, the technological solutions 
in the car and how the technological solutions 
interact with the driver. 


It seems that training needs adjustment in 
order to meet the new digitalisation of the future. 
However, in order to change what we teach to the 
learner drivers, we probably need to start with the 
educational institutions who educate the driver 
instructors. In addition, there is the question of 
if and how to re-educate driver instructors who 
are already certified as driver instructors for tra- 
ditional driving. In Norway alone, there are more 
than 1,000 driving schools, and providing courses 
for the instructors in all of these schools will take 
time and effort. This time does not seem to be 
available at the speed at which changes are hap- 
pening today. 

One solution could be that the manufactur- 
ers are responsible for the specific technological 
training for drivers, and license drivers for their 
technology. A solution such as this also requires 
consideration as to what training and testing for 
such a license should involve, in addition to who 
is responsible for the training and testing. Today it 
is the National Road Authorities in Norway who 
conduct the testing of learner drivers in order for 
them to qualify for a driving license. Therefore, to 
maintain the driver skills requirements, the test- 
ing could be the responsibility of the authorities. 
This testing could include drivers ability to drive 
and supervise the systems in addition to how to 
respond to alarms and warnings. Therefore, one 
could think of driver education that comes in two 
levels. In that case, a standard learner driver could 
learn how to handle a manual car as level one in 
a standard driving school, but also learn how to 
operate and supervise a car of the future as level 
two. How to drive a car with technological solu- 
tions could, with this system, be up to the manu- 
facturers to teach properly to all drivers, and be 
tested by the road authorities. 


We see in aviation, for instance, that pilots 
are trained in simulators in order to uphold the 
required skill level to fly an aircraft. This is partly 
because flying with a high level of automation 
decreases flying skills. This could be a solution 
for drivers as well. In order to keep their driving 
license, drivers could be required to have a certain 
amount of simulator training in order to uphold 
driving skills because their cars have ADAS tech- 
nology. However, this will require an increase in 


simulators for one, and in Norway today there 
are between 5-10 simulators for driving license B. 
Furthermore, retraining to uphold skills requires a 
system where everyone holding a driving license in 
Norway has training. A system will also be needed 
to deal with the bureaucratic aspects. Thus, there 
are some obvious obstacles to such a solution in 
regard to costs and resources in addition to the 
issue of how society would respond to it and if 
there will be public acceptance for such a system, 
and the political will to implement it. 


6 CONCLUDING REMARKS 


How will ADAS technology in cars potentially 
affect driver training and driver instructor educa- 
tion and which new skills might be needed for a 
driver? This was the question we wanted to exam- 
ine closely in this paper. 

We must be honest and admit that today, we 
do not know how to provide general training for 
more technology equipped cars, or even for self- 
driving cars. To be able to assess a good training 
program, it is essential that we know what we are 
training for. Today, however, with the vast variety 
of technological solutions on the roads, the lack 
of standardisation of the software and devices in 
cars, in addition to a future which seems to have 
new technological solutions happening quickly, it 
seems difficult for the driver instructor industry to 
prepare and come up with an optimal solution in 
the short run. 

Hence, we recommend that the content of driver 
training and driver instructor education should 
preferably be increased, not decreased, as good 
driving skills are still needed in addition to good 
understanding on how to operate the technology. 
This is because, as of today, ADAS technology in 
cars seems to result in more rather than less work 
for the driver. 
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ABSTRACT: Improving performing and safety of the aircraft operation is one of the most important 
issues addressed by experts. Such improvement can result not only in the less frequent loss of equipment 
but primarily in the protection of health or saving lives of both crew members and others involved. 
Reducing such risks or minimizing impacts is possible by analyzing events, which had already occurred. In 
this paper, our main motivation consists in developing an effective and intelligent decision support system 
based on data mining techniques. In this context, data mining classifying algorithms with large datasets 
have been utilized to assess and analyse the risk factors statistically related to aircraft incidents in order to 
compare the performance of the implemented classifiers such as decision tree, discriminant and random 
forest. To underscore the practical cost, i.e., effectiveness of our approach, the selected classifiers have 
been implemented using statistical programming tools with datasets taken from the operation process. 


This analysis is expected to find the algorithm, which can support the decision taking. 


1 INTRODUCTION 


1.1 Formulation of the problem 


Following the increasing requirements of flight 
safety and cost reduction, the problem of finding 
the optimum between the economic demands and 
acceptable level of risk arises in terms of reliabil- 
ity. Modern control systems equipped with com- 
puterized processes and extensive diagnostic tools 
do not often use all the information collected from 
the hardware level (Tloczynski 2017b). Moreover, 
some of the relations between events are often 
ignored or neglected. The article presents a new 
approach to increasing reliability of aircraft opera- 
tions by predictive data analysis and increasing 
acceptable levels of safety. 


2 FORMULATION OF THE PROBLEM - A 
STATISTICAL APPROACH TO SAFETY 
AVIATION PREDICTION 


This article deals with the needs of the usage of sta- 
tistical tools and methods of artificial intelligence, 


which enable to discover the relations between 
events stored in the database. 


2.1 Decision trees 


Decision trees are a class of predictive data mining 
tools which predict either a categorical or continu- 
ous response variable. They get their name from the 
structure of the models built. A series of decisions 
are made to segment the data into homogeneous 
subgroups. This is also called recursive partition- 
ing. If presented graphically, the model can resem- 
ble a tree with branches (StatSoft Inc. 2013). 

A decision tree is composed of nodes and splits 
of the data. The tree starts with all training data 
residing in the first node. An initial division is 
made using a predictor variable, segmenting the 
data into 2 or more child nodes. Divisions can then 
be made from the child nodes. A terminal node is 
the one where no more divisions are made. Predic- 
tions are made based on the behaviour of terminal 
nodes (Mueller et al. 2017). 

Decision trees offer many advantages. One 
important advantage is the ease of interpretation 
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of a decision tree. While the tree can be complex, 
involving a large number of splits and nodes, users 
can interpret the model (Sedlacik & Cechova 2016). 
Additionally, making model predictions does not 
involve mathematical calculations as in General 
Linear Models. The predictions are based on deci- 
sion rules. In classification problems, the user can 
specify misclassification cost. Decision trees tend 
to give good predictive accuracy and can allow for 
missing data in deployment (Hinz et al. 2017a). 


2.2 Logistic regression 


The statistical methods have been used so that the 
safe performance of the aviation operation could 
be determined. The logistic regression and the 
decision trees have been implemented as the most 
promising methods to reach this goal. The regres- 
sion logistics model is based on the similar assump- 
tion as the model of linear regression; however, the 
former can be used in case the predicted variable is 
in binomial form (Babiarz 2016). 

Logistic regression is a mathematical modelling 
approach that can be used to describe the rela- 
tionship of several independent variables X, to a 
dichotomous dependent variable, such as outcome 
- decisions D (Kleinbaum & Klein 2010). Other 
modelling approaches are possible as well; how- 
ever, logistic regression is by far the most popular 
modelling procedure used to analyse epidemio- 
logic data when the illness measure is dichotomous 
(Vintr & Valis 2011). 

Formally, the model logistic regression model is 
as follows (Valis, Zak, & Pokora 2014): 


p(x) 


log 
1- p(x) 


=f+x-B (1) 


where 

p — logistic function of the probability, 

B» B — constant terms representing unknown 
parameters. 


Solving for p, this gives: 


eAtxB 1 
P(x) ltet [pe (Ate 4) 


(2) 


It should be noted that the overall specification 
is a lot easier to fathom in terms of the transformed 
probability that in terms of the untransformed 
probability (Shalizi 2013, Hinz et al. 2017b). 


3 EXPERIMENTAL STUDY OF 
STATISTICAL METHODS 


In order to carry out experimental studies, 
flights performed on a third generation fighter 


Table 1. Flight parameters. 
Variable name Unit Type 
Time after sunrise min Continuous 
Time after sunset min Continuous 
Month Continuous 
Aircraft type Categorical 
Age of aircraft day Continuous 
Atmospheric conditions Categorical 
Name of the military department Categorical 
Real time in air min Continuous 
Number of crew members Continuous 
Flight-hour of the first pilot hour Continuous 
Flight-hour of the first pilot hour Continuous 
performed on a given aircraft 
type 
Year of the promotion of the first Continuous 
pilot 
Subsequent departure of the first Continuous 


pilot on a given day 


aircraft of two types were considered. The data 
were derived from the last 8 years of operation 
exploitation process in Poland. Flights were ana- 
lysed in terms of incidents or undesirable events 
occurrences, 

Table 1 presents variables, which represent the 
examined flight parameters. 


3.1 Decision trees results 


Evaluation of the performance of a classifi- 
cation model is based on the counts of test 
records correctly and incorrectly predicted by 
the model. These counts are tabulated in a table 
known as a confusion matrix. Although a con- 
fusion matrix provides the information needed 
to determine how well a classification models 
perform, summarizing this information with a 
single number would make it more convenient 
(Bořil & Čičmanec 2016). This can be done using 
a performance metric such as accuracy, which is 
defined as follows: 


Number of correct predictions 
accuracy = 


Total number of predictions 


Number of wrong predictions 
error rate = 


Total number of predictions 


Figure | shows the obtained results with the use 
of decision tress. 

In this analysis, 80% of the cases were selected 
as the testing samples. 

Tables 2 and 3 shows the percentage of correct 
decisions for the training sample and the learning 
sample, respectively. 
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Figure 1. Sketch of the decision tree. 


Table 2. Cross tabulation for training sample. 3.2 Logistic regression results 
Model In order to analyse the significance of parameters, 
Reality No incident Incident not all available flight data were included. For the 
analysis, the entire data from the operation proc- 
No incident 62.34% 31.46% ess with the recorded incidents were used, and the 
Incident 0.29% 5.91% same number of randomly selected flights without 


an incident was considered. 
In the proper statistical model all of the param- 


Table 3. Cross tabulation for testing sample. eters should be quite different from zero (Koucky 

& Valis 2007). There are cases when the model has 

. Model — . to comprise parameters that are not statistically 
Reality No incident Incident 


significant however, the decision has to take them 
No jadident 60.15% 34.21% into account. Such a justification may be either 
İncident 3.01% 2.63% theoretical knowledge of a phenomenon or experi- 
ence from another, similar analysis. Wald statistics 
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Table 4. Cross tabulation for generalized 
linear model results. 


Model 
Reality No incident Incident 
No incident 31% 18% 
Incident 23% 29% 


Table 5. Cross tabulation for generalized 
linear model results — backward elimination. 


Model 
Reality No incident Incident 
No incident 33% 18% 
Incident 25% 25% 


Table 6. Assessment of the parameters significance. 


can be used for the predictors determination and 
the associated test probability level p. The simplest 
way to obtain a model with all statistically relevant 
parameters (Table 7) consists in removing those 
parameters from the model which are not signifi- 
cant; this model is presented in Table 6. The strat- 
egy of construction of the model was achieved by 
backward elimination. The elimination starts with 
the entire model and then the most probable vari- 
ables (test probability level p) are eliminated. The 
elimination finishes at the moment when the model 
comprises only statistically significant variables. 

Tables 4 and 5 show the percentage of correct 
decisions for the model with all parameters and for 
the model with all statistically relevant parameters, 
respectively. 

In the present analysis, the logit node with 
implementation was applied, and the choice of 
predictors has been made with automatic execu- 


Variable name 


Value of the 
Value parameter Std. err. Wald’sstat. p 


Free term 

Real time in air 
Time after sunrise 
Time after sunset 


-6.86E+01  3.99E+01 2.95E+00  8.58E-02 
2.76E-05 7.53E-06 1.34E+01 2.47E-04 
5.87E-04 3.95E-04 2.22F+00 = 1.37E-01 

—1.22E-03 4.19E-04 8.53E+00  3.49E-03 


Month 3.80E-02 1.84E-02 4.24F+00 3.95E-02 
Flight-hour of the first pilot 4.48E-04 2.28E-04 3.86E+00 4.94E-02 
Flight-hour of the first pilot performed on a given aircraft type —4.10E-04 3.18E-04 1.67E+00  1.97E-01 


Year of the promotion of the first pilot 
Subsequent departure of the first pilot on a given day 
Age of aircraft 

Aircraft type 

Aircraft type 

Aircraft type 

Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 
Atmospheric conditions 

Type of flight 

Type of flight 

Type of flight 

Name of the military department 
Name of the military department 
Name of the military department 


3.65E-02  1.89E-02 3.76E+00 5.26E-02 
-1.16E-01 1.13E-01 1.06E+00 3.04E-01 
—2.64E-04 3.10E-04 7.27E-01 3.94E-01 

103 7.14E-02  1.43E-01 2.48E-01  6.19E-01 
102 1.96E-01 1.68E-01 1.36E+00 2.44E-01 
101 -3.66E-01  1.51E-01 5.85E+00 1.55E-02 


0 2.30E+00 8.30E-01 7.65E+00  5.69E-03 
1 2.91E+00 7.47E-01 1.51E+01  9.99E-05 
2 2.76E+00 7.54E-01 1.34E+01 2.52E-04 
3 3.19E+00 7.74E-01 1.70E+01  3.73E-05 
4 3.12E+00 7.56E-01 1.70E+01  3.76E-05 
5 2.81E+00 8.28E-01 1.15E+01 7.00E-04 
6 —1.84E+01 7.59E+00 5.90E+00 1.51E-02 
8 2.83E+00 7.94E-01 1.27E+01 3.62E-04 
9 2.43E+00 1.44E+00 2.85E+00 9.16E-02 
10 3.00E+00 8.72E-01 1.18E+01 5.84E-04 
12 3.92E+00 

15 -1.41E+01 
0 1.81E-01 3.77E-01 2.31E-01 6.31E-01 
1 4.90E+00 3.20E-01 2.35E+02 0.00E+00 
2 4.86E+00 2.35E-01 4.28E+02 0.00E+00 
23 -5.01E+00 2.60E-01 3.71E+02 0.00E+00 
25  —4.73E+00 

216 -1.00E+00 
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Table 7. Assessment of the parameters significance—backward elimination. 


Value of the 

Variable name Value parameter Std. err. Wald’s stat. p 

Free term —4.85E-01 2.25E-01 4.66E+00 3.08E-02 
Real time in air 3.06E-05 6.39E-06 2.30E+01 1.64E-06 
Name of the military department 23 —2.73E-01 1.82E-01 2.25E+00 1.34E-01 
Name of the military department 25 1.46E-01 1.85E-01 6.20E-01 4.31E-01 
Name of the military department 216 —6.94E-01 3.57E-01 3.77E+00 5.21E-02 
Time after sunset —5.39E-04 2.38E-04 5.12E+00 2.37E-02 
Month 4.18E-02 1.77E-02 5.55E+00 1.84E-02 
Aircraft type 103 2.49E-02 1.30E-01 3.64E-02 8.49E-01 
Aircraft type 102 1.91E-01 1.55E-01 1.52E+00 2.18E-01 
Aircraft type 101 —4.78E-01 1.42E-01 1.13E+01 7.61E-04 


tion. The logit model is linear and its complexity 
is determined by the number of independent vari- 
ables included in the design (Valis et al. 2016, Tloc- 
zynski 2017a). The more variables there are, the 
greater the risk of failure of the model is. 


4 CONCLUSIONS 


The work was carried out by the three research 
groups: University of Defence, Air Force Institute 
of Technology and Polish Air Force Academy as a 
part of the activities of the recently international 
cooperation between military universities and the 
institute. 

In this article the comparison between the 
results of two models, regression model and deci- 
sion trees, is presented. It was shown that both 
of them are giving similar results around 60% of 
accuracy. Results calculation should consider the 
fact they are slightly different for various models; 
the process of the training sample selection has to 
be consider as well. 

The algorithms give us, step by step, importance 
of predictors, which can be used in the processes of 
decision taking. 

This particular analysis resulted in finding the 
first, general algorithm, which can support the 
decision taking. Moreover, this algorithm is gen- 
eral, and could be used for different type of air- 
craft. In order to reach better accuracy, further 
calculations are expected. 
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ABSTRACT: This paper presents the experiences from applying SysML models as support for 
establishing the safety requirements specification of a new safety-related railway application. The new 
railway application is a software-based system for securing work areas, meaning it prevents railway 
traffic in areas along the track allocated to maintenance. The experiences are collected within the Safety 
Assessment Framework for Efficient Transport (SafeT) project managed by Bane NOR. Bane NOR is the 
government agency that owns, operates and develops the Norwegian railway infrastructure. The objective 
of the SafeT framework is to offer a systematic, reusable way for creating system wide conceptual design 
models and based on them, creating a common risk model, which in turn will facilitate safety assessment, 
establishing the requirements specification, and safety demonstration of the system under consideration. 
The paper introduces the SafeT project as context of the work and presents experiences on the application 
of SysML for the conceptual system design of the new securing work areas application. The paper also 
discusses whether SysML models fit the SafeT framework’s objectives. 


1 INTRODUCTION V-cycle (modified from EN 50126-1-2017) 


Operation and 
Maintenance 


Development 
Operant Decom, 


The SafeT project aims at developing a framework 
that supports the implementation of EN 50126 
(CENELEC, 2017) and thereby of the Common 
Safety Methods for Risk Assessment (CSM RA) 
(EU, 2013). Figure 1 illustrates which phases of 
EN 50126 that is within the scope of the current 
SafeT work and this paper, annotated by a dark 
grey rectangle. 

The current focus is on the development phases 1 
to 4 of EN 50126. In these phases of a systems life 
cycle, Bane NOR takes a lead role in the develop- 
ment while successive development phases to a large 
extent are outsourced. The SafeT framework intends 
to support the development of the core artefacts 
within the system life cycle. In the early stages of the 
life cycle, in the part of the framework that concerns 
the in-house conceptualisation, the core artefacts 


Scope of paper and relationship to EN50126. 


Figure 1. 


are: 1) the conceptual system design model; 2) com- 
mon risk model; and 3) requirements specification. 
The main objective of the SafeT framework 
is to offer a systematic, reusable way for creating 
system wide conceptual design models and based 


on them, creating a common risk model, which in 
turn will facilitate the safety assessment, require- 
ments specification and safety demonstration of 
the system under consideration, throughout the 
system’s lifetime. 
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2 RELATED WORK 


International safety standards, such as EN 50126, 
provide requirements and guidance on how to carry 
out safety demonstration and assessment. Although 
most safety standards often view the safety of a sys- 
tem asa function of the reliability of its components, 
little guidance is provided on how to derive safety 
requirements and acceptable risk for components 
whose failure rates are not known. Particularly, it is 
often difficult to derive safety requirements for logi- 
cal components such as the software. The problem 
can be formulated from a consideration of the fol- 
lowing two important tasks in the development of 
safety critical systems: (1) establishing the require- 
ments to the system, and (2) ensuring that the system 
fulfils these requirements. The safety requirements 
should be established through risk assessment and 
hazard analysis, and fulfilled through the use of 
techniques and measures adequate for the risk level. 
The framework proposed in SafeT has much of its 
inspiration from theoretical aspects of international 
safety standards such as IEC 61508 (IEC 61508). 
The novel part of the framework is fivefold: reus- 
ability, modularity, unification, transparency and 
argumentation. 

Next, many past projects that relate to the topics 
of SafeT are briefly introduced. The OPENCOSS 
project provides a common language for both 
safety-case and standards-based approaches for 
certification. The CHESS project seeks to improve 
Model Driven Engineering practices and tech- 
nologies to better address safety, reliability, per- 
formance, robustness and other extra-functional 
concerns while guaranteeing correctness of com- 
ponent development and composition for embed- 
ded systems, and offers a modelling language and 
editor. The CHESS modelling language and edi- 
tor is a collection-extension of subsets of standard 
OMG languages such as UML (UML), MARTE 
(MARTE) and SysML (SysML). 

The EU funded project MODSafe provides a 
risk analysis method purposed to combine poten- 
tial hazards, safety requirements and functions, 
and link these elements to a generic functional and 
object-oriented structure of a guided transport 
system. The SaferCer project (BjSrnander, 2012) 
provides a generic process model for integrated 
certification and development of component 
based systems, including an overall picture of the 
development and verification of components and 
systems. ASCOS (Roelen, 2014) focuses on safety 
and certification of new aviation operation and 
systems, including advices on methods and tools 
for safety based design. ModelMe! (Falessi, 2011) 
provides a tool-supported traceability framework 
where the tool automatically extracts the safety- 
related slices of SysML design models. 


Another approach is the AltaRica Language 
(Griffault, 1998). AltaRica is an object-oriented 
modelling language dedicated to performance 
evaluation of complex systems. The main motiva- 
tion for its creation was the difficulty to design, to 
share and most importantly to maintain safety and 
reliability models such as fault trees, event trees, 
Markov chains or stochastic Petri nets. The appli- 
cation and further development of the language is 
a continuous research activity at NTNU (Legendre 
2017). 

Of relevance is also CORAS (Lund, 2013; Gran, 
2004) which provides a methodology for model- 
based risk assessment, integrating aspects from 
partly complementary risk assessment methods 
and state-of-the-art modelling methodology. 

The SafeT project has also reviewed a number 
of ongoing and past industrial experiences among 
the project partners related to the use of design 
and risk models to facilitate the safety assessment 
and demonstration of complex systems. Some of 
the challenges observed in these projects have also 
been reported earlier within aviation (Gran, 2007). 
Finally, the CHASSIS method (Raspotnig, 2018) 
utilizes UML use cases and sequence diagrams 
with HAZOP guidewords to integrate safety and 
security considerations for early requirements 
determination. 


3 CONCEPTUAL MODELLING 


3.1 The role of models 


An important aspect of SafeT is the role of sys- 
tem modelling in the RAMS process defined in 
EN 50126, in particular for supporting the risk 
assessment process and the identification of safety 
requirements. An example of a modelling task 
related to the RAMS life-cycle phases is the intro- 
duction of the system under consideration in a 
model at the railway system level (phase 1). In phase 
2, the model can be refined as necessary to support 
the description of system objective, mission profile, 
boundaries and external interfaces and interactions. 
In phase 3, the model can be further refined to sup- 
port the establishment of the risk model, followed 
by a refinement in phase 4 to support the specifi- 
cation of requirements and application conditions 
for the system under consideration. In addition to 
the system models, there is also a need to establish 
risk models that capture the relations between the 
different hazards, causes, barriers, accidents, and 
consequences identified in the hazard identification 
performed at the different system levels. SafeT looks 
into the possibilities to enhance the system and risk 
modelling tasks by the appropriate application and 
combination of techniques evaluated against a set 
of criteria derived from the relevant standards. 
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SafeT intends to support the implementation 
of EN 50126 by giving guidance on what kind of 
models can be used, and how they can be utilized, 
in the life cycle phases within the standard. In this 
paper, we focus on the application of models. In 
another paper, we focus on the risk assessment 
part (Skogvang, 2018). An important research 
problem in the SafeT project is how the use of 
models throughout the life cycle of a system can 
be integrated in a way that facilitates the overall 
safety demonstration and assessment. The models 
will serve different needs, related to the analysis 
of system, risk, requirements, etc. SafeT aims at 
arriving at a set of techniques that covers the mod- 
elling needs in the different life cycle phases, with 
a current focus on the first four phases aimed at 
establishing the requirements specification. Some 
examples of the prospective use of models are: 


— describing and analysing the static structure of a 
system and its constituent parts, down to the sys- 
tem level and the level of detail necessary to sup- 
port analysis, independence demonstration, etc.; 

— describing and analysing the behaviour of a sys- 
tem, internally as well as through its boundaries; 

— describing and analysing a system’s interaction 
with its environment, and how it affects, and is 
affected by, agents involved in its operation; 

— supporting the activities involved in risk assess- 
ment and hazard control, including the iden- 
tification of hazards at all system levels, their 
causes and possible consequences; 

— supporting the derivation of the safety require- 
ments needed to handle the hazards at the over- 
all system level as well as technical hazards at 
any system level; and 

— communicating the different design and risk 
aspects, as well as the safety argumentation as 
such, to the different stakeholders involved. 


3.2 Requirements to models 


To facilitate the selection of design and risk mod- 
els, an initial set of 58 requirements to be fulfilled 
by the models is established within SafeT. The 
requirements were derived by reviewing the proc- 
ess requirements in the CENELEC standards EN 
50126, 50128 and 50129 (CENELEC 2017, 2011 
and 2003). The set of requirements acts as the 
evaluation criteria supporting the selection of tech- 
niques to be used in the development of the desired 
models. The identified modelling needs were refor- 
mulated in terms of requirements to the models as 
such and categorised as requirements concerning 


— Structure: to model the static aspects of a system 
at any system level, e.g. the possibility to sup- 
port any hierarchy of system levels, and describe 
any system level at the appropriate level of detail 


without introducing unnecessary detail and 
complexity at other system levels; 

— Behaviour: to describe the dynamic aspects 
of a system at any level, e.g. the possibility to 
show how the behaviour and state of a system 
depends on, and changes with, the functionality 
of its sub-systems and components; 

— Interaction: to describe the reciprocal impact 
between a system and its environment, e.g. 
the possibility to show how the environment 
can influence, or be influenced by, the system, 
including anything to which the system connects 
mechanically, electrically or by other means; 

— Risk: to carry out the risk assessment and haz- 
ard control, e.g. the possibility to facilitate the 
identification of hazards associated with the 
system and events leading to these hazards, 
the determination of the risk associated with the 
hazards, and the identification of possible fur- 
ther safety requirements needed to reduce the 
risk to an acceptable level, at any system level; 

— Requirements: to identify and specify safety 
requirements, e.g. the possibility to provide the 
details necessary to explain and understand the 
requirements to the functions to be provided by 
the system, as well as any additional require- 
ments that are necessary to ensure proper func- 
tioning, including contextual and_ technical 
requirements; 

— Design: to analyse the safety aspects of a design, 
e.g. the possibility to identify the need for, and 
analyse the effectiveness of, safety functions or 
any other barrier; and 

— Quality: to assure clarity, unambiguity, con- 
sistency, etc., e.g. the possibility to review the 
models for completeness of the identified safety 
requirements. 


For each requirement, SafeT provided an expla- 
nation to guide the application of the requirement 
on models to be used in the RAMS life cycle. An 
example is shown in Figure 2. 


Requirement: The models must support the 
breakdown of a system into its constituent parts, in 
terms of system, sub-systems, and components. 
Explanation: A system generally consists of a hi- 
erarchy of subsystems and components, each of 
which can be understood as a system itself. It is 
therefore meaningful to speak about the different 
levels of a system, and represent these levels in 
such a way that the details presented for each level 
are adequate for this level. Furthermore, it should 
be possible to study the details at any system level 
by recursively opening up the system model down 
to the subsystem or component of interest. 


Figure 2. Example of a requirement and its explanation. 
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3.3 The use of models in the RAMS life cycle 


The 58 requirements to models reflect needs identi- 
fied from an analysis of the tasks to be performed 
in the different phases of the RAMS life cycle. 
The requirements can therefore relatively easily be 
interpreted in this context by describing how they 
apply to the modelling needs in the first ten RAMS 
phases. The different requirements were gradually 
introduced along with possible procedures and 
flow charts. Concerning modelling, the concept 
phase can be carried out in accordance with the 
following procedure: 


1. Describe the needs and how these are met today 
without the system. 

2. Make a first informal description of the system 
and its environment. 

3. Make a first model of the system and its 
environment. 

4. Define the aspects to be analysed, including the 
aspects defined in EN 50126. 

5. Select an aspect for analysis. 

6. Analyse the aspect, refining the model to make 
it adequate for the analysis. 

7. If necessary, refine the model to make it repre- 
sent the analysis result adequately. 

8. Repeat from step 5 for the remaining aspects. 


The RAMS life cycle is initiated with the con- 
cept phase. The main objective of the phase is to 
investigate the overall system and its environment, 
confined to (1) scope, context and purpose, as well 
as (2) physical, interface, legislative and economic 
issues. This means that there already is some idea 
of a “system under consideration”, and some idea 
of the functionality that shall be offered, and most 
likely some constraints. The purpose of a model 
in this phase would therefore be to facilitate this 
investigation. Even if the system has not yet been 
defined in a proper sense, it will usually be possible 
to introduce the system as a black box, and concre- 
tize the aspects to be investigated. It might already 
in this phase even be possible to decompose this 
black box into a set of connected subsystems, each 
with its specific scope, context and purpose. 

Requirements posed to models in this phase 
demand the ability of the models to support differ- 
ent needs, for example: 


— support the breakdown of a system into its con- 
stituent parts, in terms of system, sub-systems, 
and components; 

— facilitate the treatment of systems, sub-systems 
and components as black boxes, for which the 
details on architecture, design and implementa- 
tion can be kept out of consideration, evaluating 
functions and hazards only at the boundaries; 

— describe the system as contained in its opera- 
tional environment; 


— show how the environment can influence, or be 
influenced by, the system, including anything to 
which the system connects mechanically, electri- 
cally or by other means; 

— show how man and organization can affect, or 
be affected by, the operation of the system; 

— use clear and intelligible means of description, 
such as formal notation for logical functions, 
natural language for introductions, justifications 
and representations of intentions, graphical rep- 
resentations of examples, semantic definition of 
graphical elements, and directories of specialised 
words; 

— be possible to communicate to the different 
stakeholders; 

— be understandable in themselves; 

— be understandable to the prospective user. 


4 APPLYING SYSML 


The Concept phase and System definition phase 
are focused on preparing the conceptual system 
model. The model acts as an input to the Risk 
analysis phase (see Figure 1). The first activity of 
the Risk analysis phase is the Hazard Identifica- 
tion (HI). This was the focus of a workshop in the 
SafeT project (see section 4.5) using the model- 
based description of an example case described in 
section 4.1. 

Related to the use of models, two questions were 
investigated: (1) whether the modelling technique 
selected on the basis of theoretical considerations 
(the identified requirements to models based on the 
standards) is also practical for phase 1 and 2, and 
(2) whether the model-based description prepared 
is practical for the hazard identification activity. 


4.1 The securing work areas case 


In order to realistically evaluate existing techniques 
and develop the SafeT framework, the project chose 
a case example based on a concept of a new solu- 
tion for securing work areas (Sivertsen, 2014). The 
problem concerns the need to protect maintenance 
workers from accidents caused by the interference 
with the railway traffic. The concept involves the 
development of a software-based system for secur- 
ing the work areas from such interference. The 
basic requirements to such a system are to identify 
the workers’ position correctly, effectively block 
the correct work area, and prevent a premature 
unblocking of this work area. 

In the proposed solution (see Figure 3), a safety 
guard uses a smartphone both for the interaction 
with the train dispatcher and for identifying the 
works areas under consideration. The smartphone 
contains a dedicated application with functionality 
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Figure 3. The securing work areas case. 


to manage the securing and releasing of the work 
areas. Some of the characteristics of the function- 
ality are: 


— The main Safety Guard (SG) selects the func- 
tions from the application on his smart phone, 
e.g. secure a Work Area (WA). 

— The scanning of the associated QR-code of a 
WA identifies both the SG and the WA. 

— The application communicates with the Sup- 
port System (SuS), which communicates with 
the Centralised Traffic Control (CTC) and other 
applications. 

— The SuS supervises the associated protocols. 

— The SuS supervises the secured WAs, and pre- 
vents the Train Dispatcher (TD) from prema- 
turely unblocking them. 


4.2 SysML 


UML (Unified Modeling Language) was initially 
selected to be applied for modelling in phases | and 
2 as it fulfils all of the related requirements to mod- 
els in these phases. However, UML’s focus is on 
supporting software analysis and design, while the 
system in our example case is not limited to soft- 
ware. Another important consideration was that 
the first RAMS phases are carried out at a higher 
system level (“the railway system level”), requiring 
a focus on the system as such and not merely on its 
software. Hence, we used SysML (Systems Mod- 
eling Language) instead which supports system 
engineering. 

SysML is an extension of a frequently used 
subset of UML, and thus is expected to comply 
with most of the requirements to models that 
UML complies with. A SysML model is usually 
developed in a tool that stores the model entities 
with their characteristics and relations. The model 
entities can then be used in diagrams to present 


graphical views on specific aspects, e.g. structural 
or behavioural aspects. 

Because of this unified model in the core of 
UML and SysML, they can be considered as a 
single but complex modelling technique. Further- 
more, they offer different kinds of diagrams where 
each kind can be considered as a modelling tech- 
nique in itself. 


4.3 Modelling the conceptual design 


Within the concept phase, modelling of the system 
and its environment with respect to the following 
aspects are required: (1) scope of the system, (2) 
(application) context of the system, (3) purpose 
of the system, and (4) environment of the system 
(anything that could influence, or be influenced by, 
the system, including people and procedures). All 
of these aspects are expected to be considered in 
the context of RAMS performance. The system 
definition phase requires extending the model 
with: 


— functions and elements which need to be consid- 
ered in the risk assessment; 

— interfaces and interactions with the physical 
environment, other systems, humans, and other 
organisations; 

— operational requirements influencing the sys- 
tem, including a description of conditions, con- 
straints, logistics; 

— existing safety measures and assumptions that 
determine the limits for the risk assessment. 


The modelling was performed by an IT and 
dependability specialist with some experience in 
UML modelling, using a tool. A short textual 
description of the proposed system was the input 
to the modelling, and was analysed according to 
the needs of the two phases described above. The 
models were developed in an iterative process 
including consultation with the system owner. 

Diagrams were prepared for a HAZOP work- 
shop, eg. Block Definition Diagrams (BDD) 
about Work Area and related concepts (see 
Figure 4), Internal Block Diagrams (IBD) about 
the internal communication and interfaces of the 
Support System (see Figure 5), Use Case diagrams 
(UC) of the main functions of the Securing Work 
Area application, State Machine diagrams (STM) 
about the registerable states of a Work Area (see 
Figure 6), and Sequence Diagrams (SD) about the 
main functions of the application. 


4.4 Using the models to meet the needs of the 
concept phase and the system definition phase 


The concept phase modelling needs can be mainly 
fulfilled by using BDDs and IBDs since those needs 
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Figure 6. Example STM with the different states of a work area from the securing point of view. 


require to represent the system with its elements 
and environment, and their static, conceptual rela- 
tions. BDDs can depict an ontology. For example, 
the BDD in Figure 4 identifies the main concepts 
connected to work area in the proposed solution, 
their relevant characteristics, and their relations. 
IBDs can visualize internal structure, lines of 
communication and interfaces. For example, the 
IBD in Figure 5 depicts the internal structure of 


the Support System with it interfaces. The central 
computer communicates with the applications via 
its GSM-R receiver and transmitter; the opera- 
tional support computer is used by the operational 
support staff to operate the system; the CTC com- 
puter ensures the correct interaction between the 
central computer and the CTC system. Another 
SysML diagram type not utilized by us is the 
Requirement Diagram (REQ) which could have 
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been useful for embedding the requirements in the 
model if a structured requirement specification 
had been available. 

The system definition phase modelling needs can 
be partially fulfilled by using all 5 mentioned dia- 
gram types. BDDs and IBDs can be used when the 
system is further detailed (i.e. elements and inter- 
faces in the system definition description). UC and 
STM diagrams as well as SDs are useful for depict- 
ing the dynamic, behavioural aspects (1.e. functions 
and interactions). For example, Figure 6 presents an 
STM with the different states of a Work Area from 
the securing and releasing functions point of view. 
This diagram shows for example that the states of 
the Work Area (as seen by the system) were unclear 
between “before securing”/“after releasing” and 
when it was “secured”. Whether these states (“WA 
is blocked for securing” and “WA is not secured & 
WA is blocked”) are the same, whether a transition 
from the second directly back to the state of “WA 
is blocked & WA is secured” is possible, triggered 
lots of discussions in the workshop. 

Operational requirements were mostly not 
included in the model, but they can be added 
through REQ diagrams and by defining con- 
straints. Existing safety measures were not spe- 
cifically identified as such in the model, they 
were depicted as regular parts of the diagrams. 
However, there are suggestions in this direction, 
for example extending UC and SD for safety and 
security considerations (e.g. Misuse Cases, Failure 
Sequence Diagrams, Misuse Sequence Diagrams; 
an overview can be found in (Raspotnig, 2014)). 
Assumptions were either depicted as notes in the 
diagrams (e.g. see Figure 6), or as constraints. In 
summary, SysML has the potential to fulfil the 
needs of the concept and the system definition 
phases with respect to the requirements to models 
connected to these phases. 

One experience was that the modelling process 
helped identifying unclear and missing parts of the 
case description which were necessary to develop 
an understanding for persons not familiar with 
the planned system. It is quite hard for a person 
involved in a task to evaluate what pieces of infor- 
mation are necessary for understanding the task by 
another person with different expertise working on 
another aspect of the task. The necessary amount of 
information is usually underestimated, which is also 
reflected by the related system descriptions. Model- 
ling helps overcoming this gap but it does not guar- 
antee the completeness of the information provided. 

Another experience was that modelling with 
SysML sometimes demands more details than 
available or expected in the conceptual design 
phase. In other words, it might be hard to draw the 
line between the conceptual design (defining the 
“what”) and detailed design (defining the “how”). 
For example, the conceptual design might stop at 


the level where the actors and systems of the New 
Solution are identified, maybe including the sub- 
systems of the Support System. However, includ- 
ing the SWA App in the model required some 
further details since it resides in the software part 
of the Smartphone, which is a subsystem of the 
Support System. SafeT will need to specify clear 
criteria or guidelines regarding the detailing of 
models at the different phases of the development 
of a planned system. 


4.5. Using the models ina HAZOP workshop 


Two workshops utilizing HAZOP for hazard iden- 
tification (HI) were organized, one using only a 
textual description as input and the other using 
a model-based description as input. The hazard 
identification related experiences of the workshops 
are presented in paper (Skogvang, 2018). Here, we 
focus on the modelling related experiences from 
the model-based workshop. A description utiliz- 
ing the diagrams with limited text and explanation 
of the modelling language, was sent out one week 
before the workshop. 

Even though modelling helped identifying 
unclear and missing parts from the modeller’s 
perspective, it gave no guarantee that these iden- 
tifications covered every necessary detail for HI. 
This became clear since the workshop participants 
had many questions outside the scope covered by 
the model but important nevertheless for their 
understanding of the context and for identifying 
hazards. A conclusion is that, for a better cover- 
age of the hazard identification, relevant details in 
the model and the diagrams are desired. This could 
be achieved for example by a preparatory work- 
shop focusing on eliciting such information, or by 
involving a RAMS expert in the modelling. 

Constructs in models can become complex, and 
so their visualization. According to the experiences 
in the workshop, after a certain level of visual 
complexity (e.g. when not the whole diagram can 
be shown at once or if it is shown then it becomes 
unreadable), understanding of the diagram and 
following the track of thought becomes cumber- 
some. One related problem was following the flow 
of logic in SDs when branches and parallel activi- 
ties were involved. Modularization might help with 
this issue. 

During the workshop, an example of the physi- 
cal outline was drawn ad hoc as an illustration 
which was used a lot in the discussions. This sug- 
gests that a physical outline diagram could be part 
of the model. SysML has no obvious means for 
this, therefore another modelling technique might 
be required as support. Another consideration is 
that modelling specific, representative cases (e.g. 
application of the planned system at a specific 
work area) might be a necessary supplement to the 
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general model of the planned system. In our case, a 
specific, representative train station could be con- 
sidered. The model-based description also missed 
some information, e.g. preconditions of the main 
functions of the software application, necessary 
for understanding how the system was intended 
to work. A question related to this is whether the 
workshop would have been able to process and uti- 
lize the information requested by the participants 
(defined terminology and roles, description about 
the old and current solutions, etc.). This needs to 
be taken into account when considering the use of 
models with other techniques. SafeT needs to pre- 
pare guidelines on how to use HAZOP in combi- 
nation with specific SysML diagrams. 


5 CONCLUSIONS 


In this paper we have elaborated on the experiences 
on using SysML diagrams as support for the con- 
cept and system definition phases. To the question 
of whether the modelling technique selected on the 
basis of theoretical considerations is also practical 
for the two first phases, we can answer affirma- 
tively based on the experiences. The concept phase 
modelling needs can be fulfilled by using BDDs 
and IBDs since those needs require representing 
the system with its elements and environment, and 
their static, conceptual relations. BDDs can depict 
an ontology. The system definition phase model- 
ling needs can be partially fulfilled by using all five 
mentioned diagram types. 

However, further investigations and fitting 
guidelines will be necessary. Whether the model- 
based description prepared was practical for the 
hazard identification activities were not concluded, 
but the HAZOP workshop suggests that the use of 
SysML models requires good preparation of the 
HAZOP, and the participants should be familiar 
with such modelling to benefit from the models. 
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ABSTRACT: Dynamic Event Tree (DET) methodology has been developed to overcome the 
limitations of the traditional Event Tree approach by taking timing of events explicitly into account 
through communicating with the system model that describes its dynamic behavior in event sequence 
construction. In addition, more rigorously accounting for process/hardware/software/human interactions, 
this capability allows including recoveries within the sequence analysis. Furthermore, particularly for long 
term scenarios, DET would be able to model multiple failures and recoveries for a given system with 
this capability. From probabilistic point of view, modeling multiple failures and recoveries introduces a 
major challenge since failure and recovery distributions for a given system can be correlated. Use of a 
multidimensional distribution is proposed to address this challenge. 


1 INTRODUCTION 


Dynamic Probabilistic Risk/Safety Assessment 
(DPRA/DPSA) methodologies are those meth- 
odologies that, by using simulator for the system 
under analysis, are able to explicitly model time 
dependent system evolution along with its stochas- 
tic behavior under accident conditions (Aldemir, 
2013). Among these methodologies Dynamic 
Event Tree (DET) is perhaps the most popular 
one, given its similarity to the static event-tree (ET) 
approach, but with capability to model the interac- 
tion between stochastic events (e.g., failures, recov- 
eries, etc.) and the dynamic evolution of the system 
as a consequence of these events. 

A DET consists of an initiating event (e.g., sta- 
tion blackout, loss of offsite power, etc.), and a set 
of events that initiate system evolution in differ- 
ent directions, called branching conditions. Each 
branching point is defined by the analyst, and con- 
sists of a stochastic event described by (Alfonsi, 
2013): 


e values of process variables or thresholds, and, 
e corresponding probability distributions (e.g., 
the exponential distribution of the failure time). 


When, during the simulation, the thresholds 
corresponding to the branching points are met, 


branches are generated that give rise to the typical 
ET structure. Figure 1 shows a typical DET structure 
with thresholds defined for values 0.33 and 0.66 of 
the relevant cumulative distribution function (Cdf). 

The scheme presented in Figure | refers to one 
of the codes currently used for generating DETs 
(i.e., RAVEN (Rabiti, 2016)). 

Since the timing of the events in the sequence 
is explicitly modeled with a DET, failure recovery 
can be included within the analysis by assigning a 
probability distribution to the recovery time. The 
modeling of system failure and recovery becomes 
particularly important when long term scenarios 
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Figure 1. DET scheme (Alfonsi, 2014). 
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are considered. In these cases, multiple failure/ 
recovery cycles for a given system become possible 
within the mission time. 

The modeling of recoveries, however, introduces 
two main issues to consider. On one hand, as pre- 
viously mentioned, failures and recoveries for a 
single system can be multiple along the accident 
sequence. On the other hand, recovery and failure 
time distributions can be interdependent since, 
from a causality point of view, a recovery can 
occur only after a failure. 

Let us consider a system made up by different 
components, each with a characteristic failure and 
recovery distribution. The recovery time distribu- 
tion for the overall system will be dependent on 
the failure time distribution for the system itself, 
through the failed components. Each event from 
the first recovery afterwards will depend on the 
previous one. Second recovery time will depend 
on second failure time, second failure time will 
depend on first recovery time and finally, first 
recovery time will depend on when the system has 
first failed. Therefore, in order to realistically rep- 
resent the behavior in this example, we need to use 
a 4-dimensional distribution to be sampled accord- 
ing to the branching conditions. 

The case of multiple failures and recoveries 
presents a unique challenge. In terms of physi- 
cal states of the system, the multiple failures and 
recoveries can be represented as a two-way transi- 
tion between two discrete states: an ON state and 
an OFF state (Picoco, 2017a). However, from a 
probabilistic point of view, each single transition 
(e.g., first failure, first recovery, second failure, 
etc.) could correspond to a different probability 
distribution. 

In this work, we address these two aspects of 
DET generation: failure/recovery behavior (poten- 
tially multiple) and use of multi-dimensional 
distributions. 

The paper is organized as follows. In Section 2, 
the framework for DET generation with multi- 
dimensional distributions is presented. In Sec- 
tion 3 some conclusions are drawn. 


2 FRAMEWORK TO MODEL MULTIPLE 
FAILURE/RECOVERY BEHAVIOR 
IN DET 


In order to generate a DET, a driver coupled with 
a simulator is needed. Different couples driver— 
simulator have been presented in literature such 
as ADAPT—MELCOR (Hakobyan, 2006), 
ADAPT—MAAP4 (Rychkov, 2015), RAVEN— 
RELAP7 (Alfonsi, 2013), MCDET-MEL- 
COR (Hofer, 2002), ADAPT-SAS4-SASSYS-1 


(Jankovsky, RAVEN-MAAPS  (Picoco, 
2017b). 

In the coupling, the driver is the code respon- 
sible for taking care of the probabilistic aspects. 
Generally, the driver requires for each branching 
condition the definition of a probability distribu- 
tion. The branching conditions are expressed as a 
grid of thresholds, defined in either values of the 
process variables/system configuration or Cdfs. 
The driver is responsible for generating the dif- 
ferent branches and managing their run. Driver 
also collects the results from the simulator, and, 
for some drivers, perform desired post-processing 
analysis, if any, set by the analyst. 

The simulator provides the branch consequences 
and simulates the plant evolution as the different 
branchings occur. In nuclear field, the simula- 
tor (e.g., MELCOR (Summers, 1981), MAAPS 
(MAAPS, 2015), RELAP7 (Anders, 2012)) is able 
to predict the behavior of the reactor under sev- 
eral accident conditions as the different branches 
occur. The control logic of the simulator is often 
used to model operator actions, and possible inter- 
actions among the different branches. 

The driver and the simulator usually exchange 
information during the DET generation in order 
to create the different branches, based on the plant 
state. 

Overall, the DET generation process is the fol- 
lowing (the process can slightly vary depending on 
the driver-simulator couple used): 


2015), 


1. Simulation starts: the first branch is run. 

2. The simulation stops when a branching point 
is met. 

3. Two (or more) branches are created by the 
driver, each corresponding to a different new 
simulator instance, and run in parallel. 

4. The simulation of each branches progresses 
until the next branching point is met. 

5. When all the branches simulated are completed, 
the DET is generated. 


In order to define each branch, as previously 
mentioned in Section 1, a probability distribution 
is needed. 

In case of multiple failures and recoveries, a first 
approach would suggest to treat these conditions 
independently, defining a different distribution 
(and corresponding branching points) for each 
event. However, this type of modeling will not 
account for their intrinsic correlated behavior. 

Since most of the drivers, so far, require defini- 
tion of branching points in terms of 1-dimensional 
distributions and 1-dimensional grid for each vari- 
able as a first approach to face the use of N-di- 
mensional distribution, we propose the following 
framework: 
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1. From the N — dimensional distribution, define the 
corresponding N I-dimensional marginal distri- 
butions. It is worth recalling that, starting from 
a joint N-dimensional distribution, the marginal 
of a variable describes the corresponding prob- 
ability distribution of that variable only. For- 
mally, given the N — dimensional probability 
distribution function f (x Arsia %)s then the 
marginal for the generic x; is 


ts) Se dC re ce aL ere es 
(1) 


The general concept of multidimensional distri- 
bution and corresponding marginal is shown in 
Figure 2, referred to the case of a bivariate nor- 
mal distribution. In Figure 2, the two marginal 
distributions for x, and x, are f(x,) and f(x,), 
respectively. 

2. Once 1-dimensional distributions for each 
variable are obtained, it is possible to use 
them with the driver and define for each 
variable the grid of the branching points 
independently, as is current practice with DET 
generation. 

3. Run the simulation and generate the DET. 

4. Once the DET has been generated, post-process 
the results to recalculate the probabilities of 
each history based on the values of the mul- 
tidimensional distribution rather than on the 
marginal. 


This approach allows to account for correlated 
probabilities without necessity to re-run the DET, 
and by defining the branching conditions as is the 
case for 1-dimensional distributions. 


J) 


Jia) 


~ 02 


or 


Figure 2. Two-dimensions distribution and correspond- 
ing marginal for a bivariate normal distribution. 


3 CONCLUSIONS 


DETs are currently used to analyze the possible 
evolution of an accident scenario starting from a 
given initiating event. In general, as in traditional 
ET, only failures of the different systems involved 
in the accident evolution have been considered. By 
explicitly modeling time, DET has the capability 
to include recoveries within the model. Including 
recoveries, and even multiple failures and recoveries 
for the same system, becomes particularly impor- 
tant when long mission times are considered. 

In this paper, we have presented a framework 
for dealing with the DET generation in case of 
multiple failures and recoveries for a given system 
modeled by multidimensional distribution. In sum- 
mary, the approach proposed in this work consists 
in the following: a) variables are sampled starting 
from their marginal distributions, and, b) the DET 
is generated using the marginal distributions. Then 
history probabilities are recalculated based on the 
values of the N-dimensional distributions. 

This framework represents a first approach for 
the use of multidimensional distribution in DET. 
The approach is valid in regular cases (e.g., multidi- 
mensional uniform, multivariate normal), however, 
the sampling from the marginal can potentially 
lead, in some cases, to weak density of sampling 
points in probabilistically relevant regions of the 
multi-dimensional distribution. 
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Using an enterprise architecture model for assessing the resilience 
of critical infrastructure 
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ABSTRACT: Assessing the resilience of Critical Infrastructure (CI) is a complex problem. Complexity 
becomes especially high when hybrid threats are considered and, additionally, crisis span interdependent 
sectors and sovereign borders, with time-variable cascading effects. Models may help address complexity, 
since they are simplified representations of systems, intended to promote understanding within some 
domain of discourse. Furthermore, Enterprise Architecture (EA) allows for managing and visualizing 
integrated model repositories. In this paper, we propose an EA model to assist resilience assessments, 
performed using a Resilience Assessment Framework (RAF). To ensure that the framework was fit for 
purpose according to the evaluators’ needs, a new version of an existing RAF was designed and tested, 
using a Design Science Research Methodology (DSRM) process model. To measure the value of the 
EA model, we performed a comparative evaluation of the new RAF’s usefulness—i.e. with and without 
assistance of the EA model. We conclude that the proposed EA model is useful for assisting resilience 
assessment initiatives for CI. The main scientific contributions of this paper are the validation of the 
EA model’s usefulness, a set of EA viewpoints for assisting resilience assessments of CI, as well as an 


evaluation of the new version of the RAF. 


1 INTRODUCTION 


Assessing the resilience of Critical Infrastructure 
(CI) presents both conceptual and implementation 
challenges. At the conceptual level, frameworks 
and standards are required to align terminol- 
ogy (ISO 2009, 2012), provide reference models 
(ISACA 2013a), enable consistent audit and assur- 
ance programmes (NIST 2014, ISACA 2013b), as 
well as to facilitate communication, cooperation, 
and collaboration. At the implementation level, 
adequate methods and tools are required to enable 
effective and efficient assessment initiatives. 

Models are important to address the complex- 
ity of resilience assessments, since they are simpli- 
fied representations of the system of interest. Also, 
as we demonstrate in this paper, models may help 
clarify conceptual and methodological ambiguities. 
Furthermore, Enterprise Architecture (EA) allows 
for managing and visualizing integrated model 
repositories (Lankhorst 2013), thus enhancing the 
performance of assessment initiatives — when com- 
pared with the exclusive use of spreadsheet-like 
artifacts, informal diagrams, and natural-language 
descriptions. 

In this paper, we propose an EA model to assist 
the implementation of resilience assessment ini- 
tiatives. The EA model’s usefulness was evaluated 
using a demonstration, evaluation questionnaires, 


and group sessions, to measure the efficacy, gen- 
erality, consistency, simplicity, and clarity of the 
artifacts. For the demonstration and evaluation, 
we used a new version of an existing Resilience 
Assessment Framework (RAF) — from Cadete et al 
(2017). 

We conclude that the proposed EA model is 
useful for assisting resilience assessment initia- 
tives, by helping to achieve three framework objec- 
tives: provide a logical link between management 
and operational indicators for resilience, clarify 
conceptual and methodological ambiguities, and 
facilitate the implementation of resilience assess- 
ment initiatives. 


2 METHODOLOGY 


In this paper, we performed two iterations of a 
Design Science Research Methodology (DSRM) 
process model (Peffers et al 2014), for guiding the 
construction and evaluation of the architectural 
artifacts. 

DSRM incorporates principles, practices, and 
process models which are adequate to conduct 
design science research in applied research disci- 
plines, whose cultures value incrementally effective 
solutions (Hevner & Chatterjee 2010). The design 
science paradigm seeks to create and evaluate 
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Communication 


Identify Problem Define Design & Demonstration Evaluation 
& Motivate Objectives ofa Development 
Solution 


Find suitable Observe how Scholarly 


Define problem Artifact context effective, ficient publications 
Show importance What woulda 
better artifact Use artifact to Iterate back to Professional 
accomplish? solve problem design publications 


1% DSRM Iteration: new RAF 


2% DSRM Iteration: new RAF + EA model 


Figure 1. DSRM process model used in this paper. Two DSRM iterations were performed: in the first DSRM itera- 
tion, a new conceptual model for the RAF was designed and tested; in the second DSRM iteration, an EA model for 
the new RAF was designed and tested. 


Table 1. Evaluation objectives for the new RAF. 


Evaluation 
System criteria/ Objectives for 
dimension sub-criteria the new RAF 
Goal Efficacy Integration of disaster 
risk management 
aspects 
Risk is associated to effect/ 
impact on relevant 
objectives 
Indicators are relevant for 
risk management 
Management indicators 
relate to operational 
indicators 
Generality May be tailored for 
any CI organization 
Is not overly 
prescriptive 
Cross-sector generality 
Cross-border generality 
Environ- Consistency Useful for CI 
ment with organizations 
organization/ 
utility 
Consistency Easy to implement for 
Figure 2. Evaluation criteria, taken from Prat et al with people/ CI managers 
(2014). The evaluation criteria used for evaluating the ease of use 
new RAF are highlighted with ellipses. Structure Simplicity Is simple to 


communicate 


“what is effective” in the problem space (Hevner , and understand 
et al. 2004). Clarity The concepts and methods 


The first DSRM iteration (Fig. 1) was per- erat 
formed to ensure that the framework was fit for 
purpose according to the evaluator’s needs. Had 
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this iteration been omitted, the evaluators’ ratings 
might have been negatively affected by perceived 
deficiencies in the reference framework design 
(Cadete et al. 2017). In the second DSRM itera- 
tion, an EA model was designed, for modeling the 
new RAF that resulted from the first DSRM itera- 
tion. This strategy allows for comparing the two 
sets of DSRM results, ensuring that the evaluation 
rating differences (i.e. between the first and second 
DSRM iterations) are a reasonably good measure 
of the benefits of using the EA model. 

As shown in Fig. 1, the DSRM process model 
includes an evaluation activity. For both DSRM 
iterations, we adopted the same evaluation criteria 
(see Fig. 2), as well as the same evaluation objec- 
tives (see Table 1), that were used for evaluating 
the RAF from Cadete et al (2017) — based on the 
evaluation taxonomy from Prat et al (2014). 


3 RESEARCH PROBLEM 


In Cadete et al (2017), the RAF evaluation results 
showed relatively weak ratings, regarding the 
achievement of three objectives: 


— Consistency with people, ease of use: easy to 
implement for CI managers; 

— Structural clarity: the concepts and methods are 
clear and unambiguous; 

— Generality: may be tailored for any CI 
organization. 


In related work, EA artifacts were used suc- 
cessfully to assist business continuity planning 
(Gomes et al. 2017) as well as to represent proc- 
ess-based frameworks (Vicente et al. 2013). Also, 
EA is recommended as best practice for guiding 
the creation and maintenance of the governance 
and management enablers for information sys- 
tems and related technologies (ISACA 2012a). 
Finally, it is important to note that when holistic 
frameworks are represented, complex graph-like 
structures of entities and relationships emerge 
due to the variety of concerns (e.g. many sectors, 
many countries, and many areas of expertise), as 
well as due to the complex networks of dependen- 
cies. Such complex graph-like structures may be 
represented and managed using EA methods and 
tools. 

These facts lead to the hypothesis that EA tech- 
niques and artifacts might help address some of 
the implementation issues identified previously by 
the RAF evaluators in Cadete et al (2017). 

The research problem is therefore to find and 
validate an EA solution to address the RAF con- 
ceptualization and implementation shortcomings, 
namely regarding goal efficacy, environmental con- 
sistency, and structural simplicity and clarity. 


4 DESIGN AND DEVELOPMENT 


To help improve the design of the new RAF, we 
selected senior evaluators from the defense sector, 
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Figure 3. 


The new RAF Process Reference Model (PRM) design, resulting from the first DSRM iteration. This 


artifact was used in both DSRM evaluations (first and second). This new design ensured that the RAF framework 
was sufficiently fit for purpose, according to the evaluator’s needs -thereby minimizing negative bias in the evaluator’s 


ratings, regarding the EA model’s usefulness. 
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with cyber-physical expertise. This selection differs 
from the evaluators selected for evaluating the ref- 
erence RAF, who were experts from civilian sectors 
(ICT, water and water waste, and financial sectors). 

To improve the fit for purpose of the RAF arti- 
facts, according to the evaluator’s needs, the first 
DSRM design and development activity was dedi- 
cated to producing a new RAF design, shown in 
Fig. 3. The differences between the reference RAF 
design and the new RAF design concern the Proc- 
ess Reference Model (PRM), and are the following 
(see Fig. 3): 


— 6 new process areas were added to the PRM: 
— Doctrine, Principles, Policies, and Frameworks; 
— Organizational Structures; 


— People, Training, and Education; 
— Leadership, Culture, Ethics, and Behavior; 
— Crisis Coordination and Cooperation; 
— Stakeholder Communication and Public 
Relations. 
— The generic governance and management areas 
were sub-divided according to: 
— Generic process areas; 
— Structural enablers; 
— Relational enablers. 


Essentially, these additions relate to standard 
defense planning capabilities (DOTMLPF-I, the 
acronym standing for Doctrine, Organization, 
Training, Materiel, Leadership, Personnel, Facili- 


Figure 4. The RAF Goals Cascade and PAM model (modeled using ArchiMate). The organizational drivers, needs, 
and goals cascade to a representation of the PAM elements (process purpose, outcomes, practices, inputs, outputs, and 


related guidance). 
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ties, and Interoperability), as well as to COBITS 
enablers (ISACA 2012b). 

The Process Assessment Model (PAM), the 
Process Measurement Model (PMM), and the 
Goals Cascade Methodology (GCM) remained 
unchanged in the new RAF design. 

For the second DSRM iteration, an EA model 
was created, to model the new RAF framework 
that resulted from the first DSRM iteration. The 
ArchiMate (The Open Group 2016) modeling lan- 
guage was used for the EA representations. An open 
source EA tool (Archi 2017) was used for providing 
an integrated EA repository, as well as the Archi- 
Mate views. The viewpoint representing the goals 
cascade and the PAM is shown in Figure 4. Note that 
this viewpoint clearly shows the relation between the 
goal cascade elements (stakeholder drivers and needs, 
organizational goals, business area goals, and ena- 
bling process goals), as well as the relations between 


top-down and bottom-up alignment — important for 
relating management and operational concerns. 


5 DEMONSTRATION 


For demonstration purposes, we used Fig. 4 (goals 
cascade and PAM) as well as ArchiMate artifacts 
taken from Gomes et al (2017) (Figs. 5, 6). These 
artifacts provide an EA model that instantiates 
the RAF for the following resilience assessment 
scenario: 


— RAF goals cascade Business Area (see Fig. 4): 
— COBITS, information and related 
technologies. 
— RAF PRM process area: 
— “Business Continuity Management, Avail- 
ability, and Capacity” (see Fig. 3). 
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Figure 5. Viewpoint for the COBITS manage continuity process, for process capability level 1 assessments. Taken 


from Gomes et al (2017). 
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Figure 6. Resilience assessment evidence (middle-row and lower-row elements) mapped to outputs (upper-row ele- 
ments) of the Manage Continuity COBITS process. Taken from Gomes et al (2017). 


— RAF PAM process capability level: 
— Process capability level 1 (see Fig. 5). 


In other words, in this scenario we are assess- 
ing process performance (i.e. achieving process 
goals) for the process area “Business Continuity 
Management, Availability, and Capacity”, focus- 
ing on information and related technologies’ 
concerns. For adequate coverage of such informa- 
tional concerns, we have used the state-of-the-art 
COBITS governance and management framework. 

Real-world formal assessments must be justified 
and documented using evidence. We demonstrated 
the instantiation of resilience assessment evidence 
using the ArchiMate view in Fig. 6, that represents 
the mapping between the evidence (middle-row 
and lower-row elements) and the outputs of the 
Manage Continuity COBITS process (upper row 
elements). This demonstration artifact was taken 
from Gomes et al (2017). 


6 EVALUATION 


The evaluation results are shown in Table 2. For 
rating the achievement of RAF objectives, we used 
a standard ordinal scale that is used for assessing 
outcomes (ISO 2015): 


— “FA” = fully achieved 

— “LA” = largely achieved 
— “PA” = partially achieved 
— “NA” = not achieved 


In the right column “Gain” we show the added 
value of using the EA model (i.e. difference between 
the second DSRM and first DSRM ratings). Each 
“+” sign accounts for one rating improvement (e.g. 


from PA to LA, or from LA to FA). A change from 
PA to FA is thus represented with two “+” signs. 
Where no rating changes occurred, an “=” sign was 
used. 

In the “Research Problem” section, we reported 


three issues that were found in the reference RAF: 


— Consistency with people, ease of use: easy to 
implement for CI managers; 

— Structural clarity: the concepts and methods are 
clear and unambiguous; 

— Generality: may be tailored for any CI 
organization. 


From the results presented in Table 2, we can 
observe only a minor improvement in achieving 
the objective “may be tailored for any CI organiza- 
tion”. However, higher gains were obtained for the 
objectives “management indicators relate to opera- 
tional’, “easy to implement for CI managers”, and 
“concepts/methods: clear and unambiguous”. 

Using an EA model (introduced in the second 
DSRM iteration), we have thus obtained signifi- 
cant gains in the system dimensions: 


— Goal efficacy: management indicators relate to 
operational; 

— Environmental consistency: easy to implement 
for CI managers; 

— Structural clarity: the concepts and methods are 
clear and unambiguous. 


For all the remaining objectives, no achievement 
degradation was observed. 

These results are consistent with the hypoth- 
esis that EA models are useful to assist resilience 
assessment initiatives, since the only difference 
between the first and second DSRM iterations is, 
precisely, the use of the proposed EA model. 
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Table 2. Evaluation ratings for the two DSRM iterations. The column “Gain” shows the benefits of using the EA 
model: each “+” accounts for one rating improvement, e.g. from PA to LA, or from LA to FA. A change from PA to 


FA is thus accounted for using two “+” signs. 


Objectives for the RAF Ist DSRM 2nd DSRM Gain 
Disaster risk management aspects FA, FA, FA, FA FA, FA, FA, FA = 
Risk is associated to effect on objectives FA, FA, FA, FA FA, FA, FA, FA = 
Indicators are relevant for risk management FA, FA, FA, FA FA, FA, FA, FA = 
Management indicators relate to operational FA, LA, LA, PA FA, FA, FA, FA +++ 
May be tailored for any ci organization FA, LA, LA, LA FA, LA, LA, FA 4 
Not overly prescriptive FA, FA, FA, FA FA, FA, FA, FA = 
Cross-sector generality FA, LA, LA, FA FA, LA, LA, FA = 
Cross-border generality FA, LA, LA, FA FA, LA, LA, FA = 
Useful for ci organizations FA, FA, FA, FA FA, FA, FA, FA = 
Easy to implement for ci managers LA, LA, PA, PA FA, FA, LA, LA t 
Simple to communicate and understand FA, FA, FA, FA FA, FA, FA, FA = 
Concepts/methods: clear and unambiguous LA, LA, LA, LA FA, LA, FA, FA 


Interestingly, the efficacy objective “manage- 
ment indicators relate to operational” received max- 
imum ratings in the second DSRM iteration (i.e. 
with EA model), a significant upgrade from the 
first DSRM ratings (i.e. without EA model). These 
results reflect the expressiveness benefits of the EA 
model, that provided an integrated representation 
including all levels the assessment rationale (from 
high-level organizational drivers and needs, down 
to low-level assessment evidence) in a graph-like 
conceptual structure. 

Note that these EA representations may be 
stored in an integrated EA repository, which 
means that they can be reused in several assess- 
ment initiatives, as well as integrated in the larger 
EA landscape of the CI organization. 

However, during the group sessions, the evalu- 
ators commented that additional ontological arti- 
facts — such more thorough formal definitions for 
entities and relationships, as well as ontological 
mappings — are needed to further clarify the frame- 
work’s semantics, as well as to ease real-world 
implementation initiatives in critical infrastructure 
operators. 


7 CONCLUSION 


Assessing the resilience of Critical Infrastructure 
(CI) is a complex conceptual and implementation 
challenge. Modeling artifacts such as languages, 
methods, and tools, are instrumental to address 
such complexity. Furthermore, EA models, meth- 
ods, and tools allow for managing and visualizing 
integrated model repositories, thus providing a 
powerful complement to representations based on 
spreadsheet-like artifacts, informal diagrams, and 
natural-language descriptions. 


It is important to note that this work does not 
prove that the new RAF is an improved version of 
the reference RAF from Cadete et al (2017). Also, 
no claims are made in relation to the relative ben- 
efits of the proposed ArchiMate artifacts used in 
the demonstration, vis-a-vis other EA modeling 
languages. Future work may address optimization 
of the RAF and related EA models, to assist actual 
assessment initiatives. 

However, the evaluation results are consistent 
with the hypothesis that EA models are useful to 
assist resilience assessment initiatives. Such results 
are also consistent with the informal feedback elic- 
ited during the group sessions. 

Regarding limitations, the evaluators com- 
mented that additional ontological artifacts are 
needed, namely for achieving higher ratings for 
the generality goals (such as achieving cross-sector, 
cross-border, and tailoring for any CI organiza- 
tion), as well as to improve the framework’s con- 
ceptual clarity and ease of implementation. 

Also, EA models may not be as useful for frame- 
works that are based on simple checklists or matri- 
ces of indicators. For these cases, spreadsheet-like 
artifacts and natural-language descriptions may 
be sufficient to assist the assessment initiatives. 
Note, however, that such simple artifacts may not 
provide the optimal solution for assisting holistic 
frameworks that comprise several points of view 
(i.e. many related concerns, many sectors, many 
countries, and many areas and levels of expertise) 
and account for complex networks of dependen- 
cies and interdependencies. In these cases, a com- 
plex graph-like structure emerges and may be 
successfully be addressed with adequate EA mod- 
els, methods, and tools. 

A secondary contribution of this paper is the 
new version of the resilience assessment frame- 
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work, as well as its set of evaluation ratings. This 
new version and evaluation ratings may be used to 
inform design, development, and testing for future 
DSRM iterations. 

The main contributions of this paper are the 
validation of the EA model’s usefulness for assist- 
ing resilience assessment initiatives, as well as a set 
of EA viewpoints that may be reused, improved, or 
adapted for actual resilience assessment initiatives. 
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ABSTRACT: The life and recovery factor of already existing subsea gas fields and infrastructure may 
be increased by installing boosting facilities to compensate for declining well pressures. The installation 
of such boosting facilities subsea has often been identified as more cost-efficient than installation topside. 
A recent example is the Asgard Subsea Gas Compressor installed and started up in 2016 on the Norwegian 
Continental Shelf. The compressor system is highly complex, involving, beyond the compressor itself, 
numerous pipes, valves, sensors, a liquid removal facility and a liquid pump. The design of control and 
safety systems is based on requirements in regulations and key standards, many of which build on topside 
philosophies for process safety and protection. An ongoing research in the Centre on Subsea Production 
and Processing (SUBPRO) is to investigate if new requirement formulation methods can verify if the 
current subsea safety and control philosophy is adequate. A motivation is to investigate areas of improve- 
ment for future subsea installations or similar systems. One such method is the Systems-Theoretic Process 
Analysis (STPA), a method that has been developed specifically for hazard identification in system control 
architectures. The main advantage of STPA over other hazards identification techniques is its ability to 
capture system failures that may arise from the communication between equipment in the control archi- 
tecture, and this insight can be used to build more robust and reliable systems. STPA has already been 
adopted in many different sectors and domains, but has not yet been tested for subsea processing systems. 
The main objectives of this paper are: (1) to apply STPA to a subsea processing system; a subsea compres- 
sion system; (2) to discuss opportunities and challenges of applying STPA to subsea compression systems, 
and; (3) to extend the discussion to the general use of STPA and necessity to improve the method. 


1 INTRODUCTION sure loss, and consequently, the gas production can 
be sustained at lower pressure. 


1.1 Background One traditional and proven solution in these 


The life and recovery factor of a subsea gas res- 
ervoir depends on the reservoir pressure and pres- 
sure loss in the production system. The reservoir 
pressure is typically higher than the pressure loss 
in the first period of gas production, so that the 
production rate can be maintained (Monsen et al., 
2012). However, at some point during the field life, 
installation of a boosting facility may be needed to 
compensate for declining well pressure and extend 
the plateau production (Baggerud et al., 2007). In 
other cases, long distance transport increases pres- 


cases is topside gas compression, but subsea gas 
compression has often been identified as more 
cost-efficient than topside gas compression. Sub- 
sea gas compression can sustain higher production 
rates with lower power consumption, because the 
compressor is closer to the well (Lima et al., 2011). 
In addition, unmanned operation of subsea gas 
compression reduces operation costs (Lima et al., 
2011). On the other hand, the application of sub- 
sea gas compression has been technically challeng- 
ing due to large electrical power consumption, the 
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need for fast acting control and use of complex 
equipment (Baggerud et al., 2007). 

To prevent hazardous events of subsea gas 
compression systems, the control and safety sys- 
tems of subsea gas compression are designed in 
accordance with regulations and key standards. 
However, many of the regulations and key stand- 
ards of subsea gas compression systems build on 
topside philosophies for process safety and pro- 
tection (Kim et al., 2016). Research is currently 
underway in the Centre on Subsea Production 
and Processing (SUBPRO) to investigate whether 
new requirement formulation methods can help 
verify adequateness of subsea safety and control 
philosophies (SUBPRO, 2017). One such method 
is the Systems-Theoretic Process Analysis (STPA), 
a hazard identification method that was recently 
developed based on the Systems-Theoretic Acci- 
dent Model and Processes (STAMP). 

STPA has widely been adopted and used in cyber 
security (Young and Leveson, 2013, Salim, 2014, 
Young, 2014, Schmittner et al., 2016), aerospace 
(Ishimatsu et al., 2010, Nakao et al., 2011, Lev- 
eson, 2014), aviation (Leveson et al., 2014, Chen 
et al., 2015, Allison et al., 2017), medical device 
(Antoine, 2013, Samost, 2014, Proctor et al., 2015, 
Zhang et al., 2017) and so on. However, there are 
limited number of studies on STPA application in 
the oil and gas industry, and only one conference 
paper investigated the application of STPA to sub- 
sea systems (Rachman and Ratnayake, 2015). To 
the best of our knowledge, no study has conducted 
STPA on subsea processing systems and discussed 
opportunities and challenges of applying STPA to 
subsea processing systems. 


1.2 Objectives 


The main objective of this paper is to apply STPA 
to a subsea gas compression system and discuss 
associated opportunities and challenges. This 
main objective is further developed into three sub- 
objectives: 


— To conduct STPA analysis on a general subsea 
gas compression system and summarize the 
results 

— To discuss opportunities and challenges apply- 
ing STPA to subsea processing systems, based 
on the results of the analysis 

— To extend the discussion to the general use of 
STPA and necessity to improve the method 


1.3 Structure of the paper 


The remainder of this paper is organized as follows: 
subsea gas compression system is introduced in 
Section 2, and STPA is applied to a typical subsea 


dry gas compression system in Section 3. Sum- 
mary of the analysis results and discussions follow 
in Section 4. 


2 SUBSEA GAS COMPRESSION SYSTEM 


2.1 Subsea processing 


Any handling and treatment of the produced 
hydrocarbon fluids prior to reaching the platform 
or onshore can be defined as subsea processing, 
e.g. subsea boosting, subsea separation and subsea 
gas compression (Bai and Bai, 2012). Compared 
with topside processing, the advantages of subsea 
processing are (Bai and Bai, 2012): 


— Accelerated andlor increased production andlor 
recovery; 

— Enabling marginal field developments, especially 

fields at deepwater/ultra-deepwater depths and 

with long tie-backs; 

Extended production from existing fields; 

— Enabling tie-in of satellite developments into 
existing infrastructure by removing fluid; 

— Handling constraints; 

Improved flow management; 

Reduced impact on the environment. 


2.2 Subsea gas compression 


Since the late 1980s, several oil companies and 
research institutions tried to develop and com- 
mercialize subsea gas compression technology 
(Vintersto et al., 2016), because a well can produce 
at lower wellhead pressures with subsea gas com- 
pression, thereby accelerating gas production and/ 
or increasing recovery rate (Kuhnle et al., 2015). 

However, it was considered that subsea gas 
compression requires extensive further technol- 
ogy maturing until 2005, while the other subsea 
processing were classified as mature technology or 
high technical maturity level (Fantoft, 2005). 

On 16th September 2015, the world’s first 
commercial subsea gas compression station was 
started-up on the Asgard field, and it was followed 
by the Gullfaks (Vintersto et al., 2016, Wadel- 
Andersen and Moe, 2016). 

There are currently two different solutions for 
subsea gas compression: dry gas compression and 
wet gas compression (Tønnessen and Romanello, 
2017). The former was applied at the Asgard subsea 
gas compression station, while the latter is the con- 
cept for the Gullfaks subsea compression project. 
In dry gas compression, gas and liquid in the well 
stream are separated and boosted by a compressor 
and a pump respectively. In the wet gas compres- 
sion, on the other hand, the well stream is boosted 
directly by a multiphase wet gas compressor without 
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Figure 1. A typical subsea dry gas compression system 
(API RP 17V, 2015). 


separation (Tønnessen and Romanello, 2017). Dry 
subsea compression is considered the standard solu- 
tion, because it adopted some common principles 
from conventional topside compression (Dettwyler 
et al., 2016, Tonnessen and Romanello, 2017). Simi- 
larly, wet compression is called well-stream compres- 
sion (Dettwyler et al., 2016). 


2.3 General configuration of subsea dry gas 
compression system 


API RP 17V (2015) provides a diagram that 
includes subsea dry gas compressors with typical 
safety devices, and a liquid discharge valve and a 
flow transmitter were added for the analysis of this 
paper as shown in Figure 1. 

Abbreviations in Figure | are 


— M: Motor 

— FT: Flow Transmitter 

— TSL: Temperature Safety Low 

— TSH: Temperature Safety High 

— PSHL: Pressure Safety High and Low 


3 STPA FOR SUBSEA GAS COMPRESSION 


3.1 STAMP and STPA 


Leveson (2012) proposed a new accident causation 
theory, called Systems-Theoretic Accident Model 
and Processes (STAMP), whose main idea is that 
major accidents in today’s complex, software- 
intensive, and sociotechnical systems are mainly 
caused by control problems rather than reliability 
problems. The three main concepts of this theory 
are safety constraints, hierarchical control struc- 
tures, and process models. 

Based on the STAMP theory, Leveson (2012) 
developed a new approach to hazard analysis, 


called Systems-Theoretic Process Analysis 
(STPA). The main reasons for developing STPA 
were to include new causal factors of STAMP 
that are not identified by traditional hazard iden- 
tification techniques and to provide guidance 
to the users in getting good hazard identifica- 
tion results. STPA can identify more causal fac- 
tors and hazardous scenarios, which are related 
to software, system design, and human behavior, 
than the other methods (Leveson and Thomas, 
2013). 


3.2 STPA procedure 


The STPA procedure consists of one preparatory 
step (step 0) and two main steps (step | and 2) as 
described below (Leveson and Thomas, 2013): 


— Step 0: Establishing the system engineering 
foundation 

— Step 1: Identifying unsafe control actions (UCAs) 

— Step 2: Identifying the causes of the unsafe con- 
trol actions 


Sub-steps with associated outcomes of each 
step are summarized in Figure 2. This paper fol- 
lows this procedure to conduct STPA analysis for a 
subsea gas compression system. 


3.3. STPA analysis for subsea gas compression 
system 


3.3.1 STPA Step 0 

In this step, we first identified system-level acci- 
dents (SLA), system-level hazards (SLH), and sys- 
tem-level safety constraints (SLSC) to define the 
scope of the analysis. Any unsafe control actions 
not relevant with defined SLHs were excluded in 
the further analysis. 

The definition of a hazard in STPA 1s signifi- 
cantly different from traditional definition. The 
accidents and hazards in STPA are defined as 
below (Leveson and Thomas, 2013): 


— An accident is an undesired and unplanned event 
that results in a loss, including a loss of human life 


> 


| Subssteps of STPA Step 0 Outcomes of STPA Step 0 
are | a) Adenity system level accidents/bazards | * List of system bevel accident, | 
knpineering toundiguian | | 2! Wennty iiem safety constraints hazards, safety constraints | 

| 3) tentify functional control structure + Control loop diagram 

= & å 
‘Sub steps of STPA Step 1 (eee 7 N 
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i tying Casale n identify contrai actions aad coneiitrans ldo, ratia 
Dmu Acat (UDAMY 2) Examine each combination | 
| af denny and summarcse UCAS oOo — 
= Te = 
A) Identify scenarns end causal factors bor * List of scenarnas and causal | 
Aduritifyingg Cases of each UCA fef factors of each UCA 
pauc 2) Develop detailed requirement is avoid | | + List of safety requirements of | 
the hazards scenarios and causal factors | 


Figure 2. STPA procedure. 
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or human injury, property damage, environmental 
pollution, mission loss, financial loss, etc. 

— A hazard is a system state or set of conditions that 
together with a worst-case set of environmental 
conditions, will lead to an accident (loss). 


In subsea gas compression, large amounts of 
gas release can lead to loss of human lives (due to 
explosion on topside installations or loss of buoy- 
ancy of a vessel on the surface during a gas leak) or 
environmental pollution. In addition, significant 
economic loss may occur due to damage of costly 
subsea component or reduced gas production rate. 
The SLAs, SLHs, and SLSCs of subsea gas com- 
pression are summarized in Table 1. It is assumed 
that the compression system is designed inherently 
safe, so that the system can endure the pressure of 
the well stream and the compressor. 

The next step was to identify functional control 
structure of the system. The high-level functional 
control structure of subsea gas compression sys- 
tem can be illustrated as shown in Figure 3. The 
system consists of Human Operator, Control 
System, Subsea Gas Compression (SGC) Unit, 


Table 1. System-level accidents, hazards, and safety 
constraints of subsea gas compression. 


System-level 


System-level System-level safety 
accident hazard constraints 
SLA1: People die SLH1: SGC unit SLSC1: SGC 
or are injured continues to unit must stop 
due to large supply gas when compressing 


amount of gas gas leaks to the gas when gas 
release environment leaks to the 
SLA2: The sea is environment 
polluted due to 
large amount 
of gas release 
SLA3: Valuable SLH2: SLSC2: 
subsea Compressor Compressor 
components operates must be 
are damaged outside normal protected 
operation from extreme 
conditions operating 
conditions that 
can damage 
the compressor 
SLA4: SLH3: SGC SLSC3: SGC unit 
Production unit stops must never 
is reduced or compressing stop 
interrupted gas when not compressing 
unnecessarily necessary gas when not 
necessary 
SLH4: SLSC4: SGC 
Compressor must be 
operates operated 
outside optimal within optimal 
conditions conditions 


Human Operator 


|» Status of SGC wnat 
* Status of other subsea 
and topede systems 


cin Other 
+ Static of 
other subsea Sensors 
and topeide 


systems 
* Control compressor 
* Control valves | a 
* Status of SGC unit 


Control System 


J Subsea Gas Compressor Unit 


Figure 3. High-level functional control structure of 
subsea gas compression system. 


and Other Sensors. The Human Operator provides 
adjust setpoint of SGC unit and shutdown process 
commands to the Control System, and the Control 
System provides commands to the compressor and 
valves of Subsea Gas Compressor Unit. The Con- 
trol System receives feedbacks about status of SGC 
unit and status of other subsea and topside systems 
from Subsea Gas Compressor Unit and Other Sen- 
sors respectively, and these feedbacks are finally 
delivered to the Human Operator. 

Based on the high-level functional control struc- 
ture, a detailed model can be further developed as 
shown in Figure 4. The Control System consists 
of Variable Speed Drive (VSD), Process Control 
System (PCS), Process Shutdown (PSD) System, 
Subsea Control Unit (SCU), Subsea Control Mod- 
ule (SCM), and Subsea Electronic Module (SEM). 
VSD, PCS, PSD System, and SCU are located 
topside, while SCM and SEM are installed subsea. 
The Subsea Gas Compressor Unit is composed of 
Subsea Gas Compressor (SGC), Shutdown Valves 
(SDVs), Anti-Surge Valve (ASV), Liquid Dis- 
charge Valve (LDV), and Sensors. Responsibilities 
and process models of each controller are summa- 
rized in Table 2. 


3.3.2 STPA Step 1 

After establishing the functional control structure 
with responsibilities and process models, we could 
identify unsafe control actions (UCAs) by exam- 
ining combinations of control actions and associ- 
ated process models identified in Step 0. 

Table 3 shows an example of how to identify 
UCAs of the system. One of responsibilities of 
PCS is to automatically control LDV depending 
on the level inside the scrubber. If the level inside 
the scrubber is too high, then liquid may flow into 
gas compressor, resulting in severe damage of the 
compressor. PCS should therefore open or close 
LDV to control the level of liquid in the scrub- 
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Figure 4. Detailed functional control structure. 


Table 2. Responsibilities and process models of each controller. 


Controller Responsibilities Process models 
Human — Adjust setpoint to maximize the efficiency — Compressor inlet temperature/pressure/flow 
operator of SGC unit (low/normal/high) 
— Shutdown process when needed — Compressor outlet temperature/pressure 
(low/normal/high) 
— Status of other subsea and topside systems 
(normal/gas leak) 
PCS — Deliver shutdown process command from — Setpoints (optimal/not optimal) 
human operator to PSD system — Compressor inlet temperature/pressure/flow 
— Automatically adjust compressor speed (low/normal/high) 
— Automatically open/close LDV — Compressor outlet temperature/pressure 
— Automatically open/close ASV (low/normal/high) 
— Scrubber level (low/normal/high) 
VSD — Deliver speed up/down and trip command from — Control command from PCS (speed up/down) 
PCS and PSD to SGC — Control command from PSD (trip 
compressor) 
PSD — Trip compressor and close SDVs based on — Control command from PCS (shutdown) 
shutdown command from human operator — Status of other subsea and topside systems 
— Automatically shut down process when needed (normal/gas leak) 
SCU — Deliver control commands from PCS and PSD -— Control commands from PCS (open/close LDV, 
system to SCM/SEM open/close ASV) 
— Control commands from PSD (close SDVs) 
SCM/SEM - Distribute control commands to each — Control commands from PCS (open/close LDV, 


component 


open/close ASV, close SDVs) 
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Table 3. 


Identifying UCAs of open/close LDV command. 


Controller: PCS 


Control Scrubber Not Too Too Too 
No action level provided Provided early Too late short long 
1 Open LDV High Unsafe Safe Safe Unsafe Unsafe Safe 
[SLH2] [SLH2] [SLH2] 
2 Normal Safe Safe Safe Safe Safe Safe 
3 Low Safe Unsafe N/A N/A N/A N/A 
[SLH2] 
4 Close LDV High Safe Unsafe N/A N/A N/A N/A 
[SLH2] 
> Normal Safe Safe Safe Safe Safe Safe 
6 Low Unsafe Safe Safe Unsafe Unsafe Safe 
[SLH2] [SLH2] [SLH2] 


Table 4. Scenarios, causal factors, and safety constraints of a UCA. 


UCA.PCS001: Open LDV command is not provided when scrubber level is high 


Scenario Associated causal factors Safety constraints 
PCS receives wrong measurement Drift of scrubber LT SC.PCS001.01 
of scrubber level Scrubber LT must be calibrated periodically 
SC.PCS001.02 
Scrubber LT must have 2003 configuration 
PCS receives no measurement No power supply to SC.PCS001.03 
of scrubber level scrubber LT PCS must generate an alarm when no signal is 


Broken signal wires from 


received from scrubber LT 
SC.PCS001.04 

Scrubber LT must have UPS 
SC.PCS001.03 


scrubber LT to PCS PCS must generate an alarm when no 
signal is received from scrubber LT 
SC.PCS001.05 
Signal wires must be inspected periodically 
PCS receives correct measurement, Wrong logic inside PCS SC.PCS001.06 


but PCS does not provide open 
LDV command 


PCS logic to generate open LDV command 
must be fully demonstrated during 
commissioning period 


ber. In this case, control actions are open LDV and 
close LDV, and the process model is scrubber level 
(high/normalllow ). 

UCAs can be identified by examining combi- 
nations of control actions and associated process 
models. For instance, if the open LDV command 
is not provided when the scrubber level is high, then 
this is a UCA that can cause SLH 2 identified in 
Step 0. On the contrary, it is a safe control action 
if open LDV command is not provided when the 
scrubber level is low. UCAs identified in Table 3 are 


UCA.PCS001 Open LDV command is not pro- 
vided when scrubber level is high 

UCA.PSC002 Open LDV command is provided 
too late when scrubber level is high 


UCA.PSC003 Open LDV command is provided 
too short when scrubber level is high 

UCA.PSC004 Open LDV command is provided 
when scrubber is low 

UCA.PSC005 Close LDV command is provided 
when scrubber level is high 

UCA.PSC006 Close LDV command is not pro- 
vided when scrubber level is low 

UCA.PSC007 Close LDV command is provided 
too late when scrubber level is low 

UCA.PSC008 Close LDV command is provided 
too short when scrubber level is low. 


Similarly, other UCAs can be identified by com- 
bining control actions and process models of each 
controller. 
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3.3.3. STPA Step 2 
The last step of STPA was to identify scenarios, 
associated casual factors, and safety constraints of 
each UCA. STPA provides less help for this step, 
and therefore, the analysts must rely on brain- 
storming with their own background knowledge 
and prior experiences (Leveson and Thomas, 
2013). 

An example of identified scenarios, causal fac- 
tors, and safety constraints of an UCA are sum- 
marized in Table 4. 


4 RESULTS AND DISCUSSION 


4.1 Results 


In this study, a total of 129 high-level UCAs have 
been identified for a subsea gas compression sys- 
tem. SCU and SCM/SEM have the largest number 
of UCAs, while the Human Operator is associ- 
ated with the smallest number of UCAs. 66 out 
of 129 UCAs are related with SLH2, compressor 
operates outside normal operation conditions, while 
only nine UCAs can cause SLH3, SGC unit stops 
compressing gas when not needed. The number of 
UCAs of each controller and system-level hazard 
are summarized in Table 5. 

Human operator’s main responsibility is to 
adjust setpoints, so most of UCAs of human oper- 
ator are related with SLH4, compressor operates 
outside optimal conditions. On the other hand, the 
main function of PSD system is to shut down the 
process when there occurs a gas leak, so the greater 
part of UCAs of PSD system are connected to 
SLH1, SGC unit continues to supply gas when gas 
leaks to the environment. 

The main responsibility of PCS is to auto- 
matically control LDV and ASV to prevent the 
compressor from being damaged due to operat- 
ing outside normal operation conditions, and 
therefore, UCAs of PCS are mostly relevant with 
SLH2, compressor operates outside normal opera- 
tion conditions. 


Table 5. Results of the analysis. 
UCAs UCAs UCAs UCAs 
Total to to to to 
Controller UCAs SLH1 SLH2 SLH3 SLH4 
Human Op. 10 2 0 1 7 
PCS 30 2 10 1 17 
PSD 12 8 0 4 0 
VSD 15 2 0 1 12 
SCU 31 2 28 1 0 
SCM/SEM 31 2 28 1 0 
Sum 129 18 66 9 36 


SCU and SCM/SEM have the same number 
of UCAs, because the responsibility of SCU is to 
deliver control commands from PCS and PSD sys- 
tem to SCM/SEM. 

Most of the control actions in the subsea gas 
compression system are to prevent compressor 
damage and/or operate the compressor in opti- 
mal conditions, rather than to prevent or stop gas 
leaks, because these functions are already covered 
by emergency shutdown (ESD) system with shut- 
down valves. Therefore, the subsea gas compres- 
sion system has a large number of UCAs related 
to SLH2 and SLH4 compared to SLH1 and SLH3. 


4.2 Discussion and concluding remarks 


When applied to a subsea gas compression system, 
STPA provided a systematic and structured way 
of identifying UCAs in STPA Step 0 and 1. STPA 
provides, on the other hand, less help to identify 
scenarios and causes of UCAs, and relies on brain- 
storming in STPA Step 2. However, this is not only 
a problem of STPA. Other traditional hazard iden- 
tification methods also have the same limitation. 
For instance, hazard and operability (HAZOP) 
studies provides guide words and parameters to 
identify hazards, structured what-if (SWIFT) 
uses checklists and what-if questions to identify 
hazards, and failure modes, effects, and criticality 
analysis (FMECA) utilizes system breakdown and 
functional analyses to identify hazards. However, 
once hazards are identified, all these methods rely 
on brainstorming to identify causes of the haz- 
ards. As Leveson and Thomas (2013) mentioned, 
STPA can provide more help for STPA Step 2 in 
the future, because there are common flaws that 
lead to accidents. 

One of the distinct characteristics of subsea 
systems is the remoteness of control actions. The 
control commands for subsea systems are provided 
from topside installations or onshore that is some- 
times hundreds of kilometers away from the sub- 
sea facilities. Therefore, there are some equipment 
that collect control commands and distribute the 
commands to associated components, like SCU 
and SCM/SEM in Figure 4. Accordingly, appro- 
priate transmission and handling of control signals 
becomes important for the operation of the subsea 
systems, and STPA is well suited for addressing 
hazards in such distributed control structures. 

However, only looking at the system from the 
perspective of the control units and flow of con- 
trol commands and signals during operation can 
lead to omission of important hazards. To be fair, 
STAMP does consider a wider range of accident 
causes, including unhandled environmental distur- 
bances or conditions, unhandled or uncontrolled 
component failures, unsafe interactions among 
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components and inadequately coordinated con- 
trol actions by multiple controllers (Leveson and 
Thomas, 2013). Also, Leveson uses the term control 
in a broad sense, to include control by design (i.e. 
to prevent or protect against component failures or 
unsafe interactions), control processes (including 
developmental, manufacturing maintenance and 
operational processes) and social controls (i.e. legal 
requirements, cultural norms, or other interests 
that constrain behaviour) (Leveson and Thomas, 
2013). However, although STPA may in principle 
cover a wide range of hazards, it is not necessar- 
ily the best method or easiest method to apply for 
identifying or analyzing all types of hazards. Other 
methods, such as HAZOP and FMECA also have 
their advantages and may be more familiar to 
safety engineers and risk managers. For example, 
while STPA takes the working system as point of 
departure and then identify flaws that could cause 
hazards, FMECA starts from the failure modes of 
subsystems to identify system effects, and HAZOP 
focus on deviations from normal operations that 
could cause hazards. Often these failure modes and 
possible deviations are known and STPA is not 
needed to identify them. Rather, the question is 
how critical they are for the system operation and 
safety, and if there are system hazards not covered 
by these subsystem-oriented approaches. In the 
latter case STPA may add important additional 
insight into system hazards. 

The oil and gas industry has widely adopted 
risk-based safety philosophies, where safety is 
regarded as tolerable risk. In contrast, STAMP 
and STPA views safety as a control problem, 
where hazard can be eliminated by imposing con- 
trol measures and constraints, or by modifying 
the system. STPA does not provide a method for 
describing, ranking or comparing the risk associ- 
ated with identified unsafe control actions. In real- 
ity, imposing constraints and controls to eliminate 
unsafe control actions will have a cost and will not 
be perfectly reliable or remove hazards completely. 
While STPA is good for identifying inadequate 
control and suggesting additional controls, it does 
not provide stop criteria for when the control is 
sufficient. Eventually, the need for prioritization 
of resources will necessitate an evaluation of the 
associated risk to ensure that resources are spent 
optimally and that safety of the system represents 
a tolerable risk. 

In conclusion, if the goal is to identify as many 
hazards, unsafe control actions and dangerous sce- 
narios as possible, it is useful to view a system from 
several different perspectives, and not insist on 
using only one method. Hence, rather than replac- 
ing other methods, STPA should be used as a sup- 
plement, providing a control perspective on safety. 
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ABSTRACT: Tool wear in machining processes can have a detrimental impact upon the surface finish 
of a machined part, increase the energy consumption during manufacture and potentially, if the tool 
fails completely, damage incurred may require the part to be scrapped. Monitoring of the tools condition 
can therefore lead to preventative steps being taken to avoid excessively worn tools being used during 
machining, which could cause a part becoming damaged. Several studies have been devoted to condition 
monitoring of the machining process, including the evaluation of cutting tool condition. However, these 
methods are either impractical for a production environment due to lengthy monitoring time, or require 
knowledge of cutting parameters (e.g. spindle speed, feed rate, material, tool) which can be difficult to 
obtain. In this study, we aim to investigate if tool wear can be directly identified using features extracted 
from the electrical power signal of the entire Computer Numerical Control (CNC) machine (three phase 
voltage and current) captured at 50 KHz, for different cutting parameters. Wavelet packet transform is 
applied to extract the feature from the raw measurement under different conditions. By analyzing the 
energy and entropy of reconstructed signals at different frequency sub-bands, the tool wear level can be 
evaluated. Results demonstrate that with the selected features, the effects due to cutting parameter varia- 
tion and tool wear level change can be discriminated with good quality, which paves the way for using this 


technique to monitor the machining process in practical applications. 


1 INTRODUCTION 


Tool wear and subsequent failure of tools during 
the manufacturing process will have a significant 
impact on the economics of machining, and about 
25% of machine down time can be attributed to 
the direct results of tool wear failure (Altintas & 
Yellowley 1989). Moreover, the development of 
tool wear will give rise to inconsistencies in sur- 
face finishes and geometric tolerances, affecting 
the quality of manufactured products. Therefore, a 
series of studies have been devoted to monitoring 
systems detecting underperforming tooling and 
improving machining efficiency and productivity. 
The monitoring techniques for tool wear can 
be divided into two categories, direct and indi- 
rect methods (Bhattacharyya & Sengupta 2009, 
Teti et al. 2010). With direct methods, tool wear 
is evaluated by analyzing the cutter itself, such as 
measuring the surface roughness and flank wear, 
etc. On the other hand, indirect methods apply 
either model-based or data-driven techniques to 
the measurements like cutting force, tool vibration 
and output power for evaluating tool wear condi- 
tions. It should be noted that due to the restrictions 


of direct methods such as stopping requirements 
during production, indirect methods are more suit- 
able for industrial applications (Zhu et al. 2009). 
With indirect methods, different measurements 
can be collected and analyzed to evaluate the tool 
condition, including acoustic emission (Prickett & 
Johns 1999, Karimi et al. 2013, Hass et al 2013), 
cutting force (Dimla & Lister 2000, Li et al. 2006, 
Deng et al. 2013, Lee et al. 2006), vibration (Yesily- 
urt & Ozturk 2007, Zhang & Chen 2007, Lamraoui 
et al. 2014), temperature (Byrne 1987, Davoodi & 
Hosseinzadeh 2012), spindle power/current (He 
et al. 2017, Li et al. 2000, Simoneau & Meehan 
2013), etc. However, several of these methods 
often require expensive sensing equipment (Nouri 
et al. 2015) and can be difficult to install due to 
the need for close proximity to the cutting tool and 
workpiece, meaning they can be impractical for 
large production environments. Additionally, the 
classification of tool wear from the collected data 
is challenging due to the high sensitivity of data 
to the cutting parameters (i.e. spindle speed, feed 
rate, depth and width of cut, material, tool type). 
Thresholding of time domain data has been used 
as a method of classifying tool wear (Shao et al. 
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2004), however, this requires large amounts of cali- 
bration and training data which is time consuming 
to collect and reduces the robustness of the system 
to a limited set of cutting conditions. Several stud- 
ies of investigated frequency and time-frequency 
domain analysis to reduce the sensitivity to classi- 
fication to cutting parameters (Kuljanic et al. 2009, 
Liao et al. 2007, Huang et al. 2010, Lauro et al. 
2004], however, most of these methods require 
specialist monitoring equipment which pose the 
challenges described above. 

Within this research we investigate the poten- 
tial of a low cost, non-invasive sensing approach 
which is also cutting parameter agnostic to the 
problem of tool condition monitoring, which has 
so far not been identified within existing literature. 
The investigated solution uses current and volt- 
age sensors across electrical three phase input to 
the machine to monitor the overall machine power 
consumption, whilst classification of the signal is 
conducted though time-frequency analysis using 
wavelet packet transform. 

In Section 2 the diagnostic approach using the 
wavelet analysis is described. Section 3 details the 
experimental methodology and results, and Sec- 
tion 4 concludes the findings and highlights limita- 
tions and future work. 


2 DIAGNOSTIC APPROACH 


Although several studies have been performed 
for condition monitoring of the milling process 
using wavelet transform [Choi et al. 2004, Li et al. 
2008, Zhong et al. 2010], these have mainly used 
vibration or cutting force measurements in the 
analysis instead of electrical power consumed 
by the machine. Moreover, the effectiveness of 
wavelet transform in discriminating tool wear 
level operated at varying cutting parameters still 
requires further investigation. 

In this study, wavelet packet transform (WPT) 
is selected to evaluate the tool wear level. The rea- 
son of using WPT is that compared to wavelet 
transform, which only filters the signal to get the 
low-pass results (approximation), WPT can filter 
the signal to obtain both low-pass and high-pass 
(detailed) results (depicted in Figure 1). There- 
fore, more information can be extracted from the 
original signals using WPT [Torrence & Compo, 
1995]. The extracted wavelet coefficients C,, can be 
expressed using Eq. (1). 


x= f fw nid D 
where f(z) is the original signal, y,, is the wavelet 


function, jand k are the scale and shift parameters, 
this can be expressed in Eq. (2). 


A(t) 


Figure 1. Two-level wavelet packet transform, where A 
and D are the approximation and detail by filtering the 
signal at the previous level. 


1 t-k 
(t)=—=w | — 2 
Wi) ev ( F (2) 


It can be seen from Figure 1 that the applica- 
tion of WPT provides a sub-band filtering of the 
original signal into progressively finer equal-width 
intervals with the extracted packets of wavelet 
coefficients, i.e. the i packet of wavelet coef- 
ficients at j level represent the information of 
original signal within the frequency sub-band of 
[iF /2/", (i+ )F,/2/"'], where F, is the sampling 
frequency. 

With wavelet coefficients, the time-history of 
signals at different frequency sub-bands can be 
reconstructed using Eq. (3). 


f= CX, CV (0 © 
k 


where C is a constant independent of signals. 

With these constructed signals, energy and 
entropy are calculated from each signal using Eqs. 
(4) and (5). 


Es =||fOfOlat (4) 


H, =-¥ p(x;)log p(x;) (5) 


i=l 


where E, and H, are the energy and entropy of the 
signal, p(x) in Eq. (5) is a probability of the signal 
with value of x, 

The energy of reconstructed signals represents 
the amount of information within different fre- 
quency sub-bands, while entropy of reconstructed 
signals can indicate the signal disorganization at 
the frequency sub-bands. It is expected that these 
two features would be sensitive to the change of 
cutting parameters and cutting tool wear level, 
thus can be used for the discrimination of cutting 
tool levels. This will be further investigated in the 
following section. 

It can be seen that with the use of WPT, the orig- 
inal signals can be decomposed and reconstructed 
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at different frequency ranges, from which the fre- 
quency information can be related to the signals 
in the time domain, and better used for the feature 
extraction and fault diagnosis. 


3 PERFORMANCE OF DIAGNOSTIC 
APPROACH 


3.1 Experiments 


In the study, HSS-Co8 is selected as the end milling 
tool due to its ease of wear measurement, which 
is a high speed steel containing 8% cobalt with 4 
flutes. Two end milling tools with different diame- 
ters are selected herein for the analysis. Table 1 lists 
the characteristics of these end milling tools, where 
LOC refers to the tool’s length of cut. 

In the experiments, each end mill 
was assigned a work piece of dimension 
150 mm x 120 mm x 30 mm, and the plate mate- 
rial was selected as commercial aluminum grade 
6082 T651, which is a common alloy used in 
manufacturing. 

Cutting parameters used in the tests were 
selected according to the manufacture’s recom- 
mendation, which are listed in Table 2. 

For the duration of each cutting session the 
energy monitoring device was connected to the 
system, which collected the current and voltage 
measurements at a sampling frequency of 50 kHz. 
During each session, the tools were used to per- 


Table 1. Characteristics of end milling tools. 
Overall No. 
Mill Dia. Shank Dia. LOC length teeth/ 
(mm) (mm) (mm) (mm) flutes 
8.0 10 19 69 4 
10.0 10 22 72 4 


Table 2. Cutting parameters and corresponding wear 
measurements. 


Spindle Feed Localized 
Dia. Cut Time speed rate tool wear 
(mm) no. (min) (RPM) (m/min) (mm) 
8 1 0 4000 580 0 
8 2 40 4000 580 0.182 
8 3 60 4000 580 0.279 
8 4 80 4000 580 0.327 
8 5 100 4000 580 0.493 
10 1 0 3100 600 0 
10 2 40 3100 600 0.185 
10 3 80 3100 600 0.434 
10 4 100 3100 600 0.582 


form climb milling on the work pieces. The number 
of passes, cut depth and cutting radius are selected 
as 10, half of the cutter diameter, respectively. 

It should be mentioned that after each cutting 
session, the tools were used to machine the car- 
bon steel to induce wear (40 min initially, subse- 
quently 20 min), and this process was repeated 
until 100 min, where full tool wear was observed. 
Table 2 lists the cutting parameters used for dif- 
ferent cuts and corresponding wear measurements. 


3.2 Discrimination of different cutting tools 
with different wear level 


In this section, the current and voltage from two 
end milling tools with 8 mm and 10 mm cutting 
diameters are collected at 0 min and 100 min, 
which represents the intact and fully worn tools. 
The reason of selecting these measurements is 
that the collection process will not interrupt the 
machining process, and the installation of sensors 
will not add complexity of monitoring systems, 
thus the results can be better applied in the prac- 
tical machining process.With current and voltage 
measurements, the instantaneous power can be 
calculated using the following equation. 


P, (0) =v()i(0) © 


where v(t) and i(f) are the collected voltage and 
current measurements at time t. 

Figure 2 depicts the instantaneous powers for 
8 mm and 10 mm tools at intact and fully worn 
conditions. It should be mentioned that only 
power from single pass cutting is illustrated herein, 
as the powers of 10 passes have a similar trend. In 
the current study, only the power from a single pass 
is analyzed. Table 3 lists the average and maximum 
instantaneous power at each condition. It can be 
seen from the table that the instantaneous power 
will be increased with cutting tool wear. 

From the instantaneous powers shown in 
Figure 2, the cutting tools with different diameters 
and cutting parameters, and the same tool with dif- 
ferent wear levels cannot be discriminated easily in 
the time-domain, as the signals from different con- 
ditions have a similar shape, thus the four different 
conditions cannot be discriminated using only the 
power amplitude variation. 

As described in section 2, WPT is applied to the 
instantaneous power to extract wavelet coefficients 
and reconstruct signals at different frequency sub- 
bands. In the current study the WPT is used to 
decompose the original signal over 8 levels. This 
decomposition level is selected by considering both 
therange of frequency sub-band and computational 
time. In the current study the Shannon wavelet 
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Figure 2. Instantaneous power of two cutters at intact 
and fully worn conditions. 


Table 3. Average and max instantaneous power. 
Average Max 
power power 

Condition (W) (W) 

8 mm tool at 0 min 1593.7 3277.4 

8 mm tool at 100 min 1989.6 4890.1 

10 mm tool at 0 min 1712.3 3827.7 

10 mm tool at 100 min 2537.5 5971.5 


function is used in the WPT analysis, which can be 
written as follows: 


Y(t) = Jjsinc( jhe N 
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Figure 3. Energy and Entropy distributions of 8 mm 
cutter at intact and worn conditions. 


Energy and entropy are then calculated from 
each reconstructed signal. Figure 3 depicts the 
distribution of energy and entropy over the whole 
frequency range. It should be noted that as the 
distributions are similar for the two end milling 
tools, only energy and entropy distribution from 
the 8 mm tool are illustrated herein. 

It can be seen from Figure 3 that the energy dis- 
tribution shows similar trends for both intact and 
worn conditions, and the maximum energy is con- 
centrated at around 700 Hz. However, the entropy 
distributes well along the whole frequency range, 
and the entropy distribution at intact and worn 
conditions shows clear variation. This indicates 
that the energy features can provide more consist- 
ent results, while entropy features are more sensi- 
tive to the change in the cutting parameters. 

In this study, the two highest energies and 
entropies at the intact condition are selected for 
the discrimination, as they represent the most 
information and disorganization in the original 
signal. Figure 4 depicts the discrimination results. 
It should be noted that each point in Figure 4 rep- 
resent the feature calculated with a two-second 
length instantaneous power signal. 

From Figure 4, it can be seen that with selected 
energy and entropy features, all four different 
states, i.e. two end milling tools with two wear lev- 
els, can be discriminated with good quality, indi- 
cating that not only the worn condition can be 
identified clearly for the same cutting tool, but the 
different cutting tools with similar worn levels can 
also be separated accurately. 
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(b) Discrimination results using two highest entropies 


Figure 4. Discrimination results using two highest ener- 
gies and entropies. 


When this approach is used in practical appli- 
cations, the state of the end milling tool can be 
determined with the minimum Euclidean dis- 
tance between features (two highest energies 
or entropies) of instantaneous power from the 
unknown state and the features shown in Fig- 
ure 4. It should be mentioned that as the analy- 
sis is computational efficient (only taking about 
20 seconds to gain the results), this approach can 
be used in the practical application for on-line 
monitoring purposes. 


3.3 Discrimination of different cutting tools 
with different wear level and similar 
instantaneous power 


The performance of WPT in discriminating cut- 
ting tool conditions is further investigated using 
the data from the end milling tools at different cut- 
ting parameters but having similar instantaneous 
power, which makes it extremely difficult for dis- 
crimination using time-domain techniques. 

In this study, two sets of data are used for the 
analysis, including end milling tools with diam- 
eters of 6 mm and 10 mm at different wear levels 
and cutting parameters. Table 4 lists the cutting 
parameters of these two end mill tools and the cor- 
responding wear measurements. 

Figure 5 depicts the instantaneous powers 
from these two end milling tools. Similar instan- 
taneous power can be observed due to the com- 
bination of different wear levels and cutting 
parameters. The average instantaneous powers 
from these conditions are listed in Table 5. It can 
be seen that these two conditions will provide 
similar average instantaneous power, while clear 


Table 4. Cutting parameters and corresponding wear 
measurements. 


Spindle Feed Wear 
Dia. speed rate Time measurement 
(mm) (RPM) (m/min) (min) (mm) 
8 4000 580 60 0.279 
10 3100 600 0 0 
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(a)Instantaneous power from 8mm cutter 
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Figure 5. Instantaneous powers from 6 mm and 10 mm 
diameter end mill tools. 


Table 5. Average and max instantaneous power. 


Average Max 
power power 
Condition (W) (W) 
8 mm tool at 60 min 1810.1 4461.2 
10 mm tool at 0 min 1712.3 3827.7 


variation is observed in the wear level, which is 
listed in Table 4. 

WPT described in section 2 is applied to extract 
the wavelet coefficients over 8 levels, and signals at 
different frequency sub-bands are reconstructed. 
The two highest energies and entropies are then 
selected for the discrimination. Results are depicted 
in Figure 6. 

It can be seen from Figure 6 that with the 
selected energy and entropy, the two end milling 
tools can be discriminated with good quality, indi- 
cating the effectiveness of the proposed approach 
in identifying the states of different end mill tools 
at varying cutting parameters. 
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Figure 6. Discrimination results of 6 mm and 10 mm 
tools at different conditions. 


4 CONCLUSIONS 


In this paper, the discrimination of end mill tools 
with different diameters, wear levels, and cutting 
parameters is investigated. Wavelet packet trans- 
form is applied to extract wavelet coefficients from 
the original signal. Signals at different frequency 
sub-bands are then reconstructed using wavelet 
coefficients from which energy and entropy are 
calculated. The two highest energies and entropies 
are selected to discriminate different cutting tool 
states. 

Two cases are used in this study to investigate 
the performance of the proposed method; cutting 
tools with different diameters and wear levels, and 
cutting tools with different diameters and wear 
level but similar instantaneous powers. Results 
demonstrate that with the proposed approach, the 
state of the cutting tool can be discriminated with 
good quality, both the tool wear level and cutting 
parameters can be discriminated. 

Whilst these initial results are promising fur- 
ther work is required to expand the analysis over 
a wider range of cutting parameters to establish if 
the methodology holds. Additionally, refinement 
of the sensor measurement and tool monitoring 
service is required. At present signal analysis is per- 
formed off-line, whilst data is captured at higher 
frequency (50khz) increasing the cost of equip- 
ment and time of analysis. Optimization of this 


methodology is required in order to enable on-line 
monitoring. 
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ABSTRACT: The vulnerability of technological and administrative systems to cyberattacks has been 
shown to be high in several cases, which has led to different unwanted consequences. Autonomous ships 
will also be exposed to the threat of cyberattacks, due to their need for connecting to operational, man- 
agement and administrative systems onshore. The most critical hazards are possibly not associated with 
consequences for the ship itself or its cargo, but the threat to infrastructure along the coast and offshore 
if a ship under alien command is used as a “battering ram” to cause major structural damage. Even rela- 
tively small autonomous ships may pose a real threat, and ships sailing in international waters may come 
from distant locations. This implies that all autonomous ships may be considered as possible threats. This 
paper outlines the risk for some infrastructure systems. Even though the probability may be low, such 
events cannot be ruled out in the future, and the design of autonomous ships must involve a series of risk 


reducing actions and designs. 


1 INTRODUCTION 


Maritime security has come on the agenda the 
past decade. In 2004, the U.S. presented a national 
maritime security policy. The Sept. 11th attacks 
also put maritime terrorism on the agenda. The 
increase in piracy attacks in 2008 and 2011 out- 
side the coast of Somalia contributed to even more 
attention to maritime security globally. In 2011, 
maritime security became one of the objectives in 
The North Atlantic Treaty Organization’s (NATO) 
Alliance Maritime Strategy. The UK, EU and the 
African Union proposed maritime security strate- 
gies in 2014 (Bueger, 2015). The Maritime Safety 
Committee (MSC) in the International Maritime 
Organization has recently published guidelines on 
maritime cyber risk management (IMO, 2017a). 

There is an increased focus on developing 
autonomous ships. A motivation is reduced build- 
ing and operational costs, because the ships can be 
redesigned. Research projects, such as the Mari- 
time Unmanned Navigation Through Intelligence 
in Networks (MUNIN) (Rodseth & Tjora, 2014) 
and Advanced Autonomous Waterborne Appli- 
cations (AAWA, 2016) focus on the development 
of technological specifications and designs for 
autonomous ships. Industry projects aim at real- 
izing the first autonomous ships in the next 1-3 
years, e.g., Yara Birkeland (Kongsberg Maritime, 
2017). 

Autonomous ships will be exposed to the threat 
of cyberattacks, due to their need to connect to 
operational, management and administrative 
systems onshore. The most critical hazards are 


possibly not associated with consequences for the 
ship itself or its cargo, but the threat to infrastruc- 
ture along the coast and offshore if a ship under 
alien command is used as a ‘battering ram’ to cause 
major structural damage. Even relatively small 
autonomous ships represent a high kinetic energy 
when travelling at full speed and may thus pose 
a real threat to infrastructure systems. Ships sail- 
ing in international waters may come from distant 
locations. This implies that all autonomous ships 
may be considered as possible threats. It will not 
be sufficient to ensure that the high-quality clas- 
sification societies have stringent requirements; all 
classification societies or IMO need to focus on 
such threats. 

We may think that the probability of cyber- 
attacks may be low, but such events cannot be 
ruled out in the future. We therefore believe that 
it is important, before autonomous ships are built 
and commissioned, that the marine and maritime 
industry at large, consider this threat and takes 
necessary actions to implement sufficient risk con- 
trol actions. 

A cyber-attack may have some parallels with 
the terrorist attack on USS Cole, the United States 
Navy guided-missile destroyer, on 12th October 
2000, while it was being refueled in Yemen’s Aden 
harbor (US Navy, 2001). 17 sailors were killed and 
39 injured, due to the attack from a small fiberglass 
boat carrying explosives and two suicide bombers. 
The boat approached the port side of the destroyer 
in bright daylight, and exploded, creating a 12 by 
18 m gash in the ship’s port side from what was 
estimated to 180-320 kg of explosives. 
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The objective of the paper is to discuss the impli- 
cations of the vulnerability of autonomous ships 
to cyber-attacks, the threats that a ship under alien 
control may represent for infrastructure systems, 
and how such risk should be mitigated in general. 
There are also other activities and sectors in the 
society where cyber-attacks may be a potential 
threat. One incident known from the petroleum 
industry is described in Section 2.1. Some incidents 
in the energy sector are briefly mentioned in Sec- 
tion 2.3. Autonomous cars are another such sector, 
see further descriptions in Section 2.2. Experiences 
from other sectors can be used as a basis for assess- 
ing risk and developing relevant risk mitigation 
measures for autonomous ships. 

Traditional risks to ships, which also apply to 
autonomous ships, such as collision, grounding, 
foundering, etc. are outside the scope of the paper, 
and are therefore not discussed. These risks are still 
important, and are subject to attention by several 
researchers. The risks to infrastructure systems are 
special in the sense that catastrophic consequences 
may cascade outside the industry itself. 

The paper considers unmanned autono- 
mous ships primarily, but differences between 
unmanned and manned autonomous ships are also 
considered. 


2 REVIEW OF CYBER THREATS 
IN COMPARABLE SYSTEMS 


2.1 Petroleum industry 


It is not easy to collect experience data about cyber- 
attacks. Statoil corporate management was invited 
to give a university lecture about cyber threats 
to their systems and operations in October 2016 
(Statoil, 2016). The incidents presented during this 
lecture are presented in Section 2.3 below. No inci- 
dents were mentioned in the lecture from Statoil’s 
own operations. Three weeks later it was revealed 
through media that there had been a serious unin- 
tended incident at Statoil’s Mongstad refinery in 
May 2014, as described in the following. Through 
the subsequent handling of this incident, it became 
clear that Statoil has had many more incidents of 
probably different severity. What was revealed by 
media a short while after the guest lecture puts the 
lack of openness in the university lecture in a spe- 
cial light. 

The most well-known incident in the petroleum 
industry is from the downstream part, where main- 
tenance on a server by an IT specialist in Hindus- 
tan Computers Ltd. (HCL) in India disrupted the 
loading of a gasoline tanker at the Statoil operated 
Mongstad refinery just outside Bergen in Norway 
on 21st May 2014. An input error by the operator 
gave him access to a server he should not be able 


to access. It should not be possible to stop the 
server in question remotely, but the HCL specialist 
inadvertently accessed the server through a ‘back 
door’, according to media. 

The operations of certain IT systems, including 
the Mongstad refinery, was outsourced by Sta- 
toil to HCL in India in 2012, after a risk assess- 
ment. The incident referred to here did not affect 
safety directly, but could potentially have affected 
safety functions and barriers, according to the 
audit report by Petroleum Safety Authority (PSA, 
2017). 

The NRK broadcasting company in Norway 
found 29 incidents where information and commu- 
nication technology (ICT) employees from India 
had accessed servers they should not have access 
to in Statoil. Anonymous sources in Statoil have 
commented that the problem was more extensive 
than what the journalists found.! 

The PSA audit was initiated after the inci- 
dent was known in the public domain, almost 2.5 
years after it occurred. The audit considered the 
handling of incidents associated with ICT and 
information security by Statoil in general. PSA 
considered several ICT related incidents, as well 
as Statoil’s technical requirements to information 
security for industrial automation and control sys- 
tems. The wording of the PSA audit report is such 
that it indicates that other incidents have occurred 
that are unavailable in the public domain. 

Statoil was criticized by PSA for failing to notify 
the authorities about the incident at the time it 
occurred, which according to Statoil’s own assess- 
ment could have had consequences, such as failure 
of safety functions or barriers, according to media 
reports (see footnote! above). 

Statoil informed in mid 2017 that they had can- 
celled all outsourcing contracts that affected safety 
critical systems. They had concluded that the out- 
sourcing of these systems represented too high risk 
for unwanted influence on the systems. 

From the media, it is known that Statoil was 
the target of a massive attack over three days in 
2013, where hackers tried to install dangerous code 
into Statoil computers’, apparently an unsuccess- 
ful attack. 


2.2 Autonomous vehicles 


Autonomous cars are expected to become an 
important part of the transportation system 
within the next decade. Self-driving vehicles will 


‘https://www.nrk.no/norge/x|/tastefeilen-som-stoppet- 
statoil-1.13174013. 
*http://www.newsinenglish.no/2014/08/28/statoil-held- 
off-hacker-attack/. 
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be shared by several users (Lyche, 2017). Autono- 
mous buses may be realized in the near future with 
operators in control centers remotely overseeing 
several buses. In specific circumstances, the opera- 
tors may take over control and remotely operate 
the buses if needed (Lyche, 2017). This means that 
the autonomous buses will operate in different 
autonomy levels, with shared control. 

A major challenge is the increasing intercon- 
nection that may expose safety-critical systems 
to security threats. Cars are no longer physically 
isolated machines controlled mechanically and 
locally (Macher et al, 2017). They have become 
computers with various electronic control units 
(ECU) and hackers may take control over brakes, 
engine, the steering wheel, radio, and lights. 
Recently, it was discovered that one million cars 
could be hacked simultaneously (Kibar, 2017; Slo- 
vik, 2017). 

Acar’s vulnerability to hacking depends on what 
kind of remotely connection the car has, the con- 
figuration of the car’s internal computer network, 
and how external digital commands may affect 
physical components (Kibar, 2017). Press (2017) 
discusses how cars can become weapons of mass 
destruction on the road. It will not be sufficient to 
install firewalls or intrusion detection systems. The 
UK Government states that Wi-Fi connected cars 
along with autonomous cars are getting increas- 
ingly vulnerable to hacking and data theft. They 
recently published key principles of vehicle cyber 
security for connected and automated vehicles to 
support the industry (GOV.UK, 2017). These prin- 
ciples are (quote): 


1. Organizational security is owned, governed and 
promoted at board level. 

2. Security risks are assessed and managed appro- 
priately and proportionately, including those 
specific to the supply chain. 

3. Organizations need product aftercare and inci- 
dent response to ensure systems are secure over 
their lifetime. 

4. All organizations, including sub-contractors, 
suppliers and potential 3rd parties, work 
together to enhance the security of the system. 

5. Systems are designed using a defence-in-depth 
approach. 

6. The security of all software is managed through- 
out its lifetime. 

7. The storage and transmission of data is secure 
and can be controlled. 

8. The system is designed to be resilient to attacks 
and respond appropriately when its defences or 
sensors fail. 


The connectivity means that the vehicle is inte- 
grated in a global ad-hoc network system where 
external information are important for decision 


making. Security has become an important aspect 
to include in systems safety engineering. The 
development of these novel transportation systems 
means that systematic approaches taking both 
safety and security aspects into consideration are 
needed (Macher et al, 2017). 

Standards relevant for the automotive domain 
are increasing their focus on security. IEC 
61508:2010 mentions that security threats may 
be identified during hazard analysis. Neverthe- 
less, the security threat analysis is not specified or 
detailed. The SAE J3061:2016 is a guideline for 
cybersecurity engineering. Among other things, 
it focuses on defining a process for implementing 
cybersecurity in the design, considering a vehicle’s 
lifecycle and providing basic guiding principles on 
cybersecurity. 


2.3 Energy sector 


Some other cyber-attacks on the energy sector that 
are known in the public domain are the following 
(Statoil, 2016): 


e Attacks on Technical network 
o Stuxnet: Iran’s uranium enrichment facility 
2010 
o German Steel Mill 2014 
o Ukrainian power network 2015 
o German nuclear plant 2016 
e Attacks on Office network 
o Shamoon incident: Saudi Aramco office net- 
work 2012 
o Energetic Bear: Energy industry in the US 
and Europe 2012 > 
o Cleaver (recon): Energy infrastructure several 
countries around the globe 2012 > 


The Gundremmingen nuclear power plant in 
Germany, located about 120 km northwest of 
Munich, is run by the German utility company 
RWE. It was found to be infected with computer 
viruses, but they appeared not to have posed a 
threat to the facility's operations because it is iso- 
lated from the Internet, according to press reports. 
The viruses, which included “W32.Ramnit” and 
“Conficker”, were discovered at Gundremmin- 
gen’s B unit in a computer system retrofitted in 
2008 with data visualisation software associated 
with equipment for moving nuclear fuel rods, RWE 
said. Malware was also found on 18 removable 
data drives, mainly USB sticks, in office comput- 
ers maintained separately from the plant’s operat- 
ing systems. W32.Ramnit is designed to steal files 
from infected computers and targets Microsoft 


shttp://www.telegraph.co.uk/news/2016/04/27/cyber- 
attackers-hack-german-nuclear-plant/. 
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Windows software, according to the security firm 
Symantec. Conficker has infected millions of Win- 
dows computers worldwide since it first came to 
light in 2008. It is able to spread through networks 
and by copying itself onto removable data drives, 
Symantec said. 

The ‘Energetic Bear’ is a Russian virus that 
let hackers take control of power plants. Over 
1,000 energy firms have been infected, according 
to media reports.*The hackers obtained access to 
power plant control systems, and could have dis- 
rupted energy supplies in affected countries, if they 
had used the sabotage capabilities open to them, 
according to Daily Mail. 

In October 2017, it has been revealed by media’ 
that Russians have jammed the GPS signals in 
Northern Norway in September 2017, as a delib- 
erate action by Russian militaries during a cyber 
warfare exercise. 


3 CYBER RISKS FOR AUTONOMOUS 
SHIPS 


3.1 Hacking of autonomous ships 


The technological advancements towards ships 
operating without an onboard crew is enabled by 
the developments in ICT in recent years. ICT pro- 
vides data connection and on-board intelligence 
and data connection capabilities. The ships may 
operate in different levels of autonomy. In a high 
level of autonomy, ships may be supervised by 
human operators in Shore Control Centres (SCC). 
Whenever necessary, the operator (supervisor) 
may intervene. A SSC could take responsibility for 
overseeing specific phases of a ship’s operation or 
voyage, for example, maneuvering in and out of 
port, which then means that the ship would oper- 
ate in a lower level of autonomy. The connectivity 
between the ship and SSC must have high capacity 
and availability and is crucial for the realization of 
autonomous ships (AAWA, 2016; MUNIN, 2015). 

The increasing usage of networked ICT tech- 
nology makes it possible to access systems through 
net-work interfaces and gain unauthorized remote 
capability to control ship systems in undesired man- 
ners (AAWA, 2016). Security threats that are rele- 
vant for ships are piracy and highjacking, smuggling 
of goods, human trafficking, damaging of ship or 
port facility, vandalism, sabotage, such as inten- 


*http://www.dailymail.co.uk/sciencetech/article- 
2675798/Hundreds-European-US-energy-firms-hit-Rus- 
sian-Energetic-Bear-virus-let-hackers-control-power- 
plants.html. 
Shttps://www.nrk.no/finnmark/e-tjenesten-bekrefter_- 
russerne-jammet-gps-signaler-bevisst-1.13721504. 


tional jamming or spoofing of the ship automatic 
identification system (AIS), GPS signals and com- 
munication systems, and use of the ship as weapon 
for terrorist activity (AAWA, 2016; MUNIN, 2015). 

The security challenge of shipping has been 
addressed by the International Maritime Organi- 
zation (IMO) Maritime Safety Committee and The 
Facilitation Committee, who recently issued guide- 
lines on maritime cyber risk management (IMO, 
2017a). The guidelines give high-level recommen- 
dations on security risk management to protect 
shipping from current and emerging threats. Five 
functional elements are presented consisting of 
identification, protection, detection, responding 
and recovering. Vulnerable systems that are men- 
tioned in the guideline are bridge systems, cargo 
handling and management systems, machinery 
and propulsion systems, control systems, passen- 
ger servicing and management systems, passenger 
public networks, crew welfare systems, and com- 
munication systems. IMO states that cyber risk 
management should be integrated into ship safety 
management within 2021 (IMO, 2017b). 

To protect a ship against cyber threats means 
that vulnerabilities in the ICT infrastructure need 
to be eliminated and effective measures for intru- 
sion prevention must be implemented. It is also nec- 
essary to consider that hackers may become more 
skillful over time with more advanced techniques 
available. This means that cyber security needs to be 
dynamic and proactive. Classification and encryp- 
tion of data, user identification, authentication, 
authorization, protection of data integrity and con- 
nectivity, as well as activity logging and auditing are 
examples of typical cyber security methods that are 
expected to be needed (AAWA, 2016). 

MUNIN (2015) presents a risk matrix, includ- 
ing both safety and security aspects. The highest 
ranked threats are found to be jamming, spoofing 
or hacker attacks of AIS, GPS signals, or commu- 
nication systems, leading to collision with other 
ships, or ship grounding in critical areas. 


3.2 Autonomous ships used as threat to 
infrastructure systems 


The control over an unmanned ship which is 
hacked may be lost completely, which is the most 
severe situation. It is assumed that complete loss 
of control is impossible if there is a small crew 
onboard. It is assumed that a small crew may be 
able to deactivate external control and take over 
control locally. If this fails, they should at least be 
able to shut of power and let the ship drift until 
they may be able to take back control locally. 

But without local crew such possibilities are 
not available, and control may be lost completely, 
at least for some time. In theory, control may be 
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reestablished by boarding the ship, for instance by 
helicopter, such as police helicopter or naval heli- 
copter. This will take time in any case, and if the 
vessel is far from shore, a helicopter may not be 
able to reach the ship until it comes closer to shore, 
and then it may be too late. 

It is therefore possible that an unmanned, 
autonomous ship that has been hacked may be 
used to ram into infrastructure systems. This is 
discussed further below. A similar scenario could 
also occur with a conventional manned ship, if the 
ship is highjacked, but this is outside the scope of 
the present discussion. 

Let us first consider if a hacked, unmanned 
ship may be a threat to other ships in open seas. 
This may be possible in principle, but if the other 
ships are conventional, manned ships, they may be 
able to avoid the hacked, unmanned ship through 
maneuvering away from the threat. This may fail 
if the threat is not observed, but should normally 
be successful. It the second ship is an unmanned, 
autonomous ship, control from shore should be 
able to observe the threat in a similar manner. 

A special case occurs if other ships represent 
potential extreme catastrophic consequences, for 
instance if the other ship is a cruise ship with many 
thousands of cruise passengers. Or if the other 
ship is a very (or ultra) large crude carrier, capable 
of transporting in order of 2,000,000 bbls of crude 
oil. These ships would not be autonomous, and 
should normally be able to avoid attack. 

But infrastructure installations are usually sta- 
tionary and not able to relocate to avoid the threat. 
By infrastructure systems in this context one may 
first of all think of bridges crossing fjords and bays 
and other seawater open areas which are found in 
almost all coastal areas worldwide. Other systems 
may be offshore petroleum installations, which are 
found far away from shore in several parts of the 
world; the North Sea, Gulf of Mexico, Atlantic Sea 
off the coast of Brazil, several African countries, 
Newfoundland, Shetland as well as the Pacific in 
some areas off Australia and the South China Sea. 

There are considerable differences with respect 
to impact resistance to external impact in the vari- 
ous types of infrastructure systems. In Norway for 
example, there has been a study project ongoing 
to establish possible concepts for fjord crossing 
of some of the largest fjords on the West coast 
of Southern Norway. For a possible fjord cross- 
ing of the Sognefjord, a floating bridge concept 
has been specified to have 1563 MJ kinetic energy 
resistance, corresponding to a ship of about 31,500 
tdw, travelling at a full speed of 17.7 knots (Statens 
Vegvesen, 2013). Smaller bridges along the coast 
are believed to have resistance at least one order of 
magnitude lower, but the consequences of a colli- 
sion against a smaller bridge may be less extensive. 


When it comes to offshore structures, the tra- 
ditional resistance has been designed to take the 
impact from a drifting service vessel. Typically, 
this was a value of 14 MJ for many years (Vinnem, 
2013), but is in recent years increased to around 50 
MJ (Yu & Amdahl, 2018), due to increasing size of 
service vessels used for these installations. The larg- 
est offshore structures, the concrete gravity based 
structures (so-called Condeep structures), which 
we commonly installed in the North Sea some 20 
years ago, have a push-over resistance about 200 
MJ (Vinnem, 2013). This is almost an order of 
magnitude lower than the specified resistance of 
the bridge for the fjord crossing of the Sognefjord. 
Most of the offshore structures have capacity in 
the order of 50 MJ or less. 

Floating offshore structures may in theory move 
away, if threat is detected sufficiently early. If the 
hacked ship is used with the intention to ram into 
a structure, it may be able to follow the movements 
of the offshore installation. 

Even a small ship with a mass of 5,000 tons, trav- 
elling at a speed of 12 knots, has a kinetic energy 
of roughly around 200 MJ, which is excessive in 
relation to structural capabilities of most offshore 
structures; only the Condeep structures could be 
expected to survive. Larger ships will be a threat to 
all offshore structures. 

The largest offshore structures are usually 
manned with up to a few hundred persons, imply- 
ing that many lives are at risk. In addition, comes 
the blowout potential. Here the fixed installations 
are the most vulnerable, because the equipment to 
isolate the wells are mainly on deck. If the installa- 
tion is wiped out, very long-lasting blowouts may 
occur as a result, in addition to the death toll. 


4 FEASIBILITY OF RISK REDUCTION 


4.1 Approach to risk control 


The previous sections have shown that hacked 
autonomous unmanned ships may be a consid- 
erable threat to offshore installations, and to 
infrastructure elements along the coast unless par- 
ticularly strengthened. 

It is considered that further strengthening of 
constructions is not relevant. First of all, this is 
impossible for existing structures, and further 
strengthening of future structures is not relevant 
due to excessive costs. The risk control actions will 
need to be focused on prevention of the threats to 
cause incidents. 

Traffic surveillance is one of the solutions 
adopted by the offshore oil and gas industry for 
protection of offshore installations against collision 
threats by passing vessels. For the Norwegian sec- 
tor, there are several centers; two operated by off- 
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shore companies and several government operated 
centers along the coast. The main principle is to 
detect a ship on collision course as early as possi- 
ble, to give the possibility to communicate with the 
ship and warn it to alter its course. If contact is not 
established, the approach implies to warn the instal- 
lation sufficiently early, such that safe evacuation of 
all personnel may be completed. In addition, availa- 
ble resources may be used to try to establish contact 
with the vessel, if communication fails. 

But the approach in this case assumes that the 
vessel does not want to collide. If on collision course, 
this is due to lack of knowledge, or in some cases 
with intent for a certain period, with a planned future 
course change. This approach is not correspondingly 
well suited if the ship is on collision course by intent. 
Communication is not going to change anything, 
nor the use of vessels or other resources to achieve 
physical contact. Still, the detection of a ship on col- 
lision course will imply that evacuation of personnel 
may be possible, if the procedures to start evacuation 
in a timely manner are adhered to. This will not pro- 
tect the installation, though. 

Keeping a small crew onboard is the most effec- 
tive risk control action. It was assumed above that 
a small crew may be able to deactivate external 
control and take over control locally and mechani- 
cally. If this fails, they should at least be able to 
shut of power and let the ship drift until they may 
be able to take back control locally or assistance 
from shore has arrived. 

A small crew would not need to be onboard al 
the time, the duration could be limited to where 
there are critical infrastructures. 

If keeping a small crew onboard is infeasible, then 
the only option is to ensure as far as possible that 
there are no possibilities for hackers to gain access 
to the control of an autonomous unmanned ship. 

Another option would be to limit the opera- 
tional area of an unmanned autonomous ship 
for instance by limiting the available fuel stored 
onboard. This is to some extent used for aircrafts, 
although the main approach in this case is to 
limit the weight the aircraft is carrying. But this 
would be an option with some other risks. If the 
ship due to weather or other unforeseen events is 
significantly delayed, it could run out of fuel, if 
this is limited. If such risks are judged to be toler- 
able, however, it may provide an effective manner 
to avoid that hackers turn a ship into a threat to 
goals far away from the intended route. A battery 
powered ship will have such limitations in any case. 


4.2 Principles of prevention of threats 


If the ship is completely unmanned, it will be essen- 
tial to avoid any opportunities any vulnerabilities in 
the control and communication systems onboard 


that may be used in a cyber-attack to gain control 
over the ship. This implies that complete control over 
the construction, procurement, management, opera- 
tion and maintenance of autonomous ships without 
manning of the ship for any purpose is necessary. At 
all times, no unauthorized organizations nor individ- 
uals should get the opportunity to install software 
or hardware which may provide a “backdoor” into 
the control system and software available to hackers. 


4.3 Responsibilities 


Even though the probability for a cyber-attack 
against an unmanned autonomous ship may 
be low, such events cannot be ruled out in the 
future, and the design of autonomous ships has 
to involve a series of risk reducing actions and 
designs. Requirements to completely non-vulnera- 
ble control and communication systems may pose 
extreme restrictions to the construction, manage- 
ment, operation and maintenance of a completely 
unmanned ship, perhaps to the extent that the 
advantage of zero manning by far is overridden by 
costs increases associated with such restrictions. 


4.3.1 Role of ship owners 

It will be the responsibility of the ship owner who is 
commissioning the construction of an unmanned, 
autonomous ship that no alien software or hard- 
ware is allowed on board, which may be used in a 
cyberattack. 

This will imply that every aspect of construction, 
procurement, management, operation and mainte- 
nance of such ships is controlled in extreme detail. 
All suppliers, vendors and component manufactur- 
ers and all their personnel will have to be scruti- 
nized in order to ensure that no one has illegitimate 
purposes. This would be an extreme control system. 

In the late 1970s, the possibility to construct 
nuclear power plants in Norway was considered by 
specialists and politicians. For a lot of the people 
who were against, the most fundamental argument 
was that there would need to be so strong require- 
ments to control of the personnel who would oper- 
ate and maintain nuclear power plants. Such very 
strong restrictions and surveillance of personnel 
were completely unacceptable to many persons. 

To prevent successful cyber-attacks to autono- 
mous ships, it will be crucial to maintain control 
and sufficient quality assurance over the whole 
software development process. This might become 
costly and reduce some of the expected cost sav- 
ings related to autonomous ships. 


4.3.2 Role of designers and ship builders 

It is still the responsibility of designers and ship 
builders to implement the very strict control 
outlined above. 
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4.3.3 Role of classification societies 

The classification societies will have to provide 
assurance that no alien software or hardware has 
been installed at any time during construction. 
This will require quite extreme housekeeping and 
control activities. It will not be sufficient to ensure 
that the high-quality classification societies have 
stringent requirements, all classification societies 
(high quality and low quality) need to focus on 
such threats. 

Such assurance will need to maintained also after 
commissioning, due to software updates, etc. verifi- 
cation of software and software updates therefore 
becomes even more important and challenging. 


4.3.4 Role of IMO 

Itisrequired to establish very stringent international 
requirements to control the risk of cyberattacks on 
autonomous ships. Any ship from anywhere in the 
world can travel international waters all over the 
globe and become a threat in very distant waters, 
provided it has sufficient amount of fuel (or oper- 
ates on solar power!). All ships will therefore need 
to follow strict requirements. 

It would be expected that the following were 
high-level IMO requirements for two alternative 
categories of autonomous ships, with and without 
manning: 


1. Autonomous ships that always require a small 
crew onboard to operate 
a. Ships to have function which deactivates 
mechanically external control and replaces it 
with local control 
b. Ships to have a global power off function 
which as a last resort gives a dead ship 
2. Autonomous ships that may operate without 
any crew members onboard 
a. Ships to have a function which limits the 
stored fuel to the distance between ports with 
a small margin, or 
b. Take steps to ensure fully that nobody has 
opportunity to install hardware or software 
that may be used in cyberattacks against the 
ship. 


5 CONCLUSIONS 


This article discusses cyber-attacks and its poten- 
tial threat to autonomous ships. Experience 
from other sectors are presented and discussed. 
A hacked autonomous ship may be used as a 
weapon and ram offshore oil and gas systems, 
infrastructure systems along the coast, or collide 
with, cruise ships or oil tankers. 

Infrastructure systems along the coast may 
be considerably more robust against collision 
impact compared to offshore structures. Typical 


offshore structures may have a resistance up to 
200 MJ, which corresponds to a 5,000 tons ship 
with a speed of 12 knots, and are thus quite 
vulnerable. 

Keeping a small crew onboard is the most effec- 
tive risk control action, assuming that the crew 
may be able to deactivate external control and take 
over control locally and mechanically. 

An option to keeping a small crew onboard is to 
ensure, as far as possible, that hackers cannot gain 
access to the control of an autonomous unmanned 
ship. This implies that there will have to be com- 
plete control over the construction, procurement, 
management, operation and maintenance of 
autonomous ships. 

Another option would be to limit the opera- 
tional area of an unmanned autonomous ship, 
for instance, by limiting the available fuel stored 
onboard. But this would be an option involving 
some other risk: if the ship due to weather or other 
unforeseen events is significantly delayed, it could 
run out of fuel, if this is limited. If such risks are 
judged to be tolerable, however, it may provide an 
effective manner to avoiding that hackers turn a 
ship into a threat for objectives far away from the 
intended route. 

It is required to establish very stringent interna- 
tional requirements to control the risk of cyberat- 
tacks on autonomous ships. As ships may travel 
all over the globe and become a threat in very dis- 
tant waters, all ships will therefore need to follow 
strict requirements. There will have to be different 
requirements to ships which require a small crew 
onboard than those without any crew. 
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ABSTRACT: The objective of this research was to evaluate a risk assessment process based on the 
Functional Resonance Analysis Method (FRAM) (Hollnagel, 2012) as a practical tool within a manufac- 
turing environment. Instead of focusing on the activities and events which can cause adverse outcomes, 
known as a Safety-I approach, this study uses risk assessment for detecting the sources of variability 
within activities and how this may lead to both negative and positive outcomes within the system; Hol- 
Inagel (2012) refers to this as a Safety-II approach. Four examples of work process from an upholstered 
seat manufacturer were assessed using a risk assessment process involving consultation with workers to 
determine sources of variability in the ‘work-as-done’. The data was mapped onto the FRAM six aspects 
to envision instantiations. The method provided a means of clearly articulating gaps in the system design 
impacting safety and productivity. FRAM was found to be an effective mechanism for revealing aspects 


of variability but required a greater resource commitment over regular risk assessment tools. 


1 INTRODUCTION 


Despite their wide spread use, risk assessment 
tools such as risk matrices have documented pit- 
falls (Gadd et al., 2004). Due to restricted requisite 
variety, their effectiveness is limited in the identifi- 
cation and control of hazards (Conant and Ross, 
1970) and as a result, limit the benefit they pro- 
vide to both employees and organisations. Further, 
the use of a linear causal relationship to describe 
hazards generates a concentration on negative 
outcomes and lower order controls, limiting stake- 
holder learning and cross disciplinary engagement 
(Cox, 2008). 

This research proposed to challenge this Safety-I 
archetype (Hollnagel, 2014) and replace it instead 
with the perspective of Safety-II (Hollnagel, 2014), 
principally to look beyond isolated hazard man- 
agement to the consideration and optimisation of 
the greater system within which the hazards exist. 
The objective was to investigate if the pitfalls asso- 
ciated with linear risk assessment methods could 
be mitigated by using a tool with greater requisite 
variety. 

The research evaluated four work systems 
within a manufacturing environment. The systems 
were selected based on work-as-imagined descrip- 
tions of characteristics key to both Safety-II and 
FRAM assessments, being: (i) variability in func- 
tions, (i1) the level of acknowledgement and control 
of that variability, and (iii) the couplings between 


functions within the systems, as well as (iv) cou- 
plings to upstream and downstream systems. 


1.1 Safety-I and Safety-IT 


Hazard management is based on the definition of 
risk, the perception of risk, the tools employed to 
measure risk, and the respective controls identi- 
fied as a result of the risk assessment process (IEC, 
2009). The type of tools used for risk assessment 
will also influence the type and number of hazards 
identified, the controls for management of those 
hazards, and the capacity for organisational learn- 
ing (Lundberg et al., 2009). 

In line with this practice, the risk matrix is a 
tool used for the assessment of risks based on the 
likelihood and consequence of negative outcomes 
(IEC, 2009). The risk matrix is widely accepted as 
an industry tool because of its simplicity to use 
and understand, and it is recommended by regu- 
lators such as Australian Governments (Worksafe, 
2014). 

In contrast to current Safety-I based practice, 
Safety-II seeks to understand why things go right 
even when there is variability in the system which 
would otherwise lead to error states (Hollnagel, 
2012b). This creates a focus on ‘work-as-done’ (the 
way work is actually done, as opposed to the docu- 
mentation which describes work-as-imagined) and 
how the system may adapt to emerging situations 
to prevent the occurrence of negative outcomes. 
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The adoption of a Safety-II perspective does 
not mean that the principles of Safety-I should 
be discarded. Safety-II provides a different lens to 
question the way work-as-done is understood and 
Safety-I is a subset of Safety-I (Hollnagel, 2014). 
The trade-off of looking at everything that goes 
right, as opposed to just that which goes wrong, 
is the need to look at all activity not just specific 
outliers. 

Further to this line of thought, the Safety-II 
concept recognises that workers adapt to situations 
that do not function as intended, interpreting the 
conditions of the environment and adjusting their 
own output accordingly (Hollnagel, 2012b). Rec- 
ognising that a system is unlikely to be completely 
defined or closed, humans play an important role 
in bringing together elements that could not oth- 
erwise manage the variability of an open system 
(Woods et al., 2010). 


1.2 The Functional Resonance Analysis 
Method (FRAM) 


The degree of variation that a Safety-I tool such 
as a risk matrix can describe is insufficient to 
characterise the degree of variation needing to be 
understood or controlled in the physical system 
being interrogated (Hollnagel and Woods, 2005). 
Conversely, the FRAM may have a greater requi- 
site variety due to its ability to better model the 
environment being investigated and thus its regula- 
tion (Conant and Ross, 1970). 

Unlike a risk matrix assessment which only 
looks at likelihood and consequence, a FRAM 
assessment requires four steps. The first two steps 
focus on understanding and defining work-as- 
done, the third examines emergent system states 
that result from system variability and the fourth 
considers managing these unwanted system states 
(Hollnagel, 2012a). 

In conducting a FRAM assessment, the system 
is divided into key functions which either directly 
or indirectly affect the outcome of the activity as it 
would usually be performed. In FRAM, functions 
have up to six aspects which define their char- 
acteristics and how they are coupled within the 
system: inputs, outputs, preconditions, resources, 
time constraints and controls. Potential variability 
is then described within the functions in terms of 
the way a function would typically vary in normal 
working conditions, and from where that variabil- 
ity originates. 

This information constitutes a FRAM model 
which may then be used to recognise emergent states 
or instantiations of the system. When variability 
leads to system states which are not considered nor- 
mal activity, the variability is construed as uncon- 
trolled, and resonance is said to have occurred. 


By observing the relationship between instantia- 
tions and emergent system states, it is possible to 
identify sources of variability and how they may 
be managed. These may be tabulated to qualita- 
tively aggregate all sources of variability within 
an instantiation (Hollnagel, 2012a), thus forming 
a diagram of the relationship between functions 
through coupled aspects (Hill, 2015). 

From the perspective of risk assessment, the 
Safety-II objective is to identify and assess the 
sources of variability within a system, so they may 
be damped to stop the occurrence of resonant sys- 
tem states which otherwise would generate nega- 
tive outcomes. 

A small automotive manufacturing site was 
selected which had a range of different material 
processing and fabrication systems, all involving 
different degrees of worker and technology adjust- 
ments, based on work-as-imagined descriptions. 
By definition, systems with low technological pre- 
cision required more human adjustment and where 
thus expected to contain higher variability. Com- 
paratively, systems with a high technological pre- 
cision required less human adjustment and where 
thus expected to contain lower variability. 

Based on work-as-imagined descriptions from 
written procedures, four manufacturing systems 
were selected which contrasted both technological 
and human precision. The systems were selected 
as two coupled pairs so that the upstream and 
downstream relationship of variability could be 
observed on the overall systems (Table 1). 

System 1A required two workers to move carpet 
stock through a number of automated stations to 
convert raw carpet stock into a moulded three-di- 
mensional shape with plastic-welded components. 
The output of System 1A was fed into System 1B 
which required another two workers to move the 


Table 1. Perceived level of adjustment (precision) in 
manufacturing activities based on work-as-imagined 
descriptions of technology and human action; resulting 
in potential output variability. 


Perceived precision 


System Technology Human 

1A Carpet forming and Precise Imprecise 
welding 

1B* Carpet foaming and Precise Imprecise 
cutting 

2A Track welding Precise Imprecise 

2B* Track assembly Acceptable Acceptable 


Precise = no adjustment for success; 

Acceptable = approximate adjustment for success; 
Imprecise = high adjustment for success; 

*System B is downstream of System A. 
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stock through further automated stations, initially 
adding a foam backing before waterjet cutting the 
stock to a required shape for stillaging. 

System 2A required a worker to assemble many 
individual pre-welded steel components into a 
fixture inside a robotic weld station, producing a 
welded sub-assembly. The welded sub-assembly 
of System 2A was then fed into System 2B which 
required a second worker to construct a finished 
structure from several different sub-assemblies. 

At each of these workstations, four question 
sets were employed (Albery et al., 2016) to discuss 
each of the systems with the workers thus collect- 
ing the necessary data required to perform FRAM 
analyses. 

Workers, supervisors and quality engineers were 
asked the four sets of questions. All participants 
had been in their respective roles greater than a 
year and were skilled in their roles. The question 
sets were applied as a semi-structured interview 
process, focusing discussions on the precision of 
each of the systems in the workers’ own language 
(Louise Barriball and While, 1994). 

Aggregate variability was determined qualita- 
tively from participant responses to the question 
sets. Alignments and misalignments between work- 
as-imagined and work-as-done were recorded 
within the precision of respective aspects. 

Notes from the data collection were subse- 
quently transcribed into a template adapted from 
the FRAM (Hollnagel, 2012a). This qualitative 
coding formed the basis of the aggregation of 
variability described for each system in Tables 2 
through 5. 

For each system the variability associated with 
normal work (N) was recorded as instantiation “0” 
for each aspect of each function. Normal work was 
considered an alignment between work-as-imag- 
ined and work-as-done where the required adjust- 
ments were considered in the system design. 

When the response to the questions revealed 
either an increase (I) or reduction (R) in variability 
for the same aspect, this was construed as a unique 
instantiation and “1, 2, ...” was recorded, signify- 
ing that a different level of precision was required 
to manage the variability within the system. 

Finally, visualisations of the systems then con- 
structed based on this information. 


2 RESULTS 


The first instantiation of System 1A, System 1A.0 
was considered “normal work”. It revealed that the 
shape of the carpet through the process from start 
to finish was heavily dependent on its temperature. 
If the carpet temperature dropped two low at any 
stage in the process it would become incompatible 


with the other functions. This placed a reliance on 
human precision to carefully maintain the system 
cycle time (Table 2. and Fig. 1.). 

Instantiation 1A.1 identified time delays mov- 
ing the stock between stations in upstream func- 
tions; this caused a temperature drop in the carpet 


Table 2. System 1A, aggregation of variability. 
Variability Normal Unexpected 
Instantiation 1A.0 1A.1 1A.2 1A.3 
Heat carpet 
Input N N N N 
Output N N N N 
Precondition N N N N 
Resource N N I N 
Control N N N N 
Time N N N N 
Move carpet 
Input N N N N 
Output N N N I 
Precondition N N N N 
Resource N N N N 
Control N N I I 
Time N I I N 
Form carpet 
Input N N N N 
Output N N N N 
Precondition N N N N 
Resource N N N N 
Control N I N N 
Time N N N N 
Hand trim carpet 
Input R N N R 
Output R N N R 
Precondition R N N R 
Resource R N N R 
Control R I N R 
Time R I N R 
Weld carpet 
Input N N N N 
Output N N N N 
Precondition N N N N 
Resource N N N N 
Control N N N N 
Time N I I N 
Foam backing to System 1B 
Input N N N N 
Output N N N N 
Precondition N N N N 
Resource N N N N 
Control N N N N 
Time N I I I 


N Normal variability. 
I Increased variability. 
R Reduced or no variability (function not active). 
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Figure 1. 
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System 1A/Instantiation 0 (1A.0), normal 


stock which led to the carpet border, or perimeter, 
being geometrically incompatible with the down- 
stream functions. 

Specifically, the geometric border change 
meant that the carpet stock was incompatible 
with machine fixtures, meaning (i) it could not be 
correctly welded and (ii) the foam backing mate- 
rial could not be correctly applied as the last stage 
before entering the next system (Fig.2). 

To force the welding and foam backing functions 
work successfully, workers compensated by hand 
cutting the carpet borders as a hidden upstream 
sub-function. 

Instantiation 1A.2 identified that approxi- 
mately 50% of carpet stock was a recycled mate- 
rial which was stiffer than the normal carpet and 
incompatible with the system temperature range. 
The recycled material was imprecise and required 
additional heating time as well as a hidden hand 
cutting operation to remain compatible with the 
other functions (Fig. 3). 

Instantiation 1A.3 identified that workers were 
exposed to minor burns manually handling the 
hot carpet between each of the functions (human, 
imprecise). This resulted in the carpet becoming 
soiled and rejected during the foam backing proc- 
ess (Fig. 4). 

The first instantiation of System 1B, System 
1B.0 was considered “normal work”. Workers 
advised that when the system was functioning cor- 
rectly, a sufficient carpet temperature was main- 
tained and there was no rejected stock from the 
previous System 1A (Table 3, Fig. 5). If the work- 
ers in System 1A could achieve the ‘minimum’ tem- 
perature meant the carpet stock could successfully 
pass through each function without hidden worker 
adjustments, such as also cutting the boarder as 
described in System 1A (Fig. 2). 

Instantiation 1B.1 resulted from a delay in 
instantiation 1A.1. This resulted in the introduc- 
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Figure 2. System 1A/Instantiation 1 (1A.1), unexpected 
variability welding carpet. 
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Figure 3. System 1A/Instantiation 2 (1A.2). 
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Figure 4. System 1A/Instantiation 3 (1A.3). 


tion of a second precise hidden function required 
to hand cut the carpet for the waterjet function to 
succeed (Fig. 6). 

Instantiation 1B.2 was also caused by the 
upstream instantiation 1A.1. As the carpet was 
at the wrong temperature (technology imprecise) 
when entering instantiation 1B.1 the foam backing 
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Table 3. System 1B, aggregation of variability. 
Variability Normal Unexpected sala 
O nihad carpet 
Instantiation 1B.0 1B.1 1B.2 1B.3 
Foam backing from System 1A eas 
Input N N N N seen ay 
Output N N N N 
Precondition N N N N 
Resource N N N N 
Control N I I N 
Time N N N N 
Move carpet 
Input N N N N Figure 6. System 1B/Instantiation 1 (1B.1). 
Output N N N N 
Precondition N N N N 
Resource N N N N 
Control N N N I EN 
Time N N N N faase 
Waterjet cut carpet 2 Mokha eaa 
Input N N N N x 
Output N N N N Sr ATE, 
Precondition N N N N Spain 14 
Resource N N N N 
Control N I N N 
Time N N N N 
Hand trim carpet 
Input R N R R Figure 7. System 1B/Instantiation 2 (1B.2). 
Output R N R R 
Precondition R N R R 
Resource R N R R 
Control R I R R 
Time R I R R [suihage 
Stillage finished carpet FA tried cnet} 
Input N N N N 
Output N N N I 
Precondition N N N N 
Resource N N N N 
Control N N I N 
Time N N N N 


N Normal variability; 
I Increased variability; 
R Reduced or no variability (function not active). 


Move carpet 


Stillage 


c) finished carpet 


Foam backing 
o System LA (9) 


Figure 5. 


System | B/Instantiation 0 (1B.0). Normal work. 


Figure 8. System 1B/Instantiation 3 (1B.3). 


was incorrectly applied, and it was rejected when 
stillaged (Fig. 7). 

Instantiation 1B.3 resulted from the carpet 
becoming soiled when manually handled between 
functions (human imprecise), and rejected when 
stillaged (Fig. 8). 

The first instantiation of System 2A, 2A.0 was 
considered “normal work”. The processes and 
assemblies which incorporated the stamping stock 
were dependent on its inputs being within their 
correct geometric tolerances (Table 4, Fig. 9). 

Instantiation 2A.1 resulted from variability in 
a previous upstream function (geometric toler- 
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Table 4. System 2A, aggregation of variability. Table 5. System 2B, track assembly, aggregation of 


variability. 
Variability Normal Unexpected 
— = —— Variability Normal Unexpected 
Instantiation 2A.0 2A.1 
Instantiation 2B.0 2B.1 2B.2 2B.3 
Move stock 
Input N N Move parts FROM System 1A 
Output N N Input N N N N 
Precondition N N Output N N N N 
Resource N N Precondition N N N N 
Control N N Resource N N N N 
Time N N Control N N N N 
Assemble stamping in weld station Time N N N N 
Input N N Assemble upper and lower parts 
Output N N in station #1 
Precondition N N Input N N N N 
Resource N I Output N N N N 
Control N N Precondition N N N N 
Time N N Resource N I I I 
Weld assembly Control N N N N 
Input N I Time N N N N 
Output N I Install hardware in station #1 
Precondition N N Input N N N N 
Resource N N Output N N N N 
Control N N Precondition N N N N 
Time N N Resource N N N N 
Remove welded assembly from weld station TO System 2B Control N N N N 
Input N N Time N N N N 
Output N N Install floating washers in station #2 
Precondition N N Input N N N N 
Resource N N Output N N N N 
Control N N Precondition N N N N 
Time N N Resource N N N N 
Control N N N N 
N — Normal variability. I — Increased variability. Time N N N N 
Torque hardware in station #2 
Input N N N N 
Output N N N N 
Precondition N N N N 
Resource N N N N 
Control N N N N 
Time N N N N 
Install cable 
Input N N N N 
Output N N N N 
Figure9. System 2A/Instantiation 0 (2A.0). Normal work. Precondition N N N N 
Resource N N N N 
Control N N N N 
Move oe Q Time N N N N 
q cae © Check/mark finished assembly 
Input N N N N 
oe Output N I N I 
Precondition N N N N 
p. Resource N N N N 
Kak : en Control N N N N 
»—d Time N N N N 
Figure 10. System 2A/Instantiation 1 (2A.1). N — Normal variability; I — Increased variability. 
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ances, human or technology imprecise), impacting 
all functions within this system, and those down- 
stream from it (Fig. 10). This meant stock was 
either rejected as it could not be welded, or was 
welded and rejected in a downstream function. 

The first instantiation of System 2B, System 
2B.0 was considered “normal work”. The system 
required workers to manually assembly a number 
of sub-assemblies into a station, including sub- 
assemblies from System 2A. These were then 
bolted together (human precise). The system was 
reliant on workers continually adjusting their con- 
trol of the processes and timing between each of 
the functions to ensure the total aggregate variabil- 
ity was successfully managed (Table 5, Fig. 11). 

Instantiation 2B.1 resulted from welded stock 
from instantiation 2A.1 entering this system. 
Workers could not confirm if the 2B.1 part was 
geometrically incorrect (human, imprecise) until 
checking the final assembly at which point it was 
rejected and must be returned to the first system 
input for re-assembly (Fig. 12). 

Instantiation 2B.2. resulted from using stock 
from instantiation 2A.0 that had misaligned hard- 
ware. Like Instantiation 2B.1, the assembly was 
returned to the first system input for re-assembly 
(Fig. 13). 

Finally, instantiation 2B.3 resulted from using 
stock from instantiation 2A.1 that was welded cor- 
rectly but was geometrically incorrect. However, 
the assembly was not detected and rejected initially; 


System 2B/Instantiation 0 (2B.0). Normal 


Figure 11. 
work. 


Assemble 


G) vopers tower () 
tation as 


Jastall 


(0) 
station al 
KSA 
© 6) ; ÂN 
hardare 


stathon as 


Gheck/Minne 


final assy 


Synem 2B 


Figure 12. 


System 2B/Instantiation 1 (2B.1). 


Figure 13. System 2B/Instantiation 2 (2B.2). 


Figure 14. System 2B/Instantiation 3 (2B.3). 


so the variation was carried into downstream func- 
tions and then rejected in the last function. 


3 DISCUSSION 


The key learning associated with each instantiation 
was the identification of variability introduced 
by upstream plant process where technological 
precision was expected but absent. Subsequently 
reactive human precision was introduced to man- 
age variability and maintain successful outcomes 
(Dekker, 2006). The adjustments to the introduced 
variability led to the subsequent creation of haz- 
ards to workers (Table 6). 

Further upstream variability was observed to 
have downstream effects on quality and productiv- 
ity, as demonstrated by any increased output varia- 
bility described within tables of the results section. 
This was best typified as reject stock output from 
each of the systems. These learnings contrasted the 
initial work-as-imagined evaluations of each of the 
four systems (Table 7). 

Acknowledging that each of the systems has a 
reliance on human precision to manage variability 
a summary of the main source of variability in each 
of the systems is provided (Table 8). Interestingly, 
downstream functions could be improved with no 
action other than better managing unwanted vari- 
ability in the upstream functions. 

Two specific interventions which would has an 
impact on the aggregate variability of the systems 
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Table 6. Potential safety impacts of unwanted variability. 


Table 8. Actual sources of variability identified from 


analysis. 
System Safety Potential 
System Aspect actual Source of 
Instantiation Hazard Injury type* Instantiation variability variability 
1A.1 Manual handing Musculoskeletal 1A.1 Time, Control Carpet temperature 
disorder 1A.2 Resource, Time, Recycled carpet raw 
Use knife Cut/laceration Control stock 
1A.2 Manual handing Musculoskeletal 1A.3 Output, Control, | Manual handling 
disorder Time carpet 
Use knife Cut/laceration 1B.1 Time, Control Instantiation 1A.1 
Hot surfaces Burn Instantiation 1A.2 
1A.3 Manual handing Musculoskeletal 1B.2 Time, Control Instantiation 1A.1 
disorder 1B.3 Output, Control Manual handling 
1B.1 Using knife Cut/laceration carpet 
1B.2 Manual handling Musculoskeletal 2A.1 Resource, Input, Geometric variation 
disorder Output in raw stock 
1B.3 Manual handling Musculoskeletal 2B.1 Resource, Output Instantiation 2A.1 
disorder 2B.2 Resource, Output Instantiation 2A.1 
2A.1 Manual handling Musculoskeletal 2B.3 Resource, Output Instantiation 2A.1 
disorder 
2B.1 Manual handling Musculoskeletal 
disorder 
Pinch point Crush/bruise/ 
laceration discussed above are (i) for system 1A, ensuring the 
2B.2 Manual handling Musculoskeletal system cycle time is compatible with the required 
disorder raw material forming temperature, and (ii) for sys- 
Pinch point Crush/bruise/ tem 2A, tightening the tolerances on raw stock 
laceration prior to welding sub-assemblies. 
2B.3 Manual handling Musculoskeletal It is hoped that both of these simple examples 
disorder serve to emphasise the importance of addressing 
Pinch point cs aay or damping unwanted variability at its source. 
aceration 


*Observed or described by workers. 


Table 7. Perceived (work-as-imagined) and evalu- 
ated (work-as-done) control of variability (precision) in 
manufacturing activities based on descriptions and sub- 
sequent assessments of technology and human action 
resulting in actual output variability. 


Work-as-imagined Perceived precision 


System Technology Human 

1A Carpet forming Precise Imprecise 
and welding 

1B Carpet foaming Precise Imprecise 
and cutting 

2A Track welding Precise Imprecise 

2B Track assembly Acceptable Acceptable 


Work-as-done Evaluated precision 


System Technology Human 

1A Carpet forming Imprecise Acceptable 
and welding 

1B Carpet foaming Imprecise Acceptable 
and cutting 

2A Track welding Acceptable Precise 

2B Track assembly Imprecise Precise 


Further to these points, the FRAM analysis of 
each of the functions aligns each of the sources 
of unwanted variability with potential hazards 
to workers. This is significant as traditional risk 
assessment tools do possess the requisite variety 
to understand the detail of the systems they seek 
to control (Gadd et al., 2004). This said, the cost 
and training required to develop a level of analysis 
similar to that described within this paper maybe 
potentially greater than many standard risk assess- 
ment tools (IEC, 2009), possibly deterring some 
safety practitioners. 


4 CONCLUSION 


Using a Safety-II lens it was found that for each 
of the systems analysed, work-as-imagined was 
reliant on worker adjustments for success. These 
adjustments were conscious actions made in 
response to introduced variability in each of the 
systems. From a safety viewpoint, this highlights 
that collaboration with other stakeholders is 
required to identify systemic solutions which look 
beyond the prevention of only localised hazards 
to those which emerge from the system design (i.e. 
systemic upstream issues that are not proximal to 
the worker). 
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It is concluded that the use of the FRAM 
provides deeper learnings of system perform- 
ance in the management of variability as well as 
the impact of precise and imprecise control. The 
Safety-II perspective potentially involves greater 
time and practice to develop more meaningful 
output than a traditional Safety-I approach; how- 
ever, this perspective is necessary if it is desirable 
to increase requisite variety and thus knowledge of 
the system. 

These conclusions are limited by the sample 
size and findings of the systems considered in this 
research. However, the case studies presented indi- 
cate that a degree of variability and adjustment 
exists in all of the systems investigated. Further to 
these learnings, it would be of value to extend this 
type of analysis to (1) a greater sample size and (ii) 
coupled systems in other disciplines. 
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ABSTRACT: This paper presents the experiences from applying hazard and operability analysis 
(HAZOP) as support for establishing the safety requirements specification of a new safety-related railway 
application. The new railway application is a software based system for securing work areas, meaning it 
prevents railway traffic in areas along the track allocated to maintenance. The experiences are collected 
within the Safety Assessment Framework for Efficient Transport (SafeT) project managed by Bane NOR. 
Bane NOR is the government agency that owns, operates and develops the Norwegian railway infrastruc- 
ture. The objective of the SafeT framework is to offer a systematic, reusable way for creating system wide 
conceptual design models and based on them, creating a common risk model, which in turn will facilitate 
safety assessment, establishing the requirements specification, and safety demonstration of the system 
under consideration. The experience collected on applying HAZOP is done through two workshops with 
different formats on the documentation. The objective was to collect guidance on how HAZOP can be 
supported in the SafeT framework. 


1 INTRODUCTION 


The project “Safety Assessment Framework for 
Efficient Transport” (SafeT) aims at developing a 
framework that supports the implementation of 
EN 50126 (CENELEC, 2017) and thereby of the 
Common Safety Methods for Risk Assessment 
(CSM RA) (EU, 2013) in the railway industry, in 
particular how the railway infrastructure may sup- 
port efficient transport. 

This paper presents ongoing results from the 
case studies, while the results from the modelling 
is presented in another paper (Karpati et al, 2017). 
Figure 1 illustrates which phases of EN 50126 that 
is within the scope of the current SafeT work and 
both papers, annotated by a dark grey rectangle. 
Some of the related work (chapter 2) is therefore 
relevant for both papers, and the case (chapter 4) Figure 1. Scope of paper and relationship to EN50126. 
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is applied throughout the SafeT project. The aim 
of the paper is to show differences, advantages 
and disadvantages for two different approaches for 
hazard identification, applied on the new safety- 
related railway application. 

The current focus in the SafeT project is on 
the development phases 1 to 4 of EN 50126. In 
these phases of a systems life cycle, Bane NOR 
takes a lead role in the development while succes- 
sive development phases to a large extent are out- 
sourced. The SafeT framework intends to support 
the development of the core artefacts within the 
system life cycle. In the early stages of the life cycle, 
in the part of the framework that concerns the in- 
house conceptualisation, the core artefacts are: 
1) the conceptual system design model; 2) common 
risk model; and 3) requirements specification. 

The main objective of the SafeT framework 
is to offer a systematic, reusable way for creating 
system wide conceptual design models and based 
on them, creating a common risk model, which in 
turn will facilitate the safety assessment and safety 
demonstration of the system in focus, throughout 
the system’s lifetime. 


2 RELATED WORK 


International safety standards, such as EN 50126, 
provide requirements and guidance on how to 
carry out the assessment process. Although most 
safety standards often view the safety of a system 
as a function of the reliability of its components, 
little guidance is provided on how to derive safety 
requirements and acceptable risk for components 
whose failure rates are not known. Particularly, 
it is often difficult to derive safety requirements 
for logical components such as the software. The 
problem can be formulated from a consideration 
of the following two important tasks in the devel- 
opment of safety critical systems: (1) establishing 
the requirements to the system, and (2) ensuring 
that the system fulfils these requirements. The 
safety requirements should be established through 
risk assessment and hazard analysis, and fulfilled 
through the use of techniques and measures ade- 
quate for the risk level. The framework proposed 
in the project has much of its inspiration from the- 
oretical aspects of international safety standards 
such as IEC 61508 (IEC 61508). The novel part of 
the framework is fivefold: reusability, modularity, 
unification, transparency and argumentation. 

In the following, a number of past projects 
that relate to the topics of SafeT are briefly intro- 
duced. However, most of them relates to the need 
of establishing models and providing support for a 
safety case, see related work presented in the other 
Safe-T paper (Karpati et al, 2017). 


The EU funded project MODSafe provides a 
risk analysis method purposed to combine poten- 
tial hazards, safety requirements and functions, 
and link these elements to a generic functional, 
and object structure of a guided transport system. 
ASCOS (Roelen, 2014) focused on safety and cer- 
tification of new aviation operation and systems, 
and included among other advices on methods and 
tools for safety based design. ModelMe! (Falessi, 
2011) provides a tool-supported traceability frame- 
work where the tool for example automatically 
extracts the safety-related slices of SysML design 
models (SysML). 

The AltaRice Language (Griffault, 1998) is an 
object-oriented modelling language dedicated to 
performance evaluation of complex systems. The 
main motivation for its creation was the difficulty 
to design, to share and most importantly to main- 
tain safety and reliability models such as fault 
trees, event trees, Markov chains or stochastic Petri 
nets. The application and further development of 
the language is a continuous research activity at 
NTNU (Legendre, 2017). 

Of relevance is also CORAS (Lund, 2011; Gran, 
2004) which provides a methodology for model- 
based risk assessment integrating aspects from 
partly complementary risk assessment methods 
and state-of-the-art modelling methodology. 

The SafeT project has also reviewed a number 
of ongoing and past industrial experiences among 
the project partners related to the use of design 
and risk models to facilitate the safety assessment 
and demonstration of complex systems. Some of 
the challenges observed in these projects have also 
been reported earlier within aviation (Gran, 2007). 
Finally, the CHASSIS method (Raspotnig, 2018) 
utilizes UML use cases and sequence diagrams 
with HAZOP guidewords to integrate safety and 
security considerations for early requirements 
determination. 


3 APPLYING DIFFERENT APPROACHES 
FOR HAZARD IDENTIFICATION 


3.1 The role of the hazard identification 


There may be a number of different motivations 
for performing a hazard identification. Among 
them are avoiding loss of value, life and property, 
optimizing performance and reducing costs. The 
motivation for studying hazard identification in the 
SafeT project is to make sure that relevant hazards 
associated with development and use of software 
are evaluated, risk mitigation is in place, and the 
methods used for hazard identification are applica- 
ble and useful, with a basis in case studies that are 
carefully selected together with Bane NOR. 
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The purpose and method of a hazard identi- 
fication and operability study (HAZOP-study) is 
well described in the literature, for example in Risk 
assessment (Rausand, 2011) and JEC 61882:2016 
(HAZOP studies) (IEC 61882). The hazard iden- 
tification and operability study is performed by a 
group review using structured brainstorming to 
identify and assess potential hazards. The group 
of experts starts with a list of tasks or functions, 
and next uses keywords such as none, reverse, less, 
later than, part of, more. The aim is to discover 
potential hazards, operability problems and poten- 
tial deviations from intended operation condi- 
tions. Finally, the group of experts establishes the 
likelihood and the consequences of each hazard 
and identifies potential mitigating measures. The 
analysis covers all stages of project life cycle. In 
practice, the name HAZOP is sometimes (ab)used 
for any “brainstorming with experts to fill a table 
with hazards and their effects”. Many variations or 
extensions of HAZOP have been developed. 

Hazard identification can be defined as the 
process of identifying and listing the hazards and 
accidents associated with a system (DEF-STAN 
00-56, 2007). There are numerous different defini- 
tions of the term hazard described in standards and 
the literature. In the following, we will combine the 
definitions used in EN 50126 and EN 50129 and 
define hazard as “a physical situation or a condition 
that can lead to an accident”. 


3.2 Hazard identification in the RAMS lifecycle 


Throughout the European Union, railway signal- 
ling and interlocking projects are carried out on 
the basis of the CENELEC standards EN 50126 
(CENELEC 2017), 50128 (CENELEC 2011) and 
50129 (CENELEC 2003). The set of standards 
provide a consistent, European approach to the 
management of reliability, availability, maintain- 
ability, and safety, denoted by the acronym RAMS. 
In order to demonstrate that a technical system is 
safe to take into use and suitable for its intended 
application, the CENELEC standards require that 
the system under consideration is described and 
analysed in its intended context, in particular with 
respect to its relationship to hazards that can occur 
in this context and how these hazards can be con- 
trolled through the system design. This requires 
good models of both system design and risk that 
capture the relations between the different system 
levels and between hazards, causes, barriers, acci- 
dents, and consequences. Of particular importance 
to the safety demonstration is the utilization of 
common risk models that include the results from 
the hazard identifications at the different system 
levels, from an overall railway system down to the 
separate subsystems (Sivertsen, T. 2016). The use of 


models to support the safety management is central 
to SafeT, which therefore focuses on criteria for the 
choice of modelling techniques and how they can 
be combined, adapted and further developed to sat- 
isfy the modelling needs. These needs are associated 
to the analyses at the different system levels and its 
context, the risk associated to the application, and 
the requirements established to control this risk. 
Hazard identification, operability studies, anal- 
ysis and evaluation of the risks are key activities in 
phase 3, but they are also relevant for all the fol- 
lowing RAMS-phases, shown in Figure 1, and in 
accordance with 50126-1 (CENELEC 2017): 


# Phase 


1 Concept 
2 System definition and operational context 
3 Risk analysis and evaluation 
4 Specification of system requirements 
5 Architecture and apportionment of system 
requirements 
6 Design and implementation 
T Manufacture 
8 Integration 
9 System validation 
10 System acceptance 
11 Operation and maintenance 
12 De-commissioning and disposal 


As part of continuous improvement work as 
described in the ISO 9000-family of standards 
(ISO 9001), identification and evaluation of poten- 
tial hazards should also be done as a continuous 
activity throughout the system’s whole life cycle. 
For all steps and phases, there may be numerous 
hazards that can compromise the RAMS perform- 
ance of the system. 


4 CASE EXAMPLE DESCRIPTION 


4.1 Introduction to the case example of securing 
work areas 


The introduction of axle counters for train detec- 
tion necessitates a new solution for securing work 
areas. The current solution, on track sections 
without axle counters, is to use a contact mag- 
net to induce a short circuit in a manner similar 
to how an axle of a train induces a short circuit 
and thereby is detected. The short circuit induced 
by the contact magnet triggers a state change in 
the interlocking that prevents the train dispatcher 
from locking routes through the affected section 
until the contact magnet is removed by the safety 
guard. 

In the proposed solution for securing work 
areas (see Fig. 2 and Fig. 3), a safety guard uses a 
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Figure 2. Work areas and roles. 
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Control (CTC) 
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aMi a Support System 
Pa ata (SuS} 
Safety Guard 
(SG) 
Figure 3. The securing work area case. 


smartphone to interact with the train dispatcher. 
Besides allowing voice communication with the 
train dispatcher, the smartphone also contains a 
dedicated application with functionality to manage 
the securing and releasing of work areas. 

In the Norwegian infrastructure, the train detec- 
tion has usually been performed with different 
variants of track circuits. Axle counters were intro- 
duced in the infrastructure just a few years ago, 
and gradually replace the existing track circuits. 

Irrespectively of the train detection system used, 
there is a need to protect workers along the track 
from trains unintentionally moving into the work 
area. A work area is a track section (possibly more 
than one track) that can be disposed for work, 
without any trains entering or leaving the area 
(Figure 2). The work area and the surrounding 
tracks can be protected by points, derailers, main 
signals, shunting signals, and regulations. 

While the train dispatcher in either case has the 
possibility to block the work area, a basic safety 
principle in Norwegian railway operation is that 
the workers should be able to prevent the train dis- 
patcher from unblocking the work area before the 
work is finished. Basically, 


e the workers’ 
identified; 


position must be correctly 


e thecorrect work area must be effectively blocked; 
and 

e the work area must not be 
prematurely. 


unblocked 


One of the challenges with the introduction of 
axle counters has been that the existing methods to 
secure the work area no longer worked. This applies 
both to the correct identification of the workers’ 
position and to the barriers against hazards caused 
by premature unblocking of the work area. Since a 
track circuit short-circuits when a train is present 
in the track section, the presence of trains can be 
imitated by short-circuiting the track circuit with 
other means, viz. the contact magnets. In this way, 
the workers along the track can indicate their posi- 
tion to the train dispatcher, who can block the sec- 
tion to prevent trains from entering. The contact 
magnet furthermore works as a barrier to hazards 
caused by premature unblocking of the area (trains 
entering the work area), since the track section is 
considered occupied by the interlocking. 

The current solution in Norway for securing 
the work area when axle counters are used for the 
train detection involves removing a physical key 
for the relevant work area from its lock when the 
train dispatcher has blocked the work area and 
released the key. The train dispatcher is prevented 
from unblocking the work area until the safety 
guard has put the key back. While this certainly 
works, the solution is both expensive and ineffi- 
cient due to the need for additional physical equip- 
ment along the track, and physically interlocking 
this with the signalling system. There is therefore a 
need for a system that can replace the current use 
of physical keys. 

This is the background for the invention of the 
concept described in the next section. 


4.2 A concept of anew solution for securing work 
areas 


The concept involves the development of a software 
based system for safe interaction and supervision 
related to the protection of maintenance workers 
from accidents caused by the interference with the 
railway traffic. The solution is planned to require 
no other physical measures in the infrastructure 
than simple marking along the track in terms of 
a barcode or QR-code identifying the work area. 

The proposed solution for securing work areas 
(see Fig. 3) consists of a software-based solution 
whereby a safety guard uses a smartphone to inter- 
act with the dispatcher. Besides allowing voice 
communication with the dispatcher, the smart- 
phone also contains a dedicated application with 
functionality to manage the securing and releasing 
of work areas. 
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The safety guard identifies the work area by scan- 
ning the code on site. This identification of the work 
area is required at certain steps in the operation. 
Some of the characteristics of the functionality are: 


e The main safety guard selects the functions from 
the application on his smart phone. 

e Scanning the work area identifies both the safety 
guard and the work area. 

e The application communicates with the support 
system, which communicates with the CTC and 
other applications. 

e The support system supervises the protocol 
associated to each function. 

e The support system supervises the secured work 
areas, and prevents the train dispatcher from 
prematurely unblocking the work area. 


The solution gives several advantages, like less 
intervention in the infrastructure, no physical key 
to be kept and replaced, more convenient inspec- 
tion, improved safety locally, additional function- 
ality, larger flexibility, and simpler maintenance. 

For simplicity, the interfaces between the opera- 
tional support staff and the other roles are not 
shown in the figures. The operational support is 
not mentioned in the descriptions of the main func- 
tions, but a separate analysis of the support func- 
tions should be part of a complete analysis of the 
system. The responsibilities of the operational sup- 
port include 


e correcting errors or operational problems; 

e keeping the support system updated with respect 
to information about known faults or opera- 
tional problems; and 

e keeping the support system data updated. 


For the purpose of the risk assessment at the rail- 
way system level, all the functions can be described 
by considering only the interfaces between the 
applications and the safety guards, between the 
applications and the support system, and between 
the support system and the CTC. 

Twelve main functions have been specified for 
the system (T. Sivertsen, 2014): 


1. Log in: Logging into the system, thereby get- 
ting access to the other main functions. 

2. Log out: Logging out of the system, thereby 
being prevented from using other functions 
before a new login. 

3. Join: Enrolling in a work area, thereby pre- 
venting the safety guard in charge to release 
the work area. 

4. Resign: Withdrawing from a work area, 
thereby allowing the safety guard in charge to 
release the securing of the work area. 

5. Secure: Securing a work area, thereby preven- 
ting the work area from being unblocked. 


6. Release: Releasing a secured work area, thereby 
allowing the work area to be unblocked. 

7. Set time: Setting the time available for work 
in a work area, thereby allowing an automatic 
countdown of the time available. 

8. Time: Reading the time available for work in a 
work area, thereby facilitating management of 
work in the work area. 

9. Status: Reading the status a work area, thereby 
facilitating management of work in the work 
area. 

10. Takeover: Requesting takeover of responsibil- 
ity for a work area. 

11. Full takeover: Requesting takeover of another 
safety guard’s responsibilities. 

12. Overview: Overview of the work areas the 
safety guard is in charge of or enrolled in. 


For each of these functions there is a list of tasks 
that is performed by one or more of the involved 
actors in the process of securing and releasing the 
work areas, as showed in Figure 3. 


5 TESTING TWO ALTERNATIVE 
APPROACHES FOR HAZARD 
IDENTIFICATION ON THE CASE 


In order to evaluate the importance of the system 
description in relation to the result of an analysis, 
two alternative system descriptions were applied 
in two different HAZOP workshops with different 
participants. 

The aim was to evaluate if different ways of pre- 
senting the system would result in different findings. 
In the first workshop, the basis for preparation and 
discussion was a graphical model of the system, 
while the other used a textual description. The same 
type of competence was present in both workshops, 
however, not represented by the same individuals. 

The participants in the two workshops were 
mainly academics, with theoretical knowledge of the 
new and current system and of different approaches 
for risk assessment. There were no participants with 
practical experience with using the existing system 
for securing work areas, or other roles involved 
when performing such tasks. Most participants were 
familiar with the railway infrastructure in general 
and had experience with the HAZOP technique. All 
participants in the workshop where familiar with 
the new concept for securing work areas, through 
either the graphical representation of the system or 
the textual description. 


5.1 HAZOP based on a graphical model 


As preparation, a description of the case utilizing 
SysML diagrams with limited text and explanation 
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of the modelling language was sent out to the 
participants one week before the workshop. In the 
workshop the participants had many questions out- 
side the scope covered by the model, there were also 
questions related to the meaning of some of the 
modelling symbols. During the workshop, an exam- 
ple of the physical outline was drawn ad hoc as illus- 
tration, and it was used a lot in the discussions. The 
facilitator had guidewords on hand, but they were 
not applied actively, as the participants constantly 
came up with new questions related to system achi- 
tecture or potential problems. The HAZOP resulted 
in the identification of two hazards, a large number 
of potential hazards and potential situations lead- 
ing to down-time. The large number of the identi- 
fied potential hazards was due to uncertainty and 
lack of detailed system procedures. 


5.2 HAZOP based on a textual description 


A textual description of the case was provided 
in advance as input to the HAZOP workshop (a 
summary of the textual description is given in 
chapter 4.1). The participants had one week to 
familiarize themselves with the textual description 
of the system before the workshop. 

The following guide words were used in the 
meeting: early, late, before, after, wrong place, 
missing and wrong. The guidewords were not used 
actively for each function, but were presented on a 
separate marker board throughout the whole work- 
shop. Each of the main functions was discussed in 
the HAZOP workshop, in accordance to the order 
given in chapter 4.2. 


5.3. Experiences from testing the two approaches 


A textual description is, compared to a model 
description, a well-known and common way of 
presenting systems for most people. A textual 
description may therefore be less time consuming 
to understand and is easy to present in a meeting. 
However, the textual description was not detailed 
enough to present the system logic and all the pre- 
conditions in depth. Hence, an illustration includ- 
ing the sequence of main functions and roles 
involved in each function was made by one of the 
participants in the workshop. 

The illustrations were found to be useful com- 
plements, and indicated that the textual descriptions 
alone were not able to provide sufficient informa- 
tion. In specific, it was found that understanding 
the correct sequence of functions performed by the 
different roles was critical to the hazard identifica- 
tion, and this was not easily covered and captured 
by the textual description. 

Constructs in models can become complex and 
thus their visualization as well. According to the 


experiences in the workshop based on graphical 
models, the models became difficult to understand 
after a certain level of visual complexity (e.g. when 
it is no longer possible to present the whole sys- 
tem in one single and readable screen diagram), it 
becomes more difficult to find support in the vis- 
ual representation). One specific related problem 
was following the flow of logic in diagrams when 
branches were involved. Modularization of the 
visual representation added to the textual descrip- 
tions (if meaningfully possible) might help here. 

Both workshops included a physical descrip- 
tion in addition to the text or models provided on 
beforehand. This suggests that a physical outline 
diagram could be part of the models, or an addi- 
tion to textual descriptions. Another consideration 
is that modelling or describing specific, representa- 
tive cases (e.g. application of the planned system 
at a specific work area) might be a necessary sup- 
plement to the initial descriptions of the planned 
system. In our case, a specific, representative train 
station could be considered. 

Even though participants in both workshops 
helped identifying unclear and missing parts, 
both workshops pointed to a number of poten- 
tial hazards due to uncertainty about how the 
system was intended to work. Some of these 
details were contained in only one of the descrip- 
tions, but a number of descriptions were missing 
in both workshops, for example: preconditions 
of the main functions of the securing work area 
app, defined terminology and roles, description 
about the old and current solutions etc. A ques- 
tion related to this is whether the workshops would 
have been able to process and utilize the informa- 
tion requested by the participants. This needs to 
be taken into account when considering the use of 
HAZOP. In particular, there is a need to find mod- 
els supporting the balance between the two con- 
siderations: giving sufficient descriptions, but not 
drowning the participants in details. 

Based upon one workshop with models, we 
cannot conclude on the question of whether the 
model-based description prepared is practical for 
the hazard identification. There were, as described 
above, many other influences in the workshop 
independent from the modelling. However, it is 
clear that SafeT will need to prepare guidelines on 
how to use HAZOP in combination with specific 
SysML diagrams. Another question is if other 
models could have provided the same. 

The two workshops came up with the same 
hazards. The only differences lay in how they were 
identified in the two workshops. This is in accord- 
ance to what one should expect. Since the textual 
and graphical descriptions were based upon the 
same source of knowledge within Bane NOR, dif- 
ferences in the assessment would typically point to 
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flaws in one of the descriptions. Another reason for 
having the same results is that the two workshops 
had rather homogenous group of knowledge and 
experiences. None of the groups had participants 
with practical experience, such as train dispatch- 
ers or safety guards. This is also illustrated by the 
high number of potential hazards. It is assumed 
that by having additional competence in the work- 
shop, some of these potential hazards would be 
closed as not possible, while others would be con- 
firmed. One interesting observation is that most of 
the potential hazards are not closed by just add- 
ing the graphical and the textual description. The 
uncertainty lies in what is not presented in any of 
the two workshops. If the experiment would have 
included only one HAZOP, we could falsely have 
concluded that the solution was simply to add the 
graphical or the textual description. 

Both for the model based and the text based 
descriptions there is a need to supplement the 
descriptions by all the following different visualisa- 
tions, to compensate for their inherent advantages 
and disadvantages: 


— High-level visualisations—everything on one 
drawing. 

— Modularised visualisations—to explore the 
details where and when needed. 

— Sequences—to get necessary understanding on 
the order and timing of activities and tasks. 

— Visualisation of interactions: man—machine/ 
technology—organisation—environment. 


A conclusion from this is that a better coverage 
of relevant details for the hazard identification 
could have been included in the model and the 
diagrams. This could also have been achieved by 
a preparatory workshop focusing on eliciting such 
information, or by involving a RAMS expert in the 
modelling beside the system modeller and the sys- 
tem owner. 

There are several sources of uncertainty in the 
conclusions relating to relevant hazards identified 
in the two workshops. The uncertainty related to 
the sum of competencies covered by the partici- 
pants in the workshop is crucial. That means that 
whatever approach, the sum of competencies is 
of great importance. It is not possible to compen- 
sate for lack of competence by choosing the other 
approach, or adding more time for each partici- 
pant’s preparations. 

Applying these two approaches to the case iden- 
tify basically the same hazards. This means that 
the conclusion is not that one of the approaches is 
preferable. On the contrary, both approaches give 
different nuances and different perspectives, result- 
ing in a broader risk picture, which may be useful 
when it comes to communicating, evaluating and 
mitigating the risk. 


How sensitive these findings may be to the 
chosen case is not investigated. This means that 
if the case was a totally different one, we do not 
know whether the two approaches would end up 
with similar hazards. Anyway, the findings in the 
HAZOPs from the two approaches, and the find- 
ings from the comparison of the two approaches, 
both indicate that the case is complex enough for 
an experiment like this. 

When introducing new technologies or new appli- 
cations of existing technologies, it is important to 
assess the risk by using not only one approach, but 
rather apply different approaches to get a broader 
understanding of the potential hazards. 


6 CONCLUSIONS 


In this paper we have elaborated on the experiences 
on using a graphical model presented as SysML 
diagrams in comparison with an ordinary textual 
description as a basis for hazard identification. 

The model-based description is a practical 
and useful supplement for the hazard identifica- 
tion activities, but the HAZOP workshops point 
out that the use of SysML models requires good 
preparation of the HAZOP. SafeT will need to 
prepare guidelines on how to use HAZOP in 
combination with specific SysML diagrams. The 
participants should be familiar with such model- 
ling to benefit from the models. A textual descrip- 
tion is a mode of communication that most of the 
potential participants in the HAZOP workshop 
will be familiar with and trained in on before- 
hand. Graphical models, pictures and drawings 
are necessary and useful supplements for getting 
a broader understanding on the case that is sub- 
ject for analysis. 
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Analysis of the risk of pipe breaks based on hydraulic model 


E. Bartkiewicz & I. Zimoch 
Institute of Water and Wastewater Engineering, Silesian University of Technology, Gliwice, Poland 


ABSTRACT: The Water Supply System (WSS) distributes and supplies water to customers, which is 
used for consumption, production, maintenance of sanitary conditions and extinguishing fires, i.e. activi- 
ties necessary for life. For this reason, hydraulic conditions of the WSS and the water quality must be 
maintained. One of the most common problems encountered in underground infrastructure is pipe fail- 
ure, which has a major impact on WSS performance and water quality. To avoid such incidents, a safety 
plan should be developed. The WSSs safety plans requires some steps for risk assessment and decision- 
support systems. This system should use all the information resources stored in the databases. Currently, 
water supply companies have a number of network monitoring systems, however these are more warning 
rather than preventing systems. The risk is defined as undesirable situations that may occur, and therefore 
the preventing system should be more complex and based on historical operational events. The basic 
tools for building such a system are mathematical models that simulate the operational states of the WSS 
at different time intervals and during accidently situations. There are two kinds of models—hydraulic 
to simulate operating condition of the water pipe network such as pressure and velocity of water flow 
and- quality model, that show changes in water quality at different time intervals. The article presents a 
hydraulic model of a real WSS, created in WaterGEMS. This software uses historical data of pipe breaks 
in Pipe Break Analysis function, to compute a pipe break score for each pipe. The article presents the 
results of the analysis and risk assessment, presented in matrix form. 


1 INTRODUCTION 
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factors (pressure, corrosion, external stresses) and 
environmental factors (temperature, rainfall, soil 
conditions) [2, 3, 4, 5, 6]. The Figure 1 shows a 
few examples of conditions that may cause pipes 
break. Pipes failures cause economic losses (leak- 
ages) and threats to consumer health [7, 8, 9]. In 
small systems (often branched), pipe failures cause 
interruptions in water supply, which causes risk in 


Figure 1. Causes for pipe break [2]. 


around the world through leaks that correspond to 
the total annual cost of water loss of more than 14 


public health places (hospitals, hotels, kinder gar- 
dens, schools), whereas in large systems reveal itself 
as water leaks. Determining the type of recipient 
is an important element in risk assessment. Every 
year, 32 to 48 billion cubic meter of water is lost all 


billion US dollars [10, 11]. Through cracked pipes, 
the system gets contaminated soil (by pesticides or 
animal and human waste) that causes water borne 
diseases outbreaks [12, 13]. For these reasons, the 
analysis of pipe failures should be included in the 
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Water Safety Plans, as one of the elements enabling 
the determination of water contamination areas. 
World Health Organization (WHO) wrote in the 
Water Safety Plans: “Understanding the nature of 
sources of contamination and how these may enter 
the water supply is critical for assuring water safety” 
[14], which means that every risk analysis is crucial 
to the protection of human health. 

There are two categories of pipe failure mod- 
eling: physical and statistical. Physical methods 
are designed to describe the physical mechanisms 
underlying the failure of pipelines. However, the 
mechanism of physical pipe cracks is often very 
complex and difficult to determine accurately. The 
reason for this 1s the fact that the pipes are buried, 
and the records from the failure are insufficiently 
accurate. Acquiring data to determine the physical 
causes of failure requires expensive investments, 
which is why statistical methods are more often 
used. Statistical methods use different charac- 
teristics of water networks, usually data that can 
be easily obtained, as pipe age or the number of 
failures [2]. The aim of statistic approach was to 
determine the deterioration of the pipes condition 
and creating a forecasting model to assess the risk 
of failure [11, 15]. Statistic methods can be divided 
into three categories: deterministic, probabilis- 
tic multi-variates and probabilistic single-variate. 
Models considering uncertainties related to input 
data are probabilistic models, otherwise, these are 
deterministic models, while models based on more 
parameter then pipe age are multidimensional 
[16]. The deterministic methods ware described 


by Shamir and Howard as a prediction model that 
relates a pipe’s breakage to the exponent of its age 
[17]. Examples of probabilistic methods include, 
among others models defining the time of failure 
occurrence, such as the Cox proportional haz- 
ard model, accelerate failure models [18]. Table 1 
presents sample models of pipe failure estimation. 
Analysis of the risk of pipe breaks is related to the 
concept of reliability. Reliability of WDS systems 
is defined as a property that relies on the system’s 
ability to perform its functions under certain con- 
ditions of existence and exploitation and within the 
assumed time. Which means that the reliability of 
WDS is determined as the probability of supplied 
all demand nodes [4, 19, 20]. Based on the actual 
behavior of the network, study the reliability of 
water distribution network is planning to accurate 
operation and management of WDS. Taking into 
account pipe failure prediction models and models 
of WDS reliability, a Water Safety Plans support- 
ing system can be created. 

Forecasting pipe failure requires the collection 
and analysis of historical data. One of the most 
commonly used system to collect data is the GIS 
(Geographic Information System) database, which 
enables the integration of multiple data sources 
(including location data of dangerous locations, 
geodemographic features and vulnerable popu- 
lations), the use of spatial analysis techniques 
(including buffering and overlap), potential spatial 
integration models and geographical presentation 
of complex data in a cartographic format [22]. 
However, the GIS database is used to collect data 


Table 1. Pipe failure models. 
Author Year of study Objective Explanatory variables Model type 
Shamir Howard 1979 [17] pipe breaks and other pipe length, pipe age, Deterministic 
structural degradation breakage history 
Walski Pellicia 1982 [21] pipe breaks and other pipe length, pipe age, Deterministic 
structural degradation breakage history 
Kettler Goulter 1985 [16] pipe breaks and other pipe length, pipe age, Deterministic 
structural degradation breakage history 
Cox 1972 [18] time between failures or pipe length, operating Probabilistic 
time to the next failure pressure, pipe age, 
break rate, soil 
corrosivity 
Constantine Darroch 1993 [16] pipe breaks and other operating pressure, Probabilistic 
structural degradation pipe diameter, soil 
type, overhead traffic 
conditions 
Accelerated failure models [18] time between failures or pipe age, pipe diameter, Probabilistic 


time to the next failure 


pipe length, pipe 
material, traffic 
loading, soil acidity, 
soil humidity, 
number of breaks 
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and not to acquire data, for this purpose various 
devices and systems enabling reading information 
from these devices are using. SCADA (Supervisory 
Control and Data Acquisition) belongs to such sys- 
tems. The WDS commonly used SCADA systems 
to transmit data flow and pressure in real time, 
which is why it also using as early warning systems 
(e.g. in the event of pressure/flow drop or increase). 
In combination with a telemetry system, SCADA 
can be used to find water leaks and detect the place 
of failure. These systems enable transferring and 
storing data, network models are used to predict 
(simulate) failures. Mathematical models of water 
distribution network allow the simulation of water 
flow and pressure in the system during “normal” 
and partially failed system as well it can generates 
failure rates and repair events according to speci- 
fied probability distributions [23]. Using the col- 
lected historical data and supporting systems (e.g. 
hydraulic models), decision systems can be created, 
which can then be used in the Water Safety Plans. 
Water Safety Plans are based on risk assessment 
and risk management. Risk issues provide answers 
to three questions: What can happen? How likely 
is it to happen? and Given that it occurs, what are 
the consequences? [24]. To answer these questions, 
a risk analysis should be carried out using the 
matrix method. In order to answer these questions, 
a risk analysis should be carried out using a matrix 
method in which probabilities and consequences 
weights are assumed, and the product of these 
weights will determine the risk [25]. 


2 RESEARCH OBJECT 


The subject of the study is the selected subsystem 
of the biggest collective Water Distribution System 
(WDS) in Poland, which is located in the southern- 
west of Poland in Silesian region. Analyzed WDS 
is composed of four local Water Treatment Plants 
(WTP A, B, C and D) with a total average daily 
production of 72 577 m°, and four storage tanks 
(E, F, G and H) with the total capacity of 155 
200 m° (Figure 2). The area under consideration 
is additionally fed by a pumping station (located 
outside the research area), which supplies water in 
the amount of 60 000 m° per day. The average daily 
water consumption of this area is 102 000 m?. The 
study area is characterized by high altitude vari- 
ability from 240 m to 364 m above sea level. The 
central point of the subsystem is the storage tank 
E (345 m above sea level), which are supplied from 
two directions (WTP A and Pump Station I) and 
delivers water to the largest number of custom- 
ers in north area. Tanks G (315 m above sea level) 
and H (364 m above sea level) collect water and if 
necessary provide a water supply. WTP C and B 


# Water Treatment Plant 
WH Tonks 
@ Mine shafts 


Figure 2. Scheme of the water supply network with 
marked mine shafts. 
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Figure 3. Material structure of water pipe network. 
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Figure 4. Age structure of water pipe network. 


supply the smallest part of north-west area, while 
WTP D works occasionally in a case of higher 
water demands. 

Analyzed subsystem of WDS is a widespread 
system covering about 100 km? of the area with 
a total water pipe length of 256 km. Water pipe 
network is made mainly of steel, as well as poly- 
ethylene (PE) and ductile iron (Figure 3) with pipe 
diameters from 55 mm to 1600 mm. The oldest 
pipes that build this subsystem come from 1929 
(Steel) and the latest from 2016 (PE) (Figure 4). 

The analyzed network is located in the area of 
intensive exploitation mining areas where there are 
mining damages and shocks. The reason for this is 
the collapse of the terrain and cracks in the urban 
infrastructure. Figure 2 shows the location of exist- 
ing and inactive mine shafts in the considered area. 


3 RESEARCH METHODOLOGY 


The hydraulic model of the subsystem was used for 
the analysis. The network topology was exported 
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from the GIS database to the WaterGEMS soft- 
ware. The model was calibrated to average values 
of water demand from 2016 (102 000 m°). While 
the validation was carried out for the operation 
conditions during three days (17—19.10.2016). 
For this model we obtained a high correlation of 
computed values and observed values, for pres- 
sure 99.4% and for flow 99.0%. Failure data form 
mentioned area has been entered into the software 
from the 5-year period (2012-2016). Table 2 shows 
the total number of pipe cracks for a given year. 
Based on length, number of breaks and break 
history (duration of break history) WaterGEMS 
calculate predict rate of failure. The results that are 
calculated by the Pipe Break Analysis include: 


— failure rate Àp (breaks/yr/km)- number of breaks 
for individual pipe, according to the pattern: 


Ays A break/yr/km (1) 


where N = number of breaks; L = length of pipe 
(kilometer); t = period of analysis (year). 


— pipe group failure rate A, (breaks/yr/km) — 
number of breaks for a given pipe group, accord- 
ing to the pattern: 


fey = break/yr/km (2) 


where N = number of breaks for the group; 
L = total length of pipe in the group (kilometer); 
t = period of analysis (year). 


Based on the total length of the network, the 
number of failure and the length of the failure 
time, one group break rate was created for calcula- 
tions for all pipes. 

— projected failure rate App the product of the 
scaled break rate, the projection period and the 
length of pipe. Estimate of the number of breaks 
over the projection period assuming that past 
break rates persist, according to the pattern: 


Aps = a: Arg + (1—@)- Acs (3) 


where a = index of the share of damage intensity. 


Due to the creation of only one pipe break 
group, the index value a was chosen at 0.8. 

The studied network is a main network sup- 
plying water to cities and industry, so every water 


Table 2. Number of pipe failure for a given year. 


Year 2012 2013 2014 2015 2016 


Number of 313 226 232 303 210 
pipe failure 


customer will belong to a critical group (hospitals, 
food production, etc.). Any interruption in water 
supply can have a major impact on the function- 
ing of customer group. The task of this work is to 
determine the impact of pipe breakage on the effi- 
ciency of water supply. For this purpose, the water 
flows calculated in the hydraulic model and the 
predicted number of pipe failure were used. The 
results of the analyzes were presented in the form 
of a risk matrix. 


4 RESULT AND DISCUSSION 


Figure 5 shows the average water flow in the water 
supply system, the lowest flow values (below 
50 m*/h) are marked in dark blue color, while the 
highest values by red color (above 600 m*/h). 

The number of failure for individual pipe was 
determined, zero values were obtained for 2533 
sections (sections without a failure during the 
considered period), the minimum value of failure 
rate was 0.098 (break/yr/km) and maximum 452. 
Failure rate for pipe group was obtain for all sec- 
tion in analyzed network and was 1.06 (break/yr/ 
km). Based on these rates, projected failure was 
calculated. The result of simulation a base map 
representation of the failure risk of pipes in the 
network. Figure 6 is a map of these pipes and their 
corresponding risk levels. The largest number of 
expected pipes failures was achieved for the sec- 
tion located east of the Tanks E and the section 
located north of the Tanks G. They constitute 
2.5% of the analyzed network. The biggest part 
(90%) is the probability of occurrence of up to 
4 failures per year, while the smallest part is the 
probability of the number of failures between 12 
and 16 (1%). 

The values of the predicted pipe fractures were 
calculated for each segment. The smallest value of 
failure rate is 0.001, while the highest is 20.0. Based 
on the obtained simulation results, a risk matrix 


wree 


Tanks F 


Figure 5. 


Hydraulic simulation result. 


1514 


Figure 6. Network map with marked ranges of pro- 
jected breaks. 


Weight of consequences of flow rate Wi 
1 2 3 4 5 


2 
2 


Weight of projected 
failure rate 


Figure 7. Risk matrix of pipe failure. 


Table 3. Flow rate division into weight classes. 


Categorization due to the flow rate 


Water flow 0-50 50-200 200-400 400-600 >600 
[m/h] 

Weight 1 2 3 4 5 
scale 


Table 4. Projected breaks division into weight classes. 


Categorization due to the inten- 
sity of failure rate 


Projected failure rate 0—4 4-8 8-12 12-16 >16 
Weight scale 1 2 3 4 5 


was created (Figure 7). Based on the analysis, a 
five-stage categorization of the effects of damage 
based on the volume of water supplies was made. 
The ranges of water flow values and failure rate 
were assigned to individual weight classes from 1 
to 5 (Tables 3 and 4). 

For risk analysis, a risk matrix was developed, 
which takes into account the consequences of fail- 
ure occurrence expressed by the flow rate and the 
intensity of failures. Defined risk is expressed by 
the formula (4): 


Figure 8. The distribution of risk on the network map. 


R=W,-W, (4) 


where W, = weight of flow rate; W, = weight of 
projected failure rate. 

Risk value varies from 1 to 25. For this risk, a 
three-class classification was made: acceptable 
risk, control and unacceptable risk. 

Three risk classes have been classified: Accept- 
able risk — weight 1—4 (green color), Controlled risk 
— weight 6-10 (yellow color) and Unacceptable risk 
— weight 12-25 (red color). Green color indicates a 
low impact of pipe failure on network operation, 
yellow — medium, and red — high. Figure 8 shows 
the distribution of risk classes in the network. The 
result of the risk assessment of damage to the pipe 
section allows for easy identification and classifica- 
tion of pipes, which need to be modernized. Based 
on the impact levels, areas with the highest risk of 
lack of water supply/reduction of water supply to 
the largest number of customers were designated. In 
areas located in cells 4B, 5B, SC and 6C there is the 
highest impact of pipe failure on water supply, so 
the pipes in these areas must have the highest prior- 
ity for modernization. Pipes in the mentioned cells 
distribute water to the largest number of custom- 
ers and the failures of these pipes will cause water 
shortages in the largest area. Pipes located in cells 
4B, 5B, 6B, 6C, 7C, 4D, 3E and 4E belong to the 
medium impact, therefore rehabilitation should be 
considered later. The remaining pipelines are the 
least affected and do not pose a threat. 


5 CONCLUSION 


The water distribution system is an extremely impor- 
tant infrastructure enabling the functioning of 
various social zones. Lack of water supply, reduced 
amount of water supply or contamination of water 
can lead to the risk of human life. An important ele- 
ment of network management is the determination 
and assessment of risk. These analyzes may concern 
various aspects related to the operation of the water 
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supply network, among others chlorine disappear- 
ance, secondary water contamination or pipe fail- 
ures. Data sets and decision support systems are 
important for risk management. These systems cer- 
tainly include the GIS and SCADA databases that 
acquire and provide data, but also a useful tool are 
hydraulic models that allow to perform various sim- 
ulations. Based on the conducted simulations, so- 
called “sensitive” areas can be identified, for which 
different types of risks can then be determined. 
Knowing the risks, waterworks companies can save 
money and time to plan future investments. 
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ABSTRACT: Offshore wind structures are subject to the combined action of wind and wave loads. 
A change of these loads may significantly affect the integrity of the structural elements. Increased insta- 
bilities in the Earth’s climate system could increase the frequency of extreme events (e.g. rogue waves) 
well beyond the frequency values currently recommended within structural design standards. Inherent to 
extreme event modelling is the need to use expert (subjective) judgement and sparse data sets. In this con- 
text, a Bayesian Belief Network (BBN) can be applied to describe the effect of these changes on the fre- 
quency of rogue waves within wind farms located in shallow water depths of 20-60 metres. This graphical 
modelling approach provides the structure to effectively communicate, among others, parameter uncer- 
tainty, causality across multiple risk factors, quantitative definition of assessment subjectivity or potential 
impact of a change in rogue wave frequency relative to that described in current design standards. 


1 INTRODUCTION 


The term “rogue”, “freak”, “abnormal” or “giant” 
wave commonly refers to waves that are very 
steep and large in absolute measures and, at the 
same time, significantly larger than the surround- 
ing waves in the sea state, and are thus unexpected 
(Bitner-Gregersen, 2017). They are statistically 
unlikely to occur in a given sea state (either low, inter- 
mediate or high), based on averaged properties of 
that sea state (Bitner-Gregersen & Gramstad, 2015). 

This physical phenomenon is not fully under- 
stood, but increasing reliable measurements and 
records, as well as the significant increase in com- 
putational power and numerical modelling capac- 
ity, allow to explore these extreme events with 
greater accuracy. 

There are several motivations to reduce the 
risk of wave-related incidents. First, because they 
clearly represent a current threat to marine instal- 
lations. Second, because more severe sea state 
conditions may be expected in some ocean regions 
associated with climate change and global warming 
(IPCC Panel, 2014). Third, because understanding 
and forecasting waves under various conditions is 
essential with respect to design and operation of 
offshore structures. 

Based on these initial premises, addressing these 
extreme events as potential risk and including 


them in the customary Risk Assessment process 
of a company that operates physical assets in an 
offshore environment is entirely justified, despite 
its complexity and the high number of uncertain- 
ties involved. 

In the present work a causality-based proba- 
bilistic graphical modelling methodology is pro- 
posed to assess the risk associated with rogue 
waves in offshore wind farm projects at the final 
design stage. The methodology includes the 
impact of future climate change and provides the 
structure in which to effectively communicate: 
a) parameter uncertainty; b) correlation across 
multiple risk factors (i.e. “Systems of Systems” 
(SoS) complexity mapping/analyses); c) defini- 
tion of assessment subjectivity; d) and potential 
impacts of low probability catastrophic events 
(i.e. extreme events). The methodology provides 
a holistic framework that can be integrated into 
existing decision-making processes currently 
defined within a large capital project execution 
process. 

In brief, the method studies the probability 
of a rogue wave impacting an offshore structure 
situated in a predefined location of the North- 
ern North Sea, between 20 and 60 m depth, and 
includes 3 main stages: risk understanding, quali- 
tative bow-tie creation; and transformation to a 
Belief Bayesian Network. 
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2 RISK UNDERSTANDING 


Risk assessment is to a large extent about gaining 
‘risk understanding’ in the sense of knowledge— 
justified beliefs, by producing a risk description 
(Cœ Q.K), where C, are the specified consequences 
of the activity studied, Q a measure of uncer- 
tainty, and K is the background knowledge on 
which C, and Q are based (Amundrud & Aven, 
2015). According to these authors, these justified 
beliefs are based on data, information (relevant 
processed data) and models. The uncertainty judg- 
ments about C, using Q can also be seen as justi- 
fied beliefs. 

K is a limiting aspect in the proposed meth- 
odology, due to the lack of understanding of the 
physical process of creation of rogue waves. For 
example, describing the wave phenomenon is the 
result of a set of uncertainties. The random model 
for ocean waves is constructed by representing the 
sea surface as a sum of elementary waves with 
different wavelengths, frequencies, and directions 
of propagation (Bitner-Gregersen & Gramstad, 
2015). However, in reality ocean waves are not 
described exactly by a linear formulation or sec- 
ond-order theories, and therefore require a set 
of increasingly accurate formulations. The more 
accurate, the more mathematically complex and 
more difficult the model will be. As a result, the 
logical functions and equations included in the 
proposed graphical model are based on the linear 
theory, the most tractable approach for the graphi- 
cal model under design. 

Uncertainty related to environmental phenom- 
ena may be divided in aleatory uncertainty (natural 
randomness) and epistemic (knowledge) uncer- 
tainty; and the latest in: data uncertainty, statistical 
uncertainty, model uncertainty and climatic uncer- 
tainty (Bitner-Gregersen et al. 2013). 

Assessing data uncertainty is out of the scope 
of this study, so available data are assumed to 
be appropriate. To minimize the statistical and 
climatic uncertainty, a long-term data source 
was selected. The European Centre for Medium- 
Range Weather Forecasts’ ERA-Interim is a global 
atmospheric reanalysis from 1979, publicly acces- 
sible and continuously updated in real time (Euro- 
pean Centre for Medium-Range Weather Forecast, 
2017). After establishing a geographical location in 
the Northern Sea, 4 measurements per day were 
obtained between 1979 and 2017 (about 55.000 
values per variable) for 30 different variables. Only 
9 of them were considered relevant for the project: 
model depth (d), zero-crossing mean period (Tz), 
wave spectral directional width (0,), significant 
wave height (Hs), mean wave direction (6), mean 
direction of wind waves (0,), mean direction of 
swell (8,), and Benjamin-Feir index (BFI). 


There are other relevant variables, such as the 
wave length (A), that are not independent. In these 
cases, formulae given by the Recommended Prac- 
tice DNVGL-RP-C205 have been used (DNV GL, 
2017). 

Finally, defining and managing the model is the 
core part of this work and a main responsibility 
of the risk analyst (designing, building, assigning 
probability, running simulations, reporting and 
maintaining). It reflects the limitations of the pre- 
vious factors and adds new uncertainties, due to 
failed assumptions in physical process formulations, 
or choices of probability distribution types for 
representation of uncertainties. In this regard, the 
method tries to register and track all the detected 
uncertainties. To limit this effect, all the variables 
were fitted to a probability distribution using the 
software tool, ModelRisk (Vose Software, 2018) 
only when the best fit was not supported by the 
Bayesian Network software (OpenBUGS). 


3 BOW-TIE CREATION 


The bow-tie is a graphical approach frequently 
used to represent a Risk Event, its Causes (Driv- 
ers), Prevention Barriers (Controls), Mitigation 
Barriers, and its Consequences (Impacts) in a vis- 
ual and logical manner. Centered on a critical (risk) 
event, it is composed of a simplified fault tree on 
the left-hand side and an equally simplified event 
tree on the right-hand side showing the possible 
consequences of the critical event based on the fail- 
ure or success of safety functions (Khakzad et al. 
2013). To understand the relations and depend- 
ences among factors involved in the creation and 
impacts of rogue waves and climate change on 
offshore wind structures, a qualitative bow-tie is 
proposed. The first step consists of formulating 
the critical event: impact of a breaking rogue wave 
on an offshore wind structure (named “IMPACT” 
in the graphical model). This step seems to be obvi- 
ous, but in complex or emerging risks it is essential 
to organize and plan the following phases of the 
method. 

In this case, due to the complexity of the ana- 
lyzed physical phenomena, the bow-tie focuses on 
the left-side, or analysis of causes (drivers) and 
barriers (controls). The event tree of consequences 
is reduced to one: the failure of the structure (F). 

After a deep review of the state-of-the-art 
related to rogue waves and climate change impacts 
on the study area, as well as the available data, the 
drivers and controls are analyzed individually and 
placed in the bow-tie, establishing the appropriate 
connections and causal relations. The graphic is 
continuously updated until it gets its final shape, 
shown in Figure 1. 
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Figure 1. Final bow-tie. 


Several different mechanisms may be responsi- 
ble for generating rogue waves such as linear focus- 
ing of energy (spatial and dispersive: K,, K,and T), 
wave-current interactions (CI), crossing seas (wind 
sea and swell or two swell systems, CS), quasi- 
resonant nonlinear interactions (modulational 
instability, BFI), shallow water effects (SWC), 
solitons interactions (SO), directional spreading 
(DS), and wind forcing (W). 

Atmospheric forcing has not been considered 
in the bow-tie as a cause of waves to simplify the 
visual understanding of the process. The relevant 
variables obtained from the dataset are included 
in the bow-tie as primary events. Other relevant 
variables, as slope (SL), angle between the wave 
crest and depth contours (a), angle between tar- 
get and protection structure (B), Ursell number 
(U,) or maximum height (H,,) are added as pri- 
mary events, when statistical data are not available 
but are required for a consistent explanation of 
an intermediate event. Some of them are calcu- 
lated in future steps or treated as assumptions. 
The bow-tie shows two main controls: protective 
structure (O), as a physical barrier to avoid the 
impact of a breaking rogue wave against the off- 
shore wind structure (OWS); and climate change 
(C). C is placed as control, assuming its barrier 
effect is focused on limiting or preventing the 
CO, emissions caused by humans, where the key 
assumption is that the accumulation of CO, and 
other greenhouse gases are the primary drivers for 
climate change and that the human population is 
largely the driver for the significant increase in the 
atmospheric concentrations of those gases in the 
past 200 years. Other controls are related to shal- 
low water restrictions or used for reversing inter- 
actions of separated subsystems (current, ship 
traffic, etc.) over the wave fields or between driv- 
ers of different nature, when needed. The other 
three are natural controls: shallow water condi- 
tions and rogue wave conditions. 


4 BAYESIAN NETWORK 


The bow-tie graphical model is used in this method 
as a primary tool to understand the risk and locate 
the critical event in its cause-effect framework. 
However, it presents a static picture of the prob- 
lem. Besides, no causal relation can be established 
between primary events or other events of differ- 
ent branches of the fault or event tree. These prob- 
lems are solved with its transformation to a Belief 
Bayesian Network (BBN). 

A BBN is an explicit description of the direct 
dependencies between a set of variables, in the 
form of a directed graph and a set of nodes linked 
to a probability. This structure offers the following 
benefits (Fenton & Neil, 2013): modelling causal 
factors explicitly, reasoning from effect to cause 
and vice versa; updating the probability distribu- 
tions for every unknown variable whenever an 
observation is entered into any node; reducing the 
burden of parameter acquisition; overturning pre- 
vious beliefs in the light of new evidence (explain- 
ing away); making predictions with incomplete 
data; combining diverse types of evidence includ- 
ing both subjective beliefs and objective data; and 
arriving at decisions based on visible, auditable 
reasoning. 

The conversion of a bow-tie into a BBN is sum- 
marized in Figures 2 and 3. 

The BBN includes different interacting systems 
besides the waves system, and includes the current, 
seabed, wind, climate, ship traffic and artificial 
structure. Figure 4 shows this graphical model. 

The fitted distributions are included as par- 
ent nodes, because are the basic parameters of 
the model. There are 13 “parent distributions”, 
whereas only four of them are not obtained from 
available data. In these cases, a uniform distribu- 
tion is assumed. One variable relies on the seabed 
conditions and would be subject to a better charac- 
terization with the consideration of a bathymetry 


BAYESIAN NETWORK 


Figure 2. Mapping algorithm from bow-tie to Bayesian 
Network (Khakzad et al., 2013). 


1519 


IMPACT PHYSICAL 
‘ BARRIER 


Figure 3. to bow-tie 


elements. 


Bayesian Network related 


BOW-TIE TO BAYESIAN NETWORK: 
P(U) = P(E,C,D,M,Q) = P(C) P(DY P(M)'P(Q/E,M}'P(E/C, D) 


c ORIVER ~N -— CONTROL ~ 
Sas (bp oo A’ fey a 
PUJ P(C) 


ae RISK EVENT 

ge |) | 

-——MITIGANT ~, 

P(E/D,C) CL a 
P(M) 


CONSEQUENCE —, 
a ( a ) —— 


Figure 4. 
systems. 


Overall Bayesian Network with interacting 


model: the angle between wave crests and depth 
contours (%). Another one (Froude Number Fd) 
depends on the ship traffic around the offshore 
wind turbine structure (OWS), but its assessment 
is out of the scope of this work. 

Following the conclusions of the bow-tie analy- 
sis, the failure of the OWS occurs when a rogue 
wave impacts on it. The probability of this impact 
is “the probability of a rogue wave breaking in front 
of the OWS within the plunging range without a pro- 
tective structure in between”. When the wave breaks 
just at the location or behind, the plunge distance 
is not relevant for the targeted OWS. By contrast, 
when the wave breaks in front of the structure, 
this distance is relevant, because it defines the area 
where the wave is dangerous. However, given that 
the available data are restricted to the selected loca- 
tion, further spatial considerations (i.e. defining a 
breaking point or a plunge distance in front of the 
OWS) are out of the scope of this study. 

Therefore, for this event to happen or not, 
it is necessary the presence of a rogue wave that 
breaks without an opposing protective structure in 
between. 

In the graphical model, an extreme wave is con- 
sidered a rogue wave R when the height doubles the 
significant height H, (R >2H,) (Bitner-Gregersen & 
Gramstad, 2015). The wave height is limited by 
breaking. The maximum wave height H, condition 
is based on the Recommended Practice DNVGL- 
RP-C205 (DNV GL, 2017): 


H,= 20.142tanh ZE (1) 


where A is the wave length corresponding to water 
depth d. 

The accompanying structure may be natural 
or artificial. If the structure is artificial, it can 
be either of floating type with a mooring to the 
seafloor or a solid anchored structure that is sub- 
merged or slightly above the surface. In an offshore 
wind farm, another OWS may protect the selected 
structure from the impact of a rogue wave. The 
condition to be protective is being total or partially 
aligned with the OWS in the mean wave direction. 
This condition happens, as shown in Figure 5, 
when the angle between the wave and the segment 
that links both structures (B) is between 6+90° and 
6+270° (no other physical phenomena, i.e. refrac- 
tion or diffraction, are included). 

The critical assumption of the model is that 
the extreme wave heights (H,,) calculated from the 
available data are generated exclusively by the wave 
focusing under the action of wind. The final wave 
height (W) is then the result of an increase over the 
value of H,, due to the causes explained through 
the bow-tie, as expressed in Eq. (2): 


W =H,,: (1+C,,,- SWC,- M + Ca C4step 
(0.6-U,) - C,- TF + Cy: SWG,- step (K,-1) - 
(K,-1) +C, step (CI) K) (2) 


where 

H,, = extreme value of height; 

K, = height increase proportion due to current 
refraction; 

K, = height increase proportion due to floor 
refraction; 

M = height 
modularity; 

C = height increase proportion due to the cli- 
mate change; and 

TF = height increase proportion due to tempo- 
ral focusing. 

It may be argued that the measured heights are 
already the result of these causes or, at least, the 


increase proportion due to 


0 +90° 


Wave front 


ð #270° 


Figure 5. Alignment of the protective structure. 
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linear causes, i.e. spatial and temporal focusing. To 
deal with such complications, each driver has one 
control node (constant), so that the unexpected 
cause or interacting system can be eliminated from 
the model: C,, for the current; C, , for the ship traf- 
fic; C,, for the seabed refraction; C,, for the protec- 
tive structure; C,,, for the modularity instability; 
and C,, for the temporal focusing. 

There are also natural controls (step(0.6-U,), 
step(CI), SWC1, SWC2) that cancel the drivers due 
to natural conditions. These natural conditions can 
be modelled. 

Only height increases are considered, so the 
condition to take refraction into account is K, >1: 
step (K; — 1). 

H,, is calculated following the extreme value 
theory and fitting the results to a Gumbel 
distribution. 

K, is calculated based on the Recommended 
Practice DNVGL-RP-C205 (DNV GL, 2017): 


K, =K,-K, (3) 
1-si? tanh? (kd) |" 
| sin otar | (4) 
COS O 
K, = | (5) 
Ce 
where 


K, = shoaling coefficient; 

K, = refraction coefficient; 

a, = the angle between the wave crest and the 
depth contours at the location; 

k = wave number; 

d = depth; and 

C, = group velocity. 

K, is a good example of the difficulties found 
to model some of the drivers involved in the proc- 
ess. The first approach to define the variable K, 
was based on the analysis of this phenomenon 
presented by Sorensen (Sorensen, 2006). Figure 6 


Figure 6. Definition sketch for wave refraction by a 
current (Sorensen, 2006). 


shows how a wave propagating with speed C from 
still water to water having a current velocity U, 
changes its direction. 

In mathematical terms, these equations are 
obtained: 


U, Ñ 
l1- — sing 
A, zL; y COS C 


K= = (6) 
H L SORES) j4 Y nig 
C 
1,= 1s o 
Sina 
Sina, = ss (8) 
l= Cine) 
where 


H, = Height after refraction; 

H = Height before refraction; 

L, = Wave length after refraction; 

L = wave length before refraction; 

U = current velocity; 

C = wave velocity; 

&œ=angle between the current and the crest 

front; 

&.=angle between the current and the crest 

front after refraction. 

Considering K, = H/H, an expression of K, as 
function of œ and a, can be obtained, but introducing 
it in the model was impossible and always led to a sys- 
tematic software error. Other equations were checked, 
such as those presented by Iwagaki et al. (1977). A 
different approach was finally selected based on the 
work by Mathiesen (1987), which is derived from the 
computer model to measure the refraction of ocean 
directional wave spectra and applied it to a circular 
current whirl typical in the Norwegian coastal cur- 
rent. This model found that the relative changes in 
wave heights were within +20% as compared with the 
wave height of the incoming waves. 

M is calculated as the average probability of the 
nonlinear modularity drivers, which are: solitons 
interactions (SO), variable bathymetry (SL), cross- 
ing seas (CS), Benjamin-Feir interaction (BFI), 
directional spreading (DS) and wave-current inter- 
action (CI). M is limited to a maximum value (M nax) 
of 0.20. This value is defined considering several sys- 
tematic studies which shows that effects of modu- 
lational instability can enhance the crest height for 
long-crested waves by up to 20%, at lower probability 
levels, while the troughs become about 20% deeper 
than second-order troughs (Kharif et al. 2009). 

C is calculated based on the CO, emissions origi- 
nating from the socio-economic scenarios (A1B, 
A2, B1 and B2) proposed by The Intergovernmental 
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Panel on Climate Change IPCC and the values of 
emissions currently estimated for the North Sea. 

T, is calculated as a function of the Ursell 
number U,, with a maximum value to be estab- 
lished at the moment: 


AL 
U= T 


(9) 


Kharif et al. (2009) stated that this number 
characterizes the ratio of nonlinearity to disper- 
sion. When the Ursell parameter is small, the non- 
linearity can be neglected, and the wave is a linear 
dispersive wave. In real situations of wind waves, 
the values of U, parameters are not too large, and 
the dispersive trains contribute significantly to the 
statistical wave characteristics. Based on these 
authors, a value of U, < 0.6 is selected to consider 
the impact of the temporal focusing as relevant. 

There are two restrictions related to the shallow 
waters which must be considered, and are given the 
variable names, SWC, and SWC,. Water is consid- 
ered shallow when the surface waves are noticeably 
affected by bottom topography (Bitner-Gregersen & 
Gramstad, 2015). This condition occurs when the 
depth, d, becomes less than half the wavelength, i. 

Modulational instability becomes weaker with 
decreasing depth and it is suspected to play a less 
important rolein shallow water (Bitner-Gregersen & 
Gramstad, 2015). Benney & Roskes (1969) esti- 
mated that modulational instability disappears 
when 2rd/l < 1.363 for unidirectional waves. Under 
this threshold, the model cancels the driver M. This 
is the restriction with the variable name, SWC, 

Similarly, the seabed related refraction (K,) is 
canceled when the shallow water condition is not 
accomplished (restriction SWC,). 


4.1 Modulational instability drivers 


Seven drivers are involved in the creation of non- 
linear instability. Their inclusion, conditions and 
limits are discussed in the following sections. 


4.1.1 Solitons interaction (SO) 

Solitons interaction has been suggested as a source 
of nonlinearity in shallow water (Kharif et al., 
2009). Peterson et al. (2003) linked this mecha- 
nism to relatively shallow coastal areas with high 
ship traffic density, particularly high-speed ships 
when they sail with critical or supercritical speeds. 
These speed levels rely on a value of the Froude 
number, Fa which is the ratio of the ship speed and 
the maximum phase speed of gravity waves, equal 
or higher than 1. Therefore, the model constrains 
the impact of this driver to this threshold. It is out 
of scope of this study to analyze the traffic in the 


vicinity of the location, so a uniform distribution 
has been used for the variable F}. 


4.1.2 Variable bathymetry (SL) 

Recent works have shown that the probability of 
rogue waves may increase on the shallow side of 
an underwater slope. Sergeeva et al. (2011) linked 
the probability of rogue waves to the wave steep- 
ness, which is characterized in terms of the Ursell 
parameter. Both variables increase when the depth 
decreases (water shallowing), and the wave state 
deviates from the Gaussian. Based on previous 
research, the condition for nonlinearity due to the 
interactions with a variable bottom has been fixed 
when U} > 0.6 (Kharif et al., 2009). 


4.1.3 Crossing seas (CS) 

When two wave systems (wind sea and swell or 
two swell systems) are separated in direction or 
frequency and cross, the modularity increases 
depending on the angle between them. Both wave 
trains are assumed to be narrow banded and 
weakly nonlinear (Kharif et al., 2009). 

Onorato et al. (2010) suggested that an increased 
probability of rogue waves was associated with 
angles between 40° and 60°. This is the condition 
used in the graphical model. ERA INTERIM data- 
base offers separated information about the mean 
wind waves directions (6,) and mean swell direc- 
tion (6,), so the possibility of crossing wind seas is 
not considered. 


4.1.4  Benjamin-Feir interaction ( BFI) 

A key parameter controlling the importance of the 
nonlinear wave-wave interactions is the Benjamin- 
Feir Index (BFI) which is the ratio of the wave steep- 
ness to the spectral bandwidth (Kharif et al., 2009). 


e2 


6 


BFI= (10) 


where: 

€ = wave steepness; and 

ô, = spectral directional width. 

Instability condition is given by Eq. (11), and 
is used as a condition in the graphical model 
(Bitner-Gregersen, 2017): 


V2BFI >1 (11) 


4.1.5 Directional spreading (DS) 

Onorato et al. (2002) showed that the probability of 
occurrence of rogue waves depends not only on BFI, 
but also on the directional spreading of the waves. 
Waseda et al. (2011) found evidence that occurrence 
of rogue waves was associated with sea states with 
directional spreading of less than about 30°, sug- 
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gesting that sea states with increased occurrence of 
rogue waves may occur in realistic ocean conditions. 
This has been the condition used in the model. 


4.1.6 Nonlinear wave-Current Interaction (CI) 
There are theoretical, experimental, and numeri- 
cal evidences to support that in some situations 
the combined effect of wave nonlinearity and cur- 
rents can lead to an increase in rogue wave occur- 
rence (Nakicenovic et al., 2000). Janssen & Herbers 
(2009) first discovered that initially stable narrow 
banded wave fields could become unstable when 
the nonlinearity was increased due to linear focus- 
ing. Toffoli et al. (2015) experimentally showed that 
realistic random waves propagating in opposing 
currents could destabilize, with a resulting increase 
in the occurrence of rogue waves, even for waves 
with directional spread that normally obey near- 
Gaussian properties. The probability of a current 
opposing to a wave field depends then on the angle 
between wave and current. The opposing condi- 
tion is addressed by the model as the probability of 
the mean current direction (Oc) between the values 
of 6 + 90° and 6 + 270°, with a maximum when 8c 
is equal to 6 + 180°, as shown in the Figure 7. 


4.2 Addressing climate change 


For the estimation of the climate change impact on 
the frequency of occurrence of a breaking rogue 
wave within the location proposed for an offshore 
wind farm, several assumptions have been made. 
It is accepted that there is a stochastic dependence 
between levels of CO, in the atmosphere and the 
ocean wave climate. On the other hand, only CO, is 
considered as a factor of climate change, although 
it is just one of the components of the greenhouse 
gas group (GHG). 

The projections of future climate change scenar- 
ios are based on the four marker scenarios (A1B, 
A2, Bl and B2) proposed by The Intergovern- 
mental Panel on Climate Change IPCC, over the 
twenty-first century (Quante & Colijn, 2016). Each 
emission scenario reflects different assumptions on 


Wave-current interaction. 


Figure 7. 


future socioeconomic development. Scenario A2 is 
the worst, followed by Al, B2 and B1. 

Regarding the study area, Grabemann et al (2015) 
analyzed a set of ten wave climate to estimate the 
possible impact of anthropogenic climate change on 
mean and extreme wave conditions in the North Sea. 
The projections were based on different IPCC emis- 
sion scenarios, included different global and regional 
models starting from different initial conditions. 

They found a solid pattern for the increase in 
median and severe significant wave height in the 
eastern North Sea (parts of the southeastern 
North Sea and large parts of the Dutch, German, 
and Danish coasts up to the Skagerrak) towards 
the end of the twenty-first century, while a decreas- 
ing trend in the western North Sea was detected. 
However, the magnitude of this increase was much 
more uncertain and oscillates between about —10 
and 15% relative to the reference Hs. These num- 
bers are consistent with other relevant studies in 
the area, which establish the increase between 
6-8%, or up to 10% (Kharif et al., 2009). 

Therefore, in the model the increase on the wave 
height has been defined as: 


10 
C= oÈ, X55 


where c, = maximum emission factor in decimal 
fraction; x, = reduction factor for the emission sce- 
nario i during the decade j in decimal fraction; and 
s; = emission scenario. 

Based on the abovementioned data, a value of 
0.10 has been assigned to c,. It corresponds to the 
value for the worst scenario (A2). The values of s; 
have been calculated based on the projections of 
the IPCC simulated with model AIM in the OCDE 
region, as stated in Table 1. 


(12) 


Table 1. Emission Reduction factors based on IPCC 
scenarios (x;,). 

Scenario s, 
Decade A2 Al B2 B1 
1990 1 1 1 1 
2000 1 1 1 1 
2010 1 0.97 0.93 0.89 
2020 1 0.91 0.86 0.81 
2030 1 0.83 0.78 0.71 
2040 1 0.77 0.73 0.64 
2050 1 0.72 0.68 0.58 
2060 1 0.64 0.58 0.48 
2070 1 0.57 0.49 0.4 
2080 1 0.49 0.41 0.31 
2090 1 0.4 0.33 0.23 
2100 1 0.32 0.26 0.16 
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Figure 8. 


Final version of the Bayesian Network. 


The final structure of the proposed Bayesian 
Network is presented in Figure 8. 


5 DISCUSSION AND CONCLUSIONS 


In the current study, a method for assessing a com- 
plex risk associated to physical phenomena not 
fully understood has been presented. The high 
level of complexity results in a high number of 
uncertainties which necessarily must be faced by 
the risk analyst. The focus of this study has been 
on understanding the physics behind the rogue 
wave phenomenon and determining the conditions 
under which such waves can be expected to occur 
more frequently when considering specific tempo- 
ral and spatial ranges. Without the right outcome 
from this stage, the aim at creating a Bayesian Net- 
work would have been impossible. The role of the 
analyst in a decision-making process is to create 
a model as efficient as possible. This requirement 
includes its running speed, computational calcu- 
lation and memory requirements, maintenance 
effort, file size, the least amount of assumptions, 
and finally, the ability to communicate the risk and 
the utility for the decision makers. 

Due to the complexity of the risk analyzed, the 
number of assumptions in the model is remarkable, 
but a considerable effort has been made to manage 
those assumptions via: tracking for awareness and 


future improvement; and defining the model with 
multiple options for isolating and simulating only 
a partial number of individual drivers. 

Currently, the BNN is being tested under differ- 
ent scenarios and limitations. Further conclusions 
will arise with the coming analysis of the results. In 
the worst scenario, the method will serve as a learn- 
ing tool to understand the risk and its consequences 
in a deeper way. It will also be used to perform sensi- 
tivity analysis of the different drivers involved in the 
critical event. The optimal implementation would 
be reached when the model is used as a part of the 
strategic decision-making process. However, several 
limitations have been already detected. The output 
interface of the BBN software (OpenBUGS) com- 
plicates the presentation of results. There are other 
products in the market that seem to be more pre- 
pared for sharing results with the management in 
a visual way. On the other hand, spatial considera- 
tions cannot be addressed by the graphical model, 
i.e., the analysis of the plunging distance and the 
location of a potential breaking point of the wave 
in front of the structure. 
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ABSTRACT: Predicting the occurrence of failures in power grids through specific outage risk predic- 
tors is a primary concern for utilities nowadays. Wooden poles represent core items to focus on in this 
process. Millions of them are used worldwide and they are all subject to the risk of crack formation. 
Analyzing the evolution of pole cracks is particularly relevant in reliability analyses of power grids for 
two main reasons. First: the cracks might highlight previously unconsidered or changing factors, such as 
unusual local weather conditions (e.g. overload of ice and/or wind). Second: as cracks provide an access 
for external threats (e.g. humidity, fungi, insects) to potentially non-treated internal parts of the poles, 
they might in turn accelerate the occurrence of further failures. Evaluating the role of crack formation 
is thus essential for estimating the risk of outages in power grids. As climatic variations are known to 
be among the most influencing factors in the initiation and propagation of cracks in wooden poles, we 
address this topic by suggesting a method combining open-access weather-data sources with information 
provided by new technologies, such as drones. We first highlight the influence of climatic factors on the 
reliability of wooden poles by reviewing studies describing the physical properties of wood. We then focus 
our research on a Norwegian case study and show how we can combine up to 60 years of meteorological 
information with the information provided by 17,352 geo-localized aerial pictures of cracked and non- 
cracked wooden utility poles. We finally discuss the way an indicator constructed on this combination can 
be used to predict the formation of cracks and optimize the allocation of decision-maker resources for 
inspection procedures. 


1 INTRODUCTION over, anticipating unwanted events directly ena- 


bles power utilities to significantly reduce losses 


The modernization of the society has led to a glo- 
bal increase of power consumption over the last 
50 years (Refsnæs, Rolfseng, Solvang, & Heggset, 
2006; Shiu & Lam, 2004; Yoo & Kwak, 2010). As 
numerous businesses, public infrastructures and 
private households rely on the provision of power 
for their daily tasks, there is a need for companies 
in charge of the power supply to maximize their 
capacity and reliability in delivering power. 
Predicting outage risks and avoiding downtime 
is crucial to ensure customer satisfaction. More- 


and costs. Finally, it also enables them to optimize 
resource allocations for the inspection of their infra- 
structures after natural disasters (e.g. storms, flood- 
ing) or during scheduled maintenance procedures. 
Ensuring this quality of service requires utili- 
ties to use reliable components, from the power 
source, through the transmission lines and to the 
consumption nodes. Wooden poles are widely used 
for the distribution part of the power grid (from 
regional substations to local substations and from 
local substations to end-users) (Eurelectric, 2010). 
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Outline of the transmission and distribution of power in a power grid, going from the production sites to 


the consumption nodes. Adapted from (U.S.-Canada Power System Outage Task Force, 2004). 


Identifying the principal factors responsible for 
the apparition of cracks in wooden poles represents 
thus a main objective for predicting their failures. 
For this purpose, we suggest a method enabling 
to evaluate the effects of potential predictors. The 
contribution identifies the way forward for this 
research topic and presents preliminary findings, 
representing the basis for future research. 

The rest of the paper is constructed as follows. 
Section 2 provides an overview on wooden poles 
characteristics and failures. Section 3 mentions 
various studies summarizing the main properties of 
wood on microscopic level. On this basis, it high- 
lights the influence climatic variations can have 
on the physical structure of wooden poles. It fur- 
thermore shows how the variations can affect the 
reliability of the pole and thus of the transmission 
line. Section 4 describes the strategy applied to pro- 
vide values of a crack-apparition likelihood using a 
Norwegian case study. It explains the choices made 
in the selection of the different datasets and the 
methods used to acquire them. Section 5 discusses 
the pros and the cons of the method used and 
shortly describes plans for future research. The last 
section finally concludes our work by summarizing 
and suggesting additional research possibilities. 


2 WOODEN POLES CHARACTERISTICS 
AND FAILURES 


Figure 1 shows schematically how power is deliv- 
ered from a generating station, through transmis- 
sion and distribution lines (respectively maintained 
by Transmission System Operators (TSO) and 
Distribution System Operators (DSO)), to dif- 


Figure 2. 
pole. 


First example of the shape of a wooden utility 


ferent categories of end customers. Wooden util- 
ity poles used in the power grid exist in different 
shapes and configurations, depending on the phys- 
ical requirements of the power lines, on the geo- 
graphical conformation of their location, and on 
their position in the transmission or distribution 
line (see Figures 2—4 as illustrations). 

Despite the variety of the existing shapes and 
configurations, the number of elements basically 
composing an electrical pole is relatively limited. 
A wooden utility pole is generally composed of 
one or more wooden poles, one or more cross- 
arms and multiple insulators responsible for the 
junction between the electrical cables and the pole. 
Figure 5 schematizes this assembling. 

Using wooden utility poles has multiple advan- 
tages in comparison to concrete or steel util- 
ity poles (Bolin & Smith, 2011; SEMCO, 1992; 
Stewart, 1996) 
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Figure 3. 
utility pole. 


Second example of the shape of a wooden 


Figure 4. Third example of the shape of a wooden util- 
ity pole. 


Figure 5. Basic components of a wooden utility pole: 
poles (brown), cross-arm (grey) and insulators (green) 
(Refsnæs, 2008). 


— They are lighter and easier to transport on 
mountainous fields. 

— They do not require earthing, which makes them 
interesting when lightning occur. 

— They are easy to produce in wooded areas (e.g. 
Canada, Norway). 

— They generally have a reduced environmental 
impact. 

— They have interesting lifetimes, possibly going 
up to 75 years in favorable conditions. 


Identifying the main threats for wooden utility 
poles enables to look for root causes of failures. 
This gives the possibility to estimate their effective 
remaining lifetime and optimize their replacement 
before any outage. 

In their review on power line inspection proce- 
dures, Nguyen et al. (Nguyen, Jenssen, & Roverso, 
2018) summarize some of the main common faults 
of power line components. They identify the appa- 
rition of cracks in the wooden poles as being one of 
the main failure to identify during visual inspection 
procedures. An additional review of the literature 
shows that there is need for inspection protocols 
enabling to recognize and assess cracks in timber 
structures in general (Dubois, Chazal, & Petit, 2002; 
Riahi, Moutou Pitti, Dubois, & Chateauneuf, 2016) 
and in wooden poles in particular (Morrell, 2012). 

Identifying cracks is fundamental for two main 
reasons: 


— First, as “stresses perpendicular to grain induce 
cracks which propagate longitudinally” (Cou- 
reau & Morel, 2005), we can consider multiple 
apparitions of significant cracks as being indi- 
cators of the presence of stress factors. This can 
for example suggest the existence of a localized 
area subject to harsher weather conditions (e.g. 
overload of ice and/or wind) (Wong & Miller, 
2010) and prompt deepened analysis of the con- 
cerned region. 

— Second, as cracks provide an access for external 
threats (e.g. fungi, insects, humidity) to poten- 
tially non-treated internal parts of the poles, 
their existence might accelerate the apparition 
of decay (Morrell, 2012; Refsnes et al., 2006; 
SEMCO, 1992). This permanently alters the 
structural resistance of the pole and consider- 
ably increases its probability of failure. 


3 WOOD PROPERTIES AND POTENTIAL 
INFLUENCE OF CLIMATIC 
VARIATIONS ON CRACK APPARITION 


The theory of fracture mechanics has mainly been 
developed since the first half of the 20th century. 
Initiated by A.A. Griffith in 1920 (Griffith, 1921), 
it has then been popularized by G.R. Irwin in 
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1958 (Irwin, 1958) and is since being widely used 
to analyze the origins and consequences of crack 
apparition in physical objects. Focusing on the 
microscopic level, it enables to provide models 
describing the “mechanical behavior of cracked 
materials subjected to applied load” (Perez, 2017). 

Multiple studies use this theory as a basis for 
the evaluation of crack growth in wooden struc- 
tures (Barrett, Haigh, & Lovegrove, 1981; Cou- 
reau & Morel, 2005; Dubois et al., 2002; Riahi 
et al., 2016). A characterization of the structure is 
initially made on microscopic level to understand 
how wood behaves when it is subject to a modifi- 
cation of its external environment (load variation, 
climatic variation, etc.). Figure 6 shows the struc- 
ture on microscopic level of a typical softwood. It 
highlights the anisotropic characteristic of wood 
and intuitively shows that cracks are more prob- 
able to occur parallel to the direction of growth of 
a three (longitudinal direction). 

Wood being furthermore a viscoelastic mate- 
rial, its physical properties (e.g. modulus of elas- 
ticity, volume) are directly influenced by their 
environment. This is due to the hygroscopic behav- 
ior of wood (i.e. tendency to absorb humidity) and 
implies that physical properties of wood are highly 
sensitive to the meteorological properties of its sur- 
rounding (especially temperature and humidity) 
(Chaplain & Valentin, 2010; Hamdi, Moutou Pitti, 
& Saifouni, 2017; Lamy, 2016; Morrell, 2012; Ref- 
snes et al., 2006; Saifouni, 2014; Thybring, Linde- 
gaard, & Morsing, 2009). 

Because of the former functionalities of their 
cells during their living period and because of the 
variations in their environment during their growth, 
mechanical properties of timber-based structures 
can furthermore be locally modified. This includes 


Figure 6. Typical softwood structure showing orienta- 
tion of longitudinal (1), radial (r) and tangential (t) direc- 
tions (Barrett et al., 1981). 


structure modifications due to natural defects such 
as knots, rotten knots holes or cracks due to freezing 
lifeblood. Combined with the application of exter- 
nal loads (e.g. wind, ice on the wires in the case of 
wooden poles) and the modification of its internal 
structure due to temperature and humidity varia- 
tions, there is a fertile ground for the apparition of 
cracks. 


4 DATA ACQUISITION AND PREDICTION 
METHODS 


Utility companies in Norway use over 3.5 million 
wooden poles in their power grids to support over 
25,400 km of electrical overhead lines (Eurelec- 
tric, 2010; Refsnes et al., 2006). The Norwegian 
IT company eSmart Systems! is specialized in dig- 
ital intelligence and uses artificial intelligence to 
support Statnett, Norway’s TSO, as well as some 
of the main Norwegian DSOs (e.g., Lyse Elnett, 
Ringeriks-Kraft Nett, Troms Kraft Nett, Hafslund 
Nett). In particular, the algorithms used by eSmart 
Systems automatically identify specific objects 
and recognize pre-defined faults, such as cracks 
on wooden poles (see Figure 7 as an illustration). 
This enabled us to access a database of 17,352 geo- 
localized aerial pictures of wooden utility poles, 
from which 5383 are classified as cracked. 

In most of the cases, two to three pictures of a 
unique utility pole were taken from different angles. 
This was done to ensure having accurate information 
for each of the observed poles without suffering from 
hidden information. We merged this information 
with the exact geographical coordinates of the elec- 
tric poles, made available by the Norwegian Water 
Resources and Energy Directorate (NVE). We could 
thus analyze a dataset of 7653 geo-localized wooden 
utility poles, either classified as cracked or not. 


Figure 7. Wooden pole where a crack has been local- 
ized on the mast (see rectangle). 


1. eSmart Systems: www.esmartsystems.com. 
2. NVE: www.nve.no. 
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Figure 8. 


From left to right: main axes of the Norwegian electrical gridt; map of the precipitation in Norway on the 


Ist of August 2017’; map of the temperatures in Norway on the Ist of August 2017’. 


In parallel, seNorge* (created in collaboration 
between the NVE, the Norwegian Meteorological 
Institutet and the Norwegian Mapping Authority’) 
enables us to access daily observed (or interpo- 
lated) records of the climatic conditions in Norway. 
Especially, it enables us to access temperature and 
precipitation measures going as far back as 1957. 

Figure 8 illustrates the type of information 
made available by NVE and seNorge. Using a 
scroll up/down feature of the websites, it is possi- 
ble to move from a global and national overview 
up to a specific geo-localized point (in our case, the 
localization of the wooden utility poles). 

Different approaches are considered in our 
work. The purpose is to create an indicator for the 
likelihood of crack apparition on wooden poles. 

In order to benefit from the high granularity 
offered by the webservices used, we plan to use 
daily records of temperature and precipitation 
as potential predictors for a binary classification 
problem (labeling as cracked or not-cracked). Pre- 
dictive features can be designed, that summarize at 
different granularities the daily weather data and 
extract relevant indicators that correlate with crack 
appearance. Considering an extreme reduction, we 
can for example summarize the intensity of the 
meteorological variation on a localized point into, 
e.g. a temperature coefficient and a precipitation 


3. seNorge: www.senorge.no. 

4. Norwegian Meteorological Institute: www.met.no. 
5. Norwegian Mapping Authority: www.kartverket.no. 
6. https://temakart.nve.no/link/?link=nettanlegg. 

7. http://www.senorge.no/index.html?p=senorgeny&st= 
weather. 


coefficient. This would lead to a method using only 
two predictors when focusing on this classification 
problem. 

Equation (1) provides an example of the type of 
coefficient c that can be used when focusing on a 
specific pole. 


(1) 


Where n is the number of daily records since the 
installation of the wooden pole observed; i the enu- 
meration index; X, the value of the meteorological 
phenomenon observed on the specified location on 
day i (here in millimeters or in degrees Celsius); X, , 
the record of the same phenomenon on the same 
location on the previous day; X „a, (resp. Xi) the 
maximum (resp. minimum) value of the observed 
phenomenon that has been recorded over the 
entire timestamp of observation on the specified 
location. 

Alternatively, predictive features can be auto- 
matically learned from the raw temperature and 
precipitation time series using deep learning tech- 
niques. Such techniques, belonging to the class of 
artificial intelligence methods (and more especially, 
to the class of machine learning methods) are based 
on recursive analyses of data over time and/or over 
space, from which they identify and highlight step 
by step the most relevant characteristics. 

High temperatures favor the proliferation of 
fungus, which weakens the structure of the wood. 
Furthermore, high humidity levels on extended 
periods might soften the wood and make it more 
sensitive to sudden external loads (e.g. wind or 
ice rain). Finally, the intrinsic properties of wood 
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lead it to easily accept slow variation of external 
loads and environmental conditions but make it 
particularly sensitive to sudden variations. These 
approaches will thus enable us to identify meteoro- 
logical patterns favoring the apparition of cracks, 
as well as located regions where the likelihood of 
crack apparition will be higher. 

An increase in the period of exposition to exter- 
nal factors leads to a rise of the probability of crack 
apparition. This implies that the age of the poles 
plays a big role in the suggested methods. However, 
part of this information might be missing. In such 
a case, we could consider a generic day of installa- 
tion depending on the period of installation of the 
power line in the observed region. 


5 DISCUSSION 


The suggested methods enable to evaluate the role 
that temperature variations and precipitations 
have on the formation of cracks on wooden poles. 
These methods have the advantage to be flexible 
and easily integrated when accessing additional 
data sources, such as daily records of wind inten- 
sity and direction, humidity variations, clouds 
presence, etc. They are nevertheless highly depend- 
ent on two main facts: 


— First, the initial classification of the poles as 
cracked or not. This is an important topic as the 
size of the cracks directly affects its detection by 
the algorithm used to classify the poles. There is 
thus a need for utility companies to define what 
should be considered as a problematic crack or 
not. 

— Second, the information initially available on 
the poles themselves (e.g. age, maintenance tasks 
carried out). This information might be difficult 
to access because not necessarily well reported 
in the first phases of the grid installation. 


Despite using relatively simple techniques and 
being highly dependent on initial parameters, the 
proposed methods represent a first approach in 
the analysis and handling of cracks in wooden 
poles. This information may in turn be useful for 
decision makers in the prioritization of additional 
inspection procedures and future maintenance 
tasks. 

It is to mention that our paper only highlights 
preliminary results of an ongoing research, as 
the described methods have not yet been fully 
applied. Further work will thus focus on the 
extensive application and validation of these 
approaches and provide an in-depth analysis of 
the phenomenon of crack apparition on wooden 
poles by using additional real data from the Nor- 
wegian network. 


6 CONCLUSION 


Our paper highlighted the importance for utili- 
ties of early detection and analysis of cracks on 
wooden poles. We summarized how environmental 
conditions can directly affect the physical proper- 
ties of wood and thus favor or limit the appari- 
tion of cracks on wooden poles. In order to better 
understand and predict their occurrence, we then 
suggested two approaches using pre-classified and 
geo-localized aerial pictures of cracked and non- 
cracked poles in combination with up to 60 years 
of meteorological measurements. Further, we saw 
that, despite being highly dependent on initial 
information, our approach might provide useful 
information for the generation of maintenance 
policies. This approach might finally be a good 
starting point for researchers wanting to combine 
fields of expertise such as structural study of wood 
on microscopic level and crack detection methods 
using image analysis. 
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ABSTRACT: A risk-based approach has been demonstrated to be capable of evaluating the quality of 
permanently plugged and abandoned petroleum wells. The quality measure in this context is the leakage 
risk, expressed in terms of probability of barrier failure and the leakage rate (consequence), where associ- 
ated uncertainties are dealt with by means of probability distributions. In complex engineered systems, 
such as a barrier system in a permanently plugged and abandoned oil or gas well, it is reasonable to 
question whether the probability distributions provide adequate representations of the uncertainties. To 
improve the risk-based approach for plug and abandonment, and to contribute to more informed deci- 
sions by better reflecting the uncertainties upon evaluation of the leakage risk, in this paper, we propose 
an approach to assess the assumptions made and reflect the strength of knowledge, to complement the 
probability distributions. An example is included to illustrate the approach. 


1 INTRODUCTION 


The final phase in the life cycle of a petroleum 
well is permanent plug and abandonment, in order 
to e.g. prevent hydrocarbon leakage to the sur- 
face and pressure breakdown of the formations 
(Liversidge et al., 2006). On the Norwegian con- 
tinental shelf (NCS), a significant number of off- 
shore oil and gas wells will be entering their final 
phase in the coming decades and need to be per- 
manently plugged and abandoned. Plug and aban- 
donment (P&A) designs on the NCS are governed 
by the requirements and guidelines in NORSOK 
Standard D-010 (Standards Norway, 2013). In line 
with these requirements, P&A operations are con- 
sidered prescriptive “one-size-fits-all” approaches 
(Arild et al., 2017). A criticism of the prescriptive 
approach is that it neglects the well-specific char- 
acteristics, which can differ from well to well, such 
as geological formations, reservoir qualities, well- 
bore schematics and surrounding marine environ- 
ments, and is thus not cost-effective. 

A quantitative risk-based approach for evaluat- 
ing the containment performance of permanently 
plugged and abandoned oil and gas wells has 
been established as an alternative to the prescrip- 


tive approach (Arild et al., 2017). The risk-based 
approach is considered a “fit-for-purpose” alter- 
native, which incorporates the well-specific char- 
acteristics. Here, the quality of the plugged and 
abandoned wells is measured by the leakage risk, 
which is expressed in terms of the probability of 
barrier failure within a given time period and the 
associated leakage rate at the seabed. The risk- 
based approach, demonstrated by Arild et al. 
(2017), builds on the leakage calculator presented 
by Ford et al. (2017) and can be seen as a risk assess- 
ment tool applicable in the recommended practice 
of “fit-for-purpose” well abandonment assessment 
proposed by Buchmiller et al. (2016). 

In order to evaluate and assess the leakage risk 
for a permanently plugged and abandoned well, 
the risk-based approach identifies potential fail- 
ure modes of each barrier element in the well, and 
assesses the failure probabilities and consequences. 
Thus, leakage risk can be considered as the two- 
dimensional combination of probability and con- 
sequence, which is widely criticized in the literature 
as being too narrow (see e.g. Aven (2014), pp. 28). 

The workflow is described in detail by Arild 
et al. (2017). Simply stated, the assessment of 
failure probability is performed by means of 
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Bayesian analysis, to create a lifetime distribu- 
tion of the well. Assessment of the consequences 
is done by calculating the leakage rates through 
identified leakage pathways, to provide a distri- 
bution of overall leakage rates at the seabed. For 
both assessments, uncertainties are dealt with by 
means of probability distributions (propagated 
through the use of Monte Carlo simulations). 

A risk-based approach for evaluating the P&A 
of oil and gas wells should provide a broad, 
informative and balanced description of the leak- 
age risk, as a means to support decision-making. 
To achieve this, proper treatment and commu- 
nication of uncertainties are necessities (Flage & 
Aven, 2009). Looking at the current practice in the 
risk-based approach for evaluation of P&A, where 
uncertainties are treated by probability distribu- 
tions, it is reasonable to ask whether all relevant 
uncertainties are fully reflected. Probability dis- 
tributions are based on assumptions which may 
be more or less reasonable, and knowledge which 
may be strong or weak (or in-between). These two 
aspects are somewhat ignored, or not adequately 
reflected, by the probability distributions. 

The present paper intends to improve the risk- 
based approach by implementing a semi-quantitative 
assessment of the uncertainties, in addition to the 
probability distributions, to incorporate the aspects 
mentioned above. The improved approach goes 
beyond probabilities to assess and express uncertain- 
ties, by focusing on the knowledge, which forms the 
basis for the assessment. A strength-of-knowledge 
categorization, in line with the scoring used by Flage 
& Aven (2009), will be the starting point of the semi- 
quantitative analysis to reveal uncertainties hid- 
den in the background knowledge. In addition, the 
concept of assumption deviation risk, introduced 
by Aven (2013), will be applied as a tool to reveal 
the effect of potential deviations in the assumptions. 
Assumption deviation risk assessment has been 
demonstrated in recent time to be a useful tool in the 
context of highlighting critical assumptions (see e.g., 
Berner & Flage (2016), Khorsandi & Aven (2017)). 
The improved approach for risk-based P&A intends 
to provide a more complete risk description, such 
that the decision-makers have greater decision sup- 
port when reasoning and deliberating upon the deci- 
sions to be made (Aven & Zio, 2011). 

The remainder of this paper is organized as 
follows: In Section 2, we briefly introduce the 
probability and consequence assessments in the 
risk-based approach, respectively. In Section 3, 
we discuss the weaknesses associated with the 
current practice in the risk-based approach. Sec- 
tion 4 presents the suggested improvements. Sec- 
tion 5 illustrates the improved approach through 
an example, while Section 6 discusses and evalu- 
ates the improved approach. Section 7 concludes. 


2 INTRODUCING THE CURRENT RISK- 
BASED APPROACH 


The current risk-based approach for evaluating the 
containment performance of plugged and aban- 
doned oil or gas wells is based on the assessments of 
probability of failures and consequences. Here, we 
briefly introduce the two assessments. See Arild et al. 
(2017) and the references therein for further details. 


2.1 Probability assessment 


A key quantity of interest when evaluating the qual- 
ity of a P&A design is the lifetime of the well, i.e. 
reflecting how long it will take before the barrier 
system starts to leak (Arild et al., 2017). Commonly 
used lifetime distributions are the exponential and 
Weibull distributions (Singpurwalla, 2006). In 
order to establish a lifetime distribution, historical 
data are often used as a basis to perform statistical 
inference, such as maximum likelihood estimation 
(MLE), to obtain estimates of the parameters which 
best explain the distribution of the observed data. 
To be applicable in estimating the parameters 
of a lifetime distribution, traditional methods such 
as MLE require that some failures have occurred 
(Singpurwalla, 2006). On the NCS, however, none 
of the 334 wells which are assumed plugged and 
abandoned are said to have failed since they were 
abandoned; they are considered censored observa- 
tions (Arild et al., 2017). One approach which is 
deemed feasible when working with completely cen- 
sored data is Bayesian analysis (Singpurwalla, 2006). 
In Bayesian analysis, the objective is to use 
known quantities along with a specified parametric 
expression to make inferences about the unknown 
quantities or parameters (Singpurwalla, 2006). 
Since the parameters are unknown, we assign prior 
distributions for the parameters to reflect our lack 
of knowledge about the parameter values a priori 
any evidence. After some data are observed, Bayes’ 
formula, expressed in terms of probability distribu- 
tions, is used to update the prior beliefs into poste- 
rior distributions of the parameters. The quantity 
of interest in the probability assessment is the life- 
time of the well and not the parameters estimated by 
Bayes’ formula. By drawing parameters of interest 
from the posterior distribution, a posterior predic- 
tive distribution of the lifetimes can be generated, 
which is a distribution of unobserved observations 
(predictions), conditional on the observed data. 


2.2 Assessing the consequences of a failure 


If a well barrier fails, leakage of hydrocarbons 
to the seabed can occur. Ford et al. (2017) have 
developed a simple leakage calculator to assess the 
leakage potential from a well barrier system. This 
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calculator estimates the cumulative leakage rate 
through potential leakage pathways for each well 
barrier. We refer to Ford et al. (2017) for detailed 
explanations of the leakage calculations. The leak- 
age pathways considered in this paper are: leakage 
through bulk cement given by Darcy’s law (e.g. 
Godøy et al. (2015)); leakage through fractures or 
cracks in the cement (e.g. Sarkar et al. (2004)); and, 
leakage through the micro-annuli (e.g. Aas et al. 
(2016)). These pathways are represented by the fol- 
lowing equations, respectively: 


kA 

Q, -(44 ar- resa (1) 
h cosa \{ AP 

2,-( 124 (£) @) 
RAP \ .., 

Q, = [a Jar (3) 


where Q = flow rate; k = cement permeability; 
A = cross-section of the cement plug and/or annu- 
lus; u = reservoir fluid viscosity; L = length of the 
plug or annular cement; AP = pressure difference 
over the cement plug and/or annuli; p = density of 
the reservoir fluid; g = gravitational acceleration; 
0= inclination of the well at the depth of the plug; 
h = fracture aperture; œ = orientation of the frac- 
ture; W = fracture width; R, = casing diameter; and 
OR = micro-annuli gap. 


3 ISSUES WITH THE CURRENT 
RISK-BASED APPROACH 


In the risk-based approach, as in any risk assess- 
ment, we make a number of assumptions that are 
more or less explicitly stated. A single assump- 
tion may be formulated as X = x, where the 
value x, is fixed, such as an increasing failure rate 
(X =x, = B= 1.5, in a Weibull function). Since the 
result of the leakage risk assessment is conditional 
on the assumptions made being true, it is impor- 
tant to understand the uncertainties related to 
these assumptions. We can classify the assump- 
tions as either probability-influencing assumptions 
or consequence-influencing assumptions (which 
is a broad interpretation of the three areas of 
assumptions listed by Khorsandi & Aven (2017)). 


3.1 Weaknesses with the probability assessment 


By virtue of Bayes’ theorem, the posterior predic- 
tive distribution is based on the posterior distribu- 
tion of some parameters 0. The true values of such 
parameters are unknown. By using an alternative 


prior distribution, or probability distribution, a 
different posterior predictive distribution would 
be generated. Paramount for the assessment is 
to understand the basis for the resulting predic- 
tive distribution. Other technical difficulties with 
Bayesian analysis are summarized by e.g. Ferson 
(2005). 

The historical data extracted from the Norwe- 
gian Petroleum Directorate (2017) database can 
be questioned, regarding the validity and value of 
information with respect to when an abandoned 
well will fail. Censored data, as we have here, imply 
that we have not detected any failures (leakages) 
since the wells were abandoned. For a leakage to 
be detected, a failure must have taken place. In that 
sense, it is reasonable to assume that no detected 
leakages imply zero failures. A failure, on the con- 
trary, does not automatically imply a detectable 
leakage. So the true survival times of the data are 
actually unknown. 

The data contain survival times of abandoned 
wells, designed according to NORSOK Standard 
D-010. As the risk-based approach intends to 
incorporate well-specific characteristics, such as 
flow potential, it is contradictory to use all avail- 
able data as a basis for the probability assess- 
ment. Further investigation of the data shows that 
observations differ with respect to well-specific 
properties, such as geology and flow potential. It 
is reasonable to question whether a survival time 
from a well in a reservoir with, say, limited flow 
potential is relevant when evaluating a P&A design 
for a well in a reservoir with high flow potential. 
The risk-based approach aims to justify alternative 
P&A designs, which are “fit-for-purpose”. Utiliz- 
ing the whole sample when evaluating the leakage 
risk is therefore a radical assumption, as it takes 
into account censored lifetimes from, most likely, 
stricter P&A designs. 


3.2 Weaknesses with the consequence assessment 


Challenges with the leakage rate assessment relate 
to uncertain input values, as the models (Equa- 
tions 1-3) are established on strong knowledge, 
according to criteria in e.g. Aven (2014), pp. 139. 
There are two categories of inputs in the con- 
sequence approach: (1) uncertain inputs that can 
be described by probability distributions, and (2) 
known inputs related to design variables, scenar- 
ios or low importance inputs (Arild et al., 2017). 
The former is the focus of the present paper. With 
respect to Equations 1-3, Table 1 summarizes the 
uncertain inputs of interest. Both the values and 
distributions (here, a triangular distribution) given 
in the second column are assumed. These values are 
difficult to measure exactly, and they impose uncer- 
tainties. The degree of uncertainty is case-specific 
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Table 1. Uncertain parameters (Arild et al., 2017). 


Uncertain parameters Values (min, most likely, max) 


Cement permeability 0.1, 0.5, 5.0 uD 
Micro-annuli gap 3, 20, 70 um 
Fracture aperture 10, 50, 200 um 
Inclination of fracture 0, 30, 70° 
Fracture width 1,2,3 mm 


and subject to judgment by the assessor (Flage 
et al., 2013). 


4 DISCUSSION OF SOME POSSIBLE 
IMPROVEMENTS FOR THE 
RISK-BASED APPROACH 


The common denominator for the aspects dis- 
cussed in the previous section is lack of knowl- 
edge when quantitatively assessing the leakage 
risk. A limited amount of information is available 
on the subject matter, and determining whether 
the parametric functions and input parameters 
are appropriate is challenging. Decision-making 
under such conditions is also challenging. Not 
all uncertainties can be fully expressed or trans- 
ferred into quantitative formats. The background 
knowledge, on which the probabilities of failure 
and leakage rates are based, can hide uncertainty. 
The assumptions made are known to potentially 
deviate in reality, and any deviation that could 
affect the risk picture needs to be highlighted and 
assessed. We believe the following approaches can 
improve the treatment of uncertainty in the risk- 
based approach: (1) crude strength-of-knowledge 
assessment (SoK) and (2) assumption deviation 
risk assessment. 


4.1 Crude strength of knowledge assessment 


In the probability and consequence assessments, 
uncertain inputs are used as the basis for the 
assessments. Subjective probabilities (distribu- 
tions) are assigned to these inputs, expressing our 
degree of belief about the values of these param- 
eters, conditional on our background knowledge. 
To reflect the strength of this background knowl- 
edge, a crude SoK assessment is recommended. A 
categorization in line with the scoring by Flage & 
Aven (2009) is used as a basis for an assessment of 
the SoK. Here, the background knowledge is cat- 
egorized as strong if all the following criteria are 
met (Flage & Aven, 2009): 


— The assumptions made are seen as very reason- 
able (s1) 
— A large amount of reliable data is available (s2) 


— There is broad consensus among experts (s3) 
— The phenomena involved are well understood 
(s4). 


If, on the other hand, at least one of the follow- 
ing criteria is true, the background knowledge is 
classified as weak (Flage & Aven, 2009): 


— The assumptions made are strong simplifica- 
tions (wl) 

— Data are non-existent or unreliable (w2) 

— There is a lack of consensus among experts (w3) 

— The phenomena involved are not well under- 
stood (w4). 


Anin-between background knowledge is classified 
as moderate. The SoK assessment intends to capture 
uncertainties which are not easily transformed or 
expressed quantitatively, e.g. by probabilities. 


4.2 Assumption deviation risk assessment 


The concept of assumption deviation risk, intro- 
duced by Aven (2013), highlights the risk with 
respect to the (main) assumptions on which the 
quantitative risk assessment is based. In general, 
Aven (2013) suggests focusing on the magnitude of 
the deviation, the degree of belief in this magni- 
tude occurring, how this deviation will influence 
the risk, and the SoK related to the phenomena 
affecting the assumption (as described above). 
Thus, the concept of assumption deviation risk 
goes beyond traditional sensitivity and uncer- 
tainty analysis. Sensitivity analysis, such as asking 
“what if” questions, is informative, as it produces 
a range of outcomes based on different input val- 
ues. Assumption deviation risk extends this type of 
analysis, as it assesses the risk of deviations, cover- 
ing an assessment of the consequences of the devi- 
ations, uncertainties related to the outcomes and 
a crude SoK assessment. The concept of assump- 
tion deviation risk aims to better communicate 
the uncertainties related to the leakage risk to the 
decision-maker, by explicitly assessing and identi- 
fying the influence of any deviation in the assump- 
tions. The assessment is systematic, as it intends to 
identify critical assumptions, analysing the risk of 
deviations in those assumptions, and is a means 
to understanding assumptions which may deviate 
far ahead in the future (Khorsandi & Aven, 2017). 
Requirements, such that the well barrier should 
perform its purpose for eternity, make it important 
to understand how the leakage risk may change 
over the lifetime of an abandoned well. This is 
better understood when we understand how devia- 
tions in the assumed states can occur. The main 
concerns in the leakage rate assessment are the 
negative consequences; hence, we can restrict our 
focus to deviations that worsen the outcomes. 
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Often, as the decision-making problem becomes 
more complex, the number of assumptions has a 
tendency to increase and the assumption deviation 
risk assessment becomes tedious and time-consum- 
ing. To make the improved approach practical, we 
suggest handling each assumption systematically, 
in line with guidance suggested by Berner & Flage 
(2016), who focus on six so-called settings that the 
decision-maker faces when making assumptions 
in a risk assessment. These settings are classified 
by the degree of belief in deviation from the initial 
assumption, the effect of such deviation and the 
SoK related to the assumption. When we know what 
setting we are facing, a general approach to handle 
the assumption is suggested. We must emphasize 
that the guideline should not to be used as a mecha- 
nistic framework and must be adapted to the case 
of interest. 

In addition, the assumption deviation risk 
assessment will increase the likelihood of detect- 
ing surprises. Two categorizes are of particular 
concern with respect to surprises: a moderate/high 
belief in deviation, and when the SoK is moder- 
ate/weak. Assessment of the assumptions will 
highlight such potential surprises. This is a way of 
revealing black swans (Aven, 2014). 


5 A CASE STUDY 


A synthetic vertical gas well is considered to be 
plugged and abandoned. The well, well barrier 
and reservoir characteristics are based on Arild 
et al. (2017) (except the inclination of the well). 
For simplicity, we refer to the values in Table | for 
the uncertain parameters and to Arild et al. (2017) 
for an overview of the known parameters of inter- 
est. The secondary barrier, which functions as a 
backup to the primary, is placed right on top of the 
primary barrier, and the combination of the two is 
treated as one barrier (plug 1+2). In addition, there 
is a surface barrier (plug 3) located at 300 m TVD 
(true vertical depth). 

Let us say that the well appears to have a low leak- 
age risk (before any assessment is conducted). Neg- 
ligible issues, such as scaling and cross flows, were 
observed during the production phase. The only 
concern is an unconsolidated sandstone formation 
around the plug 1+2, indicated by an acoustic log. 
However, the overall impression is that the NOR- 
SOK Standard D-010 requirement of having 100 m 
thick barriers is too strict. In other words, the require- 
ment is questioned in terms of its cost-effectiveness. 
A more cost-effective design, with shorter cement 
plug lengths, is considered as an alternative. 

Reducing the cement plug lengths intuitively 
increases leakage risk and needs to be justified 
before implementation. A deviation in some of the 


assumptions, which form the basis for the assess- 
ment, can be critical, as the consequence threshold 
will be further reduced. The aim of the analysis 
is to illustrate how assumptions can be assessed 
according to the methods presented in this paper, 
to achieve better decision support. The case study 
is divided into three parts. First, the current risk- 
based approach is applied to assess the leakage risk. 
Then, the suggested improvements discussed in 
this paper are applied to the same case. Finally, we 
compare the current risk-based approach with the 
improved approach, in terms of decision support. 


5.1 The current risk-based approach 


The alternative P&A design, with shorter cement 
plug lengths needs to be justified before imple- 
mentation. This is done by the leakage risk, which 
depends on the probability of failure and the asso- 
ciated leakage rates. These are estimated by the 
procedures introduced in Section 2. 


5.1.1 Probability assessment 

Following the approach described in Section 2.1 
and the example of Arild et al. (2017), we establish 
a Weibull distribution to reflect the probability of 
failure within a given time for this well. We assume 
that the shape parameter is 1.5, indicating that the 
failure rates increase with time. The scale param- 
eter is assumed to have a uniform distribution, 
between 0 and 1000 years. Basically, this means 
that whether failures are likely to occur in 10 years 
or 1000 years is unknown. The data used to update 
our prior belief about the scale parameter are the 
334 censored observations of wells assumed to be 
permanently abandoned. Probability of failure 
within a certain time period is then predicted from 
the posterior predictive distribution in Figure 1. 
The probability of failure within the first 100 years 
is approximately 5%, if the assumptions hold true. 


5.1.2 Consequence assessment 
If a failure occurs, the quantity of interest is the 
leakage rate. Here, we follow the calculations from 
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Figure 1. Posterior predictive probability of failure 


times given the initial assumptions. 
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Equations 1-3 for both the surface barrier and the 
combined primary and secondary barrier. Leakage 
to the seabed is determined by the minimum leak- 
age rate through the surface barrier or the com- 
bined plug. The combined plug has a significantly 
lower flow rate, due to having twice the thickness 
of the surface barrier, presented with sensitivity to 
different plug lengths in Figure 2, where the solid 
lines represent our 90% certainty that the flow rate 
is below these flow rates, the dashed lines are the 
most likely rates, and the stippled lines reflect our 
90% certainty that the flow rates are above these 
flow rates. 

When the plug lengths are above the prescrip- 
tive length of 100 m, there is a negligible increase 
in the flow rates from reducing the plug lengths. 
However, a reduction in plug lengths which are 
shorter than 100 m, imposes a relatively significant 
increase in flow rate. 


5.1.3 Results from the current risk-based approach 
The probability and consequence assessments 
resulted in low leakage risk. The probability of 
failure within the next 100 years is predicted to be 
5%. With plug lengths of 50 m, there is 80% prob- 
ability that the leakage rate will be in the range 
of 1.303 x 10° m?/s to 5.087 x 10° m*/s, with a 
most likely leakage rate of 1.829 x 10° m°/s. If 
we assume that the reservoir fluid composition is 
mostly methane, these flow rates correspond to a 
yearly release in the range of 0.021 to 0.831 ton, 
with a most likely yearly release of 0.298 ton. With 
the prescriptive plug lengths of 100 m, there is an 
80% probability that the flow rate is in the range of 
7.667 x 107 to 2.780 x 105 m?/s (0.012 to 0.454 ton/ 
year), with a mean flow rate of 1.021 x 10° m°/s. 
(0.457 ton/year). The leakage rate is almost dou- 
bled by reducing the plug lengths from 100 to 
50 m. These results represent the base case of this 
analysis. As the assumptions made in the probabil- 
ity and consequence assessment may deviate, we 
need to assess their influence on the leakage risk, 
in order to justify a plug length of 50 m. 


Flow rate [x105 m/s} 
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Length of plug [m] 
Figure 2. Flow rates through the primary and second- 


ary plug for different cement plug lengths. 


5.2 The improved risk-based approach 


Following the approach described in Section 4, we 
intend to provide a more detailed understanding 
of the assumptions, and therefore the leakage risk, 
before making a decision. For simplicity, we have 
only considered a few assumptions in this paper. 
The assumption deviation risk assessment is sum- 
marized in Table 2, and complements the results 
presented in Section 5.1. All the assumptions are 
assessed in terms of degree of belief in devia- 
tion, effect of such deviation on the leakage risk 
and the SoK regarding the assumption. Some of 
the assumptions show significant influence on the 
leakage risk. 

Assumptions classified by a moderate or high 
belief in deviation from the initial assumption 
and moderate or high influence on the leakage 
risk, based on weak background knowledge, are 
the most critical in the leakage risk assessment 
(assumption Nos. 1, 2 and 3). For assumption No. 1, 
the SoK is weak. This is controversial, as weak 
knowledge discourages the establishment of prob- 
ability distributions and raises questions regard- 
ing the final result. A uniform prior distribution 
is the most appropriate option, since it intends to 
reflect the epistemic uncertainty to some degree. 
The goodness of this distribution is, however, open 
for debate, as we do not know with certainty which 
prior distribution or interval to choose. A devia- 
tion in the range of the uniform prior distribution 
greatly affects the probability of failure. 

Assumption No. 2, regarding the validity of 
the historical data, is also critical. As the historical 
data are based on P&A designs in line with NOR- 
SOK Standard D-010, the data are poor represen- 
tations of this well. There is a trade-off between 
large sample and relevant data, which the asses- 
sors need to consider and reflect. Based on sound 
engineering thinking, it is likely that this well, with 
shorter plug lengths, will have a higher probability 
of failure within the first 100 years, compared to 
the values in Figure 1. 

The assumption deviation risk assessment 
revealed that most of the assumptions are believed 
to deviate to some degree (Nos. 4, 6, 7 and 8), in 
addition to having moderate influence on the leak- 
age risk. As the SoK is not weak for these assump- 
tions, probability distributions may be established. 
These assumptions may deviate, reflected by not 
being based on strong background knowledge or 
by the fact that we believe in a deviation; and the 
number of such assumptions is of great concern. If 
deviations from more than one of the assumptions 
take place simultaneously, the cumulative influ- 
ence on the leakage risk can be significant, despite 
a deviation in each assumption, separately, show- 
ing little influence. This is highly relevant for the 
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evaluation of P&A designs but beyond the scope 
of the present paper. 


5.3. Evaluating the results 


Based on the assumption deviation risk summa- 
rized in Table 2, it is now possible to argue that 
a reduction in the plug length from 100 to 50 m is 
not justified. If a decision were made exclusively 
on the probability and consequence assessment, 
the conclusion would most likely be different. 
The two-dimensional leakage risk of probabilities 
and consequences is low. Seeing beyond such a 
narrow perspective on risk, we see that there are 
some uncertainties which should not be ignored. 
An incorrect choice of the prior distribution can 
greatly affect the probability of failure, for exam- 
ple. In addition, the data may provide too opti- 
mistic information about expected survival times 
if the prior distribution’s ability to support longer 
lifetimes is poor. The leakage rate can also be sig- 
nificantly increased if the assumed value of the 
micro-annuli gap is slightly too low, compared to 
the true future value. The latter can take place if 
the annuli cement has expanding properties, such 
that the cement, over time, may expand radially 
towards the formation and cause a greater inner 
micro-annuli gap (Baumgarte et al., 1999). 

Ideally, the justification should be made with 
reference to some risk acceptance criteria. This has 
not been established at the time being. The final 
decision should be made after a managerial review 
and judgment process, according to the idea of 
risk-informative decision-making (e.g. Aven & Zio 
(2011)). 


6 DISCUSSION 


The main aim of this paper was to illustrate 
how the risk-based approach for evaluating of 
the containment performance of a plugged and 
abandoned well can be improved by assessing 
uncertainties beyond probabilities. The methodol- 
ogy is systematic in its treatment of the assump- 
tions. It should be mentioned that no effort was 
conducted to quantify what a high degree of sensi- 
tivity actually corresponds to, although this would 
affect the number of critical assumptions. Does 
a deviation in an assumption, resulting in a 20% 
change in the leakage rate indicate high sensitiv- 
ity? The challenge is related to how to justify what 
increase in the leakage risk is acceptable, which is 
beyond the scope of this paper. 

The quantitative approach for assessing the 
leakage risk is attractive. This type of approach 
in plug and abandonment evaluation is new to the 
industry and has seen little practical experience. 


One of the main challenges that remains to further 
develop the approach is to establish criterion which 
can be used in the justification of a specific design. 
Then the lifetime distribution and calculated leak- 
age rates complemented by an assumption devia- 
tion risk assessment would provide a powerful tool 
for making risk-informed decisions. 

Many of the assumptions made are based on 
relevant literature on the subject matter. The key 
point is that these values may be more or less rel- 
evant for the specific case. Assessing the assump- 
tions helps to declare whether the assumptions are 
appropriate or not. However, it requires the assessor 
to have a sound understanding of what it takes for 
an assumption to occur. If the knowledge is weak, 
we cannot say something with certainty about the 
deviation in an assumption and the effect of such 
a deviation. We should communicate this fact to 
the decision-maker. The assumption deviation 
risk assessment provides a deeper understanding 
of the phenomena involved in a P&A operation, 
thus revealing uncertainties beyond the capability 
of the current risk-based approach. The fact that 
barriers should perform their functions to eternity 
requires that assumption deviations are taken into 
consideration when justifying an alternative P&A 
design. 

A deviation in an assumption which increases 
the leakage risk is undesirable. To avoid such 
negative outcomes, it is common to apply con- 
servative assumptions, which are often strong sim- 
plifications. In the current risk-based approach 
conservative assumptions are often made, such as 
assumption Nos. 7 and 9 in Table 2. One reason why 
we often resort to conservative assumptions is lack 
of knowledge. Rather than searching for increased 
knowledge, it is less time- and resource-consuming 
to make a conservative assumption. A danger 
of such a practice is that a too high risk picture 
is presented to the decision-maker (Aven, 2016). 
Basic safety management principles would then 
imply an implementation of costly risk-reducing 
measures. Thus, the careless use of conservative 
assumption works against the purpose of the 
risk-based approach to provide decision support 
regarding an alternative, more cost-effective P&A 
design. 


7 CONCLUSION 


In this paper, we have shown how assumption devi- 
ation risk assessment can improve the treatment 
and reflection of uncertainties. The assumptions 
are assessed separately with respect to the analyst’s 
degree of belief in deviation, the sensitivity of the 
leakage risk to such deviation and the SoK related 
to the assumption. The fact that assumptions are 
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known to potentially deviate more or less should 
not be ignored. Not all uncertainties are easily 
transformed and expressed quantitatively, and 
they require a more semi-quantitative approach to 
be fully reflected, to ensure appropriate decision 
support. By complementing the leakage risk with 
an assumption deviation risk assessment, more 
informed decisions can be made. 

A case study was performed to illustrate how 
the improved risk-based approach provides more 
informed decision support than the current 
approach. In this case, an alternative P&A design 
was evaluated. The assumption deviation risk 
assessment highlighted some critical assumptions, 
which led to a decision that differed from what we 
would decide upon if it were made exclusively on 
the leakage risk. 
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ABSTRACT: As a possible alternative energy source, hydrogen fuel cells, especially Polymer Electrolyte 
Membrane (PEM) fuel cells, have received much more attention in the last few decades, which have already 
been equipped in many applications. A series of studies have been devoted to PEM fuel cell fault diagnosis 
to ensure its reliability during its lifetime, but due to the complexity of PEM fuel cell systems and incomplete 
PEM fuel cell test protocols, it is difficult to test various PEM fuel cell failure modes, thus the performance 
of fault diagnostic techniques cannot be fully investigated. On this basis, it is necessary to develop a reliable 
PEM fuel cell model with capability of simulating various PEM fuel cell faults. In this study, a hybrid model 
is developed to represent the behavior of PEM fuel cells in both continuous and discrete-time domains. With 
a continuous-time domain sub-model, various aspects of PEM fuel cell behavior can be simulated, including 
fluid, thermal, and electro-chemical dynamics. Moreover, the PEM fuel cell failure modes are implemented 
with stochastic Petri nets in the discrete-time domain. Based on the developed hybrid model, various PEM 
fuel cell failure modes can be simulated and their effects on the system performance can be observed. With 
the simulated data under different conditions, the performance of fault diagnostic techniques can be better 


evaluated by studying their performance in different failure mode scenarios. 


1 INTRODUCTION 


Due to the characteristics such as zero-emission 
and high efficiency, the PEM fuel cell has attracted 
more attention as an alternative energy source. 
In the last few decades, PEM fuel cells have been 
equipped in several systems, including automotive, 
consumer devices, and stationary power systems. 

However, the reliability of PEM fuel cell dur- 
ing its lifetime is still a main barrier for further 
commercialization. To address this, several studies 
have been devoted to PEM fuel cell fault diagnosis, 
which could detect and isolate PEM fuel cell abnor- 
mal performance, thus mitigation strategies can be 
taken to recover and extend the fuel cell perform- 
ance. Based on the methods adopted, these studies 
can be loosely divided into two categories, model- 
based techniques and data-driven approaches [Pet- 
rone et al. 2013, Zheng et al. 2013]. 

In model-based techniques, a PEM fuel cell 
model should be developed to express the system 
behavior, and the fault can be identified by calcu- 
lating the residual between the model outputs and 
actual measurements [Kamal and Yu 2011, Ohs 
et al. 2011, Zeller et al. 2010]. With data-driven 
approaches, the features indicating the fuel cell con- 


dition would be extracted from the measurements, 
and the fuel cell state can be determined by apply- 
ing pattern recognition algorithms to the extracted 
features [Mao et al. 2017, Placca et al. 2010, Rubio 
et al. 2010, Steiner et al. 2011, Zhongliang et al. 
2015]. 

From the previous studies, data-driven appro- 
aches are more widely used in PEM fuel cell fault 
diagnosis [Zheng et al. 2013]. The main reason is 
that the PEM fuel cell contains physical interac- 
tions consisting phenomena from fluidic, ther- 
mal and electrical domains, making it difficult 
to develop an accurate model, where data-driven 
approach can perform fault diagnosis using only 
measurements from PEM fuel cell system. 

However, due to the incomplete protocol of the 
PEM fuel cell failure tests, only limited PEM fuel 
cell failure mode conditions can be tested in the 
lab, [Yuan et al. 2011, Miller and Bazylak 2011], 
which cannot fully investigate the effectiveness 
of data-based fault diagnostic techniques. There- 
fore, further studies for the performance of these 
approaches indiagnosing more PEM fuel cell fail- 
ure modes are still required. 

In this study, a PEM fuel cell model is devel- 
oped based on the bond graph technique, which 
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can represent multiple physical phenomena in a 
unified graphical notation. With the developed 
model, various PEM fuel cell failure modes can 
be simulated, and simulated data can be used in 
data-based fault diagnostic techniques to investi- 
gate their effectiveness. In section 2, the knowledge 
of the PEM fuel cell and development of its bond 
graph model will be presented. The performance of 
the developed model in representing PEM fuel cell 
performance will be validated in section 3. In sec- 
tion 4, the model will be used to simulate the fuel 
cell dehydration phenomenon, and the perform- 
ance of data-based fault diagnostic techniques 
will be studied using the simulated data. From the 
results, some conclusions are given in section 5. 


2 PEM FUEL CELL BOND GRAPH 
MODEL 


2.1 Introduction of PEM fuel cell 


A typical PEM fuel cell includes several compo- 
nents, i.e. anode and cathode electrodes, gas dif- 
fusion layer, catalyst layer, and polymer electrolyte 
membrane, which are depicted in Figure 1. 

During operation, hydrogen and air/oxygen are 
injected into the anode and cathode sides, respec- 
tively. Hydrogen is divided into protons and ions 
with Eq. (1), protons can pass through the mem- 
brane, while ions can only arrive at the cathode 
via the external circuit, where current is generated. 
At the cathode side, protons, ions, and oxygen will 
react to produce heat and water (Eq. 2), which can 
be removed from the cathode side. 


H, > 2H* +2e (1) 
50: +2H* +2e- > H,O (2) 


2.2 Bond Graph method 


The core principle of the bond graph (BG) method 
is energy conservation, i.e. the total energy in a 


Oxygen 
O From Air 
Air andi 
Water Vapour 
Gas Diffusion Layer 
Proton Exchange Membrane 
Figure 1. A typical PEM fuel cell. 


closed system is never destroyed or lost, but con- 
verted from one form to another. With this method, 
systems involving multiple physical domains can 
be unified. 

In a BG the rate at which energy is transferred 
between components is power, which is denoted as 
a half arrow as shown in Figure 2. It can be seen 
that power flow is characterized by two power vari- 
ables: effort (e) and flow (f), where 


e xf = power (3) 


Table 1 depicts some commonly used analogies 
for the meanings of effort and flow. 

Elements in the BG are located at the BG nodes, 
and represent different energy manipulation mech- 
anisms. Sources of effort (Se) and flow (Sf) are 
active elements and provide inputs to the system. 
Such elements, controlled by an external signal, are 
called ‘modulated’ and denoted by a prefix ‘m’, e.g. 
mSe. Energy dissipation and storage phenomena 
are implemented via resistive (R), capacitive (C) or 
inductive (I) elements. Detectors of effort (De) and 
flow (Df) are shown with a full arrow to emphasize 
that they do not participate in energy exchange, 
but rather simply act as sensors and measure cor- 
responding power variables. 

Multiple power bonds can meet at one of two 
junction types, 0- and 1- type, which enforce the 
laws of energy conservation within the system. 
Another junction structure called Transform- 
ers (TF) act as energy transducers converting the 
transferred power from one physical domain to 
another. TF elements can only have two bonds 
connected. Figure 3 shows the different junction 
types, and corresponding equations are written in 
Eqs. (4)-(6). 


1-junction h=h=--= fy > ey =0 (4) 


A——38 


f 


Figure 2. Power bond between objects A and B. 

Table 1. Physical analogies for power variables. 
Domain Effort Flow 

Electrical Voltage Current 
Mechanical Force Velocity 
Pneumatic Pressure Volumetric flow 
Chemical Chemical potential Molar flow 
Thermal Temperature Entropy flow 
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O-junction e,=e,=...= ey, ty =9 (5) 


Transformer e = me, f, = mf, (6) 


2.3 EM fuel cell BG 


The hierarchy of the PEM fuel cell BG includes 
basic bond graphic elements describing energy 
storage and transfer mechanisms, which are at the 
base. A set of BG elements describing pneumatic 
and heat transfer phenomena are constructed 
for the two bipolar plates and for the anode and 
cathode sides, and BG elements describing elec- 
trochemical, transport and thermal phenomena 
representing the membrane electrode assembly 
[Saisset et al. 2006]. Additionally, cooling channels 
and the end plates are implemented as separate 
components [Vasilyev et al. 2017]. 

Figure 4 depicts the blocks resembling physical 
components of the PEM fuel cell and bonds con- 
necting them representing power flows between 
components. Anode/cathode inlet and outlet 
blocks correspond to mass flow controllers or 
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Figure 4. PEM fuel cell BG [Vasilyev et al. 2017]. 


valves and regulate the flow of matter in or out of 
the cell. This is shown by bond labelled with P, T 
as efforts and m, H as flows. The source of elec- 
tric current (mSf) represents the load demanded 
from the fuel cell. Electrochemical phenomena 
within the membrane electrode assembly compo- 
nents calculate the rates of reactants and prod- 
uct consumption my, Mọ, and my, Transport 
phenomena determine the diffusion flows through 
the membrane electrode assembly, while thermal 
effects evaluate heat flows Q between the bipolar 
plates and the membrane. 

A set of equations are used to develop the PEM 
fuel cell BG, including the computation of mass 
flow rates of gasses in and out of the cell, and ther- 
mal and pneumatic activities within bipolar plates. 
More details about the modelling procedures for 
the PEM fuel cell BG can be found in previous 
studies [Gawthrop and Bevan 2007, Vasilyev et al. 
2017]. 

With results from different PEM fuel cell com- 
ponents, the single cell voltage can be calculated 
using Eq. (7). 


Van = Erena — Mace ~ Monin ~ Tesi (7) 


where Eyen 18 the reversible potential, ngo Noi 
and N. are the activation loss, ohmic loss, and 
concentration loss, respectively. 

Figure 5 depicts the single PEM fuel cell BG 
by putting the developed individual components 
together. Figure 6 shows the PEM fuel cell system 
BG including single cell BG and cooling loops, 
where inlet mass flows are regulated by Rth! and 
Rth2. Each cooling loop is comprised of a single 
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Figure 5. Single PEM fuel cell BG [Vasilyev et al. 2017]. 


Figure 6. PEM fuel cell system BG [Vasilyev et al. 
2017]. 
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Cth-element and elements RC1-4 calculating the 
heat transfer rate. 


3 VALIDATION OF PEM FUEL CELL BG 


Before using the developed PEM fuel cell BG in 
fault diagnosis, the performance of the developed 
BG should be validated. 

In this study, the Electrochemical Impedance 
Spectroscopy (EIS) is obtained from the test to 
determine the model parameters including electri- 
cal resistance and double layer capacitance. With 
the determined model parameters, the polarization 
cure is obtained from the model and compared 
with those from the test, the comparison results 
are depicted in Figure 7. 


Experiment 
~ Simulation 


o 0.1 0.2 o3 0.4 os 0.6 o7 oa 
Current density, (A/cm?) 


Figure 7. Comparison results of polarization curves 
between the model and test [Vasilyev et al. 2017]. 
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Figure 8. 
perature curves between the model and test [Vasilyev 
et al. 2017]. 


Comparison results of cell voltage and tem- 


It can be seen from Figure 7 that the over- 
all polarization from the model can match the 
tested data with good accuracy, and the deviation 
becomes slightly larger in the region of concen- 
tration loss with current density higher than 0.55 
Alcm*. The reason is that the model doesn’t fully 
consider the electrode porosity and effects of liq- 
uid water formation within the cell. 

Furthermore, a test with varying current den- 
sities is performed, and the cell voltage and tem- 
perature are obtained and compared with those 
from the developed model. Results are shown in 
Figure 8. It can be seen that the developed model 
can capture the PEM fuel cell behavior with good 
quality, which paves the way for using the devel- 
oped model for the following analysis. 


4 USE OF FUEL CELL BG MODEL FOR 
FAULT DIAGNOSIS 


4.1 Simultion of PEM fuel cell failure mode 


In this study, dehydration is simulated and the sim- 
ulated data is used to test the performance of data- 
driven fault diagnostic approaches. The reason 
for selecting dehydration is that it is a commonly 
experienced failure mode in PEM fuel cell systems 
due to unbalanced water management. Moreover, 
dehydration is not usually performed in testing as 
it will cause permanent damage of the membrane. 
Therefore, with the developed PEM fuel cell BG 
model, the performance of data-driven approaches 
can be investigated more efficiently in terms of 
both computational time and financial cost. 

In the simulation, the constant current (70 A 
herein) is applied to the developed model, and 
after normal operation of a certain time, the rela- 
tive humidity at the anode side is reduced from 
100% to 50% at 500h, which can cause decreased 
water contents within the cell and thus dehydra- 
tion. Figure 9 depicts the variation of anode rela- 
tive humidity, voltage, and stack temperature. 

It can be seen from Figure 9 that when operated 
at the constant condition, PEM fuel cell voltage 
will decay linearly, representing the degradation 
phenomena due to fuel cell aging. Moreover, with 
decrease of anode relative humidity, stack voltage 
shows a more steep decrease, and the increased 
stack temperature can be observed more clearly, 
this is due to the reduced water content within the 
cell from the reduced relative humidity of inlet gas. 


4.2 Data-driven fault diagnostic approaches 


In this study, several fault diagnostic approaches 
have been applied to the simulated data for both 
normal and dehydration conditions. As data from 
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Reduction of dataset dimension using KPCA 


Decomposition of dataset into signals at different 
frequency ranges using WPT 


Calculation of energy from each signal 


Determination of fuel cell state with two energies 
with the highest values 


Figure 10. Flowchart of PEM fuel cell fault diagnosis 
using data-driven approaches. 
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Figure 9. Simulation results of anode relative humidity, 
fuel cell stack voltage, and stack temperature. 


multiple sensors are simulated, the approaches 
reducing the size of dataset is applied, Kernel 
Principal Component Analysis (KPCA) is selected 
herein due to its better performance in non-linear 
systems. After that, wavelet packet transform 
(WPT) is applied to decompose the original sig- 
nal into different frequency ranges, from which 
the features are constructed. The features with the 
highest values are used in this case to discriminate 
the PEM fuel cell states. This flowchart is illus- 
trated in Figure 10. It should be noted that the 
selected diagnostic framework is effective in iden- 
tifying various failure modes in PEM fuel cell sys- 
tems [Mao et al. 2017]. More details about these 
approaches can be found in previous studies [Mao 
et al. 2017, Placca et al. 2010]. 


4.3 Diagnostic results 


In the analysis, simulation data from multiple 
sensors are used, which are listed in Table 2. The 


Table 2. Sensor measurements used in the analysis. 
Sensor Unit Sensor Unit 
Voltage V Air inlet flow I/min 
H2 inlet flow llmin Air inlet pressure bar 
H2 inlet pressure bar Air outlet pressure bar 
H2 outlet pressure bar Air inlet temperature K 
H2 inlet temp K Stack temperature K 


reason of using multiple sensors is that multiple 
sensors can provide complementary informa- 
tion about the PEM fuel cell performance, which 
should be included in order not to lose useful 
information, without further interpretation of the 
sensor measurements. 

KPCA is applied to the dataset including meas- 
urements from sensors (listed in Table 2) to project 
the dataset into the two principal directions. It 
should be noted that the simulated data shown in 
Figure 9 is divided into 2 parts representing dif- 
ferent states (normal and dehydration), and each 
part is further divided into several segments for the 
following analysis. 

WPT is then applied to each segment data 
from KPCA over 3 levels, the extracted wavelet 
coefficients are used to re-construct the signals 
at different frequency ranges, from which the sig- 
nal energies are calculated. Figure 11 depicts the 
energy distribution at both normal and dehydra- 
tion states. 

It can be seen from Figure 11 that the energy 
shows similar distribution at different PEM fuel 
cell states, and the first few highest energies are 
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Figure 11. Distribution of energy from signals at 
different frequency ranges. 
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Figure 12. Discrimination results using selected features. 


located at the same wavelet packet. Therefore, the 
first two highest energies are used as features herein 
for the discrimination. The results are shown in 
Figure 12. Since KPCA is used to project the origi- 
nal dataset into the two principal directions, the 
discrimination results at these two directions are 
depicted. It can be observed that the dehydration 
state can be discriminated with good quality from 
the normal state, indicating the applied data-driven 
approaches can identify the PEM fuel cell dehydra- 
tion accurately. 


5 CONCLUSIONS 


In this paper, the PEM fuel cell model is developed 
using the bond graph technique, which can repre- 
sent the various behaviours in PEM fuel cell sys- 
tem. The model parameters are determined using 
the collected EIS from the PEM fuel cell, and 
the performance of developed model is validated 
using the fuel cell test data at different conditions. 

With the developed PEM fuel cell bond graph 
model, the data from different fuel cell failure scenar- 
ios can be simulated. In this study the fuel cell dehy- 
dration is simulated, and the simulated data is used in 
the data-driven fault diagnostic approaches. Results 
demonstrate that the used approaches can discrimi- 
nate the dehydration state with good quality. In the 
future study, more fuel cell failure modes will be sim- 
ulated using the developed model, and the capability 
of various fault diagnostic approaches in identifying 
these fuel cell faults can be fully investigated. 
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ABSTRACT: We have run experiments with 17 groups with three participants—to see how they assess 
hazards, probability and possibility to avoid dangers. We considered two scenarios taken from ISO 15998-2. 
We used a three-step process—each participant read the scenario and assesses consequence, probability of 
the event and the possibility of avoiding the hazard. Then they wrote down their rational for the assessments 
and adjusted. They then presented their rationales to the other two members and made final assessments. 

We discuss the results: what rationale made participants change their assessments, and which param- 
eters were changed? As in earlier experiments, risks are overestimated. We need to improve the way we 
describe hazardous scenarios so that we can get more realistic risk assessments. 

There exist several guidelines for writing scenarios, and they are followed by ISO 15998-2 but this did 


not seem to help. We suggest some new guidelines for scenario description. 


1 INTRODUCTION 


Risk assessment is usually done by one or more 
persons, based on a scenario description. No 
feedback or rational is needed. We would like to 
examine how the assessment process will work if 
the participants need to write rationales for their 
decisions and how the assessments process will be 
affected by hearing other persons’ rationales. 

By studying how the assessment process is influ- 
enced by other peoples’ rationales and studying 
which parameter’s value is influenced, we can get 
a better understanding of the involved persons’ 
assessment process and come up with a better 
way to construct scenarios and procedures for risk 
assessment. 


2 RELATED WORK 


There is a lot published around problems pertain- 
ing to risk assessment. A large part of these papers 
appears in the field of psychology. The papers 
cited below is just a small sample of the published 
material. 

Kahneman (2011) has done research on human 
reasoning. One result is of special interest here— 
the strong tendency always to assume the worst 
consequences. His observation is worth quoting 
in full: “The danger is increasingly exaggerated as 
the media compete for attention-grabbing headlines. 


Scientists and others who try to dampen the increas- 
ing fear and revulsion attract little attention, most 
of it hostile: anyone who claims that the danger is 
overstated is suspected of association with a ‘hei- 
nous cover-up”. 

Sandeman et al. (1998) have done research on 
risk communication and found that you get a bet- 
ter assessment of risks by enabling the respondent 
to relate it to a “normal” situation—a situation that 
the respondents already were comfortable with. 

Li and Wang (2010) report valuable experience 
with using the Delphi process in a risk analysis 
process. A more complete evaluation of the Delphi 
method used in risk assessment is given by Zaloom 
and Subhedar (2009) who report an experiment 
using the method to assess the likelihood of 10 
identified risk scenarios. Unfortunately, after the 
second round of the process, there was still only 
one risk scenario where the participants agreed and 
they ended up using the score averages as the result. 

Scenario construction is an important part of 
risk assessment. One way of writing and analysing 
scenarios is presented by Whitney and Thompson 
(2009). They argue that a useful scenario should 
contain information related to who, what, why, 
when, where and how. In addition, it is not enough 
to assess consequences, probability and how easy it 
is to avoid the danger. It is also necessary to assess 
the probability of the scenario. 

Last, but not least, it is important to remem- 
ber that all experiments and case studies into the 
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domain of hazard assessment have shown that 
there is no significant difference between layper- 
sons and experts. See e.g. Rowe and Wright (2001) 
for an extensive discussion of this topic. 


3 THE EXPERIMENT 


3.1 The experiment layout 


Input to the experiment was two scenarios for 
earth-moving machinery. These scenarios, and 
seven others, have earlier also been used in a set 
of experiments reported by Stalhane and Malm 
(2016). The scenarios are taken from ISO 15998- 
2:2012. According to the standard, the two cho- 
sen scenarios—later referred to as case 4 and case 
5—should be assessed as needing performance 
level (PL) c and e respectively. Average probabil- 
ity of dangerous failure per hour (1/h) for PL c 
is > 10% to < 3 x 10%; for PLe>10% to < 107. 
The scenarios are described as follows: 


— Case 4: Articulated Wheeled Loader. Machine 
boom moves without command. Operator is not 
compelled to be in the operator station. Opera- 
tor may be greasing machine or otherwise near 
moving parts. Operator typically in harm’s way 
much less than 10% of time. If operator is near 
moving part, it may be very difficult to get away 
quickly enough to prevent injury. 

— Case 5: Articulated Wheeled Loaders < 40 km/h. 
Complete loss of Primary Steering and Emer- 
gency Steering (Either steers un-commanded or 
not at all while propelling). Operator has brak- 
ing to stop the machine. Operator is not warned 
prior to loss of steering. Potential to hit higher 
speed vehicle with multiple passengers’ Multi- 
passenger vehicles in the path of machine is 
much less than 10% of time. Operator can stop 
the machine. Vehicle may be able to avoid the 
loader. 


To check the quality of the two scenario descrip- 
tions, we run a set of readability checks—Adobe 
(2007). Most of the readability scores—e.g., the 
Gunning Fog index and the Flesch Kincaid Grade 
level gave approximately the same results for both 
scenarios. The main difference was observed in the 
Flesch Reading Ease index where 70 to 60 indicates 
that the text is easy to read while 50 to 40 is diffi- 
cult. Case 4 scored 41 and case 5 scored 59 indicat- 
ing the case 4 is more difficult to understand. 

The experiments were performed with 51 par- 
ticipants in 17 groups with three persons in each 
group. Each group did the experiment under the 
supervision of the researcher. The participants 
were given the scenario descriptions shown above 
and went through a three-phase process which was 
a simplified version of the Delphi process—see 


Linestone and Turoff (1975). The process used was 
as follows 


1. Phase 1: Read the scenario and the scoring rules 
for the following three parameters: event conse- 
quence (S), event probability (F) and the possi- 
bility of avoiding the danger caused by the event 
(P). Descriptions for scoring of the parameters 
were taken from ISO 13849-1:2015. Based on 
the scenario description, they decided which 
value to assign to each of the three parameters. 

2. Phase 2: Write down the rational for the three 
values assigned. During this process, they 
were free to change their assessment if they 
felt that the rational did not fit their previous 
assessments. 

3. Phase 3: Read the rationales aloud to the rest 
of the group. Each participant was then free to 
change his scores. At last, they wrote their new 
assessments on the form and returned it to the 
researcher. There was no pressure or incentives 
for the group to reach a final agreement. 


The process described above gave us total of 51 
answers, which were registered in an Excel sheet. 
Each answer contained the participant’s assess- 
ment of S, F and P for each phase—a total of 153 
values. The response set for each participant con- 
tains three sets of assessed S, F and P values plus 
the rationales for each choice. 


3.2 The assessment model 


The assessment model—from risk parameters (S, 
F and P) to performance level (PL) or safety integ- 
rity level (SIL) are defined by a decision tree. For 
each of the risk assessment parameters S, F and 
P, an assessor (e.g., the participants in our experi- 
ment) can choose between two values as shown 
below: 


— Severity of injury. S1 Slight injury — bruise. S2 
Severe injury — amputation or death. 

— Frequency of exposure to injury. F1 Seldom. F2 
Frequent to continuous. 

— Possibility of avoiding the hazard. P1 Possi- 
ble. P2 Less possible. Based on the speed of 
approach of the hazard and the ability of the 
operator to avoid the hazard. (If the operator 
can avoid the hazard you would choose P1). 


4 DATA ANALYSIS 


We will need to discuss whether differences are 
statistically significant when discussing differences 
between persons or groups. We have approached 
this by using a simple approximation to the stand- 
ard uncertainty for probability differences. The 
equation that we start with is 
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d? =u, p(1-p)/N (1) 


Here p is the probability of the event under con- 
sideration and N is the size of the sample. Since 
p(1-p) is always less than or equal to 4 and u, 
for 5% significance is approximately 2, we get the 
simple approximation 


d =1/N (2) 


Thus, for differences between the 17 groups, 
the uncertainty is 0.24 (4 groups) and for dif- 
ferences between the participants, it is 0.14 (7 
participants). 


4.1 An exploratory case 


The experiments where not done to accept or reject 
one or more predefined hypotheses. Neither did we 
start with a set of research questions. What we did 
was to run the experiments to collect data and then 
perform an explorative data analysis—looking 
for what the data might tell us. For this reason, 
the results are not final. However, they will serve 
as a starting point for hypothesis and subsequent 
experiments. Throughout this paper, when we say, 
“the data shows that...” we mean “the data indi- 
cates that...” 

We got the following data from the experiment 
(1) the participants’ scores for S, F and P, (2) their 
rational for choosing the values, and (3) how these 
values changed when they write down a rational 
for their decisions and share this information with 
the other persons in their group when assessing the 
three parameters. Since we have the rationales we 
connect them to the information provided by the 
scenario descriptions—e.g., which parts did they 
use and what was used in addition. 


4.2 Different processes—different results 


Earlier, a total of nine cases—including case 4 and 
case 5 used in this experiment—have been assessed 
earlier, see Stalhane and Malm (2016). Most of 
assessments for all cases, irrespective of the results 
required by the standard, were S = 2, F = 1 and 
P=1. 

For case 4 we get S = 2 and F = 1, while the split 
between P = 1 and P = 2 is 67% versus 33%. 

When we introduced the need to write a rational 
and use it as feedback to the other participants in 
the group, most of assessments, both for case 4 
and case 5 changed to S = 2, F = 1 and P = 1 or 
P =2. As we see, the split between P = 1 and P=2 
is 44% versus 56%. The results for case 5 is almost 
identical. 

The feedback cannot be the only cause of this 
change since the decision tree structure is almost 


the same for phase 1 and phase 3. The only case 4 
change between phase 1 and phase 3 are that we in 
phase 1 has 45% 2, 1, 2 and 55% 2, 1, 1, while we in 
phase 3 have 57% 2, 1, 2 and 43% 2, 1, 1. These dif- 
ferences are not significant at the 5% level. Thus, 
the main reason for the changes is the requirement 
that the participants must write rationales. In addi- 
tion, we should note that this change only affects 
the P-parameter. 


4.3 Documentation of the rationales used 


Different participants used various parts of the 
scenario description when writing their rationales. 
If they found the scenarios to lack information, 
they used the information in the scenarios plus 
their own experience to make their own rationales. 
Tables 1 and 2 show the used information from the 
scenarios while Tables 3, 4 and 5 show information 
used but not found in the scenario descriptions. 
For case 5, we see that two thirds of the par- 
ticipants used one or more terms from the 


Table 1. Part of scenario information used in case 4. 

Case 4 

Scenario info used S F P Sum 

Machine boom moves without 4 1 2 7 
command 

Operator may be near moving parts 17 9 15 31 

Operator in harm’s way less than 2 14 16 


10% of time 
Near moving parts = > difficult to 3 | a5 

get away quickly 
Persons (#) 
Persons (%) 


24 23 26 
48 48 52 


Table 2. Part of scenario information used in case 5. 

Case 5 

Scenario info used S F P Sum 

Speed less than 40 km/h 27 2 9 38 

Complete loss of primary and mm 5 16 
secondary steering 

Breaks are working OK 2 3 ?7 

No warning prior to loss of steering 1 1 

May hit high speed vehicles with Oy 1 1 12 
multiple passengers 

Multi-passenger vehicle in path less 1 10 11 
than 10% of time 

Operator can stop machine 4 3 41l 

Vehicle may be able to avoid the loader 3 gis 

Persons (#) 38 31 39 

Persons (%) 73 61 75 
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Table 3. 
4and 5. 


General information used to assess S in cases 


S — event consequences 


Rationales Case 4 Case 5 
Safety instructions 1 

Crushed limbs 14 4 
Machine turned off 1 

Strong hydraulics 2 

Heavy parts or machine 8 6 
Person run over 3 

Hit someone 3 
Persons (#) 29 13 
Persons (%) 57 25 


Table 4. General information used to assess F in cases 
4and 5. 


F — event probability 


Rationales Case 4 Case 5 
Happens seldom 8 11 
Operator mindful of danger 10 

Regular maintenance 2 2 
Safety mechanisms 4 

Harm other vehicles 3 
Persons (#) 25 18 
Persons (%) 52 36 
Table 5. General information used to assess P in cases 


4and 5. 


P — escape possibility from hazardous event 


Rationales Case 4 Case 5 
Safety instructions 5 2 
Operator mindful of danger 7 1 
Keep away from moving parts 1 1 

Easy to avoid 5 2 
Severe injuries 2 4 
Cannot stop the machine 3 
Persons (#) 24 13 
Persons (%) 48 26 


scenario descriptions when writing a rational for 
their choice of scores for S, F and P. As for case 
4, there are no statistically significant differences 
between the three parameters at the 5% level. The 
differences between the two cases are, however, sta- 
tistically significant at the 5% level. The only part 
of the case 5 scenario that is not used is “Either 
steers un-commanded or not at all while propelling”. 


For case 4, we see that half the participants used 
one or more terms from the scenario descriptions 
when writing a rational for their choice of scores 
for S, F and P. There are no statistically significant 
differences between the three parameters at the 5% 
level. The only part of the scenario that is not used 
is “Operator is not compelled to be in the operator 
station”. 

Several parts of the scenario descriptions are 
used for several parameters—see Tables 1 and 2. 
For case 4 the statement “Operator may be near 
moving parts” is used for all three parameters. 

The Tables 3 to 5 show the rationales which 
are not taken from the scenario descriptions. The 
terms used have been extracted from the rationales 
and grouped under a set of generic terms—e.g., all 
rationales related to safety instructions and safety 
procedures have been grouped under the term 
“Safety instructions”. 

That three participants used “Cannot stop the 
machine” as a problem for case 5 is strange, since 
the scenario explicitly states that “The operator 
can stop the machine”. 

We went through all rationales to look for ration- 
ales used for the wrong parameter—i.e., the whole 
or a part the rational indicates one parameter while 
it is used for another. Examples of wrong / mis- 
placed rationales: 


— S (P): If the operator is in the danger zone, he 
may be seriously injured if he is crushed by the 
machine 

— S (F): If an accident occurs with a big and heavy 
machine—e.g. get caught in moving parts—the 
probability for severe damages is large 

— F (P): The operator is not close to dangerous 
parts of the machine very often 

— P (F): Daily inspection of hydraulics will ena- 
ble us to discover problems before an accident. 
Routines 


When we went through all the rationales, we 
found the following: 

From the table, we see that the most frequent 
wrong or mixed rationales are related to F and P, 
while the next most frequent is between S and P. 


4.4 Parameter changes 


The last piece of information that we need is the 
changes—when and where. This is shown in the 
two tables below. 

There are 41 changed assessments from phase 
1 to phase 2 and 32 changers in assessment from 
phase 2 to phase 3. The difference between the two 
cases is statistically significant at the 5% level. 

The agreement for all parameter assessments 
improved from phase | to phase 3. However, the 
only statistically significant differences are between 
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Table 6. Wrong/mixed parameter rationales for cases 3 


and 5. 

Case 4 Case 5 

S P=3 S P=4 
SP=7 SP=7 

F P=20 F P=9 
FP=13 FP=14 

P S= P F= 

Sum 44 Sum 35 

% 28 % 22 

Table 7. Parameter assessment changes in case 4. 

Case 4 


Phase | to Phase 2 Phase 2 to Phase 3 


S 5 S 0 
F 3 F 3 
P 7 P 9 
Sum 15 Sum 12 
Percentage of total 30 Percentage of total 24 


Table 8. Parameter assessment changes in case 5. 


Case 5 


Phase | to Phase 2 Phase 2 to Phase 3 


S 2 S 0 
F 18 F 18 
P 6 P 2 
Sum 26 Sum 20 


Percentage of total 52 Percentage of total 40 


phase 1 and 3 for S and P in case 4. For case 5, 
there are no statistically significant differences. 


4.5 Summary of the data analysis 


Based on the tables above, we formulate the follow- 
ing exploratory research questions—ERQ: 


1. Which parts of the scenario descriptions are 
used for the risk assessment? 

2. From where do the participants who do not use 
the scenarios take their information? 

3. Which assessments are stable and which do 
change? 

4. How often do the participants assess two or 
more factors using the same rational? 

5. Based on the findings from these ERQ, how 
can we improve the assessment process and the 
standard’s scenario descriptions so that they are 
used in a more efficient way? 


ERQI1: Roughly 50% of the case 4 assessments 
and 75% of the case 5 assessments used only terms 
from the scenario descriptions—see Tables | and 2. 
The most commonly used terms are: 


— Case 4, S: “Operator may be near moving parts”, 
F: “Operator in harm’s way less than 10% of 
time” and P: “Near moving parts = > difficult to 
get away quickly” 

— Case 5, S: “Speed less than 40 km/h”, F: “Speed 
less than 40 km/h” and “Multi-passenger vehi- 
cle in path less than 10% of time” and for P: 
“Breaks are working OK”. 


One possible reason why the case 5 assessments 
used more scenario description terms that the 
case 4 may be that this description is longer — 76 
words—while case 4 has only 58 words. 

ERQ2: Those who also used other terms than 
those used in the scenario, used terms related to 
the machine, the environment and the machine’s 
behaviour. The same terms are considered impor- 
tant for both cases e.g., “Crushed limbs” or “Heavy 
parts or machine” for consequences. 

ERQ3: For case 4, approximately a third of the 
assessments have been changed. For case 5 approx- 
imately half of the participants changed their 
mind one or more times during the three phases. 
There are significant differences between the two 
cases. For case 4, P is changed most often while for 
case 5, F is changed most often. The differences 
between case 4 and case 5 are statistically signifi- 
cant at the 5% level. 

ERQ4: We see from the data that more than half 
of all assessment rationales for S, F or P mixed two 
parameters—most commonly, they mixed F and 
P. However, a considerable amount of the partici- 
pants also mixed S and P. The participants’ logic is 
clear and can be split into the following: 


— For S: since it is easy to avoid the hazard, the 
consequences are not severe 

— For F: since it is easy to avoid the hazard, it can- 
not happen so often 


ERQS: Based on this summary we suggest three 
temporary conclusions: 


— Differentiating between P and the two other 
parameters is difficult for the participants. The 
possibility to avoid the danger is used to influ- 
ence the assessment of the event frequency and 
the seriousness of the consequences. 

— A considerable part of the participants did not 
think the information given in the scenarios is 
sufficient or relevant for risk assessment. 

— Most participants always consider the worst 
consequences. 


These temporary conclusions lead us to the next 
conclusion, namely that standards should use a 


1557 


different scenario description and a different deci- 
sion tree to help the participants to arrive at more 
consistent hazard assessments. 

The terms used in the assessment rationales 
but not present in the case descriptions, are the 
following: 


— Case 4 and 5: crushed limbs (S), heavy parts (S), 
heavy machinery(S), happen seldom (F), safety 
instructions (P), easy to avoid (P) 

— Case 4: operator mindful of danger (P) 

— Case 5: cannot stop the machine (P) 


Some of these rationales are just derived from 
the information found in the case description— 
e.g., crushed limbs, heavy parts and heavy machin- 
ery. These words are warning signal words—Hellier 
(2000)—to add emphasis to the case description. 
Others are contradicting the case description— 
e.g., cannot stop the machine. 


5 ASSESSMENT CHANGES 


The most important observation related to the 
assessments done in this experiment is the mis- 
placed assessment rationales, especially for F and 
P. The discussions in chapter 4.2 and 4.3 lead us to 
the conclusion that people are not good at separat- 
ing dangerous event probability and the possibility 
of avoiding dangers. 

Thus, instead of the current three-level decision 
tree used in the standard we suggest that the stand- 
ard should use a two-level decision tree. As before, 
S is the consequences of the dangerous event while 
P should be defined as the probability that a per- 
son is harmed. This will include both the event 
probability and the escape opportunities. Based on 
this, we suggest the following ratings: 


— P=1:wewillalmost surely avoid the consequences 
— P=2: we might avoid the consequences 
— P=3: we will most likely suffer the consequences 


As noted earlier, there are more changes to 
assessment between phase | and phase 2 than 
between phase 2 and phase 3. The significant dif- 
ference between the two transitions is that in the 
first phase, each participant first writes down his 
assessment and then wrote down the rational. 

There are presumably two mechanisms at work 
when a participant changes one or more param- 
eter assessments: (1) one or two of the other 
participants use a negative term when describing 
their parameter choice rational or (2) he sees that 
the rational he wrote does not support his assess- 
ment. In many cases, both for laypersons and for 
experts, the assessment is done based on intui- 
tion—see Freeman et al. (2012). The rational is 
written after the decision and is thus an attempt 


to justify this decision—see e.g., Pigozzi et al. 
(2009). 

26 of the changes caused a parameter to be 
changed from 1 to 2 (a parameter was changed 
from 2 to | in only six cases). In these six cases, 
the changes occurred after phase-2 and in all cases 
the parameters were changed back to 2 again in 
phase-3. Thus, in no case did any participant in 
the final case move any parameter assessment 
from 2 to 1. This agrees with the general observa- 
tion that one negative statement wins over several 
positive ones—see e.g., Baumeister et al. (2001). 
The research of e.g., Shang et al. (2015) show 
that “people are likely to direct more attentional 
resources toward high-hazard stimuli compared to 
low-hazard ones”, thus supporting Baumeister’s 
conclusions. 

The P parameter (probability of avoiding the 
hazard) is the one changed most frequently — 11 
times for case 4 and seven times for case 5. This 
fits well with our observation that P is the param- 
eter that the participants had the most problem 
with. The mixing of F and P assessment gets even 
more problematic if we consider it together with 
the effect of a negative assessment. This may cause 
a negative P assessment to change a positive F 
assessment to a negative assessment and vice versa. 


6 ANOTHER WAY TO DESCRIBE 
SCENARIOS 


6.1 The scenarios revisited 


If a scenario shall be useful in risk assessment, it must 
include information on the event- what went wrong- 
possible consequences— and how can we avoid or 
reduce these consequences? To check this against the 
scenarios used here and the rationales used by the 
participants we will first restructure the scenarios so 
that each argument is separated and tagged with the 
relevant part of the risk assessment — S: event conse- 
quences, F: event probability or P: avoidance possibil- 
ity. The result is shown in the two lists shown below. 
Case 4: Articulated Wheeled Loader. 


— Machine boom moves without command — S, F 

— Operator is not compelled to be in the operator 
station — P 

— Operator may be greasing machine or otherwise 
near moving parts — P 

— Operator typically in harm’s way much less than 
10% of time — P 

— If operator is near moving part, it may be very 
difficult to get away quickly enough to prevent 
injury — P 
Case 5: Articulated Wheeled Loader 

— Speed < 40 km/h — S, P 
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— Complete loss of Primary Steering and Emer- 
gency Steering (Either steers un-commanded or 
not at all while propelling) — F 

— Operator has braking to stop the machine — P 

— Operator is not warned prior to loss of steering — 
P 

— Potential to hit higher speed vehicle with multi- 
ple passengers — S 

— Multi-passenger vehicles in the path of machine 
much less than 10% of time — P 

— Operator can stop the machine — P 

— Vehicle may be able to avoid the loader — P 


The most important thing to note is that most 
statements/sentences are about getting away or 
preventing harm—four out of five for case 4 and 
five out of eight for case 5. Thus, there is little 
focus on the event probability (F) and much focus 
on the possibility for getting out of harm’s way (P). 
This is another factor that can explain the large 
portion of P-related rationales for assessment of 
the F parameter. 


6.2 How were the scenarios used in the standard? 


The scenarios used in the standard can be used 
in two ways. They can be considered as reference 
scenarios — if your case is like this, then the PL 
should be the same as the PL for this scenario in 
the standard. The other way is to use it to indicate 
which factors are important in the risk assessment 
process — see Tables | and 2. Thus, the informa- 
tion listed in Tables 3—S is indicators that the par- 
ticipants found the information in the scenarios to 
be wanting. Our experiments have shown that the 
participants in many cases need more or different 
information than what the scenarios provide. The 
most used pieces of information not in the sce- 
nario descriptions and not derived from this are 


— Consequences: person run over — used three times 

— Equipment descriptions: safety mechanisms — 
used four times 

— Operator description: mindful operator, safety 
instructions — used 20 times 


The escape or avoid part (P) of the risk assess- 
ment process will be greatly improved if the sce- 
narios were accompanied by a preamble giving 
information about operator courses, protective 
gear and safety instructions. 


6.3 A better way to construct scenarios 


A good method for construction scenarios is the 
GMA approach. However, using GMA may lead 
to many possible scenarios and it is not always 
clear how we can reduce this to a practical volume. 
Another, event-driven method for constructing 


scenarios is the STEP process—Hendrick (1987). 
However, this method is difficult to use with the 
scenarios used in the relevant standard — e.g., it 
is difficult to model statements such as “Operator 
typically in harm’s way much less than 10% of time”. 
A doable alternative is to use the work of Whit- 
ney and Thompson (2009). They have suggested a 
checklist for scenarios, which runs as follows: 


— Who: groups, individuals and organizations 
involved 

— What: activity, objectives and targets 

— Why: motivation of person or organizational 

— When: triggering events, time requirements, 
opportunity 

— Where: city, building, institutions or road 

— How: approaches required related to technol- 
ogy, funding or know-how 


We might think that the quality of the scenarios 
increases with the amount of available informa- 
tion. However, Oskamp (1965) has published both 
own results and meta-studies that seem to confirm 
that “Beyond some early point in the information- 
gathering process, predictive accuracy reaches a ceil- 
ing”. Oskamp’s studies are related to psychological 
diagnoses but the results are general enough to 
apply also to our case. Thus, we need to keep the 
amount of information low so that we do not water 
down the few, important pieces of information. 

Since we need to assess three parameters, we 
need to provide sufficient information for the 
assessment of each of them. Sticking to the 
schema of Whitney and Thompson (2009), we 
must decide which three of the checklist items that 
are most important for each parameter. We suggest 
the following 

S: Event consequence: 


— Who is or could be involved? 
— What may happen or is happening? 


F: Event probability: 


— What may happen or is happening? 

— Where is the event happening? 

— How — which approach is taken or required to 
initiate the event? 


P: Possibility to escape consequences: 


— Who is or could be involved? 
— How — which approach is taken or required? 
— Where is the event happening? 


When we go through the two scenarios used in 
our experiments, we see that both scenarios lack 
information needed to assess probability —i.e., how 
probable this scenario is. 

If we apply the checklist of Whitney and 
Thompson to the two cases we have considered in 
this experiment, we get the following results: 
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Case 4: Articulated Wheeled Loader. 


— Machine boom moves without command -— S, F 
What. 

— Operator is not compelled to be in the operator 
station. Not used during assessment. 

— Operator may be greasing machine (implicates 
“near moving parts”) or otherwise near moving 
parts — P What, Who 

— Operator typically in harm’s way much less than 
10% of time — P Who, When 

— If operator is near moving part, it may be very 
difficult to get away quickly enough to prevent 
injury — P Who, Where, When 


Case 5: Articulated Wheeled Loader 


— Speed < 40 km/h — S, P What 

— Complete loss of Primary Steering and Emer- 
gency Steering (Either steers un-commanded 
or not at all while propelling — not used during 
assessment) — F What 

— Operator has braking to stop the machine—P 
Who, What 

— Operator is not warned prior to loss of steering 
— P Who, What, When 

— Potential to hit higher speed vehicle with multi- 
ple passengers — S What, Who 

— Multi-passenger vehicles in the path of machine 
much less than 10% of time — P What, Where, 
When 

— Operator can stop the machine — P Who, What 

— Vehicle may be able to avoid the loader — P What 


Note that F and P share the keywords “where” 
and “how”. In addition, S lacks “who”-informa- 
tion, which is important to understand the conse- 
quences. F does in both cases lack information on 
“where” and “how”, which are important to under- 
stand the failure mechanism and thus assess the 
event probability. P lacks information on “what” 
and “how”, which makes it difficult to assess the 
escape probability. 

Tables 3, 4 and 5 shows information added 
by the participants during their assessment. The 
information added are the same for both cases. If 
we apply Whitney and Thompson’s key words to 
the information in the Tables 3, 4 and 5 we find 
that for S we can add “what” and “who”, while we 
for F and P can add “when”, “who” and “what”. 
Thus, for both cases we still miss “how” for F and 
P and “where” for P. 

Suggested additions to take care of the “where” 
aspects for the two cases are e.g., “on a construction 
site” for case 4 and “on a public road” for case 5. 

An additional challenge when writing scenarios 
is the choice of words. Our will have an important 
influence on how much and what kind of attention 
each part of a scenario description will get. In the 
end, it will also decide the assessment of PL or SIL. 


7 THREATS TO VALIDITY 


The main threats to validity are the small sample 
and participant motivations. We will give a brief 
discussion on how these two threats may influence 
our temporary conclusions. 

We chose the two cases 4 and 5 for the experi- 
ment to get a large distance in protection levels. 
Case 4 should have PL c while case 5 should have 
PL e—both according to ISO 13849-1:2015. How- 
ever, just two cases are way too small a sample for 
making a statistically significant statement about 
the Delphi method when it comes to PL assess- 
ment. On the other hand, 51 participants assessing 
two cases is still enough to say something about 
the general trends, such as mixing P and F, and the 
influence of the general observation that “bad is 
stronger than good”. 

The questions related to participants’ motivation 
is more serious. People who do this for real have a 
strong incentive to get it right while the students 
are just motivated to get the job done because that 
is what they promised the researcher. Thus, we can- 
not be sure that they “really put their soul into it”. 

Even so, it is our opinion that the data are suffi- 
cient for doing some exploratory data analysis and 
to come up with some issues that should be consid- 
ered more seriously in future research. 


8 CONCLUSIONS 


Based on the results of this small experiment, we 
have tentatively identified the following poten- 
tial problems with the way hazard analyses are 
performed: 


— Whatever the scenario description, people 
mostly goes for the worst case because “it just 
might happen” and “Bad is stronger than good”. 
Thus, we should not use the Delphi process 
to assess risks. Relevant empirical data might 
improve this situation. 

— Many persons have problems with keeping sepa- 
rate the dangerous event probability and the 
probability of avoiding harm. 

— Those who write scenarios for risk assessment 
—e.g., for standards — should consider available 
rules for writing efficient scenarios — e.g., con- 
sidering the rules of Whitney and Thompson 
(2009) and readability checks, Adobe (2007). 


These problem areas need to be addressed when 
writing a scenario intended for risk assessment. 
The first problem is the most serious and difficult 
one because it is related to psychology and not to 
any technical problem. The two other ones can be 
solved by providing relevant information to the 
assessors. 
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ABSTRACT: The inherently complex nature of risks interdependencies in construction projects cou- 
pled with incomplete data records during projects development often results to inaccurate assessments. 
This paper showcases the use of neural networks for risk assessment in construction projects. A detailed 
literature review identifies the different types and training methods of neural networks as well as the 
respective tools applicable to construction projects risk management. Based on these findings, the paper 
presents the development of a specific neural network that partially assesses occupational risk in a con- 
struction engineering project. The proposed neural network is trained with metadata from previous risks 
assessments. The modeling of the network is realized through two software tools, in order to identify 
potential difficulties in the modeling process as well as potential deviations in the assessments’ outputs. 
The main conclusion is that neural networks are reliable for conducting risks assessments that realistically 


integrate risks interdependencies in complex problems. 


1 INTRODUCTION 


Neural Networks (NNs) are inspired by the bio- 
logical neural network of human brain (Haykin, 
2008). According to Haykin (2008): “A Neural 
Network is a massively parallel distributed proces- 
sor made up of simple processing units that has a 
natural propensity for storing experimental knowl- 
edge and making it available for use. It resembles the 
brain in two respects: 1) Knowledge is acquired by 
the network from its environment through a learning 
process. 2) Interneuron connection strengths, known 
as synaptic weights, are used to store the acquired 
knowledge.” 

Three basic types of NNs’ architecture are 
recognized (Haykin, 2008): 1) single-layer NN, 
2) Multi-Layer NN (MLNN) and 3) recurrent 
NN. The learning processes through which NNs 
function can be categorized are (Haykin, 2008): 
1) supervised learning, 2) reinforcement learn- 
ing and 3) unsupervised learning. Usual basic 
problems that NNs are capable of dealing with 
are fitting (or function approximation), pattern 
recognition and association, and clustering and 
prediction, all applicable to many scientific fields 
such as automotive, financial, medical, robotics, 
telecommunications and management (Prieto 
et al., 2016). 


The ability to use NNs either individually or in 
combination with other Artificial Intelligence (AI) 
techniques, such as Expert Systems, in construc- 
tion industry and construction project management 
has been recognized since the beginning of 1990 
(Moselhi, 1991). Currently, alternative tools of tradi- 
tional methods based on AI techniques, such as NNs, 
are widely applied in engineering and construction 
industry (Paliwal & Kumar, 2009) and with a variety 
of ways in construction project management such as 
assessing project success and identifying critical suc- 
cess factors of the project, planning, estimating time 
and cost and managing risks (Magaña Martinez & 
Fernandez-Rodriguez, 2015). 

Construction projects are featured by complex- 
ity and incomplete data records during develop- 
ment, thus affecting the capacity of a construction 
organization to conduct a credible risk assessment 
for every new project in hand. Furthermore, the 
inherently complex issue of risks interdepend- 
encies has led so far to approximate assessment 
approaches or to assessments based on simplistic 
assumptions that only remotely represent real- 
ity. This paper proposes the use of NNs for risk 
assessment in construction projects through the 
presentation of a specific application of a NN that 
partially assesses occupational risk in a construc- 
tion engineering project. 


1563 


2 METHODOLOGY 


The paper, first, presents very briefly the basic 
characteristics of NNs to demonstrate the advan- 
tages of using them for risk assessment. Then, it 
investigates the existing applications of NNs in 
construction project risk management, in order to 
document their appropriateness for use in complex 
nonlinear problems such as construction projects. 
To this end, a detailed literature review identifies 
the different types and training methods of NNs 
that are applicable in construction projects risk 
management as well as of the respective tools that 
are available for real applications. 

Based on the findings from the literature review, 
the paper, then, presents a specific application of a 
neural network that partially assesses occupational 
risk in a construction engineering project. The 
training of the proposed NN is based on metadata 
retrieved from a compilation of available data from 
previous risks assessments. The modeling of the 
network is realized with the help of two different 
software tools, namely Palisade’s Neural Tools (Pal- 
isade Corporation, 2015) and Mathwork’s Neural 
Network Toolbox, in order to identify potential 
difficulties in the modeling process as well as 
potential deviations in the assessments’ outputs. 


3 REVIEW OF NEURAL NETWORK 
APPLICATIONS IN CONSTRUCTION 
PROJECT RISK MANAGEMENT 


As mentioned in Section 1, the ability to use NNs in 
the construction industry and construction project 
management has been recognized since the begin- 
ning of 1990. One of the first NN applications in 
construction project management is authored by 
Boussabaine (1996) who recognized the useful- 
ness of NNs in construction project management 
and in risk analysis specifically. Especially in con- 
struction project risk management, previous NN 
applications focus on various topics as presented 
in Table 1. It is noted that even more applications 
than those presented in Table 1 were identified on 
the analysis of claims from legal disputes as well as 
of contractor’s performance. 

Especially for construction project risk man- 
agement, the first application of NN was, prob- 
ably, carried out by Sanchez (2005) with the aim 
of quantifying total risk in economic terms. Wen 
(2010) developed a total risk assessment model, 
embedded with NN, Genetic Algorithm (GA) and 
Rough Set Theory (RST) techniques, for construc- 
tion projects. Zhu et al. (2011) implemented NN 
to perform analysis and evaluation of project cost 
risk and identification of critical factors. Chenyun 
& Zichun (2012) developed a Back-Propagation 


(BP) NN to assess risks in the construction phase 
of an expressway. 

Concerning claim causing assessment, Al- 
Sobiei et al. (2005) developed two models for 
predicting the risk of contractor default in con- 
struction projects utilizing NN and GA tech- 
niques. Chau (2007) predicted the outcome of 
construction claims through the adaptation of a 
Particle Swarm Optimization (PSO) NN, trained 
with data from cases and past court decisions. 
Hosny et al. (2015) developed a NN-based predic- 
tive and decision-awareness framework for con- 
struction claims using backward optimization. 
Gholhaki et al. (2016) used Radial Basis Function 
(RBF) NN to predict claims’ causes and control 
and minimize claims. 

Manik et al. (2008) investigated the use of 
NNs to predict pavement construction payment- 
risk based on the quality of the construction. El- 
Sawalhi et al. (2008) developed a hybrid BP NN 
and GA model for predicting contractor’s per- 
formance in terms of cost, time and quality in a 
process of contractor’s prequalification. Jin & 
Zhang (2011) used NNs to model risk allocation 
decision-making process in PPP projects, mainly 
drawing upon transaction cost economics. Goh 
& Chua (2013) performed NN analysis in quanti- 
fied occupational safety and health management 
system audit with accident data obtained from the 
Singaporean construction industry to predict acci- 
dents and identify critical factors. 

Gajzler (2013) developed a method for support- 
ing the decision-making process for the selection 
of materials and technology for repairing indus- 
trial building floors using Knowledge-based NN, 
Hybrid model of BP NN, RBF NN and Fuzzy 
Logic (FL). Gajzler & Konczak (2015) investigated 
the possibility of applying NN in the analysis of 
observational data on the issue of simulation con- 
crete supplies in the construction industry. Patel & 
Jha (2016) developed an application of BP NN for 
the prediction and evaluation of employees’ work 
behavior in construction projects using the con- 
structs of the safety climate. Shahrara et al. (2016) 
used NNs to model the relationship between 
important project parameters and risk variables in 
the process of negotiating the financial parameters 
and uncertainties of a Build-Operate-Transfer 
project with the use of NN. 

Regarding the software programs used in the 
abovementioned studies, Neural Network Tool- 
box of Mathwork’s Matlab was used by Manik 
et al. (2008), Wen (2010), Jin & Zhang (2011), Li 
et al. (2012), Chenyun & Zichun (2012), Shahrara 
et al. (2015), and Gholhaki et al. (2016). Haykin 
(2008) also used it for the example applications of 
his work. Al-Sobiei et al (2005) and Goh & Chua 
(2013) deployed NeuroShell Predictor and Neu- 
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Table 1. 


Applications of NN in construction project risk management. 


Author/ Researcher Year Innovative idea Risk management processes 

Sanchez 2005 NN application in construction project risk Quantitative risk analysis 
management with the aim of quantifying total risk 
in economic terms. 

Al-Sobiei et al. 2005 Development of two models for predicting the risk of Qualitative risk analysis / 
contractor default in construction projects utilizing Risk response analysis 
NN and GA techniques. 

Chau 2007 Adoption of a PSO NN model, provided with charac- Qualitative risk analysis / 
teristics of cases and the corresponding past court Risk response analysis 
decision, for predicting the outcome of construction 
claims. Comparison with BP NN model. 

Manik et al. 2008 Investigation of the use of NN to predict pavement Qualitative risk analysis / 
construction payment-risk based on the quality of Risk response analysis 
the construction. 

El-Sawalhi et al. 2008 Development of hybrid BP NN and GA model for Qualitative risk analysis / 
predicting contractor’s performance in terms of cost, Risk response analysis 
time and quality in a process of pre-qualification. 

Wen 2010 Development of a total risk assessment model, Quantitative risk analysis / 
embedded with NN, GA and RST techniques, for Risk response analysis 
construction projects. 

Zhu et al. 2011 Analysis and evaluation of project cost risk and Quantitative risk analysis 
identification of critical factors based on BP NN. 

Jin & Zhang 2011 Modeling risk allocation decision-making process in Risk response analysis 
PPP projects, mainly drawing upon transaction cost 
economics, using NNs. Comparison with multiple 
regression technique. 

Chenyun & Zichun 2012 Application of BP NN to assess expressway Quantitative risk analysis 
construction phase risk. 

Goh & Chua 2013 NN analysis in quantified occupational safety and Qualitative risk analysis / 
health management system audit with accident Risk response analysis 
data obtained from the Singaporean construction 
industry in order to predict accidents and identify 
safety critical factors. 

Gajzler 2013 Developing a method for supporting the decision- Qualitative risk analysis / 
making process for the selection of materials and Risk response analysis 
technology for repairing industrial building floors 
using Knowledge-based NN. Hybrid model of BP 
NN, RBF NN and FL. 

Gajzler & Koncezak 2015 Investigating the possibility of applying NN in the Qualitative risk analysis / 
analysis of observational data on the issue of simula- Risk response analysis 
tion concrete supplies in construction industry. 

Hosny et al. 2015 Development of a NN-based predictive and decision- Risk identification / Quan- 
awareness framework for construction claims using titative risk analysis / 
backward optimization. Risk response analysis 

Patel & Jha 2016 Application of BP NN for prediction and evaluation Qualitative risk analysis / 
of employees’ work behavior in construction projects Risk response analysis 
using the constructs of the safety climate. 

Shahrara et al. 2016 Modeling the relationship between important project Qualitative risk analysis / 
parameters and risk variables in the process of nego- Risk response analysis 
tiating the financial parameters and uncertainties of 
a BOT project with the use of NN. 

Gholhaki et al. 2016 Application of RBF NN to predict claims’ causes and Risk response analysis 


control and minimize claims. 


roShell2 respectively. El-Sawalhi et al. (2005) used 
Neuro Genetic Optimizer for their research. Gaj- 
zler & Konezak (2015) used STATISTICA Data 


Miner. 


Other software programs that were used in the 
general field of construction project management 
are NeuralWorks Professional II/Plus, Neurosolu- 


tions, SPSS and FANN Library. 
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A research concerning software use in NNs 
showed that Matlab is the most widely used soft- 
ware for NN implementation accumulating 28% 
users’ choice (Baptista et al., 2013). The reasons 
for this preference are that it is a complete, flex- 
ible and easy to program software, has strong and 
fast computational power and several types of 
NNs available (Baptista et al., 2013). In the same 
research, “self-created code” (24%) and “other 
software” (36%) including NeuroSolution (1%), 
also accrue large percentages. Additionally, Matlab 
responds better to the needs of the NN research 
community with 29%, while SPSS collects 1% of 
the question “what other software are you using” 
(Baptista et al., 2013). 

It is obvious that NNs can find application in a 
variety of ways in construction project risk man- 
agement. Applications primarily introduce qualita- 
tive and quantitative risk analysis and risk response 
analysis addressing the problem with different 
approaches either as function approximation or as 
pattern recognition. In most of the cases, the use 
of NNs reproduces credible results, which is the 
reason behind suggesting the use of NNs in con- 
struction project risk management. BP algorithm 
has a prominent place in supervised learning and 
the development of robust and credible models. 
Furthermore, the combination of NNs with other 
techniques seems to present even more potential in 
the development of such models. Research into the 
implementation of NNs in construction project 
risk management is still a dynamic scientific field 
given the small number of scientific studies of the 
last decade in the field. 


4 DEVELOPMENT OF A NEURAL 
NETWORK MODEL FOR RISK 
ASSESSMENT IN CONSTRUCTION 
PROJECTS 


4.1 General context 


A common and acceptable formula of risk assess- 
ment is shown in Equation (1): 


R=PxS (1) 


where R= risk; P = probability index; and S= sever- 
ity of harm index or importance of effect index. 

For several risk factors i with different probabil- 
ity of occurrence and consequences, equation (1) 
is extended to the forms shown in Equations (2) 
and (3): 


R,=P,x S,i=1,2,...,n (2) 


R=[R,, R- R] =[P, X S,, PiX Sa ..., P; X S] (3) 


Assuming that each risk R, is independent of 
other risks, the total risk R can be assessed accord- 
ing to Equation (4): 


R=} R =P, xS +P, Xx S, +... +P, XS, (4) 


Nevertheless, considering individual risks as 
independent and using their algebraic sum to 
assess total risk is considered to be an extremely 
simplistic approach as well as mathematical mod- 
eling of risk. The complex and often unpredict- 
able nature of construction projects and their risks 
inevitably contribute to complex nonlinear risks 
interdependencies. 

NNs can approximate every function for the 
assessment of total risk in construction projects 
given that a sufficient volume of valid and reliable 
historical data is available (Haykin, 2008); how- 
ever, as already mentioned, such a historical record 
is lacking in most of construction projects as 
former research efforts for applications of NNs in 
construction projects risk management evidently 
show. 

A NN trained with data of similar projects to 
a project in hand can capture the inherent rela- 
tion between individual risks and total risk of the 
projects. This is possible through the appropriate 
modeling of the project’s development parameters 
in combination with accurate historical data on 
previous risks occurrences and their impact. The 
rightly designed NN can approximate the value 
of the total risk in an indirect, yet practical man- 
ner since the historical record used to train the 
network while it does not analytically present the 
interdependencies between risks, it inherently con- 
tains them; therefore, the trained NN possesses the 
respective knowledge on this aspect (i.e. total risk) 
and its assessment can be considered reliable and 
credible. The historical data can also serve for: a) 
deriving distributions of the probability of occur- 
rence of each risk, and b) quantifying each risk’s 
severity index. Both are useful in assessing the total 
risk of construction projects using modern compu- 
tational tools such as NNs. 


4.2 Data compilation for training the neural 
network 


Occupational safety and health is a key factor 
towards a successful delivery of construction 
projects. Proper assessment of occupational risk 
in project’s design stage can significantly contrib- 
ute to avoiding accidents (Pinto et al., 2011). The 
importance of occupational risk assessment is evi- 
denced by the increasing rate of published safety 
and health studies in the construction sector and 
the development of relevant risk assessment mod- 
els (Zhou et al., 2015). 
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The training of the proposed NN is based on 
metadata retrieved from a compilation of available 
data from previous risk assessments. Available data 
concern the possible risk sources that may occur 
during the execution of construction projects, as 
well as the ranges of their probability of occurrence 
(P) and their severity of harm (S) (Argyriou, 2016). 
Risk sources are categorized into general risk cate- 
gories based on the grouping of risk sources accord- 
ing to the Labor Inspection Body of Greece. In 
Table 2, the risk range R, = [Rmin Rinacl = LPinin® Simic 
P nax Simaxd for each source of one category and the 
total risk range R, = [ZR nine Ring] of the category 
are evaluated, using Equations (2) and (4) and the 
minimum and maximum values of P and S, and 
presented. Totally, 300 scenarios of construction 


Table 2. An example of a risk category, risk sources and 
risk ranges. 


Category Risk source Risk range 
Fall of person Fall from height— 0,182-3,171 

moving ladder 

Fall from height— 0,182-3,171 
steady ladder 

Fall from height— 0,061-3,171 
stairs or steps 

Fall from height— 0,091-3,171 
moving scaffolding 
or staging 

Fall from height— 0,091-3,171 
steady scaffolding 
or staging 

Fall from height— 0,091-3,171 
assembly/disassembly 
of scaffolding 

Fall from height— 0,151-3,171 
roof 

Fall from height— 0,061-3,171 
floor 

Fall from height— 0,151-3,171 
platform 

Fall from height— 0,182-3,171 
through floor 
openings 

Fall from height— 0,182-3,171 
mobile platform 

Fall from height— 0,182-1,269 
moving vehicle 

Fall from height— 0,091-2,230 
work at height 


without protection 
Fall at the same height— 0,061-2,262 


sliding / obstacle 
impact 
Fall down by stairs 0,242-2,262 
or ramp 
Total risk range 2,001-42,904 


projects risks in the ranges described above were 
randomly generated. Each scenario was assigned 
100 random values of three overlapping ranges. 
Values were generated from a uniform distribution 
for both individual risks and total risk. The use of 
the overlapping ranges was performed towards a 
more credible reflection of reality considering that 
they introduce more interdependencies to the mod- 
eling of a NN that assesses total risk. Two simi- 
lar scenarios of data sets as described above and 
depicted in Figure 1 were used for training the NN. 
The difference between them lies in the extent of 
the overlap between risks ranges, as the second data 
set introduces a more extensive overlap. The reason- 
ing behind this differentiation was to address more 
individual risks in the scenarios reflecting in this 
way the complex and nonlinear relationships which 
are met between risks in construction projects. 


4.3 Neural network tools selection 


Several software products were investigated for 
modeling the developed NN. In order to identify 
potential difficulties in the modeling process as well 
as potential deviations in the assessments’ outputs, 
a decision was made to test with two different soft- 
ware tools. The first choice was to use the Neural 
Network Toolbox of Mathwork’s Matlab, which is 
considered as the most widely used software in NN 
modeling (Baptista et al., 2013). 

The second choice was Palisade’s NeuralTools 
because of its ease of use as an extension of Micro- 
soft’s widely used MS Excel and the ability to com- 
bine it with other management-oriented software 
programs of Palisade (e.g. @Risk). A final reason 
for selecting NeuralTools was that the conducted 
literature review revealed its limited use in research 
studies contrary to its wide application in practice. 


Data set 1 
R: 


R, ———7 & 


Data set 2 


R, 


——— mm 
Rinia Ruy Rms 
Figure 1. Data sets for training the NN. 
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This finding urged for application of NeuralTools 
to allow an insight and useful conclusions regard- 
ing to its appropriateness in NN modeling. 


4.4 Neural network training and results 


Supervised learning is the learning process used 
to train MLNNs developed both in NeuralTools 
and Neural Network Toolbox. MLNNs were cho- 
sen due to their ability to approximate any func- 
tion satisfactorily thanks to the speed of learning 
and the plethora of options for their training as 
long as sufficient and reliable data is available. It 
is noted that the conjugate-gradient back-propaga- 
tion algorithm is exclusively used in NeuralTools, 
while the Levenberg-Marquardt back-propagation 
algorithm is the default option in Neural Network 
Toolbox with the possibility of modification. In 
this research the default option was selected. 

The first data set was randomly separated in 
85% for training and 15% for testing the MLNN 
in NeuralTools. A 15-4-1 MLNN was finally cho- 
sen as the one with the best achieved performance. 
It is noted that available stopping criteria for the 
training can be a) training time, b) number of 
trials and c) training error decrease below a user- 
defined limit for a period of time, while there is 
also possibility of combining the above criteria. In 
this case, the training time was set to two hours; 
the number of trials was 10.000.000 and the error 
decrease less than 1% for five minutes. The third 
criterion was the one to cause training stop. The 
performance measures were: a) Root Mean Square 
Error (RMSE) equal to 3,873 and 5,046, b) Mean 
Absolute Error (MAE) equal to 3,019 and 4,261 
and c) Standard Deviation of Absolute Error 
(SDAE) equal to 2,426 and 2,703 for the training 
and testing samples, respectively. Then, a regres- 
sion analysis between the results of the network 
and the desired output was used to evaluate the 
reliability of the data. For that purpose, StatTools 
of Palisade was used and the results are depicted in 
Figure 2. The values of R (0,8920) and R? (0,7957) 
suggest a good relationship between results and 
desired outputs. The accumulation of extreme pre- 
diction values of the testing sample in one value, 
as seen in Figure 2, possibly indicates that the test 
data fall outside the range of the training and vali- 
dation data and the network may be extrapolating. 
A better separation of the data set could lead to 
more accurate and reliable results. 

The same 15-4-1 MLNN was then developed 
with Neural Network Toolbox. The data set was 
randomly divided into three samples; the training 
sample (70%), the validation sample (15%) and the 
test sample (15%). Training stopping criterion used 
in this case was the increase of the validation sample 
error for six iterations. The increase of the validation 
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Figure 2. Regression analysis for the testing sample of 


15-4-1 NN (NeuralTools). 


sample error indicates the NN’s inability to “explain” 
the data and in conjunction with the simultaneous 
reduction of the training sample error is a sign of 
memorization of the training sample. As a result, 
generalization, which is the NN’s ability to deal 
with new data adequately, cannot be achieved. NN 
was trained for 11 epochs, while the best perform- 
ance was achieved in the fifth epoch. Performance 
measure was RMSE equal to 4,362, 5,483 and 4,968 
for training, validation and testing, respectively. 
Regression analysis between results and desired 
outputs suggests a good relationship between them 
for all three samples as depicted in Figure 3. Results 
with both software tools seem to be relatively close. 
Finally, a multiple regression analysis was carried 
out with the use of StatTools in order to compare 
NNs’ results with those of a common and widely 
used traditional method. Aggregate results are pre- 
sented in Table 3. Results are close enough to come 
to an exclusive conclusion. Furthermore, all models 
probably accommodate improvement. 

The use of wider overlapping ranges used in 
the second data set, as mentioned in Section 4.2, 
is expected to match values of independent and 
dependent variables with more randomness result- 
ing in even more complex relationship between 
individual risks and total risk. 

In this case, the development of MLNN was 
performed with Neural Network Toolbox because 
regression analysis, which is a performance meas- 
ure of the reliability of data division, is automati- 
cally generated, while using NeuralTools requires 
the processing of the results using another MS 
Excel extension, such as StatTools. Best perform- 
ance was achieved with a 15-12-1 MLNN. The 
Levenberg-Marquardt BP algorithm was used 
for training the NN, while training stopping cri- 
terion was also the increase of the validation sam- 
ple error for six iterations. Performance measure 
was RMSE and best value (5,645) was achieved at 
the third epoch for the validation sample. For the 
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Figure 3. Regression analysis for training, validation 
and testing sample of 15-4-1 NN (Neural Network 
Toolbox). 


Table 3. Regression analysis’ R values for the testing 
sets of the models developed. 

Model R 

NN 15-4-1 (Neural Tools) 0,8920 
NN 15-4-1 (Neural Network Toolbox) 0,89246 
Multiple regression analysis 0,8922 
NN 15-12-1 (Neural Network Toolbox) 0,85649 
Multiple regression analysis 0,837 


training and testing sample, RMSE was equal to 
5,676 and 6,451, respectively. Regression analysis 
between results and desired outputs delivered val- 
ues of R equal to 0,861, 0,879 and 0,856 for train- 
ing, validation and testing respectively, as depicted 


in Figure 4. Multiple regression analysis was also 
conducted and delivered R equal to 0,837. The 
relationship between variables is better depicted 
with the NN in relation to the regression model, 
as R values suggest. Aggregate results concerning 
training set R values of all the models developed 
are presented in Table 3. 

Similarly, NNs can be developed to assess the 
risk of all risk categories, as well as the total risk 
of a construction project by combining those NNs. 
Generally, the results of this research demonstrate 
the computational power of NNs as a function 
approximation tool and confirm their usefulness 
in construction projects risk management and par- 
ticularly in quantitative risk analysis, provided that 
reliable and sufficient historical data is available. 


Output = 0.77'Target + 5.1 


training set 
Validation: R=0.87878 


Output ~= 0.77'Target + 5.3 
asteg 


sample test 


Figure 4. Regression analysis for training, validation 
and testing sample of 15-12-1 NN (Neural Network 
Toolbox). 
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5 CONCLUSIONS 


NN applications in construction project risk manage- 
ment primarily concern qualitative and quantitative 
risk analysis and risk response analysis addressing 
the problem with different approaches either as 
function approximation or as pattern recognition. 
In most of the cases, the use of NNs reproduces 
credible results, as evaluated in the studies suggest- 
ing the use of NNs in construction project risk man- 
agement. Similarly, results of the present research 
work demonstrate the computational power of NNs 
as function approximation tools and confirm their 
usefulness in construction projects risk management 
and particularly in assessing total risk. Neural net- 
works are reliable for conducting risks assessments 
that realistically integrate risks interdependencies 
and complexities stemming from non-linearity in 
problems modeling. The availability of training 
data constitutes a prerequisite towards this goal. In 
this context, it is necessary to collect historical data 
concerning construction projects features and risks. 
Further research into the field can be addressed in 
improving the models developed, deploying other 
techniques and incorporating them in NN models 
for construction project total risk assessment. 
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ABSTRACT: 


In the framework of chemical and process industry, accidental fires may lead to damages 


to equipment with severe consequences and possible domino effects. The availability and effectiveness of 
safety measures, aimed at reducing the risk associated with this type of events, may be strongly affected 
and decreased if the facility is located in harsh environment, due to complicating meteorological factors 
and extreme temperatures. The present work is aimed at defining a structured approach to the quantita- 
tive assessment of fired domino events accounting for the influence of harsh environment conditions on 
safety barriers performance. A specific metric is defined in order to consider the external factors related 
to harsh environments on the determination of hardware and emergency safety barriers availability and 
effectiveness, with a specific focus on the evaluation of the time-scale of emergency response. A dedicated 
event tree analysis is then applied implementing the obtained performance values of the safety barriers, 
in order to support the quantitative assessment of accident frequency associated with domino scenar- 
ios. The present method is applied to the analysis of a chemical facility located in harsh environmental 


conditions. 


1 INTRODUCTION 


In the last decades, interest has been increasing for 
cascading events and the assessment of their pos- 
sible risks. The chemical process industry has been 
hit by major accidents worldwide, some of which 
were completely disregarded by hazard identifica- 
tion techniques (Paltrinieri et al., 2010; Paltrinieri 
and Reniers, 2017). Among them, several domino 
events have been documented (Abdolhamidzadeh 
et al., 2011; Darbra et al., 2010; Delvosalle, 1996; 
Kourniotis et al., 2000; Lees, 1996; Rasmussen, 
1996). 

One of the most destructive cascading event dis- 
asters is the one that happened in Mexico City in 
1984 (Pietersen, 1988). Europe recognized the haz- 
ard posed by domino events and specific require- 
ments are stated in the article 9 of the latest Seveso 


Directive (European Commission, 2012). Accord- 
ing to these, the risk of propagation of primary 
hazardous scenarios to nearby units is required to 
be assessed. 

Different safety barriers are used and monitored 
in chemical process plants (Paltrinieri and Khan, 
2016), such barriers defined to prevent escalation 
scenarios. These include active, passive and pro- 
cedural protections. Examples include the water 
deluge system (WDS), fireproofing coating, pres- 
sure safety valves (PSVs) and the site emergency 
response plan. Different performance parameters 
in terms of availability (expressed as probability of 
failure on demand) and effectiveness are associated 
to every safety barrier. 

However, barriers are subject to deterioration 
and depletion of their performance. Meteorological 
and climatological conditions are factors that can 
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enhance these phenomena. For instance, cold tem- 
peratures, extreme wind and snowfall may either 
cause deterioration of hardware plant components 
or lead to difficulties for operators performing rou- 
tine tasks and/or in emergency contingency situa- 
tions (Bercha et al., 2003; Gao et al., 2010). The 
Arctic and sub-Arctic regions experience extremely 
unique weather conditions that may be challenging 
for technical barrier components as well as human 
intervention. However, a dedicated framework for 
the analysis of safety barriers performance degra- 
dation in harsh environment is still missing. 

This work is aimed at investigating the safety 
barrier performance of chemical and process facil- 
ities operating in harsh environmental conditions, 
in order to evaluate the frequency and probability 
of escalation scenarios triggered by fire. 

The paper is organized as follows: Section 2 
provides a detailed overview of the methodology 
applied to assess the frequency of cascading events 
addressing the effect of severe environment on pro- 
tection devices; Section 3 describes the reference 
case considered for the present analysis; the results 
of the application of the methodology to the refer- 
ence case are shown in Section 4, while Section 5 
provides room for their discussion. The paper ends 
with conclusions in Section 6. 


2 METHODOLOGY 


2.1 Overview 


Figure 1 shows the flowchart of the methodology 
adopted in the present study. The methodology 
was developed for the oil and gas sector (Landucci 
et al., 2017) and it is hereby extended to chemi- 
cal process industry. A detailed description of the 
methodology is provided in sections 2.2-2.5. 


2.2 Identification of reference safety barriers 


The first step of the methodology consists of a 
preliminary characterization of the safety barri- 
ers performance, with particular reference to the 
prevention and mitigation of cascading events 
triggered by fire. According to CCPS—Center of 


+ Identificationof reference safety bamers 


» Determination of harsh environment score (HES) 


* Quantitative assessmentof safety bamers performance 


CECE 


Figure 1. Flowchart of the methodology. 


»* Evaluation of cascading events probability and frequency | 


Chemical Process Safety (2000), barriers are clas- 
sified as: 


e Passive, which are in place and do not require 
external activation; 

e Active, which require automatic and/or external 
activation; 

e Procedural and emergency measures, which 
involve the intervention of operators and emer- 
gency teams. 


This step is based on the application of a pre- 
viously developed methodology (Landucci et al., 
2016) in which the evaluation of safety barriers 
performance in the framework of escalation is 
aimed at quantifying: 


e availability, defined as the probability of failure 
on demand (PFD) of the safety barriers; 

e effectiveness (n), defined as the probability that 
the safety barrier, once successfully activated, 
will be able to prevent the escalation. 


Once the parameters needed to support the 
quantitative evaluation of safety barriers are 
defined, the influence of harsh environmental con- 
ditions on their performance is inferred in the fol- 
lowing steps. 


2.3 Definition of Harsh Environment 
Score (HES) 


The Harsh Environment Score (HES) is a prelimi- 
nary metric aimed at describing the harshness of the 
environment and it is used to assess the influence 
of weather conditions on safety devices perform- 
ance. HES consists of a combination of different 
site-specific environmental parameters, such as, for 
instance, temperature and wind velocity. 

The approach for the HES evaluation is based 
on the identification of stressors. They are factors 
that mostly affect the human performance during 
operations in extreme weather conditions (Sec- 
tion 2.4.1) but are adopted in the present study 
also to address the influence of extreme weather 
conditions on hardware barriers performance 
(Section 2.4.2). 

Musharraf et al. (2013) identify the significant 
stressors for harsh environment as coldness, ice 
slippery, difficulty in breathing, combined weather 
effect, low visibility and remoteness. The present 
approach associates one or more external factors 
(EFs) to each stressor. EFs are climate or environ- 
mental conditions that can be measured and/or 
quantified. To each EF, a non-dimensional penalty, 
namely a score S, is assigned. Scores represent the 
distance from favorable conditions. They vary from 
0 to 1, where 0 represents good favorable condi- 
tions and 1 the worst ones. Table 1 lists the EFs and 
relative scoring system applied in the present study. 
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Table 1. Summary of external factors and scores 
adopted for HES evaluation (adapted from Landucci 
et al., 2017). 


External factor ID Range S; 
Temperature 1 >45 0.4 
(°C) 4to 45 0 
—4 to 4 0.2 
-10 to —4 0.6 
—30 to -10 0.8 

<-30 1 

Extreme wind 2 0 to 3.3 0 
speed (m/s) 3.3 to 5.5 0.2 
5.5 to 8 0.4 
8 to 10.8 0.6 
10.8 to 13.9 0.8 

>13.9 1 

Snowfall (m/year) 3 0 to 0.125 0 
0.125 to 0.5 0.2 
0.5 to 1 0.4 
1 to 1.5 0.6 
1.5 to 2 0.8 

>2 1 

Visibility 4 <50 1 
(fog/snow) (m) 50 to 200 0.8 
200 to 500 0.6 
500 to 1000 0.4 
1000 to 2000 0.2 

>2000 0 

Sunlight hours 5 <1200 1 
(h/year) 1200 to 1600 0.8 
1600 to 2000 0.6 
2000 to 2400 0.4 
2400 to 3000 0.2 

>3000 0 

Remoteness 6 Low 0 
Medium 0.5 

High 1 


More detailed information about the scores 
assignment process and the EFs may be retrieved in 
a previous study (Landucci et al., 2017). The scores 
are assigned according to extensive literature sur- 
veys about the effects of different physical factors on 
technical and human behavior (American Petroleum 
Institute, 2000; DOA—Department of Army, 1982; 
Kunkel et al., 2007; Landsberg and Pinna, 1978; 
Musharraf et al., 2013; Shaw and Austin, 1919). 

Finally, HES is obtained as a weighted summa- 
tion of the assessed scores, as follows: 


HES= ¥," w,S, (1) 


where S; and w; are respectively the score and the 
weight associated to the i-th EF. In the present 


analysis, a preliminary set of weights is assigned by 
using the Zipf’s law (Zipf, 1949). 


2.4 Barrier performance assessment 


2.4.1 Hardware barriers 

According to Gao et al. (2010), extreme environ- 
mental conditions may affect hardware barrier 
availability but they have no significant effect on 
their effectiveness. The depletion of barrier per- 
formance is strictly related to environmental tem- 
perature. Recommended Practices 581 by American 
Petroleum Institute (2000) identify a threshold value 
of —6.7°C for considerable effect on protection 
performance. This value corresponds to a penalty 
S, = 0.6 or higher according to Table 1. This frame- 
work addresses the depletion in barrier availability 
using the proportional hazard model (Cox, 1972) as 
suggested by Gao et al. (2010). The failure rate of a 
generic component, A, increases in harsh environ- 
ment according to the following relationship: 


A z) = Ay e@7 14092-1013 29 (2) 


where A, is the failure rate in normal environment 
(namely, the baseline value), assumed hereby as 
constant during the entire lifecycle of the facil- 
ity. The factors z, and z, are the named covari- 
ates; z, describes the protection conditions and 
z, the equipment quality, respectively. Covariates 
are considered as binary and they can assume the 
value +1 or —1. The positive value is associated 
with good quality of protections and equipment. 
The base relationship for the estimation of tested 
component unavailability (Lees, 1996) is applied to 
obtain the barrier PFD describing, from this analy- 
sis perspective, the barrier availability. 

The present work considers that the effective- 
ness of the barriers is not affected by environmen- 
tal conditions. Once activated, hardware barriers 
perform as in the case of normal environment 
(Landucci et al., 2016). 

The reference active safety barriers analyzed in 
the present study are water deluge systems (WDS) 
aimed at attenuating heat radiation from fires affect- 
ing process units. According to different experi- 
mental studies (Hankinson and Lowesmith, 2004; 
Roberts, 2004a, 2004b; Shirvill, 2004), the heat-load 
reduction on a target due to presence of WDS is 
about 50% compared to the unmitigated case. 
Hence, Qwns (the heat load received by a fired target 
in case of available WDS) is expressed as follows: 


Qvos = 9.5 Quy (3) 


where Q,,, represents the heat-load affecting the 
target due to the primary fire scenario. 
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Passive safety protections include the PSV and 
the fireproofing coating. Birk (2006) proved that 
the presence of the PSV alone does not delay sig- 
nificantly the time to failure (TTF) of the target 
equipment. In that case, the PSV effectiveness is 
considered as unitary but the TTF is evaluated 
assuming that the vessel is unprotected (Landucci 
et al., 2009). Fireproofing coatings are instead able 
to delay the vessel failure. Their effectiveness is set 
as 1. The TTF of the target vessel in case of pres- 
ence of protective coatings is evaluated by adding 
a further term, TTF., as shown in Eq. (4), which 
represents the delay action of the coating: 


TTF =TTF, 


unprotected + TTF, (4) 

The TTF, is evaluated according to a sim- 
plified approach considering the quality of the 
materials used as coating. For high performance 
materials (intumescent, vermiculite spray, fibrous 
mineral wool) the TTF, is set conservatively as 
70 minutes. TTF,.is equal to 0 minutes in case of use 
as coatings of common insulating materials (glass 
wool, rock wool). 


2.4.2 Procedural barriers 
Human reliability may be significantly affected by 
extreme weather (Musharraf et al., 2013). 

A customized version of the Success Likelihood 
Index Methodology (SLIM) (Embrey, 1986) is 
adopted in the present framework to evaluate the 
deterioration of emergency response availability (e.g. 
in terms of PFD). HES is considered as a simplified 
ranking of performance shaping factors affecting 
the emergency response in harsh environment (Lan- 
ducci et al., 2017). The higher the HES the lower the 
probability of success of the emergency team inter- 
vention. The PFD is then evaluated as: 


log, PFD = a( 1- HES) + b (5) 


where a and b are —0.954 and —0.046 respectively. 
They have been determined by setting the PFD 
equal to 0.1 in case of favorable environmental con- 
ditions (HES = 0) and by setting the PFD as 0.9 in 
worst case environmental conditions (HES = 1) 
(Landucci et al., 2016). 

The evaluation of the emergency response effec- 
tiveness is carried out by following the approach 
suggested by Landucci et al. (2017). The evaluation 
is based on the comparison between the TTF of the 
target equipment and the Time for Final Mitigation 
(TFM) required to the emergency team to extinguish 
the primary fire. The TFM is defined as the sum of 
different times for emergency operations as follows: 


TFM=%* r (© 


j=1l,j+2 J 


The times are defined according to Table 2 
(Landucci et al., 2017), where also the different 
relationships applied to account for the delay due 
to harsh environment are shown. The effectiveness 
of the emergency response is set equal to 1 or 0 by 
comparison between TFM and TTF of the target 
equipment. When TFM is lower than TTF of the 
target equipment, the emergency response effec- 
tiveness is set as unitary, otherwise it is zero. 


2.5 Evaluation of escalation probability 


A customized Event Tree Analysis (ETA) is adopted 
in order to evaluate the frequency (and probability) 
of domino escalation triggered by fire. The avail- 
ability and effectiveness of barriers evaluated as 
described in Section 2.4 are addressed in the ETA 
by using dedicated logic gates, as shown in Table 3. 

Further detailed information about gate defini- 
tions may be retrieved elsewhere (Landucci et al., 
2016). 

Gate A represents a simple composite probabil- 
ity. In this case, the availability (expressed in terms 
of PFD) is multiplied by a single probability value 
expressing the probability of barrier success in the 
prevention of the escalation. 

Gate B represents a composite probability dis- 
tribution. In this case, the PFD is multiplied by a 


Table 2. Time scale for emergency operations and sim- 
plified relationship for the estimation of time increment 
due to harsh environment (adapted from Landucci et al., 
2017). The baseline is the time required in normal envi- 
ronment (HES = 0). 


Baseline Simplified relationship 


ID Name (min) (tin min) 
t, Time to alert 5 log,, 7, 
=-0.3(1- HES) +1 
t, Time to onsite 20 log,, 7, 
mitigation = -0.3(I- HES) + 1.6 
T, Time for exter- 12 log, % 
nal team = -0.3(I — HES) + 1.38 
intervention 
t, Time for 7 log,, Z, 
equipment = -0.3(I- HES) +1.15 
deployment 
T, Time for extra set- 8 log, Z, 
up operations = -0.3(I- HES) + 1.2 
T Additional time  30-60* log,, 7 
in case of need = -0.3 (I — HES) + 2.08 


of interregional 
assistance 


“Depending on the type of location. 
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Table 3. Summary of gates introduced in the ETA to 
account of barrier performance (adapted from (Landucci 
et al., 2016)). 


Gate type Gruphical representation 


es CUT) = IN s [PFO © (t-n) «(1 -PFD)) 


A \—B 
=Y s OUT = IN (-PFD) n 


OUT, =IN [PFO + (t-n) + (1-PFD)} 
OUT, =IN.(1-PFD). n 


‘OUT, = IN -PFO 
c (Ne C )— QUT, = IN tt. (1-PFD) 
TL OUT: = IN (1-PFD) > n 


— OUT=INP, 


Le OUT, =IN(1-Po} 


probability distribution expressing the probability 
of barrier success in the prevention of escalation, 
thus obtaining a composite probability of barrier 
failure on demand. In this work, the integrated 
probability is adopted, obtaining the rule for gate 
quantification reported in Table 3. 

Gate C is associated with a discrete probability 
distribution. 

Finally, Gate D incorporates equipment vulner- 
ability models based on probit approaches for the 
estimation of P, (the probability of vessel failure). 
The effect of harsh environmental conditions has 
been addressed in the probit models in describing 
the vessel resistance behaviour. More details on 
vessel fragility models are extensively described in 
previous works (Landucci et al., 2009). 


3 CASE STUDY 


3.1 Overview 


The reference case study refers to a production 
plant for the production of personal and home 
hygiene products. The plant uses as main raw 
materials ethanol and propane and, for the quanti- 
ties stored, it is subject to fulfill the Seveso Direc- 
tive requirements concerning hazardous materials 
(European Commission, 2012). The field is located 
in harsh environment (see Section 3.2). The meth- 
odology described in Section 2 is applied to esti- 
mate the frequency of domino events triggered by 
fire and thus providing a more complete risk pic- 
ture of the facility. 

Figure 2 shows the layout considered in the 
analysis of the case study. Ethanol is stored in 
three underground tanks (T1, T2, T3) with an 
overall volume of 90 m° and kept at 15°C. Etha- 
nol is transferred to the processing area (see Fig. 2) 


Ethanol Storage Area 


Processing Area 


t 
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Figure 2. Layout defined for the case study associated 
with a non-confined pool-fire following the rupture of 
the process ethanol pipeline. 


through a pipeline featuring 20 m length and a 
nominal diameter of 100 mm. Full-bore rupture 
of the pipeline is considered to derive the features 
of the primary scenario potentially triggering the 
domino escalation. In particular, a non-confined 
pool fire following immediate ignition of the 
spilled ethanol is taken into account. A stand- 
ard frequency of 3.9-107 y~ has been assumed 
from literature analysis for pool-fire. The physical 
effects associated with the pool fire have been ana- 
lyzed applying the conventional literature integral 
models implemented in the DNV GL Phast 7.11 
commercial software. According to consequence 
assessment results, the pool fire affects the target 
propane storage tank (V1, see Fig. 2), which is 
exposed to about 48 kW/m?. 

The safety barriers in place to protect V1 are 
listed in Table 4. They are defined on the basis of 
different regulations for fire protection of liquefied 
petroleum gas storage units (American Petroleum 
Institute, 1996; National Fire Protection Agency 
(NFPA), 2018, 2017). The results of their per- 
formance assessment (in normal and harsh envi- 
ronments) are shown in Section 4. The quality of 
both the target equipment V1 and its protection 
devices is assumed as low following a conservative 
approach. 
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Table 4. Summary of fire protection devices for hori- 
zontal LPG storage tank (American Petroleum Institute, 
1996; National Fire Protection Agency (NFPA), 2018, 


Table 5. Summary of meteorological and climatological 
conditions experienced in Bode. 


2017). Factor Meteorological data Reference 
Active Passive Procedural Tempera- Coldest month: January (Norwegian 
Target barriers barriers barriers ture Minimum average Meteo- 
temperature: —11.8°C rological 
vl Water deluge Pressure safety Emergency Typical value: —2.2°C Institute, 
system valve (PSV-V1) response 2017) 
(WDS-V1) Fireproofing (ER-01) Wind Harsh month: January (Norwegian 
coating (PFP-V1) speed Maximum wind Meteo- 
(2 h rating) speed: 24.4 m/s rological 
(10 m above sea level) Institute, 
Annual range: 2017) 
3.2 Environmental and meteorological conditions 8.9 m/s 
Snow Duration: 6 months (weatherspark. 
The reference production plant is located in an (October-April) com, 2017) 
industrial site close to Bodø just North of the Arc- Average snowfall 
tic Circle, in Norway. The climatic conditions in per day: 2.54 cm 
the reference area can be characterized as severe.  Fog/snow Visibility lower (ISO- 
Table 5 summarizes the meteorological and clima- effect than 2000 m International 
tological conditions experienced in that area and standardiza- 
adopted for the determination of HES and, thus, Dr 
to derive performance data in harsh environment. Sunlight 1200-1600 h/year ibet : d 
hours Pinna, 1978) 
4 RESULTS Remote- The plant is located (Suedfeld and 


4.1 Performance assessment of safety barriers 


Adverse meteorological conditions significantly 
affect the protection effect of safety devices. In order 
to account for this effect, the methodology described 
in Section 2 has been applied to the reference chemi- 
cal processing plant described in Section 3. 

According to the meteorological and climato- 
logical data summarized in Table 5 and to the scor- 
ing system described in Section 2.3, the estimated 
HES is 0.43 for the considered case. This value is 
implemented to evaluate the performance of the 
safety barrier protecting the target tank V1. Since 
the score associated with the external temperature 
S, = 0.6, a degradation of hardware barrier avail- 
ability must also be considered (see Section 2.4.1). 

Data were also calculated for normal environ- 
mental conditions for sake of comparison (thus, 
featuring HES = 0). The time for external emer- 
gency response is calculated according to the 
guidelines described in Section 2.4. It increases 
from 77 minutes (normal environmental condi- 
tions, HES = 0) to 124 minutes (harsh environmen- 
tal conditions, HES = 0.43). 

Table 6 summarizes the results of performance 
assessment in normal and harsh environment and 
it shows the gates associated with each barrier. 


4.2 Evaluation of escalation probability 


The customized ETA approach for the evaluation 
of the escalation probability and frequency has 


ness in an industrial site 
close to cities and 
amenities. The 
remoteness is 
considered to be low. 


Steel, 2000) 


Table 6. Summary of data adopted for the quantifica- 
tion of the ETA in the present case study. HES = 0: nor- 
mal environment; HES = 0.43: harsh environment. 


PFD Effectiveness 
Gate 
Barrier type HES=0 HES = 0.43 HES = 0 HES= 0.43 
WDS-VI1A 4.3310? 5.57107? 1 1 
PSV-V1 A 110° 1.2910! 1 1 
PFP-V1 A 110° 1.29107 1 1 
ER-01 C 110! 2.5710! 0; 18 0; 1° 


Depending on the comparison between TFM and TTF. 


been carried out starting from the frequency and 
consequence assessment of the primary scenario 
(ethanol non-confined pool-fire). 

Figure 3 shows an extract of the ETA developed 
for harsh environmental conditions (HES = 0.43). 
Each branch in the event tree is quantified according 
to the rules described in Section 2. A similar event 
tree is derived for the normal environment case. 

Three different scenarios arising from uncon- 
fined pool fire are analyzed in both normal and 
harsh environment. These scenarios are: 
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No Escalation 
Unmitigated Escalation 
No Escalation 
Mitigated Escalation 
Mitigated Escalation 
No Escalation 


Figure 3. Extract of the ETA for the evaluation of cas- 
cading event probability/frequency for the target tank V1. 
It refers to the case of harsh environment, with HES = 0.43. 


1. Unmitigated domino (not effective activation 
of safety barriers); 

2. Mitigated domino (partial or ineffective activa- 
tion of one or more safety barriers); 

3. No domino scenario (barriers effectively 
mitigate/suppress the primary fire and avoid 
escalation). 


The target equipment V1 may withstand the fire 
even in the absence of barrier activation. Also in 
these cases, escalation is excluded. 

Figure 4 shows the result of the analysis in terms 
of frequency and probability of the three exam- 
ined scenarios. The “No safety barrier” scenario 
has been considered for sake of comparison, e.g. 
based on the method developed in a previous work 
(Landucci et al., 2009). 


hhh 


Satety barriers: normal Safety barriers; harsh No salety bamera 
environment environment 


eacatation scenalio @Miligalad secondary scenario OUnMiligaled escalation soena©no 


1EM 
1608 


1,6-08 
1630 
1,E-12 
tEn 
1,546 _ —l — - 


Safety barriers; noma Safety caries; burst No salen barriers 
environment ‘anyiranment 


Frequency (1/y) 


Sho escalation scenario «= Miligaied secondary scenaro  DUnnsihpusd escalation scenario 


Figure 4. 
scenarios. 


a) Probability and b) frequency of secondary 


5 DISCUSSION 


The analysis of the case study demonstrates the 
potentialities of the methodology in the assessment 
of domino scenarios for chemical facilities located 
in harsh environments. As shown in Figure 4, a 
significant increase in escalation probability and 
frequency is predicted in harsh environment oper- 
ation with respect to normal environment. When 
safety barriers are considered, unmitigated dom- 
ino scenario is the less credible, both in normal and 
harsh environments. Anyway, the degradation of 
barrier performance in harsh environment leads to 
higher frequency values. In particular, reduction of 
four orders of magnitude with respect to the case 
without protection is obtained for harsh environ- 
ment. In normal environment, the reduction is of 
eight orders of magnitude. 

This is due to the depletion in the barrier per- 
formance in harsh environment, as documented in 
the analysis shown Section 4.1. In particular, pro- 
cedural and emergency measures are significantly 
affected by cold environmental conditions. In fact, 
the time for external emergency response increases 
about 60% compared to the value in normal envi- 
ronment. This is due to delays and difficulties in 
carrying out emergency actions. 

The escalation frequency results obtained from 
the ETA analysis shown in Section 4.2 may be 
implemented in detailed quantitative risk assess- 
ment studies. In this way, a more detailed risk pic- 
ture of the facility may be evaluated, thus including 
escalation scenarios. The necessary input to apply 
the method, as exemplified in Sections 3 and 4, is 
normally available from conventional risk analysis 
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studies and therefore no additional work needs to 
be carried out for collecting input data. The mete- 
orological and climatological data for the HES 
assessment are site-specific, but easily retrievable 
from national institutes (see the example dataset 
gathered in Table 5). 

It is worth mentioning that the methodology 
addresses human factor and deterioration of bar- 
rier phenomena in a very simplified way, despite 
these issues featuring relevant complexity. For that 
reason, the so evaluated escalation probabilities 
and frequencies should be considered on the safe 
side. 

The methodology allows room for further 
refinement of data and for using different avail- 
able methods. In particular, for human reliability, 
more advanced techniques may be implemented 
supporting the evaluation of operators’ perform- 
ance and error probability given the environmental 
stressors; on the same time, emergency response 
analysis may be improved with site specific 
response time data for a more accurate effective- 
ness estimation. 

Finally, for hardware barriers, further review of 
the methodology should be considered when site- 
specific performance data will be available from 
facilities operating in harsh cold environments. 


6 CONCLUSIONS 


The present contribution shows a systematic 
approach for the quantification of domino event 
frequency and probability for chemical facilities 
operating in harsh environmental conditions. The 
approach accounts for the deterioration of safety 
barriers performance due to extreme climate con- 
ditions. A dedicated metric is used as preliminary 
index to assess the influence of environmental con- 
ditions on barrier performance, thus allowing for a 
modification of barriers availability and effective- 
ness. The modified values of barrier performance 
data allow for a more detailed probability and 
frequency assessment of cascading scenarios trig- 
gered by fire. 

The outcomes of the methodology may drive 
the design of hardware barrier components and 
improvement of emergency procedures in order 
to decrement the risk of severe accidental sce- 
narios in chemical facilities operating in harsh 
environments. 
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ABSTRACT: Major part of the nuclear power sites house more than one reactor unit and other nuclear 
facilities such as spent fuel pool storage. Currently, multi-unit risks have not typically been adequately 
accounted for in risk assessments, since the licensing is based on unit-specific PSA with focus on a reactor 
accident. This paper presents an approach to site risk analysis, taking into account various dependences 
between the units. The dependences can be caused by external hazards, which can affect multiple units at 
the same time; shared operational and safety systems at the site; common staff who should manage the 
situations. The site risk assessment approach has been developed with aid of two pilot studies made for 


two Swedish sites. Preliminary results from the pilot studies are presented in the paper. 


1 INTRODUCTION 


After the Fukushima Daiichi accident in March 
2011 general interest in site level Probabilistic Safety 
Assessment (PSA) has increased. Major part of 
the nuclear power sites house more than one reac- 
tor unit and other nuclear facilities such as spent 
fuel pool storage. Currently, multi-unit risks have 
not typically been adequately accounted for in risk 
assessments, since the licensing is based on unit- 
specific PSA with focus on a reactor accident. 

The methodology for a site level risk analysis 
needs to consider the dependences between the 
units. By “unit” we mean here not only reactors 
but also other relevant sources for radioactive 
release such as spent fuel pools and storages. The 
dependences can be caused by external hazards, 
which can affect multiple units at the same time; 
shared operational and safety systems at the site; 
common staff who should manage the situations. 
Site risk analysis is not only a matter of extending 
current risk analyses to properly cover inter-unit 
dependences in the risk assessment, but it should 
also provide risk insights for the site level safety 
management, e.g., w.r.t., severe accident manage- 
ment, emergency preparedness, design, operation 
and maintenance of shared systems. 


In 2017, the Swedish nuclear utilities Forsmark 
Kraftgrupp and Ringhals Ab and the Swedish 
Radiation Safety Authority financed together with 
the Finnish Nuclear Safety Research Programme 
SAFIR2018 a project, called SITRON (SITe 
Risk of Nuclear installations). The objective with 
SITRON is to develop methods and requirements 
for a nuclear power plant site risk analysis, driven by 
a performance of pilot studies. The paper will sum- 
marise the developed method for the site risk analy- 
sis and preliminary results from the pilot studies. 


2 OVERALL APPROACH 


2.1 Definitions 


This section introduces main concepts and defini- 
tions used in this paper. They are adopted from the 
SITRON project reports (Holmberg 2017; Back- 
ström et al. 2018). 

A “single-unit PSA” means a PSA made for a 
nuclear facility such as the reactor facility and the 
interim storage for spent fuel. A single-unit PSA 
is assumed to cover all fuel locations within the 
facility. For a reactor facility, the reactor and the 
fuel pool are the relevant locations from the risk 
assessment point of view. Single-unit PSA can be 
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also understood to refer the types of risk analyses 
that currently have been prepared for licensing of 
nuclear facilities. 

A “multi-unit PSA” means a PSA or a set of 
PSAs made to cover accident scenarios related 
to all fuel locations at the site, including spent 
fuel transportations. Multi-unit PSA can be also 
understood to be an extension of a single-unit 
PSA which can be used to quantify multi-unit risk 
metrics. The aim of this paper is to discuss one 
approach towards this direction. 

The SITRON project is limited to level 1 and level 2 
PSA reflecting the current state-of-the-practice 
for nuclear power plant applications and licensing 
requirements in most countries. Level 1 PSA assesses 
the risk of a reactor core damage or more generally 
the risk of a fuel damage. The main risk metric of 
level 1 PSA is the Core Damage Frequency (CDF) 
or generally the Fuel Damage Frequency (FDF). 

Level 2 PSA assesses the risk of radioactive 
release to the environment as a consequence of a 
fuel damage. In level 2 PSA, it is typical to use several 
risk metrics following the categorisation of releases. 
Internationally commonly used risk metrics are the 
large release frequency (LRF) and the large early 
release frequency (LERF) (OECD 2009). Meaning 
of “large” may vary depending on the regulatory 
framework, e.g., in Finland it is a release larger than 
100 TBq of Cs-137 (STUK 2013). ’Early” release 
means an accident where the release occur before 
sufficient time for offsite protective measures. As a 
general term for level 2 PSA risk metrics, “release 
category frequency” (RCF) is used in this paper. 

The structure of PSA model can be defined to 
consist of a number of initiating events (IE) and 
plant response models that are used to quantify in 
level 1 the conditional probabilities of core damage 
(CCDP) with respect to each IE and in level 2 the 
conditional probability of certain release given the 
Plant Damage State (PDS). Plant damage states 
are interface states between level 1 and 2 PSA to 
facilitate more compact modelling of level 2 PSA. 

When analysing and modelling multi-unit sce- 
narios, it is necessary to extend the above concepts 
into those involving a single-unit impacts and 
those involving multi-unit impacts. For instance, 
initiating events are grouped into single-unit ini- 
tiating events (SUIE) and multi-unit initiating 
events (MUIE). 

Risk metrics for level 1 PSA can include the fol- 
lowing CDF metrics: 


e Single-Unit Core Damage Frequency (SUCDF) 
— frequency of a reactor accident involving core 
damage on one and only one reactor unit per 
site calendar-year. 

e Multi-Unit Core Damage Frequency (MUCDF) 
— frequency of an accident involving core damage 


on two or more reactor units concurrently per site 
calendar-year 

e Site Core Damage Frequency (SCDF) — fre- 
quency of a reactor accident involving core 
damage on one or more reactor units concur- 
rently per site calendar-year. 


For a level 2 PSA, the significance of the release 
can be characterised by two components: magni- 
tude of released radionuclides (Cs-137 is typically 
used as a representative isotope) and the timing of 
the release. Unlike to level 1, in level 2 there is no 
need to count the number of units or sources that 
contributes to the release. Therefore, the release 
categorisation in multi-unit scenarios can be based 
on the following metrics: 


e Release magnitude = sum of release magnitudes 
from the units having fuel accidents 

e Release timing = time point when the magnitude 
criterion for the release category is exceeded. 


The Site-level Release Category Frequency (SRCF) 
is the sum of frequencies of the single-unit and multi- 
unit scenarios leading to a certain release category. 

It should be noted that the SITRON project has 
not proposed numerical criteria for release catego- 
risation due to national differences in the risk cri- 
teria for level 2 PSA. It is, however, recommended 
that release magnitudes are counted in absolute 
units (e.g. TBq of Cs-137) rather than in relative 
units (e.g., x% of core inventory is released). 


2.2 Basic assumptions for site risk analysis 


A key assumption for the SITRON project is that 
the site risk analysis does not need to start from 
scratch. It is assumed that the nuclear power 
plants have rather complete and well-developed 
PSAs for the units at the site. The site risk analy- 
sis is expected to complement the existing PSA- 
studies by addressing multi-unit scenarios and unit 
dependences. 

The impact of site risk analysis is two-fold. 
Firstly, it should lead to improved single-unit 
PSAs, by ensuring that multi-unit scenarios and 
unit dependences are properly accounted for. Sec- 
ondly, site risk analysis should provide a represen- 
tation of risk at the site-level, i.e., it enables the 
quantification of site-level risk metrics. 

One important principle in the method develop- 
ment of the SITRON project is that it should be 
possible to quantify the site-level risk metrics with 
the single-unit PSAs, without a development of ded- 
icated multi-unit risk models. This idea follows the 
assumption that approximative quantifications are 
sufficient for site-level PSA application purposes. 

Another important principle of the site risk 
analysis is that effective screening will be applied to 
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Selection of the 
analysis scope 


Qualitative analysis of 
dependences 


Selection of multi-unit 
scenarios 


Quantitative analysis 
of dependences aul 
Risk aggregration 


Figure 1. 


Updating of the 
single-unit PSA 


Site risk analysis procedure. 


identify relevant multi-unit scenarios. In principle, 
one could postulate a huge number of multi-unit 
scenarios, but most of them can be shown to have 
an insignificant risk contribution and could be thus 
screened out from further analysis. For instance, it 
can be shown that a combination of two simulta- 
neous independent initiating events (events occur- 
ring within a certain short time window) has an 
insignificant risk importance and therefore can be 
screened out (Bäckström et al. 2018). 


2.3 Analysis steps 


Figure | depicts a general procedure for a site-level 
PSA. As the first step, the scope of the analysis 
is determined, including a selection of sources of 
radioactive releases considered in the study and 
which level of PSA is considered. Section 3 of the 
paper will describe the elements of the qualitative 
analysis of dependences, which results in the selec- 
tion of multi-unit scenarios for the quantitative 
analysis and finally the risk aggregation. These 
topics are discussed in Section 4. 

Figure 1 also includes a grey box, “updating of 
the single-unit PSA”, which can be seen an excursion 
in the procedure. This analysis step is needed to cope 
with identified deficiencies in the single-unit PSA. 
It also facilitates the quantification approach that is 
based on an effective utilization of single-unit PSAs. 


3 QUALITATIVE ANALYSIS 
OF INTER-UNIT DEPENDENCES 


Qualitative analysis of inter-unit dependences 
should be a systematic and comprehensive assess- 
ment of possible inter-unit-dependences to identify 


important factors in multi-unit scenarios. The pur- 
pose is to ensure that the dependences that are con- 
sidered likely to be relevant are captured correctly 
in the quantitative analysis, but also to screen out 
dependences that do not require further analysis. 

The analysis should cover various dependences 
topic by topic, as will be discussed in the follow- 
ing sub-sections. The identification of relevant 
initiating events is the basis for further analysis. 
The analysis of dependences due to shared systems 
and structures is a step that can be performed in a 
general manner without postulating any particular 
multi-unit scenario (by a multi-unit scenario we 
practically mean a scenario initiated by a multi- 
unit initiating event). The other topics (identical 
components, human and organisational depend- 
ences, and plant operating states) are more practi- 
cal to assess when the relevant multi-unit scenarios 
have been selected. 


3.1 Identification of initiating events 


The primary way of identifying important multi- 
unit scenarios is based on the analysis of initiating 
events which can have multi-unit impacts. There 
are two types of such events: 1) a multi-unit ini- 
tiating event that has more or less simultaneous 
impact on multiple units, and 2) a propagating 
initiating event, which has first impact on a single 
unit and then it propagates to other units. 

The identification of multi-unit initiating events 
should be a straight-forward task since practically 
all external hazards form this group of events. 
Screening can be carried on the frequency basis, 
but this could have been done already in the single- 
unit PSAs. 

The identification of relevant propagating initi- 
ating events may require further analysis compared 
to the analysis made in the context of the single- 
unit PSA. An initiating event that causes a distur- 
bance on one unit could potentially propagate to 
another unit, either by creating an initiating event 
during the accident progression, for example loss 
of offsite power, or through an accident scenario 
(core damage) that ultimately affects the other 
unit. Another example of an initiating event that 
propagates would be a fire in one unit that spreads 
to another unit. An assessment of propagating 
hazards (fire and floodings) may require plant 
walk-downs to judge the likelihood for a hazard 
propagation. 


3.2 Common systems, buildings and structures 


There are different types of shared connections 
in a nuclear power plant. These connections can 
be categorized according to the approach used in 
(Muhlheim & Wood 2007), where the two main 
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categories “structures and facilities” and “systems 
and equipment” are used. 

Examples of shared structures and facilities 
include for example service water intake structures 
and different types of storage tanks. There are also 
plant designs where, for example, turbine or auxil- 
iary buildings are shared. 

Shared of systems and equipment can be 
grouped into three categories 


e Systems that can support several units simulta- 
neously. Systems in this sub-category include for 
example station blackout gas turbines and com- 
mon fire protection systems. 

e Independent systems at each unit that can be 
cross-connected to support another unit or sin- 
gle systems able to fully support only one sin- 
gle unit at a time. Systems in this sub-category 
could for example include demineralized water 
distribution. Emergency diesel generators may 
also be configured to support only one unit at a 
time. 

e Independent systems at each unit sharing 
standby or spare equipment. Systems in this sub- 
category include for example portable pumps for 
independent cooling. 


Which connections that are shared differ widely 
between plants, even between plants with the same 
vendor. Many of the shared connections are, how- 
ever, not important from a PSA point of view, e.g. 
shared office buildings and shared communication 
systems. 


3.3 Identical components 


Identical components at different units form a 
potential group of components which can fail due 
to Common Cause Failures (CCF). In (Schroer & 
Modarres 2013) strong evidence is presented that 
dependent failures occur with a relatively high fre- 
quency involving multiple units. The OECD/NEA 
CCF data project ICDE has also made a study on 
multi-unit CCF events, which indicate that such 
events happen (Hakansson 2017), and therefore they 
cannot be categorically ruled out from a multi-unit 
PSA. The CCF candidates are selected by studying 
the scenarios for the relevant multi-unit initiators. 


3.4 Human and organisational dependences 


Human and organizational dependences related to 
the multi-unit scenarios should be identified and 
covered in the human reliability analysis (HRA). 
In general, multi-unit HRA will need to put more 
emphasis on organizational and management 
aspects in the analysis. These factors need to be 
included in not only quantification, but also task 
analysis and modelling. 


Multi-unit accidents pose additional challenges 
on operators which are not modelled in a single- 
unit PSA. These challenges may arise from con- 
strained human resources, additional complexity 
in managing multiple scenarios from a common 
location, shared system prioritization, prioritiz- 
ing the deployment of portable equipment, etc. 
A radioactive release from one unit in case of a 
multi-unit accident might affect critical operator 
actions that have to take place outside the main 
control room of another unit. 

The degree of added complexity for multi-unit 
accidents will depend greatly upon the amount of 
interdependence between the individual units. This 
interdependence may come from the nature of the 
initiating event, the amount of shared systems/ 
equipment or the amount of shared resources. 


3.5 Plant operating states 


PSA models for nuclear power plants shall take into 
account various Plant Operating States (POS) of 
the facility, since list of relevant initiating events, 
status of safety systems and system success criteria 
can vary strongly between POSs. Usual POS catego- 
risation includes states full-power operation, reac- 
tor shutdown (from full-power to outage), reactor 
up-rate (from outage to full-power) and a number 
of POSs during the maintenance outage period. 

A realistic multi-unit scenario assessment has 
to account for the units’ various combinations of 
POSs. However, a complete consideration of all 
possible combinations of POSs between several 
units could lead to a large number of “site level” 
POSs. For instance, in a Hungarian pilot study 
(Bareith et al. 2016), 123 distinct site level POSs 
were identified for a site with four reactors and 
four spent fuel pools. 

In SITRON, it is assumed that the need to 
consider various POS combinations can be con- 
siderably reduced by screening of irrelevant com- 
binations and by merging together similar POSs. 
This should be true at least for various outage 
period POSs which have short time windows or for 
which the time to fuel damage is very long due to 
the high capacity of water pool to keep the fuel 
cooled even without active cooling system. In any 
case, the final screening of relevant POS combi- 
nations must be carried out specifically for each 
multi-unit initiating event. 


4 QUANTIFICATION OF MULTI-UNIT 
SCENARIOS 
4.1 Basic approach 


The modelling and quantification approach fol- 
lowed in SITRON assumes that the single-unit PSA 
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is properly addressing the multi-unit scenarios, i.e., 
the impacts of dependences are modelled in such 
manner that the quantification of the model pro- 
vides “correct” risk metrics from the single-unit 
point of view. 


4.2 Generation of scenarios 


Generation of scenarios means studying the Mini- 
mal Cut Set (MCS) lists for the identified multi-unit 
initiating events. Minimal cut set (MCS) lists are 
generated from the single-unit PSA. Basic events 
that can be associated with multi-unit dependences 
are identified for further quantification of depend- 
ences (see Section 4.3). 

At this stage, unimportant dependences can be 
quantitatively screened out. One approach is to 
study the maximum contribution from potential 
multi-unit sequences for each relevant depend- 
ence (represented by selected basic events). If the 
sequence has a frequency below a screening cri- 
terion (e.g. 1E-8/yr) — even if full dependence is 
assumed—then the dependence can be screened 
out. Another approach is to select MCSs which 
are above a screening criterion and to restrict the 
examination of basic event dependences into that 
MCS list. 


4.3 Quantification of dependences 


For those dependences and associated basic events 
that are not screened out, a quantification of the 
degree of dependence must be performed. Basi- 
cally, it means the evaluation of the conditional 
probability of an event at another unit given that a 
dependent event has occurred at one unit. 

For multi-unit initiating events, a full depend- 
ence is assumed. For propagating events and for 
partial multi-unit events, case specific assessment 
needs to be made. 

For shared systems that are common a full 
dependence is assumed. For shared systems, which 
have partly common sections and partly unit- 
specific sections, an assumption needs to be made 
which unit takes credit for the common section. 

The assessment of inter-unit CCFs is crucial 
issue from the quantification point of view. When 
the units are identical, there are several important 
CCF groups that dominates the risk. Since typi- 
cally full CCF dominates the results, a conservative 
approach is to assume that the components of two 
units form a joint CCF group. The event of inter- 
est will be a complete CCF of the full group given 
that a specific half of the components have failed. 
In SITRON, several approaches to assess the con- 
ditional CCF probabilities have been tested, e.g., 
using the CCF parameters of the single-unit 
model, ICDE operating experience data (Hakans- 


son 2017) and generic U.S. CCF data (U.S.NRC 
2016). 

Post-initiating event operator actions can be 
divided into three groups from the dependence 
assessment point of view. Firstly, there are actions 
that can be considered unit-specific without 
dependences. This group mostly includes actions 
required in the short time window when the units 
need to manage the disturbance individually. Sec- 
ondly, there are actions in longer term, for which 
partial dependence could be assumed due to shared 
resources. Thirdly, there are actions for which full 
dependence should be assumed. 


4.4 Computation of multi-unit CDF 


The general formula for two-unit CDF for a multi- 
unit initiating event 7 is 


MUCDF. = 
Si: È, p(d,): P(CD\| IE,,d;)- P(CD2| IE,,d,). (1) 
ti 


where f; is the initiating event frequency; p(d) is 
the probability of the dependence event(s) j; and 
P(CD1|- ) and P(CD2| - ) are the conditional core 
damage probabilities given an initiating event i and 
dependence event(s) /. 

There are several possibilities how to handle the 
dependent events. One approach used in the pilot 
studies was to re-quantify the joint MCS list gener- 
ated from the MCS-lists of single-unit models. In 
this approach, a basic event, A, associated with a 
dependence is partitioned into a “common basic 
event”, cA, and a “unit-specific event”, iA, 


A=cA*iA. (2) 


When the joint MCS list is created as a Boolean 
product of MCS lists, the dependences will be 
explicitly taken into account by the common basic 
events. The probabilities for the common respec- 
tively unit-specific basic events are obtained from 
the previous analysis step (Section 4.3). 


5 PILOT STUDIES 


5.1 Pilot study scope 


In the SITRON project, two Swedish pilot stud- 
ies are made, one for the Forsmark nuclear power 
station (Cederhorn et al. 2018) and second for the 
Ringhals nuclear power station (Bäckström et al. 
2018). Forsmark pilot study is limited to reactor 
units 1 and 2, and the Ringhals pilot study to reac- 
tor units 3 and 4. Forsmark 1 and 2 are boiling 
water reactors (BWR) of Asea-Atom design and 
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Ringhals 3 and 4 are Pressurised Water Reactors 
(PWR) of Westinghouse design. 

In both cases, the two units are practically iden- 
tical reactors located close to each other and have 
several common systems and structures such as sea 
water intake. For both cases, there exist complete 
level 1 and 2 PSAs covering all initiating event cat- 
egories (internal events, internal hazards, external 
hazards) and plant operating states (power opera- 
tion, shutdown, outage, power up-rate). 

In 2017, the pilot studies have included a quali- 
tative analysis of unit dependences and a quanti- 
tative analysis of the multi-unit initiating event 
Loss-Of-Offsite Power (LOOP). The pilot studies 
were limited to level 1 PSA. 


5.2 Findings from the qualitative analyses 


In this section, a summary of findings from the 
qualitative analysis of the two pilot studies are pre- 
sented. It can be noted that the identified depend- 
ences are very similar even though the other study 
concerns with two BWRs and the other with two 
PWRs. Therefore, the discussion given below is 
valid for both studies. 

For initiating events, both PSA-studies include 
a comprehensive analysis of external hazards. The 
list of external hazards can be directly taken as a 
list of potential multi-unit initiating events, includ- 
ing events like loss of offsite power and organic 
material in sea water. Assessment of propagating 
initiating events was left out-of-the-scope of the 
pilot studies, since this task would require plant 
visits and walk-downs. It was however identified 
that there are few common buildings for which fire 
and flooding hazards may be considered as propa- 
gating events. Later when the pilot studies will be 
extended to level 2 PSA, propagating effects of 
severe accident situations, e.g. increased radiation 
level at the site, may need to be considered, too. 

Both pilot cases have almost same important 
system and building dependences. Examples of 
important common systems are the offsite grid con- 
nections and sea water intake. There are also several 
less important common systems such as the fire 
water system, and the demineralized water system. 

Since in both pilot studies the units at the site 
are identical, practically all common cause failure 
groups could be considered potential inter-unit 
CCF groups. Assessment of relevant CCF groups 
was limited to the example scenario, LOOP. 

Both pilot studies consider a full scope of plant 
operating states. The average time share that the 
twin-units are simultaneously at-power is about 
90%. Since maintenance outages are not carried out 
in parallel, it can be assumed that the other possible 
POS-combinations include one unit being at-power 
and the second unit being at some shutdown state. 


5.3 Results of the quantitative analyses 


In both PSAs, LOOP initiating events are divided 
into several sub-cases. In the pilot study, the multi- 
unit LOOP, leading to simultaneous loss of exter- 
nal grid for twin-units is considered. This initiating 
event has rather high risk importance in both PSA 
studies. 

LOOP event has been considered for all POSs. 
When quantifying the time shares of POSs and 
risk importances of LOOP during various POSs, 
the result was that only both units being at-power 
is a significant POS combination. The reason for 
this is that other POSs are very short except one 
longer POS during maintenance outage during 
which the core/fuel damage risk is very low due to 
long time window to recover the situation. Also, 
the POSs immediately after and before at-power 
POS are from the PSA-modelling point of view 
very similar to the at-power scenarios. Same inter- 
unit dependences are important for those POSs as 
for at-power POS. 

Most important minimal cuts sets for the multi- 
unit LOOP have been analysed qualitatively to 
group similar minimal cuts sets together and to 
characterize the cut sets from the time window and 
system failures point of view. There are about ten 
groups of minimal cut sets that dominate the result. 

In almost all cases, the core damage happens 
due to loss of power supply to systems required 
for core cooling. A common feature is that house 
turbine operation fails after which the safety func- 
tions are dependent on Emergency Diesel Genera- 
tor (EDG), Gas Turbines (GT) or mobile diesel 
generators (MDG). Recovery of external grid is 
also a possibility. 

From the timing point of view, there are two main 
categories for the loss of 500 V AC power supply: 


e Immediate loss of power supply. This is caused 
by various combinations of failures to start 
EDGs, GTs or to connect them to supply the 
bus bars. 

e Later loss of power supply. This is caused by 
various combinations of failures where EDG 
start succeeds but stops later. From the battery 
capacity point of view, these minimal cut sets 
could be further divided into those occurring 
before or after the battery depletion time. 


In the assessment of twin-unit CDF, the follow- 
ing events have large importance 


e Failure of house turbine operation. House tur- 
bine operation is rather unreliable and a high 
probability is assumed that it is failed in both 
units (2 x 2 turbines). This event is included in 
all dominating MCSs 

e Inter-unit CCF of batteries (two systems), 
which are vital for successful power supply from 
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EDGs. Over 80% of MUCDF can be eliminated 
if both CCFs can be eliminated. Inter-unit CCF 
for batteries have been assessed conservatively 
assuming that they form a joint CCF-group. The 
conditional probability for a full CCF given that 
half of the batteries have failed is high due to the 
assumed CCF-model parameters. 

e Failure to recover 400 kV grid which is a com- 
mon event for both units. MUCDF is decreased 
by almost 50% if LOOP is only a short-term 
event. 

e Unavailability of gas turbine, which is a com- 
mon system for both units. Gas turbine events 
contribute about 50% to MUCDF. 


Regarding operator actions, multi-unit depend- 
ent actions are important only in later phase of sce- 
narios and they do not have large risk importance. 

The MUCDF assessment has been very simpli- 
fied, and includes several uncertainties. Conditional 
probability of a double-unit core damage given one 
core damage is 0,1—0,2. The most important uncer- 
tainty is the assessment of the probability of the 
inter-unit CCF. If no inter-unit CCF is assumed, 
MUCDF decreases by a factor more than 100. 


6 CONCLUSIONS 


Qualitative analysis of multi-unit dependences 
is a rather straight-forward task and should be 
included already in the single-unit PSA. For 
instance, multi-unit [Es can be rather easily identi- 
fied since they are practically equal to the list of 
external hazards. 

In the first hand, it can be assumed that exter- 
nal hazards are complete multi-unit IEs. To judge 
whether hazards should be considered partial mul- 
ti-unit IEs may require considerable more effort, 
plant visits and statistical analyses or expert judge- 
ments. The same applies for the identification of 
propagating IEs for whose relevance from multi- 
unit risk point of view cannot be judged without 
plant visits. 

Identification of common systems and struc- 
tures is also a straight-forward task and should 
have been already considered in the single-unit 
PSA. Relevant operator action dependences can be 
assumed to exist mainly in long term action since in 
the beginning of scenarios, the units are designed 
to manage the emergency situations independently. 

Relevant multi-unit dependences and scenarios 
are related to events impacting safety functions 
core cooling and residual heat removal. From the 
initiating event point of view, the common dis- 
turbances can be classified in the general groups 
1) loss of power supply or 2) loss of ultimate heat 
sink. Some external hazards can cause both plant 
impacts. 


From the POS combinations point of view, it is 
likely that the only POS combination which needs 
to be considered is both units being at-power. The 
other combinations can be screened out either by 
their very short time duration or by the very long 
recovery times for which reasons such events have 
negligible contribution to the multi-unit risk. This 
conclusion is based on the assessment of the multi- 
unit LOOP initiating event. 

The assessment of inter-unit CCF is crucial 
issue from the quantification point of view. When 
the units are identical, there are several important 
CCF groups that dominates the risk. The applied 
quantification principle in the pilot study is pre- 
sumably very conservative, which can lead to high 
conditional probability for a multi-unit core dam- 
age given a single-unit core damage. 

The pilot study case, loss of offsite power (both 
400 kV and 70 kV) did not reveal any such depend- 
ences that would suggest revisions in the single-unit 
PSA. There are few dependent operator actions 
which have some importance, but they are treated in 
the current PSA quite simplified manner. The func- 
tional role of gas turbine in the multi-unit scenario 
might also require some more detailed analysis. 

From the MUCDF assessment point, a rather 
simple quantification can be performed using 
dominating minimal cut sets and basic events. Two 
quantification approaches have been tried out in 
pilot studies, both providing practical approaches 
to assess the multi-unit risk metrics and risk impor- 
tances of various items of the model. 

It should be noted that the conclusions made 
here are based on a single scenario and on level 1 
PSA. It could be expected that some weather- 
related hazards impacting the sea water intake and 
the extension of the analysis to level 2 may bring 
up further issues, e.g., the role of dependent opera- 
tor actions may be more important. 
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for emergency diesel generators in nuclear power plants 
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ABSTRACT: Safety classification of structures, systems and components of nuclear power plants shall 
be based on the functional importance of the items. One challenge with the nuclear safety classification 
is that different classification systems are used in different countries and standards. Even if certain clas- 
sification scheme can be agreed upon, it is not straight-forward how the classification should be carried 
out for various components, e.g., in electric and automation systems. At higher plant and system level, 
the safety importance of functions can be defined directly, but the classification of smaller components 
requires further assessments. In principle, a component’s safety class follows its functional importance. 
Downgrading is possible if mitigative factors and reliability arguments can be shown. The importance of 
finding correct safety class is that it determines the QA requirements for the component. The paper will 
outline a risk-informed safety classification approach for the components, based on both probabilistic 


and deterministic assessments. Emergency diesel generator system will be used as an example. 


1 INTRODUCTION 


Safety classification of Structures, Systems and 
Components (SSC) of nuclear power plants (NPP) 
shall be based on the functional importance of 
the items. One challenge with nuclear safety clas- 
sification is that different classifications are used 
in different countries and standards. International 
standards and guidelines give some common guid- 
ance to the classification, but in practice the licen- 
sees and vendors need to adapt the classification 
into the national system (WNA 2015). Even if cer- 
tain classification scheme can be agreed upon, it is 
not straight-forward how the classification should 
be carried out for various components. 

At higher plant and system level, the safety 
importance of functions can be defined directly, 
but the classification of smaller components 
requires further assessments. In principle, a com- 
ponent inherits its safety class from its functional 
role based on the functional impact of its failure. 
This principle can be considered a deterministic 
approach to safety classification since the func- 
tional importance is determined by the determinis- 
tic safety analysis and the defence-in-depth concept 
of the design. 

The strength with the deterministic approach 
for the safety classification is the link with the 
strong safety design principles such as defence-in- 
depth, safety margin, redundancy, diversity and 
independence (Ahn et al. 2010). This approach 
has limitations since it does not systematically and 


explicitly consider the risk importance of items, 
which could be assessed by Probabilistic Safety 
Assessment (PSA). Deterministic approach gen- 
erally only considers worst case scenarios (Kir- 
schsteiger 1999). 

The importance of finding correct safety class 
is that it determines the Quality Assurance (QA) 
requirements for the component. Too low safety 
class may imply a system reliability concern; too 
high safety class can be a significant cost factor, 
and it can be even a problem of finding a compo- 
nent supplier for nuclear market, which has spe- 
cific QA requirements. 

Since the development of PSA methods and 
applications for NPP safety management, there 
have been attempts to implement risk-informed 
approach to safety classification. A well-known 
and applied approach is the U.S.NRC (2011) guide 
to risk-informed decision making. Shortly, the 
PSA is used to determine the risk importance of a 
component and the risk importances are compared 
to the deterministic safety class. Risk importance 
measures such as Fussell-Vesely and Risk Achieve- 
ment Worth are used for this purpose. Such stud- 
ies and discussions can be found, e.g., in (Jänkälä 
2002; Holmberg & Männistö 2008). 

Safety Integrity Levels (SIL) introduced in 
IEC-61508 (IEC 2010) and related branch-specific 
standards are also examples of risk-informed 
safety classification. The reasoning behind SIL is, 
however, quite different from the nuclear safety 
classification principles for which reason it would 
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be difficult to apply it as such in nuclear context, 
though associations can be made between SIL and 
nuclear safety classes, see e.g. Annex D of IEC 
61513 (IEC 2011). 

This paper discusses an approach to the risk- 
informed safety classification. The primary role of 
the deterministic safety classification is acknowl- 
edged, especially the assessment of the functional 
importance of items. The issue to be resolved is 
how the safety classification of components can 
be reassessed based on their risk importance. Elec- 
tric and automation systems are the main intended 
application area since these systems consist of large 
number of components with different functional 
role, failure modes and risk importance. Section 2 
describes the safety classification system. Section 3 
presents the emergency diesel generator (EDG) 
system used as an example. Section 4 outlines the 
safety classification approach, and Section 5 con- 
cludes the paper. 


2 SAFETY CLASSIFICATION SCHEME 


2.1. Defence-in-depth and plant condition 
categories 


The safety philosophy of NPPs builds on the 
Defence-in-Depth (DiD) principle. Here, we con- 
sider DiD as successive levels of protection, which 
rely on the application of safety principles of mul- 
tiple barriers, physical separation, redundancy and 
diversity. The standard nuclear five level classifica- 
tion is taken as the basis (IAEA 1996). DiD levels 
can be mapped one-to-one with “plant condition 
categories” or “design basis categories”, which 
are fundamental elements of deterministic safety 
analyses, see Table 1. 

The relationship between defence-in-depth and 
safety classification is immediate in the sense that 
each DiD level is assigned to certain safety class, 
and all items belonging to one level have the same 
safety class. It follows that all items within one 
level have the same requirements for design, quali- 
fication, regulatory review and QA procedures 
during all life cycle phases. Different DiD levels 
can belong to different safety class, and this is 
considered beneficial both from the diversity point 
of view and from the optimal resource allocation 
point of view. 


2.2 Safety classification systems in nuclear field 


In nuclear field, there is both national and inter- 
national variations in the safety classification sys- 
tems, though all systems have a link to the DiD 
principle. For example, the International Electro- 
technical Commission categorization (IEC 2009) 


Table 1. Defence-in-depth levels and plant condition 


categories. 

DiD Plant condition 

level Objective category 

1 Prevention of abnormal Normal operation 
operation and failures 

2 Control of abnormal Anticipated opera- 
operation and failures tional occurrences 

(AOO) 

3a Control of accident Design basis acci- 
to limit radiological dents (DBA) 

3b* releases and prevent Design exten- 
escalation to core melt sion conditions 
conditions (DEC)** 

4 Control of accidents Postulated core melt 


accidents or severe 
accidents (SA) 


with core melt to limit 
off-site releases 

5 Mitigation of — 
radiological conse- 
quences of significant 
releases of radioactive 
material 


*3b controls postulated common cause failures in level 3a; 
**DEC has three subcategories (STUK 2013a): a) AOO/ 
DBA & common cause failure in level 3a, b) significant 
events identified in PSA and c) rare external events. 


defines three safety categories A, B and C, whereas 
the American standards of Institute of Electrical 
and Electronics Engineers uses a classification that 
only distinguishes between safety and non-safety 
systems (IEEE 2003). 

The International Atomic Energy Agency 
(IAEA) has adopted the following three-level 
safety category system (IAEA 2016): 


e Safety category 1: Any function that is required 
to reach the controlled state after AOO or DBA 
and whose failure, when challenged, would result 
in consequences of ‘high’ severity. 

e Safety category 2: Any function that is required 
to reach a controlled state after AOO or DBA 
and whose failure, when challenged, would result 
in consequences of ‘medium’ severity. 

e Safety category 3: Any function that is actuated 
in the event of AOO or DBA and whose failure, 
when challenged, would result in consequences 
of ‘low’ severity. Any function that is designed 
to reduce the actuation frequency of the reac- 
tor trip or engineered safety features in the event 
of a deviation from normal operation, includ- 
ing those designed to maintain the main plant 
parameters within the normal range of opera- 
tion of the plant. 


In addition to the overall safety classification 
systems, there are more technically-oriented clas- 
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sification standards, such as IEC 60709 for the 
physical separation of I&C equipment (IEC 2004) 
and IEEE 384 for independence requirements of 
electric circuits and equipment (IEEE 2008). In 
this respect, the treatment of electric and I&C 
equipment is quite straight-forward. With regard 
to mechanical equipment in electric and I&C sys- 
tems, there are not so strict requirements, which 
leaves room for an interpretation. 


2.3 Risk-informed safety classification system 


In this paper, a risk-informed safety classification 
system is outlined, in which the system functions 
and associated components are first classified 
based on their functional importance. Refined 
assessment of the component safety classes is 
made based on the assessment of mitigative fac- 
tors and reliability considerations. The procedure 
is illustrated in Figure 1. 

In the first step, the functions, sub-function, 
sub-sub-functions, etc. of the system are defined 
down to a sufficient level of details so that the 
functional importance of each component can be 
defined. Here “component” is associated with the 
spare parts level itemization of a system, which 
is the level of details that needs to be achieved in 
the safety classification from the QA requirements 
point of view. 

Functional importance of components can be 
analysed, e.g., using Failure Modes, and Effects 
Analysis (FMEA). The functional impact of the 
component’s failure mode determines the prelimi- 
nary safety class. 

In the final step, the component’s safety class 
can be reassessed based on two more criteria: the 


Definition of functions and 
subfunctions of the system 
Classification of functions and 
subfunctions 


Failure modes and effects analysis of 
components; preliminary classification 
Consideration of mitigative factors; 
final classification 


Figure 1. 


Procedure for risk-informed safety classification. 


reliability of the component and the existence of 
mitigative factors. Low failure probability and mit- 
igative factors can be used as arguments to down- 
grade the safety class (see Section 4 for details). 

A three-level safety class system is proposed. 
Three classes are considered practical for the 
categorisation of components and yet sufficient 
to be acceptable in various national regulatory 
frameworks. 

Safety Class 1 (SC1) represents the highest 
safety category. It is assigned to functions, which 
are required to cope with DBAs, i.e., they belong 
to DiD-level 3, which is the most important from 
safety point of view and has thus highest QA 
requirements. 

Safety Class 2 (SC2) is assigned to other safety 
related functions than those critical in DBA sce- 
narios. This includes, e.g., functions related to DiD 
levels 2, 3b and 4. In addition, SC2 can be assigned 
to manual back-up of SC1 functions, monitoring 
and surveillance of SC1 functions and functions 
whose failure can impact the long-term reliability 
of SC1 functions. 

Safety Class 3 (SC3) is assigned to non-safety 
related functions. This includes, e.g., functions 
related to DiD level 1. 

Compared to the safety classification systems of 
IEC and IAEA, the proposed classification system 
has one class less. Practically, SC1 corresponds with 
the highest safety category of the nuclear classifi- 
cation systems, SC2 corresponds with other safety 
related classes (e.g. Cat. B and C in IEC 61226). 
The reason to merge the other safety related classes 
into a single class is that it is not practical to have 
too many safety classes for the categorisation of 
components. 


3 FUNCTIONAL DESCRIPTION OF 
EMERGENCY DIESEL GENERATORS 


3.1 Main functions and subsystems 


In this paper, Emergency Diesel Generators (EDG) 
will be used as an example. The main safety func- 
tion of EDGs for nuclear power plants is to provide 
power supply to safety-critical electric bus bars in 
case of Loss Of Offsite Power (LOOP) initiating 
event. Typically, there is one EDG per one safety 
train or per two safety trains, i.e., there are two to 
four EDGs per reactor providing emergency power 
supply. In this paper, “EDG system” refers to the 
complete set of redundant EDGs (2 to 4 EDGs), 
and “EDG” refers to a single-train EDG. 

For new designs, the requirement is to have a 
diverse back-up for the EDGs to cope with the 
DEC scenario where LOOP occurs in combi- 
nation with a CCF of EDGs. For this purpose, 
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the plants may have diverse DGs, called Station 
Black-Out (SBO) DGs, gas turbine or/and mobile 
DGs. Even cross-connections between reactor 
units may be a solution, though this is not allowed 
in all countries. 

To analyse EDG from the safety and reliability 
point of view, it needs to be broken down into sub- 
systems and functions as well as a system boundary 
must be defined. In practical EDG applications, sys- 
tem boundaries vary, but here we divide the whole 
system into three major parts. The EDG generates 
the electric power from diesel fuel, and include sub- 
systems, such as generator, diesel engine, fuel oil 
system, cooling system, lubrication system, starting 
air system, combustion air system, exhaust system, 
and I&C system. Important auxiliary or support 
systems of EDG include systems such as fuel oil 
storage and supply, cooling water system, cooling 
air and ventilation system for the engine room, and 
electric power system for the control and protection 
system. In addition, the EDG function needs elec- 
tric power transmission lines and control logic to 
the bus bars dependent on the EDG. 


3.2 Functional classification 


For the sake of simplicity, we consider two types 
of LOOP events: 


e Design Basis Accident (DBA) LOOP for which 
case the fuel stored in the day tank is sufficient. 
This is called short-term LOOP and the cor- 
responding EDG function belongs to the DiD 
level 3a, receiving SC1. 

e Design extension condition (DEC) LOOP for 
which cases the fuel oil from the storage tank is 
also needed. This is called long-term LOOP and 
the corresponding EDG function belongs to the 
DiD level 3b, receiving SC2. 


Sub-functions of EDG can be classified depend- 
ing on their criticality to the above main safety 
functions. As an example, the sub-functions of the 
fuel oil subsystem are provided in Table 2. Figure 2 
depicts a simplified flow diagram. The fuel oil sub- 
system is responsible for the feeding of the diesel 
engine with fuel oil. Feeding is arranged from the 
day tank, which can be loaded from a larger storage. 

The fuel oil subsystem has a functional safety 
class 1 since it is necessary for the DBA safety 
function. Most subsubsystems of the fuel oil sub- 
system have the same safety class. Oil storage and 
transfer is not needed during the DBA case but is 
needed in the DEC case. Therefore, it belongs to 
SC2. Fuel unloading from the oil storage belongs 
to SC3 since it is a maintenance action with no 
safety relevance. 

In the functional analysis, the sub-functions are 
broken down into a level that facilitates the classifi- 


Table 2. 
system. 


Functional classification of the fuel oil 


Sub-function Class 


Fuel oil system 
1. Fuel unloading 
2. Oil storage and transfer 
3. Fuel oil day tank 
4. Fuel oil feeding and circulation 
4.1 Fuel feeding 
4.2 Fuel oil impurity removal (filters) 
4.3 Fuel injection to engine 
4.4 Leak fuel handling 
4.5 Emergency cut-off 
4.6 Overpressure protection 
4.7 Drain 
5. Fuel oil cooling 
6. Leak fuel handling 


FPP WNNFRPNF RP eNe 


Fuel oi! cooling 


Figure 2. Simplified flow diagram of the fuel oil system 
for EDG. 
Table 3. Failure modes and effects analysis example. 


Columns F = Function, C = Failure cause and D = Fail- 
ure detection are left undeveloped in the example. 


Impacted Prel. 


Component F Failuremode C D function class 
Pipe x Rupture . . 4l 1 
Leakage a . 41 1 
Blocking «oa 4d 1 
Manual Wrong ©. 41 1 
shut-off position 
valve y 
Pump z Failure to start . . 4.1 1 
Spurious stop . . 4.1 1 
Strainer s Clogging ©. 41 1 
Rupture ©. 41 1 


cation of components. In Table 2, the sub-function 
4 “Fuel oil feeding and circulation” has been bro- 
ken down into seven sub-sub-functions. The next 
step is to perform an FMEA to assign preliminary 
safety classes for the components. Table 3 shows 
an example for some components contained in the 
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subsubsystem “Fuel oil feeding and circulation”. 
The primary classification is derived from the 
safety class of the impacted function. 


4 RISK-INFORMED SAFETY 
CLASSIFICATION 


The risk-informed safety classification account for 
two criteria in addition to the functional importance 
of the component. These are the reliability of the 
components and the existence of mitigative factors. 

Reliability of a component is principally meas- 
ured by the frequency or probability of the critical 
failure modes. The role of the component reliabil- 
ity assessment is two-fold: 1) to demonstrate the 
fulfilment of the system reliability target, 2) to jus- 
tify down-grading for components whose unavail- 
ability can be demonstrated to be insignificant. 

Mitigative factor is a feature of the system or 
component design which can eliminate or mitigate 
the impact of the component failure. Mitigative 
factor can be also seen means to improve the sys- 
tem reliability. Effectiveness of mitigative factors 
can be thus measured probabilistically. 

Figure 3 depicts a principal scheme for the re- 
classification. It should be noted that this scheme 
is only applied to SC1 and SC2 components. For 
SC3 components, there is no need to consider 
reclassification from safety point of view. In the 
following subsections, a further interpretation for 
the boxes of the scheme will be provided. 


4.1 Safety goal based reliability targets 


4.1.1 System reliability target 

In the risk-informed safety classification, the func- 
tionally derived classification of components can 
be revised based on the risk importance of the 
component. The idea is not to exactly quantify the 


Define functional | SS No need to 
safety dass. | SO mases safety 
class 
s$c1,s@ 
— st 


Failure rate low? 


No 
_—— 
Mitigative factors? 

No 

k 
; Downgrading 
Redesign Keep Safety class | possible 

Figure 3. Principal scheme for final classification of 


components. 


risk importances, but to use probabilistic reason- 
ing in an indicative manner to define reliability 
levels that can be used to define rules for the reas- 
sessment. Further, the derived reliability targets for 
equipment can be used in the argumentation on 
reasonable level of required reliability. In addition, 
the discussion can be used to support the assess- 
ment of effectiveness of possible mitigative factors 
and to derive an interpretation for negligible risk 
contributors. 

We assume certain generic risk criteria and 
design features of a nuclear power plant to obtain 
system reliability targets. In a specific project, these 
assessments need to be re-evaluated. Typical risk 
criteria for are (OECD/NEA 2009) 1E-5/yr for the 
Core Damage Frequency (CDF) and 1E-6/yr for 
the Large Release Frequency (LRF). 

EDG-failure related accident sequences may 
not contribute more than 1% to the CDF criterion 
(1E-7/yr). This is a hard requirement, but station 
black-out sequences have also a high potential to a 
large release (LRF criterion). 

The frequency of LOOP is typically of order 
1E-1/yr (Johnson & Schroeder 2016) but most 
LOOP events are very short. The frequency for 
LOOP when recovery of offsite power cannot be 
credited is assumed to be about 1E-2/yr. 

The plant has a diverse back-up for EDGs in 
case of station black-out. The unavailability of 
the diverse back-up is 0.1. This is a conservative 
value. The above probability numbers yield a fail- 
ure probability target for the EDG-system 


CDF arget 
f(LOOP)- p(EDG back-up) 

__JE-7yt _ip ig 
1E-2/yr-0.1 


U,(system) = 


(1) 


This target is for a set of redundant EDGs. 
Typically, an NPP has 2 to 4 EDGs. To derive a 
target for one EDG, the probability of Common 
Cause Failure (CCF) must be estimated. Prob- 
ability of CCF is dependent on many factors, e.g., 
number of EDGs and the testing scheme of EDGs. 
Using CCF-parameters of EDGs estimated from 
data from US NPPs (U.S.NRC 2016), it can be 
estimated that the fraction of total CCF for two- 
redundant respectively four-redundant systems is 
about (order of magnitude) 


CCF2/2 = 5% (2) 
CCF4/4 = 0.5% (3) 
Assuming a four-redundant system, an unavail- 


ability target for one EDG (out-of-four compo- 
nents) can be defined as follows 
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U (system) 


CCF4/4 (4) 
2 = 28-2, 
0.005 


U,(loo4 EDG) = 


This is a first estimate for a single-train EDG 
reliability target, and it will be compared to experi- 
ence based values for EDGs in the next subsection 
to define final choices for a reasonable reliability 
target. 


4.1.2 Experience based reliability of EDG 
As a reference for the reliability targets for EDGs, 
experience based reliability of EDGs is examined. 
The estimates will be made using T-book (TUD 
Office 2015) and U.S. NPP EDG reliability data 
(U.S.NRC 2015). Repair and maintenance time 
unavailabilities are omitted in the estimation. 
Using the parametrization of T-book, the una- 
vailability of EDG can be expressed as follows 


U =q+44 - TI+A4 TM, (5) 


where q = constant unavailability (failure to start); 
à, = standby failure rate (failure to start); à; = mis- 
sion time failure rate (spurious stop); TI = test 
interval; and TM = mission time. 

In T-book version 8, the generic parameters for 
the failure rates and probabilities are q = 3.9E-4; 
À, = 4.0E-3/h and A, = 3.9E-2/h. Assuming a test 
interval TI = 672 h and a mission time TM = 8 h, 
the mean unavailability is 


Ur vor (IEDG 8 h) = 1.8E-2. (6) 


In the U.S. NPP reliability database, the para- 
metrization is slightly different, as follows 


Ux=qtA,-1h+4,-(TM -1h), (7) 


where q = probability of failure to start; à, = first 
hour (load run) failure rate; and à, = mission time 
failure rate after first hour. The parameter val- 
ues for EDG are q = 2.88E-3 (table EDG-FTS); 
i, = 3.72E-3/h (table EDG-FTLR); A, = 1.52E-3/h 
(table EDG-FTR). These parameter values yield 


U,s(IEDG 8 h) =1.7E-2, (8) 


which is remarkably close to the T-book’s estimate. 

One should note that the EDG system bound- 
aries used in T-book and US data may vary. In 
addition, the EDG system boundary applied by 
the above references do not include all support 
systems and connected systems, which are also 
critical to the overall EDG function. The conclu- 
sion is thus that the experience based unavailability 


figure of EDG is higher than the target derived in 
Section 4.1.1 (formula (4)). A difference is that the 
reliability based values reflects old systems while 
the target value derived in Section 4.1.1. is suited 
for new systems. Therefore, this target value is cho- 
sen for further development of the component reli- 
ability targets, i.e., 


U,( 1EDG) = 2E-2. (9) 


If this value can be demonstrated for a single- 
train EDG including auxiliaries, the system should 
be at least as good as current EDGs. 

In case of the DEC target (SC2 function), the 
mission time is longer and the fuel storage and 
transfer to day tank are included in the considera- 
tion. If the mission time is changed to 72 h (as an 
example), the following experience based unavaili- 
abilities are obtained (excluding auxiliaries) 


Uso, IEDG 72 h) =1.2E-1, (10) 


Uys EDG 72 h) =1.1E-1. (11) 

Considering possible trending and missing aux- 
iliaries in the above estimates, one could neverthe- 
less define the following design extension condition 
reference value for the EDG 
U. (EDG) =1E-1. (12) 

The above unavailability targets may seem to 
be rather high, but it should be noted that EDGs 
are backed-up by diverse power supply in all new 
NPPs and in all modernised NPPs, e.g., by SBO 
DGs, mobile DG, unit cross-connections or by gas 
turbine. Therefore, from risk point of view, there 
is usually no need to demonstrate better reliability 
for EDGs. 


4.1.3 Derivation of reliability targets for EDG 
items 

The next step is to derive reliability targets for the 
items of the EDG. This cannot be done straight- 
forwardly, since it depends on the way EDG is 
decomposed into items and the way reliability tar- 
gets should be distributed between the items. For 
some items, a higher unavailability can be allowed 
if, at the same time, others can be shown to be very 
reliable. In any case, a fault tree analysis should be 
used to demonstrate that the overall system reli- 
ability target is achieved. 

It is assumed that the items form a serial system 
so that the EDG unreliability is sum of items’ una- 
vailabilities. The impact of redundancy and other 
mitigative factors will be considered separately. 

Using a kind of ALARP-approach (As Low As 
Reasonably Practicable), a limit and a target value 
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is defined for the reliability. The limit value must be 
achieved, and the target value is a reference for the 
interpretation of a negligible contribution. 

We assume that there are, from the reliability 
point of view, three groups of components: a) 
unavailability is close to limit value, u“, b) unavail- 
ability is clearly below the limit value but not negli- 
gible, u’, and c) unavailability is negligible, u‘. The 
relationship between the unavailability values u“, 
u’, and u° can be defined as follows 


ut =10u? =100u°. (13) 
The total unavailability will be then 
U =n +n? +nu = nui +n, (14) 


where n, = number of components having unavail- 
ability close to the limit u“; n, = number of com- 
ponents having unavailability well below the limit; 
and n, = number of components having negligible 
unavailability, i.e., the term nu“ is insignificant. 

Next, we assume that the number of compo- 
nents that have unavailability close to the limit 
value, n„ is much smaller than the number of 
component that have lower unavailability, ,. We 
can thus allocate the system unavailability target 
about evenly between these two groups, i.e., 50% 
of the target value to a-components and 50% for 
b-components. Further, as an order of magnitude 
estimate, we may say that there are about 100 com- 
ponents per single EDG (in the spare parts level of 
itemization) in each safety-critical class, SC1 and 
SC2. n, could be about 10 and n, about 100. This 
reasoning leads to a relative item-level unavailabil- 
ity limit 5%, which corresponds with the following 
absolute unavailability limits 


u (EDG-item) = 1E-3, (15) 
us (EDG-item) = 5E-3. (16) 


Consistently, the target value based on formula 
(13) should be a factor 100 lower, 


u?(EDG- item) = 1E-5, (17) 
u§(EDG- item) = 5E-S. (18) 


The unavailability limits/targets include the fol- 
lowing unavailability contributions of an item, 


e latent failures occurring during the standby 
period or in connection to the previous test, 
maintenance or operation moment. This can be 
split into a time-independent part and a time- 
dependent part, c.f., formula (5), 

e mission time failures. 


The maintenance and repair related unavailabil- 
ity contribution can be omitted in this context since 
it is stipulated by Safety Technical Specifications. 

Since for almost all items, the failure modes can 
be related either to the system standby time or to 
the operational time, there is no need to further 
split the item-specific reliability limits/targets into 
latent respectively mission time reliability targets. 

Table 4 provides item-specific limits and targets. 
These numbers have been derived using specific 


Table 4. Indicative unavailability limits and targets for 
items of a serial system with reliability targets 2E?2 (SC1) 
and 1E-1 (SC2). Assumed number of items per function 
~100. 


SCI function SC2 function 


Failure mode time 


dependency Limit Target Limit Target 


Latent, time- 1E-3 1E-5 5E-3 5E-5 
independent failure 
probability, q 

Latent, time- 
dependent failure 
rate, i,* 

Mission time failure 


rate, 4" * 


1E-6/h 1E-8/h 5E-6/h 5E-8/h 


1E-4/h 1E-6/h 1E-4/h 1E-6/h 


*]-month test interval assumed; 
**8 h/72h mission time assumed for SC1/SC2 functions. 


Table 5. Classification of mitigative measures. 
Mitigative Mitigation 
measure Description, examples factor 
Practical elimi- Inherent feature that ~10+ 
nation of the eliminates the failure 
failure mode mode 
Reliable Fail-safe behaviour, e.g., ~10° 
elimination instrumentation failure 
of the failure causes an actuation or it 
mode does not stop the function 
if already actuated 
Diverse back-up, automated 
function, very reliable 
switch function, negligible 
possibility for CCF 
Redundancy Duplication with an ~107 
identical item, automatic 
reliable switch function, 
small (but non-negligible) 
possibility for CCF 
Manual Back-up function exists, ~107 
back-up, but it is not automated. 
manual Success conditions for 
recovery the action exist. Human 


reliability analysis is 
needed to verify this. 
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Table 6. Reconsideration of component’s safety class 
accounting for reliability and mitigative factors. 


Component unavailability u; 


Mitigative u> limit > u, target > 
measure limit > target U; 
Practical safety class can be reduced 
elimination 
of the 
failure mode 
Reliable keep safety safety class can be reduced 
elimination class* 
of the failure 
mode 
Redundancy keep safety keep safety safety class 
class class* can be 
reduced 
Manual back- keep safety class keep safety 
up, manual class * 
recovery 
None redesign keep safety class 


* For SC2 components, the option is “safety class can 
be reduced”. 


assumptions about the system reliability target and 
numbers of items between which the reliability tar- 
get must be allocated. Proposed numbers should 
be considered indicative values to support quali- 
tative argumentation in the reassessment of the 
classification. 


4.2 Assessment of mitigative measures 


The reliability of equipment can be improved by 
various mitigative measures. These are classified in 
Table 5 regarding their effectiveness. The mitiga- 
tion factors given in the last column of the table 
should be regarded as indicative probability num- 
bers, with some correspondence with typical fail- 
ure probability numbers used in risk and reliability 
studies for technical systems. 

Combining the original component’s unavail- 
ability with additional mitigative factors, the func- 
tional safety class of an item can be reconsidered. 
A proposal for such an approach is outlined in 
Table 6. 


5 CONCLUSIONS 


Safety classification of structures, systems and 
components has an important governing role for 
the definition of QA requirements for various 
items. Safety classification is fundamentally based 
on deterministic safety analysis thinking where an 
item’s functional importance determines the safety 


class. At higher plant and system level, the safety 
importance of functions can be defined straight- 
forwardly, but the classification of components 
requires further assessments. This is especially true 
for electric and automation systems, which consist 
of a large range of components whose importance 
can vary a lot. 

The paper presents a risk-informed approach to 
safety classification where the safety classification 
is carried out in two phases: 1) deterministic, func- 
tional classification, 2) re-assessment considering 
of mitigative and reliability factors of the item. In 
the functional classification, the system functions 
are broken down into sub-functions and sub-sub- 
function until a level of details is reached so that 
the component’s functional importance can be 
determined. FMEA is considered a practical tool 
to assess the functional importance which is deter- 
mined by the functional impact of the component 
failure. 

A challenge with the nuclear safety classification 
is that different classifications are used in different 
countries and standards. Besides the international 
standards organizations, almost every nuclear 
safety authority has local requirements. Thus, 
there can be inconsistencies between international 
and national codes and standards, which is prob- 
lematic when local regulation must be combined 
with standards and applicable in vendor home 
countries. 

The paper suggests a three-level safety classifi- 
cation system, which is considered sufficient for 
component level classification and yet applicable 
with respect to various national and international 
systems. Highest safety class, SC1, is applied func- 
tions belonging to DiD level 3. Other safety-related 
functions not critical to DiD level 3 functions are 
assigned to the second safety class, SC2. Non- 
safety-related functions are assigned to SC3. 

Reassessment of component safety classes can 
be based on argumentation on the reliability of the 
components and existence of mitigative factors. 
The paper outlines an approach to judge which 
levels of reliability together with mitigative factors 
can justify downgrading. The basic idea is to con- 
trol that the overall reliability target for the system 
will be reached, which also provides a reference for 
the judgment of unavailability contributions that 
are negligible. In any case, a fault tree analysis or 
equivalent quantitative system reliability assess- 
ment need to be performed to verify the fulfilment 
of the system reliability targets. 
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ABSTRACT: High-Impact Low-Probability (HILP) events in power systems historically involve a mul- 
titude of aspects, including diverse and disparate threats, failures and sequences of events. Each of these 
aspects are associated with different types of uncertainties. In practice, the analyst has to make trade-offs 
between computational efficiency and accuracy in the different aspects that are included in the analysis. 
Without a clear understanding of the specific problem to be solved and which aspects that are important 
to capture, elaborate quantitative analysis may be of limited value. This paper presents the development of 
a qualitative framework for analysing HILP events in power systems. By mapping aspects of power system 
HILP events to a bow-tie model, it provides a framework for defining, decomposing and delimitating 
decision problems related to such events. The framework may guide the analyst in the development and 
application of methods for quantitative analysis and for considering different types of uncertainties. 


1 INTRODUCTION 


A High-Impact Low-Probability (HILP) event, 
also referred to as an extraordinary event, is an 
event with a high societal impact and a low prob- 
ability to occur. In power systems, such events 
are often understood as blackouts, i.e. wide-area 
power interruptions. A number of such major 
blackout events have occurred in the last few dec- 
ades (Bompard et al. 2013, Hillberg 2016), each 
resulting in critical consequences to society. Such 
events therefore receive great attention both by 
power system operators and other stakeholders, 
such as researchers and the general public, despite 
their low probability of occurrence. Partly due to 
this low probability, these events typically are not 
captured in conventional reliability and risk analy- 
ses, which calls for analysis approaches specific to 
HILP events. 

HILP events historically involve a multitude 
of diverse and disparate threats and complex 
sequences of events, which present the analysts 
and researchers studying them with numerous 
uncertainties. Relevant aspects that can be taken 
into account in quantitative modelling of HILP 
events include: failure bunching due extreme 
weather (Panteli and Mancarella 2015), other 
natural hazards, cascading outages (Vaiman et al. 
2012, Dobson and Newman 2017), dynamic phe- 
nomena, system protection schemes (Hillberg et al. 


2012), corrective actions (Vadlamudi et al. 2016), 
and valuation of the societal impact. Different 
approaches and methodologies exist for quantita- 
tively analysing these events (Gjerde et al. 2011), 
including methods of identifying unwanted events, 
causal analysis, consequence analysis, and risk and 
vulnerability evaluation. Such methods typically 
focus on one or a subset of all potentially relevant 
aspects. The realization is that there is no single 
methodology covering all these aspects that is suit- 
able for analyzing HILP events in power systems 
(Kjølle et al. 2013), and the full set of aspects is too 
comprehensive to analyse quantitatively. Without 
a clear understanding of what specifically is the 
problem to be solved or decision to be supported, 
and consequently which aspects are important to 
capture, elaborate quantitative analysis may be of 
limited value. 

In this paper, we take a broader view on HILP 
events and present the development of a qualita- 
tive framework for analysing HILP events in power 
systems. A qualitative framework provides the ana- 
lyst with a more complete overview of the set of 
problems and a starting point for detailed analysis. 
Previous work on HILP events largely focus on 
methods of detailed, quantitative analysis (Vaiman 
et al. 2012), but some work on the more conceptual 
level also exists. For instance, (Watson et al. 2014) 
developed a framework for resilience metrics for 
energy infrastructures. In (Veeramany et al. 2016), 
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an overarching modelling framework is formulated 
under which different models can be integrated for 
an multi-hazard risk assessment of power system 
HILP events. The cascading aspect of some HILP 
events is discussed conceptually in (Vaiman et al. 
2012, Dobson and Newman 2017). 

The qualitative framework presented in this 
paper is based on an existing framework for 
power system vulnerability analysis (Kjolle et al. 
2013, Kjolle and Gjerde 2015). The present paper 
advance previous work and attempts to consolidate 
relevant aspects of HILP events in a consistent 
and all-encompassing mapping. This framework 
explicitly discusses and structures uncertainties 
related to different decision problems. The frame- 
work is presented in Section 2, which forms the 
bulk of this paper. Subsection 2.1 shows how map- 
ping relevant aspects and their relationships to a 
bow tie model provides a more complete overview 
of HILP events. Subsection 2.2 to Subsection 2.4 
presents an approach to defining, delimitating and 
decomposing decision problems related to HILP 
events. This provides a starting point for quanti- 
tative analysis, as discussed in Section 2.4, and a 
basis for taking into account uncertainties, which 
is discussed in Section 2.5. Throughout these 
subsections, concrete examples of problems are 
discussed to illustrate the application of the frame- 
work. Finally, Subsection 3 concludes the paper 
and indicates future work in refining and applying 
the framework. 


2 QUALITATIVE FRAMEWORK 
FOR HILP EVENTS 


The qualitative framework presented in this paper 
is based on the conceptual bow tie model and a 
previously developed framework for power system 
vulnerability analysis (Kjolle et al. 2013, Kjolle 
and Gjerde 2015). The bow tie model describes the 
relationship between causes and consequences of 
unwanted events, which are here defined as power 
system failures. Note that the unwanted event in 
the centre of the bow-tie is not by itself a HILP 
event, but it could be the initiating event of a 
sequence of events with critical consequences that 
constitutes the HILP event. 


2.1 Getting a better overview of relevant aspects 


The bow tie model can be used as a visual aid 
in structuring the causes and consequences of 
unwanted events as illustrated in Figure 1. This 
figure gives a comprehensive overview of aspects 
relevant to HILP events in power systems and how 
these relate to each other. Such an overview is use- 
ful when structuring an analysis of HILP events. 


Preventive actions 
reducing the probability 
of unwanted event 


Barriers associated with Barriers associated with 
corrective actions 


Operating state nennen 


Preventive actions 
preparing for unwanted event 


Unwanted event 


= power system failures Protection system fault 
= Initiating event (missing barrier, hidden 


<> contingency fault) 


ae Time 
Threat exposure Sequence of events Restoration (not toscale} 
(blackout progression} 


Figure 1. Overview of relevant aspects of HILP events 
in power systems mapped to a bow-tie model. 


The left-hand part of the figure shows sche- 
matically how the exposure of the power system to 
different threats can cause power system failures, 
and the right-hand part shows how power system 
failures can result in consequences external to the 
power system, i.e. societal impact. The criticality 
of the consequences can be measured along dif- 
ferent dimensions, but for the illustrations in this 
paper we will consider total end-user power inter- 
ruption (MW) and interruption duration (hours) 
as the two principal dimensions. Each HILP event 
could, in principle, also be associated with a prob- 
ability. Other relevant factors include the types of 
end-users affected and the dependence of the soci- 
ety on electricity supply; for further discussion of 
the definition of “critical”, we refer to (Kjolle et al. 
2013, Kjolle and Gjerde 2015). 

Relevant threats on the left-hand side include 
conditions related to the operating state of the 
power system (e.g. challenges related to the power 
import/export situation, prior outages, etc.), nat- 
ural hazards such as major storms and human 
threats. Barriers on the left-hand side of the bow 
tie reduce the susceptibility of the power system 
to threats. These barriers reduce the probability of 
unwanted events through preventive actions such 
as condition monitoring, preventive maintenance 
and vegetation management. Some barriers also 
preemptively increase the coping capacity of the 
system to reduce the probability of critical con- 
sequences in case an unwanted event does occur. 
This category of barriers includes preventive 
scheduling, grid reconfiguration and islanding in 
preparation for a major storm. 

Barriers on the right-hand side of the bow-tie 
are intended to reduce the consequence of power 
system failures and correspond to the coping 
capacity of the power system with respect to these 
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unwanted events. Examples of such barriers are 
corrective actions such as emergency generation 
rescheduling, controlled load shedding, controlled 
islanding, and various system protection schemes. 
Other barriers are associated with the restoration 
of system operation after power has been inter- 
rupted, for instance the black-start capability 
of generators and the availability of spare parts, 
equipment and competent personnel. 

To illustrate the distinction between these two 
types of barriers, we have in Figure 1 superim- 
posed a timeline with an example of how the inter- 
rupted power could develop as a function of time 
throughout the course of the HILP event. The 
sequence of events after the occurrence of the initi- 
ating event can be broadly separated in a blackout 
progression phase and a restoration phase. Correc- 
tive action barriers are associated with the black- 
out progression phase and primarily intended to 
reduce the amount of interrupted power, whereas 
barriers associated with the restoration phase gen- 
erally intended to reduce the restoration time and 
thus the interruption duration. 


2.2 Defining and framing the problem 


The analysis of HILP events in power systems is 
a broad problem area involving different decision 
problems as well as more fundamental research 
problems. The question one needs to ask is why 
one is interested in analyzing HILP events the first 
place. It is necessary with a clear definition the 
problem and a clear understanding of the motiva- 
tion and purpose of solving the problem. 

Figure 2 shows two dimensions that can be used 
to frame problems related to HILP events: The time 
scales for power system-related decisions and rel- 
evant stakeholders or decision makers. The figure 
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Figure 2. Two dimensions relevant for framing prob- 
lems related to HILP events: The stakeholder or decision 
maker, and the time scale of relevant decision problems. 


also indicates the motivation of the stakeholders 
with regards to HILP events. The two dimensions 
in Figure 2 determine what information is available 
to the analyst and thus what uncertainties must be 
taken into account. This will be discussed in more 
detail in Section 2.5. 

Here we will distinguish between operational, 
tactical and strategic decisions by the time scale of 
the planning horizon that is considered. Following 
the classification in (GARPUR Consortium 2016), 
these three time scales correspond to system opera- 
tion (including both real-time operation and day- 
ahead operational planning), asset management, 
and system development or planning, respectively. 
Note that other references may use other terms 
and definitions for the time scales. For instance, 
(Watson et al. 2014) distinguishes between system 
planning decisions and policy decisions, and (Yang 
and Haugen 2015) defines both strategic and oper- 
ational decisions as planning decision, which are 
in turn distinguished from instantaneous or emer- 
gency decisions. 

Stakeholders can be differentiated in terms of 
their influence over power system related deci- 
sions, and since system operators have the most 
direct influence, we will in the following take the 
perspective of the system operator as a decision 
maker. Furthermore, we will focus on transmission 
system operators (TSOs) since distribution system 
operators (DSOs) have less influence over deci- 
sions relevant for wide-area power interruptions. 
In practice, decisions will be taken by different 
departments and at different levels in the organi- 
sation, but in the following we simply refer to the 
decision maker as “the system operator”. 

To put the more general problem of analys- 
ing HILP events in a decision-making context, 
Figure 3 shows some examples of relevant deci- 
sion problems for system operators, sorted by time 
scale. These decision problems will be defined in 
broad terms below and be used in the following 
sections to illustrate the qualitative framework. 
Although we do not define the decision problems 
formally in terms of their objective function etc. 
as done e.g. in (GARPUR Consortium 2016), it 
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Figure 3. Examples of decision problems for transmis- 
sion system operators with relevance for the analysis of 
HILP events. 
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is important to keep in mind that these reliability 
management decisions typically involve some form 
of trade-off between costs and reliability of sup- 
ply. The value of reliability of supply is sometimes 
monetized in the form of expected interruption 
costs, i.e. the cost of energy not supplied. 

Selection of system development plan: An exam- 
ple of a strategic decision problem is the evaluation 
of candidate system development plans (e.g. for 
new transmission lines) and selection of the best 
candidate. Regulation may dictate that a socio- 
economic cost-benefit analysis of the candidates is 
performed. Ideally, the cost of energy not supplied 
associated with possible HILP events should be 
included in such an analysis. 

Designing system protection schemes: System 
protection schemes (SPSs) are important examples 
of barriers on the right-hand side of the bow-tie, 
and the system operator has to plan which SPSs 
to implement. The motivation of implementing an 
SPS could be to increase the transmission capac- 
ity of the system as well as to increase the coping 
capacity of the system with respect to the occur- 
rence of contingencies that would otherwise result 
in critical consequences (Hillberg et al. 2012). 

Prioritize inspection and maintenance efforts: 
The system operator has to decide how to best 
allocate limited resources for preventive actions 
such as intensified inspection and maintenance 
and improved condition monitoring of power sys- 
tem components. Mitigating certain susceptibili- 
ties could help reduce the risk of HILP events as 
well as more ordinary events. 

Spare parts etc. for critical components: If the 
power system is vulnerable to the loss of certain 
component, e.g. a transformer, the decision can be 
made to provide for spare parts to reduce the dura- 
tion of potential power interruptions. 

Decide when preventive action is needed: During 
operation, preventive actions such as generation 
rescheduling may be needed e.g. due to the devel- 
opment of threat exposure and/or the operating 
state. The first step for the system operator is to 
correctly assess the situation and decide whether or 
not to effectuate preventive actions. 

Rescheduling generation e.g. to prepare for extreme 
weather: During an extreme weather event the near- 
simultaneous failure of multiple transmission lines 
(failure bunching) is more likely. In this case, one rel- 
evant preventive action is to reschedule generation in 
a way that makes the power system better able to cope 
with failures on one or several transmission lines. 


2.3 Defining and delimiting the analysis 


Decision making for problems as exemplified above 
can be supported by the analysis of HILP events. 
One way of defining and delimitating “analysis of 


HILP events” is to consider sub-problems distin- 
guished by the objective of the analysis. One pos- 
sible classification is: 


1. identifying critical contingencies 

2. identifying critical operating states 

3. identifying critical barriers 

4. assessing the contributions to the overall reli- 
ability of supply 


Each of these sub-problems can be associated 
with different parts of the bow-tie model as illus- 
trated in Figure 4. In practice, the objectives may 
be overlapping and the sub-problems may be com- 
bined in one of the same analysis. The classification 
may nevertheless be useful in discussing specific 
decision problems and the underlying motivation. 


2.3.1 Identify critical contingencies 
A critical contingency is here understood as a 
failure or unplanned outage of a power system 
component that may potentially result in critical 
consequences. One purpose of identifying critical 
contingencies is to identify critical power system 
components with the motivation to strengthen or 
introduce appropriate barriers, cf. Section 2.3.3. 
One example of a system operation decision 
involving the identification of critical contingen- 
cies is the (optimal) preventive rescheduling of 
generation in preparation for an extreme weather 
event. In this case, the system operator should ide- 
ally know which (critical) higher-order contingen- 
cies to take into account when rescheduling. In the 
context of system development, one would like 
to identify critical contingencies in the candidate 
development plans to reduce the vulnerabilities 
of the development plan that is selected. Another 
purpose of identifying critical contingencies can be 
to screen contingencies to be considered as input 
to more detailed (e.g. dynamic) analysis. 


2.3.2 Identify critical operating states 
We here understand a critical operating state as 
an operating state which in combination with a 


Consequences. 
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Figure 4. The placement in the bow tie model of differ- 
ent criticalities and sub-problems relevant in the analysis 
of HILP events. 
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critical contingency potentially result in critical 
consequences. The motivation for identifying these 
could be to increase the situational awareness of 
the system operators, which has previously been 
identified as being crucial to avoid HILP events 
(Johansson, E. et al. 2010). Situational awareness 
is relevant for operational decisions on which cor- 
rective actions to carry out after a contingency has 
occurred. Identifying critical operating states prior 
to contingencies may also be important to be able 
to decide when preventive action is needed. 


2.3.3 Identify critical barriers 

The identification of critical barriers may be used 
in selecting barriers to strengthen, and the identi- 
fication of critical barriers that are missing may be 
used in proposing new barriers to put in place. This 
involves corrective barriers such as well-designed 
system protection schemes, or preventive barriers 
such as inspection and maintenance. For the latter 
example, the decision of which components to pri- 
oritize also depends on the identification of critical 
contingencies. 


2.3.4 Assessing the contributions to the overall 
reliability of supply 
An underlying premise of this work is that conven- 
tional power system reliability analysis methods 
do not fully capture HILP events. The reliability 
of a power system can be defined as “the prob- 
ability of its satisfactory operation over the long 
run. It denotes the ability to supply adequate 
electric service on a nearly continuous basis, with 
few interruptions over an extended time period” 
(Kundur et al. 2004). The overall reliability of sup- 
ply may be quantified by reliability indices such 
as the expected annual energy not supplied. Over 
the long run, HILP events do contribute to these 
reliability indices, but their contribution may be 
underestimated by conventional reliability analysis 
methods. For instance, this may happen when the 
methods do not capture failure bunching, protec- 
tion system failures, or any of the other aspects 
and dependencies that may conspire to result in a 
HILP event. Furthermore, the short-term impact 
of a HILP event may be disproportional to their 
long-run visibility in expected values of reliability 
indices and therefore warrant separate treatment 
(Vaiman et al. 2012). These are some of the reasons 
why methods of vulnerability analysis focusing on 
HILP events have been advocated to complement 
traditional risk and reliability analysis methods 
(Johansson et al. 2013, Kjolle and Gjerde 2015). 
Nevertheless, estimates of reliability indices are 
used by system operators as part of their reliability 
management processes also for decisions relating to 
HILP events. An example is the selection of system 
development plans for a given region, supported 


by a socio-economic cost-benefit analysis includ- 
ing expected interruption costs. If the region is 
exposed to strong winds, this could motivate cap- 
turing the contribution of HILP events due to fail- 
ure bunching effects in the estimated interruption 
costs. 


2.4 Decomposition in quantitative analysis 


After defining the purpose of the analysis, one 
needs to consider which quantities the analysis 
method needs to estimate and which of them is 
most important to estimate accurately. Here we 
will consider three primary output parameters: 1) 
The probability of an event and its consequence in 
terms of 2) power interrupted and 3) interruption 
duration. As illustrated in Figure 5, these output 
parameters are broadly speaking associated with 
different parts of the bow-tie model. To assess 
the consequences of an unwanted event, it is suf- 
ficient to consider the right-hand side of the bow- 
tie: The interrupted power is primarily determined 
by the sequence of events within the phase labelled 
“blackout progression”, and the interruption dura- 
tion is primarily determined by the events in the 
restoration phase. On the other hand, to determine 
the probability of a HILP event, characterized by 
a given consequence, one has to consider both the 
left-hand side (with the label “threat exposure” in 
Figure 5) and the right-hand side of the bow-tie. 
To approach more quantitative analysis and 
consideration of different uncertainties, we overlay 
the bow tie model with a schematic data flow dia- 
gram for the analysis in Figure 6. A cause analysis 
is depicted on the left-hand side of the bow tie that 
gives as output the failure rate (or the probability 
of failure during a certain time interval) for a given 
unwanted event (i.e. a given power system failure). 
Such a module could for instance be based on a 
fault tree. Failure bunching effects, for example due 
to major storms, could be incorporated in this step 
using existing tools for estimation of wind-depend- 
ent failure rates, as done in (Solheim et al. 2016). 
The consequence analysis on the right-hand side 
of Figure 6 is divided in two modules representing 
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Threat exposure — Blackout progression Reviorution 


Figure 5. Illustration of how the problem of analysing 
extraordinary events can be decomposed and delimitated 
based on what quantity one is focusing on estimating. 
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Figure 6. Schematic of quantitative analysis (blue, 


within the bow-tie) with input data (green parallelo- 
grams) and output data (purple). 


the blackout progression phase and the restoration 
phase, respectively. The module for the blackout 
progression phase models system responses and 
resulting power interruptions. It could be based on 
an event tree model, power flow analysis, dynamic 
analysis, etc. This module can take as input elec- 
trotechnical parameters describing the power sys- 
tem and its operational limits as well as parameters 
describing the actions and responses in the system. 
For instance, if the analysis method is based on an 
event tree accounting for corrective action failures 
(Vadlamudi et al. 2016), input parameters can be 
conditional probabilities determining the prob- 
ability of different sequences of events. The res- 
toration phase module represents the restoration 
process. For instance, the restoration time could 
be modelled by average outage times of the com- 
ponents involved, in which case such outage times 
are needed as input. Alternatively, the restoration 
process could be modelled in more detail, which 
would require additional input parameters. 

When analyzing system protection schemes 
to identify critical barriers for certain unwanted 
events, it may not be important for the purpose 
of the analysis to consider what caused these 
unwanted events. For such an analysis, one could 
omit the left-hand side of Figure 6 and focus on 
the first part of the consequence analysis, e.g. 
using dynamic analysis to estimate the power inter- 
rupted. On the other hand, if the objective is to 
assess the contribution to the overall reliability of 
supply, one would typically also have to represent 
power system restoration in the analysis. 

In the determination of the consequences illus- 
trated in Figure 6, the consequence analysis stops 
after finding the interruption magnitude and dura- 
tion. However, as mentioned in Section 2.1, the 
societal impact of an HILP event is not determined 
by these two parameters alone. The box labeled 
societal factors in Figure 6 represent other factors 


determining the societal impact, such as the type 
of customers (end-users) and the criticality of the 
loads that are interrupted. Consequences of power 
interruptions are typically monetized using inter- 
ruption cost functions determined by customer 
surveys, but these interruption costs give only a 
lower bound for the total socio-economic costs of 
the power interruption (GARPUR Consortium, 
2016). Estimating quantitatively the impact on 
society more widely might involve modelling of the 
interactions between the power system and other 
infrastructures (Johansson et al. 2015). 


2.5 Taking into account uncertainties 


HILP events can be argued to be inherently asso- 
ciated with uncertainties (Taleb 2010, p. xxviii). 
Factors such as the operating state, the technical 
condition of components and failure bunching 
effects due to adverse weather all have their own 
individual uncertainties. HILP events are often 
the results of multiple, interacting factors and cir- 
cumstances. As such, their combined uncertainty 
is larger than the uncertainty of the individual 
factors. 

First, it is common to classify uncertainties as 
either aleatory, i.e. associated with random vari- 
ability, or epistemic, i.e. associated with a lack of 
knowledge. Given that HILP events are character- 
ized by a scarce experience base and severe lack of 
knowledge, epistemic uncertainties are especially 
important to consider. Next, following a similar 
classification as in (Rausand 2013), we will broadly 
distinguish between three types of uncertainties: 


— Input data uncertainties 
— Modelling uncertainties 
— Completeness uncertainties 


For the analysis of HILP events in power sys- 
tems, these types of uncertainties can be related to 
Figure 6 as follows. Input data uncertainties and 
modelling uncertainties are related to green and 
blue boxes, respectively. The additional category 
that we have here chosen to label “completeness 
uncertainty” represents uncertainty associated 
with the completeness of the models of the system. 
Although there are different ways to understand 
this term (Rausand 2013, Aven 2016), and “com- 
pleteness uncertainty” may not be unambiguously 
distinguished from “modelling uncertainty”, we 
find the term useful to describe uncertainty associ- 
ated with aspects omitted and/or outside the scope 
of the analysis. As an example, a consequence anal- 
ysis starting from a given set of contingencies (i.e. 
covering only the right-hand side of Figure 6) does 
not explicitly consider what might have caused 
the contingencies. If the problem was to identify 
effective system protection schemes, for instance, 
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threat and susceptibility aspects may not have been 
within the scope of the analysis. 

Sources of incompleteness in the analysis can 
be either known or unknown to the analyst (Aven 
2016). If the analyst is unaware that an aspect is 
not considered in the analysis, this uncertainty can 
be labelled an “unknown unknown” (Feduzi and 
Runde 2014). Here, we use this term in a wider 
sense to refer to lack of knowledge that is implicit, 
i.e. a form of epistemic uncertainty associated with 
“what we don’t know we don’t know”. Furthermore, 
we focus on “unknown unknowns” that are “know- 
able”, i.e. that can in principle be transformed into 
“known unknowns” (Feduzi and Runde 2014). 

Another way to classify uncertainties related to 
an analysis of HILP events that is more specific 
to the domain of power systems is to consider 
uncertainties related to the aspects discussed in 
Section 2.1. An example of such a classification is 
illustrated in Figure 7. Here, each of the catego- 
ries along the vertical axis corresponds to one of 
the components of quantitative analysis that were 
illustrated in Figure 6. This shows how a domain- 
specific classification can be combined with the 
generic uncertainty classification discussed above: 
For each category, a given analysis is associated 
with uncertainty (indicated along the horizontal 
axis) related to the accuracy of modelling assump- 
tions and the input data. 

This multi-dimensional classification of uncer- 
tainties can be used to structure a qualitative 
assessment of the strength of background knowl- 
edge (Aven et al. 2014, p. 87) underlying a given 
analysis: If an aspect is modelled in a simplified 
or inaccurate manner, the knowledge of this aspect 
that is represented in the analysis is weak and the 
uncertainty is correspondingly high. Even if the 
modelling of an aspect is accurate, the uncertainty 
is still high if the associated input data represented 
in the analysis is inaccurate. 
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Figure 7. Example of classification and assessment of 
uncertainties associated with analyses of HILP events. 


Such a structured assessment of the uncer- 
tainties of a HILP event analysis can be used by 
the analyst to rank which uncertainties are most 
important (Aven et al. 2014) to improve the over- 
all accuracy and suitability of the analysis. More 
accurate modelling of an aspect often implies 
longer computation times. In practice, a trade-off 
must therefore be made between computational 
efficiency and accuracy, and trade-offs must be 
made between the modelling accuracy for the dif- 
ferent aspects considered in the analysis. 

An explicit qualitative assessment of uncer- 
tainties can also be used as a basis for compar- 
ing different analyses and informing the decision 
maker of their uncertainties (Aven et al. 2014). As 
an example, one can consider methods designed 
to analyse cascading outages. A number of such 
methods have been developed, each focusing on 
different subsets of the mechanisms and aspects 
involved in cascading outages. Considerable 
efforts have already been devoted to reviewing 
and validating such methods (Vaiman et al. 2012, 
Bialek et al. 2016), but there are still many open 
questions that may limit their credibility in deci- 
sion making. More explicit classification and 
assessment of their uncertainties, scope and pur- 
pose could help inform system operators of which 
methods are most suitable for different problems. 

Completeness uncertainty is not included as a 
separate dimension in Figure 7, but if an aspect is 
not covered in an analysis, the modelling uncertain- 
ties related to this aspect can be regarded as high. 
However, to fully characterize the completeness 
uncertainty dimension of the analysis one needs to 
identify and uncover “unknown unknowns”. It has 
been argued that to do so, the analysis needs to be 
placed in a sufficiently broad framework and avoid 
starting out with a too narrow view of the problem 
(Feduzi and Runde 2014, Aven 2016). A qualitative 
mapping of relevant aspects to the analysis as pro- 
posed in this paper can contribute to transform- 
ing “unknown unknowns” to “known unknowns”, 
or in other words making implicit assumptions 
and uncertainties explicit. Communicating such 
uncertainties associated with the completeness 
of the analysis can change, from the perspective 
of the decision maker, a “unknown unknown” to 
a “known unknown”. To give a simple example: 
When deciding on system protection schemes to 
mitigate cascading outages and the analysis does 
not model the dynamics of rotor angle stability, 
the decision maker should be aware that the type 
of cascading events characterized by generators 
losing synchronism is omitted from the analysis. 

As mentioned in Section 2.2, the time scale of 
the decision problem is relevant for what infor- 
mation is available during the analysis and hence 
what is uncertain and what is known. For instance, 
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the system operator knows the operating state to 
a good approximation during real-time system 
operation, whereas this information is not avail- 
able for an analysis for long-term planning pur- 
poses (Vaiman et al. 2012). For the example of 
cost-benefit analysis including the contributions of 
wind-related failures, the analyst needs to assume 
a selection of operating states expected to be rep- 
resentative of the future, and this is associated 
with additional uncertainties. For the example of 
preventive rescheduling in preparation of a major 
storm, more information is available on the operat- 
ing state over the planning horizon, although this 
is still imperfect information as one may have to 
consider the forecast uncertainties. 


3 CONCLUSIONS AND FUTURE WORK 


This paper proposes a qualitative framework for 
analysing HILP events in power systems that may 
complement or guide more quantitative analysis. 
Mapping relevant aspects of such HILP events to 
a bow tie model provides the analyst with a broad 
overview of the set of problems at hand and a 
starting point for detailed analysis. Although the 
full set of aspects is too comprehensive to analyse 
quantitatively, the qualitative framework provides 
a basis for decomposing and delimitating the 
problem: Defining precisely the purpose of the 
analysis, one can then choose what aspects need to 
be modelled accurately and which aspects one is 
choosing to omit. Omitting and neglecting aspects 
of the overall problem introduce uncertainties in 
the analysis, but by being explicit about what is 
omitted and assumed one reduces the amount of 
“unknown unknowns” in the analysis and may 
thus support more well-informed decisions. 

Further work will test the applicability of the 
framework in case studies of real problems related 
to HILP events. The approach for defining the pur- 
pose of an analysis and delimitating the problem 
presented will also be used to guide the develop- 
ment and application of methods for quantitative 
analysis of HILP events. Furthermore, the classi- 
fication of models and input data for the analysis 
may form the basis for considering which methods 
are most appropriate for handling different types 
of uncertainties related to modelling choices and 
input data. 
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ABSTRACT: Safety assessment and risk analysis are recognized as a priority in the development of 
next generation nuclear systems (Generation-IV reactors and full-scale fusion reactor—DEMO-) and 
demand a reconsideration of the safety philosophy currently applied to the existing nuclear stations. Since 
their innovative physics and technology and the preliminary design phase of some of the concepts, their 
safety assessment has to rely on the basis of nuclear safety and technological neutral methodology. In 
order to satisfy this necessity, a bibliographic survey on nuclear and non-nuclear international standards 
and best practices is performed. By comparing them, this work tries to reach a new and more systematic 
approach, based on functional safety, suitable for dealing with the unique challenges of the innovative 
nuclear facilities, in order to guarantee that safety achievement is intended to be “built-in” rather than 


“added-on” by influencing the concept evolution from its earliest stages. 


1 INTRODUCTION 


The contemporary research activity in the nuclear 
field is focused on the development of nuclear 
facilities able to satisfy the four goal areas iden- 
tified by the Generation IV International Forum 
(GIF, 2014) in its Technological Roadmap in order 
to advance nuclear energy in its next generation: 
sustainability, safety and reliability, economic 
competitiveness, proliferation resistance and phys- 
ical protection. The attempt to answer this request 
with a fully innovative technology is the rationale 
associating all Generation IV reactor designs and 
the proposed concepts for a full-scale fusion reac- 
tor (EUROfusion website). 

The nuclear energy systems must be designed so 
that, during normal operation or anticipated tran- 
sients, safety margins are adequate, accidents are 
prevented and off-normal situations do not dete- 
riorate into severe plant conditions (RSWG of the 
GIF, 2008). Therefore, safety assessment and risk 
analysis, in both operational and accidental condi- 
tions, are recognized as an essential priority in the 
development of these next generation nuclear sys- 
tems. Because of their innovative physics and tech- 
nology and the preliminary design phase of some 
of the concepts, their safety assessment has to rely 
on the basis of functional safety and technological 
neutral methodologies. This demands a reconsid- 
eration, a modernization and an adaptation of the 
safety philosophy currently applied to the existing 
nuclear stations and a constant innovation and 
development of safety assessment methods to con- 
tinue to advance the state of the art and improve 
their adequateness. 


In 2002 GIF selected six systems from nearly 
100 concepts as the Generation-IV fission nuclear 
plants: the Gas-cooled Fast Reactor (GFR), 
Sodium-cooled Fast Reactor (SFR), Lead-cooled 
Fast Reactor (LFR), Molten Salt Reactor (MSR), 
Very-High-Temperature Reactor (VHTR), 
Supercritical-Water-cooled Reactor (SCWR) (GIF, 
2014); on the other hand the DEMOnstration 
fusion power reactor (DEMO) is foreseen to follow 
the advancements of ITER (International Thermo- 
nuclear Experimental Reactor) by 2050 (EUROfu- 
sion website). These systems present a wide range of 
new technologies that create issues if the traditional 
safety approach adopted for Light Water Reactors 
(LWRs) is considered: for example, the MSR design 
is characterized by a liquid nuclear fuel, therefore 
the evaluation of the Core Damage Frequency 
(CDF) in terms of core melting as an indication of 
severe accident is no longer applicable. Moreover, 
due to the online refueling envisaged for MSR, also 
in normal operation conditions the fuel is not local- 
ized in the core (as it happens for the LWR) but it 
is spread in several subsystems and occupies differ- 
ent positions in the reactor, making inconsistent the 
traditional definition of physical barriers (cladding, 
primary circuit, containment building). A general 
comment, valid for all these innovative nuclear sys- 
tems including fusion machines, is that, in many 
cases, the design is still in development therefore 
a safety assessment performed at the components 
level is not useful since their architecture will evolve 
in time: instead, a functional approach allows to 
identify the functional deviations challenging the 
system since the early design and, consequently, to 
include safety features in a holistic optics. 
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This work starts investigating the safety chal- 
lenges of the new generation of nuclear plants 
and performing a bibliographic survey on nuclear 
and non-nuclear safety international standards 
and best practices; the objective of the paper is 
to present an iterative methodology that is coher- 
ently applicable since the conceptual phase of the 
design and aims at influencing the direction of the 
concept and design development from its earli- 
est stages; hence the safety will be intended to be 
“built-in” rather than “added-on”. 


2 SAFETY CHALLENGES FOR NEW 
GENERATION NUCLEAR PLANTS 


The majority of current nuclear safety regulatory 
requirements is based on LWRs technology and 
necessitates changes to suit to a new spectrum of 
novel, advanced, next generation plants (Southern 
Company, 2017). In Probabilistic Safety Assess- 
ment (PSA), the risks associated with the reactor 
accidents are highly design, plant and site specific; 
this is demonstrated for any kind of reactor. In par- 
ticular, dealing with next generation nuclear plants 
implies a much larger range of risks variability with 
respect to an LWR: fundamental differences in the 
physical processes are present, as well as in the plant 
responses associated with the reactor transients 
and accidents. This is due both to the use of differ- 
ent materials for the reactor fuel, moderator and 
coolant and to different safety design approaches 
for the implementation of radionuclides barriers 
(Southern Company, 2017). Because of these dif- 
ferences, the LWR risk metrics, for instance the 
Core Damage Frequency (CDF) and the Large 
Early Release Frequency (LERF), are neither rel- 
evant nor useful for many advanced nuclear reac- 
tors; some plants, in fact, may not involve the core 
damage state that was defined for LWR and, even 
in the case, its meaning and risk framework can be 
fundamentally different from LWR (INL, 2011). 
Consequently, PSA for advanced reactors may be 
structured differently than the traditional Level 
1, 2 and 3 model for LWR PSA: it is expected to 
include out of core sources of radioactive mate- 
rial (especially in the case of online refuel, as for 
the MSR) and to adopt adequate and more general 
risk metrics (INL, 2011); the latter may lead to an 
appropriate definition of severe accident, detached 
for the core melting concept. Additionally, while 
the traditional LWR risk assessment was developed 
following the “one-reactor-at-a-time” approach, in 
next generation nuclear plants the risk associated 
to multi-unit sites becomes certainly relevant and, 
especially after the Fukushima Daichi accident, 
even dominant (Fleming, 2017). Advanced non- 
LWRs are expected to be constituted by several 


modules, located in the same site: this increases 
the possibility of common cause failures/domino 
effects, due to the potential for sharing of systems 
and structures or hazards involving more than 
one reactor (e.g. external hazards). This influences 
the traditional frequency-consequence tolerability 
criteria. 

Lastly, a major difference between the risk 
assessment methodologies (e.g. PSA) of LWRs 
and next generation nuclear plants is the following: 
the former were introduced after the plants were 
designed and licensed, limiting the risk-informed 
applications to additional systems or provisions 
for plants that were already built and operated; on 
the other hand, the latter are primarily used as tool 
to support the design and to expand the range of 
the risk-informed decisions (Southern Company, 
2017). 


3 OVERVIEW OF SAFETY ASSESSMENT 
METHODOLOGIES AND STANDARDS 


A huge set of prior activities, policies, standards, 
practices and requirements support the design and 
the licensing of LWRs. 

IAEA (International Atomic Energy Agency) 
standards provide the fundamental principles, 
requirements and recommendations to ensure 
nuclear safety. They serve as a global reference for 
protecting people and the environment and con- 
tribute to a harmonized high level of safety world- 
wide (IAEA, 2006): as stated in the fundamental 
safety principles of the IAEA Safety Standard for 
protecting people and environment, (IAEA, 2006), 
“the fundamental safety objective is to protect 
people and the environment from harmful effects 
of ionizing radiation”; this fundamental safety 
principle is detailed in ten safety principles on the 
basis of which safety requirements are developed 
and safety measures are implemented in all nuclear 
facilities and activities, and for all stages over the 
lifetime of a facility or a radiation source. These 
principles inspire the “General Safety Require- 
ments” and the “General Safety Guide” that, for 
each technical area, are declined into a number 
of “Specific Safety Requirements” and of “Spe- 
cific Safety Guides” that provide all the guidance 
necessary for implementing the general principles. 
While the ten safety principles are general enough 
to be applicable also to non-LWRs, all the other 
documents and standards are referred specifically 
to LWRs. Similarly, the Nuclear Regulatory Com- 
mission (NRC) regulations and in particular the 
Title 10 of the Code of Federal Regulations Part 
50 (10 CFR 50) (USA NRC, 2017) establish Prin- 
cipal Design Criteria (PDC) derived from the Gen- 
eral Design Criteria (GDC) that are specifically 
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referred to LWRs (Appendix A of 10 CFR 50). 
Considering the fact that the last USA commercial 
non-LWR was shut down in 1989 (Fort St. Vrain, a 
High-Temperature Gas-cooled Reactor—HTGR), 
the update of these documents is on-going but is 
especially challenging because of lack of specificity 
in the technology/designs that will be ultimately 
submitted to NRC for review, of lack of maturity 
of design and of the unavoidable technical skills 
gap (Lee, 2016). 

Traditionally, the PSA is performed only after 
the definition of the detailed design and of the 
site: in this case, if the tolerability criteria are not 
fulfilled, it could be necessary to modify also the 
preliminary design. 

Nowadays a widely accepted approach in the 
process industry is the one described in the IEC 
EN 61508, whose major idea is that the safety 
of systems must be studied and pursued from 
the early design by risk analysis tools; one of its 
main activities is to define the Safety Instrumented 
Functions (SIFs) that must be further and deeply 
analysed in order to understand the effective risk 
reduction needed and the necessity to implement 
them in terms of safety systems and in terms of 
additional safety requirements. Functional safety 
assessment in the context of IEC EN 61508 con- 
stitutes a milestone for safety to drive the design 
(IEC EN 61508, 2005). The IEC EN 61513 pro- 
vides requirements and recommendations for the 
overall I&C architecture of a Nuclear Power Plant 
(NPP) which may contain both hard-wired and 
computer-based technologies; it aims at translat- 
ing the general requirements of 61508—1, 61508-2 
and 615084 for nuclear application sector and, 
similarly to the IEC EN 61508, it introduces the 
concept of a safety life-cycle for both the whole 
architecture and the individual system, highlight- 
ing the relations between the safety objectives of 
the NPP and the requirements for the I&C archi- 
tecture (IEC EN 61513, 2013). Nevertheless, the 
need to maintain the traditional safety approach 
for nuclear applications makes the 61513 misrep- 
resenting the nature of the 61508; instead, it rep- 
resents an intermediate step included into a rigid 
process that was developed for and it is still suit- 
able to LWRs, but difficult to apply to concepts 
of the next generation. Hence, the philosophy of 
the 61508 can inspire the safety assessment of 
advanced nuclear plants, but redefining the strict 
framework defined for LWRs. 

A schematic representation of the two 
approaches is shown in Figure 1. 

Other methodologies, such as the Integrated 
Safety Assessment Methodology (ISAM) and the 
International Project on Innovative Nuclear Reac- 
tors and Fuel Cycles (INPRO) are always inspired 
by the IAEA general principles but at the same 
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Figure 1. Schematic of the traditional and the new 
approaches to safety assessment. 


time aim at implementing the concept of a safety 
driven design. The ISAM (RSWG of the GIF, 
2011) is meant to combine both probabilistic and 
deterministic tools, both quantitative and qualita- 
tive methods and evaluations, some focusing on 
high-level issues, others on more detailed issues. 
It aims at providing a robust guidance, based on 
a good understanding of risk and safety issues, 
contributing to the achievement of Generation IV 
safety objectives. The INPRO assessment (IAEA, 
2008) is a stepwise approach with a hierarchic 
structure: Basic Principles (BP), User Require- 
ments (UR) and Coordinated Criteria (CC), which 
must be fulfilled by an Innovative Nuclear System 
(INS) to determine if the system is sustainable or 
not. This approach aims at providing a tool to ana- 
lyze a nuclear installation in order to: 


— Evaluate if it is compatible with the objective of 
sustainable energy development, 

— Compare different plants or components to find 
a preferred or an optimum solution tailored to 
the needs of a specific region or a State and 

— Identify possible improvements. 


These two approaches represent guidelines that 
must be reviewed, completed and adapted, when 
needed, also using traditional risk analysis tools 
in order to better suit the unique case of each of 
the next generation nuclear plants. They define an 
inspiring philosophy but do not constitute an oper- 
ational framework, which still has to be defined 
for advanced concepts through tailored criteria, 
requirements and consolidated operational safety 
assessment methods. 


4 PERSPECTIVES OF SAFETY 
ASSESSMENT METHODOLOGIES 
4.1 Risk metrics 


Each nuclear plant must fulfil Quantitative Health 
Objectives of individual risk (QHO), used as a 
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basis for determining whether a level of safety 
ascribed to a plant is consistent with the safety 
goal policy (Whipple, 2012). In the LWR frame- 
work these objectives are embodied by LERF and 
CDF, which may not be consistent for some of the 
new generation nuclear plants, therefore they shall 
be reviewed (INL 2011). Although ISAM tries to 
adapt the CDF definition to all kinds of reactors, 
in some cases, especially those precluding the core 
damage states defined for LWRs, this results prob- 
lematic and it may be appropriate to use a set of 
the risk-metrics that have the capability to define 
the significant contributions to risk and provide 
information to demonstrate defense in depth ade- 
quacy. The proposal for advanced reactors needs 
to be TI-RIPB (INL, 2017): 


— Technological-Inclusive (TI), namely applicable 
to any design independently from the imple- 
mented processes; 

— Risk-Informed (RI), since each decision must be 
an opportune derivation of both probabilistic 
and deterministic principles; 

— Performance-Based (PB) because the risk and 
safety analysis lead to the formulation of per- 
formances requirements of Structures, Systems 
and Components (SSCs) in order to avoid acci- 
dents, or at least mitigate them. 


Some proposed indexes include the frequencies 
of event sequences grouped in accident families 
having the same plant response and the same off- 
site radiological consequences, the integrated risk 
of a given consequence (e.g. site boundary dose), 
the individual fatalities (as compared to the exist- 
ing limits for LWR) and the cumulative frequency 
of an early or latent effect. Moreover, some spe- 
cific risk metrics can be defined for each reactor, 
depending on the specific characteristics. 

These values can be expressed in the form of 
mean values and uncertainties percentiles (Sth and 
95th percentiles) and compared to the frequency- 
consequence evaluation criteria, as the one defined 
in Figure 2. 

Furthermore, the integrated risk evaluation of 
the entire plant is performed taking into account 
four evaluation criteria (INL, 2017): 


— the total frequency of exceeding a site boundary 
dose of 100 mrem shall not exceed 1/plant-year 
according to the annual exposure limits in 10 
CFR 20; 

— the total frequency of a site boundary dose 
exceeding 750 rem shall not exceed 10~*/plant- 
year according to NRC Safety Goal Policy 
Statement on limiting the frequency of a large 
release; 

— the average individual risk of early fatality within 
1 mile of the Exclusion Area Boundary (EAB) 
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Figure 2. Frequency/consequence evaluation criteria 
(INL, 2017). 


shall not exceed 5 x 10°7/plant-year according 
to the NRC Safety Goal QHO for early fatality 
risk; 

— the average individual risk of latent cancer fatal- 
ities within 10 miles of the EAB shall not exceed 
2 x 10-*/plant-year according to NRC safety goal 
QHO for latent cancer fatality risk. 


It is worth to note that the traditional classi- 
fication of PSA Level 1, 2 and 3 starts from the 
concept of CDF and LERF, therefore the update 
of the risk metrics implies a new modernized PSA 
concept. 

According to the “Basis for the Safety 
Approach for Design & Assessment of Genera- 
tion IV Nuclear Systems” (RSWG of the GIF, 
2008), one of the objectives to be pursed for the 
advanced designs optimization is their rationaliza- 
tion by the deliberate adoption of the ALARP (As 
Low As Reasonable Practicable) principle applica- 
ble to the full spectrum of design conditions. The 
UK Health and Safety Executive (HSE) defined an 
ALARP region (or tolerability region) between the 
acceptable and unacceptable risk regions: the com- 
parison between advantages and disadvantages 
(i.e. between the reduction in risk and the cost of 
achieving it) on a quantitative or qualitative basis 
establishes what is “reasonably practicable” to 
be carried out in order to reduce the risk (HSE, 
2001). This optimal risk reduction is translated 
in the implementation of innovative provisions 
looking for further risk reduction (prevention of 
the initiators and consequences mitigation) on a 
cost-benefit basis (RSWG of the GIF, 2008). In 
a frequency-consequence graph the ALARP area 
is usually represented by a range of values, but in 
Figure 2 it degenerates in a line. Since the uncer- 
tainties characterizing both the considered designs 
and their analyses, the ALARP principle should 
represent a key point for the definition of the 
acceptability criteria. 
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4.2 Iterative risk assessment 


The risk assessment process for an advanced nuclear 
plant is proposed to be iterative rather than serial (as 
for the PSA Level 1,2,3) so that it can be introduced 
at an early stage of design and be consistent with 
the level of detail of the evolving design and suc- 
cessively with the site characteristics. Moreover, it is 
supposed to provide a logical and structured method 
to guide the design and evaluate its safety charac- 
teristics in a systematic and exhaustive manner. A 
very preliminary PSA is introduced when the reac- 
tor design is still conceptual: it is focused on internal 
events involving radioactive sources and it is simpli- 
fied according to the level of knowledge regarding 
the definition of the design and physical phenomena 
occurring in the reactor. The challenges for the reac- 
tor are defined in terms of Initiating Events (IEs), 
with the correspondent plausible causes and conse- 
quences. The traditional list of IEs defined for the 
LWRs can inspire the advanced reactors ones, but 
it cannot be exhaustive and sometimes it is not even 
coherent; different technologies and phenomena 
have to be analyzed and can produce a completely 
new list of accident initiators, as in the case of fusion 
device (Pinna, 2017). The events leading to similar 
“reactor end-state” will be grouped together and 
the event involving the worst consequences will be 
selected as Postulated Initiating Event (PIE) to rep- 
resent the entire group. The first group of PIEs is 
identified at a sufficiently early stage of the design 
to enable the designer to select the events worth to 
be considered to enhance the plant safety. At this 
level, methods following a functional approach can 
be used, whose suitability is assessed also in non- 
nuclear standards (IEC EN 61508, 2005). As the 
design matures and more design details become 
available, the set of PIEs will be updated and broad- 
ened to gradually address other plant systems and 
operational states. At the same time, the selected 
events will be studied through deterministic analyses 
in order to define more accurate events sequences. 
When the deterministic inputs are modified, the 
design changes and the PSA model evolves as well. 
Progressively all the internal hazards will be included 
(not only radioactive releases but also, for example, 
internal fire and floods) and, when the site charac- 
teristics become available, also external hazards (e.g. 
earthquake) can be taken into account. At the end, 
the analysis can be refined introducing information 
about human factor (Southern Company, 2017). 
This approach is expected to converge faster to a 
successful design rather than try to adapt and satisfy 
the LWR requirements. 


4.3 Preliminary PSA 


The PSA model begins with a systematic search of 
initiating events. Since the preliminary design stage 


of some of the new generation plants, a functional 
approach has been selected, suitable to define pos- 
sible accident initiators when a sufficient design 
detail is not yet available to allow more specific 
evaluations at the component level. This method- 
ology has already been applied on fusion devices, 
in particular to analyse the Primary Heat Trans- 
fer System (PHTS) of EU DEMO, with a WCLL 
(Water Cooled Lithium Lead) and a DCLL (Dual 
Coolant Lithium Lead) breeding blanket (Pinna, 
2017), (Carpignano, 2016). 

In order to identify functional deviations able to 
compromise system safety (in terms of Postulated 
Initiating Events, PIEs) as completely as reasonable, 
two approaches can be implemented at the same 
time: the Functional Failure Mode and Effect Anal- 
ysis (FFMEA), a bottom-up approach, that focuses 
on the identification of the functions of the system 
and on the analysis of the consequences of the loss 
of each of them and the Master Logic Diagram 
(MLD), a top-down approach, that after the selec- 
tion of a top event identifies its possible elementary 
causes. A list of PIEs is completed and for each of 
them a brief description of plausible causes, con- 
sequences, involved components, preventive and 
mitigation actions is provided. In addition to the 
identification of PIEs, this approach allows iden- 
tifying lack of information on some systems, pro- 
cedures or phenomena, to point out the potential 
limitations of the design and to make suggestions 
to enhance the safety of the concept. An example 
of application of this methodology is shown in 
(Uggenti, 2017). The complexity of the application 
of this methodology is reflected in the number of 
listed Postulated Initiating Events, between 25 and 
30 for the three analysed systems, derived from an 
FFMEA of around 1000-1200 lines. 

Successively, each accidental scenario has to 
be classified into frequencies and consequences 
severity macro-categories. Accordingly to (INL, 
2017) the event sequences include relatively fre- 
quent events classified as Anticipated Operational 
Occurrences (AOO, with a frequency higher than 
10° events per plant year), infrequent events clas- 
sified as Design Basis Events (DBE, with a fre- 
quency between 10* and 10° events per plant 
year) and rare event classified as Beyond Design 
Basis Events (BDBE, with a frequency lower than 
10* events per plant year). The severity of the con- 
sequences can be evaluated in terms of release of 
radioactive material (since the preliminary design, 
it can be evaluated in percentage with respect to 
the total amount) or in terms of damages to the 
asset (taking into account the possibility to restart 
the system immediately or after a while or its 
impossibility). 

One or more risk matrices can be built using 
consistent definitions of technological-inclusive 
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risk metrics and severe accidents: according to the 
risk level of each unprotected accidental sequence, 
a number of provisions (or lines of defence) needs 
to be defined. This number is then compared to the 
number of existing barriers already present in the 
preliminary design and eventually suggesting new 
provisions to accomplish the requirements and to 
help sketching the final architecture of the system, 
using a more traditional approach. 


5 CONCLUSIONS 


For non-LWRs, the frequencies of accidents 
involving release of radioactive material may 
be very small and even those accidents with 
releases may involve very small source terms com- 
pared with releases of LWRs core damage acci- 
dent. Therefore, the total risk may be very small 
(Southern Company, 2017). Nevertheless, it is nec- 
essary to understand the principal risk contribu- 
tors in order to try to reduce the risk sources at 
the early design phase (Van der Borst, 2001): risk 
importance measures can be defined for any kind 
of risk metrics and it may be useful to calculate 
the risk significance both in relative and absolute 
basis, comparing it against risk goals rather than 
only against baseline risks. 

The evaluation of the sources of uncertainty has 
to be performed without delay: uncertainties need 
to be evaluated both for frequencies and conse- 
quences through the performance of quantitative 
uncertainty analysis, where information is available 
to perform this function, and sensitivity analyses, 
to address other sources of uncertainty that are 
more difficult to quantify. To this aim, these uncer- 
tainties have to be considered in the frequency- 
consequence evaluation criteria. This uncertainty 
treatment then becomes an input to a risk-informed 
evaluation of Defence in Depth (DID). 

A major limitation of preliminary nuclear risk 
assessment is due to the fact that all the efforts are 
concentrated on the nuclear island, while the remain- 
ing “traditional” components, for example all the 
components constituting the Balance of Plant, and 
the siting are usually only sketched: it is common 
to ignore the precise architecture of a system or the 
number of its redundancies, increasing the source of 
uncertainties in the risk evaluation: a design process 
to connect the research approach to a more engi- 
neering approach would be necessary to increase the 
realism and the accuracy of the safety evaluations. 

In conclusion, it is worth to highlight that, due 
to their standardization, the LWRs safety assess- 
ment, and consequently their safety architecture, 
is prescriptive (what to do) or proscriptive (what to 
avoid doing), since historically the safety process 
standards are rules-based. The variegate nature 


of the next generation nuclear facilities imposes 
safety process standards to be simple and based 
on stable, general principles, as suggested by the 
IEC EN 61508, already implemented in some 
process industry sectors involving many diver- 
sified plants and technologies (e.g. Oil & Gas, 
chemical plants). Goal-based standards focus on 
the final objective of the safety assessment (what 
is necessary to achieve) and suits to new technolo- 
gies (not only nuclear ones) better and more cost- 
effectively, by exploiting all the potentialities and 
the versatilities of the risk analysis. 
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Scenario dependency of safety targets for platform doors 


B. Hulin 
NTC-Systems GmbH, Gilching, Germany 


ABSTRACT: Platform barriers for railways shall protect passengers from different events like being 
crushed by a train or falling off the platform onto the track. Passenger can access a train through auto- 
matically operating platform doors that are integrated into the platform barriers. From a safety perspec- 
tive, platform doors are an electronically controlled system, whose risks need to be analysed and reduced 
to an acceptable level. One of the most discussed points, in that relation, is the allocation of safety (design) 
targets for different functions of platform doors. This paper proposes the application of SIRF for the 
determination of safety targets for functions of platform doors. Beside a theoretical reasoning for using 
SIRF the paper gives examples for its application to platform doors. Especially, the hazard ‘vehicle starts 
moving and doorways are open’ is analysed for its criticality and the related functions are assigned with 
safety targets. It is shown, that this process highly depends on the scenario even if the starting conditions 
are equal. This leads to the conclusion that all scenarios for the same situation have to be analysed. The 


conclusion is that SIRF can be applied to platform doors easily, and delivers reasonable results. 


1 INTRODUCTION 


Access points of many recent transportation sys- 
tems are limited by special barriers like platform 
barriers with platform doors. There is especially 
the need for such barriers within automated trans- 
portation systems like unmanned people movers or 
metro systems (see EN 62267 (CENELEC 2009)). 
Platform doors are usually implemented as screen 
doors that are automatically opened for boarding 
a train. 

Platform barriers as well as their integrated plat- 
form doors are for protecting passengers from fall- 
ing off the platform or being struck or crushed by 
a train. Insofar, platform barriers including their 
platform doors shall reduce the risk of injuries or 
deaths. 

From a safety perspective, platform doors are 
an electronically controlled system that needs to 
be assessed for risk. In this context, the safety 
related functions with their safety targets are to be 
determined. 

Safety targets (see CLC/TR 50451 (CENELEC 
2007)) or design targets (see Regulation 2015/1136 
(EU 2015)) can be defined amongst others as 
TFFR, THR, MTBF or SIL. A safety target is 
allocated to a functional safety requirement. The 
safety target of a functional safety requirement is 
the maximum criticality of all hazards related to it. 

For the determination of safety targets for 
functional safety requirements there is available a 
huge amount of methods such as risk graph, risk 


matrix, and so on (Summers 1998). Each method 
has its own pros and cons and is best suited for 
some domains or some applications. 

A good method for the determination of safety 
targets for railway vehicles is SIRF (EBA 2012)!. 
SIRF (Sicherheitsrichtlinie Fahrzeug) is a German 
tailoring of the EN 50126 (CENELEC 1999) for 
safety assessment for functions of railway vehicles. 
It was first released in June 2011 by the German 
national railway safety authority and has been 
applied successfully in many projects in Germany 
and Austria for main line and urban railway vehi- 
cles (e.g. metros, tramways and people movers). 

As this method is used for the determination 
of safety targets of functions all over the railway 
vehicle including vehicle doors, the author argues 
that this method can be applied for platform doors, 
too, even if they do not belong to the structural 
subsystem vehicle. 

This paper discusses this argumentation and 
concludes that an application of SIRF for plat- 
form doors is possible and reasonable. The main 
part of this discussion are examples of the hazard 
‘vehicle starts moving and doorways are oper’ with 
different situations and accident scenarios. 

Since this is the first publication that applies 
SIRF to platform doors the method is described, 
first of all. 


1. SIRF is freely available at www.eba.bund.de. 
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2 METHOD 


2.1 Generic process 


SIRF assumes initially a well defined system in 
which the main functions of this system are defined’. 
Starting from these functions a Functional Hazard 
Assessment (FHA) is conducted (Milius & Gayen 
2004). For this, railway experts combine function 
failures with different operational situations and 
scenarios (Einer & Käser 2004). The result of the 
FHA are hazards with an estimation of its critical- 
ity. Then, a function receives the highest criticality 
of all hazard that are related to this function. 


2.2 Terminology of SIRF 


For the determination of safety targets, SIRF uses 
five parameters’. 


e S -number of affected persons 

e S,,— degree of injury 

e W — probability of the occurrence of the 
expected severity‘ after a function failure 

e FE- mean duration of exposure to a hazard 

e V- possibility of avoidance of the severity of a 
harm by the person at risk, after the occurrence 
of the primary hazard 


As described in SIRF, parameters S, and S,shall 
be estimated for a realistic worst-case outcome of 
a primary hazard within the considered scenario. 
Their combination S= S,- S,,is an estimate of the 
expected outcome expressed by a severity of harm. 

Parameter W is alternatively often referred to 
as inevitability of the transition from a function 
failure to the related severity of harm. Since the 
severity is scenario dependent and parameter W 
refers to a severity of harm, W is scenario depend- 
ent, too. An ontological analysis of the parameters 
can be found in (Hulin et al. 2016). Note, that 
SIRF-parameter W has a different meaning to the 
parameter W of the risk graph. 


2.3 Estimation of hazard criticality 


The parameters mentioned above are estimated 
qualitatively by experts. For that, SIRF defines 
certain values for each parameter. 

The combination of these values results in a 
value which is called indicator. The calculation of 
this indicator J is carried out using equation 1. 


2. A good overview of functions for railway vehicles can 
be found in EN 15380-4 and E DIN 25002. 

3. Since SIRF is a German directive without an official 
and agreed English translation, we translated the defini- 
tions of terms for safety target determination on our own. 
4. Severity S is the product of S} and S,. 


Table 1. Mapping table. 


Interval of indicator 7 Safety target 


10; 21[ SIL 0 
121; 36 SIL 1 
J21; 72[ SIL 2 
172; 122f SIL 3 
1122: 281[ SIL 4 
E W-E (iy 


Then the mapping of the inidicator to a safety 
target is done according to Table 1. 


3 APPLICATION TO PLATFORM DOORS 


3.1 Justification 


Even if SIRF is defined only for railway vehicles 
some projects have shown its suitability for some 
other railway subsystems, especially platform doors 
and emergency lighting in tunnels. The reason for 
the application of SIRF to non-vehicle-functions 
was to use as less different risk assessment methods 
as possible in projects in which a complete railway 
system (consisting of vehicles, tracks, platform 
barriers, power supply and so on) was installed. 

From a theoretical point of view, the functions 
of vehicle access doors are nearly the same as for 
platform doors. Amongst others these are “provide 
external access” to the vehicle (see EN 15380-4), 
“ensure exiting” the vehicle (see DIN 25002) and 
“provide passenger emergency exits from inside 
the vehicle”. A special function for vehicle doors is 
“keep doors closed between two stations”. A spe- 
cial function of platform doors could be “provide 
emergency exit from the track to the platform”. 
Moreover, there exist some interface functions like 
“align train doors with platform doors”. 

Most of potential accidents that could happen 
with platform doors are similar to those of vehicle 
access doors. For example ‘being crushed be the 
door leaves’ or ‘falling down through an open door’ 
are an potential accidents for both types of doors. 

Moreover, hazards like ‘unintended door open- 
ing’ or ‘too fast closing of doors’ are equal for both 
vehicle access doors as well as platform doors. 


3.2 Example 1: One open doorway 


3.2.1 Initial situation 

All scenarios in this section start from the same 
operational situation: A driverless passenger train 
is standing at the platform. At the edge of the 
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platform a barrier containing platform doors is 
installed. All doors of the platform as well as of 
the vehicle are sliding doors with a passage width 
of 0.8 m. With this width it is possible that at most 
two persons can stay simultaneously in one door. 
Due to efficiency of air conditioning in the railway 
vehicle all doors are closed independently of each 
other as soon as possible and a door is opened only 
on passenger request. All platform doors as well 
as all train doors are closed except one doorway’. 
Travellers are boarding the train through this one 
open doorway. 


3.2.2 Hazard and potential function failure 
The hazard which is considered in this section is 
‘one doorway is open and vehicle starts moving”®. 

In this case, the potential accident ‘shearing of 
persons’ can occur. For the accident it does not 
matter which potential function failure is the cause 
for the hazard—e.g. the ‘vehicle starts moving 
with one open doorway’, ‘one doorway remains 
open while the vehicle starts moving’, ‘omission to 
prevent persons passing through the one doorway 
while vehicle accelerates’ or a ‘door interlock signal 
is sent while one doorway is open’. 

These potential function failures can be part of 
the vehicle, of the platform door system, of the 
control system or of a combination of some of 
these systems. 

For this example, the potential function failure 
‘vehicle starts moving with one open doorway’ is 
analysed. 


3.2.3 Analysis of accident scenarios 

The probability of shearing a person between sta- 
tionary installations and vehicle parts is higher 
with than without platform barriers since only 
small vehicle movements are necessary to shear a 
person. 

There are several accident scenarios with shear- 
ing of persons within the afore mentioned context. 
While one person looses just his arm another per- 
son could be killed. As mentioned before, due to 
the doorway width of 0.8 m it is imaginable that 
one or two persons are staying at the doorway 
simultaneously. This variability of accident scenar- 
ios can cause different outcomes. For each outcome 
(specified as a severity of harm) the conditional 
probability W of this outcome after the potential 
function failure ‘vehicle starts moving with one 
open doorway’ has to be estimated separately. 

For the severity ‘death of person’ after the 
assumed potential function failure, parameter 


5. A doorway consist of both a platform door and a cor- 
responding vehicle access door. A doorway is called open 
if both doors are open. 

6. The term ‘vehicle’ is more generic than the term ‘train’. 


W is estimated conservatively with ‘high’ since in 
the largest part of the time period in which the 
one doorway is open we assume persons passing 
through this doorway. The reason for this assump- 
tion is that the doors are just opened on passen- 
ger request and passengers yield such a request for 
boarding or exiting and doors will be closed right 
after the doorway is cleared. Moreover, the time 
period for a person entering the vehicle usually 
increases with the number of persons standing in 
the corridors and the entrance area of the vehicle. 

In the scenario that several persons are killed by 
this potential function failure we estimate param- 
eter W with ‘low’ since it is very seldom that two 
persons are staying in the doorway at the same 
time. An affecting of many persons by only one 
doorway is impossible and, consequently, for 
these cases parameter W is assigned as ‘incredible’ 
(W=0). A complete overview of all estimations of 
probability W of a certain outcome can be found 
in Table 2. 

For all accident scenarios that relate to the haz- 
ard ‘one doorway is open and vehicle starts mov- 
ing’ the exposure to the hazard is estimated with 
‘low’ (E = 1) and the avoidance is seen as ‘not or 
nearly not possible’ (V = 1). The reason for the 
estimation of parameter E is that persons are just 
exposed to the hazard during boarding and exit- 
ing. The estimation of parameter V results from 
the consideration of the time period for an effec- 
tive human reaction. It can be that due to the start- 
ing train movement persons are shocked or fall 
down and thus are suddenly limited in the ability 
to react effective. 

The graph of the SIRF indicator is shown in 
Figure 1. Neither direct nor indirect proportion- 
ality between severity and the indicator is given. 
There can exist several local minima and maxima 
in such a graph. Consequently, a certain SIL can 
result for different severities. Moreover, it can be 
that the highest SIL of all credible scenarios starting 


Table 2. Potential outcomes of hazard ‘one doorway is 
open and vehicle starts moving’. 


Number Degree SIRF 
of persons of injury S Prob. W Indic. 
No none 0 high 3 0 
One minor injury 6 high 3 T8 
Several minor injury 10 low 1 10 
One serious injury 12 high 3 36 
Many minor injury 16 incredible 0 0 
Several serious injury 20 low 1 20 
One death 27 high 3 8l 
Many serious injury 32 incredible 0 0 
Several death 45 low 1 45 
Many death 72 incredible 0 0 
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from one situation to one potential accident is not 
the scenario with the highest severity. In Figure 1 
the highest credible severity is 45 when several per- 
sons are killed since a severity S = 72 with many 
deaths is not credible with one open doorway. For 
a severity S = 45 SIRF returns a SIL 2 while for a 
severity S = 27 it returns SIL 3. 


3.3 Example 2: Variable number of open 
doorways 


3.3.1 Description 
Let us now analyse the values of the SIRF indica- 
tor in dependence on the number of open door- 
ways. The basis for this example is the same initial 
situation as described in section 3.2.1 for example 1 
except that the number of doorways is variable. 
Therefore, the hazard is ‘doorways are open and 
vehicle starts moving’ with the function failure 
‘vehicle starts moving with open doorways’. 

The potential accident ‘shearing of persons’ is 
the same as for example 1. 


3.3.2 Analysis of accident scenarios 

Of course, with increasing number of open door- 
ways the worst case severity increases, too. More 
interesting is the relation of the number of open 
doorways to SIRF parameter W and the SIRF 
indicator for an invariant severity. 

For that, we analysed three different severi- 
ties: severity S = 72 is for deaths of many persons, 
severity S = 45 for deaths several persons, and 
severity S = 32 for many seriously injured per- 
sons. SIRF parameters E and V remain equal to 
example 1 with E=1 and V = 1. Parameter W is 
the only parameter that changes with the number 
of open doorways for a constant severity. If less 
than 6 doorways are open it is incredible that many 
persons’ are killed and, therefore, W is equal to 0 
and the SIRF indicator is equal to 0, too. With 22 
or more open doors the probability for the affec- 
tion of ‘many’ persons is very probable and thus 
parameter W is assigned with 3. The death of ‘sev- 
eral’ persons’ is classified as ‘very probable’ (W = 3) 
for 4 or more open doorways. This results ina SIRF 
indicator of 135 which can be mapped to SIL 4. 

The relation of the indicator and the number of 
open doorways are shown in Figure 2. 

The graphs that show the relations for many 
affected persons climbs at the same positions (see 
Figure 2) whereas the graph which is based on the 
deaths of several persons climbs in other positions. 
These different gradients produces an intersection 


7. SIRF defines ‘many persons’ with ‘more than 10 
persons’. 

8. SIRF defines ‘several persons’ as more than | but at 
most 10 persons. 


30 SIL3 
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Figure 2. Relation between SIL and number of open 


doorways. 


between graphs for severity S = 72 and for severity 
S=45. 

In this example, this intersection is meaningless 
for the SIL allocation since SIL does not change. 
However, for other graphs or other scenarios such 
intersections are indications to a change of prior- 
ity of accident scenario for the determination of 
criticality and thus for SIL allocation. 


4 COMPARISON OF RESULTS 


The French national railway safety authority 
(called EPSF) claims the safety target TFFR < 107 
for the function failure ‘authorized traction with 
one or more open doors’ (see SAM C 305 (EPSF 
2013)). This safety target which corresponds to 
a SIL 2 is for main lines without platform doors. 
Since, the situation is quite different without plat- 
form doors, safety targets are not easily compara- 
ble to the results of the examples of this paper. 

A safety target of SIL 3 for platform doors 
functions whose failures can cause deaths is pre- 
sented in Lecomte (2008). However, he just does 
not deduce it. 

Due to the European Regulation 2015/1136 (EU 
2015) the function failure ‘vehicle starts moving 
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with one open doorway’ is assigned to a target fre- 
quency TFFR < 107. The reason for that is that 
the resulting potential accident is classified as ‘crit- 
ical accident’ which is ‘typically affecting a very 
small number of people and resulting in at least 
one fatality’ (see Article 1 of (EU 2015)). For this 
function failure the safety target determined with 
SIRF in example 1 (see section 3.2) is a little bit 
more restrictive. 

Moreover, Regulation 2015/1136 (EU 2015) 
defines ‘catastrophic accidents’ that are ‘typically 
affecting a large number of people and resulting in 
multiple fatalities’. For function failures that could 
lead to such accidents ‘an occurrence of failure at 
a frequency less than or equal to 10° per operating 
hour’ is required. The regulation, however, does 
not define what a ‘large number’ and a ‘very small 
number’ of people means. This is defined in Jovicic 
(2017). In this guideline Jovicic (2017) estimates for 
the function failure ‘train moves off at station with 
one bodyside door open’ for the situation ‘more 
than one door open’ during passenger transfer 
TFFR < 10°. 

This safety target of Jovicic (2017) is equal to 
SIRF for trains that start moving with four or 
more open doors. For two or three open doors, 
however, the safety target of Jovicic (2017) is 
more restrictive than that of SIRF with SIL 3 (see 
Figure 2 and the explanation of example 2). 


5 CONCLUSIONS 


In this paper SIRF is applied to platform doors 
for safety target determination. It is shown that 
a determination is possible and useful with SIRF. 
Compared to safety targets of railway doors and 
European Regulations results are similar. 

Some specialities of SIRF for safety target deter- 
mination has been discussed in this paper. One very 
helpful speciality to better shape severity is the con- 
sideration of two parameters S, and S, for it. 


The safety target is highly dependent on the ini- 
tial situation and the way an accident happens. The 
consequence is that the worst case scenario for a 
class of potential accidents does not lead necessar- 
ily to the highest safety target. Therefore, all sce- 
narios within a class of potential accidents have to 
be considered. 
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A general framework for integrated risk assessment of nuclear/ 
non-nuclear combined installations on market-oriented nuclear industry 


K. Kowal, S. Potempski & Pawel M. Stano 
National Centre for Nuclear Research, Poland 


ABSTRACT: Development of new nuclear technologies tends toward decentralized small-medium size 
and modular installations with parameters tailored to a specific applications, resulting from the needs of 
the market wider than the energy sector, e.g. the production of process heat, hydrogen or hydrazine, which 
is of great importance for chemical industry. The High Temperature Reactors (HTR) and Dual Fluid 
Reactors (DFR) are the examples of the attempts for building such industrial applications. However, the 
implementation of these concepts poses a challenge for safety assessment due to the interfaces between 
nuclear and non-nuclear parts of the installation, which were not taken into account within the hitherto 
completed safety studies. This is a driven force for development of new framework for integrated risk 
assessment of nuclear/non-nuclear combined installations. This article is an attempt to sorting out the 


most demanding problems related witch this issue and to indicate possible paths for the solutions. 


1 INTRODUCITON 


Seeing the intensive development of new nuclear 
reactors technologies over recent years, one can 
expect major changes in the widely understood 
nuclear industry. So far, the innovation in nuclear 
reactors has been induced mostly by the technol- 
ogy-push (i.e., public R&D expenditures) and the 
demand-pull (i.e., NPPs construction) incentives 
(Berthélemy 2012), but the main stakeholders were 
focused mostly on the nuclear power generation. 

Current trends in development of the new 
reactor technology tend toward decentralized 
small-medium size and modular installations with 
parameters tailored to a specific applications 
(Locatelli, Bingham & Mancini 2014). These solu- 
tions are considered rather as an energy source at 
site for different industrial processes than the way 
to electricity production on an industrial scale. The 
main driving force behind these concepts is the 
reduction of the emission of greenhouse gases from 
industrial processes. It seems that the nuclear mar- 
ket will soon evolve towards greater fragmentation 
and wider field of applications from which the non- 
electric services will play an increasingly important 
role. 

The general concept and key technological solu- 
tions for non-electric nuclear applications have 
been already developed. However, they have not 
reached the same industrial maturity as for the 
generation of electricity. Nevertheless, expecting 
the progression in this type of nuclear technol- 
ogy applications, the International Atomic Energy 


Agency performed the initial target market analy- 
sis which showed that there is an increased inter- 
est in non-electric applications facilitated by the 
recent development of advanced reactor concepts 
(IAEA 2002). 

The market oriented restructuring in the nuclear 
industry requires, however, an accurate estimation 
of the costs and benefits of nuclear applications 
in comparison with the non-nuclear suppliers of 
similar services and, what is much more important, 
appropriate frameworks, methods and tools for inte- 
grated risk assessment of nuclear and non-nuclear 
installations combined together in one complex 
structure. The High Temperature Reactors (HTR) 
or Dual Fluid Reactors (DFR) are the examples 
of the attempts for building industrial applications 
based on the Generation IV technologies. 

For example, in the nearest future the best 
opportunities for cogeneration will be application 
of HTR for the chemical industry (Jackowski et al. 
2017). In this respect within NC2I-R (Nuclear 
Cogeneration Industrial Initiative—Research and 
Development Coordination 2015) a review has 
been made taking into account the following main 
processes compatible with HTR capabilities: 


refinery distillation steam, 

refinery distillation superheated steam, 
petrochemicals—reaction enthalpy, 
steam as utility for industrial complex, 
and paper steam (drying). 


Use of DFR, in turn, is expected in the follow- 
ing industrial processes (Huke et al. 2015): 
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e mixed process heat and electricity generation, 

e medical isotope production with high efficiency, 

e the hydrogen-based chemistry e.g., production 
of synthetic fuels suitable for the vehicles, 

e and radiotomic chemical production— 
utilization of intensive radiation for radiotomic 
induction of chemical reactions requiring high 
doses (Stannet & Stahel 1971). 


Implementation of new technologies encounters, 
however, new problems, among others with safety 
demonstration (inadequate core damage definition, 
interfaces between nuclear and non-nuclear instal- 
lations, etc.), licensing processes (inadequate legisla- 
tion), and social acceptance (nuclear technology in 
the place of work, close to industrial centers). The 
paper aims to discuss these kind of issues as a part 
of general framework for the integrated risk assess- 
ment of nuclear/non-nuclear combined installations. 


2 THE LICENSING PROCESS 
ORGANIZATION 


Nowadays, the licensing process of the newly 
designed reactors, seems to be one of the most 
burning challenges. It have to be developed with 
respect to all specific features of the nuclear tech- 
nology and related chemical installation. Prelimi- 
nary analysis of the HTR licensing issues has been 
made within NC2I-R and HTR-PL projects with 
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non-nuclear parts of the installation 


technical 
support 2 


technical 
support | 


chemical 
regulatory 
bodies 


nuclear 
regulatory 
body 


PRA& 
chemical 
hazards 


decision decision 


chemical 
installation 
(QRA) 


nuclear 
installation 
(PRA) 


external 
hazards 


Figure 1. 


consideration of the Next Generation Nuclear 
Plant guideline (NGNP 2010). However, develop- 
ment of a new framework for integrated licensing 
process for joint nuclear-chemical installations is 
highly expected. 

Figure | shows two models of the licensing process 
organization for joint nuclear-chemical installations. 
One of them assumes separation of paths leading 
to receive the operation permission for nuclear and 
chemical parts of the installation, while the second 
one is a proposal for integration. The structure of 
regulatory bodies, their competences and communi- 
cation, as well as the scope of the safety report or 
reports must be specified in details to make it appli- 
cable and this is a challenge for further studies. 


3 RISK ASSESSMENT OF NUCLEAR/ 
NON-NUCLEAR COMBINED 
INSTALLATIONS 


Implementation of the integrated approach to the 
licensing process requires, among others, the inte- 
grated risk assessment for the whole installation 
consisting of a nuclear and chemical parts. This 
requires, in turn, consideration of insights com- 
ing from the Quantitative Risk Assessment (QRA) 
developed for the chemical part of the installation 
and Probabilistic Risk Assessment (PRA) devel- 
oped for the nuclear one, including analysis of 
interfaces, mutual reactions and interdependencies. 


b) integrated licencing process of nuclear/non-nuclear 
combined installation 
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Two general models of the licensing process organization for the nuclear/non-nuclear combined installa- 


tion; a). separation of the licensing processes for nuclear and chemical installations assuming different regulatory bod- 
ies and separate safety reports where the impact of the neighboring installation is treated as a set of specific (nuclear 
or chemical) external hazards; b). integrated licensing process where the object is defined as a joint nuclear-chemical 


installation and the integrated risk assessment is expected. 
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3.1 


As a result of initiating event within the joint 
nuclear-chemical installation different systems on 
both parts (nuclear and chemical) can fail imme- 
diately or with a time delay. The time-sequence of 
the failures, however, is quite complex due to the 
interfaces between systems within each part of the 
installation separately and between the nuclear and 
chemical plants. Independently developed chemi- 
cal QRA and nuclear PRA models do not describe 
properly the real state of the whole installation 
and thus need to be integrated. The integration 
of QRA and PRA models within the overall risk 
assessment framework can be proceeded with the 
following steps: 


Chemical QRA and nuclear PRA integration 


1. identification of postulated initiating events 
for the nuclear and chemical parts of the 
installation; 

2. identification of systems which would be directly 
affected by the initiating events immediately 
after their occurrence or with a time delay; 

3. identification of all possible interactions 
between the considered nuclear and chemical 
systems L.e.: 

a. internal interactions between systems within 
the nuclear installation; 

b. internal interactions between systems within 
the chemical installation; 

c. nuclear/chemical interactions (posing a chal- 
lenge for safety of the chemical installation after 
failure of one or more nuclear systems); 


bt, = [Atn Atal 


d. chemical/nuclear interactions (posing a chal- 
lenge for safety of the nuclear installation after 
failure of one or more chemical systems); 

4. specification of time-frames in which the whole 
installation remains in the specific states charac- 
terized by the systems affected and interactions 
within and between nuclear and chemical plant; 

5. identification of all safety functions that must 
be performed in each time-frame and determi- 
nation of their success/failure probability. 


Figure 2 presents a simple example on how to 
deal with chemical QRA and nuclear PRA models 
developed for both parts of the installation. The fol- 
lowing time-frames were established to define the 
periods in which a specific functionality is required: 


e ot, —after initiating event occurrence and before 
the notification of effects on the nuclear systems; 

e ot, — failure of nuclear system n, with possible or 
conditional effect on the chemical system ch,; 

e ôt, — failure state of nuclear system n, and chem- 
ical system ch, (nuclear/chemical interaction); 

e ot, — failure state of chemical system ch, 
and nuclear system n, (due to the internal 
interaction); 

e ot, — failure state of chemical system ch, 
and nuclear system n, (due to the internal 
interaction); 

e ot, — failure states of n, and n, nuclear systems 
due to the chemical/nuclear interaction witch 
system ch, or internal interaction with system n,; 
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Figure 2. Example of integration of chemical QRA and nuclear PRA studies within the overall risk assessment 


framework for nuclear/non-nuclear combined installations: IE — Postulated Initiating Event; ET „— Event Tree for i-th 


ni 


system of nuclear installation; ET,,; — Event Tree for i-th system of chemical installation; ôt; — i-th time-frame to be 


considered in the integrated risk analysis. 
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e dt,—failure states of chemical systems due to the 
internal interactions in the chemical installation; 

e õt,— failure state of nuclear system n, due to the 
chemical/nuclear interaction with system ch,; 

e ot, — failure state of nuclear system n, due to the 
chemical/nuclear interaction with system ch,. 


This concept has many advantages, among 
which the most important is possibility of mod- 
elling of a wide spectrum of failures sequences, 
including both nuclear and chemical parts of the 
installation, in response to the initiating event that 
occurred in one of them. Such approach, how- 
ever, is not devoid of weaknesses among which 
the following should be mentioned here: 


e numerically ineffectiveness of calculations 
based on large and complex failure three 
structures, 

e multiple modelling of the same sequences of 
events appearing at different time frames, 

e difficulties in adding new systems, interactions 
or time frames to the existing models. 


Nuclear 


Chemical 


Figure 3. 


3.2 Block framework 


Apart from the failure tree approach presented 
above it is reasonable to consider alternative 
approach to modelling of failure sequences for 
combined nuclear-chemical installations that is 
based on the applications of Bayesian networks 
to risk assessment. An example of such a block 
framework is presented in Figure 3. This network 
corresponds one-to-one with the failure tree model 
discussed in Figure 2, and it is constructed from 
the following types of blocks: 


e a collection of N initial events (IE?,..., IE}) ) 
defined for a nuclear plant and M initial events 
(ES,..., TE) defined for the chemical plant; 

e acollection of nuclear and m chemical systems 
potentially affected by previously defined initial 
events; 

e logical gates that allow to introduce complex 
Boolean expressions. 


The interactions between any two blocks i and j 
are defined by a pair of transition probabilities p,” 


Block model corresponding to the failure tree described in Figure 2. For deterministic case, all the transi- 


tion probabilities are set to 1. The transition times are set to match times in Figure 2, e.g., t,,'"" = ôt,; t,,™ = dt, + dt,; 


tc = t, <0" = Bt, + St, + dt,, etc. 
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and transition times ,"°" where the indexes in lower 
script indicate the direction of the connection (in 
this case from i to j) and the indexes in the upper 
script indicate to which type of installation the 
blocks belong (in this case 7 belongs to nuclear plant, 
whereas j belongs to chemical plant). 

Such block framework has several advantages 
that might be particularly well suited to model 
joint nuclear-chemical installations. First of all, 
with such a model it is possible to model virtu- 
ally all the combination of interactions between 
the systems in all the directions (back and forth). 
Secondly, merging independent block models into 
a single block model is relatively easy as it only 
requires defining new connections and transition 
parameters between systems belonging to distinc- 
tive block models and does not disturb the net- 
works within the models. 

This property is especially important for our 
considerations because thanks to it, it is possible 
to design safety models for nuclear plant, chemical 
plant, and interaction between plants independ- 
ently and then combine them together only at the 
final stage of the analysis to obtain quantitative 
risk indicators. Furthermore, this property is scal- 
able down, i.e., either plant can be broken down 
into smaller units, for which block models can be 
developed independently following the philosophy 
described above. 

Finally, simulating block networks is very effi- 
cient numerically because the complete informa- 
tion about network can be stored in a simple matrix 
form. Consequently, computations of probabilistic 
risk measures require matrix algebra only. A gen- 
eral structure for the matrix of transition probabil- 
ities is presented below. The matrix of transition 
times is defined similarly. 


nn nn nch nch 
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mo, nn nch nch 
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3.3 Global uncertainty and sensitivity analysis 


In the context of HTR or DFR, where a nuclear 
facility is connected to chemical, from the per- 
spective of safety analyst the structural differ- 
ences between PRA (or QRA) models for both 
plants might be immense (in complexity, accu- 
racy, purpose, etc.). Therefore, the uncertainty 
that is inherently associated with either of the 
model is further enlarged by the uncertainty 


about the way the plants, or their systems interact 
with each other. Consequently, for risk informed 
decision-making, a proper analysis of PRA mod- 
els’ results (as well as models’ failures!) requires 
that the analysis of safety outputs should be 
accompanied by the identification of relevant 
uncertainties and the assessment of their impact 
on the final results. This is done by conducting 
uncertainty analysis (UA) and sensitivity analy- 
sis (SA) that aim to support decision-making by 
quantification of model uncertainties. The fun- 
damental difference between the two is that the 
UA adopts a forward-looking approach, which 
is focused on investigating how the uncertainty 
in input variables (external uncertainty) and 
parameters (internal uncertainty) can affect the 
uncertainty of output variables. The SA adopts 
backward-looking approach, which is focused 
on investigating how sensitive the output vari- 
able is to fluctuations in uncertain input variables 
and parameters. Thus, in the face of prevailing 
un-certainties in models’ parameters and input- 
output variables, the UA and the SA complement 
each other by looking at the same problem from 
two opposite directions. 

In recent years a variety of methods have 
been developed to analyze models’ uncertainty, 
from both UA and SA perspectives specifically 
to be used in the PSA (Borgonovo, Apostolakis, 
Tarantola, & Saltelli, 2003). At the same time, a 
number of spectacular failures in management of 
complex real-life processes, associated with nuclear 
industry, due to unrealistic uncertainty assess- 
ments have seen the light of day. Over four dec- 
ades long stalemate around the Yucca Mountain 
nuclear waste depository is one example of the so- 
called “wicked problems” (Saltelli, Stark, Becker, & 
Stano, 2015), where uncertainty and disagreement 
about values affect the very framing of what the 
problem is and how to model it. Another example 
is handling the Fukushima Daiichi nuclear disas- 
ter, which provides a vivid illustration of how, in 
the face of “unknown unknowns” (Logan, 2009), 
safety assessments can become worthless in the 
blink of an eye. Better methods of ascertaining and 
managing model uncertainty are needed to realisti- 
cally re-evaluate the safety features of the existing 
installations. Most importantly by paying more 
attention to structural uncertainty, which investi- 
gates issues such as: the selection of variables and 
processes to include in the model, how the vari- 
ables and processes are described mathematically, 
how they interact, etc. When modelling complex 
phenomena, the structural uncertainty is likely 
larger source of uncertainty than the formerly 
mentioned uncertainty in input/output variables. 
Another open topic in uncertainty quantification 
is the assessment of human errors, especially in 
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highly stressful critical conditions during the acci- 
dents progressions. 

In both PRA and QRA, which have been devel- 
oped specifically to quantify various risks derived 
from operation of nuclear and chemical plant, 
respectively, the quantification of uncertainty is of 
crucial importance. However, despite much atten- 
tion being devoted to studying uncertainty in the 
PRA context there exists no universally accepted 
standard for handling various types of uncertain- 
ties in a systematic way. Furthermore, in case of 
joint nuclear and non-nuclear installations all the 
aforementioned uncertainties are inherited from 
respective installations and further elevated by the 
model of interactions between the systems. 

Although, many methods addressing structural 
uncertainty have been developed in recent years, 
this field of study is far from becoming mature. 
This is because the developed methods are usu- 
ally highly subject-specific and it is not clear how 
they can be extrapolated and reliably applied to 
problems other than specified. For example, in the 
analysis of nuclear plants much focus is on preven- 
tion of accidents leading to core meltdown, while 
in the analysis of chemical plants the main atten- 
tion is put on preventing fires and explosions. Each 
of these critical scenarios has its own typology, 
with different time scale, undesired effects, etc. and 
consequently with different types of uncertainties 
taken into consideration. It is not a surprise then, 
that each of these fields of scientific inquiry fol- 
lows its more or less unique path of dealing with 
structural uncertainty and very little work have 
been done towards developing a global approach 
that would link structural uncertainty assessment 
with assessment of aleatory and epistemic ones in 
a synthetic manner. 

Thus, to assure safety of joint nuclear and non- 
nuclear plants, a systematic approach needs to be 
develop that would allow to perform global UA 
and SA, i.e., taking into consideration the main 
sources of uncertainty for both plants jointly, but 
also investigate what are new sources of uncer- 
tainty that are due to interactions between the 
installations. 


4 CONCLUSIONS 


Many safety problems concern the mutual depend- 
ence of nuclear and chemical parts of the installa- 
tions where the nuclear reactors are considered to 
be used as an energy source for various chemical 
processes. In order to enhance safety of such joint 


nuclear and non-nuclear installations, a systematic 
approach needs to be developed that would allow 
to perform integrated licensing process. Many 
efforts has to be made to accomplish this challenge. 
Development of a new framework for integrated 
risk assessment is one of them. In this article, two 
methods for integration of chemical QRA and 
nuclear PRA were proposed as an contribution to 
this task. The first one, based on the failure tree 
structure, is very informative, but indeed, not per- 
fect and thus needed to be improved. The alterna- 
tive approach in a form of block framework has 
more advantages and the authors believe that it 
can be applied in real studies. 
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ABSTRACT: The Directive Seveso III points towards the introduction of plans for a safe management 
of ageing of critical facilities at major-risk. Such plans have to cover all phases of the life cycle of the 
equipment and take into account current deterioration mechanisms (i.e. internal and external corrosion, 
erosion, thermal and mechanical fatigue, etc.). Due to this requirement, there is a need of procedures to 
check the equipment conditions, especially at the final stages of its life cycle, and evaluate the adequate- 
ness of actions for its control. Currently, managers adopt Risk-Based Inspection (RBI) standards, never- 
theless it is essential to demonstrate the integration of ageing management within the overall management 
of major hazard plants. This paper discusses the adequateness of the measures, usually adopted to control 
the ageing phenomenon in primary containment equipment, whose deterioration could generate a major 
accident. In order to evaluate the status of such items, a shortcut method has been developed. It repre- 
sents a first attempt to develop a tool for ageing monitoring. The first release of the method is static and 
appropriate for independent auditors and inspectors acting on behalf of Seveso Authorities. The second 
release, which is currently under development, is considered dynamic as information about process vari- 
ables, external data, inspection information and etc. are continuously collected and processed, in order 
to provide an overall picture of the ageing of systems in a form of an index. These indexes allow the real- 
time forecasting of the equipment deterioration process and its management based on the industrial risk 
acceptance levels. The core of the method is a dynamic model of the strengths that accelerate the degrada- 
tion processes and factors that slow down them. Based on this model a “digital twin” of a complex plant 
can be built, by integrating smart sensors and other smart devices. 


1 INTRODUCTION Ageing of a component reveals itself as a general 


form of deterioration that is usually associated 


The new requirements of the last European legisla- 
tion on the control of major accident hazard, the 
“Seveso IIT” Directive, include the monitoring of the 
risk due to equipment ageing (EU Council, 2012). 
The introduction of plan for the safe management of 
ageing has to cover all steps of the lifecycle of critical 
equipment, for this reason there is a need of proce- 
dures to verify the status of facilities and evaluate the 
adequateness of actions made by plant managers. 


1.1 The issue of ageing in process industries 


Recently, the safe ageing of equipment has become 
the latest hot issue for several industries, in par- 
ticular those at major accident hazard. The term 
ageing does not refer to the time elapsed from the 
date of production, testing or commissioning of 
the equipment, but it is related to its condition and 
how it changes over the time (Wintle et al., 2006). 


with the in-service time and reduces its reliability 
(Horrocks et al., 2010). Ageing increases the risk 
of loss of containment and other failures and has 
been proven to be a determining factor in many 
accidents in process industry (Wood et al., 2013). 
Nuclear industry started a decade ago to pay 
attention to this problem, when it has been real- 
ised that the age of most in-service reactors was 
exceeding the designed lifetime. The guideline, 
published by IAEA (2009), defined the basic 
principles for managing ageing of the equipment 
in nuclear plants, with the aim to safely extend 
their life beyond the limits defined during the 
plant’s design phase. According to IAEA defini- 
tions, there are two terms to be distinguished, Le. 
ageing and obsolescence. Both terms basically 
refer to effects of the time on complex technical 
systems. In this frame, ageing includes processes 
that gradually change the physical characteristics 
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of the equipment over the time or with the use; 
whereas obsolescence refers to its becoming out of 
date by comparing with current knowledge, stand- 
ards and regulations and technology, this makes 
the equipment inadequate. The consequences of 
obsolescence include the incompatibility between 
old and new equipment and the non-compliance 
of old equipment. Amongst degradation proc- 
esses, corrosion plays a primary role, thus in many 
cases, ageing and corrosion are confused each 
other. The word ageing is often used with a nega- 
tive connotation, and understood as degradation, 
even though the concept of “ageing management” 
clearly implies the idea that ageing processes may 
be controlled, in order to slow-down and minimise 
their effects. In the present paper, the focus is just 
on ageing; obsolescence is out of the scope of this 
research, as well as consequences of ageing of 
workers, managers and organisation. 

In chemical industry and in the oil & gas sector, 
the issue of ageing is particularly relevant, as most 
European refineries have already been in service 
for forty or more years and it is supposed they 
will have to continue to be operating, given the 
difficulties to build new ones. A study promoted 
by the European Commission a few years ago 
analysed a hundred worldwide major accidents in 
oil refineries, which were due to inadequate man- 
agement of ageing and corrosion, this investiga- 
tion revealed the relevance of the problem (Wood 
et al., 2012). A more recent study, promoted by 
OECD (2017), outlined the impact of ageing also 
in process industry, including the chemical sector. 
A fundamental guideline for ageing management 
in chemical industry has been published by the 
HSE (Wintle et al., 2006). 

A keystone for ageing management is the 
replacement. In many cases, deteriorated or dam- 
aged systems may be dismissed and replaced by 
new ones having equivalent features. As an exam- 
ple, in a typical process plant there are thousands 
of valves, which can lose their functions, due to 
deterioration processes; these components usually 
have an affordable cost and comply with standard 
rules, thus replacement may be reasonably pro- 
posed. The case of larger items, which have an 
unsustainable replacement cost as well as complex 
authorisation procedures, is different; replacement 
is very difficult and discourages the executives in 
a way that these items are considered practically 
“no-replaceable”, thus, to extend the in-service life 
as long as possible becomes a priority for the main- 
tenance engineers. In order to comply with the def- 
initions of common engineering practices, i.e. HSE 
document (Wintle et al., 2010), these no-replacea- 
ble facilities are denoted as “static primary contain- 
ment systems”, i.e. systems for which the concern 
of ageing is much higher than other equipment, 


such as rotating machinery (e.g. pumps) and control 
systems. 


1.2 The issue of ageing in the framework of the 
European Seveso legislation 


The previously mentioned report of the European 
Commission (Wood et al., 2012) showed the rel- 
evance of ageing in refining industry. This oriented 
the EU Council to add, into the new Directive on 
major accident prevention, the requirement to 
define a management program for a safe ageing of 
critical equipment for all Seveso establishments. 

For about a decade and more, the oil and gas 
industry has trusted in the popular Risk-Based 
Inspection (RBI) practice, as defined by the rec- 
ommended documents API 580 (API, 2016) and 
API 581 (API, 2016a). As discussed by Bragatto 
et al. (2012), the traditional RBI approach is valu- 
able, but it must be integrated within a dynamic 
management system for major accident preven- 
tion. The main limit of these American standards 
is that several industries, other than refineries, are 
classified at major accident hazard according to 
the Seveso legislation, but they are not included 
in the field of application of API580/581 and, 
thus, lack of clear guidelines. As discussed by 
Bragatto and Milazzo (2016), the Seveso II Direc- 
tive stresses also the need to share with control- 
ling authorities some aspects of risk management 
and consequently increases the need to reduce 
the uncertainties of RBI models. Such uncertain- 
ties are associated with different aspects: firstly, 
uncertainties derive from an inadequate knowl- 
edge about the failure modes and related proba- 
bilities; further uncertainties are introduced in the 
following steps of the application of the method 
(Milazzo & Aven, 2012). 


1.3 Audit of ageing programs 


The integrated safety management system, required 
by the new Seveso legislation, has to be verified by 
competent authorities in order to judge about the 
overall adequateness from the point of view of the 
previously discussed ageing issue. In order to meet 
the needs of establishment executives and control- 
lers (auditors and inspectors), a shared model for 
the adequateness verification is essential. 

This paper aims at discussing the main elements to 
be integrated in an effective approach for both moni- 
toring and inspecting critical equipment at Seveso 
establishments. In the following text, a piece of equip- 
ment is defined critical, if it is involved in a sequence 
of events leading to a top-event, as identified by 
the fault tree analysis or equivalent methods; top- 
events could escalate and give major consequences. 
Equipment, containing an amount of hazardous 


1630 


substance equal at least the 20% the threshold indi- 
cated by the Seveso III Directive, are also included 
in the category of “critical items”. Therefore, not all 
equipment in a Seveso establishment has to be con- 
sidered “critical”, but only systems that are explicitly 
identified in the risk assessment as defined by Seveso 
III Directive (Safety Report). 

This paper is structured as follows: Section 2 
discusses scope and objectives of the research. 
Section 3 describes the ageing model, i.e. a list 
of factors that affect the phenomenon is given as 
well as correlations amongst them. Section 4 sum- 
marises a short-cut method for external audits, 
which supports to verify the adequacy of ageing 
management plans. Section 5 is focused on the use 
of the ageing model by its integration in a more 
sophisticated tool so-called ageing sensor. Finally, 
conclusions and a short discussion about further 
developments are given in Section 6. 


2 SCOPE AND OBJECTIVES 


The scope of this research is to discuss how to 
integrate the main factors, affecting the ageing of 
facilities, in an effective tool for monitoring and 
inspecting critical equipment at Seveso establish- 
ments. The quantitative analysis of top-events, 
through the fault tree technique or equivalent 
methods, shows those that should be considered 
credible and, by referring to a frequency thresh- 
old, allows selecting those to be analysed from the 
consequence point of view. Even if the likelihood 
threshold is set at 10% event/year, it could be possi- 
ble that, due to deterioration processes, the failure 
probability does not respect the traditional bath 
curve trend, but shows an increasing trend when 
the equipment is close to end of life. Thus, it could 
be possible that some top events, which were not 
considered credible because having frequency <10 
event/year, become credible as failure probabilities 
are higher than expected due to ageing. 

The main objectives of this research are summa- 
rised below: (1) to provide auditors and inspectors 
with a trustable short-cut method for ageing control- 
ling at Seveso establishments; (2) to easily achieve 
the update of the conditions of the equipment when 
required. The short-cut method is based on a previ- 
ous preliminary study (Bragatto et al., 2017), from 
which a simple and effective approach has been 
developed. The tool is useful for the management of 
equipment operability and is currently under test- 
ing in a few Italian establishments. Concerning the 
elaboration of the ageing status, the most ambitious 
future goal is to have a day by day monitoring. It 
must be pointed that, even if the ageing of machin- 
ery is also important, in this paper it has not been 
discussed, since the focus is only on major accidents. 


3 AGEING MODEL 


3.1 Factors affecting equipment ageing 


The ageing status of a plant is described by a model 
that collects a number of factors that contribute to 
the equipment deterioration. The management of 
ageing aim at understanding these factors, assigning 
the proper weight to their contributing to the deteri- 
oration and finally understanding the relationships 
amongst them. According to these considerations, 
the safe management of ageing appears a complex 
issue, which needs three essential elements: 


1. Knowledge (K) — it is the understanding of all 
deterioration mechanisms, affecting the equip- 
ment during its lifetime; 

2. Information (1) —it refers to the collection of the 
documents that describe the past of the equip- 
ment starting from the early stage of lifetime, 
i.e. design criteria, materials of construction 
and each change made during lifetime; 

3. Data (D) — it represents information collected 
by means of non-destructive tests, 1.e. measure- 
ments that have to be processed in order to con- 
tribute to the monitoring of equipment integrity 
and functionality. 


These key-elements allow depicting a complete 
picture of equipment. It is clear that, if knowledge 
about deterioration mechanisms is poor or informa- 
tion over the entire lifetime is lost, the measurement 
data are not enough for a good decision-making, 
because past actions could have earlier brought the 
facility to compromising conditions. 

To describe the ageing model, factors affecting 
the phenomenon have been grouped in two cat- 
egories hardcore and softcore. Hardcore includes 
direct factors (managerial factors), those that are 
linked to measurements and corrective actions 
contrasting deterioration mechanisms that are 
known (i.e. it is correlated to Data). Softcore 
includes indirect factors (physical factors), i.e. 
those elements that are correlated to Information 
and Knowledge and have the maximum control in 
predicting when the equipment has to be removed 
from service. 

As shown in Figure 1, the core of the model is repre- 
sented by the deterioration processes, which have either 
a physical or a chemical nature and sometimes both 
them. The scheme gives the dependence of these proc- 
esses by a number of accelerating and slowing-down 
factors over the time, respectively, placed downward 
and upward. It can be also observed that, beyond those 
mentioned above, there are other factors that had an 
effect in the past (i.e. design criteria, processes, materi- 
als, environment, repairs and age/in-service time), their 
consequence is due to choices that were previously 
made, whose effects cannot be corrected anymore. 
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It must be pointed that repairs is an accelerating 
factor that refers to modifications, which are not 
included in the management of changes; whereas 
modifications has to be intended as a factor that 
influences the equipment age and, in some cases, 
has the power to reset it (age re-conditioning). 


3.2 Relationship amongst factors 


Accelerating and retarding factors, shown in 
Figure 1, are not independent in their actions on the 
degradation phenomenon. In this section a descrip- 
tion about how they are related to each other is 
given. Relationships are summarised in Figure 2 in 
the form of arrows connecting related factors. 

It should be noted that defects, damages, fail- 
ures and accidents/near-misses are a direct conse- 
quence of the deterioration mechanisms. Defects 
refer to a structural damage, identified by inspec- 
tion, which does not compromise the operating 
of the system, thus, its repair is not necessary; 
whereas damages refer to something that com- 
promises the operability and compels repairs or 
replacement. Failure is the end of the contain- 
ment capacity of the system; therefore, it mani- 
fests itself through a loss of containment. Finally, 
accidents are scenarios occurring after the loss 
of containment (fire, explosion or dispersion of 
a toxic substance). The factors discussed above 
represent different modalities of manifestation of 
the ageing phenomenon, which proceeds accord- 
ing to a precise sequence of evolution, which is 
the following: defects + damages — failures > 
accidents/near misses. All four factors contribute 
in turn increasing the degradation rate (acceler- 
ating factors). The number of unplanned stops of 
the plant directly contributes in accelerating the 
deterioration as it makes the equipment subject to 
various types of stress. Amongst the factors, which 


Figure 1. Ageing model. 


act directly by slowing-down the deterioration, 
there are physical protections (e.g. cladding and lin- 
ing), maintenance, inspection program and inspection 
results. The inspections allow the identification of 
defects, which may be the quicker the more effective 
their planning. Inspection policy (or program) should 
be based on risk assessment and the knowledge of 
inspection technique and scheduling. A risk-based 
inspection system is influenced by the inspector qual- 
ification and the audit of management system (SMS). 
This last factor includes change of property (or own- 
ership) and experienced personnel loss due to staff 
change (accelerating at the softcore level) and docu- 
mentation along lifecycle and risk assessment (retard- 
ing at the softcore level), and maintenance (retarding 
at the hardcore level). The process control is a factor 
that acts on the monitoring and control of operating 
parameters of equipment, for which the choice was 
made during the process design phase (see Figure 1). 
Finally, other factors that directly contribute to 
ageing are environment, materials, processes and 
design criteria. Their action, as discussed in the 
previous section, is independent and cannot be 
eliminated because associated with past choices. 


3.3. Model simplifications 


The model, described from the relational point of 
view in Section 3.2, can be simplified as shown in 
Figure 3. A short discussion on simplifications is 
given below. 

A comprehensive knowledge of deterioration 
mechanisms and control techniques is the sound 
basis for recognized inspection practices, such as 
API 581 (2016a), which is the distillate of dec- 
ades of scientific research and experience in oil 
and allied industries. Where there is a lack of 
knowledge, such as in a few chemical industries, 
inspection strategy is a relevant element; this is 
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Figure 3. 


Simplified model for ageing. 


particularly important for the knowledge if the 
inspection technique (Inspection program). If the 
inspection program is risk-based, a periodic update 
is certainly considered in the inspection planning; 
the approach could ideally be dynamic, especially 
if the use of an ageing sensor is exploited. In this 
case, risk assessment and inspection program are 
merged in the factor RBI inspection program. 

The failures factor has previously been linked to 
repairs because usually a failure requires the need 
for an intervention. In the hypothesis of interven- 
ing before a possible loss of containment, given 
the effectiveness of the dynamic system for inspec- 
tion planning, it is possible to not consider the link 
between the two factors. 


4 SHORT-CUT METHOD TO ASSESS 
AGEING MANAGEMENT PLANS 


To support the auditors in performing the assess- 
ment of the adequacy of ageing management plans, 
a short-cut method has recently been proposed 


(Bragatto et al., 2017). It is an index approach, 
which is simple and easy-to-use. 

The method consists in the assignment scores to 
accelerating and retarding factors with respect to 
the ageing. These scores can be in the form of pen- 
alty for accelerating factors and of compensation 
for retarding ones. If the cumulated compensa- 
tions are greater or equal to the cumulated penal- 
ties, the activities that are in place for the ageing 
management are adequate. 

On the contrary, the ageing management system 
must be improved by the adoption of some techni- 
cal and/or managerial solutions that increase the 
scores for retarding factors. 

This approach is characterised by proportion- 
ality in applying countermeasures to the ageing 
phenomenon. Hence, if penalties are low, little 
compensation is required, whereas if they increase, 
it is necessary also to increase prevention activities 
to get a higher compensation. The industrial man- 
ager can choose technical and/or managerial solu- 
tions to be applied in order to offset the penalties, 
according to his/her preferences. As an advantage 
of the method, compared to traditional check-lists, 
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there is a greater clarity in the evaluation process 
and a quantification of the weight to be attributed 
to each factor. Unfortunately, as other index meth- 
ods, it introduces uncertainties. The method and its 
main steps are represented in Figure 4. Each score 
(penalty or compensation) is assigned by referring a 
four-level scale. Such levels are identified as: 1 = low; 
2 = medium; 3 = medium-high; 4 = high. A sign will 
be also associated with the score that will be nega- 
tive for penalties and positive for compensations. 

Accelerating and slowing down factors are those 
given in Figure 1. To simplify the work of the audi- 
tor and/or the industrial manager, some factors have 
been grouped in a new one and, thus, these have 
been considered sub-factors of the new factors. 
The new factors are: (1) audit of SMS includes risk 
assessment, documentation along lifecycle, change 
of property, experienced personnel loss and mainte- 
nance, (ii) inspection management refers to inspection 
program, (ii) inspection effectiveness includes inspec- 
tion program and inspector qualification and, finally, 
(iv) inspection results includes inspection scheduling. 

Tables 1 and 2 show the criteria for assigning 
scores to accelerating and retarding factors, these 
are based on the following definitions: 


e Age/In-service time = ratio “current age/maxi- 
mum designed age” or “current operating hours/ 
maximum in-service hours”. 


Selection of critical 
equipment 


Identification of identificabon of 


accelerating factors retarding factors 


Assignment of scores 


(compensabans) 


Catculation of 


Calculation of 
cumulative penalty cumulative 


OMpeNsatior 


Calculation of the 
overall score 


Negative score 
(Selection of technical and/or 
managerial solutions) 


Positive score 
{No requirements) 


Figure 4. Flow-chart illustrating the method for the 
assessment of ageing management plans. 


e No. unplanned stops = ratio “no unexpected 
stops/total stops” over a reference period. 

e Failures = actual failure rate over the reference 
period (f) compared to the failure rate of data 
from international databases (f,,,). 

e Accidents/Near-misses = ratio number of inci- 
dents and near misses due to ageing and the 
total number of registered events over a refer- 
ence period. 

e Deterioration mechanisms = average value among 
three scores related to the consequences of the 
degradation (i.e. dimension of leakage), the abil- 
ity to detect the main damage mechanisms (by an 
inspection technique) and the velocity of propa- 
gation of the phenomenon- To quantify the score 
indications are given by Bragatto et al. (2017). 

e Defects/Damages = percentage of serious dam- 
age, detected over the reference period, com- 
pared to the number of critical equipment. 

e Audits SMS = average value between two scores 
related to audits conduction (internal and exter- 
nal) on the SMS and their results. 

e Inspection management = main characteristics of 
the structure of the inspection management. 


Table 1. Criteria for the assignment of scores for accel- 
erating factors. 
Factor Score Value 
Age/In-service 1 < 90 % 
time 2 90 + 100 % 
3 100 + 120% 
4 120 % 
No. of i <10% 
unplanned 2 10 +25% 
stops 3 25 =+ 60 % 
4 >60% 
Failures 1 SE O.S fres 
2 OS free SS S Sre 
3 ESED 
4 fe 2efoef 
Accidents/ 1 <5% 
Near-misses 2 5+15% 
3 15+35% 
4 > 35% 
Deterioration 1 Average score accounting for: 
mechanisms (i) consequences, (ii) ability 
2 to detect mechanisms, 
3 (iii) propagation velocity 
4 
Defects/ 1 <1% 
Damages 2 1+3% 
3 3+5% 
4 > 5% 
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Table 2. Criteria for the assignment of scores for retarding factors. 


Factor Score Value 
Audits SMS 1 
2 Average score accounting for: (i) % of minor non-compliances, 
3 (ii) % of greater non-compliances 
4 
Inspection 1 compliant with the legislation 
management 2 risk-based integrated with inspection plan 
3 updated after changes 
4 periodically updated 
Inspection 1 
results 2 
Average score accounting for: (i) system functionality test results, 
3 (ii) system integrity test results, (iii) inspections planning (scheduling) 
4 
Inspections 1 
effectiveness 2 
Average score accounting for: (i) effectiveness of inspections, 
3 (ii) inspector qualification 
4 
Process 1 unregistering local control system 
control 2 control system with data recording 
3 data recording system with automatic blockage 
4 control system with data recording + automatic blockage + certified blockage 
Specific 1 
protections 
2 
3 Average score accounting for: (i) inspection intervals, (ii) protection’s conditions 
4 


e Inspection results = average value among three 
scores accounting for the inspections planning 
and the results of tests that verify the functional- 
ity and integrity of the systems. 

e Inspections effectiveness (it is partly included in 
inspection program) = average value among three 
scores accounting for extension and degree of 
coverage of techniques, likelihood of damage 
detecting and qualification of inspectors. Ref- 
erence should be done to UNI 11325-8 (UNI, 
2013) and API 581 (API, 2016a). 

e Process control = main characteristics of the 
installation control systems of process variables. 

e Physical protections = average value among three 
scores accounting for the type of coating, the 
frequency of the controls and the actual condi- 
tion of the material. 


In Table 1 and 2, the score for factors, which 
takes into account various sub-factors, is calcu- 


lated by averaging the scores of the various subfac- 
tors (see Bragatto & Milazzo, 2016). 


5 VIRTUAL SENSOR FOR AGEING 


The first version of the ageing model, which does 
not take into account the relationships amongst 
accelerating and retarding factors, is implemented 
in the short-cut method for the ageing monitor- 
ing and control at major hazard establishments 
presented in Section 4. To evaluate the status of 
critical equipment, a static model appears the 
most appropriate for auditors acting on behalf of 
Seveso Authorities. 

Nevertheless, a dynamic model, which also 
accounts for the interaction amongst factors, could 
be more effective for industrial managers. For this 
reason, a second release of the method is currently 
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under development within a system, called vir- 
tual sensor for ageing and made up by hardware 
and software. The model is considered dynamic 
as information about process variables, external 
data, inspection information and etc. are continu- 
ously collected and processed, in order to provide 
an overall picture of the system’s status in a form 
of an index (overall ageing score). This index 
allows the real-time forecasting of the equipment 
deterioration process and its management based 
on the industrial risk acceptance levels. Based on 
this model, a digital twin of a complex plant can 
be built, by integrating sensors and other smart 
devices collecting information from the equipment 
and the establishment. 

Therefore, the digital twin is made up of meas- 
ured data, managed information and models for 
the physical evolution of equipment. It simulates 
the real evolution of equipment, anticipating 
possible failures. This set of “data-information- 
knowledge” can be recalled from every “location” 
(e.g. cloud, DCS, etc.) through a device that consti- 
tutes the interface with the user. 


6 CONCLUSIONS 


Due to the legislation requirement, in the context 
of major hazard establishments, inspections plan- 
ning and maintenance activities are essential ele- 
ments to guarantee a safe ageing of installations. 
These must be based on in-depth knowledge of all 
damage mechanisms and backed up by appropriate 
controls. The proposed model (in its various ver- 
sions), by accounting for the interaction amongst 
accelerating and slowing-down factors and its 
dynamics, is aimed at supporting the auditing activ- 
ity and promoting an ageing management based on 
knowledge, information and data. At the present the 
second release of the method is under development 
and implementation within a system, called virtual 
sensor. It will work based on a digital twin of a com- 
plex plant and achieve a dynamic monitoring of the 
ageing status of the overall establishment. 


ACKNOWLEDGMENT 


This work is part of an Italian research project 
entitled “SmartBench” that is supported by INAIL 
(Istituto Nazionale per l’Assicurazione contro gli 


Infortuni sul Lavoro) and funded within the call 
BRIC 2016. 


REFERENCES 


API American Petroleum Institute 2016. Risk-based 
Inspection. A PI recommended practice API RP 580. 
API American Petroleum Institute 2016a. Risk-Based 
Inspection Methodology. API recommended practice 

API RP 581. 

Bragatto, P., Della Site, C. & Faragnoli, A. 2012. Oppor- 
tunities and threats of risk based inspections: The new 
italian legislation on pressure equipment inspection. 
Chemical Engineering Transactions 26, 177-182. 

Bragatto, P. & Milazzo, M.F. 2016. Risk due to the age- 
ing of equipment: Assessment and management. 
Chemical Engineering Transactions 53: 253-258. 

Bragatto, P., Delle Site, C., & Milazzo, M.F. 2017. Audit 
of Ageing Management in Plants at Major Accident 
Hazard. Proceedings 2nd International Conference on 
System Reliability and Safety ICSRS, pp. 400-405. 

EU Council, 2012. Directive 2012/18/EU on the con- 
trol of major-accident hazards involving dangerous 
substances. Official Journal of the European Union 
L197/1-37. 

Horrocks, P., Mansfield, D., Thomson, J., Parkerv, K. & 
Winter, P. 2010. Plant Ageing Study Phase 1 Report. 
Health and Safety Executive Report no. RR823. Avail- 
able on-line: www.hse.gov.uk. 

IAEA 2009. Ageing management for nuclear power 
plants. IAEA Safety Standards Series No. NS-G-2.12. 
Available on-line: http://www-pub.iaea.org. 

OECD Organisation for Economic Cooperation and 
Development 2017. Ageing of hazardous installations. 
OECD Environment, Health and Safety Publications 
— Series on Chemical Accidents, no. 29. Available on- 
line: http://www.oecd.org. 

Milazzo, M.F. & Aven, T. 2012. An extended risk assess- 
ment approach for chemical plants applied to a study 
related to pipe ruptures. Reliability Engineering and 
System Safety 99, 183-192. 

Ente Nazionale Italiano di Unificazione (UNI) 2013. 
Pianificazione delle manutenzioni su attrezzature a 
pressione attraverso metodologie basate sulla val- 
utazione del rischio (RBI). Document UNI 11325, Part 
8 (in Italian). 

Wintle, J., Moore, P., Henry, N., Smalley, S. & Amphlett, 
G., 2006. Plant ageing. Management of equipment 
containing hazardous fluids or pressure. Health and 
Safety Executive Report no. RR509. Available on-line: 
www. hse.gov.uk. 

Wood, M.H., Arellano, A.V. & Van Wijk, L. 2013. Cor- 
rosion Related Accidents in Petroleum Refineries. 
European Commission Joint Research Centre Report 
no. EUR, 26331. 


1636 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Failure prognosis of discrete events systems based on extended 
Petri Nets 


R. Kanazy & S. Chafik 
Pluridisciplinary Research and Innovation Laboratory, EMSTI, Casablanca, Morocco 


E. Niel 


Department of Industrial Engineering, Ampere Laboratory, National Institute of Applied Science Lyon, 
Villeurbanne, France 


ABSTRACT: Fault prognosis has become a major scope for complex and interconnected systems. Such 
significant events as fault events can cause partial or total stop of attempted functionalities. Prevention 
failure events are an issue to preserve performance, availability and safety of both operators and equip- 
ment. The aim of prognosis is to prevent fault events before their occurrence. Fault/repair management 
refers to event control, and so it is relevant to the domain of Discrete Events Systems (DES), for which 
stochastic finite state automaton and Petri Nets (PN) have been used to prognosticate fault state. They are 
based on predictions of fault event at least m-steps in advance. 

The proposal is based on the time notion, which is crucial for fault prognosis. Indeed one can give the 
remaining time before the occurrence of fault event. The goal is to prevent the occurrence of a fault event 
at T— timeunits in advance. This approach is based on labeled and T-temporal Petri Net, which has the 
advantage of a formal character for the assessment of properties and of sufficiently generic in order to 


apprehend a high level of complexity. 


1 INTRODUCTION 


The availability of complex systems can be 
ensured solely by control fault events, which 
can occur at any time triggering partial or total 
down of the system. The criticality of complex 
systems requires failure report before its occur- 
rence beside the detection of failure and report of 
dysfunction alarms, avoiding the accidental down 
of the system. Prognosis allows meeting these 
requirements and giving visibility on the evolu- 
tion of the system, thus, allowing the prediction 
of future failures. Several fault prognosis meth- 
ods have been developed; some have adopted a 
stochastic approach (Ammotur et al. 2017) (Dutta 
and Biswas 2015) (Chen and Kumar 2014), while 
others have chosen non-stochastic (Takai 2012) 
(Kumar and Takai 2010) (Takai 2015), one for 
state automaton or Petri Net. These approach 
are interested in prediction of failure m-steps 
in advance, based on a stochastic process that 
cannot predict in time. The challenge of each 
community working on this topic is to predict 
perfectly the future reality. 

For visibility of the future, one must master the 
present and have enough information about the past. 
The verification of time constraints an extension 
of Petri Nets (PN) (Chen et al. 2017) called temporal 


Petri Nets (Berthomieu 2001), is part of these 
modeling methods. Tow modeling methods are 
proposed, the first one, combine T-temporal PN 
with labeled PN (Yin and Lafortune 2017) to give 
an extension of PN, called extended labeled T-tem- 
poral PN. In this extension, clock and label are 
associated with transitions. The second method, 
integer Watchdog techniques (Kovacs et al. 2007), 
it generates alarms to indicate the existence of a 
fault in the system. However, the goal is to exceeds 
detection and aims to prevent fault and hence 
comes the interest an improvement of the method; 
that allows for more expressivity of the model in 
order to determine T-time units in advance and the 
occurrence or not of a failure in the system. 

The first proposal, introduces the notion of 
INIT and EXC events, which represent respectively 
the initialization and exceeding time of the clock, 
thus, allowing the distribution of the system to 
operating modes (Nominal, Degraded and Failed), 
in order to control risks. Following this distribu- 
tion, we can determine the relevant, non-relevant 
and critical places. From the first relevant place in 
degraded mode, one can prevent the occurrence 
of a fault event T-— time units in advance. Consid- 
ering that the system cannot remain in degraded 
mode indefinitely, the notion of cycle execution 
of degraded mode is introduced. By exploiting 
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the temporal constraints of transitions, it would 
be possible to calculate the minimum time of a 
mode execution, yet it would be interesting to be 
able to calculate the maximum time of a mode 
execution. Then, the notion of execution cycle 
can help to calculate a temporal estimate of execu- 
tion at the latest by a mode, which allows to have 
a time estimation interval for a mode execution at 
the earliest and latest. If the extended PN is not 
safe but bounded (places contain more than one 
token), the multi-clock is introduced. The second 
method, represents the watchdog approach based 
of the labeled T-temporal PN (Ru and Hadjicostis 
2009); the alarm places generated by the watchdog 
become indicator places from which one can pre- 
vent a fault occurrence by computing the earliest 
and latest time of its occurrence. 

The paper is organized as follows: the first sec- 
tion is dedicated to the representation of Petri 
nets and their extensions integrating time con- 
straints. In the second section, the rules of tran- 
sition from the T-temporal PN to the extended 
T-temporal PN are determined. The third sec- 
tion will be devoted to modeling the system 
using the various methods mentioned above. 
The fourth section will detail the proposed prog- 
nosis approach. The fifth section will push the 
boundaries of the proposed method by introduc- 
ing multi-clocks in the model. The paper will be 
concluded with a conclusion. 


2 PRELIMINARY 


2.1. T-temporal Petri Nets 


T-Temporal PN (Berthomieu 2001) (Sadou and 
Demmou 2009) (Zuberek 1991) (Jiacun 1998) is 
a tuple: TR=(P,T,Pre,Post,M,,I°), where: IS 
is the static interval function. It is represented by 
temporal constraints that can be associated with 
places, arcs or transitions. 

This differentiation in the representation does 
not influence semantics. In this paper, the tempo- 
ral constraints in T-temporal PN are associated 
with transitions by rational bounded intervals. 
Is (t) = [min, max | avec 0<min<max and max 
can be œ. The firing of a transition can only occur 
after a minimum of time units (min) and at the latest 
amaximum of time units (max). For example in Fig- 
ure 1; T, is fired at earliest a UT and at latest b UT. 


2.2 Labeled Petri nets 


Labeled PN (Yin and Lafortune 2017) (Li 2017) 
(Jiacun 1998), adds to PN a label alphabet and a 
function for labeling transitions with these labels. 

Labeled PN is a quadruplet LR= (R, M,,%,A), 
with: 


Figure 1. T-temporal Petri Nets. 
Po 
ONT, 
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Figure 2. Labeled Petri Net. 
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Figure 3. T-temporal labeled Petri Net. 

e (R,M,): Marking PN. 

e >: finite Set of events. 

e A:T >> The function for labeling transitions 
that assigns a label (event) to each transition. 
The model of Figure 2 can be considered as the 

execution of PN in Figure 1 by identifying events a 

and c respectively at transitions ¢, and ż,. 


2.3 Labeled T-temporal Petri nets 


Labeled T-temporal PN (Peres et al. 2011), is a 
n-uplet LTR = (P,T,Pre,Post, M,,I5,%,A), where: 


e A:T >> is the labeling function; 
Figure 3 represents a T-temporal labeled PN: 


3 FAULT PRONOSTICS BASED ON 
EXTENDED PN 


The prognostic method proposed in this paper is 
based on the temporal estimation of the occurrence 
of a future event. Modeling the system on extended 
T-temporal labeled PN. In order to analyze the risks, 
the system must be distributed into operating modes 
(nominal, degraded, faulty); Once the modes of oper- 
ation are identified, it is necessary to determine the 
places which are relevant or not and which are criti- 
cal. The aim of prognosis is to calculate the time esti- 
mate of the execution of the degraded mode in order 
to determine in advance the earliest possible occur- 
rence of a failure or entering in the failure mode. 
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3.1 Extended PN 


The proposed modelisation in this section is based 
on T-temporal labeled PN, which are adapted to 
our approach, that consider critical systems such 
as real-time systems. An extended PN is a 5-tuplet 
(P,T,&, Init, Exc,X,C), with: 


Init: the time initialization function of the clock. 
Exc: the time exceeding function of the clock. 
X: Set of Clocks. 

C: Set of temporal constraints. 


The semantics differ from that of T-temporal 
labeled PN, Firing of a transition in extended 
T-temporal PN is possible by checking both the 
time constraint and the occurrence of the event. Fir- 
ing transition allows knowing if the event occurred 
before, simultaneously or after the time constraint. 

The integration of temporal constraints in the PN 
is expressed by their associations with places, arcs 
or transitions. Extended T-temporal labeled PN is 
modeled by associating temporal constraints with 
transitions; the goal is to determine if during the fir- 
ing, an event has occurred on the intended time or 
not. Two notions have been introduced: initialization 
and exceeding clock noted, Init and Exc respectively. 
This modeling method gives more expressiveness, in 
order to predict the future evolution of the system. 
The Init sets the time clock when firing transition. 
The Exc checks whether the event occurred before or 
after time clock exceeding. To introduce the method, 
the rules of transition are modeled from T-temporal 
PN to extended PN (see Figure 10). 

In the above model the firing of the transition 
will occur once the event appears; the reset of the 
clock x is automatic done after the firing transi- 
tion. Init is associated with the first transition to 
reset the x time clock to 0 and set the guard. The 
transition is associated with {b, Exe(x,30)}. It 
is fired after the occurrence of event b, before or 
simultaneously with the expiration of time clock. 

The expression [ee oa allows the firing 
of the transition after checking the occurrence of 
event b, after the expiration of the time clock. 

The equivalent model does not include an Init 
in the modeling, since the clock initialization is not 
expressed in the T-temporal PN. The expression 
{c,Exc(x,30)} checks whether event c occurred 
before or simultaneously with the expiration of the 
time clock. 

The only condition for firing the transition in the 
first model is based on the occurrence of event c; 
no guard and no initialization of the clock are indi- 
cated. The equivalent model will remain the same 
constraint. The notions of Init and Exc refer to 
the research realized by KHOUMSI on timed PN 
(Ouédraogo et al. 2006) (Khoumsi 2009) (Khoumsi 
2005). Which consists of reformulating the progno- 
sis problem in real time in a non-real time form, by 


transforming the timed automaton into finite state 
automaton called SEA (for Set-Exp-Automaton). 


3.2 Example of model 


3.2.1 Modeling of T-temporal labeled PN 

In order to implement the rules of passage from 
the T-temporal PN to the extended PN, the 
model T-temporal labeled is used in (Figure 8). It 
is assumed that the PN of the Figure 9 is a safe 
one and event c always respects the temporal con- 
straints with which it is associated in the model (to 
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Figure 8. Example of T-temporal labeled PN model. 
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Figure 9. Extended PN model. 
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avoid the presence of a non-nominal behavior). 
Each transition is associated with a triplet event, 
guard, reset; The guard represents a temporal 
constraint that can be >, =, <, < then a specified 
number of time units. It is possible that the guard 
may not be mentioned if the firing of the transi- 
tion depends solely on the occurrence of an event. 


e P={P,,P,P,,P,,P,,P} Set of places. 

© Taq Tae li TT ,T,T,} Set of transi- 
tions. 

e $ ={a,b,c,d,e, f,g,r} Set of events, where f is a 
failure event and r a repair event. 

e X=}, UZ UZ, UÈ,. 
— Ł}„Ja,b,c}: Set of nominal events 
- %414,¢, ai Set of degraded events 
— È, į f}: Set of failure event 
— È,{r}: Set of repair 

e x: Clock. 


The clock reset is mentioned by x, an uninitial- 
ized clock is represented by —. 


3.2.2 Extended PN model 

Figure 10 represents system modeling with ex- 
tended PN. Two notions have been introduced 
in this modeling; the Init, which is used to define 
both the time constraint and the initialization of 
the time clock when necessary. The notion of Exc 
allows checking the exceeding of the time clock; it 
is formed by two parameters: the first one indicates 
the clock and the second one indicates the tempo- 
ral constraint to check. The model below represents 
the extended PN. It is assumed that this PN is safe: 


3.3 Description of the prognostic method 


3.3.1 Steps of the prognosis method 

The prognostic approach is based on the verifica- 
tion of the future occurrence of a failure event at 
the earliest and latest. The first step is to provide 
temporal modeling of the system on extended PN 
(see next section) or on PN with Watchdog tech- 
nique (see section 3.4), in order to exploit the time 
constraints represented in the model. To analyze the 
risks, the second step indicate the possible operating 
modes in the system (Nominal, Degraded, Failed). 
The nominal mode corresponds to operating mode 
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Figure 10. Operating modes of the system. 


of the system without time constraints disrespect, 
while the degraded mode corresponds to system 
operation with degradation (presence of partial 
failure) but without stopping the system and failed 
mode represents the failure state. The third step 
determines whether places belong to the operating 
modes to identify which places are relevant or not 
and which are critical places. The relevant places can 
be nominal, degraded, critical or failed, depending 
on the mode in which they are represented. When a 
place is represented in both nominal and degrada- 
tion mode, it is referred to as a non-relevant place. 
The critical place is an intermediate place between 
two modes: degraded and failed. The firing of tran- 
sition leads to the failed mode. 

Definition 1: 

P=PUP,,, where: 

P „is the set of non-relevant places P, is the set 
of the relevant places of the extended PN with 


Pr=P, UP ag U Pau and T the set of its transi- 
tions, where: 
e P „is the set of relevant place p, which belongs 


to nominal mode. p, is called the nominal place, 
denotes p,,. All p,, places constitute the set of 
places in nominal mode, Vp,,, P, € Pn 
© Pu, 18 the set of relevant place p, which are only 
for degraded mode. p, is called the degraded place, 
denote P xg All P xg places constitute the set of 
places in degraded mode, VD, joes Price E€ Price 
e Pais the set of relevant place p, which are only 
for failed mode. p, is called the failed place, we 
denote P,a All P ay places constitute the set of 


places in failed mode, YP, pi Pyar © Pfui 


Definition 2: 

A place is said to be an relevant if and only if 
it belongs to one of the sub-sets Pn, Pugs Pay A 
place is said to be critical if and only if it is between 
the degraded and faulty mode. 

The introduction of time-constraints into tran- 
sitions in modeling provides a time estimation of 
the execution time of a degraded mode, which is 
essential for calculating the earliest future failure 
occurrence. Thus, the number of execution of the 
degradation mode will make it possible in step four 
to evaluate the time occurrence of the future failure. 


3.3.2 Prognosis based on extended PN 

Step 1: Model of the system based on extended 
T-temporal PN. 

Modeling by extended T-temporal PN is based 
on Init/Exc which are associated with transitions, 
as show in Figure 9. This association will tie the 
event to the clock to determine whether the event 
occurred at the right time. Jnit(x); means that the 
clock x will be initialized to zero without assign- 
ing a time-constraint. Jnit(x,30) means that clock 
x will be initialized to zero with a time-constraint 
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of 30 TU(time — units). Exc(x,30); verify if clock x 
has exceeded the time-constraint of 30 TU. Note 
that a transition with an Exc is not necessarily 
preceded by a transition with an Init. b,Init(x)}: 
means that the initialization of the clock is depend- 
ent on the occurrence of event b. For the function 
Exc, associating the event with the time- constraint 
can be expressed in two ways: b, Exc(x,30) means 
that the appearance of event b must occur before 
or simultaneously with the expiration of 30 TU, 
however {Exc(x,30),b} means that the event will 
certainly occur after the exceeding time of 30 TU. 

Step 2: Designation of the operating modes: nomi- 
nal, degraded and faulty. 

The repartition of the operating modes of system 
is done three modes, as show in Figure 10. 

Step 3: Determination of relevant places: nominal, 
degraded, critical and failed. This nomination of 
places is intended to reinforce the proposed prog- 
nostic method (see Figure 11). Following the identi- 
fication of operating modes, the places of the model 
are labeled either as relevant or non-relevant places. 
Step 4: Evaluation of the execution time of the 
degraded cycle. 

Let us take the example of Figure 11. The fir- 
ing of 7,", indicates the change from nominal to 
degraded mode. this firing allows to calculate the 
execution time of the degraded mode (7), which 
is equal to the summing of the time constraints 
associated with the transitions in this mode, it is 
possible to determine the earliest time to the fail- 
ure event f occurs. 7=10TU +20TU;7=30TU. 

Indeed, since the clock is not initialized when fir- 
ing Ty, the time constraint associated with event 
d is equal to 1OTU =40TU -30TU. However, it is 
possible that the degraded mode runs n-fold before 
changing to the failed mode. Using the concept of 
counter execution cycle. 


3.3.3 Counter of execution cycle 

A system cannot remain in a degraded mode 
indefinitely. It becomes necessary to compute the 
number of executions of this mode in advance. 
In fact, if the execution of a degraded mode is 
repeated more that n times, it can be sure that 
the system will be leaded to the failed mode. The 
problem is to how to compute the number of cycle 
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Figure 11. Relevant places on extended PN. 


executions. To do this, we add a place, associated 
to the degraded mode, in the model. This place (P,) 
in the Figure 12 will contain a number of tokens 
equal to the number of possible cycle execution 
of the degraded mode, before leaded to the failed 
mode. It assumed that this number is fixed by 
expertise (equal to 4) in the example. 

In this example, the firing of 7,, means that a 
cycle of degraded mode has been executed, and 
a token has been consumed in P,. At this level, 
there are 3 execution possibilities before reach- 
ing the failed mode. Suppose that we repair the 
partial failure after the first execution cycle of 
degraded mode, the system will be leaded to the 
nominal mode. The next future partial failure will 
lead again the system to degraded mode. at this 
stage, the number of tokens in P, is equal to 3, 
while it should be 4. In this case the prediction of 
failure cannot be accurate. To avoid this problem, 
the notion of reset arcs, will maintain the exact 
number of tokens (4 tokens) in P,. 


3.3.4 Notion of reset arcs 

To introduce the notion of reset arc, (Akshay et al. 
2017) which represents the model of an access code 
verification system, the number of tokens, repre- 
sents the number of possible attempts. After 7, 
(Enter Code) is fired, if the entered code is correct 
(T, reached), the model reset the number of tokens 
inthe place P, (Trials). However, the Arc reset allows 
this withdrawal action. PN with reset arcs (Comlan 
et al. 2015) is a 4-uplet N,=(P,T,W,AR) with 
(P,T,W) a PN as defined in Definition 1 and 
AR: PxXT-—0,1 is the set of reset arcs (AR) if 
there is a reset arc that connects p to t, otherwise 
R(p,t)=1. In the example in Figure 13. 


e P = {Trials, InputCode, SystemAccess, Retry 
Code}; 

e T= {EnterCode, CorrectCode, Wrong Code}; 

e AR = {CorrectCode}; 

e M,=(3, 0, 0, 0). 


The integration of the reset arc in the model is 
intended to initialize the counter for the execution 
cycle of the degraded mode once the fault has been 
repaired. 
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Figure 15. Modeling on Extended colored PN. 


In the model of Figure 14, the firing of the 
T, transition represents the repair of the system, 
which can occur either before completing the pos- 
sible execution cycles of the degraded mode, or 
after the occurrence of a failure event. In both 
cases, the reset arc ensures the emptying of the P, 
place. After firing T, transition, the P, place can 
be initialized with tokens equivalent to the number 
executions of the degraded mode. Assuming that 
the extended PN is safe, to predict a possible fail- 
ure event f. The firing of the first transition of the 
degraded mode gives a visibility of the evolution 
of the system and makes it possible to determine a 
temporal estimation, at the earliest and at the latest 
of a failure events occurrence. 


3.4 Multi-clock approach 


When the assumption that the PN is safe is not 
verified, by introducing the notion of colored and 
timed PN (Soares 2017) (Jiacun 1998). The aim is 
to associate a clock to each token, in order to check 
their evolution in the system separately. Thus, in the 
same transition, several temporal constraints can be 
set depending on the clock associated with each one. 
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Figure 16. Watchdog on PN. 


In the case of multi-token/multi-clock, the tem- 
poral estimation of the failure occurrence depends 
on the clock related to tokens and the time-con- 
straints associated with transitions ( see Figure 15. 
Let’s take the case of token rl; if it ever changes 
from nominal to degraded mode following the firing 
of 7,", the temporal estimation at the earliest, of 
the occurrence of a failure event. z,,, (rl) =30TU, 
but for the token r2, ,,,,(r2) =50TU. Taking into 
account the execution cycle of the degraded mode, 
the estimation at the latest of the failure occur- 
rence is: Z,,,,(rl)=1507U for the token rl and 
T „ax (r2)=170TU for the token 12. Thus, using 
the concept of multi-token/multi-clock assures 
prognosis, in advance the occurrence of a failure 
at the earliest and at the latest, although of parallel 
execution if the PN is not safe. 


4 FAULT PROGNOSIS BASED ON 
WATCHDOG TOOL 


4.1 Watchdog concept 


In order to overcome a perplexity in diagno- 
sis, T-temporal PN can be used to model alarms 
by integrating the watchdog mechanism. Based 
on experience and considering that the risks are 
known a priori, the watchdog mechanism (Kovacs 
et al. 2007) (Kovacs et al. 2006) (Jerbi et al. 2006) 
verifies that an action has occurred before a given 
deadline, and signals the presence of an error in 
the system if a delay is exceeded. Figure 16 shows 
the watchdog mechanism on a T-temporal PN. It 
checks the maximum duration of a task, repre- 
sented by the place A. The transition Start cor- 
responds to the beginning of the task. Its firing 
creates a token in the place Observer, so the transi- 
tion MaxDuration will be sensitized and triggers 
the countdown of his firing interval [20.20]. 

The place Observer will be marked either by the 
firing of the transition Start, i. e. when the task is 
completed, or when the firing interval of the tran- 
sition MaxDuration is completed before the end 
of A: “MaxDuration” is then fired and the place 
Error is marked. This marking means that the exe- 
cution time of task A is longer than 20 Time units. 
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4.2 Application of watchdog on T-temporal PN 


The model in Figure 9 incorporates the watchdog 
mechanism. In order to exceed the limits of watch- 
dog (Kovacs et al. 2007, Combacau 1991) and 
exploit it to the failure prognosis, a new method is 
proposed to interpret alarms. For example, to sig- 
nal the change from nominal to degraded mode, 
the exceeding time of the occurrence of event b is 
checked when the transition T; is fired. 
Upstream place of 7;’, represents a waiting 
place ( P/ ) Downstream place (B") of Ty, rep- 
resents an indicator of the evolution from nominal 
to degraded mode. This makes possible the pre- 
diction time of the future failure occurrence; the 
temporal estimation of the occurrence failure on 
the model in Figure 17 can be predicted from fir- 
ing the transition Ty. The latest estimate of the 
occurrence of a failure is computed by summing 
the possible execution time of the degraded mode 
(possible number of cycles multiplied by run-time 
of the degraded mode) plus the temporal con- 
straints related to transitions 7," and 7,. When 
transition 7," is firing, failure f is certain. 


4.3. Prognostic method with watchdog 


If the permitted number of the cycle execution of 
the degraded mode n = 4 (see Figure 18),then the 
temporal estimation at the earliest of the occur- 
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rence of a failure z,,,=307U and at the latest 


Z,,.. =120TU. So the time interval in which a fail- 


max 


ure can occur after Ty firing is [30UT,120UT]. 


5 WATCHDOG VS EXTENDED PN 


Following the two modeling represented in sections 
II and V, it is possible to follow evolutions of the 
system. In the model based on the Watchdog, the 
watchdog mechanism is extended, which was lim- 
ited to fault detection, determining the switching 
from nominal to degraded mode. Thus, it would 
be possible to predict the future occurrence of a 
failure event. Extended PN model reduce the state 
space of the PN model compared to the Watch- 
dog while maintaining the same properties. The 
two methods give a visibility on the evolution of 
the system and provide a temporal estimation of 
the failure at the earliest and at the latest time. The 
advantage of Watchdog method over extended PN 
is that there exists a modelisation and simulation 
tool called Little Parametric Tools (LPT) proposed 
by Karen GORDARY in (Godary-Dejean 2008). 
This advantage does not discriminate against the 
prognostic method with extended PN, but allows 
us to integrate into our perspectives the develop- 
ment of a modeling and simulation tool adapted 
to the extended PN proposed. 


6 CONCLUSIONS 


Existing prognostic methods are based on a stochas- 
tic or non-stochastic approach, to verify the occur- 
rence in the future of a failure event, modeled by 
Petri net or automaton. The method of prognosis in 
this paper is based on a temporal approach; which 
allows the prediction of the occurrence of a failure 
in advance, by a temporal estimate of the appear- 
ance at the earliest and at the latest. Two modeling 
methods were proposed: The first is a modeling 
on extended Petri nets, integrating the concepts of 
initialization and expiration of clocks, the second 
method, proposes to integrate the watchdog tech- 
nique in order to adapt it to the prognosis. 

The prognosis approach begins with a distribu- 
tion states of the system on operating mode (nomi- 
nal, degraded, failed), in order to identify correct 
localization in each place. Thus, it would be easy to 
determine which places are relevant or not. Follow- 
ing this identification and from the first entrance 
date in degraded mode, one can predict the time at 
the earliest and at the latest of the failure occurrence. 

After verification of the approach on both mod- 
eling methods, it is concluded that the two mod- 
eling methods give the same temporal estimate. 
the only difference is that the state space in PN 
with watchdog technique is more important than 
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the extended PN, which reduces system modeling 
complexity. That’s why extended PN modeling is 
retained. Future works will consist to formalize 
the partial prognosability (Prognosis is possible for 
some parts of the system but not for others) on 
extended PN by verifying their property and the 
construction of the local prognoser. 
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ABSTRACT: The paper presents a Probabilistic Risk Assessment (PRA) method for the security of 
supply of a gas network. The method is based on a procedure for automatic generation of fault trees, 
which estimate the probability of disruption of the gas delivery from terminals/storages to each consumer 
nodes in the gas network. The method allows probabilistic analyses of the availability of the demand 
nodes and of the overall availability of the gas network. To assess the importance of each network com- 
ponent, risk achievement worth and risk reduction worth importance measures are utilized. The aim of 
the developed method is to assess potential weakness in the gas network as well as to be used as an analy- 
sis tool during expansion planning and maintenance scheduling activities. The framework developed in 
the paper leverages on steady-state analysis of the gas network performed using a physical flow/pressure 
model. The impact of a component failure on the gas supply interruption at different demand nods is 
assed and contrasted to the PRA results. The framework is exemplified with reference to the reduced UK 
gas network. The results provide insights to support a robust reliability assessment of the gas network. 
Moreover, the probabilistic mapping of the most important components in the gas network provides the 
means for assessing optimal strategies for maintenance schedule as well as to prioritize improvements of 
the gas network aiming for effective risk reduction. 


1 INTRODUCTION 


In the past decades, the consumption of gas in 
Europe increased significantly (Weisser, 2007). 
Natural gas is considered of the essence for the 
energy security in the European Union, by com- 
prising a quarter of the primary energy supply to 
electrical power generation, households, feedstock 
for industry and fuel for transportation. Consid- 
ering the decrease of domestic gas production 
and thus the higher import dependence, the need 
to address security of supply has increased (EU, 
2010). As a response, the Regulation (EU) No. 
994/2010 of the European parliament and council 
of 20 October 2010 concerning measures to safe- 
guard security of gas supply and repealing Council 
Directive 2004/67/EC was released. 

The International Energy Agency (IEA) shows 
that, in many IEA member countries, the electric- 
ity generation sector is especially dependent on 
natural gas, with tendency to grow. In 14 countries, 
gas accounts for over 20% of the electrical energy 
generation and more than 30% in nine countries, 
while in five countries more than half of the genera- 


tion is dependent on gas (Simpson and Min, 2011). 
A sense of comfort exists over the gas markets capa- 
bility to successfully adjust to demand or supply 
shocks due to low gas prices, along with expecta- 
tions for continued well-supplied gas markets over 
the medium term. However, the IEA global security 
review shows that the current situation about gas 
security comfort may change suddenly as market 
conditions change (IEA, 2016). 

Researchers have placed significant effort 
in modeling gas networks to provide accurate 
assessments of the network behavior, which can 
be of significance for the security of supply. In 
(Osiadacz, 1987), a thorough description of 
steady state and transient simulation models for 
gas network analysis is presented. Security of 
supply is investigated during conflicts and crisis 
in Europe in (Carvalho et al., 2014), and resilient 
response strategies to supply disruption events 
under relevant scenarios are provided. Further- 
more, (Antenucci and Sansavini, 2017) identifies 
the contingencies that can jeopardize the coupled 
gas and power systems security under increasing 
gas demand scenarios. 
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A probabilistic model that studies the security 
of supply in a gas network is presented in (Praks 
et al., 2015). The model utilizes Monte-Carlo 
simulations along with graph theory to perform 
analyses on a real size gas network, i.e. an unspeci- 
fied part of the EU gas network. Even though 
the model provides comprehensive simulation of the 
security of supply in a gas network, it lacks 
the support of a physical model. A fault-tree based 
analyses of the security of supply in a gas network 
is shown in (Praks et al., 2014). The paper argues 
that the application of a fault tree method on a real 
size gas network requires an automatic fault trees 
generation algorithm. 

The limited choice of methods for the assess- 
ment of the gas network security of supply is the 
main motivation behind this research. The frame- 
work presented herein aims at assessing the relia- 
bility of a gas network by gaining insights from the 
application of both probabilistic and deterministic 
analyses. A PRA method based on the fault tree 
technique is developed. The method uses a proce- 
dure for the automatic generation of fault trees for 
a selected gas demand node. The unwanted event, 
i.e. the top event is defined as the “node not sup- 
plied with gas”. The risk of having gas demand 
not supplied is a function of the gas network 
architecture and of the failure probabilities of the 
components of which the network is comprised. 
A global risk measure is introduced to evaluate 
the overall gas network security of supply. Differ- 
ent importance measures are employed to assess 
the importance of each network component with 
respect to each individual demand nodes, as well 
as to entire gas network. A physical analysis model 
is employed to assess the behavior of the gas flows, 
thus the gas properties in the system. Combining 
the PRA model analyses with the physical model 
analyses results in a complimentary platform capa- 
ble of performing robust risk analyses of the secu- 
rity of supply of gas networks. Thus, the obtained 
results, besides providing various risk measures, 
bring together the complimentary insights from 
two conceptually different models. 

The paper is structured as follows: Section 2 
describes the PRA methodology used to assess the 
gas network reliability; Section 3 presents the phys- 
ical model for simulating the gas network opera- 
tional conditions; Section 4 presents the performed 
analyses and the obtained results; Section 5 gives 
conclusion remarks. 


2 PRA METHOD FOR GAS NETWORKS 
RELIABILITY ANALYSES 


The PRA supported by the fault tree and event 
tree analyses has been extensively used for risk 


evaluations in various engineering domains, i.e. 
nuclear power plant safety, aerospace design, 
power system reliability, etc.. The fault tree is a 
deductive analytical method, where an undesired 
system state is specified and the system is then 
assessed in relation of its operations and environ- 
ment to find all relevant ways in which the unde- 
sired event can occur. 

In this paper, a method for probabilistic risk 
assessment of gas networks is presented. The 
method exploits a procedure for the automatic 
generation of fault trees that was originally 
proposed for power system reliability analyses 
(Volkanovski et al., 2009). Herein we adapt the 
method for its application to real-size gas net- 
works. The graphical representation of the proce- 
dure for the automatic generation of fault trees is 
shown in Figure 1. 

The procedure starts by defining the adja- 
cency matrix (Figure 1) of the corresponding 
network graph. The adjacency matrix of a graph, 
also known as the connection matrix, is a square 
matrix A (v, v,) such that the element 4,, is equal 
to one when there is an edge from vertex i to vertex 
j, and equal to zero otherwise (Biggs, 1993). The 
adjacency matrix is used to determine all possible 
flow pats that connect a demand node with all the 
source nodes in the gas network. For example, if 
demand node | is connected to the source node 
4 through node 2 and node 3, two possible paths 
can lead from node | to the source node 4. In 
the process of identifying a path from a demand 
node to a source node, no single component can 
be used twice in the same path. This constraint 
prevents the path from looping, i.e. repeating the 
same set of components in the same connectivity 
path. In addition, a path is only completed if it 
starts with a demand node and ends with a source 
node. 

The two identified pats (Figure 1, Connectiv- 
ity Paths: path one [1 2 3 4] and path two [1 3 4]) 
through which the demand at node 1 can be served 
from the source node are the foundation for the 
creation of the fault tree. It is a top-down proce- 
dure where the fault tree is created starting from the 
demand node and continues by unfolding the iden- 
tified paths. The demand at node 1 is not supplied 
with gas if node 1 fails or the interruption of gas 
delivery to node 1 occurs, which is the first gate in 
the fault tree which is being written by the proce- 
dure for automatic generation of fault trees and rep- 
resented by Boolean OR. The interruption of gas to 
node | occurs if both the gas pipeline between node 
1 and node 2, which is on the first path, and the gas 
pipeline between nodes 1 and node 3, which is on 
the second path, do not deliver gas to node 1. This is 
the second gate in the fault tree and it is represented 
by a Boolean AND. The gas pipeline between node 
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Figure 1. 


Graphical representation of the procedure for automatic fault tree creation. 


1 and node 2 does not deliver gas to node 1 if either 
the pipeline fails or there is no gas delivered to the 
pipeline from elsewhere, which is represented by 
another OR gate. The process of unfolding the first 
path ends when the source at node 4 is reached. The 
same operation is repeated for the second path. The 
output of the algorithm for the automatic genera- 
tion of fault trees for node 1 is a string-based fault 
tree given in the lower left part of Figure 1. The “*” 
denotes AND gate, while “+” denotes OR gate. The 
first line “TE00001 + S000001 NOD0001” denotes 
the occurrence of the top event “no gas from node 
1” (TE00001) due to either there is “no gas to node 
1” (S000001) or “node 1 fails” (NODO0001). The rest 
of the lines follows the same rationale. The graphical 
equivalent of the fault tree is presented in the lower 
mid part of Figure 1. The representation of each 
symbol is given in the lower right part of Figure 1. 


2.1 Fault tree quantification 


The method, when applied on a real-size gas net- 
work, results in large fault trees, comprising tens of 
thousandths to few hundred of thousandths lines. 


Solving a large fault trees efficiently is known to be 
a challenging problem (Contini and Matuzas, 2011). 
In general, fault trees can be converted into equiva- 
lent set of Boolean logical equations. The quantita- 
tive analyses of a fault tree is represented by the Rare 
Event Approximation method (Roberts et al., 1981): 


n m 


=}, Il Ore, 


i=1 j=l 


Q, = 2 Qucs/ BE, .BEm) @) 


where Q, is the top event probability of occur- 
rence, n is the number of minimal cut sets (MCS), 
i.e. the smallest set of basic events that induce 
the top event when occurring simultaneously, 

MCS;(BE,, BE,,) 18 the probability of occurrence of 
ith MCS, which is comprised of m basic events, 
and Orbe, is the probability of occurrence of the 
jth basic event within the ith MCS. 


2.2 Network risk measure 


In this paper, a unique fault tree for each demand 
node in the gas network is created by employing 
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the proposed procedure. The fault trees are solved 
using the Rare Event Approximation method 
(Equation 1), and the probability of each demand 
node not being supplied with gas is obtained. 

The risk of a gas demand not being supplied is 
a function of the gas networks architecture and of 
the probabilities of its components. A global risk 
measure is defined (Volkanovski et al., 2009) to 
assess the overall gas network security of supply: 


yes D 


SN 
H L 
i=l 


where Ugs is the gas system unavailability, Q,, is 
the probability of failure of gas supply to the ith 
demand node, LN is the number of nodes with 
load demands, L, is the gas demand at the ith 
node. 


2.3 Importance measures 


The performance of a system (e.g. gas network, 
power system) depends on its components. Some 
components contribute more to the failure of 
the system than others. Therefore, the concept of 
importance plays a major role in the quantification 
of risk in engineered systems. In this paper two of 
the most frequently exploited importance meas- 
ures in PRA are utilized, i.e. the Risk Achievement 
Worth (RAW) and Risk Reduction Worth (RRW). 
The RAW estimates the value of risk increase if the 
failure probability of a basic event is equal to one 
(component out of service): 


Q, (Oss, = 1) 


3 
0, (3) 


RAW, = 


where RAW, is the risk achievement worth of 
basic event j, the QO, (Qpr =1) is the top event 
probability when the probabi ity of occurrence 
of basic event j, Ore,» is equal to one. The RAW 
determines the maximum increase of risk due to 
occurrence of the basic event j, i.e. identifies the 
components that need to be efficiently maintained 
such that the reliability of the system will not 
decrease (Volkanovski et al., 2009). 

The RRW estimates the value of risk decrease if 
the failure probability of the basic event is equal to 
zero (the component never fails): 


Q, 


(4) 


where RRW, is the risk reduction worth of basic 
event j, the en (Osr, = 9) is the top event probability 


when the probability of occurrence of basic event 
Js Ope, , 1s equal to zero. The RRW determines the 
maximum reduction of risk due to perfect reliabil- 
ity of the component associated to the basic event 
j, i.e. identifies the redundancy level of a compo- 
nent associated with the basic event j (Volkanovski 
et al., 2009). In most cases, the basic event is asso- 
ciated with component unavailability (van der 
Borst and Schoonakker, 2001). 


2.4 Network importance measures 


A fault tree is created for each node with gas demand 
L, and the probability of failure of gas supply (Q, ) 
to the ith node is calculated. The RAW and RRW 
importance measures are calculated for each demand 
node, resulting in a unique set of importance values 
for each component (basic event). The importance 
values of the same basic event with respect to differ- 
ent nodes may significantly differ among each other. 
Hence, a component may be very important for 
some demand nodes and irrelevant for other demand 
nodes. In order to estimate the importance of each 
component on a global level, i.e. for the overall gas 
network, the network importance measures are 
introduced, i.e. the network risk achievement worth 
(NRAW) and the network risk reduction worth 
(NRRW) as shown in (Volkanovski et al., 2009). 


3 PHYSICAL MODEL SIMULATION 
OF GAS NETWORKS 


For a reliable representation of the gas system, 
several components need to be modelled, includ- 
ing pipelines and non-pipe elements, such as com- 
pressors, terminals and storages. Gas flow within a 
pipeline is represented with a steady state model. 
Pressure-drops in the network are modelled via the 
Panhandle “A” equation (Osiadacz, 1987): 


Ir Q354 


(5) 


where p, and p, represent the pressure at the begin- 
ning and at the end of a pipeline, L is the pipeline 
length, Q is the volume flow rate through the pipe- 
line, E is the efficiency factor and D is the pipe- 
line diameter. The newton-node loop method is 
employed for solving the system of equations which 
define the entire network (Osiadacz, 1987). This 
iterative method is based on the Kirchhoff’s second 
low, which states that the sum of the pressure-drops 
around any closed loop in the network is zero. 
Compressor stations are modelled as fictitious 
branches, and a constant pressure ratio between 
the pressures at the sending and receiving node 
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of each compressor is considered. Terminals and 
storages injections into the network are propor- 
tional to their delivering capacity. The gas flow is 
allowed in both directions in the modeling of the 
network components. 


3.1 Physical importance measure: Total gas 
curtailment 


The importance of a component is evaluated by the 
induced change in the operation of the gas network. 
In fact, when a component is removed, the way the 
gas flows into the network varies and pressures 
change accordingly. However, in case pressures 
exceed the minimum or maximum safety pres- 
sures, actions are implemented in order to restore 
the normal operating conditions. In particular, in 
case of minimum pressure violation, gas curtail- 
ments are enforced in the location of the violation. 
The Total Gas Curtailment (TGC), which follows 
the basic event in order to bring the system back 


within safety margins, is considered as importance 
measure. The chosen matrix addresses the impact 
of a single component removal on the entire net- 
work, and it is formally expressed as: 


TGC, =>," ,GC,(Q(BE,) =1) (6) 


where TGC; is the importance measure for the basic 
event j, N is the number of nodes in the system and 
GC, is the gas curtailment at node n. 


4 ANALYSES AND RESULTS 


The framework developed in this paper is applied 
to the UK gas network. The simplified transmis- 
sion grid is constituted by 61 pipelines and 21 com- 
pressor stations that work with a constant pressure 
ratio. Safety operations are bounded in the pres- 
sure range of [38 85] bars. The gas system counts 9 


Table 1. Top event probabilities for all demand nodes in the UK gas network. 


Node Demand (m*/h) Failure probability Weighting factor Weighted failure probability 
3 2.85E+05 TASE-05 1.50E-02 1.08E-06 
5 7.63E+05 7.72E-06 4.02E-02 3.11E-07 
7 2.68E+05 4.36E-06 1.41E-02 6.16E-08 
9 7.19E+05 3.74E-08 3.79E-02 1.42E-09 

11 2.44E+04 9.11E-06 1.29E-03 1.17E-08 

12 2.37E+05 1.25E-07 1.25E-02 1.56E-09 

14 1.49E+06 2.67E-07 7.86E-02 2.09E-08 

15 5.10E+05 1.20E-10 2.69E-02 3.23E-12 

17 1.88E+05 3.90E-11 9.94E-03 3.87E-13 

18 1.27E+06 4.72E-06 6.72E-02 3.17E-07 

22 2.04E+05 4.92E-09 1.07E-02 5.29E-11 

24 2.29E+04 4.46E-06 1.21E-03 5.38E-09 

26 3.80E+05 1.56E-06 2.00E-02 3.13E-08 

28 9.45E+04 2.02E-09 4.98E-03 1.01E-11 

30 9.28E+05 8.88E-08 4.89E-02 4.34E-09 

31 2.08E+05 6.27E-05 1.10E-02 6.89E-07 

32 5.97E+05 1.89E-06 3.15E-02 5.96E-08 

33 1.43E+06 2.58E-04 7.52E-02 1.94E-05 

34 1.45E+06 3.57E-08 7.62E-02 2.72E-09 

36 3.43E+05 1.06E-05 1.81E-02 1.92E-07 

37 5.92E+05 4.41E-05 3.12E-02 1.38E-06 

38 1.88E+05 4.47E-06 9.92E-03 4.43E-08 

39 6.30E+05 2.64E-05 3.32E-02 8.76E-07 

40 2.62E+05 4.99E-06 1.38E-02 6.89E-08 

41 1.77E+06 2.22E-06 9.32E-02 2.07E-07 

44 1.25E+06 3.71E-08 6.60E-02 2.45E-09 

46 6.04E+05 1.81E-08 3.18E-02 5.76E-10 

48 6.59E+05 1.24E-05 3.48E-02 4.30E-07 

52 1.39E+05 7.30E-06 7.32E-03 5.34E-08 

53 9.41E+05 6.24E-06 4.96E-02 3.10E-07 

54 3.14E+05 1.95E-06 1.66E-02 3.23E-08 

55 2.05E+05 4.98E-04 1.08E-02 5.38E-06 
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terminals and 9 storage facilities. The UK gas net- 
work topological data is adopted from (Qadrdan 
et al., 2010), including the maximum supply capac- 
ities of different terminals and the characteristics 
of the storage facilities. 

The failure probability data is adopted from 
(Praks et al., 2015). Considering one-year inspec- 
tion time, the failure probability (complete rapture) 
of gas pipeline in European gas transmission sys- 
tem is 3.5E-5 per kilometer, the failure probability 
of a compressor is 2.5E-1, the failure probability 
of the gas terminal is 1.5E-1, and the failure prob- 
ability of a gas storage is 1E-1. The failure of a 
compressor does not necessarily interrupt the gas 
flow through the compressor station, i.e. the gas 
flows through the compressor bypass. The failure 
probability of the bypass is based on the failure 
probability of a disconnection valve, i.e. “valve fail 
to open”. Due to lack of specific data, for valves 
used in the bypasses at compressor stations a value 
of 1.6E-3 is taken from (IAEA, 1988). 


4.1 Probabilistic risk analyses of the UK 
gas network 


The fault tree method is applied for each load 
demand node in the UK gas network, thus the fail- 
ure probability of gas supply at each node is cal- 
culated and the obtained results are presented in 
Table 1. In all of the fault tree analyses, the maxi- 
mum number of basic events in a minimal cut set 
is truncated at 10, while the minimum failure prob- 
ability of minimal cut set is 1E-15. 

The first column from Table 1 represents the 
nodes with gas demands, while the second column 
represents the average gas demand per node in 
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Figure 2. 


m/h, Node 41 is the node with the highest demand 
of gas, with average requirement of 1.77E+06 
(m/h). The total average gas demand in the net- 
work is 1.90E+07 (m*/h), while the total gas sup- 
ply capacity is 3.78E+07 (m/h) including the gas 
storages which can provide gas for only limited 
number of time. The third column represents the 
top event probability of each demand node. The 
fourth column represents the weighting value of 
each demand node. The fifth column represents the 
weighted failure probability of gas supply at each 
node, calculated as the product of the top event 
probability and the weighting factor of the respec- 
tive node. The total gas network failure probability 
is 3.00E-5 and is calculated with Equation 2 based 
on the individual demand node failure probability. 
Table 1 shows that in general the nodes with high- 
est gas demands have the largest weighted prob- 
ability of failure. 

The network importance measures, NRAW and 
NRRW, for the UK gas network are calculated and 
given in Figure 2 a) and b), respectively. Further- 
more, the RAW and the RRW results from the fault 
tree analyses performed for the largest load demand 
node 41 in the network are given in Figure 3. 

Figure 2 shows the most important components 
in the gas network, according to the network impor- 
tance measures presented in Section 2.4. The color 
defines the importance of each of the components 
shown in the figure, i.e. brighter colors are associ- 
ated to components with higher importance values. 
According the NRAW, the most important gas pipe- 
lines to keep operational in the system are the pipe- 
line between node 31 and node 32 with NRAW of 
2726.8, and the pipeline between node | and node 3 
with NRAW of 397.7. The pipeline between node 
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Representation of the most important components in the network according to: a) NRAW and b) NRRW. 
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31 and node 32 is providing connection between 
the network and the gas terminal at node 59, IOG, 
which is positioned in the South of the island and 
it is the only terminal in the southern region besides 
the storages at node 68 and node 69 and the Bacton 
terminal in the South-East. The pipeline between 
node | and node 3 is connecting the network to the 
terminal at node 56, St. Fergus, which is the second 
largest gas source in the system and is positioned in 
the far North of the island. Furthermore, among 
the most important components in the gas network 
are the compressor bypasses between node 31 and 
node 32 with NRAW of 786.9, and between nodes 2 
and 3 with NRAW of 242.4. Both bypasses are part 
of the compressor stations that provide connectiv- 
ity to the terminals at node 59 and node 56, respec- 
tively. From the presented results, it can be deduced 
that performing maintenance simultaneously on 
any of the above components may significantly 
decrease the security of supply to the consumers 
in the gas network. For example, the simultaneous 
unavailability of the gas pipeline between node 31 
and node 32 and the compressor station between 
node 2 and node 3, due to maintenance activities 
may have a significant impact to the gas network 
reliability. Based on the conducted analyses, it is 
possible to optimally prioritize the maintenance 
activities in the gas network, thus maximizing the 
security of supply. 

On the other hand, Figure 2 b) shows the com- 
ponents with the highest risk reduction impor- 
tance, i.e. the redundancy level of each of the 
components in the gas network. According to 
NRRW the most important components in the 
gas network are the gas terminal at node 61 and 
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Node 41 component importance according to: a) RAW and b) RRW. 


the pipeline connecting this node with node 41 
and thus to the rest of network, as well as the gas 
storage at node 66 and the pipeline connecting this 
node with node 15 and thus the rest of the net- 
work, all with NRRW value of 243.7. Decreas- 
ing the failure probability of these components 
(e.g. by installing more reliable components) will 
result with increased security of supply. In other 
words, the obtained NRRW results can help in pri- 
oritizing feature improvements on the gas network, 
leading to the largest risk reduction. 

Figure 3 a) and b) show the RAW and RRW val- 
ues, respectively, for all gas network components 
when considering node 41 not supplied with gas as 
top event. The pipeline connecting node 41 to node 
42 and the pipeline connecting node 42 to node 44 
have the highest RAW of 214.6 and 214.5, respec- 
tively, making them the most important compo- 
nents with respect to the gas supply to node 41. On 
the other hand, the storage at node 65 is by far the 
most important component according to the RRW 
importance measure with value of 953.4, and the 
second most important component is the pipeline 
connecting node 13 to node 40 with RAW of 3.8. 


4.2 Steady-state simulations of the UK gas 
network based on the PRA results 


The physical model simulates the gas network 
behavior, and pressures and mass flows in the 
pipelines are computed. A steady-state analysis is 
performed for the loss of each network component 
and the consequent total gas curtailment is calcu- 
lated (Table 2). Gas curtailments are specified for 
each gas demand node and for the entire network. 
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Table 2. 


The TGS effect of the 20 most important gas network components obtained using NRAW. 


Component Probability of failure NRAW TGS (m?/h) Risk 

L 31-33 1.30E-03 2726.8 0 0 

B 31-32 1.60E-03 786.9 0 0 

L 01-03 2.45E-03 397.7 0 0 

B 02-03 1.60E-03 242.4 0 0 

L 39-54 2.10E-03 160.9 6.18E+04 1.30E+02 
B 01-55 2.50E-01 105.1 0 0 

L 34-36 1.51E-03 103.0 4.12E+06 6.20E+03 
L 37-39 3.40E-03 92.2 7.17E+05 2.43E+03 
L 35-38 5.08E-03 83.9 1.38E+06 7.03E+03 
L 18-21 1.79E-03 49.9 4.53E+05 8.09E+02 
L 47-49 2.10E-03 45.3 1.13E+05 2.38E+02 
L 45-53 2.28E-03 43.2 1.59E+06 3.62E+03 
L 33-59 8.05E-04 39.5 3.18E+05 2.56E+02 
L 15-19 2.24E-03 36.1 5.71 E+05 1.28E+03 
TER-59 1.50E-01 34.8 3.18E+05 4.77E+04 
L 47-54 2.45E-03 33.9 0 0 

L 50-52 1.58E-03 30.1 0 0 

L 48-50 1.54E-03 30.1 0 0 

L 40-41 1.23E-03 29.5 1.41E+06 1.73E+03 
B 04-05 1.60E-03 28.6 0 0 


The first column from Table 1 represents the 
component/basic event name, such that the letters 
denote the component type (i.e. L stands for pipe- 
line, B stands for compressor bypass, C stand for 
compressor, TER stands for gas terminal and TES 
stands for gas storage), while the numerical digits 
represents the nodes where the respective compo- 
nents are connected. The second column gives the 
failure probabilities of the 20 most important ele- 
ments according to NRAW, and the third column 
represents NRAW values of these components. 
The fourth column gives the calculated TGS for 
the respective components represented by the first 
column. The fifth column gives the product of the 
component probability and its TGC impact on the 
gas network. In the physical model the compressor 
station failure is represented only by a compressor 
failure without losing the flow of gas through the 
respective branch in the gas network, instead the 
compressor station ratio is equal to 1, i.e. the gas 
pressure before and after the compressor station 
is remaining the same. Therefore, the TGS of zero 
caused by compressor bypass failures is by default, 
since no such failure is simulated. The obtained 
results show that the TER-59 (i.e. the IOG termi- 
nal connected at node 59) is the one with the high- 
est risk impact on the gas network. Furthermore, 
a high risk impact is expected by the loss of the 
pipeline between node 25 and node 38 (L 35-38) 
which is one of the main links providing connec- 
tivity between the South and the South-West part 
of the gas network, and from the pipeline between 


node 45 and node 53 (L 45-53) which is one of 
the main links providing connectivity between 
the South-West and the central part of the gas 
network. Remarkably, the loss of some elements, 
such as the pipeline between node 31 and node 33 
(L 31-33) or the pipeline between node 1 and 
node 3 (L 01-03), induce no pressure violations 
in the network, despite their large NRAW values. 
Therefore, irrespective of the relevance of these 
components from a probabilistic and network 
connectivity perspective, gas network operations 
manage efficiently the gas re-routing via different 
terminals and storages to sustain gas supply. 


5 CONCLUSIONS 


A framework for gas network security of supply 
analyses is presented. The framework is based on a 
PRA method which employs a procedure for auto- 
matic generation of fault trees related to a specific 
top event, i.e. gas demand node not supplied. The 
risk achievement worth and risk reduction worth 
importance measures are utilized to estimate the 
importance of the network components for the 
security of supply of the individual demand nodes 
and the overall gas network. Furthermore, a physi- 
cal model for the simulation of the gas network 
behavior is developed. The model is capable of 
computing gas pressures throughout the network 
and of enforcing curtailments if single or multiple 
contingencies occur. The physical model is utilized 
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to supplement the PRA method by providing accu- 
rate estimates of the gas network conditions. 

The PRA results show the most important com- 
ponents, based on risk increase and risk reduction 
measures, for the security of supply to each indi- 
vidual node and the entire network. The NRAW 
importance measures provides us with results that 
can be used to prioritize the maintenance activi- 
ties in the overall gas network, while the RAW 
can be used to schedule the maintenance activities 
with respect of each individual node. The NRRW 
importance measure provides us with results that 
can be used to prioritize future improvements in 
the overall gas network thus increasing the security 
of supply, while the RRW importance measure can 
be used to schedule potential improvements that 
will increase the security of supply of each individ- 
ual gas demand node. Furthermore, the physical 
model results show that even though some com- 
ponents are identified as one of the most impor- 
tant, according to the PRA importance measures, 
their failure may not have significant effect on the 
gas supply, because of the capability of the gas 
network to perform efficiently and provide gas 
re-routing via different terminals and storages to 
sustain gas supply. 
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ABSTRACT: Liquefied Natural Gas (LNG) as ship fuel is considered a viable solution to marine envi- 
ronmental issues, due to the significant reduction in emissions with respect to conventional fuel oil. At the 
same time, safety issues may arise in port areas when LNG is used as a fuel due to its high flammability. 
The present work focuses on the safety assessment of LNG carriers approaching a bunkering terminal 
trough port channels located in an industrial area. A risk matrix approach is adopted to evaluate the 
risk level associated with the carrier approaching the harbour, considering the vulnerability of surround- 
ing territory and potential interactions with industrial facilities located in the area. The methodology is 
applied to a case study of industrial interest showing the potential of the tool in supporting risk-based 


decision making. 


1 INTRODUCTION 


Liquefied Natural Gas (LNG) as ship fuel is consid- 
ered as a viable solution to marine environmental 
issues, due to the significant reduction in emis- 
sions with respect to conventional fuel oil (Bittante 
et al. 2017). Therefore, due to the potential benefits 
related to this technology, several projects have 
been proposed for the realisation of LNG bunker- 
ing terminals in harbour areas, contributing to the 
development of LNG infrastructure network. 
However, safety issues may arise in port areas 
when LNG is used as a fuel due to the high flam- 
mability of this substance, with the potential of 
severe fires and explosion scenarios (Jeong et al. 
2017). Thus, in the stages of early development 
and selection of LNG bunkering and ship sup- 
ply technologies in port areas, safety aspects will 
become crucial to develop sustainable and reliable 
technologies involving LNG as marine fuel. 
Several studies were reported in the literature 
concerning the safety of LNG distribution chain, 
addressing the analysis of LNG regasification ter- 
minals, bunkering stations, ships fuel systems. Yun 
et al. (2009) proposed a risk assessment method- 
ology for LNG terminals by incorporating Baye- 
sian and LOPA (Layers of Protection Analysis) 
approaches. An inherent safety based approach 
was proposed in (Tugnoli et al. 2012) to estimate 


the safety aspects or alternative LNG regasifica- 
tion technologies. 

Concerning the analysis of LNG for marine fuel 
application, several studies were presented (ABS 
2014, ADN Administrative Committee 2014). Lee 
et al. (2015) compared the fire risk assessments of 
two types of LNG fuel gas supply systems. DNV 
(2012) conducted a site-specific quantitative risk 
assessment of LNG bunkering in an effort to 
determine a safe distance for passing ships at the 
Port of Rotterdam. 

However, a structured methodology to perform 
feasibility studies for access of LNG supply ships 
in harbour areas close to sensitive urban areas and 
industrial parks is still lacking in the literature. 

The present work shows a risk-based approach to 
support the feasibility study of LNG ships access to 
harbour areas. The approach integrates geometrical 
and ship size considerations, legislative framework 
and safety aspects, and is applied to a reference case 
study, located in the Port of Venice (Italy). 

The paper is structured as follows: in Section 2 
the case study object of the feasibility study is 
described; in Section 3, the overview of the meth- 
odology is presented; in Section 4 details on the 
safety and risk assessment are provided; Sec- 
tion 5 shows the results of the analysis, which are 
discussed in Section 6. Conclusions and recom- 
mendations are given in Section 7. 
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2 DESCRIPTION OF THE REFERENCE 
CASE STUDY 


The case study 1s located in the Port of Venice, that 
is one of the most important industrial sites in Italy 
(Zonta et al. 2007). The port is strongly intercon- 
nected with the Marghera industrial area, known 
as “Porto Marghera”. Figure 1 shows the overview 
of Porto Marghera area. The area is accessible 
through the Malamocco channel (see paths 3 and 
4 in Fig. 1), which connects Porto Marghera with 
an artificial channel leading to the Adriatic Sea 
(paths 1 and 2 in Fig. 1). 

Porto Marghera is the site selected for a new bun- 
kering terminal for LNG storage and distribution. 
The bunkering station will be located in the South- 
ern part of the industrial channel (path 5 in Fig. 1); 
in the following the channel is labelled as “SIC”. 


2.1 Description of the SIC 


SIC is accessible through the basin located in the 
Malamocco-Marghera channel; a detailed view of 
SIC is shown in Figure 2. The channel length is about 
4km and the draft ranges between 6 and 10.1 m. 

The LNG bunkering terminal will be served by 
LNG carriers accessing from the Northern Adriatic 
Sea through the path shown in Figure 1 towards the 
terminal located in the SIC. This ship traffic may 
increase the risk level of the surrounding area due 
to possible accidents involving the spill and conse- 
quent ignition of LNG from the carriers. 
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Figure 1. LNG carrier planned route through Porto 
Marghera: 1) roadstead in the Northern Adriatic Sea; 
2) harbour inlet and artificial canal; 3) Malamocco- 
Marghera channel (before the industrial area); 4) Malam- 
occo-Marghera channel (inside the industrial area), 
5) SIC (Southern part of the industrial channel). 


Figure 2. 
indicate the location of the industrial berths, and the 
round symbols indicate the location of civil/commercial 
installations. 


Detail view of SIC. The squared symbols 


The critical areas along the LNG ship route are 
summarized in Figure 2. They are constituted by 
benches (either industrial or civil/commercial), 
industrial facilities, and the areas dedicated to civil 
activities or other services. 

For what concerns the benches, along the route 
there are: 4 commercial berths located near the 
Fusina area (labelled with C in Fig. 2); 14 indus- 
trial berths (labelled with U in Fig. 2), 7 of which 
serving O&G facilities. The industrial activities are 
mostly chemical and petrochemical plants, and 
some smaller construction companies. There are 
some civil areas located close to the Malamocco- 
Marghera channel. 

The civil installations of interest for this study 
are: two camping sites, e.g., Fusina and Darsena 
Fusina, the Venice Ro Port for passengers’ trans- 
port (P01, P02, and P03 in Fig. 2, respectively), 
and smaller civil areas (beaches, little dockyards, 
etc.) located in the artificial canal connected to the 
Northern Adriatic Sea (path 2 in Fig. 1). 

The new bunkering terminal location is foreseen 
at the end of the SIC, before the SIC final enlarge- 
ment (e.g., between U4 and US in Fig. 2). 


3 METHODOLOGY 


3.1 Overview 


The aim of the present work is to provide a meth- 
odology to perform feasibility studies for the access 
of LNG carriers to harbour areas. As shown in 
Figure 3, the methodology is based on two main 
parts. 

The first part focuses on the nautical acces- 
sibility of the gas carriers with respect to both 
geometrical issues and legislative aspects. This 
part is aimed providing accessibility evaluations, 
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Nautical accessibility study 


Analysis of; 

1. UNG worldwide fleet 

2. Legistative framework 

3. Geometry and physical envelope 
4: Taffie data 


Safety assessment 


> Accessibility assessment of LNG ramiers: 


Figure 3. Overview of the methodology. 


also considering possible mitigation or compen- 
satory measures to eventually reduce suboptimal 
interactions of the LNG carriers with the current 
port configuration. In this way, potential events 
leading to collision or other manoeuvre upsets 
are identified and analysed in order to feed the 
safety assessment carried out in the second part 
of the methodology (see Fig. 3). 

The safety assessment is aimed at investigat- 
ing possible interaction between the LNG carriers 
and the neighbouring areas, either civil/commercial 
or industrial. Potential hazardous events affecting 
humans or process units in the surrounding industrial 
areas are considered. A risk register is compiled to 
determine the most critical scenarios and the associ- 
ated vulnerability of the surrounding territory and 
industrial areas. Mitigation actions for critical scenar- 
ios are proposed to eventually reduce the risk level. 


3.2 Nautical accessibility study 


The nautical accessibility of LNG carriers is evalu- 
ated considering both geometrical aspects and the 
legislative framework (see Fig. 3). 

Firstly, the type of gas tankers selected for the 
bunkering activities are analysed considering the 
worldwide LNG fleet. Secondarily, regulations and 
ordinances on the entry and the exit of port channel 
are analysed to point out the case specific legislative 
issues. Finally, the ordinary port traffic and the aver- 
age residence time of each ship at the quay are esti- 
mated through historical data, in order to evaluate 
the geometrical compatibility with the gas carriers. 


3.2.1 LNG carrier selection for the case study 
The considered size of LNG carriers of interest in 
the present application feature nominal capacity 
ranging from about hundred cubic meters up to 
about 40,000 m°. The current worldwide fleet of 
ships presenting these features consists of about 
50 ships, considering those in activity and those in 
delivery by 2017 (Lloyd’s Register Marine 2015). 

In the present analysis, two reference carriers are 
considered (namely, ship A and B), which features 
are summarized in Table 1. 


3.2.2 Legislative framework and port data 
analysis 

There are no specific regulations for LNG carriers 
access to the port provided from the North Adri- 
atic Sea Port Authority. Whereas, there are three 
ordinances from the Venice Harbour Masters of 
Italian Cost Guard concerning gas carriers sail- 
ing into the Venice Port areas (Ordinance 2009, 
2010, 2016). The requirements specify minimum 
manoeuvre speed, draft in the area (between 7.95 m 
and 10.4 m), limitations for the navigation of dan- 
gerous cargo (e.g. interdiction in case of fog), man- 
datory tug service depending on gross tonnage and 
envelopment required, and maximum width of the 
convoy (never exceeding 1/3 of the minimum width 
of channels to be covered). 

Given the legislative framework, port data con- 
cerning ship fluxes are analysed in order to evalu- 
ate the maximum time allowable for the LNG 
carrier transit and operations, according to the 
physical free space left by standard historical traf- 
fic. Thus, the most critical berths are identified as 
those which are located in correspondence of the 
smaller breadth of SIC. The three potentially criti- 
cal areas and the corresponding channel breadth 
are reported in Table 2. 

Data about critical berths are part of the physi- 
cal compatibility assessment. In fact, the latter is 
the result of matching the data collected for berths 
above listed and the geometry of the LNG carriers 
defined in section 3.2.1. 

The geometrical analysis of critical points along 
the SIC supports the risk assessment evaluating 
the likelihood of accidental scenarios not related 
to process failures, namely evaluation of ships 


Table 1. Characterization of the LNG carriers for the 
case study. 
Item Units Ship A Ship B 
Reference — Wartsila ENI fleet 

(WSD50) (IGU 2016) 
Capacity m? 30,000 65,000 
Length m 170 216 
Breadth m 29.5 34 
Draught m 8 9.5 
Table 2. SIC width in correspondence of critical berths, 
see Figure 2 for berth location. 

Corresponding 

Berth ID channel width (m) 
U09, U10, U11 120 
U08, U12 140 
U05, U06, U07, U13 160 
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collision scenarios. Risk linked to accidental sce- 
narios following conventional upsets/failures in 
the carrier LNG is object of the safety assessment 
summarized in Section 3.3. 


3.3 Overview of the safety and risk assessment 


The second part of the method focuses on safety 
aspects related to the ships approaching a bunker- 
ing station in port area (Fig. 3). The assessment of 
possible interaction between the LNG carrier and 
the neighbouring areas, both civil and industrial, 
follows a risk-based approach adopting risk matrix 
analysis. The first step is the evaluation of the criti- 
cal areas in standard LNG carriers to perform 
hazards identification and to determine potential 
release events. Secondarily, the likelihood of sce- 
narios following an accident leak (expressed in 
annual probability) is evaluated through standard 
failure frequency databases and event tree analy- 
sis (ETA). Then, the consequences evaluation is 
carried out through the use of standard literature 
models (Mannan 2005), implemented on the soft- 
ware DNV PHAST 7.1, and following a threshold- 
based approach (see Section 4.4.3). Finally, the 
credibility of the accidental scenarios and their 
consequences are combined through a risk matrix 
and the results are summarized in a risk register 
(see Section 5.2). 


4 RISK-BASED ANALYSIS OF LNG 
CARRIERS APPROACHING HARBOUR 
AREAS 


4.1 Evaluation of critical areas on the LNG 
carrier 


The safety assessment is based on the identifica- 
tion of critical areas on the LNG carrier in order 
to determine the potential accidents. Collision 
with other ships moving in the channel is excluded 
from the present analysis, according to the results 
of the likelihood assessment and geometrical con- 
siderations (see Section 5.1). Thus, all the acci- 
dental release events are associated to process 
units exposed to the external environment, such as 


equipment and pipelines on the open deck. This 
is due to the fact that the structural failure of the 
main LNG storage vessels due to process failures is 
not considered as a credible event (Uijt de Haag & 
Ale 1999). 

The reference LNG carriers shown in Sec- 
tion 3.2.1 and considered for the analysis, despite 
featuring relevant differences in the total inventory, 
share the same types of process equipment on deck 
with similar geometries. Thus, since the structural 
failure of the storage vessels is excluded, the potential 
release events are the same for both types of carriers. 

During navigation in port areas, there are three 
critical zones onboard located on deck, from which 
an accidental release may develop. The possible 
leak sources are summarized in Table 3. 


4.2 Identification of release scenarios 


The Purple Book (Uijt de Haag & Ale 1999) guide- 
lines for quantitative risk assessment of inland 
waterway transport are adopted to identify the 
release scenarios and for the estimation of related 
frequencies. The structural failure of one or more 
tanks is not considered credible, while possible 
damage to the connections is taken into account. 
For “gas tanker” ships category, the Purple Book 
considers two release diameters (3” = 76.2 mm and 
6” = 152.4 mm equivalent diameter) and it provides 
the related occurrence frequencies. However, it is 
worth to consider the following limitations about 
the applicability of the guideline to the present case: 


e the approach was determined for navigable 
channels with length greater than 1 km, and it is 
suggested to perform an area-specific study for 
more reliable data; 

e data refer to ships with maximum capacity of 
4000 m°, which are smaller than the ships of 
interest in the case study. Despite this, excluding 
the structural failure of the storage vessels, the 
equipment and the pipeline systems over-the-top 
shall be roughly identical regardless of the ship’s 
capacity; 

e in the present case study, major release diam- 
eters (6”) are excluded, considering the typical 
piping configurations. 


Table 3. Critical areas characterization. T = temperature; P = pressure. 


Operating conditions 


Description Phase T(°C) P (bar) 

Liquid piping system on deck liquid -161 + -141 1+4 

Boil Off Gas (BOG) piping/over-the-top vapour -161 + -141 1+4 
vapour connections 

Steam piping manifold on deck vapour 40 2 
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The reference release diameters selected in the 
present analysis are summarized in the following: 


1. Large size release: equivalent diameter of 3” 
( = 76.2 mm) 

2. Small size release: equivalent diameter of 
10 mm. 


The latter rupture type is the reference minor 
rupture in for fixed process vessels (Uijt de Haag 
& Ale 1999). 

In order to estimate the duration of the release 
scenarios, the liquid and vapor connections on the 
deck are not considered directly opened to the stor- 
age vessels (e.g., the lines are isolated through shut 
down valves in closed position). Thus, leakages 
from those sources will last since the entire inven- 
tory inside the pipes has been released. Whereas, 
the BOG piping and the over-the-top vapour con- 
nections are assumed in open connection with 
the storage tanks. In this case, the release will last 
since the emergency shutdown (ESD) system will 
close the valves. Following the indication of the 
IGC code (IMO 2016), ESD valves in liquid pip- 
ing systems shall close fully and smoothly within 
30 s of actuation. In this analysis, further 30 s are 
considered for detection and actuation, reaching a 
total release time of 60 s, in normal ESD operat- 
ing conditions. In case of ESD system failure, the 
total release time is extended to a maximum time 
of 30 min. 


4.3 Frequencies evaluation 


The evaluation of the annual probability or acci- 
dental frequency linked to a single scenario is car- 
ried out based on the indications of Purple Book 
(Uijt de Haag & Ale 1999). The frequency (f) of 
the i-th accidental event (fire, explosion, disper- 
sion, etc.) is calculated as follows: 


f =FxPxP, (1) 


where F = initial accident frequency (1/y), 
P..= probability of having the release following the 
accident; P,, = probability of having the i-th event 
given the considered release. 

F is a function of the frequency of damage to 
a ship per unit distance (events/year per vessel 
per km) which depends on the type of channel. A 
conservative value of 1.4 x 10% events/(year x ves- 
sel x km) is considered in the present analysis (Uijt 
de Haag & Ale 1999); a single vessel is assumed 
to be involved in case of damage, and 20 km are 
considered as length of the ship route obtaining 
F =2.8 x 10° 1/y. 

The probability P, expresses the possibility of 
having a release following a serious accident to 


the ship. This value depends on the type of ship 
and on type of size of the release; indications 
reported in the literature (Spouge 2005, Uijt de 
Haag & Ale 1999) allowed determining P, = 0.025 
and P, = 0.2025 for large and small size release 
respectively. 

Once the incidental release has occurred, the 
LNG can ignite immediately giving rise to a pool/ 
jet fire. Otherwise, it may spread out forming a 
pool on the surface of the channel. The pool evap- 
oration generates a vapor cloud, which can ignite 
resulting in a flash fire or even a vapour cloud 
explosion (VCE). Each event described above has 
a probability of occurrence, expressed through the 
term P, ; where is an identifier of the type of sce- 
nario. Standard ETA (Mannan 2005) is carried out 
to quantify P, , for each scenario. 

The accidental scenarios related to the BOG 
piping and to the over-the-top vapour connec- 
tions may be mitigated by ESD activation (see 
Section 4.2). In case of ESD system failure the 
resulting hazardous scenario is not mitigated. In 
this study, the ESD system is assumed as a SIL 
2 “low-demand-mode” level. Thus, ESD failure 
probability is derived from the IEC 61508 standard 
(IEC 2010) and set equal to 10-? and implemented 
in the ETA. More details on the ETA are reported 
elsewhere (Chemical Controls 2017). 


4.4 Consequences evaluation 


Consequence assessment is based on standard lit- 
erature models for physical effect analysis (Mannan 
2005) implemented in DNV GL Phast 7.1 software 
package. The main settings and assumptions are 
shown in the following. 


4.4.1 Schematization of LNG composition 

LNG is a liquid mixture of hydrocarbons com- 
posed mainly of methane, with small amounts 
of ethane, propane, nitrogen and other typical 
components of natural gas. In the present work, 
the presence of other compounds in addition to 
methane is neglected since it is not significant for 
the purpose of evaluating the consequences. The 
physical properties of LNG are taken from (Lent- 
ner et al. 2017, Mannan 2005). 


4.4.2 Meteorological conditions 
Two reference meteorological conditions are 
assumed in this study: 


e F/2 — Pasquill stability class “stable” and wind 
speed of 2 m/s, 

e D/5 — Pasquill stability class “neutral” and wind 
speed of 5 m/s. 


Other relevant atmospheric parameters are 
summarized in Table 4. 
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4.4.3 Threshold based approach for physical 
effects assessment 

The maximum damage distances (or vulnerability 
radii, r,,,) for the accident scenarios considered in 
the present study are evaluated through conserva- 
tive threshold values, which are summarized in 
Table 5. Threshold values are derived from Ital- 
ian legislation on land use planning (DM 2001). 
Either damages to humans (in terms of irreversible 
effects) or industrial equipment are considered. 


4.5 Definition of the reference risk matrix 


Once the likelihood/probability and consequences 
are quantitatively evaluated, the risk associated 
with each LNG accidental scenario is estimated 
through a reference risk matrix. The matrix is built 
upon likelihood and consequences classification 
following criteria derived from a previous study 
related to the oil and gas sector (Petrone et al. 
2011). 

Table 6 shows the criteria chosen for the classifi- 
cation of likelihood, based on the evaluated annual 
probability of each considered scenario. 

Table 7 summarizes the criteria considered for 
the consequence assessment. Consequences are 
classified upon the comparison of damage dis- 
tances (r,,,) against a set of reference distances 
associated with the position of sensitive targets. 
Two categories of sensitive targets are considered, 
namely: 


Table 4. Atmospheric parameters set up for the conse- 
quences evaluation. 


Parameter Units Value 
Air temperature ne 20 
Water temperature G 20 
Pressure kPa 101.3 
Relative humidity % 50 
Surface roughness length mm 0.2 
Solar radiation kW/m? 0.4 


Table 5. Threshold values implemented in the present 
study, derived from (DM 2001). LFL = lower flammabil- 
ity limit; VCE = vapor cloud explosion. 


Threshold value 
Event Humans Industrial equipment 
Flash Fire LFL/2 —* 
Pool Fire 5 kW/m? 12.5 kW/m? 
VCE 0.07 barg 0.3 barg 
Flare/Jet Fire 5 kW/m? 12.5 kW/m? 


*Escalation is not credible. 


Table 6. Probability/likelihood qualitative classification 
criterion. 

Qualitative 
Probability/likelihood (F) rating 


Practically non credible 
occurrence 

Rare occurrence 

Unlikely occurrence 

Credible occurrence 


f< 10° 1/y 


10% < f< 10+ 1/y 
107 < f<10° 1/y 
103 < f<10" 1/y 


10 <f;< 1 lly Probable occurrence 

f2 1 lly Likely/Frequent 
occurrence 

Table 7. Consequences qualitative classification crite- 


rion, r,,, = Vulnerability radius, S,, = ship width. 


> *vul 


Consequence severity Qualitative rating 


ty < 1m 
Imsr,,<S, 


Slight effect 

Effects internal to the 
source (ship) 

Effects external to the 
source (ship) & no 
interaction with 
targets 

Damages to other 
units & possible 
single fatality 

> dp Multiple fatalities 


SyS fya < dy 


e P—installations or activities characterized by 
the presence of people (ferries, campsites, build- 
ings, etc.) 

e U—other units characterized by the presence of 
dangerous goods (ground-based plants, cargo 
ships, etc.) 


The minimum distances between the source of 
the accidental release (e.g., the LNG carrier) and 
the sensitive targets are defined as follow: 


e d,—minimum distance between the LNG carrier 
and the type “P” installations; 

e dy — minimum distance between the LNG car- 
rier and the type “U” installations. 


The comparison against r,,, and the reference 
distances (e.g., dp and dy) allows for the classifi- 
cation of consequences according to the criteria 
summarized in Table 7. 

Risk associated with each accidental scenario 
is finally assessed as the combination of the likeli- 
hood and the severity in the reference risk matrix 
shown in Figure 4 (see Section 5). 

The matrix is divided into three zones: 


e Low risk level: continuous improvement and 
acceptable risk; 
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Figure 4. Example of risk matrix application, showing 
risk results for the scenarios listed in Table 8 (indicated by 
the numbered circles in the matrix). 


e Medium risk area or ALARP (As Low AS Rea- 
sonably Practicable) zone; 

e High or intolerable risk level: mitigation and 
prevention measures are mandatory to reduce 
the risk at acceptable levels. 


Events with severity in class 4 (damages to 
other units & possible single fatality, see Table 7) 
and likelihood in class 1 (practically non-credible 
occurrence, see Table 6) may be associated with 
low or ALARP risk level (see Fig. 4) depending, 
respectively, on the absence or presence of danger- 
ous goods in the impacted units. In fact, physical 
effects due to accidental release from the target 
unit will only be economic damages, if there are 
not hazardous substances in the installation; oth- 
erwise human may be affected by the release of 
hazardous substance from the target units, which 
is damaged by the primary accidental scenario 
associated with the LNG carrier. 

The results obtained from the risk matrix are 
then summarized in the risk register. An example 
of risk register is shown in Section 5.2. 


5 RESULTS 


5.1 Geometrical assessment and legislative 
framework 


The legislative framework (see Section 3.2.2) 
showed no particular limitations or restrictions for 
what concerns the LNG carrier access to the har- 
bour, thus preliminarily supporting the feasibility 
of the LNG supply to the future terminal. 

The results of the geometrical envelop analysis 
show that there are not suboptimal interactions for 
LNG carrier of small size (e.g., Ship A in Table 1). 
On the other hand, suboptimal geometrical interac- 


tions occur in about the 50% of the year-scale time, 
for large LNG carrier (e.g., Ship B in Table 1), in 
accordance with the legislative framework and the 
historical traffic data. Thus, it may be needed the 
selection of small size gas carriers (e.g., Ship A in 
Table 1) to avoid any poor or limiting interactions 
with the standard/historical harbour operations. 

It may be concluded that ships collision, impact, 
grounding, and impact with berths are not credible 
scenarios. Explanation is that: piloting, tugs service, 
and speed reduction within the channels are man- 
datory (Ordinance 2009, 2010) for the cargo ships 
of interest; and that minimum distance between 
ships in the convoy and the transit of the latter in 
one direction (Ordinance 2009) ensure minimum 
or no interaction between ships. Moreover, impact 
of the LNG carrier with other moored ships has a 
practically non-credible likelihood in the range of 
10° to 10° 1/y (Uijt de Haag & Ale 1999). Thus, 
the risk analysis focuses on the scenarios following 
a random process failure (see Section 4.2), exclud- 
ing all other scenarios. 


5.2 Safety assessment results 


In this Section, the results of the risk-based analy- 
sis described in Section 4 are shown for a set of 
representative scenarios. 

A total number of 45 scenarios is obtained from 
hazard identification, including fire and explo- 
sions events, following the release of either liquid 
or vapor natural gas. The risk register is compiled 
including all the mentioned scenarios, providing 
an ID to each identified event, the description of 
the event and the risk-based classification; this 
includes frequency and consequence class evalu- 
ation, and finally, the risk level. An example is 
shown in Table 8 for three most critical events asso- 
ciated with the LNG carrier. 


Buffer zone 
Vulnerability radius corresponding to 
threshold value for: 


S - 
ah 


= 


Figure 5. Example of buffer zone. For the definitions 
of threshold values and target type refer to Sections 4.5. 
The LNG route refers to path 2 in Figure 1. The points 
labelled with “P” are the minor civil installations in the 
area. 
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Table 8. 


Example of risk register showing risk analysis results. F = frequencies class; C = consequences class; R = risk 


level defined as low (L); medium (M); high (H) (see Section 4.5 for further details). 


Source ID Scenario F C R 
Liquid piping system on deck 01 VCE 1 5 M 
BOG piping/over-the-top vapour connections 02 Flare/Jet Fire 2 2 L 
Steam piping manifold on deck 03 Flash Fire 1 3 L 


Results are also reported in the risk matrix (see 
Fig. 4) in order to drive the strategy for risk reduc- 
tion as discussed in Section 6. 

The most critical scenarios evaluated in the 
present analysis are associated with major liquid 
releases, leading to pool spread and evaporation, 
with potential large fires and explosions. The 
consequences are represented through the use of 
buffer maps in order to trace the maximum exten- 
sion of the scenarios. Figure 5 shows an example 
of buffer map. 

The scenario showed in Figure 5 is a flare/jet fire 
following the accidental release of LNG from the 
liquid pipeline, at maximum operational pressure, 
from a minor hole of 10 mm, and F/2 meteorologi- 
cal conditions. The damage distance r,,, from the 
LNG route extends for 75 m and 62 m for targets 
type “P” and “U” (see Section 4.5), respectively. 
The LNG ship route is located in the middle of 
the channel, with an uncertainty of 15 m from 
the channel centreline (path 2 in Fig. 1). Thus, the 
same +15 m uncertainty is applied to extend the 
consequences zone. The vulnerability radii are then 
moved along the LNG route obtaining the buffer 
zones, which help visualizing the possible targets 
with respect to the threshold values defined in 
Table 7. 


6 DISCUSSION 


The outcomes of the analysis demonstrate that the 
LNG carrier access induce a relevant risk level for 
the industrial and civil installations close to the 
channel, thus the analysis may constitute a pre- 
liminary driver to enhance safety measures and 
procedures in the development of the LNG termi- 
nal with a dual purpose: i) reducing possible sub- 
optimal interactions between the LNG carrier and 
the current port configuration; ii) reducing the risk 
level by lowering likelihood and/or impact of the 
most critical events. Some prevention and mitiga- 
tion actions are listed in the following in order to 
provide an example of utilization of the risk results 
obtained with the present methodology. 
Prevention measures are aimed at reducing 
the credibility of the accidental scenarios. In the 


present case, the sequence of operations on LNG 
carrier before entering the harbour area is crucial 
to prevent the occurrence of critical scenarios. In 
fact, liquid is present on board because of the cool- 
ing down activities, which normally precede the 
loading/unloading operations. These activities are 
usually carried out in the roadstead. Carrying out 
the cooling down activities at berth removes (thus 
prevents) the accidental scenarios linked to the liq- 
uid pipelines system along the ship route. Imple- 
menting this prevention action, however, shifts the 
hazards from the ship route to the berth. Thus, a 
specific analysis should be performed to evaluate 
the risk of carrying out the cooling down opera- 
tions at berth. 

Mitigation measures reduce the impact of an 
accidental scenario by lowering the consequences. 
An example of mitigation action applicable to the 
case study is the utilization of fire-fighting tugs to 
reduce the effects of fire and explosion following 
LNG releases from the carrier. However, careful 
selection of fire-fighting tugs should be performed, 
accounting for the water-mist demand to effective 
fire-fighting, water suction pumps capacity, and 
depth of sea in the working area. Thus, it requires 
further studies for the most critical scenarios. 


7 CONCLUSIONS 


In the present work, a feasibility study was car- 
ried out to evaluate the geometrical, legislative and 
safety aspects associated with the access of LNG 
carriers in the port of Venice. The carriers sup- 
ply LNG to a future bunkering terminal which is 
under development. 

A specific risk-based analysis supported the 
identification and evaluation of potential accidents 
associated with the transit of LNG carriers in the 
harbour area through a risk matrix. The most criti- 
cal scenarios were identified, providing indications 
for risk control, in terms prevention and mitiga- 
tion actions. 

The present method may support the planning 
of industrial harbour areas development in the 
perspective of a wider implementation of LNG 
bunkering and distribution terminals. 
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ABSTRACT: Performing risk assessments for hierarchical, multi-functional systems, such as a munici- 
pality, is an activity that requires input from a multitude of actors. In such systems risk assessments can be 
performed at many system levels and support different types of decisions. For issues that are constrained 
to a specific sub-system, such as a municipal department, decisions can be preferably taken at sub-system 
level. However, for other issues, such as those crossing many sub-systems and system levels decisions 
should preferably be taken at higher system levels, e.g. at the municipal level. At the same time, these 
decisions require extensive information from the sub-systems. The aim of the present paper is therefore 
to outline a framework for how risk information can be aggregated—with application in the context of 
Swedish municipalities. The research builds on previous work by the authors where a method for per- 
forming risk and vulnerability assessments in municipal departments has been developed using an action 
research approach. The method will soon be implemented in each municipal department in the municipal- 
ity of Malmö, Sweden, and the next step is to develop the aggregation of these assessments. It is argued 
that this aggregation is facilitated by ensuring that key aspects of the risk assessments in the municipal 
departments are harmonized. At the same time, too much standardisation may also reduce the utility of 


the assessments for the municipal departments. 


1 INTRODUCTION 


Performing risk assessments for hierarchical, multi- 
functional systems, such as a municipality, is an 
activity that requires input from a multitude of 
actors. In such systems, risk assessments can be per- 
formed at many system levels and support different 
types of decisions. For issues that are constrained to 
a specific sub-system, decisions may preferably be 
taken at sub-system level. However, for other issues, 
such as those crossing many sub-systems and system 
levels, decisions should preferably be taken at higher 
system levels. At the same time, these decisions 
require extensive information from the sub-system 
level which has to be aggregated and synthesized to 
become meaningful and possible to use as a basis 
for risk reductions at system level. 

However, limited research has been conducted 
in research field of risk assessment and manage- 
ment on this topic. The issue is addressed by Ayyub 
et al. (2008) who argue that to facilitate aggrega- 
tion of risk to higher levels of abstractions all lev- 
els “should share a common analytical framework” 
(Ayyub et al., 2008, p. 791). Furthermore, the UK 
Cabinet Office has a similar argument when claim- 
ing that a benefit of having a standardized risk 


assessment approach is that it “facilitates regional 
aggregation of local risk assessments” (UK 
Cabinet Office, 2005, p. 41). Klaver et al. (2008) 
provide similar arguments but also stress the need 
for using consistent scales for impact, probability 
and risk evaluation as well as using a common 
list of threat classes. Furthermore, David (2009) 
argues that one must be cautious when attempting 
to aggregate “risks” that are not independent, since 
a simple “summation” would not capture depend- 
encies that may exist between them. 

Although not specifically in the context of RVA, 
Kramer argues that information sharing, which is a 
precondition for risk aggregation, can be “impeded 
by differences in how information is coded and cat- 
egorized” (Kramer, 2005), thus stressing the need 
for common ways of describing risk-related infor- 
mation. On the other hand, Vaughan (1997) argues 
that there are many obstacles in knowledge and 
information sharing founded in the very nature of 
complex organizations that have multiple, special- 
ized units—creating so called “structural secrecy”. 
She warns that too much information sharing in 
too standardized ways may actually result in units 
simply getting less knowledge about other units in 
the organization, e.g. due to information overload. 
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Other studies presented in Mansson et al. (2015) 
and Mansson et al. (2017) have shown that incon- 
sistencies in how risk information is expressed in 
the Swedish crisis management system reduce the 
possibilities of aggregating risk information. In 
addition, an experimental study (Mansson et al. 
2017), showed that semi-quantitative or quantita- 
tive ways of expressing risk information may facili- 
tate risk aggregation compared to qualitative ways. 
Although, these two studies provide some clues on 
how to accomplish successful aggregation of risk 
information, they mainly focused on aggregation 
of likelihood and consequence information; how- 
ever, in a risk and vulnerability assessment there is 
potentially additional information that should be 
aggregated. An example is information about inter- 
dependencies between critical societal functions. 

In none of the referred studies above, however, 
a comprehensive analysis has been made regarding 
what the challenges actually are for aggregations/syn- 
theses and what processes and methods are needed 
to overcome these challenges in order to accomplish 
an appropriate synthesis of RVAs. Hence, there is 
very little guidance for how to go about in establish- 
ing and implementing such a process in practice. 
This is troublesome since establishing a process able 
to generate a consistent, high-quality picture of risk 
and vulnerability at a higher level, which as much 
as possible utilizes the information from lower level 
RVAs, is far away from straightforward. 

In the Swedish crisis management system, which 
is the context of the present paper, several public 
actors are obliged to perform Risk and Vulner- 
ability Assessments (RVAs) (SFS 2006:544). In this 
system, aggregation of risk information is intended 
to play a key role. Information from municipal 
RVAs should be used as input at regional RVAs; 
and regional RVAs should be used as an input 
to the national RVA. However, even though this 
system has been in place for more than 10 years, 
there are still many needs for improvements before 
risk aggregation can be successfully accomplished 
across societal levels. 

Rather than attempting to solve all these prob- 
lems in the Swedish crisis management system, the 
present paper delimits its focus to the municipal 
and municipal department level. The aim of the 
RVAs carried out in the Swedish municipalities is 
to increase risk awareness and knowledge of deci- 
sion makers and to implement effective risk reduc- 
tion measures (SFS 2006:544; MSB 2015:5). The 
scope of the analysis is primarily the municipal 
organization, where the aim is to ensure that criti- 
cal activities can continuously be performed also in 
times of crises, but secondarily also to create a risk 
picture for the municipality as a geographic region. 
The present paper mainly focuses on the organiza- 
tional perspective. 


In order to establish a successful risk and vul- 
nerability assessment process, many municipali- 
ties in Sweden, especially larger ones, have decided 
to push extensive analytic activities down to the 
municipal department levels. The main reason for 
this is that much of the opportunities, mandate 
and budget to implement improvements occur at 
this level. In addition, the local ownership of the 
analysis process is important and the expertise 
necessary for good quality assessments exist at 
the department level. See Cedergren et al (forth- 
coming) for further discussions of challenges and 
success factors related to municipal risk and vul- 
nerability analyses in Sweden. 

At the same time, even though relevant risk infor- 
mation can be acquired at municipal department 
level, some critical risk information will probably 
not be apparent until risk information from depart- 
ments is somehow put together, aggregated and syn- 
thesized at the municipal level. This aggregated risk 
information can then both be used at the municipal 
level to make decisions that concern the municipal 
organization as a whole as well as be fed back to the 
municipal department level in order to improve the 
next iteration of the department level assessments. 

The aim of the present paper is to describe a 
general framework for how risk aggregation can 
be accomplished in order to support system-level 
decisions. This model is then applied in the con- 
text of the municipality of Malmö. The research 
builds on previous work by the authors where a 
method development process has been initiated in 
collaboration with the municipality of Malmö in 
southern Sweden, focusing on RVA in municipal 
departments. The method development process, 
grounded in action research and design science, has 
been described previously in Cedergren and Hassel 
(2017) and a first version of the method has been 
presented in Hassel and Cedergren (2017). The 
developed method is currently being implemented 
in each municipal department in the municipality of 
Malmö, Sweden, and this paper will provide a basis 
for the next step of the RVA process in Malmö. 

The outline of the paper is as follows: in Chap- 
ter 2, a generalised model for aggregation of RVAs 
is described, in Chapter 3 the model is applied 
in the context of the municipality of Malmö to 
outline initial ideas regarding aggregation of risk 
information, and in Chapter 4 the results are dis- 
cussed and conclusions drawn. 


2 A GENERALISED MODEL FOR 
AGGREGATION 


The general model for aggregation of RVAs is pre- 
sented in Figure 1. It was developed in an iterative 
process where the point of departure has been the 
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Figure 1. An overview of the model for aggregation of 


risk information. 


research referred to above and the author’s expe- 
rience from the Swedish RVA-system. The author 
has interacted with practitioners in meetings and 
workshops who have expressed their needs and 
challenges related to the aggregation process. 

The model should be seen as recursive, meaning 
that the displayed relationships between system and 
sub-system levels could be repeated between the sys- 
tem level (e.g. municipal) and higher system levels 
(regional, national or international) as well as between 
sub-system level (e.g. municipal departments) and 
even lower system levels (e.g. specific units). The 
model will be briefly described in what follows. 


2.1 Instructions and procedures 


First, instructions and procedures for the sub-sys- 
tem RVAs need to be formulated. The purpose is 
to ensure and facilitate the subsequent aggregation 
of risk information from the sub-system RVAs at 
the system level. It is especially the form of the 
risk information that needs to be harmonized and 
made consistent since aggregation would other- 
wise be extremely difficult and time-consuming. 
In addition, the method and analysis processes as 
a whole could also be harmonized although this 
is not strictly necessary. At the same time, since 
information across different RVAs must be using 
commensurate scales these must be either absolute/ 
quantitative or be using common scale descrip- 
tions, definitions, categorizations, etc. 


2.2 Risk information to the system level 


Secondly, risk information must be submitted from 
the sub-system, i.e. RVAs must be conducted at the 
sub-system level and presented in a way that is in 
accordance with the instructions and procedures 


(see above). Of special importance here is not to 
compromise with information security. Hence, cre- 
ating trust between the involved actors and organi- 
zations is a crucial precondition for successful 
aggregation—and a potential large issue especially 
when private actors are involved and sensitive infor- 
mation are being processed. Of course, in order to 
handle the information in way that do not compro- 
mise information security some type of information 
infrastructure need to be in place where information 
can be shared in a secure and efficient way. 


2.3 Aggregation mechanism 


Thirdly, there has to be an aggregation mechanism 
at the system level that can collect, store and proc- 
ess the risk information from the sub-systems in 
order to produce the desired outputs at the system 
level. In addition, in most cases it is likely that addi- 
tional analytical activities need to be performed at 
the system level as aggregation is not likely to be a 
mechanical, additive process due to interconnected- 
ness between the sub-systems (David, 2009). In spe- 
cial circumstances, where sub-systems do not have 
any significant relationships/interactions, the aggre- 
gation process can be mechanistic—e.g. aggregat- 
ing/adding risk metrics from the sub-systems into 
risk metrics for the system, or comparing impor- 
tance of sub-system elements across sub-systems. 

These three aspects are the necessary compo- 
nents of an aggregation process; however, they 
should be complemented by three additional 
aspects in order to ensure high quality RVA proc- 
esses at both the system and sub-system level. 
Hence, the following three aspects described below 
can be added to the model. 


2.4 Information and knowledge bases to the 
sub-system level 


Fourthly, information and knowledge bases should 
be provided from the system level to the sub-sys- 
tem levels. This can concern information that all 
sub-system RVAs are in need of but which would 
require extensive data collection efforts. This activ- 
ity can then be more efficiently performed by the 
RVA coordinating actor at the system level which 
also would ensure a consistent knowledge basis 
concerning aspects that ideally should not vary 
across the sub-system RVAs (such as the likelihood 
of external events, such as floods, hurricanes, etc.). 


2.5 Support needs and guidance 


Fifthly, in order to develop the quality of the RVA 
processes, both at sub-system and system level sup- 
port needs should be communicated from the sub- 
system; and sixthly support and guidance should 
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then be provided by the RVA coordinating actor 
at system level. Again, the reason is primarily 
efficiency-related—if several sub-system RVAs strug- 
gle with the same problems then a coordinated effort 
to reduce these problems is likely to be preferable 
compared to each sub-system working with quality 
improvements on/y on their own. Note that this has 
both to do with improving the quality of the assess- 
ment on the sub-system level and increasing the possi- 
bilities of a successful aggregation on the system-level. 


3 APPLICATION OF THE GENERAL 
MODEL OF AGGREGATION FOR 
MUNICIPAL RVA IN SWEDEN 


Below the general model for accomplishing system- 
level aggregation will be applied in the context of 
municipal RVA in Malmö, Southern Sweden. The 
authors have been extensively involved in the RVA 
development process there for more than 1.5 years. 
Currently RVAs are conducted in the municipal 
departments and the next step is to aggregate the 
output from these assessments. The description 
below is the results from the first round of discus- 
sions about how the aggregation could be accom- 
plished. It is likely that the aggregation process will 
be further developed as more experience from the 
RVA processes have been gained. 


3.1 Instructions and procedures for the 
sub-system RVAs 


It is not always that organizations at higher system 
level has legal mandate to provide instructions that 
the lower system levels have to oblige to. In Sweden, 
the Swedish Civil Contingencies Agency has devel- 
oped regulations for municipal RVAs (MSB 2015:5); 
however these are only stipulating what should be 
included in the assessment not how the assessment 
should be performed. In addition, a number of 
reports have been developed by MSB providing 
some general guidance (MSB 2011, 2013, 2014). 

In the present paper the focus is on municipal 
RYA where the municipality of Malmö decided to 
develop a method that all municipal departments 
should use. The authors of this paper participated 
in the development of this method and the points 
of departure for the method development have 
been presented extensively in previous papers by the 
authors (Hassel and Cedergren, 2017; Cedergren 
and Hassel, 2017; Cedergren et al. forthcoming). 

Note that the aim of the method is to assist 
municipal departments to create robustness in 
their organisation by identifying the main sources 
of risk and vulnerability which should be targeted 
by risk reduction measures. It is only a secondary 
aim that the output of the RVAs should be aggre- 


gated at municipal level (which is the focus of the 
present paper). 

To support the municipal departments in con- 
ducting the assessments a method handbook has 
been developed, educational activities have been 
and will be arranged, and computer software 
developed where the analysis can be documented. 
In addition, the municipality of Malmö are cur- 
rently developing short movies that can be used by 
the municipal departments. 

The method consists of three main steps which 
are conducted every year, i.e. the RVA is succes- 
sively made both broader and deeper over time. 
The method has been described in detail (although 
a previous version) in Hassel and Cedergren (2017) 
but will be summarized below. 


3.1.1 Step 1 — Mapping the municipal department 
First, a mapping of the functions performed by the 
municipal department is conducted. This mapping 
includes determining how critical each function is 
for safety and functionality of the municipality of 
Malmö as well as mapping of dependencies that 
each function has. When mapping dependencies 
the strength of the dependencies is judged based on 
to what extent the function needs the dependency 
in order to be maintained as well as to what extent 
there are alternative back-up solutions in place. 


3.1.2 Step 2 — Analysing undesired events 

Secondly, an identification of what undesired 
events can happen is performed. Then a selection of 
events are analysed. The focus is primarily how the 
department’s functions are affected in the event and 
to what extent they can still perform their critical 
functions. The extent of the negative consequences 
for the municipality of Malmö is judged, based on 
their capability to continue to perform their func- 
tions during the event. Finally, potential causes are 
identified, indications/trends coupled to the event is 
identified and the likelihood of the event is judged. 


3.1.3 Step 3 — Create a decision basis for 
implementing improvements 

Thirdly, the information from the two first steps are 
visualised in order to identify functions, dependen- 
cies and events that are particularly critical. This 
visualisation is then used as a basis for suggestions 
risk reduction measures where risk reduction meas- 
ures are first identified and then evaluated. The 
evaluation is made based on the measures’ effects on 
the risk level, their costs and potential side-effects. 


3.2 Risk information from sub-system RVA to 
system level 


The sub-system RVAs produce several outputs 
that can be used as input to the system-level RVAs. 
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All this information is stored in a computer soft- 
ware that can be accessed by the RVA coordinating 
unit at the system level. A selection of the most 
relevant risk information that is uploaded includes 
the following: 


e Time until reaching severe consequences when 
the function cannot be performed. 

e Dependencies that each function have. 

e Importance of each dependency for the 
functions. 

e Extent of back-up solutions available for each 

dependency. 

List of undesired events. 

Description of each undesired events. 

Functions that are prioritized in each event. 

Capability to continue performing each priori- 

tized function in each event. 

e Description and estimation of negative conse- 
quences in each event due to reduced capability 
in the municipal departments functions. 

e Causes, trends and indications of events. 

e Estimation of likelihood of the event. 

e Strength of knowledge judgements for each 
estimation. 

e Suggested risk reduction measures. 


3.3 Aggregation mechanism and complementing 
analytical activities 


The purpose of the aggregation process is pri- 
marily that it should provide information and 
insights that can be used a basis for system-level 
improvements. A secondary purpose is that some 
information should be fed back to the municipal 
departments, and this will be further explored in 
section 3.4. 

As mentioned previously, according to the regu- 
lations in Sweden the municipal RVA has two per- 
spectives—the municipality as an organisation and 
the municipality as a geographical region. Note 
that this paper takes the former perspective and 
leave the latter for future research as it is a much 
more complex task requiring extensive input from 
non-municipal actors. 

The aggregation in the municipality of Malmö 
will be performed by the central RVA coordinating 
unit. Below a number of outputs from this aggre- 
gation is described—this list of outputs is likely to 
be expanded later on in the process. 

3.3.1 Identification and ranking of joint 
dependencies 

A municipal department may identify that a par- 
ticular dependency is very critical for being able to 
perform their functions. In some cases they might 
be able to justify taking actions to reduce the criti- 
cality of this dependency by e.g. investing in buffers, 


making the dependency more robust (if internal) or 
ensuring that external actors will be able to deliver 
the resources or services they are dependent on 
(e.g. by improving contracts). However, if many 
municipal departments are critically dependent on 
the exact same resources or services, then measures 
might be warranted even though each municipal 
department cannot justify it on an individual basis. 
Additionally, in some cases risk reduction taken at 
system/municipal level might become much more 
effective than each municipal department taking 
separate actions concerning the same dependency. 
In order to create this output the dependencies 
need to be described in a standardised, consistent 
way. Previous experiences from dependency assess- 
ments where no standardisation was used proved 
aggregation was very challenging since the level 
of detail with which dependencies was described 
varied greatly (Johansson et al. 2016). Therefore, 
the method of RVA for the municipal departments 
include a list of dependencies from which the 
municipal departments can choose from must be 
established. Since it is difficult to foresee all possible 
dependencies at the outset of the assessments, this 
list must be dynamic and continuously updated. 
Furthermore, the ranking of the joint depend- 
encies should logically be a combination between 
how many municipal departments that have the 
dependency and how critically dependent they are. 
Hence, the rating of dependency criticality must 
be comparable across the municipal departments 
which is accomplished by used common scales. 


3.3.2 Identify dependency chains 
In the assessments at municipal department level 
only first-order dependencies are identified since 
higher-order dependencies both can be difficult to 
realise/have knowledge about and be time consum- 
ing to identify. However, by aggregating informa- 
tion about first-order dependencies, dependency 
chains can be constructed that provide insights 
about second—and higher-order dependencies. 
Information about dependency chains may be 
important for the municipality as a whole since it 
could provide insights on where resources should be 
directed to break potential cascading effects at an 
early response stage (both considering preventive 
measures and measures as an event is unfolding). 
Of course, only a partial view of the municipal- 
ity can be obtained based on dependency infor- 
mation from municipal departments. In order to 
obtain a more holistic view, additional actors need 
to be consulted and included, such as private and 
other public actors. 


3.3.3 Identify undesired events 
Each municipal department performs an iden- 
tification of potential undesired events. Since 


1669 


event identification is also a requirement for the 
municipal RVA the list of events from the munici- 
pal departments can be used as an input, although 
some type of categorisation is likely to be needed. 


3.3.4 Inform estimations of event consequences 
for the municipality as a whole 

An essential part of a risk and vulnerability assess- 
ment at a municipal level is the estimation of the 
negative consequences of undesired events for the 
municipality as a whole. In the assessments per- 
formed by the municipal departments no conse- 
quence estimation is carried out considering the 
total consequences for the municipality. However, 
the dependency information, assessment of capa- 
bilities and consequence estimations collected from 
each municipal department can provide a necessary 
partial input to the estimation of the consequences 
of undesired events at the municipal level. 

Note that it is unlikely that a mechanical proce- 
dure can be used for determining consequences at 
municipal level based on information from munici- 
pal departments. The reason is that the functions 
of each municipal department are typically tightly 
coupled together so there is no additivity in the 
consequences estimated by each municipal depart- 
ment. Notwithstanding, the information from 
the municipal departments is crucial in order to 
understand how the municipality as a whole will 
be affected. But additional information from other 
relevant actors must also be included and the con- 
sequence estimations should be made in a delib- 
erative process where the information from the 
municipal departments is used as an input. 

Furthermore, in order for the municipality to 
be able to make consequence estimations concern- 
ing some specific events it must be ensured that all 
municipal departments have included these events 
in their assessments. Therefore, the method for RVA 
used by the departments will both address events 
that each department has identified on their own as 
well as a number of events that the RVA coordinating 
unit has selected to be common for all departments. 


3.3.5 Create a prioritized list of critical functions, 
dependencies and events 

In order to get the most effect from the resources 
available for risk reduction, it makes sense to iden- 
tify the most critical functions, dependencies and 
events for the municipality as a whole. By doing so 
the RVA coordinating unit could assist the munici- 
pal departments with a responsibility for the most 
critical functions, dependencies and events. Either 
this could be to provide financial support but it 
could also be to assist in getting political attention. 

Again, in order to establish these system-wide 
criticality lists, the scales for judging criticality are 
designed so they are common for all departments. 


3.3.6 Improve estimations of the likelihood of 
events 

Estimating the likelihood of undesirable events is 
part of the municipal RVA. Since each municipal 
department make likelihood estimations, although 
they have a possibility to waiver the estimation if 
they judge their expertise to be insufficient, these 
estimations could be used as a basis for the likeli- 
hood estimations at the municipal level. Especially 
those departments that have rated their knowledge 
concerning the likelihood as being high could be 
consulted. In order to enable this the strength of 
knowledge, see e.g. Flage & Aven (2009), concern- 
ing the various estimations is included the RVA 
method used by the municipal departments. 


3.4 Provision of information and knowledge bases 
from system level to sub-system RVA 


In addition to providing risk information that can 
be used for improvements at the municipal level, 
the aggregation process can lead to insights that 
can be fed back to the RVAs in the municipal 
departments in order to improve the quality of 
these assessments. Below, a number of such pos- 
sibilities will be described. 

Feeding back information about dependency 
chains to the municipal departments can lead to 
insights regarding how reduced capability to per- 
form their functions can give rise to downstream 
effects. In that way their estimations of how criti- 
cal their functions are can be improved. Due to the 
same reasons, this can also improve the estimations 
of consequences of events. In addition, the munici- 
pal department’s understanding of how they may 
become affected by undesired events may also be 
improved since they get more knowledge about 
their upstream higher-order dependencies. 

Aggregating and categorizing the undesired 
events that have been identified in each municipal 
department can also give rise to insights to other 
municipal departments that can also occur there, 
i.e. creating an “event repository”. 

In a similar way the estimation of the likelihood 
of events can be improved. As mentioned previ- 
ously the departments can opt not to provide a 
likelihood estimation—f they judge their expertise 
concerning this to be very limited. Then informa- 
tion from the aggregation process can be fed back 
to the municipal departments so that likelihood 
estimations from departments that judge their level 
of expertise as being high can be provided to the 
those departments that lack expertise. 

Of course, it may be the case that basically no 
municipal departments have the required knowledge 
concerning some event likelihoods. In that case the 
RVA coordinating unit could conduct some addi- 
tional analytic activities where appropriate expertise 
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is identified and involved to obtain a well-founded 
estimation that may be fed back to the municipal 
departments. This would be a much more efficient 
approach than forcing each department to do it on 
an individual basis. 

In order for the municipal departments to be 
able to accurately assess how their functions will 
be affected in various events and the negative con- 
sequences this would entail, it is important that 
the departments have good knowledge about the 
potential direct impacts of those events (i.e. what 
areas of the city would be flooded? How many per- 
sons would be incapacitated in a major epidemic? 
etc.). Such knowledge is not necessarily something 
that each department would have; however, as in 
the case of likelihood estimations, the RVA coordi- 
nating unit could make an effort to find the appro- 
priate expertise and provide the departments with 
appropriate information. 


3.5 Provide support and guidance to 
sub-system level 


In order to both improve the quality of the RVAs 
in the municipal departments and to increase the 
possibility of accomplishing aggregation, the RVA 
coordinating unit should provide continuous sup- 
port and guidance to the municipal departments. 
In the case of Malmö this will be done through at 
least three annual workshops where each of the 
three steps of the method is addressed. In relation 
to each step, support needs should be indicated by 
the each department. In addition, support needs 
should be identified through assessments of the 
quality of the department’s analyses performed 
by the coordinating unit. Key here is to identify 
whether significant mistakes or misinterpretations 
have been made in the analysis. 


4 DISCUSSION & CONCLUSIONS 


In the present paper a general model for perform- 
ing aggregation of Risk and Vulnerability Analy- 
ses has been suggested. This model has then been 
applied to sketch out some initial ideas on how to 
accomplish aggregation in the context of munici- 
pal RVA, with a special focus on how analyses for 
municipal departments can be aggregated to the 
municipality of Malmö. 

The model and the specific application in 
Malmö need to be concretized and evaluated. This 
will be done in the coming year as the Risk and 
Vulnerability Analyses performed in the municipal 
departments are being conducted which of course 
is a precondition for the aggregation process. 

It is argued that this aggregation is facilitated by 
ensuring that key aspects of the risk assessments 


in the municipal departments are harmonized. At 
the same time, too much standardisation may also 
reduce the utility of the assessments for the munic- 
ipal departments. 

Another key point of the proposed aggregation 
model is that sharing of risk information between 
system levels must be two-way rather than only from 
bottom-up. As information is fed back to the sub- 
systems it is critical that information is presented in 
a compact and easy-to-use way otherwise it is likely 
that it will not be used as time and resource con- 
straints for municipal departments is a significant 
challenge (Cedergren et al., forthcoming). 

A subsequent aim of the aggregation process is 
also to be able to include relevant risk information 
from other actors, such as private actors and other 
public actors. Therefore, a study regarding what 
information these other actors are willing to share 
should be initiated. It is of utmost importance that 
trust relationships with these actors are built so 
they feel they are able to share potentially sensitive 
information with the municipality. 
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ABSTRACT: Quantitative risk assessment supports decision-making processes in an increasing variety 
of contexts. Within the domain of environmental decision-making, the spatial distribution of impacts, vul- 
nerabilities and consequences associated to different risk-mitigation alternatives calls for a different frame- 
work, i.e. spatial multi-criteria risk analysis. In this paper, we propose the combined use of a hierarchical 
Bayesian modelling approach and Geographic Information Systems to integrate uncertainty and model 
probabilities into an overall risk map. The research aims at testing the operability of the integration between 
Bayesian modelling and spatial analysis in the context of spatial risk processes for resource allocation. To 
this end, we applied the model on a case study dealing with tanker oil spill risk in the Mediterranean Sea. 
The innovative contribution of the study stems from both the context of application point of view and the 
use of Bayesian modelling to calculate not only probabilities but an overall spatial risk measure. 


1 INTRODUCTION 


A rapidly emerging field of research within both 
the discipline of quantitative risk analysis (e.g. Fis- 
chhoff, 2015) and the now broad and consolidated 
literature on spatial decision support systems (e.g. 
Malczewski and Rinner, 2015) is spatial risk analy- 
sis, as demonstrated by the recent special issue on 
the topic in the Risk Analysis Journal (http://www. 
sra.org/sites/default/files/pdf/S pecialissuepro- 
posalforRiskAnalysis_Jun20.pdf). This growing 
interest may be explained by the following factors: 
(i) impacts of both positive and adverse events 
have heterogeneous spatial distributions across the 
territory, (ii) land vulnerability is characterized by 
spatial heterogeneity, (iii) risk-mitigation alterna- 
tives also lead to different spatial consequences, 
and (iv) recent advances in both decision model- 
ling and Geographic Information Systems (GIS) 
technologies make it possible to bring the two 
fields together towards the formalization of an 
analytical framework for spatial risk analysis (e.g. 
Ferretti and Montibeller, 2017). 

Despite several applications conducted in dif- 
ferent domains (e.g. natural hazards management, 
health issues, etc.), the criteria they employ are 
often risk factors with deterministic preference 
modelling replacing probabilistic information. Fur- 
thermore, evaluations of spatial risks have often 
neglected the multi-dimensional nature of spatial 
impacts, such as infrastructure damage, lost lives, 


lost crops, etc., which typically occur in such deci- 
sion problems (e.g. Ferretti and Montibeller, 2017). 

There are several reasons that can explain the 
above weaknesses. First, the lack of publicly acces- 
sible spatial data about vulnerability and prob- 
ability of occurrence spatial distributions; second, 
the need to involve risk analysts to facilitate prob- 
ability elicitation using appropriate protocols (e.g. 
structured expert judgment elicitation protocols, 
Dias et al. (2018)); third, the need to allocate 
enough time to the necessary preliminary phase 
of expert training in probabilities elicitation (e.g. 
Keeney and Winterfeldt, 1991). 

However, preliminary ideas and suggestions on 
how to formalize a framework for spatial risk anal- 
ysis and how to spatially elicit preference informa- 
tion have recently been proposed (e.g. Keller and 
Simon, 2017, Ferretti and Montibeller, 2017). 

In this paper, we propose the combined use 
of a hierarchical Bayesian modelling approach 
(e.g. Kalinina et al., 2016, Spada et al., 2014) and 
GIS to update prior probability maps and obtain 
an overall risk map. The research aims at testing 
the feasibility and operability of the integration 
between hierarchical Bayesian modelling and spa- 
tial analysis in the context of spatial risk processes 
for resource allocation. To this end, we applied the 
model on a case study dealing with tanker oil spill 
risk in the Mediterranean Sea. The main reasons 
for the selection of this context of application 
can be summarized as follows: (i) international 
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relevance of the phenomenon, (ii) availability of 
historical data collected in the PSI’s Energy-related 
Severe Accident Database (ENSAD), and (iii) the 
strong spatial dimension of the phenomenon (e.g. 
the geographical heterogeneity of both vulnerabili- 
ties and impacts in the areas affected by oil spills). 

The innovative contribution of the present study 
stems from two aspects. First, the context of appli- 
cation, i.e. tanker oil spill risk distribution, is an 
innovative one for the use of a Bayesian modelling 
approach. Indeed, Bayesian modelling in the spatial 
domain has been mostly used for ecological studies 
(e.g. He et al., 2006, Tucker et al., 1997), health con- 
cerns (e.g. Saravana Kumar et al., 2017), archeo- 
logical analysis (e.g. Ford et al., 2009), and natural 
hazards (e.g. Liu et al., 2017) to name the most 
recurrent ones. In the area of maritime transpor- 
tation, few applications of the Bayesian approach 
do exist, but they are mainly non-spatial (e.g. Goer- 
landt and Montewka, 2015, Bouejla et al., 2014). 
The second reason why this study is innovative is 
linked to the Bayesian modelling approach. Indeed, 
while many applications use a Bayesian network 
approach (e.g. Landuyt et al., 2015), we test a hier- 
archical Bayesian model not only to update an ini- 
tial estimate of the probability of occurrence of 
tanker oil spills (e.g. Burgherr et al., 2015), using 
information concerning the distribution of past 
accidents in the geographical region under analysis, 
but also to generate an overall risk map. 

The reminder of the paper is organized as fol- 
lows. Section 2 provides an overview of the pro- 
posed methodological approach. Section 3 explains 
how the method has been applied to model tanker 
oil spill risk in the geographical region under anal- 
ysis and, finally, section 4 discusses conclusions 
and future developments arising from this study. 


2 METHOD 


This study is part of a larger research project 
funded by the Swiss National Science Foundation 
aiming at the development of a methodology able 
to integrate Bayesian modelling, Geographic Infor- 
mation Systems (GIS), multi criteria decision anal- 
ysis and structured expert judgement elicitation 
protocols to comprehensively assess spatial risks 
associated to adverse events. This paper focuses on 
the integration between the Bayesian inference and 
GIS and represents a preliminary test of the oper- 
ability of the proposed approach. 

Bayesian inference is an alternative to the classi- 
cal statistical inference. In the latter, also known as 
frequentist inference, only repeatable events have 
probabilities, while in Bayesian inference probabil- 
ity describes both epistemic and aleatory uncer- 
tainty (eg. O'Hagan, 2003). Indeed, Bayesian 
analysis combines data representing the entire like- 


lihood function with prior knowledge about the 
parameters, which may come from other data sets 
or the modeler’s experience and physical intuition 
(e.g. Reis Jr and Stedinger, 2005). The a priori dis- 
tribution describes what is known before observing 
any data, while the likelihood reflects the informa- 
tion about the parameters contained in the data. 
Parameters estimation is made through the poste- 
rior distribution, which is computed using Bayes’ 
Theorem (e.g. O’ Hagan, 2003): 


p(yl®) = L(y; ®)p(8) (1) 


where p(y | O) is the posterior distribution for the 
parameter @ given the observed data y, L(y; 0) is 
the likelihood function, and p(6) is the a priori dis- 
tribution of parameter 0. The proportionality (e) 
refers to the direct sample of the parameter values 
from the posterior distribution, which is commonly 
assessed using a Markov Chain Monte Carlo 
(MCMC) method (e.g. Andrieu et al., 2003). 

To combine the Bayesian inference and GIS, dif- 
ferent issues could arise due to the spatial dimen- 
sion of the problem: (i) lack of local data (i.e. spatial 
heterogeneity in the distribution and density of the 
data), and (ii) potential spatial correlation among 
locations, to name the two most relevant ones. To 
cope with these challenges, a spatial hierarchical 
Bayesian model is employed in this framework as 
will be explained in the following paragraphs (e.g. 
Cooley et al., 2007, DiMaggio, 2012, Eastwood 
et al., 2014, Juan et al., 2016). 

A spatial Bayesian hierarchical method produces 
parameter (6) estimates for each individual analysis 
unit (e.g. location) by borrowing information from 
all analysis units (e.g. Eckle and Burgherr, 2013). 
This procedure is known as Bayesian “borrow of 
strength” effect (e.g. Zhu et al., 2006). In this way, 
the approach compensates lack of data for individ- 
ual analysis units (e.g. Kalinina et al., 2016). 

Based on this premise, equation (1) could be 
rewritten as follows: 


p(yl®) = L(y; 6)p(8]@)p(@) (2) 


An extra level is added to the standard Bayesian 
theorem for the hierarchical Bayesian approach. In 
this level, the parameter (6) can be described with a 
distribution, which is conditional on the hyperpa- 
rameter (@). The distribution of this hyperparam- 
eter (p(@)) is a hyperprior distribution. Therefore, 
when estimating the posterior for the parameter 
(8), information from the hyperprior is used in 
addition to the information from the prior and 
likelihood. 

Finally, to overcome the spatial heterogeneity 
in the distribution and density of the data and the 
potential spatial correlation among locations, in 
the spatial Bayesian hierarchical model they are 
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included as prior knowledge about the param- 
eter (0). This is possible, since they are commonly 
seen as “unexplained variance” in a model (e.g. 
DiMaggio, 2012, Eastwood et al., 2014). Based on 
this premise, by considering a set of i locations, the 
parameter for each location of interest (6,) is trans- 
formed into a log scale (making relationships addi- 
tive rather than multiplicative) and is set equal to 
an intercept term (a,) and two random effects, one 
non-spatial, i.e. the spatially unstructured latent 
covariates, (p) and the other spatial (à), i.e. spa- 
tially correlated latent covariates in the model (e.g. 
Rodrigues and Assungao, 2012): 


log @,=a,+p;+A, (3) 


The spatially structured component is described 
as a conditional autoregressive (CAR) Gaussian 
process (A ~ CAR.normal(W, 1,)) where the condi- 
tional distribution of each À, given all the other A, 
‘s, is normal with u = the average A of its neighbors 
and a precision (t,) proportional to the number of 
neighbors. W represents the matrix of neighbors 
that defines the neighborhood structure. The non- 
spatial component of the model (p,) is defined as 
normally distributed with u = 0 and precision (t,). 
The model is completed by assigning additional 
(hyperprior) distributions to the precision terms 
T, and T, (e.g. Clements et al., 2006, Cooley et al., 
2007). More details about each step are going to be 
discussed in the case study section. 


3 CASE STUDY 


The proposed methodology has been applied to 
tanker oil spills due to the strong spatial nature of 
the phenomenon (e.g. Vieites et al., 2004, Burgherr, 
2007). 

Crude oil and its refined products are a driving 
factor of many economic activities. Furthermore, 
the oil demand is expected to increase worldwide 
in the coming decades and it will continue to have 
the highest share in the global primary energy mix 
(e.g. International Energy Agency (IEA), 2015). 
However, oil spills are one of the major causes of 
ocean pollution, producing ecological disasters of 
wide public concern. Furthermore, linked to the 
damage caused to the environment are the high 
costs to fisheries, related industries, and tourism in 
the affected areas. This is especially true along the 
major oil tanker transport routes (e.g. Burgherr, 
2007, Psarros et al., 2011). 

In this study, only accidental spills from tankers 
were taken into account, whereas spills from acts of 
war and operational spillages allowed by interna- 
tional or national regulations, such as the Interna- 
tional Convention for the Prevention of Pollution 
(MARPOL) were excluded (e.g. Burgherr, 2007). 


The geographical region under analysis is the 
Mediterranean Sea including the Bosporus strait. 
This area has been selected since (1) it is still the 
shortest route from Asia to Europe; (ii) about 16% 
of the global maritime traffic and 33% of the glo- 
bal seaborne oil (almost 8 million oil barrels per 
day) is carried through the Mediterranean Sea, 
which represents only 0.8% of the ocean surface; 
(iii) is one of the most affected areas with a major 
number of spill events (REMPEC, 2011). 


3.1 Data 


The data used for the case study analysis come from 
ENSAD, which is the most authoritative database 
for energy-related accidents worldwide (e.g. Burgh- 
err et al., 2017). ENSAD comprehensively collects 
information about accidents in the energy sec- 
tor and assigns them to energy chains and activi- 
ties within those chains. Accident data go back to 
1970 and cover fossil, nuclear, hydropower and new 
renewables technologies. In contrast to databases 
that rely on a single or few information sources, the 
multitude of sources considered by ENSAD is thor- 
oughly verified, harmonized, and merged to ensure 
(1) a worldwide coverage, (2) consistently high data 
quality across regions and over time, and (3) a high 
degree of completeness. ENSAD focuses on severe 
accidents, which are distinguished from small acci- 
dents based on seven criteria, i.e. >5 fatalities, >10 
injured persons, or 210,000 metric tons (t) of hydro- 
carbons released, etc. (e.g. Burgherr et al., 2017). 
With reference to oil spills in the Mediterranean 
Sea including the Bosporus strait, the ENSAD 
database registered 106 events, including small 
accidents (e.g. <10°000 t spilled oil), in the period 
1970-2012. In this study, we updated the informa- 
tion about oil spills in the Mediterranean Sea until 
the end of 2016 by using the following sources: 


e Analysis, Research and Information on Acci- 
dents (ARIA) database 

e Failure and Accidents Technical information 
System (FACTS) 

e Hazards Intelligence (HINT) 

e Centre of Documentation, Research and 
Experimentation on Accidental Water Pollution 
(Cedre) 

e European Maritime Safety Agency (EMSA) 

e Regional Marine Pollution Emergency Response 
Centre for the Mediterranean Sea (REMPEC) 

e Center for Tankship Excellence (CTX) 

e igma by Swiss Re 

e Other sources, such as newspaper, national and 
local publications, technical reports, etc. 


In addition to severe accidents, also accidents 
with smaller consequences (<10,000 t and 20.1 t) 
have been included. All the accidents collected from 
the aforementioned sources were homogenized 
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prior to analysis, in order to avoid possible double 
counting. The final dataset comprises a total of 
271 fully geo-referenced tanker oil spill accidents 
in the Mediterranean Sea including the Bosporus 
straight for the period 1970-2016. 


3.2 Overview of tanker oil spills in the 
Mediterranean Sea 


Figure | displays maritime accidental tanker spills 
by year and severity for the period 1970-2016. 
Most of the accidents (161) result in a release <7 
t, followed by 59 and 57 accidents with a release 
between 7-700 t and >700 t, respectively. Accord- 
ing to the severe accident definition in ENSAD 
(section 3.1), only 16 accidents out of 271 could 
be considered as severe with 3 out of them even 
extreme since they account for a release bigger 
than 100,000 t. Although the annual maxima of 
historically observed spills exhibit some variabil- 
ity, the data suggest a decreasing trend until 2000 
followed by a generally constant spill rate in the 
period 2000-2016, which is also confirmed by 
decadal averages and in accordance with previous 
studies (e.g. Psarros et al., 2011, Burgherr, 2007). 
Furthermore, the relatively constant numbers of 
annual spills in the period 2000-2016 could be 
explained by the introduction of the regulation 
13 F of Annex 1 of MARPOL (The Marine Envi- 
ronment Protection Committee (MEPC), 2003), 
which effectively mandated double hulls for new 
built oil tankers of 5000 dead weight tonnage 
and above starting from year 2000. However, the 
general trend in Figure | should not be taken as 
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evidence that a catastrophic spill accident (e.g. 
100.000 t or greater) can be excluded in the future, 
but the frequency of such an event may be expected 
at a lower frequency (e.g. Burgherr et al., 2012). 

Figure 2 depicts the spatial distribution of the 
oil spills in the Mediterranean Sea including the 
Bosporus strait for the period 1970-2016. Histori- 
cal observations show different “hot spots”, such 
as the Gibraltar and Bosporus straights, the Pelo- 
ponnese, the North Tyrrhenian Sea, the Malta and 
Lebanon areas. This is not always correlated with 
the average number of tankers per year (in 2012, 
http://medgismar.rempec.org/) navigating in a 
location shown as a 0.5° x 0.5° grid. Indeed, the 
Gibraltar and Bosporus strait as well as the Malta 
and Peloponnese areas show rather large average 
numbers of tankers per year, while the correspond- 
ing numbers are relatively low for the Northern 
Tyrrhenian Sea and Lebanon. The “hot spots” in 
the latter cases could be explained by the fact that 
most of the accidents happened in ports. 

The final updated dataset is subdivided into two 
periods: 1970-2011 and 2012-2016. This has been 
done in order to use the former as input for the 
model presented in section 3.3 and the latter as a 
validation of the model. 


3.3. Model setup 


Risk is the potential for realization of unwanted, 
negative consequences of an event. In other words, 
risk can be defined as the product of probability/fre- 
quency and magnitude/severity of consequences (e.g. 
Committee on Foundations of Risk Analysis, 2015). 
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Figure 1. 
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Year 


Distribution of tanker oil spills by year and severity in the Mediterranean Sea including the Bosporus strait. 


The bold black line indicates decadal averages. The colors indicates the number of accidents with the same spill size 


(t) in the same year. 
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Figure 2. Spatial distribution of tanker oil spills by severity in the Mediterranean Sea including the Bosporus strait 
in the period 1970-2016. The average number of tankers per year (2012) navigating in the area is shown in a 0.5° x 0.5° 
grid size (modified from AIS data from REMPEC (http://medgismar.rempec.org/)). White cells indicate areas where 


no tanker per year data are available. 


In this study, the number of accidents per year gives 
the frequency, while severity measures the extent of 
the consequences of each accident. Furthermore, the 
frequency of accidents was normalized by the unit 
of tanker-year to allow for comparative evaluation 
of the modeled parameters among different areas in 
terms of accidents per tanker (e.g. Burgherr, 2007). 
It is also important to note that at a first approxima- 
tion, the frequency of spills is not considered as a 
function of the number of tankers. 

Essentially, accidents can be considered rare, inde- 
pendent events so that the frequency can be modeled 
as a Poisson distribution (e.g. Spada et al., 2014). 
Therefore, the frequency is modeled applying the 
Bayesian procedure described in section 2. In equa- 
tion (2), the likelihood is described by the Poisson 
model. In a standard Poisson model, the variance 
is required to be equal to the mean. However, when 
dealing with spatial statistics, the Poisson models 
have more variances, and these are called over-dis- 
persed Poisson models (e.g. DiMaggio, 2012). These 
variances are identified as either spatially-correlated 
effects or heterogeneity effects (section 2). The essen- 
tial idea is that the probability of values estimated at 
any given location (e.g. the frequency at one area) 
are conditional on the level of neighboring values 
(frequencies on the neighboring areas). 

Based on these premises, the parameter of inter- 
est, the frequency rate A, is modeled as the com- 
bination of a normal distribution and the spatial 
random effects (section 2). For each area, the 
normal distribution is described by a mean and a 
standard deviation. This is the step where informa- 
tion between areas is exchanged in the hierarchical 
model. Both parameters of the hyper distribution 
are modelled with non-informative priors, so that 
they have no influence before the data is introduced. 


The parameters are modelled with normal and 
gamma distributions (e.g. Eckle and Burgherr, 
2013), respectively. 

The spatial random effects are modeled using a 
combination of the intrinsic conditional autore- 
gressive (CAR) prior, which is smoothed based on 
the weight related to the number of the neighbor- 
ing areas and their tanker-year normalization val- 
ues, and a standard normal prior as described in 
section 2. 

Once the posterior distribution for the mean fre- 
quency in each area is estimated, it is normalized 
by the corresponding tanker-year resulting in the 
expected accidents per tanker. 

The severity is modeled as absolute numbers 
of oil spilled (tons) per accident. The expected oil 
spill in each area is modeled using a Lognormal 
(LOGNO) distribution (e.g. Burgherr et al., 2015). 
In this case, at a first approximation, the spill is con- 
sidered independent from the area; therefore, no 
spatial random effects are included. The LOGNO 
distribution is commonly described by two param- 
eters, namely the scale ©; and location u, which are 
used to define the mean value and variance of the 
distribution (e.g. Spada et al., 2014). In this study, 
the LOGNO distribution employed for the Bayesian 
Hierarchical modeling is described by mean and 
precision parameters, where the latter is the inverse 
of the variance. Hierarchical models are defined 
for the mean and standard deviation of the poste- 
rior and are modeled using a normal and gamma 
distributions, respectively. All the hyperparameters 
of both the mean and standard deviation are thus 
modeled with non-informative distributions. 

The MCMC algorithm is run for 30,000 itera- 
tions, following a burn-in of 3,000 updates, which 
is also used to train the model, for both frequency 
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and severity cases. According to the Gelman-Rubin 
diagnostic (e.g. Gelman and Rubin, 1992), the sim- 
ulated chains converged adequately in the MCMC 
practice implemented in this study for both cases. 
Furthermore, the models for both frequency and 
severity distributions have been validated using 
the Deviance Information Criterion (DIC), which 
tests how good the proposed model predicts a rep- 
licate dataset which has the same structure as the 
observed one, e.g. the historical observations (e.g. 
Gelman, 2003). 

Finally, the risk in each area is estimated as 
the product of the expected accidents per tanker 
and the expected amount of oil spilled, giving the 
expected spilled oil in tons/tanker for each of the 
area in Figure 3. 


3.4 Results and discussion 


Figure 3 shows the risk of tanker oil spills in terms 
of expected tons of spilled oil per tanker. Areas of 
relatively high risk (1E-8-1E-9 tons/tanker) could 
be identified in the Tyrrhenian, Adriatic and Ionic 
Sea, in the north and west side of Sardinia, in the 
area of Mallorca, in the northern part of Libya and 
in north and east sides of Cyprus. In most of the 
cases, these results are driven by the comparatively 
large historical spills in areas with relatively low 
tanker traffic expressed in tanker-year (Figure 2). 
Areas with a high spill risk (>1E-8 tons/tanker) 
could be identified in the Bosporus strait and dif- 
ferent port areas along the Mediterranean coasts, 
such as in the Northern Tyrrhenian Sea, in South- 
ern France, in the northern coast of Algeria, in the 
eastern side of the Peloponnese area and in Crete. 
This reflects the risk during port operations like 
loading and unloading of the tanker in ports and 
potential accidents due to grounding or collisions. 


Relatively low risk areas (<1E-10 tons/tanker) 
could be found in the southern part of Spain, 
in the northern part of Algeria, in the Sicil- 
ian Sea, around Malta and in the Peloponnese 
region excluding its eastern side. These results 
are driven by generally low (or even absent) acci- 
dents recorded in the area combined with relatively 
high tanker-year (Figure 2). This result could be 
explained by the fact that these are open sea areas, 
and thus areas with limited possibilities of, for 
example groundings and collisions. Furthermore, 
in the relatively low risk areas, the result should not 
be taken as evidence that a spill accident can be 
excluded in the future, but that a spill event may be 
less expected than in areas of higher risk (e.g. the 
one with >1E-8 tons/tanker). 

Finally, the remaining areas in the Mediterra- 
nean Sea show a risk level of the same order of 
magnitude (1E-10—1E-9 tons/tanker). 

To validate our model, the oil spill accidents in 
the period 2012-2016 have been considered. In this 
period, a total number of 19 oil spills have been 
recorded in the Mediterranean Sea. Most of them 
in the Greek area and in particular in the Pelopon- 
nese and on the eastern Greek coasts. 

Largest spills (>700 t) happened in areas of rela- 
tively extreme risk in the model (Figure 3), while 
smaller spills (<7 t) in areas of average risk in the 
sea, i.e. 1E-10-1E-9 tons/tanker, with the excep- 
tion of an event in Cyprus (1 t released), which 
falls into the relative high-risk category (1E-9-1E-8 
tons/tanker). Furthermore, one spill accident of 
15 t in the Peloponnesus area falls into the relative 
low risk category (<1E-10 tons/tanker). 

In general, historical observations in the period 
2012-2016 are generally in good agreement with 
the model in terms of spill sizes/tanker, thus vali- 
dating the proposed approach. 
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Spatial distribution of the Expected oil spill (t)/tanker in the Mediterranean Sea including the Bosporus 


strait. The oil spills (t) recorded in the period 2012-2016 used to validate the model are also shown. 
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4 CONCLUSIONS 


In this study, we tested the operability of a spatial 
hierarchical Bayesian model by applying it to study 
the oil spill risk of tankers in the Mediterranean 
Sea. 

A comprehensive and up to date database of 
oil spill accidents was used both as input to the 
model and for validation purposes. In particular, 
the results of the validation step confirm that the 
proposed methodological approach represents 
a promising tool for dealing with quantitative 
spatial risk assessments. Indeed, the proper con- 
sideration of the spatial component inherent in 
accidents that can have consequences spread over 
a geographic region, plays a vital role in support- 
ing spatial monitoring procedures and risk mitiga- 
tion measures. 

The main benefits associated with the proposed 
methodological approach can be summarised as 
follows: (i) ability to deliver quantitative results at 
high spatial resolution, for example in compari- 
son to previous studies (e.g. Burgherr et al., 2015), 
(ii) possibility to combine expert knowledge with 
empirical data (e.g. Choy et al., 2009), (iii) account- 
ability for uncertainties and lack of data, and 
(iv) possibility of integrating sensitivity analysis 
to improve the interpretability of the parameters 
(e.g. Roos et al., 2015). 

On the other hand, the following limitations 
should be acknowledged. First, we are using a sim- 
plified approach, which does not model the causal- 
ity between influencing parameters (e.g. size of the 
tanker, marine currents, wind, spill cause, type of 
location such as ports, open sea, narrow strait, etc.). 
Second, the model is assessing an average number of 
accidents per year, rather than taking into account 
the actual yearly trend (i.e. traffic patterns). 

Finally, we envisage several future developments 
of this research. The first one refers to the study of 
how structured expert judgement elicitation proto- 
cols could be included in the development of the 
spatial Bayesian model. Indeed, the facilitation of 
the quantitative expression of subjective judgement 
plays a crucial role in the context of probability and 
risk assessments. However, the presence of the spa- 
tial dimension, where basically every pixel of the 
map becomes an alternative to be assessed, makes 
this task a particularly challenging one (e.g. Ferretti 
and Montibeller, 2016), thus opening interesting 
avenues for future research. The second direction 
for future developments concerns the combination 
of the present methodological approach with multi 
criteria decision analysis to account for multiple 
impacts associated with the adverse event (e.g. Fer- 
retti and Montibeller, 2017). The third direction of 
further research concerns the inclusion modelling 
of data about spatial vulnerabilities within the spa- 
tial multi impact Bayesian approach. 
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ABSTRACT: A global dataset of refinery accidents for the years 1990-2016 was analyzed to evaluate 
the capacity of 16 attributes to differentiate between accidents that cause or not fatalities. For this purpose 
a Dominance-based Rough Set Approach (DRSA) analysis was carried out. The quality of approxima- 
tion and accuracy measures confirmed that the established information table is able to distinguish out- 
come levels in terms of fatalities. Furthermore, the suitability of the extracted rules to describe hidden 
relationships in the accident dataset was demonstrated. Although, the predictive capacity of the decision 
rules was not satisfactory, the rules still proved to be useful to identify the attributes that contribute most 
to assign an accident to the correct outcome class. In summary, this study provided a number of new and 
substantial insights on worldwide refinery accidents, which complement and extend previous findings for 
accident frequencies and associated trends as well as different types of consequences. 


1 INTRODUCTION 


The field of risk assessment and management is a 
rather young scientific discipline, with most of its 
fundamental ideas, principles, concepts and appli- 
cations going back to the 1970s and 1980s (Aven, 
2016). One of the first articles addressing risk as a 
contemporary societal problem was published in 
Science in 1969 (Starr, 1969). The following publica- 
tions provide a broad discussion of the concept of 
risk (Aven, 2012) as well as foundational issues (Aven 
and Zio, 2014) and important developments in the 
field (Greenberg et al., 2012, Thompson et al., 2005). 

Early developments and applications of probabi- 
listic risk assessment can be traced back to NASA’s 
space program in the 1960s (Cooke, 2009), the 
aerospace industry (Keller and Modarres, 2005), 
and the development of nuclear energy in the same 


period (Otway and Pahner, 1976). In 1975, the 
US Reactor Safety Study (WASH-1400) marked 
a major milestone in probabilistic risk assessment 
(US NRC, 1975), and subsequently spread to other 
disciplines and countries (Rasmussen, 1981). 

The need for systematic and consistent, com- 
parative risk assessment of energy technologies has 
been recognized since the 1980s (Fritzsche, 1989, 
Inhaber, 1979). Since then, it became a central ele- 
ment both in the comprehensive evaluation of the 
risk performance of energy technologies (Burgherr 
and Hirschberg, 2014), and in the broader context 
of sustainability assessment (Cinelli et al., 2014, 
Hirschberg and Burgherr, 2015, Santoyo-Castelazo 
and Azapagic, 2014). Recently, an increased interest 
can be observed to assess the frequencies and conse- 
quences of accidents over-time, and to compare them 
among energy chains, chain stages, activities and 
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infrastructure types as well as for different configu- 
rations and locations. The systematic exploitation of 
existing datasets to extract such useful information 
is a very important and promising research avenue 
(Burgherr et al., 2015, Burgherr et al., 2017, Cinelli 
et al., 2017, Spada and Burgherr, 2016). 

In the industrial realm, risk assessment and man- 
agement is often carried out according to ISO 31000 
(ISO, 2009a, Luko, 2013), and ISO 31010 provides 
a list of 31 risk assessment techniques (ISO, 2009b, 
Luko, 2014). Other studies present overviews of risk 
analysis methodologies for specific facilities such as 
industrial plants (Tixier et al., 2002) or distinguish 
between formal and informal risk handling strategies 
for utility companies (Mascini and Bacharias, 2012). 
Incontrast, Jonkman et al. (2003) and Johansen and 
Rausand (2014) address risk metrics from a more 
scientific perspective, but their discussion is also rel- 
evant for industry, authorities and other stakehold- 
ers. Lastly, loss modeling of industrial and insurance 
companies is often based on historic incident and 
accident data and the use of empirical and actuarial 
models and other approaches (Klugman et al., 2008, 
Daniell et al., 2018, Mannan, 2012). 

In this study, the risk of accidents in refineries 
is analyzed. Within the oil chain, accidents during 
transports of crude oil and refined products clearly 
dominate with a share of about 75%, whereas explo- 
ration & production (E&P) and refinery activities fol- 
low distantly with roughly 10% each (Burgherr and 
Hirschberg, 2014). Furthermore, E&P and refinery 
facilities usually represent multi-billion dollar assets, 
property damage losses in these two sectors are the 
dominant contributors in the hydrocarbon industry, 
and refineries exhibit an increasing trend in frequen- 
cies and extent of losses (Marsh, 2016). 

Recent publications address a variety of top- 
ics, including accidents at specific facilities (e.g., 
Chettouh et al., 2016, Mishra et al., 2014, Saleh 
et al., 2014), learning from past accidents (eg., 
Moura et al., 2017a, Moura et al., 2017b, Russell 
Vastveit et al., 2015), occurrence of major accidents 
(Amyotte et al., 2016), and a sustainability metric 
for petroleum refinery projects (Hasheminasab 
et al., 2018). 

In a previous study, two of the authors analyzed 
the frequencies and consequences (i.e., fatalities, 
injured persons) of refinery accidents among four 
country clusters (Burgherr et al., 2016). This pro- 
vided interesting insights and conclusion on how 
refinery configuration and regional differences 
reflecting the mode of operation are important 
factors potentially affecting overall refinery risk. 

In the current investigation, this retrospective 
analysis is extended to determine the combined 
influence and relevance of selected attributes (e.g., 
country group, accident type, event chain steps, 
etc.) of refinery accidents on the outcome level 


(e.g., fatalities). Such knowledge on the potential 
impact of an energy accident can be used as one 
piece of information for the selection of response 
mode and resource allocation. According to the 
class of risk provided by the model, the Decision 
Makers (DMs) can decide how many resources 
to allocate to respond. Retrospective analysis for 
patterns recognition and decision support model 
development have been recently applied in different 
studies, e.g., investigation of antimicrobial activity 
(Patkowski et al., 2014), selection of sustainable 
project portfolios (Zaras et al., 2012), and natural 
gas accidents (Cinelli et al., 2017), among others. 

In the current research, a dataset of refinery 
accidents extracted from PSI’s Energy-related 
Severe Accident Database (ENSAD) was analyzed 
to address three main objectives: 


1. Assess the information structure of accidents 
from ENSAD to distinguish the events that did 
cause or did not cause fatalities. 

2. Investigate the relationships between a set of 
descriptors (attributes) for the energy acci- 
dents and the outcome level (fatalities or no 
fatalities). 

3. Study the capacity of the attributes to dis- 
tinguish between accidents that cause or not 
fatalities. 


2 METHODS 


2.1 Accident data 


The refinery accident data used in this study were 
extracted from PSI’s ENSAD. The ENSAD has 
first been released in 1998 (Hirschberg et al., 1998), 
and since then regularly updated and extended 
(Burgherr et al., 2015, Burgherr and Hirschberg, 
2014). Currently, the database is migrated from a 
standalone MS-Access to an new, interactive, web- 
based GIS database named ENSAD v2.0 (Burgh- 
err et al., 2017). In ENSAD, complete energy 
chains are considered because accidents can occur 
during all stages and activities, and not just the 
actual power and/or heat generation (Burgherr and 
Hirschberg, 2008). In general, the focus of ENSAD 
is on so-called severe accidents because accidents 
with larger consequences are a major concern for 
industry and authorities, but also receive most 
attention by the general public (Burgherr and Hir- 
schberg, 2014). To be classified as severe an accident 
has to fulfil at least one of seven threshold criteria, 
for example = 5 fatalities, > 10 injured persons, etc. 
(Burgherr and Hirschberg, 2008). 

For the current study on global refinery acci- 
dents all accidents in ENSAD with at least one 
fatality or one injured person during the years 
1990-2016 were taken into account. The lower 
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fatality and injury thresholds for refinery acci- 
dents were possible because a dedicated search 
for smaller accidents was carried out to ensure a 
sufficiently complete coverage. This larger data- 
set allows analyzing a broader set of accident 
attributes compared to the previous study con- 
ducted in 2016 (Burgherr et al., 2016). 

In total, the refinery accident dataset comprised 
698 accidents, of which 277 resulted in at least one 
fatality, 597 in at least one injured person, and 176 
had both consequences. 


2.2 Refinery data 


Information on refinery configuration and charac- 
teristics was retrieved from the World Wide Refin- 
ery Survey (WWRS) of the Oil & Gas Journal (Penn 
Energy Research, 2010). The coupling between the 
ENSAD and WWRS databases was based on geo- 
referenced refinery locations (i.e., geographic coor- 
dinates). WWRS data include details on refinery 
units and processes as well as corresponding capaci- 
ties expressed in barrels per calendar day (bbl/cd), 
the Nelson Complexity Index (NCI) and the Equiv- 
alent Distillation Capacity (EDC). In some cases, an 
accident could not be assigned to a specific refinery 
in the WWRS database: (1) the accident description 
did not allow identifying the correct refinery if sev- 
eral are located in the same area (10 cases); (2) the 
refinery can be located, but is not included in the 
WWRS list, which is mostly the case for small, pri- 
vate refineries (32); or (3) the accident description is 
so incomplete that the refinery can only be assigned 
to a country, but no specific location (8). 

Additionally, five more refinery attributes were 
based on data from Swiss Re, providing additional 
information on regional differences in refinery haz- 
ards. First, each refinery accident was assigned to 
one of four regional country clusters, representing 
different plant operation styles (Pannatier, pers. 
comm. ). Generally, loss burden differs among clus- 
ters, which is attributable to operational hazard fac- 
tors such as mode of operation, attitude towards 
safety, turnaround period, maintenance, etc. The 
four clusters were given the following names: 


— “USA”: USA, Canada, UK, Australia 

— “Europe”: Europe, Singapore, South Korea, 
Japan, Saudi Arabia, Gulf States, Egypt 

— “Russia”: Russia, Former Soviet Union, Eastern 
Europe 

— “Other”: South America, Africa, Maghreb, 
other Middle East, rest of Asia 


Second, all refinery units were allocated to three 
complexity classes G1, G2 and G3 to reflect increas- 
ing fire and explosion hazard (see Table 1), based 
on a Swiss Re expert categorization (Pannatier, pers. 
comm.). The class GO includes all units that do not 
belong to one of the three other classes. An over- 
view of refinery units considered and their assign- 
ment to a complexity class is shown in Table 1. 

Three refinery unit attributes were defined, 
namely (1) the unit in which the accident started; 
(2) the complexity class of the unit where the acci- 
dent started; and (3) the complexity class of the 
most hazardous unit(s) in a refinery. 

Third, a Swiss Re Hazard Index (SR HI) was 
calculated for all refineries, combining the capac- 
ity, complexity and toxic hazard of a refinery with 
weight factors three, two and one, respectively. 


2.3 Overview of accident attributes 


Table 2 shows an overview of the 16 attributes that 
were used to analyze the previously defined data- 
set of refinery accidents. Ten attributes concern 
specific accident characteristics, whereas the other 
six provide information about refinery configura- 
tion and operational hazard. Out of these, three 
were directly taken from the WWRS database, 
and three were established using information from 
Swiss Re and WWRS. 


2.4 Dominance-based Rough Set Approach 
(DRSA) 


The dataset of refinery accidents developed in 
this study can be seen as an information table, 
where characteristics of the accidents are condi- 
tion attributes (independent variables) and the 


Table 1. Assignment of refinery units to complexity classes (CC) G0 to G3. 

G0 G1 G2 G3 

Crude Unit Vacuum Distillation Catalytic Cracking Catalytic Hydro-cracking 
Coking Catalytic Reforming Isomerization Alkylation 

Thermal Cracking Catalytic Hydrotreating Polymerization 
Hydrogen Production Lubes 

Hydrogen Recovery 

Coke 

Sulphur 

Asphalt 
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Table 2. Overview and description of refinery attributes. O-U: ordered, but direction unknown; O-K 
direction is known; N-O: not ordered. 


: ordered and 


Attribute name Description and values Unit Ordering 
Year Year in which an accident occurred: 1990-2016. yr O-U 
Country Cluster Country clusters represent different plant operation styles (see - N-O 
section 2.2): USA, Europe, Russia, Other. 
Start Unit Name of refinery unit in which the accident started (see Table 1). - N-O 
Start CC Complexity class (CC) of refinery unit where accident started: - O-K 
G0, G1, G2, G3. 
High CC Unit(s) with highest complexity class (CC) in a refinery. - O-K 
No Acc Number of accidents that occurred in an individual refinery - O-U 
during the period of observation. 
Nelson Complexity A measure for refinery complexity, replacement costs, and - O-U 
Index (NCI) values (Johnston, 1996 and references therein). 
HI Swiss Re Hazard Index (HI) (see section 2.2). — O-U 
CRUDE Crude capacity of refinery bbl/cd O-U 
EDC Equivalent Distillation Capacity is another measure to cbbl/cd O-U 
compare refineries. is another means of comparing refinery 
costs, for which a refinery’s atmospheric distillation capacity 
is multiplied by its overall complexity rating, resulting in 
complexity barrels (Johnston, 1996). 
Acc Type Type of accident: Technical Failure (TF), Human Factor (HF), — N-O 
Human-Technical (HT), Natural Hazard (NH), Intentional 
Attack (IA; not included in current analysis). 
ECI to ECS Event chain steps 1 to 5; sequence of events as given in — N-O 
available information sources. In total, 27 values are used, 
e.g., explosion, fire, release, etc. 
Fatalities (i.e., outcome Number of fatalities that an accident caused. Two outcome — O-K 


of the accident) 


levels are distinguished: no fatalities vs. fatalities. 


presence or not of fatalities is the decision attribute 
or outcome level (dependent variable). To achieve 
the research objectives, Dominance-based Rough 
Set Approach (DRSA) analysis (Greco et al., 
2001, Stowinski et al., 2015) was applied to the 
refinery information table using the jRS library 
and jMAF software package (Blaszczynski et al., 
2013). The method is well suited for this case study 
as it can handle quantitative and qualitative infor- 
mation without the need of transforming them 
into numerical or binary values, and it accepts 
inconsistencies in the dataset. 

Another important characteristic of the present 
dataset is that it is not always known a priori 
whether the condition attributes are of the gain— 
or cost—type in relation to the fatalities. Gain-type 
means that the greater the value of an attribute, 
the less likely the accident causes fatalities; con- 
versely, cost-type means that the lower the value 
of the attribute, the less likely the accident causes 
fatalities. DRSA is capable of discovering this 
type of information, named global monotonic- 
ity (Blaszczynski et al., 2012). Furthermore, local 
monotonicity can also be discovered, meaning the 
interval of attribute values in which an attribute 
shows a gain—or cost-type behavior in the discov- 
ered pattern (see Table 3). The applied transforma- 
tion is non-invasive, i.e., it does not bias the matter 


of discovered relationships. With reference to the 
set of refinery attributes, those without an indi- 
cation of their preference order for which global 
and local monotonicity were studied are “Year”, 
“No Acc”, “NCT”, “HI”, “CRUDE”, and “EDC” 
(O-U type in column Ordering, Table 2). 

The relationships between the values of the 
attributes of the energy accidents and the cause or 
not of fatalities were discovered by means of deci- 
sion rules, which are objective cause-effect patterns 
hidden in the refinery dataset. A decision rule is in 
the form of EaH, namely “if E, then H”. E is the 
condition part (also called premise or evidence) 
and H is the conclusion part (also called decision 
part). Sets of decision rules, which are essential for 
this analysis, were induced using the Variable Con- 
sistency Dominance-based Learning from ExaM- 
ples (VC-DomLEM) algorithm (Blaszczynski 
et al., 2011b). Accidents that consistently cause 
either fatalities or not are assigned to what are called 
lower approximations. VC-DomLEM uses accidents 
that belong to such lower approximations as train- 
ing examples to induce sets of certain decision rules, 
which may be later used to classify new accidents. 

Sets of rules, induced by VC-DomLEM, may be 
used to construct more accurate ensemble classi- 
fiers in variable consistency bagging (Blaszczynski 
et al., 2009, Blaszczynski et al., 2010). Another 
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Table 3. Selected examples of decision rules obtained from the DRSA of the refinery accident dataset for accidents 

causing fatalities or not. MbbI: million barrels; Supp.: Support; Conf.: Confidence. Other abbreviations see Table 2. 
Country Start No CRUDE Acc 

Year cluster Start Unit CC Acc NCI HI Mbbi/cd EDC Type Supp. Conf. 

Rules for accidents that did not cause fatalities 

>2002T Europe <54) 11 0.92 

>20107 Other 26.34 <1104 13 0.93 
USA <44 < 961 HT 11 0.92 
USA <0 <18.571 < 149.51 TF 15 0.82 

>20117 Russia z744 13 0.93 

Rules for accidents that caused fatalities 

[2002;2004] Alkylation HF 10 0.91 

[1996;2004] Crude 215214 HF 11 0.85 

< 2007 Î > 18.42 4 [126.9;189] HF 11 0.79 

< 20117 =2 > 21.281 TF 12 0.80 
USA >2 >10.651 = 20.01 TF 15 0.75 


extension of the bagging approach is applied when 
the analyzed dataset suffers from class imbalance. 
In our case, Neighbourhood balanced bagging 
(NBBag) was used to increase the predictive capa- 
bilities of constructed rule classifiers (Blaszczynski 
and Stefanowski, 2015). Moreover, the strategies, 
which were used to evaluate our proposition, con- 
sisted in learning the decision rules from a subset 
of the original accident dataset and testing them 
on the remaining accidents to evaluate whether 
they assign the correct fatality class or not. 

The capacity of the attributes to distinguish 
between accidents causing or not causing fatalities 
was assessed through a relevance measure called 
confirmation (Blaszczynski et al., 201la, Greco 
et al., 2001). It assesses the degree to which pres- 
ence of an attribute in the condition part of a rule 
confirms correct prediction (i.e., fatalities or no 
fatalities) on a subset of accidents the rules were 
not extracted from. The higher the value of the rele- 
vance measure (i.e., confirmation measure) the more 
important for correct prediction the attribute is. 


3 RESULTS 


3.1 


The ENSAD-based refinery accident dataset is 
close to unitary classification quality (0.99 on a 0-1 
scale), which means that there are very few incon- 
sistent accidents that have the same values for the 
attributes and either caused fatalities or did not. In 
other words, the 16 attributes selected to develop 
the information table are relevant to differentiate 
between accidents causing or not causing fatalities. 
This is confirmed by the accuracy for not causing 
or causing fatalities with values of 0.98 and 0.97 
(also on a 0-1 scale), respectively. 


Quality of classification and accuracy 


3.2 Decision rules 


Decision rules represent unique patterns hidden 
within the dataset and they unveil the cause-effect 
relationships between the values of the attributes 
and the outcome level, i.e., the fatalities class in this 
case. Decision rules do not only include important 
condition attributes, but they also contain a minimal 
number of elementary conditions that are necessary 
for representation of the cause-effect relationships 
existing in the information table, from which all 
inessential and redundant information is removed. 
Decision rules are induced by DRSA using the 
VC-DomLEM algorithm. Table 3 shows ten repre- 
sentative rules, five for the description of accidents 
causing fatalities and five for not causing fatalities. 
These decision rules can be viewed as interesting 
patterns among all patterns that can be found to 
describe features of the refinery accident dataset. 
The rules are characterized by useful parameters, 
including the support (i.e., the number of accidents 
that support the rule) and the confidence level. The 
selected rules have all a strong confidence level, 
close or above 0.75. This indicates that at least 75% 
of the accidents in the dataset with the stated values 
for the condition attributes lead to fatalities or not. 
In the following, some rather simple explana- 
tory examples for the interpretation of the rules 
given in Table 3 are briefly discussed. Accidents 
after 2002 in the cluster Europe with a NCI smaller 
5.4 caused no fatalities. For Russia the same was 
only true after 2011 and for a higher threshold 
of NCI, indicating a general better performance 
of Europe. The same combination of attributes 
for the cluster Other ranks between Europe and 
Russia, but additionally crude capacity has to be 
lower than 110,000 bbi/cd. In contrast, accidents 
with fatalities tend to have higher values for NCI, 
HI, crude capacity and EDC. Additionally, fatal 
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accidents for combinations of these attributes gen- 
erally appeared to occur more likely in the earlier 
years of the observation period. Finally, accidents 
starting in higher hazardous units (i.e., higher 
complexity class) cause more often fatalities, and 
even still in more recent years. 


3.3 Relevance of attributes 


The predictive capacity of the rules was tested by 
means of 10-fold cross-validation, but unfortu- 
nately the predictive accuracy of DRSA decision 
rules was not satisfactory. In fact, only 48% of the 
accidents causing fatalities and 53% of those not 
causing fatalities were assigned the correct class. 
Nonetheless, these rules can be used to extract 
the attribute relevance, which indicates to what 
extent attributes are relevant (useful) to distinguish 
between accidents causing fatalities or not. 

The confirmation measure used for the com- 
putation of the attributes relevance quantifies the 
extent to which the presence of an attribute in the 
rule suggests a correct class (Figure 1). Attributes 
CRUDE, Start Unit and the Country cluster are the 
most relevant for discerning between the accidents 
in terms of the outcome. Other attributes which 
are still relevant, though at a lower extent, for cor- 
rect prediction are the first and second event chain 
(EC1, EC2), the highest complexity class (High CC) 
of the unit(s) present in a refinery, the year when the 
accident took place, and the type of accident. 


4 CONCLUSIONS 


For the current analysis refinery accident data 
from ENSAD for the period 1990-2016 were com- 
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bined with selected refinery characteristics using 
data from the WWRS database and from Swiss 
Re. The results demonstrated that the application 
of the Dominance-based Rough Set Approach 
(DRSA) analysis to such a complex dataset is 
feasible and useful to gain detailed insights about 
the structure and to reveal hidden relationships 
between accident attributes and their outcome in 
terms of fatalities. 

The quality of approximation and accuracy 
measures confirmed that the information table 
based on 16 attributes is useful to differentiate 
between accidents causing or not fatalities. Fur- 
thermore, the extracted decision rules helped 
to evaluate hidden relationships in the accident 
dataset that go beyond traditional measures 
such as aggregated risk indicators or frequency- 
consequence curves. In particular, aspects of the 
monotonicity of attributes could be addressed as 
well as how higher (gain-type) or lower (cost-type) 
attribute values contribute to a reduced likelihood 
that an accident results in fatalities. 

Finally, the decision rules were tested with 
regard to their predictive capacity to use the set 
of rules to assign potential future accidents to the 
correct outcome level. Although, the predictive 
accuracy was not satisfactory, the rules still proved 
to be useful to identify the attributes that contrib- 
ute most to assign an accident to the correct out- 
come class. 

In summary, the current study provided a 
number of new and substantial insights on global 
refinery accidents; thus extending and comple- 
menting a previous study by Burgherr et al. (2016). 

Potential avenues of future research include 
options to enhance the predictive capacity of the 
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rules, which could be achieved either by adjusting 
the present attributes or by considering additional 
attributes. This is an essential aspect for decision mak- 
ers and other stakeholders. They are, for example, 
concerned with the development of low-probability 
high-impact scenarios, pre- and post-event strategies 
to mitigate potential consequences of future acci- 
dents or the improvement of general safety culture. 
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ABSTRACT: Navigational risk is critical to the shipping industry, particularly to the inland water due 
to its sensitive environment. To this end, this paper proposes a novel method for navigational risk analy- 
sis, incorporating Analytic Hierarchy Process (AHP), interval type-2 fuzzy sets (IT2FSs), and a similar- 
ity measure. In light of literature review, a hierarchical model for navigational risk analysis is developed 
including three levels: objective level, criteria level, and factor level. Then, weights of criteria and associ- 
ated factors are obtained using AHP, and factors are evaluated by experts using IT2FSs. Then, results 
of the degree of similarity between aggregated experts’ evaluation and 7-member IT2FSs representing 7 
grades of risk are obtained and used to determine the degree of navigational risk of the objective level. 
The application of the proposed method is illustrated by assessing navigational risk of 5 vessels cruised 
in the Yangtze River. The proposed method provides room for more flexibility in modeling and handling 
subjective uncertainties in navigational risk analysis. Also, the proposed method identifies navigational 


risk of multiple vessels, which is useful in enhancing situation awareness of safety. 


1 INTRODUCTION 


Risk assessment is a critical issue in the maritime 
transport system. Due to market demand and 
development of high technology, vessels in this 
system have been more specific in their functions, 
much larger in their sizes, much faster in their 
cruise velocities, and much more in their quan- 
tity. This transformation, however, produces an 
increasing threat to navigational safety by ship 
collisions, grounding accidents, etc., and also to 
ecosystem by emissions, oil spill, ship’s ballast 
water and sediments’ discharge, etc. It has been 
estimated that potential losses of a 19,000 TEU 
containership sinking is up to US$ 1 billion and 
it would take two years to remove all the contain- 
ers from the foundered mega containership (Alli- 
anz 2015). Therefore, it is necessary to control 
and manage such risk. To this end, Formal Safety 
Assessment (FSA), proposed by the UK in 1993 
(Wang 2000), was approved by International Mari- 
time Organization (IMO) in 2002 as principles of 
risk management and a systematic process, which 
has been broadly used and developed in maritime 
industry around the world (Görçün & Burak 2015, 
Hu et al. 2007, Montewka et al. 2014, Zhang et al. 
2011, Zhang et al. 2013). The last update of the 
FSA guidelines was in 2015 (IMO 2015), which 
illustrates the framework of FSA includes 5 mod- 
ules: Identification of hazards; risk analysis; Risk 


control options; Cost—benefit assessment; Recom- 
mendations for decision-making. The first two 
modules of FSA are the interest of this paper. 

Traditionally, navigational risk has been assessed 
by statistical method (Chen et al. 2015, Jin 2014, 
Kujala et al. 2009, Meng et al. 2014, Wang et al. 
2013), simulation (Huang et al. 2017, Lušić & Čorić 
2015, Montewka et al. 2010, Qu et al. 2011, Wang 
et al. 2009), or evaluation of risk factors (Kum & 
Sahin 2015, Thieme et al. 2017, Tian et al. 2013, Xu 
et al. 2016, Øien 2001). However, maritime trans- 
portation is a large-scale physical and socio-tech- 
nological system. Such complicated characteristics 
lead to that navigational risk assessment entails 
more specific methods which can effectively deal 
with epistemic uncertainties, e.g. discord in ship- 
ship or ship-shore communication, imprecision 
in ship identification, temporal and spatial differ- 
ences, etc. These uncertainties have been handled 
by many methods, e.g. Bayesian Networks (Akhtar 
& Utne 2014, Brito & Griffiths 2016, Hanninen & 
Kujala 2012, Martins & Maturana 2013, Zhang 
et al. 2013), Dempster-Shafer theory of evidence 
(Li & Pang 2013, Talavera et al. 2013, Zhang et al. 
2016), etc. 

Navigational risk assessment also faces with data 
uncertainty (Hänninen 2014), i.e. the lack of data 
in detail and the difficulty in evaluating the cred- 
ibility and validation of the data, and with ambi- 
guity in seafarers’ decision making in some cases. 
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To overcome these problems, experts’ judgements 
have been an appropriate alternative to objective 
data. Experts’ subjective opinions have been mod- 
eled by many studies using fuzzy set theory (Kara- 
halios 2014, Zhao et al. 2009) or integrated fuzzy 
methods, e.g. Fuzzy Evidence Reasoning (Yang 
et al. 2014), Fuzzy Analytic Hierarchy Process 
(AHP) (Andrew et al. 2014, Beşikçi et al. 2016, Celik 
et al. 2009, Pak et al. 2015), etc. However, most of 
these studies are based on Type-1 fuzzy sets (T1FSs) 
introduced by Zadeh (1965), whose membership 
grade, or the height is a crisp number within [0, 1]; 
however, determining an exact membership func- 
tion for a fuzzy set is not always possible without 
a loss of information (Gorzalezany 1987), which 
often causes biased conclusion (Liu et al. 2017). 
To overcome this, the concept of type-2 fuzzy sets 
(T2FSs) is also introduced by Zadeh (1975) as an 
extension of T1FS, characterized by primary and 
secondary membership. Nevertheless, T2FSs have 
not been widely applied due to complicated com- 
putation. To reduce such heavy computation effort, 
interval type-2 fuzzy sets (IT2FSs) is proposed 
by Gorzalczany (1987) with that all the values of 
secondary membership of T2FSs are equal to 1. 
IT2FSs can not only represent uncertainty better 
than that of T1FSs and simplify the computation 
compared with T2FSs (Hu et al., 2013), but also 
produce more accurate and robust results (Dereli & 
Altun 2013). Therefore, IT2FSs have been widely 
used in decision making, risk analysis, etc. 

Regarding risk analysis using IT2FSs, there are 
mainly three methods which are using IF-THEN 
rules (Rahib et al. 2016), ranking IT2FSs (Bozdag 
et al. 2015), and measuring the degree of similar- 
ity (Chen & Chen 2008, Chen & Chen 2009, Chen 
& Sanguansat 2011, Sen et al. 2016, Wei & Chen 
2009). Among them, the last one is the most 
popular. However, selecting a reasonable similar- 
ity measure is an open subject, depending on the 
real application environments (Deng et al. 2011). 
The similarity measure proposed by Chen & San- 
guansat (2011) is adopted in this paper. 

This paper aims to assess navigational risk. A 
hierarchical structure for identifying navigational 
risk factors is constructed with three levels: objec- 
tive level, criteria level, and factor level. Values of 
relative importance of certain criteria with respect 
to the objective, and of relative importance of cer- 
tain factor with respect to corresponding criteria 
are quantified using Analytic Hierarchy Proc- 
ess (AHP) (Saaty 1980). To evaluate risk factors, 
linguistic terms and corresponding IT2FSs are 
utilized. Then, measuring the degree of similar- 
ity between the aggregated evaluation of factors 
and seven grades of risk represented by IT2FSs is 
used to transform the aggregated result into cor- 
responding linguistic term. 


The rest of this paper is organized as follows. 
Section 2 presents a model for navigational risk 
analysis incorporating three levels—objective level, 
criteria level, and factor level, and corresponding 
grades for describing the objective level and evalu- 
ating the factor level. Based on this model, Sec- 
tion 3 illustrates the proposed method. Using the 
proposed method, navigational risk of 5 vessels is 
introduced in Section 4. Section 5 concludes this 
paper. 


2 NAVIGATIONAL RISK ANALYSIS 
MODEL 


2.1 Evaluation model for navigational risk 


For navigational risk analysis, a model with three 
levels is established based on Tian et al. 2013, 
Zhang et al. 2011, Zhang et al. 2013, and Zhang 
et al. 2016, and shown in Figure 1. The objective 
level aims to assess the navigational risk; the cri- 
teria level comprises four kinds of criteria—static 
information of vessel, dynamic information of 
vessel, environment, and management; the factor 
level includes 18 factors without directly taking 
into account human factor which is an important 
parameter for navigation safety (Fan et al. 2017, 
Kujala et al. 2009). The reason is that it is difficult 
to directly determine the risk of human because 
human is the source of success as well as failure 
(Hollnagel 2014). In contrast, human behavior is 
partially represented, i.e. dynamic information 
of vessel partially represents watch officers’ deci- 
sion; shipowner stands for organization factor; the 
maintenance situation of navigational aids par- 


Objective bevel P'er level Factor level 


Pae 
| State information pf J l- 


sid V S a 


Stupowner 


Scale 


— Spewd 


[t+ Traffic situation 


| Dynamic infanmanon of | 
| vessel 


Angle of tuming 


3 Megai records 
Naviganonal risk + —) -‘Naviational ards 


E | Informanon of | 
hayigaworal environment! 


|— Complexity of channel 


—j— Chann bending radius 


— Channe width 


—|. Density af traific flow 


Infosmanan of natural | 
environment | | 


Ei Weather alerts 


Visibilty 


Currents 


Figure 1. A hierarchical structure for navigational risk. 
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tially indicates the administration level. Factors in 
the factor level are as follows: 


1. Static information of vessel includes 6 factors: 
shipowner, type, age, scale, cargo, and draught. 
For example, the attribute of cargo may lead to 
different navigational risk. The ratio of channel 
depth to vessel draught generally is larger than 
one, otherwise, the accident of grounding may 
happen. 

2. Dynamic information of vessel contains 4 fac- 
tors: speed, angle of turning, illegal records, 
and traffic situation. For example, if a ves- 
sel has many ship detention times per year by 
Port State Control officer or Flag State Control 
officer, this vessel may be critical from the view 
of department of administration. 

3. Information of navigational environment cov- 
ers 5 factors: density of traffic flow, aid facil- 
ity, channel width, channel bending radius, and 
complexity of channel. For example, vessels 
benefit a lot from navigational aids which are 
performed and maintained well, while little from 
a complicated channel with a lot of bridges over 
it. 

4. Information of natural environment incorpo- 
rates 3 factors: visibility, weather alerts, and 
currents. For example, good visibility normally 
contributes to broader horizon of watching. 
Higher weather alert information poses a sig- 
nificant threat to the navigation safety. 


2.2 Evaluation grades for risk of objective level 


Navigational risk is divided into 7 grades, ranging 
from very low to very high, described in Table 1 
from the perspective of frequency and conse- 
quence of accidents. These 7 linguistic terms are 
represented by 7-member trapezoidal IT2FSs, 
LT, k= {1, 2, ..., 7}, shown in Table 2, which are 
adapted from Chen & Chen 2008, Liu 2011, and 
Wei & Chen 2009 by deleting absolutely low and 
absolutely high. The reason of ignoring these two 
absolute terms is due to that exactly defining the 
absolute situations in the navigational risk analysis 
is not only difficult from a systematic view, but also 
impractical from the reality. 


2.3 Evaluation grades for risk of factor level 


Factor risk is divided into 3 grades depending on 
factor attribute, ad hoc condition, and consider- 
ing experts’ views. 3 grades for factors with respect 
to 4 kinds of criteria are described in Table 3, 
Table 4, Table 5, and Table 6, respectively. These 
3 grades are represented by 3-member trapezoidal 
IT2FSs—L, M, and H listed in Table 2. The reason 
of using this 3-member trapezoidal IT2FSs is for 


Table 1. Grades of navigational risk. 


Linguistic term Explanation of linguistic term 


Very low (VL) The risk is very low due to the fre- 
quency of accident or the conse- 
quence of accident is very low. 

The risk is low due to the frequency 
of accident or the consequence 
of accident is low. 

The risk is fairly low due to the 
frequency of accident or the con- 
sequence of accident is fairly low. 

The risk is medium due to the fre- 
quency of accident or the conse- 
quence of accident is medium. 

The risk is fairly high due to the 
frequency of accident or the 
consequence of accident is fairly 
high. 

The risk is high due to the fre- 
quency of accident or the conse- 
quence of accident is high. 

The risk is very high due to the 
frequency of accident or the con- 
sequence of accident is very high. 


Low (L) 


Fairly low (FL) 


Medium (M) 


Fairly high (FH) 


High (H) 


Very high (VH) 


Table 2. 7-member linguistic terms and their corre- 
sponding trapezoidal IT2FSs. 


k Linguistic terms LT,, Trapezoidal IT2FSs 


1 Very-low (VL) [(0, 0, 0.02, 0.07; 1.0), 
(0, 0, 0.02, 0.07; 0.8)] 
[(0.04, 0.1, 0.18, 0.23; 1.0), 
(0.04, 0.1, 0.18, 0.23; 0.8)] 
[(0.17, 0.22, 0.36, 0.42: 1.0), 
(0.17, 0.22, 0.36, 0.42: 0.8)] 
[(0.32, 0.41, 0.58, 0.65; 1.0), 
(0.32, 0.41, 0.58, 0.65; 0.8)] 
[(0.58, 0.63, 0.80, 0.86; 1.0), 
(0.58, 0.63, 0.80, 0.86; 0.8)] 
[(0.72, 0.78, 0.92, 0.97; 1.0), 
(0.72, 0.78, 0.92, 0.97; 0.8)] 
[(0.93, 0.98, 1.0, 1.0; 1.0), 
(0.93, 0.98, 1.0, 1.0; 0.8)] 


2 Low (L) 

3 Fairly low (FL) 
4 Medium (M) 

5 Fairly high (FH) 
6 High (H) 


7 Very high (VH) 


simplification in expert’s evaluating the risk of fac- 
tor for a specific vessel. 


3 PROPOSED METHOD 


The motivation behind the development of the 
proposed method is to improve the flexibility in the 
navigational risk analysis. Flexibility is reinforced 
due to IT2FSs better modeling and handling 
uncertainty of experts’ judgements when the exact 
membership function of T1FSs is unknown. Four 
steps of the proposed method are as follows: 


1691 


Table 3. Grades of risk of factor with respect to static information of vessel. 
Grade of risk 

Factor L M H 

Type general cargo dangerous cargo 

Age between 10 years and 15 other larger than 25 years or unknown 
years 

Draught largely less than channel normally less than channel slightly less than channel depth 
depth epth 

Cargo container general cargo dangerous cargo, unloaded, or 

overloaded 
Shipowner government corporation individual 
Scale largely matching the channel normally matching the channel slightly matching the channel 


Table 4. Grades of risk of factor with respect to dynamic information of vessel. 


Grade of risk 
Factor L M H 
Speed safety speed, or obey the local other largely less or larger than safety 
regulation speed, or seriously violate the 
local regulation 
Traffic no foreign vessels one foreign vessels around own multiple foreign vessels around 
situation ships own ships 
Angle of equal to or less than 15 degree less than 30 degree but larger equal to or larger than 30 degree 
turning per second than 15 degree per second per second 
Illegal less than 2 times per year 2 times per year larger than 2 times per year or 
records unknown 
Table 5. Grades of risk of factor with respect to information of navigational environment. 
Grade of risk 
Factor L M H 
Navigational perfect in quality and enough other deficiency in quality and 
Aids in quantity quantity 
Complexity other one obstructions in the channel multiple obstructions in the 
of channel channel 
Channel largely bigger than ship length slightly larger than ship length slightly larger than ship length 
bending 
radius 
Channel largely bigger than ship length slightly larger than ship length slightly larger than ship width 
width 
Density of sparse normal dense 
traffic flow 
Table 6. Grades of risk of factor with respect to information of natural environment. 
Grade of risk 
Factor L M H 
Weather fourth level weather alert or no second or third level weather alert first level weather alert 
alerts weather alert 
Visibility good normal bad 
Currents advection or no current other turbulent current 
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Step 1: Determine the evaluation model 

Step 2: Calculate weights of criteria and factors 
Step 3: Aggregate the degree of risk 

Step 4: Measure the degree of similarity 


Step 1: Determine the evaluation model 

A hierarchical structure model including objective, 
several criteria, and some factors, is established 
to analyze navigational risk in an area during a 
specified period, e.g. a three-level model shown in 
Figure 1. 


Step 2: Calculate weights of criteria and factors 

In the established model shown in Figure 1, 
weights of criteria with respect to the objec- 
tive and weights of factor with respect to corre- 
sponding criteria are obtained by AHP due to its 
simplicity of implementation, although some limi- 
tations exist in AHP (Ivanco et al 2017). To this 
end, experts’ judgements are elicited. For detailed 
information of steps of AHP, please refer to Bian 
et al. 2017, Deng et al. 2014, Zhang 2011, Zhou 
et al. 2017. Finally, the weight of a factor with 
respect to the objective, the global weight of such 
factor, is the product of the weight of such fac- 
tor with respect to corresponding criteria and the 
weight of corresponding criteria with respect to 
the objective. 


Step 3: Aggregate the degree of risk 

ames vessel i, ie N+, the degree of risk of 
factor j}, Z., j= {1, 2, ..., 18}, is evaluated using 
L, M, and "HT shown in “Table 2 and according to 
the grades of risk defined in Table 3, Table 4, 
Table 5, and Table 6. For example, given that an 
unloaded 1000GT oil tanker is cruising in Wuhan 
section of the Yangtze River during dry season 
with the speed of 10kn and the visibility is over 
1.5km, grade of some factors’ risk are as follows: 
“Type” and “Cargo” are H according to Table 3; 
“Speed” is L according to Table 4; “Navigational 
Aids” is L, “Channel width” is M, “Channel bend- 
ing radius” is L, and “Complexity of channel” is H 
according to Table 5; “Visibility” is M according 
to Table 6. 

After 18 factors are evaluated, the degree of risk 
of specified vessel i, R; is obtained by aggregating 
global weights of factors determined in Step 2 and 
evaluation of factors Ž, based on a trapezoidal 
interval type-2 weighted averaging (TIT2-WAA) 
operator (Hu et al. 2013). Note that, the aggrega- 
tion result is also a trapezoidal IT2FS. For detailed 
information about this proof, please refer to Hu et 
al (2013). Therefore, it is necessary to further ana- 
lyze that what the aggregation result stands for. 


Step 4: Measure the degree of similarity 

To transform R, in Step 3 into corresponding lin- 
guistic term represented by LT, k = {1, 2, ..., 7}, 
shown in Table 2, the degree of similarity between R, 


and LT,, S(R, LT, ), k= {1, 2, ..., 7}, is calculated 
based on measure proposed by Chen & Sanguansat 
(2011). 

According to Chen & Sanguansat (2011), the 
lager the value of SUR ET. ), the more the simi- 
larity between R, and 


4 CASE STUDY 


The verification of the proposed method was done 
through retrospective analysis of 5 vessels cruised 
in the Yangtze River in 2010, 5 accidents among 
them are from CJMSA (2010). Table 7 shows that 
5 accidents with different grades and types hap- 
pened in question. 


[Step 1]: A three-level model is created in Figure 1. 
In which, the objective is to evaluate the navi- 
gational risk of these 5 vessels cruised in the 
Yangtze River shown in Table 7. Four kinds of 
criteria and eighteen factors are derived from 
Tian et al. 2013, Zhang et al. 2011, Zhang et al. 
2013, and Zhang et al. 2016. 

[Step 2]: Using AHP, a series of pairwise compari- 
sons with respect to the criteria and factors are 
performed by three-expert committee to cal- 
culate weights of criteria with respect to the 
objective, and weights of factors with respect to 
corresponding criteria. 

Taking the calculation of weights of criteria with 
respect to the objective as an example, the pairwise 
comparison matrix A is as follows: 


1/2 2/3 1/3) 

l 4/3 pan 
A= 

3/4 l 2 

kP 1/2 | 


The largest eigenvalue of matrix A is 4.2492. 
The normalized eigenvector belonging to the larg- 
est eigenvalue of matrix A is: w = (0.1338, 0.2677, 


Table 7. Basic information of 5 vessels. 


Grade 
Type of of 
i Vesselname Time accident accident 
1 Laoxiahe 828 2010.01.01 Grounding FH 
2 Fufa 888 2010.01.04 Collision H 
3 Jinzhou 656 2010.01.14 Collision VH 
4 Yuhan 805 2010.02.10 Grounding FH 
5 Xinpingjiang 2010.03.05 Grounding M 
1013 
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Table 8. Weights of criteria and factors. 


Local Global 


Criteria evel j Factor level weight weight 
Static information of vessel (0.1338) 1 Type (VT) 0.1596 0.0214 
2 Age (VA) 0.0641 0.0086 
3 Draught (VD) 0.2504 0.0335 
4 Cargo (VC) 0.3825 0.0512 
5 Shipowner (SO) 0.0428 0.0057 
6 Scale (VSP) 0.1006 0.0135 
Dynamic information of vessel (0.2677) 7 Speed (VSE) 0.2643 0.0708 
8 Traffic situation (TS) 0.5693 0.1524 
9 Angle of turning (AT) 0.0609 0.0163 
10 Illegal records (IR) 0.1055 0.0282 
11 Navigational aids (NA) 0.4263 0.1298 
Information of navigational environment (0.3045) 12 Complexity of channel (CC) 0.1591 0.0484 
13 Channel bending radius (CB) 0.0823 0.0251 
14 Channel width (CW) 0.1732 0.0527 
15 Density of traffic flow (DT) 0.1591 0.0484 
Information of natural environment (0.294) 16 Weather alerts (WA) 0.5556 0.1633 
17 Visibility (NV) 0.1111 0.0327 
18 Currents (NC) 0.3333 0.0980 


0.3043, 0.2940)". Since the ratio of consistency Table 9. Evaluation of factors for each vessel. 
index to random consistency index, 0.0923, is less 


than 0.1, the constructed pairwise comparison Lao- , Xin-ping- 
matrix A is considered acceptable (Zhou et al. Factor xiahe Fu-fa Jinzhou Yu-han jiang 
2017), and the results of normalized eigenvector level 828 888 656 805 1013 


are values of weights of criteria with respect to the 


objective. Similarity, the weights of factors with yT m M M ~ H 
H a z VA H M M M L 

respect to certain criteria are obtained by AHP. VD M L L M H 

After that, the weight of a factor with respect to vc H H H M H 

objective is obtained by multiplying the weight of so M M M M M 

the factor with respect to certain criteria and the ysc M M H H M 

weight of the criteria with respect to the objective. VSP H L L H L 

The weights of criteria and factor are shown in yg L H H L L 

Table 8. _ AT H L L H H 

[Step 3]: According to the retrospective analysis yp H H H H H 
of accident reports (CJMSA 2010), factors in wa H M M H H 
[Step 2] are evaluated by expert committee using cc H H H H H 
3-member linguistic terms-L, M, and H-shown = cp L M H L I; 
in Table 4. And evaluations of factors with cw H L M H M 
respect to 5 vessels are shown in Table 9. _ DT M M iL L L 
Regarding to vessel i, the degree of risk, R, is WA M M L L L 

obtained by aggregating global weights of factors NC M M H M M 

tabulated in Table 8 and evaluation of factors Z NV M M M M L 

shown in Table 9 using TIT2-WAA aggregation 

operator, which is shown in Table 10. 

[Step 4]: Based on similarity measure proposed by linguistic term represented by LT, which is more 
Chen & Sanguansat (2011), results of the degree similar to it. Therefore, the risk of these 5 vessels 
of similarity S(R,LT,) with respect to each ves- are as follows: Laoxiahe 828, Medium; Fufa 888, 
sel are shown RLT) 2. According to Chen & Medium; Jinzhou 656, Medium; Yuhan 805, 
Sanguansat (2011), Ř, can be transformed in a Medium; Xinpingjiang 1013, Medium. 
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Table 10. Aggregation of experts’ judgement for each 
vessel. 


Vessel name = 


R, 
Laoxiahe 828 [(0.4327, 0.5052, 0.6471, 0.7054; 1.0), 
(0.4327, 0.5052, 0.6471, 0.7054; 0.8)] 
Fufa 888 [(0.3836, 0.4600, 0.6060, 0.6669; 1.0), 
(0.3836, 0.4600, 0.6060, 0.6669; 0.8)] 
Jinzhou 656 [(0.3937, 0.4612, 0.5888, 0.6438; 1.0), 
(0.3937, 0.4612, 0.5888, 0.6438; 0.8)] 
Yuhan 805 [(0.3549, 0.4224, 0.5466, 0.6016; 1.0), 
(0.3549, 0.4224, 0.5466, 0.6016; 0.8)] 
Xinpingjiang [(0.3112, 0.3763, 0.4913, 0.5447; 1.0), 
1013 (0.3112, 0.3763, 0.4913, 0.5447; 0.8)] 
0.96, 
091 
oas! 
4 op) 
Šor 
žo 7$ 
Pose 
Šos 
S ose, 
osi 
046 
ow 
0.06 
Moo FO M H H O 7 
Figure 2. Results of similarity between aggregated 


experts’ evaluation and 7 grades of navigational risk. 


5 CONCLUSION 


A novel method using IT2FSs for navigational risk 
analysis is presented in this paper. The main fea- 
tures of the proposed method are as follow: firstly, 
weights of risk factors are obtained using AHP; 
secondly, grades of the objective and factors are 
represented by IT2FSs; thirdly, degree of risk is 
determined by similarity measure. 

Although the proposed model does not directly 
consider human factor, factors considered are 
evaluated only by 3-member linguistic terms for 
simplification, weights of criteria with respect to 
objective and weights of factor with respect to 
criteria are not constant but case by case, and the 
grades of risk are not absolutely equal to that hap- 
pened in reality, 5 vessels as case study are used to 
illustrate the application of the proposed method. 

In the future, we would specify measures to 
timely alleviate or effectively mitigate the identified 
risk, which is the third module of FSA. 
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dams in mountain regions of OECD countries 
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ABSTRACT: High safety standards for dams in Switzerland require continuous assessment of their 
structural stability and effectiveness of warning system. Therefore, risk assessment needs to be performed 
for estimation of consequences, e.g. Life Loss (LL), in case of a dam accident. With the life-loss estimation 
methods available nowadays, e.g. the LIFESim system, physical processes within the specific dam-failure 
event can be simulated. This study demonstrated the importance of adjusting the LL rates in LIFESim 
to reflect study-specific characteristics of the dam type and failure mode. In particular, for application to 
Swiss dams, alternative LL rate distributions were built based on historical events of concrete and masonry 
dams in the mountain regions of OECD countries. The alternative LL rates distributions had different 
shapes and frequency ranges than those recommended in LIFESim. A simulation example of a hypotheti- 
cal dam failure showed that the alternative and recommended LL rates lead to different LL estimates. 


1 INTRODUCTION 


Switzerland is the country with the highest density 
of dams in the world. Most of the dams were built 
to generate hydro-electricity (90% of all dams), but 
they also play an important role as flood control 
facilities. Swiss dams are constructed and operated 
under high safety standards. Historically no fail- 
ures of Swiss dams have occurred; where a dam 
failure can be defined as a collapse or movement 
of part of a dam or its foundation leading to a 
disability of the dam to retain water (ICOLD, 
2016). However, the ageing of many facilities and 
the increasingly stricter safety standards require 
that dam engineers and operators need to regu- 
larly assess and update potential risks associated 
with dams. An example is the assessment of con- 
sequences in terms of Life Loss (LL) due to a 
hypothetical dam failure. The results of such a risk 
assessment can help to justify, for example, costly 
facility upgrades or investments in the warning 
system required for risk mitigation. 

To provide a transparent and quantifiable way 
to determine consequences, e.g. LL, associated 
with dam failures and subsequent floods, various 
methods and models are available nowadays. 


1.1 Methods for life-loss estimation 


Methods for LL estimation differ in complexity 
and modeling principles. Most of the approaches 
are purely empirical and LL estimates are based 
on regressions of Population At Risk (PAR) as 
a function of the whole downstream population 


and heterogeneous Warning time (Wt) (e.g. Lee 
et al., 1986; Brown and Graham, 1988). Another 
approach by Graham (1999) is more sophisticated 
and provides LL rates for a mix of subgroups of 
PAR based on Wt, flood severity, and warning 
effectiveness. 

However, empirical approaches have limitations, 
which can be summarized as following (McClelland 
and Bowles, 2002). Firstly, empirical methods do 
not differentiate between characteristics of the 
dam failure (e.g. breach propagation or instan- 
taneous failure) and flood severity, which might 
lead to the underestimation of the impact of the 
flood. Secondly, information on PAR, building 
structures, flow quantities, etc. represent averages 
and is not site specific. This strongly affects the LL 
results, because they depend on the gender and age 
distribution of PAR (Salvati et al., 2018). Thirdly, 
evacuation is not modelled and Wt is considered as 
a single value, which affects the number of people 
that are exposed to the flood and in turn the LL 
estimates. 

To overcome these limitations in LL estima- 
tion, it is necessary to simulate physical processes 
and interactions, which cannot be achieved with 
empirical methods only. For this purpose, complex 
Geographical Information Systems (GIS)-based 
models have been developed in recent years allow- 
ing the dynamic simulation of flood consequences, 
including estimation of LL due to a dam failure. 
In contrast to empirical methods, these models 
estimate LL using modules with databases about 
evacuation, warning time, and loss of shelter to 
consider site-specific conditions. The most known 
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models are the Life Safety Model (LSM) (British 
Columbia, 2006) and the HEC-LifeSim model 
(USACE, 2017a). LSM is an agent-based model 
requiring detailed information for simulation, and 
thus it is more suitable for studies that model the 
behavior of the individual receptor (i.e. microscale 
simulation of the impact on a person or vehicle). 
In contrast, LIFESim scales up the simulation 
from the microscale to the mesoscale, i.e. simula- 
tion for the zone; therefore, it is more suitable for 
LL estimation for the specific area downstream of 
the dam. 


1.2 Study goals 


As discussed, the application of physically-based 
models in dam risk assessments could lead to bet- 
ter LL estimation. However, available models were 
developed by U.S. institutions that used LL rates 
empirically derived from historical data on flood 
events that mostly happened in the USA (see Sec- 
tion 2.2). The direct application of these LL rates to 
Switzerland would provide a potentially question- 
able approximation, since it has been shown that 
accident frequency and severity strongly depend 
on dam characteristics and location (Kalinina 
et al., 2017). 

Therefore, this study aims to develop LL-rate 
distributions that can be considered representative 
for the topographical conditions and characteris- 
tics of dams in Switzerland. These LL rate distri- 
butions are used in modular LL estimation models 
to indicate proportion of life loss, P, for different 
flood zones and the corresponding relative fre- 
quency of exceeding this rate. To build these distri- 
butions historical failures of concrete and masonry 
dams in mountain regions of OECD countries 
were used, which can also be considered represent- 
ative for Switzerland. Furthermore, the calculated 
LL rate distributions were used as input to the 
HEC-LIFESim model to simulate the LL result- 
ing from an instantaneous dam failure. In this way 
it was possible to demonstrate the robustness of 
the physically-based model and the sensitivity of 
LIFESim simulation results to different LL rates. 
Therefore, this study provides quantitative insights 
on the recommendation of McClelland and Bowles 
(2002) that the historical observations underlying 
the method for LL estimation should be adjusted 
according to the type of event that is likely for a 
particular study setting. 


2 METHOD 


In this section, the spatial modular software HEC- 
LIFESim (USACE, 2017a) is introduced. Then, 
the motivation for developing alternative LL rates 


to be applied for failures of large concrete and 
masonry dams in mountain areas is explained and 
the methodology for constructing the alternative 
LL-rate distributions is given. Finally, a simula- 
tion example, that will be built to demonstrate 
the effect of different LL rates on LL estimates, is 
described. 


2.1 HEC-LIFESim software 


HEC-LIFESim (or LIFESim throughout this 
article) is a software developed by the Hydro- 
logic Engineering Center (HEC) of the U.S. Army 
Corps of Engineers (USACE, 2017a). LIFESim is 
a spatial dynamic system for modeling life loss or 
economic consequences of a natural, dam or dike 
flood event. LIFESim is a modular system consist- 
ing of four modules built around databases. The 
modules exchange data through a geo-database 
with various information layers and tables, as 
shown in Figure 1. 

Within LIFESim the aforementioned data can 
be combined with other GIS layers such as ESRI 
maps (ESRI, 2017), which allows for simulations 
that can represent real world conditions in a more 
realistic and accurate manner. In other words, 
LIFESim can overcome some of the limitations of 
the purely empirical approaches for modeling LL 
of a dam failure. 

The four modules of the LIFESim system are 
represented as blocks in Figure 1: 1) flood routing 
module, 2) loss of shelter module, 3) warning and 
evacuation module, and 4) loss of life module. The 
flood routing module of LIFESim interfaces with 
an existing flood routing model (e.g. HEC-RAS 
5.0.3 (USACE, 2017c)) and using hydraulic and 


Flood routing module 
(external software) 


( Flood flow /') Evacuation /%, Population \ Warning 


quantities “_ routes >X distribution X curves 
Warning and 
evacuation 
module 
Loss of T 
shelter es 
module Flood /\ Population 
zones redistribution 
Loss of ji A) Ñ 
OSS O! Life loss 9 
shelter level / rites» 
Results: life loss estimates 
Figure 1. Simplified representation of the LIFESim 


approach for LL estimation (modified from Bowles, 
2007). 
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timing editor it imports time series of flow quan- 
tities from hydraulic data source, e.g. depth time 
series at different points of the inundated area. 
The loss-of-shelter module simulates the expo- 
sure of people in buildings during the flood event. 
For this, different flood zones are assigned to build- 
ings and levels of buildings in the inundated area. 
In each flood zone, the physical flood environment 
is different, which is reflected in different historical 
rates of life loss. The three flood zones are defined 
by McClelland and Bowles (2000) by the interplay 
between available shelter and local flood depths 
and velocities, and can be summarized as follows: 


— Chance zones in which flood victims are typi- 
cally swept downstream or trapped underwater, 
and survival depends largely on chance; 

— Compromised zones in which the available shel- 
ter has been severely damaged by the flood, 
increasing the exposure of flood victims to vio- 
lent floodwaters. 

— Safe zones are typically dry, exposed to relatively 
quiescent floodwaters, or exposed to shallow 
flooding unlikely to sweep people off their feet. 


As input for the loss of shelter module, 
LIFESim utilizes the datasets of flow quantities in 
the simulation domain and the structure invento- 
ries obtained, for example, from HAZUS MH data 
(Federal Emergency Management Agency, 2003). 
Stability criteria for structures are set by default in 
LIFESim and can be changed for the specific study. 

The warning and evacuation module simu- 
lates the spatial distribution of the population at 
risk from its initial distribution at the time when 
the warning is issued, to a new distribution with 
assigned flood zones when the flood arrives. For 
this module, the following information is required: 
GIS information on road layout, for example, 
from Highway Capacity Manual (TRB, 2000), 
information about population, for example, from 
HAZUS MH data (Federal Emergency Manage- 
ment Agency, 2003); evacuation destinations and 
emergency planning zones, which are location-spe- 
cific and available as shape-files. Other evacuation 
parameters are set by default in LIFESim and can 
be changed for the specific study. 

Finally, the loss-of-life module determines LL 
using the results of the aforementioned three mod- 
ules. Based on the assigned flood zone categories 
(the loss-of-shelter module) and the value of PAR 
in this category (defined by the interplay between 
the flood map and the building inventory data), 
life-loss estimates are assessed using LL-rate dis- 
tributions (McClelland and Bowles, 2002). For the 
simulations in this study, the recommended distri- 
butions were changed to alternative ones; this is 
indicated in Figure | in green and explained fur- 
ther in the text. 


2.2 LL rates distributions recommended in 
LIFESim 


For the estimation of LL, LIFESim recommends 
LL rates distributions developed by McClelland 
and Bowles (2002). To calculate the rates, the total 
PAR was determined and further divided into 
subgroups (subPAR), which help to customize the 
model to local conditions and to have homogene- 
ous data for distinct areas. Three flood zones were 
then identified for each subPAR using the informa- 
tion about warning and flood severity. Finally, by 
estimating the ratio between the number of fatali- 
ties and the number of people in the particular 
flood zone of the particular subPAR, the P value 
was calculated for each case. Using the calculated P, 
the LL rates distributions were built for each flood 
zone (recommended distributions in Figure 3). 

To construct the LL rates distributions, 38 
unique flood events with 179 associated subPARs 
were used (Table 1). These events can be classi- 
fied in three types: natural floods, floods due to a 
dam failure and floods due to a dike failure. Flood 
events due to a dam failure can be further classi- 
fied in subgroups based on the dam type, among 
which the subgroup of floods resulting from fail- 
ures of embankment dams is the largest. A similar 
prevalence of embankment dam failures was also 
found in datasets used to empirically estimate the 
dam-failure outflow in previous studies (Froehlich, 
1995; Costa, 1985). In both cases, this can be 
explained by the fact that these dams commonly 
failed gradually; thus, data on characteristics (flow 
quantities, people) could be recorded. In contrast, 
instantaneous failures common for concrete dams 
give no chance to record detailed data, resulting in 
8 events with 26 subPAR in Table 1. 


Table 1. Flood events used by McClelland and Bowles 
(2002). 
Number of 
Type of event Event subPAR Topography 
Flood (river, flash, 10 23 — 
alluvial fan) 

Dike 1 1 — 
Dam: 27 153 — 
— Embankment 16 121 — 
— Buttress 1 1 mountain 
— Gravity 2 8 mountain 

4 11 open/relatively 

flat area 

— Arch 1 6 mountain 
— Tailing 2 3 — 
— Mill 1 3 — 
Total 38 179 — 
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Furthermore, since the failure mode is differ- 
ent between embankment and concrete dams, it 
affects the nature of the floods and the subsequent 
impact on people. Thus, application of the LL rates 
derived on the data, which is highly dominated by 
embankment dam failures, can potentially bias LL 
estimates in studies of concrete and masonry dams 
(e.g. large dams in Switzerland). Therefore, the LL 
rates distributions need to be adjusted to reflect 
study-specific characteristics such as the dam type 
and failure mode. 


2.3 Alternative LL rates distributions 


To construct alternative LL rates distributions that 
are representative for large dams in Switzerland, 
a specific dam failure data set of large concrete 
and masonry dams in mountain regions of OECD 
countries was compiled. For this purpose, the 
historical experience contained in PSI’s Energy- 
related Severe Accidents Database (ENSAD) was 
searched for relevant events. 

The ENSAD database was developed at the Paul 
Scherrer Institute (PSI) in the 1990s (Hirschberg 
et al., 1998). Its goal is to enhance the compara- 
tive evaluation of different energy systems cover- 
ing human health, environmental and economic 
impacts. ENSAD covers a broad range of full 
energy chains and continuous data collection 
ensures up-to-date information. Data from 
ENSAD are used for comparative risk assessment 
of energy technologies, to detect weak points in the 
energy infrastructure, and ultimately to support 
decision-making processes concerning energy sup- 
ply options. For a detailed overview on ENSAD 
and its applications see Burgherr and Hirschberg 
(2014) or Burgherr et al. (2017). 

The ENSAD comprises a worldwide dataset of 
more than 1,000 historical dam accidents in the 
period 1798-2017, 70% of which were in OECD 
countries. Each accident record has a set of char- 
acteristics, which together provide an exhaustive 
description of the event. Categories for some char- 
acteristics, e.g. dam type, dimensions, are adopted 
from other databases (ICOLD, 1995), while for 
others they are created to meet specific needs of 
ENSAD. 

For this study, the ENSAD hydropower dam 
section was queried for dam failures that meet 
the following criteria: dams made of concrete 
and masonry; dams located in mountain areas 
(to be representative for the Swiss topography); 
dams in OECD countries (to ensure similar levels 
of safety as in Switzerland, see Hirschberg et al. 
(1998)). Consequently, calculated LL rates based 
on dam failure data fulfilling the above criteria 
can be considered a reasonable approximation for 
Switzerland. 


For each dam failure, data about the total 
number of fatalities (i.e. life loss) and the popula- 
tion in the downstream area were searched. The lat- 
ter is indicated in ENSAD only as the name of the 
town nearest to the dam. However, for this study, 
relatively homogeneous areas, subPAR, had to be 
defined, i.e. the total PAR had to be subdivided in 
areas that are different in terms of the flood sever- 
ity, warning time, or flood severity understanding. 
Therefore, additional information on downstream 
population was collected from local reports, inter- 
views, newspapers, etc. to be able to define homog- 
enous subPAR. For example, for the Gleno dam 
failure, a total of 356-500 fatalities were reported 
by different sources; however, the total number 
of people affected, and the number of fatalities 
was certainly known only for the Bueggio Vil- 
lage. The availability of the information for this 
village defined the decision to treat this village as 
one homogeneous subPAR. Finally, the life-loss 
rate, P, was determined as the ration of LL to the 
population in the subPAR. For some subPAR, P 
was defined based on key words. For example, if 
the town was “washed out”, than P of 0.99 was 
assumed with the confidence interval between 0.9 
(people can survive even in a washout of the sup- 
port surface) and 1 (resulted substantial destruc- 
tion can lead to the complete population loss). 

Furthermore, for each subPAR, the flood zone 
was defined based on warning time and flood 
severity. For this study, only one flood zone was 
assigned to each subPAR. 


2.4 Simulation example 


The simulation example was created to show that 
different LL rates defined in the model lead to dif- 
ferent LL estimates. Therefore, to demonstrate the 
relevance of using LL rates that reflect dam and 
failure characteristic specifics for the study, e.g. 
studies for Swiss dams. 

For the simulation, the failure of a hypotheti- 
cal dam was assumed. The simulation example was 
built using the well-documented example project 
provided for the LIFESim program (USACE, 
2017b). This project is not fully representative for 
the Swiss-dam study; however, certain data reflect 
the Swiss conditions. In particular, the dam-failure 
outflow hydrograph can be considered representa- 
tive, since it reflects the instantaneous failure mode 
common for failures of concrete and masonry 
dams. For the instantaneous failure, no advance 
warning was initiated, i.e. dam failure warning was 
not issued prior to the dam failure. The area down- 
stream of the dam is flat and open (see Figure 2), 
which corresponds to a hypothetical town at the 
end of a Swiss valley. Data for the structural inven- 
tories, emergency planning zones, road networks 
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Figure 2. A map of the area flooded in the dam-failure 
event and symbols used for components of the LIFESim 
simulation (modified from USACE, 2017b). 


and evacuation destinations were provided in 
the example project and originally taken from 
the sources mentioned in Section 2.1. The data 
for flow quantities were simulated in the HEC- 
RAS program (USACE, 2017d) and provided by 
USACE (2017b) as two-dimensional map. 

Three simulations were created for this study. 
The first simulation corresponds to the application 
of the recommended LL rates by McClelland and 
Bowles (2002) (Section 2.2). The second simulation 
was run with the same model set-up except that the 
alternative LL rates were used (Section 2.3). For 
the third simulation, all downstream subPAR was 
assumed as a chance zone. This assumption was 
based on the methodology for determining PAR 
by the Swiss Federal Office of Energy (2017), 
whose procedure is based on PAR comprising the 
entire area affected by a dam-failure flood wave of 
at least 2 m height and intensity of at least 2 m2/s 
in a period of 2 h after a complete failure. Further- 
more, this assumption was supported by McClel- 
land and Bowles (2002) suggesting that when the 
buildings are about 6 meters high (i.e. one or two- 
story dwellings that are common for Swiss towns), 
then the entire PAR (alternatively, subPAR) is con- 
sidered as a chance zone. 

For all three simulations the simulation were 
done for several times of the day, namely at 12 
p.m. (midnight), 6 a.m., 12 a.m. (noon), 6 p.m., 
and using 100 model runs for each combination to 
ensure convergence of results. 


3 RESULTS 


3.1 Alternative LL rates distributions 


For the LL rates analysis, a dataset of 14 failures of 
concrete and masonry dams (buttress, gravity, and 


arch) located in mountain regions of OECD coun- 
tries was established (Table 2). Four events from 
Table 1 were also considered, namely the Zerbino, 
St. Francis, Vajont, and Vega de Terra dams. 

In the established dataset, dam name, country, 
and the year of the accident are given. The fatali- 
ties are indicated as the total number of fatalities 
resulting from the dam failure and as the number 
of fatalities in the defined subPAR downstream 
of the dam. The population at risk is also given as 
the total number for the entire downstream area 
and as the number of people for the defined sub- 
PAR. All information sources are given in Table 2. 
Finally, the flood zones were assigned to the cal- 
culated P values using information about warning 
time and flood severity (Table 2). Assigned flood 
zones are specified for each subPAR as letter indi- 
ces (Table 2). 

For each flood zone, the alternative LL rates dis- 
tribution was constructed using the corresponding 
calculated P values (Figure 3). In addition, the con- 
fidence intervals were calculated for both alterna- 
tive distributions using all the P values calculated 
based on different numbers of fatalities and people 
at risk found in the literature. Due to the lack of 
data for the safe zone among the failures in the new 
dataset, no fatalities were assumed in the safe zone. 

For the alternative distribution of the chance 
zone, the range of P values is similar to the rec- 
ommended P, i.e. between | and 0.4, with values 
of the alternative distribution potentially reaching 
0.28 within its confidence interval (Figure 3). The 
recommended curve was built on a dataset with 
more realizations of high P with respect to the 
alternative distribution proposed in this study. This 
is shown for high P values (e.g. 0.8), which have a 
0.9 and 0.7 frequency of exceedance for the recom- 
mended and alternative distributions, respectively. 
Furthermore, by considering the confidence inter- 
vals built for the alternative curve, the variability 
of possible P values is large for some P due to the 
limited data; thus, for the alternative distributions 
of the chance zone high P values can also increase 
to a frequency of 0.8. 

For the compromised zone, the range for the 
alternative LL rates is smaller than for the recom- 
mended rates and the highest P does not exceed 
0.13. However, taking into account the confidence 
intervals, realizations of 0.5 for P are also possi- 
ble, which goes in line with the rates provided by 
McClelland and Bowles (2002). 


3.2 Simulation results 


Results for all three simulations are presented in 
Figure 4 and expressed in life loss as percentage 
of PAR (alternatively subPAR). The results are 
summarized using box plots showing the median 
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Table 2. Extended dataset for failures of concrete and masonry dams located in mountain regions of OECD countries. 
Proportion 
Dam name (country) & Fatalities in Population of Life Flood 
N Year subPAR name subPAR in subPAR Loss, P Warning Severity 
1 1923 Gleno (Italy) 356-500 12,631 — 
- Bueggio Village 
(Bureau of Reclamation, 2015) 209 500 0.42° no high 
2 1934 Granadillar (Spain) 
(Gonzalez and Santamarta, 8 — 0.992 
2012) 
3 1928 Komoro (Japan) T = = 
4 2012 Kopru (Turkey) 10 300 0.0333° 
(Boston.com, (Haberturk, 
2012) 2012) 
5 1891 Lynx Creek (USA) 0 — 0 
6 1959 Malpasset (France) (Graham, 400-550 6000 -= no 
1999) 
- 100 ft flood depth 30 30 I high 
- 10 ft high entering Frejus 391 6000 0.0652° medium 
7 1925 Moyie River (USA) 0 - 0 
Puentes (Spain) 680 — 
8 1802 -city of Lorca 608 4590 0.132? some 
(Saxena and (Smedley et al., 
Sharma, 1845; Murcia 
2004) Today) 
9 1928 St Francis (USA) 300-684 2250 - some 
- powerhouse N2 (McClelland 81 0.998 no high 
and Bowles, 2002) 
- Castaic Junction washed away 0.998 
(Wikipedia, 
2017b) 
- Edison tent camp at Kemp 89 140 0.636" no high 
(Rogers and James, —) 
10 1965  Torrejon-Tajo (Spain) 39 50 0.6" 
(Wikipedia, (Extremadura, 
2017a) 2016) 
11 1963 Vajont (Italy) (McClelland 1600-2600 3000 = no 
and Bowles, 2002) 
- Longarone town 1269 1348 0.941* 
- lakeside communities 158 i 
12 1959 Vega de Tera (Spain) (Graham, 140-153 415 0.347" no high 
1999) 
13 1944 Xuriguera (Spain) a - -= 
- farm house (La Vanguardia, 6 6 lig 
1944) 
14 1935 Zerbino (Italy) 130 - 


a — chance zone; >— compromised zone. 


of the modeled distributions and the bottom and 
top edges of the box indicating the 25th and 75th 
percentiles, respectively. The whiskers extend to the 
maximum and minimum of the data not consider- 
ing outliers, and the outliers (i.e. points distant by 
twice or three times the standard deviation (Ruan 
et al., 2005)) are plotted individually. 

For simulations 2 and 3, the results were calcu- 
lated using the alternative distributions shown in 


Figure 3 (solid lines). The confidence intervals of 
these distributions could not be taken into account 
due to specifics of the software settings. 

The relative patterns of the results between dif- 
ferent times of the day are similar across all three 
simulations. In particular, the highest values of 
P were calculated at 12 p.m. and the lowest at 12 
a.m. For the former, this could be explained by 
the fact that most people are asleep and not aware 
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Figure 3. Historical LL rates distributions developed 
by McClelland and Bowles (2002) and LL rates distribu- 
tions developed specifically for concrete & masonry dams 
in mountain regions of OECD countries with the confi- 
dence intervals (minimal and maximal values). 
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Figure 4. Simulation results for Life Loss as percentage 
of PAR: Simulation 1) using recommended life-loss rates 
by McClelland and Bowles (2002); Simulation 2) life-loss 
rates developed for dams in mountain areas; Simulation 
3) test case with merged chance and compromised zone 
and safe zone without LL. 


of a possible warning. On the other hand, at 12 
a.m. most people are awake and at their duties; 
they can react faster to possible warning. The 
uncertainty range is higher for 6 a.m. and 6 p.m. 
results, because potentially at that time of the day 
people are traveling from home or back home; and 
it is quite uncertain how many people in traffic are 
exposed to the flood. 

Comparing results between the first two simula- 
tions, the P values calculated in the second simula- 
tion are in general 10% lower than those calculated 
in the first simulation. This is due to the fact that, in 
the chance zone, the same P values have lower prob- 
abilities in the alternative distribution than in the 
recommended one, and in the compromised zone, 
the possible range of P values is in general lower 
(see Figure 3). The median values of P in the second 
simulation are shifted to lower values of the results 
range (i.e. distributions are left skewed) with respect 
to the first simulation. This can be explained by 
the fact that more severe P values in the chance 
and compromised zones in Figure 3 have lower 


frequency of exceedance (e.g. for a frequency of 
exceedance of 0.68, P is equal to 0.8 and 0.9 in the 
chance zone for the alternative and recommended 
distributions, respectively). On the other hand, 
probabilities for lower values of P (i.e. closer to 0.4) 
do not differentiate to such a high extent. 

Furthermore, uncertainty in the results of the 
second simulation is generally lower than in the 
first simulation, unless in the case of the simula- 
tion at 12 p.m. Generally lower uncertainty ranges 
can be explained by the fact that the alternative 
distributions are flatter (especially the one of the 
compromised zone); then, sampling from the P val- 
ues in the alternative distribution could potentially 
result in the sample with smaller range of P values. 

Finally, comparing the results of the third sim- 
ulation, it can be concluded that in general the 
defined P values are the highest among all simula- 
tions. Furthermore, uncertainty of the calculated 
values is significantly higher than in the results 
of the first two simulations. Moreover, the results 
show larger left skewness with respect to the other 
simulations results, because compromised zones 
became now chance zones and for all ranges of 
probabilities LL rates were higher. Thus, neglecting 
existence of compromised zones with their higher 
potential of people to survive, leads to potential 
overestimation of the overall life loss. 

In general, lower rates for life losses reflected 
in the alternative distributions and results of the 
simulations are supported by the following circum- 
stances. On the one hand, the alternative dataset of 
concrete and masonry dams is built exclusively for 
OECD countries, whereas the recommended LL 
rates include also events in non-OECD countries 
(e.g., China). Generally, higher population density 
downstream of non-OECD dams and differences 
in safety culture and awareness of people living 
downstream of non-OECD dams, could poten- 
tially result in higher life loss. On the other hand, 
mountain topography could leave higher survival 
chances for people located on higher altitudes in 
the downstream area. In contrast, in embankment 
dam failures, a flat area is commonly affected 
downstream of the dam; which makes the available 
shelter to be very remote. 


4 CONCLUSIONS 


To maintain or further improve the high safety 
levels of dams in Switzerland it is important to 
evaluate the performance of existing or planned 
risk mitigation measures at dams, and to estimate 
potential LL consequences of selected dam failure 
scenarios. 

Generally, the LL estimation in dam risk assess- 
ment is a complex process, depending on many 
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parameters and circumstances. Available dynamic 
modular systems for LL estimation (e.g. LIFESim) 
can overcome known limitations of the purely 
empirical methods, because they allow modeling of 
physical processes within the dam-failure event, e.g. 
evacuation, population distribution. These models 
can also go beyond LL estimates and, for example, 
define times of the day with the highest risk for 
PAR, as it was demonstrated in the current study. 

Furthermore, this study demonstrated the 
importance of adjusting the LL rates distributions 
to reflect study-specific characteristics such as the 
dam type, failure mode, etc. The LL rates derived 
from historical failures of concrete and masonry 
dams in mountain regions of OECD countries had 
different shapes and frequency ranges than the 
generic ones in LIFESim. The LL estimates calcu- 
lated using the recommended and alternative LL 
rates gave different LL estimates in the simulation 
example of the hypothetical dam failure carried 
out in this study. 

In summary, the importance of defining study- 
specific alternative LL rates distributions for dam 
risk assessment was demonstrated. Potential future 
extensions rely on the reduction of the width of 
the confidence intervals for the alternative LL-rate 
distributions considered representative for large 
concrete dams in Switzerland. To reduce the rather 
large uncertainty ranges, continuous update of 
information is suggested. In particular, more dam 
accidents need to be included and better informa- 
tion on subPAR, flood severity and warning avail- 
ability to be provided for the events in the existing 
list of historical failures of concrete and masonry 
dams. Finally, the concept will be improved and 
implemented in more details to a real case study 
reflecting Swiss conditions. 
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ABSTRACT: | The fault tree linking method for building Probabilistic Safety Assessment (PSA) models 
of nuclear power plants models accident sequences—combinations of safety system failures following an 
initiating event—by relatively small event trees. Failures of individual safety systems are modelled by fault 
trees. Such a model allows us to analyze selected accident scenarios or scenarios leading to a specific con- 
sequence. The analysis algorithm implemented in RiskSpectrum decomposes function events which fail 
along a sequence into minimal cutsets and (optionally) summarizes successful function events in an aggre- 
gate event, a so called success module. First order algorithms for quantification of a list of such minimal 
cutsets yield an approximate result. The new MCS BDD algorithm implemented in RiskSpectrum aims 
at improving this approximation. When computing resources suffice, it has the capability to quantify the 


minimal cutset list exactly. We evaluate the performance of this algorithm on real life models. 


1 INTRODUCTION 


A Binary Decision Diagram (BDD) is a data 
structure for encoding Boolean functions (Bryant 
1986). It has been applied to fault tree analysis 
by Rauzy (1993) and Coudert & Madre (1993). 
Since then, it has been successfully used in many 
domains for a complete solution of fault trees 
(Rauzy 2006). However, the size of event/fault tree 
models emerging from Probabilistic Safety Assess- 
ment of nuclear power plants is prohibitive for an 
exact analysis by the means of BDDs. 

The current standard practice for solving large 
fault tree models from the nuclear Probabilistic 
Safety Assessment is to decompose the tree struc- 
ture into a list of Minimal c utsets (MCS). Further, 
first order algorithms, such as rare event approxi- 
mation or Min Cut Upper Bound (MCUB), are 
used to quantify this minimal cutset list. 

The MCS BDD algorithm implemented in Risk- 
Spectrum presents a new method of minimal cut- 
set list quantification. A key to its efficiency and 
accuracy is a heuristic procedure which assigns 
certain nodes for the exact treatment and certain 
nodes for a treatment similar to a ZBDD (Minato 
1993, Jung et al. 2004). This provides us with an 
improved quantification for minimal cutset lists 
with high probability events and with a small set 
of high importance events. Also, it allows for event 
tree success quantification dependent on the failed 
events. In the optimal case, if the complete success 
information is generated, the algorithm has the 
capability to quantify success exactly. 


In this paper, we evaluate the current imple- 
mentation of the MCS BDD algorithm on real life 
models. The focus of the assessment lies in the fol- 
lowing aspects: 


Improved accuracy—for which types of analyses 
do we obtain a considerable decrease in con- 
servatism of the results? 

Success quantification—we demonstrate the capa- 
bility of the algorithm to quantitatively assess 
the event tree success. We also discuss sensitivity 
of the algorithm to various model parameters. 

Importance factors—we study effects of the new 
quantification method on the Risk Increase 
Factor. To what extent is this importance factor 
affected and where is the biggest gain? 

Efficiency—the algorithm allows for user control 
of the trade-off between the result accuracy 
and the calculation resources. We investigate 
the effect of different settings on the calcula- 
tion time and the MCS list value. Especially, we 
ask when one can rely on the built-in automatic 
parameter adjustments and in which situations 
one needs to steer the algorithm by fine-tuning 
the parameters manually. 


2 BACKGROUND 


A Binary Decision Diagram is a rooted binary 
directed acyclic graph with nodes labeled by deci- 
sion variables and leaves representing one of the 
two values: True or False. The MCS BDD algo- 
rithm applies pivotal decomposition to the MCS 
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list in order to build a BDD. It selects a basic event 
from the MCS list as the node decision variable 
and builds its child BDDs from smaller MCS lists 
with possibly smaller cutsets. A leaf is built when 
the MCS list to be processed is empty or it contains 
an empty cutset. Figure 1 schematically depicts 
one step in BDD building. 

One of the approximations in the analysis of 
large fault trees by the minimal cutset decompo- 
sition is the quantification of the resulting mini- 
mal cutset list. MCS BDD is an algorithm that 
has the potential to quantify the MCS list exactly. 
This would, however, lead to inacceptable calcula- 
tion times (and possibly exceed computer memory 
capacities) for large real-life models like those from 
nuclear PSA. A complete BDD quantification of a 
MCS list with success modules can unfortunately 
be achieved only for very small cases. Therefore, 
we need to trade absolute accuracy for reasonable 
calculation times. 

The MCS BDD algorithm adopts a highly prag- 
matic approach. It searches for events where a pre- 
cise BDD quantification has the greatest effect and 
treats other events in an approximate way. First, 
we describe the most important heuristics in the 
MCS BDD algorithm and then we motivate them 
by describing types of models where this brings the 
greatest advantage. 

For presentation purposes, we have the follow- 
ing assumptions on MCS lists (for the full descrip- 
tion of the algorithm see Backstrom et al. (2014), 
Backstrom et al. (2016). We assume that an input 
MCS list contains independent basic events and 
possibly also success modules. A success module 
is a summary characterization of function events 
which succeed along an event tree sequence. Tech- 
nically, it is also a MCS list containing cutsets 


MCS 
[A>False] 


Figure 1. The minimal cutset list with cutsets contain- 
ing the basic event A is decomposed into two smaller 
MCS lists which are then processed recursively. 


which fail one of these function events. A success 
module is then interpreted as ‘negated’ in cutsets. 
For a more detailed description of success mod- 
ules, see RiskSpectrum (2013). Cutsets are grouped 
according to the initiator (a frequency basic event) 
and a success module and then treated separately. 
This means that we can without loss of generality 
assume that there is only one initiator and only one 
success module in the MCS list. 


2.1 Main ingredients in MCS BDD 


The algorithm first defines the MCS list which 
shall be transformed into a BDD. Then, it builds 
the BDD. Finally, it computes the probability/fre- 
quency of the MCS list from the BDD. In the fol- 
lowing we detail the most important parts of the 
process and by this also explain the main heuristic 
ingredients: 


e Selection of the most important cutsets 
e Exact and approximate nodes 
e Success module quantification 


2.1.1 Selection of important cutsets 

The greatest contribution to the MCS list probabil- 
ity/frequency typically comes from a small portion 
of the minimal cutsets. The first heuristics splits 
the MCS list into two parts, based on the cutset 
values. The part which represents almost the com- 
plete MCS list value is treated by the MCS BDD 
algorithm. The remaining part is quantified by the 
Min Cut Upper Bound algorithm. 

This heuristics works as a variant of a proba- 
bilistic cutoff, but its effect is purely conservative. 
The greater the part that is treated by the MCUB, 
the greater the over-approximation of the MCS list 
value is and also the easier is it to build the BDD. 


2.1.2 Exact and approximate nodes 

The algorithm produces BDD nodes and recur- 
sively creates their inputs from smaller MCS lists. 
Each node that is added is either included as an 
exact or approximate node. The process of deter- 
mining the node type (exact or approximate) is 
equally important as the selection of pivotal ele- 
ment in the RiskSpectrum implementation. 

The exact method directly follows from Shan- 
non non-intersect decomposition. We split the 
MCS list (which is a probabilistic Boolean func- 
tion) by a pivot basic event A (a decision variable 
in the formula) into two parts: 


e Cutsets which are consistent with A. We remove 
A from these cutsets, if it is there. This corre- 
sponds to evaluating the Boolean formula with 
A = True. 

e Cutsets which are consistent with =A. This 
removes all cutsets containing A. This corre- 
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sponds to evaluating the Boolean formula with 
A= False. 


Formula | shows how to calculate the probabil- 
ity of a Boolean function, where A is one of the 
decision variables. 


P(f)= P(A): P(F[A) + P(A) PUTA) 0 


By f[A], f[-=A] we denote the Boolean function f 
where A = True, A = False, respectively. 

The approximate treatment splits the MCS list 
by a pivot basic event A into two parts: 


e Cutsets which contain A. We remove A from all 
of these cutsets. 
e Cutsets which do not contain A. 


This is following the same principle as ZBDD 
quantification. The usual quantification yields the 
same result as the rare event approximation—a 
direct sum of the cutset probabilities. The algo- 
rithm implemented in RiskSpectrum makes use of 
the independence assumption of basic events. It 
uses the Min Cut Upper Bound approximation on 
independent or positively correlated cutsets. The 
quantification in the approximate method is per- 
formed according to Formula 2. 


P(f)= P(A): (P(fi) + Pf) — PSs): 
P D+- P(A))- PUL) (2) 


By f, and f_, we denote cutsets that contain A 
and that do not contain A, respectively. Moreover, 
the event A is removed from cutsets in f,. This for- 
mula quantifies both parts in the same recursive 
way and then it removes product of their proba- 
bilities. Note that this way of calculating the MCS 
list probability gives a lower value than the MCUB 
approximation on the whole MCS list (and by this 
also a lower value than the rare event approxima- 
tion). Even if the whole MCS BDD was built just 
from approximate nodes, the result would be more 
precise than the MCUB quantification. 

There are conditions that have to be satisfied to 
guarantee that the approximate method does not 
yield an under-approximation of the exact value. 
A pivot basic event A cannot be treated by the 
approximate method if f, and f_, are negatively 
correlated due to the fact that the same event 
occurs in f (non-negated) and at the same time it 
appears in a success module. 

This has shown to be a significant drawback of 
the approximate method, as it either adds a restric- 
tion on using approximate nodes or it limits the 
exact quantification of success modules. 

Exact nodes and approximate nodes can be 
mixed. Approximate nodes make the analysis very 
efficient, exact nodes give a better result and allow 


for more precise success quantification. The MCS 
BDD algorithm has to balance these desirable 
properties. It searches for the best candidates for 
the exact treatment and treats the remaining basic 
events approximately. 


2.1.3 Success module quantification 

Success modules are quantified by the same tech- 
nique. We build an MCS BDD for the success 
module and append it to the MCS BDD for the 
minimal cutset list. The success module BDD uses 
only exact nodes. Therefore, we can quantify its 
(independent) value exactly. 

Disregarding dependencies between the cutsets 
and the success module contributes to the con- 
servatism of the result. If we want to reduce this 
conservatism, we need to take these dependencies 
into account. However, this limits the possibilities 
of applying the efficient approximate treatment of 
nodes in the whole BDD. Therefore, we select only 
the most important events in the success module 
and disregard dependencies for the remaining ones. 


2.2 MCS BDD parameters 


Users have a possibility to steer the search for events 
to be treated exactly/approximately as it affects the 
calculation speed and the accuracy of the result. This 
can be done by setting algorithm parameters manu- 
ally before building the BDD. The algorithm has 
the capability of an automatic adjustment of these 
parameters. This allows users to use default values 
for most (if not all) analysis cases in their models. 
The following parameters can be specified by a user: 


e MCS limit ranges from 0 to 1 and expresses the 
percentage of the MCS list value above which 
cutsets are treated by MCS BDD. 

e Q limit and FV limit parameters are used to 
determine the node treatment method—exact 
or approximate and the treatment of dependen- 
cies between events in the success module and 
the failure part. These parameters range from 0 
(only exact) to 1 (only approximate). 

e BDD nodes limit bounds the number of nodes 
that a BDD can use and by this it limits the com- 
plexity of the BDD generation. 


If the algorithm cannot succeed in building a 
BDD with the given combination of parameters 
then it automatically adjusts the parameters MCS 
limit, Q limit and FV limit and attempts generating 
a BDD again. This is repeated until the algorithm 
successfully returns a BDD structure. 


2.3 Purpose of MCS BDD 


MCS list quantification by the new algorithm 
improves the accuracy of the MCS list value. Situ- 
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ations in which first order approximations suffer 
from a significant over-approximation are: 


e Events with high probabilities 
e Success quantification 


The goal of the experimental evaluation is to 
assess to what extent the implementation fulfils its 
purpose. Additionally, we evaluate the possibilities 
that the algorithm parameters give users in steering 
the quantification. 


3 EVENT TREE QUANTIFICATION 


Quantification of event tree success necessarily 
requires handling of non-coherent fault trees (Nus- 
baumer & Rauzy 2013, Bäckström et al. 2012). 
Function event success brings negated ‘events’ into 
minimal cutsets. RiskSpectrum summarizes suc- 
cessful function events by a new type of event—a 
success module, which is simply a minimal cutset 
list containing basic event combinations which 
cannot occur in quantified sequences. Moreover, 
not-logic in fault trees might introduce negated 
basic events into minimal cutsets. 

We have performed two types of assessment. 
The first one compares the sequence or conse- 
quence value calculated by the MCS BDD algo- 
rithm to the value computed by the Min Cut 
Upper Bound algorithm. By this, we analyze the 
accuracy increase obtained from the new algo- 
rithm. We investigate cases with significant dif- 
ferences closer in order to identify factors driving 
the accuracy increase. Large real-life models are 
used. 

The second type of assessment compares quan- 
tification of sequences and consequences with suc- 
cess modules to reference values calculated by an 
analysis of structures which contain only function 
event failures (Nusbaumer & Rauzy 2013). Mod- 
els with newly built event trees based on industrial 
PSA models are used. 


3.1 Comparison to min cut upper Bound 


We have evaluated the accuracy of MCS BDD on 
15 real-life models. We split them into two catego- 
ries—models which quantify event tree success 
(by means of success modules) and models which 
use event tree success only to remove basic event 
combinations which do not belong to the analyzed 
sequence. The latter group of models disregards 
the success probability both in the first order 
quantification algorithm as well as in the MCS 
BDD algorithm. We have analyzed individual 
sequences and also sequences grouped according 
to the accident consequence (consequence analysis 
cases). 


3.1.1 Analyses without ET success quantification 
First, we present results from the group of mod- 
els which do not quantify Event Tree (ET) success. 
The results reflect only how efficiently the MCS 
BDD algorithm deals with dependencies between 
minimal cutsets. Table 1 shows decrease of the 
MCS list frequency from the value calculated by 
MCUB in percent of the new value calculated by 
MCS BDD. It lists the number of cases analyzed in 
the model, the maximal decrease, number of analy- 
sis cases with the decrease above 25%, number of 
analysis cases with the decrease above 10% and the 
minimal increase. 

All of these results have been calculated with 
the default settings: the MCS limit has been set to 
1E-3, Q limit and FV limit have been set to 1E-2. 
The vast majority of the MCS BDD results end up 
between the second and the third order approxima- 
tion, which is an expected result. Exceptions rather 
indicate an improvement potential in the calcula- 
tion of the second/third order approximation than 
in the MCS BDD algorithm. 

Obtaining an increase of the MCS list value, 
even though very small, is unexpected. It could be 
traced to a less efficient quantification of mod- 
ules in the MCS BDD algorithm and should be 
resolved in future versions. Most of the increases 
can be removed when we increase the MCS BDD 
accuracy by the algorithm settings. 

Analysis cases with the greatest value decrease 
can be often identified by one property: they con- 
tain high probability events with high Fussel-Ves- 
ely importance. 


3.1.2 Analyses with ET success quantification 

As the next step, we present results from analy- 
ses where function event success in event trees is 
quantified by the means of success modules. Apart 
from dependencies between cutsets, the MCUB 
algorithm does not take into account dependen- 
cies between failed basic events in a cutset and 
basic events from the success module in this cutset. 


Table 1. A summary of the MCS list value decrease 
with the MCS BDD algorithm compared to the MCUB 
algorithm. Event tree success is not quantified. 


Test Number Max #> #> Min 
model of cases (%) 25% 10% (%) 
M-01 1200 28.8 1 15 -1.8 
M-02 375 103 48 112 0 
M-09 2700 89.8 160 320 0 
M-10 24000 17.3 0 19 —0.83 
M-11 640 309 223 323 -0.01 
M-12 6000 42.9 318 712 0.09 
M-13 3400 40.8 47 188 -0.99 
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This might result in additional conservatism of the 
result. 

Figure 2 illustrates this issue. Consider the 
second sequence. The function event “A AND 
B” succeeds and its corresponding success mod- 
ule contains the cutset {A, B} = SM. The second 
function event fails and produces the cutset {IE, 
A, SM}. The quantification method used in the 
MCUB algorithm calculates values of events and 
modules independently. The cutset value is then 
P(IE)*P(A)*(1-(P(A)*P(B)). This is an overly 
conservative quantification. We know that A has 
failed in this cutset. Therefore, the probability that 
the success module has not failed is only 1-P(B). 
This gives us the cutset value P(IE)*P(A)*(1-P(B)) 
which the MCS BDD algorithm returns (provided 
that the event A is considered as important by the 
algorithm heuristics). Moreover, the whole success 
module will be calculated by the means of MCS 
BDD. 

Table 2 shows decrease of the MCS list fre- 
quency from the value calculated by MCUB in 
percent of the new value calculated by MCS BDD. 
It lists the number of cases analyzed in the model, 
the maximal decrease, number of analysis cases 
with the decrease above 25%, number of analysis 
cases with the decrease above 10% and the minimal 
increase. 


Fg [corsen [Cos 
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Figure 2. An event tree where the success quantifica- 
tion of the first function event depends on the second 
function event. 


Table 2. A summary of the MCS list value decrease 
with the MCS BDD algorithm compared to the MCUB 
algorithm. Event tree success is quantified. 


Test Number Max #> #> Min 
model of cases (%) 25% 10% (%) 
M-02 800 27.1 2 17 —65.6 
M-03 3 -2.4 0 0 -21.4 
M-04 4200 469.3 360 625 -3.4 
M-05 4000 867 120 331 -2.3 
M-06 7000 157 80 530 -0.7 
M-07 9700 107.8 100 350 -3.0 
M-08 660 0.48 0 0 -5.6 
M-11 1100 1132 350 500 —64.4 
M-14 3700 5-1 0 0 -0.01 
M-15 2100 300.6 53 377 -22.7 
M-16 910 72.1 2 27 -0.1 


In some cases the MCUB results may be signifi- 
cantly lower than the value calculated by the MCS 
BDD algorithm. The cases where this has been 
found underestimate the success module value in 
the MCUB calculations. This is because the MCS 
list in the success module is itself quantified by 
MCUB which gives a conservative estimate of 
a cutset list. This means that we overestimate its 
value and then calculate its ‘negation’ (1-P(SM)). 
This in its turn means that the success module value 
becomes underestimated. Normally, the conserva- 
tism from the MCUB is rather limited and other 
approximations applied during success module 
creation have more significant effect. The extensive 
evaluation revealed cases where the difference is 
greater than 50% of the MCS list value. For these 
cases, we have investigated the success module and 
verified that the MCUB algorithm underestimates 
the success module. Note that RiskSpectrum offers 
a possibility to identify candidates for this effect 
using the second order quantification of success 
modules. 

Mostly the MCS BDD yields a lower result— 
due to the increased accuracy and the treatment 
of dependencies between failed MCS and the 
success MCS. The lower results generated by the 
MCS BDD are mainly an effect of the treatment 
of dependencies between failed events and the suc- 
cess module. The MCS list value in some of the 
analysis cases decreased several times. This is typi- 
cally the case when the success module contains 
one of the important events in the failed part of 
cutsets and this event occurs in combination with 
high probability events in the success module. This 
means that in all combinations where the event 
is failed the success module will have a very low 
probability. 

The default settings (1E-3, 1E-2, 1E-2) seem 
appropriate also for analyses with quantified event 
tree success. 

All results have been studied, and a few analysis 
cases per model have been reviewed to ensure that 
the differences in results can be explained, and that 
the MCS BDD produces a better estimate of the 
MCS list. We summarize comments on a selection 
of analysis cases. 


Model 7 — Several sequences with MCS BDD 
results significantly lower than MCUB esti- 
mates have been studied in detail. The result is 
overestimated due to a number of high prob- 
ability events. The results without success mod- 
ules show the same behavior. Calculating the 
MCS list value up to the 6th order confirms this 
hypothesis. 

Model 7 — two sequences yield a very low result 
(zero) with the MCS BDD and non-zero in 
MCUB. The reason is that the success module 
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contains an event which is a part of all cutsets 
in the failed part of the MCS list. This event 
is combined with another event in the success 
module which has probability one. Since the 
event is failed the success module probability 
will be zero and the value of all cutsets becomes 
zero as well. 

Model 11 — Eight analyses with high differences 
were studied in detail. Four sequences with sig- 
nificantly different results contain many high 
probability events re-occurring in most impor- 
tant cutsets. This causes a very big overestimate 
from the MCUB algorithm. One analysis case 
has a difference in >1000% of the top value. 
We have shown by a reduced MCS list that 
the MCUB analysis indeed yields a very big 
overestimate in this case due to high probabil- 
ity events. Four analysis cases that have results 
more than 50% higher with the MCS BDD have 
big dependencies between failed basic events 
and basic events in the success module. Depend- 
ent quantification of success modules can take 
these dependencies into account and thus 
reduce the conservatism in the MCS list value. 


3.2 Comparison to exactly calculated ET success 


A collection of models with non-coherent struc- 
tures have been set up for the assessment purpose. 
Most of them are based on real-life models. In 
each test model, the evaluated event tree has an 
initiating event (IE) and three function events (F1, 
F2, F3). Figure 3 depicts such an event tree. 

Success of each function event contributes to 
the success module, which means that cutsets for 
all sequences but the last one (failure of all func- 
tion events) contain success modules. Each func- 
tion event takes a fault tree as input. Table 3 lists 
relevant features affecting the evaluated test cases 
for each model. 

Results produced by the MCS BDD algorithm 
are compared to values obtained by the method 
which avoids function event success and quanti- 
fies only combinations of function event failures 
(Nusbaumer & Rauzy 2013). Function event fail- 
ure combinations are also decomposed to MCS 
lists which are then quantified by the means of 


Figure 3. 
non-coherent structure quantification. 


An example event tree used in assessment of 


Table 3. Model characteristics for event tree 
quantification. 
High 
Number prob. 
Test model of gates Negations events Others 
EXACT1 7 
EXACT2 69 
EXACT3 22442 Y Y 
EXACT4 7771 ¥ 
EXACTS 5272 Y Y 
EXACT6 7719 Y IE and FE 
have 
dependencies. 
EXACT7 5464 Y Y 
EXACTS8 7228 
EXACT9 13125 
EXACT10 14021 Y ¥ 


Table 4. Comparison of the values computed by MCS 
BDD and reference values calculated by the method from 
Nusbaumer & Rauzy (2013). 


Maximum 
Maximum difference in 
difference in consequence 
Sequence analysis analysis 


Test model case results (%) case results (%) 
EXACT! 0.0 0.0 
EXACT2 1.1 0.1 
EXACT3 62 1.0 
EXACT4 0.0 0.0 
EXACT5 0.0 0.0 
EXACT6 1.0 0.3 
EXACT7 0.1 0.0 
EXACT8 125.1 0.0 
EXACT9 2.0 0.1 
EXACT10 0.1 0.1 


MCS BDD. This allows us to assess accuracy of 
function event success quantification in MCS 
BDD. Table 4 contains the comparison of these 
two methods. 

For the test model EXACT3, three sequences dif- 
fer by 14%, 19% and 62%. This is due to the accu- 
racy of success modules. It shall be noticed that the 
accuracy of the success module is possible to adjust, 
but this has not been done in this evaluation (only 
when results are investigated). Sequences with low 
frequency in the model EXACT4 get a lower value 
than the reference value. This is due to cutoff applica- 
tion during MCS generation. The value gets slightly 
above the reference with a lower cutoff. One low fre- 
quency sequence in the model EXACT7 is below the 
reference value. This is the same cutoff issue. 

In the model EXACTS8, two low probability 
sequences differ by 99% and 125%. This is due to 
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success module accuracy. Increasing the success 
module accuracy gives differences 0.7% and 2.7%, 
respectively. The model EXACT10 contains four 
sequences with values below the reference value 
by 0.1%. One consequence is below the reference 
value by 0.1%. This is a cutoff issue during genera- 
tion of minimal cutsets 

Parameters used to generate the MCS BDD were 
set to zero for small models (EXACT1, EXACT2), 
ensuring exact treatment of all nodes. For other 
models, the MCS limit has been set to 1E-5, Q limit 
and FV limit have been set to 1E-3. 


4 IMPORTANCE—RISK INCREASE 
FACTOR 


The new quantification procedure does not only 
affect MCS list values, but also other measures 
such as importance factors. One of these measures 
most affected by inaccuracies not handled by the 
first order calculation is the Risk Increase Fac- 
tor (RIF), also called Risk Achievement Worth 
(RAW). 

A RIF of a component is defined as the factor 
of power plant risk increase when the component 
is unavailable (Vesely et al. 1983). Whenever any 
failure mode of this component occurs in a fault 
tree, it needs to be considered as failed (Bäckström 
et al. 2016). The method implemented in Risk- 
Spectrum (RiskSpectrum 2016) calculates RIF for 
an object by setting the failure probability of all 
events in this object to one, recalculating the MCS 
list of the analysis case and dividing the obtained 
value by the nominal MCS list value. 

We have evaluated RIF calculations with the 
MCS BDD algorithm on three models (a sample 
model and two real-life models). For selected basic 
events and event groups, we have calculated the 
RIF value by re-running the analysis with the ana- 
lyzed basic events marked as failed in the model. 

The results for basic events are summarized in 
Table 5. We report the maximal decrease of RIF in 
percent of the new RIF value (Max) and the min- 


Table 5. A summary of risk increase factor values for 
basic events in three evaluation models. Two models 
have been analyzed with and without event tree success 
quantification. 


Numberof Max #> #< Min 
Test model basic events (%) 10% 0% (%) 
S 126 3.0 0 2 —0.2 
S(Succ) 126 33 0 4 -1.3 
R1 320 12 37 66 —4.8 
R1(Succ) 320 12 37 52 4.8 
R2 470 19 2 93 -1.9 


imal increase of RIF in percent of the new RIF 
value (Min). 

The changes in RIF values are rather moderate. 
The decrease of RIF values for many basic events 
is relatively surprising. This decrease is small, less 
than 5% in all cases. The explanation of this phe- 
nomenon lies in the increased accuracy also when 
calculating the nominal MCS list value—the value 
of the MCS list generated by the original analysis. 

Table 6 contains manually calculated RIF values 
for sample basic events from the model S and R1. It 
shows the RIF value calculated by the MCUB quan- 
tification, by the MCS BDD algorithm and manu- 
ally with newly generated results, quantified by MCS 
BDD. 

One can see that the MCS BDD quantification 
improves RIF values in the sense that it brings 
them closer to the actual RIF. This is even the case 
if the RIF value increases for a basic event, exem- 
plified by the basic event BE-01896. 

Apart from individual basic events, RiskSpec- 
trum analyzes RIF also for groups of basic events 
(which could be defined in multiple ways as compo- 
nents, according to attributes, or directly as groups 
of basic events). Table 7 shows differences in RIF 


Table 6. RIF values for sample basic events calculated 
by the MCUB quantification, MCS BDD and a manual 
calculation with a newly generated MCS list. 


RIF— 

Test RIF— MCS RIF— 
model Basic event MCUB BDD Manual 
S (Succ) ACP-DG02-M 1.62 1.60 1.60 
S(Succ) ACP-DGOI1-M 1.62 1.60 1.60 
S(Succ) EFW-TROI-M 1.49 1.47 1.47 

S ACP-GT01-A 2.06 2.04 2.04 

S FEED&BLEED 3.03 3.03 3.03 

R1 (Succ) BE-00924 4.54 4.07 4.02 
R1 (Succ) BE-00882 4.54 4.07 3.46 
RI BE-01896 2020 2110 2110 
R1 BE-09400 3.86 3.46 2.20 
Table 7. RIF values for sample event groups calculated 


by the MCUB quantification, MCS BDD and a manual 
calculation with a newly generated MCS list. 


RIF— 
Test RIF— MCS RIF— 
model Basic event MCUB BDD Manual 
S (Succ) SYSTEM:RHR 367 23.3 23.3 
S (Succ) SYSTEM:EFW 414 19.8 19.8 


S (Succ) SYSTEM:ECC 43.5 6.56 6.56 


R3 EG-2 1.51E+6 1.15E+6 4.05E+5 
R3 EG-3 17914 3613 N/A 
R3 EG-4 2.47E+5 1.05E+4 N/A 
R3 EG-5 127 121 N/A 
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values for groups of basic events when calculated 
by the MCUB algorithm, MCS BDD or manually. 

The evaluation of the RIF values calculated by 
the MCS BDD algorithm shows that it can achieve 
lower (more exact) values for groups of basic 
events, in some cases one order of magnitude. For 
smaller and simpler models, it achieves the values 
one gets from a calculation re-generating the MCS 
list without the event(s) under study. 


5 MCS BDD SETTINGS 


The heuristics balancing scalability and accuracy 
of the MCS BDD algorithm can be steered by the 
following settings: 

MCS limit—specifying the part of the MCS list 
which should be converted into a BDD, while the 
rest of the MCS list is quantified by the MCUB 
algorithm. 

Q limit—specifying from which probability are 
events treated exactly when building the BDD. 

FV limit—specifying from which importance 
are events treated exactly when building the BDD. 

Node limit—absolute bound on the number of 
BDD nodes. If more nodes are needed, the algo- 
rithm increases other limits and restarts. 

Exact usage of these limits is described in Risk- 
Spectrum (2016). In general, the lower the MCS, Q 
and FV limits are, the bigger and more precise the 
generated BDD is, within the size defined by the 


Table 8. Differences in calculation times and MCS list 
values with varying calculation settings. 
M-05 
M-02 M-10 Time (s) 
Time (s) Time (s) /avg. diff. 
/avg. diff. to /avg. diffito to default 
default (%) default (%) (%) /Max 
/Max or /Max or or min 
min diff. to min diff. to diff. to 
Settings default (%) default (%) default (%) 
Default 0:21 / N/A 1:19 / N/A 1:55 / N/A 
1E-1, 1E-2,  0:09/2.2 0:28 / 0.5 0:45 / 0.6 
1E-2 /42.4 T111 1799 
1E-5, 1E-2, 0:46/7E-3  4:04/0.0 1:48 / 0.0 
1E-2 1-3.2 /-1.7 /-0.9 
1E-3,1E-1, 0:12/ 0.02 0:30 / 0.0 2:44 / 0.05 
1E-2 /0.4 11.9 13.4 
1E-3, 1E-4, 0:56 /-0.02  7:43/0.0 9:40 /-0.02 
1E-2 /-5.7 /-1.2 /-3.9 
1E-3, 1E-2, 0:09 / 0.02 0:32 / 0.0 2:47/0.2 
1E-1 /0.4 11.9 / 59.1 
1E-3, 1E-2, 0:26/-0.05 7:17/0.0 9:37 /-0.03 
1E-4 /-5.7 /-1.2 /-12.4 
1E-3, 0, 0 5:26/3E-3 N/A N/A 
/-5.8 


Node limit. At the same time, we can expect longer 
generation times for larger and more precise BDDs. 

Table 8 shows the MCS list values and BDD 
generation times with different settings. Default 
settings are 1E-3, 1E-2, 1E-2 for the MCS limit, 
Q limit and FV limit, respectively. We always write 
setting values in this order. 

The evaluation shows surprisingly small differ- 
ences in the resulting MCS list values. This can be 
explained by the automatic parameter adjustment 
and it confirms the choice of default setting val- 
ues. Adjusting the settings manually makes sense 
in individual analyses when one suspects that 
the accuracy could be substantially increased by 
a larger BDD. In this case, one needs to accept 
longer calculation times, as it might require adjust- 
ment of the Node limit setting. 


6 CONCLUSIONS 


Experimental assessment of the MCS BDD algo- 
rithm implemented in RiskSpectrum leads to the 
following conclusions. 

Heuristics used in the quantification algorithm 
have proven to be very efficient. It is possible to 
derive results with relatively high accuracy. The 
frequency estimate of a minimal cutset list might 
be several times smaller for cases where first order 
approximations give overly conservative results. 
This is the case in presence of high probability 
events which occur multiple times in cutsets with 
a high contribution. 

More importantly, the MCS BDD algorithm 
allows for very accurate quantification of event 
tree success, both for individual sequences and 
(even more) for groups of sequences defined by 
a consequence to which they lead. Limitations in 
accuracy of several cases reported in this paper 
stem from the bounds on the success module accu- 
racy with default settings. Improving the success 
quantification by lifting these bounds will be inves- 
tigated in future work. 

The new quantification also improves estimates 
of importance factors. We have evaluated the effect 
on the Risk Increase Factor, where especially the 
values for basic event groups might improve dra- 
matically, in some cases by an order of magnitude. 

Finally, the combination of default setting val- 
ues and the automatic setting adjustment result 
in acceptable calculations times in commercial 
applications. 
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ABSTRACT: The concept of Social Vulnerability (SV) is characterized by its multidimensionality. In 
the present study, Social Vulnerability was analyzed and evaluated according to the methodology devel- 
oped by the Center for Social Studies of the University of Coimbra, which presents as innovative feature 
the incorporation of the Criticality and Support Capability components. Social Vulnerability was calcu- 
lated for the 278 municipalities of mainland Portugal using factor analysis. The evaluation and calcula- 
tion of the Criticality was carried out using 22 variables, selected from an initial number of 90, and the 
calculation of Support Capability was performed using 12 variables, from an initial number of 145 vari- 
ables. The obtained outputs should be a working basis for the managers and stakeholders, authorities at 
different levels, and all the community with the objective of adopting adaptation and mitigation measures 


to natural and technological risks. 


1 INTRODUCTION 


The concept of Social Vulnerability (SV) is charac- 
terized by its multidimensionality, adding not only 
the social characteristics of the individual, but also 
their social and economic relations, as well as the 
physical and social environment where the indi- 
vidual is inserted (Tapsell et al., 2010). The differ- 
entiating characteristics of SV make it imperative 
not only in the characterization and understanding 
of the degree of exposure of the communities, but 
also in their capacity for resisting and recovering in 
face of hazardous events. 

Historically, the concept of Social Vulnerability 
has emerged as an explicit critique of the dominant 
and conventional paradigms of analysis of disas- 
ters, with Hewitt (1983). The Sendai Framework 
for Disaster Risk Reduction resumes the concept 
of vulnerability as the conditions determined by 
the physical, social, economic and environmen- 
tal factors or processes that increase the suscep- 
tibility of a community to the impact of hazards 
(UNISDR, 2015). Thus, the scientific community 
as recognized the need of considering social vulner- 
ability as a particular dimension of vulnerability, 
developing distinct approaches for its measure- 
ment (e.g., Angeon and Bates, 2015; Rufat et al., 
2015; Fatemi et al., 2017). 

As noted by Wisner et al. (2004) the vulner- 
ability to hazards is a multidimensional process 


that consists in a multiplicity of components 
related with historical, political, economic, envi- 
ronmental and demographic factors, which pro- 
duce inequalities, dynamic pressures such as rapid 
urbanization and social pressures and unsafe liv- 
ing conditions that originates unequal exposure 
to risk. 

There are multiple and distinct methods of 
measuring vulnerability (Birkmann, 2006; Fuchs 
et al., 2012; Birkmann, 2013; Birkmann et al., 
2013;). In the present work, Social Vulnerability 
to natural and technological risks was analyzed 
and evaluated according to the methodology 
developed by the Center for Social Studies of 
the University of Coimbra (CES) and its Risk 
Observatory (OSIRIS) (Mendes et al., 2011). 
According to Mendes et al. (2011) the concept of 
SV is associated with the degree of exposure to 
natural and technological hazards and extreme 
events, depending closely on the resilience of indi- 
viduals and communities. 

Social Vulnerability must be a planning tool, 
supporting the implementation of a territorial 
model in which decision-making on risk manage- 
ment would be more efficiently applied. 

The study is divided into 5 sections: a) Presenta- 
tion of the area of study; b) methodology for the 
calculation of Social Vulnerability and its compo- 
nents; c) results at municipal level; d) discussion of 
the results; e) conclusions of study. 


1719 


Location of the studied area: a) Continental 


Figure 1. 
Portugal (NUT J); b) Territorial organization in NUT II 
and LAU I (municipalities). 


2 STUDY AREA 


The present study was based on the calculation of 
SV for the 278 municipalities of mainland Portugal, 
with a total area of 89,089 Km? and a resident pop- 
ulation of 10,044,484 inhabitants according to the 
2011 Census (INE, 2012). In administrative terms 
Portugal is divided into three NUTS (Nomencla- 
ture of Territorial Units for Statistics) which is 
subdivided into three levels, defined according to 
population, administrative and geographical crite- 
ria and in two LAU (Local Administrative Unit), 
in accordance with Decree-Law 244/2002, changed 
in 2015 by regulation n°868 / 2014. The work pre- 
sented here supports its analysis at the level of 
NUT III, which is composed of 23 territorial units 
and LAU I, which is composed of 278 municipali- 
ties (Figure 1b). 


3 METHODOLOGY 


The principal objective of this work is to evaluate 
the Social Vulnerability at municipal level in main- 
land Portugal. This evaluation will be assessed using 
principal component analysis (PCA), a technic also 
used by different authors like Cutter et al. (2003), 
Schmidtlein et al. (2008), Mendes (2009), Barros 
et al. (2015), with adaptations according to regional 
and local specificities, expressed in the type of vari- 
ables and unit of analysis to be selected. For PCA 
was used the software SPSS®, version 23. The data 
that supports this evaluation were obtained using 
information from the Census 2011 (INE, 2012) and 
PORDATA database (PORDATA, 2017). In this 
study the conceptual understating of Social Vulner- 
ability defined by Mendes et al. (2011) was adopted, 


where SV is composed by two components: Critical- 
ity and Support Capability. The evaluation of Social 
Vulnerability was based on PCA where redundant 
variables are eliminated and the remaining are nor- 
malized and grouped into factors. The PCA was car- 
ried out based on a set of premises where it stands 
out: a) the calculation of the Pearson correlation 
matrix analysis; b) the variance rate parameters 
(should be greater than 60%) and the Kaiser-Meyer- 
Olkin (KMO) sample measurement (should be 
greater than 0.6) with the purpose of eliminating 
redundant data (Comrey et al., 2009) and select the 
more PCA-robust dataset; c) the use of Varimax 
rotation to better identify the principal components. 
This process is done for both Criticality and Support 
Capability. After obtaining the respective scores in 
each municipality, Social Vulnerability is calculated 
by combining the two components mentioned above 
using the following equation: 


Social Vulnerability = Criticality x (1—Support 
Capability) (1) 


The results obtained are grouped into differ- 
ent classes that vary from very low to very high in 
accordance with the standard deviation (SD) and 
the following categories: “very low,” <1 SD; “low,” 
[-1, -0.5 SD]; “moderate,” [-0.5, +0.5 SD]; “high,” 
[0.5, 1 SD]; “very high,” > 1 SD (Cutter et al. 2003). 


3.1 Criticality 


The calculation of Criticality for all municipalities 
of mainland Portugal was carried out using 22 var- 
iables grouped into seven groups (Table 1). PCA 
identified 6 factors (FAC) based in the 22 explica- 
tive variables. These factors present a variance rate 
of 73% for the 278 municipalities under study, with 
a KMO of 0.726 and all communalities above 0.6. 


3.2 Support capability 


The Support Capability was performed using 
12 variables grouped into four groups (Table 2). 


Table 1. Groups of variables used in the calculation of 
municipal criticality. 


Groups Number of variables 


Social support 
Housing conditions 
Demography 
Economy 
Education 

Housing 

Health 


PWN ON N WwW 
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Table 2. Groups of variables used in the calculation of 
municipal support capability. 


Groups Number of variables 
Economy 4 
Civil protection resources 4 
Building characteristics 2 
Health facilities 2 
Table 3. Criticality components. 
Explained 
FAC Name variance (%) 
1 Risk groups 30 
2 Economic conditions 13 
3 Disadvantaged population 12 
4 Level of income 7 
5 Employment 6 
6 Dependent population 5 


Based on the variables presented in Tables 2, 3 
FAC’s were retained, presenting a variance rate of 
65% for the 278 municipalities under study, with a 
KMO of 0.705 and all communalities above 0.6. 


4 RESULTS 


4.1 Factors of criticality 


As mentioned above the Criticality assessment 
identified 6 factors with different percentages in 
the explained variance (Table 3). 


4.1.1 Factor 1 — Risk groups 

The factor named “Risk Groups” explains 30% 
of the model variance where the proportion of 
the population under 5 years old is the dominant 
variable. This factor describes the most vulnerable 
population through the variable mentioned above 
and this FAC is also explained by the following var- 
iables: proportion of population with difficulties; 
proportion of students by secondary educational 
establishment and students by pre-school educa- 
tional establishments. The FAC 1 is also composed 
by variables that are related with housing, namely: 
the proportion of rented accommodation and the 
proportion of seasonal housing. These character- 
istics are important because according with Cut- 
ter et al. (2003) and Mendes et al. (2011) the type 
of accommodation in which an individual resides 
reflects, in most cases, their personal, social and 
economic characteristics. The last variable in this 
factor is the average value of social security pen- 
sions which allows identifying economically and 
financially fragilized populations. 


4.1.2 Factor 2 — Economic conditions 

The factor 2 explains 13% of the variance where 
the proportion of employees on behalf of others is 
the dominant variant. This FAC is also constituted 
by the following variables: proportion of self- 
employed workers as an isolated employer; propor- 
tion of self-employed workers; persons employed 
in the primary sector; average value of social pro- 
tection pensions; proportion of seasonal house- 
holds. In this FAC it is considered that the better 
the economic condition, the greater the capacity to 
face and recover from hazardous events. 


4.1.3 Factor 3 — Disadvantaged population 
Factor 3 is related with the disadvantaged people 
and contributes with 12% of the model variance. 
The variable dominant is beneficiaries of the Social 
Integration Income (RSI) and Minimum Guaran- 
teed Income (RMG). The proportion of housing 
units with renting below 100 euros, the proportion 
of buildings built before 1919 and the proportion 
of employed population in the primary sector are 
the other variables present in this FAC. This FAC 
represent, in the most cases, the population with 
low-income, low socio-professional and highly eco- 
nomic and social dependent on institutional aid. 


4.1.4 Factor 4— Level of income 

Factor 4 explains 7% of the variance and is com- 
posed by the following variables: customer depos- 
its in banks, savings banks and mutual agricultural 
credit, which is the dominant variable, and pur- 
chasing power ratio. This factor is related with the 
economic capacity of the population. 


4.1.5 Factor 5— Employment 

This factor explains 6% of total variance and is 
composed by two variables: the proportion of 
employed population in the tertiary sector (domi- 
nant variable) and proportion of population 
employed in the secondary sector. 


4.1.6 Factor 6 — Employment 

Factor 6 explains 5% of the variance and is com- 
posed only by the variable proportion of social 
housing supported by social and supported income, 
being directly related with economic power of the 
population. 


4.2 Criticality factors’ cartography 


The analysis of the factor | (Figure 2) shows that the 
highest values related with risk groups are located 
mainly in the municipalities of the central and inland 
areas of Portugal. This fact is directly related, in the 
most cases, with the areas where high percentages 
of elderly population and low percentages of young 
population are observed. In factor 2, a clear distinc- 
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tion is observed between the north and south areas 
(highest values) with the center region. These high- 
est values of criticality related with low economic 
conditions are located, essentially, in the municipali- 
ties belonging to NUT III of Alto Tamega, Terras 
de Tras-os-Montes and Douro (in the north) and 
in the Baixo Alentejo, Alentejo Litoral and Algarve 
southern areas, where the primary sector still plays a 
very important role in the regional economy. 

There are also areas where there is an important 
proportion of self-employed workers as an isolated 
employer and the proportion of self-employed 
workers, mostly related with the primary sector. 
Factor 3 is related with disadvantaged popula- 
tion, and we can observe in the Figure 2 that the 
highest values of this factor emerge along the val- 


Figure 2. Cartography of the three factors that com- 
pose the criticality. 


ley of Douro river and south of the Tagus river in 
municipalities with high percentages of population 
beneficiary of the RSI and RMG, living in low- 
rent housing and old buildings and work in the 
primary sector. 

The analysis of cartography of factor 4, named 
level of income, allows concluding that a great ter- 
ritorial homogeneity exists in the different variables 
that compose this factor. In factor 5, related with 
employment, namely the population employed in 
secondary and tertiary sector. In this analysis we 
considered that employment in the secondary sec- 
tor are more vulnerable. This fact is related with 
the predomination of small and medium enter- 
prises, with value added (VAB) lower than tertiary 
sector and with greater fluctuation in productivity 
and employment in time of crises. The cartography 
identifies the highest values in the coastal northern 
zone of Tagus river highlighting NUT III Região 
de Leiria, Regiao de Aveiro, Area Metropolitana 
do Porto, Tamega e Sousa, Cavado e Ave, which 
stand out as areas with strong industrial and com- 
mercial dynamism. The factor dependent popula- 
tion (factor 6) presents highest values north of the 
Tagus river, and mainly those municipalities on the 
right margin of the Douro river. 


4.3 Factors of support capability 


The Support Capability assessment identifies 3 
factors that resulted from PCA with different per- 
centages in the explained variance (Table 4). 


4.3.1 Factor 1 — Civil protection resources 

The factor 1 explains 30% of the total variance 
and is related with the municipal civil protection 
capability. The dominant variable is the number 
of fire-fighter corporations per 1000 inhabitants. 
The other variables of the model are: firefighters 
per 1000 inhabitants, average number of inhabit- 
ants per covered spaces (which represents shelter 
facilities), pharmacies per 10 000 inhabitants and 
density of road network. 


4.3.2 Factor 2 — Economic an environmental 
dynamic 

This factor explains 22% of the variance and is 

composed by the following variables: urban waste 


Table 4. Support capability components. 


Explained 
FAC Name variance (%) 
1 Civil protection resources 30 
Economic and environmental 22 
dynamic 
3 Logistics and services capacity 12 
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collected, in kg per inhabitant, proportion of col- 
lective households, ATMs per 1000/inhabitants 
and accommodation capacity in hotel establish- 
ments per 1000 inhabitants, which is the dominant 
variable. 


4.3.3 Factor 3 — Logistics and services capability 
This factor is related with the economic dynamism 
and explains 12% of the variance. The dominant 
variable is ATMs per 1000 inhabitants. The other 
variables that compose the factor 3 are hospitals 
per 1000 inhabitants and insurance agencies per 
1000 inhabitants. 


4.4 Support capability factors cartography 


Figure 3 shows the cartographic representation of 
each FAC expressing Support Capability. 

Factor | is related with civil protection resources 
and with the analysis of the Figure 3 we can 
observe that the lowest values are located in the 
metropolitan Lisboa and Porto areas, as well as 
in adjacent municipalities. This factor is directly 
related with population density where a relatively 


Figure 3. 
pose the suport capability. 


Cartography of the three factors that com- 


reduced number of resources serves a greater 
number of inhabitants, when compared with less 
urbanized areas. The factor economic an envi- 
ronmental dynamic (factor 2) express mainly the 
urban character of the different municipalities. We 
can observe that the majority of the municipalities 
analyzed presents moderate values, with the low- 
est values principally concentrated in the northern 
margin of the Tagus river. 


4.5 Criticality at municipal level 


Figure 4 presents the cartographic representation 
of Criticality for mainland Portugal. 

We can observe that the lowest values or Criti- 
cality are mainly concentrated in the coastal area, 
especially in the Algarve region, and in the main 
regional capitals of Lisboa, Leiria, Coimbra and 
Porto, and their neighboring municipalities. The 
highest values arise predominantly at northern 


Figure 4. Criticality in mainland Portugal. 
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municipalities, namely in Alto Tamega, Tras-os- 
Montes and along the Douro river valley. We also 
observe high Criticality values in the central region 
of Portugal, where stands out the surrounding 
municipalities of Viseu (located in NUTT III Viseu 
Dao Lafões), and along the border with Spain in 
the municipalities belonging to NUT III of Alto 
Alentejo, Alentejo Central and Baixo Alentejo. 


4.6 Support capability at municipal level 


Figure 5 shows the cartographic representation of 
each FAC belonging to Support Capability. 

The analysis of Figure 5 allows observing that 
the metropolitan area of Lisboa (with the excep- 
tion of the municipality of Lisboa and Oeiras) and 
Porto (with the exception of the municipality of 
Porto) presents very low and low values of Support 
Capability. This fact is also noted in the majority 
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Low [C] Muncipantes 
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Figure 5. Support capability in mainland Portugal. 


of municipalities and NUT III surrounding these 
areas. This fact permits to conclude that, in most 
cases, such low values are directly related with high 
population density. On the other hand, we observe 
that the highest values are predominantly located 
in the inland municipalities, especially in areas 
south of the Tagus river, in municipalities char- 
acterized by the availability of the resources for a 
small number of inhabitants. 


4.7 Social vulnerability at municipal level 


The application of equation 1 that combines the 
Criticality and Support Capability results in the cal- 
culation of Social Vulnerability for the 278 munici- 
palities of mainland Portugal. The analysis allows 
observe that the highest values of Social Vulner- 
ability are concentrated in the northern areas, 
namely in municipalities located along the Douro 
river valley, in the region of Tamega and Sousa, 
Ave, southern area of the Porto metropolitan area, 
Alto Tamega, Terras de Tras-os-Montes and Viseu 
Dao Lafoes. 

In terms of lowest values we can observe that 
they are concentrated in areas in southern part 
of the country where stands outs the region of 
Baixo Alentejo and Algarve where the majority of 
municipalities has values of Social Vulnerability 
ranging from low to very low. 


5 DISCUSSION 


The analysis and evaluation of Social Vulnerability 
allows to conclude that we can divide, in general 
terms, the mainland Portugal in two areas: the area 
at north and the area at south of the Tagus river 
where the high and very high values are mainly 
located in the northern part. The reasons for this 
spatial distribution depends on several factors. 

In terms of Criticality the most important fac- 
tors at the municipal level are those related with the 
risk groups, the economic conditions and the dis- 
advantaged population. In the total of 278 munici- 
palities we observe that 40% of them present 
moderate Criticality, 30% values that varies from 
very low to low and 30% varying from high to very 
high. 

About the Support Capability we can observe 
a relation between the highest values and the high 
density of population. The most important fac- 
tors are associated with variables related to the 
civil protection resources (factor 1) and variables 
related to economic and environmental dynamics 
(factor 2). We also conclude that 39% of analyzed 
municipalities presents moderate Support Capabil- 
ity, 34% values that varies from very low to low and 
27% varying from high to very high. 
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In terms of Social Vulnerability it possible to 
conclude that the final values are strongly influ- 
enced by factors related with the weak economic 
power of the resident population, the fragility of 
its economic fabric and the presence of signifi- 
cant percentages of dependent and disadvantaged 
population. 

The present methodology allows compare and 
differentiate regions and municipalities in terms 
of whose characteristics of criticality, capacity of 
support and social vulnerability would not be evi- 
denced in another way. The spatialization of each 
component and associated variables are impor- 
tant for the definition, application and promotion 
of measures related with social policies, housing, 
distribution and reinforcement of collective equip- 
ment, the implementation of a model of economic 
development more balanced in terms of employ- 
ment in the inland areas and urban planning poli- 
cies. The implementation and the success of this 
measures are important to reduce asymmetries 
between regions and municipalities. For the suc- 
cess of this measures are important promote and 
encourage the inter-municipal resource sharing in 
the sense of corresponding to the character mul- 
tidimensional and multidisciplinary of Social Vul- 
nerability and associated components. 


6 CONCLUSIONS 


The present work presents the calculation of Social 
Vulnerability for the total of 278 municipalities of 
mainland Portugal in accordance with the meth- 
odology presented by Mendes et al. (2011). The 
character multidimensional of this methodology 
that combine the Criticality and Support Capabil- 
ity allows not only the calculation of Social Vul- 
nerability as also because of its strong territorial 
component, defining the Territorial Vulnerability 
of the analyzed areas. 

The multidimensionality of this study, that is 
based in an extended set of variables from various 
dimensions like social support, housing, demog- 
raphy, economy, education and health allows the 
applicability in several risk governance dimensions. 
The cross-referencing of these data with existing 
regional or local information may result in pro- 
grams that promote capacity and social cohesion. 
The outputs resulting from the present study allow 
the observation and comparison, among different 
places. This fact can and should be a work tool for 
analysis and application by different stakeholders, 
from multiple sectors and authorities at national, 
regional and local level. 

The knowledge and the consciousness of the 
territorial distribution of Social Vulnerability 
and its components (Criticality and Support 


Capability) as well as their consideration in risk 
management—where spatial planning instruments 
are a central part of the process, is a key tool for 
the definition and application of multidisciplinary 
and multi-scale risk management strategies that 
not only consider the physical aspects of the terri- 
tory, but all its social and institutional dimensions. 
In fact, the implementation of municipal and local 
measures that address high SV contexts would first 
require the existence of an adequate institutional 
building, drawn upon the best risk governance 
practices. 
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ABSTRACT: Complex systems are prone to catastrophic failure as the complexity causes the system to 
collapse by itself. Wind energy system, which is fast growing source of electricity, is rapidly evolving into 
complexity and size, leading to inherently and unavoidably hazardous by their own nature. The wind tur- 
bine failures have significant impact on public health and safety risk, productivity and economy. Although 
design plays a major role in developing safer and reliable system, yet achieving desired operational safety 
and reliability remains a difficult task for the wind turbine manufacturers. In order to ensure better safety 
and reliability during operation, effective design and maintenance measures need to be taken. Criticality 
analysis of the wind turbine components or its subsystems is one way to achieve these objectives. Criticality 
analysis helps to identify critical failure modes or items, which in turn, assists in formulating optimal design 
and maintenance procedures so that better operational safety and reliability of the wind turbines can be 
obtained. The conventional FMECA, which is used for criticality analysis, takes care of the effect of failure 
on components, but does not consider the causal relations or interdependencies among failures. This paper 
presents an effective method of criticality analysis of wind turbine energy system using fuzzy based digraph 
models and matrix method by taking into account the causal relations/ interdependencies among failures. 
This will help to identify the critical failure modes /units of wind turbine energy system. The proposed 


method is useful in criticality assessment of wind turbines in design as well as in operation stages. 


1 INTRODUCTION 


Wind energy is a clean and renewable that offers 
several advantages. In order to capitalize on it, the 
economically leading countries have been harvest- 
ing wind energy over past many decades [1]. How- 
ever, the efficiency of the wind turbine system is 
limited by the failures of its elements [2-5]. The 
critical issues related to reliability and maintenance 
of wind energy systems have not been addressed 
fully and these still remain big challenges in oper- 
ating and maintaining the wind power system [6]. 
The large wind energy system has numerous ele- 
ments at its various hierarchical levels, and each 
elements follow different failure patterns [7-8]. 
This makes the system more complex and thus 
leading to hard to access faulty units, difficulty in 
maintenance and poor reliability. The operations 
and maintenance cost, which are directly affected 


by unreliability of the elements of the wind tur- 
bine, can be reduced when the whole system con- 
tinue to function without failure, or with reduced 
failure rate [9]. It is, therefore, crucial to achieve 
and sustain high operational reliability. Although, 
several researchers have studied reliability aspects 
of wind turbine systems [10-13], yet there are 
limited findings on how to evaluate the critical- 
ity of the elements of wind turbine with respect 
the failure characteristics such as failure patterns, 
failure interdependencies, etc. Tools like FMEA 
may be good in assessing criticality, but lacks in 
expressing failure dependencies [14]. In reality, the 
failures, which are the loss of functions, are inter- 
related, i.e., the failure (i.e., deteriorated function) 
of one unit does have causal relation with the 
failure of other unit, which is connected through 
physical structure [15]. The studies, which ignore 
this, will provide inaccurate results. It is, therefore, 
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essential to consider the failure interdependen- 
cies through structure while evaluating criticality. 
The well known structure models; Graph/digraph 
models and matrix methods, which are effective to 
model failure dependencies [15-16], are extensively 
employed for various engineering applications. 
Application of these tools in criticality analysis of 
wind turbine energy systems is not yet explored. 
It is found that FMEA based models have been 
used in recent years for criticality assessment of 
wind turbine systems [14], but these are limited to 
modelling of failure modes, and have not consid- 
ered the failure interdependencies. Several studies 
have been carried out on solving engineering prob- 
lems in a fuzzy environment using fuzzy theory 
[17]. There are not many literatures found in the 
application of fuzzy theory in the failure studies 
of wind turbine systems. This paper presents a 
methodology of criticality analysis of wind turbine 
energy system based on fuzzy decision making and 
digraph model and matrix methods. The method is 
effective in indentifying critical units in wind tur- 
bine energy system taking into account the failure 
modes and their interdependencies. 


2 METHODOLOGY 


The proposed method is based on structural mod- 
els i.e., digraph models and matrix, and fuzzy 
decision making methods. The selected wind tur- 
bine energy system is first described for obtaining 
structural knowledge, which helps to understand 
various units, and their interconnections. The 
common, but important failure modes and their 
causal relations/interdependencies for the units 
are then identified and represented using digraph 
model, and the resultant digraph, which is known 
as causality digraph, will be converted into causal- 
ity matrix using matrix method. The matrix will 
be further processed for causality analysis and 
evaluation of criticality index for each unit of wind 
turbine system. The unit, which is having high crit- 
icality index, will be considered to be critical unit 
that will contribute to major breakdown and safety 
issues of the wind turbine energy system, whereas 
the unit with low criticality index, will be consid- 
ered as low critical unit. In this way, the criticality 
of the wind turbine energy system is evaluated and 
critical units are identified. Subsequently, appro- 
priate preventive/corrective action will be taken. 


3 SYSTEM DESCRIPTION 


A typical horizontal wind turbine energy system, 
which is shown in Figure 1, is considered for the 
analysis. 


11 


Figure 1. 
system. 


A typical horizontal wind turbine energy 


The system consists of the following units, i.e., 
components and subsystems; 


1. Wind (fluid component) — This sweeps over 
the turbine blades at high velocity and impart 
force on the blades to rotate. 

2. Nose and rotor hub — The aerodynamic design 
coupled with rotor hub, streamline and dis- 
tribute the wind over blades. 

3. Rotor blades — These are attached to the nose 
and rotor, and spins at ample wind velocity. 

4. Drive train — This is the combination of main 
turbine shaft and support bearing mechanism, 
which connects the blade with gear box, and 
transfer the rotational energy to the gear box, 
and then to the generator. 

5. Mechanical brake — This is used to stop the 
turbine in order to prevent mechanical failures 
of the turbine components from high wind 
speed. 

6. Gear box — This is used to increase the rota- 
tional speed of the turbine shaft with varied 
torque. 

7. High-speed turbine shaft — This is the part of 
the drive train, which connects gearbox and 
generator. 

8. Generator — This converts mechanical energy 
from gear box into electrical energy. 

9. Wind speed sensor — It measures wind speed. 

10. Electronic control system — When the wind 
speed is undesirably very high, the controller 
gets command from the wind sensor to stop 
the rotating drive elements through mechani- 
cal brakes. It also helps to re-start the rotation 
at low wind speed. 

11. Wind direction sensor (wind vane) — This meas- 
ures the direction of the wind, and sends the 
command signal to yaw drives to adjust the fac- 
ing of the turbine with respect to wind direction. 
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There are many other types of sensor 
used for measuring position, and speed of 
all rotating elements. These are not shown in 
Figure 1. 

12. Yaw drive — It receives the command signal 
from wind vane and rotates the turbine. 

13. Yaw motor (Hydraulically operated) — It physi- 
cally rotates the turbine based on the instruc- 
tions from the yaw drive. 

14. Supporting parts (tower) — It supports the 
entire wind turbine system high in the air. This 
also encloses the complete electrical wiring 
systems (Not shown in Fig. 1). 

15. Nacelle (Housing) — This encloses entire tur- 
bine units. 

16. Electrical system (Not shown in Fig. 1) — This 
consists of inverter to steady the output vari- 
ables; current, and voltage, and transformer to 
raise or lower the voltage in AC transmission 
line. 

17. Hydraulic system (Not shown in Fig. 1) — This 
consists of hydraulic pump, control valves, 
hydraulic motor and actuators. This is used to 
actuate yaw mechanism. 


In order to develop causality digraph, most 
common failure modes of the units and their inter- 
dependencies are identified. This is discussed in the 
following line. 


4 FAILURE DATA COLLECTION AND 
FAILURE MODE IDENTIFICATION 


From the literature and extensive interaction with 
wind turbine manufacturing/installation company 
personnel, failure data were collected for wind tur- 
bine energy system. The percentage of failures for 
each unit of the system is given in Figure 2 and the 
down time per failure in terms of days for all units, 
is represented in Figure 3. 

The collected failures were categorized into four 
major failure modes, which are listed below; 


1. Surface crack, rupture and overloading (F) 
2. Component looseness (F,) 

3. Circuitry failure (F) 

4. Fuse blown (F,) 


The above failure modes contribute to the 
majority of the failures in the wind turbine energy 
system. This is shown in Figure 4. 

These failure modes are causally interrelated/ 
interdependent. This means that one failure mode 
may be the result of the occurrence of the other 
failure mode, and vise versa. The failure modes and 
their interdependencies are presented in graphical 
representation by developing digraph models. This 
is described below. 
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Figure 2. Percentage of failures of wind turbine energy 
system. 
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Figure 3. Down time per failure (Days). 


Figure 4. Percentage of important failure modes of 
wind turbine energy system. 
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Figure 5. 
system. 


Causality digraph for wind turbine energy 


5 CAUSALITY DIGRAPH 


The causality digraph, Gep = (Vp Ep) for wind 
turbine energy is developed with, nodes, ‘Vp rep- 
resenting the failure modes and the edges ‘Ep 
representing the causal relation/interdependencies 
among the four major failure modes indentified 
under Section 4. The developed causality digraph 
is shown in Figure 5. 

The causality digraph helps to perform visual 
analysis. In order to carry out criticality analysis 
of failure modes and criticality evaluation of the 
units of wind turbine energy system, the causality 
digraph will be converted into an equivalent matrix 
by using matrix method. This is described in the 
subsequent section, i.e. Criticality Evaluation. 


6 CRITICALITY EVALUATION 


In this section, the criticality analysis and evalua- 
tion is performed by defining equivalent matrix for 
the developed causality digraph of the wind turbine 
energy system. As discussed under Section 5, the 
causality digraph represents the four major failure 
modes and their causal relations. If there are a large 
number of failure modes, then the digraph will 
become complex. It will be difficult to perform the 
visual analysis of the digraph. Moreover, the quan- 
tification of severity of failure modes and their 
causal relation is necessary to identify the critical 
element of the wind turbine system. It is, there- 
fore, essential to convert the digraph into equiva- 
lent matrix for further processing. The equivalent 
matrix, which is known as ‘Wind Turbine Critical- 
ity Matrix’ (WTCM), Ecm» is written as; 


s 1 c b Failure mode 

F, Ja Je tues 
E. = ts R fe fl! (1) 
= Ne de Ro Jale 

fos Ja Soe F |b 


where the diagonal elements represent the failure 
modes (F,, F, F., F,), while the off-diagonal ele- 
ments represent the causal relation/interdepend- 
ency between the failure modes, i and j (fij, i, j = s, 
l, c, b; fij ffi). The WTCM is analogous to perma- 
nent matrix in the graph theory. It is mentioned 
that the permanent is a standard function that is 
used in combinatorial mathematics [16]. Perma- 
nent of WTCM is criticality expression for the 
wind turbine energy system. The ‘Wind Turbine 
Criticality Expression’ (WTCE), which represents 
severity of failure modes and their causal relations/ 
interdependencies from combinatorial considera- 
tion, is obtained from its WTCM (eqn. (1)), as; 


P(Ecu) 
a FEE F, + Sofa FF, + Safa tiF, + Sis Sot ik. 
+ Sifa EF + Sirhrct Fi a Sofa EF, + SieSafafs 
gA Sofafit, i fatiches ls a Safnfos E: + KcFisfets 
+ fosfobseEi + SoSitnke+ Srchotokit SiboLSocLer 
+ Sah, efosS e £3 Falndveles a Frcdadutes pi Selo ahai 
+ fichhosa E Saf sS afv i5 Safef sfn E, Srhichosta 
(2) 


The WTCE (eqn. (2)) helps to carry out criti- 
cality analysis from combinatorial considerations 
as it takes care of the severity of failure modes 
and all possible causality relations among failure 
modes. The WTCE, which is the characteristic 
of the causal relations between failure modes of 
wind turbine energy system, contains number of 
terms. Each term in the expression has physical 
meaning. The first term in the expression repre- 
sents the severity of the four major failure modes. 
Each term, from second to seventh, represents a 
two-failure mode causality loop and the severity 
of two failure modes, and each term from eighth 
to fifteenth represents three-failure mode causality 
loop and the severity of three-failure modes. Each 
of the last nine terms represents severity of causal 
relations between all major failure modes. By sub- 
stituting the severity value of failure modes and 
their causal relations, one can carry out not only 
the criticality analysis of wind turbine, but also the 
evaluation of its criticality index. 

For performing criticality analysis, each term 
is examined for severity. The first term in the 
expression contains the combination all failure 
modes; F/F/F/F, which represents the severity of 
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all major failure modes. By examining this term, 
the design or maintenance engineer will be able 
to identify critical failure modes or critical causal 
relations between failure modes. This will prompt 
the engineers to take appropriate corrective action. 
For example, if the severity of the failure mode, 
F, component looseness is higher in electrical sub- 
system, then an additional care must be taken to 
minimise its severity or remove the failure mode. In 
the similar way, other terms are used for criticality 
analysis, and appropriate preventive or corrective 
actions can be taken. In order to evaluate criticality 
index of the wind turbine energy system, the sever- 
ity values of failure modes and that of failure inter- 
dependencies are substituted in WTCE, which will 
then be solved to obtained numerical index that 
represents the severity of the failure modes of the 
wind turbine energy system. The main objective 
of the work is to evaluate criticality index at sub- 
system level, so that the critical subsystem can be 
selected and ranked based on the criticality index. 
It is recommended that the severity value should 
be obtained from wind turbine shop-floor data 
base or experienced wind turbine service/ opera- 
tional personnel. However, the severity of the fail- 
ure modes and their causal relations can also be 
proposed based on the field experience as there is 
no data source that provides the severity data for 
the wind turbine components and no commonly 
accepted method available as well. The failure 
modes indentified in Section 4, interact with each 
other, making causal relations at varying degrees. 
Depending upon the severity of the failure modes 
and the degree of their interaction or influence of 
one failure mode on the other, appropriate severity 
rating may be selected for each failure mode and 
their causal relations. In the present work, the data 
of severity rating are obtained by making extensive 
interaction with service engineers and operational 
personnel of wind turbine manufacturing indus- 
tries. Some of the technical report and literatures 
[l-10] are also useful in this regard. It is to be 
mentioned that obtained data is the severity rat- 
ings, which are in the form of qualitative terms, e.g. 
Low, Average, or High. These terms will be con- 
verted into quantitative values, i.e., severity value, 
for evaluating criticality index. 

In order to convert the severity rating into sever- 
ity values, an appropriate and accurate method 
needs to be selected. In this work, a fuzzy theory 
based approach has been chosen for converting 
severity rating into severity values. The severity val- 
ues of the identified failure modes and their causal 
relation are substituted in eqn. (2), i.e., WTCE, to 
obtain criticality index of the wind turbine energy 
system, at its various subsystem level. Based on 
the criticality index, the critical subsystem can be 
selected. The following section will describe the 


conversion of severity rating into severity values 
using fuzzy theory, and the evaluation of critical- 
ity index of some selected units of wind turbine 
energy system. 


7 FUZZY THEORY IN CRITICALITY 
EVALUATION 


Fuzzy theory provides a tool for directly manipu- 
lating the linguistic terms that an analyst employs 
in making a criticality assessment for a failure 
modes, effects and criticality analysis (FMECA). 
In the proposed approach, these parameters, i.e. 
severity rating (linguistic term) are represented as 
members of a fuzzy set, fuzzified by using appro- 
priate membership functions and are evaluated in 
fuzzy inference engine, which makes use of well- 
defined rule base and fuzzy logic operations to 
determine the crisp score, which represents the 
criticality/riskiness level of the failure. This means 
that the method will convert linguistic term into 
fuzzy numbers and the fuzzy numbers into crisp 
scores. The higher the value of crisp score, the 
greater will be the risk and lower the value of crisp 
score, and the lesser will be the risk. 

To demonstrate the method, a 5-point scale hav- 
ing the linguistic terms low, below average, aver- 
age, above average, and high is considered (Refer 
Fig. 6). The linguistic terms are converted into 
fuzzy number, and then to crisp score. 

The crisp score of fuzzy number ‘M’ is obtained 
as follows: 


xO0<x<l 


h=] (3) 


0, otherwise 


l-x,0<x<l 


Maing (X) = | (4) 


0, otherwise 


The linguistic terms with their corresponding 
crisp scores are chosen from Figure 6 and given in 
Table 1. The crisp score represent the severity value 
of failure modes, and that of causal relations. 

The severity value is selected from Table 1, based 
on the degree of severity for each of the failure 
modes and their causal relations. By substitut- 
ing these values in eqn. (2), the criticality analysis 
is carried out and subsequently criticality index 
for the wind turbine energy system is obtained 
at various subsystem levels. For example, let us 
examine the severity of the major failure modes 
for drive train unit for criticality analysis. From 
Table 1, the severity values for all major failure 
modes, i.e., F/F/F/F, of the drive train unit are 
taken as; 0.695/0.895/0.495/0.295 respectively, and 
substituted in eqn. (2) in place of the first term. 
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Figure 6. Linguistic terms to fuzzy numbers conversion 
(5-point scale). 


Table 1. Conversion of linguistic terms (Severity rating) 
into crisp score (Severity value). 


Linguistic term Fuzzy Crisp number 
(Severity rating) number (Severity value) 
Low M1 0.115 

Below average M2 0.295 

Average M3 0.495 

Above average M4 0.695 

High M5 0.895 


It is observed that the failure mode, component 
looseness, F, is more severe, and suitable correc- 
tive action needs to be identified and implemented 
to maintain or redesign the drive train keeping in 
mind the severity of component looseness. On the 
similar line, the severity values of the failure modes 
and their causal relations are substituted in place of 
remaining terms and the criticality analysis is per- 
formed for the drive train unit. Subsequently, the 
eqn. (2) will be solved to obtain criticality index for 
the drive train unit, and obtained as 0.6741. Simi- 
larly the criticality analysis is carried out to iden- 
tify critical failure modes/critical causal relations 
between failure modes for the remaining units by 
choosing appropriate severity values from Table 1. 
In the example discussed here, seven units (1.e., sub- 
systems) such as; drive train, generator, electronic 
control system, yaw drive, electrical system, hydrau- 
lic system and gear box, are considered for evaluat- 
ing criticality index. The criticality index for the all 
subsystems are evaluated and presented in Table 2. 
The electrical system is identified as critical unit 
as it has highest value of criticality index, which is 
2.2314; it is, therefore, obvious that the electrical 
system is assigned criticality rank as 1. The electri- 
cal system of wind turbine energy system requires 
more attention in terms of, for example, improved 
design, reliability, maintainability, safety, etc. The 


Subsystem-wise severity values, criticality index, and criticality rank. 


Table 2. 
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Severity value of the failure 


modes 


Severity value of causal relations among failure modes 


fis fy fy fy fo fee fie fa fo fos CI CR 


fyi 


Sub-system 


5 
2 
3 


0.695 0.115 0.895 0.895 0.295 0.115 0.295 0.295 0.695 0.115 0.695 0.115 0.6741 
0.895 1:9953 
1.6772 


0.295 


0.495 


0.695 0.895 


Drive trains 
Generator 


0.495 


0.695 0.895 0.115 0.695 0.295 0.695 0.295 0.695 0.695 0.695 0.115 


0.695 


0.695 0.695 
0.295 0.495 


0.695 0.495 0.295 0.295 0.295 0.495 0.295 0.695 0.895 0.895 0.495 0.695 0.495 


0.895 


Electronic control 


system 
Yaw drive 


6 


0.495 0.695 0.115 0.895 0.695 0.295 0.115 0.295 0.295 0.695 0.115 0.695 0.115 0.5422 
2.2314 1 


0.695 


0.115 0.295 


0.695 0.895 


0.115 0.895 0.295 0.295 0.295 0.695 0.295 0.695 0.895 0.895 0.495 0.695 0.495 


0.295 


Electrical system 


0.295 0.495 0.115 0.695 0.495 0.495 0.495 0.495 0.495 0.495 0.115 0.295 0.295 0.9383 
0.2244 


0.495 


Hydraulic system 0.695 0.895 


Gear box 


7 


0.115 0.695 0.115 0.895 0.895 0.295 0.115 0.295 0.115 0.695 0.115 0.695 0.115 


0.115 


0.695 0.895 


CI — Criticality Index; CR — Criticality Rank. 


design activity should be oriented towards remov- 
ing the failure modes, reducing the probability of 
occurrence, and minising the severity of the failure. 
Similarly, other units of the wind turbine energy 
system are prioritized based on the criticality 
index, and suitable preventive or corrective actions 
are taken, be it in design or operating stage. 


8 CONCLUSION 


In this paper, a methodology for criticality analysis 
and evaluation of criticality index for wind turbine 
energy system at various subsystem level has been 
proposed. Criticality index for various units of the 
wind turbine energy system has been evaluated 
using fuzzy diagraph models. Criticality index is 
useful in ranking, and identifying the critical wind 
turbine units, which require immediate mainte- 
nance action, redesign or modification. 

This method will be helpful for the system safety 
and reliability analysts in taking corrective action 
during design and operational stage of complex 
wind turbine energy system. 
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ABSTRACT: Dynamic Positioning (DP) control system is currently utilized for maintaining the position 
of Floating Production Storage and Offloading (FPSO) in a variety of the ocean environment conditions. 
However, FPSO in the Arctic region, which operates under diverse ice conditions, should be designed with 
an advanced mooring system together with the DP system as the FPSO needs safer system compared with 
the normal DP systems. A novel DP and mooring system for the FPSO operating in the Arctic region is 
now being developed at KRISO in Korea. A platform shape of the FPSO that has minimum resistance 
and maximum operation efficiency is also under development. Therefore, analyses and assessments of 
potential risks are required, considering that the developing system has many novelties compared to the 
conventional DP or mooring system. Hazard identification study (HAZID) is one of the most widely 
used traditional methods to identify hazards of a system, and the method is simple to use and requires 
limited training. System Theoretic Process Analysis (STPA), on the other hand, was recently developed 
for modern complex control systems and provides systematic analysis procedure based on systems theory. 
The main objectives of this study are to suggest an approach for hazards identification associated with the 
design and operation of DP and mooring system, by utilizing complementary hazard identification meth- 
ods (HAZID and STPA). HAZID has been applied to the structural part of the system, while STPA has 
been applied to the control system. The paper also includes a comparison of the strengths and weaknesses 
of the selected methods and a discussion of complementary use of HAZID and STPA. 


1 INTRODUCTION The most important factors for design and oper- 


ation of floating type offshore plant in the Arctic 


1.1 Background 


It takes almost a decade to make the first oil 
production since finding the oil in the Northern 
Alaska in 1969. Various R&D for the effective and 
safe operation of offshore plants were conducted 
under the extremely difficult environment condi- 
tions in the Arctic region such as low temperature, 
iceberg etc. To overcome the risk factors in Arctic 
sea, a variety of fixed type offshore platforms were 
considered so far. However, the exploitation of 
natural resources is taking place at deeper waters 
in the Arctic region, so floating type offshore plant 
platforms are considered. 


area is that various techniques are required to keep 
the stable operation and production under the con- 
dition of a verity of ice conditions, e.g. icebergs, ice 
floe (drift ice) etc. It is well known that the Arctic 
conditions are much harsher/tougher environment 
conditions for the safety of the operating vessel, 
compared to the normal sea ones when consider- 
ing polar lows, sea ice, low temperature, uncer- 
tainty of metocean data etc. 

The improved Dynamic Positioning (DP) system 
and more stable mooring system is required to keep 
the operation positioning in the Arctic region. The 
DP system will help ice management during the 
operations when the ice floes are approaching, and 
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the mooring lines make the offshore plants more 
stable. 

The equipment with the advanced technolo- 
gies are installed on the offshore plant but there 
are still hazards (risks) during the operation. For 
example, if ice management is impossible due to an 
extremely large iceberg, the offshore plant should 
be removed from the operation place to the safer 
place based on a prepared manual. Therefore, a 
variety of scenarios should be prepared consider- 
ing many different risk circumstances. 

To identify hazards and hazardous situations 
that may occur in the operation of DP and moor- 
ing systems, it is important to carry out a sys- 
tematic hazards identification. Several hazard 
identification methods have been developed, see e.g. 
Rausand (2011). One of the most widely used haz- 
ard identification methods is HAZard IDentifica- 
tion (HAZID). The Preliminary Hazard Analysis 
(PHA), a variant of HAZID, was developed by the 
US. Army (MIL-STD-882D) to evaluate hazards 
early in the life of a process, and has been success- 
fully used for safety analysis of machinery and proc- 
ess plants (Rausand, 2011, CCPS, 2011). HAZID is 
a rather simple and versatile technique that requires 
limited training and can cover a range of safety 
problems (Rausand, 2011). However, this method 
is a brainstorming-oriented technique, with support 
of checklists of known hazards, and the results are 
strongly dependent on knowledge and expertise of 
the analysts (Molland, 2011). This is a particular 
concern when the systems increase in complexity, 
and have unidentified behavior due to extensive use 
of ICT technologies. These are typical attributes of 
e.g. the DP system. Dedicated hazards identification 
methods have therefore been developed to handle 
more complex, software-intensive, sociotechnical 
system (Leveson, 2012). One such example is the 
Systems-Theoretic Process Analysis (STPA). 

STPA is a further development within the 
framework of Systems-Theoretic Accident Model 
and Processes (STAMP) causality model (Leveson, 
2012). STPA can identify the hazards covered by 
traditional hazard identification methods, but it 
also can identify additional hazards that are not 
included or poorly handled in the traditional meth- 
ods, like software errors, component interactions, 
complex human errors, and so on (Leveson and 
Thomas, 2013). STPA identifies unsafe control 
actions, using a systematic analysis of a functional 
control structure of the system (Leveson and Tho- 
mas, 2013). However, STPA may not necessarily be 
suited for analysis of systems comprising only pas- 
sive component with no control action. 


1.2 Literature study 


DP and mooring system in the Arctic Ocean are 
seldom applied. Therefore, the DP and mooring 


system which have been applied and operated in 
normal sea areas to keep the operating position 
are investigated first. The DP system is closely 
related with the control system, so the DP system 
is researched for STPA and mooring system is 
researched for HAZID in this study. 

The most frequent accidents come from moor- 
ing line failure during the operating offshore 
plant (FPSO) between 2001 and 2013. That is, the 
number of accident cases is thirteen (13) times, 
and other cases are ten (10) times from the chain 
failure and five (5) times from wire rope disconnec- 
tion. In particular, accidents from the mooring line 
connection was the main cause of the catastrophic 
disaster (Offshore Magazine, 2013). 

Ma et al. (2013) and Kvitrud (2014) have 
researched about the accident causes of the moor- 
ing system and DP system in offshore structure and 
found that the top chain, wire rope terminations 
and connectors of mooring system are the major 
reasons. Several incidents have been reported for 
floating structure in North Sea, and the main 
causes of accidents are the mooring lines failures 
which are overloaded during extreme weathers 
(Kvitrud, 2014). There are several QRA techniques 
on DP systems during drilling operations in the 
Arctic or normal sea condition (Pedersen, 2015, 
Team Energy Resources Limited, 2002), but 
HAZID work applied for DP and mooring system 
are extremely seldom. 

Only limited number of studies have applied 
STPA to DP systems. Abrecht (2016) conducted 
STPA analysis for a DP system of an offshore sup- 
ply vessel. After identifying unsafe control actions 
(UCAs) and casual scenarios of identified UCAs, 
the study also conducted traditional hazard identi- 
fication analyses, such as Fault Tree Analysis (FTA) 
and Failure Modes and Effect Analysis (FMECA), 
for the DP system and compared the results to 
discuss advantages of STPA. Rokseth et al. (2017) 
also analyzed a typical DP system using STPA and 
discussed applicability of STPA compared with 
FMECA. However, there is no previous study that 
applied STPA to DP and mooring system. 


1.3 Objectives 


The main objectives of this study are to suggest an 
approach for hazards identification associated with 
the design and operation of DP and mooring sys- 
tem, by utilizing complementary hazard identifica- 
tion methods (HAZID and STPA). HAZID has 
been applied to the structural part of the system 
(e.g. hull structure, mooring lines, turret system 
etc.), while STPA has been applied to the con- 
trol system (e.g. DP systems etc.). The paper also 
includes a comparison of the strengths and weak- 
nesses of the selected methods and a discussion of 
complementary use of HAZID and STPA. 
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1.4 Structure of the paper 


The remainder of this paper is organized as follows: 
DP and mooring system is introduced in Section 2. 
Section 3 and 4 analyze hazards of DP and mooring 
system using HAZID and STPA respectively. Finally, 
results and discussion are presented in Section 5. 


2 INTRODUCTION TO ARC7 PROJECT 
AND DP AND MOORING SYSTEM 


Korea Research Institute of Ships and Ocean Engi- 
neering (KRISO) has initiated a five years long 
term research project (ARC7 project) to develop 
a hull form design for a year-round floating type 
offshore structure in the Arctic condition with DP 
and mooring system. In order to design an offshore 
structure hull form for the given operating condition 
in Arctic region (ARC7 condition!), an ice perform- 
ance evaluation methodology which uses KRISO’s 
ice tank and numerical analysis methods have been 
developed. In this research, Ice load estimation, 
hull form design, configuration design of mooring 
& DP systems are considered as core technologies. 

The aim of the research project, ARC7 project, 
is design hull form for the offshore plant structure 
in the condition of the ice sea condition and over 
200m depth. The safety of the designed system, 
DP and mooring system for station keeping in ice 
condition should be proven, which is the proper 
one for the requirement of the aim of the project. 

The designed FPSO systems are applied for the 
offshore plant structure, which is a ship-type plan- 
form as shown in Figure 1. The developing systems 
and optimized hull form are designed for the mini- 
mized ice drag force and maximized operational 
efficiency. 

Thrusters for the DP system and mooring lines 
for the mooring systems are developed for the dif- 
ferent ice conditions based on the ice management 
scenarios. Concept of DP and mooring system 
with ice management in the Arctic ice sea is shown 
in Figure 2. 

Design for the DP and mooring system should 
be considered together with the hull form of the 


'ARC7: One of the ice class rule of the Russian Maritime 
Register of Shipping (RMRS). The ice classes are divided 
to non-Arctic, Arctic and icebreaker classes. The ice class 
notation is followed by a number which denotes the level 
of ice strengthening: Icel to Ice3 for non-Arctic ships, 
Arc4 to Arc9 for Arctic ships, and Icebreaker6 to Ice- 
breaker9 for icebreakers. These ice classes can be assigned 
in parallel with the Finnish-Swedish ice class and/or the 
IACS Polar Class, provided the vessel complies with all 
applicable rules. The selection of ice class is based on the 
operating area in the Russian Arctic, time of year, ice con- 
ditions, operating tactics, and whether the vessel operates 
under icebreaker escort or independently (RMRS, 2017). 


Figure 1. Configuration of offshore plant system in the 
Arctic region (Kim et al., 2017). 


Figure 2. Concept of DP and mooring system with ice 
management in Arctic sea (Keinonen et al., 2006). 


Figure 3. Design procedure (design spiral) of the off- 
shore plant structure with the DP and mooring system 
in Arctic sea. 


offshore plant. The design process follows design 
spiral to find best compromise with in the given 
boundaries such as environment conditions, cost, 
rules (regulations) etc. Developed design spiral for 
the offshore plant structure with the DP and moor- 
ing system in Arctic Ocean is shown in Figure 3. 
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3 HAZID FOR THE DP AND 
MOORING SYSTEM 


3.1 HAZID workshop for the DP and mooring 
system in the arctic region 


HAZID work has been carried out for “the year- 
round floating offshore structure hull form devel- 
opment which has DP system and mooring system 
as the mean of the station-keeping in the condition 
of ARC7”. The HAZID workshop was performed 
for the FPSO’s concept design (hull form, DP sys- 
tem, mooring system, turret system etc.) for oper- 
ating in the Barents Sea. 

Hazards and risks during the installation, 
operation, and maintenance in the Arctic sea are 
identified, and the causes and consequences of the 
systems are discussed in the HAZID workshop. 

Preventive safeguard and mitigating safeguard 
to prevent or minimize for the identified hazards 
(risks) are suggested in the workshop, and HAZID 
report/worksheet were prepared 


3.2 HAZID results 


The hazards (risks) in the three nodes (hull, moor- 
ing and turret systems) for the failure of struc- 


ture and systems were identified in the HAZID 
workshop. 

Main causes and consequences for the hull 
structural failure were identified, and the preven- 
tive/mitigating safeguard were suggested during 
the HAZID workshop. 

Main causes of the hull structure are collision, 
green water, wind, wave, current, sea ice and low 
temperature. The mooring system failure came from 
improper connection between mooring lines, exces- 
sive tension, green water, wind, wave, current, sea 
ice, low temperature, system malfunction etc. Main 
causes of the turret system failure were, on the other 
hand, identified as collision, green water, environ- 
ment (wind, wave, current), low temperature etc. 


3.3. Recommendation and guideline 


The identified hazards or risks from the HAZID 
workshop were classified as shown in Table 1. The 
hazards (risks) are mostly related to ice, tempera- 
ture, and ocean environment in the Arctic region, 
while personal risk and downtime occurrence were 
the common consequences for the accident cases. 
Working procedure document, safety fence, rapid 
recovery were suggested as preventive/mitigat- 
ing safeguards for the currently designed system. 


Table 1. Summary of HAZID. 
Existing safeguards 
Cause Consequence Preventive safeguard Mitigating safeguards 
g Collision Capsizing & sinking — Robust design Auto-ballast control system 
8 — Increase S.F. 
E — Improve damaged stability 
Mei Sea ice Ice impact — Ice management — Auto DP system 
E — Ice avoidance 
m Low temp. Brittle fracture — Winterization — Deicing 
pd — Heat line 
B — Improve material property 
Z — Mooring line upgrade 
g Green water Impact on deck — Green water protector — Safety plan 
2 — Proper flare angle 
A — High freeboard 
A — Deflector 
o Sea ice Large offset — Ice management — Proper winch operation 
£ — Ice avoidance 
2 — High DP capacity 
> — Heading control 
a Current DP overload — High DP capacity — Proper winch operation 
2 — Heading control 
© Wave Large offset — High DP capacity — Proper winch operation 
Z — Heading control 
D Collision Mooring disconnection — Redundancy design — Heading control 
5 — Heading control 
= Sea ice Fail to disconnect — Redundancy design — Rapid OSV support 
S — Auto system 
3 Current/Wave Risers failure and — Redundancy design — Oil spill recovery plan 
Z hydrocarbon release — Shut down valve 
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Major cause for the systems and recommenda- 
tions (safeguards) from the HAZID workshop are 
shown in Table 1. 


4 STPA FOR THE DP AND 
MOORING SYSTEM 


The STPA focused primarily on the control system, 
and its interaction with ship, ship engine and 
sensors, environment, DP-operator, and bridge 
officer. The STPA was carried out in three main 
steps, following the STPA procedure: (1) Establish 
system engineering foundation, (2) Identify Unsafe 
Control Actions (UCAs), and (3) Identify scenarios 
and safety constraints. 


4.1 Establishing system engineering foundation 


The first task of STPA was to establish system 
engineering foundation, which includes defin- 
ing system-level accidents, hazard and safety 
constraints, and establishing functional control 
structure. 

System-level accidents of an offshore vessel with 
DP and mooring system are Rupture of the riser 
and Structural damage of the vessel. The former is 
caused due to unintended drift of the vessel beyond 
the riser disconnect limit, and collision with other 
vessels or icebergs can be the cause of the latter. 
System-level accidents, hazards, and safety con- 
straints are summarized in Table 2. 

The second task of the system engineering foun- 
dation was to establish functional control struc- 
ture. A high-level control structure was developed 
first, as shown in Figure 4, and a detailed structure 
was then developed as shown in Figure 5. 

The DP and mooring system is controlled by 
DP Operator who receives commands from Bridge 
Officer. DP Control System calculates vessel 
motion and provides control command to Power 
System and Thrusters for position keeping of the 
vessel. Responsibilities and process models of each 
controller in the functional control structure are 
defined in Table 3. 


Table 2. System-level accidents, hazards, safety constraints. 


4.2 Identifying unsafe control actions 


STPA considers that safety can be treated as a 
dynamic control problem, rather than a compo- 
nent failure problem (Leveson and Thomas, 2013), 
and therefore the main focus of STPA is to identify 
control actions that lead to unsafe situations of a 
system. These control actions are called Unsafe 
Control Actions (UCAs). 


OA , =| Vessel |: vee Disturbance 
L — — 
— Control commands 
> Feedback i 
> interaction between controtlers/equipment/disturbances + 


Figure 4. High-level control structure of DP and moor- 
ing system. 


| + Mewepeemce f 7 
| |e mm | Bridge Officer | 
`a x- 


À =A ` ewommmen cone 
DP System 


— a] Ship y Disturbance 
Figure 5. Detailed control structure of DP and moor- 


ing system. 


System-level accident System-level hazard 


System-level safety constraints 


SLA1: Rupture of the riser 


SLA 2: Structural damage 
of the vessel 


SLH 1: The vessel fails to maintain its position 
and drifts outside riser disconnect limit 

SLH2: The vessel fails to maintain its position 
and collides with other vessels 


SLH3: The vessel fails to avoid icebergs that 
are beyond the structural strength 


SLSC 1: The vessel must never drift 
outside riser disconnect limit 

SLSC2: The vessel must never drift 
toward other vessels 

SLSC3: The vessel must avoid icebergs 
that are beyond the structural 
strength 
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The UCAs can be identified by examining com- 
binations of control commands and process mod- 
els that were identified from Section 4.1. Some 
examples of UCAs that were identified in our 
study are shown in Table 4. 


4.3 Identifying scenarios and safety constraint 
of each UCA 


The last step of STPA was to identify scenarios, 
casual factors, and safety constraints for each 
UCA. The scenarios and casual factors describe 


Table 3. Responsibilities and process models. 
Controller Responsibilities Process models 
DP Operator — Maintain or change position of the vessel — Command from Bridge Officer 
depending on commands from Bridge (keep/change position) 
Operator — DP mode 
(position keeping/ moving to other site) 
— Status of automatic DP control system 
(operating/malfunction) 
— Status of mooring system 
(moored/disconnected) 
— Environmental conditions 
(thrust force required/ not required) 
DP Control — Calculate vessel motion based on position, — Command from DP Operator (maintain/ 
System environmental conditions and mooring forces change position, manual operation) 


— Position of vessel 

— Environmental conditions 
— Mooring forces 

— Power generation quantity 
— Engine load 

—RPM of each Thruster 

— Angle of each Thruster 


— Provide control commands to Power System to 
generate required electric power to Thrusters 

— Provide control commands to Thrusters to 
generate required thrust force to the vessel 


Table 4. UCAs of DP control system. 


UCA.DPCO01: DP Control System does not provide Speed up command to Thruster when vessel needs more thrust 
force 


Scenario Associated causal factors Safety constraints 


SC.DPC.01.01 
Accuracy of reference sensors must 
be tested periodically 
SC.DPC.01.02 
Reference sensors must have 2003 
configuration 


SC.DPC.01.03 
DP Control System must generate 
an alarm when no signal is received 
from reference sensors 
SC.DPC.01.04 
Reference sensors must be connected 
to UPS 


SC.DPC.01.03 
DP Control System must generate 
an alarm when no signal is received 
from reference sensors 
SC.DPC.01.05 
Signal wires must be inspected 
periodically 


SC.DPC.01.06 


DP Control System receives wrong 
measurement from reference sensors 


Low accuracy of reference 
sensors 


DP Control System receives no 
measurement from reference sensors 


No power supply to reference 
sensors 


Broken signal wires from 
reference sensors 


DP Control System receives correct Wrong logic inside PCS 


measurement, but DP Control System 
does not provide 
Speed up command 


Logic of DP Control System to 
generate Speed up command must 
be fully demonstrated during sea trial 
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Table 5. Scenarios, causal factors and safety constraints of UCA.DPCO01. 
Control 
No Action Not provided Provided Too early Too late Too short Too long 
1 Speedup UCA.DPCO!  UCA.DPC02 UCA.DPC03  UCA.DPC04 UCA.DPC05 UCA.DPC06 
Thruster DP Control DP Control DP Control DP Control DP Control DP 
no.l System does System pro- System pro- System pro- System pro- Control 
not provide vides Speedup vides Speedup vides Speedup vides Speed System 
Speed up command to command to command to up command provides 
command to Thruster when Thruster too Thruster too to Thruster Speed 
Thruster when vessel needs early, before late, when ves- too short, so up com- 
vessel needs to reduce vessel needs sel needs more Thruster can- mand to 
more thrust or maintain more thrust thrust force not generate Thruster 
force thrust force force immediately enough thrust too 
force long, so 
Thruster 
generates 
excessive 
thrust 
force 
2 Speed UCA.DPC07 UCA.DPC08 UCA.DPC09  UCA.DPC10 UCA.DPCII UCA.DPC12 
down DP Control DP Con- DP Con- DP Con- DP Con- DP 
Thruster System does trol System trol System trol System trol System Control 
no.1 not provide provides provides provides provides System 
Speed down Speed down Speed down Speed down Speed down provides 
command to command to commandto command to command Speed 
Thruster when Thruster when Thruster too Thruster too to Thruster down com- 
vessel needs vessel needs early, before late, when ves- too short, so mand to 
less thrust to increase vessel needs sel needs less Thruster gen- Thruster 
force or maintain less thrust thrust force erates exces- too 
thrust force force immediately sive thrust long, so 
force Thruster 
cannot 
generate 
enough 
thrust 
force 


why and how UCAs occur, and safety constraints 
suggest requirements or guidelines to prevent sce- 
narios from occurring, and ultimately, to prevent 
occurrence of UCAs. 

Contrary to previous steps to identify UCAs, 
STPA does not provide structured guidance for 
this step. Identification of scenarios, causal fac- 
tors, and safety constraints should therefore rely 
on brainstorming of analysts. Some of the results 
of this step is shown in Table 5. 


5 RESULTS AND DISCUSSION 


In this study, hazards of DP and mooring sys- 
tem have been analyzed using two hazard identi- 
fication methods: HAZID and STPA. The scope 
of HAZID was limited to the hazards related to 
mooring system, while STPA focused on the con- 
trol hazards of the DP system. 

The advantages and limitations of each method, 
introduced in Section 1, were confirmed in the 
analyses of this paper. HAZID covered extensive 


safety problems, like hazards during installation, 
operation, maintenance, and so on. This method 
can be applied without limitation for static (pas- 
sive) systems that requires no control actions, like 
mooring system, as well as dynamic (active) sys- 
tems that requires control actions, like DP sys- 
tems. However, HAZID provides less structured 
approach to identify hazards related with con- 
trol problems compared with STPA. The results 
of HAZID are therefore highly affected by the 
knowledge and expertise of analysts. On the con- 
trary, STPA provides well-structured systematic 
approach to identify hazards of a control system, 
and this systematic approach reduces reliance on 
analysts’ knowledge or expertise and supports 
thorough hazard analysis of a system. However, 
STPA may encounter some problems when the 
method is applied to systems comprising only pas- 
sive component with no control action, because 
STPA is a specialized analysis method for control 
problems. For instance, multiple mooring line fail- 
ure due to heavy wind and/or wave is a critical haz- 
ardous event that can lead to rupture of the riser 
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or collision with other vessels, but this might not 
be directly identified by STPA because no control 
action is related with this hazardous event. This 
hazardous event can only be included indirectly 
into the analysis as a process model or as a sce- 
nario of some UCAs of the DP system. 

Previous studies on STPA indicates that STPA 
can find wider range of safety problems than tra- 
ditional hazard identification methods. In 2003, 
STPA was applied to U.S. Missile Defense System 
that had already been analyzed using traditional 
hazard identification methods, and STPA found 
so many additional flaws that the project was 
delayed for six months to fix the problems (Pereira 
et al., 2006). For an unmanned spacecraft of the 
Japanese Aerospace Exploration Agency (JAXA), 
STPA found every hazard identified by fault tree 
analysis and additional hazardous scenarios that 
were related to system design flaws, software, and 
so on (Ishimatsu et al., 2010). After analyzing a 
typical DP system, Rokseth et al. (2017) concluded 
that STPA can be considered as complementary to 
FMEA, providing a better risk picture of the DP 
system, and the same applies to this case study. At 
least for the specific case of this paper, STPA may 
cover a narrower scope of safety problem than a 
traditional hazard identification method, HAZID. 
The well-structured and systematic approach of 
STPA consequently resulted in limited applicabil- 
ity. This is not the only problem of STPA. HAZard 
and OPerability (HAZOP) study, for instance, pro- 
vides specific guidewords and process parameters 
to identify hazards of a process system, and conse- 
quently, the application of this method is limited to 
process hazards (Rausand, 2011). To use HAZOP 
for other kinds of hazards, the guidewords and 
parameters should be modified for the hazards. 
The more a method provides guidance to identify 
hazards, the more restricted the scope may become. 

For thorough analysis of hazards for DP and 
mooring system, complementary use of several 
hazard identification methods may be required, 
because each method has its own advantages and 
limitations, as confirmed by this study. A com- 
bined approach based on multiple hazard identi- 
fication methods, to strengthen the strength and 
make up for the weakness of each method, would 
be an important further work. 
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ABSTRACT: The aim of this paper is to present the study elaborated through a specific risk analysis 
approach that highlights the potential risks and accidents in a geothermal installation in order to make the 
latter more reliable and environmentally friendly. The study also aims at offering a different perspective 
to the public benefiting from geothermal energy by influencing positively peoples ‘awareness on this kind 
of facilities. The geothermal fluid may contain several Non-Condensable Gases (NCG), such as carbon 
dioxide (CO,) and hydrogen sulfide (H,S). Moreover, the presence of silica and boron in the geothermal 
brine can be hazardous both to people and to the surface environment. The safety analysis presented 
determines potential accidents in the storage phase and appropriate remediation actions, in the case that 


the geothermal fluid reinjection methodology is used. 


1 INTRODUCTION 


1.1 General 


During the last years, the continuously rising 
energy demand worldwide is very evident. This can 
be explained by the sharp growth of the develop- 
ing countries, as well as the improvement of living 
standards in the developed ones. Fossil fuels are 
the main energy source for most of the countries 
for several decades. However, the limited inventory 
of the former and the need for energy independ- 
ence made most countries to invest gradually into 
renewable energy sources in order to reduce the 
share of technologies based on fossil fuels. Renew- 
able resources have an unlimited availability, are 
usually equitably distributed around the world and 
are characterized as clean technologies because 
they produce very little waste and also have a mini- 
mal environmental impact. Moreover, they contrib- 
ute not only to the reduction of CO, emissions but 
also to other pollutant gas emissions, such as sul- 
fur, nitrogen oxides, VOCs (Volatile Organic Com- 
pounds) fostering both environmental protection, 
and growth sustainability. This is in line with the 
future socioeconomic and environmental needs of 
global economy according the Kyoto treaty objec- 
tives (Dincer, 2000 & deLIano-Paz et al., 2015). 
Geothermal energy offers an alternative to this 
respect. Its utilization for electricity generation has 
been commercially used since 1913. As a renewable 
energy source, it can enhance a low carbon econ- 


omy and strengthen independency from imported 
fuels. Geothermal power can be used as baseload 
renewable energy 24/7 in order to generate electric- 
ity regardless of the weather variations. Moreover, 
geothermal energy can fluctuate depending on the 
needs and can be flexible to support the intermit- 
tent renewable energy resources demands from 
wind and solar parks. In this context, it can be used 
to provide the stability of the power grid enhancing 
the efficiency of the entire system and increasing 
the security of energy supply against disruptions 
for geopolitical reasons and fossil fuel’s price high 
volatility (deLIano-Paz et al., 2015). 

The geothermal fluid itself may contain 
several Non-Condensable Gases (NCG), such as 
carbon dioxide (CO,) and hydrogen sulfide (H,S). 
Moreover, the presence of silica and boron in the 
geothermal brine can be hazardous both to people 
and to the surface environment. For these reasons 
geothermal fluid reinjection is used, a vital part of 
any geothermal development. The reinjection plan 
should be developed as early as possible in any geo- 
thermal development taking into account that the 
field characteristics are likely to change with time. 


1.2 Geothermal injection plants 


Few types of geothermal installations have been 
used so far. In 1970 on Achuapan field in El Salva- 
dor, the first injection effort has been implemented 
for environmental reasons, where a high Boron 
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content (~50 ppm) was identified and surface dis- 
posal was not admissible (Einarsson et al., 1975). 
Nowadays, injection seems to be the most favora- 
ble solution both environmentally and economi- 
cally. Geothermal fluid injection is important to a 
geothermal project for a number of reasons, such 
as (a) to avoid surface disposal which can cause 
environmental impact, (b) to support the reservoir 
pressure, (c) to avoid any ground subsidence, and 
(d) to benefit from rock matrix heat. 

Depending on the type of the geothermal system, 
reinjection can be infield, outfield or a mix of them. 
For vapor-dominated systems, where the water can 
run out, reinjection should be infield, while for hot 
water and liquid-dominated system a mix of infield 
and outfield injections is recommended. Through 
infield reinjection pressure support is provided and, 
consequently, drawdown and the potential for sub- 
sidence will be reduced. However, outfield reinjec- 
tion protects the production area from the risk of 
cold water returns (Kaya et al., 2011). 

In the present study the infield reinjection of 
thermal fluid is used as the technology method. 


1.3 Types of hazards 


It has been noted that the lack of planning for 
injection early in the development phase usually 
caused delays in putting power on line and reach- 
ing the planned generation level as well (Arnors- 
son, 2004). Thus, as it was mentioned above, the 
injection process is a vital part of any geothermal 
development, affecting directly the success or fail- 
ure of any geothermal field development. Chemical 
pollution occurs both from gaseous components in 
steam that are discharged into the atmosphere and 
from aqueous components in spent water that may 
mix surface and ground waters, characterized as the 
most adverse environmental effect of geothermal 
energy utilization. Geothermal fluid may be includ- 
ing CH,, CO,, B and HS. The last is a noxious gas 
that has an unpleasant smell, when present in low 
and harmless concentrations and can be fatal, if 
inhaled in high concentrations for a longer period 
of time. In order to reduce chemical pollution 
both waste water and steam condensate should be 
injected into drill holes (Arnorsson, 2004). 


2 TECHNOLOGY USED 


2.1 General principles 


The single-flash steam technology is adopted when 
the geothermal production wells indicate a mixture 
of steam and liquid, in order to convert the geother- 
mal energy into electricity in a simple way. Initially, 
after its extraction the geothermal fluid mixture 


passes through a cylindrical cyclonic pressure ves- 
sel and is separated into distinct steam and liquid 
phases, with a minimum loss of pressure. The sit- 
ting of the separators is part of the general design 
of the plant and there are several possible arrange- 
ments, as shown in Figure | (Dipippo, 1998). 

A typical 30 MW single-flash power plant 
includes 5-6 production and 2-3 injection wells 
relatively. These wells can be drilled at sites across 
the field or from a single pad through a directional 
drilling in order to intercept wider zone of the res- 
ervoir. In either case, through a piping system the 
geothermal fluids are transported from the produc- 
tion wells to the powerhouse and then to the dis- 
posal wells. Of course, the initial piping system can 
be modified if new power units are added later on. 


2.2 Description of problems actually connected 
with the injection of waste geothermal fluid 


Geothermal fluid does not necessarily require to 
be injected into the production geothermal reser- 
voir; it could be injected into a different aquifer 
simply to avoid any environmental impact owing 
to surface disposal. However, in that case many 
problems can occur, such as ground subsidence, 
seismicity or leakage of the injection fluid to the 
surface due to the injection pressure, despite 
the fact that injection in a shallower aquifer than 
the producing reservoir saves drilling cost. On the 
other hand, injection into the production reservoir 
could be both beneficial as it was mentioned above 
but also risky owing to the potential cooling of the 
production well and the possible adverse impact 
on the chemistry of the extracted geothermal fluid 
(Sanyal et al., 1995). 

The injection of waste geothermal fluid is most 
of the times connected with a number of problems. 
The suitability of injection sites is a critical choice 
for plant operation, production, while the injection 
within the same fault zone can cause serious cool- 
ing. Thus, the developer in such systems has the 
choice to inject in shallow ground water aquifers, 
if the geothermal fluid is environmentally benign 
and could also inject deeper within the fault zone, 


Figure 1. 


Simplified single-flash power plant design. 
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in order to be heated up before mixed again with 
the production line. Finally, in the case of environ- 
mentally benign fluid, the latter can be discharged 
on the surface. 

Cooling provided by injection, seems to be the 
most common problem in the geothermal indus- 
try. According to Sanyal et al. (1995), there are 
two causes for injection—induced cooling: a) Very 
close distance between production and injection 
wells, b) “Short—circuiting” of the injected fluid 
to the production wells caused by a fault or frac- 
tured zone. However, the cooling problem can be 
identified through tracer test program conduction 
and could give alert to the developer. 

A potential groundwater contamination can be 
caused by the injection of the geothermal waste on 
the geothermal reservoir; the main factors are the 
up flow of the injected water to the groundwater 
aquifer through a fault and the potential leakage of 
the injected fluid behind the casing caused by poor 
cement bond or probable damage due to corrosion 
or mechanical causes. However, through a careful 
geologic modeling the first cause can be avoided by 
locating injection wells in alternative sites. 

Leakage of injection waste water to the sur- 
face can be identified in very shallow geother- 
mal reservoirs (a few hundred meters). In order 
to avoid this kind of problem, injection should 
be deeper than the production level. Moreover, 
the occurrence of micro earthquakes near injec- 
tion sites could be induced due to high pressure 
injection. More specifically, if the fluid pressure 
is increased beyond the original pore pressure 
and subsurface zones of weakness or active faults 
exist near the injection area, seismic activity may 
be induced. Thus, in order to avoid seismic activ- 
ity, injections wells should be located away from 
known active faults, and the injection pressure 
should be lower than the original pore pressure 
of the system. 


3 USE OF RISK ANALYSIS 


3.1 General 


Risk Analysis stemmed from the Major indus- 
trial accidents of the last decades involving dan- 
gerous chemicals that pose a significant threat to 
humans and the environment. It involves hazard 
identification, hazard evaluation, the development 
of potential risk reducing measures, and the com- 
munication of risk information to decision makers 
(Papazoglou et al., 1992) Risk analysis typically 
involves the following key steps: 


e Hazard identification (HAZID) 
e Frequency analysis 
e Consequence analysis 


e Quantification of risks using output from fre- 
quency and consequence analysis 

e Investigation of potential risk 
measures 

e Development of recommendations 


reducing 


The common methods used in Risk Analysis are 
a) the hazard and Operability Analysis (HAZOP), 
b) the Failure Modes and Effects Analysis (FMEA), 
c) the “What if” scenarios, d)the Fault Tree (FT) 
Analysis, e) the Event Tree (ET) Analysis, f) the Risk 
Matrix and many other methodologies, which range 
from purely Qualitative to totally Quantitative and 
a mix of the two that are applied according to the 
needs and resources of the plant to be analyzed. 
Making assumptions about the detailed engineering 
of the reinjection method used, as series of accident 
scenarios including hydrogen sulfide (H,S) release 
have been identified and are presented below. 


3.2 Accident scenarios in the geothermal plant 


The gases that exist in the geothermal re-injection 
fluid, namely the carbon dioxide (CO,) and hydro- 
gen sulfide (H,S), also exist with the natural steam 
and do not condense at the condenser tempera- 
tures; these gases pass through a steam jet ejectors 
and, after condensers (SE/C in Figure 1) and the 
vacuum pumps, can removed increasing the overall 
pressure in the condenser and lowering the turbine 
power output. If a series of misfortunes happen and 
all protection systems (PSV, pressure control) fail 
to operate, the pressure will increase consequently 
above the normal limits and a possible break in the 
tank will occur. Two accidental scenarios have been 
analyzed together with their consequences: 


a. Instantaneous failure of the tank full of H,S 
b. Failure of a pipeline carrying H,S. 


The SOCRATES toolkit has been used for 
these analyses; this is an in-house development of 
NCSR “DEMOKRITOS with gas outflow, dis- 


Table 1. Outflow data for plant damage states (a) and (b). 
Plant damage Plant damage 
Outflow data state (a) state (b) 
Type of Installation Tank Tank 
Storage conditions Gas Gas 
Pressurized Pressurized 
Pressure in tank 4 * 10° (Pa) 4 * 105 (Pa) 
Temperature 290K 290K 
Diameter of tank 1.65 (m) 10 (m) 
Height of tank 10 (m) 10 (m) 
Height of orifice 5 (m) 5 (m) 
Diameter of orifice 1.65 (m) 0.013 (m) 
Duration - 1200 (sec) 
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persion and consequences assessment codes based 
on TNO’s “Yellow”, “Green” and “Purple” books 
respectively (TNO, 2017). 

The initial conditions to this software for out- 
flow calculation are presented in Table 1 below: 


4 RESULTS 


SOCRATES toolkit calculated the concentration 
of Hydrogen Sulfide over a specified time and dis- 
tance on the appropriate mesh for the case studies. 
In both cases scenarios, the instantaneous and the 
continuous release, Hydrogen Sulfide had enough 
buoyancy owing to its temperature and dispersed 
in the atmosphere in accordance with the Gauss 
model. All the parameters considering the mete- 
orological phenomenon such as, the velocity, the 
direction and the stability class of the wind as well 
as the environmental temperatures are the manda- 
tory data/ input for the Gaussian model when deal- 
ing with such dispersions and uncertainty in their 
values has been taken into consideration. More 
specifically, in the SOCRATES toolkit, sixteen 
(16) cases regarding the various meteorological 
parameters were simulated and correlated with site 
prevailing meteorological conditions. 

Individual and Group Risk are calculated for 
plant damage states (a) and (b) for each mesh point 
in the area. The isorisk curves in Figures 2 and 3 


Figure 2. Isorisk curves and consequences zone for 
plant damage state (a), Instantaneous release. 


Figure 3. Isorisk curves and consequences zone for 
plant damage state (b), Continuous release. 


present the individual risk for plant damage states 
(a) and (b), respectively. 


5 DISCUSSION OF RESULTS 


The Risk Analysis results presented in the previous 
section demonstrate the minor risk of the geother- 
mal plant activity in the surroundings of the instal- 
lation, as the 10 isorisk curve remains within the 
battery limits of the installation. Only plant opera- 
tors are likely to suffer some discomfort owing to 
Hydrogen Sulfide, should an accidental release of 
the latter happen. However, personnel members 
are normally well trained and have sufficient Per- 
sonal Protective Equipment (PPE) to deal with 
this hazard. The consequences findings though the 
SOCRATES toolkit running and interpretation of 
the results give promising results towards the low- 
risk operation of such plants in the proximity of 
inhabited areas. 


6 CONCLUSIONS 


In this paper, the results of a Risk Analysis of a 
geothermal power system caused by the injec- 
tion of geothermal fluid back to the origin reser- 
voir were presented. Additionally, environmental 
threats specific to geothermal fluid injection sites 
have been also mentioned. 

Specific accidental scenarios of Hydrogen 
Sulfide release have been studied through the 
SOCRATES toolkit and the consequences owing 
to its dispersion are considered as non-significant 
to nearby communities. Plant personnel are aware 
of possible threats and can take specific mitigation 
measures. 

Geothermal fluid injection condensate is a 
solution to environmental impacts but also helps 
maintain reservoirs pressure, increasing both the 
operational lifetime of the production well and the 
reservoir lifetime. Waste fluid injection is favorable 
against surface disposal of waste water due to its 
constituents which may cause adverse effect on the 
environment and the people. 

In Greece geothermal energy has been used 
already in the early ‘90s only for direct utilization. 
The Greek reservoirs are not in use for electricity 
production in the country. One of the reasons was 
severe technical problems. These problems seem to 
be faced nowadays with the new prevention and 
mitigation techniques. 

The final conclusion is that a reinjection plan 
should be developed as early as possible in any 
conceptual study of a geothermal development 
taking into account the field characteristics and 
also preferably the Risk Analysis results. 
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ABSTRACT: ADS-IDAC is a discrete dynamic PRA simulation platform in which the time-dependent 
changes in the functional state and parameters associated with the system elements are traced to generate 
scenarios by branching to new sequences at various time steps following a small set of general branching 
rules. These model-based branching rules have been developed to obtain a more realistic and complete 
solution space than the traditional static PRA methods, and avoid the sequence explosion phenomenon 
as the number of system states increases. This paper describes a new version of the ADS-IDAC simula- 
tion platform that includes: branching based on important human operator events — e.g., information 
processing, decision-making, procedure-following, or action-taking type, and full implementation of 
Human Error Probability (HEP) quantification rules that explicitly account for HEP dependencies based 
on shared performance shaping factors modeled using a dynamic Bayesian network. 


1 INTRODUCTION 


The dynamic Probabilistic Risk Assessment (PRA) 
methodologies are model-based simulations used 
to generate risk scenarios and their associated prob- 
abilities. This is achieved through general rules of 
stochastic and deterministic behaviors and interac- 
tions of the system and its elements — e.g., process 
variables, hardware, human operators, and envi- 
ronmental conditions. The simulation engine of a 
dynamic PRA platform tracks possible changes in 
the functional state and parameters associated with 
the elements of the system as a function of time. 
The nature and impact of the interactions and 
interdependencies among the system elements are 
processed by the simulation engine to generate risk 
scenarios. Ultimately, depending on the selected 
method chosen for scenario generation, probabili- 
ties of individual or clusters of scenarios are calcu- 
lated for the system end states of interest. Dynamic 
PRA methodologies are especially important when 
the system includes time-dependent and complex 
interactions between the process variables, hard- 
ware, human, and environmental conditions. They 
provide a natural framework to include physi- 
cal models, such as thermal-hydraulic codes for 
Nuclear Power Plants (NPPs), mechanistic models 
of hardware failure, cognitive models of human 
behavior, and those of natural hazards. Such a 


dynamic PRA simulation platform is ADS-IDAC: 
the Accident Dynamics Simulator coupled with the 
Information, Decision and Action in a Crew con- 
text cognitive model, and a realistic nuclear power 
plant thermal-hydraulic model. It is one of the 
most mature discrete dynamic platforms with an 
evolution that spans more than 25 years (Fig. 1). 

In most of the Human Reliability Analysis 
(HRA) methods, the Human Error Probability 
(HEP), defined as the probability of an operator 
not completing a specific task, is quantified as a 
function of the Performance Shaping Factors 
(PSFs). In ADS-IDAC, the PSFs are quantified 
in terms of their contextual parameters (i.e. surro- 
gates) and their impact on the cognitive processes 
is implemented through manifestation nodes as is 
illustrated in Figure 3 (Li, 2013). 
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Figure 1. ADS-IDAC development history. 
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Although it greatly improved the explicit impact 
of the PSFs on the human performance, ADS- 
IDAC still lacks a full implementation for explic- 
itly quantifying the HEPs based on the dynamic 
nature of the PSFs. For each individual or team 
activity, the behavioral effects of the PSFs can be 
accounted for through an influence diagram. Like 
its application in the Phoenix method (Ekanem, 
2013), the Bayesian Belief Network (BBN) 
approach can be used to estimate the probability 
that a specific cognitive behavior occurs given cer- 
tain conditions. 

The main objective of the research reported in 
this paper was to introduce a set of comprehensive 
quantification rules to enable dynamic calculation 
of branch probabilities and complete risk scenario 
probabilities. The HFE dependencies were explic- 
itly accounted for through the shared PSFs using a 
newly developed dynamic Bayesian network start- 
ing from a BBN model of PSFs developed in the 
Phoenix method. 


2 IDAC HUMAN BEHAVIOR ADJUSTED 
BY CONTEXT AND OPERATOR 
VARIABILITY 


During the simulation, the human operator behav- 
ior in IDAC is adjusted based on the context 
through a mechanism of surrogates—Performance 
Shaping Factors (PSFs) — manifestation nodes 
(Fig. 2). At each time step, the NPP state param- 
eters are used to adjust the surrogate node values, 
the surrogates (yellow nodes) affect dependent 
PSFs (blue nodes) and in turn the PSFs affect 
manifestation (green nodes). The relationships 
between these nodes are based on empirical corre- 
lations found through extensive literature reviews 
corresponding to the appropriate human behavior 
mechanisms (Li, 2013). 

Like all information processed by the operator 
model, all the dynamic PSF values are based on 
information perceived by the operator rather than 
data obtained directly from the thermal-hydraulic 
model or control panel. Perceived data may differ 
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Figure 2. ADS-IDAC architecture. 


from the actual parameter value in thermal- 
hydraulic model or control panel due to time lags 
in updating perceived data and any distortions 
introduced by perception filtering and biasing. 
The PSFs modeled in ADS-IDAC are: parameter 
criticality, system criticality, information load, time 
constraint load, cognitive task load, passive alarm 
load, expertise, task complexity, stress, fatigue, and 
problem-solving style. All of them are dynamic 
PSFs, except expertise and problem-solving style. 
The criticality of system condition dynamic PSF 
represents the operator’s perception of the level of 
degradation of key safety functions compared to 
normal operation. The value of the system critical- 
ity PSF corresponds to the aggregate deviation of 
key safety parameters from a nominal value. Each 
operator profile has its own parameters used to 
calculate this PSF: the threshold limits associated 
with each parameter, and the weighting factors 
used to aggregate the parameter contributions. The 
contribution from each identified parameter to the 
overall criticality of system condition PSF value is 
denoted as the parameter criticality. Given a set of 
high and low threshold limits, the parameter criti- 
cality corresponds to the magnitude of the param- 
eter’s deviation from a nominal safe condition. 
The information loading dynamic PSF repre- 
sents the operator’s mental workload associated 
with the perception, processing, and communica- 
tion of information. All information available from 
the NPP hydraulic model and crew communications 
must first pass through the operator’s perception 
filter before it can be memorized and used. Conse- 
quently, the information flow rate through the per- 
ception filter provides an appropriate measure of 
each operator’s information processing workload. 
The time constraint load dynamic PSF rep- 
resents the time available until a monitored NPP 
parameter exceeds a critical threshold. Because 
operators will normally monitor more than one 
important parameter, the overall PSF value is based 
on the most time critical parameter. The knowl- 
edge base profile for each operator includes data 
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defining how the time constraint load PSF value is 
calculated, including a listing of NPP parameters 
used to calculate the time constraint PSF value 
along with the associated critical threshold values. 

The task load dynamic PSF is indicative of the 
actual task demand assigned to a person quantified 
in terms of the number and type of tasks in a time 
unit. NPP control room operations do not normally 
involve heavy physical work, so in ADS-IDAC only 
the cognitive task load is of interest. Simulation 
HRA models possess a unique advantage of track- 
ing each activity performed by the operator, which 
allows the code to count and to assess the workload 
specifically; therefore, the cognitive task load is also 
a dynamic PSF evaluated at each time step. 

The passive alarm load dynamic performance fac- 
tor embodies the number of salient stimuli that catch 
the operator’s attention automatically like the alarms 
in the control room. Most often passive information 
is intrusive and grabs the operators’ attention while 
interrupting their ongoing cognitive processes. Thus, 
too much passive information could be overwhelm- 
ing. In addition to causing mental stress, it shifts 
one’s attention and impedes the ability to refocus. 

Operator expertise facilitates operator’s coping 
with fast system dynamics in several ways: struc- 
turing and sorting the observations systematically, 
speeding the retrieval of knowledge for explaining 
the observation, and making connections between 
different pieces of information. 

The task complexity dynamic PSF represents 
a measure of interaction among system dynam- 
ics, diagnosis confusion, and operator expertise. 
Amongst the system dynamics tracked at each time 
step are parameter trend changes, component state 
changes, and alarm state changes. Diagnosis con- 
fusion represents the complexity induced by incon- 
sistent information and indicates the operator’s 
level of understanding of the current NPP status. 

The stress dynamic PSF combines various stress 
inducing PSFs into one factor: time constraint 
load, passive information load, cognitive task load, 
and task complexity. Each of the stressors has an 
equal weight on the stress value. 

As the NPP control room operators’ tasks do 
not involve heavy physical work. Thus, only the 
following three dimensions are considered in cal- 
culating this dynamic PSF: mental fatigue, sleepi- 
ness, lack of motivation/activity. It is evaluated 
based on an initial fatigue level at the beginning 
of their shift, a prolonged effort component due to 
performing tasks over a long period of time, and a 
sustained effort component representing the accu- 
mulation of fatigue by performing tasks. Moreo- 
ver, the sustained effort component of fatigue is 
accelerated by the stress level. 

The problem-solving style static PSF is reflected 
into the following of model parameters and 


information processing functions. In ADS-IDAC 
three problem solving styles have been imple- 
mented: Vagabond, Hamlet, and Garden-Path 
styles. They affect various parameters used to 
model the variation in diagnosis of operators in 
the reasoning module: routine monitoring time 
interval, maximal alarm stack length, prioritiza- 
tion of investigation items, investigation termina- 
tion criteria, and accident awareness thresholds. 

This framework performs well for adjusting the 
behavior of human operators based on the context. 
However, it is incomplete as it does not include any 
HFE, Crew Failure Mode (CFM), or HEP quan- 
tification that must be included in the generated 
Discrete Dynamic Event Tree (DDET) events for 
its full quantification. 


3 NEW QUANTIFICATION MODEL 
FOR HUMAN ERROR 


3.1 Overview of HCL 


The HCL methodology (Wang, 2007) was devel- 
oped for risk scenario analysis in PRAs of tech- 
nological systems that considers not only the risks 
associated with hardware components (also called 
‘hard’ causes), but also the risks generated by human 
activities, physical environment or socio-economic 
environment (also called ‘soft’ causes). This method- 
ology offers a multi-layered modeling approach so 
that each individual domain of the system is mod- 
eled with the most appropriate technique. The three 
layers modeled in HCL are: Event Sequence Dia- 
gram (ESD) layer — it is used to model the risk con- 
text, Fault Tree (FT) layer — it is used to model the 
physical systems’ behavior and quantify their impact 
on their corresponding linked events in the ESD, 
Bayesian Belief Network (BBN) layer — it is used to 
model the causal relations between events that have 
‘soft’ root causes (Groth, Wang, Mosleh, 2010). 

The HCL library wascoupled with ADS-IDAC in 
a previous research effort for its FT-BBN quantifi- 
cation capabilities. The FTs were dynamically linked 
to the DDETs for modeling support systems and 
their impact on the frontline systems (Diaconeasa, 
2017). 

In this research only the BBN layer of the HCL 
architecture was necessary. Its capabilities had to be 
expanded to include leaky noisy OR gates that are 
used to reduce the conditional probability table size. 


3.2 Overview of the phoenix method 


The Phoenix method is a static HRA method that 
was developed out of the IDAC model. In the 
Phoenix method, the quantification of HFE is per- 
formed using a BBN for modeling the effect of the 
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PSFs on the CFMs. The construction of the BBN 
was made using the CFMs and PSFs as nodes and 
the arcs to show the relationships of influence 
between them through a conditional probabil- 
ity table. BBNs provide numerous benefits such 
as the ability to incorporate both qualitative and 
quantitative information from different sources for 
analysis, a causal structure for modeling interde- 
pendencies among its elements, the flexibility of 
updating the present state of knowledge of the 
model to incorporate new evidence as it becomes 
available, the capability of reasoning under uncer- 
tainty, and its ability to interface with existing con- 
ventional PRA models. 


3.3 HFE quantification through a dynamic 
Bayesian network 


The starting point for developing a quantifica- 
tion framework of the ADS-IDAC HFEs was the 
Phoenix method briefly described in the previous 
section. It is a natural step to include the additional 
elements of the Phoenix method into the IDAC 
model as part of the full dynamic ADS-IDAC 
simulation environment. 

A BBN is valuable for problem domains or sys- 
tems where the variables do not change over time. 
This assumption cannot always be assumed. For 
example, NPP system parameters and human oper- 
ators’ reasoning are clearly changing over time. In 
these cases, a Dynamic Bayesian Network (DBN) 
is necessary. A DBN is a BBN that is extended to 
incorporate a temporal dimension to enable the 
modeling of dynamic systems. The temporal exten- 
sion of a BBN does not necessarily mean that the 
network structure or parameters changes dynami- 
cally, but it means that a dynamic system is being 
modeled. Hence, a DBN is a directed, acyclic 
graphical model of a stochastic process. It consists 
of time steps, with each time step containing its 
own variable values. The basic idea in a DBN is to 
specify how variables at time t influence variables at 
time t + 1 and replicating the structure of a model 
for each time step (Fig. 4). This concept of the 
dynamic Bayesian network was used to model the 
dependencies between the HFEs by replicating the 
network structure to represent the dynamic system 
and ultimately estimate the conditional HEP at each 
time step. This structured, causal model integrated 
into ADS-IDAC also helps improve the reproduc- 
ibility and transparency of results produced by dif- 
ferent HRA analysts for the same scenario. 

Construction of the DBN involves building the 
structure of the network and defining the data 
describing the causal relationships between the net- 
work’s nodes, and the nodes that will change in time. 
Unfortunately, the Phoenix method BBN cannot be 
adopted without modifications as some of its nodes 
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Figure 4. Simplified dynamic Bayesian network with 
developed on one time step with two dynamic nodes A 
and B influencing node C at every time step. 


DBN of PSFs and HFEs. 


Figure 5. 


Table 1. Mapping between the PSFs of ADS-IDAC 
and Phoenix. 


ADS-IDAC Phoenix 


System criticality Time constraint 


Information load Resources 

Time constraint load Time constraint 
Cognitive task load Task load 

Passive alarm load HSI 

Expertise Knowledge/Abilities 
Task complexity Procedures 

Stress Stress 

Fatigue Stress 


Problem-solving style Team effectiveness 


do not have an ADS-IDAC equivalent and they do 
not cover all the HFEs modeled by ADS-IDAC. 
The structure of the dynamic Bayesian network 
contains two layers in which all the top layer nodes 
influence all the bottom layer nodes (Fig. 5). Since 
the primary purpose of this dynamic Bayesian net- 
work is to model the effect of the PSFs on the crew 
failure modes, the top layer contains the PSF and 
the bottom layer contains crew failure modes. The 
top layer contains the PSFs described in the previous 
section: system criticality, information load, time 
constraint load, cognitive task load, passive alarm 
load, expertise, task complexity, stress, fatigue, and 
problem-solving style. As the Phoenix method does 
not have the same PSFs, based on their definition 
and purpose an equivalence relationships table was 
created to match the PSF used in ADS-IDAC and 
Phoenix (Table 1). All the PSFs except the expertise 
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and problem-solving style are dynamic; therefore, 
their value will change as the simulation progresses 
depending on the context. The PSFs have also been 
normalized to have values between 0 and 1. 

The bottom layer of the dynamic Bayesian net- 
work in Figure 5 is made of crew failure modes. 
As in the Phoenix method, the crew failures modes 
specify the possible forms of human error in each of 
the information pre-processing, decision-making, 
and action execution phases. 

The crew failure modes are also the generic 
functional modes of failure of the crew in its inter- 
actions with the NPP and represent the manifesta- 
tion of the crew failure mechanisms and proximate 
causes of failure. They are selected to cover the var- 
ious modes of crew response including procedure 
driven, knowledge driven, or a hybrid of both. To 
avoid double counting crew failure scenarios dur- 
ing the estimation of HEPs, the crew failure modes 
are defined as being mutually exclusive. 

The crew failure modes within the information 
phase assume that the crew has failed in detect- 
ing, noticing and understanding the plant function 
they are supposed to be handling. Human failure 
in this phase can be divided into two major groups 
namely: failure to perceive passive information and 
failure to actively gather information. The crew 
failure mode that would occur during the perceiv- 
ing of passive information is “Perceive State Info” 
— the crew fails to perceive the plant parameters 
or states from the control panel. The crew failure 
modes that would occur during the active gather- 
ing of information are: “Gather State Info” — the 
crew unintentionally try to collect the information 
from the wrong source, “Gather Info Mode” — the 
crew decide to use the old memorized information 
instead of collecting updated information, and 
“Gather New Info” — the crew failure in gathering 
new information. The equivalence table between 
the ADS-IDAC and Phoenix crew failure modes in 
the information phase is given in Table 2. 

The crew failure modes within the decision- 
making phase assume that there is failure in situ- 
ation assessment, problem solving and decision- 
making given correct information pre-processing. 
Therefore, the assumption is made that the crew has 
detected, noticed and understood the plant func- 
tions they are supposed to be handling. However, 


Table 2. Mapping between the CFMs in the informa- 
tion pre-processing phase of ADS-IDAC and Phoenix. 


ADS-IDAC activity Phoenix CFM 


Perceive state info 
Gather state info 
Gather info mode 
Gather new info 


Reading error 

Wrong data source attended to 
Decision to stop gathering data 
Team effectiveness 


they have failed to make a correct assessment of 
the plant condition, diagnose, decide and plan the 
adequate response needed to solve the problem at 
hand. Moreover, the decision-making operator has 
the responsibility to communicate the action-taking 
operators the appropriate strategy. Ultimately, fail- 
ures in this phase result in implementing an incor- 
rect recovery strategy, hence failing the required 
function. Therefore, the following crew failure 
modes have been included: “Diagnosis” — decision- 
maker reaches the wrong assessment of the plant, 
“Strategy Selection” — decision-maker takes the 
wrong strategy given the correct situational assess- 
ment, “Strategy Communication” — decision-maker 
fails to communicate the correct strategy selected to 
the action-taker, “Goal Selection” — decision-maker 
selects the wrong immediate goal given the correct 
situational assessment, “Goal Communication” — 
decision-maker fails to communicate the correct 
goal selected to the action-taker, and “Procedure 
Transfer” — decision-maker switches to the wrong 
procedure. The equivalence table between the ADS- 
IDAC and Phoenix crew failure modes in the deci- 
sion-making phase is given in Table 3. 

The crew failure modes within the action execu- 
tion phase involve failure in action execution given 
correct information pre-processing, situational 
assessment, and decision-making. It is assumed that 
the crew has detected, noticed and understood the 
NPP function they are supposed to be handling. 
Also, it is assumed they have made a correct assess- 
ment of the NPP condition, diagnosed, decided 
and planned the adequate response needed to solve 
the problem. However, they fail in executing the 
response or required action. It is assumed that the 
crew failure modes in the action execution phase are 
unintentional errors, that is the operators are always 
acting in the interest of recovering the NPP. The 
following crew failure modes have been included: 
“Mental Procedure” where the crew fails to adapt the 
instinctive response procedure to the current situa- 
tion, “Procedure Step” where the crew skip or pause 
a procedure step in order to rely of their knowledge, 


Table 3. Mapping between the CFMs in the decision- 
making phase of ADS-IDAC and Phoenix. 


ADS-IDAC activity Phoenix CFM 


Diagnosis Plant/system state 
misdiagnosed 
Inappropriate strategy chosen 


Information miscommunicated 


Strategy selection 

Strategy 
communication 

Goal selection 

Goal communication 

Procedure Transfer 


Inappropriate strategy chosen 

Information miscommunicated 

Inappropriate transfer to a 
procedure 
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“Procedure Interpretation” where the crew misinter- 
pret the procedure step expectation, and “Maneuver 
Action” where the action-taker does not perform the 
requested action. The equivalence table between 
the ADS-IDACand Phoenix crew failure modes in the 
action execution phase is given in Table 4. 

Some of the crew failure modes fall into the cat- 
egory of errors of commission, that is they are the 
result of their intent given the wrong situational 
assessment of the NPP conditions. 

The DBN structure defined by the PSFs and 
CF Ms given above was integrated into ADS-IDAC 
by linking the all the human events types simulated 
into ADS-IDAC to the appropriate CFMs. 

In the information pre-processing phase, the 
CFMs are linked to the human event types as fol- 
lows. The “Perceive State Info” crew failure mode is 
used to estimate the probability of the action-taker 
to correctly register the perceived information for 
an alarm state, a frontline system state, a support 
system state, a parameter value, and a parameter 
trend value from the control panel. The “Gather 
State Info” crew failure mode is used to estimate 
the probability of any of the human operators to 
collect the information from the correct source 
on the control panel: alarm state, frontline system 
state, support system state, or parameter value. The 
“Gather Info Mode” crew failure mode is used to 
estimate the probability any of the crew members 
succeeds in collecting updated information instead 
of using old memorized information. The “Gather 
New Info” crew failure mode is used to estimate 
the action-taker’s or decision-maker’s probability 
of adding a parameter to the scan queue for gath- 
ering updated information. 

In the decision-making phase, the CFMs are 
linked to the human event types as follows. The 
“Diagnosis” crew failure mode is used to estimate 
the decision-maker’s probability of reaching the 
correct assessment of the NPP given their under- 
standing of the NPP conditions. Note that if the 
operators do not correctly understand the NPP 
conditions, they will still reach a diagnosis, even 
if it’s the wrong one. The “Strategy Selection” 
crew failure mode informs the decision-maker’s 
probability of selecting the appropriate strategy 


Table 4. Mapping between the CFMs in the action exe- 
cution phase of ADS-IDAC and Phoenix. 


ADS-IDAC activity Phoenix CFM 


Mental procedure 
Procedure step 


Failure to adapt procedures 

Procedure step omitted 
(intentional) 

Procedure misinterpreted 

Incorrect operation on 
component 


Procedure interpretation 
Maneuver action 


given the correct situational assessment. The sup- 
ported strategy selections in ADS-IDAC are wait 
and monitor, procedure following, hardwired 
diagnosis, and knowledge-based reasoning. The 
“Strategy Communication” crew failure mode is 
used to estimate the decision-maker’s probability 
of communicating to the action-taker the selected 
strategy. Moreover, if the action-taker is in the fol- 
low instruction strategy mode, the same crew fail- 
ure mode is used to estimate the decision-maker’s 
probability of communicating to the action-taker 
the appropriate instruction. The type of instruc- 
tion can be to obtain information about an alarm 
state, a frontline system state, a support system 
state, a parameter value, and a parameter trend 
value from the control panel or change their values. 
The “Goal Selection” crew failure mode is used to 
estimate the decision-maker’s probability of selects 
the appropriate immediate goal given the correct 
situational assessment. Selecting the inappropriate 
goals can lead a delay in the appropriate recovery 
actions. The “Goal Communication” crew failure 
mode is used to estimate the decision-maker’s prob- 
ability to communicate the correct goal selected to 
the action-taker and consultant. The “Procedure 
Transfer” crew failure mode is used to estimate the 
crew’s probability to switch to the correct written 
or mental procedure. The “Mental Procedure” crew 
failure mode is used to estimate the crew’s ability to 
perform the appropriate instinctive response pro- 
cedure based on the correct situational assessment. 
The “Procedure Step” crew failure mode is used to 
estimate the crew’s probability of correctly skip- 
ping or pausing a procedure step in order to rely of 
their knowledge. The “Procedure Interpretation” 
crew failure mode informs the crew’s probability 
to correctly interpret the procedure step expec- 
tation. As in the case of perceiving information, 
the expectations can be related to an alarm state, 
a frontline system state, a support system state, a 
parameter value, or a parameter trend value from 
the control panel. 

In the action-taking phase, only one CFM is 
modeled as follows. The “Maneuver Action” crew 
failure mode is used to estimate the action-taker’s 
probability to complete an action communicated 
by the decision-maker or from procedures. 

After defining the PSFs, the CFMs with their 
mapping to the existing human events in ADS- 
IDAC, the next step was to obtain the data neces- 
sary to quantify the DBN. In order to achieve this, 
the estimated and calibrated parameters from the 
Phoenix method were adopted. The data sources 
used in the Phoenix method include German NPP 
operating experience data, other HRA methods 
(e.g. SPAR-H), and expert judgement. The advan- 
tage of the Bayesian network is that when new 
human performance data becomes available, be 
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it qualitative or quantitative, it can be easily inte- 
grated into the model parameter estimation proc- 
ess using Bayesian inference. Common sources 
of information that can be used are experimental 
data (e.g. control room simulator data), operat- 
ing experience (e.g., licensee event reports), HRA 
databases, like the US NRC sponsored database 
project called the Scenario Authoring, Characteri- 
zation, and Debriefing Application (SACADA), in 
addition to expert judgement. 

The conditional probability table for each crew 
failure mode node in the Bayesian network is used 
to capture the strength of influence between each 
crew failure mode and its parent PSF nodes. This 
implies that the probability of the crew failure 
mode given all its possible combinations of the 
PSFs needs to be defined. This is challenging prob- 
lem as the number of conditional probabilities in 
the conditional probability table grows exponen- 
tially with the number of nodes and states. 

To reduce the conditional probability table size, 
the noisy OR gates can be used to specify the DBN 
and to build the conditional probability table for the 
crew failure modes nodes. In relation to the DBN 
quantification model included in ADS-IDAC, the 
leaky noisy OR gate also give the advantage of 
representing the probability that a crew failure can 
occur even when there is no influence from any of 
the PSFs. In other words, the leak factor provides a 
way to include other PSFs that are not explicitly rep- 
resented in the DBN model as individual PSF nodes. 

Using the HAMMLAB empirical data and the 
HRA results from the international empirical study 
(Lois, 2009), the conditional probability table has 
been normalized. The normalization procedure 
was the following: HFE 1A, failure to isolate the 
steam generator in the simple SGTR scenario, 
has been quantified using the original conditional 
probability table. The probability obtained has 
been scaled down two orders of magnitude such 
that the simulated HFE 1A falls inside the band 
of results obtained in the international empirical 
study. 

The quantification of the successful human 
events captured by ADS-IDAC is a very critical 
aspect of the full DDET quantification. Now all 
the successful human events have a probability cov- 
ering the full unit interval instead of an assumed 
fixed probability of one. 

Compared to the conventional HRA methods, 
ADS-IDAC is able to quantify each individual 
activity that can lead to a particular HFE. There- 
fore, it is not only able to transparently predict the 
system and crew behavior, but quantify the prob- 
ability of succeeding or failing at each time step 
for each activity the crew is undertaking based on 
the actual context through the linked DBN as is 
illustrated in Figure 6. 


Figure 6. 
generated DDET where human events quantified with a 
simplified DBN are highlighted. 


Graphical representation of an ADS-IDAC 


4 EXTENDED BRANCHING OF EVENTS 
IN GENERATION OF DDETS 


The ADS-IDAC simulation engine generates a 
DDET by activating success, failure or partial fail- 
ure branching points when certain conditions are 
met. The construction of the DDET is driven by 
a rich contextual environment simulated by ADS- 
IDAC and guided by the branching rules, which 
allow the modeling of variability in system and 
human operator. 

The DBN covering the effect of PSFs on the 
CFMs at each time step was integrated into ADS- 
IDAC by linking all the human events types simu- 
lated in ADS-IDAC to the appropriate CFMs. 
That is, all the human events are assigned a success 
probability at each time step based on the current 
state of the DBN. One consequence of this frame- 
work can be seen with the following simple exam- 
ple. During a diagnosis, the crew may need to check 
the status of a component multiple times. Depend- 
ing on the context, and implicitly on the value of 
the PSFs, the probability of the crew to correctly 
perceiving the status of that same component may 
be different. Therefore, new branching rules have 
been added to capture human performance vari- 
ability, and to quantify the human failure events. 

The branching points that were linked to be 
quantified with the DBN are described below: 


e When a strategy is changed, based on the “Strat- 
egy Selection” crew failure mode two branches 
are generated: one in which the crew continue 
the current strategy, and another in which they 
switch to the new strategy. 

e When a procedure step indicates a transfer to 
another procedure, two branches are generated: 
one in which the crew switches to the new pro- 
cedure, and another in which they continue the 
current procedure. 


1755 


e When an accident diagnosis threshold is 
exceeded based on the knowledge-based reason- 
ing, two branches are generated: one in which 
the crew take recovery actions based on their 
reasoning, and another in which they transfer to 
the appropriate procedure. 

e When a mental belief activation threshold is 
exceeded based on the heuristic reasoning two 
branches are generated: one in which the crew 
transfer to the mental belief, and another in 
which the crew bypass the mental belief and 
continue their activity. 


These rules together with the newly developed 
quantification models help to keep the simulation 
space expansion under control, yet it also allows suf- 
ficient degrees of freedom for the system and crew 
to evolve into unexpected behaviors. For example, 
given the procedure step skipping probability is 
quantified at each time step containing written pro- 
cedure steps or mental procedures either determin- 
istically or stochastically, the skipping of procedure 
steps could be simulated and their impact analyzed 
in a consistent and transparent way. 


5 CONCLUSIONS 


A set of comprehensive quantification rules to 
enable dynamic calculation of branch probabili- 
ties and complete risk scenario probabilities was 
developed and implemented using a DBN. The 
HFE dependencies were explicitly accounted for 
through the shared PSFs by adapting the BBN 
model of PSFs developed in the Phoenix method 
to the dynamic environment of ADS-IDAC. 
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ABSTRACT: Conventional Probabilistic Risk Assessment (PRA) methodologies and software tools 
developed for a variety of risk-informed applications are also characterized as ‘static’, referring to the 
fact that temporal and phenomenological aspects of risk scenarios are at best implicit in the models and 
results. For instance, typical core melt cut sets are essentially logical combinations of contributing events, 
without consideration of possible effects of different time ordering of the constituent events, and timing 
of event initiation or termination. Over the past three decades a small community of researchers have 
directed toward developing and exploring possible benefits of dynamic PRA methods and tools. Dynamic 
methodologies provide a natural framework to include physical models, mechanistic models of hardware 
failure or human operator behavior models. In this paper, the capabilities to link conventional PRA plat- 
forms with dynamic PRA tools are described and the concept of cut set diffraction is introduced. This 
facilitates the use of newer risk analysis methods while still using the existing probabilistic information 


that, at least in the U.S., is available for every Nuclear Power Plant (NPP). 


1 INTRODUCTION 


Conventional Probabilistic Risk Assessment (PRA) 
methodologies and software tools developed for a 
variety of risk-informed applications are also char- 
acterized as ‘static’, referring to the fact that tempo- 
ral and phenomenological aspects of risk scenarios 
are at best implicit in the models and results. For 
instance, typical core melt cut sets are essentially 
logical combinations of contributing events, with- 
out consideration of possible effects of different 
time ordering of the constituent events, and timing 
of event initiation or termination. Additionally, the 
basic events in classical PRAs are typically binary, 
a situation that can mask the impact of such thing 
as degraded component states. Furthermore, con- 
ventional PRAs have come under attack in terms 
of adequacy in providing important contextual 
information for proper modeling and analysis of 
operator errors. 

These are some of the reasons stated in sup- 
port of the need for simulation based PRA, gener- 
ally known a dynamic PRA. Over the past three 
decades a small community of researchers have 
directed toward developing and exploring possi- 
ble benefits of dynamic PRA methods and tools. 
These efforts show a significant diversity of objec- 
tives and methodology. Among the stated objec- 
tives are: study of impact of timing and sequencing 


of events, understanding the effects of variations 
in the underlying physical processes, and study 
of operator error, particularly errors of commis- 
sion. Methodological and computational works 
cover the computational efficiency of generating 
dynamic event trees (discrete and continuous ver- 
sions), scalability and convergence, post-processing 
of generated results, software platform generality, 
and user interface features. 

With significant advancement of dynamic meth- 
odologies and computer power and storage capac- 
ity, the prospects of using dynamic PRA tools to 
answer risk questions is more real now than ever 
before. Despite the progress in the development of 
dynamic PRA methods and tools, the conventional 
PRA method and software platforms are expected 
to remain the method of choice for the current 
generation of NPPs. However, a strong case can be 
made that conventional PRAs can be augmented 
by dynamic approaches, especially for scenarios 
where dynamic characteristics are anticipated to 
be important. 

An efficient path to enable such augmentation 
is to develop protocols and capabilities to link con- 
ventional PRA platforms with dynamic PRA tools. 
This will facilitate the use of newer risk analysis 
methods while still using the existing probabilistic 
information that, at least in the U.S., is available 
for every NPP. 
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2 CONVENTIONAL AND DYNAMIC PRA 


2.1 Conventional PRA 


In the application of the defense in depth con- 
cept, Probabilistic Risk Assessment (PRA) is used 
to determine the probability for breaching each 
barrier. In general, PRA methods are employed 
to identify failure scenarios and to estimate their 
associated risk by answering to three fundamental 
questions (Kaplan and Garrick 1981): 1) “What 
can happen?”, 2) “How likely it is that it will hap- 
pen?”, and 3) “If it does happen, what are the 
consequences?” System performance reliability, 
uncertainty analysis, and human reliability analy- 
sis methods are explicitly integrated into the PRA 
framework, thereby enabling the application of 
defense in depth, which would be impossible to do 
using deterministic analysis methods alone. 

PRA is a matter of scenarios and their likeli- 
hoods as illustrated in Figure | showing an event 
tree with the system actuation failure event repre- 
sented through a fault tree (Garrick, 2008). The 
theory of structuring scenarios is rooted in field 
of reliability engineering and systems analysis. 
Reliability engineering produced graphical repre- 
sentations that were very useful in describing how 
systems perform. Fault tree analysis is a deductive 
reasoning process for building failure models. Its 
roots are switching algebra and circuit theory com- 
ing out of the Bell Labs. The inductive reasoning 
process of the event tree concept has its roots in 
decision theory and was developed into a systems 
performance tool in the reactor safety study WASH 
1400 (Rasmussen, 1975). The theory of probabil- 
ity as a measure of likelihood evolved in the fields 
of mathematics over hundreds of years. For events 
that occur frequently, a frequency can be quanti- 
fied to reflect the number of occurrences per unit of 
time. However, for events that rarely occur, or may 
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Figure 1. Conventional PRA. 


not occur at all, the concept of probability is quan- 
tified. Probability is synonymous with ‘credibility’ 
as in the credibility of a hypothesis based on all of 
the available evidence. Ultimately, for any defined 
undesirable event frequencies, such as fatalities, in 
the form on probability of frequency are the main 
measure of safety risk for engineering systems. 

A full score PRA involve the following well 
established steps (Garrick, 2008): 


e define the system and its success state—usually 
involves linearizing the system into logically pro- 
gressive operating states, that is fixing in time the 
top events based on the analyst’s understanding 
of the system operation, 

e identify and characterize the hazards also called 
threat assessment, 

e develop and structure the ‘what can go wrong’ 
scenarios that lead to undesired outcomes or 
vulnerability assessment—creative exercise ana- 
lyst dependent 

e quantify the scenarios—the uncertainties asso- 
ciated with the scenarios must be part of the 
answer 

e assemble results into measures of risk, and 
importance measures, 

e interpret the result for 
management. 


meaningful risk 


Each sequence is a path to either type of con- 
sequence: success (S), failure (F), or partial failure 
(PF) state. Each probability is a conditional proba- 
bility. The sequences are quantified by propagating 
the probability density functions representing the 
split fractions of the top events through their cor- 
responding paths up to their end states. One of the 
end results is a probability of frequency curve for 
each sequence end state consequence or in the form 
of a mean and median with its confidence interval. 
However, the quantification results are as impor- 
tant as the scenarios generated by the analysts and 
their relative importance to the NPP safety. 

Multiple conventional PRA software platform 
are readily available in academia, industry, or 
regulatory agencies. The most common ones are 
IRIS (Fig. 2), RISKMAN, CAFTA, RiskSpec- 
trum Suite, SAPHIRE. In this work, the IRIS tool 
was selected as the conventional PRA platform of 
choice, however the generic architecture used is 
applicable to the other software platforms. 


2.2 Dynamic PRA 


Dynamic PRA methodologies are generally those 
that use a time-dependent phenomenological 
model of system evolution and take into account 
its stochastic behavior to estimate the risk asso- 
ciated with the system response to an initiating 
event (Aldemir, 2013). The system evolution model 
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IRIS platform for modeling ESDs, FTs, and 


keeps track of the current hardware status, cur- 
rent level of processes variables, current operator 
assessment, scenario history, and time (Siu, 1990). 
A graphical representation of the system evolution 
space with its probabilities of occurrence is shown 
in Figure 3. 

Dynamic PRA has been grouped in two main 
categories: continuous-time (eg. Continuous 
Event Tree (CET)) and discrete-time (e.g. Dynamic 
Event Tree Analysis Method (DETAM), Accident 
Dynamic Simulator (ADS), Analysis of Dynamic 
Accident Progression Trees (ADAPT), and Risk 
Analysis Virtual Environment (RAVEN)). 

The discrete dynamic PRA methodologies use 
Discrete Dynamic Event Trees (DDETs) that 
are computationally generated based on a time- 
dependent model of system evolution and various 
branching conditions. 

Essentially, all discrete dynamic PRA method- 
ologies employ a simulation engine that gener- 
ates branches at each user-specified time step or 
conditions with their associated probabilities and 
computes the probability of each scenario. As can 
be seen in Figure 4, branching points can include 
system hardware states, physical variable changes, 
human actions, software failures or an end state if 
one of the stopping criteria is met. 

ADS-IDAC, the Accident Dynamics Simulator 
with its operating crew simulation model (Identi- 
fication, Decision and Action in a Crew cognitive 
context) and thermal-hydraulic code RELAPS/ 
Mod 3.3 was selected as the dynamic PRA plat- 
form for scenarios that would require more exten- 
sive operator response modeling. ADS-IDAC is a 
simulation engine that includes a scheduler mod- 
ule, a hardware reliability model, an indicator 
module (the control panel), and the IDAC opera- 
tor response model coupled with the RELAPS/ 
MOD3.3 thermal-hydraulic code (the system 
model) to generate DDETs containing contextually 
rich scenarios that could occur given an initiating 
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Dynamic PRA (Mosleh, 2015). 
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Figure 4. Discrete dynamic PRA methodology. 


event. Its modular structure and the flow of infor- 
mation between modules are shown in Figure 5. 

A scheduler module coordinates the interac- 
tions between all the other modules and generates 
the DDETs. As is the case for traditional ETs, the 
probability of each scenario or sequence is the 
product of conditional probabilities of its con- 
stituent branches. The indicator module simulates 
the control panel indicators’ states driven by infor- 
mation from the system module. The HCL module 
models and quantifies the probability of Human 
Failure Events (HFEs) based on a Dynamic Baye- 
sian Network (DBN) of a range of Crew Failure 
Modes (CFMs) and Performance Shaping Factors 
(PSFs) that reflect the context conditions in which 
the operators create situational assessments and 
devise recovery strategies (Diaconeasa, 2017a). 
The hardware reliability module simulates the fail- 
ure probabilities of the system’s and control panel’s 
components, but coupled with the HCL module 
can model the impact of support systems on the 
frontline systems through dynamically linked FTs 
(Diaconeasa, 2017b). 
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Figure 5. ADS-IDAC architecture. 


Figure 6. 
operator. 


IDAC cognitive model for one human 


The IDAC model serves as underlying frame- 
work for operator behavior. IDAC decomposes 
the operator’s cognitive flow into three main proc- 
ess: information processing, decision-making, 
and action execution (Fig. 6). The domain of 
applicability of IDAC is constrained to environ- 
ments characterized by high levels of training 
and explicit requirements to follow procedures. 
These constraints simplify the modeling by lim- 
iting the degrees of freedom from the broader 
human response spectrum. In IDAC, the crew is 
modeled as a team of individuals working on dif- 
ferent assigned tasks and communicating with one 
another. The individuals differ by the content of 
their memory, by their mental state, and by the 
goals and strategies they employ. IDAC can simu- 
late several decision-making and problem-solving 
strategies, including passive and active information 
gathering, diagnosis, skill-, rule-, and knowledge- 
based actions, and procedure-following. The model 
includes several dynamic and static PSFs as part of 
the set of factors and rules that simulate the Sen- 
ior Reactor Operator (SRO) and Reactor Operator 
(RO) responses. Each operator also has a unique 
knowledge base that defines his or her knowledge 
about nuclear plant systems and operations. 


Previous research efforts in developing ADS- 
IDAC have shown that a small set of generic 
branching rules are sufficient to capture complex 
variations in system and crew-to-crew perform- 
ance (Coyne, 2009). 


3 HYBRID PLATFORM ARCHITECTURE 


The hybrid static-dynamic PRA platform was 
designed to selectively feed conventional PRA 
results (e.g. cut-sets) from IRIS into the ADS-IDAC 
dynamic PRA platform for dynamic analysis. 

The typical ET/FT/BBN analysis of conven- 
tional PRAs is used to generate a list of cut sets. 
Each cut set is used in the dynamic PRA analy- 
sis to constrain the branching at every time step 
in creating the DDET driven by the system and 
crew evolutions. This is graphically illustrated in 
Figure 7. 

The optimum strategy that provides the broad- 
est range of applications and highest compatibility 
with present conventional and dynamic platforms 
(not only IRIS and ADS-IDAC) is by creating 
an interface between the platforms for selectively 
passing information that complies with the Open- 
PSA model exchange format. Given that the Open- 
PSA model exchange format covers only ETs and 
FTs, the standard will be extended to cover BBNs 
in order to provide compatibility to all the layers of 
the IRIS platform. In Figure 8, the hybrid static- 
dynamic PRA platform is shown where the typical 
inputs and outputs are listed for each module. 
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Figure 7. Hybrid static-dynamic framework. 
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Hybrid static-dynamic PRA platform. 
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The interface acts as a gate for passing informa- 
tion from cut set to branching conditions in ADS- 
IDAC. Thus, in the ADS-IDAC model boundary 
conditions need to be set based on both the cut set 
information and the available components avail- 
able in the system model either captured in the 
RELAPS thermal-hydraulic model or through 
dynamically linked FTs. 

Overall, all the branching points and the inter- 
mediate events associated with an initiating event 
make up the DDET during accident scenarios. The 
set of generic branching rules cannot create the 
DDET without modeling in parallel the dynamic 
system and human operator behaviors. Therefore, 
the construction of the DDET is driven by a rich 
contextual environment simulated by ADS-IDAC 
and guided by the branching rules, which allow 
the modeling of variability in system and human 
operator. 

The necessity for branching rules in a dynamic 
PRA simulation platform like ADS-IDAC is the 
sequence explosion phenomenon. If the simu- 
lation engine would allow branching at every 
time step the number of sequences needed to be 
explored would grow exponentially and the simu- 
lation time would become unrealistically long with 
the current computational models and resources. 
For the same computational reasons, sequence 
termination conditions have been implemented 
to stop the engine from exploring sequences after 
a time period of interest, when certain physical 
limits have been exceeded, or when the operators 
enter certain procedures. For example, if the inter- 
est of the simulation is the exploration of crew 
variability in diagnosing a SGTR, two sequence 
termination conditions could be set to stop the 
simulation. One of them could be placed when 
the operators transfer to procedure E-2 “Isolation 
of steam generator with secondary break” or E-3 
“Tube rupture in one or several steam generators.” 
Another could be set when the simulation time 
exceeds a certain time period set based on previ- 
ous crew performance. 

At the same time, sequence termination con- 
ditions can be set to calculate an overall failure 
probability for an event of interest. For example, 
a sequence termination condition for the fuel ele- 
ment cladding temperature exceeding the accept- 
ance criteria for emergency core cooling systems 
for light water nuclear power reactors (10 CFR 
Part 50.46) of 2200°F. The summation of the end 
state probability for all sequences that were termi- 
nated by this condition would essentially estimate 
an overall measure of core damage probability. 

Overall, the branching rules and the sequence 
termination conditions help define the scope of 
the intended ADS-IDAC analysis. Therefore, if the 
set of branching rules and sequence termination 


conditions do not cover all the models included 
in ADS-IDAC their variability is not included in 
the generated DDET and, ultimately, the solution 
space is not complete. 

Branching rules that cover the failure of either 
frontline or support system components have 
been implemented. Nonetheless, by implementing 
branching rules alone does not mean the sequence 
end state probability can be quantified. Each 
branching point requires either a success or failure 
probability. The ADS-IDAC hardware reliability 
module covers modeling of both frontline and sup- 
port system failures during operation by consider- 
ing the failure rate, the number of failures desired 
and the time interval between them. DDET branch- 
ing is modeled such that failures during operation 
generate two branches: success and failure branch. 
For a specific equipment, if more than one failure 
during operation is modeled, only the subsequent 
success branches will further allow more failures as 
on the failure branches this equipment had already 
failed. This feature further extends ADS-IDAC’s 
capability to dynamically predict the timing 
importance of component failures during opera- 
tion for the overall safety of the design in question 
is. For example, small-break LOCA scenarios in a 
PWR could be set up to simulate the timing of the 
pressurizer power operated relief valves (PORVs) 
stuck-open failure events and their impact on the 
available time for recovery actions. 

The crew’s recovery of frontline and support 
system’s component failures can also be modeled 
(Diaconeasa, 2017b). When the crew attempt to 
recover a component, two additional branches are 
generated: a recovery branch, in which the compo- 
nent is successfully recovered, and a permanent fail- 
ure branch, in which the component remains failed. 

The branching points that were quantified with 
the dynamically linked FTs (Fig. 9), out of which a 
success and a failure branch are created, are given 
below: 


i ee | cies ER 


Figure 9. DDET with quantified frontline and support 
system branching points based on dynamically linked FTs. 
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Figure 10. 
branching points based on a DBN of PSFs and CFMs. 


DDET with quantified human events 


e Components of frontline systems failures at 
fixed time. 

e Components of frontline systems failures on 
demand. 

e Components of frontline systems failures dur- 
ing operation. 

e Components of support systems failures at fixed 
time. 

e Components of support systems failures on 
demand. 

e Components of support systems failures during 
operation. 


Using the DBN of PSFs and CFMs, a range 
of branching points for HFEs are implemented to 
model the crew’s variability (Diaconeasa, 2017a). 
The HFEs covered in the branching rules (Fig. 10) 
can be of type strategy selection, procedure trans- 
fers, diagnosis of accident conditions based on 
either the knowledge-based reasoning or proce- 
dure following. 


4 CUT SET DIFFRACTION 
PHENOMENON 


The hardware and human operator behavior varia- 
tions can act as a diffraction grating to create new 
scenarios starting from a single cut set, a process 
analogous to light diffraction in optics. This phe- 
nomenon is illustrated in Figure 11. 

The number of sequences obtained from the sin- 
gle cut set would depend on the number of events 
in the cut set, but also by considering the following 
variations: timing and order of failures, degree of 
component degradation, time and degree of recov- 
ery, human decisions and actions, human depend- 
encies, or physical variable thresholds. 

The timing and order of failures are impor- 
tant in Fukushima-like scenarios, or hurricane 


Cut Set Diffraction 
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Figure 11. Graphical representation of the cut set 
diffraction where the input cut set is highlighted in the 
DDET. 


scenarios, where the reactor is tripped at some time 
prior to the more severe plant response. How can 
initially smaller decay heat levels complicate the 
sequence of events operators are trained on? How 
can the order of events be shifted and how does 
this impact the crew response? There are scenar- 
ios where it is necessary to consider the timing of 
failures. For example, the failure of containment 
spray at some time after they successfully start will 
prolong the capacity of the Refueling Water Stor- 
age Tank (RWST) to supply suction before recir- 
culation. On the other hand, the failure to secure 
running trains of spray per procedure would lessen 
the time to suction switchover. 

The degree of component degradation is impor- 
tant in cases similar to the Beaver Valley PRA 
sequences initiated by a loss of an instrument bus. 
When can an initially benign sequence degrade and 
become more of a challenge? Loss of this bus fails 
makeup to the volume control tank, but normal 
charging continues. If not recovered (by switching 
to an alternative instrument bus), suction is shifted 
to the RWST. Failure to recover makeup prior to 
switchover to RWST could make successful ter- 
mination dependent on a single check valve (in a 
borated environment). 

The time and degree of recovery was seen to be 
important in the TMI-3 Loss Of Coolant Accident 
(LOCA). Auxiliary Feedwater (AFW) was failed, 
and then subsequently recovered. Once recovered, 
High Pressure Injection (HPI) was secured. What 
was not appreciated was that AFW recovery came 
too late, as a LOCA had already occurred. 

The importance of decisions and actions of 
the crew can be illustrated by looking at scenarios 
involving the failure of Main Feedwater (MFW) 
and AFW. Procedures instruct the crew to recover 
MFW or AFW and after failure to do so, go to 
feed-and-bleed (F&B). Observation of different 
crews at simulators suggest different crews will 
devote different amounts of time focusing on the 
recovery of MFW or AFW. When would the delay 
result in loss of a potential success path? Also, 
recovery that requires steam generator depres- 
surization (e.g., using condensate) may influence 
recovery characteristics. This assessment will likely 
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be plant-specific due to differing shutoff head 
capacities of the HPI pumps. 

The hardware dependencies can have risk impli- 
cations as is the case for shared equipment during 
a multi-unit initiator. Examples might be Loss of 
Off-Site Power (LOSP) in a NPP with shared die- 
sels and shared equipment. Initial plant states and 
equipment failure conditions may be underappre- 
ciated as they could complicate plant recovery. 

A consequence of the cut set diffraction phe- 
nomenon in the hybrid static-dynamic PRA plat- 
form is the consideration of non-core damage end 
states. PRAs are typically developed to trace the 
sequence of events that lead to core damage or 
fission products release. The frequency of such 
scenarios is one metric of interest in regulatory 
risk, as it is one common surrogate measure for 
public health risk. However, restricting the logic 
models to only determine core damage frequency, 
or large early release frequency, limits the breadth 
of information potentially available to decision 
makers. 


5 CONCLUSIONS 


A hybrid static-dynamic PRA platform was devel- 
oped to leverage the newer risk analysis methods 
available while still using the existing probabilistic 
information that, at least in the U.S., is available for 
every NPP. The IRIS and ADS-IDAC tools make 
up the main elements of this platform, neverthe- 
less similar PRA tools could be used that conform 
to the Open-PSA model exchange format. Finally, 


the cut set diffraction phenomenon was introduced 
and illustrated with short examples. 
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A framework for assessment of Technological Readiness Level (TRL) 
and Commercial Readiness Index (CRI) of asset end-of-life strategies 


I. Animah & M. Shafiee 
Cranfield University, Bedford, Bedfordshire, UK 


ABSTRACT: A substantial number of industrial assets within the manufacturing, power generation, 
transportation, oil and gas, petrochemical processing, mining and construction sectors are facing opera- 
tion beyond their anticipated design life and will be in need of intensive maintenance services in the 
coming years. At the end of an asset’s design lifetime, the operators must make a decision on either 
rejuvenating the components through life-extension solutions or decommissioning the asset. This means 
that life extension policies (e.g. remanufacturing, reconditioning, repurpose, retrofitting) and decommis- 
sioning strategies (e.g. recycling and disposal) will continue to play a crucial role in the future manage- 
ment of industrial assets. However, some of the End-of-Life Management Strategies (ELMS) or their 
emerging technologies may not be mature yet, and therefore application of such strategies can cause 
extensive uncertainties. A well-documented Technological Readiness Level (TRL) and Commercial Read- 
iness Index (CRI) for these strategies and related technologies will be a key in reducing the uncertainties 
involved in implementing ELMS in various industries. This paper aims to propose a systematic frame- 
work consisting of six different processes to help asset managers evaluate the TRL and CRI of different 
ELMS and their corresponding technologies. An essential part of developing this framework is the strong 
collaboration among academics and industrial experts with several years of experience in undertaking life 
extension and decommissioning projects. For purpose of illustrating the model, a case study involving 
end-of-life strategies of wind turbines is provided and the results are further discussed. The data required 
for this study is collected from various sources, including the published literature and industrial reports as 
well as by surveying academic and industrial experts. The results of this study indicate that TRL and CRI 
assessments are not only an effective means of evaluating the technological status of different ELMS but 
also a means for risk management decision making. 


1 INTRODUCTION economically qualified. The benefits of extend- 


ing the service life of ageing assets are enormous. 


Over the past several decades, asset owners in the 
manufacturing, power generation, transportation, 
oil and gas, petrochemical processing, mining and 
construction industries have focused on optimizing 
the design of their systems, enhancing installation 
techniques and improving production. However, in 
recent years, many of the assets operating in the 
above-mentioned industries are entering a new 
phase of development where assets are expected 
to reach their anticipated design lifetime. Hence, 
the attention of industries is now shifting towards 
how these ageing assets can be managed, to ensure 
that they continue to deliver high level of service 
beyond their original design lifetime. 

The two most popular end-of-life management 
strategies (ELMS) include asset life extension and 
decommissioning. Asset life extension involves the 
application of technical and administrative pro- 
cedures to extend the useful life of engineering 
structures, systems and components at the end of 
their design lives, provided they are technically and 


For instance, extending the service life of a multi- 
million pound system could result in substantial 
return on investment. It also has the tendency to 
increase production volume and reduce CO, emis- 
sions due to slow down in the manufacturing of 
new products. 

On the other side, decommissioning represents 
the last stage of the asset life cycle. It involves the 
total or partial removal of assets, which ensures 
the restoration of a site (land or seabed) to suit- 
able condition for other uses and also maximizes 
material recovery from removed assets through 
waste management. Despite the huge potential of 
life extension and decommissioning to asset own- 
ers, the life extension policies and decommission- 
ing strategies as well as their emerging technologies 
are not yet matured. Therefore, the application of 
these ELMS in many sectors often result in huge 
technological and commercial uncertainties. 

In order to minimize the technological and 
commercial uncertainties involved in implement- 
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ing ELMS in various industries, it is important 
for asset owners to explore, assess and evaluate the 
maturity level of life extension policies and decom- 
missioning strategies for different systems and 
components, taking into account technological 
readiness level (TRL) and commercial readiness 
index (CRI) of related technologies. 

The TRL was first developed by NASA in 1974 as 
a benchmarking tool to assess and communicate the 
maturity levels of new technologies (Mankins, 2009). 
Since then, it has been applied in various industries 
to provide a measurement of technology maturity. 
Although the TRL concept is appropriate to help 
minimize technological uncertainties, there are often 
commercial or financial uncertainties characterising 
new programmes and technologies entering the mar- 
ket. Another accepted process that can be used for 
benchmarking the commercial maturity of new tech- 
nologies is the commercial readiness index (CRI). 
The CRI was developed by the Australian Renewable 
Energy Agency (ARENA) to evaluate the commer- 
cial readiness level of renewable energy technologies. 

Understanding and communicating the TRL 
and CRI for life extension policies and decommis- 
sioning strategies as well as their related technolo- 
gies will not only enable asset owners to reduce the 
uncertainties involved in implementing ELMS in 
various industries but also can help them to deter- 
mine how key assets could be competitive beyond 
their original design life. 
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The aim of this paper is to propose a framework 
to evaluate the maturity level of different life exten- 
sion policies and decommissioning strategies by 
taking into account the TRL and CRI of related 
technologies. The proposed framework is tested 
with a case study involving offshore wind turbine 
blades that have reached the end of their original 
design lives. Our results indicate that the proposed 
framework provides a powerful decision-making 
tool to support asset owners to efficiently manage 
engineering structures, systems and components 
when they reach the end of their original design life. 

The rest of the paper is organized as follows. 
Section 2 presents the conceptual framework and 
the steps required to aid in assessing the matu- 
rity levels of life extension policies, decommis- 
sioning strategies and the related technologies. 
In Section 3, the model is applied to a case study 
involving wind turbine blades and the results are 
analyzed in Section 4. Finally, the conclusions and 
future research directions are outlined in Section 5. 


2 THE PROPOSED FRAMEWORK 


The proposed framework for evaluating the matu- 
rity level of life extension policies, decommission 
strategies and emerging technologies is shown in 
Figure 1, which includes five steps. These steps are 
explained in details as follows: 
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Figure 1. Steps for applying the proposed framework. 


HOW's 


1768 


Step 1. Select the suitable industry and identify the 
trend of ageing facilities: As the purpose of this 
study is to evaluate the maturity level of differ- 
ent life extension and decommissioning policies 
as well as their related technologies to support 
end-of-life management of structures, systems 
and components within different industries, it is 
key to investigate the trends in the number of 
facilities approaching, have reached or exceeded 
the end of their original design life within a 
specific industry or company. This task can be 
achieved by reviewing journal articles, confer- 
ence papers, technical reports, company’s dos- 
sier and face-to-face interview with industrial 
experts. The industries/companies that can 
apply the proposed framework to support end 
of life management of ageing assets include (but 
not limited to) the following: power generating 
(nuclear energy, renewable energy and power 
transmission and distributions), manufactur- 
ing, transportation, oil and gas, petrochemical 
processing, mining, construction, etc. 

Step 2. Select an ageing facility and break it down 
into subsystems and components: In this stage, 
an ageing facility is selected and decomposed 
into manageable units. Decomposition of age- 
ing facility into manageable units is to facili- 
tate the mapping of applicable life extension 
policies or decommissioning strategies for high- 
risk subsystems and components of the facil- 
ity. The task of breaking down a facility into 
manageable units can be achieved through the 
use of product failure mode and effect analysis 
(FMEA). 

Step 3. Map applicable life extension policies and 
decommissioning strategies to subsystems and 
components: At this stage, all applicable life 
extension policies and decommissioning strate- 
gies are identified and mapped to critical sub- 
systems and components of the facility. The life 
extension policies and decommissioning strate- 
gies can be identified though literature review 
as well as consultations with industry experts. 
Examples of life extension policies include 
remanufacturing, reconditioning, repurpose, 
retrofitting, etc. while recycling, disposal and 
plugging and abandonment are examples of 
decommissioning strategies. For more compre- 
hensive description of different life extension 
policies, readers can refer to Shafiee and Ani- 
mah (2017). 

Step 4. Identify emerging technologies for imple- 
menting life extension policies and decommis- 
sioning strategies: The goal of this stage is to 
identify the technologies related to the imple- 
mentation of the applicable life extension poli- 
cies and decommissioning strategies for the 
critical subsystems and components. In order 


to identify the related technologies, one must 
understand the process sequence for each life 
extension policy or decommissioning strategy. 
Understanding the process sequence for differ- 
ent life extension policies and decommissioning 
strategies will help in identifying the appropri- 
ate technologies and their features and provide 
specific information that allows for the alloca- 
tion of TRL and CRI respectively. 


The process sequences involved in extending 
the useful life of an engineering structure, system 
or component include cleaning, disassembling 
and inspection, repair, reassembling and testing. 
Whereas cutting, lifting, removal and material 
recovery/disposal may constitute decommissioning 
process sequences for systems and components, 
especially within the offshore oil and gas and off- 
shore wind power industries. 


Step 5. Evaluate the maturity level of the life exten- 
sion policies and decommissioning strategies: 
This task of the proposed framework is per- 
formed by utilizing the TRL and CRI assess- 
ment scales, shown in Figure 2. The TRL scale 
helps to assess the technical maturity of the life 
extension policies and decommissioning strate- 
gies as well as their related technologies whereas 
the CRI scale assists in evaluating the commer- 
cial readiness of the strategies and the related 
technologies. 


There are two rounds of activities at this stage. 
In the first activity, the TRL and CRI of the 
emerging technologies supporting the implementa- 
tion of life extension policies and decommission- 
ing strategies for subsystems and components are 
determined using Eq. (1) and (3) below: 


TRL,= > w,TRL, (1) 


jal 
yw j=l; forv; (2) 
jal 


where w, is the relative importance (weight) of 
technology j for policy/strategy i, and 


CRI,= > v;CRI, (3) 


j=l 


n 


vy, =l; for V, (4) 


j=l 


where v, is the relative importance (weight) of 
technology j for the policy/strategy i. The weights 
w, and v, can be allocated by experts or estimated 
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Well proven operations TRL 1 0 

System operational TRL9 

Subsystem build and test — TRL8 

Detailed design and assembly level build | — TRL7 

Preliminary design and prototype validation of the technology TRLG 
Conceptual design and prototype demonstration TRL 5 

Teclmology demonstration TRL 4 

Proof-of-concept | TRL3 
Tectindlogy concept | — TRL2- 

Technology sesearch! — TRLI 


CRI6 


Readily financial support of the technology by banks 


Market competition/industrial acceptance driving by widespread 


CRIS application of the technology 

CRI 4 Multiple commercial application of the technology 

CRI3 Commercial scale up of the technology 

CRI? Commercial trials of the technology on a small scale 
Technically and commercially untested and unproven 


Figure2. Technology readiness level (TRL) and commercial readiness index (CRI) scale (ARENA, 2014 and Straub, 2015). 


using the Delphi-analytical hierarchy process. The 
second activity involves evaluating the maturity 
level of each life extension policy and decommis- 
sioning strategy (M) using Eq. (5): 


M = aTRL, +(1- @)CRI,,0<a<l (5) 


where a@ and l-g represent the relative importance 
(weight) of TRL and CR/in relation to each other. 


3 APPLICATION TO WIND 
TURBINE BLADES 


The number of wind turbines reaching the end of 
their original design life of 20-25 years has been 
increasing in recent years, hence repowering, life 
extension and decommissioning activities are 
attracting the attention of both practitioners and 
scholars in the wind power industry. In order to 
illustrate the efficacy of the proposed framework, 
it is applied to offshore wind turbine blades which 
have reached the end of their original design life- 
time. The data used in this study was collected 
from literature, academics with expertise in wind 
energy industry as well as blade manufacturers. In 
this Section, the results of the application case are 
presented and discussed. 


Step 1. The wind power industry is the focus of 
this case study. 

Step 2. From the product FMEA of the wind tur- 
bine provided by the operator, the proposed 
framework is applied to the blades, which are 
made of composites. 


Step 3. The life extension policies and decommis- 
sioning strategies considered for the blades are 
briefly explained below. 


— Remanufacturing: It involves the use of mod- 
ern technologies and procedures to break 
assets down into core parts, and then engi- 
neering changes are made to the core parts in 
order to meet Original Equipment Manufac- 
turers (OEM) specifications and performance. 
In most cases, the cost of a remanufactured 
system is less than that of a brand new system 
but with an equivalent warranty to that of a 
brand new system (Animah ef al., 2017). 

— Reconditioning: According to Shafiee and 
Animah (2017) reconditioning involves tak- 
ing appropriate actions to restore a defective 
system to between “as good as new (AGAN)” 
and “as bad as old (ABAO)” condition, hence 
the output is less than that of OEM’s stated 
output. 

— Repurpose: It involves taking the necessary 
steps to create a new use or purpose for an 
existing system which was originally designed 
for a different purpose (Aguirre, 2010; Cough- 
lan et al., 2015). With this life extension policy, 
engineering actions and processes are applied 
to make transformation to key parts of a sys- 
tem. The transformation is based on the tech- 
nical feasibility of the engineering processes, 
environmental performance and economic 
viability of the system for future operations 
(Bauer et al., 2017). 

— Retrofitting: It involves the process of replac- 
ing old components of a system with modern 
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Table 1. Overview of technologies used to support remanufacturing, reconditioning, repurpose and retrofitting proc- 
ess for blades. 
Process 
sequence Technologies Technology features 
Cleaning Jet cleaning (water With this technology, rust, grease, oil and other contaminants are removed 
jet, abrasive from surfaces of components through physical interaction of the acceler- 
(sand) blasting ated medium (sand, water, dry ice etc.) by compressed air or high pressured 
cleaning, dry ice water (Liu et al., 2013). 
cleaning etc.). 
Organic solvent This method of cleaning is performed by drenching/soaking components in 
cleaning organic solvent or spraying the organic solvent on component surfaces 
while cleaning takes place as a result of dissolution and chemical reaction 
(Kikuchi et al., 2011). 
Ultrasonic Ultrasonic cleaning involves the use of high frequency (20-400 KHz) sound 
cleaning waves to generate agitation in a liquid (Niemezewski, 2007). The cavitation 
bubbles produced as a result of the agitation acts on contaminants on the 
surfaces for cleaning. 
Disassembly Disassembly Disassembly embedded design integrates a disassembly mechanism into a 
embedded system during design, e.g. Snap fits to dislodge locked out of position (Soh 
design et al., 2014). 
Active This technique makes use of external triggers such as temperature, magnetic 
disassembly force or pressure, for the release of fasteners. It include utilizing smart 
materials, freezing elements, soluble elements, pneumo-elements and hydro- 
gen storage alloy elements as a fastening technique (Duflou et al., 2006). 
Repair/ Laser repair LRT is suitable for the repair/rebuilding of worn out metallic parts, which are 
Modification/ technology considered non-repairable using traditional welding or plating techniques. 
Retrofitting (LRT) The process involves injection of metal powder into a focused beam of a 
process high powered laser in a tightly controlled atmospheric condition. 
Advanced This technology is needed when the repair of existing components require 
mechanical CNC milling, turning, drilling, tapping, cutting and other machining 
machining operations. 
processes 
Industrial 3D This is an additive manufacturing process which uses high powered laser 
printing fusion to produce components by building layer upon layer from a prede- 
fined 3D digital design. 

Tubercle This technology mimics the bumps on humpback-whale fins to develop a 

technology more efficient wind turbine blades. 

Plastic surgery This the use of plastic surgery to make old, smaller and less efficient wind 
turbine blades into bigger and efficient ones without replacing them at the 
end of life. 

Testing Shearography This is the use of optimal measurement technology or shearorgraphy sensor 
system to detect defects such as wrinkles, delaminations, debondings and kissing 


bondings in wind turbine blades. 


Table 2. Overview of technologies used to support recycling and disposal of wind turbine blades. 


Process 

sequence Technology Technology features 

Material Pyrolysis This is the use of pyrolysis recycling technology (i.e. burning of the resin matrix 
recovery with limited oxygen). 

Mechanical This is the reduction of composite materials into suitable sizes and grounding them 

grinding into different grades using mechanical processes such as hammer mill. 

Solvolysis This technology makes use of physico-chemical separation process for full material 
recovery of thermosets. The process uses water at sub-supercritical temperature to 
breakdown thermoset resin, in order to remove it from the fibre. 

Incineration This is the burning of end of life product for energy generation. 

Disposal Landfill Burying the materials used in the manufacturing of the system in the ground. 
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equivalent in order to achieve higher function- 
ality and availability. An example of retrofit- 
ting is the blade extension program for wind 
turbines, in order to increase the swept area of 
the rotor to increase power generation at low 
speed. 

— Recycle: Complex industrial equipment are 
made of different materials. In post-decom- 
missioning, scraps from these infrastructure 
in the form of metals and non-metals are gen- 
erated, recycling processes are then used to 
recover high grade materials from these scraps 
for other uses, thereby limiting the amount of 
waste generated at the end of life (Yang et al., 
2012). However, recycling of the products can 
reduce the exploitation of natural resources 
and protect the environment. 

— Disposal: Open burning and landfills are 
examples of the popular disposal options 
for components and materials from indus- 
trial systems. However, the economic value 
of reusing the system or recovering materials 
through recycling is lost when this strategy is 
implemented at the end of original design life. 


Step 4. A number of technologies have been pro- 
posed by both researchers and practitioners 
to support the process sequence for imple- 
menting life extension policies and decommis- 
sioning strategies across different industries. 
The technologies that can support the process 
sequence of remanufacturing, recondition- 
ing, repurpose and retrofitting of wind turbine 


Table 3. TRL and CRI of emerging technologies sup- 
porting the life extension and decommissioning strategies 
for wind blades. 


Emerging technologies TRL CRI 


Jet cleaning 10 5 
Organic solvent cleaning - - 
Ultrasonic cleaning - - 
Disassembly embedded design 10 5 
Active disassembly 7 1 
LRT 5 1 
Advanced mechanical machining 2 1 
processes for composites 
Tubercle 7 
Plastic surgery 7 
Shearograph 9 
Industrial 3D printing 2 
Pyrolysis 3 
Mechanical machines (grinding, strip- 3 
ping, crushing etc.) 
Solvolysis 3 
Incineration 9 
Landfill 10 


See Nree 


wre 


blades are shown in Table 1. On the other hand, 
Table 2 shows the overview of the related tech- 
nologies that can support the material recovery/ 
disposal process sequence of recycle and dis- 
posal of wind turbine blades. 

Step 5. The maturity level of the life extension poli- 
cies and decommissioning strategies applicable 
to wind turbine blades using the TRL and CRI 
scales in Figure 2 is evaluated. The TRL and 
CRI of each technology was evaluated through 
experts’ elicitation. First, the experts were asked 
to allocate TRL and CRI to the emerging tech- 
nologies supporting the implementation of 
the life extension policies and decommission- 
ing strategies obtained from the literature. 
Table 3 shows the TRL and CRI of the emerg- 
ing technologies supporting the implementation 
of life extension policies and decommissioning 
strategies for wind turbine blades using Eqs. (1) 
and (3). The weights assigned to evaluate TRL 
and CRI for each technology were achieved 
through a consensus reached by a panel of 
experts. Second, the maturity level of each life 
extension policy and decommissioning strategy 
has been evaluated using Eq. (5) and are ranked 
in Table 4. 


As shown in Table 4, disposal was chosen as the 
most appropriate ELMS for wind turbine blades. 
This is because the maturity level of landfill as a 
technology for disposing the blades made of com- 
posites are technically well proven, commercially 
available and relatively cheap in many parts of the 
world. The second and third most suitable ELMS 
for the wind turbine blades are remanufactur- 
ing and reconditioning. Retrofitting, repurposing 
and recycling were identified as the fourth, fifth 
and sixth preferred ELMS for the blades. This is 
because the key technologies needed to support 
the implementation of these ELMS for products 
made of composite materials are still in the experi- 
mental phase. This means that the technologies are 
not technically matured for large scale commercial 
applications. For instance, pyrolysis, mechanical 


Table 4. Ranking of maturity level of life extension 
policies and decommissioning strategies for wind turbine 
blades. 


Life extension policies/Decommissioning 


strategies Ranking 
Disposal 1 
Remanufacturing 2 
Reconditioning 3 
Retrofitting 4 
Repurpose 5 
Recycle 6 
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grinding and Solvolysis which are considered as the 
most suitable recycling technologies for compos- 
ite products such as wind turbine blades are at the 
laboratory scale in terms of technological develop- 
ment (Rybicka et al., 2016). Thus, making landfill 
and incineration the most widely used technologies 
to support end of life management of wind turbine 
blades when they reached the end of their original 
design life. 


5 CONCLUSION 


In this study, a methodology was proposed to assist 
in understanding and communicating the maturity 
level of life extension policies and decommissioning 
strategies for industrial assets reaching the end of 
their original design lives. The proposed methodol- 
ogy considered TRL and CRI of emerging technol- 
ogies supporting the implementation of different life 
extension policies and decommissioning strategies, 
in order to rank the best ELMS. To the best of our 
knowledge, this was the first time the TRL and CRI 
scales were integrated to communicate the maturity 
level of different life extension policies and decom- 
missioning strategies for products reaching their end 
of life. For the purpose of clearly illustrating the 
efficacy of the proposed framework, it was applied 
to determine the maturity level of different life 
extension polices and decommissioning strategies 
for wind turbine blades. The findings from the case 
study indicated that disposal of wind turbine blades 
through landfilling was considered as the most 
appropriate strategy for managing wind turbine 
blades when they reached the end of their original 
design life. This was followed by remanufacturing, 
reconditioning, retrofitting, repurpose and recycle. 
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Engineering safety recommendations: Results from a survey 
in aviation 
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ABSTRACT: Taking into account the lack of uniform guidelines for the design and classification of 
safety recommendations, a relevant framework was developed according to academic and professional 
literature. The framework includes nine design criteria for recommendations, it incorporates classifica- 
tions of their scope and expected effectiveness, and it was used to perform a questionnaire survey across 
aviation professionals involved in the generation of safety recommendations. The goal of the survey was 
to capture (1) whether practitioners are knowledgeable about the design criteria, (2) the degree to which 
they apply those criteria along with corresponding reasons, (3) perceptions of the expected effectiveness 
of types of controls introduced through recommendations, (4) the frequency of generating each control 
type and respective explanations, and (5) the extent to which practitioners focus on each of the categories 
of recommendations’ scope and the relevant reasons. Overall, the results showed: an adequate level of 
knowledge of the design criteria; a strong positive association of the knowledge on a particular criterion 
with the degree of its implementation; a variety of frequencies the recommendations are addressed to 
each of the scope areas; a reverse order of perception of the expected effectiveness of control types com- 
pared to the literature suggestions. A thematic analysis revealed a broad spectrum of reasons about the 
degree to which the design criteria are applied, and the extent to which the various types of recommenda- 
tions are generated. The results of the survey can be exploited by the aviation sector to steer its relevant 
education and training efforts and assess the need for influencing the direction safety recommendations 
are addressed. Similar research is suggested to be conducted by organizations and regional and interna- 
tional agencies of any industry sector by ensuring a larger sample. 


1 INTRODUCTION Table 1. Design criteria for safety recommendations. 
Safety investigations play a crucial role in safety Criterion Brief explanation Literature* 
improvements especially because they lead to the Specific karesa (Haughey, 2014, 
formulation of recommendations to eliminate or . 
ae : pe : particular problem Gregson, 
mitigate identified problems with the scope to 2017) 
prevent similar occurrences in the future. To date, Measurable Allows monitoring of (Haughey, 2014) 
although in aviation there are established guide- its implementation 
lines for the conduction of a safety investigation Assigned Addressed to specific (Haughey, 2014, 
(ICAO, 2003, 2008, 2011, 2015), there is yet little responsible agent(s) | Gregson, 
guidance about the design of safety recommenda- 2017) 
tions (Pooley, 2013). Realistic Achievable within (Haughey, 2014, 
To fill this gap, researchers and students from current boundaries oo 


the Aviation Academy of the Amsterdam Uni- ; ; 
versity of Applied Sciences (Zonneveld, 2016, De Time-bound End dates defined (Haughey, 2014) 
Vos, 2016, Kiefer, 2016) reviewed academic and Review Review dates defined (Haughey, 2014) 
professional literature and proposed a relevant Objectives — What and not how (Johnson, 2003, 


framework that includes design criteria (Table 1), to achieve oem 
T (Table 2) and expected effectiveness Action- Actionable items are (Johnson, 2003) 
( Eo oriented preferred over 

The particular framework was used to perform studies 
a questionnaire survey across safety practition- Non-blaming Focus on problems, (Johnson, 2003, 
ers to explore the extent to which the framework not individuals Dekker, 2016) 
aspects are known and applied, and reveal any 
underlying reasons. *Indicative literature references. 
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Table 2. Scope of safety recommendations*. 


Dimension Category Brief explanation 
Aspect of Process Oriented to low-level 
operations structure tasks (Re)design of 
system’s architecture 
and functionality 
Culture Change of norms and 
context behaviours 
Focus on politics and 
the society 
Stakeholders Macro-level Governments, 
affected associations etc. 
Meso-level Industry sector(s) 
Micro-level Organizations and 
individuals 
Degree of Repair Short fixes of local 
renewal problems 
Adaptation Improvement of larger 
systems 
Innovation Creation of new 
solutions 


*Adapted from ESREDA (2015). 


Table 3. 
ations. 


Expected effectiveness of safety recommend- 


Type of control 


introduced* Brief explanation 

Physical Prevent completely actions or access 

Functional Use of technology to limit actions 

Symbolic Means to alert for hazards or 
remind/train rules, procedures etc. 

Incorporeal Strategies, general policies, legislation 


*In descending order of robustness (Hollnagel, 1999). 


2 METHODOLOGY 


Following the establishment of the theoretical frame- 
work about the design and classification of safety 
recommendations as presented above, the researcher 
aimed at assessing the degree to which the aspects of 
the framework are known and/or applied by safety 
investigators and professionals involved in the for- 
mulation of safety recommendations in general (e.g., 
safety and risk managers). The aspects of the frame- 
work comprised the topics of a questionnaire that 
was administered to safety managers, investigators 
and professionals. The data collected from the analy- 
sis with the tool and the questionnaire responses 
were statistically processed to obtain an overall pic- 
ture and examine differences across various variables. 


2.1 Survey questionnaire 


The survey instrument was designed with the 
goal to capture the following information from 


practitioners involved in the generation of safety 
recommendations: whether they are knowledgeable 
of the design criteria and the degree to which they 
apply those in daily practice along with possible 
reasons; perceptions about the order of effective- 
ness of the control types referred in literature, fre- 
quency of proposing each control type as part of 
their role, and explanations about the latter; extent 
to which they focus on each of the categories 
included in the three dimensions of recommenda- 
tions’ scope and respective reasons. 

The questionnaire included an introductory sec- 
tion where the background and aims of the study 
were stated along with the voluntary character and 
anonymity of participation. Also, the particular 
section referred to the estimated time investment 
(i.e. up to 15 minutes) and the contact details in 
the case that the respondents wanted to provide 
feedback on the questionnaire, get informed about 
the results of the study or raise any other inquiry. 

To examine possible variations of the responses 
against characteristics of the sample, the subjects 
were asked to fill in their main job role at the time 
of participation (i.e. safety manager/officer, safety 
investigator, or other), the year they started get- 
ting involved actively in the generation of safety 
recommendations, the country they were practic- 
ing their vocation at the time of participation, and 
their highest level of education (i.e. High School, 
Associate Degree, Bachelor degree, Master degree, 
Doctoral level, and other). 

For each of the design criteria for safety recom- 
mendations, the respondents were asked to state 
whether they know the criterion (possible choices: 
YES or NO), the extent to which they apply the cri- 
terion when creating safety recommendations (pos- 
sible choices: 0-20%, 21-40%, 41-60%, 61-80%, 
and 81—-100%), and explanations about their latter. 
Regarding the expected effectiveness of recom- 
mendations, the corresponding section provided 
a brief description and a few examples for each of 
the control types (i.e. physical, functional, symbolic 
and incorporeal) and asked the participants to rank 
the controls in the order of their effectiveness, state 
which control type they most frequently introduce 
in their safety recommendations, and justify their 
last answer. It is noted, that the control types were 
presented to the participants in a random order of 
effectiveness outlined in the literature (Hollnagel, 
1999). The last section of the instrument referred to 
the scope of recommendations and included a short 
description for each of the dimensions (i.e. aspects 
of operations, stakeholders affected and degree of 
renewal) and their values (see Table 2). The subjects 
were prompted to choose the frequency to which 
their recommendations focus on each of the cat- 
egories of the three dimensions (possible choices: 
0-20%, 21-40%, 41-60%, 61-80%, and 81—100%) 
and state respective reasons. 
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It is clarified that there were no obligatory 
questions to be answered. The respondents could 
omit any demographic or safety recommendation 
related question. The draft version of the survey 
instrument was sent to four persons with relevant 
academic and professional background for their 
review. Following the revision of the questionnaire 
according to the remarks collected, its final version 
was designed online with the use of the Qualtrics 
platform. The functionality of the online question- 
naire was tested with the participation of the same 
four reviewers. The survey instrument was admin- 
istered through two main channels: (1) personal 
emails to contact persons of the network of the 
Aviation Academy of the Amsterdam University 
of Applied Sciences that covers various aviation 
organizations worldwide; (2) online messages to 
practitioners found on the LinkedIn platform and 
holding relevant positions (e.g., safety managers, 
safety investigators). Due to time constraints, three 
working days were devoted to the administration 
of the questionnaire and a three weeks period was 
set for the collection of responses. 


2.2 Sample and analysis of questionnaire 
responses 


In total, 42 questionnaires were filled. Because of 
the snowball sampling strategy, the author could not 
have any information about the number of persons 
whom the instrument finally reached (e.g., unmoni- 
tored or unread emails and messages). Therefore, 
the response rate could not be estimated. Neverthe- 
less, since the scope of the whole study was an initial 
assessment of the situation around safety recom- 
mendations, the number of responses was deemed 
as sufficient. Table 4 presents the distribution of the 
sample across its demographic characteristics. It is 
clarified that the apart from the main job role, the 


Table 4. Distribution of the sample. 


Demographic Sample Valid 
variable Values size percentage* 
Main job role Safety staff 18 42.9 
Safety 13 31.0 
investigator 
Other 11 26.1 
Years involved <=4 10 25.0 
in generation 5-11 11 27.5 
of recommend- 12—18 10 25.0 
ations >=19 9 905 
Geographical Europe 30 71.4 
region Other 12 28.6 
Highest level <=Bachelor 18 43.9 
of education > _ Master 23 56.1 
received 


*Some demographic questions were not answered. 


rest of the demographics were grouped due to the 
small number of responses in some of the categories. 

Regarding the analysis of data, the frequencies 
for each the closed questions were calculated to 
offer an overall view of the responses. Fisher’s exact 
tests were performed to reveal any associations of 
the knowledge of design criteria with the variables 
of Table 4. The same variables were also used to 
conduct Kruskal-Wallis or Mann-Whitney tests (i.e. 
depending on the number of categories of each var- 
iable) for the closed questions which corresponded 
to ordinal data (i.e. frequency of application of 
design criteria, frequency of focus of recommenda- 
tions on the areas defined, and level of effectiveness 
of controls the recommendations introduce). 

It is noted that, to allow the execution of statistical 
tests, the frequency choices were translated to ordinal 
figures as follows: 1: 0-20%, 2: 21-40%, 3: 41-60%, 4: 
61-80%, and 5: 81-100%. The tests were run with the 
SPSS software version 22 (IBM, 2013). The function 
of Monte Carlo Exact Test under the settings Confi- 
dence Level: 99% and Number of Samples: 10.0000 
was chosen to strengthen the validity of the results. 
The level of statistical significance was set to 0.05. 

The open-ended questions concerned, a the- 
matic analysis was performed. The researcher indi- 
vidually performed a coding of the answers, which 
was afterwards tested for reliability with two other 
colleagues who were not involved in the study. The 
comments of the raters indicated areas of disa- 
greement as well as cases that the content of the 
answers had not been captured by the initial codes. 
Based on these remarks, the coding was revised 
and retested, resulting in agreement levels ranging 
from 77% to 92% between the researcher and each 
of the participants as calculated with Cronbach 
Alpha tests. The finalization of the list of coding 
themes was followed by the calculation of frequen- 
cies per code for each of the questionnaire topics. 

It is clarified that in several qualitative responses 
the subjects did not provide explanations about 
their choices in the closed questions, but they 
restated the latter or made general comments not 
applicable to the particular question. These cases 
were excluded from the analysis. Due to the small 
number of valid responses, no statistical tests were 
conducted between the responses and the demo- 
graphic characteristics of the participants. 


3 RESULTS 


3.1 Results from analysis of closed questions 


The frequencies that the design criteria of safety 
recommendations are known and the extent to 
which are applied by the survey participants are 
shown in Table 5. The Non-blaming, Assigned 
and Realistic criteria were the ones most known, 
whereas the Review date and Actions criteria were 
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Table 5. Design criteria for safety recommendations. 


Median rank Frequency (%) the 


Criterion (application) criterion is known 
Specific 4.5 90.5 

Measurable 4.0 87.8 

Assigned 5.0 97.6 

Realistic 5.0 95.2 

Time-bound 4.0 90.5 

Review 4.0 73.8 

Objectives 4.0 85.4 
Actions-oriented 4.0 75.6 
Non-blaming 5.0 100.0 


the ones that were least known by the respondents. 
Also, Spearman’s bivariate correlations were per- 
formed between the figures of knowledge percent- 
age and the medians. The results of the particular 
statistical test showed a significant and strong 
association (N = 9, rs = 0.870, p = 0.002), meaning 
that the higher the knowledge on a specific crite- 
rion, the higher the degree of its implementation. 

The Fisher Exact tests between the design cri- 
teria and the demographics of the population 
resulted in significant results only for the associa- 
tion of years of experience in the generation of 
safety recommendations with the knowledge of 
the criteria Realistic (N = 40, p = 0.046) and Time- 
bound (N = 40, p = 0.008). The participants with 
19 or more years of experience in safety recom- 
mendations declared less frequently that they knew 
about both criteria compared to participants hav- 
ing fewer years of experience. Regarding the fre- 
quency of application, the Objectives criterion was 
applied less frequently by safety managers/officers 
than the rest of the job roles (p = 0.039) and it was 
utilized more by the participants with increased 
years of involvement in safety recommendations 
generation (p = 0.025). 

Regarding the scope of recommendations, the 
respective medians are reported in Table 6. The sta- 
tistics revealed that culture-focused recommenda- 
tions were applied more frequently by participants 
with roles other than safety staff and investiga- 
tors (p = 0.046). Recommendations addressed to 
industry sectors (i.e. meso level) were generated by 
safety investigators more than other job holders 
(p = 0.002). Meso—and macro-level types of rec- 
ommendations were made more frequently by sub- 
jects working in Europe (p = 0.044 and p = 0.021 
respectively) or having a high educational back- 
ground (p = 0.008 and p = 0.004 correspondingly). 

The expected effectiveness of safety recommen- 
dations concerned, Table 7 presents the survey 
findings with regard to the perceived degree of 
effectiveness for each type of control introduced 
through recommendations. The statistics showed 


Table 6. Distribution of scope areas of generated 
recommendations. 
Median rank 
Dimension Category (application) 
Aspect of Process 4.0 
operations Structure 3.0 
Culture 2.0 
Context 2.0 
Stakeholders Macro level 2.0 
affected Meso level 3.0 
Micro level 4.5 
Degree of Repair 4.0 
renewal Adaptation 4.0 
Innovation 2.0 
Table 7. Perceived effectiveness of safety recommend- 
ations. 


Type of control introduced Median rank 


Physical 2.0 
Functional 2.0 
Symbolic 3.0 
Incorporeal 3.0 


no associations of the perceived effectiveness with 
the demographic characteristics of the sample. 


3.2 Results from analysis of open-ended questions 


The codes derived from the thematic analysis and 
their frequencies showed that in many cases the 
subjects applied the design criteria because they 
were mentioned in internal or external documenta- 
tion or had been seen as best practice. It was widely 
recognized that the Specific criterion minimizes 
ambiguity in the implementation of safety recom- 
mendations, which according to a few respond- 
ents is expected at some degree. However, 3 out of 
the 20 participants declared that some flexibility 
is required and recommendations should not be 
always too specific. 

The Measurable criterion was seen by most of 
the subjects as often unfeasible and a few respond- 
ents stated that the effect of changes is more 
important than their measurement, monitoring 
does not apply to simple recommendations, and a 
customization of relevant metrics to each organi- 
zational level/function is necessary. About the 
Assigned criterion, 3 out of the 18 participants 
argued that recommendations might require the 
engagement of more than one responsible persons, 
departments, agencies etc. The comments made 
about the Realistic criterion showed that this helps 
in increasing the credibility and feasibility of the 
recommendation and demonstrating an achieve- 
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ment of a balance between the resources required 
for its realization and the anticipated benefits. 
However, 5 out of the 23 answers pointed that the 
Realistic criterion might be difficult to meet due 
to the diversity of perspectives and interests of the 
stakeholders involved. 

The Time-bound criterion concerned, the par- 
ticipants expressed a variety of views. Four per- 
sons recognized that the specific criterion would 
ensure the implementation of a recommendation, 
whereas three persons stated their reservations 
about the feasibility of the criterion. Also, two 
persons did not know the particular criterion and 
one person did not contemplate it as important. A 
similar picture was observed in the answers regard- 
ing the knowledge and feasibility of the Review 
date and Objectives criteria. The latter criterion 
was appreciated by three respondents because it 
allows flexibility in the operationalization of the 
recommendation. 

A combination of actions and studies depend- 
ing on the type of the problem to be solved was 
proposed by most of the participants. Four out 
of the 17 subjects argued that the Action criterion 
increases the feasibility of a recommendation. The 
application of the Non-blaming criterion was seen 
as positive by most of the participants regarding 
the effects on the overall culture and increase of 
the effectiveness of safety recommendations. Only 
1 out of the 23 answers suggested that a focus on 
individuals/teams might be appropriate in cases of 
repeated unsafe behaviors. 

The answers regarding the type of control most 
frequently introduced by the survey respondents 
showed a preference in symbolic controls, followed 
by incorporeal, functional and physical ones. The 
explanations given emphasize on the decreased 
feasibility of controls of technical nature (i.e. phys- 
ical and functional) due to the demands for more 
resources for their realization. The views about the 
effectiveness of technical or non-technical controls 
(i.e. symbolic and incorporeal) were almost evenly 
distributed. In three cases, the respondents argued 
that symbolic and incorporeal controls are more 
vulnerable and require more frequent repairs. One 
respondent claimed that the focus on symbolic 
controls sources from the pressure of authorities 
who ask for more and better procedures. 

Regarding the focus of recommendations on 
specific aspects of operations, 4 out of the 17 
answers addressed that the decision depends on 
the context of the problem identified and four 
of the respondents answered that is a matter of 
organizational focus. The views about the proc- 
ess and culture aspects of operations were divided 
into two equal parts. Four subjects claimed that 
those aspects are difficult to change whereas other 
four subjects stated that the specific aspects are 
easier and faster to change. Concerning the degree 


of renewal, repairs and adaptations were seen as 
equally sufficient to deal with problems, partly 
because of the lower associated costs. Innovations 
were contemplated by 4 out of the 14 participants 
as either expensive or not appropriate to be intro- 
duced through recommendations, and two partici- 
pants suggested that the required degree of renewal 
depends on the problem under examination. 

Lastly, regarding the level of affected stake- 
holders, 8 out of the 17 subjects stated that their 
focus on the micro level is dictated by the organi- 
zational priority to deal with local problems and 
the difficulty to affect the meso and macro levels. 
One respondent recognized that micro level inter- 
ventions are cheap and another respondent stated 
that interventions at the lowest level could lead to 
changes to the rest of the levels. Once more, many 
subjects declared that the type of each problem is 
as a parameter to decide which stakeholders will be 
addressed in a recommendation. 


4 DISCUSSION 


The overall results regarding the design of safety rec- 
ommendations suggest that most of the participants 
are knowledgeable about the criteria identified in 
the academic and professional literature and apply 
these criteria to at least 60% of the recommenda- 
tions but with different extents. Rather expectedly, 
the knowledge of a specific criterion was positively 
associated with the degree of its application when 
generating safety recommendations. However, such 
knowledge was obtained more through experience, 
best practice or organizational documentation 
rather than international and regional standards. 
The lack of guidelines in industry standards might 
explain partially the fact that some of the crite- 
ria were not known and consistently applied by a 
fraction of the survey participants. The statistical 
tests showed a few differences across some criteria, 
mainly linked to the years of experience in generat- 
ing recommendations. Two of the criteria were less 
known and another criterion was more frequently 
applied by participants with higher experience. 

The Non-blaming criterion was known and 
applied by all participants, thus indicating that the 
necessity for a just culture in aviation (Humpreys, 
2014, Michaelides-Mateou and Mateou, 2016, 
Quinn, 2007) and the role that recommendations 
can play in realizing a culture of fairness have been 
well communicated across the particular industry 
sector. The criteria of Assigned and Realistic also 
scored very high regarding the level of familiarity 
of the respondents and the degree of their appli- 
cation, this partially attributed by the participants 
to the contribution of these criteria to increased 
credibility of recommendations. On the other 
hand, the Review and Actions criteria were not 
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known by about one-quarter of the survey par- 
ticipants. These criteria were seen as important but 
sometimes not feasible or binary regarding their 
application. In general, the respondents expressed 
concerns about the strict satisfaction of the whole 
set of criteria for every single recommendation and 
argued that each case must be handled differently. 

Concerning the type of controls introduced 
through safety recommendations, interestingly, 
their effectiveness as perceived by the participants 
is not aligned with the literature suggestions, the 
results showing a reverse order. Whereas the work 
of Hollnagel (1999) implies that physical and func- 
tional controls are more robust and effective, the 
respondents viewed the symbolic and incorporeal 
controls as such. The comments collected showed 
that the viewpoints about the effectiveness of tech- 
nology and non-technology based controls were 
evenly divided. According to the participants, the 
main factors driving the recommendation of a spe- 
cific control type are the resources available and 
the expectations of the authorities for improved 
procedures. Hence, the author contemplates that 
the responses about the effectiveness of controls 
were affected by current practice and not a consid- 
eration of their technical characteristics or poten- 
tial to deal with a hazard or risk more successfully. 

Nevertheless, the positions of the partici- 
pants regarding the types of controls preferred 
are in tandem with literature suggesting that the 
amendment of rules or introduction of new ones 
are favored due to the low costs involved and the 
timeliness and easiness of implementation (Bour- 
rier and Bieder, 2013). Moreover, the viewpoints 
of the respondents seem to confirm their efforts 
to generate recommendations that are realistic by 
taking into account the existing boundaries (e.g., 
resources, operational needs). 

The findings regarding the scope of recommen- 
dations showed that the higher and wider the level 
of operations, stakeholder and renewal, the less 
frequency corresponding changes are suggested. 
Recommendations that focus on the physical proc- 
esses and the work floor, specific organizations 
and repair of problems had been more frequently 
introduced by the participants. On the other hand, 
interventions at the culture and context aspects of 
the operational area, suggestions to the regulatory 
and standardization levels and recommendations 
for innovative solutions were least frequently gener- 
ated. The comments of the respondents seemed to 
agree about the effects of cost and time limitations 
along with political factors that can discourage 
the formulation of recommendations addressing 
wider and deeper systemic flaws. A few conflicting 
statements were observed regarding the easiness to 
make process or cultural changes, this indicating 
somehow misaligned perceptions. 


The statistics regarding differences of the rec- 
ommendations’ scope across the demographics 
included in the study showed that meso-levels of 
stakeholders (i.e, whole industry sectors) were 
addressed more frequently by safety investiga- 
tors. This finding can be explained by the fact that 
aviation safety investigation standards (e.g., ICAO, 
2003, 2011) prompt the examination of latent fac- 
tors which in turn allows the formulation of rec- 
ommendations targeted beyond organizational 
boundaries. Furthermore, recommendations for 
stakeholders other than specific organizations and 
individuals were generated more frequently by 
respondents working mainly in Europe and having 
a higher level of education. These findings might 
reflect the effects of different regional and national 
cultures (e.g., ICAO, 2013, Stolzer et al. 2008) as 
well as an influence of educational breadth and 
depth on the tendency to adopt systemic views and 
address deficiencies at higher system levels. 


5 CONCLUSIONS 


The current study was a first initiative to map the 
situation in the aviation industry around the engi- 
neering of safety recommendations. Regarding the 
degree to which professionals are knowledgeable 
about the design criteria that can increase the quality 
and effectiveness of recommendations and the extent 
to which apply these criteria in practice, the results 
suggested an adequate level of knowledge and a sat- 
isfactory frequency of application of the nine design 
criteria included in the research. However, the find- 
ings were attributed more to the employment of best 
practice and the reference of such criteria in organi- 
zational documentation, rather than to guidelines 
from regional or international guidelines or topics of 
training. Hence, the inclusion of respective material 
in standards and safety management/investigation 
courses is highly recommended to achieve consist- 
ency in the generation of safety recommendations 
and establish a commonly referred framework for 
their design. Nonetheless, such a framework must 
be viewed as a guide and not a compliance-check 
reference. As the study participants stated, each case 
has a different context and the satisfaction of all 
design criteria might not be feasible or proper for all 
recommendations. 

The findings concerning the classification of 
recommendations based on the types of controls 
they introduce showed a gap between perceptions 
of the practitioners and the suggestions of the lit- 
erature. The former viewed “soft” control types as 
more effective than “hard” ones, whereas literature 
implies the opposite. It is noted that the categoriza- 
tion of controls does not aim at stating a preference 
of any type over another but rendering the industry 
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aware of the weaknesses and strengths of each con- 
trol type. Nonetheless, the conduction of further 
studies is suggested as a means to provide empiri- 
cal evidence about the actual level of effectiveness 
of each type of control and indicate any optimum 
combinations of those. 

When considering the focus areas of recommen- 
dations, the findings suggested an emphasis on the 
repair of lowest activity levels rather than systemic 
interventions and innovative solutions. A common 
characteristic of all topics examined in this study 
was the influence of resource and political bounda- 
ries on the generation of safety recommendations, 
which usually lead practitioners in suggesting cheap 
and easy to implement fixes. From a pragmatic view- 
point, the situation mentioned above is somewhat 
unavoidable, but this should not exclude the visible 
and explicit justification of choices when engineering 
safety recommendations. The recognition and docu- 
mentation of the boundaries imposed on each case 
can support the aggregation of data and their moni- 
toring to inform stakeholders accordingly and pos- 
sibly lift existing limitations when situations allow. 

The researcher would like to point out that the 
small sample of the current study does not allow 
to claim generalization of the results. However, the 
finding of this research can trigger the execution 
of similar studies at larger scales at regional or 
international levels regardless of industry sector. 
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ABSTRACT: The issue of transportation of dangerous goods is addressed in international law since 
1957, when international agreements on transport of dangerous goods by road ADR was created in 
Geneva. Directive SEVESO was put into practice in 1982, because occurrence of severe accidents involv- 
ing dangerous substances in 70 s. Directive SEVESO was step by step more particularized according to 
practice and regulation REACH was added in 2007. Both legislation lays down requirements for handling 
with hazardous substances from the manufacture, transportation and storage through to their use. Series 
of agreements on the transport of dangerous goods to another means of transport followed. All agree- 
ments dealing with carriage of hazardous substance have one thing in common. Dangerous substances 
are treated as goods and all responsibility is bear by carrier. Critical evaluation of the existing rules and 
traffic accidents with hazardous substances shows that in practice there are missing tools, which would: 
effectively reduce the distortions of security measures that arise on the part of carriers; to ensure rapid 
response and the protection of other road users on the roads and railways; protecting people and environ- 
ment around the place of traffic accident; and the recovery of the territory afflicted by impact of road 
traffic accidents with hazardous substance. Analysis of a database of accidents involving dangerous sub- 
stances shows number of examples where the recovery of territory after accident with hazardous goods 
carried more than 10 years. Therefore, we must create system tools for coping with those risks. Basic 
strategy in response has to be prepared at national level for major accident during the transport of dan- 
gerous substances. Local governments need to determine critical points, places with protected assets and 
dangerous goods transportation. Risk management and response plans need to be prepared for critical 
points. The article shows tools, which is being tested in practice. 


1 INTRODUCTION based on international standards. The methodol- 


ogy of emergency planning is, however, mostly 


The issue of manipulation of dangerous sub- 
stances is studied from 50 s of the 20th century. 
The development and wider support, however, 
received up to the beginning of the 80 s after a 
series of accidents in the industry. In practice, it is 
addressed separately the risks management associ- 
ated with the production, storage, and processing 
of large quantities of dangerous substances and 
risk management during transport by different 
type of transportation. 

Prevention of accidents involved dangerous 
substances and their impacts is addressed in the 
case of industrial objects by technical, human and 
procedural measures in accordance with the rel- 
evant standards laid down in international treaties. 
Emergency plans are prepared for case of measures 
failure by industrial plant operators for the factory 
side and near surroundings (on-site emergency 
plan) and by public administrative to the broad 
surroundings (off-site emergency plan). 

Prevention of accidents in case of dangerous 
substances transportation or dangerous goods, how 
carriers company call such trade commodity, also 
include technical, human and procedural measures 


missing. Failure of measures, similar to industrial 
plans, occurs because human factor, a failure of 
technology or natural disaster, and it can occur also 
in areas with high population density. The article, 
therefore, will be deal with the phenomena. 

The second chapter describes the nature of the 
accidents occurrence with presence of dangerous 
substances on basis of an analysis of the data- 
base of accidents involving dangerous substances 
and statistical approaches. The third chapter will 
be devoted to the assessment of the criticality of 
infrastructure from the perspective of their techni- 
cal parameters and common accidents on one side 
and the amount and type of protected interests in 
the surrounding area on other side. 


2 NATURE OF MOBILE RISK 
OCCURENCE 


The logarithmic dependence of the number of real- 
izations of certain phenomena N on the phenom- 
ena intensity J is observed in the case of statistics 
of occurrence of all known disasters, N( I) ~a”, 
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Figure | in general. The prediction of the threat is 
based on the distribution of extreme values (Gum- 
bel 1941) in the case of natural disasters. 

The logarithmic character of F-N curves 
(fatalities-number) is observed even in the case 
of technological accidents, when the intensity of 
phenomenon is expressed in the number of human 
victims. Observed dependencies lose the logarith- 
mic character just at the range of low and high 
intensities, where the differences between incidents 
reported and occurring are (Hirschberg 1998) due 
to the data set non-homogeneity in low intensities 
range and too short observation period in high 
intensity range. Results of mathematical models 
such as the distribution of extreme phenomena 
and so on can only be taken as a guide. 

The causes and circumstances of accidents in the 
shipping domain are similar to those in technologi- 
cal domain (insufficient technical standards, faults 
in vehicle construction, bad maintenance, human 
factor, etc.). Therefore, we can also see the logarith- 
mic character of the F-N curve (Boot 2013). How- 
ever, we observe two differences. First, our control 
over the entire transport network is less than our 
control over the surrounding of industrial area. 
From that reason, the nature of this risk changes 
dynamically, e.g. the market with dangerous goods 
may steeply increase or decrease, which influences 
the transport of this goods; shipping the new dan- 
gerous goods may also start. The other problem is 
the risk mobility. An accident that occurs at one 
point on the transport route may happen at any 
other point of transport road. As example, we 
show three similar accidents in North America on 
rail route: 


1. June 19, 2009 Cherry Valley (Illinois, USA) 
Canadian train with more than 2 million gal- 


6 7 8 9 10 11 
Intensity (MSK - 64) 


Figure 1. 
occurrences on intensity in Central Europe, circuit with 
a radius of 400 km (Prochazkova 2017). 


Logarithmic dependence of earthquake 


lons of ethanol derailed to cross with the road, 
1 dead (NTSB 2012). 

2. July 6., 2013 Lac-Mégantic (Quebec, Canada) A 
train with 72 oil tanker cars derailed in a small 
town with about 6,000 inhabitants, about 50 
dead, a large part of the city destroyed (NTSB 
2014). 

3. December 30, 2013 Casselton (North Dakota, 
USA) A train carrying a large amount of crude 
oil derailed, the need to evacuate over 2,000 
people, but fortunately no losses to life (NTSB 
2017). 


The three above-mentioned accidents have 
common features, in addition to the region, the 
large-scale fires of large amounts of flammable 
substances. However, the impacts of individual 
accidents were diametrically different due to the 
varying population density and different protected 
interest concentration in the immediate vicinity of 
accident place. 

The nature of mobile risks does not allow us to 
determine the real threats based only on site spe- 
cific events. We need to extrapolate a wide range of 
information on accidents with the dangerous sub- 
stances presence across the entire transport route 
where the accident occurred or whitherward the 
followed substance is transported under similar 
conditions. Procuration and lessons learned only 
from the accident site are not sufficient. 

In the case of mobile risks, it needs to be taken 
into account that it is only a matter of time when 
a major accident occurs, namely even in densely 
populated area or even containing the high con- 
centration of other protected interests, such as 
the sources of drinking water, power stations, and 
the like. Under certain circumstances, the trans- 
port of dangerous substances can be excluded in 
places with very high criticality as they present the 
places with big concentration of public assets; the 
mark “Forbidden for vehicles carrying the danger- 
ous goods” is possible to use only exceptionally 
from economic reasons. In the Czech Republic, 
the transport of dangerous goods is, for example, 
excluded on the highway between Praha and Brno, 
between 49 and 90 km due to a drinking water 
source for the Praha. 

However, a similar solution cannot always be 
used or it is connected with a lot of problems, and 
therefore, its application is justified only in places 
with especially high criticality. For less but still crit- 
ical sites, it is still necessary to carry forward the 
lessons learned from all accidents with the presence 
of transported dangerous substances and to estab- 
lish the plans for possible response at least on the 
basis of the requirements of crisis management, 
population protection and critical infrastructure 
protection. 
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It is necessary to introduce the role of local gov- 
ernment in the transport of dangerous substances 
on the emergency management level or at least on 
the level of crisis management in addition to the 
already established obligations of dangerous goods 
carriers. The first step of local government is to set 
up the team of experts which proposes the profes- 
sional solution of problem under account. 

Head of the team need to be the safety man- 
agement expert. He/she can be from municipality 
or regional Security Council, who is responsible 
for crisis management, Act No. 240/2010 Coll., 
on crisis management in the Czech Republic, or 
some external professional from Technical Sup- 
port organization. Members from police force and 
firefighter force are also important. The nature of 
the problem also requires expert from the chemical 
industry. Czech chemical industry provides con- 
sulting services (TRINS 2015) for accidents with 
dangerous substances. Team can also be extended 
on transport expert and health care expert. 

The team then needs to know the nature of acci- 
dents that can occur in the followed territory and it 
needs to determine the distribution of risks that can 
realized here and to characterize the possible emer- 
gency situations. Special attention should be paid to 
dangerous substances that are transported in large 
quantities through the roads at territory of given 
municipality. This especially involves the vicinity of 
chemical plants and long-distance transit corridors 
important for operation of specific industries. 

Larger quantities of hazardous substances are 
transported by rail (Becherova 2017), so it is neces- 
sary to consider in this case the larger affected area 
than at the road transport. Responses plans need 
to be also prepared for pipeline transitions, espe- 
cially at locations, where pipeline are placed above 
the surface (Hansler 2012). Pipelines surrounding 
can be better monitored and it is not so associated 
with mobile risk issues. 

The municipality, in cooperation with the 
regional industry, can identify the specific sub- 
stances that are transported on regional roads. It is 
obvious that fuel transports have a special position, 
because the transport of fuels is presented always 
and everywhere. The accidents with presence of 
fuels dominate to all transport accident statistics 
for dangerous substances (Procházková 201 5a). 

Local government needs to take into account 
also the transit traffic, at which the type of haz- 
ardous substances is heavily to predict. From the 
professional viewpoint, the response plan types 
would be prepared with regard to properties and 
hazards of substances associated with the individ- 
ual hazard classes, fire hazards, explosion hazards, 
hazards of leakage to the environment, and also 
with combination of these three factors mentioned 
above, Figure 2. 


Figure 2. 
for the transport of dangerous goods. 


Process model of a crash accident response 


It is necessary to determine the nature and extent 
of the threat, for both, the fuels and the hazardous 
substances that have been transported in followed 
location. The character of accident scenario is given 
by chemical, physical and other properties of trans- 
ported substances, and also by site and meteorologi- 
cal conditions in time of accident origin. The extent 
of threat is based not only on the properties of dan- 
gerous substance, but especially by its amount. The 
amount is usually limited by relevant legal and tech- 
nical standards for every shipping type. 

The properties of substance, which predeter- 
mine the extent and nature of threat, are given in 
the safety data sheet. Everybody, who works with 
the hazardous substances during the shipping, has 
responsibility to respect the instructions in the 
safety data sheet. The issue of shipping the below- 
limit amounts of hazardous substances is solved 
by the REACH order in separate chapter. The 
Safety Data Sheet needs to be always available at 
substance handling site. In the case of transport, 
however, this is a common problem, e.g. this docu- 
ment is not in all languages of countries through 
them the shipment goes over. 

The experiences from daily practice show that 
people often underestimate the hazards that are 
given in the safety data sheet. They do not consider 
that properties of each hazardous substance vary 
in dependence on local conditions that are in site of 
transportation. For example, the ignition tempera- 
ture of unleaded petrol in the liquid state can be 
more than 250°C, the ignition temperature of the 
vapour/air mixture is substantially lower (Ceska 
rafinérska 2012). The findings from the safety data 
sheet, therefore, need to be supplemented by les- 
sons learned from studies of real-world accidents 
or by the results of practical experiments. 
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At present, many databases contain the accidents 
with hazardous substances. For example, it can be 
mentioned the energy accident database ENSAD, 
which include also fuel transportation. This database 
itself comes out from many other databases (Burgh- 
err 2017). Compilation of databases is very expen- 
sive, and therefore, the availability of some databases 
is very limited. For many organizations, it is then 
necessary to carry out their own investigations 
across relevant case studies (Prochazkova 2014). 

Complete statistics of traffic accidents enable to 
determine the most common causes of accidents. 
Detail study of major accidents shows the great 
number of combinations of phenomena that can 
occur at accidents, i.e. the prediction of accident 
scenario is not easy. Lessons learned from both, 
the statistics and the case studies allow us to evalu- 
ate only the general data for probability of accident 
and expected impacts at the sites under considera- 
tion; the real scenario will be always determined by 
momentary local conditions in the given site. 


3 MOBILE RISK AND EXPERT METHODS 


For compilation of groundwork for determina- 
tion and mastering the risks, it is necessary to put 
together: the team of experts who well know the 
behaviour of hazardous substances under different 
conditions that can be expected during the trans- 
port; and to collect data on dangerous substances 
and common amounts that are transported. Then, 
it is necessary to use suitable tools for risk analy- 
sis, namely the Checklist and What, if analysis. In 
practice, a lot of software tools are used. There- 
fore, we take a short comment on the widely used 
computing software for the dispersion of danger- 
ous substances. 

Determination of the affected area is a key step 
for subsequently identification of impacts and 
compilation of response plans. The form and size 
of affected area are calculated using the dispersion 
models. These models need to take into account the 
properties of substances (lighter or heavier than 
air), atmospheric conditions (wind speed, tempera- 
ture, humidity, inversion) or local relief topogra- 
phy (buildings, forest, terrain slope). The factors 
that affect the dispersion are many, so mathemati- 
cal models have to use certain approximations in 
many cases, which reduce the accuracy of outputs. 
Therefore, it is suitable to apply experts’ experi- 
ences before their use in practice. 

At present, the models convert from analytical 
calculations to calculations using the computer 
technology. Computer technology can combine 
analytical and numerical calculations, but essen- 
tially is based on the same formulas (Leksin 2015). 
The advantage of computer models is the accuracy, 


the model can take into account a larger number of 
input parameters, and the accessibility, the model 
can be used by broad public; the knowledge of 
model mathematical background is not necessary. 
The disadvantage is non-transparency of compu- 
tation, because the user does not often know what 
omissions at calculation are performed (the omis- 
sions are hidden in software model concept). 

Although computer computations allow us to 
use more complex models, it is necessary as it was 
given above to use the expert judgement at transfer 
to practice. It is fact, that each result, in addition, is 
greatly affected by the interface at the input, where 
we cannot include a number of geographic data in 
the calculation. In some cases, geographic condi- 
tions are deciding (Pontiggia 2010). The dispersion 
models can provide us with a lot of information 
still, despite all the uncertainties and unsureness. 
Figure 3 shows the dispersion behaviour for the 
chlorine tank in normal atmospheric conditions. 

The example at Figure 3 suits for preliminary 
identification of the belt width around the con- 
sidered ground road where we follow the impacts 
on protected interests. Common atmospheric con- 
ditions are sufficient to response at normal and 
may be emergency situations. For very extreme 
meteorological conditions it is necessary to use 
the principals of crisis management and for it suc- 
cess it is necessary to prepare both, the scenarios 
for unfavourable up to extremely conditions and 
the response plans. For real accident site it is nec- 
essary to consider the properties of affected area 
and local conditions. Detail investigation of criti- 
cal sites, e.g. the sites on roads with frequent traffic 
accident, very improves the model results. 

The densely populated areas or areas with high 
density of other public protected interests are fac- 
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Figure 3. Dispersion of chlorine from the tank accord- 
ing to the ALOHA dispersion model under normal 
atmospheric conditions. 
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tors that would be taken into account when we 
select the critical sites, for which from safety rea- 
sons there are necessary the preventive actions and 
the preparation of response plans. The occurrence 
probability of traffic accident in critical site is fur- 
ther important. The local territory investigation is 
very important road parameter. Such investigation 
can be executed using the special Checklist. 

The Checklist needs to take into account the 
physical parameters of the road (horizontal profile, 
vertical profile, surface properties), environmen- 
tal characteristics (weather, climate), utilization 
of road (traffic density, purpose of traffic) and 
other complications (tunnel, bridge, canyon) (CSA 
2002). It is appropriate to add accident statistics 
to the overall criticality assessment, if it is moni- 
tored for a given section. Assessment of criticality 
can be done in the selected sites of road only if 
it is performed cumulatively for road sections. The 
example of such a Checklist is in Table 1, where it 
is judged the section of highway in Central Bohe- 
mia, where transport Praha-Vienna (south-east) 
and Praha-Linz (south), E50, E55, E65 is located. 

For identification of critical highway spots, it is 
necessary to determine the size of risk in a given 
place for possible traffic accidents with presence 
of hazardous substances. Assessment of risk size 
at real location for the fixed event is influenced 
by density of protected interest and by likelihood 
of accident. Methods for impact assessments are 


several, but most of them have a relatively narrow 
focus on the situation (input data) or the protected 
assets. For the public administration, as the legal 
guardian of territory, it is the most appropriate the 
method “What-If analysis” performed by brain- 
storming with a team of experts, and the following 
judgement of losses at protected interests follow- 
ing from the determined impacts. 

The team of experts needs in the first to collect 
information on possible events, i.e. the character- 
istics of hazard connected with leaked substances. 
Then, it needs to determine the scenarios of 
affected areas at different meteorological condi- 
tions for individual hazardous substances. Next, 
it is necessary to determine the acceptability of 
impacts in individual scenarios at different mete- 
orological conditions. 

The experts can execute assessment by brain- 
storming. The identified impacts are best struc- 
tured by the area of protected assets: 


— human lives, health and security, 

— property and welfare, 

— environment, 

— critical infrastructure and technologies. 


It is also suitable to consider the development 
of impacts with time, how they occur apart from 
the structure by type of protected interests. Such 
data enables to build more accurate response plan. 
It is also necessary to identify the primary accident 


Table 1. Checklist for critical assessment of the first 21 km of the D1 motorway in Central Bohemia. 
Complication Technical Traffic Average 
Horizontal Vertical (tunnel, bridge, condition of intensity of accident/ Climate 
Dl-km profile profile canyon) communication 50000/hour 1 month condition Total 
1 1.0 1.0 1.0 0.0 1.9 1.0 0.0 5.9 
2 0.0 0.0 15 0.0 1.9 2.0 0.0 5.4 
3 0.0 1.0 0.0 0.5 1.7 1.0 0.0 4.2 
4 1.0 1.0 0.0 0.5 1.7 1.0 0.0 5.2 
5 1.0 1.0 1.0 0.5 1.7 1.0 1.0 42 
6 0.0 2.0 1.0 0.0 1.7 1.0 0.0 5.7 
fi 1.0 1.0 2.0 0.5 1.5 1.0 0.0 7.0 
8 0.0 1.0 0.0 0.5 1.5 1.0 0.0 4.0 
9 0.0 1.0 1.5 0.5 1.4 1.0 0.0 5.4 
10 1.0 1.0 2.0 0.5 1.4 1.0 0.0 6.9 
11 0.0 2.0 0.5 0.0 1.6 2.0 0.0 6.1 
12 0.0 1.0 15 0.0 1.6 2.0 0.0 6.1 
13 0.0 1.0 0.0 0.5 1.6 1.0 0.0 4.1 
14 0.0 1.0 1.0 0.5 1.6 1.0 0.0 5.1 
15 1.0 2.0 0.0 0.0 1.6 1.0 0.0 5.6 
16 2.0 2.0 1.0 0.0 1.3 1.0 1.0 8.3 
17 0.0 0.0 0.0 0.0 133 0.0 0.0 1.3 
18 1.0 2.0 0.0 0.0 3 1.0 0.0 5.3 
19 1.0 1.0 0.5 0.0 153 0.0 0.0 3.8 
20 1.0 0.0 0.0 0.0 1.3 0.0 0.0 2.3 
21 0.0 0.0 1.0 0.0 1.3 1.0 0.0 3.3 
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impacts, the secondary impacts to which it belongs 
e.g. the loss of supply of affected area, damage of 
drinking water sources etc. An example of such 
analysis can be found in article (Prochazkova 
2015b). According to performed tests such analy- 
sis takes about 3 hours. In the case of team com- 
posed from 5 experts, it is 15 man hours. As a rule, 
such analysis is necessary to perform for several 
different places. The result of such work is good 
preparation for mitigation of impacts of traffic 
accidents and precise response plan that reduces 
losses on protected interests. 


4 RISK ASSESSMENT 


Under the risk assessment the losses need to be 
quantified in adequate values. For public admin- 
istration and managers, the most comprehensible 
formulation of losses is in finances. In the case of 
property and infrastructure, it is not the problem 
to determine financial value of losses and damages. 
In the case of the environment, however, there are 
a number of methodologies, most of which only 
focuses on the production effect of environment 
and ignores the whole area of its importance for 
humans. 

Financial evaluation of losses on human lives 
and damages of health in EURs includes a certain 
point of cynicism; in some countries it misses the 
appropriate legal rule for its determination. There- 
fore, it is important, so the public administration 
may determine the clear instructions and method- 
ologies for this case. These procedures are mostly 
based on local culture, legislation, and the society 
tolerance for different risks. 

The appropriate risk is possible to assess as soon 
as we know the extent of the damage and the prob- 
ability of its realization. The risk can be expressed 
in the form of value of damage per unit of time, for 
example, by the equation 


R=DxP=DxAT,, (1) 


where R is the risk, D is the impact (damage), P is 
the probability, A, stands for accident and T, the 
intensity of transport of dangerous substances. 
The critical matrix can be also used to express 
the risk. It shows much more information and 
allows sophisticated access to risk; especially, in 
the ALARP (As Low As Reasonable) approach, 
for hazardous substances. Figure 4 shows the com- 
bined form of the Critical Matrix (Impacts ver- 
sus Probability) and the F-N Curve (Number of 
Death vs. Frequency), (Det Norske Veritas 2014). 
It determines which situations are acceptable, con- 
ditionally acceptable (ALARP), or unacceptable. 
Decisions on preventive measures and on response 
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Figure 4. Criticality matrix and F-N curve for acci- 
dent with dangerous substances involved (Det Norske 
Veritas). 
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Figure 5. Risk Management in public space, expert do 
Risk Assessment, Risk Decisions is executed by elected 
representatives, and Risk Mitigation is applied by techni- 
cal experts. 


plans are subsequently made according to posi- 
tion of our judgement result in the criticality 
matrix. The matrix given in Figure 4 only counts 
the human lives, but it can be transformed to other 
protected interests. 

Establishing the position of analysed situation 
in the criticality matrix in Figure 4, is the final part 
of the risk assessment. The team of experts passes 
the results of risk assessment to responsible office 
for the territory governance that is responsible for 
territory safety management and for correct deci- 
sion-making. The risk management issue moves to 
the next stage, Risk Decision, Figure 5. The deci- 
sion on further approach to revealed risks and on 
measures for coping with identified risks belongs 
to the responsibility of elected representatives, who 
are not experts, which often influence the quality 
of decisions. 
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The roles of experts return at the final stage of 
Risk Management, i.e. at Risk Mitigation, but real 
measures are influenced by limits given by public 
administration decision-making (e.g., it very influ- 
ences amount of finances that can be used for risk 
mitigation). 


5 CONCLUSION 


Critical judgement of present situation shows that 
in practice, there are missing the tools that: effec- 
tively reduce breaking the security measures on the 
carrier side; enable fast reaction to accident; enable 
protection of all road / railroad users including the 
humans and environment in traffic accident vicin- 
ity; and prop up the renovation of affected area so 
it can be used for human needs. 

The article demonstrates a methodology for the 
risks assessment associated with the transport of 
dangerous goods by road or rail. The main prob- 
lem is the nature of mobile risks associated with 
transport. Another problem is the risk dynamics. 
This methodology is primarily targeted to pub- 
lic administration, and it well supplemented the 
measures of legal rules that are primarily targeted 
on the carriers of dangerous goods. This meth- 
odology consists of three steps: risk analysis, risk 
judgement; and risk mitigating. 

Identification and awareness with the nature of 
mobile risk phenomena in the reporting area is the 
first step. The main problem during the risk iden- 
tification is neglecting the mobility and dynamics 
of the risk. The often error in practice is that it is 
not considered the experiences from accident sites 
in other regions. The analysis and judgement of 
emergency situations, which may arise as conse- 
quence of accident with dangerous substances, is 
the second step. 

The recommended methodology seeks to com- 
bine practices using the computational models 
with investigation of experts based on knowledge 
of real territory. The final step is the risk miti- 
gation and early response; the combination of 
criticality matrix with the F-N curve is used. The 
methodology calculates the significance only with 
life-threatening impacts and needs to be adjusted 
to include other fatal impacts, such as groundwater 
contamination, or the destroying of other critical 
infrastructure elements. 

The above described methodology includes pro- 
cedures and tools for risk assessment by team of 
experts. In co-operation with real local experts we 
step by step determine real results for individual 
real public administration (selection of appropri- 
ate preventive measures, processing the response 
plans for very hazardous substances at the level of 
emergency and crisis management). 
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ABSTRACT: Autonomous transport systems in all modes—road (i.e. autonomous cars), aviation (i.e. 
drones), shipping and rail are coming. Regulation and testing are on-going in Norway. Risks of autono- 
mous systems are uncertain due to missing data, emerging technology and variation in framework con- 
ditions. However, accidents of autonomous cars seem to be 1/3 or 1/2 of current levels. Incidents are 
different, needing outside interventions sometimes. Based on review of experiences across the modes 
and regulations, we suggest agile and transparent learning in the whole autonomous ecosystem, between 
all modes. System certification are needed, and system responsibilities must be clarified. Structures for 
orchestrating transport (i.e. control of many autonomous vehicles with possible common failures) and 
marking autonomous transport, should be established. In the interfaces between humans and systems 
there are differences in autonomy as imagined vs. performed, leading to new incidents and accidents. 


Emerging safety/security issues must be explored. 


1 INTRODUCTION 


This paper discusses experiences of autonomous 
transport systems, to establish a framework for risk 
based governance. Risk and risk governance are based 
on the process described by Renn (2005), starting with 
problem framing; risk appraisal (hazards and vulner- 
abilities); risk judgment; risk communication and risk 
management. The implementation of autonomy can 
reduce transport risks but it can also introduce new 
risks in the interfaces between the autonomous sys- 
tem and the environment (such as humans). As dis- 
cussed in Lund and Aarø (2004), risk reduction must 
be based on a broad set of actions such as regulation, 
technical design, training and awareness. 

Based on involvement in the regulatory proc- 
ess in Norway and experiences of autonomous 
transport systems we have discussed new emerging 
risks and threats. We see the need for establishing 
framework such as regulatory actions and clari- 
fication of responsibilities as autonomy is being 
implemented. 

In the following we have defined autonomous 
systems and concepts such as Levels of Automa- 
tion (LOA) used to specify degree of automation. 


1.1 Definitions and terminology 


Safety is related to accidental harm, while security 
is related to intentional harm. Safety is defined as: 


“the degree to which accidental harm is prevented, 
reduced and properly reacted to”, Firesmith (2003). 
Security: “the degree to which malicious harm is 
prevented, reduced and properly reacted to”. 

In Parasuman and Riley (1997) automation 
and autonomy is described as “The execution by 
a machine agent (usually a computer) of a func- 
tion that was previously carried out by a human”. 
Automation can be done by various means i.e. 1: 
Remote controlled (Surveyed and/or externally 
controlled); 2: Autonomous (based on own sensors 
and systems); 3: Cooperative and connected (based 
on own sensors and other traffic information) or 4: 
A combination of 1-3. The terms autonomous and 
automated has been used interchangeably in some 
papers. We have made a distinction. By autonomy 
we mean a system that is non-deterministic in that 
it has a freedom to make choices, and by auto- 
mated we mean a system that is more deterministic 
in that it will do exactly what it is programmed to 
do. This is based on the taxonomy and discussion 
of autonomy from Vagia et al. (2016). 

When trying to scope risks of autonomous 
systems we must include the regulation, risk gov- 
ernance, organizational framework, interfaces to 
humans and the autonomous system (a combina- 
tion of software components and cyber physical 
systems). The system is often a collection of sys- 
tems being developed by different stakeholders. 
Thus, we have used the concept of autonomous 
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ecosystem, AEC. This is inspired by the concept 
Software Ecosystems (SEC). SEC consists of com- 
ponents developed by actors both internally and 
externally of the company, 1.e. outside the tradi- 
tional borders to a group of private persons and 
actors. Manikas et al. (2013) defined a software 
ecosystem as: “the interaction of a set of actors on 
top of a common technological platform that results 
in a number of software solutions or services. Each 
actor is motivated by a set of interests or business 
models and connected to the rest of the actors and 
the ecosystem as a whole with symbiotic relation- 
ships, while, the technological platform is structured 
in a way that allows the involvement and contribu- 
tion of the different actors...”. Arguments for using 
such a concept is the realization that development 
increasingly is taking place outside of organisa- 
tional silos due to the need for speed of develop- 
ment, need for supporting applications, reduction 
of development costs, competition. This is creat- 
ing the need to address governance challenges in 
an ecosystem framework. 

An example of an autonomous ecosystem is 
Intelligent Transport Systems (ITS) consisting 
of autonomous vehicles, integrated with traffic 
control, electronic payments and other systems. 
Autonomous ecosystems handle information, but 
also actual critical processes such as transport (via 
automobiles, boats, drones and trams). These eco- 
systems must be safe and secure. The systems must 
be able to handle unanticipated events, breakdowns 
and be able to go to a safe and secure (end-)state. 

To explore the main risks of autonomous sys- 
tems, we need to clarify responsibilities i.e. LOA 
in task execution. LOA is described by steps going 
from no automation where the humans are fully in 
control to a fully automated system with no human 
interaction. Sheridan and Verplank (1978) intro- 
duced 10 steps of automation, going from LOA1: 
Fully Manual Control to LOA10: Fully Autono- 
mous Control. The LOA has been adapted to the 
car industry by the Society of Automotive Engi- 
neers (SAE), describing six levels of autonomy in 
driving, SAE (2016). Going from no autonomy 
(level 0), through driver assistance, partial auto- 
mation, conditional automation, high automa- 
tion, to full automation (level 5). The design of the 
autonomous transport system must ensure that the 
system maintains an accepted level of performance 
despite disturbances, including threats of an unex- 
pected and malicious nature. Our approach is to 
speed up learning and knowledge sharing between 
modes, since the autonomous systems have differ- 
ent maturity and experiences in aviation, rail, road 
and sea. 

The concept of resilience engineering is an 
important strategy to handle unanticipated inci- 
dents. Hollnagel, Woods and Leveson (2006) 


define resilience as “the intrinsic ability of a system 
to adjust its functioning prior to or following changes 
and disturbances, so that it can sustain operations 
even after a major mishap or in the presence of con- 
tinuous stress”. Handling of unanticipated inci- 
dents and continue to operate safe is a key ability 
of autonomous transport systems. 

Based on the preceding introduction, the 
research questions (RQ) we want to explore are: 


e RQI: What are the major risks introduced by 
autonomous transport systems? 

e RQ2: What regulatory issues should be priori- 
tized to handle these risks? 

e RQ3: What are the way forward, i.e. main 
approaches and issues needed to mitigate major 
risks of autonomous transport systems? 


2 SCOPE, CHALLENGES AND METHODS 


When discussing autonomous ecosystems, we 
include the organisational framework, regulation, 
human interactions and understanding in addition 
to the actual systems in autonomous systems and 
the infrastructure. This is described in Figure 1. 


2.1 Challenges and problems 


When introducing new technology such as auton- 
omous systems, one of the basic challenges is to 
understand emerging risks. Safety, security and 
resilience have often been identified late when 
vulnerabilities have been exploited and unwanted 
incidents have been published. There has been a 
tradition in the software industry that vendors sel- 
dom have to pay for these unwanted incidents even 
if they are due to poor quality, poor focus on safety, 
security or resilience. The consequences and costs 
have been given to users, organisations and society. 
In autonomous transport, the consequences can 
be loss of lives and/or environmental damage. In 
addition, when discussing vulnerabilities in auton- 
omous ecosystem, one challenge is that there is not 
one single supplier, but a set of suppliers involved. 
It can be difficult to identify responsibilities and 
manage competencies, if framework conditions 
(regulation/responsibilities) are missing. 


Organizational framework, regulation, and governance 


Figure 1. Scope of autonomous ecosystems—AEC. 
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2.2 Methodology and approach 


We have based this paper on empirical data from 
users of autonomous transport systems, a tar- 
geted literature review of autonomy and safety in 
addition to discussion of suggested regulation of 
autonomous road transport in Norway. 

We have explored experiences of autonomous 
transport systems from St. Olav Hospital in Norway, 
where autonomous systems have been used from 2006 
to 2017. St. Olav has 10,500 employees, and covers an 
area of 200,000 M?. We are involved in pilot projects 
with self—driving shuttle busses in three Norwegian 
cities. Trials addressing feasibility of Mobility as a 
service (MAAS) linking up to public transport (first 
and last mile). We are involved in trials with eco- 
friendly autonomous ships/Vvessels for cargo and pas- 
senger travel along the Norwegian coastline. 

We have performed a literature review based on 
a keyword search of autonomy, safety, security and 
resilience using SCOPUS, ACM Digital Library, 
IEEE Explore, Springer Link and Science Direct. 

We have been involved in a hearing of regulation 
related to testing of autonomous vehicles in Norway 
from the Ministry of Transport and Communica- 
tions—MTC (2016). The suggested regulation was 
distributed in December 2016, comments to be given 
within March 2017 and the regulation were pro- 
posed to be approved as law in December 2017. Our 
comments were based on the literature review, expe- 
riences from St. Olav’s and other public comments. 

The taxonomy used to register incidents has 
been based on Blanco et al. (2016). They collected 
a broad set of naturalistic accident data from 
autonomous driving, using a taxonomy of crash 
seriousness going from most serious at Cl to neg- 
ligent at C4. 


C1: Crashes with airbag deployment, injury (need- 
ing doctor visit), rollover, more damage than 
$1,500, require towing, police reportable. 

C2: Minimum of $1,500 worth of damage, crashes 
such as large animal strikes and sign strikes. 

C3: Crashes involving physical conflict with another 
object, but with minimal damage. Includes most 
road departures, small animal strikes, all curb and 
tire strikes potentially in conflict with oncoming 
traffic and with higher risk potential if no curb. 

C4: Tire strike only with little or no risk element 
(e.g., clipping a curb during a tight turn), con- 
sidered to be of such minimal risk that most 
drivers would not consider these incidents to be 
crashes. 


3 RESULTS AND DISCUSSIONS 


In the following section, we have documented 
experiences from autonomous systems at St. Olav 


Hospital; some selected findings from our lit- 
erature review; and key issues discussed during 
regulation of testing of autonomous transport 
systems. 


3.1 Findings from autonomous systems 
at St. Olav 


St. Olav Hospital has installed an automated 
transport system called Transcar LTC2 Automated 
Guided Vehicle System (AGV) from Swisslog. 
They installed seven AGVs in 2006, and additional 
14 AGVs at the end of 2009. From 2010 to 2017 
they have had 21 AGVs in operations. Each week 
the 21 AGVs transport medicine, food, clothes and 
garbage, in total 70-80 tonnes. (Each AGV can 
transport a load of 500 kg, and is transporting 
3.6 tonnes each week). The speed is slow, moving 
at approximately 2 km/hour (maximum speed is 
5 km/h). The AGVs can send signals, open doors, 
and reserve elevators to deliver goods. There are 
different suppliers of door and elevator automa- 
tion. When there are conflicts that cannot be 
resolved, a signal is given to the operational cen- 
tre. The centre is manned by an operator that can 
intervene through the system, or go to the place 
where there is a conflict. 

The AGVs can communicate (i.e. deliver pre- 
programmed messages) such as “Please move—you 
are in my way”, or “Elevator is reserved—please 
move out of elevator”. A key issue related to the 
awareness building between automated transport 
systems and humans are the above-mentioned 
communication from the AGVs, supporting the 
understanding that the automated system need to 
inform the bystanders about their perceptions and 
what they are going to do next, that helps staff, 
patients and visitors to learn to interact with the 
AGV’s and to anticipate their behaviour. 

In the Transcar LTC2 Operations and Mainte- 
nance manual it is written “Always maintain a dis- 
tance of 1.5 meters between the vehicles and people 
or objects.” This safety guideline is not possible to 
implement at St. Olav due to space limitations. 

There are traces on the floor indicating that the 
AGVs are always following the same pathway, thus 
(new) common failures may happen. 

There has been a total of 100-130 minor inci- 
dents per year (5—6 per AGV) categorised as C4 by 
us. Minor repairs are done on the AGVs, changing 
around 50 components per year. There are around 
15 emergency stops each year, categorized as C3, 
where components must be changed. We do not 
have data indicating that there has been any inci- 
dents of category C2 or Cl. Reported incidents are 
minor crashes due to faulty navigation, for exam- 
ple due to objects placed in the route travelled that 
is not detected. 
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When interviewing the users some incidents that 
can be generalized were reported: 


e The AGVs have problems with pallets close to the 
walls. The AGV uses the wall as reference in steer- 
ing. A misplaced pallet results in a lateral shift of 
the AGV position and may sometime end up with 
a collision. Initially the operators used a great 
deal of time to clear the transport road area (in 
the basement) from clutter (i.e. parked bicycles, 
pallets with supplies); this work has been reduced 
now—but maintenance and design should take 
into accord these limits of AVGs. 

e The AGV collided several times with the fork- 
lifts, since the LiDAR sensor (light detection 
and ranging) had a limited vertical field of view 
and was seeing a free zone (space) under the 
forklift. This was mitigated by placing a black 
rubber skirt under the forklift. The same kind 
of collisions happened when using stepladders 
on the floor in the AGVs pathway, since the 
LiDAR did not detect the object. Thus, one 
issue has been the ability to see and identify 
objects in relation to the AGVs sense of its own 
size and position. This may be a general chal- 
lenge with autonomous transport systems. The 
death accident of Joshua Brown, described by 
NTSB (2017) and NHTSA (2017), was between 
a Tesla and a trailer crossing the road—a white 
trailer giving poor contrast and with substantial 
height above the ground. Some similarity with 
the forklift problems at St. Olav. A rubber skirt 
under the trailer may have increased visibility/ 
visual signal of the trailer. 

e The AGVs can open doors, reserve and use eleva- 
tors. Sometimes there has been conflicts between 
the AGVs and the users, needing human inter- 
vention through a central control. 

e Software updates of AVGs, elevators and doors 
has led to interface problems, thus there is a need 
to look at the AVGs as a part of an ecosystem. 


During the 11 years’ operating the AGVs there 
has been no reporting of human injuries at St. Olav. 
However, at the AHUS hospital (AHUS, 2009), 
with the same system—one incident happened in 
2009, where a nurse sustained a minor injury when 
colliding with the AGV (i.e. category C3 or C4). 

In summary, the AGV system has had an 
impressive safety record at St. Olav’s Hospital. Key 
issues of safe operations are related to an ecosys- 
tem approach planning the interaction between 
technology, organisation and humans. Based on 
preparation trough pilots; low speed; communica- 
tion between automated systems and humans to 
inform surrounding people of the AGV’s intended 
behaviour. The unexpected may happen, thus there 
was a need to establish a manned control centre 
that can intervene during operations. 


3.2 Key findings from literature review 


In Axelrod (2014) the focus is on software assur- 
ance of safety-critical and security-critical systems. 
The perception is that use of the current methods 
has not achieved the wished-for level of protection, 
and that there are missing security principles and 
standards. There seems a need for incentives or 
regulations to implement protective and immuniz- 
ing measures in software. A requirement could be 
that these measures are included in a certification 
process. On governance, it is suggested to establish 
software assurance standards at the United Nation 
(UN) level; to have a risk based approach; to share 
best of breed methods; and the need to discuss lia- 
bilities for damages occurring because of an attack 
or security-related errors. 

International governance of security of the infra- 
structure is addressed through several channels such 
as standard bodies (i.e. ISO, IEC) and international 
bodies such as OECD, EU, NATO and UN. Auton- 
omous systems are international—involving many 
actors with different agendas. In GCIG (2016) there 
is a discussion of governance of emerging technol- 
ogy as it is integrated into critical infrastructure, 
such as transport systems. It is suggested that manu- 
facturers should follow the principle of privacy and 
security by design, when developing new products. 
They must be prepared to accept legal liability for 
the quality of the technology they produce. Buy- 
ers should collectively demand that manufactur- 
ers respond effectively to concerns about privacy 
and security. Governments can play a positive 
role by incorporating minimum security standards 
in their procurement. It is suggested that govern- 
ment regulations should require routine, transpar- 
ent reporting of technological problems to provide 
the data required for a transparent market-based 
cyber-insurance industry. It is suggested to establish 
an agreement (a compact) based on collaboration 
between government, industry and private society 
supporting this evidence based decision making. 

In Koscher et al. (2010) vulnerabilities in cars 
are pointed out, such as the possibility to con- 
trol a wide range of automotive functions and 
completely ignore driver input from dashboard, 
including disabling the brakes, selectively brak- 
ing individual wheels on demand, stopping the 
engine, and so on. Attacks were easy to perform 
and the effects were significant. It is possible to 
bypass rudimentary network security protections 
within the car, and perform attack that embeds 
malicious code in the car that will completely erase 
any evidence of its presence (after a crash). There 
is a discussion of the challenges in addressing these 
vulnerabilities in the existing ecosystem. 

In Lima, et al. (2016) semi-autonomous and 
fully autonomous cars are described as coming 
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from the development stage to operations. The 
autonomous systems are creating safety and secu- 
rity challenges. These challenges require a holistic 
analysis, under the perspective of ecosystems of 
autonomous vehicles. These systems will become 
important critical information infrastructures, 
simultaneously featuring connectivity, autonomy 
and cooperation. Threat analyses and safety cases 
should include both (random) faults and (purpose- 
ful) attacks. 

In DHS (2015), there is a discussion of Cyber- 
Physical infrastructure risks in the future smart 
cities. Several examples of unwanted incidents are 
described in transportation systems (i.e. autono- 
mous vehicles; trains) in electricity distribution 
and management and in water and wastewater 
systems sector. It is suggested to the regulator to 
work with standards and regulations in addition to 
communication and increased engagement by giv- 
ing direct assistance. Challenges mentioned are the 
need to establish goal based standards and regu- 
lations as new technology is implemented and to 
focus on dissemination of best practices and sys- 
tematic education. 

In (Cerrudo, 2015) there is an empirical evalu- 
ation of “smart cities” looking at a broad set of 
technologies of traffic control, management of 
energy/water/waste and security. Known vulner- 
abilities are in traffic control systems, mobile appli- 
cations used by citizens, smart grids/smart meters 
and video cameras. The issues are lack of cyber 
security testing and approval, lack of encryp- 
tion, lack of City Computer Emergency Response 
Teams (CERT), and lack of cyber-attack emer- 
gency plans. There are reasons to anticipate that 
we establish potential for serious incidents, if these 
issues are not addressed and mitigated. 

In Frei (2010) there is a discussion of the secu- 
rity dynamics of general software ecosystem 
(SEC), applicable to autonomous ecosystems. They 
examine 27000 vulnerabilities in the decade (1996— 
2008). The paper explores several policies such as 
security through obscurity, responsible disclosure 
of vulnerabilities (a suggested policy) or security 
through transparency. One key insight is that 
secrecy prevents people from assessing their own 
risks, which contributes to a false sense of security. 
Responsible disclosure means that the researcher 
discloses full information to the vendor, expecting 
that mitigation is developed within a reasonable 
timeframe. An increasing number of organizations 
has adopted some form of responsible disclosure. 
A tisk based regulatory regime are dependent on 
such an open discussion of the risks. 

In summary, if we want systems that are safe, 
secure and reliable, both safety, security and reli- 
ability must be built together. There has been 
documented several vulnerabilities and responsible 


disclosure of vulnerabilities to the vendors, seems 
to be a beneficial policy. Some sort of communities 
of practices, and a CERT of autonomous systems 
should be established. There is missing interna- 
tional regulation or compacts based on private 
public partnerships to ensure privacy, safety, secu- 
rity and resilience. Vendors must ensure this qual- 
ity by design, and must be prepared to accept legal 
liability of the technology they produce. Regula- 
tions should require routine, transparent reporting 
of technological problems to provide data for a 
transparent market-based cyber-insurance indus- 
try, and a risk based regulatory regime. 


3.3 Key issues when discussing regulation 


3.3.1 Selected issues from all forms of transport 
Risks of autonomous transport are not well known 
at present. To increase knowledge and learning, 
experiences, taxonomies, regulations and relevant 
incidents should be gathered and disseminated 
from all modes—autonomous road systems (vehi- 
cles), air transport (i.e. drones), rail (unmanned 
metro and rail systems) and shipping. Accident 
investigators and rule-makers (such as “The Acci- 
dent Investigation Board in Norway”) should 
develop methods for investigation of accidents of 
autonomy and report their findings. 

Shipping: Completely unmanned ships seem 
to give large benefits and enables new transport 
systems, some of these issues are documented in 
Rodseth (2017). There is a need for onshore con- 
trol centres to manage autonomous shipping 
operations. Norway has focused on autonomy in 
sea transport. A network, Norwegian Forum for 
Autonomous Ships (NFAS) at nfas.autonomous- 
ship.org, has been established. A more general 
research program called Centre for Autonomous 
Marine Operations and Systems (AMOS) has been 
initiated at the Norwegian University of Technol- 
ogy and Science, ref www.ntnu.edu/amos. The 
Trondheimsfjord has been selected as a national 
testing area in collaboration with The Norwegian 
Maritime Authority and The Norwegian Coastal 
Administration. At the end of 2017 three testing 
areas has been established in Norway (Trond- 
heimsfjord, Storfjord and Horten). Test areas has 
also been established in Finland and China. Risk 
levels of autonomous ships are influenced by exist- 
ing incidents and new incidents (i.e. caused by 
new automation, and former incidents mitigated 
by crew now being removed). Work is ongoing to 
explore safety of autonomous sea transport, and 
to explore a taxonomy of LOA for shipping, Red- 
seth et al. (2017). In Trondheimsfjord an autono- 
mous passenger ferry is going to be tested in 
2018-2019. The authorities need to set rules and 
requirements based on acceptable risk levels. There 
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is an increased need for Human Factors knowledge 
to improve the quality of interfaces (i.e. “human in 
the loop” control when needed) between humans 
and the autonomous systems. 

Aviation: To govern the use of Remotely Piloted 
Aircraft Systems (i.e., drones) in Norway, regula- 
tion has been established, Civil Aviation Author- 
ity—CAA (2016). The operator must be certified 
through an exam, CAA (2017). Experiences of 
remotely piloted aircraft Systems, Waraich et al. 
(2013), documents that mishaps may happen (i.e. 
50 mishaps occur every 100,000 flight hours’ vs 
human-operated aircraft where there is one mishap 
per 100,000 flight hours). The high mishap rate is 
related to poor attention to human factors science 
and design in ground control centres, Waraich et 
al. (2013). Several pilot projects with drones are 
planned, transporting goods/persons. 

Rail/Metro systems: By automated metros 
(rail systems) we mean systems where there is no 
driver in the front cabin, nor accompanying staff, 
also called Unattended Train Operation (UTO). 
UTO have been in operations from 1980. In UITP 
(2013), there are listed 674 km of automated met- 
ros consisting of 48 lines in 32 cities. UTO’s are 
found in Barcelona, Copenhagen, Dubai, Kobe, 
Lille, Nuremberg, Paris, Singapore, Taipei, Tokyo, 
Toulouse and Vancouver. Wang et al. (2016), list 
the arguments for UTO as increased reliability, 
lower operation costs, increased capacity, energy 
efficiency and an impressive safety record. There 
is substantial infrastructure cost to ensure safe on 
and offloading of passengers and that the track 
is safe and isolated from other traffic. Four dis- 
tinct Levels of automation are defined: GoA1: 
Non-automated train operation, with a driver in the 
cabin. GoA2: Automatic train operation system 
controls train movements, but a driver in the cabin 
observes and stops the train in case of a hazard- 
ous situation. GoA3: No driver in the cabin but an 
operation staff on board. GoA4: Unattended train 
operation, with no operation staff on board. We have 
at present not found normalized accident data for 
UTO (incidents based on person km), but no acci- 
dents have been reported. It seems that the UTO 
has exceptionally high safety. However more sys- 
tematic analysis and normalization of all interna- 
tional UTO transport incidents are needed. 

Road Transportation: Google’s self-driving cars, 
where the vehicle systems control all aspects of the 
driving, have been on public roads in the US since 
2009. The safety record has been impressive. How- 
ever Google engineers are supervising and re-taking 
vehicle control if necessary. The death accident in 
2016 (Joshua Brown) by Tesla in Autonomous driv- 
ing condition was caused by a tractor-trailer that 
made a left turn in front of the Tesla, and the car 
failed to apply the brakes. The Tesla did not “see” 


the trailer—it was all white and had poor contrast 
with the surrounding bright white sky. In addition, 
there was a high gap between the road and the 
trailer. The National Transportation Safety Board 
(NTSB, 2017) found that the system’s “operational 
design” was a contributing factor to the crash 
because it allows drivers to avoid steering/watching 
the road for periods of time that were “inconsist- 
ent” with warnings. Tesla could have taken further 
steps to prevent the system’s misuse. In addition, 
NTSB faulted the driver for not paying attention 
and “over-reliance on vehicle automation”. It also 
seems there is a need for better training of drivers 
related to autonomous systems—a part of driver 
education and driver license requirements. 

There are scarce safety data so far, but data from 
the period 2009 to end of 2015 has been collected 
from Googles cars, in Teoh et al. (2017). There 
were three police reportable accidents (denoted as 
level C1) in California while driving 2,208,199 km, 
giving an accident rate of 1,36 police reportable 
incident pr. million km. This is 1/3 of reportable 
accidents of human-driven passenger vehicles in 
the same area. Car accidents involving autono- 
mous cars are different from human driven. 
Google cars get more rear-ended by other vehicles 
while stopped or barely moving. There is an ele- 
ment of risk negligence in that the human driver 
does not fully anticipate the action of the self- 
driving car. There are also challenges of sustained 
human attention during lengthy period of autono- 
mous driving, making it difficult for the human 
operator to intervene i.e. “Human in the loop” 
challenges. Huffington (2017) documented that 
Waymo’s human drivers had to take control from 
the automated system (i.e. “disengagement”) once 
for every 5,000 miles in 2016. “Backup” human 
drivers in Uber’s self-driving cars had to take over 
about once every mile as of March 8, ref Recode 
(2017). It is a challenge to get situational awareness 
after having been out of the driving control loop 
for 5,000 miles. The takeover time of the human 
driver varies from 2 to 26 seconds, ref Eriksson 
et al. (2017), challenging the design of autonomous 
systems to enable human intervention. 

Analysing all car accidents, it is suggested that 
80-90% of accidents are due to “human errors”, 
thus autonomous cars could reduce the level of 
accidents substantially. However, autonomy could 
introduce new types of accidents, due to automa- 
tion itself or due to human drivers not predict- 
ing action from the automation. In Blanco, et 
al. (2016) it is suggested that accident rates are 
reduced to % of present, while Teoh et al. (2017) 
documents accident levels of autonomous systems 
as 1/3 of human driver systems. The National 
Highway Traffic Safety Administration (NHTSA, 
2017) reported a reduction in vehicle crash-rate 
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by almost 40% with Autosteer activated in Tesla 
Model S and Model X, compared to before. In 
Cummings et al. (2014) it is suggested that the level 
of accidents could be reduced by 50%. More expe- 
riences must be gathered, but significant reduction 
of accidents is expected. 


3.3.2 Need for systematic open data reporting 

At present there are missing data of incidents 
(accidents and successful recoveries) related to 
autonomous systems. Open reporting must be 
established covering systematic safety records 
and security stories, being available to researchers 
and industry actors, such as insurance. The scope 
must cover actions from the autonomous system 
but also document perceptions and understanding 
from the involved human actors. The differences 
between espoused values (rule based actions/work 
as programmed in autonomous systems) and actual 
values (actions/work as being done by humans in 
interaction with autonomy) can create the basis 
for errors and accidents. It should be a key area 
of research to explore accidents because of poor 
design vs. blaming the human actors. Use of video 
recording could help, based on regulation protect- 
ing personal data; (EU 2016:679). There must be 
a combination of data gathering in combination 
with in-depth accident investigation. Accident 
investigation boards should explore accidents of 
autonomy, to support rapid learning and changes 
in addition to improve their methods to analyse 
autonomy incidents. 


3.3.3 System perspective and human factors 
Safety of autonomous systems are dependent 
on new designed technology, human factors and 
organisational issues as discussed by Cummings 
et al. (2014). The perception should be that most 
accidents in autonomous systems are a conse- 
quence of poor design and poor testing, and that 
“human errors” are a consequence and not a cause 
as described by Dekker (2002). Moving trivial 
functions (that can be programmed) to an auton- 
omous system, means that tough decisions and 
deviations must be handled by humans. Thus, the 
science of Human Factors, knowing strengths and 
weaknesses in cognition and ergonomics, must get 
a significant position when automation is designed 
and implemented. 


3.3.4 Responsibilities and certification 

The autonomous system decides based on design 
approved by the manufacturer. Thus, product 
responsibilities of accidents and incidents must be 
placed at the manufacturer (OEM). This is in line 
with the view of the car OEMs Volvo, Google and 
Mercedes-Benz (Iozzio, 2016). This is also in line 
with the supervisory responsibility demanded in 


the Oil and Gas industry (i.e. where the operator 
is responsible for the chain of suppliers employed). 
This supervisor responsibility must be placed on 
the car OEMs, including the continued updating 
and adaptation of software in use. Certification 
is needed, such as the ISA/IEC-62443 scheme of 
industrial control systems used since 2010. How- 
ever, certification is still being developed, a survey 
documenting key issues are found in Martin, et al. 
(2015). 


3.3.5 Security and risk-based regulation 

Security (for safety) must be included in the devel- 
opment of autonomous systems, and systematic 
testing (including penetration testing) must be 
done as a part of certification prior to product 
release. The precautionary principle must be estab- 
lished as a condition for autonomous transport 
systems, COMEST (2005). 


4 CONCLUSIONS 


Related to the research question RQ1 (major risks): 
The sensors and systems used in autonomous 
systems, does not have a perfect view of the sur- 
roundings, and may also act uncoordinated with 
their surroundings, thus new type of accidents may 
happen. There is a need to speed up learning from 
these incidents and to be aware of communication 
and information challenges in operations. 

Human control and assistance through control 
centres and via human machine interactions must 
be designed based on the science of Human Fac- 
tors in order to avoid higher levels of accidents as 
documented by Waraich et al. (2013). 

We continue to see vulnerabilities and exploita- 
tion of software in the public and private sectors. 
Different perspectives are used in security and 
safety, due to different adversity models. The secu- 
rity community are addressing threats (directed, 
deliberate, hostile acts) and the safety commu- 
nity are addressing hazards (undirected events). 
AEC are so pervasive across all sectors that a silo 
approach can no longer be acceptable. To ensure 
that all actors in the value-chain understands this, 
a silo-based “need to know” principle must be 
replaced by transparent and open reporting. This 
can also support a market based cyber-insurance 
industry. 

Related to the research question RQ? (regula- 
tion): There is a need for regulatory action from 
government to set minimum standards, establish 
responsibility, and follow up of incidents/acci- 
dents. Prescriptive and detailed rulemaking on a 
national level is wanting, but should be replaced 
by functional approach demanding the same level 
of risk in automated systems as in existing systems. 
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Vendors must have responsibility to ensure safety, 
security and resilience by design, and must be pre- 
pared to accept legal liability for the quality of the 
technology they produce. Ideally, a formal process 
of product acceptance and certification (i.e. safety 
case) should be established before a product can be 
sold. The manufacturers should establish a proac- 
tive focus on (best practice) safety/security stand- 
ards. There is a need to ensure that there is some 
sort of a structured learning process (among all 
relevant actors) when incidents happen. 

Related to the research question RQ3 (way for- 
ward): Innovative approaches, such as the perspec- 
tive of Autonomous Ecosystems (AEC) are needed 
to handle the challenges of autonomous transport 
systems. The science of Human Factors need to be 
prioritized to ensure that human intervention can 
be designed in the system and can be performed in 
actual operations based on actual human limita- 
tions and human strengths to improvise and han- 
dle unanticipated events. 

Safety has been dependent on publicised acci- 
dents and a systematic learning loop between users, 
the regulator and industry. One component in the 
learning loop of complex software systems has 
been reporting and analysis of incidents through 
computer incident response teams (CERTS). There 
is a need to establish CERTS of AEC to help coor- 
dinate actions. 

Rules and mechanisms for updating software 
in autonomous systems will become more urgent 
as failures can lead to accidents, thus handling of 
updates must be addressed in a systematic manner. 

Communication between autonomous trans- 
port systems and drivers and bystanders must be 
improved. Autonomous systems are rule based 
while humans are not, thus there may be misun- 
derstandings and common failures, creating need 
for interventions through transport centres con- 
trolling the flow of transport. 

These AEC will be exposed to new strains—thus 
there must be a focus on how to handle surprises 
by resilience, to ensure that new demands/ stress/ 
failures are not impacting transportation in a cata- 
strophic way. 
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ABSTRACT: Risk communication and information exchange have become more challenging in today’s 
complex projects, which involves different stakeholders in different roles. Unstructured data communica- 
tion and flaws in knowledge exchange may lead to erroneous risk perception and evaluation. The role of 
risk analysis has become more critical in this perspective. This paper proposes a new way to enrich risk 
analysis with the help of knowledge of systems engineering. Focus is given to establish a Failure Mode, 
Effect and Criticality Analysis (FMECA) model using models from systems engineering which are more 
systematic compared to models used in tradition approach. A reference case study is analyzed, which is 
inspired from the subsea laboratory at the Department of Mechanical and Industrial Engineering (MTP) 
at the Norwegian University of Science and Technology (NTNU). The investigation focuses on the series 
of relevant risks for the subsea gas boosting section in offshore Oil & Gas installations. Results from the 
present study are discussed with emphasize on three aspects: (1) cross-system effects, (2) reasonable and 
reachable risk reduction measures and (3) multiple system dimension. We acknowledge that the proposed 
method based on system thinking can be used to construct system behavior model, which in turn could be 
used in FMECA development to gain better understanding of risks and to improve the overall perform- 


ance of system. 


1 INTRODUCTION 


In the last decades, the industry, from manufactur- 
ing to chemical and Oil & Gas (O&G) sector, has 
become more and more complex, notably by incor- 
porating innovative technologies requiring new 
stakeholders. Risk analysts are responsible for the 
identification and the analysis of potential risks 
arising from the performed activities and for treat- 
ing these risks, according to established acceptance 
criteria (ISO 31000, 2009). Many factors affect the 
actual risk level in different ways. For instance, a 
new human machine interface may cause stress 
and therefore influence operator performance. The 
risk posed by these factors raises the importance of 
both the risk analyst and the communication net- 
work with the other professionals, consequently, 
the focus moved to risk management. 


Risk management involves different disciplines 
at different levels across the entire enterprise. How- 
ever, nowadays in the industry, the coordination 
between the different parties is not stressed enough 
when instead it should be properly maintained 
(Rice & Spence, 2016). Bringing different areas of 
expertise together in analyzing the potential risks 
and identify threshold criteria is still challenging. 
Moreover, the methods adopted for information 
collection and analysis are often unstructured (Kir- 
sch, Hineb, & Maybury, 2015). The inconsistence 
in jargon used and the difference in the theoretical 
background of the different stakeholders may lead 
to erroneous decision-making and costly correc- 
tion for inadequate and inappropriate actions. 

Failure Mode, Effect and Criticality Analysis 
(FMECA) has been widely accepted as the risk 
identification and the characterization tool for 
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hardware components since the late 1960s. How- 
ever, today’s industry involves fast-moving technol- 
ogy innovations and faces challenging operating 
conditions. There is no formal procedure in con- 
structing FMECA for such complex system and 
then the quality and content of FMECA depend 
on the competence and experience of risk analysts. 
In some practices, analysts start FMECA without 
establishing a baseline system concept. Based on 
this scenario, the solution is to structure coordi- 
nated and distributed system information about 
what is taking place and what is needed for proper 
risk mitigation. 

The existing frameworks for risk management in 
Oil and Gas industry (ISO 31000, 2009; NORSOK 
Z-013, 2010) also highlight the needs of establish- 
ing the system concept before starting risk analysis, 
and maintaining and updating the system concept 
based on given indications, however, no detailed 
methods and approaches are prompted. This paper 
suggests the use of often-cited approaches in Sys- 
tems Engineering (SE) to fulfil such needs, to pre- 
vent excessive resource for reaching the agreement 
on the system concept. 

The main objective of this contribution is to 
investigate how FMECA can get advantages from 
SE. Many companies have adopted SE as the sys- 
tematic approach for the design of complex sys- 
tems (Asbjornsen, 1992; Haskins, 2008). In fact, 
SE may help in mediating information exchange 
among professionals in a simple and concrete way 
by means of different analyses and models at dif- 


ferent detail levels. In this sense, it allows the right 
person to access the right information at the right 
time and use it. The adaptation and exploitation of 
SE approaches and models can assist in maintain- 
ing the unified context of the system and ensur- 
ing that the interests of different stakeholders are 
properly understood and considered. 

This paper provides a notion through a practical 
case on to what extent and in which ways the risk 
analysts think SE methods as effective for support- 
ing FMECA. The following of this paper is organ- 
ized as follows: Section 2 describes the basis of the 
subsea gas boosting laboratory that still demands 
further improvements from a risk perspective. Sec- 
tion 3 discusses the key features of linking and cou- 
pling systems engineering and risk analysis models. 
In Section 4, the proposed model is executed with 
the practical case and the results are presented in 
section 5, discussed in section 6 consecutively. Sec- 
tions 7 presents conclusions. 


2 A PRACTICAL CASE 


The paper tactically selects an accessible subsea 
laboratory located at the Department of Mechani- 
cal and Industrial Engineering (MTP) at the 
Norwegian University of Science and Technol- 
ogy (NTNU) to investigate a larger spectrum of 
risks relevant to subsea gas boosting, as shown in 
Figure 1. As of today, one challenge for subsea 
boosting is the compression of wet gas (within 
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Figure 1. Subsea compression laboratory. 


1802 


water fraction of 2—20%) since the stratified gas- 
liquid flow may run the high risk of damaging 
the traditional compressor. The laboratory is con- 
structed as the pre-testing facility to emulate the 
different existing solutions for subsea wet gas com- 
pression in the Norwegian Continent Shelf, i.e. 
Ormen Lange and Asgard (within pre-separation) 
and Gullfaks (without pre-separation). 

This laboratory includes two major modules to 
test different characteristics of wet gas: the sepa- 
ration module (i.e. the left-bottom) and the com- 
pression module (i.e. the right-top). The mixed 
flow within water fraction (ranged from 0—20%) is 
emulated by controlling the inlet flow. The imple- 
mentation of a separation module can separate 
the water from mixed flow before entering the 
compression module, and allow studying how the 
working efficiency and robustness of a separator 
can influence the whole compression process. The 
compression module involves a compressor driven 
by a piston and a compressor driven by the turbo- 
charger, where the piston compressor is very vul- 
nerable by particles and water. Three test scenarios 
are therefore formulated by the manual close/open 
of ball valves: 


e Wet gas compression with the separation mod- 
ule, where the piston is bypassed 

e Wet gas compression without the separation 
module, where the piston is bypassed 

e Dry gas compression, where the piston and tur- 
bocharger can compress in the series 


The laboratory is almost completed in late 2016, 
but still demands many improvements in respect to 
different aspects. This paper exclusively focuses on 
managing emerged risks of the current structure of 
the laboratory and devotes to present the obtained 
results and knowledge for the new development in 
the industry-size gas boosting system. 


3 METHOD 


This paper aims to suggest a structured method to 
enrich the scope of validity of risk assessment by 
taking advantage from SE. The proposed method 
propagates SE activities toward risk assessment 
activities such as FMECA, as shown in Figure 2. 
SE workshop and FMECA workshop that focus 
on very different objectives are as the heart of this 
method. This collaborative method of knowledge 
transfer enables effective risk management with 
an objective of dispersing expert knowledge into 
available tacit knowledge (Alavi & Leidner, 2001). 

The starting part of the model is preliminary 
analysis, which includes defining process goal, 
defining system along with its boundary, environ- 
ment and interactions. This facilitates having an 
overview of the process and to be informed about 
what is included and what isn’t included. SE work- 
shop involves experts (e.g. Designers, operators, 
managers) to create the static vision of the struc- 
ture of the system from operational, functional 
and physical perspectives. SE workshop covers the 
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Figure 2. The conceptual map for the proposed method. 
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interests and the expertise of each contributing 
stakeholders.SE workshop consists of two concur- 
rent analyses: SE based analysis that covers the 
operational, functional and physical aspects and 
the scenario approach that borrows Business Proc- 
ess Model and Notation (BPMN) (White & Miers, 
2008). Operational analysis introduces behavior of 
the system that helps performance based assess- 
ment. We can introduce conflict among different 
component behavior. For example, component C 
will go off if the component A shuts down or one 
specific pump may trigger electric supply failure. 
It solves the conflicts by considering the system 
function in an ordered sequence of declarations. 
Functional decomposition specifies how the com- 
ponent functions realize the module functions 
and how the module functions realize the overall 
function. Functions that make up another func- 
tion are grouped together in sets based on ways of 
achievement. Material, energy and signal flows are 
viewed as the attributes of these interactions. The 
functionality is allocated to the preferred parts and 
layouts. Physical decomposition is made based on 
the layout of components provided by supplier to 
show connections between components. This ena- 
bles us to check interdependency among system 
components. The availability of the system, includ- 
ing reparable elements can be determined. If one 
component is out of order, the effect of that on 
the system or other components can be realized by 
following the connections. Management person- 
nel can set their work order easily by following the 
connection. 

Designing the scenarios is always an effective way 
of correctly abstracting the concerns in design. The 
scenario analysis of a given system is completed on 
the basis of static vision of its structural decom- 
position. BPMN can be considered as procedural 
knowledge representation which represents a set 
of interconnected procedures (Ligeza & Potempa, 
2012). BPMN is therefore considered as a feasible 
approach to study some scenarios generated from 
the combination of critical failure, near-miss and 
even safe states on each interconnected procedure. 
Both business experts and process experts can eas- 
ily understand the semantics of BPMN, so this 
tool is considered feasible to graphically represent 
the interested scenarios. 

The result of SE work is used to conduct 
FMECA. To carry out the FMECA, multidiscipli- 
nary experts are invited to form the team. The team, 
analyses system components for failure modes. Then 
potential causes and effects are determined, but for- 
getting the internal-related or external-related fail- 
ure modes (DNV-RP-D102, 2012). Involving the 
scenario-based approach (e.g. BPMN) offers the 
opportunity to enrich the traditional FMECA. SE 
workshop combines various discipline knowledge 
in one common platform and captures multidimen- 


sional knowledge into one frame. Different disci- 
pline experts analyses the same issue from different 
angle and may find a different solution (Su & Dou, 
2013). It assures universal agreement on a conflict- 
ing issue, eliminates bias toward severity rankings, 
and carry out a detailed analysis of the system 
structure as well as its process. In addition, generic 
FMECA contains no dynamic features of the sys- 
tem being analyzed. Using scenario-based analysis 
as the baseline can assist the risk engineers to clar- 
ify the context of each failure modes. The similar 
approach, called as the scenario-based FMEA was 
discussed in (Issad, Kloul, & Rauzy, 2017). 

After completing FMECA, one can carry 
out the well-round risk assessment on a basis 
of FMECA to provide indications for further 
improvements regarding risk mitigation, recourses 
and modifications on its structure. 

The proposed method is vividly illustrated by 
the following analysis of the presented case. 


4 ANALYSIS 


4.1 Preliminary analysis 


The preliminary analysis defines the preliminary 
system concept to describe what the system should 
do, without specifying any functionality and 
embodies. The analysis covers all the elements that 
are unmodifiable from engineering perspective, 
like the external environment of the system, users, 
legal and regulatory framework and the like. The 
analysis paves the ground for all the possible tech- 
nical solutions for the stated problem. 

Figure 3 illustrates the operational context 
within the subsea compressor laboratory, which 
only indicates what the system do without specify- 
ing how to achieve the goal. Different models can 
be developed on basis of defining operational con- 
text of subsea compressor laboratory. For instance, 
state diagram can be made to check the operational 
constraints by combining the states of laboratory 
and those of its external systems, like the energy 
supply system. The complete preliminary analysis 
can lay a solid foundation towards the SE work- 
shop that complete the system concept. 


4.2 SE workshop 


The SE workshop executes the analysis stated and 
discussed above in Section 3, including the func- 
tional decomposition and physical decomposition. 
Functional decomposition is to define the differ- 
ent levels of functionality based on the system 
mission. Physical decomposition is based on the 
Process & Instrumentation Diagrams (P&ID) and 
layout of each component to check physical inter- 
actions among subcomponents and allocations. 
Figure 4 illustrates the physical decomposition of 
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Figure 3. Operational contexts of subsea compressor laboratory. 
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Figure 5. Simplified BPMN for overpressure scenario. 


the selected system, the turbo charger to exemplify 
some key activities in SE workshop. 

The decomposition is rather easy to apply for ana- 
lyzing the functionality and physical structure of the 
system. One can refine all the operational contexts 
with sufficient details and trace the changes of func- 
tionality and structure through such method. How- 
ever, this method only describes the system concept 


Damage the pipe af fkm 
conditioning module 


Protect the pipe 


Damage the pips of 
conigeession module 


statically. Indeed, to complete the well-round risk 
analysis, we have to carry out the scenario analysis 
that dynamically describe the events that trigger the 
transitions between each operational context. 

As discussed before, BPMN models are con- 
venient to visualize practical scenarios. Figure 5 
presents one selected accident scenario. In 
Figure 5, the laboratory is tactically divided into 
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four lanes, including a control unit that is not illus- 
trated in Figure 1. The interfaces (information 
exchange) between control unit and other lane are 
indicated by the message flow, i.e. the dashed line. 
Three decision points are presented to accommo- 
date different missions. The identified operational 
contexts are also reflected in BPMN model. One 
can explicitly observe how the control unit influ- 
ences the whole process of subsea gas compres- 
sion, and connects each object within distinct 
activities through BPMN. Here, focus is given to 
the overall process, system interactions, interface 
and sequence to work flow, where no time/tempo- 
ral aspects are considered. Different consequences 
are specified upon the success of activities, i.e. the 
availability of corresponding objects. Through 
generating the accidental scenarios, designers are 
able to identify the additional needs and carry out 
complementary analysis to improve the design pro- 
posal, see also the summarized result in Section 5. 


4.3 FMECA workshop 


The results from SE workshop are integrated in the 
FMECA. One advantage is the explicit identifica- 
tion and evaluation of the potential failure modes 
across physical boundaries. The effect of failure 
on the subsystems can be studied by analyzing 
the sequence flow. For instance, the turbocharger 


compressor receives the flow from the gas intake 
under the test scenario 2. Once there is a failure 
or malfunction of the gas intake component, (e.g. 
Leakage in pipeline or contaminated with liq- 
uid from the surroundings), the compressor can 
be damaged or experience the temporary loss of 
efficiency. Such risk can be immediately registered 
when developing FMECA. Another advantage 
is that BPMN highlights major tasks within the 
block instead of a corpus of components. In some 
practices, analysts produce FMECA based on 
the checklists of physical components (for exam- 
ple Figure 4) as they did not analyze the entirety 
of the given system concept. By adopting the 
BPMN model in SE workshop, risk analysts are 
able to create the FMECA that covers the most 
significant accident scenarios, which saves a large 
amount of repetitive and unbalanced works when 
coordinating the contributions from different 
design teams. 


5 RESULT 


Table 1 summarizes some key implications raised 
after conducting two workshops. The differences 
between laboratory environment and subsea envi- 
ronment are considered and discussed in the last 
column of Table 1. 


Table 1. Improvement proposal for subsea gas boosting laboratory. 


Key issues Description 


Decisions 


Relevance for 
industry-size case 


Installation of Flow sensor in the water 
additional inlet 
valves, sensors 
or transmitter 


Must do. The humidity must be control- Not relevant for the 
led to map the characteristics of flow 


flow from the real 
gas field. 


Pressure sensor in 
the oil loop 

Level transmitter in 
the oil reservoir 


Flow meter before 
compressor 


Flow mixer 


Pressure relief valve 
in the separator 


Should do. High pressure may 
blow the pipe. 

Should do. The implementation 
can assist in the maintenance 
(oil refill). Especially when the 
laboratory is continuously run 
or stop using for a long time 
(volatilization of oil) 

Can consider for smooth 
operation Compressor 
efficiency decreases with 
increased mass flow. 

Can consider. The stratified flow 
runs a higher risk of damaging 
the compressor than a 
dispersed homogenous flow. 

Should do. The implementation 
will reduce the risk of 
separator blow. 


Relevant. 


Highly relevant. 


Highly relevant. 


Not relevant, stratified 
flows are fairly 
common even in the 
real gas field. 

The relevance depends 
on the type and size 
of separator. 
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(Continued) 


Table 1. (Continued). 


Key issues 


Description 


Decisions 


Relevance for 
industry-size case 


Installation of 
additional 
components 


Maintenance 
strategy 


Logic control unit 
connected to 
all sensors and 
controllers 
Additional filters 
in the inlet water 
line 


A Protective wall 
around the 
whole lab 

Protective 
housing for 
piston 
compressor, 
turbocharger, 
water pump after 
the separator 

Leakage test before 
the operation 


Periodic dust 
cleaning 


Documentation 
of operation 


Must do, to control the process easily 
and from a remote location. 


Not necessary if have 
confidence that supply 
water is clean 


Should do, which will 
prevent smoke dispersion 
and reduce fire spread 

Can consider, as it will prevent 
from damage in case of water 
flooding. The cost analysis is 
needed. 


Must do, as it is the most 
cost-efficient mean to 
check the integrity of 
the system. 


Must do, as it will reduce the 
blockage of valves and pipe 
network. 

Must do, as it will reduce 
human error in operation. 


Highly relevant. 


The relevance depends 
on the needs of 
removing sands 
or other particles, 
requiring expertise 
from reservoir 
management. 

Highly relevant. 


Highly relevant. 


Only relevant for the 
site-acceptance test. 
The leakage sensors 
are implemented for 
this purpose. 

Not relevant. 


Mostly important. 


strategy of 
system and 
valves 


The results are obtained by considering the 
major risks within the existing design. One limita- 
tion of the current analysis is that the engineering 
efforts behind each decision are not included. The 
remaining works are the complementary analysis 
such as life cycle cost analysis to support the deci- 
sion-making in the real practice. 


6 DISCUSSION 


This collaborative model of SE workshop and 
FMECA workshop makes access to a comprehen- 
sive knowledge network of practical experience and 
expert understanding. This enables identification 
of potential hazards and implementation of appro- 
priate measures for prevention of accidents. The 
proposed method enables expert to capture and doc- 
ument the experience they have gained throughout 


different projects. This can be implemented both in 
the design phase and in modification phase. 

In the traditional approach of FMECA, a 
failure analysis is mainly carried out on the com- 
ponent level, functional interactions between 
observed components are not included (Bertsche, 
2008). Failure analysis is carried out for the indi- 
vidual process steps. The entire production proc- 
ess is not thoroughly analyzed, for example, the 
layout of individual component is not considered 
(Bertsche, 2008). The benefits of developing the 
FMECA through the identification of key scenar- 
ios are listed in the following, based on experiences 
from practical cases: 


e Cross-system effects 

It is possible to achieve a holistic and system- 
atic view of the system through SE workshop. 
The functional analysis assists in comprehending 
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the effect of failure on subsystem functions and 
system functions. 


e Reasonable and reachable risk reduction measures 
Risk reduction measures cover several factors, e.g. 
maintenance scheduling, decision-making support 
and barrier management. The architectural analy- 
sis can clarify the main constraints that limit in 
choosing these factors. The coordination between 
FMECA workshop and SE workshop is therefore 
concerned with whether the selected structure 
offers the best balance of these factors. 


e More than one single dimension 

SE workshop involves experts (e.g. designers, opera- 
tors, managers) in performing BPMN; operational, 
functional and physical analysis. This allows to 
include the interests, expertise and the needs of each 
stakeholder in the very beginning phase of risk anal- 
ysis and reflects in the development of FMECA. 

In the proposed method, important aspects like 
interaction between components and environmen- 
tal effects are considered which gives more confi- 
dence in risk analysis and in decision making. An 
abstruse idea of a system introduces uncertainty 
in the risk assessment process. The consistent 
representation of knowledge and system assures 
that results of risk estimates can be integrated in 
decision making without less doubt. Only system 
related uncertainties are being dealt here, which 
can be mitigated with system knowledge based on 
experience and expertise. 

To present the whole system on like only like 
P&ID will create a blur on communicating the 
message. This paper proposes structural break- 
down type diagram to represent the physical and 
functional behavior of the system. The analyses 
include state in the behavior attribute of a system 
in the model that describes interdependency and 
helps performance based assessment. 

When the system is complex enough, and one 
component is serving different functions of the 
system, it can never be edited as a whole or only 
through physical or functional dissection. Applica- 
tion of both functional and physical analysis gives 
a hierarchical breakdown with branches to show 
connections between components. When one com- 
ponent is out of order, the effect on other com- 
ponents or on the whole system can be observed 
easily by following the connections, so manage- 
ment personnel can set their work order easily. 

If any new component is added to or reduced 
from the system, editing decomposition diagram 
is easy which helps to modify FMECA easily. It 
is often found that in traditional approaches for a 
change of system, risk assessors have to go through 
FMECA fully. Prior establishment of physical and 
functional analysis allows users to modify knowl- 
edge easily in case of a change of system. 


Effective knowledge transfer and integrated 
knowledge management help to make a more resil- 
ient and reliable system by reducing vulnerability. 
It helps to develop a better plan for proactive meas- 
ure to cope with the emergency. These workshops 
include models for performance and reliability, 
previous experience, realism of assumptions, rep- 
resentativeness of scenarios along with most criti- 
cal issues, thoroughness of analysis and first class 
deliverable. 

Effective maintenance also requires integrated 
information and knowledge system from which 
maintenance team can get output from other dis- 
ciplines to make proper maintenance record and 
work order. Capturing system knowledge effec- 
tively facilitates full retrieval of information to 
implement preventive maintenance ona contrary to 
corrective maintenance. Failing to do so, increases 
cost significantly (Motawa & Almarshad, 2013). 

Risk engineers often do the risk analysis to 
compare the risk level to check whether the speci- 
fied activity is to be complied with the standard. 
In this prospect, a greater chance for improvement 
remains out of scope and behind the paperwork. 
By the arrangement of thinking around system- 
atic activities described earlier, risk can be com- 
municated effectively and efficiently. The quality 
of risk assessment can be improved by capturing 
and identification of all possible issues systemati- 
cally. By sharing with one another’s information, it 
is possible to get a better risk picture in more than 
one single dimension (Su & Dou, 2013). 

However, going through all details is time con- 
suming as there remains a lot of overlaps and rep- 
etitions among functions. It is difficult to define 
the scope of subsystem when one single compo- 
nent performs two functions. It is also questionable 
whether to assume the previous work as reliable 
enough or not. Finding necessary expert and 
shareholders’ opinion on a timely manner need 
proper planning. The execution of the proposed 
method needs a high management capability of the 
organization. An organization should have a com- 
mitment to provide resources, freedom and time 
needed to acquire information. Implementing the 
analysis in development phase makes it possible to 
identify weak spots and comparative tests can be 
carried out. 


7 CONCLUSION 


Modern society deals with a larger spectrum of 
risk and a larger spectrum of stakeholders, of 
interest, value and knowledge. Recognizing the 
wider scope of risk leads to a positive evolution 
in managing risk. A structured communication 
model can help in this respect. In this paper, we 
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presented how SE may help in capturing differ- 
ent nodes of a system for effective risk evaluation, 
through the basic analysis technique like FMECA. 
The proposed method is checked with a subsea gas 
boosting laboratory where the case study assures 
more confidence in making development proposal 
and risk mitigation. By doing this type of analyses 
in the development phase, it is possible to identify 
weak spots and comparative tests can be carried 
out. Modifications in the design phase saves cost 
and preventive measures can be taken. 

As a future improvement, a consequence study 
can be included in detail and should be checked 
with other applications. The paper also suggests 
AltaRica 3.0 to encode the FMECA for quantita- 
tive risk assessment. This recent achievement has 
been brought to the forefront in the risk analysis 
community, see also more details about this mod- 
elling language in (Prosvirnova, 2014). This mod- 
elling formalism suggests taking the advantages 
from a structuring paradigm, i.e. S2ML (Batteux, 
Prosvirnova, & Rauzy, 2015), and a sufficient 
mathematical framework, i.e. GTS (Rauzy, 2008). 
With the support of this modelling language, the 
analysts can provide indications of the system 
structure as well as the operational process. 
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network process and matter element analysis method 
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ABSTRACT: This study firstly proposes an evaluation model for evaluating flight landing quality based 
on Analytical Network Process (ANP) and Matter Element Analysis Method (MEAM). Focus on the 
complexity of flight landing quality synthetic evaluation and incompatibility between single factor iden- 
tification and hierarchical classification, building element matrics, using ANP to calculate the weight of 
each parameter and calculating correlation degree to evaluate the flight landing quality. The results of 
case study show that the model with ANP and matter element analysis model can be applied to evaluate 
the quality of landing, and compared with the classic Comprehensive Weighting Method (CWM), this 
evaluation result is more objective, accurate and reasonable. 


1 INTRODUCTION 


Aircraft landing is one of the most important 
stages in the whole flight phase, which is seriously 
affected by the environment and human factors, 
and the potential failure rate is the highest (Feng 
Yachang et al.1993). Data of fatal accidents makes 
it clear that about 23 percent of all fatal accidents 
take place in the landing phase from 2003 to 2012, 
though this phase just accounts for about 1% pro- 
portion in whole flight phase (X. H. Wang et al. 
2009). So it is important to analyze the landing 
quality so as to find out the way to improve land- 
ing quality and make landing safer. 

Nowadays, the Quick Access Recorder (QAR) 
is widely used for aircraft health status monitor- 
ing. The QAR data contains variety of flight 
information. It is used in many methods to ana- 
lyze landing status and aid decision-making in 
landing process. 

There are few studies studying landing quality, 
Zhang Rongjia analyze and discuss heavy landing 
of A321 flight based on QAR data (2012), Feng 
Yachang et al (1993) used loop separation param- 
eter method to evaluate landing quality. Although 
there are few studies on landing quality, there are 
a number of literatures on the study of landing. 
Landing quality is a part of the landing. In con- 
dition of lacking studies on landing quality, it is 
necessary to study the literatures of landing. Julia 
A. Bennell et al (2017) considered the scheduling 


of aircraft landings on a single runway, built algo- 
rithms for creating landing schedules and veri- 
fied the algorithms by random test data and real 
data from London Heathrow airport. B.S. Girish 
(2016) proposed a hybrid particle swarm optimi- 
zation algorithm in a rolling horizon framework 
to solve the Aircraft Landing Problem (ALP). 

Both Analytical Network Process (ANP) and 
Matter Element Analysis Method (MEAM) are 
common methods. ANP describes the relation- 
ship of elements in the system by network struc- 
ture rather than hierarchy structure. The indexes in 
system may have mutual influence on each other. 
ANP is able to find out the correlation among 
indexes. Zhou Lisha (2009) used ANP to evaluate 
power customer satisfaction. Wang Haolun et al 
(2014) applied ANP to analyze SWOT for strategy 
decision of enterprise. Xiang Yong & Ren Hong 
(2014) evaluated the smart city based on the ANP- 
TOPSIS method. MEAM is widely used in evalu- 
ation. Huang Huiling et al (2010) used MEAM to 
evaluate land eco-security, Huang Jian et al (2007) 
used MEAM to evaluate power quality. ANP and 
MEAM are widely applied, yet they have not been 
applied together. This paper will firstly integrate 
ANP and MEAM. Specifically, this paper will 
apply ANP to calculate the weight of each index 
and apply MEAM to evaluate landing quality. 5 
flights with their respectively flight data are stud- 
ied for getting reasonable analysis results and com- 
plete analysis method. 
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The rest of the paper is organized as follows. In 
Section 2, we introduce applying ANP to calculate 
index weight. In Section3, we introduce the quality 
evaluating model of landing using MEAM. In sec- 
tion 4, case is studied. Finally, in Section 5, conclu- 
sions are made. 


2 APPLYING ANP TO CALCULATE 
INDEX WEIGHT 


In 1996, Saaty proposed network analysis method 
(ANP) on the basis of Analytic Hierarchy Process 
(AHP). ANP allows multiple indexes that can be 
quantified or difficult to quantify, and considers 
the association or feedback relationships between 
elements at different levels and elements within a 
set of elements. Therefore, compared with AHP, 
ANP reflect and describe the decision problem 
more realistically. This study adopts the method 
proposed in document by Sun Hongcai (2011), 
and has the following steps: 


2.1 Determining element hyper matrix 


According to the scale defined in Table 1, taking 
one element P(s=1,2,...,m) in the control layer 
as a criterion, taking the element e, (1 = 1,2,...7;) 
in C, =(j=1,2,...,N) as a sub criterion, other ele- 
ments in the element group C i =( j=1,2,...,N) 
will be compared according to their impact on 
e,(1=1,2,...,n,), and Nx )in, =(j=1,2.....N) 
judgement matrices can be obtained; calculate the 
eigenvectors corresponding to the maximum eigen- 
value of each matrix, and then the consistency test 
is performed. If checked, these eigenvectors are 
normalized; if not checked, a comparison matrix is 
constructed. These normalized eigenvectors form 
a Sn (j = 1,2,..., N) order super matrix, and the 
matrix consists of N x N block matrices. 


Table 1. Judgment scale definition. 

Scale Definition 

1 Factor i is as important as factor j 

3 Factor i is slightly more important than 
factor j 

5 Factor i is obvious more important than 
factor j 

7 Factor i is strongly more important than 
factor j 

9 Factor i is extremely more important than 
factor j 

2,4,6,8 The median of the adjacent judgments 
above 

Reciprocal If the ratio of factor i to factor j is a, the 


ratio of factor j to factor iis a; = l/a; 


2.2 Determining the weight matrix of elements 


Taking one element P(s=1,2,...,m) in the con- 
trol layer and one element group as criteria. Com- 
paring other element groups with the criteria and 
N judgement matrices can be constructed. It is 
also necessary to check the consistency of these 
judgment matrices and find the eigenvectors, and 
these normalized eigenvectors are constructed into 
a weighted matrix of order N. 


2.3 Determining weighted super matrices 


Multiplying each element of the weighted matrix in 
step 2 with a block of matrices in step 1, a weighted 
super matrix is formed. The weighted super matrix 
reflects the control action of the element group on 
the element and the feedback of the element to the 
element group. 


2.4 Calculating index weight 


After obtaining the weighted super matrix, the 
relative sorting vector of the elements is deter- 
mined by the corresponding calculation method 
according to the subordinate matrix type, that is, 
the weight of the N elements. 


3 QUALITY EVALUATING MODEL 
OF LANDING USING MEAM 


MEAM is the rule and method of studying the 
problem of contradiction, it is the edge subject 
of systematic science, mathematics and thinking 
science. It is a transverse subject which is applied 
widely in natural science and social science. It has 
two theories: one is the matter element and the 
matter element theory, the other is the mathemati- 
cal tool based on the extension set (Cai Wen 1999). 
The ordered triple consists of matter, character 
and quantity are regarded as the basis element to 
describe matter in matter element analysis, denoted 
as: R=(N, C, V), where N refers to matters, C refers 
to the characteristics which can show nature, func- 
tion of matters and relationship between behavior 
and matters. Vis the quantity of C, which determine 
scope of one characteristic. MEAM studies the 
extension of matter elements by means of extension 
sets which is determined by correlation functions 
and the specific conditions of the matter element. 


3.1 Index system and the level limit of each index 


Among the data recorded, there are 6 indexes, which 
are ground velocity, distance to be flown, roll angle, 
angle of pitch, lateral acceleration, normal accelera- 
tion. Both ground velocity and distance to be flown 
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are indexes that reflect the condition related to 
ground. Similarly, roll angle and angle of pitch are 
angle indexes, lateral acceleration and normal accel- 
eration belong to acceleration category. Considering 
these characteristic and the usage of Super Deci- 
sions Software which is used to implement ANP, 
this paper build the index system (see Figure 1). 

Each landing quality index will be rated as 
“excellent quality—level 1”, “quality between 
excellent and good—level 2”, “good quality—level 
3”, “quality between good and qualified—tevel 4”, 
“qualified—tlevel 5”, recorded as ,,,,. All 5 levels can 
not only avoid excessive deviation because of few 
levels, but also decrease the work of calculation to 
a certain range (Xu Yonghai & Xiao Xiangning 
2004). Level limit of each index is determined by 
analyzing raw flight data (see Table 2). 


3.2 The establishment of matter element model 


For one flight, V is the quantification of Index C, 
R=(N, C, V) is the basic element (Cai Wen et al. 
1997). Angle of pitch, distance to be flown, roll 
angle, normal acceleration, lateral acceleration and 
ground velocity will be used to describe the landing 
quality. They are recorded as ¢,,¢,,€,,€,,€;,¢,, and 
the corresponding quantity are v, Vz, V3, V4, Vs; V, SO 
it can be expressed as 


N c 


R(N,C,V) = a (1) 


The classical field matter element of evaluation 
of landing quality is 


No; G Voy 
E Vigs 
_ 2 "oj 
Ri; (Noj-CiMoj) = : : 
Ce Voj 
Ny a (an; bo i) 


ê (4a bo i) (2) 


Cs (as; bos) 


where N,, is the level j of landing quality, j = 1 
indicates that the quality of the landing is excel- 
lent, j = 2 indicates that the quality is between 
excellent and good, j = 3 indicates that the quality 
is good, j = 4 indicates that the quality is between 
good and qualified, 7 = 5 indicates that the qual- 
ity is qualified. C, are six characteristics, i = 1-6 
respectively represents angle of pitch, distance to 
be flown, roll angle, normal acceleration, lateral 
acceleration and ground velocity. v; is the range 
of c, namely classical field. Classical field is an 
interval range, which indicates the basic interval of 
the variation of landing quality, and the range of 
Voy is interval < a,b), >, which can be recorded 
AS! Vo, =< Ay, by, > i= 1, 2, ... 6). 


Segment field matter element is 


N aq yv N G (4.1) 


2 Yp : b 
pacen $" Catal 
C V6 Co (asb ) 


where N is all quality levels, c, are six characteris- 


] i ] | tics, v„ is the range of c,, namely segment field. 

| Da we ores =< a, Dy > (i=1,2,...,6), obviously, v; €V, 
(i =1,2,...,6). 

Figure 1. Evaluation index system of landing quality. 
Table 2. Level limit of landing quality index. 
indexes Ql Q2 Q3 Q4 Q5 
Angle of pitch/° [0,0.800) [0.800,1.600)  [1.600,2.400)  [2.400,3.200) [3.200,4.000] 
Distance to be flown/m [0.50.00) [50.00,100.00) [100.00,150.00) [150.00,200.00) [200.00,250.00] 
Roll angle/° [0,1.000) [1.000,2.000) [2.000,3.000) [3.000,4.000) [4.000,5.000] 
Normal acceleration/m:s? [13.000,17.400) [17.400,21.800) [21.800,26.200) [26.200,30.600) [30.600,35.000) 
Lateral acceleration/m-s* [0,1.200) [1.200,2.400) [2.400,3.600) [3.600,4.800) [4.800,6.000] 


Ground velocity/m/s [23.000,25.000) 


[25.000,27.000) 


[27.000,29.000) [29.000,31.000) [31.000,33.000] 
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For an evaluation flight, the measurement 
data and analysis results are represented by mat- 
ter element R, called landing quality to evaluate 
matter-element. NM, stands for quality level, v, is 
the quantity of c,. 


Naqy 
G Vy 
R(N.CN)=| 2” (a) 
CY 


Extension set is characterized with the correla- 
tion function, and the range of correlation func- 
tion is the whole real axis, manage to express the 
correlation function of extension set by algebraic 
formula, so qualitative question will be quantita- 
tive. The correlation function values is calculated 
by formula (5): 


Piotr) y E Voj 
Voj 
K (v,)= i 5 
m Ants) vv > 
Aoa 
And 
1 1 
P(Yiva) =p; zw + boy) = zw T ay) 
plovna) =p: 5 (4x t by) E Flop an) (6) 


(i =1,2,--,6; j =1,2,---,6) 


Comprehensive correlation degree is calculated 
by formula (7): 


K, (B)=$w,K, l) G 


where w, represents the weight of each evalu- 
ation index, K (B) represents the correlation 
degree between the landing quality and the level j. 
If K, =max{K, (A) i= 1,2,---,6), the quality of 
the landing P, is to be evaluated is level j. 


4 CASE STUDY 


4.1 Landing quality indexes of the subject to be 
assessed 


Considering the use of MEAM, this case filter out 
5 flight with their data at the time of landing (see 
Table 3). 


Table 3. Flight data. 

Evaluation 

index No.l No.2 No.3 No.4 No.5 
Angle of pitch/° 1.387 1.200 3.041 3.159 1.939 


Distance to be 39.67 210.57 230.44 88.50 27.46 
flown/m? 
Roll angle/ 
Normal 
acceleration/ 
m-s? 
Latern 
acceleration/ 
m-s? 
Ground 
velocity/m/s 


0.879 
16.976 


4.136 
14.661 


1.538 0.088 0.857 
34.724 13.118 19.291 


1.102 0.157 0.315 0.472 3.307 


31.513 26.783 28.496 27.446 28.564 


4.2 Matter element model for landing 
quality evaluation 


The classical matrix and the segment field matrix 
are 


Q a < 0,800 > 
& < 0,50.00 > 
M Č < 0,1.000 > 
Ra = c, <13.000,17.400 > 
es < 0,1.200 > 
c, < 23.000,25.000 > 
Q, &  <0.800,1.600 > 
c <50.00,100.00 > 
7 c,  <1.000,2.000 > 
Roy = c, <17.400,21.800 > 
c,  <1.200,2.400 > 
c, < 25.000,27.000 > 
Q, ¢ <1.600,2.400> 
c, <100.00,150.00 > 
7 c,  <2.000,3.000 > 
Ra = c, <21.800,26.200 > 
c,  <2.400,3.600 > 
c, <27.000,29.000 > 
Q, ¢ <2.400,3.200> 
c, <150.00,200.00 > 
c,  <3.000,4.000 > 
Roa = c, <26.200,30.600 > 
c,  <3.600,4.800 > 
c, <29.000,31.000 > 
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Matter element matrices are 


Q; & <3.200,4.000 > 
G < 200.00,250.00 > N, G 1.387 N, c, 1.200 
R, C; < 4.000,5.000 > G 39.67 G 210.57 
= 2 2 
Cy < 30.600, 35.000 > R G 0.879 F G 4.136 
cs < 4.800,6.000 > |e, 16.976} ° | c, 14.661 
GO < 31.000,33.000 > Cs 1.102 Cs 0.157 
Q, à < 0,4.000 > C, -31:513 Cs 26.783 
& < 0,250.00 > N, & 3.041 N, ¢ 3.159 
Ral oo eee c, 230.44 c, 88.50 
p 
GQ S 13.000,35.000 > T- G 1.538 E G 0.088 
cs < 0,6.000 > 3 eg, 3424| | e 13.118 
gp. © 23.000,33.000 > c 0.315 c 0.472 
5 $ 5 : 
cC, 28.496 Ce 27.446 
Table 4. Correlation value of each flight and its respective indexes. 
Correlation 
degree No. 1 No. 2 No. 3 No. 4 No. 5 
k0) kv) = —0.297 kiv) =-0.25 k,(v,) = —0.700 k(v,) =-0.737 kv) =-0.370 
kv) k,(v,) = 0.266 kv) = 0.5 k,(v,) = -0.600 k,(v,) = -0.647 k,(v,) = -0.149 
kv) k,(v,) = -0.133 k,(v,) = —0.25 k,(v,) = —0.401 k,(v,) = —0.468 k,(v,) = 0.424 
kv) k,(v,) = 0.422 k,(v,) =-0.5 k,(v,) = 0.199 k,(v,) = 0.051 k) = 0.192 
kv) k(v,) = -0.567 k(v,) = —0.625 k,(v,) = —0.142 k(v,) = -0.046 k.(v,) = -0.394 
k0) kv) = 0.207 kv) = —0.803 k,(v,) = —0.902 k,(v,) = —0.303 kv) = 0.451 
kv) k(v,) = —0.207 k,(v) = —0.737 k,(v,) = -0.870 kv) = 0.23 k,(v,) = -0.451 
kv) k,(v,) = —0.603 k,(v,) = —0.606 k,(v,) = —0.804 k,(v,) = -0.115 k,(v,) = -0.725 
k0) k,(v,) = —0.736 k,(v,) = —0.211 k,(v,) = -0.721 k,(v,) = -0.41 k,(v) = -0.817 
kv.) k(v,) = —0.802 k(v,) = 0.211 k.(v,) = 0.391 k(v,) = -0.558 k(v,) = —0.863 
k0) k,(v,) = 0.121 k,(v,) = -0.784 k,(v,) = -0.259 k,(v,) = 0.088 kv) = 0.143 
k,(v;) k(v,) = —0.121 k,(v,) = —0.712 k(v,) = 0.462 k,(v,) = -0.912 k,(v,) = -0.143 
k3(3) k,(v,) = —0.561 k,(v,) = —0.568 k,(v,) = -0.231 k,(v;) = —0.956 k,(v,) = 0.572 
kv) k,(v;) = —0.707 k,(v;) = —0.136 k,(v,) = 0.487 k,(v,) = —0.971 k,(v,) = -0.714 
k5(v5) k.(v,) = —0.780 k,(v;) = 0.136 k.(v;) = -0.616 k5(v3) = -0.978 k5(v;) = —0.786 
kiva) k,(v,) = 0.096 k,(v,) = -0.378 k,(v,) = -0.984 k,(v,) = 0.027 k,(v,) = -0.231 
kv.) k,(v,) = -0.096 k,(v,) = -0.623 k,(v,) = -0.979 k,(v,) = -0.973 k,(v,) = 0.430 
k(v,) k,(v,) = —0.548 k,(v,) = —0.811 k,(v,) = —0.969 k3(v,) = —0.987 k,(v,) = —0.285 
kv) k,(v,) = -0.699 k,(v,) = -0.874 k,(v,) = -0.937 k,(v,) = -0.991 k,(v,) = -0.523 
kv) k.(v,) = —0.774 k.(v,) = —0.906 k.(v,) = 0.063 k5(v,) = -0.993 k5(v,) = -0.643 
k,(v5) k,(v;) = 0.082 k,(v,) = 0.131 k,(v;) = 0.263 k,(v,) = 0.393 k,(v;) = -0.439 
ks) k,(v,) = -0.812 k,(v;) = —0.869 k,(v,) = -0.738 k,(v,) = —0.607 k,(v,) = -0.252 
k3(v5) k,(v,) =—0.541 k,(v;) = —0.935 k,(v;) = —0.869 k,(v,) = —0.803 k,(v;) = 0.244 
k,(v;) k,(v,) =-0.694 k,(v;) = -0.956 k,(v;) = -0.913 k,(v,) = —0.869 k,(v;) = —0.100 
ks) k.(v;) = —0.770 k;(v;) = —0.967 k5(v;) = —0.934 k5(v5) = —0.902 k5(v;) = -0.357 
kia k,(v,) = —0.814 k,(v,) = —0.320 kiva) = -0.437 kiva = -0.355 kv) = —0.446 
k (vo) k,(v,) = -0.752 k (vs) = —0.109 k (va) = —0.249 k,(v,) = —0.091 k (va) = —0.261 
kvo) k,(v,) = —0.628 k,(v,) = —0.054 k,(v_) = —0.252 k3(v,) = 0.233 k,(v,) = 0.218 
kvo) k,(v,) = -0.257 k,(v) = —0.370 k,(v,) = -0.101 k,(v.) = -0.259 k,(¥,) = —0.089 
kvo) k.(v,) = 0.257 k;(v) = —0.527 k.(v,) = —0.357 k.(v,) = -0.444 k(va) = 0.354 
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N, c 1.939 
c, 27.46 

c 0.857 
ia c, 19.291 
c, 3.307 

c, 28.564 


After inputting the matter element to be evalu- 
ated to the matter element model, the results can 
be output (see Table 4). 


4.3 Applying ANP to determine index weight 


As shown in chapter 2, the calculation of ANP is 
very complicated, this study uses Super Decisions 
software (SD) to calculate the weights of indexes. 3 
first level indexes shown in Figure 1 are input to SD 
as 3 clusters, 6 second level indexes shown in Figure 1 
are input to SD as 6 nodes, then establish connec- 
tions between groups and groups, nodes and nodes, 
and the ANP model is established (see Figure 2). 

The direction of the arrow indicates the domi- 
nance of one group over the other. For example, 
ground velocity has effect on index distance to be 
flown, so there is a loop on the cluster ground. 

This study uses the standards shown in Table 1 
to compare different indexes, according to the 
expert’s judgment, the specific data is input into 
the SD and the result is calculated (see Figure 3) 

The results will be calculated by inputting the 
data of Table 3 and Table 4 to the formula (7) (see 
Table 5). 


4.4 Analysis and comparison of the results 


As can be seen from the Table 5, the quality level 
of No. 1 flight is 1, namely excellent quality, No. 2 


ground velocit 


distance to be 


2) Angle =[0 >} 


angle of pite 
—e=—=—er 
normal acceleration 


roll angle Í 
 ——— 


< > < > al | 


Bl Acceleration =|] Xf 


lateral acceleration PY 


vi 


Figure 2. ANP model. 


[Normalized by Cluster] Limiting 
| 040000 [0.102066 
[neo [153099 
| 048703 [0078539 
| 0.51297 [0082722 
[ 054503 0318067 
| D45497 [265507 


No icon! angle of pitch 
No kon| roll angle 

No Icon| ground velocity 
No Icon)distance to be flown 
No Icon) lateral acceleration 
No Icon|normal acceleration 


Figure 3. Weight of each index. 

Table 5. Comprehensive correlation degree. 
Correlation 

degree oF Q, Q; Q, Q; 
K(N) 0.008 -0.354 -0.517 —0.639 -0.673 
K(N)) 0.088 -0.581 -0.692 —0.686 -0.584 
K(N;) 0.397 —0.581 —0.659 -0.664 —0.386 
K(N,) 0.020 -0.647 -0.704 -0.740 -0.788 
K(N3) 0.218 -0.058 -0.083 -0.436 -0.543 
Table 6. Evaluation results comparison. 

Evaluation 

method No.1 No.2 No.3 No.4 No.5 


MEAM Qi Qı Q; Qı Q, 
CWM Q, Q, Q, Q, Q, 


flight and No. 4 flight are the same as No.1. The 
quality level of No. 3 flight is 5, namely the qual- 
ity is qualified. The quality level of No. 5 flight is 
2, namely quality between excellent and good. The 
results calculated by CWM are compared with the 
results obtained from MEAM (see Table 6). 

As shown in Table 6, the results of two methods 
are almost different. Only the result of No. 5 flight 
is same. But it is obvious that their results are simi- 
lar, and there is not much discrepancy. 

The reason why this happened is that the 
numerical values they calculated are different. The 
ANP calculates the weight of each index more 
accurately. The CWM only takes into account the 
level of index, but the MEAM calculates the cor- 
relation degree between the index and each level. 
The MEAM considers more information, the com- 
prehensive correlation degree of MEAM is further 
deepened on the basis of correlation degree. 


5 CONCLUSIONS 


Firstly, based on extension mathematics, reason- 
ably using correlation function to extend the 
interval to (~œ, +), MEAM considers more 
information, so the result can be more objective. 
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Secondly, ANP and MEAM is suitable to evalu- 
ate the landing quality and we can get satisfactory 
result. The process of analysis and the result is 
helpful to improve the landing quality and guide 
pilots. Thirdly, although CWM is the classical 
method, integrating ANP with MEAM is better 
in evaluating landing quality. Finally, ANP and 
MEAM is seldom used in the field of landing 
quality level evaluation, so the selection of evalu- 
ation index, the determination of the range of the 
value of evaluation can be further discussed and 
revised. In this study, the division of levels is based 
on linear data. There are lots of nonlinear data 
in real life, so it can be further studied to evaluate 
objects with nonlinear data. 

The reason why ANP is chosen, rather than 
AHP, is that ANP can eliminate the interaction 
between indicators and more scientifically and 
objectively calculate the weight of indexes. How- 
ever, in ANP, the expert scoring method is used to 
construct a comparison matrix between indexes. 
This method is very subjective, and in the future 
research, we will try to find a more objective and 
reasonable method for this problem. 
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ABSTRACT: Security risk analysis and management for infrastructures is a challenging task as the 
uncertainties regarding both, the capabilities of security systems and various threat scenarios are high. 
Especially cost-benefit analysis regarding the investment in physical security systems to reduce the overall 
vulnerability of infrastructures is a complex problem. This paper presents an approach that is based on 
a quantitative model for vulnerability analysis previously introduced by the authors. Based on the model 
a Bayesian Decision Network (DN) is derived. The result of the DN is a Return on Security Investment 
(ROSI) based on the principle of the weakest path. The ROSI can be used to find the best outcome result- 
ing from different configurations considering mitigation of security risks and required investments in 
security measures. In a last step the application of the developed approach to a simplified infrastructure 
is presented. Finally, the results are summarized and discussed. 


1 INTRODUCTION 


Security risk analysis and management of infra- 
structures are important issues because of a rising 
number of attacks with different targets, e.g. finan- 
cial interests, sabotage or terrorism. At the same 
time they represent a challenging task, as the uncer- 
tainties regarding both, the capabilities of security 
systems and various threat scenarios, are high. 

Especially cost-benefit analysis regarding the 
investment in security systems to mitigate the 
overall vulnerability of infrastructures subject to 
different types of threats is a complex problem. 
Although numerous approaches for (physical) 
security risk assessment and analysis exist, there 
are only few that consider uncertainty for threats 
and abilities of security systems. As existing meth- 
ods for cost-benefit analysis are based on these 
approaches, they mostly do not consider uncer- 
tainties either. Additionally, most methods focus 
on a single attack scenario lacking a holistic view 
of the different feasible attack paths in an infra- 
structure equipped with a security system. 

This paper presents an approach to a risk model 
that is based on a quantitative model for vulnerabil- 
ity analysis previously introduced by the authors. 
The quantitative vulnerability model is enhanced 
using Bayesian Decision Networks. Bayesian Net- 
works (BN) allow a graphical representation of the 
security system and its function in a considered 
infrastructure. Additionally, a decision network 
(DN) based on an influence diagram is developed. 
The DN enables the analysis of outcomes resulting 


from different configurations considering mitiga- 
tion of security risks and required investments in 
security measures. Based on a return on security 
invest (ROSI) analysis, the most valuable configu- 
ration can be chosen. 

Therefore, the structure of the quantitative vul- 
nerability model is transferred into BN structures 
and the probability density distributions of the 
used model are discretized to consider uncertain- 
ties. In a second step the paper explains the imple- 
mentation of the BN into the DN based on security 
risks and security investments resulting from dif- 
ferent configurations of the security system. The 
last step describes the application of the developed 
approach to a simplified infrastructure. Finally, the 
results are summarized and discussed. 


2 STATE OF THE ART 


2.1 Security risk assessment 


Security comprises a number of issues covering 
different fields of expertise; therefore a compre- 
hensive view is needed to conduct a holistic secu- 
rity assessment (Harnser Group 2010). Physical 
security as one part deals with the protection of 
infrastructures from intentional physical attacks 
(Beyerer et al. 2010). The aim of physical security 
measures is to prevent an attacker from reaching 
his objective by different means of protection, 
detection and intervention and also set up resilient 
structures to mitigate the consequences of success- 
ful attacks (Garcia 2008). 
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The corresponding security risk definition can 
be defined as (Contini et al. 2012, Mc Gill et al. 
2007): 


Risk = Threat x Vulnerability x Consequence (1) 


This definition combines—based on a quan- 
titative analysis—consequences of attacks and 
probabilities of threat scenarios with the risk of 
individual attacks being successful, defined as vul- 
nerability. The above quantitative definition of risk 
may help to deduce acceptable risks and necessary 
measures to mitigate risks (Broder & Tucker 2012). 
Inherent uncertainties regarding the three risk fac- 
tors should be cautiously considered (Campbell & 
Stamp 2004). 

Various approaches for security risk assess- 
ment have been developed that may be divided 
into qualitative, quantitative and hybrid methods 
(Meritt 2008). Qualitative methods are mostly 
based on expert knowledge, while existing quan- 
titative methods use discrete probabilities. The 
former are more widespread because of their ease 
of use, while at the same time the application of 
expert knowledge can lead to inaccurate or even 
wrong results (Landoll 2011). Additionally, some 
quantitative methods aiming at cost-benefit analy- 
sis have been developed. Typically, cost-benefit 
analyses of security measures would account for 
potential financial losses as a result of an attack, 
the probability of occurrence of various attack sce- 
narios and the vulnerability of the security system 
(Flammini et al. 2009). This analysis yields accu- 
rate results but raises the complexity compared to 
qualitative methods (Landoll 2011). 


2.2 Cost-benefit analysis in security risk 
assessment 


Generally, a cost-benefit analysis for security 
measures is difficult, as the benefits for the meas- 
ures are hard to evaulate (Butler 2002). Thus, the 
support of decision-making concerning security 
investments is a developing area (Abrahamsen 
et al. 2015). 

Different approaches were proposed in an IT- 
security context that are based on methods and 
models of vulnerability analysis to include the 
effectiveness of countermeasures into the general 
risk assessment. For example, (Bistarelli et al. 2006) 
propose defense trees as an enhancement of attack 
trees to calculate the effectiveness of possible secu- 
rity measures. The SAEM method proposed by 
(Butler 2002) is based on expert knowledge and 
aims at including the estimated effectiveness into 
the risk assessment. An interesting approach is 
the definition of a return on security investment 
(ROSI). The ROSI is a ratio of mitigated conse- 


quences of attacks and the investment for the 
therefore needed security measures (Sonnenreich 
et al. 2005). Although proposed in an IT-context, 
a more general use in security risk assessment is 
conceivable. 

In the area of physical security risk assessment 
only a few approaches exist that provide strate- 
gies for decision-making with respect to security 
measures. Wyss et al. propose a security risk metric 
that similar to the IT—related methods is based 
on the vulnerability analysis. The basic idea of 
the method is a rather general approach that every 
applied measure should make the easiest path 
towards a successful attack as difficult as possible 
considering the constraints of costs, operational 
and programmatic restrictions (Wyss et al. 2010). 
Another approach analyzes costs and benefits of 
different measures of aviation security (Stewart & 
Mueller 2013). Here, a current state is compared to 
different security measures with the same goals by 
a break-even analysis to find an estimated minimal 
probability of successful attacks to equal invest- 
ment costs. 

In conclusion, cost benefit analysis in security 
risk assessment is mostly based on the sub-part vul- 
nerability assessment. A precise estimation of the 
inherent benefits of security measures is difficult 
as even the overall abilities of security measures are 
often uncertain. Therefore, existing methods lack a 
detailed coupling to the basic vulnerability assess- 
ment and rather supply general decision strategies. 
Additionally there are no methods of cost-benefit 
analysis based on quantitative methods, which are 
able to consider the above mentioned uncertainties 
to tackle this problem. 


2.3 Vulnerability assessment in security risk 
assessment 


Quantitative vulnerability analysis as part of the 
quantitative risk analysis is mostly based on meth- 
ods adapted from reliability and general risk analy- 
sis. Here, the considered model is dependent on 
given attack scenarios (French & Gootzit 2011). 
This dependency is detrimental to a comprehen- 
sive analysis as knowledge about the behavior of 
a potential attacker may be insufficient (Cox Jr. 
2009). The different modeling approaches can 
be further split up into mainly analytical but also 
formal methods. An overview of approaches is 
given by Nicol et al. (Nicol et al. 2004). Analytical 
methods are often based on attack trees, which can 
be seen as a derivative of fault trees already intro- 
duced by reliability analysis. Attack trees were first 
used by Schneier (Schneier 1999) for IT-security 
analysis and have been further developed by differ- 
ent authors since then, summarized e.g. by Vintr 
et al. (Vintr et al. 2012). 
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Contini et al. have introduced incoherent attack 
trees to characterize the dynamic behavior of the 
considered system (Contini et al. 2008). Addi- 
tionally, they integrated simple probability distri- 
butions for protection into attack tree models to 
investigate the chronologic sequence of attacks. 
Hence, it is possible to analyze the security system’s 
ability for an attack intervention by comparing the 
probabilities of residual protection and system’s 
response (Contini et al. 2012). 

Garcia describes this relation (Garcia, 2008), 
and thus uses feasible attack paths as part of dif- 
ferent attack scenarios and corresponding barriers. 
The model is time-based and introduces the term 
of the critical detection point that is the latest pos- 
sible point of detection that ensures a successful 
intervention. 

Summarizing, the different existing approaches 
to analytical modeling and analysis of vulnerabil- 
ity are lacking the consideration of uncertainties in 
the system parameters and overall behavior. Addi- 
tionally, these approaches do not allow a scenario 
spanning analysis of the whole security system as 
the analysis depends on specific scenarios. 


2.4 Bayesian (Decision) networks and influence 
diagrams 


Bayesian Networks (BN) are based on Bayesian 
probabilistics interpreting probability as a degree 
of belief. BN represent a combination of probabil- 
ity and graph theory. A BN therefore quantifies 
dependencies between various data, information 
or knowledge considering uncertainties (Jensen & 
Nielsen 2007). BN consist of nodes and connect- 
ing edges in directed acyclic graphs (DAG) linking 
parent and children nodes. BN therefore consist of 
(Gribaudo et al. 2015): 


e Variables (nodes) with a finite set of states 

e Directed edges between the nodes 

e A conditional probability table describing the 
result of each node 


The joint probability distribution for a set of 
nodes U = {A,, ..., A,}, is defined as: 


P(U)= jra, | parents(A,)) D 


i=] 


Influence diagrams extend the definition of BN 
to enable the consideration of decision problems by 
using the BN DAGs (Howard & Matheson 2005). 
The goal of these diagrams is to find the decision 
alternative that delivers the highest expected utility 
(Shachter 1986). 

As the structure is similar to BN, influence dia- 
grams also include chance nodes. Additionally three 
more types of nodes are added (Howard 1988): 


e Decision nodes allow decision alternatives that 
may be controlled by the decision maker. 

e Deterministic nodes represent constant values 
that only depend on the states of their parent 
nodes. 

e Value nodes represent the utility function imple- 
mented into the decision process. 


Influence diagrams are used in different fields, 
where DN are set up to visualize the structure and 
to support decision-making processes. Despite a 
widespread use of influence diagrams and DN for 
a wide range of decision problems, e.g. cost-benefit 
investment decisions in financial disciplines their 
use in security risk assessments is not very com- 
mon. An approach to use BN to describe the secu- 
rity vulnerability of gas pipelines is described in 
(Fakhravar et al. 2017). 


3 APPROACH 


This paper presents an approach for cost-benefit 
analysis based on a quantitative model for vulner- 
ability assessment that uses probability density 
functions (pdfs) introduced already by the authors 
(Lichte & Wolf 2017). The approach discretizes the 
pdfs and sets up a model that reflects the described 
functional relations in a BN based on conditional 
probabilities. Thus, the characteristics of single 
barriers are defined as submodels that include the 
barrier vulnerability. These submodels are then 
assembled to attack paths resulting from the topol- 
ogy of a considered infrastructure at a higher level 
model. Within the higher level model it is possi- 
ble to derive the strength of the infrastructure to 
block possible attacks by combining the barrier 
vulnerability in the BN. The strength of a single 
attack path is considered as the complement of its 
vulnerability in accordance with the underlying 
model. Subsequently, the cost benefit DN based 
on an influence diagram is added. Therefore threat 
and consequence as the decision utility are imple- 
mented to establish a complete risk model. The 
cost benefit network calculates a return on secu- 
rity investment (ROSI) of different configurations 
based on an approach introduced in (Sonnenreich 
et al. 2005). 

Finally the proposed approach is applied to a 
simple exemplary infrastructure composed of four 
barriers that allow four partly overlapping attack 
paths. 


3.1 Basic assumptions 


The underlying model is based on four basic 
assumptions, which characterize the most relevant 
behavior of a security system in an infrastructure 
(Lichte & Wolf 2016). These assumptions are used 
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in the probabilistic description of the systems’ 
relations. 


1. The weakest path of the security system deter- 
mines the system’s vulnerability as the chosen 
path of the attacker is uncertain. 

2. The combination of protection and observation 
at barriers is necessary as an attacker is always 
able to break through a barrier given infinite 
time without being detected. 

3. The detection of an attack is possible only if 
the protection is sufficient to prevent a break- 
through under observation until detection. 

4. After detection, an attack can be stopped only 
if the residual protection along the remaining 
attack path lasts long enough to prevent the 
attacker from reaching the asset until interven- 
tion is completed. 


3.2 Discretization of the input PDFs 


The model comprises the three main input param- 
eters protection P, observation O and intervention 
I, which are described as pdfs. To use these input 
parameters in a convenient BN with limited com- 
plexity, the describing pdfs have to be discretized. 

Figure | uses the example of a normal pdf: the 
values for the intervals are derived by integrating 
the pdf within the upper and lower boundaries 1 
and u of the intervals. 


u 


fpa (3) 


i 


The size of the intervals used in the BN should be 
determined with respect to the considered infrastruc- 
ture and security measures as well as the intended 
accuracy. The observation period for P and O is equal 
and related to the single barriers, while the period for 
intervention is usually longer as it is related to the 
behavior of the whole system. For calculation rea- 
sons, the discretization intervals should be of the 
same size, so the number of intervals n, in the descrip- 
tion of I grows in relation to the number of barriers. 


Figure 1. Schematic discretization. 


n, =N3*No p (4) 


Hence, the nodes for the input parameters pro- 
tection, observation and intervention are built as 
prior distributions. 

With the distribution nodes built at an earlier 
stage it is possible to derive the BN for barrier vul- 
nerability based on the four assumptions about the 
behavior of a security system. 


3.3 Bayesian network for barrier vulnerability 


Basically, a combination of three different rela- 
tions of the input parameters is needed to describe 
the characteristics of the security measures at a 
barrier of a security system: Detection D, residual 
protection R and timely intervention T. 

A detection D of an attacker is triggered with 
the probability that the protection measure at a 
barrier prevents an attacker from a break-through 
until an observation is completed with detection. 
This allocates the conditional probability 


D= P(t, >1,) (5) 


In the context of BN this can be interpreted as a 
chance node. The distribution table represents the 
following probability condition: 


Yk,1: P, =P(Dltwta) =l, ical 


(6) 


wherein t, and tọ denote the time for protection 
and observation, k and 1 denote the running index 
of the corresponding time interval. 

The second key relation in the vulnerability 
model is the ability for a timely intervention T. 
This parameter is based on the pdf I of the time 
needed for intervention t, and the residual protec- 
tion R and is therefore defined by the conditional 
probability given by: 


T = P(tgp >t,) (7) 


The residual protection is the sum of all protec- 
tion measures at the residual barriers of the system 
on the attack path. Figure 2 shows the BN for the 
vulnerability assessment of a barrier containing 
the nodes for detection and timely intervention. 

As depicted, the node for timely intervention 
has the parent nodes “D”, “I” and “P” of barrier 
i. Additionally, the protection nodes of n residual 
barriers are connected as barrier nodes. The result- 
ing general distribution table reflects the following 
conditional probability for timely intervention: 
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barrier i 


Figure 2. BN for barrier vulnerability. 


ME igh (P, = T | D, trm t (e o) 
JE ift,(4,)< Ae tet tp (a ) 8) 


0 else 


In (8) z} denotes the protection time of the i-th 
barrier so ‘that t®, describes the state of the pro- 
tection time of dei i-th barrier falling into the l-th 
time interval. 

According to the basic assumptions, the 
strength of a barrier is determined by the possibil- 
ity to timely intervene an attack given its detection. 
As vulnerability is the complement of a barrier’s 
strength, the BN for vulnerability is fully derived. 
In the next step the network for the system’s vul- 
nerability and risk assessment is set up. 


3.4 System vulnerability and risk assessment 
network decision network 


The risk assessment network consists of barriers 
of an attack path, an asset as target of an attack, 
a threat node and cost utility describing possible 
consequences of a considered attack. Both, the 
barrier nodes as well as the asset nodes comprise 
subsystems. The subsystem in the barriers contains 
the vulnerability model, while the subsystem of 
the asset comprises the computation of the ability 


| f ER) 


h B) f (maw) 
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m=} 


Figure 3. DN for risk assessment. 


of the attack path to block an attacker. Figure 3 
depicts the DN for risk assessment. 

The deterministic node “B” within the submodel 
of the asset describes the ability of a blocked 
attack. The node depends on the parent nodes “T” 
for the timely intervention of all n barriers on the 
attack path. Additionally, the threat node “TH” 
serves as a parent node. An attack is blocked when 
the attacker is detected at a barrier and if a timely 
intervention—starting at the same barrier—is pos- 
sible. The conditional probability for the evalua- 
tion of the deterministic function of the node “B” 
is then given by: 


1 if P,=1, P, =1 


P, =P(B\TH,T,..., T®)= to pai 


(9) 


The parent node “TH” describes the probability 
of the occurrence of a threat and serves as a factor 
for the calculation of the general probability of a 
blocked attack (node “B”). 

Finally, a value node “MC” is added with the 
node “B” as a parent (see Fig. 4) to represent the 
consequences in the risk function in (1). Herein, 
the possible consequence is inserted as a constant 
monetary value. The calculated result of the utility 
function is the value of the consequences mitigated 
by the installed security measures on the attack 
path of the infrastructure. The linear utility func- 
tion yields to: 


MC = P, Value (10) 


Thus, the derived DN enables a computation of 
the mitigated consequences and the remaining risk 
on the considered attack path. Similarly, other fea- 
sible attack paths of a system can be set up using 
the same values for the different threat nodes “TH” 
and the utility nodes “MC” respectively. Following 
the principle of the weakest path, the overall risk 
of a system with n paths is obtained by: 


Rosen = max[(1- MC®),...,(1- MC™)] (11) 
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Figure 4. DN for cost benefit analysis. 


3.5 Cost benefit decision network 


Proceeding from the mitigated consequences, a cost 
benefit DN is constructed by adding a decision node 
to enhance the DN with the capabilities of an influ- 
ence diagram. The decisions in the node symbolize 
the different configurations of the security system 
that are part of the analysis as shown in Figure 5. 
The decision node varies the values of the param- 
eters for protection and observation measures at 
all barriers as estimated for the different configura- 
tions. These values need to be inserted into the “P” 
and “O” nodes in the subsystem of the vulnerability 
analysis, where the same barriers on different attack 
path are assigned the same values (see Fig. 4). 

As depicted in Figure 4, a value node is addition- 
ally added to reflect the costs of the different con- 
figurations introduced by the decision node. Finally, 
the value node for the return on security invest 
(ROSI) is introduced. It combines the value of the 
mitigated consequences depending on its configura- 
tion and calculated in the risk assessment network 
with the costs of the configurations. The calculation 
of the ROSI is based on the principle of the weakest 
path, as the weakest path is the decisive path in case 
of an occurring attack. Hence we yield the following 
utility function for the number of n attack paths: 


min(MC,...,. MC) —Cost 
Cost 


ROSI = (12) 


Figure 5. 
infrastructure. 


Structure and attack paths of the exemplary 


This utility function computes a ratio between 
the mitigated consequences of the weakest path 
and the needed investment to realize the required 
security measures (Sonnenreich et al. 2005). In 
order to decide whether to invest or not the follow- 
ing basic rules do apply: 


e Invest for ROSI® > ROSI®»,...,ROSI™ > 0 
e Do not invest for ROSI < 0 


Thus, the basic decision whether to invest or not 
as well as the decision which configuration should 
be realized are possible. 

In the next step the derived DN is applied to an 
exemplary simple infrastructure. 


3.6 Application of the DN to an exemplary 
infrastructure 


The DN is applied to a notional simplified infra- 
structure. Its structure is shown on the left part of 
Figure 5. The security system of the infrastructure 
consists of four barriers enabling four feasible 
attack paths. 

The four attack paths within the infrastructure 
consist of two barriers (see Figure 6, right). Three 
different configurations are analyzed for a deci- 
sion whether to invest and which configuration to 
choose. The values of the parameters for the bar- 
riers dependent on configuration and the costs for 
the configurations are listed in Table 1. 

The intervention time does not depend on the 
configurations and is set to the values listed in 
Table 2. 

The DN based on the exemplary infrastructure 
is depicted in Figure 6. 

Finally, the vulnerability as well as the ROSI 
of the different configurations are calculated. The 
results are shown in Table 3. 

Based on the results, the medium configura- 
tion is the optimum decision based on the DN and 
the ROSI formulation as it shows the best ratio 
between mitigated consequences and the required 
investment. It is also visible that the weakest attack 
path of the security system changes depending on 
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Figure 6. Overall DN for the exemplary infrastructure. 


Table 1. Input parameters for P and O and needed 
investment for different configurations. 


Config 1 (50 k$) 2 (150 k$) 3 (300 k$) 


Param. 
[u, o] P (0) P O P (0) 


Barrier 1 180,30 240,60 180,30 150,60 240,40 240,60 
Barrier 2 120,20 150,20 150,20 150,20 120,20 90,20 
Barrier 3 60,20 120,30 180,20 120,30 60,20 30,10 
Barrier 4 60,20 180,60 60,20 90,30 240,30 180,60 


Table 2. Input parameters for I. 
I [p0] 
Barrier 1 360,60 
Barrier 2 30,5 
Barrier 3 30,5 
Barrier 4 180,30 


Table 3. Results for ROSI and Vulnerability. 

Config ROSI V Path 
1 0.5692 0.4614 4 

2 0.5909 0.4034 3 

3 0.1950 0.1037 1 


the configuration (see Table 3). The definition of 
the ROSI utility function considers this behavior 
of the vulnerability assessment model. 


4 CONCLUSION 


This paper proposes a new approach regarding 
cost-benefit analysis in physical security analysis 
using DN based on BN as well as influence dia- 
grams. The approach enables the assessment of the 
vulnerability of the considered infrastructure and 
introduces a decision network that uses the ROSI 
as a utility function to support decision-making 
regarding a security investment. The vulnerability 
assessment is based on a model introduced by the 
authors using pdf-based parameter descriptions. 

In a first step those pdfs which were used to 
describe the input parameters now are discretized 
for a further use in BN. Subsequently the BN sub- 
model describing the characteristics of a security 
system regarding its vulnerability is derived and 
a risk assessment network based on attack paths 
is deducted introducing threat probability and a 
consequence utility function. Following, a decision 
between different configurations and related invest- 
ment costs is implemented to establish a DN. In a 
last step the ROSI utility function is derived based 
on mitigated consequences of the weakest path and 
the costs of the different configurations. The ROSI 
function supports decision-making between differ- 
ent configurations introduced by the decision node. 
The application to an exemplary infrastructure 
shows the use of this approach and the considera- 
tion of the weakest path in the ROSI function. 

Further research is needed to refine and fur- 
ther develop the presented approach. A software 
implementation would be useful to automate the 
discretization of the pdfs and the model building. 
Additionally, the definition of the threat probabil- 
ity as well as the consequence utility function need 
to be addressed to in more depth. The investment 
costs for security measures also should be analyzed 
in greater detail. Especially the implementation of 
a measure to relate the costs to a possible barrier 
related effort, e.g. the length of perimeter barriers 
or observation distances, would be useful. 
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Risk prediction method of aircraft hard landing based on flight data 
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ABSTRACT: The paper aims to develop the risk prediction model for aircraft hard landing with flight 
data analysis. The statistic data shows nearly half of the incidents occurred in the aircraft landing stage 
and hard landing is one of the main contributions. The paper firstly analyzes the possible factors of the 
hard landing and the data feature of flight data. Secondly, flight data are preprocessed by height slice 
and Principal Component Analysis (PCA). It is to solve the prediction accuracy and data redundancy 
problems. Then the study builds the mathematical model with the objective function of maximize the 
probability of hard landing accidents through historical samples. An algorithm based on golden section 
was provided, and threshold values of each index were found. Finally, the proposed method is validated 
by empirical research. The result suggests that the proposed method is feasible in hard landing risk predic- 


tion problem. 


1 INTRODUCTION 


Previous aviation safety management data show 
that although the average time of landing phase 
accounts for only 1% of the total flight time, the 
accident rate at this stage was the highest of all 
phases of the flight. Therefore, the flight risk con- 
trol in the landing stage plays a very important 
role in the flight safety assurance of the aircraft. 
Moreover, analysis of aircraft hard landing event 
is a very important work in practice. 

Hard landing, also known as hard landing, is an 
extremely important safety hazard for the impact 
of flight safety during the landing stage. Hard 
landing may cause damage to the aircraft structure, 
result in direct or indirect financial loss, damage 
to comfort and other adverse consequences. Hard 
landing can cause damage to aircraft components 
or systems under heavy loads (eg, landing gear, 
wings, etc.) and, in severe cases, damage to the 
aircraft and casualties. Boeing points out that the 
acceleration of the plane's vertical to the ground 
exceeds the specified limit when it touchdown, and 
it can be judged to be a hard landing. The Air- 
bus defines hard landing as a phenomenon that 
the acceleration or speed of an aircraft vertical 
to the ground exceeds the specified threshold. 
Thus, the landing load (that is, the vertical accel- 
eration when the aircraft landed) is to determine 
the landing of the aircraft or not the key indica- 
tor. Accurate landing loads are predicted prior to 
the aircraft landing can identify the risk of hard 


landing events in time, and take appropriate meas- 
ures in time (such as go around), which can reduce 
the frequency of hard landing events to a certain 
extent, and improve the safety of aircraft landing. 

To ensure flight safety, real-time monitoring of 
aircraft flight status is required. The aircraft usu- 
ally contains a Quick Access Record (QAR) for 
recording flight parameters. The flight parameters 
reflect real-time status information of the entire 
flight phase of the aircraft and have high applica- 
tion value in performance testing, accident investi- 
gation, flight training and assessment, equipment 
maintenance and safety monitoring. However, due 
to the complexity of the flight data and the limita- 
tions of the data analysis methods, there is still a 
more in-depth data development value due to the 
low utilization of flight data in flight safety early 
warning. 

This article aims to establish a model based 
on the flight parameters to predict the risk of a 
landing. In a certain environment, taking the hard 
landing event as an object, we identify the high- 
risk areas that can easily trigger QAR overrun 
events in the “space” formed by the key factors 
closely related to such overrun events. The remain- 
ing part of this paper is organized as follows. In 
section 2, the process of data preprocessing by 
flight data slicing and data reduction analysis is 
introduced. In section 3, the risk region optimiza- 
tion model and determination of region division 
point is presented. In section 4, the model is dem- 
onstrated with real flight data. 
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2 DATA PREPROCESSING 


2.1 Flight data slicing 


The original flight data are time series data, and 
each flight parameter varies with flight time. How- 
ever, it was found in the study that there was great 
uncertainty in predicting the landing time before 
the landing of the aircraft, and the prediction devi- 
ations could be as high as several minutes. 

In fact, the normal landing time of aircraft is 
only a few seconds. In the prediction deviations of 
several minutes’ landing time, the aircraft is likely 
to have completed the landing or far from com- 
pleting the landing. It is highly unreliable to pre- 
dict the hard landing according to the time trend. 
On the other hand, the height of the plane from 
the ground can be measured in real time by radio 
signals and is truly reflected in the flight data. 
Therefore, in order to avoid the inconsistency in 
the flight height data interval of each flight frame, 
we need to carry out the high slicing processing of 
flight data before establishing the risk prediction 
model of hard landing. 

Slicing processing of flight data is a method to 
intercept a part of the flight data according to the 
altitude of the flight, the intercepted data will be 
used as the basis for data analysis, and the rest of 
the data will be removed, we set the altitude of the 
flight data in the range of [/,, A] (h < An), N is 


m 


the number of slices, the condition of the model is: 
@=h= (h, h, ... hy) 


Satisfy the following formula: 


h=hħ+(i-1)xd (1) 
h,-h, 

d =—— (N22 2 
vet B 


where d stands for high interval of slice, 1 <i< N. 


2.2 Dimension reduction analysis 


There are many flight parameter variables in flight 
data. Too many variables can cause serious corre- 
lation. On the one hand, it will cause redundancy 
of models and reduce the efficiency of model 
operation. On the other hand, useless informa- 
tion will also cause bias in model prediction. As a 
method of statistical analysis, factor analysis had 
a good performance in data dimension reduction, 
the basic principle is to integrate multiple variables 
into a few indicators under the premise of losing 
less original information, so as to study all aspects 
of the information. After synthesizing a few indi- 
cators, the information contained are not repeated 


each other, i.e. variables are not related. For that 
reason, this paper is proposed to use factor analysis 
to reduce the dimension of flight data in order to 
extract key and valuable data from a large number 
of flight parameter variables. 

The parameter analysis of flight parameters 
must first standardize the original flight parame- 
ters, and set up the k flight data with standardized 
treatment. 

X= (Xo X,... Xn) is a variable, m common fac- 
tor of flight data is F = (F, F, ... Fa), and m < k. 
The model is as follows: 


X= ah tah, t.t AEn + & 
X, =a, F +a P, +. + A nEn + & 


(3) 


xX, z aak T aF, +. ++ Ayn En + Ek 


where a, is factor loading, £, &, .. 
term. 

Common factor of flight data in the model, 
“Flight data factor” for short, that is, a few key 
flight data information obtained after dimension- 
ality reduction of the original flight data. This 
information will serve as the modeling data basis 
for the risk prediction model of hard landing. 


., & is residual 


3 MODEL AND ALGORITHM DESIGN 


Hard landing is an important hidden danger for 
aircraft landing safety. There are many factors 
that affect the landing safety of aircraft. It can be 
divided into three main categories: human factors, 
body factors and meteorological factors. Human 
factors include unskilled operation, misjudg- 
ments, and operation errors caused by psycho- 
logical factors, and so on. Body factors include 
the maintenance support of the aircraft, the level 
of reliability, and so on. Meteorological factors 
include the reduction of visibility caused by rain 
and snow and the deviation of the flight attitude 
caused by the side wind. The impact of these three 
kinds of factors on aircraft landing safety can be 
concentrated in the flight data. A large number of 
flight data are recorded in flight data. These flight 
data reflect the flight status and index parameters 
in real time, and provide important monitoring 
basis for aircraft flight safety. In this paper, we 
take the landing load overrun event in the landing 
stage as the object, and find out the high-risk area 
which is easy to trigger hard landing events in the 
“space” closely related to the event. The so-called 
high-risk area refers to the combination of certain 
factors in the span range of influencing factors for 
hard landing events. In the subspace represented 
by these combinations, the probability of a landing 
load over the limit event tends to be 1. 
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3.1 Risk region optimization model 


After reducing the dimension of flight data, at the 
height of 9 m-2 m before landing, the value of the 
m flight data factor and Its Variation Track play 
a major role in the occurrence of hard landing 
events. Therefore, the high risk areas are divided 
according to the space of m flight data factor and 
its change rate. In the process of division, the total 
number of samples to meet the space is not less 
than a certain proportion. That is, the nature of 
the space is established in a certain probability. 
Then, the partition point is optimized and the 
location of the partition point is adjusted to maxi- 
mize the probability of the hard landing event in 
the high-risk area. 

The symbols are as follows: 

A, B, C indicate the state value of 3 factors at 
the radio height 9m; AA, AB, AC respectively indi- 
cate the change rate of the flight data factor A, B, 
C from the radio height of 9 to 2m. N represents 
the number of all the samples; 0 < P < 1, indicat- 
ing that the sample size of the region should be 
kept at a certain level after the division; L* repre- 
sents the “high-risk area” threshold; L indicates 
that the number of overrun samples of triggered 
landing load in a certain region accounts for the 
proportion of the total number of samples in 
the region; R represents “high risk areas”, when 
and only when the proportion of the flight over- 
run samples in the region is higher than that of 
L* in the region. R, represents the number of 
flight overrun samples in “high risk areas”; R, 
represents the number of normal samples in the 
“high risk area”; Hi, H} respectively represents 
the division point on the i factor, which is a deci- 
sion variable. 

Find the optimal model for high-risk area: 


max(L(Hi, Hi))...i= A,B,C,AA,AB,AC (4) 
SE 

(R +R,)/N 2 P (5) 

R /(R,+ R,)2 L(H}, Hi) (6) 


Hi, H} e [Minimum value of i factor and maximum 
value of i factor] (7) 


Among them, formula (4) is the objective 
function, which indicates that the threshold of 
“high-risk area” is maximized under the value of 
different division points. Formula (5) indicates that 
the proportion of samples in the division area is 
not less than that of a given parameter P. Formula 
(6) indicates that the region should satisfy the defi- 
nition of high-risk area; formula (7) represents the 
range of decision variables. 


3.2 Search of partition points 


The search for the division points adopts the 
golden section method with one dimension opti- 
mization. The golden section method also known 
as extreme and mean ratio, refers to a line segment 
is divided into two parts; the part with the length 
ratio is equal to the other part and this part of the 
ratio. The ratio is an irrational number, and the 
approximate value of the first three digits is 0.618, 
so it is also called the 0.618 method. 

The order of Iterative Refinement will affect the 
result, therefore, combining the experience of the 
expert, the iteration sequence is ordered according 
to the size of the impact on the event, and the algo- 
rithm step is shown in Figure 1. 

Sort i= A, AA, B, AB, C, AC 


Hi =i +0.618(i* — i~) (8) 


Setting P; Imtializatson of £*=the number of 
overrun sumples/total number of samples 


t h 

For each factor 1, query the maximum 7 

and the minimum value 7 in the sample, let 

the mitial upper bound Sia © and the 

initial lower bound Sy)» F m the high risk 
region 


+ 
Jethe serial mimber of factor A 


* 
According to formula (S)and formulu 
(9), caleulute Mi, Ms respectively, divide 
(ff ] inte 3 mtervals re spectively. 
record as LF, Hi), ID) 


+ 
Take each interval L=the minimum value in( the 
number of overrun samples in the imterval/ total 
number of samples in the interval),record as L", look 
up the upper and lower bounds of their interval and 
eliminate all the samples of the /."'s corresponding 
interval 


t 
Whether the 
eliminated samples ure Ay 
Pf, )interval? ves | 
. | 
no} r 
For each interval, find 
Combme the remaining out L, the maximun 
samples and find out L. the hud d thè 
maximum vahie # aad hel |é +, ànd thè 
minimum value í minimum value / 
] 
+ 
NU Whether the result 
| satisfies formula 5? 
yes! hypothesize L“ Z 
t yes ai updatethe upper 
Tr ~ bound SY) f and 
the lower bound 5 
(O 
NO! 
+ Fey 
NO 


PAC 
senal number? 


YES 
> 


output the the maximum value of L° 
and its corresponding S’(/and 5(/) 


Figure 1. The diagram of algorithm step. 
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The point of this point on median symmetry is 


Hi =i +0.382(i* -i-) (9) 


4 EXPERIMENT ANALYSIS 


This research takes a certain type of UAV as the 
test object. After more than 10 years’ research and 
test time, the UAV has completed dozens of sci- 
entific research flight tests and batch production. 
At present, dozens of Unmanned Aerial Vehicles 
(UAVs) have been delivered to many batches of 
the troops, with a cumulative flight time of more 
than 200 hours. The paper obtains the flight data 
of the aircraft during the landing and decline stage 
including 19 flight data variables in 45 sorties. 19 
variables of flight data are respectively radio alti- 
tude, elevation angle, pitch rate, roll angle, roll 
angle rate, course angle, yaw angular velocity, 
aileron displacement, rudder displacement, alti- 
tude rate, elevator displacement, forward accel- 
eration, normal acceleration, lateral acceleration, 
engine speed, atmospheric height, airspeed, the 
lateral offset and ground speed. The critical value 
of landing load to judge whether a hard landing is 
or not is 18.0 m/s’, When the aircraft landing load 
is more than 18.0 m/s’, the landing is considered 
to be a hard landing. During the 45 landing sor- 
ties, a hard landing occurred during the landing of 
fourth, fifth, sixth, 10, 14, 16, 17, 20, 24, 25, 26, 
29, 30, 34, 35, 39, 40, 42, 45, respectively. Remov- 
ing the flight data after landing, the unified limited 
flight height is 9 m-2 m, and the flight data of each 
landing gear are sliced at each 0.5 m by the flight 
height, the missing data were filled by averages, 
and 15 data slices were obtained and the slice data 
were integrated. A total of 675 data of 15 altitude 
values of 45 flight sorties were finally obtained, as 
shown in Table 1. 


18 input variables of flight data were tested with 
KMO test. The results were 0.579 and more than 
0.5, which showed that the variable of flight data 
was suitable for factor analysis. The results of the 
dimensionality reduction analysis of the param- 
eters of the flight data are shown in Table 2. 

If the extracted eigenvalue is greater than 1, the 
eigenvalue is considered as a flight data factor, the 
analysis results show that, when 18 fight data vari- 
ables are reduced to 3 flight data factors, 87.278% 
of the raw information of the flight data can still 
be retained. According to the order of the number 
1-3 in the table, the three flight data factor vari- 
ables are defined as A, B, C, respectively, flight data 
factor after dimensionality reduction makes the 
model simplified, in this way, the original informa- 
tion of the flight data is preserved as much as pos- 
sible, and the expected effect is achieved. 


Table 2. The results of the dimensionality reduction 
analysis of the parameters of the flight data. 


Extraction of square 


Initial eigenvalue sum load 

Vari- Cumu- Vari- Cumu- 
Compo- ance lative ance lative 
nent* Total ratio ratio Total ratio ratio 
1 5.194 57.711 57.710 5.194 57.711 57.710 
2 1.646 18.293 76.004 1.646 18.293 76.004 
3 1.015 11.274 87.278 1.015 11.274 87.278 
4 0.441 4.901 92.180 
5 0.348 3.863 96.042 
6 0.256 2.848 98.890 
7 0.070 0.782 99.672 
8 0.026 0.258 99.957 
9 0.004 0.043 100 


*The method of extraction is principal component 
analysis. 


Table 1. Data slicing processing results of flight data. 

Landing Flight Pitch Roll Ground Landing Engine 
sortie altitude angle angle speed load speed 
1 9 —0.95859 2.90957 34.13035 27.76744 6544 

1 8.5 0.97507 1.49968 34.17583 27.54345 6544 

1 2 2.48848 —2.87851 31.07449 10.53457 6211 

2 9 —1.16459 0.79653 33.85126 31.67865 5433 

2 8.5 —1.05747 0.7416 33.86937 31.12323 5431 

2 2 2.74392 —0.13733 30.37497 13.76453 4322 
45 2 2.16987 4.05408 27.26024 21.45645 4854 
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Table 3. High-risk area*. 


Parameter A AA B AB C AC 


Upper boundary 0.62 -0.05 0.85 0.30 0.74 —0.64 
Lower boundary 0.19 -0.41 0.92 -0.70 0.34 —0.23 


*P=0.3. 


Using the algorithm in the third section to solve 
the model, the result is L* = 0.6, the results of the 
high risk area after standardization are shown as 
shown in Table 3, and the standardization process 
is as follows: 


x,= 


(10) 


where, x, represents the flight data. 

We randomly selected 10 samples for testing and 
analysis, and 4 samples were found to fall into the 
area, of which three were QAR landing load over- 
run samples. 

According to the calculation results, the average 
probability of the occurrence about landing load 
overrun in the analysis sample is 0.412. From the 
analysis results, the possibility of the occurrence 
about landing load overrun in the area is higher 
than the average occurrence probability, which 
indicates that the area marked in Table 3 belongs 
to the high-risk area of the hard landing event. 


5 CONCLUSION 


This paper establishes the risk prediction model 
for aircraft hard landing by determining the high 
risk area with flight data analysis. Through the 
high slicing processing of the original flight data, 
the unified flight data can be obtained at a cer- 
tain height. In addition, factor analysis, as a tool 
for data reduction processing, can simplify more 
parameters and maximize the information of data. 
We selected flight data (including human, aircraft 
and environmental factors) in 9-2 m to predict the 
risk of a hard landing event. In the sense of cer- 
tain probability, the division point is optimized. 
By adjusting the location of the division points, 
the probability of a hard landing event in a high- 
risk area is maximized. The results show that the 
identified high risk area can identify the exceeding 
limit of landing load. For future research, we will 
pay more attention to how to improve the accuracy 
of recognition. At the same time, further demon- 
stration is needed for the selection of the predicted 
height. 
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EU risk governance of migrants and refugees’ influxes: A realistic 
foundation for crisis governance? 


B.I. Kruke & C. Morsut 
Centre for Risk Management and Societal Safety, University of Stavanger, Stavanger, Norway 


ABSTRACT: Wars, political instability, poverty and ecosystem’s alterations force several people to look 
for better living conditions and security. Europe has become a safe haven for migrants and refugees, 
particularly since 2015, when thousands of people crossed the European borders on daily basis. In this 
paper, we aim at studying the development and the management of the 2015 migrant and refugee influx 
into Europe at the European Union (EU) level in terms of risk and crisis governance, mainly through the 
lens of the International Risk Governance Council (IRGC) Risk Governance Framework. The 2015 mass 
influx into Europe showed the EU’s inability to cope with such an event, with a subsequent fragmented 
response consisting of mainly national security initiatives. A main reason behind the inadequate overall 
joint crisis governance at EU level has been a weak supranational risk governance, mainly due to national 
political, economic, security and cultural differences. 


1 INTRODUCTION 


Migration is an old phenomenon. People have 
always been on the move. Push and pull factors 
will endure and people will continue to leave their 
home because they have to or because they want. 
Wars, permanent conflict, political instability, 
ecosystems’ negative alteration due to the climate 
change are all push factors that force people to 
leave and to seek refuge or better life conditions. 
This kind of migration has given rise to a series of 
challenges to Europe: the so-called 2015 migrant 
and refugee crisis, characterised by a high influx 
of people crossing the Mediterranean Sea, is an 
example in this sense. In 2015, more than a million 
people reached Europe (UNHCR 2017) in perilous 
ways, putting under pressure the aid and reception 
mechanisms existing at national and European 
Union (EU) levels. Rescuing people from the sea, 
giving them first assistance, providing shelters and 
distributing food were activities performed be a 
variety of non-governmental, governmental and 
supragovernmental actors in a complex frame. 
National authorities, the EU, the United Nations 
High Commissioner for Refugees (UNHCR), 
the International and national Red Cross, sev- 
eral NGOS, but also private citizens offered their 
assistance in 2015. They continue these tasks since 
the so-called crisis has become more a structural 
occurrence rather than a one-off event. 

A short terminological clarification of the terms 
migrant and refugee is needed at this point, before 
proceeding with the goal of our paper. The 1951 


Refugee Convention defines a refugee a person 
crossing a national border seeking protection from 
political or other forms of persecution (UNHCR 
2010), while a migrant is a person who chooses to 
move mainly to improve his/her living conditions 
(a better job, education or family reunion are rea- 
sons behind the choice) (UNHCR 2016). Unlike 
refugees, who cannot safely return home, migrants 
face no such impediment. However, in describing 
the 2015 massive influx of people into Europe, the 
UNHCR employed both terms, since it was diffi- 
cult to differentiate between the two groups, those 
escaping from various political instabilities and 
those impelled by economic reasons (economic 
migrants). We will follow the UNHCR’s terminol- 
ogy in this paper. 

We aim at studying the development and man- 
agement of the 2015 migrant and refugee crisis 
at the EU level in terms of risk and crisis govern- 
ance, mainly through the lens of the International 
Risk Governance Council (IRGC) Risk Govern- 
ance Framework. Our analysis rests upon previous 
research we conducted on resilient crisis manage- 
ment (Morsut and Kruke 2014, Kruke and Morsut 
2015), on reliable crisis governance (Morsut and 
Kruke 2016), and on the relationship between risk 
and crisis governance (Morsut and Kruke 2017). 
This paper first presents the conceptual framework 
applied in our case. Secondly, by drawing from 
document analysis of EU policy and legal docu- 
ments, it outlines the EU risk and crisis governance 
towards migrants and refugees according to the 
IRGC Risk Governance Framework, to answer the 


1833 


following question: to what extent did the EU cri- 
sis governance follow a thorough risk governance 
process in coping with the influx of refugees and 
migrants? 


2 CONCEPTUAL FRAMEWORK 


2.1 Risk 


Risk is related to a possible future state of affairs. 
Risk may be defined as “an uncertain conse- 
quence of an event or an activity with respect to 
something that humans value” (IRGC 2005: 19), 
or a situation or event where something of human 
value (including humans themselves) is at stake 
and where the outcome is uncertain (Rosa 1998, 
2003). The uncertain consequences—understood 
in terms of likelihood and severity—can be posi- 
tive or negative. The degree of ‘positiveness’ or 
‘negativeness’ depends very much on peoples’ 
perceptions. However, most people relate risk to 
something negative. Kates et al. (1985: 21) describe 
risk as the possibility that an undesired state of 
reality (adverse effects) may occur as a result of 
natural events or human activities. This implies 
that there is a possibility for a negative result of 
a natural phenomenon or human activity. How- 
ever, the impact of this phenomenon or activity 
is unknown, as the probability for the occurrence. 
Uncertainty characterises the phenomenon or 
activity in question. Uncertainty, together with val- 
ues, is central to Aven and Renn’s understanding of 
risk, which refers to uncertainty about and sever- 
ity of the consequences (or outcomes) of an activ- 
ity with respect to something that humans value 
(Aven and Renn 2009: 1). Thus, risk is inherently 
a subjective phenomenon, a social construction. 
Expert judgements is necessary, but not adequate, 
to understand and manage risk. In addition, risk 
perception, public values and stakeholders’ under- 
standing may also be important to consider. Thus, 
differences in approaching the risk between the 
public and the so-called experts are likely to be 
seen (Sjöberg 1999). 

Risk perception, in particular, has its roots in 
cognitive psychology and has given rise to a vast 
literature on the relationship between the per- 
ceived risk and the real foundation of risk (Slovic 
2000). Culture, gender and knowledge availability 
influence the perceived risk and the actions taken 
accordingly, both by individuals and by decision 
makers. Risk perception, as a social and cultural 
construct, reflects values, symbols, history and ide- 
ology (Weinstein 1980). Differences in risk percep- 
tion between the public and the so-called experts 
may hamper risk management approaches and 
rational decision-making. This conflict between 


expert and public risk perception is at the basis of 
the social dilemmas of risk management (ibidem) 
and of risk governance. 


2.2 Risk and crisis governance 


Governance lacks a univocal meaning since it rests 
in several disciplines that discuss the governance 
from their standpoint (see Kjær 2004; Peters 2000; 
Pierre 2000; Stoker 1998). Thus, our understand- 
ing of governance differs according to who exerts 
governance. Governance is often seen in relation to 
government and governability (Renn et al. 2011). 
Government may be understood as setting and 
administering the public policy, while governabil- 
ity is understood as the overall capacity for gov- 
ernance of any societal entity or system (Kooiman 
et al. 2008). Some scholars consider the state as the 
main actor exerting governance (Bevir et al. 2003). 
In this understanding, governance is closely related 
to government. Governance, on the other hand, is 
understood in terms of socio-political interaction 
patterns, where management and decision-making 
are conducted within a framework of institutional 
diversity (Morsut and Kruke 2017). This diversity 
is formed by several stakeholders—state authori- 
ties, trade associations, NGOs, civil society, private 
actors (Kooiman 2003; Krahmann 2003; Bogason 
1996), networks (Sorensen and Torfing 2007) and 
supranational organisations, such as the EU (Mar- 
cussen and Torfing 2007). Governance, therefore, 
entails stakeholders’ involvement (Renn 2008) to a 
much greater degree than government. 

Scholars define and characterise governance at 
various levels. At a national level, Nye and Dona- 
hue (2000) describe governance as structures and 
processes for collective decision-making involving 
governmental and non-governmental actors. The 
joint approaches between public and private actors 
are prominent in this understanding of govern- 
ance. The same is the case with Rosenau’s under- 
standing of governance at a global level (1992). 
Here, governance embodies a horizontally organ- 
ized structure of functional self-regulation, encom- 
passing state and non-state actors, bringing about 
collectively binding decisions without superior 
authority (ibidem). 

The term risk governance involves the trans- 
lation of the substance and core principles of 
governance to the context of risk-related decision- 
making (van Asselt and Renn 2011). A success- 
ful risk governance approach leads to a so-called 
dynamic non-event (Weick 2011). However, if the 
non-event becomes an event, in the form of a crisis, 
this crisis needs to be managed by a crisis govern- 
ance, defined as “to what extent the relationships 
among economic and political, formal and infor- 
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mal institutions are able to manage crises” (Kruke 
and Morsut 2015: 187). In this respect, there is a 
clear relation between the quality of risk govern- 
ance and an effective crisis governance. 


2.3. The IRGC risk governance framework 


The IRGC has elaborated a Risk Governance 
Framework for a systemic risk governance approach. 
The Framework goes beyond a naive understanding 
of risk as an objective category and a relativistic 
perspective where all risk judgements are subjec- 
tive reflections of power and interests (Renn 2008). 
The Framework is characterised by two interlinked 
spheres: the assessment and the management sphere. 

The former deals with the generation of knowl- 
edge, whereas decisions and implementation of 
actions are conducted in the latter. Each sphere is 
divided into phases (IRGC 2005) briefly described 
here: 


Pre-assessment: 

The purpose is to capture both the variety of issues 
that stakeholders and society may associate with 
a certain risk as well as existing indicators, rou- 
tines, and conventions that may prematurely nar- 
row down, or act as a filter for, what is going to 
be addressed as risk. Typical activities in the pre- 
assessment are: 


— Problem Framing: different perspectives of how 
to conceptualize the issue; 

— Early warning: systematic search for new hazards; 

— Screening: establishment of procedures for 
screening hazards and risks; 

— Scientific conventions for risk assessment and 
concern assessment: assumptions and param- 


Figure 1. 


IRGC framework (IRGC 2005: 13). 


eters of scientific modelling and evaluating 
methods and procedures for assessing risks and 
concerns. 


Risk Appraisal: 

Main activities in risk appraisal are the develop- 
ment and the synthesis of the knowledge base 
as a foundation for a decision on whether or not 
a risk should be taken. If the decision is to take 
the risk, then a follow-up activity is to map avail- 
able options for avoiding, mitigating, reducing or 
handling the risk. Risk appraisal comprises both 
a scientific risk assessment and a concern assess- 
ment. The scientific risk assessment deals with the 
risk’s factual, physical and measurable character- 
istics, including the probability of occurrence (or 
a probability distribution over a range of nega- 
tive consequences) (IRGC 2005: 14). The concern 
assessment is a systematic analysis of the associa- 
tions and perceived consequences (benefits and 
risks) that stakeholders, individuals, groups or dif- 
ferent cultures may associate with a hazard or a 
cause of hazard (ibidem). 


Tolerability and acceptability judgement: 


— Risk characterisation: Collecting and summa- 
rizing all relevant evidence necessary for making 
informed choice of tolerability and acceptability 
of the risk in question and suggesting potential 
options for dealing with the risk from a scientific 
perspective. 

— Risk evaluation: Applying societal values and 
norms to judge tolerability and acceptability 
and, consequently, to determine the need for 
risk reduction measures. 


Risk Management: 
Decision-making: 


— Option identification and generation: Identifica- 
tion of potential risk-handling options, particu- 
larly risk reduction (i.e. prevention, adaptation 
and mitigation, as well as risk avoidance, trans- 
fer and retention); 

— Option assessment: Investigations of the impacts 
of each option (economic, technical, social, 
political and cultural); 

— Option evaluation and selection: Evaluation of 
options (multi-criteria analysis). 


Implementation: 


— Realization of the most preferred option; 

— Monitoring and feedback: Observation of the 
effects of implementation (link to early warn- 
ing). Ex-post evaluation. 
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In all phases risk communication is essential to 
enlighten the risk process for all not involved rel- 
evant stakeholders, including civil society, and to 
maintain trust among these (IRGC 2005). 


3 THE CASE: THE 2015 INFLUX OF 
MIGRANTS AND REFUGEES INTO 
EUROPE 


The UNHCR started to collect data on migrants and 
refugees’ Mediterranean Sea crossing since 2007. 

As Figure 2 indicates, the statistics show a peak 
in 2015 with more than one million people. Despite 
of a decrease in 2016, Europe still faces a substan- 
tial flow of migrants and refugees across the Medi- 
terranean. Syria, Afghanistan, and Iraq are the 
top three countries of escape due to a prolonged 
warfare (UNHCR 201 6a). Challenges of reception 
and management of refugees and migrants remain 
the same, since permanent solutions are still not 
in place. 

Migration (by force or by choice) is a global 
phenomenon as much as is an old one. Refugees 
and migrants moving to Europe are much less than 
the quantity of people moving to other regions of 
the world. The vast majority of migrants continues 
to be hosted by developing countries, particularly 
those that are proximate to the migrants and refu- 
gees’ countries of origin: for instance, the bulk of 
the Syrian refugees is hosted by Turkey (2.2 mil- 
lion), Lebanon (1.2 million) and Jordan (almost 
630,000), according to figures recorded in Decem- 
ber 2015 (IOM 2017a). In the case of Europe, the 
events of 2015 were particularly challenging for the 
frontline EU member states like Italy and Greece. 
They received the highest number of refugees 
and migrants (respectively 153,842 and 856,723 - 
UNHCR 2017a). Transit countries in Central and 
Eastern Europe and major destination states, such 
as Austria, Belgium, Finland, Germany, Sweden 
and the Netherlands, experienced challenges in 


Number of refugees and migrants 


Figure 2. Overview of refugees and migrants crossing 
the Mediterranean Sea 2008-2016 (UNHCR 2017). 


handling the flow of migrants and refugees, as 
well. 


3.1 The EU risk governance 


The EU’s legal and operational asylum structure, 
the Common European Asylum System (CEAS), 
aims at harmonising the national asylum policies 
of the EU member states. The CEAS was decided 
in 1999, at the European Council of Tampere, and 
completed in 2005, with revisions between 2011 
and 2013. The CEAS contains a series of legislative 
measures that cover all the phases of asylum seek- 
ing, from the entrance into the EU territory to the 
application process, from the rights to the duties an 
asylum seeker has once he/she is deemed qualified to 
stay. The rationale behind the CEAS is twofold: in 
the short term, the EU aims at achieving “common 
standards for fair and efficient asylum procedures 
in the Member States” (Council 2013: np). In the 
longer term, the EU wants “Union rules leading to a 
common asylum procedure in the Union” (ibidem). 

The EU has only partially succeeded in these 
goals, since the reception of migrants and refu- 
gees remains largely a national affair. Despite of 
the rather positive experience from the refugees’ 
flows following the Balkan wars in the 1990s, the 
2015 influx showed a unilateral response among 
the member states, which exacerbated the dis- 
tance between the EU’s legal architecture and the 
national approaches to immigration. 

The CEAS rests on two pillars: Directives and 
Regulations. The main Directives are the Asylum 
Procedures Directive (Council 2013), the Qualifi- 
cation Directive (Council 2011) and the Reception 
Conditions Directive (Council 2013a). 

The Asylum Procedures Directive contains prin- 
ciples and criteria on how to apply for asylum in 
the EU territory. The member states have the main 
responsibility for the asylum procedure, while the 
EU mainly supports them through the European 
Asylum Support Office (EASO) (see below) and 
the European Refugee Fund. The Qualification 
Directive describes which kind of issues member 
states should take into account when processing an 
application (legislation from the country of origin, 
statement from the applicant and so on). In this 
case, as well, the member state has the primary 
task to deal with the application by verifying the 
validity of the information provided. The Recep- 
tion Condition Directive specifies a minimum 
and common standard of access to housing, food, 
clothing, health care, education and employment. 

Two Regulations (the EURODAC Regulation 
and the Dublin Regulation) complete the architec- 
ture. The former (Regulation 2013) established the 
EU asylum fingerprint database in 2003 and sup- 
ports the latter (Regulation 2013a), which sets the 
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criteria for the examination of an asylum applica- 
tion. The Dublin Regulation aims at preventing 
the so-called asylum shopping (an applicant looks 
for the member state with the most indulgent asy- 
lum procedures) and the indiscriminate circula- 
tion of people that move inside the EU territory 
(the Schengen area) while waiting for the conclu- 
sion of the application process. These two criteria 
foresee that an asylum seeker should apply in the 
first country of entrance. In 2015, the high influx 
of migrants and refugees put under pressure two 
countries of entrance, namely Italy and Greece. 
The combination of high numbers and poor recep- 
tion capacities showed that the Dublin Regulation 
could not work in such conditions. 

Two agencies assist the member states in the 
implementation of the Directives and Regulations’ 
legal measures: (1) the European Asylum Support 
Office (EASO 2017), created in 2011; (2) FRON- 
TEX, which, since 2004, is the European Border 
and Coast Guard responsible for border monitor- 
ing and management (FRONTEX 2017). At the 
end of 2014, FRONTEX started Operations Triton 
in the Central Mediterranean and Poseidon in the 
Greek sea to control the EU maritime borders and 
to conduct search and rescue operations for people 
crossing the Mediterranean Sea (EEAS 2017). 

It is worth mentioning the 2001 Directive on 
temporary protection (Council 2001), which is not 
included in the CEAS. This Directive foresees a tem- 
porary protection status in all EU countries, pro- 
moting a balance of efforts between member states 
receiving migrants and refugees, following a Council 
Decision that confirms a mass influx of displaced 
people and states the groups in need for protection. 


3.2 The EU crisis governance 


The EU crisis governance of the 2015 influx of 
migrants and refugees did not develop accord- 
ing to the Common European Asylum System. 
The EU response was undermined by national 
fragmented responses consisting of different 
emergency measures such as the enforcement of 
border controls, containment of the number of 
people crossing national borders, also with the 
use of force, construction of fences, and the rejec- 
tion by force or the arrest of people who entered 
the national territory illegally (Morsut and Kruke 
2017). These measures were taken unilaterally, 
with no consultation among states, which, for 
example, share the same border, causing tensions 
between neighbours. Both the European Com- 
mission and the Council launched a series of new 
initiatives (contained in three implementation 
packages—see Morsut and Kruke 2017) calling 
for unity and solidarity among the member states, 
seeking to coordinate the response at suprana- 


tional level. The main attempt was to balance the 
burden and commitment between the frontline 
member states Italy and Greece and the rest of the 
EU member states. 

The first package included a strengthening 
of operation Triton to handle search and rescue 
operations and combat smuggling in the Mediter- 
ranean Sea, and a temporary relocation scheme for 
asylum-seekers from Italy and Greece to the other 
Member States (European Commission 2015). 

The second package included an extended emer- 
gency relocation proposal and a permanent crisis 
relocation mechanism to be activated when the 
Commission determined that a national asylum 
system was under pressure due to a large and dis- 
proportionate influx of third-country nationals 
(European Commission 201 5a). 

The third package, highly influenced by the 
terrorist attacks in Paris in November 2015, con- 
tained new measures regarding the protection of 
EU external borders (Morsut and Kruke 2017). 

The relocation scheme in package two was offi- 
cially concluded in September 2017, but less than 
a fifth of the original target was relocated (IOM 
2017). The relocation mechanism recalls the tem- 
porary protection of the 2001 Directive (Council 
2001), which was not used to cope with the influx 
of migrants and refugees in 2015. 

In Italy and Greece migrants and refugees were 
taken to reception centres, to apply for asylum (as 
the Dublin Regulation foresees). However, both 
the Italian and the Greek reception systems were 
not able to absorb all the migrants and refugees 
giving them adequate shelter and the possibility 
to start the asylum process (UNHCR 2015). The 
EU intervened financially to implement hotspots 
in Greece and Italy, to identify, register and finger- 
print migrants and refugees and to provide assist- 
ance. However, their implementation took time, 
leaving people literally left to themselves. In other 
parts of Europe, as well, assistance to migrants and 
refugees was limited: the camp called the Jungle on 
the outskirts of Calais was a striking example. In 
2015, between 6,000 and 10,000 refugees, asylum 
seekers, and migrants, including many unaccom- 
panied children, lived there under very poor condi- 
tions (Human Rights Watch 2017). 


4 DISCUSSION 


The application of the Risk Governance Frame- 
work to our case raises some interesting issues, 
especially related to the assessment sphere, but also 
in the management. 

In the pre-assessment phase, framing and early 
warning “provide a structured definition of the 
problem and how it may be handled” (IRGC 
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2005a: 8). The EU legal and operational architec- 
ture, briefly described above, seems very promis- 
ing on paper. It captures the main issues related to 
an influx of people into Europe and appears able 
to secure the ways to cope with this risk. It calls 
for a common and shared responsible immigra- 
tion management; clearly defines the obligations 
of the member states and which kind of support 
they receive from the EU; provides a fair balance 
of efforts between member states; follows all the 
asylum seeking circle from the application to the 
granting of the protection, including the criteria 
to guarantee the person a full integration into the 
European society. The two maritime operations, 
Triton and Poseidon, seem relevant to decrease the 
risk of receiving a high number of migrants, but 
also for search and rescue of migrants. 

However, there are signs of an inability to cap- 
ture what stakeholders and society may associate 
with the risk (IRGC 2005) related to a massive 
influx of migrants and refugees. We argue that 
there may be a paradox between the Directives 
and the Dublin Regulation. The main goal of the 
Directives is to harmonise the national asylum sys- 
tems according to the principles of solidarity and 
fair burden sharing. The Dublin Regulation aims 
at preventing asylum shopping and indiscriminate 
circulation of people inside the Schengen area, 
while waiting for the conclusion of the application 
process. Nonetheless, as for the asylum shopping, 
this is an admission that there are differences in 
the national asylum systems, which the Directives 
clearly have not solved. As for the indiscriminate 
circulation, the Dublin Regulation does not ensure 
a sustainable sharing of responsibility across the 
member states, which is exactly the opposite of 
the principles of the CEAS (solidarity and fair 
burden sharing) as such. The EU responded with 
new measures (the relocation mechanisms and 
the hotspot implementation in Italy and Greece) 
contained in the implementation packages. This 
showed that it was difficult to sustain the CEAS. 

As for the FRONTEX operations Triton and 
Poseidon, their main contribution in 2015 (which 
continued in 2016 and 2017) became the rescue of 
migrants and refugees crossing the Mediterranean 
Sea with less focus on the control of the EU mari- 
time external borders (FRONTEX 2017). This is 
a reactive approach and an indication of incapac- 
ity to frame the problem, in other words to see the 
pre-assessment, risk appraisal and risk manage- 
ment of the situation in Europe in relation to the 
root causes of migration. In this respect, the EU 
risk governance did not consider the wider politi- 
cal, security and economic instabilities making 
people leave their homes for Europe. Thus, moni- 
toring and feedback according to the Framework 
(IRGC 2005) and, more specifically, the observing 


and feedback of the implementation of responses 
in 2015 need to address the reasons behind the 
EU reactive approaches and to develop proactive 
approaches and awareness of root causes, since the 
2015 influx was not a one-off event. 

In general, the CEAS degree of compliance in 
the events in 2015 was very poor, since the CEAS 
seems inadequate to account for a dynamic societal 
and political context, which is an important fac- 
tor for framing and early warning, according to 
the IRGC. This context is made of member states, 
which reacted in different ways (the national frag- 
mented responses) and undermined the principles 
of the CEAS, namely solidarity and fair burden 
sharing (the relocation mechanisms in the imple- 
mentation packages). This shows the extent to 
which the CEAS is not implemented at national 
level and the member states retain their sover- 
eignty on immigration policies. The views on the 
issue were clearly conflicting between the EU and 
the member states. 

In addition, the unclear divide between people 
in need of protection and economic migrants was 
a challenge. The CEAS is designed to manage a 
tidy, small-scale and easy recognisable group of 
refugees, not the 2015 high number of people with 
different backgrounds crossing the Mediterranean 
Sea. Furthermore, the EU and its member states 
did not recognise and detect the risk of a massive 
influx in time. At supranational level, the EU crisis 
governance of the high influx was more based on 
an incremental approach rather than an implemen- 
tation of the CEAS. The CEAS was the result of a 
risk governance approach based on an evaluation 
of the risk which had not taken into account certain 
features (high numbers, lack of cooperation, arriv- 
als in few countries, poor receptions mechanisms 
and so on). At national level, the crisis governance 
seemed to be directed by pure nationalistic choices. 

These reflections lead to the risk appraisal in 
our case. The EU did not consider the risk in all 
its characteristics and effects in the years prior to 
2015. Migration is not a risk per se, but it became 
a risk due to inadequate risk appraisal, by not con- 
sidering all issues related to migration according to 
a holistic view of the challenges. Concerns and per- 
ceptions at national level raised controversial issues 
in terms of solidarity, sustainability and capacities 
at EU level, which the EU was unable to address. 
For example, the reception facilities of Greece and 
Italy were not able to satisfy the needs of all the 
people coming. Unacceptable living conditions and 
slow bureaucratic national asylum processes may 
very easily lead people to become part of that sub- 
strate of the society, surviving through illegal expe- 
dients. This nourishes mistrust and fear towards 
immigrants establishing a vicious circle, where 
immigrants are increasingly isolated by the host 
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society and the host society intensifies its intoler- 
ance towards immigrants because they live at the 
margins of the same society. In addition, mistrust 
is directed also towards those segments of the soci- 
ety who assists the irregular migrants, also risking 
legal prosecution (Human Rights Watch 2017). 

A scientific risk assessment, of factual, physical 
and measurable characteristics of the risk (IRGC 
2005), using statistics at global level, could have 
scaled back societal negative perceptions, since 
developing countries are still the main recipient 
countries. A proper concern assessment of the 
perceived consequences (IRGC 2005) and a fol- 
lowing risk communication strategy may have laid 
the foundation for a relationship of trust among 
the different stakeholders, in particular the EU 
and member states. The EU seemed incapable of 
perceiving the concerns of the various stakeholders 
and the public. This led to opposition and protest 
against the EU risk governance, through national 
measures of crisis governance. 

Many stakeholders, and the public at large, are 
not formal parts of the process of addressing and 
handling risk. They, nevertheless, need to perform 
their own informed risk-related decision-making 
and choices about the risk in question, through a 
balance of factual knowledge based on risk percep- 
tion and personal interests. Reliable risk communi- 
cation may therefore bridge conflicting viewpoints 
and, as well, increase the likelihood of shared risk 
acceptance among different stakeholders: risk 
evaluators and managers, mostly involved in the 
risk process, researchers and policy makers, across 
academic disciplines and institutional barriers, and 
the people affected by the process (Renn 2008). 

Terror was another factor not taken into consid- 
eration in the pre-assessment phase of the EU risk 
governance, nor was it a part of the concern assess- 
ment in the risk appraisal phase prior to 2015. The 
threat of terror attacks from IS fighters and Euro- 
pean citizens returning to Europe after fighting for 
IS in Iraq and Syria influenced the last implemen- 
tation package in 2015 and turned many European 
citizens against refugees coming from those areas. 
This threat was a major social mobilisation factor 
influencing national policies and approaches. 


5 CONCLUSIONS 


The IRGC Risk Governance Framework has been 
a useful tool for studying the features of a suprana- 
tional system of risk governance as the CEAS and 
the way it was applied in our case. 

We can conclude that the EU crisis governance 
of the 2015 influx did not adhere to the EU risk 
governance approach put previously in place. The 
responses in 2015 seemed to be more directed to 


preserve the Dublin Regulation at all costs, when 
it was clear that the system shaped by this Regu- 
lation is not made for a mass influx situation. In 
addition, the fair burden sharing presupposes a 
supranational approach, not followed by the mem- 
ber states which decided for a narrow national 
approach. 

We argue that the main reason behind the inad- 
equate overall joint crisis governance at EU level 
was a weak EU risk governance of the pre-crisis 
activities prior to 2015, which did not take into 
consideration the impact of national political, 
economic, security and cultural differences among 
European countries. It seems clear that the asylum 
shopping is an admission that the differences in the 
national asylum systems have not been harmonised 
by the Directives put in place before 2015. The EU 
did also not adequately consider the social mobi- 
lisation potential of a high number of migrants 
and refugees, including a possible number of IS 
fighters. 

The analysis of the 2015 events through the 
pre-assessment and the risk appraisal phases has 
pointed out that the EU risk governance of the 2015 
events was based on fragile premises, since both the 
pre-assessment and the risk appraisal were elabo- 
rated upon inadequate and not complete pieces of 
information. The use of big data and alternative 
data (Facebook, Twitter, Instagram and mobile 
phone data) could have contributed for a better 
understanding of migration-related phenomena. 
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ABSTRACT: Many severe accidents such as Piper Alpha and Deepwater Horizon are mainly a result of 
long event sequences, which have developed gradually for a significant period, before it comes to a point 
of no return where control is lost, and emergency preparedness has to take over. During the significant 
build-up period, sometimes referred to as a ‘spiral to disaster’, there are often several opportunities where 
control could have been regained, if the awareness and understanding of the sequence of events had been 
sufficiently understood. However, since it was not, the opportunity to prevent the major accident failed. 
This paper presents and discusses some selected historical accidents and near-misses in the Norwegian oil 
and gas industry, the aim being to improve the understanding of accident propagation and in particular 
signals and warnings that could have been seen, but was not. The ambition is to achieve insight that can 
be used to improve future risk assessments and the ability to detect unforeseen events in general. 


1 INTRODUCTION 


As emphasized by Vinnem and Røed (2014), many 
accidents in the industry such as Piper Alpha 
(Cullen, 1990), Longford (Hopkins, 2000), Texas 
City (CSB, 2007) and Deepwater Horizon (Presi- 
dential commission, 2011) are mainly a result of 
long event sequences, which have developed gradu- 
ally for a significant period, before it comes to a 
‘point of no return’ where control is lost, and emer- 
gency preparedness has to take over. During the 
significant ‘build-up’ period, sometimes referred 
to as ‘spiral to disaster’, there are usually several 
opportunities where control might have been 
regained, if the awareness and understanding of 
the sequence of events had been sufficiently under- 
stood. But since it was not, the opportunity to pre- 
vent the major accident hazard failed. The present 
paper studies in detail some historical events and 
describes what contributed to the ‘spiral’. 

In the study, selected historical near-accident 
events in the Norwegian oil and gas industry have 
been reviewed. They were selected based on having 
a potential to escalate and result in a major acci- 
dent, often referred to as accidents with more than 
two fatalities (Vinnem and Roed, 2015) or with 
other extensive consequences, for example to the 
environment or assets (PSA, 2016). 

The goal of the analysis is not to obtain more 
accurate probability estimation of major accident 
events, but to improve the overall risk management 
linked to such events. In particular we are moti- 
vated by the fact that involved personnel are often 
not able to see what is coming, although in hind- 


sight, it may be concluded they should have—the 
signals and warnings were there, but the system and 
risk understanding was poor. And then, after the 
accident has occurred, the accident propagation 
becomes “obvious”, and it is questioned “How 
could this occur without being noticed?” The aim 
of the study is to give increased understanding of 
how some accidental events have propagated— 
knowledge that can contribute to prevention of 
such events in the future. 

The paper is organized as follows: In chapter 2 
we introduce the selected events and the criteria 
they were selected based upon. In Chapter 3, the 
results and implications of the results are pre- 
sented and discussed. Chapter 4 provides a discus- 
sion, and in Chapter 5 we draw some concluding 
remarks. 


2 THE SELECTED EVENTS 


The study includes 16 selected events in the Norwe- 
gian oil and gas industry in the period 2008-2015, 
as demonstrated in Table 1. This includes all publi- 
cally known events with an accident investigation 
report available in the public domain and with a 
major accident potential, as explained in the previ- 
ous section. For two events, we have used company 
reports, and for the remaining 14 events, the acci- 
dent investigation was performed by the petroleum 
safety authorities in Norway. The above comprises 
12 unignited hydrocarbon leaks, one non-process 
fire, one oil spill to the sea, one loss of well control 
and one situation with loss of buoyancy/stability 
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on a floating production unit. Events before 2008 
were excluded since they are considered less rele- 


vant for today’s safety regime and since the quality 


Table 1. Events included in the study. 

Charact. Description 

10.01.2008 A hydraulic hose broke. Due to poor 
Draugen* design, pressure surge occurred. 
Unign. HC This caused automatic release of 

an offloading hose, resulting in 
6 m oil spill. 

24.05.2008 Oil leak during hot tapping. Release 
Statfjord A* of 156 m° oil in the utility shaft 
Unign. HC and 70 m’ to sea. Flashing equal 

to 0.9 kg/s gas. 

12.09.2008 A solenoid valve was replaced with a 
Oseberg C* wrong spare. Due to lack of 
Unign. HC flushing, the process valve opened 

very quickly initiating a hammer 
effect, causing a 26 kg/s gas release 
with total amount 1500 kg. 

19.05.2009 Flange bolts were installed with the 
Kollsnes** wrong torque. This resulted in a 
Unign. HC 12 ton condensate leak with initial 

leak rate 22 kg/s. 

05.11.2009 Release of 3450 m’? oil based cuttings 
Veslefrikk* and 93000 m oil containing slop 
Well event from an injection well to the sea 

bed. 

08.02.2010 During installation of insolation, it 
Mongstad** was drilled through a line filled 
Unign. HC with LNG, resulting in a 0.08 kg/s 

gas leak with total release of 300 kg. 

12.09.2010 A wrench was used to stop an 
Mongstad** observed leak. Technical equipment 
Unign. HC loosened and was blown 30 meters 

by the pressure. 

04.12.2010 During leak testing and maintenance 
Gullfaks B* work on a choke valve, internal 
Unign. HC leaks resulted in release of 1.3 kg/s 

and total amount 800 kg. 

13.07.2011 A fire occurred in a crane engine 
Valhall* resulting in burning/glowing 


Non-process 
fire 


particles from the exhaust igniting 
a vent stack. 


26.05.2012 Valves were opened in the wrong 
Heimdal* sequence resulting in a pressure 
Unign. HC build-up. Due to poor design, a 

pipeline burst. 

12.09.2012 Repair of a seepage of produced 
Ula* water was postponed to the 
Unign. HC upcoming revision stop. Before 

being repaired, the bolts failed due 
to corrosion, resulting in release of 
20 m’ oil and 1600 kg of gas. 

07.11.2012 Anchor bolsters were not constructed 
Floatel to withstand expected weather 
Superior* conditions. This resulted in loose 


Loss of stab. 


anchors hitting and penetrating the 
vessel, causing severe listing. 


(Continued) 


Table 1. (Continued). 

Charact. Description 

17.06.2013 Gas leak due to rupture in the blow- 
Oseberg A* down line from the test separator. 
Unign. HC The line was not designed for sand 

in the well flow, and the segment 
was not sufficiently segregated 
from another segment. 

05.01.2014 Loss of seal liquid in a packing box 
Hammerfest resulting in a 0.1 to 0.3 kg/s HC gas 
LNG** leak with total amount 250-750 kg. 
Unign. HC 

26.01.2014 Oil leak with initial leak rate 20.8 kg/s 
Statfjord C* from sump tank to the cellar deck 
Unign. HC and to sea. 40 m’? oil was spilled to 

sea and 2 m’? oil was spilled on the 
installation. 

18.01.2015 The design included an insufficiently 
Gudrun* dimensioned control valve. This 
Unign. HC caused high vibrations resulting in 


a burst pipe. The result was a 8 kg/s 
release of 2.8 tons condensate. 


of the accident investigation reports has increased 
over time and in particular after 2008. 

In order to analyse each event at a sufficient 
level of detail, we have identified barrier elements 
that did not perform as intended during the acci- 
dent propagation, and then studied each of these 
failures more in-depth. During the categorization 
process, the lists of root causes presented in the 
investigation reports were used as a basis, although 
for some of the events, some of the root causes 
mentioned in the reports were combined and pre- 
sented in a simplified manner. To some extent, 
interpretation of the information in the accident 
investigation reports was needed during the iden- 
tification process. The number of barrier elements 
varied from 1 to 6 for each event with a total of 53 
barrier elements for all the 16 events together. 


3 SAFETY BARRIER PERFORMANCE 


3.1 Safety barrier performance categorisation 


The performance of safety barriers can be meas- 
ured in several dimensions such as capacity, reli- 
ability, availability, accessibility, efficiency, ability 
to withstand loads, integrity and robustness (PSA, 
2015). In the present paper we have considered 
integrity/availability (was the barrier ready to be 
used?), functionality (did it work as intended?) and 
robustness (did it “survive” the accidental condi- 
tions?). For example, lack of a mandatory risk 
analysis is considered loss of availability, since the 
activity was not carried out at all. Incorrect plan- 
ning is considered loss of functionality, since the 
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m Loss of integrity/availability 
m Loss of functionality 
a Loss of robustness 


Figure 1. Barrier element performance categories. 
m Technical barrier elements 
® Non-technical barrier elements 
Figure 2. Technical and non-technical barrier elements. 


planning was carried out (available) but failed 
and was inefficient. An example in the robustness 
category is technical equipment that failed due to 
vibration, i.e. the barrier element was vulnerable— 
not able to withstand the accidental loads (vibra- 
tion). Out of the 53 barrier elements studied, 20 
barrier elements had loss of the integrity/availabil- 
ity, 31 had loss of functionality and two had loss of 
robustness as shown in Figure 1. 

Figure 2 presents the number of technical and 
non-technical barrier element failures. There are 
numerous barrier definitions, see for example the 
discussions by Lauridsen et al. (2016) and Øien et al. 
(2015). In the Norwegian petroleum industry it is 
often distinguished between operational, organiza- 
tional and technical barrier elements, sometimes also 
simplified to technical and non-technical barrier ele- 
ments as in Figure 2. Out of the 53 barrier elements 
studied, only 7 were technical barrier elements while 
46 were non-technical, i.e. operational or organiza- 
tional. This is in line with other research showing 
that a high fraction of root causes in accident inves- 
tigation reports are non-technical, see for example 
Vinnem and Reed (2015) and Mostue et al. (2014). 


3.2 Barrier element failures—did anyone suspect 
the failures 


It is sometimes distinguished between known 
knowns, unknown knowns and unknown unknowns 


in the research literature. Secretary of Defense 
Donald Rumsfeld explained these terms in a press 
briefing February 12, 2002 (later elaborated on in 
Rumsfeld, 2011): 


“| ..there are known knowns; there are things we 
know we know. We also know there are known 
unknowns; that is to say we know there are some 
things [we know] we do not know. But there are 
also unknown unknowns—the ones we don’t know we 
don't know.” 


Inspired by the above terminology, Aven and 
Krohn (2014) introduced three categories of fail- 
ures: a) Events that were completely unknown to 
the scientific environment (unknown unknowns), 
b) Events that were not on the list of known 
events from the perspective of those who carried 
out a risk analysis (or another stakeholder), and 
c) Events on the list of known events in the risk 
analysis but found to represent a negligible risk. 

These categories correspond to unknown 
unknowns, unknown knowns and known knowns, 
correspondingly. As emphasized in Figure 3, out 
of the 53 barrier elements studied, none were in 
the a) category, 14 were in the b) category, and 39 
were in the c) category. Examples in the b) category 
are a missing orifice that the personnel were not 
aware of and lack of identification of the major 
accident potential during a risk assessment. In 
the c) category, examples are overruled inspection 
intervals, known violations, work practice differ- 
ent than prescribed and required risk assessments 
not being performed. For the majority of the bar- 
rier element failures in category c), someone in the 
organization had knowledge about weaknesses 
indicating that the barrier element was, or could 
be, deteriorated. However, the probability of this 
resulting in a major accident hazard was consid- 
ered sufficiently low to accept the risk, although in 
most cases the risk was not formally assessed and 
accepted according to company procedures. 

The above classification was to some extent 
subjective, and in particular it was difficult to dis- 
tinguish between the b) and c) categories for some 
events. If the b) criterion “or another stakeholder” 


= a) Unknown unknowns = b) Unknown knowns = c) Negligible risk 


Figure 3. Event categorisation. 
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had been interpreted as anyone in the organization, 
there would be more c’s and fewer b’s giving even 
stronger evidence that known knowns was the most 
frequent challenge for the studied events. From a risk 
management perspective, it is good news that there 
were no events in category a) and few events in cat- 
egory b), since the knowledge and understanding of 
the situation, and thus the ability to manage the risk 
in a proper way, is stronger for the barrier failures in 
category c) than for the ones in category a) and b). 


3.3 Barrier element failures—who knew about it 


So far, we have seen that for many of the barrier 
element failures, someone knew there were dete- 
riorated or weakened barrier elements, and since 
the activity was not stopped, this means the risk 
was formally or informally accepted. Now we will 
discuss who had this knowledge. We have distin- 
guished between i) personnel in the sharp end of 
the organization, such as process operators and 
maintenance personnel offshore, and ii) personnel 
in the blunt end, such as project planners and man- 
agers working onshore. The latter was only studied 
when criterion 1) was not met. 

As illustrated in Figure 4, For 28 out of the 53 
barrier element failures, the weaknesses were known 
by personnel in the sharp end. Examples are lack 
of compliance, insufficient knowledge about pro- 
cedures, lack of risk assessment, insufficient main- 
tenance and known weaknesses in design. For 12 
of the barrier element failures, weaknesses were 
known to someone else in the organization. Exam- 
ples are insufficient design or fabrication, unclear 
responsibilities, poor quality of a risk assessment 
and insufficient maintenance. For the remaining 13 
barrier element failures, there is insufficient infor- 
mation in the accident investigation report to make 
a proper categorization; it is not explained in detail 
who knew what. This is not a surprise as it is not 
common to include such a level of detail in studies 
of near-misses and non-catastrophical events, as the 
ones being studies in this paper. The above implies 
that to some extent we have needed to read between 


m Known to personne! in the sharp end 
= Known to others 


® Not able to categorize 


Figure 4. Who knew about the barrier element failures?. 


the lines, and sometimes we have made assumptions 
based on the information available in the reports. 


3.4 Surprises 


For 36 of the 53 barrier elements there were sur- 
prises. We have distinguished between two situa- 
tions; i) it was a surprise that the barrier element 
actually failed (28 occasions) and ii) it was a sur- 
prise that the barrier element actually played a role 
as a safety barrier and was safety critical (8 occa- 
sions). The last category implies a lack of system 
understanding; the personnel involved did not 
realize that something being safety critical actual 
was. Examples in the first category are unknown 
built-in weaknesses, unclear responsibilities, insuf- 
ficient planning or assessments ahead and technical 
equipment that was believed to be similar but was 
in fact not. Examples in the second category are 
safety-critical technical equipment considered non- 
critical for safety before the accident occurred. The 
majority of the surprises are of the first kind. This 
implies that for the majority of the situations with 
surprises, it was realized that weaknesses in the bar- 
rier element could potentially result in an accident, 
but it was a surprise that the barrier element actu- 
ally failed and that the even actually occurred. 

The above surprises are discussed from a view- 
point of before the accident occurred. During our 
work, we also wanted to find out if it, after the acci- 
dent occurred, still was a surprise that some bar- 
rier elements contributed to the accident occurring. 
No such surprises were found. At first glance this 
may seem surprising in itself, but in fact, it is not: 
Since we used accident investigation reports as our 
source of information, and the barrier elements in 
our study were chosen because they contributed to 
the event occurring, it is not surprising that each 
of the barrier elements’ contribution to the event 
occurring easily can be explained in hindsight. The 
above emphasizes the importance of realizing that 
the present study relies purely on information in 
accident investigation reports, and that this must be 
kept in mind when the results are interpreted. As 
elaborated on by Damnjanovic and Reed (2016), 
investigation reports by definition provide infor- 
mation about the propagation of the events and 
work processes only when an accident occurs. It is 
a general industry challenge that we do not have a 
full understanding of the events and work flow in 
situations with success and no accident occurs. This 
challenge has inspired scientists working in the field 
of resilience, as elaborated on by Nemeth and Hol- 
Inagel (2014), Hollnagel et al. (2013) and others. 


3.5 Early warnings 


For 29 of the barrier element failures there were 
early warnings indicating that something abnormal 
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was occurring. For 20 occasions, there were no 
early warnings and for the remaining four cases, 
we do not have sufficient information to make a 
proper categorization. We have divided the warn- 
ings into five groups based on their characteristics: 


i. A physical warning (phenomenon) occurred in 
advance to the incident, with sufficient time to 
recover if it had been recognized (6 occasions) 

ii. A similar or near to similar situation was expe- 
rienced earlier on the same site (2 occasions) 

iii. Uncommon solution, different from conven- 
tional solutions, but implications of this was 
not sufficiently recognized (5 occasions) 

iv. Degradation was seen or could (should) have 
been seen by required maintenance/inspection 
(6 occasions) 

v. Insufficient governing documents or poor adher- 
ence to governing documents (10 occasions) 


The first category includes a physical warning in 
advance to the event, with sufficient time to recover 
the situation. Examples are an increase or drop 
in pressure, anchors that slammed into the vessel 
in harsh weather and severe vibrations in process 
equipment. These are ‘strong’ warnings since they 
clearly indicate that something abnormal is going 
on. Also category iv) includes rather strong warn- 
ings with situations that could, and should, have 
been detected by maintenance and inspection pro- 
grams. Examples are visible degrading of a hose, 
valves with observed internal leakage, insufficient 
maintenance of a crane engine and lack of testing 
according to the maintenance program. Category 
ii) includes occasions where a similar or near to 
similar situation had been experienced at the same 
site, for example a known challenge that technical 
equipment had been fabricated without a safety 
critical item installed. Category iii) includes situa- 
tions where a nonconventional solution is used, and 
where related implications were not fully under- 
stood or recognized. Examples are a change in the 
commissioning phase not being validated, and an 
emergency shutdown system for which parts of the 
system could be put out of operation. Category v) 
includes warnings in terms of an insufficient man- 
agement system or a culture with poor adherence 
to procedures or requirements specified in the man- 
agement system. Examples are errors and inconsist- 
encies on a work permit, lack of sufficient resources 
and competence related to a work task being per- 
formed and insufficient barrier strategy and barrier 
design. With 10 occasions, this is the most com- 
mon early warning category. Unfortunately, this 
category also includes the ‘weakest’ warnings, since 
there may be many such breaches without an acci- 
dent or near miss occurring. This means that it was 
difficult to realize the ‘spiral to disaster’ had started. 

Nine of the barrier element failures included 
poor quality of risk assessments or required risk 


assessments not being performed at all, although 
none of these deficiencies had early warnings. One 
important objective of a risk assessment is to iden- 
tify potential hazards, and the above emphasizes the 
importance of risk assessment actually being per- 
formed—with sufficient quality, and the importance 
of having mechanisms in the organization ensur- 
ing that this is the case. If potential hazards are not 
identified at all, there may be no or few early warn- 
ings, and existing warnings may be overlooked. This 
may potentially result in poor risk management and 
acceptance of risks based on an insufficient basis. 


4 DISCUSSION 


A traditional view on barrier management is 
to establish a sufficient number of barriers and 
ensure they perform as required. With reference 
to James Reasons’s Swiss Cheese model (Reason, 
1990), this means installing a sufficient number 
of cheese slices with no (or few) holes. However, 
as stated by the National Commission investigat- 
ing the Macondo accident (Presidential commis- 
sion, 2011), ‘Complex systems almost always fail 
in complex ways’. This statement emphasizes that 
barrier management is not simple at all: For the 
Macondo accident lack of sufficient safety barri- 
ers were easily pinpointed in retrospect, but it was 
not evident until the accident occurred. 

For the majority of the events studied in the 
present paper, more than one barrier element failed 
and contributed to the event occurring. The number 
of barrier element failures varies from | to 6 for each 
of the accidents within the classification used in the 
paper. As emphasized by Vinnem and Reed (2015), 
if root causes had been studied in more detail, more 
factors would most likely have been identified. 

All events studied are near-misses; situations that 
potentially could have resulted in a major accident, 
but due to various circumstances did not. A question 
is then if the situations studied are relevant for learn- 
ing about major accident prevention. We believe 
the study is relevant since only events with a major 
accident potential have been included in the study, 
as explained in the introduction to the paper. Thus, 
all events included can be considered lagging major 
accident precursors as discussed by for example 
Hopkins (2009), Kjellén (2009) and Vinnem (2010). 
This implies that the events study may bring relevant 
information to the table when it comes to causes and 
contributing factors to major accidents, but limited 
information about potential consequences. 

It is important to be aware that the study con- 
siders a sample of selected events based upon the 
criteria explained earlier in the paper. It is not an 
empirical study with random events. This should 
be kept in mind when the results are interpreted. 
For example, it is a relevant challenge that the 
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number of barrier element failures identified in the 
accident investigation reports may depend on the 
severity of the accident, since it is likely that severe 
accidents are investigated to a higher level of 
detail, and thus, more barrier element failures are 
highlighted in the reports. This may partly explain 
why the number of barrier failures varied between 
1 and 6 for the events studied. 


5 CONCLUSION 


In this paper, 53 barrier element failures related to 
16 historical events with a major accident potential 
in the Norwegian petroleum industry have been 
studied. The majority were non-technical barrier 
element failures associated with loss of function- 
ality. In many cases, someone in the organization, 
typically in the sharp end, knew about weaknesses 
indicating that the barrier element was, or could 
be, deteriorated. For many of the events there were 
surprises. For the majority of the cases, it was a sur- 
prise that the barrier element(s) actually failed, but 
for some cases, it was also a surprise that the barrier 
element that failed actually was safety critical. For 
more than half of the barrier element failures, there 
were early warnings. Approximately half of the 
warnings were ‘strong’ warnings, such as a physical 
phenomenon that occurred in advance to the inci- 
dent, with sufficient time to recover if it had been 
recognized and physical degradation that was seen 
or could have been seen by required maintenance 
and inspection. The resulting half were weak warn- 
ings, such as unconventional design or similar events 
being experienced previously on the same site. 

Since only 16 historical events have been studied, 
care should be taken when interpreting the results. 
However, some general reflections have been made: 

As demonstrated in previous studies, the present 
paper confirms that it is important to pay atten- 
tion to non-technical barrier elements when risk is 
managed in organizations. The study also indicates 
that the main challenge is not events coming totally 
out of the blue. More often, someone in the organ- 
ization, typically in the sharp end, are aware of 
barrier deteriorations and that these deteriorations 
may result in an accident. However, it is considered 
unlikely that an accident will actually occur. Since 
the activity is continued with deteriorated barrier 
elements, the increased risk is formally or infor- 
mally accepted by someone in the organization. 
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Analysis of 985 fire incidents related to oil- and gas production 
on the Norwegian continental shelf 


C. Sesseng, K. Storesund & A. Steen-Hansen 
RISE Fire Research AS, Trondheim, Norway 


ABSTRACT: | Fire is a major threat in the petroleum industry. However, little has been published about 
the fire related incidents that have occurred in the Norwegian petroleum sector. To gain more knowledge, 
data from 985 incidents in the 1997-2014 period has been analysed. Examples of factors studied are 
type of facility involved, involved area or system, consequences and severity level. The analysis of the 
fire incidents reveals that even though many incidents are reported, the large majority of these have 
not imposed risks for severe fire accidents. It has also provided valuable information regarding possible 
dangerous situations, commonly involved areas, types of equipment as well as types of activity that were 
involved. Twenty-nine percent of the incidents were false alarms, which must be regarded as a high number 


in an industry where any production stop could be extremely costly. 


1 INTRODUCTION 


In Petroleum Safety Authority Norway’s (Ptil) 
assessment of the risk level in the petroleum indus- 
try on the Norwegian continental shelf, defined 
situations of hazard and accidents (DFUs) are 
defined and utilised. A DFU is an unplanned 
event which has led, or may lead, to loss of life 
and other values. Also, a DFU must be an observ- 
able event which it is feasible to measure accurately 
(Vinnem et al., 2006). Different DFUs are associ- 
ated with different risk areas. According to Ptil’s 
annual report on trends in risk level, there has been 
a declining trend in DFUs occurrences associ- 
ated with major accident risk from 2004 to 2014 
(Petroleum Safety Authority Norway, 2016). In the 
period there were no hydrocarbon fires, but there 
were hydrocarbon leakages with ignition potential. 
However, a fraction of the observed DFUs were 
fires and explosion not involving hydrocarbons. 
Even though fire occurrences are few compared 
with other DFUs, fires have disastrous potential, 
and a fire preventive focus should be maintained. 

The current study has a quantitative approach 
and look in depth into the fire incident statistics, 
with false alarms included, to gain more knowledge 
about the incidents, where they occurred and what 
their outcomes were, in order to found a basis for 
future work aiming to improve fire safety, improve 
detection reliability and prevent false alarms in the 
petroleum industry. 


2 DATASET 


The dataset which found the basis for this study 
is an extract of Ptil’s fire incident statistics. RISE 
Fire Research received authorization from Ptil to 
analyse this set of data, and gained access to all rel- 
evant incidents over the defined period of time. The 
database contains fire incidents, small and large, 
which are systematically reported to the authority. 
The sample comprises 985 reported incidents from 
all facilities and operators in the areas within Ptil’s 
jurisdiction in the 1997-2014 period. The incidents 
included in our selection were reported as one of 
the following incident types: ignited hydrocarbon 
leakage, fire/explosion in other areas, fire/explo- 
sion in other areas (not hydrocarbon fire), and 
fire/explosion in other areas (not hydrocarbon 
explosion). 

The dataset comprise information about time of 
the event, at which facility and in which area or sys- 
tem it occurred. Furthermore, the severity together 
with actual and potential consequence of the inci- 
dents are classified and registered. Also, there is a 
free text field where each incident is described in 
short. 

The severity of the incidents is assessed and 
reported according to a 5-point scale, where the 
different values are defined as follows: 1—not noti- 
fiable, 2-simpler follow-up, 3-potential under 
minor changes, 4-severe or 5-large potential/ 
serious accident/death. 


1847 


3 METHOD 


In the current study, the analyses are based on fire 
statistics from Ptil. The study has a quantitative 
approach, which implies that the individual cases 
have not been studied in detail, except for informa- 
tion extracted from free text fields in the statistics 
database. The results from the analyses are pre- 
sented as descriptive statistics. 


4 RESULTS 


4.1 Sample description 


During the 1997-2014 period there were 985 
reported fire incidents on the Norwegian conti- 
nental shelf. From 14 incidents the first year of the 
period, there has been an increase over the period, 
ending on 66 incidents in 2014. In 2006, there was 
a peak with 84 reported incidents, see Figure 1. 

The figure shows the development in number of 
incidents, distributed over severity degree over the 
period. Most incidents (91.2%) are classified with 
severity degree 2, which are incidents which require 
minor follow-up, whereas the remaining incidents 
are classified as one of the other severity degrees. 
For readability, the incidents in these categories 
are presented as one bulk in Figure 1. Most of the 
incidents in this bulk were incidents with severity 
degree 4 (n = 63). In addition there were 6 serious 
accidents (degree 5). 

There is a leap in number of incidents from 2005 
to 2006. The average number of incidents for the 
years before the leap, i.e. 1997-2005, is 35, whereas 
the corresponding number for the 2006-2014 
period is 75. This corresponds to a 216% increase. 
As will be demonstrated in section 4.5, this is 
related to an increase in the number of reported 
false alarms. 
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Figure 1. Number of reported incidents, distributed 
over severity degrees. Degree 2 (n = 898) is presented 
as one category, whereas degrees | (n = 15), 3 (n = 3), 4 
(n = 63) and 5 (n = 6) are collapsed into one category for 
readability, N = 985. 
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Figure 2. Distribution of incidents between different 
facility types, N = 985. 
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Figure 3. Actual outcome of reported alarms, N = 985. 


Further, 2006 also stands out because the number 
of incidents with severity degree 4 is over twice as 
high as any other year in the period (15 incidents, 
compared with the year with the second highest 
number of degree 4 incidents: 7). The median over 
the period is 4. 

A vast majority of the reported incidents 
occurred on fixed installations (73%), whereas one 
out of five incidents was related to movable instal- 
lations. A minor proportion (6%) was incidents 
occurring in onshore facilities. 

Over 70% of the incidents were real fire and 
explosion incidents. However, as Figure 3 dem- 
onstrates, a vast majority of the incidents were 
non-hydrocarbon fires, but rather fires in electri- 
cal systems, overheated machinery etc. Almost one 
third of the incidents were classified as false alarms. 


4.2 Type of arealsystem involved 


The incident reporting system on which this 
analysis is based upon is designed so that there is 
one variable to register both the area and system 
involved, i.e. one cannot discriminate between 
different systems in one specific area. E.g. a fire 
may start in an electric installation, but it is not 
specified if the electric installation is in the living 
quarter, main process or other areas of the facil- 
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Figure 4. Distribution of incidents between different 
areas (top group) and systems (middle group). Incidents 
without recorded area or system is shown in the bottom 
group, N = 985. 


ity. In Figure 4, and the figures following, areas 
and systems have been split into two separate 
groups, where the first group represent areas and 
the second group represent systems. The areas and 
systems are sorted with descending number of 
reported incidents within each group. 

Further, incidents with no area or system 
recorded, or incidents in other areas or systems 
constitute a third group. The categories in the latter 
group are relatively large, with 18% and 14% of the 
incidents, respectively. The majority of the incidents 
within these categories (88%) were reported between 
1997 and 2002. Also, it is seen that there is a distinc- 
tion between the categories. While most of the inci- 
dents tagged with “Not recorded area/system” were 
reported before 2002, the main part of the incidents 
tagged with “Others” was reported after 2004. 

With reference to Figure 4, one sees that of all 
reported incidents, nearly one third occurred in 
ancillary systems. These systems comprise, among 
others, communication systems, electrical power 
supply systems and water treatment facilities. In 
short, systems not related to separation, produc- 
tion and transport of hydrocarbons. An equal pro- 
portion incidents occurred in areas and systems 
not specified. The remaining incidents were dis- 
tributed over main process (10%), electrical instal- 
lations (10%), living quarters (7%) and others (9%). 


4.3 Consequences 


Of the 985 reported incidents in the period, only 
8% resulted in drilling downtime whereas 23% 
caused production stoppage. 

Not surprisingly, one fifth of all incidents caus- 
ing drilling downtime occurred in the drilling and 
well area, see Figure 5. Of all incidents causing 


Figure 5. Incidents causing drilling downtime, distrib- 
uted over the area or system the incident occurred in. 
Areas and systems with fewer than 20 reported incidents 
are excluded from the figure, n = 76. 
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Figure 6. Incidents causing production stoppage dis- 
tributed over the area or system the incident occurred in. 
Areas and systems with fewer than 20 reported incidents 
are excluded from the figure, n = 231. 


production stoppage, half of them occurred in the 
main process and one third occurred in the drilling 
and well area, see Figure 6. 


4.4 Severity level 


The vast majority (91%) of the reported inci- 
dents were classified as incidents requiring sim- 
pler follow-up (severity degree 2), whereas only 
3 incidents (~0%) had the potential of becoming 
a severe situation under minor circumstantial 
changes (degree 3). Six percent of the incidents 
were regarded as severe (degree 4) and <1% (6 inci- 
dents) had a large potential or were large acci- 
dents, but none resulted in fatalities. Three of these 
occurred at onshore facilities. In addition, there 
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Figure 7. Incidents with severity degree 3 (n = 3), 4 
(n = 63) or 5 (n = 6), distributed over different areas and 
systems, total n = 72. 


were 15 incidents (2%) which had the lowest sever- 
ity degree (degree 1), even though these incidents 
are not notifiable. 

Examples of incidents on severity level 3 are 
fires that were extinguished after a short period of 
time, either by automatic extinguishing systems or 
by manual effort. A fire in an HVAC module with 
smoke spread to the living quarter, on the other 
hand, was classified as severity degree 5. 

It is not straight-forward to analyse trends in 
severity degree over time, since there are few inci- 
dents with severity degree other than 2. However, 
it is seen qualitatively that the proportion of inci- 
dents with severity degree 4 is lower in the period 
2008-2014 (average 3,4%) than it was between 
2001-2007 (average 12%). The same trend is seen 
when adjusting for the increase in false alarms after 
2006 which yields a decrease in degree 4 incidents 
from an average of 19.5% in the first period to 5.3% 
in the second period. Also, there has not been an 
incident with severity degree 5, since 2008. It there- 
fore seems that there is a decline in the degree of 
severity of the incidents in the sample over time. 

When studying the distribution of the incidents 
with severity degree 3 or higher over the different 
areas and systems, it is seen that most incidents 
occur in the main process and drilling and well 
areas, see Figure 7. Correspondingly, one fourth of 
the incidents with this severity took place in ancil- 
lary systems and 15% in electrical systems. 

Even for these degrees of severity, there are 
around one fifth of the incidents, whereof one 
incident was classified as large potential/serious 
accident/death (degree 5), where area or system has 
not been recorded. 


4.5 False alarms 


False alarms are alarms caused by other circum- 
stances than fire and explosion. According to ISO/ 


DIS 17755-2, a false alarm is an alarm for which no 
fire occurred or [...] due to accidental operation of 
fire alarm devices (ISO, 2010). The most frequent 
causes observed in the sample were detectors mal- 
functioning, misinterpretation of the situation by the 
detection system, technical and human errors. Exam- 
ples of misinterpretations are sandblasting dust being 
detected as smoke, heat from sauna detected as heat 
from fire, heated leakage of lubricating oil detected as 
smoke and steam from cleaning detected as smoke. 

The number of false alarms was quite low in 
the first half of the focus period, see Figure 8. Up 
until 2005, there were only a few cases, whereas in 
2006 there is a leap, and in the years following the 
average number of false alarms per year is 29. This 
is probably not a real increase in false alarms, but 
rather an effect of a new reporting scheme, where 
more incidents are included than before. 

From 2006 towards the end of the period, there 
is a seemingly decrease in the number of false 
alarms. However, the trend has not been checked 
statistically or adjusted for changes in the petro- 
leum activity on the continental shelf. 

One third of all false alarms occurs in relation 
to ancillary systems, and one fifth occurs in the 
main process, see Figure 9. In fact, when taking 
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Figure 8. Number of reported false alarms each year, 
n= 286. 
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Figure 9. False alarms reported, distributed over areas 
and systems, n = 286. 
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into account the number of incidents in each area 
or system, it is seen that 60% of all reported inci- 
dents in the main process are false (Figure 10). This 
is almost twice as large proportion of false alarms 
than any other area or system (disregarded the 
incidents categorised as “other”, as this category 
most likely constitutes numerous sub-categories). 
Table 1 presents an overview of how many of the 
false alarm incidents, in each area or system, which 
caused either drilling downtime or production 
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Figure 10. Proportion of real and false alarms for each 
area and system. Areas and systems with fewer than 20 
reported incidents are excluded from the figure. Main 
process n = 102, living quarters n = 66, drilling and well 
n = 42, ancillary system n = 305, electrical installations 
n = 95, compressor installations n = 29, not recorded 
n= 182, other n = 139, total n = 960. 


Table 1. The number of false alarms which caused 
either drilling downtime or production stoppage. 
Caused Caused 
drilling production 
downtime stoppage 
Area/system [freq] [%] [freq] [%] n 
Main process 6 10% 39 65% 60 
Living quarters 2 13% 1 6% 16 
Drilling and well 1 8% 4 33% 12 
Subsea installation 0 0% 1 100% 1 
Ancillary system 6 6% 30 31% 98 
Electrical installation 0 0% 1 8% 12 
Compressor 0 0% 7 70% 10 
installations 
Structures and 0 
maritime systems 
Lifting operations 0 0% 0 0% 2 
Pipeline systems 0 
Not recorded area/ 0 0% 3 33% 9 
system 
Other 5 8% 23 35% 66 
All 20 7% 109 38% 286 


stoppage. The table should be read with caution, 
as some of the areas or systems have very few inci- 
dents, which may yield large percentages. 

Nonetheless, 7% of the false alarms resulted 
in drilling downtime and 38% caused production 
stoppage. In addition, a total of 129 false alarm 
incidents (45%) resulted in personnel mustering to 
life boats. 

Furthermore, 65% of the false alarms occurring 
in the main process caused production stoppage. 
Similarly, 31% of the false alarms caused by ancil- 
lary systems had the same consequence. 

Fewer false alarm incidents caused drilling 
downtime. Again, false alarms occurring in the 
main process or ancillary systems are the main 
cause for drilling downtime. 


5 DISCUSSION 


5.1 Incidents 


Over the 18 year period there were almost 1000 
reported fire incidents on the Norwegian continen- 
tal shelf, and it is seen that there was almost twice 
as many incidents in the second half of the period 
compared to the first half. The increase is probably 
an effect of a shift in reporting regime, where more 
incidents (mostly false alarms) than before were 
included. There is therefore reason to believe that 
there were even more incidents in the first half of 
the period than what has been reported, but that 
most of the unreported incidents were false alarms 
incidents. 

The analysis of the fire incidents reveals that 
even though many incidents are reported, the 
large majority of these have not imposed risks 
for severe fire accidents. There also seems to be a 
positive trend regarding the severity degree of the 
fire incidents, as there is a decline in the number 
of severe incidents and major accidents. However, 
the current analysis has not adjusted for the activ- 
ity level in the Norwegian sector or other possible 
covariates. Also, in general one should be careful 
to draw any conclusions based upon the trend of 
a single indicator, as there may be other indica- 
tors not investigated in the current study which 
may affect the fire safety level negatively (Vinnem, 
2010; Vinnem et al., 2006). 

Also, since the current study is retrospective in 
nature, it is important to emphasize that a possible 
change in the conditions which may affect the fire 
safety level will not be observed in the incident sta- 
tistics until later, and that the apparent trend is only 
valid for the focus period (Vinnem et al., 2006). 

The current study does not conclude on the 
underlying causes of fires offshore. Future stud- 
ies should therefore focus on revealing such causes 
in addition to triggering factors for the fire inci- 
dents. This can be done by examining investigation 
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reports from the different incidents and by per- 
forming interviews with key personnel with the 
operators. A study with such a design, investigat- 
ing the underlying causes for 35 fires in electrical 
equipment on offshore platforms in the Norwe- 
gian sector, is reported in (Storesund et al., 2012). 
The study categorised the causes according to 
the Human—Technology—Organisation perspec- 
tive, which may be helpful when trying to sort and 
reveal patterns in causes and find suitable and tar- 
geted measures. 


5.2 False alarms 


A great proportion of the reported incidents in the 
period were false alarms, and a relatively large frac- 
tion of these have been shown to cause production 
downtime and consequently economical losses. 
The classic ever-returning dilemma of smoke and 
fire detection is that increasing the detectors’ sensi- 
tivity to detect fires as early as possible also causes 
an increase in number of false alarms. Correspond- 
ingly, decreasing the sensitivity to eliminate false 
alarms affects fire detection time negatively. 

One of the main reasons for false alarms is that 
the detection system misinterprets the situation, 
and for instance takes steam or dust as smoke. 
However, detection systems have become smarter 
and there are several technologies available that 
contribute to reduce the number of false alarms 
caused by misinterpretation of the situation. 

E.g. studies have shown that multi-sensor detec- 
tors with CO sensor can both decrease detection 
time for certain types of fires in addition to reduc- 
ing the number of false alarms (Cestari et al., 2005; 
Sesseng et al., 2016; Sesseng and Reitan, 2016). 
The mentioned studies have only investigated the 
residential case, and there may be areas where such 
detectors are not suitable. Still, there is reason to 
believe that many areas may take advantage of this 
technology, e.g. living quarter, workshops etc. 

The next main causes for false alarms are techni- 
cal errors and malfunctioning detection systems. At 
the same time, compared to other barrier elements, 
fire detection systems have the lowest failure rate 
when tested. Each year, some 50,000 tests of fire 
detectors are performed on offshore facilities in the 
Norwegian sector, and since the beginning of the 
reporting of these tests in 2002 the mean fail rate 
has been declining. In 2002, around 0.9% of the 
tested detectors failed the tests, whereas only 0.1% 
failed in 2015 (Petroleum Safety Authority Nor- 
way, 2016, 2015, 2010; Vinnem, 2010). The trend 
is positive, and it should therefore be a continued 
focus on maintenance and testing. 

The last main cause for false alarms is human 
errors. This could be due to work made in the prox- 
imity of a sensor, work during service and testing 


of the system or failure to comply with procedures. 
This is most likely best managed by improved rou- 
tines for work and risk assessment as well as focus- 
ing on procedural compliance. 

Obviously, if the number of false alarms can be 
reduced there will be great economic benefits. The 
majority of installations on the Norwegian shelf is 
ageing, and anecdotal evidence suggests that some 
having old or outdated detection systems. It should 
be investigated whether an upgrade of the fire detec- 
tion systems could reduce the number of false alarms 
and, consequently the number of false alarms result- 
ing in production stoppage and mustering. 


5.3 Reporting 


The current reporting scheme has certain short- 
comings. By registering information concerning 
system and area in the same variable, informa- 
tion is lost. The obvious consequence is that one 
would have to choose to register either area or 
system, which would be at the reporter’s discre- 
tion. Besides, a specific type of system, e.g. ancil- 
lary systems, could be found in several areas, but 
may constitute different risks in different areas, 
which makes the available information of limited 
value. The reporting scheme ought therefore to be 
changed such that more details regarding involved 
area and system are recorded. 

A large number of incidents categorised as “Not 
recorded area/system” was reported before 2002. 
Almost half of the incidents categorised as “Other” 
were false alarms and could, through the free text 
description, be derived to specific areas/systems. 
This shows that in some ways the procedure and 
culture of reporting seems to have improved over 
the years. At the same time it appears that it may 
be difficult to categorise false alarms. 


6 CONCLUSIONS 


The numbers show that there is room for improve- 
ment regarding fire safety in the petroleum pro- 
duction on the Norwegian shelf. There are many 
incidents, although with low degree of severity. 
Future work should focus on investigating the 
underlying causes and triggering factors of the fire 
incidents, to be able to find focused fire preventive 
measures. 

There is also a large number of false alarms, 
which may be quite costly if they cause production 
downtime. A more thorough investigation would 
be informative concerning what types of equipment 
are causing false alarms. This could found the basis 
for targeted measures decreasing the occurrence of 
false alarms and downtime and thus increasing the 
economic profit of the installations. 


1852 


The numbers also show that severe incidents do 
not occur often, something that may be explained 
by good control of barriers. However, there are still 
some incidents that occur that have the potential 
of developing into a severe incident. Hence, there 
must still be a focus on barriers preventing the con- 
sequences of an escalating incident. 
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ABSTRACT: The Norwegian Atlantic salmon farming industry is exploring the possibility to run fish 
farms in more exposed locations. The severe wave and current conditions, irregular wind, sheer remoteness, 
and limited weather window challenge the operational planning to avoid accidents. The objective of this 
paper is to present results from applying a generic list of safety-critical parameters to a net cleaning operation 
to assess their usefulness and relevance in aquaculture. The list was proposed based on implications from 
major accident causation theories in safety research. The case study demonstrated that the list is a useful 
operational planning tool to identify activity failure mechanisms that have potential to cause accidents. The 


results also have implications to how to use barrier principles in aquaculture in general. 


1 INTRODUCTION 


The economic value created from Norwegian aqua- 
culture is expected to reach 25.3 billion Euro by 
the year 2050, which means that the production 
will increase fivefold compared to the year 2010 
(Olafsen et al., 2012). Despite the positive prediction, 
the fish farming industry is facing the challenge of 
the fewer available locations in the sheltered coastal 
environment and increasing negative ecological 
consequences due to sea lice, fish escapes and farm 
waste on the seabed (Holmer, 2010). One attempt to 
solve these challenges is to move fish farms to more 
exposed locations. This means that the fish farms 
have to deal with the amplified risk to both fish and 
human due to severe wave and current conditions, 
irregular wind and sheer remoteness. Many techni- 
cal solution concepts are initiated based on extensive 
offshore experience from Norwegian oil and gas 
industry. The Norwegian Directorate of Fisheries, 
which is responsible for sustainable management 
of marine resources and marine environment, has 
approved several concepts that are based on offshore 
technology (Norwegian Directorate of Fisheries, 
2017). These include Ocean Farm 1 from Ocean 
Farming AS, Havfarm NSK 3417 from Nordlaks 
Oppdrett AS, and Aquatraz from MNH production. 
Ocean Farm 1 has been put into pilot operation in 
Frohavet, Norway (KYST, 2017). Some fish farmers 
have even started running test facilities by expanding 
existing systems towards exposed areas with few sig- 
nificant technological and operational changes. 


Despite the amplified risk due to exposed loca- 
tions, the motivation for performing risk assess- 
ments is relatively low in parts of the aquaculture 
industry (Holmen et al., 2017). Aquaculture is dif- 
ferent from an offshore industry where both author- 
ities and companies have put significant efforts into 
systematic risk management. The similarity of 
the offshore locations and operating environment 
opens the discussion if aquaculture can or should 
learn safety practices from the offshore oil and gas 
industry. One distinction is that the offshore indus- 
try has been implementing a system-based risk anal- 
ysis, while aquaculture considers more day-to-day 
practical routines (Pettersen, 2017). The system- 
based risk analysis is represented by Quantitative 
Risk Analysis (QRA). Take the hazardous event 
hydrocarbon leak as an example. The piping items 
(e.g., flanges, valves, instruments and process equip- 
ment, such as pumps, vessels) and technical safety 
barrier systems (e.g., gas detector, fire extinguishing 
system, and firewall) are modelled to derive a basic 
risk level from the design of the facility. The results 
are documented, and the technical conditions of 
these systems are followed-up during operation. 

In a typical commercial fish farm today, the 
major operations include transferring the fish, 
delivery of feed to the fish farm, fish feeding, daily 
inspection, health and biomass control, oxygen 
measurement, net cleaning, removal of dead fish, 
delousing, and IMR (Inspection, Maintenance 
and Repair) (Bjelland et al., 2016). These routine 
operations in the fish farms dominate the variable 


1855 


risk on a daily basis. This means that not only the 
technical condition of the fish cage, but also the 
operational and organizational factors, external 
impacts, such as the operating environment, the 
involved working units (e.g., service vessel, work- 
boat) should all considered to keep the risk at an 
acceptable level (Yang and Haugen, 2016). 

Yang and Haugen (2018) have looked into impli- 
cations from major accident causation theories to 
activity-related risk analysis and proposed a generic 
list of safety-critical parameters to reveal possible 
activity failure mechanisms that may lead to major 
accidents. The list was generated from the principles 
and critical factors addressed in various accident 
causation perspectives, including the Energy-barrier 
perspective (Gibson, 1961), Man-made disasters 
theory (Turner, 1978), Conflicting objectives per- 
spective (Rasmussen, 1997), Normal accident theory 
(Perrow, 1984), System-Theoretic Accident Model 
and Processes (STAMP) (Leveson, 2012), High- 
Reliability Organization (HRO) (Roberts, 1990) and 
Resilience Engineering (RE) (Hollnagel et al., 2006). 

The objective of this paper is to present and 
discuss the results of applying these generic safety- 
critical parameters to daily fish farming operations 
to assess their practical usefulness and relevance in 
aquaculture. Net cleaning operation, which takes 
place approximately every two weeks, is chosen for 
the case study. 


2 METHODOLOGY 


Net cleaning, a critical and frequent operation in 
the marine fish farm to remove biofoulings (e.g., 
seaweed and mussels) on the net, was selected as 
a case in this paper. A professional service com- 
pany provided a net cleaning procedure and related 
operational risk analysis document to the authors. 
However, the issues covered in that analysis are 
somewhat coarse. The Norwegian aquaculture 
industry is, in general, lacking full accidents/inci- 
dents statistics and in-depth analysis. Moreover, 
the risk factors during operations are not system- 
atically identified and understood by the operators 
(Holmen et al., 2017b), which is challenging when 
attempting to collect safety-critical parameters 
(SCP) in the case study. 

A preliminary list of SCPs was identified and 
collected from several reports (Fore and Lien, 
2014, Fore and Thorvaldsen, 2017, Holmen et al., 
2017a). The list serves as an input to interviews 
with one expert in the net cleaning operation, one 
captain and one TQM (Total Quality manage- 
ment) manager from a marine operation service 
company. A joint workshop was planned but could 
not be arranged due to schedule conflicts of the 
above participates. The interviews were structured 
focusing on the following questions: 


— Are safety-critical parameters identified in the 
preliminary list important? If not, what else has 
the direct and significant influences to the differ- 
ent dimensions of risk (cf. Section 3.2)? 

— What are the major hazardous events in each 
task, and what could be proactive and reactive 
barriers? 


As a result, the preliminary list has been verified 
by field experts and revision has been made based 
on the feedback. 


3 CASE DESCRIPTION 


3.1 Net cleaning operation 


The biofoulings can be a suitable habitat for sea 
lice larvae and may stop the oxygen supply to the 
farmed salmon due to reduced flow and water 
exchange inside and outside the net. 

The cleaning operation is performed in vari- 
ous ways among service companies. In this case, 
a Remote Operated Net Cleaner (RONC) is 
used, and two operators are assigned to the job 
(Figure 1). The cleaner is remotely controlled by 
monitoring its movement from a screen in the 
wheelhouse onboard a service vessel. 

The operation can be decomposed into five 
tasks: service vessel arrives at the cage, launch of 
RONC, a cleaning operation, recovery of RONC, 
and departure to the next cage or return to the port. 
The detailed procedures are described in Figure 2. 


3.2 Operational planning from risk perspective 


The risk dimensions that need to be considered in 
fish farming operations are risk to personnel, risk 
to material assets, risk to the environment (e.g., fish 
escape and pollution), risk to fish welfare, and food 
safety (Yang et al., 2017). Prevention of fish escape, 
which is considered as a threat to the wild fish 
population and environment, is an important focus 
of authorities. Personal safety is gradually gaining 
attention, ever since the aquaculture industry is 
realized to be one of the most dangerous profes- 
sions in Norway (Aasjord and Geving, 2009). Fish 
health and fish welfare are other rising concerns, as 
sea lice have become a significant disease problem 
(IMR, 2017). The risk to marine assets has received 


Figure 1. 
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Figure 2. Stepwise procedure for net cleaning operation 
using RONC. 


little attention. However, loss of service vessels or 
in a worst-case scenario, the fish farm, will give 
companies significant financial and reputation 
damage. The last dimension, food safety, is strictly 
controlled by the Norwegian Food Safety Author- 
ity and this aspect falls outside the consideration 
of this paper for short-term operational planning. 
The exposed locations of the fish farms reduce 
the weather window and amplify the operational 
risk during net cleaning, especially for the personnel 
and the fish. The narrow weather window requires 
the operators to be well prepared, to foresee, and 
reduce the associated risks to be as low as reason- 
ably possible while doing short-term planning. 


4 APPLICATION OF THE SAFETY- 
CRITICAL PARAMETERS 


4.1 Safety-critical parameters 


Safety-Critical Parameters (SCPs) are the factors 
that have direct and significant influences on the 
risk involved in performing one activity (Yang and 
Haugen, 2016). They address possible failure mecha- 
nisms, and these are used to avoid active failures and 
latent errors that have major accident potential on 
an activity level (Yang and Haugen, 2018). The pro- 
posed generic parameters cover four aspects: input, 
control, process, and interaction. These SCPs are 
described shortly in Table 1. The readers are referred 
to Yang and Haugen (2018) for a full description. 


4.2 Adaptation to net cleaning operation 


The original SCP framework focuses mainly on 
personal safety in the context of a hydrocarbon 
leak in oil and gas industry. When the framework 
is applied to aquaculture, it is essential to also con- 
sider fish welfare, risk to the environment (e.g., 
fish escape), and risk to marine assets. Further 
identification of special hazardous events that 


Table 1. 


Safety-critical parameters categories descrip- 


tions adapted from (Yang and Haugen, 2018). 


Aspect SCP category Short description 
Input Work Stepwise description 
instruction of ajob 
Operational e.g., weather, restricted 
limitations area, restricted time, 
to ensure acceptable 
risk 
Plant Documents that 
drawings demonstrate the 
technical systems 
Process Materials, tools, and 
input spares that are 
consumed/used in 
the activity 
Control Proactive Barrier systems 
barrier to prevent 
systems hazardous events 
from happening 
Recovery Barrier systems to 
barrier reveal and recover 
systems from latent errors 
Reactive Barrier systems to 
barrier mitigate the 
systems consequences of 
hazardous events 
Temporary Replacements for 
barrier temporarily unavailable 
systems barrier systems 
Competence of | The competence of 
operator— operators related to 
process model how to perform a 
of the task specific task 
Competence of | The competence of the 
operator— operators to interpret 
process model the current state of the 
of the system controlled process 
Personnel The necessary number 
exposure of workers. 
Process Physical process The physical system that 
system the operator interacts 
Interac- Other concurrent Other on-going activities, 
tion control actions which may have 


Environmental 
disturbance 


hazardous interactions 
with the activity. 
Undefined or out-of- 
range environmental 
disturbance while 
performing the activity. 


give rise to the above risk concerns (in addition to 
personal safety) is necessary. The SCP categories 
of barrier systems (i.e., proactive barrier systems, 
recovery barrier systems and reactive barriers sys- 
tems) are removed from Table 1. Instead, the bar- 
rier systems are illustrated in connection with each 
hazardous event as shown in Figure 3 to better 
show their functions in preventing the hazardous 
events and mitigate the consequence. 
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Figure 3. Illustration of a hazardous event, causes, con- 
sequences and barrier systems. 


Another adaptation can be found under the 
category “process”. The fish farming industry is 
dealing with live animals. During net cleaning, 
the salmon, and cleaner fish stay in the cage. So 
besides the “physical process system,” another 
SCP; namely “fish”, is added to emphasize the risk 
dimension of fish welfare. 


5 RESULTS 


The five tasks are combined into three because of 
their similarities: service vessel arrival and depar- 
ture, RONC launch and recovery, and cleaning 
operation. This section describes the SCPs for 
these tasks in detail following the SCP categories 
proposed in Table 1 and Figure 3. 

The possible hazardous events and their rel- 
evance to each task are listed in Table 2, with the 
primary concern indicated with *. These hazardous 
events are identified based on the findings from 
Holen et al. (2017a), Holen et al. (2017b), Fore and 
Thorvaldsen (2017), Fore and Lien (2014) and risk 
analysis document provided by the service com- 
pany. Some hazardous events are relevant to sev- 
eral tasks. Therefore, the reactive barriers upon the 
occurrence of the hazardous event are illustrated in 
Section 5.4, instead of discussing them separately 
under each task. 


5.1 SCPs for service vessel arrival and departure 


The major hazardous event that may happen 
during service vessel arrival and departure is 
“service vessel collides/contacts with the fish 
farm”. Figure 4 summarizes the possible causes, 
existing proactive (PB_i), recovery barriers 
(RecB_j) which are derived from discussions with 
the company and results in Yang et al. (2017). 

Another hazardous event is “man overboard” 
while mooring the vessel to the cage. The anti-skid 
surface of the floating collar and shoes are the only 
existing barriers. To prevent oil spill (e.g., diesel oil, 
hydraulic oil) from the service vessel, the proactive 
barrier is also the 5-point check, which includes 
motor oil, hydraulic oil, coolant, seawater filter, 
and visual inspection. 


Table 2. Possible hazardous events in net cleaning oper- 
ation and their relevance to each task. 


RONC 
Arrival launch 
and and Cleaning 
Hazardous event departure recovery operation 
Service vessel collides/ X* X 
with fish farm 
Holes in the net x* 
Oil spill from the x x x 
service vessel 
Dropping/swinging x’ 
objects 
Loss of growth of 
fish 
Personal injury x x 
Man overboard x x x 


*Primary concerned hazardous event. 


5.2. SCPs for RONC launch and recovery 


The primary hazardous event that may happen dur- 
ing launch and recovery of RONC using crane is 
“dropping/swinging objects.” Figure 5 shows the 
possible causes, existing proactive (PB_1i), and recov- 
ery barriers (RecB_j), which are derived from Holen 
et al. (2017b), and discussions with the company. 
Another hazardous event could be the “holes in 
the net” due to entanglement between the hook and 
the net. However, this is relatively rare. “Man-over 
board” is also a possible hazardous event. Another 
concern is the “personal injury” while using the 
crane. The injuries include entanglement, crush, 
and back injuries (Holen et al., 2017a). In this case, 
no proactive barriers are set to prevent such injuries. 


5.3. SCPs for cleaning operation 


Figure 6 shows the primary hazardous event dur- 
ing the cleaning operation — “holes in the net” and 
possible causes and existing barriers, which are 
derived from Føre and Lien (2014), Fore and Thor- 
valdsen (2017), Jensen et al. (2010) and discussion 
with net cleaning operators. 

“Personal injury” and “loss of growth of 
fish” are also relevant to this task. The various 
causes and barriers are described in Figure 7 and 
Figure 8 separately. 


5.4 Reactive barriers 


The hazardous events that are discussed above are 
further analysed to identify the barrier systems 
that aim to mitigate or reduce the possible conse- 
quences to the fish, the personnel, and the environ- 
ment (Figure 9). The existing barrier systems are 
collected by interviewing the service company. 
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Table 3. SCPs for service vessel arrival and departure. Table 4. SCPs for RONC launch and recovery. 
SCP category SCPs SCP category SCPs 
Work instruction See Figure 2 Work instruction See Figure 2 
Operational Generic vessel limitations: Operational Weather limitation: wind, wind 
limitations stability, loading, maximum limitations direction, current speed, wave 
(limitations passenger, minimum crew (limitations height, visibility 
to arrival) Weather limitation: wind, wind in launch Generic crane limitations: max. 
direction, current speed, and recovery Lifting moment, capacity, 
wave height, visibility operation ) slewing torque, etc. 
Restricted mooring points Plant drawings Bird net, the height of the railing 
Restricted time of operation (layout of the 
Plant drawings Actual positions of mooring cage) 


(layout of 
the cage) 


Process input 
(Tools, materials, 
spare parts that 
are used) 

Competence of 
operator—process 
model of the task 
(Competence of 
completing arrival 
and departure) 


Competence of 
operator—process 
model of the 
system (knowing 
the vessel) 


Personnel exposure 

Physical process 
system (The 
vessel itself) 


Fish 

Other concurrent 
control actions 

Environmental 
disturbance 


lines of the cage 
Actual positions of bridles of 
the cage and the net (due 
to deformation) 
Condition of navigation system 
Condition of telecommunication 
system 
Condition of mooring line 
Competence of maneuvering 
the vessel 
Knowledge of influences of the 
weather condition to 
maneuvering: season, rain, 
wind, wave, current, fog, icing 
condition, darkness. 
Competence of safe mooring 
Knowledge of the condition 
of the vessel; how the vessel 
reacts upon maneuvering 
Knowledge of the safety systems: 
emergency stop, rescue 
equipment, firefight system 
Two operators in the wheelhouse 
Technical condition of the vessel: 
propulsion system, power 
system, side propellers, etc. 
The disinfection status of the vessel 
The size of the vessel: length, width 
Non-applicable 
No other ongoing activities 


Unexpected big waves and strong 
current from, e.g., passing 
vessels 

Changed wind speed, wind 
direction 


Process input 
(Tools, materials, 
spare parts that 
are used) 


Competence of 
operator— 
process model of 
the task 
(Competence 
of completing 
launch and 
recovery ) 


Competence of 
operator— 
process model 
of the system 
(knowing the 
RONC) 

Personnel 
exposure 

Physical process 
system (The 
RONC) 

Fish 

Other concurrent 
control actions 

Environmental 
disturbance 


Technical condition of the crane 

Interface with the RONC: crane 
fastening, crane lifting straps 

Velocity of the crane tip 

Condition of the technical 
communication system 

Knowledge of the stability of the 
vessel: ballast tank, weight 
distribution, wind-exposed 
surface 

Knowledge of influences of the 
weather condition to launch 
and recovery: rain, wind, wave, 
current, fog, icing condition, 
darkness. Knowledge of 
operating crane: relative 
movements between the vessel 
and the cage 

Knowledge of the condition of 
the RONC 


One operator on the deck, one 
operator on the wheelhouse 

The condition of the RONC: 
weight, size, disinfection status 


Non-applicable 
No other ongoing activities 


Unexpected big waves from, e.g., 
passing vessels 

Changed wind speed, wind 
direction 


inadvertent operations 


ECE 


Unexpected high waa | 


‘Stroeg current. 


[ios otpa ë } 


Sack brie les and 
ropes 


fat 


SS — 


{Mooring ine breaks 


{ Net defecmation | Net deformation | 


| MIRE Poctie barrier: procedure 


Figure 4. Causes of service vessel collision/contacts 
with the fish farm and existing barriers to prevent. 
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Table 5. SCPs for cleaning operation. 

SCP category SCPs 

Work instruction See Figure 2 

Operational Weather limitation: current 
limitations speed, current direction, 
(limitations visibility underwater 


in cleaning) 


Plant drawings 
(layout of the 
cage) 

Process input 


(Tools, materials, 


spare parts that 
are used) 


Competence of 
operator— 
process model 
of the task 
(Skill of 
cleaning ) 


Competence of 
operator— 
process model 
of the system 
(know the 
RONC and 
the net) 


Personnel 
exposure 


Physical process 
system 
(The net itself) 


Fish 


Other concurrent 
control actions 

Environmental 
disturbance 


Generic RONC limitations: 
max. working water pressure, 
speed 

Strength of the netting 

Restricted time for operation 

Remaining systems in the cage 

Actual position of the net (due 
to deformation) 

Technical condition of the 
equipment: RONC, high- 
pressure unit, inlets of the 
sea water, cables and hoses, 
power supply, sharp edges, 
cracks, loose screws 

RONC spare parts: discs, 
nozzles 

Knowledge of cleaning 
techniques 

Knowledge of RONC 
untangling and retrieve 
techniques 

Knowledge of disinfection 
techniques 

Knowledge of remotely 
maneuvering RONC 
and RONC behavior 
in different current speed 
and direction 

Knowledge of the status 
and actual position of 
the net 

Knowing the actual position 
of RONC during cleaning 

One operator on the deck, 
one operator in the 
wheelhouse 

The actual position of 
the net 

The condition of the net: type, 
slackness, age, degree 
of wear, existing holes, 
vulnerable parts (e.g., side 
rope). 

Possible items on the net: 
lost knives, fish hooks, 
mussels 

Oxygen level, type of fouling, 
possible disease, possible 
hydroid 

No other ongoing activities 


Unexpected big waves from e.g., 
passing vessels 

Changed wind speed, wind 
direction 


———_ 
| RONCstuck inthe net | | sharp edge, crack 
ne CEE 


aana] mer i knobson | 


| Fishing hooks, Jost 
| kines, rA] 
mussels on | meseisonthe net | net 


[siek net/iopes draw drawn. | 
| into RONC 


Wear from cleaning discs. h 


inadvertent operations 
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Figure 6. Causes of holes in the net and barriers to 
prevent. 
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Figure 7. Causes of personal injury during cleaning 
and barriers to prevent. 
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Figure 9. Reactive barrier systems of various hazard- 
ous events. 
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6 DISCUSSION 


6.1 Applicability of the SCP list 


In today’s fish farm, it is a common practice to hire 
marine operation service companies to carry out 
various fish farming operations, such as net clean- 
ing, delivery of feed, delousing and so on. Safe 
operations are heavily dependent on the experi- 
ence of the operators. The rapid development of 
the industry follows a fast employment growth in 
these service companies. Concerns about employee 
inexperience arise from both the industry and the 
authorities. It is especially challenging to get skilled 
staff at exposed locations (Johannesen and Sæther, 
2017). The industry is also experiencing technolog- 
ical innovations (Bjelland et al., 2015). The rapid 
emergence of new systems demands operators to 
learn about the system and operational techniques 
quickly to cope with unforeseen challenges. “Plan- 
ning in details is the keyword,” as reflected by 
Marine Harvest who has been running test facili- 
ties in exposed locations (Johannesen and Sæther, 
2017). 

The generic list of safety-critical parameters 
provides systematic guidance to identify the 
vital information that needs to be collected for a 
detailed operational planning. The detailed infor- 
mation can assist the decision-maker to approve 
the operation by considering the risk to person- 
nel, environment and the fish. The information 
can also give the operators a reasonable anticipa- 
tion of what may happen and what are or could 
be the proper reactions. The results from the case 
study show a significant number of critical opera- 
tional risk factors that origin from experiences and 
understanding of the systems. These factors will 
also facilitate the training and learning process of 
junior operators to operate more safely. 


6.2 Challenges regarding using the SCP list 


The main challenge with using the SCP list lies in 
acquiring the information under each parameter. 
The service companies have specialized equipment, 
service vessel and expertise for performing the 
operations. They may not have enough knowledge 
about the specific site they are working on, such as 
the design of the cages, locations of mooring lines, 
systems equipped inside the net, the health status 
of the fish, and the possible presence of knives, 
fishing hooks and so forth. The above information 
could be available upon request, but a close coop- 
eration and effective communication between the 
service companies and fish farmers are challeng- 
ing, as revealed by Fore and Lien (2014). The listed 
SCPs can improve the communication by indicat- 
ing clearly what information the service company 
is seeking. 


It is also challenging to collect operational 
limitations, which is identified as one of the criti- 
cal underlying causes of accidents by Norwegian 
Maritime Authority (NMA, 2017). It is not com- 
mon for the service company to have written oper- 
ational limits regarding weather conditions. It is 
subject to the captain’s judgment and experience to 
decide whether the operation is feasible. Moreover, 
on some sites, the weather changes quickly, which 
makes it challenging to derive standard operational 
limits. 

Another challenge is to understand the influ- 
ence of the planned operation to fish welfare and 
fish health, which has not been paid much atten- 
tion to yet. For instance, hydroid in the washed off 
waste damages the salmon gills (Saue, 2017). Gill 
diseases have been associated with reduced growth 
and large-scale mortality (Mitchell and Rodger, 
2011). This may be out of the knowledge of the 
service company until symptoms become relatively 
severe. Acquiring such information requires a close 
follow-up of fish health after the operation and 
knowledge about the latest research findings. 


6.3 Applying barrier principle in aquaculture 


Using a barrier perspective as a strategy to reduce 
risk, especially fish escapes, gains more and more 
attention both from the industry and the govern- 
ment (Ministry of Trade Industry and Fisheries, 
2017). The barrier perspective is one of the major 
accident causation theories behind the generic 
SCP list. Identification of barriers is an essential 
part in the case study, and the results show that 
in current fish farms, barrier systems have been 
partially placed and functioning, even though no 
explicit “barrier” label is given yet. For example, 
supervision is somewhat prevalent in the indus- 
try, as a response to the increasing use of junior 
operators. The senior operators on the site can 
correct the possible latent errors made by juniors 
during the operation. This correction is a typical 
recovery barrier. In fact, compared to supervi- 
sion, certifications and procedures are not widely 
adopted practices in the service companies yet. 
The safe operations are heavily dependent on 
operator experience, from which the SCPs are 
derived. The results have several implications for 
implementing a barrier perspective in the indus- 
try in general. 

First, clearly defined hazardous events are help- 
ful to identify which barriers are in place, and 
which could be missing. Definition of hazardous 
events can be subjective, and this will influence 
what could be the proactive barriers or reactive 
barriers. For example, both holes in the net and 
fish escape can be defined as a hazardous event, 
but the reactive barriers will differ. ROV or diver 
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inspection and repair of the holes can be reactive 
barriers to “holes in the net”, while escaped fish 
retrieve is the reactive barrier. It is beneficial to 
have standard definitions of hazardous events in 
the industry to facilitate a common understanding 
of the reactive barriers. 

Second, the performance of the barrier systems 
should be evaluated in different operations. For 
instance, holes in the net are the most direct cause 
of fish escape in period 2010-2016 (Fore and Thor- 
valdsen, 2017). The reactions (reactive barriers) to 
the holes vary from operation to operation. Imme- 
diately after and during the net cleaning operation, 
the camera equipped with RONC can be used to 
inspect whether holes are present. In some cases, the 
RONC itself has been used as a temporary barrier 
in front of the hole to prevent fish escape. For other 
operations, such as delousing, ROVs or divers have 
to be sent out separately to do the inspection. Even 
though the inspection can be a reactive barrier in 
both operations to prevent fish escape, the response 
time in different operations is different. In turn, this 
influences the performance of the barrier systems, 
which affects potential consequences (Sklet, 2006). 

Third, it is necessary to be aware that all kinds 
of functions, elements, and systems that are asso- 
ciated with safety can be given the label “bar- 
rier” (Rollenhagen, 2011). On one hand, there is 
a potential for setting up more effective proactive 
and reactive barrier systems in the case study. For 
example, absorbent boom/sheets can be equipped 
on board the vessel in case of oil spill. On the 
other hand, purely depending on barrier systems 
may not be enough for preventing accidents, even 
though operator’s knowledge about all the SCPs 
(including operational limits) identified in the case 
study can be labeled as “barriers.” Precautions 
should be taken when defining the barrier systems. 


7 CONCLUSION 


The case study in this paper illustrates that the pro- 
posed safety-critical parameters provide a system- 
atic overview of the critical risk factors in specific 
fish farming operations for operational planning, 
considering the risk to personnel, fish escape, fish 
welfare and marine assets. The results also have 
implications to how to use barrier principles in 
aquaculture in general. The collected information 
by using SCP for fish farming operations can be 
used to improve the learning process of junior 
operators and to facilitate communication between 
the fish farmer and service companies. 

The case study opens up the possibility of 
applying the SCP framework to other more 
complex operations in aquaculture. Furthermore, 


the priority of the SCP should be evaluated based 
on experiences from the senior operators, and rel- 
evant accidents and near misses. Prioritization of 
the risk factors is considered as further work in the 
following research. 
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ABSTRACT: A computer simulated microworld was used to study risk management in a critical 
infrastructure context. 36 students assumed the role of an electric distribution company Chief Executive 
Officer (CEO) making decision on how to spend resources between investments in risk reduction and other 
investments. We studied the effect of differences in terms of the incentives to invest in risk reduction and 
the extent that the participants had access to a risk assessment. We found that both independent variables 
influenced the total resources spent on risk reduction and total losses due to storms. If the company had 
to bear a larger part of the total losses due to a storm their propensity to invest in risk reduction increased. 
In addition, if the participants were provided with a simple risk assessment to support their decisions they 
invested more resources in risk reduction, and they were also more successful in limiting the total losses 


due to storms. 


1 INTRODUCTION 


The functioning of modern societies is dependent 
on the services provided by an interconnected web 
of critical infrastructures (CI:s). Although what 
counts as a CI might differ between countries, 
telecommunication, electric power, transportation 
and water supply systems are usually considered 
to be CI:s. Such systems can have a considerable 
geographic extent, sometimes covering entire 
nations, or even continents. 

Two trends in the development of CI:s are espe- 
cially important for the ability to manage risks asso- 
ciated with them. First of all, they are becoming 
increasingly interconnected growing into so called 
“system of systems” (Kröger, 2008, OECD, 2011) 
and thereby increasing the risk of transboundary 
crises, i.e. crises that “affect multiple jurisdictions, 
undermine the functioning of various policy sec- 
tors and critical infrastructures, escalate rapidly 
and morph along the way” (Ansell et al., 2010). 
Thus, what might previously have resulted in a 
local disruption of some services might nowadays 
spread quickly and grow into a full-blown crisis. 

Secondly, the planning, operation and man- 
agement of these systems have become increas- 
ingly fragmented (Almklov & Antonsen, 2010, 
de Bruijne & van Eeten, 2007). A system that 


used to be run by one organization can nowadays 
be managed by a multitude of different actors. 
Thus, at the same time as the systems are grow- 
ing more interconnected, the management of them 
are becoming more fragmented (de Bruijne & 
van Eeten, 2007). 

When CI:s and their associated risks are becom- 
ing increasingly connected it is no longer viable 
to approach risk management in a “silo-based” 
fashion, focusing on one actor and one system at 
a time. This conclusion has influence the corporate 
world (Gordon et al., 2009) as well as in govern- 
mental efforts to manage risk through so-called 
‘all-hazards’ and ‘whole-of-society’ approaches 
(OECD, 2009). 

Notwithstanding the fact that there are efforts 
to manage risk from a more holistic perspective, 
it is very difficult to know if approaches such as 
the ones exemplified above actually leads to more 
effective management of risk (Rivera et al., 2017, 
Hoyt & Liebenberg, 2011). Therefore, there is need 
to better understand risk management in general, 
and specifically in an interconnected world. 

There are many different methods one can use to 
try to increase our knowledge of risk management. 
Here we present the initial results from a study 
making use of a method that has not been used 
much in the present context; a computer simulated 
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microworld. The focus of the study is on two aspect 
that can influence risk management: differences in 
incentives for investments in risk reduction and the 
extent to which risk assessments are available when 
making such investment decisions. 


2 USING MICROWORLDS TO STUDY 
RISK MANAGEMENT 


There exists a multitude of definitions of risk (see 
e.g. Aven and Renn, 2009), and there are also sev- 
eral definitions of risk management. For example, 
the SRA Glossary defines it as ” Activities to handle 
risk such as prevention, mitigation, adaptation or 
sharing” (Society for Risk Analysis, 2015), and it is 
also noted that risk management “...often includes 
trade-offs between costs and benefits of risk reduc- 
tion...”. Other definitions, like the ones suggested 
by International Organization for Standardization 
and United Nations International Strategy for 
Disaster Reduction (ISO, 2009, UNISDR, 2009) 
also emphasize the fact that risk management is 
about doing something, e.g. prevent, mitigate, etc., 
in order to achieve something, i.e. the handling of 
risk. 

To study risk management is thus about study- 
ing a purposeful activity, or a number of activities. 
Such studies can be conducted in many different 
contexts where the focus might be on one person, 
a group, an entire organization, or even multiple 
organization. And they can also be performed 
within different scientific disciplines, have differ- 
ent ambitions, and be based on different theoreti- 
cal frameworks. It might, for example, involve field 
studies focusing on organizations (e.g. research on 
High Reliability Organizations, as described by 
Roberts (1990)), or laboratory research focusing 
on individual decision making (e.g. Tversky and 
Kahneman, 1974). 

Here the focus is on using a computer simula- 
tion, a so called microworld to study risk manage- 
ment. Microworlds have been used for some time 
as a means to capture some of the complexity of 
real life problem contexts. In a micorworld-study 
the participants interact with a computer simu- 
lation and receive continuous feedback on the 
effect of their actions. Brehmer & Dörner (1993) 
describes the use of microworlds as a way to escape 
“_..both the narrow straits of the laboratory and 
the deep blue sea of the field study”. They refer to 
the fact that laboratory experiments often get criti- 
cizes by field researchers for lacking relevance, and 
vice versa that laboratory researchers criticize field 
researchers for lack of control of confounding 
variables. A micorworld is something in between a 
laboratory experiment and a field study. It retains 
the possibility to control the variables under study, 


but at the same time offer more realistic conditions 
in terms of task complexity. 

In terms of realism, there are some clear char- 
acteristics of real risk management problems that 
microworlds are especially suitable to capture. Risk 
management usually require a series of dependent 
decisions. For example, a decision to invest in risk 
reduction will be influenced by previous invest- 
ment decisions. The previous decisions determine, 
to a certain extent, the present level of risk. And, 
a lower level of risk will, in general, make it more 
difficult to find good investments to decrease the 
level of risk more (decreasing marginal utility of 
investments). Thus, the decisions that led up to the 
present situation (risk level) all have an influence 
of the current decisions. 

Moreover, the risk management problem will 
continuously change. Both as a consequence of the 
actions taken by the risk manager, but also due to 
other factors. These characteristics are identical 
to those described for dynamic decision problems 
(Edwards, 1962, Brehmer, 1992). 

In terms of problem complexity, microworlds 
allow us to extend the study of risk management 
from using lottery-like situations where the focus 
is on the participant’s judgements, and how they 
make them, to more complex game-like situations 
where the focus is on control. To frame risk man- 
agement as a control problem is not new but has 
been proposed as a prerequisite for understand- 
ing it in a dynamic society (Rasmussen, 1997). In 
the microworld context, control refers to “...an 
attempt to achieve some desired state of affairs” 
(Brehmer, 1992). To try to achieve control in the 
present context of risk management means to 
achieve a suitable balance between the costs and 
benefits of risk reduction. Clearly, judgements, for 
example about the likelihood of various events, 
are an important part of trying to achieve control. 
But, there is much more to the control problem 
than merely judgements. 

Yet another feature of microworlds that makes 
them suitable for studying risk management is the 
fact that one can study how successful participants 
are in trading off the costs of risk reduction invest- 
ments with their associated benefits. This is usu- 
ally difficult to achieve in a field study, especially 
if one focus on low-probability high-impact events 
(Rivera et al., 2017). Neither the simple experi- 
ments lacking the dynamic nature of microworlds 
can adequately capture it. 

In a critical infrastructure context, negative con- 
sequences can be described in terms of failure to 
maintain vital societal functions (see for example 
the definition of CI in the EU council directive 
2008/114/EC). Thus, a relevant risk management 
problem is how successful the operators of CI:s 
are at trading off the costs of investments in risk 
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reduction versus the benefits the investments pro- 
vide in terms of reduced occurrence and/or con- 
sequences of failures of vital societal functions. 
However, due to the interconnected nature of 
Cl:s, and their potential devastating societal con- 
sequences, an equally interesting question is how 
their tradeoffs affect the consequences suffered by 
the users of the services. Part of the study presented 
here is aimed at investigating such consequences. 


3 A TWO-LOOP MODEL OF RISK 
MANAGEMENT 


One of the most straightforward control-loop 
models of risk management one can use is a model 
focusing on the relationships between risk manage- 
ment decision and losses/consequences (measured 
in a suitable way) as illustrated in the left part of 
Figure 1. Such a model ignores many important 
aspects of risk management, for example, how 
people make decisions (as for example studied in 
the Naturalistic Decision Making or the Heuristics 
& Biases traditions, see Kahneman & Klein (2009) 
for a review of the two traditions), or why and 
how people identify something as a risk (see for 
example Boholm & Corvellec (2011)). Neverthe- 
less, it captures some of the important dynamics 
of risk management, i.e. past decisions may influ- 
ence future ones, and it can be used as a basis for 
studying differences in risk management arising 
from different incentives to invest in risk reduction. 
For example, it is reasonable to assume that if the 
negative consequences an actor might suffer due to 
a specific activity are reduced, the actor’s interest 
in investing in risk reduction will also be reduced. 
The one-loop model focuses on an actor’s con- 
crete experiences in terms of investments in risk 
reduction, which might lead to a change in the 
system of interest, which then might lead to less 
negative consequences. If the likelihood of expe- 
riencing losses is relatively high one would expect 
the control problem to be relatively easy for a risk 
manager. He/she continuously gets feedback on the 
effect of various investment in risk reduction and 
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Illustration of the two-loop model of risk 


Figure 1. 
management. 


can therefore find a balance between the cost of 
investments and the benefits they bring. However, 
if the focus is on low likelihood/high consequence 
events the problem is not so easy. Often there will 
be no losses at all, and therefore one might ques- 
tion the effectiveness of investments. Similarly, 
when severe losses eventually occur, it is very dif- 
ficult to determine if the investments made so far 
had any effect on the outcome. In general, less fre- 
quent events generating losses will mean that the 
control loop is weaker. 

The one-loop model relies on what has previously 
happened in a system, but from a risk management 
perspective what can happen in the system is also 
very important. The question of what can happen 
together with judgements of how likely it is and the 
consequences thereof is an important part of risk 
assessment (Kaplan & Garrick, 1981, Aven, 2010). 
Therefore, we can complement the one-loop model 
with another loop that connects the risk manage- 
ment decision to a description of risk (Aven, 2010). 
The implication of this second loop is that a deci- 
sion to invest in risk reduction can be influenced 
by a description of risk from the system of interest 
(e.g. in the form of a risk analysis). The decision 
can, in turn, influence future state of the system, 
which can then give rise to a new description of 
risk, and so on. 

Thus, this second loop focuses on what might 
happen in the system of interest. The description 
of risk will most likely also be influenced by previ- 
ous consequences in the system (and perhaps simi- 
lar systems), but that relationship is not explicitly 
illustrated in Figure 1. Moreover, the two loops 
represents two types of analytical processes previ- 
ously termed experiential and analytic processing 
(Marx et al., 2007). 

The two-loop model is the basis for the experi- 
ments presented below. In a CI context, it describes 
the decisions that a CI operator might make to 
invest in risk reduction, and how the investment 
might influence the infrastructure (state of the sys- 
tem), which might then influence the occurrence 
and/or magnitude of negative consequences due 
to unwanted events (depending on the CI it might 
for example be storms, technical failures, floods, 
etc.). It also describes how the same decisions and 
following change of the CI might influence future 
descriptions of risk, which might in turn influence 
future decisions. 

The “other factors” are included in the model 
as a reminder that risk management is not a closed 
system and that there are many other factors than 
the ones included here that influence risk manage- 
ment in practice. For example, there are many other 
factors that influence the state of a CI system apart 
from investment in risk reduction, such as various 
performance objectives of the CI operator. 
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4 THE EXPERIMENTS 


4.1 Overview 


The aim of the experiment was to investigate the 
effect of differences in incentives to invest in risk 
reduction, and if such effects could be influenced 
by the extent that risk assessments are available to 
support the investment decisions. The importance 
of incentives for risk management in Cl:s is salient 
in many sectors where CI:s are operated. A spe- 
cific form of incentives is related to the two-loop 
model presented above. It concerns the magnitude 
of the negative consequences affecting a CI opera- 
tor in case of a failure to maintain the vital societal 
function in question (e.g. electric power supply). 
One can imagine two extreme situations in this 
respect. In practice the situation for most Cl:s are 
probably somewhere in between the extremes. On 
the one hand, one can consider a situation where 
an operator of a CI does not suffer any negative 
consequences at all due to a failure to uphold the 
function of interest. Clearly, there are negative 
consequences of such a failure, but they will solely 
affect the users of the function in question. On the 
other hand, one might have a situation where the 
CI operator will have to compensate the users of 
the function to cover for the full consequences of a 
failure. The net consequences suffered by the users 
would then be zero and the CI operator will have 
to bear the full consequences of the failure. 

In practice, it is difficult to imagine how these 
two extreme situations could occur. It would, for 
example, be very difficult to determine the com- 
pensation to the user of a function to compensate 
for “all” their losses. Nevertheless, the focus here is 
not to simulate real cases but rather to try to cap- 
ture some of the important mechanisms influenc- 
ing risk management in practice and investigate if 
they can be influenced in different ways. 


4.2. The microworld 


A microworld called MicroRisk was designed as 
part of several master thesis projects at Lund Uni- 
versity. MicroRisk is a computer simulation that 
allows the participants in the experiment to inter- 
act with it using a regular computer and a web- 
browser. To implement the two-loop model of risk 
management in a CI context the particular study 
reported here used MicroRisk to put the partici- 
pants in charge of a fictitious electric power com- 
pany. A detailed description of the study, including 
the user interface, can be found in (Lindström, 
2017). MicroRisk was run for a number of turns 
and after each turn the participants were asked 
to prioritize between investments to reduce the 
risk associated with the supply of electricity (e.g. 
strengthen grid, invest in repair capability, spare 


parts, etc.), and other investments. The choice 
was made by dividing 100 resource units between 
“other investments” and “risk reduction invest- 
ments”. Resource units (or simply units) was the 
way consequences (losses) and investments were 
measured in the simulation. Each turn there might 
be a serious storm affecting the power grid causing 
negative consequences (measured in units). The 
ultimate goal of the participants was to maximize 
the number of units spent on other investments 
minus the units lost due to storms. 

The consequences of a storm were determined 
by the amount of resources the participants spent 
on risk reduction. The more they spent, the less 
the consequences of storms. This function was 
implemented in the MikroRisk by having the 
“state of the system” (Figure 1) being represented 
by one variable called Level of risk. The Level of 
risk was a number between 1 and 1000. At the start 
of each game the level of risk was set to 500. Each 
unit the participant would spend on risk reduc- 
tion would then reduce one unit from the Level 
of risk. However, each turn 50 units were added 
to the Level of risk independent of the decisions 
made by the participants. It represents the fact 
that a system that is not maintained will degrade 
over time. Thus, a participant that chose not to 
spend anything on risk reduction would quickly 
degrade the system increasing the consequences 
due to storms significantly. It was only the poten- 
tial consequences that were affected when chang- 
ing the Level of risk, not the likelihood of a storm 
occurring. The maximum negative consequences 
that could occur, was a loss of 1000 units and the 
minimum was | unit. 1000 units would be lost if a 
storm occurred when Level of risk was at its high- 
est value, and 1 units would be lost if it was at its 
lowest value. However, the relationship between 
Level of risk and the negative consequences was 
not linear. Investments when the Level of risk was 
close to the starting value of 500 was more effec- 
tive than investments made when the Level of risk 
approached low values (diminishing marginal util- 
ity of investments). 

Games were played during 30 turns. After the 
30th turn of a game there was a 20% probability 
that it would end after each turn. Most games 
ended in between 30 and 40 turns, but in the analy- 
sis below we only use the 30 turns that were com- 
pleted in all games. The participants did not know 
how many turns they were going to play. Neither 
did they know the probability of a severe storm 
each turn. Instead they had to infer the likelihood 
based on previous experience (left side of the loop 
in Figure 1). The number of storms during the 
thirty turns was set to four. In each game, they 
were randomly assigned to occur in four of the 
thirty turns. 
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There were four versions of the game played by 
the participants. The four versions differed with 
respect to two factors each having two possible 
states: Incentives (Limit/No limit) and Risk assess- 
ment (RA/No RA). The incentive to invest in risk 
reduction was varied by having two possible game- 
types. One type (No Limit) where the power-com- 
pany would bear all losses due to storms and the 
users none. The other type (Limit) meant that there 
was an upper limit to the losses (200 units) that 
the power-company had to bear. Above 200 units, 
the company would only have to bear 15% of all 
exceeding losses (the remaining 85% of the losses 
affected the customers of the power company). The 
Total losses is the sum of the losses (in terms of units 
lost) affecting the power company and the custom- 
ers. The Total losses is not dependent on whether 
incentives are limited or not, it is only dependent on 
the state of the system (Level of risk), which is only 
affected by the decisions made by the participants. 

The Risk assessment factor was varied by either 
presenting a risk assessment to the participants 
to support their decisions (RA) or not (No RA). 
Thus, when no risk assessment was present, the 
loop on the right in Figure 1 was absent. The risk 
assessment was a simple text of the form “Given 
the present level of preparedness experts estimate 
the consequences of a severe storm to be a loss of 
somewhere between X and Y units”. The interval 
between X and Y was arrived at by first calculating 
the true consequences, C, (the consequences that 
would occur during a turn if a storm occurred) 
given the present Level of risk. Then a new value 
C.,, was calculated by random in the interval [0.8 

1.2 * C uel Then the interval between X 
and Y was calculated by multiplying C.,, by a fac- 
tor k (¥=C,,,-(C,, * k), Y= Cou + (Cy, * A). 
The factor k was a random number (drawn every 
turn) between 0.1 and 0.5. Thus, the risk assess- 
ment would show the participants a relatively wide 
interval in the vicinity of C 


true" 


4.3 The participants 


Participants consisted of students at Lund Univer- 
sity that were recruited to the experiment by post- 
ing information about the experiment at different 
parts of the university. The 30 first to enroll in the 
experiment was given a movie ticket. There were 
36 students that participated in the experiment. 
16 (44,4%) females and 19 (52,8%) men (one per- 
son did not want his/her sex to be recorded). The 
students were predominately studying (or had just 
graduated) from engineering programs (22), but 
there were also students from biology (4), medicine 
(3), psychology (1), audiology (1), speech therapy 
(1), nursing (1), systems science (1), graphical 
design (1), and criminology (1). 


4.4 Procedure 


We used a fully crossed, 2 Risk assessment (No RA/ 
RA) x 2 Incentive (No Limit/Limit), within-subject 
design. Thus, each participant played all four ver- 
sions of the game. The order in which the games 
were played was randomized to minimize learn- 
ing effects. The participants played the game using 
one of three computers in a closed off room at the 
Division of risk management and societal safety at 
Lund University. 

In each version of the game we measured how 
much resources each participant spent on invest- 
ments in risk reduction (Total investments in risk 
reduction), how much losses the company and the 
customers suffered (Total losses). 


5 RESULTS 


Three of the students did not complete all ver- 
sions of the game. Their data was removed from 
the results. A two-way analysis of variance was 
conducted on the influence of the two independ- 
ent variables (Incentive, Risk assessment) on the 
each of the dependent variables (Total losses, Total 
investments in risk reduction). 


5.1 Total losses 


The mean value for Total losses in game condi- 
tion 1 (No RA, No Limit) was 1448 (SD = 1181), 
in condition 2 (RA, No Limit) it was 1183 
(SD = 856), in condition 3 (No RA, Limit) it was 
2004 (SD = 1324), and in condition 4 it was 1583 
(SD = 1125). The results are illustrated in Figure 2 
where a box-plot shows the median value and the 
25th and 75th percentiles respectively (whiskers 
show the most extreme data points). The main 
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Figure 2. Box plot showing Total losses for all four 
game types. 
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effect for Incentive was significant F(1, 32) = 7.47, 
p < 0.05, as was the main effect for Risk assesse- 
ment F(1, 32) = 5.58, p < 0.05. The interaction 
effect was not significant, F(1,32) = .25, p = 0.62. 


5.2 Total investments in risk reduction 


The mean value for Total investment in risk reduc- 
tion in condition 1 (No RA, No Limit) was 1583 
(SD = 354), in condition 2 (RA, No Limit) it was 
1596 (SD = 254), in condition 3 (No RA, Limit) 
it was 1320 (SD = 475), and in condition 4 (RA, 
Limit) it was 1535 (SD = 382). The results are illus- 
trated in Figure 3. 

The main effect for Incentive was significant F(1, 
32) = 5.24, p < 0.05, as was the main effect for Risk 
assessement F(1, 32) = 7.42, p < 0.05. The inter- 
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Figure 4. Mean value of the Level of risk for all 
participants. 


action effect was also significant, F(1,32) = 6.45, 
p<0.05. 


5.3. Level of risk 


During the games, we measured how the Level of 
risk varied. In Figure 4 the mean value of all par- 
ticipants Level of risk at different turns and during 
different versions of the game is shown. 


6 ANALYSIS AND DISCUSSION 


The results from the study show that both inde- 
pendent variables (Incentive & Risk assessment) 
have a significant effect on both Investments in risk 
reduction and on Total consequences. Taking the 
perspective of the users of vital societal functions, 
i.e. all actors dependent on the service in question 
and the public, it is especially interesting to note 
that the introduction of a simple risk assessment 
increased the investments in risk reduction. The 
increase was so strong that it essentially cancelled 
out the negative effect of reducing the participants’ 
incentive to invest in risk reduction. The mean 
value of Total investments in risk reduction is essen- 
tially the same for the cases (No RA/No Limit), 
(RA/No Limit) & (RA, Limit) in Figure 3, whereas 
the mean value for (No RA, Limit) is significantly 
lower. 

The result suggest that one can influence CI 
operators (and others) to invest more in robust 
services without using coercive measures, such as 
detailed regulations. Instead, one can influence 
their behavior in a positive way (for the users of 
the services) by introducing a risk assessment. This 
effect might be especially important in Cl:s where 
the consequences due to severe interruptions of 
services are small for the operator compared to 
those suffered by the users of the services. 

Moreover, while the results presented in 
Figure 2 illustrate the rise in Total consequences 
due to a reduction of incentives to invest in risk 
reduction (an expected effect) it also shows some- 
thing interesting from a user of service-perspective. 
It shows that the introduction of a risk assessment 
not only led to more investments in risk reduction, 
it also had a significant effect on the Total conse- 
quences. Thus, the participant not only spent more 
on risk reduction, the investments were also effec- 
tive in reducing Total consequences. One might 
assume that the reduction of Total consequences 
stem only from the fact that the participant spends 
more resources on risk reduction. In part, that is 
true, but comparing Figure 2 (No RA/No Limit, 
and RA/No Limit) to Figure 3 (No RA/No Limit, 
and RA/No Limit) also suggest that even without 
an increase in resources spent on risk reduction 
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the introduction of a risk assessment leads to less 
Total consequences. At least in contexts where the 
CI operator have to bear a large part of the nega- 
tive consequences due to a service interruption. 

Microworld experiments allow researchers 
focusing on risk management to study how the 
state of a system, and the risk associated with 
it, change over time (turns). Thus, using microw- 
orlds one can study risk management phenomena 
where longitudinal studies of the state of a system 
is crucial. One example is phenomena associated 
with a gradual increase in risk that goes unrec- 
ognized resulting in a system drifting into failure 
(Dekker and Pruchnicki, 2013). Figure 4 shows 
the mean value of the Level of risk in the four 
versions of the microworld. It is clear that one of 
the versions stands out from the rest in terms of 
a gradual increase in risk (No RA/Limit). A ris- 
ing Level of risk due to economic pressure (in this 
case the incentive to select “other investments”) 
can be counteracted by one or both of the two 
loops in Figure 1. The left side loop counteracts 
this tendency by concretely making the partici- 
pants experience that they pay a price for allowing 
the Level of risk to become high in that the losses 
when a storm occurs also becomes high. However, 
when there is a limit to the losses suffered by the 
power-company, the concrete consequences felt 
by the participants are smaller compared to when 
there are no such limits. Thus, the left side loop is 
weaker in countering the economic pressure (that 
is the same in all cases) than when there are no 
limits. 

The right-side loop counteracts the economic 
pressure by making clear for the participants 
what the consequences can be if a storm occurs. 
Although, this might not be as important as con- 
cretely experienced losses, it has the advantage 
of being present all the time (during all turns) as 
opposed to the concrete losses due to storms that 
is only experienced in four of the thirty turns. This 
makes it easier to use the risk description as a basis 
for the risk management tasks. An increase in 
investments in risk reduction would, in this case, 
immediately be visible in the risk description pre- 
sented during the next turn. Thus, the participant 
would get a confirmation, although somewhat 
uncertain, that the investments have an effect. 

However, in the (No RA/Limit) condition of the 
experiment this feedback is lacking and the right- 
loop is thus in its weakest form to counter the eco- 
nomic pressure. Moreover, in that condition the 
left-loop is also in its weakest form and that is an 
explanation offered by the two-loop model as to 
why the Level of risk is almost constantly increas- 
ing in one of the conditions in Figure 4, whereas 
in the other conditions it seems to be stabilized on 
much lower levels. 


The relevance of the results to real problems 
of CI risk management is unclear and needs to be 
further studied. Nevertheless, based on the results 
it seems reasonable to focus on finding practical 
situations of CI risk management problems that 
might differ with respect to the conditions inves- 
tigated here. Thus, we should look for situations 
where there are differences in terms of the conse- 
quences suffered by the provider of a vital societal 
function in case of a failure to supply it (left loop). 
And we should also look for situations that dif- 
fer with respect to how often failure of functions 
occur (left loop). Moreover, situations differing 
with respect to how (if) risk analyses are used as a 
basis for decision making should also be identified. 
Based on further detailed study of such practical 
situations one might be able to improve the micro- 
world studies linking them better to the practical 
cases and investigating whether the effects seen in 
the microworlds are also present in practice. 


7 CONCLUSIONS 


We have investigated the effect of different incen- 
tives for investing in risk reduction, and the effect 
of risk assessments, on risk management in a com- 
puter simulated microworld. The focus was on 
a situation where one actor supplies a service or 
function that is used by other actors. The risk man- 
agement problem is related to how much the sup- 
plying actor will spend on risk reduction, focusing 
on protecting the service in question, given that the 
actor will suffer a certain amount of negative con- 
sequences in the event of a service failure. 

Both factors had a significant effect on the 
dependent variables of the study. Manipulating the 
incentives to invest in risk reduction by changing 
the negative consequences suffered in case of a serv- 
ice failure affected both the amount of resources 
invested in risk reduction, and the total losses due 
to service interruptions. Moreover, a similar effect 
was observed when the extent to which the actor 
supplying the service in question had access to a 
simple risk assessment changed. Having access to 
a risk assessment (compared to not having such 
an access) resulted in more resources spent on risk 
reduction, and also in less total losses due to serv- 
ice interruptions. 

Thus, the results indicate that risk management 
actions can be influenced in a predictable way just 
by providing access to simple risk assessments. Sup- 
plying a decision maker with a risk assessment in 
the experiments lead not only to an increased pro- 
pensity to invest in risk reduction it also resulted in 
more well-balanced decisions that ultimately led to 
less total losses. The addition of a risk assessment 
can counter the effect of reducing the incentives to 
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invest in risk reduction. This effect is especially clear 
when the incentives to invest in risk reduction is low. 

Although the practical implications of the 
results presented here are not clear, the results indi- 
cate that incentives and the extent to which risk 
assessments are used (and how they are designed) 
are two important factors for understanding risk 
management. 
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Lessons learned from an unexpected uranium accumulation event 
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U.S. Nuclear Regulatory Commission, Rockville, Maryland, USA 


ABSTRACT: On July 14, 2016, a nuclear fuel fabrication facility licensee notified the United States 
Nuclear Regulatory Commission (NRC) that significant amounts of uranium were discovered, poten- 
tially exceeding their Criticality Safety Evaluation (CSE) mass limits, during an annual inspection of a 
scrubber ventilation system. The licensee subsequently confirmed not only significant mass several times 
higher than the CSE mass limits in the scrubber and associated ventilation ductwork, but also significant 
concentrations of uranium. As part of the NRC’s platform of continuous improvement, a lessons-learned 
activity was initiated to explore opportunities for improving the NRC’s regulatory processes for early 
identification of facility operational issues and preventing such events in the future. This paper describes 
the event, some of the licensee’s root causes that led to this event, some of the reasons why the NRC 
did not identify this condition (and similar conditions at this and other facilities) through its regulatory 
processes prior to the event, and the improvements being considered to enhance these NRC regulatory 


processes. 


1 THE EVENT 


1.1. What happened? 


On May 28-29, 2016, a licensee conducted an 
annual inspection and cleaning of their scrubber 
ventilation system. The scrubber is one of the main 
air scrubbers for the nuclear fuel conversion proc- 
ess and is connected to numerous processes. When 
the scrubber ventilation system was inspected and 
cleaned, a large mass of material was found inside 
the large attached ventilation ducting and sub- 
sequently within the scrubber body itself. At the 
time, the licensee believed that the uranium con- 
centration of the material removed from the scrub- 
ber was low. The licensee removed the material and 
sent samples for analysis of the composition. The 
licensee received the results of the initial lab analy- 
sis on May 30, 2016, which indicated a significant 
concentration of uranium. The licensee did not 
fully consider the results and restarted operation 
of the system. Over a month later, on July 13, 2016, 
the licensee received the results of additional lab 
analyses that confirmed the earlier results indicat- 
ing that the concentration of uranium was almost 
fifty percent (50%) and significantly exceeded the 
Criticality Safety Evaluation (CSE) mass limit for 
the process. The licensee reported the event to the 
US Nuclear Regulatory Commission (NRC) on 
July 14, 2016. 


1.2 Why is this important? 


This event did not result in a criticality. How- 
ever, because there were no physical controls or 
measures available to prevent a criticality (i.e., 
all controls and measures failed to prevent the 
accumulation of uranium significantly above the 
CSE mass limit), this event represented a signifi- 
cant safety concern. The subsequent discovery of 
similar conditions at this and other fuel fabrica- 
tion facilities has reinforced the need to address 
the concerns and weaknesses raised by this event in 
both the licensees’ and regulator’s processes. 


2 ROOT CAUSES FOR THE EVENT 


2.1 What led up to this event? 


Throughout a period of more than a decade before 
the event, a combination of process changes, anal- 
ysis assumptions, and operational approaches 
created the environment for this uranium accumu- 
lation event, including the licensee’s slow response, 
poor decision making, and delayed reporting. 

In 2002 this scrubber replaced another scrubber 
and over a number of years ventilation discharges 
from other processes were rerouted to this scrub- 
ber. The scrubber was originally designed to scrub 
acidic off-gas; however, many of the current feed 
streams contain ammoniated (basic) off-gas. 
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The feed streams all tied together through a net- 
work of ventilation ductwork of various diameters 
to a large diameter section before entering the tran- 
sition section of this scrubber, reducing the linear 
velocity of the flow and allowing greater reaction 
time between the scrubber solution and the incom- 
ing feed streams. 

In June 2009 the licensee implemented a new 
safety basis for the scrubber ventilation system 
that lowered the CSE mass limit by more than a 
factor of 60 and installed expansion plenums on a 
vent line, which the licensee assumed would reduce 
the amount of particulates that would travel to the 
scrubber. However, the licensee never considered 
the potential for the uranium to accumulate in a 
chronic fashion within the scrubber ventilation 
system. Further, the licensee incorrectly assumed 
that only minor amounts of uranium powder were 
expected to accumulate in the scrubber ventilation 
system. In December 2009 the licensee identified 
significant accumulation and performed addi- 
tional modifications to remove an ammonia line. 
In 2010 the licensee instituted periodic cleaning 
of various processes. In April 2015 the licensee 
revised a procedure and included a note that based 
on “past experience the [percentage of uranium] of 
the trapped powder is approximately 45-48%.” 

Material buildup was still periodically observed 
and in April through May of 2016 large slabs of 
material would become dislodged during pressure 
washing and fall into the scrubber ventilation transi- 
tion section. The operators were directed to continue 
to pressure wash the material so it would dissolve. 
Though not the desired result, it was fortuitous that 
the material did not dissolve, because the insoluble 
ammonium-uranyl-fluoride mixture prevented the 
formation of a critical mass configuration. 


2.2 Why did the licensee choose not to report 
the event immediately? 


In accordance with the regulations, licensees 
should report an event to the NRC within one hour 
in which there are no Items Relied On For Safety 
(IROFS) available and reliable to perform their 
function that results in the failure to meet speci- 
fied regulatory performance criteria. On May 30, 
2016, the licensee received the results of a sample 
taken from the material removed from sections of 
the scrubber ventilation system that indicated high 
uranium concentrations. However, on May 31, 
2016, the nuclear criticality safety engineer, una- 
ware of the sample results and assuming low ura- 
nium concentration, declared that the accumulated 
material did not challenge the CSE mass limit. As 
a result, the licensee did not immediately perform a 
detailed evaluation to determine whether the mate- 
rial discovered could have exceeded the safety basis. 


On June 1, 2016, after completion of the clean- 
ing activities, the nuclear criticality safety engineer 
communicated to the process engineer that there 
were no issues from the NCS group with restart- 
ing the scrubber ventilation system. Even though 
the process engineer was aware of the sample result 
that clearly indicated the CSE mass limit had been 
exceeded and a detailed evaluation of the credited 
controls (i.e., IROFS) was needed (because it had 
failed to prevent the accumulation), the licensee 
restarted the system. Only after receiving addi- 
tional lab results confirming the high uranium 
concentration did the licensee stop the process and 
report the event to the NRC on July 14, 2016. 


2.3. What were some of the root causes 
for the event? 


Fundamentally, because the licensee’s configura- 
tion management program did not ensure that 
design and physical changes to the scrubber ven- 
tilation system and associated controls (i.e., the 
designated IROFS) were properly designed and 
implemented to prevent adverse impact to the 
scrubber ventilation system safety basis, material 
accumulated that reduced scrubber efficiency by 
increasing the amount of uranium carryover to the 
system and generating insoluble uranium bearing 
compounds. Complex chemical interactions from 
various input streams created ammonium uranyl 
fluoride, which is mostly insoluble in water and 
plated out on the scrubber ventilation surfaces and 
within the scrubber body. Over time, the assump- 
tions in the licensee’s safety basis became invalid. 

Furthermore, although the licensee conducted 
periodic inspections of the ventilation ductwork 
and was detecting material accumulation, they 
did not effectively use procedures to weigh and 
sample the uranium concentration in the material 
collected, undermining their ability to properly 
evaluate scrubber performance. Since scrubber 
ventilation system visual inspections did not effec- 
tively detect and remove significantly concentrated 
uranium from the system, eventually the estab- 
lished CSE mass limit was exceeded. 

The licensee completed its own root cause eval- 
uation in October 2016 and identified two root 
causes and two contributing causes for the event. 

Root Cause 1: Programmatic controls for con- 
figuration management did not have the rigor to 
mitigate increased uranium accumulation in the 
scrubber ventilation system when design changes 
were made to the system and when operational 
requirements for the scrubber spray system were 
changed in the procedure. 

Root Cause 2: Management did not scruti- 
nize the content of the CSE and as-found con- 
ditions in the scrubber ventilation system with 
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the questioning attitude and conservative bias 
required for a healthy nuclear safety culture. Fur- 
ther, management did not ensure the organization 
had sufficient procedures and training to recognize 
and respond to deviations from the safety basis 
described in the CSE. 

Contributing Cause 1: Operating experience 
and the corrective action processes were not effec- 
tively used to pursue the actions needed to detect, 
estimate, and mitigate deposited uranium in the 
scrubber ventilation system. 

Contributing Cause 2: The scope of licensee 
audits and assessments did not provide a compre- 
hensive review of the nuclear criticality safety pro- 
gram with an appropriate level of intrusiveness as 
is applied to higher risk activities. 

The licensee’s root cause analysis team also con- 
cluded that the event occurred due to long-stand- 
ing weaknesses in the safety culture at the facility. 
The organization did not exhibit the behaviors 
expected to recognize that nuclear work is unique 
and that complex technologies can fail in unpre- 
dictable ways, resulting in adverse latent condi- 
tions not being recognized. Weaknesses in this 
pattern of thinking contributed to invalid assump- 
tions and non-conservative decisions not being 
challenged. As a result, CSE mass limits were not 
well communicated and instructions for verifying 
the effectiveness of criticality controls were not 
well established. The licensee’s root cause analy- 
sis team also identified a number of corrective 
actions to prevent either recurrence or significant 
consequences. 


3 THE POTENTIAL FOR THIS EVENT WAS 
NOT FLAGGED BY THE REGULATORY 
PROCESSES 


While the facility conditions and the licensee’s ini- 
tial responses to the conditions indicate a break- 
down in their processes and programs, the NRC’s 
overall response to the event was appropriate and 
as to be expected. An Augmented Inspection Team 
(AIT) was chartered on July 28, 2016, to: 1) review 
the facts surrounding the failure to maintain the 
CSE mass limits and controls in the scrubber ven- 
tilation system and the potential for similar failures 
in other production areas using the same control 
protocols, 2) assess the licensee’s response to the 
failures, and 3) evaluate the licensee’s immediate 
and planned long-term corrective actions to pre- 
vent recurrence. Performance issues identified by 
the AIT were submitted for additional NRC inspec- 
tion follow-up and further review and enforcement 
activities followed normal regulatory processes. 

In addition, as part of the NRC’s overall 
platform of continuous improvement, NRC 


management initiated a lessons-learned activity to 
explore opportunities for improving NRC regula- 
tory processes in identifying facility operational 
issues and preventing such events in the future. The 
team was chartered on October 28, 2016, to evalu- 
ate five areas: the licensing process, the inspection 
program, the operating experience program, roles 
and responsibilities, and knowledge management. 
The first two areas (licensing and inspection) are 
specific programmatic areas that periodically inter- 
face with the licensee and their analyses and pro- 
grams. The other three areas (operating experience, 
roles and responsibilities, and knowledge manage- 
ment) support improving the capability, efficiency, 
and effectiveness of the regulatory staff in per- 
forming their responsibilities in the first two areas. 

The team reviewed numerous documents related 
to each of the evaluated areas, including licensing 
review staff guidance, inspection procedures, and 
management directives, and also reviewed docu- 
ments directly associated with the event, including 
the AIT report, an information notice, and a con- 
firmatory action letter. The team also conducted 
individual and group interviews of nearly all 
project managers, technical reviewers, inspectors, 
and managers within the NRC’s fuel fabrication 
arena, including the highest level of management 
within the region responsible for inspection of 
these facilities. 

Through this effort, the team made a number 
of specific observations and recommendations 
associated with each evaluation area. The team 
issued its report on January 31, 2017, and many of 
the observations are summarized in the following 
subsections. 


3.1 The first opportunity comes during 
facility licensing 


These facilities are typically large process facilities 
with numerous individual processes and associated 
analyses. There is significant review effort expended 
during licensing and license renewal. Much of this 
effort ties to fully understanding the facility and its 
processes and the review of the licensee’s identifica- 
tion and control of the multitude of hazards asso- 
ciated with these processes. A significant focus of 
the review is on the potential for criticality events, 
but detailed reviews are not performed for all areas. 
Instead, consistent with the licensing review staff 
guidance, reviewers primarily review the overarch- 
ing facility safety program (a “horizontal review”) 
and sample specific areas for more detailed (“verti- 
cal slice”) review. This prioritization of the scope, 
focus, and detail of review is based on many aspects, 
including operating experience and reviewer expe- 
rience, but also relies heavily on the perceived risk 
associated with the process as conveyed by the 
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licensee’s Integrated Safety Analysis (ISA). In fact, 
the current licensing review staff guidance specifi- 
cally states that the reviewers should more closely 
review processes and systems with a relatively high 
unmitigated risk than processes and systems with 
low risk. In the context of this event, the scrubber 
ventilation system was considered low risk by the 
licensee based on the assumptions: 1) that only 
minor amounts of uranium powder were expected 
to accumulate in the scrubber ventilation system, 
2) low uranium concentration would be present 
within the scrubber ventilation system, 3) minimal 
amounts of small uranium particles were entrained 
within the intake ventilation ductwork, and 4) the 
scrubber constantly diluted the uranium concen- 
tration with the addition of makeup water during 
normal operation and anticipated upsets. These 
assumptions by the licensee are reflected in their 
ISA and established controls (i.e., IROFS). 

The NRC licensing review staff guidance does 
not establish the level of review for processes and 
systems determined by the licensee to be low risk. 
Further, there is no specific guidance for reviewing 
processes and systems determined to be low risk 
that rely heavily on licensee assumptions. This lack 
of guidance resulted in the reviewers not reviewing 
this system in any depth during the prior facility 
license renewal. As a result, during the prior license 
renewal and amendment reviews, the reviewers did 
not challenge the overall performance of the sys- 
tem and related controls, including the assumption 
of low accumulation. 


3.2 The inspection program complements 
licensing 


One of the main purposes of the inspection pro- 
gram is to confirm continued compliance with the 
regulations and conformance with the approved 
license. Similar to the license review process, it is not 
practical to perform entire facility inspections, but 
rather, inspectors use a sampling approach. This 
approach is particularly relevant for facilities that 
do not have resident inspectors, which is the case 
for the subject facility (i.e., NRC inspectors are not 
located at the facility on a daily basis). For these 
types of facilities, over the year, inspectors visit the 
facility periodically to inspect specific program- 
matic aspects of the license, such as plant modifica- 
tions, fire protection, operational safety, etc. 
Similar to the license review process, the current 
inspection focus is on perceived high risk areas of 
the facility, which is based on the licensee’s ISA. 
Because the licensee considered the scrubber ven- 
tilation system to be low risk, as stated above, the 
NRC did not consider this system for detailed 
inspection. Several inspectors noted that had the 
system been part of a detailed inspection, the 


licensee’s deficiencies in the CSE and implementa- 
tion of associated management measures and con- 
trols would likely have been identified. 

Various inspection procedures appear to recog- 
nize that inspectors should examine presumably 
low risk processes and systems, but again, very lim- 
ited guidance is provided on how to select samples 
from such processes and systems or the focus of 
such inspections. 


3.3 Operating experience could have provided 
insight and focus on this system 


Operating experience can be a valuable tool to 
help provide additional input to determining the 
appropriate focus and scope of facility areas to 
review and inspect. However, most license review- 
ers and facility inspectors did not rely upon the 
fuel fabrication facility operating experience pro- 
gram, which had previously been identified as 
needing to be improved. In fact, most inspectors 
and many reviewers were not aware of the fuel 
fabrication facility operating experience database 
or did not know how to access it. For those that 
were aware of the database, they observed that the 
database contained only relatively recent, publi- 
cally available, US data and were unsure if it could 
trend events to support use in inspection planning. 
Furthermore, while a criticality inspection proce- 
dure had recently been revised to include the con- 
sideration of operating experience in inspection 
planning, other inspection procedures did not give 
any formal, structured guidance on considering 
operating experience. All of these conditions were 
considered to limit the usefulness of the operating 
experience database to the license reviewers and 
facility inspectors. 


3.4 Understanding roles and responsibilities 


Understanding individual and organizational roles 
and responsibilities is key to efficient and effective 
regulatory reviews and inspections. At the NRC, 
the licensing reviews are performed within one 
organization located near Washington, DC, while 
the inspections are performed within another 
organization located in Atlanta, Georgia. Commu- 
nication and collaboration is essential in ensuring 
full understanding of licensing reviews and their 
implications for the inspection regime, especially 
when the organizations are physically separated by 
such a great distance. 

The licensed facilities are required to provide 
annual summaries that describe the prior year’s 
facility and process modifications and separately 
updates to their ISAs. These summaries can be, 
and are expected to be, used to inform inspection 
planning for the subsequent year. In the past, the 
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NRC licensing organization primarily performed 
the review of these summaries and provided its 
input to the NRC inspection organization, but in 
2016 the NRC changed the lead role for the ISA 
summary reviews to the inspection organization 
to avoid overlapping efforts. However, the expecta- 
tion of obtaining insights from the facility licens- 
ing review project manager and technical staff 
in these annual submittal reviews was not clearly 
established. Likewise, it was recognized through 
the lessons learned effort that the licensing review 
staff guidance did not clearly establish an expec- 
tation for obtaining insights from the inspec- 
tion organization. In both cases, the potential for 
missing valuable insights was identified since the 
regulatory guidance did not establish a formal 
expectation for the various regulatory staff to col- 
laborate in these areas. 


3.5 Knowledge management 


It is recognized that knowledge management is 
inextricably linked to all the other areas evaluated 
by the lessons learned team. It is an element criti- 
cal to performing technical evaluations of licensee 
submittals, selecting relevant inspection samples, 
administering a successful operating experience 
program, clearly understanding respective roles 
and responsibilities, assessing the significance of 
an event, etc. Most of the lessons learned team 
recommendations involve some aspect of knowl- 
edge management. However, the lessons learned 
team did identify some fundamental knowledge 
management issues. 

The current licensing and inspection qualifi- 
cation programs rely heavily on documentation 
reviews supported with some coursework and site 
visits. Certain skills that are important to regula- 
tory staff success, however, are mostly left for the 
staff to pursue outside the qualification program, 
such as critical thinking, effective communication, 
and conflict resolution. All of these aspects require 
continuous practice and reinforcement and are 
invaluable when performing license reviews, con- 
ducting inspections, and interacting at all levels of 
the organization. 

In addition, ensuring all regulatory staff are kept 
informed of current (and periodically reminded 
of past) licensing, inspection, operational, and 
technical issues improves the understanding and 
ultimately, performance of the regulatory staff 
and organization as a whole. While the inspection 
organization held periodic knowledge management 
seminars of selected topics, such a program was 
not being fully implemented within the licensing 
organization. As a result, lessons learned by some 
regulatory staff were not being effectively shared 
among all the other regulatory staff. 


4 LESSONS LEARNED TEAM 
RECOMMENDED IMPROVEMENTS 
TO THE REGULATORY PROCESSES 


The lessons learn team recommended improvements 
in all five regulatory areas. Most recommended 
improvements are associated with the verification 
of the technical bases and assumptions in the licen- 
see’s ISA and improving the knowledge bases and 
resources used by the reviewers and inspectors. 

For the license review process, the lessons 
learned team identified for further evaluation the 
need to clarify the licensing review staff guidance 
to include guidance on the examination of the 
technical justification for processes and systems 
designated as low risk, especially those justifica- 
tions related to key analysis assumptions. 

For the inspection program, the team identified 
for further evaluation the need to modify the scope 
and focus of inspections so that all facility proc- 
esses and systems with the potential for interme- 
diate and high consequences are inspected within 
some periodicity, regardless of perceived risk sig- 
nificance. The team also suggested the development 
of additional guidance associated with reviewing 
and using the summaries of facility modifications 
and licensee ISA updates in support of inspection 
planning. Such additional guidance could also 
focus specific inspections on these analyses, with 
the intent of verifying the continuing validity of the 
technical bases and assumptions of the analyses. 

For the operating experience program, the 
team identified for further evaluation the need to 
improve the framework and guidance for the flow 
of information from this program to the licensing 
and inspection programs. Related to the fuel fab- 
rication facility operating experience database, the 
team suggested enhancing access to the database 
so that the information is more readily available 
to the licensing review staff and inspectors and to 
include legacy and international operating experi- 
ence so that the database is more complete. 

For the area of roles and responsibilities, the 
team suggested for further evaluation improv- 
ing the guidance related to using the licensee’s 
annual submittal of summary descriptions of 
facility modifications and ISA update summaries 
in inspection planning, setting the expectation to 
gain inspector facility knowledge and experiences 
within the licensing process, and providing rota- 
tional opportunities between the licensing review 
staff and inspectors to foster a better understand- 
ing of the diverse roles and responsibilities. 

Finally, the need for improving knowledge man- 
agement within these regulatory organizations 
is pertinent to all the above aspects. The team 
specifically identified for further evaluation the 
need to improve the qualification programs for the 
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licensing review staff and inspectors, to implement 
continuous knowledge management activities, such 
as regularly scheduled seminars and debriefings 
on topics of interest, and to periodically perform 
systematic reviews of the licensing and inspection 
programs to identify gaps and support continuous 
improvement. 


5 CONCLUDING COMMENTS 


The NRC created an action plan to guide and 
track the evaluations of the recommended 


improvements identified by the lessons learned 
team and their subsequent implementation, as 
appropriate. Some activities, such as the oper- 
ating experience database, had previously been 
identified as needing to be improved and were 
already in the early implementation stages. Other 
activities involve additional considerations (e.g., 
priority, schedule, budget, and potential benefit) 
and are being evaluated and implemented, as 
appropriate. Through these efforts, the regulatory 
programs should improve, be more effective and 
efficient, and enhance the assurance of safety of 
the facilities. 
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experience 


R. Filippini & P. Urschiitz 
EBG MedAustron GmbH, Wiener Neustadt, Austria 


ABSTRACT: The Austrian facility MedAustron is one of the most advanced centers for ion beam ther- 
apy and research in the world. The MedAustron Particle Therapy Accelerator (MAPTA) is a CE-certified 
Class IIb medical device. Safety is the core attribute to be achieved, assessed and maintained during the 
entire system lifecycle. This paper tells about the experience of certifying a large particle accelerator as 
a medical device, from the risk management point of view. Examples of the different risk management 
activities are given, in order to explain the way the risks have been analyzed, controlled and evaluated. 


1 INTRODUCTION 


The MedAustron Particle Therapy Accelera- 
tor (MAPTA) is a CE-certified Class IIb medi- 
cal device in compliance with the Medical Device 
Directive MDD 93/42/EEC [1]. The facility con- 
sists of four irradiation rooms, three for clini- 
cal operations (two horizontal and one vertical 
beam lines, one Gantry) and one for non-clinical 
research (one horizontal beamline). The layout of 
the accelerator complex is shown in Figure 1. The 
beam particles are generated in the ion source(s). 
They are pre-accelerated in the Linear Accelerator 
(Linac) and accelerated to the requested energies 
in the 80 meter-circumference Synchrotron ring, 
before being extracted and delivered into the irra- 
diation rooms. 

The maximum beam energies for patient treat- 
ment are 250 MeV/u and 400 MeV/u for proton and 
carbon ions respectively. For non-clinical research, 


Figure 1. 


MAPTA layout. 


beam energies up to 800 MeV/u can be generated. 
Compared to conventional radiotherapy with pho- 
tons (gamma rays, X-rays) or electrons, the treat- 
ment with protons and carbon ions has a more 
precise dose distribution (i.e. the Bragg peak) and 
reduces damage to the adjacent healthy tissues sig- 
nificantly. This makes it possible to deliver much 
higher doses, with greater benefits for the patient, 
but also presents safety concerns, which have to be 
addressed by risk management. 

In general, the risk management for particle 
therapy accelerators and particle accelerators has 
relatively little literature, for several reasons. First, 
this depends on the size of the accelerator; only at 
high energies the consequence of a malfunction 
becomes relevant for costs and human lives for 
non-medical applications. Secondly, the particle 
accelerators are often research facilities, and this 
allows a larger degree of freedom of addressing 
safety without a risk management process. As a 
third reason, even if risk management is in place, 
there are restrictions that limit the dissemination 
of results. Because of that, most of the existing 
studies deal with reliability and availability, which 
are related to uptime and performance, and only 
a few of them included safety and risk in scope. 
The CERN Large Hadron Collider is within these 
exceptions. The LHC operates at very high beam 
energies (7TeV) and it is designed to deal with 
potentially catastrophic scenarios, e.g. beam losses 
and quench of the superconducting magnets with 
damage to equipment and interruption of opera- 
tions for several months [1]. Several reliability 
and safety studies have accompanied its design 
and realization [3, 4, 5, 6]. Within the domain of 
particle therapy accelerators, a few studies exist, 
which are of great interest. A Probabilistic Risk 
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Assessment (PRA) has been performed for the 
PROSCAN proton therapy accelerator of the Paul 
Scherrer Institute [7] and similar analyses for radi- 
ation sources can be found in the IAEA report [8]. 

This paper presents the risk management for the 
particle accelerator MAPTA of MedAustron. The 
content of the paper is organized into five sections. 
Following this introduction, Section 2 contains the 
functioning principles and the safety architecture 
of MAPTA. The risk management framework is 
in Section 3, including the regulations and stand- 
ards, the hazards of concern, the risk metrics and 
the acceptability criteria. Section 4 contains the 
description of the risk assessment activities, as well 
as several examples. A few concluding remarks are 
contained in Section 5. 


2 MAPTA SAFETY ARCHITECTURE 


2.1 Functioning principles 


MAPTA produces ions beams of a given energy 
and intensity for the irradiation of the tumor, as 
specified in the patient treatment plan. A com- 
plete patient treatment consists of between 20 to 
40 sessions, each with a total duration of about 
30—40 minutes including a few minutes of irradia- 
tion time, depending on the tumor indication. 

A treatment irradiation session starts after 
MAPTA has been handed over to the user (medical 
physicists and doctors), who has the responsibil- 
ity of supervising the operations from the control 
room of the irradiation room. The entire irradia- 
tion process is automated. The MAPTA Medical 
Frontend (MF) system administers the irradiation 
according to a predetermined timing sequence. 
The scanning of the tumor target volume and 
the dose delivery is conceptually simple: the MF 
system switches on the chopper kicker magnet 
and actuates the switching dipoles to let the beam 
travel from the synchrotron into the irradiation 
room and then through the nozzle into the target 
volume. The beam scanning is done in “iso-energy 
slices”, in order to deposit the required radiation 
dose at different positions and depths (higher 
energy = deeper penetration). During the irradia- 
tion, the MF system supervises the functioning of 
all components and monitors the beam parameters 
(intensity, position, energy, dose, etc.). Any possi- 
ble malfunction, deviation out of tolerable limits 
or unauthorized action, triggers the interruption 
of treatment for safety reasons. 


2.2 Safety architecture 


The safety architecture in MAPTA covers three 
different domains: 1) access protection, 2) inter- 


operability of the irradiation rooms and 3) patient 
treatment. 

The MAPTA access protection system guaran- 
tees the “safe access and stay” in the accelerator 
areas and irradiation rooms for technical and med- 
ical personnel. Two states, “Green” and “Red”, are 
defined. When the state is Green, the respective 
area/room is accessible. All radiation sources are 
turned off and the beam stoppers (or equivalent 
devices) prevent the beam from reaching the accel- 
erator area or the irradiation room. When the state 
is Red, the beam can be sent into the respective 
area/room, and the accesses are locked. The vio- 
lation of the Red state (e.g. unauthorized access) 
triggers an interlock which activates the Green 
state for the respective area or irradiation room, 
immediately stopping the beam and shutting down 
all radiation sources. 

The second safety domain covers the risks that 
are caused by the simultaneous use of the irradia- 
tion rooms. Two operation modes per room are 
defined: 1) physics and accelerator (e.g. for com- 
missioning, maintenance, and test) and 2) medi- 
cal (for clinical operation). For safety reasons, 
certain combinations of the operation modes are 
not allowed and they are vetoed. Moreover, cer- 
tain activities cannot be executed when in clinical 
operation. For example, if one irradiation room 
is in medical mode, none of the others irradiation 
rooms may be in the physics and accelerator modes. 
In case of violation of the veto, an interlock is trig- 
gered with subsequent beam stop and interruption 
of the treatment. 

The third safety domain covers the risks dur- 
ing patient treatment. The goals are to guarantee 
that 1) MAPTA is correctly configured for the 
treatment plan, 2) the beam physics characteris- 
tics (energy, intensity, position, width) are within 
acceptable limits and 3) the dose is correctly deliv- 
ered (i.e. the correct dose at the right time, into the 
right irradiation room). Most of the risk control 
measures are implemented in the MF system or 
other systems that assure an equivalent safety level. 
In case of detected errors, the beam is sent into the 
beam dump and the treatment is interrupted. 


2.3 Safety principles 


The architecture of MAPTA meets the “single 
fault safe” principle, which states that “a combina- 
tion of two independent errors must not lead to a 
life-threatening situation” for the patient [9]. This 
principle requires a number of design features to 
be implemented in MAPTA, e.g.: 


e Fail-safe behavior, 
e Independency of risk control measures, 
e Acknowledgement of a safety action, 


1880 


e Alternative means or independent back-ups, 
e Integrity of risk control measures. 


The fail-safe behavior makes it impossible for 
a failure to develop in an unsafe way. For exam- 
ple, the selection of an incorrect source is fail- 
safe because it is impossible for the synchrotron 
to accelerate an ion species different from the one 
for which it is configured (e.g. protons instead 
of carbon ions and vice versa). A power supply 
failure is fail-safe for all MF components, e.g. it 
causes the demagnetization of the chopper magnet 
and the resulting beam dump. A pressure leakage 
in the beam stopper cylinders drops the shutters 
into the beam line by gravity, which again is fail- 
safe. The logic levels of the electronics are fail-safe 
in the case of loss of signal, and so forth. 

Another important design feature is the inde- 
pendence of the risk control measure and the 
supervised system. This is guaranteed by hardware 
separation and/or software segregation, to avoid 
possible common causes of failure. 

The result of the execution of a risk control 
measure is acknowledged and, in case of a fault, 
other risk control measures are triggered to com- 
plete the safety action, eg. if the beam is still 
detected after the actuation of the chopper mag- 
net, or if the position of the beam stoppers is still 
out of the beam path. The functioning of risk 
control measures is also verified before the start of 
every patient treatment session. This is guaranteed 
by functional tests and internal self-diagnostics. In 
addition, periodic checks are performed for the 
integrity of the risk control measure, e.g. by discov- 
ering dormant faults in redundant components. 


2.4 Hazardous situations and reaction times 


The majority of the hazardous situations in 
MAPTA develop in the order of milliseconds to a 
few seconds, depending on the causes and the cir- 
cumstances. As a consequence, every risk control 
measure must have a suitably fast reaction time. All 
risk control measures are implemented in electronics 
and electro-mechanical components, including the 
back-up measures, which intervene in case of fail- 
ure. The operators and the users are seldom called 
upon to respond to hazardous situations, unless 
they have a longer time of development for which 
the human reaction time is compatible and effective. 


3 THE RISK MANAGEMENT 
FRAMEWORK 
3.1 Regulations and standards 


Medical devices have to comply with a complex 
and articulated sets of regulations, norms and 


standards. The Medical Device Directive MDD 
93/42/EEC stipulates the manufacturers’ responsi- 
bilities to consider all safety aspects and to be able 
to address them. 

The Directive states that: “the (medical) devices 
must be designed and manufactured in such a way 
that, when used under the conditions and for the pur- 
poses intended, they will not compromise the clinical 
condition or the safety of patients, or the safety and 
health of users or, where applicable, other persons”. 

It continues: “the solutions adopted by the man- 
ufacturer for the design and construction of the 
devices must conform to safety principles, taking 
account of the generally acknowledged state of the 
art. In selecting the most appropriate solutions, the 
manufacturer must apply the following principles in 
the following order: 


e eliminate or reduce risks as far as possible (inher- 
ently safe design and construction); 

e where appropriate take adequate protection meas- 
ures including alarms if necessary, in relation to 
risks that cannot be eliminated; 

e Inform users of the residual risks due to any short- 
comings of the protection measures adopted.” 


The above stipulations are extensively explained 
in the EN ISO 14971:2012 [10] regarding what con- 
cerns risk management. The safety requirements 
are defined in medical device standards, such as the 
general standard IEC 60601-1 [9], the safety stand- 
ard IEC 60601—2-64 [11], the medical software 
standard IEC 62304 [12], the standard for medical 
IT networks IEC 80001-1 [13] and the IEC 62366 
[14] for usability engineering. Industrial safety 
standards that are applied for MAPTA are the IEC 
61508 with the derived IEC 62061 and ISO 13489 
[15, 16 and 17]. The list is not complete and many 
other standards apply, which focus on a particular 
technical domain, e.g. the OVE/ONORM E-8001 
[18] for electrical hazards. 


3.2 Scope of the risk management 


The risk management is performed throughout 
the entire life cycle of MAPTA, from the design 
to the decommissioning. While safety is the main 
goal, the manufacturer shall also take into account 
reliability and uptime, which eventually affects the 
performance, i.e. uptime and the patients’ through- 
put. This is a well-known trade-off in safety engi- 
neering. Risk management also takes into account 
the interdependencies of MAPTA with the sys- 
tems at the interface, such as the patient position- 
ing system, the technical infrastructure and the IT 
network. The user activities related to the prepara- 
tion of the treatment plan, accommodation of the 
patient, imaging, and supervision of treatment are 
out of scope. 
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The risk manager is responsible for the risk 
management process. The outcomes of the risk 
management process are reviewed by a team of 
experts, which includes representatives of the top 
management, accelerator technologists, medical 
physicists, and radiation oncologists. 


3.3 Hazards and categories of risk at 
MedAustron 


The EN ISO 14971 defines the applicable hazards 
and the categories at risk for MAPTA. The catego- 
ries of risk are the patient, the user of MAPTA, 
the technical and medical personnel, the equip- 
ment, the environment and third persons (i.e. 
anybody not involved in the patient treatment, 
operation and/or service of MAPTA). The goal is 
to assure that all applicable combinations “hazard 
versus category of risk” are in scope of risk man- 
agement. Table | provides an excerpt of the list of 
hazards (radiation, electrical, etc.) that apply to 
the categories of risk patient, personnel/user and 
environment for MAPTA, including the applicable 
standards. 


3.4 Risk metric and acceptance criteria 


The risk metric for MAPTA is the Risk Priority 
Number (RPN) [10]. The RPN is the product of 
the probability P and the severity S of a hazardous 
situation, i.e. RPN =P x S. Six frequency intervals 
for P are defined in MAPTA: 


e P= 1: Incredible; 
e P=2: Unlikely; 

e P=3: Seldom; 

e P=4: Occasional; 
e P=5: Often; 

e P=6: Frequent. 


and five severity levels for S: 


e S=1: negligible; 
e S =2; minor; 


Table 1. Hazards and categories of risk. 
Hazards Patient Personnel/user Environment 
Radiation ISO 14971 IEC 61508 UVP (RP) 
IEC 62061 
ISO 13489 
Electrical n.a. OVE/ONORM naa. 
E-8001 
Software IEC 62304 n.a. n.a. 
IT ISO 80001-1 n.a. n.a. 
Mech. ISO 14971 ISO 14971 n.a. 
Use IEC 62366 IEC 62366 n.a. 


3: moderate; 
4: severe; 
= §: catastrophic. 


The frequency intervals for P and the levels for 
S are chosen in agreement with the risk manage- 
ment standard and the current state-of-the-art [10, 
19, 20]. 

The acceptability criteria are applied by defin- 
ing two risk thresholds. A risk is acceptable if 
RPN < 10, while it is not acceptable if RPN > 12. 
Risks with RPN between 10 and 12 (inclusive), 
and that cannot be reduced further, are evaluated 
by risk-benefit analysis. These risks are acceptable 
if the benefit for the patient outweighs the residual 
risk. 


4 RISK ASSESSMENT FOR MAPTA 


4.1 Scientific rationale 


The risk assessment includes the activities of risk 
analysis, risk control and the risk evaluation. These 
activities are performed in sequence, one after the 
other, as shown in Figure 2. The Failure Mode, 
Effects and Criticality Analysis, FMECA, is the 
methodology for risk assessment [21] for all sys- 
tems and components in MAPTA. The only excep- 
tion is the access protection system, for which a 
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Figure 2. 


The risk assessment process workflow. 
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different method applies, as well as a different risk 
metric, the Safety Integrity Level (SIL) [15] instead 
of the RPN. 

The FMECA is an inductive system analysis 
method. It uses the natural language to describe 
the hazardous situations, which has the advantage 
of facilitating the communication between the risk 
analyst and the system experts. Another advantage 
is that the analysis with FMECA is semi-quantita- 
tive, i.e. instead of precise failure statistics, prob- 
abilities are chosen within predefined intervals. 
This feature makes the risk assessment relatively 
easy, but also turns out to be a limitation in com- 
parison with other methods of system/risk analy- 
sis, e.g. fault tree, which is more accurate. Another 
limitation exists for the analysis of common causes 
of failures. These drawbacks are known and have 
been addressed in MAPTA by ad-hoc risk assess- 
ment guidelines that assist the analyst during the 
compilation of the FMECA. The guidelines pro- 
vide a comprehensive methodological approach 
and guarantee the correctness and consistency of 
the results. 


4.2 Risk analysis 


The risk analysis consists of three activities: defini- 
tion of the hazardous situations, risk estimate and 
risk evaluation. 

The first activity describes the failure dynamics 
as a causal chain of events, “initial cause > error 

> fault — failure > malfunction” that, together 
with the circumstances, lead to the hazardous situ- 
ation. The more accurate the description of the 
causal chain is, the more effective the apportion- 
ment of the risk control measures. 

The second activity makes it possible to esti- 
mate of the initial RPN. The estimate is done 
without considering the risk control measures. The 
MAPTA risk assessment guidelines provide empir- 
ical look-up tables with intervals of probabilities 
for the different classes of faults or errors (e.g. 
HW/SW faults, human errors, etc.). Analogous 
tables are also available for the severity. The ana- 
lyst chooses the initial RPN within the suggested 
probability and severity intervals, for the worst 
case scenario. The last activity of the risk analysis 
is the evaluation of risks against the acceptabil- 
ity criteria. Those risks that are non-acceptable, 
i.e. RPN > 10 shall be mitigated by risk control 
measures. 


4.3 Risk control 


The EN ISO 14971 defines three types of risk 
control options: 1) inherent safety, 2) preventive/ 
protective and 3) information for safety. Each of 
these risk control options has a different effective- 


Table 2. Risk control options in MAPTA. 


Risk control 
measure 


Effect to the hazardous 
situation 


Inherent safety (physical) 
Inherent safety (logical) 
Inherent safety (failsafe) 
Preventive 

Protective 

Information for safety 


Make it physically impossible 
Make it logically impossible 
Turn it into failsafe 

Detect and stop development 
Detect and stop at occurrence 
Contribute to prevent/avoid 


ness, e.g. by avoiding, preventing or stopping a 
hazardous situation, see Table 2. Inherent safety 
is the most effective among the control options. It 
prevents the hazardous situation from developing. 
Preventive measures intervene while the hazard- 
ous situation is developing, while protective meas- 
ures intervene when the hazardous situation has 
already developed and for example, it represents a 
real harm for the patient. Information for safety 
is the least effective of the risk control options. It 
includes organizational measures such as instruc- 
tions for operators and users of MAPTA. 

The MAPTA guidelines contain the ration- 
ale for the application of the risk control meas- 
ures and the estimate of the risk reduction. This 
rationale follows a few general recommendations 
from the risk management standard and best prac- 
tice. The risk control measures have to be applied 
altogether and not in isolation (EN ISO 14971). 
Secondly, a risk control measure has to be associ- 
ated with the respective failure event (as identified 
in the causal chain). In addition, they have to be 
applied in the right sequence, i.e. inherent safety 
first, then organizational measures, preventive and 
protective measures. Finally, failure dynamics and 
the risk control measures have to be considered 
(and analysed) as interrelated processes. The model 
that supports this description is an event sequence 
diagram. An illustrative example is shown in Fig- 
ure 3 for a hazardous situation with three risk con- 
trol measures. 

A risk control measure is “successful” if it pre- 
vents the hazard from developing and leads to the 
“end state”, while it is “unsuccessful” if the haz- 
ard can develop further. In terms of P and S this 
means: 


e Successful: P does not change, S is lower: 
e Unsuccessful: P is lower, S does not change. 


The amount of reduction of P depends on the 
type of risk control measure. The reduction of the 
severity S depends on the instant at which the risk 
control measure intervenes. If this is before the 
hazardous situation becomes a real harm, then the 
severity drops down to S = 1 (no harm), see End 
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Figure 3. 


states 1 and 2 in Figure 3. On the contrary, if it 
occurs at a later stage, then the residual severity 
could be higher, which is the case of End state 3. 

The residual risk (i.e. the final RPN) is the risk 
after all risk control measures have been applied. 
Because the risk is reduced by a certain amount 
after the execution of a risk control measure, the 
residual risk is the highest RPN associated with 
the end states of the last risk control measure, e.g. 
End states 3 and 4 in Figure 3. In the example, End 
state 3 has a residual severity related to the extra 
dose deposited, before the beam is stopped by the 
protective measure. The estimate of the residual 
severity shall take into account: 1) the type of par- 
ticles (e.g. carbon ions are heavier than protons), 
2) the detection threshold of the protective meas- 
ure, and 3) the reaction time, up to the complete 
beam stop. The limits for the maximum extra dose 
are defined in the standard IEC 60601-2-64, both 
under normal and fault conditions [11]. 

The amount of risk reduction is calculated 
in the risk assessment guidelines of MAPTA by 
look-up tables for every risk control measure. 
Figure 4 shows the risk reduction table for a pre- 
ventive risk control measure. The calculation of 
the risk reduction is straightforward. The initial 
RPN = P x S identifies the cell in the table with 
the residual risk after the application of the risk 
control measure, e.g. RPN with P = 4 and S = 5 
becomes RPN = 2 x 5. If more risk control meas- 
ures apply, then the output of one risk reduction 
table becomes the input of the next risk reduction 
table and so forth. 
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Figure 4. 
measures. 


The risk reduction table for preventive 


4.4 Risk evaluation 


The risk evaluation is the final activity of the risk 
assessment. All individual risks are evaluated on 
the basis of the acceptability criteria. After this is 
done, the same individual risks are re-evaluated 
together in order to account for statistical cumu- 
lative effects. According to the risk assessment 
guidelines of MAPTA, the cumulative statistical 
effects are estimated by counting how many indi- 
vidual risks are in the same cell P x S. For every N 
individual risks in the same cell P x S, one cumula- 
tive risk is added in the cell (P+1) x S. The thresh- 
old N depends on the order of magnitude of the 
interval P. The P intervals in the risk matrix cor- 
respond to orders of magnitude 1 (P = 3, 4 and 
5) and 2 (P = 1, 2), which is N = 10 and N = 100. 
As an example, 12 individual risks with P = 3 and 
S = 3 would be statistically equivalent to two indi- 
vidual risks in 3 x 3 and one cumulative risk in 
4x3. 


4.5 A few results 


Figure 5 shows the results of the FMECA based on 
the functional description of MAPTA. The scope 
of this FMECA is the patient treatment session. A 
patient treatment session consists of four phases: 
1) request and allocation of the accelerator com- 
ponents and the irradiation room, 2) activation, 
3) irradiation, and 4) termination of treatment. 
In total, 132 individual hazardous situations have 
been identified and the respective risks have been 
analyzed, controlled and evaluated. The analysis 
also includes 26 risks related to common causes of 
failure. All residual individual risks (and the cumu- 
lative statistical effects) have been evaluated and 
they are acceptable i.e. all RPNs < 10. This result 
is obtained by applying 90 different risk control 
measures in MAPTA including: 


22 inherent safety measures, 

39 preventive risk control measures, 
9 protective risk control measures, 
20 organizational measures. 
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Figure 5. Distribution of the residual risks. 


Another seven risk mitigations have been identi- 
fied and transferred outside of MAPTA for their 
implementation. 

The analysis of the risk control measures pro- 
vides interesting insights on the MAPTA architec- 
ture for the patient safety. A risk control measure 
is often called for in more than one hazardous 
situation. Therefore, by counting the times this is 
required, it is possible to deduce its importance. 
The following statistics are obtained: the preven- 
tive measures are called for 143 times, the inher- 
ent safety measures 111 times, the organizational 
measures 49 times and the protective measures are 
called for 39 times. The preventive measures are the 
most required, followed by inherent safety, while 
the protective measures are the least required. This 
result is in good agreement with the general recom- 
mendation of the medical device standard, which 
states that hazardous situations shall be either 
avoided by design or prevented. Indeed, only a 
smaller percentage requires the intervention of 
protective measures. 

The FMECA, based on the functional descrip- 
tion of MAPTA, is completed by other FMECAs, 
which deal with risks at the system and component 
level. In total, more than 1700 individual risks have 
been analyzed and controlled in MAPTA. The 
overall residual risks have been evaluated and they 
are acceptable. 


4.6 Final reports and post-production 


The outcomes of the risk management activities 
and the verification by tests of the risk control 
measures are included in the MAPTA risk manage- 
ment report, which accompanies the declaration 
of conformity for the CE certification as a medical 
device. All documents produced by the risk man- 
agement are organized in the respective file, which 
is periodically inspected by external auditors. 

The risk management also covers the post- 
production activities, such as product changes, 


maintenance, commissioning and project develop- 
ment. These activities often require the update of 
the existing documents or new risk analyses, and 
because of that, they are constantly monitored. 
Reporting adverse events, errors and near misses 
is also within the scope of risk management. This 
includes the validation of the risk estimates and the 
estimate of MAPTA uptime, based on the opera- 
tional data, as it was done, for example, in [22]. 


5 CONCLUSIONS 


This paper presented an overview of the risk man- 
agement for the particle accelerator MAPTA of 
MedAustron. All activities in scope of the risk 
management have been discussed, with several 
examples regarding the methodologies for the 
risk analysis, the risk control techniques and the 
outcomes of risk assessment. In total, more than 
1700 individual risks have been identified, analyzed 
and controlled in MAPTA. The residual risks have 
been evaluated and they are acceptable. 

The depth of this work is such that it can be 
barely summarized within these pages. Nonethe- 
less, it is worthwhile sharing a few lessons learned. 
The first lesson learned is related to the complex- 
ity of the particle therapy accelerator on one hand, 
and the demanding requirements of risk manage- 
ment for medical devices, based on an all hazards 
approach, on the other hand. Complexity has been 
managed by the definition of different frameworks, 
focused on a specific domain e.g. intended use and 
patient safety, access protection and radiation 
hazards, industrial safety, etc. The second lesson 
learned concerns the development of the know- 
how. Standards for medical devices are general 
encompassing different applications. It is under the 
responsibility of the manufacturer to interpret the 
standard clauses, and organize risk management 
accordingly. A lot of groundwork has been done 
in this respect, with the preparation of scientific 
rationales, guidelines and safety concepts. Another 
lesson learned regards the safety culture. Risk man- 
agement is a discipline that cannot be performed 
in isolation, and requires competencies of spe- 
cialists from various fields. An essential requisite 
is to build adequate safety responsiveness within 
the different groups (engineers, accelerator and 
medical physicists) to be able to anticipate rather 
than react to potential adverse events. In order to 
attain these goals, risk management promotes reg- 
ular exchange and dissemination of information, 
including training and team work. The fourth les- 
son learned is about the CE certification process 
itself. The MedAustron particle accelerator stands 
out among the majority of the existing particle 
therapy accelerators, and one of the reasons is that 
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risk management was part of certification process. 
Besides the illustration of the methodologies and 
the results, there is the unique experience of hav- 
ing dealt with a particle therapy accelerator of this 
size, which has challenged and possibly improved 
the state-of-the-art in this subject. This is one of 
the added values of the “MedAustron experience” 
in the domain of particle therapy accelerators. 

Risk management was executed successfully dur- 
ing the phases that accompanied the certification 
of the particle accelerator as CE medical device, 
and it is presently looking after post-production 
activities for clinical operations with irradiation 
rooms IR2 and IR3 (horizontal beamlines, pro- 
tons), and non-clinical operation in IR1. The verti- 
cal beamline of IR2 will be active in Spring 2018, 
while carbons ions are still under commissioning 
for a future use. The proton Gantry is planned as 
the last step in the commissioning sequence. 
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ABSTRACT: Onsite real-time video streaming of traffic accidents covering condition of inflicted person 
can help overcome problem of under and over-triage by emergency services. A network of Rescue Emer- 
gency Drones (REDs) that could transmit live video to emergency services is proposed to be mounted at 
the sites prone to frequent accidents in Denmark. A risk mapping for placement of RED docking stations 
at suitable places of southern Danish city, Esbjerg and its outskirts has been designed using Geographical 
Information System (GIS) tools ArcMap, and ArcGIS 10.5.1. The result demonstrates the robustness of 
RED into emergency services by providing high quality footage that helps to assess the scene of crash 


faster than the standard existing procedure. 


1 INTRODUCTION 


Lack of clarity of the condition of patients or 
injured persons can lead to wrong decision taken 
by Emergency Medical Dispatcher (EMD) and 
Emergency Medical Service (EMS). For instance, 
in Denmark ‘Unclear Problem’ of level B-E emer- 
gency was the chief complaint involving 66 deaths 
in period between 2011-2012. Because in many 
cases appropriate resources were not dispatched 
to handle the emergency which leads to degree of 
over-and under triage. (Andersen et al 2014) 

Under and over-triage is a major problem that 
costs not only financial losses but also human lives. 
For instance, in Denmark 18 deaths could have 
been prevented if EMD had dispatched a targeted 
response. (Andersen et al 2014) 

Drones have many applications in emergency 
services. For instance, drones can reach at the 
scene of the road accidents faster than conven- 
tional means of transportation. The aim of this 
study is to explore the potential benefit of a drone 
system to transmit live video footage covering the 
condition of the inflicted/patient that may improve 
the decision making of EMD. For application of 
drones in this regard, this study considered traffic 
accidents, because traffic accidents top the list of 
human casualties’ statistics of non-natural cause 


of fatalities. Around 1.25 million people lost their 
lives and 20-50 million people suffered injuries due 
to traffic accidents. (WHO 2017) 

Although Denmark is relatively safe coun- 
try for commuters, there were 211 people killed, 
1,796 suffered serious injuries and 1,432 suffered 
slight injuries in traffic accidents in 2016. (Statis- 
tics Denmark 2017) 

Emergency medical dispatcher finds it difficult 
sometimes to comprehend the situation and con- 
dition of the emergency. First responders usually 
rush to the emergency sites with limited informa- 
tion that can sometime jeopardize the rescue oper- 
ation. Therefore, if EMD and EMS can see and 
assess the severity of injuries of an inflicted per- 
son in traffic accident, it will facilitate a targeted 
response via live video footage. 

The Danish Emergency Medical Communica- 
tion Center (EMCC) receives medical emergency 
calls to respond and rescue patients and injured 
persons. EMCC staff responds the calls accord- 
ing to the Danish index care into five categories. 
Category “A” represents a life-threatening or 
potentially life-threatening condition; therefore, it 
requires immediate response. Category “B” means 
that a patient or injured person requires urgent 
help, but his/her condition is not life threatening, 
whereas category “C” requires an ambulance in a 
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non-urgent condition. Under “D” category EMD 
needs to send a patient transport while under “E” 
category no ambulance is dispatched instead taxi 
or other transportation is advised. 

Category “A” hasa pre-hospital time of 08:12 min- 
utes, but sometime EMD can make a wrong deci- 
sion in dispatching targeted response. In one of 
the case of an audit study, it was found, that EMD 
categorized an emergency as category B, however, 
when the ambulance arrived at the scene, a Mobile 
Emergency Care Unit (MECU) was summoned 
due to the severe condition of the patient. The life 
of patient would have been saved if the ambulance 
along with MECU could have been dispatched. 
(Andersen et al 2013) Therefore, fast response with 
right resources dispatched in saving human lives is 
crucial. The Danish pre-hospital median time for all 
emergencies is 10:27 minutes. (Andersen et al 2013) 
The average minimum response time of fire and 
rescue services (FRS) is 10 minutes and it could be 
15-20 minutes depending on the location of acci- 
dent sites. (Sydvestjysk Branvesen 2015) 

Rescue Emergency Drones (REDs) can reduce 
time of onsite assessment of the condition of 
inflicted person by reaching to the patient/injured 
person faster than the conventional means of trans- 
port and transmitting ‘live video’ that can help to 
cope with the problem of under, and over-triage. 


2 RED NETWORK IN DENMARK 


Providing visual aids by a drone will improve the 
prehospital process in case of a traffic accident 
this aim of the project will meet the need to reduce 
fatalities. 

There are many potential benefits of incorpo- 
ration of RED into emergency services and their 
improving the decision making as follows, 


— Real time visual feed from the scene of crash 
will assist in better assessment of severity of the 
emergency by dispatcher. 

— The sufficient of amount of resources will be 
saved by emergency services by overcoming 
problem of over and under-triage. 

— A targeted and quick response will increase sur- 
vival rate of inflicted persons. 

— A targeted response will improve the quality of 
life by decreasing the severity of injuries of cau- 
salities and thereby saving them to live without 
physical impairments. 

— Dispatcher would be able to better guide caller 
to handle emergency properly while the ambu- 
lance is on its way. 

— Dispatcher can calm down the panicked caller. 


Denmark map is developed in GIS based on 
the data of traffic accidents in Denmark between 
2012-2016. 


In Figure 1 the black dots represent the acci- 
dents, which are more frequent in populated areas 
of the country. There were 87,787 total accidents in 
Denmark recorded between 2012 to 2016 (Danish 
Road Directorate 2017). Majority of the accidents 
are reported in bigger cities for instance, Copenha- 
gen, Odense, Aarhus and Aalborg. 

Based on the audit study it is assumed that the 
onsite live video streaming aid would help to miti- 
gate the consequences of accidents. For visual aid, 
network of REDs is proposed for Denmark. For 
RED network Esbjerg municipality is considered 
as a case study with the broader application for the 
rest of the country. 


2.1 Esbjerg municipality 


This study is carried out in Esbjerg municipality, 
which covers a total area of 794.5 km? (Sydvestjysk 
Brandvæsen 2015). The total population of Esbjerg 
municipality is 115,905. (Statistics Denmark 2017) 
The density of the population is 116 (individual/ 
km’). The municipality consists of both rural and 
urban areas. Esbjerg municipality observed 2,515 
total number of traffic accidents between 2012 and 
2016. The accident data is extracted from the Dan- 
ish Road Directorate and accidents coordinates are 
shown in the following Esbjerg municipality map. 

Most of the accidents were recorded in the resi- 
dential area. A total of n = 2515 cases of traffic 
accidents were reported in Esbjerg municipality 
between 2012 to 2016. The traffic accident casual- 
ties during this period are given in Table 1. 

Table 1 shows that in 2016 total 63 casual- 
ties were recorded, among them six persons were 
killed, 38 injured seriously and 19 were injured 
slightly. One death due to traffic accidents costs 
Danish society up to 17.3 million DKK. (Trans- 
portministeriet 2010). To avoid such a huge loss, it 
is necessary to improve the pre-hospital response 
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Figure 1. GIS mapping of traffic accidents coordinates 
on Danish roads. 
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Table 1. Traffic accidents casualties in Esbjerg 
municipality. 

Casualties, Seriously Slightly 
Years total Killed injured injured 
2012 102 7 49 46 
2013 6l 1 32 28 
2014 83 2 47 34 
2015 86 7 42 37 
2016 63 6 38 19 


Figure 2. GIS mapping of traffic accidents coordinates 
on Esbjerg municipality roads. 


time. There is a robust evidence for an association 
between short response time and survival rate for 
traffic accidents. (Sanchez-Mangas 2010) 

Currently in Denmark, total pre-hospital 
median time for category “A” is 08:12 minutes 
whereas for B, it is 13:27 minutes. For “C” cate- 
gory, it is 16 minutes and 5 seconds. Similarly, for 
“D” category the time is 19:46 minutes. The Dan- 
ish pre-hospital time median time for all emergen- 
cies is 10:27 minutes. (Anderson et al 2014). The 
detail of prehospital time is given in Figure 3. 

Moreover, fire and rescue service (FRS) has 
also a crucial role in saving human lives as first 
responder along with EMS. As far as fire and 
rescue services are concerned in case of Esbjerg 
municipality, their time to reach at the site of acci- 
dent in the municipality is depicted in the Figure 4. 

Fire and rescue station of Vibevej 18, 6705 Esb- 
jerg Ø is mainly responsible for urban area of the 
Esbjerg municipality. The emergency team is com- 
prising of 7 rescues workers on 3 vehicles that is 
incident commander vehicle, fire truck and rescue 
truck. (Sydvestjysk Branveesen 2015) 

The green area of the map shows a response 
time of 10 minutes while yellow area represents 
15 minutes of response time and rest of the area 
represents 20 minutes of response time. (Syd- 
vestjysk Brandvasen 2015) 


tean HEMD Jarci ttransport Arrival on 
Scene 
—— _ — — —— -y 


EMD time miterval EMS time interval 
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B 03.27 Min B 1000 Min 
c OLSI Min c ILH Mw 
D 06:46 Min D 13:00 Min 
Pre -hospital nme 
A 08:12 Min 
B 13:27 Min 
c 16:05 Min 
D 19:46 Min 
Figure 3. Current pre-hospital emergency time. 


Figure 4. Esbjerg fire station response time to emer- 
gency calls. 


RED can assist in emergency operations by 
reducing the FRS onsite assessment via video 
streaming by reaching faster than their time of 10 
to 20 minutes depending on the location of acci- 
dents from FRS station. 


2.2 Identification of RED placement 


For optimal placement of RED networks, a spatial 
analysis was performed using geographic infor- 
mation system (GIS) tool ArcMap, and ArcGIS 
10.5.1 to analyze and visualize the results. (Law & 
Collins 2015) 

For application of RED network, DJI M210 
Matrice drone is considered, which is one of the 
most advanced drone to date with broader indus- 
trial applications. The cruise speed of this drone 
with A mode is 82.8 kph. During vertical ascent, it 
has a speed of 5 m/s and vertical descent the speed 
is 3 m/s. (DJI 2017) 

This drone can transmit footages with camera 
such a Zenmuse X4 and Zenmuse X5 s along with 
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Zenmuse Z30. The range of the drone is 7 km. 
Because of its agility, water proof and along with 
other specifications, it fits best to be considered as 
RED for building network to quickly assess the site 
of accident crashes and support EMD and EMS 
to make correct and quick decisions. RED can also 
assist fire and rescue team via video streaming. 

Esbjerg municipality is considered for this 
explorative study. Esbjerg municipality is divided 
into urban and rural areas. 

In Figures 5 and 6 optimal locations for place- 
ment of RED network are shown. To cover the 
urban area of the Esbjerg municipality five place- 
ments are identified to mount RED network. Simi- 
larly, five placements are also identified for rural 
area (Fig. 6). Each placement is the center point 
of the circle shown on the map. The drone range 
is 7 km; therefore, each circle represents 7 km of 
radius. The origin of the circle is for the docking 
station of the drone. As each UAV location covers 


Figure 5. RED Network placement across Esbjerg 
municipality urban area. 


Figure 6. 
municipality rural area. 


RED network placement across Esbjerg 


a radius of 7 km, several traffic accident cases in 
the analysis are overlapping. A total of n = 2515 
cases of traffic accidents were reported in Esbjerg 
municipality between 2012 and 2016. Out of these 
2,029 were in the urban area and 486 were reported 
in the rural area of the municipality. For each loca- 
tion’s (both rural and urban) median time, maxi- 
mum time and minimum time is depicted in the 
following Tables 2 and 3. 

Each location is identified based on the number 
of accidents in the radius of 7 km of circle. The 
origin of the circle is the location for the drone 
placement. From drone placement to wherever 
accidents occurred in the circle, the distance is 
measured. The following formula is used to meas- 
ure the distance between longitudinal and latitudi- 
nal coordinates. 


d= (x -%7 +7- 7) (1) 


Considering the speed of the drone total time 
between two locations is calculated and subse- 
quently that total time is used to calculate the 
median time, maximum time and minimum time 
for each location. The preparation time for launch- 
ing the drone is not considered, as this time is 
approximately 3 seconds. Claesson et al 2017) 
whereas airborne time of the drone is considered 
for median time calculation. 

Maximum time for both urban and rural loca- 
tion is approximately 5 minutes 57 seconds, 
whereas minimum time and median time is varied 
across the locations. 


Table 2. RED network median time to reach at the 
scene of accident in urban area. 


Location Median Maximum Minimum 
urban time time time 

1 03:25 05:57 00:59 

2 04:48 05:57 01:59 

3 05:01 05:57 01:07 

4 04:57 05:57 01:03 

5 04:54 05:57 01:31 
Table 3. RED network median time to reach at the 
scene of accident in rural area. 

Location rural Median time Max time Min time 
1 03:21 05:55 01:05 

2 04:22 05:56 00:53 

3 04:53 05:57 01:39 

4 04:03 05:56 01:01 

5 04:05 05:57 01:47 
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3 DISCUSSION 


Real time video from the scene of crash is powerful 
tool in supporting quick and right decisions. (Fig. 7) 
Due to many reasons bystander cannot clearly 
define the health status of the patient/injured per- 
sons. Danish medical staff supports the concept of 
live video streaming to deal with problem of over 
and under-triage. (Gerdstrom 2017) 

To have a safe operation of the live video 
streaming via RED, precautionary measures needs 
to be considered. There should not be any safety 
concerns for bystanders or any harm to the sur- 
roundings environment. DJI M210 has collision 
avoidance sensors, however, bystanders onsite must 
be informed of RED approaching to them. More- 
over, rotors of RED should be shut down once the 
EMS or FRS reach at the site of accident. 

Building and integrating RED Network in Den- 
mark may bring new challenges for the emergency 
services to get training and implement the system 
as well as the interaction among EMD, bystander 
or inflicted persons at the site of crash. 

There are some risks associated with this novel 
idea of RED network such as public perception 
of the drone technology, differentiation (colour/ 
appearance) between the emergency drones and 
other drones for the public, risk of falling of a 
drone, risk of drone docking unit stolen or dam- 
aged, risk of data/information stolen, charging 
issues with the drone, bad weather and environ- 
mental effects of the drone technology etc. 

Nevertheless, the pre-hospital phase of emer- 
gency services would benefit from RED due to 
live mutual visual inspection of the emergency. 
The real time video feed will help cut down the 
costs of the resources that are not needed at the 
site of crash. For instance, a procedural protocol 
to respond to a traffic accident alert involves dis- 


Arrival on 


teati TEND) Ctransport Si laoe tor 


Scene 


_ —— —  —— ——- 
Improvement 
* EMD (quick and right decision) 
* MECU (better preparation) 
* Hospital Selection (quick and right hospital to treat specific injury) 
+ Doctor at Hospital ( visual will reduce time for doctor to treat inflicted person 


Figure 7. RED network assessment of emergency scene 
via live video. 


patching of fire truck, incident commander vehicle 
and a rescue truck that may or may not be needed. 
Inter departmental and intra departmental com- 
munication of emergency services is expected to 
be improved. Another worth mentioning benefit 
of RED networks is that the implantation of the 
system to traffic accidents will pave the way to scal- 
ability and application of it to other emergencies. 


4 LIMITATIONS 


Beyond Visual Line of Sight (BVLOS), flight oper- 
ations of drones are not allowed in Denmark. 

It is important to know the acceptability of drone 
technology in local population, for which a there is 
a need to have comprehensive risk perception study. 

DJI M210 drone was considered having a range 
of 7 km with area of network coverage of 14 km; 
many accidents in the analysis are therefore over- 
lapping. The configuration of the drone along with 
range would have resulted different results if we 
would had considered another drone. 

The real test flights are yet to be performed. 


5 CONCLUSION 


The application of GIS model results in the iden- 
tification of appropriate placement of RED 
networks across Denmark. The real-time video 
transmission via RED networks can enable emer- 
gency services to take immediately right decision 
and dispatch a targeted response to treat injured 
persons. Therefore, RED could be the key to over- 
come the problem of under-triage and over-triage 
in saving lives besides cutting budgets. 


ABBREVIATIONS 


BVLOS: Beyond Visual Line of Sight; EMCC: 
Emergency medical communication center; EMD: 
Emergency Medical Dispatcher; EMS: Emergency 
medical services; FRS: Fire and Rescue Services; 
GIS: Geographical information systems; GPS: 
Global positioning systems; MECU: Mobile emer- 
gency care unit; RED: Rescue emergency drone. 
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ABSTRACT: Withits growing dependence on electricity, modern society faces the risk of cascading fail- 
ure of interconnected societal functions. To protect societal functions during an event of power shortage, 
Sweden has implemented a multi-level planning process called Sryrez, which involves national-, region- 
al—and local-level actors. As part of the Swedish crisis management system, the regional body operates 
as a co-ordinator that organises co-operation and interaction between private and public actors. This 
study examines the role of the regional hub in Sryrez and the collaboration and co-operation between 
planning levels. It focuses on the co-ordinator’s perspective and presents evidence from interviews and a 
survey among planners at County Administrative Boards, entrusted with the supervision and execution 
of STYREL within their regional area of responsibility. This paper indicates that the regional co-ordinator 
lacks the awareness, knowledge and resources to fulfil its core function in the national planning for critical 


infrastructure protection. 


1 ELECTRICITY AND THE SWEDISH 
CRISIS MANAGEMENT SYSTEM 


1.1 Background 


Electricity is a vital resource in today’s society, 
which largely depends on electricity for maintain- 
ing critical social functions. It can be argued that 
the reliable distribution of electricity is crucial for 
private households, businesses, and public opera- 
tions to function and survive (Cohen 2010, Ghanem 
et al. 2016, Rinaldi et al. 2001). This dependency is 
likely to increase over time due to the continuous 
developments in important infrastructure such as 
railways and electric cars (Cedergren et al. 2015). 

The power grid is vulnerable to various types 
of events, such as extreme weather conditions (e.g. 
storms and floods), technical failures due to out- 
dated infrastructure and aging components, cyber- 
attacks and destruction. Disturbances in the grid 
can have severe consequences for society (Gheo- 
rghe et al. 2006, Pescaroli & Alexander 2016). For 
example, in Sweden, the storms Gudrun, Per, Dag- 
mar and Ivar caused major problems that in some 
cases lasted for more than a month (EA 2006, 
2007a, 2007b). 

In the future, there is a risk that such extreme 
conditions will increase in number and magnitude 
due to the changing climate (Birkmann et al. 2016). 
Given the serious effects of such events on society, 


creating the necessary conditions for sustainable 
power supply during a crisis is an important func- 
tion of the Swedish Energy Agency (EA). In order 
to ensure undisturbed power supply to important 
users in society, i.e. critical infrastructure (CI), the 
EA has developed a planning process called STYREL 
(an acronym for control of power supply to priori- 
tized electricity users), to provide critical infrastruc- 
ture protection (CIP) against short-term power 
shortages. 


1.2 Aim of the study 


The County Administrative Board (CAB) plays a 
central role in the Swedish STYREL process as co- 
ordinator (EA 2014). The aim of this paper is to 
examine the role of the regional hub of Sryrez and 
the collaboration and interaction between plan- 
ning levels that are included in the process. The 
focus is on the differences between CABs regard- 
ing their performance as co-ordinators in STYREL. 


1.3. The Swedish crisis management system 


The Swedish crisis management system depart 
from three principles: The first one is the principle 
of responsibility, which implies that actors who are 
responsible for an activity or a process in everyday 
life are also responsible for it during a crisis. Next, 
the principle of parity implies that societal functions 
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during a crisis should as far as possible be carried 
out in the same way as they are during normal con- 
ditions. The third principle of proximity states that 
actors closest to the event handles the crisis when 
it occurs; this means that a municipality or county/ 
region should primarily handle a crisis. If local 
resources are insufficient, the state can act through 
the CAB (MSB 2014, Pramanik et al. 2015, Tehler 
et al. 2012). In practice, this means that the CAB 
is responsible for co-ordinating between relevant 
actors in their county (MSB 2014). The co-ordi- 
nating role may involve some problems, as there is 
no explicit process for resolving possible conflicts 
within the Swedish crisis management system. 

A study of the Swedish defence directors at the 
21 CABs in Sweden has revealed what the prob- 
lems are (Wimelius & Engberg 2015). According 
to the study, clearer governance, improvement in 
network management and increase in resources are 
measures that can help to improve co-operation 
among the various players in the county. Several 
defence directors expressed the view that the Swed- 
ish crisis management system is characterised by 
weak governance and lack of continuity (Wimelius 
& Engberg 2015). A study of the river groups in 
Northern Sweden further substantiated this view. 
The river groups exchange information in events 
such as floods and high flows through co-opera- 
tion via networks. However, vague instructions 
from the Swedish Rescue Services Agency have 
resulted in the different river groups working dif- 
ferently, having different objectives, and involving 
different actors (Olausson & Nyhlén 2017). All 
these reports point to the need for a more inte- 
grated and standardised system when it comes to 
crisis management in Sweden. 


2 THEORETICAL FRAMEWORK 


This study focuses on the planning process for 
power shortages, STYREL, which involves both pub- 
lic and private actors (GroBe 2017). Pierre & Peters 
(2000) consider the management of society as a 
continuum that extends from traditional top-down 
control, at the one end, to self-organisation (auto- 
poesis) and networks at the other end. The concept 
of governance is the common element of the entire 
continuum. In social sciences, the concept of gov- 
ernance has no clear definition, in which regard 
Pierre & Peters (2000) note: 


‘..Sufficiently vague and inclusive that it can be 
thought to embrace a variety of different approaches 
and theories, some of which are even contradictory’ 
(Pierre & Peters 2000: 37). 


Governance can be regarded as a policy instru- 
ment in the context of institutionalism, rational 


choice, and network and policy communities, or it 
can be analysed based on neo-Marxist and criti- 
cal theories. The concept of governance describes 
how a society is organized, governed and who is 
involved in dialogue, participation, and network- 
ing. According to both governance and public 
policy theories, networks are an important phe- 
nomenon (e.g. Christopoulos & Ingold 2011, 
Henry 2011, McGinnis 2011, Petridou 2014). In 
this study of Sryrez, we use the definition of gov- 
ernance as a policy instrument and subsequently 
as a network for steering. Sryrez can also relate 
to the concept of risk governance, which considers 
legal, institutional, social and economic contexts as 
well as the actors involved in each of these contexts 
(Renn 1998). 

Governance or policy networks can be either 
self-organized or created and co-ordinated by 
the state (Sorensen & Torfing 2005). Individual 
organizations often use networks to achieve their 
strategic and operative objectives, to maximize 
their influence over outcomes or to avoid depend- 
ence on other actors in the system. From this per- 
spective, governance involves managing networks 
(Rhodes 1996). 

This study examines material from interviews 
and a survey of planners at CABs to portray the 
CABs'’ central role in the Swedish planning system. 
The analysis was based on the concept of complex 
systems governance, the aim of which is to ensure 
control, communication, co-ordination and inte- 
gration of a complex system by several metasys- 
tem functions (Keating et al. 2014). In particular, 
the focus is on two functions of complex systems 
governance: 


e Policy and Identity 
e Information and Communications. 


The aim of focusing on these two functions is 
to inform other functions of complex systems 
governance, such as learning and transformation 
and the operational performance of the Swedish 
crisis management system and its governance, i.e. 
the metasystem (Keating et al. 2015, Keating et al. 
2017, Keating & Bradley 2015). 


e Policy and Identity 

The role of policies is to provide direction and 
identity to the system components, e.g. the plan- 
ners in the Swedish STYREL process, and to repre- 
sent the system to external constituents, e.g. the 
Swedish crisis management system and the wider 
public. 


e Information and Communications 

Secure and reliable information paths are par- 
ticularly important in national planning for CIP. 
However, access to relevant information for deci- 
sion-making is similarly vital for the performance 
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of the planning system, as is the consistent inter- 
pretation of available information throughout 
multi-level planning, such as in the case of STYREL. 

This study examines the available evidence in 
light of these governance functions and highlights 
problems in the design, execution and evolvement 
of the Swedish multi-level planning system for 
CIP, in order to inform further development of 
this complex system and its governance. 


3 METHOD AND SELECTION OF CASES 


In this study, we use interviews with co-ordinators 
at the CABs in three counties in Sweden: one in the 
rural north, one including one of the three major 
cities in Sweden, and one including some heavy 
industry close to the capitol of Sweden. 

This study further includes a survey with all 
the co-ordinators at the 21 CABs in Sweden, car- 
ried out in October 2017. Until today, 15 of these 
co-ordinators have responded to the survey, which 
means that the participation rate is 71.4%. These 
15 participants provided answers to 34 questions on 
their perceptions of the effectiveness and efficiency 
of the planning in general and on the proceedings 
during the last planning process iteration within 
their area of responsibility in particular. The survey 
has an overall response rate of 62.2%; the answers 
to the remaining questions were do not know (N/A). 

A document study complemented the interviews 
and the survey and provided important back- 
ground information, which allowed for data trian- 
gulation (Gerring 2007). The documents for study 
included a handbook for the planning process (EA 
2014), evaluations of the pilot study in 2008 (Lans- 
styrelsen Blekinge 2009, Dalarna 2009) and evalu- 
ations of the first round of planning in 2010 at the 
national level (EA 2012) and in Stockholm County 
(Lansstyrelsen Stockholm 2012). Moreover, a 
report on the grid operator’s plans for manual load 
shedding (MFK) completed the document study 
(Veiback et al. 2013). We conducted the inter- 
views after the document study, which deepened 
the information gained from the documents and 
allowed for verification of the evidence from the 
documents in the interviews. 


4 SWEDISH PLANNING FOR 
CRITICAL INFRASTRUCTURE 
PROTECTION-STYREL 


In Sweden, different actors are responsible for 
energy supply at the national level. The EA is 
responsible for creating the conditions for effi- 
cient, resilient, and sustainable energy use and 
cost-effective distribution of Swedish energy (EA 


2012). The Swedish Energy Markets Inspector- 
ate (EI) is responsible for supervision, regulation 
and licensing in the energy market. The Swedish 
Civil Contingencies Agency (MSB) bears the over- 
all responsibility of the crisis management system 
and the measures taken before, during and after an 
emergency or crisis. Finally, the Svenska Kraftnat 
(SvK) is nationally responsible for the power grid. 
When a power shortage occurs, the SvK is respon- 
sible for MFK in as informed and socially efficient 
a way as possible. The mission is to ensure that 
local and regional power grid operators can per- 
form such MFK within 15 min (EA 2012, Veiback 
et al. 2013). 

In order to enable the national, regional and 
local grid operators to run an MFK without affect- 
ing critical social functions, the four national-level 
actors (the EA, the EI, the MSB and the SvK) have 
developed the planning process STYREL. The plan- 
ning and prioritisation process for power shortages 
has been used 2010 and then repeated in 2014. The 
next planning process iteration will take place in 
2019. In STYREL, the CAB acts as co-ordinator 
between governmental agencies and municipalities, 
on the one hand, and the municipalities and power 
grid companies on the other, as Figure 1 depicts. 

During the recent planning in 2014, the fol- 
lowing multi-level process was agreed upon (EA 
2014): 

With the aid of an eight-digit scale for prioriti- 
sation of CI (see Table 1), national agencies iden- 
tify and prioritise the CI that each of them operate. 

In step (1) (see Fig. 1), each agency sends a por- 
tion of these ranked objects to the CAB of the 
regional area of responsibility in which the CI 
object is located. Each CAB merges the received 
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Figure 1. Actors and information paths in the Swed- 


ish multi-level planning process for CIP against power 
shortages. 
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Table 1. Priority classes of critical infrastructure. 


Class Description 
Electricity consumers that havelrepresent: 


= 


significant impact on life and health—short-term 

(hours) 

2 significant impact on society’s functionality—short- 
term (hours) 

3 significant impact on life and health—long-term 
(days) 

4 significant impact on society’s functionality—long- 

term (days) 

significant economic value 

significant importance for the environment 

significant importance for social and cultural values 

others 


oo NN cU 


lists of prioritised objects and divides them into 
portions that correspond with each municipality’s 
area of responsibility. In step (2), the CAB for- 
wards these portioned lists to each municipality. In 
step (3), the municipalities make an inventory of 
locally important infrastructure and prioritise the 
objects in accordance with the list in Table 1. 

In step (4), the municipalities exchange infor- 
mation on the prioritised consumers with each 
locally operating power grid provider, which pro- 
vides information on the technical feasibility of 
control. The CI objects merges into controllable 
power lines. Thereby, the used spreadsheet per- 
forms additive aggregation of the objects’ ranking 
scores, which yields another list that contains the 
ranking of the power lines. After a final evaluation, 
the municipalities send this latter list back to the 
CAB in step (5). Each of the CABs merge these 
lists from the municipalities in their jurisdiction, 
resolve conflicts between lines that cross munici- 
pal or regional borders and make the final decision 
about the ranking of power lines. In step (6), the 
CABs send the final document to the SvK and 
dedicate portions of it to each power grid provider 
that operates in the region. 


5 RESULTS OF THE STUDY 


5.1 Analysis of the reference process model 


The Swedish planning process for CIP involves 
actors from a large number of national agencies— 
all the CABs and municipalities and locally, region- 
ally and nationally operating power grid providers. 
In Figure 1, the CAB makes two appearances as 
co-ordinator of the proceedings. The STYREL proc- 
ess can therefore be considered as a multi-agency 
planning process (Alexander 2015, Bharosa et al. 


2010) that occurs at multiple hierarchical levels 
(Allouche & Berger 2011). This Swedish multi- 
level planning system consists of three hierarchi- 
cal levels—the local, the regional and the national 
level. 

The STYREL process can be decomposed into 
single problems at each level, at which respon- 
sible planners act on behalf of public or private 
organisations in sequential order, while the CABs 
play a central role as co-ordinators of the planning 
decisions. This role is directed top down and bot- 
tom up, but the latter role is incomplete because 
the procedure lacks co-ordination at the national 
level. 

In the top-down part of the sequence (step (1) & 
(2)), a CAB receives information on an electricity- 
dependent CI that national agencies operate in the 
CAB’s regional area of responsibility. The survey 
results in Tables 2 and 3 indicate that although the 
CABs perceived the collaboration with national 
agencies as good, 84.6% of the CABs stated the 
need for a more structured process for this activ- 
ity, particularly for the consistent interpretation of 
priority classes. Further, on in the process, the CAB 
portions the information on national objects and 
sends them to each of the municipalities in its juris- 


Table 2. Co-ordinators’ perceptions of STYREL. 


Population: 21, Response rate: 71.4%, n = 15 


Participation STYREL: never: 58.3%, once: 25%, twice: 
16.7% 


Proceedings of STYREL: Knowledge: 42.5%, Perception: 
78.9% 


Median 0 123 4 5 
Importance of STYREL 4.64 000 1 3 10 
Usefulness in crisis mngt 3.80 0202.6 5 
CIP in power shortages 3.00 10033 0 


Collaboration with 


e National agencies 3A 01134 0 

e Municipalities 4.00 00134 5 
Trust in 

e National agencies 2.90 01243 0 

e Municipalities 3.55 01036 =1 

e Energy Agency 3.90 01243 0 
Impact of CABs’ work 2.45 03241 0 
Knowledge of STYREL 2.36 32143 1 
Level of system control 3.23 11224 3 
Good information access 3.22 01222 2 
Good information security 2.75 21275 1 
Clear information paths 3.10 10135 0 
Good resource access 2.67 12160 2 


Note: Scale running from 0 (don’t agree) to 5 (totally 
agree). 
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Table 3. Co-ordinators’ experiences with STYREL. 
Yes Mostly No NA 
Request 
for clearer 
processes 
e with national 84.6% 15.4% 
agencies 
e with 69.2% 30.8% 
municipalities 
e with power 53.9% 38.5% 
grid providers 
Followed the 16.7% 41.7% 0.0% 41.7% 


handbook 


Regular meetings 15.4% 42.9% 0.0% 38.5% 


... was handled of Municip. Collaboration CAB N/A 


National/ 7.7% 30.7% 15.4% 46.2% 
regional CI 

Final 0.0% 25% 38.6% 33.4% 
compilation 

Cross-local lines 0.0% 25% 25% 50% 

Cross-county 0.0% 30.7% 0.0% 58.3% 
lines 


diction that host such assets. In addition, according 
to the reference process model, the CABs should 
provide training and guidance to their municipali- 
ties during subsequent planning at the local level. 
Since the questions on concrete proceedings had 
a low response rate of 57.5% during the survey, 
it is possible that the knowledge within the plan- 
ning system is stunted. In addition, 58.3% of the 
CABs have not participated in the planning process 
before. This may influence their ability to co-ordi- 
nate the proceedings and to provide guidance to the 
municipalities. Nevertheless, the CABs’ responses 
with regard to collaboration with the municipalities 
were slightly positive and indicated that they rather 
trusted the municipalities. However, the reference 
process did not provide any measures to evaluate 
the correctness of the planning decisions, so aside 
from communication, the CABs have no means of 
assessing the information they receive—neither in 
the top-down nor in the bottom-up phase. 

In the second part of the sequence (steps (5) & 
(6)), the information flow is in the bottom-up direc- 
tion. Information about local prioritisation also com- 
prises the national CI assets, but they are masked. 
During the recent iteration of the planning process, 
information exchange was limited to power lines and 
the number of objects per priority class. Even though 
this reduction in information may ensure a certain 
level of information security, it makes regional—or 
national-level co-ordination impossible. The study 
results indicate that the CABs used meetings as a 
means to gain more information and to align the 


prioritisations in their area of responsibility. Never- 
theless, 69.2% of the CABs stated that they require 
a more structured process for collaboration with 
municipalities. Moreover, each CAB must also merge 
the lists from the municipalities and then decide 
upon the regional ranking of power lines. Half of the 
CABs that answered this question decided to merge 
the lists entirely on their own. Further, one CAB 
performed the merging by itself and announced the 
changes to the concerned municipalities. The remain- 
ing respondents stated that they co-operated with the 
municipalities to align the results and to compile the 
final ranking list. Finally, the CAB divides this list 
into bundles of power lines that correspond to each 
local power grid operator in the region and sends this 
list to each of them. In addition, each CAB delivers 
a complete list to the national power grid provider. 
Interestingly, even though the process does not neces- 
sitate intensive collaboration of the CAB with the 
grid providers, 53.9% of the CABs stated that they 
required a more structured process anyway. This is 
probably because the CABs do not receive any feed- 
back from the power grid providers about next-level 
planning for MFK because of national information 
security concerns. 

Due to the immense information processing and 
process management involved, CABs bear a dou- 
ble burden—as participants in the process and as 
regional co-ordinators. Hence, the CAB represents 
the central hub in the current multi-level STYREL 
planning process in Sweden. 


5.2 Organisation and execution of STYREL 


Each CAB is responsible for co-ordinating work 
related to crisis management in its own county in 
Sweden. Therefore, the CAB is also responsible for 
co-ordinating the execution of Sryrez, in which 
the CAB plays the central role in the planning 
approach, but with little influence on the qual- 
ity of the process outcome. The evaluation after 
the pilot and the first round of planning showed 
overall, the CABs perceive STYREL as an important 
planning process for identifying CI. The survey 
substantiates this perception, as 92% voted on 
agree/strongly agree. However, due to the limited 
influence of the CABs on the outcomes of the STY- 
REL planning and the subsequent MFK planning, 
the CABs expressed some doubt about the useful- 
ness of STYREL’s outcomes for crisis management. 
Further, they expressed considerable doubt about 
whether STYREL can provide the intended protec- 
tion for society during a power shortage. 

The interviews show that the three CABs organ- 
ised their work according to the reference model. 
All three CABs emphasise the importance of 
working within existing networks. In particular, 
the CABs used already existing networks, used 
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in ordinary work with crisis management and 
emergency response. No new networks emerged 
in the three regions. Evaluation of the first run in 
2010 indicates that the CABs acknowledged the 
STYREL process’ contribution to improved coopera- 
tion within existing networks (EA 2012). However, 
the organisation of these networks differs between 
the three counties, and the counties’ size seems to be 
the main reason for the differences. The two smaller 
counties worked more closely together, e.g. meet- 
ings included representatives from all municipali- 
ties. The larger county also used existing networks 
meetings. In this case, the county divided into four 
or five different groups; northeast, north-west, 
southeast, south-west, and the large city. This divi- 
sion ensured a smoother planning process in the 
region, but it also made it difficult for the munici- 
palities to have an understanding of the region 
as a whole. Instead, individual municipalities had 
only experienced the discussion in their part of the 
region, which could lead to differences in princi- 
ples and priorities among the four parts. Accord- 
ing to the evaluation of the first run in 2010, the 
major challenge was to find a common view on the 
prioritisations among municipalities in a region. 
Thereby, how to deal with the dependence chains 
and to which extent an analysis of these chains is 
appropriate seem unclear (EA 2012). In the rural 
north county and the county close to Stockholm, 
all municipalities participated in the discussions 
on principles and priorities.In the latter one, the 
municipalities made notes in the planning docu- 
ment, which made it easier for the CAB to iden- 
tify the objects along the line. This could also have 
impact on the result: ‘For the result then ... because 
we have a bundle of power lines, without knowing 
what is on them, it is extremely difficult. Because 
you could, in theory, cut off the hospital using a few 
ICA stores, or some water pumps’. The notes made 
it possible for the CAB to identify such effects. 
This study evinces that the CAB in general has 
followed the planning model as stated in the hand- 
book for STYREL. However, there were some devia- 
tions from the model due to lack of time. Since some 
actors did not follow the predetermined schedule, 
the CABs ran out of time for their part of the proc- 
ess. Although the other actors in the process caused 
this delay, the three CABs perceive the initial plan as 
too optimistic. The co-ordinators argued that there 
was a risk that such a compressed schedule, which 
speeds up the work of municipalities and CABs, 
led to a widespread copy-and-paste behaviour in 
the municipalities: “Jt may be necessary to give more 
time because it became very stressful when it became 
so delayed in the first line from government agencies’. 
Evaluation of the STYREL planning process in 
2010 revealed that there were only a few, if any, 
contacts between CABs and private actors, except 
for contacts with the power grid providers (EA 


2012). Due to time constraints, the interviews 
with the three CABs indicated that no other pri- 
vate actors or actors representing civil society have 
been involved in the current STYREL planning. 

Between the two rounds of planning, the CABs’ 
role in the process changed. In the first one, the 
CAB participated more actively in assessing and 
balancing the priorities of the CI objects at the 
county level, whereas in the second planning, the 
CAB only complied the results from the munici- 
palities. One of the counties did not fully apply this 
change; instead, the municipalities, the region and 
the CAB made the final ranking list together. The 
participating municipalities were, according to the 
co-ordinator, unanimous about this departure from 
the official planning process: ‘ Yes, in what other way 
would we do? It’s just like a damn long list’. 


5.3 Integration and governance 


STYREL is an integrated part of the Swedish crisis 
management system. As stated, the three principles 
of the system are responsibility, parity, and proxim- 
ity. The CAB is responsible for co-ordinating work 
with the system at the regional level. Critique from 
CABs against the STYREL process mainly includes 
problems with the process itself and the lack of 
feedback during the process in the multi-level 
system. 

In the interviews, the co-ordinators at the CABs 
all agreed that it is important to identify CI objects, 
i.e. societally important objects, in advance in order 
to ensure that there is as much power supply as 
possible to these CI objects in the event of a power 
shortage. Therefore, there are certain elements 
of STYREL that are important for the functioning 
of society. However, all three CAB co-ordinators 
interviewed are critical about the design of the 
reference process model and process execution in 
the two rounds. They are also critical about, the 
limits of the usefulness of the planning process. 
Today, the process stands to some extend for itself; 
therefore, the co-ordinators regret the absence of 
a holistic, integrated view on STYREL. One co-or- 
dinator envisioned that integration and transition 
of the planning process of STYREL would be an 
important pay-off to the Swedish crisis manage- 
ment system at subsequent planning levels, such as 
preparedness and contingency planning. 

In two of the counties, the co-ordinator at the 
CAB described the process as smooth without any 
major conflicts between the included parties. The 
problem was primarily that the CAB, according 
to changes in the process in the second round of 
planning, could not access information about the 
objects themselves, but only the lines. All three co- 
ordinators at the CAB emphasised on the problems 
of this change. Since the co-ordinators did not get 
information about individual objects along high- 
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priority lines, there is a risk that important objects 
is down prioritised due to the process design. In 
all three cases, the co-ordinators preferred to have 
more information about the objects in order to 
ensure the quality of the process outcome. 

There were solutions to deal with the problems. 
In one case, the CAB and the municipalities first 
discussed how to grade a certain objects. Then the 
municipalities made notes in the planning docu- 
ment indicating which objects are located along the 
line. In another case, the CAB, the municipalities, 
and the region made the final ranking together. In 
the third case, they only used initial discussions, 
but there were no discussions on individual objects. 
In theory, this could imply a down prioritization of 
the line for the major hospital in favour for other 
lines. Finally, before submitting the final ranking 
list, the CAB ensured that they were along one of 
the highest prioritised lines. 


6 DISCUSSION AND CONCLUDING 
REMARKS 


6.1 Policy and identity 


From this study, it appears that STYREZL is part of 
a reliable energy supply plan, even in the event of 
power shortages. However, the findings also indi- 
cate that the execution of STYREL does not fol- 
low on the process created by the EA. The risk of 
‘copy-and-paste’ behaviour can particularly affect 
the prerequisites for reliable power supply during 
a power shortage. In accordance with the concept 
of resilience, this study on the co-ordinating func- 
tion of the CAB highlights that there is a risk that 
society cannot maintain important social func- 
tions. The implementation of the process does not 
provide any guarantee for a resilient power sup- 
ply. Further, any form of systematic co-operation 
between the system components, such as private 
and public actors, seems to be absent in the current 
STYREL process. Systematic co-operation, if any 
that occurs at the municipal level remains to study. 

The importance of private-public co-operation 
in networks for enabling actors (i.e. the municipali- 
ties, regions, CABs and power grid providers) to 
identify and prioritise CI objects is further empha- 
sised by the evaluation of the three pilot studies 
in 2009. The results from our current study reveal 
that none of the three CAB formed new networks 
for the process. 

The findings signify the underrepresentation of 
private actors and actors representing civil society 
in the planning process and its reference model, 
developed by the EA. However, the deliberately 
vague definition of the reference model allows 
municipalities and CABs to include private actors 
to obtain as much information as possible for the 


ranking of CI objects. Thus, the system permits 
components to adapt to local regional commit- 
ment to improve the process. This means that the 
regional outcomes of a process instantiation can 
vary distinctly, which questions the national char- 
acter of the planning. Particularly, since STYREL 
prescribes neither an over-regional nor a national 
alignment of CI and the power lines, it remains 
uncertain how local and regional proceedings dur- 
ing the planning affect CI objects of over-regional 
and national importance. 

Although the policy is accepting alternative pro- 
ceedings, the CABs used already existing networks, 
which only include public actors, mainly at the 
municipal level making the proceeding more effec- 
tive. However, such an approach carries the risk 
that important information from private actors, 
such as private care providers, is lost ignoring 
proper risk communication to society. 


6.2 Information and communications 


This study implies that it is important that specific 
public actors, such as persons responsible for cri- 
sis management, have authorised access to crucial 
information on power lines that ensure power sup- 
ply to CI objects, such as different care providers. 
It seems that there is no guarantee that the actors 
update available information, due to the earlier men- 
tioned ‘copy-and-paste’ behaviour. Moreover, due 
the limited information content in the received lists, 
the CAB cannot control the correctness and com- 
pleteness of the CI objects; instead, it has to rely on 
the performance and commitment of other actors. 

STYREL can contribute to the maintenance of CI 
during power shortages, but there is no proof of 
its success in this role due to the absence of any 
assessable success factors. This presupposes that 
actors at the municipal level execute the planning 
in accordance with the national strategic objec- 
tives. However, the interviews in this study reveal 
that in some cases, individual interpretations of 
these objectives resulted in an adapted, time-sav- 
ing behaviour, i.e. ‘copy-and-paste’ of local results 
from the first planning round. Since there is no way 
of ensuring that the available data on CI objects 
from the previous planning also applies four years 
later, there emerges a risk that the results of STYREL 
do not properly reflect the intentions and priorities 
of municipalities and agencies. 

In addition, the absence of specific feedback 
from power grid providers on the planned pro- 
ceedings during a power shortage hampers further 
reliable integration of STYREL in regional crisis 
management. These preconditions illustrate that 
the regional co-ordinator cannot rely on the results 
of STYREL planning for CIP in subsequent plan- 
ning processes, such as preparedness and continu- 
ity planning. 


1899 


7 CONCLUSIONS 


STYREL as planning process does not necessarily 
contribute to the creation of a reliable energy sup- 
ply as stated by the governmental guidelines of the 
EA. STYREL can contribute to the maintenance of 
CI and societally vital services, but it is difficult 
to gauge this in the absence of assessable success 
factors. 

It appears that there are no integration of the 
STYREL process into the Swedish crisis management 
system. Such integration may further improve the 
effectiveness and efficiency of this complex multi- 
level planning system. In particular, such integra- 
tion could facilitate the further development of 
co-ordinated information paths and directed com- 
munication. In turn, such development can assist 
with ensuring adequate national and international 
information security with regard to sensitive infor- 
mation about CI. Therefore, it is necessary for 
authorised persons to designate and monitor the 
necessary information with confidentiality, integ- 
rity and availability to fulfil strategic and operative 
objectives in the context of national CIP. 

The present analysis shows that the results of 
the STYREL process, implemented in a Swedish 
multi-level planning system, rely on the commit- 
ment of the CABs as the co-ordinator for achiev- 
ing a common understanding of the criticality of 
infrastructure and for mediating regional collabo- 
ration. The level of trust between the different lev- 
els of the planning system seems likely to further 
influence the resulting emergency response plan. 
Moreover, the planner’s perceptions regarding the 
significance of the planning task, the likelihood of 
a power shortage situation and the crisis manage- 
ment capability of a county can have an impact on 
the effectiveness of the complex multi-level plan- 
ning system in a crisis. 

This paper also indicates that there is a lack of 
awareness at the regional level about the function 
of core players in the Swedish Sryrez approach. In 
addition, the regional hub lacks the knowledge and 
resources to fulfil adequately its dedicated function 
in the national planning process for protecting CI 
objects from the consequences of a power outage. 

With insights from the Swedish case, this paper 
highlights the regional core of STYREL and contrib- 
utes thereby to international discussions on the iden- 
tification, prioritisation and protection of CI objects. 
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ABSTRACT: A risk analysis should provide decision makers with information regarding relevant haz- 
ards. The initiating phase, where the risk analysts identify hazards to be included in the risk analysis, lays the 
foundation for the rest of the analysis. This phase is, therefore, of great importance. In this paper, we exam- 
ine how risk analysts in a municipal setting identified potential adverse events and how they chose which 
ones to analyse in the risk analysis. The municipalities under study had important similarities with respect 
to exposure to hazards and government regulation. With these similarities as a starting point and studying 
how the initiating phase took place, the paper focuses on impact regarding the uniformity of adverse events. 
Looking at events included in the Comprehensive Risk and vulnerability Analyses (CRAs), seems to reveal 
a predominance of uniformity. This is reasonable given the previously mentioned similarities. It is arguably 
also a result of many risk analysts using the same sources to retrieve ideas of potential hazards. The latter 


is alarming when considering risks not listed in these sources, like emergent or local risks. 


1 INTRODUCTION 


Communities at different levels of society strive 
to attain safety and security. To do so, they try 
to identify hazards and threats that pose a risk. 
This is a starting point in preparedness for emer- 
gency response and risk and vulnerability reduc- 
tion (Perry & Lindell 2003). Risk Analyses (RAs) 
are the prominent methods in which risks and 
vulnerabilities are identified and assessed. They 
are formal and analytical (Renn 1998; Rausand & 
Utne 2009) and used by organizations to prepare 
for misfortune. They do so by providing decision 
makers with information about relevant hazards 
and threats, the likelihood of these adverse events 
and their potential consequences. RAs enable deci- 
sion makers to make informed decisions regarding 
reduction of risks and vulnerabilities (Aven 2011). 

The mission of RAs is in other words (Rausand 
& Utne 2009): 


— To figure out what kind of adverse events might 
happen 

— To figure out the likelihood of the events 

— To figure out the consequences of the events 

— To describe the risks (Aven 2015) 


The bulleted list clearly illustrates that beneficial 
outcomes of RAs depend on the initiating phase, 
which is identifying the adverse events of relevance 
to the RA. 


“This is one of the most important steps in the 
risk analysis. If a hazard source or an adverse event 
is not detected, it will not be included in the analy- 
sis” (Rausand & Utne 2009, p. 86). 

Likewise, Cameron et al. (2017, p. 53) describe 
the identification of hazards as “the first and 
most crucial step in any risk assessment”. Accord- 
ing to Renn (2008), stages of risk assessments 
vary depending on risk domains and risk sources. 
Regardless of that, hazard identification is one of 
three core elements in risk assessment (Renn 2008). 
Not being able to identify hazards properly can 
result in accidents or adverse events (Cameron et. 
al 2017). 

Risk analysis has received a lot of academic 
attention. A December 2017search for “risk analy- 
sis” in the Academic Search Premier database, 
resulted in approximately 58800 academic articles. 
Comparatively, searching for “hazard identifica- 
tion” or “identification of hazards” resulted in 
1100 and 1250 hits. This paper is a supplement to 
the studies of this highly important element of risk 
analysis. 

This paper focuses on the initiating phase of 
RAs. It presents how risk analysts in 12 munici- 
palities identified and chose hazards to be analysed 
in greater detail in their RAs. These municipalities 
had several similarities (they are presented in sec- 
tion 3). With these similarities as a starting point, 
can we categorise the risk analysts who carried out 
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the work as either copycats at one end of the scale, 
or explorative analysts at the other? 

The main focus, though, is the impact of the 
approaches used in the initiating phase. This 
impact is identified and discussed, restricted to 
uniformity of risks included in the RAs. To be 
more precise: we have studied the processes when 
conducting so-called Comprehensive Risk and vul- 
nerability Analyses (CRAs). 

Municipalities are exposed to both hazards and 
threats. The nuances between the two terms are not 
of importance in our study. So, for convenience, we 
use “hazard” as a common term for both. Further, 
we use the terms hazards and adverse events in an 
interchangeable manner, even though hazards do 
not necessarily lead to adverse events. 


1.1 The CRAs 


The objective of the Civil Protection Act 2011 and 
secondary law is to ensure that municipalities safe- 
guard the safety and security of the population 
(Directorate for Civil Protection 2017). According 
to these legal requirements, the Norwegian munici- 
palities must have CRAs. 

The objective of the Civil Protection Act 2011 
and secondary law is to ensure that municipalities 
safe-guard the safety and security of the popu- 
lation (Di-rectorate for Civil Protection 2017). 
According to these legal requirements, the Norwe- 
gian municipali-ties must have CRAs. 

The secondary law lists a few minimum require- 
ments for the CRA. Two of them are of impor- 
tance for the initiating phase of CRAs (Directorate 
for Civil Protection 2017): 


— First, a CRA must address both existing and 
future risks in the municipality, as well as exter- 
nal risks of relevance to the municipality. 

— Secondly, critical functions in society and critical 
infrastructure must be addressed. Loss of elec- 
tricity or water can be examples. 


Beyond that, the legal requirements do not 
specify what kind of adverse events be included 
in CRAs. Risk analysts in the municipalities must 
identify the potentially adverse events based on idi- 
osyncratic risks in their communities. 

There is a variety of methods for risk analysis 
(Rausand & Utne 2009). Analysts in the munici- 
palities are free to choose, but preliminary RAs 
are the common method in the municipal domain. 
In preliminary RAs, potential adverse events are 
identified, then the identified events are analysed 
separately regarding causes, likelihood and conse- 
quences (Aven 2006). 

In addition to the requirements in the Civil Pro- 
tection Act focusing on the risks from a holistic 
perspective, the municipalities face regulation at 
the sector-level. 


2 THEORETICAL APPROACH 


Several elements are of importance in the initiating 
phase of RAs. Based on our point of interest, we 
focus on some theoretical considerations related to 
the method of risk/hazard analysis, supplemented 
with some perspectives when suited. 

Due to the framework for the paper, elements 
of importance are excluded, though. For instance, 
risk perception, i.e. peoples’ judgement of hazards 
(Renn 2008), is not explicitly addressed. Neither is 
safety culture addressed, even though culture can 
contribute to focus the attention to some specific 
hazards, while other hazards are not taken notice 
of (Pidgeon & O'Leary, 2000, Pidgeon 1998; 
Douglas & Wildavsky 1983). 


2.1 Method and regulation 


A preliminary risk analysis is suited for both major 
and minor hazards. However, risk analysts might 
be restricted by a decision that the process of iden- 
tifying hazards should be limited to regulatory 
requirements (Baybutt 2014). Such restrictions 
could, in extreme cases, result in a CRA of rhe- 
torical value, symbolizing control (Clarke 1999), 
risking that hazards of importance or interest are 
omitted from the analysis. 


2.2 Imagination 


Cole (2012, p. 12) uses the phrase “broaden the 
mind-set of responders” as an argument for sur- 
prise scenarios in exercises. It is also requisite to 
broaden the mind-set of risk analysts when identi- 
fying potentially adverse events. 

Imagination and creativity contribute to the 
identification of scenarios that would otherwise not 
necessarily have been identified. Hence, imagination 
and creativity are required, but analysts might lack 
these characteristics (Camerona et al. 2017). Besides, 
even if risk analysts are imaginative, it is not a guar- 
antee for identifying all hazards (Baybutt 2014). 

A boundary for imagination might be the 
ontological status of hazards and risks. They are 
not fixed. Risks can be viewed in different ways; 
as objective properties or as socially constructed 
(Aven & Renn 2010). Risks pre-exist in the former 
view, and risks can in principle be identified and 
measured (Lupton 2013, p. 13). Socially con- 
structed risks, on the other hand, are the prod- 
uct of rhetorical processes (Lupton 2013, p. 46). 
Potentially this induces discussions or interpreta- 
tions among risk analysts about which hazards to 
consider in the initiating phase of CRAs. 


2.3 Cognitive biases 


Thinking can be divided into two systems; fast and 
slow (Kahneman 2011). The fast mode is instinc- 
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tive and the slow is deliberate. The risk analysis 
method presupposes deliberate thinking. How- 
ever, risk analysts are humans. Therefor they are 
not necessarily as rational as could be expected 
(Aakvaag 2008). 

Cognitive biases are results of heuristics (Kah- 
neman & Tversky 1982). The biases stem from 
the unconscious influence on human judgements 
and decisions (Baybutt 2016). They are deviations 
from the rationality of thinking (Meissner & Wulf 
2013, p. 802). Researchers have found many cogni- 
tive biases, e.g. the availability bias, group thinking 
or the framing bias, to mention a few. We will not 
go into details in this paper. The point here is that 
cognitive biases among risk analysts can result in 
missed hazard scenarios (Baybutt 2016). Therefore 
the negative effects of cognitive biases need to be 
addressed. This is very difficult due to the uncon- 
scious processes involved (Baybutt 2016). However, 
knowledge, information and awareness can reduce 
biases. Another strategy is to use a devil’s advo- 
cate in the risk analyst team. An appointed devil’s 
advocate can initiate debates that might challenge 
the mind-set of others (Baybutt 2016). Addition- 
ally, scenario planning can alter biases (Meissner & 
Wulf 2011). 


2.4 Filtering risks 


There must be a limit to the number of adverse 
events to analyse in the CRA. It is simply a mat- 
ter of resources. This implies that the number of 
identified adverse events in the brainstorming 
process must be reduced. Rausand & Utne (2009) 
argue that hazards where the risks are small, due to 
low likelihood and/or insignificant consequences, 
could be filtered here. 

Power and interest are also important. Interests 
can be invested in which adverse events should be 
emphasised and de-emphasised (Aven 2011; Dekker 
& Nyce 2014). This also applies to the brainstorm- 
ing phase. Being able to handle interests requires 
the capacity to exercise power. There are several 
sources of power, e.g. information, expertise, con- 
trol over agenda and resources (Antonsen 2009). 


2.5 Standardization and uniformity 


Recipes and checklists can be beneficial. They pro- 
vide advice and save time for risk analysts (Hale & 
Swuste 1998). Checklists can also mitigate a lack of 
imagination among risk analysts (Baybutt 2014). 

The purpose of recipes of how to do things is 
to do the same. Ergo, uniformity is reasonable 
(Brunsson 2000). However, standardization can 
cause blindness to possible adverse events unsuited 
to the recipes (Hale & Swuste 1998). For instance, 
Baybutt (2014) holds that elements unlisted in 
checklists might be left out. 


3 METHODS 


Data were gathered in twelve of nineteen munici- 
palities in a county in Arctic Norway. The main 
criterion for including municipalities in the study 
was location. They are all located in the same 
geographical region, and therefore to a certain 
extent exposed to the same hazards. Another cri- 
terion was time. The Civil Protection Act came 
into force in 2011. Requirements in the law set a 
new framework for CRAs. The CRAs and CRA- 
processes included in the study are from the time- 
span 2011-2017, ensuring that the municipalities 
had been subject to the same legal requirements. 
Their geographical location also meant that they 
had been subject to the same supervision by the 
same County Governor. A third criterion was the 
availability of the informants during the data col- 
lection period. 

The data was collected via interviews and analy- 
ses of the CRAs, a qualitative approach. Twelve 
semi-structured interviews were conducted; one 
informant per municipality. A question guide with 
open ended questions was used. The informants 
all played pivotal roles in the CRA process in their 
respective municipalities. All of them had partici- 
pated actively in the process of making the CRAs 
which this paper focuses on. Hence, they had first- 
hand knowledge of the process and the choices 
that were made. 

The contents of interviews and CRAs were ana- 
lysed and compared, so data coherence could be 
checked. 

In a Norwegian context, the municipalities 
spanned from small to medium population size. 

Next, we will present findings from the proc- 
esses of brainstorming and filtering. 


4 THE BRAINSTORMING PROCESS 


The identification of potential hazards in the 
municipalities is called the brainstorming process 
in this paper. The term here refers to a process of 
creativity, imagination, structure and mapping. 
Next, we will present the “who’s” and the “how’s” 
in this process. 

The municipalities had their own unique brain- 
storming processes. However, there were similari- 
ties. Aggregated, the processes either involved 


— municipal representatives (M) 

— acombination of M and external representatives 
(E) 

— consultants who involved either M or M+E 


In eight of the municipalities, both internal and 
external representatives participated. 

It is hard to conclude unambiguously in what 
way the legal requirements and other regulative 
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attempts to influence affected the process of iden- 
tifying hazards. For some it seems as if regulative 
involvement broadened the scope of hazards to 
consider. The internal focus in one of the munici- 
palities was not motivated by legal compliance. 
The intention was to heighten organizational 
competence in this area of municipal responsibil- 
ity. A contrast is the municipality with the lowest 
involvement in this study. Here the primary objec- 
tive was a “good enough” CRA. 

A devil’s advocate formally appointed to chal- 
lenge assumptions or stimulate ideas was not used. 
Ascribing such a formal role to a participant in the 
processes seems to be unfamiliar to the risk ana- 
lysts. However, in two of the municipalities, the 
risk analysts responsible for the local processes 
deliberately sought counter-arguments. They were 
self-appointed informal devil’s advocates. 

A common trait for the municipalities is that 
they based identification of hazards on a combi- 
nation of information sources. In some cases, their 
imagination was not sufficient, and other sources 
provided valuable inspiration and ideas. Nobody 
referred to checklists etc. as means to save time or 
resources. 

In addition to their own previous municipal 
CRAs, most analysts used regional or national 
sources to assist them when identifying hazards: 
typically, national and regional RAs and a govern- 
ment guideline for CRAs. The analyses and the 
guideline served as checklists. At the local level, 
other municipal CRAs were sources of informa- 
tion too, but to a lesser extent than for instance the 
regional RA. 

Another influence was the urge from national 
and regional government to take specific adverse 
events into consideration, for instance deliberate 
adverse events in schools etc. (threats, use of weap- 
ons), and quite recently, arrival of refugees in large 
numbers. 

Media-coverage was also a source of informa- 
tion and inspiration to some risk analysts. 

The informants also referred to a recently estab- 
lished regional arena for risk analysts. Here the 
analysts could exchange ideas. For instance, the 
hazard related to cruise tourism had been addressed 
by one of the municipalities. This was also relevant 
for some of the other municipalities. The influence 
from this arena will probably be apparent in future 
CRAs. 

Arguably, these sources facilitate uniformity if 
not reflected on. 

Interestingly, in two of the municipalities, repre- 
sentatives from residents were invited to participate 
in the process, potentially providing a local focus. 
Representatives from the municipality and external 
actors who had been invited to contribute could, 
of course, also add a new perspective. 


The analysts were asked about the usefulness of 
external information sources and potential nega- 
tive effects. The majority found such sources very 
helpful, providing ideas and serving as some sort 
of quality control as to the content of CRAs. The 
potential negative effects seem to be eliminated by 
the usefulness of such sources. 


5 THE FILTERING PROCESS 


After having identified potential hazards, the 
municipalities chose which hazards to include in 
the CRA. In this paper, this process is called filter- 
ing. Some of them had identified many potential 
adverse events, others had few. Most of the munic- 
ipalities structured the identified events by merging 
related events, thus reducing the number of events. 
In addition, the approach of filtering out small 
risks was applied in several municipalities. 

One of the municipalities in fact included all of 
the identified hazards in the CRA as they were, 
without filtering. 

All in all, the filtering processes passed without 
much controversies, according to the risk analysts. 
Issues for debate were the severity of hazards, not 
their ontological status. The participants in the 
process came to an agreement. External actors 
without representation in the CRA work, such 
as representatives from local industry, showed no 
interest in trying to influence this, or other, proc- 
esses. This lack of interest is interesting per se, but 
beyond the scope of this paper. 


6 IMPACT 


Looking at the type of risks included in the CRVs, 
there is a high degree of uniformity regarding 
events that are mandatory to address; e.g. critical 
infrastructure. 

However, the analysts have not analysed all types 
of critical infrastructure. They have chosen the ones 
relevant to them. Electricity and electronic com- 
munication are the focus of attention, followed by 
water supply. Transportation is also addressed in 
some of the CRAs. Here, local circumstances are 
obviously of importance. E.g. municipalities with 
only one main road are more vulnerable than those 
with several. 

Pandemics, nuclear accidents and extreme 
weather, the transboundary risks, are also included 
to a high degree in the CRAs. Fires, accidents, 
emissions or spills of dangerous substances are 
also addressed to a high degree. These are events 
of relevance to all municipalities, regardless of 
location. However, the detailing and the objects at 
risk vary from one CRA to the other. Municipali- 
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ties with a coastline have analysed accidents at sea, 
for example. 

Several of the above-mentioned hazards are reg- 
ulated in sector legislation: e.g. water-supply, fires 
and nuclear accidents. 

There is also high uniformity at an aggregated 
level regarding deliberate adverse events. They are 
included in the CRAs. The types of adverse events 
differ, though. Events like threats and “minor” vio- 
lence are addressed by almost everybody. Terror, a 
disastrous event, is covered in fewer CRAs. Here, 
analysts have varied between the events, most 
likely based on their assessment of relevance to 
their municipality. Only two CRAs include Cyber- 
attack. This type of hazard has not been on the 
public agenda for very long, and the two CRAs 
have recently been revised. 

In addition, the CRAs encompass a few adverse 
events of a strictly local character: flooding and 
the breaking of dikes. 

Finally, a few of the CRAs contain unique 
adverse events; such as substance abuse among 
municipal employees, animal diseases, violence 
and sexual abuse against children and breaches in 
information security. 

Identifying adverse events is a challenging task, 
taking uncertainty about what the future holds into 
consideration. In two of the municipalities this was 
addressed by including “the unknown event”. 


7 DISCUSSION AND CONCLUSION 


7.1 Discussion 


The preliminary risk analysis method and the 
legal requirements per se are neutral regarding the 
number of people involved. The variety among 
the studied municipalities is vast with respect to 
involvement of personnel. At one end of the con- 
tinuum only the consultant and a single municipal 
employee took part in the process. Here the rhetor- 
ical value of the CRA was the primary objective, 
i.e. having a CRA in compliance with regulations 
(Clarke 1999). At the other end of the continuum 
a bottom-up approach was applied with all munici- 
pal departments being mobilized. The variety of 
involvement does not seem to have a bearing on the 
number of adverse events included in the CRAs. 

If other parameters were considered, like how 
well founded the CRAs are in the municipality, or 
reduction of risk, then the verdict regarding choice 
of processes might shift. Important hazards might 
be missing in the CRA (Clarke 1999). 

There is a predominance of uniformity in the 
events that were included in the CRAs. Obviously, 
this can be ascribed to legal requirements and that 
the municipalities face the same hazards to a rela- 
tively large degree. In addition, the formal frame- 


work for CRAs confers some uniformity. The 
brainstorming processes also induced uniformity, 
because many risk analysts in the study used the 
same sources to retrieve ideas of potential hazards. 
With recipes and a framework like this, some uni- 
formity is reasonable (Brunsson 2000). 

Still, we would argue that this is alarming with 
regard to hazards not listed in these sources, like 
emergent, novel or local hazards. They might be left 
out, as implied by Baybutt (2014) and Hale & Swuste 
(1998). For instance, fish diseases were not included 
in one of the often-used sources, the County-RA. 
This hazard might be of relevance in several munici- 
palities because fishery is a prominent part of the 
industrial base in these communities. Only one CRA 
included this hazard. Using sources to retrieve ideas, 
the risk analyst might miss out hazards if the proc- 
ess resembles copying and has a lack of imagination 
(Cameron et al. 2017; Baybutt 2014). 

Most of the processes took place without major 
disagreement. A devil’s advocate was not used in 
most of the municipalities. Hence, the processes 
lacked a participant who systematically could have 
challenged the mind-set of others (Baybutt 2016). 
Given the uncertainty about future adverse events 
and local susceptibility, the processes could have 
benefited from critical voices challenging both the 
premises for analysing risks, i.e. the formal frame- 
work for CRAs, and the local processes. 

This could perhaps have provided more diver- 
sity in novel or unique adverse events. However, 
diversity is not an objective per se. The CRAs were 
in fact diverse in having variations within types of 
hazards. For instance, some included car accidents, 
others included bus accidents. 

Having argued in this paper that uniformity can 
be alarming, a final reflection should be added 
regarding crisis management. Even if all hazards 
have not been identified and analysed, in crisis 
management many of the same features occur 
regardless of the hazard involved (Nilsen 2017). 
Therefore, uniformity need not be too serious in 
that respect. The disadvantage of uniformity is the 
decreased possibility of reducing or eliminating 
unknown hazards in advance, making crisis man- 
agement redundant. 


7.2 Conclusion 


The initiating phase of any risk analysis, where 
adverse events are identified and filtered, is very 
important. This paper has addressed how risk 
analysts in 12 municipalities have carried out this 
phase and the impact of uniformity of adverse 
events analysed in municipal CRAs. The main con- 
clusion is that there is a predominance of uniform- 
ity in CRA-events, as could be expected due to the 
circumstances and the brainstorming-processes. 
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Diversity is not an objective per se. However, the 
chance of identifying the unknown adverse event is 
lessened by copycats. That is alarming, and aware- 
ness needs to be raised. 
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ABSTRACT: Design documentation, safety and security analysis, environmental studies, studies on 
organizational factors, product characterization, etc., constitute the knowledge base each process plant, 
with a higher or lower detail, uses for plant management. 

Most of this knowledge is often lost inside an accumulation of formal documents that are not made 
available for practical use, while it should be disclosed and exploited within a living model of the plant 
(updated in real time), to which the various actors should refer to make their decisions throughout the 
lifecycle of the installations. 

How to give a shared representation of the factory (state, history, behavior), in order to improve the 
reliability and flow of decision-making, investment, prevention, protection, crisis management? A Risk 
monitoring systems and knowledge management to be integrated in the architectures of the company IoT 
has been proposed, developed and tested in French national institute for industrial environment and risks 
(INERIS). 

The initial risk modelling embedded in the knowledge management systems, based on the bow-tie 
methodology to identify the barriers for critical sequences to the Major accidents and to assess their avail- 
ability, to be used for decision making, has been here integrated with the Integrated Dynamic Decision 
Analysis in order to obtain the critical sequences of events, that include the operator contribution (in 
terms of errors and recovery), the barrier effectiveness and the plant behavior. 

The representation of the plant in the shape of sequences allow a more user-friendly management of 
the information and thus a simplified control of the coherence of the risk assessment modelling with the 
real plant behavior, and an enhanced decision-making support in the definition of plant control measures, 
both technical and operational. It also allows an easier integration of the data coming from the field, with 
traditional or new technologies, as virtual and augmented reality. 

The proposed solution is exemplified through the application to an ammonia storage plant. 


1 INTRODUCTION The tool is developed to be used by two main 


users—the plant managers and the authorities in 


In this paper a risk monitoring systems and knowl- 
edge management tool proposed, developed and 
tested in INERIS is described, named MIRA 
(Monitoring Intégré des Risques Actualisés, Inte- 
grated Risk Monitoring Updated). The MIRA 
system was developed in the Virtualis projec, 
update in the TOSCA projec and test in different 
company. It was developed with the purpose of 
making the risk assessment in major hazards proc- 
ess plants a living analysis, making available for 
the different actors—operators, managers, control 
authorities—the information from the risk assess- 
ment and management to support the risk-based 
decision making along the plant life cycle. 


charge of the inspections and authorization for 
plant activities—in order to make available all the 
information about risk management and to guide 
the decision making with reference to major acci- 
dents prevention. 

As discussed in Demichela et al. (2004) and 
De-michela & Piccinini (2006), the strict correla- 
tion between the risk assessment and the safety 
management systems allows summarizing their 
relations and the links with the inspections activi- 
ties as in Figure 1. According to this view, the tool 
developed in French national institute for indus- 
trial environment and risks (INERIS) manages the 
following set of data: 
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Figure 1. Processes included in the tool. 


e The risk analysis data, both the probabilistic 
and the consequence analysis—in order to allow 
checking the hypotheses, the methodologies and 
the results; 

e The prescriptions descending from the safety 
management system inspections, both internal 
and of the local authorities, that can affect the 
risk analysis and the related decision making; 

e The operational control of the plant and its 
risks, designed also on the risk results that allows 
maintaining a tolerable the level of risk, through 
the safety management system; 

e The records of the safety management system 
procedures and instructions; 

e The results of the inspections and the conse- 
quent plans for improvements. 


2 THE KNOWLEDGE MANAGEMENT 
SYSTEM 


The heart of the knowledge management system 
is the management of the “living risk assessment”, 
based on the updating of the risk analysis according 
to the modification of the processes—production 
processes, procedures and tasks—impacting on 
the possible accident sequences—probabilities of 
occurrence, accident scenarios, impact assessment. 

The initial level of risk estimated in the “safety 
case” or “safety report” required by the Seveso 
Directive—Seveso III (Directive 2012/18/EU) -, 
when tolerable, should be maintained in time, 
maintaining the conditions that brought to the 
level of the risk assessed: process hazards, potential 
accidents sequences and barriers. 

The safety management performance control 
allows a retrospective updating of the initial level 
of risk. According to Figure 2, the updating of the 
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Figure 2. Risk modeling. 


risk assessment will take into account the following 
aspects: 


e The critical tasks. As an example, through task 
analysis, missing operational controls on criti- 
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cal equipment or protection systems can be 
traced, affecting reliability and availability of 
the systems. 

e The analysis of technical failures and occurred 
accidents or near-misses. The analysis of the 
failures and of their treatment allows identifying 
the unavailability of a system or a missed critical 
task. Summing up the unavailability days over a 
given observation period allows estimating the 
actualized availability, e.g., of a safety barrier. 


The calculation of the “level of confidence” of 
the system under study is carried on according to 
INERIS methods. 

Comparing the initial risk level and the actu- 
alized one allows the inspectors to verify if the 
hypothesis made during risk assessment by the 
company management are realistic, given the real 
plant behavior. 

In case of a difference, the plant management 
can revise the safety case, improve the management 
of the critical tasks and of the critical activities. 

During inspections, the plant management has 
to make available the input data to the inspections, 
both from the risk assessment and from the opera- 
tional management. The inspectors have to control 
the coherence of the different elements. 


3 THE CASE STUDY 


The case study refers to a company whose activi- 
ties can be assimilated to those carried on at the 
AZF company, where the 21st September 2001 an 
explosion of ammonium nitrate killed 31 persons, 
together with thousands of injured (Paltrinieri 
et al., 2012). 

In this case, the company produces and stores 
different types of solid fertilizers, using for their 


Figure 3. Sample bow-tie for the ammonia release case. 


production raw materials, with different hazardous 
properties controlled by Seveso Directive: 


Flammable gases (natural gas and ammonia); 
Toxic gases (ammonia); 

Oxidising liquids (ammonium nitrate); 
Flammable liquids (oil); 

Toxic liquids (oil); 

Gas under pression (acétyléne); 

Substances dangerous for the environment (oil, 
ammonia); 

e Oxidising solid (fertilizers). 


Within the risk assessment exercise, the com- 
pany developed all the bow-ties needed to repre- 
sent the possible accidental events and the barriers 
available for their protection (Aneziris et al., 2017). 
The barriers, whose availability has to be kept 
under control during time, being critical systems, 
could be technical or operational (Ahmad et al., 
2015). 

In Figure 3 one of the bow-ties developed for 
the ammonia piping system is shown as a dem- 
onstration. It refers to the release of a toxic cloud 
of ammonia (NH,) and takes into account all the 
possible initiating events, internal and external, 
identifying the barriers, operational and technical, 
that should interrupt the possible event sequences 
towards an accident. 

The barriers numbered in Figure 3 are: 


Protection against lightning 

Security guard 

Design of the rack against earthquakes 
Rack lay-out design 

Site firefighting system 

Manual decompression of the line 
Line section isolation for low pressure 
Building isolation for low pressure 


YANAMNFwWNeE 


+ © 


1911 


The value of barrier availability is justified in 
the safety case—not detailed in this paper—and 
can be updated according to the maintenance 
tasks records and the inspections’ results thanks to 
the knowledge management system adopted. 

The systematic adoption of this approach allows 
both the plant managers and the inspectors to eas- 
ily identify the critical items and the protection sys- 
tems that has to be taken under control in order to 
maintain in time the availabilities speculated in the 
risk assessment. 


4 ONE STEP BEYOND 


The present version of the tool, despite allowing a 
retrospective risk update, does not allow the risk to 
be updated in real time. Some experiences towards 
a similar purpose can be found in Paltrinieri et al. 
(2012), where the early warnings analysis is dis- 
cussed and in Villa et al. (2016) where the dynamic 
risk assessment is introduced. In Raoni et al. (2018) 
the integration of probabilistic analysis and proc- 
ess simulation is also discussed and exemplified. 

In general terms, to reach the goal the knowledge 
management tool should be interfaced with the con- 
trol system of the process unit in order to collect the 
field data and update run time the risk figures. 

In this way it could constitute an overarching 
real-time optimization and scheduling system, 
controlling and monitoring the operations of the 
whole plant together with the optimization of 
analytical tasks and the interpretation and the 
enhancement of the data used for risk assessment 
and control. 


Logical — probabilistic 
model 


1. Heat recovery 
section; 

2. Reaction; 

3. Separation section. 


- Fault and error model 


Pump lanmo 


At the moment the representation of the poten- 
tial accident through bow-ties does not allow this 
kind of integration, thus, in this seminal work, 
an attempt has been made of shifting toward the 
representation in “stories” — sequences of events— 
allowed by the Integrated Dynamic Decision Anal- 
ysis (IDDA). 

Originally developed by Galvagni, the method 
was tested on process case studies several times: 
by Demichela & Piccinini (2008) on a simple case 
study of a tank overflow; by Turja & Demichela 
(2011) and Demichela & Camuncoli (2014) for the 
risk-based design on an allyl-chloride production 
plant, where it allowed to carry on the risk analysis 
in a dynamic way, taking into account process time 
dependent occurrences; by Baldissone et al. (2016) 
to the analysis of competing technologies for the 
VOC treatment; by Gerbec et al. (2016) that com- 
pared the result of the IDDA analysis on a LPG 
tank cold water pressure test procedure to the 
results obtained with the Bayesian network. 

The IDDA methodology, that can be seen as 
an enhanced event tree, is built on the interaction 
between a logical—probabilistic model and a phe- 
nomenological model of the plant under analysis. 

Its use as a decision-making supporting tool in 
risk assessment is summarized in Figure 4, as also 
discussed in Baldissone et al. (2017). 

The logical—probabilistic model is a logi- 
cal description of the plant behavior (integrating 
human and technological aspects, where the case) 
based on the general logic theory. 

In particular, when the operational aspect is 
relevant, a good starting point to build the logi- 
cal—probabilistic model is the Task Analysis, or a 
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Figure 4. Conceptual scheme of the IDDA use as risk based decision-making support tool. 
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Table 1. 


Sequence n. 133 from the IDDA logical-probabilistic model elaboration. 


Level Answer Prob. Cumulative prob. Description 
1 1 1.00e-03 1.0000e-03 Operational error? Yes 
2 1 9.00e-01 9.0000e-04 Enough space available? No 
3 1 1.00e-01 9.0000e-05 Pump trip failure? Yes 
10 1 1.00e-01 9.0000e-06 Level transmitter failure? Yes 
20 0 1-1.00e-01 8.1000e-06 Extraction works? Yes—diluted cloud 
101 0 1—1.67e-05 8.0999e-06 Crane falling down? No 
102 1 1.00e-03 8.0999e-09 Fire below the pipe (Domino effect)? Yes 
103 1 1.00e-02 8.0999e-11 Safety valve works? No 
105 0 1-1.00e-01 7.2899e-11 Manual decompression? Yes 
120 1 1.00e-03 7.2899e-14 Pipe support 1 failure? Yes 
121 1 1.00e-03 7.2899e-17 Pipe support 2 failure? Yes 
125 #1 7.29e-17 Both Supports fail? Yes 
150 #1 7.29e-17 Pipe leaks? Yes—NH3 release 
200 #2 7.29e-17 Result of the case 1? Overload—release managed 
201 #1 7.29e-17 Result of the case 2? Toxic release from the pipe 


Sequence probability: 7.28988e-17. 


functional analysis for the technological part, as in 
Balfe et al. (2017). 

The logical—probabilistic model elaboration 
brings to the development of all the possible 
sequences the system could undergo, with an accu- 
racy dependent on the level of knowledge disclosed 
in the model itself. 

Each sequence of events is coupled with its 
probability of occurrence, dependent on the 
occurrence probability of each event contained in 
the sequence, and on the logical and probabilistic 
constraints. 

The phenomenological model is the mathemati- 
cal description of the physical behavior of the 
system, assessing the status and trend of each rel- 
evant process variable with reference to the failure 
sequence identified and allows for a more realis- 
tic scenario both in probabilistic and in phenom- 
enological terms; it allows the description of not 
only the unwanted events, but also of the resilience 
capacity built in the system (recover possibilities). 

The full set of alternatives allows for the com- 
plete spectrum of possible probability-consequence 
conditions to be used as a basis for decisions in risk 
reduction and control with a compound knowl- 
edge encompassing system description. 

The IDDA logical probabilistic model has been 
developed for the bow-ties related to the ammonia 
involving accident scenarios and returned a set of 
sequences the system could undergo. 

The events represented in Figure 3, described 
through IDDA, brought to the generation of 
220 alternative sequences, 90 of which bringing 
to a toxic release. The probability of occurrence 
obtained for the Top Event is comparable to the 
one obtained through bow-tie. 


In Table 1 a sequence (“story”) is represented 
according to the IDDA elaboration, as example of 
the results obtainable. 

The representation in stories, coupled with the 
phenomenological behavior of the system allows 
comparing the data collected from the field—from 
different data sources: e.g. from the plant DCS/ 
PLC or smart distributed sensors; from the regis- 
tration of the maintenance activities, as through a 
Risk Register, as the one proposed in Leva et al. 
(2017) - with the sequences possibly occurring in 
the plant, thus allowing to follow up run-time the 
risk in the plant. 

The application of the enhanced method has 
also identified some dependencies among events 
that can hardly be caught through the bow-tie 
approach adopted, thus bringing possibly to not 
conservative evaluations. 


5 CONCLUSIONS 


The risk assessment in process plants has often 
been regarded as a static document to be prepared 
for authorizations more than a living document for 
the day-to-day decision making in process plants. 

The seminal work here described would like 
to overcome this view and include in a knowl- 
edge management system—already developed at 
INERIS—a way to trace the behavior of the plant 
towards an accident in order to guide the main- 
tenance, inspection and even emergency prepared- 
ness in a living system. 

The opportunity has been obtained integrat- 
ing in the model the representation in sequences 
of the Integrated Dynamic Decision Analysis, that 
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could be used to compare the plant management 
and behavior and its model contained in the risk 
assessment. 

It was demonstrated that including this approach 
allows some interdependencies among event to be 
identified—identification otherwise impossible— 
and that this constitutes a good support for the 
further developments above described. 
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ABSTRACT: Social interaction is regarded central in prevention, planning, handling and evaluation of 
unforeseen (UN) events. Hence, the social interaction factors may vary in different phases and conditions. 
This is also the case even when conditions are unpredictable compared to situations of low or without 
risk. We therefore apply an UN-oriented «Bow tie» model that focuses on three main stages related to the 
development of a serious event; 


e Phase 1: Preparation/ identification of hazard signals and barrier development 
e Phase 2: Occurrence of an unexpected event/accident 
e Phase 3: Measures/ stabilization 


The purpose of this paper is to investigate social interaction, the term and phenomenon at risk, and 
how social interaction factors behave under such conditions. How do social interaction factors relate to 
the different phases given by the «Bow tie» model? And, of what significance can this have related to 
education and training as well as the competence of a complex and interdependent organization? We will 
in this paper argue that social interaction plays a crucial role in meeting the unforeseen and discuss this 


accordingly. 


1 INTRODUCTION 


Organizations increasingly rely on their ability 
to adapt to and manage multifaceted, demand- 
ing situations (Herberg et al., 2018, Brozus, 2016, 
Roux-Dufort, 2007 Weick, 2015), particularly 
when facing sudden and unexpected risk events 
(Barnett, 2004, Bechky & Okhuysen, 2011, Cunha 
et al., 2006, Fornette et al., 2016). Little is known 
regarding how an organization can methodically 
identify relevant factors that influence the out- 
come of an event (Kaarstad & Torgersen, 2017). 
The current paper is based on the work of Torg- 
ersen (2018a) and Torgersen (2015). The starting 
point is the Norwegian/Nordic concept of social 
interaction (interaction). Traditional perceptions 
of the concept of social interaction are challenged 
and new models are introduced with the ques- 
tions: What are the basic structures of the con- 
cept of interaction under/during risk, and further, 
how can interaction be created when the condi- 
tions are unpredictable? This scientific anthology 
“Social interaction” (Torgersen, 2018a) may be 


of relevance to anyone who works with warning 
signs, handling and stabilizing unforeseen events 
within most professions in society and emer- 
gency management, as well as students, teachers 
and researchers in education, strategy, human 
resources and organizational management sub- 
jects. The work was started by Torgersen & Steiro 
(2009). In 2015, the term unforeseen (UN) was 
introduced and defined as: 


“Something that occurs relatively unexpected and 
relatively low probability or predictability for those 
who experience and must deal with it.” (Kvernbekk 
et al., 2015:30). 


In this article we will provide an overview of 
the research and explaining the importance of 
social interaction linked to the unforeseen. We 
will argue that social interaction plays a crucial 
role in handling the unforeseen and can provide 
useful and applicable knowledge to the field of 
risk management and may add do the dynamic 
capabilities (Eisenhardt et al. 2000) of the 
organization. 
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2 THE DEFINITION OF SOCIAL 
INTERACTION 


Based on a study of 15 organizations (Torgersen & 
Steiro, 2009), this definition of social interaction 
is developed: 


“Social interaction is an open and mutual commu- 
nication and development between participants, who 
develop skills and complement each other in terms 
of expertise, either directly, face-to-face, or medi- 
ated by technology or by hand power. It involves 
working towards common goals. The relationship 
between participants at any given time relies on 
trust, involvement, rationality and industry knowl- 
edge.” (Torgersen & Steiro, 2009:130). 


Social interaction is primarily a way to work or 
“act. Central to interaction is “action”, first and 
foremost a targeted action. This action is shared or 
exchanged expertise—often extensive, specialized, 
and used in a complementary manner (Torgersen 
& Steiro, 2018, Steiro & Torgersen, 2013, Torgersen 
& Steiro, 2009). The focus on complementariness 
can also be seen in the work of Miles & Watkins 
(2007), supporting the notion that interaction is 
more than the sum of its parts. For social interac- 
tion to occur, one must also be aware that each par- 
ticipant contributes with their unique situational 
understanding (“shared situational awareness”) or 
shared repertoire of practice (Wenger), based partly 
on their own perspective and position in the organ- 
ization, and their experiences, culture, knowledge, 
attitudes, emotions and job satisfaction, including 
recommendations to the interaction process (Torg- 
ersen & Steiro, 2018, Sandeland & Boudens, 2000). 


3 THE THEORETICAL FOUNDATION 
FOR SOCIAL INTERACTION AND RISK 


In the previous paragraph, complementary ele- 
ments was pointed out as important. An illustrative 
example of joining force a playing complemen- 
tary roles is the Duomo in Firenze, Italy. Another 
illustrative example is the foundation of the dome 
of the Florence Cathedral, designed by Filippo 
Brunelleschi between 1417-1434. Brunelleschi had 
the bricks laid in a herringbone pattern to sup- 
port the inner dome (Illustration 1). King (2000) 
explains this as an action and reaction between the 
bricks. We can argue that they are the same bricks 
but assigned different roles, and that the action 
and reaction creates interaction, redistributing the 
forces of pressure outwards and downwards. This 
prevents the dome from collapsing inwards (Torg- 
ersen & Steiro, 2018). It can be illustrated in the 
illustration below. 


Illustration 1. The herring-bone pattern of brickwork 
designed by Filippo Brunelleschi (1377-1446) (Photo: 
Trygve Steiro, 2017). 


The herring-bone pattern of brickwork 
designed by Filippo Brunelleschi (1377-1446), for 
the inner dome of Florence Cathedral, which effec- 
tively divides the pressure downwards and out- 
wards, avoiding an inward collapse (grey arrows). 
The white arrows symbolize action and reaction, 
thereby creating an interaction, and an illustration 
of social interaction as something happen in the 
action. It also illustrate complementary aspects 
(Miles & Watkins, 2007). 

The bow-tie diagram has been used extensively 
for risk management and is based on and modified 
after Primrose, Bentley, van der Graaf & Sykes 
(1996) model (Torgersen, 2018 b). 

The bow—tie diagram modified after Primrose 
et al. (1996) in order to understand the interplay 
between the risk of the unforeseen through differ- 
ent phases and the importance of social interac- 
tion. The closer to the impact, the more demanding 
social interaction will be because of the actions 
required. However, Carlström (2018) stresses that 
social interaction can occur vertically, horizontally 
and synchronous. Scholtens (2008) emphasizes that 
operative staff, in most cases, make the right deci- 
sions and act in an optimal way, if they are allowed 
to act autonomously. Martin et al. (2016) have 
investigated crisis management and underlined 
the importance of the fours Cs; Communication, 
Cooperation, Coordination and Collaboration. 
Building on this work, we also put social interac- 
tion closest to the action. 

The illustration is based on the work of Scholtens 
(2008), Torgersen & Steiro (2009), Martin et al. 
(2016) and Carlström (2018). Carlström (2018) 
further argue for in times of uncertainty, flexible 
and decentralized solutions should be sought. The 
same was claimed by Torgersen & Steiro (2009) 
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Figure 1. (Torgersen, 2018). 
The Unforeseen 
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ete.) Communications 
* need for training and 

training 


* effect for handling 
unforeseen events 


Figure 2. 4Cs plus social interaction in order to meet 
the unforeseen (Torgersen & Steiro, 2018). 


building on the classical work by Burns & Stalker 
(1961). Carlst6m uses the military concept “auft- 
ragstachtik” as an example. For a further and 
elaborated discussion see Carlst6m (2018), Bergh 
& Boe (2018) Krabberod & Jacobsen (2018). Brady 
(2011) has looked into the Battle of Stalingrad and 
stressed that while the German commander Pau- 
lus sticked to the plan and doctrines too rigidly, 
the opponent Russian General was improvising 
and allowed improvising by the Russian over com- 
mando Stavka. New technology has been said will 
contribute to more decentralized chain of com- 
mand (Zuboff, 1988; Albert & Hays, 2003). Albert 
& Hays introduced the term “strategical corporal”. 
On the use of plans and military doctrines, also see 
Carlsten, Torgersen, Steiro and Haugdal, 2018. We 
are not claiming that plans are not important, since 
they obviously are. When foreseen interruptions 
occur, plans can be followed more strictly. However, 
when an unforeseen event appears they need to be 
uses as a framework and theory, but local action is 
apparently more important. An illustrative exam- 
ple on plans and social interaction are the Hudson 


River landing in 2009 with no fatalities (Eisen & 
Savel, 2009). On the 15th of January 15 in 2009, 
US Airways Flight 1549 hit a flock of geese shortly 
after takeoff from LaGuardia Airport (New York, 
NY), causing both of the engines to lose power. 
The first officer were in control of the aircraft. 
Captain Chesley Sullenberger took control of the 
airplane and the radio-communications, and then 
instructed the first officer to run an engine restart 
checklist, which was unsuccessful. Without engine 
power, Sullenberger decided that he was unable to 
reach either LaGuardia or the Teterboro airport 
in New Jersey. The flight crew then decided that 
an emergency landing in the Hudson River was a 
viable option. Pariés (2011) points out that a bird 
strike hitting and destroying both engines was 
thought of on beforehand. However, nobody had 
seem to predict or foresee the effect of losing both 
engines below an altitude of 800 meters. This was 
the case of Flight 1549. Below is excerpts of the 
communication of Flight 1549 and the air control 
of LaGuardia. 

Based on National Transportation Safety Board 
senior official Kathryn O. Higgins’s account and 
based on a transcript of the communication (Fed- 
eral Aviation Administration, 2009). It is also worth 
noticing that within the cockpit, Captain Sullen- 
berger orders “My aircraft”. First officer Skiles 
replies: “Your aircraft” (Eisen & Savel, 2009). Eisen 
and Savel (2009) write that this is according to the 


Time Event 

1524:54 Tower cleared flight 1549 for takeoff 

1525:51 Pilot informs control tower that they 
were at 700 feet 

1527:01 Radar detected that 1549 hit primary 
targets 

1527:36 Pilot, “Ah, this is Cactus 1539 hit birds, 
we lost thrust in both engines. We’re 
turning back towards LaGuardia” 

1527:49 Controller advised LaGuardia to stop 
departures 

1528:05 Controller, “Cactus 1529, if we can get 
it to you do you want to try to land 
runway one three?” 

1528:11 Pilot, “We're unable. We may end up in 
the Hudson” 

1528:50 Pilot, “I am not sure if we can make 
any runway. Oh, what’s over to our 
right anything in New Jersey; maybe 
Teterboro” 

1529:02 Controller, “Do you want to try and go 
to Teterboro” 

1529:03 Pilot, “Yes” 

1529:25 Pilot, “We can’t do it” 

1529:28 Pilot, “We’re going to be in the Hudson” 

1530:30 Touchdown in Hudson River 
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Table 1. The terms 7T in Norwegian and the English 
translation. 


Trust (Norwegian word tillit) 
Assurance (Norwegian word trygghet) 
Well-being (Norwegian word trivsel) 
Belonging (Norwegian word tilhorighet) 
Clarity (Norwegian word tydelighet) 
Time (Norwegian word tid) 

Tolerance (Norwegian word foleranse) 


Table 2. Sources of influence and competencies for 
“samhandling” or “social interaction”. 


Competencies 


for samhandling Sources of influence 


Trust Torgersen & Steiro (2009) 
Assurance Torgersen & Steiro (2009) 
Well-being Torgersen & Steiro (2009) 
Belonging Torgersen & Steiro (2009) 
Clarity Weick (1987); LaPorte & Consolini 
(1991); Weick & Sutcliffe (2001); 
Lefdali (2014); Steiro, Johansen, 
Andersen & Olsvik (2013); 
Fredriksen & Moen, (2013); 
Eggen & Nyronning (1999); 
Simensen (2005); Leitao (2010) 
Time Weick (1993); Steiro et al. (2013) 
Tolerance Kant (1795/1991); Derida (2005a; 


2005b; 2000); Torgersen & Steiro 
(2009); Steiro et al. (2013); Steiro 
& Torgersen (2018) 


procedures and also accordance to Crew Resource 
Management Training, that is a well established 
concept to foster interaction within aviation crews. 
When Sullenberger gave the order, “Brace for 
impact,” the flight attendants chanted repeatedly, 
“Brace, heads down, stay down” (Eisen & Savel, 
2009:912). Pariés (2011) further reports that cap- 
tain Sullenberger was very focused on the tasks. 
The Air Traffic Controller reported also after the 
accident: “During the emergency itself, I was hyper 
focused, I had no choice but to think and act quickly, 
and remain calm. I was flexible and responsive. I 
listened to what the pilots said, and made sure to 
give him the tools he needed. I stayed calm and in 
control” (Pariés, 2011:16). 

All this communication can be seen as precise 
communication, that is so important for effective 
social interaction and particular when time is of 
such shortage (Torgersen & Steiro, 2009). The rel- 
evance of competence skills brings us on to further 
findings regarding competence for social interac- 
tion. Torgersen & Steiro (2018) have identified 
the following central competencies as important 
for social interaction. The first four was identified 


in Torgersen & Steiro (2009) and in Norwegian it 
was “tillit, trygghet, trivsel and tilhørighet”, which 
was summed up in the 4Ts. In the study of Torger- 
sen & Steiro (2018), “tydelighet” (English; clarity, 
both in role and communication) and “toleranse” 
(English; Tolerance) and “Tid” (English; time) was 
added. That would for Norwegian readers then be 
the 7Ts (Table 1). This point is not important in 
itself other that for educational reasons. To a not 
Nordic reading audience, this has no relevance. 

In the following Table 2, the theoretical and 
empirical sources of influence to the 7T model is 
presented. 


4 MEASURING SOCIAL INTERACTION 
IN RELATION TO THE UNFORESEEN 


Herberg et al. (2018) found that social interaction 
combined with general self-efficacy and social sup- 
port can account for a considerable proportion of 
the variance in preparedness for the unforeseen in 
different organizations. The study was performed 
using a self-completion questionnaire answered by 
personnel from the Norwegian Armed Forces during 
a three-month period in the winter of 2016 & 2017. 
The personnel were from all branches of the military, 
including commissioned and non-commissioned 
officers, military academy students and conscripts 
participated in the study. The study results are based 
on a survey carried out. The questionnaire was dis- 
tributed to 16 units, departments and military acad- 
emies throughout Norway. A total of 624 personnel 
participated in the study and the response rate was 
77%. The sample consisted of 525 males (85%) and 
92 females (15%) respondents with a mean age of 
25.7 years and with the average military experience 
was 5.5 years (Herberg et al. 2018). Social interaction 
was found to be the most important predictor of pre- 
paredness for the unforeseen. The results in the study 
indicate that it is possible to prepare for unforeseen 
events by implementing measures that improve 
social factors in particular Herberg et al. (2018). 
Organizations should develop a work environment 
where managers and colleagues provides both moral 
and emotional support. Furthermore this environ- 
ment where peoples communicating is recognized by 
their listening to each other and creating trust. The 
authors further suggest to create an environment of 
common understanding of the situation, have useful 
routines (and relevant) with partners and focusing 
on exchange and complement employee skills and 
knowledge (Herberg et al., 2018). 

These findings lend support to the factors sug- 
gested by Torgersen & Steiro (2018) and listed in 
Table 1. However, Herberg et al. (2018) points to a 
need for more research to better understanding the 
relations between these factors. 
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5 TRAINING AND EDUCATING FOR 
SOCIAL INTERACTION 


We have seen from paragraph 4, that social interac- 
tion plays an important role. In another study, by 
Nyhus, Steiro & Torgersen (2018), mentoring and 
coaching was studied for a joint operation course 
held at the Norwegian Defence College. Personnel 
form all weapon branches participated (Air, Army 
and Sea). The purpose of the course was to increase 
understanding of joint operations (Andersen, 2016; 
FHS, 2015). Since conflicts and wars rarely follow a 
familiar pattern, unpredictable factors are tried to be 
put into the education (Heier, 2015). The course is 
executed by more experienced officers, guiding less 
experienced officers and soldiers with given military 
cases and scenarios that are often based on experi- 
ences from actual events. The theoretical framework 
applied was therefore “apprenticeship learning”. 
Apprenticeship is rooted in sociocultural learning 
theory (Saljo), which emphasizes that knowledge is 
constructed through social interaction (social inter- 
action) in a context and not primarily through indi- 
vidual processes (Lave & Wenger, 1991, Maguire, 
1999, Dysthe, 2001). 10 hours of observation was 
conducted in the sessions. In addition five group 
interviews were conducted with a total of 23 inform- 
ants (out of a number of total 100 students). After a 
sufficient number of data was obtained (saturation), 
the interviews were stopped. A thematic analysis was 
adopted when interpretating the interview material. 
This is a suitable method for identifying, analys- 
ing, and reporting patterns within the data analysis 
(Braun & Clarke, 2006). The analysis reveals that 
some supervisors emphasize mainly on the prod- 
uct, and others more on how the group reached the 
result, i.e. the process. The product said something 
about the specific deliveries that the group arrived 
at. Furthermore, the product, among other things, 
referred to the supervisors who made sure that the 
group followed certain structured patterns of action 
and worked towards a goal that was embodied in 
doctrines and drills. Nyhus et al. (2018) found fur- 
ther that mentors who focused a lot on the product 
in the preparation phase did not take the unforeseen 
into account. A process-oriented supervisor with 
authority enough to create a shared commitment 
and understanding of situations. By the virtues of 
experience and competence, the supervisor can pave 
the way in preparation for meeting the unforeseen. 
In the impact phase, we see how some mentors 
successfully combine equality with such results 
focus. It is seen as an advantage if the supervisor 
allows to focus on the result, but at the same time 
dares to move away from the role of an instructor 
who is going to make a correct or wrong answer. 
In the final consequence phase, the supervisor 
should represent a counterweight to pure instruc- 


tion, where logical conclusions about results con- 
trol the conversation. Nyhus et al. (2018) found 
that the supervisor should not maintain a tight 
structure with a given focus on a blueprint prod- 
uct. The supervisors should investigate, challenge 
and test the group’s knowledge and understand- 
ing, and she should ask follow-up questions where 
they, together with the group, assess the answers. 
The supervisor occupies a coaching role (Nyhus et 
al. (2018). The authors further suggest that guided 
student groups within the apprenticeship learning 
tradition is an alternative to traditional education, 
one that provides insights and valuable learning 
experiences. 


6 CONCLUSION 


We have demonstrated in this paper that social 
interaction offers something different than col- 
laboration, co-operation. In a world facing the 
unforeseen, rather than insisting on the prediction 
of unexpected events, organizations should there- 
fore investigate how to deal with unanticipated and 
learn from such events and how this can be devel- 
oped. A change in focus and mind-set should high- 
light the relevance of social interaction as a basic 
and generic core competence. We have defined 
social interaction and explain underlying elements 
that makes it unique. At the same time trust, assur- 
ance, well-being, clarity, time and tolerance are, 
as we have argued, is central elements in order to 
establish and develop good social interaction. It 
does not come for free, but the potential benefits 
are expected to be large. We have also argued that 
social interactions plays key roles during risk and 
meeting with the unforeseen as demonstrated in 
paragraph 4. We have used the Bow-tie diagram 
to distinguish between social interaction before, 
during and after an incident or accident. We need 
further research that can contribute to the analysis 
of incidents and accidents within this framework. 
It would also been interesting to examine the devel- 
opments of social interaction and to follow up on 
social interaction in different organizations. 
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ABSTRACT: Existing resilience management and assessment frameworks for Critical Infrastructure 
(CI), as well as resilience indicator indexes, define categories and methodologies that provide useful con- 
ceptual models. However, there is a lack of specific guidance on how to implement such conceptual mod- 
els in practice, thus enabling stakeholders to conduct assessment, auditing, and consulting initiatives in a 
consistent manner. In this paper, we use an existing CI resilience management framework—based on the 
ISO 31000 risk management process-, and present guidance for implementing such a framework, based 
on generalized COBITS best practice. This is done by illustrative demonstration with a defined risk sce- 
nario that includes Information and Communications Technology (ICT) aspects, using a specific process 


from an existing CI resilience assessment framework. 


1 INTRODUCTION 


Assessing and managing the resilience of Criti- 
cal Infrastructure (CI) present many conceptual 
and implementation problems, that CI operators, 
regulators, assessors, auditors, and consultants 
must address. High conceptual complexity kicks- 
in when hybrid threats, dependencies, and time- 
variable cascading effects are considered. Practical 
implementation issues arise, as crisis may span 
interdependent sectors and sovereign border, call- 
ing for frameworks and standards that promote 
interoperability and facilitate communication, 
cooperation, and collaboration among diverse 
stakeholders. Finally, the growing importance of 
the cyber dimension calls for an integrated people- 
cyber-physical approach to help solve the social- 
technical challenges of resilience management. 

Significant research and development invest- 
ments have been made in recent years to address 
such challenges (IMPROVER 2017, CIPRNet 
2017, DRIVER 2017, FORTRESS 2017, PRE- 
DICT 2017, SmartResilience 2017, ERNCIP 
2017). 

In this paper we use the term resilience as 
defined by UNISDR (2017), i.e. “The ability of a 
system, community or society exposed to hazards 
to resist, absorb, accommodate, adapt to, transform 


and recover from the effects of a hazard in a timely 
and efficient manner, including through the preserva- 
tion and restoration of its essential basic structures 
and functions through risk management”. From this 
definition of resilience, three important aspects 
may be highlighted: 


— The concept of resilience goes beyond the 
domains of security and safety — related to pro- 
tection-, to include additional goals and func- 
tions related to crisis management — such as 
response, recovery, and adaptation; 

— Risk management is an important function to 
ensure achieving the desired ability, towards a 
defined optimal level; 

— Since UNISDR defines resilience as an ability, 
resilience management models and frameworks 
may take the point of view of optimizing sets 
of capabilities and capacities, as well as to assess 
resilience based on related indicators and meas- 
urement models. 


Existing resilience management and assessment 
frameworks for Critical Infrastructure (CI), as 
well as resilience indicator indexes, define meth- 
odologies and categories that provide useful con- 
ceptualizations. A recent review of such resilience 
methodologies and frameworks can be found in 
Rod et al. (2017), Lange et al. (2017a, 2017b), and 
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IMPROVER (2016). Such conceptual models help 
address complexity, since they are simplified rep- 
resentations of the systems, intended to promote 
understanding within the relevant domain of 
discourse. 

However, there’s a lack of specific guidance on 
how to implement such conceptual models in prac- 
tice, thus enabling stakeholders to conduct assess- 
ment, auditing, planning, and implementation 
initiatives in a consistent and effective manner. 

Due to the growing importance of the cyber 
dimensions, it is important to note that security 
and resilience of CI requires consideration and 
integration of cyber and physical aspects, therefore 
a cyber-physical approach is needed (Choras 2017, 
NIST 2014). 

For addressing the government and manage- 
ment aspects of the cyber dimension, COBITS 
provides state-of-the-art conceptual models and 
guidance. COBITS is used e.g. in the NIST Cyber- 
security Framework, for providing governance and 
management best practice (NIST 2014). 

In this paper, we provide implementation guid- 
ance for resilience management and assessment of 
critical infrastructure, by developing, integrating, 
and demonstrating the two functional aspects: 


— Resilience management: we reuse an existing CI 
resilience management framework from Lange 
et al (2017a, 2017b) —based on the ISO 31000 
risk management process-, and present guidance 
for implementing such a framework, based on 
generalized COBITS best practice. By itself, the 
COBITS framework only addresses part of the 
problem, namely the enterprise governance of 
information and related technologies. However, 
as we propose and demonstrate in this paper, 
some of the key concepts, models, methods, and 
guidance of the COBITS framework may be 
generalized to encompass the broader scope of 
CI operator objectives, activities, and support- 
ing technologies. Such an approach allows for a 
holistic solution to the resilience management 
problem, ensuring—by design—that the cyber 
dimension is integrated using a state-of-the art 
governance framework for information and 
related technologies. 

— Resilience assessment: based on the resilience 
assessment framework from Cadete et al (2017), 
we define generic requirements for resilience 
assessment models, to enable seamless integra- 
tion with the COBITS assessment and measure- 
ment models. 


To illustrate the proposal, we provide a dem- 
onstration that uses a risk scenario that includes 
information and communications technologies 
(ICT) aspects, using as an example the resilience 
assessment framework from Cadete et al (2017). 


2 RELATED WORK 


In Rod et al (2017), a selection of existing resil- 
lence assessment methodologies is described 
and evaluated in the context of the IMPROVER 
project (IMPROVER 2017). The authors conclude 
that there is a need for a CI resilience assessment 
framework that is sufficiently well defined, but at 
the same time may be flexible to account for idi- 
osyncrasies of the different types of CIs and their 
operators. Also, such a framework should remain 
compatible with the current guidelines for risk 
assessment of the European Union (EU) Member 
States and should integrate the paradigm of resil- 
ience into the risk assessment process according to 
ISO 31000. 

Lange et al (2017a, 2017b) addressed those 
needs, proposing a framework for resilience 
assessment of CI, which integrates the resilience 
paradigm into the risk assessment (RA) process 
according to ISO 31000, while maintaining com- 
patibility with the current European guidelines for 
national RA applied by the EU Member States 
(Fig. 1). Starting from definitions used in ISO 
31000 for risk assessment, the authors propose a 
conceptual framework that maps these risk con- 
cepts to the resilience assessment domain: 


— Resilience assessment: Resilience assessment 
is the overall process of resilience analysis and 
evaluation. 

— Resilience analysis: Resilience analysis is the 
process of determining the level of resilience. 

— Resilience evaluation: Resilience evaluation is the 
process of comparing the results of resilience 
analysis with criteria or objectives to determine 
whether resilience level is acceptable and identify 
areas for improvement. 

— Resilience treatment: the process to modify resil- 
ience, focusing on the absorptive, adaptive or 
restorative capacity. 

— Resilience management: Coordinated activities 
to direct and control an organization regarding 
its resilience, including the above processes. 


According to these conceptual mappings, the 
authors propose that some of the expected out- 
puts of a risk analysis could contribute to some 
of the indicators required to carry out a resilience 
analysis, i.e. the outputs may be utilized in the 
form of a single/multi-hazard engineering analy- 
sis to provide input to suitable resilience analysis 
methodology. 

Dependencies (of other CI and suppliers) 
accounted during the risk analysis stage, may be 
used in the CI resilience analysis. The authors also 
propose that the maturity of organizational proc- 
esses (see Fig. 1) may be used to provide more 
input to the resilience analysis methodology. 
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Figure 1. 


The “establishing the context” activity (Fig. 1) 
provides input to the overall CI risk assessment 
process, requiring e.g. knowledge about best prac- 
tices in the industry in question, national legis- 
lation, sector specific methods of RA, as well 
as relevant hazards identified in national RA 
standards. 

However, the authors conclude that future work 
is needed to implement the proposed framework, 
namely to develop “a resilience evaluation method- 
ology which is compatible with a resilience analy- 
sis methodology and which allows a comparison 
between the results of a resilience analysis and real 
performance objectives based on the needs and toler- 
ances of the society or community which relies on 
the service provided by the CT’. 

Regarding resilience assessment models, Cadete 
et al (2017) propose a resilience assessment frame- 
work, based on rating the capability levels of 
organizational processes, that allows for a direct 
comparison between the capability levels that result 
from an assessment and the desired capability lev- 
els-thus enabling CI resilience evaluation. Both 
sets of capability levels (i.e. assessed and desired) 
are determined using the same goals cascade meth- 
odology, thus allowing for straightforward CI resil- 
ience evaluation, i.e. for: 


— comparing the results of resilience analysis with 
criteria or objectives to determine whether the 
assessed resilience level is acceptable, and; 

— identify areas for improvement. 


CI Resilience treatment 
(improving absorptive, adaptive and restorative 
capacity) 


Monitoring and review 


CI risk assessment (RA), incorporating CI resilience assessment. Taken from Lange et al (2017b). 


The proposed framework consists of the follow- 
ing components: 


— A Goals Cascade Methodology (GCM), for 
ensuring that stakeholder needs are met, as well 
as for tailoring the framework usage for each 
CI organizational setting. This methodology is 
consistent with the COBITS goals cascade, thus 
enabling consideration of ICT concerns, which 
are increasingly important for all CI sectors. The 
proposed goals cascade methodology is derived 
from the COBITS goals cascade. This method 
allows for customizing the framework for differ- 
ent organizational settings, recognizing that each 
critical infrastructure organization has its own 
objectives and context. The goals cascade cor- 
responds to the COBITS framework principle 
of meeting stakeholder needs. It provides value- 
based discernment for decision-making at several 
management levels, namely the discernment cri- 
teria for assessing risk, as well as for selecting risk 
indicators. Similarly, in this framework, the stake- 
holder needs translate to a cascade of inter-related 
goals at different levels, starting from high-level 
governance goals, cascading through all required 
management levels, down to relevant operational 
levels. Therefore, the GCM allows for tailoring 
the assessment rationale to any sector-specific, 
country-specific, as well as disaster-specific sce- 
narios. Interestingly, the recently released NIST 
Community Resilience Planning Guide for Build- 
ing and Infrastructure Systems (NIST 2015) also 
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approaches resilience planning activities based on 
understanding individual and social needs, which 
are then mapped to community goals and desired 
recovery performance goals, in recognition that 
“the community's social and economic needs and 
functions should drive goal-setting for how the built 
environment performs” (NIST 2017). 

— A Process Reference Model (PRM), is a set of 
interrelated processes that enables the govern- 
ance and management of CI organizations, from 
a disaster risk management viewpoint. This refer- 
ence model is integrated in the broader govern- 
ance and management of the organization, thus 
embedding disaster risk management concerns in 
business-as-usual activities. The proposed model 
presents a disaster risk management viewpoint, 
and includes relevant concerns from the COBITS 
and the NIST Cybersecurity Framework (NCF). 
The NCF provides a common language for ena- 
bling communication and cooperation in cyber- 
security risk management, for the information 
technology (IT) and industrial control systems 
(ICS) environments. The NCF core defines a set 
of relevant activities that help in the analysis, pri- 
oritization and implementation of cybersecurity 
countermeasures. Security functions are defined 
at the highest level of abstraction, helping to 
express activities for management of cybersecu- 
rity risk, and defined as follows: 

— Identify: develop the organizational understand- 
ing to manage cybersecurity risk to systems, 
assets, data, and capabilities; 

— Protect: develop and implement the appropriate 
safeguards to ensure delivery of critical infra- 
structure services; 


— Detect: develop and implement the appropriate 
activities to identify the occurrence of a cyberse- 
curity event; 

— Respond: develop and implement the appropri- 
ate activities to act regarding a detected cyberse- 
curity event; and 

— Recover: develop and implement the appropri- 
ate activities to maintain plans for resilience and 
to restore any capabilities or services that were 
impaired due to a cybersecurity event. 


A diagram of the PRM from Cadete et al. 
(2017) is shown in Figure 2. The governance and 
management enabling processes are aggregated in 
three interconnected groups: 


1. Cisis management: a central process group for 
enabling specific disaster management func- 
tions, following the NCF terminology: identify, 
protect, detect, respond, and recover. 

2. Generic enablers: a process group for cross-func- 
tional (generic) enablement, improvement, learn- 
ing, and adaptation, for continuously enhancing 
disaster management capabilities, including 
generic governance and management capabilities. 

3. Risk: processes for risk governance and manage- 
ment, conveying the idea that risk management 
should be continually performed on all other 
enabling capabilities for disaster management, 
thus promoting a clear disaster risk reduction 
perspective. Note that the risk-related processes 
refer to organization-wide processes (i.e. are not 
specific to disaster management), thus ensur- 
ing that disaster risk management concerns are 
embedded in the broader enterprise risk man- 
agement (ERM) activities. 
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Process Reference Model (PRM), for the 26 disaster risk management processes. These organizational proc- 


esses cover risk management (govern risk and manage risk), disaster management (identify, protect, detect, respond, 
and recover), as well as generic governance and management processes. Taken from Cadete et al (2017). 
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— A Process Assessment Model (PAM) and a Process 
Measurement Model (PMM). These models are 
based on the ISO/IEC 33000 series guidelines. Like 
in COBITS, the assessment results are expressed 
according to the capability level ratings achieved 
for each enabling process. Note that a certain 
capability level rating “does not guarantee that 
an organization will perform its processes at any 
given process capability level, simply that it is capa- 
ble of performing its processes at that level’ (ISO 
2015). The capability levels follow the ISO/IEC 
33020xx definitions, ranging from “Incomplete” to 
“Innovating” (the latter maps to the “Optimizing” 
COBIT5S capability level, based on ISO 15504). 


The authors discuss why process assessment 
and measurement models are useful for conduct- 
ing risk and resilience assessments, noting that risk 
is defined, in generic terms, as “the effect of uncer- 
tainty on objectives” (ISO 2009) and that processes 
“have clear business reasons for existing, accounta- 
ble owners, clear roles and responsibilities around the 
execution of the process, and the means to measure 
performance” (ISACA 2012a). Note also that the 
first process capability level is named “Performed” 
and addresses precisely the achievement of organi- 
zational objectives in a certain process area, thus 
serving as a useful indicator for risk and resilience 
management. The achievement of higher capabil- 
ity levels provides stronger assurances regarding 
risk reduction, although generally implying higher 
implementation costs (Cadete et al. 2017). 


3 RESEARCH PROBLEM 


Although Lange et al (2017a, 2017b) proposes a 
CI risk assessment process model incorporating 
CI resilience assessment, and the resilience assess- 
ment framework from Cadete et al (2017) allows 
for establishing a link between resilience analysis 
and resilience evaluation—based on measuring 
and comparing capability levels of disaster risk 
management processes-, we currently lack a meth- 
odology for guiding the implementation of such 
conceptual models in practice. 

Without such guidance, stakeholders will not be 
able to conduct assessment, auditing, and consulting 
initiatives in a consistent manner. Also, results from 
different CI assessment initiatives (e.g. different sec- 
tors, operators, auditing and consulting companies) 
will hardly be comparable, hindering the effective- 
ness of enterprise, policy, and regulatory initiatives. 


4 PROPOSAL 


In this paper we seek to provide a contribution 
to this problem, by providing guidance on how 


to manage, assess, and measure resilience-related 
capabilities which are expressed in the form of 
capability levels of organizational processes. As 
stated before, this approach is important because 
risk may be defined as “the effect of uncertainty on 
objectives” (ISO 2009c), and that processes “have 
clear business reasons for existing, accountable 
owners, clear roles and responsibilities around the 
execution of the process, and the means to measure 
performance” (ISACA 2012a). It is also important 
to clarify that governance and management proc- 
esses have an overarching role in risk management, 
as stated in CISM guidance (ISACA 2016): “most 
security failures can ultimately be attributed to fail- 
ures of management, and it must be remembered 
that management problems typically do not have 
technical solutions.” 

To this end, we propose to use generalized 
COBITS guidance for risk management, based on 
COBITS for Risk (COBIT 2013c) best practice. 
Note that COBITS provides a framework that 
assists enterprises in achieving their objectives 
for the governance and management of enter- 
prise information technology (IT). As such, its IT 
scope is narrower that what we need to address 
the broader CI resilience assessment and manage- 
ment problem. However, the framework funda- 
mentals may be easily generalized towards creating 
higher-order generic governance and management 
frameworks. 

The COBITS extensive body of knowledge is 
based on the following objectives and principles 
(ISACA 2012b): 


— Governance objective: enterprises exist to create 
value for their stakeholders. Consequently, any 
enterprise, commercial or not, has value creation 
as a governance objective. Value creation means 
realizing benefits at an optimal resource cost 
while optimizing risk. Note that, in COBITS, 
IT risk is treated as business risk, therefore the 
COBITS5 recommendations may be generalized 
to other risk types. 

— Principle 1: Meeting Stakeholder Needs: the 
purpose of risk governance and risk manage- 
ment is to help ensure that enterprise objectives 
are achieved throughout the goals cascade. Opti- 
mizing risk is one of the three components of 
the overall value creation objective for an enter- 
prise. Note that this principle is aligned with 
NIST guidance, that approaches resilience plan- 
ning activities based on understanding individ- 
ual and social needs, which are then mapped to 
community goals and desired recovery perform- 
ance goals, in recognition that “the community’s 
social and economic needs and functions should 
drive goal-setting for how the built environment 
performs” (NIST 2017). 
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— Principle 2: Covering the Enterprise End-to- 
end: COBIT 5 for Risk covers all governance 
and management enablers in its scope and 
describes all required phases of risk governance 
and risk management. COBIT 5 does not focus 
on only the IT function, but treats information 
and related technologies as assets that need to 
be addressed like any other asset, by everyone 
in the enterprise. Also, governance of enterprise 
IT is integrated into enterprise governance. This 
principle is relevant to ensure that resilience 
management and assessment frameworks are 
able to cover all cross-sector concerns engaged 
in crisis management (especially due to CI 
interdependencies). 

— Principle 3: Applying a Single Integrated Frame- 
work: COBIT 5 for Risk aligns with major risk 
management frameworks and standards, such 
as ISO 31000, ISO/IEC 27005, and COSO ERM 
(COSO 2017). 

— Principle 4: Enabling a Holistic Approach: 
COBIT 5 for Risk identifies all interconnected 
elements of the enablers that are required to 
adequately provide risk governance and man- 
agement, presenting a holistic and systemic 
approach towards risk. 

— Principle 5: Separating Governance from Man- 
agement: COBIT 5 distinguishes between risk 
governance and risk management activities. 
Good governance means that risk optimization 
is part of the governance arrangements that are 
put in place and risk information is included in 
the decision-making process, at the highest levels 
of leadership and related accountability. 


A benefit of using a risk management approach 
that applies generic COBITS rationale is that the 
cyber and informational dimensions of CI are 
directly mapped to the information and related 
technologies content of COBITS. Addressing the 
full people-cyber-physical challenge requires the 
seven categories of enablers of COBITS, as per 
Principle 4: 


— Principles, Policies and Frameworks; 

— Processes; 

— Organizational Structures; 

— Culture, Ethics and Behaviour; 

— Information; 

— Services, Infrastructure and Applications; 
— People, Skills and Competencies. 


An adequate process reference model, such as 
the one presented in Figure 2, covers the “Proc- 
esses” enabler dimension directly. Since COBITS is 
a holistic body of knowledge, the other six enablers 
are covered indirectly: 


— The people dimension is covered by the enablers 
“People, Skills, and Competencies” , “Culture, Eth- 


ics, and Behaviour”, and “Organizational Struc- 
tures”, as well as informed and enforced by the 
enabler “Principles, Policies, and Frameworks”. 

— The enablers “Information” and “Service, Infra- 
structure, and Application” should be understood 
in the broad informational and technological scope 
of CI operator activity, i.e. including IT, opera- 
tional technology (OT), material, facilities, and 
other infrastructural and technological artifacts. 

— The enablers “Information”, “Service, Infra- 
structure, and Application”, and “People, Skills 
and Competencies” correspond to organizational 
resources. This entails that the concept of capa- 
bility should include measures of (resource) 
capacity. 


Considering the above COBITS guidance, as 
well as COBITS for Risk (ISACA 2013b), we may 
provide implementation guidance for CI resil- 
lence management, as described in the following 
sub-sections. In the remainder of this section we 
assume that GCM and PAM/PMM models are 
used—such as the one proposed in Cadete et al 
(2017). 


4.1 Establishing the context 


Using the GCM, we can translate stakeholder 
needs into specific, actionable, and customized 
goals cascade, from high-level enterprise goals, 
down to enabler goals. 

For risk assessment, these enabler goals will 
later be associated with best practice, standards, 
and compliance requirements from relevant areas 
of concern. 

During this risk management activity, the exter- 
nal and internal contexts should be established, and 
risk criteria—i.e. terms of reference against which 
the significance of a risk is evaluated—should be 
developed (ISACA 2013b). 


4.2 CI Risk Identification 


COBITS recommends an approach based on risk 
scenarios. A risk scenario is a description of a 
possible event that, when occurring, will have an 
uncertain impact on the achievement of the enter- 
prise’s objectives. 

Risk scenarios can be identified and developed 
using two different mechanisms (ISACA 201 3b): 


— A top-down approach: starting from the overall 
enterprise objectives, identify business objectives 
and scenarios with highest impact on achieve- 
ment of business objectives. 

— A bottom-up approach: a list of generic scenar- 
ios is used to define a set of more relevant and 
customized scenarios, applied to the individual 
enterprise situation. 
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Note that using risk scenarios, as a specific 
method for identifying risk, does not exclude using 
other risk techniques e.g. from ISO 31010, to assist 
risk identification, as well as risk analysis activities. 


4.3 CT risk analysis 


In this activity, the frequency and impact of risks 
are estimated. Risk factors should be considered, 
i.e. those conditions that influence the frequency 
and/or business impact of risk scenarios. They can 
also be interpreted as causal factors of the sce- 
nario that is materializing, or as vulnerabilities or 
weaknesses. 

Scenario analysis may be based on past expe- 
rience, known current events, and also possible 
future circumstances (ISACA 2013b). 


4.4 CTresilience analysis 


Using a PAM and a PMM as assessment and meas- 
urement models, the analyst can relate the risk sce- 
narios with current and forecasted organizational 
capability levels (including resource capacities), 
to calculate current risks and residual risks (equal 
to current risks with additional risk responses 
applied) (ISACA 2013b). 


4.5 CT resilience evaluation and treatment 


In this activity we compare the results of resilience 
analysis with criteria to determine whether resilience 
level is acceptable and identify areas for improve- 
ment. The purpose of defining a risk response 
(avoidance, acceptance, sharing/transfer, or mitiga- 
tion) is to bring risk in line with the defined risk 
appetite for the enterprise ISACA 2013b). 


4.6 CT risk evaluation and treatment 


The CI risk and resilience risk responses should be 
integrated in a common decision-making process 
and contribute to a single security, safety, and resil- 
ience strategy, although the implementation of spe- 
cific resilience controls may be realized by specialized 
initiatives or projects, within a coherent portfolio. 


5 DEMONSTRATION 


To illustrative how to implement the proposed 
guidance, for conducting resilience management 
and assessment, we will address the ERNCIP 
Project Platform concerns (ERNCIP 2017) and 
adapt a demonstration from Cadete et al (2017). 


— Establishing the context: we situate ourselves 
in the context of the ERNCIP Project Plat- 
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form, addressing the concerns of the Thematic 
Group (TG) on Chemical and Biological (CB) 
Risks to Drinking Water. Adopting the frame- 
work from Cadete et al (2017) as our reference 
model, we use the GCM and define that the ena- 
bling process area “Monitoring, Detection, and 
Reporting” should achieve capability level “Jnno- 
vating process” (ISO 33020), given the ERNCIP 
requirements for high levels of risk control and 
innovation. 

Communication and Consultation: note that 
the ERNCIP Project Platform itself provides 
a forum for communication and consultation 
purposes, i.e. addresses “continual and iterative 
processes that an organization conducts to pro- 
vide, share or obtain information and to engage in 
dialogue with stakeholders regarding the manage- 
ment of risk” (ISO 2009). 

CI Risk Identification: using a top-down 
approach, we address the ERNCIP TG work 
programme goals and choose e.g. a bio-terror- 
ism risk scenario. 

CI Resilience Analysis: using the PAM/PMM 
from Cadete et al (2017) as assessment and 
measurement models, we assume that the analyst 
(we may also assume external auditing valida- 
tion) assessed the current capability level of the 
“Monitoring, Detection, and Reporting” process 
as “Established process”. Note that this means 
that the current capability level is below the 
desired capability level of “Jnnovating process”. 
CI Resilience Evaluation and Treatment: miti- 
gation is proposed as risk response, meaning 
designing and executing a project to move from 
“Established process” to “Innovating process”, to 
meet risk appetite criteria. Using ICT best prac- 
tice from COBITS, this project entails achieving 
a capability level of “Jnnovating process” for the 
COBITS processes “DSSOJ Manage Operations” 
and “DSS02 Manage service requests and inci- 
dents”. Further detailed guidance can be found 
in COBITS documentation (ISACA 2013a). 

CI Risk Evaluation and Treatment: finally, after 
designing the project and estimating related 
implementation costs, the plan is submitted to 
the approval of top management. A decision is 
made to include the new project in the ongoing 
organizational security programme, with a man- 
date to address all possible portfolio synergies, 
as well as to report to governance bodies accord- 
ingly, for monitoring and review purposes. 
Monitoring and Review: as assumed in the pre- 
vious paragraph, monitoring and review should 
include the governance bodies, for the purposes 
of ensuring that controls are effective and effi- 
cient in both design and operation, obtaining 
further information to improve risk assessment, 
analyzing and learning lessons, detecting changes 


in the external and internal context, and identi- 
fying emerging risks (ISO 2009). 


6 CONCLUSION 


The main contribution of this paper is to provide 
implementation guidance for resilience manage- 
ment and assessment of critical infrastructure, 
using the resilience management framework from 
Lange et al (2017a, 2017b). Also, we demonstrated 
practical implementation of this guidance, to 
facilitate understanding and provide validation, 
using the context of a real-world resilience initia- 
tive (ERNCIP 2017), a fictional risk scenario, and 
addressing ICT concerns using a state-of-the-art 
governance and management framework for infor- 
mation and related technologies (ISACA 2012b). 

The ultimate goals of this contribution are to 
facilitate conducting assessment, auditing, and 
consulting initiatives in a consistent manner, as 
well to enable comparing assessment results from 
initiatives performed in different CI settings (e.g. 
different operators, sectors, or countries), using a 
standards-based approach. 

Note that the resilience assessment framework 
discussed in the paper (Cadete et al. 2017) is not 
essential to the proposal, and may be replaced by 
another similar framework if the following generic 
requirements are met: 


1. Approach: is aligned with the ISO 31000 risk 
management process, adopts a cyber-physical 
approach, and provides direct or indirect cover- 
age for the seven COBITS enablers. 

2. Governance: is aligned with the governance 
objective and principles, as described in the 
proposal. 

3. GCM: provides a goals cascade methodology, 
so that risk may be address as “the effect of 
uncertainty on objectives” (ISO 2009), and we 
can translate stakeholder needs into a specific, 
actionable, and customized goals cascade. 

4. PRM: provides a reference model based on 
organizational processes, covering all relevant 
resilience management and assessment areas of 
concern. 

5. PAM/PMM: is aligned with ISO 33020 or a sim- 
ilar standard (such as ISO 15504). 


Regarding limitations, note that this paper does 
not claim that the proposed guidance is the best 
approach for resilience management of critical 
infrastructure, instead assuming the framework 
from Lange et al (2017a, 2017b) is a reasonable 
approach. However, we have demonstrated that the 
framework from Lange et al (2017a, 2017b) can be 
used to derive an approach that is consistent with 
ISO 31000 and COBITS principles. Also, we have 


demonstrated that it can be used consistently with 
process-based resilience assessment frameworks. 
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Impact of human factors on threats in sewage treatment plants 
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ABSTRACT: Physical, biological and chemical process take place during wastewater treatment. Any, 
even the smallest, disruption of any process may pose a risk in operation of the treatment plant. For each 
type of wastewater, an appropriate method for their purification is established. It depends on the type 
of sewage, the population equivalent and the type of receiver The paper compares the threats that may 
pose a risk in municipal sewage treatment plants. Selected, for risk analysis, sewage treatment plants are 
located in the same city. They were designed in the same technological chain due to similar Population 
Equivalent (PE) and comparable conditions of discharge purified wastewater to the receiver. Previous risk 
analyses have shown the importance of the human factor, in the form of management, work organization, 
documentation and databases, technical knowledge and staff training. Employees of sewage treatment 
plants in their daily work can prevent negative events or cause them to occur. The location of the selected 
wastewater treatment plant in the same city area causes that they are managed by the same person and 
operating in the same way. This ensures a similar response to the same events, similar risk elimination dur- 
ing the operation and quality of databases. These condition allow for the unification of the human factor 


when comparing the treats at the municipal sewage treatment plants. 


1 INTRODUCTION 


Man can influence the environment in which he 
lives and all processes that take place in his sur- 
roundings in both positive and negative ways. His 
decisions and actions during exploitation of tech- 
nical objects may contribute to the emergence of 
potentially dangerous situations and events which 
result in losses, both material and immaterial, or 
in other words may cause risks. Therefore, the 
human factor can be seen as one of risk factors. 
It affects the effectiveness of business manage- 
ment, work organization and keeping of records 
and databases. It is also worth considering in terms 
of risk for occupational safety, which is caused by 
low level or total lack of employees’ knowledge in 
this field, their reluctance to change and improve 
the work process and their unfamiliarity with OSH 
rules [1]. Speaking about the risk in terms of the 
human factor the authors want to focus on the 
human impact on municipal sewage treatment 
plants operation process. 

The basic stage of risk analysis is its identifica- 
tion and classification, which enables its manage- 
ment in the company [2, 3, 10]. Obtaining reliable 
results depends on the accuracy of conducting the 
process and the professionalism of the person who 
identifies threats. Of great importance are also 
exactness and diligence in maintaining a database. 
In other words, reliable risk identification also 
depends on the human factor. 


2 HUMAN FACTOR IN A MUNICIPAL 
SEWAGE TREATMENT PLANT 


Sewage treatment plants are specific technical 
objects in which physical, biological and chemical 
processes take place. Constant supervision of the 
technological process and introduction of moni- 
toring and control systems are not able to exclude 
the emergence of failures and malfunctions of 
individual devices [4, 5, 11]. This happens because 
adverse situations are fortuitous events and it is 
impossible to predict when and where they may 
occur. Correctly conducted threat identification 
and competent risk management in the object 
allows the elimination of adverse events before 
they may arise, and the development of operat- 
ing instructions for dealing with a given phenom- 
enon [12]. Yet the effectiveness of these activities 
depends on the employees who will execute them. 
Therefore it can be stated that proper functioning 
of a treatment plant is largely dependent on the 
human factor. 


3 CHARACTERISTICS OF EXAMINED 
TREATMENT PLANTS 


Risk identification was performed for two munici- 
pal sewage treatment plants in the Upper Silesian 
Industrial District (GOP). The examined objects 
are characterized by similar technological lines, PE 
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Table 1. Characteristic of examined treatment plants. 


Plant A Plant B 


Population 52 000 62 500 
Equivalent—PE 
Treatment technology Activated sludge 
Elements of Expansion chamber 
technological Bar screens Sifters 
line Horizontal grit-removal tank, 
aerated, two-chamber 
Biological dephosphatation 
chamber 
Activated sludge chamber 
Secondary settling tanks, radial 


Difference in the 3 secondary 2 secondary 


quantity of settling tanks settling 
technological tanks 
line elements 

Possibility of Yes—cast station present 


sewage delivery 
with the use of 
sewage truck 
Industrial objects in 
the catchment area 


Meat processing None 
plant, electro- 
plating plant 


Receiver Watercourse 


value and the method of sewage discharge to the 
receiver. The characteristics of plants in question 
are presented in Table 1. 

The analyzed sewage treatment plants are 
supervised by the same manager, so they both have 
identical treatment plants’ worksheets, in which all 
works carried out during the shift and any failures 
or adverse events are recorded. This should stand- 
ardize one of the human factor components in 
threats identification. 


4 METHODOLOGY 


First step in identifying threats of analyzed sew- 
age treatment plants was the thorough familiariza- 
tion with the technology and the way of operating 
objects, based on submitted operating instructions. 
The recognition of adverse events was preceded by 
a thorough interview with the manager of the object 
and shift managers. Researchers got acquainted 
with current operating problems and previous meth- 
ods of their elimination. Operating instructions [6,7] 
for the treatment plant obligate the staff to carefully 
keep the operating log, which consists of: 


e Worksheets of the treatment plant—it is 
required to record all work carried out during 
the shift, failures and observations; 

e Equipment repair and maintenance cards—con- 
tain records on repair and maintenance works 
and failures; 


e Technological notebook—contains results of 
analyzes of laboratory tests and information 
about the amounts of waste generated in the 
treatment plant; and therefore further stages of 
work are based on these documents. 


Next, after analyzing the worksheets, a list 
of risk factors was prepared together with the 
frequency of their occurrence. Any doubts and 
inaccuracies in the records were verified with the 
operator’s personnel. The next stage was to get 
acquainted with the equipment repair and main- 
tenance card in order to determine risk factors 
being eliminated during maintenance of individual 
objects. This allowed identification of latent risk. 

The final stage of the work was to determine 
which of adverse events actually cause a risk in 
the operation of a sewage treatment plant. For 
this purpose, records of laboratory tests results 
of treated sewage at the outflow during and after 
the occurrence of a potential threat were checked 
with the use of technological notebook. In the case 
of incorrect values of any quality indicator, the 
occurrence of a given adverse event was treated as 
a threat causing losses—an emergence of a risk. 

Identification of threats in the examined treat- 
ment plants was performed for a period of 3 years 
of object’s exploitation. This period fell on the 
years 2014-2016. 


5 THREATS IDENTIFIED IN MUNICIPAL 
SEWAGE TREATMENT PLANTS 


5.1 Identified risk factors [4] 


Risk factors in municipal sewage treatment plants 
may be divided into: 


e internal—caused by the impact of treatment 
plant objects, 

e external—caused by the influence of factors 
from outside the treatment plant, 

e ordinary—easy to predict, often occurring dur- 
ing current operations, 

e extraordinary—unpredictable fortuitous events, 
unusual situations, 

e explicit (existing) — occurring in the past, often 
appearing periodically, 

e latent—potentially possible, but not yet occurred 
(in a given object). 


5.2 Identified risk types [4] 


During the risk identification process of municipal 
sewage treatment plants one can distinguish the 
following types of risk: 


e qualitative—results in lowering the level of sew- 
age treatment or in digestibility of individual 
devices, 
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e operational—tresults in a decrease of the effi- 
ciency of technological line cleaning, and in 
extreme cases even in a breakdown of the 
process, 

e ecological—results in a negative impact on the 
receiver and environment, 

e financial—results in a financial outlay which 
must be borne by the sewage treatment plant 
as a consequence of the arisen threat. 


6 SEWAGE TREATMENT PLANT “A” 


In the analyzed three-year operational period of 
the sewage treatment plant, a total of 36 different 


threats were identified. Some of them were iso- 
lated events, others occurred several times. It was 
stated that a total of 111 situations could have been 
risk-causing events. Table 2 presents the identified 
threats for individual objects of the treatment plant. 

The percentage share of risk-posing events 
(Figure 1) indicates that devices exposed to the 
highest risk are the bar screen and activated sludge 
chamber. Description of the most frequent events 
and specification of type of factors and type of 
risk they cause are presented in Table 3. 


Threats on objects 


2% 
5% @ bar screen 


Table 2. Identified risk-posing factors for individual | 
objects of treatment plant “A”. ` 
\ ® activated sludge 
Device Number of events ' chamber 
Cast station for delivered sewage 1 E grit chamber 
Local sewage pumping station 1 
Bar screen 30 m clariffer 
Grit chamber 18 
Dephosphatation chamber 15 
Activated sludge chamber 23 @ dephosphatation 
Clariffer 17 chamber 
Sewage treatment plant 6 . : 
Total 111 Figure 1. Percentage share of events that may cause risks 
in individual devices of the A plant’s technological line. 
Table 3. Examples of risk factors for selected elements. 
Device Event Factor Type of risk Effect Action taken/proposed 
Bar screen clogging external qualitative impeded cleaning of bars 
of bars sewage flow 
Bar screen large fat and external qualitative, greasing and removal of excess fat, 
meat dump operational, clogging of bars, cleaning of bars 
financial, greasing of grit- 
ecological removal tank 
Bar screen failure of bar ordinary qualitative closing of bar repair of bar screen 
screen screen channel controlling 
automatics automatics 
Grit-removal failure of grit- ordinary qualitative increased amount repair of 
tank removal tank of sludge in failure 
regulation 
chambers 
Grit chamber dump of greasy external qualitative large amount pumping-out of fats 
wastewater of fats in the 
flotation area 
Dephosphatation sludge floatingon internal qualitative, bad sedimentation lime dosage 
chamber the surface of operational of sludge 


dephosphatation 
chamber and 
activated sludge 
chamber 
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(Continued) 


Table 3. (Continued). 
Device Event Factor Type of risk Effect Action taken/proposed 
Activated sludge freezing of external qualitative ice formation on breaking the ice 
chamber activated sludge sludge coat 
chamber surface 
Activated sludge formation of ice external qualitative, inability to draining removal of blockage 
chamber blockage operational of sewage 
Activated sludge problems with internal qualitative, formation of removal of scum layer 
chamber sludge drain- operational scum layer and removal of putre- 
ing, sludge fied sludge 
putrefaction 
Activated sludge emergence of internal qualitative formation of scum breaking the scum layer 
chamber filamentous layer and bacteria removal 
bacteria 
Secondary freezing of the external qualitative, formation of breaking the ice layer 
settling tank settling tank operational ice layer 
surface 
Secondary auxiliary devices ordinary qualitative, minor disturbance repair of auxiliary 
settling tank failure operational in the settling devices 
tank operation 
Sewage electrical power external qualitative, lack of power for connection to the emer- 
treatment outage operational electrical powered gency power supply 
devices 
Table 4. Identified risk-posing factors for individual . 
objects of treatment plant EB” g Th reats on objects 
Devige Aari of 4% @ activated sludge 
chamber 
Cast station for delivered sewage 2 = clariffer 
Inflow to the plant 1 
Sifters 9 
Grit chamber 10 5 grit chamber 
Activated sludge chamber 72 
Clariffer 18 
Sewage treatment plant 2 @ sifters 
Total 114 
® other 


7 SEWAGE TREATMENT PLANT “B” 


Over the same time period, 32 different threats that 
could have caused risks were identified in treat- 
ment plant B. The number of risk-posing events 
was 114. Table 4 presents the identified threats 
and the number of their occurrences in individual 
objects of the “B” treatment plant. 

Figure 2 shows that the most risk-posing device 
in the B plant was activated sludge chamber. The 
reason for this situation may be the attempt to 
improve the effectiveness of sewage treatment in 
the year 2015 and incursion in the technology 
used, as a result of which filamentous bacte- 
ria and scum layer appeared in activated sludge 
chamber. The most frequent threats are shown in 
Table 5. 


Figure 2. Percentage share of events that may cause 
risks in individual devices of the B plant’s technologi- 
cal line. 


8 COMPARISON OF THREAT 
IDENTIFICATION RESULTS IN THE 
EXAMINED TREATMENT PLANTS A 
AND B 


Among the identified threats, 9 events occurred in 
both sewage treatment plants with different inten- 
sity (Table 6). 

Large discrepancies in quantities of individual 
phenomena occurring in examined treatment 
plants may be caused by various circumstances, for 
instance: 
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Table 5. 


Examples of risk factors for selected elements. 


Action taken/ 


Device Event Factor Type of risk Effect proposed 
Sifters sifter scraper ordinary qualitative clogging of sifter repair of scrapper 
failure 
Grit chamber large dump of external qualitative clogging of grease unclogging the 
greasy sewage chamber outflow outflow 
Activated emergence of internal qualitative, formation of scum breaking the scum 
sludge filamentous operational layer layer and actions 
chamber bacteria aimed at stop- 
ping bacteria 
development 
Activated momentary change External qualitative, high concentration turning on mam- 
sludge in quality of ordinary operational of ammonia in moth rotors 
chamber inflowing sewage aeration chamber 
Activated steering ordinary qualitative, incorrect readings manual steering 
sludge malfunction operational from activated and repair of 
chamber sludge chamber automatics 
Activated freezing of activated external qualitative formation of breaking the ice 
sludge sludge chamber ice layer layer 
chamber surface 
Secondary malfunction of ordinary qualitative cleaning of discharge reapair, cleaning of 
settling discharge flume flume impossible flumes 
tank cleaning brush 
Secondary auxiliary devices ordinary qualitative, minor disturbance repair of auxiliary 
settling failure operational in the settling devices 
tank tank operation 
Sewage electrical power external qualitative, no power for connection to 
treatment outage operational electrical powered emergency power 
devices supply 
Sewage failure of devices ordinary qualitative, automatic steering repair of visualiza- 
treatment visualization internal operational of all devices tion system 
impossible (aeration 
chambers operate in 
manual mode) 
Table 6. Frequency of the same phenomena occurrence in sewage treatment plants A and B. 


Device 


Cast station for delivered sewage 


Grit chamber 


Activated sludge chamber 


Secondary settling tank 


Sewage treatment 


e dump of greasy sewage was influenced by meat- 
processing industry in the catchment area of 


treatment plant A; 


e the emergence of filamentous bacteria was the 
effect of attempts to improve the efficiency of 


Event 


No flow 


Freezing of grit chamber elements 


Large dump of gre 


Emergence of filamentous bacteria 


asy sewage 


Freezing of activated sludge 


chamber surface 


Mammoth rotor malfunction 
Power cut of operation sensors 
Auxillary devices failure 
Electrical power outage 


treatment plant B, i.e. technological changes; 


Treatment plant “A” 


n— OD — 


Treatment plant “B” 


vienen 


= aN Ww 


e momentary power outages in treatment plant A 
were caused by earthworks near the main power 


cable; treatment plant A was in five out of six 


situations notified beforehand. 


Repetitive occurrences of only 9 adverse events in 


both analyzed objects does not mean that the remain- 
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ing events are distinctive for a given treatment plant. 
The same situations may be variously described in 
the worksheets depending on who makes the record 
on a given day. During the identification process, 
efforts were made to unify the records in operating 
logs, but this was not always possible. 

The main operational problem of both treat- 
ment plants are fibrous substances not included in 
the above list. In treatment plant A, the cleaning of 
probes (activated sludge chamber, secondary set- 
tling tank) from fibers was called “a maintenance 
of individual objects”. Due to the occurrence of 
this problem, initially assumed period between 
inspections was shortened from 6 to 3 months. On 
the other hand, in treatment plant B every 3 months 
on average the event called “removing of rags” has 
been recorded on individual objects (active sludge 
chamber, secondary settling tank). Only after con- 
versation with the manager and employees of the 
sewage treatment plant it was found out, that it is 
all about the same operational activity. 

The above examples show that the human factor is 
of great importance during identification of threats. 


9 CONCLUSIONS 


In both treatment plants appeared phenomena 
that under unfavorable conditions may be risk fac- 
tors. In the analyzed period there were no exceed- 
ances of indicators examined at the outflow, so 
it can be concluded that potentially risk-posing 
situations did not adversely affect the process of 
sewage treatment. Employees’ behavior, their deci- 
sions and actions taken when a risk-posing factor 
emerges have an impact on the level of sewage 
treatment attained by each treatment plant. 

The human factor plays a decisive role in risk 
identification. Actions aimed at standardization of 
this factor (comparison of threats in two municipal 
sewage treatment plants with similar PE and a simi- 
lar technological lines), have emphasized its role not 
only in the identification process but also in risk man- 
agement. The methods of dealing with a given threat 
depend on the training and expertise of the staff. 
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ABSTRACT: The paper demonstrates the importance of tacit knowledge to cope with various situa- 
tions during field work in the high arctic. Two cases from field work at the University Center in Svalbard 
(UNIS) are shown to exemplify this: one boat trip and one snow mobile trip with researchers and stu- 
dents. Successful field operations depend heavily on technicians from the UNIS Field Safety Section that 
have the responsibility to assist in the planning and execution of every type of field work. Due to rapidly 
changing conditions, local variations, extreme weather conditions, lack of access to infrastructure and 
communication, successful safety performance is accomplished by individual’s ability to adapt to situ- 
ations. The paper demonstrates that this ability to a large extent is a function of the tacit knowledge of 
the technicians. To improve the tacit knowledge of each technicians, systems and practices of experience 
feedback must be run to ensure individual and organizational learning from both failures as well as suc- 
cesses. This is in particular important in systems with great variability in climatic conditions and systems 


with organizational changes. 


1 INTRODUCTION 


The University Centre in Svalbard (UNIS) has 
been operating in Longyearbyen, Svalbard since 
1993. Longyearbyen is a small town located on 
the west side of Spitsbergen, a part of the Sval- 
bard archipelago in the high arctic at 78 degrees 
north. UNIS educates more than 800 students and 
supports close to one hundred research projects on 
an annual basis. In total UNIS has close to 12 000 
field days per year. The education and research is 
field based and the season lasts from January to 
December. 

Operations in the high arctic prove to be chal- 
lenging. Challenges encountered include, but are 
not limited to: lack of infrastructure, harsh and 
variable weather, darkness, and rapidly changing 
natural hazards. These are conditions that have to 
be handled from day to day to ensure safe opera- 
tions for students and scientists. 

In the last five years natural hazards have been 
changing at such an increased rate (e.g. avalanche 
danger, melting sea ice, high levels of precipita- 
tion, rapid fluctuations in air temperature etc.). 
For UNIS this implies that established operational 


procedures based from many years of experience 
are no longer valid. The practitioners responsible 
for planning and guiding students and scientists 
in the field often experience that the plan deviates 
from the performance. They have to choose devia- 
tion “constantly” to maintain a safe performance 
of the activity. A challenge is thus to have a system 
or process that secures feedback to the decision 
makers. 

To keep up with the changing risk picture, expe- 
rienced feedback and tacit knowledge are getting 
more and more important to maintain safety man- 
agement and safe operations. 

The purpose of the paper is to exemplify how 
experienced feedback and tactic knowledge are used 
to manage safety for operations in the high arctic 
and discuss how experience feedback can ensure 
organizational learning for field operations. 


2 EXPERIENCE FEEDBACK 


Safety management is based on the principle of 
experience feedback, i.e. the process by which 
information about the results of an activity is fed 
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back to decision makers as new input to modify 
and improve subsequent activities (Kjellén and 
Albrechtsen, 2017). Kamsu Foguem et al. (2008) 
have a similar interpretation: experience feedback 
is a process whereby experience at an operational, 
tactical or strategic level is disseminated in such 
a way that the knowledge is used to improve the 
organization’s performance. 

The purpose is to use information about expe- 
rienced or expected safety performance as a basis 
for decisions that prevent accidents and reduce 
accident risk. 

Experience feedback is based on principles from 
quality management such as Juran (1989) persist- 
ent feedback control and Deming’s (1993) cycle. 
Kjellén and Albrechtsen (2017) present a safety 
information system based on principles of expe- 
rience feedback that consist of collection of data 
about experienced and expected safety perform- 
ance; analysis and storage of the data; distribution 
of analyzed data to decision-makers; and decision- 
making and implantation of safety measures. This 
system facilitates systematically improvement 
of safety based on experiences (incidents, non- 
conformities, observations, etc.); identification 
of current performance (inspections, audits); and 
excepted safety performance and challenges (risk 
assessments). 

Another important principle of experience 
feedback is organizational learning and knowledge 
sharing. The process of organizational learning 
involves an organizational unit changing itself or 
its knowledge base as a result of experience (Cyert 
& March, 1963). The unit can learn directly from 
its own experiences, or from the experiences of 
other units (Levitt & March, 1988). Argyris and 
Schön (1996) claim that organizations learn only if 
the product is a change in behavior and governing 
variables in the organization. 

Although the literature differs between indi- 
vidual and organizational learning, there is a clear 
relationship between the two (Crossan et. al, 1999). 
Nonaka and Takeuchi (1995) demonstrates how 
transitions of tacit and explicit knowledge lead 
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Tacit Explicit 


knowledge bnowledge 


ay, 
internalization Combination 


Figure 1. Knowledge creation in organizations (Non- 
aka and Takeuchi (1995). 


Tacit 
knowledge 


From 


Explicit 
knowledge 


to organizational knowledge (Figure 1). Through 
these processes, knowledge is converted from indi- 
vidual knowledge to shared knowledge that can be 
utilised by the whole organisation. The transitions 
are continuous processes that lead to a learning 
spiral. Nonaka & Takeuchi (1995) propose four 
basic processes whereby knowledge is converted: 


— Externalisation takes place when tacit knowl- 
edge is made explicit, for example when an 
unwanted occurrence is observed and reported 
by a worker at the sharp end. 

— Combination takes place when explicit knowl- 
edge is combined with other explicit knowledge, 
for example when a reported unwanted occur- 
rence is compared with other reported unwanted 
occurrences in an effort to identify similarities. 

— Internalisation takes place when explicit knowl- 
edge becomes tacit knowledge. The point is to 
see the importance of making practical use of 
knowledge through converting the explicit to 
practical, effective and correct actions. 

— Socialisation takes place when tacit knowledge is 
spread as tacit knowledge to other members of 
the organisation, who learn over time through 
seeing what others do. 


Principles of organizational learning implies 
that safety management based by experience feed- 
back is dependent on both formal and informal 
processes of knowledge sharing among practition- 
ers, safety staff and managers. 


3 SAFETY CHALLENGES IN FIELD 
OPERATIONS AT THE UNIVERSITY 
CENTRE IN SVALBARD 


Every person going through the UNIS system 
expects access to teaching, learning or research 
in the field. Safety technicians at UNIS have the 
responsibility to assist in the planning and execu- 
tion of every type of field work. This includes the 
safe transport of groups to their desired field loca- 
tions, and then further technical assistance and 
general safety at the field site. Safety technicians at 
UNIS benefit from the experience of years of living 
and working in the Arctic. This includes hundreds 
of hours of work in the field on an annual basis. 

At UNIS there are two distinct field seasons. 
Winter field season normally runs from January- 
May, while summer field season extends from 
June-October. November and December are typi- 
cally slower, due to lack of snow and light. 

Winter field season is characterized by snow 
mobile travel. Other forms of transportation may 
include travel by beltwagon or larger sea going ves- 
sels. Typical hazards encountered during winter 
season include: snow mobile driving, avalanche 
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terrain, sea ice, glaciers, harsh weather conditions, 
and polar bear encounters. 

Field operations in the summer season is mainly 
done by boat travel. A variety of vessels are used 
including: zodiacs, polar circles, tourist boats and 
large cruise vessels. Typical hazards encountered 
during summer season include: harsh weather, 
rough ocean conditions, camp challenges, polar 
bear encounters, and hazards encountered when 
travelling in steep mountainous terrain. 

The top priority in both winter and summer 
field seasons for the technicians is to ensure the 
group has safe transportation from Longyearbyen 
to their field location. In winter season this is con- 
trolling a group of 10-30 students and teachers on 
snow mobiles as they drive through variable terrain 
to their destination. In summer this is transporting 
up to 12 students and teachers at a time by polar 
circle to their destination. 

Ideally the technicians should only operate 
under favorable conditions, where there are few 
hazards and thus low risk. Due to the environment 
and related hazards and challenges, all activities 
involve some kind of risk, and the plan often devi- 
ates from what is expected. 

Two examples will be presented which illustrate 
the challenges associated with field travel in both 
winter and summer 


3.1 Case 1: Boat travel, summer season 


The objective of the field travel was to drop off one 
group to Colesbukta, drop off camp equipment 
and perform water sampling in Dicksonfjorden, 
see map in Figure 2. The two trips were originally 
scheduled to take place on different days, but this 
had to be changed when an attempt to drop off 
the first group in Colesbukta the day prior was 
unsuccessful. Technician A had taken three scien- 
tists, plus a boat full of field equipment towards 
Colesbukta the previous afternoon. The group had 
traveled for approximately one hour in bad condi- 
tions, heading straight into the oncoming waves 
which were crashing over the bow of the boat at a 
height of over two meters. The technician decided 
to turn around for several reasons: 


— If the technician maintained the same speed, an 
hour long round trip would have taken two to 
four hours in conditions which were not sup- 
posed to improve 

— The wind direction was not favourable for land- 
ing in the desired location 

— The scientists had a lot of heavy equipment 
which would take a long time to unload 

— The desired drop off point was extremely shal- 
low, with lots of known objects in the water. 
With heavy waves and wind, chances of either 


Map showing the routes described in the fol- 
lowing tasks. Summer hazards are where shallows and 
rough seas are normally encountered. Winter hazards are 
where avalanche, glacier, sea ice and open water hazards 
are normally encountered. Basemaps © Norwegian Polar 
Institute. 


Figure 2. 


grounding the boat, or hitting the propeller/ 
engine on an object were increased 

— The boat was heavy and had a broken anchor 
winch, making it not ideal for shore landings, 
especially with the given conditions 


If the technician only encountered one of the 
before-mentioned factors, the trip would probably 
have been completed. A combination of all of the 
factors, which the technician was able to identify 
during the trip, forced the decision to turn around. 
There was no protocol for this, but due to past 
experiences the technician was able to determine 
that in this particular situation, the risk was not 
worth the potential consequences. 

The next day the winds and sea had calmed 
making it realistic to complete both trips. Due to 
a combination of factors it was decided that two 
technicians and two boats were needed to make 
this a successful operation. A request was made by 
Technician A to Technician B for assistance. Fac- 
tors which were considered for this trip included: 


— Colesbukta and Dicksonfjorden are in oppo- 
site directions, so if only one boat went, the trip 
would take a significantly longer amount of time 

— The drop-off spot for the camp in Dicksonfjor- 
den is dependent on tides. A lot of gear needed 
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to be unloaded which is time sensitive in order to 
not ground the boat 

— The boat going first to Dicksonfjorden needed to 
pick up scientists from the camp for sampling 

— Wind was coming from east, making the travel 
to Dicksonfjorden fine, but more challenging on 
the way back to Longyearbyen when heading 
directly into the wind 


The plan was then executed safely and success- 
fully. The two boats left at the same time, with Boat 
B heading first to Colesbukta to drop off scien- 
tists and gear, and then bring the rest of the gear 
to Dicksonfjorden to meet Boat A in time to drop 
off equipment before the tide started falling. Boat 
A went straight to Dicksonfjorden and was able 
to complete the sampling and drop off the other 
equipment. Boat A and Boat B were then able to 
drive back to Longyearbyen from Dicksonfjorden 
together, which was optimal because the winds 
began to pick up creating unfavorable conditions 
for driving alone. Boat B has a covered cabin, mak- 
ing it more favorable for driving in big waves. Boat 
A could then drive in the wake of Boat B, as to not 
get as many waves into the boat on the way back 
to Longyearbyen. 

This task ended successfully for several reasons. 
The two technicians had combined experience 
from driving boats in Isfjorden. They both knew 
the challenges with landing in the two areas, and 
were able to understand the implications that the 
weather conditions and tides would have on the 
locations they needed to get to and tasks they 
needed to complete. Flexibility played a huge part 
in that both technicians were willing to change 
plans when it proved necessary in order to com- 
plete a safe and successful trip. The challenges and 
potential hazards encountered during this scenario 
are not uncommon or unknown. Shallows, tides, 
waves, wind, boat problems, etc., are all encoun- 
tered by boat drivers in Svalbard. The challenges 


Figure 3. 
UNIS Polaris beached. 


Example of vessel used during field trips, 


can be anticipated, but only the knowledge one 
needs to be equipped to deal with them, and to 
operate around them are only gained through 
experience. 


3.2 Case 2: Snow mobile travel, winter season 


The objective of the field travel was to transport a 
group of students and professors from Longyear- 
byen to Svea (a small mining settlement located 60 
km away from Longyearbyen) in early February 
by snow mobiles. Travel between Longyearbyen 
and Svea is common during the winter and spring 
months by snow mobile. Transporting a group of 
up to 25 students and professors to do work on the 
sea ice close to Svea is normally not challenging. 
However, in recent years, Svalbard has experienced 
more precipitation, higher temperatures and less 
sea ice. This leads to more challenging conditions 
for winter field work. The technicians anticipated 
the following risk factors for this particular task: 


— 25 people with little to no snow mobile experi- 
ence must drive over 60 km through challenging 
conditions 

— Driving through avalanche terrain 

— Driving in places with open water 

— Driving over terrain which is icy and rocky 

— Driving over glaciers with crevasses 

— Working on sea ice 


Therefore, special precautions had to be taken, 
and here it is the job of the technicians to use their 
knowledge and expertise to ensure the group can 
travel and work safely in Svea. One should always 
expect and be prepared to encounter hazards doing 
this kind of work, but in this particular situation 
there were several risk factors which needed special 
attention from the technicians to ensure safe and 
successful travel: 


— The season up to this point had brought unu- 
sual conditions which included: a lot of snow, 
followed by heavy rain and temperatures up to 
+7°C. Temperatures then quickly fell to well 
below 0°C. 

— The normal route to Svea includes traveling 
through narrow valleys surrounded by steep 
mountains which leads to both avalanche haz- 
ards and water hazards 

— The route also includes travel across a wide 
open valley which is vulnerable to wind which 
can blow all of the snow away leading to icy and 
rocky conditions 

— The route includes a glacier crossing which cre- 
ates a potential crevasse hazard 

— Due to the decreased activity in Svea, the route 
which is normally well maintained is much more 
unknown and unpredictable 
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— Sea ice which is normally forms close to Svea 
is affected by warmer air and sea temperatures. 
Extra precaution must be taken when working 
on this ice. 


Instead of the normal procedure of sending out 
one technician to follow the group as would be 
necessary, several scouting trips were undertaken 
in order to identify all of the possible hazards the 
group might encounter, and to confirm that safe 
travel was possible. Three separate scouting trips 
were completed until the technicians were satis- 
fied that they were comfortable sending the group 
through the terrain and had identified all possi- 
ble hazards and deemed them manageable for the 
group. The technicians identified open water, ava- 
lanche conditions, blue ice and rocks. They were 
able to deviate slightly from the normal route, in 
order to find a route which was as safe as possible 
for the students and staff. 

On the day when the students and staff were 
supposed to travel to Svea, two technicians joined 
to ensure safe travel. The trip was completed suc- 
cessfully and the students and staff were able to do 
their work in Svea. 


4 DISCUSSION 


4.1 Adaptation and flexibility 


Both field trip examples described in the previous 
section were successful in terms of safety because 
the technicians were able to anticipate the risks 
of the trips and thus adapt to the situation of the 
trips. Rankin et al. (2014) present a framework for 
understanding coping mechanism (adaptions) to 
respond to variations in a dynamic environment, 
see Figure 4. Adaptions are a function of 1) objec- 
tives, i.e. the outcome that the adaption aims at 
achieving and is related to identifying demands, 
pressures and conflicting goals; 2) the context in 


Figure 4. Snow mobile travel to Svea. 


which the adaption is carried out; and 3) necessary 
resources and conditions for successful implemen- 
tations of the coping mechanism, including both 
“hard” and “soft” conditions such as availability 
of knowledge. The adaption in itself consists of 
1) the four cornerstones of resilience (Hollnagel 
et al., 2007) anticipating, monitoring, responding 
and learning; and 2) interactions between sharp- 
end and blunt-end. 

The successful adaptions among the technicians 
at the field trips can be described in such a frame- 
work. Their adaption is a function of the context 
of the action that consist of their ability to moni- 
tor and anticipate the situation. The technicians’ 
tacit knowledge is a key contributor to their ability 
to adapt to the situation, not least because the suc- 
cess of the actions depend on the decisions made 
at sharp end due to lack of communication infra- 
structure with the blunt end. 

The tacit knowledge among the technicians and 
their ability to adapt to the situation are essential 
in both cases for maintaining safe travels in con- 
tinuously changing variable conditions. Scenario 
1 has several elements which made it more diffi- 
cult than a normal to drop off or pick up in the 
field. Tacit knowledge from the technicians which 
is acquired through multiple seasons and hundreds 
of hours of driving boats around the Isfjorden 
area was vital to complete this task in a safe and 
effective way. The snow mobile trip in scenario 2 
was successful for many of the same reasons that 
the boat trip was successful. The technicians were 
able to use their past experience and knowledge to 
anticipate the hazards and then act accordingly. 
Flexibility plays an important role. The techni- 
cians were able to put in much more work than is 
normal for this kind of trip. When many different 
hazards exist and combinations of hazards are not 
always predictable, tacit knowledge and experience 
are essential. 

The two scenarios also show the connection 
between the sharp end—the practitioner, and the 
blunt end—the management. For both scenarios 
a key word is flexibility. Based on the experience 
in the sharp end from similar operations, the man- 
agement can see the need for flexibility and use of 
extra time to adapt to the situation. Experience in 
the sharp end is building situational understanding 
in the blunt end. 


4.2 Experience feedback at an organizational 
level to improve coping mechanisms 


The available knowledge among the technicians at 
sharp-end during field trips could be improved by 
principles of experience feedback as illustrated in 
Figure 5. The framework by Rankin et al. (2014) 
is rooted in principles of resilience engineering. 
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Figure 5. Framework for analysis of coping mecha- 


nisms (based on Rankin et al. 2014) and the importance 
of experience feedback. 


Resilience engineering is centred around four main 
abilities: to respond, to monitor, to anticipate and 
to learn. The ability to learn from both failures and 
success is a key to improving the knowledge among 
sharp-end practitioners. By learning from experi- 
ences at field trips both individually and among 
colleagues, resources that contribute to adaptions 
will improve. 

The technicians who participate in fieldwork 
with students and employees acquire new experi- 
ence every day and thus maintain and develop their 
tacit knowledge. In addition, their tacit knowledge 
is generated when the they continuously close non- 
conformities during field trips as shown in the two 
cases in the prior chapter. 

The individual learning is important learning 
for the organisation. Feedback from experiences 
related to everyday tasks should be shared in the 
organization; what works and what does not work 
in order. 

Nonaka and Takeuchi (1995) emphasize how 
different transitions of tacit and explicit knowl- 
edge create shared knowledge in organizations. 
Socialization (from tacit to tacit knowledge) is 
one of the transitions that contributes to organi- 
sational learning. The technicians at UNIS start 
each morning with a meeting to go through the 
tasks/duties to be performed on that day. Ekman 
(2012) highlights the importance of informal con- 
versations in making tacit knowledge visible in the 
organisation, and facilitating arenas that encour- 
age small talk. The morning meetings involve a set 
agenda where events from the day before is dis- 
cussed and the technicians inform each other of 
changes related to snow conditions, weather, etc. 
The technicians generally meet up for a cup of cof- 
fee together before this meeting. A lot of informa- 
tion is shared during this five-minute period that 
should be raised during the formal meeting. As 


a result, important information is not raised at 
the meeting with the management because it has 
already been shared during the small talk over cof- 
fee before the meeting. Tacit knowledge among 
those talking is improved, but there is potential for 
organizational learning in addition if this knowl- 
edge is share among more people. 

Transition from tacit knowledge to explicit 
knowledge (externalization) will also contribute 
to organizational learning (Nonaka and Takeuchi, 
1995). Within safety management, systems for 
reporting of unwanted occurrences is an important 
contribute to externalization of tacit knowledge, 
but also to combination of explicit knowledge as 
well as internalization (from explicit knowledge to 
tacit knowledge) (Kjellén and Albrechtsen, 2017). 

Such learning among technicians wil happen in 
communities of practise (Wenger, 1998). Commu- 
nities of practice is a group of humans that has a 
mutual engagement, common goals and activities 
and a common repertoire of actions and resources 
(Wenger, 1998). Among the primary parts of learn- 
ing in communities of practice we find social par- 
ticipation, sharing stories, apprenticeship learning, 
and that learning is a complex social phenomenon 
dependent on context. 

How can one use and systemise the informal 
“coffee break” to strengthen learning in the organi- 
sation? Ekman (2012) refers to the importance of 
horizontal meeting places for the tacit knowledge 
where every day experiences can be shared. It is 
also possible to learn from conversations about a 
completely normal day. It provides an opportunity 
to test out the prevailing knowledge and create new 
learning. A traditional view in the field of safety 
is that one learns from mistakes and incidents. 
However, in more recent times, it has become 
more common to focus on learning from success- 
ful tasks (Hollnagel, 2014), which after all is most 
of the tasks one performs during a working day. 
By learning from everyday events, it is possible 
to test the prevailing knowledge and, in doing so, 
uncover practices that are unsafe, even though no 
accidents have occurred. “Learning from success- 
ful operations is not only about identifying and 
promoting good practice, it is also about detecting 
the instances where no accident occurred in split of 
unsafe practices or unsafe systems” (Rosness et al., 
2016). 


4.3 Contextual change that affect 
experience feedback 


UNIS experience increased student production and 
rapid changes in the natural environment. Ashby’s 
(1961) law of requite variety states that control of 
a system is achieved only when the variety of coun- 
termeasures matches the variety and changes of 
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the system. This implies that the field technicians 
must acquire new knowledge and put this into 
effect to deal with the contextual changes of their 
field activities. Systems and practises for experi- 
ence feedback would enable improved knowledge 
to handle new situations. 

Growth is not only a matter of increasing staff- 
ing to deal with the increased activity. One must 
also make structural changes to ensure that the 
environment for learning from tacit knowledge 
and making this visible is as favourable as possi- 
ble. If one looks at organisations that manage to 
exploit tacit knowledge, facilitating communica- 
tion is a key factor. Structurally, one can facilitate 
the rapid spreading of knowledge and spend time 
on systematic training and review the composition 
of the work group to attain a mentor effect. 


5 CONCLUSION 


Successful field operations at the University Centre 
in Svalbard depend heavily on safety technicians 
that have the responsibility to assist in the planning 
and execution of every type of field work. Due 
to changing conditions, local variations, extreme 
weather conditions, lack of access to infrastructure 
and communication successful safety performance 
is created by individual’s ability to adapt to situa- 
tions. This paper has demonstrated that this ability 
to a large extent is a function of the tacit knowledge 
of the technicians. To improve the tacit knowledge 
of each technicians, systems and practices of expe- 
rience feedback must be run to ensure individual 
and organizational learning from both failures as 
well as successes. This is in particular important in 
systems with great variability in climatic conditions 
and systems with organizational changes. 
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Automation of the rail—removing the human factor? 


T.M. Stene 
SINTEF, Trondheim, Norway 


ABSTRACT: Automated vehicles will be increasingly used as transport in the future. However, it is 
unclear if this imply full autonomy or different levels of automation. A unified definition of autonomy in 
transport is missing. The SAREPTA project (Safety, autonomy, remote control and operations of indus- 
trial transport systems) is established in 2017, and cover safety challenges of future intelligent transport 
systems that are autonomous, remotely controlled and normally not manned. The project covers both 
road, sea, aviation and rail. This paper focuses on issues related to rail transport, including both metros 
and railway. The purpose of the paper is to describe current rail accidents as a basis for questioning 
whether future digitalisation will improve safety. The paper will discuss the autonomy concept in relation 
to grades of automation. Relevant questions are: What is automation and which accidents may be pre- 
vented by automation? To what degree do automation and remote control imply removal of the Human 
Factor? And from a safety perspective—What is the safety potential of future automation, and how can 


humans contribute to safety in future intelligent transport systems? 


1 INTRODUCTION 


Digitalization is a global change affecting a variety 
of social conditions and businesses. In addition to 
changing products and services in businesses and 
the labour market, digitalization will also create 
radically new business models in many industries 
(Stene et al 2017). 


1.1 Safety and automation of transport systems 


Safety and environmental challenges of future intel- 
ligent transport systems are addressed in a newly 
established project founded by the Norwegian 
Research Council for 2017-2021. The SAREPTA 
(Safety, autonomy, remote control and operations 
of industrial transport systems) project focuses on 
systems that are autonomous, remotely controlled 
and/or periodically not manned. 

In the project, four thematic areas of autono- 
mous systems are central: (1) Risk identification 
and risk levels, (2) Infrastructure vulnerabilities 
and threats, (3) Technical, human and operational 
barriers to mitigate system risks, and (4) Organi- 
zational and human factors, and regulatory meas- 
ures. The project includes road, sea, aviation and 
rail. This paper focuses on the rail. The purpose 
of the paper is to describe current rail accidents 
as a basis for questioning whether future digitali- 
sation will improve safety. Relevant questions are: 
What is automation and which accidents may be 
prevented by automation? To what degree do auto- 
mation and remote control imply removal of the 
Human Factor? And from a safety perspective— 


What is the safety potential of future automation, 
and how can humans contribute to safety in future 
intelligent transport systems? 


1.2 Current rail transport safety—fatal 
and frequent accidents 


European railways are the safest mode of land 
transport and the safety level has improved over 
the last decades (EU ERA (European Railways 
Agency) 2016). However, accidents have heavy 
impact on confidence in the system. Further, every 
accident represents a significant business cost in a 
highly competitive environment. It is argued that 
emphasis needs to be on human factors as well as 
on new technology which can be both an opportu- 
nity and a threat. 

Compared to other transport modes, the fatality 
risk for an average train passenger (0.12 per billion 
km) is at least twice as high as commercial aircraft 
passengers (EU ERA 2017). However, the risk is 
higher for passengers traveling by bus/coach (one 
third of the risk) and sea vessels (nearly three times 
as high). Further, using individual transport means 
on the road is most risky. Car occupants have at 
least 20 times higher likelihood of dying compared 
to train passengers. 

Even if rail transport statistically is safer than road 
transport, some large rail accidents have occurred. 
The rates of fatal train accident (five or more killed: 
totally 362) have fallen substantially from 1980 to 
2009 on Europe’s main line railways (Evans 2011). 
Fatality risks per million train-km (system risk) in 
the period 2010-2014, based on persons involved, 
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was 0.28 killed per billion train-km at the EU level 
(EU ERA 2016). For rail passengers, this was 0.14 
killed passengers per billion train-km. 

Although rail transport safety has steadily 
enhanced over the years, the number of accidents 
started increasing in 2014 and 2015 (Eurostat 
2017). Still, the number of victims (killed or 
injured persons) continues to decline. Table 1 
shows the number and persons killed and injured 
in rail transport accidents in Europe 2016. Two 
types of accidents are dominant — (1) Rolling stock 
in motion and (2) Level-crossings—followed by 
(3) Train collisions and (4) Derailments. 

The majority are accidents to persons caused by 
rolling stock in motion. These are either hit by a 
railway vehicle or an object attached to it. Persons 
that fall from railway vehicles are included, as well 
as persons that fall or are hit by loose objects when 
travelling on-board vehicles. 

Fatal level crossing accidents are more numer- 
ous and account for more fatalities than fatal train 
collisions and derailments (EU ERA 2016). Fur- 
ther, in contrast to collisions and derailments, the 
rate per train-kilometre remained unchanged in 
1990-2009. Thus, level crossing accidents represent 
an increasing proportion of serious accidents. 

The estimated accident rate in 2016 is 1.07 fatal 
collisions or derailments per billion train-kilometres, 
which represents a fall of 73% since 1990 (Evans 
2011). This gives an estimated mean number of fatal 
accidents in Europe in 2016 of 4.7. In contrast to 
fatal train collisions and derailments, the rate per 
train-kilometre of severe accidents at level crossings 
fell only slowly and not statistically significantly in 
1990-2016. There are statistically significant dif- 
ferences in the fatal train accident rates and trends 
between the different European countries. 

Totally, the most common cause of fatal accidents 
is signal passed at danger, followed by signalling/ 


Table 1. Number and persons killed and injured in rail 
transport accidents by type of accident in Europe 2016 
(Eurostat 2017). 


Number of persons 


Type of accident Killed Seriously injured Total 
Collisions 44 77 121 
Derailments 11 21 38 
Accidents involving 
Level-crossings 256 220 476 
Accidents to 651 438 1089 
persons caused 
by rolling stock 
in motion 
Others 2 16 18 
Total 964 718 1742 


dispatching errors and violation of the speed limit. 
Further, small numbers are train fires and groups of 
persons struck by trains, mostly track workers. 

The causes of level crossing accidents differ from 
train collisions and derailments. The most frequent 
cause of fatal train collisions (2) and derailments 
(3) is signals passed at danger. The majority of level 
crossing (1) accidents are caused by errors or viola- 
tions by road users. Most major crossings in Europe 
have automatic warnings (lights, barriers and bells) 
operated by approaching trains. Most minor cross- 
ings have fixed warning signs only, with no indi- 
cation when trains are approaching. The primary 
responsibility for operational safety thus rests with 
road users, either in obeying warnings or checking 
that no train is approaching before they cross. 


1.3 Animals along the track—a current challenge 


Less severe accidents and incidents strongly out- 
number fatal accidents (EU ERA 2016). However, 
these occurrences are not collected at the EU level, 
and great benefits could be made from reporting 
them to identify and manage risks. 

While the number of people killed or injured in 
rail accidents is well-documented, little research 
has been done to analyse the number of animal 
casualties on international railways (Gray 2015). 
High-speed trains often cut through sensitive wild- 
life habitats. Accidents involving various species 
are detrimental to local wildlife, are costly and a 
danger to travellers. 

In Norway, nearly 2000 collisions with animal 
are recorded on the railway each year, which is a 
doubling of the frequency over 20 years (Roald- 
sen et al. 2015). Reduction of crashes—even by a 
few percent—can contribute to significant socio- 
economic savings and reduced conditions for both 
humans and animals. 

From 1991-2014, the Norwegian National Rail 
Administration registered nearly 26 000 events 
with one or more animals (near 36 000 animals) 
being hit by train. Over 90 percent involve moose 
(57%), roedeer (15%), sheep (9%) and domesti- 
cated reindeer (8%). Topography and landscape 
influence the existence of animals in areas near the 
rail, thus increasing the accident risk. Important 
factors are related to food, shelter, visibility and 
animal corridors. Further, weather conditions as 
snow and rain affect where the animals are. 


2 TRANSPORT TECHNOLOGY 
INNOVATION 
2.1 Digitalization of the rail 


Digital technology may be defined as the use of 
ITC (computing capacity + telecommunication) 
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to gather, transfer and process data to provide the 
communication backbone for all users of the net- 
work (BearingPoint 2017). 

Rail 4.0 may be considered a parallel concept to 
Industry 4.0 (Stene et al 2017). The concept refers 
to four industrial revolutions starting at the end of 
18th century with the introduction of (1) mechani- 
cal manufacturing, and continues with (2) mass 
production, (3) computers and automation (also 
labelled digital revolution) and (4) Internet. Four 
key components in Industry 4.0 are: CPS (Cyber- 
Physical Systems), IoT (Internet of Things), Smart 
Factory (e.g. traffic management sites) and IoS 
(Internet of Services). 

Further, Davidsson et al (2016) divide the digital 
period in four waves: (1) introduction of comput- 
ers in the 80s, (2) Internet in the 90s made it easy to 
access and share information, (3) mobile Internet 
making this possible regardless of where you are, 
and (4) is represented by Internet of Things (IoT). 
In addition to people, different types of entities 
(vehicles, machinery) may also have access to and 
share information. 

In the rail sector, ERTMS (European Railway 
Traffic Management System) is a common signal- 
ling system that is to be introduced in all EU coun- 
tries by 2030. A standardized system will improve 
the interoperability between networks and systems. 
ERTMS includes ETCS (European Train Control 
System), GSM-R (Global System for Mobile Com- 
munication-Railway, which is radio communication 


between train and signalling), and common Euro- 
pean traffic regulation. A common trans-border rail- 
way transport allows trains to travel in any European 
country which has the ERTMS system implemented 
both in the rail infrastructure and in the train itself. 
ERTMS has many similarities with CBTC 
(Communication-Based Train Control), which 
is the preferred signalling solution for auto- 
mated subways and metros. One difference is that 
ERTMS is standardized, while CBTC is supplier 
specific. CBTC is a signalling system making use 
of telecommunication between train and track 
equipment (wayside) for traffic management. By 
making more exact positions of each train, the sys- 
tem makes it possible reduce time intervals between 
trains. The main objective is increased capacity. 


2.2 Automatic Train Operation (ATO) 


Generally, autonomy is often related to attributes 
like self-government, freedom to act or function 
independently. For vehicles, autonomy is generally 
understood as the ability to make decisions about 
actions to take, e.g. course or speed, independent 
of a human operator. Levels of autonomy or auto- 
mation describe the successive shifting of respon- 
sibility from the driver to the vehicle. Different 
concepts are used to describe vehicle automation 
in each transport mode/ domain. 

In addition to concepts used in each domain, 
Ponsard et al (2017) present a comparative over- 


Table 2. Comparison of automation levels at road, rail and air. Based on Ponsard et al (2017). 


automation 


GoA-0 Sight train 
operator 

GoA-1 Manual train 
operation Automated 
train protection 

GoA-2 Semi-automated 


train operation (STO). 
Autom. train op. (ATO) 


GoA-3 Driverless train 
operation (DTO) 


Automated control (ATC) 
Some control by attend- 


ant (operating doors, 
emergencies) 


Road 


SAE 
levels 


LO No automation 


L1 Driver assistance 
Park assist/cruise 
control 

L2 Partial automation 
Traffic jam assist 


L3 Conditional 
automation 


L4 High automation 
Highway traffic jam 
system 


Aircraft 


Levels of 
automation 


Level 1 Raw data, no 
automation at all 

Level 2 Assistance 
Flight director 
Auto-throttle 

Level 3 Tactical use 
Autopilot 


Level 4 Strategic 
Flight management 
system 


All time 

Drivers 

Monitors all 
time 


Ready to take 
back control 


Warn Protect 


Guide Assist 


Manage movements 
within limits 


Drives itself, may 
give back control 


GoA-4 Unattended train 
op (UTO) Automated 
doors Platform screen 
doors 


L5 Full automation 
(all situations) 


Uninterrupted May not take Drives itself 
autopilot project back control with graceful 
(Boing) Drones degradation 
(unmanned) 

Not required All time 
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view of the responsibility between system vs 
human (driver/pilot) at different levels of auto- 
mation (see Table 1). In rail, the concept Grades 
of Automation (GoA) is used. Notice the double 
line in the table; this marks a shift from GoA-3 in 
responsibility from the driver to the system. 

Rail and airplanes have already achieved much 
higher levels (Ibid). However, this is only true for 
some rail line types. Several fully autonomous met- 
ros exist. The next two sections in this paper goes 
more into this. 


2.3 New technology on the main line railway 


The difference between signalling and control sys- 
tems in European railway is significant, and until 
1980 14 national standards were in practical use 
(Tao & Jing 2014). ETCS (European Train Control 
System) is designed to replace these incompatible 
safety systems, and the first version was published 
in 2000. 

As mentioned above, the GoA concept describe 
levels of automation in rail. Figure 1 illustrates the 
existence of a driver at different grades. Further, 
the operations are described at each grade, Le. 
management agents and actions to be taken. 

Implementation of ERTMS at GoA-1 implies 
that signal information is shown on a panel inside 
the cabin. The driver may use the signal as a 
replacement of a traditional light outside at the 
track. The signal tells whether the driver may drive 
into the next block or not. At GoA-2 the train is 
operated by automated control based on signals 
from sensors along the track. In addition to be 
responsible for monitoring the speed and position, 
the driver may take control in case of any incident 
or emergency. 

A lot of literature on transport autonomy 
focus on train automation, i.e. the interaction 
and responsibility between vehicle—driver (see 
Figure 2). The inner control loop is responsible for 
executing the production plan (Rao & Montigel 
2017), and the focus is on driving performance by 


CONTROLLED 
MANUAL DRIVING 


The driver manages all 
aspects of driving the using automated 


The person (not the No staff aboard. The 
driver) s on board to contral system 
controls. The driver open and dose dooms. manages all 

in charge of opening and handle incidents operations, supervisert 
and closing doors, remotely by control 
authorises fhe startaip contre 


train manually 


of the train, monitos 
Ihe track and handie 
unexpected situations 


GoAl GoA2 GoA3 GoA4 


Figure 1. Levels of automation (Brodeo 2016). 
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Figure 2. Train automation—Control of onboard train 
operation. 


Traffic Management Centre Onboard Train Operation 


Figure 3. Traffic management—Control of traffic and 
infrastructure. (Based on Rao & Montigel, 2017). 


providing driver assistance or introducing train 
automation. 

Rao (2015) presents a holistic approach to the 
main line railway. In addition to (1) train automa- 
tion, the focus is also on (2) traffic management, 
and the relationship between the two areas (see 
Figure 3). The outer control loop supervises the 
status of traffic and infrastructure, detects devia- 
tions and conflicts, and develops a new schedule 
(rescheduling) and transmits it to train operation. 

Automation depends on two supports: Onboard 
support (as the Automatic Train Protection—ATP) 
system to provide train’s overspeed protection and 
to keep a safe headway between trains, and infra- 
structure support (as Automatic Train Supervi- 
sion—ATS) to provide dynamic traffic regulation 
to avoid traffic conflicts (Rao et al 2016). 

Even at GoA-4 trains on are not autonomous 
in the sense that no control is needs. Traffic man- 
agement focus both on the outer control loop 
(improving efficiency for the dispatcher by provid- 
ing resolutions for traffic conflict) and the inner 
loop (improving driver performance or assisting 
the driver). Thus, reducing human failure are cen- 
tral in both control loops. 

ETCS (European Train Control System) is a 
signalling, control and train protection system 
used on the main railway lines. The train detection 
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equipment sends the position about speed limita- 
tion, signal status etc. (Venticinque et al 2014). Three 
levels define the use of train control system; com- 
munication from track to train (level 1), continuous 
communication between the train and the Traffic 
Management Centre (level 2), and future implemen- 
tation of a moving block technology (level 3). Sev- 
eral main rail tracks operate at level 2, including two 
main subsystems: (a) a ground system collects and 
transmits track data to (b) an onboard subsystem. 

ETCS-2 uses digital radio transmission of sig- 
nals along the trackside (Tao & Jing 2014). With 
its onboard positioning equipment, the train can 
automatically report its exact position and direc- 
tion of travel at regular intervals, in addition to 
motion (stop/go) signals. Balises on the tack detect 
trains and send the position to the control centre 
(Venticinque et al 2014). Based on the position 
of all trains, the centre determines the new move- 
ment authority (MA) and sends it to the train. The 
onboard computer calculates its speed profile from 
the MA and the next braking point. This informa- 
tion is displayed to the driver. 


2.4 Autonomous metros 


In metro systems, automation refers to the process 
by which responsibility for operation management 
of trains is transferred from the driver to the train 
control system (UITP 2017). 

The experience period with automated metros 
is over 30 years. The first was high capacity, but 
today we also see a trend of increase in mid-capac- 
ity trains. Between 2014 and 2015 Europe will lead 
in terms of growth (Hernandez 2014). Asia and 
Europe together hold 75% of the km of fully auto- 
mated metro lines. 

For metros, many use the term CBTC synonym 
as an automated driverless system. However, at its 
most basic form the system provides automatic 
protection (ATP) only. Fully automated systems 
also include ATO (Automatic Train Operation) 
and ATS (Automatic Train Supervision). 

A semi-autonomous train (GoA-2) may manage 
movements, but a human need to be onboard to start 
the train, open doors etc. (Lufkin 2015). There are 
also trains that can fully operate completely free of 
humans. Only 6% of the world’s transit rails operate 
those trains. Several cities are aiming for automation. 

There are 55 fully automated metro lines in 
37 cities around the world (UITP 2016a). Fully 
automated metro lines, defined as those metro 
lines in which trains can be operated without staff 
onboard—a defining characteristic is the absence 
of a driver’s cabin on the train. This type of opera- 
tion is also known as Unattended Train Operation 
(UTO), or Grade of Automation 4 in standard 
IEC 62267. 


2.5 Metro automation and safety 


The positive experience of decades of automated 
operation highlights one of the major elements to 
consider in this success story: safety (UITP 201 6b). 
There have been no significant accidents, in par- 
ticular none involving casualties, in any automated 
metro line in the world. 

Copenhagen Metro is one example of a sys- 
tem running fully automated, consisting of auto- 
matic train protection, operation and supervision. 
Although no serious accidents have occurred, inci- 
dents and accidents may point out some risk areas. 
The station area is strongly marked. The safety 
of the platform/track interface is crucial for fully 
automated metro lines. 

The dominant safety measure is installation of 
platform screen doors (detection systems) prevent- 
ing persons and objects from falling on the track. 
Currently, near 80% of stations in fully automated 
metro lines in operation in the world are equipped 
with such doors (UITP 2016). 

Platform and track incidents aside, there has 
only been one operational incidents with UTO sys- 
tems; in Osaka at the end of the 80s a train did not 
stop at terminus and hit a bumper stop, provoking 
injuries in a few dozen passengers (UITP 2017). 


2.6 Open surroundings—challenging 
the main railway 


Since the main railway has much more complicated 
infrastructure situations, currently train automa- 
tion is mainly applied in metro railway (Rao et al 
2016). 

The open surroundings of current main rail 
traffic challenge safety. Rails with driverless trains 
are generally run on closed off networks, i.e. run 
underground. Thus, no one can fall onto the tracks, 
and there are no points where the trains cross with 
others. 


3 DISCUSSION 


3.1 Rail 4.0 — Opportunities and challenges? 


The purpose of intelligent systems is to make the 
human environment more “people-friendly” tech- 
nologies (Tokody & Flammini 2017). This means 
that infrastructural systems should be sustainable, 
safe, economic and easy-to-use. The development 
of intelligent, autonomous systems may ensure 
sustainability and safety. 

Future IoS (Internet of Services) in a rail con- 
text will focus on offering services to the general 
public or specific target groups as passengers. For 
example, a dynamic system for Copenhagen metro, 
will automatically optimize trains frequency 
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depending on numbers passenger and changes of 
numbers (Razeto & Corsanego 2017). Likewise, in 
Switzerland, a new Trip Planner app using voice 
control will let customers compare, combine and 
book a journey with multiple modes of transport 
including taxi (SWI 2017b). 

Integrated mobility is an example of Smart 
Management. According to the Federal Railways 
in Switzerland, integrated mobility is a central 
field of innovation, and thus they are developing 
a door-to-door service to the general public (“SBB 
Green Class”). 

One example of utilizing IoT, is goods trans- 
port in Switzerland installing various sensors in 
carriages. Instruments will measure temperature, 
vibrations and the wagon’s position. Customers 
may get information of goods status, location and 
time for arrival. In Japan high-speed rail use in- 
ground sensors in quake-prone zones, that imme- 
diately activate emergency brakes seconds after the 
initial quake waves are detected. 

However, one of the future challenges is related 
to telecommunication and traffic management. ITS 
includes telematics and all types of communications 
in vehicles, between vehicles and between vehicles 
and a fixed location (Brodeo, 2016). As even more 
transport is being digitalized, the use of radio fre- 
quencies for signalling systems may be conflicting 
or overloaded. Several EU countries already use 
radio communication systems in the same range, all 
on a limited duration licensing scheme. 


3.2 Scenarios — Can automation prevent 
future rail accidents? 


For more than three decades, rail transport safety 
has improved generally and presumably due to 
a wide range of safety measures like automatic 
train protection, improved signalling systems and 
improved operational management. The question is 
whether new technology may contribute to prevent 
the most serious and frequent accidents; (1) Rolling 
stock in motion, (2) Level-crossings, (3) Collisions, 
(4) Derailments and (5) Animals along the track. 


1. The engine (rolling stock) is heavy, and as such 
needs a long distance to stop in case of an inci- 
dent or unexpected objects on the track. A driv- 
erless train needs to have equipment that detect 
obstacles and stops automatically. Rail research 
and innovation in Europe include safety related 
technology development; automatic obstacle- 
detection systems for railway vehicles, regenera- 
tive braking, monitoring systems and satellite 
based positioning systems (Tokody & Flammini 
2017). 

However, passenger comfort is also highly 
valued. An efficient and powerful breaking sys- 
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tem may cause great discomfort and passenger 
injuries. This is true for passenger trains, but 
should be a less problem with freight trains. 
Even though automated trains may still include 
some staff onboard. 

Even though capacity is the main objective 
of CBTC systems used at automated metros, 
maintaining safety is a major requirement. In 
addition to distance, calculations cover speed, 
curves and position. Thus, controlling accelera- 
tion, retardation and stops at stations. At slower 
speed, the distance may be shorter. A challenge 
is to calculate the block length for max capacity 
while ensuring safety. 


. Level-crossings. Road user errors or violations 


contribute to most of fatal accidents, either in 
obeying warnings or checking that no train is 
approaching before they cross (EU ERA 2016). 
The authors point out countermeasures like those 
for road accidents, particularly education and 
enforcement. However, more autonomous vehi- 
cles may also contribute to prevent rail accidents. 

Autonomous obstacle detection systems 
may be beneficial for road and rail transport. 
The Germany SMART project focuses on rail 
freight and automation of railway cargo haul 
(Shift2rail 2016), including development of (1) 
a prototype of an autonomous obstacle detec- 
tion system and (2) a real-time marshalling yard 
management system. The first system will use 
night vision technologies, multi stereo vision 
system and laser scanner to create fusion sys- 
tem for short (up to 20 m) and long range (up 
to 1000 m) obstacle detection during day and 
night operation, as well as during operation in 
impaired visibility. The second system will pro- 
vide optimisation of available resources and 
planning of marshalling operations. 


. Collisions. Related technology development 


which may contribute to accident prevention are 
automatic obstacle-detection systems for railway 
vehicles, traction transformers, energy storage 
technologies, regenerative braking, monitoring sys- 
tems, satellite based positioning systems, and smart 
railway technologies (Tokody & Flammini 2017). 

As mentioned in relation to rolling stock in 
motion, passenger comfort is highly valued, 
and unexpected intense breaking may contrast 
a safety measure. Acceleration and decelera- 
tion are essentially limited by the wellbeing and 
safety of the passengers (Gary 2016). 


. Derailments. One serious accident on a main 


line using ERTMS, was a derailment of a high- 
speed train in Spain in 2013. Initial reports 
cited driver error as the sole cause, but a deeper 
study of the accident says lack of a function- 
ing onboard ETCS system was a crucial factor 
(Puente 2015). A high-speed train derailed trav- 


elling at 180 km/h (speed limit 80 km/h) through 
a curve, resulting in the death of 79 people and 
injuring more than a hundred. 

The line was equipped with ERTMS/ETCS 
Level 1, except for the first and the last kilome- 
tre, with a national signalling system used as a 
backup. However, the onboard ETCS system 
had been switch off in 2012 due to alleged oper- 
ating problems. The train driver should manu- 
ally have changed the speed, but when the train 
entered the low speed section the driver was 
speaking on the phone to staff at the train com- 
pany (Johnsen 2015). 

If onboard ETCS had been working, the fol- 

lowing would have happened at the ETCS exit 
boundary 4km before the curve where the acci- 
dent occurred (Puente 2015): (a) a text message 
announcing the transition would have appeared 
on the Driver Machine Interface (DMI) of the 
train, which was travelling at 200km/h, (b) the 
DMI would have shown a message with a yel- 
low flashing frame and would have emitted an 
acoustic signal asking the driver to acknowledge 
the transition by tapping on the screen, and (c) 
if the driver failed to acknowledge the message 
within 5 seconds, service braking would have 
been applied continuously until the driver had 
acknowledged the transition or the train had 
stopped. 
. Animals along the track. Current counter- 
measures include building fences around the 
worst affected rail lines, removal of vegetation 
and warning systems (Roaldsen et al. 2015). 
The implemented strategies include installa- 
tion of warning signs for train drivers, night 
patrols along the tracks and introducing staff 
to assist animal crossings. Warning signs are 
the most widespread accident prevention 
measure (Gray 2015). Most is human warn- 
ings, but acoustic signals creating fear in ani- 
mals (preventing them from approaching the 
tracks) is also tried. As an example, Norwe- 
gian reindeer owners often warn about animals 
near the rail, implying that train drivers may 
reduce speed and the probability of incidents 
(Busengdal et al 2014). More general models 
have also been developed to predict the occur- 
rence of animals (Gundersen & Andreassen 
1998). Gray (2015) argue that manned assist- 
ance along high-speed tracks across the world 
is not a practical solution and better alterna- 
tives are needed. Deutsche Bahn Netz AG 
and OptaSense is one example of testing new 
warning technology. Distributed Coustic Sens- 
ing (DAS) technology uses heat and motion 
sensors in various areas of operation, includ- 
ing to detect and alert train drivers of animals 
approaching the tracks. 


3.3. Will automation remove the human factor? 


Automated systems are often designed to relieve 
humans of tasks that are repetitive. However, 
the more reliable the system, the more likely is it 
that humans in charge will “switch off” and lose 
their concentration, implying greater likelihood 
of unexpected factors and a potential catastrophe 
(Vedantam 2009). Technology replacing or assist- 
ing the driver can become crutches. Accidents hap- 
pen when unusual events come together. No matter 
how clever designers of automated systems might 
be, they simply cannot account for every possible 
scenario, which is why it is so dangerous to elimi- 
nate human “interference.” 

The on-board personnel may be unprepared to 
take control and manually drive. Regular training 
exercises that require operators to turn off their 
automated systems and run everything manually 
are useful in retaining skills and alertness (Ibid). 
In addition to detect system failure, understanding 
how automated systems are designed to work also 
allows operators to recognize when it is on the brink. 

As the system cannot cope with all situations, 
the driver must be ready to resume operations 
when instructed (Ponsard et al 2017). The author 
address issues as situational awareness (the system 
should make sure that driver’s decisions are based 
on right mental pictures), human reaction capabili- 
ties (e.g. alarms may cause confusion, defect view 
of the entire situation, or panic), warning annoy- 
ance (trust in the system in case of e.g. frequent/ 
inappropriate alarms) and task inversion (focus 
on monitoring alarm and lack of attention to real 
world situations). The authors claim that machine 
learning techniques can pay an important role for 
making sure the driver and the system are operat- 
ing optimally together. 


3.4 How to cope with unexpected scenarios? 


The concept of black swans refers to rare and 
unpredictable events. Black swans are extremely 
rare, catastrophic, and unpredictable events that 
never have been encountered before (Taleb 2007). 
In principle, black swans cannot be anticipated. 
However, even though a catastrophe was not pre- 
dicted, does not mean that the event could not 
have been prevented (Murphy 2016). 
Implementing new technology and autonomous 
transport, black swans will occasionally occur. We 
have to prepare both to cope with alternative sce- 
narios and to handle completely unexpected situ- 
ations accompanied by high stress and emotions. 
Thus, in addition to training to identify clues of 
and handling anomaly situation, training should 
cover completely unexpected and catastrophically 
events with an extremely high emotional state. 
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Experiential training may be necessary for coping 
with unexpected events, especially to handle per- 
sonal high stress and to communicate with others 
(Stene et al 2016). 

Emergencies are events which happen suddenly 
and may destroy normal operations. Despite the 
presence of automated metro operation control 
system, the emergency management is still heav- 
ily dependent upon capabilities of dispatchers at 
the management centre (Wang & Fang 2014). The 
system may lose a part of automated safety pro- 
tection function. Thus, human error behaviours 
during emergencies cannot be ignored. Competent 
humans in transport control centres may represent 
a safety barrier, preventing incidents and accidents 
(Stene et al 2017). Machines may be excellent in 
detecting signs and signals, but humans have to 
evaluate and decide action based on the context 
and complexity of the actual situation. 


4 CONCLUSION 


4.1 Future automated trains and metros 


With more people living in urban areas than ever 
before, metro systems around the world will need 
to adapt (Lufkin 2015). The next generation of 
subways will develop from cities that are already 
at the cutting-edge, e.g. the super-fast speeds of 
Japan’s shinkansen or the punctual, low-cost driv- 
erless trains of Copenhagen. 

Self-driving trains are already being used in 
some countries, with varying degrees of autonomy. 
Autonomous driving on a complex rail system, 
with passenger trains and freight trains is more dif- 
ficult than on a subway—but it is possible (Gary 
2016). Several pilots are currently running. On 
a test field in Germany, trains will be fitted with 
cameras and other technologies to detect obstacles 
on the track and stop the train if necessary. The 
AutoHaul project in Australia, a long-distance 
railway system is intended to transport iron ore 
from 15 mines. 

Switzerland will test self-driving trains on a 
main line without too many people, but still get a 
feel for how it would work in public (SWI 2017a). 
The trains will be fitted with sensors that should 
detect objects on the rails and bring the train to 
a stop. If rolled out, a system to automate train 
traffic is assumed to increase passenger and freight 
capacity by 30%. 


4.2 The human factor in future rail systems 


Technology can improve safety, but there may 
be examples where human interaction is neces- 
sary (Gary 2016). The main purpose of imple- 


menting a common European railway signalling 
system are: (1) Maintaining a safe distance 
between following trains on the same track, (2) 
Safeguarding the movements at junctions, and 
(3) Regulating the movements of trains accord- 
ing to the service density and the speed required 
(Abel, 2010). 

The development relies too heavily on old iner- 
tia, meaning too much emphasize on technology. 
More attention should be paid to the organization, 
the passengers and the infrastructure (Malla 2014) 
and passenger evacuation procedures (Hernandez 
2014). 

Factors contributing to the likelihood of cata- 
strophic rail accidents are system complexity, a 
trend towards higher travel speed, growing infra- 
structure capacity constraints and the constant 
cost pressures on risk management activities 
(EU ERA 2017). Accident investigations should 
continue to report on both success or failure of 
systemic risk management methods, e.g. high-relia- 
bility organisations, redundancy, robust regulatory 
and enforcement regimes. 

Based on experiences from operating both auto- 
mated and conventional metro lines, one conclu- 
sion is that the human factor is that key for the 
success of an automated line. (UITP 2016b). The 
rail is far from being autonomous, in the sense of 
being independent of a human operator. Humans 
will still be a necessary resource to manage trans- 
port and cope with unexpected incidents. 
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Revitalization of risk management in the Norwegian petroleum sector 
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ABSTRACT: The PSA has launched a risk management project. The purpose is to stimulate a revitali- 
zation of the risk management in the Norwegian petroleum sector. The project is ongoing, which means 
that no conclusions, findings or reports are final yet. In this paper we describe the reasons for initiating 
this project. We also describe the process, and the preliminary focus areas of the project. Three messages 
to decision makers are given special attention. First, that risk management has to be an integral part of all 
organizational processes, and part of decision making. Second, uncertainty is a main component in the 
risk concept, leading to a strong focus on the need for robustness. Third, the management culture at all 
times absolutely has to be characterized by a sincere wish to reduce risk. Finally, we describe our inten- 
tion to provide examples of practical risk management challenges, as well as some ideas for how these 


challenges can be handled. 


1 INTRODUCTION 


1.1 Background and the reasons 
to initiate this project 


The PSA (Petroleum Safety Authority Norway) 
has launched a risk management project in order 
to stimulate a revitalization of the risk manage- 
ment in the Norwegian petroleum sector. We 
aim to contribute to preservation and further 
development of the industry’s risk management. 
We address key issues from the viewpoint of the 
PSA, based on input from the stakeholders in the 
industry. 

The primary target group for the project is man- 
agers and decision makers. Other groups that have 
important roles in safety management will hope- 
fully also benefit from the content. 

The need to preserve and further develop 
risk management is described in several sources. 
This has resulted in the PSA’s decision to initi- 
ate a project on risk management. The purpose 
of the project is both process oriented and goal 
oriented. The process shall give new and even 
more appropriate attention to risk management, 
and a memorandum on risk management shall be 
issued. 

Several events and reports have led to the ini- 
tiation of this project. A brief presentation of the 
most relevant of these are: 

The annual report “Risk Level Norwegian 
Petroleum” (RNNP) attempts to contribute to a 
common perception of the risk level between the 
stakeholders. The RNNP indicates that there have 
been significant improvements in several areas 
over a number of years, but also opportunities for 
further improvement (PSA, 2017). 


The Norwegian government appointed an 
expert group to assess the current safety regime. In 
the “Engen report” (2013) it was concluded that 
the current regime is well-functioning. However, 
the report also indicated that there was a need for 
further development of risk management, espe- 
cially related to major accident risk. 

The Deepwater Horizon accident in 2010 showed 
that there is a need to review the principles and 
methods of risk management, and also how they 
are carried out in practice. Following this accident, 
both the supervisory authorities and the industry 
have taken a number of initiatives in the fields of 
risk management, barrier management and man- 
agement follow-up (PSA 2014, Norwegian Oil and 
Gas Association 2012). The PSA (2014) concludes 
that such initiatives must be subject to continuous 
development in order to achieve lasting effects. 

A working group at the Norwegian Oil and Gas 
Association (2015) has reviewed today’s practice to 
identify improvement areas within risk informed 
decision making. They pointed out that in many 
cases the information is not available soon enough, 
and identified possible paths for improvement. 
The Norwegian Oil and Gas Association also pub- 
lished a report “Black Swans” (2017), which points 
to the need for an expanded perspective on risk, 
where knowledge building, experience transfer and 
learning become even more central. 

The risk concept in the regulations was clarified in 
2015. Here, attention to uncertainty was made even 
more explicit, and the PSA issued a memorandum 
in 2016 describing what we aim to achieve regard- 
ing this topic. The memo from 2016 highlighted the 
risk term, while the project described in this paper 
highlights the management aspect of risk. 
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1.2. Why is risk management essential? 


Good risk management will enable the industry 
to find a reasonable balance between safety and 
economic profit. The operators on the Norwegian 
continental shelf have been given a great degree of 
freedom to find efficient ways of doing business as 
long as a certain safety level is achieved, through 
a “functional based regulatory framework”. This 
framework both encourages and requires a cer- 
tain mindset. The intention is to both encourage 
and require that the nature of the operations is 
taken into account, as well as local and operat- 
ing conditions. A prerequisite for a functional 
based regulation is that operators and duty hold- 
ers take responsibility and implement suitable 
risk management processes. There are therefore 
requirements for risk management and reduction 
processes in the HSE regulations, including con- 
tinuous improvement. 

Good risk management should also provide 
an opportunity to use resources in a way that has 
the best effect on safety and value creation. This 
implies that the industry must integrate risk man- 
agement in the decision making processes. Also, 
a proper understanding of how risk management 
can be performed well in practice is required. 


1.3 Limitations of this paper and project 


This paper should not be read as the viewpoint of 
the PSA. It is merely the viewpoint of the authors 
of the paper. 

“Risk management” is here limited to the PSA’s 
authority area, and is based on major accident 
risk. It may also be relevant for working environ- 
ment, natural environment, health, security and so 
on. The project is not intended to introduce any 
new requirements. 

We cover only selected themes, and the project 
should be seen in conjunction with other areas 
highlighted by the PSA. See for instance the Bar- 
rier memo, the “Book about learning” and the 
“HSE and Culture” pamphlet. 


2 THE PROCESS 


Members of the PSA’s Risk management group 
have been running the project. In the planning 
phase of the project, some keys to success became 
clear for such an ambitious project. There was a 
need for discussions with the stakeholders, as well 
as professional experts. Such discussions between 
the stakeholders (regulators, trade unions and 
industry associations) are already a formalized 
part of the Norwegian petroleum safety regime. 
See for instance the Safety Forum (PSA, 2017). 
Therefore we have in the early stages discussed the 


project with the tripartite Safety Forum and the 
tripartite Regulatory Forum. 

Furthermore, we emphasize that risk manage- 
ment is not a stand-alone activity for risk manage- 
ment experts. Instead, the various departments 
need to use risk management as an integral part 
of their activities. Therefore, we have taken great 
care making sure that the risk management project 
is not run solely by risk management experts. We 
are consulting with all the PSA departments to 
get their input on what is the key to successful risk 
management. This idea of consulting with dif- 
ferent technical disciplines has also been applied 
while discussing with the industry. 

We have had several meetings with operating 
companies, contractors, labor unions and risk 
management experts. In these meetings we have 
tried to make sure that the discussions are driven 
by the needs of the decision makers and managers 
in the industry, rather than the opinions of the risk 
management experts. 

Additionally, we have had several meetings with 
the other relevant safety regulators in the Norwe- 
gian petroleum sector, the Norwegian Board of 
Health Supervision and the Norwegian Environ- 
ment Agency. 

Typically, the discussions have centered on top- 
ics such as: 


— What is necessary for a well-functioning risk 
management? 

— Which tools are useful? 

— What are the conditions and principles that are 
necessary for risk to be managed as an integral 
part of the activities instead of a stand-alone 
activity after the main decisions are made? 

— How can we make sure that there is sufficient 
agreement between short-term and long-term 
objectives in various areas (HSE, profits, project 
progress, departments), at various levels and 
between various participants in the activities? 

— What are the main challenges? 

— What are the necessary criteria of success? 


The feedback from these meetings have been 
very valuable to gain an understanding of chal- 
lenges and good practices. However, we need to 
emphasize that the conclusions in the project will 
be our conclusions, and not merely a summary of 
the statements of the parties. 

A well-known challenge for any organization 
when dealing with major accident risk, is to make 
sure that even if you have a strong track record on 
safety you still must remain vigilant. This is very 
relevant for our project, since the stakeholders to 
a large degree have presented how they act when 
they are performing well. Hence, the intention is 
that this project can help in passing on such knowl- 
edge of best practice. Further, it is our intention 
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that the project also serve as a reminder that one 
major accident is one too many. 

In the span of the project, our focus area has 
shifted slightly based on the meetings and discus- 
sions we have had. The project is not completed, so 
instead of any final conclusions we can only give 
an overview of how the project has developed so 
far. When the project started, the following main 
themes were identified: 


1. Risk management as an integral part of the 
management processes 

2. Risk reduction processes 

3. Closing the “control loop” by learning 

4. Risk analysis and—acceptance criteria 


We identified theme number | above as the main 
theme, while there was no “ranking order” between 
the three other themes. At the time of writing, 
theme number | is still the main theme. As this 
paper will show, the others are still included but 
have been somewhat restructured. 

At the time of writing this paper, we have identi- 
fied three areas that are of importance to our target 
audience, where we believe that further develop- 
ment is necessary. In the next section we describe 
these three areas. We point out that being excel- 
lent in one of these areas is not sufficient. Instead, 
these three areas must complement each other. 


3 PRELIMINARY FOCUS AREAS 
IN THE PROJECT 


3.1 Risk informed management 


ISO 31000 describes risk management as “coordi- 
nated activities to direct and control an organiza- 
tion with regard to risk”. It is further stressed that 
risk management should be a holistic and inte- 
grated part of all processes in the organization. 

The reality that we too often see is that risk anal- 
ysis is done after the decisions already are taken, 
and used to justify decisions that are already taken, 
in contrast to the “direct and guide” role described 
in ISO 31000. This have been recognized by many 
organizations, and e.g. the Norwegian Oil & Gas 
Producers Organization (NOROG) have produced 
several reports (NOROG 2015) on the importance 
having risk information in time for the decisions 
and of how to produce the risk information in 
time. 

Integrated risk management in ISO31000 
implies that risk management is a (integrated) part 
of the organization’s processes and activities. Fur- 
ther, that risk management is a part of the manage- 
ment’s responsibility. 

Holistic risk management in ISO31000 implies 
that the risk management is able to take into 


account how a decision will affect and interact with 
other related areas of the organization’s activity, 
and how they relate as a whole. This will imply that 
there is a need to collate the risk information from 
all related parts of the organization and balance 
these in the decision. It is therefore essential that 
risk management becomes aligned and integrated 
with the strategic planning processes. Further, it 
should enable the involvement of all stakeholders, 
internal and external, capturing their opinions on 
the companies’ operation and critical issues. Risk 
management according to ISO 31000 should equip 
upper management with tools to help taking better 
and more informed decisions. 

The Petroleum Safety Authority in Norway do 
experience examples of what we perceive as good 
risk management. In these cases, the industry 
describes processes where appropriate risk infor- 
mation are available when the decisions are being 
made. Further, that the decision makers takes an 
active role in obtaining relevant information before 
making decisions, and that decision makers have 
fruitful discussions about the acquired information 
and the strength of knowledge in this information. 
In essence that they live according to the proverb 
that “Doubt is the key to knowledge”. Finally, 
that they take the strength of their knowledge into 
account in the decisions. 

These examples of holistic and integrated risk 
management in accordance with ISO 31000 is what 
we in this project and paper call “Risk informed 
management”. 

In a risk informed management, one must start 
by understanding the organizations activity as 
a whole and the goal in this specific activity and 
decision (what is to be delivered). Here it is impor- 
tant to understand the context in which this is 
done and what requirements are set for the activity. 
Furthermore, one need to identify risk and possi- 
ble reasons for why things may go wrong and the 
consequences of this. The decision on how to per- 
form the activity must consider this understanding. 
An evaluation of the robustness of these plans is 
needed, in case of disturbances and changes. Obvi- 
ously, the decision makers need to be well qualified 
to make the decision. 

The execution must be carried out as planned 
and it is important that those who perform the 
activity have the necessary understanding of the 
context as well as the procedures for the task. It 
is also important that they have understood the 
basis for decisions, consequences and uncertainties 
so that they can react properly if disturbances and 
changes occur. 

To improve, it is important that one assesses 
the delivery and the execution. Any learning from 
this evaluation should be communicated to the 
organization. 
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3.2 Uncertainty and robustness 


Before making decisions, the responsible party shall 
ensure that issues related to health, safety and the 
environment are adequately considered. The basis 
for decision-making must have the necessary quality, 
where different alternatives and consequences have 
been studied and relevant experts, departments and 
user groups have been involved. To consider uncer- 
tainty has to be included in this decision process. 

How can uncertainty be considered adequately? 
This is a research topic that has been given quite 
a bit of attention the last few years, for instance 
in the previous ESREL conferences. We will not 
go into detail of the possible methods here, but we 
stress the following point that has been made in 
the literature. 

Say that a decision maker concludes that it is 
unlikely that a particular event will occur within 
a given time frame. Such statements are typically 
based on data, information, testing, analysis, argu- 
mentation, theory, models, assumptions, discussions 
with stakeholders and more. These statements may 
be more or less strong, and taking uncertainty into 
account includes clarifying what this knowledge 
consists of and how strong it is. If the knowledge 
is weak, the decision will have a weak foundation. 

Obviously, if the knowledge is weak, correcting 
measures are necessary. In our project we stress the 
importance of robustness as an important measure 
to provide extra margins, in addition to similar con- 
cepts like resilience and the cautionary approach. 

Requirements for robustness are used because 
deviations, unexpected changes and surprises can 
occur. Robustness must be emphasized especially 
for events with high potential. 

As unforeseen events may occur, robustness is 
needed to ensure that the business can be operated 
in a safe manner. This implies that the organization 
are still able to operate in the event of disturbances 
slightly beyond the stresses they are expected to be 
exposed to. Surprises, unexpected changes and dis- 
turbanses can happen, and events that are consid- 
ered unlikely can still happen if the assessments are 
based on assumptions that are incorrect. 

A high degree of uncertainty must lead to a 
cautionary approach, for instance through barrier 
requirements and robust solutions or application 
of principles such as reducing risk as far as possible 
without significant disproportion between cost and 
effect. If there is insufficient knowledge concerning 
the effects of a preventive measure, further measures 
should be taken according to the HSE regulations. 


3.3. Management and culture 


How can risk management become a good tool 
for reducing risk, rather than a documentation 


and reporting of risk? The attention and priority 
of the management greatly contributes to the cul- 
ture. Risk management processes and systems are 
important tools to achieve good risk management, 
but humans and organizations are preconditions 
for success. The management culture always must 
be characterized by a sincere wish to reduce risk. 

Further, knowledge, involvement and commit- 
ment must be a core value that forms the decision- 
making processes in every aspect of the organization, 
even in times of pressure on the industry such as 
delays, changes and increased cost consciousness. 

Based on Reason (1997) we describe a good 
HSE culture as just, reporting, learning and flex- 
ible. There needs to be established a belief that you 
will not be met by sanctions when you report an 
issue. This trust is easily eroded, and is difficult to 
reestablish. A key to earning and keeping the trust 
is to display consistency between HSE promises 
and action, especially when finding the balance 
between safety and other priorities. Here, commu- 
nication skills are important, as well as establish- 
ing dialogue with all stakeholders instead of solely 
managing by commanding. This facilitates trust, 
commitment and valuable knowledge into the risk 
management process. 


3.4 Examples of risk management challenges 


In the coming memorandum from this project, the 
PSA also intends to highlight some practical risk 
management challenges. This should include some 
ideas for how these challenges can be handled. 

However, the PSA is careful to point out that 
the risk owner is responsible for their risk. Thus, it 
is neither possible nor advisable that the authori- 
ties give detailed requirements or solutions that 
can be interpreted as requirements. Therefore, the 
PSA is careful not to spell out detailed solutions, 
but instead we intend to nudge the industry in a 
certain direction when we see such a need. 

In this paper we have highlighted the impor- 
tance of considering uncertainties before making 
decisions. It has been discussed in the literature 
how uncertainties can be handled in practice when 
using Risk Acceptance Criteria and Risk Reduc- 
tion Processes. Summarized, we note that the Black 
Swans report describes that Risk Acceptance Cri- 
teria are fulfilled if: 


— The calculations are within the criteria while the 
knowledge is strong, or 

— The calculations are within the criteria with a 
large margin, while the knowledge is not weak. 


Similarly, the required risk reduction processes 
beyond the Acceptance Criteria (ALARP) might 
incorporate the effect of uncertainties in a manner 
where: 
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— If the cost is low, the proposal is implemented. 

— If the cost does not present a significant dis- 
proportion to the risk reduction, the proposal is 
implemented. 

— If there are other important aspects, the pro- 
posal is considered implemented. Other aspects 
can be significant uncertainty, need for robust- 
ness and barriers, and more. 

— Furthermore, it is not possible to set specific 
regulatory requirements and established mini- 
mum solutions in the industry aside based on 
arguments about risk informed cost benefit 
assessments. 


We note that these bullet points are well known 
in the literature and in the industry, but that uncer- 
tainties are not always systematically considered. 

When successful, the decision makers are actively 
involved in and are observant as to whether the 
knowledge base is strong or if there is a need to 
obtain more information. In good examples, sta- 
tistics constitute a necessary part of the decision- 
making basis, in addition to other sources of 
information. 

Other typical management challenges includes: 


— To balance the different needs. For instance, 
regarding ventilation, where working environ- 
ment and avoiding gas collection should be bal- 
anced. Other examples include keeping aware of 
the possibility of conflicts between short-term 
and long-term objectives in various areas (HSE, 
profits, project progress, departments), at vari- 
ous levels and between various participants in 
the activities. 

— The uncertainties we have discussed above leads 
us to conclude that risk assessments can often 
be “massaged” to provide a justification of 
decisions that are already made. That is not the 
intention behind the regulation, as it does not 
provide any incentive to improve safety. 

— Placing great emphasis on identifying all relevant 
risk factors. Quality is ensured by considering 
specific aspects, local conditions and opera- 
tional conditions. Involvement, local knowledge 
and broad knowledge are keys to success. The 
opposite would be generic hazard identifica- 
tions. Generic lists of hazard situations can be 
used as a starting point for brain storming, but 
will not be sufficient to provide an overall and 
nuanced basis for decision making. 

— To execute according to the decisions, with a 
good understanding of the risk while vigilantly 
detecting possible changes and deviations. Per- 
sonnel on the sharp end need sufficient knowl- 
edge of the task and the risk picture, and to 
understand how the activity is planned and 
which surrounding factors that have to be taken 
into account. Furthermore, major accident risk 


is a natural part of the various forms of risk vis- 
ualization, such as Safe Job Analyses and Work 
permits. 

— To avoid that the plans are made without involv- 
ing the personnel on the sharp end, and to ensure 
that the reasoning behind the decisions are com- 
municated. If not, any adjustments that need to 
be made in the sharp end might not be considered 
and handled as a change. Another challenge is if 
the performance is rigidly performed according 
to the plan, but without understanding of the 
risk picture and possible changes in conditions. 
Rigid processes and procedures can also lead to 
a quietly accepted practice of non-compliance. 

— Transfer of experience might be the most dif- 
ficult part of the management loop, especially 
to make sure that the experience is accessible to 
anyone who needs the information later. In our 
“Book about learning” (PSA 2013), we empha- 
size that organizational learning is a prerequisite 
for safety. The book can inspire actors to find 
good learning solutions themselves. The report 
“Black Swans” (Norwegian Oil and Gas 2017) 
places great emphasis on better ways to learn, 
and can help actors find practical ways to close 
the control loop. 

— The regulations require that one should improve 
safety by continuously identifying where it is 
needed. A typical pitfall is a “good enough” 
philosophy without being conscious about when 
and where improvement is needed. 


4 CONCLUSIONS 


We have described the need for well-functioning 
risk management to create a reasonable balance 
between safety and economic profit. Addition- 
ally we have described that several reports indicate 
a need to revitalize the risk management of the 
Norwegian petroleum sector. 

Our project aims to contribute to these goals. 
The project is ongoing, and at the time of writing 
this paper our draft project report is subject to a 
hearing at several government agencies. 

We have involved the stakeholders in the project, 
and aim to continue this approach in the coming 
phases. 

The preliminary message is: 


— There is a plethora of risk management tools 
available. These may seem intricate and difficult 
to use for management, which can create confu- 
sion and apathy. However, we note that all good 
tools are based on the traditional “Control loop” 
(Plan-Do-Check-Act) which is the standard way 
to operate for managers. 

— Risk management is the responsibility of the 
decision makers, meaning that the role of a risk 
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management department is to give advice to 
provide the best possible knowledge background 
for the decision makers. 

— Risk management must be an integral part of 
decision making, as described in ISO 31000. 
When “risk management” is performed sepa- 
rately from (and after) other processes, the inten- 
tions behind the regulatory regime will not be 
fulfilled. Instead, we note that when the stake- 
holders describe successful risk management, 
they describe a standard control loop where the 
necessary information about risk is available at 
the right time. 

— Relevant information needs to be available at the 
time of decision-making. This can be challeng- 
ing. We observe that the industry is currently 
focused on this topic, and possible improved 
methods are examined. 

— Uncertainties needs to be acknowledged when 
making decisions regarding major accident risk. 
We note that success is contingent on decision 
makers that take an active role in requiring rel- 
evant information, including questioning the 
strength of knowledge. Due to uncertainties, 
there is a need for a cautionary approach where 
the industry ensures sufficient robust solutions. 

— Decision makers need to be aware of the poten- 
tial for conflicting aims. Examples include 
project progress, safety and department aims. 
To ensure balance between such various goals, a 
holistic approach is necessary. 

— Finally, the framework described above is depend- 
ent on a management culture that at all times is 
characterized by a sincere wish to reduce risk. 


In essence, the main findings so far from the 
project is that: 


l. risk have to be managed by risk informed 
management, 

2. decisions need to take into account the uncer- 
tainty, taking into account the proverb “Doubt 
is the key to knowledge”. 

3. robustness is a necessity and a prerequisite for 
providing margins for deviations, unexpected 
changes and surprises, 

4. safety concerned leadership and safety culture 
is the foundation of risk management. 
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ABSTRACT: A system called SMARTD ID CARD employees location and working time registration 
systems. Among the others, the data obtained in this system allow one to monitor whether the employees 
use constantly personal protective equipment. The research aimed at verification of the effectiveness in 
detection whether a worker used personal protection means. To this end, a robot simulated some human 
movements. Comparison between Pearson, Kendall’s Tau, Spearman’s and BiSerial correlation detection 
algorithms has been performed. The results obtained proved that the system was highly effective in detect- 
ing the use of personal protective equipment. The system was more effective at detecting that the personal 
protective equipment had not been used, what was important from the safety point of view. Additionally, 
the detection time of the lack of personal protection was relatively short and taking about 2 minutes. Bas- 
ing on the results obtained the Authors recommend the use of SPEARMAN or BISERIAL algorithm. 


1 INTRODUCTION 


With the development of new technology proce- 
dures, more and more efficient machines have been 
designed. Especially, the development of Infor- 
mation Technology (IT) and telecommunication 
techniques has brought about the possibility of 
intelligent manufacturing system design. In such 
systems a human factor contribution to production 
process has been significantly reduced in view of the 
automation of entire process. The computational 
capacity of these systems in the first place, is applied 
to monitor the manufacturing process. However, 
their usage in monitoring the level of safety of sys- 
tem operators has been increased significantly. 

In modern systems, sensors devices are increas- 
ingly used to the implementation of safety func- 
tions. Such systems detect the position of special 
labels, which in turn allows for localisation of the 
objects on which these labels have been installed 
(Dzwiarek 2015, Reiner et al. 2013). An example 
of such equipment is the Real Time Location Sys- 
tem (RTLS), which detects the position of the label 
(Gomez et al. 2013, Guyoun et al. 2013). RTLS 
systems employ a variety of location technologies, 
e.g., active identification using radio waves, optical 
location and ultrasonic location. A sample appli- 
cation of the location system to improve safety 
consists in using it in monitoring of employee 
activities is showed in (Seppa 2012). That is espe- 


cially useful in detection of situations requiring 
medical attention (Sachs et al. 2014). 

At the same time, the rapid development of 
electronic systems applications to working time 
registration is observed. Currently, most common 
are devices containing the radiofrequency identi- 
fication (RFID) (Roberts 2006), bar code or mag- 
netic strip card readers, in which the working time 
is registered after the card has come closer or been 
inserted into the reader. The major drawback of 
these systems consists in the fact that they display 
the information only at the moment of identifica- 
tion (card approach or insertion). Later, the user 
does not have the access to working time registra- 
tion data until the card is used again. 


2 MATERIALS AND METHODS 


2.1 SMART ID CARD system 


Advantages of location and time recording systems 
combines the system developed at TENVIRK Sp z 
0.0. called SMARTD ID CARD. The system con- 
sists of: 


e control units forming a radio Mesh Topol- 
ogy Network (MESH) (Huang et al. 2008) and 
connected with the cloud computing called 
“eAttendace” via Ethernet or General Packet 
Radio Service (GPRS) (Walke et al. 1991), 
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e smart cards which can be pinned to the worker’s 
clothing, equipped with a set of sensors, please 
find the items, which, at most, could form the 
aforementioned set: 

— two independent accelerometers, 
— magnetometer, 

— moisture sensor, 

— pressure sensor, 

ambient light sensor, 

temperature sensor, 
— measurement of battery voltage. 

e system software of class Enterprise Resource 
Planning/Customer Relationship Management 
(ERP/CRM) (Meller 2005) in the cloud, which 
collects real time data. 


Computer control system is called “eAttendence” 
and is available in the cloud Software as a Service 
(SaaS) model (Haolong et al. 2015). The data from 
sensors are collected in the radio MESH network 
(Fig. 1) and submitted to the cloud eAttendance by 
internet links. Then they are processed and stored. 

The smart cards are located in the MESH net- 
work based on nearest router address. The values 
of card parameters of Link Quality Indicator 
(LQI) (Qinab et al. 2013) and Received Signal 
Strength Indication RSSI (Yanga at all 2015), are 
the basis of which the system calculates theoreti- 
cal distance from the nearest router. Knowing the 
location of the router, one can specify the card 
position. Additional localisation functions employ 
the frame time-of-flight measurement/or phase 
shift. Location accuracy is estimated as | m. 

The system provides: 


e recording of all inputs and outputs automati- 
cally with no the need for any actions to be taken 
either by employee or employer, 

e automatic detection and signalling of all mis- 
uses, such as fake presence at work of the absent 
employee, illegitimate absence at the worksta- 
tion, work, improper exchanging of cards etc., 
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Figure 1. Sample visualisation of the network in 
eAttendance cloud. 


e monitoring of the presence at work both inside 
and outside the building, e.g. on the construc- 
tion site, with no need for installing control 
devices or construct additional input gates, 

e prompt, real-time information about the work- 
ing time; delivered not only to employers but 
also to employees, 

e additional information about detecting any acci- 
dents at work;i.e., fall or impact, 

e detection whether PPE is used, 

e automatic registration of the working time 
devoted to specific tasks, projects and clients, 

e prompt production of corporate employee 
cards, 

e significant reduction of costs as compared to 
classic systems. 


2.2 Research aims and methodology 


Among the other, the data resulting from the 
SMART ID CARD allow one to, monitor whether 
the employees use Personal Protective Equip- 
ment (PPE) constantly. To this end the correlation 
between signals representing movements of cards 
used by workers and the signals from the cards 
installed at the centre of PPE (e.g., helmet, glove, 
mask, glasses, etc.) has been examined. 

The purpose of the research consisted in effec- 
tiveness verification of the detection mechanisms 
used in the SMART ID CARD, showing whether 
PPE was used or not. 

Specific objectives were: 


1. To check which correlation algorithms most 
suitable for detection of the use of PPE. 

2. Finding how the correlation determining 
parameters affect the determination efficiency. 


The research conducted in the Laboratory for 
Safety Techniques in Control Systems has been 
implemented in accordance with the methodology 
of the research developed within the framework 
of the project “Principles of the use of monitor- 


Figure 2. Portal robot used in human movement 
simulation. 
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Figure 3. Robots head movement trajectory. 


ing techniques for employee localisation with the 
use of ultra-broadband (UWB) communication to 
ensure machine safety” (Dzwiarek 2015). To simu- 
late human movements the portal robot have been 
used (Fig. 2). 

The displacements of robot working head rep- 
resented human movements, as well as those of 
PPE. The card simulating men have been mounted 
on the robots head. The PPE use was simulated 
through a flexible connection of the second card, 
so that the movements of both cards were synchro- 
nised, but not identical. The robot head has been 
moved between consecutive points at a speed of 
0.6 m/s. At each point of the trajectory the follow- 
ing sequence of movements was performed: 


e 3 bottom-up movements, 

e movement representing inactivity (a distance of 
0.15 m along the Y axis, travelled at a speed of 
0.01 m/s). 

e 2 bottom-up movements, 

e return movement to represent inactivity. 


At the last point of trajectory the flexible con- 
nection of labels was broken and the smart card fell 
down (representing the PPE to be put away). The 
trajectory of robot motion that consisted of twenty 
randomly selected points is shown in Fig. 3. The 
experiment was repeated 3 times with the connec- 
tion length 0.10 m, 0.50 m and 1 m, respectively. 


3 RESULTS 


In the course of experiment the signals from accel- 
erometers situated in cards were registered. These 
were signals indicating: 


e moving times, 
e number of movments, 
e motion sensitivity. 


The data were stored in the cloud eAttendance. 
Sample data recorded during one experiment in 


the cloud eAttendance are shown in Fig. 4. The 
vertical line represents: 


e start the experiment, 
e connection fall, 
e end of the experiment, respectively. 


Then, the data obtained have been analysed. 
During the analysis effectiveness of the following 
correlation algorithms was compared: 


e Pearson product—moment correlation coeffi- 
cient (Buda &. Jarynowski 2004), 

e Spearman's rank correlation coefficient, which 
is one of the non-parametric measures of statis- 
tical dependencies between monotone random 
variables (Kowalczyk 2015), 

e Kendall's Tau coefficient (Kowalczyk 2015), 

e BiSerial measure moment correlation coefficient 
(Linacre 2008). 


The data were analysed basing on those regis- 
tered by the Smart ID Card system, and written in 
the form of variables “quantity of movement” (IR), 
“time of movement” (CR) and “movement sensi- 
tivity” (CzR). During the tests they were compared 
with each other using different functions, which 
determined the value of the correlation coefficient: 


1. The product of movement sensitivity and time 
of movement: CzR x CR. 

2. The product of movement sensitivity and the 
logarithm of time of movement: CzR x log 
(CR). 

3. The product of the logarithm of movement sen- 
sitivity and the logarithm of time of movement: 
log (CzZR) x log (CR). 

4. The product of movement sensitivity and quan- 
tity of movement: CzR x IR. 

5. The product of movement sensitivity and the 
ratio of time of movement to quantity of move- 
ment: CzR x CR/IR. 


To find the optimal values the Authors per- 
formed calculations for all control parameters: 


a. width of the correlation window, 

b. correlation coefficient threshold to identify the 
current situation, 

c. type of correlation algorithm, 

d. function of the measured data used as corre- 
lated variable. 


During the experiment, we can distinguish two 
different phases: 


1. Periods during which the both cards were 
attached to the robot head, i.e., their move- 
ments were correlated. 

2. Periods during which one card remained 
motionless, so movements of the cards were not 
correlated. 
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Figure 4. Example of experimental data obtained (Pearson correlation coefficient). 
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ficients should be as high as possible (close to the 
unity), while in the second case, the correlation 
should not be observed, and the value of correla- 
tion coefficient should be close to zero. 

To ensure that the detection of correlation loss 
was effective the following two functions of prob- 
ability were determined: 


e the probability that before the card release the 
correlation coefficient was not less than the 
assumed value 0.4, 

e the probability that after the card release the 
correlation coefficient falls to zero after a certain 
period of time. 


The results are presented in Table 1. 


4 CONCLUSIONS 


Test results have proved that the SMART ID 
CARD was highly effective in detecting the fact 
that PPE was worn, e.g., number of cases for 
which the value of the correlation coefficient has 
fallen below the threshold value reached 90%. Sys- 
tem is much more effective at detecting that PPE 
was not used, what is important from the safety 
point of view. In addition, the time necessary for 
the detection of lack of personal protection is rel- 
atively short and reaches near 2 minutes. Taking 
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into account all the results obtained the course of 
research the use of SPEARMAN or BISERIAL 
algorithm for detecting lay down of the PPE is 
strongly recommended. Differences between these 
two algorithms are so small, that decision can be 
made depending on computational complexity the 
application involves. 

An important innovation aspect consisted in the 
applied test procedure. On-line registration of all 
relevant test parameters in the cloud eAttendance 
allowed to both improve the conducted experi- 
ments, and significantly ordered their process. It is 
an excellent example of how the use technique of 
“Internet of things” can improve the research. 
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A Monte Carlo method for evaluating dependability of mission 
repairable items 


H. Cheng, J. Huang & Y. Zhang 
NAA, Beijing, China 


ABSTRACT: When evaluating or predicting dependability of products in reliability engineering prac- 
tice, we may face one kind of problems where product is repairable and mission time is flexible, i.e. repairs 
are allowed during mission process only if the accumulated delayed time is less than or equal to a specified 
time duration. The analytic solutions to this kind of problems are difficult or impossible to be derived. 
To solve this kind of problems, the relevant terminologies are listed and analyzed, firstly. Then, the occur- 
rence processes of the probabilistic events corresponding to mission success or failure are analyzed. After 
this, a numerical method based on Monte Carlo simulation is proposed to compute dependability of 
repairable items when mission time is flexible. Finally, two examples are presented. The first one illustrates 
the use of the proposed method to assess or to predict dependability of an exponential distributed item. 
The second one shows the use of the method to compare two products in the dependability when mission 
time is specified, but repairs are allowed in the limited accumulated time during mission. Although, for 
simplicity, products in the two examples are assumed to follow exponential distributions, the method is 
available to any items with one or more failure mechanisms that follow the common distributions in reli- 
ability field, such as exponential distribution, normal distribution, log-normal distribution and Weibull 


distribution. 


1 INTRODUCTION 
Normally, mission reliability R,, is used to iden- 
tify the probability for one product to finish a spe- 
cific mission successfully (Zeng et al 2011, Murthy 
& Rausand 2008). Meanwhile, availability is used 
instead of reliability for reparable products. However, 
the conventional concepts and theoretical methods 
are not available for some special problems in reliabil- 
ity engineering practice (Porry 1973, Krivtsov 2000). 
One of these problems is to solve the success prob- 
ability, namely, dependability of repairable product 
during mission (Yang 2007). In such case, some of 
the failures occurring during the mission process can 
be repaired by the operators, and the mission time 
is somewhat flexible, namely, maintenance that does 
not exceed specified time is allowed. 

To solve this kind of problems, a Monte Carlo 
(Lemieux 2008) based method is proposed by con- 
sidering the procedure of the probabilistic events, 
failure occurrence, down for repair, successful 
repair, unsuccessful repair, maintenance time 
exceeding limit. 

Examples are presented to show the availability 
of the proposed Monte Carlo method in depend- 
ability prediction, dependability assessment, and 
dependability comparison for various products or 
design. 


2 TERMINOLOGY 


Before establishing the simulation process, it is bet- 
ter to review and study the terminology related. The 
dependability is related to reliability and maintain- 
ability of a product. Here, for that the issue studied 
is about dependability, i.e., success probability for 
mission, we just pay attention to the critical fail- 
ures resulting in mission stop or mission failure. 
The notions related is described below. 


1. Critical failure: the failures causing mission 
break off. 

2. Mean Time Between Critical Failures (MTBCF): 
the ratio between accumulated operating time T 
and critical failure times r during a specific mis- 
sion profile. 

3. Mean Time To Repair (MTTR): The ratio 
between accumulated maintenance time and 
the failures times r during a specific mission 
profile. 

4. Dependability: an item’s ability to successfully 
finish an intended mission. 


3 SIMULATION MODEL 


The proposed method is useful regardless of the 
distributions of the failure and maintenance time. 
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But for simplicity, the exponential distribution is 
used to show the procedure. 

Assume that one item with kinds of failure 
mechanisms, all of which are followed independent 
exponential distributions with distribution param- 
eters {0},i= 1, 2, ..., n. The number of repairable 
mechanisms is k out of n, and n-k is the number of 
unrepairable mechanisms. 

Let T be the time duration of mission. The item 
can be repaired during mission, but the accumu- 
lated maintenance time must be less than or equal 
to a specified time T, where Tọ < T. For the item 
can be repaired 7,, hours, the smallest operating 
time for item will be T — Tu during the mission. 
Therefore, the item can conduct the mission suc- 
cessfully if the following events happen: 

Event A: the item has never failed during the 
mission. 

Event B: the item fails at t, but T-t) < T. 

Event C: the item has failed / times during the 
mission, all are repairable, and the accumulated 
maintenance time is less than or equal to T 

Based on the analysis above, the Monte Carlo 
simulation algorithm is established and shown in 
Figure 1. 

The simulation steps in Figure | are explained 
as follows: 


Step 1: For one item with n types of failure mecha- 
nisms, every failure mechanism may follow dif- 
ferent probability distribution. The distribution 
parameters for every type of failure should be 
obtained before to establish the simulation pro- 
gram. The parameters for failure distribution can 
be denoted by 0, i= 1, 2, ..., n, regardless of one- 
parameter cases, say exponential distribution, or 
multi-parameters cases, say Weibull distribution. 

Step 2: Simulate failure time ¢, for the ith failure 
mechanism with distribution parameter ©, 
described in step 1. Conduct the simulation for 
all the failure mechanisms to get the failure time 
series {t,} corresponding to {0, }, i= 1, 2, ..., n. 

Step 3: Find out the minimum value t, 1<k<n, in {t}. 
Denote the minimum failure time ¢, as tnin aNd t,o 
and we will use them in the subsequent steps. 


Step for judgement: Judge whether the failure 
mechanism corresponding to ¢,,, is repairable or 
not in mission. Go on to conduct the next step for 
the case that the failure mechanism is repairable, 
and the mission fails for that it is unrepairable. 

Step for judgement: Compare the values of tnin 
and T-T If the result is, 


bminz 1 -T (1) 


then the continuous operating time of the item can 
achieve the smallest operating time T- Ty, and the 
mission will succeed. Otherwise, go on to conduct 
step 4. 


Start 


"1. Get the distribution 
parameters of n types of failure 
mechanisms{ 9}, (=1,2,-"+7 


2. Simulate to produce random 
| failure time series {i} from { 8}. 
f=1.20"n 


3. Find out minimum fein 
=min({t)})=heo inga} 


| 4. Simulate to produce repaired time fay 
from distribution parameter 4, of 
maintenance time 


from f, again 


| 
| 6. Replace tss in {4} produced in step 2 
by feo fe; to get a renewed failure time 
series 


~ 7. Repeat step 2 to step 6 V 
times, 
N=SF 


“8. The dependability is 
R-SIN 


Figure 1. The Monte Carlo simulation algorithm for 
calculating dependability. 


Step 4: Assume that the distribution parameter 
of maintenance time is 44, (the distribution of 
maintenance time could be any type normally 
used) corresponding to the failure mechanism 
of t, in step 1. Simulate the maintenance time 
t,, {rom distribution with parameter p,. The 
subscript “m” indicates “maintenance”, and 
“k” indicates the kth failure mechanism. 

Step for judgement: Compare the accumulated 
maintenance time &f,,, (2 indicates cumulative 
sum) with the specified T, if, 


Shot (2) 
this means that the maintenance time used has 


exceeded the maximum maintenance time allowed. 
And mission fails. 
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if, 


IST 3) 


m 


then go on to step 5. 


Step 5: Simulate the next failure time ¢,, based on 
the distribution parameter 6, corresponding to 
the kth failure mechanism. In addition, for a pre- 
scribed problem in practice, we must know the 
performance of the repair that the item is repaired 
to new or is repaired to old first. Produce the next 
failure time ż,, based on this precondition. 

Step 6: Get new failure time series by replacing ¢, in 
{t,} produced in step 2 by t, gtt,- And then denote 
the new time series with {¢,}, i=1, 2, ..., n, as well. 


Go back to step 3, continue to conduct the loop 
from step 3 to step 6 until that we can get a result 
that the mission fails or succeeds. 


Step 7: Repeat the simulation procedure (from 
step 2 to step 6) N times, and N is determined 
based on the requirement of accuracy (Murthy 
& Rausand 2008). 

Step 8: Compute dependability of the item based 
on the simulation result, 


R=S/N (4) 


where N is the total number of simulated test 
shown in Figure 1, and Sis the number of success- 
ful cases out of N times of simulation. 


4 EXAMPLES 


The Monte Carlo simulation method presented in 
section 3 could be used to predict or to assess the 
dependability of a specified item. Also, it is useful 
for comparing two items in the aspects of reliability 
and maintainability. Two examples are presented 
to show the availability of the proposed method. 
For simplicity, the failure time and maintenance 
time are both assumed to follow exponential dis- 
tributions. The simulation processes are the same 
for the cases that the failure time and maintenance 
time corresponding to different failure mechanism 
follow different kinds of distributions. 


4.1 Example 1 


The mission time T = 100 h for item A. The maxi- 
mum allowed accumulated maintenance time 
T,, = 10 h. There are four critical failure mecha- 
nisms which can cause mission pause for item A, 
and all the related parameters will be used in the 
simulation procedure are tabulated in Table 1. The 
first and second failure mechanisms are repairable 
during mission, and the third and fourth are not. 


Calculate the dependability R of item A based 
on the simulation process shown in Figure 1. The 
change of R with simulation times N (the maxi- 
mum number is 100000) is shown in Figure 2. 

It can be seen from Figure 2 that the depend- 
ability R of item A converges to a somewhat sta- 
tionary level with the increase of N. The maximum 
value of Ris 1, and the minimum one is 0.5, cor- 
responding to N= 1 and N =2 respectively. 

To show the change rule of R clearly, the 
abscissa scale is limited to the interval [0 10000]. 
The result is shown in Figure 3. 


Tablel. The distribution parameters for failure mecha- 
nisms of item A. 

Distribution Distribution 

parameter Repairable parameter of 
of failure during maintenance 

time 6, mission time 4; 

0, = 400 Yes My =2 

6, = 500 Yes h=3 

6, = 1000 No / 

6, = 900 No / 


Dependability R 


0 2 4 6 8 10 
Number of simulations M x10? 


Figure 2. The change of R with simulation number M 
(global). 
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Figure 3. The change of R with simulation number N 
(local). 
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From Figure 3, it can be seen that R becomes 
stationary relatively when N = 1107. A step further, 
when N reaches 40000 and above, the result of R 
is very stationary, and the variation is acceptable. 

To analyze the stationary property of the simula- 
tion based result, the change values of R correspond- 
ing to N= 100000 are divided into 1000sections with 
100 continuous simulation results in each section. 
N’ is used to denote the order of the sections. The 
coefficient of variation standard deviation to mean 
ratio, C (Ronald et al. 2007) of each section is calcu- 
lated. The change rule of C with the section number 
N’ is shown in Figure 4. 

As shown in Figure 4, the maximum value of C 
is 0.1046. The coefficient of variation C converge 
quickly with the increase of N’. To show the result 
clearly, the ordinate axis is limited to the interval [0 
0.0001] as shown in Figure 5. 
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Figure 4. The change rule of coefficient of variation C 
of R with the divided section number N’. 


1S) 
< 0.8 
© 
Š 
$- 
Soó 
— 
© 
5 
3 0.4 
iz 
[a] f 
Sos X: 400 
Y: 4,553e-05 
0 
0 200 400 600 800 1000 
N’ (N/100) 
Figure 5. The change rule of coefficient of variation C 


of R with the divided section number N’ (local). 


Form Figure 5, we can see that the coefficient of 
variation C will be less than 10— when the simulation 
number N reaches 400 x 100 = 40000. This implies 
that the variation of R is extremely small already. 

Based on all the analysis above, we accept the sim- 
ulation results of R from N = 40000 to N = 100000 
here in this example as shown in Figure 6. 

The dependability of item A is shown in 
Figure 6, the mean value is, 


R= 0.8132 (5) 


4.2 Example 2 


There are item B and item C to compare the 
dependability. The mission time is T = 200 h, and 
the maximum allowed accumulated maintenance 
time is T,,= 12h. 

There are four critical failure mechanisms which 
can cause mission pause for item B, and all the 
related parameters will be used in the simulation 
procedure are tabulated in Table 2. The first, sec- 
ond and third failure mechanisms are repairable 
during mission, and the fourth are not. 

There are three critical failure mechanisms 
which can cause mission pause for item B, and all 
the related parameters will be used in the simula- 
tion procedure are tabulated in Table 3. 
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Number of simulations M «104 
Figure 6. Simulation result of R of item A (N = 40000- 


100000) and its variation. 


Table 2. The distribution parameters for failure mecha- 
nisms of item B. 

Distribution Repairable Distribution 
parameter of during parameter of 
failure time 8, mission maintenance time 4, 
0, = 500 Yes M,=0.5 

8, = 1000 Yes mh=15 

8, = 1200 Yes H,= 1.0 

80, = 1300 No / 
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Table3. The distribution parameters for failure mecha- 
nisms of item C. 


Distribution Repairable Distribution 
parameter of during parameter of 
failure time @, mission maintenance time 4, 
8, = 800 Yes My =3 

8, = 1200 Yes lL=4 

8, = 800 No / 


Based on the data in Table 2, the MTBCF of 
item B through traditional method is, 


1 
T oe aa 
500 1000 1200 1300 


MTBCF = =217h (6) 


The MTTR of the mission repairable failure 
mechanisms of item B is, 


os, 15. 1 
MTTR= 200 1000 1290. =0.8696h (7) 


———, + ee eee 
500 1000 1200 


Meanwhile, the MTBCF of item C is, 


MTBCF = 


1 
200 (8) 
+ —" — 


800 1200 800 


The MTTR of the mission repairable failure 
mechanisms of item C is, 


3 4 
+ 


MTTR=300 1200 _3.4h (9) 


—— + —— 
800 1200 


Based on the analysis above, item C is more reli- 
able than item B from the viewpoint of tradition 
method. 

However, it is more reasonable to consider both 
the reliability and maintainability of a product 
when calculating its ability to finish a specific mis- 
sion. The comparison of dependability of item B 
and of item C based on the proposed Monte Carlo 
method in Figure 1 are shown in Figure 7. The 
total simulation number N is 100000. 

The simulation result becomes stationary when 
Nis larger than 40000 as shown in Figure 7. A step 
further, we have analyzed the variation of the sim- 
ulation results of item B and item C quantitively 
from N = 40000 to N = 100000 as shown in Fig- 
ure 8 and Figure 9, respectively. 
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Figure 7. The comparison of simulation based depend- 


ability of item B and of item C. 
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Figure 8. Simulation result of R of item B (N = 40000- 


100000) and its variation. 
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Figure 9. Simulation result of R of item C (N = 40000- 


100000) and its variation. 
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Table 4. The comparison of item B and item C in the 
aspects of reliability, maintainability and dependability. 


Mission 
repairable 
MTBCF proportion of MTTR 


Item (h) critical failure (h) Dependability 
B 217.27 75% 0.8697 0.8655 
C 300.00 67% 3.400 0.7726 


Based on the result shown in Figure 8, the 
dependability of item B is, 


R, = 0.8655 (10) 
The standard deviation of R, is 

Sig( Ry) = 3.4241 x 10* (11) 
The coefficient of variation of R, is 

C(R,) = 3.9562 x 10* (12) 
The coefficient of variation is very small, this 

means that the result is stationary and believable. 


Based on the result shown in Figure 9, the 
dependability of item C is, 


Ro = 0.7726 (13) 
The standard deviation of Re is 

Sa(Ro) = 5.9918 x 10+ (14) 
The coefficient of variation of R, is 

C(R,) = 7.7554 x 104 (15) 


The coefficient of variation is very small; this 
means that the result is stationary and believable. 

Based on the calculation above, the comparison 
of item B and item C in the aspects of reliability, 
maintainability and dependability is listed in Table 4. 

As listed in Table 4, the inherent reliability of item 
C is better than item B. However, the maintainabil- 
ity design of item B is better comparatively. Finally, 


as shown in this example, the dependability (prob- 
ability to finish the given mission) of item B is 12% 
higher than item C, although its MTBCF is 28% 
lower compared to the latter. 


5 CONCLUSIONS 


In this study, we focus on the issue calculating 
dependability of repairable item when the mission 
time is flexible, and get the following conclusions: 


1. A Monte Carlo simulation algorithm is estab- 
lished based on the analysis of the occurrences 
of the random events during the mission where 
the item could be repaired with in limited accu- 
mulated maintenance time. 

2. The use and availability of the proposed method 
in predicting or assessing the dependability of 
a product is developed and verified in the first 
example. 

3. Based on the process and result in example 2, 
the dependability of one item during the mis- 
sion scenario analyzed in this study is deter- 
mined by both reliability and maintainability, 
and a better maintainability can even make up 
the disadvantage in the aspect of reliability. 
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ABSTRACT: 


In this publication, the authors presented a proposal to use fuzzy expert inference system 


for evaluating the selected aircraft on-board unit reliability. Under consideration is military aircraft gun. 
The article is continuation of 2015 ESREL publication. There were initial analysis and now work simu- 
lations is presented. The project ensures reliable work (selected unit reliability, in addition maintenance 
process and safety improvements). It was verified by practical simulations. The research confirm possibil- 
ity of use fuzzy expert inference systems in aircraft on-board units reliability evaluation. 


1 INTRODUCTION 


Prof. Lotfi Zadeh, in 1965, proposed to use mul- 
tivalent logic in inference systems. This technol- 
ogy was and is used in many branches of science, 
including reliability systems research. Used by 
people systems reliability analysis with traditional 
mathematical and statistical methods is diffi- 
cult because of its complexity. Research centres 
from all over the world started to search alterna- 
tive object reliability evaluation approaches. For 
example in automotive (ABS system Mauer, 1995), 
reliability analysis (Azadeh et al. 2009), aviation 
(Grzesik 2004, 2012), medicine and even in music. 
Selected unit was M61A1 gun mounted on F-16 
class aircraft. Many different factors effect gun 
reliability. Some of them changing slowly and 
some dynamically (Wrona, 2013). It can lead to 
gun parts damages. Proper design fuzzy expert 
inference system may contribute to decrease the 
gun maintenance costs and time. In the publi- 
cation (presented on ESREL 2015 Conference, 
Zurek, & Grzesik 2014) fuzzy expert inference 
system designing process in Matlab, Fuzzy Logic 
Toolbox software (Mrozek & Mrozek, 1998) and 
initial analysis were described. Dynamic sys- 
tem simulation of designed models with use of 
Simulink software is presented as the next steps 
of the project development. The scope of research 
is to increase the system reliable work. 

Extensive literature analysis and preliminary 
studies about very wide range of fuzzy logic appli- 
cations provide the authors necessary fundamen- 
tals to continue the project in aircraft on-board 
units like selected one. Moreover there are several 
available publications in web sites describing simi- 
lar problems, but not in military aircraft. 


And that is why the authors conducted further 
system work analysis to verify proper system work 
practically in dynamic system work simulations 
which are presented in the article. 

One of the most important factor determine 
achieving the main goal is accurate selection of 
the knowledge base experts during designing proc- 
ess of such systems. The experts were responsi- 
ble for the criteria and principles of operation of 
these systems determination (creation of inference 
rules). The general overview about expert selection 
process is presented in Zurek & Grzesik (2014). 

That kind of reliability evaluation method sup- 
ports classical methods Adamski (2006), Jasztal 
et al. (2007), Tomaszek et al. (2011) Jasztal et al. 
(2008), Zio (2009) and can be used when there is 
no statistical information or the information is 
not sufficient, prognostic parameters, describ- 
ing the object, change and the changes are not 
well known, long term prognosis are determined 
with combined mathematic and heuristic methods 
Idziaszek & Grzesik (2014). 

Typical fuzzy inference system consists of fuzzy 
sets (membership functions). Those sets and mem- 
bership functions are characteristic for fuzzy logic 
and provide opportunity to use the data in fuzzy 
inference systems. 

This publication is an integral part of the 
authors scientific researches. 


2 FUZZY EXPERT INFERENCE SYSTEM 
FOR SELECTED AIRCRAFT ON-BOARD 
UNIT RELIABILITY EVALUATION 


Project is Mamdani type non-adaptive fuzzy infer- 
ence system with two inputs and one output (MISO— 
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Figure 1. 


Fuzzy inference system (typical). 


Gun Unit 
Reliability 
Evaluator 


Figure 2. Fuzzy expert inference system for selected 
aircraft on-board unit reliability evaluation (first 
evaluator). 


Gun Unit 
Reliability 


Evaluator 


Figure 3. Fuzzy expert inference system for selected air- 
craft on-board unit reliability evaluation (second evaluator). 


many inputs—single output), Wrona (2014). The 
project consists of two evaluator and is shown on 
Fig. 2 (first evaluator, presented on ESREL 2015 
Conference) and Fig. 3 (second evaluator). 

According to technical documentation, reli- 
able gun system work depends on number of fired 
rounds, barrel corrosion (barrel reliability) and 
hydraulic pressure in the hydraulic drive. That is 
why authors analyzed the influence of those fac- 
tors on gun probability damage. 


3 REAL-TIME WORK SIMULATIONS OF 
THE SYSTEM 


In order to carry out the simulation, the fuzzy expert 
inference systems designed in Matlab, Fuzzy Logic 
Toolbox software need to be transferred to Simulink 
software. The Simulink model consists of two signal 
blocks (“Signal Builder”) and fuzzy inference system 
block (“Fuzzy Logic Controller”, Fig. 4). In addi- 
tion, each block is combined with a block showing 
the results of the simulation (“Scope”). Simulations 


were performed using the same, designed in Fuzzy 
Logic Toolbox models, because the structure of 
the test models is the same. To obtained results the 
simulation time is not relevant. It was selected that 
retrieving parameters from individual samples is eas- 
ier. In the simulations, 11 samples were collected at 
intervals of every 10 seconds, so the simulation time 
is 100 seconds., Wrona (2014). 


> M61A1 gun barrels evaluation damage prob- 
ability simulation analysis 


Three system simulations were performed in 
order to examine the proper project work. The 
input signals are: “Fired Rounds” expressed in 
number of total shots and “Barrel Corrosion”. 
The output signal presents changes of gun barrels 
damage probability along with a change in input 
parameters. The input signal builders present the 
change of systems’ input values. 


Simulation I 


The first simulation assumes that the input param- 
eters are changing from the minimum to the maxi- 
mum value in a linear way (Figs. 5, 6). 


Figure 4. The Simulink model of fuzzy expert inference 
system. 


Table 1. Signal samples parameters used in the simula- 
tion I. 
Sample Fired Barrel Probability of 
number rounds corrosion [in] damage [%] 

1 0 0 3 

2 5000 0,01 12,4 

3 10000 0,02 21,8 

4 15000 0,03 28,6 

5 20000 0,04 33,3 

6 25000 0,05 45 

1 30000 0,06 52,4 

8 35000 0,07 60 

9 40000 0,08 63,4 
10 45000 0,09 68,1 
11 50000 0,1 97 
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Figure 5. “Fired Rounds” linear input function in the 
Simulink signal builder. 


Figure 6. “Barrel Corrosion” linear input function in 
the Simulink signal builder. 
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Figure 7. Gun barrels damage probability linear output 
function calculated in the Simulink. 


The Fig. 7 shows that with simultaneous and 
proportionate input values increase the gun bar- 
rels damage probability increased almost linearly. 
From 90 seconds the damage probability increases 
sharply to 97%. 


Simulation II 


The input parameters of the “Fired Rounds” sig- 
nal in the second simulation remained unchanged 
(Fig. 8) in relation to the parameters from the pre- 
vious simulation. However, it was assumed that the 
corrosion parameters have changed very slightly and 
oscillate around a small value of 0.01 in (Fig. 9). 
Taking into account the simulation II assump- 
tions, the probability of gun barrels damage is kept 


Table 2. Signal samples parameters used in the simula- 


tion II. 
Sample Fired Barrel Probability of 
number rounds corrosion [in] damage [%] 
1 0 0,008 3,56 
2) 5000 0,0084 12,4 
3 10000 0,0088 12,8 
4 15000 0,0092 26,7 
5 20000 0,0096 29,4 
6 25000 0,01 29,7 
if 30000 0,0104 30 
8 35000 0,0108 33,5 
9 40000 0,0112 49.6 
10 45000 0,0116 61,5 
ii 50000 0,012 83,1 


Figure 8. “Fired Rounds” linear input function in the 
Simulink signal builder. 


Figure 9. “Barrel Corrosion” linear input function in the 
Simulink signal builder. 


Figure 10. Gun barrels damage probability linear out- 
put function calculated in the Simulink. 
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at a low level for 20 seconds (Fig. 10). Then for about 
70 seconds the damage probability is averaging. From 
70 seconds the damage probability rapidly increases 
and reaches a large level due to activate antecedents 
with high and very high fired rounds value. 


Simulation II 


The third simulation imitated the situation with 
the intense increase in barrels corrosion with rela- 
tively small fired rounds. 


Table 3. Signal samples parameters used in the simula- 
tion III. 
Sample Fired Barrel Probability of 
number rounds corrosion [in] damage [%] 

1 5000 0,06 29,4 

2 6000 0,064 31,1 

3 7000 0,068 32:7 

4 8000 0,072 38 

5 9000 0,076 44,4 

6 10000 0,080 49,6 

7 11000 0,084 55,4 

8 12000 0,088 59,3 

9 13000 0,092 49,6 
10 14000 0,096 70,6 
11 15000 0,1 83,7 


Figure 11. “Fired Rounds” linear input function in the 
Simulink signal builder. 


Figure 12. “Barrel Corrosion” linear input function in 
the Simulink signal builder. 


Figure 13. Gun barrels damage probability linear out- 
put function calculated in the Simulink. 
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Figure 14. The Simulink model of fuzzy expert infer- 
ence system. 


Fig. 13 analysis of the gun barrels damage prob- 
ability in the third simulation shows, that with a 
relatively small amount of fired rounds, rapid 
growth of barrels corrosion significantly increases 
the barrels damage probability. 


> M61A1 hydraulic drive evaluation damage prob- 
ability simulation analysis 


Two system simulations were performed in order 
to examine the proper project work. The input 
signals are: “Firing Rate” expressed in number 
of shots per minute and “Hydraulic Pressure” 
(Fig. 14). The input signal builders present the 
change of systems’ input values. The output signal 
presents changes of gun barrels damage probabil- 
ity along with a change in input parameters. 


Simulation I 


The first simulation assumes that the both input 
parameters are within the range of the optimal 
values. 

According to performed simulations and tak- 
ing into account project assumptions assessed 
that input parameters will be oscillated within the 
nominal system work parameters and the probabil- 
ity of gun hydraulic drive damage is small or very 
small (Fig. 17). 
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Table 4. Signal samples parameters used in the simula- 


tion I. 


Sample Firing rate 
number [rounds/min.] 


Hydraulic 
pressure [psi] 


Probability of 


damage [%] 


5800 
5880 
5960 
6040 
6120 
6200 
6060 
5920 
5780 
5640 
5500 
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3000 
2900 
2800 
2700 
2600 
2500 
2560 
2620 
2680 
2740 
2800 


Figure 15. “Firing Rate” linear input function in the 


Simulink signal builder. 


Figure 16. “Hydraulic Pressure” linear input function 


in the Simulink signal builder. 


Figure 17. Gun hydraulic drive damage probability lin- 


ear output function calculated in the Simulink. 


Simulation II 


Simulation number two assumes that gun drive sup- 
ply pressure is continuously dropping to an unac- 
ceptable value. The firing rate remains the same for 
some time and then drops below the limit value. 
Fig. 20 shows, that with the optimal gun firing 
rate the drop of gun hydraulic drive pressure does 
not affect its probability of damage. When the gun 
firing rate and the gun drive pressure are average, 
the probability of gun damage is also on average 
level. Dropping firing rate below 3,900 rounds/min 


Table 5. Signal samples parameters used in the simula- 
tion II. 


Sample Firing rate Hydraulic Probability of 
number [rounds/min.] pressure [psi] damage [%] 


1 6200 3000 3 

2 6200 2830 10 

3 6200 2660 12,1 

4 5825 2500 13 

5 5450 2280 13,3 

6 5075 2070 13,5 

1 4700 1860 25 

8 4325 1640 29,1 

9 3950 1430 31,5 
10 3575 1210 63 
11 3200 1000 96,4 


Figure 18. “Firing Rate” linear input function in the 


Simulink signal builder. 


Figure 19. “Hydraulic Pressure” linear input function 


in the Simulink signal builder. 
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Figure 20. Gun hydraulic drive damage probability lin- 
ear output function calculated in the Simulink. 


leads to sharp increase the gun damage probability 
and when the pressure reaches a value of approxi- 
mately 1000 psi, probability is at level of 97%. 


4 DISCUSSIONS 


Fuzzy expert systems are increasingly used as a 
tool to solve all sorts of scientific problems. These 
systems are characterized by high accuracy, while 
used mathematical simplicity and uncomplicated 
structure. In past and recent years, research are 
intensifying into the use of inference systems in 
the objects reliability and maintenance analysis 
and evaluation (lead to improvement, Kacprzyk 
& Yager 1985, Sergaki & Kalaitzakis 2002, Yager 
2004). Moreover, these systems also are more and 
more used in the aircraft on-board systems. That 
is why authors decided to try to demonstrate the 
possibility of use fuzzy expert inference systems 
in aircraft on-board systems reliability evaluation 
(based on M61A1 gun damage probability evalu- 
ation systems). Some simplifying assumptions 
were used during designing process. However, it 
did not cover the benefits of use the technology 
in the complex objects reliability analysis. 

The advantage of fuzzy expert inference sys- 
tems is the fact that they can be used to assess 
the objects damage probability, that reliability 
depends on rapidly changing in time parameters 
(like: hydraulic pressure, firing rate), as well as on 
the parameters that changing during exploitations 
(like: fired rounds, barrel corrosion, Wrona 2013). 

Another advantage the fuzzy systems is the abil- 
ity to perform a simulation based on the real-time 
changing parameters of which it would be possi- 
ble to determine when the individual maintenance 
procedures are necessary, and when it can be 
done later. Such action could lead to reduction in 
amount of unnecessary maintenance procedures 
and reduce operating costs. 


5 CONCLUSIONS 


The possibilities resulting from the use of fuzzy 
expert systems, are very flexible and depend on the 
creativity of engineers designing such systems. The 
results obtained in Simulink software are satisfac- 
tory at this level of designing process. The project 
development would require the M61A1 gun dam- 
ages and malfunctions data collection. 

The main contribution of the publication is that 
there is possibility of use the fuzzy logic (fuzzy 
expert inference systems) in selected aircraft on- 
board systems/units reliability evaluation. This 
approach is pioneer according to the systems reli- 
ability evaluation problems. 
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ABSTRACT: Developing safety critical systems require long years of planned investments, broad theo- 
retical knowledge and domain experience. Data interchange between CPUs, synchronization, computation 
speed and diagnostic measures shall exhaustively be evaluated along with the effects of the parameters used 
in the reliability and safety calculations ex tunc. This study focuses on the effects of calculation parameters 
for different architectures. Special attention is paid for the architecture 1002D regarding its model and nor- 
mative definition. It has also been revealed that there are correlations between some parameters which seem 
independent. An advising route map is created to distill what kind of methods can be applied to decrease the 
hazard rates. For concretizing some concepts and sharing field experience, railway domain is selected, how- 
ever the study is fully applicable to other domains due to deeming the norm IEC 61508 along the entire paper. 


1 INTRODUCTION 


Murthy et al. (2008) mentions that regulatory 
requirements, customer requirements, and techni- 
cal requirements are to be fulfilled when develop- 
ing a safety instrumented system (SIS). For safety 
related systems in railway, automotive, nuclear 
power etc. the domain specific norms dealing 
with system, hardware (HW), software (SW) and 
transmission are derived from the core norm IEC 
61508 (Functional safety of electrical/ electronic/ 
programmable electronic safety-related systems) as 
depicted below in Figure 1. CENELEC Norms are 
the railway norms derived from this standard. 

The norm IEC 61508 defines in Part I four dis- 
crete safety integrity levels (SIL) with regards to the 
low demand and high demand or continuous. In 
case the yearly demand rate is lower than one, then 
the low demand mode is applied, else continuous 
mode should be considered. For the low demand 
mode, SIL is determined according to the Prob- 
ability of a dangerous failure on demand (PFD) 
explained as safety unavailability of an E/E/PE 


IEC $1508 


Raliway 
EN 50126 
EN 50129 
EN 50128 


Automotive 
ISO 26262 


Nuclear Power 


IEC 61513 
IEC 60987 
IEC 60880 


EN 50159 


Figure 1. Derivation of standards from IEC 61508. 


safety-related system to perform the specified safety 
function when a For the high demand mode, SIL is 
determined according to the average frequency of 
a dangerous failure per hour (PFH) explained as 
average frequency of a dangerous failure of an E/E/ 
PE safety related system to perform the specified 
safety function over a given period of time. 

A similar approach is applied also in EN 50129 
(Railway applications—Communication, _ sig- 
nalling and processing systems—Safety related 
electronic systems for signalling). IEC 61508 and 
CENELEC norms describe both qualitative and 
quantitative requirements to be applied for these 
integrity levels (IL). Architecture is both relevant 
with qualitative and quantitative requirements. 
For instance, hardware fault tolerance (HFT) is 
defined as qualitative requirement and PFH, is 
allocated as quantitative requirement for the per- 
tinent IL. Here, the subscript “G” represents the 
group of voted channels which is also used in the 
remaining part of the paper in this manner. 

Although several formulas are provided in the 
norms or stochastic processes can be utilized for the 
calculation of the PFH,, it is not comprehensible to 
judge the effects of the parameters in these formu- 


Table 1. SIL vs PFD „and PFH. 

SIL PFD, PFH [h] 

1 1E-1 > PFD,,, 2 1E-2 1E-5 > PFH 2 1E-6 
2 1E-2 > PFD,,, 2 1E-3 1E-6 > PFH 2 1E-7 
3 1E-3 > PFD,,, 2 1E-4 1E-7 > PFH 2 1E-8 
4 1E-4 > PFD,,, 2 1E-5 1E-8 > PFH 2 1E-9 
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las. However while developing on-board computer, 
it has been realized that such a study would be very 
useful to decide the design path. Chen et al. (2008) 
simulated RAMS of triple-modular-redundant sys- 
tem and a dual-modular-duplex-redundant system 
and compared using the estimates of actual hard- 
ware failure rates. King (2014) analyzed Hazardous 
Event Frequency per year with respect to demand 
rate. But, neither effect of the parameters in Table 2 
nor the crucial dependent failures were discussed in 
these studies. Smith & Gruhn (1995) showed depend- 
ency of the safety system on some parameters at 
some level, however this study is not very detailed, 
the results were shown in bar charts with two only 
values. Moreover, the architectures 1002D and 1003 
were not covered and high demand/ continuous 
mode was not evaluated, but the behavior of param- 
eters in high demand mode are very different as the 
algebraic equations differ essentially from each other 
in low and high demand mode for the same archi- 
tecture. There are some studies like the one from Liu 
and Rausand (2016) about proof testing effect or the 
one from Ilavsky et al. (2013) about B factor effect, 
but no detailed work covering all parameters and 
all plausible architectures has been met in the litera- 
ture. Therefore, it is believed that a study examining 
different architectures to find out the influences of 
parameters to reach very low PFH, could be very 
beneficial and interesting. These very low challeng- 
ing values are required since the result of a possible 
hazard at some systems like nuclear power plants, 
autonomous vehicles, high speed trains or air planes 
could be catastrophic. For instance, European rail 
traffic management system/European train control 
system (ERTMS/ETCS) defines the core hazard for 
ETCS on-board in Subset — 091 in clause 4.2.1.8 as 
exceedance of the safe speed or distance as advised 
to ETCS with tolerable hazard rate of 2 x 1E-09 
(1/h) for the entire system which results in a PFH, 
requirement of the vital train computer about less 
than 1E10-12 (1/h). A similar or even less quantita- 
tive target for the on-board vital computation unit is 
obtained for the metro lines considering the safety 
requirement in the CBTC standard IEEE 1474.1. 
It sets the requirement that the CBTC wayside and 
train-borne equipment located within any contigu- 
ous portion of a one-way route (including the maxi- 
mum number of other trains that can be located in 
this contiguous portion of a one-way route under 
the specified peak operating headway) that can be 
traversed by a train traveling at the specified maxi- 
mum authorized speed for one hour or less shall 
have a total calculated aggregate MTBHE (total 
of all critical and catastrophic hazards) of at least 
1E-09 operating hours. This information shows that 
quantitative values that are thousand times better 
than SIL 4 are to be obtained for several constituents 
like vital train computer or vital wayside Automatic 
Train Protection computer. Another novelty of the 


paper is a route map providing structural informa- 
tion to develop safety critical constituent to reach 
very low dangerous failure rates. 

This paper is organized as follows. The introduc- 
tory part gives a brief information about the norms, 
functional safety and motivation of the study. Sec- 
tion H distills technical background of the architec- 
tures and the calculation methodologies. In Section 
III, the simulation results of PFH, as a function of 
different parameters are described by highlighting 
the most interesting results in accordance with the 
pertinent architecture in the form of deductions. 
Furthermore, the architectures are compared in 
this part. Section IV provides a route map as a pro- 
posal to reach very low PFH, ranges. Finally, some 
concluding remarks are conveyed. 


2 SAFETY ARCHITECTURES 


Karydasa & Brombacherb (1999) explain that 
architectural modeling deals with the development 
of a detailed block diagram of the programmable 
electronic system identifying each subsystem and 
the interconnections related to the safety func- 
tion under consideration. IEC 61508 defines six 
architectures providing the formulas depending on 
the parameters defined in Table 2 to calculate the 
PFH,. However, according to IEC 61508, part 2, 
Table 3, for SIL 4, the HFT must at least be one, 
if diagnostic coverage (DC) is over 99%, and two, 
in case DC is between 90% and 99%. The architec- 
tures lool and 2002 thereupon are eliminated for 


Table 2. Parameters and their explanations. 


The fraction of 
undetected failures 
that have a 
common cause 


À Total failure rate B 
(1/h) of a channel 
in a subsystem 


a, Safe failure rate By Of those failures 
(1/h) that are detected 
by the diagnostic 
tests, the fraction 
that have a 
common cause 
Aa Detected safe MTTR Mean time to 
failure rate (1/h) restoration (h) 
Na Dangerous failure MRT Mean repair time 
rate (1/h) (h) 
Ag Detected K Fraction of the 
dangerous success of the 
failure rate (1/h) autotest circuit 
in the loo2D 
system 
Ag,  Undetected m Proof test interval 
dangerous (y) 
failure rate (1/h) 
DC Diagnostic 
coverage 
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this analysis. Chen et al. (2008) points out that tri- 
ple modular redundancy and dual-duplex modular 
redundancy are main architectures used for safety- 
critical computer systems. Furthermore, according 
to our investigations in the railway industry, 1002, 
1o02D and 2003 are the main vital computer archi- 
tectures for Communication Based Train Control 
(CBTC) systems used in metro lines and ERTMS 
ETCS systems used in high speed lines. In case 
availability requirements are high, then redundant 
loo2 and 1loo2D architectures are selected. 1003 
architecture is usually used in transmitters and elec- 
tromechanical relays, but not preferred for comput- 
ing units due to relative lower availability since the 
unavailability of the computing unit would result in 
the unavailability of the entire system. 

Reliability block diagram paradigm is used for 
deriving the equations in IEC 61508. Another 
widely accepted and used method for sophisticated 
architectures is Markov Diagrams, a stochastic 
process. In this work, the formulas in IEC 61508 
are utilized. Applying different paradigms will 
surely result in different algebraic outcomes, how- 
ever as Zhang et al. (2003) shows the difference is 
not so crucial when applied correctly. 

The calculation formulas and architecture 
explanations provided in IEC 61508 are distilled in 
following. In these algebraic formulations, B-Fac- 
tor paradigm is selected for modeling the depend- 
ant failures. 

2003 Architecture: This architecture consists of 
three channels connected in parallel with a major- 
ity voting arrangement. 


PFH; = 6((I- 4) App +(1 - A) pu )(1 - B) 
Anuter + Apu (1) 


loo3 Architecture: This architecture is similar 
to 2003. It consists of three channels connected in 
parallel with a voting arrangement such that one 
channel is enough to perform the safety function. 


PFH, = 6((I-4,) App +(1- A) Apu) (1-2) 
Aputcetor + yy (2) 


loo2 Architecture: This architecture consists 
of two channels connected in parallel, such that 
either channel can process the safety function. It 
is assumed that any diagnostic testing would only 


2003 and 


Figure 2. 1oo2, loo2D architectural 


descriptions. 


report the faults found and would not change any 
output states or change the output voting. 


PFH, =2((1- 4) Apo +(1- A) Apu )(1- 4) 
Anutcr + lou (3) 


(Z+ mar) +“20.urre (4) 


D 


- vu 


CE = 
Ap 


loo2D Architecture: This architecture consists 
of two channels connected in parallel. During nor- 
mal operation, both channels need to demand the 
safety function before it can take place. In addi- 
tion, if the diagnostic tests in either channel detect 
a fault then the output voting is adapted so that 
the overall output state then follows that given by 
the other channel. If the diagnostic tests find faults 
in both channels and a discrepancy that cannot be 
allocated to either channel, then the output goes 
to the safe state. In order to detect a discrepancy 
between the channels, either channel can determine 
the state of the other channel via a means inde- 
pendent of the other channel. The channel com- 
parison/switch over mechanism may not be 100% 
efficient therefore K represents the efficiency of 
this inter-channel comparison/switch mechanism, 
i.e. the output may remain on the 2002 voting even 
with one channel detected as faulty. The parameter 
K will need to be determined by an FMEA. 


PFH; =2(1- A)Ayy +((1- A) Apy +(1-2) 
App + Asp )te +21- K) App + Alpu 


(5) 
Apu ( at MRT) +( App + Asp) MTTR 
i= 
i Anu ts App + Asp 
Ao tpc © 


Although this formula does not make a distinc- 
tion regarding the DC between detected and unde- 
tected dependant failures, Hokstad (2005) showed 
how the DC is influenced when comparison of 
channels is applied as part of the diagnostic testing 
for multiple channel systems and suggests an alter- 
native to define two betas, i.e. B for DU failures, and 
Bp for DD failures. Notwithstanding, this paper 
takes the formulation provided in IEC 61508 part 6. 


3 SIMULATION RESULTS AND 
DEDUCTIONS 


In this section, the effects of parameters are 
explained. At each step, one parameter and of 
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Table 3. Parameters used for the analysis of A effect. 


A (Pol) A, Aa Aa Nea Neu DC SFF B Ba MTTR MRT K T1 
1E-07<A<1E-05 v v v v v 0.99 0.995 0.02 0.01 8 8 0.98/0.9999 1 
407 Avs PFH for various Architectures -40° Avs PFH, for various Architectures 
2 25 
1002 1002 
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Figure 3. The influence of à (K = 0.98 for 1002D). 


Table 4. Parameters used for the analysis of B effect. 


Figure 4. The influence of à (K = 0.9999 for 1002D). 
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Figure 5. The influence of B (K = 0.98 for 1002D). 


course the parameters derived from this param- 
eter if there exist are varied while remaining ones 
are fixed. A table is provided to track the values of 
variables in the equation. The parameter of interests 
is shown as “Pol”, variating value as “v”. Having 
provided the simulation results, a deduction part is 
also provided to summarize the impact of the per- 
tinent parameter. It is found that setting plausible 
values for fixed variables is essential. A particular 
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Figure 6. The influence of B (K = 0.9999 for 1002D). 


importance should be given for the value K used 
for the calculation of the architecture 1002D. In 
IEC 61508, part 6, an example value for K is given 
0.98. However, according to our field experience, it 
is possible to reach 0.9999 for K, and with this valu- 
ation, the results change for 1002D substantially. In 
the simulations, attention is drawn in case a change 
from 0.98 to 0.9999 for K has a great impact on the 
results. 
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Table 5. Parameters used for the analysis of 8, effect (for B = 0.02). 
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Figure 7. The influence of B, while B is 2%. 


3.1. The influence of À (Table 3 and Figures 3 and 4) 


To compare loo2D with other architectures in a 
correct way, K is chosen firstly as 0.98 and then 
as 0.9999, 

According to above parameters, the simulation 
re-sult is shown below. 

Deduction: There is a positive correlation 
between à and PFH, for all architectures. 1002D 
is the most affected architecture. K should be 
selected correctly to judge the results and compare 
the architectures. 


3.2 The influence of f (Table 4 and Figures 5 and 6) 


Deduction: To make an exhaustive effort for 
decreasing the B ten times affects 1002, 2003 
and 1oo3 almost linearly such that ten times bet- 
ter PFH, can be reached. Similar is also valid for 
loo2D in case K is relative high (0.9999). If it is 
not the case, the effect of B on 1002D is negligible. 


3.3 The influence of f, 


The influence of 8, is examined for two different B 
factors to see the difference between the cases with 
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Figure 8. The influence of Bd while B is 20%. 


relative high and low 8 values. First, a B value of 
0.02, afterwards a 8 value of 0.2 are handled. 
3.3.1 The influence of f, for f= 0.02 (=2%) 
(Table 5 and Figure 7) 
3.3.2 The influence of p for B= 0.2 (=20%) 
(Table 6 and Figure 8) 
Deduction: It is found that any change in the B4 
fraction of 8 does almost not affect the PFH, 
which is very strange, because lots of effort is to be 
paid for increasing the diagnostics and the return 
of this endeavor is almost zero. Besides, when 
compared with DC, a measure for detecting inde- 
pendent failures A, this behavior is found again 
too strange, since in this case, the failures are also 
detected which would cause a hazard if they would 
not be revealed. We believe this equation should be 
reconsidered from this perspective. According to 
this simulation result, we decided to scrutinize this 
issue in our next study in more detail. 


3.4 The influence of DC (Table 7 and Figures 9 
and 10) 


Deduction: An increase of DC causes a decrease in 
the PFH, for 1002, 2003 and 1003. In case the DC 
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Table 7. Parameters used for the analysis of DC. 
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Figure 9. The influence of DC (K = 0.98 for 1002D). 


Figure 10. The influence of DC (K = 0.9999 for 
loo2D). 


Table 8. Parameters used for the analysis of MTTR. 
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Figure 11. The influence of MTTR. Figure 12. The influence of T,. 
Table 9. Parameters used for the analysis of T,, 
A Ns Aii Aa Naa Ke. DC SFF B Bı MTTR MRT K T1 (Pol) 


1E-07 5E-08 4.95E-08 5E-08 4.95E-08 5.00E-10 0.99 0.995 0.02 0.01 8 


8 0.9999 1/(365.24) < T1 < 50 
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Table 10. Parameters used for the analysis of K. 
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Figure 13. The influence of K. 


is doubled from 0.5 to 0.99, then the PFH, falls for 
these by half. For 1002D, the situation is observed 
again as strange. It is found that when K is 0.98, 
then there is a positive correlation between PFH, 
and DC such that if DC goes up from 0.5 to 0.99, 
the PFH, increases about 33%. However, the ques- 
tion arises here why DC would affect negatively the 
PFH,. On the other hand, in case K is 0.9999, then 
the behavior is similar to the other architectures. It 
is unveiled that when K is about 0.99, then PFH, 
is not influenced by DC. According to this simula- 
tion result, we decided again to scrutinize this issue 
in our next study in more detail. 


3.5 The influence of MTTR (Table 8 and Figure 11) 


Deduction: MTTR is usually selected not more than 
8 hours. In this simulation, as the affect during the 
8 hours was found as none, the time range is changed 
up tol000 hours. Even in this case, for all architec- 
ture, the MTTR has almost no impact on the PFH.. 


3.6 The influence of T, (Table 9 and Figure 12) 


Deduction: T, can be selected from one day to many 
years according to the type of the unit. For vital 
computers it can be the life time such as 20 years. 
However, for both very short and very long time, 
the T, does almost not affect PFH,. Besides, when 
doing proof tests, the effectiveness of the tests shall 
also be considered. Therefore, taken into account 
the effort needed for proof tests and also the effec- 
tiveness of such tests, the life time can be selected for 


0.98 < K < 0.9999 1 


proof test interval such that no proof test is required 
during the normal life time of the constituent. 


3.7 The influence of K (Table 10 and Figure 13) 


Deduction: The parameter K has an important 
impact on the PFH, for 1002D. If it is increased 
from 0.95 to 0.99, five times better PFH, can be 
obtained. And if it changes from 0.98 to 0.99, then 
two times, if from 0.98 to 0.999, then twenty times 
and if from 0.98 to 0.9999, then hundred times 
lower PFH, could be obtained. 

Beside the deductions mentioned above, the 
simulation results show surprisingly that the archi- 
tectures 1002, loo3 and 2003 have very similar 
PFH,, values and similar characteristics although 
there are major differences for the HW/SW design 
and implementation of these architectures. At first 
glance, a hypothetical judgement would claim that 
1o03 should have much more better safety perfor- 
mance than 1002 as there exist one additional com- 
putation unit, however the results show this is not 
the case. 

Taking into account the highlights for the archi- 
tectural perspectives, we have eliminated 1003 
architecture and decided to select 2003 architec- 
ture for ERTMS/ETCS on-board vital computer 
and 2 x loo2 (cold stand-by redundancy) value 
for CBTC on-board vital computer, since for the 
CBTC systems full redundancy for sensors and 
actors are given as requirement due to very high 
density traffic. We have favored 1002 over 1002D, 
because with this selection, the time and cost 
efforts needed to spend to reach for very high K 
values could be excluded. On the other hand, 
instead of developing two different systems and 
going through the exhaustive certification proc- 
esses for two different systems, we have unified the 
requirements and decided to develop 2 x loo2D 
(cold stand-by redundancy) platform with relative 
high K value, since this effort has been evaluated 
as less than going through the second independent 
assessment process. We have chosen 1002D also for 
utilizing cross channel comparision. We have also 
come to conclusion for our projects that it is plau- 
sible to utilize diverse HW to reduce dependent 
failures and get better hazard rates while it helps to 
reduce systematic failures at the same time. Moreo- 
ver, according to the analyses, hot stand-by design 
would cause additional failures in comparison 
to cold stand-by, its design would be much more 
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complex, and as cold stand-by can cover the avail- 
ability requirements, too, the cold stand-by system 
is selected. 


4 ROUTE MAP TO GET BETTER PFH, 
WITH REGARDS TO THE RELIABILITY 
AND SAFETY PARAMETERS 


As mentioned in the introduction, todays challeng- 
ing technologies require a PFH, of the computation 
unit about less than 1E10-12 (1/h). Regarding these 
though requirements, an ordinary development 
procedure would not be sufficient for reaching the 
quantitative requirements. According to the results 
and deductions provided in the previous section, 
a route map is developed for this purpose in this 
section. Note that B factor is utilized for modeling 
dependant failures in this study, so these clauses are 
valid in case 8 factor is used. If this is not the case, 
simulations are to be repeated to ensure the correct- 
ness and consistency of the results. 


i. Decrease à as much as possible; for each poten- 
tial architecture. A linear relationship with a 
slope one is valid between à and hazard rate. 
To decrease this reliability parameter, select- 
ing higher quality parts, setting correct duty 
cycle, applying derating, consulting engineer- 
ing experience to adjust data and using field 
data can be utilized. For instance, if the duty 
cycle is reduced from 100% to 75%, we expe- 
rienced 23.47% better failure rates. Moreover, 
the reliability calculations can be performed 
using different methods such as MIL 217F, 
Bellcore, 217 Plus etc. Sometimes, ten times 
better results can be obtained as provided in 
the study of the RIAC (2010). At our experi- 
ence, 217Plus developed by RIAC (2010) gave 
5.29 times better results in comparison to MIL 
HDBK 217F N2 developed by USA Depart- 
ment of Defence (1995). If there is no special 
requirements, then the most effective handbook 
can be utilized. Besides, it has been observed 
for some COTS products in the industry that 
they are claimed to work at mobile and ground 
environments, however there is only one fail- 
ure rate and one MTBF (Mean Time Between 
Failures) value provided. But, according to our 
calculations, if the environment changes from 
ground fixed to ground mobile, the failure rate 
increases 83.71%. As a result, if the constituent 
is designed to word at different environments, 
with different duty cycles etc., the quantitative 
values should be provided for these different 
operating conditions. 

ii. Decrease common cause factor, namely 8 
as much as possible; for loo2, 1003, 2003 


architectures, approximately ten times better B 
results in ten times better PFH,; however the 
effect for 1002D is very limited if K is relative 
low like 0.98 such that ten times better B results 
in two times better PFH,. Diversity is a very 
decisive character to get low dependant failures. 
Another approach would be trial modeling 
dependant failures with another method than 
B factor. Fleming (1987) explains details of the 
dependant failure models such as binomial fail- 
ure rate, multiple Greek letters, alpha factor etc. 
which can be alternatives to B factor modeling. 

iii. The effect of detecting the common cause fail- 
ures, namely 8, has almost no effect on PFH, 
for any architecture, hence do not spend much 
effort to detect common cause failures. On the 
other hand, as explained previously, we believe 
that some issues are overlooked for this param- 
eter in the equations of IEC 61508 and we will 
investigate this issue in more detail at the next 
studies. 

iv. Increase DC for all architectures except 1002D 
if K is less than 0.99. Similar to the previous 
item, the behavior of DC for loo2D with a K 
of less than 0.99 is also strange. Therefore, we 
recommend to bring up K at least to 0.99 to 
get plausible results. 

v. Do not perform much effort to decrease 
MTTR for the safety performance as it has 
almost no effect on PFH, for any architecture. 

vi. Similar to the case with MTTR, do not per- 

form much effort to decrease T, for the safety 

performance as it has almost no effect on 

PFH, for any architecture. 

If loo2D is chosen, then increase K as much 

as possible. If K changes from 0.98 to 0.9999, 

then hundred times better PFH, could be 

resulted. 


E: 


vii. 


5 CONCLUSION 


In this study, the effects of calculation parameters 
for several architectures are simulated. Attention is 
drawn for surprising, interesting sides of the results 
while providing detailed deductions regarding the 
architectures for each parameter in the calculations. 
It has been observed that some parameters like K 
or DC for the architecture 1002D are crucial while 
some other parameters like MTTR or T, has almost 
no effect on the results of any architecture. The sim- 
ulations can be utilized when designing safety criti- 
cal systems, subsystems or equipment. The selected 
architectures for on-board computer to be used in 
ERTMS ETCS and CBTC domains are shared pro- 
viding the most important judgements. An advis- 
ing route map is newly created, including project 
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examples, to distill what kind of measures can be 
taken to decrease the dangerous failure rate which 
is at challenging low levels in today’s high-tech mis- 
sion critical systems like high speed trains or driver- 
less vehicles performing vital safety functions. 
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ABSTRACT: 


We extend the equal load-sharing model of cascades to investigate the abrupt breakdown 


behavior of coupled distribution grids. In particular, we mimic the effects of the ever-increasing customer 
demand in a foreseen scenario where energy hubs interconnect the different energy vectors. In the load 
growth scenario we find evidence of first order transitions (i.e. abrupt breakdowns of the system) due to 
the long-range nature of the flows. Our results indicate that the foreseen increase in the couplings between 
the grids has two competing effects: on the one hand, it increases the safety region where grids can operate 
without withstanding systemic failures; on the other hand, it increases the possibility of total failure of all 


the interconnected systems at once. 


1 INTRODUCTION 


In this paper we will consider a simple mean-field 
model of cascading that applies both to single and 
to interdependent networks. To highlight the pos- 
sibility of emergent behavior, we will first abstract 
PNIS; in order to understand the basic mechanisms 
that could drive systemic failures; in particular, we 
will consider finite capacity networks where a com- 
modity (a scalar quantity) is produced at source 
nodes, consumed at load nodes and distributed 
as a Kirchoff flow (e.g. fluxes are conserved) and 
introduce a simplified model that is amenable of 
a self-consistent analytic solution. Subsequently, 
we will extended such model to the case of several 
coupled networks and study the cascading behav- 
ior of such a model under increasing stress (i.e. 
flow magnitudes). 


2 MODEL 


Let’s consider a weighted network G=(V,E,c) 
where V ={l<i<|V |} is the node set, ECV xV 
is the set of edges and e= {c ,} is the vector char- 
ae ae ij) a 
acterizing the capacities of thé edges (i, j). We asso- 
ciate the nodes a vector s={s,} that characterize 
the production (s, > 0) or the consumption (s, < 0) 
of a commodity. We further assume that there are 


no losses in the network (i.e. }, s; = 0); hence, the 
total load on the network is 


1 yas 


i:s;>0 


The distribution of the commodity is described 
by the fluxes f = {fu y on the edges (i, j)e £ and 
is supposed to respect Kirchoff equations, i.e. 


Efi = 0) 


The relation among fluxes and demand/load is 
described by constitutive equations 


f =F(s,G) 2) 


where in general Eq. (2) is non-linear but satisfies 
Eq. (1). 

More often, constitutive equation rely on a rela- 
tion among fluxes on the line and the values of a 
physical field @ defined on the nodes. As an exam- 
ple, in networks transporting fluid commodities 
(e.g. gas, water) the constitutive equations take the 
form 


Sin = 4-4 (3) 
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where y~ 2 and @, is the pressure at the ith node. In 
the case of DC currents, the relation is linear (y= 1), 
Japis the electric current and @, the voltage; the same 
linear equations hold in the case of steady-state DC 
power flow in electric networks, where now f; „is 
the power flow and @, the phase angle. For further 
details, see (Scala 2017) and references therein. 


3 METHODS 


Cascading failures represent a critical vulner- 
ability in our world where network infrastructures 
grow both in complexity and interdependencies 
(D’ Agostino and Scala 2014, DAgostino and Scala 
2015). When detailed information about the topol- 
ogy are at hand, centrality measures can be used 
to identify vulnerable subsets of the infrastructural 
system (Scala et al. 2016). However, in this section 
we will concentrate on cascading due to the viola- 
tion of the link capacities while disregarding the 
effects of shocks due to strong transients. 

For the flow networks described in the previous 
section, the finite capacity c, of a link (i, j) con- 
strains the maximum flux on such a link 


Miu < Caj) 
while above such flux, the link will cease function- 
ing. As an example, power lines are tripped (dis- 
connected) when power flows go beyond a certain 
threshold. Since flows will redistribute after a link 
failure, it could happen that other lines get above 
their flow threshold and hence consequently fail, 
eventually leading to a cascade of failures. A typi- 
cal algorithm to calculate the consequences of an 
initial set of line failures F° ={(ij) failed} is the 
alg.(1). Here F(p,G|F) calculates the flows sub- 
ject to the constrains that flows are zero in the fail- 
ure set of edges (i, j) e F. 


Algorithm 1 Network cascading 
Set initial failures F° 
t0 


repeat 
t+t+1 
Calculate flows t +— F(s. GIF!) 
Calculate new failures AF’ + { (1j): | JE] > cu} 
FFA UAF 
until AF = Ọ 


To develop a general model that helps us under- 
standing the class of failures that can affect Kir- 
choff-like flow networks, let’s start from rewriting 
Eq. (1) in matrix form 


B'f =s (4) 


using the incidence matrix B that associates to each 
link (i.j) its nodes į and j and vice-versa. B is an 
|v|x|eé| matrix where each column corresponds 
to an edge (i, /); itscolumns are zero-sum and the 
only two non-zero elements have modulus 1 and 
are on the ith and on the jth row. 

The matrix B is related to the Laplacian B’B of 
the system; in particular, it shares the same right 
eigenvalues and the same spectrum (up to a squar- 
ing operations); hence, it is a long-range operator 
since perturbation on a node of the system can be 
reflected on nodes far away on the network (Pahwa 
et al. 2014). 

Due to the long range nature of Kirchoff’s 
equations, to understand the qualitative behavior 
of such networks we can resort to a mean field 
model of flow networks where one assumes that 
when a link fails, its flow is re-distributed equally 
among all other links. Such model, introduced 
in (Pahwa et al. 2014), is akin to the fiber-bundle 
model (Peirce 1926, Daniels 1945) and has been 
considered in more details in (Scala and Lucentini 
2016, Yagan 2015) for the case of a single system. 
In this model, each time lines trip, flows are recal- 
culated, the lines above their threshold trip again, 
their flows would be re-distributed and so on, up 
to convergence; recalling that L is the total load 
of the system and assuming the each link (i, j) has 
an initial flux f= L/| E], we can describe such a 
model by alg.(2). 


Algorithm 2 Mean Field cascading 

t0 

F' 4+ 0 initial number of failed links 

repeat 
tht 
M + |E| — F"! number of working links 
l +— L/M average flux on the working links 
Fre |d (a7) :l>cajy}| 

until F" = a 


Such algorithm can be cast in the form of a sin- 
gle equation in the case where the system is com- 
posed by a large number of elements with capacity 
c. In fact, in such limit we can describe the links’ 
population by the probability function p(c) of 
their capacities. Indicating with M =| £| the ini- 
tial number of links, we see that if we apply an 
overall load L to the system, all the links will be 
initially subject to a flow /°= L/M. Thus, a frac- 
tion of links f'=J4 p(c)de would immediately 
fail, since their thresholds are lower than the flux / 
they should sustain. After the first stage of a cas- 
cade, there will be M'=(1—f')M surviving links 
and the new load per link is /'= L/M'. The fol- 
lowing cascade’s stages follow analogously; we can 
thus write the mean field equations for the (¢ + 1)” 
stage of the cascade: 
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J” sites (5) 


where /=L/M_ is the initial load per link and 
P(x)=I% p(c)de is the cumulative distribution 
function of link capacities; the initial conditions 
are f'-°=0. The fix-point f of Eq. (5) satisfies 
the equation 


ros . 


and represents the total fraction of links broken 
at the end of the cascading stages (Pahwa et al. 
2014). 

The behavior of f depends on the func- 
tional form of p(c). In particular, by defining 
mc)=1-P(c) and x=/"'(1—f), we have that 


f =f ple)de= t-(+) 


x 


and can rewrite Eq. (5) as 


ints (=| 
x? 


(see Fig. (1)). This equation has a trivial fix-point 
x* =0 (representing a total breakdown of the sys- 
tem) since (co) =(0, Such fix-point is unstable for 
1-0 and becomes stable for />0,2(X)|..0- 
We notice that if P(c) does not change convexity 


L>Le / 


Figure 1. Graphical solution of the fixpoint equation. 
For L = L, there is only one solution corresponding to 
a critical point. For L > L, the stable fixpoint x* corre- 
sponds to a small fraction of broken links f‘; in the limit 
of / > 0, f =0. On the other hand, for L < L, the only 
stable solution is x* = 0, corresponding to the situation 
f* =1 where all the links are broken and hence the whole 
system has failed. 


Figure 2. The behavior of the fix-point x* depends on 
the tail of the distribution of link capacities p(C) and is 
known to present a first order transition for a wide fam- 
ily of curves; in particular, when the cumulative distribu- 
tion function P(C) goes to zero faster than C?, we are 
in case (iii) where the transition is discontinuous, with 
a jump at a critical value L, and a divergence of dM/dL 
a (L p LJ” 


(i.e. has no bumps) and the transition is first order, 
the system will breakdown directly to the total col- 
lapsed state f= 1. 

In general, the behavior of the fix-point x* 
depends on the tail of the distribution p(c) and is 
known to present a first order transition for a wide 
family of curves (da Silveira 1998) (see Fig. (2)); in 
particular, the fixpoint behavior is dictated by the 
behavior of = a for C, i.e. 


e for p(C)~C-7 with 1 < y< 2, no transition 
occurs 

e for p(C)~C~, the transition is continuous but 
there is a critical point L, with a divergence of 
dM /dL~(L-L,)? 

e for C*p(C)—>0 when C > œ, the transition is 
discontinuous, with a jump at a critical value L, 
and a divergence of dM/dL ~ (L- L)’. 


Depending on the functional form of p(c), Eq. (6) 
could sometimes be solved analytically. Otherwise, 
the fix-point of Eq. (6) can be solved numerically 
either by iterating the Eq. (5) or by finding the 
zeros of Eq. (6) by Newton-Raphson iterations. 


4 RESULTS 


Commodities are defined substitutable when they 
can be used for the same aim; when commodities 
are substitutable, they can expressed in the same 
units. An example of such commodities are elec- 
tricity and gas can, since be both used for domestic 
heating. Hence, an increase on the cost of the gas 
(as the one that has been recently experienced by 
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Ukraine) could provoke stress on the electric net- 
work of the country since most customer will pos- 
sibly switch to the cheaper energy vector. To take 
account for such effects, we will extend the model 
described by Eq. (5) to the case of several coupled 
systems that transport substitutable commodities. 

We will consider n coupled systems assuming that 
when a system a is subject to some failures, it sheds a 
fraction T, of such the flow increase due to such 
failures on system b. In other words, upon failure 
system a decreases its load by a quantity l, f, 2 To 
and increases the load of all systems b ža by 
LJ,T,s,- Thus, the n coupled systems are described 
by a set of n equations of the form of Eq. (5) 


it 
Ti D 


I’ is the load per link experimented by system a at 
the ¢” stage of the cascade and P(x) = f Dp, (x)dx 
is the cumulative of the probability distribution 
function p,(x) for the capacities of the a” system. 
Equations (7) are not independent, since the sys- 
tems’ coupling is reflected by the dependence of /' 

on the fractions ff of failed links in all the other 
systems, i.e. 


it = L í a ST a ) + Sia 
b b 
=l, + ose 
b 


(8) 


where Ly =(1-8p)T a tSo 2ap Tep has again 


the form of a Laplacian operator. Thus, the full 
equations for n coupled systems are 


L, + ` Luly Si 
1- fi 


fe — e (9) 


For simplicity, we will consider the case of 
two identical system with a uniform distribution 
of link capacities and solve the fix-point of Eq. 
(7) numerically. We show in Fig. (3) the cascad- 
ing behavior of two coupled systems; we observe 
that—as in the single system case—transitions are 
in the form of abrupt jumps, i.e. are first order. 
Let’s rewrite Eq. (9) in the case of symmetric cou- 
plings T, =7;_,,=1 and same probability distri- 


bution for the capacities 
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H L 

í rea (s +) 
oe l ling 

a) 


(10) 


If the two systems described by Eq. (10) are 
stressed at the same pace (i.e. /, = /, = 1/2), we get 
the case 


l 
l= fi 


l 
#l— P 
h | 


sim [1- raf.) 


[1+ Tafa] | 


f 
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from the symmetric solution Af, = 0 we see that 
the breakdown of both systems happen at the same 
critical load as the uncoupled systems. Such situa- 
tion is shown in the left panel of Fig. (3). 

In the general, only one of the systems will be 
the first one to break down (i.e. the fraction of bro- 
ken links jumps to f= 1): correspondingly, also the 
other systems will experience a jump in the number 
of broken links. Let’s consider the symmetric case 
described by equations (10) and suppose that /, > 
L, so that system 1 is the first to breakdown (i.e. fi 
= 1); hence, the equation for the fix-point of the 
second system becomes 


railcar slp 


i.e. the system behaves like a single system starting 
with a renormalized load ř >1. Thus, if /° < L, the 


System | 


——- System2 


j 
n7 


Figure 3. Behavior of the number of failed nodes respect 
to the total stress / = /, + /, of the systems. For simplic- 
ity, we present the case of two identical systems with a flat 
distribution of link capacities and symmetric couplings 
Ti, =T, = 9.5. We show the result of increasing the 
total stress / in the two systems along the lines /,//, = const. 
Left panel: we show the case /,//, = 1.1 where both systems 
are subject to a similar stress while increasing /. In such 
case both system break down together at the same critical 
load lf = l$; in the region />/¢ = ik both systems are 
failed. Right panel: we show the case /,//, = 4 where when 
increasing / systems | is more stressed than system 2. In 
this case, the break down of system | at the critical load 
If induces a jump in the number of failures system 2, but 
system 2 is still able to sustain stress and will break down 
only at higher values of /. Respect to the /, ~ l, case, there 
is now a region /f </</s where only system 1 is failed. 
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| | | 

Figure 4. Phase diagrams of two identical coupled sys- 
tems with symmetric interactions (T> =I, a= T). 
The plane of initial loads /, and /, is separated in four 
different regions by critical transition lines. The labels B, 
(i = 1,2) mark the areas where only system i suffers sys- 
temic cascades (/;= 1, Sei < 1), while the label B,, marks 
the area where both system suffer system wide cascades 
(f. = fa =1). The label S marks the (safe) area near the 
origin where no systemic cascades occur. Left panel: the 
case T = 0 corresponds to two uncoupled systems: thus, 
each system suffers systemic failure at /, >/° (where I 
is the critical load for an isolated system); both systems 
are failed in the B,, area corresponding to the quadrant 
(1, >/°,1, >/*). Central panel, right panel: when cou- 
plings are introduced, each system is able to discharge 
stress on the other one and the area S where both systems 
are safe increases. On the other hand, the area B,, where 
both systems are failed increases. 


critical value of Eq. (5), system 2 will break down 
at higher values of the stress. Such situation is 
shown in the right panel of Fig. (3). 

In Fig. (4) we show the full phase diagrams of 
two coupled systems while varying the coupling 
among them. As discussed before, due to the 
symmetry all the systems have a transition point 
(/:,/*) along the /, = /, line, where F is the critical 
load of the single system. According to the initial 
loads, we can distinguish an area S near the origin 
where the system is safe and three separate cascade 
regimes: B,, B, where either system 1 or 2 fails, 
and B,, where both systems fail. We notice that, by 
increasing the coupling among the systems, both 
the area S where the two systems are safe and the 
area B,, where they fail together grow; accordingly, 
the areas B, where only one system fails shrink. 


5 DISCUSSION 


In this paper we have introduced a model for cas- 
cade failures due to the redistribution of flows upon 
overload of link capacities. For such a model, we 
have developed a mean field approximation both 
for the case of a single network and for the case of 
coupled networks. Our model is inspired to a pos- 
sible configuration for future power systems where 
network nodes the so-called energy hubs (Geidl 
et al. 2007), 1.e. points where several energy vectors 
converge and where energy demand/supply can be 


satisfied converting one kind of in another. Hubs 
condition, transform and deliver energy in order 
to cover consumer needs (Favre-Perrod 2005). In 
such configurations, one can alleviate the stress on 
a network by using the flows of the other energy 
vectors; on the other hand, transferring loads from 
a network to the other can trigger cascades that 
can eventually backfire. 

By analyzing the case of two coupled systems 
and by varying the strength of the interactions 
among them, we have shown that at low stresses 
coupling has a beneficial effect since some of the 
loads are shed to the other systems, thus postpon- 
ing the occurrence of cascading failures. On the 
other hand, with the introduction of couplings the 
region where not only one system fails but both 
systems fail together also increases. The higher the 
couplings, the more the two systems behave like 
a single one and the area where only a system is 
failed shrinks. Notice that our model in the present 
form does not apply to islanding strategies in 
power systems, where some sub-networks can even 
enhance their reliability upon failure of part of the 
remaining system (Mureddu et al. 2016); such sub- 
ject will deserve further investigations. 


6 CONCLUSIONS 


It is worth noting that while fault propagation mod- 
els do predict a general lowering of the threshold for 
coupled systems (Wang et al. 2013), in the present 
model a beneficial effect due to the existence of 
the interdependent networks is observed for small 
enough overloads, while the expected cascading 
effects take place only for large initial disturbances. 
This picture is consistent with the observed phe- 
nomena for interdependent Electric Systems. More- 
over the existence of interlinks among different 
networks may increase their synchronization capa- 
bilities capabilities (Martin-Hernandez et al. 2014). 
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ABSTRACT: Design and correct dimensioning of oil and gas terminal facilities is often a challenge 
due to a variety of operational uncertainties as well as the volatility in supply and demand. Various fac- 
tors come into play to ensure smooth uninterrupted operation, as well as optimal and effective resource 
utilization. This paper discusses the challenges of optimization and presents an integrated approach to 
optimize terminal logistics and dimensioning, where parts of the objective function is solved by discrete 
event simulation. A case study of the new Veidnes terminal to be commissioned in the north of Norway, 
is also presented. This terminal shall serve as a central hub for export of oil from the already producing 
Goliat field and the upcoming offshore facilities in the Barents Sea, including Johan Castberg as well as 
the later producers Alta Gotha and Wisting. A variety of factors including weather disturbances, produc- 
tion profiles, shuttle tankers, jetties, etc. are included in the analysis to help provide a decision-basis on the 


number and size of tanks required at the terminal. 


1 INTRODUCTION 


Oil produced offshore is usually transported to 
terminals onshore either via pipelines or by oil 
tankers. These terminals typically have facilities 
for storage and export of oil, and some also have 
facilities for further treatment of oil, such as frac- 
tionation. Export of oil from the terminal can be 
done via pipeline, ship, train or trucks. In practice, 
an oil terminal represents a buffer in the middle 
of the supply chain, reducing the consequences of 
disturbances in the transportation. 

During the design process of such terminals, 
important decisions are made. One such decision 
is related to the dimensioning, where the difference 
between good and bad decisions can be worth tens 
or even hundreds of millions of dollars. Dimen- 
sioning of these terminals is difficult as the required 
capacity may vary due to high volatility in supply 
and demand. Tank storage space is expensive in 
terms of CAPEX (Capital Expenditures) but too 
little storage can turn out to be much more expen- 
sive in terms of deferred production and delays in 
the supply chain. This can lead to the design of an 
extra buffer, which in practice is expensive overca- 
pacity. Also, tank suppliers may have a tendency to 
recommend more volume than necessary. 

Despite the expense of bad decisions in design, 
it is not uncommon to base such decisions on 
experience, general knowledge, and gut feeling. 
Experience is certainly always valuable, but is not 


sufficient since each terminal is unique. Some sort 
of simplified calculation is usually involved as 
well in the design or planning phase. However, a 
common misconception in the industry is that 
complexities and uncertainties make the problem 
impossible to fully approach mathematically. 

This paper discusses how a mathematical model 
may be used to approach the optimization prob- 
lem. It discusses an integrated approach involving 
an objective function and discrete event simula- 
tion. The paper also describes a specific case to 
demonstrate the application of the mathematical 
approach for a Norwegian oil terminal. 


2 PROBLEM AND CASE DESCRIPTION 


2.1 The optimization problem 


Mathematical optimization constitutes a large 
area of applied mathematics. It includes finding 
the ‘best’ value of some objective function within 
a given domain. The ‘best’ value is depending on 
the context but will typically be the one which 
maximizes or minimizes the value of the objective 
function. 

Terminal dimensioning can be viewed as such an 
optimization problem, where the aim is to minimize 
overall costs over a defined period. Both CAPEX 
(Capital Expenditures) and OPEX (Operational 
Expenditures) are important to include. Another 
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important aspect is the revenue from exported oil. 
Lost revenue due to deferred production as a con- 
sequence of shortage in storage capacity, can be 
regarded as an expense. Alternatively, the objective 
function can contain all revenue from exported oil, 
subtracting the CAPEX and OPEX, in which case 
the function should be maximized. In that case, the 
objective function would in a simplified form look 
something like this. 


MAX seo.) (X) — g(x) — A(X) (1) 


where f= revenue from exported oil; g = CAPEX; 
and h = OPEX. xX isa vector of elements such as 
number and size of offshore tanks, number and 
size of shuttle tankers, pump capacities, etc. 

CAPEX is usually the easiest function of the 
three, consisting of sums of unit costs multiplied 
by number of units. But it may be hard to express 
the cost as a continuous function, since e.g. tank 
producers offer a fixed selection of designs with a 
certain size. Uncertainties may also be present at 
an early stage, but these are quite small compared 
to the other parts of the objective function. 

OPEX is more complex as these are running 
costs over a long period of time, i.e. the expected 
lifetime of the terminal, which in itself is a large 
uncertainty factor. On the other hand, OPEX may 
not be very largely influenced by the size of many 
of the elements in x. I.e. maintenance of a 500,000 
barrel tank is not necessarily much more expensive 
than maintenance of a 400,000 barrel tank. But for 
both, the Life Cycle Cost (LCC) is associated with 
high uncertainty. 

The revenue from exported oil or the revenue 
loss from deferred export due to shortage in stor- 
age capacity, is the most difficult parameter to 
assess. There are many factors associated with 
large uncertainties. In order to understand what 
could cause deferral in oil production or export, 
we must understand the dynamics of the whole 
supply chain. These dynamics contain a lot of 
time-dependent and stochastic elements and are 
difficult to represent with equations. 

Consider the following example: An oil tanker 
is on its way to an offshore installation to load oil. 
The approximate loading operation and sailing 
times are known. However, it turns out that the 
waves are too high for a safe loading operation and 
the tanker must wait. If we are unlucky, the bad 
weather period lasts for so long that the oil storage 
tank on the platform gets full and there is a forced 
production shutdown. By the time the weather 
improves, the tanker loads and heads to the ter- 
minal. This delay may cause a delay in the export, 
or export of off-spec oil quality because the export 
tanker was depending on the load from the shuttle 
tanker getting there in time. 


One could say that the weather was the cause of 
two financial loss factors; the offshore disruption 
and the delayed or off-spec export load. However, 
this could have been avoided with e.g. larger tanks 
offshore and onshore. Weather is just one of sev- 
eral factors influencing the oil export, and as the 
example illustrates, the dynamics or timing is an 
important element. To express revenue as part of 
the objective function is thus a very challenging 
task. As will be discussed in the next chapter, dis- 
crete event simulation is a suitable way to solve the 
problem, since the dynamics are well captured and 
the objective function to a large part is reduced to 
drawing random numbers from given distributions. 


2.2 Case description 


2.2.1 Veidnes terminal development 

Statoil are building a new terminal in Veidnes in 
the northern parts of Norway, which will receive 
oil from various offshore installations in the Bar- 
ents Sea by a fleet of winterized shuttle tankers. Oil 
stored at Veidnes will be shipped by export tank- 
ers to various destinations in continental Europe. 
The sizing and number of tanks located at Veidnes 
is important, since overcapacity is expensive and 
insufficient storage may lead to expensive losses. 
Figure | contains an overview of the Veidnes ter- 
minal with associated offshore fields. 

Optimal sizing and number of tanks is influ- 
enced by the sizing and number of shuttle tank- 
ers and export tankers, as well as several other 
parameters. In addition, exported oil is required to 
have a certain quality, measured in API (American 
Petroleum Institute) gravity value. The storage and 
mixing of different oil qualities, as well as require- 
ments for a certain API value of exported oil, also 
influence the optimal size and number of tanks. 

One of the biggest challenges is the volatile 
nature of supply. In the case of Veidnes, one of the 
offshore fields, Goliat, has just started producing. 
The plan for Veidnes is to start operation when 
also Johan Castberg, with larger oil reserves than 
Goliat, starts producing. Later, additional fields, 


Location of the Veidnes terminal and associ- 


Figure 1. 
ated oil fields. 
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Alta Gotha and Wisting, will start production and 
the winterized shuttle tankers will transport oil 
from four different fields to Veidnes. 

In addition to the four fields starting produc- 
tion at different times, their production profiles are 
dynamic, and their forecasts are associated with a 
high degree of uncertainty. There are also other 
reserves in the area, which could be relevant for oil 
production and shipping to Veidnes at some point 
in the future. However, these factors are not con- 
sidered in the first stage of designing the terminal. 


2.2.2 Phased development 

When the Wisting and Alta/Gotha fields begin 
production, the terminal needs to be expanded. 
For now, we consider the three phases listed in 
Table 1. 

In a scenario like this, there are two main 
options. The first is to account for the highest sup- 
ply somewhere in phase 3 (when the sum of all four 
fields reaches peak production) and dimension the 
terminal capacity accordingly from the start. In 
this case, there is excess terminal capacity in the 
first two phases. The second option is to build a 
smaller terminal at first and then expand capacity 
in the later phases. Which of the options is best 
also requires analysis, since it can be argued that 
building a large terminal from the start is cheaper 
than doing it in steps. But there is also a potential 
NPV (Net Present Value) gain in delaying some 
of the construction for later. This evaluation was 
already made on a high level, with the decision 
to construct in two phases, one for the first two 
phases and then add additional tanks and another 
jetty for the third phase. Since some decisions had 
been made already, the case study was both a veri- 
fication, as well as an optimization study. 

In other words, for larger planned increases in 
supply, such as the start-up of a new producing 
field or plateau production of fields, the existing 
structure can be increased. For smaller and highly 
uncertain increases in production profile, it is not 
feasible to add to existing capacity. 

For predicted reductions in supply such as end 
of plateau production, reductions in capacity are 
of course not feasible either, as it hardly affects the 
costs, since the investment has already been made. 


Table 1. Phases considered for the Veidnes terminal 

project. 

Phase Period Involved fields 

1 2022 Q4-2023 Q4 Goliat, Johan Castberg 

2 2023 Q4-2024 Q3 Goliat, Johan Castberg, 
Alta/Gotha 

3 2024 Q3-2030 Q4 Goliat, Johan Castberg, 


Alta/Gotha, Wisting 


— Planned Veidnescapacity 
Predicted supply profile, mean value 
Predicted supply profile, P10 and P90 values 


Figure 2. Illustration of supply, uncertainties and 
phased development. 


Figure 2 illustrates the main elements as dis- 
cussed. The planned capacity should account for 
the peak supply in all phases. But planned capac- 
ity must take uncertainties in supply into account. 
The optimization approach consists of finding 
the right balance between minimizing the risk and 
associated financial loss of too little capacity and 
the investment in sufficient capacity, even though 
overcapacity is inevitable in certain periods. 


3 APPROACH 


3.1 Discrete event simulation 


As mentioned in Section 2.1, the dynamic nature 
of the terminal activities and resulting oil export, 
presents a significant challenge to express the 
f(x) function from equation 1. Monte Carlo 
simulation, however, is suitable for dealing with 
dynamics. The progression of time is simulated 
and stochastic events can be dealt with by condi- 
tions implemented in the simulation algorithm. 

Monte Carlo simulation has the advantage of 
simplifying a mathematical problem significantly. 
If a stochastic element is represented by a prob- 
ability distribution function, the simulator draws 
just one number at a time from this distribution, 
making calculations much easier. But if this proc- 
ess is repeated a sufficient number of times, the 
numerical result still approaches the theoretical 
result. Hence, the challenge shifts from mathe- 
matical complexity to computational power in the 
case of Monte Carlo simulation. This challenge 
becomes smaller as the computational power in 
general keeps increasing. 

Discrete event simulation is a branch of Monte 
Carlo simulation which is suitable for solving prob- 
lems like these, because they are characterized by 
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events occurring at discrete points in time. Events 
are generated at simulation start or initialization. 
These are exemplified by the arrival of a shuttle 
tanker, the tank getting full or empty based on cur- 
rent loading/unloading rate, the shift from waves 
under the critical threshold value to above the criti- 
cal threshold value, etc. These events are sorted in 
an event queue, according to time of occurrence. 
Instead of moving forward in time in pre-defined 
small incremental steps, which is the traditional 
Monte Carlo approach, the simulator skips to the 
next event in the queue. At the event, calculations 
are done and new events drawn from distributions, 
which might change the existing event queue. 

Ina case like this, there are elements of both dis- 
crete events and continuous flow. When e.g. a shut- 
tle tanker offloads at Veidnes, there is a steady flow 
into the tank and the tank level increase is continu- 
ous in time. However, it is not necessary to monitor 
the tank level continuously. With a discrete event 
simulation model, the next event can be the time 
when the tank is full or the shuttle tanker is empty. 
If the flow rate varies in time, e.g. full flow at the 
start of the offloading operation and then slowing 
down when the tank approaches full, the point in 
time changing to a lower flow rate can be regarded 
as an event. 

In models with both elements of continuous 
flow and discrete events, it is generally more effi- 
cient to create a discrete event simulation model 
and include the continuous flow elements, than 
vice versa. 

The difficult part of the revenue function is two- 
fold. It is expressed as a product of the amount 
of exported oil and the oil price. The problem 
with the oil price is simply the fact that it is highly 
uncertain, looking over a future 10-year period. It 
should be discounted as well and the discount rate 
adds to the uncertainty. 

The difficulties about the amount of exported 
oil have been explained in Section 2.1. It is a time- 
dependent function which is difficult to express due 
to many parameters and high complexity. Hence, 
this part is covered by the simulation model. 

In order to assess the amount of exported oil, 
it was important to consider the supply chain and 
evaluate all significant influencing factors. The sim- 
ulation model includes the following parameters: 


— Production profiles 

— Capacity of pumps, tanks and tankers 

— Number of tanks, jetties, and tankers 

— Oil quality (API grade) 

— Weather 

— Criteria for critical wave heights and oil quality 
blend 

— Travel speeds and distances 

— Berthing and de-berthing times 


— Export strategy (blended or segregated oil 
quality) 


Some of these parameters are dynamic and 
change during the simulation. For instance, the 
number of jetties increase from one to two at a cer- 
tain point in time, when more offshore fields start 
producing. 


3.2 Simulation tool and algorithms 


The simulation software, ExtendSim, was used to 
create and run the model of the Veidnes terminal. 
It belongs to the class of generic simulation tools, 
not specifically built for a particular industry or 
simulation paradigm. These are typically very flex- 
ible and can be used for almost any modelling of 
production, processing, logistics, resource manage- 
ment, etc. or combinations of these. 

ExtendSim consists of different blocks, which 
perform various functions, e.g. drawing a random 
number from a distribution, creating an item, 
assigning an attribute to an item, performing an 
activity, acting as a storage tank, and even imple- 
menting a custom-made function. These existing 
blocks cover the necessary functionality for the 
basics of the model, i.e. producing oil offshore, 
loading shuttle tankers, transporting to Veidnes 
and offloading, storage at Veidnes, loading export 
tankers and transport it to various destinations. 
It also enables the specification of oil quality 
requirements and the inclusion of a weather model 
and its interaction with loading and unloading 
operations. 

Despite much existing functionality, the custom- 
made function block has been used extensively for 
implementing algorithms for specific operational 
rules. For instance, rules must be implemented to 
determine which offshore field the next available 
shuttle tanker will go to and when. This requires 
some calculations and functions. In this case, it was 
decided that the next available shuttle tanker (the 
one that just finished offloading at Veidnes) goes 
to the offshore field that requires offloading first. 
Loading starts when the tank will be full in x hours, 
where x is a parameter that can be sensitized. 

Another challenging issue which needed hard- 
coding of algorithms, is the rule-set of loading and 
offloading at Veidnes. In order to know if there is 
capacity to receive or even to know if an export 
tanker can be filled with the right API blend, it 
is necessary to keep track of the total volume of 
each API value and the used and potential volume 
for each tank at Veidnes. This is fairly straight-for- 
ward. However, with the introduction of a second 
jetty in phase 3, one can easily end up in a situation 
where e.g. an export tanker is about to load when 
the shuttle tanker arrives at the other jetty. Hence, 
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Figure 3. Algorithm for prioritization of shuttle and 
export tankers. 


it is not enough to know the status at Veidnes, 
but also the remaining volume and API grade of 
a tanker at the jetty in the middle of a loading/ 
offloading process. Operational rules implemented 
in the algorithm are illustrated in Figure 3. 

Yet another operational rule must deal with the 
loading and offloading in case of bad weather. 
Critical wave height limits are typically established 
for given vessels. The critical limit for start of load- 
ing operation is lower than the limit for abortion, 
which may occur if the wave height increases dur- 
ing the operation. If the operation is aborted due 
to increased wave height during the operation, 
exceeding the critical limit, it must also be decided 
if one should take the current load to the destina- 
tion or stand-by for a new weather window to com- 
plete the operation. 

As for the weather itself, we obtained weather 
data for the last 25 years. There are several pos- 
sibilities to apply these data to model the weather 
over the next 15 years. A common approach would 
be to fit the duration of the good weather and bad 
weather periods to probability distributions and 
draw a number from those distributions during the 
simulations. Good and bad weather would then 
be defined by the critical wave height limits. In the 
Veidnes project, another approach was used. The 
weather data are stored in a database linked to the 
simulation model. At the beginning of a new year 
in the simulation, a random historical weather year 
is drawn. Then the weather in the model replicates 
the weather in that particular year. 


3.3 Sensitivity cases 


In order to solve the objective function, f(¥) 
must be simulated, while g(y) and A(Z) can be 


calculated. However, the vectors, x,y and Z 
contain many of the same elements, so they can- 
not be optimized independently. For instance, the 
amount and size of the tanks at Veidnes contribute 
to CAPEX, OPEX and revenue. Due to the fact 
that we simulate revenue rather than attempt to 
express f(x) mathematically, it is more difficult to 
genuinely find an optimum, as we simulate rather 
than derivate the function. When simulating the 
oil export, a specific tank configuration must be 
selected. Hence, each simulation produces results 
for a specified vector, x. In theory it can take an 
infinite range of values, making the task of find- 
ing an optimal value impossible by just selecting 
arbitrary values for xX and running simulations. 
In practice, there are of course restrictions with 
regards to many of the parameters. Not only is 
the tank size restricted to a positive value, which 
cannot exceed the size of the largest tank available 
on the market, but the tank supplier usually has a 
finite set of products to offer. 

Still, there are many parameters and thus a 
large amount of combinations of values, even if 
they can take finite values or at least values in a 
finite interval. Simulation thus impedes the calcu- 
lation of an optimum value somewhat, but this is 
the price to pay in order to be able to realistically 
model the revenue from exported oil. 

In the Veidnes case, sensitivity analyses are eas- 
ily done, with all essential model input parameters 
listed in MS Excel with a link to the simulation 
model. Each Excel worksheet represents a unique 
sensitivity case. By changing parameter values 
and running sequential simulations for all cases, 
results are exported back to Excel for each case for 
comparison. 

There are two main cases dealing with the differ- 
ent oil qualities. One alternative is to keep all API 
grades segregated. The other alternative is to blend 
them to a mix with a given API. There is also a ques- 
tion of how to blend, since it can be done either in 
the Veidnes tanks or in the export tanker. Only the 
second option has been considered in this analysis, 
i.e. the tanks at Veidnes never contain more than 
one oil quality and the blending is done when load- 
ing the export tanker from different tanks. 

Except for the ‘segregated’ and ‘blended’ cases, 
all other sensitivity cases are defined by changes in 
parameter values. Thus, the model becomes a tool 
for Design of Experiments (DOE), to contribute 
to the economic optimization. That is, the model 
is suitable for analyzing the effect of changes in 
all the parameters listed in Section 3.1. The DOE 
includes all of these, as well as the timing for the 
time-dependent parameters. 

With the objective function calculated for each 
sensitivity case rather than derivated, we obtain a 
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set of results. Of these we can select the one which 
gives the highest profit. If we are uncertain about 
this being the optimal value, new simulations can be 
run for values in the vicinity of the previous ones. 


4 RESULTS 


Detailed results of the case study are not presented 
in this section due to confidentiality. Also, only 
preliminary results have been produced so far, due 
to high uncertainties, especially in the economical 
parameters. Several interesting observations have 
been made, however. First off, it was observed that 
the assumed tank volume at Veidnes is sufficient 
and could even be somewhat reduced, thus reducing 
CAPEX, if one is willing to accept a few deviations. 
That is, in the segregated oil quality case, one must 
accept that a small fraction of export tankers is not 
fully loaded. For the blended case, one must accept 
that a small fraction of export tanker loads is off- 
spec with respect to oil quality. Due to variations in 
the production profiles for the different fields and 
the batch-like nature of the supply to Veidnes, situ- 
ations like these are difficult to avoid entirely. If one 
has zero tolerance of these issues, the number of Vei- 
dnes tanks must increase significantly, which gener- 
ally will result in over-capacity except for those few 
instances. In order to conclude, we need to obtain 
the cost associated with off-spec tanker loads. 

A few offshore disruptions or so-called tank tops 
are also expected over a 9-year period. The simula- 
tion model verified that the planned number of shut- 
tle tankers is sufficient and disruptions are not due to 
late arrivals, but solely long periods of bad weather. 
The only way to avoid these, is changes in design of 
shuttle tankers and loading facilities, enabling higher 


tolerance for waves. However, the number of disrup- 
tions seemed acceptable to the project. 

In the base case, it was assumed that an increase 
to three shuttle tankers and two jetties were needed 
in phase three. This seems to lead to some over- 
capacity. For instance, average jetty occupation is 
right below 30%. It would seem sufficient with one 
jetty occupied 60% of the time. However, this aver- 
age is calculated over the period, 2022 to 2030. In 
the peak year, 2026, the jetty occupation is close to 
40%. With one jetty this would be 80% and lead to 
much waiting time. 


5 CONCLUSIONS 


This paper has demonstrated that mathematical 
optimization of terminal dimensioning and logis- 
tics is difficult but possible to a large extent. Drivers 
of production deferral in a complex supply chain 
are difficult to identify using analytical approaches, 
but discrete event simulation is a helpful tool as it 
is suitable for capturing the dynamics and reduce 
mathematical complexity. This does, however, 
require a flexible tool, allowing for algorithms to 
represent operational rules in a realistic way. 

Even though results are preliminary, there are 
several key learning points for the project. E.g. 
optimum tank configuration depends on both the 
blending strategy and the size of the export tanker. 

It has been demonstrated that results from simu- 
lation of DOE lead to very valuable insight by quan- 
tifying important parameters and reveal how they 
are influenced by changes in input values. Although 
a single optimum is difficult to conclude from these 
results alone, they provide valuable insight and a 
solid basis for cost-benefit evaluations. 
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An integrated bayesian network and cost-benefit analysis model for 
blowout preventer configuration selection in deepwater offshore fields 


E.M. Enjema, M. Shafiee & A. Kolios 
Cranfield University, Bedford, Bedfordshire, UK 


ABSTRACT: Due to the capital intensive nature, limited supply quantities, infeasible and unviable 
prospects, among several other setbacks of other emergent energy sources, huge importance continues 
to be placed on Blowout Preventers (BOPs), the principal defense mechanism against blowouts during 
any drilling/workover operation in the oil and gas sector. Particularly so after the Macondo disaster, 
BOPs have been the center of regulatory change and sector development. BOP availability and reliability 
become even more important as drilling advances into deep and ultra-deep water offshore fields. The 
BOP configuration choice for such variable environments will have far reaching consequences. Reliability, 
though hugely important and vital, is one of the several criteria that operators must use for determining 
the most cost-effective configuration as the cost of accidents in deeper waters increases proportionately. 
In the current paper, an integrated framework for the selection of the most appropriate BOP configura- 
tion in deep and ultra-deep water conditions is proposed. The framework captures all evaluation crite- 
ria such as BOP reliability, handling/deployability, overall weight and CAPEX/OPEX ratio. Appropriate 
mathematical and evaluation tools such as Bayesian Network (BN) and Lifecycle Cost Analysis (LCCA) 
are employed to evaluate different configurations. The models are applied to a commonly used CLASS 
VII subsea BOPs in deeper waters. The results indicate that configuration 1 (with 2 annular, 2 pipe rams, 
1 blind shear ram, 1 casing shear ram) is slightly less reliable than configuration 2 (with 1 annular, 2 pipe 
rams, | blind shear ram, 2 casing shear rams), however, the operation and maintenance (O&M) costs are 
higher for the latter configuration. Our framework can serve as a valuable decision making tool for BOP 


stakeholders as varying facets of information regarding the device are obtained. 


1 INTRODUCTION 


The increasing demand in world’s energy con- 
sumption has made the Blowout Preventer (BOP) 
a crucial and indispensable complex technical 
system. Coupled with improved safety awareness 
and growing stringent environmental protection 
policies, the safe operation of the device cannot 
be overemphasized. However, depletion of shal- 
low oil reserves has forced drilling and oil explo- 
ration into deeper, erratic sea environments. Such 
deep and ultra-deep water developments rely on 
new technology, which is yet to be field proven, 
hence, increased uncertainty related to occurrence 
of unforeseen events and higher capital expendi- 
ture, costs of production interruption and subsea 
intervention costs (Enjema et al., 2017). A combi- 
nation of unpredictable and erratic environmental 
conditions as well as increased associated costs is a 
major challenge in this sector. 

Several studies have been carried out on differ- 
ent individual aspects of the BOP system, includ- 
ing its reliability (see, e.g. Cai, et al., 2012; Holand 
& Skalle, 2011; Holand, 2001), cost analyses (see, 
e.g. American Petroleum Institute (API), 2015), 
etc. More so, aspects such as maintenance, repair 


and inspection, or configuration and classification 
are well regulated and documented by regulatory 
bodies. However, there exist great dependence and 
interconnectivity among these aspects and evaluat- 
ing them holistically provides a better view of the 
entire system functioning and operation. 

Given the above-mentioned research gap, this 
study proposes a framework that considers many 
of these factors simultaneously and the interrela- 
tionship between them. The model is comprehen- 
sive and built upon modern intuitive techniques 
such as Bayesian Network (BN) for technical 
reliability analysis, Cost Benefit Analysis (CBA) 
for economic analysis and eventually, a compari- 
son scheme for selection of the best solution. 
The proposed approach is validated with a case 
study of a class VII BOP system and the results 
are subsequently discussed and evaluated. The 
generic nature of the proposed framework makes 
it applicable to various other complex engineering 
systems. 

The rest of the paper is organized as follows. Sec- 
tion 2 presents the selection framework developed in 
this research. In Section 3, the model is applied to a 
case study and the results are analysed in Section 4. 
Finally, the research is concluded in Section 5. 
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2 PROPOSED FRAMEWORK 


In this Section, a conceptual framework is devel- 
oped for the purpose of analysing and selecting 
suitable BOP configuration for varying operational 
conditions. Comparative analyses of the reliabilities 
of various system configurations under these con- 
ditions and a cost-benefit evaluation of the con- 
figurations is performed. Decision making based 
on economies of scale and suitability for particular 
fields of operation is simplified. Information used 
in the concept development emanate from pub- 
lished literature in the offshore oil and gas sector, 
face-to-face semi structured interviews and corre- 
spondence with BOP experts. The proposed frame- 
work, as shown in Figure 1, consists of three main 
phases. Phase 1 is a preparatory stage in which the 
premise for selection is determined. The system 
requirements, functionality and safety levels are 
investigated. This provides further information on 
the related parameters involved in operating the 
system. Evaluating and discretising the operating 
condition(s) is achievable at this stage. All related 
correlations are then analysed and data is collected. 
Phase 2 is the core of the framework, in which math- 
ematical and technical determination of all correla- 
tions related to BOP selection is performed. The 
final phase captures the comparative analyses and 
selection process. The entire framework will provide 
a modern comparative and selection tool, given 
several interrelated parameters. Key tasks in each 
phase are described in the following subsections. 


2.1 Preparation 


2.1.1 Setting premise 
Setting the ground rules for the technical, opera- 
tional and economic aspects is the first step. BOPs 
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Figure 1. The proposed integrated framework. 


are deemed the most safety critical component in 
a driller’s toolbox but also the single largest agent 
of unproductive time (Sattler, 2013). BOPs are no 
doubt complex, multi-configuration and multi- 
phase devices, synchronising several individual 
components of hydraulic, electrical and mechani- 
cal nature. Aside the failsafe function of monitor- 
ing and maintaining well integrity, BOP system’s 
primary functions are therefore: 


— to confine or seal off well fluids in the well bore 

— to provide means of adding or withdrawing 
controlled volumes of fluid to and from the well 
bore. 

— to shut or ‘kill’ the well and seal the wellhead. 


Typical BOPs are huge pieces of equipment with 
some current weighs of about 450 tons and 60 ft. 
in height (Tulimilli et al., 2014). Principal factors 
affecting safety and reliability of BOPs particu- 
larly in high pressure high temperature (HPHT) 
operations include (Montgomery, 1995): 


e Manufacturing specifications 

e MIT (Maintenance, Inspection and Testing) 
techniques. 

e Temperature and pressure (and corresponding 
fluid effects). 

e Stack configurations. 


Deep and ultra-deep water developments rely 
on new technology, which is yet to be field proven. 
The uncertainty related to occurrence of unforeseen 
events increases as novel technologies introduced for 
deep waters are not encountered in shallow waters. 
Deeper waters are also characterized by large capi- 
tal expenditures with relatively high operational 
expenditures and high sustainable production rates 
— hence large losses for production interruption. 
Furthermore, subsea interventions become more 
expensive and are associated with longer waiting 
times for the required mobilization of intervention 
vessels. Subsea well system repairs and economic 
penalty for delayed/lost production also soar, char- 
acterized by long delays centered on availability, 
particularly in ultra-deep water environments. 

Some BOP configurations have been shown to 
be more reliable than others (see, Cai et al., 2012; 
Sattler & Gallander, 2010). Individual components’ 
reliability affects overall system reliability due to 
variations in numbers, position and even type 
within stack configuration. Common-cause failures 
are popular within BOP systems, with a dominant 
impact on accidents (Cai et al., 2012). Redundant 
components (control panels, control pods, annular 
preventers, ram preventer, valves and regulators etc.) 
may fail from a single event. Historically, multiple 
low frequency failures lead to blowouts (Whooley 
et al., 2011). A failure in a BOP system with series- 
fashion connectivity will affect the entire system 
functionality. It is therefore obvious that reliability 
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is directly related to configuration, which intends to 
have a direct bearing on size/weight and hence total 
cost of operation and maintenance (O&M). 

Safety Integrity Level (SIL) requirements for the 
BOP are specified in IEC 61508 and 6151 1stand- 
ards which are widely accepted for the basis of 
operation and design of Safety Instrumented Sys- 
tems (Pinker, 2012). The Norwegian Petroleum 
industry accepts a minimum SIL 2 requirement 
for the SIF (Safety Instrumented Function) of the 
subsea BOP systems (NOG 070, 2004). 


2.1.2 Evaluation conditions 

Technology progressed significantly and by the 
1990s it was advanced into deeper depths. Deepwa- 
ter drilling refers to water depth between 400 m — 
1500 m. Depletion shallow water reserves coupled 
with huge advances in technology has seen many 
drilling companies venture into depth greater than 
3000 m. Though the risks in shallow water drilling 
are considerably less, the economics of deep-water 
production are highly attractive and worth the risks 
(Latham, 2002). Areas such as the Atlantic Margin 
in Europe, Gulf of Guinea in West Africa, the Gulf 
of Mexico and the Compos Basin in Brazil are 
currently drilling and exploring deep-water envi- 
ronments of over 2000 m (Oyeneyin, 2009). These 
future oil and gas development regions are bur- 
dened with complex geological features and profiles. 
Conditions that affect the operation, maintenance 
and testing (MIT) of the BOP are also considered. 
These new adventure presents varying oceano- 
graphic and geological environments, flooded with 
high gas-oil ratios, HPHT regions, elevated tides 
and wave currents, difficult formations and even 
lack of experienced personnel (Skogdalen & Vin- 
nem, 2012). Specific data based on some of these 
conditions provide some foundation in this study. 
Erratic conditions are limited to HPHT, where 
‘high’ is defined based on API TR 1PERI5 K-1, 
API 17TR8, and expert literature regarding current 
operating limits (Lehr & Collins, 2015): 


— High Pressures = 15000psi (~1000bar) 
— High Temperatures = 350°F (175°C) 
— Water depths > 5000 ft or 1500 m 


Classification for HPHT wells are expected well- 
head shut in pressure = 10.000psi (690bar) or pore 
pressure gradient >1.81bar/10 m and high tem- 
perature when reservoir temperature or wellhead 
temperature > 150°C (Masi et al., 2011). 


2.1.3 Data collection 

Data for analyses, particularly for safety critical 
technical systems in the oil and gas industry, is con- 
stant challenge (Khakzad et al., 2014). The accuracy 
and integrity of the data is sometimes questioned. 
Expert judgements through rigorous elicitation 
techniques (see Clemen & Winkler, 1999; Keeney & 


Winterfeldt, 1991) are used and due to the dynamic 
nature of the proposed framework, updating is pos- 
sible when new and coherent data becomes available. 
In a bid to overcome this problem, subject experts 
with extensive years of experience are selected to 
provide required information. More so, different 
types and forms of data is sourced from different 
systems, the trends are observed and aggregation 
is done where possible to eliminate and minimise 
ambiguity. In the second phase, different types of 
data about system design, component interaction 
and interrelationships and operational procedure 
are required. Primary qualitative data is obtained 
via questionnaires completed by subject experts. 
Slightly more precise and delicate qualitative asser- 
tions and quantitative information such as failure 
rates and conditional failure probabilities, load 
margins and corresponding effects are gathered. In 
order to control the quality and ensure specificity 
of the data obtained, interviews and questionnaires 
are employed. Specific primary data here provides 
greater confidence in the results obtained. Addi- 
tional secondary data is also available within litera- 
ture. However, greater focus is placed on primary 
data obtained from questionnaires. 


2.2 Determination 


2.2.1 Operational parameter (Reliability ) 

A few studies such as Holand (2001), Holand & Ska- 
Ile (2011), Holand & Awan (2012) have considered 
the reliability of the BOP in deep-water conditions, 
using the fault tree analysis (FTA) and the reliabil- 
ity theory. Some other studies, e.g. Cai et al. (2012) 
and Cai et al. (2013) use a more contemporary tech- 
nique called Bayesian Network (BN). The effects 
of inherent complex technical systems characteris- 
tics such as common-cause failures are considered, 
alongside important reliability attributes such as 
maintenance and repair can be captured with this 
technique. A Bayesian network (BN) is a compact 
representation of a multi-variate statistical distribu- 
tion function (Langseth & Portinale, 2007). It has 
a qualitative side, portrayed by a directed acyclic 
graph with nodes (representing random variables) 
and arcs between the nodes (representing depend- 
encies) which together define the joint probability 
distribution over all random variables (Boudali & 
Dugan, 2006). Nodes represent cause and effect in 
real-world situations and arcs connect the nodes. 
The quantitative part, made up of conditional 
probabilistic tables, could easily be ascertained by a 
domain expert (Langseth & Portinale, 2007). 

BNs perform both predictive and diagnostic 
analyses and are now considered as a viable alter- 
native technique (Khakzad et al., 2013). Such pre- 
dictive analysis is used to compute the reliability of 
systems in a process called marginalization. Quali- 
tative and quantitative parameters, discrete and 
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continuous data, equations and data sets can be 
modelled into systems or processes through BNs. 
Erratic conditions such as extreme temperature 
and pressure and their immediate corresponding 
effects are easily incorporated into complex techni- 
cal system analyses via this methodology. A simple 
step by step frame is presented in Figure 2. 


2.2.2 Economic parameter (cost-benefit ratio) 
Several factors influence the choice of BOP employed 
in any drilling or workover development and this is 
common with other complex technical systems which 
are associated with high turnovers and productivity. 
Cost-benefit analysis (CBA) as used in economics 
enables long-term decision making by comparing 
present value of costs with the present value of ben- 
efits. An activity is considered worthwhile when the 
sum of its benefits outweighs the sum of its costs, i.e. 
benefit/cost ratio is greater than 1. It becomes nec- 
essary to identify the associated benefits and costs. 
Based on experts’ input, a summary of potential ben- 
efits and costs associated with BOP operation and 
functioning are shown in Table 1. 
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Figure 2. Bayesian network framework. 


Table 1. Associated BOP costs and benefits. 
Benefit 
Costs Cost elements Benefits elements 
Capital Purchase Production Revenue 
costs Upgrades 
Installation Labour Increased 
costs Logistics Safety 
Operating Maintenance Time 
cost Royalties savings 
Taxes 
Logistics 


2.2.3 Others 

Given that the proposed framework is a living tool, 
stakeholder decision-making is facilitated as new 
data is obtained. Though reliability, of paramount 
importance, is the focus of the study carried out, 
other related factors which affect the selection 
process are compared. The economic benefits are 
determined in CBA whilst other criteria such as 
associated weight/size are obtained from varying 
available data sources and expert opinion. 


2.3 Selection 


In the case of different BOP configurations under 
study, the most appropriate option is chosen on 
the basis of a number of criteria such as reliability, 
cost, weight/size, technical handling/deployment 
issues, etc. For this purpose, a simple weighted 
decision matrix is used to rank options. Buying or 
rental and MIT expenses may be of utmost impor- 
tance to an operator but reliability and availability 
affect several other stakeholders, the environment 
and beyond. Adding weights to these criteria 
will enable clear decision making and eventual 
optimisation. 


3 APPLICATION AND RESULTS 


In this Section, the proposed model is applied to a 
BOP of class VII. This class is typically used in deep 
water application and comprises of either a dual 
annular, five (5) ram arrangement or a single annu- 
lar, six (6) ram configuration. The required data is 
collected from 4 BOP experts, two BOP vendors and 
supported by extensive literature review. Employing 
the proposed model, prior to setting the appropriate 
premise, the reliability for both Class VII configu- 
rations is calculated. The selected tool, incorporates 
a continuous discretised evaluation range for both 
temperature and pressure, considering one-year 
period, and the results are presented in Table 2. 

The results indicate that the configuration 1 
(with 2 annular, 4 pipe rams, 1 blind shear ram) 
is slightly less reliable (obtained with Bayesian 
Network analysis) than configuration 2 (with 1 
annular, 4 pipe rams, 1 blind shear ram, 1 casing 
shear rams). Increase in reliability may be due to 


Table 2. Reliability approximation of different 
configurations. 

Reliability 

Temperature Pressure 
Configuration model model 
1) 2 annular, 5 rams = 99.7 = 97.2 
2) 1 annular, 6 rams = 99.8 = 97.5 
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Table 3. 


Preventer type and overall configuration weight. 


Weight Configuration Configuration 
BOP Type (Ibs) 1 quantity Total 2 quantity Total 
Annular 40,632 2 81264 1 40632 
Pipe (single) 23,300 4 93200 4 93200 
Casing shear (single) 28783 — — 1 28783 
Blind shear (single) 28783 1 28783 1 28783 


Total = 203247 Total = 191398 


Table 4. Final selective decision matrix. 


Decision matrix 


Weight/ Configuration Configuration 
Criteria Rating 1 2 
A: Reliability 0.4 2 2 
B: Weight/size 0.2 -2 1 
footprint 
C: Technical 0.1 1 -1 
feasibility 
D: Economic 0.3 1 -1 
feasibility 
Total=1 0.8 0.6 


Feasibility scale: 

2 = better than average. 

1 = slightly better than average. 
-l = slight worse than average. 
—2 = worse than average. 


the casing shear ram. Temperature models also 
seem more tolerance compared with their pressure 
counterparts as pressure has a direct bearing on 
the operation of the rams. Costs and benefits are 
also evaluated for each configuration. Approximate 
purchase prices for single annular, pipe, and shear 
preventers are $29500, $39500, and $105000 respec- 
tively. The Net Present Value (NPV) can be appro- 
priately estimated by calculating the Net Cash Flow 
(NCF), depicted by Liu and Ford (2008) as: 


NCF = Revenue — CAPEX — OPEX — Tariffs — Tax 


The analysis captures most factors, parameters 
and risks involved in operating a complex techni- 
cal system. Holistic analyses and suitability assess- 
ment of the individual configurations examined 
for the loading parameters involved is then made 
possible. As mentioned earlier, configurational 
changes affect the weight and size of the entire 
BOP stack. This has bearings on maintenance as 
number of components may have increased or 
reduced. More so, the larger the stack, the more 
complicated installation, deployment and decom- 
missioning will be. Weight and size depend largely 
on the manufacturer and if the rams are studded, 


flanged or hub type. Table 3 provides the details on 
different BOP types available in the market. 

Finally, based on expert opinions, the weights 
are assigned to the parameters and the best alter- 
native, given the period and conditions considered 
is chosen. 


4 CONCLUSIONS 


This study developed a framework for BOP selec- 
tion, particularly in deep and ultra-deep water envi- 
ronments. The model provided a powerful decision 
making tool for incorporating the effects of erratic/ 
extreme loading conditions and selecting the set of 
system modules, considering several related factors. 
Application to a class VII BOP configuration was 
also shown and the results demonstrated the valid- 
ity of the proposed framework. Correlating several 
aspects related to the selection process facilitated 
decision making and provided a better overview 
of the many intricacies involved in deep and ultra- 
deep water exploration and drilling. Simultaneously 
analyses and incorporation of technical, operational 
and economic selection aspects were incorporated. 
The entire framework, once completed efficiently 
resulted in varied forms of information relating to 
system operation, design (redundancy configura- 
tion), loading condition boundaries and even eco- 
nomic consideration. The proposed model can be 
applied to other modular complex technical sys- 
tems in general as it is comprehensive, intuitive and 
dynamic. Its application to other operational con- 
ditions is a possible step in the future. Appropriate 
and more coherent data will also provide refined 
results in further analyses. The development of 
appropriate modelling tools to reduce uncertainty in 
cost assessment and the development of a software 
based selection mechanism after the evaluation of 
the necessary correlates are other future prospects. 
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ABSTRACT: Simulation methods are widely used in different industrial sectors like economics, logis- 
tics and also in the field of reliability engineering. In general, simulation pursues the target of imitating 
the operation of a real-world system over time in order to gain knowledge of its behavior which is trans- 
ferable to reality. It is especially applied when the testing of the real-world system is too expensive, risky 
or too complex. The aim of this paper is to present different simulation modelling techniques which are 
used in reliability engineering and other subject areas and to evaluate them regarding the specific applica- 
tion of Data-driven Facility Simulation (DFS). After presenting the techniques, important issues of DFS 
are discussed. Subsequently, the presented modelling methods are evaluated with respect to a wind park 
consisting of wind turbines as an illustrative example. They are discussed regarding different advantages 


and disadvantages as well as limitations. 


1 INTRODUCTION 


Simulation is widely used in different industrial 
sectors like economics, logistics and also in the 
field of reliability engineering. It pursues the target 
of imitating the operation of a real-world system 
over time in order to gain knowledge of its behav- 
iour, which is transferable to reality (Banks 2005). 
It is especially applied when the testing of a real- 
world system is too expensive, risky or too complex 
(Kolonko 2008). 

One of the first steps in performing a simulation 
study is to develop a (mathematical) simulation 
model, which consist of objects and their mutual 
relationships. It is expressed in a mathematical, 
logical or symbolic way and can either be static or 
dynamic (Banks 2005). 

In reliability engineering, various simulation 
approaches have been used, i.e. for estimating per- 
formance characteristics of system components, 
the system availability and safety or the system’s 
degradation. For example the Monte Carlo Sim- 
ulation (MCS) as well as its specific application, 
the Discrete Event Simulation (DES), is used for 
the evaluation of reliability characteristics of mul- 
ticomponent technical systems as well as for the 
estimation of the reliability of mechanical struc- 
tures (VDI 1999). Gathered simulation results can 
be used for reengineering activities or for further 
improvement of the system. 

The aim of this paper is to show selected and 
established simulation modelling methods, which 
are used in reliability engineering and other subject 


areas. The stated modelling methods are evaluated 
regarding the specific application of DFS. This is 
done by means of a wind park consisting of wind 
turbines as an illustrative example. They are dis- 
cussed regarding different possibilities of appli- 
cation, advantages and disadvantages as well as 
limitations. 


2 MONTE CARLO SIMULATION IN 
RELIABILITY ENGINEERING 


The MCS is a well-known and established method 
for simulating stochastic systems. Within MCS, 
random numbers are used for solving or rather 
estimating solutions of a deterministic or sto- 
chastic problem (Law 2007; Zio 2013). It is often 
applied when it’s too complex or impossible to 
evaluate a system by analytical methods. In reliabil- 
ity engineering, MCS is used for the determination 
of reliability characteristics of technical systems 
and can be performed by the following 3 steps (cf. 
VDI 1999): 


1. Building a stochastic simulation model 
y=@(x), which contains random variables 
(i.e. component state variables x,). The model 
should describe the behavior of the system 
under study in a sufficient way. 

2. Implementation of the model in a computer 
program for simulating system failure processes 

3. Simulate the model with a sufficient amount 
of simulation runs N. Within a simulation run, 


2013 


input random variables x,,i=1,...,.1 are gener- 
ated with consideration to their determined dis- 
tribution (i.e. by inverse transform sampling or 
rejection sampling, etc.) and the model output 
(the simulated state of the system y) is calcu- 
lated. Lastly, statistical parameters of the gen- 
erated output data y,,k =1,...,N are calculated 
(i.e. mean value, variance, etc.) 


According VDI (1999), the MCS can be per- 
formed at different types of models, which can be 
static or dynamic. In literature, the MCS is often 
associated with simulating only static models 
whereas for discrete dynamic models the terms of 
discrete event and discrete time simulation (DES 
and DTS) are used. However definitions of MCS 
vary in literature. 


3 METHODS FOR SIMULATION 
MODELLING 


This section deals with selected dynamic simula- 
tion modelling techniques, which are used in reli- 
ability engineering and are mostly based on MCS. 
The methods are presented and instances of appli- 
cations in the field of reliability are shown. 


3.1 Discrete event simulation 


Discrete Event Simulation (DES) describes a 
technique of system modelling where the system’s 
state variable can only change at discrete points in 
time (Banks 2005) when an event occurs which is 
defined according (Law 2007) “as an instantaneous 
occurrence that may change the state of the sys- 
tem”. After establishing a model, it can be run and 
an artificial history in form of generated data can 
be produced (Banks 2005).The DES can be seen as 
particular application of the MCS if events occur 
at random. 

A typical application of using DES in reliabil- 
ity engineering is the simulation of a model (e.g. 
a block diagram model) over a specific operation 
time ¢. For each component, failure and repair 
events are generated from its time-to-failure or 
time-to-repair distribution. At each point of 
time an event occurs, the system state (success or 
failure) is determined. Moreover, in such simula- 
tions, different constraints can also be considered 
(e.g. repair strategies and capacities or operation 
phase-switching) (cf. VDI 1999). With a sufficient 
number of simulation runs N, reliability measures 
(e.g. MTTF) can be estimated. There are also other 
applications where DES is used. For instance, it is 
used for reliability and availability assessment of 
civil engineering structures (Juan 2009). Moreover, 
Sharda & Bury (2008) developed a DES model for 


understanding facility failure effects on a chemical 
plant’s production capability. 


3.2 Discrete time models 


According Zeigler (2010), within discrete time 
models, the time advances stepwise in a discrete 
way (i.e. in a one-second interval). At each specific 
point in time, the model persists in a certain state. 
The model is executed at each point in time and 
the next state is determined for the subsequent 
time step. In this subsection, selected discrete time 
modelling methods are presented and discussed. 


3.2.1 Discrete-Time markov chain models 
Discrete-Time Markov Chains (DTMC) can be 
used for modelling and simulation studies (cf. 
Nfaoui, Essiarab & Sayigh 2004; Papaefthymiou & 
Klockl 2008; Sahin & Sen 2001; Shamshad 2005). 
A discrete time stochastic process X,,f€ 7,7=N, 
can be described by a markov chain, where for all 
states i,,...,4.€2 the following condition (1) is 
valid (Waldmann & Helm 2016). 

P(X gS Fl iA an pagal) 


n+] 


=X ae jlX, =i) (1) 


y 
I denotes the state space and P, the one-step 
transition probabilities which can be represented 
in a one-step transition matrix P. 
The simulation of the stochastic process is car- 
ried out for each time step and can be performed as 
follows (cf. Sigman 2007). 


1. Select an initial value XY, =i,,n=1 

2. Generate X,,, by sampling from the conditional 
distribution P(X,,,|X,) and set n=n+1 

3. Set X, =X, and go back to step 2 until the 


desired number of n is reached. 


Markov chain models are also used in the field 
of reliability. Skalny (2013) used Discrete-Time 
Markov Chains in combination with the MCS 
Method for estimating the failure probability of 
companies for satisfying an order to industrial 
partners. Markov chains were also applied in mod- 
elling reliability structures (cf. Koutras 1996). 


3.2.2 ARMA models 

Autoregressive Moving Average Models (ARMA 
models) (Box & Jenkins 1979) are used for time 
series forecasting and simulation (especially for 
wind speed time series (cf. Kamal & Jafri 1997; 
Sfetsos 2000)). In ARMA models, it is assumed 
that the present value of a variable is the result of 
a linear function of a specific number of past val- 
ues and a random error term. It can be represented 
generally in the following form (see equation 2): 
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y, denotes the actual value and e, the random 
error which is assumed to be i.i.d. with mean 
of zero and constant variance o° at time t. 
ø(i=1,...,p)ic N, are the autoregressive and 
O(j=1,....g),i € N, are the moving average model 
parameters, p and q are referred to as the order of 
the model. ARMA models can be used to model 
only stationary time series. If a time series, which 
has to be modelled, isn’t stationary, it has to be dif- 
ferenced d times or transformed. The determina- 
tion of the model (model orders p, d, q and model 
parameters 0, and @) can be performed by the Box 
and Jenkins methodology (Box & Jenkins 1979). 

The simulation process can be done by perform- 
ing the following three steps: 


1. Select an initial value y,,n=1 

2. Generate a value of e, by sampling from the 
fitted normal distribution function for e, with 
MCS 

3. Calculate a new value of y, by equation 2, shift 
all former values one step in the past and go 
back to step 2. 


ARMA models are also applied in reliability 
engineering. For instance (Ho & Xie 1998) used 
ARMA models for repairable system failure reli- 
ability forecasting. Another application exam- 
ple is given in (Karki, Hu & Billinton 2006) and 
(Billinton, Chen & Ghajar 1996). They used an 
ARMA model for simulating wind speeds in com- 
bination with a power generation model for the 
reliability evaluation of wind power systems. 


3.2.3 Artificial neural network models 

Another technique for simulation modelling are 
artificial neural networks (ANN) (Kruse 2012). 
ANN ’s are used in a wide range of applications 
(i.e. wind speed forecasting). It is a technique 
for mapping input vectors to output vectors by 
learning from examples without implying a spe- 
cific relationship between them (Li & Shi 2010). 
There are different types of ANN’s available (i.e. 
feed forward, radial basis function networks or 
recurrent neural networks). This section deals 
only with the type of feed forward ANN mod- 
els (FNN) (Kruse 2012). The mapping is realized 
by interconnected neurons, arranged in different 
layers called input, hidden and output layer and 
no feedback loops between the layers exist. The 
strength of each connection from a neuron j to 
a neuron i is expressed by a connection weight. 
Hereby an activation level v, is calculated for each 
neuron i in the network as follows (Mohandes 
1998): 


v = Yd wx, —Wi (3) 


w, denotes the connection weights and x, 
the input values to neuron i. The bias value w; 
denotes the shift of the function ø(v,). This func- 
tion determines the output of a neuron from a cal- 
culated activity level and is a nonlinear function 


(i.e. a sigmoidal function, see equation 4). 


sig(x) = (4) 


l-e~* 


In order to get adequate results by the FNN for 
a given input, the connection weights of the net- 
work need to be trained (Welch 2009). For training, 
the well-known back propagation algorithm (BP) 
(Kruse 2012) can be applied. Hereby, an iterative 
training process minimizes the mean square error 
of the network output in comparison to the real 
output by adjusting the connection weights which 
are assigned with random values initially (Li 2010; 
Mohandes 1998). 

Concerning time series simulation or forecast- 
ing, an ANN describes a nonlinear function f(.) 
which represents the relation of past recorded val- 
ues of a time series to future values (see equation 5) 
(Zhang 2003). 


Vi = FV Vim W) + E (5) 


€, describes the error term of the model, whereas 
w denotes the vector of the model parameters 
(connection weights between the neurons). After 
training the ANN, the simulation can be per- 
formed by selecting initial values of y,_,,....¥,_, 
and calculating a new value of y, Then all former 
values are shifted one step in the past and the sim- 
ulation process is performed again until a speci- 
fied number of simulation runs is reached. ANN 
models were also applied in reliability engineering. 
They are widely used in structural reliability analy- 
sis (cf. Chojaczyk 2015; Hurtado 2001). Moreover 
(Rajpal, Shishodia 2006) used ANN’s for model- 
ling and simulating the behaviour of a complex, 
repairable system (helicopter transport facility) 
under various constraints. Results can be used for 
optimizing the operation of the system. 


4 MODELLING METHODS FOR DATA- 
DRIVEN FACILITY SIMULATION 


In recent years the availability and functionality 
of condition monitoring and diagnostic systems 
regarding a wide range of technical facilities has 
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increased. The systems offer the ability to record 
field data of different variables during field opera- 
tion which can be used for the identification of 
damages and the facility state history. The data are 
acquired by different sensors, which are integrated 
within a plant and are available in form of multi- 
variate time series. 

They can be used for prediction purposes like 
the estimation of the Remaining Useful Life (RUL) 
or the future facility state. However the amount of 
recorded data is often small in case of a short field 
duration time which hinder prognoses. Additionally, 
the gathered data often represent just a small time 
frame within the usage phase concerning the whole 
expected facility service life. Thus long-term predic- 
tions concerning the facility state can be afflicted 
with high error because aspects like changing of 
environmental conditions; degradation behavior as 
well as changing of the facility usage, which also 
leads to a change in the facility load scenarios, can- 
not be included. In order to address these prob- 
lems, the behavior of the facility can be simulated 
over a desired period of time in form of generating 
synthetic operation data. Thereby the simulation 
process can be performed under assumptions (i.e. 
changing usage behavior after a specific operation 
time) based on expert knowledge which gives the 
possibility to perform more accurate predictions in 
case of the occurrence of changing conditions dur- 
ing the facility life. 


4.1 Important issues of data-driven facility 
simulation 


For performing DFS, several issues (i.e. require- 
ments on data structure) have to be taken into 
account which are stated as follows: 


Data structure and recording 

For performing DFS, multivariate input time series 
data need to be recorded which fulfill requirements 
of a constant sampling rate and consistent time 
stamps at all variables. This is needed for instance 
for identifying cross correlation structures between 
the variables. Another important issue is the 
amount of recorded data. For instance, if a simu- 
lation model is built in form of a markov chain, the 
amount of recorded data needs to be sufficient for 
determining state transition probabilities. 

Another important issue is the occurrence of 
incorrect measurements as well as missing values 
within the recordings. Therefore, it is recommended 
to treat implausible and obviously incorrect data 
as missing values. Missing values in time series can 
then be replaced with the help of data imputation 
methods within a data preparation process. On the 
contrary, outliers are treated as rare events by the 
simulation model and thus don’t need to be rejected. 


Data analysis 

In order to perform a DFS, a profound data analy- 
sis has to be made. Results are serving as input for 
the simulation model building as well as for the 
evaluation process. The data which exist in form 
of time series can be analyzed among others in the 
following way: 


Statistical parameters: Calculating general statis- 
tical parameters for evaluation purposes (mean, 
variance, percentiles, min/max, etc.) 

Probability density functions: Fitting probability 
density functions (i.e. Weibull density function) to 
the time series data. This can be done either for the 
whole recorded time series (i.e. also for evaluation 
purposes) or within a system state (i.e. a specified 
range of wind speed) for modelling transitions into 
other system states. 

Autocorrelation: An analysis of the autocorrela- 
tion structure of the recorded time series needs to 
be performed for considering the series chronology 
within the simulation process (Feijoo & Villanueva 
2016). This can be done by the following equation 
given in Shamshad (2005) for a specified number 
of lags k 


1 (N-K) = = 
7 Tamer ait (x; -X(x X) 


TE -50-5 


Pr (6) 


where x,,...,Xy is the analyzed time series and X 
is its mean. The autocorrelation structure describes 
the correlation of time series values with its past 
values at different lags. The calculated autocorrela- 
tion structures of simulated and real data can be 
compared and thus the simulation model perform- 
ance can be evaluated. 

Event identification: Technical facilities behave in 
a different way in various operation phases which 
also results in the data recordings which must be 
considered during simulation. For instance wind 
turbine data recordings clearly differ between 
starting up phases and shutdown phases as well 
as in idle time phases (i.e. during maintenance). 
Therefore it is important to label the recorded time 
series for allocating data to different operation 
phases in order to acquire better simulation results. 
Additionally, the occurrence of events needs to 
be analyzed regarding frequency, sequences and 
occurrence probability. 

Dwell times and idle times: Another important issue 
within data analysis is the determination of dwell 
times in system states because in simulation the 
time duration within different load states of a facil- 
ity which is an important factor for the degradation 
estimation needs to be reproduced in an adequate 
way. For increasing the accuracy of simulated data 
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and for better RUL estimation, the determination of 
idle time durations has also to be taken into account. 


4.2 Evaluation of simulation modelling methods 
for data-driven facility simulation 


The reviewed simulation modelling methods shall 
be evaluated regarding DFS by means of a wind 
farm as an illustrative example. The evaluation is 
based on a data set consisting of operational time 
series data of wind turbines located in the west- 
ern part of Germany. The data set consists of 
10 minute values. Its essential structure is shown 
in Table 1. 


Discrete event simulation 

The operational time series variables 3 to 7 and 
12 (see Table 1) in the data set change at almost 
every point in time their value or remain just a 
short time in a value level. This leads to an enor- 
mous number of system changes which are not 
seen as events in the proper sense. DES is usually 
not applied here because a system change takes 
place at almost every discrete point in time and 
a system change can’t be assigned to a specific 
reason. However with DES, sequences of clear 
specified events concerning a wind turbine can be 
modelled in which the behaviour of operational 
time series variables change. An example for such 
events are maintenance activities, facility shut- 
downs caused by shadowing or occurring facility 
failures. An advantage of applying DES for simu- 
lating a wind park on event basis is that it can help 
for evaluating maintenance and repair strategies 
of the plant or for estimating the plants availabil- 
ity. However, a major disadvantage is that a plant 
specified model has to be developed. The wind 
turbine data set includes high numbers of differ- 


Table 1. Structure of the wind farm data set. 


No. Variables Available data 


1 Timestamp Date, hour, minute 
2 Plant number 1 to 14 

3 Wind speed [m/sec] Mean, min, max 

4 Rotor speed [1/min] Mean, min, max 

5 Active power [kW] Mean, min, max 

6 Reactive power [kW] Mean, min, max 

7 Nacelle position [°] Mean 

8 Blade angle A,B,C [°] Mean 


9 Rainfall [mm/min] 
10 Visual range [km] 


Mean, min, max 
Mean, min, max 


11 Ambient brightness [lux] Mean 
12 Temperature of machine Mean 
parts [°C] 
13 Facility status Nominal status 


values 


ent types of facility states. If DES is applied, a 
complex model has to be established. For DFS, a 
DES model can be combined with an DTS model 
(cf. Cha & Roh 2010). This would take potential 
impacts from the facility states (e.g. maintenance 
actions) onto the operational time series data into 
account as well as their interdependencies. For 
establishing such a model, expert knowledge of 
the facility is an essential condition. DES models 
are usually very process specific which another 
drawback is. 


Discrete time models—A RMA models 

The operational time series variables of the wind 
park data set (variables 1 to 12) show a strong 
autoregressive behaviour which is given by a slight 
decrease of the ACF function. This is a strong indi- 
cator that the time series are non-stationary which 
means that the series doesn’t have invariant char- 
acteristics. Invariant characteristics of a (weakly) 
stationary time series would be a constant mean, 
a constant and finite variance over time as well 
as a covariance structure which is only dependent 
on the lag and not on time (Davies 2014). As with 
ARMA models only stationary series can be mod- 
elled, they are not suited to model operational time 
series variables of a wind park without stationar- 
izing the series (e.g. by applying data transforma- 
tion or decomposition). Differencing would lead 
to a stationary series, however the effect of a huge 
information loss leads to problems in simulating a 
time series which has similar characteristics as the 
observed one. With decomposition techniques, the 
time series can be divided into a trend-cycle com- 
ponent and a stationary component. The former 
still can’t be modelled by ARMA in contrary to the 
latter component. 

The advantage of ARMA is that just a few 
parameters for generating a time series with a 
specified ACF are needed. However, generally the 
probability density function differs from the meas- 
ured data which can lead to wrong estimations 
(Papaefthymiou 2008). Furthermore, concerning 
wind turbines, it is assumed that ARMA models 
are not suited for simulating idle times of a facility 
within a variable (e.g. active power) as well as for 
instance dwell times (e.g. in specific temperature 
value level of a machine part) because the simula- 
tion of its trajectory is driven by sampling from an 
error distribution (see equation 2). Another disad- 
vantage of ARMA models is reasoned by its linear 
nature which make them potentially not well suited 
for modelling operational time series data of wind 
turbines (cf. Chen & Yu 2014). 


Discrete time models—ANN models 
ANN Models are another way for time series simu- 
lation as described in section 3.2.3. They can be 
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used for approximation of various nonlinearities 
in data (Zhang 2003) in contrary to ARMA mod- 
els. Moreover, they give the possibility to model 
a huge number of functions with high accuracy. 
They were successfully applied in wind speed sim- 
ulation as well as in simulation of series of solar 
radiation (cf. Li 2010; Mihalakakou 2000). How- 
ever, there is no rule available for determining the 
ANN topology and the number of needed neurons 
for modelling a function. This has to be examined 
for every different time series in a parameter study. 

A disadvantage of ANN lies in modelling time 
series in which values persists a certain duration. 
For example, regarding the wind park data set: 
During an idle time no output power is generated. 
That means the future value of past zero values 
is also zero as long as the facility is in idle mode. 
However, after a while the facility is re-started 
which leads to future values different to zero. This 
can’t be modelled solely by an ANN as described 
in section 3.2.3. The ANN would persist in the idle 
stage for the rest of the simulation duration. A pos- 
sible way to face this problem is to use hybrid mod- 
els. For example, the ANN model can be trained 
for just simulating changes of value levels and the 
dwell time within a level can be simulated by a 
Monte Carlo approach. Another way for applying 
ANN models in DFS is the usage for regression. 
For instance, the variables rotor speed and active 
power are highly correlated. After simulating one 
of these variables, the other one can be determined 
through modelling the relationship between the 
two variables. This gives the possibility for gener- 
ating multivariate correlated time series. 


Discrete time models—markov chains 

Instead of ARMA models, Discrete-Time Markov 
Chains can be used to model operational time 
series data of the wind park. The methodology 
is suited for modelling non-stationary and non- 
linear time series which are existent in the wind 
park data set. Moreover, the methodology consid- 
ers the underlying ACF as well as the probability 
distribution function of the control sample series. 
It is also able to simulate dwell times (of tempera- 
ture levels of machine parts of the wind turbine) 
as well as idle times associated with the belonging 
state. However, when applying this methodology, 
the stochastic process needs to be discretized by 
defining a number of states (Papaefthymiou 2008), 
which leads to a loss of information. According 
Papaefthymiou (2008), when applying MC, a trade- 
off between the model accuracy and complexity 
(represented by the number of states and param- 
eters) has to be made. With a high number of states 
the process can be better represented. However 
it is difficult to assess transition probabilities in 
case of a low data volume. Regarding wind speed 


series, in literature, MC of higher order are applied, 
which leads to more adequate results with better 
ACF and a better retention of the probability den- 
sity function of the simulated series compared to 
the real data (Feijóo 2016). However, by using a 
higher order MC, the model complexity as well as 
the needed amount of data increases enormously 
which limits their usability (cf. Aksoy 2004). 

The MC approach is also not able to simulate 
a trend. As the series of the data set don’t show a 
trend over the observation time, this restriction is 
negligible. 


5 CONCLUSIONS 


In this paper, different simulation modelling tech- 
niques were reviewed regarding the field of reliabil- 
ity engineering. After discussing important issues 
of data-driven facility simulation, the presented 
modelling methods were evaluated regarding the 
specific applicability in DFS and in general. Its 
advantages, disadvantages and limitations were 
pointed out by means of a wind park as an illus- 
trative example. The stated models (ARMA mod- 
els, DES, ANN as well as Discrete-Time Markov 
Chains) can be seen as tools which can be used for 
establishing a comprehensive simulation concept 
for operational time series simulation of a tech- 
nical facility. In future studies, the stated model- 
ling methods shall be applied on the wind park 
data under consideration of the stated important 
issues of DFS. Furthermore, the methods shall be 
applied in combination for performing the DFS. 
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ABSTRACT: The trend of automation in industrial production has led to massive use of autonomous 
robots. In classical approaches, safety is usually guaranteed by isolating robots from humans. Collabora- 
tive robots, i.e., humans and robots working together, are expected to increase both productivity and per- 
formance. However, removing fences and putting the robot working in collaboration with humans causes 
new hazardous situations. Therefore, proper risk assessment should be performed to avoid those hazard- 
ous situations without compromising the productivity. We present an automated warehouse where auton- 
omous robots load trucks with products while sharing the same environment with human workers. In this 
position paper we propose a safety strategy that is modeled based on dynamic safety fields around the 
robot, which is consistent with important guidelines in collaborative robotics (i.e., ISO15066). We propose 
three different safety levels of dynamic fields: red (critical), yellow (warning) and green (clear). Instead of 
completely stopping the robot in the presence of humans it can keep performing its operations with some 
enforced constrains for safety reasons. We also propose a risk assessment of hazardous situations based 
on proprioceptive and exteroceptive data. This evaluation generates different warnings or actions to be 


performed based on those safety levels and is responsible for changing the size of the dynamic fields. 


1 INTRODUCTION 


In the present phase of industrialization, auto- 
mation represents an important role, that leads 
towards an extensive use of robots from factories 
and manufacturing industries to other business 
areas, like in large logistic systems and warehouses. 
The traditional safety strategy is to separate robots 
and humans completely by isolating robots behind 
fences, and stop the robot immediately if anything 
goes wrong, thus minimizing the contact between 
robots and workers, as in Kiva project (Guizzo 
2008). Collaborative robotics, in which robots and 
humans collaborate to accomplish tasks, intro- 
duces new advantages in terms of productivity and 
efficiency, but also new hazards, like increased pos- 
sibility of collision with workers. 

Avoiding hazardous situations is an absolute 
prerequisite for autonomous robots. Due to elimi- 
nation of barriers around the robot in this new col- 
laborative situation, a robot should interact with 
other robots and workers at different levels. It is 
crucial to ensure the correct and safe operation 
of the robot so that it cannot cause injuries to the 
workers, other objects or to itself (Robla-Gomez 
et al. 2017). This issue is aggravated in the ware- 
house scenario, where mobile robots can move 
freely in the presence of other moving robots and 
workers. The regulations that incorporate robot 


related risks for workers include the international 
standard ISO 10218 (ISO 2011la, ISO 2011b). A 
recent technical specification ISO/TS 15066:2016 
(ISO 2016) for collaborative robots introduces new 
concepts, such as collaborative operation, collabo- 
rative work-space (a shared space where worker 
and robot can perform tasks concurrently), and 
collaborative robot, which are a direct focus of this 
work. Please note that in the rest of the paper, we 
use human and worker synonymously. 

Many safety approaches for collaborative ware- 
house rely only on sensor data (e.g., cameras and 
Li-DARs) to perform high precision detection and 
tracking of the workers (Krug et al. 2016, Sabattini 
et al. 2017). By knowing the position of the work- 
ers, robots simply deviate from them as they come 
closer. Although such reactive approach may be 
sufficient in some situations, there are other solu- 
tions in which robot actions can be determined 
from its proximity to the obstacle and the type 
and nature of this obstacle. This leads to the safety 
field concept which creates a virtual circle around 
the robot, where robot reaction changes accord- 
ing to the region of the circle that is occupied 
(Magnanimo et al. 2016, SafeLog 2017). 

In this position paper, we present a two-fold 
safety strategy and a detailed architecture including 
all the required components to implement safety 
for collaborative operations within an automated 
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warehouse. We base our safety analysis on creat- 
ing three-layered safety fields around the robot: 
red (critical), yellow (warning) and green (clear). 
It is based on speed and distance monitoring and 
power and force limiting approaches of the ISO/TS 
15066:2016. An advantage of using different safety 
fields is the increased performance of the collabo- 
rative operation. Instead of completely stopping 
the robot in the presence of humans it can keep 
performing its operations with a decreased speed. 
Our two-fold safety strategy consists of: 


1. An offline safety analysis: performs a simula- 
tion-based quantitative safety assessment of the 
scenarios. It evaluates the high-level plans (high- 
level task descriptions for the robots), generated 
by the system before sending it to the robots. 

2. An online safety analysis: creates the safety field 
around the robots within the robot control loop 
at runtime. It uses the sensors’ data and performs 
risk assessment to calculate the risks. Based on 
the calculated risk values, it generates 3-layered 
safety fields around the robot. These fields are 
dynamic in size, and its sizes are computed based 
on the risk assessments and on both environ- 
mental and operational context of the robot. 


Paper Outline: Related work is presented in Sec- 
tion 2. Section 3 describes details of the automated 
warehouse, its scenarios and the proposed archi- 
tecture. Our idea for three-layered safety strategy is 
presented in Section 4, and finally, Section 5 con- 
cludes the paper with a description of ongoing and 
future works. 


2 RELATED WORK 


The problem of robot safety in warehouses is a 
relatively recent issue, though the safety problem 
involving robots have already been discussed for 
decades (Robla-Gomez et al. 2017). One of the first 
initiatives in warehouse automation was the Kiva 
project (Guizzo 2008) which used omnidirectional 
mobile robots to load shelves and bring them close 
to workers. This enables workers to pick products 
from the inventory and accomplish the orders effi- 
ciently. In this scenario, humans are placed in a sep- 
arate area from the robot and an alarm is triggered 
when someone enters the robot area. In comparison 
to this traditional approach, our approach allows 
collaborative operations inside the working space 
and hence the safety requirements are less restric- 
tive than keeping the robot within a limited area. 
The problem of automated picking and placing 
of objects using mobile robots is addressed in (Krug 
et al. 2016), in which the robots operate in the same 
area as the workers. The robots are equipped with a 
set of safety LiDARs and camera sensors for safety 
assessment. Workers use reflective vests (ANSI/ 


ISEA 107-2004 standard) to facilitate the proc- 
ess of human detection and make it possible even 
in low light conditions. The robot determines the 
position, velocity and the body configuration (e.g., 
sitting, standing) of the worker through its sensors 
and uses this information for safe navigation. The 
mobile robot conforms with the EN 1525 standard 
(CSN 1998) which tolerates contact with a human. 
The robot solution for warehouse proposed by 
(Sabattini et al. 2017) relies on 3D object detection 
using the fusion of four LiDARs and two cam- 
eras. The LiDARs ensure coverage of 360° of the 
scenario. The object detection and tracking is per- 
formed by both camera and LiDAR. The outputs 
from these sensors are used by the robot to deviate 
from obstacles and workers during the navigation. 

In both approaches of (Krug et al. 2016, Sabat- 
tini et al. 2017), robots share a collaborative work- 
space with humans and a safe distance is kept to 
avoid collisions with objects and humans. How- 
ever, the distance is calculated based on sensor 
data only and no risk assessment or safety analysis 
is performed, as we propose in this paper. 

SafeLog is a more recent initiative which deals 
directly with safety in a collaborative scenario 
between robots and humans inside a warehouse 
(SafeLog 2017). It uses three safety levels (A, B 
and C) which creates a three-layered virtual cir- 
cle around the worker. The safety level A creates 
a virtual circle around the worker which repulses 
the robot in a similar approach as potential fields 
(Taquet et al. 2017). Robots should stop when 
entering region covered by level A, which is the clos- 
est one to the worker. In the safety level B, robots 
should send a notification to the worker about the 
risk and the travel speed can be limited. Finally, in 
the safety level C, the safety system re-plans the 
robot and human paths to avoid close encounters 
between them. Further, safety vests are also used to 
facilitate the localization and detection of humans. 

Another collaborative scenario presents a two- 
level dynamic safety fields creation around the 
moving robot (Magnanimo et al. 2016). The pro- 
tection field behaves like a cage; that is, the robot 
immediately halts when detecting a human inside 
this field, while detecting any object inside the 
warning field results in less rigorous action like 
reducing the robot speed. The sizes of the fields 
are changed dynamically based on sensor data. 

Both SafeLog and the work of (Magnanimo 
et al. 2016) use collaborative scenarios and are 
more similar to our proposed approach. However, 
differently, we introduce a three-layered safety 
field which dynamically changes the sizes of each 
field based on 1) the environment conditions and 
2) the results of the risk assessment performed on 
the captured data. We also present an offline safety 
analysis before sending plans to the robots. Moreo- 
ver, in the approach of (Magnanimo et al. 2016), 
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their warning field is intended for human safety 
assessment and does not focuses on interaction 
with other robots, as done in our approach. 


3 WAREHOUSE LOGISTICS 
MANAGEMENT USING 
COLLABORATIVE ROBOTS 


The first part of this section presents details of our 
logistics use case within an automated warehouse 
and describes collaborative and non-collaborative 
scenarios to be performed safely inside the ware- 
house. The second part of this section describes 
the proposed architecture and its components. 


3.1 Collaborative scenarios 


Our use case is an automated warehouse where 
autonomous robots and humans work together 
in a shared environment to perform the logistics 
management operations. The high-level task is to 
load trucks with products. Multiple robots pick 
up products from the shelves and deliver them to 
conveyor belts, that in turn take the products to 
the trucks. Robots can move freely in the environ- 
ment to perform the tasks and are equipped with a 
robotic arm for pickup operations. Humans inter- 
act with shelves by placing or moving products on 
them. Thus, the shelves and warehouse floor are 
shared among humans and robots, presenting a 
collaborative work-space and a collaborative sce- 
nario, as shown in Figure 1. Here, the human and 
the robot can come into close interactions with 
each other, thus leading to severe safety risks. 
Other situations can include the human interven- 
tion when a product is dropped by the robot and 
human comes to clean up, or for the maintenance of 
a broken robot. Proper procedures must be adopted 
and safety must be ensured during this operation. 


Figure 1. Illustration of robots in the warehouse and 
the dynamic safety fields. The blue cube is the recharging 
station, the small boxes in red, green and brown are prod- 
ucts on the shelf, and a white square on the floor next to 
the conveyor belt is a way-point. The green (clear), yellow 
(warning) and red (critical) circles are the safety zones 
around the robot. 


A robot can also come close to another robot 
in multiple situations. Encounters can happen dur- 
ing navigation from shelves to conveyor belts and 
when the robot is coming back to the shelf after 
delivering the product. Another encounter can 
occur when two or more robots are picking up 
products from a different row or column of the 
same shelf. Although the plan provided to each 
robot will ensure that two or more robots will not 
be at the same row and column at a particular time 
(using offline safety analysis), still a situation can 
arise when the products are placed closely and 
there is a chance of colliding with nearby robots. 


3.2 System architecture 


This section describes the basic architecture model 
with some core functionalities necessary to provide 
all the features of an automated logistics ware- 
house with collaborative robots. Figure 2 presents 
an overview of the basic architecture. 

Some physical components of Figure 2 and their 
respective functions within the automated ware- 
house are described in Section 3.1, i.e., products at 
the shelves, robots picking up the products from the 
shelves and placing on the conveyor belt. A human 
worker is responsible for placing products on the 
shelves. 

To fully automate the warehouse scenario and 
for simulation purposes, a digital representation of 
all these physical components/actors is required. 
We use digital twins for this purpose, which is a 
digital/virtual representation of an actor/com- 
ponent in the system. It provides a well-known 
communication interface and resource descrip- 
tion, therefore, hiding all the specific complexity 
of heterogeneous devices/resources. A Warehouse 
Controller component performs the main control 
of the system. It is responsible to perform all initial 
configurations and discovery mechanisms setup. 
When a resource becomes online it will need to reg- 
ister itself to the system and that, in turn, will trig- 
ger the spawning of a digital twin associated with 
that resource, and from that point on, all interac- 
tions between the system and the actual resources 
are performed through the digital twin. 

The system needs to perform task planning and 
monitor the execution of the tasks. Thus, a plan- 
ning service is implemented, employing PDDL 
traditional technique (Mcdermott et al. 1998) to 
generate an overall plan for the whole system!. 

While dealing with the collaborative scenarios, 
there is a strong requirement of assuring human 


'Please note that the details of the warehouse controller, 
planning service, and digital twins are presented to depict 
a complete picture of the architecture and are beyond the 
scope of this paper. 
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Figure 2. System architecture of the collaborative warehouse, the safety components are highlighted in red dotted 


boxes and arrows. 


worker’s safety aspects during operation. Our two- 
fold safety strategy is presented by dotted com- 
ponents and arrows in Figure 2. First, there is an 
initial assessment of the whole computed plan per- 
formed within the Offline safety analysis element. 
If the plan is deemed safe, the tasks are sent to the 
respective resources for execution. Second, during 
the entire operation, local Online safety analysis is 
performed within the robots to guarantee safety. It 
consists of multiple components namely feasibility 
analysis, decision making, obstacle avoidance, safe 
locomotion and safe pickup. All these safety-related 
functionalities are described in the next section. 


4 SAFETY STRATEGY 


A proper safety analysis is required for multiple 
collaborative scenarios in the automated ware- 
house (described in Section 3.1) where a robot can 
come close to humans, other robots or objects. 
For collaborative scenarios, the ISO/TS 
15066:2016 (ISO 2016) establishes a series of guid- 
ing points towards a situation where robot and 
human worker can share the same workspace. 
A first requirement on the technical specifica- 
tion enforces that the robots used in collaborative 
operations shall meet safety requirements defined 
in ISO/TS 10218:2011 (ISO 201la, ISO 2011b). 
Secondly, the document restricts robots to have 
at least one of the following modes: safety-rated 
monitored stop, hand guiding, speed and separation 
monitoring, and power and force limiting. In the first 
method when the human is outside the collabora- 
tive workspace the robot can operate normally. For 
human and robots working together in the work- 
space, the robot must be at safety-rated monitored 
stop, resulting in a stop category 2 (actuators are 
still powered on) as defined in (IEC 2016). If any 


of previous conditions are violated the robot must 
issue a protective stop resulting in a stop category 0 
(actuators are powered off). In hand guiding mode, 
the human must control the robot using a device 
located near or at the robot’s end-effector. For the 
human to enter in the workspace, the robot must 
be in a stop category 2. The third method allows 
human and robots to be moving while both are 
inside the workspace, but the robot shall keep a pro- 
tective distance from the human (ISO 2016). This 
distance varies based on some parameters (such as 
velocity, robot reaction time, robot distance to stop, 
operator velocity) and to keep that distance robot’s 
velocity can be restricted. The power and force lim- 
iting method allows intentional and non-intentional 
contacts between robot and human. In this mode, 
the robot should keep values of force, pressure and 
energy transfer limited according to different parts 
of human body and contact situations. 

Our proposed safety approach is based on both 
speed and distance monitoring and the power and 
force limiting methods. Although distance moni- 
toring is performed throughout the process, dur- 
ing navigation most safety measures are based on 
limiting the speed of the robot. On the other hand, 
during the robotic arm operations, all the safety 
measures are done by imposing limitations on 
power and force. 

The following sections present our main 
approach to achieve safety. Our key aim is to 
enhance safety through a dynamic system capa- 
ble of adapting the robot’s behavior based on 
its current context. To achieve this purpose, we 
present a 3-layered safety strategy in Section 4.1. 
Sections 4.1.1 and 4.1.2 present how tactical and 
operational aspects of safety will be handled by 
different components at different moments. In 
the context of an automated warehouse where all 
devices are interconnected (i.e., share information), 
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make decisions and perform actions, a robust safety 
analysis should be performed. Section 4.2 briefly 
discusses the need of a real-time implementation 
of risk assessment approach to achieve a robust 
hazard management. Another important aspect to 
obtain effective and safe co-working in the auto- 
mated warehouse is the human psychological state. 
This aspect of establishing human trust is discussed 
in Section 4.3. 


4.1 Three-layered safety strategy 


Our safety strategy is based on dynamic safety fields 
around the robot. The strategy defines three dis- 
tinct levels of safety: critical (red), warning (yellow) 
and clear (green). The levels classify the “degree of 
safety” in which the robot is currently operating in 
relation to outside elements such as humans and 
other robots while also taking into account the nature 
of the operation being performed by the robot itself. 

The green level corresponds to a clear state, 1.e., 
even if the robot is detecting obstacles (i.e., human, 
robot, warehouse infrastructure) those are at a safe 
distance from the observing robot. Therefore, the 
robot is clear to continue working with its current 
setup and parameters. 

The yellow level is a warning state. Here, the 
robot has detected an obstacle closer than a certain 
threshold and has to adapt its behaviour to guaran- 
tee that it is operating in a safe state. The dimension 
and impact of this adjustment is dependent on its 
current operation and the nature of the detected 
obstacles. If the robot is moving inside the ware- 
house, it might be necessary to reduce its velocity, 
but if it is performing an operation with its manip- 
ulator it might be necessary to not only decrease 
the speed of the joints but also to change the areas 
within which the arm is allowed to move about. 

Finally, the red level is the critical level. When 
the robot is at this level, a human or another robot 
is very close and it is under safety threat. There- 
fore, every action needs to ensure that neither the 
human, other robot nor the robot itself are injured 
or damaged. In most cases, the natural decision is 
to perform category 2 stop on the robot, i.e., stop 
the movement of the arm or the robot platform 
completely in a controlled manner. 

To classify the safety state (green, yellow or 
red) and alter the robot’s behavior accordingly, the 
nature of the detected object (i.e., human, robot, 
infrastructure) and the current context (i.e., opera- 
tion being performed) should be taken into consid- 
eration. To help the human establish trust on the 
robot, it is also important to inform the close-by 
detected human about the robot’s safety state. This 
can be performed through triggering a cue and/ 
or even using augmented reality devices to get the 
alerts and detailed information of the robot, simi- 
lar to the approach of (SafeLog 2017). 


Additionally, in a truly collaborative scenario, 
there may arise situations where the human and 
robot will have to almost touch each other in order 
to complete some task. For instance, a scenario 
where a robot needs to give an item to a human 
or the other way around. Here, the human will 
most likely be very close to the robot and, based 
on the nature of the interaction, the decision to 
completely stop the robot might not be applicable. 
Although the robot needs to be allowed to perform 
some actuation to conclude the previously men- 
tioned operation, other types of actuation should 
be extremely limited if not forbidden. E.g., the 
robot needs to be able to actuate the gripper and 
perhaps some other parts of the manipulator itself, 
but these movements should be performed under 
severe constraints, while the overall movement of 
the robotic platform should be forbidden. 

We envision that an advantage of using differ- 
ent safety fields will be the increased performance 
of collaborative operations. Instead of completely 
stopping the robot in the presence of humans, it 
would be better if the robot could keep performing 
its operations in a constrained manner. Further, due 
to dynamically changing sizes of the fields based on 
the input data and calculated risk’, the robot could 
maneuver efficiently even in small places, e.g., by 
reducing field sizes in the presence of no risk. 


4.1.1 Offline safety analysis 

To analyze the safety of the generated plan (generated 
by the planning service) from a tactical perspective, 
offline safety analysis is performed before actually 
sending those plans to robots. This analysis is imple- 
mented by running the candidate plan in a realistic 
simulated environment (i.e., using 3d physics-based 
robotics simulator such as V-REP (Rohmer, Singh, 
and Freese 2013) or Gazebo (2017)). The purpose 
of this step is to reduce the possibilities of unfore- 
seen situations, e.g., robots moving too close to one 
another, which could lead to unnecessary recalcula- 
tions by the planning service. 

The simulation reproduces all objects, robots and 
other devices’, and receives as input a starting state 
for the warehouse and a plan to be implemented. 
The offline safety analysis module returns a feasibil- 
ity status (whether the plan is feasible/safe or not) 
and a level of risk associated to following the given 
plan. In order to calculate this level of risk the fol- 
lowing measurements can be used: the number of 
robots crossing closely to other robots or obstacles, 


*Risk assessment module will calculate the risk level for 
situations in real-time as described in Section 4.2. 

3At this moment we are not considering modeling 
humans in the simulated scenario due to the complex- 
ity and unpredictability of human behavior. Future work 
could address this point to bring the safety evaluation 
closer to that of the real scenario. 
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the waiting time for other robots to move away in 
order to pick up or drop products and the number 
of times robots have to access the same shelf. 


4.1.2 Online safety analysis at robot 

While the offline analysis is responsible for dealing 
with safety at a tactical level, robots are responsible 
for handling most of the safety requirements at the 
operational level. The robots must be capable of not 
only detecting static and dynamic obstacles but also 
identifying the human workers. To detect the human 
workers, a variety of techniques and sensors can be 
applied. When a worker is detected within a minimal 
distance, the robot must enter a mode where its power 
and force are limited to a threshold to ensure no phys- 
ical harm to the worker, but may still allow physical 
contact between them (ISO 2016). This method of 
operation avoids a complex tracking of all human 
movements and distances from the robot parts, yield- 
ing a simpler and more straightforward system. 

Interactions that strictly happen among robots, 
have fewer safety requirements. In these situations, 
robots must be aware of their poses and collision 
avoidance techniques such as Singh and Krishna 
(2013) and Belkhouche (2017) can be applied. When 
dealing with static objects (such as shelves, con- 
veyor belts, walls and non-interactive equipment), 
the robot can follow classical collision avoidance 
behavior like (Khatib 1986), since in those scenar- 
ios safety requirements are not so demanding. 

Each robot in our scenario is modeled as an 
autonomous cognitive agent which is capable 
of highlevel decision-making’, i.e., it has its own 
goals, perception, actuation and decision-making 
capabilities. Even though our control strategy con- 
templates a centralized planning service, which can 
perform planning at the level of the whole ware- 
house and send tasks to each robot, a minimum 
level of autonomy at each robot is necessary to 
deal with unforeseen events. For example, when the 
robot receives a high level task, its cognitive agent 
would then perform a local safety analysis for the 
received task/instructions. If the instructions are 
deemed unfeasible or too risky, the robot tries to 
find an alternative way of accomplishing the tasks. 
A diagram that illustrates the connection between 
robot, its cognitive agent, its digital twin and the 
environment can be seen in Figure 3. 

Each robot must be able to acquire information 
about its current environment from a multitude of 
sources, of both proprio- and exteroceptive nature, 
and be able to fuse this information to build a com- 
plete perspective of its state and the world around it. 


4We are implementing the high-level control of robots as 
cognitive agents (Laird 2012), which are capable of reac- 
tive and deliberative decision-making. We believe such an 
approach can produce more resilient robotic agents that 
are able to adapt to unforeseen scenarios. 


Robot 1 Digital Twin 
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Figure 3. Cognitive control at each robot is responsible 
for monitoring the environment, with instructions from 
warehouse controller coming via digital twin. The robot 
then reasons about the most appropriate way to perform 
those instructions given what it knows about the local 
environment through its sensors’ input. 


For environment perception, we choose cameras and 
LiDARs to detect static and dynamic obstacles in the 
vicinity of the robot. Cameras can have additional 
application by capturing some special vests/tags on 
the humans (e.g., ANSI/ISEA 107-2004) to facilitate 
the process of human detection. Robots’ set up also 
includes wheel encoders and IMU for self localiza- 
tion. As the position estimation is a key component 
to avoid close encounters between robots and work- 
ers, we consider the usage of radio triangulation, 
through technologies such as Wi-Fi, Bluetooth Low 
Energy, Ultra-Wideband, together with the encoders 
and IMU data (Jiménez and Seco 2017). 


4.2 Risk assessment 


We consider to implement a risk assessment algo- 
rithm (e.g., fuzzy logic, neural networks, or neuro- 
fuzzy as described in (Viharos and Kis 2015)) to 
calculate the risk level (i.e., high, medium, low) and 
then calculating the sizes of the dynamic safety fields 
based on the calculated risk level. The algorithm will 
be implemented within the cognitive agent’s decision 
making module. Proprioceptive and exteroceptive 
data is input from the chosen sensors to the module, 
based on which the risk assessment algorithm will 
calculate the current risk level for the robot. 

Depending on the level of risk calculated by the 
agent, module 1) changes/recalculates the sizes of 
the fields and 2) the robot’s behavior is modified 
in order to increase safety accordingly, as shown 
in Figure 4. 

For example, if no object detected within the 
warning or critical fields, and the area where robot 
has to move is narrow, then the size of the field can 
be reduced so that robot can move safely within 
the area. If a human/object is detected moving fast 
towards the robot then it should not only reduce 
the speed but also increase the size of the fields 
around it to avoid any hazardous situation. 

For safety reasons, in our scenario the arm is not 
used while the robot’s base is in movement. The arm 
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Figure 4. 
strategy. 


Workflow to execute the proposed safety 


Figure 5. Difference in risk assessment for the cases 
when the robot is moving (to the left) and when it is 
standing still and moving its arm around (to the right). 
Green arrow indicates movement. 


stands still at a position within the base’s bound- 
ing box. When the robot’s base is standing still, the 
robotic arm is then able to move for performing its 
pick-up and drop actions. This scenario is illustrated 
in Figure 5, where red dotted lines represent the 
areas considered to be critical for human interaction. 


4.3 Discussion on establishing human trust 


It is necessary to ensure that the worker feels com- 
fortable and safe when cooperating with a robot, 
and that mental strains associated with such tasks 
are bearable. Three influential factors to assess the 
worker’s mental strain (including distance, speed and 
warnings of motion) were varied in order to define 
design criteria to improve human worker comfort 
in (Arai, Kato, and Fujita 2010). Suitable training 
of the worker is reflected in the ISO/TS 15066:2016 
(ISO 2016). Training clearly has an influence on his/ 
her confidence and stress levels as well as their safety. 

We propose that the robot informs the identi- 
fication of human/other robot in its safety zones 
through a visual cue (a visual indication, for exam- 
ple a led panel on the robot), sound alert or aug- 


mented reality. To exemplify, if used visual cues, a 
red light can be displayed when human is identified 
inside the red zone, a yellow light for yellow zone 
and green light for the rest to specify that robot 
is performing its operation normally). In this way 
the human feels informed about the robot working 
and feels safe while performing collaborative tasks. 
The robot performs similar actions when another 
robot comes closer to it. 

Most works often focus on the task of increas- 
ing trust of humans on machines. However, human 
trust on the automated system should not be blind. 
Excessive trust could be as harmful (or even more) 
as a lack of it. Therefore, the concept of “calibrated 
trust” should be explored (Lee and See 2004). 

Calibrated trust tries to minimize the mis- 
matches between trust and the capabilities of auto- 
mated systems. Over-trust means poor calibration 
in which trust exceeds what the system is capable 
of delivering, and distrust means not trusting 
the system enough, thinking it is not capable of 
delivering what in fact it can. Both scenarios are 
undesirable because they could generate safety 
issues (e.g., over-trusting an autonomous car with 
level 3 autonomy to behave like one with level 4 
autonomy Automated driving (...), see weblink in 
‘References’ section) and reduced efficiency (e.g., a 
worker insisting on doing by hand something that 
is already automated). 

With that in mind, we intend to increase the 
level of human-machine mutual understanding 
(Azevedo, Raizer, and Souza 2017), by making clear 
to the human user the intentions and motivations 
that led the automated system to make its decisions. 


5 CONCLUSIONS AND FUTURE WORK 


In this position paper, we have presented a detailed 
architecture to realize a safe collaborative auto- 
mated warehouse scenario. We have proposed a 
safety strategy that combines the safety analysis 
performed globally by the warehouse controller 
(offline safety analysis) and locally by the robot 
itself (online safety analysis). The simulation- 
based offline analysis will check the safety of the 
high level plans for the robots at a higher level, and 
will ensure that collisions will not happen. 

Considering the dynamism of the warehouse, 
unexpected changes in the robot and human plan 
may occur at run-time. To address unforeseen 
events, we have presented a three-layered safety 
field approach to be applied by the robots to adjust 
their behavior locally. 

Currently, we are implementing the proposed 
architecture in a simulated scenario using V-REP 
simulator. For future works, we will extend the 
simulation with our proposed three-layered safety 
strategy. We will also implement it using physical 
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robots. It will consists of the implementation of 
the risk assessment module, and the three-layered 
safety strategy. Results will be evaluated using a set 
of Key Performance Indicators (KPIs) based on 
safety requirements, and the overall performance 
of the warehouse. We also intend to imbue each 
robot with learning capabilities, so it can adapt 
to situations that are particular to the warehouse. 
For instance, robot A is picking up a product at 
the shelf while robot B is waiting to do the same. 
Robot B could learn that pick up action at this par- 
ticular shelf takes longer than at others, and adjust 
its behavior accordingly (e.g. should it keep wait- 
ing or should it move to another shelf?). Another 
future direction could be to extend the simulation 
with a human model and work on trust aspects to 
bring the safety evaluation closer to the real world. 
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ABSTRACT: The paper presents a versatile simulation model of the operation process of a complex 
technical system. Among other features, the model assumes dependencies between the system’s com- 
ponents, and a priority-based maintenance/repair policy implemented by a limited personnel. Over the 
recent decades various maintenance models were investigated with the aim of optimizing the mainte- 
nance costs. Most researchers pursued analytical approach, hence the obtained results hold true under 
quite restrictive assumptions imposed for the purpose of analytical tractability (mutual independence of 
components, exponential time-to-failure and time-to-repair, no waiting time for repair, etc.). The model 
described herein is free from such limitations, as it is designed to compute the system’s reliability or 
performance indices by way of simulation rather than analytically. The following features make it more 
realistic than many other models from the relevant literature: 1) the components are mutually depend- 
ent, i.e. a component’s state change affects the failure rates of some other components, 2) repairs can be 
perfect, imperfect, or minimal, 3) due to a limited maintenance personnel, a failed component may have 
to await its repair in a priority queue, where the priority level depends on the component’s importance, 4) 
a component state may be unobservable, in which case hidden failures are revealed by inspections. Using 
such a model, the optimal or near optimal parameters of an assumed maintenance policy can be found 
by repeated simulation. Any performance index expressed as a priori known function of these parameters 
can be subject to optimization. The author’s main result is a non-trivial simulation algorithm encompass- 
ing the model’s comprehensiveness. Clearly, the simulation approach, although time-consuming, allows 
to avoid elaborate analytical derivations which, in the absence of simplifying assumptions, become unrea- 
sonably complicated or impossible as the system’s complexity increases. 


1 INTRODUCTION failed, a component is either replaced or repaired, 


depending on its wear-related age measured by inte- 


Over the recent decades numerous researchers 
investigated various maintenance models with the 
aim of optimizing the maintenance policies of 
the modeled systems. They mainly used analytical 
approach, hence most of the obtained results hold 
true under quite limiting assumptions imposed for 
the purpose of analytical tractability (mutual inde- 
pendence of components, exponential TTF/TTR, 
no waiting time for repair). See Couchan et al. 
(2013), Nakagawa (2008a), Nakamura et al. (2017), 
Sarkar et al. (2011), and Wang (2002) for surveys 
of more or less recent works on the subject. 

The model presented in the current paper is free 
from such limitations, as it is constructed for the pur- 
pose of simulating the system’s behavior rather than 
analytically computing its reliability or performance 
indices. This model assumes that the system is com- 
posed of two-state mutually dependent components, 
which means that the aging rate of acomponent may 
depend on the states of other components. When 


grating the component’s changeable aging factor 
over time up to the moment of failure. A compo- 
nent is also replaced on reaching the age limit for 
its operation, even if it is still in operating condition 
(preventive replacement). Repairs and replacements 
are performed by a limited maintenance person- 
nel, hence a failed component may have to wait in 
a maintenance queue for being serviced. There are 
multiple queues corresponding to various priorities 
assigned to individual components. The compo- 
nents waiting in a higher priority queue have prec- 
edence over those in the lower priority queue, and 
each queue is a FIFO sequence. 

Based on the above model, a simulation frame- 
work for the reliability analysis of a complex 
technical system is constructed. By assumption, 
the simulation begins at time t, = 0, when all the 
components are new and operable. The simulation 
procedure produces a sequence of time instants at 
which changes of components’ states occur, i.e. if 
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t, is the time when a component changes its state, 
then t,,, is the time of the subsequent state change 
of the same or another component, k = 1. More 
accurately, t; is the time of the first component 
failure, and either of the following events takes 
place at each subsequent t, i> 1: 


— acomponent fails 

— acomponent’s hidden failure is detected 

— a components age reaches age limit for 
operation 

— a component’s 
completed 


repair or replacement is 


Depending on a component, its failures can be 
self-revealing or hidden. In the first case a failure 
and its detection are simultaneous events; in the 
second case a failure can only be detected at the 
next inspection or preventive replacement, which- 
ever comes first, and until then the component 
remains in the state of undetected failure. A binary 
vector defined in the next section is used to distin- 
guish between the two types of components. Upon 
a failure detection, either the component’s repair or 
replacement is started (if at least one maintenance 
team is available), or (if all the teams are busy) the 
component is placed in the respective maintenance 
queue, following the rules given in section 4. 

The proposed simulation model has a number 
of changeable (adjustable) parameters. One of the 
key parameters is the age limit on a failed compo- 
nent’s repair; if a component’s failure is detected 
before its operating age has reached that limit, then 
the component is scheduled for repair, otherwise it 
is scheduled for replacement. Another key param- 
eter is the age limit for a component’s operation; 
when this limit is reached, the component is put 
out of operation and scheduled for replacement 
regardless of its state. Clearly, the value of the 
former parameter should be smaller than the value 
of the latter one. 

Each failure, repair, and replacement incurs a 
cost, and so does a sojourn in a failed state. The 
system operator’s aim is to minimize the overall 
operating cost over a certain period of time, or 
the average operating cost per unit time over an 
indefinitely long period. This aim can be achieved 
by appropriate adjustment of the aforementioned 
changeable parameters (decision variables). This 
can only be done by means of the trial-and-error 
method combined with the repeated simulation, as 
the model’s complexity practically excludes ana- 
lytical optimization methods. 


2 ACRONYMS AND NOTATION 


CDF — cumulative distribution function 
PDF - probability density function 


TTF — time-to-failure 

TTR - time-to-repair 

IFR — increasing failure rate 

n — number of components 

€,,...,¢, — the individual components 

Q — number of priority levels (maintenance 
queues) for maintenance scheduling 

R — number of maintenance teams 

D(i) — set of components whose states affect the 
aging rate of e; let us note that i¢ D(i) 

X(t) — the reliability state of e; at time t; X,(t) = 1 
if e; is in operation, and X(t) = 0 if e; is failed or 
under repair 

Y(t) — the operational state of e; at time t; Y(t) 
<-—1 when e; is awaiting repair, Y,(t) = 0 when e, is 
undergoing repair, Y,(t) = 1 when e; is in operation, 
and Y(t) = 2 when ẹ is in the state of undetected 
failure 

a,(t) — the aging rate of e; at time t; a,(t) can differ 
during e,’s lifetime and, if Y,(t) = 1, it is dependent 
on the states of e, je DG) 

A(t) — the age of e; at time t, defined as Ja a,(s) 
ds, where t is counted from the last moment when 
e; was first put in operation or replaced; 

T,(t) — the prospective sojourn time of e; in the 
state Y(t), as counted from t, provided that the 
aging rate after t is constant and equal to a(t). The 
prospective and actual sojourn times may differ, 
because e,’s aging rate may change before the pro- 
spective time elapses. 

V[i] — age limit for e,’s repair, i.e. if t is the time 
of es failure (or its detection) and A,(t) = V[i] then 
e; will be replaced rather than repaired 

Wii] — age limit for e,’s operation i.e. e; will be 
replaced at time t, regardless of its state, when A,(t) 
reaches W[i]. Clearly, W[i] > Vii] 

dtc[i] — a binary variable stipulating whether 
failures of e, are self-revealing; if so, then dtc[i] = 1, 
otherwise dtc[i] = 0 

S — time between consecutive inspections that are 
necessary to reveal failures of e; for which dtc[i] = 0 

A;* —age of e, at its failure or preventive replace- 
ment, whichever comes first; A;* is simulated each 
time when e, enters state 1, i.e. when new, replaced, 
or repaired e, is put in operation 

sim(i, 1, A) — a function simulating the random 
age of e, at which its failure will occur, given that e; is 
in state 1 and its already accumulated age equals A 

sim(i, 0, A) — a function simulating the random 
duration of es repair (A < B[i]) or replacement 
(A = B[i]), given that A is the already accumulated 
age of e; 

prt[i] — the maintenance priority level of e; 

len{q]—the current length of the repair queue no. q 

ind[q, r] — the index of the component awaiting 
its turn in the r-th place in the q-th queue 

avl_r— the number of currently available repair 
teams 
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Ci], C,,fi] Cdi] — the costs of ejs failure, 
repair, or replacement respectively 

Cyunall] — the unit cost of eʻs sojourn in state 2 
(undetected failure) 

Cali] — the unit cost of eʻs sojourn in idle (<0) 
state (awaiting or under maintenance) 

Cream — the unit cost of employing one team 

c, — average total operating cost per unit time 


3 ACOMPONENT’S LIFE CYCLE 


The functioning of a single component can be 
modeled by a stochastic process with the state- 
space {..., —1, 0, 1, 2}. The assignment of individ- 
ual states is shown below: 


-1 or less: a component placed in a repair queue, 
0: a component under repair, 

1: an operable component, 

2: a component with undetected failure. 


If a component is in a “negative” state, then this 
state’s absolute value of determines the compo- 
nent’s place in the respective repair queue. Fig. | 
illustrates all possible inter-state transitions. Let us 
note that all the “negative” states are grouped into 
one. 

The above diagram can be simplified if a com- 
ponent’s failures are self-revealing, i.e. it never 
enters state 2. In such a case the node representing 
the state 2 can be deleted along with the adjoining 
links. 

The diagram in Fig. 2 illustrates the transitions 
to or from the “negative” states, i.e. placements in 
the repair queue, forward moves in the queue, and 
repairs completions. It is assumed that multiple 
repairs are never completed simultaneously (simul- 
taneous completion is an event of probability 0). 


Figure 1. 
transitions. 


The diagram of a component’s inter-state 


Figure 2. A diagram of transitions to or from the “neg- 
ative” states. 


Otherwise, transitions from — k to min(—k+s,0), 
where k > 2 and 2 < s < r, would also be possible, 
which would increase the diagram’s complexity. 


4 THE MAINTENANCE POLICY RULES 


A failure of e; is revealed immediately if dtc[i] = 1. 
In turn, if dtc[i] = 0, then es failure is revealed at 
the time of the next inspection. When this hap- 
pens, eps maintenance is performed immediately 
provided that the maintenance personnel is avail- 
able, otherwise e, must wait for its turn. Further- 
more, when the age of e; reaches W[i], it is either 
replaced or scheduled for replacement, no matter 
what its state is (if e; is still operating, its operation 
is stopped). The components awaiting maintenance 
are placed in Q (logical) queues, corresponding to 
different maintenance priority levels. The levels 
assigned to individual components are stored in 
one-dimensional array prt[-:], where prt[i] is the 
level assigned to e,, Components on priority level q 
are placed in the q-th queue, 1 <q < Q. Each queue 
has the FIFO property, i.e. components with equal 
priority are placed in the respective queue in the 
order of their failure times—the recently failed one 
is placed at the end of the queue. The components 
in the q-th queue are scheduled for maintenance 
ahead of those in the (q+1)-st queue, 1 < q < Q-l, 
i.e. the lower the component’s priority level, the 
sooner its maintenance will start. Summing up, 
the component e; awaits repair in the queue whose 
number is given by prt[i], and the (negative) state 
of e; determines its place in this queue. 

If e; is repairable and the time of its mainte- 
nance arrives, then e; undergoes repair, provided 
that its age has not reached the limit V[i]. If e,’s 
age is greater or equal to V[i], then it undergoes 
replacement rather than repair. It is necessary to 
be aware of the difference between V[i] and W{i]. 
V[i] is the age limit for e,’s repair. If e,’s age reaches 
or exceeds V{i] and e; fails, then e; is only subject to 
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replacement. W[i] is the age limit for e,’s operation. 
If es age reaches or exceeds W[i], then e; will be 
replaced even if it is in the operating state. 


5 THE SIMULATION ALGORITHM 


According to the maintenance model presented 
in the previous sections, the following algorithm, 
simulating the system’s operation, will now be pre- 
sented. The core of this algorithm is the procedure 
generating the sequence (t,, k = 1) which, as indi- 
cated in the Introduction, is the sequence of time 
instants at which components change their opera- 
tional states. The algorithm operates in a loop con- 
sisting of the following steps: 


1. At t, = 0 all variables characterizing the main- 
tenance process, defined in the Notation sec- 
tion, are given their initial values. The respective 
pseudocode is given below. 


k < 0; tọ — 0; len[q] — 0, q e {1,...,Q}; 
avl_r & R; 

fori=1,...,ndo { 

-Y i(t)) — 1; A(t) = 0; A;* — min( sim(i,1,0), Ci] 
); T(t) — A;*/a,(ty) 

} 


Since all components are in state | at tọ, their 
prospective sojourn times in this state are simu- 
lated with the use of sim(i, 1, 0). In order to obtain 
T((t,) the simulated age has to be divided by a((t,), 
as follows from the definitions of A,(t) and T,(t). 
2. Increase k by 1 
3. Compute t, by taking the smallest T,(t,_,) over 

all e; that are in non-negative states at t, ;; the 

components in negative states, i.e. those await- 
ing repair, are irrelevant to the system behavior 

in the interval [t, ,, t,). Also, add a,(t, ,)-(t, — t,,) 

to A(t,) for each e; that is not in state 0 at t,, 

(the aging rate of a component under mainte- 

nance equals 0). 

4. Assign Y((t,,) to Y,(t,) for each i, so that each 
component that does not change its state at t, will 
remain in the same state in which it was at t, ,. 

5. Compute Y((t,) for each e; that is in non-negative 
state at t, , and changes its state at t,. Each such e; 
is identified by checking if its sojourn time in the 
state Y(t), as counted from t, ;, equals t, — t, ,.! 
For any e; fulfilling this condition we have: 


1. if Y,(t,,) = 1, then e; changes its state to 2 
at t, provided that dtc[i] = 0 and inspection 


'Possible state changes at t, of the components awaiting 
repair in the interval [t, ,, t,) are secondary w.r.t. state 
changes of those which are in non-negative states in this 
interval. The former components’ states at t, are com- 
puted in step 6. 


or preventive replacement does not coincide 
with t,; otherwise (i.e. dtc[i] = 1, or dtc[i] = 
and inspection or preventive replacement is 
performed at t,) e; is queued for maintenance, 
i.e. e; enters an appropriate negative state at 
tk, 

2. if Y(t) = 2, e; is queued for maintenance at 


k? 
3. if Y(t) = 0, e;s maintenance is completed at 
t,, Le. e; changes its state to 1 at t, 

In case (3) one maintenance team is released, 
hence avl_r is increased by 1. Moreover, if ejs age 
was greater or equal to V[i] when e; entered state 0, 
e; underwent replacement in state 0, hence its age is 
set to 0 at t,. In view of the above analysis, step 5 is 
implemented in the following way: 


for each i such that Y(t, _,) 2 0 and T(t, ,) = t, 


—t,, do { 
-if (Y,(t, ,) = 1 and dtc[i] = 0 and/t/SIS = t, 
-and A(t) # W[i] ) 


-then Y(t) <2 

-else {q < prtfi]; len[q] + len[q] + 1; 
~-Y(t,) & -enfq]} 

-if (Y(t) =2) 

-then {q < prt[i]; len[q] < len[q] + 1; 
~-¥(t,) — —len{q]} 

-if (Y(t) =0) 

-then {Y((t,) < 1; avl_r < avl r + 1; 
-if A(t) 2 V[i] then A,(t,) <— 0} 

l 

$ 

6. If avl_r > 0, set to 0 the states at t, of at most 
avl_r components placed at the head of the 
repair queue, then update avl_r and the repair 
queue accordingly. 

7. Compute the cost incurred during (t, ,, tą] and 
add it to the total cost. The respective pseudo- 
code is given below. 
CeC+(t,-t,,)-R-¢ 
fori=1,....ndo { 

if (Y(t, ) < 0) then 
CH C + (tk = ta)'Cialil; 
-if Aly 1) = 2) then 
ml é C+ (ty = tka)'Cundlil; 
-if o 1) #0 and Y(t,) = 0) then 
vif (A(t) < VID 
een then C —C+C, fi] 
een else C —C+C,, fi]; 
vif (Y(t, ,) = 1 and Y(t, ) #1 and fail[i] = 1) then 
«C & C + C{i]; 


team? 


Remark 1: The cost of repair or replacement is 
added when the respective action starts, because it 
is only then known whether ei’s age has reached 
vii]. 

Remark 2: The last “if” adds C{i] to the total 
cost if e; fails at t,; failfi] is defined in the explana- 
tion to step 8. 
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8. Determine T,(t,) if e; changes its state to a non- 
negative one at t, or e; remains in state 1, but at 
least one e; in D(i) changes its state. Otherwise 
assign T,(t, ,) — (t, — ty) to T(t). The binary 
value fail[i] is set to 1 if e; enters state 1 at t, (e;’s 
repair or replacement is completed at t,) and the 
age of e, at its failure (simulated at t,) doesn’t 
exceed W[i] (e; fails before or when it reaches its 
planned replacement age); fail[i] is used in cost 
calculation in step 7. Step 8 is implemented by 
following pseudocode: 


fori=1,...,.ndo { 

--failfi] — 0; 

vif (Y(t) # Yit) ) then { 

-eif (Y(t) = 0) then 

setts T(t) <— simG, 0, A,(t,)); 

-if (Y(t) = 1) then { 

naire A,* < min( sim(i, 1, A,(t,)), WED; 
sasaaeka T(t q= (A* = A(t)Vat; 

eee if (A;* < W[i]) then fail[i] + 1 


sal 
-if (Y (t9 = 2) then 
lives T(t) — min([t/S}S—t, (WELA(t,)/a(t,) 


Remark 1: cf. step 5 for the conditions of enter- 
ing state 2. 

-else if (Y(t) = Y(t) =1 

-vand Y(t.) # Y(t) for at least one je D[i] ) 

--then T,(t,) < (A,* — A,(t,))/a,(t,); 

Remark 2: If any e; je D(i), changes its state 
at t, then es aging factor also changes, hence 
a(t,) # a(t, ,) and Tit) # Ttk) — (tk — tk), thus 
T(t) must be calculated using the new aging 
factor. 

-velse 

oT — T(t.) - [the tal 

J 

The function sim(i, 1, A) simulates, using a 
random number generator, the age of e; at which 
its failure will occur, provided that e; is in state 
l and its already accumulated age equals A. This 
function is called when e; is put in operation, and 
then A = 0 for new or replaced e, or 0 < A < V[i] 
for a repaired e,. The simulated value is used to 
calculate A;* which, divided by e,’s aging rate, 
yields the prospective sojourn time of e; in state 1 
from the moment when e; enters state 1, provided 
that e,;’s aging rate remains unchanged during es 
sojourn in state 1. Since es aging rate depends 
on the states of e, je D(a), it is necessary to recal- 
culate T,(t), using at), at each time t when any 
e; je D(i), changes its state during e,’s sojourn in 
state 1. 

The function sim(i, 0, A) simulates the dura- 
tion of es repair or replacement depending on 


Set the initial values of all variables, 
then simulate the initial Ai* 

and determine the prospective Tilto) 
for i=1.....0 


Compute tx — t+ by taking 

the minimum of Tits) 

over i such that Yi(tk— ))20. 

and add ay(tk-1)-(te — t1) to Aity) 
if Vit 20 


Assign Yi(te 1) to Vite) 
forie {!.....n} 


Compute Y(t) for each e, 
that is in non-negative state at tk- 
and changes its state at ty 


If avl_r>0, start the maintenance of 
at most avl_r components placed at 
the head of the queue, i.e. set their 
states to 0, then update avl _r and 
the states of the components still 
awaiting maintenance 


Compute the cost incurred 
during (ty), tk} and add it 
to the total cost 


Determine T\() if the state of e 
changes to non-negative at tk, or if & 
remains in state | at tk, but at least 
one ẹ in D(i) changes its state, 
Otherwise assign 

Tite 1) (te~ te 1) to Tih) 


Figure 3. The simulation algorithm’s block diagram. 


whether A < V[i] or A = V{[i] respectively. If e; 
enters state 2, i.e. es D#Y(t,) =2, then T(t) is 
calculated as min( |t,/S] - St, (W[i] — A,(t,)/ 
a(t,)), because e; must wait in state 2 until the 
time of the next inspection or until e,’s age reaches 
W{i], whichever comes sooner. 

9. Go to step 2 

For illustration, the algorithm’s block diagram is 
shown in Fig. 3. 


6 THEILLUSTRATIVE EXAMPLE 


We will now present the numerical results obtained 
by the program based on the algorithm from the 
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previous section. As the complete version of the 
program is still under preparation, the results come 
from its simplified version developed according to 
the following assumptions: 


— The components are independent, i.e. a compo- 
nent’s aging rate does not depend on the other 
components’ behavior and is constant during a 
component’s lifetime (T,(t,) needs not to be recal- 
culated with the new aging rate when e, remains 
in state 1 and another component changes its 
state at t,). 

— The network is built of non-repairable compo- 
nents; if a component fails, it is always replaced, 
irrespectively of its age. TTF and TTR of e; are 
simulated by the functions sim(i,1) and sim(i,0) 
respectively. They do not variable A, because the 
age of a new or replaced component equals 0, 
and the age of an old component is irrelevant to 
its replacement time. In consequence, there is no 
age limit for e,’s repair, thus the only age limit— 
WIi] — is imposed on e,’s operation time. 

— The failures of all components are self-revealing 
(dtc{i] = 1 for 1 = 1,...,n), thus no inspections 
need to be performed and the set of component 
states does not include state 2. 


The example system has the following 
parameters: 


Number of components: 5 

Type of distribution of a component’s TTF: 
Weibull 

Scale parameter for TTF: 0.05 

Shape parameter for TTF: 1.5 

Type of distribution of a component’s TTR: 
Weibull 

Scale parameter for TTR: 0.5 

Shape parameter for TTR: 1.5 

Time unit: 1 day 

Number of maintenance queues: 2 

Components with priority 1: e,, €,, €, 

Components with priority 2: e,, €; 

Number of maintenance teams: 1, 2, 3 

C_rplcf[i] — 1000 

C_fail[i] — 4000 

c_idl[i] — 900 

c_team — 40 

For the sake of this paper the following defini- 
tion is adopted: a (two parameter) random vari- 
able T has Weibull distribution if the CDF of T is 
given by the following formula: 


Pr(T <t) = 1 — exp (At) ] (1) 


where A and o are the scale and shape parameters 
respectively. It should be noted that in our example 
a is equal to 1.5 which is greater than 1. This is 
due to the fact that, as follows from the reliability 


Table 1. 


The results of cost calculation for various W 


and R (* denotes the local minimum of c4). 


k_fin = 10.000 k_fin = 100.000 
W c, for R=1, 2, 3 c, for R = 1, 2,3 
60 1815, 1732, 1801 1792, 1747, 1786 
50 1778, 1721, 1783 1790, 1743, 1777 
45 1807, 1715, 1776 1784, 1737, 1787 
40 1774, 1725, 1788 1774*, 1739, 1772 
35 1776, 1734, 1777 1782, 1725, 1765 
34 1799, 1743, 1769 1778, 1726, 1763 
33 1763, 1759, 1750 1794, 1722, 1757 
32 1771, 1739, 1762 1786, 1724, 1752 
31 1769, 1721, 1764 1774*, 1724, 1759 
30 1771, 1717, 1753 1781, 1713*, 1753 
29 1766, 1704, 1758 1779, 1720, 1748 
28 1776, 1714, 1728 1782, 1718, 1746 
27 1777, 1702, 1739 1784, 1720, 1759 
26 1779, 1727, 1754 1788, 1719, 1742* 
25 1776, 1704, 1746 1789, 1709*, 1750 
24 1781, 1709, 1758 1787, 1719, 1748 
23 1795, 1693, 1792 1800, 1716, 1747 
22 1814, 1690, 1763 1807, 1717, 1742* 
21 1790, 1681*, 1725 1808, 1714, 1754 
20 1796, 1737, 1743 1813, 1717, 1759 
19 1835, 1725, 1760 1829, 1729, 1756 
18 1846, 1730, 1772 1844, 1732, 1759 


theory, planned replacements are only cost-effec- 
tive if the distribution of a component’s TTF has 
the IFR property (see Barlow & Proschan (1975)). 
The Weibull distribution has this property only if 
a >l. Other than Weibull distributions of TTF and 
TTR are discussed in Barlow & Proschan (1975) 
and O’Connor & Kleyner (2011). 

Our aim is to find the optimal age limit for a 
component’s operation, i.e. the optimal age at 
which its preventive replacement should take place. 
The objective function to be minimized is the total 
operating cost per unit time. Let W*[i] denote the 
optimal age for e, As the components are stochas- 
tically identical, W*[i] are equal for all five compo- 
nents, i.e. W*[1] =... = W*[5] = W*. The obtained 
results of cost calculation for various values of 
W = W[1] =... = W[5] are given below: 


The values of c, in the first column are obtained 
for k_fin = 10.000, where k_fin is the number of 
cycles of the algorithm’s main loop. Based on 
this accuracy (k_fin can serve as a measure of the 
simulation’s accuracy) we conclude that c, attains 
its minimum for W = 21 and R = 2. However, it is 
advisable to perform a more accurate simulation, 
because we can see that c, fluctuates in the sam- 
pled interval 18 < W < 60 and for each R it has 
more than one local minimum there, therefore a 
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more precise analysis of c,’s behavior is indicated. 
The computations for k_fin = 100.000 show that 
c, as a function of W is not unimodal (has more 
than one local minimum) for each R = 1, 2, 3. The 
respective local minima are marked with aster- 
isks. The optimal number of maintenance teams 
and the optimal value of W are equal to 2 and 
25 respectively. Also, let us note that the intervals 
30 < W < 45 and 20 < W < 32 for R =1 and R =3 
respectively are uncertain” as regards determining 
the optimum value of W, but these values of R are 
of no practical interest, as the optimum number of 
maintenance teams equals 2. 


7 CONCLUSION AND FURTHER 
RESEARCH 


A newly developed algorithm simulating the main- 
tenance process of a complex technical system, and 
the results of a computer program implementing 
this algorithm, have been presented. The author 
aimed at developing a software tool that would 
enable the system operator to assess the total cost 
incurred over a long operating time, and to mini- 
mize it by properly adjusting the decision variables, 
i.e. R, S, and V{i], W[i], i= 1,...,n, as defined in Sec- 
tion 2. Due to the complexity of the system model 
the analytical computation and optimization meth- 
ods cannot definitely be used, thus Monte Carlo 
simulation combined with heuristic optimization is 
the only feasible solution. 

The algorithm demonstrated in the current 
paper bears resemblance to the one presented in 
George-Williams & Patelli (2015). However, our 
approach focuses on the simulation of time points 
in which any system component changes its opera- 
tional state. The time axis is divided into state- 
invariant intervals (no component changes its state 
in any such interval) thus allowing for very accu- 
rate analysis of the system operation process. It is 
also important that the components are mutually 
dependent and a component’s aging rate depends 
on other components’ states. The components 
states distinguished in the last-mentioned paper 
are different than in the present one, but the model 
considered herein is adjustable in this respect. 
Also, since failed components may await repair in 
a priority queue, our model makes it possible to 
analyze the system from the queuing theory (QT) 
viewpoint, enabling its operator to optimize the 
relevant (QT related) parameters. 

The current paper defines the system opera- 
tion cost as the sum of individual components’ 
operation costs. However, the state of the whole 
system, expressed as an a priori defined function 
of the components’ states (the structure function) 
can also be included in the overall cost calculation. 


Apart from the cost, other performance charac- 
teristics (e.g. the system availability) can be easily 
calculated, in a similar way as in step 7 in section 5. 

Clearly, the main drawback of stochastic simu- 
lation is its high time complexity. The simplified 
version of the program, based on the assump- 
tions from Section 6, executes in approximately 
12 seconds if k_fin = 10.000, and 120 seconds if k_ 
fin = 100.000 (on a PC machine with Intel® Core™ 
i15 CPU). This shows that, as expected, the execu- 
tion time increases linearly with k_fin. Admittedly, 
the above execution times are relatively short as 
far as Monte Carlo simulation is concerned. How- 
ever, this is due to a small number of components 
(5). If their number is changed to 10, the execu- 
tion time increases to 22 seconds (k_fin = 10.000), 
which means that the algorithm’s time complexity 
increases with n somewhat slower than at a lin- 
ear pace. This is explained by the fact that steps 
3 through 8 of the algorithm, except step 6, are 
implemented as “for” loops with n cycles, while 
step 6 requires avl_r (i.e. at most R) compound 
operations, and in a real system R is much smaller 
than n. Obviously, if the components are not iden- 
tical (as in the provided example) and their number 
is high then the program may have to be run a large 
number of times in order to find the optimum val- 
ues of the decision variables. However, the compo- 
nents in many systems can be divided in a number 
of groups such that the components in one group 
have identical parameters, which significantly 
reduces the number of decision variables, and, as a 
consequence, decreases the time complexity of the 
optimization task. 

Summing up, we have endeavored to develop 
a possibly comprehensive algorithm that would 
encompass a wide range of maintenance models 
found in the relevant literature, and would also be 
applicable for practical purposes. It seems that this 
goal has been at least partly achieved, but there 
still remains considerable conceptual work to be 
done. The further research will be mainly concen- 
trated on properly defining the dependence rela- 
tions among the system’s components, i.e. how 
the states of components in the set D(i) affect the 
aging rate of e,. The issue of components depend- 
ence is considered in the following works: Zhang & 
Horigome (2001), Dukhovny & Marichal (2012), 
Yang et al. (2013), Nakamura et al. (2017), Zhang 
& Wilson (2017). Also, the algorithm should sup- 
port various failure modes. Then, whether a fail- 
ure is self-revealing or not may depend both on its 
mode and the failed component’s type. Last but 
not least, the impact of external conditions (e.g. 
ambient temperature and humidity) on the com- 
ponents’ aging rates should be taken into account 
in constructing future versions of the presented 
algorithm. 
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ABSTRACT: Recent field data indicates that pitch systems account for a substantial part of a wind 
turbines down time. Reducing downtime means increasing the total amount of energy produced during 
its lifetime. Both electrical and fluid power pitch systems are employed with a roughly 50/50 distribution. 
Fluid power pitch systems generally show higher reliability and have been favored on larger offshore wind 
turbines. Still general issues such as leakage, contamination and electrical faults make current systems 
work sub-optimal. Current field data for wind turbines present overall pitch system reliability and the 
reliability of component groups (valves, accumulators, pumps etc.). However, the failure modes of the 
components and more importantly the root causes are not evident. The root causes and failure mode 
probabilities are central for changing current pitch system designs and operational concepts to increase 
reliability. This paper presents a feasibility study of estimating pitch system reliability based on a failure 
rate prediction method for generic fluid power components. Special attention is given to the use of com- 
puter simulations for assessing working conditions such as flow, pressure, work cycle, fluid contamina- 
tion concentration etc. The fluid power pitch system is co-simulated with the SMW NREL wind turbine 
implemented in the FAST software. The estimated failure rates is compared to field data and comments 


are given to the correlation and discrepancies based on the uncertainties of the simulated conditions. 


1 INTRODUCTION 


Pitch systems are today employed on all modern 
multi-megawatt turbines and enable the turbine 
blades to rotate along their longitudinal axis in 
order to facilitate aerodynamic braking. This is 
used at wind speeds above rated and also for ena- 
bling safe emergency stopping of hub rotation. 
Multiple studies on turbine reliability and down- 
time have indicated the pitch system to be the most 
unreliable sub-system of the turbine (Wilkinson 
and Hendriks 2010, Carroll et al. 2015). Contrib- 
uting to over 20% of total downtime, pitch systems 
not only introduce high risk, they also cause a sig- 
nificant loss of power production during the life- 
time of turbines. 

Typically modern turbines use either electrical or 
fluid power pitch systems, where this paper focus 
on reliability estimation of the latter. Currently, 
the most detailed publicly available field data on 
pitch system failures show the failure rate distri- 
bution among system components such as valves, 


pump, accumulators, cylinders, etc. (Carroll et al. 
2015). Yet, the failure modes and root causes are 
not evident. Such information is crucial in order to 
identify critical areas of the system and to enable 
development of more reliable and safe concepts. 
Also, precise reliability estimates of system com- 
ponents allow for strategic maintenance planning 
which potentially reduces maintenance time and 
costs. In an attempt to reveal the highrisk areas of 
the pitch system, Liniger et al. conducted a system- 
atic qualitative study on identifying critical compo- 
nents (Liniger et al. 2017). While the study showed 
promising results, the occurrence of failure modes 
was qualitatively determined using expert knowl- 
edge of generic fluid power systems. Thus, actual 
failure rates and operational dependent failure 
mechanisms of the pitch system in a wind turbine 
was not directly considered. Two quantitative stud- 
ies have been conducted with the purpose of mode- 
ling fluid power pitch system reliability (Yang et al. 
2011, Han et al. 2012). While these studies have 
aimed at creating the model basis for calculating 
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reliability, both the origin of failure modes and 
failure rates has not been fully covered. The only 
publicly available source of failure rate estimation 
for fluid power components known to the authors 
is the Handbook of Reliability Prediction Proce- 
dures for Mechanical Equipment (Jones 2011). This 
source presents failure rate models as functions of 
component dimensions, material properties and 
operating conditions. 

The main contribution of this paper is a feasi- 
bility study of estimating pitch system reliability 
using the empirical failure rate models of (Jones 
2011). Estimation of failure rates has the potential 
to close the gap in knowledge between the quali- 
tative and quantitative studies while also incor- 
porating turbine operating conditions into the 
framework. Operating conditions are generated 
using a simulation model of a fluid power pitch 
system and the SMW National Renewable Energy 
Laboratory (NREL) wind turbine implemented in 
the Fatigue, Aerodynamics, Structures, and Turbu- 
lence (FAST) software. The estimated failure rates 
are compared to the most detailed and recent field 
failure rates available. 


2 PITCH SYSTEM DESCRIPTION 


The pitch system configuration used in this analy- 
sis is depicted in Figure 1 and component label 
description is found in Table 1. The system con- 
sists of a supply located in the nacelle of a tur- 
bine and the actuation located in the rotating hub. 
The supply consists of a fixed displacement fixed 
speed pump where pressure and flow are con- 
ditioned by dump valve V2 and relief valve V1. 
The supply connects to the rotating hub through 
rotary union R1. The actuation is a conventional 
fluid power cylinder drive, where cylinder posi- 
tion is controlled closed-loop and flow is metered 
by the proportional valve V6. Accumulators Al 
and A2 stores energy which is used for extend- 
ing the cylinder C1 in the event of an emergency 
shutdown. 

The fluid power pitch system presented in 
Figure | is very similar to the system analyzed in 
the previous qualitative study (Liniger et al. 2017). 
The results of the previous study showed to cor- 
relate well to field failure data which indicates that 
the system presented here is similar to the real-life 
systems which are confidential. Note that con- 
ventional fluid power pitch systems also employ 
a locking circuit for keeping the blade pitch angle 
fixed when the turbine is shut down. The locking 
circuit and all system transducers are omitted in 
this analysis as their failure rates are negligibly 
small compared to that of the other system com- 
ponents (Carroll et al. 2015). 


Connection to 
remaining pitch circuit 


Figure 1. Fluid power pitch system diagram with indi- 
cation of supply and actuation circuit locations in the 
wind turbine. 


Table 1. Description of component labels. 

Label Description Type 

V1 Relief valve Cartridge, poppet 

V2 Solenoid dump valve Cartridge, poppet 

V3,V4 Check valves Cartridge, poppet 

V5,V7-V10 Solenoid valves Cartridge, poppet 

V6 Proportional solenoid Module, spool 
valve 

CI Differential cylinder 

H1-H5 Flexible hoses 

A1, A2 Emergency Gas charge, piston 
accumulators 

A3 Pump accumulator Gas charge, piston 

P1 Fixed displacement Internal gear 
pump 
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Table 2. Main data for the wind turbine and pitch sys- 
tem simulation model. 


Turbine data Value 

Nominal power 5 [MW] 

Nominal hub speed 12 [RPM] 

Tower height 90 [m] 

Blade length 63 [m] 

Wind speed (Rated) 11.4 [m/s] 

Turbulence model Normal Turbulence 

Model (NTM) 

Pitch system data 

Pitch cylinder (Rod/Piston/ 90/0 140/1350 [mm] 
Stroke) 

Pump flow (Rated) 20 [l/min] 

System pressure (Rated) 250 [bar] 


3 WIND TURBINE SIMULATION MODEL 


A simulation model of the fluid power pitch sys- 
tem operating in a wind turbine is utilized for gen- 
erating operating conditions subsequently used for 
reliability estimation. The wind turbine model is 
based on the open-source data for a SMW NREL 
turbine implemented in the FAST software (Jonk- 
man and Buhl 2005). The main specification are 
given in Table 2. 

It is noted that power capacity of the simulated 
turbine is above the 2-4MW range covered by the 
field data. The simulated operating loads may, there- 
fore, be larger than those found in the real-life sys- 
tems and possibly yielding higher estimated failure 
rates. 

The dynamical model of fluid power pitch sys- 
tem is based on the layout described in the previ- 
ous section and is developed in a previous study 
by the authors (Pedersen et al. 2015). The dynami- 
cal model is implemented in Matlab/Simulink and 
co-simulated with FAST. The model incorporates 
the compressibility of fluid in the cylinder cham- 
bers, proportional valve dynamics and kinematics 
of the cylinder-blade coupling. The pitch angle is 
controlled closed-loop using a gainscheduled PI- 
compensator. The main specification for the pitch 
system is given in Table 2. 


4 OPERATING CONDITIONS 


The operating conditions for the real-life turbines 
used in this feasibility study are unknown. Thus, the 
simulated system is operated under a wide range of 
conditions which are considered to fully cover the 
conditions of the real-life turbines. The considered 
range of operating conditions is mean wind speed, 


turbulence intensity, ambient temperature and 
fluid temperature. The full-field wind is generated 
using Turb-Sim (Jonkman 2009) and based on the 
IEC61400-1 wind turbine design standard Design 
Load Case (DLC) 1.2 (IEC 2006). This load case is 
used for evaluating fatigue loads of wind turbines 
during normal operation. While the pitch system 
is used for both stopping and starting the turbine, 
normal operation constitutes the majority of the sys- 
tem lifetime. Based on the availability of the real-life 
turbines (Carroll et al. 2015) and considering out-of- 
range wind speeds, the utilization percentage can be 
assumed to be 90%. 

The mean wind speed is normally described 
using the Weibull probability density distribution 
which can be described by a shape and scale param- 
eter (Hansen 2008). A 20-year baseline distribution 
shown by blue bars in Figure 2 from the Østerild 
location near the Danish shore is selected for the 
study. The black bars show the range of wind speed 
distributions considered by selecting shape and 
scale parameters +20% from the baseline values. 
The wind distribution is discretized in twelve wind 
bins from cut-in to cut-out wind speed of the SMW 
NREL turbine. To simplify the analysis, the wind 
direction is assumed to be ideal, that is, orthogonal 
to the turbine. Estimated failure rates for each wind 
bin is multiplied by the probability density and sum- 
marized to yield values comparable to the field data. 

According to the DLC 1.2, turbulence intensity 
is categorized in either high (class A, 16%), medium 
(class B, 14%) and low (class C, 12%) for the Nor- 
mal Turbulence Model (NTM). All three turbulence 
intensity classes are considered in the feasibility 
study. 

The pitch systems are located in the hub and 
nacelle of the turbine which in most modern tur- 
bines is conditioned to be within a desirable oper- 
ating temperature interval. Operating temperatures 
considered are T „= [0 20 60]°C. 

Lastly, the fluid temperature is controlled dur- 
ing normal operation of the pitch system. How- 
ever, it is not uncommon for the fluid temperature 
to be locally different than the desired value. T,,,, = 
[30 50 70]°C is selected to cover the expected range 
of temperatures. 
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Figure 2. Mean wind speed distribution used in simula- 


tions. The black bars indicate the considered range. 
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5 FAILURE RATE ESTIMATION 


Failure rates are estimated using the Handbook of 
Reliability Prediction Procedures for Mechanical 
Equipment (Jones 2011). The failure rate models 
are constructed such that an empirically deter- 
mined base failure rate, A,, for a generic compo- 
nent is multiplied with several non-dimensional 
factors, C C, C,..., describing both material, 
dimensions and operating conditions for estimat- 
ing the failure rate. 

Some failure rates estimations are dependent on 
operating temperature. The operating temperature 
for components with direct fluid contact is set to 
the fluid temperature. Solenoids are if they are 
operated, set to the ambient temperature plus a 
constant offset accounting for joule heating. The 
offset is 100°C for valves that are continuously on 
during normal operation. Valve V2 is operated 
intermittently, thus reducing the temperature off- 
set to 70°C. The temperature offsets are confirmed 
from measurements (Liniger et al. 2018). 

Failure rate estimates are influenced by the 
amount of allowable leakage for seals and valves in 
the system. For pitch systems, external leakage is in 
most cases much more critical than internal leakage 
due to environmental contamination hazard of the 
turbine surroundings. External leakage is generally 
set to a very low value of 2 - 107 l/min correspond- 
ing to a few drops a month. Allowable internal 
leakage is set to 10~*1/min for seat valves and seals. 
Proportional spool type valves are normally associ- 
ated with higher internal leakage, and the allowable 
limit for valve V6 is therefore set to 2 I/min. 

Operating cycles of the valves, cylinders and 
accumulators are used for assessing the failure 
rates. Operating cycles for on/off valves are sim- 
ply determined from the number of activations 
during normal operation. For proportional valve 
V6, cylinder C1 and the accumulators, the operat- 
ing cycles are determined using rain-flow counting 
and a minimum travel threshold. The minimum 
threshold for C1 and the accumulators is selected 
to 2 mm. For valve V6, the minimum threshold 
is 0.2 mm. Due to the uncertainty of the thresh- 
old values, a sensitivity study is conducted in the 
results Section 5.2. 

Contamination concentration in the fluid is 
also considered in the failure rate estimation. The 
contamination concentration at each component 
N is determined by the particles generated from 
upstream components according to the rates speci- 
fied in (Jones 2011). Additionally, a particle filtra- 
tion size of C, =3 um for filter F1 is utilized in the 
calculations. 

To simplify the description, the details for fail- 
ure rate estimation of one component are given in 
the following section. The cartridge poppet type 


valve V2 is selected since valves are the most used 
component type in the system. Also, the failure 
rate estimation procedure for the parts in valve V2 
is similar to most of the remaining components. 
All component specifications are similar to those 
found in actual pitch systems working in turbines 
with similar power capacity as for the real-life 
systems. 


5.1 Cartridge poppet valve V2 


Cartridge Valve V2 is shown in Figure 3. Valve V2 
consists of several parts where dimensions, mate- 
rial and specifications are given in Table 3. All 
notations follow (Jones 2011) and pressures and 
flows are denoted according to the diagram in 
Figure 1. The operating time f,,is given in hours and 
Np denote the operating cycles of valve V2. Note 
that both the factors for surface finish (roughness) 
for the seat valves and Young’s modulus for O-rings 
increase with time. The estimated failure rates for 
these parts are therefore increasing with time. 

As an example, the multiplication factors for 
the main poppet are given in Table 4 and values 
are determined from a nominal operating scenario. 
The nominal operating scenario covers one calen- 
dar year of operation at ambient temperature Tn, 
= 20°C, fluid temperature T,,4 = 50°C, rated wind 
speed and turbulence class B. The pressure and 
flow rate multiplication factors depend on simu- 
lated pressure and flow time series. Multiplication 
factor values C, and C, represented in Table 4 are 
mean values. Due to length limitations, the time 
series are not shown but can be found in the work 
by Pedersen et al. (Pedersen et al. 2015). 

The multiplication factors C,, C,, C, Cr Co Cp 
are seen to cover operating conditions for pressure 
and flow. C, C,» C,, depend on the dimensions 
and manufacturing of the valve. The main poppet 
failure rate for the nominal operating scenario is 
determined according to: 


ha) 


Ayam = Ag sve C, C,C,C, C C C Ci 


pP~q n~s~dt~sw 
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Outer o-ring Spring 


Pilot poppet 


Inner o-ring Main poppet 


Figure 3. 
dimensions. 


Cartridge poppet type valve V2 with part 
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Table 3. 


Cartridge poppet type valve V2 data related to failure rate estimation. 


Part Dimension Material Specification 
Main Seat diameter D,=15 mm Steel Seat valve base failure 
. failure rate Ay sy = 1.410" 
: cycle 
Poppet Seat width D,,, = 2mm Seat pressure drop AP,, = P,- P [bar] 
Rated flow FR „= 140 l/min 
Allowable leakage O,,,= 104 l/min 
Surface finish 15.105- N,, + 0.2[ 4m] 
F- for N,,, <= 4000 cycles 
m 1.54m 
for N,, > 4000 cycles 
Pilot Seat diameter D,=5mm Steel Rated flow FR, = 1 l/min 
Poppet Seat width D,,, = 0.5 mm Allowable leakage Q,,= 10“ I/min 
Spring Coil diameter D,=15 mm Steel Spring base failure 1. =238-10* failure 
rate B Aa hour 
Wire diameter D,,.=2mm Operating cycle rate N,, [ cycl e] 
t, L hour 
Active coils N,=7 
Compression length C,=5mm 
Outer Inner diameter D, =30 mm NBR-70 Seal base failure rate failure 
á i Ag sg = 2.4: 10* —— 
(static) BSS hour 
O-ring O-ring diameter D,,,=2.62 mm Allowable leakage 0,,,,= 2; 107 l/min 
Seal pressure drop AP „= P, — P.,,,, [bar] 
Youngs modulus Engr = 59 - 10% ; ty + 6.2 [MPa] 
Mating surface F „=0.8 um 
finish 
Inner Inner diameter D, =28 mm NBR-70 Allowable leakage Q= 10* l/min 
O-ring O-ring diameter D,,, = 1.78 mm Seal pressure drop AP, =P,- P, [bar] 
Solenoid Insulation Solenoid base failure failure 
rate Azs = 2.17-10° 
f cycle 
class-H Coil temperature Tot = Lamy + 70 [PC] 


Operating cycle 
rate 


Np l ac | 
t hour 


H 


The main poppet and remaining part failure 
rates of valve V2 are given in Table 5. 

Clearly, the solenoid contributes with the high- 
est failure rate of the valve. The lowest failure rate 
exists for the inner o-ring. The main and pilot pop- 
pet and the spring are seen to yield similar failure 
rates. 


5.2 Results 


The feasibility study of the described estimation 
method is performed by comparing estimated 
failure rates to field failure rates. The field fail- 


ure rates are divided into six component groups, 
namely accumulators, valves, pump, cylinder, 
rotary union, and hoses. The system presented in 
Figure 1 is likewise divided in the six component 
groups. The accumulator group consists of A1-A3. 
The valve group cover V1-V10 and the hose group 
is constructed from hoses H1-H5. Each estimation 
case utilize all combinations of wind speed pro- 
files as given in Figure 2 and turbulence intensity 
classes A, B and C. 

Figure 4 shows the comparison between the esti- 
mated and field failure rates for a range of ambi- 
ent temperatures. The black error bars indicate the 
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Table 4. Failure rate estimation and multiplication factors for main seat valve in V2 under nominal operating 


conditions. 
Factor name Description Value 
Pressure C, = (4.8 - 10% - AP, 0.44 
Allowable leakage c [910-0 for Q,, > 4.9- 10 I/min 3.72 
© |42-4.8-10-0, 
Surface finish é (39.4: F 0.46 
í 353 
Fluid viscosity C, (Tia) SAE 10 fluid look-up table 0.47 
Fluid contamination c.\* 0.024 
C, = ( S) Fon N mo 3.79 
Contact pressure 9000 Ls 0.31 
C, = 0.26- 
i 3: AP -1.5-10" 
Seat diameter C,=1.1-D,,- 0.04 + 0.32 0.97 
Land width o a [355-097 D+ TB: Ding — 86° Diy 2.0 
= 0.25 for D,,, > 1.34107 
Flow rate Q : 1.0 
C, =1+ = 
R 
Operating cycle rate N,, cycle 
ta hour 
Table 5. Part failure rate for valve V2 under nominal 102 y 7 + = y - 
operating. (Field data 
IE 
failure _10'F T, -20 C 3 
Part Failure rate] cycle g Ere- | cs 
Bs yh 4 
Main poppet Ayam = 2.4 - 107 £ ii 
Pilot poppet Map = 1.9- 107 F 
Spring Ars = 1.5- 107 10°F 
Outer o-ring Arau = 1.0 + 10° 
Inner o-ring Azio = 2-7 - 10° 
Solenoid g= 1.2- 10% ef £ SS SS Ff & 
Valve V2 Aya = 1.8 - 10-6 ev FF FF * 


variation due to the considered wind speeds, tur- 
bulence intensities and wear models. The colored 
bars are mean values for each estimation case. Fea- 
sible failure rates estimation means that the field 
data must fall within the error bars. From Figure 4, 
the estimated failure rates are seen to be at least an 
order of magnitude larger than the field failure rates 
for component groups other than hoses. Increasing 
the ambient temperature slightly reduces the esti- 
mated failure rates. The only component group to 
fall within the estimated range is hoses. 


Figure 4. Component group failure rates for varying 
ambient temperatures. Other conditions are fluid tem- 
perature Tra = 50°C and low travel threshold. 


The estimated failure rates for varying fluid 
temperatures are seen in Figure 5. Generally, the 
estimated failure rates increase with increasing 
fluid temperature. A significant change is seen for 
the failure rates of valves. At lowest value, the esti- 
mated failure rate of the rotary union covers the 
field data. 
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Figure 5. Component group failure rates for varying 


fluid temperatures. Other conditions are ambient tem- 
perature T,„, = 20°C and low travel threshold. 


amb 


Failures pr. year 


Figure 6. Component group failure rates for varying 
travel thresholds. Other conditions are T,,,, = 20°C and 
T inia = 50°C. 


The travel threshold for counting operating 
cycles is associated with a high degree of uncer- 
tainty. Thus, the effect of three threshold levels are 
analysed. The low levels are described in Section 5. 
Medium and high thresholds are double and triple 
w.r.t. the low values. Increasing the threshold low- 
ers the operating cycles which in Figure 6 is seen to 
have a minor decreasing effect to the failure rates. 
At a low threshold value both the valves and cylin- 
der C1 failure rates are increased significantly. 

The results are generally not satisfying, and the 
large discrepancy between estimated and field val- 
ues is either related to wrongful modeling assump- 
tions or non-describing field data. Reasons for 
the estimated failure rates being larger than the 
field data could be a consequence of the simu- 
lated wind turbine being 5 MW rather than the 
3—4 MW range of the real-life systems and non- 
modeled repair or replacements of components. 
Also, over 30% of field failures are known to be 


insufficiently documented and therefore not con- 
sidered in this comparison (Liniger et al. 2017). 
On the other hand, the real-life systems are known 
to contain more components than presented in 
Figure 1 which potentially could further increase 
the estimated failure rates. 

In spite of these facts, the large discrepancies 
are most likely caused by the estimation proce- 
dure being over-conservative. This is indicated by 
the tendency seen in Figure 6 for low threshold, 
where the failure rates for all component groups 
except hoses follow the field data with an offset. 
As evident from the references in the Handbook 
of Reliability Prediction Procedures for Mechani- 
cal Equipment (Jones 2011), the empirical base 
failure rates date back to the late 1960’s. Latest 
developments in manufacturing processes, compo- 
nent design, and fluid properties are therefore not 
considered in the estimation procedure. A sugges- 
tion for increasing the precision of the estimation 
procedure could, therefore, be to adjust the base 
failure rates to more current data and evaluate if 
the multiplication factor can be used as is. 


6 CONCLUSION 


A feasibility study of estimating fluid power pitch 
system reliability has been conducted using an 
estimation procedure for generic fluid power com- 
ponents. The estimated values were determined 
based on operating conditions obtained from a 
simulation model of a pitch system in normal 
operation in a turbine. The estimated failure rates 
were compared to failure rates of real-life turbines. 
Large parameter variations related to wind speed, 
turbulence intensity, and system temperatures were 
performed since the operating conditions of the 
real-life turbines were unknown. 

The estimated failure rates were over-conserva- 
tive in relation to the field failure rates for accu- 
mulators, valves, pump, cylinder and rotary union. 
The method yielded feasible results only for failure 
rates of hoses. 

While being over-conservative, the estimated 
failure rates followed the same tendency of the 
field data for accumulators, valves, pump, cylinder 
and rotary union. The similar tendency indicated 
that the base failure rates are incorrect for com- 
ponents in modern pitch systems and should be 
updated using more recent test data. 
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ABSTRACT: Norway is currently ranked as one of the top nations in regard to road safety. However, 
continued efforts are applied as we stretch towards a goal of zero deaths and serious injuries in road traffic 
accidents. In this paper we explore if Norwegian driver education could benefit from simulator training. 
Possible advantages are cost effectiveness, environmentally friendly training, repeatability, accessibility to 
different scenarios (accident scenarios and dangerous situations, darkness and snow outside of winter, 
difficult weather conditions and extreme road traffic density), the possibility to make errors in a safe envi- 
ronment, and interaction with new technology such as advanced driver assistant systems. However, there 
are challenges such as how to increase the number of simulators in Norway, and legal obstacles as current 
legislations require all mandatory parts of the Norwegian driver education to be conducted on the road. 
Our overall impression is that the driver education in Norway could have advantages in applying a more 


systematic approach to simulator training. 


1 INTRODUCTION 


The purpose of this paper is to investigate the use 
of simulator training in driver education in Nor- 
way, discuss the potential gains and challenges 
and look at the possibility of increasing the avail- 
ability and use of driving simulators. In Norway, 
like in many other countries, the public authorities 
have established a formal theoretical and practical 
driver education (NPRA 2017), based on scien- 
tific and policy factors, where professional driving 
teachers employed by approved driving schools 
are the main responsible bodies to conduct the 
education. The driver learner program is an exten- 
sive and systematic module based program with a 
comprehensive syllabus. The program is to a large 
degree based on the Goals of Driver Education- 
matrix (GDE-matrix; Keskinen 1996 in Hatakka 
et al. 2002; Keskinen et al. 2010). In this program it 
is estimated that the average learning period, from 
novice to the issuing of the driver’s license, is two 
years. The authorities recommend that training 
starts at the age of 16 in order to get the driver’s 
license at the age of 18 — which is the lower limit 
for receiving a car driver’s license in Norway. To 
reduce accident risks in novice drivers, elements of 
the driver training are carried out in real life situ- 
ations where driver learners are accompanied by 
skilled driver instructors. Additionally, in Norway 
it is legal and recommended for experienced drivers 
(normally parents) to provide driver learners with 


extra practice. The only premise is that the driver 
learner has completed an introductory course 
and that the experienced driver must have held 
their driver’s license for a minimum of five years 
without receiving any penalties or driver’s license 
endorsements (FOR 2017). Such additional train- 
ing is meant to increase the driver learners’ experi- 
ence behind the wheel prior to their exams and the 
license issuance. Our question is to whether driving 
simulators could be a training platform in Norway 
to increase driver learners’ driving experience, and 
if they can complement or even substitute some 
of the more traditional learning methods used in 
today’s education. 

In many industries where human errors are 
likely to have critical outcomes, such as aviation, 
hospital medicine and commercial nuclear power, 
simulator training is frequently used as part of 
training. Simulator training can be cost efficient 
and can provide training in situations that are 
rarely seen (e.g. accident scenarios; Bye et al. 2011; 
McGaghie et al. 2010; Salas, Bowers & Rhoden- 
izer 1998). Currently driving simulators are not 
the standard way of learning how to drive, how- 
ever, in some European countries, such as in The 
Netherlands and the United Kingdom, simula- 
tor training has gained some acceptance as part 
of the driver education (Baten & Bekiaris 2003), 
and there are reports showing an increased use of 
simulators in Germany (Stiegler & Vennefrohne 
2017) and France (Goepp 2017). There are several 
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factors explaining why simulator training is more 
common in other industries where human errors 
are likely to have critical outcomes than in driving. 
In medical surgery for instance, the risk of letting 
unskilled personnel practice on people is consid- 
ered too high, so simulation has become a natural 
way of acquiring skills. Doing simulator training 
means that there is room to learn from mistakes. 
The same is seen in aviation. Additionally, the 
costs and emissions of flying a large aircraft are 
so substantial that doing all the training necessary 
to obtain a commercial pilot license is not consid- 
ered economically or environmentally beneficial. 
Even though developing, building and handling a 
simulator also result in costs, it is far less expen- 
sive than training in airplanes. In aviation, as well 
as industries such as commercial nuclear power, 
simulators can be used to train personnel to avoid 
serious accidents and to minimize the overall con- 
sequences if unwanted events occur. 

It is our impression that all of the reasons men- 
tioned above, concerning reduced risk through 
extra training, prevention of fatalities and injuries, 
handling accident scenarios, and reduced cost and 
emissions, can be used as reasons to introduce car 
driver training in simulators. In this paper we will 
attempt to clarify current usage and potential gains 
of simulators in driving education in Norway (sec- 
tion 2). This is discussed in the light of the rapid 
technological development in today’s automobile 
industry, and how new technology can be included 
in simulators. This is followed by a discussion on 
the structural and practical obstacles in imple- 
menting an increase in simulator training in driver 
education (section 3). 


2 POTENTIAL GAINS IN SIMULATOR 
TRAINING 


It has never been common to use driving simula- 
tors as part of the driver education in Norway. 
Currently, only 5-10 out of 1033 driving schools, 
offer simulator training for driving-license category 
B driver training (vehicle weight less than 3500 kg), 
and the simulators are mainly used for learning 
the basic introductory elements of handling and 
maneuvering a car. These schools seem to lack a 
systematic pedagogical or educational plan in their 
simulator use. Additionally, The Norwegian Pub- 
lic Road Administrations are rather strict on what 
is allowed to be taught in a simulator only. Any 
topic that is mandatory in the education will not 
be approved using only a simulator (NPRA 2017), 
despite research indicating that for instance that 
the mandatory dark driving demonstrations have 
the same learning outcome taught in real life and 
in a simulator (Mikkonen 2007; Robertsen et al. 


2017). A different approach is taken in Finland 
where dark driving sessions are approved using a 
simulator, so these aspects are not internationally 
agreed upon. 

There have not been many empirical studies 
measuring and discussing the learning outcomes 
from using simulators in driving education. We 
only found one published study on use of simu- 
lator training in driver education in Norway 
(Robertsen et al. 2017). This study was regarding 
theoretical learning outcome when comparing tra- 
ditional training and simulator based training on 
dark driving demonstration. Dark driving is a part 
of first module (basic handling of the car) in the 
Norwegian driving education program. This study 
showed no significant differences in the outcome 
between these two groups on theoretical knowl- 
edge of dark driving. According to some of the 
international empirical studies concerning driving 
simulators, it seems like simulator training could 
be useful in driving education. In a study car- 
ried out in in The Netherlands by de Winter et al. 
(2009), they found that better driving simulator 
performance increased the actual driving skills on 
the roads and the chance for passing the final driv- 
ing test. Additionally, Crundall et al. (2010) found 
that commentary training in a driving simulator 
has beneficial effects on driving behavior in the 
UK. For instance, it was found to improve respon- 
siveness to hazards on the roads. Wang et al. (2010) 
have pointed out that road hazard performance 
was significantly higher for a simulator trained 
group of novice drivers than others. Divekar et al. 
(2016) also report that novice drivers’ outcome in 
PC-based simulator training increases the aware- 
ness and driving skills in real life operations. Addi- 
tionally, a German study showed that the training 
period could be reduced by 21 days when using 
a simulator instead of traditional training with 
a driver instructor (Reindl, Gunther, & Wottge 
2016). However, in all these empirical studies there 
are methodological challenges in isolating and 
measuring the learning outcome from simulator 
training, and determining the transferability from 
improvements in the simulator to improved driv- 
ing on real roads. Another common challenge in 
these studies is the difficulty to measure the long 
term effects on the drivers’ skills and behavior. 
Nevertheless, it seems to be an agreement in these 
empirical studies that especially novice drivers have 
a significant short term positive learning outcome 
from training specific elements in simulators. 

Based on this earlier research, a systematic 
offer of simulator training in the official driving 
education might make it easier to learn the basic 
skills in handling a car and making the soon-to- 
be drivers trained in adjusting their road traffic 
behavior to the circumstances on the road. Hence, 
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it seems likely that driving in a simulator, accom- 
panied by other training methods, could be used 
to improve the various driving- and safety skills 
during the phase of learning to maneuver a car. 
Particularly, in order to reduce risks of young driv- 
ers, training with professional driving instructors 
combined with simulator training, seems to gradu- 
ally become accepted as a useful tool in developing 
driving skills. 

The gains of supplying adequate simulator 
training are also related to the possibility to train 
in a secure environment where the negative conse- 
quences of making mistakes are eliminated. It is 
also environmentally friendly, flexible, could train 
driver leaners in different road traffic environments 
any time of the year. For instance, Norway has a 
large road traffic density variance. This means that 
different mandatory training scenarios, such as 
urban and rural driving, are fairly easy to obtain in 
the cities and other densely populated areas. How- 
ever, in some rural areas access to urban driving 
might entail a long journey. A widespread access 
to driving simulators could potentially reduce the 
number of those long journeys. Due to the long 
dark winter in Norway, it is particularly important 
to learn how to drive in the dark and handling the 
challenges of darkness. However due to the lack 
of darkness in summer, the mandatory dark driv- 
ing demonstration can only be conducted from the 
end of October till mid-March. With a simulator 
of sufficient quality, dark driving demonstrations 
can be given all year around. Other environmen- 
tal challenges are seen in other countries, such as 
southern France (Goepp 2017), where it is not 
unlikely to go through the entire driver education 
without facing rain. The possibility for all driver 
learners to experience different weather conditions 
and road traffic densities is a good argument for 
using a simulator. Additionally, simulator train- 
ing has the potential to be cheaper for the driver 
learner, if sufficient instructions are given virtually 
and thereby removing the necessity of one driving 
instructor per driver learner. 

Simulator training should also be seen in con- 
nection to the rapid technological development 
seen in the automobile industry. When a driver 
learner has finished the driving education, should 
he/she only be able to handle the basic technology 
found in every car, or should he/she have learned 
how to use and interact with new technology 
introduced in new cars to assist the driver? Stay- 
ing in touch with the technological race while 
using a traditional training approach would entail 
a very frequent replacement of vehicles. Using a 
simulator approach might be easier as a software 
update could potentially provide the new technol- 
ogy or features to be used without replacing the 
simulator. 


From other industries, research has shown that 
training for new and more automated technol- 
ogy is of importance in order to avoid unwanted 
incidents (Salas et al. 2006; Setren and Laumann 
2015). It would be beneficial to have a system 
for training drivers who buy new cars with new 
Advanced Driver Assistant Systems (ADAS) tech- 
nology. Research shows that buyers have very lim- 
ited knowledge of the new technology in their new 
cars and on how it can be used. One of the reasons 
could be that only 24% reported that they received 
instructions regarding the ADAS technology from 
the manufacturer when buying the car (Harms & 
Dekker 2017). Training in simulators could pro- 
vide an important alternative for learning how to 
drive with ADAS technology for instance for driv- 
ers who already hold their driver license but need 
training for new technology. 


3 CHALLENGES IN SIMULATOR 
TRAINING 


In order to implement a broader use of simulator 
training in driver education there are a lot of chal- 
lenges and obstacles that have to be discussed and 
solved. First, there are technological challenges in 
developing adequate hardware and software mak- 
ing the simulators relevant for learning to drive on 
the road. For instance, to what degree could and 
should simulators be designed to give the driver 
learners a “car-like” experience when training, 
and should the simulators be designed to make it 
possible to adjust for different equipped cars? A 
software-challenge would be to design adequate 
road traffic situations training the driver learners 
to handle and control the vehicle on the road in 
different road traffic settings. 

One of the main challenges in simulator based 
training, if thought of being used for more 
advanced driver education, is to adjust the training 
to the GDE-matrix. As mentioned, the hierarchical 
GDE-matrix is an important base for the Norwe- 
gian driver education. The GDE-matrix originally 
consisted of four levels, where the first level is vehi- 
cle maneuvering, second level is mastering road 
traffic situation, the third level is goals and context 
of driving, and fourth level is goals for life and skills 
for living (Keskinen 1996 in Hatakka et al. 2002) 
and later a fifth level, social skills, was added (Kes- 
kinen et al. 2010). The skills are learned through 
theoretical and practical teaching in addition to 
individual and group work. To reach level four and 
five it takes time to mature, thus, the authorities 
recommend starting at age 16 in order to have a 
driver license at age 18. The main reason for this 
is that adequate psycho-motoric skills and physi- 
ological functions are found not to be sufficient 
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for good and safe driver performance. For instance, 
when the lowest levels of the hierarchy are learned, 
they are applied under guidance of higher level 
objectives. Hence, the training of basic skills is 
important but the driver learner should also be 
able to deal with goals higher in the hierarchy such 
as dealing with social pressure (Hatakka et al. 
2002). There is little doubt that simulator training 
could be of help for the lower levels in the GDE- 
matrix, but in order to deal with the higher levels 
including self-evaluation, simulator training might 
not be optimal. Thus, this argues that simulator 
training cannot completely replace the traditional 
driver’s education, but be a supplement. 
Increasing the use of simulators in Norway has 
specific legal challenges. The mandatory driver’s 
education is regulated such that it must be given 
by professional driver instructors while the driver 
learners are sitting behind the wheel of an actual 
car. Hence, training in simulators can only be seen 
as an additional part of an education program and 
not a part of the mandatory education. That being 
said, learning how to drive entails a large amount 
of training outside of the mandatory elements 
allowing driving schools to use simulators a sub- 
stantial amount if they want to. The main obstacle 
in Norway seems to be the lack of simulators. There 
might be several reasons for this, but it seems that in 
general, the driving schools do not consider it eco- 
nomically beneficially to offer simulator training. 
In addition to the investment cost of simulator, the 
driving schools have to handle the cost of software 
updates, maintenance, and training staff in simula- 
tor handling. Without simulators, the main income 
of a driving school is hours spent on the road with 
driver learners. Introducing a simulator (particu- 
larly one that is cheaper than traditional training) 
the school has to change its business model for sell- 
ing man-hours to include selling simulator-hours. 
For the driving instructors this would undermine 
their occupation. This is further underlined with the 
argument for having simulators in countries such 
as France, where it was emphasized that a school 
with 5—6 instructors, one driver instructor can be 
replaced with a simulator (Goepp 2017). There is 
no shortage of driving instructors in Norway, like 
for instance in Germany (Stiegler & Vennefrohne 
2017), thus, the Norwegian market does not pro- 
vide that need for a simulator for better efficiency. 
Another challenge is simulator sickness. Simula- 
tor sickness is a subset of motion sickness, lead- 
ing to many experiencing nausea after only a 
short while in a driving simulators, influencing the 
usefulness of simulator training (de Winter et al. 
2012). Simulator sickness is due to the perceived 
discrepancies between the motion expected by the 
participant and the motion displayed in the simu- 
lator. This has been a problem in a wide range of 


simulators and virtual reality applications, but it is 
gradually decreased as the simulations improve in 
terms of both responsiveness and reduced delays. 
Research shows that younger individuals are less 
prone to simulator sickness than older (Brooks 
et al. 2010). This could be beneficial for driver 
learners, but might be a hindrance for using simu- 
lators for upholding driving skills for those who 
have had a driving license for some time. 


4 CONCLUDING REMARKS 


Our impression is that the car driving education 
in Norway could have advantages in using simula- 
tors more systematically than what has been done 
until now. There are many possible advantages 
from introducing simulators in driving schools 
such as cost effectiveness, environmentally friendly 
training, repeatability, accessibility to different 
scenarios (accident scenarios and dangerous situ- 
ations, darkness and snow outside of winter, dif- 
ficult weather conditions and extreme road traffic 
density), the possibility to make errors in a safe 
environment, and interaction with new technology 
such as advanced driver assistant systems. How- 
ever, there are challenges such as how to increase 
the number of simulators in Norway, and legal 
obstacles as current legislations require all manda- 
tory parts of the Norwegian driver education to be 
conducted on the road. 

The experiences from other European countries 
and the few empirical studies that exist provide 
some insight into the potential for simulators to 
be used in driving education in Norway. However, 
more research should be done to find out which 
parts of the driving education that could be per- 
formed in a simulator, and how the simulator could 
set up to optimize the learning outcomes. 
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ABSTRACT: Natural gas supply systems are complex pipeline networks, whose features can cause 
severe consequences under unexpected scenarios. In this work, we present the research being performed 
to develop a method for efficiently measuring and locating pipelines which are critical to gas distribution. 
Graph theory, thermal-hydraulic simulation, optimization and network flow method are integrated to 
calculate the pipeline criticality with respect to the supply of gas. Graph theory is applied to model the 
supply-transmission-demand systems. A capacity model, superposed on the graph model, and a combi- 
nation of thermal-hydraulic simulation and optimization is used to simulate the system operation under 
different scenarios and estimate the different supply performances of the pipelines. This allows to assess 
the criticality of the pipelines by the network flow method. For demonstration, the method is applied to a 
relatively complex gas pipeline network. The results of the application show that the method can provide 
analytical information useful to improve the system robustness and perform a more efficient and effective 


protection of the gas supply system. 


1 INTRODUCTION 


Reliable service of supply from natural gas pipe- 
line networks is important for economy develop- 
ment and society stability. Although many efforts 
have been done, potential vulnerabilities still exist 
because of uncertain environment, system structure 
complexity, demand fluctuation and resource limi- 
tation. When unexpected events occur, e.g., politi- 
cal crisis, terrorist attacks, third-party activities, 
extreme weather events, etc., severe consequences 
may follow (Suet al., 2017; Enrico Zio, 2016). 

In many application areas, the vulnerability 
analysis of complex transmission networks has 
been given increasing attention, e.g., transportation 
systems (Hong, Ouyang, Peeta, He, & Yan, 2015; 
Mattsson & Jenelius, 2015), supply chain (Thekdi 
& Santos, 2016), power grids (Cadini, Agliardi, & 
Zio, 2017; E. Zio & Golea, 2012; Enrico Zio, 2014; 
Enrico Zio & Sansavini, 2013), etc. However, the 
importance of vulnerability of natural gas transmis- 
sion networks has not been given enough efforts to. 


Vulnerability is a term which is applied with sev- 
eral different definitions in the literature (Kröger 
& Zio, 2011). In this paper, vulnerability is defined 
as inherent system defects to absorb the effects of 
failures or to restore the system to normal condi- 
tion. Vulnerability analysis focuses on the inherent 
properties rather than environment and probabili- 
ties, as reliability and risk do. Applying vulnerabil- 
ity analysis to evaluate criticalities of components 
of gas pipeline network systems can help to find 
the “bottlenecks” of a natural gas pipeline network 
system, improve the ability to withstand unex- 
pected damages and reduce potential risk of loss. 

There are several common methods to quantify 
the element criticality with respect to vulnerabil- 
ity. Generally, in a transmission network system, 
element criticality evaluation is usually performed 
based on the consequences of failures or the system 
structure. In the consequence-based methods, 
criticality of element is evaluated according to the 
direct, and sometimes indirect, effect of the failure 
(Johansson & Hassel, 2010). From the system struc- 
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ture perspective, some identification approaches 
have been proposed based on different considera- 
tions, e.g. flow-based performance (Fang, Pedroni, 
& Zio, 2015; Nicholson, Barker, & Ramirez-Mar- 
quez, 2016a; Enrico Zio & Piccinelli, 2010), network 
efficiency (Deng, Li, & Lu, 2015; Han. F; E. Zio, 
2014; E Zio, 2007), accessibility (Demirel, Kompil, 
& Nemry, 2015), etc. In this paper, we use a measure 
of network performance (network flow) rather than 
the graph-theory-based measure to evaluate the crit- 
icality. Several works have been carried out recently 
to compare the topological model and the flow- 
based model for power grids (Ouyang, Zhao, Pan, 
& Hong, 2014) and the results show similarities. 

The flow-based element criticality analysis 
requires a model that is capable of capturing the 
system behaviours, the amount of gas requested 
at demand sites and the supply capacity. System 
dynamic methods and thermal-hydraulic methods 
can analyse the consequences in detail; but, they 
are not suitable to the element criticality evalua- 
tion because of the huge cost of computation due 
to rather detailed numerical simulations. Hence, 
it is important to develop a computation-efficient 
model which can reflect system behaviours and cal- 
culate the consequences of element failures. 

In this paper, a method of element critical- 
ity analysis of natural gas supply service pipeline 
networks is developed considering the system vul- 
nerability, based on the performance of network 
flow. In Section 1, a consequence analysis model 
is developed based on network flow theory, com- 
bined with an optimization part. The model can 
simulate the system responses of element failures 
and estimate the capacities of supply of both 
the demand sites and the global system, with the 
consideration of constraints of physical limita- 
tion. Section 2 provides three criticality measures 
emphasizing the transmission performance based 
on different considerations. Finally, in Sections 3 
and 4, the method is performed based on a com- 
plex natural gas pipeline network and the results 
are analysed in detail. 


2 NATURAL GAS PIPELINE NETWORK 
MODEL FOR CONSEQUENCE 
CALCULATION 


For the development of the consequence assess- 
ment model, only the components relating to gas 
supply capacity in the pipeline networks are con- 
sidered, i.e. pipelines, compressor stations, gas 
sources and consumers. This is because the objec- 
tive of the assessment is supply service. The devel- 
opment of the model of capacity/consequence 
analysis has followed the steps in Figure 1. 
The inputs of the model include: 


Process 


Figure 1. 
analysis. 


of modeling for consequence 


A. Pipeline parameters: diameter, length, rough- 
ness etc. 

B. Parameters of compressor stations; 

C. Properties of natural gas; 

D. Information of gas sources: locations, types, 
capacities; 

E. Information of consumers: locations, demands; 

F. Topology information of the pipeline network 
structure. 


2.1 Natural gas supply-transmission-demand 
system abstraction 


The pipeline network is abstracted in the form of a 
directed weighted network. The pipeline junctions, 
consumers, gas sources are represented as nodes 
and pipelines are represented as arcs. The weights 
on the arcs denote the capacities of the pipelines. 
The capacities are calculated by thermal-hydraulic 
analysis, based on the above inputs A, B and C. 
When the pipeline capacities change, the weights in 
the weighted network change accordingly. 

In general, if an unexpected event occurs, 
operators will take actions to reduce the nega- 
tive impacts on the system. Generally, the actions 
include adjustment of supply of gas sources and 
re-distribution of gas. In this model, all these 
actions can be swiftly performed by changing the 
weights on the arcs. The optimization process will 
be introduced in the next Sections. 


2.2 Modeling of gas transmission optimization 


In the optimization process, supply distance and 
economic efficiency are considered as the most 
relevant attributes to gas transmission planning. 
Firstly, a “standard cost” matrix C is developed 
based on network topology and the network ele- 
ment cost is calculated by equation 1. This so- 
called “standard cost”, which is determined by the 
cost of transmission and pipeline length, repre- 
sents a factor for the optimization of the transmis- 
sion path, not a real cost: 


C, = a, x BI y xe) 0) 


where C, represents the optimization factor com- 
bining distance and cost ($). Although the unit 
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of the “standard cost” is §, it is different from the 
actual cost. Q, represents the designed quantity 
of natural gas transported from i to j (MCM); L; 
represents the length of the pipeline from node 7 
to node j (km); œ and £, ranging from 0 to 1, rep- 
resent the importance weights of distance and cost 
of transmission, respectively; c represents the cost 
of gas transportation ($/(km-MCM)). 

Then, “standard cost” vectors are calculated 
based on matrix C. Each of the vector contains 
the “shortest paths” based on the “standard cost” 
between a gas source and all remaining nodes 
(except the other gas sources), found by the Floyd 
algorithm (Ahuja, Magnanti, & Orlin, 1993). In 
order to find out the sequence of the nodes that 
the gas flow from different sources should follow 
in its transmission path, the “standard cost” vec- 
tors are sorted in ascending order. For a specific 
gas source, priority is given to the nodes with lower 
“standard cost”, which means lower cost and rela- 
tively higher efficiency. 

The algorithm will also check whether the sup- 
ply capacity of the source is exhausted: if there are 
still residual capacity and unsatisfied customers, 
the algorithm will search for the next unsatisfied 
demand node in the node sequence. This process 
continues until all demands are satisfied or the 
supply capacity of the source is exhausted. 


2.3 Supply capacity calculation 


On the basis of the transmission directions deter- 
mined in Section 2.3, we proceed to optimize the 
plan of natural gas distribution in the pipeline net- 
work (volume of gas supplied by different sources 
and gas flow in the pipelines). This problem is con- 
verted to a max-flow problem in graph theory and 
solved by Ford-Fulkerson algorithm. 

In this process, two constraints are to be 
respected: 


e The sum of the flows exiting a node is equal to 
the sum of those entering the node (except for 
the sink nodes and the source). 

e The flow in the arcs is within the capacity 
limitations. 


Ford-Fulkerson algorithm is carried out by the 
following steps (Ahuja et al., 1993): 


e Searching paths which connect the sinks and the 
sources with available capacities on the arcs. 

e Repeating the search process until no additional 
flow can be added to the path. 


In general, gas pipeline networks usually have 
more than one source or sink, however, Ford- 
Fulkerson algorithm is used to solve the “single 
source and sink” problem. To convert the “multi- 
ple sources and sinks” problem to a “single source 


Figure 2. 
ysis model. 


Flowchart of the capacity/consequence anal- 


and sink” problem, we assume a “super sink” and 
a “super source” connecting with all the customers 
and sources by arcs with unlimited capacities. 

Finally, the model will output the supply capaci- 
ties of the overall system and the customers in the gas 
pipeline network. By comparing with the demands 
requested, the amount of unsatisfied customers and 
the gaps between supply and demand are calculated 
as inputs of the element criticality analysis. The flow- 
chart of this process is shown in Figure 2. 


3 CRITICAL PIPELINE ANALYSIS OF GAS 
SUPPLY SERVICE 


Because of its structure complexity, there may be 
some unknown, perhaps previously unimagined 
system weaknesses in a large natural gas pipeline 
network. Critical element analysis focuses on iden- 
tifying this kind of weakness that contributes to 
the supply vulnerability of the gas pipeline net- 
work. A pipeline, or a combination of pipelines, 
is a critical component depending on how essen- 
tial it is for the supply service of the pipeline net- 
work system. Compressor stations, which can also 
affect the supply capacity, but the impacts of their 
failures will eventually amount to degradation of 
pipeline transmission capacities. 

Generally, the critical component analysis is per- 
formed by estimating the consequences of failures 
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of single component or multiples (Fang & Zio, 
2013; Nicholson, Barker, & Ramirez-Marquez, 
2016b; E. Zio & Golea, 2012). This method is per- 
formed exhaustively up to a given number (from 
1 to P) of simultaneous failures. The value of P is 
usually from 3 to 5. In this range, every possible fail- 
ure or failure combinations should be simulated, 
with the consequence analysis model developed in 
Section 1. This kind of analysis gives a comprehen- 
sive picture of the system vulnerability within the 
range of simultaneous failures considered. 

However, in a complex natural gas network, the 
attack-consequence analysis method sometimes 
can be impossible because of the huge burden of 
computation. To overcome this problem, a flow- 
based method is applied in our research, which 
combines topology properties and flow properties 
of the gas pipeline network, to measure the criti- 
calities of the pipelines. 

Several works have explored the measure of arc 
criticality in a network by graph theory and net- 
work flow theory. On account that the importance 
of a pipeline depends on its role in the network 
topology and its flow capacity, an index, named 
weighted flow capacity rate (WFCR) (Nicholson 
et al., 2016a), is chosen to evaluate the criticalities 
of the pipelines. WFCR is the combination of flow 
capacity rate (FCR) (Nicholson et al., 2016a) and 
flow centrality (FC) (Freeman, Borgatti, & White, 
1991). FCR is the sum of the percentages of arc 
flows to arc capacity for all s-d max flow problems: 
c, represents the transmission capacity of pipeline 
(i, j); N is the number of s-d pairs. FC represents 
the sum of flow in pipeline (7, j) for all possible s-d 
pair max-flow problems divided by the sum of all 
pairs max flows: MF_,(i,7/) denotes the flow on 
pipeline (7, j) when the max flow MF, is from 
s to d (s represents gas sources and d represents 
demand sites). 

The FC provides the contribution of pipeline 
(i, j) to the transmission capacity of the pipeline 
network and FCR accounts for its potential criti- 
cality due to constraints of capacity. Therefore, by 
weighting each term in equation 3 by the FC value, 
WFCR represents the expected impact of pipeline 
(i, j) to the system performance of supply: 
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FC =% i 
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Figure 3. Topology of the gas network. 
Table 1. Properties of the gas sources of the gas 
network. 
Location Type Limit (MCM/d) 
9 LNG terminal 4 
10 Pipeline 31 
15 LNG terminal 10 
18 Pipeline 25 
50 LNG terminal 7.1 
Table 2. Demands of customers of the gas network. 
Demand Demand 
Location (MCM/d) Location (MCM/d) 
4 1.43 34 1.00 
5 1.57 35 1.00 
6 1.66 36 1.74 
9 1.46 37 1.30 
12 4.40 38 1.00 
16 1.54 40 2.00 
17 0.50 41 1.40 
20 1.50 42 0.50 
24 1.60 44 1.06 
25 1.80 46 1.82 
27 2.50 47 0.68 
29 2.00 48 1.17 
32 0.80 49 2.00 
33 0.80 51 0.98 


4 CASE STUDY 


We consider a complex natural gas pipeline network, 
assuming coherent and reasonable data and infor- 
mation of the pipeline network, including customer 
demands, pipeline parameters, compressor station 
parameters and gas source capacity. The pipeline 
network is shown in Fig. 3. Assumptions about 
locations, capacities and types of gas sources are 
reported in Table 1, whereas the demands are listed 
in Table 2. 
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5 RESULTS OF CRITICAL COMPONENT 
ANALYSIS 


Firstly, to estimate the criticalities of pipelines, an 
overview of consequences of the “directly attack” 
method have been analysed. The exhaustive analy- 
ses were performed for N-1 to N-3 simultaneous 
failures. All the possible consequences are sorted 
from high to low and presented in Figure 4. The 
pipelines or the combinations with high conse- 
quences have relative high criticalities. 

However, for a complex gas pipeline network 
with hundreds, or even thousands of pipelines, the 
direct attack method requires a quite large number 
of simulations and it is impossible to perform 
exhaustive analyses under high-order scenarios. 
Hence, in Section 2, the weighted flow capacity 
rate (WFCR) in equation 2, is proposed to meas- 
ure the criticalities of pipelines. The pipelines were 
sorted from highest to lowest according to WFCR 
in Table 3. For a further comparison analysis, 
FCR, the measurement of the pipelines potential 
criticality due to capacity limitation, and FC, the 
measurement of contributions of pipelines to sys- 
tem transmission capacity, were also calculated. 
The sequences of pipeline criticalities based on FC 
and FCR are shown in Table 3. 

In Table 3, the pipelines which are high-ranked 
by FCR are more prone to become bottlenecks 
because of the low margins between their capaci- 
ties and loads. The pipelines with higher FC val- 
ues have heavier burdens during the process of 
gas transmission in the gas pipeline network. The 
WFCR index combines the concepts of both FCR 
and FC, and the pipelines with higher WFCR val- 
ues are more critical for the gas supply capacity of 
the gas transmission network. 

The effectiveness of the proposed method was 
verified by a “random attack & preparedness 
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Figure 4. Distribution of shortage for N-1, N-2, N-3 
simultaneous failures. 


Table 3. Pipeline ranking according to importance cal- 
culation based on different methods. 
FC FCR WFCR 
From To From To From To 
10 2 10 2 10 2 
16 11 18 19 10 11 
18 19 17 39 10 49 
10 11 43 5 18 19 
10 49 16 33 16 11 
11 12 23 24 16 33 
19 20 31 29 11 51 
16 33 10 11 11 12 
2 3 20 27 20 27 
20 27 18 22 17 39 
2 45 9 52 23 24 
18 22 9 53 18 22 
11 51 11 51 10 42 
10 42 14 13 18 17 
18 17 31 32 33 17 
33 17 39 40 43 5 
19 21 46 43 19 21 
21 23 33 17 21 23 
23 24 10 42 2 4 
17 39 18 17 3 47 
3 46 22 21 50 6 
27 28 19 21 31 29 
28 31 21 23 22 21 
45 43 15 33 9 53 
B 12 29 32 14 13 
43 5 2 45 39 40 
31 29 16 11 3 46 
21 20 21 20 2 45 
22 21 10 49 29 32 
2 4 2 4 2 3 
3 47 3 47 21 20 
22 17 5 34 15 33 
50 6 30 29 46 35 
9 53 32 37 19 20 
14 13 33 36 24 25 


policy” method. In the “random attack & prepar- 
edness policy” method, random attacks were firstly 
carried out on the pipeline network. The numbers 
of random attacks of N-1, N-2, N-3 simultaneous 
failures are respectively 100, 10000 and 500000. 
When the pipelines are sampled, their capacities 
will be reduced to zero. Secondly, six preparedness 
policies were performed. The first policy selects no 
pipeline to harden. The second policy randomly 
selects 15% of the pipelines to harden. The 3rd 
— Sth policies select the top 15% critical pipelines 
based on FC, FCR and WFCR, to harden, respec- 
tively. The 6th policy selects the top 15% critical 
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pipelines or pipeline combinations, according to 
the consequences of direct attacks. In the prepar- 
edness policies, the selected pipelines will maintain 
70% of their normal capacities after attack. The 
consequences were represented by “100% x ( short- 
age/normal demand)”. The results of the “random 
attack & preparedness policy” simulations are 
shown in Figures 5-7 in the form of box plots. Sup- 
plementary statistic information is listed in Table 4. 
According to the information in Table 4 and 
Figures 5-7, we can conclude that all the criticality- 
based policies can reduce the loss significantly com- 
pared with “random policy” and “do nothing”. The 
combination index, WFCR, is more effective than 
FCR and FC, and is a little worse than the criticali- 
ties based on direct consequence calculation. How- 
ever, the burden of computation of the flow-based 
method is far less than the direct attack method. 
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Figure 5. Supply vulnerability by different prepared- 
ness policies (1 pipeline failure). 
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ness policies (2 pipelines failures). 
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Figure 7. Supply vulnerability by different prepared- 


ness policies (3 pipelines failures). 


Table 4. Supplementary information of vulnerability of 
the gas pipeline network by preparedness policy. 


1 pipeline 2 pipelines 3 pipelines 
failure failures failures 
Mean Mean Mean 


Policy (%) Var = (%) Var = (%) Var 


None 1.77 10.68 3.56 21.44 543 31.17 
Random 1.64 9.93 3.32 19.73 5.19 31.06 
FC 1.18 7.54 2.63 15.94 4.13 24.29 
FCR 1.22 7.16 2.49 15.54 3.85 23.49 
WFCR 1.01 441 1.89 10.23 2.70 12.89 


Direct 0.93 4.61 1.37 4.10 2.19 8.30 
attack 


6 CONCLUSIONS 


In this work, we develop a flow-based method to 
identify the critical pipelines in natural gas pipe- 
line network from the vulnerability perspective. 
Comparing with the large computational cost of 
the direct attack method when it is used for large 
complex pipeline networks, the flow-based critical- 
ity measurement used in this work is much more 
efficient with a little loss of effectiveness. The 
flow-based method combines the topology and the 
capacity contribution of a pipeline to represent its 
criticality to the supply service of the gas transmis- 
sion system. In the case study, both approaches 
have been performed and their effectiveness has 
been compared by the “random attack & prepar- 
edness policy” simulation. The results show that 
accuracy of the flow-based method is slightly 
lower than that of the direct “attack-consequence” 
method; but, the computation burden of the latter 
is far more higher than that of the former. 
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ABSTRACT: UK customers visited community pharmacies to receive NHS prescriptions 1.104 bil- 
lion times in 2016. One study of dispensing errors found an error rate of 3.3%. Severe dispensing inac- 
curacies often receive a high level of media attention, however, lower level errors could also be causing 
significant inefficiencies in the delivery of primary healthcare. This paper presents a modelling approach 
for analysing the reliability and efficiency of community pharmacies performance using a Coloured Petri 
Net (CPN) methodology. The model considers how single prescriptions are processed, the use of staff 
resources, and the occurrence of errors. The CPN evaluates performance over a set of key performance 
indicators. Results are validated, where possible, against published studies of community pharmacies. 


1 INTRODUCTION 


1.1 Background 


Over the past 50 years there has been a growing 
awareness that healthcare systems are capable of 
inflicting harm to patients, and this harm should 
be reduced (Health Foundation, 2011). Two key 
reports by the US Institute of Medicine (Mullan 
et al, 2001) and the UK Department of Health (DoH, 
2000) helped to spread the message that iatrogenic 
patient harm within healthcare systems is an impor- 
tant issue. Notably, if the community pharmacy 
dispensing error rate of 3.3% (Franklin & O’Grady, 
2007) is considered, this could mean that around 36 
million UK prescriptions per year contain errors. 

As well as safety concerns, studies have shown 
that patient satisfaction with pharmacy services is 
linked to waiting times (Afolabi & Erhun, 2003). 
Extended waiting times have been given as a rea- 
son why patients will not return to a particular 
pharmacy (Somani & Daniels, 1982), and content 
customers are increasingly likely to return to their 
specific healthcare provider (Dansky & Miles, 1997). 


1.2 Reliability engineering 


Reliability engineering techniques are used by 
many industries and it has become common for 
complex systems to be subjected to risk assessment 
processes (Andrews, 2009). These assessments have 
historically been carried out in conventional high 


risk industries, such as the aviation (Netjasov & 
Janic, 2008), nuclear (Hsueh & Mosleh, 1996) and 
space sectors (Garrik, 1988), where effects of fail- 
ure can be catastrophic. 

Fault trees and event trees are an example of 
a widely used reliability engineering techniques. 
They use combinatorial logic to combine events 
to produce both qualitative and quantitative 
analysis of failures (Vesely et al, 2002). Fault tree 
analysis requires that the occurrence of events is 
independent. 

Markov models are memoryless processes capa- 
ble of modelling more complex systems, which 
might typically contain repair strategies and 
dynamic behavior (Boyd, 1998). A key limitation 
to implementing a Markov model for a given sys- 
tem, arises from the fact that the number of sys- 
tem states to consider grows exponentially with the 
number of components in the system. 

Petri Nets are an effective tool for modelling proc- 
esses or systems exhibiting concurrency (Schnee- 
weiss, 1999). Since the publication of Carl Adam 
Petri’s thesis in 1961, a number of extensions of the 
basic technique have been developed. Two impor- 
tant examples of Petri Net extensions are timed 
and Coloured Petri Nets (Jensen, 1996). Timed nets 
use either deterministic or stochastic delay timings, 
to control the timing of transitions. This gives the 
opportunity to model temporal processes. Mean- 
while, incorporating token colour sets into Petri Net 
modelling enables token specific information to be 
propagated around the net. This can then be used to 
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control and manipulate the nets behavior. Coloured 
Petri Nets have been utilized to model complex sys- 
tems in a wide range of areas (Liu, 2017). 

The healthcare sector, primary care especially, 
represents a relatively new area for reliability mod- 
elling. Previous healthcare modelling studies have 
been centred in secondary healthcare settings. In 
this field, Petri Nets have been used to model hos- 
pital departments (Dotoli et al, 2010), hospital 
information systems (Darabi & Galanter, 2009), 
and mental health care services (Dammasch & 
Horton, 2007). Michael R. Cohen et al utilized 
fault trees to conduct a risk assessment of dispens- 
ing in community pharmacies (Cohen et al, 2012), 
and their error probabilities are also used in this 
paper. 

The novelty of the proposed approach in this 
paper is the ability to perform safety and efficiency 
evaluation within the framework of a single mod- 
elling technique. Therefore, a timed CPN model is 
developed and a wide range of performance indi- 
cators is obtained, using simulations. Model out- 
puts can be used to support resource management 
and safety improvement decisions. The community 
pharmacy dispensing process is presented in sec- 
tion 2, section 3 outlines how the model is built, 
section 4 presents results and analysis and sec- 
tion 5 concludes the paper. 


2 COMMUNITY PHARMACY 
DISPENSING PROCESS 


2.1. The main stages of dispensing 


A standard community pharmacy dispensing proc- 
ess is described in this section. The six key stages 
of the community pharmacy dispensing process 
are given in Figure | (Langley & Belcher, 2009 & 
NPSA, 2007 & Waterfield 2008). 

To begin with, prescriptions must be received 
by a member of staff as and when patients bring 
them into the pharmacy. Prescriptions are then 
legally and clinically checked, to ensure that the 
prescription is clinically appropriate before con- 
tinuing. After being received, the prescriptions’ 
labels are generated. The labels include key infor- 
mation about the medicine. The next stage of the 
process is bringing the constituent parts of the 


Prescription | Generating Picking | 
reception labels medicines | 
+ 
toring medicines, ‘ 
I I 
we hend adieta: k Clinical & Final =- Applying | 
accuracy check labels 
patients 
Figure 1. Dispensing process flow chart. 


prescription together to create the final product. 
First, the set of items included on the prescription 
is gathered together from the pharmacy stock. 
After this, an intermediate accuracy check is rec- 
ommended, before applying the labels to medi- 
cines. After the prescription is fully assembled, 
it is passed onto either a pharmacist or an ACT 
(Accredited Checking Technician) to perform 
a final accuracy check on the prescription. The 
final accuracy check is the final opportunity for a 
pharmacy to intervene if a prescription has been 
dispensed incorrectly at some point in the proc- 
ess. The accuracy check involves making sure that 
the prescription being provided by the pharmacy 
exactly matches what has been written on the pre- 
scription form. This includes checking that the 
labels, items, doses, quantities and form of medi- 
cation are all correct before handing the prescrip- 
tion out. Any mistakes that go unnoticed at the 
final accuracy check are likely to reach patients. 
Each stage of the process can be completed by a 
single member of staff, although, only pharma- 
cists or ACTs are qualified to final accuracy check 
prescriptions. 


2.2 Resources 


A typical community pharmacy staff team consists 
of a group of pharmacists, ACTs and dispensers, 
but the number of staff varies between pharma- 
cies. Larger stores can have teams of up to 12 peo- 
ple, while the smallest independent store may be 
run by a single pharmacist. However, for a phar- 
macy to be allowed to dispense prescriptions, there 
must be a responsible pharmacist present during 
all hours of operation. 

The full list of resources used in the dispens- 
ing process is as follows: prescriptions, dispensers, 
pharmacists, medicines, labels, labelling stations 
and a private room. 


2.3 Non-dispensing tasks 


As well as completing dispensing tasks, there are 
a number of non-dispensing tasks in pharmacies 
that members of staff are required to complete 
(Davies et al, 2014). These non-dispensing tasks 
include, stock management, patient counselling, 
advanced pharmacy services, non-prescription 
services, staff training, and general housekeeping. 
Advanced services are a set of 6 services offered in 
pharmacies, one example of which is the smoking 
cessation service. 

In this study, the set of non-dispensing tasks 
requiring to be completed by staff is limited to 
stock management, advanced services and patient 
counselling. Although not strictly a task, lunch 
hours for dispensers are also included in the 
model. 
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2.4 Failure modes 


Dispensing correct prescriptions reliably and in a 
time that is convenient for customers are the two 
main goals of community pharmacies. Therefore, the 
dispensing process can be considered to fail if either: 


1. A prescription is incorrect when handed/deliv- 
ered to a patient. 

2. A prescription takes an extended amount of 
time to be dispensed, causing the patient to 
decide not to return in the future. 


Prescriptions can be incorrect in a number of 
different ways, for example, the labels may indi- 
cate to take too much or too little of the medicine. 
This would be classified as a labelling error. Other 
examples include, items being included which are 
different to those prescribed. This would be classi- 
fied as a contents error, and it can be due to wrong 
dose, wrong volume, or being a completely differ- 
ent medicine. Additionally, it may be the case that 
the labels and items were generated and picked 
correctly, but they are mixed up when applying the 
labels, this is classified as a label application error. 

If one of the above errors makes it through 
the final accuracy check and is handed out to a 
patient, this is then classified as a dispensing error. 
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generation 


Label generation 


If however, the error is spotted and rectified at the 
final accuracy check, this is classified as a near 
miss (Chua et al, 2003). 


2.5 Definitions: Process reliability and efficiency 


Reliability of the dispensing process, R, is defined 
in Equation (1) as: 


Ra Le 
P, total 


(1) 


where p.,. is the number of prescriptions dispensed 
which are completely correct, and p,,,,, 1s the total 
number of prescriptions dispensed. 

Process efficiency is commonly defined as the 
ratio between an output gained and the level of 
resources needed to maintain the process. Since the 
cost of resources is not factored into this study, a 
set of efficiency indicators are used. Two examples 
of efficiency indicators are, the total number of 
prescriptions completed, and the average time to 
dispense walk-in prescriptions. Results for all per- 
formance indicators can be found in Table 4. The 
ideal outcome of the process in terms of efficiency is 
a high number of prescriptions completed quickly. 
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handover to patient 


Filling + applying labels 


Figure 2. A CPN model for community pharmacy dispensing. 
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Table 1. Places. Table 2. Transitions. 
Place Description Type Transition Description (Y/N)* 
1 Walk-in task generator. e 1 Walk in generation: Exp(0.0033) N 
2 Customer at counter. e 2 Receive a prescription: Uni(30, 60) N 
3 Delivery task generator. e 3 Move staff to counter: Det(0) N 
4 Staff receiving. w 4 Delivery generation: Det(6000) N 
5 Prescriptions to be dispensed. p,e 5,6 Staff choose prescription: N 
6, 10 Labelling stations available. e Uni(5,10) 
7,9 Staff member choosing prescription. w 7,8 Allocate a staff member: Det(e) N 
8 Staff available for primary tasks. w 9,10 Label generation: Det(15) Y 
11,12 Staff member generating labels. w, p 11-21 Spreaders: Det(e€ ) N 
13,15,17 These places are used to separate staff w, p 22-27 Filling & label application: Y 
19, 21, into parallel work streams. N(50,10) 
23 28, 32,36 Pharmacist allocation: Det(e) N 
14, 16,18 Staff are assembling, and applying w,p 29,33,37 Choose prescription: U(10, 15) N 
20, 22, labels to prescriptions. 30, 34,38 Final accuracy check: Uni(5,10) Y 
24 31, 35,39 Hand out and counsel: Exp(0.025) N 
25 Prescriptions waiting for secondary p,e 31,35,39 Store for delivery: Exp(0.05) 
dispensing tasks. 40 Allocate to advanced service: N 
26 Pharmacists available to complete w Det(e) 
secondary tasks. 41 Complete advanced service: N 
27, 30,33 Pharmacist allocated to complete w Uni(300, 600) 
secondary tasks for a prescription. 42 Advanced service generator: N 
28, 31,34 Pharmacists is checking a prescription. w, p Exp(0.00006) 
29, 32, 35 Pharmacist is handing out/storing for w, p 43 Move pharmacist primary; Det(10) N 
delivery. a 44 Stocking task generator: Det(6600) N 
36 All completed prescriptions. p 45 Allocate to stocking: Det(€ ) N 
37 Advanced service being completed. w 46 Finish stocking: Uni(300, 900) N 
38 Advanced aa Wang. © 47 Begin triggering of lunch break: N 
39 Advanced service task generator. e Det(7200) 
40 Stocking task generator. e 48 Allocate dispenser to lunch: Det(e) N 
4l Stocking waiting. © 49 Dispenser finished lunch: N 
42 Stocking task being completed. w Det(3600) 
43 Dispenser lunch break generator. e 
44 Lunch break ready to be taken. e *This column designates transitions as processors. 
45 A dispenser is on their lunch break. w 


3 MODELLING APPROACH 


3.1 Overview 

This section of the paper presents the development 
of a Coloured Petri Net (CPN) for modelling the 
dispensing process. The dispensing process being 
modelled in this study is that of manual dispensing 
pharmacy, as opposed to automated dispensing. 
Figure 2 shows the CPN model of a community 
pharmacy. Overall, the model is built according to 
the process flow, considering resources and errors. 
Model outputs are obtained after the CPN model 
is simulated. 


3.2 Places and transitions 


Table 1 shows the description of each place and 
the type of token that may occupy the places. 
Note that the net uses three token types: e (basic), 
w (staff), and p (prescriptions). 


Overall, some places are used to keep track of 
resources, and others are used as task generators, 
controlling when new tasks arrive. 

Table 2, shows the description and distribution 
of each transition. Note that Det(x) stands for a 
deterministic delay. Some transitions directly rep- 
resent the community pharmacy dispensing tasks 
seen in Figure 1. Other transitions are purely used 
to move tokens around the net. The types of dis- 
tributions and their parameter values have been 
assumed in this paper. 

In Table 2 each transition is also designated 
as either a ‘processor’ transition, or not. A proc- 
essor transition represents a task that is affected 
by the number of items in the prescription. For 
example, the transition, modelling generating 
labels, is a processor transition, since it will take 
longer to generate labels for a large prescription. 


3.3. Model assumptions 


Tasks in the model are separated into primary and 
secondary tasks, where primary tasks may be com- 
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pleted by all staff, whereas secondary tasks may 
only be completed by pharmacists. In addition a 
number of assumptions about staff behaviour and 
pharmacy specification are made. Below is a list 
of modelling assumptions about how staff behave. 


e Staff complete tasks in an identical way, i.e. the 
same probability distributions are used to deter- 
mine how long tasks take, and to generate error 
probabilities for different staff. 

e Dispensers may only complete primary tasks, 
and pharmacists prioritise secondary tasks. 
Pharmacists are able to move to primary tasks 
if they are idle. 

e Once primary work is begun on a prescription, 
the same member of staff continues working on 
it until the primary tasks are finished. 

e Upon a customer arriving with a walk-in, the 
first member of staff to become available for 
primary tasks go to serve them. 

e Dispensers have a lunch hour. It is assumed that 
pharmacists fit their lunch in during moments 
when they are not working. 


Below are assumptions about the labelling sta- 
tions, pharmacy opening hours, and prescriptions. 


e The pharmacy is open from 9 am-5 pm. 

e Walk-in prescriptions are prioritised over deliv- 
eries. Within the same type, there is a first come 
first served order. They arrive with increments of 
an Exponential distribution, as shown in Table 2. 

e Delivery prescriptions arrive at the pharmacy in 
a single large bulk, at 10 am, 1 hour after the 
pharmacy opens. 

e The pharmacy has 2 labelling stations capable 
of generating labels for prescriptions. 

e Walk-ins taking longer than 15 minutes to be 
dispensed are classed as delayed. 


3.4 Prescription modelling 


In the CPN model, prescription tokens each have 8 
colour fields which represent: 


1. Delivery or walk-in 

2. The number of items 

3. Time taken to dispense 

4. Number of iterations to compete 

5. The overall outcome 

6. Label error 

Table 3. Error probabilities. 

Task Error probability 
Labelling 0.06 
Filling 0.05 
Label application 0.03 
Final accuracy check 0.05 


7. Content error 
8. Label application error 


In particular the number of iterations to com- 
plete is determined by how many times a phar- 
macist has had to send the prescription to be 
corrected after a final accuracy check. The overall 
outcome is one of 3 outcomes: completely correct, 
near miss, or dispensing error. The last 3 colours, 
labels, contents and label application, are Boolean 
variables, which indicate whether an error of each 
type is contained within the prescription. 

Upon arrival, every prescription is allocated 
a random number of items by sampling from a 
Geometric (0.35) random variable (mean = 2.86). 
This was chosen using two assumptions. Firstly, 
patients with a prescription will have at least 1 
item on the prescription. Secondly, prescriptions 
with more items are increasingly less likely to 
occur than those with fewer. This number of items 
is then used to determine how long the processor 
transitions, designated in Table 2, take to fire. For 
example, a prescription containing 5 items will 
use the sum of 5 samples from the distribution 
that describes the duration of label generation. 


3.5 Failures 


Failures are modelled using Bernoulli random var- 
iables. At three points of the process, label genera- 
tion, prescription assembly and label application, 
an error can occur. The error probabilities were 
taken from Cohen et al (Cohen et al. 2012), and are 
shown in Table 3. 

The outcome of the final accuracy check 
depends on the state of the prescription being 
checked. It is assumed that prescriptions that are 
correct will always pass through the check. If there 
is an error present in the prescription, the phar- 
macist will spot it with probability 0.95, otherwise 
they will fail to spot it with probability 0.05. 


4 PHARMACY SIMULATION SCENARIOS 
AND THEIR ANALYSIS 


4.1 Scenario specification 


This paper uses three pharmacy scenarios to demon- 
strate the ability to evaluate performance using the 
CPN model. These three scenarios have been chosen 
to demonstrate the impacts, or efficiency improve- 
ments, of adding an additional staff member. 


a. Scenario 1 
Staff — 1 pharmacist, 2 dispensers 
Failures—Chance of failure in labelling, filling, 
label application and final accuracy check stages. 
Advanced services—Included. 
Stocking—Pharmacist must do 4 stints of stock 
management, each period lasting 5-15 mins. 
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Lunch hours — 1 hour for each dispenser, taken 
sequentially (only 1 dispenser may be off at the 
same time). 

b. Scenario 2 

Same as scenario 3, but with 1 pharmacist and 
3 dispensers. 

c. Scenario 3 

Same as scenario 1, but with 2 pharmacists and 
2 dispensers. 


4.2 Results and analysis 


A 9-5 day of pharmacy operation was simulated a 
total of 6000 times for each scenario. A test for con- 
vergence was conducted to find whether 6000 was a 
large enough number to reach convergence. A fur- 
ther 1000 simulations were carried out for each sce- 
nario, then the indicator values for the set of 7000 
simulations were compared to the values calculated 
for 6000 simulations. Every field was the same 
between the two sets of data to 2 significant figures. 

Results of key performance indicators for each 
scenario are shown in Table 4. 

Since walk-in (WI) prescriptions are given prior- 
ity over delivery prescriptions, walk-in prescriptions 
get completed first, but a smaller pharmacy which 
takes longer to dispense prescriptions is unable to 
complete all their deliveries. This can be seen in sce- 
nario 1, where 39 of the 150 delivery prescriptions 
are unfinished. In both scenarios 2 and 3, having 
an additional staff member of either type (phar- 
macist or dispenser) improved the efficiency of the 
pharmacy sufficiently so that on average almost 
all the deliveries were being completed. This sug- 
gests that the pharmacy may be able to complete 
a larger number of delivery prescriptions when 
employing 4 staff. The average time to dispense was 
also improved by more staff in scenarios 2 and 3. 
A large decrease (of 217s) in the average time to 
dispense walk-ins was seen when introducing an 
extra pharmacist in scenario 3. A smaller decrease 
(of only 75s) was gained by introducing an extra 
dispenser to the pharmacy team in scenario 2. 

Previous studies have reported near miss rates 
of between 0.024% (Knudsen et al, 2007) and 
1.84% (Sanchez, 2013), and dispensing error rates 
of between 0.014% (Knudsen et al, 2007) and 3.3% 


Table 4. Simulation results. 


(Franklin & O’Grady, 2007). There are many more 
near misses occurring during the simulations than 
have been seen in previous studies of errors, i.e. all 
3 scenarios had near misses occurring in over 10% 
of all prescriptions being dispensed. This may be 
due to underreporting of near-misses in self report 
based studies, or the final accuracy check failure 
probability is set too low in the model. The dis- 
pensing error rate produced by simulations fell 
within the reported range. 

These simulations suggest that the simulated 
dispensing process has good reliability. The reli- 
ability for scenarios 1, 2 and 3 were as follows, 
R, = 0.992, R, = 0.992, R, = 0.992. The same reli- 
ability for tall three scenarios is due to the fact that 
the error rates do not depend on the type of staff 
and pharmacy set-up. 


4.2.1 Distribution of time to dispense 

Figure 3 shows how the distribution of the time 
to dispense walk-ins depends on the scenario. The 
duration of 600,000 walk-in prescriptions were 
used for comparison, i.e. around 100 walk-in pre- 
scriptions from each of the 6,000 simulations. It 
can be seen in Figure 3 that all scenarios have a sim- 
ilar underlying distribution. However the skewness 
decreases with each additional member of staff. A 
larger decrease in skew is seen when an additional 
pharmacist is added. Note that the dashed vertical 
line represents 15 min dispensing time limit. 


4.2.2 Causes of delays 

A prescription could be delayed due to one of 
many reasons, such as, prescriptions contain- 
ing more items taking longer to dispense, delays 
due to a large amounts of walk-ins already being 
processed or waiting in the queue when a patient 
arrives, members of staff being busy with non- 
dispensing activities, or due to a near miss that has 
been picked up at the final accuracy check. 

Table 5 shows how looking at single scenarios, 
for more increasingly delayed prescriptions, the 
average size, and number of iterations required to 
complete prescriptions increases. This appears to 
confirm the prospect that prescriptions which con- 
tain more items, or need to be dispensed multiple 
times are more likely to be delayed. 


Efficiency Reliability 
Advanced WI dispense 
Deliveries Total services time mean Near Dispensing 
Scenario completed completed completed Delayed sec R misses errors 
1 111.1 211 1.8 25.0 711 0.992 29.7 1.6 
149.4 250 1.8 636 0.992 32.8 1.9 
3 149.5 250 1.8 494 0.992 32.9 1.9 
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Figure 3. Distributions of the time taken to dispense walk-in prescriptions. 
Table 5. Causes of delays. 
t<15 15<t<20 20<t<25 25<t<30 30<t<35 35<t<40 40<t 

Scenario mins mins mins mins mins mins mins 

1 % of total 75.08 12.64 6.15 3.01 1.49 0.754 0.884 
Avg items 2.37 3.83 4.20 4.49 4.76 5.02 5.79 
Avg itts 0.074 0.226 0.349 0.498 0.673 0.838 1.24 

2 % of total 80.66 10.40 4.72 2.17 1.03 0.495 0.533 
Avg items 2.40 4.10 4.52 4.88 5.25 5.70 6.70 
Avg itts 0.0803 0.276 0.425 0.606 0.796 0.997 1.33 

3 % of total 91.74 5.27 1.77 0.66 0.295 0.141 0.122 
Avg items 2.50 5.70 6.46 6.57 7.33 8.21 9.51 
Avg itts 0.104 0.485 0.765 1.027 1.232 1.37 1.73 


Comparing scenarios, it can be seen that 
scenarios 2 and 3 offer an improvement in the 
number of walk-in prescriptions being completed 
on time. Scenario 3 increased the percentage of 
prescriptions being completed on time by 16%, 
while scenario 2 managed an increase of only 
5.5%. 


5 CONCLUSION 


In conclusion, this paper has demonstrated the 
use of CPNs as an effective tool for modelling the 
community pharmacy dispensing process. CPN 
is a suitable tool to evaluate efficiency and safety 
in one model. Pharmacy dispensing complexity is 
captured through: the inclusion of all major dis- 
pensing stages, their duration, and a variety of 
staff roles, errors and remedial action. Adding a 
pharmacist improved the pharmacy efficiency 
more than adding a dispenser. Dispensing errors 
are within the range reported in the literature, 
whereas near misses are overestimated. 


Process reliability remained constant in all sce- 
narios. By assigning staff wage costs to scenarios, 
this model could support decisions related to the 
cost-benefit of employing extra staff member. 

Future work will focus on optimizing a phar- 
macy dispensing process. This would involve find- 
ing the optimal choice of how many staff should 
work in the pharmacy, given the working condi- 
tions and cost of staff wages. Metaheuristics such 
as, genetic or ant colony optimisation algorithms, 
are promising methodologies for this purpose. In 
addition, in-field data collection would be carried 
out, and ethical approval has been granted by the 
University of Nottingham. Other routes for future 
research could include constructing an alternative 
model capable of comparing the performance of 
automated and manual dispensing pharmacies. 
Future iterations of the model could be designed 
to include the dependency between the overall 
state of the pharmacy, and staff error rates. For 
example, if a pharmacy is busy, with many patients 
waiting for walk-ins to be dispensed, this could put 
pressure onto staff, who may be then more likely 
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to make errors. Another possible improvement to 
the model could be to consider how errors of each 
type, labelling, contents or label application, can 
actually occur in each item in a prescription. 
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ABSTRACT: Advances in Information and Communication Technologies (ICT), pervasive computing 
and Artificial Intelligence (AI) are affecting all daily human life domains in multiple ways. Safety-critical 
transport systems are being massively affected by this technological shift. The Autonomous Vehicles (AV) 
are being considered the most promising new element in this future transport paradigm. It is expected 
that the AVs bring relevant benefits to the society, mainly reducing accidents and increasing accessibility. 
Given that any new safety-critical system, concept or technology is placed in operations only if its ben- 
efits outweigh the safety risks, assuring these future transport systems will be safe during their operation 
is a mandatory duty. This paper proposes a safety analysis Framework to be applied to the future intel- 
ligent Roadway Transport Systems (RTS) paradigm. Based on a combined fast and real time simulation 
approaches, the Framework allows analyzing, in a broad sense, the impacts of concepts, technologies and 
procedures on RTS safety and efficiency. A Proof-of-Concept (PoC) is provided, where representative 
RTS scenarios are modeled, simulated and analyzed using OpenDS, Matlab and Sumo+VeinstOM Net++. 
The RTS scenario elements (environment, vehicles, roadways, and driver) interact with each other. Auton- 
omous and ‘non-autonomous’ vehicles share the RTS infrastructure, and AV is modeled combining a pair 
of RTS elements “driver+vehicle”, with the control algorithms embedded in the autonomous driver ele- 
ment to manage AV movement. In this PoC, both the autonomous vehicle behavior and the impact of its 
embedded autonomous control algorithms over the RTS safety and efficiency are analyzed. Concluding, 
the Framework is capable to deal with the representative future RTS scenarios, mainly regarding to AV, 
analyzing the impact of concepts in component level (e.g. AI-based algorithms) over system level proper- 
ties (e.g. safety) and the relationship among properties at various levels of RTS. 


1 INTRODUCTION port. For example, it is expected to reduce signifi- 


cantly the number of traffic accidents when the need 


Advances in Information and Communication 
Technologies (ICT) and pervasive use of comput- 
ing and Artificial Intelligence (AI) are affecting all 
domains of daily human life, including those ones 
that could lead to deaths and health, environmental 
and patrimonial injuries. One of these safety-critical 
domains, the Transport systems, is being directly 
and massively affected by this new technological 
changing wave. Autonomous Vehicles (AV) - also 
called as self-driven or robot cars—could be con- 
sidered the most notably new element in this future 
paradigm of Intelligent Transport System—ITS 
(PARLIAMENT & UNION 2010). Regarding to 
the Road Transport Systems (RTS), it is expected 
that the AVs will bring significant improvements in 
traffic safety and efficiency for all modes of trans- 


for a human driver is eliminated, or even reduced, 
given that the causes to more than 90% of accidents 
are attributed to human error (Singh 2015). Besides, 
AVs would increase accessibility to the individual 
transport by people with disabilities, elderly people, 
children and even vehicles without people on board 
(e.g. delivery or rental fleet reallocation). 

As like any new safety-critical system, concept 
or technology, the AV will be incorporated into 
daily human life only if its benefits outweigh the 
safety risks. Therefore, it is mandatory assur- 
ing that the future safety-critical AVs, as well as 
the whole transport systems, are going to be safe 
enough during their operations on public roads in 
the real world environment. Given that informa- 
tion, communications systems and networks (i.e. 
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ICT) and Machine Intelligence (MI) are key ena- 
blers to the future ITS—mainly to the autonomous 
vehicles—it is necessary to evaluate how ICT, MI 
and other technologies and concepts will impact 
on (and contribute to) the autonomous transport 
safety risks and, consequently, how to apply them 
in an innovative, cost-effective and safe way. 

US. National Highway Transportation Safety 
Administration (NHTSA) released a guidance to 
orient developers of Automated Driving Systems 
(ADS) — AV included—to “analyze, identify, and 
resolve safety considerations prior to deployment 
using their own, industry, and other best practices” 
(NHTSA 2017). It recommends the combination 
of simulations, test track, and on-road testing to 
validate the ADS safety performance. However, it is 
observed that the current AV safety evaluation and 
validation are being supported by track and on-road 
testing. Virtual testing approaches are a recent, in- 
development field interesting to manufacturers (Kim 
et al. 2017). 

Tests are useful to improve the confidence on 
the system. However, it is impossible to cover all 
the possible system’s conditions (behaviors) using 
tests. Kalra and Paddock (2016) show that ‘fully 
autonomous vehicles would have to be driven hun- 
dreds of millions of miles and sometimes hundreds 
of billions of miles to demonstrate their reliability in 
terms of fatalities and injuries’, which is unfeasible 
in terms of costs and other available resources. In 
this way, they recommended that innovative, alter- 
native methods of demonstrating safety and reli- 
ability need to be developed, as the “virtual testing 
and simulation”, to supplement real-world testing 
in order to assess autonomous vehicle safety. 

Thus, there is a need to develop new virtual testing 
and simulation approaches, mainly using acceler- 
ated time scales, in order to analyze the safety levels 
of AVs on a Roadway Transport System (RTS), as 
well as the safety levels of the RTS as a whole. This 
paper proposes a simulation-based safety analysis 
Framework to be applied to the RTS in this future 
ITS paradigm. Based on a combined fast-time and 
real-time computer simulation approaches, this 
Framework allows analyzing the impacts of con- 
cepts, technologies and procedures on the future 
RTS safety and efficiency, in a broad sense. 

In this proposed Framework, specific RTS in 
predefined traffic scenarios (or ‘Use Case’) are 
modeled, simulated and analyzed. A Use Case 
represents a complete road traffic scenario under 
analysis, including the RTS architecture and speci- 
fication, traffic configuration (vehicles trajectories 
and other traffic elements behavior) and the pro- 
cedures to be applied during the analysis (simula- 
tions). By means of a Use Case, it is possible to 
analyze the influences of any RTS element inserted 
in a specific traffic configuration over the system 
safety metrics (e.g. collision rates) and over the way 


the hazardous situations evolve (system behav- 
ior). Mostly, it is possible to analyze how the AVs 
impact the safety and efficiency of the whole trans- 
port system they belong to, and identifying haz- 
ardous situations that emerge from the behavior of 
the AV on transport systems. 

In addition to proposing and presenting the 
safety analysis Framework, this paper provides a 
Proof-of-Concept (PoC), where a representative 
Use Case is modeled, simulated and analyzed. The 
RTS elements interact with each other, and auton- 
omous and ‘non-autonomous’ vehicles are sharing 
the same RTS infrastructure and running in the 
predefined traffic scenarios. Further, the validation 
of the Framework is presented using an AI-based 
autonomous control approach (Naufal et al. 2017) 
that manages risks related to the operations of AVs 
in a simulated RTS scenario. 

This paper is structured in 5 sections. Section 2 
presents details about the development of the 
proposed simulation-based safety analysis Frame- 
work. Section 3 provides a Framework PoC, in 
which a representative RTS scenario (Use Case) is 
implemented (modeled) and simulated by real-time 
and fast-time Open Source Simulation Tools. Sec- 
tion 4 presents the results obtained from the PoC 
and used to validate the proposed Framework. 
Finally, Section 5 presents conclusions about this 
work and some future directions. 


2 THE SAFETY ANALYSIS FRAMEWORK 
FOR THE ROADWAY TRANSPORT 
SYSTEMS 


Our proposed simulation-based safety analysis 
Framework is intended to be applied to the future, 
intelligent RTS paradigm. It is based on an Auto- 
motive (road and vehicle) Cyber Physical Systems 
(CPS) safety and resilience theoretical conceptual 
bases in the RTS domain. The Automotive CPS 
(ACPS) concept is a fusion among RTS, CPS and 
ITS concepts, as detailed in section II and II of 
Naufal et al (2017). These theoretical conceptual 
bases were used to understand the future, intelligent 
road transport system context in which AV are going 
to be inserted. Consequently, they were used to iden- 
tify the elements and characteristics demanded to 
execute a simulation-based safety analysis of a RTS. 

The developed Framework is composed of 
5-modules as illustrated in Figure 1: 


1. the “ACPS Elements module”, a repository 
(class library) containing the classes of ACPS- 
based elements used to model an ACPS-based 
Roadway Transport System (RTS); 

2. Framework users design a specific Use Case in 
the “Use Case elaboration” module using classes 
and attributes available in ACPS Elements mod- 
ule. ACPS-based RTS architecture, specification 
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Figure 1. The 
framework. 


simulation-based safety analysis 


and safety analysis procedures (safety metrics 
and parameters) are then submitted to the simu- 
lation-based safety analysis approach; 

3. “Fast-Time Simulation (FTS) module”, where 
the Use Case is instantiated and simulated— 
following the safety analysis procedures—using 
the open source fast-time computer-based sim- 
ulation tools available in this Framework; 

4. “Real-Time Simulation (ReTS) module”, where 
the Use Case is instantiated and simulated— 
following the safety analysis procedures—using 
the open source real-time computer-based sim- 
ulation tools available in this Framework; 

5. The simulation reports with the simulation 
results are analyzed in “RESULTS module”. 


2.1 ACPS elements 


Based on these ACPS conceptual bases, four classes 
of elements were identified: Roadway (physical 
place where vehicles run), Vehicle (the central traf- 
fic element, moving people and goods), Driver 
(manages the Vehicle movement according to the 
traffic rules and vehicle model) and Environment 
(weather conditions—e.g. rain, fog, snow—and 
obstacles—e.g. pedestrian, animals, roadway block- 
ages) other than Vehicles). Table 1 presents these 
ACPS elements (classes) and their characteristics 
(attributes). These elements are both transport sys- 
tem safety-related (physical) elements and represent 
a real RTS in the ITS context (US-DoT 2017). 
Regarding to the ACPS Elements, the 
‘DRIVER’ class has no defined characteristic 
(attributes) because the DRIVER, in this Frame- 
work, only execute actions (methods). ROADWAY 
and Environment were included in the same class 
of elements (ROADWAY). Figure 2 illustrates the 
interactions among ACPS elements used to model 
the Use Cases in the proposed Framework. Differ- 
ent from a typical RTS in which Drivers get infor- 


Table 1. RTS main elements and characteristics. 


ACPS Elements 


(Classes) Characteristics (Attributes) 


ROADWAY 
(Environment 
included) 


Interdicted area 

Direction of the road 

Obstacles on the road 

Physical characteristics of the road 

Signs 

Weather conditions 

Turn right command 

Turn left command 

Brake command 

Speed up command 

Distance from detected Vehicles/ 
Obstacles on the Road (DVOTR) 

Vehicle size 

Maximum braking rate 

Maximum vehicle speed 

Maximum acceleration rate 

Mechanical vehicle system status 

Steering wheel status 

Accelerator pedal status Brake pedal 
status 

Vehicle Speed status 

Direction of the vehicle 


VEHICLE 


DRIVER 


DRIVER, 


NOPE ROADWAY 


TEnyrrormrvent included) 


DRIVER F — 


= 4 VERICLE 


Figure 2. Relationship among ACPS elements. 


mation directly from Environment and Roadway 
(observing them), DRIVERs monitor the Envi- 
ronment and Roadway using the sensors and other 
technologies embedded in the VEHICLE that they 
are driving. This enables modeling the future highly 
automated and fully-autonomous vehicles, given 
that they will take driving decisions using data 
obtained by themselves from the Environment. 

In the proposed Framework, these ACPS ele- 
ments embody a repository (a class library) con- 
taining classes of ACPS-based elements (elements, 
attributes, methods and interfaces). These elements 
represent an ACPS-based RTS and, consequently, 
this repository is used by Framework users to 
model any specific ACPS-based road traffic sce- 
nario (Use Case) and to develop a simulation- 
based safety analysis. 


2.2 Use case elaboration 


During the Use Case elaboration, the RTS archi- 
tecture, specification and safety analysis procedures 
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(safety metrics and parameters) are specified and 
modeled. The Use Case (ACPS-based RTS model 
and the analysis procedures) is instantiated and 
simulated using real-time and fast-time approaches 
by specific, open source simulation tools. Results 
(Reports) are obtained from these tools (safety anal- 
ysis) and used to analyze the impacts of concepts, 
technologies and procedures on RTS safety and 
efficiency. 

The Use Case shall contain the ACPS-based 
RTS architecture and specification (i.e. configu- 
ration of the ACPS elements characteristics— 
Roadway, Vehicle and Driver—which represent 
the specific use case scenario), as well as the safety 
analysis procedures—the safety metrics and 
parameters that will be analyzed in this scenario. 
For the safety analysis efficiency purposes, the Use 
Case should represent a road traffic scenario that 
maximizes the occurrence of conflicts points, haz- 
ardous situations and accidents. Besides, it is pos- 
sible to analyze the influences of ACPS elements 
and their characteristics (Use Case Specification) 
over system safety metrics (e.g. collision rates) and 
hazardous situations evolution (behavior). 


2.3. Fast-Time and Real-Time simulation modules 


The Use Case is implemented (instantiated), sim- 
ulated and analyzed using Fast-Time and Real- 
Time computer-based Simulations approaches in 
Framework modules 3 (FTS) and 4 (ReTS) using 
two different sets of open source tools. In the FTS 
(Framework module 3), Veins is used. It is an 
event-based network simulator that allows simu- 
lating vehicular communication, and integrates 
with SUMO (Simulation of Urban MObility) — a 
road traffic simulator—and OMNeT++ (Objective 
Modular Network Testbed in C++). In the ReTS 
(Framework module 4), it is used the OpenDS™ — 
a driving simulator primarily intended for research 
and driving training, and Matlab™). 

Molina et al. (2017) assessed these two open- 
source simulation tools sets suitability with the 
modeling and simulation capabilities demanded by 
the analysis Framework, and evaluated if they are 
capable to model and simulate, using fast-time and 
real-time simulation approaches, the Road Trans- 
port Systems (indeed, ACPS) elements characteris- 
tics required by the Framework. Besides, the tool’s 
best features were highlighted and the process of 
adapting these tools for the safety analysis pur- 
poses was presented. 

Figure 3 and Figure 4 illustrate the high-level 
architecture of FTS and ReTS modules, respec- 
tively. Roadway, Vehicles and Driver are represented 
in both FTS and ReTS. A functional Vehicle (AV) 
is modeled combining a pair “Driver + Vehicle” ele- 
ments, the vehicle control algorithms are embedded 
in the Driver element, which manages the Vehicle’s 
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Figure 3. FTS high-level architecture. 
ens, ae =. 1 
P Tratfic l [ss element Roashway 
Vehicle 1 f acs element Driver 


Driver l 


: a ACPS element Vehicle 
Traffic 
Vehicle 


I 

! 

! PETTEN i 

4 en ti ir 4 < 
Autonomous Monitor autanomous l 

ee eon Car Driver i 
na | 


Figure 4. ReTS high-level architecture. 


movement. In ReTS, Vehicles are represented by 
“Traffic Vehicles’ and ‘Autonomous Cars’. 

Using FTS approach, it is possible running sev- 
eral simulations over a Use Case scenario, allowing 
the estimation (statistical) of risk metrics, as well 
as identifying worst cases scenarios. On the other 
hand, using ReTS approach, system behavior dur- 
ing specific scenario conditions (including failure 
modes), especially to the worst cases scenarios can 
be analyzed in details, as identified by the fast-time 
simulation. 


2.4 Results 


Finally, ‘RESULTS’ (module 5) represents the 
safety analysis reports generated by the simulation 
tools. These reports contain the safety and efficiency 
metrics, as well as any other relevant information, 
obtained by simulating the Use Case scenarios 
according to the Safety Analysis Procedures. These 
reports may contain the status of the system elements 
when a collision (or any safety-related event) occurs, 
as well as safety risk metrics and other parameters 
specified by the analyst in the Use Case. Therefore, 
using information provided in these reports, it is pos- 
sible to systematically identify the potential causes of 
hazards or failure modes and identify how failures 
or defects in the system elements might contribute to 
hazards or accidents (e.g. collisions) — i.e. performing 
a Safety Analysis (MoD 2017). 
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3 PROOF-OF-CONCEPT (PoC) 
FRAMEWORK IMPLEMENTATION 


This section presents a Proof-of-Concept (PoC) to 
the proposed Framework. The PoC is oriented by a 
Use Case composed by a representative number of 
ACPS Elements to effectively test, and validate the 
Framework. Besides, the Use Case presents a road 
traffic scenario that maximizes the occurrence of 
hazardous situations and accidents (‘Vehicle to 
Anything Else’ — V2X collisions). Thus, it is pos- 
sible to analyze both the influences of ACPS ele- 
ments (autonomous vehicles included) over RTS 
safety and efficiency metrics (e.g. collision rates 
and traffic flow and capacity) and the hazardous 
situations evolution (system behavior). 


3.1 Al-based Autonomous Vehicle (AV) 


The proposed Framework is used to analyze the 
capabilities of an AlI-based autonomous vehicle 
(AV) control approach in managing safety risks 
related to the operation of an autonomous vehi- 
cle (AV) in a simulated ACPS-based RTS scenario. 
The analyzed AlI-based AV control approach was 
proposed by Naufal et al (2017) and named as 
“A7?CPS Engine”. It is a run-time, software-based, 
vehicle-centric risk management approach, which 
its algorithm was developed using Fuzzy Logic (an 
AI technique). A°CPS Engine algorithms monitor 
the environment and the AV behavior and control 
AV actions, mitigating hazards and reducing acci- 
dents severity related to the AV operation. Table 2 
presents the risk management rules performed by 
the analyzed AV Control algorithms. Depending on 
the calculated Risk Level and its Risk Criteria, a spe- 
cific Treatment (action to be executed by the AV) is 
performed. 

The AV Control algorithms were embedded in 
the autonomous vehicle DRIVER, monitoring and 
controlling the autonomous VEHICLE. Figure 5 
illustrates the AlI-based AV Control algorithms 
(based on A?CPS Engine proposal) embedded in the 
autonomous DRIVER and the I/O interface with 
Autonomous VEHICLE. In this implementation, 
the AV Control algorithms receive its Speed from 


Table 2. Risk management rules performed by the AV 
control. 

Action 

(brake 
Risk level Risk criteria Risk treatment intensity) 
No risk Negligible Do noting No brake 
Low Undesirable Speed reduction Small 
Moderate Medium 
High Intolerable Stop vehicle High 
Very high Very high 


Autonomous 
DRIVER 


Vehicle Speed 


Al-based 
AV Control 


Autonomous 
VEHICLE 


Figure 5. AJ-based AV control inputs/output interface. 


Autonomous Vehicle and the Distance from detected 
Vehicles and Obstacles on the Road (DVOTR), and 
it controls the brake intensity application. 

The proposed safety analysis Framework is used 
for analyzing both (i). the autonomous vehicle 
behavior in the road traffic scenarios and (ii). the 
impact of AI-based autonomous control algorithms 
(autonomous vehicle behavior) over the RTS safety 
and efficiency levels. It is worth noting that this 
analysis has considered both safety and efficiency, 
because they are competing system characteristics in 
transportation systems. Traffic efficiency (e.g. traf- 
fic flow and capacity) is often impaired when system 
safety is prioritized. So, a tradeoff between safety 
and efficiency (availability included) is required. 
Consequently, a Use Case is designed to analyze: 


1. Both safety metrics (#collisions) and traffic 
efficiency metrics (average speed and standard- 
deviation (E[v]; VAR[v]); average total distance 
traveled). These metrics were obtained in func- 
tion of maximum vehicle speed (Vmax) and Inci- 
dent Zone Warning (IZW) configuration—a 
main characteristic of the Al-based AV Con- 
trol approach (A’?CPS Engine), representing a 
dynamic zone of protection around the vehicle 
(Naufal et al. 2017). By these metrics, it could be 
possible to analyze the influence of AV Control 
over the autonomous vehicle behavior and how it 
affects the global, transport system safety versus 
efficiency relationship (i.e., the analysis objective 
‘i1.’). Given the statistical characteristic of met- 
rics, they were obtained by the Fast-Time Simula- 
tion (FTS) approach (Framework module 3); 

2. The autonomous vehicle Stopping Distance 
(SD) — distance the autonomous vehicle stops 
from a static obstacle. SD was obtained in func- 
tion of the maximum vehicle speed (V max) and 
the IZW configuration. By this metric, it could 
be possible to analyze how the AV Control algo- 
rithms affects the autonomous vehicle braking 
behavior (i.e., the analysis objective ‘i.’). This 
analysis was performed by the Real-Time Simu- 
lation (ReTS) approach (Framework module 4). 


3.2 Use case details 


The ACPS-based RTS high-level architecture con- 
sidered in the Use Case is illustrated in Figure 6. 
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Use case road traffic scenario (FTS). 


Figure 7. 


The AI-based AV Control (A?CPS Engine) algo- 
rithms are embedded in the Autonomous Driver, 
managing the Autonomous Vehicle. The pair 
“Driver + Vehicle” models the Autonomous Vehi- 
cle (VEHICLE 1), which runs along the Roadway 
together with other Vehicles (2, 3, ...). 

The RTS traffic scenario considered in the FTS 
approach (Framework module 3) is illustrated in 
Figure 7. Four (4) Vehicles (V1 to V4) are travel- 
ling, in a closed traffic circuit, with pre-defined 
trajectories and speed profiles through a Roadway 
composed of 2 roads (T1 and T2) — a two-way road 
(T1.1 and T1.2) and a one-way road (T2). One of the 
four vehicles (V1) is an autonomous vehicle, and its 
(autonomous) Driver has the embedded AI-based 
AV control approach under analysis. The Road- 
way architecture, including each road direction, 
is represented by arrows. V1 to V4 trajectories are 
shown, where V1 (the AV) runs through all the roads 
(T1.1, T1.2 and T2) following the signed trajectory; 
V2 runs only on T1.2; V3 runs only on T2; and V4 
runs only on T1.1. In this road traffic scenario, V1 
trajectory exposes itself and the other vehicles (V2- 
V4) to the occurrence of harmful events (collisions), 
most of than on the crossing points. Consequently, 
the occurrence of hazardous situations and acci- 
dents are maximized, improving the safety analysis 
efficiency. 

The RTS traffic scenario considered in the ReTS 
approach (Framework module 4) is illustrated in 
Figure 8. The autonomous vehicle running in a 
straight trajectory and with its maximum speed 


Speed profile 


SD [m] 


Brake applied Obstacle 
Figure 8. Use case road traffic scenario (ReTS). 
IZw 
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Figure 9. IZW fuzzy logic variable (x1—x4 parameters). 


(Vmax). In the end of this trajectory is placed a static 
obstacle. During the simulation it is measured the 
time to stop, speed and the distance from autono- 
mous vehicle to the obstacle when vehicle stops (the 
stopping distance—SD) to different IZW’s configu- 
rations and Vmax- Using these values, it is possible 
to analyze if the autonomous vehicle keeps a safe 
and efficient distance from other vehicles. 


3.3. RTS and FTS Modules 


The Use Case traffic scenarios, including the RTS 
architecture and specification, and the Al-base 
AV control algorithms are modeled (instantiated), 
simulated and analyzed in the FTS and ReTS 
modules of Framework using its open-source tools 
sets (Figure 3 and Figure 4, respectively). The Use 
Case attributes configured to the RTS elements 
were: VEHICLEs = {max. acceleration = 3,3 m/ 
s’; maximum braking rate = 7,6 m/s?; maximum 
speed = 120 km/h}; ROADWAY = {track geome- 
try = uniform—without spines}. Drivers managed 
both longitudinal and lateral movement of their 
vehicles. Vehicles 2, 3 and 4 (non-autonomous 
vehicles) were driven by the native control embed- 
ded in the FTS simulation tool. In FTS, the total 
simulation time was 10.000 s. 

In the FTS approach, it was considered 3 maxi- 
mum speed (Vmax) to the autonomous vehicle 
(V1): Vmax = {15; 20; 30} m/s = {54; 72; 108} km/h; 
Vehicles 2, 3 and 4 run in constant speed of 5 m/s 
(18km/h). In the ReTS approach, it was considered 
3 maximum speed (V max) to the autonomous vehicle 
(V1): Vmax = {30; 60; 120} km/h. Both simulation 
approaches (FTS and ReTS) considered the same 
4 configurations to the IZW fuzzy logic variable 
parameters (Figure 9): {{20; 60; 95; 180}; {10; 30; 
47,5; 90}; {5; 15; 23,75; 45}; {20; 60; 150; 180}}.). 
IZW_1 version ({20; 60; 95; 180}) was defined dur- 
ing the model calibration and it was the configura- 
tion-base to the other tests. IZW_2 version ({20; 60; 
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150; 180}) has changed only one of the parameters 
and used to analyze the impact of IZW shape on 
safety and efficiency. IZW_3 ({10; 30; 47.5; 90}) and 
IZW_4 versions ({5; 15; 23.75; 45}) configurations 
were based on the IZW_1, which was divided by 2 
and 4, respectively. Parameter ‘x4’ physically repre- 
sents the size of protection zone (45 m, 90 m and 
180 m). 


4 RESULTS AND DISCUSSION 


The results obtained by the FTS approach are 
presented in Table 3. For all four IZW configura- 
tions (IZW_1I to IZW_4), three (3) autonomous 
vehicle maximum speeds (V max) were considered. 
The number of collisions, average speed, stand- 
ard-deviation and total distance travelled by AV 
(during the simulation length of 10,000 s) were 
obtained by fast-time simulation. 

The results obtained by the ReTS approach are 
presented in Table 4. Just as it was adopted for the 


Table 3. FTS results (Metrics x Vmax x IZW). 
Vmax (AV) [m/s] 
Metrics 15 20 30 


IZW_1 = {20; 60; 95; 180} 


#collisions 28 28 28 

Average speed [m/s] 5.0 5.0 5.0 

Standard deviation 3.1 3.1 3.1 

Total distance [m] 50,050 50,049 50,049 

IZW_2 = {20; 60; 150; 180} 

#collisions 30 30 30 

Average speed [m/s] 5.0 5.0 5.0 

Standard deviation 25 2.5 2.5 

Total distance [m] 50,072 50,072 50,072 

IZW_3 = {10; 30; 47,5; 90} 

#collisions 24 24 24 

Average speed [m/s] 5.0 5.0 5.0 

Standard deviation 2.3 2.3 2.3 

Total distance [m] 50,115 50,115 50,115 

IZW_4 = {5; 15; 23,75; 45} 

#collisions 184 195 315 

Average speed [m/s] 10.0 115 15.9 

Standard deviation 4.4 6.1 8.4 

Total distance [m] 99,855 114,398 142,327 

Table 4. ReTS results (SD x Vmax X IZW). 

Stopping Distance (SD, in [m]) 

Vuax (AV) 

[km/h] IZW_1 IZW_2 IZW_3 IZW_4 
30 10.00 10.10 1,9 collide 
60 7.47 9.92 collide collide 

120 6.67 8.80 collide collide 


FTS approach, three (3) autonomous vehicle max- 
imum speeds (Vmax) were considered to each one 
of the 4 IZW configurations. The Stopping Dis- 
tance (SD) was obtained by real-time simulation. 

Safety is measured by the number of collisions 
observed during 10.000 seconds (about 3h) of traf- 
fic execution. Efficiency is measured by means 
of average speed and standard-deviation, and 
total distance traveled during the simulation. The 
autonomous vehicle braking behavior is analyzed 
by means of the Stopping Distance (SD) metric 
(Figure 8). It is observed how these metrics change 
when Vmax and IZW configuration are modified. 

Considering the ReTS results (Table 4), it is pos- 
sible to analyze the capability of four Al-based 
AV Control (A?CPS Engine) algorithms versions 
(IZW_1 to IZW_4) in identifying a potential col- 
lision condition, assessing the related level of risk 
during runtime execution and stopping the auton- 
omous vehicle safely (before a collision) and effi- 
ciently (as close as possible to the obstacle). The 
IZW_4 version could not stop the AV safely in any 
speed. Consequently, the AV has not a safe behavior 
using this A?7CPS Engine version. This is confirmed 
by the FTS results (Table 3), which IZW_4 has the 
largest number of collisions obtained among all 
IZW versions. Therefore, it is possible concluding 
that a smallest tested IZW (zone protection shorter 
than 45 m) cannot protect the AV against collision. 

IZW_3 version produces an AV zone protection 
twice larger than IZW_4. But, IZW_3 version pro- 
tects the AV against collisions only at low speeds 
(30km/h) with a high efficiency (AI-based AV Con- 
trol algorithms stop the vehicle near to the obsta- 
cle—about 2 m). Now, considering IZW_1 and 
IZW_2 — with a zone protection twice larger than 
IZW_3 and four times larger than IZW_4 — they 
can protect the AV against collisions at any tested 
speed (from 30km/h to 120 km/h). This observation 
is confirmed by the FTS results (Table 3), which 
the speed has no influence in the traffic safety (the 
same number of collisions was observed in IZW_1 
and IZW_2 at any speeds). The difference between 
these IZW_1 and IZW_2 versions is observed in 
the braking efficiency, where SD was between 6,67 
m and 10 m (five times less efficient than the last 
analyzed version at 30 km/h which), and the worst 
braking efficiency is observed in the IZW_2 version. 

The internal shape of linguistic variable IZW 
(Figure 9) has influence on SD (braking) efficiency, 
but it has neither influence on traffic safety (the 
number of collisions is the same in both versions) 
nor on traffic efficiency (same average speed and 
total distance). 

Finally, it is possible to observe that best AI- 
based AV Control (A’?CPS Engine) algorithms 
version in terms of traffic safety (lower number 
of collisions) and efficiency is the IZW_3. How- 
ever, regarding to the ReTS results (Table 4), this 
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version protects the AV against collisions only at 
low speeds (30 km/h). Observing the average speed 
and standard deviation, AV has travelled in this 
scenario only at low speeds. Therefore, this ver- 
sion is safe and efficient in terms of traffic flow/ 
productivity. 


5 CONCLUDING REMARKS 


This paper proposed a simulation-based safety anal- 
ysis Framework to be applied to the future, intelli- 
gent Roadway Transport Systems (RTS) paradigm. 
Based on a combined Fast-Time Simulation (FTS) 
and real-time simulation (ReTS) approaches— 
implemented by the open-source tools OpenDS, 
Matlab, Sumo, Veins and OMNet#, this Frame- 
work allows analyzing the impacts of concepts, 
technologies and procedures on RTS safety and 
efficiency. For demonstration and validation pur- 
poses a Framework Proof-of-Concept (PoC) was 
provided. It was applied to analyze the capabili- 
ties of a Al-based autonomous vehicle control 
approach—the A’CPS Engine proposed by Naufal 
et al (2017) —in managing safety risks related to the 
operation of an autonomous vehicle in a simulated 
RTS scenario. 

PoC was useful to demonstrate the Frame- 
work capability of modeling and simulating real- 
world, representative ACPS-based Road Transport 
System scenarios, in which Vehicles, Roadway 
(environment included) and Driver elements are 
interacting with each other. It enables modeling 
roadway traffic scenarios where both autonomous 
(unmanned) and manned vehicles share the same 
roadway infrastructure. The autonomous vehi- 
cles are modeled combining a 2-tuple of ACPS 
elements “Driver + Vehicle”. The autonomous 
vehicle control algorithms, even those algorithms 
developed by a third-party (not by Framework 
user), are embedded in the autonomous Driver 
element and they manage the (autonomous) Vehi- 
cle movement. Using these approaches, it is pos- 
sible to analyze the behavior of the autonomous 
vehicle’s behavior in traffic scenarios and evaluate 
the impact of autonomous vehicles on the of RTS 
safety properties. 

Representing an Autonomous Vehicle by the 
combination of a ‘off-the-shelf’, general purpose 
Vehicle element whose behavior is managed by an 
autonomous control algorithm embedded in the 
Driver element is adherent to the state-of-the-art 
adopted by the autonomous vehicles developers. 
Today, AV developers are using current manned 
vehicles from the consolidated automotive indus- 
try worldwide and embedding in them systems, 
software and machine intelligence. Therefore, this 
Framework can be used efficiently to evaluate 


the Autonomous ‘Driver’, the part that makes a 
vehicle be ‘autonomous’, as well as its impacts on 
system safety when operating in a simulated, but 
complete, RTS scenario. 

Using the proposed Framework, it is possible 
observe the impact of a new concept/technology 
(component level) over several system parameters 
(in this case, traffic safety and efficiency) and, at 
the same time, understand and analyze the rela- 
tionship among several system properties (some of 
them intrinsic to systems elements, others intrinsic 
to the system). ‘Safety’ and efficiency are compet- 
ing metrics in transport systems, where the traf- 
fic efficiency (e.g., traffic capacity or flow rate) is 
often impaired when system safety is prioritized. 
Therefore, a trade off between safety and effi- 
ciency (availability included) is required to these 
safety-critical systems. 

Concluding, this work has advanced on address- 
ing the question ‘how to assure that the future safe- 
ty-critical autonomous transport systems are going 
to be safe enough during their operation?’. Relevant 
ACPS-based RTS elements and characteristics to 
be considered in a safety analysis were identified. 
As the following works, authors are evolving this 
Framework, implementing both environment/sys- 
tem changes during runtime simulation and col- 
laborative communication among system elements 
(V2X) capabilities. This Framework evolution 
will allow resilience analysis over the ACPS-based 
Road Transport Systems, evolving on the develop- 
ment of new approaches to safety assurance. 
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ABSTRACT: Effective coordination of Distributed Energy Resources (DERs) in power systems via 
control strategies mitigates the frequency fluctuations stemming from stochastic renewables and uncer- 
tain demand. Recently, open communication networks are used to deploy Load Frequency Control 
(LFC) strategies to overcome the lack of dedicated communication infrastructures and the ubiquity of 
DERs system. However, open networks are exposed to communication degradation and can reduce the 
LFC performance. This work investigates the real-time performance and reliability of the integrated DER 
system and open communication networks, i.e. the cyber-physical microgrid system, with reference to 
LFC against communication degradation. In particular, LFC is provided by a discrete PID controller 
tuned via particle swarm optimization. The cyber-physical microgrid system is implemented on a real-time 
platform simulating various MAC protocols and open-communication-network architectures, developed 
in the Truetime simulator. The impact of communication degradation on LFC performance is assessed. 
Simulation results demonstrate that transmission delays and packet dropouts jeopardize the ability of 
cyber-physical microgrid systems to maintain system frequency deviations within tolerance bounds. In 
particular, the use of Ethernet ensures higher reliability as compared to 802.11 b/g. Moreover, the impact 
of interfering traffic and of the percentage of used bandwidth on the LFC performance reduction is eval- 
uated. The optimized PID controller is able to compensate for communication degradation and uncer- 


tainty of the microgrid, and ensures robust LFC against unknown network configurations. 


1 INTRODUCTION 


The power sector is experiencing a structural 
trend towards decentralization stemming from the 
integration of large shares of Renewable Energy 
Resources (RERs) (Driesen and Katiraei, 2015). 
This is fostered by Distributed Energy Resources 
(DERs), which require the integration of power 
generation means located at or near the end-user 
side (Kumar et al., 2017). However, the stochastic 
nature of RERs and of the load demand induces 
system frequency fluctuations (Shotorbani et al., 
2012). An effective control strategy is needed to 
keep the system frequency to its nominal value by 
balancing power generation and demand in real 
time. To this aim, Automatic Generation Con- 
trol (AGC) schemes are developed for damping 
frequency oscillations in Distributed Generation 
Systems (DGS) (Shotorbani et al., 2012; Lee and 
Wang, 2008). 

Recently, the AGC has been integrated with the 
open communication network, due to low cost, 
high speed, simple structure and flexible access. 
Data exchanges among PMUs, generators and 
control center are provided by the open communi- 


cation network in the form of time stamped data 
packets (Kuzlu et al., 2014). Stable AGC depends 
heavily on the performance of the open communi- 
cation network (Ahmadi and Aldeen, 2017). 

However, open communication networks are 
exposed to various types of degradation processes, 
i.e. network-induced time delays (Lee and Wang, 
2008) and packet dropouts (Mo et al., 2016). As 
a result, the measurement signals (control signals) 
received by the control center (ESSs or generators) 
degrade, effective AGC cannot be carried out and 
the system frequency response worsens (Pan and 
Das, 2016). Studying the performance of open 
communication networks is critical for under- 
standing the occurrence of time delays and packet 
dropouts. 

Time delays are variable, challenging to predict, 
deteriorate the AGC performance and reduce the 
stability region (Pan and Das, 2016). Packet drop- 
outs refer to lost messages, which occupy network 
bandwidth but cannot reach destination. They 
affect the operations of DERs and the reduction 
of frequency fluctuations, particularly in uncertain 
network environments. Optimal feedback AGC 
regulators for DERs are investigated in numerous 
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works for perfect communication networks and the 
impact of transmission delays and packet dropouts 
on the controller cannot be captured (Ghoshal, 
2004). Robust PID controllers against constant or 
uniformly distributed time delays are designed to 
cope with perturbations of the control parameters 
(Pan and Das, 2016). However, constant or uni- 
formly distributed time delays cannot be generally 
assumed in realistic communication networks. 

In this work, we investigate (a) the operations of 
the integrated DER system and open communica- 
tion network, and (b) the design of optimal AGC 
strategies in the face of communication degrada- 
tion. The ability of the integrated system to main- 
tain system frequency within tolerance margins 
quantifies system reliability, and is evaluated by 
Monte Carlo simulation (MCS). Stochastic time 
delays are modelled by generating random con- 
gestion based on MAC protocols. Congestion of 
network channels depends on the activity level 
of the interfering traffic, which is the root cause 
of network-induced delays and packet dropouts 
(Peng and Han, 2016). The open communication 
network model is implemented via Truetime simu- 
lator testing different MAC protocols (Grenier and 
Navet, 2004; Cervin et al., 2010). Multiple activ- 
ity levels of the interfering traffic are simulated 
via the disturbance node, which sends random 
interfering packets over the network. Packet drop- 
out is described by Bernoulli-distributed variables 
(Minero et al., 2009). To stabilize system frequency 
against RER, demand variability and communica- 
tion degradation, a discrete PID controller is used 
(Pan and Das, 2016). Particle Swarm Optimization 
(PSO) is adopted to minimize the stochastic objec- 
tive function and achieve the optimal PID control- 
ler for various architectures and conditions of the 
open communication network (Ghoshal, 2004; 
Pan and Das, 2016). Finally, the robustness of the 
optimum-PID-controlled AGC against communi- 
cation degradation is assessed. 

The rest of the work is organized as follows. Sec- 
tion 2 describes the DGS model. The open network 
and the communication degradation models are 
described in Section 3. Section 4 defines the reli- 
ability of the integrated system, and introduces the 
PSO-based PID controller. Section 5 presents the 
simulation results. Section 6 concludes the work. 


2 MODEL OF DGS WITH DERS 


The schematic of the integrated system, which is 
made of a cyber layer and a physical layer, is illus- 
trated by Fig. 1. The structure of physical layer is 
general, representative of DGS, and widely adopted 
in the literature (Pan and Das, 2016). It models a 
hybrid microgrid, which consists of conventional 
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Figure 1. Schematic of the integrated system. 


generators (Diesel Engine Generator, DEG), RERs 
(wind turbine generator, WTG, and photovoltaic 
generator, PV), ESSs (battery and flywheel energy 
storage system, BESS and FESS) and power-to-gas 
technology (aqua electrolyzer, AE, and fuel cell, 
FC). The cyber layer, i.e. the open communication 
network, provides data exchange between the con- 
trol center and the controllable components, i.e. the 
ESSs and DEG, in the physical layer. PMU meas- 
urements and control signals are transmitted via 
the shared and open communication network. The 
cyber layer is detailed in Section 3. Power imbalance 
is the root cause of system frequency fluctuations. 
The control center remotely monitors the ESS and 
the diesel generator to reduce the imbalance and to 
ensure good AGC performance. 

The small signal stability analysis of the hybrid 
microgrid in Figure 1, is based on time-domain 
simulations, i.e. transfer function models. In the 
AGC, the WTG, PV, AE, FC, DEG, FESS and 
BESS are described by first order transfer func- 
tions with specified gain and time constant. A cen- 
tralized controller is used by Pan and Das (2016), 
as opposed to multiple decentralized controllers 
for each controllable component (Ray et al., 2011). 
It enables easier maintenance and reduces wiring 
cost, and makes the AGC design problem trace- 
able by reducing the number of controller param- 
eters. On the other hand, the centralized controller 
impacts the AGC performance, because a unique 
control signal is used by all the components. Nev- 
ertheless, current studies show that the centralized 
controller can ensure acceptable real-time AGC 
performance (Pan and Das, 2016). The transfer 
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functions G,,,,(s),Gp,(s), Gec(s) and Gp, (s) 
of the WTG, PV, AE and DEG, respectively, are 
expressed as 


K... P 
Gaie n = WIG (1) 
t+ STG hy 
K E 
G,,(s)= HC i 
al ) Tt ST hy Pa | 
K.. Pz 
TE 3) 
+sI;c Pie 
Kp, Pos 
G Ss] = pee = a a 
TA ) 1+ sThg Unre (£) ( 


where Kyro Kpy, Krc and Kprzg are the gain, and 
Tore Tpy Trc and Tp,¢ are the time constant of 
the WTG, PV, AE and DEG. Prç and Pp, are the 
electrical power produced from the RERs, i.e. wind 
power P,, and solar power P P,, and P,c are the 


sol” 


output of the AE and the power produced of the FC. 
The DEG is controlled by the control signal w,,,¢ (t) 
sent by the remote control center and it generates 
power only when the RERs cannot meet the demand. 
In the DGS, the AE is used to absorb the rap- 
idly fluctuating output power from the WTG and 
PV by producing hydrogen (Das et al., 2012). The 
hydrogen is stored and used as fuel in the FC to 
feed power to the grid. The dynamic property of 
AE is described by the transfer function G; (s) 


Kay _ Pur 
I+sl), (Paine F Pa )\(l-K,) 


(5) 


where K,; and T; are the gain and time constant 
of the AE. 1- K, denotes the fraction of power 
generated by the WTG and PV to produce hydro- 
gen in the AE. K, is equal to Py/(Py,¢ + Po) 
and is set to 0.6 (Pan and Das, 2016). 

The ESSs are critical in eliminating frequency 
fluctuations due to their fast response to the con- 
trol signal. Based on Lee and Wang (2008), the 
transfer functions Gpgss(s) and Ggzss(s) of the 
FESS and BESS are given as 


K. ee 
Gers (8) = E = ss ©) 
FESS l+sTress  Uress (t ) 
K.. P, 
G l (s) = BESS = BESS (7) 
BESS 1+ sT ess Upess (z) 


where Kpss and Kpgss are the gain, T;,,,, and 
Tyys5 are the time constant, P,,., and Py, are 


the output power, uprss(t) and uppss(t) are the 
control signal of the FESS and BESS. 

Remark 1: The DEG, FESS and BESS have 
rate constraint, i.e. |Pyse|< Ppeoo |Peess| < Press 
and (Press Pass where Porc» Press and Pess 
are the maximum rated output power of the DEG, 
FESS and BESS, respectively. 

The transfer function G,,,(s) of the hybrid 
power system (HPS) models the relationship 
between the power imbalance, i.e. AP,— AP,, and 
the system frequency Af(t) 


_ tit _ AY) 
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where M and D are the inertia constant and 
dampi constant of the HPS (Pan and Das, 2016), 
P, is the total power generated, denoted by 
K,;( Fyre + Poz) t Puc t Porc H Press t Pess» w 


P, is the power demand. The models for Pyre, Poy 


and P, are detailed in (Lee and Wang, 2008). 


3 THE OPEN COMMUNICATION 
NETWORK 


The model of the data transmission across the open 
communication network accounts for the composi- 
tion of network-induced delays and for the stochastic 
packet dropout. We assess two general architectures 
for the communication network used in power sys- 
tems, i.e. the Ethernet (Grenier and Navet, 2004) and 
a mix of Ethernet and WLAN (henceforth called 
hybrid network) (Pan and Das, 2016), and implement 
them via Truetime simulator (Cervin et al., 2010). 


3.1 Model of time delay and packet dropout 


The time delay in the communication channel 
consists of four components, i.e. the preprocess- 
ing time T, the waiting time T, the time 


for traveling across the channel T, and the post- 
processing time 7’. (Ghoshal, 2004). Therefore, 


post? 


the total time delay T, can be expressed as 


d 


1; =, se + T vait + To + T, (9) 


post 
where 7, and T, depend on the processing 
speed of the device firmware. 7,, depends on the 
physical bandwidth, propagation spend and trans- 
mission distance. 7’, ,, is the major source of the 
te delay and it is influenced by the MAC protocol. 

The time delays t, at forward channel and T, 
afeedback channel are determined by (9), and 
their impact on the transfer function from the 
control signal U(s) to the system output Y(s) is 
described as 
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where G(s) denotes the transfer functions of the 
DERs. 

A second source of disturbance in the source- 
destination communication is packet dropout, 
which occurs for three major reasons, i.e. the net- 
work disconnection, time-out transmission and 
time-out re-transmission (Cervin et al., 2010). 
Binary switching sequences are usually applied, 
which specify the expected packet dropout prob- 
ability. The stochastic parameter y, of the binary 
switching sequence is Bernoulli distributed (Liang 
et al., 2010), taking value of 0 or 1 with 


Piy, =O}=L, (11) 


where 0< L, <1 is the packet loss probability. 


3.2 Network architecture and implementation 


The architecture for the AGC of DERs via Ether- 
net and hybrid network, is illustrated by the cyber 
layer in Figure 1. The data exchange among the 
control center, DERs, interference node and PMU 
is wired in the Ethernet architecture. In the hybrid 
architecture, the data exchange between the rout- 
ers and the RTU (BESS, FESS, DEG and PMU) 
is wireless and provided by the WLAN. The data 
exchange among routers is wired and provide by 
the Ethernet. Low product prices make WLAN 
more convenient and cost-effective compared to 
the traditional LAN. As such, WLAN has become 
an efficient approach to provide flexible data com- 
munication between routers and RTU, and is 
employed for monitoring and controlling DERs, 
offshore wind farms and smart home energy man- 
agemensystems (Pan and Das, 2016). 

The interference node simulates the inter- 
ference user in the open network, who sends 
disturbing traffic over the channels and cause 
congestion. Generally, the length of data traffic 
is constant (i.e. 80 bytes in this work), and the 
interference node sends it to the network at every 
time period T, if 
UN, < BWShare (12) 
where UN, is a uniformly distributed random 
number sampled at time period 7, in the inter- 
val [0,1], and BWShare is the expected ratio of 
the network bandwidth used by the interference 
node. 

The architectures of the real-time AGC of 
DERs via the Ethernet and hybrid network all 
consists of: 


e PMU node (time-driven): the PMU takes meas- 
urements of the system frequency at every sam- 
pling interval 7, = 0.01s and sends Af(kT,) to 
the control center over the network. The size 
of data packets is 80 bytes and the phase delay 
caused by the filter of the PMU is 0.006 s (Mar- 
tin et al., 1998). 

e Control center node (event-driven): when a data 
packet from the PMU reaches the control center, 
the controller takes 0.002 s to compute the con- 
trol signal u(t,) and sends it to the DERs. The 
length of control signal is 500 bytes. 

e DERs nodes (event-driven): the DEG, BESS 
and FESS adjust their operations based on the 
control signal received. 

e Interference node (time-driven): it sends dis- 
turbing traffic over the network with period 7; 
and causes congestion, to generate different sce- 
narios of time delay and packet dropout. 


4 THE PSO-TUNED PID CONTROLLER 


4.1 Discrete-time PID controller 


Due to periodic PMU measurements and packet 
dropouts, the AGC process is discrete. Therefore, 
a discrete-time PID controller is adopted to com- 
pute the control signal 


KYA) + KS) (13) 


where K,,K, and K,„ are the proportional, 
integral and derivative gain of the controller, and 
Aa Af ( t) a Af ( bai ) a 

For the implementation of the discrete-time PID 
controller, the role of a first-order pole filter on the 
derivative action should be considered [36]. There- 
fore, the transfer function of the discrete-time PID 
controller in Eq. (18) is 


cl2)=-[Ar + xal) N, ) (14) 


° 14+ N,b(z) 


where N, is the filter’s coefficient indicating the 
location of the pole in the derivative filter and 
a(z)=b(z)=T./{z-). 


4.2 Reliability of the integrated system 


Frequency values outside the tolerance band com- 
promise the components in the physical layer, 
reduce their useful lifetime and deteriorate the 
AGC performance. In this work, the reliability is 
defined as the ability of the integrated system to 
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maintain system frequency within a predefined 
tolerance band, in the face of stochastic RERs, 
uncertain load demand, variable time delays and 
random packet dropouts. The value of the reliabil- 
ity R measures the ability of the integrated system 
to provide adequate AGC performance, and is 
computed as 
R=T,/T (15) 
where T, is the total time in which the system 
frequency remains within the predefined toler- 
ance band and T is the total operating time of the 
AGC. 

Due to the uncertainties in the DGS and the 
random communication degradation effects, the 
AGC performance evolves through stochastic tra- 
jectories. Thus, the reliability in Equation 15 and 
objective function in Equation 16 are stochastic 
and generally evaluated as the expected values of 
a stochastic process (Cervin et al., 2010). Thus, the 
reliability and t objective function are estimated via 
the MCS method (Pan and Das, 2016). In order to 
achieve a statistically acceptable estimate, the reli- 
ability is computed via 200 MCS-based samples, 
i.e. the integrated system is evaluated for 200 tri- 
als to compute the expected reliability (each sam- 
ple simulates a time frame of 100 s and requires 
around 4 s on a 64 b Wdows desktop with 32 GB 
memory and an Iel(R) Xeon(R) E5-1650 v3 @ 
3.50GHz CPU). 


4.3 Performance of the PID-controlled AGC 


The AGC performance of the integrated sys- 
tem depends on the discrete-time PID control- 
ler. Therefore, the PID controller is optimized to 
mitigate the communication network disturbances, 
and offer optimum AGC performance by reducing 
system frequency fluctuations. As a result, system 
reliability is also optimized. The objective function 
for the optimization of the PID controller is an 
integral performance index over the total operat- 
ing time 7, which quantifies frequency and control 
signal deviations (Pan and Das, 2016): 


J= f [ n(ary+(1-m)*m( au)? ] ae (15) 


where 77, indicates the relative importance of the 
two terms and 77 is the normalizing constant to 
scale both terms in a uniform range and is set to 
0.002. The first term is directly related to the reli- 
ability of the integrated system and the second 
term measures the disturbance rejection ability of 
the controller, which is the total control effort to be 
minimized. The total process time T is set to 100 s. 


Many heuristics algorithms, e.g. the genetic 
algorithm (Das et al., 2016) and the PSO (Pan and 
Das, 2016), are appropriate in such case to find the 
near-optimum solutions, taking into account the 
stochastic nature of the objective function. The fit- 
ness value of the stochastic population-based algo- 
rithms is the expectation of the stochastic objective 
function obtained from MCS 


El =Dy, 


where J, is defined in Equation 16 and N = 200 
samples. We use PSO to solve the optimization 
problem for x € R? 


(16) 


minimize E| J( x) | (17) 
where J(X):R*°— R is defined in Equation 16. 
The variable ¥ are the controller parameters, i.e. 
K,,T, and T,, and the 3-dimensional search 
space G € R? is pre-specified as G = [ 0,100] * to 
widely span the optimization range of the control- 
ler design. 


5 RESULTS AND DISCUSSION 


This Section investigates the reliability of the inte- 
grated energy-communication system equipped 
with optimum PID controller to enhance the AGC 
performance and reduce system frequency fluctua- 
tions. We assess the impact of the two architectures, 
i.e. Ethernet and hybrid, and of various configura- 
tions of the open communication network on the 
system reliability and on the AGC performance. 


5.1 Specifications for the integrated system 


The physical system operates in nominal conditions, 
i.e. the stochastic wind speed, the variable sun irra- 
diance and uncertain load, affecting, respectively, 
Pyrgs Poy and P,, given by Das et al. (2016). The 
coupled algebraic and ordinary differential equa- 
tions for the DGS and the open communication 
network in Fig. 1, are numerically integrated using 
the Dormand-Prince method implemented in Mat- 
lab ode45 function with a fixed step size of 0.005 s. 
The parameters of the transfer functions for the 
DERs in Figure 1 are provided in Das et al. (2016). 
Additionally, the physical layer has the following 
specifications: 


e The base value for the apparent power is 150 kW. 

e The rated apparent power of the WTG is 
150 kW, the rated electrical frequency is 60 HZ 
and the rated voltage for the induction machine 
is 440 V. 
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e The structure of a 150 kW PV power gen- 
eration has 93 parallel strings and consists 
of 7  series-connected modules (SunPower 
SPR-230E-WHT-D). 

è Based on Remark 1, 0 < AP,,¢ < 0.7 pu, |AP;rss] 
< 0.3 pu and |AP,;ss| < 0.3pu, provide the larg- 
est rated output of the DEG, FESS and BESS 
(Pan and Das, 2016). 

The open communication network is imple- 
mented in Truetime simulator (Cervin et al., 2010). 
The data rate of the communication architectures 
is 800 Kbits/s, and the WLAN is characterized by 
transmission power 20 dbm, receiver signal threshold 
—48 dbm, ACK timeout 0.04 ms and the retry limit 5. 


5.2 Reliability of the integrated system 


For illustrating purpose, the frequency tolerance 
band is set to + 0.05. The parameters of the PID 
controller are tuned as K, =37.78, K, =17.24 and 
K, =0.16 (Pan and Das, 2016). Five combinations 
of BWShare, T, and L; are taken into account 
to investigate multiple communication scenarios in 
the Ethernet and in the hybrid network. For com- 
parison purpose, the reliability of the integrated 
system with perfect communication, i.e. no time 
delays and no packet dropouts, is R=98.36% and 
the value of objective function is J =16.36. 
Figure 2 illustrates the relationship between the 
reliability of the integrated system with Ethernet, 
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and the communication parameters. Figure 2 shows 
that the AGC becomes unstable and the system relia- 
bility degrades if L, 2 40%, the controller does not 
have sufficient observations of the system frequency 
measurement, and is unable to derive correct control 
signals. The case (BWShare = 0.1 & T,=0.01) has the 
largest reliability due to light congestion. In this case, 
data collisions between the PMU and the interfer- 
ence node are fewer than for other cases with smaller 
T. In the case (BWShare = 0.2 & T, = 0.005), the 
interference node sends out disturbing traffic every 
0.005 s and consumes 20% of network bandwidth, 
therefore data collisions between the PMU and the 
interference node increase. As a result, increased 
time delays degrade system reliability as compared 
to the other two cases. Increasing the activity level of 
the interference user and the used bandwidth leads 
to increasing data collisions, and therefore network 
congestion, which causes longer time delays. Thus, 
system reliability decreases, i.e. comparing case 
(BWShare = 0.1 & L,=0.1) and case (BWShare = 0.2 
& L, = 0.1), and comparing case (T, = 0.01 & 
L,= 0.1) and case (T, = 0.002 & L, = 0.1), respec- 
tively. Additionally, if T, = 0.01, the competition for 
sending data packet between the interference node 
and the PMU reduces, network conditions improve 
and thus the system reliability increases. 

Table | provides the estimated reliability of the 
integrated system with Ethernet and with hybrid 
network. Such values of communication param- 
eters are selected because they generate 10 ms to 
2 s time delays, and losses of 1% to 10% of the 
overall packet stream, which are representative of 
real communication networks (Cervin et al., 2010). 
A negative correlation between system reliability 
and E[ J( x) ] can be inferred from Table 1. This 
is expected because when the system reliability is 
high, most of the system frequency measurements 
are within the tolerance band, which thereby leads 


Table 1. Reliability of the integrated system for differ- 


00 \ ent network configurations [BWShare T, L,] of the two 
a \ architectures. 
p — ` Configurations R, E[ J,( x)] R, E [ JA x)] 
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Figure 2. System reliability as a function of BWShare, 
T,and L} 


*R, and E[ J, ( x)] are the reliability and objective value 
of the integrated system with Ethernet. 

*R, and E | J,( x) ] are the reliability and objective value 
of the integrated system with hybrid network. 
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to a small value of the objective function. Table 1 
shows that the Ethernet architecture ensures higher 
reliable to the integrated system as compared to the 
hybrid architecture. This is expected because the 
data exchange between the AP and the RTU in the 
hybrid architecture is provided by WLAN, which 
introduces additional time delays and packet drop- 
outs, and results in lower system reliability. 


5.3 Optimal PID controller for the AGC 
of DERs 


PSO is employed to design the optimal PID con- 
troller. In the PSO, the number of particles and 
the number of iterations are set to 30 and 50. For 
each particle, the stochastic objective function 
E[ J( x) | is estimated from 200 MC samples. The 
performance of the adopted PSO is discussed by 
Pan and Das (2016) with respect to the conver- 
gence rate. The PSO can achieve the local optimal 
PID controller after 50 iterations (Pan and Das, 
2016). 

Table 2 lists the optimal control parameters 
[K, K, Kp] for the integrated system with two 
network architectures, which minimize E[ J( x) | 
under various configurations of [[BWShare T, L,]. 
Different combinations of BWShare, T, and L; 
generate different levels of network traffic in dif- 
ferent scenarios of a day, i.e. light, medium and 
heavy [27]. Large BWShare, small T, and large L} 
simulate intense communication flows in the open 
communication network. For the perfect com- 
munication, the optimal control parameters are 
K, =36.06, K, =16.06 and K,=0.23, the reli- 
ability is R=99.37 and E[ J( x) | = 9.99. Table 2 
quantifies the amount of system reliability and 
control effectiveness that is lost due to the effects 
of communication degradation with respect to 
the perfect communication case. As an example, 
E[ J(X) | =17.47 is the minimum expected value 
of the objective function for the Ethernet work con- 
figuration [0.2 0.005 5%], which implies that the PID 
controller can provide the best AGC performance 
and highest system reliability. The increase in the 
BWshare and L,, and decrease in T, cause the reduc- 
tion of R and the increase of E| J( x) |, which is 
in line with common intuition. Such changes defi- 
nitely deteriorate the performance of communica- 
tion network, and lead to inaccurate control of the 
DERs and large system frequency fluctuations. 


5.4 Impact of communication degradation 
on optimum AGC operations 


The degradation of the communication network 
performance can affect the AGC operations even 
for optimally-controlled DER components. We 
quantify the extent of this impact for three archi- 
tectures, i.e. perfect communication, Ethernet, 


Table 2. Optimal PID controllers [K, K, Kp] for differ- 
ent network configurations [BWShare T, L,] of the two 
architectures. 


Optimal 
PID and 
Configuration performance 


Network architectures 


Ethernet Hybrid 


[37.37 17.24 [43.02 17.63 


[K, K, K,] 0.16] 0.12] 
[0.2 0.005 5%] R 98.22 98.15 
E(D] 1747 17.96 
[K, K, K,] [39.5423.57 [32.99 21.68 
pr Sol 0.13] 0.09] 
[0.3 0.005 5%] R 97.64 94.43 
E] 82 31.22 
37.96 16.02 [36.12 16.55 
IK, K, K] | 0.15] 0.16] 
[0.1 0.005 5%] R 98.30 98.20 
E[y(x)] 1675 17.71 
[K, K, K,] [32.5417.78 [36.83 21.24 
poo oi 0.15] 0.12] 
[0.2 0.0025 5%] R 96.94 94.17 
Epa] 195 30.57 
[K, K, K,] [37.76 18.51 [41.00 15.85 
pr Sol 0.15] 0.12] 
[0.20.01 5%] R 98.27 98.22 
E[y(s)] 1699 17.51 
[K, K K] [41.04 18.73 [41.50 23.09 
BE SBE OTA] 0.13] 
[0.2 0.005 10%] R 98.05 97.04 
E[y(x)] 180 18.79 
[K, K, K,] [89:71 22.65 [43.04 17.72 
acea i O16) 0.14] 
[0.2 0.005 1%] R 98.26 98.21 
Epa] 7% 17.64 


and hybrid network, and the same WTG output, 
PV output and load demand. The three inte- 
grated systems are equipped with optimum PID 
controllers for the communication configuration 
[0.2 0.005 5%], given in Table 2 Line 1. Figure 3 
shows the power curves of FESS, BESS and DEG, 
the power imbalance AP,— AP,, and the system 
frequency Af, in the three integrated systems 
for te [49.5s,60s] of one MC simulation. The 
DER system experiences a sudden load increase 
at t= 50s, and AP, rises from 0.8 pu to 0.95 pu. 
In the perfect communication case, at t= 52s, the 
output of the FESS, BESS and DEG rises, respec- 
tively, from 0.026 pu to 0.124 pu, from 0.008 pu 
to 0.038 pu, and from 0.009 pu to 0.029 pu. As a 
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5 55 5 57 38 50 59 
Time (s) [— Peded Network — Ethemat -Hybrid Network] 
Figure 3. Evolution of FESS, BESS and DEG output, 


power imbalance AP,— AP, and system frequency for a 
sudden load increase (0.15 pu) at ¢ = 50 s. Positive values 
of FESS and BESS power result in net injection into the 
grid. 


result, AP, — AP, is negative and very small, and, 
therefore, a negative and small frequency deviation 
Af = -0.013 occurs. 

For the integrated system with Ethernet and 
hybrid network, the FESS, BESS and DEG out- 
put, AP,— AP, and Af converge to the same val- 
ues as the AGC with perfect communication with 
small convergence rates due to time delays and 
packet dropouts. The strong effects of communi- 
cation degradation introduced by the WLAN in 
the hybrid network yield the smallest convergence 
rate in Figure 3. Additionally, Figure 4 highlights 
delays in the response of the controllable compo- 
nents to the control signal for the integrated sys- 
tems with imperfect communication. For example, 
the output of FESS in the hybrid architecture and 
the output of BESS in the Ethernet architecture 
do not change in the time interval [50s, 50.2s], as 
compared to the perfect communication case. 
FESS and BESS keep absorbing (releasing) the 
same power from (to) grid even in unbalanced 
conditions because the data packets containing the 
control signals do not reach the destination. The 
control center lacks enough frequency measure- 
ments from the PMU, and issues inaccurate com- 
mands. Delays in component response and missing 
control signals ultimately cause the FESS, BESS 
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Figure 4. Evolution of system frequency for a sudden 
load increase (0.15 pu) at t= 50 s under increasing com- 
munication degradation. 


and DEG to perform untimely and inaccurately, 
leading to large frequency oscillations. 

Finally, we investigate the robustness of the 
PID-controlled AGC operations optimized for 
the configuration [0.2 0.005 5%] in Table 2, Line 
1, subjected to increasing communication degra- 
dation. Figure 4 plots the system frequency Af in 
the three integrated systems for re | 49.5 s,60 s] 
of one MC simulation. Four configurations are 
shown, respectively, the optimization conditions, 
BWShare. increased from 0.2 to 0.3, T, decreased 
from 0.005 to 0.0025, and L, increased from 5% 
to 10%. Similar to Figure 4, AP, rises from 0.8 pu 
to 0.95 pu at f= 50s. Fig. 5 unveils the abilities 
of various controllers in providing satisfied AGC 
performance for unknown communication con- 
figurations. The optimized controller for Ether- 
net is still robust to degraded communication but 
small frequency fluctuations occur. The reliability 
in the four configurations is, respectively, 98.22%, 
96.17%, 95.84% and 95.63%. Conversely, the opti- 
mized controller for hybrid network cannot provide 
robust AGC operations under degraded communi- 
cation, and the system frequency becomes unsta- 
ble. The reliability in the four WLAN network 
configurations is, respectively, 98.15%, 71.58%, 
85.9% and 73.85%. The analysis of the PID tuning 
indicates that the PSO produces different optimal 
PID controllers for different runs of the optimiza- 
tion algorithm [10]. Nonetheless, even if PID con- 
trollers tuned using heuristics have different values 
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of Rand E| J(X) ], they are capable of stabiliz- 
ing the system frequency quickly in the analyzed 
network configurations. 


6 CONCLUSION 


This work proposes a system-of-systems framework 
for investigating the operations of integrated DER 
systems and open communication networks, and 
the design of optimal AGC strategies in the face 
of communication degradation. The results show 
that the activity level of interference traffic and the 
expected percentage of bandwidth used by the inter- 
ference user have a significant impact on system reli- 
ability. Low system reliability is observed for large 
BWShare and small T. The AGC based on a dis- 
crete-time PID controller is developed to suppressed 
the frequency oscillations in the integrated system 
and enhance reliability. Optimization results show 
that the PSO-based PID controller is capable of 
reducing system frequency fluctuations and improve 
AGC performance. The communication degrada- 
tion delays the component response to the control 
signal and cause their output to remain unchanged 
even in the presence of power imbalance until an 
updated packet is received. Finally, the AGC with 
Ethernet shows the largest reliability and robustness 
against unknown communication configurations, 
and results in small frequency oscillations. 

Future work will focus on the time-delay com- 
pensation by designing a real-time Smith predic- 
tor based on the estimated communication delay. 
Additionally, the effect of packet dropout can be 
mitigated by reconstructing missing the data set to 
form a complete one. 
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ABSTRACT: The degradation analysis method is widely used to the reliability analysis of the products 
with long life and high reliability. There is a non-competing multi-degenerate situation widely exist in 
some mechanical products, non-competing relationship multiple degradation model is proposed in this 
paper to describe it. The degradation process of products which proposed in this paper have two degra- 
dation indicators, the gamma process is used to characterize the degradation process. A Bayesian fusion 
framework of degradation analysis is presented to deal with the small sample size problem of the products 
and the MCMC method is used to simulate the samples of model parameters. A case-study of the spool 
valve is discussed to demonstrate the model and method we proposed. 


1 INTRODUCTION 


Nowadays, with the progress of the times and 
the development of science and technology, the 
complex systems such as industrial, transporta- 
tion and military, are generally required to per- 
formance with a high reliability level (Ye et al., 
2014). Methods such as degradation analysis have 
been developed to enhance the reliability analysis 
of these systems (Chen and Tsui, 2013, Nelson, 
1981). Degradation analysis methods have been 
widely used in academic research and industrial 
productions. In general, the complex products 
with long life and high reliability may have multi- 
ple performance processes. The multiple degrada- 
tion processes based on the competing relationship 
have been widely studied (Peng et al., 2016, Pan et 
al., 2011). In the competing relationship multiple 
degradation models, the products confront a fail- 
ure when anyone of the performance indicator 
reaches a predefined threshold. However, there is a 
non-competing multi-degenerate situation widely 
exist in mechanical products (Yang et al., 2016), 
the competing relationship multiple degradation 
models may not be suitable for this situation. To 
address the problem, a modified multiple degrada- 
tion processes model is proposed. A classic exam- 
ple, i.e., spool valve, is investigated in this paper to 
demonstrate the proposed method. The wear rate 
of spools and sleeves are two non-competing per- 
formance indicators of spool valve. 

Earlier spool valve reliability models did not 
consider the small sample size problem in reli- 
ability analysis, but in some cases, small sample 


size problem is inevitable. A modified model that 
considers the small sample problem is introduced 
in this paper. In order to obtain more accurate 
analysis result under limited degradation observa- 
tions, it is imperative to take full advantage of the 
degradation information from Original Equipment 
Manufacturers (OEMs) and user plant’s. Degrada- 
tion information from different sources have the 
mathematically common, so they can be fused dur- 
ing the reliability analysis (Feng and Zhou, 2008). 

The objective of this paper is to present a hierar- 
chical Bayesian method of gamma process model 
for degradation process modeling and analysis. 
The hierarchical Bayesian method could fusion 
degradation information from different sources. 
The Markov chain Monte Carlo (MCMC) method 
is used to estimate the model parameters. 

The remainder of this paper are organized as 
follows. Section 2 introduces the gamma process 
model for the non-competing relationship multi- 
ple degradation model. In Section 3, a Bayesian 
information fusion method is presented. Section 4 
presents the degradation analysis of spool valve to 
illustrate the proposed method. Section 5 summa- 
rizes the whole work and prospects of the future 
work. 


2 THE DEGRADATION MODELS 


2.1 Gamma process model 


The performance degradation process of a product 
is usually an evolutionary process. The stochas- 
tic process model can describe this process well. 
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The gamma process is a stochastic process gener- 
ally used to describe the performance degradation 
of mechanical products. Because the increment of 
the gamma process is non-negative, it is consistent 
with the performance evolvement of the mechanical 
products. 

The gamma process is a stochastic process 
{X(t),t>0} with the shape parameter a(t) and 
scaleparameter 4 isdenotedas X(t) ~ Ga(a(t), 4). 
The gamma process has the following proper- 
ties: X(t) is independent incremental process 
with X(0)=0, and the degradation increments 
AX (t)=X(t+At)—X(t) follow gamma dis- 
tributions as  AX(t)~Ga(Aa(t),A) with 
Aa(t) = a(t+ At)— a(t). The mean and variance 
of X(t) are a@(t)/A and a(t)/ AZ, respectively, 
where a(t) is a right-continuous non-decreasing 
function in [0,cc), with a(0)=0 (Abdel-Hameed, 
1975). 

The probability density function (PDF) g(x) 
of the gamma process is defined by 


AX) 
Exo xla), A) = Tato) xO- exp(—Ax)Ixo (xX) 
(1) 
where 
I fi XE (Oe9) r 
oa) 5TO x ¢(0,00) Q) 


and T(a)=Jexe*dx is 
(Noortwijk, 2009). 

Suppose that a product’s degradation process is 
characterized by {X(1),¢ > 0}, it fails when the per- 
formance indicator reaches a predefined threshold 
C. Thus, the lifetime of the product is defined as 
T = inf{t|X(t)=C}. The reliability R(t) of prod- 
uct or system at time ¢ can be expressed as: 


gamma function 


R(t) = Pr{sup(X(s)) <C} (3) 


R(t) = P{T > t}= P{X(t)<C} 
= J grola), Ade 4) 


1 CA 
Í u-le-4dy 
0 


~ Tato) 


2.2 The non-competing relationship multiple 
degradation model 


Suppose that the m performance indicators 
of the product are mutually independent, 
D,(t) is the degradation data of the Lth deg- 
radation processes at the time ¢. All degrada- 
tion processes lead to one failure mode, the 


reliability R(t) at time ¢ can be defined as: 


RO) = Pr{D (N+ D+... D,O< C}, p= LL 
If D,(t) follows a gamma process with shape 
parameter @,(t) and scale parameter 4 with 
p=1,2,..,£, when the shape parameter func- 
tion @,(t)=a@,t, then the degradation increment 
AD, follows the gamma distribution as follows: 


AD, ~ Gamma(a@, At, A) (5) 


Based on the additivity of the gamma distribu- 
tion, the following relationship can be obtained as: 


3 AD, (t) ~ Gamma 2 a,At, a] (6) 


p=1 p=1 


According to the properties of gamma process, 
(D, (t)+ D, (t) +...+ D, (t)) follows a Gamma proc- 
ess with shape parameter function a(t)= L/_, &,t 
and scale parameter 4. The reliability R(t) at time 
t can be defined as: 


L 
CA È apr- 


Í ur  eĉdu 
0 


a <n a (7) 
[$ a) 


p=! 


R(t) = 


3 RELIABILITY ANALYSIS BASED 
ON DEGRADATION MODEL 


3.1 Basic Bayesian framework of degradation 
process 


Bayesian method has two advantages for degra- 
dation analysis: (1) The capability of Bayesian 
method for data and information fusion, making 
the OEMs and user plant’s performance degrada- 
tion data fusion analysis possible (Peng et al., 2016). 
(2) The handling capacity of Bayesian approach to 
the uncertainty, making the performance degra- 
dation analysis and reliability analysis results are 
guaranteed (Efron, 2013). These two points are the 
basic principles to construct the basic framework 
of Bayesian performance degradation analysis and 
reliability analysis. 

This framework is based on Bayesian method to 
take the degradation data of the OEMs and the 
user plant as the essentials, including the acquisi- 
tion of the prior distribution and the solution of 
the posterior distribution. 

According to the different sources of degrada- 
tion data, the framework divides the acquisition 
of the prior distribution into two methods. The 
method of subjective information quantization is 
used to obtain the prior distribution. The prior 
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distribution of the degradation analysis of the 
user plant is directly transformed by the posterior 
distribution of the degradation analysis of the 
OEMs. 

In order to estimate the posterior distribution, 
this paper construct the posterior distribution of 
parameters in degradation process model based 
on Bayesian method, and sampling of posterior 
distribution from the degradation data through 
MCMC method. 


3.2 Degradation process analysis method 


In this paper, we analyze the non-competing relation- 
ship multiple degradation model of two degrada- 
tion indicators. Suppose that the degradation data 
of D,(t)and D,(t) are observed for N units. Let 
D, (t, and D, (t denote the jth observation for unit 
iat time point z, with =l M and i=1,...,N. Let 
Ad, = D,(t,) - Dit, a) and Ad, = D,(t,)— Dit, j1) 
be the degradation increment. According to the 
gamma process model, the Ad? follow the gamma 
distribution Ad? ~ Gamma(a, At,,A). 

The performance degradation data of the non- 
competing relationship multiple degradation model 
includes the OEMs data D? and user plant’s data 
DY. When the performance degradation data D? 
and DY are unified as D, with v=(@,@,/), the 
likelihood function can be presented as: 


L(D,,D,.v |@,@,A) 
=] II g(Ad, | @,A)g(Ad;, | @,A) 


i21 j=2 


N M Aani aAn;-1 8 
M amj olea) O 
A agam exp(-4Ad, ) 
r ( a An; ) ' Í 


where Ad, = D(t,;) — D(t 1), Ad;=D,(t;)-D:(t j) 
and g(e) is the PDF ofa gamma distribution. 

Suppose that the prior information about the 
samples is quantified and obtained as joint prior 
distribution for the parameters of degradation 
process model as 2(0)=2(@,,a@,,4). Following 
the Bayesian method, the joint posterior distribu- 
tion of the parameters can be obtained as: 


P(Q,&%,A,V|D,,D,)< 2A L(D,,D,,V| A) 
= M&,&,A)L(D,, D,,v|&,&,) 


N M Aaa i 
= = Akh AI] I] T( ain) Ad, 2% 
a An; 
exp(—AAd, ) amy exp -Ad ) 
gay 


(9) 


From the Equation (9), it is a comprehensive 
description of the a prior information and degra- 
dation process. 


3.3 Bayesian information fusion method 


According to the degradation process modeling 
and Bayesian method, the degradation data of the 
OEMs and the user plant are analyzed to obtain 
the joint posterior distribution of the model 
parameters. The realization process is as follows: 


Pi&i, À, Q, A) 
N M n 
i ea a 
i=l j=2 ij 
“faye el 
Pul, @, A, V 1 o 
= P(&,&,A, 
N M ganny A 7 
II II T(@An, ss k exp(-4Ad; ) (11) 
i=l j=2 
Ane saga , 
Tlam)” aan, 'exp( -44a ) 


where D? and D? are the degradation data 
from the OEMs, 2(@,@,4) is prior distribution 
quantified by the prior information for products, 
DA(7,06,%,V|D°) is the posterior distribution of 
the model parameters obtained by combination 
of prior information and the information con- 
tained in the degradation data of the OEMs. DY 
and DY are the degradation data from the user 
plants, p,,(7,0,7%V|D",D°) is a model param- 
eters posterior distribution, it is a description of the 
combination of prior information and information 
contained in the degradation data from OEMs and 
user plants. 

From the Equations (10) and (11), the MCMC 
method is used to estimate the model parameters 
because of their computational complication. In 
most actual applications with Bayesian methods, 
it is difficult to obtain the posterior distribution. 
The MCMC method is used to construct a Markov 
chain, the invariant distributionis of the Markov 
chain is the posterior distribution that is needed to 
be accurately estimated. In this paper, we assume 
that the prior distributions of the model param- 
eters are non-informative, and the OpenBUGS 
is used to perform the Gibbs sampling after the 
model parameters is estimated. 


4 ILLUSTRATIVE EXAMPLE 


The reliability of hydraulic systems is determined 
by various subsystems, and the spool valves is one 
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of the most important subsystems. The reliability 
of the spool valves directly affects the reliability of 
the hydraulic system. The main failure mode of the 
spool valves is the wear degradation of spools and 
sleeves. When the total wear degradation of spool 
and sleeve reach predetermined threshold, the 
spool valves is considered as failure. The Bayesian 
information fusion. 

To obtain information about degradation of 
spool valves in the hydraulic system, the wear deg- 
radation of spools and sleeves of six spool valves 
are monitored. The wear degradation paths of 
OEMs are presented in Figure 1, and the wear 
degradation paths of user plants are shown in 
Figure 2. 

As discussed in Section 3, the wear degradation 
increment of sleeves is Ad, = Ds, (tj) — Ds: (6,1) 


which is modeled as Ga( a, At,, A) and the 
wear degradation increment of spools is 
35 
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Figure 1. The wear degradation paths of OEMs. 
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Figure 2. The wear degradation paths of user plants. 
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Ad, = Dsp(t,)— Dsp(t,,) that is modeled as 
Ga|&spAt;, A). Owing to the limitation on the 
available prior information, the non-informative 
prior distribution, which is the prior distributions 
for parameters of the degradation process model 
of OEMs, quantized from the subjective informa- 


tion. It could be given as: 
Qs, ~ U(0,100), &» ~ U(0,100), 2 ~ U(0,100) 


The U(a,6) is a uniform distribution. 

The MCMC method is used to simulate the 
samples of model parameters, and then we use the 
OpenBUGS to generate 20000 samples. 

The posterior pdf of model parameters for 
degradation process of the OEMs are presented 
in Figure 3. Based on the framework discussed 
in Section 3, these posterior distributions trans- 
formed into the prior distribution of degradation 
process model parameters of user plants. 

The results for parameter estimation given in 
Table 1 are obtained from the generated posterior 
samples. The predefined degradation threshold is 
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Table 1. The results for parameter estimation. 
Confidence interval 
Standard ees 
Mean deviation 2.5% 97.5% 
a, 0.6542 0.06317 0.5374 0.7827 
kg 0.9189 0.0886 0.7538 1.102 
A 1.348 0.1308 1.104 1.62 
1 
0.95} a S 
09t h 4 
0.85 F q 
z os| | 
Z 0.75} N N i 
0.7 
0.65 %. 4 
06 N 4 


a) 20 40 60 80 100 
Number of strokes (10 thousands) 


Figure 4. Reliability of the spool valves. 


C =120, the reliability of the spool valves can be 
obtained from the posterior distribution of model 
parameters as presented in Figure 4. 


5 CONCLUSIONS 


In this paper, a reliability analysis model for the 
non-competing relationship multiple degradation 
process of products with two performance indica- 
tors is presented. The characteristics of these deg- 
radation process are modeled by gamma process. 
Then, a hierarchical Bayesian method is presented 
to fusion the degradation information from dif- 
ferent sources. Therefore, the Bayesian method 
is used to estimate the model parameters and the 
MCMC method is used to simulate the samples of 
model parameters. A case-study of a spool valve is 
provided to demonstrate the proposed model and 
method. 

It should be noted that there are some open 
questions for future work. According to the indi- 


vidual heterogeneity of the products, the random 
effect could be introduced to the non-competing 
relationship multiple degradation process. The dif- 
ferent working environment of the products in the 
OEMs and the user plant could be considered into 
the modeling of the degradation process. 
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ABSTRACT: Polymer composites used in engineering structures are exposed to various types of 
mechanical stress. The most commonly used methods of assessing their strength are the ones based on 
static or dynamic testing. Rare attempts are made to assess reliability of composites under a given load. 
The influence of the manufacturing technique on the mechanical strength of composites is also a well- 
known fact. The production of fibre reinforced composites using an autoclave allows for the production 
of high strength composites and a very small number of structural defects. However, this method is expen- 
sive and available with restriction. Much more often composites are produced by cheaper methods like 
hand lay-up, infusion or vacuum bag. The aim of this study is to determine the influence of the technique 
of making selected polymeric composites on the probability of failure before achieving a certain tensile 
strength threshold. Composite materials reinforced with carbon and glass fabrics have been prepared as 
the most commonly used composites in aviation. Composites were made by such methods as hand-lay- 
up, vacuum bag and pressing. Static stretching was performed, followed by a statistical analysis and a 
reliability analysis. The reliability analysis was performed using the Weibull model. It was found that, in 
terms of tensile strength, the differences between the composites made by the compression method and 
the manual method are negligible. Vacuum bag composites exhibit the lowest tensile strength compared 
to other composites. On the other hand, the analysis of reliability indicates that the highest probability 
of maintenance of structure continuity under load is exerted by composites with fiberglass made by the 
pressing technique. 


1 INTRODUCTION e seeking methods of design and rules of systems 

operation so as to minimize the possibility of 
Reliability, which is understood quite well intui- damage at certain expenditures. 
tively, is a synonym for self-assurance in opera- ER ; i 
tion. Since the very beginning of constructing t er puta oa however, is geared 
broadly understood technical objects, this terms owards(Smalko ): 


has been comprehended in this sense. The inten- e exploring physical processes of the occurrence 


tion was always to maximize suitability for use, 
generally described by the Q quality. It needs to be 
stressed that the R reliability is one of the constitu- 
ent elements of quality, along with e.g. efficiency, 
usefulness or accuracy. As a narrower concept, it 
describes the ability of an object to perform a spe- 
cific task under certain conditions and at a given 
time (Godziszewski 1983, Bentley 1999). 

The theory of reliability has got two main objec- 
tives (Szopa 2009): 


e the formulation of theoretical grounds for the 
description of the laws underlying the occur- 
rence of damage and malfunction in technical 
systems, 


of damage, 

e removing damage, 

e the ability to predict the occurrence of damage 
and counteract them. 


Thus, the science of reliability is of multidiscipli- 
nary nature. The stimulus for its development was 
the necessity to meet increasingly higher demands 
for modern technical objects. At the beginning of 
its existence, it served to project the expected behav- 
iour of complex objects during their operation (e.g. 
aircraft, computers). Additionally, taking into 
account modern trends, research is conducted on 
the methodology of designing objects whose opera- 
tion is reliable in a defined time (Stowinski 2002). 
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The assessment of the reliability of compos- 
ite materials is based on the analysis of physical 
changes which affect them. The factors which 
influence the behaviour of the laminate can be 
divided into three main categories: 


e mechanical load, 
e material factors, 
e environmental factors 


One of the greatest threats in maintaining reli- 
ability of the composite strength is the separa- 
tion of the matrix and the reinforcement. On the 
boundary of the phases there occur forces of adhe- 
sion and adsorption, which account for their over- 
all strength (Hart-Smith 1990, Godzimirski 2008). 

The foreseeable duration of composite util- 
ity will be specified by means of the likelihood 
of damage occurrence, which is determined on 
the basis of static and dynamic investigations. 
The most trustworthy results are those of tensile 
strength and bending (Krzyzak 2015, Esfandiari 
2008, Wu, Cheng & Kang 2000). 

Damage is defined as a transition from a fully- 
operational state to the state of deficiency or fail- 
ure to comply with at least one major parameter 
(Godziszewski 1983). 


2 METHODOLOGY 


The main objective of this study is to examine the 
influence of a manufacture technique of selected 
polymer composites upon the probability of dam- 
age prior to reaching a specified threshold ten- 
sile strength. In aviation, the autoclave method 
is used for the manufacture of advanced aircraft 
parts. However, the hand lay-up and vacuum bag 
are the most common methods used for repair. In 
the research, the authors proposed an alternative 
pressing method. 

The object of research was composites which are 
most commonly used in aviation, with a particular 
focus on polymer-matrix laminates. The authors 
examined four-layer composites reinforced with 
commonly available woven fabrics of glass fibre 
(G) and carbon fibre (C). The matrix was epoxy 
resin C.E.S. R70 on the basis of bisphenol A/F of 
density pp = 1.16 g/cm? combined with C.E.S. H72 
hardener whose density equalled py = 1.02 g/cm? in 
the ratio of 100:54. 

The polymer composites were produced with 
three methods: hand lay-up (hand), pressing 
method (press) and vacuum bag (bag). During the 
hand lay-up lamination, particular attention was 
drawn to the total saturation of the reinforcement 
fabric with resin, as well as the reinforcement setup 
in such a way that the fibers ran parallel. The sub- 
sequent layers measuring 0.6 x 1 m were pressed 


together by a roller. In order to ensure complete 
curing of the resin, the composites were left for 
72 hours. 

Another batch of samples was made by means 
of the hydraulic press Mecamaq PDM 50S. The 
initial production process was identical as in the 
case of the hand lay-up method. Having laid and 
saturated the reinforcement layers with resin, the 
sheets were placed in a press and subjected to 
a compressive force of 5,000 kg or 49,033 N for 
2 hours. Subsequently, they were removed and left 
for another 48 hours. During the pressing proc- 
ess, the composite was subjected to the pressure of 
0.082 MPa. 

The third series of the pieces was produced by 
the vacuum bag method. The reinforcement layers 
were saturated manually with resin and placed in a 
tape-sealed working area. Next, the delamination 
fabric was placed for an uninterrupted separation 
of the manufactured laminate from the auxiliary 
materials. Another layer of the perforated film was 
to allow removing excess resin. Next, lignin, which 
was responsible for the resin absorption, was laid. 
Finally, everything was tightly covered with a poly- 
mer film. The negative pressure was achieved by 
the TW-1A 1/6 HP pump. 

The pump generated a negative pressure of 
approximately 1 bar, thus it equalled 0.1 MPa when 
calculated for the laminate surface which pressed 
the reinforcement layers. 

Similarly to pressing, after 24 hours, the lami- 
nate was extracted and left for 48 hours. Due to 
practical limitations associated with the availability 
of a proper carbon fabric, it was merely possible to 
manufacture glass-reinforced samples by means of 
the vacuum bag method. 

The obtained boards served for cutting test 
pieces of 10 x 100 mm. Special attention was drawn 
to the cutting direction. It was important to ensure 
the assumed direction of the reinforcement. The 
cutting was performed with a stream of water and 
garnet (for abrasive cutting) under high pressure. It 
was possible to obtain smooth and equal edges and 
high dimensional accuracy. 


3 RESULTS 


3.1 Tensile strength 


The static tensile tests were performed on a fatigue 
testing Zwick/Roell machine, equipped with pneu- 
matic grips, in accordance with the research norm 
DIN EN ISO 527-1 at a constant speed of the 
traverse movement equal to 2 mm/min. The meas- 
urement length equalled 50 mm and the length of 
the gripping hands was 25 mm. The samples were 
fitted in the grips, paying particular attention to 
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Table 1. 
strength. 


Basic statistics results of composite tensile 


Standard 


Composite Number deviation Coefficient 


material of sizes R, of the mean of variation 
C (hand) 20 457.38 54.70 11.96% 
C (press) 20 452.31 66.47 14.69% 
G (hand) 20 275.64 14.93 5.42% 
G (press) 20 291.33 22.38 7.68% 
G (bag) 18 240.89 21.62 8.97% 


their vertical orientation. Even minor deviations in 
this respect would negatively affect the reliability 
of the results obtained. 

The highest average tensile strength of materials 
reinforced with a carbon fabric was exercised by the 
laminate made with the hand method C (hand) — 
457 MPa, which only slightly exceeded the pressed 
product C (press) — 452 MPa. In the case of the 
glass composites, the situation is quite different. 
The highest tensile strength was exercised by the 
composite G (press)— 291 MPa, a slightly lower by 
G (hand) — 276 MPa. The laminate manufactured 
with the vacuum bag method G (bag) reaches 
the minimum tensile strength at 241 MPa. Both 
the standard deviation, the standard error of the 
mean and the coefficient of variation indicate that 
the repeatability of research results is far greater 
in the case of carbon-reinforced composites, irre- 
spective of the manufacture method, than for glass 
laminates. 

In the case of composites, due to their two-phase 
structure and the occurrence of tiny matrix cracks 
affecting the emergence of permanent deforma- 
tions, it is difficult to determine a typical elasticity 
boundary (Mallick 2007, Zhang, Li, Is & Yu 2013). 
While analysing an increase in force with respect to 
elongation in the process of stretching, however, it 
may be concluded that the test pieces underwent 
purely tensile deformation, only in the initial phase 
of the test. In the following part, until the breaking, 
they behaved like fragile bodies (Figure 1). Given 
the characteristics of composite materials and their 
two-phase structure, it can be deduced that the 
emerging matrix microcracks were to blame. 

Relative extension of carbon composites during 
stretching was on average equal to 2.8%. The dif- 
ferences between C (hand), and C (press) were neg- 
ligible. The situation concerning various types of 
samples made by hand and by pressing was similar. 
Both types showed similar relative deformation of 
4.35% and 4.22%, respectively. Taking into account 
the standard deviation of results, this small dis- 
crepancy may be accidental. It was noted, how- 
ever, that there is a significant maximum decline in 


T 

a 

= PTT C hand 

= c 

$ — Cpress 

a mag JE ap T E ae G hand 

“i 

2 == = Gpress 

c 

2 — — Gbhag 
0 1 2 3 4 5 


Relative elongation(%] 


Figure 1. Tensile strength in function of relative 
elongation for composites made by diffirent methods 
(examples). 


relative elongation of the samples produced by the 
bag vacuum method, with regard to the samples 
obtained by other methods. The average value of 
the considered parameter was, in this case, 3.57%. 


3.2 Permissible value of mechanical strength 


The results obtained and the descriptive statisti- 
cal analysis may give rise to determining permis- 
sible tensile strength to stretching of the examined 
composites. The safety coefficient is determined on 
the basis of, among others, numerous studies and 
a statistical analysis, however in the case of com- 
posite materials there are no rigid guidelines for its 
adoption. This is explained by a variety of factors 
determining the strength and structural charac- 
teristics of composites and, in the selection of an 
index, the computations as well as the construc- 
tor’s experience play a large role. 

Thus, the authors adopted the permissible 
strength value n, as an average strength of the com- 
posites made by hand decreased by the safety coef- 
ficient on the level of 5%. In the case of composites 
made by the manual lamination method, with carbon 
reinforcement, the permissible value of the tensile 
strength equalled 435.51 MPa, whereas for com- 
posites with glass reinforcement it was 261.86 MPa. 
It was assumed that the largest structural defects 
occur in composites made by manual lamination, 
therefore the above-mentioned permissible strength 
will also be adequate for composites produced with 
methods of more advanced technology. 


3.3 Analysis of reliability 


The Weibull reliability analysis was conducted. 
Thus, the authors initially determined the regres- 
sion functions of reliability likelihood and coeffi- 
cients of determination R°. 

The formula of the function determined through 
approximation, using the method of least squares 
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errors, in a simple form y=ax+6, was obtained 
from the following formulas: 


where: 
x — predictor value, 
y — dependent variable value, 
x,y — mean values, 
N-number of observations. 


The coefficient of determination informs how 
much the accepted model explains the collected 
measurement data. The better model adjustment, 
the closer is its value to one. It is calculated using 
the formula: 


where: y, — prediction on the basis of the regres- 
sion model of the variable value. 


The figures (Figure 2, Figure 3) show the approxi- 
mation of the probability distribution of maintain- 
ing strength with regard to the type of reinforcement 
and the methods of manufacturing the composite. 

The approximated functions of probability 
distribution indicate slight differences between 
the examined carbon composites. The situation 
is slightly different in relation to laminates rein- 
forced with glass. In this case, the obtained func- 
tions clearly indicate that the biggest damage in the 
slightest stretching stresses occur in materials man- 
ufactured by means of the vacuum bag method. 
On the other hand, pressed composites exercised 
the biggest strength, demonstrating a significant 
advantage over laminates produced manually. 
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Figure2. Approximation of the probability distribution 
of failure when stretching carbon-reinforced composites. 
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Figure 3. Approximation of the probability distribution 


of failure when stretching glass-reinforced composites. 


The obtained values of the coefficient of deter- 
mination prove that the regression function and 
experimental results are quite coherent. It also 
means that results modelling are based on proper 
assumptions as for the testing and adoption of 
distribution. It can be concluded that the obtained 
function can be used for the interpolation of the 
results within the value of variable research factors 
exploited in the conducted experimental studies. 

When the investigated parts become damaged, 
mainly due to sudden wear (cracks, fractures, etc.), 
it is recommended to use the Weibull distribution 
(Saghafi, Mirhabibi & Yari 2009). It is characterized 
by a variable intensity of damage, which is mono- 
tone (non-decreasing) in its nature. The distribution 
applies to the description of fatigue service life of dif- 
ferent materials and machinery. The reliability func- 
tion, i.e. the dual parametre model has got the form: 


To 


Oo 


where: ¢ — expected operating time of objects, in 
strength investigations predefined as permissible 
strength of materials, 
0, — scale parametre of the Weibull model [MPa], 
M — shape parametre of the Weibull model [-]. 


The Weibull model m is equal to the directional 
coefficient a of the determined approximating 
straight line. In other words, it corresponds to the 
angle of inclination of the regression line to the 
X-axis (Maksymiuk 2003, Yadav, Singh & Goel 
2006). Alongside the increase in the Weibull mod- 
ule, there is a decrease in load spreading, where the 
critical damage is likely to occur. 

The scale parameter o, is a tension deter- 
mined for the accumulated likelihood of failure at 
(Warszynski 1988): 
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p=1-}=0.632 
g 


The designated parameters of the Weibull 
model have been listed in Table 2. 

The graphs (Figure 4, Figure 5) show the 
functions of failure, or probability distribution 
functions of damage occurrence. They depict pre- 
determined permissible tensile strength values 
O, = n„ Which constitute the adopted safe opera- 
tion border (vertical dotted line). The likelihood 
read out at the intersection of the graph with this 
line indicates the percentage of the population in 
which there was decohesion with tension below 
the permissible one. The continuous horizontal 


Table 2. Parametres of the Weibull method. 


Fibre Method m [-] ©, [MPa] 
Carbon hand 9.84 480.89 
press T39 484.53 
hand 22.23 280.12 
Glass press 15.44 300.23 
bag 12.80 241.18 
i 
08 
0.6 
Q 
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0.2 
0 
280 38 480 580 


Q om [MPa] 


Figure 4. Empirical function of failure of carbon com- 
posites due to tensile strength. 
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Figure 5. Empirical function of failure of glass com- 
posites due to tensile strength. 


line also marked probability p,= 0.632 at the level. 
Above this value, there is accumulated occurrence 
of catastrophic damage for the whole population. 
The second quarter of the plane determined by the 
two restrictions is the area which excludes material 
out of operation. Obtaining the likelihood of dam- 
age above p,= 0.632 under the influence of tension 
less than g, = n, is tantamount to the exclusion of 
the composite from use. 

By analysing the above diagrams, it can be con- 
cluded that polymer composites manufactured by 
means of pressing and the manual method, both 
carbon and glass ones, are highly reliable and can 
be safely used with the adopted values of tensile 
stress. Laminates manufactured using the bag vac- 
uum do not satisfy the conditions of operation at a 
certain level of load with assumed reliability. 

It is assumed that fulfilling the condition 
n,, > 6,, indicates a high probability of uninter- 
rupted service life of a part manufactured with a 
given technique. 

The following table (Table 3) shows the likeli- 
hood p({n,,) of reducing the strength of the mate- 
rial below a predetermined permissible value. The 
authors adopted the following criteria of risk gra- 
dation of substantial debilitation: 


p(n) < 0.02- very low, 
0.02<p,(n,,) < 0.2- low, 
0.2< p,(n,,) < 0.5- average, 
0.5< p,(n,,) < 0.8- high, 


Py ( Ny) > 0.8 — very high. 


In accordance with the earlier observations, 
only materials produced by the vacuum method 
indicate a high probability of a decline in strength 
below the permissible safety value. The composites 
produced by hand and pressed are characterized 
by average and high reliability. 


Table 3. Characteristic strength n,, and lack of reliabil- 
ity p(n,,) of polymer composites manufactured with vari- 
ous techniques. 


Fibre n,, [MPa] Method o,[MPa] pn) 

Carbon 434.51 hand 480.89 0.34 
press 484.53 0.40 

Glass 261.86 hand 280.12 0.19 
press 300.23 0.09 
bag 241.18 0.74 
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4 CONCLUSIONS 


The technique of making polymer compos- 
ites exerts an influence the likelihood of dam- 
age before achieving a certain threshold tensile 
strength, which can be clearly proved by analysing 
the results of glass laminates. The smallest prob- 
ability of the emergence of such damage can be 
seen in the pressed samples, the reason for which 
might probably be more compact and free from air 
bubbles composite structure as compared to the 
samples made manually, obtained by the pressure 
exerted on the material in the production process. 
Improved matrix continuity allows an even dis- 
tribution of stresses on the reinforcement in the 
whole volume of the element. 

It was also found that the composite materi- 
als produced by the vacuum bag method (bag) 
showed by far the biggest likelihood of the occur- 
rence of destructive damage prior to reaching the 
set threshold strength. This is surprising since the 
pressure exerted on the laminate in the production 
process should theoretically lead to a decrease in 
such a likelihood, similarly to the pressing method. 
However, these two techniques varied. Due to the 
pressure exerted on the laminate, the matrix may 
have been squeezed out of the reinforcement layers 
and its excess was absorbed by the lignin. This may 
have led to an insufficient volume share of resin in 
the final product. The consequence might be the 
failure to distribute tensions over the whole of the 
reinforcement, which could explain a reduction in 
strength. In the pressing method, resin could not 
find its way out of the manufactured composite. 
Taking into account the peculiarity of the vacuum 
bag technique, it can therefore be assumed that the 
reduction in the negative pressure generated by 
the pump or not using lignine could result in the 
reduction of the probability of the sample decohe- 
sion by reducing the outflow of the matrix from 
the laminate at the production stage. 

Due to the fact that the approximated functions 
of probability distribution indicate slight differ- 
ences between the examined carbon composites, 
and due to a relatively small number of tests per- 
formed, it was found that any conclusions so as to 
improve or deteriorate the properties of laminates, 
depending on the technique of making them, 
would be a too far-reaching presumption. The 
pressure exerted by the press appeared to be insuf- 
ficient to obtain similar effects to those in glass 
composites, due to more compact structure of the 
carbon fabric used as reinforcement. The results 
obtained do not therefore deny the previously 


formulated conclusions, however they are incapa- 
ble of confirming them. Presumably, it is necessary 
to make composites by the pressing method again, 
this time with more compressive load. 
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ABSTRACT: Lock mechanism is one of the important and problematic components on aircraft, whose 
performance and reliability directly affect the mission or even safety of the aircraft. However, lock mecha- 
nism always have characteristics of complex forms, few samples and high reliability, traditional reliability 
assessment methods and reliability test method based on statistics is not appropriate for its reliability 
assessment. In addition, damage of the lock components increase with work time, the damage will make 
the performance of lock mechanism degradation or even cause failure. Therefore, reliability of the lock 
mechanism is not only function of random variables but also function of work time. In this paper, a time- 
variant reliability analysis method based on physics of failure was put forward, in the method, both time- 
invariant factors and time-variant factors are considered, and the degradation model of the components 
can be obtained from both practicality experiment or damage mechanism model. Taking a lock as the 
study object, failure modes and failure mechanism of the lock mechanism were analyzed combined with 
dynamical responses. Then, Response Surface Method (RSM) was used to obtain relationship between 
design variables and dynamic responses. Component damage was regard as a random process, after dam- 
age degradation law was obtained, considering randomicity of all the parameters, lock mechanism reli- 


ability with work time was obtained and the reliable life was predicted. 


1 INTRODUCTION 


Numerous mechanical systems on aircraft have 
lock mechanisms, such as landing gear system, 
cargo door system, et al. The reliability of the lock 
mechanisms has a great influence to the mission 
accomplishment and the safety of the aircraft. If 
the lock of landing gear door won’t open in the 
landing time, the landing gear door cannot put 
down; If the cargo door lock mechanism cannot 
open, the airborne or airdrop mission will fail. 
Several studies have been focused on the reli- 
ability of the lock mechanism. (Ouyang 1994) ana- 
lyzed wear depth of the lock hook surface based 
on Archard’s wear model, gained the changing 
trend of the wear reliability with the opening and 
closing times. (Gu, et al. 1995) used the fault tree 
method, from the startup and running two aspects, 
assessed reliability of the landing gear door’s lock. 
(Ouyang, 1994) only focused on the partial wear 
of the lock mechanism, without considering the 
impact on the overall performance. Although lock 
reliability is analyzed with overall performance in 
consideration (Gu, et al. 1995), the degradation 
process and the physical processes that cause the 


degradation were ignored. Reliability problems 
of other mechanisms like lock mechanism have 
received extensive attention. These studies mainly 
focused on kinematical accuracy (Tsai et al. 2008, 
Wang et al. 2011), joint clearance effect (Rhee & 
Akay 1996, Erkaya & Yzmay 2012) and motion 
seizure (Ballu 2008) and other aspects. To the per- 
formance degradation caused failure problems, 
(Wang & Chen 2011) analyzed the time variant 
reliability for gear with multi-failure mode, pre- 
sented the relation between reliability and work 
time. (Jin et al. 2013) accord to performance deg- 
radation data come from actual physics of failure 
test, based on failure-physics method established a 
model to describe MW’s failure process and finally 
predict the work lift-time. To the component dam- 
age caused component failure or function failure, 
there’s no too much literatures. To sum up, the 
present literatures don’t give out an intensive study 
towards mechanism time-variant reliability. 

The lock mechanism reliability was affected 
by many factors, such as the manufacturing and 
assembling errors, material properties, external 
load and other non-time-variant factors. There- 
fore, the lock mechanism has the possibility of 
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failure in the early service period, and in general, 
the design safety margin can make sure that the 
mechanism has a higher reliability in early service. 
However, there is a kind of factors, whose param- 
eters varies with serving time, such as joint wear, 
material stress relaxation, gasket aging and corro- 
sion. In the early service period, their influence on 
the lock mechanism is not significant, but with the 
increase of service time, these factors will degrades 
performance of mechanism, for example, joint 
wear will cause the hinge gap increases, the con- 
tact condition will become worse, causing friction 
increases; the stress relaxation and creep will cause 
the motion accuracy decreases; the sealing parts 
corrosion and aging may cause the performance 
of hydraulic system degradation, cause the driving 
ability decrease. Therefore, the failure probability 
is not only function of the variable, but also func- 
tion of service time. With the service time growth, 
the reliability of mechanism will continue deterio- 
rates until the failure occurs. 

In this paper, failure modes and failure mecha- 
nism are analyzed, and a lock is used as an object 
of study. This paper is organized as follows: failure 
modes and failure mechanism were analyzed based 


on the dynamical simulation results, use failure 
physics method to establish damage model of the 
lock part, use response surface method (RSM) to 
obtain the deterministic function of the system, 
which reveals the relationship between system 
performances and the design variables. Then, part 
damage results are used to update the simulation 
model. Finally, considering the randomicity of 
variables, and using the knowledge of statistics, 
obtained the reliability degradation law and pre- 
dicted the lifespan of mechanism. 


2 RELIABILITY ESTIMATION METHOD 
BASED ON PHYSICS OF FAILURE 

2.1 Reliability estimation framework based on 

physics of failure 


Different from electronic system, mechanisms in 
aircraft always have properties of form complex, 
small sample size, and long life, high reliability 
requirements, so the traditional reliability assess- 
ment method based on statistics is not suitable for 
aircraft mechanism, and the reliability assessment 
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based on test method needs to spend too much 
time and cost. In order to solve these problems, 
time-variant reliability assessment framework 
based on failure physics was presented, as shown 
in Figure 1: 

The main work of the time-variant reliability 
assessment based on failure-physics method con- 
sists of following aspects: 


1. Failure modes and mechanism analysis 

In this stage, failure modes and failure mechanism 
were obtained based on work principle analysis, 
history failure and other test information. Fail- 
ure modes analysis is the foundation of reliability 
analysis, the common methods are Failure Modes 
Effect Analysis (FMEA) and Failure Tree Analy- 
sis (FTA). Where FMEA is a bottom-up method, 
which begins from components failure and follows 
certain logic to find out the effect to the system and 
then find out the crucial components and main 
failure modes. Its effect is very accurate at the level 
components but becomes progressively weaker as 
the fault effect propagates further away from the 
component and into the subsystem and system- 
level effects. While FTA is an up-down method, 
start form high level failure to find out the reasons 
which cause the fault. 

2. Simulation model establishment 

Mechanism is a mechanical system consists of 
several components and related joints. For some 
complex mechanisms, establishing and solving the 
dynamic models are very difficult. But the existing 
commercial software such as MSC ADAMS and 
LMS Virtural.Lab, present an easy way to model 
mechanism, and the dynamic results can be accu- 
rate if only the simulation model is reasonable. 
Therefore, this paper established the dynamics 
simulation model of mechanism in LMS Virtual. 
Lab, the model was modified and validated by the 
performance test, and then the model can provide 
accuracy results compared to real mechanism. 
Thus we can obtain the physical models which 
reflect the relationship between inputs and outputs 
of mechanism system. 

3. Physics of failure model 

Physics of failure model of mechanism contains 
performance function which reflects the relation- 
ship between inputs and outputs of mechanism 
system and the damage model which reflects the 
geometry and physical parameters change law with 
the work time. In the former step, the perform- 
ance function is implicit, if use simulation model 
to analysis the reliability, every sample should be 
simulated in the model, which make it quite time 
consuming, for example, a single simulation cost 
20 seconds, that is to say 10000 samples need 
200000 seconds (about 56 hours). In order to 
obtain an explicit performance function, response 


surface method was used, this time, only a few 
samples need to be run in the simulation model, if 
bucher’s sampling method is applied, the number 
of samples is 2n+1, here, n is the number of ran- 
dom variables. 

Damage in the mechanism of behaves as joint 
wear, plastic deformation, seal parts corrupt and 
aging, spring stress relaxation and so on. These 
damage models can be obtained by fitting the deg- 
radation data or by the damage model. Presently, 
many scholars have done a lot of research work in 
such domains (Wen Tsinghua University press, Su, 
Tianjin University press). In the reliability analy- 
sis process, we can directly choose the appropriate 
model to application. 

4. Reliability assessment and lift prediction 

For a certain mechanism, it is likely that two or 
more failure modes exists at the same time, and 
different failure modes may be in series, parallels 
or mix relations. To analysis reliability, proper reli- 
ability model needs to be established. Then bring 
failure criterion and variables’ distribute param- 
eters into the response surface, the reliability and 
sensitivity can be obtained, at last, based on the 
reliability result give a lift prediction of the mecha- 
nism. The operation time at which the reliability of 
the mechanism is lower than the required reliabil- 
ity is the lifespan of the mechanism. 


2.2 Time-variant reliability analysis 
method for mechanism 


Unlike structure’s static response, dynamic responses 
of the mechanism changes with its configuration, so 
the mechanism performance can be described by a 
set of performance curves, namely F = F(t). These 
curves can describe position, velocity, acceleration, 
load et al. In addition, time-variant factors makes 
the response quantity of different motion cycle not 
the same, namely F = F(t,n), wherein ¢ represents a 
runtime mechanism in single cycle, n represent the 
cycle number of the mechanism. In the reliability 
assessment process, it’s not necessary to focus on 
the response of whole cycle time, but care about 
the response of the maximum value, the mini- 
mum value or a certain time value of each cycle, 
namely F(n) = max(F(t,n)), F(n) = min(F(t,n)) or 
F(n) = F(t,,n)|, =. When any F(n) exceed its allowed 
domain, the mechanism will failed. 

Ina generally way, mechanism performance can 
be represented by a set of response expressions, 
namely F(n) = max(F(¢,n)), i = 1,2,...,s. And the 
performance is mainly influenced by the m fac- 
tors of x, j= 1,2,...,. Among which p factors are 
time-invariant and q factors are time-variant. 

In this paper, linear response surface function 
was used to describe the relationship between the 
inputs and outputs of system, the form is 
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F(%)=b, + ¥.bx,= Bll x]ļ,i=1,2,...5 (1) 
0 ri 
i=l 


Assuming that time-invariant variables obey 
normal distribution, time-variant obey nor- 
mal process and the distribution parameters of 
both time-invariant and time-variant factors are 
obtained, i.e. 


wy > N(u;,0;), j =1,2,...,p 
x, ~ N(u,(), 07 (n)), j =1,2,..409 
ptq=m 


Generally, F(n) is a monotonic function, here 
assume it is a monotonic increasing function, 
namely, F(n) increase with the mechanism per- 
formance degradation. Then the failure probabil- 
ity, reliability and lifespan of each failure mode 
can be represented as 


P(n) = PUR (n) > L} 


R(n) = PAE) < L} =f" ARLE] 2) 
N, = inf{ N|F (n) > Lan 20} 


where, L; represents the failure threshold of each 
dynamic response and the inf{-} is the infimum 
function. 

All the failure modes of the mechanism can be 
regard as in series, failure probability of the system 
can be written as 


Ege DF 


s 


> P+ X Py -+ (DBs 


lsi<j lsi<j<k 
(3) 


Previously, performances of mechanism system 
are treated independently, the correlations between 
different failure modes are been ignored (Leira 
et al. 2005, Franchin et al. 2003), namely, assume 
reg p,,=0. So the probability of the mechanism can 
be described as 


Pn) =D PEN > Fo} = DP (4) 
i=l 


i=l 


However, all components in mechanism suffer 
a common load environment. Effects of Multi- 
factors coupling and multi-body coupling made 
failure modes in a mechanism dependent with 
each other, it means different failure modes have 
relations with each other. So the previous method 
which ignoring the correlations between different 
failure modes will lead to an overestimation of the 


system probability of failure and underestimate 
of the system reliability (Wang et al. 2007, Levitin 
2001). 

If correlations between different failure modes 
are been considered, for a mechanism have two 
or three failure modes, equation (4) can be writ- 
ten as: 


For s=2, P= P, + P,- Pi, (5) 


For s = 3, P= P,+ P,+ P,- Pa- Pi- Paot Pin (6) 


For functions of Equ. (1), the correlation coeffi- 
cient p and covariance matrix C between different 
failure modes can be described as 


ppe Cov(F,,F,) (7) 
"IDE -.[DE, 

Th Pr2Pn Pr, Poln Pr, 

10,0, or 10,7, 0; 
C= Par Or, P Pr R B (8) 


Pir, Or, Or, 


According to knowledge of probability theory, 
failure probability of k failure modes occur in the 
same time can be obtained through integrate joint 
density function of the multi-dimension random 
valuables in the failure domain, 


For k =2, 
P, = | | fF Ma RAF, (9) 
G 
For k = 3, 


P, = | | | SEE Rade dF, (10) 
G 


And we can also use multivariate normal cumu- 
lative distribution function in MATLAB to calcu- 
late the failure probability, the form is 


P, =mvncdf (L, u,, C) 


ij... 


(11) 


Because distribute random parameters of 
time-variant valuable is variant with operation 
time, the failure probability calculated form equa- 
tion (6) is the function of operation cycle. Assume 
that the reliability of the system is required to be 
high than Rc, then the lifespan N can be calcu- 
lated as 


N= inf{ T|R(n) > Re,n = 0} (12) 
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3 FAILURE MECHANISM OF A LOCK 
MECHANISM 


3.1 Working principle of the lock mechanism 


The lock mechanism consists of eleven parts named 
1 to 11, the closed state is shown in Figure 3. The 
opening process can be divided into three phases: 
(1) the hydraulic system work to push part 2, which 
makes part 3 rotate clockwise. At the same time, 
wheels assembled on part 3 shove part 4, make part 
4 rotate clockwise until part 4 and part 5 apart; 
(2) When part 4 is open, wheels roll along the up 
surface of part 4 and shove part 5, at the same time, 
force transfer though part 6 to part 7 and part 7 
rotate anticlockwise, until points A, B and C get 
in a line; (3) after that, with the effect of compress 
spring 11, part 8 open automatically, and with the 
spring force of 10, part 4 rotate anticlockwise and 
keep the lock in the open state. The closing process 
is in opposite to the opening process. 


3.2 Failure modes and mechanism analysis of lock 
mechanism 


There are too many potential failure modes in lock 
mechanism, in order to find out the most prob- 
able failure modes, dynamic simulation was imple- 
mented, dynamic responses of main structure are 
show in Figure 2. 

Obviously, max value of contact force occurs in 
the first period and the max value of link and driv- 
ing force occur in the second period. 

Force analysis for the first period is shown in 
Figure 3, part 4 bear five forces of F, Fa, Fio Fins 
F, among which, only F,, help to open the lock 
and all others prevent it from opening. Here F, is 
determined by lock hook force F,, F,„ is supplied 
by F,, and Fa is determined by F, and friction 
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Figure 2. Dynamic simulation results. 
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ing phase. 


Force analytical of first period during open- 


coefficient f,;. From the direction of F,, which is 
the composition force of F Fa» it can be seen that 
with the increase of f,, the arm of F, to the rota- 
tion axis decreased when the friction coefficient 
J, increased. And the force arm decrease to very 
small value or even zero if f, increases to a certain 
extent. At the same time, the wear and impact lead 
to slope surface of the key roughness and uneven- 
ness, which will directly affect the magnitude of the 
positive pressure and friction force, and affect the 
direction of the composition force as well. 

In the performance test, we record the pressure 
of the hydraulic pressure in the cylinder actuator 
and the force in the link, after several operation 
cycles we found that the driving force and the link 
force increased compared to the beginning trial. 
Then the lock mechanism is carefully examined 
and found that there are wear and extrusion marks 
in the slope surface of the part 4, other parts with- 
out obvious change. Damage modality of part 4 is 
shown in Figure 5. 

In the second period of lock opening process, 
part 7 contrarotate and make point C move from 
underside to upside of the line made of point A and 
point B, during this process, part 7 suffer F, and F,, 
force analytical model is shown in Figure 4. Here 
F, supplied by hydraulic system act as the driv- 
ing force and F, determined by Fp act as the drag 
force. As point C gradually close to the line form 
underside, distance between point A and point C 
increased, this result in Fp increased because much 
large force needed to conquer the deformation of 
the port. And the radius of the part 8 also influ- 
ences Fp if F, is too large, the hydraulic pressure 
will not large enough to open the lock, this time 
the lock mechanism will get seizure. 

According to what have been said, the lock 
mechanism have two main potential failure modes: 
1) in the first period, the roller load is become 
larger and larger, when the maximum load exceed 
the terminal load, which is the maximum load the 
structure can bear, the structural damaged, manifested 
as roller plastic bending or link buckling; 2) in the 
second period, link buckling or lock mechanism 
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Figure 6. Failure mechanism of the lock mechanism. 


get seizure. The failure mechanism is shown in 
Figure 6. 

To sum up, the failure modes of lock mecha- 
nism is roller plastic bending in the first period and 
mechanism seizure in the second period. 

Further, as Figure 7 shows, with the effect of 
degradation factors, characteristic parameters of 
the lock mechanism will increase with the work 
time, and make the reliability degradation. 
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Figure 7. Sketch map for performance degradation. 


4 TIME-VARIANT RELIABILITY 
ANALYSIS OF THE LOCK MECHANISM 


4.1 Parameter characterization of local damage 


When the lock begins to open, wheel move along X 
direction quickly in the hydraulic action, and then 
wheel assembled on the axis will impact part 4, 
since hardness of axis is greater than part 4, after 
repeated impact, surface of part 4 will generate an 
impact pit. The curvature of pit is approximate 
to radius of wheel R, the sketch map of damage 
modality is shown in Figure 8. Here / represents 
the depth of impact pit, and the dimension is equal 
to difference of R and D. 


4.2 Reliability model of lock mechanism and 
distribution parameters of variables 


4.2.1 Reliability model of lock mechanism 

Section 3.2 shows that potential failure modes of 
lock mechanism is roller failure, manifested as roller 
plastic bending in the first period and mechanism 
seizure in the second period. Analysis result shows 
the critical force F, lead to roller plastic bending is 
8346 N and the maximum driving force is 10000 N. 
In order to simply the analysis process, the disper- 
sion of the value has been ignored. Then failure 
criterion of the failure mode can be described as 


g, (Xn) = Fo, -F (X,n) <0 (13) 


g,(X,n) = Fo, - F,(X,n)<0 (14) 


According to Equ. 4, the failure probability of 
the lock mechanism can be written as 


PAM = Pig, = 0}+ Pig, <0}- P{g, <0Ng, <0} 
(15) 


In the above equation, F,(X,n) is the max con- 
tact force in the nth cycle and F (¥,n) is the force 
needed to open lock in the nth cycle. 


2104 


Depiction for the damage modality. 


Figure 8. 


4.2.2 Distribution parameters of variables 
Failure analysis of the lock in section 3 shows four 
factors affect the maximum roller force in the open 
process, they are load of hook F, friction coeffi- 
cient between key and roller, slope of key fi, fric- 
tion coefficient between key and rocker f}. And the 
key’s slope a. Assuming all variables obey normal 
distribution. 

Load of hook is a random variable affected by 
the flight state of the aircraft, which has no rela- 
tion with the lock operation times. 

The performance test shows there’s almost no 
wear between key and rocker, so the friction coef- 
ficient between key and rocker can be regarded as 
a time-invariant variable. 


4.3 Performance function modeling 
based on response surface method 


For complex mechanism, it is unpractical to 
establish and solve the dynamic equations and 
the through test to obtain the dynamic responses 
and their relationships between design parameters 
because of the limitation of time and costs. How- 
ever dynamic simulation is a good choice to solve 
these difficulties. 

Establish the lock simulation model in LMS 
Virtual.Lab platform, then use test data to modify 
and the simulation model until the results are pre- 
cise enough, and then effective simulation model 
is obtained. 

According to the distribute parameters of valu- 
ables, adopt Bucher sampling method to get the 
samples, then simulation all the samples in the sim- 
ulation model and obtain correlative results. 

From equation (1), we get the performance 
functions of max contact force in the first period 
and the max driving force in the second period like 


F(xm=B f h æ F, hf (16) 


E(x.n)=B[l AR F (17) 


Thereinto, 


|- 6295.1208 
4418.1333 
8291.466 
8495.65 
B= By =| 132.1717 
78.8655 “ 
34.691 
363.0785 
| 191.5783 


From B,, it can be seen that all the coefficients 
are positive except the constant term. It means 
contact force increases with the increase of each 
parameter. B, shares the same conclusion. The 
conclusion is accord with our intuitive feelings. 

Comparison is made for measured values and 
fitted values in Figure 9, it can be seen that the two 
kinds of values almost superposition, that is to say, 
the linear response surface is precision enough to 
replace the simulation model. 


4.4 Reliability analysis and life-span prediction 


Substitute the distribute parameters into limit state 
equations, the mean value and standard deviation 
can be obtained. 


up (n) = 2128.14 0.44187 + 0.0192n!/3 


O(N) 
= 2.0189 x 10° +13.98687 +0.1748n? + 0.003677? 


Up(n) = 8464.921 
0,,(n) = 881.2249 


The driving force and contact force distribu- 
tion in different work cycles are show in Figure 10. 
From which it can be seen that driving force dis- 
tribution is constant during lifetime, while contact 
force distribution changes constantly during life- 
time, the mean value increase obviously with the 
increase of the work cycles and the deviation also 
increase tardily. 

According to the reliability analysis method rep- 
resented above, reliability of lock mechanism was 
obtained, as Figure 11 shows, reliability of failure 
mode II (motion seizure in the second period) is 
a constant value of 0.9592 during lifetime, while 
reliability of failure mode I (wheel axis damaged) 
decreased with the increase of the work cycles, the 
change is much obviously after 1000 work cycles. 

Another conclusion is that in the beginning 
of service time, failure mode II plays as the main 
failure mode and after 1000 work cycles, failure 
mode I plays as the main failure mode. 
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Table 1. 


Summary statistics of the basic variables in the modal. 


Variables Mean, u Standard deviation, © Max Min Average 
Fi 0.04+0.0001*n 0.1*u 0.1 0.04 0.07 
J 0.2 0.0166 0.15 0.25 0.2 
aldeg 60 0.3333 61 59 60 
AR/mm 0 0.00667 0.02 —0.02 0 
F/KN 5 0.3333 6 4 5 
h/mm 0.0001 *7!3 0.1*u 0.3 0 0.15 
(0.0.92) 
Š 10000 g 10 
= 9000 x 
& OW ds ess SER 
E 8000 H = 
3 xX 8 0.8 
= 7000 E 
Z s g 
= 2 
= 6000 = 07 
= ee 
2 5000 = Measured value of contact force > oa 
$ © fined valne of contact force F . 
2 4000 4 Measured value of driving force : - j 
2 ns 3 —s— Reliability of failure mode|! 
= bail aca X 057| e Reliability of failuremode|! 
È ~—4— Reliability considering failure modes relativity 
FA 04 ~¥— Reliability ignoring failure modes relativity — 
= 
0 2000 4000 6000 8000 
Sample Number Work cycle 
ee 9. Comparison of measured values and fitted Figure 11. Degradation law of lock reliability. 
25% 10 5 CONCLUSIONS 
2 5 Driving force ben a A 
. f Ñ —— Contact force at beginning time 1. In order to solve problems exists in the tradi- 
A [if — — Contact force after 500 work cycles tional statistic method and test method for lock 
31 i poe sa sid KEVERS mechanism reliability analysis, a time-variant 
B oe reliability analysis method was presented, which 
$ i is based on virtual simulation and with compo- 
& nent damage into consideration. 
sa 2. Based on dynamic simulation results, failure 
Hie -i = modes and failure mechanism was analyzed, 
0o 2000 4000 6o00 8000 10000 12000 14000 obtained that two main potential failure modes 


Force / N 


Figure 10. Diving force and contact force distribution. 


From system reliability curve in Figure 11, it 
can be seen that lock reliability is 0.9592 at the 
beginning time, and decreased slowly in the fol- 
lowing 1000 work cycles. Then, the lock reliability 
decreased quickly. According to the requirement 
that reliability much higher than 0.9, the reliable life 
is 5027 work cycles. 


of lock mechanism are wheel axis failure and 
motion seizure, then analyzed the failure mode 
influence factors. 

3. Performance functions were found through 
RSM, time-variant reliability of lock mecha- 
nism was evaluated with lock key damage and 
randomicity of variables into consideration. 
The results show lock reliability is 0.9592 in the 
early days. After 1000 operation times, its reli- 
ability decreases to 0.9451. As the reliability is 
required to be higher than 0.9, the reliable oper- 
ation time is 5027 times. 
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A metal-oxide-semiconductor devices reliability assessing method 
based on physics of failure 


Hantian Gu, Ming Zhu, Wei Zhang, Lei Zhang, Hengjing Zhu & Min Tang 
China Aerospace Components Engineering Center, CASC, Beijing, P.R. China 


ABSTRACT: With the development of Metal-Oxide-Semiconductor (MOS) devices, reliability is 
becoming a key differentiator in a competitive market. Considering the different stress types and levels 
during application, an algorithm is proposed to evaluate the reliability of MOS devices under complicated 
operation conditions. This algorithm is based on the theories of Physics of Failure (PoF). Failure modes, 
mechanisms and effects analysis is conducted to achieve the potential failure mechanisms and physics 
models. Then through modeling and simulation, thermal, mechanical and electrical parameters of MOS 
devices under different conditions are obtained. Using the physics models, cumulative damage theory 
and random sampling algorithm, the matrix of time to failure of each unit is taken. At last, competing 
failure model is applied to acquire the time to failure of MOS devices in working condition. The algo- 
rithm provides a new approach based on PoF for evaluating the reliability of MOS devices under complex 


environments. 


1 INTRODUCTION 


Reliability is an important requirement for almost 
all users of integrated circuits (ICs). Scaling for 
enhanced performance and cost reduction has 
pushed existing MOS devices materials much 
closer to their intrinsic physical and reliability 
limits. Besides, A newly developed semiconductor 
technology node cannot be released to user with- 
out going through a rigorous work of reliability 
evaluation (Saeidi et al., 2013, Hava et al., 2013). 
All of these demonstrate that the reliability evalu- 
ation for MOS device is necessary and it is facing 
a huge challenge. 

For reliability evaluation, a common and realis- 
tic practice is to perform accelerating tests under 
conditions that are much more severe than those 
under operation conditions. The severe conditions 
in accelerating tests generally mean a much higher 
stressing temperature or stressing current or both 
than the operation conditions. After obtaining the 
time-to-failure and failure distribution on severe 
conditions, the evaluation of reliability under 
normal operation is realized through accelerating 
models by extrapolation. However, there are two 
disadvantages in this approach. The first one is 
only one failure mechanisms are considered during 
accelerating tests, so the time-to-failure under nor- 
mal operation by extrapolation is only associated 
with that mechanism. The second one is that the 
devices are always operating in a constant stress 
level. It is not match with operating conditions in 


actual. In other words, the results are qualitative 
and they cannot be used to evaluate the lifetime of 
devices under actual environment. 

Therefore, a new algorithm is put forward to 
evaluate the reliability of MOS devices in this 
paper, which consider different failure mecha- 
nisms and stressing conditions in the entire life of 
devices. 

In the next second of the paper basic theories 
associated with the algorithm are introduced. In 
the third part the procedure of reliability evalua- 
tion for MOS devices will be given and the detailed 
algorithms will be told. The fourth part of this 
paper is a case study. In this section we will describe 
each step in details combined with a simple MOS 
device and make a discussion with the results of 
simulation. At last, conclude this paper and look 
into the future of this methodology. 


2 BASIC THEORIES 


This methodology that put forward in this paper 
is based on Physics-of-Failure (PoF). The concept 
of PoF, also known as Reliability Physics, involved 
the use of degradation algorithms that describe 
how physical, chemical, mechanical, thermal, or 
electrical mechanisms evolve over time and even- 
tually induce failure (Borgarino et al., 2001, Pecht 
and Gu, 2009). In this paper, PoF is used to estab- 
lishment the relationship between the operation 
environment and the degradation of performance 
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parameters of MOS devices. The used failure 
modes include three groups according to their ori- 
gin: (i) failure modes associated with chip; (ii) fail- 
ure modes resulting from leads; (iii) failure modes 
due to overstress. In this paper, we analyze the 
failure modes and mechanisms which have higher 
probability of occurrence and greater impact on 
the performance of MOS device at first. Then 
we calculate the stress of device under operation 
conditions to find the reasons that possibly lead to 
failure through simulation analysis. At last, TTF is 
predicted by PoF models. Hence, all of the work in 
this paper is about PoF. 


3 ALGORITHM 


3.1 General algorithm 


As Figure 1, the input of this algorithm consists 
of three parts: parameters of structure, material 
properties, potential failure mechanisms and local 
stress under operation conditions. 


1. Parameters of structure include dimensions of 
packaging and die; 

2. Material properties describe the deformation of 
material under load; 

3. Failure Mode, Mechanism and Effects analysis 
(FMMEA) helps to understand the potential 
failure mode, mechanisms and models of MOS 
devices during under operation conditions. 

4. Local stress under operation conditions will be 
obtained from simulation analysis, including 
thermal analysis, random vibration analysis and 
electronic parameters analysis. 


According to inputs, the main part of algo- 
rithm is developed with the work of random sam- 
pling, cumulative damage and data processing. 
Considering the uncertainty of size, properties 


Local stress 


Parameters of Material Potential failure + 
structure properties mechanisms z 


under operation 
conditions 


Input _ J 


Random sampling 


Figure 1. The workflow of reliability evaluation by 


simulation. 


of materials and operation conditions in practice, 
we treats the uncertain parameters as random 
variables described by some appropriate statistical 
distribution, which may be sampled using Monte 
Carlo methods. After sampling we get the matrix 
of the uncertain parameters for failure mecha- 
nisms. In practice, the MOS devices experience 
variable amplitude loading. And cumulative dam- 
age theory allows us to calculate the damage for 
each cyclic loading. Here we obtain the matrix 
of TTF of simulation unit for each critical fail- 
ure mechanisms under operation conditions. At 
last, we need to calculate the TTF of MOS device 
according to the matrix of TTF of simulation unit 
by data processing. 

In the following section, we will discuss this 
methodology in detail and give the algorisms in 
each phase. 


3.2 Input information 


The input information is divided into three 
classes: 


1. Parameters of structure include dimensions of 
packaging and die; 

2. Material properties describe the deformation of 
material under load; 

3. Failure Mode, Mechanism and Effects analysis 
(FMMEA) helps to understand the potential 
failure mode, mechanisms and models of MOS 
devices during under operation conditions. 

4. Local stress under operation conditions will be 
obtained from simulation analysis, including 
thermal analysis, random vibration analysis and 
electronic parameters analysis. 

5. Parameters of structure 


For structure parameters, different approaches 
are selected to obtain. The following table shows 
the approaches to extract them: 


3.3 Material properties 


Material properties include the density, elastic 
modulus, poisson’s ration, conductivity, coefficient 
of thermal expansion and so on. They are easy to 
get from handbook. 


Table 1. Approaches of parameters extraction. 
Parameters Sources 
Parameters of packaging Datasheets 
Measurement 
Parameters of die Design files of layout 
of chip 
Measurement 
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3.4 Potential failure mechanisms 


Considering the interaction among performance, 
physical characteristics of products, materials and 
environment stress, failure mode, mechanism and 
effects analysis (FMMEA) can be used to deter- 
mine the potential failure mechanisms and models. 
Besides, the priority of failure mechanisms can be 
determined based on the severity and probability 
of occurrence. 

After FMMEA, the critical failure mechanisms, 
which mean higher probability of occurrence and 
greater impact on the performance of the specific 
MOS device, and models are determined. They are 
the focus of analysis and calculation in the follow- 
ing phases. 


3.4.1 Local stress under operation conditions 
Since the parameters of operation conditions are 
the environment which the system working in and 
it is different from the MOS devices’ working envi- 
ronment, the simulation modeling and analysis 
are necessary. Local stress of MOS devices will be 
given by simulation modeling and analysis. In this 
phase, there are three simulation are indispensable: 
thermal analysis, random vibration analysis and 
electrical parameters analysis. 


e Thermal analysis 

Thermal analysis is used to obtain the local tem- 
perature distribution under specific operation 
conditions. The solution procedures in this paper 
are based on computational fluid dynamics (CFD) 
techniques, which are concerned with the numeri- 
cal simulation of fluid flow, heat transfer and 
related processes. 

e Random vibration analysis 

Random vibration analysis is used to obtain the 
equivalent stress, equivalent strain and model 
under specific operation conditions. The solution 
procedures in this paper are based on finite element 
analysis (FEA) techniques, which are concerned 
with the force-balance equation, deformation com- 
patibility equation and material properties. 

e Electrical parameters analysis 

Electrical parameters analysis is used to obtain 
the related electrical parameters of MOS struc- 
ture under specific bias voltage, such as the cur- 
rent density for electromigration and oxide field 
for TDDB. 


3.5 Algorithm of random sampling 


Considering the uncertainty of size, properties of 
materials and operation conditions in practice, we 
treats the uncertain parameters as random variables 
described by some appropriate statistical distribu- 
tion, which may be sampled using Monte Carlo 
methods. Here, the parameters of PoF models are 


subject to various kinds of uncertainty, which may 
include: uncertainty of material properties, the 
effect of variations in manufacturing process and 
the uncertainty associated with stochastic fluctua- 
tions of operational stresses. In order to generate 
the samples fitting the given distribution, the algo- 
rism for random sampling is, 


1. Calculate the inverse function of cumulative 
distribution function (CFD), F™'(x); 

2. Generate samples of uniform distribution in the 
interval (0,1), d=(a,, a),...,4,); 

3. Calculate x= F(a), so vector x is the sam- 
ples that we wanted. 


The matrix of the uncertain parameters for spe- 
cific failure mechanism is, 


Qi a Ain 

= DB, An DH, (1) 
Bim> Eur Gn 

Here, Œ», means the nth sample of uncertain 


parameter, Œ 


3.6 Algorithm of cumulative damage 


In practice, MOS devices experience not a few 
of different stress levels during lifetime, but a 
life profile which the different stress levels are 
arranged in a certain order. Calculating the TTF 
under different stress levels respectively cannot 
descript the actual situation in application. So it 
is necessary to develop cumulative damage analy- 
sis. In this paper, we introduce two approaches: 
acceleration factor method and cumulative dam- 
age rule method. 


e Acceleration Factor Algorithm (AFA) 
Acceleration Factor is a multiplier that relates a 
product’s life at an accelerated stress level to the 
life at the use stress level. AFA is suitable for failure 
mechanisms of electromigration, hot carrier injec- 
tion, gate-oxide time dependent dielectric break- 
down and so on. Here, in order to think about the 
influence of life profile with several stress levels, 
we convert it to a new profile with only one stress 
level, which we call it reference stress level using 
acceleration factor at first. In terms of tempera- 
ture, the algorithm is as following, 


1. Start; 

2. Input the profile of temperature with two vectors 
f =[Lstyotyr’*oty] and T=[1, T, T} T] 
reference temperature T, t_trans = 0; 

3: Lae =1; 

T,,,=T, t = (ta) xX AF(T), otherwise, 


| if! "4 F(T()d(s). In the formula, AF(T) is 
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Figure 2. A profile of temperature used to help describe 
AFM. 


the is the acceleration factor of temperature T 
which has been transferred to Ty; 

5. Calculate t_trans = t_trans + t’; 

6. i=itl; 

7. Repeat the step 4) to 6) untili=n-1. 


Then calculate the TTF under operation condi- 
tion of new profile and it is easy to transfer it to the 
TTF under original profile. 

For example, a profile is as following, 

In this profile, there are two different stress lev- 
els, T7 and 72, and the dwell is t/ and (t3 — 12), 
respectively. The two stages with temperature 
change are (t2 — t1) and (t4 — t3). Here, assuming 
that we transfer them to the stress level of TO, the 
formula is, 


t_trans =tlx AF(T1)+ J AFETO 
(13-12) x AF(T2) + [AETA 


(2) 


Here, t_trans is the equivalent time of origi- 
nal profile under stress level of T0, AF(Ti) is the 
acceleration factor of temperature Ti which has 
been transferred to 70, T(t) and T’(t) are the 
function of T vs. t. 

Calculate the TTF under temperature 70, than 
TTF under original profile is, 


ea | xt4 (3) 
t_trans 


TTF, -| 


Here, [*] mean the number that is nor more 
than *; TTF is time-to-failure of specific MOS 
structure under 70 in terms of certain mechanism; 
TTF, is time-to-failure of specific MOS structure 
under original profile. 


e Cumulative Damage Rule Algorithm (CDRA) 
In terms of mechanisms such as random vibration 
fatigue, CDRA is more appropriate to solve the 


problem of the cumulative damage. Here, consid- 
ering the order and interaction effects of variable 
amplitude loading, the Corten-Dolan approach is 
determined to calculate the TTF of failure mecha- 
nisms. The formula is, 


N== (4) 


m d 
O, 
Žala) 
= ol 
Here, N, is the cycles to failure for variable 
amplitude loading; N, is the cycles to failure for 
the maximum amplitude loading; o, is the equiva- 
lent stress of leads for the i” amplitude loading, 
which has obtained from the random vibration 
analysis of phase 4; g, is the equivalent stress of 
leads for the maximum amplitude loading; d is 
a constant that is decided by materials, m is the 


number of stress level, a, is a percentage that can 
given from, 


O; 
———— 
o +0, +:+0, 


m 


i=1,2,---,m (5) 


According to the life profile of random vibra- 
tion, N, is easy to transfer to TTF. 

Then we can calculate the TTF using the algo- 
rism of PoF models and cumulative damage theory 
(Hall and Strutt, 2003, Haggag et al., 2000). 

Assuming a specific PoF model can be expressed 
as, 


TTF = f(S, M, E) (6) 


Here, TTF is the short for the time-to-failure, S, 
M and E means parameters of structure, properties 
of materials and operation conditions, respectively. 
So the vector of TTF of simulation unit for specific 
failure mechanism under certain stress level is, 


TTF = f(S, M, E) (7) 


Calculate the T7Fs for each critical failure 
mechanisms obtained from phase 2, the matrix is, 


TTF „TTF, -TTF 


In 


TTR TTR TTE, 
Frka ae a (8) 


TEF T Tepi TIF 


Here, k is the k” critical failure mechanism of 
MOS device, n is the number of samples. 

According to matrix (8), choose different cumu- 
lative damage approaches for different failure 
mechanisms to calculate and the matrix of the 
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TTF of simulation unit for each critical failure 
mechanisms under life profile 1s, 


TTR, „TTE, „TTF 


o oln 


TEF ait TEs ILE pa 
TTF, = g É (9) 


TTF „TTE „+, TTF, 


okn 


3.7 Algorithm of data processing 


Now the analysis objects are simulation units, 
which are parts of MOS devices, such as MOS 
structure. In order to evaluate the TTF of MOS 
devices, the following work is necessary. 


e TTFs for single mechanism to TTFs for multiple 
mechanisms 

At first, we need to transfer the TTF vectors for sin- 

gle mechanisms of certain simulation unit to vector 

for multiple mechanisms of certain simulation unit. 
The algorithm for distribution fitting and test- 

ing is as follows: 


1. Start; 
2. Seti= 1; 
3. Extract the vector TTF (i,:) = [TTF as TTF o 
.... [TF,,,] from matrix (9); 
4. Assume that data of vector obey certain 
distribution; 
5. Calculate distribution parameters 8i = [ 6i, Gi, 0i]; 
6. Hypothesis testing and repeat step 4) and 5) 
until the null hypothesis is accepted. 
.i=i+ l; 
. Repeat step 3) to 7) until į > k; 
. Then we get the matrix of distribution param- 
eters of simulation unit, 


\o o N 


14043 


PA 


ei (10) 
Ba 293» 


here if the number of distribution parameters 
of some failure mechanisms is p and p < 3, the 
parameters 6, = O(g = p + 1, +3); 

10. Set7 =1; 

11. Random sampling according to distribution 
parameters @i,:) =[6,, 8, ,,] and we get vec- 
tor ta = m bye in 

12. Competing failure model help to get the mini- 
mum value, t, = [tne bj? eetl 

13. j=j+1; 

14. Repeat step 11) and 12) until j > n; 

15. Now we get the TTF vector for multiple 


mechanisms of certain simulation unit, 
bn = [tins bn fe ‘bin |: 
16. End. 


e TTFs for each simulation unit to TTFs for MOS 
device 
In order to obtain the TTFs for MOS devices, 
repeat algorithm listed above. At last, we achieve 
the vector of TTF for MOS device, 
TTF, =(TTF,,,TTF,,,-:-, TTF.) (11) 
The last work is to distribution fitting and 
testing for vector TTF, The vector is sorted into 
ascending order and fitted to a suitable distribu- 
tion to obtain the TTF of MOS device. Assuming 
that the vector is fitted a distribution of f(t), the 
TTF of MOS device is, 


TTF = O (12) 


4 CASE STUDY 


In this paper, we select a MOS device with plastic 
dual inline-pin (DIP) package which provide the 
system designer with direct implementation of the 
NOR function with the weight of 1.1 g and power 
dissipation of 200 mW. In the following, we will 
depict the process of reliability evaluation by simu- 
lation in detail with this device. 


4.1 Input information 


4.1.1 Parameters of structure 

At first, it’s necessary to extract the parameters of 
structure. According to datasheet, design files and, 
they are listed as Table 2. 


1. Structural parameters 


4.1.2 Material properties 

This part include the parameters about weight, 

power dissipate and material properties and so on. 
According to the results of FMMEA, the 

potential failure mechanisms which have a greater 

impact on the performance of the specific MOS 

include: 


device solder joint thermal fatigue, 


Figure 3. 
study. 


The MOS device which is applied in case 
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Table 2. Structural parameters for MOS device. Table 3. Material properties of MOS device. 
Packaging MATERIAL Plastic Plastic of Density 1206 kg/m? 
parameters LENGTH 19.3 mm Packaging Elastic Modulus in 15900 MPa 
WIDTH 6.2 mm X direction 
THICK 3.3 mm XY Poisson’s Ratio 0.25 
Lead MATERIAL Copper alloy Conductivity 0.67 W/m*°C 
parameters I/O 14 Coefficient of Linear 15e-006°C 
PITCH 2.4mm Thermal Expansion 
L, 0.6 mm Copper Density 8360 kg/m? 
R 0.3 mm Alloy Elastic Modulus in 118410 MPa 
L, 2.4mm (C197) X direction 
L, 3.2mm XY Poisson’s Ratio 0.3 
t 0.2 mm Conductivity 0.67 W/m*°C 
W, 1.7 mm Coefficient of Linear 16.8e-006/°C 
W, 0.5mm Thermal Expansion 
Die DIE_LENGTH 4.3 mm Yield Strength 40 MPa 
Tensile Strength 450 MPa 
ee DIE OWIDTH domi Activation Energy 1.64 eV 
DIE_THICKNESS 0.2 mm i 
INTERCONNECTS Elastic Modulus in 17200 MPa 
THICKNESS OF 0.8um X direction 
INTERCONNECTS XY Poisson’s Ratio 0.11 
Solder Solder Height 1.6 mm Conductivity . 0.2 Wi m 
parameters Solder Joint Bond Area 1.58 mm? Coefficient of Linear 17.6e-006/°C 
Thermal Expansion 
Tensile Strength 276 Pa 
Impact Amplification Factor 1.0 
4 - Tm ” Activation Energy of Interconnects 0.59 eV 
ai E |% Thermal Activation Energy of MOS 0.1 eV 
Structure (TDDB) 
Activation Energy (HCI) 0.05 eV 


' F =. 
g ł = 

Fi H 

bu i 


je- LENCTE . 


a) b) 


Figure 4. This figure illustrates the structural param- 
eters: a) for packaging parameters and b) for lead 
parameters. 


random vibration fatigue, corrosion, gate-oxide 
time dependent dielectric breakdown (TDDB), 
electromigration, hot carrier injection (HCI) and 
shock. So in the following steps, we focus on these 
seven failure mechanisms and evaluate the reliabil- 
ity with them. In Table 4, the failure mechanisms 
and PoF models are listed. 


4.1.3 Local stress under operation conditions 
At first, we need to introduce the operation conditions. 
The MOS device is located in the middle of 
printed circuit board (PCB) with the dimensions 
of 120 mm x 80 mm x 2 mm. The operation con- 
ditions consist of three types of stresses, ambient 
environment, power spectral density of random 
vibration and relative humidity. The profile of 


MOS device during lifetime is as Figure 5, Table 5 
and Table 6. 

The relative humidity is always 20%. 

According to the parameters of structure and 
operation conditions, the CAD, CFD and FEA 
models are as Figure 6. 

We develop researches on thermal analysis, ran- 
dom vibration analysis and electrical parameters 
analysis and results are as Table 7. 


4.2 Random sampling 


Considering the uncertainty of parameters, monte 
carlo approach is widely used. Now we give the dis- 
tribution types of the different parameters as Table 8. 

In following section, we will introduce how to 
develop the monte carlo simulation to obtain the 
vector of TTF for electromigration. 

At first, get the 10000 samples of uncer- 
tain parameters (W, d, T and j) by monte carlo 
approach as following, 


0.7901 14832,0798267848, 0.79834634,0.800647454.,...,0.799123104 
a 0.981133085,1.007929821,1.008361467,1.000597051,...,1.015567806 

55.84119123,56.5393662,55.23497947,56.09943779,...,55.60634757 

0.298663927,0.302170728,0.301507547,0.299965692....,0.299706913 
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Table 4. PoF models for potential failure mechanisms. Table 6. Parameters of profile for random vibration. 

: PoF model No. Frequency (Hz) PSD (g2/Hz) 

Failure 
mechanism Formular Remark 1 5 0.0266 
75 0.4 
Shock ; Z alow Zmax a — 200 0.4 
Solder Joint = N,=0.5(Ay,/2e’,)" Coffin- 2000 0.004 
Thermal manson 
: 2 5 0.0133 
Fatigue 
75 0.2 
Random i= 200 02 
Vibration N.=C Z 5 2000 0.002 
Fatigue d z, sin(zx)sin(zy) i 
Electro- 7m Black’s 
migration MTTF = Wa” (2) equation 
Ci kT 
Gate-Oxide MTTF. = T-exp(-yE ) E model 
: E ox 
Time -exp(Ea/kT) 
Dependent MTTF, = z-exp(G/E) 1/E model 
Dielectric s 
Breakdown expl EalkT) 
(TDDB) 
Hot Carrier MTTF =B. a)” For nMOS 
Injection AI i 
HCD exp(EalkT) 
MTTF =B-(1,)" For pMOS 
-exp(EalkT) 
MTTF = B-(1,)" 
-exp(Ea/kT ) 
Corrosion Ea Figure 6. Three models of MOS device that is located 
MTTF = A(RH)" ap 4) in the middle of board and PCB: a) is CAD model, b) is 
CFD model and c) is FEA model. 

TARE . Table 7. Results of simulation analysis. 
LAN Parameters Result 
A te eke Thermal Lead Ambient Temp. 25°C 44.5°C 

dat Analysis Ambient Temp. 40°C 61.7°C 

Die Ambient Temp. 25°C 55.7°C 
Ambient Temp. 40°C 72.3°C 
ESL Random Lead Min Natural 317 Hz 
a 2 ed a Vibration Frequency 
br | Analysis Max PSD 8.135e-9 g?/Hz 
Max PSD 2.683e-9 g°/Hz 
* Electrical MOS Current Density in 0.3 MA/cm? 
Parameters Interconnects 
£ TERNARA Analysis Oxide Field 8.9 MV/cm 
Figure 5. The profile of MOS device under operation Peak Substrate 22 pA 
eee Current 
conditions. 
Table 5. Parameters of profile for temperature. 
p P 4.3 Cumulative damage 
No. Ambient temperature Dwell in minutes Tn terms of the seven potential failure mecha- 
nisms, we divide them into depletion-type and 
1 25 60 A i 
60 60 overstress-type. It is only need to confirm whether 


it is strong enough to stand the specific stress under 
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Table 8. 


The distribution type of each uncertain 


Table 9. The algorithms of cumulative damage analysis 


parameters. and relative formula. 
Failure Methods of 
mechanism Parameter Distribution cumulative 
Failure damage 
Solder Joint L, LENGTH Triangular mechanism analysis Formula 
Thermal Distribution 
Fatigue h: Solder Height Triangular Electromi- AFA MTTF. (T.\" 
Distribution gration AF, = ATIT 7 (2) 
AT: Cyclic Weibull i i 
Temperature Swing Distribution exp Ea | -a 1 ) 
Tsj: Mean Cyclic Normal ki T 
Temperature of the Distribution 
Solder in Degrees C TDDB AFA AF= MTTF, 
q, and æ: Coefficients Triangular ` MTTE 
of Linear Thermal Distribution l Ea ( 1 1 )| 
Expansion for =exp| -| 7 
na and kn T, 
Substrate HCI AFA 
Random B: Length of the PCB Normal AF = MTh; 
Vibration Edge Parallel to the Distribution M TTF, 
Fatigue Component Located Eaf 1 1 
at the Center of the ac a? [z T ) 
Board l 
L: LENGTH Triangular Corrosion AFA MTTF, 
Distribution AF = MTTF 
t: Thickness of PCB Normal r 
Distribution =exp race z| 
J; Minimum Natural Normal LEUD T, 
Frequency Distribution 
Electro- W: Width of Normal m oñ CDRA N= N, = 
migration Interconnects Distribution fatigue E m G, 
d: Thickness of Normal a a; o 
Interconnects Distribution 
T: Temperature Normal 
Distribution 
j: Current Density Triangular . : . 
Distribution nisms, the methods of cumulative damage analysis 
TDDB Eox: Oxide Field Triangular and relative formula are as Table 9. 
Distribution For electromigration, according to PoF model 
T: Temperature Normal and matrix &, the vector of TTF is, 
Distribution 
HCI T: Temperature Normal TTF =[11297.991, 11113.816, 12055.017, 
Distribution 11445.712,...,11954.772] 
Corrosion RH: Relative Normal 
Humility Distribution For each element of vector of T7F, calculate the 
T: Temperature Normal time-to-failure considering the cumulative damage 
Distribution 


operation conditions for overstress-type while it is 
necessary to consider the cumulative damage for 
depletion-type. 

According to the result of FMMEA, potential 
failure mechanism of shock is the only overstress- 
type and the stress is not enough to make device fail- 
ure. Besides, the change of temperature is already 
be considered by solder joint thermal fatigue itself, 
there is no need to calculate the cumulative dam- 
age. So for the other five potential failure mecha- 


theory under the profile given in section 4.1.The 
result is, 


t =[6312.42, 6209.51, 6735.38, 6394.95, ..., 6679.37] 


Then calculate the TTFs of other potential fail- 
ure mechanisms and form the matrix of TTF for 
one MOS structure, 


6312.42, 6209.51, 6735.38, 6394.95, ..., 6679.37 
_ | 8944.57, 8418.34, 9010.22, 8262.84, ..., 9900.97 
~ |11396.04, 11361.27, 11360.94, 11351.17, ...,11357.64 

9344.20, 10396.72, 9835.80, 9241.89, ..., 9546.54 
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Since potential failure mechanisms of solder 
joint thermal fatigue and random vibration fatigue 
will never happen in MOS device, the matrix of 
TTF only includes four failure mechanisms. 


4.4 Data processing 


Calculate the matrix of TTF for different 
simulation unit and develop the data fitting and 
hypothesis testing. At last we transfer them to a 
vector of TTF for MOS device, 


T =[6435.21, 5547.34, 6859.25, 6378.53, ---,6678.67] 


The data is well fitted as a weibull distribution 
and the TTF of MOS device is 6415.9h under the 
operation conditions given in section 4.1. 


5 CONCLUSION 


With the development of semiconductor technol- 
ogy, it is more and more obvious that the field of 
reliability engineering is facing the problem of reli- 
ability evaluation under complex operation condi- 
tions. So a methodology is put forward to evaluate 
the reliability of MOS devices in this paper. At 
first we need to extract large amounts of param- 
eters of MOS device as input of following steps. 
Then FMMEA used to determine potential failure 
mechanisms and they are the focus that we will 
consider. The third step is simulation modeling 
and analysis to obtain the stress parameters under 
operation conditions. Then considering the cumu- 
lative damage and uncertainty of parameters, 
we can obtain the TTF of each potential failure 


mechanism for every simulation unit by MCS. At 
last, we evaluate the TTF of MOS device accord- 
ing to the data obtained before. In the end of this 
paper, we select a typical MOS device and it is 
evaluated as the case study of this methodology. 
The TTF of MOS device under certain operation 
conditions is 6415.9h. 
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ABSTRACT: When modelling critical infrastructure, it is important to account for the effects of inter- 
dependencies. Although there is much literature on how to account for interdependencies, there is a short- 
age of real-world examples. In an effort to increase the number of real-world examples, a model of the 
interdependent power and water systems of the Caribbean island of St Kitts has been developed. System 
dependencies arise due to electrically powered water pumps within the water system. Given the location 
of the island, it is not uncommon for it to encounter tropical storms, which may result in disruptions to 
the island’s power system. Depending on the severity of the disruption to the power system, the effects 
may be able to propagate, through the dependencies, into the water system. The developed model uses the 
track and wind speed of past and simulated hurricanes, or up to date weather predictions, to simulate pos- 
sible disruptions to the island’s power system. Any propagation of these disruptions to the water system 
through the interdependencies are also simulated. The recent occurrence of hurricane Maria provides a 
useful case study to compare the output of the coupled system model with the actual effects that resulted 
due to the hurricane. The models are run with the known track and wind speeds of hurricane Maria. The 
predicted disruptions to the power system, as well as the cascading effects throughout the water system 
are then compared to the actual disruptions exhibited in the interdependent systems. The results can be 


used to validate the model and give an indication if improvements to the current model can be made. 


1 INTRODUCTION 


When modelling disruptions to critical infrastruc- 
ture, it is agreed that any interdependencies present 
need to be taken into account (Buldyrev et al. 2010). 
There is much literature available on the various 
methods to model infrastructure interdependen- 
cies, however there is a lack of examples which 
model real interdependent infrastructure systems 
(Ouyang 2014). Some examples which look at real 
interdependent systems include Johansson and 
Hassel (2010) and Dueñas-Osorio et al. (2007). 

To provide another real example of interde- 
pendent infrastructure modelling, a model of the 
coupled water and electrical power system on the 
Caribbean island of St Kitts has been developed 
by the Guikema Research Group. The model was 
developed with the intension to see how tropical 
storms affect the island’s coupled power and water 
systems. The coupled water and power system of 
St Kitts provides a good opportunity to model a 
coupled infrastructure system given that they are 
relatively simple and self-contained as they service 
only the island of St Kitts. 


In the next section, a brief overview of the 
island of St Kitts and the natural hazards, such 
as Hurricane Maria, that affect the island will be 
given. Section 3 will provide more detail about the 
model of the island’s power system, water system 
and the dependency that exists between the two 
systems. The results of the model when simulating 
the effects of Hurricane Maria will be presented 
in Section 4. This will be followed by a discussion 
in Section 5 about the challenges associated with 
developing and validating this model, and more 
general any interdependent infrastructure model 
that aims to simulate real systems. The final sec- 
tion, Section 6 will then draw the conclusions of 
the paper. 


2 THEISLAND OF ST KITTS 


St Kitts is the larger of the two islands of the Fed- 
eration of St Christopher (St Kitts) and Nevis, 
which are part of the Leeward Island group in 
the Eastern Caribbean (The Commonwealth, The 
Offical Website of St Kitts and Nevis). The total 
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population of St Kitts and Nevis is estimated to be 
just over 50,000. St Kitts is of volcanic origin and 
thus the centre of the island is mountainous with 
a highest peak at Mount Liamuiga. Therefore the 
majority of the population live along the coastline 
of the island (The Offical Website of St Kitts and 
Nevis). 

The total area of St Kitts is 69 square miles or 
roughly 180 square kilometers and thus is a rela- 
tively small island. The small size of St Kitts means 
that modelling infrastructure systems of the island 
is feasible, without having to overly simplify the 
system. These systems are also self-contained due 
to the geographic constraints that they service only 
the island of St Kitts. 

Due to the location of St Kitts, the natural haz- 
ard of tropical storms is an annual threat to the 
island. The most notable storm to hit St Kitts in 
recent years was Hurricane George in September 
1998, for which the estimated damage costs to St 
Kitts and Nevis was 445 million United States Dol- 
lars (USD) (Relief Web 2002). Other storms that 
have had great impacts recently on St Kitts include 
Hurricane Lenny in 1999 with estimated damage 
costs of around 41 million USD, Hurricane Hugo 
which hit the island in 1989 and caused damage to 
roughly 20 percent of the poles in the electricity 
power system, and Hurricane Luis in 1995 which 
caused great damage to the power and water sys- 
tems of the island (Relief Web 1995, Relief Web 
1999, US Aid 1990). 


2.1 St Kitts and Hurricane Maria 


The National Oceanic and Atmospheric Adminis- 
tration (NOAA) first issued a tropical storm watch 
on 16th September 2017, with the first advisory 
report being issued at 11:00 Atlantic Standard 
Time (AST). By 17:00 AST, NOAA were record- 
ing tropical storm force surface winds of what was 
then Tropical Strom Maria. At this time a hur- 
ricane watch was in effect for St Kitts and Nevis 
along with Antigua, Barbuda and Montserrat. At 
the time of Advisory 6 of Maria 17:00 AST, hurri- 
cane force winds had been recorded and the storm 
had been upgraded to a hurricane. The St Kitts 
hurricane watch had been upgraded to a hurri- 
cane warning. As of 19th September 05:00 AST, St 
Kitts was experiencing tropical storm force winds 
as Hurricane Maria passed by the south west 
coast of the island (NOAA). The eye of Hurricane 
Maria was reported as passing within 90 miles of 
the island (St Kitts & Nevis Observer 2017). Advi- 
sory 15, issued on 19th September at 17:00 AST 
was the last advisory to issue a hurricane warning 
to the island of St Kitts (NOAA). 

Figure 1 shows the path of the centre of Hur- 
ricane Maria as it passed by St Kitts, as well as 


Track of Hurricane Maria—The New York 


Figure 1. 
Times, 2017. 


the wind speed severity category bands (The New 
York Times 2017). From Figure 1 it can be seen 
that St Kitts experienced tropical storm severity 
wind speeds as Hurricane Maria passed by the 
south-west of the island. 


3 COUPLED POWER AND WATER 
SYSTEM OF ST KITTS 


The island of St Kitts is a relatively small island 
with an area of 69 square miles. The power and 
water systems which provide electricity and clean 
water to the island are self-contained as they do 
not provide utilities to other islands and can be 
modelled without having to be over simplified. The 
water system of St Kitts contains 30 wells which 
rely on the electricity system to pump the water 
from the wells into the water system. This depend- 
ence can be modelled along with the power and 
water systems to develop a model of the coupled 
power and water system for St Kitts. 

The water system model was developed using 
information provided by St Kitts Water Depart- 
ment. The model encompasses the entire clean 
water distribution system, including the supply 
sources and demand nodes. Figure 2 shows the 
modelled water system of St Kitts, including the 
pipelines, reservoirs and wells. The blue lines in 
Figure 2 represent the water pipes, the red dots are 
the water nodes (wells and junctions) and the black 
nodes are the reservoirs and tanks within the water 
system. 

The water system was modelled using EPANET 
2.0, a program that can “perform extended period 
simulation of hydraulic and water quality behav- 
iour within pressurised pipe networks” EPANET 
(2000). 

The power system was modelled using the lim- 
ited information available on St Kitts Electricity 
Company Limited (SKELEC) website. The main 
lines of the power system were modelled, however 
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Figure 2. Model of St Kitts water system. 


Figure 3. 


St Kitts electricity power system model. 


due to lack of available information the smaller 
distribution lines of the power system could not be 
included in the model. As stated in Section 2, most 
of the island’s population resides on the coastal 
areas of St Kitts. Therefore the 3 main power lines 
that surround the island are modelled. The model 
contains 157 nodes, representing 157 wooden elec- 
tricity poles that the power cables are attached to. 
The three main power lines can be seen in Figure 3. 
The northern line is represented by red squares 
and contains 70 poles, the southern line is shown 
as blue dots and contains 39 poles and the western 
line is shown as purple triangles and contains 48 
of the 157 poles. 

For each pole, a fragility curve giving the prob- 
ability of failure given the on-island wind speed 


has been determined based on historical damage 
reports of hurricanes that were readily available to 
the public. If one pole fails in a power line, then all 
the poles downstream of the failed pole will also be 
modelled as inoperable. 

The water system depends on the power system 
at each of the 30 wells within the system to get the 
water from the well into the water distribution sys- 
tem. To model the water system’s dependency on 
the power system, each well is dependent on the 
closest electricity node, or pole, to provide power to 
the well. During the simulation, if a node becomes 
inoperable that a well depends on, the well that 
depends on it is then removed from the water sys- 
tem model. Figure 4 shows the modelled electricity 
lines, as black triangles, and the water wells as red 
squares and other water nodes as blue dots. Sixteen 
wells depend on nodes within the northern power 
line, 7 wells depend on electricity from the western 
line and 8 wells depend on the southern line. 

The track as well as the maximum wind speed 
recorded every 6 hours for the centre of Hurricane 
Maria from 17th to 28th September 2017 was used 
to estimate the maximum speed of the hurricane 
over St Kitts using a model previously developed 
by Guikema et al. (2014). This provided the prob- 
ability of failure for each of the 157 poles within 
the electricity power system. These probabilities 
were then used in the MATLAB model developed 
within the Guikema Research. To look at the effects 
of Hurricane Maria on St Kitts coupled power 
and water systems, the MATLAB model was run 
for 60000 iterations, where first it was modelled 
which poles would fail or break due to the hur- 
ricane before modelling the cascading effects of 
these poles down each power line and through to 
the water system. 


Figure 4. Coupled power and water system of St Kitts. 
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4 RESULTS 


4.1 Electricity power system 


For each of the 60000 iterations that the model 
completes, each pole is recorded as failing due to 
the wind or not. Therefore, the model gives the fre- 
quency of initial pole failure, which can be seen in 
Figure 5. The frequency, given as a percentage, of 
each pole failing due to Hurricane Maria is shown 
using the colour scale as shown in Figure 5. 

There are a few poles with relatively low fre- 
quency of failing in the southern line, closest to 
the centre of the island. However, the southern 
line is mostly unaffected by Hurricane Maria. On 
the northern power line there is one pole, around a 
third of the way along the line that has a high fre- 
quency of failure. Failure of this pole would cause 
disruptions to the rest of the line up to the north- 
east of the island, and thus any wells that depend 
on any nodes on this section of the northern line to 
also be disrupted. On the western power line there 
are more poles within the line that have initial fail- 
ures due to the hurricane. This is expected as Hurri- 
cane Maria passed by the south-west of the island. 
The most frequent initial failure in the western line 
is a pole that lies around halfway along the western 
coast. This would cause disruptions along the rest 
of the line, all along the western coastline, and these 
disruptions would cascade to any wells that depend 
on this section of the western power line. 


4.2 Actual power outages 


When looking for data of what affects Hurricane 
Maria had on the electricity system of St Kitts, 
the only information found came from SKELEC 
website. This information was a restoration update 
that named residential areas that were experiencing 
power outages due to Hurricane Maria. The press 
release stated “Most of our feeders remained intact 
and online during the passage of the storm. Some 
like the Canada feeder which services Conaree, 
Halfmoon and Canada Estates came offline, also 
Basseterre North Buckley’s to Trinity also is fully 
offline” SKELEC (2017). 

Figure 6 shows the areas that SKELEC reported 
to have experienced outages due to Hurricane 
Maria. The three areas to the north of the island 
are Canada, Halfmoon and Conaree. Canada has 
been filled red rather than just outlined as the 
Canada feeder was the feeder identified as caus- 
ing these outages to the north of the island. These 
three outlines were not specified by SKELEC 
but are the outline of the three areas named by 
SKELEC in the press release. Due to no further 
information being found in relation to power out- 
ages due to the hurricane this is best representa- 


tion of possible affected areas that can be given. 
The larger section on the south-west of the island 
represents the area from Basseterre North Buck- 
ley’s to Trinity. The outline encases all residential 
areas between, and including, Basseterre North 
Buckley’s to Trinity. Again, the press release from 
SKELEC is the only information on which to base 
the actual areas affected and thus the areas cannot 
be more specific. 

Figure 7 shows the results of the model for fre- 
quency of initial pole failure compared to the areas 
that SKELEC reported to experience power out- 
ages due to Hurricane Maria. When looking at the 
predictions and actual areas to exhibit outages on 
the northern power line, the high frequency of a 
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Figure 5. Frequency of initial pole failure. 


Figure 6. Areas of St Kitts that experienced power out- 
ages due to Hurricane Maria. 
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Figure 7. Frequency of initial pole failure compared to 
areas to experience power outages. 


pole break on the northern side of the island given 
by the model coincides with the press release that 
the northern side was affected such that a feeder 
came offline. When looking at the south-west area 
that experienced outages, if the main line was 
affected by the hurricane, it would understandable 
to see more outages along the line. However, these 
outages could have been caused to failures or dis- 
ruptions within the smaller distribution lines that 
are not included in the model. 


4.3 Water system 


When looking at disruptions to the water system, 
the frequency that nodes exhibit pressure below 20 
psi as well as 0 psi are both considered. The pres- 
sure of the nodes within the water system is impor- 
tant to measure as if the pressure becomes negative 
this can cause the water to flow in the wrong direc- 
tion, or cause pipes to break. These both can cause 
the water to become contaminated. Pressures lower 
than 20 psi are investigated as in the USA this is 
the minimum pressure required to use a water sys- 
tem for firefighting purposes (EPA 1992). 

Figure 8 shows the frequency that water nodes 
exhibited pressures lower than 20 psi. Comparing 
Figure 8 to Figure 5 the low pressures towards 
the lower end of the island coincide to the areas 
in which the initial pole failures occurred. The low 
pressures exhibited towards the north-western part 
of the island are due to the initial failures in both 
the northern and western power lines cascading to 
the north of the island. 

Figure 9 shows the frequency that the water 
nodes exhibited pressures less than 0 psi. There are 


Figure 8. Frequency of water nodes with pressure lower 
than 20 psi. 


Figure 9. Frequency of water nodes with pressure lower 
than 0 psi. 


few instances of nodes exhibiting pressure lower 
than 0 psi, with a few occurring on towards the 
north-west end of the island. Although the initial 
aim was to compare the results of the model to 
how Hurricane Maria affected the coupled infra- 
structure power and water system of the island, it 
was challenging to find any data related to disrup- 
tions to the water system. 

The location of wells compared to the actual 
outage data was also considered to see if any 
wells would have been without power during the 
outages. 

When looking at Figure 10, there are three 
wells that are situated in the south-west area of 
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Figure 10. Frequency of water wells with pressure lower 
than 0 psi and reported power outages. 


St Kitts that experienced power outages. To fur- 
ther test the water system model, these three wells 
were removed from the model, as is they had lost 
power. However, all other nodes in the water sys- 
tem model maintained a pressure above 0 psi, thus 
even if these three wells were without power, the 
water system would not have any disruptions. 

Although no information directly related to 
disruptions to the water system was found, the 
Prime Minister of St Kitts and Nevis stated that 
the island’s infrastructure such as the electricity 
and water systems “sustained extensive damage” 
(Relief Web 2017). 


5 CHALLENGES WHEN MODELLING 
AND VALIDATING REAL 
INFRASTRUCTURE MODELS 


When modelling the coupled power and water 
infrastructure system of St Kitts, information 
regarding the water system was provided by St 
Kitts Water Department. This, along with ongo- 
ing communication with the water department 
while developing the model, allowed an exten- 
sive model of the system to be developed. How- 
ever, when developing the electricity model, the 
only information available was that in the public 
domain, primarily on SKELEC website. This 
meant that the model was an oversimplified rep- 
resentation of the main power lines in the island’s 
electricity system. 

Accessing and collecting data to form the basis 
of infrastructure models is a well-documented 
issue with constructing models of independent 
infrastructures as well as interdependent infra- 


structures (Ouyang 2014, Johansson and Hassel 
2010, Rinaldi et al. 2001). Another problem high- 
lighted by Ouyang (2014) is that infrastructure are 
large, complex systems that are constantly chang- 
ing and thus the data used to produce the model 
may not be relevant for very long after the model 
has been developed. Any major changes to the 
infrastructure would have to be updated within the 
model. 

When trying to validate the model, information 
on the disruptions to both the water and power 
systems due to Hurricane Maria were also hard 
acquire. The electricity company did have some 
information available, however, this is aimed at 
the customers of the infrastructure and is there 
to reassure them that the company is aware of the 
problems and are fixing the issues. The informa- 
tion relates only to general areas, and not specific 
parts of the system that would be useful when vali- 
dating the model. 

Within the USA, publicly available outage data 
of power companies is becoming increasingly 
more available. Most power utility companies are 
now showing daily outage data on their websites, 
see the website Power Outage for more informa- 
tion on which utilities provide outage data. How- 
ever, other utilities in the USA, such as those who 
manage water infrastructure are still less willing to 
share outage data. 


6 CONCLUSION 


The aim of the paper was to compare the results 
of the coupled power and water system of St Kitts 
to the disruptions the systems exhibited during 
Hurricane Maria. Although the electricity model 
is a simplified representation of only the island’s 
main power lines, the areas that experienced power 
outages did coincide with poles that had a high 
frequency of initial failure. This showed that the 
model is quite accurate on estimating the maxi- 
mum wind speed for the different areas of the 
island. However, as the smaller distribution lines 
could not be included in the model and limited out- 
age information is available, it is difficult to access 
the accuracy of the model. Improvements could be 
made by increasing the complexity of the model, if 
the necessary information can be obtained from St 
Kitts Electricity Company. 

Although the water system model is a good rep- 
resentation if the island’s water system, the lack of 
data available relating to disruptions due to Hur- 
ricane Maria again means it is difficult to validate 
the model. The lack of available information relat- 
ing to infrastructure systems is a well-recognised 
challenge when developing models of infrastruc- 
ture systems. 
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ABSTRACT: Availability is a core attribute of Global Navigation Satellite System (GNSS) and many 
topics are studied in this area. However most of them are about a part of GNSS, which is single satellite 
availability, navigation constellation availability or the availability of ground stations. Very few studies are 
concerned with the availability of the whole system. In view of this, the availability simulation modeling 
of the whole GNSS including space segment and ground segment were studied in this paper. The influ- 
ence of satellite failure, interruption, coverage and ground station fault, repair etc. on system operation 
was considered to establish the model. Then the simulation logic was developed to simulate the opera- 
tion of navigation system. Based on the STK/MATLAB hybrid simulation, the service availability of the 
GPS for Beijing was analyzed finally. The simulation results verify the applicability of the model and the 


simulation algorithm. 


1 INTRODUCTION 


Global Navigation Satellite System (GNSS) is used 
in various fields and bring great conveniences to 
people’s lives. It has become an indispensable neces- 
sity in people’s life. Availability is a core parameter 
of GNSS and reflects the degree to which the needs 
of users are met. Therefore, the level of availabil- 
ity of GNSS is directly related to the quality of 
the service it provides. At present, many schol- 
ars at home and abroad have carried out relevant 
research on it and achieved some results. 

In the aspects of availability of ground stations, 
Joo built a series model of availability considering 
early stage and stabilization stage and obtained 
the availability of the ground control sections by 
the actually used data (Joo et al. 2007). Li took 
the coverage performance of the ground station 
as its available basis and put forward a availability 
model of the monitoring station (Li et al. 2010). 
In the aspects of availability of navigation constel- 
lation, Xiang established the constellation system 
reliability model and analyzed the impacts of the 
reliability of the satellite, the constellation con- 
figuration and the number of backup satellites on 
the reliability of the constellation system (Xiang 
et al.2007). Zhou analyzed the impacts of the 
single-satellite reliability change on the constella- 
tion availability aiming at the network supplement 
mode of terrestrial backup satellite (Zhou et al. 


2014). Zhao proposed a Markov model for calcu- 
lating per-slot availability considering the influ- 
ence of standby satellite on slot availability (Zhao 
et al. 2013). Hou calculated the availability of the 
navigation constellation based on the failure of the 
subsystem component and built service availabil- 
ity calculation model based on constellation state 
probabilities (Hou et al. 2014). In the aspects of 
availability of navigation system, Zhao established 
a multidimensional parameter system of the avail- 
ability of satellite navigation system (Zhao et al. 
2014). Liang modeled and simulated the system 
structure of the ground system of the remote sens- 
ing satellite based on DoDAF (Liang et al. 2017). 
Yang proposed a system efficiency modeling and 
analysis method that comprehensively considers 
various kinds of factors, such as various types of 
interrupting randomness and lossy failure per- 
formance, and used Markov chain to build the 
availability model of navigation satellite (Yang 
et al. 2017). 

In summary, the availability classification and 
model of GNSS present the diverse features such as 
the accuracy availability and integrity availability, 
single-point availability and service area availabil- 
ity, single-satellite availability and navigation con- 
stellation availability. At present, the researches on 
the availability of GNSS are more focused on local 
issues. For example, some scholars consider the 
availability of single satellite while others study the 
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availability of navigation constellations in terms of 
reliability and maintainability, or only consider the 
availability of ground stations, all of which are dif- 
ficult to support the comprehensive analysis and 
evaluation of the overall availability of GNSS. In 
view of this, based on the composition, mission 
requirements, usage patterns and user require- 
ments of GNSS, a systematic and comprehensive 
availability simulation model was built consider- 
ing the operation, interruption, maintenance and 
guarantee of GNSS and the overall simulation 
logic flow was developed in order to realize the 
simulation of GNSS operation, networking, net- 
work supplement, interruption and other process, 
and analysis of its availability level. At the same 
time, backup satellite program can also be provide 
with a basis to shorten the satellite supplement 
waiting time and improve the navigation system 
availability. 


2 FRAMEWORK OF AVAILABILITY 
SIMULATION MODEL OF GNSS 


Availability is the description of the working 
state of a system. According to the different mis- 
sion requirements of GNSS, its availability can 
be divided into instantaneous availability, single 
point availability, service area availability and 
moving target availability. As GNSS is affected by 
the internal failures and refurbishment and exter- 
nal weather and environment, it cannot be avail- 
able at all time. The availability of GNSS is mainly 
affected by the following factors: (1) the reliability 
of equipment (2) equipment maintenance (3) sys- 
tem structure (4) backup strategy. The above four 
factors were considered comprehensively and the 
availability simulation model of GNSS was estab- 
lished in this paper. 

According to the overall structure and opera- 
tion characteristics of the navigation system, in 
order to describe the whole process of composi- 
tion, operation, maintenance and support of 
GNSS completely, the framework of the overall 
availability simulation model was established, as 
shown in Figure 1. 

The model consists of seven parts, which respec- 
tively are the structural model, the state model, 
the mission requirement model, the interrup- 
tion model, the maintenance model, the support 
model and the relationship model among them. 
The structural model is mainly used to describe 
the constituent elements, the hierarchical structure 
and the reliability characteristics of the constituent 
elements of the system. The state model is used to 
describe the state of each satellite to calculate its 
coverage area. The mission requirement model is 
used to describe different types of missions for dif- 


unit data model 
strictumal model 
tnd! hierarchical structure model 
Hon-geostationary orbit satellite stale model 
stale model geostationary orbit satellite state model 
mission ground station Position model 


requirement mode) 


the planned orbit interruption model 


availability interruption model 
simulation model the unplanned orbit interruption model 
of GNSS 
Preventive maintenance model 
maintenance model Corrective maimenance model 
Maintenance resource requirement model 
Space segment support model 
support model Support sation model 
Support resource model 
telationship model 
Figure 1. Framework of GNSS availability simulation 
model. 


ferent user groups. The orbit interruption model is 
used to describe the repair process of the space seg- 
ment of GNSS, including the type of outage, repair 
time and the station information. The maintenance 
model is used to describe the information including 
the purpose of maintenance, the type of mainte- 
nance, the time required for maintenance and the 
resources required for maintenance. The support 
model is used to describe the relationship among 
the maintenance support factors and operational 
support factors including the storage and supply of 
satellites, the support equipment and the support 
facilities and personnel etc. The relationship model 
is used to describe the relationship among the sys- 
tem structure, state, missions, interruption, mainte- 
nance and support. With the above seven models, 
all the elements describing the operation, interrup- 
tion and maintenance of GNSS are realized. 


3 DISCRIPTION OF AVAILABILITY 
SIMULATION MODEL 


3.1 Structural model 


The structural model mainly describes the organi- 
zational relationship of the constituent units of 
the system at the functional structure level, and 
provides reference and support for the system 
composition and reliability data of the simulation 
modeling. 

GNSS is a typical complex system, which has a 
variety of equipment types such as space segment 
equipment, ground segment equipment and user 
segment equipment, and each equipment has its 
own composition and characteristics. Therefore, it 
was divided the functional structure of GNSS by 
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Figure 2. GNSS structural model. 


layers from top to bottom in this paper, which is 
divided into large system layer, system layer, sub- 
system layer, module layer and equipment layer. 
Taking into account the acquisition of equipment 
fault information, it was intended to decompose 
the space segment to the satellite level, and the 
ground segment to the power supply equipment 
level, as shown in Figure 2. Data structures are 
shown in Table 1-2 below. Table 1 describes the unit 
fault mode and basic reliability data, and Table 2 
describes the main relationship between units, 
including the relationship of structure and quantity. 


3.2 State model 


The state model is used to describe the initial state 
of each unit. The longitude, latitude and coverage 
of dynamic satellite of each satellite that change 
with time are calculated by the satellite operation 
rules, so as to complete the calculation of the cov- 
erage of the mission area. The satellite navigation 
system includes space satellite and ground station 
equipment. The space satellite can be subdivided 
into non-geostationary orbit satellite and geosta- 
tionary orbit satellite. The operating rules and 
status of each unit are different. According to the 
different operating states, this paper divides the 
state model into non-geostationary orbit satellite 
model, geostationary orbit satellite state model 
and ground station state model. The data struc- 
tures are shown in Table 3. 

From the below three groups of state models, it 
can be seen that the operational state model of sat- 
ellite determines the under-satellite point and cov- 
erage area at each moment of time, and the state 
of the ground station determines its geographical 
position so as to determine whether the cover- 
age of satellite to the ground meets the mission 
conditions. 


3.3 Mission requirement model 


Mission requirement model is mainly used to 
describe the mission requirements and the logi- 


Table 1. Unit data. 
Data item Definitions 
Unit name The unit containing the system 


Fault mode Different types of fault of the unit 

Parameter Fault parameters, such as failure rate, 
name MTBF, etc. 

Distribution Parameter distribution type, includ- 
type ing the exponential distribution 


and normal distribution 

Parameter corresponding to the dis- 
tribution type 

Parameter corresponding to the dis- 
tribution type 

The start of service of each satellite 
or equipment 


Parameter | 
Parameter 2 


Initial working 
time of unit 


Table 2. Unit hierarchical structure. 
Data item Definitions 
Unit name Subunit identification 


Parent unit identification 
Matching quantity of the unit that the 


Parent unit name 
Unit matching 


number current parent unit is affiliated with 

Unit type Determine whether the unit is a sys- 
tem, subsystem, module, or unit 
Table 3. State model. 
Geostationary Position of 

Non-geostationary orbit satellite ground 
orbit satellite state state station 
Unit ID Unit ID Unit ID 
Orbit semi-long axis Longitude Longitude 
Eccentricity Orbit height Latitude 
Orbital inclination Field angle Ground height 
Right lifting node 

longitude 


Perigee argument 

True near corner 
point 

Field angle 


cal relationship between different submissions. 
The main function of GNSS is to use satellite to 
provide users with the quick navigation and posi- 
tioning, short digital message communication and 
timing services. According to the different func- 
tions, its main missions can be divided into the 
following three aspects: (1) high-precision naviga- 
tion (2) high-precision positioning (3) high-preci- 
sion timing. For different types of missions, and 
targeting different user groups, the geographical 
location information, movement information, and 
time information of the stationary users, mobile 
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users and regional users shall be provided. In view 
of this, the data structure of mission requirement 
model was established as shown in Table 4, to 
describe the requirements of different missions. 

The three typical missions of GNSS shall be 
completed in the space segment, user segment and 
ground segment together. It is assumed that the 
user segment is intact in this paper. Therefore, the 
navigation system is represented as a series logical 
relationship between the space segment and the 
ground segment. Reliability Block Diagram is used 
to express the success criterion of the mission, as 
shown in Figure 3, when the functions of differ- 
ent devices comply with the following logic condi- 
tion of reliability, the navigation system mission is 
succeeded. 


Table 4. Mission requirement model. 


Data item Definitions 


Mission type Divided into navigation, positioning 
and timing 

For positioning and timing, the coor- 
dinates refer to the longitude and 
latitude of the spatial location of the 
mission initiation point or the latitude 
and longitude coordinates of the mis- 
sion initiation region. For the naviga- 
tion, the coordinates can express a 
point, which refer to the spatial loca- 
tion of the mission initiation point; 
or express a distance, which refer to 
that from the mission initiation point 
coordinate to the mission end point 
coordinate. 

The level of the ground surface where 
the user is 

For the navigation mission, the client 
speed affects the satellite service area 
switching, so the speed parameter 
needs to be given 

For navigation users, in addition to the 
user speed, the user status, accelera- 
tion, deceleration or uniform speed 
shall be described 

Time to start work 


Coordinate 1 
Coordinate 2 
Coordinate 3 
Coordinate 4 


Height 


Speed 


Accelerated 
speed 


Time to start 
work 

Time to end 
work 


Time to end work 


active positioning 


systém 
passive 


positioning system 


muster control 
statin 


Monitoring 
staliog 


injection 
stalin 


Figure 3. RBD for success criterion of the mission. 


3.4 Orbit interruption model 


The orbit interruption model is used to describe 
the repair process of the space segment orbit inter- 
ruption of the satellite navigation system, includ- 
ing the type of outage, the time of interruption and 
the information of the orbit which it is located. 
Depending on different causes of the orbit inter- 
rupt, the interrupt time and repair measures are also 
different. Therefore, two types of space orbit inter- 
ruption models are established in this paper: the 
planned orbit interrupt model and the unplanned 
orbit interrupt model. The former is used to describe 
the conditions that the satellite orbit is stopped 
and the satellite cannot provide service caused as 
operation and maintenance activities of satellite 
and the replacement of satellites, and the specific 
data structure is shown in Table 5. The latter is to 
describe the interruption repair measures and inter- 
ruption repair time after the orbit is stopped caused 
as short-term hard fault and long-term hard fault 
caused as the accidental fault, and the specific data 
structure is shown in Table 6. 


Table 5. Planned orbit interruption. 


Data item Definitions 


The name of the unit that is the planned 
interruption is in, given by the struc- 
tural model, as defined in Table 1 

The types of outages corresponding to 
planned interruption include opera- 
tion and maintenance and end of life 

Calendar time Calendar time intervals of preventive 
interval maintenance 

Repair time The time consumed for the planned 

interrupt operation 

Name of orbit Orbit plane that the satellite is in 


Outage unit 


Planned 
outage type 


Table 6. Unplanned orbit interruption. 


Data item Definitions 


The name of the unit that is the 
unplanned interruption is in, 
given by the structural model 

The types of outages correspond- 
ing to unplanned interruption 
include short-term hard faults 
and long-term hard faults 


Outage unit 


Non-planned 
outage type 


MTTI distribution The distribution function type of 
type the average interrupt time of 
the unit 
MTTI parameter! Shape parameter | of interruption 


time distribution function 
MTTI parameter2 Shape parameter 2 of interruption 
time distribution function 
Name of orbit plane Orbit plane that the satellite is in 
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3.5. Maintenance model 


The maintenance model is primarily used to 
describe maintenance activities on equipment that 
fails on ground segments, including preventive 
maintenance activities and corrective maintenance 
activities. In order to fully describe the maintenance 
activities, the data structure of the preventive main- 
tenance, the corrective maintenance and the main- 
tenance resource requirement were established in 
this paper details can be seen in Tables 7-9. 


Table 7. Preventive maintenance. 
Data item Definitions 
Maintenance The name of the unit that the preven- 
unit name tive maintenance operation is in, as 
defined in Table 1 
Preventive Refer to the type of preventive 
maintenance maintenance operations, including 
type operations and maintenance 
Calendar time Calendar time interval of preventive 
interval maintenance 
Duration Time consumed for this preventive 
maintenance operation 
Maintenance Maintenance site where the unit is in 


station name 


Table 8. Corrective maintenance. 


Data item Definitions 


Maintenance unit The name of the unit per- 


name forming the corrective 
maintenance 
MTTR The distribution function type 


distribution type 
MTTR parameter 1 


of the unit repair time 
Shape parameter | of repair 
time distribution function 


MTTR parameter 2 Shape parameter 2 of repair 
time distribution function 
Maintenance Maintenance site where the 


station name unit is in 


Table 9. Maintenance resource requirement. 


Data item Definitions 

Maintenance The name of the unit performing the 
unit name corrective maintenance 

Maintenance Describe the type of maintenance of 
method the maintenance unit 

Resource Maintenance resource for the unit to 
name perform the corresponding level of 

maintenance work 

Resource Maintenance resource quantity 

quantity for the unit to perform the cor- 


responding level of maintenance 
work 


3.6 Support model 


The support process is a complex process that con- 
tains various dynamic factors. The support model 
is a model that describes the support system of the 
navigation system, including the storage and sup- 
ply of spare parts, support equipment and facili- 
ties. According to the different composite objects, 
the space segment support model and the ground 
segment support model were respectively estab- 
lished. The former is used to describe the informa- 
tion about the backup satellite configuration at all 
levels and the ground network supplement time, as 
shown in Table 10. The latter mainly describes the 
information about the ground support station and 
the support resources, as shown in Tables 11-12. 


3.7 Relationship model 


Based on the above models, if we want to establish 
a complete satellite navigation system availability 
model, we must establish the relationship between 
the models to lay the foundation for the subse- 
quent establishment of simulation logic. The rela- 
tionship model mainly describes the relationship 
of system structure, status, mission and interrupt 
maintenance station, as shown in Figure 4. 


1. Status and missions 

The operating status of the satellite determines its 
subastral point and coverage area and the status of 
the ground station determines its coverage. During 
the mission process, the coverage of satellite and 


Table 10. Space segment support. 


Data item Definitions 

Orbit plane The orbit plane that the repair 
name orbit is in 

Level Describe the support level of station, 

usually refer to the orbit plane level 

Storage Quantity of backup satellites 
quantity in the orbit 
in orbit 

Network Refer to the time interval from that the 
supplement back-up satellite in the orbit reaches 
time in the assigned orbit and debugged com- 
orbit pletely to the start of work. 

Superior Describe the support organization rela- 
name tionship between the orbit plane and 


the ground 
Ground stor- Refer to the quantity of backup satellite 


age in the ground launching base 
quantity 

Ground net- Describe the time required for network 
work supplement, launch, and commission- 
supplement ing between the ground and the orbit 
time plane 
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Table 11. Support station. 


Data item Definitions 


The station used to maintain the 
equipment 

Describe the support level of the sta- 
tion, including the orbit plane level, 
station level and launch base level 


Station name 


Station level 


Superior Represent the superior station of this 
station station, and describe the support 
level organization relationship between 


stations 


Support time Describe the transit time from the 


between superior station 
stations 
Table 12. Support resource. 


Data item Definitions 


Station name The station used to maintain the 
equipment 


Resource Resources required in the maintenance 
name process, including manpower and 
equipment 
Resource Resource quantity provided in this station 
quantity 


support model 
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Figure 4. Relationship model. 


ground station in the mission area is calculated, 
and then whether the mission is complete is judged 
by success criterion of the mission. 

2. Structure and status 

When the unit breaks down in the course of carry- 
ing out the mission, it is fed back to the state model. 
When the satellite needs to be supplemented dur- 
ing the repair process, it also needs to be fed back 
to the state model after updating the unit. 

3. Maintenance interruption and mission 

During the mission, the mission may be affected 
when the unit is interrupted or repaired. The mis- 


sion process can also have some impact on the 
planned maintenance. 

4. Interruption and support 

During the repair process, whether the satellite is 
needed or not shall be determined. When the orbit 
plane does not comply with the required satellite, 
the interruption repair event is standby, when the 
quantity of the back-up satellites in the orbit plane 
is less than the setting value, an application is made 
to the ground station for satellite supplement. 

5. Maintenance and support 

In the maintenance process, the spare parts and 
other support resources are needed. When the 
number of resources in the maintenance station 
is insufficient, the maintenance event is standby. 
After the maintenance is completed, release the 
resources, and wait for the maintenance of other 
fault parts. After the standby maintenance event 
complies with the maintenance conditions, judge 
whether it can be maintained and determine to 
maintain or continue to wait. 


4 AVAILABILITY SIMULATION LOGIC 


4.1 The main simulation logic 


Taking the components of the navigation system as 
the subject of the simulation and mission events as 
the traction to trigger failure events and all events in 
repairing model, interrupt model and support model, 
in the thought of discrete event simulation technol- 
ogy, the main process of the overall availability simu- 
lation for GNSS was established, which is shown in 
Figure 5. With the advancement of simulation events, 
determine whether the failure and maintenance 
events affect the implementation of the mission one 
by one, and carry out the corresponding maintenance 
activities, until the end of the mission, and then per- 
form the availability statistics. 

Based on the main simulation logic, each event 
has a detailed process. Eight events and their simu- 
lation algorithms were developed in this paper. The 
list of simulation events are shown as Table 13. 


4.2 The simulation process of typical events 


In view of the limitation of pages, only several sim- 
ulation processes of typical events are presented in 
this paper. 


4.2.1 Simulation process of mission event 

Mission event is the key for implementation of 
satellite navigation system availability simulations. 
Driven by the mission, the equipment failure and 
maintenance process are simulated, and the impact 
on the mission is determined. In this paper, the sat- 
ellite visibility is used as the evaluation criteria for 
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Figure 5. Main simulation logic. 


Table 13. Simulation events. 


Serial number Simulation events 


Mission event 

Unit fault event 

Unplanned interrupt repair event 

Planned interrupt repair event 

Preventive maintenance event 

Corrective maintenance event 

Support event for satellite-replacing 
in the space 

8 Ground support resource 


DAURAR UN=e 


the availability of the navigation system. The mis- 
sion simulation process is shown in Figure 6. 

First determine the mission type, where the nav- 
igation, positioning and timing missions are sub- 
ject to different success criteria of missions, and 
then determine the mission implementation under 
each simulation step. According to the mission 
requirements of the navigation system, the state of 
each unit in the state model is read to determine 
whether the covered multiplicity complies with the 
coverage of the user in the area, that is, whether the 
coverage of the mission point by the space unit is 


The misson starts 


navigation 
positioning 


timing 


calculate covered 
multiplicity 


select the position of aibasiral 
point of current unit (Xi, Yi) atti 


calculate the numbers of task 
point (P, Q) within (Xi + R, Yi +R) 
expressed as | 


covered multiplicity meet 
the requirements 


does not meet the 
requirements 


f execute mission 


phase mission fail 


no 


_ dies ie. 
miaon end? 
yes 


Parameters statistics 


Figure 6. Simulation process of mission. 


met. For a given mission, the satellite has defined 
the location information of the mission point. So 
read the dynamic position information in the struc- 
tural unit and determine the covered multiplicity. 
Since the minimum number of units required for 
each mission type is constant, the phase mission is 
judged to be successful when the number of units 
that can be covered can reach the minimum value 
specified by the mission. The phase mission fails 
if the number of available systems is not enough. 


4.2.2 Simulation process of unplanned interrupt 
repair event 
Maintenance event is also an important part of the 
availability simulation and is the necessary event in 
the simulation of the real operation of the naviga- 
tion system. the simulation process of unplanned 
interrupt repair event for the space segment is 
shown in Figure 7. 

If the satellite is interrupted, first determine 
whether the satellite can receive the ground sig- 
nal. If it cannot, it is judged as a long-term hard 
fault, belonging to the unplanned long-term inter- 
rupt type, where the satellite needs to be replaced. 
It is necessary to determine whether there are 
backup satellites for the replacement satellites. 
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repair. 


If it does not met, make a request to ground for sat- 
ellite supplement. If the satellite supplement fails, 
it enters the state to wait for satellite supplement 
from ground. If the satellite supplement succeeds, 
the event of satellite supplement complete will be 
sent out after the time T, and the state unit will be 
updated at the same time. If there are backup sat- 
ellites for the replacement satellites, then the satel- 
lites in the orbit will be subject to orbital position 
adjustment, and a request to ground for satellite 
supplement is made, and then it enters the state 
to wait for satellite supplement from ground. If 
it can receive the ground signal, the fault type is 
short-term interruption, belonging to short-term 
unplanned interruption caused by short-term hard 
faults. At this point, the satellite will be subject to 
self-repair, if the repair fails, the satellite will be 
redetected whether it can receive terrestrial signals. 
If the repair is successful, the event of repair com- 
plete will be sent out after the time T, and the state 
unit will be updated at the same time. 


5 CASE ANALYSIS 


Taking GPS as the target, Beijing area was selected 
to perform the single-point positioning mission 
and STK/MATLAB hybrid simulation method 
was used to simulate and analyze the service avail- 
ability of GPS in Beijing. The latitude and longi- 
tude coordinates of Beijing is (40, 116), and the 
ground elevation is 20 m. 

The data of GPS was derived from (U.S. 
Department of Defense. 2008). Failure data and 
maintenance data were gave based on experience. 
It is assumed that one backup satellite in B, D 
and F orbit and two backup satellites in ground 
launch base. The availability simulation model of 


GPS was built according the method mentioned in 
this paper. Setting the frequency of the simulation 
as 10 times, the simulation time as July 01 2007 to 
June 30 2015 and the simulation step as 1 minute, 
STK/MATLAB hybrid simulation was conducted. 
The trajectory of each unit is shown as below: 


Figure 8. 
state. 
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Jul 2007 12:00:00. 000 
Jul 2007 12:01:00. 000 
Jul 2007 12:02:00. 000 
Jul 2007 12:03:00. 000 
Jul 2007 12:04:00. 000 
Jul 2007 12:05:00. 000 
Jul 2007 12:06:00. 000 
Jul 2007 12:07:00. 000 
Jul 2007 12:08:00. 000 
Jul 2007 12:09:00. 000 
2007 12:10:00. 000 
Jul 2007 12:11:00. 000 
Jul 2007 12:12:00. 000 
Jul 2007 12:13:00. 000 
Jul 2007 12:14:00. 000 
Jul 2007 12:15:00. 000 
Jul 2007 12:16:00. 000 
Jul 2007 12:17:00. 000 
Jul 2007 12:18:00. 000 
Jul 2007 12:19:00. 000 
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Figure 9. Quantity of visible GPS satellite in Beijing. 
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During the simulation, the available time sec- 
tion of each satellite to the ground mission area 
can be obtained. The satellite coverage multiplic- 
ity in each time section in Beijing is obtained by 
statistics, shown in Figure 9. Finally, we got the 
coverage multiplicity data list for each event. 
The single point service availability of the statis- 
tical system in each state is obtained by Matlab 
simulation. 

The availability statistics model is as below: 


_}0, F-N<0 
af? F-N > 0 0) 


i 


Yo, At 
= tl 


Ag T 


(2) 
where F is the actual coverage multiplicity, N 
is the minimum coverage multiplicity required for 
the success of the mission, and is the available state 
function of the unit. 
The GPS satellite navigation system availability 
was obtained to be 99.95% by statistics. 


6 CONCLUSION 


Based on the characteristics of satellite naviga- 
tion system and considering the operation, inter- 
ruption, maintenance and support of navigation 
system, the framework of availability simulation 
model of GNSS was established and the detailed 
model including space ground state, interruption, 
maintenance, support model and the relationship 
model were described. The availability simulation 
logic flow based on operational process were also 
developed, which supports the availability simu- 
lation of GNSS. Combined with the simulation 
analysis data, it can provide basis for improvement 
of operational program, and development of sys- 


tem dynamic interrupt plan and in-orbit backup 
scheme, in order to increase the system availability. 
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ABSTRACT: Air transportation is essential for society and it is increasing gradually. Furthermore, new 
technologies that improve the airspace operation in terms of safety are under development, such as the 
Unmanned Aircraft Systems (UAS). In the past few years, there has been a growth in UAS numbers 
into the segregated airspace. However, there are many challenges to be faced in order to integrate these 
autonomous aircraft into the non-segregated airspace. In this context, the relationship between UAS and 
Air Traffic Controller (ATCo) is a important aspect due to the fact that the presence of UAS into the non- 
segregated airspace may represent an increase in ATCo workload and, ultimately, a reduction in safety 
levels. This increase results from the lack of familiarity of the ATCo with the UAS, which leads these pro- 
fessionals to perform additional activities (e.g. cognitive activities) in order to conduct the air traffic safely. 
In fact, the Technology Maturity Level (TML), which is a measurement system that models the level of 
familiarity of ATCo with a particular aircraft, shows the impacts on workload levels of different aircraft. 
The main goal of this research is to propose a flow-management method for balancing the workload of 
sectors in order to reduce the impact on safety levels of the UAS presence in the non-segregated airspace. 
This workload balancing method considers the TML of Manned Aircraft (MA) and UAS, which may 
increase (in case of UAS) or decrease (in case of MA) the ATCo workload. The results show that this 
method can be employed for balancing workload effectively, even considering the UAS presence, and that 
integrating a small number of UAS into non-segregated airspace may present acceptable safety levels from 


the ATCo workload perspective. 


1 INTRODUCTION 


The increasing importance of air transportation 
for society (Marquart, Ponater, Mager, & Sausen 
2003) has been considered an enabler for the devel- 
opment of technologies that improve the airspace 
operation, such as Unmanned Aircraft Systems 
(UAS) and Decision Support Tools (DST) for 
Air Traffic Controllers (ATCos) (e.g. Arrival and 
Departure managers) (Noskievi¢é & Kraus 2017). 
These new technologies are proposed for improv- 
ing airspace operation from many perspectives, 
such as efficiency and capacity. The DSTs, for 
instance, ensure the Air Traffic Controller (ATCo) 
the decisions made are effective, which lead to a 
reduction on his/her workload (i.e. the time spent 
on controlling aircraft) and, ultimately, on airspace 
complexity from ATCo perspective (Majumdar & 
Ochieng 2002). Although these technologies are 
used in many cases, they may bring uncertainties 
due to the fact that the personnel (e.g. ATCos) are 
not familiar to deal with them. 

Furthermore, in the past few years, there has 
been a growth in UAS numbers into segregated 


airspace (Guerin 2015). These aircraft are com- 
posed of subsystems such as Unmanned Aircraft 
Vehicle (UAV), payload, control station and com- 
munication subsystems (Fasano, Accado, Moccia, 
& Moroney 2016) (Austin 2011) and have several 
militaries and civil applications (e.g. firefighting). 
However, the integration of these aircraft into the 
non-segregated airspace may lead to the creation 
of new ways of reaching the wellknown unsafe 
states. For instance, human mistakes in paloting 
can be made by bug in software. As the accept- 
ance of this new technology is increasing due to 
its advantages compared to manned aircraft (e.g. 
efficiency), it stands out as a challenge for ATCos 
due to the lack of familiarity. 

Moreover, the workload of ATCo is a result of 
the interaction of several factors, including the Air 
Traffic Control (ATC) complexity (Mogford, Gutt- 
man, Morrow, & Kopardekar 1995), and challeng- 
ing situations for ATCos are related to a higher 
workload (Meckiff, Chone, & Nicolaon 1998), i.e., 
ATCo workload is related to safety (Neto, Baum, 
Hernandez-Simes, Almeida, Camargo, & Cug- 
nasca 2017) as well as complexity, which is one of 
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the main factors that impact on ATCo workload 
(Majumdar & Ochieng 2002). 

In this sense, the Air Traffic Flow Management 
(ATFM) enables the Air Traffic Management 
(ATM) to be effective in terms of safety, efficiency, 
cost-effectiveness, environmental — sustainabil- 
ity and interoperability of ATM systems (ICAO 
2014). The ATFM is conducted by a ATC cell. For 
instance, in European airspace, ATFM activities 
are carried out by Eurocontrols Network Manager 
Operations (Alam, Chaimatanan, Delahaye, & 
Féron 2017). However, there are many challenges 
in terms of flow management, such as the unpre- 
dictability of weather conditions and unexpected 
delays. Furthermore, the interaction between the 
UAS and the ATCo may present impacts, in terms 
of safety, in the airspace. Considering a high air 
traffic density and that a higher workload level 
leads to a lower airspace capacity (Majumdar & 
Polak 2001), the lack of adaptation of the sectors’ 
capacity due to the presence of the UAS during the 
ATFM process may compromise the safety levels. 

The main goal of this research is to propose a 
flow-management method for balancing the work- 
load of sectors in order to reduce the impact on 
safety levels of the UAS integration in the non-seg- 
regated airspace. This workload balancing method 
considers the level of familiarity of ATCos with 
Manned Aircraft (MA) and UAS, which presents 
a lower (in case of UAS) or higher (in case of MA) 
impact on the ATCo workload. 

This paper is organized as follows: Firstly, Sec- 
tion 2 presents the related works. Secondly, Sec- 
tion 3 presents the aspects of flow management 
into the airspace. Thirdly, the ATCo workload 
considering the relationship between ATCo and 
UAS is presented in Section 5. After that, Sec- 
tions 4, 6 and 7 present, respectively, the aspects of 
UAS, the method and the case studies adopted in 
this research. Then, Section 8 presents a discussion 
on the results achieved in the experiments. Finally, 
Section 9 shows the conclusions of this research. 


2 RELATED WORKS 


In (Alam, Chaimatanan, Delahaye, & Féron 2017), 
the authors present a distributed air traffic flow 
management model for addressing the four-di- 
mensional trajectory planning over the European 
Functional Airspace Blocks (FAB), which is a con- 
cept adopted in European airspace for allowing 
cooperation for improving the air traffic flow in 
an efficient and safe manner. Thus, this distributed 
approach of ATFM enables the information shar- 
ing between airspace blocks in strategic planning 
minimizing interaction between trajectories. Fur- 
thermore, this method is implemented and tested 


with a real air traffic data over the European air- 
space and interaction-free 4D trajectories are pro- 
duced in short computational time (according to 
the time constraints of this operation). The results 
showed that this distributed method is viable and 
the interaction is reduced. In addition to this con- 
tribution, we propose a ATFM method for bal- 
ancing the ATCo workload over the ATC system. 
Furthermore, we consider the presence of UAS 
into the non-segregated airspace. 

In (Ivanov, Netjasov, Jovanovi, Starita, & 
Strauss 2017), the authors propose a two-level 
mixed-integer optimization model for solving the 
en-route demand-capacity imbalance problem in 
order to explore the possibility of controlling the 
ATFM delay distribution. Thus, the minimization 
of the delay propagated to subsequent flights can 
be achieved. Considering that this research comes 
from a practical background and that the model 
proposed is compatible with the models that are 
being used in the real-world, tests are conducted 
in realistic experiments. In order to mitigate the 
effects of delays in flights, aircraft operators usu- 
ally embed a buffer time in their schedules. The 
current practice for allocating ATFM delays does 
not consider, though, if flights are able to absorb 
ATFM delay and still reduce delay propagation 
to subsequent flights, i.e., if the flights have any 
remaining schedule buffer. The results show that 
the proposed method reduces the delay propa- 
gated to subsequent flights and improves airport 
slot adherence. However, although this is an inter- 
esting contribution in terms of ATFM and delay 
reduction, this research does not consider the UAS 
presence. 

The authors in (Neto, Baum, Almeida, Camargo, 
& Cugnasca 2017) present a simulation tool that 
aims evaluate safety (from workload perspective) 
and efficiency in aircraft sequencing in final sec- 
tor considering the UAS presence. In this paper, a 
novel approach for measuring the impacts of the 
integration of UAS into the non-segregated air- 
space is proposed. This approach, called the Tech- 
nology Maturity Level (TML), models the level of 
familiarity of ATCo with these aircraft. The real- 
istic results achieved in the experiments showed 
that, depending on the familiarity of ATCo with 
the UAS, these aircraft may present a consider- 
able impact in the workload levels and, ultimately, 
in the airspace capacity. However, although the 
authors present an approach for integrating the 
UAS into the non-segregated airspace, aspects of 
Air Traffic Flow Management (ATFM) are not 
taken into account. 

The authors in (Clothier, Denney, & Pai 2017) 
propose a manner to create a Risk Informed Safety 
Case (RISC) applied to the context of safety assur- 
ance of UAS operation. This approach aims to 
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facilitate safe and cost-effective operations of 
sUAS by presenting the comprehensive measures 
considered in order to eliminate, reduce, or control 
the safety risk. The RISC proposed is composed 
by barrier models of safety, which support the 
development of safety measures, and structured 
arguments, which provide assurance of safety in 
operations (through, for instance, appropriate evi- 
dence). The authors also propose a model for UAS 
operational risk, which considers, for instance, spe- 
cific hazards (e.g. collision) and operational risks 
(depending on the UAS). Ultimately, this paper 
shows key safety-related assurance concerns to be 
addressed and the development of a layered frame- 
work for reasoning about those concerns, which 
can be useful for regulators and various stakehold- 
ers in justifying confidence in operational safety in 
the context of the absence of the relevant aviation 
regulations for UAS. 


3 AIR TRAFFIC FLOW MANAGEMENT 
(ATFM) 


According to (ICAO 2014), the ATFM is “an ena- 
bler of Air Traffic Management (ATM) efficiency 
and effectiveness. It contributes to the safety, effi- 
ciency, cost-effectiveness, and environmental sus- 
tainability of an ATM system”. ATFM enables the 
airspace to operate smoothly and resiliently, con- 
sidering the possible difficulties that may be faced 
(e.g. bad weather conditions). The planning of the 
traffics movements (e.g. scheduling) helps the ATC 
units to, collaboratively, adapt the airspace for cur- 
rent needs in a global view. 

Among the objectives of the ATFM, we can 
highlight (ICAO 2014): (1) improvement of the 
safety of the ATM system; (2) Optimization of 
flow in all phases of flights; (3) Simplification 
of collaboration between stakeholders (e.g. ATC 
units and airlines); (4) A better understanding of 
the ATM system resource constraints in terms eco- 
nomic and environmental priorities. 

The approach of involving more stakeholders 
in the planning is a strategic manner of predicting 
future problems (e.g. unexpected delays and con- 
flicts), mitigating risks (e.g. by using a delay buffer 
in each flight, regions with bad weather conditions 
can be avoided by the aircraft) and avoiding unsafe 
states. Thus, the traffic planning enables balanced 
definitions on the trade-off between safety and 
efficiency. As efficiency is, naturally, a metric to be 
maximized (due to, for instance, profit increase), 
the flow in planned in this sense. On the other 
hand, the impacts on the safety levels of the air- 
space must be taken into account in the decision- 
making process. Finally, safety represents the most 
important restriction in this context, i.e., the effi- 


ciency is desired to be maximized but the safety 
must remain at an acceptable level. 


4 UNMANNED AIRCRAFT SYSTEM 
(UAS) INTEGRATION INTO THE NON- 
SEGREGATED AIRSPACE 


UAS, in which the interest of engineering commu- 
nity has increased in the past few years (Fasano, 
Accado, Moccia, & Moroney 2016), is an autono- 
mous system composed of subsystems (e.g. com- 
munication system and control station) (Fasano, 
Accado, Moccia, & Moroney 2016) (Austin 2011). 
There are many advantages provided by the UAS 
in large scale (e.g. airspace efficiency improvement) 
and in small-scale (e.g. reduction of risks associated 
with pilots in applications, such as firefighting). 

Nowadays, the UAS that are being built vary 
in terms of size and can be classified into three 
categories in terms of weight (Romero & Gomez 
2017): (1) Small UAS; (2) Medium UAS; (3) Large 
UAS. The class One (small UAS) represents the 
UAS that has many applications in smaller scenar- 
ios, with weight less or equal to 149 kg. Class Two 
(medium UAS) weights up to 600 kg. Finally, large 
UAS weights more than 600 kg. In this research, as 
we consider a futuristic scenario and the impacts 
of UAS are measured from the workload perspec- 
tive instead of performance perspective (e.g. speed 
difference between large aircraft and small UAS), 
large UAS with size and performance similar to the 
commercial aircraft are adopted. 

In order to control the aircraft, UAS and Manned 
Aircraft (MA), throughout the airspace, a set of 
activities must be performed by the ATCo (Neto, 
Baum, Hernandez-Simes, Almeida, Camargo, & 
Cugnasca 2017). In this research, activities are 
proposed by specialists of Safety Analysis Group 
(GAS), of the University of Sao Paulo. These spe- 
cialists are actual Air Traffic Controllers and have 
more than 10 years of hands-on experience. The 
activities performed by the ATCo in each sector 
and their duration (in seconds) are: (1) First Con- 
tact (10s); (2) Instruction (15s); (3) Surveillance 
(10% of the mean time spent by the aircraft within 
the sector); (4) Communication with adjacent ATC 
unit (10s); 

In this context, the Technology Maturity Level 
(TML), which is a measurement system that 
measures the familiarity between the ATCo and 
the aircraft, is employed (Neto, Baum, Almeida, 
Camargo, & Cugnasca 2017). Aircraft with higher 
TMLs are related to operations with lower work- 
load levels, whereas aircraft with lower TMLs are 
related to operations with higher workload levels. 
For instance, nowadays, it is reasonable to con- 
sider that the UAS have a lower TML, whereas 
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the Manned Aircraft have a higher TML. In terms 
of workload, it is reasonable to consider that the 
familiarity of the ATCo with MA is higher than 
the familiarity of the ATCo with UAS. This is due 
to the fact that the ATCos, nowadays, are used to 
deal with MA but are not used to deal with UAS 
(which do not operate in the non-segregated air- 
space) (Złotowski, Yogeeswaran, & Bartneck 
2017). In this context, it is reasonable to consider 
that the UAS has a Technology maturity Level 
(TML) equals to 0 and MA equals to 10. Thus, the 
time spent in the activities performed by the ATCo 
in the MA operation is multiplied by 1, whereas 
the time spent in the activities performed by the 
ATCo in UAS operation is multiplied by 2 (Neto, 
Baum, Almeida, Camargo, & Cugnasca 2017). 


5 WORKLOAD BALANCING 
MODEL (WBM) 


This section presents the main contribution of this 
research, the ATCo Workload Balancing Model 
(AWBM) for safe Air Traffic Flow Management 
(ATFM), from the workload perspective, consid- 
ering the UAS presence. Firstly, the flow network 
employed in the AWBM is presented. Then, aspects 
of the allocation process are highlighted. Finally, 
the final considerations are shown. 


5.1 Flow network 


The airspace is composed by (but not limited to) sec- 
tors and tracks. The aircraft that operate into the 
non-segregated airspace use these track in order to 
fly to different regions. Naturally, these flights lead 
the aircraft to cross a set of sectors, which are con- 
trolled by ATC units. Each ATC unit is responsible 
for one sector, i.e., although a collaborative work is 
conducted, metrics of each ATC unit are independ- 
ent (eg. ATCo workload) and, in order to maintain 
the safety levels, the ATCo workload of a given sector 
must not exceed the maximum acceptable workload 
established. Finally, a reasonable maximum accept- 
able workload, i.e., a reasonable workload threshold, 
represents 80% of an hour (Majumdar & Polak 2001) 
(Majumdar, Ochieng, Bentham, & Richards 2005). 
The safety levels may be, then, compromised if the 
the current workload exceeds the workload threshold. 

Figure | illustrates a region of the airspace and 
its respective set of sectors. Each sector has its own 
ATC unit and, consequently, its own workload 
threshold. Note that there are points (b,, b», b,, ..., 
b,) that connect the sectors according to the tracks 
that represent the sectors boundaries, which are 
logical elements that identify the region in which 
the aircraft changes from one sector to another. In 
order to abstract this scenario, a flow network is 
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Figure 2. Flow network used for representing the sce- 
nario presented in Figure 1. 


appropriate due to its suitability for dealing with 
problems in which goods (in this case, aircraft) 
must be sent from one point to another, consider- 
ing specific constraints (in this case, workload and 
safety). In order to model the problem considered 
in this research, the flow network can be faced as a 
graph that considers a flow, a capacity, source and 
sink nodes, goods (i.e. aircraft) to be delivered and 
specific constraints (Ford & Fulkerson 1962). 

Suppose a undirected graph G with nodes b, 
b,, ..., b, and edges e,, e» ..., ,,. The nodes repre- 
sent the sector boundaries, i.e., each node is a logi- 
cal element that illustrates the point in which the 
aircraft changes from one sector to another. One 
should note that this point refers to the change of 
ATC unit that controls the aircraft due to the fact 
that each ATC unit is responsible for one sector. 
Furthermore, the edges represent the direct trajec- 
tory from one sector to another. Each trajectory 
has a particular length. Finally, a fleet composed 
by a set of aircraft (MA and UAS), a,, a,, ..., A, 
must be allocated to paths that connect the source 
and the target of the network, i.e., origin and des- 
tination of each aircraft. 

Figure 2 illustrates the flow network used by the 
AWBM for representing the scenario presented in 
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Figure 1. In this scenario, each trajectory has a spe- 
cific length. One should note that as a trajectory is 
within a sector and a sector has a limited workload 
threshold', there is a limited number of aircraft 
that can operate simultaneously in each trajectory. 


5.2 Allocation process 


The scheduling problem faced in this research is 
constituted by allocating the set of aircraft to 
selected paths. This allocation process is aimed to 
balance the ATCo workload levels in the airspace 
in order to maintain a proper safety level in all sec- 
tors. For instance, if all aircraft are delivered to a 
single path, that path is expected to have a high 
ATCo workload level whereas other possible paths 
have a low ATCo workload level, i.e., the workload 
is unbalanced and some sectors may present criti- 
cal scenarios in terms of safety. Firstly, in order to 
select a path for a given aircraft going from node b, 
to the node b, all possible paths are found in the 
graph. Secondly, an evaluation of the ATCo work- 
load of the sectors that composed each path is 
conducted in order to identify which path presents 
the lowest workload level. Finally, the path with the 
lowest sum of ATCo workload is selected and the 
aircraft is allocated to that path. 

Once the aircraft is allocated, the attributes of the 
path are updated. For each sector, the ATCo work- 
load is updated with the ATCo workload related 
to controlling that aircraft, which depends on its 
Technology Maturity Level (TML). This research 
is focused on the planning of flights, i.e., the flights 
assigned to paths are expected to generate a certain 
level of ATCo workload and, thus, controlling the 
amount of traffic that fly in each path is an effec- 
tive manner of balancing the workload. 

Thus, this challenge can be faced as a optimiza- 
tion problem: Equation 1 shows the function that 
must be minimized in order to balance the work- 
load in the airspace. In fact, this equation measures 
how different the workload of the different sectors 
are from each other, i.e., scenarios with higher vari- 
ations of workload present a higher value in this 
equation. Note that the minimum value achieved is 
0, which is achieved in the case in which the ATCo 
workloads of all sectors are the same. The W func- 
tion measures the workload for a given sector, which 
is represented by an edge e,,. Finally, this equation 
sums the absolute value of the subtraction of the 
workload of all sectors. However, the restrictions of 
this problem must be respected. Equation 2 illus- 
trates a restriction that highlights that the actual 
workload of a given sector, represented by the edge 


1. The workload threshold is the maximum workload 
supported by a given sector in order to maintain the 
safety levels. 


e,, must be less or equals to the workload limit of 
its workload threshold (Wt). This restriction aims 
to ensure that the safety levels, from the workload 
perspective, are maintained at an acceptable level. 
Furthermore, the Equation 3 shows that each air- 
craft must be assigned to one path as a restriction. 
Solutions of this optimization problem, consider- 
ing specific scenarios, represent solutions of the 
ATFM considering the UAS presence. 


m-l m 


min f=), > W (e)-W(e,) (1) 
i=l j=i+l 

W (e,)<Wt(e,) PSL 2.350225 (2) 

Paths(a,)=1 KH. 2.3503 (3) 


5.3. Final considerations 


Finally the ATCo Workload Balancing Model 
(AWBM) has some considerations on the sce- 
narios modeled. Firstly, the distance between the 
paths is expected to be similar. Secondly, the per- 
formance of the aircraft is expected to be similar 
and bad weather conditions are not present in the 
scenarios. Finally, unsafe states are not expected, 
i.e., the aircraft minimum separation is respected 
during all flights. 


6 METHOD 


This section presents the steps that guide the appli- 
cation of our proposal in different problems. Note 
that this approach can be applied in scenarios of 
different sizes (e.g. the number of sectors) and type 
of aircraft (e.g. MA and UAS). This process is 
illustrated in Figure 3. 

Firstly, the input data is provided. This input 
data is composed of the list of aircraft with the 
same source and target nodes. Furthermore, the 
type of each aircraft must be provided, i.e., each 
aircraft must be identified as Manned Aircraft 
(MA) or Unmanned Aircraft System (UAS). 
This information is used in the workload evalu- 
ation. Secondly, the scenario is described. This 
description is based on the transformation of a 
scenario into a flow network, which contains sec- 
tors (represented by edges) and sectors boundaries 


apt Data fesseaee > ae <> Adjustment > aaria 
5 4 
No 1 
Ww Evahation ja» Experiments la- Yes Ams it Validation 
Figure 3. Method adopted in this research. 
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(represented by nodes). In fact, this phase identi- 
fies the characteristics of the situation that is used 
in the model. 

Thirdly, The adjustment process is conducted, 
which is conducted in order to make sure that the 
description represents a viable scenario, i.e., if the 
flow network considered connects the source to 
the sink. After that, the scenario building is con- 
ducted. This process aims to build a flow network 
pragmatically based on the description provided. 
This scenario is, then, validated by checking if 
all aspects of the model are properly defined. If 
the validate is conducted successfully, the experi- 
ments can be conducted, otherwise, the adjust is 
conducted again. The experiments constitute the 
process of allocating the aircraft to the available 
tracks in a suitable manner. In this phase, the data 
the workload levels of each sector are estimated. 
Finally, the evaluation aims to identify the insights 
on the results in order to show the impact of our 
proposal in the workload balancing. 


7 CASE STUDIES 


This section presents the case studies considered 
in this research. Firstly, a simpler and abstract 
scenario is presented. After that, a more complex 
scenario is shown. Finally, a scenario based on a 
realistic scenario is presented. 


7.1 Case study I 


The main goal of this case study is to show the 
applicability of our proposal in a simpler scenario, 
in which the fleet is reduced. We consider a fleet of 
5 aircraft, in which 2 are UAS (UAS, and UAS;) 
and 3 are MA (MA,, MA, and MA,). In this sce- 
nario, illustrated in Figure 4, the aircraft must be 
sent from the node b, to the node b,. 

The characteristics of the tracks of this sce- 
nario are illustrated in Table 1. In this Table, the 
distances, in Nautical Miles, the mean time of 
the aircraft in that track (or sector) T and the 
ATCo workload threshold, both in seconds. The 


oo atscreceicatoetnazcesd b5 
Whi siscásissacsisisag My- iicisiacissisisciciasd BG) iciscicissitsaonn b8 
BA onvnveavevecycevereent b7) 


Figure 4. Scenario adopted in case study I. 


Table 1. Characteristics of tracks in case study I. 
Track Distance (nm) T (s) Threshold (s) 
b-b, 52 624 681,8 
b-b, 45 540 623 
b-b, 51 612 673.4 
b, - b; 60 720 749 

b, - bs 58 696 732.2 
b,- b; 57 684 723.8 
b; - b; 48 576 648.2 
b,- bg 43 516 606.2 
b,-b 51 612 673.4 


a 
o 


Table 2. Solution proposed in case study I. 
Aircraft Path 

MA, b, - b, - b; - by 
MA, b, - b, - b; - by 
MA, b, - b, - b,- by 
UAS, b, - b, - b, - bg 
UAS, b, - b, - b - bg 


workload threshold represents the maximum 
ATCo workload supported by each sector, are pre- 
sented for each track. 

During the allocation process, the solution 
achieved is illustrated in Table 2. In this solu- 
tion, the aircraft are allocated in a balanced man- 
ner throughout the network in order to share the 
ATCo workload. 


7.2 Case study II 


The main goal of this case study is to show the 
applicability of our proposal in a more complex 
scenario, in which the fleet is larger than the fleet 
considered in the case study I. We consider a fleet 
of 10 aircraft, of which 5 are UAS (UAS,, UAS,, 
..., UAS.) and 5 are MA (MA,, MA,, ..., MA;). 
In this scenario, illustrated in Figure 5, the aircraft 
must be sent from the node b, to the node b,,. 

Furthermore, the characteristics of the tracks of 
this scenario are illustrated in Table 3. Similarly as 
presented in the case study I, the distances, in Nau- 
tical Miles, the mean time of the aircraft in that 
track (or sector) T and the ATCo workload thresh- 
old (s), both in seconds. The workload threshold 
represents the maximum ATCo workload sup- 
ported by each sector, are presented for each track. 

The results achieved in the allocation process, 
illustrated in Table 4, showed that the aircraft are 
sent from the node b, to the node b, in a balanced 
manner, i.e., many paths composed by different 
tracks are explored. 
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Figure 6. 
from (DECEA 2017). 


Scenario adopted in case study III adapted 


re b6 b9 
Figure 5. Scenario adopted in case study II. 
Table 3. Characteristics of tracks in case study II. 
Track Distance (nm) T (s) Threshold (s) 
b-b, 52 624 1353,8 
b-b, 45 540 623 
b, - b, 57 684 723.8 
b, - b; 55 660 597.8 
b, - b; 42 504 707 
b, - bs 49 588 656.6 
b,-b, 49 588 656.6 
b,- by 80 960 917 
b,- b, 57 684 723.8 
b,- b, 57 684 723.8 
b,- b; 57 684 723.8 
b, - bio 65 780 791 
b; - bio 62 744 765.8 
b,-b 58 696 732.2 


c 


Table 4. Solution proposed in case study II. 


Aircraft Path 

MA, b, - b, - b, - b, - by - bio 
MA, b, - b, - b; - b,- bw 
MA, b, - b, - b4 - b, - b; - bio 
MA, b, - b, - b, - by - bio 
MA, b, - by ~ by ~ by - bio 
UAS, b, ~ by ~ by ~ by - Dyy 
UAS, b, - b, - bs- b; - bio 
UAS; b, - by ~ by- bh= bio 
UAS, b, - b, - b; - b,- bg - bw 
UAS; b, - b,~b,-b,- Dy 


7.3 Case study IIT 


The third case study presents a realistic scenario 
present in the Brazilian airspace. We consider a 
fleet of 20 aircraft, in which 10 are UAS (UAS, 
UAS,, ..., UAS,,) and 10 are MA (MA,, MA, ..., 
MA,,). In this scenario, illustrated in Figure 6, the 
aircraft must be sent from the node belém (bÌ) to 
the node santarém (sr). In this realistic scenario, 
there are 11 tracks and their characteristics are 


Table 5. Characteristics of tracks in case study III. 
Tracks Distance (s) T (s) Threshold (s) 
bl -mp 178 2136 1740.2 
mp - dr 108 1296 1152.2 
dr - sr 162 1944 1605.8 
dr - tv 60 720 749 
tv - sr 124 1488 1286.6 
bl - ap 150 1800 1505 
ap - dr 105 1260 1127 
bl- tv 223 2676 2118.2 
bl- pe 126 1512 1303.4 
pc-at 106 1272 1135.4 
at - sr 162 1944 1605.8 
Table 6. Solution proposed in case study II. 
Aircraft Path 
MA, bl - mp - dr - sr 
MA, bl - pe - at - sr 
MA, bl - pe - at - sr 
MA, bl - pe - at - sr 
MA, bl - ap - dr - sr 
MA, bl - ap - dr - sr 
MA, bl - mp - dr - sr 
MA, bl - pe - at - sr 
MA, bl - pe - at - sr 
MA,, bl - pe - at - sr 
UAS, bl -ap - dr - tv - sr 
UAS, bl- tv - sr 
UAS, bl -mp - dr - sr 
UAS, bl- tv - sr 
UAS; bl - pe - at - sr 
UAS; bl- tv - sr 
UAS, bl - ap - dr - sr 
UAS; bl - tv - sr 
UAS, bl - mp - dr - sr 
UAS io bl - tv - sr 
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illustrated in Table 5 (the distance and the mean 
time of the aircraft in the track). 

The results of the allocation, presented in 
Table 6, shows that all available tracks are used, 
i.e., the ATCo workload is shared among the 11 
sectors. 


8 DISCUSSION 


In the first case study, the ATCo workload level 
in each track is illustrated in Figure 7. Note that 
the workload levels of all tracks are much lower 
than their respective workload threshold, i.e., they 
support more traffics (Figure 7 and Table 1). How- 
ever, this balanced manner of aircraft distribution 
maintains the different sectors with similar work- 
load levels. Finally, the result of the unbalancing 
measure (Equation 1) is 3094.8s. Note that, for 
instance, if the only the path use is b, — b, — b, — 
b, and the ATCo workload threshold is respected, 
the unbalancing measure is 21390.6s, which is con- 
siderably higher than the value achieved by the 
proposed solution and, consequently, indicates a 
possibility of safety issues in those sectors. 

The second case study presents a higher number 
of tracks and the workload of each track is illus- 
trated in Figure 8. Although the path b, — bio pre- 
sented a workload (756s) similar to its workload 
threshold (791s), the method proposed in this 
research ensures that the threshold is respected. 
Finally, the result of the unbalancing measure 
(Equation 1) is 30614.1s. One should note that this 
value may be considerably higher depending on 
the paths the aircraft are allocated to, i.e., if the 
workload levels of the tracks are very different, 
this measure increases. 

Finally, the third case study also presented bal- 
anced results, which are illustrated in Figure 9. 


ATCo Workload (s) for each Sector 


ATCO Workload (5) 
e 
a 
t] 


bi-b2 blds bls b2-b5S b3bó bbb? b5SÞD8S böbE bba 
Route/Sector 


Figure 7. ATCo workload level of each track (case 
study I). 
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Figure 8. ATCo workload level of each track (case 
study II). 
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Figure 9. ATCo workload level of each track (case 
study III). 


Note that the values vary considerably. However, 
this solution presents a balanced solution com- 
pared to solutions that use few of these paths. 
Furthermore, all sectors present a workload level 
below its respective workload threshold (presented 
in Table 5). Finally, the unbalancing measure of 
this case study achieved 51096s. Note that other 
solutions may present a considerably higher unbal- 
ancing measure, for instance, if all aircraft are sent 
through the path bl — tv — sr respecting the ATCo 
workload threshold, which represents 72018.8s of 
unbalancing measure and may indicate problems, 
from the safety perspective, in those sectors. 


9 CONCLUSION 


In this paper, a safe flow-management method 
for integrating the UAS safely into the non-seg- 
regated airspace was presented. The method aims 
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to distributes the workload among the sectors in 
order to maintain the safety levels. The results 
achieved in the proposed experiments showed that 
our method distributes the workload among the 
sectors effectively, even considering the UAS pres- 
ence. As future intentions, the authors aim to con- 
sider aspects such as delays buffer and resilience in 
case of failures in the UAS operation (e.g. failures 
in the command and control link). 
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Active power dispatch strategy of wind farms under generator faults 
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ABSTRACT: Generator is a critical component of the wind turbine. It transforms the mechanical energy 
to the electric energy and transmits electricity to the grid. Based on the previous experience, the fault rate 
of the generator is fairly high which increases the downtime of the turbine. It has a significant effect on the 
performance and the economic benefit of the wind farm. The traditional wind farm controller adopts the 
proportional dispatch strategy and does not consider the fault severity. In this paper, a power dispatch strat- 
egy of wind farm focusing on the fault condition of the generator is proposed. The main faults are divided 
into three levels according to their severities. The wind farm controller dispatches power references to wind 
turbines onthe basis of fault level, wind condition and power demand from the grid. When the fault level is 
high, the strategy gives priority to protect the generator. If the fault level is not high, the wind farm should 
follow the power demand. In addition, the wake effect is also an important factor that must be taken into 
account in the wind farm control. Because it can affect the power production directly. To verify the advantage 
of this strategy, it is compared with the proportional dispatch strategy in the simulation. The result shows 
that the proposed strategy can make a good trade-off between component protection and power production. 


1 INTRODUCTION 


Wind energy is one of the most widely used renew- 
able energies. With the rapid development of 
technology and industry application, it plays an 
important role in the energy market of the world. 
By 2016, wind energy overtook coal and became 
the second largest form of power generation capac- 
ity in Europe (Europe 2016). Wind Farm (WF) is 
seen as a mature and effective form to utilize wind 
energy for commercial operation. The large-scale 
offshore WF is the development trend of wind 
energy. This also implies that the wind energy pen- 
etration in the grid is larger and larger. However, 
wind energy has its congenital defects. One of 
them is the high failure rate of the Wind Turbine 
(WT) due to the harsh operating environment and 
the highly turbulent wind speed. Therefore, the 
stability and reliability of WF are more and more 
important to the grid. The fault of WT should not 
be neglect in the operation of WF as well. 
Inrecent years, wind farm control draws more and 
more attention of the researchers. Previous Wind 
Turbine Control (WTC) can no longer guarantee 
the good performance of WF during operation, 
no matter from the aspects of power production, 
fatigue loads, etc. For the coordination, control 
and management of all WTs in the entire WF, it is 
apparent that Wind Farm Control (WFC) is more 
direct and effective. The research on WFC can be 


divided into two categories depending on whether 
participating in the individual turbine control. One 
is to achieve the control objectives of WF through 
adjusting WTC from the WF level (Ebrahimi et al. 
2016, Tian et al. 2017, Gonzalez et al. 2013). While 
the other is opposite, it only coordinates and dis- 
tributes the variables that are not involved in WTC 
such as start/stop and power reference (Soleiman- 
zadeh and Wisniewski 2011, Soleimanzadeh et al. 
2012, Zhang et al. 2015). The former method seems 
to be able to get better results due to more control- 
lable variables. However, the latter method does 
not need to change the original controller from the 
manufacturer, it is easier to apply in the practice. 
Normally, the objectives of WFC includes power 
maximization, fatigue and wake reduction, fre- 
quency response, voltage control, following some 
specific requirements from TSO, etc. 

The generator faults will influence the power 
output of WT directly and even result in the emer- 
gency stop. Although the failure rate of the gen- 
erator is not the highest in all components, the 
average fault-removal time is quite long and the 
maintenance cost is high as well. Most research on 
generator faults focuses on Fault Detection and 
Diagnosis (FDD). The common methods need to 
detect some physical quantities such as voltage, cur- 
rent, power and then analyze the signals based on 
the model or signal-processing (Balasubramanian 
and Muthu 2017, Qiao and Lu 2015). However, 
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there are few articles considered generator faults in 
WFC. Some similar research on the fault of blade 
and drive-train can be found in (Badithi et al. 2015, 
Odgaard et al. 2009). 

The analysis of generator faults shows that some 
faults can be mitigated by reducing load, such as 
turn-to-turn short circuit (Lešić et al. 2013). The 
down-regulation of WT can not only prevent the 
faulty generator from further damage but also 
reduce the downtime. 

The contribution of this paper is to distribute 
power demand to the individual WTs reasonably 
when the generator fault occurs. According to the 
severity and mechanism, we divided the faults of 
Doubly-Fed Induction Generator (DFIG) into 
three levels. The proposed power dispatch strategy 
will take different distribution type depending on 
the fault level and try to fulfil the power demand 
under the premise of ensuring WT’s safety, assum- 
ing that all faults can be detected. 

This paper is organized as follows. Section 2 
describes DFIG faults briefly and the fault clas- 
sification. The WF model with wake effect is pre- 
sented in section 3. Section 4 gives the active power 
dispatch strategy of WF. Section 5 provides the 
simulation results of different cases. 


2 FAULT CLASSIFICATION 


DFIG is a widely used generator in WT because of 
its low capital cost and good energy yield (Hansen 
and Hansen 2007). The components with high 
fault rate include slip ring/brush, bearing, cooling 
system, winding insulation, encoder, etc (Shipurkar 
2015). Although DFIG has many different types of 
fault modes, the only control variable in this paper 
is the power reference which can adjust the elec- 
trical load of WT. So we focus more on the fault 
modes that can be mitigated by reducing the load. 
The characteristic of the fault according to Fault 
Tree Analysis (FTA) and Failure Modes Effect 
Analysis (FMEA) is used for reference. In addition, 
the safety of WT is always the most important. 
Fault severity is another factor need to be consid- 
ered. The fault classification is as follows: 

Fault level 1 (FL1): this fault level includes the 
faults that are not serious or cannot be mitigated 
by down-regulation. For example, some fault of 
the redundant sensor, minor rotor misalignment 
and bearing vibration. 

Fault level 2 (FL2): this fault level includes the 
early inter-turn short-circuit faults of the stator 
and rotor, as well as the initial fault of the cool- 
ing system. The characteristic of this level is that 
the overheating caused by fault can have a more 
serious effect on the generator. However, down- 
regulation operation can reduce generator heating 
by decreasing the current. For protecting genera- 


tor, there is a limit value of power Pup This value 
depends on the specific fault mode and means that 
the load of WT should not exceed it. The specific 
value of Pimp can be estimated according to the 
fault mechanism and lifetime estimation. 

Fault level 3 (FL3): this level has the highest 
severity, such as the phase to phase short-circuit of 
stator and rotor. It will result in the generator failure 
directly. Therefore, WT must be shut down when the 
fault of this level is detected for protecting WT. 


3 WIND FARM MODEL 


WF consists of several WTs. However, simply tak- 
ing the ambient wind speed as the wind speed of 
all the WTs, then adding all power output of WTs 
as the power output of WF is not accurate due to 
the aerodynamic interaction. To show the results of 
power dispatch strategy, a WF model that taking 
wake effect into account is used. WF model includes 
three parts: WT model, wake model and WF layout. 


3.1 Wind turbine model 


A 5MW DFIG WT model is adopted here. 
According to the Betz theory, the mechanical 
power extracted by the turbine from wind energy 
can be calculated by the following equations: 


1 


P= 727R YC, (4.2) (1) 

aa 2e 2) 
v 

E ; paRv°C (A,B) (3) 

where, P,, is the mechanical power extracted by 


turbine; p is the air density; R is the radius of 
rotor; v is the wind speed; C, is the power coeffi- 
cient, which depends on A and £; / is the tip speed 
radio; 2 is the pitch angle; œ, is the rotor speed; 
F is the thrust force; C, is the thrust coefficient, 
which depends on A and fas well. The surfaces of 
C, and C, are shown in in Figure 1. 

Pitch system adjusts the aerodynamic charac- 
teristic of blades by controlling the pitch angle. It 
can change the efficiency of energy capture and 
offer aerodynamic brake. Gearbox supports the 
necessary speed conversion for DFIG. Generation 
system includes a DFIG and a partial scale power 
electronic converter. 

These subsystems are the major parts of DFIG 
WT and will be simplified as an inertial system. 

The control strategies of normal operation are 
Maximum Power Point Tracking (MPPT) and 
Constant Power in low and high wind speed region 
respectively. The down-regulation strategy should 
be emphasized here because it can affect the wake. 
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Figure 1. Surfaces of C, and C,. 


The strategy adopted here is Max @, which is the 
most widely used in the industry (Mirzaei et al. 2014). 


3.2 Wake model 


Wake effect can decrease the power production of 
WF and increase the fatigue load of downwind WT. 
Therefore, it should be considered in WFC. There 
are several research focus on how to describe wake 
effect accurately (Göçmen et al. 2016). Among 
engineering applications, Jensen wake model is 
widely used because its practicality and simplic- 
ity. In this paper, we adopt Jensen wake model for 
the single wake and quadratic summation method 
for the multiple wakes (Katic, Højstrup, & Jensen 
1986). The velocity deficit can be expressed as: 


2 vec) 


u (l-2a@X/ DY 


(4) 


where, v is the downwind wind speed at position 
X; u is the ambient free wind speed; X is the dis- 
tance between upwind and downwind WT; D is the 
diameter of rotor; @ is decay constant, choosing 
0.05 for offshore WF. 

To calculate the velocity deficit of the j” WT 
in multiple wakes, the following equation can be 
used. 


v; 2 E -Viss 
(-2)-ža à (5) 


where, v, is the wind speed of the j” WT; n is the 
number of wakes that the j” WT is in; v, is the wind 
speed of the j” WT under the influence of the wake 
of the i” WT. 


3.3. Wind farm layout 


In order to reflect the wake effect to WF and to 
simplify the calculation, five WTs are arranged ina 
straight line. The selected wind direction makes the 
downwind WTs are in the full wake of the upwind 
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Figure 2. Wind farm layout. 


WTs. This layout allows the study of the worst- 
casewake effect. The distance between two WTs is 
6.5 times the diameter of the rotor. The layout is 
shown in Figure 2. 


4 ACTIVE POWER DISPATCH STRATEGY 
WITH FAULT ACCOMMODATION 


The traditional active power dispatch strategy is 
proportional dispatch strategy (Grunnet, Soltani, 
Knudsen, Kragelund, & Bak 2010). Power demand 
is distributed to WTs proportionally to the available 
power of each WT by the WF controller as follows: 


1 2 
P= 5p PERVC ax (6) 
P = Pa P (7) 
i E P, dem 


where, P,; is the available power of the i” WT; 
Cymax 18 the maximum coefficient of power of WT; 
P,, is the power reference of the i” WT; P,,,, is the 
power demand of WF from TSO. 

The traditional proportional strategy neither 
considers the impact of the wake effect on power 
production, nor does it consider the fault of WT. 
The proposed strategy is also based on the propor- 
tional dispatch strategy. But it takes generator fault 
classification and wake effect into consideration, 
and tries to follow the power demand ensuring 
WT'’s safety as the precondition. 

The whole strategy according to the fault level is 
also divided into three parts: 


1. Strategy for FL1: The fault in FL1 is not severe 
and cannot be remitted by down-regulating 
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Fault detection 


Fault classification 


Figure 3. 


WT. Therefore, the active power dispatch will 
maintain the original proportional strategy for 
both healthy WT and faulty WT. 

2. Strategy for FL2: This strategy is the focus of 
this paper as it relates to the trade-offs between 
fault protection and following power demand. 
The general idea includes three steps. 


Step 1. If the sum of the power of healthy WTs 
can follow the Power demand enough, it is allowed 
to shut down the faulty turbine. Then the power 
demand will be distributed as follows: 


Pag (8) 


Poni = P dem* 
> ahi 


where, P,,, is the power reference of the 7” faulty 
WT; P.,,, is the power reference of the i” healthy 
WT; P, ,,;18 the available power of the i” healthy WT. 

Step 2. If the sum of the power of healthy WTs 
cannot follow the Power demand. The minimum 
power reference of the faulty WT that can fulfil 
the power demand should be found by the exhaus- 
tive method. There are two possible reasons why the 
power demand can still be followed by reducing the 
power of a particular WT. One is that the available 
power of WF is higher than power demand. The 
other reason is that, due to the wake effect, the down- 
regulation of upwind WT will increase the wind 
speed of the downwind WTs, and the power loss 
because of down-regulation will be compensated by 
the downwind WTs. The wind speeds of WTs are 
highly coupled due to wake effect and are related to 
the power references of all the WTs. Therefore, the 
exhaustive method is the simplest and quickest way. 


Flowchart of the proposed active power dispatch strategy. 


Step 3. If P, imn 18 higher than Pnp the power 


reference of faulty WT will be set as Pmp So the 
power output of WF must be less than the power 
demand. This power deviation is ineluctable for 
protecting WT. The power distribution will be as 
follows: 


Bpi 


Pory 
a,h,i 
Lie = Piem 
P 
D a,h,i 


If P, jimin 18 smaller than P,,,,,,, the power refer- 
ence of faulty WT will be set as P, simin: In this way, 
the strategy can follow power demand and reduce 
the load of faulty WT as much as possible. The 


power distribution will be as follows: 
P P 


refi *r,f,i,min? 


Popi 
a,h,i 
Eni = Piem 
P 
> a,h,i 


3. Strategy for FL3: The generator fault with high 
severity is a serious threat to WT’s safety. There- 
fore, the WT with FL3 must be shut down as 
soon as possible. The power distribution is the 
same with equation 8. The whole strategy is 
shown in the flowchart in Figure 3. 


=P 


limit, f ? 


(9) 


= P imig ). 


(10) 


~ V 


5 SIMULATION 


To verify the advantage of the proposed active 
power dispatch strategy, the WF model mentioned 
in Section 3 is used. The parameters of a SMW 
DFIG WT are shown in Table 1. Figure 2 shows the 
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distribution of wind speed under wake effect in this 
WF. The ambient wind speed of WT1 is 10 m/s. All 
power references are set to SMW as MPPT strategy. 
The distribution shows that the wind speed of the 
downwind WT drops due to the wake effect. The 
following simulation will also use this WF model. 

The strategies for FL1 and FL3 will not be sim- 
ulated here. Because these two parts can be under- 
stood easily and there is no difference with the 
existing protection measure. To illustrate the fault 
protection, the phase current of faulty WT, Iys is 
used in the comparison. Because all of these faults 
has the relationship with the current. With respect 
to the simulation of strategy for FL2, Pris set to 
2 MW. Three cases will be studied here to show the 
simulation results in different situations. 

In Case 1, wind speed is 12 m/s, and power 
demand is 1SMW. An FL2 fault occurs on WT3. 
The sum of power of healthy WTs can follow the 
Power demand enough, so the faulty WT can be 
shut down. It will not affect the power demand 
tracking. The power output of each WT is shown 
in Fig. 4. The red and blue bars represent the 
power output in Traditional Proportional Strategy 
(TPS) and Proposed Proportional Strategy (PPS) 
respectively. The power references, Power output 
of WF and the phase current of faulty WT are 


Table 1. Parameters of a SMW wind turbine. 
Parameter Value 
Rated Power 5 MW 
Rotor Diameter 126m 


Min. and Max. Rotor Speed 
Cut-in, Rated, Cut-out Wind 


6.9 rpm, 12.1 rpm 
3 m/s, 11.4 m/s, 


Speed 25 m/s 
Gearbox Ratio 97:1 
Synchronous Frequency 50 Hz 
Electrical Generator Efficiency 94.4% 


Number of Pole-pairs 3 


Distribution of wind speed under wake effect 
ir 


Wind speed (m/s) 
| 


1 2 4 


WT number 


Figure 4. Distribution of wind speed in WF. 


shown in Table 2. The result shows that the faulty 
WT is shut down, but WF can still produce enough 
power to follow the power demand. The faulty WT 
is also protected from further deterioration. 

In Case 2, wind speed is 12 m/s, and power 
demand is 17.5MW. An FL2 fault occurs on 
WTS. The healthy WTs cannot supply enough 
power. Through the exhaustive method, the value 
Of P, fimin 18 2.28MW in this case. It is higher than 
Pima According to the proposed strategy, although 
WF power cannot follow the power demand, in the 
consideration of protecting generator, the faulty 
WT has to be limited to 2MW. The result is shown 
in Table 3. The phase current of faulty WT has 
been decreased from 2.70kA to 1.71kA. 

In Case 3, wind speed is 10 m/s, and power 
demand is 1OMW. An FL2 fault occurs on WT3. 
The healthy WTs cannot fulfill the power demand 
either. The value of P, ;,,,,;,, in this case is 0.73MW 
and much lower than P,,,,,; Therefore, the faulty 
WT can be down-regulated to 0.73MW. Meanwhile, 
the power demand can also be followed. The result 


Table 2. Comparison of two strategies in Case 1. 

TPS PPS 
P. (MW) 3.19 4.01 
P, (MW) 3.16 3.71 
P,,(MW) 2.93 0 
P.,(MW) 2.87 3.97 
P,s (MW) 2.85 3.31 
P yr(MW) 15.06 15.06 
Iya (kA) 2.50 0 
Table 3. Comparison of two strategies in Case 2. 

TPS PPS 

P, (MW) 4.08 3.49 
P, (MW) 3:15 1.81 
P..,(MW) 3.31 0.73 
P.,(MW) 3.20 2.26 
P,;(MW) 3.16 1.71 
P,(MW) 17.57 10.00 
Iya (kA) 2.70 1.71 
Table 4. Comparison of two strategies in Case 3. 

TPS PPS 
P. (MW) 2.77 3.49 
P, (MW) 1.94 1.81 
P., (MW) 1.82 0.73 
P., (MW) 1.77 2.26 
P,s (MW) 1.75 174 
P yr (MW) 10.04 10.00 
Iys (KA) 1.55 0.62 
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Simulation result of Case 1 
WTS is shut down in PPS 
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Figure 5. 


can be found in Table 4. The phase current of faulty 
WT can be decreased from 1.55kA to 0.62kA. 

Figure 5 shows the power output of WTs in 
these three cases. From the simulation results, it 
can be seen that the proposed active power dis- 
patch strategy can effectively ensure the safety of 
WT in the event of a generator fault, and follow 
the power demand as much as possible. 


6 CONCLUSIONS 


In this paper, an active power dispatch strategy of 
WF under generator fault is proposed. It gives the 
idea of implementing power distribution based on 
fault level. With the comparison of traditional pro- 
portional strategy, the simulation results show that 
the proposed strategy can make a good balance 
between generator protection and power produc- 
tion. It not only reduce the downtime caused by 
non-severe fault, but also follow the power demand 
as much as possible under the premise of ensuring 
WT’s safety. The general idea can also be extended 
to some other faults in WT. 

However, the shortcoming is lack of detailed 
research on the specific fault mode and the reli- 
ability evaluation. The fault model, lifetime estima- 
tion of generator and reliability model should be 
studied to combine the failure, condition, power 
production and reliability together in the further 
research. 
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ABSTRACT: The large-scale use of distributed electricity generation is possibly associated with increas- 
ing grid congestion; hence, Distributed Generation (DG) curtailment is envisioned to solve those conges- 
tions in affected power systems. By resorting to curtailment as part of an Active Network Management 
(ANM) scheme under real-time supervision/control of the DG units, more distributed generation can be 
accommodated in a distribution grid, while keeping the power system secure. In this paper, a method, 
based on an importance sampling scheme systematically targeting electrically challenging scenarios, is 
given to assess the curtailment risk associated to DG unit connections to the grid. It simultaneously 
resorts to correlated sampling in order to estimate the risk evolution due to “perturbed” situations, i.e. 
due to a newly connected DG unit. This significantly reduces the computation time of the risk indicators. 
The effectiveness of the proposed method is demonstrated on a test power grid. 


1 INTRODUCTION 


The European Union (EU) has set very aggres- 
sive emission reduction targets, establishing a 20% 
reduction in greenhouse gases with respect to 1990 
levels by 2020, and a 80% reduction and 100% 
clean electricity by 2050 (European Climate Foun- 
dation Roadmap 2050, 2010). In this context, a lot 
of distributed electricity generation is expected to 
be part of the future power systems. This requires 
additional investments in the network, including 
expensive and time-consuming reinforcements and 
modernization of the power system infrastructure. 
In addition, the variations in power generation 
and the interconnection may constitute barriers 
to achieving an optimal system for connecting dis- 
tributed sources to the grid. Therefore, the adapta- 
tion of current grid planning to the new situation 
in order to facilitate the integration of Distributed 
Generation (DG) units is more affordable (Do, 
M.T., Francois, B. 2017). 

However, the connection of large amounts of 
DG units to the power system introduces many 
technical and economic challenges. Reverse power 
flows, leading to line congestions and voltage prob- 
lems, are likely to take place when the non-locally 
consumed power is injected to the high-voltage 
(HV) grid (Faghihi, F., Labeau, P.E. et al. 2014). 
As a consequence, Transmission System Operators 
(TSOs) will need to operate the grid by means of 


an Active Network Management (ANM) scheme 
in (almost) real-time control (Jarventaustac, P. 
et al. 2009), by possibly curtailing their production 
in case of grid congestion in order to maintain grid 
security and reliability. 

In general, power system reliability assess- 
ment focuses on the evaluation of some indica- 
tors (such as system average interruption duration 
index, interruption frequency index and expected 
energy not supplied) to reflect the ability to supply 
adequate electric service on the long term, while 
Probabilistic Risk Assessment (PRA) aims to esti- 
mate the probability or frequency of disturbances 
to system operation and their consequences, both 
of these two elements being the constituents of 
the risk (Rocchetta, R. et al. 2015). In this study, 
indicators related to the expected curtailment are 
envisioned, by propagating on a grid model the 
uncertainties on the loads and productions. As a 
direct Monte Carlo Sampling (MCS) turns out 
to be extremely time-consuming, an alternative 
approach was proposed, consisting in decoupling 
the probabilistic assessment of possibly challeng- 
ing situations from the corresponding curtailment 
assessment using an Optimal Power Flow (OPF) 
(Labeau, P.E., Faghihi, F. et al. 2014). 

This methodology rests on the concept of net 
balance (algebraic sum of all productions minus 
total load in a substation). The net balance state 
space associated to the grid under study is discre- 
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tized and the cases corresponding to this mesh are 
first systematically analyzed in order to identify 
unsafe cells with respect to possible grid conges- 
tion. Then, this database of unsafe cells is updated 
based on possible congestion in the MV/HV trans- 
formers. The curtailment of DG units is calculated, 
based on a targeted (systematic) importance sam- 
pling of only those detailed variants (all individual 
productions and loads) likely to lead to congestion, 
i.e. those corresponding to net balances belonging 
to unsafe cells. Risk indices can therefore be calcu- 
lated efficiently. However, this approach only allows 
calculating the risk indices of interest for a global 
set of q DG units already connected to the grid. 
The present work aims to develop an accurate way 
of estimating the impact of a (q + 1)th DG unit to 
be connected to a node of the grid, on the previ- 
ous value of the risk indices for the grid with q DG 
units. Resorting to correlated sampling, one com- 
putation will simultaneously bring the risk estima- 
tion for a reference case (with q + 1 DG units) and 
for both a first perturbed case (the previous situa- 
tion with q DG units) and for a second perturbed 
case (corresponding to a different installed capacity 
of the (q + 1)th unit), for each season (associated to 
a given rating of the elements) and each grid topol- 
ogy considered (N and N-1 conditions). 

The paper is organized as follows. Section 2 
defines the risk indices and the mathematical the- 
ory supporting their estimation. Section 3 presents 
the application of this methodology to the grid. A 
test case with realistic power grid data is used to 
illustrate the approach. Results and analyses are 
given in Section 4 and conclusion is in Section 5. 


2 RISK INDEX CACULATION 


The classical definition of risk is adopted as the 
product of the probability of occurrence of the 
undesired event (i.e. contingency) and the related 
consequence (i.e. severity). To take into account 
all undesired events, the definition is extended by 
summing all contributions as: 


RI =, P(E)x f(E) (1) 


where P(E,) is the probability of occurrence of the 
undesired event E; and f(E;) is the severity of the 
related consequence. In probabilistic risk assess- 
ment, contingency frequencies are used as prob- 
abilities and severity functions as consequences. 


2.1 Risk index in a grid with DG units 


In the context of power systems, a contingency is 
defined as the unexpected loss of one/more electric 


elements (e.g. line, cable, or transformer) that the 
power system is made of. Overloads, influenced by 
seasons, related with the feeders’ thermal limits, 
and bus voltage magnitude, related with frequency 
and system balance, are both indicators of power 
system stress and correspond to the consequences 
for the risk calculation (P.A. Gooding et al. 2014). 
Thus, the risk index associated with one contin- 
gency can be expressed as follows for the whole 
power network: 


Ns 


RIC, / = >a x P(C AOX SC oA) (2) 
l=1 


where x is the set of all seasonal conditions (e.g. 
summer, inter-season, winter). C, is the Ath con- 
tingency (e.g. N/N-1 configuration). q, is the prob- 
ability of the /th season. f(C,,,x) is the performance 
function under congestion for electric element i in 
the conditions of the /th season. P(C,/x) is the 
probability of occurrence of congestion k. n, is the 
total number of seasonal conditions, n, is the total 
number of configurations. Let p, be the probability 
of configuration k. The risk index due to all con- 
tingencies is, then, obtained as: 


ne Ns 


Rl=¥ > p, xq x P(C, DX FAC w) (3) 


k=1 l=1 


Suppose that the power grid under study com- 
prises n substations, to which wind farms (WFs) 
and/or combined heat and power (CHP) units are 
connected. The total load in substation i is denoted 
L,while the total production of DG units connected 
to node j is denoted by P, The total risk under all 
contingencies in a grid can be expressed as: 


Ne Ny 


RI=X > paa ff fu(L.P) 9, (EL, Pda (4) 


k=1 /=1 


where P and L are the vectors of DG units gener- 
ation and load, respectively. g,,( P,Z) is the joint 
probability density function(pdf) of the variants in 
configuration k and season /. Sal PL) represents 
the severity function associated to the curtailment 
solving a possible congestion. For a given situation 
kl, eq. (4) can be simplified as: 


RI= ff fu (LP) Qy(L, P)dLdP (5) 


In order to identify more easily possible conges- 
tions, the concept of net balance NB is used again 
(see section 1). Suppose that n, is the total number 
of DG units connected to node i. The production 
of DG unit j connected to node i is denoted P,. 
The risk index formula now writes: 
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NB,= DF, -L, (6) 


j=l 


RI={[ f(LOVB,P), P)ANB)g(L(NB, P), P)d NBd P 
(7) 


where 8 NB) is an indicator function: if NB 
causes agrid congestion, it is equal to 1; otherwise 
it is 0. This will divide the net balance space into 
two regions, a safe region where there is no conges- 
tion and an Unsafe Region (UR) where congestion 
can occur. The UR is delimited based on a discre- 
tization of the accessible region in the net balance 
space. Hence the UR is bounded by n, unsafe cells 
(UC). Therefore, Eq. (7) can be written as: 


Neel 


J.-dNB= ÈJ- ANB (8) 


RI= | dNBJ f(L(NB, P), PX L(NB, P), P)dP (9) 


UR 


NBC, <P, + Papp, 


i,min 


=L; < NBYE 


i,max 


(10) 


RI= 2 Jane] f(L(NB, P), P)@(L(NB, P), P)d P 
(1) 


where P., is the total production of WFs con- 
nected to substation i; Ponp: iS the total genera- 
tion of CHP units at node i; L, is the total load at 
substation i. NBY and NBV are the extreme 
values of net balance for unsafe cell UC in the 
dimension corresponding to node i. The loads at 
the different nodes are correlated, and so are the 
wind productions, but correlations between wind 
productions and loads are negligible. CHP pro- 
ductions are independent of the other variables. 
Therefore, our risk index is calculated as follows: 


aN P,, Pop, L)=9,(P,) pL) | [| AP) (12) 
isl 


RI= x fif LE, w? Pegs L) oF, Eps L)dP, dP,dL 
(13) 


where 9(P,,P.,,»L) is the total joint pdf of all 
variants, consisting of the product of the joint pdf 
o P,) of all wind productions, the individual pdf’s 


p| Peuri) of all CHP generationsand the joint pdf 


p L) of the loads. The detailed variant sampling 
procedure will be illustrated in section 3.3. 


2.2 Risk index under perturbed cases 


Considering that a new DG unit of a specific type 
and installed capacity is likely to be connected to 


a substation, its impact on the previously assessed 
risk indices should be estimated. As already men- 
tioned, the risk indices presented above correspond 
to a global set of q DG units connected to the grid. 
The UR related to (q + 1) DG units is an extension 
of that obtained with q units, in the dimension cor- 
responding to the node where this new unit could 
be connected. Additionally, a different installed 
capacity for this new DG unit means changing this 
extension range, see Eq. (14) and Eq. (15). It is thus 
possible to resort to correlated sampling to assess 
more accurately the difference in the risk indices 
between the three situations, considering the case 
with (q + 1) units under a certain capacity as the 
reference situation (ref), and the case with q units 
as perturbed situation no. 1 (per1), and the case 
with a newly connected DG unit with a smaller 
installed capacity as perturbed situation no. 2 


(per2). 


ny 
= (perl) b 
= NB nin, per’) 4 > P a min j 


j=l 


NB... CP 


min; 


(14) 


ni 
(p r1) 
e) + YI Pu max, 


j=l 


NB vax) = NB a 


max; 


(15) 


Of course, the joint pdf’s of the uncertain loads 
and generations are affected by the addition of 
the new DG unit. The risk indices will thus be 
estimated according to the previous procedure 
for the reference case, and those associated to 
the perturbed cases are derived from a correlated 
sampling procedure. We can then write for the risk 
index definitions and Monte Carlo estimates on N 
sampled variants: 


R[eserence) = MH) S Pro Fenr L) P” Fy Fone L) 


dP ABl. (16) 
‘turbed as ce Feu nD 
RI (perturbed) = I! SP, wo Fenr- L g! (P, P pL) 
gI(P, T dPonpdL (17) 
A LS P, P 
RJE = FÈ > TONE, Fes L D) I p” ( Penri) 


(18) 


n 


j rer (P ) peL) II p (Ponp) 


Li 


Ng (PB) g0) lieve . 
apa) 
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R p (perturbed) — 
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per( B, Popp.L) » . 
Therefore, basal ies 22) is the correction factor 
g"! ( Py Poup.L) 


(‘statistical weight’) of the various contributions 
tothe risk indices of the reference situation,allowing 
obtaining in the same simulation an estimate of the 
risk indices inthe perturbed situations. In fact, the 
correction factor calculation is specially defined 
for two scenarios: 


a. the added DG unit is WF and the joint pdf of 
WEs, g, (2), has been changed, the correction 
factor is simplified by = z, 

b. the added DG unit is CHP and the correspond- 
ing correction factor will be represented by 

oP“ ( Peur) 
g"! ( Penp) ` 


3 ASSESSMENT METHODOLOGY 


According to Eq. (13), the risk index is computed 
by the probability of the load-generation patterns 
leading to a specific congestion in the grid, curtail- 
ment power of each DG unit and the correction 
factor of reference variables. Therefore, the main 
structure of this practical methodology consists 
of three parts: curtailment evaluation, probabi- 
listic assessment and correction factor estima- 
tion, shown in Figure 1. The details of the risk 
calculation process is introduced in the following 
paragraphs. 


3.1 Preprocess 


The preprocess stage, structuring the variants sam- 
pling, aims to partition the net balance space into 
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Figure 1. Framework flowchart of the proposed 
methodology. 


safe and unsafe cells. SCADA system performs the 
network monitoring, which collects every 15 min- 
utes multiple low-frequency signals including 
the existing generations and the loads. Based on 
this data collection, the net balance domain can 
be determined. Variants within the net balance 
domain can either lead to an acceptable solution 
of the load flow calculations, applied to both the 
N situation and all the N-1 ones, or not. The net 
balance variants can be grouped based on their 
similarity in making congestion on the grid. As 
shown in Figure 2, a security boundary divides the 
net balance space into the corresponding safe and 
unsafe regions. Adding a DG unit capacity to node 
1 corresponds to extending the net balance domain 
by the yellow area, without affecting the security 
boundary. 

In order to obtain an approximation of this 
security boundary between the safe and unsafe 
regions of state space, a mesh is defined on the NB 
domain, creating cells. If all the corners of a cell are 
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Figure 2. Preprocess:(a) safe and unsafe regions in the 
net balance space; (b) discretization of the net balance 
space in safe and unsafe cells. 


2156 


safe (i.e. if they cause no congestion on the grid), 
then this cell is considered as a safe cell; otherwise 
it is considered as an unsafe cell, see Figure 2. 


3.2 The framework of curtailment 


Assessing the curtailment of the different DG 
units is not an easy task. For such a problem, 
Monte Carlo algorithm is an appropriate solution. 
Through Monte Carlo Sampling inside the unsafe 
cells, the input sample will be produced and the 
probabilistic modeling will be built. 

At this stage, the curtailment calculation of each 
DG unit considers the curtailment cost of each 
specific DG type as well as its type of access. The 
Principles of Access (PoA) correspond generally to 
considering that the last connected producer is the 
first one to be curtailed. An OPF tool is applied 
to the sampled variants, both in N and N-1 situ- 
ations, in order to calculate the (possible) curtail- 
ments of all DG units via the objective function. In 
this research, the objective function is the minimi- 
zation of the total cost of curtailment. 

In parallel with the curtailment analysis, the 
Capacity Factors (CF) and utilization factors (UF) 
are computed for each DG unit. The UF is defined 
as the ratio of the actual energy (after the curtail- 
ment) that can be produced in one year over the 
corresponding theoretical maximum at peak value, 
while CF is the ratio with no curtailment. Both of 
them are obvious indicators of the economic per- 
formance under reference and perturbed cases that 
a producer expect. 


3.3 Probabilistic assessment 


The risk analysis of congestion/curtailment is only 
associated to these unsafe cells, and the variants 
corresponding to the safe cells have no contribu- 
tion to the risk indices. Thus, the systematic MCS 
based on a probabilistic modeling using historical 
data is performed by focusing on the unsafe cells. 
The stochastic input variables, i.e. distributed gen- 
eration and loads, are modeled by their joint pdf’s. 

For WFs, the pdf that has been most used to 
represent the wind speed distribution is the Weibull 
law, which is mathematically described as: 


rot ae (2) 


where v is the wind speed; k and c are the shape 
and scale parameters, respectively. An example is 
illustrated in Figure 3. Based on these marginal 
distributions, their joint pdf is constructed thanks 
to a Gaussian Copula. Sampled wind speeds are 
converted to electric power using the power curve. 
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Figure 3. Wind speed marginal distribution. 
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Figure 4. CHP sampling from the truncated pdf. 


The load modelis obtained bynormalizing the 
historical load data of each year to the maximum 
consumption of the corresponding substation in 
that year. These normalized data implicitly con- 
tainsa more informative consumption pattern in 
the different substations. A copula is used as well 
to define the joint pdf. 

For the CHP units, their productions are inde- 
pendent and historical data are fit to obtain gen- 
eration profiles. In the sampling procedure, after 
sampling wind and load from the corresponding 
joint pdf’s, significant values of CHP productions 
correspond to values likely to make the net bal- 
ance fall in an unsafe cell. In Figure 4, the net, bal- 
ance limited to wind and load, i.e. (ÈP; — L); is 
presented. For each unsafe cell UC, a compatible 
value of the CHP production at each node ¿is sam- 
pled from the corresponding truncated pdf in this 
UC, if possible, which is then given by: 
DP (Penri) = Gi CPenr i)! Fou i (21) 
where g¥° (Peek is the truncated pdf of the 
CHP generation inside an unsafe cell UCat the ith 
substation,while ø, (Paes) represents the joint pdf 
of the CHP generation at nodei. P.,, ; is probability 
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of having the CHP production at node än the right 
range. 

The expression of the risk index for the per- 
turbed case is then expressed as: 


sy ary LL (Benes 
Ree Si gP) øD I] Ponp) 


NPE) AD TE gR) (22) 
i=l i 


D S Pra P day II PEG (Py La) 
UC i= 


3.4 Correction factor calculation 


In correlated sampling, the correction factor is 
the ratio between the probability of the perturbed 
variant and that of the reference case in the cor- 
responding unsafe cells. The calculation of this 
factor uses the variants from the reference proba- 
bilistic assessment, see Figure 5. In the meantime, 
a correlated sampling, expressed in these 3 steps, is 
adopted so as to decrease the curtailment compu- 
tation time for the perturbed cases. 


a. Identify the perturbed pdf’s and the type of the 
added DG unit (WF or CHP); 

b. Sample variant from the joint pdf of the refer- 
ence case; 

c. Calculate the corresponding correction factors 
for each reference variable. 


As shown in Figure 5, calculate the correction 
factor from the reference situation to the case of 
a new WF with different installed capacity con- 
nected to a grid. It will be necessary to evaluate 
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Figure 5. Flowchart of correction factor calculation. 


the Weibull parameters of this perturbed case’s 
new WF and filter the reference data which are 
more than this perturbed case's newly WF installed 
capacity. While for the case of a new CHP unit 
accepted to the grid, since CHP sample is to a large 
extent relied on the joint pdf’s of wind-load pat- 
tern data and itself historical data, it is unavoidable 
to firstly execute reference wind-load correlated 
sampling so as to get perturbed CHP correlated 
sampling. 

However, to apply efficiently the correlated 
sampling, the correlated sample size must be large 
enough. 


3.5 Risk estimation 


Finally, the risk indices for reference and perturbed 
cases can be estimated by averaging the product of 
the curtailment magnitude and the corresponding 
correction factor which is linked to the probability 
of the corresponding unsafe cells. 

The weighting factor in each unsafe cell actu- 
ally means the corresponding probability and is the 
ratio of the number of CHP samples taken from 
its truncated pdf inside the cell divided by the total 
number of random samples coming from the CHP 
pdf. Inevitably, the weighting factor calculation in 
each situation (e.g. season condition, N/N-1 con- 
tingency) needs to be taken into account. 

A case study ona test power grid using this meth- 
odology will be treated in the following section. 


4 TEST RESULTS AND DISCUSSION 


In this section, the effectiveness of this practical 
methodology is illustrated on a test power grid. As 
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shown in Figure 6, the reference case corresponds 
to connecting a new WF to NODE 2. The grid ad is 
composed of 4 wind farms (57 MW), 2 CHP units 
(15 MW), 1 PV unit (1 MW) and a new connected 
wind farm (20 MW). The transmission network 
is operated at two voltage levels and comprises 4 
high-voltage substations and 11 transmission lines 
(including transformers) connected to the MV 
side. The collection data has been conducted for 
this power grid during January 1,2015 to Decem- 
ber 31,2015 every 15 min. 

As for the perturbed case no.1, there are 7 DG 
units (without accounting for the added WF) con- 
nected to the substations in the power grid, while 
perturbed case no.2 corresponds to a new 1|OMW 
installed capacity WF accepted at NODE2. 
Assume that all DG units are connected to the MV 
side, the risk indices are estimated for the N and 
N-1 configurations (i.e. out of the HV/MV trans- 
former at node 2), and for the different seasons 
leading to different line/transformer ratings. Due 
to the correlated sampling method used, the risk 
index for both the reference and perturbed cases 
are simultaneously obtained in one computation. 

Figure 7 shows the Wind2A conditional risk 
under different cases. The probability of the seasons 


300 


Risk of Power Curtallment(kW) 


o —_— pers 


(Nth (Mis ih N-i 


Out of 
line/tramsformer 


Figure 7. Wind2 A conditional risk in Summer. 


and N/N-1 configurations are not considered. The 
solid rectangle represents the conditional risk in 
the reference situation, while the point rectangle 
and dotted line rectangle corresponds to the per1 
and per2 cases, respectively. The detailed values 
are provided in Table 1. It obviously shows that 
the conditional risk of Wind2 A power curtail- 
ment in the reference case is smaller due to the 
addition of a new WF to node2. However, if 
the system lacks the transformers connected to 
NODE2, it will need more curtailment than for 
the other N-1 cases. The power curtailment of the 
newly connected WF in NODE2 will increase with 
a more adding installed capacity. And owing to the 
seasons influence of wind energy, curtailment will 
decrease in winter. 

In order to integrate the different energy cur- 
tailments under different N-0/N-1 rules, multiply 
the electrical elements rating and get the total 
risk of energy curtailment, see Figure 8. It clearly 
describesthe influence of different installed capac- 
ity of adding a new DG unit. The larger capac- 
ity case will produce more curtailment for this 
new DG unit but for the others DG units in the 
NODE2 will decrease. Therefore, it seems that 
the reference case will need more total energy 
curtailment. 

As shown in Table 2, the utilization factors 
validates, there is a decrease of first and second 
curtailed producer under both the reference and 
perturbed 2 cases. Especially for the reference 
case, the impact is very visible, both capacity and 
utilization factors of Wind2A fall down from the 
perturbed 1 case where there is no new DG unit 
connected. However, the perturbed case 2 is able 
to have a higher value of capacity and utilization 
factors than reference case since the new added 
WF installed capacity is more suitable to NODE2. 


Table 1. Conditional risk of different cases (in kW). 
Reference Perturbed 1 Perturbed 2 
Seasons and (N-1); Wind2A Wind2C Wind2A Wind2A Wind2C 
(N-1), 0.25 176.42 0 0.31 14.18 
(N-1), 0 105.51 4.47 0 0 
Summer (N-1), 52.35 1291.81 248.71 65.87 515.13 
(N-1), 52.18 1292.33 248.26 65.68 515.49 
(N-1), 0.25 151.85 0 0.31 10.64 
i (N-1), 0 34.66 4.47 0 0 
me (N-1), 52.11 1283.13 247.13 65.53 515.49 
(N-1); 51.53 1282.09 245.67 64.68 511.10 
(N-1), 0.25 38.28 0 0 0 
w (N-1), 0 4.93 4.47 0 0 
inter (N-1), 52.12 1009.43 247.65 65.69 168.91 
(N-1); 51.83 1003.23 246.40 65.23 166.95 
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Table 2. Capacity and utilization factors. 


Reference Perturbed 1 Perturbed 2 


Wind2A Wind2C Wind2A Wind2A Wind2C 


CF (%) 27.19 27.97 27.34 27.34 28.59 
UF (%) 27.19 27.96 27.33 27,33 28.58 
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$ 20 4 —— — 
= 
Efe : 
= S Reference case) 
3 H Penurted | 
a ®& Perurbed 2 | 
E = & 
aii = DG units 


Wind2A Winda 


Figure 8. Total risk energy curtailment. 


5 CONCLUSION 


The large-scale use of distributed electricity gen- 
eration is possibly associated with increasing grid 
congestion; hence, Distributed Generation (DG) 
curtailment is envisioned to solve those conges- 
tions in affected power systems. By resorting to 
curtailment as part of an Active Network Man- 
agement (ANM) scheme under real-time super- 
vision/control of the DG units, more distributed 
generation can be accommodated in a distribution 
grid For the purpose of retaining the power system 
secure and adapting the grid code, any possible 
new connection is able to accept a certain risk level 
in long-term planning. 

In the work presented in this paper, an efficient 
methodology for probabilistic assessment of the 
risk of connecting a new DG unit to an existing 
grid, using an ANM scheme for a defined season 
and a configuration of the grid (N and N-1 con- 
ditions), is proposed. Furthermore, the use of cor- 
related sampling allows significantly reducing the 
computation time so that one computation is able 
to simultaneously provide the risk estimation for 
both reference case and perturbed cases in order to 
choose a more suitable installed capacity DG unit. 
The concept of net balance in the preprocess stage 
is on one side to account for the dimensions of the 
problem (i.e. number of DG units and substations), 
and on the other side, it allows for a rather quick 


identification of the unsafe region by performing 
a set of load flow calculations on the corners of 
the cells in the constructed mesh. A probabilistic 
wind model has been implemented, resorting to a 
Gaussian Copula to build the joint pdf of all WFs 
at the different nodes of the grid; in the meantime, 
the compatible CHP variants inside each unsafe cell 
were produced. Thus, the variants belonging to an 
unsafe cell acted as input variables of the optimal 
power flow (severity function), based on the curtail- 
ment cost, for congestion/curtailment calculations. 

Last but not least, the advantages of this practi- 
cal methodology lie on fast estimating the impact 
of the possible acceptance of new DG units with 
different installed capacities into an existing grid, 
what provides foundation for the future grid plan- 
ning and its reinforcements. 


REFERENCES 


Do, M.T., Francois, B. 2017. Probabilistic approach to 
evaluate the cost and constraints of the renewable pro- 
duction curtailment in MW network. Jn Proceedings of 
IFAC2017, Toulouse (France). 

European Climate Foundation, Roadmap 2050: Practi- 
cal guide to a prosperous, low carbon Europe 2010. 
http://www.roadmap2050.eu/attachments/files/Road- 
map2050-AllData-MinimalSize.pdf (last consulted on 
19/12/2017). 

Faghihi, F., Labeau, P.E., Maun, J.C., De Wilde, V. & Verg- 
nol, A. 2014. Towards a probabilistic risk assessment of 
distributed generation curtailment in saturated TSO/ 
DSO networks. In Proceedings of CIRED2014, Rome 
(Italy). 

Gooding, P.A., Makram, E. and Hadidi, R. 2014. Prob- 
ability analysis of distributed generation for island sce- 
narios utilizing Carolinas data. Electric Power Systems 
Rearch 107: 125-132. 

Hagspiel, S., Papaemannouil, A., Schmid, M. and Ander- 
son, G. 2012. Copula-based modeling of stochastic 
wind power in Europr and implications for Swiss power 
grid. Applied Energy 96: 33-44. 

Jarventaustac, P., Repo, S., Rautiainen, A. and Partanen, 
J. 2009. Smart grid power system control in distributed 
generation environment.6th IFAC symposium on power 
plants and power systems control, Finland. 

Labeau, PE., Faghihi, F, Maun, J.C., De Wilde, V. and 
Vergnol, A. 2014. Mathematics of PRA applied to Dis- 
tributed Generation curtailment in saturated grids. Jn 
Proceedings of Esrel2014, Wroclaw (Poland): Balkema. 

Rocchetta, R.., Li, Y.F. and Zio, E. 2015. Risk assessment 
and risk-cost optimization of distributed power genera- 
tion systems considering extreme weather conditions. 
Reliability Engineering and System Safety 136: 47-61. 


2160 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Integrated deterministic and probabilistic safety assessment 
of the cooling circuit of a superconducting magnet for nuclear 
fusion applications 


R. Bellaera, R. Bonifetto, N. Pedroni, L. Savoldi & R. Zanino 
NEMO group, Dipartimento Energia, Politecnico di Torino, Torino, Italy 


F. Di Maio & E. Zio 
Dipartimento di Energia, Politecnico di Milano, Italy 


E. Zio 
Chair on Systems Science and the Energetic Challenge, European Foundation for 
New Energy—Fondation EdF, CentraleSupelec, France 


ABSTRACT: In recent years, there has been a growing interest in nuclear fusion as energy source due 
to its several principle advantages over fission, which include reduced radioactivity in operation and in 
waste, large fuel supplies, and increased safety. The most promising configuration of a nuclear fusion 
system is currently the tokamak, the largest of which, called the largest of which (ITER), is under con- 
struction in Cadarache, France. The safety of nuclear fusion systems has to be proved and verified by a 
systematic analysis of the system behavior under normal transient and accidental conditions. One chal- 
lenge to the analysis is that the operation of tokamaks presents complex dynamic features as it is based on 
the transformer principle: in particular, they employ superconducting magnets, a subset of which oper- 
ates with variable current to generate one of the components of the magnetic field needed to confine the 
plasma in the chamber where nuclear fusion reactions occur. In the present paper, we apply techniques 
of Integrated Deterministic and Probabilistic Safety Assessment (IDPSA), which combine phenomeno- 
logical models of system dynamics with stochastic process models, taking for the first time as reference 
system the cooling circuit of a superconducting magnet for fusion applications, subject to a Loss-Of- 


Flow-Accident (LOFA). 


1 INTRODUCTION 


The need of satisfying a growing demand of energy, 
while protecting the environment and reducing the 
dependence on fossil fuels, has recently increased 
the interest in the use of nuclear fusion as energy 
source. Nuclear fusion presents several advantages 
over fission, which include reduced radioactivity 
in operation and in waste, large fuel supplies, and 
increased safety (EC, 2004). 

The most promising configuration for electrical 
energy production from fusion is the tokamak, the 
largest of which (ITER) is now under construc- 
tion at Cadarache (France), under an international 
collaboration between seven member entities (i.e., 
China, the European Union, India, Japan, Korea, 
Russia and the United States) ITER, 2014). 

Nuclear fusion reactors will use superconduct- 
ing (SC) magnets to generate a powerful magnetic 
field needed to confine the plasma in the shape 
of a torus, where D-T fusion reactions occur at a 


temperature of the order of 10°K. On the other 
hand, all the SC coils need to be cooled at cryo- 
genic temperatures in order to avoid the quench of 
the magnets (i.e., the loss of their superconducting 
state) during operation: for example, the ITER SC 
coils are cooled by supercritical helium (SHe) at a 
pressure of 0.5-0.6 MPa and temperature of about 
4.5K (ITER, 2014). Dedicated cryogenic cool- 
ing loops remove the heat load from the magnets, 
releasing it to saturated liquid helium (LHe) pools, 
acting as thermal buffers in the transfer of the load 
to the refrigerator (Hoa et al., 2012; Zanino et al., 
2013). 

The safety of nuclear fusion systems has to be 
proved and verified by a systematic analysis of the 
system behavior under normal transient and acci- 
dental conditions (Taylor and Cortes, 2014; Rivas 
et al., 2015; Perrault, 2016; Wu et al., 2016), for 
two main reasons. First, the presence of radioac- 
tive sources (e.g., tritium and materials activated 
by the neutrons produced by the fusion reactions) 
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imposes a careful study to avoid the contamina- 
tion of the workers, the public and the environ- 
ment (Taylor, 2015; Taylor et al., 2017). Second, 
in view of the cost of the SC magnet system 
(Mitchell et al., 2008 and 2012), its protection 
and integrity should be guaranteed (ITER, 2014; 
Savoldi Richard et al., 2014; Savoldi et al., 2017 
and 2018). 

One challenge to the related safety analyses is 
that the operation of tokamaks presents complex 
dynamic features, as it is based on the transformer 
principle. In particular, they employ SC magnets, 
a subset of which operates with variable current to 
generate one of the components of the magnetic 
field needed to confine the plasma in the chamber 
where nuclear fusion reactions occur (Zohm, 2014). 
The order and timing of failure events occurring 
along an accident scenario, the magnitude of fail- 
ures and the values of the process variables at the 
time of event occurrence are critical in determin- 
ing the evolution of the system response (Aldemir, 
2013; Kirschenbaum et al., 2009; Zio and Di Maio, 
2009, 2010; Zio et al., 2010; Zio, 2014; Turati et al., 
2017 and 2018). 

To perform the safety assessments we resort 
to the Integrated Deterministic and Probabilis- 
tic Safety Assessment (IDPSA) framework (Di 
Maio et al., 2016) to combine (deterministic) 
phenomenological models of system dynamics 
with (probabilistic) stochastic process models. 
The IDPSA methodology is here used for the 
first time to analyze the response—to abnor- 
mal transient conditions—of the cooling system 
of a SC magnet, namely a single ITER Central 
Solenoid Module (CSM) in a reference (cold) 
test facility (Spitzer et al., 2015). In particular, 
a Loss-Of-Flow Accident (LOFA) is considered 
as reference abnormal scenario. The 4C code 
(Savoldi Richard et al., 2010) is employed for the 
(deterministic) simulation of the system behav- 
ior. Multiple Value Logic (MVL) is adopted for 
the description of the components failures (Gar- 
ibba et al., 1985). The MVL allows describing 
that the components can fail at any time along 
the scenario, with different (discrete) magnitudes 
(Di Maio et al., 2015; Zio, 2013). 

The paper is organized as follows. In Section 2, 
the single ITER CSM in the cold test facility is 
presented together with the corresponding simula- 
tor. In Section 3, the different regimes of system 
operation are described. The components failures 
causing the deviations from the nominal CSM 
conditions in the test facility are described in 
detail in Section 4, together with the use of MVL 
for generating accident scenarios. In Section 5, 
the results obtained are illustrated and analyzed. 
Finally, conclusions are drawn and summarized in 
Section 6. 


2 DESCRIPTION OF THE SYSTEM AND 
PRESENTATION OF THE SIMULATOR 


The ITER CS allows inducing the current in the 
plasma and maintaining it during long plasma 
pulses. It is composed by 6 CS modules (CSM) 
stacked in the vertical direction; each of them is 
being manufactured and will be independently 
tested. Each module is composed by 7 pancake- 
wound conductors, namely 6 hexa-pancakes and 
1 quad-pancake. Each pancake is cooled in par- 
allel, resulting in 40 parallel cooling channels per 
module, each featuring 14 turns. The He inlets are 
located at the coil bore, while the outlets are at the 
outer side of the magnet. All the pancakes of each 
module are electrically connected in series through 
suitable joints (Libeyre et al. 2015). 

The main components of a reference facility for 
the CSMs cold tests (which is the subject of the 
present analysis) are the He refrigerator, producing 
the supercritical He (SHe) at 4.5 K, which is used 
to cool the magnets during the tests, and the test 
chamber where the module will be put into a cryo- 
stat. Besides these two main components, several 
manifolds, pipelines, heat exchangers (HXs) con- 
trol valves (CV) of the cryoplant and a liquid He 
(LHe) thermal buffer will connect the refrigerator 
to the coil, and a dedicated power supply system 
will provide a current up to ~50 kA. 

From the hydraulic point of view, the analy- 
sis reported here is focused on the loop cooling a 
single CSM. This closed loop, much simpler than 
that foreseen in ITER for the cooling of the CS, 
provides SHe at ~4.5 K to the inlets of the 40 
hydraulic paths, collects the (warmer) SHe at their 
outlets and by means of a cold circulator drives it 
to two HXs, where the heat removed from the coil 
is transferred to a LHe buffer. The LHe evaporated 
in the buffer is, then, extracted and cooled down by 
the refrigerator. 

The 4C thermal hydraulic code (Savoldi Rich- 
ard et al. 2010) is used here to model both the coil 
and its SHe cooling loop. Figure | shows a scheme 
of the loop model. The SHe at the outlet of the 
cold circulator is cooled in HX1 to remove the heat 
generated by the compression process. Then, it is 
driven to the coil inlets through CV1 (fully open 
in normal operation) and a supply cryoline (cry- 
oline 1). At the coil outlet, the SHe reaches HX2 
through a return cryoline (cryoline 2) and CV2 
(fully open in normal operation). The main input 
parameters of each component model are reported 
in Table 1. For the details of the model of each 
component, please refer to (Bonifetto et al. 2012) 
and (Zanino et al. 2013). A realistic characteristic 
of the cold circulator has been implemented in the 
circuit model, as described in detail in (Zanino 
et al. 2013) for another system. 
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Figure 1. 
Solid red circles are pressure taps, open cyan triangles are 
flow meters (see the text for other abbreviations). 


Scheme of the SHe cooling loop of the CSM. 


Table 1. Main input parameters of the circuit com- 
ponent models (LZ = length, D = diameter, K, = flow 
coeffient). 


# of parallel 

Component L [m] D [mm] pipes 
HX1,2 31 20 11 
Cryolinel 28 46 

Cryoline2 24 46 

K, [m3/h] 
CV1,2, BV, SVin, 71 
SVout 


With reference to a Loss-Of-Flow Accident 
(LOFA) (scenario of interest in the present paper, 
see the Introduction), the strategy adopted by the 
control system is assumed here to be similar to that 
adopted in ITER and described in (Savoldi et al., 
2018) for the Toroidal Field (TF) coils. In particu- 
lar, in order to protect the CS, a controlled discharge 
of the CS circuits is carried out in ITER, consist- 
ing of a current ramp down of about 30s (ITER_D_ 
K7G8GN v2.1, 2014), driven by the plasma control 
system. Obviously, the current variation causes AC 
losses, which induce a (possibly significant) heat 
deposition in the conductor. In the reference sys- 
tem at hand, a similar fast controller discharge is 
assumed to be taken by the control system in case 
of LOFA. As far as the cryoplant is concerned, the 
basic circuit control in case of a LOFA includes the 
isolation of the circulator from the coil by means 
of the full closure of both CVs and the opening 
of the by-pass valve (BV) to equalize the pressure 
at the circulator suction and discharge (preventing 
any damage to the pump itself as fail-safe condi- 


tion). In the present work, this action is taken (by 
controller C1 in Figure 1), when a SHe mass flow 
rate below 10% of the nominal value is measured 
both at the inlet and at the outlet of the CSM, 
after a validation time of 1s (Savoldi et al., 2018). 
Notice that the “nominal value” of the mass flow 
rate here considered corresponds to the nominal 
CSM conditions during a test in the reference facil- 
ity, not to the “future” normal operating conditions 
in ITER. In case of excessive pressurization at the 
coil boundaries, two Safety Valves (SV) at the CSM 
inlet and outlet open, driven by the PID controller 
C2 (gain = 1 e-’Pa™', integration time = 0.2 s, deri- 
vation time = 1 s), with set-point 1.8 MPa (Savoldi 
Richard et al., 2012); the controller parameters and 
set point have been assumed here equal to those in 
(Savoldi et al., 2018). When the SV opens, the He is 
released in a Quench Tank (QT) by means of suit- 
able Quench Lines (QL). 

The detailed CSM model solves the 1D transient 
mass, Momentum and energy conservation equa- 
tions, computing the temperature, pressure and 
velocity distribution, in each of the two regions 
(cable bundle and central, low impedance channel) 
of each pancake, as described in detail in (Zanino 
et al. 1995). Then, the inter-turn and inter-pancake 
thermal coupling between adjacent turns and 
pancakes, respectively, is computed considering 
the insulation as a thermal resistance to evaluate 
the heat transfer between neighboring conductors 
(Savoldi et al. 2000). 


3 SYSTEM OPERATION REGIMES 


During the tests, the facility and the CSM will be 
operated in different regimes, which can be briefly 
described as: 


a. Cold mode standby operation (e.g. during night 
or weekend), when the CSM is not charged and 
kept at nominal ~4.5 K; no dangerous transients 
are expected from a LOFA in these conditions, 
so this regime is not analyzed here. 

b. Cold mode experimental operation (i.e. dur- 
ing tests), when the CSM is charged at full or 
partial current (which is the case analyzed in 
the present paper). In this regime, an accidental 
temperature increase up to (or above) the so- 
called current sharing temperature T- may lead 
to a quench, i.e., to a loss of the superconducting 
state (notice that Tos is defined as the tempera- 
ture, above which the current starts to flow also 
in the Cu matrix of the SC strands and in the 
pure Cu strands, developing a non-zero voltage 
and causing Joule heat generation in the cable). 
The consequent fast local Joule heat deposition 
can induce thermal stresses that may seriously 
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damage the conductor, causing a degradation 
of its performance or, in the worst case, the loss 
of integrity of the conductor. The presence of a 
normal (non-SC) zone and, to some extent, vari- 
ations in the conductor temperature above Tos 
can be detected by measuring the voltage at the 
extremities of the each pancake. 


The coil inlet temperature is here taken as the 
nominal one. The objective of the analysis is to study 
the response of the system described above (in the 
operating mode b.) to abnormal conditions, e.g., to 
accident scenarios driven by stochastic components 
failures (see the following Section 4). In particular, 
our interest is mainly devoted to LOFAs (see the 
previous Section 2). Actually, in absence of helium 
coolant flow, the heat deposition induced by AC 
losses (caused by the controlled discharge of the CS 
circuits) may a priori lead to a dangerous increase 
in the conductor temperature (with possible quench 
of the magnet), even in a cold test configuration like 
the one analyzed here. Two variables are monitored 
during each transient as “critical indicators” of the 
state of the system: the cooling helium pressure at 
the inlet and outlet of the CS magnet and the volt- 
age measured at the coil extremities. Actually, if the 
pressure in the conductor exceeds 25 MPa, the con- 
ductors can be damaged (ITER_D_2NBKXyY v1.2, 
2009); also, if the voltage on a single pancake goes 
above 0.1 V for more than 1s, it means that the con- 
ductor temperature has exceeded the current shar- 
ing temperature Tos i.e., that the superconducting 
state of the magnet is lost. 


4 FAILURE SCENARIOS GENERATION 
BY MULTIPLE VALUED LOGIC 


The following component failures can occur at 
random times in the time horizon [0,600] seconds: 


1. the Centrifugal Pump (CP) reduces exponen- 
tially the rotational speed, directly affecting the 
mass flow rate that can be reduced to 1) 75%; ii) 
50%; iii) 25%; iv) 0% of the nominal mass flow 
rate, i.e. in the last case down to a total loss of 
pumping capacity. 

2. the two Control Valves (CVs) can fail in three 
different modes: i) stuck (open) at the nominal 
position; ii) stuck closed at 50% of the nominal 
position; iii) stuck totally closed. 

3. the By-pass Valve (BV) can fail in three different 
modes: i) stuck (closed) in nominal position; 11) 
stuck open at 50% of the flow area; iii) stuck 
totally open. 

4. the two Safety Valves (SVs) can fail in three dif- 
ferent modes: 1) stuck (closed) in nominal posi- 
tion; ii) stuck open at 50% of the flow area; iii) 
stuck totally open. 


It can be shown that the system reaches a steady 
state condition in ~100 seconds, irrespectively of 
the failure occurred. Therefore, we set a mission 
time T,,= 700s. 

A Multiple Value Logic (MVL) scheme has been 
adopted to generate the accidental scenarios, con- 
sidering the stochastic (discrete) time interval (t) of 
occurrence of component failures, their (discrete) 
magnitude (m) and the order (ord) of events along 
the sequence. The random realizations of the dis- 
cretized time and magnitudes values are included 
into a sequence vector that represents a generated 
scenario to be simulated, [m tp Ordy, Map tors 
OFA yji M.a lya On, Mys bys Ord, p Movs boys OFA, 1» 
MS,9, tS,» ord >] (Di Maio et al., 2017). The follow- 
ing values are considered: 


e time (ft) discretization: we use the label ż = 1, 2, 3, 
4, 5 and 6, for failures occurring in the intervals 
[0, 100] s, [101, 200] s, [201, 300] s, [301, 400] s, 
[401, 500] s, [501, 600] s, respectively; t= 0 means 
that the component does not fail within the Ty 
of the scenario and the value “NaN” is used to 
identify the respective (non-)failure order (ord) 
in the sequence vector of the accidental scenario. 
Notice that, even if the failure order (ord) may 
seem redundant in the MVL representation, it is 
actually used to discriminate between scenarios 
where different components fail in the same time 
interval t. Also, it is worth highlighting that, once 
a time interval t is identified by MVL, the actual 
time of component failure (used in the determin- 
istic simulation of the accident scenario) is ran- 
domly sampled within the interval. 

e Magnitude (m) discretization: 

— the CP magnitude is indicated with the label 
m,, = 1, 2, 3 or 4 for failure states correspond- 
ing to an exponential decrease of the rota- 
tional speed down to 75%, 50%, 25% and 0% 
of the nominal value, respectively; if m,, = 0, 
the component does not fail; 

— for each CV, the magnitude is indicated by the 
label m,, = 1, 2 or 3 if the component stays 
stuck (open) at the nominal position, stuck 
closed at 50% of the nominal position and 
stuck closed, respectively; if m,, = 0, the com- 
ponent works correctly; 

— the BV magnitude is indicated by the label 
m, = 1, 2 or 3 if the component stays stuck in 
(closed) nominal position, stuck open at 50% 
of nominal flow area and stuck totally open, 
respectively; if m, = 0, the component does not 
fail; 

— for each SV, the magnitude is indicated by the 
label m,, = 1, 2 or 3 if the component stays 
stuck (closed) in nominal position, stuck open 
at 50% of nominal flow area and stuck totally 
open, respectively; if m,, = 0, the component 
works in that scenario. 
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As an example, the accidental sequence vector 
[4, 6, 5,0, 0, NaN, 1, 2, 1, 2, 4, 3, 2, 5, 4, 1, 3, 2] rep- 
resents a scenario where: the CP fails completely 
to 0% of the nominal value at a time in [501,600] 
s (fifth event occurring along the sequence); the 
CV1 correctly works throughout T„; the CV2 
fails stuck (open) at the nominal position at a 
time in [101,200] s (first event occurring along the 
sequence); the BV fails stuck open at 50% of the 
nominal flow area (third event along the sequence) 
at a time in [301,400] s; the SV1 fails stuck open 
at 50% of the flow area at a time in [401,500] s 
(fourth event along the sequence); finally, the SV2 
fails stuck (closed) in nominal position at a time in 
[201,300] s (second event along the sequence). 


5 RESULTS 


In principle, the MVL accidental scenario genera- 
tion procedure described above would entail the 
creation of 4 x 108 scenarios. We report here only 
the “bounding analysis” of the system response, 
analyzing a reduced set of MVL sequences, which 
are expected to be the most challenging for the sys- 
tem safety. Notice that none of the selected acci- 
dent sequences turns out to be critical for the CS 
module integrity. In particular, the voltage at the 
extremities of the SC coil will remain far below 
the threshold of 0.1V (i.e., the temperature of the 
conductor does not exceed the Tos): this means 
that the magnet maintains its superconductive 
properties. 

Let us consider the MVL sequence [4, 1, 1, 0, 0, 
NaN, 0, 0, NaN, 0, 0, NaN, 0, 0, NaN, 0, 0, NaN], 
which entails the pump to fail according to an expo- 
nential decrease of the rotational speed till complete 
stop; the other components work correctly during 
the scenario. This sequence has been chosen because 
in this system no CP redundancies are designed and, 
therefore, a CP unavailability might lead the CSM 
to critical conditions. Notice that when the loss of 
flow is detected (Section 2), in order to protect the 
CS, a controlled discharge of its electrical circuits is 
carried out, consisting of a current ramp down to 0 
in about 30s. A significant heat is, thus, generated in 
the conductors, due to the AC losses produced by 
the fast current variation. Figure 2 shows that when 
the LOFA occurs, the pressure of the coil both at 
the inlet (solid line on Figure 2) and outlet (dashed 
line) increases up to 50% with respect to the nomi- 
nal value, without however exceeding the pressure 
safety threshold of 1.8 MPa. 

Two additional MVL that might deserve atten- 
tion are [0, 0, NaN, 0, 0, NaN, 0, 0, NaN, 0, 0, 
NaN, 3, 1, 1, 0, 0, NaN] and [0, 0, NaN, 0, 0, NaN, 
0, 0, NaN, 0, 0, NaN, 0, 0, NaN, 3, 1, 1]. These 
sequences entail the failure of safety valves SV1 
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Figure 2. Pressure at the inlet and outlet of the coil (fol- 
lowing the CP failure at 0% of the nominal mass flow rate). 


and SV2, located respectively at the inlet and out- 
let of the coil, that have the function of keeping the 
pressure of the system below the critical value of 
1.8 MPa in accidental situations. In normal opera- 
tion, the pressure at the inlet of SV1 is equal to 
0.433 MPa and at the outlet is 0.42 MPa, corre- 
sponding to the assumed pressure of the quench 
tank. If component SV1 fails, helium starts flow- 
ing in the quench line leading to the reduction of 
the pressure at the inlet of the coil (solid line in 
Figure 3): both the inlet and the outlet pressure of 
the coil decrease, the mass flow rate remains at its 
nominal value and a new steady state situation is 
reached with large safety margin with respect to 
the pressure safety threshold 1.8 MPa. 

Sequence [0, 0, NaN, 0, 0, NaN, 0, 0, NaN, 0, 
0, NaN, 0, 0, NaN, 3, 1, 1], instead, entails SV2 
failure. In Figure 4, it can be seen that the pressure 
at the inlet of the valve is equal to 0.376 MPa (solid 
line) whereas at the outlet (dashed line) it is equal 
to the assumed pressure of the quench tank (i.e., 
0.42 MPa). When valve SV2 opens, the pressure 
at the outlet of the coil increases up to 0.42 MPa. 
Since the CP is working correctly, the nominal 
pressure drop between the inlet and the outlet of 
the component is maintained. As a consequence, 
the increase in the outlet coil pressure is followed 
by an increase in the inlet coil pressure. Also in this 
case, a new steady state condition is reached with 
a large safety margin with respect to the pressure 
safety threshold of 1.8 MPa. 

The relevance of the dynamic features is wit- 
nessed by the outcome of sequence [0, 0, NaN, 
0, 0, NaN, 0, 0, NaN, 0, 0, NaN, 3, 1, 1, 3, 1, 2], 
whose evolution is shown in Figure 5. The first 
failure is represented by the SV1 stuck open. This 
causes a slight decrease in the pressure at the inlet 
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Figure 3. Pressure at the inlet and outlet of the coil fol- 


lowing SV1 failure (stuck totally open). 
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Figure 4. Pressure at the inlet and outlet of the coil fol- 
lowing SV2 failure (stuck totally open). 
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Figure 5. Pressure at the inlet and outlet of the coil 


along the sequence of failures: SV1 stuck totally open; 
SV2 stuck totally open. 


(solid line) and at the outlet (dashed line) of the 
coil. Then, also safety valve SV2 fails stuck open, 
causing an increase in the pressure at the outlet of 
the coil up to the value of 0.42 MPa, the pressure 
assumed for the quench line (this is due to the fact 
that the pressure at the outlet of the coil is lower 
than that in the quench line). As a result, the pres- 
sure at the inlet and at the outlet of the coil equal- 
izes: the LOFA happens (and is detected) and the 
current is reduced to zero producing AC losses, 
which lead to heat deposition. Throughout the 
rest of the transient, the value of the pressure at 
the boundaries of the coil stabilizes at 0.42 MPa. 
This response of the system justifies the necessity 
of a dynamic approach. In fact, if we consider the 
single failures of the SV1 and SV2 at the most con- 
servative magnitude (see, e.g., Figures 3 and 4), a 
new steady state condition is reached without any 
problem from the point of view of the availability 
of the CSM in the test facility; on the other hand, 
the combination of failures considered just now 
leads to a different end state. 

In all the scenarios reported here, the pressure 
at the boundaries of the SC coil is kept below the 
safety threshold of 1.8 MPa, i.e., with a positive 
safety margin. In other words, as for the analysis 
here presented, no critical situations are expected 
and the coil is not damaged. In addition, the volt- 
age always stays well below the threshold of 0.1 V, 
which means that the current sharing temperature 


Tes is not exceeded. 


6 CONCLUSIONS 


In this paper, we have considered the safety analysis 
of the simplified cooling system of a single mod- 
ule of the ITER Central Solenoid (CS) in a cold 
test facility, subject to a Loss-Of-Flow Accident 
(LOFA). For this, the deterministic 4C code has 
been employed to simulate the system behavior and 
a Multiple Value Logic (MVL) has been adopted 
for building a comprehensive set of combinations 
of times and (discrete) magnitudes of components 
failures to run stochastic accident scenarios. The 
cooling helium pressure at the inlet and outlet of 
the SC coil, and the voltage measured across the 
pancakes have been selected as “critical” safety 
parameters to monitor during each transient. 

The application of the MVL has generated a 
list of more than 10% possible scenarios. However, 
for the sake of brevity, we have carried out only 
a “bounding analysis” of the system response by 
inspecting a reduced set of MVL sequences, in par- 
ticular, those expected to be most challenging for 
the system safety. Results have shown that these 
scenarios are not critical for the CS module integ- 
rity: in particular, in all the cases considered, the 
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pressure at the boundaries of the SC coil is below 
the safety threshold of 1.8MPa, i.e., with a positive 
safety margin. Also, the voltage keeps well below 
the threshold of 0.1V, which means that the tem- 
perature of the conductor does not exceed the cur- 
rent sharing temperature Tos and the magnet does 
not lose its SC properties. 

However, a final remark is in order with respect 
to the positive results obtained. The CS magnet 
has been analyzed here in “cold mode experimental 
operation” (see Section 3), with the coil inlet tem- 
perature taken as the nominal one. In some cases, 
this temperature can be artificially increased, in 
order to perform specific tests (for example, the Tes 
measurement (Savoldi et al. 2000)). This situation 
would reduce the temperature margin (namely, the 
difference between the operating cable temperature 
and the Tes), shifting the operating point closer to 
the Tes and, thus, reducing the amount of the heat 
needed to induce a quench. The analysis of such 
“more severe” operating conditions will be possi- 
bly considered in future research. 
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ABSTRACT: In this work we perform Reliability-Based Design Optimization (RBDO) by classifying 
the limit states by using soft non-linear Support Vector Machines (SVM). By adopting the kernel trick in 
the dual formulation, by using e.g. the Gaussian kernel,we classify non-linear states of fail or safe obtained 
from design of experiments. The Most Probable Point (MPP) of the SVM is established in the physical 
space where the distance is minimized in the metric of Hasofer-Lind. The solution to the corresponding 
optimality conditions is obtained by using Newton’s method with an inexact Jacobian and a line-search of 
Armijo type. At the MPP, we perform Taylor expansions of the SVM using intermediate variables defined 
by the iso-probabilistic transformation. In such manner, we derive a Quadratic Programming (QP) prob- 
lem which is solved in the standard normal space. This is done for several probability distributions such 
as e.g. lognormal, Gumbel, gamma and Weibull. The optimal solution to the QP problem is mapped 
back to the physical space and new Taylor expansions of the SVM are derived and a new QP problem is 
formulated and solved. This procedure continues in sequence until we obtain convergence of our RBDO 
problem. The steps presented above constitute our proposed FORM-based sequential QP approach for 
RBDO by using SVM. The target of reliability appearing in the FORM-based QP problem might also be 
adjusted using different SORM formulas such as e.g. Breitung, Hohenbichler or Tvedt, or by applying 
importance-based Halton or Hammersley sampling. A nice feature of the proposed SVM-based RBDO 
approach is that several limit state functions can be represented simultaneously by only one single SVM. 
Thus, the proposed SVM-based RBDO methodology might be considered to be a rational approach for 
the treatment of RBDO problems including system reliability. This is demonstrated by solving established 


RBDO benchmarks. 


1 INTRODUCTION 


The soft non-linear support vector machine (SVM) 
introduced by Cortes & Vapnik (1995) defines a 
paradigm shift in machine learning and the paper 
has been cited more than 15000 times. By adopting 
the kernel trick and the soft penalization, we are 
able to classify non-linear separable data includ- 
ing misclassified data points. In this work, we sug- 
gest to use a single SVM to represent several limit 
state functions simultaneously when performing 
reliability based design optimization. For readers 
not familiar with SVM, an excellent introduction 
to this machine learning discipline is found in the 
textbook by Hamel (2009). 

Although the number of papers on SVM-based 
RBDO is small, the idea of using SVM in order 
to represent limit state functions in RBDO is not 
new. Basudhar et al. (2008) solved RBDO prob- 
lems using SVM and a particle swarm algorithm. 
Song et al. (2012) performed sampling-based 
RBDO by using probabilistic sensitivity analysis 
and virtual support vector machines. Khatibinia 
et al. (2013) suggested a gravitational search algo- 
rithm with a weighted least square support vector 


machine for RBDO. Wang et al. (2015) proposed 
a new SORA-based RBDO method using SVM 
as a surrogate model for the limit state function. 
Most recently Liu et al. (2017) used SVM-based 
sampling in order to improve Kriging models 
for RBDO. Another recent paper is by Yang & 
Husada (2017), who study seven state of the art 
methods from data mining in order to improve 
accuracy and efficiency of a single-loop RBDO 
method. 

In this work, we will adopt the SORM-based 
sequential quadratic programming (SQP) approach 
for RBDO recently suggested by Strömberg (2017), 
where we now represent the limit state functions by 
using SVM instead of analytical expressions. The 
main idea of the proposed method is to represent 
several limit state functions simultaneously with 
only one single SVM. In such manner, a RBDO 
problem with several reliability constraints boil 
down to a problem with only one constraint. We 
also think that this approach is a step towards a 
rational method for treating problems with system 
reliability. Reliability analysis without optimiza- 
tion using a similar idea was recently investigated 
by Li et al. (2016). 
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The proposed method might be consider to 
be a metamodel-based approach for RBDO even 
though the SVM is not representing the overall 
behaviour of the reliability constraint function but 
only is a representation of the limit state or a clas- 
sification of states in fail or safe. Metamodel-based 
RBDO is a most powerful approach for treat- 
ing the reliability constraints of complex models 
such as non-linear finite element models. Popular 
metamodels for this kind of applications are e.g. 
Kriging Kriging (Hu et al. 2016), artificial neural 
networks (Zhu et al. 2011) and radial basis func- 
tion networks (Lv et al. 2015). In Strömberg (2016) 
FORM- and SORM-based RBDO of a underrun 
protection profile was performed using a new type 
of radial basis function networks with a priori bias 
suggested by Amouzgar & Strömberg (2017). 

The outline of the paper is as follow: in the next 
section we review the FORM- and SORM-based 
SQP approach for reliability based design optimi- 
zation that recently was suggested in Strömberg 
(2017), in section 3 we present the theory of the 
soft non-linear support vector machine by deriv- 
ing the dual formulation of the maximum margin 
problem using the Karush-Kuhn-Tucker (KKT) 
conditions and introducing the soft penalization. 
In section 4 two RBDO examples are solved using 
the proposed SVM -based methodology, in particu- 
lar an established benchmark with three reliability 
constraints are treated using a single SVM and we 
also suggest how to apply SVM-based adaptive 
sampling in order to improve the solution. Finally, 
some concluding remarks are presented. 


2 SQP-BASED RBDO 


Most recently a SORM-based SQP approach for 
RBDO was suggested in Strömberg (2017). A 
brief presentation of this approach is given in this 
section. 

Let us consider the following RBDO problem: 


umin f(u) 
a Pr[ g(X)<0]>P, 


s 


(1) 


where X is a vector of Ny, uncorrelated random 
variables X,. The mean value 4, of each variable X, 
is collected in u. The functions f= f(u) and g = (X) 
represent the objective function and constraint, 
respectively. Thus, the reliability constraint reads 
that the probability that g < 0 must be greater than 
the target of reliability P,. In this work, we treat 
this reliability constraint by representing the limit 
state g = 0 by support vector machines, see the next 
section. The SVM can also be used to classify states 
of fail or safe, but that is not the main purpose of 


using SVM in this work. The main idea is to repre- 
sent several limit states simultaneously by using a 
single SVM and then toperform RBDO following 
the method presented in this section. 

The cumulative distribution of each variable is 
given by 


F(x:8)= Ja.dx, (2) 


where ,=,,(x:@) is the probability density 
function for distributions parameters collected in 
8 = @(4,) . So far, the following distributions have 
been implemented: normal, lognormal, Gumbel, 
gamma and Weibull. 

The problem in (1) is solved by the SQP approach 
as outlined below. At an iterate k with mean values 
collected in 4*, we perform Taylor expansions of f 
and g in intermediate variables Y, defined by the 
iso-probabilistic transformation, L.e. 


Y, =Y,(X,)= P7 (F (X;8 (4). 6) 


where ® = (x) is the cumulative distribution of 
the standard normal distribution. In addition, the 
Taylor expansion of g is done at the most probable 
point xM" on the limit surface. The Taylor expan- 


sion of fbecomes f (4%) ~ f (4 )+ 
ØY) 


xe PEO 


Nvar NVAR 


5 > A,nn, (4) 
i=l j=l 


NVAR of 
i=l OX, 


e+ 
) 


where 7 is the mean of Y and 


p- A A 
a E k- gk k y (5) 
9X,9X, x=k PMO) p, (U8) 
Furthrmore, g = 2(Y) ~ 
N: 
VAR g 9 ye) 
Y — yMPP) 6 
i=l ðX, X- xMPP MEZEI i Yi ) ( ) 


where yMPP = Y, (xMPP) is the most probable point 
defined by 


lesan 
min ¥(X) Y(X) (7) 
s.t. g(X) =0. 


In this work, we solve (7) when the limit state 
g(X) = 0 is represented by a soft non-linear SVM as 
presented in the next section. This is in turn done 
by solving the necessary optimality conditions by 
using Newton’s method withan inexact Jacobian. 
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Finally, by inserting (4) and (6) into (1), we 
derive the following QP-problem: 


ae Ie 
E <-Zo,, (8) 
s.t 
-<7 <Í, 
where 
N 4 
VAR og A yMPP) i 

Ms ps OX, X= MPP ae ý ) 


(9) 


p 5 d g o( ye) F 
i i=l 9X, X=xMPP p(x) 

Here, 4 =@-7 (P) is the target reliability index 
which can be corrected by any SORM approach 
or Monte Carlo sampling. So far, four SORM 
approaches have been implemented, e.g. Breitung, 
Hohenbichler and Tvedt. In addition, Halton- and 
Hammersley-based importance sampling at the 
MPP are also implemented. The optimal solution 
to (8), denoted 7% , is mapped back from the stand- 
ard normal space to the physical space using 


oY) 
k. gk Ti» 
p(s) 


Le = Uk + 


Then, a new QP-problem is generated around 
4 and this procedure continues in sequence until 
convergence is obtained. The QP-problem in (8) 
solved using quadprog.m in Matlab. 


3 SUPPORT VECTOR MACHINE 


In this section, we present the dual formulation of 
the soft non-linear support vector machine. First 
we introduce the original linear SVM, which actu- 
ally was suggested already in the 60 s by Vapnik, 
then we apply the kernel trick and, finally, regular- 
ize the problem. 

Let us consider N sampling points x’, which take 
values y’ =1 (fail) or yi =—1 (safe). Furthermore, 
we assume that it exists hyper-planes 
w:x+b=0, (10) 
which separate these sampling points into two 
subsets; one that only takes values y’ =1 and the 
other one with values y'=-—1. This is shown in 
Figure 1, where the SVM-based RBDO methodol- 
ogy is illustrated. We also assume that the follow- 
ing constraints are satisfied: 


gre’ 
4 SVM-based 
limit surface 


+ 
y' = 1 (fail) 


Support vectors 


> 


Xı 


Figure 1. Illustration of the proposed SVM-based 
RBDO approach. 


yi(wexi +b) 21, Palea Ny (11) 


The shortest distances to x‘ from a hyper-plane 
defined in (10) is given by 


w 


xi = x + j —. (12) 
ll» Il 
(12) inserted in (11) yields 
yvi(wex+bt 7 || w|) 21. (13) 
By utilizing (10), one obtains 
7 21/ || w || for y! =1, (14a) 
7 <-1/ || w || for yi =-1. (14b) 


Thus, the lower bound on the shortest distance 
|v'| is maximized by minimizing || w ||. This is 
the key idea of the original linear support vector 
machine formulation, which reads 


. 1 z 
min- || w ||? 
(wb) 2 


s.t.1—y'(w-x' +b) <0,i=1,...N. 


(15) 


Obviously, the closest sampling points to the 
optimal hyper-plane 
w-x+b*=0 (16) 


are obtained when 
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yi (w -xi +b*)=1. (17) 


Sampling points satisfying (17) are called sup- 
port vectors, see Figure 1. It is also obvious that in 
the region between the optimal hyper-plane defined 
by (16) and the support vectors is empty of sam- 
pling points. Thus, the support vector machine for- 
mulation in (15) finds a hyper-plane that maximizes 
the size of this region. This region is augmented 
with sampling points in the SVM-based adaptive 
sampling approach discussed in the next section. 

The Karush-Kuhn-Tucker conditions of the 
support vector machine in (15) are given by 


E Saye (18a) 
N g 

0= Yay, (18b) 
i=1 

A, 20, (18c) 

l- yi (w-x' +b) <0, (18d) 

A (1-y'(w-x' +b))=0. (18e) 


The corresponding Lagrangian function is 
L=L(Aw,b) = 


N 
live +¥A,(1- y' (w-x'+b)). (19) 
i=l 


Furthermore, the dual formulation of the sup- 
port vector machine in (15) reads 


max min L(4,w,b). (20) 


A20 (w,b) 


By inserting (18a) in (19), one obtains 


L(A,w(A),b) = 


1 N N N N 
noD e ys "xi + by! = by Ay’. (21) 
i=l i=l 


i=l j=l 


In addition, the latter part is zero by (18b). In 
conclusion, the dual support vector machine for- 
mulation is given by 


: 1 N N N 
mins $, 24y yx xi 2A 


i=l j=1 
A 22 
Say = 0, ( ) 
StA 45 


A 20,i=1,.., N. 


From the optimal solution 4’ of the dual sup- 
port vector machine in (22), we obtain the corre- 


sponding support vector machine solution from 
the Karush-Kuhn-Tucker conditions in (18) as 


N 

w= FAyx (23) 
i=l 

and 

b=l/y'-w*-x! (24) 


for any 4; >0. Notice, by using (23), the opti- 
mal hyper-plane in (16) can be written as 


i 
Saree -x+b* =0. (25) 


i=l 


Notice also that you only need to do the summa- 
tion over support vector indices, because otherwise 
A’ equals zero by the KKT-conditions. This is uti- 
lized in the implementation in order to speed up 
the evaluation of the SVM. 

For non-separable sets of sampling points x’, 
the support vector machine approach presented 
above will of course not work. However, one might 
transform the sample set to a new space where it 
become separable, let say by = ¢(x). In this new 
space, the only difference in the derivations of the 
dual support vector machine in (22) and (25) is 
the appearance of a new scalar product (£, £) 
instead of x'-x/. Thus, we donot have to know the 
explicit expression of the transformation é= £(x) 
, but only the expression of the scalar product of 
the new space. The explicit expression of this scalar 
product is known to be the kernel function, i.e. 


k(x!,x/) =(& (xi), €7(x/)). (26) 
Consequently, by using an appropriate kernel 


function in (22) instead of x‘-x/, e.g. the Gaus- 
sian kernel 


k=k(x,z)= exp =e) (27) 


the sample set can be separated by 
N 

SA yik(x!,x) +5" =0. (28) 
i=1 


Another kernel function is the polynomial 
kernel, i.e. 


k=k(x,z)=(1+x-z). (29) 


Even if we perform a suitable kernel trick, we 
might have some misclassified points such that 
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(22) does not converge to a solution. This can be 
treated by applying a regularization of (22). The 
established soft regularization of (15) is 


. 1 N 
min — 24C vy, 
(wb) 2 Iwl 2», T 
P (30) 


v, 20,i=1,...N. 


The Karush-Kuhn-Tucker conditions in (18) 
then modify by adding the following conditions: 


C-4 20, (31a) 
v,20 (31b) 
v(C-4,)=0 (31c) 


The corresponding Lagrangian becomes £ = 


N N 
=l 


N N 
SULA yy x+ YA- 


ii j=l i=l i 


N 
Av, +C $v, 
i=l 


(32) 


where the two latter terms cancel out due to (31c). 
Thus, the only difference of the dual support vec- 
tor machine in (22) for this regularization is the 
appearance of an upper bound on A, i.e. 


. 1 N N o ; N 
ming LLAAyv/k(x'x/)- DA 


i=l j=l 
N 33 
ai =0, oo 
sti 


0<4 <C, i=1,..N. 
Finally, 0<4,<C must be satisfied in order 


for (24) to be valid. Here, we have also introduced 
the kernel k(x, y) in the objective function. The soft 


non-linear SVM in (33) is solved using quadprog.m 
in Matlab. 


4 EXAMPLES 


In this section, we demonstrate by solving two 
examples that the soft non-linear SVM presented 
in the previous section can represent the limit states 
properly such that the SQP-based RBDO meth- 
odology can be applied for solving RBDO prob- 
lems with SVM-basedlimit state functions. The 
first problem was considered in Strömberg (2016), 
where SLP-based RBDO was performed by adopt- 
ing radial basis function networks (RBFN) as met- 
amodels. The second example is a most well-know 
benchmark for evaluating new RBDO approaches, 
see e.g. Youn and Choi (2004). 
The first example reads 


4 mink ofi 2 woof E 2] 
4 4h (34) 


i Pr[(X, -0.5)* +(X, -0.5)* <2] 2 P, 
S.t. ~ g 
lsu, <4, 


where P =0.999 and VAR[X,]=0.1?. The deter- 
ministic solution is (1.5,1.5) and the minimum of 
the unconstrained objective function is found at 
(2,2). The constraint g>0 g >0 is plotted to the left 
in Figure 2. The solution to (34) obtained by our 
SQP-based RBDO approach is (1.2705,1.2705). 
The corresponding reliability is 99.9%. 

Now, let us consider the problem in (34) as a 
“black-box”, which we represent with metamod- 
els for a sample set of 100 Hammersley points as 
depicted in Figure 4 for the next example. In par- 
ticular, we let the objective function be represented 
by a radial basis function network as in Strömberg 
(2016) and the limit state is represented by a SVM 


Figure 2. The left plot shows g > 0 and in the right plot g%™ > 0 is given. Despite the overall behaviour is not captured 


by the SVM, the limit state is represented properly. 
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g™. Figure 2, g™ > 0 is plotted. Notice that g™ 
only represents the limit state properly but not the 
overall behaviour. Of course, the classification of 
fail or safe is obtained by taking sign(g*™). Instead 
of solving (34), we solve the metamodel-based 
representation of (34), i.e. the radial basis func- 
tion taken as our objective function and g*™ is the 
limit state function. The corresponding solution 
is (1.3410, 1.2315) and the reliability for this solu- 
tion becomes 99.7%, slightlylower than the target 
of 99.9%. Histograms for the objective function 
and the constraint are plotted in Figure 3. Notice 
the non-symmetric characteristic of the solution. 
This as well as the reliability of the solution can be 
improved by adopting adaptive sampling. This is 
a topic for future work. In fact, the SVM is most 
appropriate for improving the sample set with data 
along the limit state in the margin of the SVM. 
Some ideas of how this can be done is outlined for 
the next example. 

The second example is a most well-known 
benchmark with three constraints and reads 


amin (4+) 
Pr[g, =20-—X7X, <0] > (3), 
ga cae iat 
Pr 30 |2@@), 65 
s.t. -%12 5 
120 
Pr[g; = X? +8X, -75 <0] > (3), 
ISAST 


where VAR[X, | = 0.3’. A small modification of 
the original problem is done by taking the square 
of the objective function. This problem was 
recently considered in Strömberg (2017) by using 


Figure 3. 


a SORM-based SQP approach for reliability based 
design optimization. The example was in that work 
also generalized to 50 variables and 75 constraints 
for five different distributions simultaneously (nor- 
mal, lognormal, Gumbel, gamma and Weibull). 
The analytical solution for two variables with nor- 
mal distribution is obtained to be (3.4525,3.2758) 
with our RBDO algorithm. 

In this work we consider (35) to be a “blackbox” 
for which we setup a design of experiments using 
Hammersley sampling with 100 points as shown in 
Figure 4. From this sampling we define a training 
set g" for our support vector machine in the fol- 
lowing manner: 


if any g, >0, 


1 
j E otherwise. G6) 

For this training set, we obtain a SVM g*™ 
according to Figure 4. The plot to the right clearly 
shows that the the limit state functions are well 
represented by the global SVM g*™ = 0. The cor- 
responding classification of fail or safe using 
sign(g*™) is given in Figure 6. As in the previous 
example, we represent the sampling data for the 
objective function with a RBFN. Thus, our RBDO 
problem to be solved is this RBFN-based objective 
function with the reliability constraint on the SVM 
Pr[g*™ <0] > (3). In such manner, we reduce the 
problem size from three constraints to a problem 
with only one constraint. For this problem, the 
RBDO algorithm produces the following solution: 
(3.4247,3.2218). The corresponding reliability indi- 
ces for the two first active constraints are 2.84 and 
2.87, respectively. Histograms for these constraints 
are given in Figure 5. 

Instead of solving (35) using a uniform sampling 
as presented in Figure 4, we adopt a sequential 


Histogram of the objective function fand the constraint g for the solution obtained by using the SVM. Red 


lines are showing the corresponding normal distribution, where green lines mark the w+ 30 intervals. 
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Figure 4. 
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In the left plot the Hammersley sampling of 100 points is shown together with the explicit constraints g,. 


To the right the corresponding SVM is depicted. The limit state surface in cyan corresponds well to the actual limit 


surfaces g,. 
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Histograms for the two first active constraints g, and g,, respectively. Red lines are showing the correspond- 


ing normal distribution, where green lines mark the u + 30 intervals. 


adaptive sampling approach by distributing points 
in the margin of the SVM along the limit states as 
well as adding the deterministic optimum and the 
most probable point obtained in each sequence. 
For instance, in Figure 7, we start from a set of 
50 Hammersley points and then add 10 points (8 
SVM-based points, 1 deterministic optimum, 1 
MPP) sequentially five times following this SVM- 
based sampling approach in order to generate the 
set of 100 points shown in Figure 7. In this figure 
the corresponding SVM is also plotted. Compar- 
ing the SVMs in Figure 4 and Figure 7, one can see 
that the latter onebetter capture the local behavior 
of the limit surface close to the optimal solution. 
The optimal solution for this SVM-based formu- 
lation of the example is (3.4805,3.2585) and the 
corresponding reliability indices are 3.08 and 2.95, 
respectively, for the two active constraints. This is 
rather close to the target of 3. It is clear that this 


Figure 6. 
the SVM. 


Classification in fail (red) or safe (blue) using 
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Figure 7. SVM-based adaptive sampling augmented 
with deterministic optimum and MPP. 


is an improvement compared to the previous solu- 
tion presented above for the same example. 

Finally, we solve (35) for all three constraints 
modelled by three separate SVM models using the 
adaptive DoE used above. Then, we obtain the 
following solution (3.4304,3.3053), and the cor- 
responding reliability indices for the active con- 
straints: 2.97 and 3.12. 


5 CONCLUDING REMARKS 


In this work SVM-based RBDO is investigated by 
implementing soft non-linear SVM together with a 
SQP-based RBDO method recently developed in 
Stromberg (2017). The implementation is done so 
far for two random variables and the idea is to rep- 
resent the limit state functions with a single SVM. 
It is demonstrated that the proposed approach 
works well for an established benchmark with 
three reliability constraints. It is also demonstrated 
how the SVM can be utilized in adaptive sampling. 
For future work it would beinteresting to investi- 
gate the approach for more than two variables with 
several constraints. 
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Crisis management in extreme situation: The Model of Resilience 
in Situation (MRS) as a support to observe the organization with 
simulation 
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ABSTRACT: The traditional approach to safety engineering for a nuclear reactor is mainly focused 
on its technical dimension, while the Human Factors approach focuses on how to optimize the value 
added by human to the reliability of the system. Safety management seeks to organize the work as well 
as possible, train employees, develop their safety culture, and so on. This juxtaposition of approaches to 
reliability illustrates the recurring dilemma faced by at-risk organizations in choosing between the techni- 
cal anticipation of pre-defined situations and the optimization of the management of the situations by 
people who, in real time, through their skills and understanding of the situation going on, will adapt their 
strategy and actions according to that particular situation. These last three years, we have performed a 
series of Extreme Situation Simulations on Full Scale Simulators in order to test the resilience of the Cri- 
sis Organization of EDF during accidents with characteristics similar to the accident of Fukushima. The 
realization of these tests was an organizational challenge which required up to 80 people for the simula- 


tion, and about 5 months for both the preparation and the analysis of the tests. 


1 OBJECTIVE OF THE STUDY 


Following the Fukushima accident, EDF imple- 
mented additional crisis management measures 
(organizational, material, etc.) to respond to an 
accident in an Extreme Situation (ES). An Extreme 
Situation is a situation in which a nuclear site is iso- 
lated and inaccessible as consequences of a large- 
scale external event which has an impact on all the 
reactors of the site, and during which the operat- 
ing teams have limited means of communication. 
The objective of our works was to study the design 
of the operating teams in Extreme Situations and 
the National Crisis Management Organization of 
EDF, in order to identify the strengths and the 
areas of improvement for crisis management. 

To do this, we observed simulations involving 
the entire crisis management organization of EDF, 
which would operate in an ES, from the operat- 
ing teams on-site, to the experts from the National 
Technical Support Team, along with members of 
the National Direction Command Post. External 
stakeholders (Public Powers, Regulatory Author- 
ity...) have not been taken into account in the 
simulation. 

We applied a multidisciplinary approach to 
these simulations, cross-referencing the analyses of 
experts in Cognitive Ergonomics, Human Reliabil- 
ity and Nuclear Safety. Our method of analysis of 
these simulations is based on the Model of Resil- 


ience in Situation (Le Bot & Pesme, 2010), which 
explains how a socio-technical system such as the 
one we have observed can continue to operate 
safely using a process of anticipation and a process 
of adaptation. Therefore, beyond the operational 
objective of studying the organizational system 
anticipated in an ES, we examined more closely the 
link between resilience and crisis management. 
The purpose of this communication is to present 
the method we have developed for the implemen- 
tation of the Extreme Situation Tests and the one 
used for their analysis. We will first see how the spe- 
cific context of an ES is complex to simulate, and 
how we managed to achieve it. We will then detail 
the method of observation of the tests and analysis, 
focusing on the functional analysis of the resilience 
in accordance with the Model of Resilience in Situ- 
ation. Finally, we will present some of the conclu- 
sions we have drawn from the analyses and the first 
lessons for the implementation of simulations. 


2 TESTS IN EXTREME SITUATIONS 


2.1 What is an extreme situation? 


The concept of Extreme Situations emerged fol- 
lowing the accident at Fukushima in 2011. The 
situation we considered as representative for the 
study of the crisis organization is a situation with 
characteristics similar to this one. 
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We suppose that a nuclear site is hit by a 
major earthquake leading to the loss of internal 
and external power supplies on at least two reac- 
tors, with the isolation of the site preventing the 
immediate arrival of the local on-call emergency 
response teams and the Safety Engineer. 

Furthermore, nearly all internal and external 
communication systems of the site have been ren- 
dered unavailable by the event: in the first tests, 
the control room could not communicate remotely 
with the field operators when they are working 
on field manoeuvers. In later tests, an autono- 
mous communication system was available for the 
teams to communicate with the field operators. 
In the short-term, the team in the control room 
can only receive support from the National Emer- 
gency Response Team by communicating with the 
“National Direction Command Post” via an emer- 
gency satellite telephone. 


2.2 How to simulate an extreme situation? 


During our study, we observed two series of three 
tests, each involving complete operating teams on 
full scale simulators. Four tests were carried out on 
two simulators in parallel, representing a beyond- 
design-basis “multi-unit” site accident, simultane- 
ously affecting two reactors on one site. During 
one of these exercises, the two simulated units were 
of different technologies. The two remaining tests 
were focused on thermohydraulic design basis acci- 
dents in situations combined with another event 
(fire or flooding), in order to study the design of 
the team in a non-isolated site situation. 

A multi-disciplinary work group, consisting 
of representatives of the site operations depart- 
ments, training instructors on simulators and 
experts on ergonomics, human reliability and 
safety, prepared the tests. This group produced the 
test protocol and defined a scenario able to pro- 
vide sufficient data to understand crisis organiza- 
tion in a beyond-design-basis extreme situation. 
The accident scenarios were tested and validated 
in technical and documentary terms, and lastly, 
for each test, a prior “dummy” test was used to 
check the entire simulation system with site oper- 
ating teams. 

This study meant that for the first time within the 
company, tests could be performed on two full scale 
simulators in parallel built in the training centre of 
the sites, a “hardware” one reproducing the control 
room exactly, and another “digital” one with touch 
screens representing the devices of the control room. 
During one of the tests, a third unit was simulated 
on paper: it was a world’s first. The simulation of 
several units raises numerous problems, for exam- 
ple, certain units are paired and share common 
equipment. This is not the case for the full scale 
simulators, they are technically independent. Thus, 


during tests simulating two paired reactors shar- 
ing common systems, the unavailability of shared 
equipment needed to be simulated for one simulated 
reactor if the other simulated reactor used it. 

Furthermore, following the Fukushima acci- 
dent, it was decided to equip each reactor with 
complementary equipment for facing extreme 
situations, such as emergency unit cooldown die- 
sel generator sets: these were not yet taken into 
account in the simulators, or in the training of the 
operating teams at the time of the tests. Since the 
objective of the simulations is to study the opera- 
tion of the crisis organization as it would be with 
this equipment deployed, these devices were pro- 
visionally simulated for the tests and the operat- 
ing teams participating in the tests were specially 
prepared in their use. The validation of the new 
procedures was not an objective of the tests. 

Lastly, it emerged from our preliminary analyses 
of the Fukushima accident (Baudard, 2017) that 
the field actions had played a very important role 
in the crisis management. We therefore involved 
the field operators of each operating team in the 
simulation by asking them to simulate the comple- 
tion of the actions requested by the control room 
in a degraded environment (poor lighting, access 
path blocked, etc.). This participation by the field 
operators helped to ensure a more realistic simu- 
lation, even though they have not simulated their 
actions directly in the field. 


2.3. The observation system 


To limit as far as possible any technical contingen- 
cies in the progress of the test, each scenario was 
played out twice with different teams before being 
observed during the final test. The aim of these pre- 
cautions was to seek possible faults in the procedures 


Table 1. Data collection methods. 
In-situ Human 
observation Ergonomics reliability 
Note taking Chronologies, actions performed, 
decision making, communication, 
difficulties observed. 
Video & audio Detailed subsequent Used only 
recording analysis of in case of 
sequences chosen doubt 
Types of Process 
instrumented evolution 
collection Logbooks 
Post-simulation Debriefing focused on the Notable 
debriefings Events in the organization, noted 


during the observation and dis- 
cussed between observers during 
the preparation of the debriefing 
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which were being designed, but also to allow the 
trainers to adapt to this unusual scenario. 

The tests involved up to 80 people, from its 
preparation through to the implementation: 


— two complete operating teams and one observer 
per team member 

— field operators for each team and one observer 
per group 

— the experts of the national Technical Support 
Team and two dedicated observers 

— members of the National Direction Command 
Post and two dedicated observers 

— Scenario creators and trainers 

— In-house specialists and others from outside of 
the company coming to observe the method of 
data collection 

— etc. 


The system observed is adapted to the crisis 
organization which would take place in an ES 
in order to represent it as accurately as possible. 
However, entities from outside of the company 
(prefecture, regulatory agency, media, etc.) were 
not simulated. The system is characterized by the 
six observation posts (two simulators, two teams 
of field operators, the National Technical Sup- 
port Team (NTST) and National Direction Com- 
mand Post) across which the ergonomics, human 
reliability and safety observers were spread. The 
simulation takes place over a period of five hours, 
followed by an on-the-spot debriefing and post- 
analysis of the simulator logbook. The data col- 
lection methods are detailed in the following table: 

Five work themes guided the organization, the 
observation and the analysis of the Extreme Situa- 
tion Simulations: 


— The design of the operating team 

— Field actions management 

— Information Exchange with the National Crisis 
Organisation 

— The use of the tools and resources available in 
an ES 

— The resilience of the organisation, which we will 
look at in more detail later. 


3 DATA ANALYSIS 


3.1 Analysis method 


The main stages in the analysis method are sum- 
marized in Table 2 below: 

Following the observation of a test, the multi- 
disciplinary analysis group identifies the favora- 
ble and unfavorable factors for the organizational 
resilience of the socio-technical system in Extreme 
Situations. The common document base for ana- 
lysts is as follows: 


— The exhaustive chronologies by group (oper- 
ating team, Technical Support Team, etc.) is 
reproduced based on the notes taken by each 
observer, distinguishing the actions carried out 
by the group (application of procedures, launch- 
ing field actions, etc.) and events independent of 
the group (equipment failure, action carried out 
by an independent group, etc.) 

— Identification of “Notable Events” (NE). 
A Notable Event is an event or the repetition of 
an event which reveals an action, the absence of 
an action, a decision made, a collective or indi- 
vidual initiative, a fact that can strengthen the 
reliability of the organisation of the operating 
team, and/or the emergency response team, oper- 
ations, its robustness, facilitating sensemaking, 
or on the contrary which may make the socio- 
technical system less reliable and make safety 
barriers more fragile, damaging sensemaking. 
The NE are observed during the in situ observa- 
tion, subsequently during the group debriefing, 
or during the reconstruction of the overall chro- 
nology. NE are the central elements in the analy- 
sis of the resilience of the socio-technical system. 


The NE are then categorized according to the 
groups they refer to (control room 1 or 2, emer- 
gency response team, management control unit, 
field operators), and a first level of analysis is used 
to identify: 


e the Technical NE, resulting from operations on 
the process 


Table 2. Contribution of the different disciplines to the analysis of the situations. 
Ergonomics 
Cognitive analyses and analyses of group 
Observers Human reliability + ergonomics Human reliability operations 
Chronology by Raw analysis Technical Monacos Functional Timing charts Themed Analysis by 
observer, then from obser- points chronologies analysis of of control analyses of hypothesis 
by Group, vations and Summary progress of resilience room activity debriefings 
then overall debrefings of notable operations and Site/ 
chronology Notable events by National 
Events group and interactions 


overall 
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e the Expertise NE on the assistance that the 
experts provide for the operations 

e the Organizational NE on the operation of the 
group as a whole. 


Lastly, the Human Reliability analysts link 
together the Notable Events, the characteristics of 
the system observed, and the resilience functions 
defined in the MRS. 


3.2 Model of resilience in situation 


The Model of Resilience in Situation (Le Bot & 
Pesme, 2010) is an empirical model which was build 
out of analyses of real or simulated accidents, based 
on the theoretical framework of Social Regulation 
(Reynaud, 1997). The model serves to reconcile 
two seemingly opposite rationales: the anticipation 
of potential situations on the one hand and adap- 
tation to the situation on the other. It is used to 
dynamically describe the operational management 
of acrisis situation, alternating between periods of 
constructing ad’hoc operating rules and periods of 
application of these rules. Should the rules being 
applied become obsolete, the process will repeat. 

The MRS applies to crisis organization as a 
whole, as a dynamic network of work groups 
(operating team, experts, etc.) that interact, coop- 
erate, collaborate and coordinate themselves. Each 
of these groups, taken within its own environment 
(procedures, HMI, etc.), is considered a distributed 
cognitive interactive system. The overall resilience 
of the organization therefore results from interac- 
tions within the groups and interactions between 
groups. 

The model links together two processes: the exe- 
cution process, and the adaptation process based 
on Figure 1: 

In stable operating conditions, the system exe- 
cutes the operation rules (EXECUTION). The 
functions to be performed continuously are: 


— INFORMATION: selection and sharing of 
information based on the surveillance criteria 

— ACTION: act based on the objectives and their 
priorities with the corresponding resources 

— CONTROL: ensure that the action complies 
with the operating rules 


If the continuous VERIFICATION detects that 
the rules are not appropriate or are obsolete (objec- 
tive achieved), the system initiates a Rupture phase 
after a RECONFIGURATION: interruption of 
the rules which are not relevant, mobilization of 
resources, then ADAPTATION in order to read- 
just the operating rules by carrying out DIAG- 
NOSIS, PROGNOSIS, SELECTION of relevant 
procedures and parameters, PRIORITISATION 
of objectives, COLLABORATION to negotiate 


Prescriptions and resources 


Diagnosis Prognosis 
Selection Prioritization 
Collaboration Cooperation 


ADAPTATION 


Validation Sensemaking 


Operating Rules 
P g =: Reconfiguration 


information Verification 


Operating Rules 
Diagnosis / Prognosis 
Objectives 


EXECUTION 


Control 


Figure 1. The model of resilience in situation. 


and define the operating rules, COOPERATION 
to distribute the tasks and the resources. 

The operating rules are validated (VALIDA- 
TION) and shared to make the situation relevant, 
implementing them and linking the past experi- 
ence, present actions and future projections of the 
parties involved. 


3.2.1 Definitions of the MRS functions 
Adaptation process: 

The adaptation process involves redefining opera- 
tion: objectives, strategy and resources to achieve 
the objectives. This means: 


— Anticipating the behaviour of the installation 
and the actions to be carried out (diagnosis and 
prognosis functions) 

— Selecting the relevant information, procedures, 
instructions and means/resources (selection 
function) 

— Collaborating to adapt them autonomously if 
necessary, to define the new strategy with the 
operational objectives and resources (collabora- 
tion and prioritisation functions) 

— Validating the implementation of the rule by the 
authorised party (validation function). 


Execution process: 
The execution process involves robustly imple- 
menting the agreed strategy: 


— The robustness is obtained by the execution 
(action function) and the control of the ongoing 
actions (control function). 
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— The group constantly checks that this strategy 
remains appropriate in regard of the situation 
(verification function) 

— These functions are carried out by the acquisi- 
tion and the sharing of information (informa- 
tion function), and guided by sensemaking 
(sensemaking function) 


3.3 Functional analysis of resilience 


During our analyses of the tests, we seek to relate 
Notable Events with these resilience functions and 
estimate whether they are favorable or unfavora- 
ble. For example, the contribution of a member 
from outside of the operating team to the produc- 
tion of the rules to be followed is a favorable fac- 
tor for COLLABORATION between groups. On 
the contrary, the workload of parties involved in 
Extreme Situations slowed down the management 
of field actions, which was an unfavorable factor 
for the ACTION and CONTROL functions to be 
carried out, as well as for the PRIORITISATION 
of actions. 

Our analysis can be summarized by the dia- 
gram on Figure 2. Taking the example of the use 
of a new field actions management tool, the Field 
Actions Monitoring Device, we would have the 
analysis on Figure 3. 

This tool was introduced in the second series 
of tests, trying to resolve the difficulties observed. 
More specifically, its use in extreme situations 
allowed us to observe that the tool ensured that 
the field actions to be carried out were managed 
correctly. With minimal preparation in this new 


Organizational NE 
identified by the 


analysts 


Figure 3. 


tool, the team was able to implement organization 
ensuring effective management of field actions, 
particularly important in an ES. Furthermore, this 
tool facilitates the prioritization of field actions 
and the monitoring of field operators and their 
optimization in a situation requiring many field 
actions. Therefore, the Field Actions Monitoring 
Device was a favorable factor in controlling the 
state of the reactor in a degraded situation. 


4 GENERAL DISCUSSION 


The results of our analyses show that the design 
basis of the crisis management organization in 
Extreme Situation was not called into question. It 
also appears that some observed difficulties require 
more consideration in the preparation of the oper- 
ating teams in order to strengthen resilience. 


4.1 Management of field actions 


The main observation relates to the management of 
field actions and of the “field operator” resources. 
In a design basis accident, if electrical power 
sources are lost, the necessary number of safety- 
related equipment items for the facility are backed 
up by Emergency Diesel Generators and can still 
be controlled remotely in the control room. Field 
actions then involve trying to return the systems to 
service or checking the shutdown of the stopped 
systems, and therefore these actions are not gener- 
ally essential for the operation of the reactor. In 
an Extreme Situation with a beyond-design-basis 


Function 
Mode 
Resilience in 
Situation 


Technical NE 
dent by the 
analysts 


Example of functional analysis of resilience method: Field actions monitoring device. 
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loss of electrical power sources, these backups may 
be lost and the operation of the installation may 
require direct field action to control certain equip- 
ment, try to return it to service or take information. 
The field operators are in fact in high demand. They 
are sent in the field on a case-by-case basis depend- 
ing on each action demanded by the team in the 
control room, with one or more field action sheets, 
and their actions are not carried out instantly since 
they need to travel to the premises before execut- 
ing their sheet, and the action itself must often be 
executed manually. Each sheet therefore requires 
human resources, one or more field operators will 
need time to perform the action demanded. 

Prioritizing and re-prioritizing the field action 
sheets becomes a condition for the success of the 
crisis management in the control room. This activ- 
ity has considerable cognitive requirements for 
the person responsible of it, and an impact on the 
collective operation of the team. During our tests, 
this activity takes up a lot of the supervisor’s time, 
therefore leaving them less available for the team 
and for their own supervision tasks. 

To help to manage the field actions and the field 
operators, a prototype of a Field Actions Moni- 
toring Device has been designed as part of an agile 
process (De la Garza et al., 2016). This is a triptych 
panel providing an overview of the actions pend- 
ing, allowing them to be prioritized and then to 
see those which are in progress and who is carry- 
ing them out, and finally the successes and failures. 
This system is a support for ES operation, mak- 
ing it easier to monitor and ensure the safety of 
the field operators who leave to carry out actions. 
It also allows information to be shared within the 
operating team and becomes an area for exchanges, 
or even collective problem-solving. 


4.2 Reflections on the ES simulation system 


The organization and analysis of an Extreme Situ- 
ation test is costly in terms of time and resources: 
60 to 80 people from different entities (operators, 
engineering, trainers, R&D, etc.) working on the 
organization and progress of the tests requiring 
4 to 6 months of preparation, then 4 to 6 months 
of analysis... We have adopted a procedure of con- 
tinuous improvement in the organization of the 
tests, progress of which has been spread over more 
than three years, getting the various stakeholders 
in the preparation involved earlier in the process, 
and optimizing the analysis method. 

However, we can see that there remain questions 
pending and areas for improvement to be followed 
in the organization of simulations on such a large 
scale, and in particular based around four points 
discussed above relating to the preparation, crea- 
tion of scenarios and control of certain variables. 


4.2.1. Unpredictability of the scenario 

The scenario we designed for the tests was a succes- 
sion of equipment failures (loss of off Site Power, 
Loss of on-site Power...) similar to the damages 
that the Fukushima Daiichi NPP faced during the 
crisis. In order to successfully manage the acci- 
dent, the operating teams have to apply rarely used 
procedures. 

Before the final test, we had to validate the tech- 
nical aspects of de simulation, like the behavior of 
the Full Scale Simulator in this situation where it 
had to cope with a lot of equipment failures, or how 
well the procedures matched with the situation. 

The problem is that, because of the autonomy 
we gave to the operating teams during the simula- 
tions, if we want to validate the adequacy of the 
procedures, we have to make sure that they are 
correctly applied by the operating team: operators 
might choose another procedure to apply, if they 
consider it more appropriate. Instructors however 
are trained to strictly apply the procedures, so we 
preferred having them for the technical validation 
of the scenarios. 

Even though the technical validation is neces- 
sary, it does not protect from problems during the 
test. But since their objective was to study the resil- 
ience of the teams, we left them a lot of autonomy 
in the operations, and we would not have stopped 
the simulation should they have applied a proce- 
dure we had not expected. 


4.2.2 Simulating equipment which is scheduled 
but not yet installed 

How can the players be prepared to manage a 
simulation which will require equipment and an 
organization which are not yet scheduled as part 
of their training, and at the same time make the 
situation as close as possible to the target? Here we 
have a modification of an existing situation. 

To train the teams, we offered them preparatory 
information meetings, during which the objective 
of the tests, the scheduled progress for the day and 
the new systems scheduled for crisis management 
were presented, without giving them any indica- 
tion of the scenario which would be played out. 

Research is in progress at EDF on rapid means 
of prototyping control room interfaces, which can 
be combined with the simulator design codes. The 
equipment scheduled as part of the post-Fuku- 
shima project could therefore be integrated into 
the simulators during extreme situation tests. 


4.2.3 Getting the players involved 
How can the players get involved in the simula- 
tion of a faulted condition, in a highly-degraded 
environment? 

It is of course impossible to recreate in the 
control room or in the field the degraded condi- 
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tions encountered by the operators during the 
Fukushima accident. However, using field opera- 
tors, we regularly provided the control room with 
information on the degraded state of the facili- 
ties. For example, on returning from a simulated 
field inspection, the field operator drew up for the 
supervisor the list of inaccessible premises, dam- 
aged equipment, etc. 

We have seen tests in which the teams attached 
great importance to the safety of the participants 
in the field, avoiding sending them for field actions 
which are irrelevant given the situation, or sending 
them in pairs. These observations were interpreted 
as a good understanding of the situation in hand. 

It has however not always been clear in the 
mind of the teams that the site was isolated, to the 
extent that some were waiting for the arrival of the 
on call teams in order to launch important actions. 
The question arises of how to simulate the conse- 
quences of the external hazard on the environment 
of the plant which have a major impact for the 
operators and their actions. 


4.2.4 Importance of multi-reactor accident 
simulations 

Our studies into the Fukushima accident high- 
lighted the interactions which took place between 
the different damaged units on the site (Baudard, 
2017): immobilization of resources, focus on one 
reactor at the expense of the others, transfer of 
experience, etc. 

Thanks to the autonomy we left to the teams, 
they were able to perform out of the procedures 
actions, and we identified transfer of experience 
mechanisms during the simulations. For example, 
one operating team benefited from the experience 
of the team from the neighboring plant unit in 
restoring the power supply using the backup means 
scheduled in the post-Fukushima provisions. 

Our work on the ES tests highlighted favorable 
factors allowing these beyond-design-basis crisis 
situations to be managed, but there are also factors 
which will require progress. Since these simulation 


of multi-units accidents are relatively recent, we 
think that it would be helpful, for understanding 
and improving resilience, to develop these simula- 
tion situations, but also simplifying them. 


4.3 Our proposition for future preparations 


Indeed, even though the organization, observation 
and analysis of the tests have been highly benefi- 
cial for the development of knowledge on Extreme 
Situations management by the parties involved 
within the company, these tests are too costly and 
too time consuming. 

We have already started investigating lighter 
simulation methods for the teams involved in crisis 
management (through Serious Games or Storytell- 
ing for example), focusing on one specific group 
rather than the whole organization, and simulating 
its environment. A prototype has been tested with 
the National Technical Support Team (Alengry 
et al, 2018). 
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ABSTRACT: Road tunnel management is oriented in protecting tunnels from fire accidents that can 
evolve to disasters causing several human losses and extended infrastructure destruction. In order to 
enhance tunnels’ safety, analysts usually employ Quantitative Risk Assessment (QRA) models for pre- 
dicting human losses. However, human behaviour includes significant uncertainties during the evacua- 
tion process. This paper aims to address this problem proposing a stochastic-based evacuation model. 
Employing Monte Carlo approach, a set of simulations is conducted in which users’ behaviour modelling 
is based on data from the existing theoretical framework, post-accident reports, conducted experiments 
and legislation requirements. Applying the model to a typical Greek tunnel, the outcome highlights a 
significant proportion of scenarios that exceed losses estimated by traditional approaches revealing also 
potential fallacies. The model’s contribution rests on the provision of a stochastic-based simulation that is 
closer to describing reality than a simple deterministic model, as far as users’ evacuation is concerned. 


1 INTRODUCTION et al., 2015). To this respect, road tunnels, like any 


Road tunnels are regarded as a key element of the 
road network as they connect remote regions since 
they create short-cuts in mountainous ranges. In 
addition, the growing concentration of popula- 
tion in large urban areas has inevitably lead to the 
extended use of underground road tunnels in order 
to relief the increasing traffic volumes (PIARC, 
2016). 

Due to the aforementioned implementations, 
road tunnels are considered critical infrastruc- 
tures. A critical infrastructure can be defined as 
“... facilities of key importance to public interest 
whose failure or impairment could result in det- 
rimental supply shortages, substantial disturbance 
to public order or similar dramatic impact” (Ghe- 
orghe, et al., 2006, p. 6). This criticality was con- 
firmed by the adverse effects in terms of human 
losses and infrastructure destruction from various 
past accidents. For instance, the Mont Blanc tun- 
nel accident in 1999, the worst accident in Europe, 
cost the life of 39 people and around 350 million 
Euros for the repair (AADT, 1999). Likewise, the 
Sasago tunnel accident in 2012, the worst tunnel 
accident in Japanese history, where the tunnel ceil- 
ing collapsed causing fire and cost the life of 9 
people. Meanwhile, the road network of the region 
remained closed for almost two months (Maskura, 


complex socio-technical and critical system, must 
be designed to ensure an acceptable level of safety 
(Kirytopoulos & Kazaras, 2011). For tunnels spe- 
cifically this would mean protection against fire 
accidents considering that can evolve to disasters. 

In order to enhance tunnels’ level of safety, 
safety analysts usually apply the risk assessment 
process based on various Quantitative Risk Assess- 
ment (QRA) models (Kirytopoulos, et al., 2010). 
The ultimate goal of a QRA model is to estimate 
the tunnel level of safety in case of a fire accident 
focusing primarily on predicting human losses 
amongst trapped-users. With this respect, QRA 
models often examine a subset of some crucial fire 
scenarios in a deterministic approach based on 
specific guidelines and regulations that each coun- 
try has adopted (PIARC, 2012). However, post- 
accident reports, full-scale and virtual experiments 
towards the aspect of human behaviour under 
emergency situations have illustrated significant 
uncertainties regarding the self-evacuation proc- 
ess of the trapped-users. Despite that, national 
guidelines as well as QRA models do not take into 
account this information during the analysis in suf- 
ficient detail. 

This paper aims to address this problem by pro- 
posing a model with a stochastic-based approach. 
The proposed model employs Monte Carlo 
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approach for conducting a set of simulations in 
which users’ behaviour during the discrete stages of 
their self-evacuation process is based on probabil- 
istic data. The probabilistic data used in the model 
were sourced from various findings from existing 
theories, post-accident reports, conducted full- 
scale as well as virtual experiments and legislation 
requirements. The outcome provides safety analysts 
with a prediction of the distribution of both suc- 
cessful and non-successful self-evacuations in order 
for them to estimate the overall tunnel safety and 
select additional measures, if required. Further- 
more, by surfacing the impact of self-evacuation 
to the evolvement of the accident, it points out the 
importance that authorities and tunnel managers 
should give on educating tunnel users in order to 
react appropriate in such accidents. 


2 EVACUATION PROCESS IN TUNNEL 
FIRE ACCIDENTS 


2.1 General principles of the evacuation models 


The safety ground of the road tunnel system 
encompasses all the crucial elements of the system 
and classifies the latter into five basic categories, 
namely: (a) the facilities, (b) the infrastructure, 
(c) the traffic, (d) the users and (d) the vehicles 
(Kirytopoulos, et al., 2017). The risk assessment 
methods deal with these categories that combine 
to form the tunnel level of safety as high as reason- 
able possible. 

Users’ losses is the representative parameter that 
provides the preparedness of the tunnel system in 
confronting fire accidents. Thus, each of the risk 
assessment methods focuses primarily on estimat- 
ing the possible losses amongst trapped-users and 
is concerned about how to reduce them (PIARC, 
2012). 

Indeed, tunnel users consist the most vulner- 
able factor of the system. They are the first who 
confront with the fire consequences in a tunnel 
and in most of the cases without being adequately 
experienced in such circumstances. Moreover, they 
do not have appropriate equipment with them 
and often they do not have the education of other 
groups, like the members of the rescue teams, on 
how to react in critical situations (Kirytopoulos, 
et al., 2017). 

Furthermore, fire in tunnel has much different 
behaviour than fire in the open road. In particu- 
lar, it has rapid development, remains in maximum 
Heat Release Rate (HRR a) longer and releases 
much more fumes (Beard & Carvel, 2012; PIARC, 
2017). As a result, trapped-users have to evacuate 
themselves in a strictly limited time interval, which 
does not allow for any delay in the beginning of 


the self-evacuation process for the anticipation of 
external rescue teams. 

Servicing the principles of managing risks in 
which users’ factor must be taking into account, 
users’ behaviour during the evacuation is in the 
centre of attention in order to assess and enhance 
road tunnels’ level of safety. 

The majority of the evacuation models, like 
engineering hand calculations and computer tools, 
are used in order to calculate the time it takes for 
trapped users to evacuate the tunnel walking away 
from the fire environment and heading towards a 
safe place as the emergency exit doors or the tunnel 
portals (it depends from the fire location) as soon 
as possible. An important prerequisite in this direc- 
tion is the establishment of two basic time param- 
eters, the Available Safe Egress Time (ASET) and 
the Required Safe Egress Time (RSET). ASET is 
defined as the time which is actually available for 
trapped users from fire sparking and the time point 
at which conditions become inadequate for human 
life, because of the high rates of pollutant concen- 
tration and radiation. On the other hand, RSET 
refers to the time that trapped users actually need 
for a successful outcome of their self-evacuation. 
The aim of the safety analysts is to achieve the 
ASET to be greater than RSET (Kinateder, et al., 
2015). To do so, they have to forecast two main 
things of the evacuation process: the actions that 
people take and the time it needs for these actions 
to be performed. 

However, most of the risk assessment meth- 
ods as well as evacuation models implement 
oversimplified assumptions about the different 
stages of evacuation process and the behaviour 
of trapped-users. Furthermore, they have to fol- 
low the country-specific regulatory requirements. 
These approaches might increase the uncertainty 
of evacuation models that subsequently lead to 
a considerable uncertainty of the whole safety 
approach. Hence, it is important to design models 
taking into account potential information or clues 
from both real accidents and studies about human 
behaviour in fire evacuation and adapting them 
to the models, accordingly. If so, the uncertainties 
regarding the overall level of tunnel safety could 
be diminished. 


2.2 Human behaviour in evacuation process 


2.2.1 Theoretical framework 
Human behaviour in fire evacuation process 
depends on which phase of the process the user 
performs. The big picture of the evacuation proc- 
ess in tunnels can be separated in three main phases 
(Kuligowski, 2013). 

Initially, the first phase is the ignition phase. 
This point has some perturbation as not all the 
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users perceive the fire location promptly and also 
at the same time. For this reason, there is possibil- 
ity for another accident to occur, such as crashing, 
upstream of the fire location. The second phase 
is the pre-evacuation phase. In this phase the user 
receives, collects and processes the cues in order 
to decide his own self-evacuation strategy. The 
progress of this phase depends on both the knowl- 
edge of the user and user’s risk perception, as well. 
These two phases constitute the no-moving period 
and the time that elapses can prove fatal for the 
users. Experiments showed that under certain cir- 
cumstances even a couple of seconds delay can 
neutralise the user (Ntzeremes, et al., 2018). The 
last phase of the evacuation process is the mov- 
ing period. In this phase, trapped-users begin to 
walk away from their positions inside the tunnel 
and heading towards to a safe location. In tun- 
nels, depending on the fire location, emergency 
exit doors and/or the tunnel portals are considered 
as safe locations. However, post-accidents reports 
indicated that not all users were willing to aban- 
don their cars hoping that rescue teams could 
extinguish the fire soon or underestimating the fire 
behaviour (AADT, 1999). 

With regard to the aforementioned phases of 
the evacuation process, there are some common 
theories in order to interpret trapped-users behav- 
iour (Fridolph, et al., 2013). The first theory is 
the behaviour sequence model, which separates 
the human behaviour in four distinct mental and 
physical actions. So, a user initially receives, after 
interprets, subsequently prepares and in the end 
acts. The next theory is the role—rule model. This 
model makes the assumption that every user in a 
critical situation would behave according to the set 
of rules steaming from his position. A subsequent 
theory is the affiliation model, which assumes that 
a person would head to places or follow people 
that are familiar to him. The last theory deals with 
the social influence. This approach considers that 
the presence of other people would affect the user’s 
evacuation process. 


2.2.2 Evidence from real and virtual cases 

Placed next to theoretical framework, a valuable 
source on human behaviour in fire accidents is 
the evidences from real and virtual accidents, both 
for the no-moving period and also for the moving 
period. 

Regarding the post-accident reports, in the most 
disastrous tunnel accident in history, the Mont 
Blanc accident, one of the most important obser- 
vations is that a significant number of 27 users 
delayed enough the evacuation process or they 
did not even started remaining in their vehicles. 
On the contrary, in Burnley accident in Sydney in 
2007, the quick reaction of both the users and the 


emergency response units reduced the RSET sig- 
nificantly reducing both the causalities and injuries 
(Fridolph, et al., 2013). 

Furthermore, studies based on virtual acci- 
dents can also provide analysts with information 
about users’ behaviour. For instance, some studies 
illustrate the important factor of social influence 
during evacuation (Kinateder, et al., 2014). Sub- 
sequently, some other studies explored the factor 
of risk perception of users and the relationship 
between fire and awareness of successful evacua- 
tion process (Ronchi, et al., 2015). Other studies 
explore users’ behavioural intentions and knowl- 
edge in case of fire accidents. Outcomes were far 
away from reflecting the expected ones, as users 
often had totally underestimate the criticality of 
the accidents (Kirytopoulos, et al., 2017). The 
aforementioned information illustrates the high 
perturbation amongst the information on the way 
that trapped-users might behave during fire acci- 
dents. Thus, there is a lack of available data for 
use by evacuation models as far as the no-moving 
period of users’ self-evacuation process. 

However, similar perturbation and lack of data 
exists in the estimation of the moving period, too. 
One of the most important factors when assessing 
safety in tunnels is the users walking speed through 
smoke, which is the basic evacuation performance 
characteristic. Smoke caused by fires flows strati- 
fied in the tunnel tube and large space during the 
initial fire, and after the stratification of the smoke 
is disrupted by the heat absorption by the ceiling 
wall. Subsequently, it starts to diffuse inside the 
tunnel, making evacuation activities, rescue activi- 
ties and fire extinguishing activities extremely dif- 
ficult. Studies have shown that walking speed does 
not only steams from the physical ability of the 
user but also affected from the speed of informa- 
tion exchanged, the experience and the knowledge 
of the user as well as from the social influence 
(Ronchi, et al., 2015). 

Furthermore, a number of full-scale experiments 
estimating walking speed have been conducted. In 
such experiments, researchers examine different 
group of participants in different smoke situations 
(irritant and non-irritant smoke). As a result, the 
outcome shows a strong interrelation between the 
extinction coefficient, which shows how easily the 
air can be penetrated by a beam of light, and the 
estimating walking speed in normal and emergency 
situations (Seike, et al., 2016). 


2.2.3 Country-specific regulations 

Last but not least, the country-specific regulations, 
usually determine the context in which a safety ana- 
lyst should operate affecting the analysis. Often, in 
these regulations, the walking speed of the user in 
relation to opacity is provided as a deterministic 
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number, which is something that does not always 
reflect reality (PIARC, 2012). 


3 METHODOLOGY 


In order to estimate the tunnel level of safety, a 
simulation-based research is used. This approach 
can aid a better understanding of the phenom- 
enon, the outcome of the self-evacuation proc- 
ess of trapped-users in a fire accident, having the 
advantage that includes and integrates available 
data from existing theories and experiments. The 
structure of the process is as follows: 

Initially, the fire scenario along with the fire 
location are selected. Trapped-users’ evacuation is 
illustrated as a trajectory, depicted as line segments 
depending on the walking speed, with regard to 
tunnel length and time. Fig. 1 provides an example 
of the evacuation process. 

Road tunnels have generally simple and straight 
geometries in which the relative distance of the 
users from the fire source, the fire characteristics 
and the time spent inside the tunnel are the main 
factors affecting life safety. In order to estimate the 
impact of fire on trapped-users, firstly, the tunnel’s 
airflows (the air temperature and the air opacity) 
should be estimated. The Computational Fluid 
Dynamic (CFD) process is performed with the aid 
of Camatt 2.0 software. The outcome provides two 
sources of data, the air temperature and the air 
opacity with regard to the tunnel length and time. 
Subsequently, these data are used for the estima- 
tion of the effect of the heat and pollution on the 
evacuation trajectories of the users. 

Trapped-users safety are affected by accumu- 
lating heat of both radiation and convection. The 
estimation of the Fractional Effective Dose of heat 
(FED,,) in every time step and location is estimated 
by the following equation: 

FED = X(1/t, 
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Figure 1. 
350 m. 


Evacuation process with fire location at 


where ton 18 the time duration for convective heat, 


which is calculated as: 
ticon, = (9*107)*T**: for light clothing (2) 
where {t} the time in minutes and {T} the tempera- 
ture in °C, and where t,,,,18 the time duration for 
burning of skin by radiation, which is calculated 
as: 


t = 4%q156 (3) 


irad. 7 


where {t} the time in minutes and {q} the radi- 
ant heat flux in kW/m? (for q > 2.5 kW/m’; for 
q < 2.5 kW/m? this time is equal to 30 min). 

Along with the effect of heat, trapped-users are 
also affected by accumulating toxicity by pollut- 
ants concentration. Carbon monoxide (CO) is the 
toxic gas on which toxicity is raised, based on the 
national standards. The estimation of the Frac- 
tional Effective Dose of CO (FED) in every time 
step and location is estimated as follows: 


FED,, = £(0.00083*CO-1*At)/D (4) 


where {At} is the exposure time in minutes, {CO} 
is the CO concentration in ppm and {D} is the 
concentration at incapacitation. 

In order to estimate the FED of heat and pol- 
lution concentration further information is needed 
regarding the evacuation trajectory of the users. In 
the presented model, contrasting to the commonly 
used deterministic approach, the stochastic vari- 
ables of the evacuation process are defined exhibit- 
ing the interrelation amongst current literature and 
national regulations. 

Initially, a time that indicates the no-moving 
period, is selected. The no-moving period, consist- 
ing of the ignition and the pre-evacuation phases, 
is defined by the stochastic variable of how much 
time it takes before moving period begins. In Purs- 
er’s study, this time regarding the Mont Blanc acci- 
dent was specified between 40s and 90 s. Another 
study from Japan specified this time between 30 s 
and 120 s (Seike, et al., 2016). On the contrary, 
Greek regulations do not determine this time, but 
the most commonly time used in such analyses is 
approximately 90 s. Synthesising the existing data 
along with the Greek provisions, a uniform distri- 
bution with 60s as a lower limit and 120 s as an 
upper limit, is considered in order to be at the safe 
side. 

In addition, the stochastic variable that char- 
acterises the moving period is the walking speed 
of the users. With a view to represent as closer to 
reality, the process employs existing studies results 
towards the connection between air opacity and 
walking speed (Seike, et al., 2016; Ntzeremes, 
et al., 2018). Synthesising these data with the 
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Table 1. Users’ walking speed. Table 2. Tunnel’s attributes. 
Speed (m/s) Designing One dimension—single sector—rural 
features Total length 3.000 m 
Distribution Normal Normal Normal Slope -1,5% 
Mean 1 0.5 0.2 Number of exit doors 6 
St. Div. 0.067 0.067 0.05 Number of traffic 7 
Opacity (m™') (0, 0.50) [0.50, 0.70] [0.70, ...) interruptions 
Starting time of 5 min 
traffic lights 

Greek provisions, three moving intervals are after fire ignition 
defined (Table 1). : Mechanical Number of 8 

At the beginning, the process estimates human ventilation jet-fan-array 
losses using the deterministic values of the Greek Starting time 2 min 
provisions (AAT, 2011). These values are the mean 
points of the aforementioned distributions. As Traffic _ Vehicle flux 55 veh./hr. 
a result, the outcome that safety analysts would conditions Proportion of HGVs 30% 


estimate according to the current deterministic 
approach, is provided. 

After building the model, the simulation is 
executed, with the aid of Matlab 2017b software, 
for 1000 iterations using the time distribution of 
the no-moving period and Table | elements for the 
moving period. This step forms the “60-120” sce- 
nario. Having created the distribution of losses, the 
results of both the deterministic and the stochastic 
approach are compared. Afterwards, another two 
repetitions of the process are conducted chang- 
ing first the upper limit of the no-moving period, 
reducing it by 30 s creating a new uniform distri- 
bution between 60 s and 90 s (“60—90” scenario). 
Subsequently, reducing generally the delay time of 
trapped-users a second new uniform distribution 
between 30 s and 90 s (“30-90” scenario) is created. 
These two aforementioned scenarios are related 
to two user training scenarios, one with the goal 
to reduce the upper reactive limit and the other 
decreasing the upper and the lower one. 

In the end, the outcome provides a predic- 
tion of the distribution of non-successful self- 
evacuations. 


4 ILLUSTRATIVE CASE 


4.1 Case description 


A de-identified road tunnel in Greece fulfilling 
both the Greek and European safety requirements, 
is selected. The examined fire scenario, which is a 
standardised fire scenario of Greek regulations, 
involves a fire of HRR pax LOOMW sparking from a 
HGV without carrying dangerous goods (PIARC, 
2017; Ntzeremes, et al., 2018). The location of the 
fire, 350 m from the entrance of the tunnel, is part 
of the entry area, which regarded as the most com- 
mon and one of the most vulnerable locations of 
the tunnel (PIARC, 2017, p. 12). Tunnel’s attributes 
are shown in Table 2. 


4.2 Results 


In order to simulate trapped-users trajectories, 
their initial locations are required. To this respect, 
the analysis assumes uniform traffic condition 
and uniform distribution of the stopped-vehicles. 
So, at the time fire outbreaks, the traffic simula- 
tion results in 10 trapped-vehicles shaping one row 
with 10 m distance in-between. Hence, the model 
estimates the losses corresponding to the locations 
that the vehicles have stopped (e.g. Fig. 1), taking 
into account that each vehicle has two users. 

On the one hand, the analysis following the tra- 
ditional deterministic approach using the mean 
points of the aforementioned distributions esti- 
mates eight losses (there are four fatal trajectories 
corresponding to the fire location and the three 
consecutive locations upstream the fire). However, 
if a more conservative approach is applied, the 
outcome would be with less losses. For instance, if 
the time when the evacuation starts is determined 
at 60 s and the users’ speed remain the same, only 
two losses (only the fire location’s trajectory is 
fatal), are estimated. This outcome highlights how 
the expert judgment or an incorrect threshold can 
alter the estimated safety level. 

On the other hand, the stochastic approach illus- 
trates a much different outcome. Fig. 2 presents a 
wider distribution of the scenarios. 

Fig. 2 shows that regarding the simulation of 
“60-120” scenarios set only 18.8% of the sce- 
narios include eight losses. The outcome provides 
an approximately equal probability amongst the 
scenarios of four, six, eight, ten and twelve losses 
arising significant uncertainty. This outcome raises 
serious concerns about the safety of the tunnel. 
There is a significant probability of 41.2% sce- 
narios to have ten or more losses (up to fourteen), 
although there is an equal probability of 40% of 
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Figure 3. Cumulative function of evacuation scenarios. 


scenarios to have less than eight losses, which can- 
not counterbalance the probability of very adverse 
consequences. The potential additional measures 
in case of fourteen losses require quite different 
additional measures than if there were only four, 
as this discrepancy raises in 30 m longer the lethal 
length of backlayering. 

Furthermore, the “60-90” simulation shows that 
if the evacuation starts no longer than 90s, there 
is 83% probability of having favourable outcome 
with regard to the deterministic approach and 40% 
with the “60-120” scenario. In this outcome, the 
trapped-users corresponding to the fifth, the sixth 
and the seventh stopped-vehicle are considered 
totally safe. However, as in the “60-120” case, still 
all the scenarios include losses. 

Only the last case (“30—90” scenario set) has 
37.5% probability of including scenarios with- 
out losses. However, there is an equal probabil- 
ity of having up to six losses, which indicates 
that the need for taking additional measures is 
inevitable. 

However, the results indicate also the importance 
of users’ attitude in fire accidents. To this respect, 
reducing the upper limit of the time corresponding 
to the no-moving period, trapped-users before the 
first 40 m are successfully evacuating the tunnel. 


Nevertheless, the safety problem of the tunnel still 
remains. So, if the users are educated enough in 
order to start the evacuation process immediately, 
the outcome shows that approximately 40% of the 
scenarios have no losses without changing any 
other parameter of the system. 

In order to illustrate the safety level of the tun- 
nel the cumulative diagram of losses is employed. 
Based on this, safety level can be a strict threshold 
(e.g. 2 losses should be not exceed 20%) or can be 
estimated based on as low as reasonable possible 
(ALARP) principle. 

In addition, the effect of users’ education is 
also reflected in Fig. 3. Consequently, through the 
simulation one can observe the wider spread of 
losses in contrast to the deterministic approach. It 
is remarkable how a small reduction of only 30s in 
the range of the time of the beginning of the evac- 
uation can benefit the overall safety level. Thus, a 
huge effort must be given towards the sensitisation 
of authorities and tunnel users towards safety. 


5 CONCLUSIONS 


The presented study illustrates that the current 
safety approach towards road tunnels, which con- 
stitute a critical infrastructure of the road network, 
by applying QRA models should focus further on 
the users’ factor. The existing literature provides 
safety analysts with the theoretical amendments 
to interpret human behaviour in fire accidents. 
Furthermore, the cues from previous accidents as 
well as the full-scale experiments and virtual reality 
experiments enlighten better scientists in discover- 
ing crucial parameters of the evacuation process. 

However, there is lack in data that would assist 
evacuation models to give outcomes with lower 
uncertainties. In the presented illustrative case, 
the use of the deterministic approach together 
with the country-specific regulatory requirements 
cannot estimate the tunnel’s actual level of safety. 
The first gives four losses when the stochastic 
approach estimates 60% of the scenarios above 
this value. Nevertheless, the outcome provides 
safety analysts with a prediction of the distribu- 
tion of non-successful self-evacuations in order to 
estimate the overall tunnel safety and select addi- 
tional measures, if required. Hence, this approach 
is closer to describing reality than a simple deter- 
ministic model, as far as users’ evacuation is con- 
cerned. In a nutshell, it aids risk analysts make 
better-informed decisions. 

Furthermore, the change in the parameter of 
the beginning of the evacuation process shows the 
importance of educating tunnel users in confront- 
ing with these accidents. The outcome showed that 
a reduction of time by 30s could result in 40% of 
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the scenarios be free of losses, when in the basic 
case all scenarios have from two to fourteen losses. 
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ABSTRACT: This paper describes the newly-developed damage-based fatigue life model for the long- 
term reliability assessment of the drawn steel wires and the wire ropes. The methodology is based on the 
computed local stress field in the critical trellis contact zone of a stranded wire rope by the FE simulation 
and the degradation of the Young’s modulus of the drawn steel wires. The fatigue damage model is based 
on Lemitaire’s damage equations for quasi-brittle material with a damageable microinclusion embedded 
in an elastic mesoelement. The incremental fatigue damage calculations employing the load-cycle block is 
described. The routine is integrated into commercially available Finite Element Analysis (FEA) software. 
A case study using a single strand (1 x 7) steel wire rope with 5.43 mm-dia. drawn wires is employed to 
illustrate the damage-based fatigue life prediction procedures. The simulated tensile fatigue cycles consist 
of the load range of 145 KN and load ratio, R = 0.1. The peak applied load corresponds to 50% of the 
maximum breaking load of the steel wire rope. The FE-calculated results indicate that the von Mises 
stress, maximum principal stress and the contact pressure cycle in-phase, and with an identical stress ratio 
as the applied axial load. The trellis contact point is relatively small and experiences elastic stresses, thus 
the fatigue damage prediction for the fretting fatigue condition is appropriate. The damage initiation life 
at the trellis contact along the core wire is calculated at M, = 1050 cycles. An additional and improved 
data set of the damage model parameters of the drawn steel wires is required to achieve an accurate and 


validated life prediction model of the wire ropes. 


1 INTRODUCTION 


Steel wire ropes have found numerous applications 
in marine and civil construction including the 
mooring system for FPSO in the deep sea, bridge 
suspension elements for suspended and stayed 
structures, and lifting cranes. In these applications, 
the wire ropes are subjected to complex loading, 
often involving tension-tension fatigue induced by 
both the lifted weight and self-weight of the com- 
ponent, bending over the sheave fatigue, free bend- 
ing fatigue due to transverse load such as ocean 
current, and torsion fatigue. The sea operating 
environment further amplifies the reliability issue 
through stress corrosion cracking of the wire rope 
elements. The rated or operating load for the wire 
rope is often a fraction of the Maximum Breaking 
Load (MBL) of the wire rope and induces stresses 
in the elastic range. However, the combination of 
the relatively high stress amplitude and the positive 
load ratio, particularly at the trellis contact region, 
could initiate damage and nucleate fatigue crack 


at the locality. The subsequent crack propagation 
leads to fracture of the critical drawn wire, redistri- 
bution of stresses in the fracture locality and fail- 
ure of the neighboring drawn wires thus leading to 
premature fatigue failure of the wire rope. Consid- 
ering the associated high cost of installation, and if 
needed, the very costly and difficult to replace wire 
rope (Raoof & Davies 2008), an efficient and accu- 
rate predictive tool is required for reliability assess- 
ment of the wire ropes over their typical design life 
of 20 years. 

The common causes of fatigue failure of the 
wire rope are due to the contact pressure and slip 
(Waterhouse 2003). The relatively high contact 
pressure developed at the trellis contact between 
the layers of the strand wires in a small contact 
region. Consequently, this induces partial slip 
with small displacement amplitude causing the 
fretting fatigue failure. In a wire rope design hav- 
ing a large lay angle, a line contact is established 
between the wires in the strand causing a large slip 
contact region. The relatively large amplitude of 
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displacement results in gross slip and failure occurs 
by fretting wear of the wire rope. In both cases, 
the development of a failure model should account 
for both the mechanics of deformation and mecha- 
nism of failure of the wire rope. The work describe 
in this paper is limited to wire ropes failing under 
the fretting fatigue mode. 

Fatigue life prediction of the wire rope is a 
challenging task because the wire rope is a system 
consisting of many drawn wires arranged in dif- 
ferent windings and simultaneously enduring the 
applied load. Design against failure due to the 
tensile load that exceeds the breaking strength 
of the wire rope has been established (Prawoto 
& Mazlan 2012). Fatigue testing of the wire rope 
samples to generate the fatigue-life curve seems a 
straight-forward way of life prediction. Most of 
the available fatigue-life data for the wire ropes 
are generated from tests at zero-depth (in-air envi- 
ronment). In this respect, extensive axial fatigue- 
life test data on sheathed steel spiral strands have 
been established (Raoof & Davies 2008). The 
work also quantifies the effects of external hydro- 
static pressure, representing the deep-water load- 
ing condition, on the fatigue performance of the 
sheathed wire ropes. Fatigue design of the wire 
ropes that considers both the high-cycle fatigue of 
the stranded wires and the inherent fretting effects 
have been examined (Winkler et al. 2015, Wang 
et al. 2013, Cruzado et al. 2012, Sasaki et al. 2007). 
The fatigue-life data for the wire ropes are based 
on data taken from available standards such as 
DNV OS-E301 (2004) and open literature (Raoof 
& Davies 2008, Alani & Raoof 1997, Birkenmaier 
1980, Suh & Chang 2000). However, the reported 
fatigue strength of these wire ropes is relatively 
low at (San / Sy) =0.15-0.3 (Kao & Byrne 1982). 
The continuum-based high cycle fatigue life model 
such as the modified Goodman and Soderberg 
approach could only predict the terminal life of 
the most critical wire. While fatigue failure of the 
wire rope is often defined by the area fraction of 
the fractured wires. Reliability model based on 
the local stress concentration due to the erosion 
of the wire material at the contact locations has 
also been proposed. However, the evolution of the 
local geometry of the area is difficult to establish. 

This paper describes the framework for the 
development of a damage-based fatigue life model 
for the steel wire ropes. The methodology for dam- 
age and failure assessment of the wire rope materi- 
als under the high-cycle fretting fatigue condition 
is discussed. An incremental calculation routine 
based on cyclic degradation of the elastic modulus 
of the drawn wires is deliberated. A preliminary 
simulation for the case of a single strand (1 x 7) 
steel wire rope under tensile fatigue loading is 
demonstrated. 


2 FRAMEWORK FOR THE DAMAGE- 
BASED FATIGUE LIFE PREDICTION 


The reliability assessment of the steel wire ropes 
is aimed at the prediction of the fatigue life of 
the wire rope when subjected to the general load- 
ing conditions. Figure 1 illustrates the framework 
for the development of a validated damage-based 
fatigue life prediction model for the wire ropes. 
The fatigue life of the wire rope is governed by 
many factors including the design and material 
of the drawn wires, fabrication process, operating 
load conditions and environment. The wire rope 
design also indicates if fretting fatigue at trel- 
lis contact or fretting wear causing gross slip is 
dominating the failure scene. The fretting fatigue 
failure is being addressed in this paper. Fatigue 
tests of the wire rope establish the correspond- 
ing fatigue life, N, represented by Task (A). Task 
(B) constitutes of the tension tests of the drawn 
steel wires to establish the mechanical properties 
of the material. Fatigue tests of the drawn wires 
are performed to establish the strength-life and 
fatigue limit of the wire material. Damage param- 
eters are also extracted from the resulting S-N 
curve. The damage-based fatigue life model of the 
drawn wire material is developed in Task (C). The 
properties and damage model parameter values 
obtained in Task (B) are employed. The damage 
model is then incorporated in the finite element 
analysis (FEA) through user subroutine UMAT 
for the Abaqus FEA software employed in this 
study, in Task (D). The FE-computed fatigue life 
is compared with the measured life from Task (A) 
to establish the validity of the fatigue life predic- 
tion model. 


Validation 


Figure 1. 
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2.1 Fatigue damage model 


Damage in a material can originate from the 
debonding of atoms or the nucleation, growth and 
coalescence of microcavities or microcracks. It leads 
to the gradual degradation of stiffness and strength 
of the material. Although the damage process is dis- 
continuous in nature, it can be modelled by a con- 
tinuous variable at the mircoscale (Lemaitre 1985, 
Kachanov 1958). In a representative volume ele- 
ment (RVE) of the material, the damage parameter 
is defined by scaling the damaged area, A, by the 
total cross-sectional area, A of the RVE such that: 


áo 
A 


D= (1) 


The damage parameter is continuous and repre- 
sents the failure of microdefects over the mesoscale 
volume element. The value of the scalar damage 
variable, D is bounded by 0 < D <1, where D =0 
represents the undamaged state while rupture (or 
separation of the material point) occurs at the 
value of D = 1. Following damage initiation, the 
effective stress tensor, {G} can be represented by: 


7 O, 


6, = ea (2) 


where gis the Cauchy stress tensor. 

The evolution of this damage variable that 
depends on the expected value of the micro-defect 
density can then represent the history of the inelas- 
tic strain in the material. In the high-cycle fatigue 
where the amplitude of the loading is low, the 
amplitude of the plastic strain is relatively small or 
negligible at the mesoscale when compared to the 
elastic strain amplitude. For the drawn steel wires 
that are assumed to behave in a quasi-brittle man- 
ner, the behavior is brittle at the mesoscale but local- 
ized damage growth occurs at the microscale. Thus, 
the material is modeled as a damageable microin- 
clusion embedded in an elastic mesoelement, as 
illustrated in Figure 2. While the mesoscale matrix 


MESOSCALE MATRIX 
Elastic 


E.G) .Oy.0, 


MICROSCALE INCLUSION 
Elastic-plastic and damage. 


E,o" =o, 


S, PoP: 


Figure 2. Representative Volume Element (RVE) for 
the two-scale model (Adapted from Lemaitre (1985)). 


is elastic, the microscale inclusion experiences elas- 
tic, plastic and damage. In addition, it is assumed 
that the inclusion is subjected to the same strain 
state (or strain rate) as the mesoscale matrix. 

The prediction of the rupture or the separation of 
the material point requires the coupling between the 
plastic flow and damage at the material constitutive 
level. Detailed derivation of the damage equation is 
found in the work of Lemaitre (1996) and employed 
by the authors (Salleh et al. 2017). Damage initia- 
tion occurs when the equivalent strain reaches the 
damage strain threshold, p, defined as: 


2d OE. (3) 
Pp o 
o,-— 
om 


where o, o, and g, is the yield stress, tensile 
strength and fatigue limit of the material, respec- 
tively. The number of cycles, N, needed for the 


equivalent strain to reach the damage threshold, 
Pp is given by: 


_ Pp _ Ep, _ Epp (4) 
° 2AE 2Ao 4(o,,-2) 


where E is the elastic modulus of the material, o, 
denotes the maximum stress and ø is the cyclic 
mean stress in the cycle. 

Based on the kinetic damage law for the inclu- 
sion, and considering the relatively large plas- 
tic strain compared to the elastic component of 
a micro-volume, u the evolution of damage is 
expressed as: 


D=— b“ 5 
y’ (5) 


The superscript, u refers to quantities for the 
microelement and S is a material damage param- 
eter. It is desirable to express the damage strain 
energy release rate, Y“ and the accumulated plas- 
tic strain rate, p” terms as functions of the mac- 
roscopic quantities such as strain and stress. The 
plastic strain rate, p“ of the elastic-perfectly plas- 
tic microvolume is equal to the equivalent strain 
rate of the elastic mesoscale matrix. 

We consider pure elasticity at the mesoscale and 
further approximate the damage rate, D=0 if the 
equivalent stress, On, < Op the fatigue limit of 
the material. In addition, an increase in the mean 
stress causes a decrease in the corresponding stress 
amplitude needed to induce fatigue failure at a 
certain number of cycles. Under these conditions, 
eqn. (5) can be conveniently written in incremental 
form as (Salleh et al. 2017): 
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AD = 1-29 (AN) (6) 


The term, AD represents the increment of the 
fatigue damage accumulated over the elapsed load 
cycles, AN while E and v is the elastic modulus 
and Poisson’s ratio of the material, respectively. It 
should be acknowledged that the Young’s modulus 
of the material is likely to degrade with the accu- 
mulated load cycles. In addition, the rate of the 
stiffness degradation is dependent on the magni- 
tude of the acting stress cycles and the stress ratio, 
as described in the next section. The separation or 
rupture of the critical material point would occur 
when the accumulated fatigue damage variable 
reaches a critical level (denoted by Doin Table 1). 


2.2 Incremental fatigue damage 
calculations routine 


Fatigue damage is estimated based on the gradual 
degradation of strength and modulus of the drawn 
steel wire over the applied load cycles. Each simu- 
lated load cycle represents a block of specified 
number of cycles on the wire rope. During this load 
increment, the decrease in the modulus of the drawn 
steel wire is prescribed as observed experimentally. 
The calculations of the fatigue damage that occur 
at a material point during the given increment of 
the load cycles, AN is described by the process flow 
as illustrated in Figure 3. This constitutes a user- 
defined SUBROUTINE UMAT for the commer- 
cially available Abaqus FEA software employed 
in this work. The current stress tensor, o, fatigue 
damage level, D the accumulated fatigue cycles, N 
residual stiffness, E(N) properties of the drawn 
steel wire (S, V, Op O, 6,) and the load-cycle block 
size, AN are transferred from the main FE program. 

STEP2 calculates the simulated load cycle param- 
eter, namely the maximum, oj, and mean stress, Ø. 
The number of load cycles, N, required to initiate 


Table 1. Properties and damage model parameters for 
the drawn steel wires. 

Parameter Symbol Value 
Elastic modulus E, (GPa) 202 
Yield strength o,(MPa) 1690 
Tensile strength o (MPa) 2164 
Poisson’s ratio v 0.28 
Fatigue limit o{MPa) 305 
Damage parameter S(MPa) 6 
Critical damage De 0.8 


SUBROUTINE UMAT 

STEP] Transferred from main program: 
oy, D, N, E(N), AN 
S, V, OY On OF 


STEP2 Increment N = N+AN 
Determine om, F 
Calculate No (eqn. (4)) 
IF N= Na, calculate the damage 
ELSE RETURN 


STEP3 Determine (Ny, oy) from Sma-N curve 
(as corrected using Miner’s rule) 


STEP4 Determine residual E(N) 
corresponding to N cycles (eqn. (7)) 


STEPS Compute AD (eq. (6)) 


STEP6 Update for N: 
D=D+AD 

aj 

fij = Tp 


RETURN 


Figure 3. Flow chart of the incremental damage calcu- 
lations over a load-cycle block. 


fatigue damage at this material point is estimated. 
The calculations proceed to the subsequent steps if 
the material point experiences damage initiation. 

In STEP3, the fatigue strength-life curve (see 
Figure 5) is corrected since the material point 
has been overstressed for finite number of cycles 
[Miner 1945]. The number of cycles to failure, N, 
corresponding to oj, is established. It is used in 
eqn. (7) to establish the residual elastic modulus, 
E(N) of the drawn steel wire at the end of the 
load-cycle block in STEP4. Next, the increment of 
the damage parameter, AD over the increment of 
the load cycles is established using eqn. (6). 

Finally, in STEP6 the damage variable and the 
stress tensor are updated to correspond to the end 
the load-cycle block. These updated variables are 
returned to the main program for the next load- 
cycle block. 


3 ILLUSTRATIVE CASE STUDY 


A case study employing a single strand (1 x 7) 
steel wire rope construction and subjected to axial 
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fatigue loading is considered to illustrate the char- 
acteristic fatigue damage evolution and the life 
prediction of the wire rope. Although the 1 x 7 
wire rope is commonly used as stayed cable, the 
validated fatigue life model could be used for other 
wire rope designs and loading conditions. The 
chemical composition of the drawn wires is (in 
wt.%) 0.83C, 0.917, 0.717Mn, 0.0124P, 0.00318, 
0.015Cu, the remaining being Fe. The observed 
preferred orientation of the grains along the draw- 
ing direction provides the superior strength of 
the wires in the axial direction. Since the damage 
model is developed for fretting fatigue failure, this 
failure mechanism should be reproduced by the 
steel wire rope under study. 


3.1 Finite element modeling 


The geometry of the wire rope model is illustrated 
in Figure 4, along with the design parameters. The 
core wire is straight. The total length of 330 mm 
of the wire rope model considered accounts for a 
full pitch length of 230 mm in the central length 
region, while the remaining lengths at both end 
regions are used to apply the load and boundary 
conditions. One end of the wire rope is fixed in 
both translational (U,= U,= U.=0) and rotational 
(UR, = UR, = UR. = 0) displacements, while the 
other free end is subjected to an applied axial load, 
but without rotation (UR, = 0). An axial fatigue 
load consisting of the load range, AP = 145 kN 
and load ratio of minimum-to-maximum load, 
R=0.1 is applied. The peak load cycle is at 50% of 
the maximum breaking load (i.e. 50% MBL) of the 


e 


Sab 
Core Diameter: 5.43 mm 
Layer Diameter: 5.23 mm 
Y Pitch length: 230 mm 
A Wire Rope Diameter: 15.80 mm 
2x MBL % [100%] : 290 kN 


Figure 4. Geometry and the geometrical properties of 
the single strand (1 x 7) wire rope. 


wire rope. The contact condition of the wires is pre- 
scribed to follow Coulomb’s law with the assumed 
coefficient of friction, cof = 0.5 considering the 
surface roughness of the drawn wires. Based on the 
outcome of the mesh convergence study, the sin- 
gle strand (1 x 7) wire rope geometry is discretized 
into 203208 8-node continuum elements. Abaqus 
FEA software with user-defined SUBROUTINE 
UMAT is employed for the analysis. 


3.2 Mechanical testing of the drawn steel wires 


The tension test is performed on the drawn steel 
wire specimens in the as-fabricated condition. The 
gage length is 100 mm. The resulting mechanical 
properties are listed in Table 1. The true fracture 
strength of the drawn steel wires, o, = 2.164 GPa 
with the corresponding true plastic strain at frac- 
ture, £x = 6.78% are established. 

Fatigue life tests were performed on the drawn 
steel wires in the as-received condition at the load 
ratio of minimum-to-maximum load, R = 0.1 and 
loading frequency of 30 Hz. The 5.43 mm-diam- 
eter drawn wires are tested in the as-fabricated 
condition. The resulting fatigue strength-life (S-N) 
curve is shown in Figure 5 (circle symbols). The 
solid line compares the published fatigue life of 
0.9 mm eutectoid steel wires. Effect of size (wire 
diameter) on fatigue life is consistent in that the 
larger diameter wires display a relatively shorter 
fatigue life. This is postulated by the greater 
number of manufacturing defects or microcracks 
inherent in a larger volume of the material within 
the gage section. The fatigue strength of the drawn 
steel wires is determined at ø, = 305 MPa, corre- 
sponding to N = 2.5 x 10° cycles (R = 0.1). The 
damage model parameter values are approximated 
based on the limited test data on alloy wires, as 
shown in Table 1 (Lemaitre 1996). 
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Figure 5. Fatigue strength-life (S-N) curve of the drawn 
steel wires, d= 5.43 mm, R=0.1. 
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Figure 6. Normalized residual Young’s modulus of the 
drawn steel wires. 


The characteristic degradation of the Young’s 
modulus of the drawn steel wires is established 
through the interrupted fatigue tests of the wire 
specimens. In each test, the wire specimen is sub- 
jected to a specified fatigue loading (AP R). The 
test is interrupted at every block of fatigue cycle, 
AN and the sample load-displacement response up 
to the maximum load cycle magnitude is recorded. 
The interrupted fatigue test is continued until the 
wire fractured. The loading and unloading residual 
Young’s modulus as function of the accumulated 
load cycles are then extracted from the slope of the 
resulting stress-strain curves. Three different com- 
binations of the fatigue loading is employed. The 
residual Young’s modulus data could be presented 
in terms of the normalized quantities as (Adam 
et al. 1989, Gathercole et al. 1994): 


1 
log N- log0.5 “JZ gD 
log N,- log0.5 E 
Cu 
Ef 


E(N)= 


where œ and p are curve-fitting constants. The 
effect of different load ratios could be incorporated 
through the different fatigue life, NV, of the respec- 
tive specimen. The trend of the normalized residual 
Young’s modulus curve is illustrated in Figure 6. 
The interrupted fatigue tests conducted by the 
authors for the drawn steel wires are still on-going. 


4 RESULTS AND DISCUSSION 


A typical distribution of the maximum principal 
stress in the wire rope material corresponding to 


the peak applied load cycle is shown in Figure 7. In 
the single strand wire rope under tension-tension 
loading cycles, the core wire experiences the great- 
est axial displacement under the prescribed iso- 
strain end condition. This is due to the core wire 
element having the shortest length within the pitch 
distance of the wire rope. The surrounding wires 
stretch around, inducing the contact pressure at 
selected locations along the core wire during the 
tensile stressing. However, the computed highest 
magnitude of 1.27 GPa is artificially induced at the 
localized end region of the wire rope model due to 
the applied boundary conditions, thus should not 
be considered in the fatigue analysis. The highest 
principal stress occurring in the relatively small 
trellis contact region of the core wire is about 83% 
of the yield strength level of the drawn wires at 
1.690 GPa, as shown in the figure. Thus, the fret- 
ting fatigue model is appropriate for the analysis. 
Each material point in each drawn wire of the wire 
rope experiences different stress amplitude but at 
the same stress ratio of R=0.1, as the applied load 
cycles. In addition, It is also noted that the equiva- 
lent stresses (von Mises and maximum principal 
stress) and the contact pressure evolve in-phase 
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Figure 7. The maximum principal stress distribution 
across the critical section containing the highest contact 
pressure (left) and similar normalized distribution in the 
core wire (right). 
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Distribution of the fatigue variable, N, in part 
of the core wire showing the critical section (left) with the 
trellis contact region (marked A). 


Figure 8. 


with the applied sinusoidal axial load. The peak 
von Mises stress in the trellis region is 1.240 GPa 
with the stress range of 1.116 GPa. 

The calculated number of cycles required to ini- 
tiate fatigue damage, N, for the applied load-cycle 
is shown in Figure 8. The left cross-section repre- 
sents the critical section of the core wire containing 
the highly stressed trellis contact region. Fatigue 
damage is predicted to initiate first in this local- 
ized critical region (marked A) of the core wire at 
N, = 1050 cycles. 

The grey region represents material point with 
N, > 1200 cycles. Following damage initiation, the 
damage evolves with the continuous redistribution 
of the stresses in the wire rope to satisfy the force 
equilibrium and strain-displacement compatibility. 
The path of damage, thus fatigue crack propaga- 
tion could fairly be inferred, at this early stage of 
the analysis, from the high-to-low gradient of N, as 
illustrated in the figure. 

Once the critical drawn wire fractured, the load 
carried by the wire is transferred to the neighbor- 
ing intact wires in the strand. In a wire rope design 
consisting of large number of drawn wires, the 
fatigue life of the wire rope could be established 
based on the prescribed area fraction of intact-to- 
total load bearing area of the wires. 


5 CONCLUSIONS 


The framework for the development of a vali- 
dated damage-based fatigue life model of the steel 
wire ropes has been presented and discussed. The 
fatigue damage model is based on the gradual 
degradation of strength and Young’s modulus of 
the drawn steel wires. FE simulation of the single- 
strand (1 x 7) wire rope under axial fatigue loading 
(AP= 145 kN, R=0.1) indicates that: 


— the von Mises, maximum principal stress and 
contact pressure at the trellis contact point 


evolve with an identical stress ratio as the applied 
axial load. 

— the trellis contact point is relatively small and 
experiences elastic stresses, thus the fatigue dam- 
age prediction for the fretting fatigue condition 
is appropriate. 

— the critical material point at the trellis contact in 
the core wire experiences the first damage initia- 
tion event at N, = 1050 cycles. 
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An efficient computational strategy for robust maintenance 
scheduling: Application to corroded pipelines 


E. Patelli & M. de Angelis 
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ABSTRACT: The ability to predict correctly the future remaining life time of components is of 
paramount importance to improve the safety and reliability of systems and networks via an effective 
maintenance policy. However, simplifications and assumptions are usually adopted to compensate lack 
of data, imprecision and vagueness, which cannot be justified completely and may, thus lead to biased 
results. To overcome these issues, an imprecise probabilities approach is proposed for reliability analysis 
and risk-based maintenance strategy. A novel efficient computational approach is proposed for identifying 
robust maintenance strategies. The optimal solution is obtained through only one reliability assessment 
based on Advanced Line Sampling and reusing the outcome of maintenance activities in a force Monte 
Carlo approach. The proposed methodology remove the huge computational cost of reliability-base opti- 
mization making the analysis of industrial size problem feasible. The applicability of the approach is 
demonstrated by identifying the optimal maintenance policy of buried pipelines and it is shown how this 
approach can improve the current industrial practice. 


1 INTRODUCTION 


One of the most important degradation/deteriora- 
tion mechanisms that affect the long-term reliabil- 
ity and integrity of metallic pipelines is corrosion. 
Corrosion which leads to metal loss is the most 
prevailing time dependent threat to the integrity, 
safe operation and cause of failure for oil and gas 
pipelines (Caleyo et al. 2002). The corrosion proc- 
ess is affected by large uncertainty making the 
assessment of pipelines a complex and challeng- 
ing task (Bazan and Beck 2013, Qian et al. 2011). 
For instance, uncertainties are in relation to opera- 
tional data variation, associated to the randomness 
of the environment, form imperfect measurements 
of pipeline geometry, in the material strength and 
from the ageing processes of the pipeline. 

The remaining strength of a pipeline with cor- 
rosion defects can be assessed using one or all of 
the international design codes viz: B31G (AMSE 
1991), B31Gmod (ASME 2012), Battelle (Leis and 
Stephens 1997a, Leis and Stephens 1997b), DNV- 
101 (AS 2015) and Shell-92 (Klever et al. 1995). 
The associated methods use deterministic values 
for load and resistance variables, thereby assuming 
no uncertainty. In the light of the existing inherent 
uncertainties in the corrosion process, the obtained 
results are obviously quite coarse approximations, 
which may deviate from reality significantly. A key 
challenge in this regard is the probabilistic model- 
ling, which relies on substantial information and 


data required to define parameter distributions. 
Sahraoui et al. (2013) proposed a Bayesian mod- 
elling to take into account imperfect inspection 
results while Li et al. (2017) suggested using Baye- 
sian network and Markov process approach to 
develop an optimal maintenance strategy for cor- 
roded subsea pipelines. 

However, the amount of data required to define 
unequivocally those distributions might not be 
available in practice, assumptions and simplifica- 
tions are applied and often they cannot be justified 
completely. To solve this conflict, the use of impre- 
cise probabilities (Beer and Ferson 2013, Beer and 
Patelli 2015) is proposed to realistically reflect the 
vagueness of the available information in the prob- 
abilistic model. In fact, since these assumptions 
and simplifications can be quite decisive, an impre- 
cise probabilities approach provides a promising 
pathway towards a robust maintenance strategy. 
This paper therefore proposes the use of a novel 
reliability metric redefined within the framework 
of imprecise probabilities. 

Another challenging task is the identification of 
optimal inspection interval time in order to reduce 
the overall costs of pipelines including cost of 
inspection, repair and failure. For instance, areas 
needing repairs must be accurately pinpointed as 
to minimise excavations for verifications. Like- 
wise, early observations of failure mechanisms, 
and determination of the likelihood of failure 
in association with the pipeline must be handy. 
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The identification of optimal maintenance sched- 
uling requires in turn the evolution of the model 
reliability that can be computational expensive to 
evaluate (Gomes et al. 2013). Approximate meth- 
ods, e.g. FORM may not be sufficiently accurate 
or applicable for large scale problems, and we have 
to resort to simulation based methods. 

In this paper, a novel and efficient computa- 
tional technique is proposed for the identification 
of a robust maintenance scheduling taking into 
account uncertainty and imprecision. More spe- 
cifically, the proposed approach allows determin- 
ing the optimal inspection interval and the repair 
strategy that would maintain adequate reliability 
level throughout the service life of the pipeline 
obtained through only one reliability assessment. 
In turn, the reliability analysis is performed using 
Advanced Line Sampling (de Angelis et al. 2015). 
This allows to estimate reliability bounds with 
only one simulation and, in addition, it efficiency 
is independent of the reliability level. Hence, the 
proposed approach is applicable to the analysis of 
industrial size problem. The proposed reliability 
strategies are implemented in the general purpose 
software OpenCossan (Patelli et al. 2018, Patelli 
2016, Patelli et al. 2012) and freely available. 


2 MODELLING CORRODED PIPELINE 


Metal losses due to corrosion affect the ultimate 
resistance, safety and serviceability of the struc- 
ture and cause changes in its elastic and dynamic 
properties. These are major concerns in struc- 
tural reliability assessment of existing structures 
and infrastructures, also in the prediction of the 
safe and serviceable life for both new and existing 
structures. 


2.1 Failure criteria 


The failure modes considered here are the loss of 
structural strength of pipelines through reduc- 
tion of the remaining pressure strength, and pipe 
wall thickness caused by corrosion defects. The 
failure pressure are assessed using four interna- 
tional design codes: Shell-92, B31G, DNV-101 and 
Modified B31G models, respectively. The sum- 
mary of all the failure pressure models is shown 
in Tables 1 and 2. In Table 1 W is the pipe wall 
thickness; L is the longitudinal length of defect; D 
is the outside diameter of pipe and M is the Folias’ 
factor. In Table 2, F, is the failure pressure and d 
represents the defect depth. ø, and g, are the mate- 
rial yield stress and the ultimate tensile strength, 
respectively. 

The assumption and limitation of these models 
are reflected on the individual flow stresses which 


Table 1. Flow stess and Folias’ factor according differ- 
ent international design codes. 
Flow 

Model stress Folias’ factor (M) 
B31G 1.1 SMYS T 

1+ 0.893 

DW 

Modified SMYS + i T T: 

1+ 0.6275 0.003375 

B31G 68.95 DW DW 
MPa 


for L< V50DW 


0.032* + 3.3 for L > V50DW 


al] 
DW 


Shell-92 SMTS 2 
+ 0.893 
DW 


DNV-101 SMTS 


= 
1 


= 
1 


Table 2. Failure pressure of pipelines according differ- 
ent international design codes. 


Defect (area Failure pressure 


Model and shape) expression (F,) 
B31G 2/3dL 2d 
Parabolic 20o W z 3W 
U= ——— 
2d 
1-4- M7 
3W 
Modified 0.85dL 
B31G Arbitrary 2 O, t 68.95) wW 
D 
1- 0.354 
wW 
1- 0.85 T M" 
W 
DNV-101 dL Rectangle d 
2w) ‘Sw 
D-t|i l ya 
W 
Shell-92 dL Rectangle 1 d 
1.80,W W 
D |i ly 
W 


are the measure of the strength of steel in pres- 
ence of a defect. Folias’ factors, M, is the geometry 
correction factor to account for the stress concen- 
tration due to radial deflection of the pipe sur- 
rounding a defect. Failure is assumed to occur as a 
result of the flow stress, defined by yield strength 
(in B31G and Modified B31G codes) or ultimate 
tensile strength (in DNV-101 and Shell-92) as their 
tensile properties. Further considerations and 
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assumptions on different shapes and areas of cor- 
rosion defect can also be made which might lead to 
different definition of failure critaria. 

The failure criteria is defined as the difference 
between the failure pressure, F,, of the pipeline 
and the maximum allowed operating pressure, 
(MAOP): 


g =F,- MAOP (1) 


where g is the so-called limit state function. 

The easiest way to estimate the reliability of 
pipelines is based on safety factors (also known as 
level I analysis) calculated using the capacity equa- 
tions or codes presented in Table 2. Such analysis 
do not model explicitly the uncertainties that might 
have occurred and increased over the years of the 
pipeline service. The effects of the uncertainty are 
considered in terms of safety margins and factors. 
Worst-case scenario is used for loads and capac- 
ity of the structural system and in turn, this might 
leads to greater safety/reliability but also to huge 
costs associated with the overdesign or overmain- 
tenance of pipelines. 

Level II analysis based on partial safety fac- 
tors includes the first and second moment of the 
parameter distributions. Partial safety factors take 
care of uncertainties for defect depth and failure 
pressure (burst) capacity. For instance, DNV-101 
code uses analytical expression to derive the values 
of standard deviation of relative corrosion defect 
O, , and the failure pressure. 

In modern engineering systems and critical 
infrastructures to assure adequate level of safety 
and reliability an explicit quantification of the 
uncertainty must be performed. A full probabilis- 
tic approach (level III analysis) requires the evalu- 
ation of multidimensional integral shown in Eq. 2. 
The probability of failure, P, is defined as: 


P,=P(g<0)= | f(A)d0 (2) 


g(A)<0 


where f(@) represents the multivariate distribu- 
tion function of the uncertainty vector @. In real- 
istic application a large number uncertainties need 
to be considered. Hence, analytical and approxi- 
mate methods like FORM and SORM result to 
be inadequate for solving Eq. (2) (Valdebenito 
et al. 2010). Monte Carlo simulation methods are 
then required to evaluate the integral of Eq. (2). 
However, when dealing with rare case events, plain 
Monte Carlo simulation might become infeasible 
due to the large number of the samples required to 
achieve a specific level of accuracy. To overcome 
this limitation, advanced Monte Carlo techniques 
such as Line Sampling (de Angelis, Patelli, & Beer 


2015) and Subset simulations (Au & Patelli 2016) 
can be adopted for analysing complex real world 
problems. 


2.2 Maintenance strategy 


In order to understand the status of pipelines, dif- 
ferent inspection tools can be used characterised 
by different quality and sensitivity. The inspection 
activities may assess the damage incorrectly or 
may not even detect any damage at all based on the 
quality and assiciated inspection costs. 

The most common tools for metal loss and 
crack inspection are based on the Magnetic Flux 
Leakage or Ultrasonic techniques (Version 2009). 
Pigging data is gathered through in-line inspection 
activities using Magnetic Flux Leakage intelligent 
pig, whereby the values of parameters in the model 
is as a result of the operations and inspection his- 
tories of the pipeline. Geometry tools are avail- 
able for detecting and sizing of deformations and 
mapping tools for localization of a pipeline and/or 
pipeline features (Version 2009). 

In this paper the probability of detection (PoD) 
associated with the non-destructive inspection 
techniques is modelled as (Pandey 1998): 


PoD=1-exp-“ (3) 


where d represents the defect depth and q the qual- 
ity of inspection. 

Following an inspection, if a defect is detected, 
it can be repaired or not. In fact, repairing a buried 
pipeline is an expensive process because it requires 
the excavation and the replacement of part of the 
pipeline. For this reason, the repair is perform 
immediately after an inspection only if the pipe 
defects produce a failure pressure safety factor 
lower than a prescribed threshold otherwise the 
pipe is left unrepaired. In this case a useful remain 
life is estimated and a preventive maintanance can 
be scheduled. The threshold level of the safety fac- 
tor is between 1.25 and 1.5 (Pandey 1998). These 
values are in agreement with the level of integrity 
established by actual pipeline hydro testing, and 
corresponds with the repair factor for a class 2 
pipeline in Canadian code (Association 2007). 


3 MODELLING THE UNCERTAINTIES 


A full probabilistic analysis requires the proper 
characterisation of the uncertainties. In other 
words, each variable is associated with a proper 
probabilistic distribution function. For instance, it 
is practice to describe the variability of measure- 
ments as a Gaussian process characterised by its 
mean and standard deviation. A proper estimation 
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of the characteristic of the distribution (or even the 
shape of the distribution itself) requires the avail- 
ability of data. When the amount of data is not 
enoughfor unambiguous characterisation of the 
uncertainties, expert judgement and often unjusti- 
fied assumptions. This is the case in many practi- 
cal situations where very limited data are available. 
To avoid the inclusion of subjective hypothesis, 
the imprecision and vagueness of the data can be 
treated combining probabilistic and set theoreti- 
cal components in a unified construct allowing the 
identification of bounds on probabilities for the 
events of interest in order to give a different pro- 
spective to the resultss (Beer and Patelli 2015). 

For the treatment of imprecise knowledge, non- 
consistent information and both epistemic and alea- 
tory uncertainty multiple mathematical concepts 
can be used including intervals (Augustin 2004), 
probability boxes (Ferson et al. 2003), normalized 
fuzzy sets (also known as possibility distributions) 
(Verma et al. 2007), Dempster-Shafer structures 
(Dempster 1967, Shafer 1976), Bayesian frame- 
works (Faber 2005) and Random Set theory. In par- 
ticular, Random Set theory is a general framework 
suited to model uncertainty represented as cumula- 
tive distribution functions (CDFs), without making 
any implicit or explicit assumption at all. Explana- 
tory examples of such flexible frameworks are pro- 
vided in (Patelli et al. 2015, Rocchetta et al. 2018). 

In this paper, the concept of probabilistic boxes 
(P-boxex) is used (Ferson, Kreinovich, Ginzburg, 
Myers, & Sentz 2003). P-boxes can be seen as a 
generalization of the Dempster-Shafer structures 
where the sets are represented by distributions. 
Hence, P-boxes are sets of Cumulative Distribu- 
tion Functions (CDFs) for which lower and upper 
bounds are assigned [EF] . The probability 
distribution associated to the random variable x 
can be either specified or not. The former are gen- 
erally named distributional P-boxes, or paramet- 
ric P-boxes, the latter are named distribution-free 
P-boxes, or non-parametric P-boxes. In literature 
the upper bound on probability is referred as plau- 
sibility and the lower bound as belief. 

Distributional p-boxes appear when there is 
indetermination in the representations of the 
parameters of a given CDF. These parameters 
are imprecisely specified as intervals. For instance, 
consider a quantity that is known to be Gaussian 
with mean within the interval [1,2] and standard 
deviation somewhere in [3,4]. All CDFs that are 
normal and have means and standard deviations 
inside these respective intervals will belong to this 
probability box. The upper and lower CDF bounds 
F and F of the p-box enclose many non-normal 
distributions, but these would be excluded from the 
p-box by specifying the normal CDF as the paren- 
tal distribution family. 


The calculation of the bounds of the quantity of 
interest such as the probability of failure requires 
significant computational resources. This because 
it will be necessary to estimate the integral of Eq. 
(2) for all the possible probabilistic model consid- 
ered and then identify the bounds of the response. 
Fortunately in many engineering applications the 
response of the model is monotonic with respect to 
the imprecision of the input parameters. In general, 
this allows to estimate the bounds of the probabil- 
ity of failure with only 2 full probabilistic analysis 
(Rocchetta et al. 2018). Advanced Line Sampling 
(de Angelis et al. 2015) method can further reduce 
the computational cost allowing the estimation of 
the bonds of the probability of failure with only 
1 efficient probabilistic analysis (de Angelis et al. 
2014). 


4 ROBUST MAINTENANCE STRATEGY 


Inspection and monitoring of pipelines is neces- 
sary in order to ensure their continued fitness for 
purpose, which entails protection from any time- 
dependent degradation processes, such as corro- 
sion. Also, pipeline failures have significant impact 
on the economic, environmental and social aspects 
of the society. Therefore, the proper assessment 
and maintenance of such structures are crucial; 
negligence will lead to serviceability loss, failure 
and might lead to catastrophic environmental and 
financial consequences. On the other hand, main- 
tenance is an expensive activity and the availability 
of robust maintenance scheduling is of paramount 
importance. The premise for these decisions is 
supplied by reliability estimation inculcating the 
impact of inspection scheduling and reparation 
activities over the pipeline’s service life. 


4.1 Optimization problem 


In reliability-based optimization, the total expected 
costs in relation to maintenance and failure is the 
objective function that needs to be minimised, 
see e.g. (Beer et al. 2014). The time and number 
of the inspections represents the design variables 
of the optimization problem while the expected 
monetary costs associated with inspection, repair 
and failure form the objective function that can be 
formulated as: 


argmin E|C,(N,.4t,) | (4) 


Ny qt; 


where N, q, t, and C, denote the number of inspec- 
tions, the qualities of inspections, the i-th time of 
inspection and the expected total cost, respectively. 
The expected total costs are defined as: 
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EC, (N.@t)\=E[C, aa t)] 
+E[C,( RAN, tj )] (5) 
+E(C,( N,,q,t; )] 


where C,, Cp and C, are the costs of inspection, 
repairs and failure, respectively. 

In addition, the optimisation problem must sat- 
isfy some constraints. For instance, it might be nec- 
essary to guarantee a minimum level of reliability, 
i.e.: 


R(t)=1-P, 


where P(t) is the probability of failure at time ¢and 
T,, represents the so-called mission time. Hence, 
reliability based maintenance strategy requires the 
evaluation of the reliability over the time as sum- 
marised in Section 2.1. Constrained optimisation 
techniques are then adopted to identify the mini- 
mum of the objective function, Eq. (4), satisfying 
the constrain of Eq. (6). 


(A282 Vee[0,T7,,] (6) 


4.2 Inspection costs 


The expected inspection costs E(C)) are calculated 
as the product of the unit inspection cost, c, that 
depends on the quality of inspection q, corrected by 
the discount rate, r, and the probability that inspec- 
tion will take place at time ¢; 1 — P(t). In other 
words, the pipeline has not to be failed before the 
i-th inspection time scheduled at t. This expected 
costs are expressed in mathematical form as: 


a ced (t) (7) 


4.3 Repair costs 


The evaluation of the expected costs associated 
with repair, E[C,], is quite challenging since 
depends on the probability of performing a repair 
after the i-th inspection, P,(¢,). This, in turn 
depends on the probability of detection (POD) (i.e. 
the probability to detect a defect). The expected 
repair costs are modelled as: 


1 
8 
-$a R aaay +r)! (8) 


where cp is the unit cost of a repair. The probability 
of repair is calculated by computing the reliability 
analysis of the pipeline where the repair threshold 
represents the limit state function weighted by the 
probability to detect the crack, i.e. 


P(t) = f (0): PoD(d,t)d@ (9) 


1.25<¢(8,1)<1.5 


4.4 Failure costs 


The total capitalized expected costs, E[C,], due to 
failure are the costs associated with failure over 
the region of the corresponding demand functions. 
Hence, the calculation of the failure costs requires 
the estimation of the probability of failure of the 
pipeline over the time. Teh computational strategy 
proposed in the next Section allows to estimate 
these costs by performing a single reliability analysis 
and the reusing the results in the optimisation loop. 

The cost of failure at a specific inspection time 
is calculated as the cost of the failure of i-th time 
t, (that is proportional to P, (r,) minus the cost of 
failure at previous time t: œ P, (t) . This allows 
to take into account the fact that the system has 
survived till the time ¢,,. Taking into account all 
the inspection times and the discount costs, the 
expected failure cost becomes: 


(10) 


5 COMPUTATIONAL STRATEGY 


5.1 Reliability analysis 


The estimation of the probability of failure requires 
in general significant computation efforts. In par- 
ticular for highly reliable pipelines, the number 
of model evaluations easily exceed the computa- 
tional resources available. In addition, the presence 
of imprecision adds another level of complexity 
because the propagation of intervals and p-boxes 
requires the adoption of an additional optimiza- 
tion loop making the required computational cost 
quite challenge. For this reason, the Advanced 
Line Sampling (de Angelis et al. 2015) method is 
adopted to estimate the bounds of the probability 
of failure. One of the key feature of this approach 
is the ability to estimate different probabilities of 
failure (associated to different levels of the per- 
formance function) with only one analysis. For 
instance this can be used to estimate with only 
one reliability analysis the bounds of probability 
of failure due to imprecision in the inputs and the 
probability of repairs as well. 


5.2 Robust maintenance 


The robust maintenance is computed adopting a 
novel computational strategy that allows to reuse 
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Figure 1. Effect of maintenance (repairs) on the weights 


associated with realisations of evolution on the time of 
the failure pressure. 


the results of the reliability analysis in the optimi- 
sation problem. 

In order to explain the simulation approach, we 
first consider a simplified model without impreci- 
sion and solved using plain Monte Carlo method. 
The idea is to first simulate the evolution of defects/ 
cracks in pipelines without considering inspec- 
tions and repairs. This is performed by sampling 
the parameters of the model and then solving the 
equations in Tables 1-2 till the time of interest. At 
this point we have a number of possible cracks evo- 
lution (or failure pressures) over the time. Then, we 
add the effect of maintenance and update the cor- 
responding pipeline reliability as shown in Figure 1 
by calculating the weights associated to each pos- 
sible event outcome. The procedure is repeated for 
all the simulated cracks evolution. The computed 
weights are then used to calculate the probabil- 
ity of failure at time of interest. For instance, the 
probability of failure at time ¢, is estimated by the 
summation of the weights associated to the failure 
events divide by the total numberof simulations. 

Finally an optimisation tools is used to identify 
the number and time of inspections that minimise 
Eq. (4). When a new inspection time is proposed, 
it is necessary to recompute the weights starting 
from the original simulation but thisstep does not 
require the re-analysis of the model (i.e. evaluat- 
ing the evolution of the crack/defect till the time 
of interest). 


6 EXAMPLE APPLICATION 


The optimal maintenance scheduling of a pipeline 
with characteristics shown in Table 4 is performed. 


6.1 Reliability analysis of pipelines 


First, the probability of failure of the pipeline as 
a function of time has been computed using the 
DNV-101, Shell-92, B31G and B31Gmod codes 


without considering inspection and mainte- 
nence. The uncertainties are modelled as shown 
in Table 5. Different level of imprecision on the 
parameter vaules has also been condidered. 

Advanced Line Sampling (de Angelis et al. 2015) 
is adopted to estimate the reliability of the pipelines 
with 20 lines resulting in 120 model evaluations. 
Advance Line Sampling is able to deal with impre- 
cision in the parameter values and it allows to com- 
pute the bounds of the reliability. In addition, the 
number of model evaluations are independent of 
the reliability level. As expected, the probability of 
failure of the corroded pipeline increases with time 
as shown in Figure 2. The Figures shows lower and 
upper bounds of the probability of failure when 
10% of imprecision is considered on the input vari- 
ables. Shell-92 and the DNV-101 are the most con- 
servative models followed by Modified B31G and 
the least conservative is the B31G model. than 0.6) 
respectively. This is in accordance with results from 
literature obtained without considering impreci- 
sion (see e.g. Caleyo et al. (2002)). The results of 
the analysis are also summarised in Table 3. 


6.2 Robust maintenance 


Maintenance is a very effective way to improve 
the safety of corroded pipelines. The aim of 
this section is to identify the optimal number of 
inspections that are able to minimise the overall 
costs. Maintenance is only performed is a defect 
is detected. The typical minimal detectable depth 
of a high resolution Magnetic Flux Leakage tool 
for uniform corrosion is 0.1 W with a POD of 0.9 
(Version 2009). Using these values and the pipeline 
wall thickness W = 9.52 mm, the quality of inspec- 
tion can be estimated as g=2.42 (from Eq. 3). 
However, if the length of the defect is } < 30mm 
we have a pitting defect. In this case the quality of 
inspection is reduced to q =1.61. 


Probabmty of taura 


as o4 os 0s or oe ow ' 
(dye 


Figure 2. Lower and upper bounds of the probabil- 
ity of failure of a pipeline with 10% of imprecision on 
the variables using Shell-92, B31G, Modified B31G and 
DNV-101 failure pressure models. 
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In this example, it is assumed that the inspec- 
tion time are equally spaced from the initial time 
till the final time of 50 years (mission time). 
Figure 3 shows the pipeline failure probability at 
mission time against the number of inspections for 
different models and parameter imprecision. From 
the results presented in the Figure 3, it can be 
deduced that using B31G model and 3 inspections 
suffice reducing the probability of failure of the 
pipeline below 10%. However, when the modified 
B31G model is used more than 6 inspections are 
required (using the upper bounds of the param- 
eters). These results allows to identify the mini- 
mum number of inspections required to guarantee 
a prescribed level of safety. The very small prob- 
ability of failure have been calculated adopting the 
approach presented in Section 5. 

Figures 4 and 5, show the total expected cost as 
a function of the number of inspection obtained 
using DNV-101 model. Obviously, the costs of 
inspection increases with the number of inspec- 
tions performed during the lifetime of the pipe- 
line. Costs of failure decreased with the number of 
inspections. For very small number of inspections 
the total costs are governed by the costs associ- 
ated with failure while for large number of inspec- 
tions, the total maintenance costs are due to the 
costs associated with repairs. The optimal number 
of inspection is always a trade-off between costs 
of failure and costs of repairs. Using the DNF- 
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Figure 3. Probability of failure (at mission time) for a 
pipeline as a function of the number of inspections. 


Figure 4. Expected total maintenance cost as a function 
of the number of inspection using the DNF-99 model, 
with a mission time of 50 year and upper bounds of 
imprecise parameters. 


Figure 5. Expected total maintenance cost as a function 
of the number of inspection using the DNF-99 model, 
with a mission time of 50 year and lower bounds of 
imprecise parameters. 


Table 4. Pipeline characteristics. 


Parameter 

Transported substance Crude oil 

Pipe outlay Below ground 

Outside Diameter 609.6 mm 

Material Class X52 SUTS 496 MPa, 
SMYS 358 MPa, 
MAOP 4.96 MPa. 

Nominal wall thickness 9.52 mm 


Table 3. Bounds of the relative corrosion defect for different safety levels with 10% imprecision on model parameters. 


Mod-B31G 


Shell-92 


Safety 

level B31G DNV-101 

10° [0.4799, 0.9273] [0.5035, 0.7389] 
10+ [0.4093, 0.8566] [0.4564, 0.7154] 
105 [0.3151, 0.8095] [0.4328, 0.6683] 
10° [0.3151, 0.7625] [0.4093, 0.6447] 
107 [0.3151, 0.7154] [0.3622, 0.6212] 


[0.4799, 0.7860] 
[0.4328, 0.7389] 
[0.3858, 0.7154] 
[0.3151, 0.6683] 
[0.3151, 0.6212] 


[0.4093, 0.6683] 
[0.3622, 0.6212] 
[0.3151, 0.5977] 
[0.3151, 0.5741] 
[0.3151, 0.5270] 
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Table 5. 


Stochastic model used for the corroded pipeline and monetary unit cost for operation (i.e. multiplicative 


factors). 

Parameter Symbol & unit Distribution Mean CoV Parameter Symbol Value 
Diameter D [mm] Normal 609.6 0.02 Inspection cost c, 0.018 
Defect depth d [mm] Normal 3 0.10 Repairs cost Cr 0.243 
Wall thickness W [mm] Normal 9.52 0.02 Failure cost Cr 36.55 
Ultimate Tensile Strength o, [MPa] Log-Normal 496 0.07 Discount Rate r 0.05 
Pipe Yield Stress o, [MPa] Normal 358 0.07 

Defect length l [mm] Normal 200 0.1 

Operating Pressure Op [Mpa] Log-Normal 4.96 0.10 

Radial corrosion rate vd [mm/yr] Log-Normal 0.5 0.10 

Long. Corrosion rate vl [mm/yr] Log-Normal 0.5 0.10 


99 model the optimal number of inspections is 
between 4 an 6. 


7 CONCLUSIONS 


In this paper, an efficient numerical approach for 
robust optimal pipeline inspection time schedul- 
ing has been proposed. This allows to determine 
the optimal inspection interval and the repair 
strategy that would maintain adequate reliabil- 
ity throughout pipeline service life. The compu- 
tational framework allows to take into account 
the uncertainties of the model and imprecisions 
on the knowledge of model parameters. The pro- 
posed approach is efficient since allows to per- 
form reliability based optimisation with only one 
reliability analysis. 
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ABSTRACT: Reliability and safety issues of mooring chains are causing concern in recent years. 
Accordingly, some efforts have been made for detecting the structural integrity of mooring chains. How- 
ever, a fully successful mooring chain condition monitoring technique has not been achieved today. This 
is largely due to the fact that mooring chains are submerged in water and the currently available non- 
destructive testing technologies are difficult to apply in wet environment. To overcome this issue, a new 
mooring chain condition monitoring method is studied in this paper with the aid of thermography tech- 
nique. The research is conducted based on two philosophies, i.e. (1) the mooring chain material has much 
higher thermal conductivity than that of water. Therefore, when the mooring chain is heated, the thermal 
energy will transmit mostly inside the chain, rather than dispersing in water; (2) the defects occurring in the 
mooring chain will disturb the transmission of thermal flow inside the mooring chain and consequently 
change the distribution of the temperature in the adjacent area. To demonstrate the effectiveness of the 
proposed method, both numerical and experimental researches are conducted in this paper. The research 


results have shown that thermography is indeed valid in detecting the integrity of mooring chains. 


1 INTRODUCTION 


There are a variety of non-destructive testing tech- 
niques that have been developed for addressing 
various structural integrity testing and assessment 
issues in different fields, such as those depicted in 
References (see Blitz 2012, Sanjeev et al. 2013, Ame- 
nabar 2011 and Garcia-Martin et al. 2011). How- 
ever, few of them is applicable to monitoring the 
structural integrity of mooring chains as the moor- 
ing chains are full submerged in the water located 
in harsh marine and offshore environment. 

In order to tackle this issue, much effort has 
been made by the scholars and industrialists in 
recent years, although a cost and technically effec- 
tive mooring chain condition monitoring tech- 
nique has not been successfully achieved till today. 
This is because so far, almost all the existing moor- 
ing chain condition monitoring techniques and 
systems (see AVT Reliability 2017, Seatools 2017, 
Lugsdin 2017), with the exception of the ultra- 
sonic guided wave technique developed by TWI 
(see TWI 2013), are originally designed for detect- 
ing a broken mooring line rather than detecting 
and monitoring the growth of the defects occur- 


ring in it. In reality, a defective mooring chain 
may but is not necessary lead to broken of the 
mooring line when it is subject to extreme loads. 
For example, Remotely Operated Vehicle (ROV) 
inspection has been popularly adopted for inspect- 
ing typical damage and loss of integrity of marine 
structures. It can be equally applied to the inspec- 
tion of mooring chains. However, ROV inspection 
only provides snapshot of the surface of moor- 
ing chains, which could be covered by thick layer 
of marine lives. Therefore, visual inspection via 
ROV is unable to provide the operator with reli- 
able information about the actual structural health 
condition of mooring chains. Moreover, the appli- 
cation of ROV inspection is limited by weather 
windows and access, it is unlikely to realize the 
continuous monitoring of the mooring chain. In 
order to obtain continuous monitoring data from 
mooring chains, AVT Reliability attempted to use 
strain gauge to monitor the integrity of mooring 
chains and applied the devised strain gauge based 
measurement system to assessing the integrity of 
the 9 mooring chains installed on a 870,000 bar- 
rel oil storage tanker (AVT Reliability 2017). 
The novelty of such a system is that it does not 
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directly measure the chain tensions but instead, 
monitors the stresses in the buoy structural steel- 
work in reacting those same chain tensions. This 
has the advantage that the instrumentation can 
be mounted internally inside the buoy in a clean 
dry environment. However, the measured stresses 
from the buoy structural steelwork are not only 
dependent on the integrity of mooring chains, but 
also affected by the motions of the storage tanker 
and the external loads acting on it. Therefore, 
the AVT system is effective in detecting a broken 
mooring line, however ineffective in detecting and 
monitoring the incipient defects occurring in it. 
Apart from AVT Reliability, the other companies 
also develop mooring chain integrity monitoring 
systems using different techniques. For example, 
Seatools developed the mooring chain inclination 
measurement technique (Seatools 2017), Tritech 
International Ltd developed a multi-beam sonar 
technique (Lugsdin 2017), and so on. But they all 
for detecting whether the mooring lines are well 
connected to the floating structures, rather than 
for detecting the defects occurring in the chains. 
To tackle this issue, TWI developed an automated 
ultrasonic guided wave technique for monitoring 
mooring chains (TWI 2013). Laboratory test has 
shown that such a technique does work in improv- 
ing the accuracy, consistency and repeatability of 
inspection results. But it requests to make minimal 
surface preparation before conducting inspec- 
tion. However, this is very difficult to implement 
in the practical application. Additionally, the high 
cost of the associated robotic delivery system also 
limits the extensive application of such an ultra- 
sonic guided wave technique. In view of this, a new 
mooring chain condition monitoring technique is 
studied in this paper with the aid of thermography 
technique. The details of the numerical and experi- 
mental research are given below. 


2 HYPOTHESES 


According to the fundamental theory of thermo- 
dynamics, the rate of heat flow can be described as 
(Borgnakke et al. 2003): 


2. Aer (1) 


where Q = the amount of heat transferred in a 
time ¢; « = the thermal conductivity constant for 
the material; A = the cross sectional area of the 
material transferring heat; d= the thickness of the 
material; and AT = the difference in temperature 
between one side of the material and the other. 
From (1), there are two hypotheses can be 
inferred that: (1) since the thermal conductivity 


K for steel is 46 Watts/meter—°C, which is much 
higher than the thermal conductivity of water (i.e. 
x for water is only 0.58 Watts/meter—°C), the heat 
flow will be transferred much faster in steel than in 
water. Accordingly, when the steel mooring chain 
is heated from one end, the majority of heat flow 
will be transferred inside the steelwork of the chain 
rather than being dissipated in the water around 
it; (2) equation (1) shows that the cross sectional 
area A is inversely proportional to the temperature 
difference AT. That means when the same amount 
of heat is transferred inside the chain, the defect 
resulted change in the cross sectional area A will be 
indicated by the change in temperature or temper- 
ature distribution. In the meantime, the heat flow 
will be transferred automatically along the path 
which has smaller thermal resistance. 

If the above two hypotheses are true, the integ- 
rity of the mooring chain then can be detected via 
observing the distribution of mooring chain tem- 
perature. In other words, the discontinuity in tem- 
perature distribution may indicate the presence of 
defect in the structure of mooring chain. 


3 NUMERICAL RESEARCH 


In order to demonstrate the aforementioned two 
hypotheses and investigate the transferring process 
of the heat flow insider a mooring chain when it is 
heated, the numerical model of mooring chains is 
developed in ANSYS 15. In each chain, the center 
to center distance is 24 mm, the width is 18 mm. 
The radius of the side circle is 4 mm. Then, dif- 
ferent types of defects were artificially made on 
the model through changing its geometries. The 
mooring chains with different sizes of cracks with 
0.5 mm clearance are shown in Figure 1. 

In ANSYS 15, the steady state thermal analysis 
are conducted. In the engineering data section of 
the analysis, steel was chosen as the material for 
the mooring chain. In the calculation, the mooring 
chains are meshed and moreover, mesh refinement 
properties were used for building finer meshes par- 
ticularly in the vicinity areas of the defects, such as 
the example given in Figure 2. 

In the numerical calculations, both environmen- 
tal temperature 22 Celsius and heating temperature 


Figure 1. 
conditions. 


The mooring chains with different integrity 
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Figure 2. Mesh and mesh refinement of the mooring 
chains. 


Numerical result obtained when a partially 
cracked mooring chain is heated from one end. 


Figure 3. 


300 Celsius are specified in the steady state ther- 
mal column. The convection film coefficient is also 
defined in this section. The value is taken as 22 W/ 
m’°C. When the heating temperature is applied at 
one end of the mooring chains while the other end 
is fixed and no temperature is applied, the numeri- 
cal calculation results for a partially cracked moor- 
ing chain are graphically shown in Figure 3. 

From Figure 3, it is found that: (1) the disconti- 
nuity of the temperature observed from the upper 
part of the defective mooring chain proves that the 
defect in the mooring chain does disturb the trans- 
fer of heat flow, thus may cause visible tempera- 
ture difference in the vicinity area of the defect; 
(2) the asymmetric distribution of the tempera- 
tures of the upper and lower parts of the defective 
mooring chain proves that heat flow is more easily 
to be transferred along the path that has smaller 
thermal resistance; (3) although in the numerical 
simulation, the mooring chain is assumed to be 
placed in air rather in water, the visible tempera- 
ture differences prove that when the steel moor- 
ing chain is heated from one end, the majority of 


0.000 0.070 (m) 
0,035 


Figure 4. Numerical result obtained when a fully 
cracked mooring chain is heated from both ends. 


heat flow will be transferred inside the steelwork 
of the chain rather than being dissipated outside 
the chain as the thermal conductivities of air and 
water are similar and much smaller than that of 
steel mooring chain; and (4) moreover, the profile 
of the temperature distribution around the defect 
indicates the size of the defect. In order to further 
verify the findings from Figure 3, a fully cracked 
mooring chain is heated from both ends, as shown 
in Figure 4. 

From Figure 4, it is clearly seen that the afore- 
mentioned four findings are also valid when the 
mooring chain is completely cracked and heated 
from both ends. This indicates that, in the sense of 
theory, the thermography do have potential to be 
applied to detect and monitor the defects occur- 
ring in mooring chains. 


4 EXPERIMENTAL RESEARCH 


In order to physically demonstrate the interesting 
findings observed in numerical research, experi- 
mental research is organized in laboratory. The 
perfect and defective chains with different sizes 
of cracks are shown in Figure 5. Where, the chain 
cracks are artificially made using hacksaw. 

In the experiment, the influences of the cracks 
on the chain temperature distributions in their 
vicinity areas are investigated in two scenarios in 
order to find a better heating method that can lead 
to more reliable condition monitoring result. In the 
first scenario, the mooring chain being investigated 
is heated from the both ends of it; while in the sec- 
ond scenario the mooring chain is headed from 
only one end. The experimental results obtained in 
the first scenario are shown in Figure 6. 


2215 


(a) perfect chain 


welds 


(d) chain with through crack 


(c) chain with > depth crack 


Figure 5. Mooring chains used in the experiments. 


Crur 0.9 


(d) chain with through crack 


(c) chain with £ depth crac 


Figure 6. Experimental results obtained when the 
chains are heated from both ends. 


From Figure 6, it is found that when the moor- 
ing chain is perfect and has no any defect inside 
the chain structure, the temperature is distrib- 
uted evenly and smoothly over the chain except at 
welds, where the temperature is obviously smaller 
than in others due to the much higher thermal con- 
ductivity of the welding material. But when a crack 
is present in the chain, the even distribution of the 
temperature over the chain will be discontinued. 
Consequently, a concave profile can be observed 
from the cracking area in the thermal image. 
Moreover, it is found that the larger the size of the 
crack, the deeper the concave profile tends to be. 
This is because the air or water in the clearance 
of the crack has lower thermal conductivity than 
that of the mooring chain material. Therefore, the 


air or water temperature in the crack clearance 
is lower than that of the steelwork of the chain. 
From such an observation, it can be inferred that a 
partial through crack can partially stop the trans- 
fer of heat flow, although the heat flow is still able 
to be transferred through the un-cracked section. 
But when the crack continues to propagate and 
finally becomes a full through crack in the end, the 
transfer of heat flow will be significantly limited 
by the crack. In this worst case, the heat flow in 
the cracking area is transferred only via the air or 
water in the clearance of the full through crack. In 
general, from these experimental results obtained 
in the first scenario, it can be concluded that the 
crack occurring in the mooring chain can be read- 
ily detected using thermographic technique. More- 
over, the size of the crack can be approximately 
understood through observing the concave depth 
of the temperature profile in the vicinity area of 
the crack. However, more accurate evaluation of 
the crack is difficult to achieve due to the limita- 
tion of observation. 

In order to further improve the accuracy of the 
crack evaluation, the experiment is repeated in the 
second scenario. But the difference from the first 
scenario is the mooring chains being investigated 
are heated from only one end. The corresponding 
experimental results are shown in Figure 7. 

From Figure 7, it is found that the similar 
phenomena observed from the first scenario (see 
Figure 6) also can be observed. But the assess- 
ment of the crack is not easy to achieve through 
observing the concave depth of the temperature 
profile because the concave feature caused by the 
crack cannot be clearly observed in the second 
scenario. Accordingly, a quantitative assessment 
method is developed in the following by using 
the temperature reading function of the thermo- 
graphic camera. In addition, it is noticed that there 
is a temperature value displayed at the top left of 
the picture, such as 80.8°C in Figure 7a, 26.6°C 
in Figure 7b, 25.2°C in Figure 7c, and 33.8°C in 
Figure 7d. These values indicate the temperatures 
at the positions located by the circle with a cross. 
With the aid of this special temperature reading 
function provided by the thermal camera, the tem- 
perature at any position in the thermal image can 
be readily obtained. Then, the temperatures at two 
specific positions are measured for developing the 
quantitative assessment criterion. One is the tem- 
perature T, measured at the heating source posi- 
tion, another is the temperature 7, measured at the 
other side of the crack. Since the temperature at 
the heating source position is the highest tempera- 
ture in the thermal image, the value of T, is usually 
the maximum value shown in the grey scale of the 
image, ie. T,=216°C for Figure 7a, 192°C for 
Figure 7b, 263°C for Figure 7c, and 99.6°C for 
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Figure 7. Experimental results obtained when the 
chains are heated from one end. 
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Figure 8. Quantitative assessment of the crack occur- 
ring in the mooring chain. 


Figure 7d. Therefore, in fact T. is the only value 
that needs to be measured by moving the circle 
with a cross to the other side of the crack. Once the 
value of T. is obtained, the following quantitative 


assessment criterion can be calculated, i.e. 


c= x100% (2) 


h 


In essence, the criterion C measures the sig- 
nificance that the crack disturbs the heat transfer 
along the mooring chain. When the crack increases 
in size, the local heat transfer ability of the chain 
will decrease due to the reduced contact area of 
the metal. This will lead to a small value of T, and 
consequently a large value of criterion C when the 
heating temperature T, is constant. Therefore, 
thecriterion Ccan beused to quantitatively assess the 
size of a transverse crack occurring in the moor- 
ing chain. In the experiment, the values of T, have 
been read from the images shown in Figure 7. They 


are T, =97°C for Figure 7a, 85°C for Figure 7b, 
111.5°C for Figure 7c, and 40°C for Figure 7d. 
Substitute the values of 7, and T, into (2), the 
criterion C is calculated and the results are shown 
in Figure 8. 

From Figure 8, it is clearly seen that as expected, 
the value of criterion C does increase gradually 
with the increasing depth of the crack. This means 
that the larger the size of the crack, the more sig- 
nificant the influence of the crack tends to be in 
the heat transfer process. This fully demonstrates 
that the proposed quantitative evaluation method 
does work in assessing the crack occurring in the 
mooring chain. 


5 CONCLUSIONS 


In order to explore a feasible method for monitor- 
ing the health condition of mooring chains, both 
numerical and experimental researches are con- 
ducted in the paper to investigate the potential of 
thermography technique in detecting transverse 
cracks occurring in mooring chains. From the 
work reported above, the following conclusions 
can be drawn: 


1. When the mooring chain is perfect in structural 
integrity, the temperature is smoothly distrib- 
uted over the chain except at welds, where the 
temperature is smaller than in other chain parts 
due to the higher thermal conductivity of weld- 
ing materials in comparison of that of the steel 
material of mooring chain; 

2. When the mooring chain is heated from its both 
ends, a clear concave will appear in the tempera- 
ture profile in the presence of a transverse crack 
in the mooring chain. Moreover, the larger the 
size of the crack, the deeper the concave tends 
to be. Therefore, the concave in the temperature 
profile is a good indicator of the crack and its 
propagation when the mooring chain is heated 
from the both ends of it; 

3. In comparison of the scenario that the mooring 
chain is heated from the both ends of it, more 
accurate assessment of the crack can be achieved 
when the mooring chain is heated from only one 
end. Experiment has shown that the proposed 
quantitative evaluation method can successfully 
predict the presence and growth of the crack 
when the chain is heated from only one end; 

4. The aforementioned numerical and experimental 
researches have demonstrated that thermogra- 
phy technique does have potential to be applied 
to condition monitoring mooring chains. How- 
ever, further research is still required to develop 
an appropriate method to heat the mooring 
chain in a completely wet environment. 
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ABSTRACT: This paper presents an efficient approach for Bayesian analysis of corroding gas pipe- 
lines containing metal-loss corrosion defects subjected to internal pressure. The methodology considers 
stochastic dependence among individual defects. The Nataf model is used to model the interdepend- 
ence (correlation) among the defects. The generation of new defects and the growth of existing ones 
are included in the stochastic modelling, by employing a Non-Homogeneous Poisson Process (NHPP) 
and a Homogeneous Gamma Process (HGP), respectively. Information on a corroding pipeline obtained 
through multiple In-Line Inspections (ILIs) is used for the Bayesian updating of stochastic models. The 
Probability of Detection (PoD) and measurement error of the inspection tools are also taken into consid- 
eration. The Bayesian updating is conducted by using Structural Reliability Methods (SRM), a recently 
proposed method in literature that sets an analogy between Bayesian updating and a reliability problem. 
The SRM adopted herein is Subset Simulation (SuS) and the whole analysis is referred to as BUS-SuS. 
The updating is conducted in conjunction with the Data Augmentation (DA) technique which accounts 
for both detected and undetected defects, by treating the undetected ones as the missing data. Multiple 
simulated corrosion defects from different ILI inspections are employed for the implementation and vali- 
dation of the methodology. A parametric study that examines the impact of correlations among defects 


on the posterior stochastic models is also included in the numerical example. 


1 INTRODUCTION 


Metal-loss corrosion is the most predominant 
gradual deterioration process for gas pipelines, 
based on historical failures. Reliability-based 
integrity management programs are increasingly 
adopted by pipeline operators to ensure the safe 
operation of pipelines against metal-loss corrosion 
(Khan & Tee 2016). In the literature, stochastic 
growth models are considered the most efficient 
modelling option for metal-loss corrosion in gas 
pipelines (Maes et al. 2009, Zhang & Zhou 2013, 
Pesinis & Tee 2017). However, a comprehensive 
analysis requires the modelling of the generation of 
new crack features too. Moreover, the dependence 
(or correlation) among individual defects should 
be considered, to express the spatial dependence 
among them. This is typically due to the similar 
corrosive environment, similar pipe properties at 
the defects’ location and the fact that defects are 
subjected to the same loading conditions (Zhou 
et al. 2012, Khemis et al. 2016). 

In Maes et al. (2009) and Zhang & Zhou (2013) 
gamma process was employed to characterize the 
growth of corrosion defects on the pipeline. In Qin 
et al. (2015) the generation of new metal-loss cor- 
rosion defects was taken into consideration in the 


analysis, using a stochastic process-based model. 
In all three studies, the hierarchical Bayesian anal- 
yses updated the stochastic models based on ILI 
data by means of Markov Chain Monte Carlo 
(MCMC) simulation, without considering correla- 
tions among defects. In fact, there is only limited 
information available on modelling correlations of 
deterioration in energy pipelines (Qian et al. 2013, 
Zhou et al. 2017) and to the best of the authors’ 
knowledge, none when it comes to Bayesian updat- 
ing. Furthermore, when it comes to MCMC simu- 
lation, this is thought to involve an underlying 
uncertainty around ensuring the final samples 
have reached the posterior distribution and has 
limited capacity in ultimately quantifying small 
failure probabilities (Straub et al 2016). An alterna- 
tive method to MCMC is Bayesian Updating with 
Structural reliability methods (BUS), which sets an 
analogy between Bayesian updating and a reliabil- 
ity problem (Straub & Papaioannou 2014). This 
formulation enables the use of established SRM to 
conduct the Bayesian updating. The SRM adopted 
in this study is Subset Simulation (SuS). 

Thus, hierarchical Bayesian updating is con- 
ducted herein, by using BUS-SuS in conjunction 
with the Data Augmentation (DA) technique. 
Simulated data corresponding to a gas pipeline 
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subjected to internal pressure loading are used 
to illustrate and validate the proposed method- 
ology. The growth of multiple metal-loss corro- 
sion defects is characterized through adopting 
a Homogeneous Gamma Process (HGP) model 
and incorporating it into the hierarchical Bayesian 
framework based on multiple ILI data, account- 
ing for the associated measurement errors as well. 
Furthermore, the interdependence among different 
defects is considered using the Nataf model, also 
known as the Gaussian copula. The generation of 
new metal-loss corrosion defects is realised in the 
analysis by means of Non-Homogeneous Poisson 
Process (NHPP). At the end, the impact of differ- 
ent dependence scenarios is investigated. 

The contributions of this paper include first the 
development of a robust hybrid hierarchical Baye- 
sian framework for the updating of both genera- 
tion and growth stochastic model, with respect to 
metal-loss corrosion. The proposed methodology, 
based on BUS-SuS and DA technique, eliminates 
the uncertainty regarding whether the final samples 
have reached the posterior distribution. Then, the 
methodology efficiently accounts for different spa- 
tial correlation scenarios among individual defects, 
which has not been conducted before in Bayesian 
updating for energy pipelines. The contents of this 
paper are structured as follows. Section 2 defines 
the uncertainties concerning the inspection data. 
The formulations of the NHGP and HGP models 
for the defect generation and growth, respectively, 
are presented in Section 3. The hierarchical Baye- 
sian method for updating the model parameters 
by means of BUS-SuS and DA are presented in 
Section 4. An application based on simulated ILI 
data corresponding to a gas pipeline is presented 
in Section 5, in order to illustrate and validate the 
proposed methodology. Finally, conclusions are 
drawn in Section 6. 


2 UNCERTAINTIES OF INSPECTION 
DATA 


2.1 Measurement error 


Inspection data for corrosion defects on gas pipe- 
lines are subject to random measurement errors 
because of the accuracy limitations of the ILI tool. 
Thus, the measured size of a defect is expected to 
differ from its actual size. Based on a measured size, 
the actual defect size can be evaluated through the 
following (Qin et al. 2015, Zhang and Zhou 2013): 


dy = 4, + Bey; + Ep (1) 


for a corrosion defect k (k = 1, 2,..., n) that is 
detected in the ith (i = 1, 2, ..., m) inspection, where 


a, and b, denote the constant and non-constant 
biases, respectively, associated with the defect depth 
and ¢,, is the random scattering error associated 
with the measured depth. It should be noted that if 
a, =0 and b,= | the tool is unbiased. The random 
scattering errors associated with different defects 
for a given ILI tool were assumed to be mutually 
independent and those associated with different 
ILI tools for a given defect were also assumed to be 
mutually independent (Stephens & Nessim 2006). 

Given a measured defect k, the probability dis- 
tribution of the random scattering error £ in the 
ith inspection can be determined from tool speci- 
fications that characterize the sizing accuracy in 
terms of the probability that the error will fall 
within prescribed bounds e,,,, and ena Then, the 
mean value (u) and standard deviation (o) of the 
random scattering error can be determined, for 
any distribution type. Assuming that the mean and 
standard deviation of the error are from a multi- 
variate normal distribution as follows: 


4 = (e mın Terai ) / 2 (2) 


T= (ng -wo (2) 8) 


where ®~! is the inverse standard normal distribu- 
tion function (Stephens & Nessim, 2006). 


2.2 Probability of detection 


Probability of detection (PoD) is related to the 
capacity of an ILI tool to detect a metal-loss cor- 
rosion defect (Zhang & Zhou 2013). It can be 
defined as a function of the defect size together 
with constants that refer to the tool’s accuracy. The 
following exponential function can be defined (Qin 
et al. 2015): 


PoD=1- 6 for c>c, (4) 


The value of q refers to the inherent tool detec- 
tion capacity and can be specified from vendor-sup- 
plied tool specifications. In addition, c denotes the 
actual defect depth and c,, is the detection thresh- 
old, i.e. the minimum detectable defect depth. 


3 DEFECT GENERATION AND GROWTH 


3.1 Stochastic defect generation 


The non-homogeneous Poisson process sto- 
chastic process-based model was employed for 
the generation of new defects on a pipe segment 
over time (Qin et al. 2015, Tee & Pesinis 2017). 
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The total number of defects in a time interval [0, ¢] 
is assumed to follow a Poisson distribution with a 
probability mass function (PMF), f (NDIMA): 


(AEH 1s 


ANO N(t)! 


A(t) = (5) 


Nt) is the expected number of defects generated 
over the time interval [0, ¢] and A(#) is the instanta- 
neous generation rate: 


A(t) = | Adt (6) 


with the parameters A, and 6 quantified from the 
ILI data. 

In case of m ILIs that have taken place over a 
certain period, it was assumed that each inspec- 
tion can detect new and existing defects, in terms 
of their spatial positions (Qin et al. 2015). The 
total number of defects X, on the time of the ith 
inspection (i = 1, 2, ... m) t, can be divided into 
those defects that have initiated prior to the (i-1) 
th inspection, X? and those ones that have initiated 
between the (i-1)th and ith inspections X;. The 
value of X; can be evaluated by assuming that it 
follows a Poisson distribution with PMF given by: 


EAEE oa (7) 


The detected defects are typically less than 
the actual ones and that is due to the imperfect 
detectability of the ILI tool. If X“ and X;" are the 
detected and undetected values, respectively, of the 
total number X;. Following the Poisson splitting 
property (Qin et al. 2015, Kulkarni 1995), X“ and 
X“ are assumed to follow the Poisson distributions 
with the respective PMFs: 


PoD,; A x en Pedi4, 

f(x 4,6) = weed 
ep ae 

SX |40) = - i a i 


where PoD, with the overline refers to the average 
PoD that corresponds to the X; defects and can be 
evaluated from: 


PoD; = | PoD(x)f,, (x)dx (10) 


where f$, (x) denotes the probability density func- 
tion (PDF) of the depths of the X; defects at time t, 


3.2 Stochastic defect growth 


The depth c(t) of a defect at year t, (i.e. f= 0 corre- 
sponds to the installation year of the pipeline) was 
assumed to follow a homogeneous gamma process 
(Zhang & Zhou 2014). The PDF of c(t) is gamma 
distributed: 


arD, O- eM 


Saale) Mt), @) = rio) 


Too.) (c(t)) 


(11) 


with y(t) representing the time-dependent shape 
parameter which is a linear function with time: 
Mt) = a(t-t) (12) 
where ¢, denotes the initiation time of a defect. 
In addition, œ (@ > 0) is the time-independent 
rate parameter or the inverse of the scale param- 
eter, I(*) denotes the gamma function and 

~) (c(Z)) is the indication function (i.e. it is equal 
to unity if c(t) > 0 and zero otherwise). The quan- 
tity o,/@ represents the mean of the defect depth. 
The parameters œ, were assumed to be common for 
all defect depths, whereas ¢, and @ were assumed 
to be defect-specific. t, and æ, were defined with 
respect to the initiation time and rate parameter 
for the zth defect, as opposed to the index k that 
was used to characterise detected defects, i.e. z 
refers to the total number of defects (both detected 
and undetected). 

The growth of the jth defect among the (i-1)th 
and ith inspections Ac, is gamma distributed with 
a time-dependent shape parameter Ay, as follows: 


vy) 


where y is a quantity that satisfies ¢,_, < t, < t, The 
depth of each defect z at the time of each i inspec- 
tion i c,, is the sum of consecutive incremental 
depths between the (i-1)th and ith inspections: 


AY, = &(t,-%.) (13) 


Ciz = Cit. + Ac, (14) 


4 BAYESIAN ANALYSIS 


4.1 Likelihood functions 


Given the number of detected defects in a total 
of m inspections and assuming a defect k first 
AET A A the ith (i = 1, 2, ... or m) inspection, 
= (dio dist- disygs»»s dp) and Cp = (Cio Cierto- 
Cg) can n be defined that denote the vector 
of the actual depths for defect k and the vector of 
the corresponding ILl-reported depths of defect 
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k, respectively. Considering the measurement 
error, the likelihood of d, conditional on ¢, can be 
defined as follows: 


L(d,|e,) 
(n—i+1) il 


: i oi 
=(27) 2 Ei *exp[- (l-e be, )’ (15) 
Xe, (d, — æ- be, )] 


where Q = (4, d,,;,...,4,,) and b is an m—i+1 x 
m — i+ 1 diagonal matrix with the yth element 
equal to a, 

The updating of the number of detected defects 
inherently contains the possibility that some of the 
newly detected defects in the ith inspections might 
have been generated prior to the previous inspec- 
tion (i-1)th but remain undetected until then. 
However, it was assumed that the newly detected 
depths in the ith inspection, have initiated between 
the (i-1)th and ith inspections. That is a conserva- 
tive approach, since it finally leads to overestima- 
tion of the instantaneous rate of the generation 
model (Qin et al. 2015). The likelihood function 
for the newly detected defects in m inspections X;“ 
(i =1, 2, ..., m) was defined as follows: 


L(X;"|A,,.6) 


=p eT i ma 


= Sa 
l (16) 


where the PoD, with the overline, includes both 
detected and undetected defects in the ith inspec- 
tion. The undetected defects were treated as the 
missing data and the data augmentation (DA) 
technique was employed to incorporate these in 
the Bayesian analysis (Tanner & Wong 1987). 
This technique is described in more detail in Sec- 
tion 4.4. Thus, the PoD serves as a link in the Baye- 
sian updating, between the generation and growth 
models. 


4.2 Prior distributions 


The gamma distribution was selected as the prior 
distributions of œ, @, parameters of the growth 
model and also for the à, and 6 parameters of the 
generation model. The truncated normal distribu- 
tion with an upper bound equal to the time of the 
inspection that each defect is detected for the first 
time and a lower bound equal to the time of the 
previous inspection was chosen as the prior distri- 
bution of tọ. The prior distributions for t,,, &, and 
@, for individual defects were assumed to be mutu- 
ally independent. The shape (rate) parameters of 
the gamma prior distributions for œ, @, and Ay, 


ô were defined as §(§,), «(«,) and 9,(@,), ¢(G,), 
respectively. Dependence among defect behavior is 
modelled through correlation coefficients among 
the HGP model parameters. In this study, the 
model parameters œ, were considered equi-corre- 
lated among all defects (k = 1, 2,..., n) with cor- 
relation coefficients ©. The correlation coefficient 
represents the statistical dependence due to com- 
mon fabrication quality, common material char- 
acteristics and common loading characteristics. 
The joint distribution of all model parameters is 
subsequently modelled through a Nataf (Gaussian 
copula) model (Qian et al. 2013). 


4.3 Updating with BUS-SuS and DA technique 


Bayes’ rule enables updating the joint prior dis- 
tribution of the parameters n that correspond to 
both generation and growth models, based on the 
inspection data C. The prior distribution is fn) 
and is converted into a ‘posterior’ distribution 
f m|C), based on the following (Straub & Papaio- 
annou 2014): 


L(C\|n) f 
sialO=5 (Clin fm (17) 


UCIM 


where Q denotes the domain of definition of 1. 

In this study, the Bayesian updating was con- 
ducted with BUS-SuS along with the aforemen- 
tioned DA technique which constitutes a robust 
hybrid simulation technique, in an effort to numer- 
ically evaluate the complex denominator of Eq. 17, 
since an analytical evaluation would not be feasi- 
ble. Bayesian updating with BUS-SuS is an exten- 
sion of the classical rejection sampling approach 
to Bayesian analysis (Straub & Papaioannou, 
2014, Straub et al. 2016). It is more advantageous 
over MCMC, which is typically used in the energy 
pipeline literature, since it diminishes the uncer- 
tainty around ensuring that the final samples have 
reached the posterior distribution and therefore 
provides more accurate samples of the posterior. 
The simple rejection sampling algorithm however, 
is quite inefficient and therefore subset simulation 
is employed that can compute very small prob- 
abilities. Thus, the Bayesian updating becomes 
very efficient, while maintaining the advantages of 
the simple rejection sampling algorithm (Straub & 
Papaioannou 2014, Straub et al. 2016). 

A set of random samples of the prior distribu- 
tion y are generated and then accepted as samples 
of the posterior distribution. The prior samples 
are accepted with a probability p = cL(n), where 
c is a positive constant that ensures cL(n) < 1 for 
all n. If c is selected too small, this might decrease 
the efficiency of the method, since the acceptance 
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rate is a linear function of c. If c is too large, the 
resulting samples might not follow the posterior 
distribution. An optimal choice of c is considered 
to be (Straub & Papaioannou 2014): 


1 


c= ——— (18) 

sup L(n) 

For a single ILI with error £, supL(8) is equal to 
maximum of the PDF of e. For the multiple ILIs 
of this study, supZ(n) was thought to be equal to 
the maximum of the multivariate PDF of the error 
€,, This is spatially independent, following a mul- 
tivariate normal distribution with a zero mean and 
known covariance matrix Le associated with the 
total number of inspections m. 

After defining the augmented outcome space [n; 
p|and the domain {p < cL(n)} a structural reliabil- 
ity problem results, with the posterior distribution 
obtained by censoring the joint distribution of p 
and y to {p < cL(n)} and marginalizing n: 


FAO = f Mmrle{pscLimpf@dp 09 


where Jis an indicator function which takes value 1 
if {p < cL(m)}and 0 otherwise (p is a standard uni- 
form random variable). In the structural reliability 
convention, the domain {p < cL(n)} describes an 
observation event (in terms of the Bayesian updat- 
ing) through a limit state function r, such that it 
corresponds to a respective domain {r (pm) < 0}: 


r(p.n) = p- cL) (20) 


The probability of the observation event can 
be efficiently computed by a robust method such 
as SuS. It is typically efficient to apply SuS in the 
standard normal space, therefore in this study the 
outcome space of the original random variables p 
and ņ was transformed to a space with independ- 
ent standard normal random variables V (p and n 
are independent and thus they can be transformed 
separately). The transformed observation domain 
is a function U: 


U(v) =v -P7 (cL(TO)) (21) 


where © is the standard normal CDF. 

An algorithm based on SuS can generate sam- 
ples from the transformed observation domain 
subsequently. SuS is a well-established method and 
a detailed step-by-step presentation of the method 
can be found in Au & Beck (2001) and Straub & 
Papaioannou (2014). It should be noted that the 
final finding of interest in this study is the sam- 
ples that belong to the posterior distribution. As a 


result, a final step should be defined. In that extra 
step, J additional samples conditional on Y from 
the output domain {p < cL(n)} are produced, so 
that a total number of (at least) Z samples from the 
posterior distribution is obtained. 

Both the detected and undetected defects were 
considered for the Bayesian updating in this study. 
The depths of the detected defects were related to 
the ILI-reported depths through the likelihood 
function given by Equation 15, while the real 
depths of the undetected defects were treated as 
the missing data and imputed using the DA tech- 
nique (Tanner & Wong 1987). As a result, the joint 
posterior distribution of the model parameters 
was evaluated from the depths of the total defect 
population (both detected and undetected). It 
should be noted that it is straightforward to couple 
DA in BUS-SuS. DA is an iterative process that in 
each iteration contains two steps; the imputation 
and the posterior step. The imputation step gener- 
ates the samples of the missing data from its corre- 
sponding probabilistic distribution conditional on 
the current state of model parameters, whereas the 
posterior step is used to generate new samples of 
model parameters, from their corresponding poste- 
rior distributions conditional on both the observed 
and missing data. More information and details of 
the DA process can be found in Tanner & Wong 
(1987), Rubin (2004) and Little & Rubin (2014). 


5 APPLICATION 


5.1 Case study 


Simulated data are used to illustrate and validate 
the methodology. A gas pipe segment is consid- 
ered, with 609 mm diameter and 7.9 mm wall 
thickness, from pipe material API 5 L grade X52. 
Five future ILIs were assumed to take place on 
years t = {4, 7, 9, 11, 15} after the installation of 
the pipeline. The growth of individual defects was 
assumed to be a linear random variable corrosion 
growth model, with the rate of defect growth for 
each defect selected from a uniform distribution 
with lower bound 0.29 mm/year and upper bound 
0.50 mm/year. Metal-loss corrosion defects were 
assumed to initiate based on the NHHP model by 
deterministically setting the parameters A, = 0.2 
and 6= 1.0 and to grow with the aforementioned 
individual random rate. 

Furthermore, a depth detection threshold 
of >l mm deep was considered and a PoD of 
98.3%. The measurement error was assumed to 
fall within the bounds of +1 mm with 90% prob- 
ability. Therefore g = 4.0785 in Equation 4 and 
measurement error follows a multivariate normal 
distribution with mean zero and standard devia- 
tion 6, = 7.29%wt. The constant and non-constant 
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biases included in the measurement error given 
by Equation 1 were assumed to be equal to zero 
and unity for all inspections respectively, while the 
random scattering errors associated with different 
inspections were defined as mutually independ- 
ent with the same standard deviation of unity. 
Table 1 summarises the simulated data. 


5.2 Validation of Bayesian formulation 


The Bayesian analysis was carried out for the defect 
generation and growth models, based on the ILI 
data. The shape and scale parameters of the gamma 
prior distributions for the parameters of the mod- 
els œ, @, and A,, 6 were all set to unity. A total 
of 20,000 samples were generated to evaluate the 
probabilistic characteristics of the model param- 
eters. The means, medians and standard devia- 
tions of the posterior marginal distributions of 
the parameters A,, dof the NHPP model, together 
with the average PoD for the defect generated prior 
to the first inspection year and among the rest of 
inspection years respectively, are summarised in 
Table 2. Furthermore, the same information for 
the HGP parameters œ, that are common for all 
defects is also presented in Table 2. When com- 
pared with the actual (i.e. simulated) values, the 
posterior mean and median values of 6 and A, 
are considered to be in good agreement, which 
validates the Bayesian formulation described in 
Section 4. This is further illustrated in Figure 1 
and Figure 2 for the NHPP and HGP models, 
respectively. 

In Figure 1, results for a period from when 
the generation of new defects initiates to the last 
inspection year (i.e. 15), are presented. The mean 
values of the number of generated defects were 
estimated, with the values A,, 6 set equal to their 
corresponding posterior medians. For compari- 


Table 1. Summary of the input simulated ILI data. 


Time of inspection Year4 Year7 Year9 Year11 Year 15 


Number of 0 3 53) 868) 146) 
detected defects 

Measured depth 0 1:32 2,31 2.66 2.74 
(mm) (Mean) 

(Standard deviation) 0 0.07 0.86 1.14 1.48 

Table 2. Posterior data of the model parameters. 

NHPP PoD 
Para- 
meter = 4, (0.2) 6(1.0) 1 2 3 4 5 


Mean 0.30 0.81 
Median 0.30 0.80 
Std 0.28 0.17 


0.03 0.99 0.41 0.72 0.73 
0.03 0.97 0.41 0.72 0.73 
0.08 0.12 0.07 0.07 0.07 
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Figure 1. Comparison of predicted and actual number 


of detected and undetected defects. 
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Figure 2. Comparison of the predicted depths and 
actual depths of defects at year 15. 


son, A(t) evaluated by the actual values of à, and 6, 
the simulated total number (i.e. detected and unde- 
tected) of defects, along with the simulated num- 
bers of detected defects on the five ILIs are also 
illustrated in Figure 3 A(‘) agrees with the actual 
mean very well, which proves the validity of the 
Bayesian model. 

Next, the depths of the detected defects at year 
15 were evaluated and compared with the cor- 
responding actual defect depths. Each predicted 
defect depth was set equal to the corresponding 
mean derived from the HGP growth model, with 
the parameters of the model (i.e. œ, œ, and ty) 
equal to the respective posterior medians. In the 
Bayesian updating, constant c of Equation 18 was 
evaluated on each inspection year, based on the pro- 
cedure described in Section 4.4 and the respective 
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Figure 3. Predicted growth paths of defects 1 and 4 
based on the HGP model. 


values are illustrated in Table 3. From Figure 2 it 
can be observed that the model predictions are 
in good agreement with actual values. In fact, all 
defect depths fall within the range of +10%wt of the 
corresponding actual depths, which is a commonly 
adopted confidence interval for ILI tools’ accuracy 
in industry (Zhang & Zhou 2013). In summary, the 
methodology is validated against the corresponding 
actual data and its accuracy is verified. 

For illustration purposes, two defects were cho- 
sen for analysis and their means were estimated, 
based on the results of BUS-SuS for the same 
15-year period. The mean and standard deviation 
of defect depths were defined as o1,(t—d,,)/m, and 
(04,(t—t,,)/@,2)°° respectively for each defect, based 
on the HGP corrosion growth model. The parame- 
ters O,, ®, and twere set equal to the deterministic 
median values from their corresponding marginal 


Table 3. Constant c values for different ILI years. 


Time of 
inspection Year4 Year7 Year9 Yearll Year 15 
Constantc 0 6.2796 4.3460 3.0084 2.0840 
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Figure 4. Comparison of the predicted depths and 
actual depths at year 15 for different correlation coeffi- 
cients among defects. 


posterior distributions derived from BUS-SuS. 
Also, 80 random realisations of the HGP model 
are illustrated for each defect. It is observed that 
for both defects, the BUS-SuS predicted depths 
and actual depths are in very good agreement, even 
though the ILI data is not in great proximity with 
the actual depths. Thus, the proposed model can 
account successfully for the imperfections of the 
inspection data and capture the actual initiation 
times and growth rates of the defects. 

Furthermore, in order to investigate the effect 
of the correlations of defects on the growth model, 
different dependency scenarios were examined. It 
was assumed that c, (i = 1, 2, ..m), (k =1, 2,..., n) 
are identically distributed and equi-correlated with 
the corresponding coefficient 8, set equal to 0.25, 
0.5, 0.75 and 0.0 (i.e. 0.0 corresponds to independ- 
ent and identically distributed samples). 

Figure 4 compares the predictions of the growth 
model (i.e. HGP) at year 15 corresponding to dif- 
ferent correlation scenarios. The Mean Squared 
Error of Prediction (MSEP), defined as: 


Lind (Cr Sar)” (22) 


quantitatively evaluated the predicting accuracy of the 
growth model, where c,, and c,, denote the predicted 
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and actual depths of the kth defect. The MSEP values 
are illustrated in brackets on Figure 4 and it is given 
that the higher the accuracy the lower the correspond- 
ing MSEP value is expected to be. It is observed that 
the most accurate correlation scenario is 9 =0.0. Nev- 
ertheless, there is not a clear tendency indication with 
respect to different correlation scenarios, by either the 
depth predictions or the MSEP results. In fact, for 
almost half defects, the depth predictions are negligi- 
bly different among correlation scenarios. However, 
for all correlation scenarios the methodology pro- 
posed for updating of the growth model is validated 
against the actual data. It should be noted that for all 
correlation scenarios the posterior NHPP generation 
model results remained unchanged. Therefore, it is 
thought that correlations do not impact the updating 
of the methodology significantly, but the selection of 
a specific correlation structure is likely to affect the 
accuracy of the growth model upon updating. 


6 CONCLUSIONS 


A stochastic process-based hierarchical Bayesian 
methodology was proposed for corrosion manage- 
ment of gas pipelines. The metal-loss corrosion 
defect generation was characterised by a NHPP and 
the growth by a HGP. Furthermore, the Nataf model 
was used for the spatial dependence among defects. 
The hierarchical Bayesian framework proposed, 
captured the imperfect detectability of the ILI tool 
as defined by the PoD and also the measurement 
errors associated with the ILI data. The Bayesian 
updating was performed by employing BUS-SuS in 
conjunction with the data augmentation technique. 
The proposed model was applied on simulated 
defects and multiple ILI inspections. Different 
defects were assumed identically distributed and 
equi-correlated with the coefficient @ equal to 
0.25, 0.5, 0.75 and 0.0. Results from the Bayesian 
updating of the generation and growth models, 
indicate that the predicted overall defect popula- 
tion corresponding to the base case (i.e. © = 0.0) is 
in good agreement with the actual defect popula- 
tion, which validates the proposed methodology. 
The other three correlation scenarios also lead to 
predictions of good agreement with the actual data 
and therefore the Bayesian methodology is accu- 
rate irrespective of the correlation scenario. There 
are some discrepancies in the accuracy of predic- 
tions among the different scenarios nonetheless, 
identified by the MSEP. The most accurate among 
all four scenarios was the one with 6 equal to 0.0. 
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ABSTRACT: In the deep geological disposal facility for radioactive waste to be built in France, it is 
planned to encapsulate high level waste in carbon-steel overpacks before inserting them into horizontal 
micro-tunnels. The main function of the overpack is to isolate the waste from the environment long 
enough for its radio-toxicity and heat to significantly decrease. Several phenomena affect the overpack 
during its lifetime. The prediction and modelling of these phenomena are subjected to many uncertain- 
ties due to their complexity and their extrapolation over long time periods (several centuries). This paper 
presents an overall framework for the estimation of the failure probability of the overpacks over time. 
The failure scenario considered is the fracture of the overpack. The analysis is based on a parameterized 
finite element model taking into account the evolution of the geometry due to the non-uniform corrosion 
process, the evolution of the mechanical loading and their variability. In addition the manufacturing proc- 
ess may induce defects which are modeled as cracks and included in the model of uncertainty. The whole 
process is run for several time steps until the complete failure of the overpack. The reliability estimation 
is based on a two-level Monte Carlo analysis and according to a fracture mechanics criterion based on a 


probabilistic critical stress intensity factor. 


1 INTRODUCTION 


Nuclear plants are one of the principal sources of 
electricity in many countries and are responsible 
for most of the electricity production in France. 
Nuclear industry produces 60% of the French 
radioactive waste, 27% are from research activi- 
ties, 9% from defense, 3% from industry and 1% 
from medical activities (ANDRA 2015). Radioac- 
tive waste are categorized by their level of hazard- 
ousness. The European council directive 2011/70/ 
EURATOM states that, at this time, deep geologi- 
cal storage is the safest option as the end point 
of management of high-level waste. Cigeo is the 
French project of deep geological disposal facil- 
ity for radioactive waste. Safety and reliability are 
major goals of the project and it is necessary to 
guarantee that the radioactive material will not be 
in contact with water during the early time of stor- 
age. In the project, high level waste are planned to 
be conditioned in carbon-steel overpacks before 
being inserted into horizontal micro-tunnels. The 
main function of the overpack is to isolate the 
waste from the environment long enough for its 
radio-toxicity and heat to significantly decrease. 
This period is estimated at 500 years. Therefore, the 


reliability of overpacks is an important element in 
the safety assessment of the overall Cigeo project. 

For this project, studies are lead on the evolu- 
tion of corrosion processes (type and kinetics), the 
properties of the steel grades and their behavior in 
repository conditions, the thermos-hydro-mechan- 
ical properties and behavior of the rock, the pre- 
diction and characterization of failure mode of the 
steel structures (the buckling of the lining, crack- 
ing of the overpack). Reliability methods have not 
been explored yet in this project although they are 
already widely used in other fields with comparable 
constraints such as oil and gas industry (Dundulis 
et al. 2016). The interest of the study is to estimate 
the failure probability of the overpacks taking into 
account the aging process and uncertainties of the 
system. The failure probability is investigated con- 
sidering a fracture mechanics criterion to take into 
account an eventual Stress Corrosion Cracking 
(SCC) sensitivity of the steel. 

The analysis presented in this paper proceeds 
as follows: a realization of the random variables 
is generated and the finite element analysis of the 
degraded model is performed for all the time steps 
over the lifetime. The stress field is recovered and 
the stress intensity factor is calculated at each time 
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step for numerous randomly simulated cracks. The 
reliability of the system over time is estimated by a 
Monte-Carlo method using two levels of simula- 
tions. At the first level, a limited sample of the ran- 
dom variables affecting the finite element analysis 
is evaluated. For each of them (second level) the 
stress intensity factor of the random crack requires 
only post processing. This strategy allows us to 
analyze a large number of crack configuration 
with moderate numerical efforts. The reliability is 
then evaluated for every time step facing a random 
critical stress intensity factor criterion. 


2 STATEMENT OF THE PROBLEM 


The repository will take place 500 meters below 
ground in an impermeable argillaceous rock 
(Callovo-Oxfordian claystone) able to contain 
radioactivity over a very long period (ANDRA 
2005). High Level Wastes (HLW) disposal cells 
consist of horizontal micro-tunnels cased with a 
steel liner (Fig. 1). A cement-based filling mate- 
rial that imposes corrosion-limiting environmental 
conditions is also injected between the rock and 
the lining. HLW are vitrified and sealed in stainless 
steel containers conditioned in low-alloy overpacks 
which are inserted in the lining (Figs. 2-3). The 
whole system may be compared to Russian nesting 
dolls, in the center is the waste, then the container, 
the overpack, the lining, the filling material and 
finally the rock. 

The corrosion and the evolution of the mechan- 
ical loading could affect the structural integrity 
and the reliability of the overpack. Therefore, it is 
important to describe and model the in-situ condi- 
tions of the system along its lifetime, the possible 
variations of this scenario and the possible failure 
causes. 

Overpacks are made of low-alloy steel to pro- 
mote general corrosion in repository conditions. 
Water coming from the host rock will progressively 
fill the HLW cell implying that two corrosion rates 
(either under or above water) affect the overpack. 


Micro-Tunnel 


=E 
= 


Gallery ~ Overpack 


Figure 1. Overview of the storage facility. 
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Figure 2. Schematic cross-section of the system 
studied. 
Overpack 
Container 
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Figure 3. R7-T7 vitrified waste disposal package. 


These two corrosion rates lead to non-uniform 
corroded thickness around the overpack. 

The evolution of the water level over time is 
the equilibrum result of several hardly predictible 
phenomena, involving the saturation and water 
flow in the rock and the gas production by the cor- 
rosion processes. Therefore, uncertainties remain 
about the long term stabilized height of the water 
level. A liquid water extraction system avoids the 
water filling of the micro-tunnel during the 100 
years operating phase, so that the water level is 
null for this period. The equilibrium of the gas and 
liquid phases has been studied for intermediate 
level waste disposal facilities of the same project 
i.e. the same rock in (Croisé et al. 2011; Brom- 
mundt et al. 2014). It has been stated that once 
the repository is sealed, the rock saturation will 
increase and water will flow in. After a while, an 
equilibrium between water and gas pressure will 
be reached and the water level will stabilize. This 
study focuses only on the case of the water level 
stabilizing at an intermediate value filling half of 
the micro-tunnel (Fig. 4). 

Corrosion also affects the steel lining which is 
under the pressure of the rock. The reduction of its 
operational thickness will eventually cause buck- 
ling, it will come into contact with the overpack 
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Figure 4. Water level over time. 


and transmit the pressure. The contact area will 
progressively expand until it completely covers 
the overpack which will be directly subjected to 
the pressure of the rock. Once the micro-tunnel is 
sealed, it is also subjected to a fluid pressure. 

The overpack is made of 3 forged parts welded 
by electron beam. Each overpack is inspected for 
defect from either the welding or the forge proc- 
ess, but defects such as surface cracks may remain 
undetected, being too small for the precision of 
detection. In this study, only the case of cracks 
with a constant depth over time is considered. 

Various processes affect the geometry of the 
overpack and the mechanical loading. It is there- 
fore important to model all of these processes 
with their variability to use them as input param- 
eters of a finite element model. The study focuses 
only on the reliability facing a fracture criterion, 
to take into account a potential SCC sensitivity 
of the steel which is a conservative approach. To 
this end, a parameterized finite element model of 
the overpack has been developed. The corrosion 
process and its variability have been modeled as 
well as the mechanical loading over time. Cracks 
are positioned and numerically simulated in the 
highest tensile stress concentration area and the 
stress intensity factor is estimated using analytical 
equations. 


3 MODEL OF UNCERTAINTY 


Various uncertainties should be considered because 
of the novelty of the whole project, the complexity 
of the system and the long periods of time studied. 
These uncertainties can affect significantly the life- 
time of the overpack. Therefore, it is important to 
quantify and to properly model them. 


Two different strategies have been used to 
describe the variability of the system. Uncertain 
parameters are modelled by random variables. For 
corrosion rates and water level over time, the evo- 
lutions of the phenomena are uncertain because of 
the inherent complexity of the whole system. Sets 
of possible scenarios are described and each case 
can be studied individually. 

The corrosion rates are the result of several 
chemical processes which are hardly predictable. 
The in situ corrosion process of the overpack has 
been studied in (Schlegel et al. 2014; Necib et al. 
2017). It has been stated that the corrosion rates 
are decreasing as a result of the pseudo-passivation 
of the metal surface. However, the precise path that 
this decrease follows is uncertain, this study focus 
only on a very severe estimated scenario (Fig. 5). 
In this one the corrosion rates are steady for 100 
years because of a potential oxygen inflow from 
the access gallery during operating phase of the 
HLW cell. Once the micro-tunnel is closed and the 
access drift backfilled, the corrosion rates decrease 
until they reach a new equilibrium. The uncertainty 
considered in this paper is related to the overall 
chemical activity of the system, where both of the 
corrosion rates are fully correlated. From expert 
opinion, the corrosion rates have a nominal value c 
and the interval which js the most likely to contain 
their actual value is [Ts el Se], Therefore, it has 
been modelled by a multiplying factor A following 
a lognormal distribution with the median equal to 
one and affecting both corrosion rates. This way, the 
probability for the variable to be larger than 1.5 is 
equal to the probability for it to be less than x. 
The standard deviation is then fitted so that 99.7% 
of the realizations are in the interval. 

The contact pressure P, transmitted by the 
lining after buckling is uncertain. Three different 
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Figure 5. CR,, and CR; evolutions over time. 
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studies have been carried out on the contact pres- 
sure transmitted by the lining after buckling. The 
dispersion of the results have been simplified by 
the confidence interval [0.5 P omina 2P nominar] Which is 
used for this study. The same strategy as described 
for the corrosion rates have been used to model the 
contact pressure with lognormal distribution. 

The reliability analysis is performed using a 
fracture mechanics criterion. An uncertain semi- 
elliptical crack is simulated on the external surface. 
The size of the crack is taken deterministic with 
both crack depth and length of 2 mm. These values 
have been chosen according to the biggest accept- 
able crack of the best quality of control described 
in the European standard EN 10228-3 about non- 
destructive testing for the forged parts. To reduce 
the calculation time, the crack is considered at a 
deterministic coordinate value in the axis of the 
cylinder (z,). Preliminary simulations have high- 
lighted a tensile stress concentration localized at 
this specific position. The position of the crack on 
the cross-section profile at z = z, is defined by the 
angle 6,, the angle with respect to the circumferen- 
tial direction. The orientation of the crack on the 
surface is defined by the angle 6,. It is assumed that 
the crack does not have any preferential position 
or orientation, and that all the configurations are 
equally likely. Therefore, both of these parameters 
are taken as independent random variables follow- 
ing a uniform distribution defined by the interval 
[0; 180]. Early studies (Necib et al. 2017) have been 
performed to estimate the fracture toughness of 
the overpack under representative corrosion con- 
ditions K,,... CT specimens have been loaded to 40 
MPam during 4000 hours exhibiting very lim- 
ited propagation (<150 um). Further studies are in 
progress but no statistical data are currently avail- 
able. By default, a normal distribution has been 
defined by the following parameters: 


seco = 40MPaVin 
e coefficient of variation of 0.1 


e mean value of K 


4 FINITE ELEMENT MODEL 


The reliability analysis is based on a finite element 
model that should be representative of the geome- 
try and loading of the system in operational condi- 
tions for all its lifetime. To reduce calculation time, 
the finite element analysis is limited to the elastic 
behavior of the material. The goal is to retrieve the 
stress field in the whole area where cracks will be 
simulated. This model is evaluated for time steps of 
100 years until complete failure. This occurs when 
the corroded thickness at one location reaches the 
initial thickness. All these evaluations correspond 
to one realization of the random variables affect- 


ing the finite element model. In order to perform 
a reliability analysis, many of these realizations are 
required. Therefore, the model must be as fast as 
possible and completely parameterized to be auto- 
matically generated. 

Corrosion is taken into account as a reduction 
of the thickness, implying that the geometry of 
the overpack is changing over time. The corroded 
thickness is non-uniform around the overpack and 
depends on the time, the history of the water level, 
and the corrosion rates. The exact thickness of the 
cross-section can be evaluated at any point of the 
profile. However, the water level is supposed steady 
after a period and the corrosion process is highly 
simplified. These simplifications lead to strong 
thickness discontinuities implying sharp edges on 
the corroded cross-section profile. These sharp 
edges are not representative of the real geometry 
of the corroded overpack and would lead to stress 
concentrations in the finite element analysis. To 
define a continuous smooth cross-section profile 
representative of the thickness loss by corrosion, 
the exact corroded thickness is calculated for 16 
points on the profile, every 22.5° as described in 
Equation (1). 


CT(i,t) = f(r (i,t)CR,,(t)+1"(t,i)CR,,())dt (1) 


with CTG,t) the corroded thickness at the 7" point 
at time t, CRia(t) and CRiu(t) are the values of the 
corrosion rates above and under water at time ¢, I° 
and I" are defined as follow: 


aps) ) Hf hal t) < h(i) 
aes) -| Oif h,,,( 1) = h(i) @) 
I"(i,t)=1-I¢(i,t) (3) 


with h(t) the height of the water level at time ¢ 
and /(i) the height of the 7" point. The radius of the 
points are reduced by the corroded thickness and a 
closed spline curve is defined passing through the 
points (Fig. 6). For each time increment, the spline 
is calculated and used as the corroded cross-section 
profile to generate the geometry of the part. 

The mechanical loading is composed of a con- 
stant fluid pressure and a contact pressure. The 
contact pressure over time depends on the date at 
which the lining buckles. This date is calculated 
by an empirical equation determined by previ- 
ous experimental and numerical studies lead at 
Andra (Nguyen 2017). The equation is based on 
the rock pressure value and the corrosion rates. 
The contact area where the pressure is applied is 
defined by a contact angle a (Fig. 7). The angle 
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Figure 7. Schematic drawing of the contact angle. 


defines two contact areas modelling the case of a 
two lobes buckling. The contact angle evolution 
is very difficult to predict due to the complexity 
of the system (post-buckling behavior of the lin- 
ing, long term mechanical loading applied to the 
rock) and as a first approximation, it is assumed 
that it increases linearly with time and is expected 
to reach complete contact is about 3000 years. The 
pressure profile over the contact surfaces is taken 
as parabolic, such that pressure is equal to zero at 
the edges and nominal at the middle. The nominal 
pressure is governed by the random variable P, as 
discussed in section 3. Once œ reaches 180°, it is 
necessary to adapt the pressure profile such that its 
evolution is smooth. At this time, the two contact 
areas are joined and the pressure is null at the two 


edges separating them. The overall pressure profile 
is close to an ellipse. However, the long-term pro- 
file pressure is a circle because minor horizontal 
and vertical stresses in the rock are almost equal. 
Therefore, to ensure continuity of the loading over 
time, ois considered increasing even after 180° and 
the pressure profile is updated following the same 
evolution. The difference being that the profile 
pressure is still applied on the two 180° areas, this 
way the stress values at the edges increase and the 
pressure profile get closer to a circle (Fig. 8). 

The finite element analysis is performed on 
the uncracked model, the stress field around the 
crack position is then analyzed to estimate the 
stress intensity factor as described in (Pommier, 
Sakae, and Murakami 1999). These equations are 
a generalization of the work of Newman & Raju 
(1981). In these equations the stress field around 
the crack is defined as a polynomial of the geo- 
metrical coordinates. The order and parameters 
of the polynomial are input parameters of the 
equations. They provide good evaluations of K, 
for semi-elliptical surface cracks with a wide range 
of shapes given by the two input parameters, the 
crack depth and length. The stress field is exported 
from the finite-element analysis on a grid with five 
points in the axial direction centered on z,, five 
points in the radial direction and 500 points in the 
circumferential direction. Once the position of the 
crack associated with 0, is determined, the stress 
field at several points around the crack and the 
coordinates of these points are retrieved. The area 
where the data are retrieved is the slice of the scan 
area that is centered on the crack position (5 axial 
points, 5 radial points, 6 circumferential points). 
The components of the stress field are modelled 
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Figure 8. Pressure profile for several time steps after 
buckling. 
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with a linear regression in the global coordinate 
system. Then, the parameters of the fitted linear 
model of the stress field are calculated in the local 
coordinate system attached to the crack. These 
parameters are used to estimate the stress intensity 
factor associated with the opening mode along the 
crack edge. This way all positions and orientation 
of cracks can be simulated with a single finite-ele- 
ment analysis. 


5 RELIABILITY ANALYSIS 


The results of the finite element model are used to 
perform the reliability analysis using Monte-Carlo 
simulations. The failure probability P,is expressed 
as follows (Lemaire 2013): 


Pel Mey 


where fis the joint probability density function of 
the random variables X and Z is defined as: 


lif g(X)<0 
ea pee aeo i 


X)dX (4) 


with g is the performance function which is an 
auxiliary function introduced to define the failure 
event. The failure domain is associated with the 
negative values of g, as discussed in Equation (5). 
In Monte-Carlo simulations, realizations of the 
random variables are generated and the integral 
expressed in Equation 4 is estimated by: 


P= 3X") 6 


where N is the number of simulations and X” 
denotes samples of the uncertain parameters 
obtained using a (quasi-)random number genera- 
tor. This approximation is valid for N sufficiently 
large, usually for N > 100/P,. 

In this study, random variables are managed 
following two different strategies depending on 
how time consuming they are to evaluate. The ran- 
dom variables affecting the finite element model 
(A, P.) are associated with considerable numerical 
efforts because for each realization the finite ele- 
ment analysis has to be performed for the whole 
lifetime. The random variables affecting only the 
crack (6,0,,K,.) are easy to evaluate because they 
only require post processing without further evalu- 
ation of the finite element model. Therefore, to 
maximize the simulations and to use the advantage 
of the variables associated with moderate numeri- 


cal effort, the Monte-Carlo simulations are lead on 
two levels with two different numbers of simula- 
tions. The random variable set can be divided into 
two sets X=(X,X,). where X, = ( G,0,,K,.) and 
X,=(A,P.). The previous equation can be writ- 
ten as: 


Pfft 


The Monte-Carlo simulation is applied twice in 
order to transform both integrals into sums. 


P,= [> 13x X,)dXx, 
ME (8) 


ee ) 


k=1 j=l 


») dX, dX, (7) 


In this study, 100 simulations of the variables 
related to the finite element model (N,) have been 
evaluated and 100,000 cracks (NV ,) have been simu- 
lated for each of them. Therefore the failure prob- 
ability is expressed as follow: 


Ny N, 


Yo (A 


P,( yee, 
` aoo 


9.1) ©) 


and J defined with the performance function g: 


g(X0),t) = KO- K,(A),8),A®,P® (10) 


Once complete failure is reached (i.e. corrosion 
of all the thickness of the overpack), the finite ele- 
ment model can no longer be created. Therefore, 
stress intensity factor and the sign of g cannot be 
evaluated. The system is then considered defective 
regardless of the sign of g and the failure probabil- 
ity over time is estimated with: 


if t>t, then I(X,,X,,t)=1 


6 RESULTS 


The stress intensity factor K, have been evaluated 
every 100 years until complete failure for the N,* N, 
simulations. The results have been represented by 
the median curve and confident interval bounds 
at 95% (Fig. 9). The estimation of the stress 
intensity factor gives negative values as response 
of compressive stresses. These values have been 
considered as equal to zero which does not affect 
the median curve or the estimation of the failure 
probability. The results show that the dispersion 
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Figure 10. Estimated failure probability over time. 


increases with time while the median value remains 
almost unchanged. This result can be explained by 
three phenomena. First, with time increasing, the 
average thickness shrinks and the overall stresses 
slowly increases. Second, the stress concentration 
area as discussed in section 3 is only local in the 
area where cracks are simulated but it may lead to 
high stress intensity factors. This area grows with 
time and the stress values involved increase rapidly, 
thus increasing the extreme values of stress inten- 
sity factor. Finally, depending on their orientation, 
the cracks are almost equally likely to be subjected 
to tension than to be subjected to compression. 
Therefore, even with increasing stresses involved, 
the median value of the stress intensity factors 
remain balanced. Because of the impossibility 
to evaluate K, once compete failure is reached as 
discussed in section 5, with time increasing the 
number of samples available to estimate the vari- 
ability of K, decreases, explaining the increasing 
irregularities of the upper bound of the confident 
interval. 

The failure probability over time estimated by the 
Monte-Carlo simulation is represented in Fig. 10. 
The first failures occurs after 1100 years and the 
failure probability at this date is 10%. Therefore, the 
failure probability at 500 years which is the mini- 
mum required lifetime is less than 10°. 


7 CONCLUSION 


The failure probability of the overpacks for HLW 
is estimated taking into account the evolution of 
the operational conditions and the aging of the 
system. The presented study is based on assump- 


tions made to ensure conservationism and reduce 
computation time. The failure probability at 500 
years is smaller than 10~ which is in good agree- 
ment with the lifetime requirement of the overpack. 
This study gives a first estimation of the reliability 
of the design choices made for the Cigeo project. 
The reliability and safety of the infrastructures are 
major issues of the project and these results are 
relevant for the validation and improvement of the 
design choices. 

This study could be completed by more simu- 
lations and the implementation of a metamodel 
to reduce numerical efforts. Further improvement 
would be to take into account a wider range of 
parameters involved in the problem. For example 
the other corrosion cases and water level evolution 
scenarios could be evaluated, as well as cracks on 
a wider surface area and the possibility of internal 
cracks in the overpack thickness. 
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ABSTRACT: Three sizes are important for the characterization of the propagation of fatigue cracks— 
initial size, detectable size and acceptable size. The theoretical model of a fatigue crack progression can 
be based on a linear elastic fracture mechanics. Depending on location of an initial crack, the crack may 
propagate in structural element that could be described by calibration functions. Single edge-cracked steel 
element with rectangular cross-section under relative short edge fatigue damage under pure tension, pure 
bending, three and four point bending load have been chosen for applications of the theoretical solution 
suggested in the studies. When determining the required level of reliability, it is possible to specify the time 
of the first inspection of the construction which will focus on the fatigue damage. Using a conditional 
probability and Bayesian approach, times for subsequent inspections can be determined based on the 
results of the previous inspection. For probabilistic calculation of fatigue crack progression, the original 
and new probabilistic method—the Direct Optimized Probabilistic Calculation (DOProC), which uses a 
purely numerical approach based on optimized numerical integration without any simulation techniques 
or approximationapproach. This provides more accurate solutions to probabilistic tasks, and, in some 
cases, allows to considerably fasten completion of computations with the taking into account the statisti- 
cal dependence of random input variables. 


1 INTRODUCTION Major et al. (2017), Nemec et al. (2017), Vican 


et al. (2015) Cajka (2013) and Kormanikova and 


Fatigue phenomenon is one of the main factors 
influencing the life of steel structures and bridges 
subjected to cyclic loading. A substantial increase 
in the overall weight load from vehicle axles and 
crossing frequencies leads to higher fatigue dam- 
age than considered during the design of bridges. 
Due to above mentioned reasons, it is highly rel- 
evant to develop methods for the calculation and 
assessment of the residual fatigue life and time- 
dependent analysis of the reliability of existing 
steel structures, e.g., Soliman et al. (2016), Maljaars 
& Vrouwenvelder (2014) and Partov & Kantchev 
(2014). Numerous numerical methods, mostly 
based on the Finite Element Method (FEM), e.g., 


Kotrasova (2011), have been developed to aid in 
the understanding of the behavior of the fatigue 
phenomena. 

In the design of structures, information related 
to time-variable load and detection of cracks 
from measurement during the operational period 
of the structure should be incorporated into reli- 
ability calculations. The essential tools for these 
calculationsare provided by fracture mechan- 
ics and the reliability theory, e.g., Hradil et al. 
(2017) and Michalcova & Lausova (2017). Some 
of approaches used for the fatigue crack predic- 
tion are based on stochastic methods, e.g., Kralik 
(2016), Antucheviciene et al. (2015) and Fedorik 
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et al. (2015). Insight into the stochastic interactions 
among random factors (load, geometric and mate- 
rial characteristics), e.g., Krivy & Konecny (2013), 
affecting the reliability of steel bridges is namely 
essential and crucial to understanding the progress 
of failure probability of steel structures over time, 
e.g., Schneider, Thons, & Straub (2017). Moreover, 
due to the presence of significant uncertainties 
associated with crack initiation and propagation, 
inspection, monitoring and/or repair actions plan- 
ning should be performed and applied to prevent 
sudden failures of damaged structural components 
and their associated consequences, e.g., Lotsberg, 
Sigurdsson, Fjeldstad, & Moan (2016). 

The paper focuses on the probabilistic approach 
based on optimized numerical integration—the 
newly developed Direct Optimized Probabilis- 
tic Calculation method (DOProC), published in 
details, e.g., in Janas et al. (2017). The DOProC 
method is distinguished by higher accuracy than 
the other probabilistic methods. Another advan- 
tage is the easy implementation on platforms with 
multiple processing units or cores, enabling the 
parallel computing of this probabilistic procedure, 
e.g., on supercomputers — (Krejsa et al. 2016). This 
new probabilistic approach has allowed to describe 
and implement a very precise methodology for 
stochastic prediction of fatigue damage of steel 
structures and bridges exposed to cyclic loading. 
Probabilistic modeling of fatigue crack progression 
leads to designing a system of regular inspections 
of structures and is based on linear elastic fracture 
mechanics and Paris-Erdogan’s law, e.g., Seitl et al. 
(2017). This article describes this approach with a 
particular focus on the demonstration of using the 
above mentioned methodology for the single edge- 
cracked steel element with a rectangular cross sec- 
tion under relatively short edge fatigue damage. 


2 THEORETICAL BACKGROUND 


2.1 Fatigue crack propagation 


When investigating the propagation, the fatigue 
crack that deteriorates a certain area of the struc- 
ture component is described with one dimension 
only—fatigue crack length a. In order to describe 
the propagation of the crack, the linear elastic frac- 
ture mechanics is typically used. This method uses 
Paris-Erdogan’s law and defines relation between 
propagation rate of the crack size a, and range of 
the stress rate coefficient, AK, in the tip of the crack: 


M CAR”, (1) 
dN 


where C, m are material constants, that are deter- 
mined experimentally, N is the number of loading 


cycles and AK is range of the stress intensity factor 
in front of the crack tip and it is defined as follow: 


ak =ao:Vra-F[ 4), (2) 
1 


where Ao is constant stress range (the value of Ao 
corresponding to each way of loading, see Fig. 1, is 
shown in Table 1), / is the height of the rectangular 
cross-section of the component and F(a/h) is the 
calibration function which represents the course of 
propagation of the crack (e.g., at the edge or on the 
surface of the component) and various boundary 
conditions. 

Three sizes are important for the description of 
the characteristics of the propagation of fatigue 
cracks. The fatigue crack will propagate in a stable 
way only if the initial crack a, exists in the place 
where the stress is concentrated. Existence of the 
initiation cracks during the propagation should be 
revealed, along the detectable length of the crack 
dy» e.g., during inspections. The crack propagates 
in a stable way until it reaches the third important 
size—acceptable length of the crack a,,, which is a 
limit for the required reliability. 


ac? 


2.2 Stochastic reliability assessment 


The main assumption is that the primary design 
should take into account the effects of the extreme 
loading and the fatigue resistance should be 
assessed. The probabilistic methods should be used 
for the investigation of the propagation rate of the 
fatiguecrack until the acceptable size is reached 
because the input variables include uncertainties 
and reliability should be taken into account. The 
resistance of the structure can be evaluated using 
Eqs. (1) and (2) as: 


a, 


R(a,.) = { —“*—<aa. (3) 


A 


If the upper Pora limit a, is used, the 
resistance of the structure can be specified 
similarly. Similarly, it is pottibie to define the 
cumulated effect of loads that equals to: 


N 
E(N)= [C-Ao"dN =C-Ao"-(N-N,), (4) 


No 


where N is the total number of oscillations Ao for 
the change of the length from a, to a,» and N, is 
the number of oscillations in the time of initializa- 
tion of the fatigue crack (typically, the number of 
oscillations is zero). For details see, e.g., (Krejsa, 
Koubova, Flodr, Protivinsky, & Nguyen 2017). 
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Figure 1. 


Static scheme for single edge-crack steel specimens: (a) loaded by tension, (b) loaded by pure bending, (c) 


loaded by three-point bending, (d) loaded by four-point bending. 


Table 1. Stress range Ao and the acceptable length of 
the crack a, for single edge-crack steel specimens and 
various loads according Fig. 1; / is span of the element, 
w and h are width and height of the rectangular cross- 
section, n is flat load [N/m’], N is axial force and f, is 


yield stress. 


Stress Acceptable length 
Type of load range Ao of the crack a, 
Tensi N 
EOR N Nanik h- 
wh wf, 
Pure bending 6-M 6- 
wel? ae We; 
Three-point 3+ Fig! 3- F ppl 
bending- 3PB 9.y.72 ay 2-w-f 
Four-point 2-Fyp,l 2-Fypg fl 
bending — 4PB whe h- w- f 


The probability of failure P,equals to: 


P, = P( Gay (X) <0) = P(R(a,.)- E(N) <0), (5) 


where G,,,, is reliability function and X is a vector 
of random physical properties such as mechanical 
properties, geometry of the structure, load effects 
and dimensions of the fatigue crack. 


2.3 Inspections planning 


When the probability of failure P, according to 
Eq. (5) exceeds the specified designed probability, 
P, the inspection should be performed. On the 


basis of the results of the first inspection, a system 
of following inspections can be established using 
conditional probability, as e.g. in Krejsa et al. 
(2017). 

Because it is not certain in the probabilistic 
calculation whether the initiation crack exists 
and what the initiation crack size is and because 
other inaccuracies influence the calculation of the 
crack propagation, a special inspection is neces- 
sary to check the size of the measurable crack in 
a specific period of time. The acceptable crack size 
influences the time of the inspection. If no fatigue 
cracks are found, the analysis of inspection results 
give conditional probability during occurrence. 

While the fatigue crack is propagating, it is pos- 
sible to define following random phenomena that 
are related to the growth of the fatigue crack and 
may occur in any time, ¢, during the service life of 
the structure. Then: 


e U „ phenomenon: No fatigue crack failure has 
been revealed within the ¢-time and the fatigue 
crack size a, has not reached the detectable 
crack size a,. This means: 


Ay) < Ay» (6) 


e D „ phenomenon: A fatigue crack failure has 
been revealed within the ¢-time and the fatigue 
crack size a is still below the acceptable crack 
size a,,. This means: 


ay < aii) < Aye > (7) 


e F., phenomenon: A failure has been revealed 
within the ¢-time and the fatigue crack size a has 
reached the acceptable crack size a,,. This means: 
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dac < Ay : (8) 


If the crack is not revealed within the f-time, this 
may mean that there is not any fatigue crack in the 
construction element. This might be also an initia- 
tive phase of nucleation of the fatigue crack (when 
a crack appears in the material) and this phenom- 
enon is not taken into account in the fracture 
mechanics. Even if the fatigue crack is not revealed 
it is likely that it exists there but the fatigue crack 
size is so small that it cannot be detected under 
existing conditions. 

Using the phenomena above, it is possible to 
define following probabilities: 


e The probability that the failure is not detected 
within the t-time, this means the probability that 
the fatigue crack size a, is below the measurable 
crack size a; 


P(C) = Play < 44), (9) 


e The probability that the failure detected within 
the ż-time has the crack size a that is less than 
the acceptable size a„: 


P(D,) = Play Sq) < du), (10) 


e The probability that the failure occurs within the 
t-time, this means the probability that the fatigue 
crack size a, reaches the acceptable size a,,: 


P(Fiy) = Plan <a). (11) 


Those three phenomena cover the complete 
spectrum of phenomena that might occur in the 
t-time. This means: 


P(Uy)+ P(Dy)+P(Ay)=1. (12) 


In order to specify the time for the next inspec- 
tion, it is necessary to determine the conditional 
probabilitie which can be expressed using the full 
probability rule: 


P(En)- P(E) 


P(F iq) lUa) = Plu 
(tr) 
P(Dy)) Pin) Duy) (13) 
PU.) 
and 


r PE) 


PFU) 


PPLE MUD 


i, hy w “he. 4 


Figure 2. Probabilities of failure P, calculated accord- 
ing Eqs. (5) and (13) with the times of 5 inspections. 


P(Fn)- P(E) 
P(D,,) 
P(U n) PE Uy) 
P(D,,)] 


For details see Fig. 2 and, e.g., (Krejsa, Bro- 
zovsky, & Mikolasek 2017), (Krejsa, Kala, & Seitl 
2016). 


Pnl Di, = 


(14) 


3 APPLICATION 


The demonstration of using introduced method- 
ology was made for the single edge-cracked steel 
element with rectangular cross-section under short 
edge fatigue damage defined according scheme in 
Fig. l(c). Calibration curves F(a/h) were experi- 
mentally derived for a specimens with relative 
crack length a/h: 0.01-0.3 loaded by tension, pure, 
three-point and four-point bending. 

The resulting calibration functions, e.g., for 3PB 
test and ration l/h = 2 is according to Seitl et al. 
(2017): 


a) 
F(E, = +1.0259 -1.4659- (<) 
h” h 


ay ay (15) 
+4g318-(2) -2.4637-(2) 
h h 
for ration //h=4 is: 
a tea 5 
F pirs = +1.0691-1.3496-( 2) 
‘ (16) 


2 3 
+5.1865-() -3.3509-( 2) f 


1 {i 
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for ration //h=8 is: 


l 
Lug 
(4) = +1,0963 —1.3052-| £ 
hh) spp h 


2 3 (17) 
+5.2829-(2) -35972-() i 
h h 
for ration //h=16 is: 
a ra a 
(4) = +1.1079-1.2328-( 4) 
Sarn yo V > (18) 
+50ss1-(£) -3.2837-(2) 
h h 
and for ration //h=80 is: 
a r” a 
(4) = +1.1180-1.1964-{£) 
h) spp h (19) 


2 2 
+5176 2) -33127-(2) , 
h h 


The value of the calibration function F er pe 


for intermediate ratio values of //h is determined 
by linear interpolation. 

Eqs. (3), (4) and (5) allow for effective computa- 
tion of the inspection times for each single edge- 
cracked steel components using DOProC method 
using FSCProbCalc code with exactly defined ran- 
dom input quantities. 

The allowable crack size a,, for the single edge- 
crack steel specimen loaded by three-point bend- 
ing can be expressed by a relationship in Table 1 
considering the derived weakening of the cross- 
sectional area of the element (with the limit length 
defined by ratio a/h = 0.3), similarly as in (Wang, 
Zhai, Duan, & Wang 2015): 


n (20) 
2-w:f, 


dac = h > 


Deterministic and random input quantities are 
given in Table 2 and Table 3. 

If a period of time ż is specified and the time 
step is 1 year, it is possible to determine resistance 
of the construction Ra and Ru, pursuant to 
Eq. (3) — see Figs. 3 anid 4, load ‘fects, E(N), 
pursuant to Eq. (4) — see Fig. 5, and reliability 
function G(X) according Eq. (5) — see Fig. 6, as 
well as the probability of elemental phenomena, U, 
D and F, pursuant to Eqs. (9) through (11) for each 
year of the structural operation — see Fig. 7, which 
are the basis for specification of inspection times. 


Table 2. Overview of deterministic input quantities. 
Quantity Value 
Material constant m 3 


Material constant C 2.2 - 108 MPa”m""”?)*! 


Height of the rectangular 0.1m 
cross-section h 

Width of the rectangular 0.01 m 
cross-section w 

Span of the element / 0.4m 


Target probability of 0.02277 (B= 2) 
failure P, 

Table 3. Overview of random input quantities expressed 
in a bounded histograms. 

Type of 

parametric 

probability Mean Standard 
Quantity distribution value deviation 
Total number of Normal 10° 10° 


stress peaks per 


year N 
Yield stress f, Lognormal 200 MPa 20 MPa 
Loading force Normal 6kN 0.6 kN 
in three-point 
bending test Fp, 
Initial size of the Lognormal 0.2mm 0.05 mm 
crack a 
Smallest detectable Normal 2mm 0.2 mm 


size of the crack a, 


Ce ec Ian 
Value F(z) 


Figure 3. 
ance Ruy: 


Resulting histograms of the structural resist- 


When the probability of failure P, according 
Eq. (5) exceeds the specified designed probability, 
P, the inspection should be performed. The 
Table 4 include numerical values for the final 
inspection times—for the first inspection and sub- 
sequent inspections resulting from the conditional 
probability pursuant to Eq. (13). 
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Figure 4. Resulting histograms of the structural resist- 


ance R,, ). 


trobaire 
„ RERRES SEER RE RREESG 


Figure 5. Resulting histogram of the load effect E(N) of 
the calculation for t = 35 years of structural operation. 
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Figure 6. Resulting histogram of the reliability func- 
(X) of the calculation for t = 35 years of struc- 


tion G, 


tural operation. 


Table 4. Calculated times for the first five 
inspection of the structural element. 


Inspection Time of inspection 
no. [years] 

#1 35 

#2 46 

#3 48 

#4 50 

#5 51 


4 CONCLUSION 


The article demonstrates the probabilistic calcula- 
tion of the fatigue damage prediction of relatively 
short edge crack under various loading using the 
newly developed Direct Optimized Probabilistic 
Calculation (DOProC), which appears to be a very 
efficient tool to make probabilistic assessment of 
the structural reliability on the basis of the exact 
definition of the acceptable size of the fatigue 
crack. The theoretical model of fatigue crack pro- 
gression is based on a linear fracture mechanics 
and Paris-Erdogan law. The computational proce- 
dure is capable to make probabilistic assessment of 
the structural reliability on the basis of the exact 
definition of the acceptable size of the fatigue 
crack. The probabilities were obtained for three 
basic phenomena, which are related to propaga- 
tion of the fatigue cracks. On the basis of those 
data, the probability of failure can be calculated 
for each year of operation of the structural ele- 
ment. When determining the required degree of 
reliability, it is possible to specify the time of the 
first inspection of the structure, which will focus 
on the fatigue damage. Using a conditional prob- 
ability, times for subsequent inspections can be 
determined. The article describes how to design 
the system of regular structural inspections in case 
of the simple demonstration example. 
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Figure 7. Resulting probabilities of random events U, D and F for first 60 years of structural operation under vari- 


ous load. 


2240 


The DOProC method and its application in 
probabilistic prediction of fatigue crack damage 
can considerably improve estimation of mainte- 
nance costs for the structures and bridges subject 
to cyclic loads. This methodology is developed 
further. The goal of investigations seems to be, in 
particular, application of Bayesian networks in the 
computational model, such as e.g. in Mahadevan 
et al. (2001), which describes propagation of 
fatigue cracks in the system. 
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Reliability analysis of structural health monitoring systems 


E. Etebu & M. Shafiee 
Cranfield University, College Road, Bedfordshire, UK 


ABSTRACT: Structural Health Monitoring (SHM) systems are comprised of a grid of sensors installed 
at a fixed location on structures to detect the presence of defect, localize the detected defect, quantify its 
severity, and estimate the Remaining Useful Life (RUL). SHM system performance is currently assessed 
based on Probability of Detection (POD) of defects, which is a function of defect size. This performance 
parameter was inherited from Non-Destructive Testing (NDT), where a human operator performs inspec- 
tion on a structure at a given location, with mobile sensors. For SHM systems, POD and Probability-of- 
False-Alarm (PFA) are a measure for only detection of defects. Furthermore, these parameters could 
vary over time as sensors degrade. This paper presents a methodology to characterize the performance 
of SHM systems with respect to damage detection, localization, and assessment. Probability theorem is 
used to characterize uncertainties associated with the SHM process, and Bayes theorem is employed to 
determine its reliability. The methodology is then tested on vibration-based modal strain energy SHM 
technique applied to a numerical Finite Element Analysis (FEA) study conducted on an offshore energy 


structure. 


1 INTRODUCTION 


Structural health monitoring (SHM) is a technique 
used to assess the integrity of in-service structures 
that are exposed to operational and environmen- 
tal loads on a continuous basis, with the goal of 
improving the monitored structure’s reliability, 
reducing inspection cost, and emergency repair 
expenditure (Doebling et al. 1998). SHM sys- 
tems are typically comprised of a grid of sensors 
installed at a fixed location on monitored struc- 
ture, where signals measured from the structures 
response are transferred though a communication 
network, processed to extract damage sensitive fea- 
tures, after which a damage detection algorithm is 
employed to determine the structure’s integrity. In 
overall, the four functions of SHM systems include 
detection of defect, localizing the detected defect, 
quantifying its severity, and estimating the struc- 
ture’s remaining useful life (RUL). 

SHM system performance is currently assessed 
based on probability of detection (POD), which 
describes the probability of detecting defects as 
a function of their size at the time of inspection 
(Annis, 2009). This performance parameter was 
inherited from non-destructive testing (NDT), 
where POD is a function of replicability achieved 
by the human operator, instruments and sensors 
used at a given location. Moreover, the instruments 
and sensors can be transported to multiple loca- 
tions for inspection. This is vastly different from 
SHM where a human operator is absent; instru- 


ments and sensors are placed at an unchanging 
location after installation. For a SHM system, the 
replicability of POD is influenced by the environ- 
ment, measurement noise, deterioration of sensors 
and communication network. 

Statistical characterization of SHM systems per- 
formance based on POD was established in earlier 
works (see Thompson, 2007; Aldrin, 2010; Aldrin 
et al., 2011; Etebu and Shafiee, 2017) to alleviate 
the number of experiments associated with SHM 
operational parameters, by implementing a vali- 
dated model to assess the changes in SHM out- 
come at varying damage severity, location, as well 
as operational conditions. Outcomes from these 
models are used to generate POD curves using the 
Hit/Miss data fitted to a binary regression model, 
where the POD curve is generally plotted with 95% 
confidence interval (CI). 

In addition to POD, the probability of false 
alarm (PFA) has also been applied to evaluate 
SHM performance, where PFA is a perform- 
ance measure of detecting defect in the absence 
of defect [3]. Although these two parameters are 
indicators of SHM performance, POD and PFA 
are associated with only damage detection. Hence, 
these parameters do not evaluate SHM perform- 
ance with respect to localization, and damage 
assessment. 

In this paper, a methodology is presented to 
characterize the performance of SHM systems 
with respect to damage detection, localization, and 
assessment using the Model Assisted Probability 
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approach, and Vibration based SHM technique to 
assess varying levels of damage on multiple mem- 
bers of a fixed offshore platform in operational 
conditions. 

The rest of the paper is organized as follows. 
Section 2 develops an assessment framework for 
SHM system performance. In Section 3, the pro- 
posed approach is applied to a numerical study and 
the results are analysed in Section 4. Finally, the 
research is concluded in Section 5. 


2 SHM PERFORMANCE: DAMAGE 
DETECTION, LOCALIZATION AND 
ASSESSMENT 


The statistical characterization of SHM systems 
performance based on POD and PFA has been 
established by previous studies (Annis, 2009; 
Aldrin et al., 2011; Thompson, 2008), where the 
output O, of a SHM damage detection system 
either generates an event that indicates the occur- 
rence damage D, in a structure, or no damage 
detected D,,,. The classification of damage occur- 
rence O, for SHM is based on a minimum detect- 
able damage size known as the damage threshold 
D, which separates D, and D,, POD and PFA are 
then expressed as 


POD(t) = P(Oy | Da- Dy, 29); (1) 


PFA(t) = P(O,, |D; - D, <0). (2) 


In order to address the performance of SHM 
systems with respect to localizing damage and 
estimating its severity, two additional perform- 
ance functions are required to be developed. The 
performance of a SHM system with respect to 
locating damage on a structure—in the event of 
damage being detected—is characterized using 
the Probability of Accurate Localization (POAL), 
which is constructed based on the zone of true 
damage location (ZTDL) and the predicted loca- 
tion of damage occurrence (PLD). The zone of 
true damage location is characterized by the true 
damage location (TDL) and an allowable localiza- 
tion tolerance (ALT) corresponding to a specific 
TDL. Based on the given definition, the limit state 
function (LSF) for POAL is expressed mathemati- 
cally as: 


2(X,y,Z,t) = [TDL — PLD] - ALT. (3) 


When the LSF is less than zero, then an accu- 
rate location of damage will be attained from the 
SHM system. An event where the LSF is greater 
than zero indicates the occurrence of an error in 


damage localization; this error either indicates a 
false positive or false negative location. The Prob- 
ability of Accurate Localization (POAL) can be 
expressed in Cartesian coordinates using the locali- 
zation LSF as follows: 


POAL(t) = P| Ou = Dd |{|TDL( x, y,z,t) 


- PLD( xX, y,Z,t) }-ALT( xX, y,Z,t) < 0] . 4 


Evaluation of damage severity is conducted by 
either quantifying or qualifying its magnitude. The 
error between the predicted damage severity (PDS) 
and actual damage severity (ADS) is implemented 
to construct a damage assessment LSF. In order to 
construct a suitable range of error associated with 
damage assessment, an acceptable damage sever- 
ity threshold (DST) is required. The Probability of 
Accurate Assessment (POAA) is constructed using 
conditional probability as: 


POAA(t)= P| O,=Dd|{|ADS- PDS|} < DST]. 
(5) 
3 SIMULATED CASE STUDY 


The proposed approach is applied to a numeri- 
cal study in order to evaluate SHM performance. 
Vibration based Modal strain energy (MSE) by 
Stubbs et al. (1995) is employed to assess the integ- 
rity of fixed offshore platform with varying degrees 
of damage at its structure members. The topology 
and foundation stiffness of the fixed offshore plat- 
form are given in detail by Karadeniz (2001); the 
operational and environmental loads applied are 
presented in Table 1. 

Finite element model of the fixed offshore plat- 
form was constructed using beam elements, where 
damage was modeled by reducing the stiffness of a 
given structure member. Steel was assigned as the 
material property of all members in the structure. 
Baseline response of the fixed offshore platform 
condition was initially attained where no damage 
was present in the fixed offshore platform, after 
which cracks were simulated at 8 different loca- 
tions, where 10 levels of damage severities were 


Table 1. Environmental and operational loads applied 
in the numerical simulation. 

Mass of Deck 4800 ton 

Water Depth 50 m 

Significant wave height 2.5m 

Drag coefficient 1.3 

Inertia Coefficient 2.0 


Sea spectrum Pierson-Moskowitz 
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simulated at each location. In this work only the 
first 3 modes of vibration are implemented during 
the SHM process. 

Damage detection is conducted by MSE based 
on the observed change in each structure mem- 
ber j at vibrational mode i using the following 
equation: 


3 


i 


[KAE EKA | EKA 
PK 0A+ De Koo | AKA 
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At each mode, À, represents the eigenvalue, ¢, 
represents the mode shape, K, represents the stiff- 
ness matrix of a structure member, and K is the 
global stiffness matrix (Stubbs et al., 1995). 

Damage localization in the structure is based on 
the normalization of the modal strain energy based 
on its mean Oy; and standard deviation MSE 
as (Li et al., 2016): 


MSE,= 2 
À 


MSE,- MSE 
ZMSE, = —___—_.. (7) 


Ou SE 


Assessment of damage in each element is con- 
ducted by quantifying the damage magnitude using 
the cross modal relationship in Equation (8), writ- 
ten in matrix form. Parameter œ is the reduction 
factor in the stiffness of a given structure member. 
The magnitude of œ varies from 0 to —1, where —1 
indicates a total loss in stiffness (Li et al., 2007): 
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The values of O,, attained from ZMSE, at dif- 
ferent levels of damage are used to create a POD 
curve based on a fixed value of D, which was set 
as 1. A hit/miss graph was plotted and fitted to the 
logistic binary regression model to calculate POD. 

The performance function POAL was also 
assessed by implementing ALT, as a function of 
braced members sharing the same node with the 
damaged member. Furthermore, non-connected 
load sharing members on the same floor with the 
damaged member were also implemented to calcu- 
late ALT. The position of PDL was attained from 
ZMSE,, which in conjunction with TDL, and ALT 
were used to generate hit/miss graph which was 


fitted to attain a POAL curve. A similar process 
was applied to achieve the POAA curve, where 
PDS was quantified using œ, and DST was set at 
20% of ADS. 


4 RESULTS 


The undamaged structures modal frequencies 
were 7.328 Hz, 7.328 Hz and 33.193 Hz for the 
first 3 modes respectively. The directions of excite- 
ment for the first two modes exhibited translation 
motion in the diagonal direction of the X-Y plane, 
while the third mode displayed torsional motion. 
The 80 scenarios of damage assessed with 
indexes in Stubbs et al. (1995) resulted in at least 
one member element with ZMSE value greater 
than 1, thus indicating the presence of damage for 
all damage scenarios. The performance of dam- 
age localization and severity were observed to be 
dependent on the extent of damage, type of struc- 
ture member, and orientation of damaged struc- 
ture member. Platform legs were most sensitive 
structure members to damage; a 0.5% decrease in 
stiffness was accurately localized. Furthermore, 
the estimated loss of stiffness was within 80% of 
the true stiffness value (see Figure 1). For diagonal 
structural members lying horizontal to the floor 


ZMSE 
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Figure 1. ZMSE outcome on structure member 5 
(platform leg) at (top) 0.5% (middle) 10% (bottom) 50% 
reduction in member stiffness. 
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Figure 2. PDS outcome on structure member 14 (plat- 
form leg) at (Top) 0.5% (Middle) 10% (Bottom) 50% 
reduction in member stiffness. 


plain, damage localization and severity estima- 
tion was unsuccessful across all damage scenarios. 
Damage in vertical bracings were accurately local- 
ized and estimated for damage scenarios where the 
reduction in member stiffness was 5% or greater. 
For horizontal bracings, a reduction in member 
stiffness of 15% resulted in accurate localization 
and estimation of damage. 

Further analysis of all structure members with 
simulated damage indicated a correlation between 
the orientations of mode shape excitement, and 
the location of damaged structure member. Dam- 
aged structure members positioned in the diagonal 
planes of mode shape excitement were correctly 
localized with an acceptable PDS at lower magni- 
tudes of ADS in comparison to equivalent struc- 
ture members on the same floor (see Figure 2). 

The data compiled from all 80 outcomes of ZME, 
and œ were used to attain a model assisted POAL, 
and POAD via the hit/miss technique. A logit model 
was employed to generate both probability curves. 
The attained hit/miss data for damage localization 
and damage severity were equivalent in this study, 
when damage was not localized within the ALT, the 
magnitude of PDS was less than 80% of TDS. 

The attained POAL and POAD curve repre- 
sented in Figure 3 indicates that the vibration 
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Figure 3. Probability of Accurate Localization (POAL), 


and Probability of Accurate Assessment (POAA) gener- 
ated from hit/miss data. 


modal strain energy method was able to correctly 
localize and assess 18% of damage that result in 
a 0.5% reduction in stiffness, as well as 75% of 
damage that generate a 50% reduction in element 
stiffness. 


5 CONCLUSIONS 


Two performance functions were presented for 
SHM systems with regards to its ability to accu- 
rately predict the location of damage and the 
damage severity. For a given damage magnitude, 
the Probability of Accurate Localization (POAL) 
quantifies the SHM systems ability to localize 
damage, while Probability of Accurate Assessment 
(POAA) is a measure of the predicted damage 
severity in comparison to the actual damage sever- 
ity. These performance functions work in conjunc- 
tion with the widely used probability of detection 
(POD) performance function to characterize three 
of the four major objectives for SHM systems. 

Numerical studies were conducted to detect, 
localize and quantify damage on a fixed offshore 
platform under operational loads at varying dam- 
age magnitudes and locations using the Modal 
strain energy (MSE). The results indicate that MSE 
was able to detect damage in all simulated scenar- 
ios, thus leading to constant POD curve across all 
extent of damage. The POAL and POAA curves 
were identical, as all damage correctly localized 
also generated a satisfactory estimate for damage 
severity. The performance of MSE method with 
respect to POAL and POAA varied based on the 
type of member with an assigned damage, dam- 
age magnitude and its location with respect to the 
orientation of mode shape excitement. 

The current numerical study did not include 
errors associated with Eigen value and Eigen vec- 
tors attained from the finite element model vibra- 
tion data. However, noise is always present in real 
world measurements of vibration data. Hence, 
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future studies will include the addition of noise in 
vibration parameters. 
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ABSTRACT: The reliability of structures in the serviceability limit states is analysed. The Eurocode 
EN 1990 for basis of structural design provides general recommendations which should be further sup- 
plemented including classification of irreversible serviceability limit states with respect to assumed con- 
sequences of construction works. Further harmonisation of National Annexes to Eurocodes and other 
prescriptive documents is needed. The standards recommend almost identical serviceability criteria for 
similar environment which are further compared with differently calculated characteristic values of 
stresses, deflections or crack widths based on different analytical models and combinations of actions. An 
example of verification of the serviceability limit states of crack width for a reinforced concrete member 
indicates that the resulting crack width might be found in a rather broad range. The probabilistic methods 
are applied for the verification of the reliability of the structural member and also of crack width model. 
It is shown that the reliability level of the structural member designed according to Eurocodes fulfils the 


recommended target reliability level given in EN 1990. 


1 INTRODUCTION 


Construction works are designed using methods 
recommended in national, European or interna- 
tional standards. The first generation of Eurocodes 
allows the national selection of about 1600 Nation- 
ally Determined Parameters (NDPs) including 
alternative design approaches, load combinations, 
values of partial factors for actions and mate- 
rial properties and other reliability elements, and 
also serviceability constraints. Therefore, the reli- 
ability of a designed structure depends on applied 
national codes or specified parameters NDPs in 
the National Annexes to nationally implemented 
Eurocodes in CEN Member States (MS). It is 
expected that in the second generation of Euroco- 
des the number of NDPs will be reduced and most 
procedures harmonised. 

The load-bearing capacity and serviceability of 
a structure designed in accordance with national 
codes or nationally implemented Eurocodes could 
be expected within a broad range. The actual struc- 
tural resistance depends not only on used theoreti- 
cal models and selected partial factors and other 
reliability elements, but also on prescriptive rules 
including structural detailing. Moreover, in some 
cases the theoretical models given in various stand- 
ards for determining structural resistance provide 
different probability of over-crossing the specified 
design value. 


It is shown that the reliability of structural 
members designed for the serviceability limit states 
according to current national standards, Eurocodes 
and also fib Model Code have a considerable scatter 
which should be analysed and further harmonised. 


2 DESIGN PROCEDURES IN CODES 


The partial factor method which is a basic method 
for the structural design in Eurocodes, in interna- 
tional standards including ISO 2394 (2015) and also 
in many national standards, deals with uncertainties 
of basic variables by means of design values assigned 
to the variables. The design limit state function may 
be expressed in terms of the set of design values of 
a vector x of basic variables given as 


Exa) = Ba fas 4a Oa Cy) (1) 


where F, = the design value of action, f, = the 

design material property, 8, = the design model of 

uncertainty, a, = the design value of geometrical 

quantity and C, =the serviceability constraint. 
The design condition is given as 


(x4) 20 (2) 


for the design vector of basic variables in the limit 
state function which is commonly obtained on the 
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basis of the characteristic values of basic variables 
and a set of partial factors for actions and material 
properties. 

For the verification of the serviceability limit 
states, the following inequality is given 


ES Ca (3) 


determined on the basis of relevant combination 
of actions following EN 1990 (2002). The limit- 
ing serviceability constraints are not defined here, 
however they are newly proposed in the final draft 
of revised prEN1990 (2017). 

For the verification of various serviceability 
requirements, the characteristic or quasi-perma- 
nent combinations of actions are often applied for 
which some serviceability constraints are provided 
in material oriented Eurocodes. It should be noted 
that further guidance for serviceability constraints 
are commonly given in the National Annexes of 
nationally implemented Eurocodes of CEN MS 
taking into account the effects of short-term and 
long-term duration of actions and various design 
situations. However, the serviceability constraints 
(e.g. the limit deflection, the limit crack width) 
are themselves subjected to uncertainties, and 
should be therefore included in the probabilistic 
assessment. 


3 RELIABILITY ANALYSIS 


The knowledge of the reliability level of the struc- 
ture designed according to the national codes or 
nationally implemented Eurocodes and also the 
reliability (credibility) of prescriptive analytical 
models can be used for optimisation of design 
procedures or for further harmonisation of 
standards. 

The structural member or theoretical model 
may be considered as reliable, if the condition pp- 
p, is satisfied where the failure probability p, is 
given as 


Pe= J pdx (4) 


g(x) 0 


The failure probability can be expressed by the 
reliability index B=— ®"(p,), where ® is the distri- 
bution function of standardised normal variable. 
The failure probability p, is the target value that 
should not be exceeded during the intended refer- 
ence period. 

The reliability differentiation of structures in 
EN 1990 (2002) is based on three different levels of 
failure consequences with respect to the ultimate 
limit states (Consequence Classes CC1 to CC3). 


However, for the serviceability limit states simi- 
lar differentiation has not been provided yet. In 
some cases this differentiation of structures in 
serviceability limit states might also be useful for 
distinguishing potential consequences. 

EN 1990 (2004) recommends for the reversible 
serviceability limit states the target reliability index 
p. = 0 and for the irreversible limit states J, = 1,5 
(for the fifty year reference period). Some further 
recommendations for the target values of reliability 
indices in the serviceability limit states are given in 
the JCSS Probabilistic Model Code (2014) where 
three consequence classes are proposed. Therefore, 
for structures categorized to reliability class RC1 
the target reliability index p, = 1,3 and for struc- 
tures in class RC3 the target value 8, = 1,7 could 
be considered. 

The reliability analysis of structural members 
for the ultimate or serviceability limit states can 
be determined through the probability p,, of the 
action effects E(X) randomly exceeding the struc- 
tural resistance R(X) according to the following 
relationship 


Pri = P{(Er R (X) - Sp EX)) <0} (5) 


where X = vector of basic variables, €, and 
€,= coefficients of model uncertainty of resistance 
and action effects. 

The reliability (credibility) of theoretical mod- 
els given in standards can be analysed by means of 
the credibility of specified design value v,(x,) (e.g. 
deformation, crack width) determined on the vec- 
tor of design variables. 

The probability pp, of exceeding the design value 
va(x4) specified according to relevant theoretical 
formulae and recommendations of prescriptive 
design procedure, can be analysed 


Pr = P{ (va (xa) — S V(X) < 0} (6) 


The serviceability requirements in the limit states 
of crack width are analysed for an example of a 
selected reinforced concrete member as follows. 


4 CRACK WIDTH VERIFICATION 


Cracking in reinforced concrete members due 
to the load effects can be controlled by applying 
theoretical (analytical) crack width model recom- 
mended in Eurocodes or other codes or fulfilling 
appropriate practical rules. In common cases the 
prescriptive rules for detailing and stress control 
are applied according to nationally implemented 
Eurocodes. However, in some cases the calcula- 
tion of crack width is needed e.g. for the design of 
water retaining structures. 
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Presently various theoretical models exist for 
predicting the characteristic crack width w,. Some 
crack width model may be found nearly in any 
standard for the design of concrete structures. 
The preliminary ENV Eurocode for the design 
of concrete structures and also the Eurocode 
EN 1992-1-1 (2004) provide different models for 
the assessment of crack width. The Model Code 
2010 (2012) currently introduces different theo- 
retical crack width model than the previous MC 
1990 (1998). Presently a new crack width model is 
proposed in the second generation of EN 1992-1-1 
which is presently under development. Evidently 
the theoretical models have not been well estab- 
lished till now. 

The design of the structural members for the 
serviceability limit states of crack width is based 
on the inequality between the analyzed crack width 
and the crack width limit given as 


WG) S Wim (7) 


where the characteristic value of crack width w,(.) 
for a vector of characteristic values x, of basic 
variables is specified on the basis of a prescriptive 
formula and wm is the crack width limit. 

For the analysis of crack width, the basic vari- 
ables are commonly taken into account as deter- 
ministic ones where the geometric properties are 
mostly considered by nominal values, the actions 
and material properties by characteristic values. 
Therefore, the calculated values of crack width 
might have different statistical meaning (mean, 
characteristic, extreme). The variability of the 
basic variables may considerably influence the 
resulting value of crack width. It appears that the 
design value of crack width may be estimated only 
with certain reliability. 

Moreover, the specification of the required limit 
Wim for crack width is mostly based on past expe- 
rience without appropriate scientific backgrounds. 
The crack width limits should be determined on 
the basis of given performance requirements, tak- 
ing into account commonly adverse environment 
of the structure and possible consequences of 
excessive cracking. 


5 ANALYSIS OF SLAB CRACK WIDTH 


As an example, for the analysis of the theoretical 
models of crack width, a reinforced concrete mem- 
ber subjected to bending moment due to external 
forces is presented. A simply supported reinforced 
concrete slab having thickness from 0,19 to 0,29 m, 
span of 5 m is considered to be loaded by perma- 
nent and recommended imposed load of category 
B (office areas) given in EN 1991-1-1 (2002). 


The slab is designed for the Ultimate Limit 
States (ULS) according to the Eurocodes, but 
in some cases considering also rules of selected 
national standards (in those cases the national val- 
ues of partial factors are applied). The limit states 
of crack width of concrete slab are verified with 
respect to the theoretical crack width models given 
in Eurocodes, Model Code 2010 and also in some 
national standards. The common crack width limit 
Wim = 0,3 mm is considered here. It may be noted 
that the requirement of various standards for limit 
crack width is almost identical for the similar type 
of environment. However, different combinations 
of actions are in some cases recommended in 
national standards for verification of serviceability 
limit states of crack width (e.g. quasi-static or serv- 
iceability loads combinations). Thus, different val- 
ues of crack width w, determined on the basis of a 
broad range of prescriptive recommendations are 
being compared with the same limiting value wm: 

The resulting values of crack width of a rein- 
forced concrete slab based on different models 
given in the Eurocode EN 1992-1-1 (2006) and its 
prestandard ENV, in Model Code 2010 (2012) and 
also in the previous document Model Code 1990, in 
ACT 318 (1989), BS 8110 (1989) and CSN 73 1201 
(1986) are illustrated in Figure 1, partly based 
on the previous works Holicky 2007, Markova & 
Holicky (2014). It should be noted that the names 
of standards are shorten in all Figures (the stand- 
ard in parenthesis indicate that the slab is designed 
for the ultimate limit states according to national 
recommendations with respect e.g. to thickness of 
concrete cover and load combination. 

It is shown that the value of crack width is influ- 
enced by selection of theoretical crack width model 
and also previous design of structural member for 
the ultimate limit state. Prescriptive documents 
give different requirements for material properties, 


wy [mm] 
035 
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0,00 
0.19 021 023 025 O27 0.29 


h [m] 
Figure 1. Characteristic values of crack width w, of 
a reinforced concrete slab based on selected models in 
Eurocodes, Model Code documents and selected national 
standards. 
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geometrical parameters, loading, structural detail- 
ing, stress limitation, minimum area of reinforce- 
ment and its cover thickness. 


6 PROBABILISTIC CRACK WIDTH 
VERIFICATION 


The time-independent reliability analysis of the 
slab for the limit state of crack width is dealing 
with the probability p,, of the random crack width 
w(X) over-crossing the required constraint w 
expressed by 


Pri = PL Shim Wim — Sy W(X) < 0} (8) 


where X is a vector of basic variables and 6, &, 
are the model uncertainties of the requirements 
on the crack width limit and the model of crack 
width, respectively. The structure is assumed to be 


reliable if the inequality is satisfied 


Pri (X) S Pr (9) 


lim 


where p,, = the target probability of failure that 
should not be exceeded during the design working 
life of the structure. Three different levels of the serv- 
iceability performance of structures should be dis- 
tinguished—irreversible, reversible and long-term. 

The probabilistic models of basic variables enter- 
ing the equation (6) are based on recommendations 
of PMC (2014) and previous reliability analyses 
developed in the Klokner Institute CTU. Some of 
the models applied in the reliability analyses are 
assumed to be deterministic values, while the others 
are considered as random variables having the nor- 
mal, lognormal, beta or gamma distribution. Sta- 
tistical properties of basic variables are described 
using the moment characteristics (by mean, stand- 
ard deviation), lower and upper bounds. 

The theoretical models assume different prob- 
abilities of exceeding the characteristic value of 
crack width, or the maximum crack spacing. 
The probability of over-crossing the characteris- 
tic crack width w, is 5% according to Eurocodes 
[1,3], Model Code 2010 and CSN 73 1201 (1986), 
10% in CEP FIP Model Code 1990 and 20% in BS 
8110 (1989). The limit state function is based on 
the theoretical model of mean crack width. There- 
fore, it is excluded that the theoretical models take 
into account different probability of over-crossing 
design value of crack width, or specified maximum 
distance of cracks. 

The results of reliability analysis of the rein- 
forced concrete slab for the limit state of crack 
width are illustrated in Figure 2. 


0.19 0.21 0,23 0.25 027 0,29 


Figure 2. Reliability of reinforced concrete slab of 
height / for the limit states of crack width. 


Analysis of the sensitivity factors œ of basic 
variables indicates that the significant basic vari- 
ables influencing the design crack width include 
permanent and variable loads, thickness of the 
cover of reinforcement, tensile strength of con- 
crete and influence coefficients (e.g. expressing the 
bond strength, duration of the loading, shape of 
the strain across cross-section). 


7 CREDIBILITY ANALYSIS OF CRACK 
WIDTH MODELS 


The credibility (reliability) of the specified char- 
acteristic value of crack width w, is verified. The 
probability of the random variable w(X) exceeding 
the crack width w, determined in accordance with 
relevant theoretical model of particular standard 
is expressed as 

Pra = P{W- § W(X) <0 (10) 
where x, is the vector of characteristic values of 
basic variables and the coefficient č, represents 
uncertainties of action effects and inaccuracy of 
the crack width model. 

The probability pp, of the random crack w(X) 
over-crossing the crack width w, according to rela- 
tionship (7) for the slab expressed here by reliabil- 
ity index J, is illustrated in Figure 3. 

Analysis of the credibility of specified crack 
width w, indicates, that the reliability index £, 
determined for a reinforced concrete slab appears 
to be low for the theoretical models introduced 
in national American and British standards and 
rather high in Czech standard. The credibility 
of theoretical models seems to be sufficient in 
EN 1992-1-1 (2004), and a little bit higher than 
the credibility of the model provided in Model 
Code 2010. 
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Figure 3. The credibility of the determined crack width 
w, for selected theoretical models. 


8 CONCLUDING REMARKS 


Presently the basic Eurocode EN 1990 for the 
design of structures introduces general recommen- 
dations for the verification of the serviceability 
limit states. The development of supplementary 
provisions is needed including classification of 
structures based on the consequences of failure. 
Deterministic methods of structural analysis com- 
monly used for the verification of structures in the 
serviceability limit states do not enable objective 
evaluation of structural reliability. The probabi- 
listic methods facilitate comprehensive analysis 
of the reliability of a structure and assessment of 
credibility of used theoretical models. 

An example of verification of a reinforced con- 
crete slab with respect to the limit state of crack 
width indicates that the same limiting serviceability 
constraints (crack width limit) is compared with 
characteristic values of crack widths obtained on 
the basis of a broad range of normative recommen- 
dations and also having different statistical meaning. 

The methods of structural reliability enable 
realistic analysis of concrete members with respect 
to the crack width. Reliability indices J assessed 
in the analysis of the credibility of the analytical 
crack width formulae and the reliability of rein- 
forced concrete slab with respect to limit crack 
width have a significant scatter and in some cases 
seem to be rather low. 

The credibility of theoretical models for crack 
width is independent on the previous design of 
a reinforced concrete slab for the ultimate limit 
states of Eurocodes or relevant national standards 
and may be applied for calibration purposes. 

Analysis of sensitivity factors aindicates the sig- 
nificant basic variables influencing the reliability 


index and the credibility of the design crack width 
including permanent and variable loads, thickness 
of the cover of reinforcement, tensile strength of 
concrete and influence coefficients (expressing the 
bond strength, duration of loading). 

It appears that the probabilistic methods can be 
effectively used for the development and calibra- 
tion of new theoretical models applied in design 
and verification of structures. They may be applied 
for further harmonisation of parameters NDPs in 
Eurocodes. 
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ABSTRACT: The soft-packed power battery is becoming a promising power source because of its high 
energy density and light weight. The duration of sealing has a significant impact on the service life of the 
soft-packed battery. However, currently there is no effective test method or theoretical models to evaluate 
its sealing life. In this paper, an Accelerated Degradation Test (ADT) is designed and performed to inves- 
tigate the sealing strength degradation. The testing is conducted to measure the degradation rates and 
residual tearing strengths of the standard specimens under the different accelerated stresses. Based on the 
testing data and membrane theory, a modified Cohesion Zone Model (CZM) is proposed to describe the 
tearing curve and strength degradation. This model is validated by the testing data well. 


1 INTRODUCTION 


The soft-packed power battery, as the heart part 
of electric vehicles, has become the focus of study 
because of its light weight and high energy density 
compared with the traditional hard-shell battery 
(Gallagher, K.C. 2016, Lu, L. 2013, Lee, H. 2014). 
However, the sealing of soft-packed battery is still 
has to face some challenges. The sealing duration 
becomes one of the key factors severely restricting 
the reliability and safety of the battery, because the 
sealing is always subjected to tearing load due to 
slow and constant increase of the inner gas pres- 
sure. So, this paper is devoted to investigating the 
tearing behavior of the sealing and developing an 
effective testing method and analytical model for 
its life prediction. 

Many methods are proposed to describe film 
tearing behavior, such as Kim model (Kim, K.S. 
1988), Wei-Hutehinson model (Wei, Y. 1998) and 
cohesion zone model (Barenblatt 1959, Dugdale 
1960). Among them, cohesion zone model (CZM) 
is a more perfect and significant mathematical tool 
for characterizing crack propagation. In the study 
of elastic and plastic materials, it can effectively 
simulate plastic deformation and creep behavior 
ahead of the crack tip (Barenblatt 1959, Dug- 
dale 1960, Needleman 1987, 1990). Needleman 
proposed a unified theoretical framework for the 
cohesion zone from initial degumming to complete 
detachment (Needleman 1987). Kent used the 
trapezoidal cohesion model to study the composite 


cracking process of the film adhesive layer (Kent 
2008). Zhao used the CZM to simulate the tearing 
process between metal film and ceramic substrate 
(Zhao, H.F. 2008). However, these methods can 
just describe the instantaneous tearing strength. 
They do not consider the strength degradation due 
to viscous deformation of the polymer material 
under the constant load over a long time period. 
The theoretical analysis concerns about the tearing 
strength degradation with time and the size of any 
moment. Therefore, the models need to be modi- 
fied considering degradation characteristics of 
product performance. It is deserved to develop a 
CZM-based method for the tearing strength deg- 
radation of sealing. 

To obtain the information of degradation, accel- 
erated degradation test (ADT) is adopted in this 
study. It is applicable to products with long life and 
strength degradation. ADT accelerates perform- 
ance degradation by increasing the level of stress 
under the condition of keeping the degradation 
mechanism unchanged and collect performance 
degradation data (Fallou, B. 1979). These data are 
then used to estimate the reliability of the prod- 
uct and to predict the life of the product at normal 
stress levels. Nelson first studied the accelerated 
degradation test, analyzed the accelerated deg- 
radation test data of the insulation material, and 
obtained the service life of the insulation material 
under the normal stress (Nelson, W. 1990). Con- 
stant stress accelerated degradation test is widely 
applied in product life assessment, the operation of 
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which is simple and whose data statistics method 
is mature (Ma, X.B. 2011). This is obviously suit- 
able for lithium-ion batteries that is a long-life and 
highly reliable product (Meeker, W.Q. 1995). It is 
considered that there is a significant degradation 
characteristic for the behavior that the residual 
strength of the sealing decreases, for the result that 
the polymer layer of hot-adhesive soft-packed will 
have a viscous deformation and be eventually torn 
under continuous stress. Most of the study still 
focuses on the correlation between strength and 
displacement of the sealing, which directly results 
in deficiency in the mature life evaluation method 
of the sealing. There is no study of the degenera- 
tion of the sealing tearing strength. 

In this study, first, the behavior and mecha- 
nism of sealing tearing strength degradation are 
analyzed. A constant stress ADT is designed and 
implemented. Then, a modified CZM is developed 
to evaluate strength degradation using testing 
data. Finally, some conclusions and discussions 
are given based on the current research. 


2 TESTING 


2.1 Specimens and experiment set-up 


The product tested in this study is outsourcing 
aluminum-plastic film of the soft-packed lithium- 
ion battery—PA/AL/CPP composite film. The 
innermost layer is CPP (Cast Polypropylene), 
which has a high thermal viscosity and is used for 
fusion when the upper and lower heads are pres- 
surized at high temperature; The middle layer is 
AL (Aluminum Foil) layer, which is the carrier 
of the heat-sealing material and prevents water 
penetration; The outermost layer is PA (nylon), 
which has a certain anti-puncture performance 
and plays a decorative role. Fig. 1 and Fig. 2 show 
the product and product stress diagram respec- 
tively. The battery will produce the gas inside in 
the process of storage or application. The gas 
pressure causes the sealing of the aluminum plas- 
tic film to be torn, that results in the failure of 
the battery. 

The standard specimens are cut out of the seal 
side with a spill area as shown in Fig. 3. The width 
W is 8 mm and the length L is 40 mm. 

Test system consists of two parts: in situ fatigue 
test bench and optical measurement microscope, 
shown in Fig. 4. The test stage manufactured 
by Care Test Company is mounted on a stage 
of a measuring microscope and the sample was 
clamped by the stage. During loading/unloading, 
images were captured at high resolution using an 
industrial miniature camera. At the same time, 
experimental data was recorded and analyzed by 
Care-Test-fatigue system. 
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400mm 


CPP soft-package lithium-ion battery. 


Figure 1. 
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Figure 2. Sealing tearing process caused by the inner 
gas pressure. 


Adhesive zone: CPP 


40mm 


Figure 3. The illustration of specimen. 


2.2 Experimental procedure 


2.2.1 Strength CURVE measurement 

In order to standardize the tearing strength of the 
sealing and define the stretch rate of the acceler- 
ated degradation test, perform direct tearing test 
first. The specimen is fixed as shown in Fig. 5. 
The displacement control of the fatigue stretch 
test machine was carried out, and several tests 
were performed at different stretch rates. Tearing 
direction is perpendicular to the adhesive area and 
tearing angle is 90 degree. It is known that the tear- 
ing process of polymeric materials is influenced by 
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Optical measuring microscope 


Fatigue testing system 


Figure 4. 
system. 


In-situ optical microscopy fatigue testing 


Fixture Fixture 
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Figure 5. Specimen installation method. 


temperature and load (Zhang, G.J. 1996). The test 
is usually carried out at room temperature and the 
load is a more sensitive factor. Therefore, only the 
tearing behavior at room temperature and strength 
degradation behavior under different constant 
loads are studied in this paper. The test tempera- 
ture controlled at 25 degrees. Record the maximum 
tensile load. According to GB/T 22638.7-2008 
(Test Method for Aluminum Foils—Part 7: Tear- 
ing strength)), the tearing strength is defined as the 
maximum tensile load at a fixed width. Five speci- 
mens were tested repeatedly at each tearing speed, 
and taking the average value as the tearing strength 
at that speed. 

The load-displacement curves are obtained 
by Care-Test-fatigue system. The tearing curves 
can converge when the stretch rate drops to 1 N/ 
mm. Therefore, the ADT stretch rate is defined as 
1 N/mm. In Fig. 6, the load-displacement curves 
obtained from the five sets of data are different. 
Fig. 7 indicates the process of the sealing tearing. 
Therefore, average the test results. The maximum 
in five sets of the load data are 45.37, 44.92, 44.96, 
45.61, 44.14 N and the average is 45.00 N. The 
corresponding displacements are 2.03, 2.18, 1.99, 
2.00, 1.80 mm and the average is 2.00 mm. In order 
to illustrate the general mechanics of the sealing 


load-displacement cune 
gS T j g 


T T 
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Gsplacementvmm 
Figure 6. Schematic illustration of curve fit to test results. 


Figure 7. 
process. 


The physical photo of sealing tearing 


adhesive interface, the five groups of curves are 
fitted to a standard curve using the least squares 
method, as shown in Fig. 6. 

There is almost no change in the displacement 
when the load is less than 5 N; when the load is 
between 5 and 20 N, the displacement increase is 
approximately linear and can be regarded as the 
elastic stretching stage; when the load is more 
than 20 N, the displacement change rate gradu- 
ally increases. After the load reaches a maximum 
of 45 N, the sealing is torn rapidly until it is com- 
pletely torn. Therefore, this sealing tearing strength 
is calibrated as 45 N. According to the above test 
results, combined with the principle of degenera- 
tion, the constant load of ADT can be controlled 
between 5 and 20 N and the tests under different 
load conditions are carried out. 


2.2.2 Accelerated degradation test 

Without changing the failure mechanism of the 
product, the test conditions are increased to accel- 
erate the failure of the specimens, and the perform- 
ance degradation data under the high stress level 
are used to extrapolate the service life of the prod- 
uct under normal use stress. This is the basic prin- 
ciple of the ADT. The specific method is: under 
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different loads: 5, 10, 20 N; different duration: 
20h, 40h, test sealing degradation rate of tearing 
strength. According to the relationship between 
time and tearing displacement, the regression curve 
equation and judgment coefficient are obtained. 
Under different constant loads, the curves of tear- 
ing strength degradation of sealing are shown in 
Fig. 8 when the duration is 20 h. It can be seen that 
the sealing is slowly torn with time even if the load 
is constant. This shows that the tearing strength 
has been degraded. As the constant load increases, 
the degradation rate gradually increases. 

In order to measure the residual strength after 
degradation, displacement control was used to 
perform the tearing test based on the ADT. The 
result is shown in Fig. 10. And Fig. 9 indicates the 
process of the sealing tearing. 

As it can be seen from the above figure, the max- 
imum tearing strength gradually decreases as the 
constant load increases while the time is the same. 
That is when the load time is 20 h and the constant 
load is 5 N and 10 N respectively, the corresponding 


adhesive strength degradation 
T 


displacementimm 


Figure 8. 


Schematic illustration of relationship between 
the displacement and time. 


Figure 9. 
process. 


The physical photo of sealing tearing 


load-displacement curve 


Tearing test 

constant loadeSN,20h 
constant load=5N,40h 
constant load=10N,20h 
constant load=10N,40h 
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Figure 10. Schematic illustration of curve fit to test results. 


tearing strength peak are: 43 N, 38 N; Maintaining 
a constant load unchanged, with the loading time 
gradually increases, the maximum tearing strength 
decreased. That is when the constant load is 10 N 
and the loading time is 20 h and 40 h respectively, 
the peak of the tearing strength corresponding to 
complete tearing is 38 N and 33 N respectively. 
This shows that the cumulative effect of load has a 
significant effect on the sealing tearing. It is that a 
constant load applied to the sealing for a period of 
time causes the tearing strength to decrease. 


3 THEORETICAL ANALYSIS 


3.1 Exponent cohesion zone model 


The mechanical response of polymer materials 
always shows the combination of elasticity and vis- 
cous property, which is called viscoelasticity (Bower, 
D.I. 2002). CPP is subjected to thermomechanical 
treatment, such as stretching, at a temperature that 
is under the melting point and above a glass transi- 
tion temperature (the glass transition temperature 
of the polypropylene material is —35°C). In this 
case, the orientation process of the polymer may 
occur. As a result of the orientation, the tensile 
strength and flexural fatigue strength of the poly- 
mer material significantly increase in the orientation 
direction. At the same time, the orientation perpen- 
dicular to the orientation direction significantly 
decreases. The macro performance is torn. 

The CZM is based on elastic-plastic fracture 
mechanics and is used to investigate the elastic- 
plastic region of the crack tip. The constitutive rela- 
tion of the cohesion zone is defined by the cohesion 
and the open displacement at the interface. On the 
basis of the test, the interface is characterized as a 
thin layer of the constitutive relation of the expo- 
nential cohesion zone model and the exponential 
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Figure 11. 


Tearing strength—displacement curve. 


CZM is chosen to simulate the separation of the 
sealed boundary. The tearing strength is given by: 


A 
1 
4) o 


In Fig 11, the displacement characteristic value À 
and the tearing strength value t are two independent 
parameters. However, the energy required to sepa- 
rate within a unit area is often used as a parameter. 
It is equal to the sum of the area under the tearing 
strength-displacement curve, expressed as follows: 


T= FT, 4 exp| 1 
max A, z (Pp 


G, =| (Aaa 2) 


It can be determined by tearing test. If we use 
the energy required for separation G as a model 
parameter, then equation (1) can be expressed as: 


GA A 
a Se agl 3 
i $4 exp z) e 


É c € 


In this paper, the required energy G (N/mm) and 
displacement eigenvalue à /mm are two independent 
constitutive parameters. The relationship between 
tearing strength and displacement was measured 
by the test. The peak of strength was 45.00 N and 
the displacement was 2.0 mm. The calculated inter- 
facial fracture energy was 244.66 N/mm. Fig. 12 
shows the predicted results that experimental data 
on specimen are used to validate the CZM. Obvi- 
ously, this CZM can give a satisfactory prediction. 
The sealing CZM expression is as follows: 


0.24466 A A 
- x Z 4 
2.0 A exo A f(A) @ 
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0 05 1 1.5 2 25 3 
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Figure 12. Comparison of tearing curve in CZM model 


and test result. 


3.2 Modified CZM 


The change in the displacement of the adhesive 
area under a constant load is shown in Fig. 9. With 
the degradation of strength is considered, time and 
constant load are taken as degradation factors to 
modify the equation (4). Tearing process can be 
divided into three stages: to reach a constant load 
stage, to maintain a constant load loading stage, 
rapid tearing stage. Assumptions: the constant 
load is 7, /N: the displacement reached to constant 
load is f’(q): the constant load loading time is t/s; 
the degradation rate is (determined by the degen- 
eration rate curve)v; the displacement at the start 
of rapid tearing is f’(z,)+vt. Then, the modified 
CZM can be written as: 


A. to achieve a constant load phase: t= f(A) 

B. to maintain a constant load loading stage: 
C= Ta 

C. rapid tearing stage: T= f(A)-f(vt) 


T(t, To, Å) 
f(A) 0<4< f'n) 
= Ty f'(%)< AS F n)+vxt 
f(A)-f(vt) f’'(m)+vxt<A 
(5) 


In ADT, different constant loads can change 
the rate of tearing strength degradation. To 
obtain the degradation rate under different con- 
stant loads, calculate the slope of the straight 
line in Fig. 8. The result is given in Table 1. Then, 
the simulation curve of the relationship between 
degradation rate and constant load is obtained as 
shown in Fig. 13 and expressed by the following 
equation: 
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Table 1. Tearing strength degradation rate under differ- 
ent constant load. 
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Figure 13. Schematic illustration of curve fit to test 
data. 
v=0.39397, =g(7) (6) 


Substituting Equation (6) into Equation (5), the 
modified CZM can be expressed as: 


The constant load is 5 N and duration is 40 h in (a). 
And the (b) is a validation when the constant load 
is 10 N and duration is 40 h. The (c) is a valida- 
tion when the constant load is 20 N and duration 
is 20 h. It is clear that the predictions are in good 
agreement with the testing data. Obviously, for 
the residual strength, the modified model can give 
satisfactory predictions. The strength degradation 
phenomenon due to constant loading is described 
well using the proposed model. Good agreements 
are observed between the model predictions and 
experimental data. Overall, this model can predict 
the strength degradation under constant loading 
and the duration. 


5 CONCLUSION AND FUTURE WORK 


In this paper, a modified CZM considering degen- 
eration is proposed to evaluate the tearing strength 
of sealing and calculate the residual strength by 
ADT. The modified model and the testing data have 
a good match. It is beneficial to solve the effect of 
constant loading on the real-time tearing strength 
of the sealing. Then, the most use security problems 
will be avoided in a large part. Based on current 
research, the following conclusions can be given: 


1. An ADT is designed to simulate the tearing 
process of the sealing by using standard speci- 
men on the in-situ tensile machine. The tearing 
strength of the sealing degrades under a small 
static tensile load, and the sealing is torn over 
a long period of time. A positive correlation 
is observed between the degradation rate and 
load. 


f(A) 0<As F'n) 
Ae DA= i f'(%)< AS f’(%)+0.39392, xt (7) 
f (A) -£ (0.3939z,t) f'(%)+0.39397, xt<A 


4 MODEL VALIDATION 


In this section, experiental datas on specimen with 
under different constant load and duration are 
used to validate the proposed modified CZM. 

In Figure 14, the model predictions are com- 
pared to testing data under constant loading 
and duration. The horizontal axis is the tearing 
displacement and the vertical axis is the tearing 
strength. The red line is testing data and the blue 
line represent predictions of our proposed model. 


2. A modified exponent CZM considering degra- 
dation is proposed to depict the tearing strength 
degradation curve and residual strength curve 
of the sealing. The different strength degrada- 
tion data are used for model validation. Good 
agreements can be seen between the testing data 
and model predictions. 


The current study considers the stress only, and 
the effect of temperature and electrolyte will be 
investigated in the future. 
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Figure 14. Comparison of tearing curve in modified 
CZM model and test result. 
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Probabilistic analyses of existing power producing facilities 
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ABSTRACT: Assessment of existing components of power producing facilities is based on the proba- 
bilistic methods of the theory of structural reliability provided in Eurocodes and ISO standards. An 
example of quick-closing valves in a selected hydroelectric power plant indicates the assessment of reli- 
ability and prediction of remaining working life of a structural component for the considered model of 
corrosion. Moreover, assessment of reliability level of main steel structural members of existing industrial 
bridge for two alternatives of a cleaning machinery reveals rather low reliability level of some significantly 
deteriorated structural members of the industrial bridge in the power plant, and additional measures are 


to be applied. 


1 INTRODUCTION 


Various existing construction works in power- 
plants built in the Czech Republic are gradually 
deteriorating due to the effects of adverse factors 
having significant impact on their reliability. The 
main factors include adverse surrounding envi- 
ronment, chemical attacks, physical effects and 
repeated actions due to technological processes. 

Actual technical state of power producing facili- 
ties needs to be regularly evaluated enabling opti- 
mal decision regarding their repairs, strengthening 
or replacement. Simplified conservative procedures 
for the reliability assessment of existing structures 
based on the methods commonly used in current 
codes for the design of new structures may lead to 
their expensive repairs. 

That is why the assessment of existing struc- 
tures of power-plants are based on the probabil- 
istic methods of the theory of structural reliability 
given in the international standards ISO 13822 
(2010) and ISO 13823 (2008). 

The probabilistic assessment of existing ener- 
getics devices evoked by necessity of the assess- 
ment of their current reliability level, estimation 
of their remaining life time and also change of 
technological procedure evoked by need for more 
effective, however heavier machine in a selected 
hydroelectric power-plant in the Czech Republic 
given here as an example shows the procedure for 
the estimation of reliability, bearing capacity and 
residual life of existing structures for assumed cor- 
rosion models. 


2 VERIFICATION OF WORKING LIFE 


The reliability requirements for existing structures 
as well as for new ones may be expressed in terms 
of the failure probability P, or reliability index £. 
The relationship between the both reliability indi- 
ces is given as 


B=O'(P) (1) 


where ®(.) denotes the standardized normal dis- 
tribution function. The following reliability condi- 
tion is required 


P, <P, or B= B, (2) 


where P, and 2, are the target values of the failure 
probability and reliability index. 

The target reliability level which should be used 
for the verification of existing structures may be 
based on calibrations taking into account the con- 
cept of the minimum expected costs and social 
risks. The recommended values of the target reli- 
ability index £, for the verification of various limit 
states given in ISO 13822 (2010) are illustrated in 
Table 1. 

More detailed target reliability indices depend- 
ing on the consequences of failures and the rela- 
tive costs for safe design are provided in ISO 2394 
(2015). 

EN 1990 (2002) gives principles of reliability dif- 
ferentiation in Annex B. Three reliability classes 
RC are recommended and the target values of reli- 
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Table 1. 
structures. 


Indicative target reliabilities for existing 


Limit states Target reliability index £, 


Serviceability—irreversible 1.5 

Ultimate with failure 
consequences 

— very low 2.3 

—low 3.1 

—medium 3.8 

— high 4.3 


Table 2. Reliability differentiation of hydro-technical 
structures according to CSN EN 1990 (2011). 


Reliability class | Examples of construction works 


RC3 Dam, weir higher than 5 m, main 
conduit of potable water to town 
agglomeration 

Sewage disposal plant, accumulating 
pumping plant, water reservoir, 
aqueduct, hydro-electric power 
station, weir height up to 5 m, 
nautical channel 

Swimming pool, melioration 
structure 


RC2 


RCI 


Table 3. Partial factors for actions. 


CSN 75 6303 (2012) 


Type of action RCI RC2 RC3 
Hydrostatic pressure 1.0 1.1 1.2 
Hydrodynamic pressure 1.2 l3 1.4 
Permanent action 1.2 1.35 1.4 
Variable action 1,35 1.5 1.65 


ability indices J, proposed including examples of 
construction works. Presently hydro-technical con- 
struction works are not given in the scope of the 
current generation of Eurocodes and they need to 
be based on supplementary national prescriptive 
provisions. 

Reliability differentiation of hydro-technical con- 
struction works is illustrated in Table 2 as recom- 
mended in the recently developed National Annex 
to CSN EN 1990 (2011) of the Czech Republic. 

Partial factors for hydrostatic and hydrody- 
namic pressures, permanent and variable actions 
recommended in CSN 75 6303 (2012) for reliability 
classes RC1 to RC3 are given in Table 3. 

Application of probabilistic methods for speci- 
fication of working life of an existing structure 


Figure 1. 
ing life. 


Probabilistic assessment of remaining work- 


is illustrated in Figure 1, Holicky & Markova 
(2007). 

It is assumed that the assessment (inspection) 
of an existing component of a power producing 
facility is performed in time ¢,, from the beginning 
of the structure completion. In case that the time- 
dependent resistance R(t) of a component and 
load effects E(f) are known, the remaining working 
life of the component may be specified. 

For estimation of the residual working life t., of 


the component, the following expression is given as 
Pi(tyes) = P{R (tres) — Eltes) < OF = Pix (3) 


facilitating decision about its repair or replacing. 

The general requirements for the assessment 
of structural reliability and safety for users are 
applied for reliability analysis of quick-closing 
steel valves and also for existing industrial steel 
bridge in the hydroelectric power plant operating 
about 50 years. 


3 VERIFICATION OF QUICK-CLOSING 
STEEL VALVES 


3.1 Inspection of state of the quick-closing valves 


The inspection of the quick-closing valves of 
hydroelectric power plant revealed deterioration 
of the components including steel cover plates as 
illustrated in Figure 2. 


3.2 Reliability analysis based on partial 
factor method 


Firstly, the partial factor method given in EN 1990 
(2002) and ISO 13822 (2010) is applied for verifica- 
tion of a cover plate of quick-closing steel valves 
which is one of main components. 

The steel cover plate having thickness of 
0,018 m is supported by stiffeners 0,8 m spaced. 
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Figure 2. Deterioration of a steel cover plate. 


Hydrostatic and hydrodynamic components are 
considered for determination of water pressure on 
the plate given as 


d= Is + Ina = PE h + PQ v/A (4) 


where p = water density; A = depth; Q = flow rate; 
v = water speed; A = area of the cover plate. 

The basic condition M,, > M,, between the 
design value of bending moments for resistance 
and effects of actions should be satisfied. The reli- 
ability of the structural component is verified with 
respect to the ultimate limit state given as 


Mpgra=% b UG w) (5) 


where y, = coefficient of model uncertainty of resist- 
ance, d= plate thickness, b = plate length, f, = char- 
acteristic yield strength, %ọ, = material factor. 

The characteristic yield strength is considered 
Jx = 235 MPa, the partial factor 4, = 1.15 and the 
coefficient y, = 0.85 according to CSN 75 6303 
(2012). 

The maximum water pressure is considered for 
determination of load effects for continuously sup- 
ported cover plate given as 


M,,= 6 (Yous Insk + Yona Grax) L712 (6) 


where Yon, = partial factor for hydrostatic pressure, 
Yona = Partial factor for hydrodynamic pressure 
which are considered here according to Table 3 for 
structures in class RC2. The dynamic amplification 
factor ô= 1.3 is considered here as recommended 
in CSN 75 6303 (2012). 

The serviceability limit state of the component 
is also verified 


Mg S Ca (7) 


where C, = limiting deformation. 


Results of time dependent reliability analyses 
of a steel component with respect to both the ulti- 
mate (ULS) and serviceability (SLS) limit states 
are illustrated in Figure 3. The reliability of the 
cover plate significantly decreases with decrement 
Ad of plate thickness due to steel corrosion. 

In case that the partial factor method is applied 
for verification of existing structural member in 
the common consequence class CC2, it is shown 
that its bearing capacity is exhausted after about 53 
years and some measures are needed to be taken, 
e.g. strengthening or replacement. 


3.3 Application of probabilistic methods 


The probabilistic methods are applied for evalua- 
tion of the reliability level of the steel component, 
Holicky (2009). 

The time-dependent corrosion d.,,,(t) of the 


steel cover plate is based on the relationship 


ogee (t) = A i (8) 
where A and B = parameters of analytical model 
which are considered for parameter A in a range 
from 0,03 to 0,13, and for parameter B from 
0.6 to 0.7. It is assumed here for non-uniform 
corrosion 


A =0,06 mm; B=0,7 (9) 


which is based on the results of inspection of 
the actual state of steel quick-closing valves. Pit- 
ting corrosion with pits up to 1 mm is shown in 
Figure 4. 

It is considered that the steel cover plate of 
quick-closing valves (about 50 years old) is deterio- 
rating due to non-uniform corrosion with average 


Figure 3. 
bearing capacity and serviceability with decrease of slab 
thickness A d. 


Progressive reduction of the component 
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Figure 4. Pitting corrosion of a steel cover plate. 


Table 4. Probabilistic models of basic variables. 


Mean CV. 


Basic variable Symbol Distr. u V 


Yield strength [MPa] f LN 280 0.08 
Plate span [m] L DET 0.75 - 
Plate thickness [m] d N 0.018 0.03 
Plate width [m] b DET 1 - 
Water pressure [kN/m] q N 157.6 0.1 
Dynamic factor ô DET 1.3 — 
Parameter A [mm] A N 0.06 0.10 
Parameter B B DET 0.7 - 
Resistance uncertainty &, DET 0.85 -— 


one-side decrease up to 1 mm and about 30% prob- 
ability of simultaneous weakening at opposite side 
of the cover plate. The model of corrosion given 
in equation (8) considering parameters in equation 
(9) leads after 50 years to actual average depth of 
corrosion of 1.3 mm. 

The ultimate limit state is expressed as the dif- 
ference AM(t) between the time-dependent resist- 
ance moment M}(t) and the bending moment M, 
due to the water pressure q 


AM(1) = éb (d= 1,304 JA- LN? (10) 


The probabilistic models of basic variables are 
given in Table 4 assuming normal (N) or lognor- 
mal (LN) distributions. Some of the basic variables 
are considered to be deterministic (denoted DET). 

The reliability of the steel cover plate decreases 
in time due to the reduction of its thickness 
d,,,(t). The value of the target reliability index 
A80) = p, = 3.7 for considered 80 year working 
life of the cover plate is specified on the basis of 
required reliability index for a reference period of 
one year [(1) = 4.7 given as 


D((80)) = (PUDE = (84.7) = (3.7) AD 

The probabilistic reliability assessment of the 
steel member is based on own software tool devel- 
oped in software Mathcad. 

Initial reliability of the cover plate non-affected 
by corrosion (reliability index p = 5.2) satisfies the 
target reliability level 8. = 3.7 for considered refer- 
ence period of 80 years. For one-side corrosion of 
1 mm and cover plate 53 years old (current state), 
the reliability index decreases to B= 4, see Figure 5. 

It appears that the working life of the cover plate 
may be estimated to approximately 70 years. Thus, 
the residual working life of a cover plate is consid- 
ered to be approximately about next 20 years. 

In case that a more effective protection level of 
cover-plate is provided (currently planned) then 
lower rate of corrosion may be considered (e.g. 
parameter B = 0.6 only, see case | in Figure 6). 

New estimation of the residual working life is 
shown in Figure 6 when new protective coating of 


N 


Figure 5. 


Time-dependent reliability index £(£f) and 
one-side corrosion depth d.,,,(t). 
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Figure 6. Time-dependent reliability index (1) and 
one-side corrosion depth d.,,,(¢) for considered increased 


maintenance level (Case 1), and common maintenance 
level (Case 2). 
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steel member is provided leading to extended resid- 
ual working life from 70 to 90 years. 


4 VERIFICATION OF INDUSTRIAL 
BRIDGE 


4.1 Inspection of the state of bridge 


Existing industrial steel bridge supported by massive 
concrete columns being used for running of clean- 
ing machine for trash racks in the power station are 
assessed for possibility of their further usage assum- 
ing two alternatives of a new machine (Fig. 7). 

Industrial steel bridge is composed of two main 
structural members—steel longitudinal main rolled 
beams (I 280) stiffened by cross beams (I 140). 

The longitudinal main beam is stiffened in the 
location of concrete columns by two supporting 
lateral diagonals (rolled U profiles) built in col- 
umns. The beams are supported at their ends by 
rather not enough suitable U profiles embedded 
in abutments. Visual inspection of steel structures 
revealed significant deterioration of load bearing 
members, mainly in locations of their joints and 
also in connections with steel bridge deck made 
from corrugated steel plate (Fig. 8). 


Figure 7. Existing industrial bridge with a cleaning 
machine. 


Figure 8. 
members. 


Significant deterioration of bridge structural 


Alternative B - Machinery with Container 


n 


axel spacing: 4800mm > 


i 

TTR a 
44 KN /1 wheel (1. LS) ii \)\ 46 kN / 1 wheel (1. LE 
30 kN /1 wheel (2. LS) || || 65 kN / 1 wheel (2. LE 
26 kN /1 wheel (3. LS) "E iat i N 69 kN / 1 wheel (3. LE 


Figure 9. Illustration of new cleaning machinery, alt. B. 
Table 5. Load effects and resistances of main structural 
members of the industrial bridge, in kNm. 

Basic member E R Ries 
Alt. A — main beam 76.6 141.5 99 
Alt. B — main beam 73.4 141.5 99 
Alt. A — cross beam 14.6 19.8 13.9 
Alt. B — cross beam 13.3 19.8 13.9 


Reliability of the main structural members of 
the industrial bridge is verified for two alternatives 
of the new cleaning machinery for trash racks, see 
Fig. 9 where the alternative B is illustrated. Firstly, 
the reliability is verified by partial factor method 
and commonly applied procedure according to 
current standards CSN ISO 13822 (2010), CSN 
73 0038 (2014) and CSN EN 1990 (2004) (see also 
3.1), secondly applying probabilistic methods and 
the Probabilistic Model Code (PMC 2014) of the 
Joint Committee on Structural Reliability (JCSS). 

Results of reliability verification based on the 
partial factor method is given in Table 5 consid- 
ering two alternatives of machinery with different 
self-weight, total load effects and geometry includ- 
ing supports. 


4.2 Verification by partial factor method 


The reliability of the existing industrial bridge 
is verified by the partial factor method given in 
CSN ISO 13822 (2010) and CSN 73 0038 (2014). 

Self-weight, permanent loads and imposed 
loads evoked by new machinery are considered. 
The load bearing capacity of main structural 
members without any deterioration, and also for 
estimated actual 30% reduction of resistance is 
taken into account. 

The results indicate that the structures are less 
loaded when the machinery B is used in compari- 
son to machinery A. For non-deteriorated struc- 
ture, and also for partly deteriorated structure 
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Table 6. Reliability of basic bridge steel members. 


Basic member B Beas Brag 
Alt. A — main beam 6.5 5.6 4.7 
Alt. B — main beam 6.7 5.8 4.9 
Alt. A — cross beam 4.7 3.5 2.8 


Alt. B — cross beam 5.1 3.9 3.3 


(assumed reduction of resistance) the basic reli- 
ability condition is fulfilled. 

However, low reliability level reveal inclined 
struts supporting longitudinal beams in regions of 
concrete columns. Requirement on limiting deflec- 
tion of steel longitudinal main beams is also not 
satisfied. 


4.3 Probabilistic structural analyses 


Probabilistic methods are applied for structural 
verification based on the principles of the Proba- 
bilistic Model Code (PMC 2014) and expressions 
(1,2). The reliability analyses of two basic mem- 
bers, of a longitudinal main beam and cross beam 
for three cases: a structural member without any 
deterioration, and also for estimated 20% reduc- 
tion (B.,,,) and 30% reduction (a2) of resistance 
is taken into account for two alternative machin- 
ery A and B. The results of reliability analysis are 
given in Table 6. 

In case that the main structural members of the 
industrial bridge are without any deterioration, the 
resulting reliability indices fulfil the required target 
reliability index given in EN 1990 (2002). 

However, in case that 20%, resp. 30% reduction 
of resistance is assumed due to the corrosion of 
steel members, the reliability level of such deterio- 
rated cross beams might be rather low and addi- 
tional measures (repair, strengthening) should be 
undertaken. 

Moreover, joints of steel members should be 
checked and repaired if needed. 


5 CONCLUDING REMARKS 


It is shown that application of partial factor method 
for the assessment of existing structures might lead 
in some cases to conservative estimations. 

The probabilistic assessment of existing struc- 
tures make it possible to effectively estimate 
remaining working life of structures and to 
plan their maintenance and required economic 
resources. 


The assessment of the quick-closing valves has 
shown that their remaining life-time is about addi- 
tional 20 years provided that regular maintenance 
will be made. Protective layers of steel compo- 
nents should be renovated and regularly inspected. 
When the reliability index would decrease below 
the target reliability index 3.7 (estimated to 20 
years), a new reliability assessment of cover plates 
should be made on the basis of updated material 
characteristic. 

Assessment of existing industrial bridge for new 
technology of cleaning reveals need for regular 
maintenance. Machinery B appears to be more 
suitable for cleaning of racks. 

Probabilistic methods represent suitable tool 
for decision about alternative technological pro- 
cedures and evaluation of actual conditions of 
structures. 
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ABSTRACT: Solder joint thermal fatigue is a major mechanism of electronic products failure. Previous 
studies had proposed several physics-of-failure models for thermal fatigue life prediction based on stress, 
strain, or energy. However, few models considered fatigue crack propagation. From that point of view, 
this paper presented a novel fatigue crack propagation model for ball grid array package solder joint ther- 
mal fatigue life prediction. Due to the difficulty of crack length measurement, experiment was conducted 
to explore the relationship of fatigue crack propagation and daisy chains resistances growth, where the 
daisy chains were created by solder joint network. Furthermore, the finite element simulation method was 
used to extend this model for solder joints with different dimensions and materials and the final form of 


the fatigue crack propagation model was obtained. 


1 INTRODUCTION 


In electronic package structure, solder joints have 
the functions of electronic connection, mechanical 
connection between chips and substrate and ther- 
mal dissipation tunnel. Reliability of solder joint 
has a big impact on the function of the whole elec- 
tronic product. Due to a mismatch of the coeffi- 
cient of thermal expansion, electronic solder joint 
experiences cycle shear strain, which ultimately 
leads to fatigue failure. Predictions of solder joint 
thermal fatigue life are mainly based on statistical 
method or Physics-of-Failure (PoF) models. How- 
ever, the statistical data is often confidential or dif- 
ficult to obtain for aerospace electronic products. 
Consequently, the PoF models are widely used for 
predicting solder joint life in engineering. 

The earliest low-cycle fatigue model was Coffin- 
Manson model, which presented an expression 
of fatigue lifetime by exploring the relationship 
of stress cycle damage and plastic deformation 
(Manson 1965). Smith et al. (1970) improved 
Coffin-Manson model by taking energy changes 
into account and raised model accuracy and effec- 
tiveness. Engelmaier-Wild model considered the 
combined influence of tress relaxation, plastic 
deformation, and creep, and had been widely used 
in standards IPC-SM-785 and IPC-D-279 (Engel- 
maier 1983, Socie 1987). 

These models were mainly based on stress, 
strain, or energy. Darveaux model took hysteresis 
energy theory and fracture mechanics theory into 
account to predict crack growth rate, thus the failure 
cycle numbers could be obtained (Darveaux 1997, 


Darveaux 2002). However, the parameters in the 
model couldn’t be obtained directly, which limited 
the applicability of this model (Akay et al. 1997, 
Perkins 2007). New strategies need to be developed. 
In this study, a fatigue crack propagation model has 
been developed for Ball Grid Array (BGA) package 
solder joint life prediction. Experimental and simula- 
tion methods were adopted to determine the param- 
eters and to extend it for more general conditions. 

The remainder of this paper is organized as fol- 
lows. Section 2 presents the fatigue crack propa- 
gation model considering stress intensity factor. 
Section 3 discusses the experimental procedures 
and results of data analysis to form the complete 
crack propagation model. Section 4 investigates 
the relationship between solder joint dimension 
and material with the proposed fatigue crack prop- 
agation model via Finite Element Analysis (FEA) 
simulation. Section 5 provides conclusions as well 
as directions for future work. 


2 FATIGUE CRACK PROPAGATION 
MODEL 


The thermal failure of electronic devices is a kind 
of fault phenomenon that mainly occurs in the 
interconnected parts under the action of long- 
term temperature cycling. Typical failure mecha- 
nisms include thermal fatigue of solder joints and 
thermal fatigue of plated-through holes, which are 
mainly caused by cyclical switching of circuits and 
periodic changes of ambient temperature, which 
eventually lead to thermal fatigue of solder joints. 
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For the BGA package solder joints, fatigue 
can be divided into two phases during the failure 
process: 


— During the forming phase of the fatigue source, 
the initial crack appears at the solder joint sur- 
face with the maximum stress; 

— Under the shear stress, the initial crack tip expe- 
riences repeated plastic deformation and contin- 
uous expansion until fatigue fracture occurred. 


Previous studies (Forman & Shivakumar 1986, 
Manson et al. 1966) have shown that there is an 
exponential integral relationship of crack length 
difference A/ and stress intensity difference Ak dur- 
ing the crack propagation stage, and the exponent 
is related to the cycle number to fatigue failure N; 


aN f b 


Al = j (Ak) | dl (1) 


where / = crack length; /, = initial crack length; 
l, = crack length when fatigue failure occurs; 
k = stress intensity factor; a and b are constants 
to be defined. 

The stress intensity factor k is a parameter that 
reflects the strength of the stress field at the crack 
tip of an elastic object, which can be defined as 
follow: 


k=oJ/M (2) 


where o = shear stress and M = bending moment 
of crack cross section. For BGA package solder 
joints, the tip shear stresses corresponding to dif- 
ferent crack lengths are different; however, the 
bending moment for the same crack location is 
constant. Consequently, stress intensity difference 
can be defined as follow: 


Ak = AoVM (3) 
Through a large number of statistical data fit- 


ting, it is found that Ao, M, and failure cycle 
number N have a certain relationship: 


Mi Aon, (4) 


VM 


where N, = cycle number when the crack length 
reaches / and N,= cycle number to fatigue failure. 
Then Equation (1) can be replaced by: 


i N aN j’ 
Al=|"|—4| dl (5) 
N; 


lo 


After the integral is simplified: 


ai 
1=1,+(l,-1,) ra (6) 


f 


When the solder joint falls off, the crack length 
is equal to the diameter of the solder joint, which 
denotes failure in principle. However, in order to 
leave a security margin, the solder joint is consid- 
ered to have failed when the crack length reaches 
70% of the diameter of the solder joint based 
on GB/T6398. Generally, initial crack length 
1,=10* ~107°m, which is small compared to 
the expansion crack length; consequently, we 
assumed that /, = 0. Then the general form of the 
fatigue crack propagation model can be defined 
as follow: 


aN} 
l= oo (7) 


f 


where D = solder joint diameter. N, and N, can be 
obtained via experiment or simulation; however, 
constants a and b are unavailable directly due to the 
difficulty of crack length measurement. Therefore, 
the following sections will show a new approach to 
solve this problem and the fatigue crack propaga- 
tion model of Equation (7) is determined. 


3 EXPERIMENT 


This section mainly shows the experimental pro- 
cedures for obtaining N, and N, as well as the new 
approach to replace crack length measurement for 
data fitting to calculate coefficients a and b accord- 
ing to a real-life case. 


3.1 Temperature profile 


The temperature profile used in this study is used 
in an engine control system of a helicopter, which 
involves parking, taxi, takeoff, ascent, level-flight, 
descent, and landing phases. During hot day, the 
ground temperature can reach 55°C, and since the 
electronic device was installed inside, considering 
a temperature difference of 15°C, this would be 
70°C. During the flight phase, it will drop to 20°C 
due to a combination of various factors that influ- 
ence temperature. The resulting life temperature 
profile is shown in Figure 1, where the off-time is 
omitted in this case due to its small impact on sys- 
tem life. 

The detailed values of the temperature pro- 
file are shown in Table 1. The lower dwell was at 
20°C for 115 min, while the upper dwell was at 
70°C for 30 min. The overall cycle duration was 
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t/min 


144 158 173 


Figure 1. Temperature profile. 


Table 1. Detailed values of the temperature profile. 


Temperature (°C) Duration time (min) 


1 70 15 
2 Decreasing 14 
3 20 115 
4 Increasing 14 
5 70 15 


173 min, and the rate of temperature change was 
3.5°C/min. 


3.2 Test specimen 


Due to the difficulty of crack length measurement, 
experiment was conducted for BGA package sol- 
der joints to explore the relationship of fatigue 
crack propagation and daisy chains resistances 
growth, where the daisy chains were created by sol- 
der joint network. 

The first step was to produce test speci- 
mens. Chips were surface-mounted on printed 
circuit boards (PCBs) with 0.5 mm diameter 
Sn62Pb36 Ag2 solder joints. There were two test 
boards (Al and A2) each had six BGA package 
chips (U1-U6) with an electrical resistance net- 
work that could be monitored for failure, which 
is shown in Figure 2. The solder joint network 
created three daisy chains named T1, T2, and 
T3. The resistance of the daisy chain was equal 
to the sum of the resistances of all solder joints 
contained within. The solder joints of differ- 
ent daisy chains were under different thermal 
expansion mismatch conditions. Among them, 
the daisy chain under the most severe condition 
would be the first one to fail and its resistances 
were representations of the solder joints’ fatigue 
crack lengths. 


Figure 2. Test specimen and three daisy chains. 
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3.3 Device setup and parameter conversion 


The temperature cycling profile was achieved with 
a temperature chamber. And the resistance values 
of daisy chains were measured with a resistance 
tester. The experiment and test devices were organ- 
ized as shown in Figure 3. 

Then the thermal fatigue failure cycle numbers 
N and daisy chains resistances r were obtained. 
IPC-9701 had shown that there was an exponen- 
tial relationship of daisy chain resistance r and 
fatigue crack length / of BGA package solder 
joints, which is shown as follow after deduction 
and simplification: 


r=mi(D-1)" (8) 


where i = solder joint number within the daisy 
chain; D = solder joint diameter; m and n are con- 
stants to be defined. Then the relationship of daisy 
chain resistance r and thermal fatigue failure cycle 
number N can be derived by combining Equations 
(7) and (8): 


aN ph 
r=miD 1-07{ 2) (9) 
N 


where N,= N.. 
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3.4 Results and discussion 


Twelve resistance values (two test boards with six 
chips each) of each type of daisy chain had been 
obtained within this experiment. The results of T1 
daisy chains are shown in Figure 4. 

Interconnecting failure was defined as a 20% 
increase in nominal resistance of the daisy chains, 
which means that at least one of the solder joints 
had failed (Z, = 0.7D). It can be found from Fig- 
ure 4 that the resistance value of the T1 daisy chain 
reached 20% increase at about 500 cycles. Conse- 
quently, the subsequent data was useless for data 
fitting due to the overstepping of security margin. 
Generally, these twelve groups of data had good 
uniformity. The medium value of these twelve 
groups of data was used to analyze the fatigue 
crack propagation model. 
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Figure 4. Resistance of the T1 daisy chain in twelve 
chip samples. 
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Figure 5 shows the resistances of daisy chains 
T1, T2, and T3 of the experiment after integration. 

The initial resistance values of three daisy chains 
were diverse due to the difference of solder joint 
numbers within. Figure 5 shows that the daisy 
chain T1 was the first one to fail. The reason is 
that daisy chain T1 was installed on the outermost 
side of the chip of Figure 2, which has maximal 
strain under the same thermal stress. As mentioned 
in section 3.2, daisy chain T1 was under the most 
severe thermal expansion mismatch conditions, 
and its resistances represented the thermal fatigue 
lives of solder joints of the chips. The resistances 
and failure cycles of daisy chain T1 had been 
recorded for data fitting purposes. 


3.5 Data fitting 


The experimental data were verified and N and r 
were picked up, which is shown in Table 2. 

Table 2 shows that when resistance r reached a 
20% increase for the first time, cycle number N was 
508, which represents the fatigue failure cycles N, 
in Equation (9). The diameter of solder joints is 
0.5 mm and the solder joint number within daisy 
chain T1 is 56. 

Data fitting has been processed with Matlab 
and a, b, m, and n in Equation (9) are shown in 
Table 3. 

Here, SSE represents the sum of squares due to 
error, while R-square represents the coefficient of 


Table 2. Failure cycles and resistance values of daisy 
chain T1. 
N (cycles) r (Ohm) 
1 0 0.1942 
2 16 0.1946 
3 32 0.1948 
4 48 0.1950 
5 60 0.1952 
6 76 0.1955 
7 100 0.1963 
8 128 0.1969 
9 160 0.1976 
10 220 0.1985 
11 240 0.1990 
12 288 0.2000 
13 304 0.2028 
14 320 0.2031 
15 352 0.2098 
16 396 0.2197 
17 424 0.2211 
18 460 0.2298 
19 508 0.2394 
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Table 3. Values of a, b, m, and n and goodness of 


fitting. 

Parameter Value 
a 0.6637 
b 0.4053 
m 0.0032 
n 0.1795 
SSE 0.0240 
R-square 0.9397 
Adjusted R-square 0.9305 
RSME 0.0736 


determination. Adjusted R-square represents the 
degree-of-freedom adjusted coefficient of deter- 
mination, and RMSE represents the root mean 
squared error. 

Then, the relationship of daisy chain resistance 
r and thermal fatigue failure cycle number N of 
BGA package Sn62Pb36 Ag2 solder joints with 
0.5 mm diameter could be integrated as: 


0.6637 N 79-4053 poas 


r =0.0032iD 1-02] (10) 


J 


where r = resistance of the first failed daisy chain; 
i = solder joint number within the daisy chain; 
D = solder joint diameter; N, = cycle number when 
the resistance reaches r and N, = cycle number to 
fatigue failure. And the fatigue crack propagation 
model for the certain solder joint can be obtained: 


N 0.6637N y 0.4053 
1=070{ %.| 
N 


f 


(11) 


where / = crack length and N, = N, = cycle number 
when the crack length reaches /. 


4 FEA SIMULATION 


Experimental results were most realistic; however, 
both time and cost of the experiment were quite 
significant for aviation products due to the large 
failure cycle numbers. Therefore, simulation is still 
the main technology applied for engineering. This 
section is mainly to show the simulation method 
for solder joint thermal fatigue, and to explore 
the relationship between parameters a, b, m, and 
n with the material and dimension of solder joint; 
then, the final form of the fatigue crack propaga- 
tion model could be obtained. 


4.1 Modeling 


A typical BGA package was modeled with the 
finite element method using ANSYS; the results 
are shown in Figure 6. The model contained solder 
joint, chip, substrate, encapsulated layer, and PCB 
layer. In ANSYS, element type Visco107 was used 
to describe the solder joint material. 

To evaluate the impact of solder joint material 
and dimension on the crack propagation model, 
four types of solder joint materials were selected. 
These were Sn62Pb36 Ag2, Sn96.5 Ag3Cu0.5, 
Sn60Pb40, and Pb97.5Sn2.5, and the diameters of 
the solder joint were selected as 0.25 mm, 0.4 mm, 
0.5 mm, and 0.6 mm. 


4.2 Simulation results 


The solder joint strain could be calculated via 
FEA. With the relationship of strain and thermal 
damage resistance, the daisy chain resistance could 
be obtained. Figure 7 shows the resistance of T1 
obtained via experimental and FEA. 

Figure 7 shows that the simulation result was 
not as smooth as the experimental result; however, 
the error was controlled within 5%, which was con- 
sidered to be reasonable within 20%. 


Figure 6. FEA model and solder joint model. 
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Figure 7. Resistance of the T1 daisy chain via experi- 
mental and FEA. 
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4.3 Crack propagation model considering 
material and dimension 


Using these simulation results, the parameter a, b, 
m, and n in Equation (9) for four material types 
and four diameters of solder joints could be cal- 
culated, and the results are shown in Table 4. The 
material of the solder joint was expressed with 
Young’s Module E and fatigue strength factor o, 
which are the most critical parameters of the mate- 
rial for fatigue failure. Figure 8 shows the relation- 
ship between parameters a, b, m, and n depending 
on diameter and material of the solder joint. 


Table 4. Parameter values of different solder joint diameters 
and materials. 


D(mm) E (GPa) o,(GPa) a b m n 


Sn62 0.25 30 1.97 0.649 0.391 0.0033 0.181 
Pb36 0.4 30 1.97 0.652 0.384 0.0031 0.179 
Ag2 0.5 30 1.97 0.650 0.404 0.0036 0.180 
0.6 30 1.97 0.648 0.391 0.0032 0.179 
Sn96.5 0.25 35 2.25 1.178 0.411 0.0262 0.183 
Ag3 0.4 35 2.25 1.175 0.406 0.0262 0.179 
Cu0.5 0.5 35 2.25 1.180 0.412 0.0259 0.178 
0.6 35 2.25 1.180 0.401 0.0258 0.182 
Sn60 0.25 32 2.01 0.872 0.408 0.0110 0.179 
Pb40 0.4 32 2.01 0.875 0.401 0.0114 0.178 
0.5 32 2.01 0.873 0.411 0.0112 0.178 
0.6 32 2.01 0.874 0.407 0.0111 0.182 
Pb97.5 0.25 40 2.55 1.567 0.389 0.0448 0.180 
Sn2.5 0.4 40 2.55 1.564 0.406 0.0450 0.180 
0.5 40 2.55 1.569 0.394 0.0451 0.182 
0.6 40 2.55 1.564 0.387 0.0450 0.180 
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Figure 8. Relationships of parameters with diameter 
and material of solder joints. 


Figure 8 shows that the material and dimension 
of solder joints have no effect on parameters b and 
n. The mean values are 0.400 and 0.180 respec- 
tively after testing. Figure 8 furthermore shows 
that parameter a and m only changed with chang- 
ing material. With the simulation results of FEA, 
the relationships between parameter a and m with 
Young’s Module E and fatigue strength factor o, 
were fitted via Matlab: 


a=0.1109E —0.31230, —2.0448 (12) 


m=0.0035E + 0.01180, — 0.1246 (13) 


Consequently, Equation (9) can be expressed as: 


aN p94 
r= miD} |- oa ze) 
N 


T 


—0.18 


m= 0.0035E + 0.01180, — 0.1246 
a= 0.1109E— 0.31230, — 2.0448 


(14) 


And the final form of the fatigue crack propaga- 
tion model can be obtained: 


aN 94 
1=0.7D| At 
N; 


a= 0.1109E- 0.31230, — 2.0448 


(15) 


where Young’s Module E and fatigue strength 
factor o, can be directly found within a material 
handbook. 


5 CONCLUSION AND FUTURE WORK 


This paper presented a fatigue crack propagation 
model for solder joint thermal fatigue life predic- 
tion from the perspective of stress intensity fac- 
tor. To determine the constants within the model, 
experiment was conducted and a new approach 
with daisy chain was used to solve the problem of 
crack length measuring difficulty. The model for 
a certain solder joint was obtained. To study the 
relationship of the coefficients of the fatigue crack 
propagation model with the dimension and mate- 
rial of the solder joint, the Finite Element method 
was used to simulate more cases. And a more gen- 
eral form of the model was achieved. 

A further study may focus on additional FEA 
simulations and study other factors that may affect 
fatigue crack propagation models, such as different 
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mounted styles of the chip and different conduc- 
tivities of the solder joint. 
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ABSTRACT: This paper is an outgrowth of an investigation to determine what is the main source of 
failure in the Polish operational aviation. The proper understanding of failure data is valuable in predic- 
tion of the future needs in a specified planning horizon or for specified operational hours. The major 
effort of the study was the reliability analysis for two types of aircraft, which are used in the Polish opera- 
tional aviation. As the main results the value of the number of hours between failures is presented and this 
distribution is tested with some mathematical models. The description of the data by Weibull distribution, 
which is one of the most frequently used functions in reliability analysis, has been tested. Then, the same 
test has been performed for another mathematical distributions. It has been observed that for an aircraft 
of first type the best description can be given by different distribution as for the second type and that the 
standard Weibull distribution is not always the best reliability model. The potentially causes of this behav- 
ior are discussed in the article. Additionally, the change of the number of the failures as a function of time 


is presented. It is observed that this behavior varied for the different types of aircraft. 


1 INTRODUCTION 


1.1 Formulation of the problem 


Aircrafts system may be designed with redundancy 
of components in order to improve their overall 
reliability. Such approach can increase the reliabil- 
ity of the whole system, however, the improved reli- 
ability is not the only factor that contributes to the 
effectiveness of a task performance. If the system 
fails and has to be repaired, then the time to repair 
is also an important factor in effectiveness. One of 
the most important factors of the high efficiency 
of the system usage is the appropriate estima- 
tion of the expected time to failure, proper plan- 
ning and efficient performing of inspections and 
any possible repairs (Tloczynski 2017b). The time 
to failure can be estimated by knowledge of the 
behavior of a data from the exploitation process. 
Data can be described by the different statistical 
models. Proper selection of the model is externally 
important which will be presented in this contri- 
bution by comparison of data with the selected 
models. The advantage of this method over the 
calculation of mean and standard division is the 
reduction of reparation costs. Failure rate function 
shows exactly when early failures occurs, constant 
failure rate may describes random failures, whereas 
increasing failure rate tells about wear-out failures 
(Babiarz 2016). 


2 FORMULATION OF THE PROBLEM 


2.1 Basics definitions 


A failure rate function is defined as the limit, if 
it exists, of the ratio of the conditional probabil- 
ity that the instant of time, 7, of a failure of an 
item falls within a given time interval t+ At and the 
length of this interval, Aż, when Aż leads to zero, 
given that the item is in an up state at the begin- 
ning of the time interval, which can be described as 
(Koucky and Valis 2007, Valis et al. 2014): 


< 
Alt) = tim P{t<T<t+At|T>t} 
At>0t At 


(1) 
where T is a continuous positive random variable 
of device operation time. 

If T has a density f(t) and the distribution F(t) 
equation (1) will take the form: 


(2) 


where F(t) = MOL =P{T <t}=1-P{T >t}. 


2.2. The Weibull distribution 


The Weibull distribution is one of the most widely 
used lifetime distributions in reliability engineering 
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(Tloczynski 2017a). It is a versatile distribution, that 
can take on the characteristics of other types of distri- 
butions, based on the value of the shape parameter, 2. 

Weibull Probability Density Function can be 
described by the following formula (Hinz, Hien- 
zsch, & Bracke 2017a): 


f()= Ary" et (3) 


where: 
7 - scale parameter, or characteristic life, 
p- shape parameter (or slope), 
y - location parameter (or failure free life). 
The Weibull Failure Rate Function is given by: 


p-l 
To ° 


2.3. The Burr distribution 


In probability theory, statistics and econometrics, 
the Burr Type XII distribution or simply the Burr 
distribution[1] is a continuous probability distribu- 
tion for a non-negative random variable. It is also 
known as the SinghMaddala distribution[2] and is 
one of a number of different distributions some- 
times called the generalized log-logistic distribution. 

Burr Probability Density Function can be 
described by the following formula (Mueller, Hinz, 
& Bracke 2017): 


1 


ak(t-7/B) 
All +(t- 7/2)" ] 


f(t)= a (5) 


where: 

k - continuous shape parameter (k > 0), 

œ- continuous shape parameter (a> 0), 

ß - continuous scale parameter (£ > 0), 

y - continuous location parameter (y= 0 yields 
the three-parameter Dagum distribution). 

The Burr Failure Rate Function is given by: 


ak(t- 118 
so _ Zire- | 
-F 1_fis(e-/ay"] 


A(t)= 


2.4 The Dagum distribution 


The Dagum distribution is a continuous probabil- 
ity distribution defined over positive real numbers. 


It is named after Camilo Dagum, who proposed 
it in a series of papers in the 1970s. The Dagum 
distribution arose from several variants of a new 
model on the size distribution of personal income 
and is mostly associated with the study of income 
distribution. 

Dagum Probability Density Function can be 
described by the following formula (Vintr & Valis 
2011): 


y= ak(t- yA 


= a (7) 
Al1+(t- ri) ] 

where: 
k - continuous shape parameter (k > 0), 
o- continuous shape parameter (œ > 0), 
p- continuous scale parameter (8> 0), 
y - continuous location parameter (v=0 yields 

the three-parameter Dagum distribution). 
The Dagum Failure Rate Function is given by: 


ak(t—y/ py l 
sA _ Ae] 
OO ene y À 


2.5 The log-logistic distribution 


In probability and statistics, the log-logistic dis- 
tribution (known as the Fisk distribution in eco- 
nomics) is a continuous probability distribution 
for a non-negative random variable. It is used in 
survival analysis as a parametric model for events 
whose rate increases initially and decreases later, 
for example mortality rate from cancer following 
diagnosis or treatment. It has also been used in 
hydrology to model stream flow and precipitation, 
in economics as a simple model of the distribution 
of wealth or income, and in networking to model 
the transmission times of data considering both 
the network and the software. 

The log-logistic distribution is the probability 
distribution of a random variable whose logarithm 
has a logistic distribution. It is similar in shape to 
the log-normal distribution but has heavier tails. 
Unlike the log-normal, its cumulative distribution- 
function can be written in closed form. 

Log-logistic Probability Density Function can 
be described by the following formula (Hinz, Hien- 
zsch, & Bracke 2017b): 


AoA] ” 


f(t)= 
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where: 

æ- continuous shape parameter (œ> 0), 

B- continuous scale parameter (£ > 0), 

y- continuous location parameter (y= 0 yields 
the two-parameter Log-logistic distribution). 

The Log-logistic Failure Rate Function is given 
by (Valis, Zak, Pokora, & Lansky 2016): 


wy AF) FT] 
eS 


3 EXPERIMENTAL STUDY OF 
STATISTICAL METHODS 


(10) 


3.1 Input data 


This analysis uses data on one type of aircraft used 
in Polish Air Force for a selected group of pilots. 
Data were obtained from the operation and main- 
tenance. The analysed data are presented in the 
form of graph of average flight hours between fail- 
ure as a function of years in Figures | and 2. These 
figures show the data for the years 2011 and 2017. 
It was noted that data for 2013 are concentrated at 
lower values, while for 2017 at higher. Such behav- 
ior may indicate that at some point there was an 
improvement of procedures to control the aircraft, 
which resulted in fewer aircraft failures. It may also 
be the result increasing the maintained. 

Based on Figures | and 2, the division into two 
groups of data can be observed: the first one cov- 
ers the years 2011-2015, for which the average time 
between failures is about 15 hours, second con- 
tains years 2016-2017 for which the average time 
between failures increases twice. 
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Figure 1. Average flight hours between failure for 4th 
generation fighter aircraft type A as a function of years. 
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Figure 2. Average flight hours between failure for 4th 
generation fighter aircraft type A as a function of years. 


Comparison of distributions of probability 
density function with data obtained during the 
operation process is shown in Figure 7. In addi- 
tion, Figure 8 summarize the fitting parameters 
for failure rate function distributions. It can be 
observed that the for aircraft type A the shape 
is more narrow. The Weibull’s distributions for 
both type of aircraft have similar shapes, but for 
type B the most expected value is shifted to lower 
value. Because no significant differences between 
the failure time for two types of aircraft have been 
observed, for better precision of the fit, has been 
done for all collected data together. 


3.2 Quality of the fit 


To determine the quality of the fit and to deter- 
mine whether the analysis distribution describes 
the data, the following tests were used: 


e Kolmogorov-Smirnoy, 
e Anderson-Darling, 
e Chi-kwadrat, 


which are based on the empirical distribution func- 
tion (EDF). 


3.2.1 The Kolmogorov-Smirnov statistic 
The Kolmogorov-Smirnov statistic is defined as: 


D =sup|F, (x) - F(x) (11) 


The Kolmogorov-Smirnov statistic belongs to 
the supremum class of EDF statistics (Chakravarti, 
Laha, & Roy 1967). This class of statistics is based 
on the largest vertical difference between F(x) and 
F(x). The Kolmogorov-Smirnov statistic is com- 
puted as the maximum of D* and D~, where D* is 
the largest vertical distance between the EDF and 
the distribution function when the EDF is greater 
than the distribution function, and D^ is the larg- 
est vertical distance when the EDF is less than the 
distribution function. 
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Figure 3. Weibull probability density function fit for 
4th generation fighter aircraft type A. 
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Figure 4. Weibull probability density function fit for 


4th generation fighter aircraft type B. 


Figure 5. 


Burr probability density function fit for 4th 
generation fighter aircraft type A. 


Figure 6. Dagum probability density function fit for 
4th generation fighter aircraft type B. 
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Figure 8. Failure rate comparison. 
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Figure 9. Log-logistic probability density function fit 
for 4th generation fighter aircraft. 
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Figure 10. Weibull probability density function fit for 
4th generation fighter aircraft. 


Table 1. Goodness of fit—summary for 4th generation fighter aircraft type A. 
Test 
Kolmogorov Smirnov Anderson Darling Chi-Squared 

# Distribution p-value Rank p-value Rank p-value Rank 

1 Dagum 0.053 1 0.207 2 1.518 8 

2 Gen. Logistic 0.062 2 0.212 3 2.240 16 

3 Wakeby 0.062 3 0.187 1 3.230 28 

4 Gen.Extreme Value 0.062 4 0.231 4 1.843 13 

5 Burr 0.064 5 0.318 10 1.611 9 

6 Log-Logistic(3P) 0.067 6 0.263 7 1.110 2 

T Frechet(3P) 0.068 7 0.250 5 1.645 10 

8 PearsonS(3P) 0.068 8 0.259 6 2.648 22 
19 Weibull(3P) 0.090 19 0.511 18 1.455 4 
22 Weibull 0.094 22 0.481 16 2.812 25 
D = i U Here the weight function is y (x) = [F(x)(1 — F(x)]. 

= MAR n The Anderson-Darling statistic is computed as 
D- = max (v - =) (12) IY 
‘O na’ A =-n-—) | (2i-1) log U;, 
ns 
D = max(D‘,D-) Ft (14) 


3.2.2 The Anderson-Darling statistic 

The Anderson-Darling Statistic belongs to the 
quadratic class of EDF statistics (Stephens 1974). 
The Anderson-Darling statistic is defined as: 


(13) 


+(2n+1—2i) log (1- Uw) 


3.2.3 The Chi-Squared statistic 

The Chi-Squared statistics belongs to the 
quadratic class of EDF statistics (Snedecor & 
Cochran 1989). This class of statistics is based 
on the squared difference (F),(x)—F(x))*. 
Quadratic statistics have the following general 
form: 
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Table 2. 


Goodness of fit—summary for 4th generation fighter aircraft type B. 


Test 


Kolmogorov Smirnov 


Anderson Darling Chi-kwadrat 


# Distribution p-value Rank p-value Rank p-value Rank 
1 Burr(4P) 0.044 1 0.125 1 0.940 1 
2 Burr 0.046 2 0.155 2 1.711 4 
3 Dagum(4P) 0.047 3 0.163 4 1.576 3 
4 Dagum 0.049 4 0.171 5 2.406 10 
5 Wakeby 0.049 5 0.163 3 1.126 2 
6 Log-Logistic(3P) 0.051 6 0.183 7 2.402 9 
7 Gen.Logistic 0.052 7 0.181 6 2.093 5 
8 Gen.ExtremeValue 0.056 8 0.219 8 2.155 6 
26 Weibull 0.096 26 1.440 28 7.397 28 
30 Weibull(3P) 0.110 30 1.174 26 7.517 30 
Table 3. Goodness of fit—summary for 4th generation fighter aircraft. 
Test 
Kolmogorov Smirnov Anderson Darling Chi-kwadrat 
# Distribution p-value Rank p-value Rank p-value Rank 
1 Log-Logistic(3P) 0.040 1 0.207 3 0.838 2 
2 Wakeby 0.042 2 0.193 1 2.000 9 
3 Gen.Logistic 0.042 3 0.203 2 0.913 3 
4 Burr(4P) 0.043 4 0.224 4 0.680 1 
5 Gen.ExtremeValue 0.048 5 0.309 7 1.097 4 
6 Dagum(4P) 0.052 6 0.269 5 1.205 6 
7 Pearson5(3P) 0.053 7 0.346 8 2.609 11 
8 Pearson6(4P) 0.054 8 0.364 9 2.590 10 
19 Weibull 0.078 19 1.180 20 5.379 17 
26 Weibull(3P) 0.097 26 1.371 23 9.820 25 
oo 5 subject of future papers. The risk profile of an 
Q= nf (F ( x)-F( x)) Y ( x)dF (15) aircraft is changing and, as such as, it is essential 
Bes that we will continue to define and monitor desired 


The function weights  ( x) the squared differ- 
ence (F),,(x)-F(x)). 

The Tables 1-3 presented that the goodness-of- 
fit tests reject the null hypothesis that the number 
of hours between failures can be approximated by 
any of the distribution. It is merely because the 
p-value is greater than 0.01. 


4 CONCLUSION 


This contribution stands at the beginning of 
research program projecting risk at the Air Force 
Institute of Technology for 4th generation fighter 
aircraft and subsequent development will be the 


performance outcomes in line with the risk profile. 
In order to be able to create such profile, it is nec- 
essary to select a probability distribution that will 
facilitate the construction of reasonably precise 
probability statements of the type that one wishes 
to make. In this analysis, the number of hours 
between failures has been used as the main variable 
for taking the conclusion about the dependability 
of the aircraft. It was concluded, from the qual- 
ity of fitting, that the distribution of Dumm and 
Darum are the best descriptions of the gathered 
data. Commonly used in reliability the Weibull dis- 
tribution, in this particular scenario, did not meet 
the expected requirements. This can be the result 
of the relatively small statistic of the incidents of 
the 4th generation fighter aircraft. 
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Buffered environmental contours 
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ABSTRACT: The main idea of this paper is to use the notion of buffered failure probability from 
probabilistic structural design, first introduced by Rockafellar & Royset (2010), to introduce buffered 
environmental contours. Classical environmental contours are used in structural design in order to obtain 
upper bounds on the failure probabilities of a large class of designs. The purpose of buffered failure prob- 
abilities is the same. However, in constrast to classical environmental contours, this new concept does not 
just take into account failure vs. functioning, but also to which extent the system is failing. For example, 
this is relevant when considering the risk of flooding: We are not just interested in knowing whether a 
river has flooded. The damages caused by the flooding greatly depends on how much the water has risen 


above the standard level. 


1 INTRODUCTION 


Environmental contours are widely used as a 
basis for e.g., ship design. Such contours allow the 
designer to verify that a given mechanical structure 
is safe, i.e, that the failure probability is below a 
certain value. A realistic model of the environmen- 
tal loads and the resulting response is crucial for 
structural reliability analysis of mechanical con- 
structions exposed to environmental forces. See 
Winterstein et al. (1993) and Haver & Winterstein 
(2009). For applications of environmental contours 
in marine structural design, see e.g., Baarholm et 
al. (2010), Fontaine et al. (2013), Jonathan et al. 
(2011), Moan (2009) and Ditlevsen (2002). 

The traditional approach to environmental con- 
tours is based on the well-known Rosenblatt trans- 
formation introduced in Rosenblatt (1952). This 
transformation maps the the environmental vari- 
ables into independent standard normal variables. 
Using the transformed environmental variables a 
contour with the desired properties can easily be 
constructed by identifying a sphere centered in the 
origin and with a suitable radius. More specifically, 
the sphere can be chosen so that any non-overlap- 
ping convex failure region has a probability less 
than or equal to a desired exceedence probability. 
The corresponding environmental contour in the 
original space can then be found by transforming 
the sphere back into the original space. 

Alternatively, an environmental contour can 
be constructed directly in the original space using 
Monte Carlo simulation. See Huseby et al. (2013), 
Huseby et al. (2015a) and Huseby et al. (2015b). 
Contours constructed using this approach will 
always be convex sets. This yields a more straight- 


forward interpretation of the contours. Another 
advantage of this approach is a more flexible 
framework for establishing environmental con- 
tours, which for example simplifies the inclusion 
of effects such as future projections of the wave 
climate related to climatic change. See Vanem & 
Bitner-Gregersen (2012). 

In the present paper we introduce a new con- 
cept called buffered environmental contours. This 
concept is based on the notion of buffered failure 
probability from probabilistic structural design, 
first introduced by Rockafellar & Royset (2010). 
Contrary to classical environmental contours, this 
new concept does not just take into account failure 
vs. functioning, but also to which extent the system 
is failing. For example, this is relevant when con- 
sidering the risk of flooding: We are not just inter- 
ested in knowing whether a river has flooded. The 
damages caused by the flooding greatly depends 
on how much the water has risen above the stand- 
ard level. 

The structure of this paper is as follows: In Sec- 
tion 2, we recall the classic definition of failure 
probability in probabilistic structural design and 
compare this to the concept of buffered failure 
probability, as defined in Rockafellar & Royset 
(2010). Furthermore, we recall some of the argu- 
ments favoring the buffered failure probability 
over the regular failure probability. Then, in Sec- 
tion 3, we recall the concept of environmental 
contours and how such contours are used in struc- 
tural design in order to find upper bounds on the 
failure probabilities of a large class of designs. In 
Section 4, we introduce the new concept of buff- 
ered environmental contours, and argue that these 
contours are better suited than the classical ones in 
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cases where the level of malfunctioning is impor- 
tant. Finally, in Section 5, we apply the proposed 
contours to a real life example, and compare the 
contours to the classical environmental contours. 


2 STRUCTURAL DESIGN AND THE 
BUFFERED FAILURE PROBABILITY 


In probabilistic structural design, it is common to 
define a performance function! g(x,V) depend- 
ing on some design variables x =(X,,%,...,%,,)’ 
and some environmental quantities” 
V =(V,.V,,....V,) EV, where V CR". The design 
variables can be influenced by the designer of the 
structure, and may respresent material type or lay- 
out. The quantities are usually random, and can- 
not be directly impacted by the designer. Hence, 
they may describe environmental conditions, mate- 
rial quality or loads. To emphasize the randomness 
of the quantities, we denote them by captial letters. 
In contrast, the design variables are controlled by 
the designer and hence denoted by small letters. 

For a given design x s, g(x,V) represents the 
performance of the structure, and is called the 
state of the structure. A given mechanical struc- 
ture can withstand environmental stress up to a 
certain level. The failure region of the structure 
is the set of states of the environmental variables 
that imply that the structure fails. The perform- 
ance function is defined such that if g(x,V)>0, 
the structure is failed, while if g(x,V)<0, the 
structure is functioning. Moreover, for a given x the 
set F(x)={veV:g(x,v)>0} is called the failure 
region of the structure’. 


2.1 The failure probability, reliability and 
approximation methods 


The failure probability, denoted by p,(x), of the 
structure is the probability that the structure is 
failed. That is, p, (x)= P(g(x.V)>0). If f,(v) is 
the joint probability density function for the ran- 
dom vector V, the failure probability is given by: 


P/(x)= Jo A (o)ae. (1) 


1. 1The performance function is sometimes called the 
limit-state function. 

2. Environmental quantities should here be understood 
in a broad sense. E.g., for marine structures such 
quantities typically includes wave height and period. For 
other types of structures, one may consider e.g., material 
quality, effects of erosion or corrosion as environmental 
quantities. 

3. In some papers, such as Huseby et al. (2013), the failed 
states are defined as the states such that g(x,V)<0. This 
is just a matter of choice of notation. 


For a given x the reliability, R(x), of the system 
is defined as the probability that the system is func- 
tioning, i.e.: 


R(x) =1= p; (x) (2) 


A classic problem is to compute the reliability of 
the system. In order to do so, we need to compute 
the integral (1). In many cases it is difficult to obtain 
and analytical solution to this. To overcome this 
issue various approximation methods have been 
proposed. Two traditional methods for doing this 
are the first-order reliability method (FORM) and 
the second-order reliability method (SORM). The 
basic idea of the first-order reliability method is to 
approximate the failure boundary at a spesific point 
by a first order Taylor expansion. The idea behind 
SORM is similar, but using a second order Taylor 
expansion instead. In both cases, the approximated 
failure probability can be used to optimize the 
structural design, i.e. determine a feasible design 
which has an acceptable failure probability. 


2.2 Return periods 


As is common in structural design models, we view 
V as representing the average value of the relevant 
environmental variables in a suitable time inter- 
val of length L. Based on this and knowledge of 
the performance function g it is possible to com- 
pute the so-called return period. This is done as 
follows: 

We consider the environmental exposure of 
the given design from time ¢20. The time axis 
is divided into intervals of some specified length 
L, and we let V, denote the average environmen- 
tal quantity in the ith period, i=1,2,.... It is com- 
mon to assume that V,,V,,... are independent 
and identically distributed. This is a fairly strict 
assumption, but as it is so frequently used in struc- 
tural design, we assume this as well. We then let 
T = min {i: g(x,V,)>0}. By the assumptions it 
follows that T is geometrically distributed with 
probability p, = P( g(x,V) >0). The return period 
is defined as E[T]=1/p,. Thus, the return period 
can be interpreted as a property of the distribution 
of g(x, V). Hence, it suffices to analyze this distri- 
bution, which is what we will focus on in this paper. 


2.3 The buffered failure probability 


The approximations made by FORM and SORM 
can sometimes be too crude and ignore serious 
risks. Therefore, we will consider the buffered 
failure probability, introduced by Rockafellar & 
Royset (2010) as an alternative to the failure proba- 
bility. This concept relates closely to the conditional 
value-at-risk (also called expected shortfall, average 
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value-at-risk or expected tail loss), which is a notion 
frequently used in mathematical finance and finan- 
cial engineering, see Pflug (2000), Rockafellar 
(2007) as well as Rockafellar & Uryasev (2000). 

Recall that for any level of probability a, the 
a-quantile of the distribution of a random variable 
is the value of the inverse of its cumulative distribu- 
tion function at a. For the random variable g(x, V), 
we let q(x) denote its o-quantile. Similarly, for 
any probability level a, the a@superquantile of 
g(x,V), q(x), is defined as: 


a(x) = E| ¢(x,V)|g(x.V))4.(x) | (3) 


That is, the @superquantile is the conditional 
expectation of g(x, V) when we know that its value 
is greater than or equal the a-quantile. Rockafel- 
lar & Royset (2010) then define the buffered failure 
probability, p,(x), as follows: 


Py (x) =l-a@, (4) 


where œ is chosen so that ¢,(x)=0. Note that 
from the previous definitions we have: 


P, (x)= P(g(xV) > da(x))=1-F(ae(x)) © 


where F denotes the cumulative distribution func- 
tion of g(x, V). 

In order to show how to calculate the buffered 
failure probability p,(x), we consider the plot 
shown in Figure 1. The curve in the plot repre- 
sents the cumulative distribution function of the 
performance function, g(x,V). As an example we 
have chosen a Gaussian distribution with mean 
value —2.5 and standard deviation 1.5. For this 
distribution we have F(0) = 0.952, as can also be seen 
in the figure by considering the right-most vertical 
dashed line starting at 0 on the x-axis, and the cor- 
responding upper horizontal dashed line starting at 
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Figure 1. Buffered failure probability calculation where: 


p, (x) = 0.048, q, (x) =-0.743, a= F (q,,(x)) =0.879, and 
P,(x)=l-@=0.121. 


0.952. Hence, we get that p, (x) =1- F (0) = 0.048. 
In the figure p,(x) is the distance between 100%- 
line and the upper horizontal dashed line. 

Using e.g., Monte Carlo simulation it is easy to 
estimate q(x), and we find that q,(x) =-0.743. 
In the figure q(x) is represented by the leftmost 
vertical dashed line. By following this line until 
it crosses the cumulative curve, we find that 
a= F(q,(x))=0.879. Finally, the buffered failure 
probability is found to be p,(x)=1-a@=0.121. In 
the figure p,(x) is the distance between 100%-line 
and the lower horizontal dashed line. 

It is easy to see that we always have q,,(x) <0, 
and thus, it follows that a= F(q,,(x)) < F(0). This 
implies that: 


B,(x)=1-a@21-F(0)=p, (x). 


Hence, it follows that the buffered failure 
probability is more conservative than the failure 
probability. See Rockafellar & Royset (2010) for a 
detailed discussion of this. 

Rockafellar & Royset (2010) present several 
advantages of using the buffered failure probabil- 
ity instead of the regular failure probability. The 
following are some of the key arguments: 


e In general, the failure probability p(x) cannot be 
computed analytically, and the techniques com- 
monly used to approximate it, such as FORM 
or Monte Carlo methods, can sometimes ignore 
serious risks. This makes it problematic to apply 
standard non-linear optimization algorithms 
in connection to structure design. In contrast, 
non-linear optimization algorithms are directly 
applicable when using the buffered failure prob- 
ability instead. 

e The buffered failure probability contains more 
information about the tail behaviour of the dis- 
tribution of g(x, V) than the failure probability. 

e The buffered failure probability can lead to 
more computational efficiency in design optimi- 
zation when the performance function g(x, V) is 
expensive to evaluate. 


The buffered reliability, R(x), of the struc- 
ture is defined as R(x)=1-p, (x). Since 
p,(x)<Pp,(x), it follows that R(x) 2 R(x). That 
is, the reliability of the system is greater than or 
equal to the buffered reliability. Again, this essen- 
tially says that the buffered reliability is more con- 
servative than the reliability. 


3 ENVIRONMENTAL CONTOURS 


Environmental contours are typically used dur- 
ing the early design phases where the exact shape 
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of the failure region is typically unknown. At this 
stage it it may not be possible to express a precise 
functional relationship between a set of design 
variables x and the performance of the structure. 
Instead we skip x in the notation and let the design 
options be embedded in the performance function 
g(V) itself. In particular we denote the failure 
region simply by F, while the corresponding fail- 
ure probability, P(V € F), is denoted by p,(F). 

Although F is unknown, it may still be possible 
to argue that F belongs to some known family, €, 
of failure regions. As in the previous sections we 
consider cases where the environmental conditions 
can be described by a stochastic vector V eR” 
with a known distribution. An important part of 
the probabilistic design process is then to make 
sure that P(V €e F) is acceptable for all F € £. 

In order to avoid failure regions with unaccepta- 
ble probabilities, it is necessary to put some restric- 
tions on the family £. This is done by introducing 
a set BCR" chosen so that for any relevant fail- 
ure region F which do not overlap with B, the 
failure probability P(V € F) is small. The family 
€ is chosen relative to B so that FABCOB for 
all F eE, where 0B denotes the boundary of B. 
This boundary is then referred to as an environ- 
mental contour. See Figure 2. 

Following Huseby et al. (2017) we define the 
exceedence probability of B with respect to £ as: 


P (B.E) = sup {p,(F): F € E}. (6) 


For a given target probability P, the objective is 
to choose an environmental contour 0B such that: 


P.(B,E) =P, 
We observe that the exceedence probability 
defined above represents an upper bound on the 


failure probability of the structure assuming that 
the true failure region is a member of the family 


~ Environmental contour 


| 
I i 


Figure 2. An environmental contour 0B and a failure 
region F 


€. Of particular interest are cases where one can 
argue that the failure region of a structure is convex. 
That is, cases where € is the class of all convex sets 
which do not intersect with the interior of B. In the 
remaining part of the paper we will assume that € 
satisfies this. 


3.1 Monte Carlo contours 


There are many possible ways of constructing 
environmental contours. In this paper we focus on 
the Monte Carlo based approach first introduced 
in Huseby et al. (2013), and improved in Huseby et 
al. (2015a) and Huseby et al. (2015b). 

Let U be the set of all unit vectors in R”, 
and let weU We then introduce a function C(u) 
defined for all ue U as: 


C(u):=inf {C: P(u' V >C)< P} (7) 


Thus, C(u) is the (1— P ) -quantile of the distri- 
bution of u’V. Given the distribution of V, the 
function C(u) can easily be estimated by using 
Monte Carlo simulation. Thus, let V,,....V, be a 
random sample from the distribution of V. We then 
choose wel, and let Y,(u)=w'V,,r=1,...,N. 
These results are sorted in ascending order: 


Using the sorted numbers we first estimate C(u). 
Since C(u) is the (1— P.)-quantile in the distribu- 
tion, a natural estimator is: 


where k is determined so that: 


2 aioe 
N 


Note, however, that this estimator can be 
improved considerably by using importance sam- 
pling. See Huseby et al. (2015b) for details. 

Foreach u € U, wealso introduce the halfspaces: 


IT (u) = {v: wO< C(u)}. 
Ti (u) ={v:u'v> C(u)}. 


We then define the environmental contour as 
the boundary 06 of the convex set set B given by: 


B= 0- (u) (8) 


ueu 


It follows that the exceedence probability of B 
with respect to €is given by: 
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P.(B,E) =sup {p,(F): F €E} 
=sup (P; (TI (u)) neu} 
=sup P(w’V >C(u)) =P, 


ue 


where the second equality follows since we have 
assumed that F is convex and hence contained in 
I (u) for all Fe€ In fact for all we U we have 
IT*(u) € € as well, and these halfspaces are the max- 
imal sets within €. Moreover, the last equation fol- 
lows by the definition of C(a) given in (7). Thus, we 
conclude that the contour 0B indeed has the correct 
exceedence probability with respect to £. See Huseby 
et al. (2017) for further details regarding this. 


4 BUFFERED ENVIRONMENTAL 
CONTOURS 


In this section, we introduce a new concept called 
buffered environmental contours. This combines 
the ideas behind buffered failure probabilities and 
environmental contours. Before we introduce the 
main results we review a result on superquantiles 
which will be essential in our approach (See Rock- 
afellar (2007).) 


Proposition 1. Let g, and g, be two performance 
functions such that g,(V)<g,(V) almost surely, 
and let q,,,andq, „ denote the a-superquantiles of 
g, and g, respectively. Then q 4S o 


As a corollary of this result we get the following 
result on buffered failure probabilities: 


Corollary 2. Let g, and g, be two performance func- 
tions such that g, (V) Sg, (V) almost surely, and let 
P,and p,, denote the buffered failure probabili- 
ties of g, and g, respectively. Then P, ; < Pap- 


For a given performance function g its failure 
probability, p, can be computed based on the fail- 
ure region of g alone. In contrast, computing the 
buffered failure probability, p,, requires more 
detailed information about the distribution of g. 
We indicate this by expressing p, as a function of 
gand denoted p, (g). 

Just as for classical environmental contours, 
a buffered environmental contour is the boundary 
0B of some suitable set B c R”. We shall now 
describe how the set B can be constructed. As in 
the previous section we let U be the set of all unit 
vectorsin R”, and let u e U Moreover, we let P, be 
a given target probability, and let C(u) be defined 
by (7). In order to introduce buffering, we let: 


C(u) = E{u’V |u'V)C(u)]. (9) 


Given the distribution of V, the function C(u) 
can easily be estimated by using Monte Carlo 


simulation. As in Subsection 3.1, we let V,,....Vy 
be a random sample from the distribution of V, 
and choose wel. Based on the sorted values 
Yp S Yo) <S- <Y y) we first estimate C(u) by Yp 
as previously explained. We then estimate C (u) by 
computing the average value of the sampled val- 
ues which are greater than Y. Thus, we estimate 
C(u) by: 


1 


C(u)=——_Yy,,. 
(u) N-k& (r) 


Foreach u €U, we also introduce the halfspaces: 


i- (u)= fv: uDS C(u)}. 
I (u) = {o: u’'v> C(u)}, 


similar to what we did in the previous section. 
Finally, we define the buffered environmental con- 
tour as the boundary 0B of the convex set set B 
given by: 


(10) 


_ We observe that by (12) we obviously have that 
C(u) >C(u). . By comparing (8) and (10), it is easy 
to see that this implies that: 


BcB. 


Thus, given that the same target probability P, is 
used to construct both contours, the buffered envi- 
ronmental contour is more conservative than the 
classical environmental contour. 

The next step is to identify a family G of per- 
formance functions defined relative to the set B 
such that p,(g)< P, forall geg We recall that for 
the classical environmental contour we chose to let € 
be the family of all convex failure regions which do 
not intersect with the interior of 6. Thus, one might 
think that the natural counterpart for buffered envi- 
ronmental contours would be to let G be the fam- 
ily of performance functions with convex failure 
regions which do not intersect with the interior of 
B. In this case, however, we need more control over 
the distributions of the performance functions. In 
order to do so we choose u euU and introduce the 
performance function T'(u,-) given by: 


T(u,V) =u'V—- C(u) 
By (12) we have: 


E[ T(u,V)r(u,V) > C(u) — C(u)] 
=Elw’V |w,V > C(u)] = C(u) =0. 
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Moreover, by (7) we have: 


Py (I'(u.-)) = P(T (u,V) > C(u) - C(u) 
= P(u',V >C(u))= P 


Since the unit vector u was arbitrarily chosen, 
we conclude that the performance function T (u,-) 
has the desired buffered failure probability P, for 
all ue U 

We will use these performance functions as a basis 
for constructing the family G where the T (u,-) -func- 
tions serve as maximal elements in this family. Note 
that the T(u,)-functions now play a similar role 
as the halfspaces TI* (u) played in the construction 
of the family F. Thus, we let G be the family of 
all performance functions g for which there exists a 
ueU such that g(v)<T(u,v) for all ve V. By the 
above discussion the following result is immediate: 


Theorem 3. For all g eG we have p,(g) <P... 


Proof: Assume that g € G. Then there existsa u € U 
such that g(V)<I(u,V) almost surely. Hence, by 
Corollary 4.2 and the above calculations we have: 


Having constructed both the set B and the 
family G we are now ready to introduce the buff- 
ered exceedence probability of B with respect to 
G defined as: 
P.(B,G):=sup {p,(g):¢€ 9}. (11) 

We note that by the definition of G it follows 
that I'(u,-)eG for all ue U. Hence, we get: 


Thus, we conclude that the contour 0B indeed 
has the correct buffered exceedence probability 
with respect to G. 

If geG and g(v)<I(u,v) for all ve V, we 


F(g)cF(U(u,-)) = {o:u’v-C(u) > 0} 
v:u'v>C(u)} =T (u) 


Thus, the failure region of a performance func- 
tion g€G does not overlap with the interior of 
the set B, but is contained within a halfspace sup- 
porting B. This is similar to the relation between 
failure regions in the family ¢ and the set 6 for 
the classical environmental contours. However, as 
already pointed out, knowledge about the failure 
region of a performance function is not sufficient 


to ensure that the performance function has the 
correct buffered failure probability. 

It may be argued that the choice of the T'(u,-)- 

functions as maximal elements in the family G 
is too restrictive. In order to have a more flexible 
framework, it is possible to consider a slightly 
more general approach where we define: 
C, (u) = Efau’V |u’V)C(u)| = aC(u), (12) 
where a is a positive constant. By increasing the 
a-factor, the contour may be inflated so that it can 
be used for steeper performance factors. 

On the other hand it should be noted that to 
ensure that a given performance function g has the 
correct buffered failure probability, it is not neces- 
sary that g(v) is dominated by some I'(u,-) -func- 
tion for all ve V. It is sufficient that this holds for 
v-values corresponding to the upper tail area of g. 


5 NUMERICAL EXAMPLE 


In this subsection we illustrate the proposed 
method by considering a numerical example intro- 
duced in Vanem & Bitner-Gregersen (2015). More 
specifically, we consider joint long-term models for 
significant wave height, denoted by H, and wave 
period denoted by T. A marginal distribution is 
fitted to the data for significant wave height and 
a conditional model, conditioned on the value of 
significant wave height, is subsequently fitted to 
the wave period. The joint model is the product of 
these distribution functions: 


tru (£h) = fa (A) f(t KO) 


Simultaneous distributions have been fitted to 
data assuming a three-parameter Weibull distribu- 
tion for the significant wave height, H, and a log- 
normal conditional distribution for the wave period, 
T. The three-parameter Weibull distribution is 
parameterized by a location parameter, y, a scale 
parameter a, and a shape parameter £ as follows: 


= fa 
fa(h)= a2) elna, hay, 
aa 
The lognormal distribution has two parameters, 
the log-mean u and the log-standard deviation o 
and is expressed as: 


3 1 -[(in)-a?/(20°)] 
t | h) = ——e ; 
Srl | ) Nr 


where the dependence between H and Tis modelled 
by letting the parameters u and o be expressed in 
terms of H as follows: 


t20, 
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4= E[In(T)|H =h]=4 +a,h%, 
o= SD[In (T) | H= h] = b, +b eht, 


The parameters a,, d,, d,, b,, b,, b, are estimated 
using available data from the relevant geographi- 
cal location. In the example considered here the 
parameters are fitted based on a data set from 
North West Australia. We consider data for two 
different cases: swell and wind sea. The param- 
eters for the three-parameter Weibull distribution 
are listed in Table 1, while the parameters for the 
conditional log-normal distribution are listed in 
Table 2. In all the examples we use a return period 
of 25 years. The models are fitted using sea states 
representing periods of 1 hour. Thus, we get 24 
data points per 24 hours. Thus, the desired exceed- 
ence probability is given by: 


1 


P =——_——_ = 4. 5631-10. 
25-365.25-24 

For more details about these examples we refer 
to (Vanem & Bitner-Gregersen 2015). 

The classical environmental contours are esti- 
mated based on the methods presented in Huseby 
et al. (2013). More specifically, we have used 
Method 2 presented in this paper. The buffered 
environmental contours are estimated in exactly 
the same way, except that C(u) is replaced by 
C(u) forall wel. 

In Figure 3 and Figure 4 the resulting environ- 
ment contours are shown. As one expected, the 
classical environmental contours are located inside 
their respective buffered contours. Thus, since the 
target probability P, is the same for both types of 
contours, the buffered contours are more conserv- 
ative than the classical contours. 


Table 1. Fitted parameter for the three-parameter 
Weibull distribution for signifcant wave heights. 
a B Y 
Swell 0.450 1.580 0.132 
Wind sea 0.605 0.867 0.322 
Table 2. Fitted parameter for the conditional log- 
normal distribution for wave periods. 
j=l i=2 i=3 
Swell a; 0.010 2.543 0.032 
b, 0.137 0.000 0.000 
Wind sea a; 0.000 1.798 0.134 
b 0.042 0.224 —0.500 


sro um 709 zia asic 


Figure 3. Buffered environmental contour (black) and 
classical environmental contour (gray) for North West 
Australia Swell with return period 25 years. 


Figure 4. Buffered environmental contour (black) and 


classical environmental contour (gray) for North West 
Australia Wind sea with return period 25 years. 


6 CONCLUSIONS AND FUTURE WORK 


In the present paper we have introduced the con- 
cept of buffered environmental contours, and 
shown how such contours can be estimated using 
Monte Carlo simulations. Such contours do not 
just take into account the probability of failure, 
but also the consequences of a failure. This is 
relevant e.g., when analysing the risk of flooding 
at a given location. While it may not be possible 
to prevent floodings from occurring, the damage 
caused by such an event can vary a lot depending 
on how much the water has risen above the normal 
level. In some cases only minor damages may be 
the result. In other cases the consequences can be 
catastrophic. 

For a given target probability, P, buffered envi- 
ronmental contours are generally more conserva- 
tive than the classical environmental contours. 
However, in cases where the consequences are more 
important than the triggering event itself, a higher 
target probability might be acceptable as long as 
the damages are manageable. Thus, in real-life 
applications a buffered environmental contour may 
not be so conservative after all. At the same time 
these contours provide much more information 
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about the tail area of the environmental variables. 
This may be very useful when a design is optimized. 

The buffered environmental contours proposed 
in this paper are the natural extension of the Monte 
Carlo contours introduced in Huseby et al. (2013). 
In particular both contour types are boundaries of 
convex sets. Sometimes this restriction may lead to 
contours which include areas of very low probabil- 
ity. Thus, it would be of interest to investigate other 
ways of constructing buffered contours. In particu- 
lar, it is possible to modify contours obtained by 
using the Rosenblatt transformation so that they 
include buffering. To make this work, however, 
evaluating the resulting contours becomes very 
important. The evaluation framework described in 
Huseby et al. (2017) may serve as a starting point. 

Future work in this area also includes the use of 
buffered environmental contours in design optimi- 
zation, but with additional design constraints. The 
question is how such additional constraints can be 
dealt with. An initial idea is to apply a Lagrange 
duality method in order to transform the problem 
into a previously known form. 

It would also be interesting to compare buffered 
environmental contours to the conservative envi- 
ronmental contours defined by Leira (2008). The 
contours defined in Leira (2008) are typically larger 
sets than the environmental contours considered in 
Section 3, which means that they are more conserv- 
ative when it comes to classifying structures as safe. 

Another idea which requires further investigation 
is how time can be introduced into this model in a 
less restrictive way. As mentioned in Subsection 2.2, 
we consider average stochastic environmental con- 
ditions V,,V,,... over some specified time intervals 
and assume independence and identical distribu- 
tions of the Vs. A more realistic approach would be 
to introduce a stochastic process in continuous time 
modelling the environmental situation. It is interest- 
ing to see how this affects the model and what conse- 
quences this has for the design optimization. 
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Technical service life prediction of deteriorating structures 
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ABSTRACT: | The technical service life as a quantitative durability parameter of deteriorating structural 
members is studied. The probabilistic analysis and prediction of durability of deteriorating structures 
subjected to recurrent extreme service and climate actions is discussed. The strategy of this prediction is 
based on the concept that not only a performance but also a safety margin of deteriorating members of 
load-carrying structures are time-dependent random variables. The effect of coincident recurrent extreme 
actions on their survival probabilities is analysed. The instantaneous and time-dependent survival prob- 
abilities of particular members may be assessed by the method of transformed conditional probabilities. 
The technical service life ¢,, as a quantitative durability parameter of deteriorating members is related with 
the target value of generalized reliability index. The presented methodology on durability prediction of 


structures may help engineers to calculate the technical service life of deteriorating structures. 


1 INTRODUCTION 


The load-carrying structures of buildings and con- 
struction works must be designed, constructed, 
erected and operated in such a way that they 
maintain safety and all quality parameters dur- 
ing an explicit or implicit period of time without 
requiring unforeseen costs for their maintenance, 
repair and reconstruction. Higher serviceability 
and durability requirements are applied to struc- 
tures which routine or preventive maintenance 
and repair require great efforts (LukoSeviciené and 
Kudzys 2009). Timely maintenance and repairs 
may prolong their technical service life effectively. 
Qualitative and quantitative inspections in a stand- 
ard format present the performance of materials 
and components. The contemporary standard 
ISO/CD 15686 (1997) recommends using inspec- 
tions data for assessment and prediction of the 
durability of deteriorating and ageing members by 
deterministic and semi-probabilistic methods. 

Stewart and Val (1999), Czarnecki and Nowak 
(2008), Cheung et al. (2009) developed probabilis- 
tic design approaches and models, describing grad- 
ually deteriorating concrete and steel structures 
of construction works. The random gradual dete- 
rioration of load-carrying systems and decrease in 
their structural reliability are caused by heterogene- 
ous actions of dynamically changing environmen- 
tal service loads, and aggressive accumulation of 
chemicals on concrete or steel surfaces of system 
members. Besides, a predictive reliability analysis 
of deteriorating members and systems is required 
to prevent structures from premature damage and 
to avoid losses and accidents. 


But the technical service life, ¢,, as basic durabil- 
ity factor of structural members has a big random 
scatter and should be treated as a stochastic vari- 
able which values may be assessed by probability- 
based approaches and methods. Besides, using 
deterministic and semi-probabilistic approaches, it 
is inconceivable to fix a real reliability index of a 
deteriorating member the failure domain of which 
changes with time. A durability as time-dependent 
probabilistic reliability of elements may be pro- 
longed by repairs of materials and components. 
The standard differentiation in the reliability of 
structures is based only on classes of failure conse- 
quences (EN 1990 2002). However, the methodol- 
ogy of sustainable durability predictions requires 
taking into account future repair and replacement 
abilities of deteriorating structural members. The 
minimum values for reliability index, 4, associate 
with the structures or structural members. 

Purpose of this study is to suggest approaches 
of the probability-based prediction of a time- 
dependent safety, reliability index and technical 
service life of deteriorating members of structures 
when they are affected by aggressive actions and 
recurrent extreme loads. 


2 RELIABILITY ANALYSIS OF 
DETERIORATING STRUCTURES 


2.1 Deterioration of load-carrying structures 


The probability-based reliability analysis of struc- 
tural systems (frameworks, truss, carcasses) or their 
subsystems i.e. structural members (beams, slabs, 
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columns, joints) as physically distinguishable part of 
a building or civil engineering work may be objec- 
tively assessed and predicted only knowing survival 
probabilities of particular members (normal, cross 
or oblique sections, connections, deflections). The 
reliability prediction of deteriorating members and 
their systems will account for all extreme action com- 
binations. Predicted durability parameters for dete- 
riorating structures depend on chemical or physical 
diagnosis and the acceptable risk of serviceability 
failure associated with the damage levels and losses. 

Aggressive actions induced by concrete carbona- 
tion, chloride penetration and other chemical attacks 
are the main factors determining the deterioration of 
concrete members and their systems (LukoSeviciené 
and Kudzys, 2009). The aggressive environmental 
conditions cause the degradation of concrete covers 
and reinforced bars lead to the irreversible limit state 
of load-bearing members. A process of concrete 
carbonation leading to global depassivation of rein- 
forcement bars is initiated by carbon dioxide. The 
highly alkalinous environment with the pH value 
larger than 12 protects steel members from corro- 
sion, when carbon dioxide, CO,, reaches the reinforc- 
ing bars and the corrosion product, Fe(OH),, begins 
to cover their surfaces (CEB Bulletin 238, 1997). 
The values of deterministic durability parameters 
of members due to their deteriorating are assessed 
and presented at design codes or standards by deter- 
ministic implicit recommendations and related with 
long-term experience of designers (EN 1990, 2002; 
ISO 2394, 1998; EN 12500, 2000). 

According to JCSS (2000), in any case, it is 
expedient to divide the life cycle t, of deteriorating 
structures into the initiation, ¢,,, and propagation, 


t,» periods (Figure 1). Using hierarchical models, 


Figure 1. 
model (b) for time-dependent reliability analysis: / — 
unloaded members, 2 — loaded columns, 3 — loaded 
beams (Kudzys & LukoSevi¢iené, 2009). 


Degradation function g(t) (a) and dynamic 


the time-dependent resistance and action effect are 
presented in Figure | (b). The resistance of par- 
ticular members in the propagation may be defined 
as (Mori and Ellingwood, 1993) 


RW) = AYR, (1) 


where ø(t) denotes the deterioration function; 
R,„ is the initial value of member resistance. 

The deterioration function of particular mem- 
bers caused by corrosion may be presented in the 


form: 
At) =1-a(t-t,,) <1; (2) 


where a is a degradation intensity factor; b defines 
a non-linearity of the deterioration function; 
t — time being considered; ¢,, — time of corrosion 
initiation (Mori & Ellingwood, 1993; Zhong & 
Zhao, 2005). Deterioration function is close to 
rectlinear (b = 1) and parabolic (b = 2) when cor- 
responding degradation mechanisms are a steel 
corrosion and aggressive environmental attacks. 
Marine corrosion process of steel structures 
is not linear function of time (Melcher et al., 
2008). The deterioration function by Eq. (2) 
analysis are shown in Figure 2, when b = 1 and 
a=0.00125, 0.0025, 0.00375 (rectlinears 2, 3, 4), 
respectively. The coefficient of variation of dete- 
a member resistance may be expressed 
as )=[8R,, + 8%9(t) | “. Its components 
ko = o nmm and 89(t)=o9(t)/¢g,,(t) = 
=a„(t- 1,) x ba/[1- a„(t-t„)]| characterize the 
variations of initial resistance R,, and rectilin- 
ear deterioration function from Eq. (2). When 
the parameters of a factor a of this function 
a,, < 0.005, 8a =0.3-0.5 and 6R,, =0.08 -0.20 
the random value BR(c,, insignificantly exceeds 
the coefficient of variation 6R,,. Therefore the 
variances of time-dependent and initial resistances 
of deteriorating particular members are very close 
in their values, i.e. 6°R(t) = 6?R,. 

According to Stewart and Val (1999) also 
Enright and Frangopol (1998) studies of rein- 
forcement corrosion of reinforced concrete bridge 
beams, it has led to propose several resistance dete- 
rioration functions (Figure 2) that relate to low, 
medium and high deterioration (curves 1,5, 6), 
respectively: 


g(t)=1-att+al; (3) 
when 

0.0005 a, =0 t, =10 years as low 
a = 40.005 Oe 0 f =5 years as medium 

0.01 = 0.00005 ¢,,=2.5 years as high 
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Figure 2. Deterioration function calculated using curves 


(1, 5, 6) from Eq. (3), rectlinears (2, 3, 4) from Eq. (2). 


where a, a,— degradation intensity parameter. It is 
assumed that a, is a random variable with the coef- 
ficient of variation equal to 0.2. 

The deterioration function in initiation and 
propagation periods of reinforced concrete and 
steel structures is shown in Figure 1 (a). 


2.2 Random safety margin of particular members 


Using the hierarchical dynamic model for time- 
dependent reliability analysis, the time-dependent 
safety margin of deteriorating particular members 
exposed to permanent and variable action effects 
may be written in the form: 


t),0| = 6,R(t)- 8E,- OE, - (4) 
-6,E,,(t)- 6E,(t)— 6,£,(t) 
here X(¢) and @ are the vectors of basic physical and 
additional variables, representing random com- 
ponents (resistances and action effects) and their 
model uncertainties; R(t) — time-dependent resist- 
ance of deteriorating members; E,, E > Ego Es 
and E, are the action effects caused by permanent, 
sustained, extraordinary service (live), snow and 
wind loads, respectively; 6r, 6,,6,,.4,,.9, are 
additional random variables containing the design 
model uncertainties associated with resistances and 
action effects of particular members. The mean 
values and standard deviations of additional vari- 
ables are: &,, =0.99-1.10, 68, =0.05—-0.10 and 
Pan =F m= Opm n= Pym * 1.00, 00,= 06, = 60, 
= 00, = 60, ~0.10 (Hong and Lind 1996, 1cs8 
2000); 

The resistance as static or dynamic structural 
response for which a probability distribution can 


be described by normal or lognormal distribution 
laws (ISO 2394 1998; EN 1990 2002). According 
to JCSS (2000), EN 1990 (2002), ISO 2394 (1998), 
Mori and Kato (2003) recommendations, a Gaus- 
sian distribution law is to be used for permanent 
actions, when lognormal, Gaussian, Weibull and 
gamma distributions may be assumed for sustained 
live loads. Imposed intermittent extraordinary 
service and industrial actions may be assumed to 
be distributed by exponential law (Vrowenvelder 
2002). The Gumbel cumulative distribution func- 
tion is quite appropriate for the probability anal- 
ysis of structures subjected to climate annual 
extreme wind and snow loads (ISO 4354 1998, 
JCSS 2000). 

The structural safety analysis of deteriorating 
members may be based on the limit state criteria: 


Z, = R- £,,k =1,2,...,n-1,n, (5) 


where R, = 9,0,8,, is the resistance of deteriorating 
member at the sequence cut k; E, is action effects 
at the same sequence cut k. 

The means and variances of these components 
of the safety margin Z, given by Eq. (5) may be 
expressed: 


(eR), = Brn [12 Ap (t =? Rin (6) 
O (ORR) = nO R + RB", (7) 
Bry Em (8) 
CE=6,,0°E, + E20 0,. (9) 


When extreme action effects are caused by two 
stochastically independent variable actions, a fail- 
ure of members may occur not only in the case of 
their coincidence but also when the value of one 
out of two effects is extreme. Therefore, three sto- 
chastically dependent safety margins should be 
considered as follows: 


Zi = R,- Ep k= 1, 2,.05%hs (10) 

Z = R,- Ey, R=, 25.05%, (11) 

Zaye = R,- Evy = Ry — E,- Eps B= 1,2.. 
(12) 


where n,, is the number of sequence cuts. 

The durations of extreme floor and climate 
actions are: d, = l-14 days for merchant and 
1-3 days for other buildings, d, = 14-28 days and 
d,, = 8-12 hours. Renewal rates of annual extreme 
actions are equal to A = 1/year. Therefore, the 
recurrence number of two joint extreme actions 
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during the design working life of structures, ¢, in 
years, may be calculated by the formulae: 
M =t,(d,+d,)AA4, (13) 


where 4=4 =1/t, are the renewal rates of 
extreme loads (LukoSeviciené and Kudzys 2009). 


2.3 Instantaneous and time-dependent survival 
probabilities of particular members 


The instantaneous survival probability of particu- 
lar members at k-th extreme situation, assuming 
that they were safe at the situations 1,2,...,k—-1, 
may be expressed as: 


P(S,)=P(Z, >0)=P{R, -E, >0}, k=1n (14) 


The values of instantaneous survival prob- 
ability of particular members may be calculated 
using analytical, numerical integration and Monte 
Carlo simulation methods. The resistance R, and 
single extreme action effect Æ, may be treated in 
the design of structural safety of ductile particular 
members as statistically independent variables of 
their random safety margins. Therefore, the instan- 
taneous survival probability of deteriorating mem- 
bers can be expressed by convolution integral as: 


P(S,) = | (RC) Fe, Cd (15) 


where fr, (x) is the density function of resistance 
of a member and F, (x) is the cumulative distri- 
bution function of its extreme action effect. 

The computer program (LukoSeviciené and 
Kudzys, 2009) was written in the Matlab environ- 
ment and adapted to predictions of time-depend- 
ent instantaneous survival probabilities of ductile 
autosystems of the reinforced concrete or steel 
members. So, the time-dependent survival prob- 
ability of particular member as stochastic autosys- 
tem with n elements may be defined as follows: 


P,(S) 
-psx fi (resia asf] 
(16) 
where n is the number of extreme situations; 
. 1 ie 
Paty [3a] (17) 


is the bounded conventional correlation factor of a 
quadratic matrix between safety margins of system 


elements as a member of conventional correlation 
vector written in the form: 


p=[1. ag, Ah | yi 


When the vector p = 0 or p = 1, the proposed 
Eqs. (16), (22) and (23) give accurate solution. 

Basic correlation coefficients between system 
elements may be calculated by the equation: 


(19) 


where Cov(Z,,Z,) and oZ,, OZ, are covariance 
and standard deviations of safety margins Z, and 
Z, calculated by Eq. (5). 

The bounded index of a conventional correla- 
tion factor may be expressed as: 


x, = P(S,)x[(4.5+49,)/(1-0.982,) |" (20) 


where its index 


2, P(S,) 
V, = = TE (21) 
© (k=1)x(k+28-158, +30)” 


helps us evaluate an effect of survival probabilities 
of system elements P(S,) = ®(,) on the bounded 
index by Eq. (16) when a reliability index of k-th 
element is equal to £, 

The relation between bounded correlation fac- 
tor and correlation coefficient for systems consist 
of four and ten elements are presented in Figure 3. 


2.4 Technical service life prediction 


The technical service life as the lifetime at preset 
target reliability index of deteriorating members 
is the period for which it can actually perform, 
according to the service requirements based on 
an intended purpose, without major repairs. In 
any case, the technical service life, t£, of members 
comes to an end before the beginning of concrete 
spalling or steel cracking process at an attack 
phase (LukoSeviciené and Kudzys, 2009). The 
design service life of structures is their working 
life used in the design process taking into consid- 
eration its probable dispersion since it is a time- 
dependent random quality. The design service life 
should ensure the required level of safety with 
respect to target service life (Poukhonto, 2003). 
Target service life of buildings and structures is 
specified by the client or owner in accordance with 
general rules (EN 1990, 2000; ISO 2394, 1998). 
The requirements for the technical service life 
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Figure 3. The bounded correlation factor versus basic 
correlation coefficient. 


depending on the level under consideration, may 
be specified for structural integrity of the building, 
load carrying capacity and strength of the materi- 
als. Most of these requirements have been included 
in codes and standards (Trbojevic, 2009). 

The durability prediction of structures should be 
considered for beams, columns, slabs, piles, joints and 
other structural members as auto systems represent- 
ing their multicriteria failure mode due to various 
action effects and responses of particular members. 
Illustrations of series, series-parallel and parallel sys- 
tem are shown in Figure 4. For example, continuous 
beams are characterised using stochastically depend- 
ent conventional elements in series-parallel con- 
nections (Figure 4 (c)). Due to system redundancy, 
according to the research, limit state in any one 
normal section 1 or 2 of beams does not mean their 
failure. Besides, the failure of beams in any oblique 
section 3 implies the failure of auto system. 

According to concepts of transformed condi- 
tional probabilities, the total survival probabilities 
of structural members (beams, columns, plates, 
trusses) as series, series-parallel and parallel sys- 
tems may be expressed: 


P(S) 

= (5)«TH (rodia odi 
(22) 

PCS) 

~1- P(F)x HI ( ne] il |} 
(23) 


Figure 4. 
and parallel (b) systems. 


Illustrations of series (a), series-parallel (c) 


P(S,) from Eq. (22) and P(F,) from Eq. (23) 
denote the survival and failure probabilities of 
ductile autosystem k = 1...m; where m is the 
number of random discrete failure modes; Ply 
is the bounded conventional correlation factor 
assessed by Eq. (17). 

The generalized reliability indices of series and 
parallel systems are expressed by: 


p20" [ 205)] (24) 


B,,=>| P.(S)] (25) 


where ®- [e] is the inverse cumulative distribution 
function of the standard normal distribution of 
their survival probabilities P,( S) by Eq. (22) and 
P(S) by Eq. (23) (Kudzys and Lukoševičienė 
2016). 

The design documents recommended by the 
Joint Committee on Structural Safety (JCSS, 2000) 
are acknowledged as the progressive probabilistic 
model code in the design of structural members 
and their systems. The reliability index of structures 
is related to the consequences of failure classes 
when the target value is equal to p, = 3.3-4.3 (EN 
1990, 2002; ISO 2394, 1998). Particular elements 
and members of the structure may be designed on 
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Figure 5. 


Determination of technical service life t, of 
a structural member using the time-dependent reliability 
index curve. 


Scheme of single storey building. 


Figure 6. 


Table 1. The failure probabilities in normal random vector space of the concrete structure under the effect of carboni- 
zation and chloride ions aggressive actions. 
Failure probability Coefficient 
Sampling of P(EUE 
time carbonization Chloride ions correlation P( AN F) ( a i) Method 
The 20th 198 x 107 107 x 107 0.93 57x 107 248 x 107 Zhong & Zhao 
year 56.8 x 107 248.2 x 107 (2005) TCPM 
The 40th 0.04847 0.00914 0.95 0.00908 0.04853 Zhong & Zhao 
year 0.00709 0.05052 (2005) TCPM 
The 60th 0.4053 0.1335 0.94 0.1334 0.4054 Zhong & Zhao 
year 0.1263 0.4125 (2005) TCPM 


the same higher or lower reliability index as for 
the entire structure. The reliability classes may be 
defined by the reliability index B concept. Three 
reliability classes RC1, RC2 and RC3 may be asso- 
ciated with the three consequences classes CCl, 
CC2 and CC3. 

The technical service lives as a quantitative dura- 
bility parameter of deteriorating structural mem- 
bers may be calculated from Eqs. (22) — (23) when 
t, its value becomes like to their life cycle r,. The 
computation is iterated until the value ż, that cor- 
responds the target probability P( T= t, ) =0(Z,) 
(Figure 5). 

The value of technical service life, ¢,, at preset 
reliability index can be defined by analytic-graphic 
approaches. 


3 NUMERICAL EXAMPLES 


3.1 The time-dependent structural safety of 
deteriorating reinforced concrete members 


The time-dependent structural safety of deterio- 
rating reinforced concrete members are investi- 
gated Zhong and Zhao (2005). They proposed 
the algorithm for failure probability analysis. This 
method, interesting for practical use, helps engi- 
neers to determine approximately the probability 


of a structure with different failure modes. The suf- 
ficient accuracy of numerical values by method of 
TCP in comparison with this algorithm and MCS 
methods is demonstrated in Table 1. 


3.2 Technical service life prediction of single 
storey building members 


The procedure of technical service life prediction 
is applied to the knee-joint of single storey rein- 
forced concrete building (Figure 6). 

Their degradation process is caused by 
concrete carbonation and the function is: 
At)=1-0.004(t-t,,), where ¢,,=12 years is the 
initiation period. The values of parameters of 
additional variables are: Ogn = Om = 9, = 1.0, O° O, 
= 0°), = 00, =0. 

The values of mean and variance of shear 
resistances in this period are: K,,,,,=387.6 kN, 
OR, = 0° R(t) = (0.128 x 387.6)? = 2461.4(k NY. 

The values of means and vari- 
ances of shear forces caused by perma- 
nent and snow loads are: V,,,=77.7kN, 
o°V, = (0.10 77.7) = 60.4 (KN); V,,, = 30.4 KN, 
o’V, = (0.60 x 30.4)? = 332.7 (KN). 

The time-dependent survival probability 
P(T 2t) of deteriorating knee-joints was calcu- 
lated using the method of transformed conditional 
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Figure 7. Technical service lives t, of knee-joints from 
the graph of time-dependent reliability indexes by TCPM 
(1) and numerical integration (2). 


probabilities expressed by Eq. (22) and numerical 
integration methods. The time-dependent drop of 
reliability indexes A(t) = o| P(T > t) | of knee- 
joints is demonstrated in Figure 7. 

According to this Figure 7, the technical serv- 
ice live of considered knee-joints for structures 
RC2 class are equal about 42 or 44 years using 
the methods of transformed conditional prob- 
abilities (curve /) and numerical integration 
(curve 2). 

The illustrative examples show that the results 
of system analysis obtained by the proposed TCP 
method are very close to the probabilistic data 
computed by exact but sophisticated numerical 
integration. 


4 CONCLUSIONS 


The probabilistic technical service life concept 
related to target reliability indexes of deteriorating 
particular and structural members helps us to rep- 
resent their quantitative durability. The probabil- 
istic technical service life of structural members as 
series, series-parallel and parallel systems may be 
predicted by analytic-graphic method. 

The probability-based analysis and durability 
prediction of deteriorating members as systems of 
extreme events may be related to the autosystem 
concept. The survival and failure probabilities of 
sustainable series, parallel and series-parallel sys- 
tems may be calculated using the method of trans- 
formed conditional probabilities. 

The presented probability-based approaches, 
design models on durability prediction of struc- 
tural members may be convenient for many prac- 
titioners and can stimulate designers to predict the 
technical service life of deteriorating structures 
more actively and effectively. 
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Reliability quantitative analysis method for mechanical system by using 
extended fault tree 
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ABSTRACT: Fault tree analysis is an important method in safety engineering and reliability engineer- 
ing, which is a top-down method and is able to map the relationship between complex events such as 
system-level failures and basic events such as component-level failures by creating a logic diagram of 
the overall system. For the reliability analysis of mechanical systems, it’s extremely important to obtain 
the quantitative relationship between the failure and the design parameters except the logic relationship 
between the top event and the basic event in the design stage. However, the conventional fault tree analysis 
is hard to build such relationship. As a result, this paper extends the conventional fault tree by develop- 
ing a new arborescence under basic events of fault tree. A kind of custom gate is defined within our new 
method. The custom gate is used to build the quantitative relation between the process variables (force, 
deformation and other performance feature), which can be used to characterize the basic failure mode, 
and basic random variables (dimension parameters, material parameters and load parameters, etc.). 
According to different failure models, the custom gate can be different. Then the limited state function of 
the basic event can be presented based on physical model. The procedure of this method is demonstrated 
in this paper, and different reliability models for mechanical system can be represented clearly. A Monte 
Carlo method involving dependent variables is provided to calculate probability of the top event. An illus- 


trative example of a lock mechanism is presented to demonstrate the method in this work. 


1 INTRODUCTION 


With the rapid development of aerospace indus- 
try, the requirement of high-reliability mechanical 
system is increasing. The reliability evaluation of 
mechanical system is of great importance. How- 
ever, the operation environment for mechanical 
system is getting more and more complex making 
reliability assessment of these systems harder. 

Although reliability evaluation of electronic 
products is becoming much maturer based on prob- 
ability and statistics theories, as well as the failure 
mechanism, it is still very hard when it comes to the 
reliability assessment of mechanical systems for the 
diversity of mechanical components and complex- 
ity of the failure mechanism. The conventional reli- 
ability assessment based on statistics cannot suit the 
mechanical system very well, as it is hard to asso- 
ciate the basic random variables (for example the 
component dimensions and the materials etc.) with 
the failure modes. As a result, the study of methods 
based on physical causal mechanisms focuses more 
on reliability of mechanical systems these years. 
These method combines probability theory with the 
physical modes [1-2], for instance, the method of 
Fault Tree Analysis (FTA). 

In fault trees, the logical connections between 
faults and their causes are represented graphically. 


FTA is deductive in nature meaning that the anal- 
ysis starts with a top event (a system failure) and 
works backwards from the top of the tree towards 
the leaves of the tree to determine the root causes of 
the top event [3]. FTA was first put forward in 1961 
by Watson [4] of Bell Laboratory and was used in 
the development of Minuteman Missile. Then it 
was applied in nuclear industry in 1975 [5]. And 
in 1977, Lapp and Powers develop a method using 
computer to create fault trees automatically [6-7]. 
To overcome some shortcomings of conventional 
FTA, e.g. in handling the uncertainties, allowing the 
use of linguistic variables, and integrating human 
error in failure logic model, the fuzzy FTA [8] 
was developed. Fuzzy FTA provides a framework 
where basic notions such as similarity, uncertainty 
and preference can be modeled effectively. In 1994, 
Sawyer [9] published his paper on the fuzzy fault 
tree analysis of mechanical system. Lindhe [10] 
used fault tree analysis on an integrated level, and a 
probabilistic risk analysis of a large drinking water 
system in Sweden was carried out. Mao [11] applied 
fuzzy fault tree to analysis the automatic water sup- 
ply system in fire control system. Then, dynamic 
fault tree (DFT) [12-13] is developed. DFT extend 
conventional FT by defining additional gates called 
dynamic gates to model complex interactions such 
as sequence and functional-dependent failures, 
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spares and dynamic redundancy management [14]. 
Yuan [15] applied a quantitative reliability analysis 
method to warm spare gate based on DFT. Nadjafi 
[16] investigated the reliability of Emergency Detec- 
tion System, combining Fuzzy Monte Carlo Simu- 
lation and DFT. 

Although plenty of work has been done on 
FTA, the present studies on FTA cannot compre- 
hensively show the failure mechanism of mechani- 
cal systems. It is unable to track lowest level items 
that contributing to the failure modes. The prob- 
ability statistical information of failure rate for 
bottom event is usually not sufficient, making the 
quantity analysis inaccurate. Besides, the correla- 
tion between the failure modes and the correla- 
tion between their causes are usually ignored. All 
these correlations make the reliability assessment 
intractable. Since the conventional fault tree is not 
accurate enough for complex mechanical system 
reliability assessment, it is necessary to develop an 
efficient method. 

We improve the conventional FTA method and 
extend it into a new method. The method presented 
combines physical model with system reliability 
theory, and most importantly, can associate the 
basic variables with failure modes of mechanical 
systems in a systemic way. As a result, connections 
between faults and lowest level items are repre- 
sented. In addition, the correlation of the failure 
modes can be easily identified with this method, 
and then correlation analysis can be carried out. 
Besides, this method is convenient to implement 
programmatically. 

The rest of this paper is organized as follows: 
In section two, we provide an introduction on the 
conventional FTA method, and then our extended 
fault tree method is introduced. In section three, 
we provide the main process of mechanical sys- 
tem reliability assessment. In section four, we give 
out an engineering example to demonstrate our 
method. A brief closure, along with a summary of 
our future works is provided in section five. 


2 FAULT TREE ANALYSIS 


In this section, we will firstly give a general descrip- 
tion about conventional fault tree on its structure 
and function. Then we will introduce our new 
method based on the conventional fault tree. 


2.1. Conventional fault tree 


Fault tree is a common method in system reliabil- 
ity analysis. Conventional fault tree is a kind of 
graphic method using logic gates to describe the 
relationship between a system-level failure and 
basic events such as component-level failures. 


A fault tree contains four kinds of elements, 
which are described as follows: 


1. The top event, which is on the root of fault tree, 
represents the failure of system. In mechanical 
system, the top event means the failure of the 
system. For example, the top event in fault tree 
shown in Figure 1 is “S—Re/” 

2. The bottom events, which are on the tips of 
fault tree. The bottom events represent the 
kind of failure modes that can’t be resolved any 
more. For example, in Figure 1, elements from 
g,<0to g,<0 represent the bottom events. 

3. The intermediate events. This kind of events is 
between the top event and bottom events, and 
represents the failure modes that can be the 
consequence of bottom events or other inter- 
mediate events. For example, in the fault tree 
shown in Figure 1, element M, is the intermedi- 
ate event. 

4. The logic gates. In fault tree, this kind of ele- 
ments is used to connect events and describe 
logical relationship between them. The com- 
mon used logic gates include AND gate and OR 
gate. The AND gate means that the event upon 
it happens if any one of the events below it hap- 
pens. In another word, the events below AND 
gate are in series relationship. 


The OR gate means that the event upon it hap- 
pens only when all the events below it happens. In 
another word, the events below OR gate are in par- 
allel relationship. 

Using fault tree, we can estimate the importance 
degree of each event, and based on the assumption 
that value of each failure mode is certain, the prob- 
ability of system failure can be evaluated ultimately. 


2.2 Extended fault tree 


Focusing on the complex characteristics of the fail- 
ure modes in the mechanism system, we improve 
and extend the conventional FTA. The new method 
can be set into two parts: the system calculating tree 
and the failure mode calculating tree. 

The system calculating tree is just the same as 
conventional fault tree. Each bottom event of 
system calculating tree is connected to the failure 
mode calculating tree as its top events. 

For the system shown in Figure 1, the diagram 
of failure mode g, <0 calculating tree is shown in 
Figure 2: 

As is shown in Figure 2, the failure mode calcu- 
lating tree includes four kinds of elements, which 
will be described in detail as follows: 


1. The top node 

The top node is put on the top of our new arbo- 
rescence. This element represents a specific failure 
mode which is one of the bottom events in system 
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Figure 1. System calculating tree. 


g, <0 


\ 


(3) 


(X) (%) 
Figure 2. Failure mode calculating tree. 


calculating tree. For example, in the failure mode 
calculating tree shown in Figure 2, the element 
g <0 is the top node. For convenience of later cal- 
culation, we attach the reliability calculation meth- 
ods to this node. These methods include FOSM, 
AFOSM, Monte Carlo, etc. As shown in Figure 3, 
we use a circle to represent the top node. 

2. The bottom node 

For mechanical system, this kind of elements rep- 
resents the basic variables. The basic variables can 
be the geometry size or material of mechanical 


Figure 3. The top node. 


Figure 4. The bottom node. 


Yı 


Figure 5. The intermediate node. 


parts. In the failure mode calculating tree shown 
in Figure 2, the elements from x, to x, are the bot- 
tom nodes. Because basic design variables are ran- 
dom in actual engineering, we attach the statistical 
parameter to these bottom nodes. And in special 
condition, we will give the probability distribution 
function (CDF) and probability density function 
(PDF) of the basic design variables. As show in 
Figure 4, we use the shape of water drop to repre- 
sent the bottom nodes. 

3. The intermediate node 

In Figure 3, the elements y, and y, are the intermedi- 
ate nodes. For mechanical system, the intermediate 
nodes represent the process variables, which depend 
on the basic design variables. We use the shape in 
Figure 5 to represent the intermediate node. 
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4. The custom gate 
The failure mode calculating tree is developed to 
establish connection between a specific failure 
mode and basic design variables. For this purpose, 
we put forward a kind of custom gate referring 
to the function of logic gate in conventional fault 
tree. We defined that a custom gate should have 
its input data, output data and a built-in func- 
tion. The built-in function depends on the physical 
mechanism of the failure mode. 

The custom gate is represented by triangle in the 
diagram and has different definition in different 
position. 


1. For the connection between basic variables and 
a specific process variable, we define two kinds 
of custom gates, the explicit gate and the implicit 
gate, both of which use basic variables as its input 
data and process variables as its output data. 
When the process variables are obtained by 

using specific software or algorithm to deal with 

basic variables, the implicit gate is available and 
represented by a blue triangle. 

The build-in function of an implicit gate can be 
an instruction to call the proper software or algo- 
rithm. For the implicit gate in Figure 2, its build-in 
function is: 


Vi = A(X) (1) 


When there is analytic mathematical model 
between the process variable and basic variables, 
the explicit gate is available and represented by a 
white triangle. The build-in function of explicit 
gate in Figure 2 is: 


V2 = f(x X) (2) 


The explicit gate and implicit gate is shown in 

Figure 6. 

2. For the connection between process variables 
and a specific failure mode, we define a specific 
gate, the limit sate gate. As shown in Figure 7, 
this kind of gates use process variables and 
some basic variables as its input data and out- 
put the probability of the failure mode. 

This gate contains a limit state equation. The 
limit state equation is defined by making the value 


Figure 6. The implicit gate and explicit gate. 


Figure 7. The gate of performance function. 


The extended fault tree. 


Figure 8. 


of performance function based on the physical 
mechanism model of the failure mode equal to 
zero. For example, the limit state equation in Fig- 
ure 2 can be expressed as: 


gV Y2X4)=0 (3) 


Attaching failure mode calculating tree to the 
bottom events of system calculating tree, we have 
our extended fault tree, as is shown in Figure 8. 

Based on the extended fault tree described 
above, we can firstly use failure calculating tree to 
obtain the probability of each failure mode and 
then use system calculating tree to obtain the reli- 
ability of system. 

We can see that there are two main advantages 
of our extended fault tree compared with the con- 
ventional one. First, using our new method, the 
relationship between basic variables and failure of 
system can be described in a clear and systematic 
way. Second, when it comes to a mechanical system 
with a plenty of complex failure modes, compared 
with the conventional fault tree, our method can 
obtain the probability of failure modes more accu- 
rately by using implicit gate and explicit gate. 
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3 PROCESS OF ANALYSIS 


The quantitative analysis of the extended fault tree 
is presented in this section. For a complex mechan- 
ical system, in the process of quantitative reliabil- 
ity analysis, the main problem is correlation of the 
failure modes. To solve the problem of correlation 
and calculate the reliability of system, a method 
based on Monte Carlo simulation is provided. The 
main process is shown in Figure 9. 
The process of analysis in detail is as follows: 


1. Calculate the minimal cut sets 
Cut sets are the unique combinations of failure 
modes that can cause system failure. Specifically, a 
cut set is said to be a minimal cut set if, when any 
basic event is removed from the set, the remaining 
events collectively are no longer a cut set [17]. The 
minimal cut sets can be obtained according to the 
method found by Fussel [18], which is based on the 
function of logic gates in system calculating tree. 
For the system shown in Figure 8, there are two 
minimal cut sets: {(g, <0)} and{(g, <0).(g, < 0)}. 
Once the minimal cut sets are acquired, then the 
structure function of system calculating tree can be 
expressed as: 


Calculate the minimal cut sets of the 
system calculating tree 
Obtain the performance functions of the 
bottom events 
Draw samples of the basic variables 


Calculate the process variables through 
explicit gate or implicit gate 


Substitute the process variables into all the 
performance functions 
Calculate the probability of the mechanical 
system 


Figure 9. Flowchart of reliability assessment of the 
mechanical system using Monte Carlo simulation. 


w=] K, = DTT» © 


j=l j=lieK; 


In Equation (4), N, is the number of minimal 
cut sets. K, represents the jth minimal cut, x, means 
the ith bottom event in the jth minimal cut. 


K,=T]» (5) 


ieKj 


As a result, for the system shown in Figure 8, the 
probability of system failure can be expressed as: 


P, = P(g, <0)U[(g, <0N (s, <0)]} e 


2. Obtain the performance function of the bottom 
event 

The performance function of the bottom event is 

obtained by Failure mode calculating tree to asso- 

ciate the bottom nodes or process variables with 
the bottom event. The common methods used 
in this process include response surface method 

(RSM), Kriging method etc. 

3. Probability calculation of top event in sys- 
tem calculating tree based on Monte Carlo 
simulation 

The probability of top event in system calculating 

tree is calculated by 


P(T) = P(K, UK, UU Ky) = 


Nk Nk Nk 
X P(K)-}, P(KK)+¥, P(K,K,K) 
i=l 


i<j=2 i<j<k=3 


++ (DYT PKK, Ky) 
(7) 


Considering the correlation between the bot- 
tom events and the correlation between minimal 
cut sets, Equation (7) cannot be solved analyti- 
cally. Monte Carlo simulation is adopted. Each 
one of the bottom nodes is considered as a ran- 
dom variable with certain expected values and 
variances. We draw physical samples from these 
variables, and calculate the value of process vari- 
ables by explicit gate or implicit gate. Then prob- 
ability P(.K,K,:--K,,) in Equation (7) is acquired 
through process variables. At last, the probability 
of top event can be obtained. 


4 CASE STUDY 


In this section, we will use our new method to 
analysis the reliability of a kind of lock mechanism 
used to control the opening and closing of the air- 
craft landing gear cabin door. 
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The mechanism is shown in Figure 10. This sys- 
tem includes six components: a cylinder, a piston, a 
lock hook, a rocker arm and two connecting rods. 
There are six revolute joints in the system, which 
are represented by RO~RS in Figure 10. The piston 
is connected to the rocker arm by a spring. 

The system has three main failure modes: The 
start-up failure, the movement failure and the posi- 
tioning failure. 

Each of these failure modes can lead to the fail- 
ure of system. As a result, the system calculating 
tree is shown in Figure 11. 

In Figure 11, we use g,<0 to represent the 
start-up failure, g, <0 to represent the movement 
failure, g, <0 to represent the positioning failure. 

The start-up failure occurs when the starting 
force is smaller than a certain value. As a result, 
starting force F is considered as the failure index. 

The performance function of start-up failure is 
shown in Equation (8) 


g =F-[F] (8) 


@ Piston 
®@ Rocker arm 
@ Micro connecting rod 
© Connecting rod 
© Lock hook 


Friction 


Figure 10. The lock mechanism. 


g,<0) g,<90 <0) 


Figure 11. lock 


mechanism. 


System calculating tree of the 


where F represents the starting force, [F] represents 
the limit state value of starting force. 

The movement failure occurs when time to 
unlock is shorter than a certain value. As a result, 
unlocking time is considered as the failure index 
and the performance function of the movement 
failure is shown in Equation (9) 


g,=t-[¢] (9) 


where ¢ represents the unlocking time, [f] represent 
the limit state value of unlocking time. 

For both the failure of start-up and movement, 
there are five influence factors, which are consid- 
ered as random variables. The distributions of 
these variables are shown in Table 1. 

In Table 1, X, represents the elastic coefficient 
of the spring; X, represents the damping coeffi- 
cient of the spring; X, represents the decoupling 
angle between the lock hook and the lock ring; X, 
represents the maximum contact force between 
the lock hook and the lock ring; X, represents the 
change rate of driving force attached on piston. 
There are no explicit functions for the two failure 
modes. As a result, the implicit functions of these 
two failure modes are obtained by the dynam- 
ics simulation software: LMS Virtual. Lab, as is 
shown in Figure 12. 

As a result, the start-up force and unlocking 
time can be expressed as follows: 


Table 1. The distributions of random variable in the 
mechanism. 
Mean Standard 

Variables/Unit Distribution value deviation 
X,/(N/m) Normal 5620 100 
X,/(kg/s) Normal 478 10 
X,/deg Normal 46.5 0.5 
X,/N Normal 4000 400 
X, normal 14 0.02 
ie - 

p ao 

ite 

fe: 

fe: 
> 
Figure 12. Simulation model. 
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F=F(X,,X,,X3,X,,X;) (10) 
t=t(X,,X,,X,,X,,X;) (11) 


As a consequence, the failure mode calculating 
trees of start-up failure and movement failure is 
shown in Figure 13. 

For the positioning failure, to establish the geo- 
metric model, we simplify Figure 8 into Figure 14. 

Based on the geometric model shown in Figure 9, 
deflection / is considered as the index of position- 
ing accuracy of the mechanism. The performance 
function of positioning failure can be expressed as 


g,=h-[h] (12) 


where A represents the deflection, [/] represents the 
limit state value of deflection. 

According to Figure 12, the deflection / can be 
written as: 


(13) 


h(a,b, f) = arsin BSL +f = 5) 


2-a- f 


f2: N /E: \ 
P= anay. 
WEI) fr] CHI Æ] 
TE E. 
x x. % x, x ~“ % % x, x; 
Figure 13. Failure mode calculating trees of start-up 


failure and movement failure. 


RyR,: Lock hook 

RR: Connecting rod 
RRR: Rocker arm 

R,Re: micro connecting rod 


Surface of lock hook 


Figure 14. Schematic diagram of the system. 


In Equation (13), there are three independent 
variables: the length of the lock hook represented 
by a, the length of connecting rod represented by 
b, the length of piston represented by f. These inde- 
pendent variables are random due to manufacture 
errors and assembly tolerance. The distributions of 
their error are shown in Table 2. 

Then the failure mode calculating tree of posi- 
tioning failure is shown in Figure 15. 

According to the method mentioned in this 
paper, the extended fault tree of this lock mecha- 
nism is shown in Figure 16. 

From Figure 16, it can be concluded that start- 
up failure and the movement failure are depend- 
ent. The positioning failure is independent with 
the other two failure modes. 

There are three minimal cut sets: {(g,<0)}, 
{(g, <0)} and {(g, <0)}. According to the proc- 
ess of analysis in Section 3, 1,000,000 samples of the 
variables (X1, X2, X3, X4, X5, a, b, f) are drawn. 
Substitute the samples into Equation (10), (11), (13) 
corresponding the three failure modes. For the three 
failure modes, failures occur if the values are less 
than or equal to zero. The probabilities of the three 


Table 2. The distributions of length error of parts. 
Error 

Initial Lower Upper Mean Standard 

length limit limit value deviation 
a/mm 47.566 -0.226 0.011  -0.1075 0.0395 
b/mm 56 —0.240 0.048  -0.0960 0.0480 
fmm 0 0.250 0.250 0 0.0833 

] 

fa\ (o\ (f 

Figure 15. Failure mode calculating trees of position- 


ing failure. 
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Figure 16. Extended fault tree of the lock mechanism. 


failure modes are 0.0035, 0.0045 and 0.0082, respec- 
tively. The probability of the lock system is 0.0141. 


5 CONCLUSIONS 


In this paper, a new method of mechanical system 
reliability quantitative analysis using extended fault 
tree is introduced. Focusing on the failure modes 
of mechanism system, we improve and extend the 
conventional FTA. The new method can be set into 
two parts: the system calculating tree and the fail- 
ure mode calculating tree. A kind of custom gate 
is defined in our new method to associate the basic 
random variables and the bottom events. A pro- 
cedure for reliability evaluation of the mechanical 
system based on the extended fault tree is provided, 
in which the correlation of the bottom events is 
considered. A lock mechanism is investigated to 
demonstrate our method. The result shows that our 
extended fault tree can effectively be used to assess 
the reliability of the mechanical system. 

In the process of our study, it is found that the 
correlation exists in mechanical system is extremely 
complex. As a consequence, in the future, we will 
focus more on correlation exists among failure 
modes, and develop a series of software based on 
our extended fault tree. 
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Research on kinematic reliability of flapping mechanism 
for flapping wing flight 
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ABSTRACT: As the core component, the kinematic reliability of flapping mechanism influences the 
flight accuracy and efficiency of flapping wing flight directly. The joint clearance caused by wear is the 
most important factor affecting the kinematic accuracy of the mechanism, which directly affects the con- 
trol accuracy and flight. This paper analyzes the kinematic reliability of flapping wing mechanism with 
different clearance of hinges based on the dynamics of mechanisms with clearance. The variation of the 
flutter angle, flutter angular velocity and angular acceleration of wingtip with different clearance are 
calculated. The Multiple joints clearance are considered in the flapping wing mechanism. Synchronous 
accuracy of two wings also be analyzed. Results show the relationships of the flutter angle, flutter angular 
velocity, angular acceleration and the joints clearance. The results can be used to optimize the flapping 


mechanism structure parameters. 


1 INTRODUCTION 


Flapping wing flight (MAV) is a kind of air- 
craft that imitates the flying creature of nature 
and relies on flapping wings to generate lift and 
thrust power simultaneously. It has the unique 
advantages of high flight efficiency, high maneu- 
verability and compact structure. As the core 
component, the kinematic reliability of flapping 
mechanism influences the flight accuracy and effi- 
ciency of flapping wing flight directly, such as the 
kinematic accuracy and synchronism of flapping 
mechanism. The joint clearance caused by wear is 
the most important factor affecting the kinematic 
accuracy of the mechanism, which directly affects 
the control accuracy and flight. Therefore, the 
kinematic accuracy and synchronization of flap- 
ping wing mechanism with hinge clearance should 
be investigated. 

Tsai (2004) presents an effective method to ana- 
lyze the transmission performance of linkages. 
Equivalent kinematical pairs were used to model 
the motion freedom caused by the joint clearances. 
The mechanism positions are solved. Ting (2000) 
analyzes the influence of joint clearance on accu- 
racy of position. Flores (2012, 2012, 2010, 2007) 
public a series of articles for the mechanism mod- 
eling with joint clearance, and also develop algo- 
rithm to solve. 

Liu (2006, 2007) analyze the force-displacement 
relationship of spherical joints with clearances 
based on the Hertz theory. 


This paper aims to analyze the kinematic reli- 
ability of flapping wing mechanism with differ- 
ent clearance of hinges based on the dynamics 
of mechanisms with clearance. The flutter angle, 
flutter angular velocity and angular acceleration 
of wingtip with different clearance are calculated. 
The Multiple joints clearance are considered in the 
flapping wing mechanism. Synchronous accuracy 
of two wings are also analyzed. The relationships 
between the flutter angle, flutter angular velocity, 
angular acceleration and the joint clearance are 
determined. 


2 EVALUATION OF MECHANISM 
SYNCHRONIZATION 


2.1 Synchronization 


Assuming that: achieving some function needs n 
mechanisms to conduct the design motions syn- 
chronously; the allowable maximum time differ- 
ence for the mechanisms to finish the motion is & t, 
is the time that the mechanism / needs to finish the 
motion. Then the expression for the synchroniza- 
tion requirement of the mechanisms is shown as 
the following equation: 


max(/;)—min(t,)<é(1Si,j <n) (1) 


If the mechanisms that require motion synchro- 
nization are parallel mechanisms, which means the 
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mechanisms have same design motion and work- 
ing conditions, then the motion synchronization of 
the mechanisms can be expressed in another way: 
the displacement difference of these mechanisms 
are supposed not to be greater than the prescribed 
tolerance at the same moment. Assuming that: 
achieving some function needs n mechanisms to 
conduct the design motions synchronously; the 
allowable maximum position difference at the same 
moment is £; q, is the position where the mecha- 
nism / is. Then the expression for the synchroniza- 
tion requirement of the mechanisms is shown as 
the following equation: 


max(q,)—min(q,)<é(1Si,j <n) (2) 


2.2 The evaluation method of the mechanism 
synchronization based on the interval 
estimation 


The traditional probability-based evaluation 
method of mechanism synchronization can solve 
the synchronization problems with same probabil- 
ity distribution types and parameters. However, 
the probability distribution types and parameters 
are rarely the same in real engineering problems. 
Therefore, it is better to apply the evaluation 
method of the mechanism synchronization which 
is based on the interval estimation. 


min( xy x7 +E 


Roe fp rS [AOL J 


max(2X ,x2 )-€ 


For a system with n mechanisms, the motion 
times of the mechanisms, which are expressed as x, 
(i=1,...,n), are random variables. Besides, the prob- 
ability density function of x, is set as f(x), i= 1,...,n. 

The allowable maximum time difference for 
the mechanisms to finish the motion is £. For any 
two mechanisms in the system, this requirement 
can be transformed as a probability requirement: 
Pilea €}. 

For the two mechanisms, once the motion time 
of one of the mechanism x, is determined, then 
the motion time of the second mechanism should 
range in [x,—€, x, + £] to satisfy the synchronization 
requirement. As shown in Fig. 1, the probability 
for the two mechanisms’ synchronization can be 
expressed as the following formula: 


R= J A| fo heads fox 3) 


MIN(Xy Xp r- Xp-1 )+ E 


Acre 


max(X} .X2,..-.Xy-1 -& 


Xe x xy te x 


Figure 1. Interval schematic diagram of two mecha- 
nism synchronization. 


Based on the analysis of two mechanisms, the 
motion synchronization of three mechanisms can 
be analyzed. If two mechanisms satisfy the synchro- 
nization requirement, then the motion time of the 
third mechanism must range between max(x,,x,) — € 
and min(x,,.x,)+ £. If x,>x,, the range of x, is [x,—€, 
x, + £]; if x, < x, the range of x, is [x, — £, x, + £]. 
The time interval of synchronization is shown in 
Fig. 2 and the synchronization probability can be 
expressed as the following formula: 


Ra =f Red] J" ASSE Aes ids, Jax Jax 
p ig re f° IAEA fj “AOC )dx, | dx, lex 
(4) 


Accordingly, the formula of the synchronization 
probability of n mechanisms can be deduced: 


P25 )8% 8 | ds, | dx, (5) 


As can be seen in Formula (5), the synchro- 
nization probability of n mechanisms can be 
expressed by n-ple integrals. In the n-ple integrals, 
the upper and lower limits of the outermost layer 
variable x, is the range of the system’s motion 
time, normally as [0,%), while the upper and 
lower limits of every inner layer variable is deter- 
mined according to the number size of the outer 
layer variables. 

The inner layer integral cannot be solved 
directly, as its upper and lower limits is determined 
according to the number size of the outer layer 
variables. Therefore, the interval method is used to 
transform the non-deterministic upper and lower 
limits to deterministic upper and lower limits for 
the inner layer integral, then the Formula (5) can 
be solved. 

As can be seen in the above analysis, when two 
mechanisms are considered, there is one independ- 
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Figure 2. Interval schematic diagram of three mecha- 
nism synchronization. 


Table 1. The relationship of mechanisms number n and 
independent integral interval number N. 


Number of mechanism 2345 wh 


X;ı number of positon 123 4... nA 
Number of Independent interval 1 2 6 24 ... (n-1)! 


ent integral interval; when three mechanisms are 
considered, there are two independent integral 
intervals; when four mechanisms are considered, 
there are six independent integral intervals; when n 
mechanisms are considered, the integral interval of 
every variable is determined by the number size of 
all variable before it. Therefore, for a system with 
n mechanisms, there are N = (n — 1)! independ- 
ent integral intervals. The relationship of mecha- 
nisms number n and independent integral interval 
number N is listed in Table 1. 

From the above analysis, it can be seen that the 
synchronization probability of n mechanisms can 
be expressed as the sum of (n — 1)! n-ple integrals. 
As the (n — 1)! integrals are independent with each 
other, the synchronization probability of n mecha- 
nisms can be deduced: 


R, = 2 Ri (6) 


For most of the probability distributions, though 
their probability density functions are elementary 
functions, their integrals are not elementary func- 
tions. Therefore, numerical methods are needed to 
calculate the synchronization probability of mech- 
anisms, which mainly include numerical integra- 
tion methods and Monte Carlo Method. 


1. Solving the n-ple integrals by Numerical Inte- 
gration Method 


If the n-ple integrals cannot be expressed by 
elementary functions, then the Newton-Leibniz 
Equation is not able to solve the integrals. There- 
fore, the integral equation is transformed to the 
limit equation: 


I= J s(x)dx = lim (Dh, (7) 


In Formula (7), A; =x; -X 1X; S é Sx;,a8 
Xp < X,<...< x, < b. Then an approximation me- 
thod can be used as shown in the formula: 


T= fds DAS) =1, (8) 


In the above formula, {x,}’_, are integral nodes, 
and {A4,}"_, are the multiplier parameters. 

The frequently-used numerical integral meth- 
ods are: interpolated quadrature formula, complex 
quadrature formula, Romberg quadrature formula 
and Gauss quadrature formula. 

The n-ple integral can be solved by repeat- 
edly solving the single integral using the above 
methods. 

2. Solving the n-ple integrals by Monte Carlo 

Method 
Assuming that D is a region in the n-dimensional 
space R,, f(x) € D c R, >R, the n-ple integrals on 
the region D can be expressed as: 


T=] fap O) 


where, J can be regarded as the result expecta- 
tion of the region’s measure multiplying function 
fon the region D. The fundamental Monte Carlo 
Method is to find a hypercube (with known meas- 
ure M) which contains region D, generate n sam- 
ple points which obey the uniform distribution in 
the hypercube, and counting the number of sam- 
ple points in the region D. Assuming that there is 
m out n sample points in the regions D, then the 
measure of region D is approximated as: 


ie (10) 
n 
The expectation of the function f: 
= 1 
Pe) (11) 
m ieD 


2.3. The analysis of mechanism synchronization 


For the system with n mechanisms, the motion 
time of every mechanism is random variable and 
expressed as x, i= 1, ...,1 respectively. The total load 
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of the system is random variable is also a random 
variable and expressed as z. As the motion time of 
mechanism i is related to several influence factors, 
such as mechanism parameters, working loads and 
environment conditions, assume that the function 
of the motion time and the influence factors is: 
x; = fO Yn Ym G Z) (12) 

In the function, Ya,Y;s Ym are m random 
variables that influence the motion time of the 
mechanism i, @, is the load weight of the load on 
mechanism i to the total load and it is related to the 
mechanism parameters: 
Qta, ++, =l (13) 

The synchronization requirement of two mech- 
anisms can be expressed as: 


max (| X;— Xj; D Sé 
0<i,j<n : 
i+j 


(14) 


As for the synchronization evaluation of two 
mechanisms, Coupla function C,(u,v) can be 
used, in which @ is the correlation parameter of 
two mechanisms. The Coupla function can be 
obtained by sample fitting, and the synchroniza- 
tion probability can be expressed as: 


(lee dC, (u,v) 


R= A| “a-e Juðvy 


Ja )dx, bx (15) 


As for the synchronization evaluation of three 
mechanisms or more, firstly use the sample fitting 
to obtain the Coupla function, then obtain the joint 
probability density function f(x,,x,,...x,,) via the edge 
probability density function f(x,), then the synchro- 
nization probability of multi mechanisms can be fig- 
ured out combined with interval evaluation method. 

For the system in which the mechanisms are not 
independent, as the motion times are correlated 
with each other and the correlation can hardly be 
illustrated in mathematical forms, so it is very dif- 
ficult to obtain the synchronization probability 
using only mathematical methods. For these non- 
independent cases, the method combined dynamic 
simulation and sampling calculation is widely used 
to obtain the synchronization probability of the 
mechanisms in the system. 


3 SYNCHRONISM ANALYSIS 
OF FLAPPING MECHANISM 
3.1 Flapping mechanism 


The structure of flapping wing is shown as Fig. 3. 
The principle of flapping mechanism is shown as 


Wing tip Wing 


Figure 3. The structure of flapping wing. 
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Figure 4. The principle of flapping mechanism. 


Table 2. Dimension of revolute joint with clearance. 
Revolute joint Diameter of shaft Diameter of sleeve 
B, 3 2.6 

C, 12 11.2 

D, 12 11.2 

E 5.5 5 


1 


Fig. 4. The crank drive mechanism to maintain 
uniform rotation, and drive the wing tip under 
flutter through the connecting rod. 


3.2 Dynamic simulation of flapping wing 
mechanism 


LMS Virtual Lab is used to construct dynamic 
model. For the kinematic pair with joint clear- 
ance, combing spherical contact with plane pair to 
substitute the rotation pair, dimension of revolute 
joint with clearance are shown as Table 2. The vari- 
ation of wingtip angle, angular velocity and angu- 
lar acceleration with time can be calculated. The 
results are shown as Fig. 5 to Fig. 9. 

We can conclude from those figures, the vari- 
ations of wingtip angle with joint clearance are 
obvious, especially at the limit position; the vari- 
ations of wingtip angle velocity with joint clear- 
ance are bigger, fluctuate around ideal position; 
the variations of wingtip angle velocity with joint 
clearance are extreme, the wingtip is under extreme 
shock load. 
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Figure 5. The variation of wingtip angle with time. 
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Figure 6. The variation of wingtip angular velocity 
with time. 
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Figure 7. The variation of wingtip angular acceleration 
with time. 
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Figure 8. The velocity difference of left and right wing 
tip with time. 
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Figure 9. The difference of left and right wingtip angle 
with time and clearance position. 


3.3 Influence of clearance on motion synchronism 


To analyze the variation of wingtip angle with 
joint clearance, the joints clearance of B2, C2, D2, 
E2 are setup from 0.2 mm to 1.2 mm, the difference 
between left and right wingtips are calculate. The 
results are shown as Fig. 9. 

From the Fig. 9, as the clearance increased, the 
difference of left and right wingtip is increased. 
The influence of value of revolute clearance E2 
is biggest, and the influence of value of revolute 
clearance B2 is smallest. 


4 CONCLUSIONS 


This paper aims to analyze the kinematic reliability 
of flapping wing mechanism with different joint 
clearance based on the dynamics of mechanisms 
with clearance. The results can be used to optimize 
the flapping mechanism structure parameters. The 
conclusions as follows. 


> The variations of wingtip angle with joint clear- 
ance are obviously, especially at the limit posi- 
tion; the variations of wingtip angle velocity 
with joint clearance are large, fluctuate around 
ideal position; the variations of wingtip angle 
velocity with joint clearance are extremely large, 
the wingtip is under extremely shock load. 

> As the clearance increased, the difference of left and 
right wingtip is increased. The influence of value of 
revolute clearance E2 is largest, and the influence 
of value of revolute clearance B2 is smallest. 
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K. Breitung 


Engineering Risk Analysis Group, Technical University of Munich, Munich, Germany 


ABSTRACT: 


In the last fifteen years the subset sampling method has often been used in reliability 


problems as a tool for calculating small probabilities. This method is extrapolating from an initial Monte 
Carlo estimate for the probability content ofa failure domain found by a suitable higher level of the 
original limit state function. Then iteratively conditional probabilities are estimated for failures domains 
decreasing to the original failure domain. Here is explained why this concept is very problematic. 


1 INTRODUCTION 


A basic problem of structural reliability is the cal- 
culation of failure probabilities given by n-dimen- 
sional integrals in the following form 


P=) eo f(x) dx (1) 


Here f(x) is an n-dimensional PDF (probability 
density function) of a random vector X and g(x) 
is the LSF (limit state function) giving the failure 
condition. During the development of structural 
reliability methods it was found to be favorable to 
transform the random vector X into a standard 
normal random vector U with independents com- 
ponents. So the problem was then in this standard- 
ized form: 


= l u P 
nl2 SS 
P=(27) fa- o exp | du (2) 


Such transformations into the standard nor- 
mal space for random vectors with independent 
components have been described first by Rack- 
witz & Fiessler (1977). For random vectors with 
dependent components the Rosenblatt-transfor- 
mation is proposed to obtain a transformation 
but with the exception of the example given in 
Hohenbichler & Rackwitz (1981) no applications 
of this transformation concept are known to the 
author. A practically applicable method seems to 
be the Nataf-transformation described in Der Kiu- 
reghian & Liu (1986). 

A newer method for the calculation of failure 
probabilities is the subset simulation concept. It 
will be outlined here that there is an intrinsic flaw 
in it. 


2 GLOBAL AND LOCAL EXTREMA 


Shortly a few basic facts about local and global 
extrema are outlined here before proceeding. Let 
be given a function f :D— R. A local minimum 
(maximum) is a point x, € D such that there exists 
a neighborhood u of x, that for all xeU ND one 
has always 


f (x)= f(x) (resp. f(x) < f(%)) (3) 


In a similar way a global minimum (maximum) 
for this function is defined as a point x, € D such 
that for all xe D always 


f( x)= f(%) (resp. f(x) < t(%)) (4) 


The first is a local property of the function, it 
depends only on its behavior in any arbitrary small 
neighborhood of the point. The second is global, 
one has to know the values of the function over the 
whole domain of definition. Therefore methods 
to find local extrema are plentiful and described 
in all textbooks about numerical analysis, whereas 
in these textbooks the problem of finding global 
extrema is covered only in a marginal way (see for 
e.g. Nocedal & Wright (1999)). 

In handling optimization problems it is impor- 
tant to be aware if one searches a global or a local 
extremum. Mostly a global should be found. But 
this is often done by using methods for determin- 
ing local extrema and then in some way a conclu- 
sion is made that the obtained local extremum is 
also a global. 

The most usual naive way for this is to start a local 
minimum search several times from random starting 
points and record all obtained local minimum. The 
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smallest of these can then be considered a global one 
with some confidence, if enough runs have been made. 

In finding global extrema a common misunder- 
standing is that if one has a function depending 
on a parameter in a continuous way that then with 
a continuous change of it also the positions of 
the global extrema move continuously. But in fact 
these points jump also in non-pathological cases. 
In Figure 1(a) for a simple one-dimensional exam- 
ple the difference between local and global extrema 
is shown. In Figure 1(b) the global minima for a 
function depending on a parameter are shown, the 
lower ones by solid curves and the upmost func- 
tion is shown by a dotted curve. For the three lower 
functions global minima are on the left side, forthe 


(a) Local and global extrema of a function 


(b) The global minima of a function 
depending on a parameter 
Figure 1. Global and local extrema of functions: 


minima are shown by squares, maxima by circles, filled 
symbols are global extrema. 


1 


Figure 2. Constrained minimal distance points for the 
function given in equation (5). 


two upper global minima are on the right side of 
the figure. The second function from above has 
two minima. The upmost function, denoted by a 
dotted curve, has its global minimum on the left 
side of the figure. This shows that the position of 
a global minimum of a function depending on a 
parameter does not necessarily vary continuously 
if the parameter is varied continuously. 

Also in global optimization under constraints 
the minimum points tend to jump if the problem 
is non-trivial. This will be shown in the following 
simple example. Let be given a LSF defined by 


2 2 
> Mtui , 


g(u.u,)= -u uz 


raw (5) 
= B-u2- 


( wui + u3) 


b? 


In Figure 2 the positions of the global minima of 
|u| under the constraint g(u) =c are on the red line 
segments. Reaching the black circle they jump. In 
Zhigljavsky & Zilinskas (2008) the problem of dis- 
tinguishing between a local and a global maximum 
is explained in detail in the first chapter. This is one 
of the main problems in global optimization. Fail- 
ing to understand this and accepting local extrema 
asglobal ones might lead to erroneous results. 


3 ASYMPTOTIC ANALYSIS AND FAILURE 
PROBABILITIES 


With concepts of asymptotic analysis one can get 
asymptotic approximation for failure probabilities. 
Here only an extremly shortened version is given, 
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for more details see Breitung (1994). This concept 
considers the failure probability as a function of 
the distance of the failure probability to the origin 
B, defined by 


B= min |u| (6) 


g(u)so0 


For smooth LSF’s £ is the distance of one or 
more points on the limit state surface {g(u) =0} 
to the origin. If the failure domain F is imbed- 
ded into a family of expanding surfaces one has 
asymptotically 


n=l 


P(F)=(-A)]] 0- Any” (7) 


with the x, the main curvatures of the limit 
state surface at the point u, called design point. 
In the case of several such points their contribu- 
tions will be added. So looking at the example in 
Figure 3 the curvatures at the two points are 
needed for the asymptotic approximation. Does 
one need the curvatures? Not necessarily. The 
lemma of Hohenbichler (Breitung (1994), p. 53) 
says that an approximation can be obtained also 
if one can calculate the probability content of the 
neighborhoods of the design points. So one has 
Figure 4 


P(F)=P(4)+P(4)) (8) 


a ) 
\ / 
ý / 


es eS ve — 


Figure 3. Asymptotic approximation with curvatures. 


k----;----------------@--.-.-----------4---- 


Figure 4. Asymptotic approximation calculating the 
probability content of the neighborhoods. 


4 THE SUBSET SIMULATION CONCEPT 


The subset simulation algorithm is a variant of 
Monte Carlo methods; it tries to avoid the large 
number of data points which are needed in stand- 
ard Monte Carlo by using instead of it an iterative 
procedure. It can be subsumed under the generic 
term of stochastic optimization procedures. 

While importance sampling methods try to 
improve the efficiency of Monte Carlo by iden- 
tifying regions with high probability content and 
moving more date points there, SuS starts from an 
enlarged failure domain whose design points are 
much nearer to the origin and then moves step by 
step towards the original failure domain. These 
regions are defined here by domains in the form 
F,={g(u)<a,} with the as being positive and 
a, 0. The basic thought of the method (see (Au 
& Beck 2001) and (Au & Wang 2014)) is now to 
write the failure probability P(F) as a product of 
conditional probabilities 


P( F,)= P(E | F)P(E, | E)...P(E, |.) 


nl 9 
= TPR LA) O) 
k=0 


Here R"'=KR DF DF,D...DF,=F. In Fig- 
ure 5 such a standard case is shown, the design 
points for the various LSF’s are shown by black 
squares. Since the respective (suitably chosen) con- 
ditional probabilities are relatively large compared 
with the failure probability P(F) which has to be 
estimated, such an access to the problem has the 
advantage that these conditional probabilities can 
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Figure 5. Typical SuS example. 


be estimated more efficiently with much smaller 
sample sizes. The details how these samples are 
produced with Monte Carlo Markov chains can 
be found in the references given above. SuS can 
be seen as a stochastic continuation method (this 
concept is explained in Allgower & Georg (2003)). 
Such methods try to extrapolate from a point x 
which solves an equation system F(x, 7) for a given 
values of the parameter T to find solutions for 
other values of T. 


5 SUBSET SIMULATION AND 
STOCHASTIC MINIMIZATION 


It seems not to be quite clear how to classify this 
SuS in mathematics, it uses MCMC as a tool, 
but MCMC is not the objective. The goal of 
the approach is to find an estimator p for the 
unknown probability p= P(F,). Now in a Baye- 
sian setting estimation procedures are mostly 
based on using a loss function and then determin- 
ing the estimator as a minimizer ofthe expected 
value of the loss function. This is outlined in Box 
& Tiao (1973), appendix A5.6, p. 308. For a loss 
function L(.) then the estimator @ for a parameter 
Ais chosen such that 
IE( 1(6-6)) — min (10) 

Therefore a parameter estimation always is 
transformed into a minimization problem. Since 
there is a stochastic component, it is a stochastic 
minimization and since one wants the best estima- 
tor, it is a global stochastic minimization (see e.g. 
Spall (2004)). 


In SuS the chosen loss function is the squared 
coefficient of variation for the estimator P( F.)=p 


(+ ZP ) > min 
P 
Under the assumption of unbiasedness and 


independence of the estimators p,= P(F | F) of 
this can be approximated by 


of (42)']-x0| (24) 


Then the approximative optimal estimator 
is given by choosing for the p,’s the value of the 
empirical conditional distribution function at the 
points c, since these are unbiased minimal vari- 
ance estimators for the ps (see e.g. Lehmann & 
Casella (1998)). This shows that the estimation 
in SuS is a minimization procedure. And the goal 
is clearly to find an estimator which achieves the 
global minimum value for the estimation error in 
equation (12). 


IE (11) 


(12) 


6 EXAMPLES 


All the examples with the exception of the last 
one were calculated with the SuS algorithm given 
in Li & Cao (2016). For the last example the algo- 
rithm given in Uribe (2016) was used, since it 
records the seeds. As parameters were taken 500 
samples per step, an acceptance probability of 0.1 
and a chain length of ten. This setup was used in 
Au & Wang (2014) for estimates of the context of 
two-dimensional domains. The design points are 
marked by red circles and the SuS data points by 
green points. 

The focus here is on the geometric structure of 
the failure domains and their design points, on 
purpose the different probability estimates which 
can be computed are excluded. Only two-dimen- 
sional cases are considered, since only for them it is 
possible to produce informative diagrams showing 
the behavior of the algorithm. 

The claim of SuS proponents is that the method 
is much more general than FORM/SORM and 
makes no use of design point concepts. But this 
should mean that for such simple two-dimen- 
sional cases, where it is essential to find the design 
points for getting useful estimates of the reliability 
index and/or the failure probability, also the SuS 
approach should give a fortiori good results. 

In the examples cases are shown where the data 
points of the SuS algorithm move not to the design 
point but to points further away from the origin 
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than the design point leading to an overestimation 
of the reliability index and an underestimation of 
the failure probability. In Breitung (2016) and Brei- 
tung (2017) similar examples were studied. 


6.1 Basic example 


A simple example will show the misunderstanding 
in the SuS algorithm about global and local min- 
imization. Let be given two LSF’s g, and g, and 
together with them define a third g as minimum 
of both: 


g,(U,U,) = 6 a 
g( u1) = 5+ t4, (13) 
g(u%,u,) = min(g,,g)) 


Consider now two reliability problems. In the 
first the failure domain is given by 


F ={g (m,u,)< 0} (14) 
In the second the failure domain Fis given by 


F ={g(u,,u)) < 0} (15) 


8 4 0 4 8 


(a) The design points (black filled circles) for the LSF’s 
gi (uy, u2) = 0.5, 0.25, 0 


8 


© 


(b) The design points (black filled circles) for the LSF’s 
glu, u2) =0.5,0.25,0 


Figure 6. The design points for the LSF’s g, and g. 


(a) SuS algorithm for the LSF g}, a) solid line 
contour g} = 0, dash-dotted lines further con- 
tours for gı > 0, b) graph of gi; green dots show 
SuS data points 


(b) SuS algorithm for the LSF g, a) contour line 
g = 0, b) graph of g; green dots show SuS data 
points 


Figure 7. SuS for the LSF’s g, and g. 


In both cases it is a series system which fails if at 
least one of the two functions is less than zero. The 
LSF’s are shown in Figure 7. The SuS algorithm 
works correctly in the first case, see Figure 7a, but 
fails for the second, see Figure 7b. 


6.2 Invariance 


An important reason that the Hasofer-Lind index 
was adopted as a measure for reliability is its invar- 
iance under reformulations or reparametrizations 
of the underlying reliability problem (see Hasofer 
& Lind (1974)). Also the convergence proofs for 
the beta point search algorithms do not depend on 
the specific form of the LSF. But clearly one has 
to start the search algorithm from different points 
to obtain all global minimal distance points, i.e. 
design points. Consider now a series system con- 
sisting of two independent components, so failure 
occurs if at least one fails. The first component 
fails if u,>5 and the second component if 
u, < —4. Now, this limit state surface can be the 
zero set of different LSF’s. For example, one has 


duaje min a (16) 


4+u, 


Here both LSF’s are linear, with SuS one 
obtains as expected an estimate for the asym- 


ptotic failure probability approximation 
P(F) ~ @(-4) = 3.17-105. The performance of 


the SuS method is shown in Figure 8. 
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Figure 8. 


Series system defined by LSF in equation (16). 


Assume now that the LSF for the second ran- 
dom variable is given not by a linear but by a logis- 
tic function in the form: 


1 


= 17 
1+ exp (-2( u, + 4)) ue uD 


g u) 


Then the LSF g*(u,u,) given by 


5-4, 
g’(u,u,) = min 1 os 09 
1+ exp (-2( u,+ 4)) 


defines the same limit surface as before, but the 
shape of the LSF is different and the contour lines 
of these functions are different in the safe and 
unsafe domain, both LSF’s have only the contour 
of zero level set in common. Here, with the LSF 
defined in equation 18, the points in SuS converge 
towards the point (5,0) and one gets as probability 
estimate a value of ®(-5) = 2.87- 10-7 whereas the 
true failure probability is approximately equal to 
®(-4) = 3.17-105 as shown in Figure 9. So, here 
the different forms of the LSF’s influence the result 
of the method. The reason is that the structure of 
the LSF in the neighborhood of the origin is differ- 
ent from its form near the limit state surface. 

The same limit state surface can be described by a 
plethora of different LSF’s. Their specific forms will 
influence the behavior of the SuS algorithm. Espe- 
cially for more complicated LSF’s for series or paral- 
lel systems it might be useful to clear inasmuch this 
can create convergence problems or lead to incorrect 
results. Certainly there will be cases where the result 
will not depend on the changing structures of the 
LSF’s, but as the example above shows, it would be 
overoptimistic to assume that this is generally so. 

The same limit state surface can be described by a 
plethora of different LSF’s. Their specific forms will 
influence the behavior of the SuS algorithm. Espe- 
cially for more complicated LSF’s for series or paral- 
lel systems it might be useful to clear inasmuch this 
can create convergence problems or lead to incorrect 
results. Certainly there will be cases where the result 
will not depend on the changing structures of the 


ei uy te) 


Figure 9. Series system defined by LSF in equation (18). 
LSF’s, but as the example above shows, it would be 
overoptimistic to assume that this is generally so. 


6.3 Several design points 
Let the LSF be 


galus) =É -ha u]; (19) 


Due to the symmetry of the LSF there are four 
beta points. In a FORM/SORM analysis one obtains 
using the results found in the following asymptotic 
approximation one has for the failure probability 


P(g,(u,u)< 0)~ 2V2-@(-£),B>~. (20) 


Here clearly in a SORM analysis the beta point 
search algorithms have to be started several times 
to find all beta points. 

If this problem is examined now with SuS 
the possible outcomes of runs are shown in 
Figure 10. In fifty runs of SuS in one case only 
one beta point was detected, in 11 two, in 29 three 
and only in nine cases all four were found. This 
might lead to a systematic underestimation of the 
failure probability when not all beta points are 
found. If now several runs are combined, there 
will still be a bias, the failure probability will be 
underestimated. It is unclear to the author how 
to get a good estimator of the failure probabil- 
ity here without making some sort of geometric 
analysis similar to FORM/SORM. Also in the 
FORM/SORM approach the search algorithm for 
design points can end in local minimum distance 
points. But this is well known and by running the 
search from many different starting points one 
can expect to find all relevant design points (see 
e.g. Ditlevsen & Madsen (1996), section 5.2 and 
Melchers & Beck (2018), section 4.3.7). 


6.4 Stationarity of seeds 


A recurring claim in the SuS literature is that the 
seeds for the next step which lie in the failure domain 
F, already have a stationary distribution over this 
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(d) SuS detects four design points 


Figure 10. SuS for the LSF el UUs ) =15- Ju: u| ; 


domain, i.e. perfect sampling. This is true only for 
the first step, in all further steps the seeds do not 
have this stationary distribution. This was discussed 
in Botey & Kroese (2012), section 7 and in remark 
4.4.3 in Botev (2009), but it seems that these findings 
remained unnoticed. The claim is true only for the 
first step in the algorithm where the seeds are com- 
ing from a Monte Carlo sample. This can be shown 
using theorem 2.4.1, p. 24 in Arnold et al. (2008). As 
explained in Geyer (2011) for proving that a MCMC 
produces exactly stationary data points, the random 
mechanism creating those must be studied. 

Here a simple argument is given that the seeds 
in higher steps do not have this stationary distri- 
bution. Consider the simple case that F, is a semi 
infinite interval [b 2). An explanation that this 
cannot be correct is the fact that in one Markov 
chain generated with the MCMC algorithm pro- 


300 


200 
i g 
is 1 2 3 4 5 


Figure 11. Occurence of seeds with distance zero from 
the lower end point. 


posed there, the next point can be the same point 
as before if the point found by the iteration algo- 
rithm is rejected. So the probability that the small- 
est seed on the interval F, is equal to the left limit 
point b, of the interval is larger than zero. But for a 
stationary distribution it is zero. 

To demonstrate this for the problem with 
g(u)=5—wu one thousand runs SuS runs were 
made with 500 points per step and an acceptance 
probability of 0.1 and 10 samples per chain. In 
the following Figure 11 it is recorded how often 
thesmallest seed had distance zero from the next 
lower order statistic. For the first step this never 
happens, but from the second step it happens quite 
often. This contradicts the claim that the seeds 
have a stationary distribution over F,=[c,,°°) for 
i>1. Important here is to note that if the distribu- 
tions were stationary, a double would occur with 
probability zero, so even one such event happening 
during a numerical simulation makes the claim of 
stationarity invalid. The fact that for different SuS 
implementations and points in steps the number 
might be lower does not change the conclusions; 
since it is a qualitative not a quantitative statement. 
If at least one such point appears in a numerical 
simulation the claim of stationarity is disproved. 


7 PROBLEM STRUCTURE AND SUS 


Another important point which is problematic in 
SuS is the ignoring of the problem structure. This 
is advertised as a special feature. For example in 
Zuev et al. (2012) it is written: 


Subset Simulation provides an efficient stochastic 
simulation algorithm for computing failure probabili- 
ties for general reliability problems without using any 
specific information about the dynamic system other 
than an inputoutput model. This independence of a 
systems inherent properties makes Subset Simulation 
potentially useful for applications in different areas of 
science and engineering where the notion of “failure” 
has its own specific meaning,... 
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Contrasting this opinion there is the following 
quote from Monahan (2011) p. 394: 


For MCMC, an extremely naive user can gener- 
ate a lot of output without even under- standing the 
problem. The lack of discipline of learning about 
the problem that other methods require can lead to 
unfounded optimism and confidence in the results. 


8 CONCLUSIONS 


The subset simulation approach to structural reli- 
ability analysis ignores the problem how to dis- 
tinguish between local and global constrained 
mimimum distance points on the limit state sur- 
face. It finds minimum points without checking 
if there are other minimum points which give a 
smaller value. Certainly this analysis is restricted to 
cases where design points exist, but also for these 
cases SuS should give correct results. This prob- 
lem of local minima is known in FORM/SORM 
methods and is avoided by running design point 
searches from different starting points. Therefore 
there is a considerable danger of obtaining erro- 
neous results, since SuS ignores the possibility of 
finding only local minima instead of global. This 
misunderstanding of the basic problem can be 
seen in all papers about SuS as far as the author 
knows, it is always assumed that there is no need to 
ascertain that the found solution is not based on a 
local minimum. But this is the essential problem of 
global minimization to avoid local minima and to 
escape from them towards global minima. 

It is obvious that similar problems will appear for 
examples in higher dimensions. The simple exam- 
ples here were chosen, because they allow an illus- 
tration of the behavior of the method in a graphical 
way, which is not possible in higher dimensions. 
Maybe each of these primitive examples can be 
solved by adding to SuS some specific adhockeries 
for the example, but in this way the principal prob- 
lem of avoiding local extrema is not addressed. And 
this has to be resolved for SuS to be accepted as a 
meaningful reliability method furthermore. 
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ABSTRACT: The subject addressed in this paper concerns the constantly developing methods of deter- 
mining the fatigue strength of industrial pipelines. The description of semi-elliptical crack propagation 
proposed in the paper applies a deterministic model based on a modified Paris’ formula, which became 
the basis for developing a probabilistic model, being the subject of this publication. Based on a differential 
equation, a Fokker-Plank parabolic partial equation was derived, which describes a crack development in 
a probabilistic sense, at the same time taking into account deterministic model relationships. A solution 
of the equation is a two-dimensional normal distribution of crack propagations. A manner of estimating 
the distribution parameters is also stated. Having a distribution of probability with known parameters, 
in the case of an assumed probability of exceeding the permissible crack length, the fatigue life of model 
elements cut-out from industrial pipelines was estimated. 


1 INTRODUCTION 


Forecasting the propagation of fatigue cracks in 
elements of industrial pipelines is an important 
and still not fully addressed issue, which is of fun- 
damental significance due to severe outcomes of a 
potential damage. The subject approached in this 
paper concerns the constantly developing methods 
of determining the fatigue strength of such objects 
(Mazur 2002, Song et al. 2002, Sniezek & Goss 
2006, Śnieżek et al. 2007, Śnieżek & Stępień 2007). 
Current solutions in this field consider the growth 
of surface cracks, outgoing from free edges of the 
body and propagate along the pipeline surface, 
as well as into the material, towards the pipe wall 
thickness (Boukharouba & Pluyinage 1999, Kim & 
Hwang 1997, Song et al. 2002, Sniezek & Stepien 
2007, Sniezek et al. 2006). The occurring random- 
ness of the load, material properties, geometry, 
etc., prompts many scientists to apply probabilis- 
tic crack propagation models (Ahammed 1997 & 
1998, Ahammed & Melchers 1997, Caleyo 2002, 
Kocanda et al. 1999). However, some of them 
(Baranowski & Małachowski 2015, Kocanda et al. 
1999, Sniezek & Stepien 2007, Sniezek et al. 2006, 
Tomaszek et al. 2008), when applying a probabil- 
istic description of a crack growth dynamics, take 
into account the relationships of a deterministic 
model, which physically reflects the aspects of 
fatigue crack growth. Therefore, the description of 


semi-elliptical crack propagation suggested in the 
paper, which includes a probabilistic model based 
on a deterministic model published by the authors 
(Zieja et al. 2017). A physical description of a crack 
growth utilizes a modified Paris’ formula, taking 
into account two mutually dependent crack direc- 
tion, i.e., along the small and large ellipses. The 
development of the model was based on experi- 
mental tests on overhead industrial pipelines after 
thirty years of operation (made from 1H18 N9T 
steel), as well as sections of new pipelines, prior 
to handing them over to operation (made from 
1.4541 steel). Samples for fatigue tests were cut 
out from these pipelines and then subjected to flat 
(non-zero-pulsating) bending at different value of 
stress amplitudes. An analysis of many variants 
of mutual interconnection of cracking velocity in 
both directions, provided a surprising conformity 
of the results of interconnection between quo- 
tients of cracking velocity in both direction and 
the quotient of crack length (for different samples 
and load values) (Zieja et al. 2017). The conducted 
analysis indicates that in the case of surface length 
of the crack, it is possible to apply the classic Paris’ 
formula, whereas in the case of cracking towards 
the depth, the correction factor had to be applied. 
Next, fatigue life calculations were conducted 
based on the proposed modifications variants 
of the Paris’ formula, with the obtained fatigue 
lives being similar, and at the same time “safer” in 
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relation to the actual fatigue life obtained during 
fatigue tests. It confirmed the possibility of uti- 
lizing the presented variants of the deterministic 
model. The deterministic model developed in the 
paper (Zieja et al. 2017) became the base for devel- 
oping the probabilistic model, which is the main 
subject of this article. 


2 BASIC ASSUMPTIONS OF THE 
PROBABILISTIC MODEL 


In order to describe fatigue cracking in the general 
case of a random load, an equation proposed by 
prof. Henryk Tomaszek was used (Kocanda et al. 
1999, Sniezek & Stepien 2007, Tomaszek et al. 
2008): 


ZEEE) E T E 
Ab, (a,c) U (a,c,t) db, (a,c)U(a, c,t) a 


aa a (1) 
P {(a,c)U (a,c, t) 1 Fd, (a,c)U (a,c,t) f 


2 da? 


oade 
1 d, (a,c)U (a,c,t) 
2 ac? 


where U(a,c,t) = a function of length c and depth 
a densities of cracks at a time of t; C(a,c) =a fac- 
tor characterizing the possibility of a catastrophic 
crack event, when the crack depth is a, and length 
is c; d(a,c) and d,(a,c) = mean squares of crack 
increment in relevant directions, in adopted units 
of time; 5,(a,c) and 6,(a,c) = mean values of crack 
increment in relevant directions, in adopted units 
of time; and u(a,c) = correlation moment in an 
assumed unit of time: 


(a,c) =r d (a,c) d, (a,c), 


where as r is a crack length and depth correlation 
factor. 

Further analysis of the equation (1) is a difficult 
issue, since it lacks an analytical solution. However, 
by adopting certain assumptions, it is possible to 
simplify the above equation. The assumptions for 
the probabilistic model have the following form: 


a. there are such lengths of fatigue cracks (in both, 
mutually perpendicular directions) that within a 
certain interval (or for a given number of load 
cycles), the probability of a catastrophic ele- 
ment crack is equal to zero; 

b. in the deterministic approach, the velocities of 
fatigue cracking are often defined by appro- 
priately modified Paris’ formulas (Zieja et al. 
2017); 


c. load cycles with a duration of At may not 
appear in a continuous manner, but rather ran- 
domly, with an intensity of 4 i.e. AAt < 1. 


The following differential equation describing a 
two-dimensional crack growth in the probabilistic 
sense, may be formed under these assumptions: 


U 


a,c,t+At 


=(1- 4 At)U, e, +Å AtU pracne (2) 
where U,,.,=a probability that at a time ¢, the crack 
length is c, and the depth is a; Aa and Ac = crack 
increments into and along the surface of a mate- 
rial, over a time interval of At; and A= load cycle 
appearance intensity. 


3 THE DETERMINATION OF CRACK 
LENGTH DISTRIBUTION PROBABILITY 


From the equation (2), after a transition to func- 
tional notation, we obtain: 


U(a,c,t + At) =(1— AAt)U(a,c,t) + 3 
+AAtU (a— Aa,c — Ac,t) (3) 
where U/(a,c,t) = crack length density function. 

Expanding the equation terms into a Taylor 
series and taking into account, for t, the terms of 
expansion to the first derivative, and for a and c, 
the terms of expansion to the second derivatives, 
the following was obtained: 


U(a,c,t+ At) =U (a,c,t) _ Mast) 


U(a — Aa,c— Ac,t) = U(a,c,t) - 


AU (a,c,t) es 1 FU (a,c,t) 


c 
4 1 FU (a,c,t) Agha FU (a,c,t) 
2 &k Jade 


The equation (2), after conversions, will assume 
the following form: 


W (a,c, t) _ 


at 
1 2 FU (a,c,t) 
2 da? 2 a? 
ZU lact) AaAc 
Jade 


(5) 


The Aa and Ac increments should be determined 
as time functions from the relevant relationships 
of a deterministic model, presented in a previous 
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paper (Zieja et al. 2017). For known increments, 
we can write: 
JU (a, c,t) 


AU (a,c,t) AU (a,c,t) n 


a 


= b (£) 
1 FU (a,c, t) 1 
d 

+3 ie} w 2 

PU (a,c,t) 
dade 


b,(t) 
FU (a,c,t) 4 
ac 


+4(t) 


(6) 


where at) Jj- aN b » (t) = AAc(t); d (t) = AAa?(1); 
d, 2(t)= 4A 2 (t); and u(i ) = Ma(t )Ac(t). 

hereby, ) “dot over the variables means a 
derivative after time. The solution of the equation 
(6) has the following form: 


7 1 2 r al) Jh ath EO) 
l-r? 
where b(t j= f Aa( (w) )dw;b = faac Ac( w) )dw; 
= Í A Aa? (w) dw;d, (t) = i Ac? (w) dw; 
0 0 
A(t) 


= JA Aa(w) Ac(w) dw; and r= 
0 


d,(t)d,(t) 


4 THE DETERMINATION OF CRACK 
LENGTH DISTRIBUTION PARAMETERS 


Defining the parameters of presented distribu- 
tions requires the determination of the increments 
of length “c” and depth “a” of a crack, within one 
load cycle. For this purpose, we shall utilize the 
relationships in the previously developed deter- 
ministic model (Zieja et al. 2017). Based on the 
relationships (8), (9), (10) describing the cracking 
velocity in three variants, and taking into account 
the solutions of the deterministic models shown 
below, the determination of required crack length 
and depth increments was attempted. 

The Ist variant—common is only the exponent 
m of the Paris’ formula for both cracking directions: 


da "HS m 
V,= aN =C AK” =C, s (Muore) r 
V, = p CAK? =C, (M,.AoVzc) 


where a = crack depth; c = half the length of a surface 
crack; AK, = the range of the stress intensity factor 
for crack depth a; AK, = the range of the stress inten- 
sity factor for surface length c; N = number of load 
cycles; Ao = stress range; C,, C,, m = Paris formula 
factors; M,, and M,.= correction factors, taking into 
account the finiteness of element dimensions, deter- 
mined for the crack length c and depth a, respectively. 

The 2nd variant—correction of the stress inten- 
sity factor with a c/a quotient, for the velocity in 
the direction of the crack depth: 


ai n 
Ve = = = C,AK? = C, (u Ki E Aom 7 


(9) 
dc 
Vg OAR, (usc 


The 3rd variant—correction of the stress inten- 
sity factor with a square of the c/a quotient, for the 
velocity in the direction of the crack depth: 


Ve = = = C,AK?" a C, | m,,(£ j sos) 
(10) 


V, = =CAKn =C, (M,.AoVze)". 


The below notations include the probability 
P, of an event in which a stress range Ao exceeds 
the threshold value and a crack grows. Moreover, 
it was assumed that N = At and additional factors 
were introduced: 


m 


Qa, =C, Mh, Ao we (11) 


a, =C.MpP,AO" 7. 


Below you can find the determined two-dimen- 
sional crack length and depth increments. The 
solutions were presented in two variants: for the 
values of the exponent m = 2 and m # 2 (conduct- 
ing calculations for specific values of the exponent 
is only dictated by limitation of the mathematical 
apparatus and does not have physical justification). 

For m=2: 

Variant 1: 


Aa = ag e% 
Ace cre M (12) 
Variant 2: 


2 52a,At 
Aa = CoE i 


(13) 


Variant 3: 


4 4Q,At 
Ae i 


Aa = 
4 
[Zaer] (14) 
a, 
Ac = qa ert, 
For m #2: 
Variant 1: 
m 
2-m \2-m 
Aa = @, „At + dy 
m (15) 
pS = 2-m 
Ac = 02" ; 
Variant 2: 
jy 2m \ 
al a At+ ce? ) 
Aa m 
a 2-m 2m | Zm 2+m a, 2+m | ~ 
— aAt+ ce? +a? ——*c,? 
a, 2 a, 


(16) 
Variant 3: 
Aa 
4m 
2-m Zm \ 2-m 
VA Att C, 
2 
243m oer 
a[l 2-m 2-m T m 3m2 g 243m 
| aAt+c,? | +4? —-—*c,? 
2 a. 
Ac 
p 2-m \ oon 
=a, ( —— a At+ e? 
(17) 


Next, the parameters of relevant probability 
distributions were determined. It required the cal- 
culation of integrals present in the formula (7). 
The results obtained for the considered calculation 
variants are presented below. 

For m=2: 

Variant 1: 


b(t) = ae%*; 
b(t) = qe%*; 


2 

d,(t) 2. pages (18) 
Variant 2 

a, 2a@,At 2. 
aS ae A -1)+ a}; 
b(t = ee"; 

Ma (ea 1)+ at 

a| æ, 
d( t)= > 
| ic a Jmf Le o ( eA 1)+ a? 


(19) 


a 


€ 


b(t)= te cf e444 — 1) + af; 


etA 1)+ as _ 


—c5( 
a(t) =% 77 a) po œ 
[eaten 1)+ a3 


€ 


d(t)= 


=Car, 


2 


For m #2: 
Variant 1: 


ai= Bacar a i "; 


5 2+m (21) 
2a, a == 2-m 
d,(t) = Tes Att ay? ; 
2 2- 2-m =m 
( ) : = GAt+ Cy? > 
° 2+m 2 
Variant 2: 


2326 


2+m 2+m 
a, 2-m 2m | om 
= w At+ ce? + 
-Ja 2 ; 
b ( t) ~ E E 
am gy em 
+a,? —-—¢,? 
a, 
| 2-m 2m | In 
b,(t)= a At+ c? : 
= 2+m 
2 4 m2, 2 -m 2-m | 5m 
d,( t) = azm aga a Att c? ; 
2+m 2 
24m 
2æ | 2-m om | 2m 
d,(t) = — aAt+ c? : 
2+m 2 
(22) 
Variant 3: 
24+3m sacs 
al2-m 2m | 2m 
al a At+ c, F] + 
-| @ 2 é 
b,( t) = 3 
m+ 2 243m 
+a, 2 — zo 2 
2 -m 2-m pe 
YODI a At+ c? 
j 2 
2+m 
4 3m-2[ _ 2-m ] tom 
d,( t) = QL3m+2 Ose? a Att e? ; 
2+m 2 
2+m 
2a 2-m om | 2-m 
d,(t)=—= a Att c? 
2+m 2 
(23) 


Determining the variant d,(¢) for any value of 
the exponent m, in the 2nd and 3rd variants of the 
model required the determination of integral val- 
ues, with their analytical calculation significantly 
difficult. It was decided to simplify the integrands 
through omitting the constants, with their value 
(especially for large f) has a little impact on the cal- 
culation results. In addition, such a simplification 
will increase the variance, which translates to a 
more conservative estimation of the fatigue life. As 
a result of the conducted calculations, expressions 
were obtained (18+23), which define the expected 
value and crack length and depth variance for indi- 
vidually considered calculation variants. 


5 ESTIMATING THE FACTORS 
OCCURRING IN EXPRESSIONS 
DEFINING DISTRIBUTION 
PARAMETERS 


In order to present distribution parameters in an 
explicit form, it is necessary to estimate the value 


of coefficients @,, a, m and a correlation coeffi- 
cient r. The actual crack growth during operation 
depends on many factors of a random character, 
e.g. the course of a random load. Determining an 
element fatigue life in the probabilistic approach 
requires the knowledge of the crack growth, cover- 
ing a full probabilistic characteristic of the crack 
propagation, therefore, providing information on 
random cracking factors. Let it be a general nota- 
tion, in the following form: 


[CRARC RAR Are läst, )] 
[esta sAGish) (east enol ent.) (24) 


Based on experimental data concerning crack 
growth (24), the values of required factors should 
be estimated. The value of the m coefficient was 
decided to be determined according to the com- 
monly adopted manner in a deterministic system, 
from a relationship binding the velocity of a crack 
over the surface of material V, with the value of 
the stress intensity coefficient. Whilst, the values of 
the remaining coefficients were determined based 
on a method with the biggest credibility. Due to 
the fact that the method is commonly known, its 
description was omitted. The form of the obtained 
distribution parameters coefficient estimation 
results is presented below. 

The correlation coefficient relationship for 
the adopted experimental data has the following 
form: 


[arn -a,)-b F At, ) |x 


lo dee -¢,)-& 3 (Atri = At, )] 
r = A K=0 Abies = At, , (25) 
d; Jd; 
where 
bf ae; 
At, 
t e C ė 
A 


nl (Gear - a,) = b (Atoa E At,)) i 
Abey T At, l 


ke (ler = é) -b; (Ab Zz EAI 


d, = 0; = 
g g Ati Z At, 


M k=0 
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The expressions used to estimate factors œ, a, 
have the following form: 

For m =2: 

Variant 1: 


(26) 


(27) 


Variant 3: 
4_ 74 
= l a,- Ca. 
h > 


4_ 74 
At, Cn Co Co 


(28) 


For m #2: 
Variant 1: 


eer aes 
“ At,m—-2| =” 


: ý (29) 
aafia 1 
E At, m-2 m-2 m-2 


Variant 2: 


go Aji a 
“At, m-2| "5 


ff St 1 
A m-2| = = 


Variant 3: 


3m+2 3m+2 
= 2 

a, A 

3m+2 3m+2 ? 


1 2 1 1 
Qa, = m-2 m-2 
At, m—2 = = 


G1) 


6 FATIGUE LIFE ESTIMATION METHOD 


Using the developed form of the crack length dis- 
tribution with known parameters, it is possible to 
determine the probability of not exceeding the per- 
missible crack lengths R(t): 


aq Cd 
R(t)=P(a<a;,c < c4,t)= J J U(a,c,t)dadc. (32) 


According to the assumptions of the probabi- 
listic model, the permissible crack lengths a, and 
c, Should be determined in such a manner, so that 
the risk of immediate element destruction is suffi- 
ciently low. Mutual interconnection of both crack- 
ing directions hinders the calculations associated 
with a two-dimensional normal distribution. The 
information about linking random variables of the 
crack length is provided by the correlation coeffi- 
cient value. In the case of a correlation coefficient 
value, estimated acc. to the previous equation (25), 
the coordinate system should be converted so that 
the correlation coefficient for the converted ran- 
dom variables was equal to zero. The coordinates 
are converted according to the relationship: 


a' =acos@+csing@ (33) 
c! =-asina+ccosa, 


where the angle ais determined from the formula: 


gaa (34) 

di z d; 

With such a conversion, the modified crack 
lengths a' and c’ are independent random variables 
for each determined ¢. The distribution parameters 
should also be converted according to the above 
rule: 


) =b (t)cos æ+ b, (t)sin a, 
)=—b,(t)sin a+ b, (t) cos æ; 


dj(t)= ( d,(t)cos a+ ,/d,(t) (sina) ; (35) 
di(t)= (- d,(t) Jsing+ Ja, () cose} . 


In the new coordinate system, the relationship 
(32) shall assume the following form: 


(36) 


P (a! <al,t) 
from 


and P(c <c,,t) are 
one-dimensional normal 


where: 
determined 
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distribution. Using normal distribution tables 
requires additional standardization of random 
variables. Assuming that R(t) < Ry, where: R, is the 
minimum permissible probability of not exceeding 
the permissible crack lengths a, and c,, for which 
an element’s fatigue life may be determined. At the 
same time, probability / — R, will be the probability 
of exceeding permissible operational durations. 


7 CALCULATION EXAMPLE 


The example shall utilize the data from experimen- 
tal tests presented in papers (Sniezek & Stepien 
2007, Zieja et al. 2017). The sample designations 
assumed in the aforementioned elaborations were 
kept. As per the previously presented assumptions 
for three variants of the model, calculations of fac- 
tors present in expressions for crack length distri- 
bution parameters were made, and their results are 
shown in Table 1 below. 

Having the values of calculated coefficient (as 
per the previously presented procedure), the esti- 
mation of the fatigue life for individual samples was 
commenced. It was decided to present the obtained 
results in groups concerning a given material and a 
given loading manner. For all samples, a minimum 
probability of not exceeding the permissible crack 
parameters R, = 0,9 was assumed. 

For Group I samples cut out from a new pipe- 
line, prior to including the operation, made from 


non-zero-pulsating bending with a frequency of 
20 Hz (cycle asymmetry coefficient R = 0) with 
subsequent stress nominal amplitude values given 
in table 4. Permissible crack lengths in individual 
directions: a,=3 mm; c,=9 mm. 

The obtained fatigue life calculation results, 
based on the three developed model variants are 
similar to the actual life, obtained during fatigue 
tests of bent samples. At the same time, in the sec- 
ond and third calculation variant, “safe” fatigue 
lives were achieved in comparison to the actual 
life, however less conservative from the point of 


Table 2. Values of nominal stress amplitudes for 1.4541 
steel samples. 


Sample designation Stress range Ao [MPa] 
Sample 1 170 
Sample 2 180 
Sample 3 200 
Sample 4 200 
Sample 5 250 
Sample 6 250 
Table 3. Fatigue life calculation results acc. to the devel- 


oped probabilistic model. 


Projected life N, [cycles] 


1.4541 steel (acc. to EN 10088-3), subject to flat Sample Actual life —— 
non-zero-pulsating bending with a frequency of designation N, [cycles] Variant 1 Variant 2 Variant 3 
subsequent stress nominal amplitude values given Sample! 3689 125 3543 791 3.657527 365784 
i A J. Permissible- crack E ths in indi F l Sample2 2705250 2477217 2573436 2673 900 
Re me oe ae ec vidua! Sample3 2217000 2186 162 2212251 2212263 
p apn 4a aa mim; a npe r ee Sample4 1025375 974936 1013023 1013197 
OT TOUP -A sampes Cut out Tom a pipene Samples: -411875 354515- -402.535 402742 
after 30 years of operation, made from 1H18N9T Sample 6 433625 415798 430378 430 408 
steel (acc. to PN-71/H-86020), subject to flat äi ie 
Table 1. Calculated values of distribution parameters’ factors. 
Variant | Variant 2 Variant 3 
Sample r m Q,, a, 0A VA a, a, 
1 0.855 5.8417 182x107 4.51x10°% 7.16x10° 451x108 1.51x10°? 4.51 x 10° 
2 0.888 5.8358 2.48x107 6.16x10% 9.82x10 6.16x10% 2.0810? 6.16x 10% 
3 0.853 7.8811 2.51x107 2.84x10° 1.54x 10°? 2.84x10° 3.76x10 2.84x 10° 
4 0.819 5.3271 696x107 212x107 44x10° 212x107 1.06x10"' 2.12107 
5 0.821 5.3677 1.73x10% 5.17x107 1.08x10% 5.17x107 3.86107 5.17107 
6 0.852 6.1564 149x106 3.27x107 44x10° 3.27x107 6.6410? 3.27x107 
7i 0.868 4.8690 6.13x10° 1.76x10° 48x10° 1.76x10° 2.94x10'° 1.76x 10° 
8 0.878 5.0368 2.72«10° 7.21x 107 118x108 7.21«107 925x10™ 1.76 x 10% 
9.18 0.877 5.0133 1.36x10° 3.51x107 8.82x10° 3.51x107 457x10" 3.51 «107 
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Table 4. Values of nominal stress amplitudes for 
1H18N9T steel samples. 


Sample designation Stress range Ao [MPa] 
Sample 7 300 
Sample 8 250 
Sample 9.18 200 


Table 5. Fatigue life calculation results acc. to the devel- 
oped probabilistic model. 


Projected life N, [cycles] 


Sample Actual life 

designation WN, [cycles] Variant 1 Variant 2 Variant 3 
Sample 7 152000 148731 150088 150 162 
Sample 8 338 400 329867 333899 333 934 
Sample 9.18 712670 698102 704266 704319 


view of preventive forecasting of fatigue life, in 
comparison to the fatigue life results for the first 
variant. It confirms the earlier remarks regarding 
the possibility of applying the probabilistic model 
to estimate fatigue life. 


8 CONCLUSIONS 


The probabilistic description of semi-elliptic crack 
propagations proposed in the paper utilizes a pre- 
viously developed deterministic model (Zieja et al. 
2017), based on the modified Paris’ formula. On 
the basis of the differential equation, the Fokker- 
Plank parabolic partial equation, which describes 
a crack development in a probabilistic sense, was 
derived, at the same time taking into account the 
deterministic model relationships. A solution of 
the equation is a two-dimensional normal distri- 
bution of crack propagations. A manner of esti- 
mating the distribution parameters is also stated. 
Having the distribution of probability with known 
parameters, in the case of an assumed risk of 
exceeding the permissible crack length, the fatigue 
life of model elements cut-out from industrial 
pipelines was estimated. 

The conducted analyses took into account the 
interconnection of cracking directions, which is 
the main advantage of the two-dimensional model. 
The forecast fatigue life depends on two permis- 
sible crack lengths. Taking into account the critical 
conditions allows to ensure a higher safety level in 
relation to a single condition. 

Experimental tests regarding the fatigue life of 
model elements with a propagating semi-elliptical 
crack, enabled to provide the calculation model 


with necessary data, describing the development 
character of the considered two-dimensional 
cracks. The forecast and recorded during the 
experimental tests fatigue lives of samples show 
satisfying compliance. However, the fatigue lives 
forecast on the basis of developed mathematical 
models are lower than in the case of the experi- 
mental ones, hence, “safe” and, in addition, less 
conservative than the results obtained on the basis 
of a deterministic model and presented in the 
paper (Zieja et al. 2017). 
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Environmental contours for design of ice-capable vessels 
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ABSTRACT: The environmental contour method is an established method for design of ships and 
marine structures. In this method, environmental contours for a given return period are developed in order 
to identify the environmental conditions which are associated with the most critical structural responses, 
and then the long-term extreme response for the same return period can be effectively estimated based on 
the response statistics of the most critical (short-term) responses. In this work, the environmental contour 
method is described for design of ice-capable vessels in Arctic regions. In particular, for vessels sailing in 
Arctic regions, the most dangerous condition is associated with ice ridges. Based on the probability dis- 
tributions of key parameters of the first-year ice ridges which determine the ice ridge loads, the environ- 
mental contours are developed. The effects of the correlation between different environmental variables 
are discussed. A simple case is proposed to illustrate application of the environmental contour method 


for design of ice-capable vessels. 


1 INTRODUCTION 


For ships and marine structures subjected to envi- 
ronmental loads, such as wind, wave and ice forces, 
etc., evaluation of the extreme response over their 
specific lifetime is necessary and important at the 
design stage. The full long-term analysis which 
accounts for the structural response from each 
short-term environmental condition and the occur- 
rence rate of each short-term condition is recog- 
nized as the most reliable and accurate approach. 
However, such long-term analysis is time con- 
suming, especially for large and complex systems 
(Leira, 2008). 

In order to improve the computational effi- 
ciency, the long-term analysis can be simplified 
either by reducing the computation cost of short- 
term analysis (Chai et al., 2016) or by developing 
approximate method that requires a lower number 
of short-term analysis, such as the IFORM (inverse 
first order reliability method) (Giske et al., 2017) 
and the environmental contour method (Haver 
and Winterstein, 2009). 

The environmental contour method offers a 
simplified and fast way to estimate the long-term 
extreme response and has been widely used for 
design of ships and marine structures at the early 
design stage (Li et al., 2016, DNV, 2010, Baarholm 
et al., 2010). An environmental contour is the locus 
of all environmental parameters correspond to a 
given annual exceedance probability along which 
extreme responses with the corresponding return 
period should lie. Traditionally, establishment of 
such environmental contour for a given return 


period is based on the IFORM and the joint distri- 
bution of the environmental parameters. 

The main advantage for the environmental 
contour method lies in the fact that the structural 
response is uncoupled from the description of 
the environmental variables. As a result, the long- 
term extreme response for a given return period 
can be approximately estimated on the basis of 
response statistics associated with the most criti- 
cal short-term environmental conditions along 
the environmental contour, which corresponds to 
a given return period. Therefore, in this method, 
short-term analyses are performed only for a few 
conditions on the environmental contour and then 
numerical simulations or experiments are per- 
formed in order to identify the most critical short- 
term response for further application (Winterstein 
et al., 1993). 

In this work, the concept of environmental 
contour is applied for design of ice-capable ves- 
sels operating in Arctic regions. For ships sailing 
in sea ice areas without icebergs, ice ridges are 
assumed to pose a major threat to the vessel since 
they determine and govern the design loads for the 
vessel (Høyland, 2014). Key parameters of the ice 
ridge which determine the main loads of the ship- 
ice ridge interaction process are identified at first. 
Then, probabilistic models are applied to describe 
the distributions of these key parameters in order 
to develop the environmental contours. 

The procedure of extreme response prediction 
based on the environmental contour method is 
simply described in this work. Finally, a simple 
case is given to illustrate the development of a 
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desired environmental contour used for design of 
ice-capable vessels. 


2 ENVIROMENTAL CONTOUR METHOD 


In this section, construction of the environmental 
contour for a given return period and application 
of the environmental contour method are pre- 
sented. Assume that for the N-year return period, 
the corresponding failure probability of the struc- 
ture response is given as P, As for the traditional 
IFORM to generate the environmental contour 
with a N-year return period, a n-dimensional (n is 
the dimensionality of the environment parameter 
vector) hypersphere with radius f, is first created 
in the U space, with the value of J, being given as 
(Haver and Winterstein, 2009): 


r; =O"(1—- P,) (1) 


where ® represents the cumulative density func- 
tion of the standard normal distribution. 

Then, the n-dimensional hypersphere is trans- 
formed into the physical parameter space by 
consideration of the joint distribution of the envi- 
ronmental parameters in order to get the environ- 
mental contour. Figure 1 presents an example of 
an environmental contour in the physical param- 
eter space. 

After establishment of the environmental con- 
tour with the N-year return period, a limited set 
of design conditions, i.e. short-term environmental 
conditions, are selected on the environmental con- 
tour (e.g. see Figure 1). Time consuming numerical 
calculations (or expensive experiments) for struc- 
tural responses are only needed for these selected 
environmental conditions. The most critical short- 
term response is identified in order to estimate the 
long-term extreme response with the N-year return 
period. The main process of applying the envi- 


S; design ie 
environmental contour 
5; 
Figure 1. Environmental contour in the physical param- 


eter space and selected environmental design conditions. 


ronmental contour method for long-term extreme 
response approximation is presented in Figure 2. 


3 FIRST-YEAR ICE RIDGE 


A ridge is a line or wall of broken ice features 
forced by pressure or shear. When first formed, 
an ice ridge is simply a pile of unconsolidated ice 
blocks. Then, these blocks may become to some 
extent consolidated by refreezing processes and 
form the ice ridge. Figure 1 illustrates a typical ice 
ridge, which consists of two parts: the sail and the 
keel. The sail part is above the water and has pores 
filled with air and snow. The keel is the underwater 
part and can be further separated into an upper 
completely frozen layer called the consolidated 
layer, which is always thicker than the surrounding 
level ice and a lower unconsolidated part that has 
loose blocks partially refrozen together with water 
trapped between the blocks (ISO, 2010). 

Generally, ice ridges are subdivided into first- 
year versus older ice ridges. During its first winter 
and summer, an ice ridge is called a first-year ice 
ridge (e.g. Figure 3). The consolidation process in 
the keel part progresses with time and the keel part 
is closed to fully consolidated if the ridge has sur- 
vived one summer’s melt. Ridges that survive one 
or more summers are referred to as old ice ridges. 

In this work, first-year ice ridges are considered 
for design of ice-capable vessels. For one thing, the 
ice conditions along the commercial Arctic ship- 
ping routes, such as the Northern Sea Route, are 
mostly first-year and few ice appears in summer 
seasons. For another, fewer studies have been made 
on old ice ridges than the first-year ice ridges and 
studies for the mechanical and physical properties 
of the multi-year ice ridges are very limit. 


Establish joint distribution for enviromental parameters | 


ao eran remane pE 


Create a sphere with radius fp in the U space and | 
transform it into the physical parameter space to derive 
the environmental contour for a given retum period 


[peered 


Select a few design conditions on the enviromental | 
ne for engineering calculations or experiment 


T 


Find the most critical (short-term) structural response 
and estimate the long-term extreme response 


Figure 2. Flowchart of the environmental contour 
method used for approximate the long-term extreme 
response. 
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Figure 3. First-year ice ridge with some key param- 
eters: sail draft A, sail width ws, consolidated layer thick- 
ness hy, surrounding level ice thickness h,, keel draft A, 
and keel width w,. 


Original outline of keel 


Figure 4. Illustration of ship interacting with a typical 
first-year ice ridge. 


For the scenario of an ice-capable ship interact- 
ing with a first-year ice ridge, the ship structure 
can be simplified as a downward sloping structure. 
The effects of ridge sail can be neglected since the 
volume of the sail is small compared to that of the 
keel part (ISO, 2010). Former studies on a ridge 
failure against a confederation bridge, whose piers 
are designed as (slope) ice-breaking cones, have 
shown that failure of the consolidated layer is the 
dominant term in determining the keel loads due 
to the ice ridge interaction with the slope struc- 
ture. Also, there is no correlation between the keel 
depth, A, and the keel loads. Based on former stud- 
ies for first-year ice ridge interacting with slope 
structures, the ship-ice ridge interaction process 
is illustrated in Figure 4. Flow rubbles from the 
unconsolidated layer would be cleared by the local 
water current. The action due to the consolidated 
layer part can be approximated as level ice with an 
equal thickness that interacts with the vessel and 
the mechanical properties of the consolidated layer 
are assumed to be close to those of level ice. There- 
fore, the ship-ridge interaction process is further 
simplified as a ship-level ice interaction event for 
the purpose of preliminary design. 

The ship-level ice interaction process is initiated 
by a localized crushing of the ice edge and then the 
contact area between the ship and the ice sheet as 


well as the crushing force both increase when the 
ship is advancing and penetrating the ice features. 
The ice sheet eventually deflects and the bending 
stresses imply a flexural failure at a certain distance 
from the crushing region (Jordaan, 2001). There- 
fore, the ice thickness, the crushing strength and 
the flexural strength of the (equivalent) level ice are 
considered as the key parameters for determining 
the loads due to ship-ice ridge interaction process 
at the early design stage. 


4 PROBABILISTIC MODELS 


In this section, probabilistic models are applied in 
order to describe the distributions of the above- 
mentioned key parameters which determines the 
loads due to ship-ice ridge interaction. These key 
parameter are believed to be dependent on geo- 
graphical location and season, etc. In this work, 
the first-year ice ridges in the Barents Sea are con- 
sidered since many field experiments have been 
performed in this area by relevant Norwegian and 
Russian research institutes. 

The average thickness of the consolidated layer, 
based on measurements which are collected by 
mechanical drilling, can be described by a Gamma 
distribution (Strub-Klein and Sudom, 2012): 


ee 
Mh) = eh -E Q) 


where k and @are the shape and scale parameters 
for the Gamma distribution, respectively. Based on 
the experimental data, these two values are deter- 
mined to be 2.97 and 0.54. 

The fitted and sample probability density func- 
tion of the consolidated layer thickness are plotted 
in Figure 5. However, sample data for the flexural 
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Figure 5. Probability distribution for the consolidated 
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strength g,and the crushing strength o, of the con- 
solidated layer is very limited. The mechanical prop- 
erties of the surrounding level sea ice can serve as 
effective alternatives (Timco et al., 2000). On the basis 
of the experimental data for the flexural strength of 
the level ice in the Barents Sea, the marginal PDF 
(probability density function) of the flexural strength 
is described by a two-parameter Weibull distribution 
(Krupina and Kubyshkin, 2007): 
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Figure 6. Probability distribution for the flexural 
strength o, 
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Probability distribution for the crushing 


where the scale parameter œ= 0.274 and the shape 
parameter J= 3.167 are obtained from actual data. 
The marginal PDF and the empirical histogram of 
the flexural strength are presented in Figure 6. 

Full-scale measurements have been performed 
to collect the data of the crushing (or compres- 
sive) strength of the level ice in the Barents Sea. 
The distribution of the crushing strength for the 
vertically loaded samples can be described by the 
Gamma distribution given in equation (1) with a 
shape parameter of 2.5 and a scale parameter of 
1.40 (Strub-Klein, 2017). The marginal PDF of 
the of the crushing strength and the corresponding 
empirical histogram based on experimental data 
are shown in Figure 7. 


5 ENVIROMENTAL CONTOUR 


Assume that the desired ice-capable ship mainly 
sails and operates in the Barents Sea with an annual 
voyage length of 5000 km in ice ridge fields. The 
ice ridge density is 2/km along the route. Therefore, 
for a 50-year return period, the failure probability 
P,is determined as: 


P, =1/(50 -5000 - 2) (4) 


Correspondingly, the radius of the sphere in 
the U space, S, is equal to 4.611 according to 
equation (1). 

Transformation of the sphere in the U space 
into the environmental contour in the physical 
parameter space is generally based on the inverse 
Rosenblatt transformation (DNV, 2010). However, 
in order to perform such a transformation, the 
joint distribution of the environmental parameters 
as described by the conditional modeling approach 
is required. This necessitates a great amount of 
sampled data. Due to the limitation of experi- 
mental data for the first-year ice ridge statistics, 
only the marginal PDFs of the abovementioned 
key parameters can be obtained. Nevertheless, the 
joint distribution of the environmental parameters 
with consideration of the correlations between 
these environmental variables can be approximated 
by the Nataf distribution model (Silva-Gonzalez 
et al., 2013). Then, the environmental contours can 
be obtained by the Nataf transformation. 

Let the variables S,, S, and S, represent the con- 
solidated layer thickness, the flexural strength and 
the crushing strength, respectively. The symbols 
Pi» Pi and p,, denote the correlation coefficients 
between these variables. The Nataf transformation 
model is given as: 


F, (s,) = ®(u) (5) 
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Fs, (5) = OU y1- Az + ADC (8) (6) 


F; (s) = D A-0 — 2 -PEA 
1 


+I gg Aa T AAD E) 
+ a -AANDE (8) 


1-a 
(7) 


where U,, U, and U, represent independent stand- 
ard normal variables and their values can be 
obtained based on the following equation: 


Z wA (8) 


The coefficients ,’ (i, j= 1, 2, 3; i + j) are the 
corresponding (equivalent) correlation coefficients 
used in the Nataf transformation and their rela- 
tionship with p, can be approximated by a semi- 
empirical equation, which is given as: 


Pi =S Pi (9) 


where relevant description for determining the 
function € can be found in Ref. (Liu and Der 
Kiureghian, 1986). 

There exist few previous studies on the correla- 
tion between these three key parameters for sea ice 
material. For simplicity, we assume that the cor- 
relation coefficients p, (i, j= 1, 2, 3; i+ j) are equal 
to 0.5. Correspondingly, the 50-year contour sur- 
face for the three key parameters is obtained by the 
Nataf model described by equations (5)—(9) and it 
is plotted in Figure 8. 

The correlation coefficients in this simple case 
are chosen somewhat arbitrarily due to a imitated 
amount of information in the literature. Therefore, 
the influence of the correlation coefficient on the 
shape of the environmental contour is studied in a 
parametric way. For simplicity, a two-dimensional 
contour line on the contour surface is selected for 
this sensitive study. The consolidated layer is fixed 
as its mean value 1.59 m, and the corresponding 
two-dimensional contour lines for different values 
of the correlation coefficient between the flexural 
strength and the crushing strength are plotted in 
Figure 9. In addition, for the sensitive study, p,, 
and p,, are kept at a fixed value equal to 0.5. 

It is seen in Figure 9 that the variation of the 
correlation coefficient has a quite important effect 
on the shapes of the contour lines. The maximum 
values of the flexural strength and the crushing 
strength on the contour line do not change with 
the correlation coefficient. However, the critical 
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Figure 8. The 50-year contour surface for the con- 
solidated layer thickness, flexural strength and crushing 
strength. 
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Figure 9. Influence of the correlation coefficient on the 


two-dimensional contour lines. 


region with large values of the flexural strength and 
crushing strength becomes narrow as the correla- 
tion coefficients increase. Narrower critical region 
implies that large values of the flexural strength 
are more easily (or more possibility) to be accom- 
panied by large values of the crushing strength, 
which would cause serious structural response. 
Therefore, the correlation coefficients has signifi- 
cant influence on the extreme response for the ice- 
capable vessel sailing in the ice ridge field. 


6 CONCLUSIONS 


In this work, the environmental contour method 
was proposed for design of ice-capable vessels 
sailing in areas with ice ridges. Environmental 
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contours for the key parameters of the first-year 
ice ridge have been established by applying the 
Nataf model. 

Based on the abovementioned simple case, it 
is found that variation of the correlation coeffi- 
cients between the environmental parameters has 
a very important influence on the shapes of the 
environment contour as well as on the extreme 
response of the vessel. Therefore, a reliable 
numerical model is important in order to investi- 
gate this influence. 

Furthermore, better knowledge for the corre- 
lations between these key parameters, either pro- 
vided by experiments or numerical simulations, 
will promote the development of reliability-based 
design of ice-capable vessels in Arctic regions. 
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Partial factors for fatigue loads in the Eurocode system for road bridge 
design 
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ABSTRACT: In the Eurocode system, for fatigue design of bridges, the recommended partial factor for 
fatigue traffic loads is set to 1. In this paper, the adequacy of this approach is investigated by performing 
a reliability analysis on two types of welded joint in a main girder of a steel motorway bridge. For this 
purpose, a weigh in motion measurement dataset belonging to a main Dutch motorway has been com- 
pared with the fatigue load model 4 of Eurocode EN 1991-2 with respect to the stress spectrum and the 
fatigue damage of two structural steel details. Several structural schemes have been considered to study 
the effect of the shape and length of the influence line. The distributions of the stochastic variables such 
as dynamic amplification, accuracy of the structural model, and future traffic trends have been estimated 
or taken from literature. Partial factors for fatigue loads have then been calibrated in such a way that the 
target reliability is obtained. The influence of each stochastic variable on partial factors has been studied 
by derivation of the sensitivity factors. The results show that a considerably higher fatigue partial factor 
is required for fatigue loads on road bridges than the value of | currently recommended in EN 1991-2. 


1 INTRODUCTION 1. The factor Y, depends on the choice of fatigue 


Most of the parameters governing the load effects 
on a structure as well as the structure’s response 
are uncertain in nature. To avoid the complexity 
of dealing with these random variables, standards 
such as Eurocode provide a deterministic model 
for structural design wherein characteristic values 
are provided for the load and the resistance and 
these values need to be corrected by partial factors 
to arrive at the design values, the latter giving the 
desired reliability level. In the other word, partial 
factors are introduced to link the deterministic 
models used for practical design to the required 
reliability level. The general equation for design of 
a structural component according to the Eurocode 
standard (EN1990, 2002) is; 


R 7,X E,20 > R,2E, (1) 


M 


where R, and R, are the characteristic and design 
values of the material resistance respectively, E, and 
E, are the characteristic and design values of the 
load effect, 7,,1s the partial factor for the resistance 
and y, is the partial factor for the load effect. In 
the Eurocode system for fatigue design of bridges 
(EN1991-2, 2003) (EN1993-1-9, 2005), a partial 
factor larger than 1 is recommended only on the 
resistance side of the limit state (7,,) and the par- 
tial factor on the load side (y) is recommended as 


assessment method as well as the consequences of 
failure. For the safe-life fatigue assessment method 
and a structure with high consequence of failure, 
Yur is recommended as 1.35. In this paper, the 
adequacy of these partial factors is investigated 
in a probabilistic approach where all influential 
random variables are attempted to be modelled as 
close to reality as possible. For this purpose, several 
structural schemes are defined and for each one of 
them, two structural details are designed to meet 
the Eurocode’s safety requirements. Subsequently, 
the reliability of each designed case at the end of 
service life (100 years) is evaluated by the proba- 
bilistic approach and is compared with the safety 
requirement of the Eurocode (EN1990, 2002). 
Partial factors are calibrated in such a way that the 
deterministic design with partial factors arrives at 
the target reliability level. 


2 METHODS 


To consider the effects of the shape and length of 
the influence lines on the reliability analysis out- 
comes, four different structural schemes (Figure 1) 
with three different span lengths (L = 5 m, 20 m 
and 100 m) are considered. These are the same 
structural schemes which have been used for 
calibration of fatigue load models in Eurocode. 
Furthermore, two common structural details in 
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Figure 1. Structural schemes used in the Ae 
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Figure 2. Left: Transverse butt welded joint (Category 


80), Right: Cover plate (Category 50). 


bridges, transverse butt weld joint and cover plate 
(Figure 2), are chosen to study the effect of the dif- 
ferent S-N curves on the partial factors. 


2.1 Deterministic approach 


The first step in the reliability analysis is to design 
a structure for fatigue according to the Eurocode 
system. For this purpose, the fatigue load model 4 
(FLM4) for road bridges has been applied on the 
selected structural scheme (influence line) and for a 
fictitious cross section (section modulus) the char- 
acteristic stress spectrum will be obtained by using 
the rain flow cycle counting method. FLM4 is 
selected for this purpose because it is the most accu- 
rate fatigue load model in the Eurocodes. It consists 
of a set of five heavy vehicles, each having a certain 
axle configuration, load distribution and fraction 
of the total traffic volume. These are pulled over the 
influence line, resulting in a stress history. A rain 
flow counting procedure subsequently provides the 
characteristic stress spectrum, which is multiplied 
by a partial factor 7,-so as to arrive at the design 
stress spectrum. The design stress spectrum is to 
be compared to the fatigue resistance. However, 
because the recommended value of y,, is set equal 
to 1, the characteristic and design spectra are the 
same. The cumulative stress spectrum obtained for 
the structural scheme 3 with span length of 100 m is 
shown with solid black curve in Figure 3. 

The fatigue resistance is a term referring to the 
capability of a specific structural detail to with- 
stand the repetitive loads. This feature can be 


studied by laboratory tests and be presented by 
the so called S-N curve. The number of the load 
cycles (N) that the detail can resist under repetitive 
loading with stress range Acis recorded during the 
experiment and the S-N curve is the curve fitting 
these points. In Eurocode (EN1993-1-9, 2005), this 
curve is tri-linear in log-log scale and can be pre- 
sented by the following equation; 


log(N) = log(a,)— mlog(Ag) (2) 


where a — the N-axis-intercept — and m — the 
negative inverse slope — are properties depending 
on the detail type and the index i indicates two dif- 
ferent branches of the trilinear curve. The first line 
(i = 1) descends with m, = 3 until the point known 
as the constant amplitude (CA) fatigue limit 
(CAFL) which is defined at N =5x10° cycles. To 
consider the effect of the variable amplitude (VA) 
loading, the second line (i = 2) is extended from the 
CAFL with m, = 5 until the fatigue cut-off limit at 
N = 108 cycles. Damage below this cut-off value is 
ignored. The characteristic values of a, and a, can 
be calculated using (EN1993-1-9, 2005) for each 
detail type. 

The characteristic S-N curves are divided by Yyy 
to obtain the design curves. These S-N curves are 
shown in blue in Figure 3 for the cover plate. 

Having the design stress spectrum and design 
S-N curves, the fatigue damage (D) can be calcu- 
lated using the Palmgren Miner damage accumu- 
lation rule (Miner, 1945). According to this rule, 
all stress cycles cause proportional fatigue damage 
which is linearly additive: 


MLAEN, (3) 


where D, is the damage due to n= > „n, cycles, d, 
is the damage caused by all stress cycles n, in the 
design stress spectrum that have the same range 
Ao, and N, is the number of cycles to failure for 
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Figure 3. Deterministic cumulative stress spectra and 
S-N curves for cover plate. 
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that same stress range obtained from the S-N rela- 
tion in equation (2). 

For the structures designed by Eurocode’s reg- 
ulations, D, should be equal to or smaller than 1 
at the end of life. By adjusting the section modu- 
lus in such a way that D, is equal to 1, a structure 
is obtained that just meets the requirement for 
fatigue. 

In the next step, to simulate the reality more 
accurately, a Weigh In Motion (WIM) data set 
belonging to a main Dutch highway (A16) is 
used instead of FLM4. WIM is a traffic measure- 
ment system which is able to record the speed, the 
number of axles, the axle loads and the axle dis- 
tance of a passing vehicle as well as the distance 
between consecutive vehicles. In the considered 
WIM dataset, these traffic properties are recorded 
for a traffic flow in one direction and for the dura- 
tion of one month. Thus, to simulate the traffic 
flow for the structure’s entire service life of 100 
years, the recorded number of 2°10° heavy vehicles 
should be multiplied by 1200 if trends are absent. 
The analysis is proceeded by applying the measured 
traffic on the previously adjusted influence line for 
each structural scheme and detail. This resembles 
the situation when a bridge is designed following 
Eurocode system and is loaded by actual traffic. 
The stress spectra for actual traffic are calculated 
as explained for FLM4 and in this paper are 
referred to as ‘WIM’ spectra. The WIM cumula- 
tive stress range spectrum for the case of structural 
scheme 3 with span length of 100 m is shown with 
dashed black curve in Figure 3. The stress range 
spectrum of both load models are indicated in Fig- 
ure 3 for the structural scheme 3 and span length 
of 100 m. 


2.1.2 Probabilistic approach 
There are several sources of uncertainty on the load 
side such as dynamic amplification factor (DAF), 


trend amplification factor (t) and load effect model 
uncertainty (B). On the material resistance side, 
uncertainties can be considered by scatter of the 
fatigue test data. In the following, these random 
variables are described in more detail. 

A DAF should be introduced to compensate for 
the absence of the dynamic vehicle-bridge interac- 
tion in the WIM database. Theoretical studies to 
determine the DAF are usually aimed at the ulti- 
mate limit state and give an exaggerated effect for 
a fatigue assessment. In practice, this factor can 
therefore be calculated as the ratio between the 
maximum stress recorded at the crossing of a test 
vehicle with high speed (e.g. 80 km/h) and with low 
speed (e.g. 20 km/h). Based on the strain gauge 
measurements carried out on some Dutch motor- 
way bridges, DAF in this study is assumed to be 
distributed according to Table 1. However, there is 
room for improvement of this distribution by col- 
lecting more measurement data. 

The trend amplification factor takes into 
account possible change in load over time. The 
design life of important bridges is usually 100 
years and in that period vehicle’s number and 
weight can change significantly. Clearly, there is a 
large uncertainty in estimating trends and never- 
theless, it should be estimated based on either the 
assumptions or extrapolation of very limited avail- 
able data. The Dutch standard for assessment of 
existing structures (NEN8701, 2011) suggests an 
increase of 20% on axle loads in 100 years, Le. a 
linear trend in time with an average annual increase 
of 0.2% with respect to the design year (Figure 4). 
An uncertainty over this trend increasing with 
time with a standard deviation of 0.05 after 100 
years is assumed. The linear trend is approximately 
equivalent to an average increase of 10% in axle 
loads over the entire life with a standard devia- 
tion of 9%, The annual loads are thus multiplied 
by factor (t) which is statistically distributed as 


Table 1. Distribution of random variables. 

Variable Distribution mean STD 
DAF Dynamic amplification factor Log-normal 1 0.05 
t Trend amplification factor Normal 1.1 9E-4 
B Load effect model uncertainty Log-normal 1 0.1 
log, (a Y Material parameter-1st line (Butt weld joint) Normal 12.42 0.25 
log,,(a,)° Material parameter-2nd line (Butt weld joint) Normal 16.07 0.32 
log (a, Material parameter-1st line (Cover plate) Normal 11.69 0.18 
log, ( a, y Material parameter-2nd line (Cover plate) Normal 15.02 0.3 
D, Critical fatigue damage Log-normal 1 0.3 


“a, and a, are fully correlated. 
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Figure 4. Trend amplification factor distribution. 


shown in Figure 4 and is presented in Table 1. A 
possible change in number of vehicles is not con- 
sidered. Background of this assumption is that the 
considered highway—containing 3 lanes per traffic 
direction—is believed to have reached is maximum 
capacity. A higher number of vehicles is expected 
to result into significant traffic jam and thus no 
larger number of passing vehicles. 

Section 2.1.1 demonstrates that an influence 
line is required to determine the stress ranges. 
In practice, such an influence line is subtracted 
from a structural model of the bridge. The model 
may contain approximations and errors, so this 
influence line might be different from reality. 
The uncertainty related to this issue is known as 
the load effect model uncertainty. By comparing 
the calculated influence line with the influence 
line derived from the measurement, this uncer- 
tainty can be evaluated. In this study, the JCSS 
(JCSS, 2001) recommendation is used for the 
distribution of the load effect model uncertainty 
(Table 1). 

The fatigue resistance is known for its signifi- 
cant scatter. This is mainly due to the uncontrol- 
lable differences in test samples. For simplicity, the 
slopes of the fitted lines m, and m, are assumed 
to be constant and the scatter is presented by the 
random variables a, in Equation (2). EN 1993- 
1-9 includes the characteristic lines having 95% 
probability of survival and the scatter is not men- 
tioned. Therefore, in this research, the standard 
deviations of the logarithm of a(Log(a)srp;) are 
taken from the British standard (BS7608, 2014). 
Starting with the characteristic values a, and a, 
from Eurocode (EN1993-1-9, 2005), their mean 
values can be obtained as; 


Log(a) mean, i = Log (4) garacteristie.i T 1.645Log(a) 


STD,i 


(4) 


Having the distribution parameters for all ran- 
dom variables, reliability analysis can be performed 
to evaluate the safety status of the designed details 
at the end of the service life. Figure 5 shows the 
WIM cumulative stress spectrum and its lower (5% 
fraction) and upper (95% fraction) bounds for the 
structural scheme 3 with span length of 100 m as 
well as the mean S-N curve and S-N curves with 
5% and 95% probability of survival for a cover 
plate detail including all random variables intro- 
duced above. 


2.2 Reliability analysis 


The limit state function is defined as; 
g(X)= D,- D,(X) (5) 


where the random variable D,, is the critical dam- 
age, D, is the fatigue damage at the end of life and 
X is the vector of all previously mentioned random 
variables. D „follows a lognormal distribution with 
parameters according to Table 1 (JCSS, 2001). Fail- 
ure can be defined as the situation wherein g(X) < 0. 
Therefore, probability of failure is defined as; 


P,= PL g(X)<0] =f ayat2) (6) 


where f,( X) is the multivariable probability den- 
sity function of X. First Order Reliability Method 
(FORM) is used to approximate the integral in 
equation (6). FORM, developed by (Hasofer & 
Lind, 1974) is based on the idea that in the standard 
normal space, the reliability index, 2, is the short- 
est distance from origin to the limit state surface 
g(U)=0, where U is the vector of normalized 
random variables. The point u* on the limit state 
surface which has the shortest distance to the origin 
is known as the design point or the most probable 
failure point. Finding the design point is an iterative 
process and once calculated, the reliability index is 
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Figure 5. Probabilistic cumulative stress spectra and S-N 
curves for cover plate including all random variables. 
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obtained. Another direct outcome of the FORM is 
the sensitivity factor of each random variable, œ, 
which is a measure for the relative importance of 
the standard deviation of the i random variable to 
the reliability index. It can be calculated as; 


a=-"/y (7) 


Sensitivity factors are used for derivation of par- 
tial factors according to the following equation; 


M [1 = a BSTD(E,)| x mean(E,) 
fi ~ 


(8) 


characteristic(E,) 


where $, is the target reliability index (Sec- 
tion 2.1.4), E, is the i® random variable and %; is 
the required partial factor for E, 

Fatigue design factor (FDF) is defined as the 
multiplication of partial factors of resistance and 
load effect random variables; 


FDF = [J mI] Ven — Yus X Vey (9) 


The accuracy of FORM is checked by a Crude 
Monte Carlo (CMC) simulation with 4 x 10° 
number of samples. In this method, the probabil- 
ity of failure can be calculated as the ratio of the 
number of failure cases ( g( X)< 0) over the total 
number of CMC samples. Having the probability 
of failure, the reliability index can be calculated as; 


f=-©"(P,) (10) 


where ®~'(.) is the inverse cumulative normal dis- 
tribution function. 


2.3 Target reliability index 


Target reliability index (£) is the answer to the 
question “What level of safety is sufficient?”. Sev- 
eral factors play a role in this answer, including the 
consequence of failure in terms of both loss of 
human life and economical aspects, the required 
cost for improving safety, the structure’s planned 
service life and the type of considered limit state. 
In Eurocode, £, for fatigue limit state is ranging 
between the target values of the reliability indices 
for the ultimate and serviceability limit states. For 
the structures such as bridges which can be cat- 
egorized into the consequence class 3 (EN1990, 
2002), with details having large consequences of 
failure and that are designed according to the safe 
life concept, the target reliability index for fatigue 
is set equal to the ultimate limit state value of 4.3. 


3 RESULTS AND DISCUSSIONS 


Having in mind that the results of this study are 
widely dependent on the assigned distributions of 
the random variables, the reliability indices at the 
end of life for both cover plate and transverse butt 
welded joint and for the all structural schemes are 
shown in Figure 6. For all considered cases, the 
reliability index is lower than the target value when 
applying the recommended values of partial fac- 
tors. In addition, it can be observed that the value 
of p depends on the shape and length of the influ- 
ence line. This is caused by the approximations and 
simplifications in the fatigue load model FLM4. 
Furthermore, it can be observed that the cover 
plate results in slightly higher reliability at the end 
of life than the transverse butt weld joint, due to 
the differences in scatter of the fatigue resistance 
(Table 1). 

The corresponding fatigue design factors for 
each detail and structural scheme and under the 
WIM spectra (structure designed with character- 
istic FLM4 and loaded by WIM measured traffic) 
are shown in Figure 7. 

For the span length of 100 m and especially for 
Model 2A, the values of FDF are much higher 
than the value resulting from the recommended 
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Figure 6. Reliability indices at the end of life. 
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Figure 7. Required fatigue design factors to reach the 
target reliability index at the end of service life. 
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partial factors and these are not in line with the 
other values. The reason is the deficiency of FLM4 
to simulate the real traffic flow for the bridges with 
very long span. In the Dutch national annex to EN 
1991-2 (EN1991-2/NB, 2011), this flaw is fixed by 
introducing a second heavy vehicle on the bridge in 
20% of the number of heavy vehicle crossings for 
continuous positive or negative influence lengths 
larger than 60 m. The center to center distance 
between these two vehicles is set as 50 m. 

Repeating the reliability analysis with FLM4a 
according to the Dutch NA, the values of FDF for 
the span length of 100 m are reduced to the values 
presented in Figure 8. These values can still be as 
high as 1.92 for model 2 A but usually fatigue is not 
the dominant failure mode at location of the inter- 
mediate support of a two span bridge. This exam- 
ple—and the still significant differences between 
the various structural schemes—demonstrate the 
necessity of updating the fatigue load models in 
the Eurocode. 

Now that the FDF is determined, the next step is 
to distribute this over y,,,-and Y% A direct determi- 
nation of the partial factors based on the FORM 
sensitivity factors is problematic, because: 


— for fatigue the three-branch S-N curve uncer- 
tainty is related to the number of cycles and not 
to the stress range, and; 

— the entire stress range spectrum influences the 
reliability instead of a single stress value. As 
demonstrated in Figure 3, the shape of the stress 
range spectrum of FLM4 is not in agreement 
with that of the WIM data. 


For this reason, an intermediate step is taken 
where the section modulus is determined using the 
S-N curve as resistance model and the WIM data- 
base as load model. In this case, the uncertainty 
on the load side is captured in the multiplication 
of the DAF, t and B. This results in a single multi- 
plication factor for the stress spectrum and hence 
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Figure 8. Required FDF after modification of FLM 
according to the Dutch national annex. 


Table 2. Sensitivity factors of random variables. 


Variable DAF t B Log,,(a,) Log,,(a,) D,, 


a —0.268 —0.0044 —0.534 0.0739 0.728 0.329 
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Figure 9. Fatigue load partial factors. 


enables the determination of a modified partial 
factor for the load side, y,,*, via the influence fac- 
tors Œ, Op,, and a,. Note that y,,*, considers the 
uncertainties related to the load side but not the 
inaccuracy of FLM 4 to represent the WIM meas- 
urement. The same analysis gives the modified 
fatigue design factor FDF* and the partial fac- 
tor on the resistance side can now be determined 
as Yup = FDF*/y,,*. Finally, the partial factor on 
the load side including the difference between the 
actual WIM data and the FLM4 model is deter- 
mined with the FDF of the original analysis of 
Figures 7 and 8, via Y= FDF/%,. 

As expected, the factors FDF*, œ, Œp Œp and 
Yu are almost independent on the structural mod- 
els—variation of less than 2%. This demonstrates 
the consistency of the approach. 

Table 2 presents the sensitivity factors of all ran- 
dom variables as well as 7,;* values for the trans- 
verse butt welded joint located on model 2B with 
span length of 20 m. 

For the case of transverse Butt welded joint, 
the value calculated for y,, is 1.42 (i.e. 5% higher 
than Eurocode’s recommendation) and for the 
case of cover plate it is equal to 1.37 (1.8% higher 
than Eurocode’s recommendation). The calculated 
values of 7,, are presented in Figure 9. It can be 
observed that for all considered cases, y,,has a 
value larger than Eurocode’s recommended value 
of 1. Considering all studied structural schemes 
with all assigned span lengths and for both struc- 
tural details, y,, takes the average value of 1.16 with 
standard deviation of 0.13. 


2344 


4 CONCLUSION 


In this paper the necessity of increasing the values 
of partial factors for fatigue traffic load models in 
Eurocode to design a steel bridge with the safe-life 
assessment approach has been studied. For this 
purpose, two different structural details located in 
twelve structural schemes have been designed to 
meet the Eurocode’s fatigue safety requirements. 
After studying the random variables which influ- 
ence the fatigue limit state and assigning a distri- 
bution to each one of them, each designed detail 
has been loaded by measured traffic and reliability 
analysis has been performed to evaluate the safety 
status of the detail at the end of its life. It has been 
observed that in all cases, the reliability index is 
lower than the minimum acceptable value set by the 
standard. The reliability depends on the structural 
detail as well as the shape and length of influence 
line. The study has been proceeded by calculating 
the values of the partial factors that are needed to 
reach the standard’s target reliability level. The cal- 
culated fatigue resistance partial factors are 1.37 
and 1.42 for the two steel details considered and 
these values are slightly higher than the recom- 
mended factor of 1.35 in the Eurocodes. The main 
finding of this study is that the value of fatigue 
load partial factor recommended by the Euroco- 
des is too low for fatigue load model FLM4. For 
all studied structural schemes and for both cho- 
sen details, the required load partial factors Yy 
are higher than the factor 1 as recommended by 
the Eurocodes. The factor yy strongly depends on 
the shape and length of influence lines and varied 
between 1.02 and 1.58 for the considered systems. 

The values presented should not be considered 
as ‘the truth’ because they are subjected to the 
choices of the distributions of load related varia- 
bles. Nonetheless, it is clear that the recommended 
partial factor of 1 in the current Eurocodes is too 
small and the large differences in the calculated 
factors demonstrate the necessity of improving the 
fatigue load models in the Eurocode. 

A sensitivity analysis to study the effect of the 
most influential load related random variables on 
the partial factors is another outcome of this study. 


It has been found that load effect model uncer- 
tainty has a large impact on the results, followed 
by the load trend amplification factor. Dynamic 
amplification factor has a lower influence than 
the other variables. Based on this analysis, future 
studies can be planned to evaluate the distribution 
parameters of these random variables. 
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Reliability analysis in the presence of Aleatory uncertainty 
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ABSTRACT: This paper proposes a method for characterizing a system’s response given data. This 
response might prescribe the failure domain needed to assess the reliability of such a system. We focus 
on the case in which not all uncertain parameters affecting the response are observable and the measure- 
ments are corrupted by noise. In this setting, the system response is not given by a function but instead 
by a random process. In this paper we use a staircase random predictor model to characterize such a 
process. Consequently, the resulting failure probability is not a scalar but a random variable. This variable 
accounts for the aleatory contributions of the model-form uncertainty and the measurement noise affect- 
ing the system’s response. Furthermore, we propose a framework that enables trading off the system’s per- 
formance, measured by the extension of an acceptable range of operating conditions, against the system’s 
reliability, measured by an admissible range of failure probabilities. The risk incurred by such a practice 
is ignoring a (small) percentage of the predicted worst-case system responses. These ideas are illustrated 
by performing the reliability analysis of an aeroelastic structure subject to flutter instability. Furthermore, 
this paper puts forth a means to quantify the error resulting from having a dataset of limited size when 


performing the above analysis. 


1 INTRODUCTION 


Metamodeling (Simpson, Peplinski, Koch, & Allen 
2001) refers to the process of creating a mathe- 
matical representation of a phenomenon based on 
input-output data. This paper uses a metamodeling 
technique for constructing computational models 
describing the distribution of a continuous output 
variable. These models are called Random Predictor 
Models (RPMs) because the predicted output cor- 
responding to any given input is a random variable. 
One common example of an RPM is a Gaussian 
Process (GP) model (Rasmussen & Williams 2006). 
In contrast to GP models, which only lead unimodal 
and symmetric responses, we focus on RPMs hav- 
ing a bounded support set and prescribed values for 
the first four moments. The manipulation of these 
functions enables the generation of predictors that 
accurately describe possibly skewed and multimodal 
responses typical of many physical phenomena. 

This paper extends the developments on RPMs 
made by the authors to account for sampling error 
in the moment estimates. As an application, we 
use RPMs for the reliability and risk analysis of 
a flexible structure. To make the paper self-con- 
tained, essential concepts are presented. Supple- 
mental information is available at (Crespo, Giesy, 
& Kenny 2017a). 


2 PROBLEM STATEMENT 


A DGM is postulated to act on a vector of input 
variables, x € R”, to produce an output, ye R"’. 
In this article the focus will be on the single-output 
(n, = 1) multi-input (n, = 1) case. The dependency 
of the output on the input is arbitrary. This covers 
the case in which y is a function of x with all com- 
ponents of x available (so there is only one output 
value for each available input), the case in which y is 
a function of x but not all components of x are avail- 
able (so there might be infinitely many outputs for 
each measured input, and the case in which y is an 
arbitrary random process of x. Assume that N Inde- 
pendent and Identically Distributed (IID) input- 
output pairs are obtained from a stationary DGM, 
and denote by D = {x y0}, fori=1,...N the cor- 
responding data sequence. The main objective of a 
predictor model is to generate a computational rep- 
resentation of a DGM based on the data in D. Two 
types of predictors will be used hereafter. An Interval 
Predictor Model (IPM) yields a bounded interval of 
output values at any value of the input. The desired 
IPM is a narrow interval wherein unobserved data 
will fall with high probability. Conversely, a Random 
Predictor Model (RPM) yields a random variable at 
any value of the input. The desired RPM accurately 
describes the distribution governing the DGM. 
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3 PRELIMINARIES 


Consider the continuous random variable z with 
support set A, = [ ZrsZy | Probability Density 
Function (PDF) f.:A,c R= R*+, and Cumula- 
tive Distribution Function (CDF) F: A, > [0,1]. 
Denote by m, the r-th central moment of z, which 
is defined as 


m,= f E-A f.(z)dz,r= 0,1,2... (1) 


where u is the expected value of z. Note that m, = 
1, m, = 0, m, is the variance, m, is the third-order 
central moment, and m, is the fourth-order cen- 
tral moment. Where reference is made to the r-th 
moment of a random variable, we assume that the 
corresponding integral in (1) converges for that 
distribution. 

The random variables of interest will be con- 
strained to have a bounded support set and given 
values for 4, m,, m,, and m,. The bounded support 
constraint is A, c Q, where Q.=[2z,7] with 
Z2z given, whereas the moment constraints are 
given by (1). The parameters of these constraints 
will be grouped into the variable 2. e R° given by 


6. =[2,7,4m,,m,,m, |. (2) 


Any random variable z having a support set 
contained by [z,7| with moments 4, m, m,, and 
m, must satisfy the feasibility conditions g(@) <0 
given in (Crespo, Giesy, & Kenny 2017b). The real- 
izations of 0 satisfying these conditions constitute 
the @-feasible domain, ©, defined as 


0 ={8:2(0)< 0}. (3) 


A member of © will be called @-feasible. Deter- 
mining membership in O is a distribution-free assess- 
ment applicable to possibly infinitely many random 
variables satisfying the desired constraints. 

A particular class of random variables that can 
realize most of © is proposed in (Crespo, Giesy, & 
Kenny 2017b). This class is called Staircase because 
the PDF of its members is piecewise constant over 
bins of equal width. Staircase variables, are calcu- 
lated by solving the convex optimization program 


min{ J(8,n,): A( 8m) £= b(8),2E 0}, (4) 


where J is the cost function used for optimiza- 
tion, n, is the number of bins partitioning Q., 
le R” are the values of the PDF at the bin cent- 
ers, and Af=b are moment matching constraints. 
Staircase variables enable modeling complex phe- 


nomena efficiently. Staircase variables will be 
denoted as 


z~S,(@,n,,J). (5) 


When the cost is chosen to be the entropy, E, 
we obtain a maximal entropy staircase variable. 
The points ĝe © for which a staircase variable 
with n, bins exists constitutes the staircase feasible 
domain, S(n,). As expected, S( n, ) c ©. Supple- 
mental information is available at (Crespo, Giesy, 
& Kenny 2017b). 


3.1 Staircase estimation 


This section focuses on the estimation of the hyper- 
parameter @. of a staircase variable S, from the sam- 
ples 2"),...2\). Expert opinion should be used to 
prescribe n, and the bound to the support Q.. Whereas 
Q, must! contain A= [min{z'/)},max{z'/)}], the 
moments 4, m,, m,, and m, can be chosen to be the 
sampling moments /,7,,m, and m,. The devel- 
opments that follow account for the error incurred 
by using these estimates. 

Lets first focus on the error incurred by using 
Q, > A. Finite values of N often make A a subset 
of the support set A. Scenario optimization (Campi 
& Garatti 2008) enables bounding the probability 
of the tails of the PDF of z extending beyond A. 
In particular, 


Pi ze A] =Kqll<é, (6) 
where 
ê= [= ee BN-1) (7) 


Bis the confidence parameter, and q € R” is: 


1 if (Z,2%,;]E Q, NÁ, 
0 (zal EA, 
ZU)’ z 
q= miniz i z, < min {20} z, 


otherwise. 


Equation (6) ensures that the staircase variable 
conforms to the tails probability bound from con- 
vex scenario theory. 

We now focus on the error in the sample 
moments. This error, fully prescribed by the corre- 
sponding sampling distributions, can be quantified 


‘Sampling estimates will be denoted with a dot-superscript. 
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by using bootstrapping techniques. Alternatively, 
an asymptotic approximation to the sampling dis- 
tribution, grounded in the central limit theorem, 
can be used instead. Such a distribution is given by 
the normal variables 


. |m, 
WN de al 


(8) 
m~ N, [nf], 
à “VN 
eo AG 
m, ~ N m| Ma. NI 
conditional on ĝe©, where m, for k = 
5,6,7,8 are the sample fifth, sixth, sev- 


enth and eighth central-order moments and 
t = mM, — mz — 6mm, + 9M}, t, = m — mg — 8mm, + 
16mm. These expressions correspond to an arbi- 
trarily distributed variable z for a sufficiently large 
value of N (Kendall & Stuart 1969). For small val- 
ues of N, bootstrapping techniques often yield a 
more accurate approximation. 

To account for sampling error in the calculation 
of a staircase variable, the moment matching con- 
straints are replaced by the polynomial inequality 
constraints 


HS MSZ, (9) 
m, < m, (£) < m, (10) 
m, <m, (£) < m, (11) 
m, < m, (2) < Mys (12) 


f6-3um, and m,(l)=1rl-4um, — 647m, — 4 
are the moments realized by the staircase vari- 
able, r, is the i-th row vector of A in (4), and the 
moment bounds are the 1 — œ confidence inter- 
vals corresponding to the sampling distributions, 
e.g., -1.96 rn, / N < u(t) < ù+1.96 m, /N for 
a 95% confidence interval. Note that the box of 
moments defined by (9-12) might not be fully con- 
tained in ©. 

Sampling error can be accounted for by solv- 
ing for a maximal entropy staircase variable 
constrained by Equations (6), and (9-12). The 
resulting staircase variable will not account for the 
manner in which the sampling distribution allo- 
cates probability within the box of moments. This 
consideration can be taken into account by using a 
likelihood-dependent cost, such as 


where 4“ 0)=n,f, mye = rl- 4°, m (1) = rl- 
l 


J( £)=-E(£)-log {L(£)}, (13) 


where L= Nw Ni Nag (yNng() is the likeli- 
hood function corresponding to the sampling 
distribution. 

In summary, staircase variables provide (i) the 
ability to represent a wide range of density shapes, 
(ii) the ability to represent most of the feasible 
space ©, (iii) the ability to account for the effects of 
having a limited number of observations, and (iv) 
the low-computational cost required to efficiently 
perform various uncertainty quantification tasks. 


4 PREDICTOR MODELS 


4.1 Interval predictor models 


This section presents a means to calculate the sup- 
port set of an RPM. This will be carried out by find- 
ing a baseline IPM using the same data sequence 
D that will be used to construct the RPM. 

An IPM assigns to each instance vector 
xeX CR” a corresponding outcome interval 
in Y CR. That is, an IPM is a set-valued map, 
I,:x—1,(x)CY, where I(x) is the prediction 
interval. Depending on context, the term IPM 
will refer to either the function J, or its graph 
Ixy): xe X, yer, (x)} in X x Ý. A nonpara- 
metric IPM is given by 


AOR EONO Fe) 2 v9}. 


where the functions y(x) and y(x) are the lower 
and upper boundaries of the IPM respectively. A 
parametric IPM is obtained by associating to each 
xex the set of outputs y that result from evalu- 
ating the function y= M(x,p) at all values of p in 
the set P, so 


(14) 


I,(x,P)={y = M(x,p), p€ P}. (15) 

Attention will be limited to the case in which the 
output depends linearly on p and arbitrarily on x, 
so y= p”ø( x), where g(x) eR” is an arbitrary 
basis. Several IPM types can be calculated within 
this framework. In thispaper we use the technique 
used in (Crespo, Kenny, & Giesy 2016). 

Example 1: Next we use an analytically described 
DGM of which we have full knowledge. Figure 1 
shows the corresponding one percentiles. Note that 
the support set, moments and modality of the DGM 
are strongly nonlinear functions of the input. The 
high concentration of percentile lines at the edge 
of the support indicates a bimodal structure. N = 
1000 IID observations were drawn from the DGM 
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to form the data sequence D. This data, shown in 
Figure 1, was then used to construct an IPM with 
n, = 20 terms and a Gaussian basis structure. That 
is, g(x ‘yee =u}, where u eR is a center and 
v ER is a length-scale parameter, for i=1,...n, 

Centers are uniformly distributed over X =[-z, zl, 
whereas the length scale parameters are made all 
equal to 7/5. Figure 1 also shows the resulting IPM. 
Notice that the IPM tightly encloses all the data as 
intended. An IPM will be used to describe the sup- 
port of the DGM, whereas RPMs, introduced next, 
will be used to characterize its distribution. 


4.2 Random predictor models 


An RPM is a mapping that assigns to each input 
vector xe X a corresponding random variable 
R(x). A non-parametric RPM is the random vari- 
able-valued map given by 


R,(x) ={ fya(r( x)). ¥( x)e A,( re 


where fẹ is the PDF of y at xe X having the 
support set A,( x) = [ y( x), H( x) | c Y. By con- 
trast, a parametric RPM is obtained by associating 


to each xe X the set of outputs y corresponding 
to all values of p described by a random vector 
with joint PDF f (p) supported in A,, so 


(16) 


R(x.f,)={ y= (2p), p~ f,(P). Pe Ay}. 
(17) 
The RPMs used below assume a staircase struc- 


ture. As such, they require prescribing an input- 
dependent hyper-parameter /(x)=[n,(x), a, xl 


for all xe X. Hereafter we will assume that mts) 
is a fixed constant, and focus on the prescription 
of 6... Define as 


Y(x)" 


Figure 1. One-percentiles of the DGM (black lines), M 
= 1000 observations (red x), and IPM limits (blue lines). 


A.) = [ (2), F (2), Zypeptita yells pepe yee) 


a set of target functions prescribed according to 
the data, and by 6,,) the set functions realized by 
an RPM. An RPM that ae represent the 
DGM must make 6,,, close to By. . The strategy 
for prescribing @,,. according to the data sequence 
D presented in enn, Giesy, & Kenny 2017b) 
will be used here. b ) is based on a weighted aver- 
age of values y“ D for K ) close to x. 

A non-parametric RPMs with a staircase struc- 
ture can be readily calculated from the target func- 
tions bi ) by making the prediction at any input 
value x Ë X realize the corresponding target. 
Once a set of staircase-feasible target functions 


6.4) is obtained, the RPM S, herd ses J) can 
be readily evaluated at any ae of ihe input. The 
resulting RPM will conform to the target regard- 
less of the choice of J. 

Example 2: Next we use the data sequence 
of the DGM in Example 1 to derive an RPM. 
Figure 2 shows the functions in @,, correspond- 
ing to the DGM (solid lines) along with the target 
functions A ) (dashed lines). Note that the tar- 
get functions pprotimate the DGM well in spite 
of only using N = 1000 observations. The differ- 
ence between the two sets of functions is caused 
by the limited amount of data available and by 
using neighboring data to calculate 6, œ) Phe 


targets G44) in Figure 2 were used to build the 
RPM S,,.(8,,).500,£). This RPM is staircase- 
feasible throughout X. This was also the case for 
values of n, as small as 100. The plot at the top 
of Figure 3 shows the 1-percentiles of this RPM. 
This figure was generated by calculating staircase 


Figure 2. Support set (green), mean (blue), variance 
(red), third-order- (black), and fourth-order-central 
moment (magenta) functions corresponding to the DGM 
(solid) and to the target (dashed-dotted). 


2352 


Figure 3. 
bounded RPM of maximal entropy (middle), and 
moment-bounded RPM (bottom) based on Equation 
(13). 


Moment-matching RPM (top), moment- 


variables over a uniform grid of input values in 
X, sampling them, smoothing the corresponding 
empirical CDF using a Gaussian kernel (Silver- 
man 1986), and grouping the points belonging to 
the same percentile line. The moment functions 
attained by the RPM are indistinguishable from 
the targets shown in Figure 2. The comparison 
between the DGM, shown in Figure 1, and the 
moment-matching RPM indicates excellent agree- 
ment despite only using N = 1000 data points. Note 
that RPM describes well the bimodal structure of 
the DGM by replicating the regions where proba- 
bility is highly concentrated, i.e., the regions in the 
upper and lower limit of the support where many 
percentile lines coalesce. Furthermore the skew- 
ness of the probability mass in the interior of the 
support set follows the same oscillatory patterns 
present in the DGM. 

Example 3: Next we study the effects of the 


sampling error on the empirical target By» and 


on the resulting staircase RPM. The observations 
prescribing the target functions at x are weighted 
according to their separation from such a point 
(Crespo, Giesy, & Kenny 2017a). The weight is the 
greatest when the datum is at x, and it approaches 
zero as its separation from x increases. For the 
functions shown in Figure 2, the number of obser- 
vations having a non-negligible weight ranges from 
110 to 261. To quantify the sparsity of the dataset 
we define the equivalent number of observations, 
n, aS 


(18) 


The effects of the sparsity in the data will be 
quantified using the developments in Section 3.1. 
In particular, we will generate a maximal entropy 
RPM satisfying the constraints (6) and (9-12) for 
N=nx). The value of n, is inversely proportional 
to the size of the box of moments. For this data 
sequence, the value of n, ranges between 47.15 and 
112. This indicates that the dataset is sparse. In 
contrast to the RPM at the top of Figure 3, the 
resulting RPM will not match moments estimated 
from the data but instead, it will realize moment 
functions bounded by their sample uncertainty. 
As such, we will refer to this RPM as a moment- 
bounded RPM. Figure 4 shows the 95% confidence 
intervals of the four moment functions. Note that 
the range of these intervals exhibits oscillations, 
reaching their largest spread near x = 0. These 
regions contain the moments realized by staircase 
variables comprising the moment-bounded RPM, 
which are shown as solid lines. The first three 


1.5) 


-0.5 . 
-3 -2 -1 0 1 2 3 
x 


Figure4. Optimal mean (blue), variance (red), third-order- 
(black), and fourth-order-central moments (magenta) cor- 
responding to a moment-bounded RPM using maximal 
entropy (solid lines), Equation (13) (dotted lines), and cor- 
responding sampling error ranges (shaded regions). 
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moments vary in the interior of their confidence 
intervals whereas the fourth moment stays on 
the lower limit. All but the fourth moment func- 
tion take on values that vary within the intervals. 
The uncertainty in the sample moments increases 
the expected entropy E [E] of the RPM from 
—0.3642 to 0.7484. 

Figure 3 shows the moment-matching RPM 
as well as moment-bounded RPMs of maximal 
entropy. Note that the most prominent features 
of the process, such as the peaks at the boundaries 
of the support set and the patterns of the lines in 
its interior, have faded in the latter predictor. Fur- 
thermore, note that the derivative discontinuities 
in the moment functions of Figure 4 yield deriva- 
tive discontinuities in the percentile lines. Such 
discontinuities can be eliminated by using a Ker- 
nel smoother or by using another cost function. 
Alternatively, we can calculate a moment-bounded 
RPM having the cost function in Equation (13). 
The corresponding moment functions are shown 
as dotted lines in Figure 4. In contrast to the 
moments for the maximal entropy formulation, the 
new moments have continuous derivatives through- 
out X. The resulting RPM, shown at the bottom of 
Figure 3, exhibits smooth percentile lines. This is 
achieved at the expense of a minor entropy reduc- 
tion to E,[E]=0.73. Asn, increases, the width of 
the confidence intervals reduces making moment- 
bounded RPMs converge to the moment-matching 
RPM. The PDF-matching or the maximum-likeli- 
hood staircase formulations are preferable when n, 
is sufficiently large. 

Example 4: Next we consider the reliability 
analysis (Rackwitz 2001) of an airfoil subject to 
aeroelastic flutter (Mahler, Touze, Doare, Habib, 
& Kerschen 2017). During flutter the pitch and 
plunge dynamics are coupled yielding a self-sus- 
taining limit cycle oscillation that might compro- 
mise the structural integrity of an aircraft. The 
onset of flutter depends on the free stream air- 
flow speed v, as well as inertial, geometrical, and 
material properties of the wing. These parameters 
include the plunge and pitch stiffnesses, the aero- 
dynamic lift, the location of the center of mass, 
and the location of the elastic axis. In this context, 
a reliability analysis seeks to quantify the prob- 
ability of flutter instability (i.e., probability of 
failure) given probabilistic prescriptions for the 
parameters. 

The objectives of this example are two-fold. 
First, we use RPMs to characterize the system 
response. Measurement error and model-form 
uncertainty make the data and the response alea- 
tory, thereby justifying such a modeling choice. A 
deterministic response model along with a proba- 
bilistic description of the parametric uncertainty 
would enable the calculation of the failure prob- 


ability. However, when the response is intrinsi- 
cally aleatory, the failure probability can only be 
determined to lie within a range of values. We then 
explore the reduction in the range of failure prob- 
abilities resulting from ignoring a (small) percent- 
age of the responses predicted by the RPM. 

The stability of the system is evaluated by cal- 
culating the damping coefficient, y (i.e., the output 
in the context of this paper) of the time response 
to a given flow speed. As in (Canor, Caracoglia, & 
Denoel 2015), damping is related to the real part of 
the eigenvalues of a linear dynamic model. Non- 
negative values of y denote an unstable system, 
whereas negatives values correspond to a stable 
system. Hence, the failure domain is defined as 


F ={x: y(x|v) = 0}. (19) 


The measured output y depends on measure- 
ment errors, as well as epistemic and aleatory 
uncertainties. In this example we will divide the 
uncertain parameters into two groups. The pri- 
mary group consists of measurable parameters 
having a strong influence on the output (eg., 
stiffnesses), whereas the secondary group consist 
of parameters that are either weakly important, 
unmeasurable, or unknown to the analyst (e.g., 
measurement error, unsteady aerodynamics, etc.). 
In this study the primary group of parameters 
constitute the input x. Note that variations in the 
secondary parameters make y(x) aleatory. Hence, 
in contrast to a standard reliability analysis for 
which a parametric model explicitly prescribes 
the dependency of the limit state on the uncer- 
tain parameters (Sun, Wang, Rui, & Tong 2017), 
the limit state we are aiming to identify will not 
be a deterministic function of all the parameters, 
but instead a random process depending on the 
primary parameters. This implies that the failure 
probability will range on an interval whose spread 
depends upon the manner in which the RPM 
describing y(x) crosses zero (failure boundary). 

For simplicity in the analysis, x will be assumed 
to be a single non-dimensional parameter describ- 
ing the ratio of the pitch and plunge stiffnesses. 
Independent input-output pairs D={x,,y,} for 
i=1,...N, were obtained by simulating the flut- 
ter dynamics of N = 2500 individual airfoils. The 
randomness in the output stems from variability 
of not only x but also of the secondary aerody- 
namic and structural parameters affecting y. It is 
expected that such parameters mostly vary within 
10% of their nominal value. 

Figure 5 shows the input-output data for the free 
stream airflow speeds v = 0.75, v = 0.79, v = 0.83, v 
= 0.87, v = 0.91, v = 0.95. Each chosen airfoil was 
evaluated at these speeds. Note that the number of 
data points falling into the failure domain y20 


2354 


Figure 5. 
2500 observations (blue x) for several airflow speeds. 


Nominal system response (solid line), and N = 


increases with v. In all cases, however, the y=0 mani- 
fold is crossed by the nominal response near x = 2.8. 
The response of a calibrated deterministic model, 
to be referred to as a nominal response, is shown 
as a solid curve. Whereas the nominal system for 
all but the greatest speed crosses into the instabil- 
ity region at x = 2.8, somewhere in v €[0.91,0.95] 
the curve flips to the opposite side of y = 0 (see the 
bottom plots in Figure 5). Reliability analyses using 
the nominal responses as the limit states will yield 
an abrupt discontinuity in the failure probability at 
that speed. This sudden change in the response is 
caused by a bifurcation. The manner by which the 
system transitions into instability (e.g, the region in 
x becoming unstable first) cannot be inferred from 
studying the nominal system. 

The data for all 6 speeds was processed and 
the resulting moment functions were calculated. 
Figure 6 shows these functions and their cor- 
responding uncertainty ranges. Note the high 
sensitivity of the functions to v. For instance, 
the variance varies considerably throughout x 
converging to a small value when x is large. Fur- 
thermore, the third-order central moment at x = 
2.15 goes from being practically zero to being 
large, first positive and then negative. Features like 
these, driven by the dynamics of the system, can 
not be accurately described by a GP model. These 
functions were then used to calculate a maximal 
entropy RPM with n, = 300 bins. Figure 7 shows 


8 . 
6 V=0.75 V20.79 
4 
>2 
0 
2 
4 
8 
6 V=0.83 Ve0,B7 
a 
>2 
0 
2 
4 
8 7 
B V=0,91 V=0.95 | 
4 
~2 
o 
-2 
4 _ 
os 1 15 2 25 3 DI t 15 2 25 3 


Figure 6. Mean (blue), second- (red), third-(black), and 
fourth-order (magenta) central moments along with their 
sampling error ranges (shaded areas). 
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Figure 7. RPMs for several airflow speeds. 


the one-percentile curves for the resulting RPMs. 
The excursion of individual percentiles into the 
instability region prescribes the severity (i.e., what 
is the failure probability for a fixed value of x and 
v) and the manner (1.e., which x region transitions 
to instability first) by which flutter occurs. 
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The crossing y(x|v)=0 occurs over a range of 
x values. Let’s formalize this notion by defining the 
t-percentile of the RPM as 


y(x |v) = Fly) (7/100), (20) 


where Fep is the distribution of the RPM for 
speed v, and ze[0,100] is the percentile of inter- 
est. Hence, )(x|v) is the upper limit of the 
RPM and y,(x|¥v) is the lower limit. 

The failure domain associated with the 
tpercentile at airspeed vis F, ={x: y(x|v)>0}, 
whereas the corresponding failure probability is 


P[F]= f, f.(x)ax, (21) 


where f (x) is the PDF of x. f (x) can be prescribed 
according to the available data or to expert opin- 
ion’. Modeling the response as an RPM yields the 
failure probability range 


r(v) =| P| A, LE| Fe i; 


This range can be readily computed for any f(x). 
For instance, if x is a Beta random variable with 
hyper-parameters 3 and 3 and support [0, 3], we 
obtain r(0.75) = [0.0040, 0.0187], (0.77) = [0.0059, 
0.8814], (0.79) = [0.0024, 0.9948]. These ranges 
account for all predicted outputs regardless of 
their likelihood. The upper limit of some of these 
ranges is distressingly close to one. However, it is 
possible that a very small portion of the predicted 
responses is responsible for most of the spread in 
the failure probability range. At this point the ana- 
lyst might want to contemplate the following ques- 
tions: If we are willing to ignore some of the worst 
predicted outputs what will be the corresponding 
reduction in the failure probability range? How 
large should be such a reduction (if any) to justify 
taking such a risk? By worst-case outputs we mean 
those leading to the largest decrease in the upper 
limit of the failure probability range. In this con- 
text, the risk, to be denoted as ¢ e [0,100], is the 
percentage of the worst-case predicted responses 
the analyst is willing to neglect. The analyst might 
be willing to accept a small risk providing that the 
corresponding reduction in the failure probabil- 
ity is sufficiently large. This will be the case when 


(22) 


*If the realizations of x in D were controlled to ensure a 
good coverage of the response function over the domain 
of interest X, they should not be used to prescribe a 
naturally occurring f (x), e.g., variations resulting from a 
manufacturing process. 
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Figure 8. Failure probability vs. risk for several airflow 
speeds. 


the limits of r(v) are prescribed by extreme, low- 
probability events occurring at the long tail of a 
distribution. For a given ¢, the failure probability 
range is given by 


r(v,g) =| P| LP Face |. 


Figure 8 shows P| Foc] as a function of 
the risk ¢ for several airflow speeds. This figure 
enables making informed risk-based decisions 
regarding the reliability assessment of the system. 
For instance, the range of failure probabilities 
corresponding to a zero risk for v = 0.77 is r(0.77, 
0) =[0.0059, 0.8814]. These two values correspond 
to the points on the v = 0.77 curve for which the 
risk is 100 (smallest failure probability) and zero 
(largest failure probability). This level of risk 
implies that all predicted outcomes are accounted 
for. If the analyst is willing to ignore the worst 2 
percent of the outputs, Figure 8 leads to r(0.77, 2) 
= [0.0059, 0.0193]. These two values correspond 
to the points on the v = 0.77 curve for which the 
risk is 100 (smallest failure probability) and 2 
(largest failure probability). Therefore, a risk of 
2 percent decreases the largest failure probability 
by 0.8621. This illustrates that a small percentage 
of the predicted responses is responsible for most 
of the spread in the range of failure probabilities, 
and that a risk averse approach might yield an 
overly conservative prediction. Whereas the zero- 
risk interval contains all predicted responses, the 
small-risk interval contains most of the responses 
more tightly. This information enables the analyst 
to avoid making overly conservative assessments 
driven by extreme low-probability events rarely 
seen in practice. 


(23) 
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5 CONCLUSIONS 


This paper illustrates the use of staircase RPMs by 
applying them to the reliability analysis of an aeroe- 
lastic structure. The ability of the staircase varia- 
bles to describe skewed and multimodal responses 
over an input-dependent interval makes them well 
suited for structural dynamics and controls appli- 
cations. We consider the case in which the predic- 
tor is designed to match sample moments exactly 
(a setting applicable to large datasets), as well as 
the case in which the predictor accounts for the 
uncertainty in those estimates (a setting applicable 
to sparse datasets). The versatility and low com- 
putational cost of the proposed framework makes 
it appropriate for a wide range of applications in 
science and engineering. 
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ABSTRACT: 


In reliability engineering, load—sharing is typically associated with a system in parallel 


configuration. Examples include bridge support structures, electric power supply systems, multiprocessor 
computing systems, etc. We consider a reliability maximization problem for a high-voltage commutation 
device, wherein the total voltage across the device is shared by the components in series configuration. 
Here, the increase of the number of load-sharing components increases component-level reliability (as 
the voltage load per component reduces) but may decrease system-level reliability (due to the increased 
number of components in series). We review optimal solutions for the proportional hazard and acceler- 
ated life models with the underlying exponential & Weibull distributions and elaborate on the log-linear, 


power, and Eyring laws used in the life-load models. 


1 INTRODUCTION 


A load-sharing system is typically associated with a 
system in parallel configuration. In their renowned 
text on System Reliability Theory, Rausand and 
Hoyland (2009) write “Consider a parallel system 
with two identical components. The components 
share a common load.” Another famous text by 
Kapur and Lamberson (1977) also treats load— 
sharing systems exclusively as parallel systems. 

Examples of parallel load-sharing systems 
include but are not limited to: civil engineering 
[e.g., structures (Chen & Lui 2005)], materials 
engineering [e.g., composites with fiber bundles 
(Phoenix & Tierney 1983)] power engineering 
[e.g., distributed generation systems (Marwalli & 
Keyhani 2004)], computer/network engineering 
{multiprocessor computing systems (Eager et al. 
1986)]. Load-sharing system are also often dis- 
cussed in the context of (still parallel) systems 
with k-out—of-n configuration, e.g., Huang & Xu 
(2010) and Amari & Bergman (2008). 

In many applications of electrical and power 
engineering, namely in high voltage power equip- 
ment, one often runs into problem of switching 
high voltages by solid-state devices, where the 
system voltage (10-100 kV) well exceeds the oper- 
ating voltages of individual switching compo- 
nents (1-3 kV). One of the commonly practiced 
ways to increase the voltage capability of high 


voltage switching devices is to put single switch- 
ing components in a series configuration. Shown 
in Figure 1 is a high voltage thyristor switch 
(Gurevich & Krivtsov 1991), wherein the total 
switching voltage is shared by the serially con- 
nected thyristors. 

In this case, the increase of the number of thyris- 
tors increases thyristor—level reliability (as the volt- 
age load per thyristor reduces) but may decrease 
system-level reliability (due to the increased 
number of components in series). Clearly, the sys- 
tem reliability function in this case should have a 
maximum associated with an optimal number of 
thyristors in series. 

We derived (Krivtsov et al. 2017) optimal solu- 
tions to the two popular life-load models: the 


Figure 1. A serial load-sharing system, wherein total 
commutating voltage is shared by the serially connected 
SCRs (thyristors). 
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proportional hazard and the accelerated failure 
time with the underlying exponential and Weibull 
distributions. In this paper, we elaborate on the 
log-linear, power and Eyring laws used in the 
aforementioned life-load models. 


2 PRELIMINARIES 

Let A(t;L) be the load—dependent failure rate of a 
component. Two commonly used models relating 
failure time to a load level are: the Proportional 


Hazard Model (PHM) and the Accelerated Failure 
Time Model (AFTM). 


2.1 Proportional hazard model 
Under the PHM, we have: 


h( t;L) = w( L)-A(t;L,) = W(L)-A,(t), (1) 
where A(t) is the baseline failure rate and 
WAL )=1. 

The commonly used model for y(L) is the log- 
linear (exponential) law: y(L)=exp(@% +L). 
In fact, the exponential term can be replaced by 
any known positive, non—decreasing function. For 
example, for the log-linear law: 

w(L) =exp(@ + @L) = 8- exp(a@L) (2) 

For the power law: 

W(fL)=6-L* (3) 

For the linear law: 

W(L)= ct al (4) 

For the Eyring law: 

y(L) = L'exp(@ +a IL) (5) 

The baseline failure rate can follow any time- 


varying function. Further, for the cumulative haz- 
ard and the reliability functions, we have: 


H(t:L)=y(L)- H(t) (6) 
R(t; L) =exp{-H(eL)}=[R,() (7) 


2.2 Accelerated failure time model 


Under this model, the effect of the load is multiplica- 
tive in time. The reliability function is expressed as: 


R(EL) = R(t L)) ® 


where R,(.) is the reliability function at the base- 
line load. Function ø(L) represents the accelera- 
tion factor at load L. Without loss of generality, 
we can assume that (Ly) = 1. When there is only 
one type of load, commonly used forms of ¢(L) 
include the power law: 


ø L) = ó- L (9) 


the log-linear law: 


@& L) = 5- exp( aL) (10) 

and the Eyring law: 

Ø(L) = L'exp(a% +0 / L) (11) 
Finally, for the hazard functions we have: 

H(t;L) = H,(t- A L)) (12) 

h( t;L) = Ø L)-h,(t- A L)) (13) 


It must be noted that if the baseline distribution 
is Weibull (or Exponential) and the multiplicative 
factor (acceleration factor) follows the power law, 
then the AFTM and PHM coincide. However, 
in general, there is no direct duality between the 
models. 


2.3 System reliability under series load—-sharing 


Consider a load-sharing series system with n com- 
ponents. The total load on the system is L, = V. 
The load is equally distributed between the compo- 
nents. Hence, the load on each component is: 


pe (14) 
n n 

and the system’s reliability is: 

R, (&V,n)=[ R(tV/n) |’ (15) 


Note that for fixed ¢, function R(t; V/n) is an 
increasing function in n. However, function R” isa 
decreasing function in n. Hence, there is an optimal 
value of n that maximizes the system reliability. 

Alternatively, one can consider a logarithmic 
function of system reliability: 


nR, (£V,n) = n-InR(t:V/n) (16) 
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The right hand side of the above equation has 
two product terms. The first term increases with 
n and the second term decreases with n. Further, 
it implies that 
H,(t;V,n) =n H(t;Vin) (17) 

To maximize the system reliability, we need to 
minimize the corresponding cumulative hazard func- 
tion. Again, it has two product terms. The first term 
increases with the number of components and the 
second term decreases with the number. Hence, there 
exist an optimal value that minimizes the cumulative 
hazard function and maximizes the system reliability. 


3 OPTIMAL NUMBER OF COMPONENTS 
IN A SERIES LOAD-SHARING SYSTEM 


3.1 Load-sharing under PHM 


The system reliability in this case is: 


Reever] =| rf E) 
-fina -ER 


It follows that for fixed t, R,(1) is also fixed. 
Hence, for fixed ¢, to maximize the system reliabil- 
ity, we need to minimize g(n) = n-y( Vin). Thus, 
the optimal 7 is independent of the form of the 
underlying failure time distribution R,(t) and is 
also independent of mission time f. 

Under the log-linear law, we have: 


n-w(V/n) (18) 


y(L)=exp(@ + @,L) = 8- exp(@L) (19) 


For notational simplicity, hereafter we'll use œin 
the place of œ. Hence, 


w(L)=y(Vin) = 8- exp(a@-Vin) (20) 
(nano £) = nô- exp( æ: Vin) (21) 
n 

The minimum of g(n) can be obtained as 
A la) =ð- e| a z) + ndero( o “\-=} 
dn n n n’ 

2 (n)=0>n=a-V 

dn D 7 

(22) 


Thus, the optimal n that maximizes system 
reliability is 


n=|a-V| (23) 
where ||.|| is the nearest integer function. 

Figure 2 shows system-level reliability as a 
function of the number of the load—sharing com- 
ponents in series (with ô= 1, a = 2 and V = 3) 
for various values of component reliability. As 
expected, the optimal number does not depend on 
the latter. 

It can be shown that for a PHM in the power law 
form, depending on model parameters, the optimal 
n will be either 1 or ce. The linear law has a similar 
behavior. In both cases, the cost function can be 
used as regularization. 


3.2 Series load-sharing under AFTM 
The system reliability in this case is: 


n 


R(tV,n)=[ R( GL) | 


[R(t AL))]'=[ exp{-m( A 2))}] 


n 


(24) 
R,( tV,n) 
= exp{—n- H,( tø L))} = exp{-H,( &V,n)} 
(25) 
where 
H (£V,n)= n H(t A L)) (26) 


From the equation above, it follows that a) 
maximizing the system reliability is equivalent 
to minimizing the cumulative hazard function 
and b) unlike the PHM, the optimal value does 


2 on — 0N 
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number of load-sharing components in series (n) 
Figure 2. System-—tlevel reliability as a function of the 


number of the load-sharing components in series (with 
6=1, @=2 and V = 3) for various values of component 
reliability. Optimal n is marked with the asterisk. 
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depends on the form of the underlying failure time 
distribution. 

3.2.1 AFTM with power law and the underlying 
exponential distribution 

For the exponential distribution, we have: 


H,(t)= At (27) 
Hence, 
H,(t:V,n) =n: At- g L) = n- At- (Vin) (28) 


In this case, maximizing the system reliability is 
equivalent to minimizing g(n)=ng(V/n). The 
optimal n is independent of the mission time or 
even the baseline hazard function. 

Now, for the power law: 


AL) = 5- Le (29) 


and 


g(n)= nf | = na) (30) 


Similar to the PHM with the power law, depend- 
ing on model parameters, the optimal n will again 
be either 1 or œ. The cost function can be used to 
regularize this case. 


3.2.2 AFTM with log-linear law and the 
underlying exponential distribution 
For the log-linear law: 


OL) = 5-exp(a@L) (31) 

and 

g(n)= no ~| =no- apf a £) (32) 
n n 


Note that the functional form of g(n) is the same 
as in the PHM model with the log-linear law. Thus, 
the optimal n that maximizes system reliability is 


n=|a-v| (33) 


3.2.3. AFTM with Eyring law and the underlying 
exponential distribution 


@(L) = L'exp(a@, +a IL) (34) 


The minimum of g(n) is arg (Lg (n) =0). 


yet X(n =exp(a,+a@,4). Then with <g(n)= 
xW (2n +@,), it becomes easy to show that e opti- 
mal n that maximizes system reliability is 
| als 
2 (36) 


3.2.4 AFTM with power law and the underlying 
Weibull distribution 
For the Weibull distribution, we have: 


H,(t)= (zy 87) 


Hence, 


H,(£V,n)= 4) wmf (38) 


From the equation above, it follows that a) max- 
imizing the system reliability is equivalent to mini- 
mizing g(n)=n-[AV/n)%, and b) the optimal 
value of n is independent of the mission time and 
the scale parameter of the Weibull distribution. 

For the power law: 


O(L) = 5-Le (39) 


and 


ol] 


Again, depending on model parameters, the 
optimal n will again be either 1 or æ. The cost func- 
tion can be used to regularize this case. 


3.2.5. AFTM with log-linear law and the 
underlying weibull distribution 


For the log-linear law: 


@(L) = 5-exp(a@L) (41) 


and 


eae “| of rj- n- 5?-exp{ aVIn} (42) 


—g(n)=0>n=a BV (43) 
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Figure 3. System-—level reliability as a function of the 


number of the load-sharing components in series (with 
a= 2 and V = 3) for various values of Weibull shape 
parameter with the scale parameter of 7 = 1 and mission 
time t= 0.1. Optimal n is marked with the asterisk. 


Thus, the optimal n that maximizes the system 
reliability is: 
n= a: B- v| (44) 

Figure 3 shows system-level reliability as a func- 
tion of the number of the load-sharing compo- 
nents in series (with œ = 2 and V = 3) for various 


values of Weibull shape parameter with the scale 
parameter of ņ= 1 and mission time t = 0.1. 
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ABSTRACT: 


In an engineering system, multiple components may fail simultaneously due to a shared 


cause or Common Cause (CC). This kind of failure is referred to as a Common-Cause Failure (CCF), 
and it contributes greatly to the system unreliability. Due to the insufficiency of historical data and system 
complexities, uncertainty is an inevitable problem in system reliability analysis. This paper proposes a 
Valuation-Based System (VBS) method to incorporate CCFs explicitly in system reliability analysis con- 
sidering parametric uncertainty (related to reliability data of components) and model uncertainty (related 
to the system structure). The method can model different kinds of uncertainties, have no limitation on 
the type of failure distributions of system components, and allow the relationship between CCs being 


s-independent, s-dependent or mutually exclusive. 


1 INTRODUCTION 


Systems can be subject to Common-cause Failures 
(CCFs), where multiple components fail or mal- 
function simultaneously due to a shared cause or 
Common Cause (CC). For example, redundancy 
technique is often used to improve the system reli- 
ability, but it also has negative effects on the system 
reliability because identical components may fail 
simultaneously due to a CC. The presence of CCF 
makes the system be more prone to fail. Therefore, it 
is important to take into account the contributions 
of CCFs in system reliability analysis. There are 
two types of CCFs existing in engineering systems: 
external CCFs that are caused by external factors 
such as extreme weather and human errors, and 
internal CCFs that caused by propagated failures 
within the system such as the functional dependen- 
cies or physical dependencies between components 
(Misra 2008, Xing & Levitin 2013). In this paper, 
only external CCFs are taken into account. 

CCFs have received considerable attention in 
system reliability analysis (Vaurio 2003, Misra 
2008). Xing et al. (2007) summarized the limita- 
tions of existing CCF models, including being con- 
cerned with a specific system structure (Xie, Zhou, 
& Wang 2005), being applicable only to exponential 
time-to-failure distributions (Anderson & Agarwal 
1992), having combinatorial explosion problems 
(Dai, Xie, Poh, & Ng 2004), limiting components 


belonging to at most one single Common-Cause 
Group (CCG) (Vaurio 1998), having a single CC 
(Amari, Dugan, & Misra 1999), defining CC as 
being s-independent or mutually exclusive (Vaurio 
2003). Motivated by the limitations of existing 
methods, Xing et al. (Xing, Meshkat, & Donohue 
2007, Xing 2008, Xing, Shrestha, Meshkat, & 
Wang 2009) proposed a Fault Tree (FT)-based 
Efficient Decomposition and Aggregation (EDA) 
approach which decomposes an original reliabil- 
ity problem with CCFs into a number of reduced 
reliability problems without considering the effect 
of CCFs using the Total Probability Theorem. 
However, in the EDA approach, models of some 
reduced reliability problems share common sub- 
models, and these sub-models are stored and eval- 
uated several times. To improve the efficiency of 
the EDA approach, Mo and Xing (2013) extended 
the EDA approach by proposing an enhanced 
Decision Diagram-based (DD-based) method in 
which a single compact DD is generated to model 
all reduced reliability problems sharing their iso- 
morphic sub-DDs. 

Uncertainty analysis is challenging in reliability 
and risk analysis of complex systems (Qiu et al. 2014, 
Qiu et al. 2015, Qiu et al. 2018). Different kinds of 
uncertainties present in reliability studies because of 
many reasons, such as the randomness in phenom- 
ena and the insufficiency of data. As summarized 
in Pate-Cornell (1996), uncertainty is usually clas- 
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sified into two types: aleatory uncertainty and epis- 
temic uncertainty. Aleatory uncertainty is due to 
the inherent variation in physical systems. It is also 
called irreducible uncertainty because it cannot be 
reduced. Epistemic uncertainty arises from the lack 
of knowledge. It is also called reducible uncertainty 
because it can be reduced by acquiring knowledge. 
Reliability parameters of components generally 
come from statistics, experiments, expert’s opinions, 
similar components, etc. Therefore, uncertainties 
related reliability parameters of components may be 
aleatory or epistemic. In the previous CCF studies, 
only aleatory parametric uncertainty quantified by 
probability is taken into account. When it is difficult 
to measure the probabilities of CCs accurately, par- 
ametric models such as beta-factor model, binomial 
failure rate model, have been widely used to quan- 
titatively analyze CCFs (Misra 2008). However, 
except the aleatory uncertainty coming from the 
randomness, there may exist epistemic uncertainty 
coming from the insufficiency of data and informa- 
tion in real systems. Because the probability cannot 
distinguish aleatory and epistemic uncertainties, 
probabilistic methods are not appropriate to ana- 
lyze CCFs when there are aleatory and epistemic 
uncertainties. Therefore, non-probabilistic methods 
need to be developed for the CCF analysis consider- 
ing aleatory and epistemic uncertainties. 

Our proposed non-probabilistic method is based 
on Valuation-Based System (VBS). VBS was first 
introduced by Shenoy (1989) as a framework for 
representation of and reasoning with knowledge 
under uncertainty. Different types of uncertain- 
ties can be modeled and quantified in the frame- 
work of VBS using different uncertainty theories, 
including probability theory, imprecise probability 
theory, belief functions theory, possibility theory, 
etc. The proposed VBS method is suitable to ana- 
lyze CCFs considering both of aleatory and epis- 
temic uncertainties, have no limitations on the type 
of failure distributions of system components and 
allow the relationship between CCs being s-inde- 
pendent, s-dependent, or mutually exclusive. In the 
proposed VBS method, n CCs are modeled as n 
basic events and their three kinds of relationships 
are also considered in the model. 

The structure of this paper is organized as fol- 
lows. Section 2 presents the basic notions of the 
VBS. Section 3 presents an illustrative compu- 
ter system. Section 4 presents the proposed VBS 
method and applies it to the illustrative example. 
Section 5 gives some conclusions and perspectives. 


2 VALUATION-BASED SYSTEM 


Shenoy (1989) introduced VBS as a framework for 
representation of and reasoning with knowledge 


under uncertainty. This framework belongs to the 
family of graphical models. In a VBS, a set of vari- 
ables and a set of valuations affected to the subsets 
of the variables are used to represent knowledge. 
Reasoning with this knowledge means to find the 
marginals of the joint valuation for the variables 
of interest using two operators called combination 
and marginalization. VBS can represent different 
types of uncertainties in different domains includ- 
ing probabilities, basic probability assignments 
(bpas), possibility values, etc. The detailed defi- 
nitions and features of VBS in reliability analysis 
based are discussed as follows (Qiu et al. 2017, Qiu 
et al. 2017). 


2.1 Basic probability assignments (bpas) 


A reliability analysis problem can be mod- 
eled using a finite set of variables that represent 
the components and system states. For a vari- 
able X, the frame of discernment ©, denotes the 
set of all the possible values that X can take. 
The mapping m®x :2°% [0,1] is called bpa if 
VA € 28x, x E mx (A) =1. 
C&ey 

The set A can be an event or a subset of events. 
Indeed, a bpa m®* is assigned to each subset of 
2°x such that m®x (A) represents the subjective 
probability assigned to the information which 
exactly supports A. 

The lower bound of the probability over a set A 
on Q, represents the sum of all bpas of subsets that 
imply A. It is computed as follows (Shafer 1976) 


P(A)= X m*(B) ABCQ, (1) 


The upper bound of the probability over A on 
Q,-is defined as the total amount of bpas of subsets 
that are consistent with A. It is computed as follows 


P(A) = 


$ m*(B) ABcQ, (2) 


BIB) A4D 


The following example explains the meaning of 
m, P and P. Suppose that variables ¥ and Y rep- 
resent the states of Component 1 and Component 
2. These variables may have two qualitative values: 
failed or working. The frames of discernment Q, 
and Q,of Xand Yare given by Qy =Q, ={F,W}. 

An expert assigns bpas to 
the two values of X as follows: 
m* ({W}) = 0.7, m™ ({W,F}) = 0.2, m> ({ F}) = 0.1. 
The expert assigns also bpas to the two states of 
Y as follows: m® ({W}) = 0.6,m® ({W,F}) = 0.4. 

According to Eq.l and Eq. 2, the prob- 
ability of the event A: “Component 1 
is in the working state’, is included in 
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[ P( 4), P(A) J=[ me (W) me ({W}) + m> (LW, F})] 
= [0.7,0.9]. The value 0.7 represents all the infor- 
mation that implies the event A, whereas the value 
0.9 represents all the information that is consist- 
ent with the event A according to the expert. The 
length of the interval P(A)—P(A)=0.2 repre- 
sents the imprecision about A. 


2.2 Joint basic probability assignments 


Let X and Y be two variables defined on frames 
Q, and Q,. Let Q, x Q, be the cartesian product 
of Q „and Q,. A bpa m®x**r defined on QX Q, 
is called a joint bpa, and can be seen as an uncer- 
tain relation between variables XY and Y. To evalu- 
ate the VBS, joints bpas defined on the cartesian 
product of variables are used to show the degree 
(or strength) of the relationship between variables. 

We retake the same example, and we aim to 
evaluate the probability of the event: “The system 
composed of components | and 2 is in the work- 
ing state” which is represented by a variable S. The 
variable S may have two qualitative values: failed 
or working, and its frame of discernment is given 
by Q, = {WF}. The variable S depends on the 
variables X (state of Component 1) and Y (state 
of Component 2). This relationship will be repre- 
sented by a joint bpa m,. The expert gives the fol- 
lowing joint bpa m, in order to define the degree of 
relationships between the variables: 


mg rs ((W,W,,Ws), (E, FE, Fs }) = 0.75 
axar xos (E,W, W), (W, E, Ws} =0.2 
mP s 46 x Oy, XQ} = 0.05 


m 


Fig. | represents the VBS of the example. As we 
can see, we have three nodes which represent the 


Figure 1. VBS example. 


variables X, Y and S. We have also three bpas: m, 
m,, and m, which respectively represent the bpas 
assigned to X, Y, and the relationship between X, 
Yand S. 


2.3 Inference in VBS 


After constructing the VBS and assigning the bpas 
of events (variables) and joints bpas of relationship 
between events (variables), we have to use inference 
in order to obtain the probability of the variable of 
interest. Making inference means to find marginals 
for the variables of interest using combination and 
marginalization operators. 
The combination operation of two bpas m° 

and m; °v respectively defined on Q, and Q, is per- 
formed as follows (Smets & Kennes 1994) 


mg Y ( H) = 
m2” ( Aym (B) 
AN B= HWA,BCQy x Qy (3) 
I X me *% ( A) merre (B) 


AN B= WA,BCQy xQy 


Note that in order to combine two bpas respec- 
tively defined on Q, and Q,, they must be defined 
on the same frame of discernment Qy x Q,. The 
operation which allows one to extend a bpa m® 
defined on Q, to Q, x Q, is called the vacuous 
extension and ‘defined as follows 


m*(A) if B= AxQ,V AC Q, (4) 
0 otherwise. 


max? (Ox xy) ( B) = 


On the other hand, the marginalization opera- 
tor is used to focus the information contained by 
a valuation onto a smaller domain. Indeed, the 
marginal (or projection) of the bpa m®°x*®y on the 
frame Q, is defined by 


= mx xor ( 


(4) x Oy | Proj( BL Qy )= : 


Qy x Qy) Way ( 


ml \WACQ, (5) 


where Proj( BL Saks |*e Q, /Ie Q, (xy)e B 

To summarize, for a variable of interest Z that 
depends on two other variables Xand Y, (®m)'2z 
is computed by marginalizing (projection) on Q, 
the joint valuation ®m obtained by the combi- 
nation of the bpas m°x,m®°r, and the joint bpa 
mex *2y XQ2z 

We retake the same example in order to com- 
pute the upper and lower probabilities of the 
event: “the system composed of components 1 
and 2 is in the working state”. We first combine 
the bpas m?*, m2’, and the joint bpa m?**°r*?s 


2367 


using Eq. 4 (After vacuous extension of each bpa 
using Eq. 5). Then, we marginalize the obtained 
bpa (m ®m, ®m,)°x**r°s on the frame Qs 
using Eq. 6. Finally, using Eqs. 1 and 2, we obtain 
upper and lower probabilities of the event “the 
system composed of components | and 2 is in the 
working state” from the bpa m\***2"*8s)!®5 ag fol- 
lows: | P(W,),P(W;)] = [0.9.1]. 


3 AN ILLUSTRATIVE EXAMPLE 


Here we use an example of a computer system 
subject to CCFs proposed by Mo and Xing (2013). 
This system consists of three processors (P,, P,, 
and P,), two buses (B, and B,), and three memory 
units (M,, M,, and M,). It works if at least two of 
the three processors, at least one of the two buses, 
and at least one of the three memory units work. 
Fig. 2 shows the reliability block diagram of the 
example computer system. 

Two CCs are supposed to affect the com- 
ponents in the example computer system. CC, 
affects processor P, and memory unit M, CC, 
affects processor P, bus B, and memory unit M,. 
Thus, the two CCGs are CCG,={P,M,} and 
CCG, =ą{ P,B,,M,}. There are three kinds of 
relationships between CC, and CC,: s-independ- 
ent, s-dependent, or mutually exclusive. 

Assume that all components and the system 
are binary, and their frames of discernment are 
Q={0,1] where ‘0’ denotes failed state and ‘I’ 
denotes ae state. The frame of discernment 
of CCsis Qoc, = {0;,1, } where ‘0? denotes the non- 
occurrence of CC, and ‘17 denotes the occurrence 
of CC, 

Probabilities used in this example are given as 
follows (Mo & Xing 2013): 


e Failure probabilities of components: q} = Yp, = Yp, 
= 0.002, q; = qs, = 9.001, gy, = du, = 4u, = 9-003. 
Reliabilities of components are hereinafter 
denoted by p. 

e Occurrence probabilities of 
Pr {CC,}= 0.001 and Pr {CC,}= 0.0015. 

e If CCs are s-dependent, the conditional occur- 
rence probabilities are Pr {CC,|CC,}= 0.0025 


CCs: 


and Pr {CC, |ACC,}= 0.0015. 
Pi Mi m 
Bi 
P2 25 M2 


P3 Aii M3 


Figure 2. Reliability block diagram of the example 
computer system. 


4 PROPOSED VBS METHOD 


This section presents the explicit VBS method to 
incorporate CCFs considering only aleatory uncer- 
tainty represented by probability, and applies it to 
the example computer system given in Section 3. 


4.1 Proposed method 


The proposed VBS method contains the following 
five steps: 


Step 1: Develop the VBS model of the studied sys- 
tem. Each CC is modeled by a basic event, and 
the relationship between CCs is modeled by a 
basic event. 

Step 2: Establish the truth table that represents 
the relationship between CCs and transform 
the truth table into a bpa. There are three kinds 
of relationships between CCs: s-independent, 
s-dependent, or mutually exclusive. In this 
work, n CCs occurring s-dependently means 
that CC, depends on CC, ,. A variable R is used 
to describe the three kinds of relationships. 
Its frame of discernment is Qg ={1,,22,3,}, 
where ‘lẹ denotes s-independent, ‘2,’ denotes 
s-dependent, and ‘3,’ denotes mutually exclu- 
sive. The occurrence of CCs is influenced by 
their relationship, e.g., if n CCs are mutually 
exclusive, no CCs can occur simultaneously. 
Table 1 is the truth table for CCs under differ- 
ent relationships with the corresponding condi- 
tional probabilities. 

Step 3: Establish truth tables that represent the rela- 
tionships between CCs and their affecting com- 
ponents, and transform truth tables into bpas. 
A component in the system may be affected by 
several CCs. For example, if a component x is 
affected by CC,,CC,,,,....CC,, Table 2 is the 
truth table representing the relation between x 
and these CCs. 

Step 4: Establish truth tables that represent the 
relationships between components, subsystems, 
the system, and transform truth tables into 
bpas. For example, for a system consisting of n 
components in parallel, its truth table is shown 
in Table 3. 

Step 5: Declare bpas for leaf nodes according to 
their reliabilities, calculate the joint bpa and 
marginalize it over the frame of System. 


4.2 Illustrative example 


In this subsection, the proposed VBS method is 
used to model and evaluate the reliability of the 
example computer system. 


Step 1: Fig. 3 shows the valuation network of the 
example computer system in the proposed VBS 
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Table 1. Truth table for CCs under different relationships. 
R ce 0G; CC,, CC, Pr{CC,&... & CCIR} 
Ls 0, 0, 0.4 0, Pr{—CC,}Pr{—CC,}...Pr{-CC,,_,} Pr{-CC,} 
ly 0, 0, Oi 1, Pr{—CC,}Pr{-CC,}...Pr{—CC,_,}Pr{CC,} 
1, 0, 0, ie 0, Pr{-CC,}Pr&-CC,}...Pr{CC_,}Pr{-CC} 
I, 0, 0, La 1, Pr{-CC, jPr{-CC,}...Pr{CC,_,}Pr{CC,} 
1, 1, it La 1, Pr{CC,}Pr{CC,}...Pr{CC, ,}Pr{CC,} 
25 0, 0, 0.4 0, Pr{—CC,}Pr{—CC,|-CC,}...Pr{—CC,_,|-CC,_,} Pr{-CC,|-CC, ,} 
25 0, 0, 0.4 l Pr{—CC,}Pr{-CC,-CC;}...Pr{-CC,_,|-CC,_,}Pr{CC,|-CC, ,} 
2p 0, 0, la 0, Pr{—CC,}Pr{—CC,|-CC,}...Pr{CC, HCC, 5} Pr{-CC,|CC,_,} 
25 0, 0, lyi I, Pr{—CC,}Pr{-CC,|-CC}}...Pr{CC,_,|-CC,_,}Pr{CC,|CC,,} 
2 1, 1, tnt L Pr{CC,}Pr{CC,|CC,}...Pr{CC, ,|CC, ,}Pr{CC CC, ay 
3 0 0, 0, n 
R 1 2 n-l n 1=>, _,Pricc,} 
3p 0, 0, nl l Pr{CC,} 
3p 0, 0, nl n Pr{CC,_,} 
3r 0, 1, oe 0-1 0, PriCC,} 
3h h > 04 0, Pr{CC}} 
Table 2. Truth table of the relation between x and CCs. 
à sive. Table 4 is the truth table for CCs under 
G Ceri Uee Xi IG AECC different relationships with the corresponding 
0, Opa 0, lp, conditional probabilities. 
0 0, 0 0 The truth table in Table 4 is transformed into 
l k-1 k x qy . QRQCC12CC2 . 
0, On4 L 0, 1 the following bpa m; z: 
0, la oO 0 1 
0, i, t © 41 m; ({lr:0,:0; }, {22,0,,0, },{3,,0,,0, }) = 0.9975 
. = 2 ms ({1_0,51,}, {29:01 }{3r011,}) = 0.0014985 
1, la L 0, 1 m; ({lr:0,:0; },{22.0,,0, },{3,.,0,.1, }) = 0.000015, 
ee ee ee m3 ({L_+1,.0,},{2 911.0, {3,251,502 }) = 0.0009975 
able 3. ruth table of a parallel system. m3({I_01,.02},{2ashslo}.{3~01,02}) = 0.000001 
Xi x,, xX, System Pr{System|x, &...& x,} mM; (flol h {221l} {3r1,02}) = 0.0000015 
i 0 1 i 
0, 0O., 0. ~§ Step 3: P, is affected by CC, and CC,. Table 5 shows 
p xe i l 1 the truth table representing the relationship 
~ a ae between P, and two CCs. 
0 . o ds 1 The truth table in Table 5 ds transformed into 
` Xni n the following bpa m,°c"*o" - 
0, ~ X; 1. ls l 
x n- Xn m,([0,,0,;1, }.{0).1),0, t. {1,0,0 L 
; : {1,,1,,0,, }) = 0.998 
f= ta, la ms({0,,0,,0,, },{0,,1.0, }.{11,02,0, }, 


method. 15 variables and 14 bpas are used to 
model the example computer system. Two CCs 
and the relationship between CCs are modeled 
separately as basic events. 

Step 2: The relationship between two CCs can be 
s-independent, s-dependent, or mutually exclu- 


{1,,1;.0, }) = 0.002 


The same for B, (m), M, (mo) and M,(m,,). 
Step 4: The relationships between components, 
subsystems and the system depend on the sys- 
tem structure. For example, the processor sub- 
system works if at least two out of the three 
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Table 4. Truth table for two CCs under different 
relationships. 


R CC, CC, Pr{Cc, & CCR} 
1, 0, 0, Pr{—CC,}Pr{-CC,} = 0.9975015 
1, 90, l Pr{—CC,}Pr{CC,} = 0.0014985 
le 0, Pr{CC,}Pr{—CC,} = 0.0009985 
1, J, 1, Pr{CC,}Pr{CC,} = 0.0000015 
2p 9, 0, Pr{—CC,}Pr{—CC,|-CC,} = 0.9975015 
2 0; 1, Pr{—CC,}Pr{CC,|-CC,} = 0.0014985 
2 4 0, Pr{CC,}Pr{—CC,|CC,} = 0.0009975 
2 4, l Pr{CC,}Pr{CC,|CC,} = 0.0000025 
3, 9, 0, 1 —Pr{CC,} — Pr{CC,} = 0.9975 
3, 0 i Pr{CC,} = 0.0015 
3 0, r{CC,} =0.001 
Table 5. Truth table of the relationship between P, and 
CCs. 
CC, GG, P, Pr{P,|CC, & CC,} 
0, 0, la Pp, = 0.998 
0, 0, 0, qr, = 0.002 
0 I, 1 
1 2 0, 
1 0, 1 
1 2 0, 
1, l; 0 1 


Table 6. Truth table of the processor subsystem. 


P, P, P, P Pr{P|P, & P, & P,} 
0: 0, 0p, 0, 1 
0, 0, iy M 1 
0, ly 0p, 0 1 
0, lp Ly lp 1 
la 0, On, a l 
la 0, li l 1 
la lp, 0, l 1 
1, l; l; lp 1 


processors work, thus the truth table of the 
processor subsystem is given in Table 6. 

The truth table in Table 6i is transformed into 
the following bpa m;""?"8™ , 


The same for the bus subsystem (m,), the 
memory subsystem (m,), and the entire sys- 
tem (m). 

Step 5: m,, M, Ms, Mp represent the knowledge 
about the states of P,, P}, B,, M,. According to 
their failure probabilities, we have 


el Os =0. 
"i lent (i pease 
s” ({0a} 


)= 0.001 m” (fi, })=0.999 
mi or Ou, })= 0.003 mi" ({1y, }) = 0.997 


m, represent the knowledge about the relation- 
ship of CCs. There are three kinds of relation- 
ship between CCs: 


— m ({1}})=1 if CCs are s-independent; 
— m*({2,})=1 if CCs are s-dependent; 


— mi*({3,})=1 if CCs are mutually exclusive. 
The system reliability is computed by 
(mer. MOs ® m,noneave ®...® mercee: 2@ 
m&s, If CCs are s- independent, the 
obtained system unreliability is 2.44843e-05. 
If CCs are s-dependent, the obtained system 
unreliability is 2.44883e-05. If CCs are mutu- 
ally exclusive, the obtained system unreliability 
is 2.44859e-05. 


The obtained results are different from the 
results in Mo and Xing (2013), because the con- 
ditional probabilities Pr {Systemfailure |CCE,} in 
the compared paper are incorrect. 

The proposed explicit VBS method is more 
efficient than the explicit FT-based methods 
Vaurio 2003, Wang, Xing, & Levitin 2014). In the 
explicit FT-based methods, if there are n CCs in 
which CC, affects m, components, then X mi; 
basic events and logical gates are needed to model 
CCs in the expanded FT models. In the proposed 
explicit VBS method, only n basic events are 
needed to model CCs. And as explained in Wang 
et al. (2014), the explicit FT-based method is only 
applicable to systems subject to s-independent 
CCs, whereas the proposed explicit VBS method 
is applicable to systems subject to s-independent, 
s-dependent, or mutually exclusive CCs by intro- 
ducing a variable representing the relationship 
between CCs. 
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Valuation network of the example computer 
system subject to CCFs in the proposed VBS method. 


Figure 3. 


4.3 Epistemic model uncertainty 


This subsection takes the epistemic model uncer- 
tainty into account. Epistemic model uncertainty 
means that the analyst cannot give the precise 
system structure due to the lack of knowledge or 
information. For example, in the example compu- 
ter system, the analyst has difficulty in judging the 
structure of the processor subsystem due to the 
lack of relative knowledge. He/she cannot judge 
whether it is a parallel system or a 2-out-of-3: G 
system (the subsystem works if and only if at least 
two out of the three processors work). The epis- 
temic uncertainty of the analyst can be described 
by the truth table in Table 7. Items in bold repre- 
sent the epistemic model uncertainty. 

The truth table in Table 7 is transformed into 
the following joint bpa m,"°?"""” : 


mali 05.05 (00-1 0p Data, Oa Des 
Lysletid OnslasOasOe tt 0317:0231}, 
{On aslaslats] WaiQa sO spt tlasOa, 
Oy sletet tnsVaslasled ETERNIA 
TA 


Table 7. Truth table of the processor subsystem under 
model epistemic uncertainty. 


P, P, P, P Pr{P|P, & P, & P;} 
0, 0, G, e 1 
0, 0, lis 0,, 1 1 
0, 1, 0p, 0,, 1 1 
0, l» 1, lp : 
1, 0, 0, 0,, 1 1 
la 0, lp 1, 1 
i lp 0, 1, 1 
1, 1, l; lp I 


All other bpas remain unchanged. The pro- 
posed VBS method is applied to model and evalu- 
ate the system unreliability under this epistemic 
model uncertainty. Finally, the system unreli- 
ability obtained by the VBS method is [2.57036e- 
06,2.44843e-05]. 


5 CONCLUSION 


In this paper, a VBS method is proposed to ana- 
lyze CCFs considering aleatory and epistemic 
uncertainties. In the proposed VBS method, n 
CCs are modeled as n basic events and their three 
kinds of relationships are also considered in the 
model. Our proposed VBS method is suitable to 
analyze CCFs considering both of aleatory and 
epistemic uncertainties, have no limitations on 
the type of failure distributions of system com- 
ponents and allow the relationship between CCs 
being s-independent, s-dependent, or mutu- 
ally exclusive. In the future, the proposed VBS 
approach will be extended to analyze the Belief 
CCF which means that the occurrence of a CC 
may result in failures of different components 
with different degrees of belief. 
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ABSTRACT: The article deals with a simple way to allocate reliability and maintainability require- 
ments which enables us to specify single system parts requirements preliminarily in early phases of system 
development and design. The method might be used for the system made of subsystems arranged into 
a serial structure when a failure of any subsystem leads to a whole system fault. For the whole system, 
however, there are no separately set requirements for its reliability and maintainability level, but overall it 
is necessary to achieve a certain level of availability. The method is based on a rational presumption that 
the requirements for a single subsystems reliability and maintainability level should be determined so that 
the subsystems in which a higher failure occurrence might be expected could have stricter maintainability 
level requirements than the systems with a higher reliability level. The suggested method is illustrated 


through a practical example. 


1 INTRODUCTION 


From a dependability point of view, a decisive 
property of numerous technical systems is their 
availability defined as the ability to be in a state to 
perform as required. What is really important then 
is not the system reliability or maintainability level 
itself, but their mutual combination expressed by 
availability. 

Availability is of major importance in the sys- 
tems used in non-stop operation (manufacturing 
technology, electric power generation and distribu- 
tion, etc.), or they are involved in relatively long 
missions and their interruption is undesirable 
(means of transport, weapon systems, etc.). Dur- 
ing designing a complex weapon system, there was 
a requirement specified for its availability level, 
and while forming the system conception, the allo- 
cation of single subsystems reliability and main- 
tainability was already required. 

The performed allocation was expected to be 
only preliminary and the results were to serve as a 
part of input information to decide about the con- 
ceptual solution of the whole system and all its sub- 
systems. Therefore, it was required that the method 
should be easy to apply and provide only general 
information about single subsystems. Based on the 
overall requirement for the whole system availabil- 
ity level, the aim of the method was to determine 
general requirements for a single subsystems reli- 


ability and maintainability level. The method was 
also expected to be easy and fast to apply since dur- 
ing a development and design process a number of 
possible system conceptions are assumed to be for- 
mulated, therefore the method should enable us to 
assess quickly their advantages and disadvantages 
also from the dependability point of view. 

In the available literature there are different 
methods and procedures dealing with the alloca- 
tion of availability requirements. A certain part 
of these methods is based on the application of 
differently defined importance coefficients (Jigar 
et al. 2016; Barabady & Kumar 2007), or they 
use purposefully selected weighted parameters 
(Chang et al. 2009; Hagmark & Virtanen 2006). 
Also a generic algorithm and its modifications 
which enable a multi-criteria decision proc- 
ess to be made, are used when allocating avail- 
ability requirements (Elegbede & Adjallah 2003; 
Huang et al. 2009; Oliveto 1999). The multi-cri- 
teria approach is also applied by other methods 
(Chiang & Chen 2007; Mohamed et al. 2002), 
which, when allocating the requirements, observe 
whether certain optimizing aims are fulfilled, e.g. 
minimization of system ownership costs (Kumar 
et al. 2007). There are also the methods based on 
the application of fuzzy sets in combination with 
a hierarchy process (Wang et al. 2012), for exam- 
ple, or in the form of intuitionistic fuzzy optimiza- 
tion (Song et al. 2015). 
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All the introduced approaches are based on the 
assumption that relatively accurate information 
about the system design and functions is available. In 
view of this fact, the possibilities of using these meth- 
ods and procedures during the period of designing a 
basic system conception are very limited since in this 
phase of system development and design all neces- 
sary information has not been available yet. 

Another problem is that the methods are rela- 
tively complex and their practical application is 
by no means trivial and might be time consuming 
if it is carried out for a bigger amount of system 
arrangement versions. 

In view of these facts, we have suggested a sim- 
ple method of availability requirements alloca- 
tion which enables us to specify very quickly the 
requirements for single subsystems reliability and 
maintainability with only limited input informa- 
tion about the system design. 

The requirements allocation performed with the 
use of the suggested method is only general and 
its results serve as the basis for evaluating different 
conceptions of the suggested system and selecting 
the best version. 

After the system conception is clarified, it is nec- 
essary during the system development and design 
to perform repeated allocation of availability 
requirements using sophisticated allocation meth- 
ods which will result in achieving final require- 
ments for single subsystems and their components 
reliability and maintainability. 


2 SOLUTION ASSUMPTIONS 


Solution assumptions result from the above men- 
tioned mission and the nature of the system for 
whose development and design the allocation 
method was suggested. The analysed system is 
characterized by the following properties: 


e The considered system consists of n inter-inde- 
pendent subsystems (from a reliability point of 
view). 

e All subsystems are arranged into a serial struc- 
ture (from a reliability point of view). 

e The failure of any subsystem results in the fault 
of all the system and its operation is interrupted. 
At any moment only one subsystem can be in a 
fault state. 

e The system gets into an available state after the 
restoration of a relevant subsystem. 

e Possible administrative, logistic and technical 
delays are not considered. 

e Times between failures and times to restoration 
have exponential distributions. 


The inherent availability of the system charac- 
terized this way might be expressed by the follow- 
ing formula: 


_ MTBF u 
MTBF+MTTR A+u 


(1) 


where MTBF is Mean Operating Time Between 
Failures, MTTR is Mean Time to Restoration, A 
is Failure Rate and u is Repair Rate, and then the 
following formulas apply: 


1 
A= 2 
MTBF 2) 

1 
“= MTTR 6) 


For a mutual relation between the failure rate 
of a whole system and the failure rates of single 
systems for the system defined above the following 
formula applies (Vintr & Holub 2001): 


4=}4 (4) 


where A, is failure rate of the i-th subsystem and n 
is the total number of subsystems. Also the rela- 
tion between repair rate uw of a whole system and 
repair rates of single subsystems can be expressed 
in a similar way. If the failure rate of single sub- 
systems is known, the formula for a given system 
might be expressed in the following manner (Vintr 
& Holub 2001): 


34 
Maa (5) 


yA 


it Mi 


where 4, is Repair Rate of the i-th subsystem. 

The final equation characterizing the relation 
between whole system availability and the meas- 
ures of single subsystems reliability and main- 
tainability can be obtained by substituting from 
the equations (4) and (5) to the equation (1): 


A=—_— (6) 


This equation might also be transformed and 
then it can express the dependence of the whole 
system availability A on the level of single subsys- 
tems availability A, (Vintr & Holub 2001): 


— (7) 


A= 
l-n+ > A) 
i=l 
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3 BASIC ALLOCATION PRINCIPLE 


The nature of the solved task required that within 
the allocation not only the requirements for single 
systems availability would be specified, but also 
the requirements for each subsystem reliability and 
maintainability level would be determined. These 
requirements made the task solution rather com- 
plicated since at the time of preliminary allocating 
the requirements only general systems conception 
and general information about functions to be car- 
ried out by the system were known. 

This information provides at least a general idea 
of the position and the task of each subsystem and 
its presumed design complexity. When it comes to 
the maintainability of single systems, however, it 
is very difficult to draw any conclusions from this 
information. In order to overcome this obstacle, a 
general principle expressing the required relation 
between a reliability and maintainability subsys- 
tems level has been formulated. 

This principle is based on a rational require- 
ment that the subsystems with a lower reliability 
level (more frequent failure occurrence might be 
expected) should have a higher maintainability 
level (they will be repaired more often) than the 
subsystems with a higher reliability level (they will 
be repaired less often). Generally, this principle 
says that the mean time to recovery as a subsystem 
maintainability measure should be directly propor- 
tional to the mean operating time between failures 
as a subsystem reliability measure. When applying 
this principle, the following should apply: 


MTTR _ A, 
MTBF, 4 


(8) 


where s is a required ratio between MTTR and 
MTBF in all subsystems. 

After adapting the equation (6) with the use 
of the relation expressed by the equation (8), we 
obtain the following formula: 


A= l 
l+ns 


(9) 


After adapting the introduced equation, we get 
the formula which enables us to calculate the ratio 
s when knowing a required system availability level 
A and the number of subsystems n: 


1-A 
s= 
nA 


(10) 


4 ALLOCATION PROCEDURE 


The initial step of the whole process is the expert 
determination of the requirements for a whole sys- 


tem reliability level in the form of a specific value 
MTBF, or a failure rate A. This is based on the 
nature of the suggested system, the way of its appli- 
cation, and the evaluation of the consequences of 
a possible operation interruption caused by a fail- 
ure. Also the information about the behaviour of 
similar systems can be used. 

The next step is to introduce the weight factor 
which represents the ratio between single subsys- 
tems failure rate and the required failure rate of 
the whole system: 


À, 
= — 11 
a= (11) 


By substituting from the equation (11) to the 
equation (4) we then obtain basic condition which 
has to be fulfilled by the sum of single weight 
factors: 


Ye, =1 (12) 


Next, the weight factors have to fulfil the follow- 
ing condition: 


A, 


7(min) 
0,2 (13) 
Where Armin represents a maximum acceptable 


requirement for a subsystem reliability level. This 
requirement is again determined by an expert deci- 
sion and enables the subsystem reliability require- 
ments to be rationally performable with respect 
to the used technologies. In fact, the maximum 
acceptable requirement for system reliability has to 
be determined so that nA,,.;., < 4 applies. 

The introduced method requires that in the next 
step the weight factors for all subsystems are to 
be determined using an expert decision. However, 
the weight factors have to fulfil the requirements 
resulting from the equations (12) and (13), there- 
fore this is a rather difficult task. In order to make 
it simple, a method of subsystems point aided esti- 
mation has been used (Vintr & Holub 2001). 

When applying this method, to each subsystem 
a point value a; € [15 amaw] 18 allocated. This point 
value characterizes a required system reliability 
level. The higher the required subsystem reliabil- 
ity level when compared with other subsystems, 
the higher the point value allocated to the system. 
The range of the used point scale, or the maximum 
point value aima» 1s determined so that it could 
enable single subsystems requirements to be finely 
differentiated. 

For the transformation of the determined point 
values into the weight factors value, the following 
equation is used: 
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@ =a, k+q (14) 


where real numbers k and q are linear transforma- 
tion coefficients. Using this formula ensures that the 
determined weight factors will fulfil the conditions 
expressed by the equations (12) and (13). If the for- 
mula (14) is to be applied practically, it is necessary 
to determine the values of coefficients k and q. 
With respect to the relation between the point 
evaluation of a relevant subsystem and its reliabil- 
ity level described above, the following equation 
must apply: 
Q, 


i(min) = Ai(max) k + q (15) 
With respect to the formula (11), this equation 
might be further adapted as follows: 


Ea = Gimax) k a q (16) 
If we substitute into the equation (12) for @, the 

expression from the equation (14) and modify it 

appropriately, we get the following equation: 


i=n 


k% a +nq=1 (17) 


i=] 


The equations (16) and (17) form a system of 
two equations of two unknown variables, and after 
solving them, we can determine the values of coef- 
ficients k and q: 


Amin) — A 
= 7 Ani i (18) 
A (in n- Sa) 
i=] 
A Gimax) ~ Amin Èa 
q= i=l (19) 


A (in n= 


Fa) 


i=] 


The resulting relation for determining the weight 
factor might be obtained by substituting the coef- 
ficients expressed this way into the equation (14): 


i=n 
a; (n Amin E A) + A Gimax) ~ Amin Xa, 
0, = - = (20) 


i 
A (as n— 


When knowing the weight factor, it is possible 
to determine both the requirement for each subsys- 
tem reliability level: 


A=0,A (21) 


i i 


and also the maintainability level requirement, 
using the equations (8) and (10): 


_ nA 
-Í (1-4A) 


A, (22) 


5 EXAMPLE OF THE METHOD 
APPLICATION 


The practical application of the suggested method 
will be demonstrated by the following example. 
It is required that the preliminary allocation of 
reliability and maintainability requirements for 
the system consisting of eight subsystems is to 
be performed. The system fulfils the initial pre- 
requisites for the application of this method (see 
Chapter 2). 

For the whole system, the A = 0.997 level avail- 
ability is required and the required failure rate was 
determined as 2 = 2.5-10*. The maximum accepta- 
ble reliability level for single subsystems was deter- 
mined in the form of a minimum acceptable value 
of subsystem failure rate Amin = 1.25-10°. 

In the system, a point evaluation of a required 
single subsystems reliability level was performed 
when applying the point scale a, e [1; 10]. The 
results of the point evaluation are put in Table 1. 
Coefficients k and q have been calculated using 
the equations (18) and (19) and the following 
results have been obtained k = —0.01875 and 
q = 0.2375. 

The requirements for the single subsystems 
reliability and maintainability level were deter- 
mined on the basis of the point evaluation using 
the equations (20), (21) and (22). The results 
of the performed calculations are also put in 
Table 1. 


Table 1. Results of calculation. 

Subsystem Point Weight Failure Repair 
numberi valuea, factora@, rate å, rate 4, 

1 1 2.19E-01 5.47E-05 1.45E-01 
2 10 5.00E-02 1.25E-05 3.32E-02 
3 7 1.06E-01 2.66E-05 7.06E-02 
4 8 8.75E-02 2.19E-05 5.82E-02 
5 6 1.25E-01 3.13E—05 8.31E-02 
6 5 1.44E-01 3.59E-05 9.55E-02 
7 2 2.00E-01 5.00E-05 1.33E-01 
8 9 6.88E-02 1.72E-05 4.57E-02 
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6 CONCLUSIONS 


The suggested method enables us to perform rela- 
tively simple preliminary allocation of reliability 
and maintainability requirements with very limited 
information about the analysed system. Therefore, 
the application of this method is suitable mainly at 
early stages of the system development and design 
when different system conceptions are taken into 
consideration. In this situation the method enables 
us to specify very quickly general requirements for 
single subsystems reliability and maintainability 
and provides inputs used for evaluating different 
conceptions of the suggested system and selecting 
the most appropriate variant. 

All the allocation model can be easily made in 
any spreadsheet processor enabling us to evalu- 
ate very easily how the change of allocation input 
parameters influences the results. 

A drawback of this method is that most of the 
input parameters have to be determined using an 
expert decision, and the expert’s knowledge and his 
experience level inevitably influence the results of 
the performed allocation. Therefore, when deter- 
mining allocation input parameters, it is advisable 
not to involve an individual, but a group of expe- 
rienced experts. 

In conclusion it is necessary to emphasize that 
the suggested allocation method is by its nature 
only preliminary and if we are to clarify the sys- 
tems conception, it is necessary during its develop- 
ment and design to perform repeated availability 
requirements allocation using sophisticated alloca- 
tion methods. 
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ABSTRACT: Accelerated Reliability Testing (ART) consists of tests designed to determine the reliabil- 
ity information of a product such as failure intensity, failure probability or survival, or time to failure of 
a particular product. ART is divided into three phases: preparation phase, implementation phase, evalu- 
ation phase. This paper describes the methodology for the preparation of ART of electronic components 
used in combat vehicles. Before ART is performed, it is extremely important to do test planning including 
sample selection, test condition selection, ART method selection, determination of stresses level and test 
duration, etc. The successful design of an ART requires good knowledge of the intended use condition, 
environmental impact profile, operation and design of the product. Electronic components in combat 
vehicles often operate in harsh environments and they are generally expected to be subject to higher levels 
of stress than common commercial electronic components. 


1 INTRODUCTION 


Reliability is ability to perform as required, without 
failure, for a given time interval, under given condi- 
tions. The time interval duration can be expressed 
in units appropriate to the item concerned, e.g. cal- 
endar time, operating cycles, distance run, etc., and 
the units should always be clearly stated. Given 
conditions include aspects that affect reliability, 
such as: mode of operation, stress levels, environ- 
mental conditions, and maintenance (IEC 60050- 
192). Because of the reliability of a system or a 
device is mainly dependent on the reliability of its 
components, the evaluation of the reliability of the 
components is very important to understand the 
reliable life of the overall systems and devices. 
Reliability of electronic component is esti- 
mated by using the standard handbook, statisti- 
cal analysis of operation & maintenance data or 
performing reliability testing experiments (Varde 
2010). In general, the use of standard handbooks, 
like MIL-HDBK-217 approach, has some inher- 
ent limitations as it does not allow simulations 
with projected component load profiles; relatively 
large uncertainties that are associated with vari- 
ous parameters and hence, in the final results; no 
provisions to assess the root cause of component 
failure; it is not effective in predicting the reliability 
of new components or new design of conventional 
components, etc. Hence, another approach to 


determine the reliability of electronic components 
should be using accelerated test methods. 

Accelerated test is testing in which the stress 
level, or rate of stress application, exceeds that 
occurring under specified operational conditions, 
to reduce the duration required to produce a stress. 
The object of the methods is to either identify 
potential design weakness or provide information 
on item dependability or to achieve necessary reli- 
ability/availability improvement, all within a com- 
pressed or accelerated period of time response 
(IEC 62506). 

Accelerated testing methods help investigate 
the reliability of electronic components as regards 
certain dominant failure mechanisms under nor- 
mal operating conditions, provide details about 
the various degradation mechanisms and thereby 
improved understanding of the root cause of the 
failure (Varde 2010). 

Accelerated tests are commonly used in the auto- 
motive and electrotechnical industries to assess or 
demonstrate component and system reliability, to 
detect failure modes which may occur during the 
life of product, to compare different manufactur- 
ers, etc. For military purposes, accelerated tests are 
possible to apply for the electronic components in 
combat vehicles. 

Common reasons for failure in electronic com- 
ponents are environmental contaminants and 
conditions, such as temperature, thermal cycles, 
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humidity and other failures deriving (e.g. vibra- 
tion, ripple voltage, and overvoltage). Based on the 
cumulative damage model, expected test informa- 
tion, and assumptions about product usage, accel- 
erated test methods can be divided into three types 
(IEC 62506): 


e Qualitative accelerated tests (type A); 

e Quantitative accelerated tests (type B); 

e Quantitative time and event compressed tests 
(type C). 


Qualitative accelerated tests (type A) are used 
primarily to identify failures and failure modes 
without attempting to make any predictions as 
to the product’s life under use conditions (Relia- 
soft Corporation 2001). Commonly used qualita- 
tive testing models are HALT (Highly accelerated 
limit tests), HAST (Highly accelerated stress test), 
HASS/HASA (Highly accelerated stress screen- 
ing/audit). Qualitative tests are performed with 
the specimens subjected to a single severe level of 
stress, to multiple stresses, or to a time-varying 
stress (e.g., stress cycling, cold to hot, etc.). The 
results of the tests are then used to increase the 
margin of strength of the design. Moreover, they 
provide valuable information such as the types and 
levels of stresses to employ during a subsequent 
quantitative test. 

Quantitative accelerated tests (type B) are 
designed to quantify the life characteristics of the 
product component or system under using con- 
ditions, and thereby provide reliability informa- 
tion (Reliasoft Corporation 2001). The purpose 
of quantitative accelerated testing is to estimate 
one or more reliability information such as failure 
intensity, failure probability or survival, or time to 
failure. It can also be used to assist in the perform- 
ance of risk assessments or design comparisons. 

Quantitative time compressed tests (type Cl) 
are achieved by excluding the “shutdown time” 
i.e. by focusing the test just on the switch-on time. 
However, only focusing on the operating time can 
ignore damage during downtime. Quantitative 
event compressed tests (type C2) are used to repeat 
events with greater intensity than the product 
usage in practice. With this type of testing, some 
negative consequences can be created by using 
continuous stress, in some way causes failures that 
do not occur under normal conditions. 

Accelerated testing type A is used in product 
design phase (new electronic component devel- 
opment, modernization of existing components 
applying new technologies). Accelerated testing 
type B and C is used to quantify the reliability 
parameters of the electronic components, thereby 
minimizing the lack of reliability information of 
electronic components used in combat vehicles. 


This paper presents the process of designing a 
quantitative accelerated test applied to electronic 
components in combat vehicles. 


2 ELECTRONIC COMPONENT IN 
MODERN COMBAT VEHICLES 


Today’s modern combat vehicles achieve the 
parameters of the major military characteristics 
(tactical-technical parameters) through a high 
proportion of electronic components with digital 
control. The digitization of military technolo- 
gies is one of the key requirements of individual 
armaments, newly purchased combat vehicles, or 
weapon systems. These electronic components 
contribute significantly to the achievement of the 
reliability of vehicles. In-vehicle electronic systems 
can be divided into control systems, protection sys- 
tems and information systems (Chloupka 2012). 
Electronic control systems control mechanical, 
hydraulic, and other functions (e.g. fuel injection, 
steering, braking). In modern combat vehicles, 
engine, transmission, active suspension and brake 
systems, are controlled by electronic control unit 
(ECU). 

Protection systems work with digital and ana- 
log information, and they do not directly connect 
to control links, for example, diagnostic system, 
navigation system, security and detection systems. 
Data collected by protection systems will be shared 
with other systems, or directly trigger protective 
elements. The function of these systems is control- 
led by the control algorithm. 

Information systems work with data unrelated 
to control links. The crew can track and share 
information from the different vehicles that are 
aggregated and displayed on the display (e.g., dis- 
playing the vehicle’s own position on a map). The 
crew also can distribute information to defined 
systems in combat vehicle (weapon systems, pro- 
tection systems) and then can make decisions in 
combat activities. 

Electronic components in-vehicle electronic sys- 
tems are generally expected to be subject to higher 
levels of stress than commercial electronic com- 
ponents, mainly temperature, vibrations, shocks, 
voltage, dustiness and humidity. The basic require- 
ments for electronic components in combat vehi- 
cles are shown in Table 1. 

Electronic components act in the same way no 
matter whether for military or civilian purposes; 
therefore, the standardized procedures of acceler- 
ated tests have been already available for commer- 
cial electronic components, they can be applied 
or modified for use on electronic components of 
combat vehicles (Vintr et al. 2013). 
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Table 1. The basic requirements for electronic compo- 
nents in combat vehicles. 


Parameter Value 
MTTF 15 years 
Power supply 18 V to 33 V 
Ambient Operating Temperature —40°C to 70°C 
Storage temperature —40°C to 80°C 
Vibration 10 G, 40 to 500 Hz 
Mechanical shock 500 G, 0.2 ms 
to 2 ms 
Repeated mechanical shock 20 G, 1 ms to 
15 ms 
Relative humidity 98% at 25°C 
Chemical resistance 1 mg/m? 
Dust resistance 2 g/m 
Impulse voltage resistance 70 V/3 ms 


3 MODELS FOR ACCELERATED 
RELIABILITY TESTING 


These are the models of stress during the life of 
a product when the damage after a unit of test 
time is appropriately accelerated by increasing 
the stress level. The underlying assumption when 
using any of these models is that the components 
operating under normal conditions experience the 
same failure mechanism as those occurring at the 
accelerated stress conditions. It is assumed that the 
time-scale transformation or acceleration factor 
is constant and hence implies linear acceleration 
(Chaluvadi 2008). 


3.1 Arrhenius model 


Temperature is commonly used as an environ- 
mental stress for testing of electronic devices. This 
is generally modeled using the Arrhenius reac- 
tion rate, which is used for constant temperature 
stresses and is based on the assumption that abso- 
lute temperature is due to the emergence of certain 
mechanisms of failure. The Arrhenius reaction 
rate equation is given by (IEC 62506): 


E, 


a 


R(T) = Ae Fat (1) 


where R(T) is the speed of reaction; A is a constant 
(which is not a function of temperature); E, is the 
activation energy (eV); A, is the Boltzman’s con- 
stant, ky = 8,617385.10°° (eV/K); T is the absolute 
temperature (K). The Arrhenius life-stress rela- 
tionship is given by: 


L(T)= Cer (2) 


The relationship is linearized by taking the natu- 
ral logarithm of both sides in the Arrhenius equa- 
tion or: 


In[L(T)] = 2 +In(C) (3) 


where L(7) represents a quantifiable life measure 
such as MTTF, characteristic life, etc.; In(C) is 
the intercept of the line; D is the slope of the line 
(= Elk). 

Since the Arrhenius is a physics-based model 
derived for temperature dependence, it is used for 
temperature accelerated tests. For the same rea- 
son, temperature values must be in absolute units 
(Kelvin), even though the Arrhenius equation is 
unitless. 

Acceleration factor is the ratio of the stress 
response rate of the test specimen under the accel- 
erated conditions, to the stress response rate under 
specified operational conditions. 


Eq 


~ kel res Faf_! 1 
= R reu) = Ce “rror = a A (4) 
F_Arr Eg 
R(Tose) sn 
Ce “BTuse 
where Tye and Trey are use temperature and test 


temperature. Arrhenius model is applied to a plu- 
rality of statistical distributions in reliability analy- 
sis. Its applicability is determined by conditions 
where exposure to thermal stress is expected by a 
constant temperature. The model is not applicable 
for damage caused by low temperature. For such 
types of damage, it is recommended that a failed 
test be performed to determine a specific model. 


3.2 Eyring model 


This model is most often used when thermal stress 
(temperature) is the acceleration variable, which 
is similar to the Arrhenius model. However, the 
Eyring model is also used for stress variables other 
such as humidity. The relationship is given by (IEC 
62506): 


ngei (5) 


where L(S,) represents a quantifiable life meas- 
ure, such as MTTF, characteristic life, median life, 
etc.; Spis the stress level (temperature values are in 
absolute units Kelvin). A and B are model param- 
eters to be determined by the test or approximated 
by the literature values. The acceleration factor in 
this model is: 
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where Sy ue and S; ry are the use stress level and 
accelerated stress level. 

The Eyring model can be used for all types of 
mathematical probability distributions that are nor- 
mally used in reliability analysis. Mathematical levels 
of reliability can be determined for individual param- 
eters or functions based on appropriate statistics. 


3.3. Inverse Power Law (IPL) model 


This model is commonly used to test electronic 
devices when the stress is dynamic stresses such as 
shock, vibration or climatic stresses such as tem- 
perature cycles, temperature changes, humidity. 
The relationship is given by (IEC 62506): 


1 
L(S) = os" (7) 


where: L(S) represents a quantifiable life measure, 
such as MTTF, characteristic life, etc.; S is the 
stress; C is the model parameters to be determined; 
n is the model parameter dependent on the behav- 
ior of the stress to be determined. 

The parameter n in the inverse power relation- 
ship is a measure of the effect of the stress on 
the life. As the absolute value of n increases, the 
greater the effect of the stress. Negative values of 
n indicate an increasing life with increasing stress. 
An absolute value of n approaching zero indicates 
small effect of the stress on the life, with no effect 
(constant life with stress) when n = 0. 

The IPL appears as a straight line when plot- 
ted on a log-log paper. The equation of the line is 
given by: 


In[L(S)] = —nL(S) — In(C) (8) 


For the IPL relationship, the acceleration factor 
is given by: 


1 
= L (Suse ) _ C Sise = S, Test i 

a L ( Sest ) l 
CS: 


est 


(9) 


where Sue and Sry are the use stress level and 
accelerated stress level. 
3.4 Coffin-Manson model 


This model is used to test electronic devices when 
the stress is a thermal cycle. The failure mechanism 


is thermal cracking. The equation of the model is 
(Cui 2005): 


_C 
AT’ 


(10) 


where AT = T,,,, — Thin 18 the temperature range; 


Tax 18 the high extreme temperature; T in is low 
extreme temperature; N is the number of cycles to 
failure; C and y are properties of the material and 


test setup. The acceleration factor is given by: 


a 
AT, 
A = test 1 1 
F_CM se ( ) 
where AT se AT)... are temperature range under 


normal operation and temperature range under 
test operation. 


4 DESIGN OF ACCELERATED TEST OF 
ELECTRONIC COMPONENTS 


Before performing tests, it is fundamental to plan 
and determine the characteristic of the sample and 
the stress-scheme (De Carlo et al. 2014). Planning 
activities are performed respecting the specifica- 
tions, the budget and the time constraints. The 
purpose of ART is to provide objective and repro- 
ducible data on the reliability of the object. This 
requires that the test conditions described in the 
test plan and the methods used to process the test 
results are as reproducible as possible and that the 
test samples used are representative. The steps for 
designing the ART of electronic components in 
combat vehicles are presented below. 


e Select electronic components in combat 
vehicles; 

e Determine operating conditions of electronic 
components; 


e Decide stress types, levels and numbers; 
e Decide sample numbers; 
e Calculate acceleration factors and test duration. 


4.1 Electronic components selection 
in combat vehicles 


If electronic components have not been identified 
before, based on the results of the analysis below, 
it is possible to identify representative electronic 
component for the implementation of ART. 


e Analyze overview about each electronic compo- 
nent used in the combat vehicle; 

e Analyze the importance of electronic compo- 
nents for the overall functions of the combat 
vehicle; 
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e Analyze tactical and technical data of individual 
electronic components, design, technological 
level of electronic components (processor tech- 
nology, processor boards, and displays); 

e Analyze the location of electronic components 
in combat vehicles; 

e Specify the applicability of the results obtained 
by the accelerated test to other similar electronic 
components. 

e Analyze the reliability data of an electronic com- 
ponent specified by the manufacturer, or it may 
be routed for the purpose of an accelerated test. 


4.2 Operating conditions of electronic 
components 


The conditions for performing the ART are typically 
determined on the basis of normal operation of the 
combat vehicles. Modern combat vehicles have the 
diagnostic system that archives the operating time 
for selected components, allowing for the setting of 
accelerated test parameters. Military equipment is 
often used as follows (Vintr et al. 2013): 


e Free state, during which the electronic compo- 
nents are switched off — up to 90% of operating 
life time; 

e Operating state, during which the electronic 
components are switched on — up to 10% of 
operating life time. 


4.3 Stress types, level and number 


In the automotive and electrotechnical industries, 
when performing accelerated tests, the most com- 
monly used stresses are temperature, tempera- 
ture cycling, humidity, vibrations/shocks, voltage 
(overvoltage/low voltage) or combination of these 
stresses. The different types of stress have differ- 
ent effects on the load of the electronic component 
and its failure rate, which is reflected in the value 
of acceleration factor and accelerated test time. By 
combining different types of stress during an accel- 
erated test, it is possible to achieve higher overall 
acceleration factor, which means a substantial 
shortening of cumulative test time. Accordingly, 
it is necessary to understand the behavior of elec- 
tronic components with different types of stress by 
performing tests on statistical samples. In general, 
procedures designed for commercial electronic 
components can also be applied to electronic com- 
ponents of combat vehicles. However, it must be 
remembered that military elements do not behave 
similarly to commercial electronic components. 
For this reason, it is also necessary to analyze the 
accelerated test procedure of commercial elec- 
tronic component before it is applied to electronic 
components of combat vehicles. 


When performing the ART of electronic com- 
ponents in combat vehicles, there are four types 
of stress should be focused, namely temperature 
cycling, temperature, humidity and vibration. The 
temperature is the basic stress of electronic compo- 
nents. It is sometimes said that temperature is the 
enemy of reliability of all electronic components. 
Whether they are used for commercial or military, 
these components are stressed by the environmen- 
tal temperature when turned off and the operat- 
ing temperature when turned on. The temperature 
cycling which particularly stresses on the solder 
joints of the electronic components results in a 
total failure of the electronic components (for 
example breaking of solder joints, loss of conduc- 
tive contact, and deformation of the processor 
plate). The basic factors that affect the failure rate 
of electronic components by temperature cycling 
are: 


e Type of solder is used and its properties; 
e Temperature rise in electronic components; 
e Minimum temperature difference (e.g., on/off). 


During the switch off state the electronic com- 
ponents of combat vehicles are stressed by ambi- 
ent temperature. Depending on the season of the 
year, the idle state does not have a profound effect 
on an average annual temperature. 

During the operation of military equipment, the 
electronic components are switched on and by their 
function they generate their own heat. After switch 
of the equipment it reaches its operating tempera- 
ture. The resulting temperature of an electronic 
part depends then on the ambient temperature at 
which the electronic part is switched on. In order to 
find out the operating temperatures of electronic 
parts placed in combat vehicles, an experiment has 
to be performed. 

When determining temperature cycling, it is 
necessary to find out the time and the number of 
the switch on/off of electronic components dur- 
ing operation. Modern combat vehicles with elec- 
tronic components are able to record and archive 
the time of switch on/off including an hour, a day 
and a year. 

For other stresses such as humidity and vibra- 
tion, the stress level is also based on the operating 
conditions and operating environment of combat 
vehicles. Then, the experiments are performed 
with selected military electronic components. 
Based on this information, it is possible to specify 
the types of stresses or combinations which have 
a major influence on the reliability or the failure 
rate of the electronic components. The determi- 
nation of stress level in test depends on operat- 
ing limits of electronic components and technical 
parameters of test equipment such as temperature 
range, temperature rate of change, humidity range. 
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The required information for the test calculation is 
shown in Table 2. 


4.4 The number of samples in test 


The goal of ART affects the determination of sam- 
ple numbers, the extent of accelerated test, types 
of damages, possibilities of failure diagnosing, the 
method of failure resolution. 

In qualitative accelerated tests (type A), the sam- 
ple selection range is determined by the number of 
stresses and the number of failures detected. For 
example, in the classic HALT, it requires one sam- 
ple for low temperature, one for high temperature, 
one for vibration, one for temperature cycles and 
one for the combined test by temperature cycles 
and vibrations; therefore, total is 5 samples. In 
order to account for more than one failure mode, 
it is recommended to use another 2 to 5 samples, 
so the recommended total selection range is 7 to 10 
objects (IEC 62506). 

In quantitative accelerated tests (type B and 
C), the number of samples is determined mainly 
by estimating the constant failure rate or time to 
failure. A typical selection range for accelerated 
testing is 77 samples for 1000 h (JESD 47G). In 
the case of exponential distribution, standards 
for test plans such as IEC 61123 and IEC 61124 
can be used. For a test using Weibull’s distribu- 
tion, at least 5 to 10 failures are expected. Because 
Weibull’s test is often stopped, when one-third of 


Table 2. Operating conditions and types of stresses. 
Parameter Symbol Value (unit) 
Operation time 

Switch on bi, h 

Switch off off h 
Temperature 

Switch on be K 

Switch off Ty K 
Test temperature ait K 
Temperature change AT K 
Temperature change ATs K 

in test 
Total number of cycles Nice 
Speed of temperature change ies K/min 
Speed of temperature change a K/min 
in test 

Vibration Wie m/s? 
Vibration in test ea m/s? 
Relative humidity RH. % 
Humidity in test RH,,., % 
Required lifetime ty h 
Required reliability R(t) 
Activation energy a eV 


the objects under test fails, the selection range is 15 
to 30 objects (IEC 62506). 


4.5 Acceleration factor and test duration 


4.5.1 Temperature 
When stress is temperature, it is possible to use the 
Arrhenius reaction model or the Eyring model to 
calculate the acceleration factor. The Eyring model 
contains constants A and B, and for military elec- 
tronic elements, it has to be determined by other 
tests. To determine them, extensive experiments 
are needed (on a statistically significant number of 
electronic components). For products of medium 
complexity, precision acceleration tests may become 
problematic because of the different components 
and materials have different values of constants B; 
therefore, the accelerated tests of electronic com- 
ponents often use the Arrhenius reaction model. 
According to the standard GS 95003-1-2010 pre- 
scribed in vehicle electrical and electronic compo- 
nents using the Arrhenius reaction model. When 
using the Arrhenius reaction model is necessary to 
set the value of activation energy E,, which directly 
affects the magnitude of the acceleration factor. The 
values of activation energy with different standards 
are shown in Table 3. Figure 1 shows the effect of 
activation energy Æ, on the acceleration factor. 

The acceleration factor is given by Arrhenius 
reaction model: 


E, 


A, = gilt te! (12) 


The time with the temperature at switch off state 
is normalized to the temperature at switch on state: 


= kg 
lon N a. lon + Ly 


1 1 ) 
Ty Ton (13) 


The test duration of the ART is: 


t -E 11 ) 
t =k =kt ve kg\ Ton Tiest (14) 


T _test on _ 


Table 3. The values of activation energy (Chloupka 2012). 


Electronic MIL217F RDF MIL217 plus FIDES 
components (1996) (1996) (1996) (2009) 
Bipolar logic 0.4 eV 0.4eV 0.8 eV 0.7 eV 
CMOS logic 0.35eV  0.3eV 0.8 eV 0.7 eV 
BICmos logic 0.5 eV 0.4 0.8 eV 0.7 eV 
Linear 0.65 eV 0.8 eV 0.7 eV 
Memories 0.6 eV 0.8 eV 0.7 eV 
VHSIC 0.4eV 0.8 eV 0.7 eV 
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Figure 1. Acceleration factor with different values £,. 


where k is the multiplier of the actual duration 
of the stress, is determined from the graph in the 
standard IEC 62506. 


4.5.2 Temperature cycling 


The acceleration factor is given by Coffin-Manson 
model: 


=a -1 

AT se e % 
Ar =| 7r z 
AT es: Gai 

where n is the temperature cycle exponent (n typi- 


cal value is around 2). 
Number of cycles in test is given by: 


n vA 
eean AB | (Se 
"o Are "MATa \ Si 


TC test 


(15) 


(16) 


Figure 2 shows the schematic of temperature 
cycle profile. The temperature cycle profile can 
be characterized by: High extreme temperature 
(Tax), low extreme temperature (T7,,,,), tempera- 
ture change (AT), Ramp rates and dwell times at 
extreme temperatures (Cui 2005). The test dura- 


tion of a temperature cycle is: 


ATi. t esi 
e =e a 3 t 
test 


test 


t Ft 


T _low 


(17) 


where ty is dwell time at the lower temperature 
extreme. 


4.5.3. Humidity 

During the humidity test, the acceleration is 
achieved by increasing the relative humidity during 
the test as well as by increasing the test tempera- 
ture above the expected values during use. 


Figure 2. Schematic of temperature cycle profile. 


Thermal acceleration is the same as the tem- 
perature exposure test for the full equivalent time 
using the ¢,, y for the switching temperature. The 
acceleration factor for the humidity test is given by 
Hallberg-Peck model: 


(18) 


where T,,, is temperature with humidity test, h is 
the exponent for the acceleration factor caused by 
humidity according to IPL model. 

The test duration of exposure in humidity is 
given by: 


ÍRH _tst = k 


t RH h Efi i 
on_N =k use kg\ Ton Try 
— Mton _N e 
A s 
RH 


(19) 


4.5.4 Vibration 
With stress is vibration, it is possible to use the IPL 
model to calculate the acceleration factor. 


W se H 
Ay = ta 

West 
where w is the exponent for the acceleration factor 
caused by vibration according to IPL model. In the 
absence of the test-specific constant, it is usually 


assumed that w = 4. 
The test duration is given by: 


(20) 


A, W, 


test 


t 3 w 
by _test = ka = ktw _use (Fe) (21) 


where fws is operation time in vibration test. 
Typically, for electronic components in automo- 
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Table 4. Calculation results for ART. 


Stress Test: Temperature 
Acceleration Model: Arrhenius 


Ta Ty t on toy T, test Ar t T_test 
[K] [K] [h] [h] [K] Ho h 
Stress Test: Temperature cycling 

Acceleration Model: Coffin Manson 

A Tix A Tre Ce Ee Nie A TE Nest 


[K] [K] [K/min] [K/min] [H El El 


Stress Test: Temperature Humidity Bias 
Acceleration Model: Hallberg-Peck 


Tri Ary Í RH test 


A A K] H [h] 


Stress Test: Vibration 
Acceleration Model: Inverse power law 


wW, W, t A t 


use test w_use wW w_test 


[m/s] [m/s] [b] H [h] 


Combined stresses 


to R(t) A Liest 


[h] H HI [h] 


tive, one hour of non-accelerated vibration test 
is estimated with distance traveled approximately 
1600 km; therefore, t „se = D/1600 (h), where D is 
total distance traveled in kilometer of automotive. 


4.5.5 Combined stresses 

To determine the overall acceleration factor, it will 
be assumed that temperature cycles and vibration 
cause the same failure mode while temperature and 
humidity cause another failure mode. The overall 
acceleration factor is given by: 


= Arc Ay + A, Ary 


A 22 
7) (22) 
Then the total test duration is: 
t 
° (23) 


The calculation results for ART of electronic 
components are shown in the Table 4. 


5 CONCLUSIONS 


This article presents the steps for designing a 
quantitative ART of electronic components in 


combat vehicles. Commonly used stress types for 
electronics components are temperature, tempera- 
ture cycling, vibration, humidity or combination 
of stresses. However, in practice, it is necessary to 
analyze the type of electronic components, operat- 
ing conditions, operating environments, and capa- 
bilities of the existing test equipment to select the 
type of characteristic stress for the failure mode 
of electronic components. The ART method of 
electronic components provided in the commercial 
industry can be applied for electronic components 
in combat vehicles after the detailed analysis. The 
next step is conducting the ART and evaluate the 
results under actual operating conditions. 
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Approximation method for reliability of one-unit repairable system with 
time redundancy 


Xiaoyue Wu & Haiyue Yu 
School of Systems Engineering, National University of Defense Technology, Changsha, China 


ABSTRACT: For some engineering repairable systems like the spaceflight Telemetry, Tracking and 
Control (TTC) system, mission can be executed within a given time interval and been regarded as success- 
ful so long as the system remains in operational state for a minimum length of time. Thus, such system 
has time redundancy in mission execution and higher mission reliability. This paper presents a discrete 
approximation method to numerically calculate the mission reliability of such kind of system with one 
repairable unit and binary states. The time window is divided into a number of time slices and then the 
mission reliability of the system is derived by solving two groups of recursive discrete time equations. All 
these equations are established by decomposing the corresponding random event into a number of dis- 
joint events at discrete time points with reduced time lengths. A numerical example is provided for a one 
unit system. The results of the example are compared with those previously obtained by other solution 
methods. It was shown that the proposed methods are computationally efficient, and the approximated 
missionreliability converges to the analytical results when the width of the divided time slices decreases. 


1 INTRODUCTION 


In engineering practice, there are systems that have 
time redundancy in mission execution. Specifi- 
cally, such system needs to work continuously for a 
minimum length of time to accomplish prescribed 
critical missions within a given time interval (called 
time window). A typical example of such system 
is the space flight Telemetry, Tracking and Con- 
trol (TT&C) system that provide services such as 
orbit tracking, remote control and data transmis- 
sion to spacecrafts during their fly over the ground 
stations (Wu 2014, Guest 2013). Sometimes, the 
TT&C service requires only a short time duration 
within the time window when the spacecraft fly 
over, so time redundancy exists in such cases. 
Although many research works have been done 
on mission reliability of systems, not enough atten- 
tion has been given to mission reliability of system 
with time redundancy. Many existing works are on 
phased-mission system(PMS) which accomplishes 
missions in consecutive phases and requires that 
the system must keep in operational state for all the 
time duration of the mission phase (Xing & Amari 
2008). The existing methods for mission reliability 
generally includes analytical methods, and simula- 
tion methods. Models based on Boolean algebra, 
Binary Decision Diagrams (BDD), fault trees, and 
Continuous Time Markov Chains (CTMC) are 
among the most commonly used analytical methods 
(Kim & Park 1994, Xing & Amari 2008). To avoid 


the strict restriction on the modeling assumptions 
and computational complexity, simulation models 
like Petri nets, Monte Carlo can provide alternative 
options (Wu & Wu 2015, Wu & Hillston 2016), but 
may have precision and computation time issues. 
To evaluate the mission reliability of such kinds 
of systems, the effect of time redundancy should 
be taken into consideration to avoid underestimat- 
ing the mission reliability. For repairable one-unit 
systems, Wu presented a model by decompos- 
ing the mission success event into a number of 
disjoint events, and gave general formulas for its 
computation (Wu 2014). Later, for semi-Markov 
systems with multiple units, Wu and Hillston gave 
an analytical approach based on matrix integral 
equations for mission reliability evaluation (Wu 
& Hillston 2015). The integral matrix equations 
are numerically computed by time discretization 
approximation. However, both of these analytical 
methods involving complicated integral expres- 
sions and numerical computations. Monte Carlo 
methods have also been applied for reliability eval- 
uation of such systems (Wu & Guo 2017, Wu & 
Hillston 2016). The main purpose of this paper is 
to present a new analytical approach for evaluating 
mission reliability of repairable one-unit systems 
with time redundancy. As the first step for mission 
reliability evaluation, we divided the time window 
into a number of time slices, and then build recur- 
sive equations for numerical computation. Thus, 
it is different from our previous approaches in the 
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order of discretizationas we do time discretization 
firstly and then build reliability evaluation model 
afterwards.The new approach is expected to be 
more effective and more computationally efficient. 


2 MISSION RELIABILITY MODEL 


2.1 Basic assumptions 
In this paper, we make the following assumptions: 


e the system has one unit that is repairable and 
has two states: operational state (or up state) 
and failure state (or down state). 

e the time to failure and repair time both follow 
exponential distribution. 

e the unit after repair is “as good as new”. 

e the system is required to accomplish its mission 
within the time window [0,7]. 

e the mission is successfully accomplished if the 
system can keep in up state continuously for 
time duration no less than T. 


2.2 Equations for mission reliability 


Let the interval [0,7] be divided into N intervals 
as follows. 


O=< WMI W< WaS WET (1) 
where 
w,=i6, i=1,---,N (2) 
O=T/N 
Let 
d= min |T, -iô (3) 


Assume that the time to failure of the system W 
and the repair time Z follow exponential distribu- 
tion with failure rate A, and repair rate 4 respec- 
tively, namely W ~ Exp(A),Z ~ Exp( 4). 

Now we introduce the following notations 

P*; the probability that the system starts from 
up state and remains in up state continuously until 
at time w, enters into down state. 

PE: the probability that the system starts from 
down state and remains in down state continuously 
until at time w, enters into up state. 

P.(n,d): the probability that the system has an 
operational span not less than 7”, within time inter- 
val [0,7,,], and the system starts at up state. 

P,(n,d): the probability that the system has an 
operational span not less than 7’, within time inter- 
val [0, T,], and the system starts at down state. 


Thinking from renewal points of view, we obtain 
the following equations 


P (n,d)= XP; +S Ps -P,(n—k,d) (4) 
k=d k=1 
n-d 

P (n,d)= JPE- P (n-k,d) (5) 


k=1 
Obviously, the boundary conditions are 


P(n,d)=0 n<d (6) 
P,(n,d)=0 n<d (7) 


The mission reliability can be obtained by cal- 
culating P,(N,d), which is the probability that the 
system starts in up state and has at least a continu- 
ous time duration T, of operational state during 
time interval [0,7 J.T =oN. 


2.3 Probabilities of sojourn time 


For implementing the previously established 
model, it is necessary to give methods for calculat- 
ing the sojourn time probabilities corresponding to 
different starting states of the system. 

Since both the working time and repair time of 
the system follows exponential distribution, we can 
describe the behavior of the system with a binary 
state continuous time Markov chain (CTMC) model. 

By the CTMC theory (Ross 2010), for k >0, we 
have 


p: = exp(-4w,)( 1- exp( -46)) (8) 
= exp(-Ak )( l1- exp(—Aé)) 

Pk = exp(- 4w, )( 1- exp( -0)) (9) 
= exp( -uk )( 1- exp( -0)) 


3 ALGORITHM FOR SOLVING EQUATIONS 


3.1 Recursive equations 


For brevity, we define the following notations: 


G, = P,(n,d) (10) 
F, = P,(n,d) (11) 
A, a sid (12) 
B, =P; (13) 
C= J A, = exp(—Aw,) = exp(-Ad6) (14) 


k=d 
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Then, the equations (4) and (5) can be rewrit- 
ten as 


d-1 
G, =C+}A, “Ek (15) 
kal 


n-d 
Fi = X BG (16) 
k=1 
The associated boundary conditions are 


G, =0 
F,=0 


n<d (17) 
n<d (18) 


which make equations (15),(16) more specific 
G,=C+ J, A E (19) 


F = YBG,_; (20) 


where M(n,d) = min {d-1,n- d}. 
Thus, we obtain recursive equations for find the 
mission reliability Gy = P,( N,d). 


3.2 Solution procedure 


Based on the previous results, the mission reliabil- 
ity of the system can be solved with the following 
computational procedure. 
Step 1 : Initialization 

Set N,d,d. 

Set G =0,F° =0, n=0,---,N. 
Step 2: Calculate C, and A,,B,,k =1,---,N. 
Step 3: from n = 1 to n = N, calculate 


M(n,d) 
Gre = C + DS A, . Fou (21) 
k=l 
n-d 
Fe = J BG (22) 
k=1 


Step 4: if the difference between G” and G is 
small enough, then stop. The value of G} will be 
the mission reliability. 

Otherwise, set GG", FoF" go 
back to Step 3. 


4 NUMERICAL EXAMPLE 


To verify the proposed model and the solution 
algorithm, we use an example of one-unit system 
(Wu 2014), the time to failure and repair time of 
the unit are both of exponential distributions. The 


Table 1. Mission reliability for different N. 


N R=G, N R=G, 
100 0.5093 2000 0.5322 
200 0.5208 2500 0.5324 
500 0.5283 3000 0.5326 
800 0.5302 3500 0.5327 

1000 0.5308 4000 0.5328 

1200 0.5313 4500 0.5329 

1500 0.5317 5000 0.5329 

Table 2. Mission reliability after different iterations. 

Iteration 3 5 8 10 15 

R=G, 0.5100 0.5313 0.5328 0.5329 0.5329 


failure rate and repair rate of the unit are 2 =1/60 
and “=1/10 respectively. The system is required 
to work continuously for T, = 60 within the given 
time interval [0, 7], T = 100. For this example, the 
mission reliability obtained by other methods is 
0.533, including analytical and simulation methods 
(Wu 2014, Wu & Hillston 2015). 

The proposed method is used to solve the mis- 
sion reliability. To study the approximation preci- 
sion of the time discretization, different value of 
N for the fixed time window [0, T] is used to com- 
pute the mission reliability. The results are shown 
in Table 1. 

In order to see the convergence performance 
of our recursive algorithm for solving equations 
(19) and (20), the mission reliability calculated 
after different number of iterations are provided in 
Table 2, where the number of discretization is set 
as N= 4500. 

From Table 2, we can see that our algorithm has 
fast convergence speed. After about 8 iterations of 
the recursive equations, the value of the mission 
reliability sufficiently approaches the analytical 
result 0.533. 

By comparison of the mission reliability results 
in the two tables, we can find that the number of 
discretization of time window has more influence 
on the precision of the obtained results. When 
the length of time slice becomes about 1/50 of 
the length of the time window, it can be expected 
that the obtained mission reliability reach satisfied 
precision. 


5 CONCLUSIONS 


For one-unit repairable systems with time redun- 
dancy that requires only a minimum length of oper- 
ational time within a time window for its mission 
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success, we present an analytical approximation 
approach to calculate the mission reliability. Based 
on the system behaviors at discrete time points, 
the recursive equations for mission reliability are 
established for different system starting states and 
remaining time durations. Moreover, an iterative 
numerical computation procedure is presented 
for obtaining the system mission reliability. This 
approach is different from our previous approach 
by making time discretization before building the 
recursive equations. Results of numerical exam- 
ple shows that the proposed method provides an 
efficient way for mission reliability evaluation of 
systems with time redundancy. As future research 
work, we will try to extend the model and algo- 
rithm to more complicated cases. 
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ABSTRACT: In this paper, a new reliability analysis method for complex mechanical system which 
contains hybrid uncertainties is proposed. The hybrid uncertainties containing randomness and intervals 
are considered in the limit-state function of reliability. Middle Point Limit-State (MPLS) model is first 
proposed for calculating the reliability index of hybrid uncertainties, and this model could make the index 
interval smaller and accurate. For complex mechanical systems, system hierarchy technique is used in reli- 
ability modeling, and composite limit-state function is then obtained to calculate the system reliability. By 
constructing the optimization model of the composite limit-state function, the reliability index interval 
can be obtained, and artificial bee colony algorithm is employed to solving these optimization models. 
The reliability index interval obtained through the method proposed by this paper could be more precise 
comparing to existing techniques. The reliability indexes containing the interval and MPLS index may be 
a more useful analysis tool to assess the mechanical systems reliability with hybrid uncertainties. A numer- 
ical case is carried out to illustrate this method and an engineering case of the satellite driving system is 
used as a typical complex mechanical system to demonstrate the analysis method proposed in this paper. 


1 INTRODUCTION 


The complexity of modern mechanical system gen- 
erally emerges in two sides: first, the system struc- 
ture and its components are complex, mechanical 
systems are typical multi-level systems, and there 
are many elements in every level. Second, the 
uncertain information of mechanical system is 
complex. Depending on the amounts of the sta- 
tistic information, some variables or parameters 
have sufficient statistical information and they are 
probabilistic variables; some other variables lack 
enough statistical information, so they could only 
be given upper and lower values of interval. 

While dealing with the hybrid uncertainties of 
interval and probability, some researches have been 
emerged in the structure reliability analysis field. 
Techniques as: transforming the interval uncertain- 
ties to random uncertainties, then calculating the 
reliability with classical random reliability theory 
such as First Order Reliability Method (FORM), 
Second Order Reliability Method (SORM) (Hur- 
tado et al 2012, Hurtado 2013); assuming the 
interval variables obey uniform distribution in the 
interval range, and the non-probabilistic reliability 
index is used to estimate the reliability of structures 
(Jiang et al. 2013, Wang et al 2010, Qiu et al. 2008). 
These methods are based on the assumption that 
the interval variables obey random distributions. 


This assumption has neglected the objective fact 
that interval variables have no enough informa- 
tion but only the upper and lower boundaries. In 
order to overcome the above deficient, some other 
techniques like tow-step method has been devel- 
oped for the hybrid uncertainties structural system 
(Jiang et al. 2011, Du 2007, Guo et al. 2009), this 
technique is not suitable for the system reliability 
analysis, the main problem is the interval extension 
because of the Interval Arithmetic Method (IAM). 
With the complexity of the mathematic process, 
the range of the output interval will become too 
large to accept. Interval-truncation approach (Lu 
et al. 2002, Zhao et al. 2008) is developed to reduce 
interval extension, and it could make sense in sim- 
ple interval arithmetic equations, but is not suitable 
for the complex problems, especially for complex 
mechanical systems. Other epistemic uncertainty 
theory such as fuzzy possibility (Singh et al. 2008) 
and dempster-shafer evidence theory (Flage et al. 
2011, Zio et al. 2012) could also deal with the inter- 
val uncertainty, but these theories are suitable for 
specific uncertainty structure as fuzzy set, basic 
belief assignment. The uncertainty measures such 
as possibility, belief and plausibility are used for 
reliability assessment. 

This paper aims to develop an efficient and rela- 
tive accurate reliability analysis method for com- 
plex mechanical system with hybrid uncertainties. 
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A Middle Point Limit-State (MPLS) model is first 
introduced in this paper to deal with the hybrid 
uncertainties of interval and probability, and the 
composite limit-state function and optimization 
model for complex mechanical system reliability 
analysis are also proposed by this paper. First the 
system reliability model is built with the hierarchy 
model of the system, and then the composite limit- 
state function of the system is obtained according 
to different levels and logic gates of the hierarch 
systems. The optimization model for calculating 
the system reliability index is constructed and Arti- 
ficial Bee Colony (ABC) algorithm is employed to 
solve it. The optimization model method has been 
used in the fuzzy industrial system analysis (Fales 
2010, Harish et al. 2012, Sharma et al. 2009, Garg 
et al. 2013). The main contribution of this method 
is that the Optimization Model Based Method 
(OMM) instead of IAM, is used to calculate the 
system reliability, and the final results of reliability 
interval will be more precise. It should be noted 
that the mechanical systems in the paper are con- 
sidered with two states and the basic logic gates 
“AND” and “OR” are used. 

The remainder of this paper is organized as fol- 
lows. Random distribution with interval param- 
eters are employed to quantify the uncertainty, 
and the MPLS model is introduced for this case 
in section 2; section 3 are reliability model estab- 
lishment of complex mechanical system; section 4 
demonstrates the reliability analysis method 
based on the optimization model and ABC algo- 
rithm; a numerical case and an engineering case 
are carried out in section 5 to demonstrate this 
method. Finally, a conclusion is given in section 6. 


2 MIDDLE POINT LIMIT-STATE MODEL 
FOR HYBRID UNCERTAINTIES 


Define the limit-state function Z of mechanical 
system as follows: 


Z = (X.Y) (1) 


where X = {xp} are random variables, 
Y ={y, V3" Y„t are interval variables, and the 
upper and lower boundary values of Y are given: 


Y = [E 2 Yoa] (2) 
The mean value and deviation of Y are defined as: 
Y + Y, 
Y, = min max (3) 
2 
Y -Y. 
Y. = max _ mn (4) 
2 


Because of the interval variables Y, the limit-state 
function Z is an area composed of two surfaces in 
the probabilistic space. It is shown as in Figure 1. 

For reliability analysis of the hybrid uncertainty 
limit-state function Z, this paper constructs a “mid- 
dle point limit-state (MPLS)” model to calculate the 
reliability index 2. In the MPLS model, a new limit- 
state function is constructed when all the interval 
variables are chosen as the middle point in the inter- 
val ranges. And this limit-state function is defined as: 
Zn = AY, )=0 (5) 

When using the FORM or SORM, The equa- 
tion (1) could be rewritten as: 


Z = g(X,Y)=g(TU.Y))=GU.Y) (6) 


where U={u,,u,,---,u,} is the standard normal 
variables transferred from X. 

The reliability index 4" of the Z, can be 
obtained by solving the follow optimization model: 


” = min |U 
2” = min |U] m 
st. G(U,Y,,)=0 
For reliability index interval J’ =| pn, wl, 
the following two optimization problems can be 
solved. 


pn =min || 

M U (8) 
st. minG(U,Y)=0 

¥ 

and: 
pr = min U | 
(9) 

s.t. max G(U,Y)=0 


The redefinition of reliability index interval Ø". 
Through the above work, three reliability indexes 


a(X,Y)>0 


E Banv(X.Y)=0 
2(X,Y)<0 


Suny (X,Y) = 


0 X, 


Figure 1. State space of performance function includ- 
ing interval variables. 
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P,P and L is obtained, then the reliability 
index interval J’ of Z can be defined according to 
the following cases: 

case 1: 

igs p, then A =|", 7" |. 

case 2: 

i fw 2 pA, then A s(f™, A]. 

case 3: 

If prep <p™, then A =A 7, 
and ™ is an accessory reliability index. The final 
reliability index is defined as: 


-JZ 
The reliability is: 
_[R=0(f) 
r= R coon m 


3 COMPOSITE LIMIT-STATE FUNCTION 
METHOD FOR MECHANICAL SYSTEM 
RELIABILITY ANALYSIS 


3.1 Hierarchy model of mechanical system 


For complex mechanical system, the reliability 
analysis is full of challenge because of the complex- 
ity both in structure and uncertainty information, 
especially when the amount of its elements is too 
large. Due to this reason, system hierarchy tech- 
nique is an efficiency way for complex mechanical 
reliability analysis. 

The system hierarchy method is as shown in 
Figure 3. In this method, the mechanical system is 
divided into different levels according to its struc- 
ture and components. Generally, the top level is the 
system and the bottom level contains the variables 
and parameters of basic elements. Middle levels 
are sub-systems or parts of the mechanical system. 
Different levels are jointed by the logic gates. 

In the hierarchy model of the system reliabil- 
ity, logical operations of the gates in the hierarchy 
model are appointed as follows. 

For the “AND” logical gate, the reliability is: 


Rino = Il R, (12) 
For the “OR” logical gate, the reliability is: 
Rox =1-[] (1-8) (13) 


It should be noted that all the components or 
failure modes are independent with each other. 


N, 


Figure 2. Reliability index of hybrid uncertainty. 


Figure 3. The hierarchy model of mechanical system. 


3.2 Composite limit-state function and 
calculation of system reliability 


In section 3.1, the complex mechanical system is 
modeled by different levels and logic gates accord- 
ing the system hierarchy method. Then this section 
proposes the construction of the composite limit- 
state function and how to calculate the system 
reliability. The Figure 3 is used to illustrate this 
method. Through this figure, we could see that the 
bottom levels are the basic variables and param- 
eters, and the variables are both probabilistic and 
interval in this paper. Then every element in level 3 
could have the limit-state functions based on its 
failure modes and variables, and these limit-state 
functions are defined as: 


Levers = {Zhu ? Zie "mta Zhen } ( 14) 
Zines = 8(Xp¥,) (15) 


The index Z}, represents limit-state of the ith 
element in level 3. The indexes X, and Y; repre- 
sent the probabilistic variables and interval vari- 
ables of the ith element. 

For level 2, the limit-state function of the part 1, 
part 2 and part 3 could be obtained based on the 
index Zk and the logic gate. And these limit- 
state functions are: 
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Lever = TZ ? Zrel ? Z inla } ( 1 6) 
Zier =L (Zioen Livers ) (17) 


In which L(-) is a logic function. For level 1, 
the limit-state function of the mechanical sys- 
tem could be obtained through the index Zven 
and the logic gate 1, and the limit-state function 
is as: 

Z pa LA leann) (18) 


sys 


and Z, is defined as the composite limit-state 
function of the mechanical system. However, the 
composite limit-state function is not suitable for 
using directly, especially for the large systems with 
too much variables, and the limit-state function in 
this paper are considered as the hybrid uncertainty 
limit-state function. So this paper employs the reli- 
ability index instead of the limit-state function. 
According to MPLS in section 2, every limit-state 
function of element in the level 3 could get a reli- 
ability index interval Ø! and R', then, the reli- 
ability of the composite limit-state function Z,,, 
could be defined as: i 
Rs = L(Ri,RI,---,R!) (19) 
where the function principle of L is the same as the 
equation (19). 

After calculation, the index R}, is also an inter- 
val and defining it as: f 


R}, =[ R3», Ree | (20) 


SyS SYS 97 “sys 


where Rma and Ra are the lower and upper 
boundary of the interval respectively. 


4 RELIABILITY ANALYSIS OF COMPLEX 
MECHANICAL SYSTEM 


4.1 Optimization models for calculating the 
reliability 

For the reliability computation of R}, the opti- 
mization model based method (OMM) is utilized 
instead of the interval arithmetical method (IAM). 
The boundary values of R}, in equation (20) could 
be obtained by solving the optimization problems 
as follows: 

For Rm 
Minimize (RI, R},---,R') 
Subjectto R™ < R] < R™ 


For Ry” 


Maximize g(RI, Ri, R!) 
Subject to Rm < R] < R™ 


According to (Garg & Sharma 2012), the 
smaller of the interval range, the more meaning- 
ful of the system reliability decision. For the com- 
plex mechanical system, the optimization model is 
obviously a high nonlinear problem, thus a high 
efficient optimization algorithm is needed here. 
Artificial bee colony (ABC) algorithm is proved 
to be an effective algorithm (Peng et al. 2013), and 
this paper will use it to calculate the optimization 
models to get the system reliability. 

The whole steps of the reliability analysis 
method proposed in this paper are as follows: 


Step 1: analyze the structure and components 
of the mechanical system, build the reliability 
model with system hierarchy technique; 

Step 2: identify the bottom elements of the system, 
collect and extract the uncertainty information 
of these elements; 

Step 3: build the reliability limit-state functions 
of these elements, if it is implicit problem, the 
RSM is needed. Then the composite limit-state 
function of the system is identified; 

Step 4: calculate the reliability index interval # 
and R! of the elements based on the MPLS 
method proposed in the section 2, and then con- 
struction the optimization model for solving the 
boundary of the Ri. of the composite limit- 
state function Z; 

Step 5: optimize the model with ABC algorithm, 
the mechanical system reliability index R], and 
R®™. are obtained. i 


Artificial Bee Colony (ABC) algorithm was 
proposed by Karaboga for optimizing numeri- 
cal problems (Karaboga 2007, 2008). The algo- 
rithm simulates the intelligent foraging behavior of 
honeybee swarms. It is a very simple, robust and 
population based stochastic optimization algorithm. 


5 CASE STUDY 


5.1 The numerical case 


Assume the hierarchy model of a mechanical sys- 
tem as shown in Figure 4, after system hierarchy, 
there are 5 levels in the mechanical system S, the 
bottom level is the basic variables. There are 9 
elements in the level 4, and all these elements are 
considered with hybrid uncertainties. Their basic 
variables and limit-state function are given. The 
5 logic gates G,~G, are as follows: G,, G, and G, 
are “OR” gates, G, and G, are “AND” gates. Now 
we calculate reliability index of S with the method 
proposed in this paper. 
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Denote the reliability performance functions of 
the bottom elements E,~E, as follows: 


2(X,Y),, =x,-y,/x? -0.8x,/5, 

g(X,Y), =567-x,-x,-0.5y3 

s(X.Y),. =x +V2/x, =F 

(X,Y), =x,- y300x +1.92y} 

g(X.Y),_ = exp[0.4(y, +2) + 5.02] 
—exp[0.3y, + 500] — 200x,,x2 

g(X.Y),. 
= y, — 6x, yA . Xa) — 6x; -V (xa -x2) 

2(X.Y 


= ig ‘Vio’ Yaf Ms + V2x2,/ Vio 7 326 


g(X,¥),. 
= 0.01 -(48x,, + 32)/(18x,, +3) Ya (VaV) 


g(x we = 4M Xo [Vis Xa Vig 


(21) 
The uncertainties of all variables are as shown 
in Table 1 and Table 2. Based on the MPLS model 
proposed in this paper, the reliability index Z. 
of the bottom elements of the system could be 
obtained, and the results are listed in Table 3. 
Construction of the composite limit-state func- 
tion of the system S. According to the system 
structure and its logic gates, the composite limit- 
state function can be obtained as: 


Gys (X,Y) = Lor (Em 8m,) 


Zin = Lor (ss ~ gr) 
Zm, = Lann (Sar Su, ) (22) 
Zm, = Lor (ge, Sr, ) 
Zm, = Lanp (ge, Sr, EE ) 
where L is the function according to the logic oper- 


ation (12) and (13). Similarly, the system reliability 
R! could be obtained according the equation (23) 


Level | 


Level 2 


Level 3 


Level 5 


Basic variables xyz» ))—Vrs 


Figure 4. Hierarchal model of the system S. 


Ry = Re R R Ry 
[1-1-8 R) (1 


R,)-(I- R): (1- R )] 
(23) 


The system reliability R®, is calculated directly 
according to equation (23), it is: Rẹ, = 0.94960402. 

The lower boundary value of the system reliabil- 
ity Rys is 
minimize Ry 


subject to Rr, € [ Re, | 


The upper boundary value of the system 
reliability Rj is 
maximize Rys 
subject to Rz, € [ Re, | 

The ABC is used to optimize the above models, 
it has been implemented in Matlab (MathWorks) 
and the program has been run on a T6400@2GHz 
Intel Core (TM) 2 Duo processor with 2 GB of 
Random Access Memory (RAM). The selected 
values of all the parameters for ABC are given as: 


Colony size (CS) = 20 x number of subunits. 
Limit for scout = (CS x D)/2, where D is dimen- 
sion (number of variables) 

Number of generation = 500. 


The termination criterion has been set to a max- 
imum number of generations of order of relative 
error equal to 10%, whichever is achieved first. 

The results of the ABC algorithm are: 
0.93429182 (the lower boundary) and 0.96392416 
(the upper boundary). The optimization process is 
as shown in Figure 5 and Figure 6. 


6 RESULTS AND DISCUSSION 


According to the reliability analysis method 
proposed in this paper, when all variables are 
the mean values, the reliability of system is 
R™. =0.94960402. The index Ri, is as follows: 
; using IAM directly, the reliability interval Rj, 

s [0.91771156, 0.96653623], if using OMM, the 
reliability interval Ri, with ABC algorithm is 
[0.93429182, 0.96392416]. Comparing to the IAM 
result, the interval range obtained through the 
OMM is smaller, especially is the lower boundary 
of the interval. The lower boundary value is reduced 
about 0.01658026. The upper boundary value is 
almost the same with the IAM result, which is 
reduced about 2.61207e-3. This is because the relia- 
bility interval of all the bottom elements are smaller 
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Table 1. 


Uncertain information of random variables. 


Variables Distribution Mean Variance Variables Distribution Mean Variance 
Xi Normal 26 0.1 Xi Normal 1000 100 
X Normal 3.2 0.05 Xiz Normal 100 15 
X; Weibull 7.2 0.07 Xis Lognormal 200 20 
X4 Lognormal 0.6 0.131 Xiz Weibull 10 0.1 
X; Lognormal 2.18 0.03 Xis Normal 4 0.085 
Xg Normal 2.2 0.01 Xie Extreme I 0.3 0.003 
A Normal 12.5 0.1 Xiz Lognormal 0.1 0.001 
Ki Extreme I 48 3 Xg Normal 152 0.01 
Xo Normal 1.0 0.16 Xis Normal 60 0.085 
A Weibull 1.0 0.001 Xa Normal 1.6 0.02 
Table 2. Uncertain information of interval variables. oar = = z : 
; . s- OMM 
Varia- Upper Lower Varia- Upper Lower 096} | ==- R” ] 
bles boundary boundary bles boundary boundary i = AM 
m ; s e d - "hed be EEE 4 
» 33.78 31.72 yy 46 4.2 E 
M3 222 230 Vu 1.2 0.8 = ee E ONE 
Ya 22 18 Ya 20.5 20 asst 
y; 2 0 Yı,  2.003e6 2e6 
Ve 1.5 0.5 yy 0.021 0.018 oat | 
y, 380 360 igs. 2B 20 
Ys 53 47 Ye 2.4 2 s= 100 200 300 400 500 
Iteration steps 
Table 3. Reliability of different parts. Figure 6. Optimization process of lower boundary. 
Part Reliability index 8 Reliability index 8” 
han 1. It al: 1 n that the interval reduc- 
E 2.4119,3.1729 > 8013 than I. It also could be seen that the interva ed Cc 
l tion in this case is not obvious in this case, this can 
E; 1.7269,2.3996 2.0679 p p 
2 be explain that the system is not complex enough. 
E, 1.8211,1.9646 1.8925 : A : . 
: The spread width to the index R™, is as follows: 
E, 4.4907,6.1194 5.3166 iaa 
through IAM, the left spread width is 0.03189246, 
E, 2.9614,3.4732 3.3013 : : : 
3 and the right spread width is 0.01693221. Through 
Es 5.8321,5.8923 5.8772 P ; 
OMM, the left spread width is 0.0153122, and 
E; 3.0761,3.4638 3.3587 ; z i 
E 27731 3 4253 28196 the right spread width is 0.01432014. Suppose 
OEA. l he index R™, is the most sensitivity point in the 
E, 2.8845,3.2076 3.0313 : sys l ee : 
interval of the index Rl, then interval obtained 
by OMM is more sensitive than the result obtained 
by IAM. 

In order to illustrate the advantages of the ABC 
algorithm, other intelligence optimization algo- 
rithms as particle swarm optimization (PSO) algo- 

2 rithm and genetic algorithm (GA) are also used for 
4 these same optimization models. The results are 
Z listed in Table 4. It shows that ABC algorithm and 
PSO algorithm could get almost the same results. 
OMT ] The results demonstrate that ABC and the PSO are 
0935 a. = 1 both good at the global searching ability in the opti- 
0937; — a ] mization process. But PSO need more steps (177) 
0925 ie ai = 5 SN than ABC (102) to convergence to a stable solve. 
iteration steps The GA convergences very quickly (only 77 steps), 
but the optimization results is much rougher than 

Figure 5. Optimization process of the upper boundary. the ABC and PSO. 
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6.1 Engineering case 


A satellite driving system is composed of a stepper 
motor, a harmony reducer, a photoelectric encoder 
and a microswitch. The CAD model of the satel- 
lite driving system is as shown in Figure 7. After 
structure analysis, the hierarchy model of the driv- 
ing system as shown in Figure 8. The event in top 
level is the failure of driving system. 

Through the calculation and statistical experi- 
ments, in the driving system, the reliability indexes 
of the bottom elements are shown in Table 5. 

Based on the hierarchy model in Figure 9, the 
system reliability is: 


R; = Ra “Re, “Re, Re, Ry, ` 
[i -(1 -R,,)-(1 - Ry.) |- Re, “Re, “Re, “Ry, 
(24) 


The upper and lower boundary values of the 
system reliability could be obtained by solving fol- 
low optimization models. 


minimize/maximize Rs 
subject to R, € [ Re, | 


Based on the data in Table 5, the reli- 
ability R® could be obtained directly and it is 
0.97313495. For Ri, the result of the OMM is 
R! =[0.9485038,0.9911342] and the result of the 
IAM is R! =[0.9338912,0.9933270]. It can be 
shown that the interval obtained by OMM is obvi- 


it means that the OMM can reduce the interval 
extension in this problem. Through the Figure 10, 
we can see that ABC algorithm could maintain 
in the optima margin stably for a high nonlinear 
problem. The OMM based on the ABC algorithm 
could get precise interval result for a complex sys- 
tem, and it is meaningful for the decision making. 
The left spread width and right spread width to 
the Rm is as shown in Table 6. The spread width 
of the OMM is smaller than the IAM. Compared 
to the IAM, the left spread width of the OMM 
reduce about 0.0145918(37.1825%), and the right 
spread width reduce 0.0021858(10.82878%). 


HO 
jtir 


nin 
ini 


ously smaller than the interval obtained by IAM, 


Table 4. Results of different algorithm. 


Lower Upper 

boundary boundary Iteration 
Methods value value steps 
ABC 0.93429182 0.96392416 102 
PSO 0.93420751 0.96398842 177 
GA 0.92682667 0.96738521 74 


u 
ay 
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Figure 7. CAD model of the driving system and the key 
components. 


Figure 8. Hierarchy model of the driving system. 

Table 5. Reliability of the bottom elements. 

Elements R RY 

P, 0.9998 0.9998 
P; [0.9998,0.9999] 0.99984 
P, [0.9998,0.9999] 0.999835 
P, [0.99972,0.99986] 0.99981 
P; [0.9980,0.9985] 0.999826 
P, [0.99982,0.99987] 0.999858 
Pe [0.9999,0.99998] 0.99994 
P; 0.999 0.999 

P; [0.999,0.9998] 0.9996 
Pig [0.980,0.999] 0.9906 
Pii [0.9575,0.9975] 0.9845 


Figure 9. Optima results and the interval math results. 
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Table 6. Reliability of the bottom elements. 


Left Right 

spread spread 
Methods width width 
IAM 0.0392437 0.0201851 
OMM 0.0246311 0.0179993 


7 CONCLUSION 


In this paper, a reliability analysis method is 
proposed for complex mechanical system. This 
method overcomes the complexity difficult of 
uncertain information brought by the complete- 
ness difference of the statistic information. The 
reliability index Ø as well as J” of the middle 
point limit state function is introduced to compre- 
hensively assess of the hybrid uncertainty system 
reliability. This method could avoid both subjective 
assumption to interval variables and information 
waste of random variables. 

Composite limit-state function is used to build 
the reliability model of the complex mechanical 
system, and it could deal with the complex of large 
variables and information. Optimization model 
based technique is employed to solve the system 
reliability index interval, and advance intelligence 
ABC algorithm is utilized in the optimization mod- 
els. This optimization model based method could 
conquer the difficulties brought by the complexity 
of the system structure, and also could avoid the 
interval extension induced by the IAM. Through 
this method, the higher sensitivity zone of the reli- 
ability interval will be achieved; this is very useful 
and meaningful for reliability estimation and deci- 
sion making for complex mechanical system. 
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Common cause failures and cascading failures in technical systems: 
Similarities, differences and barriers 


L. Xie, M.A. Lundteigen & Y.L. Liu 
NTNU, Trondheim, Norway 


ABSTRACT: Many technical systems continue to increase in size and complexity, with more interac- 
tions and interdependencies between components. Dependent failures, such as common cause failures 
and cascading failures, are becoming important concerns to system reliability. Both failure types may lead 
to the unavailability of multiple components at the same time or within a short time interval. Although 
many researchers have studied common cause failures and cascading failures respectively, there is little 
comparison of the two concepts. This paper investigates the similarities and differences of these two 
failure groups, with focus on the conditions and nature of initiations and propagation of such failures. 
Moreover, a comparison is also made about suitable barrier strategies that can either prevent or reduce 
the consequences of failure. The paper concludes the study with a demonstration of reliability modeling 


for common cause- and cascading failures. 


1 INTRODUCTION 


Technical systems, such like railway systems, 
processing systems in chemical and petroleum 
plants, and power grids, are becoming increasingly 
complex. These systems include many physical 
components, with a huge number of interaction 
and interdependencies. Sometimes, those failures 
occurring in multiple components are resulted 
from the interconnections. We refer to such fail- 
ures as dependent failures. Within the category of 
dependent failures, there are two sub-categories 
that are of specific interest: common cause fail- 
ures (CCFs) and cascading failures (Rausand and 
Lundteigen, 2014). In the chemical and process 
industry, cascading processes are called as domino 
effects (Abdolhamidzadeh et al., 2010, Abdolha- 
midzadeh et al., 2009, Landucci et al., 2016). 

Past accidents and near misses have shown that 
dependent failures are one of main threats to a 
complex system. For example, CCFs are main 
contributors of failures in safety systems of the oil 
and gas industry (Smith and Simpson, 2004, Lun- 
dteigen and Rausand, 2007). Fires in the chemical 
and process industry highlight the severe cascading 
consequences (Landucci et al., 2016, Cozzani and 
Reniers, 2013). The blackouts in United States, 
Canada in 2003, and Europe in 2006 are also the 
examples of cascading failures (Kotzanikolaou 
et al., 2013, Andersson et al., 2005). Many other 
infrastructure systems, like water distribution 
networks, transportation, also often suffer from 
cascading failures (Lin et al., 2014, Shuang et al., 
2014, Ouyang, 2014). 


So far, it seems like most attention has directed 
to CCFs and in specific for safety-critical systems 
where redundancy is used actively to enhance reli- 
ability (Paula et al., 1991, Humphreys and Jenkins, 
1991, Lundteigen and Rausand, 2007, IEC61508, 
2010, A. Mosleh, 1998). There have been two main 
strategies suggested for incorporating defenses 
against CCFs in design. One is to carry out analyses 
to identify and remove causes, and the other is to 
introduce measures to reduce the effects of CCFs 
in case they occur. Suggested methods include 
cause-defense matrices, common cause analysis, 
and zonal analysis (Humphreys and Jenkins, 1991, 
Paula et al., 1991). 

The defenses to CCFs are typically identified in 
design, however, measures in the operational phase 
are also important (Lundteigen and Rausand, 
2007). Even for an excellent system design, there 
will always remain a risk of CCFs. It is therefore 
required to include the contribution of CCFs in 
quantitative analyses used to demonstrate ade- 
quate reliability. A high number of models has been 
introduced for this purpose (Vesely, 1977, Fleming, 
1975, Evans et al., 1984, Mosleh and Siu, 1987). 
The standard beta factor model is perhaps the 
most widely adopted, due to its simplicity (Flem- 
ing, 1975, IEC61508, 2010). The PDS method 
(Hauge et al., 2015) is an extension of the standard 
beta factor, where a second parameter is added to 
account for voting, e.g. 2-out-of-3 and 1-out-of-3. 

As for cascading failures, it is of interest to 
consider efficient means to avoid or reduce the 
vulnerability of the failures in the system design, 
and to quantify cascading failures. An important 
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task in these analyses is to study interdependen- 
cies, and many analyzing approaches in literature 
are based the topology of complex network (Mot- 
ter and Lai, 2002, Wang, 2012, Albert and Bara- 
basi, 2002). One kind of cascading failures are 
the failures when a heavily load component fails, 
and its load is redistributed to other components, 
resulting in loads on that exceed their capacities. 
State-based approaches, such as Markovian proc- 
ess, approaches based on the Bayesian network 
models, and Monte Carlo Simulation have been 
used to analyze cascading failures (Iyer et al., 2009, 
Calvino et al., 2016, Erp et al., 2017). 

In fact, many technical systems can be subject to 
both CCFs and cascading failures, thus it is impor- 
tant to consider both failure categories in reliability 
analysis. Unfortunately, very limited attention has 
been directed comparing the two types of depend- 
ent failures, and their corresponding defense strat- 
egies. Kotzanikolaou et al. (2013) highlight that 
CCFs may have cascading effects, but do not go 
into much detail. 

The objective of this paper is therefore to make 
a comprehensive comparison on the concepts, 
causes, and mechanisms of the two failures, and 
provide some suggestions on the analysis and 
defense strategies. In this paper, we use the term of 
barrier to denote a specific defense measure. 

The rest of the paper is organized as follows: In 
section 2, we discuss the definitions and interpreta- 
tions of CCFs and cascading failures. Sections 3 
and 4 present the similarities and distinctions of 
the two failures. In section 5, we clarify the barriers 
against the two failures. A small example is then 
employed in section 6, to illustrate that the effects 
of CCFs and cascading failures. Conclusions and 
discussions occur in section 7. 


2 DEFINITIONS AND INTERPRETATIONS 


According to Humphreys and Jenkins (Hum- 
phreys and Jenkins, 1991), dependent failures refer 
to the failures whose probability cannot be expressed 
by unconditional probability of the individual event. 
Dependencies in a technical system may derive 
from the sameness of the types of components, 
exposure from the same environment, the use 
of shared resources, functionality, the common 
shocks and the incapability to resist certain haz- 
ardous events (Rausand, 2013). 

People in different industrial sectors define 
CCFs in their own ways. Nuclear sector defines 
it as two or more component fault states exist at 
the same time, or with a short interval, because of 
a shared cause (Mosleh et al., 1988). The generic 
standard on design and operation of electric, elec- 
tronic, and programmable electronic safety-related 


systems, IEC 61508, defines a CCF as a failure that 
is the result of one or more events, causing concur- 
rent failures of two or more separate channels in a 
multiple channel system, leading to system failure 
(IEC61508, 2010). Both definitions emphasize that 
CCFs involve at least two failures that are due to a 
shared or common cause. 

Cascading failure may be multiple failures, where 
initiated by the failure of one component in the sys- 
tem that results in a chain reaction, the so-called 
domino effect (Rausand and Øien, 1996). In power 
systems, cascading failure is referred to a sequence 
of dependent failures of individual components that 
successively weakens the systems (Baldick et al., 
2008). It differs from the definition in infrastruc- 
tures that limit the cascading failure to the propa- 
gation of failures between components (Rinaldi et 
al., 2001). Generally, we can find some same ele- 
ments in the definitions that cascading failures are 
multiple failures initiated by one, and a sequential 
effect occurs. 

From the perspective of failure causes, both 
CCFs and cascading failures result from some 
common vulnerabilities of more than one compo- 
nent. These two types of failures are interrelated 
in some cases (Laprie et al., 2007, Kotzanikolaou 
et al., 2013). However, they are still two distinctive 
categories of dependent failures. As Smith and 
Watson explained, CCFs emphasize that failures 
are located in ‘first in line’, which means that the 
failure are only dependent on the causes, but not 
on each. 

In the following sections, we try to elaborate sim- 
ilarities and difference between the two failures. 


3 SIMILARITIES 


We categorize the similarities between CCFs and 
cascading failures into three: multiplicity, timeli- 
ness and classification of causes. 


3.1 Multiplicity 


Both CCFs and cascading failures obviously 
involve more than one components. We are con- 
cerned with the effect of failure of several compo- 
nents and functions for two categories of failures. 


3.2 Timeliness 


For both CCFs and cascading failures, the time 
from the first failure to the existence of multiple 
failures is often short. In case of insufficient miti- 
gation measures, the collapse of an entire system 
may occur very soon. For example, in the Three 
Mile Island accident caused by CCFs in 1979, 
the radiation level in the primary coolant water 
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was around 300 times of the expected level after 
only 2 hours (Hasani, 2017). The power blackout 
in India in 2012 due to cascading failures, spread 
across 22 states within 12 hours and affected more 
than 620 million people (Russel, 2012). 


3.3 Root causes 


Root causes of both CCFs and cascading failures 
are the common vulnerability of more than one 
components in a system. Coupling factors between 
components can explain why multiple components 
are destroyed by a common hazardous event, e.g. 
cold temperature, extreme snowfall or electrical 
failure. Meanwhile, for cascading failures, cou- 
plings also can explain why multiple components 
are affected by the faults of relevant components. 
For example, the unavailability of one processing 
unit increases the workload of another unit. 


4 DIFFERENCES 


For differences between two types of failures, we 
categorize them into two: initiation and propaga- 
tion of failures, as shown in Table 1. Initiation of 
failures. 

As seen in Table 1, the initiating event of a CCF 
can be either replicated or occur simultaneously 
for several components. The effect of CCFs arises 


(a) Sequence 


First in 
line 


Cascading failures 


(b) Consequence 


Cascading failures 


from shared causes, may be simultaneous failures 
or failures with some time apart. A cascading fail- 
ure always starts with a single preceding compo- 
nent failure, as the effect of an initiating event. 


Table 1. Differences between CCFs and cascading failures. 
Cascading 
Difference Characteristics CCFs failures 


Initiation Triggering Shared causes Conditional on 
condition preceding 
failures 
Occurrence Simultaneously or Sequence 


during a critical 
time of interest 


Propagation Sequence First in line Series 
Consequence Finite Possibly infinite 
Pathway Cause- Connected/ 
components dependent 
components 
( 1 High Temp. Overload 
High | ‘ ‘ 
om | 0000 
High 
Pressure 


a) t) 


Figure 1. CCF and cascading failures. 


paee 


(c) Pathway 


CCF 
Causes 


Cascading failures 


0—0 


Components 


Figure 2. Comparisons of CCFs and cascading failures in terms of impact and effect. 
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To illustrate these differences, we introduce two 
small examples, as shown in Figure 1. High tem- 
perature is the initiating event of both a CCF anda 
cascading failure in this case. In Figure 1(a), all the 
four components expose themselves to high tem- 
perature, and so all or some of the components fail 
simultaneously or in a short interval. However, in 
the case of a cascading failure of Figure 1(b), only 
component | is exposed to high temperature, and 
fails due to this initiating event. Then, the failure 
of component | trigger the failures of other com- 
ponents due to diverse reasons. Even in the same 
cascading sequence, the failure causes can be dif- 
ferent for the different components. 


4.1 Propagation of failures 


Propagation of failure means in this context the 
evolvement of multiple failures, with the initiating 
event already manifested. Figure 2 illustrates the 
differences in the propagation of CCFs and cas- 
cading failures. CCFs are first in line failures that 
delineate the exclusion of dependent failures from 
CCF definition (Smith and Watson, 1980), which 
implies that CCFs are directly linked to the fail- 
ure causes. On the contrary, the propagation of a 
cascading failure follows a series of interactions. 
CCFs are most different from cascading failures 
in terms of the approaches of propagation. As 
shown in Figure 2(a), for CCFs, the first in line 
failure only occurs on component | and 2. For the 
consequence of failure propagation, as shown in 
Figure 2(b), a cascading failure can escalate and 
result in worse impacts on the other parts of a sys- 
tem, such as more serious disruptions, overload 
to neighbors and longer recovery time etc. CCFs 
highlight a direct cause-effect relationship between 
the cause and the failed components (Rausand 
and Lundteigen, 2014), whereas the pathway 
of cascading failures involve the interactions or 
dependencies between relevant components, see in 
Figure 2(c). 


5 BARRIERS 


Barriers are employed to prevent, control or miti- 
gate undesired events or accident (Sklet, 2006). 
Sometimes, barriers are also called defenses, pro- 
tection layers or countermeasures. In general, a 
barrier function can be realized by many different 
means, such as by a technical or physical system, 
human actions and procedural deficiencies. 

In the design phase of a system, it is possible 
to introduce barriers against potential failures, like 
separation, diversity, quality control, simplicity of 
design etc. Some of them are effective to reduce 
the probability of CCFs, and some of them are 


(a) (b) 


Figure 3. Barriers for CCF and cascading failures. 


more functional for protecting the system from 
cascading failures. Considering the similarities 
and differences of CCFs and cascading failures, 
we can categorize barriers into three groups: barri- 
ers against both failures, barriers against CCFs and 
barriers against cascading failures. 


e Barriers efficient for both failures: Such kind 
of barriers should be designed in consideration 
of the similarities of CCFs and cascading fail- 
ures, such like their root causes and coupling 
factors. One way of barrier design is therefore 
to mitigate and reduce the vulnerability to root 
causes. Simplicity can be regarded as a barrier, 
for example, to reduce system complexity that is 
one important source of vulnerability. Another 
way of barrier design is to decrease the coupling 
degrees among components. Spatial and tem- 
poral separations are examples of decreasing 
coupling degrees. In practices, we can find that 
firewalls in a process plant are effective barriers 
to prevent fire disasters. 

e Barriers against CCFs: The effectiveness of such 
barriers is to isolate failure causes and compo- 
nents, as shown in Figure 3(a). One example 
is diversity of the design. Diverse components 
will often have different failure modes, and are 
therefore less likely to be affected by the com- 
mon cause. However, diversity is not effective 
to mitigate cascading failures. When the failure 
of one component brings higher workload to 
its neighbors and their failure probabilities, no 
matter the components are identical or not. 

e Barriers against cascading failures: The main 
purposes of this kind of barriers are to stop or 
slow down failure propagation, as shown in Fig- 
ure 3(b). An example for this class of barriers is 
a process shutdown valve that can isolate related 
process segments. In case abnormal events have 
occurred in the upstream facility, the shutdown 
valve can stop or limit the flow between two facil- 
ities, and thereby cease the failure propagation. 


In the next section, we will use a small example 
to illustrate the quantitative analyses for CCFs and 
cascading failures, and the effects of barriers. 
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6 CASE STUDY 


Suppose a system comprising two parallel compo- 
nents. The effects of failures and corresponding 
barriers for the two dependent failures are studied 
separately, as illustrated in Figure 4. 

For modeling CCFs, a new independent “CCF” 
event is added in the standard beta model with 
beta-factor B. The parameter £ can be interpreted 
as the conditional probability that a failure of a 
channel is in fact a common-cause failure: 


= Pr(CCF |Failure of channels) (1) 


With inclusion of CCFs, the total system reli- 
ability can be obtain as: 


R(t)=2R— ROP) (2) 


where R=0.8 and Z=0.1. 

For modeling cascading failures, it is neces- 
sary to consider the effects of functional depend- 
ency between the two components, and Bayesian 
network model is an approach we used here. The 
conditional failure probability is a measure of 
dependency that differ from the conditional prob- 
ability B for CCFs. The conditional probability for 
cascading failures can be defined as: 


Pr(Comp. B fails|comp.A fails) = £ (3) 
A 

Here, F, and F, denote the individual failure 
probability for component A and B. F, denotes the 
failure probability for component A on the condi- 
tion of component A has failed. The total system 
reliability with cascading failures can be obtained as: 


R(t) =1- F,(Fy + Fy — FF) (4) 
=1-(1- R} -(1- RRP 


where P, denotes conditional probability between 
component A and B and is assigned as 0.1(Pr =0.1). 

As shown in Figure 5, the total system reliabil- 
ity with CCFs becomes 0.946, but it is 0.957 with 
the effects of cascading failures at that time. This 


food 


(a) (b) 


Figure 4. Case study for CCF and cascading failures. 


4 System reliability with CCFs, cascading failures and barriers 
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Figure 5. Reliability with cascading failures & CCFs. 


implies that CCFs may have more influence on the 
reliability performance than cascading failures in 
this case, when using similar assumptions about 
the probability of having additional failures, when 
a first failure has occurred. 

We now introduce time-dependent probabilities 
for reliability analysis, and assume that the time 
to failure is exponentially distributed, with failure 
rate of 1E-04 per hour for each component. For 
the system with CCFs, the total system reliability 
can be obtain as: 


R(t)= [2et-A# = ene | e72 (5) 


For the system with cascading failures, the total 
system reliability can be obtain as: 


R(t)=1-(-e“*y -(1- e4} eP (6) 


Figure 5 illustrates calculated system reliabil- 
ity considering the effects of the two failures as a 
function of time. We can see that, in this case, the 
two failures seems to have comparable effects on 
the system reliability. 

For CCFs, the function of barriers is to separate 
shared root causes from the components. The func- 
tion of the barriers against cascading failures is to 
prevent propagation of the failures between compo- 
nent A and B. Reliability of the system with barriers 
is illustrated in the blue line in Figure 5, implying 
that the system reliability will increase when per- 
forming barriers function against the failures. 


7 CONCLUSION AND FURTHER WORK 


Exploring similarities and difference between 
CCFs and cascading failures facilitate us to answer 
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the following questions: 1) why such dependent 
failures initiate, 2) how dependent failures con- 
tribute to disruptions in the systems, and 3) what 
kind of barriers are needed and how they should 
be implemented. In this paper, we find that CCFs 
and cascading failures may have comparable influ- 
ences on the performance of a simple system. 
More probabilistic and quantitative analyses are 
required, to evaluate the impacts of cascading 
failures in a larger and more complex system (Erp 
et al., 2017). 

Our further work will involve modeling the 
interdependent systems with cascading failures 
and CCFs, and developing tools to evaluate reli- 
ability for complex systems. It is also of interest 
to identify different failure modes and perform 
barrier analysis for both of the failures, which can 
help to allocate barriers and thereby optimize bar- 
rier functions. 


REFERENCES 


Abdolhamidzadeh, B., Abbasi, T., Rashtchian, D. & 
Abbasi, S.A. (2010) A new method for assessing dom- 
ino effect in chemical process industry. Journal of haz- 
ardous materials, 182, 416—426. 

Abdolhamidzadeh, B., Rashtchian, D. & Ashuri, E. 
(2009) A new methodology for frequency estimation 
of second or higher level domino accidents in chemi- 
cal and petrochemical plants using monte carlo simu- 
lation. Iranian Journal of Chemistry and Chemical 
Engineering (IJCCE), 28, 21-28. 

Albert, R. & Barabasi, A.-L. (2002) Statistical mechan- 
ics of complex networks. Reviews of modern physics, 
74, 47. 

Andersson, G., Donalek, P., Farmer, R., Hatziargyriou, 
N., Kamwa, I., Kundur, P., Martins, N., Paserba, J., 
Pourbeik, P. & Sanchez-Gasca, J. (2005) Causes of 
the 2003 major grid blackouts in North America and 
Europe, and recommended means to improve system 
dynamic performance. IEEE Transactions on Power 
Systems, 20, 1922-1928. 

Baldick, R., Chowdhury, B., Dobson, I., Dong, Z., 
Gou, B., Hawkins, D., Huang, H., Joung, M., Kir- 
schen, D. & Li, F. (2008) Initial review of meth- 
ods for cascading failure analysis in electric power 
transmission systems IEEE PES CAMS task force 
on understanding, prediction, mitigation and res- 
toration of cascading failures. Power and Energy 
Society General Meeting-Conversion and Delivery of 
Electrical Energy in the 21st Century, 2008 IEEE. 
IEEE. 

Calviño, A., Grande, Z., Sánchez-Cambronero, S., Gal- 
lego, I., Rivas, A. & Menéndez, J.M. (2016) A Marko- 
vian—Bayesian network for risk analysis of high speed 
and conventional railway lines integrating human 
errors. Computer-Aided Civil and Infrastructure Engi- 
neering, 31, 193-218. 

Cozzani, V. & Reniers, G. (2013) Historical background 
and state of the art on domino effect assessment. 
Domino Effects in the Process Industries: Modelling, 


Prevention and Managing. Elsevier, Amsterdam, The 
Netherlands. 

Erp, N.V., Linger, R., Khakzad, N. & Gelder, P.V. (2017) 
Report on risk analysis framework for collateral 
impacts of cascading effects. RAIN—Risk Analysis 
of Infrastructure Networks in Response to Extreme 
Weather. TU Delft. 

Evans, M., Parry, G. & Wreathall, J. (1984) On the treat- 
ment of common-cause failures in system analysis. 
Reliability engineering, 9, 107-115. 

Fleming, K. (1975) Reliability model for common mode 
failures in redundant safety systems. Modeling and 
simulation. Volume 6, Part 1. 

Hasani, F. (2017) Calculation and Analysis of Reliabil- 
ity with Consideration of Common Cause Failures 
(CCF)(Case Study: The Input of the Dynamic Posi- 
tioning System of a Submarine). International Journal 
of Industrial Engineering & Production Research, 28, 
175-187. 

Hauge, S., Hoem, A., Hokstad, P., Habrekke, S. & Lun- 
dteigen, M.A. (2015) Common Cause Failures in 
Safety Instrumented Systems. SINTEF Technology 
and Society Trondheim. 

Humphreys, P. & Jenkins, A.M. (1991) Dependent fail- 
ures developments. Reliability Engineering & System 
Safety, 34, 417-427. 

Tec61508 (2010) Functional safety of electrical/electronic/ 
programmable electronic safety related systems. Inter- 
national Electrotechnical Commission. 

Iyer, S.M., Nakayama, M.K. & Gerbessiotis, A.V. (2009) 
A Markovian dependability model with cascad- 
ing failures. JEEE Transactions on Computers, 58, 
1238-1249. 

Kotzanikolaou, P., Theoharidou, M. & Gritzalis, D. 
(2013) Cascading effects of common-cause failures 
in critical infrastructures. International Conference on 
Critical Infrastructure Protection. Springer. 

Landucci, G., Argenti, F, Spadoni, G. & Cozzani, V. 
(2016) Domino effect frequency assessment: The role 
of safety barriers. Journal of Loss Prevention in the 
Process Industries, 44, 706-717. 

Laprie, J.-C., Kanoun, K. & Kaaniche, M. (2007) Mod- 
elling interdependencies between the electricity and 
information infrastructures. Computer Safety, Reli- 
ability, and Security, 54—67. 

Lin, Y., Li, D., Liu, C. & Kang, R. (2014) Framework 
design for reliability engineering of complex systems. 
Cyber Technology in Automation, Control, and Intelli- 
gent Systems (CYBER), 2014 IEEE 4th Annual Inter- 
national Conference on. IEEE. 

Lundteigen, M.A. & Rausand, M. (2007) Common cause 
failures in safety instrumented systems on oil and gas 
installations: Implementing defense measures through 
function testing. Journal of Loss Prevention in the 
process industries, 20, 218-229. 

Mosleh, A., D.M. Rasmuson & F.M. Marshall (1998) 
Guidelines on modeling common cause failures in 
probablistic risk assessment. 

Mosleh, A., Fleming, K., Parry, G., Paula, H., Worledge, 
D. & Rasmuson, D.M. (1988) Procedures for treating 
common cause failures in safety and reliability stud- 
ies: Volume 1, Procedural framework and examples. 
Pickard, Lowe and Garrick, Inc., Newport Beach, CA 
(USA). 


2406 


Mosleh, A. & Siu, N. (1987) A multi-parameter common 
cause failure model. Transactions of the 9th interna- 
tional conference on structural mechanics in reactor 
technology. Vol. M. 

Motter, A.E. & Lai, Y.-C. (2002) Cascade-based attacks 
on complex networks. Physical Review E, 66, 065102. 

Ouyang, M. (2014) Review on modeling and simulation 
of interdependent critical infrastructure systems. Reli- 
ability engineering & System safety, 121, 43-60. 

Paula, H.M., Campbell, D.J. & Rasmuson, D.M. (1991) 
Qualitative cause-defense matrices: Engineering 
tools to support the analysis and prevention of com- 
mon cause failures. Reliability Engineering & System 
Safety, 34, 389-415. 

Rausand, M. (2013) Risk assessment: theory, methods, 
and applications, John Wiley & Sons. 

Rausand, M. & Lundteigen, M.A. (2014) Reliability of 
safety-critical systems: theory and applications, John 
Wiley & Sons. 

Rausand, M. & Øien, K. (1996) The basic concepts of 
failure analysis. Reliability Engineering & System 
Safety, 53, 73-83. 

Rinaldi, S.M., Peerenboom, J.P. & Kelly, T.K. (2001) 
Identifying, understanding, and analyzing critical 


infrastructure interdependencies. JEEE Control Sys- 
tems, 21, 11-25. 

Russel, H.S. a. R. (2012) 620 million withow power in 
india after 3 pewer grids fail. 

Shuang, Q., Zhang, M. & Yuan, Y. (2014) Node vulner- 
ability of water distribution networks under cascading 
failures. Reliability Engineering & System Safety, 124, 
132-141. 

Sklet, S. (2006) Safety barriers: Definition, classification, 
and performance. Journal of loss prevention in the 
process industries, 19, 494-506. 

Smith, A.M. & Watson, I.A. (1980) Common cause fail- 
ures—a dilemma in perspective. Reliability Engineer- 
ing, 1, 127-142. 

Smith, D.J. & Simpson, K.G. (2004) Functional Safety: 
A straightforward guide to applying LEC 61508 and 
related standards, Routledge. 

Vesely, W. (1977) Estimating common cause failure prob- 
abilities in reliability and risk analysis: Marshall-Olkin 
specializations. Nuclear systems reliability engineering 
and risk assessment, 2. 

Wang, J. (2012) Mitigation of cascading failures 
on complex networks. Nonlinear Dynamics, 70, 
1959-1967. 


2407 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Industry 4.0 and complexity: Markov and Petri net based calculation of 
PFH for designated architectures and beyond 


M. Albert 
SICK AG, Waldkirch, Germany 


M. Dorra 


Institute for Occupational Safety and Health of the German Social Accident Insurance (IFA), 
Sankt Augustin, Germany 


ABSTRACT: Industry 4.0 opens up for a new area of industrial automation and challenges established 
methods for the assurance of safety. Traditionally, the architecture is a major criterion for the determina- 
tion of safety levels and so-called designated architectures serve as a basis for the calculation of the PFH 
(EN ISO 13849-1, 2015, IEC 62061, 2015). This easy approach is, however, limited to systems which can 
be directly mapped to the assumptions. For modern, software-intensive systems the extended possibilities 
for online diagnostics makes this mapping particularly difficult. To make the PFH calculation more flex- 
ible and transparent, we have derived Markov-based analytic equations for the designated architectures. 
In this paper, we will sketch our approach for a single-channel system with diagnostics. We compare the 
result to Petri net simulations, analyze the influence of individual parameters and argue how these models 


can be extended to more complex systems even beyond the realm of the designated architectures. 


1 INTRODUCTION 


In machinery the demand for safety and flexibil- 
ity at the same time has led to the development of 
a large variety of safety functions. According to 
the risk associated with particular hazards in the 
application, these functions are intended to assure 
an appropriate risk reduction. Accordingly, the 
integrity of the safety function is typically speci- 
fied as a Safety Level, e.g. as the “Safety Integrity 
Level” (SIL) of IEC 61508 (IEC 61508-1, 2010), 
the SIL claim of IEC 62061 (IEC 62061, 2015), 
or the Performance level (PL) of ISO 13849-1 
(EN ISO 13849-1, 2015). 

These standards directly link the achievable 
safety level to a target probability of dangerous 
failures. For safety in machinery (usually operated 
in high demand mode), this is the average frequency 
of dangerous failures of the safety function PFH 
(IEC 61508-4, 2010, Innal et al. 2010). 

Most implementations of safety functions 
involve electrics and/or electronics but the com- 
plete functional chain often also employs pneu- 
matics, hydraulics or mechanics. Irrespective of the 
technologies, the safety-related reliability in terms 
of the PFH value can be kept sufficiently low by 
using diagnostics and redundancy. The standards 
IEC 61508, IEC 62061 and ISO 13849 (IEC 61508- 
1, 2010, IEC 62061, 2015, EN ISO 13849-1, 2015) 
governing the functional safety of machinery 


address a couple of simple and generic system 
architectures but pursue different ways in evalu- 
ating the PFH value: IEC 61508 and IEC 62061 
by provision of equations for calculation of the 
PFH, ISO 13849-1 by tables. However, the under- 
lying approaches of these standards as well as 
the assumptions made result in several shortcom- 
ings concerning the applicability of these simple 
approaches to real systems. On the other hand, in 
all of the standards, stochastic modeling is sug- 
gested as a suitable method to determine the PFH 
value of systems which cannot directly be mapped 
to their dedicated architectures. 

Among these methods, Markov models have 
proven to be a suitable tool for the calculation 
of safety-related reliability measures (Brissaud & 
Oliveira 2012, Dutuit et al. 2008). 

Unlike numerical methods (stochastic Petri nets, 
Monte Carlo simulation), they enable the deriva- 
tion of analytic equations (Dutuit et al. 2008). 
Their major drawbacks are the limitation to expo- 
nentially distributed processes (constant transition 
rates) and the (exponential) explosion of the state 
space with the number of elements. These draw- 
backs make the application to more complex sys- 
tems with non-exponentially distributed processes 
and many states difficult (IEC 61508-6, 2010). 

In the first part of this paper, we will argue, how 
the limitation to exponentially distributed proc- 
esses can be overcome for a single channel system 
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with diagnostics (loolD) without significantly 
reducing the precision of the results. Subsequently, 
a very general analytic formula for the calculation 
of the PFH for the 1oo1D systems will be derived. 

The second part will show, how the analytic 
results can be reproduced by Petri net based 
Monte-Carlo simulations and will argue how the 
corresponding Petri net can be extended to more 
complex systems. 


2 SINGLE-CHANNEL SYSTEM WITH 
DIAGNOSTICS 


Figure 1 shows the block diagram of the single 
channel system with diagnostics (loolD) consid- 
ered in this paper. 

The functional channel F with the danger- 
ous failure rate App performs the safety func- 
tion. The diagnostic channel M (monitor) with 
the dangerous failure rate Ap tests F with the 
diagnostic coverage DC and the diagnostic rate 
r, With respect to F “dangerous failure” means 
loss of the safety function whereas with respect 
to M “dangerous failure” means loss of the 
diagnostic function which M should execute for 
F. £ is the “common cause factor” as defined 
by IEC 61508-6, 2010, Annex D. It represents 
a measure for the susceptibility of F and M to 
dangerous failures due to the same causes. If a 
dangerous failure of F is detected, M initiates a 
safe state. The diagnostic channel M is not tested 
by any diagnostics. Failures of the diagnostic 
channel will hence not be detected. The system 
is subject to regular demands of the safety func- 
tion that occur at random points in time. r, repre- 
sents the mean demand rate. At least one demand 
occurs per year (high demand mode of operation, 
see IEC 61508-4, 2010). 


Figure 1. 
with diagnostic channel. 


Block diagram of a single channel system 


2.1 State transition model of the single 
channel system 


Modelling the above-noted features of the 1001D 
systems leads to the state transition model depicted 
in Figure 2. All system states shaded in light grey 
are dangerous states because the safety function 
cannot be executed due to a failure of channel F. A 
demand on the safety function will lead to a haz- 
ardous event if the system is in one of these states. 

Assuming time-constant failure rates, the transi- 
tions representing those failures are exponentially 
distributed functions (solid arrows). Testing of F 
and demand of the safety function are supposed 
to be uniformly distributed (dashed arrows) which 
corresponds to a constant frequency of occurrence. 
These processes are either driven by automatic 
periodic procedures (especially the tests) or their 
frequency is estimated (especially the demands). In 
the latter case the estimation typically comprises 
an average frequency only and there is no basis 
for the assumption of any other distribution than 
the uniform distribution as an adequate represen- 
tation of regularity. On the contrary the repair 
process complies with a jump distribution (dotted 
arrows, labeled with the repair rate r,). Failures of 
the functional channel F and of the monitor M 
due to the same cause (“common cause failures”) 
are associated with the common cause failure rate 
Acc. A common cause failure can only occur in the 
OK state from where it leads directly to the state 
“F DU, M D”. 

Since not all of the transitions have exponential 
distribution functions, this graph does not repre- 
sent a Markov model. 


Operating Inhibition 


=> Exponentially distributed 
===> Uniformly distributed 
=> Jump-distributed 

D PFH measurement point 


Figure 2. State transition model for the single channel 
system with diagnostics; “F DD”: F failed dangerously 
detectable, “F DU”: F failed dangerously undetectable, 
“M D”: M failed dangerously (unable to detect faults of F). 
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Applying a demand of the safety function while 
the system is in one of the dangerous states will 
result in a “hazardous event”. Therefore, accord- 
ing to Figure 2, the PFH is made up of the abso- 
lute flow (events per time unit) associated with the 
demand-driven transitions. This is where the PFH 
measurement points are located. 


2.2 Principles for model simplification 


In order to transform the model into a simpler and 
Markov-compatible form, several simplifications 
can be carried out. 

The “F DU” state can be merged with the 
“F DU, M D” state without any influence on 
PFH, because the sum of their probabilities is not 
affected by the transition between them. It is, how- 
ever, the probability sum only which determines 
the PFH contribution of these states or, respec- 
tively, of the merged state, since there is no other 
exit than via the demand process. Additionally, the 
ratio of the transition rates allows for some signifi- 
cant model simplifications without considerable 
loss of PFH accuracy. For instance, by neglect- 
ing the Mean Repair Time (MRT) the “operating 
inhibition” and “hazardous event” states can be 
merged with the “OK” state. 

On the other hand, the most interesting state of 
the model is “F DD”. It is fed by the very low rate 
DC (Arp — Acc) (detectable failures of F without 
common cause failures) but cleared by the diagno- 
sis or the demand process on a regular basis with 
a much higher rate. As a consequence, “F DD” is 
nearly empty at any time and the probability of the 
state can be neglected. (It will though be used to 
calculate the outflow rates of this state, see below.) 
Moreover, the “F DD” state is additionally dis- 
charged by failures of the monitor M (rate A,,,). 
The “F DD” state can be eliminated by making use 
of its very low probability and by determining the 
outflow rates as a function of the inflow rate. 

This situation is sketched in Figure 3, where the 
related state is labelled “FPN”, as it serves as “flow 
partitioning node”. 


Figure 3. 


Flow partitioning node; 4, and A, are nominal 
rates of exponentially distributed failure processes, T is 
the mean time between two clearings of FPN, A,, A, and 
A, denote the corresponding absolute mean flow rates, 2, 
and A, are the resulting mean surrogate outflow rates of 
the FPN node. 


At “FPN” phases of duration T with feeding 
by the absolute rate A, and a small discharge by 
the rate A, are interrupted by a complete clearing. 
“Absolute rate” means probability flow per time 
unit. In case of an exponentially distributed proc- 
ess, the absolute rate A is given by the probability of 
the source state, multiplied by the nominal rate A of 
that process. During the interval [0, 7] of the time 
t, for the probability of the state it holds true that 


Prey (t+ At) = Prpn (t) + A, At- Pren (t) -A At 
(1) 


For At > 0 this yields: 


Prex (t) = A-4: Pren (t) - (2) 


With 0 < ¢ < T and the initial condition 
Prp\O) = 0, the solution of this differential equa- 
tion is: 


Prex (t) = (1 sg ') (3) 


The mean probability is calculated by integration 
over a time interval of duration T and division by T: 


17 A A 
Pey = F | Pew (t) dt = T | Io 1) 
0 + 2 


(4) 


By replacing the absolute inflow rate A, by the 
nominal rate 4, Equation 4 allows to calculate 
the partitioning of the nominal inflow rate A, of 
“FPN” to the two fractions 2, and A,: 


l-e#T 
A, = afi- AT ) 5) 
_ A q ear 
dy = Apit) ©) 


Note that the splitting of A, to 7, and A, accord- 
ing to Equations 5 and 6 considers the exponential 
distribution function associated with A, and the 
uniform distribution function associated with the 
clearing process with rate 1/T. 

The flow partitioning node “F DD” of Figure 2 
is cleared by testing as well as by demand on the 
safety function. Hence, the following substitution 
in Equations 5 and 6 is necessary 


(7) 
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According to Figure 2, 2, and A, are given by 
DC (Agp — Acc) OF Ayp respectively. 


2.3 Resulting model and analytic result 


Applying the above-mentioned simplifications to the 
model of Figure 2 yields the model shown in Figure 4. 

The simple two-state model of Figure 4 is finally 
a Markov model since all occurring transitions are 
exponentially distributed. The related differential 
equation system can be solved for the initial condi- 
tions po,(0) = 1 and p, p(0) = 0 whereupon, accord- 
ing to Figure 4, the instantaneous PFH value can 
be calculated by 


r 


pfh(t) =A, 7 = Pox (t) 


+ [(1- Dc) (A, - Aec) + 4; + Ace | Pox (t) (8) 
+ Arp Pun (t) 


Calculating the mean value of pfh(t) over the 
mission time T„ eventually yields 
PFH = Ary 
(A, = Acc) 
(Arp + Žun- fec) Ty (9) 
ý lA (Aro + Aup = Acc) Ty 


+ (Aim = Aeg) (1 — e~ Ven + Au ~ Acc) Tu )}} 


—)DC-TRTE - 


where the common cause failure rate of the chan- 
nels F and M is given by 
Acc = 2- min(A,,, Agi) (10) 


And TRTEis the time-related test efficiency. Making 
use of Ap << r,+ r; this quantity can be calculated by 


| Weighting 
factor 


(i = DC): (Aro -Ace ) 
+A, a, 


= PFH measurement point 
D PFH measurement poini, weighled 


Figure 4. Graph of the Markov model for the single 
channel system with diagnostics. 


Time-related test efficiency ( TRTE) 
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Figure 5. 
rt/rd. 


Time-related test efficiency as a function of 


TRTE = — 
I, + t4 


(11) 


TRTE is a measure for the test effectiveness 
under consideration of a temporal “race” of test- 
ing and demand on the safety function. System 
design should aspire to avoid such a race so as not 
to diminish the effect of testing. This is accom- 
plished if the test is always executed in time and in 
this case TRTE can be set to its maximum value of 
1. TRTE = 0 as well as DC = 0 lead to an untested 
channel F and PFH = A,». 

If time-optimal testing is not feasible, TRTE 
should be made as high as possible by implement- 
ing an adequate test rate. The model of Figure 2 
assumes the absence of any synchronization of test 
and demand. Provided this, Equation 11 states the 
dependence of the time-related test efficiency TRTE 
on the ratio of the test rate r, and the demand rate 
r, This dependence is depicted in Figure 5. 

IEC 61508, IEC 62061 and ISO 13849 recom- 
mend a ratio r/r} of 2100 (IEC 61508-2, 2010, 
IEC 62061, 2015, EN ISO 13849-1, 2015). Accord- 
ing to Figure 5 or Equation 11, this ensures a time- 
related test efficiency of >0.99. 


2.4 Other architectures 


The designated architectures of machine controls 
addressed by the standards for functional safety 
also comprise the two-channel architecture. There- 
fore, by applying the principles described in sec- 
tion 2.2, an analytic Markov model-based solution 
for the PFH was also derived for the asymmetric 
redundant 1oo2D structure. Due to limited space 
it is not reported here. A manuscript regarding this 
matter has been proposed for usage within stand- 
ardization (Dorra, M., 2017). 
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3 PETRI NET BASED MONTE 
CARLO SIMULATION 


Finite state machines are an efficient way to model 
the behavior and the dynamics of safety related 
systems (EN 61508-6, 2010). Since the 1960s Petri 
nets have been used for this purpose and since 
then have proven to be useful and versatile tool. 
While in the beginning (timed) Petri nets where 
mostly used to synthesize Markov graphs, Monte 
Carlo simulation of such nets became increasingly 
popular to investigate their dynamic behavior and 
to overcome the problem of the exploding state 
space of the Markovian approach (EN 61508-6, 
2010). In case of Petri nets, the size of the model 
scales linear with the number of modeled elements. 
Beside their scaling properties, their easy graphical 
representation and the possibility to simulate the 
system graphically makes Petri nets a powerful tool 
for the modelling of safety-related systems. Results 
for the calculation of the average probability of 
dangerous failures on demand in case of the low 
demand mode (<1 demand/year) obtained with 
Markov models and Petri nets have proven to be 
similar (Brissaud & Oliveira 2012). 


3.1 Model 


The basic Petri net model of the 1001D architec- 
ture of Figure 1 is depicted in Figure 6. The model 
consists of: 


” 


— Two places “Fo,” and “Fp” representing the 
working and the failed state of the functional 
channel, respectively. 


Repair M on demand 


Figure 6. Petri net for single Channel system with diag- 
nostics. For further details see text. 


— Two places “MOK” and “MD” representing 
the working and the failed state of the monitor 
channel. 

— The transitions “Failure F” and “Failure M” 
representing the failures of the functional and 
the monitor channel, respectively. Whenever 
the “Failure F” transition is fired, an assertion 
is used to generate a random number to classify 
detectable and undetectable failures (See descrip- 
tion of the transition “Failure Detected”). 

— A transition “Failure CC” representing the 
common-cause failures of function and monitor 
channel. 

— The transition “Demand” representing the 
demand of the safety functions (after a failure of 
the function channel). Any firing of “Demand” 
will reset the Petri net, assuming that the system 
will be brought to an “as-good-as-new-state” 
if a dangerous failure has occurred prior to a 
demand of the safety function. If the monitor 
channel is also in the failed state, it will be reset 
by the transition “Repair M on demand”. 

— The cyclic transition “Failure Detected” rep- 
resenting the periodic testing of the functional 
channel by the monitor. Predicates are used to 
distinguish detectable and undetectable failures 
by comparing the random number generated by 
the “Failure F” transition to the DC. 

— Additional assertion and predicates are used, to 
assure the correctness of the model, while keep- 
ing the graphical representation as simple as 
possible. 


The Petri net model is investigated using the 
Monte-Carlo simulation engine of the GRIF soft- 
ware-framework (Satodev, 2017). 

As argued earlier in this paper, any demand on 
the safety function from a failure state of “F” will 
contribute to the PFH. In case of the Petri net, the 
PFH of the system can hence be determined by the 
average frequency at which the “Demand” transi- 
tion fires (Innal et al. 2010). 

As the monitor channel itself is not monitored, 
a failure of this channel will not be detected and a 
subsequent failure of the functional channel will 
likewise remain undetected. 


3.2 Comparison 


Figure 7 shows the results of the Monte-Carlo 
simulations of the Petri net and compares them to 
the solution of the analytic Markov Modell (eq. 9) 
for the same sets of parameters. The results of the 
two methods are very similar and reproduce the 
table values of ISO 13849-1, Annex K, both for 
DC = 90% and for DC = 60% (EN ISO 13849-1, 
2015). For the calculation, the following assump- 
tions where made: Demand rate: r} = 360 h“', 
rate of diagnosis: r, = 100 -r, same MTTF,, for 
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Markov lool1D (eq. 9) 
ISO 13849-1, Annex K 
— -- Petri-net simulation 


PFHy|h"] 


10! 107 
MTT Fp|a) 


Figure7. Comparison of PFH values calculated with the 
analytic Markov-Formula (see Equation 9) to the values 
of the table in ISO 13849-1, Annex K (EN ISO 13849-1, 
2015), and the Petri net based Monte Carlo simulation. 
The two sets of curves correspond to DC = 90% (upper) 
and DC = 60% (lower). The results of the Markov and 
the Petri net approach are very similar and reproduce 
reasonably well the values of ISO 13849-1, Annex K. 


function channel and monitor channel: MTT- 
Fup = MTTF,» common cause failure rate: 
Ace= B: min(1/MTTF,» 1/MTTF,,)), with B= 2%, 
mission time: T,,= 20 a. These assumptions are in 
accordance with the prerequisites and assumptions 
of ISO 13849-1 and allow for a direct comparison 
of the results to this standard (EN ISO 13849-1, 
2015). 


3.3 Failure sequences 


The properties of the 1001D System were further 
studied by a systematic investigation of the rele- 
vant parameters of the model. The Petri net model 
allows for the investigation of the relevance of the 
failure sequences to the overall frequency of dan- 
gerous failures. 

In the simulation, the order of the fringing of 
the transitions was tracked by assertions. The clus- 
ters of firing sequences listed in the following list 
are investigated, where the order of the firing of 
transitions is indicated by arrows. Detectable and 
Undetectable failures are distinguished by asser- 
tions and predicates on the transition, as described 
above. To simplify reading, this implicit distinction 
is indicated by adding detectable/undetectable to 
the Failure transition in brackets: 


1. Fyp-sequences: 

— Failure F(detectable) Demand 

— Failure F(detectable)—Failure M@->Demand 
2. Fpy-sequences: 

— Failure F(undetectable) Demand 

— Failure F(undetectable)— Failure M—Demand 


3. M,-sequences: 
— Failure M—Failure F(detectable) Demand 
— Failure M—Failure F(undetectable) Demand 
4. CCF-sequence: 
— CCF—Demand 


The demand is explicitly modeled in order to 
investigate the residual contribution of detectable 
failures to the overall failure rate due to the finite 
frequency of the diagnosis by the monitor channel. 

Figure 8 shows the relative contribution of the 
major failure sequences to the PFH as a function 
of the MTTF, values for different DC values. The 
following assumptions were made for the system: 
The mean time to failure of the monitor channel 
was assumed to be half the MTTF, of the func- 
tion channel (This ratio is suggested as a lower 
bound for the MTTF, of the monitor channel in 
(EN ISO 13849-1, 2015)). The fraction of common- 
cause failures of the functional channel and the 
monitor channel is assumed to be B= 2%. The rate 
of demands of the safety function is rp = 360 h“', 
corresponding to a demand every 10 seconds. The 
mission time is T,= 20 a. 

While for a DC = 60%, the PFH is dominated 
by F,, for MTTF, values above approx. 48 years, 
the M,—>F, »>Demand sequence is dominant for 
lower MTTF, for DC = 60% and for all MTTF, 
values up to 100 years for DC=90% and DC=99%. 

Beside the influence of the finite frequency of 
the diagnosis on the PFH discussed earlier in this 
paper (see Equation 11) this shows how big the 
impact of failures of the monitor channel is and 
how important it is to integrate these failures cor- 
rectly into PFH calculations. Furthermore, one 
could improve the system by introducing a diag- 
nostic on the monitor channel itself, e.g. by the 
Functional channel. The necessary extension to 
the model and the results of the simulation will be 
discussed in the following chapter. 


3.4 Complex systems beyond designated 
architectures 


The properties of monitor channels become even 
more important for more complex systems which 
cannot easily be mapped onto designated archi- 
tectures underlying the simplified formulas of 
established standards (EN ISO 13849-1, 2015, 
IEC 62061, 2015). 

Traditionally, the architecture of a system was 
the basis criterion for the achievable safety level 
for Industrial safety systems. (EN 954, 1996). Even 
though the target measures for dangerous failures 
have been established as the defining criterion for 
Safety Levels (EN 61508-1, 2010) and have been 
introduced in the generic standards for safety of 
machinery (EN ISO 13849-1, 2015, IEC 62061, 
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Figure 8. 
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Relative contribution of the individual failure channels. Dotted line: Fpp, dashed line: Fpu, solid line: 


Common-Cause failure, dotted dashed line: M, followed by Fp. 


FOK 


Failure M Detected 


Repair M on dernand 


Figure 9. Petri net model of a system with a monitor 
channel which is monitored by the functional channel. 


2015), the architecture is still an important 
criterion. 

However, mapping modern software-intensive 
systems to the designated architectures of the estab- 
lished standards is not always easy. Programmable 
electronics introduce various levels of diagnostics 
into the systems, which can be hardly mapped to the 
assumptions of the designated architectures. Some 
standards (EN ISO 13849-1, 2015, IEC 62061, 
2015, EN 61508-2, 2010) suggest the application of 
stochastic methods for these cases. However, due to 
the explosion of the state space for complex systems 
with many constituents, Markov Models are diffi- 
cult to apply. Furthermore, the timing of monitor 


channels frequently does not follow exponential 
probability distributions. Monitor channels may 
in some cases even be synchronized to the poten- 
tial demands of the safety function—especially in 
systems with cyclic working processes and the sto- 
chastic modelling of these systems becomes increas- 
ingly complex. Petri nets offer a very versatile tool 
to model even complex scenarios, where various 
monitor channels work concurrently—possibly on 
different time scales, with multiple different ranges 
of coverage and test conditions. 

An example of such a system is depicted in 
Figure 9. The model is a slight extension to the estab- 
lished designated architecture of a lool1D system. 
The system comprises a monitor channel with a rela- 
tively high diagnostic coverage (DC, = 99%), which 
is in return monitored by the functional channel 
itself with a diagnostic coverage of DC „= 99%. The 
MTTF, of both channels is assumed to be equal. 

Monte-Carlos simulation of the system reveals 
a substantial reduction of the PFH by the addi- 
tional diagnosis of the monitor channel itself. In 
Figure 10 the relative reduction of the PFH is 
depicted as a function of the MTTF, (assumed to 
be equal for function and monitor channel). For 
low MTTF, values, the reduction exceeds 90%, 
while for a high MTTF, value of 100 a, the reduc- 
tion still amounts to 69%. 

The rate of the diagnosis of the functional chan- 
nel is assumed to be 100 times the demand rate 
r, = 360 h™'. The monitor channel itself is checked 
with a frequency of r, y = 1/24 h. 

The variation of this frequency is, however, less 
critical for the PFH value than the frequency of the 
diagnosis of the functional channel by the monitor 
channel. While the latter frequency has to be sub- 
stantially higher than the demand rate to ensure 
the detection of failures of the function channel 
before any demand, the test of the monitor chan- 
nel is efficient as long as the test rate is substan- 
tially lower than the failure rate of the functional 
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Figure 10. Effect of the additional diagnosis of the 


monitor channel itself. Depicted is the relative reduction 
of the PFH by the additional diagnosis on the monitor 
for a diagnostic coverage of the monitor of DC,, = 99% 
as compared to DC,, = 0% (no diagnostics on the moni- 
tor channel). 
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Figure 11. Dependency of the PFH on the rate at which 


the monitor channel is tested, r,,,, The four curves corre- 
spond to various MTTF, value (assumed to be equal for 
function and monitor channel): MTTF, = 10 a (dashed 
line), MTTF, = 30 a (dashed-dotted line), MTTF, =3 a 


(solid line) and MTTF, = 100 a (dotted line). 


channel. The dependency of the PFH on the rate at 
which the monitor channel is tested, r,,,, is depicted 
in Figure 11 for various MTTF, values (assumed 
to be equal for function and monitor channel). It is 
obvious, that the diagnosis of the monitor is effi- 
cient even for a very low MTTF, = 3 a and a test 
rate as low as r,,,= 10 h~. Higher test rates will 
not substantially decreased the PFH for the case of 
MTTF, = 3a. 


4 SUMMARY 


In this paper we presented two approaches for 
the determination of the PFH for single channel 
systems with diagnostics, one based on a Markov 
model, the second using Petri net based Monte- 


Carlo simulation. We introduced a detailed (non- 
Markovian) state transition model of the loolD 
architecture with non-exponential distributions 
and sketched how this model can be transformed 
to an approximate 2-state Markov model by valid 
approximations and appropriate merging of states, 
and by use of a so-called flow partitioning node. 
For this 2-state Markov model we presented a 
general analytic formula for the calculation of the 
PFH, which includes important features of 1001D 
systems, e.g. it takes into account the effect of non- 
time-optimal testing of the functional channel per- 
forming the safety function. 

We furthermore demonstrated how the same 
system can be modelled by a Petri net and obtained 
excellent conformity of the calculation results of 
the PFH between the Markov-based analytic for- 
mula and a Monte-Carlo simulation of the Petri 
net. Both approaches are also in excellent agree- 
ment with the PFH values given in ISO 13849-1 
(ISO 13849-1, 2015) as table values. 

The Petri net model was further used to inves- 
tigate the importance of the contributions of the 
various failure sequences to the overall PFH. It 
revealed the large influence of sequences in which 
the Monitor fails prior to the Functional channel. 

In the last part of the paper we argued how the 
Petri net model of 1001D could be extended to deal 
with architectures which are not covered by the 
dedicated architectures of standards on functional 
safety. We showed how additional diagnostics on 
the monitor itself could be used to substantially 
improve the PFH of the modeled system. 

With Industry 4.0, modern, software-intensive 
systems with their versatile possibilities for online 
diagnostics will play an important role for the 
safety of machinery. Our results open up for a 
transparent and flexible determination of the PFH 
for these systems. 
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ABSTRACT: Reliable failure rate estimates of safety critical equipment is crucial for verifying perform- 
ance requirements and for trending the safety performance of the equipment. Joint industry efforts, like 
OREDA, Exida and PDS handbooks, supported by international standardization on data collection such 
as ISO 14224, publish generic failure rates for selected equipment commonly used in the oil and gas indus- 
try. Such generic data builds on field experience and ensures a transfer of knowledge from the operational 
phase to new design projects. Currently, most generic data is generated for a specific equipment group, 
and not for separate inventory attributes, such as size, type/fabricate, service and flow medium. Some 
standards and methods, like MIL-HDBK-217F, suggest how to encounter effects of inventory attributes, 
but these approaches are also generic, and does not account for sector specific experience, i.e. from opera- 
tion of oil and gas facilities. In light of the increased focus on digitalization, it is expected that the access 
to data will be improved, and it is therefore important to utilize these data more efficiently in the business 
sector in question. The PDS forum in Norway, who gathers most actors involved in Norwegian oil and 
gas industry, initiated a study to analyze operational data of safety critical equipment, with the purpose 
to study more specific and recent effects of inventory attributes. Data from several oil and gas facilities in 
Norway, both offshore and onshore, have been systematized and analyzed. The purpose of this paper is 
to present the approach used to analyze these data, including data collection and statistical methods, as 
well as the final results of the study. The starting point was inventory attributes suggested by expert judg- 
ments, and their effects were investigated with basis in the collected data. Information from the operating 
companies’ maintenance system has been too sparse to support all suggested inventories, and the choice 
of inventory attributes were narrowed down for some selected equipment groups; fire and gas detectors, 
level transmitters, shutdown valves and pressure safety valves. 


1 INTRODUCTION (OREDA 2015a, OREDA 2015b) and Exida hand- 
books (Exida 2015). 
Safety Instrumented Systems (SISs) perform 


L1 Background safety critical functions such as to shut down the 


The Petroleum Safety Authority (PSA) in Norway 
requires (in Management Regulations, section 19) 
that the operators shall collect, process and use 
field-based reliability data to ensure that the safety 
systems perform according to specified require- 
ments. A key task of safety management is there- 
fore to register equipment failures and to use this 
information to verify that the systems are suffi- 
ciently reliable. For the oil and gas industry, it is 
vital to share field experience with new projects, to 
make realistic assumptions about the performance 
of new equipment in known operating environ- 
ment. Equipment failures are therefore collected 
from several facilities under the framework of ISO 
14224 (ISO 2016a), and generic failure rates based 
on operational experience are presented in PDS 
handbooks (SINTEF 2013a), OREDA handbooks 


plant or isolate ignition sources. IEC 61508 (IEC 
2010) and IEC 61511 (IEC 2016), which are man- 
datory standards to use for design and operation 
of SISs, suggest a risk-based approach to the for- 
mulation of reliability requirements. The starting 
point is a risk analysis, which defines the necessary 
risk reduction for each Safety Instrumented Func- 
tion (SIF) carried out by a SIS. This risk reduction 
is translated into a Safety Integrity Level (SIL) 
requirement. Four different SIL levels are defined 
(SIL 1 — SIL 4), and for each level it is specified 
a required reliability performance interval. Proba- 
bilistic calculations are needed to demonstrate 
that each SIF meets the given SIL requirement. 
Since the risk analysis considers what is accept- 
able risk at a given facility it is important to use 
“realistic” reliability data when the performance 
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of the SIFs are estimated. Realistic in this context 
means to consider historic field experience data, 
rather than data obtained strictly from analyses 
and/or from testing in a laboratory with control- 
led environment—not covering all possible failure 
causes experienced in operation. In other words, 
the reliability calculated in design should as far 
as possible reflect the reliability that is experi- 
enced under typical conditions in the operational 
phase. This is a requirement that is also empha- 
sized and strengthened in the new edition of IEC 
61511 (IEC 2016); operators must ensure that 
data are both credible, traceable, documented, 
and justified. 

A practical challenge with generic data, is the 
time from data are collected to publishing of 
updated handbooks and data bases. The time lag is 
often five years or more, and it is therefore of inter- 
est for operators to collect and systemize their own 
operational experience. Many operators in Nor- 
way now carry out regular (e.g. annual) reviews of 
reported failures. The results from these reviews 
are then used to monitor reliability performance 
in operation, to give feedback to manufacturers 
(about problems experienced) and to make deci- 
sions about changes in functional test intervals 
(Hauge & Lundteigen 2008). 

Reviews of failures reported on various Nor- 
wegian oil and gas facilities, indicate that failure 
rates for similar types of equipment can be quite 
different between facilities, even if the operating 
environment is more or less the same. This result 
may be explained by some variations in technology 
used (e.g. detection principle for gas detectors), 
process medium, the external environment, qual- 
ity and frequency of maintenance and inspection, 
etc. Care should therefore be taken to use generic 
failure rates in studies for reliability demonstration 
without considering such influencing factors. Con- 
sequently, it is desirable to supplement generic fail- 
ure rates (that represent an “average performance” 
for comparable equipment) with equipment char- 
acteristic parameters that can identify more specific 
values of failure rates, i.e. inventory attributes of 
the equipment (e.g. size of valve, type of detector, 
etc.). The term inventory attribute is introduced 
for equipment attributes particularly important for 
the reliability performance. For example, it may be 
of interest to distinguish between failure rates for 
different sizes of shutdown valves. 

We foresee at least two applications for fail- 
ure rates based on specific inventory attributes: 
1) Monitoring the reliability of existing SIFs, 
allowing for the specific characteristics of the 
equipment, and 2) Calculating the reliability of 
new SIFs, or existing SIFs considering the influ- 
ences of design, operation, environment and main- 
tenance characteristics. 


1.2 Objective and scope of paper 


The main objective of this paper is to identify 
which inventory attributes that may be relevant for 
the four equipment groups; fire and gas detectors, 
level transmitters, shutdown valves and pressure 
safety valves, and to analyze which of the sug- 
gested inventory attributes (if any) are significant 
based on systemized field experience gathered by 
SINTEF in the period 2006-2016. 

The content of this paper is based on work per- 
formed as a continuation of a research project 
funded by the Norwegian Research Council and 
the members of the PDS forum (www.sintef.no/ 
pds). SINTEF has previously systematized fail- 
ure data for six offshore and onshore oil and gas 
facilities (a total of more than 13000 maintenance 
notifications) in Norway. These failure reports, 
supplemented by additional expert judgements, 
have been used for selecting possible inventory 
attributes influencing the failure rates of some 
selected equipment types. Then, data for these 
inventory attributes have been collected (as com- 
plete as possible) to be able to analyze the possible 
impact of selected inventory attributes. 


2 FAILURE DATA COLLECTION 
AND DATA FORMAT 


2.1 Generic data sources 


Generic failure rates are mainly derived from data 
collected by an organization and published in 
handbooks or as computerized databases (Rausand 
2014). The failure rates can often be regarded as 
an average of the experienced performance for spe- 
cific equipment groups. 

The oil and gas industry has collected failure 
data over many years and for several offshore facili- 
ties, mainly on the Norwegian continental shelf. 
Relevant generic data sources are the OREDA 
handbooks, ref. OREDA (2015a) and OREDA 
(2015b), PDS handbooks, ref. SINTEF (2013a) and 
SINTEF (2013b), and the safety equipment reliabil- 
ity handbook (SERH) published by Exida (2015). 


2.2 Operator’s data 


ISO 20815 (ISO 2008) on production perform- 
ance assurance, emphasizes that the systematic 
collection and treatment of operational experi- 
ence is considered as an investment and means 
for improvement of production and safety critical 
equipment. The oil and gas industry has been in 
the forefront of developing international standards 
on data collection with ISO 14224 (ISO 2016a). 
Reliability data can help operators to plan 
the preventive maintenance, e.g. to optimize test 
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intervals, avoid unscheduled stops and reduce the 
amount of corrective maintenance. Aggregated 
data, used to determine generic values of failure 
rates, represents an experience transfer from oper- 
ation to analyses needed for new facilities and for 
installations of new systems. 

Several operators are continuously working on 
systemizing their failure records, to have their own 
“preferred” data set. This data set can be used to 
estimate an average performance of equipment 
for a single facility or for several facilities with the 
same operator. If the amount of data is extensive, 
one can also estimate separate failure rates for spe- 
cific equipment attributes. 


2.3 Failure data from operational reviews 


Failures revealed during operation and mainte- 
nance are reported by maintenance notifications. 
A notification allows some free text description of 
the failure and about the measures implemented to 
correct the failure. In addition, it is also possible to 
characterize the failure, by ticking off in lists of pre- 
defined classes of failure causes, failure modes, and 
detection methods. Most maintenance systems are 
aligned with ISO 14224 for data collection, and addi- 
tional effort is needed to further classify the failures 
into Dangerous Detected (DD) failures, Danger- 
ous Undetected (DU) failures and safe (S) failures, 
for alignment with IEC 61508 and IEC 61511 tax- 
onomy. Information about inventory attributes of 
interest may be partly available in notifications and 
partly in SIS related documents, such as the Safety 
Requirements Specification (SRS), safety manuals, 
and Safety Analysis Reports (SARs). 

The operating companies themselves can per- 
form regular operational reviews, or they can use 
assistance from consultants or research institutes. 
In either case, it is important to involve personnel 
from key disciplines such as automation, safety 
and maintenance from the specific facility and 
company in question. The main purpose of the 
reviews is to verify the performance of SIL rated 
equipment and to give recommendations related to 
maintenance and testing. A secondary purpose of 
the review, as suggested in this paper, is to analyze 
such data in more detail to investigate the perform- 
ance for various inventory attributes. 


2.4 Selection of equipment groups 


Several types of safety critical equipment are used 
in SIFs on an oil and gas facility. Operational 
reviews of safety critical equipment covers about 
twenty different equipment groups, however, some 
groups consist of few equipment units. To limit the 
scope of the analyses in the PDS project, it was 
decided to extract groups of equipment where: 


— a certain amount of data (both failures and a 
certain amount of aggregated operational time) 
has been gathered. 

— the equipment group is represented on several 
facilities. 

— the equipment group is represented in several 
SIFs on a facility. 

— the equipment types have some attributes that 
are considered as significant with respect to the 
failure rate. 


The selection of equipment groups and inven- 
tory attributes was consulted with experts within 
the PDS forum participants in an experts meeting. 
The recommendation was to focus on fire and gas 
detectors, level transmitters, shutdown valves and 
pressure safety valves, since these groups both con- 
tain a significant amount of equipment and con- 
tain possible significant inventory attributes. 


2.5 Uncertainty & data collection challenges 


A major challenge when splitting up the failure 
rates according to inventory attributes is to obtain 
sufficient statistical confidence. If no DU failures 
have been experienced in an observation period, 
the statistical confidence is lower even if the obser- 
vation time is quite extensive. 

Quality of data is another challenge. It is impor- 
tant that we can rely on the information given in 
the notification, e.g. that the failure mode and 
the detection method have been classified cor- 
rectly. Data collection is both time consuming and 
demanding; it is seldom straightforward to iden- 
tify all relevant information from the maintenance 
records. To obtain correct information about the 
actual failure, it is often necessary to discuss indi- 
vidual notifications with operators and mainte- 
nance personnel. Such work is time consuming, 
but nevertheless rewarding, e.g. to avoid repeating 
(and thereby costly) failures. Many operational 
reviews reported repeating failures, where seem- 
ingly insufficient measures had been implemented 
to remove the cause of failure. For the purposes 
of analyzing inventory attributes, we have removed 
repeating failures, to avoid that these are given too 
high weight in the overall results. 

From the operational reviews, we saw that the 
effects of local facility conditions were impor- 
tant and that based on our experience, should be 
considered. One facility may experience specific 
problems (e.g. icing) for some particular type of 
equipment, which are not observed at other facili- 
ties. Local conditions seem to be of particularly 
interest for the occurrence of Common Cause Fail- 
ures (CCFs), i.e. failures that are dependent due 
to a shared cause and which occur close in time. 
An example of a local problem which turned out 
to be defined as a CCF, was a number of failures 
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for shutdown valves caused by wrong type (here 
viscosity) of hydraulic oil. Some of the failures 
related to specific problems at one facility (such as 
the hydraulic oil problem) which are not likely to 
occur at other facilities, have been removed from 
our data set for the analyses of inventory attributes. 
Other local problems, that were defined as CCFs, 
were icing problems. Icing is often more challeng- 
ing for facilities in the Barents Sea compared to the 
North Sea, however, unfortunate design solutions 
may also allow for icing to occur. Failures related 
to such conditions that may occur on several facili- 
ties, have not been removed from our data set for 
the analyses of inventory attributes. 


3 STATISTICAL METHODS 


For the analyses of data from operational reviews, 
the focus has been on DU failures since these fail- 
ures will influence the most important perform- 
ance requirements related to SIF equipment, such 
as the Probability of Failure on Demand (PFD) 
for a SIF. The total data set for analyses comprises 
all equipment units that have been involved in the 
operational reviews, i.e. has been part of equip- 
ment groups considered in the reviews. For each 
unit, data on inventory attributes has been col- 
lected together with the information about if the 
unit has experienced a DU failure or not in a pre- 
defined observation period. 

It was decided to introduce a Generalized Linear 
Model (GLM) based on a binomial distribution, 
where only two possible outcomes are considered. 
The response variable is a discrete variable with two 
possible outcomes; 0 and 1, i.e. “No DU failure” or 
“DU failure”, such that GLM can predict failure 
probability and assess effect on failure probability 
from the predefined inventory attributes. Other 
methods, such as lifetime modelling was also con- 
sidered. However, since the observation period for 
some of the facilities does not cover the entire life- 
time of the item and the time an item was put into 
operation is unknown for most of the components, 
such methods were disregarded. 

GLM describes the statistical relationships 
between response variables Y,, Y,, .... Yy and 
explanatory variables x, x, ... x, by estimating the 
corresponding inventory attributes 2, b», .... B. 
An explanatory variable is a type of independent 
variable that can affect response variables, which 
may be fixed by the experimental design. GLM are 
mostly based on maximum likelihood estimation 
and allows for regression modeling when response 
variables are distributed as one of the members of 
the exponential family. The model is given by: 


y=Ayt Xi beet Die Xin () 
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Here, g(u) is called the link function. Further, 
let Yi ~ Binomial(n,p,) express the response vari- 
ables with failure probability p, i.e. the likelihood 
for an item to fail at a given time. Then the GLM 
model based on binomial distribution is given as: 


Pi 
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log( )= Žo + Pi Xj F- + Xin (3) 


In this GLM, inputs are related to inventory 
attributes as well as failure data. Outputs of the 
model are related to failure probabilities that are 
used to check whether there is statistical signifi- 
cance. The formula of failure probability is: 


exp(Fi + Arn +--+ AX) 
l+exp(Z, + 4.x; +--+ A X,) 


(4) 


i 


We identify the parameters that significantly 
impact on the reliability performance by check- 
ing variables in the regression model with small 
p-values and large coefficients. A small p-value 
suggests that changes in the explanatory variable 
are associated with the changes in the response 
variable, i.e. the inventory attribute is significant 
and does influence the DU failure rate (based on 
our data and under the given assumptions). The 
exponential coefficient represents the change in the 
response variable when changing the categories of 
one inventory attribute holding the other explana- 
tory variables constant. 


4 DATA COLLECTION OF INVENTORY 
ATTRIBUTES 


4.1 General 


Collection of data for inventory attributes turned 
out to be a rather time-consuming activity. Only 
parts of the relevant information were (easily) 
found in the operator’s maintenance system. It was 
necessary to supplement with information from 
other sources, such as process and instrument 
diagrams (P&IDs), data sheets and manufacturer 
specifications together with discussions with tech- 
nical advisors and process engineers. 

The manufacturer name was the most straight- 
forward inventory attribute to obtain from the 
maintenance system. However, within an equip- 
ment group we would find some (“small”) manu- 
facturers that had not delivered more than a few 
equipment units each. Thus, to be able to achieve 
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some rational results, the number of manufactur- 
ers were kept to a minimum by grouping all the 
“small” manufacturers into an “other” group. 
Grouping the outcomes represented by a small 
number of units into a common “other” category, 
was also performed for other attributes with sev- 
eral outcomes/categories. 


4.2 Fire and gas detectors 


The expert review meeting suggested the follow- 
ing inventory attributes for fire and gas detectors: 
Manufacturer, measuring principle, i.e. which phys- 
ical principle the detection is based on, and model 
type. Table 1 shows in more detail the inventory 
attributes for point gas detectors. The same types 
of inventory attributes were selected for line gas 
detectors, smoke detectors and flame detectors. 

For the analyses of detectors, we were in the 
fortunate situation to access more data than from 
the six facilities where SINTEF was involved in 
operational reviews. For this particular equipment 
group, it was possible to add the inventory attribute 
“facility”, due to the extensiveness of data, to allow 
comparison in failure rates and effects of inventory 
attributes between different facilities. 


4.3 Level transmitters 


Level transmitters are often placed in a group called 
“process transmitters”, together with temperature 
transmitters and pressure transmitter. In our anal- 
yses, we wanted to focus on the level transmitters 
alone, mainly because they are more dependent on 
measuring principle and operating conditions (var- 
ious medium, foaming, calibration challenges, etc.) 
than the other transmitters. Measuring principles 
for level transmitters are divided into the catego- 
ries; displacer, pressure, radar (guided wave radar) 
and others (nuclear, ultrasonic, servo, capacitance 
and magnetostrictive). 

Table 2 shows the list of inventory attributes 
agreed upon among the experts to include in the 
analyses. However, due to lack of details in col- 
lected data, we were left with three credible inven- 


Table 1. Inventory attributes—Point gas detectors. 
Attributes Incl. Examples of categories 
Manufacturer YES Autronica, Drager, 


Simtronics... 
IR, Wireless, Acoustic...* 
HC200, PIR 7000, GD10... 


Measuring principle YES 
Model YES 


*Catalytic gas detectors have been removed from the data 
set due to significant more DU failures than the rest of 
the measuring principles. 


Table 2. Inventory attributes—Level transmitters. 
Attributes Incl. Examples of categories 
Manufacturer YES Vega, Fisher-Rosemount... 


Measuring principle YES Displacer, Pressure, Radar... 
No. of medium YES 1,2 o0r3 
phases 
Type of medium NO Hydrocarbon, Water, 
Chemical... 
Separator, Scrubber, Tank... 


Foaming, Sand, Scale... 


Type of vessel NO 
Special problems NO 


tory attributes; manufacturer, measuring principle 
and number of medium phases. Regarding the meas- 
uring principle, we made the following assump- 
tions based on the type of vessels for which the 
transmitters were installed: Level transmitters for 
lst stage separators and test separators normally 
measure three types of medium (e.g. oil, MEG/ 
water and gas) while level transmitters in scrub- 
bers, 2nd stage separators and 3rd stage separators 
are supposed to be used to measure for two types 
of medium, e.g. liquid/gas. In case this information 
about vessel type was not evident, the number of 
vessel outlet lines was checked against e.g. P&IDs 
to obtain number of fluid phases inside the vessels. 

Despite the effort, we were left with several 
transmitters where information was missing. E.g., 
measuring principle was not identified for 10% of 
the transmitters and manufacturer was not identi- 
fied for 11%. 


4.4 Shutdown valves 


The data collected through operational reviews 
contains in total 1245 Emergency Shutdown 
(ESD) and Process Shutdown (PSD) valves. 
Table 3 shows the inventory attributes selected in 
the expert review meeting. Unfortunately, it was 
necessary to remove all data from one of the facili- 
ties due to very sparse information on the inven- 
tory attributes manufacturer and (valve) size. 

Categories for manufacturer, size and type of 
valve were obtained from the maintenance system 
and equipment and facility specific information 
such as data sheets and P&IDs. 

The process medium exposing the valves was 
assessed by experts at one of the facilities: It was 
suggested that for each system at the facility a cor- 
responding (typical) medium could be assumed, 
and this “mapping” between system and medium 
was adopted for the rest of the facilities. 

To avoid too many categories for valve size, 
group size intervals were decided together with an 
expert. Criteria for these intervals were based on 
the valve and process characteristics, rather than 
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Table 3. Inventory attributes—PSD and ESD valves. Table 4. Inventory attributes—Pressure safety valves. 
Attributes Incl. Examples of categories Attributes Incl. Examples of categories 
Actuation NO Electric, Hydraulic, Actuation principle NO Pilot, Spring, 

principle Pneumatic... Pressure-vacuum... 
Manufacturer YES Tai Milano, Swagelok, BIS... Dirty or clean service YES Yes or No 
Medium YES Gas, HC liquid, Water... Manufacturer YES Petrolvalves, O.M.S.,... 
Size YES 041”, 1-3”, 3-18” and >18” Medium NO Gas, HC liquid... 
Special problems NO Corrosion, Icing... Size NO 
Type YES Ball, Gate, Butterfly, Other 


having equally sized categories: E.g., valves less 
than 1" has been defined as a separated category 
since they normally are water-based and attached 
with lower risk compared to bigger valves. Thus, 
the number of valves in each size category varies. 

The categories for inventory attribute “actua- 
tion principle” was not straight forward to retrieve. 
This would require time-consuming manual infor- 
mation, and was only performed for one of the 
facilities. Hence, the actuation principle inventory 
attribute was removed from the analyses. Also, the 
inventory attribute “specific problems” (corrosion, 
icing and temperature changes) was removed, as 
this information required manual and rather time- 
consuming effort. 

Table 3 lists the examples of categories assigned 
to selected inventory attributes. Also for this equip- 
ment group, information was missing. For exam- 
ple, we found that 14% of the valves had unknown 
manufacturer, 13% had unknown size and 8% had 
unknown type. 


4.5 Pressure safety valves 


The inventory attributes that were selected for pres- 
sure safety valves (PSVs) based on expert meeting 
are presented in Table 4. Manufacturer and size of 
valve were obtained from the maintenance system 
and equipment and facility specific information 
such as data sheets and P&IDs. However, we faced 
major problems with missing category informa- 
tion. E.g. for about 50% of the PSVs, information 
about the valve size was not found in the mainte- 
nance system. Thus, the inventory attribute “size” 
was not part of the PSV analyses. Unfortunately, 
we also had to remove data for PSVs from one of 
the facilities due to missing information. 

Dirty or clean service, i.e. if the medium flow- 
ing through a PSV is “dirty” (e.g. including sand, 
crude oil, etc.) or “clean” (e.g. pure gas), was 
together with the actuation principle pointed out 
by the experts as a possible significant inventory 
attribute for the PSVs. To simplify the analyses, it 
was assumed that all PSVs installed in the same 
system had the same category (either “dirty” or 


“clean”). The inventory attribute “medium” was 
not included, since it would be partly correlated to 
the inventory attribute “service”. 

The actuation principle of the PSVs was iden- 
tified to some extent in the maintenance system. 
As for ESD and PSD valves, it was very time-con- 
suming to extract this information and this has not 
yet been performed. Hence, the actuation principle 
was not part of the analyses. 

Table 4 summarizes the selected inventory 
attributes. Note that examples of categories are not 
provided for the inventory attribute “size” since it 
was decided to omit this one from the analyses. 


5 ANALYSES AND RESULTS 


5.1 Assumptions 


The data for all the finally selected inventory 
attributes were for the analyses combined with 
information about how many DU failures that had 
been registered for the equipment group and the 
aggregated time in operation. In the data set, we 
removed DU failures that had been repeated for 
the same equipment, to avoid double counting of 
the same failure event. 

For some of the inventory attributes, e.g. 
medium for PSD and ESD valves and dirty or 
clean service for PSVs, the categories are based 
on the assumption that all equipment installed in 
one particular system share the same medium and 
service; e.g. all valves in system number 43 (Flare 
system) is assumed to share the medium “gas” and 
“clean” service. 

The number of predefined categories and of 
course how they are defined, e.g. size intervals and 
which categories belonging to the “other” category, 
will also impact the results. 

Some assumptions have also been made regard- 
ing the analyses and for the data to fit the statistical 
analyses as described in section 3. E.g., the DU fail- 
ures are assumed to be identically distributed and to 
occur stochastically independent. It is also assumed 
that inadequate information and missing data do 
not have any effect on the results of the analyses. 
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The observation periods from each facility 
is not equal and varies from two to 11 years for 
those facilities included in the data set. Thus, some 
assumptions about observation periods had to be 
made to get observations periods as equal as pos- 
sible and to utilize all DU failures: The final peri- 
ods should not be too short such that there would 
be very few observation periods with failure com- 
pared to observation periods without failures— 
then it would be more difficult getting significant 
results. On the other hand, the periods should not 
be too large such that multiple failures of the same 
component often would occur within the same 
period—then we would not utilize all the DU fail- 
ures. Also, different observation period intervals 
were concerned in data analyses for fire and gas 
detectors compared to other equipment groups: 

For those equipment groups with data from five 
or six facilities (PSD and ESD valves, level trans- 
mitters and PSVs) three-four years was regarded 
as one observation period. Then, for a facility with 
observation period between three and four years, 
each equipment unit was counted once in the total 
data set. For a facility with 11 years of opera- 
tional experience, the inventory data was counted 
for three times in the total data set (and the DU- 
failures were distributed on the correct observation 
period based on the notification date). 

For fire and gas detectors, where data from sev- 
eral facilities was included and many of those with 
shorter observation periods, two-three years was 
regarded as one observation period. 

The GLM model was implemented in software 
R, which is a free software for statistical computing 
and graphics. 


5.2 Results 


The aim of the analyses was for each equipment 
group to identify which inventory attributes and 
related categories that became statistical signifi- 
cant (if any). 

Table 5 shows the results of the analyses for 
each equipment group, listing the most significant 
inventory attributes and associated categories— 
with respect to the DU failure rate. Note that not 
all attributes and categories have been found to be 
significant, and they are therefore not listed. Nei- 
ther are those categories less significant compared 
to two or more other categories. “Significant” 
implies that the inventory attribute and its associ- 
ated category(s) either contribute to significantly 
higher failure rate or significantly lower failure rate 
compared to the other attributes/categories. 

From Table 5 we see that the manufacturer, 
typically represented by one or two of the largest 
manufacturers, is significant for most of the equip- 
ment groups. For ESD and PSD valves the largest 


Table 5. List of most significant inventory attributes 
and categories. 
Equipment group Attribute Category 
ESD/PSD valves Size >18” 
Medium Gas, Water, ... 
Manufacturer Confidentially 
Line gas detectors Manufacturer Confidentially 


Point gas detectors Measuring principle IR 
PSVs Manufacturer Confidentially 
Dirty or clean service Dirty, clean 


valves seem to have a higher failure rate compared 
to small valves. Also, the medium may be impor- 
tant for the failure rate for ESD/PSD valves. 

One inventory attribute suggested by experts, 
and that we was able to analyze, was not found 
to be of significant in our analyses: “measuring 
principle” for level transmitters. This may be partly 
explained by inadequate level of details concerning 
inventory attributes in the applied data. 

For the fire and gas detectors where data was 
available from several facilities, also the facility 
was included as a separate attribute. The results 
showed that some facilities turned out to contrib- 
ute to significant higher or lower failure rates com- 
pared to the others. 


6 RECOMMENDATIONS 
AND FURTHER WORK 


Measures and means for improving the quality of 
data that are recorded into the maintenance system 
is an important area for further research. Today, 
the recording is mainly manual, and there lack a 
systematic way for consistent recording of infor- 
mation for more automatic extraction and analy- 
ses. Based on the results of this study, it is possible 
to suggest more specific categories of information 
to be recorded. It may be necessary to further 
investigate the implications of assumptions that 
were made for our analyses. Both those that were 
made to overcome practical obstacles, e.g. due to 
lack of information related to selected inventory 
attributes, and those made to simplify the selection 
of categories, e.g. about the relationship between 
categories for inventory attributes (e.g. clean serv- 
ice) and system number (e.g. flare system). It is also 
possible to perform other types of analyses, e.g. 
“big data” analyses, to identify significant inven- 
tory attributes particularly when the amount of 
data, inventory attributes and categories increases. 

Operational experience indicates that similar 
equipment performs differently between facilities 
with a comparable operating environment. It is 
therefore desirable to supplement the generic data 
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with inventory attributes that can explain the vary- 
ing performance, and enable the reliability analyst 
to better predict the variations. Due to the limited 
information about inventory attributes, it is rec- 
ommended that the operators increase the amount 
of relevant inventory information in their main- 
tenance systems, in particular for safety critical 
equipment part of operational reviews. Then, the 
failure data and data for inventory attributes can 
be combined to perform in depth analyses. 

Data collection is becoming increasingly impor- 
tant both with respect to quantity and quality. It 
is an important activity to provide feedback on 
experience from the operational phase to design- 
ers of new systems and for monitoring the opera- 
tional performance of safety barriers. It is also an 
important activity seen in relation to the increas- 
ing trend of lifetime-extension for existing facili- 
ties. SINTEF and PDS forum is also working on 
means for enhancing the digitalization of failure 
reporting, classification, and analyses, to update 
the generic failure rates more frequently and to 
reduce the manual effort in this process. A higher 
level of automatic analyses of data can help when 
prioritizing resources needed to improve the over- 
all quality of data. 
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ABSTRACT: A significant fraction of current research in telecommunications is investigating new para- 
digms to guarantee high flexibility in deploying network infrastructures. Network Function Virtualization 
(NFV) is probably the most effective paradigm: it adapts tothe telecommunication networks the virtuali- 
zation concepts originally conceived in the computer world. According to NFV specifications, network 
elements as switches or routers can be implemented as virtual machines called Virtual Network Functions 
(VNFs), and deployed in field by timely and cost-effective operations. Some Telco operators are already 
taking advantage of these virtualized resources, by aggregating more VNFs in order to provide new serv- 
ices. One example is IP Multimedia Subsystem (IMS), providing multimedia delivery services, that can 
be suitably deployed by means of interconnected VNFs. We consider a virtualized implementation of 
an IMS system (vIMS) that we characterize by an availability standpoint. First, we describe a vIMS 
system as a chain of virtualized network elements modeled by three components: hardware, hypervisor, 
and application. Subsequently, we model the probabilistic behavior of the network nodes by Stochastic 
Reward Nets, that account for failure and repair events characterizing each node. Innovating on previous 
formulations, part of the analysis is carried out by adopting non-Markovian models, thus allowing for 
more realistic (non-exponentially distributed) times between some state transitions. As final results, we 
determine the optimal redundant vIMS configuration able to guarantee a steady-state availability not less 
than 0.99999, and we provide a sensitivity analysis useful to evaluate the system robustness to variation of 
parameters from their nominal values. 


1 INTRODUCTION paradigm to provide a plethora of multimedia 


services (audio/video sessions, online messaging, 


Network Function Virtualization (NFV) (ETSI 
2012) represents one of the most innovative para- 
digms within the fifth generation (5G) of telecom- 
munication systems. Basically, it has been designed 
to boost the deployment of new network services 
by exploiting the virtualization concepts. Within 
an NFV domain, traditional network equipments 
(e.g. firewalls, routers, switches, etc.) are replaced 
by their virtual counterparts named Virtualized 
Network Functions (VNFs). According to NFV 
logic, every VNF relies on a decoupled structure 
typically made of: i) a hardware part accounting 
for physical components; ii) a hypervisor part act- 
ing as an abstraction layer between hardware and 
application iii); an application part that includes 
the software logic and runs on top of a VNF. An 
architecture that can profitably benefit from an 
NFV environment is the IP Multimedia Subsystem 
(IMS) (3GPP 2001). IMS exploits the all-IP-based 


presence, IP TV, etc.) by taking advantage of Ses- 
sion Initiation Protocol (SIP) (Rosenberg et al. 
2002). In the present work, the new virtualized 
IMS (vIMS) framework (namely, IMS in an NFV 
environment) has been characterized in terms of 
its availability. Accordingly, we find the optimal 
system configuration respecting the so-called “five 
nines” availability requirements. It consists in tol- 
erating a maximum system downtime of 5 minutes 
and 26 seconds per year. The paper is structured as 
follows. Section 2 contains a brief excursus about 
relevant and pertinent works. Section 3 offers 
an overview about possible vIMS deployments. 
An availability analysis is then presented in sec- 
tion 4, where details about Stochastic Reward Nets 
(SRNs) and Reliability Block Diagram (RBD) 
methodologies are provided. Section 5 presents a 
numerical experiment, where characteristic system 
parameters as repair and failure rates of the VIMS 
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components have been set in accordance to scien- 
tific literature and expert hints. Finally, concluding 
remarks along with considerations about possible 
future work are drawn in Section 6. 


2 RELATED WORK 


Recently, the dependability and availability of novel 
telecommunication infrastructures are becoming 
critical issues since network operators are com- 
mitted to strict Service Level Agreements. Accord- 
ingly, such issues have got attentions from scientific 
andtechnical literature. The work presented in (de S. 
Matos et al. 2012) is one of the first papers devoted 
to cope with the availability issues of virtualized 
infrastructures. In particular, the authors propose 
a three-level model of a generic system (hardware, 
software, hypervisor) analyzed by combining Con- 
tinuous Time Markov Chains (CTMCs) and fault 
trees formalisms. A combination of approaches is 
also used in (Fernandes et al. 2012), where a depend- 
ability assessment of virtual networks is presented 
by exploiting both combinatorial and state-based 
models. A mathematical framework to model a res- 
toration mechanism of virtual resources caused by 
failures is instead proposed in (Taleb et al. 2016). 
A stochastic model-driven approach to evaluate 
the availability of an Infrastructure-as-a-Service 
(IaaS) cloud system is presented in (Ghosh et al. 
2014). The failure events have been managed by 
considering migration of physical machines among 
three types: hot (running machine), cold (turned off 
machine), and warm (turned on, but not yet ready 
machine). Some algorithms aimed at solving the 
Minimum Total Failure Removal (MTFR) prob- 
lem have been proposed in (Liu et al. 2016), where 
a reliability evaluation of an NFV environment has 
been faced. The present work takes inspiration from 
some recent contributions proposed by the authors, 
see (Di Mauro et al. 2016), (Di Mauro et al. 2017), 
and, basically, provides two original developments. 
The first one concerns the stochastic modeling of 
a virtualized IMS (called vIMS) framework by con- 
sidering a typical telecommunication network sce- 
nario. The second one pertains to the proposal of 
a more realistic model relaxing the assumption of 
exponentially distributed failure and repair times, 
characterizingMarkovian model: the choice of some 
non exponentially-distributed transition times has 
some influence on the transient analysis of systems. 


3 OVERVIEW OF THE IMS 
ARCHITECTURE 


IP Multimedia Subsystem was born as a frame- 
work able to providing access to a plethora of 


multimedia IP-based services with guaranteed 
quality of service (Camarillo and Garcia-Martin 
2008). IMS architecture supports a broad range 
of services by exploiting the flexibility of SIP pro- 
tocol, among which, multimedia and real-time 
sessions (such as phone calls), web messaging and 
enriched communications. The signaling flows are 
managed by the CallSession Control Function 
(CSCF) servers, that communicate by exchanging 
mainly SIP messages. The CSCF functionalities 
are distributed among three servers. The Proxy 
CSCF (P-CSCF) is a SIP proxy, and acts as an 
interface between the user device and the IMS net- 
work. The /nterrogating CSCF (I-CSCF) forwards 
SIP requests or responses within the domain. The 
Serving CSCF (S-CSCF) is in charge of perform- 
ing some core functions as session and rout- 
ing control and user registration management. 
Another key element of the IMS infrastructure is 
the Home Subscriber Server (HSS), an advanced 
database containing users’ profiles that can be 
queried by means of a specific protocol called 
Diameter. Such nodes are interconnected among 
them to provide basic and advanced services. An 
example is offered in Fig. 1, where a (simplified) 
Registration procedure (needed before exploiting 
IMS services) is depicted. Initially, a user device 
contacts P-CSCF via Register message (1). Such 
a message is passed to I-CSCF (3) that, in turns, 
sends it to HSS (3) in order to retrieve the address 
of S-CSCF in charge of current registration. 
Once obtained the information from HSS (4), the 
message is forwarded to the correct S-CSCF (5). 
Finally, an OK message indicating a correct device 
registration is back-propagated to the user device 
(6), (7), (8). Once completed the registration pro- 
cedure, the device is ready to use IMS services as 
a real-time audio/video session for example. It is 
worth noting that, in Fig. 1 is depicted the signal- 
ling flow, but, typically media flows between two 
user devices (i.e. the content of an audio/video 
call) traverse a different path. 
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Figure 1. A simplified registration procedure in IMS 
domain. The Register message is propagated from device 
to S-CSCF. A 200 OK message is back-propagated to 
device if the procedure ends correctly. 
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3.1 Virtualized IMS domain 


We recall that, in our proposal, the IMS framework 
is assumed as integrated in an NFV environment. 
Such a hybrid solution is getting recently attention 
either by technical literature (Duan et al. 2017), and 
by the industrial world (ETSI 2015). Accordingly, 
the involved nodes (P-CSCF, I-CSCF, S-CSCF, 
HSS) are modeled by VNFs composed, in turn, by 
three layers: 


e Hardware: represents the aggregate of physical 
subsystems (CPU, RAM, Storage, etc.) that are 
often deployed in server farms; 

e Application: represents the software logic 
deployed on top of a specific VNF (P-CSCF, 
HSS, etc.); 

e Hypervisor (or Virtual Machine Monitor): an 
intermediate (software-based) level allowing to 
deploy one (or more) virtual nodes on the same 
hardware and to manage the resource consump- 
tion of each VNF. 


In our setting, the hardware and hypervisor lay- 
ers are supposed to be the same for CSCF servers 
and for the HSS node, while the application layer is 
modeled separately for CSCF nodes and HSS. 


4 AVAILABILITY MODEL 


The system availability analysis is performed by 
taking into account a two-level hierarchy model 
combining two formalisms: Reliability Block Dia- 
grams (RBDs) and Stochastic Reward Networks 
(SRNs). The former is useful to model the system 
in terms of interconnections among nodes (sub- 
systems) of the considered vIMS infrastructure, 
and constitutes the first level. Figure 2 shows the 
RBD representation derived from the Registration 
scenario reported in Fig. 1, where a series model 
is necessary to represent the IMS, since all net- 
work functionalities must be active to guarantee 
the Registration service to users, whereas a parallel 
configuration for each node (replicas) are usefulto 
ensure a certain degree of redundancy in case of 
failures. Furthermore, we assume that HSS node 
is deployed in a k-out-of-n:G configuration, where 


| 
P | pescry ja Ho n H m ns f soscr, H 
| T | bel — 
sab | bcscra [t] osor H4 ns HH ua) sosca Hha 
| li? j 
L| Pesca = LI ieser i L Wee, H | SOSH» H 
1 } 
L d 


Figure 2. The Reliability Block Diagram representation 
of virtualized IMS domain, where HSS is deployed in a 
2-out-of-n,, redundancy configuration. 


k represents the number of HSS replicas that must 
work to make the HSS node to work. Henceforth, 
we assume k = 2, in accordance with most actual 
deployments. 

On the other hand, the second level of the hier- 
archical model based on SRN is used to describe 
the internal behavior of a single node by char- 
acterizing the relationships among three layers 
(hardware, hypervisor, application). In the fol- 
lowing subsection, after a brief description of the 
SRN formalism, we provide a specific model of a 
generic vIMS node. For further readings on the use 
of SRN for availability evaluation, refer to (Mup- 
pala et al. 1994). 


4.1 Stochastic reward networks approach 


The SRN model derives from Markov Reward 
Model (MRM) which enhances the traditional 
Continuous-Time Markov Chain (CTMC) by 
adding a reward rate to each state. With state- 
space based models (such as MRMs), a classic 
problem of modeling real-world systems is related 
to the growth of state space. On the contrary, an 
SRN-based representation allows a more com- 
pact description of the underlying system, by 
identifying repetitive structures. In this way, it is 
possible to automatically generate the underlying 
MRMs (Bolch et al. 1998). SRN model adopts a 
bipartite directed graph representation where: i) 
places (represented by circles) specify a condition 
(e.g., the system is down or up), and ii) transitions 
(represented by rectangles) denote actions (e.g., 
a system crash). Places and transitions are con- 
nected by arcs denoted by directed edges. In the 
SRN formalism, the transition times are assumed 
as exponentially distributed since they take into 
account a probabilistic delay. A place typically 
contains a number called token, that represents an 
holding condition. In case of a condition change, 
a transition is fired, and the token is moved from 
one place to another. A measure of interest is the 
distributionof tokens, called marking, that denotes 
the possible assignment of tokens to all places of 
the underlying Markov model, and is useful to cap- 
ture the dynamics of the overall system. 

In the SRN context, the reward function, say 
X(t), has a crucial role. It is a (non-negative) ran- 
dom process that represents system conditions, 
namely X(t) varies with time ¢ in accordance to the 
desired measures, for instance availability, depend- 
ability, or performance (Muppala et al. 1996). 
Being interested in availability evaluation, X(t) 
is defined as: X(t) = 1 in case of working system 
(up condition) at t, and X(t) = 0 otherwise (down 
condition). Accordingly, it is possible to define 
the instantaneous availability A(t) as the expected 
reward function at time f, i.e. 
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Figure 3. 


A(t) = Pr{X(t) =1} = E(X (t))= Xz- p,(d), (1) 


ieS 


where S denotes the state space (namely, the set of 
markings within SRN) that can be broken in a sub- 
set of up states S, (reward rate r,= 1), and a subset 
of down states S, (reward rate r,= 0), whereas p(t) 
is the probability of the system being in state i. 

The SRN of a generic vIMS node replica (CSCF 
or HSS) is shown in Fig. 3, while the overall vVIMS 
infrastructure is described by the RDB model pre- 
sented in Fig. 2. 

By inspection of Fig. 3 it is possible to distin- 
guish the following entities: 


e Places (circles): the group of places Ppr 
Povum> and Pappo takes into account the 
working conditions of hardware, hypervisor 
(VMM subscript stands for Virtual Machine 
Monitor), and application layers respectively. 
The numbers inside the three places indicate 
the corresponding initial (working) conditions. 
Conversely, the group of places Piinw, Pivary> 
and P,,,pp, represents the failure conditions 
of hardware, hypervisor, and application layers 
respectively. 

e Timed Transitions (unfilled rectangles): such 
transitions take into account the various lay- 
ers behavior; in particular, Ty,»p[T.4pp|, 
Toun |T,vun] and Trw [Tage | denote the fail- 
ure [repair] events of the application, the hyper- 
visor, and the hardware, respectively. 

e Immediate Transitions (thin and filled rectangles): 
such transitions take into account the instanta- 
neous actions, namely, actions characterized by 
zero transition time. In the proposed SRN, two 
immediate transitions appear: tpp and tyra. 


4.2 Evolutionary model of SRN 


In this section we analyze the dynamics of the sys- 
tem, namely, the conditions arising when events 
such as failures or repairs emerge. In particular, we 
focus on the evolution of the SRN model of a sin- 


SRN representation of one generic node replica of vIMS network infrastructure. 


gle vIMS node. For the sake of simplicity, it is use- 
ful to consider an initial fully working condition for 
the node characterized by a token in each P,, place 
of the SRN. If an application failure happens, it 
means that the software function representing the 
logic of a vIMS node (a CSCF node or the HSS 
node) breaks. In this case, the transition Tp, is 
fired, and the token leaves place P,,,»p to enter 
place P,,,»p. Once the application gets repaired 
(sometimes a trivial reboot procedure could solve 
the problem), the transition 7,,,, is fired, and the 
token comes back to initial place P,,. In case of 
hypervisor failure, instead, the transition T,,,,,, 18 
fired, and the token is moved from P,,,,,,, place 
to P,4pp Place. It is worth noting that the place 

‘pvua 18 connected to immediate transition fp, 
by an inhibitory arc (the segment with a small circle 
close to ¢,,,). Such an arc forces f,,, to get fired in 
order to model an application failure. The applica- 
tion layer, in fact, needs a working hypervisor layer 
to work correctly. On the contrary, when the hyper- 
visor gets repaired, the token is again moved from 
Pave tO Ppvum Place, and the inhibitory arc is 
now ineffective. A similar reasoning holds for the 
failure of hardware. The token passes from P pyw 
to Puuw Place as transition Taw gets fired. In 
such a case, an inhibitory arc (between Pyy and 
tyum) forces tyy to move the token from P prum 
to Pryyy since the hypervisor layer needs an 
underlying functioning hardware to work properly. 
It is interesting to note that, another inhibitory arc 
connects Pauw and T),.,,,. Such an arc inhibits 
the firing of transition T.,,,,, (and consequently 
the firing of transition T.,,,), until the hardware 
layer gets repaired. 

Let now be r, the reward rate assigned to mark- 
ing i (i-th distribution of tokens), and p, (£) the 
probability of a generic node replica j modeled by 
the SRN depicted in Fig. 3 to be in marking i at 
time ¢. Being the markings mutually exclusive, it is 
possible to express the instantaneous availability 
At) according to (1), namely 


AD(1) = Fr Pld) ©) 


iel 
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with J identifying the set of so-called tangible 
markings (markings with no immediate transitions 
enabled). The reward rate r,, associated to the tan- 
gible marking / is given by: 


1 if (#P ar =!) 


0 otherwise. 


It is worth noting that such a reward rate condi- 
tion does not need to account for up condition of 
hypervisor and hardware because it is intrinsically 
embedded in the SRN model in Fig. 2, where inhibi- 
tory arcs avoid having a working application layer 
coupled with failed hypervisor and/or hardware lay- 
ers. A steady-state analysis for the SRN-based model 
of the generic vIMS node replica j can be simply 
obtained by (2) for long runs (t > œ% ) and the cor- 
responding steady-stateavailability A” has the form: 


AV = lim AM (t) = Xii Pap (3) 
Fái iel 
where p,, represents the steady-state probability 


given by Pa = lim p, ,(t).. 

We recall thatthe overall vIMS system can be 
modeled as a series/parallel of independent sub- 
systems as shown in Fig. 2 and the availability of 
all subsystems is derived from (3). Accordingly, 
by exploiting the power of RBD representation, 
the overall vIMS steady-state availability can be 
expressed as 


i-4)} a) 


where A'), Al), and AP are the steady-state 
availabilities of j-th replica of nodes P-CSCF, 
S-CSCF and I-CSCF, respectively, while the 
steady- state availability of HSS node replicas is 
AY J= =A,,, Vj; the numbers of redundant subsys- 
tems of each network functionality are np, ng, np 
and ny, respectively. Being the vIMS a series sys- 
tem of network functionalities, vVIMS steady-state 
availability (4) is a product of single node availa- 
bility, where P-CSCF, S-CSCF and I-CSCF nodes 
are in parallel configuration and provide the first 
three terms in (4). The last factor, instead, takes 
into account the k-out-of-n: G configuration of the 
HSS node, where k=2 and n=n,.. 


5 NUMERICAL ANALYSIS 


This section presents a numerical analysis of the 
proposed IMS framework over an NFV environ- 
ment by exploiting two software tools: SHARPE 
(Symbolic Hierarchical Automated Reliability and 
Performance Evaluator) (Sahner and Trivedi 1987) 
and TimeNet (German et al. 1995). It aims to sin- 
gle out the minimal-cost redundant configuration 
of the virtualized IMS system able to guarantee 
the “five nines” requirement for telecommunica- 
tion systems availability. The parameters used for 
this analysis, and, representative of the mean time 
of failures and repairs as regards the various com- 
ponents (hardware, hypervisor, application) are 
reported in Table 1. The adopted values are derived 
in part from the experience of telecommunication 
experts, and in part from technical literature. For 
the sake of simplicity we assume that: 7) hardware 
and hypervisor are supposed to be the same for all 
the nodes; ii) all the software instances running 
on CSCF nodes (P-CSCF, I-CSCF, S-CSCF) are 
characterized by the same failure and repair times. 
On the contrary, the software instance running on 
the HSS is supposed to have a different mean time 
of failure, being the database a more delicate and 
prone to failures element. As common in literature, 
the symbols A and u denote failure and repair rates, 
respectively. To characterize the stationary avail- 
ability of the vIMS system in long runs, a steady- 
state analysis is carried out by considering some 
exemplary settings as shown in Table 2. The first 
column of Table 2 indicates the setting identifier 
(S,,....5;). The second column indicates the con- 
sidered redundancy level; for example, the setting 
S, is characterized by two (whatever) CSCF nodes 
having redundancy of 2, the remaining CSCF node 
having redundancy of 3, and the HSS node having 
redundancy of 3. The third column indicates the 
steady-state availability value of the whole virtual- 


Table 1. Input parameters for the hardware subsystems. 

Parameter Description Value 

VA ny mean time for hardware failure 60000 hours 

VAs: mean time for hypervisor failure 5000 hours 

WA cee mean time for CSCF node 3000 hours 
failure 

VAs mean time for HSS node failure 2000 hours 

VA mean time for hardware repair 8 hours 

VAs mean time for hypervisor repair 2 hours 

i) ee mean time for CSCF software 1 hour 
repair 

VAs mean time for HSS software 1 hour 
repair 
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Table 2. Availability results of the whole virtualized 
IMS. 


Setting | Redundancy Level A iis 

S; CSCF = [2,2,2] HSS=3 0.99999416 

iS CSCF = [ 2,2,2] HSS = 4 0.99999756 

S; CSCF = [2,2,3] HSS=3  0.99999497 

S; CSCF = [333] l HSS = 3 0.99999659 

S; CSCF = [2,3,3] HSS=4 — 9.99999918 
; : = si = 


Figure 4. Steady-state unavailability for different set- 
tings S,,S,,...,S;. The above horizontal dashed line repre- 
sents the required unavailability: 1— Á zys = 10%. 


ized IMS system in Fig. 2. In order to visualize in 
a more comfortable manner the steady-state avail- 
ability results, we show in Fig. 4 the unavailability 
of the virtualized IMS system 1—A,,,,;, and it is 
possible to observe that each setting S, satisfies 
the “five nines” requirement, namely, each bar lies 
below the horizontal dashed line at 1 — As =10~. 
In the case of setting S., the system is even able to 
satisfy the more challenging “six nines” require- 
ment, being the corresponding bar lying belowthe 
horizontal dashed line at 1— Ams =10%. Among 
the considered settings, S, entails the minimum 
number of deployed replicas, namely, 2 replicas 
for each CSCF node and 3 replicas for the HSS 
node. Consequently, setting S, represents the opti- 
mal redundant configuration in terms of minimum 
number of deployed replicas while fulfilling the 
desired high availability requirement. 


5.1 Transient non- Markovian analysis 


The performed regime analysis is useful to evalu- 
ate the system behavior as t—>-°, but it cannot 


capture the dynamics of the system when, for 
example, some node instances are changing their 
states from failed to repaired. We approach this 
issue analyzing the dynamics of the system when 
software instances (namely the application parts) 
of both P-CSCF node replicas are down and ready 
to be repaired. This is the typical case of a contem- 
porary (and often unplanned) update of operating 
system that forces both software instances to be 
rebooted. 

Actually, in order to consider a more realistic 
scenario, we replace the classical hypothesis of 
exponentially distributed software repair transi- 
tion times with a Weibull-distributed one. In par- 
ticular, the transient analysis evaluates the behavior 
of theinstantaneous availability 4,,,,,(¢) and the 
interval availability Ays (t) of vIMS system in (0, 
í], that is defined as 


Arnis (8) =+ [Aus (wd (5) 


The interval availability represents the time aver- 
age of the instantaneous availability function over 
the interval (0, ¢]. Figures 5 and 6 show the behav- 
ior of Ayjys(t) and of A,j,5(t) when a Markovian 
process (all exponential transition times) and a non- 
Markovian process (Weibull software repair transi- 
tion times) are considered, respectively. In the case 
of Weibull software repair transition times, we con- 
sider the same mean repair time used in the case of 
exponentially distributed software repair transition 
times. Therefore, we set the shape parameter of the 
Weibull to œ =3 (as reported in (Guida et al. 2013)), 
while the scale parameter 2 turns out to be 1.1198. 
Besides, Fig. 7 highlights the different behavior of 


Availability 


aaah Instantaneous 


Interval 


o æ iW iso 200 250 300 150 400 «asi 500 
rth) 


Figure5. Instantaneous (A,,,,<(¢)) and interval (A,,,,,(0)) 
availability in case of exponential repair times of P-CSCF 
application part. 
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Figure 6. Instantaneous (A,,,,.(¢)) and interval (Aus) 
availability in case of non-Markovian process with a 
Weibull repair time of P-CSCF application part. 
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Figure 7. Instantaneous availability comparison in case 
of Markovian process (all exponential transition times) 
and non-Markovian process (Weibull repair time of 
P-CSCF application part). 


instantaneous availability As (t) during the tran- 
sient, in the case of exponential and Weibull failure 
rate (of application part). It is worth noting that the 
transient of A\,,;(¢) in the Weibull case is slower 
than the transient of A,,,,;(¢) in the exponential 
case. This behavior is compatible with real-world 
effects (Ayers 2012). In both cases, as expected, the 
interval availability Aus (t) in (5) converges more 
slowly to the steady-state availability with respect to 
the instantaneous availability Ays (t). 


5.2 Sensitivity analysis 


As far as the last part of the considered experi- 
ment, we perform a sensitivity analysis with 


respect to deflections of two system parameters 
from their nominal values: A, cescr and Aycs. 
Such an analysis has been carried out by consider- 
ing the minimal cost setting S, obtained as a result 
of the regime analysis. In Fig. 8, the influence of 
the P-CSCF application part failure time has been 
considered. The nominal value amounts to 3000 
hours (see Table 1), but if we relax such a value 
to about 1500 hours, we are still able to satisfy the 
“five nines” requirement. At 1/A,_¢sc, =1500, in 
fact, the value Aus lies approximately around to 
0.9999935. Figure 9, instead, shows the influence 
of the HSS application part failure time. In this 
case, the nominal value amounts to 2000 hours, 
and no side effects are notable unless 1/4,,.. 
undergoes the value 1000 (approximately). In this 
case, the horizontal dashed line at 0.999990 is use- 
ful to identify the breakpoint across the high avail- 
ability requirement. 


ood 
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Figure 8. Influence of 1/4, csc failure rate over sys- 
tem availability, in case of optimal setting S). 
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Figure 9. Influence of 1/4,,., failure rate over system 
availability, in case of optimal setting S,. 
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6 CONCLUSIONS 


Network Function Virtualization and IP Mul- 
timedia Subsystem represent two fundamental 
concepts within the 5G telecommunication world. 
The former is exploited to virtualize network func- 
tions into building blocks, that can be connected or 
chained in severalways. The latter is designed to pro- 
vide advanced and IP-based multimedia services. 
By combining these two paradigms it is possible 
to derive a virtualized IMS (vIMS) infrastructure 
that, in this work, has been characterized in terms 
of availability. Eachvirtual block of vIMS frame- 
work has been modeled as a three-layer structure 
considering: hardware, software (or application) 
and hypervisor. From a macroscopic viewpoint, the 
vIMS has been described by exploiting the Reliabil- 
ity Block Diagram (RBD) representation, useful to 
capture the relationships among the blocks. On the 
other hand, each virtual block has been modeled 
by a Stochastic RewardNets (SRN), able to char- 
acterize the probabilistic functioning of the under- 
lying system in terms of failure and repair events. 
Such a modeling phase is preparatory to perform: 
i) a steady-state availability analysis aimed at find- 
ing out the minimal-cost redundant configuration 
that guarantees the “five nines” availability require- 
ment; ii) a transient analysis focused on capturing 
the vIMS dynamics, where a more realistic repair 
distribution (Weibull) for software layer has been 
taken into account; iii) a sensitivity analysis aimed 
at evaluating the robustness of vIMS in case of 
some fluctuation of the critical parameters with 
respect to nominal values or estimation errors 
in real data sets. Future work will be devoted to 
the availability analysis of a more realistic vVIMS 
infrastructure, where some network nodes are co- 
located, and common mode failures arise. 
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ABSTRACT: In order to bridge the gap between Model-Based Systems Engineering and Model-Based 
Safety Assessment, we propose in this paper a language transformation between SysML semi-formal mod- 
els and the formal language AltaRica 3.0. Meta-data of SysML Block Definition Diagram and Internal 
Block Diagram that describe system architecture as well as meta-data of SysML State Machine Diagram 
that represent system behavior (in a limited formalism with respect to AltaRica’s Guarded Transitions 
System) are used to generate AltaRica classes, blocks, events, transitions, etc. Flow port directions and 
connectors are used to create flow propagation assertions. The object and prototype-oriented paradigm 
of AltaRica 3.0 with class, composition, inheritance, etc. will be respected since SysML and AltaRica’s 
System Structure Modeling Language share commonalities in structuring constructs. It is obvious that 
one modeling language cannot be replaced by the other because their goals and domains are different, but 
the mapping between languages such as SysML and AltaRica allows better understanding and communi- 
cation between systems engineers and safety experts. Once the preliminary AltaRica 3.0 code is generated 
with structural and behavioral information, safety experts will complete and validate the code with sto- 
chastic models, synchronization, common cause failures and redundancy mechanism to carry out safety 


assessment based on the expressive power of the language, thanks to its mathematical framework. 


1 INTRODUCTION 


Model-Based Systems Engineering (MBSE) 
(INCOSE 2015) is a common approach support- 
ing the design and management of complex sys- 
tems. Various modeling tools and languages can be 
used according to the different domains involved 
in the system, the level of detail, the system aspects 
to be modeled, etc. SysML (OMG 2015) is a gener- 
al-purpose modeling language adapted to systems 
engineering since it allows to express the main 
concepts inherent to the different aspects of sys- 
tem development. It provides a unified standard 
for specifying, analyzing, designing, and validating 
complex systems. The language also allows a multi- 
viewpoint model and building traceability links. 
Model-Based Safety Assessment (MBSA) aims 
at using high level modeling languages to integrate 
risk analysis with system architecture. Classical 
safety modeling formalisms such as Fault Trees, 
Blocks Reliability Diagrams, Event Trees, Markov 
Chains and Stochastic Petri Nets can provide 
efficient assessment algorithms and/or expressive 
power but suffer from the distance with the archi- 
tecture of the system. AltaRica (Point 2000) is a 
formal language that supports MBSA approach. 
An earlier version of the language, AltaRica Data 


Flow (ADF) (Arnold et al. 2000) has already been 
embedded in some commercial integrated mode- 
ling environments such as Safety Designer, Cecilia 
OCAS and Simfia. In ADF, a model is composed 
of nodes that are characterized by their reachable 
states, in and out flows, events, transitions and 
assertions. Once a system model is specified in the 
AltaRica language, it can be compiled into a lower 
level formalism such as finite-state machines, fault 
trees and stochastic Petri Nets and different safety 
assessments can be performed. 

A more recent version of AltaRica, AltaRica 3.0 
(Prosvirnova 2014) (Batteux et al. 2015) improves 
the language with the underlying mathematical 
framework Guarded Transition Systems (GTS) 
and new structure constructs via System Structure 
Modeling Language (S2ML). GTS (Rauzy 2008) 
is a general states/events formalism which han- 
dles looped systems by using fix-point calculation 
mechanism. Compositions such as free product 
and synchronization between GTS allow to build 
hierarchical and modular systems. Meanwhile, 
S2ML assembles constructs coming from object 
and prototype-oriented modeling languages such 
as classes, prototypes, composition, inheritance, 
etc. A flattening algorithm is needed to collapse 
the hierarchy of nested blocks and instances of 
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classes into a single program in order to compile 
and execute AltaRica 3.0 models. 

The first objective of this paper is to study 
the existing work concerning the automatic 
translation between SysML and AltaRica in 
order to integrate the reliability analysis with the 
design process. Secondly, we will introduce our 
approach to generate AltaRica 3.0 code from 
SysML models that takes into account the new 
features of the language while being consist- 
ent with the previous work to not reinvent the 
wheel. The paper is organized as follows. Sec- 
tion 2 presents some related work on SysML and 
AltaRica mapping. The new version of AltaRica 
is introduced briefly in Section 3 with an example 
taken from AltaRica 3.0 training material. Sec- 
tion 4 describes SysML elements used for AltaR- 
ica 3.0 mapping as well as the generated code for 
the given example. Discussions and future work 
are given in Section 5. 


2 RELATED WORK 


The link between SysML and AltaRica has been 
studied in several research work such as David et 
al. (2009), Belmonte and Soubiran (2012), Ruin et 
al. (2012), Yakymets et al. (2013) and Hecht et al. 
(2015). 

David et al. (2009) proposed, as a continuation 
of their MéDISIS methodology integrating sys- 
tems engineering and safety analysis, a mapping 
between SysML models and AltaRica Data Flow 
(ADF) language, so that existing tools to quantify 
reliability indicators such as the global failure rate, 
the mean time to failure, etc., can be used directly 
on the failure modes identified in the previous 
step of the methodology. An architecture trans- 
lation was first carried out by exploiting SysML 
architectural view through Block Definition Dia- 
gram (BDD) and Internal Block Diagram (IBD). 
A SysML block with its properties is translated 
into an ADF node with its sub nodes (parts), flow 
variables (ports) and state variables (values). The 
synchronization of flow exchange is performed 
by IBD analysis. Limited variable types and uni- 
directional flow directions in ADF obliged the 
authors to propose some adaptations to perform 
an automatic translation. For the behavioral part, 
an exhaustive list of SysML elements that depict 
the comportment (operations, actions, messages, 
etc) is proposed to guide the creation of events, 
transitions, guards, assignments or assertions in 
ADF. However, SysML state-charts diagrams 
whose mathematical framework is close to the 
transitions/events mechanism of AltaRica are not 
considered in this work that can facilitate the auto- 
matic translation. 


Belmonte and Soubiran in (Belmonte & Sou- 
biran 2012) proposed a translation from Obeo 
Designer’s Domain Specific Modeling Language 
(DSML) for Preliminary Hazard Analysis (PHA) 
and Failure Mode and Effects Analysis (FMEA) 
as well as SysML for system functional models into 
AltaRica to enable formal verification. A node in 
an Altarica model can be expressed as an octuple 
(l, SI, F, S, E, I, T, A) where the components corre- 
spond to respectively identifier, subnode instances, 
flow definitions, state variable definitions, events, 
initial state, guarded transitions and boolean asser- 
tions. Different transformation rules concerning 
the current operational context, the environment 
nodes (functional activities or operations), the 
FMEA nodes (dysfunctional specifications)and 
the PHA node (top level node) are given. However, 
no real system has been studied yet in order to 
prove the scalability of the method. 

In the context of Complex Maintenance Pro- 
gram Quantification, Ruin et al. (2012) proposed 
a framework using SysML to model the static 
and interaction part of production systems, and 
AltaRica Data Flow formal language to model 
the concept behavior part. The model building 
language and the model execution language are 
related to each other by a transformation step. The 
SysML Sequence Diagram representing interac- 
tions between blocks and the Parametric Diagram 
describing relationships between attributes of 
blocks are used to generate ADF parts made up 
with nodes, states, events, initial states and transi- 
tions. According to their algorithm: a node is cre- 
ated for each lifeline in the sequence diagram; an 
event corresponds to a reflexive message (looped 
back message) to the same lifeline; and when a 
transition takes place, the state variable will change 
from one “condition mark” to another. Synchro- 
nization is made via messages sent from one life- 
line to the other and the looped back messages. 
However, no algorithm is given for the paramet- 
ric diagram and the case study in the paper is not 
an industrial scale system. As cited in Ruin et al. 
(2012), ADF presents some restrictions like the 
impossibility to model looped systems and acausal 
connection where the direction of the flow propa- 
gation depends on the states. 

Yakymets et al. (2013) presented a safety mod- 
eling framework for fault tree generation and 
analysis SMF-FTA. This framework includes 
meta-models, profiles, model transformation, veri- 
fication, and Fault Tree Analysis (FTA) tools. In 
this approach, several steps are needed. First, the 
system to be analyzed is designed and its structural 
models are built using the SysML BDD and IBD 
diagrams. Second, these models are annotated 
with failure behavior. Then the entire model is con- 
verted into AltaRica language by using transfor- 
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mation rules. A SysML block is translated into an 
AltaRica node containing flows that correspond to 
the block’s ports. Expressions showing how output 
port errors can be derived from internal failures of 
the block and/or possible deviations in the input 
ports are stored in SysML opaque expressions 
and used to generate AltaRica internal component 
transitions. Connection between components via 
the connectors are used to create assertions. An 
algorithm already existing in the ARC tool ana- 
lyzes the AltaRica model and derives the different 
minimal cut sets from the model. These cut sets are 
assembled to form the final fault tree. The result- 
ing fault tree can be represented either with open- 
Probabilistic Safety Assessment (PSA) or a SysML 
dedicated profile 

In Hecht et al. (2015), Hecht et al. generated 
FMEA from AltaRica code, and this code is created 
from a series of interacting SysML state machines. 
However, no detail is given about the mapping 
elements, except an example of text export from 
SysML models showing: i) a block and its ports 
coming from an IBD and ii) states, events and 
transitions deriving from a state machine diagram 
of the corresponding block. This SysML output 
parameter file is transformed into an AltaRica 
input file where the specific block becomes a node 
with its own state variable, events and transitions. 

No related work concerns yet the latest version 
of AltaRica with which new theoretical frame- 
work and structural constructs have been estab- 
lished. Generating AltaRica 3.0 code directly from 
SysML models while being consistent with existing 
related work would be the next step to verify the 
feasibility of the approach. 


3 ALTARICA 3.0 LANGUAGE 


AltaRica 3.0 is currently under specification in the 
framework of the OpenAltaRica project (http:// 
openaltarica.fr). In this paper, we are only focused 
on the main concepts of the language. Advanced 
notions such as functions, operators and records 
are not considered at the moment. 

A block in AltaRica is used to encapsulate a 
finite state automaton whose state variables will 
have type and initial value. A transition is made of 
an event, a guard and an action to be performed 
when the transaction is fired. To illustrate the 
language, an example taken from Open AltaRica 
training document is given in Fig. 1. The system 
has an environment with a true input value and an 
observer which is the output of the system. The 
five components a, b, c, dand e are identical, each 
one having an input and an output flow connected 
to each other. Each component can fail and be 
repaired and the state machine is given in Fig. 2. 


true observer 
Figure 1. Case study from OpenAltaRica training 
session. 


output := input 


failure 
sedaj 


Figure 2. Component finite state machine. 


We can define a class in AltaRica code that rep- 
resents a generic component as in Fig. 3. The class 
Component has a boolean state variable “work- 
ing” and two flow variables “input” and “output”. 
The two events “failure” and “repair” will change 
the value of the component state from true to false 
and inversely. The assertion represents the internal 
transfer function of the component: if the compo- 
nent is working, the output is equal to the input. 
Otherwise, it is false. 

The case study system can be defined by using 
the instances of the class Component. Since the 
model of the whole system is unique, it is denoted 
by a structural construct block that represents a 
prototype (Fig. 4). An observer is added to model 
the system output. The assertion part describes 
the relations between components. Since there are 
redundant sub-systems, component e fails only if 
there is no output from both c and d (e:input:= 
c:outputord:output). 

For the events, we can add stochastic or deter- 
ministic delays as well as parameters used in the 
corresponding law to define different failure rates 
for the components as in the updated version in 
Fig. 5. This allows to associate with each event a 
delay, a memory policy and a weight in order to 
have stochastic timed model. To generate fault 
trees from the AltaRica model, we need to declare 
the top event with the observer. Also, a common 
cause failure (CCF) on components b and c can 
be defined with its dedicated parameter and used 
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class Component 
Boolean working (init = true); 
Boolean input, output (reset = false); 
event failure, repair; 
transition 
failure: working -> working := false; 
repair: not working -> working := true; 
assertion 
output := if working then input else false; 
end 


Figure 3. Component code: version 1. 


block CaseStudy 
Component a, b, C, d, @; 
observer Boolean out = e.output; 


assertion 

a.input := true; 

b.input := a,output; 

c.input := b.output; 

d.input := a-output; 

e.input := c-output or d.output; 
end 


Figure 4. System code: version 1. 


class Component 

Boolean working (init = true); 

Boolean input, output (reset = false); 
parameter Real pLambda = 1.0e-5; 

parameter Real pMu = 1.0e-2; 

event failure (delay = exponential (pLambda)) ; 
event repair (delay = exponential(pMu)); 
transition 


end 


Figure 5. Component code: version 2. 


block CaseStudy 
Component a, b,c , d, £; 
Component d (pLambda = 1.0e-6); 
parameter Real pLambdaCCF = 1.0e-6; 
event eventCCF (delay=exponential (pLambdaCCF) ) ; 
observer Boolean topEvent = e.output==false; 
transition 
eventCCF: ?b.failure & ?c.failure; 
assertion 


end 


Figure 6. System code: version 2. 


in the transition to synchronize the components. 
The prefix ? or ! mean that the event is optional or 
mandatory, respectively. The updated code for the 
system is given in Fig. 6. With these information, 


we can compute the different minimal cut sets, the 
probability of the top-event for a given mission 
time, etc. with the OpenAltaRica platform. 

AltaRica 3.0 supports also more advanced con- 
cepts such as inheritance between classes, acausal 
connection for bidirectional flows and cold redun- 
dancy. The mechanism to update flows uses a fix- 
point algorithm that allows to detect modeling 
errors at run time if no fix point is reached. 


4 SYSML MODELS AND ALTARICA CODE 
GENERATION 


Table 1 shows our proposal of basic elements map- 
ping from SysML to AltaRica. This model-to- 
model transformation is based on previous work 
in Section 2 and takes into account new concepts 
of the current version of AltaRica. 

We have modeled the AltaRica 3.0 case study 
with SysML. Figures 7, 8 and 9 show respectively 
the Block Definition Diagram, Internal Block Dia- 
gram and State Machine Diagram of the example. 

To generate code for the Component class, the 
different states in the state machine will become an 
enumeration ComponentState {working, failed}, 
with a specific initial value. The two flow ports 
“input” and “output” become automatically typed 
flow variables in AltaRica. Actually, they have a 
boolean type, with reset made to false by default. 


Table 1. SysML and AltaRica 3.0 mapping. 


SysML AltaRica 


generic block with several instancesclass 


specific block with unique block 
occurrence 
block inheritance extends 
flow port flow variable 
states a state variable with 
enumeration domain 
initial state init 
transition event 
guard guard 
connector assertion 


Figure 7. 


Case study block definition diagram. 
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Figure 8. Case study internal block diagram. 


ComponentStateMachine 


failure 


Figure 9. Component state machine diagram. 


domain ComponentState {working, failed} 


class Component 
ComponentState stateVar (init = working) j 
Boolean input, output (reset = false); 
event failure, repair; 
transition 
failure: stateVar=-working->stateVar :=failed; 
repair: stateVar==failed->stateVar :=working; 
assertion 
output:=if (stateVar==working) then invut 


Figure 10. Generated component code. 


The events and the transitions in AltaRica will be 
created automatically according to the information 
in the state machine. Boolean expressions for the 
guard conditions are added if available. The trans- 
fer function of the component that calculates the 
value of the output flow variable from the value 
of the state variable and the input flow variable is 
generated automatically in the assertion part. The 
generated code for the Component class is given 
in Fig. 10. 

The information from IBD is used to gener- 
ate code for the whole system which is made of 5 
components. The interactions between the com- 
ponents are modeled through the connectors and 
the flow direction (in, out and inout). The input 
flow of a component depends on the output flow 
of another component if there is a directed link 
between them. Since AltaRica 3.0 supports acausal 
components, i.e. components for which inputs and 


block CaseStudy 
Component a, b, c, d, 9; 
observer Boolean out = e.output; 


assertion 

a.input := true; 

b.input := a.output; 

c.input := b.output; 

d.input := a.output; 

e.input := c.output or d.output; 
end 
Figure 11. Generated system code. 


outputs are decided at run time like the inout port 
of SysML, we can also create an acausal connec- 
tion in the assertion part. For an input port that 
receives information from different output ports 
(the case of the component e that is linked to two 
components c and d), we have to verify if they 
come from redundant sub-systems or not. We have 
proposed in another paper (Nguyen, Mhenni, & 
Choley 2016) the Redundancy Profile, a SysML 
extension which allows integrating redundancy- 
relevant properties in the system model in order to 
better represent system architecture. So these data 
can be used to generate the assertion e-input:= 
c:outputord:output. For the case when there is 
no redundancy, the equation should be e:input:= 
c:outputandd: output. An observer has been created 
to observe the final output of the whole system 
that takes the value of e:output. The generated 
code for the case study is showed in Fig. 11. 


5 DISCUSSION AND FUTURE WORK 


Currently, our transformation handles just simple 
concepts of AltaRica 3.0 language with what we 
can perform a direct mapping. The next step of our 
work is the completion of the Safety Profile (Mhenni 
et al. 2016) that contains the Redundancy Profile in 
order to integrate information concerning stochastic 
models (delays, parameters, expectations) as well as 
redundancy mechanism. AltaRica code can be gen- 
erated with more precise data. The model-to-model 
transformation will be implemented in a proof of 
concept tool to validate the approach. 
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Failure behavior analysis of hot standby system based on BDD method 
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ABSTRACT: Hot standby technology becomes a popular method for improving the reliability of a cru- 
cial component, and a significant system. Considering the failure mechanisms of components, this paper 
proposes a binary decision diagram-based method for modeling the failure behaviors of hot standby sys- 
tems. The failure probability of the component what fails firstly among the primary and hot standby com- 
ponent is discussed. We find that under the condition of that one component fails firstly, as time goes by, 
the transient and cumulative failure probabilities of that the primary and hot standby component fail firstly 
change. Besides, after one component fails firstly, the reliability of the system sharply decreases and totally 
depends on the statuses of those remaining functional components. With the help of the method proposed 
in this paper, more practical work about the maintainability and design optimization can be done. 


1 INTRODUCTION 


Insome safety or mission-critical applications, such 
as nuclear power plants, aerospace, telecommuni- 
cations and so on, the redundancy technology is a 
main method for ensuring the reliability or safety 
of the systems. The redundancy technology can be 
divided into 3 types, hot, cold, and warm standby, 
due to the differences of working conditions of 
redundancy components (Levitin et al. 2015). In 
hot standby systems, a hot standby component is 
exposed to the same operational and environmen- 
tal stresses as the primary component’s. And when 
the primary one fails, a hot standby component 
will be switched into the system immediately, and 
start to output its function and guarantee the mis- 
sion finished successfully. With the applications 
of hot standby systems in engineering products, 
the study about how the hot standby system fails 
becomes more and more critical. 

When the lifetime distributions of all the sys- 
tem’s components follow exponential distribu- 
tions or the failure rates are known, Ren & Zhang 
(2009) modeled and analyzed the hot standby sys- 
tem’s structure by using Markov process method. 
Though the Markov process method can be 
applied for the reliability analysis easily, it still has 
a strict limitation on the lifetime distribution of the 
components (Xing et al. 2015). Ebrahimipour et al. 
(2010) obtained the reliability expression of k-out- 
of-n multi-state series-parallel systems by using the 
universal generating function. Levitin et al. (2015) 
analyzed the reliability of 1-out-of-n heterogene- 
ous standby system, and a corresponding mathe- 
matical expression is derived. Besides, Levitin also 


considered two different kinds of failure propa- 
gation, selective and global propagation, into the 
reliability modeling. But when the system’s scale is 
large or its logical relations are complicated, that 
kind of analyzing method will be extremely diffi- 
cult for applying. 

The modeling or analyzing method for a hot 
standby system is mainly based on the assumption 
that the probability density functions (pdfs) or the 
cumulative distribution functions (cdfs) of com- 
ponents are known (Levitin et al. 2015, Ardakan 
& Hamadani 2015, Levitin et al. 2014). And most 
papers pay attentions to the final results of sys- 
tems’ reliability, not how the components’ failure 
develop and affect the entire system’s failure. To 
overcome the difficulty of collecting compo- 
nents’ lifetime data and find the nature of failure, 
physics of failure (PoF) is proposed (Hassan & 
Aldemir 1990). Because the lifetimes of failure 
mechanisms simulated by PoF method are con- 
stant, Probabilistic Physics of Failure (PPoF) is 
developed for obtaining the probabilistic distribu- 
tions of those lifetimes (Zoran & Vlado 2008, Hall 
& Strutt 2003). Based on those key theories of 
PoF and PPoF, Chen Y et al. (2015) proposed five 
correlations among failure mechanisms, which 
are competition, trigger, acceleration, inhibit, and 
accumulation. And Chen also provided an effec- 
tive way, failure mechanism tree (FMT), for ana- 
lyzing the failure behaviors of components and 
systems. 

In chapter 2, this paper will introduce the fail- 
ure behaviors of a hot standby system and give a 
series of expressions for all behaviors. In chapter 3, 
a hot standby system in the supply module of an 
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electronic control system will be analyzed as an 
example for showing the characteristic of the hot 
standby system. 


2 FAILURE BEHAVIOR OF A HOT 
STANDBY SYSTEM 


2.1 Equations for the reliability of a hot 
standby system 


Failure behavior is used for describing how the fail- 
ure mechanisms of components develop with the 
effect of correlations among themselves, and even- 
tually cause the failure of the entire system. This 
paper focuses on analyzing failure behaviors of the 
hot standby system consisting one primary com- 
ponent and one hot standby component. And the 
switch what is used for switching the hot standby 
component to be operational after the primary 
component fails is supposed to be perfect. Assume 
that the failure probabilities of components are 
irrelevant with usage or switch intensity. 

Because of the same operational conditions that 
both primary and hot standby components are 
exposed to, the failure mechanisms of the compo- 
nents develop and influence each other since the 
system starts to work. The failure sequences of 
those components are probabilistic and not deter- 
mined till some components fail, which also means 
each component has the possibility to fail earlier 
than the other. So, according to whether the fail- 
ures of components happen, the working phase of 
a hot standby system can be divided into two parts, 
phase I and II. 

In phase I, no component fails. The failure mecha- 
nisms start to develop since the system begins its 
working, but none of them reaches its failure thresh- 
old. So, the reliability of the entire system is influ- 
enced by the developments of failure mechanisms and 
in-time reliability of both components. Only if both 
of the components fail, the failure of a hot standby 
system will occur. The reliability of the system can be 
described as (1). 


R(t)=1-F,(t)¢ F(t) 


=1- J fo (ty) dtp |i fs (0s) dts 0) 


where f (t) and f(t) are the probability den- 
sity functions of the failures of primary and hot 
standby components, respectively. 

In phase II, the primary or hot standby compo- 
nent fails at time ¢,, but the other one is survived and 
switched to be operational for ensuring the system 
finish its function. The entire system will fail if the 
survived one fails. So, the reliability of the system is 
equal to the survived one’s, which also shown in (2). 


1- F,( t) a hot standby component 
até =1- f f(t )dt, fails at £, 
7 1- F.( t) a primary component 
=1- f A(t) dts fails at £, 


(2) 


From the discussion above, we know that the 
component failing at ¢,is a primary or hot standby 
component will lead to a different reliability level 
of the system. So, the study on the distribution 
function of ¢, is a significant job for learning the 
changes of the system’s reliability and adjusting 
maintenance strategy. t, equals the minimum fail- 
ure time among a primary and hot standby com- 
ponent’s, which is shown as (3). 


ts. tast 
t= (3) 
£ lty tSt 


where ¢, and ¢, are the failure times of the primary 
and hot standby component, respectively. 

When the primary component fails firstly, which 
means 1, < t, t,equals the failure time of primary 
component. And at time ¢, the transient probabil- 
ity of that the primary component fails, but the hot 
standby component is survived can be obtained as 
(4). Besides, the cumulative probability of that the 
primary component fails firstly is (5). 


Pehl =t =tt 2t) 


= fp (Qd: f f (ts)dts A 


E (5) 


When the hot standby component fails firstly, 
which means f, > ts, t-equals the failure time of the 
hot standby component. So at time ¢, the transient 
probability of that the hot standby component 
fails, but the primary is survived can be derived as 
(6). And the cumulative probability of that the hot 
standby component fails firstly is (7). 


P (t) = P(t, =t; =t,tp >t) 


=f(dd f fp (tp)dtp i 


(7) 
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For further discussion, if there is a component 
fails firstly at time ż, the probability of that the 
failed component is the primary component is 
P p(t), and the probability of that the failed com- 
ponent is the hot standby component is P,,(¢), 
which can be calculated as (8) and (9). 


Pael t) = PAC) L PC) + Px(o) | 
Selt)di- | ` folts )dts 


7 fol t) dtp: |7 fo( ts) dts + f(t) dt- | 7 fp (ty) dtp 


(8) 


Poe(t) = BC) P(t) + BC) | 
s(t) at: Í Hel tg) Gee 


fol t)dty: f fo ts) dts+ f(t) dt: |” folto) dtp 


(9) 


When one component fails, the reliability of the 
system will decreases and equals to the survived 
one’s reliability. Whether the system’s reliability 
after t, reaches the desired level or not will influ- 
ence the optimal choice of actions for preventing 
and reducing the losses caused by the system’s 
failure. So, the cumulative probability of t, can be 
obtained as (10). 


t 


t 


zi 
=j. 


(10) 


2.2 Binary Decision Diagram 
(BDD) for simulation 


The binary decision diagram (BDD) is proposed on 
the Shannon decomposition theorem (Xing L. et al. 
2012). As a matter of fact, the BDD model has 
been widely applied for simplifying logical analysis 
on large static fault trees. For a Boolean formula- 
tion, f is defined upon a set of Boolean variables: 
Xp X» -< Xp and it can be equivalently decomposed 
by the Shannon expansion rule as (11). 


fay hata fa = eS aofa) Aisn) 
(11) 


where f _,, represents f when x, = b, and b is a 
Boolean constant 0 or 1; ite denotes the compact 
if-then-else format. Equation (11) also can be illus- 
trated as Fig. 1. 

To build a BDD model of a fault tree, an efficient 
recursive generating algorithm through a depth first 
left traversal of the fault tree is presented as (12). 


) 
Sol te) Ute)” F.C 1) ate] | A(t) dts J Solty) dt, 


else edge (*) then edge 


( 0-edge) / i l-edge) 
Olx ) ES ) 


Figure 1. Shannon decomposition and ite format of x. 


GOH = ite(x,G,,G, )Vite(y, H,, Hy) 
ite(x,G,0H,,G,0H,) index(x) = index(y) 
=) ite(x,G,0H,G,0H) 
ite(x,G0H,,G0H,) 


index(x) < index(y) 
index(x) > index(y) 
(12) 


where G and H are two Boolean formulations rep- 
resenting structure functions of the traversed sub- 
trees, and G, and H, (i = 1,0) are the subfunctions 
of Gand H. ¢ indicates logic operation of AND or 
OR, and index(x) and index(y) denote an argument 
index of the variable x and y. 

This paper will model the failure behavior of 
the hot standby system by adjusted BDD models. 
The basic and important issue is dealing with the 
correlations among failure mechanisms. Accord- 
ing the method of modeling a FMT model (Chen 
et al. 2015), the FMT models of all components 
in hot standby systems can be generated, which 
can be transformed into BDD models with the 
approaches listed in Table 1. 

In hot standby systems, only if both the pri- 
mary and hot standby components fail, the entire 
systems will fail, which means the logic relations 
between the primary and hot standby ones can be 
described as the logic AND. On the other hand, the 
logic OR may exists among different components. 
So, the fault tree (FT) model of the different com- 
ponents can be generated, and the exact methods 
about how to build the FT and transform it into 
a BDD model are shown in Table 2 (Levitin et al. 
2015). 

In Table 2, A, (i= 1, 2,...,7) represents the com- 
ponents of a system. 

To analyze the BDD model, a Monte-Carol 
based method is used in this paper. Firstly, a large 
number of random lifetimes of failure mecha- 
nisms should be generated. Secondly, the logical 
expressions about the failure times of compo- 
nents and the entire systems must be derived. 
The exact method is listed in Table 3. Finally, 
lots of the random failure time will be computed 
and some corresponding reliability curves can be 
generated. 

In Table 3, T, is the lifetime of the failure mech- 
anism M_, T, (i= 1,2,...,n) is the lifetime of the fail- 


ure mechanism M, (i = 1,2,...,2), T? (i = 1,2,...,2) 
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Table 1. FMT and BDD models of mechanism 


correlations. 


Mechanism 
correlation FMT model 


Corresponding 


Competition 


Trigger 
[Maco | 
va | \ \ 
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i Tai d 
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KZ TAX 
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Accumula- | F. | 
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| MADA | 


ae 


(AL) (Mz) ss (Ade 


is the lifetime of the failure mechanism M, 
(i = 1,2,...,n) after being accelerated or inhibited, 
T ¿(i= 1,2,...,n) is the lifetime of the component A, 
(i= 1,2,...,n), tois the time when event C happens. 


Table 2. FT and BDD models of logic AND and OR. 


FT model Corresponding BDD model 


Table 3. Logical expressions for the mechanism correla- 
tions, logic AND and logic OR. 


Mechanism/logic 
correlations Logical expressions 


Competition t, =min{7,,T,,---,T,} 


Trigger t= mnM ket Ttet Leite + Th 
Acceleration T’ T’ 
or inhibit t, = te +min{7,’-—1,,T7,’ T tes 
1 2 
as 
tn er 
: 1 
Accumulation =, = Ii i 
+++ 
T O T, 


Logic AND t = max{T T. Tanl 


ata” 


Logic OR t=min{T,,,T,,,L,T,,} 


3 A CASE STUDY 


3.1 Description of the case 


This section takes a supply system in an electronic 
control system of an aircraft as an example for 
discussing the failure behavior of a hot standby 
system. The supply system contains two identical 
supply modules consisting 6 components, which is 
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simplified and shown in Fig. 2. The exact meanings 
of all components in Fig. 2 are shown in Table 4. 

The components in primary and hot standby 
supply systems are the same component, but dif- 
ferent in their lifetime distribution’s parameters. 
According to the FMMEA result of the example 
supply system, main failure mechanisms, correla- 
tions among those mechanisms, and the effect for 
every component are shown in Table 5. 

According historical data or the PPoF method, 
the distributions and parameters of those failure 
mechanisms’ lifetimes in different components are 
shown in Table 5. 


Voltage conversion 
module (+15V) 


Filter buffer module 


Figure 2. The supply module in the supply system. 


Table 4. Components in the supply module. 


In Table 5, TF is thermal fatigue, EC is elec- 
trolytic corrosion, MM is metal migration, HCI is 
hot carrier injection, VF is vibration fatigue, EM 
is electrical migration, TDDB is time-depend- 
ent dielectric breakdown, and EB is electrical 
breakdown. 


3.2 Failure behavior modeling 


With the method of generating a FMT model 
(Chen Y. et al. 2015), the FMT model of the supply 
system can be generated as shown in Fig. 3. 

After transforming the FMT into BDD model, 
the failure behavior model is obtained as shown in 
Fig. 4. 


| The Ennre System Fail 


| 
| Primary Subsystem Fail [ Hot Standby Subsystem Fail | 


Component symbol Definition and details 


1 Inductance 

R Resistance 

Vv Thyristor 

IC DC-DC supply converter (PWM 
switching mode power supply) 

VP Optocoupler 

X Perfect socket 

Table 5. 


A 
L\ 
j -i T ) 
L Fall | fe 
Í 
CO Ra [Ovr] ci 
me i 
[MAA 
= pa [A 
RD RE 
TO. ER 
Figure 3. The FMT model of the entire system. 


Failure mechanisms and correlations in components, and distribution types and parameters of mechanisms. 


Distribution parameter 


Primary Hot standby 
Failure Failure subsystem subsystem 
Mechanism Failure mechanism Effect mechanism Distribution 
Component Mechanism symbol effect correlations symbol correlations type B©) no BO no) 
L TF Li, Open / / / Weibull 2.1 6225 17 3172 
R EC Rf, Resistance Parameter MRI / Lognormal 8.38 0.7 971 13 
increase union 
MM Rf, Resistance Weibull 1.3 4715 24 7863 
increase 
V HCI Vf Short / / Competition Weibull 1.73 6429 7.3 4217 
VF Vf, Open Damage MVI Weibull 64 8523 5.2 9726 
TF VE, Open accumulation Weibull 6.7 9318 4.9 8179 
IC EM Ch; Open / / Competition Weibull 3.1 5621 2.72 6923 
TDDB Cf, Short / / Lognormal 11.37 1.5 12.79 1.4 
VF Cf, Open Damage MCI Weibull 5.7 8374 2.91 6741 
TF Cf, Open accumulation Weibull 4.9 7956 3.57 5357 
VP EB Pf, Short / / / Weibull 5.6 4682 3.4 7423 
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i (Hot Standby> 


Figure 4. The BDD model of the failure behavior of 
the example supply system. 


3.3 Simulation 


Using the Monte-Carlo method, 2,000,000 ran- 
dom lifetimes of all mechanisms are generated. 
From the BDD model of the example system 
in Fig. 4, the logical expressions of the primary, 
hot standby subsystems and the entire system are 
shown in (13)-(15). 


primary z 1 tanaty = min {max {Turi ¿mın Tyr > Tuvi } } ? 
Tisi Ter Tera» Tuci Ton} (13) 
Trit-irstty = min AL imay ? Landy } ( 14) 
= v 
system-failing max{T mary ? T ssandby J ( 1 5) 


Calculating the random lifetimes by those logi- 
cal expressions, the exact failure times of the sys- 
tem are obtained. The reliability curve of primary 
and hot standby subsystem, and the entire system 
are shown in Fig. 5. 

In Fig. 5, the reliability of primary subsystem 
is bigger than the hot standby one’s at the same 
time, and both of them are smaller than the supply 
system’s, which means the hot standby technology 
used in this system works and the hot standby sub- 
system may fail firstly. 
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Figure 5. The reliability curves of the primary and hot 
standby subsystems and the entire system. 
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Figure 6. The transient failure probability curves of 
that the primary and hot standby system fail firstly. 


Besides, if there is a subsystem fails firstly, the 
transient failure probability curves of that the pri- 
mary and hot standby system fail firstly are shown 
in Fig. 6. And the cumulative failure probability 
curves of that the primary and hot standby system 
fail firstly are shown in Fig. 7. 

In Fig. 6, we can learn that during the early 
period of the system’s lifetime (0 < t < 1000 hours), 
Py,At) is nearly 75% ~ 100%, which decreases 
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Figure 7. The cumulative failure probability curves of 
that the primary and hot standby system fail firstly. 
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Figure 8. 
subsystem. 


The pdf probability curve of the first-failing 


as time goes by. And if there is a supply module 
fails firstly, the probability of that the failed mod- 
ule might be the hot standby module. During 
1000 < t < 2500 hours, P,,{t) is nearly stable at 
75%, and decrease after 2500 hours. During the last 
period of the system’s lifetime (t > 4000 hours), 
Py At) = Pat) = 50%. 

In Fig. 7, we learn that in summary, the hot 
standby supply system is more likely to fail firstly 
at the probability of nearly 77%. Moreover, the 
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Figure 9. The reliability of the supply system after one 
subsystem’s failing at t = 1719 hour. 


pdf curve of that one supply module fails firstly is 
shown in Fig. 8. 

From Fig. 8, the probability of one of the sub- 
systems fails at about t = 2000 hour is biggest. So 
at that moment, the reliability of the supply sys- 
tem should be paid enough attentions, and the 
frequency of checking the working conditions of 
the two subsystems should be greater. Besides, the 
mean time of the failure time of first-failing sub- 
system is about 1719 hours. Suppose that there is 
a subsystem fails at t= 1719 hour, the reliability of 
the entire system will have totally different results 
as shown in Fig. 9. 

If primary subsystem fails at t = 1719 hour, the 
reliability will equal to the hot standby subsys- 
tem’s; if hot subsystem fails at t = 1719 hour, the 
reliability will equal to the primary one’s, which 
has a more dramatic decline and may not reach the 
requirement of the reliability and the hot standby 
subsystem should be replaced. 


4 CONCLUSION AND FUTURE WORK 


This paper discussed the failure behavior of a 
hot standby system and proposed a BDD-based 
method for evaluating the hot standby system’s 
failure behavior. We derived some functions for 
describing the transient and cumulative failure 
probabilities of that the primary or hot standby 
component failed firstly, which may be helpful for 
the optimizations of the maintenance allocation 
and functional design. The sharply decrease of the 
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reliability of a hot standby system after some com- 
ponents fail firstly are analyzed. 

Next, we will optimize this kind of analyzing 
method, and try to apply it to the failure behavior 
analysis of more complex systems, like 1-out-of-n 
hot standby system. 
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Dependability analysis of a product line using its model 
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ABSTRACT: The objective of this study is to introduce safety studies as soon as the engineering infor- 
mation is available. Safety studies require the use of formalisms. SysML begins to be a good vector of 
System Engineering activities, and Feature Model seems like an excellent candidate for the product line 
description. In order to perform a safety analysis, the required information is extracted from the Feature 
Model/SysML models of the product line. To reduce the number of studies, we provide FMECA (Failure 
Modes, Effects and Criticality Analysis) of product line (parametric FMECA) type analysis support, 
which allows conducting an analysis at the level of the product line, and provides rapid analysis synthesis 
for each product. In this study, we introduce a new process dedicated to product line in the MéDISIS 
method. We design a meta-model of System Engineering to help the information management related 
to the product line variability from the functional and organic point of view. We define the parametric 
FMECA which carries all relevant information from the models. It allows the decisions and choices capi- 
talization during the safety analysis, especially the impact of variability of the product in terms of depend- 
ability. The new MéDISIS process automatically generates from both models a parametric FMECA of the 
product line. Finally, the process is finalized by the Dependability Engineer using the consolidation tool 
from MéDISIS in order to generate the final FMECA. The synthesis method of a parametric FMECA 
from models will be presented. In particular, we will discuss how a variability by its presence or absence 
can influence a dependability analysis, and how the rules used to define variabilities are taken into account 


for the final FMECA synthesis. 


1 INTRODUCTION 


In the current commercial and industrial context, 
the industrial strategies are often based on the 
definition of line of products. The objective is the 
decrease of the costs and of the delays while favor- 
ing a better offer adaptation to the expectations of 
the market, and an increase of the products or sys- 
tems quality. Such an industrial strategy impacts 
naturally on organizations, in particular on the 
processes and on the methods of System Engineer- 
ing (SE). Indeed, one of SE qualities is to allow the 
capitalization of the convergent validated solutions 
of the product creation. Thus, the SE so offers a 
natural support to the approaches by product line 
characterizing a set of products having common 
elements and variabilities. 

A product line defines a set of products having 
common characteristics and architecture related to 
components and functions, with which are associ- 
ated by variabilities, which is necessary for the sat- 
isfaction of a range of needs. 

The capitalization of common parts knowledge 
is then one of engines of studies rationalization, 
while artifacts stemming from the management of 
the variabilities and from their coherences consti- 
tute a brake. 


The works made during the definition of the 
method MeDISIS (Method of Integration safety 
analyses in the process of System Engineering) and 
its platform (David et al. 2010) and (Cressent et al. 
2013), established the bases of an exploitation of 
the Model Based Systems Engineering (MBSE) 
to generate partially models of operating safety 
analysis. More recently, (Kajdan & Idasiak 2015) 
allowed by the definition of minimal set informa- 
tion of MBSE, the rationalization of the techniques 
of generation of pre-FMECA. Furthermore, this 
rationalization gave the possibility of following the 
evolution in a unique one FMECA of the modifi- 
cations carried out on a system model. The variant 
part of SE models of product line escapes this type 
of generation process. Thus, our objective consists 
in adapting the methodology to products designed 
in the form of a product line, and in allowing so to 
build generic studies of the operating safety. 

A product line, for the SE analysis and design 
steps, is to identify the common elements of all 
products in the range in order to make significant 
gains through the capitalization and reuse of arti- 
facts produced by the SE processes implemented: 
cost reduction, development time reduction and, 
testing, and in our case, dependability analyzes. 
But every product in the product line has to adapt 
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to a wide variety of needs, which are then differ- 
entiated by specificities. The set of specificities 
characterizes the product line, and constitutes the 
variabilities. An example of use cases in section 4 
will support our explanations. 

If the common part requires a few changes in 
practices in modeling techniques and methods, it 
is not the same for variability. Product line mod- 
els provide a solution to clearly identify the set 
of common elements of all variant elements and 
their dependencies. In fact, the variation models 
also introduce a logic allowing the configuration 
expression of each product that only uses a subset 
of variants. 

In this article, we will propose two approaches to 
dependability analysis based on product line mod- 
els. We will extend our reflection on the generation 
of pre-FMECA process of the MéDISIS method 
in order to provide for each product line model a 
FMECA declinable for each product. Section 2 
presents the engineering system concepts used for 
product line model. Section 3 presents the stand- 
ard processes of MéDISIS and the proposed proc- 
esses. Section 4 presents a case study conducted to 
evaluate the tow processes. Section 5 presents the 
conclusion. 


2 THE PRODUCT LINE AND 
ENGENEERING SYSTEM MODEL 


The concepts of the product line modeling are 
known in the computer under the name of SPL 
(Software Product Line) (Lee et al. 2009), they 
introduce Feature Model. The Feature Model (FM) 
modeling technique is based on logical concepts 
between the model’s features. It is used to define 
similarity and variation in models and provide 
support for consistent variability management. 

Some authors have defined their own vari- 
ability management (Berrebi 2013) in the field 
of aeronautics. In particular, at the level of the 
architectural organization of the model. The vari- 
ant management proposed is an alternative to the 
variation point that has less logic operator than 
their equivalent with FM. Moreover, (Sierla et al. 
2014) shows the applicability of FM in areas other 
than software, including the HiP-HOPS approach 
(based on the dysfunctional model) for product 
line safety analysis (De Oliveira et al. 2014). This 
variant management is explained in the approach 
proposed by (De Oliveira et al. 2014), which uses 
the Hephaestus/Simulink tool (Steiner et al. 2013) 
to manage variability in a product line model. 

(Le put 2016) presents a proposal for integrat- 
ing the definition steps and the conceptualization 
of the product line in the field of the ES. Here 
the product line is presented as a generic prod- 


uct reusable and scalable according to the need. 
This method definition is based on the properties 
present in the FM. 

Some ES tools and MéDISIS use a modeling 
paradigm like SysML (Friedenthal 2014) to repre- 
sent the model of a product or system. Among the 
works that make it possible to use the SysML lan- 
guage to model a product line, we find in particular 
(Grénniger et al). This work presents an approach 
to modeling an automobile system using FM and 
an internal organic modeling defined by the inter- 
nal block diagram IBD (SysML). The authors of 
(Grénniger et al. 2008) propose a translation of 
a product line model using the FM technique, in 
SysML language, which makes it possible to model 
the variables of automobile systems in internal 
block model in order to be able to analyze the 
generic system and its variability. In (H6fig et al. 
2014), the authors propose a structuring of data by 
instantiation with a meta-model to obtain a reus- 
able FMECA. However, they do not define the 
constraints between the variant elements. Through 
this analysis, we are interested in using the FM 
and the SysML modeling language to represent a 
product line model in different technology areas. 
We must now define the nature of the information 
needed to build a FMECA. 

The authors of (Kajdan & Idasiak 2015) intro- 
duce a system model analysis support, defining 
sets of artifacts (needs, functions, components, 
and requirements) and dependency relationships 
between them. This hierarchical view, necessary for 
a functional or organic analysis, must be adapted 
to take into account the artifacts introduced by 
FM. Similarly (David et al. 2010) defines the meta- 
model supporting the generation of pre-FMECA, 
it will be adapted to build the parametric FMECA 
associated with a product line. 

(Mazo 2014) shows the possibilities offered by 
the FM syntax. Indeed, FM can be likened to the 
superimposition of different product models for 
which are highlighted the common artifacts and 
variants, as well as the relationships between the 
latter which are at the point of variation accord- 
ing to (Le put 2016). These relationships are con- 
straints on the configurations that a product can 
have. 


2.1 Model concepts 


Figure | shows the different concepts needed in 
system model for realizing dependability analy- 
sis. These concepts are identified as follows: the 
need that is represented in the system need view, 
the function realized by system represented in 
functional view, the component that makes up the 
system and that is represented in component view 
and the requirement system which can be applied 
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StB : Stated by 
SaB : Satisfied by 
ReB : Realized by 


op op op 


Figure 1. 


SE mathematical model for product line. 


to it (requirement view). They are identified and 
matched when a model is being established thanks 
to the following relationships: Stated by (Need StB 
Function), Satisfied by (Function SaB Require- 
ment), Realized by (Component ReB Function), 
Refined by (Concept(x) R Concept(x)). In addi- 
tion to previous relationships, we introduced or 
modify six relationships to define the product line 
SE model. 
Either the functions used in the model: f, f, € F: 


R 

e Refined by (R or MA): f;— f, (noted also f; R 
J): means that f; is necessarily refined by f. 

° Optional (OP): f — J, (noted also f; OP J): 
means that f; is maybe refined by f. 

e Requires (RE): f; => f, (noted also $ RE f): means 
that the use of f, implies the use of f.. Note: fdo 
note refine f, and f, do note refine f;. 

e Excludes (EX): E > Í, (noted also f; EX f): 
means that the use of f, excluded the use of 
J, Note: fi = f, is equivalent to f = f; , or 
(f. A fj). XOR 

e Alternative (XOR): f; > jf, (noted also f; XOR 
J): means that only one of f; and f, maybe are 
used. Note: f= f, ands; =f, 

e OR gate (OR): f.— f (noted also f; OR f): 
means that among f; and f, minimum one is used. 


After identifying the SE relationships needed 
between the concepts that make up the product line 


nett ti 


Functional flow 


model to conduct a dependability analysis on that 
model. These new relations will allow describing a 
new meta-model of the database which helps to gen- 
erate a PL’s FMECA and to formalize its semantics. 

This meta-model Figure 2 builds on that pre- 
sented by (Cressent et al. 2013) and coordinates 
three categories of information useful for a prod- 
uct line FMECA construction, the information 
elements derived from the activities of: product 
SE, product line SE and Safety Analysis. The fail- 
ure mode part makes it possible to capitalize the 
decisions and choices of the dependability analysis 
associated with the studied elements. 


2.2 Product line composition element 


System variants modeling is a basic technique for 
model based systems engineering for the product 
line. We need to model variants to: analyze design 
alternatives; evaluate variants via configurations; 
product family modeling. So the challenge is to 
separate the variant from the common part and 
manage the dependencies. 

The product line elements are decomposed into 
three parts as shown in Figure 3: 


— Core part: contains the invariant elements of the 
system and contains all elements that are used in 
all system configurations. These elements do not 
depend on any variant element; 
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Figure 3. Flowchart describing the proposed process in 
MéDISIS methodology. 


— Variant-Core Part: contains the invariant ele- 
ments of the system which are related at least on 
variant element; 

— Variants Part: contains the variant elements of 
the system, only occurs in some configurations 
and is part of the system. 
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This architecture makes it possible to set up a 
safety analysis strategy on the product line model, 
with the prioritization of data-flow present in the 
architecture. 


2.3 Product line safety analysis 


From the dependability standpoint, the system has 
three types of elements which represents the triplet 
of a line in the FMECA: 


— E; Studied element; 
— E; failure Cause element; 
— E; failure Effect element. 


The variation in product line model may change 
the analysis for each product configuration because 
it changes the functional or organic structure of a 
product. For study the impact of variations in a 
product, it is necessary to classify the FMECA’s 
lines according to the type triplet studied. Indeed, 
some of this line can be done earliest in the design 
process, that is linked to the variability can’t be 
done only when the variability choose is done. 

The product line FMECA contains 8 types of 
line. Each line type in FMECA of product line has 
a specific characteristic represented in Table 1. The 
type lines T1 and T8 are composed of the same 
type of elements set. Such line is analyzed with the 
same method, which will allow a saving of time in 
the safety analysis using the standard methodology 
MéeDISIS. 

The variation in the product line is presented 
in line type T2, T3, T4, T5, T6 and T7. This line 
may include in his triplet (£.,£,,£,) a variant ele- 
ment and core element. T1 and T8 FMECA lines 
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Table 1. Product line FMECA composition lines. 


E, IB, E, Line type Set type 
Core CoreorIC Core Tl {Core} 
Core CoreorIC Variant T2 {Core} U 

{ Variant} 
Core Variant Core T3 {Core} U 

{ Variant} 
Core Variant Variant T4 {Core} U 

{ Variant} 
Variant Core Core T5 {Core} U 

{ Variant} 
Variant Core Variant T6 {Core} U 

{ Variant} 
Variant Variant or IC Core T7 {Core} U 

{ Variant} 


Variant Variant or IC Variant T8 {Variant} 


Note. IC: Internal Cause (component-specific). 


can be analyzed as soon as the core elements or 
variant elements are designed, even before the end 
of the total definition of the product line. Whereas, 
the lines T2 to T7 cannot be analyzed only when 
the association of the elements variant with the 
core elements. 


3 PRODUCT LINE DEPENDABILITY 
ANALYSIS PROCESS 


This section focuses on the comparison of two 
dependability analysis processes based on a prod- 
uct line model. For that, we must deploy a sequence 
of information processing available allowing a pro- 
gressive contribution of the information held by 
the system engineer and the dependability engi- 
neer, this is what is called a MéDISIS process. 
These processes are done to make possible the fol- 
lowing principles: 


— As soon as information is available during an ES 
step, it must be able to be used in a step of the 
dependability MéDISIS process; 

— As soon as the information is required, the process 
clearly identifies it so that the expert can provide it. 


3.1 Standard process 


The description of the product line model must be 
complete in FM modeling to begin the SE proc- 
ess. The FM concepts represent the variation in the 
model. Our ES process show on Figure 3 begins 
with FM modeling for the product line which 
represents a variant model. Among the results 
obtained of this model, we find a configuration list 
of the possible product model in the line. The con- 


figuration list is composed of two sets of elements, 
a set of common elements that are represented by 
the ‘Mandatory’ relation in the FM model and a set 
of variant elements represented by the ‘Optional 
relation and the orthogonal relations like ‘OR’, 
‘XOR and ‘EX’ relations. Selecting variant ele- 
ments allows you to specify a valid model among 
the models in the product line. After this step, we 
will use these configurations for the system mod- 
eling of the products of the line in SysML (it is a 
modeling language specific to system engineering 
to model different views of the system). A model 
produced in SysML corresponds to one configura- 
tion (one product) in the FM model. The SysML 
model must respect to some basic modeling rule 
in (Kajdan & Idasiak 2015) to ensure consistency 
between the artifacts used in the model. 

The next step is to generate a preliminary 
FMECA (pre-FMECA) from each product model 
(the SysML “Functional, Structural and Behavio- 
ral View” model). The pre-FMECA produced is 
generated automatically thanks to MéDISIS, it is 
a transformation of the product model. 

Each pre-FMECA undergoes a set of operations 
directed by the dependability engineer in order to 
obtain the final FMECA. Indeed, pre-FMECA is 
then finalized by using the MéDISIS consolidation 
tool. The consolidation process consists of linking 
several lines of the pre-FMECA. Each merged line 
makes it possible to follow the propagation of a 
failure, starting from the internal cause of a stud- 
ied function to final system effect. This process 
contains several tasks: 


— Annotation and filtering of the pre-FMECA; 
— Generation of failures propagation tables; 

— Filling failures propagation tables; 

— Consolidation of the pre-FMECA. 


The first task is to filter the pre-FMECA. For 
this, the System Engineer chooses what the lines 
he considers relevant are or not, and he may make 
comments to record the reasons for his choices or 
to provide additional information. In order to be 
taken into account in the other tasks of the con- 
solidation (Versioning tool). 

The second task is to generate the failures prop- 
agation tables from the pre-FMECA annotated 
and filtered. 

These correspond to the propagation tables of 
each element of the pre-FMECA. The depend- 
ability engineer must choose from the database 
a failure mode that will be propagated to stud- 
ied element linked with a functional flow with an 
upstream element. 

The final task of the process is carried out using 
the propagation tables and the pre-e FMECA anno- 
tated and filtered. In fact, to do the consolidated 
pre-FMECA, the process performs two main steps: 
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— search for possible paths of failure propagation, 
i.e. consolidated FMECA lines are one of prop- 
agation graph paths; 

— for each paths (i.e. FMECA lines) each step of 
the path allows to concatenate the correspond- 
ing pre-FMECA lines. 


The MéDISIS database has been set up in the 
processing process. The database allows consist- 
ency of dependability analyzes thanks to the archi- 
tecture of its meta-model that links functional or 
dysfunctional information to one another. This 
database interacts during all stages of the proc- 
ess (from filtering, the propagation tables until the 
consolidation of the FMECA). 

The database is used to complete the pre- 
FMECA of a new product in the product line, a 
previous product of which had already allowed 
FMECA generation. Indeed, the database is devel- 
oped to be initialized during the first generation 
pre-FMECA after completion by the expert. Sub- 
sequently, with each successive generation, this 
database will allow: 


— to restore the filtering of the expert for the ele- 
ments of the functional analysis that have not 
been modified; 

— highlight the data that has been modified since 
the previous version. 


This database joins the main principals of ISO 
26550 ‘Asset Base’, thus makes it possible to offer 
consistent information throughout the evolution 
of the products in the line, specifically designed to 
facilitate the work of the operating safety expert 
and to guide the realization of fault analysis. 

At the end of the process, we get a specific 
FMECA for each product in the line. 


3.2 Proposed process 


This process is also based on FM model of the 
product line. The model will be able to represent 
the different elements of the line product set. It thus 
makes it possible to identify the common and the 
variant parts which constitutes the product line. 

However, this model does not represent the flow 
of data between the elements that are necessary to 
perform a functional analysis on the system that 
represents the product line. Although this mod- 
eling allowed managing variation in the product 
line, these limitation of FM modeling require com- 
pleting the FM modeling with SE concepts (c.f. 
Figure 1). 

To ensure a good basis of the study, we used 
the SysML language to complete the product line 
model. The product line model in SysML contains 
all the elements defined in the FM model and 
describes them from functional and/or organic 
point of view. This model makes it possible to 


represent the system model of the product line 
containing all the variations identified in the FM 
model. This model is composed in two parts, a 
Core part which represents the common elements 
of the set of products of the range which are identi- 
fied by the relation ‘MA’ in the FM model and the 
variant part which represents the elements varying 
from the set of products of the range and which 
are identified by the relation ‘OP’ and orthogonal 
relations (‘Alternative’ and ‘OR’). 

The next step is to generate a preliminary 
FMECA (pre-FMECA) from the product line 
model. The pre-FMECA has generated automati- 
cally thanks to MéDISIS. The latter must be sup- 
plemented by the relations resulting from the FM 
model in order to present the identified constraints 
of the modeling of the product line. So, the pre- 
FMECA obtained will contain all the variant part 
of the product line. This part is identifiable by the 
relation ‘OP’ in the column “Hierarchical relation- 
ship” in pre-FMECA. Similarly, the common part 
is to identify by the relation ‘MA’. 

Therefore, the added concepts from the FM 
model of the product line present in the last two 
columns of the parametric pre-FMECA underline 
the association between the variants and the valid 
configuration scenarios. These concepts present the 
dependence relationships (hierarchical and orthog- 
onal), in order to be able to build a pre-FMECA 
which corresponds to a model of the product line. 
The hierarchical relationship column defines the 
optionality ‘OP’ or the refine ‘MA’ (mandatory). 
These two properties make it possible to identify 
the invariant elements presented by the sign ‘MA’ 
belonging to the Core part and the variant ele- 
ments presented by the sign ‘OP’ belonging to the 
variation part. As for the orthogonal relations col- 
umn, this one represents the relations of implica- 
tion ‘RE’ and exclusion ‘EX’ between the elements 
to be studied (the variable elements). These two 
properties make it possible to expose the depend- 
encies between the elements. In particular, we have 
the relations ‘OR’ and ‘XOR which are also con- 
sidered as orthogonal relations. 

Through this overall representation of the 
FMECA, we will be able to perform the depend- 
ability analysis on products derived from the prod- 
uct line without using their model, on which we 
could apply the existing FMECA MeDISIS proc- 
ess. However for each product, it would be neces- 
sary to redo an identical analysis part. However, 
we seek to capitalize FMECA type analyzes in the 
case of a product line. These relationships make it 
possible to represent constraints between a set of 
elements in the same point of variations. 

Parallel elements that have the same input ele- 
ments and output elements can be implicitly iden- 
tified. Given that the principle of redundancy that 
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is presented by the parallel structure makes it pos- 
sible to provide several resources to achieve the 
same function. This principle is a sufficient con- 
dition for variant modeling. On the other hand, 
the ‘YOR’ relation complements this principle to 
become a necessary and sufficient condition in the 
analysis and modeling of the product line. 

The next step is to generate the FMECA from 
the product line, we will follow the same tasks as the 
first process. The filtering task consists of selecting 
the FMECA lines used for the analysis (The ele- 
ments that make up the lines of this pre-e FMECA 
will all be taken into account in the filtering task). 
After filtering the product line pre-FMECA, the 
next task is devoted to generating the propagation 
tables. The same conditions are taken with respect 
to the standard process. The association between 
failure modes and the element under study as well 
as their causes and effects may be specific to each of 
these configurations. After obtaining the product 
line FMECA, the final step can be done. This last 
step which is setting the FMECA using configura- 
tion lists to get a FMECA for a specific product. 


4 CASE STUDY 


The Winch-PL is a lifting system that has an 
Operational Safety Brake for Hoist Winch (FOST) 
(Chieb et al. 2017). This system was chosen to 
evaluate the two approaches based on a proposed 
product line model to support the automatic 
analysis of the dependability of several products 
in the line. The results of applying the proposed 
approach to Winch-PL products were used as 
proof of concept. 

The system consists of three parts: the trans- 
mission part subsystem which includes the two 
elastic couplings 1 and 2 with torque limiter which 
is an alternative with the elastic couplings 1, the 
chain reducer and the drum. The control subsys- 
tem includes the electric motor, two speed sensors 
(one for the drum and another for the motor) this 
part contains no variation. The safety subsystem 
is composed by a safety automata to ensure the 
braking when there will be a shift of speed between 
the drum and the engine, and finally, the standard 
brake and the safety brake. 


BiastcCoupiingst TorqueLimter 


Figure 4. The Winch-PL system in FM. 


From the product line model of the Figure 4, 
we obtain 6 valid possible configurations. The six 
configurations are obtained using the constraints 
and relationships applied in the product line 
model. We focus our study only for the three main 
configurations. 

We began our case study by applying the stand- 
ard process to the study system. We define the 
Winch-PL product line model in FM, then create 
system models for each configuration of the three 
selected from the configuration list obtained. For 
each model produced, we generate a pre-e FMECA 
in order to carry out the consolidation tasks (c.f. 
section 3.1). In the end, we obtained FMECA 
that corresponds on each product in the product 
line. Using the database we generate a product 
line FMECA that brings together all FMECA 
products. 

The saving of time during the realization of 
the first FMECA of the product is the order 50% 
thanks to the MéDISIS tools. During the realiza- 
tion of the following FMECA, the versioning tool 
allows the recovery of the information and the ele- 
ments present in both lines type 1 and 8. The type 
2 to 7 of line in FMECA depends on the number 
of different variability between the two products 
and of their functional or organic connection with 
the core elements. 

In the second part of our case study, we applied 
the new process. We followed every step of this 
process. At the end of the process, a parametric 
FMECA of the product line FMECA is obtained. 
Table 2 is the overview of the parametric FMECA 
of the Winch-PL system. It contains the Id col- 
umn which represents the line’s identifier and the 
Element column which contains the name of the 
studied element, the cause, local effect, and fail- 
ure mode column. This parametric FMECA thus 
comprises the note column for the selection of the 
valid lines, the HR and OR columns for the rela- 
tionships related to the variability and at the end 
the column type of line for the classification of the 
lines of the product line. 

To obtain a FMECA of a product from the 
parametric FMECA of the product line, we made 
the selection of the lines that are product specific 
and then proceed to the step of propagation of the 
failures. 
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Table 2. A product line FMECA overview. 


Local Failure Line 
Id Element Cause Effect mode Note HR OR type 
0l brake IC drum LF ok MA / Tl 
02 e_couplingl! IC ch_reducer LF ok OP ecl xortl T7 
03 sensor_bis IC saf_automat LF ok OP as=>sb T8 
04 spd_sensor IC automat LF ok MA / T1 
05 ele motor IC e_coupling1 LF ok OP ecl xortl T2 
06 trq_limiter ele motor ch_reducer LF ok OP ecl xortl T6 
æ tion of the o and help the analyst to make 
= ia] his decisions faster. 
| | 
= 
eee SS =. 
li | tt 5 CONCLUSION 
EA | ry | When products are defined in the product line, it is 
J l pa 
==) SS == í difficult to not repeat the same FMECA type study 
| for the common elements of these products. In order 
ee, = imi 
li If ii to optimize the time needed to produce FMECA 
| jji lj | “== for each product, we first studied how product line 
| | = models influence information needed for SE mod- 
|e n | eling and then their dependability analysis. 
— lr We have adapted the MéDISIS method to take 
= | into account a product line model. We compared two 
automated analysis processes by the MéDISIS tools, 
Figure 5. Failure propagation graph for the Winch-PL the first uses the ability of the MéDISIS versioning 


system. 


For example, if we analyze, the “Require” rela- 
tion who is used to manage the variation of the 
optional elements ‘safety automaton’ and ‘sen- 
sor bis’. The impacted FMECA lines have several 
types (T6, T7, and T8). These elements must be 
optional in order to apply this relationship. The 
alternative choice between two elements is pre- 
sented in the pre-FMECA of our case study by the 
relation “XOR”. For example, the presence of the 
element ‘elastic coupling 1’ excludes the presence of 
the element ‘torque limiter’ and vice versa. Indeed, 
the analysis is done in the case of a product con- 
figuration on a single element among the set of 
alternative elements. The alternative relationship is 
applied to optional elements. 

Figure 5 shows the failure prognosis graph for 
Winch-PL system. For our study, we chose only 
one failure mode that is “loss of function”, over the 
internal failure of the element that is represented 
by its internal cause. A color code was established 
to distinguish each part. This graphic representa- 
tion is automatically generated from pre-FMECA 
or consolidated FMECA. This representation is 
synchronized to the standard tabular representa- 


tools but uses product models. The second allows to 
start analyze with a product line model, which makes 
it possible to realize the dependability analysis earlier. 

These results allowed us to have a very impor- 
tant gain in time of analysis time and to be able 
to reuse parts of this analysis for another product 
system of the same product line. 

In this first experimentation, we note an accel- 
eration of the dependability studies from standard 
FMECA. Without other feedback, it’s difficult 
to quantify this gain because it depends on the 
number and the nature of each variability. More- 
over, make a FMECA at product line stage allow 
an earlier starting of dependability studies. 
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Preliminary safety assessment of circular variable nacelle inlet concepts 


for aero engines in civil aviation 


S. Kazula, D. Grasselt, M. Mischke & K. Héschler 
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ABSTRACT: A safe design process and its application are introduced to a concept study for circular 
variable aero engine inlets. The paper highlights the tasks of inlets, the compromise in designing them and 
how using variable inlets could solve this compromise and allow for faster and more efficient commercial 
aircraft. However, high safety and reliability requirements bring up disadvantages. Tackling these disad- 
vantages, a systems engineering approach is complemented by a safety assessment process, according to 
Aerospace Recommended Practice ARP 4754A. Safety methods that are applicable during early phases 
of the product development process are presented and applied to develop feasible variable inlet concepts. 
Hence, safety requirements, potential failure events and resulting failure modes are systematically identi- 
fied, assessed and mitigated. The mitigation of a failure condition by the means of redundancy within the 


adjustment control system is presented. 


NOMENCLATURE 

AMC Acceptable Means of Compliance 
ARP Aerospace Recommended Practice 
CCA Common Cause Analysis 

CMA Common Mode Analysis 

CS Certification Specification 

DD Dependence Diagram 

EASA European Aviation Safety Agency 
FAST Function Analysis System Technique 
FHA Functional Hazard Assessment 
FMEA Failure Modes & Effects Analysis 
FTA Fault Tree Analysis 

MA Markov Analysis 

PRA Particular Risk Analysis 

PSSA Preliminary System Safety Assessment 
RTO Rejected Take-off 

SAE Society of Automotive Engineers 
SSA System Safety Assessment 

TRL Technology Readiness Level 


US United States 


VDI Verein Deutscher Ingenieure/ 
The Association of German Engineers 
ZSA Zonal Safety Analysis 


1 INTRODUCTION 


Improvement of efficiency and thus emissions as 
well as travelling speed are major goals in civil 
aviation (European Commission 2001, European 
Commission 2011). These goals also depend on the 
aero engine and its subsystems. One of these sub- 


systems is the inlet that supplies the aero engine 
with air. The geometry of these inlets is designed as 
a fixed compromise regarding aerodynamic drag. 
Significant work on the optimisation of the sur- 
face geometry of rigid inlets has been carried out 
(Luidens et al. 1979, Pierluissi et al. 2011, Albert 
& Bestle 2014, Schnell & Corroyer 2015), however, 
a rigid inlet can only achieve the best compromise 
between optimal geometries for different flight 
conditions. 

Using aero engine inlets with a variable lip and 
duct geometry for different flight conditions is 
perceived as a possibility to reduce aerodynamic 
drag and therefore to have a positive effect on air- 
craft efficiency and speed (Baier 2015). Therefore, 
studies, e.g. (Kondor & Moore 2004), da Rocha- 
Schmidt et al. (2014) and Ozdemir et al. (2015), 
have been conducted on concepts for variable 
inlets for commercial aircraft. Additionally, first 
patents for variable inlets, e.g. US 4075833 and US 
5000399, exist. 

The only commercial aircraft that used variable 
inlet systems are the Concorde and the Tupolev 
Tu-144. Their variable inlet systems consisted of 
movable ramps and flaps for supersonic flight. 
Compared to subsonic aircraft both aircraft mod- 
els had huge deficits concerning range, efficiency 
and noise, which is why they have been retired. 

Within the scope of this study, in contrast to 
inlets with ramps and flaps, variable inlets with 
a closed contour that can adjust the lip and duct 
geometry are investigated. From the beginning, the 
development process is accompanied by a safety 
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and reliability process. The aim of this study is to 
achieve technical feasible concepts for this kind of 
variable inlets, related to Technology Readiness 
Level (TRL) 3. 

Although those variable inlets have been inves- 
tigated in first research studies they did not find 
its way in modern civil aviation yet. This can be 
justified by the trade-off between the positive and 
negative effects of the usage of variable inlets on 
commercial aircraft. Besides additional weight and 
production costs, negative effects are the increased 
complexity and thus potential safety and reliabil- 
ity issues. These issues are related to the numerous 
requirements and boundary conditions regarding 
inlets. 

In contrast to earlier studies, it is reasonable 
to supplement the utilised systems engineering 
approach for product development with a safety 
assessment process to identify and fulfil the safety 
requirements in aviation right from the beginning. 
Thusly, possible events as well as resulting failure 
conditions can be identified and possible safety 
issues remedied during early phases of the product 
development process. Such an approach has been 
carried out successfully within a project concern- 
ing coupled actuation systems for thrust reversers 
and variable nozzles of aero engines (Grasselt & 
Hoschler 2015, Grasselt et al. 2017). As a result, 
variable aero engine inlets could be utilised in civil 
aviation, whereby these aircraft could be more effi- 
cient and faster, while maintaining high safety and 
reliability. 

Therefore, this paper deals with the safety assess- 
ment process within concept development and 
preliminary design of variable inlets in civil avia- 
tion. First, the inlet system and its necessary func- 
tions are described. Afterwards, the utilised safety 
assessment process referring to ARP4761 is intro- 
duced and applicable safety analysis methods, e.g. 
Functional Hazard Assessment (FHA) and Fault 
Tree Analysis (FTA), are presented. Some selected 
results of the applied methods are shown and dis- 
cussed. These results contain the identification of 
safety-relevant requirements, failure conditions and 
events. Finally, the influence of this assessment on 
the design of the investigated concepts is shown. 


2 AERO ENGINE INLETS 


2.1. Tasks and implementation 


The main purpose of nacelle inlets for aero engines 
is to divide the free stream in front of the aero 
engine, depending on the capture stream tube as a 
function of operating conditions, into an internal 
and an external airflow. The external airflow shall 
flow over the nacelle surface while avoiding flow 
separation and other sources of drag (Mattingly 


2006). The objective of the internal airflow is to 
supply the aero engine during each operating con- 
dition with the correct quantity of air at a desired 
flow velocity (Rolls-Royce Plc 2015). The axial flow 
velocity that is required by the compressor system 
of the aero engine is around Mach 0.5 (MacIsaac & 
Langton 2011). Hence, a deceleration of the inter- 
nal airflow is required at flight speeds above Mach 
0.5 to ensure a highly efficient and safe operation 
of the compressor system. Thus, the inner contour 
of the inlet duct has to be designed as a diffuser 
(Farokhi 2014). The efficiency and operational sta- 
bility of the compressor system also depends on 
the uniformity of the airflow. Therefore, flow sepa- 
rations should be avoided under all conditions, as 
they can lead to vibration excitations, rotating stall 
and engine surge concatenated by a loss of thrust 
and reduced aero engine durability. 

Moreover, the fan and compressor induced 
noise emissions have to be reduced by the inlet, 
which is achieved by integrating acoustic treatment 
into the diffuser wall, see Figure 1. 

Furthermore, probes for pressure and tempera- 
ture measurement at the fan level can be part of 
the inlet. Hence, these measured data must be pro- 
vided to a flight control system. 

Additionally, the inlet has to protect itself and 
the compressor system from icing and its conse- 
quences, e.g. impact damage and flow separation. 


Engine Axis 


- Intake Lip 

- Bleed Air Anti-Icing System 

- Bleed Air 

- Outer Planking 

- Diffuser Wall with Acoustic Liners 
- Structural Components 

Level of Unperturbed Air 

- Engine Intake Area 

- Engine Throat Area 
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+ 


- Fan level 


Figure 1. 
inlet. 


Typical design of a rigid subsonic nacelle 
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An anti-icing system is installed in the inlet to ensure 
this. Most commonly, electrical or bleed air anti ice 
systems are used (Rolls-Royce Plc 2015). Bleed air 
anti ice systems transfer hot air from the compres- 
sor to the inlet lip to prevent icing. Aluminium is 
typically used for the inlet lip, due to its good heat 
conductivity. Furthermore, aluminium is light and 
resilient to foreign object damage, sand erosion, hail 
and bird strikes. It is to prove during certification 
that thrust can be maintained to a certain level to 
assure that the flight can be safely continued after a 
single bird strike (Hedayati & Sadighi 2016). 

The outer planking as well as the inner bound- 
ary with the acoustic linings is made of compos- 
ites, which minimises weight. Thereby, the inlet 
should withstand stipulated loads and be robust 
against damage (EASA 2016). 

This way, incidents like the Air France Flight 
AF-66, where the fan and the engine inlet were 
separated from the aircraft during flight, should 
occur at a very low rate to minimise the risk for 
passengers and crew (Aviation Herald 2017). 


2.2 Trade-off in design and variable inlets 


Ensuring reliable operation during all flight phases 
must be unified with aerodynamic requirements 
during the geometric design process of nacelle 
inlets (Seddon & Goldsmith 1999). On the one 
hand, the inlet should be highly efficient at high 
flight velocities above Mach 0.8 during cruise 
operation. On the other hand, it is necessary to 
avoid flow separations and hazardous events dur- 
ing take-off and climb operation up to Mach 0.3. 
Figure 2 presents, optimal aerodynamic con- 
tours of an inlet for different flight conditions. 
Optimal efficiency at high flight velocities can be 
achieved by a thin or sharp lip contour combined 
with a small entry area A, to minimise wave and 
spillage drag (Farokhi 2014). As the entry area is 
reduced, a longer diffuser is required at high veloci- 
ties to avoid flow separations. Sharp inlet lips can 
be used for flight Mach numbers up to 1.6 without 
significant losses (Farokhi 2014). However, at low 
aircraft velocities, where high angles of incidence 
and crosswind can occur, a sharp or thin lip con- 
tour is sensitive to flow separations and its potential 
negative consequences. For these operating condi- 
tions, a round and thick inlet lip with a large inlet 
area is optimal. Such a ‘blunt’ lip geometry causes 
higher drag and thus less efficiency during opera- 
tion at higher flight Mach numbers (Braunling 
2015). Hence, conventional rigid subsonic inlets can 
only accomplish a compromise geometry that pro- 
duces increased losses during high flight velocities. 
Using a variable inlet, which applies the optimal 
contour for each flight condition, can improve effi- 
ciency and maximum aircraft speed during cruise 


Ma 0.3 Take-off: 
Blunt Lip 
Large Inlet Area 
Short Diffuser 


A Ma 0.9 Continental Cruise: 
Thin Lip 
Medium Inlet Area 
Medium Diffuser Length 


Ma 1.6 Transcontinental Cruise: 
ny Sharp Lip 

Small Inlet Area 

f Long Diffuser 


Figure 2. Tendencies of optimal inlet contours for dif- 
ferent flight phases and velocities. 


flight, while ensuring reliable operation during 
take-off and climb. However, it needs to be con- 
sidered that the variation of the inlet presents an 
additional function, which can entail further reli- 
ability and safety issues (SAE Aerospace 2010). 

Therefore, variable nacelle inlets for Mach num- 
bers up to 1.6 are investigated focussing on safety 
and reliability in the context of an internal research 
project at the chair of Aero Engine Design at the 
Brandenburg University of Technology. Within 
the scope of that project, a methodical safe design 
approach is developed and utilised to perform a 
feasibility study for concepts for variable inlets 
up to TRL 3. The safe design approach for vari- 
able inlet concepts has been presented in Kazula 
& Höschler (2017). These concepts can be divided 
into three geometry adjusting mechanism groups: 
movement of rigid segments, deformation of elas- 
tic surface material and boundary layer control. 
The preliminary safety assessment in chapter 4 
focusses primarily on the first group. 
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3 SAFE DESIGN APPROACH 


3.1 Systems engineering 


When designing complex systems, it is favourable 
to utilise a systems engineering approach, e.g. 
Design for Six Sigma and VDI Guideline 2221. 
Methodical design approaches allow for improved 
requirements, interface and risk management, com- 
plexity reduction as well as more efficient solving 
of the design task. Furthermore, weaknesses dur- 
ing development can be minimised. Most methodi- 
cal design approaches are based on a common 
iterative structure: 


— starting with analyses concerning requirements 
and functions of the desired product, 

— continuing with the allocation of solution prin- 
ciples to functions, preliminary design and 
preselection of potential solution architectures 

— and concluding with detailed design, as well as 
validation and evaluation of the design. 


As the desired product functions in modern 
industries become increasingly complex, particular 
safety efforts, such as safety assessments and tests, 
should be considered to ensure safety and reliabil- 
ity (Bertsche & Lechner 2004). 


3.2 Safety and reliability engineering 


The safety and the reliability of a product play 
important roles in various industries for reasons of 
efficiency and business sustainability up to social 
acceptance. These industries, all of them using a 
separate safety approach, include the power, rail, 
shipping, automotive and aviation industry (Verein 
Deutscher Ingenieure 2000). The most effective 
time to improve product safety and reliability is 
during early development phases (Bertsche & Lech- 
ner 2004). On the one hand, this can be achieved 
by using a mature design approach according to 
design guidelines, which contains a precise and 
complete requirements document and early testing. 
On the other hand, analytical methods can be uti- 
lised to forecast reliability and to find weaknesses in 
design. Analytical methods are divided into quali- 
tative methods and quantitative methods. Whereas 
quantitative methods, e.g. Markov Analysis are 
used to predict probability of faults, qualitative 
methods like the FHA are utilised to identify and 
assess potential failure events and resulting condi- 
tions. The safety and reliability of a product can 
be positively influenced by utilising appropriate 
safety and reliability methods during each step of 
the development process. For this purpose, multiple 
authors, e.g. Bertsche & Lechner (2004) and Meyna 
& Pauli (2010), as well as organisations, e.g. Inter- 
national Organization for Standardization (2011) 


in ISO 26262 for automotive industry, propose 
safety processes for different areas of application. 


3.3 Safety process in aviation 


In aviation, failures could lead to fatal accidents. The 
risk for accidents can be reduced by improvements 
in the areas of airplane design, flight operations, 
maintenance, air traffic management, regulations 
and design methodologies (Hasson & Crotty 1997). 
Since the Chicago Convention in 1944, local avia- 
tion authorities have been publishing regulations 
to ensure a safe operation. The European Avia- 
tion Safety Agency (EASA), for instance, releases 
Certification Specifications (CS), eg. CS-25 — 
Large Aeroplanes. Paragraph CS-25 AMC 25.1309 
describes the safety assessment process in aviation 
that is based on the process in ARP 4754A (SAE 
Aerospace 2010) and the methods in ARP 4761 
(SAE Aerospace 1996). The design approach for 
variable inlets of Kazula & Héschler (2017) utilises 
safety methods of the ARP 4761, see Figure 3. 


3.4 Safety methods for variable inlet development 


The first of the methods presented in Figure 3 is 
the preparation of a requirements document that is 
as complete and accurate as possible. Therefore, all 
requirements that could be introduced by the dif- 
ferent stakeholders should be identified and quan- 
tified (Sadraey 2013). A product with high safety 
and reliability as well as low development costs can 
be achieved during early stages of development by 
focussing on the Type Certificate Program and its 
entailed requirements. These requirements are set 
by the aviation authorities and have been consid- 
ered within this study. However, the creation of a 
requirements document is an iterative process as 
some safety requirements are only identifiable dur- 
ing the safety assessment. 

To comply with a requirement, a product must 
fulfil a function. A function is defined as the con- 
version of input material, energy or data into 
desired output (Roth 2001). Within the scope of 
the development process it is usual to perform a 
functional structure analysis. This way, the neces- 
sary main and secondary functions are identified 
and broken down up to elementary functions like 
‘convert’ or ‘increase’ (Koller & Kastrup 1998). 
A detailed functional structure can contribute to 
a high product safety by preventing design flaws, 
however, a too detailed breakdown should be 
avoided, as this limits the solution variety during 
development (Verein Deutscher Ingenieure 1993). 
Hence, it is reasonable to start with a simple func- 
tional structure and iteratively increase the level of 
detail within the development process (Roth 2000). 
The most common means to create a functional 
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Figure 3. Suitable safety and reliability methods for the separate phases of the methodical design approach. 


structure are the FAST-diagram (Function Anal- 
ysis System Technique), the function net and the 
function tree. While all of these methods are quite 
similar, function trees are the recommended choice, 
because of their structure that synergises well with 
the Functional Hazard Assessment (FHA) (Verein 
Deutscher Ingenieure 2000). 

The FHA is a qualitative method that should be 
performed at the beginning of the safety process. 
At this point, it is reasonable to carry out qualita- 
tive safety methods, as they support the systematic 
investigation of failure conditions, causes and con- 
sequences. During later phases of the development 
process these qualitative methods can be replaced 
by quantitative methods to investigate reliability in 
more detail (SAE Aerospace 2010). The main objec- 
tive of the FHA is to systematically assess functions 
of a systems and to identify and classify failure con- 
ditions as well as their effects. It is performed on 
Aircraft, System and if required subsystem level. 
Similar to the function tree, a disadvantage of this 
method is that it can be difficult for inexperienced 
users to apply this method appropriately as it allows 
for an easy deployment of enormous tables and as 
the classification of hazards can be rather subjec- 
tive. Kritzinger (2016) describes the advantages of 
an FHA, one of them being the provision of top 
level events for the following Preliminary System 
Safety Assessment (PSSA). 

The PSSA is used to analyse which single or 
multiple system, subsystem or component failures 
lead to the functional hazards that have been iden- 
tified within the FHA. This way, safety related 
design requirements can be determined and concept 
designs can be evaluated. A PSSA is performed 


by utilising the Fault Tree Analysis (FTA), the 
Dependence Diagram (DD) or the Markov Analy- 
sis (MA), supplemented by a Common Cause 
Analysis (CCA) (SAE Aerospace 1996). While 
the most commonly used FTA, which is an itera- 
tive top down method, presents the relationship 
between failures through logic gates, the DD uses 
paths and the MA time dependant probability 
functions. On the one hand, FTA, DD and MA 
have many advantages, which are discussed for 
instance in Kritzinger (2016) and SAE Aerospace 
(1996). On the other hand, all of these methods 
lack a systematic that assures completeness (Verein 
Deutscher Ingenieure 2000). 

Therefore, within the System Safety Assess- 
ment (SSA) it is reasonable to combine the top 
down method FTA with the bottom up method 
Failure Modes and Effects Analysis (FMEA), that 
is simple to use but iterative and time consuming 
(Kritzinger 2016). The SSA evaluates the compli- 
ance of the investigated system with the safety 
requirements from the PSSA by applying analyses 
and test methods. In addition to the FMEA, the 
SSA includes quantitative FTAs and a CCA. A 
CCA comprises of a Zonal Safety Analysis (ZSA), 
a Particular Risk Analysis (PRA) and a Com- 
mon Mode Analysis (CMA). A PRA is utilised to 
identify external events and a ZSA to determine 
individual failure modes that can cause hazards. 
A CMA is used to verify independence of func- 
tions, as this is not done within the FHA (SAE 
Aerospace 1996). Finally, tests must be performed 
to validate the results from earlier analyses and to 
comply with the requirements set by the aviation 
authorities. It is reasonable that the less experience 
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the product developer has, the more tests should be 
performed for a successful certification. 


4 SELECTED RESULTS 


4.1 Function trees 


The Aircraft has to fulfil different 1st level or top 
level functions according to SAE Aerospace (2010) 
and Kritzinger (2016), one of them being ‘control 
thrust’, see Figure 4. Thrust control is achieved 
by generating, adjusting, ensuring and determin- 
ing thrust. The inlet system influences all of these 
functions, e.g. to generate thrust, the inlet system 
has to provide the aero engine with an airflow. 


Aircraft 
Functions 


The function tree that is presented in Figure 5 
contains a few important functions of the nacelle 
inlet system. Different flight phases require for 
different geometries to achieve minimal drag and 
an internal airflow with high uniformity and the 
appropriate velocity. An inlet geometry adjust- 
ment system can improve the achievability of these 
functional requirements. Additionally, an adjust- 
ment system could be used to prevent negative 
effects of icing by detaching ice. Furthermore, it 
could utilise existing systems for data and energy 
transfer. The influence of the variability on other 
inlet subsystems has to be investigated, as the inlet 
would not be a rigid enclosed structure anymore 
and the installation space for acoustic treatment 
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could be reduced. The essential functions of an 
inlet adjustment system are change and locking of 
the inlet geometry as well as gathering, transfer- 
ring and processing geometry data. 


4.2 FHA 


The function trees can be used to determine the 
input for the FHA. For clarity reasons, only the 
function ‘adjust inlet geometry’ and its impact on 
the aircraft is presented in the following. After 
identifying the functions, associated failure con- 
ditions and their effects must to be determined 
regarding single and multiple failures during 
normal, e.g. take-off, climb and cruise, as well as 
special conditions, e.g. windmilling. The main fail- 
ure condition of the adjustment system is that an 
undesired geometry is adjusted, what could affect 
a single engine or all engines. 

A further categorisation is possible concerning: 


— which geometry is adjusted, 

— is the system either incapable to maintain the 
desired geometry or to adjust the geometry on 
time or in general, 

— is the failure occurring either announced or 
unannounced, either abruptly or anticipated. 


Then, the effects of these failure conditions 
must be identified and classified. Possible effects 
of the presented failure conditions are ‘loss of 
thrust’ or ‘reduced efficiency’. The classification 
is conducted according to CS-25 AMC 25.1309 
regarding the severity of an effect on aircraft, crew 
or occupants and is divided into ‘catastrophic’, 
‘hazardous’, ‘major’, ‘minor’ and ‘no safety effect’. 
Each of these classifications entails a specific 
probability requirement concerning occurrence of 
a failure mode. Table 1 presents the failure classi- 
fication for an undesired, abrupt but announced 


Table 1. Failure classification for the event ‘undesired geometry 
is adjusted’ on a single engine. 


Probability requirement 


Flight phase Classification Events per flight hour 

Start No Safety Effect No Probability Requirement 
Idle No Safety Effect No Probability Requirement 
Taxi No Safety Effect | No Probability Requirement 
Take-off Minor <1.0E-03 

RTO No Safety Effect No Probability Requirement 
Climb Minor <1.0E-03 

Cruise No Safety Effect No Probability Requirement 
Descent No Safety Effect | No Probability Requirement 
Approach No Safety Effect | No Probability Requirement 
Go-around Minor <1.0E-03 

Landing No Safety Effect No Probability Requirement 
Brake No Safety Effect No Probability Requirement 
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Figure 6. Simplified fault tree for the inlet adjustment 
system. 
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adjustment of a sharp thin cruise geometry on a 
single engine. Commercial aircraft must be able 
to fly with one engine inoperative. A failure of 
the adjustment system of a single engine can only 
cause surge and therefore loss of thrust on a sin- 
gle engine. Hence, this event only leads to a slight 
reduction of the aircraft functional capability and 
a slight increase in workload for the flight crew. 
However, a failure of the adjustment system on 
multiple engines could lead to loss of thrust on 
more than one engine. For instance, during take- 
off this can result in a rejected take-off (RTO). 
This is classified as a hazardous mode with a prob- 
ability requirement of less than 1.0E-07 events per 
flight hour (EASA 2016). 


4.3 FTA and failure prevention 


The fault tree in Figure 6 shows how the hazardous 
event ‘loss of thrust’ is based on the loss of thrust 
on all engines. This is caused by the incapability to 
provide air sufficiently due to the adjustment of an 
undesired inlet geometry. This could be caused by 
a single control system failure. 

Single failure conditions can be avoided by a 
redundant design. This can be achieved by design 
modifications. Therefore, the adjustment system 
is protected by a locking system, which is inde- 
pendent from the actuator control electronics, 
see Figure 7. Furthermore, the individual engines 
should be controlled separately. 


5 CONCLUSIONS 


While being able to improve the speed and effi- 
ciency of modern aircraft, variable inlets did not 
find its way in commercial aviation yet due to 
potential safety and reliability issues. These issues 
can be reduced by applying a coupled design and 
safety process that is integrated in the early phases 
of the product development process. Moreover, the 
methods of this process are discussed and an FHA 
as well as an FTA are applied to variable inlet con- 
cepts. This results in the identification of failure 
probability requirements and design adjustments. 

It is still impossible to ensure that every failure 
is identified during the development, but the prob- 
ability of design faults can be reduced significantly. 
During the investigation of variable inlets, the 
bottom up method FMEA should be applied to 
complement the top down approach of the FTA. 
A CCA, especially a PRA, should be performed 
to investigate independence and external events. 
Finally, validation and verification tests must 
be carried out in consultation with the aviation 
authorities. 


It should be considered that some safety regula- 
tions, guidelines and expert opinions could be too 
conservative and thus limit the solution variety of 
new technologies. Furthermore, safety and reliabil- 
ity is often neglected during academic studies. 

Due to the safe design approach of this study, 
function trees, an FHA and an FTA have been 
described in this paper. This results in the identi- 
fication of safety requirements and design faults. 
These faults can be mitigated by the means of 
redundancy within the adjustment control system. 
This way, the negative effects of variable inlets 
can be reduced during early stages of the prod- 
uct development. Hence, variable inlets could be 
applied in commercial aviation. This could allow 
for faster and more efficient aircraft, while main- 
taining safety and reliability. Moreover, the safe 
design approach could be applied to other prod- 
uct developments in aviation or other industries, as 
there is currently no general safe design approach. 
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ABSTRACT: Multi-State Phased-Mission Systems (MS-PMSs) are Multi-State Systems (MSSs) that 
accomplish different missions in a series of consecutive and non-overlapping time durations, called phases. 
Many practical PMSs are non-repairable, such as the spacecraft working in the outer space. In the non- 
repairable MS-PMS, the dependence among phases are more complicated, like the components’ states 
cannot be better in the latter phases. The paths of the MMDD model generated by traditional MMDD 
manipulation rules cannot consider this kind of dependency and part of the paths may have the self-conflict 
problems. To solve this problem, a MMDD algorithm for MS-PMS and PMS-MMDD model are pro- 
posed. By the PMS-MMDD model, the system MMDD model can be generated without any additional 
steps. At last, the Monte Carlo simulation method is used to certify the proposed PMS-MMDD model. 


1 INTRODUCTION ronment are different in different phases, so the 


distinct model for each phase are necessary. 


The PMSs are systems that need to complete dif- 
ferent missions in multiple, consecutive, non-over- 
lapping time durations, known as phases (Xing and 
Amari, 2008). The PMSs are commonly seen in the 
aerospace industry, like the spacecraft, whose life- 
time can be divided into several phases: launching, 
orbit-transfer, on-orbit operation, back-to-earth. 
The systems/components in the PMSs can exhibit 
multiple performance levels or states, called MS- 
PMS. The reliability of the MS-PMS is defined as 
the probability that the system state stays above the 
failure states in all phases. The challenges in ana- 
lyzing the MS-PMS are mainly due to two aspects 
(Shrestha et al., 2011): 


1. The phase dependence, the states at the end 
of one phase should be equal to the beginning 
of the consecutive phase and the components 
failed in one phase will remain in the failure 
state in the following phases; 

2. The dynamic behaviors, the working compo- 
nents, system structure and the working envi- 


The existing PMS analysis methods can be 
divided into two classes: the analytical methods 
and the simulation methods. The analytical meth- 
ods can be further categorized into three types: 
(1) the combinatorial methods, like the Binary 
Decision Diagram (BDD), (Zang et al., 1999), or 
MMDD method (Shrestha et al., 2011); (2) state 
space model, such as the Continuous Time Markov 
Chain (CTMC) (Wu et al., 2012); (3) modular 
method (Ou and Dugan, 2004), which combines 
the combinatorial methods and the state space 
model and possess advantages of both. But most 
of the existing methods are limited into the binary 
state systems/components on the other hand, many 
research efforts have been developed in the analy- 
sis of the Multi-State Systems (MSSs) (Liu et al., 
2008, Liu and Huang, 2010), but all of these works 
do not consider the system multi-phased behaviors. 

The researches on the MS-PMS are very few 
until now due to the complexity of both PMS and 
MSS. And the multi-state behaviors renders more 
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complicated the phase dependence and the dynam- 
ics leading to the state explosion problem. Recently, 
the MMDD method is used to assess reliability of 
MS-PMS with multi-state repairable components 
(Shrestha et al., 2011). The system MMDD model 
is generated by the general MMDD manipulation 
rules and all the path probabilities are evaluated 
by the CTMC. In the paper by (Shrestha et al., 
2011), only the repairable MS-PMS is considered 
and only the first kind of the phase dependence 
(the state of component at the beginning of one 
phase should be identical to the state at the end of 
last phase) is considered. So the paths generated by 
this method may have the self-conflict problems in 
the non-repairable PMS. In this paper, a MMDD 
algorithm for the MS-PMS and PMS-MMDD 
model are proposed to evaluate system state prob- 
abilities of the MS-PMS. By the MMDD algo- 
rithm for the MS-PMS, the self-conflict paths can 
be cancelled automatically in the model generation 
process without any additional steps, which makes 
the system modelling more efficiently. 

This paper is organized as follows. Sec- 
tions 2 introduces the basic concepts of MMDD 
model. Section 3 introduces a simple example in 
detail. In section 4, the proposed MMDD algo- 
rithm and the model generation process by the 
PMS-MMDD model are introduced. The path 
probability evaluation method and the result certi- 
fication are shown in section 5. The section 6 gives 
the conclusions and future works. 


2 BASIC CONCEPTS OF MMDD MODEL 


The MMDD model, an extension of the BDD, 
is proposed in (Shrestha et al., 2011) and applied 
to analyze the MSS and MS-PMS. Like BDD, 
MMDD is also based on the Shannon’s decompo- 
sition and manipulation of the multi-valued logic 
function. The multi-valued logic function F of 
component A with m states is, 


F =case(A,F,F,, AT RAG „F,) (1) 
=A*F. _,+4AF.5+ Ferreri +A oF 


There are two kinds of nodes in the MMDD 
model: (1) the non-sink node, which is labeled with 
a multivalued variable x,, which representing the 
behaviors of component A, and its corresponding 
multiple outgoing edges; (2) the sink node, ‘0’ and 
‘T’, representing the system is in one specific state 
or not. And each disjoint path from the root node 
to the sink node ‘1’ represent the system is in one 
specific state. 

The generation of the MMDD model is based 
on the manipulation rule (Xing and Dai, 2009), 


g0h = case(x,G,,---G,,) 0casey(y, Hy H) 
case(x,G,0H,,---,G OH ), index (x) = index(y) 


m m 


=; case(x,G,0h,--,G,,0h)  index(x)< index( y) 
case(y,g0H,,--,g0H,,) „index(x) > index(y) 
(2) 


where ? represent a logic operation (OR, AND) and 
the index represents the predefined variable orders. 


3 A SIMPLE EXAMPLE 


In this paper, the non-repairable MS-PMS is stud- 
ied, which is common seen in the aerospace indus- 
try. A simple example system is used to show the 
proposed method. The system consists of three 
components, A, B and C and their state transition 
graphs are shown in Figure 1. The 4, , represents 
the transition rate of the component transit from 
state i to state j. And the parameters are shown in 
Table 1. 

The example PMS consists of three consecu- 
tive missions phases: 7,, t, and 7,. Mission T, needs 
component A in state 3 or component C in state 
2; Mission T, needs component A above state 2 or 
component B in state 3; Mission 7, needs compo- 
nent B above state 2 or component C in state 2. 
The entire mission is successful if all the three con- 
secutive missions are completed successfully. The 
system structure function describing can be speci- 
fied as, 


m 


Be E E E 
TE RS 
(1) Component A ~< ae 


ay = 
ERIN ee ee A 
SOMO 
nt B =< 
G 
Figure 1. The state transition graphs for the compo- 
nents in the example system. 


(2) For compone! 


(3) For component C 


Table 1. The parameters of components in the example 
system. 

A B C 
Aa 1/100 1/120 
Ai 1/150 1/160 
Aa 1/200 1/200 1/160 
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Figure 2. The MFT model for the example system. 


p =P P,- P,= 
(43 + Gia) As +A,,+ BN Bis + By, + Cs) G) 


According to the descriptions of the system 
structure above, the multistate fault tree (MFT) 
model for the example system is shown in Figure 2. 

Each of the bottom event in Figure 2 represents 
one component in one specific phase and state. For 
example, A,, means component A stays in state 
3 in phase 1. 


4 MMDD MODEL FOR MS-PMS 


4.1 The MMDD algorithm for PMS 


In the paper (Shrestha et al., 2011), only the repair- 
able MS-PMS is studied and the MMDD model 
is generated by the general MMDD manipulation 
rules, but part of paths have self-conflict prob- 
lems in the non-repairable PMS. For example, in 
the path ‘Nio > N; > N, > N, > l’ of Figure 6 in 
the paper (Shrestha et al., 2011), the component 
c is in state 1 in phase 3 and in state 0 in phase 1 
simultaneously, which cannot happen in the non- 
repairable MS-PMS. To address this problem, a 
MMDD algorithm for the MS-PMS extended 
from the BDD algorithm for the binary nonrepair- 
able PMS (Zang et al., 1999) is proposed. Firstly, a 
phase algebra for the MS-PMS is proposed based 
on the phase dependency among phases in the 
non-repairable MS-PMS, shown in Table 2. 

The physical meaning of the elements in Table 2 
are, 


1. A m® Ám > A; m Component A is in the perfect 
state m both in phase i and phase j implies com- 
ponent A is in state m in phase J. 

2. A °A; > 4A; component A is in worst state 
1 both in phase iand phase j implies component 


A is in worst state 1 in phase 7. 


Table 2. The phase algebra for the MS-PMS. 


A,,, . A,,, > Ag A,m $ A,n > Aim 

Aa. A, > A, A, a A, = An, 

As i Ais, >0 As + As ol 
(I< S,<S,<m) (ISS, <S,<m) 


3. Ais 24,5 20 USS, <S; <m), component A 
is in state S, in phase i and is in state S, in phase 
J simultaneously i is not exist. 

4. Aim + Aj Aim Component A is in the per- 
fect state m in phase i or phase j implies compo- 
nent A is in state 7m in phase i. 

5. A +4; > 4A, component A is in state 1 in 
phase i or phase j implies component A is in 
state 1 in phase j. 

6. Ais +A, s> has no physical meaning if the com- 
ponents are not repairable. 


4.2 The MMDD operation for the MS-PMS 


With the above phase algebra for MS-PMS, the 
phase dependence operation (PDO) for the MS- 
PMS can be derived for both the forward PDO and 
backward PDO: (1) forward PDO, the order of the 
variables are the same as the phase order; (2) back- 
ward PDO, the order of the variables is reverse of 
the phase order (Zang et al., 1999). 

Assume that component A with m states works 
both in phase i and phase j (i< j), and the case 
format of component in phase i and phase j, E, and 
E, can be written as, 


E,= case(4,(E,) (E),,] (4) 
= case(A,,G,,, i Gi.) 

E, = easel 4, E,), > (E) P (5) 
= case(A,,H,,, ~,H,) 

where A, mOn = 1 means component A is in state 


min phase jand 4, „G 


im? ~ m 


is not in state m in phase j. 
For the forward PDO index( A <A, )) the pro- 
S- 


=0 means component A 


posed MMDD manipulation A for the MS-PMS i is, 
case(A,,G,,,-+.G, 1) Ocase(A,, Hp» -,H,) © 
=case( A, G,,9E 10n VE; GOH) 

where 


E,,= case 0 Hag Hat} l<n<m 


n-i 
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For the backward PDO pues >4))), the 
proposed MMDD manipulation rule for the MS- 
PMS is, 


case( A, G,» -,G,) Ocase(A,,H, 


= case(A,,G, 0H „E, 9H 


mH) m 
m m? m-1>"" -E,,,0H,) 


where 


G 


m-l? 


B= cae 4,6 owt), l<n<m 
i kaa 


m-n 


5 SYSTEM EVALUATION 


In this section, the system evaluation process by the 
proposed MMDD model and CTMC are shown in 
detail, with the following two steps: model genera- 
tion and path probabilities evaluation. 


5.1 Model generation 


Firstly, the basic events of the MFT model in Fig- 
ure 2 are transferred into MMDD basic events. For 
example, case(A,,1,0,0), case(A,,0,1,0), case(A,,0,0,1) 
represent component A in state 3, state 2 and state 1 
in phase i, respectively. Then, with the MMDD basic 
events and the system structure shown in Figure 2, 
the MMDD models for each phase are generated 
by the general MMDD manipulation rules. The 
MMDD models for phases of the example system 
are shown in Figure 3. Then, the system MMDD 
model is generated by the proposed MMDD model 
for the MS-PMS, shown in Figure 4. 

The grouped solid lines and dashed lines in 
Figure 4 represent the working states and fail- 
ure states of the corresponding non-sink nodes. 
According to the definition of the MMDD model, 
each path from the root node to the sink node ‘1’ 
represent that the system stays in one specific state 
in all phases. From Figure 4, we can get all the 
paths that represent the example system stays in 
the working state in all 3 phases. There are totally 9 
paths, shown as, 


3 
tv 


oA 
1 
[ 0 | 


(Phase 1) 


0 


o d. x . 
b eee — = 
aut 


2 E 
i [|° 0 | 


(Phase 2) (Phase 3) 


Figure 3. The MMDD model for each phase. 


Figure 4. The grouped MMDD model for all three 
phases of the example PMS. 


= G (0B ays Th = Cy, 2B, (4,2)42,(2,3) 

1 = Cy (2) Bs ays Ma = C Gi) Bs oB 

Is = C5 yob oboe) 

1% = C, isi) (8) 
m = C; G0) B,2) oy.) 

Tk, = Cy Gy Bay 044) 

hy = C, (Gey (24> (24, 6) 


5.2 Path probability evaluation 


In this section, the probabilities of the disjoint 
paths generated in the last section are evaluated 
by the CTMC. All the components are independ- 
ent on each other, so the probability of each path 
can be evaluated by multiplying the components’ 
transition probabilities in the different phases. For 
example, the probability of path 7, can be evalu- 
ated as, 


Pr(7, =1)= PC (Chat a) 


9 
Pi Ci Cap Pe Big PAg) 9) 


In this paper, only the non-repairable compo- 
nents are studied, so the state of one component 
cannot be better in the latter phases and the state 
at the beginning of one phase should be equal to 
the end of last phase. Let S, represent the state of 
one component in phase i(1<i<m). Due to the 
memoryless property of the homogeneous CTMC 
(Lisnianski and Levitin, 2003), the path probabil- 
ity of the component can be evaluated as, 
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Pr(S, = CS, = Cyr Sm = Ga) 


(10) 
F Pry, (7;) Pr, 3€2, (T) 5 Pr som (T,) 
where c, means the specific state of component A 
at the end of phase ¿and c, €[1,2,---,.N]. 
For example, in path 7, component C is in state 2 
in phase 1 and transit to state 1 in phase 3, 


ee (ecw = ps (T) Phi (3) (11) 


Component B is in state 3 in phase 3 and it 
works both in phase 2 and phase 3, so component 
B is in state 3 in both phase 2 and phase 3, 


Pr(B,)] = pi; (T, E T,) (12) 
Component A is in state 3 in phase 1, 
Pr( 4,5) = 74,(Z) (13) 


P(t), in which P (t) represent the probability 
of component A transit from state i to state j dur- 
ing the time interval [0,4]. p, ,(t) can be evaluated 
by the CTMC with the state transition rate 4, ,, 


pi,(t) = exp( Qt) 
Q= lal 


(14) 
(15) 


Then, the probability of path 7, can be com- 
puted as, 


Pr (7% = 1) =Pr (Gary ) Pr (By ) Pr (4%) 
= 0.0228 


(16) 


Similarly, the probabilities of other eight paths 
can be evaluated, and the reliability of the example 
system at the end of phase 3 can be computed as, 


9 
Pr(7, =1) =0.5819 


i=l 


(17) 


5.3. MC simulation validation 


In the previous sections, the reliability of the exam- 
ple PMS is evaluated by the PMS-MMDD model 
and the CTMC. To certify the correctness of these 
methods, the MC simulation method (Zio, 2013) is 
applied. The components’ transition time and sys- 
tem failure time realizations simulation procedure 
is shown in Figure 5. 

Then, the system reliability can be evaluated by 
statistically analyzing the system failure time reali- 


Set number of realization n = 0 | 


t 
Sample the state transition time 7), for each 
component according to their state transition 
rates and their competing failure behavior 


n=n+l 


samples and system structures, determine 
the failure time of this realization 


<ai> 


| Yes 


End 


——— —— — | MA Ce OMO ON 
| According to the components’ failure time 


Figure 5. The grouped MMDD model for all three 
phases of the example PMS. 


zations. In this paper, the amount of the realization 
is Nmax =2 10° And the system reliability evalu- 
ated by the simulation method is, 
Rör = 0.5817 (18) 

Furthermore, the calculation time of the analyt- 
ical approach and simulation approach are 0.07s 
and 74s, respectively, which illustrate that the ana- 
lytical approach is more computationally efficient. 


6 CONCLUSIONS AND FUTURE WORKS 


In this paper, the non-repairable MS-PMSs are 
studied. In the non-repairable MS-PMS, the com- 
ponents’ states cannot be better in the latter phases. 
To address this dependency in the system model- 
ling, a MMDD algorithm for PMS and MMDD- 
PMS model are proposed to analyze the reliability 
of the MS-PMS with non-repairable components. 
By the proposed method, the self-conflict paths 
can be cancelled automatically and the system 
MMDD model can be generated without any 
additional steps. With the paths from the MMDD 
model, the system reliability is evaluated by adding 
all the path probabilities evaluated by the CTMC. 
At last, the result is certified by the MC simula- 
tion method and the comparison also illustrate the 
computational efficiency of the proposed analyti- 
cal method. 

Similar to the PMS-BDD model, the scale of 
the MMDD model is also heavily dependent on the 
predefined variable orders. As a part of the future 
works, the relationship between variable orders 
and the model scale of the PMS-MMDD model 
will be explored. On the other hand, the non-expo- 
nential transition time distribution is more general 
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in practice, the reliability evaluation considering 
non-exponential transition time distribution will 
be studied. 
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Reliability modeling for dependent competing failure processes between 
component degradation and system performance deterioration 
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ABSTRACT: Mechanical systems are usually subjected to multiple dependent competing failure proc- 
esses. The processes were categorized two type of failures: soft failure caused by continuous component 
degradation together with additional abrupt degradation due to a shock process, and hard failure caused 
by the instantaneous stress from the same shock process. The models in open literature have simulated 
the aforementioned processes. However in practice, some mechanical systems have phenomena that all 
the components still in the good states, but the system functions cannot meet the requirements of per- 
formance. This paper studies reliability for multi-ccomponents system subject to dependent competing 
risks of components degradation, components random stress shocks and system performances deteriora- 
tion. These failure processes are dependent in three respects: 1) the component stress shocks over certain 
critical threshold affects the component degradation process, 2) the component degradation impacts the 
component catastrophic failure threshold level, and 3) the component degradation influences the system 
performances. Based on component degradation, component random shocks and system performance 
modeling, a reliability model for systems considering three dependent failure patterns is derived. Then 
an application example using an electro-mechanical system illustrates the effectiveness of the proposed 
approach with sensitivity analysis. The results whether considering system performance deterioration or 


not are compared and discussed. 


1 INTRODUCTION 


The mechanical system is subjected to external ran- 
dom loads during the working process. At the same 
time there are internal wear, fatigue, aging degrada- 
tion processes. The system is under multiple failure 
processes including catastrophic failures and deg- 
radation failures due to external shock loads and 
degradation processes. The two failure processes 
are dependent and competing against each other. 
Correlation means that there is a mutual influence 
between the two failures and competition means 
that any type of failure can result in system failure. 
The catastrophic failure mode (fracture, yielding, 
plastic deformation, etc. also called hard failure 
mode) and degradation failure mode (wear, corro- 
sion, fatigue, and etc. also called soft failure mode) 
are competing with each other and whichever 
occurs first may make system fail. It is impossible 
that both risk modes occurs simultaneously due to 
the competing (Li & Pham 2005). 

In the literature, there mainly exist five typical 
categories of random shock models and lots of 
mixed models based on them. An extreme shock 
model (Gut & Hiisler 1999, Cirillo & Hiisler 
2011) supposes system failure occurs when the 
magnitude of any shock load exceeds a criti- 
cal value. A cumulative shock model (Sumita & 


Shanthikumar 1985, Montoro-Cazorla & Pérez- 
Ocón 2015) assumes system failure occurs when 
the cumulative damage from extern shock loads 
exceeds a specified threshold. A run shock model 
(Mallor & Omey 2001, Eryilmaz 2016) supposes 
system failure occurs when there is a consecutive 
run of m-shocks exceeding a critical magnitude. A 
d-shock model (Rangan & Tansu 2010, Wang & 
Peng 2017) assumes system failure occurs when the 
time interval between two successive shocks is less 
than a specified threshold 8. A mixed shock model 
(Gut 2001, Rafiee et al. 2015, 2017) includes two 
or more basic shock models in which the system is 
supposed to fail either because of one large shock, 
or as a result of many smaller ones. 

Gut et al. expanded the shock model further and 
considered that some shocks may harm the system 
and lead to the threshold change. Accordingly, 
the failure rate would also change (Gut & Hiisler 
2005). Jiang et al. supposes shocks with different 
magnitudes might have different impacts on the 
failure of the system and proposed a zone shock 
model (Jiang et al. 2015). 

The degradation models can be classified into 
disperse process and continuous process. Xue 
et al. extended 2-state reliability analysis method 
to multi-state. The system multi-state reliabilities 
are analyzed by combining Markov processes and 
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s-coherent multistate system structure functions 
(Xue & Yang 1995). Li et al. developed a general- 
ized multi-state degraded system reliability model 
which operating condition of the multi-state sys- 
tems is characterized by a finite number of states 
(Li & Pham 2005). Zhang et al. introduced multiple 
discrete function theory considering the monotone 
and coherence of the multi-state system to describe 
the structure function of system state. The law of 
Demogen and a new block diagram algorithm are 
developed to simplify the expression for the system 
reliability (Zhang et al. 2017). 

For continuous degradation model, there are 
two categories to describe the process: degradation 
path model and stochastic process model. 

Bagdonavičius et al. used a general nonparamet- 
ric, nonlinear path model to simulate degradation 
process in which supposing that the intensities of 
failures depend on degradation level (Bagdonavičius 
et al. 2005). Xu et al. proposed a class of general 
path models to incorporate time-varying usage 
and environmental variables for modeling of deg- 
radation paths. They used nonlinear functions to 
describe the degradation paths, and random effects 
to describe unit-to-unit variability (Xu et al. 2016). 

Yan et al. proposed an improved time-variant 
reliability model considering stochastic process 
of strength degradation based on semi-stochastic- 
process. The parameters are determined based on 
the model of Gamma degradation process with 
independent increments. This method can avoid 
the redundant and complex calculations and obtain 
accurate reliability results (Yan et al. 2016). Huang 
et al. proposed an adaptive skew-Wiener model to 
simulate the degradation drift of industrial devices 
and predict the remaining useful life. A two-stage 
algorithm is adopted to estimate unknown param- 
eters as well (Huang et al. 2017). 

There may exist competing between system cat- 
astrophic failure and degradation failure. 

Peng et al. developed a reliability model based 
on degradation and random shock modeling for 
systems subject to Multiple Dependent Compet- 
ing Failure Processes (MDCFP). Two dependent 
failure processes, soft failures and hard failures, 
were considered in the model. Hard failure occurs 
when the size of external shock exceeds its thresh- 
old. Degradation is composed by pure degrada- 
tion and instantaneous increase lead by shocks. 
Soft failure happens if the degradation exceeds its 
specified threshold. The two failures are compet- 
ing with each other based on which the system reli- 
abilities are analyzed (Peng et al. 2010). Jiang et al. 
supposed that the system withstanding shocks is 
deteriorating, and its resistance to failure is weak- 
ening. They introduced shift failure thresholds to 
the competing failure model (Jiang et al. 2012). 
Similarly, Rafiee proposed a reliability model for 
systems subject to dependent competing failure 


processes of degradation and random shocks with 
a changing degradation rate. The degradation rate 
will change when the system becomes more sus- 
ceptible to fatigue and deteriorates faster (Rafiee 
et al. 2014). Besides, Song studied the reliability 
of multi-components system subject to multiple 
dependent competing failure processes consider- 
ing dependent shock damage on failure processes 
among components. They developed reliability 
models considering different patterns for depend- 
ent shock damage for more realistic and accurate 
prediction of system reliability (Song et al. 2016). 

There are many other researchers studied the 
competing failure models. For all we know, the deg- 
radation threshold is only for the soft failure of the 
component. There is no constraint on system per- 
formance parameters, supposing that system failure 
occurs when component’s degradation exceeding a 
critical magnitude. This assumption is reasonable for 
a system composed of structural parts, but there are 
some problems for the system contains mechanisms. 

Firstly, that a component of system occurs soft 
failure does not mean system must fail. For exam- 
ple, a simple slider crank mechanism contains frame, 
crank, connecting rod, and slider. Usually, there 
exist multiple wear degradation processes in the kin- 
ematic pairs of the mechanism. The performance 
parameters of the system are affected by each wear 
degradation process. When a wear degradation value 
has exceed its specified threshold, the performance 
parameter value of the system may exceed the sys- 
tem requirement, or it may not exceed the require- 
ment and be in a normal state. The system’s state is 
only determined by the performance of the system. 

Secondly, some mechanisms have phenomena 
that all the components still in the good states, but 
the system functions cannot meet the requirements 
of performance. The degradation of wear may 
increase the coefficient of friction. Although the 
soft failure of all components of the system has 
not occurred, the increase of friction coefficient 
can lead to stagnation failure of the mechanism. 

Finally, the degradation of some system com- 
ponents is difficult to measure directly, especially 
for mechanism, but the performance value of the 
mechanism output is easy to acquire. 

Based on the above problem, we will study the 
reliability for multi-components system (especially 
for mechanism) subject to dependent competing 
degradation of components, random stress shocks 
of components and performances deterioration of 
system. A reliability model for systems considering 
three dependent failure patterns is derived based 
on component degradation, component random 
shocks and system performance modeling. Then an 
application example will illustrate the effectiveness 
of the proposed approach with sensitivity analysis. 
The results whether considering system performance 
deterioration or not will be compared and discussed. 
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2 SYSTEM FAILURE DESCRIPTION 


As shown in Figure 1, the component of a mechan- 
ical system are in the process of degradation. The 
degradation trajectory may be linear or non-linear. 
In addition, the stress shocks of component i over 
certain critical threshold will affect the component 
degradation process with additional abrupt degra- 
dation damage. At the same time, the degradation 
of component will impact the component hard 
failure threshold level. So the hard failure thresh- 
old of the component will degrade accordingly. 
When the strength of the component is less than 
the load stress, the component catastrophic failure 
will occur. 

For a mechanism, the degradation of com- 
ponents will influence the system performances 
based on topological relationship of mechanical 
composition. As shown in Figure 2, the system 
performance of the mechanism will degrade 
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Figure 1. Dependent degradation processes for compo- 
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Figure 2. Dependent competing failure processes 
between components’ catastrophic failure and mecha- 
nism’s performance degradation failure. 


according to the degradation value of component 
1, 2, ..., n and their topological function. When 
the system performance exceeds the threshold 
range, the system performance degradation fail- 
ure will occur. 

The component catastrophic failure and the sys- 
tem performance degradation failure are depend- 
ent because of the same load stress environment. 
There also exists a competition for the two types of 
failures. Any type of failure occurs, the mechanism 
will fail. 


3 RELIABILITY MODELLING FOR 
MULTI-COMPONENTS SYSTEM 


3.1 System load stress analysis 


Typically speaking, the mechanism system is always 
under its necessary working loads. However, the 
magnitude of loads are not fixed values due to 
several reasons including randomness of shock, 
vibration of system and etc. The system load will 
be transferred to component level by specific ways. 
The transmitting ways are determined by the basic 
structure form and total degradation of all the 
components for mechanical system. 
Supposing the transfer laws is expressed by 


S,= fi( B,L,F,t)  751,2,-+,n (1) 


where S, is the stress size of the ith component; f0 
is the transfer function between the ith component 
stress and system loads; B indicates the basic struc- 
ture form of the mechanical system; L is the total 
degradation vector of all the components; F means 
the system load spectrums vector; ¢ is time variable. 

The transfer laws may be explicit or implicit 
form. For a simple mechanism can get explicit 
transfer function, the analytical method can be 
used to get component level stress. For a complex 
mechanism cannot get explicit transfer function, 
we can use multi flexible body dynamics method 
to get the stress. 

The oversized stress is more attractive because 
the larger the stress, the more dangerous the com- 
ponent. A larger stress may cause the component 
additional abrupt degradation. 

Due to the randomness of system external loads, 
the component stress with magnitude larger than a 
fixed threshold is subjected to a Poisson process. 
Concretely, the ith component stress shocks arrive 
with a Poisson process {N,(t)|S,(t)>H,,t 2 0} 
with rate 4. The probability that the ith compo- 
nent stress shocks equal k is expressed as 


E exp(-4t)(44)} 
k! 


P(N,(t)=k) (2) 
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where H, is the abrupt degradation threshold of 
component i. 

The magnitude of component stress larger than 
the fixed threshold can be statistically obtained. The 
type and parameters of stress shocks distribution may 
obtain by using the maximum likelihood method. 


3.2 Reliability modeling for component 


Components are subjected to two kinds of proc- 
esses, the catastrophic failure process and deg- 
radation process as we mentioned above. The 
catastrophic failure happens when the size of stress 
of component exceeds its threshold. The threshold 
is changing with time. It is related to the degrada- 
tion of the component. 


3.2.1 Degradation process modeling 

The component degradation is consisted by two 
parts, the pure degradation X(t) and abrupt deg- 
radation Z, The exponential degradation path is 
used to model the pure degradation. The degrada- 
tion path is expressed by 


X (t) =g, + yt + elt) (3) 


where @, is the initial value; y and ¢are the degra- 
dation parameters; € is a standard Brown motion 
E~ B(O, œt). 

The cumulative abrupt degradation Z, caused 
by random stress shocks is 


AU (4) 


where M(t) is the total number that the ith compo- 
nent stress shocks occur; Y, is the abrupt degrada- 
tion value of component i stress shock j. 

The abrupt degradation value Y, is dependent on 
the component stress shocks. If the shock is larger, 
the abrupt degradation value is more likely dete- 
riorating. We assume a positive linear correlation 
between the abrupt degradation value and the com- 
ponent stress shock. Their relation is expressed as 


Y,= aS; (5) 


where S, is the component stress shocks greater 
than H, until time ¢. Which can be determined 
by equation (2). œ, is the degradation conversion 
coefficient. 

Thus, the total degradation of component i is 
expressed by 


L(t)=X,(t)+Z,(t) 
(t 6 
=g tyt + t)+ as, K 


3.2.2 Catastrophic failure process modeling 
Catastrophic failure occurs once a stress of com- 
ponent exceeds its strength threshold. The prob- 
ability that no catastrophic failure happens is 


P(S < D,)= F,(D,(0)) (7) 


where S is a random variable of component 
stress; D(¢) is the current strength threshold which 
changes with time due to the degradation of the 
component. 

The current strength threshold is related to the 
total degradation of the component. The greater 
the total degradation of the component, the 
smaller the residual strength is. So we suppose the 
current strength threshold is a linear function of 
the total degradation of the component, as showed 
in Figure 3. 


D,(t) = D, + BL, (t) (8) 


where 2, is the threshold conversion coefficient; 
Dy, is initial threshold value. 

The randomness of Poisson process indicates 
that there may be any shock times, from 0 to ~, until 
time ¢. For a better illustration, there is an assump- 
tion that there are j shocks until time f. Besides, the 
coming moment of the jth shock is indicated by T. 
Thus, the probability that no catastrophic failure 
occurs for the ith component is expressed by 


F(D(1))= P(N(1)=0) 
E eda Saa (TD (Se < DR)) 


N,(t) = jh = Toh = Trt = T,)- P(N (t) = j) 


kai J J 


Siw da. SES z t 


Figure 3. 


Catastrophic failure process for component i. 
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3.3 Reliability modeling for system 


3.3.1 System performance degradation modeling 
Mechanism is designed for specific performance 
such as load transmission and displacement shift- 
ing. The achievement of system performance is 
combined action of all the units of including all 
components, all pairs, all loads and etc. 

Typically speaking, the system performance 
value is a function of relevant units: 
P, = f.(B,LF t) (10) 
where P, is the system performance value; f() is the 
function between the components parameters and 
system performance. 

This function is the same as the system load 
transfer function (1). Its result needs to be solved 
iteratively with the load transfer function. 


3.3.2 System performance degradation 
failure modeling 

As we have analyzed above, there are some deg- 
radation in components. The degradation may 
deduce the system performance deterioration. 
In other words, the performance value will also 
degrade due to the degradation of components. 
Thus, performance value is also a variable of time ¢ 
which is expressed by O((2). 

If the mechanical system performance wants to 
meet requirements, the performance value need to 
conform to the ideal value. Namely, the error of per- 
formance value and its ideal value is under a thresh- 
old. The performance reliability is indicated by 


P= P( <ô) 


O, = O; 


(11) 


where O,, is the ideal value of performance and dis 
the maximum allowable performance error. 


Rocker arm 


Piston rod 


latch loop 


Figure 4. The composition of GDLM. 


3.4 System reliability modeling 


All in all, mechanical system may fail due to two 
kinds of failure types, the catastrophic failure of 
components and the degradation failure of system 
performance. This two failures are dependent and 
competing with each other. 

If the system keeps in good state, neither the 
catastrophic failure nor the performance failure 
can happen. The reliability is expressed by 


RER TLE (D0) 


(12) 


4 NUMERICAL EXAMPLES 
AND RESULTS 


An aircraft gear door lock mechanism (GDLM) 
example is used to illustrate the competing reli- 
ability model. The GDLM consists of seven parts, 
which are a cavity, piston rod, latch hook, rocker 
arm ABD, rod BC, rod DE and spring, as illus- 
trated in Figure 4. They are connected by six revo- 
lute joints (A, B, C, D, E and F). In the opening 
process, the piston rod experiences the pressure 
of the hydraulic oil moves towards the right. The 
rocker arm ABD rotates anti-clockwise. Mean- 
while, the latch hook rotates anti-clockwise until 
the latch hook is separated from a latch loop. The 
opening process is investigated in this paper. There 
are mainly two kinds of failure modes in the open- 
ing process: component fracture and mechanism 
seizure. 

The latch hook experiences the gear door load 
from the latch loop when the mechanism is in 
closed state. In the opening process, the latch hook 
rotates anti-clockwise. The friction force and fric- 
tion moments are caused by the joints. The friction 
force and friction moments are determined by the 
contact stress and friction coefficients. The magni- 
tude of the contact stress can be determined by the 
dynamics of the mechanism. 

When the driving force from the piston rod can- 
not overcome the resistance from the mechanism, 
the seizure of the mechanism will happened. 

Hence, the failure criterion for seizure is: 
0,-0,<0 (13) 
where O, donates the maximum force that the pis- 
ton rod can provide; O,, presents the resistance of 
the mechanism, which is 2700 N. 

It is difficult to present the explicit expression for 
the performance function O, due to the complexity 
of the GDLM and the diversity of the influence 
factors. Therefore, based on Virtual.Lab software, 
a multi-body dynamic simulation model of the 
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GDLM is established. The relationship between 
the unlocking force O, and the influence factors is 
constructed by the response surface method. 


O, = -4416.10+ 187.16.X, + 0.3378X, + 
1382.69X, + 0.0617.X, — 2.02X? + 4.9339 x 10° 
X?2—261.53X2 + 2.7067 x 10° 
X?—0.00239X,X, — 23.93X,X, — 3.9136 

x 10* X, X, + 0.3236X,X, — 1.062 x 10% 
XX, -0.02387972,X, 


(14) 


where X, Xp» X, X, are influence factors, the 
meanings and values of the parameters are listed 
in Table 1. 

Typically, total load is distributed to each com- 
ponent according to the construction of mecha- 
nism system. For this GDLM, the load distributed 
to each component can be obtained by dynamic 
analysis. For simplicity, we assume that the propor- 
tion of stress per component is constant. The value 
is listed in following Table 2. 

According to equation (9), (11), (12), the reli- 
ability of GDLM is calculated at various time. 
Numerical method based on Monte Carlo is 


used for calculation due to the lack of analytic 
solution. 

The reliability and failure probability density 
results are showed in Figures 5 and 6. The x-label 
is the working time which is stated by number of 
working circles. The y-label is reliability value and 
failure probability density respectively for GDLM 
with performance deterioration considered. 

From Figure 6 we can obtain that component 
catastrophic failure is less probable than system 
performance failure. The system total reliability is 
dependent on the competing results of component 
catastrophic failure and system performance fail- 
ure. When considering system performance dete- 
rioration, the reliability reducing rate is quite fast 
at high reliability level. 

This appearance manifest that the deteriora- 
tion influences equipment with high reliability 
seriously. For mechanical system requiring preci- 
sion and dependability, it is of vital importance 
to control the deterioration and accident shock 
load. 

Figure 7 presents sensitivity analyses of latch 
hook friction coefficient on system total reliability. 
From Figure 6, by increasing the value (7, increases 
from 3 x 107 to 7 x 107), R(®) shifts to the left. At 


Table 1. Values of the parameter used in the GDLS reliability analysis. 
Standard 
Parameters Distribution Mean deviation Degradation parameters 
Angle when latch hook contact loop X, normal 45° 0.5° 
Latch hook load shock X, normal 10000 N = 300 N A= 2.5x 10°: 
Latch hook friction coefficient X, normal 0.05 0.001 9, = 0; æ =1.0:7,= 5x 107; 
3 a h M3 3 
&, = 1.05; a, = 5.5x 10° 
Stiffness of spring X, normal 7558.6 N/m 378 N/m p, = 0; œ = -1.0; 7, = -9 x 10+; 
4 9% Ms f4 > 
¢, = 1.05; o, = 0.4 
Threshold of joint A of rock arm ¥, normal 1.2 GPa 0.06 GPa fe =-1.0:a =0.1 7%; = 3.2 x107; 
5 VAs =U 53. > 
g, = 0; & = 1.05; o, = 5.5x 10% 
Threshold of joint B of rod BC X, normal 1.2 GPa 0.06 GPa Lh =] 0:a, =0.1: %= 1.2x10-7: 
g, = 0; & = 1.05; a, = 5.5x 10° 
Threshold of joint C of rod BC X, normal 1.2 GPa 0.06 GPa B, =-10;a, = 0.1; 7, =1.2 x107; 
Ø, = 0; ¢ = 1.05; ø, = 5.5x 10° 
Threshold of joint D of rock arm X, normal 1.2 GPa 0.06 GPa B -1.0:a, = 0.1% =2.8x107; 
8 eT ESE T A > 
p, = 0; é = 1.05; a, = 5.5 x 10° 
Threshold of joint E of latch hook ¥, normal 1.2 GPa 0.06 GPa B =-1.0;a =0.1:% =2.8x 107: 
o Vo 59.17954. > 
p, = 0; J = 1.05; o, = 5.5x 10% 
Threshold of joint F of latch hook X,, normal 1.2 GPa 0.06 GPa By =-1.0; 2% = 0.1; % = 5.6 X 107; 


P, = 0; É, = 1.05; o, = 5.5x 10° 
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Table 2. Ratio of each component stress. 


Components Joint A Joint B Joint C Joint D Joint E Joint F 


Stress ratio 0.4 0.15 0.15 0.35 0.35 0.7 


y+ + w= 
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Figure 5. The reliability of GDLM at various time. 
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Figure 7. Sensitivity analysis of R(t) on latch hook fric- 
tion coefficient. 


reliability level 0.8, The number of working circles 
decreases from 1.04 x 10° to 0.55 x 10° by nearly 
half. 

It can be inferred that increased degradation 
coefficient y% decreases the system reliability. For 
the comparisons, the degradation rate parameters 
have a very obvious impact on the reliability of the 
system. 


5 CONCLUSIONS 


The reliability model for multi-components system 
subject to dependent competing risks of compo- 
nents catastrophic failure and system performances 
deterioration failure has been developed. These 
failure processes are dependent in three respects: 
1) the component stress shocks over certain criti- 
cal threshold affects the component degradation 
process, 2) the component degradation impacts 
the component catastrophic failure threshold level, 
and 3) the component degradation influences the 
system performances. 

This reliability model mainly for mechanism 
considered the relationship between the system 
and its component. Through this relationship and 
the degradation models of components, the system 
performance deterioration failure model has been 
proposed. The system performance deterioration 
failure process and the components catastrophic 
failure process are dependent and competing. The 
system total reliability is the competing result of 
the above two processes. 

It is more realistic and difficult due to the fail- 
ure processes being dependent not only due to a 
common load but also due to constraints between 
components. We presented a numerical example 
to demonstrate the effectiveness of the proposed 
approach and observe the reliability with sensitiv- 
ity analysis for an aircraft gear door lock mecha- 
nism. The results whether considering system 
performance deterioration or not are compared 
and discussed. 

For more accurate, the reliability model need 
consider the actual latch hook friction coefficient 
decreasing rate. This may be extended for future. 
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ABSTRACT: This paper proposes a new quantitative reliability analysis method for repairable systems 
with multiple correlations based on Goal Oriented (GO) method. First, the solving methods for repair- 
able systems with shutdown correlation, maintenance correlation, standby correlation and common cause 
failure in GO method are presented. And the reliability analysis process of this paper’s GO method is 
formulated. Then, the hydraulic oil supply system of heavy vehicle is taken as an example to conduct 
reliability analysis by this paper’s GO method. Finally, in order to verify the feasibility and advantage of 
the proposed GO method, its analysis result compared with those by GO method for system without con- 
sidering correlations and system with considering a single kind correlation. All in all, this study not only 
improves the theory GO methodology; but also provides a new reliability analysis approach for repairable 


systems with multiple correlations. 


1 INTRODUCTION 


The reliability of repairable systems is a prereq- 
uisite for its normal operating. Correlations are 
universal characteristic in the repairable system, 
and affect the system reliability directly, such as 
shutdown correlation, maintenance correlation, 
standby correlation and common cause failure. 
If the repairable systems with correlations are 
not considered these correlations, the reliability 
analysis result will have a large bias. Different from 
Fault Tree Analysis (FTA), Failure Mode, Effects 
Analysis (FMEA), and Monte-Carlo Simulation 
(MCS), Goal Oriented (GO) methodology is a suc- 
cess-oriented method for system reliability analy- 
sis. It has three obvious advantages [1], as follows: 


i. GO model is directly developed according to 
system principle diagram, flowchart or engi- 
neering drawing, so it is more objective. 

ii. GO method can combine with other technolo- 
gies to improve the GO method easily so that 
it can solve various kinds of practical engineer- 
ing problems, such as multi-fault modes [2-4], 
Closed-Loop Feedback [5-7], multifunction 
[8, 9], multi-state [10-12], and so on. 


iii. Both of the accurate quantitative analysis 
result and qualitative analysis result can be 
obtained by GO method. 


GO method has become increasingly popular in 
recent years because of its advantages in aspects of 
establishing system model and its stronger analy- 
sis power. Indeed, a large number of engineering 
applications have established its value [1, 13]. It is 
an efficient technology approach to conduct reli- 
ability analysis [14], reliability optimization allo- 
cation [15, 16] and reliability assessment [17, 18]. 
Although the basic theory of GO method has been 
improved so that it can solve some single correla- 
tion, such as standby correlation [19], the existing 
GO methods are not suitable for repairable systems 
with multiple correlations, which are shutdown 
correlation, maintenance correlation, standby cor- 
relation and common cause failure. Above all, the 
main contributions of this study are as follows: 


i. The new quantitative reliability analysis 
method for repairable systems with multiple 
correlations, which are redundancy structure 
with shutdown correlation, maintenance corre- 
lation, standby correlation and common cause 
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failure based on GO method, is expounded in 
detail. 

ii. The reliability analysis process of this paper’s 
GO method is formulated. 

iii. The hydraulic oil supply system of heavy 
vehicle is taken as an example to conduct reli- 
ability analysis by the proposed GO method 
firstly. 


The remainder of the paper is organized as fol- 
lows. The GO method for quantitative reliability 
analysis of repairable systems with multiple cor- 
relations is proposed in Section 2. Section 3 illus- 
trates a practical example, which is a hydraulic 
oil supply system of heavy vehicle, based on the 
proposed GO method. Section 4 provides results 
analysis in order to verify the feasibility, advantage 
and reasonability of this paper’s GO method. Sec- 
tion 5 provides some conclusions on the findings 
of the research. 


2 GO METHOD FOR REPAIRABLE 
SYSTEMS WITH MULTIPLE 
CORRELATIONS 


The quantitative reliability analysis result by 
using GO method is obtained based on GO 
model by adopting GO algorithm according to 
the reliability analysis process of GO method. 
So, the GO algorithm and the reliability analy- 
sis process of GO method are the key factors 
for conducting GO analysis. In this section, the 
GO algorithm for redundancy structure with 
multiple correlations and the reliability analysis 
process of this paper’s GO method are proposed, 
respectively. 


2.1 GO algorithm for repairable systems with 
multiple correlations 


The redundancy structures are often used in 
complex repairable systems, and the shutdown 
correlation, maintenance correlation, standby 
correlation and common cause failure affect the 
reliability of such structure directly. 


1. GO algorithm for redundancy structure with 
shutdown correlation, maintenance correlation 
and standby correlation 
In quantitative GO analysis, the redundancy 

structure with shutdown correlation, maintenance 

correlation and standby correlation is equivalent 
to a unit, which is represented by Type 1 operator, 
which is used to describe two-state unit. And the 
reliability parameters of such unit are obtained by 

GO algorithm for redundancy structure with shut- 

down correlation, maintenance correlation and 

standby correlation [20], as shown in Eq. (1)-(5). 


_[(M-i+1)4, J=0OR M-i+l<K 
| KA, J=1OR M-i+1>K 
te eg: (1) 
b= 
Lh, i>k 
p=Py]” (2) 
i 70 ab 


P -Fr/5? (3) 


Ar = nek’ Gy kl / > P, (4) 
i=0 
I 
LR = Pre ET > P, (5) 
i=M-K+1 


where M is the unit number of redundancy struc- 
ture, Kis the operating unit number of redundancy 
structure, Z is the faulting unit number of redun- 
dancy structure, J is the standby indicator, J = 0 
represents no standby unit in redundancy struc- 
ture, J = 0 represents the redundancy structure 
has standby units, P, is the success probability of 
redundancy structure at the condition of all unit 
operating, P,is the state probability of redundancy 
structure, Pp, Ap and Mp are the success probabil- 
ity, failure rate and maintenance rate of equivalent 
unit for redundancy structure, respectively. 


2. GO algorithm for repairable systems with com- 
mon cause failure 

In quantitative GO analysis, the quantitative anal- 

ysis result can be obtained by using GO algorithm 

for repairable systems with common cause failure 

[21], as shown in Eq. (6). 


M 
R,= Bay CARs -R,) (6) 


m=1 


where R, is the system availability by GO opera- 
tion according to the basic GO algorithm [22], R, 
is the system availability without considering com- 
mon cause failure by GO operation according to 
the basic GO algorithm [22], C,, is the common 
cause failure probability of mth redundancy struc- 
ture with common cause failure, m=1,2,---,M 
Ry and R,, are the system availabilities at the 
situation of the availabilities of all units in each 
redundancy structure with common cause failure 
as 0 and 1, respectively. 


2.2 Reliability analysis process of proposed 
GO method 


The reliability analysis process is the criterion 
and prerequisite for conducting quantitative GO 
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analysis. While, the reliability analysis process of 3 EXAMPLE 


basic GO method is not suitable for repairable 


systems with multiple correlations [23], so the reli- In order to illustrate the usage of this paper’s GO 
ability analysis process of repairable systems with method, the quantitative analysis of a hydraulic oil 
multiple correlations based on this papers GO supply system for a military vehicle is conducted 
method is formulated, and its process diagram is based on the proposed GO method. 


shown in Fig. 1. 


1. Analyzing system 3.1 Analyzing hydraulic oil supply system 


Besides the contents of system analysis in the 


existing GO methodology [21], the redundancy 1. To analysis structure and function constitute of 


structure with multiple correlations should be 
determined. 

2. Developing GO model 
According to the results of system analysis, the 
function GO operator and logical GO operator 
are selected to represent the unit and the logi- 


cal relationship in system. Then, the GO model 2. 


of system is established by using signal flow to 
connect the GO operators. 
3. Obtaining reliability parameters 

a. To obtain the reliability parameters of unit. 
They mainly contain the failure rate, mainte- 
nance rate and availability of unit. 

b. To obtain the reliability parameters of redun- 
dancy structure with shutdown correlation, 
maintenance correlation and standby corre- 


lation according to Eq. (1)-(5). 3. 


c. To obtain the reliability parameters of redun- 
dancy structure with common cause failure. 
They mainly contain the parameters of com- 
mon cause failure model, common cause fail- 
ure probability, and availability. 

4. Conducting quantitative GO analysis 
According to Eq. (6), the system availability 
can be obtained. If the shared signal does not 
exists in GO model, the GO operation can be 
conducted by the direct algorithm. If the shared 
signal exist in GO model, the GO operation 
can be conducted by the exact algorithm with 
shared signal [22]. 


hydraulic oil supply system 

The hydraulic oil supply system is used to sup- 
ply working oil for variable speed control sys- 
tem, steering control system, hydraulic torque 
converter and lubrication system. And its sche- 
matic diagram is shown in Fig. 2. 

To determine the redundancy structure with 
multiple correlations 

In hydraulic oil supply system, LF1 group and 
LF2 group are redundancy structures with 
shutdown correlation, maintenance correla- 
tion and standby correlation. The P1 group, 
LF1 group and LF2 group exist common cause 
failure because of external disturbance, such as 
extremely heavy impact, oversize oleo contami- 
nated particle, and so on. 

To define success rule of hydraulic oil supply 
system 

According to above analysis, the success rule of 
hydraulic oil supply system can be defined as that 
system can supply working oil to variable speed 
control system, steering control system, hydrau- 
lic torque converter, and lubrication system of a 
military vehicle under high steering speed condi- 
tion without considering overload protection. 
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Figure 1. Reliability analysis process diagram of pro- Figure 2. Schematic diagram of hydraulic oil supply 
posed GO method. system. 
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3.2 Developing GO model of hydraulic oil supply 


system 


According to above analysis, the GO model of 
hydraulic oil supply system is developed, as follows: 


1. 


To select GO operator 

There are 5 kinds of GO operators. And the GO 
operators corresponding units and logical rela- 
tions are presented in Table 1. 


. To develop GO model 


According to analysis result of hydraulic oil 
supply system and Table 1, the GO model 
of hydraulic oil supply system is established, 
as shown in Fig. 3. In operators, the former 
number and latter number are type and serial 
number of GO operator, respectively. And the 
number on a signal flow is serial number of sig- 
nal flow. The signal flow 37 is system output. 


3.3 Obtaining reliability parameters 


1. To obtain the reliability parameters of unit, as 
presented in Table 2. In Table 2, the failure rate 


Table 1. Operator type of unit. 

No. Unit Type Description 

1 Oil pan Input unit 

2,3 LF1 Two-state unit 

5,6 P1 6 Unit controlled by 
two control signals 

7 Pump group 5 Input unit 

power 

9,10 LF2 1 Two-state unit 

12, 18 LF2B, LF3B 1 two-state unit 

14 Pressure oil tank 1 two-state unit 

15 P2 6 Unit controlled by 
two control signals 

16 LF3 1 Two-state unit 

17, 24 CV2, CV3 1 Two-state unit 

20 P4 6 Unit controlled by 
two control signals 

21,35,36 RV1, RV2, RV3 1 Two-state unit 

22 P5 6 Unit controlled by 
two control signals 

23 P5 power 5 Input unit 

26 P3 6 Unit controlled by 
two control signals 

27 DRV 1 Two-state unit 

29 TC 1 Two-state unit 

30 TCB 1 Two-state unit 

32 HE 1 Two-state unit 

33 HEB 1 Two-state unit 

4, 8,11, 25, 28 2 “OR” logical relation 

37 10 “AND” logical 
relation 

13, 31, 34, 19 18 “Standby” logical 


relation 


Figure 3. Reliability analysis process diagram of pro- 
posed GO method. 
Table 2. Reliability parameters of unit. 

Failure Repair 

rate/ rate/ 
Unit hour hour Availability 
1 0.00075 3.0000 0.999750062 
2,3 0.00605 1.0021 0.993999202 
5,6 0.00075 1.4933 0.999497999 
7 0.02180 1.6274 0.986781433 
9,10 0.00305 1.0083 0.996984123 
12, 18 0.00119 0.8865 0.996984123 
14 0.00005 0.5000 0.999900010 
15 0.00075 1.4933 0.999497999 
16 0.00015 1.2000 0.999875015 
17, 24 0.00075 1.0995 0.999318341 
20 0.00075 1.4933 0.999497999 
21, 35, 36 0.00060 1.5652 0.999616814 
22 0.00075 1.4933 0.999497999 
23 0.02180 1.6274 0.986781433 
26 0.00075 1.4933 0.999497999 
27 0.00110 1.0009 0.998902171 
29 0.00050 0.0600 0.991735537 
30 0.00119 0.8865 0.998659395 
32 0.00040 0.0500 0.992063492 
33 0.00119 0.8865 0.998659395 
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and maintenance rate of unit are determined 
according to engineering statistic. 


. To obtain the reliability parameters of 


LF1 group and LF2 group by Eq. (1)-(5). In 
LF1 and LF2 group, M=2, K=2, I=2, L=2, 
J=0. The reliability parameters of LF1 group 
are illustrated in detail, as presented in 
Table 3. 

In the same way, the reliability parameters 
of LF2 group can be obtained, i. e. its avail- 
ability, failure rate and maintenance rate are 
0.999981809, 0.000018341533 and 1.008264462, 
respectively. 


. To obtain the reliability parameters of redun- 


dancy structure with common cause failure, as 
presented in Table 4. In this case, we adopt the 
common cause failure 8 model [23]. 


Table 3. Reliability parameters of LF1 group. 


State 0 (all units 1 (one unit 2 (all units 
number operating) faulting) faulting) 
Fault unit 0 1 2 

number 
Operating unit 2 1 0 

number 
Standby unit 0 0 0 

number 
a; — 0.0121 0.00605 
b; -= 1.002149395 1.002149395 
Pi 1 0.012074048 0.000072891 
Pp (LF1 group) 0.9999279830 
Apr (LF1 group) 0.0000721765 


Hy (LF1 group) 1.0021493950 


Table 4. Reliability parameters of redundancy structure 
with common cause failure. 


Structure Failure rate g8 A, C 

Pl 0.00075 0.3 0.999648546 7.5332e-05 
LF1 0.0000721765 0.13 0.999937345 3.7407e-04 
LF2 0.0000183415 0.024 0.999982246 2.4794e-05 


Table 5. Quantitative analysis by GO method. 
R, Non-R, item (P1 group) 
0.98449496 —-7.4169e-05 
Non-R, item (LF1 group) Non-R, item (LF2 group) 
—3.6829e-04 —3.2723e-08 
Rs 
0.9840524683 


3.4 Conducting quantitative GO analysis 
of hydraulic oil supply system 


According to the definition of shared signal [22], 
the signal flow 1, 4, 7, 8, 14, 15, 16, 25 and 31 are 
shared signals in Fig. 3. Thus, the exact algorithm 
with shared signals is adopted to conduct GO 
operation. And the system availability is obtained 
by Eq. (6), as presented in Table 5. 


4 RESULT ANALYSIS 


In order to verify the feasibility and advantage of 
the proposed GO method, its analysis result com- 
pared with those by GO methods for system without 
considering correlations, system considering redun- 
dancy structure with shutdown correlation, main- 
tenance correlation and standby correlation and 
system considering common cause failure. And the 
quantitative analysis results are presented in Table. 6. 


Table 6. Quantitative analysis results by different GO 
methods. 


System 

Method availability 
This paper’s method 0.9840524683 
Method without considering 0.9845210823 

correlations 
Method considering redundancy 0.9840869188 

structure with shutdown correlation, 

maintenance correlation and 

standby correlation 
Method considering common cause 0.9844857454 


failure 


According to Table 6, we can see 


1. The analysis result by this paper’s method is 
smaller than the result by GO method without 
considering correlations. It meets engineering 
practice. It shows that if the correlations in sys- 
tems are neglected, the reliability analysis result 
will have a bias. 

2. The analysis result by this paper’s method is 
smaller than the result by GO method consid- 
ering redundancy structure with shutdown cor- 
relation, maintenance correlation and standby 
correlation. It meets engineering practice. It 
shows that the shutdown number, maintenance 
man, and standby structure affect the system 
reliability directly. 

3. The analysis result by this paper’s method is 
smaller than the result by GO method consider- 
ing common cause failure. It meets engineering 
practice. It shows that the reliability analysis for 
system with redundancy structure should con- 
sider the common cause failure. 


The reliability analysis process of example 
shows that the GO model is developed according 
to the system principle, system structure and func- 
tion constitute, and the quantitative analysis result 
is obtained by multiple GO operations. It indicates 
that the GO method has obvious advantages in 
terms of establishing system model and conduct- 
ing quantitative analysis. 


5 CONCLUSION 


This study proposes a new quantitative reliability 
analysis method for repairable systems with mul- 
tiple correlations based on GO method. First, the 
solving methods for repairable systems with mul- 
tiple correlations in GO method are presented, 
which are GO algorithm for repairable systems 
with common cause failure, and GO algorithm for 
redundancy structure with shutdown correlation, 
maintenance correlation and standby correlation, 
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respectively. On this base, the reliability analysis 
process of the proposed GO method is formulated. 
Then, the hydraulic oil supply system of military 
vehicle is taken as an example to conduct reliability 
analysis by this paper’s GO method firstly. Finally, 
its quantitative analysis result compared with those 
by GO method for system without considering 
correlations and system with considering a single 
kind correlation. The analysis results show that the 
correlations affect the system reliability directly. 
And the reliability analysis process of example 
shows that the GO method has obvious advan- 
tages in establishing system model and conducting 
quantitative analysis. All in all, this study not only 
improves the theory GO methodology; but also 
provides a new reliability analysis approach for 
repairable systems with multiple correlations. 
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ABSTRACT: The network topological optimization, developing the scheme of pipelined connections 
needs to find the best layout of all links in the urban sewage system, makes the whole system cost-effective 
with the satisfaction of some special criteria. In this paper, the application of constraints based on some 
priori knowledge, including the requirement to link the components in network systems, reduces the solu- 
tion space of the optimization substantially, unlike most literatures focus on improving the efficiency 
of optimization algorithms through the modification of the algorithm itself. In order to evaluate it, the 
network topological optimization of a sewage system, serving two different cities, is investigated. Perform- 
ance of the solution based on the application of priori knowledge is validated through the sewage system. 
It indicates that the application of priori knowledge can be efficient for the solution. 


1 INTRODUCTION 


Urban sewage systems are important infrastruc- 
tures for modern cities. A successful design of the 
urban sewage system is a guarantee for its avail- 
ability, economy, and reliability. As the critical 
content of the design, developing the scheme of 
pipelined connections needs to find the best lay- 
out of all links in the urban sewage system, which 
makes the whole system cost-effective with the 
satisfaction of some special criterions. This design 
problem is defined as the optimization of network 
topology [1]. 

The objective of network topological optimiza- 
tion is to find the optimal layout of links among 
components in the system, and some requirements 
should be met such as the performance of net- 
works, transmission delay. This paper considers 
the network topology optimization of the urban 
sewage system. In particular, its design should have 
a minimal cost under the constraint that the reli- 
ability of the system based on the design is not less 
than a given system reliability. 

In general, the study of network topological 
optimization concentrates on three aspects: the 
choice of a measure of reliability, the algorithm 
adopted to calculate the system reliability and the 
solution of the optimization model. The meas- 
ures of reliability in network systems include the 


connectivity measure [2], the connectivity measure 
with the consideration of the performance [3], and 
the measure based on capacity [4]. At the same 
time, the exact algorithms based on these meas- 
ures of system reliability are widely investigated 
[5]. However, the increased size of network sys- 
tems poses a new challenge to the computation of 
system reliability. Approximation algorithms are 
adopted to address this challenge [6-8]. 

The solution of the optimization model, unlike 
the choice of a measure of reliability and the algo- 
rithm adopted to calculate the system reliability, is 
mainly about how to solve the optimization model 
effectively. In this aspect, many optimization algo- 
rithms are proposed to the solution of network 
topological optimization [9-17]. However, most 
literatures focus on improving the efficiency of 
optimization algorithms through the modification 
of the algorithm itself. 

In this paper, the application of constraints 
based on prior knowledge, including the require- 
ment to link the components in network systems, 
reduces the solution space of the optimization sub- 
stantially. In order to evaluate it, the network topo- 
logical optimization of a sewage system, serving 
two different cities, is investigated. The constraints 
in optimization are that the system reliability 
should be larger than a given threshold, and the 
total cost of the sewage system is minimized. In 
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addition, performance of the solution based on 
the application of priori knowledge is validated 
through the sewage system. It indicates that the 
application of priori knowledge can be helpful for 
the solution. 


2 MODELING FOR RELIABILITY 
BASED TOPOLOGY OPTIMIZATION 
DESIGN PROBLEM 


In most research about reliability based the topol- 
ogy optimized design, it is often given that: (a) 
location of sewage treatment plants in differ- 
ent levels, (b) the reliability of sewage treatment 
plants, (c) the cost of pipelines that connect sew- 
age treatment plants, (d) the constraint of sewage 
treatment system reliability. The layout of the 
sewage treatment system is actually determined 
by the connection relationship among cities and 
sewage treatment plants in different levels. The 
connection relationship is often defined by topo- 
logical relation, and it is represented by graph in 
graph theory. The objective of reliability based 
topology optimization design is to find the opti- 
mal graph that enables the sewage treatment 
system has the minimal cost and the required reli- 
ability. Therefore, the optimization is adapted to 
model for reliability based topology optimization 
design. In the optimization model, the decision 
variable is the graph representing the topologi- 
cal relation of the sewage treatment system. The 
objective function is to evaluate the cost of the 
sewage treatment system; the constraint is used to 
guarantee the reliability of the sewage treatment 
system that meets the requirement. 


2.1 Representation of the topological 
characteristic of the sewage treatment system 


This paper adopts graph to represent the layout of 
the sewage treatment system. Graph may reveal the 
topological characteristics of the sewage treatment 
system, in particular the connection relationship 
among cities and sewage treatment plants in differ- 
ent levels. In addition, the adjacent matrix, a repre- 
sentation of graph, is a very useful data structure 
for Genetic Algorithm to solve the problem. 

In the research about reliability based topology 
optimization design, the graph is used to represent 
the layout of systems, such like social networks, 
power networks. A graph g consists of the nodes 
set Vand the arcs set E. The elements in the nodes 
set are nodes in graph, and the components of the 
sewage treatment system, such as cities and sewage 
treatment plants, are usually represented by nodes. 
The elements of the set E represent the pipelines in 
the sewage treatment system. 


Adjacent Matrix = (1) 


yy AUA BW WN 
eooceocoeoesd ce 
oaocococ co on 
cocoococo oO Fw 
ogeoeooocce ots 
eccocoroeo ou 
cooorodocoof 
CoOorraoodconNm 


Although graphs are very intuitive to show the 
layout of sewage treatment system, it is hard for a 
graph to participate in an operation as the part of 
Optimization Algorithm. To address the challenge, 
the adjacent matrix, a representation of graph, is 
adopted in this paper. The adjacent matrix is a very 
useful data structure, and it enables the graph to 
participate in the operation of Optimization Algo- 
rithm to solve the problem. The dimension of an 
adjacent matrix is determined by the amount of 
nodes in the graph. The row number or column 
number is consistent with the node number. For 
example, the adjacent matrix of the graph in Fig. 1 
is shown in Eq. (1). The elements in an adjacent 
matrix can be 0 or 1. The element 0 means that 


Figure 1. The graph referred to adjacent matrix in 


Eq. (1). 
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there is no connection between the node whose 
number equals to the row number and the node 
whose number equals to the number of columns 
number, and the element 1 means that there is a 
link between them. Therefore, the adjacent matrix 
is equal to the graph, and it can be adopted to rep- 
resent the layout of the sewage treatment system. 


2.2 Solution space in optimization model 


In fact, the objective of solving the optimization 
model is to find an optimal solution that satis- 
fies the constraint in solution space. Therefore, 
defining a reasonable solution space is helpful for 
optimization algorithm to solve efficiently. In this 
paper, the solution space is all the possible layout 
of sewage treatment system, and these layouts meet 
the design requirements. It consists of the neces- 
sary components of the whole sewage treatment 
system and the join sequence of cities and sewage- 
treatment plants. The normal sewage treatment 
system of a city includes a primary sewage treat- 
ment plant, a secondary sewage treatment plant 
and a tertiary sewage treatment plant. Therefore, 
the sewage treatment system must contain all the 
three plants. In addition to this, the join sequence 
of cities and sewage-treatment plants must be cor- 
rect. The sewage of cities must be disposed first by 
the primary sewage treatment plant, and then the 
secondary sewage treatment plant is used for the 
process. The process of sewage treatment finishes 
at the tertiary sewage treatment plant. 

According to the design requirements of sewage 
treatment system, this paper proposes a method 
that restricts the value of elements in adjacent 
matrix to limiting solution space. It is often given 
that the locations of cities and sewage treatment 
plants, therefore the amount of nodes in graph is 
known. Based on that, the dimension of the adja- 
cent matrix can be determined. At the other hand, 
there is no connection between sewage-treatment 
plants and themselves, therefore diagonal elements 
in the adjacent matrix is 0. Based on the informa- 
tion above, the amount of unknown elements in 
the adjacent matrix is decreased. The adjacent 
matrix can be represented like Eq. (2). 


1 2 3 e n-2 n-l n 

1 0 a2 a3 a ai, n-2 ain -1 ain 

2 azı 0 2,3 2, n-2 a2, n-1 a2 

3 a3,1 3,2 0 a3, n-2 a3, n-1 a3, 
n-2) an-21 An -2,2 An 2,3 0 Qn-2,n-1 a 
n-l| an-11 an-1,2 an-ı an-ı 0 an-ı 

n anı an,2 an, an,n -2 n,n -1 

(2) 


In fact, the amount of unknown elements is 
decreased from x? to n? — n, which means that the 
solution space is limited. Therefore, the solution 


space is determined by the n? — n elements, and 
they can be represented by Y = (xX, X, ). 


A 
n-—n 


2.3 Objective function in optimization model 


The objective function is to calculate the cost of 
the sewage treatment system. In this paper, the 
construction cost of the sewage treatment system is 
ignored; therefore the cost of the sewage treatment 
system is determined by the connections among 
the plants. The length of pipelines for connections 
mainly affects the cost. According to the require- 
ments of design, the primary sewage treatment 
plant is near the cities, but the secondary sewage 
treatment plant and the tertiary sewage treatment 
plant are far from the cities. Fortunately, the cost 
of different connections among them is usually 
known, and they are often represented by a matrix 
that is called cost matrix A, as shown in Eq. (3). 


1 2 3 oe n-2 n-l n 
1 C11 C12 6,3 [a] Ci.n-1 ca 
2 gii ê C3, C2, n-2 rid ė 
3 c31 c €3,3 C3, n-2 C3, n-I c3, 
n-2| cn-21 c Cn-2,3 Cn-2,n-2 Cn-2n- Cn-2 
n-i Cn-11 Cn-1,2 Cn-1,3 Cs-l,n-2 Cr-1n-1 Cn, 
n Cnt c Cn,3 Cn,n-2 Cnn -1 c 
(3) 


The cost of different connections can be 
obtained by the cost matrix A and the connections 
that exist in layout are obtained by the decision 
variable X. Therefore, the objective can be repre- 
sented by the Eq. (4). 


Min: c= F(A,X) (4) 


2.4 Constraint in optimization model 


The constraints in optimization model are used to 
guarantee the layout that is solved by the model 
meets the reliability requirement and the design 
requirements. 

When the layout of the sewage treatment sys- 
tem is determined, the reliability of the sewage 
treatment system can be calculated using the com- 
putational method of the network reliability. At 
present, there are four computational methods of 
the network reliability: 1) calculate the probabil- 
ity of connectivity, 2) calculate the probability of 
connectivity with the consideration of the network 
capacity, 3) calculate the probability of connectiv- 
ity with the consideration of the network perform- 
ance, 4) calculate the reliability based on mission. 
This paper adopts the method of calculating the 
probability of connectivity to calculate the reli- 
ability of the sewage treatment system. This result 
of this method is also called s-t reliability, and it is 
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determined by the probability of the connectivity 
of all the path from the source node to the terminal 
node. In this paper, the source node is the city, and 
the terminal node is the tertiary sewage treatment 
plant. In addition, it is not necessary to consider 
the reliability of pipelines; therefore the reliability 
of the sewage treatment system is determined by 
the components of the sewage treatment system. 
In this paper, the reliability of the components is 
represented by R=(Ri,R2,---,R:), and they are 
actually given in most study. Using the reliability 
of components R and the connection information 
X, the constraint is defined in Eq. (5). The system 
reliability R* is the threshold in requirement. 


st. @(R,X)>R* (5) 


According to the content above, the optimiza- 
tion model is defined in Eq. (6). 


Min: c= F(A,X) 
st. @(R,X)>R* (6) 


3 A CASE STUDY 


A sewage treatment system, serving for two cities, 
is proposed in this paper to be analyzed. There are 
two primary sewage treatment plants, two second- 
ary sewage treatment plants and one tertiary sew- 
age treatment plant. Their information about the 
node number and reliability are shown in Table 1. 
It should be noted that there are some pressuriza- 
tion equipment in cities, therefore the cities also 
have the reliability. Because the location of cities 
and sewage treatment plants are given in Fig. 2, the 
cost matrix is also known, and it is shown in Eq. 
(7). At the same time, the threshold R* is 0.8. 


1 2 3 4 5 6 7 
0 100 11 43 30 55 60 


100 0 39 14 42 28 58 
11 39 0 50 28 37 42 


(7) 
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3.1 Building the optimization model 


There are seven nodes in this case, therefore the 
dimension of adjacent matrix is 7 x 7. In addition, 
there are some more design requirements about 
the join sequence of cities and sewage-treatment 
plants, and these requirements are: 1) city can only 
connect with the other city or the primary sewage 


Table 1. The node number and reliability information. 


Node Reliabi- 
Component name number lity 
City A 1 0.9895 
City B 2 0.9882 
Primary sewage treatment plant A 3 0.9535 
Primary sewage treatment plant B 4 0.9886 
Secondary sewage treatment plant A 5 0.9328 
Secondary sewage treatment plant B 6 0.9761 
Tertiary sewage treatment plant 7 0.9585 
a ra <a 
li ] a 
f AD X os 
S : 
j City B 
4 | F 
| Pi EF ° ee 
9 
i aan 
Í “ Baath “an O ii . eee mi piant 
uE S Pare 
D wee _ treatment plant A 
| a Tertiary sewage 
\ 4 l treatment plant 
Figure 2. The location cities and components in the 


sewage system. 


treatment plants, 2) the primary sewage treatment 
plant can only connect with the other primary sew- 
age treatment plant or the secondary sewage treat- 
ment plants, 3) the secondary sewage treatment 
plant can only connect with the other secondary 
sewage treatment plant or the tertiary sewage treat- 
ment plant. Thus, the adjacent matrix is defined in 
Eq. (8). And the decision variable X in this case is 


defined as X = (x1, X2, X16). 
Adjacent Matrix = 
1 23 4 5 6 7 
1/0 mm xx 0 0 0 
2}x4 0 x x 0 0 0 
3/0 0 0 xļx x x 0 
40 0 xo O xm x 0 (8) 
5S0 0 0 0 0 xg xu 
6 0 0 0 0 xs O xe 
7.0 0 0 0 0 0 90 


The cost is calculated by the objective function 
through Hadamard product between the cost matrix 
and the adjacent matrix. The Hadamard product is 
new matrix that its dimension is still 7 x 7. These 
non-zero elements are the cost of connections 
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which exist in the designed layout. The sum of these 
non-zero elements is the total cost of the layout. 

There are some more constraints in this optimi- 
zation model except the reliability requirement. It 
should guarantee: 1) the sewage of city A and B 
must be discharged, 2) the sewage discharged by 
city A and B should be disposed by the primary 
sewage treatment plants, 3) the sewage discharged 
by primary sewage treatment plants should be dis- 
posed by secondary sewage treatment plants, 4) the 
sewage discharged by secondary sewage treatment 
plants should be disposed by the tertiary sewage 
treatment plant. All the requirements are defined 
in Eq. (9). Therefore, based on all above, the Opti- 
mization Model is defined in Eq. (10). 


X, +X, +x; 21 

X,+X;+X, 21 

Xb shee (9) 
Xg +X +X, +X, Zl 

Xt X16 21 


Min: c=F(A,X) 

s.t. g(R,X)>R* 
X,+X,+x,21 
X, +X; +X, 21 
X, +X, +X; +X, 21 
Xg +X +X, +X, 21 


(10) 


ITERATION 


Figure 3. Cost in iterations. 
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Figure 4. The layout of the sewage treatment system. 


3.2 Solving the optimization model 


Genetic Algorithm is adopted to solve the optimiza- 
tion model. The population size in algorithm is 200. 
The crossover rate and mutation rate are 0.8 and 
0.5 respectively. The stopping criterion of Genetic 
Algorithm is set to execute 200 iterations. In order 
to evaluate the performance of the application of 
constraints based on prior knowledge, the set of 
parameters is chosen for an efficient solution. The 
result shown in Fig. 3 indicates that the cost con- 
verges to a value, and the value is 971. The layout 
of the sewage treatment system is shown in Fig. 4, 
according to the decision variable X. Obviously, 
the cost in previous generations such as 1338 in the 
third generation and 1154 in the eighth generation 
shows that the global optimized result is obtained. 
At the same time, the solution without these 
constraints in Eq. (8) is obtained. Although the 
final solution is the same with the solution with 
the application of constraints based on prior 
knowledge, the process of solving the optimization 
model should execute many times to find the solu- 
tion meets which the requirements of the design. 


4 CONCLUSION 


This study has formally defined a reliability based 
topology optimization design problem through a 
sewage treatment system to find the optimal lay- 
out of the sewage treatment system that serves for 
two cities. The optimal layout of the sewage treat- 
ment system should have the minimum cost sub- 
ject to the reliability constraint R*. A modeling 
and analyzing method is proposed in this paper for 
the reliability based topology optimization design 
problem. The method can be used to build the 
objective function and the constrains. In addition, 
the priori knowledge about the design require- 
ments is used to determine the elements in adjacent 


Analyze the system 


| Define the objective and criterion of optimization | 


| 


Build the objective function Assemble the prion knowledge 


` > F 
` x 


Rubl the constraints | 


t 


Solve threogh the Oplimization Algontin 


Figure 5. The procedure of the method. 
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Figure 6. The original design of the sewage system. 


Table 2. The comparison between the original design 
and the optimal design. 


Type Decision variable 
The original design 1000010000100100 
The optimal design 0011000000010001 


matrix, which limits the solution space, and this 
makes the algorithm efficient for solving the prob- 
lem. Finally, Genetic Algorithm is adopted to solve 
the optimization model. According to the proce- 
dure above shown in the Fig. 5, the optimal layout 
of the sewage treatment system is obtained. The 
cost based on the optimal layout is 971, and its 
reliability 0.8266 that meets the reliability require- 
ment. Compared with the original design shown in 
Fig. 6, the significant change in the optimal deci- 
sion variable is presented in Table 2. Obviously, 
the method makes the design change toward the 
direction of more feasibility. In particular, the 
application of constraints based on prior knowl- 
edge makes the solution efficient. Therefore, this 
method is proved to be helpful for the reliability 
based topology optimization design problem. 
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NOMENCLATURE 

g The graph 

V The set of nodes in graph 
E  Theset of arcs in graph 
A The cost matrix 

X The decision variable 


F(.) The objective function 


R The reliability of components 

g(.) The function of system reliability 

c The total cost of a sewage treatment 
R* The threshold in reliability requirement 
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ABSTRACT: Risk reduction can be conducted through constructive, technical, organizational or per- 
sonnel measures. State of the art is to determine the necessary risk reduction, taking into account control 
dependent protective measures in the sense of functional safety. Complementary protective measures are 
a special subgroup of protective measures. 

Such additional safety functions extend the usual quantified sensor logic actuator chain of safety func- 
tions by the operator and the necessary human-machine interface. Standard methods of quantification 
only apply to the reliability of the technical part of such complementary safety functions. This paper deals 
with the challenge of taking into account large extent through the human-machine interface. It proposes 
a systematic approach using a combination of methods offered by standards and literature to determine 
and quantify the risk reduction of the overall complementary safety instrumented functions. The paper 


summarizes the proposal of an extension for the assessment of complementary protective functions. 


1 INTRODUCTION 


1.1 Motivation 


The ongoing technological changes affect tradi- 
tional workplaces. Workers change their role from 
part of the production chain to part of the control 
system as the level of automation is ever increas- 
ing. This affects how safety is ensured. Therefore it 
is important that human-machine interfaces fulfil 
the requirements of safety-related functions. 

Some of these interfaces are represented by 
mechanical control elements that initiate electronic 
safety functions. These functions are grouped 
together under complementary protective meas- 
ures. Typically these measures are not operated 
as often as a control-dependent safety function, 
and are considered to be a complementary safety 
device. 

ISO 12100 provides details on complementary 
protective measures. These are measures which are 
neither inherently safe design nor safeguarding, 
but are required due to intended use or reasonably 
foreseeable misuse of the machine (ISO 12100, 
2010). 

Nevertheless, these human-machine interfaces 
must be taken into account for an overall risk 
assessment of machinery and systems. 

Within the scope of the IEC 61508 is the over- 
all risk assessment of systems for which electrical/ 


electronic/programmable electronic safety-related 
systems significantly contribute to the overall risk 
reduction. This standard also defines human error 
as a systematic failure which has to be considered 
in the overall risk assessment (IEC 61508, 2011). 
Hence, if complementary safety functions are part 
of the risk reduction strategy in integration, opera- 
tion or maintenance they have to be considered in 
the overall risk assessment and evaluation. This is 
independent of the decision whether complemen- 
tary safety functions are to be treated as standard 
safety functions as well as their risk reduction 
effect. For instance, their risk reduction could be 
considered as a safety buffer (safety factor, back 
up safety function) in addition to more automated 
safety functions. 

Other standards like the EN 61511 also require 
that the design of a safety instrumented function 
shall take into account human error (IEC 61511, 
2004). This means that the assessment of the con- 
trol elements of human-machine interfaces are 
part of the overall assessment of safety functions. 

Nonetheless, the available standards do not give 
a sufficient answer to the question of how asses the 
human machine interfaces. For instance, it has to 
be discussed if the assessment should be qualita- 
tive or quantitative. 

For this reason, there is a need for research to 
asses such interfaces as part of the safety chain. 
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1.2. Challenge/main topic 


The Machinery Directive 2006/42/EC prescribes a 
hierarchy of measures for risk reduction (Machin- 
ery Directive 2006/42/EC, 2006): 


e Inherently safe design measures, 
e technical protective measures, and 
e information for users. 


Technical protective measures could be divided 
into control independent and control depend- 
ent measures. If risks are then reduced by control 
dependent measures, a validation must be con- 
ducted to proof sufficient risk reduction. 

For this verification, the assessment results usu- 
ally are summarized in a Reliability Block Diagram 
(RBD) (IEC 61508, 2011). Based on stochastic 
data, calculations and estimation procedures, a 
quantitative reliability value (e.g. called SIL osim) iS 
determined for each safety function block and the 
overall system. 

However, complementary protective measures 
take a special position. They are initialized by an 
operator and not by a sensor. Hence, in order to 
evaluate complementary protective measures, 
the reliability block diagram is proposed to be 
extended as shown in Figure | to include the oper- 
ator and the human-machine interface. 

It is necessary to review in which form the oper- 
ator and the mechanical interface are included in 
the standards of reliability assessment. Therefore 
states of the art for both areas are reviewed in the 
paper, with focus on quantitative approaches: 


e State of practice to assess the operator is to use 
the human reliability analysis approaches. 

e State of practice to assess electrical/electronic/ 
programmable electronic safety-related systems 
is to use component failure values, calculation 
and estimation methods. 


It has to be discussed which of both method 
sets can be used to assess the design of the human- 
machine interface properly. 

To do so, complementary protective meas- 
ures are first of all introduced in section 2 and 
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Figure 1. Extended reliability block diagram. 


subdivided into different technologies and operat- 
ing procedures. 

In section 3 a taxonomy is developed to choose 
appropriate methods and requirements for needed 
extensions are listed. 

Section 4 then reviews the appropriate methods 
presented by standards and literature and discusses 
approaches to extend existing methods, i.e. meth- 
ods that are not yet mentioned in standards and 
thus are beyond current best practices. 

This extension should enable determining and 
quantifying the reliability of mechanical human- 
machine-interfaces. Aim is to support the determi- 
nation of a total reliability value for complementary 
protective measures. 

In section 5, the most suitable methods for 
determining human reliability analysis and techni- 
cal safety for complementary protective measure 
interfaces are listed. Finally suitable additions to 
the existing methodology to assess safety functions 
with human-machine interfaces are described. 


2 HUMAN MACHINE INTERFACE 


Complementary protective measures are used as one 
of the few safety human-machine interfaces. Direc- 
tory 2006/42/EG stipulates that any machine must 
be equipped with one or more complementary pro- 
tective measures, unless the risk cannot be reduced 
without complementary protective measures. 

Areas of application of complementary pro- 
tective measures include machines, process engi- 
neering or emergency braking devices on trains. 
Disconnectors as used in the automotive sector. 
Emergency shut downs in laboratories or control 
rooms are integral part of such facilities. 

Since these control elements initialize a safety 
function, they need to be highly available. For this 
reason they have to be clearly recognizable, easily 
visible and quickly accessible (ISO 13850, 2014; 
Machinery Directive 2006/42/EC, 2006). 

The design of the control elements depends on 
the requirements of the application. Due to this, 
they can be designed as pushbuttons, wires, ropes, 
rails, handles or handlebars (Gehlen and Rudnik, 
2015). In addition, there are also large scale control 
elements or specially designed foot switches. 

The selection depends also on the type of actua- 
tion, e.g. by hand or foot. However, other extremi- 
ties such as elbows or knees are less common but 
also used (ISO 13850, 2014; IEC 60947, 2015; 
Gehlen and Rudnik, 2015). 

Special designs are becoming increasingly 
important in order to ensure adaptation to the 
operator or the working environment, for exam- 
ple as compensation for the physically disabled or 
applications on control elements to prevent acci- 
dental actuation. 
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Due to that variety of common control elements, 
an extension of the methodology is necessary in 
order to allow to inclusion of control elements 
when determining reliability. 

In doing so, section 3 defines requirements to 
choose suitable methods. These requirements are 
needed to categorize and select promising existing 
and emerging methods of human reliability and 
technical safety. 


3 CLASSIFICATION TAXONOMY OF 
METHODS FOR HUMAN-MACHINE 
INTERFACE ASSESSMENT 


The necessity to assess the human-machine inter- 
face was described in section 1. This section 
develops a taxonomy for choosing appropriate 
assessment methods and points out requirements 
for the needed extensions. 

To support an overall risk assessment in the 
sense of IEC 61508, valid input data are necessary. 
Due to the fact that IEC 61508 does not list many 
methods for risk assessment, those listed by ISO/ 
IEC 31010 are discussed. 

In order to use the output of the methods within 
the reliability block diagram, the output should be 
able to be transferred into quantitative values. 

For the human-machine interface of comple- 
mentary protective measures there is a lack of this 
kind of data. 

The method or a combination of methods to be 
selected should be able to submit an essential con- 
tribution to the quantification of such interfaces. 

The method should consider design require- 
ments as defined in directive 2006/42/EG. Interface 
devices must be 


e clearly identifiable, 
e clearly visible and 
e quickly accessible. 


This has to hold for different product types and 
all kinds of actuations mentioned in section 2. 

Furthermore performance shaping factors like 
design, temperature, noise or workflow should be 
measurable by the selected set of methods. 

The set of methods should build on at least 
partially validated approaches as documented 
in reviewed literature and be already accepted or 
acceptable by experts as well as practitioners. 

In summary, the methods are examined with 
regard to the applicability of the following items: 


1. Delivery of quantitative data; 

2. Existing scientific acceptance and level of vali- 
dation to build on; 

3. The method set should allow to link to the con- 
cept of Safety Integrity Levels (SIL) as defined 
in IEC 61508 and related standards; 


4. The methods should allow the assessment of 
the fulfillment of existing design requirements 
as of Machinery Directive 2006/42/EC; 

5. High expected scientific acceptance and practi- 
cal feasibility of methods. 


4 REVIEW OF METHODS 


This section reviews and discusses methods, tech- 
niques and measures offered by major standards 
in general and in the domain of functional safety 
and machine safety. First methods for human reli- 
ability prediction will be reviewed and discussed. 
In doing so all mentioned aspects of section 3 are 
considered. 


4.1 Standard of human reliability analysis 


Human Reliability Analysis (HRA) is commonly 
divided into three categories. The first generation 
focuses on quantification in terms of success/fail- 
ure of actions. The second generation focuses on 
cognitive aspects that cause errors by taking into 
account performance shaping factors. The third 
generation focuses on human performance fac- 
tor relations and dependencies (Di Pasquale et al., 
2015). 

Methods of all generations are divided into ana- 
lytical and expert estimation procedures, as charac- 
terized by (Strater, 1997). 

Examples of methods for expert estimation 
methods are Human Error Assessment and Reduc- 
tion Technique (HEART) (Swain and Guttmann, 
1983) and SLIM and Standardised Plant Analysis 
Risk-Human Reliability Analysis (SPAR-H). In 
these procedures, the cognitive area is the main 
focus. The employed data bases on studies, surveys 
and extensive literature research. 

Exemplarily analytical methods are Technique 
for Human Error Rate Prediction (THERP) 
and Accident Sequence Evaluation Programme 
(ASEP) (Swain and Guttmann, 1983) as well 
as Méthode d’Evaluation de la Réalisations des 
Missions Opérateur pour la Surete (MERMOS), 
Connectionism Assessment of Human Reliability 
(CAHR), Cognitive Reliability and Error Analy- 
sis Method (CREAM) (Zhang and Tan, 2018), 
Systematic Human Error Reduction and Predic- 
tion Approach (SHERPA) (Bligard and Osvalder, 
2014) and A Technique for Human Error Analysis 
(ATHEANA). In such methods, the assessment 
of the human error probability (HEP) is based on 
lists of error probabilities and uncertainty factors 
(Kirwan, 1998). Values for these lists were deter- 
mined by observations and tests in simulation and 
fake-real situations. 

To evaluate human errors scenarios, fault tree 
analysis can be used as it was done with maintenance 
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procedures of a pump in (Noroozi et al., 2014). 
Quantitative data of human error prediction tables 
as it was first introduced by THERP are the empir- 
ical basis of this method. Also the output of these 
methods can be presented quantitative. 

However, the data presented by human error 
prediction tables consider only the operator and 
not the human machine interface, in particular 
not the interface of complementary protective 
measures. 

Nonetheless, these methods deliver interesting 
approaches to quantify qualitatively estimated data 
and interpret data from test scenarios. They can 
furthermore be used to define test settings. Pursu- 
ing the project it is proposed to further scrutinize 
the HRA methods (Petruni et al., 2017). 

Design requirements are not in the scope of 
these methods, because of their focus on human 
error prediction. Because of this focus, they are 
of limited applicability for the assessment of the 
mechanical part of the human-machine interface 
and its interaction with humans. 


4.2 Standard technical safety methods 


ISO 31010 provides risk assessment methods. 
They are categorized in look-up methods, statisti- 
cal methods, control assessment methods, function 
analysis, scenario analysis, supporting methods 
and others (IEC/ISO 31010, 2009; Tixier et al., 
2002). 

Look-up methods basically use hazards lists to 
generate input for further analysis. It is common to 
use these methods in early phases of risk identifi- 
cation, construction or design processes. Therefore, 
they deliver only indications if risks are possible or 
not. Exemplarily methods are check-lists, prelimi- 
nary hazard analysis (PHA) and brainstorming 
(IEC/ISO 31010, 2009). 

Delivering quantitative output is not their pur- 
pose. This leads to no approaches to quantify the 
output. Through their simple structure they are 
not qualified for the proof of reliability of com- 
plementary protective measures. Nevertheless, 
in an assessment process they are seen as a valid 
starting point and to achieve completeness of risk 
assessments. 

Control assessment methods take into account 
all layers of protection to assess risks (IEC/ISO 
31010, 2009). Exemplarily methods are bow-tie 
analysis and layer of protection analysis (LOPA). 
These methods are also used for risk assessment in 
the field of functional safety as in IEC 61508. 

These methods evaluate systems and their dif- 
ferent protection measures concluding the wide 
range of technically, intercompany, organization, 
emergency response and so on (Gowland, 2006). 

Typical output data are management recom- 
mendations and priorizing of risk measures. The 


kind of output cannot be used for SIL quantifica- 
tion. Therefore a transfer to evaluate complemen- 
tary protective measures is not indicated. 

Due to their alignment on global protection 
layers, the methods do not focus on single design 
factors. 

Function analysis deals with the functional units 
or main functions of technical systems. Exempla- 
rily methods are Hazard and Operability study 
(HAZOP) and Hazard Analysis and Critical Con- 
trol Points (HACCP) (IEC/ISO 31010, 2009). 

The approach of these methods is that the con- 
sidered system is divided into components, units or 
subsets. Nominal functions are assigned to these 
subgroups. 

Subsequently, parameters are formulated that 
lead to a deviation of the nominal function. The 
qualitative discussion takes place on the extent to 
which the identified deviations lead to an increase 
of risks (IEC/ISO 31010, 2009). 

These methods are also used for risk assessment 
in the field of functional safety as in IEC 61508. 
However, the focus is on the identification of 
risks that could be minimized later on by control- 
independent protection measures. This approach 
therefore determines the necessary degree of risk 
reduction and thus the requirements for standard 
safety integrated functions. They do not assess the 
achieved value of the actual technical implementa- 
tion or the human-machine interface. 

Finally in functional analysis, a level of the nec- 
essary risk reduction is given. That leads to further 
requirements of the assessment process, like to 
use statistical data in the reliability block diagram 
describing the proposed architectures. 

According to the discussion given, this method 
type cannot be used to proof the achieved level of 
reliability of the human-machine interface. Due to 
this, also the scope of the method does not cover 
design and performance shaping factors. 

Other methods are e.g. consequence/probability 
matrix and risk indices. 

These methods assign the values of a system in 
matrixes or tables to determine a risk index. 

Delivering quantitative values depends on quan- 
titative input data. When using convertible data a 
quantitative overall value can be determined. Nor- 
mally semi-quantitative scales are used (IEC/ISO 
31010, 2009). 

Due to their general approach these methods are 
not specified, instead they only rank input data. In 
general these methods are no stand-alone method 
but in combination with other methods they will 
be interesting for complementary protective meas- 
ures. In this way it would be possible to rank the 
design data. 

Scenario analysis summarizes scenario based 
methods. To define solutions these methods define 
possible effects and causes for defined scenarios. 
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In doing so, top event can be analyzed but not all 
risks can be identified. 

Event tree analysis (ETA) follows an inductive 
approach to identify consequences which result 
from an initiating event. Due to this approach, 
this method is not useable for risk evaluation 
(Stapelberg, 2009). 

Fault tree analysis (FTA) is a deductive process. 
By subdividing a system into its basic components 
this method identifies factors which contribute to 
an undesired event (Vesely, Goldberg, Roberts, 
Haasl, 1981). Beside software and hardware fail- 
ures also human error failures can be considered. 

In order to be able to create fault-trees there must 
be detailed knowledge of the system. To obtain 
additional detailed quantitative values, quantita- 
tive input data is required. For complementary 
protective measures these cannot be provided. 

By using quantitative input data FTA is able 
to predict failure probability for a specific event. 
However, used stand-alone, it cannot be used to 
predict SIL values. 

Due to its general approach, it is possible to use 
FTAs for many application areas. The scope can 
be laid on design and performance shaping fac- 
tors. In this way, fault tree analysis could be used to 
assess design failures which can lead to a delayed 
response. 

Supporting methods are methods guided by a 
survey manager. In contrast to normal survey 
methods, negative group effects can be counter- 
acted with a survey manager. Exemplarily methods 
are structured or semi structured interviews and 
the Delphi technique (IEC/ISO 31010, 2009). 

These methods deliver no quantitative data but 
they can be used to provide input data for other 
methods. In combination with fault tree analysis, 
using supporting methods could identify factors 
which contribute to an undesired event. 

Supporting methods deliver no data to proof 
the achieved SIL level. 

Supporting methods are applicable to every area. 
Through the survey manager, the scope can be laid 
on design and performance shaping factors. 

Besides contributing input for fault tree analy- 
sis, supporting methods could be used in test sce- 
narios to determine subjective opinions by the test 
persons. 

Statistical methods use mathematical models. By 
using reliability data (e.g. failure rates), these meth- 
ods are able to predict the possible development 
of a protective measure. Exemplarily methods are 
Markov chains, Monte Carlo simulation and Bayes 
nets (IEC/ISO 31010, 2009). 

In the field of functional safety, methods like 
parts-count and parts stress are also used in Failure 
Modes Effects and Diagnostic Analysis (FMEDA) 
(Smith, 2017) and fault tree applications. Often 
such methods highly resolve, at least for simple 


requirements, specifications 


subdivision of 
factors 


assessment, categorization and systematization of the results 


ranked |parameters/factors 


Figure 2. Planned method chain. 


components (as opposed to complex components 
such as ASICs or microcontrollers). With respect 
to failure modes, e.g. short cut, drift, and interrup- 
tion failures. With such very standardized methods 
an overall reliability value of the safety functions is 
calculated by single values form each component 
as well as diagnostic coverage, safe failure frac- 
tion and hardware failure tolerance values (Smith, 
2017). 

Statistical methods cannot be used in the first 
instance, if there is no quantitative reliability data 
available. However, if such values can be deter- 
mined by other methods, system state analysis and 
reliability prediction methods can be used to deter- 
mine the listed overall reliability values. 

Due to the generic mathematical model these 
inductive and deductive methods are suitable for 
almost every area of application. 

A special coverage of design or performance 
shaping factors is not possible, but if correspond- 
ing input data is connected to these mathemati- 
cal models, they can consider these factors. For 
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instance, depending on the design of the physical 
interface, certain failure modes are feasible. 


5 EXTENSION BY MEASURING 
METHODS 


As shown in section 4, the influence of the physical 
design of the mechanical human-machine interfaces 
for complementary protective measures is neither 
sufficiently taken into account by methods for 
human reliability analysis nor for technical safety. 

This fact requires broadening the reviewed 
methods with a new method of measurement in 
order to evaluate the interface. The result should 
allow for a more precise evaluation of this type of 
protective measures and for the development of 
the silk models. 

The following method chain is proposed for 
assessing the physical human machine interface 
of complementary protective measures. The listed 
methods are subdivided into quantitative and 
qualitative (cf. Figure 3) assurance methods and 
their use for the assessment is explained. Subse- 
quently, the possibility of transferring the gained 
data to SIL values is explained. 


5.1 Qualitative methodological approaches 


First, qualitative methods are used to evaluate 
requirements and specifications that influence 
the reliability of the mechanical part of a human 
machine interface as complementary protective 
measures. These have to adhere to legal require- 
ments as defined in directive 2006/42/EG and nor- 
mative regulations. 

With these, input scenario based analyses are 
conducted to identify contributing factors which 


Figure 3. 
button. 


Exemplarily heatmap for emergency stop 


influence fault free and highly available usability 
of complementary protective measures. 

Exemplary factors are size, shape and color. 
In the further procedure, these factors are subdi- 
vided in qualitative- and quantitative measureable 
factors. 

Qualitative factors are analyzed e.g. by struc- 
tured expert interviews and test persons of the 
test procedure. In addition, types of actuation are 
video—monitored during the test procedure. 

The aim is to evaluate the subjective impressions 
of the evaluators and the test persons and with the 
measured values from the usage test. Compliance 
with legal and normative regulations is part of the 
evaluation process. 


5.2 Quantitative methodological approaches 


The quantitative measurable factors determined 
from the first steps will be examined in depth. 

One auspicious approach to measure these 
parameters is eye-tracking (Harezlak et al., 2014; 
Guo et al., 2016; Khalighy et al., 2015). This video 
analysis approach from the area of usability can 
be used to measure different parameters include 
(Khalighy et al., 2015): 


e Appropriateness, which indicates what design 
elements are preferred by consumers for a cer- 
tain function 

e Novelty, which includes unexpected and unexpe- 
rienced design elements for a certain function 


Eye-tracking also provides quantitative output 
such as number of fixations, standard deviation 
of fixation or common area of fixation (Khalighy 
et al., 2015; Takahashi et al., 2017). An example 
can be seen in Figure 3. 

Easy to use in simulations, fake test scenarios 
and real test scenarios (Hahn and Liidtke, 2013) 
are further advantages of this method. 

Eye tracking can be used to measure the reaction 
and the behavior of humans but also their inter- 
action with the mechanical elements of human- 
machine interface (Khalighy et al., 2015; Kim et al., 
2017). Due to this eye tracking is good suitable to 
measure the defined design requirements identifi- 
ability, visibility and accessibility (cf. Machinery 
Directive 2006/42/EC, 2006). 

In addition to the parameters that can be 
measured in the eye tracking procedure, the actu- 
ation time is also measured during the test set- 
tings, since it can be used to assess accessibility, 
for example. 

Depending on their importance, certain qualita- 
tive and quantitative factors are evaluated, catego- 
rized and systematized. Matrices or indexes can be 
used for this purpose. The final goal is to develop 
a way to make use of the parameters examined in 
SIL assessments. 
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Table 1. Comparison table based on IEC 61508, 2011. 
Actuations No. of Area of 

time fixations fixation Factor 
12 ms 16 3 cm? 0,01 
20 ms 10 5 cm? 0,05 


5.3 Transfer to SIL 


Transferring the measured data, e.g. reaction and 
response times, to values of functional safety can 
be achieved in several ways. 


e One to one transfer (using SIL models) 
e Statistical assessment 
e Expert estimations 


An attempt can be a one-to-one transfer into 
existing SIL values. But so far, there is no indication 
that such a transfer is easily possible beyond the 
assessment of the fulfillment of specified require- 
ments of safety functions, e.g. reaction within 1 sec. 

It is more likely that statistical assessment of 
multiple eye tracking experiments is necessary. In 
this way, probabilities of failure of the human or 
the physical human-machine interface are feasible. 

The measurements can be used to determine 
comparison tables similar to the table for calcu- 
lating the beta factor according to IEC 61508. In 
this way, an analogical transfer of the results can 
be proposed. A simplified example table is shown 
in Table 1. 

Experts can estimate frequencies of the observed 
reaction types, thus resulting in an overall reliabil- 
ity assessment. This approach can be further sup- 
ported by historical accident data analysis. 


6 CONCLUSION 


Complementary protective measures need a proof 
of reliability according to IEC 61508. It is needed 
for overall risk assessment as well as overall safety 
evaluation. To deliver a proof of reliability quanti- 
tative data is required. 

It was shown that there is a lack of such data for 
complementary protective measures. 

After reviewing typical methods from the field 
of human reliability analysis and technical safety 
no directly applicable method is identified. How- 
ever, several methods are supporting the quan- 
tification of the reliability of the mechanical 
human-machine interface. 

To deliver quantitative data the use and exten- 
sion of standard existing methods are described. 
To this end a tool chain was proposed with the 
main methods intended to be used. 


In particular, the video analysis based tool eye 
tracking is identified as promising key method, to 
deliver the searched data. 

Since it does not directly generate reliability 
data, it was also discussed how the measured val- 
ues can be further evaluated and combined with 
standard reliability values from IEC 61508. 

In conclusion, a way forward the present work 
sketches a way forward how to address the chal- 
lenge of the reliability quantification for physi- 
cal human machine interfaces as complementary 
safety functions for safety relevant systems. 
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Simulation analysis of aerodrome CNS system reliability 
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ABSTRACT: Ensuring safety, regularity and continuity of air traffic are priority objectives of aero- 
nautical regulations and procedures. They concern, among others, Air Traffic Control services (ATC), 
Communications, Navigation and Surveillance systems (CNS), Visual/Instrumental Flight Rules (VFR/ 
IFR), minimal Meteorological Flight Conditions (FMC) and air traffic procedures. This paper focuses 
on aerodrome CNS systems reliability in relation to aerodrome traffic operations and ATC procedures. 
The aerodrome CNS system has complex technical and functional structure. The elements forming the 
system are very reliable and accurate. However, the possibility of their operational use depends not only 
on the technical state. Aerodrome ATC unit (TWR) can issue a clearance for performing an operation 
(approaching, landing, take-off and climbing) only when the requirements meet the actual conditions 
specified by FMC, ATC procedures and air traffic rules, approved Flight Plan and operational status of 
CNS devices. This means that different combinations of operational situations and reliability status of 
CNS devices should be considered. In this paper Petri nets are proposed to model the aerodrome CNS 
system reliability structure. The model was used to perform simulation analysis of CNS system reliability 
for different air traffic and meteorological conditions. Applicability of the method has been shown on the 


example for selected scheduling season. 


1 INTRODUCTION 


Ensuring safety, regularity and continuity of 
aerodrome operations is a fundamental require- 
ment of aviation law. This results in specific tasks 
for the aerodrome managing body. They include 
appropriate planning and maintenance of airport 
infrastructure elements as well as implementation 
of appropriate technical and operational proce- 
dures. The realization of these tasks is connected 
with the necessity of incurring costs, which are an 
important element of the budget of the aerodrome 
operator. Aviation law regulations do not specify 
the requirements for the categories of procedures 
and navigational aids that must be implemented 
and installed at a given aerodrome. This is the deci- 
sion of the aerodrome operator. These decisions 
are related to the safety and economic objectives 
taken into account in the aerodrome’s investment 
plans. Usually, investment decisions are based on 
cost-benefit analyzes, which consider the issue of 
aerodrome continuity. It can be effectively analyzed 
on the basis of reliability theory with the use of 
operational readiness measures (Koztowski, 2015, 
2016). The obtained results allow defining ade- 
quate investment processes including cost aspects. 
This also applies to the decision to leave actual or 
increase the category of procedures and navigation 
aids, which entails significant costs but gives the 
opportunity to increase revenues from serviced air 


traffic and transport. An element of these analyzes 
is the assessment of the reliability of the CNS sys- 
tem in relation to meteorological flight conditions 
and planned air traffic. 

This work aims at finding an effective and objec- 
tive method of assessing the adequacy of CNS 
infrastructure to FMC and ATC procedures and 
traffic rules. This will help identification of neces- 
sary changes and verification of investment plans. 


2 AERODROME INFRASTRUCTURE 


2.1 Airside infrastructure 


Aerodrome is an area on a land, including any 
buildings, installations and equipment intended 
to be used for the arrival, departure and surface 
movement of aircraft (Annex 14 ICAO). The aero- 
drome infrastructure includes two basic groups of 
elements: 


— the movement area, 
— communications, navigation and surveillance 
(CNS) devices. 


All elements of the aerodrome infrastruc- 
ture has specific characteristics which must meet 
the requirements and be consistent with the 
specifications (nominal values and tolerances 
of parameters) given in legal regulations. These 
characteristics define technical, operational and 
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reliability parameters, and their values must be 
consistent with the established requirements speci- 
fied, e.g. in: Annex 10 ICAO, Annex 11 ICAO, 
Annex 14 ICAO. 

Movement area is that part of an aerodrome 
to be used for the take-off, landing and taxiing 
of aircraft, consisting of the maneuvering area 
(including runways and taxiways) and the aprons 
(Annex 14 ICAO). A runway is rectangular area 
on the aerodrome intended for the landing and 
take-off aircraft operations. Depending on the 
types of installed CNS and their parameters, two 
types of runways are established: non-instru- 
ment runways (NI) intended for the operation 
of aircraft using visual approach procedures and 
instrument runways for instrumental approach 
procedures. The latter may be divided into fol- 
lowing types: 


— non-precision (I-NP) runway, served by visual 
aids and a non-visual aid providing directional 
guidance adequate for a straight-in approach, 

— category I (I-PI) runway, served by ILS or MLS 
and visual aids intended for operations with 
a decision height (DH) not lower than 60 m 
(200 ft) and either a visibility not less than 800 m 
or a runway visual range (RVR) not less than 
550 m, 

— category II (I-PII) runway, served by ILS or 
MLS and visual aids intended for operations 
with a DH lower than 60 m (200 ft) but not 
lower than 30 m (100 ft) and a RVR not less than 
300 m, 

— category III (I-PIII A/B/C) runway, served by 
ILS or MLS and: (A) — intended for opera- 
tions with a DH lower than 30 m (100 ft), or no 
decision height and a RVR not less than 175 m, 
(B) — intended for operations with a DH lower 
than 15 m (50 ft), or no DH anda RVR less than 
175 m but not less than 50 m, (C) — intended for 
operations with no DH and no RVR limitations. 


2.2 CNS infrastructure 


CNS systems are three technologies that are used 
at the Air Traffic Management (ATM) to perform 
air operations. 


1. Aerodrome control radio station (COM) is the 
standard aerodrome air-ground communication 
device ensuring two-way voice communication 
between TWR and aircraft. 

2. The aerodrome navigation systems include the 
following device groups: 

— visual aids for navigation (VAN), 

— radio aids for navigation (NAV). 

The standard aerodrome VAN aids consist of: 
— indicators and signaling devices, 

— markings, 


— signs (mandatory and information), 

— lights: approach lighting systems (ALS) - Cal- 
vert or ALPA-ATA, visual approach slope 
indicator systems—VASIS or PAPI, lights 
installed on runways, taxiways and aprons. 

The standard aerodrome NAV aids consist of: 

— non-directional radio beacon (NDB), 

— the VHF omnidirectional radio range (VOR), 

— distance measuring equipment (DME), 

— instrument landing system (ILS) or micro- 
wave landing system (MLS). 

3. The surveillance radar (SUR) is an equipment 
used by air traffic control (ATC) to determine 
the position of an aircraft in range and azi- 
muth. The standard aerodrome SUR systems 
used by TWR are: 

— primary surveillance radar, 
reflected radio signals (PSR), 

— secondary surveillance radar, which uses 
transmitters/receivers (interrogators) and 
transponders (SSR). 


which uses 


All of the CNS facilities, in accordance with 
the requirements of legal regulations (Annex 10 
ICAO), are characterized by very high techni- 
cal reliability. However, their operational status 
depends on many various factors. These factors 
determine operational reliability considered as 
operational readiness. The CNS facility failure is 
understood as any unanticipated occurrence which 
gives rise to an operationally significant period 
during which a facility does not provide service 
within the specified tolerances. 

Aerodrome CNS devices reliability depend and is 
achieved by a combination of factors, especially: 


— flight meteorological conditions (FMC), 
expressed in terms of visibility, distance from 
cloud, and ceiling, 

— aerodrome infrastructure maintenance pro- 
grams and procedures (AMP), 

— level of redundancy in the reliability structure. 


The general formula of the CNS facility (for 
which the failures follow a Poisson distribution) 
reliability (as a percentage) — R, defined as a prob- 
ability that the facility will be operative within 
the specified tolerances for a time ¢ formula is 
expressed as: 


R =100e~/ MTBF (1) 


where: 

e = base of natural logarithms, 

t = time period of interest, 

MTBF = mean time between CNS device failures. 


The following reliability values of CNS devices 
were adopted, as shown in Table 1. 
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Table 1. Exemplary values of CNS facility reliability. 


COM NAV VAN SUR 
Reow=  Rppp=0,9950 R y= 0.9990 Rou = 
0,9995 R oe =0,9900 — Rpyp = 0,9980 0,9800 
Ry uyp= 029900 Rey = 0,9995 
Rys=9,9970 Ry = 0,9990 


EHH HHS] 


Figure 1. 
system. 


Operational structure of aerodrome CNS 


CNS devices are needed in accordance with 
established procedures to ensure safe operations 
in aerodrome traffic, their functional structure is 
shown in Figure 1. This structure has been mapped 
in the model presented in Section 4. 


3 AERODROME TRAFFIC 


3.1 Aircraft operations and procedures 


Aerodrome traffic is defined as all traffic on the 
aerodrome maneuvering area and all aircraft flying 
in the vicinity of an aerodrome (Annex 11 ICAO). 
Aircraft may flight in accordance with the visual 
flight rules (VFR) or instrument flight rules (IFR), 
and conduct approach to landing, landing, taxi- 
ing, take-off and departure operations. The largest 
range of operational requirements is assigned to 
approach and landing operations. 

In this paper, special attention is paid to 
approach and landing procedures, in particular 
to instrument approach procedure (IAP), i.e. a 
series of predetermined maneuvers from the initial 
approach fix or from the beginning of a defined 
arrival route to a point from which a landing 
can be completed (ICAO Doc 4444). Instrument 
approach procedures are classified as follows: 


— non-precision approach (NPA), it utilizes lateral 
guidance but not the vertical guidance, 

— approach with vertical guidance (APV), it uti- 
lizes lateral and vertical guidance but does not 
meet the requirements established for precision 
approach and landing operations, 

— precision approach (PA) using precision lateral 
and vertical guidance with minima as deter- 
mined by the category of operation. 


Some standard procedures are usually imple- 
mented in aerodrome traffic (Annex 11 ICAO): 


— standard instrument arrival (STAR) - an IFR 
arrival route linking a point on an ATS route 
with a point from which an instrument approach 
procedure can be commenced, 

— standard instrument departure (SID) - an IFR 
departure route linking the aerodrome with a 
specified point on a designated ATS route. 


Analysis presented in this paper applies to a 
controlled aerodrome, where the air traffic control 
service is provided by TWR (ICAO Doc 4444). 
To perform an operation in controlled aerodrome 
traffic, the aircraft needs TWR clearance, i.e. 
authorization to proceed under specified condi- 
tions. Before issuing the clearance, TWR checks 
the compliance between the requirements and 
specifications, procedures, operational status of 
CNS devices, traffic situation, current FMC and 
aerodrome operating minima (AOM). 


3.2 Aerodrome operating minima 


Aerodrome operating minima express the limits of 
usability of an aerodrome for: 


— take-off, expressed in terms of RVR or visibility 
and cloud conditions, 

— landing, expressed in terms of RVR and DH 
or minimum descent height (MDH) and cloud 
conditions as appropriate to the category of the 
operation (Annex 6 ICAO). 


The Aerodrome Operator (AO) establishing the 
AOM should take into account and consider the 
following factors: 


— types, performance and handling characteristics 
of the aircraft, 

— dimensions, characteristics and categories of the 
runways, 

— categories and performance of the available 
CNS devices and systems, 

— aerodrome ATC/TWR procedures, 

— obstacles locations and dimensions and obstacle 
clearance height (OCH) and necessary MDH, 

— shape and sizes of obstacle free zone (OFZ) and 
normal operating zone (NOZ), 

— descent profile determined for vertical guidance 
during a final approach (glide path—GP). 


To ensure the adequate level of safety the Aero- 
drome Operator shall specify incremental values 
for height of cloud base (CB) and visibility (RVR), 
to be added to the operator’s established AOM. 
The following AOM were adopted, as shown in 
Table 2. 

For the given RWY and approach procedure cat- 
egories and the for aerodrome operating minima, 
the minimal alternative path of operational readi- 
ness were determined, as shown in Figure 2. 
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Table 2. Controlled aerodrome operating minima. 


RWY Cat/ 

PA Cat min RVR min CB (DH) 
NI 5000 m 450 m 

I-NP 550m 150 m (OCH) 
I-PI 550m 60 m 

I-PII 300 m 30 m 
I-PIIIA 175m 30m 
I-PIIIB 175m 15m 
I-PIIIC 0m 0m 


Figure 2. 


Identified minimal paths of operational read- 
iness at controlled aerodrome Cat IIIB. 


Table 3. Time percentage of FMC occurrence in SS and 
SW. 
Scheduling 
period 
FMC SS WS 
RVR > 5000 m and CB 2 450 m 18% 6% 
5000 m > RVR = 550 m; 450 m > 27% 12% 
CB 2 150m 
5000 m > RVR = 550 m; 150 m > 18% 19% 
CB 260m 


550 m > RVR = 300 m; 60 m > CB230m 13% 23% 
300 m > RVR 2 175 m; 60m > CB230m 10% 16% 
300 m > RVR 2 175 m; 30 m > CB2 15m 11% 13% 
175 m > RVR and 15 m > CB 3% 11% 


3.3 Operation of the aerodrome 


The subject of research presented in this paper is in 
fact the airport i.e. aerodrome intended for service 
commercial air transport. 

The operation process of the airport is carried 
out in two scheduling period i.e. either the summer 
(SS) or winter (WS) scheduling season as used in 
the schedules of air carriers. The operating condi- 
tions and structure of the air traffic are different in 
the scheduling seasons. 

In subsequent studies, the following data and 
assumptions were adopted, as shown in Tables 3 
and 4. 


Table 4. Percentage of IFR and VFR 
operations in SS and SW. 


Scheduling period 
FR SS Ws 
IFR 90% 95% 


VFR 10% 5% 


4 MODEL OF AERODROME CNS SYSTEM 


As indicated above, calculation of the opera- 
tional readiness of the aerodrome depends not 
only on the reliability of the CNS system, but 
also on other factors described by random vari- 
ables, the nature of which has been identified. 
The form of the density function of these ran- 
dom variables indicates the need to use simula- 
tion methods to study the operational readiness 
of the aerodrome. 

Model of aerodrome CNS system for reliabil- 
ity and operational readiness analysis was created 
using Petri nets (Jensen, 1997, Marsan et al, 1999, 
Reisig, 2013). Originally, they were developed 
for modeling computer systems working syn- 
chronously. However, their high versatility has 
resulted in their having many other applications 
in recent years, including modeling and support of 
air traffic management processes (Davidrajuh & 
Lin, 2011, Oberheid & Söffker, 2008, Skorupski, 
2011, 2015a, 2016, Werther et al., 2007) and sys- 
tems reliability (Vismari & Camargo, 2011, Pinna 
et al., 2013, Song et al., 2017, Nývlt et al., 2015, 
Skorupski, 2015b). 

The model of aerodrome CNS system has been 
implemented as a colored, hierarchical, priority 
Petri net. The network hierarchy is represented in 
the form of so-called “pages” responsible for dif- 
ferent parts of the model: -NP or I-PIIIB, CNS, 
FMC. 


4.1 General characteristics of Petri nets 


The basis for building a Petri net is a bipartite 
graph containing two disjoint sets of vertices called 
places (designated by ellipses) and transitions (rec- 
tangles). The arcs in this graph are directed. A 
characteristic feature of the graph used in Petri 
nets is that the arcs have to combine different types 
of vertices. 

The set of places P corresponds to CNS systems 
operational states or FMC parameters observed at 
the airport. The set of transitions T corresponds 
to the generators of these values. The set of arcs 
defines the process of changing CNS and FMC 
parameters in subsequent experiments. 
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4.2 Petri net for modeling airport operations 


The airport traffic model presented in this paper 
can therefore be written as 


Sar ={P,T,1,0,H,M,,T,C,G,E,B} (2) 


where: 

M,: P > Z, xR — initial marking, 

T — nonempty, finite set of colors, 

C- function determining what color of tokens can 
be stored in a given place: C: P >T, 

G — function defining the conditions that must be 
satisfied for the transition before it can be fired; 
these are the expressions containing variables 
belonging to T, for which the evaluation can be 
made, giving as a result a Boolean value, 

E — function describing the so-called weight of 
arcs, i.e. expressions containing variables of 
types belonging to TF for which the evaluation 
can be made, giving as a result a multiset over 
the type of color assigned to a place that is at the 
beginning or the end of the arc, 

B:T > R, — function determining the priority of 
transition ¢; this function applies only for tran- 
sitions that are simultaneously active; in this 
situation a free choice of transition to be fired 
is possible. 


4.3 Mapping the aerodrome CNS system 
reliability 


The model of aerodrome CNS system reliability 
represents operational states of individual CNS 
devices, FMC parameters (visibility and height 
of cloud base) and aerodrome operating minima 
through the use of colored Petri nets. The main 
advantage of such approach lies in that the tokens 
located in places which determine the system states 
allow one to easily calculate the whole aerodrome 
CNS system reliability. 

For example, in Figure 3 the reliability structure 
of the CNS system implemented for controlled 
aerodrome of category Cat IIIB is presented. 

Places COM, SUR, NDB, VOR, DME, ALS, 
PAPI, RWY, ILS, TWY contain a token describ- 
ing operational state of the corresponding system. 
Places CNS1 and CNS2 are responsible for stor- 
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Figure 3. Petri net (CNS page) representing reliability 


structure of the CNS system of controlled aerodrome 
Cat IIIB. 


inputt i, jh 
output (rvr,cb,f,b); 
action fmcti,),4 


BOOL BOOL 

Figure 4. Petri net (FMC page) representing the flight 
meteorological conditions for controlled aerodrome 
Cat IIIB. 


ing the current reliability status of the entire CNS 
system. The determination of this state takes place 
in the rel() procedure. In addition, there is a place 
S whose task is to synchronize the control in the 
simulation program. 

Figure 4 shows the part of the model respon- 
sible for simulating atmospheric conditions. It is 
represented by the FMC website. 

Places RVR, CB and FR represent visibility, 
height of cloud base and flight conditions (IFR or 
VFR) respectively. The place S is used to synchro- 
nize the control (similarly as on the CNS page) and 
the place FMC stores information on whether the 
current meteorological conditions are sufficient to 
perform the landing operation in accordance with 
the current flight conditions. The determination of 
compliance of these conditions is implemented in 
the fmc() procedure. 


4.4 Computer tool in CPN Tools environment 


The presented model has been implemented as a 
computer software using CPN Tools 4.0 environ- 
ment (Jensen et al. 2007). It is a very convenient 
tool because it allows at the same time creating a 
model, simulating at different input parameters 
and simultaneously analyze the results in the state 
space. 

The model of aerodrome CNS system reliabil- 
ity is implemented as the hierarchical Petri net in 
which different parts of the model are created 
independently and during the simulation are syn- 
chronized by means of special mechanisms. 

Figure 5 shows the page I-PIIIB responsible 
in the model hierarchy for calculating the results. 
Places marked with CNS1, CNS2 and FMC labels 
in the bottom left corner, called “fused places”, 
are used to synchronize with relevant pages rep- 
resenting the remaining levels in the hierarchy. All 
fused places marked with the same label are iden- 
tical, regardless of which part of the model they 
are placed in. The mechanism of fused places also 
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Figure 5. Petri net for calculation of the results. 


gives the ability to synchronize the elements within 
one page of the model. 

Place P1 counts those simulation experiments in 
which the landing operation could be performed, 
and the place N1 counts those experiments in 
which the meteorological conditions or the reli- 
ability state of the CNS system were insufficient 
to perform the operation. Similarly, the P2 and N2 
places register situations of full reliability of all 
CNS system components and partial failure of the 
system. 

The model and the computer tool have been 
validated with the use of real data by comparing 
the data obtained from measurements with the 
probabilities obtained from the model. Due to the 
volume of the paper, details of the validation are 
not shown. 


5 SIMULATION EXPERIMENTS 


Using the developed model and computer tool, a 
series of simulation experiments were performed 
to determine both the reliability of the CNS sys- 
tem at the airport and the possibility of perform- 
ing air operations, i.e. operational readiness of the 
airport. 

The first experiment was made for a poorly 
equipped airport of category Cat NP. The follow- 
ing elements of the CNS system must be able to 
perform the landing operation under IFR condi- 
tions: COM, SUR, ALS, PAPI, RWY lights and 
at least one of the following: NDB, VOR, DME. 
In addition, the FMC conditions must be at least 
550 m visibility and 150 m cloud base. Landing in 
VFR conditions requires visibility of at least 5000 
m and cloud base of at least 450 m. 


The experiment consisted in simulating 107 
landing operations taking into account the real 
parameters of random variables. The reliability of 
the CNS system obtained from the simulation is 
0.976 and operational readiness is only 0.41. 

The second experiment concerned a well- 
equipped airport of category Cat IIB. The follow- 
ing elements of the CNS system must be able to 
perform the landing operation under IFR condi- 
tions: COM, SUR, DME, ILS, ALS, PAPI, RWY 
lights, TWY lights. In addition, the FMC condi- 
tions must be at least 175 m visibility and 15 m 
cloud base. As before, landing at VFR requires 
at least 5000 m visibility and at least 450 m cloud 
base. 

The experiment consisted in simulating 107 
landing operations taking into account the real 
parameters of random variables. The reliability of 
the CNS system obtained from the simulation is 
0.962 and operational readiness is 0.857. 

The reliability of the CNS system is slightly 
lower for the Cat IIIB airport, which is due to the 
larger number of necessary devices. At the same 
time, these devices allow for much more precise 
navigation, which allows to significantly increase 
operational readiness. 


6 CONCLUSIONS 


Comparison of the obtained results of conducted 
simulation experiments indicates the lack of 
dependence between the reliability of the CNS sys- 
tem and the aerodrome operational readiness. It 
proves that decisions regarding the modernization 
of the CNS infrastructure, and thus the change 
of the airport category and investment projects, 
should be made in the aspect of ensuring business 
continuity. 

Obtained results and conclusions from the use 
of the Petri nets model will be used for further 
research on the determination the airport mini- 
mum business continuity and other parameters, 
such as minimum recovery time or maximum tol- 
erable period of disruption (ISO 22301). 


ABBREVIATIONS AND ACRONYMS 


ALS  — Approach lighting system. 

AMP -— Aerodrome maintenance program. 
AO — Aerodrome operator. 

AOM -— Aerodrome operating minima. 
APV  — Approach with vertical guidance. 
ATC — Air traffic control. 

ATM -— Arr traffic management. 

ATS — Air traffic services. 

Cat. — Category. 
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CB — Cloud base. 

CNS — Communications, navigation and sur- 
veillance. 

COM — Communications. 

DH — decision height. 

DME -— Distance measuring equipment. 

FMC — flight meteorological conditions. 

FR — Flight rules. 

ft — Feet (dimensional unit). 

ICAO -— International Civil Aviation Organi- 
zation. 

IFR — Instrument flight rules. 

ILS — Instrument landing system. 

I-NP -— Instrumental Non-precision approach. 

I-P — Instrumental Precision approach. 

MTBF — Mean time between failures. 

NDB — Non-directional radio beacon. 

NI — Non-instrumental approach. 

PAPI -— Precision approach path indicator. 

R — reliability. 

RVR —_ runaway visual range. 

RWY -— Runway visual range. 

SID — Standard instrument departure. 

SS — Summer schedule season. 

STAR —_ Standard instrument arrival. 

SUR — Surveillance. 

TWR — Aerodrome control tower. 

TWY -— _ Taxiway. 

VFR  — Visual flight rules. 

VOR  — VHF omnidirectional radio range. 

WS — Winter schedule season. 
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ABSTRACT: Asa result of the digitalization of the power business in Norway and Europa, a lot of new 
possibilities and challenges arise. In 2014 an expert committee one outlined a proposal for the future grid 
company structure in Norway (Reiten, 2014). In addition, new technologies are being implemented in the 
system. Wind power, solar power, un-regulated small hydro power production, battery storage domestic 
and industrial and electrification of transport. Transmission System Operators (TSOs) have a responsi- 
bility to supply industry and communities with reliable electric power. However, the operators have been 
virtually blind to slowly occurring changes in the load profile that reduce the expected regularity of the 
power supply. This paper will focus on the possibilities and challenges the power business are facing. The 
paper will describe what technologies is needed i.e Real time probabilistic risk calculations, artificial intel- 
ligence, machine learning and smart grid technology. The main question is: can the power business and 
the introduction of new system tools manage without probabilistic risk calculation for making use of the 


digitalization and the corresponding big data? 


1 INTRODUCTION 


1.1 History of the electric grid 


Modern Norway was built and industrialized by the 
fact that we managed to utilize rivers and waterfalls 
for power generation. Hydropower is still the cor- 
nerstone of the Norwegian power system, but wind 
power and solar energy is becoming an increas- 
ing part of the energy system. The grid has been 
developed over 150 years since the first small hydro 
plants were installed to supply small local indus- 
tries. Hydro power plants were constructed over 
time as the industrial development moved forward. 
Initially the generation supplied local and regional 
consumers, but as transmission technology devel- 
oped regions were connected via high voltage lines. 

Now, the main grid is the most important part 
of the grid system, as failure here could mean 
power outage for very many consumers. The main 
grid was built largely from the 1950s to the 1980s. 
However, regional islands existed until 1994 when 
the main grid was finally established throughout 
Norway. Deregulation and competition was intro- 
duced in 1991 with the purpose of improving the 


socio-economic efficiency in the energy sector 
Now, the Norwegian main grid is aging and in the 
process of being replaced and upgraded by con- 
struction of new 420 kV lines in combination with 
digitalization of the power system. (Statnett, 2017). 


1.2 The change in the power system, smart grid, 
solar, wind, battery etc. 


The energy system is a critical part of a well-func- 
tioning society. Norway is largely electrified and 
power transmission is an important prerequisite 
for value creation. 

Although the power grids are largely built as 
before, the power system changes at a rapid pace. 
Hence, the Transmission System Operator (TSO) 
must be an enabler and be prepared for the future. 
The Norwegian aging main grid is in the process 
of being replaced and upgraded. The power grid 
takes a long time to plan and build, and have a long 
lead time, which contrasts strongly with an energy 
sector in a rapid change. The load is increasing 
and more generation is being installed. The Green 
Certificate Scheme provides incentives to expand 
renewable power generation, including small scale 
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hydropower, wind power and solar power. Imple- 
mentation of Automatic Meter Reading and Con- 
trol Systems at the consumer level will allow for 
activation of consumer flexibility. Consumption 
patterns in the energy sector are changing rapidly 
(NVE, 2016). 

Hence, the growing expansion of renewable 
energy and activation of flexible loads increases 
the complexities in balancing generation and 
demand in the power system. The energy-shifting 
and fast-ramping capability of energy storage has 
led to increasing interests in batteries to facilitate 
the integration of renewable resources (Amrouche, 
2016). 

The future power system will become even more 
dynamic and the need for real time information for 
monitoring the system status and for taking the 
proper control actions are increasing 


1.3 The need for a solution 


Global challenges regarding energy and climate 
change, the environment, safety, technology and 
renewable solutions, use and conservation of 
energy, use of batteries and the connection of elec- 
trical vehicles requires greater effort. The chang- 
ing landscape of the power and utilities industry 
is resulting in new expectations for IT. Transmis- 
sion system operators are struggling to fulfil their 
traditional mission of maintaining security of 
supply in a rapidly evolving environment driven 
by digitalization. Digital transformation is some- 
thing that has become a common trend, and it is 
one that has reached the Power & Utilities sector 
moving rather quickly (Digitalization & Energy, 
2016). The physical power system cannot func- 
tion effectively without a well-functioning power 
market with smart ICT systems. The power system 
must be able to cope with the increasing variability 
in load and generation. Hence, in a complex power 
system, new solutions in the field of ICT and new 
market models are required to ensure the reliability 
and security of supply, to ensure that the transmis- 
sion capacity is optimally utilized and that control 
actions are taken when needed. 


2 DIGITALIZATION OF THE POWER 
BUSINESS—THE SOLUTION? 


2.1 The goal with digitalization 


The concept of a Digital Power System (DPS) has 
been discussed for many years. The DPS may be 
defined like the digital power system being the digital, 
figuration and real-time description and reappear- 
ance of physical structure, technical characteristic, 
management system as well as personal information 
system of a real power system which is in operation. 
The DPS will be able to make a significant contri- 


bution to administrating and decision-making more 
scientifically (Chakrabortty, 2017). 

The share unregulated renewable power gen- 
eration is rising and the power system is changing 
rapidly. Changes like this must be able to handle 
tomorrow’s energy system. Hence, the reasons and 
goals for implementing the digital power systems 
are multiple: 


e Monitoring g and controlling all components by 
equipping them by sensors 

e Measure the condition of the power system 
flows, angle differences, stability margins, and 
hence the reliability and security 

e improving security and stability online, online 
making and implementing economical opera- 
tion strategy and carrying out emergency and 
anti-fault control, etc. 

e Better utilization of the facilities 

e Precise state information results in increased 
capacity and fewer faults 

e More efficient maintenance and increased lifetime 


In the end, the primary goal is to increase the 
value creation while maintaining the reliability and 
security of the system. 


2.2 Big data 


Data has always been an important asset in every 
industry. Since the early days of the information 
age, business intelligence and descriptive statistics 
have been used as the standard tools for extracting 
information and make important decisions from 
all kinds of collected data. However, as the cost of 
collecting, storing, and processing data has been 
dropping exponentially, the amount and the diver- 
sity of the data has reached the point where tradi- 
tional approaches are no longer feasible. The term 
Big Data is often used to refer to any data that 
requires new techniques and tools in order for it 
to be processed and analyzed. Big Data could also 
be looked from the point of view of the new set 
of technologies that are helping to solve the chal- 
lenges in collecting, managing, and analyzing Big 
Data. These technologies include cloud computing 
and cluster computing for data storage and manip- 
ulation, Artificial Intelligence (AI) and machine 
learning for data analysis (L’Heureux, 2017). 

As in many other sectors big data analytics 
and machine learning are also getting involved in 
the energy sector and tools are being developed. 
They are for example used to forecast electric- 
ity demand at substation level, segment custom- 
ers based on their power consumption patterns, 
implement demand response strategies, for power 
system condition monitoring and controls. 

The value of big data may come from several use 
cases: as a source of analytics, as a source for control 
actions and as an enabler for new products and serv- 


2514 


ices. An energy company could e.g. track, collect, and 
store all available data from their system from custom- 
ers, from components, from system data, from GPS 
trails to geographical and meteorological data, then 
combine them together and use big data analytics to 
produce high value actionable insights and controls. 
Use of big data technologies may also open up 
completely new business models and introduce 
new products and services in the energy sector. 


2.3 Smart grid 


The smart grid would be an enhancement of the 
electrical grid, using two-way communications and 
distributed intelligent devices (Smart Grids Euro- 
pean Technology Platform, 2011). Two-way flows 
of electricity and information could improve the 
delivery network. A smart grid would allow the 
power industry to observe and control parts of the 
system at higher resolution in time and space. One 
of the purposes of the smart grid is real time infor- 
mation exchange to make operation as efficient as 
possible. It would allow management of the grid 
on all time scales from high-frequency switching 
devices on a microsecond scale, to wind and solar 
output variations on a minute scale, to the future 
effects of the carbon emissions generated by power 
production on a decade scale. 

The management system in smart grid is the sub- 
system that provides advanced management and 
control services. Most of the focus aim to improve 
energy efficiency, demand profile, utility, flex- 
ibility, cost, based on the infrastructure by using 
optimization, machine learning and game theory. 
Within the advanced infrastructure framework of 
smart grid, more and more new management serv- 
ices and applications are expected to emerge and 
eventually revolutionize consumers’ daily lives. 

The protection system of a smart grid provides 
grid reliability analysis, failure protection, and 
security and privacy protection services. While 
the additional communication infrastructure of a 
smart grid provides additional protective and secu- 
rity mechanisms, it also presents a risk of external 
attack and internal failures (Pandey, 2017). 


3 POWER SYSTEM OPERATION 


3.1 Balancing the system 


Electricity must be produced at the same time as 
the power is consumed. In addition, the produc- 
tion must be equal to the power consumed. This 
is called the instantaneous balance of the power 
system. The power market is the central tool for 
balancing supply and demand for power. The 
results of the daily pricing calculation in the day- 
ahead market are the basis for the Norwegian TSO 
Statnett’s planning and maintenance of current 


balance in the following operating day. The con- 
tinuous balancing of production and consumption 
is very important for the reliability of the system. 
In case of imbalances, system administrators 
implement measures to restore the balance, such as 
adjusting output or consumption. 

Statnett has been given the system responsibility 
in the Norwegian power system. System Require- 
ments in the Power System (Regulations on system 
responsibility in the power system, 2002) emphasize 
that the system operator shall provide frequency 
regulation, ensure instantaneous balance in the 
power system, develop market solutions that con- 
tribute to the efficient development and utilization 
of the power system, and to the greatest extent pos- 
sible use of instruments based on market principles. 
The System Responsible company coordinates the 
operation of the power system, provides for the 
determination of capacity for the market, bottle- 
neck handling and trade with other countries. 

A well-designed power system has the following 
characteristics: 


e Provide all consumption regardless of geograph- 
ical location 

e Provide consumption at all times 

e Must be able to handle variability in consump- 
tion and production 

e Supply must be of good quality and meet 
defined quality requirements 

e Must be based on economic ‘optimal’ principle 

e Must meet required and defined security goals 


The delivered power must meet certain mini- 
mum delivery quality requirements. The following 
determines the quality: 


e System frequency must be kept around the spec- 
ified 50 Hz with variation within + —0,1 Hz 

e The voltages are kept within narrow, prescribed 
limits around the normal value. Generally, the 
voltage variation should be within + —10% (or 
5% in some systems) 


To ensure that voltages and frequencies are kept 
within their limits, voltage and frequency regula- 
tion is required for efficient operation of the power 
system (Gjengedal, 2017). 


3.2 Traditionally operation and planning N-1 


Operating the network according to the N-1 crite- 
rion means that failure of a component does not 
result in interruptions in the supply to the end user. 
It is referred to as reduced reliability when the N-1 
criterion is no longer met, in cases where it should 
normally be met. Statnett as a system adminis- 
trator has the means through the system liability 
regulation (Regulations on system responsibility 
in the power system, 2002) to be able to change 
the grids configurations, as well as demand up-or 
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down regulation of production. Such means can 
help ensure operation according to the N-1 crite- 
rion. However, the authorities are not requiring 
that the main network should meet the N-1 opera- 
tion safety at all times. 


3.3 Power system operation challenges 


In power system operation, traditionally slow- 
changing and predictable parameters are now 
changing fast. In addition, new system parameters 
are being introduced in the power system. This will 
influence the inherent properties of the system as 
well as the risk for outages. While the power system 
complexity is increasing, the operational available 
response time is decreasing fast. Example: The 
introduction of solar and wind production repre- 
sents a power production which is difficult to man- 
age when the wind stops or clouds cover the sun. 

Combine this with a high degree of automation 
and smart grid technology, and the existing opera- 
tional «know how» may not be sufficient to deal with 
the properties of the current and future power system. 

What the neighboring power system is doing, 
will affect the current power system in regards of 
dynamics and risk. Until now power system opera- 
tors have evaluated the system risk for loss of load 
qualitative, based on experience. This “gut feeling” 
based on such an experience approach, will not be 
valid in the future without nurturing new skills and 
competencies. 

The first step is to be able to assess the system 
risk level equally for each power company. The 
only way to achieve this is by assessing this quanti- 
tatively with probabilistic approach. 

The power business has lacked tools to evalu- 
ate quantitatively the system risk level in near real- 
time, and to assess possible risk reducing actions. 
This challenge was addressed by the GARPUR 
project 2017 (GARPUR, 2017). 


3.4 How will the future power system 
risk develop? 


New production with challenging properties, like 
solar and wind are put into the power system in an 
increasing rate. These alone will increase the risk in 
the system due to the characteristic intermittency 
property. It is expected more severe weather affect- 
ing the power system, resulting in higher risk. It 
is expected higher peak load as consequence of 
electrifying transportation and petroleum produc- 
tion, and this will increase risk in period of peak 
loads. New smart grid technology like disconnec- 
tion of loads when needed, will reduce the risk in 
the system. The IoT (internet of things) technol- 
ogy may also increase the challenge of balancing 
the system due to rapid in and out connection of 


load driven by new sets of criteria not known by 
the system operators. This technology alone repre- 
sents a threat to the system based on the possibility 
that third parties can hack into the equipment and 
connect/disconnect technology without anybody 
noticing, resulting possible outage of large areas. 

Furthermore, battery storage is being intro- 
duced. This will most likely reduce the risk of 
outages in the system. In addition, artificial intelli- 
gence and machine learning is already on our door 
step. This can both help the system if done right, 
or increase the risk done wrong. 

All these factors combined is a large order for 
the human mind to process in real-time. It is often 
seen that the risk driver is the combination of 
many seemingly unrelated factors and events, and 
not a single cause and effect scenario. 

Example: A given power system has a unique 
dynamic characteristic and inherit risk property. 
The power system will typically be in a N-O situa- 
tion in periods with high peak load. Depending on 
the size of the load(s) affected by the reduced power 
delivery reliability, the total system risk is affected 
(see Figure 1). Above the red line the expected 
“not delivered energy” exceeds the acceptable level 
in regards of expected cost due to outage or the 
power companies’ goals. Increase the system load 
for the same power system and the risk for outage 
will increase (see Figure 2). By operating the same 


Increased system load causes increased risk for outage 
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Figure 1. System risk for a given power system. 
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Figure 2. Increase only the load and the risk for outage 
increases. 
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Figure 3. Operate the power system optimally, and the 


same increase in load can me managed in terms of risk 
for outage. 
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Figure4. Prediction for future system risk development. 


power system with the increased load more opti- 
mally, the total risk can be reduced to being within 
the acceptable risk level (see Figure 3). 

With all the changes that are being introduced in 
the power system, our prediction is that the power 
system will experience system risk levels that goes 
from a very high level to the theoretical lowest 
level, and changing continuously by the minute 
(see Figure 4). 


4 PROBABILISTIC RISK ASSESSMENT 


4.1 The starting point for a consistent risk 
management for the whole value chain 


To have a consistent risk management throughout 
to value chain, it is vital to be able to calculate the 
risk level of the current state of the power system. 
The current state will as it is in any power system 
analysis, be the base case for the next step analysis 
and evaluation. Furthermore, everything that is 
being done and planned, is being done for keeping 
the power system in operation. Therefore, the risk 
assessments done in the operation of a power sys- 
tem by operators or planning of operation, should 
be the foundation of risk evaluation to be commu- 


nicated in the power grid company as input for all 
future evaluations and analysis. 


4.2 Identifying the power system inherit risk 
properties 


To be able to identify the power system’s inherent 
risk properties, the following objects and param- 
eter has to be included in a mathematical represen- 
tation of the power system: 


e Production — Spinning reserve — Location 

e Production type 

e Power system configuration 

o load flow 

o load demand 

© system dynamics 

Component reliability 

Maintenance interval and prioritizing 

Weather influence 

Energy storage possibilities and Smart grid 

technology 

e System operators action and strategy 

e Influence of other grid company’s actions in 
their own power system 


These factors have the characteristic of slow 
changing and fast changing properties. Since a 
power system is changing every minute of the 
year’s 8760 hours, the total system risk graph will 
also vary by the minute. It is therefore, important 
to model the power system in great detail so that 
every change that occurs is reflected by the math- 
ematical model. Furthermore, since historical data 
of the power system is stored (i.e. configuration, 
production, load flow, load level and failure rate 
for each component in the system) is available, 
validation of the mathematical model is possible to 
test against previous recorded risk levels. 


4.3 Calculation of the probabilistic risk level 
in near real time 


A short description of the calculation sequence in 
PROMAPS: 


1. Calculate reliability of each grid segments. Each 
component in a grid segment can be described 
with multiple possible states, for instance Func- 
tioning, Intermediate fault or Lasting fault. 
PROMAPS use Markov models to represent 
each individual component in the grid segments: 


B= AP, (1) 


where A, is a Markov model containing fault 
rates and repair rates. 

2. Itis possible to build reliability models of whole 
grid segments by simply combining all the 
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Markov models of each individual component 
as Kronecker sums, as follows: 


A=A@A,®...04, (2) 


is combined into common states, thus reducing 
the number of states in the grid segment model 
to a few unique states. 

3. Calculate the probability of each system states. 
A system reliability model is calculated by com- 
bining all grid segments models using Kro- 
necher sums, as described in the last step. 

4. Discard all system states with probability 
below some probability threshold. The prob- 
ability threshold is dependent on how many 
states should remain in the set for further 
assessments. 

5. Calculate maximum power transmission capac- 
ity for each state in set. 

6. Calculate expected power shortage at each load 
point 

7. Calculate expected power supply reliability, and 
mean time between loss of supply 

8. Calculate various auxiliary variables including 
economic data. 


The analysis can be performed for various load 
profiles. For online reliability assessments, parts 
of the calculation sequence are repeated whenever 
new online data is available. 

Promaps risk assessment principles for online 
calculations has been presented in detail in 
PMAPS2012 (Svendsen, 2012). The concept has 
also been researched in the recently completed 
pan-European project GARPUR (GARPUR, 
2017). Common for these methodologies for real- 
time risk assessments is that they consist of two 
main parts: 


1. Calculate the probability of all sequences of 
events in the power grid 
2. Calculate the consequence of these events. 


Since there is a “infinite” number of possi- 
ble events in a power grid, there also need to be 
some principle of discarding events with negligible 
risk. The simplest approach is to discard all con- 
tingency with probability below some probability 
threshold. The uncertainty of the risk assessments 
is related to the sum of risk of all events that has 
been discarded. 


5 AIAND MACHINE LEARNING IN 
OPERATION OF POWER SYSTEM 


5.1 What data is needed 


There are numerous data available as input for 
AI and machine learning such as e.g.: current 
and historical: load, production, spinning reserve, 


sensor data (current, voltage, frequency) load flow, 
configuration of the power system, component 
data (type, characteristic, age), component health 
indexes, all previous outages and causes, weather 
type at the point of outage, dynamic data of the 
strength of the system form PMUs, protection 
schemes and other functionalities. 

In addition, also near real time probabilistic risk 
calculation are available as input (Tollefsen, 2015). 
This represents big calculated data sets that gives 
a new insight in the inherit property of the power 
system. 

Essential data for operation of the system: Volt- 
age, frequency, production, spinning reserve and 
regulation possibilities. 


5.2 Al agents assigned to perform tasks 
in the power system 


Artificial Intelligence in general and supervised 
deep learning in particular tends to work well with 
large amounts of data (Schmidhuber, 2015). In 
supervised scenarios, the deep learning algorithms 
learn from known correct examples, and pick up 
trends and patterns that depict specific scenarios. 
In these cases, there is a trade-off between data 
quality and data size. The lower the quality of the 
data, the more data is needed extract the correct 
patterns. The most common example where this 
works is social media such as Facebook which con- 
tains enormous data of varying quality, enabling 
complex artificial intelligence algorithms. 

The same basic concept is true for power sys- 
tems, and it is therefore crucial that large amounts 
of data from power systems, including smart 
meters, is collected. The Norwegian power system 
manager Statnett has a particularly important role 
here. A concrete example of an application area 
artificial intelligence is expected to play a pivotal 
role is predicting electrical consumption peaks to 
avoid power outages (Goodwin, 2016). Over con- 
sumption may have serious consequences such as 
power outage. By predicting future peaks in the 
consumption, techniques such as load balancing 
could be carried out to avoid the problems. This 
clearly has to be carried out before the consump- 
tion peak happens, but knowing the consumption 
before occurrence is difficult. For this particular 
case, positive and negative examples should be col- 
lected, which in this case is examples of normal 
power flow, and over consumption. The artificial 
intelligence networks are trained with the data, 
and learns to understand which consumption 
trends lead to peak in the data. After the training 
phase, the network is put into practice and predicts 
future peaks which could either be used directly in 
an automated system to initiate load balancing, 
or as input to a decision support system. Other 
examples where artificial intelligence could play 
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a similar role are operation of power system and 
production prediction. 

The operation of a power system is based on 
a set of rules and constraints that the power sys- 
tem operator must operate the system within. The 
constraints are at set of limits that is related to the 
physical properties for the different power system 
components. This can e.g. be thermal limits for 
power lines/transformers or other components, 
the normal frequency deviation should be within 
+ 0,1 Hz, voltage variations within + 10% and the 
angle difference in three-phase current should be 
within maximum limits when reconnecting differ- 
ent part of the power system. 

The rules and constraint connected to operation 
of a power system is notably different than the rules 
of board games such as chess. A power system is a 
stochastic system influenced by physical laws and 
human behavior such as consumption. The rules 
in a chess board and the behavior of the game are 
deterministic which means that future states are 
easily predictable, albeit many. However, there are 
similarities as well. The number waste amounts 
of states and complex behavior is identifiable in 
both complex board games and power systems. We 
can imagine that a machine learning also can be 
applied for operation of power systems based on 
the principle applied by Deep Mind with the new 
chees Alpha Zero program (Silver, 2017) and by the 
improved operation strategy obtained for Googles 
data center by use of Machine Learning Applica- 
tions for Data Center Optimization (Gao, 2016). 


5.3. How can we evaluate the AI actions 
and gain trust? 


The artificial intelligence techniques vary from 
being statistically based on probabilistic induc- 
tion, to knowledge based, and neuron based deep 
learning. For deep learning, which is undoubtedly, 
the most promising artificial intelligence technique 
in use, a confidence level is available as part of the 
supervised classification output. This confidence is 
very different from a probability, but can in any case 
be used as part of a trust schemes. If a deep learning 
network were to predict future problems, whenever 
it outputs an expected problem it can at the same 
time output how confident it is that is an actual 
problem. If this is part of a decision support sys- 
tem, the confidence can be used to inform a human 
decision maker in a decision support control room. 


6 HUMANS, DECISION SUPPORT 
SYSTEM AND AI 
6.1 Decision support for system operation 


The ability to deal with the real-time fluctuations 
of the power system is not only a question of 


creating new technology and algorithms. Humans 
are still in the loop and the energy system is thus 
not only a technical system. It is a socio-technical 
system where the sense making, decisions and 
interventions of control room operators play an 
important part in the reliability of the system as a 
whole. This means that decision-support technol- 
ogy can play a key role in upholding the security of 
supply, but also that we need to take into account 
the human part of decision-making in control 
rooms. New decision-support systems will meet 
existing competence and experience, both at the 
individual and team level. In order to make sure 
that decision-support systems have the intended 
effects, the human perspective must be included in 
the development of the systems to ensure a good 
match between humans, technology and the organ- 
ization of decision-making. 

Advances in modelling and machine learning 
allow for information processing and problem 
solving that surpasses the capacity of an individ- 
ual human decision-maker. Nevertheless, there is a 
need to find a balance between man and machine 
in the distribution of decision-making functions. 
Also, any period of technological transition will 
face challenges related to the competence of the 
existing workforce in the use of new technology, 
as well as a warranted level of trust into what new 
technology can and cannot do. 

The importance of taking into account the 
relationship between human decision-making and 
algorithms can be illustrated by an example from 
New Scientist Volume 236, describing the domina- 
tion of robots in the financial markets resulting in 
the human trader era is fading: “There are still a 
lot of unanswered questions surrounding the last 
bond market ‘flash crash’. On 15 October 2014”, 
“the US Treasury market crashed for about 10 
minutes. Experts hypothesized that “activities of 
electronic trading algorithms” bore part of the 
blame, but reserved judgement for when they had 
more information. Three years later, no one is any 
wiser” (Adee, 2017). 

This can also be the case for the future power 
system where smart grid technology, IoT and 
machine learning is put into operation. In order to 
avoid similar algorithm-induced “flash crashes” in 
the power system, it is vital to ensure the under- 
standing of the system dynamics, the inherent risk 
properties and applying deep learning approaches 
as decision support. 


7 CONCLUSIONS 


The power system is rapidly changing towards the 
digital power system by using advanced ICT solu- 
tions, big data, smart grid, AI, machine learning 
and other advanced instruments. 
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The digitalization of the power business in Nor- 
way, Europe and other parts of the world, arise 
numerous new possibilities but also challenges. 

To gain trust in the machine learning technol- 
ogy being introduced to power system, and to 
avoid similar problems in the power system as the 
financial markets experienced with the ‘flash crash’ 
from 2014, new insight is needed. The need for 
understanding and tracking in near real time the 
power systems inherent system property in regards 
of power system dynamics and risk level, becomes 
evident. 
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ABSTRACT: Nowadays, a fundamental requirement for a prosperous society is a reliable energy sup- 
ply. The complex network theory provides an excellent basis to explore the functionality of such systems 
in response to severe component failures. In this case study, the European natural gas system is analyzed. 
The actual natural gas consumption is geospatially allocated to the infrastructure network. The network 
is abstracted and the flow capacity of the network is computed. A scenario analysis is conducted in order 
to identify the impact of storage facilities on the actual maximum possible flow. Furthermore, the natural 
gas supply shortage caused by each pipeline in case of a potential pipeline shut-down or failure is esti- 
mated. Finally, potential strategic locations of storage facilities for a more reliable natural gas network 


are identified. 


1 INTRODUCTION 


The consistent and continuous flow of goods and 
energy is an essential requirement for a reliable and 
prosperous economy in today’s world. Looking at 
Europe, for instance, a notable share of more than 
20% of the primary energy consumption is cov- 
ered by natural gas (Eurostat, 2017). However, this 
causes also dependency on reliable supply, which 
cannot always be guaranteed to the full extent. 
In the last decade, several unforeseen natural gas 
disruptions with consequences on the economy 
occurred (Austvik, 2016). For example, the terror- 
ist attack on the Amenas natural gas plant caused 
a reduction of more than 10% of the natural gas 
production of Algeria (Chrisafis et al., 2015). A 
very recent example is the explosion happened in 
Austria, where the accident disrupted natural gas 
flows toward Croatia, Italy and Slovenia. (Tirone 
and Wabl, 2017). Not only actual disruptions can 
cause damage, but also the awareness of a poten- 
tial supply shortage can trigger uncertainty about 
business continuity. This happened for example in 
March 2013 in the United Kingdom, when natural 
gas supply equivalent to only six hours of supply 
was available in the storage facilities (Plimmer and 
Chazan, 2013). 


Natural gas is transported and distributed by a 
well-developed system. The design of such a sys- 
tem requires long-term planning and large infra- 
structure investments. This kind of investments 
locks the capital in long-term contracts involving 
often policy decisions and agreements on national 
or regional levels (e.g. Carvalho et al., 2014, Mišík 
and Nosko, 2017). The European natural gas sys- 
tem became over time a large infrastructure net- 
work with many components, such as compressor 
stations, storage facilities, gas processing plants, 
Liquefied Natural Gas (LNG) terminals, LNG 
liquefaction and regasification facilities, aiming 
at assuring high reliability of the supply system. 
These components should be well planned and 
coordinated to guarantee a continous and ade- 
quate natural gas flow. 

The natural gas demand cannot be covered 
by the European countries’ available natural gas 
resources (BP, 2016). Therefore, it is necessary that 
natural gas is transported by pipelines from the 
East and South to Europe. Complementary, natu- 
ral gas is also imported via LNG terminals. 

The natural gas infrastructure network has 
to be able to compensate for potential pipeline 
shutdowns or failures, among others. This can 
be achieved through the construction of strategic 
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storage facilities and by increasing the network 
connectivity by means of new pipelines or diversi- 
fication of entry points (e.g. LNG terminals). 

Because of its importance, the European natu- 
ral gas network is under constant analysis in order 
to improve the network’s security of supply capac- 
ity. Besides natural hazard analysis (Poljansek 
et al., 2012, Lustenberger et al., 2017), a transmis- 
sion network model, the so-called ProGasNet was 
developed (Praks et al., 2015) and complex net- 
work methodologies employed for analysis of the 
network properties (Carvalho et al., 2009). All the 
applied methodologies are based on graph theory. 
Moreover, Monte-Carlo-based reliability and vul- 
nerability assessments of the network are applied 
(Praks et al., 2017), different flow algorithms and 
ways to allocate the natural gas in case of a dis- 
ruption are tested (Carvalho et al., 2014). All the 
developed methodologies enable the identification 
of bottle-necks in case of a network component 
failure. A methodology to systematically analyse 
the impact of natural gas storage facilities on the 
network, however, is not developed yet. Further- 
more, the impact of natural gas storage facilities 
on the entire European natural gas network has 
not been previously investigated. 

The aim of this study is to develop a methodol- 
ogy to analyse the European natural gas infrastruc- 
ture network’s capacity to cover the demand not 
only during full operational conditions, but also in 
case of a long-duration pipeline shutdown or fail- 
ure. Because storage facilities can be considered as 
consumption or supply infrastructure components 
in the network, the impact of feeding natural gas 
from natural gas storage facilities into the network 
in case of a potential long-term pipeline shutdown 
or failure is examined. The analytical framework 
employed in the current study comprises four steps 
to achieve this goal: 


1. extending the natural gas infrastructure network 
model with geo-referenced storage facilities, 

2. allocating natural gas demand according to 
geospatial consumption distribution, 

3. analysing the extended natural gas infrastruc- 
ture network by applying complex network 
theory, and 

4. conducting scenario analysis with and without 
storage facilities to analyse the potential impact 
of pipeline failures on the overall network flow. 


This study builds on Lustenberger et al. (2017). 
The focus is on the European Network of Trans- 
mission System Operators for Gas (ENTSOG) 
because sufficiently detailed data was available for 
the analysis. The network is abstracted, and the 
impact in case of a pipeline failure causing sup- 
ply losses is estimated for each pipeline segment. 
Based on this, the specific impact for each pipeline 


segment is quantified. Moreover, the effect of stor- 
age facilities, to compensate such a potential fail- 
ure, is analysed. 


2 METHOD 


In this section, the considered network system, 
including its components, is described, and the 
methods applied are explained. 


2.1 Geospatial natural gas infrastructure network 
model and its components 


The European natural gas infrastructure network 
consists of several thousand components includ- 
ing pipeline segments, entry and exit points, stor- 
age facilities and LNG terminals. In the present 
study, these infrastructure components are taken 
into account. 

The natural gas infrastructure network data 
on geospatial basis can be abstracted as graph 
(e.g. Carvalho et al., 2014, Praks et al., 2015), 
while important features as the network topology 
are retained. This enables to employ a maximum 
flow algorithm well known in graph theory 
(Heineman et al., 2016). The graph can be defined 
as G=(V, E, c), whereas Gis an undirected graph 
with the vertices V and the edges £ and c denotes 
the capacity of each edge. The vertices V can be 
sinks (e.g. consumer vertices) or sources (e.g. natu- 
ral gas drilling platforms, LNG import terminals, 
and import pipelines). 

A special case are the storage facilities and the 
natural gas power plants, which are discussed in 
section 2.2. 

E(u,v) represents the pipeline, connecting V, 
and V, Whereas u and v are unique vertex iden- 
tities. The flow capacity of E(u,v) is defined as 
c(u,v), while the natural gas flow through E(u,v) 
is f(u,v). If V, and V, are connected by multiple 
edges, E(u,v) represents the sum of these edges. In 
this case, c(u,v) is the sum of the involved pipe- 
line capacities. The graph is then built according to 
Equation | and Equation 2. 


E (u,v) = E(v,u) (1) 
c(u,v) =c(v,u) (2) 
With this abstraction, the natural gas infrastruc- 


ture network can be transformed into an undi- 
rected (bidirectional edges) graph. 


2.2 Allocation of natural gas demand 


The actual natural gas consumption is allocated to 
the identified sink vertices. This is done considering 
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the geospatial distribution of the population den- 
sity. It is assumed that the population density rep- 
resents the geospatial distribution of the natural 
gas consumption of all the demand sectors, such 
as households or industries. The exception is the 
actual consumption of the natural gas power 
plants, which are identified as additional large 
consumers. The latter distinction is done because 
a linear allocation of the natural gas consump- 
tion at the country level to the population density 
on raster grid cell resolution does not necessarily 
reflect the spatial distribution of the natural gas 
power plants. Moreover, natural gas power plants 
constitute large consumers, which need to be taken 
into account separately for the spatial natural gas 
consumption allocation. The consumption alloca- 
tion is conducted according to the following steps. 

First, the population density for each country is 
estimated based on the corresponding population 
maps (CIESIN, 2016) and the country boundaries 
(zones) (GADM, 2016). 

Second, the natural gas consumption of the nat- 
ural gas power plants is estimated for each country. 
The consumption for each power plant is derived 
according to its CO, emissions provided by Enipe- 
dia (2016) and summed up at the country level. 

Third, the natural gas consumption at the 
country level for the other demand sectors (e.g. 
industries and households) is estimated. This con- 
sumption (OD-Consumption) can be represented 
by subtracting the natural gas consumed by natu- 
ral gas power plants from the total annual natural 
gas consumption at the country level (IEA, 2017). 
The resulting OD-Consumption is normalized per 
capita at the country level. The per capita con- 
sumption can then be allocated to the population 
density for each grid cell (~1 km) resulting in the 
OD-Consumption map of natural gas. 

Fourth, the European area is split into demand 
areas applying a Voronoi algorithm (also called 
Thiessen algorithm) considering the identified sink 
vertices. The Voronoi algorithm cuts the connection 
line to the nearest neighbour of a vertex into half 
and connects the cutting point to an area around 
the corresponding vertex (Aurenhammer, 1991). 
The resulting Voronoi diagram is a partitioning 
of Europe into demand areas, in such a way that 
all points within a given Voronoi cell are closer to 
the corresponding vertex than to any other vertex. 
With this procedure, consumption can be allocated 
to the network, while the topology of the network 
remains the same. 

Fifth, demand areas are spatially joined with the 
the OD-Consumption map. For this, all the raster 
grid cells are summed up within each demand area. 

Finally, the natural gas consumption of the 
power plants is added to the corresponding 
demand areas. 


à Natural gas power plants 
© Storage facilities 
Voronoi polygon 
© Sinks 
~— Pipeline 
0D-Consumption 
Low consumption 


m High consumption 


Map of the network infrastructure, demand 


Figure 1. 
areas (Voronoi polygons) and other demand sectors’ con- 
sumption (OD-Consumption) on raster grid for illustra- 
tion purpose. 


This procedure provides an estimate of the 
natural gas consumed in each demand area on an 
annual basis. The data used and the spatial distri- 
bution are illustrated in Figure 1. 

The power plants are connected to the network 
by assigning their consumption to the correspond- 
ing demand area. Storage facilities can serve as 
sinks or sources depending on the actual operation 
of the facility. 

The storage facilities are assigned to the corre- 
sponding demand area. Depending on the scenario 
definition (see section 2.4), the corresponding 
demand area vertex is considered as source and/ 
or sink. 

The result of this procedure is a graph defined as 
G(V.E,c) with s,te V, where s represents the source 
vertices and ¢ denotes the sink or targeted verti- 
ces. Moreover, storage facilities can be switched on 
or off by turning the corresponding vertex from a 
sink to a source. 


2.3, Complex network theory—maximum flow 
estimation 


The abstract network poses a multi-source/sink 
problem because there are more than one identi- 
fied source and sink. The multi-source/sink prob- 
lem is commonly solved introducing a virtual super 
source and a virtual super sink, connecting all the 
corresponding sources and sinks as illustrated in 
Figure 2 (e.g. Ford and Fulkerson, 1956). 

Moreover, flow capacity can be assigned rep- 
resenting the sink or source. This can be done by 
allocating flow capacities to the resulting virtual 
edges connecting the virtual super sink or the vir- 
tual super source to the corresponding vertex. In 
this study, the virtual edges connecting the source 
vertices to the virtual super source are assumed 
to be infinite, but the virtual edges connecting the 
sink vertices to the virtual super sink are limited 
by the actual consumption estimated at each cor- 
responding vertex. 
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Figure 2. Illustration of the graph transformation 


from a multi-sink/source problem to a single sink/source 
problem. 


Table 1. Ford-Fulkerson algorithm as applied in this 
study. 
Algorithm: Ford-Fulkerson algorithm (Ford and 


Fulkerson, 1956) 
Data: G(VE,c) ste 
Result: Fos 
while pfs.t) | fis) Sc(s,)) exists do 
|. Find path with unexploited flow 
Ponty = PSA) | fisa); 
2, Find unexploited capacity 
Conte = Inin(c(u,v)—flu,v)) for u EPuntip ANd V Efans 
3. Add unexploited capacity to flow on panty 
ftv) = fitv) +emexp fort Epuitsp atd V EPn 
end 


Fnax = Yuev fus) = Puey f (t'u) 


The maximum possible flow between the virtual 
super sink and source is computed as described in 
Table 1 is derived from Heineman et al. (2016). 


2.4 Scenario analysis 


To estimate the impact in case of a pipeline failure, 
each pipeline is once deleted resulting in G = (VE). 
Whereas E = E\E, with k representing the number 
of removed edges, which is in this case E, = E). 
With this the impact in case of a single pipeline 
loss on the maximum possible flow is estimated. 
Hence, cascading failures in the network are not 
investigated. The dependencies of the network 
components in the sense of pipeline capacity to 
compensate a potential flow loss, however, are con- 
sidered. As the maximum possible flow is not only 
limited by the pipeline capacity, but also by the 
actual consumption, the resulting difference can be 
considered as the loss of service of the European 
natural gas infrastructure network. This impact is 
estimated according to Equation 3. 


max Poasi 
= mat 100% (3) 


where AF’, ,is representing the impact of pipeline 


max,i 


i in percentage in case of a failure of pipeline i. 


F „ax Stands for the initial maximum flow and Faxi 
for the computed maximum flow without the 
pipeline i. 

To estimate the impact of a storage facility 
considering a potential pipeline failure, a scenario 
analysis is conducted. For this purpose, the above- 
mentioned simulation is carried out twice, once 
without and once with storage facilities as natural 
gas sources. The storage facilities are added to the 
vertex in the corresponding demand area. The iden- 
tified vertices in the corresponding demand areas 
are then connected to the virtual super source, rep- 
resenting not only a sink but also a source vertex. 

The defined scenarios are with and without stor- 
age facilities. In this way, the impact of a storage 
facility on the network, and the service loss in case 
of a potential pipeline failure can be calculated. 


3 DATA 


In this section the datasets employed and the data 
acquisition is described. The used data is listed in 
Table 2. All data were available in a format directly 
usable in this study, except for the European natu- 
ral gas infrastructure network data. The acquisi- 
tion of this data and its processing is described in 
section 3.1. 


3.1 European natural gas infrastructure network 


The natural gas infrastructure data was extracted 
from three map layers, each consisting of 1024 high- 
resolution raster map tiles (images) from ENTSOG 
(2016). The three map layers represent three differ- 
ent ranges of pipeline diameters to which approxi- 
mate flow capacities can be assigned. 

First, the high-resolution images were georefer- 
enced. Second, the raster cells were reclassified and 
the raster-grids vectorized. This resulted in a pipe- 
line network in a digital map format (Lustenberger 
et al., 2017). To each edge a flow capacity c can 
be assigned according to the given diameter of the 
corresponding pipeline using the information from 
Carvalho et al. (2009). 

The identification of the sources was done 
according to the provided map of ENTSOG 
(2016). Drilling platforms and LNG terminals were 
identified as source vertices. In addition, the entry 
points of the pipelines running from the East and 
from the South to Europe were identified as source 
vertices. This was concluded from the natural gas 
import statistics according to BP (2016). 

In addition, natural gas storage information 
is processed from (ENTSOG, 2016). The storage 
facilities’ information, including facility name and 
ENTSOG-ID, are downloaded via the ENTSOG 
Transparency Platforms API (Application 
Programming Interface). The point coordinates 
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Table 2. Datasets used. 


Dataset Format Source 


Description 


Population density Raster (CIESIN, 2016) 


Gridded population of the world on raster grid with cell size 


of ~1 km. 


Country boundaries Polygon (GADM, 2016) 


Worldwide country boundaries (Administrative level 0). 

Annual natural gas production, imports and exports on country 
level in ktoe (2015). 

European natural gas pipelines in three different diameter classes 


(9611 pipelines, 284641 km). 


European natural gas storage facilities with location, facility 
name (152 entries). 
European natural gas sources (LNG ports, drilling platforms 


and system boundary crossing points - 152 entries). 


National natural gas Table (IEA, 2017) 
consumption 

Natural gas pipelines Polyline (ENTSOG, 2016) 

Natural gas storage Point (ENTSOG, 2016) 
facilities 

Natural gas sources Point (ENTSOG, 2016) 

Natural gas power Point (Enipedia, 2016) 


plants 


Natural gas power plant coordinates and associated annual CO, 
emissions (618 entries, 565 used, 53 incomplete (not used)). 


500 0 500 1000 1500 2000km 


Figure 3. 
facilities. 


were then georeferenced according to the map pro- 
vided from ENTSOG (2016). 


4 RESULTS AND DISCUSSION 


Figure 3 displays the resulting maximum flow loss 
(AF axi PAI) without storage facilities. The maxi- 
mum flow loss can also be interpreted as loss of 
service. For six pipelines (labeled 1 to 6) the esti- 


mated loss is higher than 1% (~4.5 10° m?/year). 


Pe 100-46 
A E Country boundaries and ocean 


Impact of a potential pipeline failure or shutdown on the maximum flow for the scenario without storage 


Similarly, Figure 4 shows the resulting maximum 
flow loss considering the storage facilities in 
addition as source vertices. The cases shown in 
Figures 3 and 4 are discussed in Table 3. 

As illustrated in Table 3, the introduction of 
storage facilities is an effective measure to reduce 
the system service loss in case of a long-duration 
pipeline failure or shutdown. This is because stor- 
age facilities can compensate a potential failure 
and provide natural gas during the time of reduced 
flow capacity of the network. 
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Figure 4. Impact of a potential pipeline failure or shutdown on the maximum flow for the scenario with storage 
facilities. 


Table 3. Description of the pipelines with Faas; higher than 1% resulting from the scenario without natural gas stor- 
ages. The index is referring to Figure 3 and Figure 4. 


Scenario without storage facilities (Figure 3) Scenario with storage facilities (Figure 4) 
AF, max, i AF, max,i 

Index [%] Description [%] Description 

1 4.6 The Transit Gas pipeline and its southern 2.0 With the introduction of the storage facilities 
extensions have the highest impact on the loss in case of a failure of the Transit 
the maximum flow in case of a failure. Gas pipeline can be reduced from 4.6% to 
The pipelines connect northern Italy via 2.0%. Several storage facilities are available in 
Gries Pass through Switzerland with the Germany and France to cover the remaining 
German and French natural gas networks demand at the northern end of the pipeline, 
(Transitgas, 2017). The high maximum while two storage facilities are identified in 
flow loss results from the fact that this central Italy (ENTSOG, 2016) to absorb 
pipeline is the only one crossing the central demand reduction at the southern part of 
Alps providing a North/South connection. the pipeline. 

2 3.5 The pipelines connecting Sicily and Calabria 0.9 With storage facilities, the loss can be 
have the second highest impact on the substantially reduced. In case of a pipeline 
maximum flow in case of a failure. The failure, Sicily is supplied by marine pipelines 
reason is that those pipelines are a part of coming from northern Africa, while the 
the Trans-Mediterranean pipeline system supply reduction on the Italian mainland is 
(Transmed, 2017) and the Green-Stream partly absorbed by two natural gas storage 
project (GreenStream, 2017) connecting facilities in central Italy. 


Algeria and Tunisia with Continental 
Europe. These pipelines provide a major 
share of the natural gas consumed in 
Europe (Carvalho et al., 2009). 


(Continued) 
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Table 3. (Continued). 


Scenario without storage facilities (Figure 3) 


Scenario with storage facilities (Figure 4) 


Index [| Description 


Pa] = Description 


3 1.8 The Turkish pipeline between the cities 
Silvas and Malatya connects the 


south-eastern part of Turkey with the 


Trans-Anatolian pipeline bringing natural 


gas from Georgia and Iran to Ankara 
(TANAP, 2017). The entire natural gas 
consumption of south east Turkey is 
provided with this pipeline, which 
amounts to ~1.8% of the consumption 
of the investigated system. 

4 1.8/1.8 Two critical pipelines in the region of 
London were identified. The high loss 
of the south-eastern pipeline can be 
explained as it is the only high capacity 
pipeline delivering natural gas to London 
and the surrounding area. 

For the other pipeline in the north-western 
part the high loss is due to its high 
capacity as it delivers natural gas to the 
administrative regions of south west 
England. 

The pipeline near Venice connects the LNG 
port Cavarzere Porto Levante with the 
Italian mainland (AdriaticLNG, 2017). 


This indicates that this port plays a major 


strategic role for the continuous natural 
gas supply to continental Europe. 

This pipeline connects the European 
mainland with the Danish Islands Funen 
and Zealand. The importance of this 
pipeline is due to the fact that it is the 
only pipeline supplying Copenhagen 
and southern Sweden with natural gas. 


1.8 This loss remains the same because no 
storage facilities were identified in 
south eastern Turkey. 


1.8/0.0 The south eastern high capacity pipeline shows 
an unchanged loss as it is a dead-end pipeline. 
The loss of the other pipeline could be fully 
compensated due to compensation from 
storage facilities. 


0.2 The loss of this pipeline can be reduced due to 
compensation of supply from storage 
facilities in the center of Italy. 

0.0 Because two storage facilities are located close 


to Copenhagen, the loss caused by a potential 
failure or shutdown of this pipeline could be 
reduced to 0. 


In this study, it is generally assumed that pipe- 
line failures or shutdowns have a duration of one 
year, which is likely to be highly conservative for 
most cases. This does not affect the identification 
of critical pipelines as it is based on topologi- 
cal analysis and is thus independent of the dura- 
tion of the loss of supply. On the other hand, the 
absolute loss of supply on the yearly basis as esti- 
mated here is expected to be much lower due to the 
expected typically much shorter duration of the 
interruptions. 

The dependencies of the different components are 
assessed to the extent that in case of a pipeline shut- 
down, the natural gas flow is rerouted according to 
available pipeline capacities. Hence, in case there is 
no additional pipeline capacity available to compen- 
sate, there is a shortage of natural gas supply and a 
reduction in the maximum flow of the network. 

Given the relatively conservative assumption of 
long duration of unavailability of critical pipelines 


after disruptions, the estimated losses are signifi- 
cant, but relatively limited. Moreover, the losses 
can to a certain extent be mitigated by the storage 
facilities. 


5 CONCLUSIONS 


In this study, the European natural gas infrastruc- 
ture network is analysed considering the impacts 
of postulated failures of critical pipelines, and the 
potential mitigation of capacity losses by storage 
facilities. 

Considering that the network capacity is not 
only limited by actual pipeline flow capacities, 
but also the consumption, leads to more realistic 
maximum flow analysis results. This is in contrast 
to the often-proposed maximum flow application 
with infinite capacities of the identified sources 
and sinks. 


2527 


The maximum flow loss analysis showed that 
the failure of a single pipeline in a scenario with- 
out natural gas storage facilities can cause service 
losses >1% for six pipelines, with a maximum of 
4.6% for the Transit Gas pipeline. Taking storage 
facilities into account can significantly reduce losses 
for all pipelines, except for south-east Turkey where 
no storage facilities are identified and a dead-end 
pipeline supplying the London area. 

The Transit-Gas pipeline, crossing the Alps 
from Switzerland to Italy results in the highest 
maximum flow loss for both scenarios (with and 
without storage facilities). 

For long-duration pipeline failures, the storage 
volume is required to be sufficiently large to act as 
a compensating source in the system throughout 
the whole disruption time. It is assumed that the 
storage capacity is not a limiting factor in reducing 
a pipeline supply loss. Additionally, the flows and 
consumption are aggregated on an annual level. 
This means that this study does not consider sea- 
sonal variabilities of natural gas consumption. The 
natural gas consumption in Europe is expected 
to be higher in winter than in summer (Hauser 
et al., 2017). Furthermore, this study assumes 
deterministically that the critical pipelines will be 
unavailabele during the whole year, which is very 
conservative. In summary, this study focuses on 
investigation of topological properties of the net- 
work and does not attempt to analyse the likely 
duration of disruptions. 

In practice, not only the strategic placement of 
additional sources like storage facilities or LNG 
terminals, but also a temporal increase of the pipe- 
line capacity, e.g. through increase of pressure in 
a pipeline, could provide sufficient natural gas to 
cover the demand (Su et al., 2017). This rerout- 
ing with temporal capacity increase of pipelines 
in case of a single pipeline failure or shutdown is 
partially considered. The pipeline capacities were 
quite roughly allocated and rather overestimated. 
It can be assumed that all pipelines are running at 
the corresponding allocated maximum capacity. 
This overestimation could possibly represent the 
credit for a temporal capacity increase of the pipe- 
lines in the system. 

Although, the implementation of the findings 
of this study at large requires some caution due 
to current limitations, the described approach and 
results could provide the relevant stakeholders with 
useful indications for further enhancements of the 
reliability of natural gas infrastructure networks in 
general and of the European natural gas network 
in particular. 

The results, and especially the approach devel- 
oped in this study, could provide potentially help- 
ful inputs for governmental agencies, the private 
industry and policy—makers. 


It is recommended that the model could be fur- 
ther improved in terms of the time resolution. This 
would take into account the seasonal variation of 
the actual natural gas consumption and the actual 
storage capacities and storage use (source/sink). 
Furthermore, a network analysis methodology 
considering not only edge removal, but also stor- 
age capacities and the given temporal limitations 
could be developed. 

Considering future work, the network model 
presented in this study will be combined with 
probabilistic treatments of potential losses of criti- 
cal pipelines due to technical failures and selected 
natural events. This will provide an anlysis of the 
actual risks caused by potential pipeline failures 
considering not only consequences but also fre- 
quencies. Moreover, future work will have a risk 
and resilience prespective on the system level and 
not a reliability perspective on the component level. 
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ABSTRACT: Common-Cause Failures (CCF) impose severe consequences on a complex system’s reli- 
ability and overall performance. A more realistic assessment, therefore, of the survivability of the system 
requires an adequate consideration of these failures. The survival signature approach opens up a new and 
efficient way to compute system reliability, given its ability to segregate the structural and probabilistic 
attributes of the system. Traditional survival signature-based approaches assume the failure of one com- 
ponent to have no effect on the survival of the others. This assumption, however, is flawed for most realis- 
tic systems, given the existence of various forms of couplings between components. This paper, therefore, 
presents a novel and general survival signature-based simulation approach for non-repairable complex 
systems. We have used Monte Carlo Simulation to enhance the easy propagation of CCF across the com- 
plex system, instead of an analytical approach, which currently is impossible. In real application world, 
however, due to lack of knowledge or data about the behaviour of a certain component, its parameters 
can only be reported with a certain level of confidence, normally expressed as an interval. In order to deal 
with the imprecision, the double loop Monte Carlo simulation methodology which bases on the survival 
signature is used to analyse the complex system with CCF. The numerical examples are presented in the 
end to show the applicability of the approach. 


1 INTRODUCTION the reliability and availability of multi-component 


systems. They are, therefore, extremely impor- 


Common-Cause Failures (CCFs) are failure events 
that affect multiple components simultaneously. 
The origin of common cause events can be out- 
side the system components they affect, or they 
can originate from the components themselves, 
causing the other components to fail. The proper 
consideration and modelling of CCFs is essential 
in complex systems reliability analysis, as they may 
have a significantly adverse effect on the system’s 
overall functionality. They have been shown by 
many studies (Dhillon & Anude 1994) to decrease 


tant in reliability assessment and must be given 
adequate treatment, to minimise overestimation 
(Modarres 2006). 

The CCF event can either impact the overall sys- 
tem operation or only affect specific components 
within the system (Wierman et al. 2007). Aldemir 
(1987) haa given an overview of parametric Com- 
mon-Cause Failure models. To be specific, for com- 
ponent level, the CCF event is a component level 
failure. Rasmuson and Kelly reviewed the basic 
concepts of modelling CCFs in reliability and risk 
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studies (Rasmuson & Kelly 2008). One of the most 
commonly used single parameter models defined 
by Fleming (1975) is the factor model, which 
is the first parameter model applied to common 
cause failures in risk and reliability analysis. He 
then generalised the #-factor model to the multiple 
Greek letter model in 1986 (Fleming et al. 1986). 
The a-factor model which is proposed by Mosleh 
et al. (1988) develops CCFs from a set of failure 
ratios and the total component failure rate. Based 
on the a-factor model, Kelly & Atwood (2011) 
presented a method for developing Dirichlet prior 
distributions that have specified marginal means, 
but which are otherwise minimally informative. 
The binomial failure rate model (Atwood 1986) on 
the other hand, estimates the failure frequency of 
two or more components in a redundant system. 
This is computed as the product of the CCF shock 
arrival rate and the conditional failure probability 
of the components given the shock. 

At for system level, the CCF event is a system 
functional level failure. A number of models have 
been developed recently. For instance, George-Wil- 
liams & Patelli (2017) proposed an efficient load- 
flow simulation approach to assess the availability 
of reconfigurable multi-state systems with inter- 
dependencies. A robust Bayesian approach to the 
a-factor model for common cause failures has also 
been proposed by Troffaes et al. (2014). Coolen 
& Coolen-Maturi (2015b) presented a non-para- 
metric predictive inference for system reliability 
following a common cause failure. However, there 
are mainly two problems within the above research 
works: (1) either recognise the components 
within the system as exchangeable single type; or 
(2) evaluate the system configuration for every reli- 
ability estimation trial, which is time consuming. 
Therefore, an extension of above works is needed. 
To be specific, it is necessary to perform reliability 
analysis on systems susceptible to CCFs, because 
these realistic complex systems always consist of 
components which belong to different types. Sur- 
vival signature provides a good way to solve this 
problem. 

Survival signature was first proposed by Coolen 
& Coolen-Maturi (2012) in 2012. It is a powerful 
methodology which can not only hold the merits of 
the former system signature (Samaniego 2007), but 
can be used in complex system with components 
belong to multiple types. In essence, it does not 
have the assumption that components of different 
types are exchangeable, which overcomes the long- 
standing limitation of the system signature. This 
is useful when a system consists of components 
that belong to different types, which means their 
failure times follow different probability distribu- 
tions characters (Coolen & Coolen-Maturi 2015a). 
Therefore, survival signature is a promising method 


for application to complex systems. Based on the 
former work, Aslett et al. (2015) analysed system 
reliability within the Bayesian framework of sta- 
tistics. Feng et al. (2016) dealt with the imprecision 
within the system by analytical and numerical ways 
respectively, what is more, new component impor- 
tance measures were presented in this paper. An 
imprecise Bayesian non-parametric approach by 
using sets of priors to system reliability with multi- 
ple types of components was developed by Walter 
et al. (2017). Patelli et al. (2017) proposed efficient 
simulation approaches which based on survival 
signature for reliability analysis on large system. 
Reed (2017) put forward an efficient algorithm 
for exact computation of survival signature using 
binary decision diagrams. 

This paper is organised as follows. Section 2 gives 
a brief conceptions about the survival signature 
and a-factor parameter. The survival signature- 
based simulation reliability approach is proposed 
in Section 3, in addition, this Section introduces 
imprecision within the components failure times. 
The applicability and performance of the proposed 
approaches is presented in Section 4. Finally Sec- 
tion 5 closes the paper with conclusions. 


2 SURVIVAL SIGNATURE AND Q-FACTOR 
PARAMETER 


2.1 Survival signature 


Suppose there is a complex system with m 
components which belong to K>2componenttypes, 
with m, components of type k €{1,2,...,K} and 

pak =M. Assume that the random failure times 
of components of the same type are exchangeable, 
while full independence is assumed for compo- 
nents belong to different types (iid), the survival 
signature which can be denoted by ®(/,,/),..../x), 
with /,=0,L...m, for k=1,2,...,K. It defines 
the probability that the system functions given 
that /, of its m, components of type k work, for 


each ke {1,2,...,K}. There are 


state vec- 
m, k 
tors x* with D ie xk =l, (k = 1, 2, ..., K), where 


xf Sli sXe i; Let Sn.. denote the set of 


Mk 
all state vectors for the whole system, aud it can 
be known that all the state vectors x‘ eS% are 
equally likely to occur. Therefore, the Suevivi sig- 
nature can be expressed as: 


i ja Ef i ‘i 
eal H L i xeS), 
XES) dK 


È Ax) (1) 


where ø= (x): {0,1}” — {0,1} is the system struc- 
ture function, i.e., the system status based on all 
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possible state vectors x. @is 1 if the system func- 
tions for io vector x and 0 if not. 

Let C,(t) € {0,L...,m,} denote the number of 
k eo working at time ¢. Assume that the 
components of type k have a known cumulative 
distribution function (CDF) F,(t) and the compo- 
nents failure times of different type are assumed 
independent, then: 


rfe APE) 
-ify o yrs [1-F, (I (2) 


Hence, the survival function of the system with 
K types of components becomes: 


P(T, >t)= $. Seti 2G = w) 


h=0 Ix=0 


Equation 3 shows that the structure of the sys- 
tem is separated from the its components failure 
times, which is the typical advantage of the sur- 
vival signature. The survival signature is a sum- 
mary of structure functions and only needs to be 
calculated once for the same system. As a result, it 
is an efficient method to perform system reliability 
analysis on complex systems with multiple compo- 
nent types. 


2.2 a-factor model 


The ofactor model is particularly useful in the 
practical engineering world as the alpha factor 
parameters can be got through experts’ judgement 
of the system or past data on the system. 

The parameter, œ, of the model, is the frac- 
tion of the total component failure events caus- 
ing the simultaneous failure of an additional r— 1 
components. 

Let us assume there is a system with three 
exchangeable components, œ, means the failure of 
one component cannot influence the status of the 
other components. œ denotes the failure of one 
component can lead to the other two components 
fail simultaneously, which means CCFs occur. For 
œ = p, it expresses that there is a probability p of 
one additional component failing, following the 
failure of a component in this system. It can be 
drawn that >. a,=1. 

Similarly, for “a complex system with multi- 
ple component types, the alpha parameters a 
denotes that if one component of type k fails due 
to an common cause event, the probability that the 
other r— 1 components fail simultaneously. If there 


are m, components in this group, it can conclude 
that X," af =1. 

Based on the definition of œ-factor parameter, 
the probability of a common cause basic event 
involving failure of k components in a system of m 
components can be calculated by Equation 4. 


k æ 
Q, = ae (4) 
m-l 
Cea)" 
where, k =1,2,...,m and a=). ka@,. Q, is the 


total probability of failure accounting both for 
common cause failures and independent failures. 
The alpha parameter estimator can be expressed as: 


n, 
>. 
ft 


where, n, is the number of events with k failed 
components. 

The @ parameter estimator represents the prob- 
ability that exactly k of the m components fail, 
given that at least one failure has occurred. It can 
be seen from Equation 5 that the sum of all the 
a, will be 1. The advantage of the o-factor model 
is its distinction between the total failure rate of a 
component Q, for which we generally have a lot of 
information, and common cause failures modelled 
by œ, for which we generally have very little infor- 
mation (Troffaes et al. 2014). 


GZ 


(5) 


3 EFFICIENT METHOD FOR ANALYSING 
SYSTEM RELIABILITY WITH COMMON 
CAUSE FAILURES 


3.1 The proposed approach 


In order to perform reliability assessment on any 
kind of systems without introducing simplifica- 
tions or unjustified assumptions, this Section 
proposes a simulation method to analyse system 
reliability after common cause failures. 

Suppose there is a complex system with m com- 
ponents which belong to X, different component 
groups, and there are m, components of type 


k e{1,2,..,K} and yom =m. The common 
cause group matrix can be expressed as M_,,,, the 
a factor parameters of each component group, 


a , are stored in the matrix, recall 5 ak =1. 


mg? =1 mMk 


The number of failing component of each type is 
depended on the number of component still func- 
tioning, therefore, it is necessary to use the œ-factor 
model to provide probabilities for any combina- 
tions of number of components that would fail 
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when a common cause failure event occurs. Here 
has an assumption that if one component of type 
k fails, it can only influence the components within 
the same component group X, under the CCF 
model. This is reasonable as the components of the 
same type tend to be influenced by the same com- 
mon cause failure event, this is also the reason why 
they are grouped in the same type. Then looking 
at how many of the components of each type still 
function, and assume exchangeability within them 
with regard to the CCF failure model. 

The reliability of the system after common 
cause failures can be estimated adopting the fol- 
lowing simulation procedure: 


Step 1. Initialise the counter V to store the out- 
put, define the mission time as ¢,, and 
number of samples as N; 

Define the component groups as X,, and 
the common cause group matrix Mecce 
the æ factor parameters @ are stored 
in the matrix; 

Sample the failure time of each compo- 
nent as t, < ¢,,,, where i= 1, 2, ..., m, and 
set t4 = 9, at this time the survival signa- 
ture (production level) is equal to 1; 

Set the current time toren = min(t)); 

At time ¢, finding out which component 
fails and which common cause group it 
belongs to. The components affected by a 
failure event due to CCF can be expresses 
as V esis 

Upgrade the number of working compo- 
nents of each component group after the 
CCF, and then get the survival signature 
(production level) œ, after the correspon- 
ding failure time t; 

Set the failure time of the components 
(the failure component and its common 
cause failure components) as infinite; 
Repeat Steps 4 through 7 until £, > tyż 
Store the production level of the system 
over the time by V(j)=V(j)+®,; 

Step 10. Repeat Steps 3 through 9 for N times. 


Step 2. 


Step 3. 


Step 4. 
Step 5. 


Step 6. 


Step 7. 


Step 8. 
Step 9. 


Therefore, the survival function of the complex 
system after common cause failures is obtained 
by averaging the vector collecting the production 
level of the system over the number of samples: 
P(T.>t|CCF)=V,/N. 

The algorithm of the proposed simulation 
method can be seen follows: 


3.2 Imprecision in consideration 


In the engineering applications, if there exist impre- 
cision within the components failure time distri- 
butions, or empirical distribution of components 
failure times are used, no analytical methods can be 


Algorithm 1 The Proposed Approach’s Algorithm 


Require: N, tm, Xa Meca. Ni: number of discreti- 
sation step 
Set V (1: N,) =0 
Set P =Survival Signature 
signature 
for = 7: Ndo > Loop over number of samples 
Sample t; > Sample failure ime of every 
component in each type 
Set [leurrent: Xe] = min(t;) > Minimum failure 
time and component index 
while [teurrent < tn do 
Veh = ceil [tuas teurreni] / timestep) + L & 
Define channels 
V(Weh(1) : Veh(2) — 1) = V(Veh(1) : 
Veh?) — 1) + (tare) > Update counters 
if Meee is not empty then > CCF within 
system 


> Initialise counters 
> Load survival 


CC Feunprop = Cursum a” ) > 


Cumsum CCF probability of the failed component 
group 
r= find(CCFoumpro z rand) © 


Find the number of components in the CCG to fail 
ifr > 1 then 


Propagate CCF 
end if 
else > No CCF within the system 
Vs =X, > V contains affected 
components 
end if 
Piua = Pteverons > Upgrade survival 


signature 


toid = lourrent > Upgrade new old time 


[leurrent: Nol = min(t;) > Minimum failure 
time and component index after CCF 
end while 
if tn > toa then 
V(Vech(1) Veh(2)) = V(Veh(1) 
Weh(2)) + B(tura) > Update counters 
end if 
end for 


used without resorting to some degree of simplifica- 
tion or approximation (Beer et al. 2013) (Aven 2017). 
Instead, the proposed simulation methods can be 
applied to any systems irrespectively to the probabil- 
ity distribution for the component failure time used. 
To be specific, the system reliability perform- 
ance after common cause failures can be simu- 
lated using survival signature-based Monte Carlo 
method. This double loop simulation method (Du 
& Chen 2004) not only has the advantage of sur- 
vival signature to handle complex system reliability 
problems, but can recur to Monte Carlo simulation 
to deal with the uncertainties within the system. 
Double loop sampling involves two layers of 
sampling: the outer loop is called the parameter 


2534 


loop since it concerns sampling different values 
for the set of distribution parameters for all of the 
uncertain quantities; while the inner loop goes by 
the name of probability loop because it involves 
sampling from precise probability distribution func- 
tions. As a matter of fact, double loop sampling 
implicates sampling from an analytical distribution 
whose parameters have been generated by sampling. 

To solve the parameter epistemic imprecision 
within components, it is just need to add an opti- 
mization loop around the survival signature-based 
simulation method cited in Section 3.1 to estimate 
the bounds. In other words, it can be done by add- 
ing a simple Monte Carlo loop and sampling the 
values of components parameters from uniform 
distributions. 


4 NUMERICAL EXAMPLE 


Shown in Figure 1 is an arbitrary 13-component 
complex system, which components are arranged 
into five groups. The number within each box 
denotes which group the component belongs to 
while the number outside defines the index of the 
component in the system. The system is assumed 
to be non-repairable and components of the same 
group have the same failure time distribution, as 
defined in Table 2. In the table, an exponential dis- 
tribution is defined by its mean (in hours) while a 
Weibull distribution is defined by a set which first 
element is its scale parameter (in hours). 

The system is first analysed without CCF using the 
proposed simulation model with the data presented 
in Table 2 and compared to its analytical solution. 

It is then re-analysed considering common 
cause failures with all common cause groups are 
active. For this system, the common cause group 
failure matrix Mecce with and without CCF, can 
be expressed in Equations 6 and 7 respectively. The 
results obtained are shown in Figure 2. 


Figure 1. 
which belong to four types. The number inside the com- 
ponent box represents the type, while the number outside 
the box expresses the component index. 


Complex system with thirteen components 


Table 1. Component failure data with precise distribu- 
tion parameters. 

Compo- 

nent Distribution Distribution CCF 

type type parameters parameters 

il Weibull (1.8,2.2) 10.95, 0.05} 

2 Exponential 1.2 {0.8, 0.1, 0.05, 0.05} 
3 Weibull (2.3,1.6) {1} 

4 Weibull (3.2,2.6) {0.9, 0.1} 

5 Exponential 2.1 {0.75, 0.1, 0.1, 0.05} 


Table 2. Component failure data with imprecise distri- 
bution parameters. 


Component Distribution Distribution CCF 
type type parameters parameters 
1 Weibull ([1.68,1.86],  {0.95, 0.05} 
[2.08,2.32]) 
2 Exponential [1.07,1.33] {0.8, 0.1, 
0.05, 0.05} 
3 Weibull ((2.12,2.51], {1} 
[1.38,1.72]) 
4 Weibull ((2.99,3.41],  {0.9, 0.1} 
[2.51,2.79]) 
5 Exponential [2.01,2.28] {0.75, 0.1, 
0.1, 0.05} 
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Figure 2. Survival function of the system in Figure 1 


with CCF and without CCF through simulation method, 
along with the system reliability without CCF got by ana- 
lytical solution. 
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The accuracy and generality of the proposed 
simulation approach are validated by the plots in 
Figure 2, given the agreement between the simula- 
tion and analytical results. As shown, the reliability 
of the system reduces drastically when the effects 
of CCF are factored into the analysis. It exempli- 
fies the need to consider this realistic aspect of a 
system’s operation in its reliability evaluation. 

To deduce the effects of imprecision in the fail- 
ure distribution parameters of components on the 
system survival function, the system is analysed 
using the data presented in Table 2. Instead of 
a single curve, the survival function, in this case, 
could be any of an infinite number of curves lying 
within the bounds shown in Figure 3. 


5 CONCLUSIONS 


Common-Cause Failures (CCF) have an adverse 
effect on the reliability and performance of multi- 
component systems. They are normally a conse- 
quence of functional couplings between a group of 
components due to a variety of possible reasons. 
Thus, there is an inevitability about the suscepti- 
bility of realistic multi-component engineering 
systems to these failures. The need, therefore, to 
incorporate CCF considerations into system anal- 
ysis is overwhelming, as the alternative may lead to 
overestimating the reliability of the system. 

This paper puts forwards an efficient simulation 
method which bases on the survival signature to 


perform reliability analysis on complex systems 
with common cause failures. In fact, this approach 
extends the applicability of the survival signature 
approach to systems susceptible to common cause 
failures. More importantly, it holds the merits of 
both survival signature methodology and Monte 
Carlo simulation. Therefore, this approach is gen- 
eral and allows to know the survival function of the 
system after common cause failures at each time. 
What is more, the probabilistic uncertainty and 
imprecision in components parameters are taken 
into consideration by resorting this general simu- 
lation method. The effectiveness and feasibility of 
the proposed approach hasbeen demonstrated by 
the numerical example. 
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ABSTRACT: Modern industrial trends such a Cyber-Physical Systems and System of Systems lead to 
the continuously increasing complexity and heterogeneity of components and interfaces, as well as more 
and more advanced software parts. Classical reliability evaluation methods, recommended in nowadays 
standards, such as Fault Tree Analysis (FTA) and Failure Mode and Effect Analysis (FMEA), fail to 
describe system behavioral aspects in a sufficiently deep manner. Therefore, additional, sophisticated and 
highly specialized methods for the analysis of the effects of unavoidable faults are required. Recently 
introduced Dual-graph Error Propagation Model (DEPM) is a stochastic framework that captures system 
properties relevant to error propagation processes such as control and data flow structures and reliability 
characteristics of single components. The DEPM helps to estimate the impact of a fault of a particular 
component on the overall system reliability, e.g. to compute the mean number of erroneous values in a 
critical system output during given operation time. A DEPM can be automatically generated from vari- 
ous semi-formal system representations such as UML/SysML, AADL, or Simulink/Stateflow. However, 
despite the common trend towards model-based system development the functional software parts usually 
incorporate manually programmed code. The error propagation properties of this manual code also need 
to be analyzed and considered during the reliability evaluation of the complete system. This paper presents 
a new method, based on the Low-Level Virtual Machine (LLVM) compiler framework, that allows the 
automatic transformation of C-code or another LLVM supported front-end into a DEPM. The source 
code is compiled into the LLVM Intermediate Representation and instrumented in order to analyze con- 
trol and data flow structures of LLVM instructions and control flow transition probabilities. The obtained 
information is transformed into the formal DEPM XML for further analysis. The paper describes the 
transformation method and its application to a low-level flight control software of a UAV system. 


1 INTRODUCTION industrial standards, such as Fault Tree Analy- 


sis (FTA) and Failure Mode and Effect Analysis 


1.1 Motivation 


Model-Based System Engineering (MBSE) plays 
an important role in the development of modern 
safetycritical systems. Advanced MBSE toolchains 
support the system development starting from high- 
level design up to deployment and testing. MAT- 
LAB Simulink (MathWorks 2017a) and Stateflow 
(Math-Works 2017b) dominate in the field of con- 
trol software development. Nowadays trends such 
a Cyber-Physical Systems and System of Systems 
lead to the continuously increasing complexity and 
heterogeneity of components and interfaces, as 
well as more and more advanced software parts. 
The increasing complexity brings new challenges 
for system analysis. Classical reliability and safety 
evaluation methods, recommended in nowadays 


(FMEA), fail to describe system behavioral aspects 
in a sufficiently deep manner. Therefore, additional, 
sophisticated and highly specialized methods for 
the analysis of the effects of unavoidable faults are 
required. Recently introduced Dual-graph Error 
Propagation Model (DEPM) (Morozov & Jan- 
schek 2014) is a stochastic framework that captures 
system properties relevant to error propagation 
processes. The DEPM helps to estimate the impact 
of a fault of a particular component on the overall 
system reliability, e.g. to compute the mean number 
of erroneous values in critical system outputs. 

A DEPM can be automatically generated from 
various semi-formal system representations such as 
Simulink/Stateflow (Morozov et al. 2016), UML/ 
SysML (Ding et al. 2016), and AADL (Morozov 
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Baseline system model 
e.g. Simulink/Stateflow 


Code 
generation 


Error propagation analysis with ErroPro 


Figure 1. 
transformation method. 


Three possible use cases of the proposed 


et al. 2018). Despite the common MBSE trend, the 
functional software parts usually incorporate man- 
ually programmed code. The error propagation 
properties of this code should be analyzed with a 
DEPM and considered during the reliability evalu- 
ation of the complete system. This paper presents 
a new method, based on the Low-Level Virtual 
Machine (LLVM) compiler framework, that 
allows the automatic transformation of C-code or 
another LLVM supported front-end into a DEPM. 
Figure | demonstrates three possible use cases: 


e Use case 1: a Simulink model contains manually 
developed S-Function blocks, we generate indi- 
vidual DEPMs separately for these blocks using 
the proposed method and integrate them into a 
top-level DEPM generated from the Simulink 
model using the method presented in (Morozov, 
Janschek, Kriiger, & Schiele 2016). 

e Use case 2: we apply the proposed method to 
the automatically generated C-code of the entire 
Simulink model. 

e Use case 3: we generated DEPMs for manu- 
ally developed code in case of none-MBSE 
approach. An example of this use case is shown 
in Section 3. 


The rest of the paper is organized as follows. The 
two following subsections give the overview of the 
background DEPM and LLVM technologies. Sec- 
tion 2 provides technical details of the proposed 
transformation method. Section 3 demonstrates 
the method with a case study, showing the trans- 
formation of lowlevel flight control software of a 
UAV into a DEPM. 


1.2 Dual-graph error propagation model 


The DEPM is a mathematical model that cap- 
tures control and data flow aspects of a system, 


described using the following set-based mathemat- 
ical notation: 


DEPM :=(E,D,AcpsAppsC) (1) 


E is a non-empty set of executable system 
elements; 

D isaset of data storages; 

Ac, is a set of directed control flow arcs, extended 
with control flow decision probabilities; 

Ap, is a set of directed data flow arcs; 

C isa set of conditions of the elements. 


Fig. 2a shows a simple DEPM example. An 
Element, e.g. A, B, or C, represents an executable 
part of a system. Each element may receive input 
data and provide output data. A Data, e.g. dl, 
d2, or d3, represents a variable that can be read 
or written by an Element. For instance, Element 
C reads d2 and d3, and writes output. A Control- 
Flow arc (black lines in Fig. 2a), weighted with 
an attribute probability, represents a control flow 
transition between Elements. For instance, after 
the execution of A, B will be executed with prob- 
ability 0.7 and C with probability 0.3. A DataFlow 
arc (purple lines in Fig. 2a) describes data transfer 
between the elements. A DataFlow connects an 
Element with a Data or vice versa. The DataFlow 
arcs are considered to be the paths of the data 
error propagation. 

The fault activation and the error propagation 
can be specified using probabilistic Conditions in 
the elements. During the execution of an element, 
faults can be activated and occurred errors propa- 
gate to its outputs. The elements, which can acti- 
vate faults, are highlighted in red. For instance, in 
the element A, highlighted in red in Fig. 2a, faults 
can be activated with probability 0.1, defined in 


v 


Eror Pro 
System: ABC Mode? 
Number of steps: 100 


Conditions of element A 

if (True): 
then (d1=0k)&&(d3=0k) with prob 0.9 
then (d)=error)&4(d3=error) with prob 0,1 


Conditions of element B 
if (d1==0k): 
then (42=ok) with prob 1,0 
if (d1==error): 
then (d2=0k) with prob 0.1 
then (d2=error) with prob 0.9 


Conditions of element C 

if (d2==0k)8&8(d3==0k): 
then (oulput=ok) with prob 1.0 

if (42==error)|\({d3==error): 
then (output=ok) with prob 0.2 
then (output=error) with prob 0.8 


(a) A simple DEPM. (b) Conditions of the elements. 


Figure 2. An example of a DEPM with specified condi- 
tions of the elements. 
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the conditions of A (see conditions A in Fig. 2b), 
and occurred errors propagate to its output data 
dl and d3. 

The error propagation or error correction prob- 
abilities for each element are defined also using 
probabilistic Conditions (see Fig. 2b). The errors 
can propagate from the data inputs to the outputs. 
For instance, the conditions of the element B spec- 
ify that the element B does not activate faults, but 
the errors can propagate from d/ to d2 with the 
probability 0.9. 

ErrorProTM (Morozov et al. 2015) is our soft- 
ware tool for stochastic error propagation analysis 
that allows users to create and compute DEPMs. 
Discretetime Markov chain models that describe 
system execution and error propagation proc- 
esses are automatically generated and computed 
using an interface with the PRISM model checker 
(Kwiatkowska, Norman, & Parker 2011). In the 
DEPM, the mean number of errors of all data, 
marked with yellow color, can be computed. The 
reliability metric, mean number of errors, is the 
average number of cumulative occurred errors in 
the data during the system execution. For instance, 
the mean number of errors in the data storage out- 
put during 100 steps (execution of one element 
is one step) is equal to 3.63, as shown in Fig. 2a. 
A DEPM model can be stored in an XML file. 
A fragment example of an XML file is shown in 
Listing 1. 


1.3. LLVM compiler infrastructure 


The LLVM Project (Lattner & Adve 2004) is a 
collection of modular and reusable compiler and 
toolchain technologies. The LLVM libraries pro- 
vide a source and target-independent optimizer, 
along with code generation support for many 
popular CPUs. These libraries are built around 
an assembly-like low-level code representation 


<?xml version="1.0" encoding="utf—8"?> 
<model n_steps="100" name="ABC Model” version="3.0" > 
<element initial ="true” name="A"/> 


<data name="d1"'/> 
<control flow from="A" prob="0,7" to="B"/> 
<data_flow from="A" to="d1"/> 

<conditions element.name="A"> 

if statement="True”> 


then prob="0.9" update="d1 = ok; d3 = ok;"/> 
then prob="0.1" update="d1 = error ; d3 = error ;"/> 


onditions > 


< /model > 


Listing 1. An example of a DEPM XML file. 


C code: LLVM IR: 
l int max(int a, int b) 1 define 132 @nax(i32 %a, 132 Ww) #0 { 
2 entry: 


7 if (= a) 
4 


retval = alloca 132, align 4 
1 %a.addr = alloca 132, align 4 
‘b.addr = alloca 132, align 4 
store 132 %a, 132% %a.addr, align 4 
store i32 %b, 132* %b.addr, align 4 
štap = load i132, 132" %ò. addr, align 4 
‘Stopl = Load 132, 132* %a.addr, align 4 
‘cmp = icmp slt 132 %tmp, *%tmpl 
br il %cmp, label Sif.then, label tif. else 


return a; 
return b; 


if. then: 
Atmp2 = Load 132, 132% a. addr, align 4 


store 132 %tmp2, 132* &retval 
lö br label “return 


18 if.else: 
‘Stmp3 = load 132, 132* %b.addr, align 4 
store i32 %tmp3, i32* retval 
br label “return 


+ return: 
A %tmp4 = Load 132, 132* retval 
ret i32 \tmp4 


Figure 3. C-code example and its LLVM IR. 
| Module i 
@Global variablea @Global variable b 
Function a Function b 
Basic Block a Basic Block a 


Instruction a Instruction a 


Instruction b Basic Block b 


Instruction c Instruction a 


Figure 4. The hierarchical structure of LLVM IR. 


known as the LLVM Intermediate Representa- 
tion (LLVM IR). The LLVM IR is basically a 
representation in-between a high-level language 
and a low-level machine code. Figure 3 shows a 
small piece of C-code (left) and the correspond- 
ing LLVM IR code (right). The LLVM IR uses 
Static Single Assignment (SSA) format: Each 
defined variable cannot be re-defined. There are 
only explicit strongly typed variables in the IR 
code. 

The LLVM IR code is structured into modules 
that contain functions as it is shown in Figure 4. 
The functions are decomposed into basic blocks. A 
basic block contains a sequence of single instruc- 
tions. The conditional jumps between basic blocks 
form the control flow structure. 

Figure 5 shows a typical LLVM-based workflow. 
An input source code, e.g. a C-code or any other 
of more than 20 LLVM supported languages, can 
be compiled with clang into the LLVM IR. Using 
the LLVM API, an optimizer can be built that pro- 
duces an optimized version of the compiled LLVM 
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Ccode code 


Common LLVM-based workflow. 


Figure 5. 


IR, which, in turn, can be further compiled with 
llc for x86, ARM, PowerPC or other platforms. 
The optimizer can be built with so-called analysis 
or transformation passes. An analysis pass does 
not change the input IR, but provides analytical 
information, for instance, counts the number of 
instructions or builds a call graph. In contrast, a 
transformation pass changes and optimizes the 
original LLVM IR. 


2 TRANSFORMATION METHOD 


Figure 6 shows the architecture of the proposed 
transformation method. Rounded rectangles with 
blue borders represent activities and gray rec- 
tangles represent data. The red frames highlight 
our contribution: We have developed two LLVM 
passes and the python script that processes the 
outputs of the passes and generates DEPM XML 
file for further analysis with ErrorPro™. All the 
steps are automated and can be run with a single 
shell script that calls LLVM tools (compilers, link- 
ers etc), as well as our passes, and the DEPM gen- 
eration python script. 


2.1 Transformation pass for control flow analysis 


The first, transformation, pass helps to identify 
elements of the future DEPM, control flow struc- 
ture, and control flow transition probabilities. 
The overview of the transformation pass is shown 
in Figure 7. The pass takes the LLVM IR of the 
original C-Code generated with clang as input and 
performs two tasks. (1) The first task is to generate 
the list of DEPM elements that represent single 
LLVM IR instructions. Using the LLVM API, the 
pass iterates over all LLVM IR instructions and 
stores the information into the elements. txt file. 
The pass gives a unique name for each instruc- 
tion taking it account its location according to the 
LLVM IR hierarchy. The structure of elements. txt 
is shown in the right-hand side of Figure 7. (2) The 
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Figure 6. Top-level architecture of the C-to-DEPM 
transformation method. 
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second task is the generation of the instructions 
execution sequence in order to define the struc- 
ture of the DEPM control flow as well as control 
flow transition probabilities. In order to achieve 
this, we link an external C-code of a “print” func- 
tion with the original LLVM IR using //vm-link 
tool and append calls to this function after each 
instruction of the original LLVM IR. After this 
instrumentation, we compile and run the program. 
The instructions execution sequence is stored in 
sequence. txt. 


2.2 Analysis pass for data flow analysis 


The second, analysis, pass makes no changes in the 
original LLVM IR but analyzes the code in order 
to identify variables and their relations with the 
LLVM instructions. The overview of the analysis 
pass is given in Figure 8. This pass iterates over the 
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Figure 8. Structure of the analysis pass. 


instructions and identifies related input and output 
variables (operands) and generates unique names 
for the future DEPM data storages. The pass also 
classifies the variables as local or global, and stores 
this information (data records) into the datas. txt 
file. An example of the data records is shown on 
the right-hand side of Figure 8. The general chal- 
lenge here is to define rules for different types of 
LLVM instructions. 


2.3 DEPM generation 


The last step is the generation of a DEPM model 
in XML format. The python script parses the ele- 
ments. txt, sequence.txt, and data.txt files. The 
information from elements. txt is transformed into 
a set of <element> tags of the DEPM XML file. 
The sequnce.txt file is processed in order to count 
the number of control flow transitions between 
the elements and transform this information into 
control flow arcs and transition probabilities for 
<cf_arc> tags. The data.txt file helps to create a 
set of data storages using the <data> tags and con- 
nect them with the elements using <data_flow> 
tags. The generation of DEPM conditions is left 
out of the scope of this method and will be con- 
sidered later. 


2.4 Hierarchy 


The DEPM supports hierarchical models. 
A DEPM element can be compound and contain 
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Figure 9. Composition of DEPM elements and data 
storages in different hierarchical levels. 
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another DEPM inside. This helps to map the 
LLVM hierarchy to the DEPM like it is shown 
in Figure 9. The main function of an LLVM 
module is modeled with the toplevel DEPM, 
see Level 1 in Figure 9. The elements of this 
DEPM represent basic blocks of the main func- 
tion. Each of these blocks contains a sequence 
of instructions that is modeled with the elements 
of Level 2 DEPMs. Some of these instructions 
might call other functions. Basic blocks of these 
functions and their instructions are modeled 
using the Level 3 and Level 4 DEPMs respec- 
tively. The local variables are modeled with the 
corresponding DEPM data storages and the glo- 
bal variables with the data storages of the top- 
level DEPM. 


2.5 Limitations 


The current software implementation of the trans- 
formation method is a proof-of-concept and has a 
number of technical restrictions: 


A function can be called only in one place. 

A function can not call itself—no recursion. 
Pointers and arrays are not supported. 

All the functions have to be defined in a single 
module. 


The elimination of these restrictions requires 
some effort, but even now the method can be 
applied for the safety-critical control software 
analysis that is usually developed or auto-coded 
according to standards like MISRA-C (MISRA 
Ltd 2004) that considerably limits the capabilities 
of C/CH. 


3 CASE STUDY 


A part of embedded flight control software of an 
octocopter UAV has been selected as a case study 
for the introduced transformation method, see 
Figure 10. This case study was also discussed in 
(Morozov & Janschek 2016). The flight vehicle 
contains a number of onboard computers with 
guidance, navigation, and control software. A 
part of the control software, responsible for atti- 
tude and rate control, was selected as one of the 
most critical part of the entire UAV. The main 
function of the selected flight control software 
is shown in Figure 10. The software is written in 
C and contains approximately 800 lines of code. 
It is decomposed into six functions. The func- 
tions “read_input”, “rate_control”, and “ecg” are 
invoked in each iteration of the main loop, the 
functions “err_quat” and “attd_ctrl” are executed 
in each second iteration, and the function “eul_to_ 


int main(int argc, char *argv[]) 
{ 
input_file = fopen("data/input.txt","r"); 
output_file = fopen("data/output, txt","w"); 
static uint32_t step = 0; 
while(step<100) 
{ 

read_input(); 

if(step % 2 == 0) 

{ 


if(step % 4 = 0) 
eul_to_quat(); 

err_quat(); 

attd_ ctrl(); 


rate_ctrl(); 
ecg(); 
step++; 
for(uint8 t i = 6; i < 8; i++) 
fprintf(output_file, "si;",mtr_cmd[il); 
} 
fprintf(output_file,"\n"); 
$ 
fclose(output_file); 


fclose(input_file); 
return 0; 


Figure 10. An octocopter UAV system. Top: an embed- 
ded computer with the low-level flight control software 
highlighted with the red frame. Bottom: Ccode of the 
main functions of the low-level flight control software. 


quat” in each forth iteration. The software reads 
sensor data and external inputs (“read_input”), 
processes them (“eul_to_quat” and “err_quat”), 
performs attitude and rate control (“attd_ctrl” 
and “rate_ctrl”), and generates engine commands 
(“ecg”). The output variable “mtr cmd” is a critical 
system output. 

This software was successfully transformed into 
a set of hierarchical DEPM models using the intro- 
duced method. The top-level DEPM is shown in 
Figure 11. The DEPM control flow branching cor- 
rectly represents the if-structure of the discussed 
main function. Table 1 gives the numerical evalua- 
tion of the case study size. 
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Figure 11. 

Table 1. Numerical evaluation of the case study size. 
Number of LLVM instructions 2070 
Number of LLVM basic blocks 906 
Number of entries in control sequence 507136 
Total number of DEPMs 250 
Total number of DEPM elements 2976 
Total number of DEPM data storages 2104 
Total number of DEPM control flow arcs 1882 
Total number of DEPM data flow arcs 5099 


4 CONCLUSION 


A new LLVM-based method for automatic trans- 
formation of manually developed code to the 
stochastic Dual-graph Error Propagation Model 
(DEPM) has been presented in this paper. The 
DEPM helps to estimate the impact of a fault of 
a particular system component or input on the 


The top-level DEPM model, automatically generated using the introduced transformation method. 


overall system reliability, e.g. to compute the mean 
number of erroneous values in a critical system 
output during given operation time. 

The DEPM was originally developed in order 
to support MBSE development approaches. 
However, the functional software parts usually 
incorporate manually programmed code and the 
presented method enables the analysis of the error 
propagation properties of the manual code for 
enhanced reliability evaluation of the complete 
system. 

The introduced method maps control and data 
flow structures based on the LLVM IR. The fur- 
ther development requires the mapping of reliabil- 
ity properties for single DEPM elements using the 
discussed DEPM conditions. The current software 
implementation of the transformation method is 
a proof-of-concept with a number of serious tech- 
nical limitations that shall be eliminated. The per- 
formance of the method will be also optimized. 
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ABSTRACT: The analysis and verification of the timing behavior is an important aspect in the model- 
based development of mechatronic systems. Components of a mechatronic system typically communicate 
and share data over a network which originates real-time requirements. The network introduced imperfec- 
tions like package loss and delay have a stochastic nature and thus may lead to the violation of a real-time 
requirement which is considered as a timing error and has an impact on the system performance and reli- 
ability. The model-based timing analysis is a valuable task for the evaluation and prediction of the system 
reliability. Recently we introduced a method for the model-based analysis of timing errors. It comprises 
a mapping of a semi-formal baseline system model into a probabilistic model, namely a discrete-time 
Markov chain model. In this paper, we discuss the analysis of the Markov chain model with respect to 
the timing requirements with probabilistic model checking techniques. We use the probabilistic model 
checking tool PRISM and the probabilistic computation tree logic for the property specification. Model 
checking requires the appropriate transformation of the informal timing requirements to the temporal 
logic expressions. The verification results show whether a specific timing property is satisfied by the model 
or not and reveal design flaws in the early design phase. A model of a mobile medical patient table serves 


as a demonstrative case study. 


1 INTRODUCTION 


Model-based system development allows the 
analysis and verification of the system behavior 
and its conformity to the requirements early in the 
design phase. A challenging task is the verifica- 
tion of reliability properties. Our research group is 
focused on the model-based dependability analysis 
of mechatronic systems and recently introduced 
a method for the analysis of data errors and their 
propagation with the goal to identify weak design 
parts with respect to reliability (Morozov and Jan- 
schek 2014). A significant effort was spent in the 
automatic transformation of baseline system mod- 
els, specified e. g. with UML and the Architecture 
Analysis & Design Language (AADL, see (Feiler 
and Gluch 2012)), into an appropriate formal 
model for the error propagation analysis (Ding et 
al. 2016) and (Morozov et al. 2018). Components 
of a mechatronic system typically communicate 
over a network. In a networked distributed sys- 
tem with concurrent processes that exchange data 
periodically, real-time requirements arise between 
the processes which provide data (sender) and the 
processes which receive the data (receiver). The 


violation of the real-time requirements may lead 
to degradation of the system behavior and has an 
impact on system reliability. Thus, reliability analy- 
sis in early design phases, in particular, the analysis 
of the timing, is a valuable but challenging task. 

Recently we introduced a method for stochastic 
timing analysis (Mutzke et al. 2016) and extended 
it in (Mutzke et al. 2018). A semi-formal baseline 
system behavioral model which is annotated with 
stochastic timing properties is mapped into an 
intermediate representation, a formal timed Petri 
net model. The analysis of the Petri net model con- 
sidering the stochastic timing properties results in 
a probabilistic Markov chain model. 

In this paper, we discuss the analysis of the 
Markov chain model with respect to the timing 
requirements with probabilistic model check- 
ing techniques (the highlighted lower part in 
Figure 1). Verification techniques like simulation 
and test typically do not cover all possible system 
execution scenarios. In comparison to this, model 
checking allows verifying the correctness of the 
model whether a particular property is satisfied or 
not. We use the probabilistic model checking tool 
PRISM (Kwiatkowska et al. 2011) and the proba- 
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Figure 1. The top-level workflow of the methodology. 


bilistic computation tree logic for the property 
specification. This requires the appropriate trans- 
formation of the informal timing requirements to 
the temporal logic expression and its mapping to 
the property language of the PRISM tool. The 
verification results provide valuable information in 
the early design phase whether a property is sat- 
isfied or not and support design decisions which 
help to achieve an acceptable level of reliability. A 
model of a mobile medical patient table is used as 
a demonstrative case study. 

This article is organized as follows: Sec- 
tion 2 gives an overview of the related work and 
empathizes the contribution of this paper. In 
Section 3, a representative case study model of a 
mobile medical patient table is introduced. The 
analysis of the Markov chain model with probabil- 
istic model checking techniques is discussed in Sec- 
tion 4. Finally, Section 5 presents the results and 
gives an outlook on future research topics. 


2 RELATED WORK AND CONTRIBUTION 


In (Mutzke et al. 2016) we introduced a method 
for the model-based, stochastic timing analysis 
based on a Discrete-Time Markov Chain (DTMC) 
model. The DTMC model is generated from an 
annotated baseline system model. In (Mutzke et al. 
2016) the method is extended so that the assump- 
tions regarding the distribution of the execution 
times can be relaxed. However, the scope was on 
proposing the methodology whereas in this paper 
the analysis of the DTMC model is discussed in 
detail. 

Several authors proposed valuable methods tar- 
geting the mapping of semi-formal UML/SysML 
diagrams into formal models for the analysis and 
verification of timing properties. In Ali et al. 
(2015) a formal verification of SysML internal 


block diagrams with discrete time constraints is 
proposed. It is based on the mapping of the model 
and the requirements into a probabilistic timed 
automaton and probabilistic computational tree 
logic expression, respectively. The verification is 
done with the PRISM model checking tool. How- 
ever, the annotated discrete time constraints are 
non-probabilistic. The specified properties derived 
by the user requirements thus are also non-prob- 
abilistic. In Baouya et al. (2015) a probabilistic 
and timed verification framework of SysML state 
machine diagrams, annotated with time and prob- 
ability properties, is introduced. However, the tim- 
ing properties, which are assigned to the states, are 
specified as an interval with minimum and maxi- 
mum values and the probabilistic properties are 
attached to the decision nodes and represent the 
control flow probabilities. In Jarraya et al. (2007) 
an approach for the automatic verification and 
performance analysis of time-constrained SysML 
Activity Diagrams is presented. The SysML Activ- 
ity Diagram is mapped directly to the correspond- 
ing DTMC models and verified with the PRISM 
model checking tool. 


3 CONTRIBUTION 


In this paper, we discuss the analysis of the Markov 
chain model. Probabilistic model checking is used 
to verify the correctness of the model with respect 
to the timing requirements. A model of a mobile 
medical patient table serves as a representative case 
study. We present a set of properties derived by 
the timing requirements, which are verified using 
probabilistic model checking. The properties are 
formalized using the probabilistic computation 
tree logic. We use the probabilistic model checking 
tool PRISM to verify whether the Markov chain 
model, expressed with the PRISM modeling lan- 
guage, satisfies the defined properties, expressed 
with the PRISM property language. The verifica- 
tion results help revealing design flaws regarding 
the timing behavior and estimating the system reli- 
ability to achieve an acceptable level of reliability. 


4 CASE STUDY 


A concept study of a mobile medical patient 
table (MPT, Figure 2), introduced in Mutzke et 
al. (2018), is used as a demonstrative example. A 
typical use case scenario for a mobile MPT is the 
patient transport between the preparation room 
and the examination room. In the examination 
room, the MPT needs to be precisely positioned 
relative to specific modalities, e. g. a magnetic 
resonance imaging device (MRI). The MPT is 
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Figure 2. A design concept of a mobile medical patient 
table (Siemens Healthcare GmbH). 


Figure 3. 


The UML components diagram of the MPT. 


equipped with four electrical driven omnidirec- 
tional wheels. This allows autonomous move- 
ments in arbitrary directions, including lateral 
movements and rotations around an arbitrary 
vertical axis, without a steering mechanism. The 
movement may be commanded by an operator or 
a modality, e. g. position correction based on the 
evaluation of MRI images. 

The mechatronic system MPT comprises four 
motion controllers for each individual omnidirec- 
tional wheel and the main controller which pro- 
vides the interfaces and computes the movement 
trajectory based on the target position. The main 
controller communicates via a network with the 
motion controllers. Figure 3 shows the UML com- 
ponent diagram that describes the interfaces of the 
discussed MPT components. The dashed lines rep- 
resent the dependency from the components to the 
power supply on board of the MPT. 

Network induced imperfections may lead to a 
degradation of the system performance and reli- 
ability, e. g. a delayed position update may lead to 
discontinuity of position, movement, velocity or 
acceleration, resulting in leaving the planned move- 
ment path and fail to reach the target position. 


4.1 Baseline system model 


An abstract behavioral model, the UML Activity 
Diagram in Figure 4, serves as a baseline model 
of the MPT for the timing analysis (Mutzke et 
al. 2018). It represents a behavioral model for the 
automated positioning task. It comprises the con- 
trol flow as well as the data flow of the system. 
The atomic executable elements, the actions are 
annotated with stochastic timing properties listed 
in Table 1. This abstract model is subject to refine- 
ment in follow-up design iterations, however, this 
level is sufficient for the timing analysis. 


Activityinitial 


merge? 


merges 


Figure 4. The UML activity diagram of the MPT. 


Table 1. Annotated timing properties of the activity 

diagram in Figure 4. 

Action Time Probability 

compute (6, 7, 8) (0.05, 0.9, 0.05) 

send (3, 4, 5) (0.9, 0.09, 0.01) 

wait (39) (1.0) 

sync (1, 2, 3) (0.99, 0.009, 0.001) 

receive (10, 20, 30, 40) (0.9, 0.05, 0.03, 0.02) 

positioning (5, 10, 15) (0.99, 0.009, 0.001) 

(compute, (9,10, 11, 12,13) (0.045, 0.8145, 0.1265, 
send) 0.0135, 0.0005) 


(wait, sync) (40, 41, 42) (0.99, 0.009, 0.001) 
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The actions on the left hand side of the activ- 
ity diagram, namely compute, send, wait and sync 
represent the main controller process P,, (red). 
The actions receive and positioning represent one 
of the wheel controller processes P, (yellow). For 
the sake of simplicity, the model contains only one 
of the four wheel controller processes. The Activity 
starts at the ActivityInitial node. The black hori- 
zontal bars represent fork and join nodes indicating 
the begin and the end of parallel execution paths. 
The black edges represent the control flow and the 
blue edges the data flow respectively. The action 
compute plans the movement trajectory, time-equi- 
distant steps of the movement path, for every wheel 
and provides the data ref_data_1| which are used by 
the action send. The data ref_data_network are an 
output of the action send and input of the action 
receive of the wheel controller process. The fork2 
node ensures that the action receive starts after the 
action send is finished and both, wait and receive 
start at the same time. The join] node ensures that 
action receive need to be finished processing the 
data ref_data_network{k] before processing the data 
ref_data_network[k + 1]. The action sync initiates 
the activation of the transmitted data ref_data_2 
from action receive to action positioning. The 
behavioral model implies the timing requirements 
(i) that action receive has to be finished before the 
action sync is finished which can be expressed by 
the statement ti + tyne > Cece and (ii) that the 
action positioning has to be finished before the 
action send is finished which can be expressed by 
the statement tompure + lend > tpositionine: The baseline 
model is annotated with the discrete timing prop- 
erties listed in Table 1. The values and its distri- 
butions are chosen arbitrarily and are subject to 
refine while the design is enhanced. However there 
are three types of distributions: deterministic for 
the action wait, monotonically decreasing for the 
actions send,sync,receive and a variance around 
a characteristic mean for the action compute. 
Figure 5 shows a timing chart for the MPT rep- 
resenting about three cycles of the nominal, error- 
free system run. The horizontal bars visualize the 
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Figure 5. 
MPT. 


The nominal, error-free timing chart of the 


action execution times and their order of execution 
in each process P,, and P, respectively. 


4.2 Stochastic timing analysis 


According to Mutzke et al. (2018), in this section, 
the approach for the stochastic timing analysis, 
applied to the case study baseline system model, is 
briefly demonstrated. 

The analysis comprises the steps: 


e Reducing the baseline system model and com- 
puting probabilistic timing properties; 

e Mapping of the UML AD to a timed Petri net 
model; 

e Generation of the DTMC model. 


4.2.1 Baseline system model reduction 

The baseline system model elements which are 
sequentially executed, namely compute, send and 
wait, sync are subject to model reduction. For the 
reduced single elements (compute, send) and (wait, 
snyc), which substitutes the elements in the baseline 
model, we have to compute the sum of the execu- 
tion times. The execution times are assumed to be 
independent random variables with a discrete dis- 
tribution. The sum is also a random variable and 
its distribution can be determined by the convolu- 
tion of the distributions of each random variable. 
For two discrete distributions m}, mp it follows 


(m,*m,)(j) = ndk) mal J k), kjeZ 


where j,k iterates over the values of m, and mp 
respectively. The asterisk symbol (*) represents the 
convolution operator (Bronstein et al. 2005). The 
last two rows in Table 1 show the computed results 
for the reduced elements (compute, send) and (wait, 
sync) of the case study baseline system model. 


4.2.2 Computing timing properties 
The timing requirements now can be reformu- 
lated: (1) laite” Creina and (ii) L smsiesona >> "pestering? 
The violation of a timing requirement is consid- 
ered as the occurrence of a timing error. Thus, 
the probability of the occurrence of a timing 
error with respect to the timing requirements can 
be expressed as: (i) Prob(t eene >t ) and (ii) 
Problt n = “computesend }* 

The probability of the occurrence of the tim- 
ing error is obtained by computing the joint prob- 
abilities which satisfy the condition t >t 


receive 


waitsyne 


waitsync” 


For (i) With taisne =L4 and fecere = tp this can be 
computed with 
"Bmax tAmax 
Prob(t, St) = phun) 
U =t Bmin u =U 
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Where p(u,,1,) = Prob(t ty.) Prob (ty, is the joint 
probability of a single realization of t, and tą. The 
subscripts min and max indicate the border of the 
individual interval respectively. 

Finally, we obtain the following results for the 
probability of the occurrence of a timing error for 
the case study model: 


i. Prob( te 2 tuan) = 0.0198, 
1. Prob(t 2 ) = 0.008735, respectively 


positioning = “computesend 


The results show that (1) approximately two out 
of one hundred data ref_data_network won't be 
received in time by the wheel controller process 
and (ii) approximately in nine out of one thousand 
execution cycles the positionin task exceeds the 
compute and send tasks. 


4.2.3 Mapping into a formal Petri net model 

A semi-formal UML Activity Diagram can be 
mapped into a formal Petri net model by applying 
the rules defined in Stérrle and Hausmann (2004) 
and adapted in Mutzke et al. (2016). The Petri net 
represents the control flow of the baseline system 
model. Figure 6(a) shows the corresponding Petri 
net model of the baseline system model, the UML 
Activity Diagram in Figure 4, where the immedi- 
ate transitions are represented by thin black bars 
while the timed transitions are represented by a 
black rectangle. The ActivityInitial of the Activity 
Diagram is mapped to the place p, in the Petri net 
model and is initially marked with a token. 

Table 2 shows the assignment of the UML 
Activity Diagram nodes to the nodes of the Petri 
net model. 

The timing requirements following the notation 
of the Petri net can be expressed with (1) the tran- 
sition /6 fires before the transition ¢5 and (ii) the 
transition 79 fires before the transition 2. A pre- 
condition is that the pre-places of the transitions 
t6, t5 and #9, t2, namely p7, p8 and p2, pll, are 
marked simultaneously. The Petri net reveals that 
this is ensured by design. 


4.2.4 Generation of a discrete-time 
Markov chain model 

A reachability analysis of the generated Petri net 
model reveals all markings reachable from the ini- 
tial marking m,. The result is a directed graph 
RG with the state space S. Considering the timing 
requirements (i) and (ii), we can identify the states 
Sion CS and Sp C S which represents the nomi- 
nal operational states and the erroneous states, 
respectively. 

Simultaneously activated timed transitions in the 
Petri net model are represented by nondeterministic 
state transitions in the RG characterized by places 
with two or more outgoing edges. To this edges we 
can assign the state transition probabilities, computed 


a OKOZ 


(a) Petri net model (b) DTMC model 


Figure 6. Graphical representations of the Petri net 
model and its corresponding DTMC model. 


Table 2. The assignment of the activity diagram nodes 
in Figure 4 to the Petri net nodes in Figure 6(a). 


PN nodes AD nodes 

pl ActivityInitial 

t(l, 3, 7) fork(1, 2, 3) 

p(2, 3, 4) merge(1, 2, 3) 

p(5, 6, ..., 11, 12) auxiliary 

t(4, 8) join(1, 2) 

t2 action compute, send 
t5 action wait, sync 

t6 action receive 

t9 action positioning 


in Section 3.2.2. The reachability graph attached by 
state transition probabilities represents a discrete time 
Markov chain model (DTMC). Figure 6(b) shows the 
obtained DTMC model with states in nominal oper- 
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ation S and the erroneous states S, highlighted 
in red, generated out of the Petri net model that is 
shown in Figure 6(a). Table 3 contains the assign- 
ment of the DTMC states to the Petri net places. The 
occurrence of a timing error following the notation 
of the DTMC can be identified with (i) the state s, is 
visited and (ii) the state s,, is visited, respectively. The 
DTMC model comprises 65 states and 145 edges, 
however in Figure 6(b) only the states in nominal 
operation path with a cyclic behavior and two path 
fragments, each consisting of three erroneous states, 
are shown, which are highlighted inred. 


5 ANALYSIS OF THE MARKOV 
CHAIN MODEL 


The generated Markov chain model is the founda- 
tion for the analysis of specific properties related 
to the timing behavior. Formally, according to 
(Baier & Katoen 2008), a DTMC is a tuple 


M=(S,P,i, 


init? 


AP,L) 


S= ( SS, o-08,,) is a finite set of states; 


is the transition probability function 
such that for all states s: 
Ès POSS J= 13 

is the initial distribution such 
that vise S tal s) = E 

is a set of atomic propositions 


P:SxS—> [0,1] 


la: S —> [0,1] 


init 


AP =(nom,err) 


L:S—> 2 a labeling function. 


The states of the DTMC model are labeled with 
the atomic propositions nom and err, respectively 
(Table 3). The analysis of the DTMC model is 
done with the probabilistic model checking tool 
PRISM. Listing 1 shows the DTMC specified in 
the PRISM modeling language. 

Lines 31 to 41 in Listing 1 defining rewards, 
namely visiting the states s,, s,, and the disjunction 
Sy V Si» that indicate the occurrence of timing error. 
This allows to compute the expected number of 
errors during specific number of execution steps. 

To specify the properties, we use the probabilis- 
tic computation tree logic (PCTL). This is a tem- 
poral logic based on the computation tree logic 
(CTL). A PCTL formula formulates conditions 
on states of a Markov chain. The notation used 
in this paper is according to (Baier and Katoen 
2008). Compared to CTL which uses universal and 
existential path quantifiers V and 4, the PCTL uses 
the probabilistic operator P, (ø), where gis a path 
formula, and J is an interval of [0, 1]. The syntax 
for a PCTL state formula is given by: 


| | dime 
3 | label “nom” =s=0 | s=] | s=2 | s=3 | s=4 | s=5 | s=6 | s=7) s=8; 
a | label “er” =s=9 | s=10 | ṣs=11 | s=12| s=13 | s=14; 


© | module mpt 
8 |} Sc[0.04) init 0; J states 


wy tl 


14 | [] s54 —> 0,.9802+8"=5) + 0.01981 47°=9); 
is | [] s=5 —> Is'=6); 
16 | [] s=6—> bLfs’=7); 
7 | I] s=7 => las=8); 


18 | [] s=8 —> 0,991265:(5"=1) + 0.00879545"=12): 
19 

20 | [] s=9 —> l4s*=10); 
a |U S=10 => Lifs'=11); 
2 | [] s=ll—> 1(s'=)); 
a | [] s12 —> ii(s'=13); 
as | [] s=13 —> Lys'=14); 
2 | [] s=14 -> Lis =l): 
27 

2a | endmodule 

29 


u | rewards “num emor 9-12" 
2) f}s=9 | 6 =12: 1; 
3a | endrewards 


rewards “num error.” 
% |U s=9: ht 
37 | endrewards 


39 | rewards "num-erroòr-12" 
w | ( s=12: 1; 
d) | endrewards 


Listing 1. The DTMC model expressed in the PRISM 
modeling language. 


Table 3. Assignment of the PN places to the DTMC 
states. 
DTMC states PN places DTMC label 
0 (1) “nom” 
1 (2, 3, 4) “nom” 
2 (3, 4, 5) “nom” 
3 (3, 4, 6, 7) “nom” 
4 (4, 7, 8) “nom” 
5 (3, 4, 7) “nom” 
6 (3, 4, 9) “nom” 
7 (2, 3, 4, 10) “nom” 
8 (2, 3, 11) “nom” 
9 (4, 8, 9) “err” 
10 (2, 4, 8, 10) “err” 
11 (2, 8, 11) “err” 
12 (3, 5, 11) Serr” 
13 (3, 6, 7, 11) “err” 
14 (7, 8, 11) “err” 


® := true|a\®, A®,| —®|P,( p) 


Where ae AP isa atomic proposition. The syntax 
for a PCTL path formula is given by: 
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9 =O ®|®, U®,|®, Us"®, 


where Ois the next operator. Formula O® holds for 
apathif ® holds for thenext statein the path. ®, U®, 
holds for a path if there is a state in the path where 
®, holds and ®, holds in all states visited before. 
P U”, is the step-bounded variant meaning 
that ®, will hold within at most n steps. 

The PCTL is used to specify the following prop- 
erties relevant for the analysis of the timing: 


const uw k; 


H (al) probability of a timing emor (i) in k steps 
P=" | Fe=k (s=9) | 
i (a2) probability of a timing emor (ii) in k steps 


P=) | F<=k (s=12) | 


ò | // (83) probability of a timing emor (i) or (if) in k steps 
j | P=? | Fesk (s=9)s=12)] 


i2 | / tbl) cumulative rewand: the expected number of errors (i) 
3 |4 within k steps 
i | Rò"numeror. 9" }=7] C<=k | 


io | // (b2) cumulative reward: the expected numberof errors (i) 
17 | // within k steps 
is | R{“oum-_error_12"}=? | C<=k | 


W | V (b3) cumulative reward: the expected number of errors 
2} Ci) or Cit) within k sieps 
2 | R{“num-errer_9_12"}=? | C<=k ] 


2 | (ce) infinitely often “nom” 
25. | P>=1(GF "nom" | 


a | # (d) always an error state implies within at most three 
os | // steps a nominal state is reached 
2 | Po=1(G"er" => (X X X "nom") | 


Listing 2. The properties specified in the PRISM prop- 
erty language. 


a. What is the probability of the occurrence of a 
timing error (i), (ii) and the disjunction (1) v (ii) 
within k execution steps? 

b. Cumulative reward: What is the expected 
number of errors within k execution steps? 

c. The system will always recover from a timing 
error and reach a nominal operation state. 

d. Whenever an error state occurs, in at most 
three steps a nominal operational state is 
reached. 


To verify the properties, they have to be formal- 
ized. In Listing 2 the properties are specified in the 
PRISM property language. 


6 RESULTS AND CONCLUSIONS 


Table 4 lists the formal expressions of the defined 
properties in PCTL as well as the PRISM property 
language. 

Figures 7 and 8 show the verification results for 
the properties (a) and (b), respectively. The graphs 
represent the the timing error (i), (ii) and the dis- 
junction (i)v (ii) highlighted green, red and blue, 
respectively. 

For k = 1000 steps almost sure one of the tim- 
ing errors occur, but the (ii) is less probable. Hence 
design enhancements should be focused to decrease 
the probability of (i). The verification result of 
property (c) means that a nominal operational 
state is visited infinitely often. Hence the system 
will not remain in an erroneous state. The verifica- 
tion result of property (d) states that the system 
always recovers from an erroneous state within at 
most three steps. 


Table 4. Formal property specification and results. 


Property PCTL PRISM Result 

(al) P ©, (s=9)) P=2[F<-k(s=9)] see Figure 7 
(a2) P,(0_,(s=12)) P=2[F <= k(s = 12)] see Figure 7 
(a3) P_,0_,(s=9v s=12)) P=2[F<=k(s=9|s=12)] see Figure 7 
(b1) = R{"num_error_9°} = 21C <= k] see Figure 8 
(b2) = R{"num_error_12"} = 2[C <= k] see Figure 8 
(b3) -7 R{"num_error_9_12"} = 2[C<= k] See Figure 8 
it P_( o "nom") P>= 1[ GF "nom" | mene 

(©) P (a"err"—> (OOO "nom")) P >=1[G "err" => (XXX "nom")] true 
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250 500 750 
k 
i [F<=k (s=9]s=12)] -P=? [F<=k (s=9)] 


-P=? [F<=k (s=12)] 
Figure 7. The probability of a timing error within k 
steps. 


250 500 750 
k 
-R{"num_error_9_12"}=? (C<=k]) -R{"num_error_9"}=? [C<=k]) 
-R{"num_error_12"}=? [C<=k]) 


1,000 


Figure 8. The expected number of timing errors within 


k steps. 


In this paper, we discussed the analysis of the 
Markov chain model with respect to the timing 
requirements with probabilistic model checking 
techniques. This requires the appropriate transfor- 
mation of the informal timing requirements to the 
temporal logic expression and its mapping to the 
property language of the PRISM tool. The results 
are valuable for the assessment of the system reli- 
ability and help to achieve an acceptable level of 
reliability. 

A challenging task is to formalize the appropri- 
ate timing requirements from an informal require- 
ment specification or to derive the implicit timing 
requirements from the baseline system model, 
where the latter is done manually in this paper. 
Thus a future research topic is the automatic 
recognition of the timing requirements and their 
mapping to a temporal logic expression. 
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Direct integration of safety analysis in a model based system 
engineering process: Lessons learned from Ariane 6 control bench 
family RAMS studies 
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ABSTRACT: As of today, two of major topics in Reliability, Availability, Maintainability and Safety 
(RAMS) field are how to deal with the growing complexity of new systems and how to reduce the time 
required to perform RAMS analysis. Complexity is perceived as the most challenging factor when devel- 
oping “safe proved” critical systems. During the last decade Model Based System Engineering has been 
broadly deployed in the industry in order to manage complexity. Complexity is also managed by adopting 
incremental and/or iterative development lifecycles. It is therefore imperative to integrate RAMS analyses 
with such means and processes. This implies dealing with a live, changing design baseline. To deal with the 
two major topics detailed above, a new practical RAMS modelling methodology is presented in this paper, 
based on a generic Model Based System Engineering (MBSE) tool with an Engineering Model (EM) 
shared between Design and RAMS teams. Requirements, Functional Trees, System Architecture and 
Detailed Design are included in this model. Furthermore the model includes Functional Failure Mode 
and Effect Analyses (FMEAs), Components FMEAs, Feared Events and related Fault Tree Analysis. In 
this way, safety and engineering models are intrinsically linked and RAMS teams can fully exploit trace- 
ability (established and maintained by the design team) between Functions and Components & Interfaces. 
An automated export of RAMS related data, (Reliability Data & Fault Tree topologies) allows numerical 
calculations in a reliability tool. This methodology can deal with both software and hardware technolo- 
gies. It is valid for highly critical systems (safety related) but also for less critical systems with complex 
availability modelling (production means). A real use case for this methodology is presented; the Ariane 6 
Control Bench Family (A6 CBF). A6-CBF is a family of 9 control benches, covering all the ground con- 
trol needs of future Ariane 6 launcher in terms of production, validation, training and launch operations. 


1 INTRODUCTION economically viable in systems to be produced in 


large series. 


Model Based Safety Analysis (MBSA) has become 
a trending topic in the dependability field since the 
mid-2000s (Lesage & Kruse, 2011; Joshi, Keim- 
dahl, Miller, & Whalen, 2005). Formal methods 
will be the RAMS analyst’s Swiss army tool to 
perform safety case of complex systems (Rauzy 
& Chaire, 2014). They already have been success- 
fully used in highly regulated industries where 
safety certification is mandatory (e.g. certification 
of the flight control system of the aircraft Falcon 
7X, Bieber et al., 2008). On the other hand, the use 
of formal methods or specific languages for dys- 
functional modeling (AADL, Altarica, eventB, 
Safety Architect from ALL4TEC, Safety Designer 
from Dassault) requires a specific knowledge and a 
considerable amount of time which is usually only 


The proposed methodology in this paper is an 
intermediate strategy between traditional and inno- 
vative techniques. It allows the use of well-known 
methods (Fault Trees Analysis (FTA), Failure Mode 
Effects Criticality Analysis (FMECA), etc.) with 
very complex systems. Our approach ensures the 
completeness and consistency of the RAMS studies, 
the most challenging parts within “manual” safety 
analyses. It also allows to trace and link elements 
proper to RAMS analysis with any other element of 
an Engineering Model (EM). That may sound obvi- 
ous but it adds a big value for design and exploitation 
teams. It results in something as simple but powerful 
as identifying which diagnostic cover a particular 
failure mode and in which test case the expected 
behavior has been validated with two mouse clicks. 
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2 PROPOSED METHODOLOGY 


To integrate RAMS in MBSE lifecycle, system and 
dependability engineering should share a common 
EM. RAMS analysis shall be part of the whole sys- 
tem development cycle using information available 
on EM at every moment from Requirements Elici- 
tation to Site Acceptance phases. 


2.1 Functional Analysis (FA) 


Functional Analysis (FA) is achieved by defin- 
ing functions to be performed by the system. 
Main functions are decomposed into several sub- 
functions recursively until a sufficiently detailed 
functional tree is obtained. This decomposition 
allows functional specification of the system in a 
complete way. Every single functional requirement 
shall be linked to a function. Function perform- 
ance requirements shall be identified usually as a 
non- functional requirement. 


2.2 Functional FMEA (F-FMEA) 


For each function established in the functional 
tree, five generic Functional Failure Modes 
(F-FM) based on (Betancourt, Birla, Gassino, & 
Regnier, 2011) are created. These 5 failure modes 
are defined as follows: 


F-FM1: Fail to perform the function at the 
required time 

F-FM2: Fail to perform the function with correct 
value 

F-FM3: Performance of an unwanted action 
F-FM4: Interference or unexpected coupling with 
another function 

F-FM5: Unable to remove the function 


Once these 5 F-FM are associated to each func- 
tion, the effects of the failure modes of a function 
are studied in a systematic and recursive way and 
the relevant relationships are established with the 
failure modes of the higher functions following the 
functional arborescence defined. That builds up 
a Functional Failure modes Fault Tree. For that, 
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Figure 1. Functional analysis tree. 


OR and AND gates can be implemented defining 
an appropriate kind of aggregation/decomposi- 
tion link between the functions. A crucial element 
is provided by F-FM4 which establishes relation- 
ships between F-FM whose functions are not 
directly related in the functional hierarchy tree but 
a possible failure in one of them could impact the 
correct performance of the other. This feature is 
very useful while analyzing transversal Uses Cases 
requiring a sequence of particular functions. 


2.3 Feared Event (FE) identification 


Feared Events (FE) are identified mainly by the 
requirements defined by the client based on pre- 
vious Preliminary Risk Assessment (PRA). Each 
FE is assigned a certain severity level as defined in 
ECSS Q30 based on the impact that it may have 
on the system. 


2.4 Link between Functional Failure Modes 
(F-FM) and Feared Events (FE) 


Every FE shall be related to one or more F-FM, 
which become the cause for the occurrence of this 
FE. The “higher” the Functional Tree is related to 
a FE, the more F-FM and consequently the more 
C-FM are potentially involved. For each FE a Fault 
Tree Analysis (FTA) is generated whose result is 
the assignation of a criticality level to each func- 
tion. This functional criticality level allows RAMS 
requirements allocation usually by using function 
performance requirements. 


2.5 Traceability between Functional Analysis 
(FA) and System Component Hierarchy 
(SCH) 


During the Preliminary Design Phase a system 
Architecture is defined and high level components 
in the Product Breakdown Structure (PBS) are 
identified. As the Detailed Design Phase progresses 


Figure 2. 


HW/SW component domain model. 
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Functional Analysis and component design is 
refined producing sub-functions implemented by 
subcomponents. In this way the function tree is 
related to a hierarchy of components in the PBS. 
The relationships between the components and the 
functions they implement must be established at 
the required level of detail. SCH decomposition 
should arrive until the Line-Replaceable Unit level 
the lowest system element to be exchangeable on 
the field. At the end, each LRU component shall 
implement one or various functions of the lowest 
level of the functional tree. 


2.6 Identification of Component Failure 
Modes (C-FM) 


For every component a set of C-FM is identified. 
For HW type components the use of a standard 
Failure Mode Distribution database (RMQSI 
Knowledge Center, 2016) is recommended. Similar 
HW_Components can be grouped in HW_Fami- 
lies sharing a list of common Failure Modes. 

SW Failure modes are identified in relation to 
the functionalities of the SW component being 
analyzed. For SW components generics SW fail- 
ure modes are available on the literature (Reifer, 
1979). One of more general sets of SW failure 
modes is defined in. (ECSS, 2017) Three SW-FM 
are proposed: 


SW-FM1: functionality not performed; 

SW-FM2: functionality performed wrongly (e.g. 
wrong/null data provided, wrong effect); 

SW-FM3: functionality performed with wrong 
timing (e.g. too late). 


The persistence of the error causing the failure 
mode and the nature of the software application 
to be analyzed has to be taken into account care- 
fully. In contrast to HW C-FM where the random 
nature of physical ageing plays a major role, most 
of the SW-FM can be covered/prevented by some 


venetian bane s 
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Figure 3. RAMS domain model. 


level of testing (unit/integration/validation test- 
ing). EM allows traceability between these failure 
modes and the related diagnostic and tests to be 
implemented in further phases. 


2.7 Link between Component Failure Modes 
(C-FM) and Functional Failure Modes 
(F-FM) 


Each Software/Hardware component (SW/HW) 
has its own C-FM which are identified following the 
procedure established in the previous sections. These 
C-FM can be linked to those F-FM that belong to 
the function or functions that the component imple- 
ments. To achieve accurate results this analysis shall 
be performed by the RAMS team together with the 
team responsible for the component design. 


2.8 Results from RAMS studies implemented on 
the engineering model 


2.8.1. Component criticality analysis 

Following the traces “Feared Event > Functional 
Failure Mode > Functional Failure Mode > ... > 
Component Failure Mode > Component” it is pos- 
sible to identify all components leading to a Feared 
Event. Components are then classified according 
to the criticality of the Feared Event. A set of 
Non-Functional requirements is imposed to each 
component depending of its criticality level. 


2.8.2 Quantitative and qualitative RAMS 
requirements verification 

Exportation from EM to BlockSim of related Fault 
Trees diagrams (via dedicated script generating an 
XML file per diagram) is possible. This exporta- 
tion is limited to simple Fault Trees (managing 
OR & AND gates). However, the script manages 
“repeated” events and events present in different 
points of the same Fault Tree. Probabilities are 
then computed based on reliability parameters of 
component failure modes. This reliability data is 
available on the EM, and exported previously to 
BlockSim. For availability Fault Trees, a single 
fault mode per component is used, aggregating all 
failure mode distributions. Fault Trees in Block- 
Sim can be used also to calculate Minimal Cut Sets 
allowing validation of qualitative requirements like 
the absence of Single Point of Failure, Fail Opera- 
tional, Fail Safe or similar criteria. 


3 USE CASE: ARIANE 6 CONTROL 
BENCH FAMILY 


The Ariane 6 program, approved in 2014, aims to 
reduce the cost of space launch by half compared 
to Ariane 5. Europe will remain number one in the 
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commercial launch services market while respond- 
ing to the needs of European institutional mis- 
sions. In the particular case of Ariane 6, a MBSE 
approach is incorporated in the design of its con- 
trol benches (Cherqui, Comery, & Lesens, 2016). 
Moreover, some MBSA efforts were taken within 
Ariane 5 environment (Ercilbengoa, Schoenig, 
& Hutinet, 2010). A6-Control Bench Family is 
the family of nine Control Benches (seven to be 
deployed in continental Europe, two in European 
Spaceport in Kourou), covering all Ground Con- 
trol needs of Ariane 6 launcher in terms of pro- 
duction, validation, training and launch. A6-CBF 
will be developed by a consortium led by GTD, 
formed by GTD and CLEMESSY. 


3.1 A6-CBF: Core and instances 


A6-CBF its developed in an innovative iterative 
incremental life cycle. It is built around a “cata- 
logue” of components with a reference architecture 
implementing a suite of main functions. This cata- 
logue of components is called the Core. This Core 
has to be instantiated for a particular production 
site bench implementing components that perform 
the functions required in this site. Four versions 
of A6-CBF Core are planned, with the first bench 
(based on CORE 1) to be deployed operationally 
in the ArianeGroup Le Mureaux site in 2018. 


3.2 Selected MBSE tools 


3.2.1 Enterprise Architect (EA) 

The A6-CBF engineering process is carried out 
with a commercial Model Based System Engineer- 
ing tool, Enterprise Architect (EA). An Engineer- 
ing Model is shared between Design and RAMS 
teams. This Engineering Model uses a SysML 
language subset to model system design and stores 
all data and elements needed to perform RAMS 
analyses. 


3.2.2. BlockSim/XFMEA by Reliasoft 

Two tools specialized in reliability engineering 
industry, BlockSim and Xfmea from Reliasoft, are 
selected to perform specific RAMS calculations. 
RAMS related data (FTA and reliability data) is 
exported from the EA model via XML format. 
BlockSim is used in the project to perform numeri- 
cal and minimal cut sets computations. By using 
Fault Trees and Reliability Blocks Diagrams it is 
possible to validate the reliability of the system and 
the operational availability requirements.The size 
of the project implies a large amount of data that 
needs to be processed. Hence, Xfmea is used in the 
project as a tool to facilitate data management and 
reporting during the development of FMECAs. 


3.2.3 Enterprise architect scripting 

Since the design of the system is conceived in 
an iterative process, the design is expected to be 
changing continuously and information and ele- 
ments needed for RAMS analysis will be updated 
continuously. Regarding the fact that RAMS anal- 
ysis are performed outside the engineering model 
from EA and they are performed at same time 
the system is being developed, there is a need to 
constantly update the ReliaSoft database with the 
most recent information. In order to do so, scripts 
are developed to import information like reliability 
data, Fault Tree topologies or system architecture 
to ReliaSoft from EA and keep them up to date 
whenever it is needed. 


3.3. A6-CBF RAMS modelling objects 


An overview of RAMS modelling objects used is 
presented on Figure 4: 


1. Client Requirement [YELLOW]: Requirement 
coming from any applicable documents. 

2. CBF Requirements [GREEN]: Requirements 
declined by project team. 

3. Feared Event [RED]: Technical Risk related to 
one CBF requirement that shall be analyzed by 
Fault Tree 

4. Fatal Alarm [RED]: Alarm generated by Moni- 
tor Control that will put the bench in a prede- 
fined state. A FearedEvent Fatal_Alarm is used 
to show which conditions are required to activate 
this alarm. 

5. Trigger [BLUE]: Condition that activates a 
StateMachine transition. _(BENCH_Opera- 
tional > BENCH_Failed trigger by Fatal Alarm) 

6. Function (CORE or Component) [ROUND]: 
function defined in CBF Functional Analysis 
tree. 

7. Functional Failure Mode [LIGHT ORANGE]: 
describing on the five generics failure mode of 
function 

8. Component Level 1 [PINK]: CBF component to 
be developed & validated autonomously. 

9. Component (HW or SW) [PINK]: decomposi- 
tion of ComponentL1 

10. COTS (HW or SW) [part of an ILS Family] 
[OLIVE]: Commercial off-the-shelf component 

11. Component Failure Mode [DARK ORANGE: 
describing failure mode of component 

12. MCS Diagnostic [LIGHT YELLOW}: permit- 
ting the diagnostic of a Component/Functional 
Failure Mode. 

13. Open Issue [linked to IBTReq]: OpenPoint to 
discuss. 

14. Assumption: [within OpenIssue] linked to open 
issues 
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Figure 4. RAMS modelling objects in EA. 


3.4 A6-CBF functional analysis 


In the A6-CBF functional decomposition has been 
initially performed for Core Functions. Through 
the functional definition it is possible to distin- 
guish two kind of functions: Core Functions 
(Core-F) and Component Functions (Comp-F). 
Additionally, some Instance Functions (Inst-F) 
may have been used to model specific functionali- 
ties not covered by the Core. The criteria for the 
use of Core-F or Comp-F Functions is based on 
the participation of one or more components in 
the function performance. In this way, for those 
functions where it is expected that only a single 
component will participate, Comp-F are used, 
whilst for the rest CORE-F are used. Comp-F will 
be located at the lowest levels of the functional 
tree and CORE-F, resulting from the integration 
of various Comp-F, at higher levels of the tree. A 
total of 21 main functions (PF) have been defined 
(eg: acquisition, commanding, data archiving, ...). 
They have been decomposed following the meth- 
odology exposed in section 2.1. 


This results in more than 250 Core-F, more than 
1200 Comp-F and about 10000 related requirements. 
Complementary, transversals Use Cases has been 
used to model system temporal behavior involving 
functions not directly linked on the functional tree. 


3.5 A6-CBF feared event identification 


Around 50 Feared Events have been identified. 
Feared Events are classified in five severity levels. 


3.6 A6-CBF Functional AMDEC 


Export of System hierarchy, FA, F-FM, C-FM, an 
diagnostics from EM model allows to interact with 
FMECA via a classical template, allowing a navi- 
gation around information in a practical way, easy 
to share with other non specialized stakeholders. 


3.7 Traceability between functional analysis and 
system components 


Around 70 different Components L1 (to be devel- 
oped and validated unitary) are currently defined. 
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Figure 5. 


Example of CBF transversal function. 


Recursive decomposition has been produced hun- 
dreds of component L2 (SW & HW). Each compo- 
nent implement a small subset of 1200 component 
functions, allowing Criticality Analysis. 


3.8 Identification of component failure modes 


As of today around 40 HW families have been iden- 
tified. The Failure Mode/Mechanism Distributions 
database (FMD) by Quanterion is used to select 
appropriate failure modes for each HWFamily. 


3.9 Link between component failure modes and 
functional failure modes 


A stereotype object of type “Diagnostic” has been 
used to identify detection mechanisms associ- 
ated to a C-FM or F-FM. In this way, it is easy to 
identify the Failure Modes that are not detectable. 
The diagnoses are identified with the nomencla- 
ture used by the supervision component (log code 
errors). 

The general behavior of the A6 Control Bench 
defined by general state machine defined by Super- 
vision component. States (the transitions of those 
states are Trigger type stereotypes, associated 
with feared Events. The dynamic effects of sys- 
tem behavior can also be modeled with activity 
diagrams or state machines, allowing triggers. In 
this way it is possible to know in what conditions 
a certain failure mode will produce the transition 
from the state bank Fail to the state (FatalAlarm). 


i 


Figure 6. 


Figure 7. Components Vs Function traces. 
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These operating ways are used for Supervision 
components as well. 


3.10 A6-CBF RAMS results 


3.10.1 Component criticality analysis 

A6-CBF Components L1 has been classified 
regarding criticality levels. Three components 
have been identified as the most critical regarding 
safety Feared Events. For each of this L1 compo- 
nents, a refinement work is perform in order to 
identify and isolate subcomponents with the high 
criticality. 


et 


Figure 8. Feared event modelling diagram in EA. 


Feared event fault tree in BlockSim. 


Figure 9. 


3.10.2 Quantitative and qualitative RAMS 
requirements verification 

In this project phase, export and computation of 
FearedEvents Fault Tree from EA to BlockmSim 
allows to produce preliminary results, taking into 
account the current design definition. As design 
evolves, recursive iterations are performed to assure 
design complaint against RAMS requirements. 


3.10.3. Changes requested safety assessment 
A6-CBF shall operate for at least 30 years, so a spe- 
cific Integrated Logistic Support program has been 
established to ensure operational capabilities during 
this period. Most of the components of the system 
have a useful life of less than 30 years which is why it is 
essential to manage the obsolescence of the material. 

A component taxonomy has been created (HW 
Families) which allows to identify ILS Recom- 
mendations and defining RAMS requirements 
for each product. Each HWFamily can be imple- 
mented by different pre-qualified Part Numbers. 
The required reliability characteristics are defined 
at the HWFamiliy level (MTBFmin, MTTRmax, 
etc.). Failure mode distributions are also defined 
at this level based on the data available in the 
different reliability databases (NTNU, 2017). 
Even in this initial phase, during the three years 
between the first and the last Control Bench deliv- 
ery, it is highly possible that some PartNumbers 
evolve due to obsolescence issues generating spe- 
cific configurations for each bank. 

Having defined the minimum RAMS require- 
ments at the HWFamiliy level ensures that the 
system will continue to comply with requirements 
regardless of the PartNumber used. 


4 OUTLOOK 


This methodology can easily be extended to 
include cyber risk and human factors. It is possible 
to model realistic reliability and operational avail- 
ability taking into account functions required for 
each of the operational phases of the bench (e.g. 
launch campaign, launch chronology, ignition 
sequence, post-launch revalidation phase). 


4.1. Cyber risk modelling 


In order to include cybersecurity analyses, func- 
tional analysis must be carried out until reaching a 
decomposition level with all required information 
transport functions (data flows). Data flows rout- 
ing (components, ports and physical links involved) 
would have to be defined. 

For each of these flows and any component 
where data is treated/storage following failure 
modes would have to be created: 
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— loss of data integrity, 
— loss of data confidentiality, 
— unavailability of data 


Following the method presented in 2.4, these 
dataflow/component failure modes will induce 
functional failure modes which in turn could be 
part of an existing safety feared event Fault Tree 
(FE: To send an unexpected command). If specific 
feared events regarding cyber risks could be defined, 
(e.g. theft of information, deterioration of public 
image) every HW/SW component could be classi- 
fied according to criticality of cyber risks induced. 
Cyber risk requirements could be allocated to each 
component according to this criticality. 


4.2 Human error factors modelling 


Following the same logic used previously for each 
interface between the system and a human-type 
actor Human Error Failure Modes (HE-FM) 
could be created. Different HE-FM taxonomies 
are used in the literature (e.g. THERP, SHERPA, 
HEART, see Pocock, S., Harrison, M. D., Wright, 
P. C., & Johnson, P. 2001). These HE-FM, can 
be linked to the F-FM of functions performed 
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p ! yj Fluid System 
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p! fo External Simulators 
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' g «ExternalActor» Safety Operator 
$ «ExternalActors N3 Operator 
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È <ExternalActor» Simulator Operator 
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Figure 10. External actors, human type. 


by components in interface with humans (usu- 
ally IHMs). Components could be then classified 
according to criticality of human errors. Human 
reliability requirements could be allocated for each 
component according to this criticality. 


4.3 Reliability and operational availability 
by operation phase 


The profile mission of each A6 Control Bench site 
is unique. In the case of Ariane 6 Control Bench 
Launch Pad (CBLP) it is used during 10 days and its 
exploitation is oriented to ensure safety during opera- 
tions and availability at HO (launch time). Production 
control benches are exploited to ensure continuous 
availability during the production checkout phase. 
The development of functional analysis by means of 
use cases enables the addition of “tag values” identi- 
fying which functions are required by phases. 


5 CONCLUSIONS 


A new methodology for direct integration of safety 
analysis in a MBSE process has been presented. 
System and safety engineering activities are carried 
out in a common Engineering Model. In this way 
functional FMECAs tasks can be started at a very 
early stage of the project identifying critical func- 
tions. Particularly for SW components, a detailed 
functional decomposition allows to identify and 
isolate high from low criticality components. Fol- 
lowing design advancement (architecture & detailed 
design), FMECA’s can be refined by performing 
criticality analysis of components/subcomponents. 
This kind of incremental approach reduces the 
required period for the RAMS analysis. 

Coherence between RAMS and design config- 
uration is always ensured since a single and cen- 
tralized engineering model is the source of both 
types of information. Model scripting capabilities 
export the selected information to reliability tools 
generating according Fault Trees and FMECA 
in a more traditional format. This facilitates the 
exchange with other project actors. Likewise, 
quantitative and qualitative requirements can be 
computed numerically. The ability to manage per- 
mits, locks and warnings about design modifica- 
tions on the model allows the RAMS engineers to 
have an accurate view of every design change. This 
guarantees a correct impact analysis. 

Traditional system safety analysis is usually based 
on informal specifications and highly dependent 
on the skills of the RAMS analyst. Lastly, but not 
less important, another advantage provided by this 
methodology is the reduction of manual effort and 
related error-prone tasks. Again, this reduces cost 
and time and it increases the quality of the outcome. 
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ABSTRACT: Storage reliability is of importance for the products that largely stay in storage in their 
total life-cycle such as warning systems for harmful radiation detection, rescue systems, many kinds of 
defense systems, etc. The storage reliability of a product is commonly defined as the probability that 
the product can perform its specific function for a period of specific storage time under specific storage 
environment. Logically, the failures of the product in storage should be identified with the same criteria 
as in its operation process. However, the failure data in storage may be observed indirectly through the 
maintenance or inspection activities. Nevertheless, when the storage reliability is concerned in general, 
the reliability model should take into consideration the possibility that the operational reliability does 
not start at 100%, for example, the one-shot product may have only 96% operational reliability when 
they are newly produced. In this paper, the storage reliability model with possibly initial failures, which 
are usually neglected at the beginning of storage in most of storage models, is studied on the statistical 
analysis method when the masked data are observed. The parametric estimation procedure, based on the 
Least Squares method, is developed generally by applying an EM-like (Expectation and Maximization) 
algorithm for the storage data in which some information about which components have caused the sys- 
tem failures is not known, namely the failure data are masked. The estimates of the model parameters 
including the initial reliability are formalized. In the case of exponentially distributed storage lifetime and 
series system, a numerical example is provided to illustrate the method and procedure though the method 
is not limited to such case. The results should be useful for planning a storage environment, decision- 
making concerning the maximum length of storage, maintenance strategy optimization and identifying 
the production quality. 


1 INTRODUCTION reliability will degrade as the their storage time 


goes [5-6]. However, the operational reliability at 


Storage reliability is generally referred to the abil- 
ity of a product that can still be able to perform its 
required functions after it has been in the storage 
state for a certain period of time. There are many 
examples of such products or systems whose life- 
cycle is mostly dominated by their storage time. 
The failure mechanisms of the product in storage 
and operation are completely different, and this 
implies that the storage reliability model should be 
considered by taking the operation reliability into 
account. There have been some storage models 
proposed under different assumptions [1-6]. 

In the literatures, the storage reliability models 
are commonly to assume that the products are per- 
fect at the beginning of the storage, but as a result 
of the slow deterioration process, the operation 


the beginning of storage is not always 100% for 
many kinds of one-shot products [3]. Specifically, 
some one-shot products cannot be evaluated about 
if they are functioning or not, instead, they are 
only assessed to be good to fulfill the requirements 
on their performance measures under the peri- 
odical maintenance [7-14]. Note that a one-shot 
product is referred to those devices or equipment 
that can be used for only one time. After the use, 
they are destroyed or should be extensively rebuilt. 
Some examples are missiles, fire extinguishers, air- 
bags in cars, etc. 

To study the storage reliability with initial fail- 
ures, a generalized storage reliability model was 
proposed by Zhao & Xie [15-16]. The model is 
expressed as 
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R(t) = RR, (t) (1) 


where R, can be interpreted as the initial reliability 
of the product and may not be completely known 
for the products in storage. R(t) is called the inher- 
ent storage reliability [15] and entirely reflects the 
effect of the storage on the products. Note that the 
storage reliability can also be studied based on the 
knowledge of Physics of Failure (PoF), and vari- 
ous PoF based models can be found in [17-23]. 

Recently, the estimation methods for the model 
expressed by (1) have been considered by Zhang 
et al. [5-6]. In the study presented in [5], an inte- 
grated approach is proposed to estimate the storage 
reliability based on the combinational estimates of 
the failure numbers and current reliabilities at each 
testing times when groups of binomial-type failure 
data are available. The E-Bayesian estimation of 
the failure probabilities is further applied by Zhang 
et al. [6] into the integrated approach proposed in 
[5] for the same types of the testing data. 

In general, the failure data in storage may be 
observed indirectly through the maintenance 
or inspection activities and can also be obtained 
through the measures of the product performance 
particularly based on their components. Further- 
more, the field testing data can be available, but 
the failure causes cannot be always known due 
to various reasons such as high cost, difficulty to 
diagnosis, lack of enough information, etc. This is 
the case where the system lifetime data is masked 
for the causes of some failures are unknown or the 
components resulting in the system failures cannot 
be identified [24-26]. The masked data analysis 
has been considered by several authors. Miyakawa 
[24] proposed both parametric and nonparametric 
estimators in the case of a simple two-component 
series system by assuming that the lifetimes of 
components are exponentially distributed. For a 
general series system, the expression of the likeli- 
hood function was derived by Usher & Hodgson 
[25-26]. The Maximum Likelihood Estimation 
(MLE) was considered in the cases of two and 
three-component series systems and it was shown 
that the closed-form maximum likelihood estima- 
tors are algebraically intractable. The masked data 
analysis was also studied by Hansen & Thyregod 
[27] for a superimposed renewal process. The under- 
lying lifetime distribution of the components are 
assumed to be identically mixed-exponential and 
each failure is masked with the same probability p 
for all components. The parameter p is unknown 
and has to be estimated from the field data. In soft- 
ware reliability, the MLE estimation of software 
reliability were also considered for superimposed 
nonhomogeneous Poisson processes in the case of 
the masked data [27-28]. To the author’s knowl- 


edge, however, there has not been the study on the 
masked data analysis under the storage model with 
possible initial failures described in (1). 

In this paper, the storage reliability model 
with the possible initial failures is studied for the 
masked failure data of systems. The masked data 
are the groups of binomial-type system failure data 
that may not be identified on which components 
caused the system failures. A general method of 
the parametric estimates is developed by applying 
a modified EM (Expectation and Maximization) 
algorithm for parametric ML estimation. In the 
case of exponentially distributed storage lifetime 
and series system, the method and procedure are 
formulized in detail. A numerical example is also 
provided to illustrate the method and procedure 
though the method is not limited to such case. 
The results should be useful for planning a stor- 
age environment, decision-making concerning the 
maximum length of storage, maintenance strat- 
egy optimization and identifying the production 
quality. 


2 MODEL DESCRIPTION 
AND DATA STRUCTURE 


2.1 Model formulation 


Following the common definition, the storage 
reliability of products in this paper is defined as 
the probability that the product can perform its 
required function for a period of specific storage 
time under the defined storage environment [15]. 
Note that the definition given here implies that the 
product should be functioning in operation if it is 
said to be good. Consequently, it may not be pos- 
sible for one-shot products to be identified as good 
or not by taking only the maintenance inspection. 

Let T be the storage lifetime of a product, 
although its value can be hard to observe in reality 
for some types of products, it is a genuine random 
variable and often known to be greater, smaller or 
between some storage time points [5—6]. The stor- 
age reliability function given in (1) can be derived as 


R(t)=P(T>t)= 
P(T > t\product is good before storage) * (2) 
P( product is good before storage) = RR, t) 


Note that “product is good before storage” is 
referred to the random event that the product can 
be functioning when it is new without of storage. 

The meaning of R, can simply be the success- 
ful probability for one-shot products to be applied. 
For electronic components, it will represent the 
proportion of non-defectives in the population, 
but mostly, it will be 100% for common products. 


2566 


For the inherent storage reliability function R(t), 
some common lifetime distributions, such as expo- 
nential, Weibull or lognormal distributions, can be 
applied to model the failure process of the prod- 
ucts due to the material deterioration in storage. 
To simplify the method presented in this paper, the 
exponential distribution is applied for the inher- 
ent storage reliability although the method can be 
valid for a general lifetime distribution. The model 
expressed in (1) is therefore rewritten as 


R(t) = Re” (3) 


For a series system with k components, the stor- 
age reliability of the system, for the sake of sim- 
plicity, is given as 


R(t) = R(t) R,(1)...R,(t) = RyRy Rye” (A) 


where A= 44+ 4 +--+ 4,,4, is the failure rate of 
ith component, 7 =1,2,---,4. Note that the system 
reliability given by (4) assumes that all components 
have the same type of lifetimes that are exponen- 
tially distributed with different failure rates. 


2.2 Masked failure data 


Suppose that the systems in storage have the bino- 
mial-type failure data at sequential observation 
times 4,< f << 14, The masked data can have 
the form as shown in the following table. 

In Table 1, n, and n, are the numbers of the 
system and ith component tested at observa- 
tion time 4, f, and fp are the failure numbers 
of system and ith components, respectively, 
Ga) 2 onl fa Leg. 

Note that if there is not masked data, the stor- 
age reliability of each component at observation 
times can be simply estimated as 


Table 1. Binominal-type masked failure data of a series 
system with k components. 


Observation times 


Causes 

of ; 

failures fi ty : bm 
*System (nf) Onf) (yn fo) 
Component | (1A) Miz fi) (ine Sim) 
Component 2 (111, fa) (1, fry) (o> Jom) 
Component k (Masia) (M2 fio) (Mone Fim) 


R(t,)= oh i. biki 6) 


y 

It is also easy to write the likelihood function 
of the parameters of initial reliability and failure 
rate for each component, see e.g. [5-6] for detail. 
The parametric estimation of the component reli- 
ability can separately be made by ordinary meth- 
ods. However, when the masked data are present, 
the likelihood function of all parameters of the 
components have to be considered and becomes 
a complex multivariable function with a very high 
dimension. It is therefore difficult to find out the 
MLE (Maximum Likelihood Estimate) of these 
parameters. 


2.3 Least squares estimation with 
non-masked data 


When the exponential distributions are applied for 
the inherent storage lifetimes of components, it is 
easier to obtain the initial reliability and failure 
rate of each component if the data are not masked. 
The MLE of these parameters can be obtained by 
solving the ML equations numerically, see [5—6] for 
detail. Nevertheless, the Least Squares (LS) esti- 
mates are more convenient in the case of exponen- 
tial distributions since the analytic formulas can be 
written. 

According to the reliability function of each 
component, the following linear equations hold: 


Y, =a, +bt,, (6) 


For a fix i, by using the estimates given in for- 
mula (5) to replace R(t). j= 1,2,...,m, it can 
easily be seen that the Least Squares estimates of 
a,and b, therefore R, and 4, can be calculated as 


R; = exp(R,,, + At), 


*The system failures are not identified on which 
component’s failures caused its failures and then called 
to be masked. 


R,=—>"n( R(1,)).7=D" 0, (9) 


Note that the LS estimates of the initial reliabil- 
ity and failure rate for each component cannot be 
obtained from formulas (7), (8) and (9) if the masked 
data as displayed in Table 1 are applied. The ML 
estimates are also extremely difficult to obtain since 
the ML equations will contain too many parameters. 


3 PARAMETRIC ESTIMATION USING 
MASKED DATA 


3.1 EM algorithm and its modification 


The EM (Expectation-Maximization) algorithm 
has recently been applied to solve the ML para- 
metric estimation problem in software reliabil- 
ity when the masked data are presented, see e.g. 
[27-28]. The properties of this algorithm can also 
be seen in [30-33]. 

The general idea of the EM algorithm in the 
context of the masked data is to give the initial val- 
ues of the parameters to be estimated, for example, 
6°, and calculate the expectations of the missing 
data. The ML estimates 6' can be obtained as if 
the complete data are available, and then repeat the 
procedure taking 6! as the new initial values until 
the stable values are obtained. The applicability of 
the EM algorithm should be that the ML estimates 
can easily be obtained. Unfortunately, the ML esti- 
mates of the component parameters in the storage 
reliability model considered here do not have ana- 
lytic forms. This implies that the application of the 
EM algorithm is also complicated. 

Note that the LS estimates of the component 
parameters have close analytic form as given in for- 
mulas (7), (8) and (9). A modified EM algorithm, 
called ELS algorithm, is therefore proposed and 
generally presented as follows: 


E-step: Give the initial values of the parameters 0°, 
find out the expectations of the missing data; 
LS-step: Using the estimated missing data and 
unmasked data as if they are complete, find out 
the LS estimates of the parameters 6! that will 
be used as the new initial values of the E-step for 

the iterations until the convergence is reached. 


To the author’s knowledge, there is no similar 
study on the ELS algorithm found in the litera- 
ture. In principle, the ELS algorithm can simply be 
applied for any model if the LS estimates can easily 
be obtained, for examples, in the case of Weibull, 
exponential or log-normal distributions of the stor- 
age lifetimes. In the following, the ELS algorithm 
is presented in detail in the case of exponential dis- 
tributions of component storage lifetimes. 


3.2 ELS algorithm for binominal-type data 


For the masked data type shown in Table 1, the 
masked data are the numbers of system failures 


that are not identified as which components are 
the causes of these failures. To realize the ELS 
algorithm, one needs to find out what are the 
expected number of failures for each component 
at each observation time point when the system 
has f failures. For simplicity, the following dis- 
cussion is given without considering the variable 
time. 

Let n be the number of tested systems and f the 
failure number. The components have their storage 
reliability as R,, i= 1,2,...,k, then the system has 
the reliability R= R,* R, *---* R,. If it is assumed 
that these system failures are not identified due 
to which component, it can simply be proved that 
conditional on (n, f), the expected number of fail- 
ures for ith component, f,”, is given by 


i 


JE = EY F=1 2. ok. (10) 


Note that the sum of all f* is generally larger 
than the failure number f of the system since 
one system failure can be caused by more com- 
ponents. Concerning on the ELS algorithm, the 
formula (10) will be used to convert the system 
binomial data (n, f) into component binomial data 
(wS EVE een 

The ELS algorithm can now be specified for 
the masked data given in Table 1 on the storage 
reliability model (4): 

E-step: Give the initial values of the component 
initial reliabilities (R0, R6... R9.) and failure rates 
(4,4? ...48), calculate the followings: 


e Component and system reliabilities at observa- 
tion times 4<,<-<4,,! 


_go 
R, = Rye“, R, = RR; Ry; 
i=1,2,...,4;7=12,...,m. 


e For the masked system data (n, f), calculate the 
expected number of failures for each component 
using formula (10): 


f= l- R; 
a 1=R, 


J 


ple brke H 1,256.45. 


e Update the component binomial data in Table 1 
as 


n = Nyt Ny Sy = Jyt Si (11) 
i=1,2,...,k, j= 1,2,...,m. 


LS-step: Use the estimated component binomial 
data calculated by formula (11), also apply formu- 
las (5), (7), (8) and (9), the LS estimates of the ini- 
tial reliabilities and failure rates of the components 
are obtained as 
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Ri(t,)= +4 i= 1,2... 7=12,...,m (12) 
Ny 

Ri, = exp(R;, + ÂT), (13) 
” | In( Êt ))- = 

po 2afa-R lo, 


a lyon as -_ yr” 
Ry = 7 pel (s) = 2 ai (5) 
i=1,2,...,k. 


When the LS-step is finished, the E-step will be 
repeated, but the initial values (R?,,.Ro,...R?,) and 
A®,A)...A)) will be replaced by the LS estimates 
Ri, Ri...Ri,) and (41, A)...4,). The iteration of 
the E-step and LS-step can be terminated when the 
stable LS estimates are received. 


4 NUMERICAL EXAMPLE 


To illustrate the ELS algorithm to estimate the 
storage reliability, a series system of 2 components 
is considered. The binominal-type masked data 
created manually for the purpose of the illustra- 
tion, is listed in Table 2: 

Note that there are four parameters to be esti- 
mated in this example: two initial reliability R, 
R», and two failure rates 2,, 2, of component 1 
and component 2, respectively. By applying the 
ELS algorithm, the estimates of these parameters 
can be easily obtained numerically. 

In Figure 1 and Figure 2, the original obser- 
vations of reliability estimated by the successful 
ratio, the estimated reliability by the least squares 


Table 2. Binominal-type masked failure data of a series 
system with 2 components. 


Observation Causes of failures 
times 
1000 h System Component 1 Component 2 
5 (40,3) (55,3) (45,2) 
10 (40,4) (50,4) (45,3) 
15 (20,2) (60,5) (55,6) 
20 (20,3) (70,7) (65,5) 
25 (20,3) (60,7) (55,10) 
30 (15,2) (55,9) (50,11) 
35 (12,2) (55,12) (50,10) 
40 (15,3) (60,14) (60,12) 
45 (20,4) (58,15) (53,14) 
50 (15,3) (50,13) (45,13) 
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Figure 1. The observed and estimated reliability func- 


tions of Component 1. 
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Figure 2. The observed and estimated reliability func- 
tions of Component 2. 


method (without system data) and the ELS algo- 
rithm (with system data), respectively, are dis- 
played for Component | and 2. The differences are 
obviously visible, but reasonable since the system 
failure data should be used when the component 
reliability is analyzed. 

The blue curves in Figures 1 and 2 represent 
the estimated storage reliability functions when 
the model parameters are estimated by the least 
squares method without applying the system data. 
The red curves represents the estimated storage 
reliability functions by the ELS method. 

In Table 3, the compared results are also given to 
see the differences between the estimated parameters 
in the models. The estimates of the failure rates of 
the components with and without using the masked 
system data differ to each other. This indicates that 
it is important and necessary to apply the masked 
system data for component parameters. 
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Table 3. Comparison results of parametric estimates. 


Component 1 Component 2 


Methods Ry, A, Ro A, 

Estimates without 0.9949 0.0062 0.9927 0.0065 
system data 

Estimates using sys- 0.9920 0.0052 0.9861 0.0053 


tem data 


5 CONCLUSION 


In this paper, the masked data analysis is consid- 
ered for the storage reliability model with initial 
failures. To solve the problem that the estimation 
of the component reliability becomes difficult 
when the cause of system failures are hidden, an 
EM-like method (algorithm), the so-called ELS 
method is proposed for the series system with 
binomial-type failure data although the ELS algo- 
rithm can also be easily applied to other system 
configuration. A numerical example is provided to 
illustrate the method. By the illustrated example, 
it can be seen that the parameter estimations are 
quite different by the methods without and with 
the system failure data when the component reli- 
ability is estimated. The proposed ELS method has 
greatly simplified the parametric estimation in the 
case of the masked data. 
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Probability-based reliability and availability assessments for 


a lane at a signalised intersection 


M. Maslak & K. Ostrowski 


Cracow University of Technology, Cracow, Poland 


ABSTRACT: An original computational approach allowing the reliability and the availability assess- 
ments of the lane located at an intersection with traffic signals is presented and discussed in detail. Opera- 
tion of the lane of this type is described using a probabilistic model of an alternating renewal process. 
The occurrence of queues with vehicles waiting for the change of the lights, but only in the cases when 
the number of these vehicles is observed to be at least equal to the arbitrarily accepted number relating to 
the critical congestion level, is treated as the lane’s failure. The proposed formal model is calibrated based 
on the empirical intensity function being a rate of occurrence of failures. The availability of the lane is 
interpreted as the probability that such lane will be serviceable at a predetermined time-point ¢, whereas 
its reliability—as the probability that it will be serviceable in the whole analysed time interval. 


1 INTRODUCTION 


A reliable assessment of lane’s availability at a 
signalised intersection should be unambiguously 
linked to a detailed analysis of the queues which 
form randomly on the lane, whenever arriving 
vehicles do not encounter the green signal (Chodur, 
2011; Tracz 2012). Of course, the formation of this 
type of queue is inevitable in a process of control- 
ling the individual traffic flows, but the point is 
to ensure that the queue does not turn out to be 
unduly long (Ostrowski, 2014; Ostrowski, 2015). 
In the proposed calculation procedure, the authors 
determine a certain, arbitrarily set by the evalu- 
ator, number of queued vehicles on the analysed 
lane, subsequently designated as Q „ and treat it as 
a cut-off between a critical queue associated with 
a formal lack of the lane’s availability for effective 
use, which the authors designate as a failure (“0” 
state), and a queue of an acceptable length, with 
a number of vehicles lower than Q.,, which does 
not result in termination of the lane’s serviceabil- 
ity (state “1”). In this way the lane performance 
is seen as a bi-polar relationship. In a given event, 
being equivalent to a single observation, it can be 
assigned to only one of two mutually exclusive 
assessments: serviceable (“0”) or unserviceable 
(“1”). In such an approach, the process of use of 
the considered lane, viewed over a 24-hour cycle, 
can be modelled as a so-called alternative renewal 
process with alternating time-intervals of its serv- 
iceability and non-serviceability of random dura- 
tions. In this model, each observation corresponds 
to a full cycle of signal changes of the lane’s traffic 


signals. Usually, the cycle features a green-yellow- 
red sequence of a fixed length. Nevertheless, the 
constant length of the entire sequence does not 
mean the constant duration of each colour signal. 
In each sequence, these intervals can have random 
durations, which is typical of so-called accommo- 
dative signalling. Knowing the unified duration 
of the whole single sequence, it is easy to deter- 
mine the number of observations, i.e. of complete 
sequences, specified per hour ¢. This number is 
referred to as n, Of course, in accordance with the 
above assumptions, it does not depend on the time 
at which the observations were carried out, thus 
Vn, =n. The authors subsequently relate with this 
value the numbers nf =n and neal = n, 
where n +n® =n, =n, determined for each hour 
of the observations corresponding to the fixed 
number of signal change sequences. The first of 
these two values is the measure of the number of 
events when the lack of the lane’s serviceability was 
noted in a given hour ¢ while the second one—the 
measure of the number of observations evidenc- 
ing the complete serviceability of such lane. Both 
values mentioned above are used here to create 
the empirical histograms—in the first case the one 
relating to a random number of the lane’s failures 
and, in the second case, the other one associated 
with a random number of the lane’s serviceability 
states. It can be noticed that the model proposed 
by us does not go into detail as for the structure 
of traffic on the lane, neither does it distinguish 
among types of vehicles nor their speed parame- 
ters. The authors are merely interested in the meas- 
ured values of a random failure rate, specified in 
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each hour, and also in its empirical distribution, 
identified taking into account a whole 24-hour 
observation cycle. 


2 EMPIRICAL FUNCTIONAL 
CHARACTERISTICS DETERMINING 
THE RELIABILITY OF THE LANE 
UNDER CONSIDERATION 


2.1 Histograms relating to the random numbers 
of the lane’s failures and to the random 
number of the lane’s serviceability states 


In order to enhance the clarity of the argument, 
the proposed procedure is presented here on a 
computational example containing an interpreta- 
tion of the results of an experiment conducted by 
the authors over 100 observation days on a selected 
lane of one of Cracow’s major signalised intersec- 
tions. The signalling selected for the experiment 
had a sequence of exactly n = 25 full signal change 
cycles during each hour, which translates into N = 
24 n=600 such cycles observed each day. The results 
obtained by us on each observation day were used 
in histograms displaying a random number of n® 
the lane’s failures and, separately, in histograms 
displaying a random number of the lane’s service- 
ability states n{, with a one-hour width of the 
class interval (At =1h) and with a half hour t=0h 
30 m; 1 h30 m; ...; 23 h 30 m being a centre of each 
interval. The most probable daily histograms of a 
random number of the lane’s failures are shown in 
detail in Figure. la while the most probable histo- 
grams of a random number of the lane’s service- 
ability states in Figure. 1b. They were developed on 
the assuming that Q, = 7. It is easy to notice that 
at night, or specifically, between 20 h 30 m and 6 h 
30 m of the next day, the traffic on the analysed 
lane was so small that in principle there were no 
failures. Such failures could happen sporadically, 
on separate days, but this was not reflected in the 
histogram averaging the results compiled through- 
out the duration of the experiment. On the other 
hand, in the morning hours, between 9 4 30 m and 
10 h 30 m, the analysed lane was unavailable on 
each day of the study. 


2.2 Empirical functions of the lane’s reliability 
and of the lane’s unreliability 


The data in the histogram shown in detail in Fig- 
ure. la allow assigning to the analysed lane the 
empirical function of its unreliability S(t) while 
the data given in Figure. 1b form the empirical 
function of the lane’s reliability R( 7). In this 
approach time tTmeans a selected value of variable 
t. The appropriate mathematical formulae look as 
follows (Migdalski, 1982): 
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Figure 1. The most probable empirical histograms dis- 
playing on a day-by-day basis the random number of the 
lane’s availability failures (Figure. la) and the random 
number of its serviceability states (Figure. 1b), obtained 
from the study of the analysed lane. The length of the 
critical queue was assumed at Q, = 7. 
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The graphs of these functions, for the data con- 
sidered in the example (and compiled in Tables 1 
and 2), are presented in Figure. 2a and in Figure. 
2b, respectively. 


2.3 Empirical probability density functions of the 
random occurrence of the lane’s availability 
and failure states 


As the next step of our analysis the appropriate 
empirical probability density functions (pdf-s) are 
developed: first—for a random occurrence of the 
lane’s availability states and then—for a random 
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Table 1. Numbers of cases, obtained experimentally for 
each class interval, with the availability and with the lack 
of the availability of the analysed lane. 


t=T n ne t=T NY a N, 
0h30 m 0 25 600 600 600 
1h30 m 0 25 600 600 600 
2h30 m 0 25 600 600 600 
3h30 m 0 25 600 600 600 
4h30 m 0 25 600 600 600 
5h30 m 0 25 600 597 598.5 
6h30 m 3 22 597 577 587 
7h30 m 20 5 577 558 567.5 
8h30 m 19 6 558 533 545.5 
9h30 m 25 0 533 518 5235.5 

10h30 m 15 10 518 508 513 

11h30 m 10 15 508 503 505.5 

12h30 m 5 20 503 501 502 

13h30 m 2 23 501 498 499.5 

14h30 m 3 22 498 493 495.5 

15h30 m 5 20 493 486 489.5 

16h30 m 7 18 486 478 482 

17h30 m 8 17 478 472 475 

18h30 m 6 19 472 470 471 

19h30 m 2 23 470 470 470 

20h30 m 0 25 470 470 470 

21h30 m 0 25 470 470 470 

22h30 m 0 25 470 470 470 

23h30 m 0 25 470 470 470 
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Figure 2. The unreliability (Figure. 2a) and the reliabil- 
ity (Figure. 2b) functions determined experimentally for 
the analysed lane. 


occurrence of the lane’s lack of availability. These 
functions are marked as f(T) and (T), respec- 
tively. Based on (Migdalski, 1982) we have: 


ay 4 n=n®(t) 

Us ara (3) 
yy _ O(t) 

êle) = N-a 2 


The graphs of these functions obtained for the 
data considered in the example and compiled in 
Table 2 are shown in Figure. 3a and in Figure. 3b, 
respectively. 


2.4 Empirical values relating to the lane’s renewal 
intensity and other ones—determining its 
failure rate 


The next pair of the empirical dependencies deter- 
mined by the authors for the lane considered in the 
example involves respectively the intensity of the 
lane’s renewals d( T), when the lane returns to its 
availability state, and the lane’s failure rate A(t). 
In order to determine these values for the selected 


Table 2. Detailed values determining the functions 
characterising the availability or the lack of the availabil- 
ity of the analysed lane, calculated on the basis of the 
empirical data. 


ft] m ôm Âh 
0.042 
0.042 
0.042 
0.042 
0.042 
0.042 
0.037 
0.008 
0.010 
0.000 
0.017 
0.025 
0.033 
0.038 
0.037 
0.033 
0.030 
0.028 
0.032 
0.038 
0.042 
0.042 
0.042 
0.042 


t=T R F 


0h30 m 
1h30 m 
2h30 m 
3h30 m 
4h30m 1.000 
5h30m 1.000 
6h30m 0.995 
7h30m 0.962 
8h30m 0.930 
9h30m 0.888 
10h30m 0.863 
11h30m 0.847 
12h30m 0.838 
13h30m 0.835 
14h30m 0.830 
15h30m 0.822 
16h30m 0.810 
17h30m 0.797 
18h30m 0.787 
19h30m 0.783 
20h30m 0.783 
21h30m 0.783 
22h30m 0.783 
23h30m 0.783 


1.000 
1.000 
1.000 
1.000 


0.000 
0.000 
0.000 
0.000 
0.000 
0.000 
0.005 
0.038 
0.070 
0.112 
0.137 
0.153 
0.162 
0.165 
0.170 
0.178 
0.190 
0.203 
0.213 
0.217 
0.217 
0.217 
0.217 
0.217 


0.000 
0.000 
0.000 
0.000 
0.000 
0.000 
0.005 
0.033 
0.032 
0.042 
0.025 
0.017 
0.008 
0.003 
0.005 
0.008 
0.012 
0.013 
0.010 
0.003 
0.000 
0.000 
0.000 
0.000 


0.042 
0.042 
0.042 
0.042 
0.042 
0.042 
0.038 
0.009 
0.011 
0.000 
0.020 
0.030 
0.040 
0.046 
0.044 
0.041 
0.037 
0.036 
0.040 
0.049 
0.053 
0.053 
0.053 
0.053 


0.000 
0.000 
0.000 
0.000 
0.000 
0.000 
0.005 
0.035 
0.035 
0.048 
0.029 
0.020 
0.010 
0.004 
0.006 
0.010 
0.015 
0.017 
0.013 
0.004 
0.000 
0.000 
0.000 
0.000 
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Figure 3. Empirical probability density functions of the 
occurrence of the state of the lane’s availability (Figure. 3a) 
and of the state of the lane’s unavailability (Figure. 3b). 


argument ¢ = T, the authors first need to prepare 
the following auxiliary parameters: 


f 


ND=N- X, nf (5) 
t=0h30m 

Nu = NP — (2) © 

for which: 

N,=0.5(N® +N®,,) (7) 


Based on the above, the following is derived, 
respectively: 


A _ n- n®( T) 

0( 7) = EN (8) 
and: 

Hat (9) 


N.At 


The detailed values of both these functions, cal- 
culated for subsequent values of argument t = 1, 
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w) 005 +— 
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Figure 4. Empirical functions of the lane’s renewal 
intensity (Figure. 4a) and of the lane’s failure rate (Fig- 
ure. 4b) relating to the analysed lane. 


are compiled in Tables 1 and 2. These functions are 
also shown in detail in Figure. 4a and in Figure. 4b, 
respectively. 


3 RENEWAL FUNCTION AND RENEWAL 
DENSITY SPECIFIED FOR THE 
WEIBULL-TYPE RENEWAL PROCESS 


3.1 Idea of an alternative renewal process 
modelling the use of the lane 


As it was indicated in Chapter 1, the manner in 
which the lane is used during successive full cycles of 
signal change of a fixed duration of a single cycle can 
be formally described by a scheme of an alternative 
renewal process in which the time-intervals of lane’s 
serviceability (availability) of random length T, alter- 
nate with the time-intervals of lane’s non-serviceabil- 
ity (failure) which are also of random length ©, As 
a result of such assumption, we are dealing with a 
sequence T,,0,,75,0,,...,7;,0,,... (Figure. 5). 


3.2 Modelled probability distributions for the 
random durations of the states with the lane’s 
serviceability and of the states with the lane’s 
non-serviceability. 


Careful analysis of the graph of the empirical 
probability density functions of respectively f(T) 
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time 


Figure 5. Scheme of an alternative renewal process con- 
sidered in the example. 


- related to a random occurrence in time Tof a state 
with the lane’s availability, and of g(t) - related 
to a random occurrence of a state with the lane’s 
failure, as shown in Figure. 3, indicates that it is 
difficult to assign the uniform probability distri- 
bution for the whole day to the random durations 
of these states. This conclusion is further corrobo- 
rated by the non-monotonic graphs of functions 
0( T) and A( T). Consequently, the full daily cycle 
is recommended to be broken down into shorter 
component time-intervals for which the fitted 
probability distribution modelling the use of the 
lane in such a short period becomes sufficiently 
reliable. Under this approach, in the night, when 
during the entire time of the lane use its availabil- 
ity seems to be fully ensured, the random duration 
of the state of the lane’s serviceability should be 
undoubtedly modelled by a uniform probability 
distribution. Much more interesting, however, are 
the lane’s states (“availability” or “unavailability”), 
with random durations, which alternate with each 
other during a day. Considering the fact that the 
values of a lane’s failure rate did not turn out to 
be constant in any time-period being sufficiently 
long, the random durations of the lane’s service- 
ability states T, and also the random durations 
of the lane’s non-serviceability states ©, are rec- 
ommended by the authors to be assigned by the 
two-parameter Weibull probability distributions, 
for which the intensity of a failure occurrence is 
determined from the formula (Kaminskiy, 2013): 


(10) 


whereas the intensity of a successive renewal, lead- 
ing to the lane’s serviceability state, is derived from 
a similar formula: 


Bo-| 
Meana) AE) 


a 


v 


(11) 


In this approach, parameters œ, and œ, are the 
scale parameters while the parameters 6, and p, 
are interpreted as the shape coefficients. In time- 
intervals when function A(T) rises non-linearly it 


is usually assumed that Ø, >3. On the other hand, 
for time-intervals with a non-linearly decreasing 
function A(T), a value £, <1 should be assumed. 


3.3. Determining the renewal function 
relating to the considered process 


The choice of a renewal function that will fit well 
all the parameters of the considered formal model 
is difficult, especially in the case of a non-monot- 
onic empirical failure intensity function A(t) 
relating to this model. In our example this function 
during certain periods of a day turns out to be a 
growing function while during others, a decreas- 
ing function. It is recommended in the professional 
literature to use for such the complex design cases 
a mixture of the probability distributions (Kamin- 
skiy, 2013). In the simplified model proposed by 
the authors in this paper the Weibull probabil- 
ity distribution is chosen in this field to describe 
both the random times of the lane’s serviceability 
T, and the random times of the lane’s non-serv- 
iceability ©, With such assumption, the conven- 
tional approach used to determine the appropriate 
renewal function, based on the use of the Laplace 
transform technique (Bobrowski, 1985), becomes 
rather ineffective. Some proposals of the way in 
which this function could be determined as accu- 
rately as possible in the case when the Weibull-type 
renewal process is considered were given for exam- 
ple in (Yannaros, 1994) and in (Lomnicki, 1966). 
In other papers, e.g. in (Smeitink & Dekker, 1990) 
and in (Jiang, 2008), it was recommended to use in 
this field some more or less simplified estimates. In 
our analysis we have chosen for practical use the 
approximate formula given in (Jiang, 2010). Such 
a formula is appropriate, however, only in time- 
intervals for which the failure rate can be estimated 
as growing. This formula is constructed as follows: 


H(7) = ¢F(z)+(l-¢) A(z) 


As one can see, the renewal function H(7) is cal- 
culated here by an appropriate combination of the 
cumulative distribution function (cdf) F(a), being 
specific for the Weibull probability distribution, 
and of the other function being the measure of the 
cumulative frequency of the lane’s failures A(7). 
The combination coefficient é, estimated using 
the least-squares method, is derived in such an 
approach from the formula: 


Obviously, its value depends on the value of the 
shape coefficient, i.e. B= 8, or B= B,, respectively. 


(12) 


(13) 
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It should be emphasised that the above mentioned 
formula was calibrated on the assumption that 
the scale coefficient was set at the level œ = 1. The 
cumulative distribution function for the Weibull 
probability distribution, described both according 
to (10) and according to (11), is expressed as: 


u 


The cumulative failure intensity function for the 
same distribution can be calculated directly based 
on the definition, which gives: 


(14) 


A(e)=-Inft-F(0)]=(Z) (15) 


a 


Detailed values of this function can also be 
derived directly from the empirical data, on the 
basis of the histograms presented in Chapter 2 
of this paper. It is essential that the approach 
described above allows an efficient determination 
of the renewal function H(7) only for those time- 
intervals of a day when the empirical function of 
failure intensity A( T) is observed to be monotoni- 
cally increasing. In future research the authors will 
try to determine a similar-type renewal function 
for time-intervals of decreasing failure intensity. In 
order to attain this objective, we want to apply a 
procedure proposed in (Smith & Leadbetter, 1963), 
according to which: 


MASLU a (16) 

where 

z= (17) 
h=N (18) 
m=n- LIM, (19) 
= ava (20) 

i! 
T(k) = feed (21) 


3.4 Determining the renewal density 


Setting out from equation (12) defined for time- 
intervals with increasing values of functions A T) ; 


it is easy to determine the renewal density accom- 
panying this phase of the lane’s use. It can be 
expressed as: 


_ 4H 


h(t) = dt A) 


=A- R (22) 


In this case, the lane’s reliability function R(1) 
approximates the empirical function described in 
formula (2), whereas the probability density func- 
tion of the lane’s failure f(T) approximates the cor- 
responding empirical function defined by equation 
(3). Application of formula (16) for time-intervals 
with decreasing values of function A( 7) leads to 
the dependence given in (Smith & Leadbetter, 1963): 


(23) 


4 AVAILABILITY AND RELIABILITY 
OF THE ANALYSED LANE 


4.1 Lane’s availability in the model of the 
alternative renewal process 


So far, the random durations of the time of the lane’s 
serviceability state 7, were described by a prob- 
ability distribution characterised by the probability 
density function f(T) (approximating the empirical 
function f(7)) and by the cumulative distribution 
function F(2). Similarly, the lane’s random non- 
serviceability time-intervals ©, were described by a 
probability distribution with the probability density 
function g(t) and with the cumulative distribution 
function G(z). With such assumptions, according to 
the conventional renewal theory, the distribution of 
the probability of random time-intervals between 
consecutive renewals (interpreted as the restoration 
of the lane to the serviceability state), i.e. of the 
time-intervals arranged in sequence 
T FOD + O55 0, FO pes (24) 
can be described by the cumulative distribution 
function: 


È 


®(z) =| F(r-1)dG(t) (25) 
0 

for which the Laplace transform is as follows: 

®(s) = sF(s)G(s) (26) 


A similar formula is true for the Laplace trans- 
form of the probability density functions, giving: 
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As) = f(s) &(s) (27) 
Based on the above, the authors compute the 
Laplace transform of the renewal function of an 
alternative renewal process analysed in the experi- 
ment described in the example, which produces: 


ayy = F020 


s[i — KOROJ 


Function H(t) computed by the appropri- 
ate re-transform in this case is interpreted as the 
anticipated number of the lane’s renewals within 
the (0, 7) time-interval. Likewise, for the same 
lane, the expected number of the lane’s failures 
can be expressed as a renewal function H(7), for 
which the Laplace transform is expressed by the 
formula: 


(28) 


se O 
H, (s) s[i- (s)a(s) | 


The lane’s availability K(Tņ) in this model 
expresses the probability that at a certain time T 
such lane will be observed in a serviceability state. 
Therefore, this availability can be calculated as the 
sum of the probabilities of two fully separable ran- 
dom events formulated as follows: 


(29) 


Sey: 


— the first event—that the serviceability time- 
interval until the first lane’s failure is longer than 
T, which means that T, > 7, 

— the second event—that the n-th lane’s renewal 
occurs within the (t,t + Ar) time-interval, while 
within the (t, 7) time-interval the next lane’s fail- 
ure does not occur. 


Hence: 


K(2)=R(2)+| R(r-1)dH(0) 


0 


(30) 


The Laplace transform for such availability is 
expressed by the following formula: 


R()=[1- Foa] G1) 
which, after a substitution of (28), yields: 
aes HU 
K(s)=-— (32) 
s[i — (s)&(s) | 
Ultimately: 


K(s)=H(s)-4, (s)+— (33) 
hence: 
K(7)=H(z)-H,(z)+1 (34) 


This means that the probability that the ana- 
lysed lane will be observed at time Tin a state of its 
non-serviceability, expressed as the difference 1 — 
K(‘), in the model of the alternative renewal proc- 
ess presented by the authors in this paper is equal 
to the difference between the expected number of 
the lane’s failures and the number of the lane’s 
renewals that have occurred until that time. 


4.2 Reliability of the analysed lane 


The basic difference between the availability K(7) 
and the reliability Q(z,=) determined for the lane 
considered in the example is such that the first of 
these values is determined for a selected time-point 
t = T, while the second one—for the entire time of 
the observation, for example for ==24h. In this 
approach, the reliability of the lane is interpreted 
as the probability that such lane will be observed 
in a serviceability state throughout the whole day. 
Therefore the following formula can be applied for 
practical use: 


(35) 


5 CONCLUDING REMARKS 


The main aim of the research undertaken by the 
authors was to verify the possibility of describing 
the process of the use of the lane at a signalised 
intersection, having unambiguously random char- 
acteristics, using the classical formal model of an 
alternative renewal process. In this analysis, failures 
of such lane corresponded to the time-intervals 
when it was non-serviceable because the queue of 
vehicles awaiting before the red traffic signal turned 
out to be too long. However, in subsequent signal 
cycles with a fixed duration of a single cycle, the 
lane was renewed after each occurrence of a non- 
serviceability state, so that it became serviceable 
until the next failure. According to the authors, 
the choice of the formal model itself seems to be 
quite straightforward, although its authoritative 
and reliable mathematical description is difficult. 
The experiment considered in the example revealed 
that the use of the lane analysed in a 24-hour cycle 
is associated with complex, multimodal probabil- 


2579 


ity distributions of subsequent failures and sub- 
sequent renewals of random duration. Therefore, 
a precise description of the process of this type 
seems to require an appropriate combination of 
the probability distributions, which significantly 
complicates both the mathematical formulae them- 
selves and the resulting inferences. In the presented 
analysis the authors attempted to apply in practice 
certain, seemingly acceptable, simplifications. Due 
to the fact that the function defining the lane’s fail- 
ure intensity did not turn out to be a uniform with 
respect to the observation time at any stage of the 
cycle analysed daily the authors were unable to use 
in this field the simple formal model developed for 
the renewal processes of the exponential-type, well 
described theoretically. Therefore it has been pro- 
posed a different, more complex, calculation pro- 
cedure, constituting a set of the analyses conducted 
separately at certain time-intervals, for which one 
could assume an unambiguously monotonically 
increasing or monotonically decreasing graph 
of the failure intensity function. This approach 
allowed us to describe the considered renewal 
process by using a two-parameter Weibull prob- 
ability distribution, reduced further to the only 
one-parameter process due to the assumption that 
the scale parameter was set at a level œ = 1. The 
composition of the appropriate failure rate func- 
tions, where each of which will be specified in a 
given time-interval, should allow for the calculation 
of the reliability being accurate for the analysed 
lane. The basic problem involved in describing the 
actual process of the lane’s use by means of a for- 
mal model of an alternative renewal process with 
random serviceability and non-serviceability time- 
intervals characterised by the Weibull probability 
distributions is that in this case it is impossible to 
effectively use the Laplace transform to specify the 
renewal function. In fact, the precise determination 
of this function in a strict manner becomes possible 
only on the basis of complex and time-consuming 
numerical calculations. For this reason, the authors 
recommend to use in this field a simplified esti- 
mate, presented in a mathematically-closed form, 
which seems to be acceptable, even if prone to quite 
a significant error of approximation for high values 
of the shape coefficient £. This estimation, how- 
ever, can only be used during time-intervals with 
increased failure intensity function. Wherever the 
intensity of the failure occurrence is diminishing, 
the authors suggest using a more complex recur- 
sive procedure, which is well described in profes- 
sional literature. The authors intend to verify the 
accuracy of this type of approach in the future. 
Undoubtedly, if the well-verified renewal function 
determining the proposed formal model will be 
unambiguously identified, for the available experi- 
mental data, it should enable to definitively calcu- 


late first the lane’s availability, specified at a given 
time-point ¢ = T, and subsequently its reliability, 
specified for the whole observation time-interval 
=. As the functions of this type, proposed by the 
authors so far, do not seem to be determined accu- 
rately enough, in the final part of this study, the 
authors limit themselves to presenting subsequent 
steps of the planned future procedure, without pro- 
viding numerical results of the applied formulae. 
Nevertheless, the authors hope that this description 
of the process of the lane’s use at a signalised inter- 
section will prove useful in the practical design of 
intersections of this type, even if the mathematical 
model itself seems to be relatively complex. 
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ABSTRACT: Current methods for Technology Qualification (TQ) rely to a large degree on traditional 
reliability methods developed for hardware systems. However, as technology becomes more dependent 
upon software, leading to more complex systems, there is a need to better reflect the system perspective. 
One aspect of complexity is that the system constituents is heavily interconnected and dependent upon 
each other, i.e. the system shows emergent behavior. Systemic failures, not caused by component failures, 
but caused by how the constituents interact (i.e. emergence), need to be addressed differently than previ- 
ous practices. This paper presents a consistent way of classifying failures to help discovering the critical 
failures and to guide how to handle the identified failures in a system TQ setting. To ensure that the iden- 
tification process is consistent and as complete as possible, we advocate that failures should be classified 
according to different perspectives, where the classification within each perspective is Mutually Exclusive 
and Collectively Exhaustive (MECE). We show how this can be used to guide how evidence is collected in 


the technology qualification process. 


1 INTRODUCTION 


1.1 Background 


Technology development can either enable a 
project to be realized or it can enhance the value 
of it. Either way the technology developer has to 
build the operator’s confidence in the technology. 
The operator again needs to build the confidence 
of the other stakeholders in the project before a 
decision to implement the technology can be taken. 
To build this confidence, a systematic risk-based 
qualification process must be performed. 

Technology qualification comprises activities 
to assess, improve and safeguard technology. It 
aims at providing evidence that the technology will 
function within specified limits with an acceptable 
level of confidence (DNVGL-RP-A203, 2017). 

To qualify a system, different perspectives are 
needed to identify possible ways in which the sys- 
tem may fail, and to select appropriate ways of col- 
lecting evidence for the qualification process. There 
exist many different definitions of failure and fail- 
ure types and many ways of classifying failures. This 
makes it challenging to get an overview of the com- 
pleteness and consistency with respect to whether 
most failures have been identified and addressed or 
not. In this paper, we suggest a method that pro- 
vides a structured and consistent way of classifying 
and handling failures, and relate it to the TQ proc- 
ess as described in DNVGL-RP-A203. 


1.2 Objective 


The purpose of this work is to develop a system- 
atic method to identify an as-complete-as-possible 
set of relevant failures, that allows for an effective 
failure identification and that increase the confi- 
dence that most failures have been identified and 
addressed. The method is generic and can be 
used in combination with both new and conven- 
tional failure assessment methods in a system TQ 
setting. 


2 DEFINITION OF FAILURE 


The literature is abundant with definitions of fail- 
ure. In this work we use a broad definition: 

Failure is the loss of a function (fully, in part, or 
erroneous delivery). 

We explicitly avoid to mention where or how 
the loss of the function materializes, to highlight 
that the failure is on the abstraction level of the 
function. This makes it easier to consider failures 
without limiting it to actual resources, or combi- 
nations thereof. The term resources is used here 
to refer to an individual part, component, device, 
functional unit, equipment, subsystem, software, 
people, environment, organizational structures, 
etc. or combinations of one or more of the above. 
This is closely related to how an item is defined in 
TEC 60050-192 (2015). 
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Figure 1. Illustration of the relationship between a fail- 
ure event and a fault state (based on Rausand & Øien 
1996). 
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Figure 2. Failure model illustrating the relationship 
between failure at one level in a system hierarchy and 
fault at another level. 


If failure is considered as the event when a 
function was lost, the failed state can then be 
related to the term fault state used by IEC 60050- 
192 (2015). 

One illustration of the relationship between a 
failure event and a fault state is shown in Figure 1. 

Figure | shows the relationship between a fail- 
ure and a fault with respect to a specific function. 
Note that in Figure 1, a failure leads to a fault, and 
not the other way around. Figure 2 illustrates how 
a fault due to a failure at one level in a system hier- 
archy may lead to a failure at a higher level. Note, 
however, that a failed function on one level does 
not necessarily propagate to the next (e.g. if that 
level has redundancy or the failed function was 
non-critical for the next-level function), and that 
a failed function on one level does not necessarily 
imply a failure on a previous level (e.g. emergent 
failures). 


3 EXISTING FAILURE CLASSIFICATION 
SCHEMES 


As can be seen from the definition of failure given 
above, any method for failure identification and 


analysis strongly depends on the ability to identify 
all the required functions of the system subject to 
analysis. In this work it is assumed that these func- 
tions are known and will focus on ways that fail- 
ures can be classified. 

Various standards classify failures in different 
ways. IEC 61508-4 (2010) applies two main classi- 
fication schemes, categorizing failures by its cause 
or by its mode. Failures categorized by its cause 
are either random (in hardware) or systematic (in 
hardware or software), where the definition of sys- 
tematic failure are the same as given by IEC 60050- 
191 (1990). 

IEC 61508-4 classify all (random hardware) fail- 
ure by its mode as: dangerous undetected (DU), 
dangerous detected (DD), safe undetected (SU), 
and safe detected (SD). 

ISO 14224 (2015) classifies failure after its fail- 
ure mechanism, failure cause, and failure mode 
and provides a detailed list of failure modes down 
to component level. 

ISO/TR 12489 (2013) provides three different 
classification schemes for use in reliability model- 
ling and calculation of safety systems reliability: 
According to randomness, according to occur- 
rence (state of the failing item: when running, 
on stand-by, due to demand), and by the way the 
failure is detected (revealed, hidden, or due to 
demand). Classification of failure according to 
randomness is divided into random and non-ran- 
dom (systematic) failure and are further broken 
down into hardware, software and human failure. 

Blache and Shrivastava (1994) introduced a 
generic classification by using modified definitions 
from IEC-60050-191, which is shown in Figure 3. 

Rausand and Øien (1996) stated that failure 
modes may be classified in three main groups 
related to the function of the item: total loss of 
function, partial loss of function, and erroneous 
function. 


Failure 


Extended Failure 


Catastrophic Degradation 
Failure Failure 


Figure 3. Failure classification scheme for failure modes 
by Blache and Shrivastava (1994). 
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Rausand and Øien (1996) also described an 
alternative classification scheme by failure cause 
as an option, where failure cause is “the circum- 
stance during design, manufacture or use that have 
led to failure”. The failure cause is a necessary 
information to avoid failures or re-occurrence of 
failures. Failure causes may be classified in relation 
to the life-cycle of an item or a functional block 
as illustrated in Figure 4. They also described clas- 
sification of failure by its effect and severity (From 
MIL-STD 882): catastrophic, critical, marginal, 
negligible. 

Habrekke et al. (2013) classified failures into 
two main schemes based on the failure definitions 
given in IEC 61508-4 as mentioned above: either 
by its mode or by its cause. When classifying by 
its cause, Habrekke et al. (2013) use the definitions 
given by the IEC 61508-4 standard (differentiate 
between random hardware failures and systematic 
failures), but give a more detailed breakdown of 
the systematic failures, as shown in Figure 5. 

When classifying by its mode, they categorize 
failure (not limited to random hardware failures 
only) as (Habrekke et al., 2013): DU, DD, SU, SD 
and non-critical (NC). 


Aging failure 
ishandling failure 


Figure4. Failure classification after its cause (Rausand & 
Øien, 1996). 
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Figure 5. Failure classification after its cause by the 
PDS forum (Habrekke et al., 2013). 


The above show a wide variety of failure classi- 
fication schemes, but no method to structure them 
or give guidance on use seems to exist. 

In the current work we propose a simple and 
structured method for classification of failures and 
exemplify how it can be used for more consistent 
failure identification in a technology qualification 
context. 


4 ASTRUCTURED METHOD FOR 
IDENTIFYING FAILURES THROUGH 
PERSPECTIVES 


A challenge with the classification structures 
described in the previous section is that it is difficult 
to know whether the schemes identify and address 
all possible and relevant failures and if some fail- 
ures have been overlooked. Many of the classifi- 
cation schemes build elaborate hierarchies where 
failure classes are broken down into sub classes. A 
danger of following such a hierarchical approach is 
that the branches of the classification tree may dif- 
fer with respect to how detailed they are and which 
criteria are used for the subdivision. 

Rather than following a hierarchical approach, 
which may tend to narrow the perspective of 
the analyst, we propose a method where failures 
are identified by viewing the system from differ- 
ent perspectives, and alternating between these 
views, rather than creating one unified failure class 
hierarchy. 

Within each perspective, a Mutually Exclusive 
and Collectively Exhaustive (MECE) set of cat- 
egories are established. The MECE categories are 
intended to help the assessor structure the failure 
identification process by introducing complete 
sets of failure categories so that all aspects of a 
perspective is covered. This will increase the con- 
fidence that the failure identification method cap- 
tures the relevant failures. 


4.1 Commingled structures 


Each of the perspectives and its MECE categories 
should trigger identification of different failures. 
That is, one should be able to identify and place 
failures in the MECE categories of each and every 
perspective. 

This is fundamentally different than the hierar- 
chical approaches described in Section 3. However, 
with the suggested approach, a variety of hierar- 
chical classification schemes can be reproduced by 
combining the MECE categories of the different 
perspectives, and are thus fully compatible with 
previous methods. 

This approach enables us to concentrate on N 
perspectives with M, individually relevant MECE 
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Figure 6. 
panel) used to build a full classification structure (lower 
panel). 


Independent classification schemes (upper 


categories rather than complex structures of cate- 
gories that may not be MECE. Figure 6 shows how 
two perspectives (here geometric form and pattern) 
can be assessed individually (upper panel), and if 
relevant can be combined to a commingled struc- 
ture (lower panel). 

The commingled classification structure will 
remain MECE, and thus complete. Note that all 
combinations may not be relevant, however this 
should not be regarded as a problem. Rather than 
assuming that some combination is impossible, the 
possibility of every combination is kept open. By 
considering the views independently, one avoids 
debates such as whether software failures can be 
random or not. This may help the analyst avoid 
limitations given by preconceived ideas. 


5 FAILURE CLASSIFICATION IN 
ONE-DIMENSIONAL VIEWS 


Technology qualification involves executing a work 
process comprising of a set of steps. In brief, the 
first four steps are: 


e Establish a qualification basis identifying the 
technology, its functions, its intended use, as 
well as the expectations to the technology and 
the qualification targets. 

e Assess the technology by categorizing the degree 
of novelty to focus the effort where the related 
uncertainty is most significant and identify the 
key challenges and uncertainties. 

e Assess threats and identify failures and their 
risks. 

e Develop a plan containing the qualification 
activities necessary to address the identified 
risks. 


Part of the technology assessment is to break 
the top level system functions into sub-functions. 
The loss (partial or complete) of a function on 
any level in this functional hierarchy constitutes a 
failure of that function. Hence, what constitutes a 


failure, or which functions that fail is not the focus 
of this chapter. Rather, the focus here is to identify 
how functions may fail, by relating the functions 
to the resources and processes used to deliver the 
specific functions. 

TQ is a process of assessing whether a par- 
ticular solution, composed from a specific set of 
resources and involving a set of processes, is able 
to deliver the desired functions with sufficient con- 
fidence. This involves to identify how a function 
may fail and the associated risks. Based on this, the 
next step in the TQ process is to develop a plan 
of qualification activities necessary to address the 
identified risks. To accommodate this, we propose 
an approach where the system is analyzed from a 
set of different perspectives. Some relevant per- 
spectives may be: 


Resource perspective 
— Life-cycle perspective 
— Randomness perspective 
— System level perspective 


In addition, two perspectives which may be used 
to guide the assessor on how to handle the identi- 
fied failures may be: 


— Controllability perspective 
— Risk perspective 


Each of these perspectives are described below, 
and we show how the individual perspectives may 
be used in TQ. For each perspective, we propose 
a list of MECE categories to ensure completeness 
and avoid confusion. Looping through a set of 
perspectives is intended to instruct the assessor to 
consecutively change her point-of-view, and thus 
help her identify a wider range of failures than if 
conventional methods were used. This is an alter- 
native to using more elaborate hierarchical failure 
classification schemes that may be ambiguous 
or contain an unmanageable number of failure 
classes. 

Note that other perspectives can also be relevant 
in TQ. This paper exemplify the use of relevant 
perspectives and categories, focusing on the prin- 
ciple of building unambiguous perspectives that 
can help the assessor in the failure identification 
process, and give guidance on how to handle the 
identified failures. 


5.1 Resource perspective 


The resource perspective provides a classification 
of failure based on its relation to the particular 
hardware, software or human components of a 
system (including their interactions), see Figure 7. 

The core motivation of the resource view is 
to elicit which of the resources of the system are 
involved in the failure. 
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Figure 7. Classification of failure based on its origin. 


Note that Figure 7 uses a coarse classification 
of resources into the groups hardware, software 
and human. This is only one possible set of MECE 
categories used for exemplification of the resource 
perspective. Other categories or granularity may be 
used as long as it is kept MECE. 

Note that although the categories themselves 
are MECE (for consistency), a failure does not 
necessarily belong to only one category. The fail- 
ure of a function may for example involve both 
hardware and software, or the interaction between 
them, even if neither of the individual resources 
have failed. 

The resource perspective can help the assessor 
to identify the resources that need to be qualified 
to ensure that functions are adequately provided. 


5.2 Life-cycle perspective 


Classification of failure may be done in relation to 
time, i.e. when in the life cycle the failure is intro- 
duced s shown in Figure 8. We call this the life- 
cycle perspectives. 

This is essentially similar to Rausand and Øien 
(1996) who described an alternative classification 
scheme by failure cause as an option, see Figure 4. 
In our case, we have also added failure occurring in 
concept, installation, and decommissioning of the 
system. The granularity of the categories is project 
dependent, depending on the nature of the system 
development and associated activities. 

The importance of the life-cycle view in a TQ 
setting is that failures, or precursors to failures, 
may occur in any phase of the project. Some fail- 
ures will materialize immediately while some may 
not materialize until later in the system’s life-cycle. 
The TQ process is by nature iterative and flexible, 
and is used to reduce the possibility of unidentified 
failures that may propagate into later stages of the 
system life-cycle. Thus, taking a life-cycle perspec- 
tive can help the assessor identify failures intro- 
duced at certain stages in the life-cycle, but that 
may not materialize until later stages. By identify- 
ing when a failure might be introduced and when it 
might materialize can help in finding ways of how 
it could be prevented or handled. For example, 
failures in specification, design, or manufacturing 


Failure 
— 
Í | 


| [Manufacturing & | 
Concept & Design | RANG 


Operatian | | Decommissioning 


Installation 


Figure 8. Classification of failure according to when in 
the life cycle the failure is introduced. 
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Figure 9. Classification of failure according to its 
randomness. 


may be detected and handled through quality con- 
trol, third party verification and testing. Other 
failures may develop and occur in operation and 
the assessor can then identify required contingency 
plans, monitoring, inspection, testing or mainte- 
nance plans, etc. 


5.3. Randomness perspective 


A commonly used failure classification scheme 
is to classify into random and systematic failure. 
Here we suggest to use random and non-random 
(see Figure 9) since this clearly states if a failure is 
random or not. 

Note that we deliberately do not use the term 
random hardware failure, since random hardware 
failure is a subset of random failure. Both IEC 
61508-4 (2010) and the PDS forum (Habrekke et 
al., 2013) classify failures by its cause to be either 
random hardware failure or systematic. According 
to our understanding, none of these schemes are 
MECE. 

In a TQ setting, whether a failure is considered 
a random (stochastic) process, or if it is systematic, 
will be very relevant with respect to how the fail- 
ure is handled. If, for example, the assessed system 
function is to withstand a weather-related load, 
e.g. wave or wind loads, the maximum load the sys- 
tem will see throughout its lifetime, is stochastic. 
In this scenario, the usual approach is to estimate 
a maximum load distribution, and to require the 
structure to have the capacity to withstand some 
percentile of this maximum load distribution (e.g. 
a 100 yr load). On the other hand, if the assessed 
system is a small bridge designed for smaller cars, 
the bearing capacity (the system function) will 
always fail if a heavy-duty truck tries to cross. This 
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Figure 10. Classification of failure according to sys- 
temic behavior. 


is a systematic, or non-random failure, and a TQ 
process would identify if the bridge may be sub- 
jected to such loads, and the bridge should either 
be re-designed, or limitations to vehicle weights 
could be imposed. 


5.4 System-level perspective 


Traditional reliability engineering methods derive 
failures on system level from failure on component 
level. However, a system function may fail even if 
none of its sub-functions or resources have failed. 
Such systemic failures are typically related to how 
different resources in a system interact. It may for 
instance occur based on interaction between the 
system and its environment, unintended or unfor- 
tunate interactions, or conflicting actions among 
the system resources. This perspective is shown in 
Figure 10. 

Here we have used the ISO/TR 12489 (2013) 
definition of systemic failure: 

“Systemic failure is failure at system level which 
cannot be simply described from the individual 
component failures of the system.” Asking the 
question: Does the failure depend on failures on a 
lower system level, or is it a feature that only mani- 
fests itself on this level? A good example is the Mer- 
cedes A-class failure of the “Elk-test” (Teknikens 
Varld 2017). Here, the new Mercedes A-class was 
put through an evasive maneuver test, and above a 
certain speed, the car tipped over when perform- 
ing the evasive maneuver. The steering performed 
within its intended range and the speed was within 
the car’s limitations. Thus, no function was oper- 
ated outside its intended range, however, the result 
was a total failure of the system’s function, Le. 
being able to drive and steer. 

Ina TQ setting, the distinction between systemic 
and non-systemic failures becomes important with 
respect to how they are identified and handled: A 
non-systemic failure can be addressed by ensuring 
that no subsystems or components fail. However, 
handling systemic failures requires that the entire 
system is considered as a whole (including archi- 


tecture, life cycle, system control structures, etc.). 
Systemic failures cannot be identified by analyz- 
ing each of the system’s resources separately. 
Some of the systemic failures may be identified 
through existing methods like system FMECA 
(Subburaman, 2010), STAMP/STPA (Leveson, 
2011, 2013), FRAM (Hollnagel, 2004) or through 
system integration testing and system pilot testing. 
However, the assessor should, where relevant, also 
utilize virtual testing using e.g. simulations to dis- 
cover systemic failures that cannot be anticipated 
based on the understanding of individual compo- 
nents, and which would be too expensive or dan- 
gerous to test in real-life. For the Mercedes A-class 
this would e.g. have been to explore the limits of 
the steering, the speed and the road friction. 


5.5 Controllability perspective 


According to control theory, the ability to control 
a process depends on the ability to observe the 
process, the ability to influence the process, hav- 
ing a defined objective, and sufficient understand- 
ing of how to influence the system, based on the 
above. In the controllability perspective we catego- 
rize failures as controllable or not. In reality, there 
might be degrees of controllability ranging from 
full control to no control at all. The granularity of 
the categorization can be modified according to 
project specific needs. 

In a TQ setting, this perspective is important 
to identify which failures that can, or need to be, 
controlled, and which cannot. The latter must then 
either be accepted based on risk evaluations (see 
risk perspective below), or the system cannot be 
qualified. 

For those failures that can be controlled, the 
control can be exercised at different stages in the 
life cycle of a system. Some can be controlled by 
design, e.g. according to standards, best practices, 
quality assurance, etc. Other failures may be con- 
trolled in fabrication through e.g. testing, quality 
assurance, procedural control, etc. Or similarly in 
other phases like installation, operation etc. 

Sometimes it is possible to choose which phase 
control is implemented, e.g. through robust design, 
or alternatively operational monitoring and miti- 
gation strategies. 


5.6 Risk perspective 


It is common to relate failures to risk. By risk we 
mean the consequences of the failure and its associ- 
ated uncertainty (ISO 31000 2010, Society for Risk 
Analysis 2015). Different risk metrics may be used 
to measure the risk (either quantitatively or quali- 
tatively). Here we only introduce the categories 
acceptable, and not acceptable. As with the other 
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perspectives, it is possible to use a more fine-grained 
set of categories if that is expedient for the project. 

To define the border between acceptable and not 
acceptable risk, the ALARP principle (as low as rea- 
sonably practicable), is applied in many industries. 

ALARP implies that risk should be reduced 
until the sacrifices necessary to reduce it further 
become disproportionate to the risk to be avoided 
(Health and Safety Executive UK 2017). Note that 
the risk perspective can be seen in conjunction 
with the controllability perspective as the ability to 
reduce risk is related to controllability. This is in 
essence the aim of the TQ process, i.e. to control, 
eliminate, or reduce the effect of, failures that are 
in the unacceptable risk category. 


6 CONCLUDING REMARKS 


This paper has presented a consistent way of classi- 
fying failures to help discovering the critical failures 
and to ensure that all types of failure are addressed, 
and to guide handling of identified failures in a sys- 
tem TQ setting. For failure identification, assess- 
ment and management, it is useful to explore the 
failures from different perspectives. To ensure that 
the identification process is consistent and as com- 
plete as possible, we suggest to use Mutually Exclu- 
sive and Collectively Exhaustive (MECE) failure 
classification schemes. These perspectives, or fail- 
ure classes, can then be used to guide how evidence 
is collected in the qualification process. 

The benefits of such an approach is that the 
failure identification process will become more 
practical and tractable by employing simple one- 
dimensional failure classification schemes and 
viewing them one at a time, rather than investigat- 
ing hierarchical failure structures. 

In addition, the method guides the assessor 
towards exploring the entire failure space, and do 
not exclude any possibilities from the start. 

It should be noted that the method proposed 
in the current work is fully compatible with previ- 
ous methods. For instance, by combining the one- 
dimensional MECE schemes, a more complicated 
structure is easily created, see Figure 6. 

One of the perspectives suggested here explicitly 
focus on identifying and addressing system-level 
failures (also called systemic failures) that are not 
easily addressed by conventional methods. 

In TQ this is useful since it gives guidance to 
identification of different types of failures and 
how to handle them, i.e. what the best method will 
be to collect the evidence that the critical failures 
will be managed in the best possible way. 
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ABSTRACT: The effect of natural and man made disasters on critical infrastructures are substantial, as 
evident from recent history. Break downs of critical systems such as electrical power grids, water supply net- 
works, communication networks or transportation can have dire consequences on the availability of aid in 
such a crisis. That is why, reliability analyses of these networks are of paramount importance. Two important 
factors must taken into consideration during reliability analysis. First, the networks are subject to complex 
interdependencies and must not be treated as individual units. Second, the reliability analysis is typically 
based on some form of data and or expert knowledge. However, this information is rarely precise or even 
available. Therefore, it is important to account for different kinds of uncertainties, namely aleatory uncer- 
tainty and epistemic uncertainty. Aleatory uncertainty represents the natural randomness in a process, while 
epistemic uncertainty represents vaguness or lack of knowledge in the model. In this work we present an 
approach to the numerical reliability analysis of complex networks and systems extending a previously devel- 
oped method based on Monte Carlo simulation and survival signature. The extended method treats both 
kinds of uncertainties, thus, yielding better results. We show how Monte Carlo simulation controls aleatory 
uncertainty and apply sets of distributions (probability boxes) to treat epistemic uncertainties in component 
failures. In this framework, dependencies are modelled using copulas. Copulas possess the unique property 
of decoupling the odelling of the univariate margins from the modelling of the dependence structure for con- 
tinuous multivariate distributions. Analoguous to the p-boxes we use sets of copulas to include imprecision 


in the dependencies. Finally, the method is applied to an example system of coupled networks. 


1 INTRODUCTION 


Modern infrastructure systems are highly complex 
and subject to a multitude of different depend- 
encies. Disasters in recent years have shown how 
critical the impact of these dependencies can be. 
Failures in one network such as a power outage 
will surely impact other dependent systems. In 
worst case scenarios these dependencies can lead 
to cascading effect ultimately breaking down entire 
networks (Buldyrev et al. 2010). This highlights the 
need for methods of reliability analysis that can 
deal with these complexities. 

Recently, the survival signature (Coolen and 
Coolen-Maturi 2013) has gained in popularity as 
a tool to aid with this task. The survival signature 
allows to decouple the structural evaluation from the 
probabilistic analysis, allowing for highly efficient 
simulation. In Behrensdorf et al. 2017 we introduced 
a method for the reliability analysis of complex inter- 
dependent networks. Other efficient algorithms can 
be found for example in Patelli et al. 2017. 


A secondary task during the reliability analysis is 
the accurate modelling of component failures and 
dependencies. Typically, this is done based on data 
or expert assessments. However, both are subject 
to two kinds of imprecisions, namely, aleatory and 
epistemic uncertainty (Beer et al. 2013). Aleatory 
uncertainty represents the randomness inherent 
in a process, such as component degradation and 
external forces affecting the system (natural haz- 
ards, earthquakes, etc.), while epistemic uncertainty 
describes the uncertainty in the model due to a lack 
of or vagueness of knowledge about the system. 
The latter is usually regarded as reducible through 
acquiring of additional data and information. 

In this work we expand our previously devel- 
oped technique by inclusion of imprecision. The 
method is based on Monte Carlo simulation and 
as such already deals with aleatory uncertainty. In 
this extension the modelling of component failures 
is refined by applying probability-boxes (p-boxes) 
to account for epistemic uncertainty. Feng et al. 
2016 have shown the advantages of using p-boxes 
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in reliability analysis. Additionally, the rather sim- 
ple dependency modelling of the initially developed 
method is replaced by imprecise copulas (Montes 
et al. 2015). Copulas split continuous multivariate 
distributions in a dependence structure and uni- 
variate marginals, which in turn allows for separate 
flexible modelling of the two (Joe 2014). 

This paper is outlined as follows. First we intro- 
duce the previously developed method for the reli- 
ability analysis of networks. Then, after presenting 
basic theory and notation on copulas, we discuss 
how to model dependencies with copulas and how 
to translate these methodologies into an imprecise 
setting. Finally, we apply the developed techniques 
to a simple example. The paper closes with some 
concluding remarks and an insight into future 
work. 


2 RELIABILITY ANALYSIS 


This section presents the survival signature based 
method to calculate the reliability of a system, as 
first presented in Behrensdorf et al. 2017. 

The survival signature was developed as an 
extension to the system signature (Samaniego 
2007), overcoming the limitations that restrict the 
system signature to systems of one single compo- 
nent type (Coolen and Coolen-Maturi 2013). The 
main function of the survival signature is to sepa- 
rate the structural information of a network from 
its probabilistic characteristics. 


2.1 Survival signature 


Considering a system with m components, the sur- 
vival signature for / out of m components working 
is defined as 


o(n=(") Dole) (1) 


where x=(x,...,x,,) denotes the state vector 
of the system with x, = 1 and x, = 0 represent- 
ing a working or failed component respectively 
and g(x) is the structure function returning the 
state of the full system with g(x) =1 indicating a 
working system and g(x)=0 indicating a failed 
system. 

Extending the survival signature to systems with 
K component types and m, components for each 
type k(k=1,...,K) and /, out of m, components 
working results in 


An efficient algorithm to compute the survival 
signature can be found in Aslett 2012. 


2.2 Survival function 


The next step in calculating the reliability of a sys- 
tem is the definition of the survival function. This 
function uses the survival signature to calculate the 
probability that a system is working at time ¢ and 
as such calculates the reliability. The survival func- 
tion is defined as 


m mk 


PT, >t)= Y vie O(h) 
=0 


of (uct = J : 


Note especially the separation of structural 
information (left) and probabilistic information 
(right). This means, that the structural evalua- 
tion of the system must occur only once for the 
entire reliability analysis. In the last remaining 
step, the probabilistic part of the survival function 
is approximated using Monte Carlo Simulation in 
order to be able to include imprecisions and inter- 
dependencies in the analysis. 


(3) 


2.3. Monte Carlo simulation 


The simulation starts by selecting a sufficient 
number of samples Nc and small time step fol- 
lowed by the sampling of component failure times 
from the assumed copula (see section 3). Next, in 
two nested loops over all combinations /,...,/, 
where Dl hesh 0 and all time steps ¢ the 
number of samples in the same configuration are 
counted as N, (t). Then, the probabilistic part 
of the survival function is approximated by 


ef a 7 13 Nit) (4) 


Nuc 


Finally, the partial reliabilities obtained in the 
previous step are multiplied by their probability 
from the survival signature and summed up, yield- 
ing the full reliability of the network. 


3 MODELLING DEPENDENCIES 


This section introduces the necessary notation of 
copulas and how to apply them to model depend- 
encies in and between networks. In more detail, 
section 3.4 presents how to use copulas to model 
common causes of failure while section 3.5 shows 
how to model interdependencies. This is a very 
basic introduction, for a thorough discussion of 
copulas see Nelsen 2006. 
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3.1 Copulas 


The basis of copulas is built by what is today 
known as Sklar’s theorem (Sklar 1959). The theo- 
rem states that any multivariate distribution H (in 
dimensions d 2 2) can be separated into its univari- 
ate marginal distributions F, and a copula function 
C: [0,1]? — [0,1]. 


Theorem 1. Sklar’s theorem Let H be a d-dimen- 
sional distribution function with margins F,,...,F,. 
There exists an n-dimensional copula C such that for 
allx in R’ 


H(X)=C(K(x)..-F, (%,)): (5) 


If the marginals F,,...,F, are continuous, then C 
is unique; otherwise, C is unique on Range(F,) X...X 
Range(F,). 

Conversely, if C is a d-copula and F,...,F, are 
distribution functions, then the function H defined by 
Eq. 5 is an d-dimensional distribution function with 
margins F,..., F, 

This facilitates separate modelling of the marginal 
distributions from modelling of the dependence 
structure, in turn allowing for effective treatment of 
imprecisions in both parts (see section 4). 

In this work we apply three distinct copula 
families, namely the Gaussian copula, the Inde- 
pendence copula, and the Clayton copula. For an 
encompassing discussion of copula families the 
reader is referred to Nelsen 2006 and Joe 2014. 

The d-dimensional Gaussian copula is defined as 


Cph tise cti) =©,(®"(u,),...,.07(u,)), (6) 


where R e[-1,1]* is a positive definite correla- 
tion matrix and ®,(-;R) is the d-variate cumu- 
lative distribution of a N,(0,R) random vector. 
-~ denotes the inverse of the univariate standard 
Gaussian cdf (Joe 2014). 

The latter two copulas belong to the class of 
Archimedean copulas. This family is particu- 
larly popular due to their easy construction and 
wide range of applications (Nelsen 2006). The 
Archimedean copulas used in this work are one 
parameter families, which allow for easy treat- 
ment of imprecision as seen in the subsequent 
section. Any d-dimensional Archimedean copula 
is constructed using a so called generator function 
g—:[0,-0]— [0,1] and its inverse øg’ according to 


C, (is-14) = gv" (u) +... +g] (u,)). (7) 


where u,,...u, €[0,1]| (Mai and Scherer 2012). 
Table 7 shows the generators and parameter ranges 
for the Independence and Clayton copula families. 


3.2 Dependence measure 


Studies have shown, that correlation is not a suit- 
able measurement of dependence for copulas 


(Schirma Schirmacher and Schirmacher 2008). 
Therefore, in this work, Kendall’s tau is selected 
as the preferred dependence measure. Kendall’s 
tau is based on concordance. A pair of random 
variables is said to be concordant if “large” values 
are associated with “large” values “small” values 
with “small”. Formally, two observations (x, y) 
and (x, y) from a vector (X,Y)(X,Y) are con- 
cordant if (x; —x,)(¥j— y) >0 and discordant if 
X; -x;) Yi A . 
Then, Kendall’s tau for a sample of n observa- 
tions {(x,,y,),....(x,.¥,)} from a vector of con- 
tinuous random variables (x sF) can be defined as 


siia) 5 


where c and d represent the number of concord- 
ant and discordant pairs among all possible pairs 
of observations. This value may also be interpreted 
as the probability of concordance minus the prob- 
ability of discordance for a random pair of obser- 
vations (x,,y;) and (x,,y,). Based on this fact, 
Kendall’s tau for the random variables ¥ and Y 
can be defined by 


2(X,Y) = Px -X)(y-¥)>0] 


- P(x- ¥\(Y - f) <0} R 
where (X,Y) is an independent copy of (X,Y) 
(Schirmacher and Schirmacher 2008). 

Kendall’s tau is not only used to measure 
dependence but also to find copula parameters 
representing a desired strength of dependence. 


3.3 Vine copulas 


In order to analyse the reliability of complex net- 
works it one must build hight dimensional copu- 
las. However, in comparison to bivariate copulas, 
the available literature on multivariate copulas is 
scarce (Mai and Scherer 2012). For this reason, we 
employ a technique called pair copula construction 
to break down multivariate copulas into combina- 
tions of bivariate copulas. More accurately, we use 
vine copulas as a graphical tool to model pair cop- 


Table 1. Generators, generator inverses and parameter 
ranges for the Clayton and Independence copula. 
Generator 
Generator Inverse 
Name 9, (t) o; (t) Parameter 0 
Clayton F T (1+ æ"? 8e [-1,00)\ {0} 
— tC w 
0 
Independence — jog (t) exp (-f) 
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ula constructions as sets of trees. An example of 
a five-dimensional vine copula is shown in Fig. 1. 

A regular vine (R-Vine) V is defined as a set 
of trees 7,,...,7),, where T, consists of nodes 
N, ={L...,d} and edges £,. Every subsequent Tree 
T, uses the edges E,, as nodes and connects them 
with the edges Æ, The last property needed to define 
a regular vine is the proximity property, stating that 
if a and b are connected by an edge in T, j 22, 
then a and b must share a common node in T.. 

A plethora of different structures exist for a 
d-dimensional R-Vine based on the definition. 
Because of this and the fact, that a regular vine 
possesses 2”! sampling order we apply D-vines 
instead, which have a much more restrictive struc- 
ture. As seen in Fig. 2, a D-Vine is characterized by 
each node ne N, having a maximum degree of 2. 
Sampling from D-Vines is a lot simpler and in this 
work performed by using the MATLAB toolbox 
VineCopulaMatlab (Kurz 2016). 


3.4 Common cause of failure 


One type of dependent failure tackled in this work 
is common cause of failure. It is defined as two or 
more components failing at the same time due to 
common defects or weaknesses. Causes include 
but are not limited to: errors in manufacturing, 
errors during maintenance or operation, and envi- 
ronmental causes such as earthquakes or tsunamis 


Figure 1. Five-dimensional copula represented as a 


regular vine. 


Figure 2. Structure of a five-dimensional D-Vine. 


Figure 3. Samples drawn from a Clayton copula with 
the parameter chosen so T= 0.5. 


(Hanks 1998). We model common cause of failure 
by application of Clayton copulas. This family 
possesses a property called lower-tail dependence, 
meaning that dependence is stronger in the lower- 
left quadrant of [0,1}?. By modelling the failures 
this way, the dependence is much stronger in early 
component life and as such brings us closer to the 
traditional bathtub shape or component failure 
probabilities. Figure 3 shows samples drawn from 
a Clayton copula. Note, how the stronger depend- 
ence in the lower-tail is clearly visible in the scatter 
plot. 


3.5 Interdependencies 


Interdependencies between nodes and networks 
are handled by application of Gaussian Copulas. 
Gaussian copulas possess no tail-dependence and 
show good results, although other families could be 
investigated for the same application in the future. 
In addition to the copula there is one more step 
required to accurately model the dependencies. We 
understand interdependencies as the phenomenon 
of one component failing due to the failure of 
another. As such, interdependencies imply causal- 
ity. In order to represent this causality in the model, 
dependence is introduced in the marginals. During 
the transformation of the failure times sampled 
from the copula by the inverse transformation 
method, the marginal distributions are aggregated 
from the dependent marginals using Kendall’s tau 
of the random variables u, and u, as 


U, =(1- 7) Fo (u)+7- E (u), (10) 


where F, and F, are the marginals of a copula C. 
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4 HANDLING IMPRECISION 


Two types of uncertainties must be taken care of dur- 
ing the reliability analysis, namely, aleatory and epis- 
temic uncertainties. Aleatory uncertainty describes 
the natural randomness inherent in a process, while 
epistemic uncertainty represents the uncertainty due 
to vagueness in information or a lack thereof. 

Aleatory uncertainty can automatically han- 
dled by our reliability analysis technique. Through 
assuming failure time distributions for the com- 
ponent failures and sampling these during Monte 
Carlo simulation, the randomness that our model 
is subject to is fully included. However, the selec- 
tion appropriate failure time distributions is typi- 
cally based on either data or expert knowledge, 
neither of which yield perfect results, in turn intro- 
ducing epistemic uncertainty into the model. This 
uncertainty can be reduced by using probability- 
boxes (p-boxes) (Feng et al. 2016). 

P-boxes are defined as bounds on the cumula- 
tive distribution function of a random variable. 
The left and right bounds can be found by for 
example selecting an appropriate distribution and 
giving the parameters as intervals. As such, a p-box 
comprises boththe aleatory and the epistemic 
uncertainty. An example of an exponential p-box 
with parameters / e [1.2,2.2] is shown in Fig. 4. 

By feeding the bounds of the p-box into the reli- 
ability analysis, the epistemic uncertainty propa- 
gates into the result. Thus, instead of one survival 
function, we obtain an upper and lower bound. 
Figure 5 shows an example of theupper and lower 
bounds obtained by performing a reliability analy- 
sis of a simple system of two parallel components 
of the same type, assuming the p-box of Fig. 4 for 
the failure time distributions. 

Similarly to the application of p-boxes to han- 
dle epistemic uncertainty in the marginals, we 
can define the copula parameters as intervals and 
obtain imprecise copulas for the dependencies 
(Montes et al. 2015). This works especially well 
since all copula families, including the bivariate 
Gaussian copula, we apply in the vine copula are 
defined by a single parameter. 


Probability 


Figure 4. Example of an exponential p-box with 
Ae [1.2,2.2]. 
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Figure 5. Upper and lower bounds of the reliability 


resulting from applying the p-box in Fig. 4 to a simple 
system of two parallel components. 


Figure 6. Structure of the example network. The 
red lines represent interdependencies between the two 
subsystems. 
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Figure 7. D-Vine used to model the common cause of 
failure in system 2 and the interdependencies between the 
two networks. 


Finally, the bounds of the p-boxes as well as the 
bounds of the copula parameters are fed into the 
reliability analysis. Returning to the simple sys- 
tem of two parallel components and linking the 
components with an imprecise Gaussian copula 
Re [0.3, 0.6] results in the upper and lower bounds 
for the reliability as seen in Fig. 5. 
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Figure 8. Bounds on the reliability of system 2 based 


on the assumed imprecisions. 


5 NUMERICAL EXAMPLE 


The methods that were introduced in the previous 
sections will now be applied to a simple toy exam- 
ple. All marginals and dependencies will be consid- 
ered as imprecise in order to account for all aleatory 
and epistemic uncertainties. Figure 6 presents the 
example system build from two systems of two 
parallel components where the first and second 
components respectively are interconnected. 

A four-dimension D-Vine copula, including the 
interdependencies and a common cause of failure 
shared among the components in system 2, is built 
in order to sample the component failure times. 
The structure of the vine is shown in Fig. 7. 

The component failure times for systems 1 and 
2 are assumed to be exponentially distributed 
with 4 €[1.5,1.7] and 4, €[0.7,1.1]. The copula 
parameters are chosen such that ze€[0.2,0.4] for 
the Clayton copula and 7e [0.4,0.6] for the Gaus- 
sian copulas. The resulting bounds on the reliabil- 
ity of system 2 are plotted in Fig. 8. 


6 CONCLUSION AND OUTLOOK 


In this work we have presented how to perform 
reliability analyses of complex interdependent net- 
works in a highly imprecise setting. The necessary 
theory on copulas and vine-copulas and applied to 
the modelling of dependencies in and between net- 
works. Finally, the modelling of dependencies and 
a previously introduced method for the reliability 
analysis of networks were extended to account 
for both aleatory and epistemic uncertainties. As 
a result, we obtained bounds on the network reli- 
ability. The method was applied to a numerical toy 
example to prove the functionality. 


It is obvious that this paper only serves as a short 
introduction into future work. The methods must be 
validated further and applied to complex real world 
networks in order to ensure usability. Especially the 
construction of the D-Vine copula for sampling of 
the component failure times must be further investi- 
gated. There exist a plethora of possible vine struc- 
tures for a given problem and effective automatic 
construction techniques have to be created. 
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ABSTRACT: Several unmanned aerial vehicles (UAVs) flight in a formation fleet are used to improve 
the effectiveness of civilian missions such as firefighting, searching, rescuing, etc., as well as the success of 
military applications. The increase use of these cooperative systems in hazardous environments makes the 
reliability improvement essential in order to prevent any catastrophic event. In this article, we aim to ensure a 
successful communication between the drones from one side, and between the drones and the ground station 
from the other side. We propose to identify, the different fault states and their probabilities during a com- 
munication. An Absorbing Markov Analysis approach is developed for these states. This framework can be 
used to find the riskiest scenarios and elements that need to be addressed in order to improve the reliability. 


1 INTRODUCTION 


Nowadays, the interest of using the unmanned 
aerial vehicles (UAVs), recognized as drones, has 
increased due to their use in several civilian appli- 
cations such as searching, border surveillance, 
natural disaster monitoring, and firefighting, etc 
(Rabbath & Léchevin, 2010). They are known for 
3D missions that are ‘dirty, dull or dangerous’ 
(Hattenberger, 2008). Lately, the focus is shifted 
toward the cooperative UAV fleet formation due 
to their mission in a large hazardous environment, 
since a single UAV has a limited energy and pay- 
load. The multi UAV system needs to ensure sev- 
eral properties such as robustness, cooperativeness 
and scalability. These proprieties can be attained 
by assuring the navigation of each UAV, the con- 
trol of the whole fleet as well as constant and reli- 
able communication between the drones on one 
side and between the drones and the ground sta- 
tion control (GSC) on the other side. Their dif- 
ferent size and payloads, their flight times, the 
distance between two UAVs and the communica- 
tion ranges are the causes that affect the overall 


performance of the fleet formation flight. The 
essential role of these cooperative systems reflects 
the importance of enhancing the reliability in 
order to avoid the failure of the communication 
between the aerial vehicles. Coordination between 
them should always be guaranteed despite the 
uncertainties of the environment, the network 
and simple failure in the hardware of a vehicle. 
In hence, the detection of the anomalous aircraft 
prevents the collisions between the aerial vehicles 
and the degradation of the team performance. The 
information flow between UAVs can be collected 
by an entity on ground, Ground Station Control 
(GSC), that controls the mission and makes deci- 
sions for the aircrafts; or alternatively, they share 
the information between them and make collective 
decisions. 

Considering the importance of the reliability 
of the communication system, this paper presents, 
based on Markov model, a novel approach to eval- 
uate it. The proposed framework takes in consid- 
eration the internal elements, both hardware and 
software, of the system as well as the surrounding 
environment. 
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The paper is organized as follows. Section 2 
presents the related work on the reliability of the 
aerial vehicles focusing on the Markov chain. 
Section 3 provides a description of the proposed 
model of state diagram for the communication 
failure between UAVs. Section 4 gives a brief 
description of the proposed framework. The con- 
clusion is attributed in Section 5. 


2 RELATED WORK 


Since the UAV’s accidents and failure rates are 
higher than the manned aircraft, the reliability 
analysis of these systems presents an important 
focus for the researchers. The fault-tolerant sys- 
tem and the redundancy hardware do not always 
represent the efficient solution for this forma- 
tion fleet flight due to incurred costs and weight. 
Different methods like the Fault Tree Analysis 
(FTA) (Abdallah, Kouta, Sarraf, Gaber, & Wack, 
December 2017), Failure Modes and Effects Anal- 
ysis (FMEA) has been used to improve the reliabil- 
ity of the helicopters. In some cases, various FTA 
are needed to represent the different failure condi- 
tions of a complex system. 

The evaluation of reliability of a system consid- 
ers the state-space models, such as Markov Chain 
(Frattini, Bovenzi, Alonso, & Trivedi, 2010), that 
handle the failure/repair of its components and 
surrounding elements that might impact the reli- 
ability model. The Markov chain defines the dero- 
gation states of operation, where the functions are 
not all performed or where the state functions are 
absolutely stopped. In (Kitchin, 1988), distinct 
techniques used for establishing Markov models 
for the reliability of systems are provided empha- 
sizing on the exponential model. It is devised to 
detect the failure and the method to recover it. 

Thereliability of the flight computer system (FCS) 
components including the flight computer and the 
navigation system is discussed in (Pashchuk, Sal- 
nyk, & Volochiy). It enquires a fault tolerant model 
considering two cases: the case where no additional 
standby microprocessors are implemented and the 
case of inherent standby microprocessor. A math- 
ematical model based on Markov chain is applied 
to improve the reliability for the FCS components. 
An explanation of the Markov chain and Markov 
process is given in (Fuqua, 2003). It clarifies the 
powerful relation between the Markov chain and 
the reliability, maintainability and safety engineer- 
ing (RMS) insisting of the International Standards 
that deal with this approach such as IEC 61165 and 
IEC 61508 that estimate the probability of failure 
of a critical system. 

The issue of packet dropout for the drones’ com- 
munications via wireless is investigated in (Zhou, 


Li, Lamont, & Rabbath, 2012). The authors pro- 
posed a two state Markov model in order to model 
the wireless channels taking into consideration the 
impacts of the Ricean fading. Their computer sim- 
ulations are better than those of the most known 
models for wireless channels, the Gilbert-Elliott 
model (Gilbert, 1960), (Elliott, 1963), since their 
approach simulate the non-stationary errors. 

A distributed computing system (DCS) is 
multiple processors that are interconnected via a 
network. In DCSs, the information is spread out 
among the nodes that consist of the data files, 
the processing elements, the shared resources and 
programs. In order to ensure the exchange of the 
information and the control of the data, the reli- 
ability of this system is important to be studied. 
It focuses on the analysis of the distributed pro- 
gram reliability (DPR) and the distributed system 
reliability (DSR). (Wang, 2004) suggested two 
reliability stochastic measures for these distrib- 
uted systems: Markov-chain distributed program 
reliability (MDPR) and Markov-chain distributed 
system reliability (MDSR). The article describes 
the employment of one absorbing state for this 
problem and the probability of transition between 
the states. An Adaptive Markov Model Analy- 
sis (AMMA) is proposed in order to isolate the 
faults in the critical components. This proposed 
approach serves to make better the robustness and 
the availability of the UAV autopilot by incor- 
porating the Fault Detection Isolation (FDI) 
approach (Krishnaprasad, Nanda, & Jayanthi, 
2016). In (Kumar & Jackson, 2009), the paper 
discusses the reliability models based on the sto- 
chastic approach of Markov analysis, merged with 
the probabilistic approach of Weibull distribution 
in order to approximate the failure attributes of 
wear out components. This method is used since 
the components with wear out failure are charac- 
terized with variable failure rates depending on the 
operation time of the components. For this issue, 
a state transition diagram for six components opti- 
cal telescope calibration system (OTCS) is shown. 
The partially observable Markov decision proc- 
esses (POM DPs) is used in (Ragi & Chong, 2013) 
in order to determine a path planning for the UAVs 
to track different targets. The failure analysis of 
the flight control system of Air Force Institute of 
Technology (AFIT) UAV based on Markov analy- 
sis is elaborated in (Okafor & Eze, 2016). It shows 
the failure states and the probability of being in 
these states. 


3 PROPOSED APPROACH 


An Absorbing Markov Chain (AMC), where there 
is at least one absorbing state, is considered. An 
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absorbing state is characterized by the fact of once 
it is entered, it cannot be left. Each state in the 
transition diagram can be taken as an absorbing 
state. The transition between states can have multi- 
ple steps in order to attain the absorbing state. Two 
important variables should be calculated: the mean 
time t,,.an IN addition to the length of the path until 
the state is absorbed. We aim to evaluate the prob- 
ability of being in each transient state leading to 
the absorbing state. Transitions between states are 
based on the probabilities that are function of the 
failure rates, of internal components as well as the 
occurrences of related events within the surround- 
ing environments. 

The main focus is to maintain a communica- 
tion between the drones although all the uncer- 
tainties that can occur. We propose an Absorbing 
Markov Chain to model the problem and show 
the transition between the events that affect the 
communication. 

First, the exchange of information and the com- 
munication is considered in a normal state. How- 
ever, several causes can affect this state. The causes 
can be divided in internal causes at the level of 
the software and hardware failures and the exter- 
nal causes that are related to the human and the 
environment. 


3.1 Hardware failure 


The hardware failure can attack the engine, the 
power, the propellers and the antenna of trans- 
mission and reception. The issue is that during a 
flight, a hardware failure cannot be repairable and 
lead to an absorbing state of communication fail- 
ure (Figure 1). 

Figure 1 shows the causes of the hardware 
failure of a drone in addition to the transition 


x 4 
+ 
Virus GPS daa 
Malware OS Fait r inaccuracy 
v 
Y iia 
SW Fault — Commun cat 
-ion error 
- 4 
j >| _ Snipping 
from ennemy 


Figure 1. Hardware failure. 


between the transient states. The antenna failure 
can directly lead to a communication failure. The 
drones cannot send and receive anymore the infor- 
mation between them or to/from their ground sta- 
tion control. Moving to the power failure, it can 
be caused from the ventilation default and the dis- 
ruption of the cables that induce an overheating 
of the drone and consequently a power failure. It 
is also attributed to an overcurrent/undercurrent, 
physical damage, overheating or exhaustion of the 
battery. The loss of the UAV transceiver affects the 
servomotor which its failure involves the actuator 
default and in hence the engine failure. So on, the 


Table 1. Failure rates of the hardware events. 
Minimum Maximum Mean 
Failure Failure Failure 
Rate (Amin) Tate (Ama) Rate (Anean) 
Events (x10) (x10) (x105) 
Loss of UAV trans- 7,285E-01 3,740E+00 9,635E-01 
ceiver (Aas) 
Servomotor default 3,575E-01 3,952E+00 2,961E+00 
Cassa) 
Actuator default 1,218E-01 8,569E-01 3,444E-01 
(Aasaa) 
Disruption of 4,000E-05 4.087E+00 4.800E-05 
cables (A,,,. 4.) 
Ventilation Default 3,483E-01 4.087E+00 2,891E+00 
(Ansa) 
Overheating (Ason) 12,617E+00 778,2E+00 44,421E+00 
Power Failure 2,352E+00 9,104E+00 4,813E+00 
asp) 
Default assembly 8,992E+01 6,296E+02 2,07E+02 


of propellers 


(Aasaa) 
Antenna failure 3.662E+00 11.886E+00 20.467E+00 
(asar) 
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zos 
Collision 
| between 2 or 
E more drones 
Incorrect 2 * 
postions | 
. 
pJ Collision s 
with 
obsades 


2597 


pe ` 


Loss of UAV Servenetor Actuator : 
wmsdver | Faihee Failure 
S Crash of 
\ drome 
a 
* + 3 
Í | Power 
Disruption of qe A Fahre 
of emer 
— 
of ‘Ventilation 
a Default 
> sembly of 
\ the propelless 
Figure 2. Software failure and collision events. 
> Decoding 
= > Synchronize 
‘Ren emor 
en ` = 
a 4 i 
jaše . 
Bad weather gtensios 5 Noise 
b4 
> 
“ inefeme 
- 
y 
Pe al Human arer 
+ 
» _ Haman 
inexperience 
Lage 
stance fom 
‘tamsnitter 


Figure 3. Exterior factors events. 


engine failure, the power failure and the default 
assembly of the propellers can lead to the damage 
of a drone. The crash of the leader of the fleet for- 
mation flight is the most risky case since the leader 
controls the exchange of information between 
the fleet. Since the crash of a drone cannot be 
repairable, it leads to an absorbing communica- 
tion failure between the drones. The failure rates 
of these events are known from the Nonelectronic 
Parts Reliability Data Publication (NPRD-2016) 
database. 

Table 1 exhibits the failure rates of these hard- 
ware events considering the previous state as nor- 


mal state ns. The annotation of the failure rates 
is A from the previous status of transition to the 
next status of transition. The database gives only 
the failure rates from the normal states. The other 
transition failure rates are not known. 


3.2 Software failure and collision events 


The normal situation can be affected by a software 
failure bringing a disruption to the communica- 
tion between the drones (Figure 2). An infected 
virus or malware represents an important reason 
for a mal-functioning of the UAV. It disturbs the 
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operating system OS of the drone in a manner 
that the two causes engender a software fault. The 
virus/malware and the OS fault can also attack 
the ground station control (GSC). The reliability 
of the software is ensured by regular updates of 
the operating system in addition to a good antivi- 
rus. The software faults affect the GPS data lead- 
ing to a communication error between the drones 
or between one drone and the base station in a 
manner that the communication is not lost but 
there is an error in exchanging the information. 
The GPS data inaccuracy permits a confusion of 
the position of other’s drones that influences the 
coordination of the fleet formation flight. 

The wrong positions’ data received from other 
drone might lead to collision between two or more 
UAVs or even to collision of the drones with an 
obstacle such as a building, trees, birds, etc. From 
the one hand, the collision between two drones 
leads to an absorbing state that cannot be avoided 
and repairable, the loss of communication between 
two or more drones /cma. On the other hand, the 
collision with obstacles in addition to the snipping 
of a drone from an enemy cause the loss of com- 
munication with only one drone /ca, since it will 
not be presented in the fleet. This event is also an 
absorbing state. 


3.3 Exterior factors 


Different exterior factors influence the communica- 
tion of the fleet formation flight. It englobes the ani- 
mals, the weather, the obstacles and the human. The 
bad weather is an important state to prevent it. It 
includes temperature, wind, clouds, rain, ice, thun- 
derstorms and fog. The transmitted signals might 
be subjected to an atmospheric attenuation due to 
a bad weather or other environmental conditions. 
An attenuation involves an interference and a noise 
that contributes to a bad signal transmitted between 


4 BRIEF DESCRIPTION OF THE STATES 


Table 2. Brief description of the states. 


the aerial vehicles. If the medium of communication 
is exposed to jamming, echoes and noise, then that 
might interfere with what is transmitted, affecting 
the overall communication. Synchronization and 
the decoding of the message is an essential mecha- 
nism in the transmission of the message that can be 
affected by the interference phenomenon. 

The human plays an important role in the 
communication especially with the GSC. The 
exhaustion of the GSC operator and his lack of 
experience and qualification in flying a certain type 
of drones contribute to the human error. The dis- 
tance between the source and the destination can 
influence the quality of the received information as 
when the distance increases, the transmitted signal 
can be exposed the attenuation and atmospheric 
noises. 

In order to avoid these events, the operator 
should chose the typical environment for the fleet. 
He might take in consideration the weather, the 
season and the time of flight. However, although 
all these events lead to the loss of communica- 
tion with the drone, but this state is not absorbing. 
It is repairable since we can change the environ- 
ment, chose the right persons to flight the fleet, 
the interference could affect the signal for a certain 
time then the fleet continues its mission. Figure 3 
resumes the external factors. 

On the contrary of hardware failure, the loss 
of communication with the drone Icr is a repair- 
able state. It could be caused by software faults 
or environmental effects. A software fault in the 
communication could be repaired through alterna- 
tive channels or by the ground station. The envi- 
ronment effects can be controlled by making the 
drones flying in close distances or alternatively 
planning the mission in some other time with 
better weather conditions. Once the lcr occurs, 
it might attribute to the collisions between the 
drones or with an obstacle. 


ns Normal State 


This is the normal situation where the communication system, between 


drones and with GSC, is functioning normally. 


vm Virus or malware 


The system has been infected by a virus or malware leading to mal-func- 


tioning and abnormal behavior 


OSf Operating System fault 


The operating system of a drone or GSC is not properly functioning due to 


being infected by virus or malware or due to some internal fault or error 


swf Software fault 


A fault in the software that handles GPS data or the communication sys- 


tem within the drone or the communication system within the GSC 


gpsf GPS data inaccuracy 
wpd Incorrect positioning data 


GPS data of one or more drones are inaccurate 
Wrong positions of one or more drones have been communicated to other 


drones and/or GSC 


(Continued) 
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Table 2. 


(Continued). 


No communication between 2 or more drones and/or between | or more 


Collision between 2 or more drones have occurred 

A drone has collided with obstacle(s) 

A drone or more have been shut down by enemies or third parties 

A bad weather that might have impacts on the communications between 
drones and/or between the drones and GSC 

Transmitted signals might be subjected to attenuation due to bad weather 
or other environmental conditions 

Transmitted signals might be subjected to noise 

Transmitted signals might be subjected to interference 


Decoding/synchronization errors Transmitted data might be subjected to decoding and/or synchronization 


One or more drones have flown away from transmitters of other drones 


GSC operator is experiencing exhausted and tired 

GSC operator(s) committed error(s), due to fatigue or lack of experience 
and qualification, that might affect the communication system 

Hardware fault affecting the transmitter and/or receiver antennas of one or 


Loss of an electronic device that transmits and receives the signal 
Hardware default of the motor that permits the control of the position, the 


Hardware default of an electronic speed controllers that is linked to the 
engine, servomotors and propellers UAV actuators 
Internal incident that cut the cables 


The temperature of the drone is high due to a disruption of cables or due 
to a default in the cooling system 
Due to short-circuit, overcurrent/undercurrent, battery damage, 


Loss of more than two propellers 

Loss of one drone due to collision with obstacles, snipped by enemies, and/ 
or due to internal operation failure 

Loss of 2 or more drones due to collision between them 

Loss of communication with a drone due to environmental conditions or 


This state is repairable and the system could go back to its normal state 


Loss of communication with a drone due to the loss of the drone itself 
caused by collision, snipping with enemies, environmental conditions 


This state is not repairable and therefore it is an absorbing state. 


cf Communication error 
drones and GSC 
cd Collision between drones 
co Collision with obstacle(s) 
se Snipping from enemy 
bw Bad weather 
aa Atmospheric attenuation 
no Noise 
in Interference 
de 
errors 
Id Large distance 
and GSC 
hf Human fatigue 
he Human error 
af Antenna failure 
more drones of GSC 
It Loss of UAV transceiver 
sd Servomotor default 
acceleration and velocity. 
ad Actuator default 
dc Disruption of cables 
vd Ventilation default The cooling system is in failure 
oh Overheating 
pf Power failure 
overheating 
da Default assembly of propellers 
lld Loss of one drone 
Imd Loss of 2 or more drones 
ler Loss of communication with 
a drone software faults. 
(ns) 
lca Loss of communication with 
a drone due to the loss 
of the drone and/or some internal faults. 
lema Loss of communication 


with multiple drones 


Loss of communication with multiple drones due to collision between 
them. 
This state is not repairable and therefore it is an absorbing state. 


5 CONCLUSION 


external factors that we can prevent them by choos- 
ing the suitable environment, season and time. On 


Different state diagram are presented in this paper 
showing the causes of loss of communication 
between the drones or between the drones and the 
ground station control. It includes the hardware 
failure, the software failure in addition to the exter- 
nal factors that affect transmitted signals. The soft- 
ware failure is a repairable state in addition to the 


the contrary, as they are not recoverable, hardware 
failures will lead to an absorbing state for the loss 
of communication. We aim to consider in our 
future works the failure rates of the external and 
software events, the failure rates of transitions in 
addition to the repairable rates taking in considera- 
tion the different strategies of fleet formation flight. 
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ABSTRACT: Connected vehicles such as Vehicular Ad hoc Networks (VANET), a subset of Mobile Ad 
hoc Networks (MANETs), is a wireless communication technology applied to transportation, referring 
to a set of smart vehicles used on the road. These vehicles provide communication services among one 
another (V2V) or with Road Side Infrastructure (V2I). The main benefits of VANET are enhancing road 
safety, reducing energy use and emissions, and giving information services. Reliability is one of the most 
critical issues related to VANET since the information transmitted is distributed in an open access environ- 
ment. We focused in this paper, on the reliability of VANET as a function of reliable hardware and their 
functionality taking into consideration the needed security equipment. Reliability Block Diagrams (RBD) 
and Fault Tree (FT) were used to analyze the reliability of the vehicles and the Road Side equipment 
(RSU). In order to prove our result, a simulation was occurred using the RBD and FT probability and it 
has been conducted to validate the proposed approach. The data (Failure ratio) used were from profes- 
sional database concerning the type of components presented in the system. Our scientific approach was 
structured with methods that combine qualitative approaches (such as functional analysis, Failure Modes 
and Effects Analysis (FMEA)....) and quantitative methods (Fault tree, probabilistic models of degrada- 
tion, etc.) for the VANET. From this data an exponential model of reliability was proposed. The prob- 
ability calculation was performed in relation to a reference time of use. Thereafter a sensitivity analysis was 
suggested concerning the reliability parameters and redesign proposals are developed for the components. 


1 INTRODUCTION These smart vehicles must comply with the safety 


integrity obligations. Indeed, safety is a major con- 


Each year 1.25 million people die worldwide as a 
result of road traffic accidents according to the 
WHO’s Global status report on road safety 2015 
[1]. Connected Vehicles such as vehicular ad hoc 
networks (VANET) and the autonomous vehicles 
are proposed as solutions to improve road safety [2]. 

On-Board Unit (OBU) computers give a smart 
vehicle the ability to communicate with other vehi- 
cles (V2V) and with intelligent Road Side Unit 
(RSU) infrastructure (V2I). In this context, wireless 
vehicular network technologies will allow significant 
reduction of vehicular accidents [2]. Since 90% of 
accidents are caused by humans, connected vehicle 
systems will help the driver to avoid these accidents 
and to divide the percentage accident ratio by ten [3]. 


cern for the user of automotive vehicles. Reliability 
is one of the most critical issues related to the con- 
nected vehicle since the information transmitted is 
distributed in an open access environment and also 
due to the high mobility, dynamic topology, delay 
constraints, varying environments and different 
traffic patterns in VANET [4]. 

With the evolving of the Internet of Things, 
ensuring the dependability and the reliability of 
vehicular networks is becoming critical, and any 
lack of this requirement can lead to accidents and 
catastrophic results [5]. 

This paper presents a design of reliability block 
diagrams for: (1) the OBU based on Dedicated 
Short Range Communication (DSRC) Standard, 
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(2) the RSU and (3) the devices that are especially 
interesting for security purposes such as Trusted 
Platform Module (TPM). TPM is often mounted on 
vehicles to offer reliable storage (e.g. user credentials 
and keys) and to compute cryptography. TPM hard- 
ware is assumed to be tamper resistant so that hack- 
ers can’t gain access, even with physical presence [6]. 

In VANET, the reliability of networks must be 
given special attention since the system aims at safe- 
guarding road safety and accomplishing secured com- 
munication. Currently, in VANET, most of the work 
carried out relates to the evaluation of the communi- 
cation performance and for routing protocols, with- 
out taking into consideration the dependability and 
the operational safety of the system. Network reli- 
ability is greatly reliant on the availability of hardware 
components and their life time cycle. Given that this 
problem is relatively new for applications in the mobile 
environment using ad-hoc networks, hence the devel- 
opment of methods and models to assess the reliabil- 
ity of smart vehicles by the approaches of reliability 
analysis that combine qualitative approaches (such as 
functional analysis) and quantitative methods such as 
reliability block diagrams (RBD), and fault trees (FT) 
for the VANET, has become indispensable. Evaluat- 
ing these problems leads to a sensitivity analysis con- 
cerning reliability parameters in order to propose a 
redesign for the components that aims to increase the 
reliability and the availability for the whole system. 

The paper is organized as follows: Section 2 
presents the related work, section 3 describes the 
analytical approach and the hardware architecture 
of VANET: OBU, RSU, TPM. Using the reliability 
block diagrams and the fault tree we calculate the reli- 
ability of the system in section 4. Section 5 describes 
the simulation performed and the results. Finally, 
Section 6 concludes the paper with lessons learned 
from this work, and points out future research direc- 
tions on VANET. 


2 RELATED WORK 


Generally, reliability is defined as the ability of a 
functional unit to perform a required function 
under given conditions for a given time interval [7]. 

Technically speaking, reliability is the probabil- 
ity that the VANET systems will work without 
failure during its running time under wireless envi- 
ronment operating conditions. 

It is important to differentiate between avail- 
ability and reliability concepts; reliability refers 
to failure-free operation during an interval, while 
availability refers to an operation free of failure at 
a given instant of time [8]. 

Many research works have been proposed in 
literature to define the reliability for VANET and 
focus on protocols such as the DSRC and Wireless 


Access in Vehicular Environments (WAVE) pro- 
tocols [9,10]. Moreover, most of the proposed 
approaches emphasize on reliability in terms of 
successful delivery and the optimization of the 
number of packet drops. For example, in [9] com- 
munication reliability metrics are defined such as 
packet delivery ratio, and distribution of consecu- 
tive packet drops are reduced. DSRC based scenar- 
ios are proposed. In [10], the authors investigate 
additional metrics, such as reliability, packet recep- 
tion ratio, and effective range of coverage. 

Le Lann in [11] presents the safety and reliability 
for Intelligent Vehicular Networks (IVN) for pla- 
toons and cohorts. To avoid failure, he introduces 
diversified functional redundancy, giving a replace- 
ment in case of telemetry failures as an example. 

A lot of researches focus on finding a reliable 
routing in connected vehicles, and on classifying 
routing protocols [12,13,14]. 

In [12], the authors present a new algorithm with 
2 notations: virtual equivalent node (VEN) and dif- 
ferentiated reliable path (DRP) to solve problem 
of link failures, RSU will play the role of VEN if 
the route failed. In [13], authors introduce a rout- 
ing protocol ROVER (Robust Vehicular Routing) 
based on positions of local-aware vehicles to define 
the zones. The protocol broadcasts control packets 
and works as Ad hoc On-Demand Distance Vector 
(AODYV) protocol to discover the best routing path. 
A classification is described in [14] to classify the 
existing VANET routing protocols into five catego- 
ries according to their used routing metrics. 

In [15], the authors calculate the reliability of 
OBU in function of hardware reliability and chan- 
nel availability without taking hardware secu- 
rity into consideration and by using simulations 
instead of real data. 

Waqar et al, lists all the papers working on the 
reliability of communication networks and classify 
them according the modeling and analysis tech- 
niques. For each paper listed in their work, Waqar 
et al, provides a survey of each application in com- 
munication networks mentioning the strong and 
weak points of different approaches [16]. Concern- 
ing VANET applications, only Ref. [15] mentioned 
above is listed. 

Further, the authors in [17] proposed an evalu- 
ation method of the performance reliability of the 
Mobile Ad hoc Networks (MANET) and studied 
the effect of interference. However, reliable com- 
munication in VANET is more complex compared 
to MANET, due to its characteristics, such as the 
high mobility and the dynamic topology. 

In all mentioned articles earlier, we notice that 
the reliability of hardware and the quantitative 
and qualitative reliability analysis for the con- 
nected vehicle system are ignored in realizing their 
performance. Hence, we propose to analyze the 
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reliability of VANET, in terms of network reliabil- 
ity, as function of hardware reliability and func- 
tionality while taking into consideration the needed 
security equipment. Several approaches are applied 
to evaluate the reliability and the availability for 
communication systems. In this paper, the investi- 
gation is directed using RBDs, for both V2V and 
V2R communications. RBD analysis is used to con- 
duct both qualitative and quantitative analysis of 
the systems. RBD deals with the identification of 
critical or weak components of the VANET system. 
The reliability of the overall system is calculated on 
the basis of reliability of the individual components 
in the OBU. The exponential distributions are used 
to model the failure characteristics of the OBU 
equipment nodes. 

The California Department of Motor Vehicles 
(CA DMV) is the state agency that registers motor 
vehicles and issues driver’s licenses in the U.S. state 
of California, and is responsible for permitting and 
monitoring the testing of autonomous vehicles. 
DMV published reports concerning AV failures 
or disengagement. Disengagement is the failure 
event where the autonomous of the car (OBU 
algorithms) fail to take the right decision. In these 
situation the control should reverts to the human 
driver. 

Studying this report for 2616 disengagements 
events from September 2014 to December 2016, 
show that 52% of this disengagement were due to 
system failure, which refers to a hardware, and/ 
or software failure of the vehicle technology that 
causes the impossibility for the vehicle to continue 
the autonomous operations. System failure is a 
broad category that includes issues such as a dis- 
crepancy between onboard GPS systems, incorrect 
perception of external objects, incorrect prediction 
of other traffic behavior, and so forth. 

The valuation of these problems is the object of 
quantitative and qualitative reliability analysis for 
the connected cars. As shown later, the reliability 
prediction helps us to precise the redundant com- 
ponent in the OBU and in the RSU in order to get 
a higher acceptable reliability level. 


3 THE PROPOSED APPROACH 


3.1 The analysis approach 


In this paper, we focus on reliability and availabil- 
ity related to hardware failure issues because of 
their importance and wide utilization in the area 
of communication networks. 

Using the Reliability analysis methods, we can 
identify the problems in VANET that lead to the 
improvement of the system design and avoid future 
problems. Reliability analysis plays an important 


role in the prediction of the behavior of wireless 
networks versus time, and gives us a clear view to 
take the right decision concerning the redesigning 
of the system in order to have an efficient commu- 
nication network [18]. 

Fig. 1 shows the reliability procedure analyses 
approach; this approach begins with the func- 
tional analysis and the development for conceptual 
behavioral model of the system. 

In the first step, we describe our mode of com- 
munication (wireless in our case) and the desired 
network behaviors, such as network protocols 
(e.g. DSRC, WAVE), network topologies (adhoc, 
broadcast) and fault tolerance of the connected 
vehicular network. 

For this purpose, we tackled the preliminary 
risk analysis of the smart vehicles. The functional 
analysis studies how each of the parameters identi- 
fied above are ensured by the components of the 
system. Based on this analysis of the functioning 
of the VANET, dysfunctions and failures could be 
identified. 

The failure analysis of WANET research 
sources (causes) of system and their consequences 


Start 


Functional analysis of the system (goals 
and technical) 


Failure mode and effect analysis of the 
VANET 


+ 


Reliability metric: 
- mean-time to failure (MTTF) 
- mean-time between failure (MTBF) 


Reliability 
Block Diagram 


Fault Tree | 


Reliability Analysis Technique 
- simulation 
- mathematical model 


Decision 


Reliability analyses procedure. 


Figure 1. 
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(effects). Two large families of analyses are com- 
monly used. Some inductive, such as hazard and 
operability study (HAZOP) or analysis of failure 
modes (FMEA) used in our study, and others are 
deductive (such as fault tree analysis). The induc- 
tive analysis distinguishes systematically the causes 
and effects of failures. Deductive analysis, mean- 
while, are top-down approaches. (step 2). 

The third step is the calculation of basic metrics 
of reliability and availability, such as Mean-Time 
to Failure (MTTF), Mean-Time Between Failures 
(MTBF) [19], for each equipment used in VANET. 
In our case, we obtain these metrics from a pro- 
fessional data base called Quanterion Automated 
Databook (QAD) that uses Electronic Parts Reli- 
ability Data (EPRD). The MTTF and MTBF are 
measured in hours. These metrics can be obtained 
by statistically calculating the failure rates of the 
GPS that are embedded in OBU. The failure rate 
of components is calculated below: 


A jeitaret hour =f ( nt ) 


where f represents the number of failures occurred 
during the observations; n: represents the number 
of components observed and t: represents the total 
time of observation. 

This data is used in order to calculate the reli- 
ability of OBU and RSU and then for the system 
as a whole. 

RBD and FT are the most usually used formal- 
isms in system reliability modeling. They are used 
through two different approaches: in a reliability 
block diagram, the OBU and the RSU are repre- 
sented by components connected according to their 
function or reliability relationships, while fault trees 
show which combinations of the components fail- 
ures will result in a system failure. (step 4). 

In step 5 the reliability analysis technique and 
the simulation were carried out, since the under- 
standing and choosing the right probability dis- 
tributions is very important at this stage. The 
exponential distribution is widely used to describe 
events recurring at random points in time, such as 
the time between failures of electronic equipment 
of the OBU and the RSU or the time between 
arrivals at a service booth. It is related to the Pois- 
son distribution, which describes the number of 
occurrences of an event in a given interval of time. 


3.2 Connected vehicles: Vanet 


The VANET system is composed of several 
sub-systems. 


3.2.1 On Board Unit (OBU) 
An OBU is a mobile or portable wireless device that 
is located inside intelligent vehicles [20]. It works as 


a communication device and allows DSRC commu- 
nications with other OBUs or RSUs. OBU includes 
other communication systems (e.g. GPS.), and 
other subcomponents like: application unit hard- 
ware, Human Machine Interface (HMI) and power 
supply. Each vehicle equipped with an OBU col- 
lects data and information, (such as vehicle’s speed, 
position, brake status, signal status, etc...) analyzes, 
processes and encrypts the data in order to send 
it as a safety message to other vehicles (V2V) or 
RSUs (V2R) through the wireless medium; 

An automated vehicle is equipped with an OBU 
system designed to ensure a number of functions. 
In addition to processing, inputs/outputs, and 
storage functions, an OBU system provides other 
major functions such space-time localization and 
scene recognition, longitudinal telemetry and 
short-range omnidirectional communications. 

The set of system components required for 
OBU are: 


e Resource Command Processor (RCP): It directs 
the operation of the other units by providing 
timing and control signals. All other resources 
are managed by the RCP which is the processing 
unit of the system. It must permit a local elabo- 
ration of the data gathered from the infrastruc- 
ture and from the smart vehicle. 

e GPS system: 360° positioning and global time 
keeping in order to allow the vehicle to commu- 
nicate its own position to perform geo-location. 

e Wireless communication: VANET uses DSRC 
to provide an omnidirectional 360° radio wire- 
less communication between moving vehicles. 
DSRC refers to 802.11p which is an improve- 
ment of IEEE 802.1 1a. 

e Antenna: used with a transmitter or a receiver in 
order to achieve the dedicated range of wireless 
networks. 

e HMI: an electronic display screen that is used for 
driver assistance in collision avoidance applica- 
tions. HMI shows awareness messages such as: 
indications, warnings, and advices using differ- 
ent ways of interaction like visual (flashing light, 
image) and auditory (sound, alarm). 

e Vehicle services: interacts directly with the body 
chassis systems of the car doing a tactile and 
kinesthetic functions such as the vibrations of 
the driver’s chair or the steering wheel. 


3.2.2 Trusted Platform Module (TPM) 
In VANET, data is broadcasted over shared com- 
munication media: a malicious node may easily 
intercept, modify or inject data [21, 22]. Data injec- 
tion can provoke collisions in a vehicular system. 
Weak security leads to many traffic problems 
putting human lives at risk. In Vehicle-Based 
Security System (VBSS), the OBU generates the 
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Basic Safety Messages (BSM) that collect vehicle 
and road conditions data. These BSM should be 
certificated and signed in order to preserve pri- 
vacy and enhance essential security services, such 
as authentication, integrity, confidentiality and 
nonrepudiation. 

A secured vehicle needs some hardware require- 
ments, such as TPM which can be integrated into 
the OBU; TPM is the hardware module that forms 
the security issues such as encryption/decryption, 
hashing and digital signature. It is able to protect 
and store data and keys in shielded locations. 

From hardware point of view, a TPM contains 
the below components: 


e Assembled controller: A TPM contains a con- 
troller bus, for the connection and coordination 
among its memory and peripherals. 

e NVRAM: Non-volatile random-access mem- 
ory is used for permanent storage of the startup 
configuration that is writeable. It is also used for 
permanent storage of hardware revision, identi- 
fication information and the cryptographic keys. 

e Crypto unit: as its name indicates, it is respon- 
sible for random number generation, public-key 
cryptographic algorithm, cryptographic hash 
functions, symmetric-key algorithms, digital sig- 
nature generation and verification, and Elliptic 
Curve Cryptography (ECC). 


3.2.3 Road Side Unit (RSU) 

RSU is installed at the road side [23]. It includes 
communication hardware (e.g. Wi-Fi, UMTS, ITS 
G5, etc.), and serves as a gateway between OBUs 
and the communications infrastructure. It could 
provide location based services and Internet access 
for mobile devices to improve the communication 
connectivity. The main functions of RSUs are as 
follows: 


— Network coverage extension of the Ad Hoc 
network and communication medium between 
OBUs and RSUs. 

— Source of safety and awareness information like 
weather status. 

— Prioritize management messages to and from 
the OBU. 

— Gateways that allow vehicles to establish con- 
nection with the internet. 


The RSU is connected to the V2I communica- 
tions network. Prioritization of messages, is also 
managed by the RSU to and from the vehicle. 

Similar to OBU, an RSU is composed of a 
communication transceiver (802.11 & 1609), GPS 
and a processor. RSU contains a router that acts 
as an interface to the V2I cloud network. RSU is 
also connected to a local safety processor which is 
related to traffic light signal controller. 


4 THE RELIABILITY 


In this section we used RBD and FT, the 2 most 
effective techniques for modeling reliability and 
availability of communication networks [16]. 


4.1 RBD 


RBD for the OBU (Fig. 2) are graphical struc- 
tures consisting of blocks representing the system 
components and the connection to these compo- 
nents. The system is functional, if at least one path 
of properly functional components from input to 
output exists, otherwise it fails [24]. 

Thus, the reliability of an OBU, represented by 
Ropu: is given below: 
Rose = Ry AR, aa R R 


OBU Ant” “DSRC crs Rps RCP 


[ ( Rian + Ryc ) ~ Rian Bye | Repay (1) 
where the reliability of the TPM is set below: 
Riru = RerpPmemBcon (2) 


In this section we used Reliability Block Dia- 
gram (RBD) and the Fault tree (FT), the 2 most 
effective techniques for modeling reliability and 
availability of communication networks [16] 

The RBD of RSU will be as show in Figure 3. 

Consequently, the reliability of an RSU, repre- 
sented by Rag, is given by: 

Rasy = RR 


RSU ‘Ant psrc Bers Res Rprocess srr 3 ) 


4.2 Fault tree of OBU failure 


Fault Tree analysis is a graphical technique per- 
formed to know the probability of occurrence of 
the top event; i.e., a DSRC fail event can cause 
the whole system to fail. These causes of system 
failure are represented in the form of a tree rooted 
by the top event as depicted in Figure 4. Logic 


LR | 
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Figure 2. 
the TPM. 
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Figure 3. Reliability block diagram for RSU. 
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Figure 4. Fault tree event of OBU. 


Boolean gates are used to link two or more cause 
events provoking one fault. The OR gate is pre- 
sented when one fault from any node is enough to 
cause a fault. While the AND gate is used when the 
fault (output) occurs when all inputs fail (inputs 
are independent). 

Many quantitative data are needed in order to 
achieve this analysis; hence the necessity that the 
data should be accurately targeted. QAD contains 
field failure rate data on a variety of electrical, 
mechanical, electromechanical, and microwave 
parts and assemblies. It is used as a source of reli- 
ability failure rate data. Data contained in EPRD 
reflects industry average failure rates, especially 
the summary failure rates which were derived by 
combining several failure rates on similar parts/ 
assemblies from various sources. At t = 0, the 
population of parts has not experienced opera- 
tion. As operating time increases, parts in the 
original population are replaced and the failure 
rate increases. 

Probability of events is modelled from compo- 
nent failure rates. These are defined by the average of 
failure rate (Failure rate Mean (Amean), (Failure rate 
Lower (Amin Failure Rate Upper (Ama), and by stand- 
ard deviation of failure rate (Failure rate SD (A,p)). 

T The component failures of the system follow an 
exponential distribution on a total duration of lyear 
(8760 hours) to take into account the history of deg- 
radation in the estimation of the fault. During this 
time, we consider that the OBU system and the RSU 
system are operating in real conditions (Figure 4) 


5 PROBABILISTIC MODELLING 
OF THE FAULT TREE EVENTS 


The objective of this section is to present the prob- 
abilistic approach applied on data from the event 
tree. The top event is the OBU Failure. Reliability 
data expressed in failure rate are random. 

The reliability information of these events are: 
Amin = Minimum degradation rate, and Aua, = Maxi- 
mum degradation rate, and Amen = mean deg- 


radation rate, and A,, = standard deviation of 
degradation rate. 

The exponential distributions, with failure rate 
A and time-to-failure random variable, are used in 
order to express the reliability or availability of these 
individual components of the OBU. The depend- 
ability of each component is then used in order to 
determine the reliability of the overall VANET sys- 
tem by utilizing the mathematical expressions that 
are presented in eq. 1. The exponential model is the 
wieldiest used for electronics components such in 
the OBU and the RSU, even for a pessimistic sce- 
nario. The failure A rate is considered as a random 
variable which is defined between two limits. Its 
mean and standard deviation are known. 

Table 1 depicts the failure events data for all the 
components of the OBU; the power supply (E008) 
has the greater failure rate, and has a significant 
influence on the reliability of the OBU, since the 
power supply is implemented in series in the RBD, 
that means definitely when the power supply of 
OBU fails the whole system will fail. As an improve- 
ment we suggest to use a dual power supplies in the 
OBU. 

The redundant power supply, shown in Figure 5, 
might be connected in parallel. After the redesigned 
RBD for the OBU, we noticed an improvement in 
reliability between the OBU having a single power 
supply and the redesigned OBU having dual power 
supplies. 

Thus, the reliability of an OBU, after adding 
a redundant power supply R,,,, represented by 
Ropu» is given below: 


Rosu = RamRosecRors [ ( Ros + Ross ) = RisiResa | 
Rrcr [ ( Rym + Ryc) T7 RaR] Ripu 
(4) 


As shown in Figure 6, After 8760 hours (1 
year) of running, the OBU1 powered by a single 


Table 1. Failure events data. 

Basic event Label Ayin Ria Amen Asp 

HMI E005 1.4E-06 1.8E-06 1.6E-06 1.8E-07 

Vehicular E006 4.2E-07 2.3E-06 1.2E-06 8.08E-07 
services 

RCP E007 2.8E-06 3.7E-06 3.3E-06 3.7E-07 

Power E008 4.7E-06 9.1E-06 6.3E-06 2.01E-06 
Supply 

GPS E009 1.4E-06 1.8E-06 1.6E-06 1.8E-07 

DSRC E010 1.05E-06 1.2E-06 1.1E-06 8.18E-08 

Antenna E011 4.8E-07 6.2E-07 5.5E-07 6.18E-08 

Crypto E012 2.8E-06 3.7E-06 3.3E-06 3.7E-07 
Unit 

Memory E013 4.6E-07 2.5E-06 1.1E-06 9.04E-07 

Controller E014 1.2E-06 3.3E-06 2.1E-06 9.6E-07 
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Figure 7. Reliability comparison between RSU1 and 
RSU2. 


power supply reaches a reliability value equal to 
82%, however the OBU2 powered by a dual power 
supply reaches 88% of reliability, that means an 
increase of at least 6%. This improvement becomes 
more evident and visible as time passes. 

Concerning the RSU, and since all the compo- 
nents are implemented in serial, the remodeling 
proposed to implement the RSU with dual power 
supplies increases the reliability for more than 
17%. In (Figure 7) it is clear that the degradation 
in reliability in RSU2 powered by dual power sup- 
plies is less severe than RSU1 that powered by a 
single power supply. 


6 CONCLUSION AND FUTURE WORK 


Connected vehicles are classified as “safety criti- 
cal” because they are directly related to human 
safety. For this reason, in this paper we used the 
exponential model which is very conservative and 


pessimistic. A failure occurs whenever some func- 
tion is lost, and no other function can supersede 
the lost function in due time. 

Any OBU failure can lead to crashes. In similar 
domains (e.g. in aircraft), the control system and 
communication parts may be triplicated or quad- 
rupled [25]. The model used in the aviation sys- 
tem, where the system reliability is very high and 
failure ratio should be in range of 10° per hour, 
can be used as a reference. The solution used in 
aircraft could be cloned into vehicle context in 
order to provide distinct redundancy of hardware. 
However, a trade off between the disadvantages of 
redundancy (cost, weight, power saving and design 
complexity...) and the reliability should always be 
considered. 

In this article we demonstrated the importance 
of reliability analysis using mixed approach between 
qualitative and quantitative methods. As a result, we 
detected that the reliability of a system with a single 
point of failure such as the OBU system with serial 
subcomponents—will immensely degrade with 
time. Using a very pessimistic exponential model as 
showed in Figure 6, resulted in the degradation of 
the OBU ‘s reliability by 16% over a short period 
of one year—the failure rates of the OBU electronic 
components were in the range of E-07 to E-06. 
Moreover, one of the most important advantages of 
vehicular networks is that there are no energy con- 
straints, unlike wireless sensor networks and other 
types of mobile devices used in MANETs where 
limited battery life is a major concern. Taking into 
consideration this feature we can implement the 
redundancy components for the VANET without 
worrying about energy consumption. 

Autonomous Vehicles could also operate in iso- 
lation from other vehicles using internal sensors. A 
combination between the autonomous vehicle and 
the connected vehicle is called Automated Vehi- 
cle (CAV), which leverages autonomous and con- 
nected vehicle capabilities thus making the system 
more complex. For this reason, it is very important 
to analyze each component functionality to iden- 
tify its failure rate ratio in order to redesign for 
redundancy. 

The paper mainly emphasized on the reliabil- 
ity of the connected vehicle networks primarily 
with respect to hardware components and the 
needed security equipment involved in communi- 
cations. However, a future study might tackle the 
reliability degradation using other models such 
as Weibull model which is classified as realistic 
model. Another future study will also tackle the 
reliability of the whole system in the operation 
mode and evaluate the reliability of communica- 
tion from sender to receiver to include the reliabil- 
ity of data transmission while using real network 
and traffic conditions on the proposed modeling 
technique. 
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ABSTRACT: The Bayesian Network approach is a probabilistic method with an increasing use in the risk 
assessment of complex systems. It has proven to be a reliable and powerful tool with the flexibility to include 
different types of data (from experimental data to expert judgement). The incorporation of system reliability 
methods allows traditional Bayesian networks to work with random variables with discrete and continuous 
distributions. On the other hand, probabilistic uncertainty comes from the complexity of reality that scien- 
tists try to reproduce by setting a controlled experiment, while imprecision is related to the quality of the 
specific instrument making the measurements. This imprecision or lack of data can be taken into account by 
the use of intervals and probability boxes as random variables in the network. The resolution of the system 
reliability problems to deal with these kinds of uncertainties has been carried out adopting Monte Carlo 
simulations. However, the latter method is computationally expensive preventing from producing a real-time 
analysis of the system represented by the network. In this work, the line sampling algorithm is used as an 
effective method to improve the efficiency of the reduction process from enhanced to traditional Bayesian 
networks. This allows to preserve all the advantages without increasing excessively the computational cost of 
the analysis. As an application example, a risk assessment of an oscillating water column is carried out using 
data obtained in the laboratory. The proposed method is run using the multipurpose software OpenCossan. 


1 INTRODUCTION 


Nowadays, experimental research in engineer- 
ing deals with systems with high complexity due 
to the number of components used in the proce- 
dures. Occasionally, the limited capacity of the 
experimental arrangements to test different config- 
urations and obtain a number of relevant measure- 
ments, hinders the impact of the study. In addition, 
the effects of epistemic uncertainty derived from 
the procedure and, the uncontrollable conditions 
under which the study is carried out can provide 
results with limited applicability or lack of mean- 
ing. There are different methods developed for 
modeling the dependability and evaluation of large 
engineering systems allowing to take into consid- 
eration both qualitative and quantitative informa- 
tion. Among the most used methods, the reliability 
block diagrams, fault trees and, event trees, can be 
identified as the techniques with reliable results 
providing a robust mathematical background 
(Hamada et al. 2008). However, several assump- 
tions are made with these techniques. 

The Bayesian Network (BN), is a probabilistic 
method to study and analyze the genuine depend- 
encies or independences of variables that make up a 
system. This concept was proposed by Judea Pearl 


in 1988, originally for the artificial intelligence area 
(Pearl 1991). Currently, the BNs have many more 
applications ranging from system dependability 
(Castillo et al. 1997) and risk analysis (Hudson et 
al. 2002), to system maintenance (Kang and Golay 
1999). It is worth noticing that this method has 
attracted an increasing interest, reaching 800% 
according to (Weber et al. 2012), during the last 20 
years. The success of Bayesian networks rests on 
the graphical representation of the system, which 
renders them intuitive and easy to understand even 
by for non-experts. In addition, this method can be 
used to provide a diagnostic or predictive reason- 
ing, a combination of both (Korb and Nicholson 
2004) and also they accept new evidence that can be 
used to update the network and to adapt the model 
to the new parameters. Moreover, information of 
different types (e.g. expert judgment, experimen- 
tal data, historical records, feedback experience, 
theoretical models, etc.) can be merged in the same 
network, inside structures called probability tables 
(or conditional probability tables in the case of 
children nodes). These tables are filled with crisp 
probability values, providing a global dependability 
estimation (Jensen and Nielsen 2007). 

On the other hand, the high acceptance of the 
traditional Bayesian networks for uncertainty 
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reasoning is limited to the use of only crisp probabil- 
ities (Spiegelhalter 1987). This type of probabilities 
leads to discretization methods or hard assumptions, 
impoverishing the quality of the analysis (Tolo et al. 
2016a). In order to work with continuous probabili- 
ties that can take into account the uncertainty of the 
variables in the network and avoid discretization of 
the input information, Daniel Straub and Armen 
Der Kiureghian (Straub and Der Kiureghian 2010) 
proposed to enhance BNs with structural reliabil- 
ity methods since these techniques support the use 
of continuous random variables. This approach 
embraces all the advantages of BNs and further- 
more allows working with discrete, continuous, as 
well as small probability variables (ideal for low- 
probability high-impact events). Some applications 
have been done with this method, focusing on the 
risk assessment of technological facilities consider- 
ing climate change (Tolo et al. 2016a). 

Although the enhanced Bayesian networks con- 
sider a broader spectrum of variables, usually the 
information available rarely meets the requirements 
of the method. For example, often expert beliefs do 
not agree in an exact same probability value or the 
information is scarce, in which cases data have to 
be averaged or the analyst makes assumptions to 
perform the study. In engineering, it is common to 
perform measurements during an experiment, the 
results obtained will have attached an epistemic 
uncertainty that cannot be eliminated and under- 
estimated. Consequently, the incorporation of 
imprecise probabilities becomes an imperative need 
that can improve the employability of BNs. 

The proposed method attempts a naive imple- 
mentation of Credal sets and p-boxes as a way to 
characterize imprecise probabilities in discrete and 
continuous variables, respectively. This approach 
is expected to overcome two main problems when 
dealing with uncertainty in Bayesian networks. 
The method is implemented on the multipurpose 
software OpenCossan, (Patelli et al. 2018). On one 
hand, the use of all the advantages of parametric 
and non-parametric p-boxes to work with con- 
tinuous imprecise random variables and obtain the 
quantile bounds of the final distribution represent- 
ing the system under analysis so they can be used 
after with the structural reliability methods. The 
network reduction process can be done with the 
advanced line sampling method, since it is capable 
of approximating the upper and lower bounds of 
the failure probability under the assumption of a 
monotone system. Once the network is reduced and 
the uncertainty of continuous variables propagated 
to the reduced network through the bounded prob- 
ability of failure, the result will be a Credal net- 
work with only discrete but bounded variables. The 
method used, allows to compute the exact bounds 
of the query probability in the absence of evidence. 
In the case of evidence introduced in the network 


the method will provide the intervals, such that the 
true bounds of the query probability are located. 


2 THEORETICAL BACKGROUND 


2.1 Bayesian network 


A Bayesian network, as established before, is pre- 
sented in the form of a directed acyclic graph (DAG) 
made of nodes and arrows (called Jinks) connecting 
those nodes. Each node represents a random varia- 
ble with information about observable quantities or 
hypothesis of the system, whilst the links show the 
dependency among the nodes. Nodes can be differ- 
entiated as parent and child. A child node depends 
directly on another node, called the parent node and 
graphically they are connected by a link starting in 
the parent and ending in the child. Nodes with no 
parents are called roots. The dependence of nodes 
is ruled the d-separation concept. Two variables, 
namely, A and B, are d-separated if there is an inter- 
mediate variable, C, different from A and B, such 
that in a serial or diverging connection C is instanti- 
ated (with a specific probability value, i.e. evidence). 
In the case of a converging connection, if neither C 
nor any of its descendants are instantiated they are 
d-separated (Jensen and Nielsen 2007). 

The arrangement of parents and children nodes 
connected by links allows performing diagnostic 
and predictive reasoning. The first approach can 
be done to know the causes of an event by query- 
ing a parent node given the information in children 
nodes. The predictive reasoning follows the direc- 
tion of the links and uses the causes (information 
in the parents) to predict the effects (children). 
The probability values denoting the dependency 
between a child node with its parents is stored in a 
conditional probability table. In this arrangement, 
the probability of each state of the child is pro- 
vided given each of the states of the parents. The 
total dependability of the network is quantified by 
the joint probability distribution which is defined 
as the product of all the conditional and uncondi- 
tional probabilities specified in the network. This is 
governed by the chain rule for Bayesian networks 
(Jensen and Nielsen 2007) and, it is given as follows, 


P(X,) = [PPL | pa(X,)] (1) 


where X, represents each of the random variables 
of the network and, pa(X) is the probability of the 
parents of X,. As an example, the joint probability 
of the Bayesian network presented in Figure 1 is 
given by the expression, 


P(X Xa My Ha) = P(X) P(X) 


x P(X, 1X, X,) P(X |X) @) 
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Figure 1. Example of a traditional Bayesian network. 


The joint probability distribution function can 
be used to calculate the probability of any indi- 
vidual variable (Hosseini and Barker 2016). For 
instance, to calculate the probability only of the 
node X,, the rest of the nodes are marginalized out 
from Eq.(2). So, PCX,) will be given as follows, 


Peas 3) 
xP(X, |X, X) P(X, |X) 


P(X;)= 


Marginalization is a distributive process to cal- 
culate the total probability of a variable of inter- 
est by the summation of the products of all the 
possible combinations of local joint probabilities. 
This process allows to isolate the probability of the 
parameter of interest and to remove the rest of the 
variables from the joint probability distribution 
(Jensen and Nielsen 2007). 

It is possible to remove variables that are not 
relevant to know the probability of the variable of 
interest, namely A. This process is called variable 
elimination. It consists of simply removing from 
the joint probability, variables that are outside the 
Markov blanket of A (i.e. variables that are parents 
or children of A, or sharing a children with A). 
The eliminated variables (those out of the Markov 
blanket), do not influence the probability measures 
of the variable of interest. 

The type of information that can be adopted in 
this method involves real probability values (dis- 
crete nodes) or Gaussian distribution functions. 
The latter works with crisp value probabilities 
(Tolo et al. 2016b). However, this characteristic 
turns into the main drawback of this technique 
when the data comes in the way of continuous dis- 
tributions. Nevertheless, this disadvantage is over- 
come with the use of structural reliability methods. 


2.2 Enhanced Bayesian networks 


Structural Reliability methods (SRMs) are used 
to work out the conditional probability tables of a 
BN containing both discrete and continuous ran- 
dom variables (enhanced Bayesian network) result- 
ing in the reduction of the network to a traditional 
BN. Suppose the nodes in the network on Figure 2 


Figure 2. a) Simple enhanced Bayesian network with 
discrete nodes (rectangular shaped), probability function 
nodes (circle shaped) and, interval discrete nodes (ellip- 
tical shaped). b) Reduced Bayesian network with crisp 
probabilities. 


a) correspond to independent random variables. 
Here X, is an interval node (with AX) its cumu- 
lative distribution function, CDF), X, and X,, are 
discrete nodes (representing probability mass func- 
tions, P(X,), P(X), respectively) and X, a continu- 
ous node (representing a CDF /(X,)). According 
to (Straub and Der Kiureghian 2010) the enhance 
Bayesian network joint probability can be com- 
puted by approximating the equation below, 


P(X)= f, p IOS) Hs 
xP(X, |X, X,, X,)dX dX, 

if the Markov properties are considered, node X, 

and X, are d-separated from X, since X, has not 

received any evidence yet. So, the joint probability 

of X, given X,, from the equation above, can be 

written as, 


PX, X= f, F(X )L(%) 6) 

xP(X, | X,,X,,X,)dX,dX, 

It has to be noticed that each entry in the con- 

ditional probability table of X, is defined by the 

domain Q,*., in the continuous space of X, and 

X, for a given value x, of the variable X,. So the 
Eq. 5 can be further reduced as, 


PXI X= J os f(X) f(%;)dX,aX, (6) 


X4.x2 


The integral shown in Eq. 6 is equivalent to that 
of a reliability problem (Tolo et al. 2016b). Solving 
a structural reliability problem, i.e. approximat- 
ing the system failure probability, a reduction of 
a network with continuous nodes to one with only 
discrete probability values is obtained. 

There are certainly several approaches for the 
solution of problems like that shown in Eq. 6. These 
methods range from numerical approximations like 
Monte Carlo simulations, to the well-known and 
widely used first-order and second-order reliability 
methods (Hasofer and Lind 1974). Moreover, some 
advanced sampling techniques like Importance 
Sampling, Stratified Sampling or Advanced Line 
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Sampling (Hasofer and Lind 1974), among others, 
have been used as an alternative to the computa- 
tionally expensive numerical approximations. 


2.3. Credal networks 


Credal networks can be referred as an extension 
of Bayesian networks to manage intervals of dis- 
crete probability values representing the lack of 
information and uncertainty about the variables 
involved. A Bayesian network constructed exclu- 
sively with discrete nodes such that, only one prob- 
ability value is associated with the state of the 
variable. Such a state can belong to the variable 
itself, in the case of roots, or can be conditioned 
on the parents of that node, in the case of chil- 
dren. However, in a Credal network, probabili- 
ties are presented in the form of intervals that are 
associated with probabilistic inequalities. In this 
manner, a Credal network will represent the differ- 
ent variable states, each of them associated with 
one specific probability value inside the interval, 
of the same Bayesian network (Tolo et al. 2018). 
The graphical structure of such a network is the 
same as the Bayesian case, as well as the Markov 
blanket concept and d-separation of nodes. Nev- 
ertheless, the probability of a variable x is indi- 
cated in the form of the so-called credal set K(X), 
whilst the set of joint probability measurements 
P(X, | pa(X,)) is named a joint credal set, given by 
K(X, | pa(X,)) (Cozman 2000). So, two different 
interval probabilities of variables X and Y (where 
Yis the complement of X) can be characterized by 
their upper and lower bounds, [p(X), p(X] and 
[Y =1- p(X), ¥ =1- p(X)] 401], respectively. 


2.4 Probability boxes 


Probability boxes (or p-boxes) allow making fewer 
assumptions about the values used in the study 
when correlations of the variables employed in the 
study are ignored due to the effect of aleatory and 
epistemic uncertainties. A p-box is specified for a 
random variable X by the interval bounds |F „F ] 
on a cumulative distribution function F with values 
between 0 and 1, such that F(X)< F(X)< F(X) 
(Ferson et al. 2003). If a probability measure p 
(since it is the lower bound of that measure) for 
the random variable X, is given, the lower, F (xX An 
and upper F(X,), bounds of the p-box can be 
computed as follows (Walley 1991), 


F(X,) = p(X, X,), F(X) =1- p(X, > X) (7) 


The p-box has a dual interpretation, i.e. F, rep- 
resents the probability (CDF axis) upper bound and 
quantile (x-values axis) lower bound. The opposite 
happens with F. Therefore, this concept is applica- 
ble in cases of imprecise continuous probabilistic 


distributions and two types can be differentiated, 
parametric non-parametric p-boxes. A parametric 
p-box is defined when the shape of the probabil- 
ity distribution is known, but there is no precise 
information about the parameters of that distribu- 
tion. The non-parametric case is rarer but can exist 
especially when an experiment has been performed 
and a set of measurements was obtained. It occurs 
when parameters regarding the probabilistic distri- 
bution, e.g. mean and variance, of a variable, are 
known but no information about the type of distri- 
bution is available (Ferson et al. 2003). 


2.5 Computational tool 


The open source software OpenCossan exploits the 
Object-Oriented programming paradigm that Mat- 
lab offers with the use of classes (entities contain- 
ing the data and the functions or methods that the 
members of the same class have in common) and 
objects (instances of a class). This basis is employed 
in order to efficiently provide solutions to problems 
regarding uncertainty quantification, sensitivity 
and reliability analysis, robust design, among oth- 
ers (Patelli 2015). The object-oriented methodol- 
ogy allows reutilizing parts of code to create more 
complex objects in a systematic and condensed way. 
OpenCossan offers a wide flexibility to integrate 
new methods that enhance, improve or complement 
the current tools available in this software. This 
opens the gate for new developments that enrich the 
software robustness to provide solutions. 

Within the framework of OpenCossan, three 
main toolboxes can be used for Bayesian networks. 
These are, BayesianNetwork, EnhancedBayesian- 
Network and, CredalNetwork. The first one can 
be used in cases were variables only correspond 
to crisp probability values whilst the second one, 
also considers continuous probability distributions 
and bounded variables. The third toolbox is the 
one chosen for this study, since it allows working 
with continuous probabilities and interval vari- 
ables representing imprecise probabilities. How- 
ever, the graphical display process is done with 
the same method, makeGraph which is a class of 
EnhancedBayesian Network. 


3 CASE STUDY: OSCILLATING WATER 
COLUMN 


An Oscillating Water Column (OWC) is a type 
of the so-called wave energy converters that cap- 
ture the energy that sea waves deposit once reach- 
ing the named structure. The popularity of this 
type of energy converters has increased over the 
last few decades, since it is an alternative for the 
clean energy production (Falcão 2010). The struc- 
ture of an OWC consists of a chamber, partially 
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Figure 4. Schematic layout from a top view of harbour 
walls with different lengths (Daniel Raj et al. 2016a). 


submerged in the sea, generally with two orifices. 
One is typically at the top of the chamber inside 
which turbine is placed, and the other one is below 
the water line facing the coming sea waves, see Fig- 
ure 3. An entombed mass of water column formed 
inside the chamber oscillates as a result of the wave 
inside in the structure. This, in turn, drives an air 
flow through the orifice coupled to a turbine, thus 
generating electricity (Cruz 2008). 

Since the system is always surrounded by sea 
waves (especially when it is built away from the 
shoreline), it is susceptible to overtopping events 
that can cause serious damages to the structure, 
particularly in the operational mode. Horizontal 
sliding, overturning, scouring and collapsing of 
the structure may be possible during extreme sea 
conditions (Cruz 2008). It is preferred to study 
such overtopping possibilities to be more on the 
conservative side. Generally, a conventional-OWC 
device is constructed with adequate height such 
that the overtopping event is quite not possible. 
As the addition of harbor walls increases the wave 
amplification, the possibility of an overtopping 
event should be carefully addressed. 

A very simple Bayesian network was built to test 
the inclusion of imprecise probability nodes in the 
computational toolbox based on the OpenCossan 
software. The network represents the components 
involved in an experimental work carried out by 
Daniel Raj and his team (Daniel Raj, D. et al, 2016) 
at the Indian Institute of Technology Madras in 
India, to study the influence of harbour walls of an 
OWC on its energy efficiency characteristics. The 
present network is used to provide an assessment 
of the risk of structure overtopping triggered by 
the waves generated in the laboratory. 


Sectional view wave flume with OWC on the left-hand side. 


3.1 Description of the experiment 


The experimental arrangement was 72.5 m long, 
2 m wide, and with a deep wave flume of 2.5 m, 
please refer to Figure 3. The scaled OWC model 
was 0.540 m height, in a 1:20 ratio from a real pro- 
totype. To generate random water waves (it is able 
to reproduce either shallow or deep water waves) 
the flume is equipped with a wave maker system 
to generate waves with steepness characteristics 
within the limits of the operational range of the 
system. The generated waves for this experiment 
covered a range of relative water depths, d/L, from 
0.074 to 0.23 and a wave steepness, H/L, from 
0.0074 to 0.065. Here, d denotes water depth, H 
corresponds to the wave height (in metres) regis- 
tered from the first wave gauge | (situated at 8 m 
from the water generator), and L denotes wave- 
length (in metres). The crest periods, T,, adopted 
were from 1 to 2.5 s in 0.25 s intervals. For more 
information about this experiment please refer to 
(Daniel Raj et al. 2016a), (Daniel Raj et al. 2016b). 
The experiment was carried out in two stages; 
the first stage involved the identification of effi- 
cient resonating length of harbour wall, as seen in 
Figure 4, which enhances the energy conversion 
capacity of the system. Four testing criteria have 
been chosen in the first stage, so as to investigate 
the effect of projecting sidewalls length on the effi- 
ciency of the OWC. Among them, one is without 
the sidewalls (conventional) and rest with the pro- 
jecting sidewalls in three criteria such as c/b of 1, 
1.5 and 2. In the second stage, the effect of the har- 
bour walls on each side of the OWC was studied 
by varying the angle of the harbour walls within 
the range of [47/8,7z/8] at intervals of 7/8 with 
respect to the front lip wall of OWC, as seen in Fig- 
ure 5. This angle variation is called wall inclination 
from now on. The wall length, c, was maintained 
constant in this stage at its optimal configuration 
identified from the first stage of experiment. 


3.2 The Bayesian network 


The goal of the experiment was to analyse the 
influence of different configurations of harbour 
walls in order to modify the resonance frequency 
of the water wave coming towards the structure so 
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Figure 5. Schematic layout different harbour wall incli- 
nations (top view) (Daniel Raj et al. 2016b). 


y 


Enhanced Bayesian network for OWC 


Figure 6. 
experiment. 


that the wave amplification was maximized. The 
maximization of wave amplification poses a poten- 
tial risk of structure overtopping that can damage 
the rearward face of the experiment equipment or 
to bring unexpected consequences. The probability 
of having this event is quantified through the net- 
work presented in Figure 6 that takes into account 
the epistemic uncertainty characterizing the meas- 
urements of the amplified waves running up the 
structure. In addition to that, the wave properties, 
as well as the OWC configurations studied in the 
experiment, were considered as influencing factors 
for the occurrence of structure overtopping. 

Since imprecise variables are employed in this 
simple study, the network for this problem is defined 
in the toolbox for Credal networks node by node, in 
the form of computational objects. Once all the var- 
iables are defined, an object of the class Credal Net- 
work is created to store the information regarding 
the nodes on the network, so further calculations 
can be performed. Then, the method makeGraph is 
invoked to display the network shown in Figure 6. 
The nodes are defined as follows. 

Different harbour wall inclinations and lengths 
were tested to study their effects on wave amplifica- 
tion. In order to select the case to be studied with 
the BN, the Experiment_case node was defined 
as an interval node containing the maximum and 
minimum length ratio, for the variation of harbour 
wall lengths experiment, or the maximum and min- 
imum angle for the remaining case. The values in 
the intervals are defined in such way that they can 
cover all the possible values used in the experiment. 


Wave properties were used in the Bayesian net- 
work as follows, the Crest_period interval node 
contains the information of the mean level-up- 
to-down-crossing time of the incident waves. The 
crest periods were changed from 1 to 2.5 s inter- 
vals of 0.25 s. Meanwhile, the Wave_height node 
has the information of the wave height used in the 
experiments. The waves studied were 0.03, 0.06 and 
0.09 m high for each of the harbour wall configura- 
tions. It has to be noted that the Wave_height will 
influence directly how high the amplified wave is. 

According to some authors, as shown by All- 
sop review (Allsop et al. 1985), wave amplification 
phenomenon (or wave run-up) in steep structures 
slopes follows approximately Rayleigh distribu- 
tion. For this reason, the Wave_amplification is 
assumed to follow this probability distribution 
with a scale parameter based on the experimen- 
tal results obtained from the wave amplification 
measurements. However, the parameters to define 
this variable in the experiment are uncertain due 
to the lack of probabilistic data (i.e. only one 
measurement was taken for a given value of wave 
crest period, wall inclination or length). So, the 
use of p-boxes becomes handy. The Wave_ampli- 
fication p-boxes were defined in such manner that 
all the possible values tested experimentally were 
enclosed in the p-box of each case. These values 
are presented in the Table 1, for the given harbour 
wall length and inclination configurations as well 
as each of the different wave heights, respectively. 

In this work, the Owen equation proposed, by 
Mase (Mase et al. 2013), for overtopping discharge 
ratio Q is used to describe the Overtopping node 
and the probability of exceedance, P(Z), of the 
maximum admissible wave overtopping is given as: 


Re 


Z= Ous- (4T, H,Je Em (8) 


where A and B are dimensionless empirical coef- 
ficients depending on the slope ratio of the struc- 
ture. In this case, A and B are given as 0.0079 and 
20.12 (Owen 1980), respectively. H, is considered 
as the significant wave height of the amplified 
waves in meters, g the gravitational acceleration 
and, R, the structure freeboard. If the condition 


Table 1. Wave amplification p-boxes defined with the 
Rayleigh-distribution scale parameter for harbour-wall 
length (4,,,,¢,,) and inclination (a, ) experiments. 


inclination. 


Wave 

height 

(m) Aength A ination Distribution 
0.03 [0.038, 0.077]  [0.045, 0.08] Rayleigh 
0.06 [0.082, 0.142]  [0.11, 0.157] Rayleigh 
0.09 [0.121, 0.213] [0.183, 0.237] Rayleigh 
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Figure 7. Credal network after reduction process for 
OWC experiment. 


Table 2. Occurrence of structure overtopping for wall 
length case. The bounded values in Overtopping, p. 
correspond to the Wave_amplification p-boxes in each 
of the Wave_height cases, i.e. 0.03, 0.06 and 0.09 m, 
respectively. 


Period (s) Overtopping, Overtopping, | Overtopping. 
1 [0, 0.013] [0.016, 0.273] [0.165, 0.555] 
1.25 [0, 0.011] [0.19, 0.279] [0.165, 0.563] 
1.5 [0, 0.01] (0.021, 0.272] [0.164, 0.551] 
1.75 [0, 0.012] [0.2, 0.269] [0.159, 0.557] 
2 [0, 0.01] [0.02, 0.268] — [0.156, 0.562] 
2.25 [0, 0.013] [0.2, 0.275] (0.161, 0.555] 
2.5 [0, 0.009] [0.2, 0.263] [0.164, 0.557] 


P(Z < 0) is exceeded the event is considered a fail- 
ure of the system meaning a wave overtopping of 
the OWC for a given maximum overtopping rate 
Qax Once all the nodes are defined in the toolbox, 
the network is reduced using the Adaptive Line 
Sampling method. In the Figure 7 can be observed 
the reduced network used for this study. The over- 
topping events are given in the continuous space of 
the Wave_amplification node for each given state 
of Experiment_case. 

In this simple network there is only one p-box 
employed per simulation (regarding the Wave_ampli- 
fication node), so the lower bound would correspond 
to the minimum amplification that a wave under 
those conditions can have. Thus, overtopping prob- 
ability resulting from this calculation will be the mini- 
mum probability that this variable can have, assuming 
monotonicity in the system. In order words, given the 
lower bound of Wave_amplification the lower bound 
of Overtopping will be found. The same reasoning is 
used for the case of the upper bounds. 

The lower and upper bounds of the overtop- 
ping probability are displayed in the Table 2, for 
the case of harbour wall length experiment, and 
in Table 3, for the case of harbour wall inclina- 
tion experiment. In each Wave Height column in 
the tables are stored the overtopping probability 
bounds computed for each height, i.e., 0.03, 0.06 
and 0.09 m for each of the different crest periods 
(T,). This was done in order to provide a combina- 
torial study of the overtopping probability change 
for all the experimental set ups. 

It can be observed that T, does not influence sig- 
nificantly the structure overtopping occurrence. In 
fact, changes in overtopping results obtained from 


Table 3. Occurrence of structure overtopping for wall 
inclination case. The bounded values in Overtopping, p. 
correspond to the Wave_amplification p-boxes in each 
of the Wave_height cases, i.e. 0.03, 0.06 and 0.09 m, 
respectively. 


Period (s) Overtopping, Overtopping, | Overtopping. 
1 [0, 0.017] [0.115, 0.34]  [0.455, 0.623] 
1:25 [0, 0.015] [0.109, 0.342] [0.451, 0.623] 
1.5 [0, 0.015] [0.111, 0.337] [0.456, 0.631] 
1.75 [0, 0.017] (0.114, 0.343] [0.453, 0.629] 
2 [0, 0.015] (0.11, 0.341] — [0.449, 0.621] 
2.25 [0, 0.017] (0.111, 0.34]  [0.455, 0.623] 
2.5 [0, 0.017] (0.113, 0.341] [0.455, 0.621] 


the different crest periods may be due to the ran- 
domness of the simulation functions used. A major 
influence comes mainly from the Wave_height and 
Experiment_case. It is logical that higher waves will 
increase the height of the amplified wave, in con- 
sequence, the probability of an overtopping event 
will increase as well. Comparing the results from 
both experiment arrangements, it can be resolved 
that the wall inclination factor increases the most 
the probability of overtopping occurrence. The 
wave amplification factor of 0.237 correspond- 
ing to a wall inclination of 37/4 with 0.09 m wave 
height can be referred as the worse case scenario. 
This can be useful when the values are only scaled 
up to prototype dimensions. For instance, a struc- 
ture overtopping assessment can be provided with- 
out performing any experimental work, as long as 
the physical behavior of the variables preserves the 
same probabilistic distribution. 


4 CONCLUSIONS 


From the case study presented here, an approxima- 
tion to the worst case scenario was found with the use 
of a very simple Bayesian network with the imple- 
mentation of p-boxes and interval variables (credal 
sets). The original network containing continuous, 
interval and discrete variables was reduced using 
structural reliability methods (adaptive line sampling) 
to a simpler network containing only crisp probabili- 
ties. Since the values used as input in the p-boxes and 
in the interval variables contain all the possible cases 
in the experiment, none of the those should over- 
come the maximum probability given. So, epistemic 
uncertainty affecting this experiment can be quanti- 
fied with this method. If specific data (different from 
that considered here) regarding any of the variables 
in the experiment, inside the bounds, are given, the 
overtopping probability results can be provided for 
that specific case and they will remain below or be 
equivalent than the maximum value achieved. 

The implementation of imprecise data in Baye- 
sian networks is a very necessary tool in engineer- 
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ing. However, the methods currently available are 
computationally expensive. Most importantly, 
when systems with a large number of variables are 
studied, the algorithms may suffer combinatorial 
explosion. Adaptive line sampling will be further 
studied to deal with bounded failure probabilities. 
In addition to that, random set theory (combined 
with Subset Simulation to improve calculation 
speed) appears as a good option to the limitations 
of the current tool, since the latter method is spe- 
cially useful for small probability cases and good 
performance in both low- and high-dimensional 
spaces as well as in nonlinear limit functions. 
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ABSTRACT: According to Bourdeau (1986), diffusion of stresses in a granular medium can be 
described using a probabilistic approach. A point load applied on the surface of a granular media will 
follow an erratic path, depending on the probability of transition between the grains. The diffusion of the 
expected vertical stress in the granular medium can be described by a Fokker-Planck type equation. In 
terms of expected vertical stresses, an equation of diffusion is obtained and the parameter of diffusion is 
shown to approximate the coefficient of lateral pressure of the material at a given depth z. The coefficient 
of lateral pressure of the material can be expressed in terms of intervals with upper and lower values to 
account for uncertainty. 

In the present approach, we propose to solve the diffusion equation using interval-based parameters 
to account for uncertainty. Uncertain parameters are considered as discretized fuzzy numbers; they are 
combined with finite difference method to solve the diffusion equation. Comparisons are made with 


experimental and available data. 


1 INTRODUCTION 


Diffusion of stresses in granular media is a phe- 
nomenon that provokes rearrangement in grains 
and thus settlements. Soils as granular media are 
constituted of an assembly of particles of different 
sizes, mineralogy and morphologies. The particle 
sizes vary from less than 0.002 mm in clays to some 
tens of millimeters in gravel materials. Despite their 
granular aspect, they are considered, from the soil 
mechanics viewpoint, as a continuum. Theories of 
continuous media are often applied to model the 
behavior of cohesive materials such as clays. Cohe- 
sionless soils behavior on the other hand is diffi- 
cult to capture using continuum approaches. Their 
granular aspect, especially when they are in loose 
states, makes their behavior complex to predict 
using conventional theories of continuum media. 
Harr (1977) proposed an approach to estimate 
the expected vertical stress in a granular medium 
subjected to surficial loading. A concentrated load 
applied on the surface of a semi-infinite ground 
will follow an uncertain path between the grains. 
The resulting stress in one point is a random vari- 
able. Its distribution will reflect the composition of 
the media. Following a binomial law of diffusion 
on one side or the other given a reference mark, the 
transmission of the load in the granular medium 
will be expressed as an equation of diffusion of 
stresses. The main parameter characterizing the 
equation is the coefficient of lateral pressure in 
soils. Bourdeau (1986) extended this approach and 


proposed a formulation in terms of displacements 
diffusion in loose cohensionless granular medium. 

We revisited the formulation in terms of stress 
diffusion to account for parameter uncertainty 
using intervals which can be combined and 
expressed also as fuzzy discretized numbers. A 
numerical approach based on finite difference 
method is combined with interval-based param- 
eters to simulate diffusion of stresses in granu- 
lar media. Interval-based parameters are used to 
account for uncertainty and comparisons are made 
with experimental data. The aim of this research is 
to study parameter uncertainty and its quantifica- 
tion in the context of stress diffusion in a granular 
medium. 


2 DIFFUSION OF STRESSES IN A 
GRANULAR MEDIUM 


Diffusion of stresses in soils is a fundamental 
aspect of soil mechanics. It is observed in all soil- 
structure interaction problems as a consequence of 
applied loads. Harr (1977) build a model for stress 
diffusion in a granular medium based on the idea 
of expected stress distribution at a point. A unit 
force applied on the surface will follow a random 
path between the grains. The vertical normal stress 
acting at a point within a medium is the total accu- 
mulated effect of many random variables; shape 
and distribution of particles, the spatial distribu- 
tion of voids as well as their local configurations. 
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The central limit theorem assures that the distri- 
bution of stress will converge to the normal distri- 
bution as the number of particles becomes large 
(Harr, 1977). 

In Figure 1 is shown, schematically, the transmis- 
sion of vertical forces between particles. The effect 
of the boundary force will spread laterally in the 
positive and negative directions. At a representative 
particle, the input stress can be taken to divide in 
the left and right directions consistent with a Ber- 
noulli trial. The division of stresses is expected to 
be equal. For a random distribution the frequency 
of moving to the left is the same as moving to the 
right. Figure 2 shows the distribution of stresses 
within a homogeneous random medium. If Ax is 
the average spacing of the stresses at row, with the 
rows taken to be Az apart, the expected stress will 
follow a binomial distribution with the recurrence 
equations (Harr, 1977): 


S.[x,z+ Az | = {5.x Ax, z| + S [x+ Ax,z]} 
(1) 
Subtracting the expected vertical stress S_[ x,z] 


from each side of the expression and dividing by 
Az, we get: 


S_[ x,z+ Az | < S, [xz] 7 (Ax)? 
Az 2Az 
| S.[ x+ Ax,z]- 25, [ x,z]+ 5.[ x- Ax, z] | 


(ax) 


(2) 


In the limit as Ax and Az become very small, 
the equation can be expressed as a differential one: 


F,(x — Ax, Z) F(x + Ax, z) 
E,(x,z + Az) 
Figure 1. Transmission of vertical forces. 


Figure 2. Unit stress distribution in terms of 
probabilities. 
as PS. 
=D (3) 
z ax? 


where D= lim {342 } 


Av.Az30l 2 4 


The ratio ““ is seen to reflect a characteristic 
of the particulate medium. Harr (1977) has shown 
that (D = v.z) is dependent on the depth z and 
the coefficient of lateral stress v. The author also 
showed that v can be approximated by the coeffi- 
cient of earth pressure at rest K (using Jaky’s for- 
mula for example). 

Equation (3) is an equation of diffusion of 
expected vertical stresses in a granular media, with 
D the parameter of characterization. 

The main objective of our research is to study 
the variability of the coefficient D and its influence 
on the diffusion of stresses in a granular medium. 
Uncertainty analysis will be carried out based on 
interval analysis and fuzzy representations of D. 
Different sources of information are used to con- 
sider the coefficient of diffusion. 


2.1 Uncertainty in the diffusion coefficient 


Uncertainties in geotechnics are generally classified 
into two categories, “epistemic” related to lack of 
data or knowledge and “aleatory” related to natu- 
ral randomness (Baecher & Christian, 2003). The 
diffusion parameter D is subjected to epistemic 
uncertainties which have influence on the diffusion 
process in the granular medium. The main sources 
of uncertainty of D come from evaluation of lat- 
eral pressure coefficient K as = K.z. In situ and 
laboratory measures are used to evaluate lateral 
pressure coefficient in soils via empirical relations 
(Jaky’s formula for example). Empirical correla- 
tions combining Jaky’ formula (1—sin@’) with 
OCR from laboratory are also used to determine 
the lateral pressure coefficient at rest K, (Cai et al. 
2011). When experimental results are given in terms 
of intervals the average values are generally used. 
Figure 3 shows variation of lateral earth pres- 
sure based on different tests (Chen & Fang, 2008). 
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Figure 3. Distribution of horizontal earth pressure 


against model wall (Cheng & Fang, 2008). 


Chen & Fang (2008) studied variation of lateral 
earth pressure in a block of sand. They presented 
experimental data on the variation of lateral earth 
pressure against a non-yielding retaining wall due 
to soil filling and vibratory compaction. 

Schmertmann et al. (2005) used a K-Box device 
to describe stress diffusion in sand under a surface 
circular plate loading. The experiment intended to 
improve understanding of stress distribution in a 
particulate medium. Comparisons were made with 
obtained stresses from the probabilistic approach 
(Harr 1977). According to the authors the particu- 
late-probabilistic theory seems approximately cor- 
rect when K = K,. 

As we can observe, uncertainties are inherent to 
the testing method, in other terms using interval 
values can be suitable to handle the perturbations. 
Intervals are often used as a tool to handle uncer- 
tainties which arise during experiments. In differ- 
ent types of tests uncertainties propagate during 
measuring procedures and from the use of dif- 
ferent devices. Using Jaky’s Formula for example 
uncertainty lies essentially in measuring the mate- 
rial angle of friction ’. Whatever the techniques 
used for measuring such a parameter, there are 
uncertainties to consider. 


3 FORMULATION OF INTERVAL-BASED 
PARAMETERS FOR DIFFUSION 


Interval analysis was introduced by Moore (1966) 
and is considered as a mathematical discipline that 
deals with quantities expressed as intervals which 


are common in engineering problems. Intervals are 
a convenient tool to deal with uncertainty when 
there is lack of data. As probabilistic approaches 
require important amounts of data, when in geo- 
technics information is scarce and small data avail- 
able in general, the use of interval analysis could 
be of interest. In practice, it may be difficult some- 
time to get a large number of experimental data so 
we need an alternative method in which we may 
handle the uncertainty with few experimental data. 
In our problem, the diffusion parameter D will be 
considered as interval-based. Applied pressure on 
the ground can also be considered as an uncertain 
parameter expressed in terms of interval. 

Finite difference schemes are usually applied 
to solve the type of diffusion equation we have in 
our case. We can either use a forward or backward 
scheme to solve it when dealing with deterministic 
values of the parameter D. 

2 
Os. D 07S. 


Oz Ox? 


where 
D=K.z 
( For simplicity we use S instead of S.) 


In finite difference method, the forward scheme 
can be expressed by: 


Sj! = Sj+ Ke! 


Az . ; f 
Te (Sia -2Si+ Si) (4) 
Index iis used in z direction and j in x direction. 
The formulation of the problem to deal with 
interval valued parameters is conducted follow- 
ing interval arithmetic and rules. In Moore et al. 
(2009), Hanss (2005) we can find more details on 
derivation and application of interval arithmetic. 


3.1 Reminder on intervals 


Interval valued approach gained significant use in 
engineering especially when information is uncer- 


Granular medium 


Figure 4. Scheme and Boundary Conditions. 
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tain. Studies were conducted in different fields, such 
as thermal conductivity to account for uncertainty 
using interval approaches and fuzzy sets (Wang 
2014). Wang (2014) proposed a new numerical 
technique named as fuzzy finite difference method 
to solve the heat conduction problems with fuzzy 
uncertainties in both the physical parameters and 
initial/boundary conditions. The @level-cut method 
is used to study the problem in terms of interval 
equations. Kermani & Saburi (2007) presented a 
numerical method for solving “Fuzzy Partial Differ- 
ential Equation” (FPDE) with some examples. 

In soil mechanics the solutions for stress diffusion 
are based on continuum mechanics which considers 
the media as a continuum. Even constituted of par- 
ticles, the granular aspect of the soil is not consid- 
ered. The probabilistic approach from Harr (1977) 
considers the soil as a particulate media. The diffu- 
sion equation (1) is used with a parameter of diffu- 
sion D asa deterministic value. Deterministic values 
of D are not common, but we tend to use approxi- 
mations. In the following the equation of diffusion 
of stresses in a granular medium will be considered 
in terms of intervals. The parameter of diffusion D 
and initial/boundary conditions are taken as inter- 
vals to account for uncertainty. 

Using S, equation (3) can be written in terms 
of intervals as follows; 


3 (5 


Š is replaced by S!. We notice S! for a given 
interval J of alevel. S(@),S(q@) are the lower and 
upper values of the interval at œ level 


as a[ S(a@),S(a@) | 


dz oz (6) 
as! Si = Sat 
az Az 

25 Oe Sh ag 28, +S} 

dS DPS} Saja a.j-1 (7) 


Ox? dx? (ax)? 

Then the forward scheme of finite difference 
approach using interval-based parameters can be 
written; 


Az 
Sit Sait Pip | SB 2804 Sz ]®) 
where 
Dy = = Klizi (9) 


aj i 


a level cuts are used to characterize a fuzzy 
number. 

A triangular fuzzy number, denoted by 
u=(a,b,c) where a<b<c has a-cuts 


ul, =[a+a( 


And membership function 


(b-a),c-a(c b)].œ e [0,1] 


— ifasx<b 
b-a 

Mal X) = = ifbs xe 
= 


0 otherwise 


Elementary operations of interval arithmetic 
are used (Hanss, 2005). 


[a,,b,]+ [a,,b, | = [a, +a,,b, + b,] 
[ab] - [ab] = [a, —b,,b, —a,| 


[ab] x [a,b] = [ min( ),max(M) | 
E 


4 CASE STUDY 


Turedi and Ornek (2017) performed laboratory 
experiment on sand to investigate the stress and 
bearing capacity. The vertical stresses resulting 
from strip and rectangular footings are measured 
at different depths of the tank model of dimen- 
sions 1.25m length, 1.0 m width and 1.0 m depth. 
The loading is considered under a model footing 
of B=0.1 m breadth and L = 5B length. The meas- 
ured average peak friction angles ø were 36° and 
42° for loose and dense sands, respectively. The 
results of measured vertical stress (kPa) are given 
in Figure 5. 

The model was calibrated using the results from 
Turedi and Ornek (2017). Figure 6 shows the ver- 
tical stress distribution using the probabilistic 
approach when we consider a deterministic diffu- 
sion parameter D = K.z. Jaky’s formula is used, 
K=1-sing’. As the experiment was performed 
in loose sand we used for our simulation g’ = 36° 
as given by the authors. 

The purpose of our study is to take into account 
uncertainties using interval-valued parameters. The 
proposed model permits to consider the parameter 
of diffusion D in terms of intervals, or as triangu- 
lar fuzzy number. The loading on the surface can 
be taken into account as interval also. 
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Figure 5. Vertical stress distribution along the depth 
at different loading levels (Experiment Turedi & Ornek, 
2017). 


o 10 20 30 40 
Vertical Stress (kPa) 


Figure 6. Vertical stress distribution along the depth 
at different loading levels (Probabilistic approach, K = 1 
-= sing’). 


Considering ranges of values for the coefficient 
of lateral pressure K around the friction angle of 
sand g’, the used interval values of K are given 
at different a level cuts. The applied load is taken 
as point valued q = 40 kPa. The triangular fuzzy 
number for K can be written in terms of intervals 
at level œ, [K], =[0.35 + 0.030,,0.41 —0.03a]. 

And the membership function is given as 


x —0.35 


if 0.35 < x < 0.38 
0.03 
My (x)= SS if 0.38<x<0.41 
0 otherwise 


Figure 7 shows distribution of vertical stress at 
acut levels; «0 =0, ol =1/3,02=2/3 and a3=1. 

Parameter D being dependent on depth, it is 
noticed that uncertainty is more pronounced with 
z increasing. Uncertain parameter of diffusion has 
important impact on stress distribution, especially 
at deeper levels. The observed dispersion suggests 
that using deterministic values for D gives only 
an approximation of the stress. It can underesti- 
mate or overestimate the real level of stress in the 
medium. Using interval values for the diffusion 
parameter helps estimating the evolution of stress 
spreading with depth. 

One can also consider uncertainty in the load- 
ing using intervals for the surficial charge q. Influ- 
ence of interval values of q is shown in Figure 8 
for q= [ q,q | = [38 kPa, 40 kPa]. The parameter 
of diffusion is kept point valued in this case with 
K=0.38. 
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Figure 7. Vertical stress distribution under axis for dif- 
ferent level cuts of K. 
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Figure 9. Vertical stress distribution for inter-val surfi- 
cial pressure [ 4.7 | and interval coefficient [ K,K ] . 


The uncertainty on the load has effect on the 
distribution as the diffusion shows the vertical 
stress evolving as an interval from the surface, and 
it continues with the depth. It is slightly augmented 
for z deeper than half of the domain. 

Combination of uncertainty due to K and q has 
higher effect on the evolution of uncertainty in ver- 
tical stress as it is illustrated in Figure 9. The surfi- 
cial pressure is kept as q= [ 38 kPa, 40 kPa | and 
the lateral pressure coefficient K =[0.35, 0.41]. 

As uncertainty is combined between the surfi- 
cial pressure g and the coefficient K we notice fast 
dispersion in the vertical stress S with depth as it 
was illustrated in Figure 7 already, but the effect 
is more pronounced because uncertainty originates 
from both the load and the soil parameter. 


5 CONCLUSION 


The probabilistic approach from Harr (1977) 
needed only one parameter for characterizing the 
particulate medium. It was shown in (Bourdeau, 
1986) that lateral pressure coefficient can be a 
good estimate of the parameter for vertical stress 
diffusion in a granular soil. We revisited the theory 
from Harr to account for uncertainty in the diffu- 
sion parameter D and in the surficial loading. The 
purpose was to show the ability of interval valued 
method to handle uncertainty due to lack of knowl- 
edge and data. The finite difference scheme is prac- 
tical for this type of equations. The construction of 


such a scheme for interval analysis is not straight 
forward as it obeys number of specific conditions. 
It was shown that uncertainty in the diffusion 
parameter affects significantly the distribution of 
vertical stress in the granular medium. And, when 
combined with the uncertainty from surficial pres- 
sure the stress distribution shows more dispersion 
with depth. Interval valued parameters are suitable 
when dealing with lack of data and uncertainty. 
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ABSTRACT: For the optimization of renewable energy systems, uncertainties associated to technical 
and economic parameters are scarcely taken into account, which may lead to a weak confidence in results. 
In this paper, we propose and investigate a 4-steps methodology for uncertainty sensitivity assessment, 
with the objective to improve confidence in the assessment results and to support decision-making proc- 
ess following techno-economic optimization of the design and the operation of an autonomous power 
system. The methodology is applied to off-grid system including photovoltaic production, battery and 
hydrogen components (electrolyser, pressure storage and fuel cell). This energy system is modelled and 
optimized with Odyssey—a simulation software developed by CEA-LITEN since 2010. We focus on static 
parametrical uncertainties, linked to the energy system parameters. 


1 INTRODUCTION 


1.1 Optimization of complex energy systems 


Energy systems are getting more and more com- 
plex, and difficult to assess because of (i) the vari- 
ability of the renewable power sources and of the 
demand, (ii) the resultant necessity of storage 
and (iii) the presence of different and new energy 
vectors. The modelling and simulation software 
Odyssey (Guinot 2013) enables the realization of 
techno-economic optimizations of such energy sys- 
tems design and operation. However, many param- 
eters used to simulate the systems are uncertain 
(e.g. static component performances or economic 
properties, but also time series of production or 
load profiles) and it is necessary to evaluate the 
impact of these uncertainties on the design and 
operation arising from the optimization process to 
help decision-making about these systems. 


1.2 Problem statement 


Up to now techno economic studies carried out 
with Odyssey, as with most other similar simula- 
tion tools, have not taken into account uncertain- 
ties, but only have provided sensitivity analysis on 
uncertain key input parameters. Thus, the objec- 
tive of our work is to develop a comprehensive 


approach to enhance the platform with capacities 
of uncertainty management, from the identifica- 
tion of the main sources of uncertainty to results 
analysis and support to decision making. 

We identified two main ways to account for the 
uncertainty influence on the results of a techno- 
economic optimization. The first one consists in 
optimizing the system taking into account the 
uncertain parameters so as to get results robust 
to the considered uncertainties. The second way 
consists in optimizing the system and then apply 
the uncertainties to evaluate the sensitivity of 
this optimized design to uncertainties; this paper 
presents an application of this second one. In the 
first part, we will present a 4-step methodology 
for uncertainty assessment. In the second part, 
we will present a representative example of energy 
system optimization, which allows dealing with 
problematic of competition between technolo- 
gies, the problematic of energy storage in off-grid 
power-system and the optimization of this system 
design and operation without uncertainty. In the 
third part, we will apply our methodology on the 
described study case, showing how we modelled 
uncertainties on selected parameters and how to 
assess the sensitivity of results to these uncer- 
tainties. Finally, the fourth part will expose our 
conclusions and give directions for our future 
research work. 
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2 PROPOSED METHODOLOGY 


The methodology of uncertainty treatment pro- 
posed and implemented in this work is based on 
the four-steps approach described by de Rocquigny 
(de Rocquigny 2006a, b), schematized in Figure 1. 
This methodology is summarized below. 


2.1 Model of the system 


For step A, we assume that the considered energy 
system model is available and implemented in the 
software Odyssey, used as a black box. 

The entries of this black box are technical and 
economic parameters divided in two types: design 
variables and uncertain parameters. The uncertain pa- 
rameters are arbitrarily decided by the decision-maker 
or by the software user. On the contrary, the design 
variables can be set, and even optimized, like the 
optimization variables in Table 1. The next part will 
describe with precision the model of our case study. 

The outputs of the black box are technical and 
economic indicators, which assess the performances 
of the design of the system. These indicators will 
be described later. 


Step C 
Uncertainty 
propagation 


Quantification 
of sources of 
uncertainty 


Random 
variables! 


i fields 


Response 
variability 


Step C’ 


Sensitivity 
analysis 


Figure 1. Uncertainty analysis common framework. 
Table 1. Optimization variables. 
Optimization borders 

Variable Units Minimum Maximum 
Number of Modules PV* — 1 no 
Number of Battery - 1 150 

Units** 
Number of electrolyze - 3 no 

cells 
Fuel Cell Stack Max W 1 no 

Power 
Volume of pressure tank m° 1 no 


*Each module has a peak power of 1 kWp. 
**Each unit has a rated capacity of 10 kWh. 


2.2 Uncertainty modelling 


Regarding the quantification of sources of uncer- 
tainty, at step B, we model the sources of uncertainty 
in a probabilistic framework, with probabilistic den- 
sity laws. In this study, we assume that all the consid- 
ered uncertainties are independent. 

The three classical probabilistic laws are con- 
sid-ered: uniform, beta and Weibull, detailed in 
Table 3, in paragraph 4.1 where the whole uncer- 
tainty modelling is applied to our study case. 


2.3 Uncertainty propagation 


At step C, the propagation of uncertainties allows 
us to see how the outputs of the model respond 
to the uncertainties: we achieved this by coupling 
a Monte Carlo launcher provided by the Uranie 
software (Bouloré 2012) and the executable 
Odyssey, as schematized in Figure 2. 


2.4 Sensitivity analysis 


Finally, at ‘step C’, the sensitivity analysis permits 
to identify the uncertainties that have the strongest 
influence on the outputs of the model. This identi- 
fication gives us the possibility to try to reduce the 
uncertainty of the most influent sources, in order 
to reduce the uncertainty of the outputs and facili- 
tate decision making (Borgonovo 2016). Among 
the different methods of sensitivity analysis, we 
chose the Morris method and the Sobol indexes 
computation, which are closely complementary. 
The Morris method (Morris 1991) allows first to 
classify the uncertain parameters in three categories: 


— the parameters with negligible effects, 

— the parameters with linear effect and without 
interaction, 

— the parameters with nonlinear effects and/or 
interactions (without distinction of these two 


effect types). 
Table 2. Optimized study cases design and indicators. 


Case 0 01 05 1 


Number of Modules 735 735 660 600 
PV (>) 

Number of Battery 146 145 135 138 
Units (-) 


Number of electrolysis 8 5 5 5 
cells (—) 

Fuel Cell Stack Max 43,500 10,500 5000 5000 
Power (W) 

Volume of pressure 31 16 3.5 3:5 
tank (m°) 

Unsatisfied load (%) 0 0.1 0.5 1 

LEC (€/MWh) 404.9 336.1 295.5 280.2 

Unused Primary 39.2 39.7 33.1 26.8 


Production (%) 
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Table 3. Uncertain parameters and associated probabil- 
ity distributions (Uniform, Beta or Weibull). 
Component 
Parameter 
Unit Law Reference 
PV 
CAPEX B [a = 1.8; IRENA (2016) 
€/Wp B=6; 
Min = 0.374; 
Max = 3.165] 
OPEX U [2; 10] id? 
% CAPEX 
Battery bank 
CAPEX B [œ= 1.31; Battke et al. 2013 
€/Wh BH 3:5; 
Min = 0.102; 
Max = 0.354] 
OPEX U [2; 10] i.d.* 
% CAPEX 
Capacity loss U [1.4E-5; Riffonneau et al. 
Whih 4.2E-5] 2007 
Self-discharge W U [3.75E-S; IRENA 2017 
1.4E-4] 
Charge efficiency B[o=1;B=4; Battke et al. 2013 
= Min = 0.8; 
Max = 0.9] 
Discharge efficiency B [a= 1;8=4; Battke et al. 2013 
— Min = 0.8; 
Max = 0.9] 
Electrolyser 
CAPEX U [6.5; 13.1] id.* 
€IW 
OPEX U [2; 10] i.d.* 
% CAPEX 
Degradation U [0.4; 15] Bertuccioli et al. 
uVih 2014 
Cell voltage** U [1.39; 1.54] i.d.* 
FC 
CAPEX U [2.2; 8] id.* 
e/w 
OPEX U [2; 10] i.d.* 
% CAPEX 
Degradation U [0.45; 1.35] Kurtz et al. 2015 
u%th 
Efficiency** U [0.30; 0.34] | FutureE Fuel Cell 
Solutions GmbH 
H, tank 
CAPEX** U [18,055; i.d.* 
28,239] 
OPEX U [2; 10] i.d.* 
% CAPEX 


*internal data. 
**see 4.1.1. 
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Figure 2. Odyssey/Uranie coupling. 


This screening method presents the advantage 
to sort the uncertain parameters with a limited cal- 
culation cost. In fact, the Morris method requires 
N code computations, with: 


N=r*(d+1) (1) 


with: 


— re [4;10], 
— d: number of uncertain parameters. 


We used the Morris method to eliminate the un- 
certain parameters with negligible effects on the 
output indicators, in order to calculate the Sobol 
indexes (Sobol 1993). This second part of the sen- 
sitivity analysis requires many more code compu- 
tations, which explains why the Morris method is 
relevant to use before. Indeed, the calculation on 
the Sobol indexes required N code computations, 
with: 


N =n*(d +2) (2) 


with: 
— n: size of the sample, 
— d: number of uncertain parameters. 


The Morris method and the Sobol indexes 
computation combination compose the sensitivity 
analysis step. 


3 MODELLING AND OPTIMIZATION 
OF THE SYSTEM WITHOUT 
UNCERTAINTY 


3.1 


The case study investigated in this paper is a stand- 
alone power system located in Nigeria and is 
shown on Figure 3. It includes: 


Case study description 
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Figure 3. 


Architecture of the case study. 


— an electrical load (Load), 

— a photovoltaic (PV) plant, 

— a bank of Lead-acid batteries, 

— acomplete hydrogen chain made of: a PEM elec- 
trolyser, a pressurized tank to store the hydrogen 
and a PEM fuel cell. 


This example is representative of (i) the operat- 
ing competition occurring between batteries and 
a hydrogen chain, (ii) the problematic of energy 
storage in off-grid power-system, and (iii) the PV 
over-sizing linked to the load satisfaction research. 

The implemented power management strategy 
is based on the on/off switches of the electrolyser 
(ELY) and the fuel cell (FC), as was originally de- 
veloped by Ulleberg (Ulleberg 2004), and exploited 
on a similar case by Guinot et al. (Guinot et al. 
2015). The operation depends on the state of 
charge (SOC) of the battery and on levels fixing 
switching operations the fuel cell and the electro- 
lyser (FC+, FC-, ELY- and ELY-, i.e. the operation 
parameters) given in Figure 3. 

In this case study, the replacement of the com- 
ponents is not considered. 


3.2 Optimization criteria and variables 


The operation parameters are considered constant 
during the whole exploitation simulation. The opti- 
mization of the system operation consists in finding 
the best suited operation parameters to minimize 
both electrical cost and unsatisfied load, as for any 
other design parameters. No distinction is made 
between plant and controller optimization prob- 
lems, as it would have been necessary if the opera- 
tion parameters had been evolving according to the 
dynamic of the system (Fathy et al. 2001). However, 
in this study the operation are not optimized. 

We selected as optimization variables the five 
dimensioning variables shown in Table 1. 

Odyssey multicriteria optimization process uses 
a genetic algorithm in order to minimize the stand- 
ard Levelized Electricity Cost (LEC) in €/MWh 
on the one hand and to minimize the unsatisfied 


load (UL) in %, i.e. the energy based percentage 
of unmet electrical load, on the other hand. There- 
fore, two objective functions are in competition. It 
is often observed that lowering the load satisfac- 
tion, by reducing the storage system size for exam- 
ple, leads to a lower cost of the system and thus 
the cost of the produced electricity. While on the 
contrary, improving the satisfaction of the load by 
oversizing the system tends to increase the cost of 
the produced electricity. 

The simultaneous optimization of design and 
operation parameters allows to take maximum 
benefit from each optimized design. 

The multicriteria algorithm used in this work 
is the Strength Pareto Evolutionary Algorithm 2 
(Zitzler et al. 2001). 


3.3 Optimization results 


Due to the competition between both optimiza- 
tion criteria LEC and UL, the optimization results 
take the shape of a Pareto front as on Figure 5. On 
this Pareto front, four different design points were 
selected corresponding to different indicators val- 
ues (LEC and UL). We selected the point accord- 
ing to the UL and we defined four different cases 
named from their UL value and with the designs 
given in Table 2. 

The resulting cost distribution for the four 
selected cases given in Figure 6 illustrate the rela- 
tive importance of the component costs within 
the overall system cost. We noticed that the com- 
ponent influencing the most the price was the PV 
array, which contributes for more than the half 
of it, in every case, followed by the battery bank 
(electrical storage). The fuel cell has a significant 
part in the price only in the case 0, which is intui- 
tive regarding to the maximal power of the fuel cell 
stack (Table 2). 

This part describes the way we identified and 
selected the optimal system designs without uncer- 
tainty consideration. In the following, we will 


Batteries SOC 


FC/ELY States 


of On 


Figure 4. Power management strategy of hydrogen 
chain. 
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Figure 5. Pareto front resulting from the optimization 
of the system. 
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Figure 6. Cost distributions for the 4 different cases. 


investigate the influence of uncertainties on the 
selected cases (optimized without uncertainties) 
and through them on the Pareto front. In fact, 
these four cases show similar cost distributions 
and thereby, it is interesting to observe if applying 
uncertainties may modify the comparison between 
them. As we optimized operating parameters, 
it will however not be possible to check whether 
operating parameters may counter-balance the 
effect of uncertainties on the design or not. 


4 APPLICATION OF THE PROPOSED 
METHODOLOGY TO THE CASE STUDY 


4.1 Uncertainty modelling 


In this paper, we decide to focus on static para- 
metrical uncertainties, linked to the energy sys- 
tem parameters. We have identified 25 parameters 
sources of uncertainty, all with an epistemic nature. 
In fact, the parameters of components that are not 
completely mature (such as the electrolyser and the 
fuel cell) are not well known. Moreover, even the 
mature components do not have parameters with 
perfectly known values. 

An extensive literature research has been car- 
ried out to identify existing, validated or accepted 
uncertainty probabilistic models for components 
of energy systems. Table 3 summarizes the differ- 
ent uncertain characteristics of the system compo- 
nents considered in the study, with their associated 
probability distribution, and with the reference for 
the chosen uncertainty model. Uniform probabil- 
ity distributions have been associated to the uncer- 
tain characteristics of the innovative components 
(in order to report the equiprobability between the 
possible values) and to the uncertain parameters of 
mature components when no other “better” (e.g. 
from expert judgements) distribution is available. 

There are too many uncertain parameters to be 
de-scribed exhaustively, but the following subsec- 
tions will focus on three particularity types. 

4.1.1. Uncertain parameters modelled 

by polynomial models 

Three uncertain parameters, i.e. the cell voltage of 
the electrolyser, the efficiency of the fuel cell and 
the CAPEX of the hydrogen pressure tank are 
modelled by polynomials, respectively functions of 
the current density in the electrolyser, the pressure 
(P/Pnominal) of the fuel cell and the volume of the 
hydrogen pressure tank. We assumed that only the 
constant coefficient was uncertain, thus generating 
an area of possible values instead of a curve. 


4.1.2 Uncertain parameters of the photovoltaic 
panels 

No technical parameter of the photovoltaic panels 

was considered as uncertain in this paper. This is due 

to the fact that the solar production is defined directly 

by a time series data representing the electrical pro- 

duction, not considered as uncertain in this paper. 


4.2 Uncertainty propagation through Odyssey 


The immediate effect of the uncertainties are 
observed thanks to the Monte Carlo approach 
using the Odyssey model. Then the dispersions of 
the indicators (LEC and unmet load) are analyzed. 
They are not the same for these two indicators and 
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they depend on the case, i.e. on the design of the 
system. A realization of all the uncertain param- 
eters is sampled, and based on this realization the 
system is simulated using its Odyssey model to 
propagate the uncertainty on the model output 
performance indicators (UL and LEC). This simu- 
lation is iterated for 300 Monte Carlo history. The 
results are given in Figure 7. 

We observe the relative dispersion of the LEC 
and the unmet load, calculated by the ratio of the 
standard deviation and the average (Fig. 8). The 
relative dispersion of the unmet load decreases sig- 
nificantly from case 0 to case 1, i.e. inversely to the 
nominal unmet load characteristic of the design. 
While the relative dispersion of the LEC increases 
slowly from case 0 to case 1, i.e. also inversely to 
the nominal LEC characteristic of the design. 


4.3 Sensitivity assessment 


After propagating uncertainties, the sensitiv- 
ity analysis described below aims to identify the 
most influent uncertain parameters on the output 
variance. 


4.3.1 Application of Morris method 

The Morris method permits us to select those 
uncertain parameters that have a non negligible 
influence on the output indicator (marked with 
+ in the Table 4). Both for the LEC and the UL, 
whatever the design configuration, the method 
eliminates the same parameters. 

However, the eliminated parameters are not the 
same for LEC and the UL. Indeed, the UL is decor- 
related from the economic parameters, so we do not 
keep them for the Sobol indexes calculation. On the 
contrary, the LEC is not influenced only by eco- 
nomic parameters, since it depends also on the elec- 
tricity production. For the Sobol indexes calculation 
related to the LEC, we also keep the more influent 
technical parameters based on the Morris method. 
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Figure 7. LEC and UL indicators for the four selected 
design configurations with uncertainties. 
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Figure 8. Relative dispersion of LEC and UL for the 
four study cases. 


4.3.2 Analysis with Sobol indexes 
The Sobol indexes give with precision the contri- 
butions of the variance of one output indicator 
due to the considered parameters. 


4.3.2.1 Unmet load 
Considering the unmet load variance, the Sobol 
indexes represented in the Figure 9 indicate that 
the most influencing uncertain parameter, what- 
ever the case, is the capacity loss of the battery, 
followed by the discharge efficiency of the battery. 
The importance of these two parameters, linked to 
the battery bank shows the major role played by 
this component in the load satisfaction. The dis- 
charge efficiency is much more influent than the 
charge efficiency, because the PV panel installation 
is oversized and therefore the solar production is 
in excess, limiting the role of the charge efficiency. 
The charge efficiency takes a bigger importance 
only in case 05 and case 1 (responsible of respec- 
tively 3 and 5% of the unmet load variance) where 
the PV panel installation size is smaller (Table 2). 
The ascendancy of the battery on the hydrogen 
chain is due to the design and the control of those. 
The powers delivered by the battery on one side 
and by the hydrogen chain (i.e. by the fuel cell) on 
the other side illustrate that the hydrogen chain 
supplies a negligible electric power, even in the case 
0, in which the fuel cell has the biggest design, i.e. 
in which the hydrogen chain production is the most 
favorable (Fig. 10). 


4.3.2.2 Levelized electricity cost 

The Sobol indexes indicate that whatever the case, 
the most influent uncertain parameter on the 
LEC variance is the PV CAPEX, far before the 
PV OPEX and to a lower degree the battery bank 
CAPEX. 

We notice that the hydrogen chain plays a signifi- 
cant role in the unmet load variance only in the case 
0, i.e. with its largest design: 26% of the total cost 
in its integrality and 19% for the fuel cell (Fig. 6). 

We can observe that if the Sobol index of a 
given parameter is linked to the cost weight of 
the corresponding component (studied in Sec- 
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Table 4. Morris method results. 


Component 

Parameter Unit LEC Unmet load 
PV 

CAPEX €/Wp + = 
OPEX % CAPEX + — 
Battery bank 

CAPEX €/Wh + — 
OPEX % CAPEX + = 
Capacity loss Wh/h 

Self-discharge W 

Charge efficiency — — + 
Discharge efficiency — 

Electrolyser 

CAPEX €/W + — 
OPEX % CAPEX + m 
Degradation uV/h a + 
Cell voltage V — + 
FC 

CAPEX €/W + — 
OPEX % CAPEX + = 
Degradation u%/h = + 
Efficiency - - + 
H, tank 

CAPEX €/m? + = 
OPEX % CAPEX + = 


tion 2), there is however no direct proportional 
relation, because of the influence of the probabil- 
ity distribution of the input parameters values. For 
instance, the battery bank that plays an important 
role in the system cost (between 19% and 26%) has 
a relatively small impact (inferior than 8%) on the 
LEC variance. While the PV panel installation, 
which is the main contributor to the system cost, 
but no more than 68%, represents (CAPEX and 
OPEX unified) the overwhelmingly part (between 
88% and 94%) of the LEC variance cause. 


5 CONCLUSIONS AND FUTURE 
RESEARCH WORK 


In this paper, we investigated a comprehensive 
4-steps methodology to evaluate the impact of 
uncertainties, in order to improve the confidence 
in the assessment results of a complex autono- 
mous power system modelled and optimized with 
Odyssey. We first chose to optimize the system 
before considering the uncertainties that were con- 
sidered uncertainties as independent. 

The results of the uncertainty propagation and 
of the sensitivity analysis teach us that the most 
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Figure 9. Normalized Sobol indexes (total order) for 
the four different cases, related to the unmet load. 
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Figure 10. Powers supplied by the hydrogen chain and 
the battery bank in the case 0. 


influencing uncertain parameters are linked to the 
design of the system and in our case study are: 


— the capacity loss followed by the discharge 
efficiency of the battery for the unmet load, 

— the PV CAPEX followed by the PV OPEX and 
the battery CAPEX for the LEC. 


There are several interesting points that still have 
to be thoroughly investigated. We want to inves- 
tigate now (i) the incidence of the choice of the 
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Figure 11. Normalized Sobol indexes (total order) for 
the four different cases, related to the LEC. 


probabilistic distributions associated to the uncer- 
tain parameters, (ii) the inclusion of the uncer- 
tainty relative to time series, (ili) the optimization 
of operation parameters as a way to counter-bal- 
ance uncertainties on the design of the system, and 
mainly (iv) the optimization taking directly into 
consideration the uncertainties. 
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Modular global uncertainty analysis of event-driven indicators of 
system’s availability 


Pawel M. Stano & Michal Spirzewski 
National Centre for Nuclear Research, Swierk, Poland 


ABSTRACT: The purpose of this manuscript is to propose a novel methodology for conducting uncer- 
tainty analysis for event-driven indicators of system’s availability. In this approach the system’s availability 
depends on the frequency of components’ failures and their duration. For all the components the param- 
eters that determine their behavior are obtained from statistical analysis, which means they are given with 
certain uncertainty, hence the system’s availability is also uncertain. In this paper we present a numerically 
effective algorithm that assesses uncertainty about system’s availability. A modular approach has been 
adopted, in which the uncertainty about the complete system’s availability is estimated by combining the 
information obtained for separate subsystems using the knowledge about the functional relationships 
between them. Furthermore, to assure numerical efficiency, a novel approach has been adopted, which 
combines screening approach with quasi-random Sobol sampling approach. The functionality of the 


proposed method is visualized on an illustrative example. 


1 INTRODUCTION 


In this paper we develop a methodology to investi- 
gate complex input-output stochastic system used 
in reliability studies. We consider systems with the 
inputs defined as stochastic renewal processes that 
model occurrences of the so-called basic events, i.e., 
the events that cause the system’s failure and the 
outputs of the systems defined as RAMI metrics. 
There are two types of failures that might occur 
to the system: direct, or indirect, e.g., by igniting a 
sequence of chain events that leads to system fail- 
ure. The events and their corresponding param- 
eters are usually identified during a standard 
Failure Mode and Effects Analysis (FMEA) (Sta- 
matis, 2003), which combines a detailed analysis 
of technical specifications of plant components, in 
order to identify their failure modes, with a detail 
analysis of operational interactions within the 
plant subsystems to discover the effects of previ- 
ously identified failure modes. Thus, performing 
FMEA allows for efficient description of systemic 
structural dependencies (in a form of logic trees or 
functional block diagrams). Properly performed 
FMEA is considered as a prerequisite of the anal- 
ysis described in this paper. In other words, we 
assume that the complete collection of basic events 
together with the paths of failure propagation 
are already identified. With such assumptions in 
place, it is possible to analyze operational metrics 
of the system. This is normally done by perform- 
ing full Reliability, Availability, Maintainability 


and Inspectability (RAMI) analysis (Stapelberg, 
2009). The full RAMI analysis is out of scope of 
this paper. Instead, the focus is put on investigating 
what impact the uncertainty in input parameters 
has on selected availability metrics of the system, 
which, in general, measure the proportion of time 
the system in operational with respect to the total 
system’s life time. We adopt block-wise approach, 
similar to Dynamic Reliability Block Diagrams 
(DRBDs) (Distefano & Xing, 2006), in which a 
complex system is broken down into several layers 
of smaller subsystems. The operational conditions 
of components are modelled as stochastic processes 
whose evolution is determined by occurrences of 
failure events. Each component is associated with 
an effect function, which determines whether the 
failure of the component leads to system’s deterio- 
ration (non-critical failure) or complete shutdown 
to conduct necessary repairs (critical failure). 

The paper is organized in the following man- 
ner: Section 2 briefly covers preliminaries necessary 
to understand the content of the paper, Section 3 
describes the general procedure to conduct modular 
global uncertainty analysis for event-driven indica- 
tors. In Section 4 the algorithm described in Sec- 
tion 3 is applied to an illustrative case study coming 
from the examination of an injector designed for 
International Fusion Materials Irradiation Facility 
(IFMIF) (Bargallo Font, 2014). Section 5 concludes 
the paper with a discussion about the advantages of 
the proposed method but also about its limitation 
in the current state of development. 
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2 PRELIMINARIES 


The analysis presented in this paper is a two-level 
probabilistic approach. In this approach the prior 
knowledge Priors(®) about uncertainty associ- 
ated to basic events e, identified in FMEA is fed 
to the quasi-random uncertainty sampler. This 
higher level algorithm generates a representative 
sample of the uncertain parameters 0,,...,0, of the 
stochastic input variables, which are sent in par- 
allel to the lower level algorithm. In these blocks 
probabilistic Monte Carlo simulations of the sys- 
tem are performed to compute the RAMI metrics 
for a given set of the uncertain parameters 9,,...,0; 
of the stochastic input variables. These metrics 
are combined in a non-parametric filter for post- 
processing that leads to final global RAMI analy- 
sis. The structure of such a two-level algorithm is 
presented in Figure 1. 

The structure of the higher level algorithm 
(Uncertainty Sampler) is presented in detail in Sec- 
tion 3. In this section we shall briefly describe how 
the lower level algorithm works. 


2.1 Probabilistic Monte Carlo RAMI simulations 


The lower level algorithm is designed to study the 
effects of the basic events mentioned in the previ- 
ous section on the functionality of the system. In 
this work this was done by running a probabilistic 
Monte Carlo simulations of the DRBD model of 
the system using the Availsim 2.0 program (Bar- 
gallo Font, 2014). With this engine it was possible to 
repeatedly simulate the operation of a system over 
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Figure 1. The schematic depiction of the complete two- 
level Global Uncertainty Analysis (GUA) algorithm. 


long periods of time with fixed parameters but ran- 
dom initial conditions, which in our case were times 
of failure associated with identified basic events. 
Thus, each simulation run was associated with dif- 
ferent failure sequences. When a failure occurs, the 
performance of the system is affected by a certain 
factor. As failures accumulate, the performance 
decreases until it drops below a critical threshold 
when it is necessary to shut the system down for 
time necessary to perform corrective repairs that 
raise the system’s performance above its minimal 
threshold. Note that in case of critical components 
a single failure leads to immediate shut down of the 
system. After the system is restarted, a new opera- 
tional cycle begins until new failures, possibly of the 
same components, cause the system to shut down 
again. After the simulation is over, the total down 
time due to failures is calculated by adding all the 
down times caused by individual events. Then, the 
expected down time to the system is computed by 
averaging the total down times obtained in indi- 
vidual Monte Carlo runs of the simulation. Due to 
the high complexity of the considered systems, the 
time to complete a single run of the simulation can 
be significant. Therefore, it is advisable to stop the 
further Monte Carlo runs as soon as the expected 
down time to the system (computed over all the 
Monte Carlo runs) stabilizes. 


2.2 Output indicators of availability 


With every single run of the simulation a reali- 
zation of a collection of independent stochastic 
renewal processes Æ, is obtained. Each E, models 
the occurrence of failure basic events e,, where the 
failure rate follows exponential distribution with 
expectation defined by the Mean Time Between 
Failure (MTBF) parameter. The time required 
for the process to renew, the down time T,,,,,,, is 
stochastic itself, as it is determined by the sum of 
independent random variables: 

it ed Gar ae be (1) 


down access repair recovery" 


where the variables are defined as follows: 


a. Tes ~ time required to access the component 
to be repaired. The variable follows exponen- 
tial distribution with the Mean Access Time 
(MAT); 

b. Topar ~ time required to perform repairs of the 
broken components. The variable follows expo- 
nential distribution with the Mean Time To 
Repair (MTTR); 

C. Toce ~ time required for the system to recover 
to the nominal operating conditions. The vari- 
able follows exponential distribution with the 


Mean Recovery Time (MRT). 
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An important distinction between the variables 
listed above is that the parameters of the variable 
Topar USUally depend only on the characteristics 
of the components the basic events are associated 
with, thus, normally, they vary from event to event. 
On the other hand, the parameters of the variables 
Toces ANA Ty, USUally depend directly on the 
system and are often the same for multiple com- 
ponents (e.g., those that share the same location 
within a subsystem). 

The stochastic realizations of variables T 


down? 
T, Tga T, are used to define the output 


indicators of availability, which are functions of 
the basic events that cause system failure (inputs). 
We consider standard system’s availability indica- 
tors (Stapelberg 2009, Lie & Hwang & Tillman, 


1977) such as: 
1. Inherent Availability (IA), which is defined as: 


T 
E (2) 
TÈ +T 


up down 


IA 


2. Achieved Availability (AA), defined as: 


TÈ, 
. (3) 


AA= 
T Diag +T È down +T DET 


3. Operational Availability (OA), defined as: 


TE 

OA=——”., (4) 
Td oz 

where: 

e TZ,, counts the total time the system is 


operational: 

e TX,,,,, counts the total time the system is shut 
down due to failures; 

© TÈ antenne COUNTS the total time the system is 
shut down due to routine maintenance; 

o TŁ Counts the total life time of the system, 
which includes Tyas TEs TZ renemes DUL 
also potential logistic delays typically associated 
with the operating cycle (insufficient service per- 


sonnel, lack of spare parts, etc.). 


Depending on a particular application, the 
uncertainty of any of the three indicators of the 
system availability can be investigated. 


2.3. Uncertainty problem formulation 


With the probabilistic simulations described in 
previous subsections it was possible to determine 
the average availability of the system for a given 
set of input parameters. Unfortunately, this infor- 
mation is often not sufficient because for reliable 


assessment of system’s functionality the impact 
of the uncertainty of parameters such as MTBF, 
MTTR, MAT, MRT, on the system availability 
must be analyzed as well. To determine the exact 
nature of this impact, a thorough uncertainty and 
sensitivity analysis should be performed. The latter 
type of analysis, which aims to identify the basic 
events that introduce the highest sensitivity to the 
system’s availability (Saltelli et al., 2008) is beyond 
the scope of this paper but is discussed in detail 
in (Stano, 2018). Instead, the goal of this paper is 
to describe a method to conduct the uncertainty 
analysis that aims to describe how the uncertainty 
in inputs propagate to the outputs of the system. 
Note that although the uncertainty about inputs 
is parametrized (the priors about MTBF, MTTR, 
etc., are given in parametric form, usually lognor- 
mal distributions), the uncertainty about outputs 
is usually nonparametric. 


3 GLOBAL UNCERTAINTY ANALYSIS 
FOR EVENT-DRIVEN INDICATORS OF 
RAMI 


This section presents the higher level algorithm 
mentioned in Section 2 that aims to generate the 
representative distribution of prior parameters. 
The structure of the algorithm is schemati- 
cally depicted in Figure 2. The algorithm is com- 
posed of three sequential steps, where the first 
one is of qualitative nature and the following two 
are of quantitative nature. Firstly, in Section 3.1 
the sources of uncertainty associated with basic 
events are identified; secondly, in Section 3.2 the 


Uncertainty Sampler 


identification of main 
sources of 
uncertainties 


Screening for most 


influencial events 


Quasi-Random 
Sampler 
(Sobol) 


Figure 2. The schematic depiction of the uncertainty 
sampler. 
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most influential events are selected; and thirdly, 
in Section 3.3 the prior distribution of the input 
parameters is generated via quasi-random Sobol 
sampler. Finally, the description of modular GUA 
approach is presented. 


3.1 Identification of main sources of uncertainty 


Each basic event corresponds to a failure of a 
component or a group of components that cause 
system’s performance deterioration (and eventual 
shut down) or immediate shut down to perform 
corrective maintenance. In this paper we assume 
the components to be within their Normal Life 
cycle, which means that they fail with a constant 
rate over time. With some extra effort, it is possi- 
ble to adopt our methodology to events with time- 
varying failure rates but this is out of the scope of 
this paper. 

In Section 2.2, we have described four main 
sources of uncertainty: MTBF, MTTR, MAT, 
MRT. Out of these two, the MTBF and MTTR 
are directly associated with the inputs to the sys- 
tem and are aleatory uncertainties with known 
parametric distributions. On the other hand, the 
MAT and MRT are structural uncertainties asso- 
ciated only with the system construction. Thus, 
because the stated purpose of GUA is to study the 
impacts the inputs have on the outputs, only the 
uncertainty about MTBF and MTTR parameters 
shall be investigated. The uncertainty in param- 
eters MAT and MRT should be studied in the 
context of structural uncertainty analysis, which is 
devoted to studies of intrinsic system uncertainties 
(Schueller, 2009). 

The prior uncertainties in MTBF and MTTR 
are modelled by the lifetime distributions derived 
from reliability tables of appropriate components. 
Sometimes it is enough to analyze impact of one 
parameter only (e.g., when the MTTR is very short 
compared to MAT and MRT, the only important 
uncertainty about inputs lies in MTBF). 

The general model for the uncertainty analysis 
for the event-driven indicators is given by the fol- 
lowing equation: 


Y=F(X,- X3) (5) 


where: 


e output Y is a system availability defined by 
(2-4); 

e inputs X, are random variables with lifetime 
distributions reflecting uncertainty in failure 
parameters of basic events (MTBF, MTTR). 


It is important to distinguish the uncertainty 
in failure rates, which is modeled by variables X, 
from the uncertainty of failure occurrences, which 


are modelled by the renewal processes E,i= 1,...,d. 
Thus, the model (5) should be understood as an 
expectation of a function of d stochastic processes 
£,, conditional on realizations of d independent 
random variables X.. 


3.2 Screening for most influential events 


The first quantitative part of the analysis is to 
identify failures that are most detrimental to sys- 
tem’s availability. This is done by computing for all 
events the Fraction Contribution (FC), which is a 
ratio of unavailability due to an event divided by 
total unavailability due to all events: 


unavailability due to event e. 
FC,= 2 i 


i 


(6) 


total unavailability 


To compute such defined screening factor, the 
plant is simulated with nominal parameter set- 
tings for all events, i.e., failure and repair rates are 
fixed at the mean values derived from the reliabil- 
ity tables. Then, the 7n, events with cumulative FCs 
above a predefined threshold 1-e are labeled as the 
most significant and selected for further analysis, 
i.e., we select those events for which the following 
holds: 


no 
DFC, >1-¢. (7) 


i=l 


With such restriction introduced, the numerical 
complexity of the problem is reduced significantly 
by not taking into consideration events with very 
low expected contributions to facility’s unavailabil- 
ity due to failures. On a downside, by excluding the 
least influential events from the further analysis a 
certain amount of information about the system 
is lost. Thus, the cut-off threshold 1-e should be 
selected very carefully so that the price of computa- 
tional efficiency is not too high. On the top of that, 
the quantitative screening should be accompanied 
with the qualitative analysis in which one should 
investigate whether any events with critical impor- 
tance to the functionality of the analyzed system 
are among the discarded events with low FCs. If 
that is the case it might be advisable to re-incorpo- 
rated these critical events into GUA despite them 
having low values of quantitative indicator FCs. 


3.3 Sobol sampling sequences 


Although the dimensionality of the problem 
was greatly reduced by the screening procedure 
described in Section 3.2, for complex systems the 
dimensionality of RAMI simulations remains a 
numerical challenge. Therefore, we apply Sobol 
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approach in which the uncertain parameters are 
generated using quasi-random low-discrepancy 
Sobol sequences. Thanks to this approach it is 
possible to obtain a sample that densely covers the 
sampled space with smaller number of sampling 
points than those required by the orthogonal grid 
or the Monte Carlo method. This is because the 
Sobol sequences satisfy the so-called uniformity 
properties A and A’ (Kucherenko et al., 2015) and 
at the same time avoid clustering of the samples, 
which commonly occurs with Monte Carlo sam- 
plers. The samples generated for n-dimensional 
unit cube [ 0,1]" are transformed, via inverse 
cumulative distribution function to get an accurate 
approximation of system’s priors. This approach is 
visualized on an example of 512 Sobol 75-dimen- 
sional points projected on a two-dimensional plane 
defined by uncertain MTBFs of Acquisition mod- 
ules and PLC (see Figure 3). Note that the Sobol 
points cover densely not only the full space (75 
dimensions) but also all the lower-dimensional 
hyper spaces. 


3.4 Modular global uncertainty analysis 


Despite all the steps undertaken above to reduce 
the numerical load of the GUA algorithm, if it 
is executed on the complete system it might still 
be very expensive numerically. Therefore, to con- 
duct numerically effective GUA we propose the 
following approach: firstly, we divide the system 
into modules that correspond to the most impor- 
tant parts of the system; secondly, we conduct the 
GUA described in Sections 3.1-3.3 independently 
for each module assuming that all the components 
of the remaining modules are 100% operational; 
thirdly, we combine the information obtained from 
separate modules using the knowledge about the 
functional relationships between events from dif- 
ferent modules. This procedure allows for paral- 
lelization of the computations which significantly 
reduces the computational load of the algorithm 
that uses sequential architecture. 


ab o7 ON o mM 
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Figure 3. Sample of 512 Sobol sequences projected on 
two-dimensional plane approximating: uniform distribu- 
tion (left) and lognormal distribution (right). 


It is useful to introduce the following algebraic 
notation. Let us define System Unavailability (SU) by: 


SA=1-SU (8) 


where SA denotes random variable defined by 
(2-4). Then let us introduce the following algebraic 
operation: 


SA, ® SA,defl— SU,- SU,, (9) 


where SA, and SA, follow the general form 
described in (9). The next step necessary to com- 
plete the uncertainty analysis for the whole system 
is to observe that: 


SA,ys = SA, ®---® SA, + Residual, (10) 


where: 


e SA_, „y are indicators of availability for N indi- 
vidual modules; 

e Residual is the term that accounts for interac- 
tions between the modules. 


Note that with availability of the system assessed 
by (8-10), the Residual is always a nonnegative 
term. Indeed, in the modular approach we double 
count all the down times of modules that happen 
while some other module is already down. Thus, 
the Residual term measures the influence of all 
these almost-simultaneous occurrences of failure 
events from distinctive modules with the formula: 


Residual = y x SAlel, Nasa nel, ) 


JHMSi <..-<ij SN I<ki <N; 


iki eN, (11) 
identification of modes 
ert [= 
b S 
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| + + 
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| UA(System) 


Figure 4. Schematic depiction of the modular GUA 
approach of a typical figure. 
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This means that even without analyzing the 
Residual, with (10) we obtain a conservative esti- 
mate of the availability of a system. 


SAgys 2 SA, D ® SAy. (12) 


The schematic depiction of the modular 
approach described above is shown in Figure 4. 


4 CASE STUDY 


The practical implementation of the proposed 
GUA procedure is illustrated on an example, which 
involves the analysis of the model of the Injector 
System within the International Fusion Materi- 
als Irradiation Facility (IFMIF). The IFMIF is 
an accelerator-based neutron source, developed 
jointly by Europe and Japan, which is conceived 
for fusion materials testing. The main purpose of 
the Injector System of the linear accelerator under 
study is to deliver sufficient beam current to the 
first accelerating cavity (RFQ—Radio Frequency 
Quadrupole) (Bargallo Font, 2014), and thus 
to achieve a 125 mA RFQ output current. The 
Injector is composed of four subsystems: Source 
and Extraction System (SES); Low Energy Beam 
Transport (LEBT); Local Control System (LCS); 
and Auxiliaries System (AUX), which are con- 
nected sequentially reliability-wise. Thus, in case 
of the analyzed Injector, the GUA decomposition 
into modes naturally overlaps with the decomposi- 
tion into subsystems. 

In case of the Injector, all the identified basic 
events ignite failure sequences that cause immedi- 
ate shutdown of the system so that the accelera- 
tor vault can be accessed by the repair crews to 
conduct corrective maintenance operations. Out 
of considered availability indicators (2-4), we 
have selected the IA because it measures the direct 
impact of system failures on accelerator’s availa- 
bility whereas the remaining two accounts also for 
other events (scheduled maintenance time, logistics 
delay, etc.). 


4.1 Identification of main sources of uncertainty 
in the Injector system 


The analysis of the Injector system within the 
IFMIF accelerator revealed that the main sources 
of uncertainty in the system lay in failure rates 
(MTBF parameters). Indeed, the MTTR associ- 
ated with the basic events are relatively low and 
dominated by constant access time (MAT) and 
recovery time (MRT), which are determined by the 
accelerator technical specification. Consequently, 
the variations in MTTR do not influence the IA 
output indicator significantly. The uncertainties 


in MTBF are parametrized by lognormal distri- 
butions with parameters defined in the reliability 
tables of events detected for the IFMIF accelerator 
(for details see Bargallo Font (2014)). 


4.2 Screening for the most influential events in the 
Injector system 


In the analyzed system, according to (Bargallo 
Font, 2014), there are overall 508 basic events 
identified for the Injector, each characterized 
by the uncertainty about the MTBF. It has been 
established experimentally that for the nominal set- 
ting of uncertain parameters, the mean downtime 
of the Injector system stabilizes after 200 Monte 
Carlo runs of AvailSIM, which is depicted in 
Figure 5. 

Thus, for a grid of samples from a prior asso- 
ciated with each event and 200 Monte Carlo runs, 
the computational load of the rough GUA algo- 
rithm is: 


ns x 200 x single run, (13) 


which is unfeasible even for a small n. Therefore, 
following the modular approach of Section 3.4, the 
space of basic events have been screened independ- 
ently to identify these events that are the most det- 
rimental to the functionality of each module. The 
events that contribute cumulatively to more than 
90% to the inherent unavailability of each mode 
are selected for further analysis. These, ordered by 
the strength of their contribution are presented in 
Figure 6. 


4.3 Sobol sampling sequences in the Injector 
system 


For the events detected in the previous section, the 
uncertainties in MTBF need to be sampled from 
their lognormal priors. This is done with high- 
dimensional low-discrepancy Sobol sequences. 


(SIM sensitreity 


Figure 5. Mean downtime versus number of Monte 
Carlo runs (nominal parameters). 
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Figure 6. Fractional contributions of the most influential 
events for each mode considered (SES, LEBT, LCS, AUX). 


Following the suggestions of (Kucherenko et al., 
2015), the number of Sobol points is set to 512, 
which, as they argue, give lower discrepancy cov- 
erage than both pseudo-random Monte Carlo and 
Latin Hyper-cube methods. Thus, the computa- 
tional load of the algorithm is reduced from (13) to: 


512 x 200 x single run, (14) 
which is manageable even for standard machines. 


4.4 Results of GUA for individual modules 


Four subsystems have been simulated with cor- 
responding number of variables, identified in the 
screening process, on AvailSIM code. The compu- 
tational load defined in eq. (14) was evenly distrib- 
uted on two Xeon E5-2680 v3 processors with total 
number of 96 cores. There was no significant dif- 
ference in computation time between subsystems, 
it took 15 minutes to finish on average, apart from 
AUX subsystem which required almost 12 hours 
of calculations to finish. 

Since lognormal distribution was applied in the 
Sobol sequence generation, a non-parametric distri- 
bution of the Inherent Unavailability will be fitted to 
a lognormal distribution. The IU estimators of mean 
and standard deviation of the log(IU), denoted by u 
and o, respectively, are calculated in twofold. 

First method is based on direct calculation of 
estimators of the u and o from standard equations: 


ae 
pas 


i=] 


(15a) 


1 N 1/2 
ô -( F20- 1") (15b) 

Second method is based on application of the 
python module scipy.stats, specifically lognorm.fit 
function. This function returns, apart from u and © 
estimators, the “loc” parameter, which add an extra 
degree of freedom in the estimator by allowing for 
shifts in the data. For each scenario simulated, the 
Kolmogoroy-Smirnov test was performed on both 
estimators in order to check the goodness of each fit. 
Based on obtained p-values the better fit was selected 
to represent the empirical data. All estimators are 
summarized in Table 1. 

For the estimators that give the closest fits, the 
probability density functions (PDF) and the cumu- 
lative distribution functions (CDF) were plotted to 
see how they compare to the data. The PDFs are 
compared with histograms of IU data for each 
considered scenario (see Figure 7) while the CDFs 


Table 1. Lognormal estimators of IU. 

Mode À ô ‘loc’ 
AUX -5.8 0.44 0.001 
LCS -8.12 0.46 5e-5 
LEBT -7.24 0.28 - 
SES —6.53 0.50 — 
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Figure 7. The PDFs fitted to simulated IU (histogram) 
obtained with lognormal.fit (blue) or classic estimators 
(green) for each mode considered (SES, LEBT, LCS, 
AUX). 
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Figure 8. The CDFs fitted to simulated IU (histogram) 
obtained with lognormal.fit (blue) or classic estimators 
(green) for each mode considered (SES, LEBT, LCS, 
AUX). 


are compared with empirical CDF of the dataset 
for each simulated scenario (see Figure 8). 


4.5 Complete GUA for the overall Injector system 


After the procedure was performed on all the four 
modes, the subsystem data were then combined in 
order to describe the complete Injector system of 
the IFMIF in the modular fashion described in 
Section 3.4. 

To complete the analysis it is first necessary to 
estimate the impact of the Residual term on the 
uncertainty of the Injector system. To do that it is 
required to estimate the probability of the coinci- 
dental occurrences of the events from distinctive 
four subsystems. With some simple but tedious 
combinatorial calculations it can be shown that 
the probability of the coincidence of the multiple 
failure events (i.e., of the components coming from 
distinctive subsystems of the Injector) is bounded 
by Se-5, where the boundary is very conservative 
by taking maximum failure probabilities across all 
the considered events. This means that the impact 
of Residual is similar to the impact of a single fail- 
ure event. Thus, given that there are more than 500 
events considered in the complete Injector system, 
the impact of Residual on the uncertainty of the 
IA can be assessed as negligible. 

Consequently we estimate the IA of the com- 
plete Injector with the lower bound given in (12). 
Note that for each module considered the log- 
normal uncertainties in the input propagate into 
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Figure 9. The PDF and CDF fitted to simulated IU 
(histogram, red) obtained with parametric estimators for 
the complete Injector. 


Table 2. Summary statistics for the IA. 


Mean Std Ist percentage Error factor 
Mode [%] [%] [%] Pa] 
AUX 99.56 0.17 99.03 1.0057 
LCS 99.96 0.019 99.89 1.0008 
LEBT 99.93 0.023 99.85 1.0008 
SES 99.83 0.104 99.39 1.0047 
INJ 99.29 0.177 98.75 1.0057 


quasi-lognormals in the outputs. Thanks to this 
property we were able to approximate the poste- 
rior IU distribution of the complete Injector in 
a parametric way, i.e., the sample distribution of 
IU for the complete Injector system is obtained 
by summing samples generated independently 
from lognormals obtained in previous section. The 
empirical distribution of 10,000 such generated 
system data is presented in Figure 9, which shows 
that the posterior uncertainty distribution of the 
complete Injector system can also be well approxi- 
mated with lognormals. 

To complete the analysis Table 2 presents sum- 
mary statistics for individual modes and for the 
complete Injector system. 


5 CONCLUSIONS & DISCUSSION 


In this paper a general framework to conduct Mod- 
ular Global Uncertainty Analysis (GUA) for event- 
driven RAMI indicators has been introduced. The 
method proposed is based on a two-level proba- 
bilistic simulations conducted in a modular man- 
ner, which makes it applicable to a broad range of 
RAMI problems. The main advantages of the pro- 
posed method lie in: 


a. numerical efficiency: when compared with 
standard simulation technics based on Monte 
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Carlo methods, the proposed algorithm signifi- 
cantly reduces the computational complexity of 
the problem; 

b. completeness: performing GUA makes it pos- 
sible to estimate the posterior uncertainty dis- 
tribution of systems’ outputs from marginal 
distributions of systems’ inputs; 

c. modularity: with the proposed method it is pos- 
sible to estimate the posterior distributions of 
systems’ modules independently and combine 
them to produce the complete posterior distri- 
bution of the systems’ outputs. 


We have demonstrated that these three proper- 
ties are satisfied by applying the proposed Modular 
GUA method to study the Injector system within 
the International Fusion Materials Irradiation 
Facility (IFMIF). For this system, the indicator of 
inherent availability (IA) can be well approximated 
with IA indicators obtained independently for four 
modules: SES, LEBT, LCS, AUX. This is possible 
due to a weak influence of the Residual that char- 
acterizes the impact of the coincidental failures of 
at least two events from different modules. 

We finish the paper with the discussion on 
limitations of the proposed Modular GUA proce- 
dure. Firstly, the presented approach is focused on 
the analysis of the impacts of the uncertainty in 
the inputs to the system and does not deal with the 
impacts of structural uncertainties present in the 
system (e.g., uncertainties about MAT, MRT, logis- 
tic processes). Secondly, within the proposed GUA 
algorithm it is assumed that the uncertain failure 
rates of the components are within a Normal Life 
Cycle, which means that the failure rates are fixed 
for the whole duration of the simulated system life 
cycle. Consequently, further research is required 
to make the method applicable to problems with 
time-varying failure rates that are associated with 
infant or wear out exploitation stages of system’s 
components. Thirdly, in general, the impact of 
Residual might not be negligible as was the case 
for the Injector system and without analyzing it 
in detail the right-hand side of (12) will underesti- 
mate the true uncertainty in system output. There 
are several possible ways to analyze the uncertainty 
associated with non-negligible residual, e.g., if the 
residual had an impact of another module, it could 
be handled by incorporating a virtual module of 
equivalent characteristics into online simulations 
to update global uncertainty profile. Another 


possible approach would be to perform (offline) 
screening on interactions to identify these that 
matter the most and incorporate them as an inter- 
action module into the simulations—this could be 
useful in the analysis of sensitivity of the system. 
Finally, the events rejected during the screening 
phase due to their marginal contribution on sys- 
tem’s availability, might have influential impact 
taken cumulatively. Therefore, it is desirable to 
incorporate into the existing procedure a routine 
that would quantify the uncertainty impact of the 
least influential events taken collectively. With such 
procedure we could validate the correctness of cut- 
off threshold setting. 

All the three aforementioned concerns regard- 
ing the proposed Modular GUA procedure are a 
subject of ongoing research. 
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ABSTRACT: Belief reliability is a newly developed reliability metric considering aleatory and epistemic 
uncertainty. Since Performance Margin (PM) is an important concept in belief reliability, there is a great 
need to develop PM-based belief reliability models. In this paper, we consider the parameter uncertainty 
of PM and propose a new belief reliability model. In this model, the performance parameter and its 
threshold, which determines the PM of a system, can be modeled as an uncertain variable or a random 
variable, and three belief reliability formulas are put forward based on different cases. To illustrate the 
model, a case study about belief reliability estimation of a contact recording head is performed. 


1 INTRODUCTION 


Nowadays, there is a growing interest in physical 
model-based reliability metric (cf. Physics-of-Failure 
(PoF) model based metric (Zeng et al. 2016), struc- 
tural reliability metric (Choi et al. 2007), etc.), where 
reliability is predicted using deterministic physi- 
cal models. The only uncertainty comes from the 
parameter variations, which are modeled as prob- 
ability distributions. However, in real cases, besides 
the random variations (referred to as aleatory uncer- 
tainty (Kiureghian & Ditlevsen 2009)) affecting 
parameters, the physical model is also influenced by 
epistemic uncertainty caused by our lack of knowl- 
edge (Kiureghian & Ditlevsen 2009). For example, 
the model parameters may not be estimated accu- 
rately due to our limited knowledge about the sys- 
tem operation environment (Aven et al. 2014). 
Considering the effect of epistemic uncertainty, 
many reliability metrics are proposed, such as evi- 
dence theory-based reliability metric (Bae et al. 
2004), interval analysis-based reliability metric 
(Zhang et al. 2017), fuzzy interval analysis-based 
reliability metric (Flage et al. 2013) and posbist 
reliability metric (Cai 1996). The first three metrics 
are essentially reliability intervals which may cause 
interval extension problems, while the last one is a 
possibility measure which does not satisfy duality 
property (Kang et al. 2016). Recently, a new reli- 
ability metric called belief reliability is developed 
considering both aleatory and epistemic uncer- 
tainty Zeng et al. 2017b). To model uncertainty, 
belief reliability introduces uncertainty theory and 
chance theory to measure system reliability. Uncer- 
tainty theory is proposed by Liu (2007) as a new 
branch of axiomatic mathematic, founded on nor- 
mality, duality, subadditivity and product axioms, 


and chance theory can be regarded as a mixture of 
uncertainty theory and probability theory. Accord- 
ing to Kang et al. (2016), since belief reliability 
overcomes the shortcomings of other reliability 
metrics, it is regarded as a more appropriate metric 
in reliability engineering. 

The concept of performance margin plays an 
important role in belief reliability (Zeng et al. 
2017a). Performance Margin (PM), denoted as m, 
describes the distance between a critical perform- 
ance parameter p to its associated threshold p,,. By 
analyzing the physical model of a system, we can 
obtain the PM model. In this process, there may 
be two types of uncertainties affecting the physical 
model: parameter uncertainty caused by indeter- 
ministic model parameters and model uncertainty 
due to the inaccuracy of the physical model. Till 
now, several research have considered the model 
uncertainty and applied the belief reliability model 
to real industrial systems (see Zeng et al. (2017a) 
and Yu et al. (2017)). However, there are still 
strong needs to develop PM-based belief reliability 
models considering parameter uncertainty. There- 
fore, in this paper, we managed to propose a sim- 
ple model accounting for parameter uncertainty 
for future application. To represent aleatory and 
epistemic uncertainty, the PM of a system is mod- 
eled as an uncertain random variable. If the system 
is mainly influenced by aleatory uncertainty, PM 
will degenerate to a random variable; if the system 
is mainly affected by epistemic uncertainty, PM 
will degenerate to an uncertain variable. Since the 
parameter uncertainty will essentially affect the p 
and p,,, we discuss the model from three aspects, 
i.e., p and p, are all modeled as uncertain variables, 
p is a random variable while p,, is an uncertain 
variable, and p is an uncertain variable while p,, 
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is a random variable. Some basic belief reliability 
formulas based on uncertainty theory and chance 
theory are then proposed based on the three cases. 

The remainder of this paper are organized 
as follows. In section 2, we will introduce some 
basic concepts and results of the theory basis, 
i.e., uncertainty theory and chance theory. The 
developed PM-based belief reliability model will 
be introduced in section 3. We will first give the 
definition of belief reliability in the sense of PM, 
and then put forward three formulas. Finally, a 
real case study is performed in section 4 to illus- 
trate the model. 


2 PRELIMINARY 


In this section, some basic concepts and results 
of uncertainty theory and chance theory are 
introduced. 


2.1 Uncertainty theory 


Uncertainty theory is a new branch of axiomatic 
mathematics built on four axioms, i.e., Normal- 
ity, Duality, Subadditivity and Product Axioms. 
Founded by Liu (2007) in 2007 and refined by Liu 
(2010) in 2010, uncertainty theory has been widely 
applied as a new tool for modeling subjective (espe- 
cially human) uncertainties. In uncertainty theory, 
belief degrees of events are quantified by defining 
uncertain measures: 

Definition 2.1. Uncertain measure, Liu (2007)). 
Let T be a nonempty set, and £ be a o-algebra 
over T. A set function M is called an uncertain 
measure if it satisfies the following axioms, 

Axiom 1. Normality Axiom M{T }=1 for the 

universal set T. 

Axiom 2. Duality Axiom M{ A}+ M{A‘}=1 for 

any event AE L. 

Axiom 3. Subadditivity Axiom For every countable 
sequence of events ^,^, , we have 


mi Us| < YMA}: 


Uncertain measures of product events are calcu- 
lated following the product axiom (Liu 2009): 


Axiom 4. Product Axiom. Let (T,,£,.M,) be 
uncertainty spaces for k =1,2,---.. The product 
uncertain measure M is an uncertain measure 
satisfying 


where L, are o-algebras over T,, and A, are arbi- 
trarily chosen events from L, for k=1,2,-++, 
respectively. 

Definition 2.2. Uncertain variable, Liu (2007)). 
An uncertain variable is a function € from an 
uncertainty space (A,£,M) to the set of real 
numbers such that {eB} is an event for any 
Borel set B of real numbers. 

Definition 2.3. Uncertainty distribution, Liu 
(2007)). The uncertainty distribution ® of an uncer- 
tain variable ¢ is defined by K( x)= M{ é< x} 
for any real number x. 

For example, a linear uncertain variable 
é~ £( a,b) has an uncertainty distribution 


0, ifx<a 
®,(x) = = ifa<x<b (2.1) 
l, ifx>b 


and a normal uncertain variable £~ M (e, o) has 
an uncertainty distribution 


®,(x)= [sen 2) J xe R (22) 


An uncertainty distribution ® is said to be 
regular if it is a continuous and strictly increas- 
ing with respect to x, with 0<@®(x)<1, and 
lim ®( x) = 0, lim®(x)=1. A regular uncer- 
tainty distributiof has an inverse function, which 
is defined as the inverse uncertainty distribution, 
denoted by ®"'( a) ,ae (0,1). Inverse uncertainty 
distributions play a central role in uncertainty the- 
ory, since the uncertainty distribution of a func- 
tion of uncertain variables is calculated using the 
inverse uncertainty distributions: 

Theorem 2.1. (Operational law 1, Liu (2010)). 
Let éé, be independent uncertain vari- 
ables with regular uncertainty distributions 
D,D, D, respectively. If f(E,G.-.€) is 
strictly increasing with respect to ġ,%,,,č, and 
strictly decreasing with respect to Één ő, 
then E= f (£é) has an inverse uncertainty 
distribution 


yla) = f(®3(@),-.8;( a), ©7,(1- 


(1- æ)). 


Theorem 2.2 (Operational law 2, Liu (2010)). Let 
é, ó. é, be independent uncertain variables with 
continuous uncertainty distributions ®,,®,,---,®,,, 
respectively. If f(€.6,--.€) is strictly increas- 
ing with respect to ¢,¢,,°°.¢,, and strictly 
decreasing with respect to Š %2, then 
é= f (6,6...) has an uncertainty distribution 
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y({ x)= sup ( min®,(x,) ^ min 


f( Xt eae Xn )=x l<i<m m+l<isn 


(1-®,(x;))). 


2.2 Chance theory 


Chance theory is founded by Liu (2013b) as a 
mixture of uncertainty theory and probability 
theory, to deal with problems affected by both 
aleatory uncertainty (randomness) and epistemic 
uncertainty. The basic concept in chance theory 
is the chance measure of an event in a chance 
space. 

Let (T,£,M) be an uncertainty space, 
and (@,A,Pr) be a probability space. Then 
(T,£,M) x (Q,.A,Pr) is called a chance space. 

Definition 2.4. (chance measure, Liu (2013b)). 
Let ( Tr,£,M) x (Q, A,Pr) be a chance space, and 
let Oc Lx A bean event. Then the chance meas- 
ure of © is defined as 


Ch{O}= f Prf we Q|MiyeT (7,0) 0}> 
x} dx. 


Definition 2.5 (Uncertain random variable, Liu 
(2013b)). An uncertain random variable is a func- 
tion € from a chance space (T ,£,M) x (Q, A,Pr) 
to the set of real numbers such that {fe B} is an 
eventin LX A for any Borel set B of real numbers. 

Random variables and uncertain variables are 
two special cases of uncertain random variables. 
If an uncertain random variable ¢(7,@) does not 
vary with y, it degenerates to a random variable. 
If an uncertain random variable ¢(7,@) does 
not vary with @, it degenerates to an uncertain 
variable. 

Definition 2.6. Let € be an uncertain random 
variable. Then its chance distribution is defined by 
®(x)=Ch{ é< x} forany xe R. 

Theorem 2.3. (Liu 2013a) Let 7.7,,---,7,, be 
independent random variables with probability 
distributions Y, ¥,,-¥ „, respectively, and let 
ToT, T, be uncertain variables. Assume f is a 
measurable function. Then the uncertain random 
variable E=f (Mss nons T,) has a 
chance distribution 


D(x) = | E(X yY) AP (7) (9%) 


Rm 
dP ph Yn) 

where F(X; Yi, Yas Ym) is the uncer- 

tainty distribution of the uncertain variable 


S (Vo Yas Ym To Tass Ta) 


3 THE BELIEF RELIABILITY MODEL 


This section will develop the belief reliability 
model based on performance margin (PM). In 
subsection 3.1, we first give some basic definitions 
about PM and belief reliability, and discuss the 
source of uncertainty in PM. Then, three theo- 
rems are developed as reliability formulas in sub- 
section 3.2 to calculate belief reliability indexes 
based on PM. 


3.1 Performance-margin-based belief reliability 


In an industrial system, there is usually a critical 
performance parameter with a failure threshold, 
describing the functional behavior and require- 
ment of the system. In most cases, two categories 
of performance parameters exist: 


1. Smaller-the-better (STB) parameters: Failure 
occurs when p 2 p,- 

2. Greater-the-better (GTB) parameters: Failure 
occurs when p< p,- 


Definition 3.1. (Performance margin Let p be the 
critical performance parameter of a system, and p,, 
be its associated failure threshold. Then the per- 
formance margin m related to p is defined as: 


a =P, 
m= 
P— Pn 


It is easy to find that PM describe the dis- 
tance between the performance parameter and its 
threshold, and failure occurs whenever m <0. In 
real cases, the PM is usually affected by both alea- 
tory and epistemic uncertainties. Therefore, in this 
paper, we model the PM as an uncertain random 
variable and we have the following definition of 
belief reliability in the sense of PM. 

Definition 3.2. (Belief reliability.) Let the system 
performance margin m be an uncertain random 
variable, then the system belief reliability is defined 
as the chance that m is greater than 0, i.e., 


if pis STB, 


ae (3.1) 
if pis GTB. 


R, =Ch{m > 0}. (3.2) 

Remark 3.1. If the PM is mainly affected by 
aleatory uncertainty, m will degenerate to a ran- 
dom variable, and the belief reliability becomes 
R, = Pr{m> 0}. 

Remark 3.2. If the PM is mainly affected by epis- 
temic uncertainty, m will degenerate to an uncer- 
tain variable, and the belief reliability becomes 
R; = Mim> 0}. 

The uncertainty of m comes from p and p,,. For 
the performance parameter, in practice, it is usually 
described by a physical model: 
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P=8(%,%,--,x,); (3.3) 


where g(-) denotes the deterministic model predict- 
ing p and x,,i=1,2,---,n denote input parameters. 
Due to the effect of aleatory and epistemic uncer- 
tainty, the input parameters are not crisp values any- 
more. To deal with the parameter uncertainty, the 
input parameters are usually modelled as random 
variables or uncertain variables. Therefore, p can 
be a random variable, an uncertain variable, or an 
uncertain random variable. For the failure threshold 
associated with p, it is usually regarded to be con- 
stant in traditional analysis. However, in some cases, 
P is also affected by uncertainty. For example, if p,, 
is estimated by experts, we model it as an uncertain 
variable; if p,, is estimated by experimental data, we 
tend to regard it as a random variable. 

According to the above discussions, there may 
be four common conditions about p and p,,: (1) p 
and p,, are all random variables; (2) p and p,, are 
all uncertain variables; (3) p is random while p,, is 
uncertain; (4) p is uncertain while p,, is random. 
Since the first condition is similar to the traditional 
stress-strength interference reliability model, we 
only discuss the latter three conditions and develop 
some reliability formulas in next subsection. 


3.2 Reliability formula 


3.2.1 Case I: p and p,, are all uncertain 

Theorem 3.1. Suppose the system performance 
parameter p and its associated threshold p,, are 
uncertain variables with uncertainty distributions 
(x) and Y (x), respectively. Then the PM-based 
belief reliability of the system can be calculated by: 


sup(®(y) A(I-¥(y))). if pis STB, 
n sup((1-¥ ( y)) aY (y)); if pis GTB. 


yeR 


Specially, if p,, is constant, the belief reliability 
will be 


Re ®( pan) if p is STB, 
2? |1-Q(p,,). if pis GTB. 


Proof Let ¢=p-p,,. Since p and p,, are all 
uncertain variables, & is also an uncertain variable. 
Here we assume the uncertainty distribution of €is 
Y(x). According to Theorem 2.2, we have 


7( x)= sup (®(x,) a(1-¥ (x,))). 


X-XQ=X 


If p isa STB parameter, the belief reliability can 
be calculated as 


R, = Mim> 0} = M{p- pa < 0} = 7(0) 
= sup(®(y) a(1-¥(y))). 


When p,, is constant, it is easy to find 
Ry = Mip< pys=¥ ( pa). 


If p isa GTB parameter, the belief reliability can 
be calculated as 
R, = M{m > 0}= M{p- p,, > 0}= 1-770) 
= 1—sup(( y) (I= (y))) 
= sup((1-(y)) s¥(y)). 


When p,, is constant, we have 
R, = Mip> Py }=1-®( py). 


3.2.2 Case II: p is random while p,, is uncertain 
Theorem 3.2. Suppose the system performance 
parameter p is a random variable with a probability 
distribution ®(x) and the associated threshold p, is 
an uncertain variables with an uncertainty distribu- 
tion W(x). Then the PM-based belief reliability of 
the system can be calculated by: 


| [1-¥ (y)d®( y), if pis STB, 
J Y O)dæ(y), 


Proof The proof will be given from two aspects: 


R 


B= 


if pis GTB. 


1. pisa STB parameter 

Let €=p-p,,. Since p is a random variable 
while p,,, is an uncertain variable, & is an uncertain 
random variable. Here we assume the chance dis- 
tribution of is 7{(x). According to Theorem 2.3, 
we have 
X(x) =f — R(x y)do(y), (3.4) 
where F(x, y) is the uncertainty distribution of an 
uncertain variable q= y- p,,. Based on Theorem 
2.1, we have F- ( @) = y-¥1(1- a), so F(x;y)is 
F( x;y) =1-¥ ( y- x). Then, (3.4) will be 


7 (x)= f “1-¥ (y= x) d0(). 


Therefore, the belief reliability can be calculated 
by 


R; = Chi p— py, < OF= Pai 0) 
= | 1-¥(y)do(y) 
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2. pis a GTB parameter 


Let 7= p„— pP. It is easy to find 7 is an uncer- 
tain random variable. Assume its chance distribu- 
tion if 7,(x), then we have 


1,(x)= J | B(sy)d0(y), (3.5) 
where F,(x,y) is the uncertainty distribution of an 


uncertain variable s= p,,—y. Similarly, we have 
F,( x;y) = ( x+ y’ Then, (3.5) will be 


x(x)= J Y (+ y)d@(y) 


Therefore, the belief reliability can be calculated 
by 


R, = Ch{py,— p< 0}= 70) 


=| Y (y)dæ(y) 


3.2.3 Case III: p is uncertain while p,, is random 
Theorem 3.3. Suppose the system performance 
parameter p is an uncertain variable with an uncer- 
tainty distribution ®(x) and the associated threshold 
P, is a random variables with a probability distribu- 
tion w(x). Then the PM-based belief reliability of 
the system can be calculated by: 


Jemo) 


J -1-@(y)a¥(y), if pisGTB. 


if p is STB, 
R 


B 


Proof Since the proof of this theorem is similar 
to the previous one, it will not be repeated here. 


4 CASE STUDY 


In this section, we try to use the proposed model 
to analyze the static belief reliability of a contact 
recording head (Kawakubo, Miyazawa, Nagata, 
& Kobatake 2003). The contact recording head is 
a vital component of a contact recording system, 
which is used to achieve a recording density over 
100 Gbits/in?. Wear is the main failure mechanism 
of the recording head. To satisfy the recording den- 
sity, the wear depth of the head cannot exceed 2 nm 
(determined by the overcoat thickness of the head) 
after 100h of using. Since a track seek also recip- 
rocates during the disk rotation, the worn area is a 
donut. 

Because the wear rate of the recording head 
decreases with increasing sliding distance in wear 
tests, to quantify the wear volume, an extended 
Archard’s wear equation developed by Kawakubo, 
Miyazawa, Nagata, & Kobatake (2003) is utilized: 


(4.1) 


l-u f 
repi wle (2). 
D LJ \b 


where V denotes the wear volume, k, denotes spe- 
cific wear amount (SWA) at a standard sliding dis- 
tance L„ W denotes the sliding load, L denotes the 
total sliding distance, a denotes running-in coef- 
ficient ranging from 0 to 1, B denotes the sliding 
width, and b denotes the head contact width. 

The uncertainty of this physical model may 
come from the input parameters. Among them, 
the coefficient a is relatively precise obtained by 
real experiment, k, is calculated based on a and 
the material properties, L, is determined by k,, L 
is easily controlled to be a crisp value (total func- 
tion time is 100/ and the sliding speed is controlled 
to be 10m/s), and B and b are controlled during 
design and manufacturing phase. The only uncer- 
tain parameter is the W, which may be affected 
by many factors in real cases and we don’t have 
enough data to estimate an accurate value. There- 
fore, in this paper, we model W as an uncertain 
variable following normal distribution (the form is 
shown in 2.2). Through uncertainty propagation, 
the uncertainty distribution of V can be easily cal- 
culated. In addition, since the threshold of V is 
given by experts according to the overcoat thick- 
ness of the head, it may be affected by epistemic 
uncertainty. To be more precise, we describe V,,, as 
an uncertain variable as well. The values or the dis- 
tributions of critical parameters are summarized in 
Table 1. 


Table 1. Values or distributions of parameters. 


Parameter Value or distribution 
SWA k, = 2.55x 10” (m? / N) 
Standard sliding L, = 1000(m) 
distance 
Running-in a= 0.39 
coefficient 
Total sliding _ 6 
distancë L= 3.6x 10°(m) 
Sliding width B=0.015(m) 
Head width BE 10*( Hi ) 
Head contact area 1 0-( m?) 


Coninctdoad W ~ N( “= 0.7,0= 0.03)( mN) 


Wear volume V,, ~ L( a= 2,b= 2.5)(10-"m') 


threshold 


1. N( u, o) and L( a,b) are normal and linear 
uncertainty distributions in the form of (2.2) and (2.1), 
respectively. 
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Based on the operational laws of uncertainty 
theory, the wear volume of the head after 100/ of 
using also follows a normal uncertainty distribu- 
tion, i.e., 


V ~ N (4, =1.8606, ©, = 0.07974) (10-7 m’), 


Assume the distribution function of V and V,, 
are ®,(x) and ®, , respectively. Since the wear 
volume V is a STB parameter, by using Theorem 
3.1, the static belief reliability can be calculated to 
be 
R, = sup(®, (x) A(1-@,,, ( x))) = 0.97078. 


xeR 


5 CONCLUSIONS 


In this paper, we develop some performance-mar- 
gin-based belief reliability models using uncer- 
tainty theory and chance theory. Considering the 
uncertainty of performance parameter and its 
threshold, we discuss the model from three aspects: 
p and p,, are all uncertain variables, p is a random 
variable while p,, is an uncertain variable, and p is 
an uncertain variable while p,, is a random variable. 
Three theorems as belief reliability formulas are 
given. Finally, a case study about the static belief 
reliability evaluation of a contact recording head is 
performed to illustrate the reliability formulas. 
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Bayesian updating with time dependent models 


P. Beaurepaire 
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ABSTRACT: Bayesian updating is increasingly used in structural engineering; it is applicable as an 
inverse method to identify the model of uncertainty which best matches some available experimental data. 
This paper introduces a novel method for the definition of the likelihood function in case the numerical 
model is a time dependent function. The set of time instants which best describes the experimental data 
is identified using the maximum entropy principle. The marginal distributions of the model response are 
identified as well using the maximum of entropy. The dependence between the responses of the model 
for the different time instants is implemented using the linear coefficients of correlation obtained after a 
mapping into standard normal distributions. The joint probability density function is subsequently used 
in the formulation of the likelihood function. The relevance of the method is demonstrated through an 


application example. 


1 INTRODUCTION 


Inverse methods are widely used in science and 
engineering; they consist of identifying the input 
parameter of a numerical model leading to an 
adequate match with available experimental data. 
Model updating techniques provide an appropri- 
ate framework and received considerable attention 
from structural engineers during the past decades 
(Friswell & Mottershead 1995, Imregun & Visser 
1991, Arora 2011). They are applied in case a for- 
ward numerical model is available but it is not pos- 
sible or numerically too demanding to evaluate the 
inverse model. Such methods are used for instance 
in case sensors collect the vibration data, which are 
subsequently used to identify modal information 
(amplitudes, modes, damping, etc). A finite ele- 
ment model is then implemented and model updat- 
ing is used to identify the input parameters leading 
to the best fit with the experimental data. 

Bayesian updating methods allow to identify 
the optimal parameter values and as well the prob- 
ability density function associated with them (Beck 
& Katafygiotis 1998, Katafygiotis & Beck 1998). 
Considering a model .@, @ the set of parameters 
to be updated from the experimental data 2 using 
Bayes’ theorem, which expresses as: 


(| 1M) pr |Z) 
P(A) 


pl, 12,4) = (1) 


where p( O| M) is the prior distribution, which 
gathers the initial knowledge on the parameters; 
P( YOM) is the likelihood function, which quan- 


tifies the match between the experimental data and 
the outcome of the numerical model; p( @|F,./) 
is the posterior distribution, a probability density 
function associated with the model parameters 
which considersthe information provided by the 
experimental data Y and p(P| M) is the evi- 
dence, a constant guaranteeing that Equation (1) 
integrates to one. 

Much research efforts on Bayesian updating 
are geared towards the implementation of compu- 
tationally efficient numerical methods. The most 
frequently used approach consists of generating 
realizations of the posterior distribution and sub- 
sequently assuming that this set of realizations fully 
describes the posterior distribution. Markov chain 
Monte Carlo is frequently used in this context, as it 
can be applied to any arbitrarily given distribution. 
For instance, Ching and Chen (2007) proposed an 
efficient algorithm based on a sequence of inter- 
mediate distributions with a gradual convergence 
from the prior distribution to the posterior dis- 
tribution, combined with an appropriate selec- 
tion of the first state of the Markov chains; Beck 
and Zuev (2013) implemented a method based on 
importance sampling, Markov chain Monte Carlo 
and simulated annealing; Straub and Papaioan- 
nou (2015) introduced Bayesian Updating with 
Structural reliability methods (BUS), a procedure 
used to transform the updating problem into a reli- 
ability problem, which is subsequently solved with 
reduced efforts as multiple efficient algorithms are 
available for reliability analysis. 

The focus of this paper is on the formulation of 
the likelihood involved in Equation (1). The for- 
mulation of this function is not difficult in case the 


2651 


numerical model is a scalar function, it is possible 
to use for instance kernel density estimation (Goller 
2011, Nagel & Sudret 2016). Specific formulations 
are available in case the problem involves modal 
data (Vanik, Beck, & Au 2000, Ching, Muto, & 
Beck 2006). In this paper, a novel implementation 
of the likelihood function is proposed in case the 
response of the numerical model is a time depend- 
ent function of the form: 


y= f(t) (2) 


where @ denotes the uncertain parameters and t 
denotes time. 

This manuscript is organized as follows: the 
methods of analysis are described in Section 2; 
Section 3 presents application examples and the 
paper closes with conclusions and perspectives in 
Section 4. 


2 METHODS OF ANALYSIS 


2.1 Distribution at a given time instant 


Incase Bayesian updating is applied, a set of experi- 
mental realizations of the numerical model is avail- 
able and used to define the likelihood function. It 
is assumed here that it consists of a set of curves 
of the response of the model expressed in termsof 
time of the form (y (t), y® (2),.... p0 (t)). The 
method developed here is applicable only in case 
the response of the model greater than zero for all 
the time instants. 

The first step of the procedure is the discretiza- 
tion of the time dependent response of the model 
(as shown in Figure 1). A finite set of time instants 
t= Cee is defined and the same discreti- 


Response of the model 


(a) 


Figure 1. 


zation points are applied to all the experimental 
responses. 

The response of the numerical model at the time 
t; is a function involving uncertain parameters, it 
can therefore by modeled as a random variable Y. 
The probability density function associated with 
the model response needs to be identified for any 
predefined time instant. It is not possible to assign 
a priori a probability distribution (e.g. Gaussian, 
lognormal, uniform) with Y, as the most suitable 
distribution is problem dependent. Novi Inverardi 
and Tagliani (2003) proposed a flexible strategy to 
model the probability density function of an arbi- 
trarily given distribution from a set of experimen- 
tal data, it is expressed as: 


M; 
Py(x|t)= ox S40" (3) 
j=0 


where p,(-|¢,) denotes the probability density 
associated with the model response at the time ¢,, 
» = (ass Aa ) and -i=(@,,...,@%y,) denote a 
set of parameters identified using the maximum 
entropy principle. 

The distribution described in Equation (3) may 
be characterized by its fractional moments at the 
time instant ¢,; which are expressed as: 


My =E[ y(t.) |= x p(x |t,)dx (4) 


0 


These fractional moments can be estimated 
from the experimental data: 


n 1% 


u= à= — Y (y(4,))" (5) 


Response of the model 


Schematic aspect of a time dependent response. Each curve can be associated with a realizations of the 


available data. (a) Original functions. (b) Discretization of the functions. 
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The identification of the maximum entropy 
distribution may be simplified using equispaced 
moments (see e.g. (Taufer, Bose, & Tagliani 2009)), 
i.e. Œa =0 and @, = ja, for j> 1. The coefficient 
A, is used to guarantee that Equation (3) inte- 


i 


grates to one; it is expressed as: 


co M; 
Ary = n| Í a| -5 Ax" | a (6) 
0 j=1 


The integral involved in Equation (6) is esti- 
mated using Gauss-Laguerre quadrature. 

The entropy associated with the distribution is 
expressed as: 


M; 
L; (n, ë) a Aot $ Â (7) 
j=l 


The maximum entropy distribution is subse- 
quently identified using an optimization procedure 
to maximize Equation (7). The design variables of 
the optimization (i.e. the adjustable parameters) are 
a, and A, The values of these coefficients obtained 
through the optimization procedure define a prob- 
ability density function in good agreement with the 
experimental data. 

The probability density function p,(x|t,) and 
its coefficients @ and A, should be expressed 
in terms of Y, as the experimental data is used 
to identify the coefficient value. However, this 
dependence is omitted in the notation to simplify 
the formulas. 

It is referred to (Novi Inverardi & Tagliani 2003) 
for a detailed derivation of the equations described 
above, an in-depth discussion of the method and 
a procedure applicable to set the total number of 
fractional moments M, to be used. 


2.2 Distribution associated with a set of time 
instants 


The experimental values of the numerical model 
for a set of time instants t are characterized by the 
marginal distributions and by the linear correla- 
tion coefficient. The procedure described in Sec- 
tion 2.1 is applied independently for all the time 
instants to identify the marginal distributions. The 
joint probability density function is subsequently 
defined using the Nataf’ model (Nataf 1962, Liu & 
Der Kiureghian 1986), which is equivalent to using 
a Gaussian copula (see e.g. (Mai & Scherer 2012, 
Schélzel & Friederichs 2008)). Theiso-probabilis- 
tic transformation is used and auxiliary random 
variables with a standard normal distributions are 
introduced, they are defined as: 


Z,=0"| Pa (Y(4))) (8) 


where F,,, denotes the cumulative distribution 
function associated with the response of numeri- 
cal model at the time instance t, and ®~ denotes 
the inverse of the cumulative distribution function 
of the standard normal distribution. In the general 
case, the responses of the numerical model at the 
time instants ¢, and ¢, are correlated. Therefore, 
the random variables Z, and Z, are correlated as 
well. Their correlation coefficient is determined by 
applying the iso-probabilistic transformation to 
the experimental data: 


es o-( Pry ( yO (4))) 


= 0° (1- f Ta Prad] (9) 


Gauss-Laguerre quadrature is used for the inte- 
gral involved in Equation (9). 

The coefficients of correlation are directly deter- 
mined once all the experimental data are mapped 
into standard normal distributions. The joint prob- 
ability density function of the numerical model’s 
response is: 


P(X Ð= py OG |G) Py OG |t).--Py (Xv, ty, ) 
Px, (ZR) 


(21) (23) zy) 


(10) 


where ¢ denotes the probability density function 
of a standard normal distribution, gy, denotes 
the joint probability density function associated 
with a multivariate standard normal distribution, 
R being the matrix containing thecoefficients of 
correlation. 

Alternative strategies may be considered to 
account for the correlation between the responses 
of the numerical model at different time instants. 
For instance, copulas offer a rational framework to 
considered alternative dependence structures. How- 
ever, thejoint probability density function described 
in Equation (10) is involved in the definition of the 
objective function of an optimization problem (as 
discussed in the next subsection). Therefore, a large 
number of density functions need to be defined in 
an iterative way, which may become challenging in 
case multiple structures of dependence need to be 
considered. Only the Nataf model is implemented 
in this work for the sake of simplicity. 


2.3 Identification of a representative set of time 
instants 


A reduced set of time instants t needs to be identified 
to define the likelihood function and the maximum 
entropy principle is used as the quantity of infor- 
mation extracted from the experimental data can 
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be maximized using this technique (Jaynes 1957a, 
Jaynes 1957b). The entropy associated with t is: 


rie la 


0 


Py(x| Bln ( py (x| t)) dx (1) 


o mg 


The expression of the joint probability density 
function expressed in Equation (10) can be injected 
in Equation (11); this formula can be expanded and 
simplified. This procedure is not derived in details 
herein and the simplified formulaof the entropy is: 


M 
r(t)= YT + Zin det ( 27eR) (12) 
j=0 


where I’, is the entropy defined in Equation (7). The 
term on the left-hand side of Equation (12) is the 
contribution of the marginal distributions to the 
total entropy and the term on the right-hand side 
is the contribution of the correlation. The set of 
time instants which best represents the experimen- 
tal data is obtained by maximizing Equation (12): 


t= targ max I(t) (13) 


t may be interpreted as the set of N, instants 
which contains the maximum of information from 
the experimental data. It is therefore subsequently 
used in the definition of the likelihood function. 


2.4 Bayesian updating 


The likelihood function involves the numerical 
model y, which needs to be evaluated for the time 
instants maximizing Equation (12). It involves as 
an input the values of the uncertain parameters 
0. The formulation of thelikelihood function is 
well known and widely applied in statistics in case 
realizations of a random variable and its probabil- 
ity density function are available. In the present 
problem, the joint probability density function 
described in the previous subsequently is used and 
only one realization is considered; it is the response 
of the numerical model associated with 8. The like- 
lihood function is: 


P( ASM) = p(y te) |t) (14) 
with 4 £6)=| »(,6).3(2.8),--091 iy, 8) | 


. The dependence with respect to the experimen- 
tal data 2 is not explicitly stated in the right 
hand side of refeqeq:likelihood. Nevertheless, this 
dependence is carried by the correlation matrix 
and by the coefficient a, and >» involved in the 
definition of the marginal distributions. Therefore, 
the value of the likelihood functions associated 
with ĝis changed in case the experimental data is 
modified or expanded. Samples of the posterior 


distribution expressed in Equation (1) are gener- 
ated using crude Monte Carlo simulation as the 
application examples are numerically inexpensive. 
Advanced methods may be applied as well as the 
choice of the updating algorithm does not affect 
the relevance of the proposed formulation of the 
likelihood function. 


3 APPLICATION EXAMPLES 


3.1 Damped oscillator 


The methods described in the previous section 
are applied to an example involving a mass spring 
damper system as shown in Figure 2. A force F is 
applied at the time ¢=0 and the quantity of interest 
is the displacement of the mass during 25 seconds. 
This problem has an analytical solutions which is 
used as the numerical model. The applied force and 
the mass are deterministic and arbitrarily set to 
1 Nand 1 kg, respectively, and the same parameter 
value is used during the updating.12 samples of 
the damping and of the stiffness are selected arbi- 
trarily and the corresponding displacement of the 
mass is determined. These trajectories of the dis- 
placement of the mass are used as the experimental 
data and Bayesian updating is subsequently used 
to identify the posterior distribution of the uncer- 
tain parameters. Therefore, this example does not 
consider model uncertainty, as the same model is 
used to generate the experimental data and during 
the updating procedure. 

The prior distributions are uniform between 
0.5 and 2 for both the stiffness of the spring and 
the damping coefficient, which corresponds to a 
non-informative prior distribution between the pre- 
defined ranges. It is initially assumed that all the 
regions of the domain are equally likely to contain 
samples of the uncertain parameters. Two frac- 
tional moments are used to define the marginal 
probability density functions associated with the 
experimental data expressed in Equation (3), Le. 
M,=2 for all the time instants. 

The procedure described in Section 2 does not 
provide any strategy to set N, the total number 


k 


C 


Figure 2. Mass spring damper system. 
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of representative time instants to be used for the 
updating. It is arbitrarily set to 1, 2, 3 and 5 and 
the results are shown in Figure 3. 


3.2. Over-constrained beam model 


The second example considered here is an over- 
constrained beam model as shown in Figure 4. The 
left hand side of the beam is clamped, the right 
hand side is simply supported and two forces are 
applied. All the units are expressed using the met- 
ricsystem (and the forces are therefore defined in 
Newtons). The Young’s modulus of the material is 
equal to 200GPa, the total length of the beam is L 
= 1m and the moment of inertia is equal to 10*m*. 
The random variables involved in the model are the 
forces f, and f,. 20 samples are arbitrarily gener- 
ated, the corresponding deflection of the beam is 


0.8 1 1.2 
Stiffness 


(a) 


(c) 


Figure 3. 


computed for all the coordinates of the beam and 
subsequently used during the Bayesian updating 
procedure. Therefore, the quantity of interest (i.e. 
the beam deflection) is expressed with respect to a 
space parameter. Even though a time parameter is 
considered in the description of the methods in Sec- 
tion 2; these procedures remain applicable without 
loss of generality to any parameterdependent prob- 
lem and can therefore be applied to this example. 
For both forces, the prior distribution is uniform 
between 4 and 16KN. Three fractional moments 
are used in the definition of the marginal distribu- 
tions which define the likelihood function. As in 
the previous example, the parameter N, (i.e. the 
dimensionality of the set of random variables used 
in the definition of the likelihood function) is arbi- 
trarily set to 1, 2, 3 and 5. Figure 5 shows the results 
obtained from the Bayesian updating procedure. 


0.6 0.8 1 1.2 


Stiffness 


(b) 


0.7 0.8 


0.9 1 WW 
Stiffness 


(d) 


Results of the Bayesian updating procedure for the damped oscillator. The contour lines represent the iso- 


values of the posterior distribution, the dots represent realizations of this distribution and the circles are the samples 


used to generate the experimental data. (a) V,= 1. (b) N; 


2. (c) N,=3.(d) N,=5. 
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3.3 Discussion 


For the two considered examples, the results are 
strongly influenced by N, the dimensionality of 
the set defining the likelihood function. 

In case N= 1, a poor match between the poste- 
rior distributions and the samples used to generate 
the experimental data is observed. It is indeed not 
possible to identify the bivariate distribution func- 
tion associated with the input parameters using a 


(c) 


Figure 5. 


single response of the model, as an infinite number 
of values of the input parameters can be associated 
with the response. 

The posterior distribution is smooth and regular 
for N, = 2. In this case, the results are also in good 
agreement with the dependence in the experimen- 
tal data. Moreover, it is graphically observed in the 
first example that the experimental input values are 
slightly positively correlated and negatively corre- 
lated in the second example. The same trend can be 
seen in the posterior distribution for both examples. 

In case NV, =3 and N,=5, the posterior distri- 
bution becomes irregular; it seems that the experi- 
mental data are over-fitted. The surface included in 
a contour line of the posterior distribution is non- 
convex and the regions of the parameter domain 
inthe vicinity of the samples used to generate the 
experimental data are favored (i.e. the posterior 
distribution has higher values in these regions). 


3 N =2 


(d) 


Results of the Bayesian updating procedure for the beam model. The contour lines represent the iso-values 


of the posterior distribution, the dots represent realizations of this distribution and the circles are the samples used to 


generate the experimental data. (a) NV, 


1. (b) N,=2. (©) N, 


3. (d) N,=5. 
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In case the Bayesian updating problem does not 
include model uncertainties, N, should not be 
greater than the dimensionality of the problem. 
For both examples, all the samples of the models’ 
response used to define the likelihood function lay 
on a bidimensional manifold (included in a space 
of dimension three or five). The strategy used to 
define the joint probability density function asso- 
ciated with the response of the model becomes 
inappropriate, leading to the poor fit between the 
posterior distribution and the samples used to 
define the experimental data. example. As the the 
linear correlation coefficients are used to model 
the dependence between the model’s response for 
the various time instants, the first two eigenvalues 
of the correlation matrix are non-negligible and all 
the other eigenvalues are verysmall 

It seems to be empirically observed from these 
examples that the N, should be equal to the total 
number of parameters involved in the model of 
uncertainty in case the problem does not involve 
model uncertainties. 


4 CONCLUSIONS 


This paper describes a method for the formulation of 
the likelihood function used for Bayesian updating 
of time dependent models. The maximum entropy 
principle is used to (I) define the marginal distribu- 
tions for all the time instants; (II) identify the set 
of time instants which best represent the response 
of the model. The joint probability density func- 
tion is defined by its marginals and by the matrix of 
the linear coefficients of correlation obtained after 
the Nataf transformation. This probability density 
function is subsequently used in the formulation of 
the likelihood, the definition of this function gener- 
ally used in statistics is applied here. 

The method is applied with success to a problem 
involving the response in the time domain of a spring 
damper system with two uncertain parameters: the 
stiffness of the spring and the damping coefficient. 

The example heuristically suggests to use a 
parameter NV, equal to the total number of uncer- 
tain parameters involved in the numerical model 
(at least in case there are no model uncertainties). 

Future work is geared towards the application 
of this method to examples involving model uncer- 
tainties and a larger number of random variables. 


REFERENCES 


Arora, V. (2011). Comparative study of finite element 
model updating methods. Journal of Vibration and Con- 
trol 17(13), 2023-2039. 


Beck, J. & K. Zuev (2013). Asymptotically independent 
Markov sampling: A new MCMC scheme for Bayesian 
inference. International Journal for Uncertainty Quantifi- 
cation 3(2), 445-474. 

Beck, J. & L. Katafygiotis (1998). Updating models and 
their uncertainties. i: Bayesian statistical framework. 
Journal of Engineering Mechanics 128(4), 380-391. 

Ching, J. & Y.-C. Chen (2007). Transitional markov chain 
monte carlo method for Bayesian model updating, 
model class selection, and model averaging. Journal of 
Engineering Mechanics 133(7), 816-832. 

Ching, J., M. Muto, & J.L. Beck (2006). Structural model 
updating and health monitoring with incomplete modal 
data using Gibbs sampler. Computer-Aided Civil and 
Infrastructure Engineering 21, 242257. 

Friswell, M. & J.E. Mottershead (1995). Finite element 
model updating in structural dynamics. Springer Science 
& Business Media. 

Goller, B. (2011, June). Stochastic Model Validation 
of Structural Systems. Ph. D. thesis, University of 
Innsbruck. 

Imregun, M. & WJ. Visser (1991). A review of model 
updating techniques. Shock and Vibration Digest 23(1), 
9-20. 

Jaynes, E.T. (1957a). Information theory and statistical 
mechanics. Physical Review. Series II 106(4), 620-630. 

Jaynes, E.T. (1957b). Information theory and statistical 
mechanics ii. Physical Review. Series IT 108(2), 171-190. 

Katafygiotis, L. & J. Beck (1998). Updating models and 
their uncertainties. ii: Model identifiability. Journal of 
Engineering Mechanics 128(4), 463-467. 

Liu, P-L. & A. Der Kiureghian (1986). Multivariate dis- 
tribution models with prescribed marginals and cov- 
ariances. Probabilistic Engineering Mechanics 1(2), 
105-112. 

Mai, J.-F. & M. Scherer (2012). Simulating Copulas (Sto- 

chastic Models, Sampling Algorithms and Applications), 

Volume 4 of Series in Quantitative Finance. World 

Scientific. 

Nagel, J. & B. Sudret (2016). A unified framework for mul- 

tilevel uncertainty quantification in Bayesian inverse 

problems. Probabilistic Engineering Mechanics 43(Sup- 

plement C), 68-84. 

Nataf, A. (1962). D’etermination des distributions de 

probabilit’es dont les marges sont données (in French). 

Technical report, Acad’emie des Sciences, Paris. X. 

Novi Inverardi, PL. & A. Tagliani (2003). Maximum 
entropy density estimation from fractional moments. 
Communications in Statistics - Theory and Methods 
32(2), 327-345. 

Sch’ olzel, C. & P. Friederichs (2008). Multivariate non-nor- 
mally distributed random variables in climate research 
— introduction to the copula approach. Nonlinear Proc- 
esses in Geophysics 15(5), 761-772. 

Straub, D. & I. Papaioannou (2015). Bayesian updating 
with structural reliability methods. Journal of Engineer- 
ing Mechanics 141(3), 04014134. 

Taufer, E., S. Bose, & A. Tagliani (2009). Optimal predic- 
tive densities and fractional moments. Applied Stochas- 
tic Models in Business and Industry 25(1), 57-71. 

Vanik, M.V., J.L. Beck, & S.-K. Au (2000). Bayesian proba- 
bilistic approach to structural health monitoring. Jour- 
nal of Engineering Mechanics. 


2657 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Advanced methodology for uncertainty propagation in computer 
experiments with large number of inputs: Application to accidental 
scenario in a pressurized water reactor 


A. Marrel 
CEA, DEN, DER, SESI, LEMS, Saint-Paul-lez-Durance, France 


B. Iooss 
EDF R&D, Chatou, France 
Institut de Mathématiques de Toulouse, Toulouse, France 


ABSTRACT: Complex computer codes are often too time expensive to be directly used to perform 
uncertainty propagation or sensitivity analysis. A solution to cope with this problem consists in replacing 
the cpu-time expensive computer model by a cpu inexpensive mathematical function, called metamodel. 
Among the metamodels classically used in computer experiments, the Gaussian process (Gp) model has 
shown strong capabilities to solve practical problems. However, in case of high dimensional experiments 
(with typically several tens of inputs), the Gp metamodel building process remains difficult. To face this 
limitation, we propose a general methodology which combines several advanced statistical tools. First, 
an initial space-filling design is performed providing a full coverage of the high-dimensional input space 
(Latin hypercube sampling with optimal discrepancy property). From this, a screening based on depend- 
ence measures is performed. More specifically, the Hilbert-Schmidt independence criterion which builds 
upon kernel-based approaches for detecting dependence is used. It allows ordering the inputs by decreas- 
ing primary influence, for the purpose of the metamodeling. Furthermore, significance tests based either 
on asymptotic theory or permutation technique are performed to identify a group of potentially non- 
influential inputs. Then, a joint Gp metamodel is sequentially built with the group of influential inputs 
as explanatory variables. The residual effect of the group of non-influential inputs is captured by the 
dispersion part of the joint metamodel. Then, a sensitivity analysis based on variance decomposition 
can be performed through the joint Gp metamodel. The efficiency of the methodology is illustrated on a 
thermal-hydraulic calculation case simulating accidental scenario in a Pressurized Water Reactor. 


1 INTRODUCTION dom input variables (see Forrester et al. (2008) for 


example). 


Quantitative assessment of the uncertainties taint- 
ing the results of computer simulations is nowa- 
days a major topic of interest in both industrial 
and scientific communities. One of the key issues 
in such studies is to get information about the out- 
put when the numerical simulations are expensive 
to run. For example, in nuclear engineering prob- 
lems, one often faces up with cpu time consuming 
numerical models and, in such cases, uncertainty 
propagation, sensitivity analysis, optimization 
processing and system robustness analysis become 
difficult tasks using such models. In order to cir- 
cumvent this problem, a widely accepted method 
consists in replacing cpu-time expensive com- 
puter models by cpu inexpensive mathematical 
functions, called metamodels (Fang et al. 2006). 
This solution has been applied extensively and 
has shown its relevance especially when simulated 
phenomena are related to a small number of ran- 


However, in case of high dimensional numerical 
experiments (with typically several tens of inputs), 
depending on the complexity of the underlying 
numerical model, the metamodel building process 
remains difficult, even unfeasible. For example, the 
Gaussian process (Gp) model (Santner et al. 2003) 
which has shown strong capabilities to solve practical 
problems, has some caveats when dealing with high 
dimensional problems. The main difficulty relies on 
the estimation of Gp hyperparameters. Manipulat- 
ing pre-defined or well-adapted Gp kernels (as in 
Muehlenstaedt et al. (2012), Durrande et al. (2013)) 
is a current research way, while coupling the estima- 
tion procedure with variable selection techniques has 
been proposed by several authors (Welch et al. 1992, 
Marrel et al. 2008, Woods and Lewis 2017). 

In this paper, following the latter technique, we 
propose a rigorous and robust method for building 
a Gp metamodel with a high-dimensional vector 
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of inputs before using it to perform variance-based 
sensitivity analysis. 

To build this metamodel, we use a sequen- 
tial methodology where the technical core are 
updated with more relevant statistical techniques. 
For example, the screening step is raised by the 
use of recent and powerful techniques in terms of 
variable selection using a small number of model 
runs. Second, contrary to the previous works, we 
do not remove the non-selected inputs from the 
Gp model, keeping the uncertainty caused by the 
dimension reduction by using the joint metamodel 
technique (Marrel et al. 2012). The integration of 
this residual uncertainty is important in terms of 
robustness of subsequent safety studies and sensi- 
tivity analysis. Finally, a sensitivity analysis based 
on variance decomposition is performed through 
the joint Gp metamodel, yielding both the estima- 
tion of the influence of each selected inputs and the 
total effect of the group of non-selected inputs. 

Each step of our methodology is detailed in a 
dedicated section and illustrated on a guideline 
application, namely a thermal-hydraulic calcu- 
lation case simulating accidental scenario in a 
nuclear reactor. This use-case is first described in 
the following section. 


2 THERMAL-HYDRAULIC TEST-CASE 


Our use-case consists in thermal-hydraulic compu- 
ter experiments, typically used in support of regu- 
latory work and nuclear power plant design and 
operation. Indeed, some safety analysis considers 
the so-called “Loss Of Coolant Accident” (LOCA), 
which takes into account a double-ended guillotine 
break with a specific size piping rupture. It is mod- 
eled with code CATHARE 2.V2.5 which simulated 
the thermalhydraulic responses during a LOCA in 
a Pressurized Water Reactor (Mazgaj et al. 2016). 

In this use-case, d = 27 scalar input variables of 
CATHARE are uncertain. They correspond to 
various system parameters as initial conditions, 
boundary conditions, some critical flowrates, 
interfacial friction coefficients, condensation coef- 
ficients, ... The output variable of interest is a 
single scalar which is the maximal peak cladding 
temperature during the accident transient. 

In our problem, minimal and maximal values 
are defined for each uncertain input and, in the 
framework of probabilistic approach, their uncer- 
tainties are modeled by probability laws defined 
on the domain of variation (uniform, log-uniform, 
truncated normal and truncated log-normal laws). 
Moreover, the d inputs are supposed independent. 
Our first objective with this use-case is to provide 
a good metamodel for sensitivity analysis, uncer- 
tainty propagation and, more generally, safety 


studies. Indeed, the cpu-time cost of this computer 
code is too important to develop all the statistical 
analysis required in a safety study only using direct 
calculations of the computer code. A metamodel 
would allow to develop more complete and robust 
demonstration. 

In what follows, the system under study is gener- 
ically denoted 


Y =g(X,,.. X3) (1) 


where g(-) is the numerical model (also called 
the computer code), whose output Y and input 
parameters X... X; belong to some meas- 
urable spaces Y and Xp»... Æ; respectively. 
X =(X,...,X,,) is the input vector and we suppose 
that ¥ -Jj æ cR’ and YcR. For a given 
value of the vector of inputs x =(x,,...,x4) € Rf, 


a simulation run of the code yields an observed 
value y = g(x). 


3 STEP 1: INITIAL DESIGN OF 
EXPERIMENTS 


The objective of the initial sampling step is to 
investigate the whole variation domain of the 
uncertain parameters in order to fit a predictive 
metamodel which approximates as accurately as 
possible the code in the whole domain of varia- 
tion of the uncertain parameter. For this, we use 
a space-filling design (SFD) of a certain number 
n of experiments, providing a full coverage of the 
high-dimensional input space (Fang et al. 2006). 
This design enables to investigate the domain of 
variation of the uncertain parameters and provides 
a learning sample. 

For the SFD type, a Latin Hypercube Sample 
(LHS) with optimal space-filling and good projec- 
tion properties (Woods and Lewis 2017) would be 
well adapted. In particular, Fang et al. (2006) and 
then Damblin et al. (2013) have shown the impor- 
tance of ensuring good low-order sub-projection 
properties. Maximum projection designs (Joseph 
et al. 2015) or low-centered Z? discrepancy LHS 
(Jin et al. 2005) are then particularly well-suited. 

Mathematically, this corresponds to the sam- 
ple {nants which is performed on the 
model g. This yields n model output values 
denoted {y®,..,y} with y0=g(x). The 
obtained learning sample is denoted (X,,Y,) with 


X= [x0r x0r] and Y, = [v9] The 
goal is to build an approximating model of g using 
the n-sample (X,,Y,). 

The number n of simulations is a compromise 
between the CPU time required for each simula- 
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tion and the number of input parameters. Some 
thumb rules propose to choose n at least as large 
as 10 times the dimension d of the input vector 
(Loeppky et al. 2009, Marrel et al. 2008). 

To build the metamodel for the LOCA test case, 
N = 500 CATHARE simulations of this test case 
are performed following a space-filling LHS with 
good projection properties as the design of experi- 
ments. The obtained inputs-output sample consti- 
tutes the learning sample. 


Remark 3.1 Note that the input values are sampled 
following their prior distributions defined on their 
variation ranges. Indeed, as we are not ensured to be 
able to build a sufficiently accurate metamodel, we 
prefer sample the inputs following the probabilistic 
distributions in order to have at least a probabilized 
sample of the uncertain output, on which statisti- 
cal characteristics could be estimated. Moreover, as 
explained in the next section, dependence measures 
can be directly estimated on this sample, providing 
first usable results of sensitivity analysis. 


4 STEP 2: INITIAL SCREENING BASED 
ON DEPENDENCE MEASURE 


From the learning sample, a screening technique 
is performed in order to identify the primary 
influential inputs (PII) on the model output var- 
iability. It has been recently shown that screen- 
ing based on dependence measures (Da Veiga 
2015, De Lozzo and Marrel 2016, Raguet and 
Marrel 2018) or on derivative-based global sen- 
sitivity measures (Kucherenko and Iooss 2017, 
Roustant et al. 2017) are very efficient methods 
which can be directly applied on a SFD. One of 
their great interest is that, additionally to their 
screening job, the sensitivity indices that they 
provide can be quantitatively interpreted and 
used to order the PII by decreasing influence, 
paving the way for a sequential building of met- 
amodel. Note that Mara et al. (2017) recently 
compared the efficiency of several sensitivity 
measures to address the issue of factors fixing 
setting: Sobol’indices estimated with sparse pol- 
ynomial chaos expansion method, density-based 
dependence measure and derivative-based global 
sensitivity measures. 

In the considered LOCA test case, the adjoint 
model is not available and the derivatives of the 
model output are therefore not computed because 
of their costs. This considerably limits the inter- 
est of using derivative-based sensitivity measures. 
Moreover, as the number of uncertain inputs is 
large and HSIC dependence measures has showed 
good convergence properties (De Lozzo and Mar- 
rel 2016), we choose to use the latter for the screen- 


ing step, directly estimated from the inputs-output 
sample (metamodel-free estimation). 


4.1 Screening based on HSIC dependence 
measure 


Da Veiga (2015) and more recently De Lozzo 
and Marrel (2016) have proposed to use depend- 
ence measures for screening purpose, by applying 
them directly on a SFD. These sensitivity indices 
are not the classical ones variance-based meas- 
ures (see Iooss and Lemaitre 2015 or Borgonovo 
and Plischke 2016, for a review on global sensitiv- 
ity analysis methods). They consider higher order 
information about the output behavior in order 
to provide more detailed information. Among 
them, the Hilbert-Schmidt independence crite- 
rion (HSIC) introduced by Gretton et al. (2005) 
builds upon kernel-based approaches for detect- 
ing dependence, and more particularly on cross- 
covariance operators in reproducing kernel Hilbert 
spaces (RKHS). 

If we consider two RKHS F, and G of func- 
tions ¥, >R and Y—R respectively, the 
crossed-covariance Cy, y operator associated to 
the joint distribution of t(X,,Y is the linear oper- 
ator defined for every fy, € F, and gy eG by: 


< fy, Cu, Sr >n = Cov( fy, Bi) (2) 


Cy, y generalizes the covariance matrix by rep- 
resenting higher order correlations between XY, and 
Y through nonlinear kernels. The HSIC criterion is 
then defined by the Hilbert-Schmidt norm of the 
cross-covariance operator: 


HSIC(X,.Y) 56 =ll G Ilis - (3) 


From this, Da Veiga (2015) introduces a nor- 
malized version of the HSIC which provides a sen- 
sitivity index of YX; 

E HSIC(X,,Y) 
 HSIC(X;,¥, \HSIC(Y,Y). 


(4) 


2 
Risc. 


Gretton et al. (2005) also propose a Monte Carlo 
estimator of HSIC(X,,Y) and a plug-in estima- 
tor can be deduced for Risic,. Note that Gaussian 
kernel functions with empirical estimations of the 
variance parameter are used in our application (see 
Gretton et al. 2005 for details). 

Then, from the estimated Rsc» independ- 
ence tests can be performed for a screening pur- 
pose. The objective is to separate the inputs into 
two sub-groups, the significant ones and the non- 
significant ones. For a given input X, it aims at 


testing the null hypothesis “ HP: X, and Y are 
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independent”, against its alternative “ H”: X, 
and Y are dependent”. The significance level! of 
these tests is hereinafter noted œ. Several statistical 
hypothesis tests are available: asymptotic versions, 
spectral extensions and bootstrap versions for non- 
asymptotic case. All these tests are described and 
compared in De Lozzo and Marrel (2016); a guid- 
ance to use them for a screening purpose is also 
proposed. At the end of the screening step, the 
inputs selected as significant are also ordered by 
decreasing Rsc- This order will be used for the 
sequential metamodel building in step 3. 


4.2 Application on LOCA test case 


From the learning sample of N = 500 simula- 
tions, Rj, dependence measures are estimated 
and bootstrap tests with @ = 0.1 are performed. 
Eleven inputs are selected as significantly influen- 
tial. Ordering them by decreasing Rj.,. reveals the 
predominance influence of Xj) ( Rasic = 0.39 ), fol- 
lowed by X,, X,, and X,,, ( Rigic = 0.04, 0.02 and 
0.02 respectively). X,,, X,3, Xo, Xss X 4, Xa and X,, 
have a lower influence ( Rigjc around 0.01)) and 
the others variables are considered as negligible by 
statistical tests. 

Note that the estimated HSIC and the results of 
significant tests are relatively stable when the learn- 
ing sample size varies from N = 300 to N = 500. 
Only two or three selected variables with a very low 
HSIC (Rigo around 0.01) can differ. This con- 
firms the robustness of the HSIC indices and the 
associated significance tests for qualitative sorting 
and screening purpose. 

In the next steps, the eleven significant inputs are 
considered as the explanatory variables, denoted 
PII, in the joint metamodel and will be successively 
included in the building process. The other sixteen 
variables will be joined in a so-called uncontrollable 
parameter. 


5 STEP 3: JOINT GP METAMODEL WITH 
SEQUENTIAL BUILDING PROCESS 


Among all the metamodel-based solutions (polyno- 
mials, splines, neural networks, etc.), we focus our 
attention on the Gaussian process (Gp) regression, 
which extends the kriging principles of geostatis- 
tics to computer experiments by considering the 
correlation between two responses of a computer 
code depending on the distance between input 
variables. The Gp-based metamodel presents some 
real advantages compared to other metamodels: 


'The significance level of a statistical hypothesis test is 
the rate of the type I error which corresponds to the 
rejection of the null hypothesis H, when it is true. 


exact interpolation property, simple analytical for- 
mulations of the predictor, availability of the mean 
squared error of the predictions and the proved 
efficiency of the model (Santner et al. 2003). 

However, for its application to complex indus- 
trial problems, developing a robust implementa- 
tion methodology is required. Indeed, fitting a Gp 
model implies the estimation of several hyperpa- 
rameters involved in the covariance function. In 
complex situations (e.g. large number of inputs), 
some difficulties can arise from the parameter 
estimation procedure (instability, high number of 
hyperparameters, see Marrel et al. 2008 for exam- 
ple). To tackle this issue, we propose a progressive 
estimation procedure which combines the result 
of the previous screening step and a joint Gp 
approach (Marrel et al. 2012). 


5.1 Sequential building process based on 
successive inclusion of explanatory variables 


At the end of the screening step, the inputs selected 
as significant (group of PIT) are ordered by decreas- 
ing influence. The sorted PII are successively 
included in the metamodel explanatory inputs while 
the other inputs (remaining PII and the sixteen 
non-selected inputs) are joined in a single macro- 
parameter which is considered as an uncontrolla- 
ble parameter (i.e. a stochastic parameter, notion 
detailed in section 5.2). Thus, at the j” iteration, a 
joint Gp metamodel is built with, as explanatory 
inputs, the j sorted PII. The definition and building 
procedure of a joint Gp is fully described in Marrel 
et al. (2012) and summarized in the section 5.2. 

However, building a Gp or a joint Gp involves 
to perform a numerical optimization in order to 
estimate all the parameters of the metamodel (cov- 
ariance hyperparameters and variance parameter). 
As we usually consider in computer experiments 
anisotropic (stationary) covariance, the number 
of hyperparameters linearly increases with the 
number of inputs. In order to improve the robust- 
ness of the optimization process and deal with a 
large number of inputs, the estimated hyperparam- 
eters obtained at the (j—-1)” iteration are used, as 
starting points for the optimization algorithm. 
This procedure is repeated until the inclusion of all 
the PII. Note that this sequential estimation proc- 
ess is directly adapted from the one proposed by 
Marrel et al. (2008). 


5.2 Joint Gp metamodel 


In the framework of stochastic computer codes, 
Zabalza et al. (1998) proposed to model the mean 
and dispersion of the code output by two inter- 
linked Generalized Linear Models (GLM), called 
“joint GLM”. Marrel et al. (2012) extends this 
approach to several nonparametric models and 
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obtains the best results with two interlinked Gp 
models, called “joint Gp”. In this case, the stochas- 
tic input is considered as an uncontrollable param- 
eter denoted X, (i.e. governed by a seed variable). 

We extend this approach to a group of non- 
explanatory variables. More precisely, the input 
variables X = (xX p- X4) are divided in two sub- 
groups: the explanatory ones denoted X,,, and the 
others denoted X,. The output is thus defined by 
y= E(X Xe): Under this hypothesis, the joint 
metamodeling approach yields building two meta- 
models, one for the mean Y,, and another for the 
dispersion component Y; 


You( Xap) = E(Y|X) (5) 


(Xa) Varl 7.4) = Bf (r-r Peay 
(6) 


To fit these mean and dispersion components, 
we propose to use the methodology proposed by 
Marrel et al. (2012). First, an initial Gp denoted 
Gp,,, is estimated for the mean component with 
homoscedastic nugget effect. A nugget effect is 
required to relax the interpolation property of the 
Gp metamodel, which would yield zero residu- 
als for the whole learning sample. Then, a second 
Gp, denoted Gp,,, is built for the dispersion com- 
ponent with, here also, an homoscedastic nugget 
effect. Gp, is fitted on the squared residuals from 
the predictor of Gp,,,,. Its predictor is considered 
as an estimator of the dispersion component. 
The predictor of Gp,, provides an estimation of 
the dispersion at each point, which is considered 
as the value of the heteroscedastic nugget effect. 
The homoscedastic hypothesis is so removed and a 
new Gp, denoted Gp,,,, is fitted on data, with the 
estimated heteroscedastic nugget. Finally, the Gp 
on the dispersion component is updated from Gp,, , 
following the same methodology as for Gp, ,. 


Remark 5.1 Note that some parametric choices are 
made for all the Gp metamodels: a constant trend 
and a Matern stationary anisotropic covariance are 
chosen. All the hyperparameters (covariance param- 
eters) and the nugget effect (when homoscedastic 
hypothesis is done) are estimated by maximum like- 
lihood optimization process. 


5.3. Assessment of metamodel accuracy 


To evaluate the accuracy of the metamodel, we use 
the predictivity coefficient Q7: 


(7) 


Table 1. Evolution of Gp,, metamodel predictivity dur- 
ing the sequential process building, for each new addi- 
tional PII. 


Additional PIT X, X, X, X, Xs Xù 


lo 0.60 0.64 0.70 0.79 0.81 
Additional PII X, X, Xy Xs X; 
o 0.85 0.85 0.87 0.87 0.87 


0.83 


where CO cise, is a test sample, Oeics 

are the corresponding observed outputs and 
(Picim, are the metamodel predictions. Q 
corresponds to the coefficient of determination in 
prediction and can be computed on a test sample 
independent from the learning sample or by cross- 
validation on the learning sample. The closer to one 
the Q’, the better the accuracy of the metamodel. 


5.4 Application on LOCA test case 


The joint Gp metamodel is built from the learning 
sample of N = 500: the eleven PI identified at the 
end of the the screening step are considered as the 
explanatory variables while the sixteen others are 
considered as the uncontrollable parameter. Gps 
on mean and dispersion components are built using 
the sequential building process described in sec- 
tion 5.1 where PII ordered by decreasing Rj, are 
successively included in Gp. Q’ coefficient of mean 
component Gp,, is computed by cross validation 
at each iteration of the sequential building proc- 
ess. The results which are given by Table 1 show an 
increasing predictivity until its stabilization around 
0.87, which illustrates the robustness of building 
process. The first four PII make the major contri- 
bution yielding a Q? around 0.8, the four following 
ones yield minor improvements (increase of 0.02 
on average for each input) while the three last PII 
does not improve the Gp predictivity. 

Thus, only 13% of the output variability remains 
not explained by Gp,,, this includes both the inac- 
curacy of the Gp,, (part of Y,, not fitted by Gp) 
and the total effect of the uncontrollable param- 
eter, i.e. the group of non-selected inputs. 


6 STEP 4: VARIANCE-BASED 
SENSITIVITY ANALYSIS 


Sensitivity Analysis (SA) methods allow to answer 
the question “How do the input parameters vari- 
ations contribute, qualitatively or quantitatively, 
to the variation of the output?” (Saltelli et al. 
2008). These tools can detect non-significant input 
parameters in a screening context, determinate 
the most significant ones, measure their respective 
contributions to the output or identify an interac- 
tion between several inputs which impacts strongly 
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the model output. From this, engineers can guide 
the characterization of the model by reducing the 
output uncertainty: for instance, they can calibrate 
the most influential inputs and fix the non-influen- 
tial ones to nominal values. Many surveys on SA 
exist in the literature, such as Kleijnen (1997), Frey 
and Patil (2002) or Helton et al. (2006). SA can be 
divided into two sub-domains: the Local SA (LSA) 
and the Global SA (GSA). The first one studies the 
effects of small input perturbations around nomi- 
nal values on the model output (Cacuci 1981) while 
the second one considers the impact of the input 
uncertainty on the output over the whole variation 
domain of uncertain inputs (Saltelli et al. 2008). 
We focus here on one of the most widely used GSA 
indices, namely Sobol’ indices which are based on 
output variance decomposition. 


6.1. Sobol’ indices 


A classical approach in GSA consists of comput- 
ing the first-order and total Sobol’ indices which 
are based on the output variance decomposi- 
tion (Sobol 1993, Homma and Saltelli 1996). If 
the variables X),...,.X, are independent and if 
B[ g?(X) |< +e, we can apply the Hoeffding 
decomposition to the random variable g(X) (Hoef- 
fding 1948): 


2 a (8) 


uc{l,...,d} 

where g,= E| ¢(X) ].g,(X,) = Elg(X)|X,]- 
and g,(X,)=E[g(X)| Ae Se Ae ae with 
X,=(X,),.,, forall uc {1,...,d}. All the 2“ terms 
in (8) have zero mean and are mutually uncor- 
related with each other. This decomposition is 
unique and leads to the Sobol’ indices. These are 
the elements of the g(X) variance decomposition 
according to the different groups of input param- 
eter interactions in (8). More precisely, for each 
uc {1,...,d}, the first-order and total Sobol’ sen- 
sitivity indices of X, are defined by: 


g(x)= 


Var| g, (x.)] 


M Varle(X)| and S7 = Ès.. 


S, eee the part of the output variance 
explained by X, if independently from the other 
inputs, and S7 is the part of the output variance 
explained by X, considered separately and in inter- 
action with the other input parameters. 

In practice, we are usually interested in the first- 
order sensitivity indices S,,... S}, the total ones 
ST rey and sometimes in the second-order ones 
Sy l<i< Í <d. The model g is devoid of interac- 


tions if E, S,=1. 


Sobol’ indices are widely used in GSA because 
they are easy to interpret and directly usable in a 
dimension reduction approach. However, their 
estimation (based on Monte-Carlo methods for 
example) requires a large number of model evalua- 
tions, which is intractable for time expensive com- 
puter codes. An common solution consists in using 
a metamodel to compute these indices. Note that, 
when the Q? of the metamodel is estimated on a 
probabilized sample of the inputs, it provides an 
estimation of the part of variance unexplained by 
the metamodel. This can be kept in mind when 
interpreting the Sobol’ indices estimated with the 
metamodel. 


6.2 Sobol’ indices with a joint Gp metamodel 


In the case where a joint Gp metamodel is used to 
take into account an uncontrollable input X,, we 
have shown in Marrel et al. (2012) how to deduce 
Sobol’ sensitivity indices from this joint meta- 
model. Indeed, the variance of the output variable 
Y (XX. ) can be rewritten and deduced from 


the two metamodels: 


Vary (Xup X,) Vans, [X (Xa) | + E [A(X] 
(9) 


where E, (resp. Vary) denotes the mean (resp. 
variance) operator with respect to the pdf of X. 
Furthermore, the variance of Y is the sum of 
the contributions of all the d controllable inputs 
Xap = (x isg, a) and the uncontrollable one X; 


Var(Y)=V,(Y)+ ÈV, (Y)+v (Y) 


(10) 


where ran) Vary [Ey mea 1X), 
b Ex, 1X), VX) = Vary [Ex 
X)]- YY) - VAY), V(X) = Vary, [E Ey. , 
O- VY) - V(Y)... 
Variance of the mean component Y (X) denoted 


m 


hereafter Y,, can be also decomposed: 


m 


valt =$ Err, (11) 


i=l |J|=i 


V(Y) = 
(Y| 
(Y| 


As V(Y,„) = Vary Ey [Er Y XlX] = 


m. exp 


V(Y), Sobol indices according to input variables 


Xo = (X)-1ı a can be derived and estimated from 

Y 

5 roan Fee. (12) 
Var(Y) oF 


Similarly, the total sensitivity index of X, is 
given by: 
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V(Y) T ye) = Eyy eA (Xan) 


Sir = 
e Var(Y) Var(Y) 


(13) 


Note that, as Y, (X) is a positive random 
variable, positivity of S$: is guaranteed. In prac- 
tice, Var( Y) can be estimated from the data or from 
simulations of the fitted joint model, using equa- 
tion (9). 

S% is interpreted as the total sensitivity index 
of the uncontrollable process. The limitation of 
this approach is that only the total part of uncer- 
tainty related to X, is estimated; its individual effect 
is not distinguished from its interaction with the 
other parameters. However, these potential inter- 
actions could be pointed out, considering all the 
primary and total effects of all the other param- 
eters. The SA of Y,can also be a relevant indicator: 
if an input variable X, is not influential on Y, we 
can deduce that S, is equal to zero. 


6.3 Results on LOCA test case 


From the joint Gp built in section 5.4, Sobol’ 
indices of PII are estimated from Gp,, metamodel 
using equation (12), Var( Y) being estimated with 
Gp,, and Gp, using equation (9). For this, intensive 
Monte Carlo methods are used (see e.g. pick-and- 
freeze estimator of Gamboa et al. 2016). The first 
Sobol indices of PII are given by Table 2 and rep- 
resent 85% of the total variance of the output. X,, 
remains the major influential input with 59% of 
explained variance, followed to a lesser extend by 
X,, and X,, with for each of them 8% of variance. 
The partial total Sobol indices involving only PII 
and derived from Gp, show that additional 4% of 
variance is due to interaction between Xp X,, and 
X. The other PII have negligible influence. Lastly, 
all PII explain around 89% of the output vari- 
ance, of which 79% is only due to X o, X,, and X,,. 
From Gp, metamodel and using equation (13),the 
total effect of the uncontrollable parameter, Le. 
the group of the sixteen not-explanatory inputs, is 
estimated to 9.7%. This includes the effect of the 
uncontrollable parameter alone and in interaction 
with the PII. To further investigate these interac- 
tions, Sobol’-based SA and HSIC-based statistical 


Table 2. First Sobol’ indices of PII (in %), estimated 
with Gp,, metamodel. 


Input Xi X Xo Xy Xs Xg 


S 


Ist Sobol’ index 59 3 
Input X, 


8 8 2 
Xi, X Xz; 
Ist Sobol? index 2 2 0 0 


> px 


dependence tests are applied on Y, and reveal that 
only Xio Xip X,, X,, and X, potentially interact 
with the uncontrollable parameter. 


7 CONCLUSION AND PROSPECTS 


Using an efficient sequential building process, we 
fitted a predictive joint Gp metamodel on a high 
dimensional thermal-hydraulic test case simulating 
accidental scenario in a Pressurized Water Reactor 
(LOCA test case). An initial screening step based 
on advanced dependence measures and associated 
statistical tests enabled to identify a group of sig- 
nificant inputs, allowing dimension reduction. The 
efforts of optimization when fitting the metamodel 
fitting can be concentrated on the main influential 
inputs and the robustness of metamodeling is thus 
increased. Moreover, thanks to the joint meta- 
model approach, the non-selected inputs are not 
completely removed: the residual uncertainty due 
to dimension reduction is integrated in the meta- 
model and the global influence of non-selected 
inputs is so controlled. 

From this joint Gp metamodel, several statisti- 
cal analyses, not feasible with the numerical model 
due to its computational cost, become accessible. 
Thus, on LOCA application, a sensitivity analysis 
based on variance decomposition is performed 
using the joint Gp: Sobol’ indices are computed 
and reveal that the output is mainly explained 
by four uncertain inputs: one input is strongly 
influential with around 60% of output variance 
explained, the three others being of minor influ- 
ence. The quite less influence of all the other inputs 
is also confirmed. 

The next step is to use the joint Gp metamodel 
to perform uncertainty propagation for the estima- 
tion of failure probabilities and quantiles. In the 
LOCA test case, we are particularly interested by 
the estimation of high quantile (at the order of 
95% to 99%) of the model output temperature. In 
nuclear safety, methods of conservative computa- 
tion of quantiles (Nutt and Wallis 2004) have been 
largely studied. However, several complementary 
information are often useful and are not acces- 
sible in a high-dimensional context. Then, we 
expect that the joint Gp metamodel could help 
to access this information: the uncertainty of the 
influential inputs will be directly and accurately 
propagated through the mean component of the 
joint metamodel while a confidence bound could 
be derived from the dispersion component in order 
to take into account the residual uncertainty of 
the other inputs. On this last point, the interest of 
heteroscedastic approach in joint Gp could also be 
illustrated and compared with its homoscedastic 
version. 
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ABSTRACT: Through evaluating stress levels, Accelerated Degradation Testing (ADT) can obtain deg- 
radation data in a limited period of time and then use these data for reliability evaluations. However, 
because of the high price of the test items or the test equipment, the sample size used in ADT is usually 
small, which causes a lack of knowledge on recognizing the population and then lead to the epistemic 
uncertainty. The small sample problem makes the probability theory based models, which need large sam- 
ples, not appropriate any more. To address this problem, based on the uncertain theory, this paper uses 
the general geometric Liu process to construct an uncertain acceleration degradation model, and gives the 
corresponding statistical analysis method with objective measures. A carbon-film resistors case is used to 
illustrate the proposed methodology, and discussions are conducted on the sensitivity analysis of the pro- 
posed methodology to the sample sizes. Results show that the proposed methodology is a suitable choice 


for the reliability evaluations of ADT data under small sample situations. 


1 INTRODUCTION 


Products’ reliability and lifetime are usually assessed 
by the life testing that uses time-to-failure data. But 
for highly reliable products, there are usually few or 
even no failure during the life testing, which makes 
it unappropriated to use life testing to assess these 
products’ reliability and lifetime. Therefore, the 
accelerated reliability testing (ADT) has attracted 
much attention and been widely applied. ADT can 
obtain degradation data in a limited period of time 
by evaluating stress levels, and then uses these data 
for the reliability and lifetime assessments. 

In a standard ADT analysis, there is a degra- 
dation model and an acceleration model. The deg- 
radation model describes the degradation paths 
under each stress level. Some of the parameters 
are assumed to be functions of stress levels, i.e., the 
acceleration model. In general, there are two broad 
categories of degradation models based on the 
probability theory, which are the degradation path 
models (Meeker et al., 1998) and the stochastic 
process models (Ye and Xie, 2015). These models 
are suitable for the situation where there are large 
samples. But in practical applications, the sample 
size in ADT is usually small due to the high price 
of the test items or the test equipment, which will 
cause a lack of knowledge on recognizing the pop- 
ulation, and then lead to the epistemic uncertainty. 
Therefore, the probability theory based models are 
not appropriate for the small sample situation. 

To quantify the epistemic uncertainty, various 
methods have been applied by utilizing subjec- 


tive information such as belief degrees, including 
the Bayesian method (Li and Meeker, 2014), the 
interval analysis (Moore et al., 2009), and the fuzzy 
probability theory (Beer et al., 2013). The prior 
distributions, the intervals or the fuzzy variables 
are used respectively in these methods to utilize 
subjective information to quantify the epistemic 
uncertainty (Kang et al., 2016). 

However, there still exit some problems. On the 
one hand, these methods quantify the epistemic 
uncertainty by subjective measures, which could 
result in different results from different research- 
ers. On the other hand, these methods origin from 
the probability theory, which makes them unsuit- 
able for the ADT data with small sample size. 

Motivated by these problems, the uncertainty 
theory proposed by Liu (Liu, 2015) is introduced 
to the field of ADT modeling. The uncertainty 
theory is a branch of mathematics for modeling 
belief degrees and is used for the small sample (or 
even no sample) situations (Liu, 2012). It has been 
widely used in many fields such as risk assessment 
(Liu, 2010), reliability analysis (Zeng et al., 2013), 
supply chain (Huang et al., 2016), and so on. 

In this paper, based on the uncertainty theory, 
a positive uncertain process named the general 
geometric Liu process is proposed to construct an 
uncertain accelerated degradation model, and the 
statistical analysis method for parameter estimations 
is proposed correspondingly. The proposed method- 
ology quantifies the epistemic uncertainty by objec- 
tive measures. The rest of the paper is organized as 
follows. Section 2 introduces preliminaries about the 
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uncertainty theory. Section 3 presents the uncertain 
accelerated degradation model, derives the reliability 
and lifetime distributions and gives the correspond- 
ing statistical analysis method. Section 4 conducts 
the case study and the sensitivity analysis. Section 5 
concludes the paper. 


2 PRELIMINARIES 


In this section, we introduce some preliminaries 
about uncertain measure, uncertain variable, and 
uncertain process that will be used in the subse- 
quent sections. 

Definition 1 (Liu, 2015): Let IT be a nonempty 
set, and £ be a o-algebra over T. Each element A in 
Lis called a measurable set. A set function M from 
£ to [0, 1] is called an uncertain measure if it satis- 
fies the following axioms: 


1. Normality axiom: M{T}=1 for the universal set T. 

2. Duality axiom: M{A} + M{A‘}=1 for any event A. 

3. Subadditivity axiom: For every countable 
sequence of events A,,A,,..., we have 


MOa <Ma) (1) 


4. Product axiom (Liu, 2009): Let (T, T, M,) be 
uncertainty spaces for k = 1, 2, ..., then the 
product uncertain measure M is an uncertain 
measure satisfying 


MITA} = AzM,{A,}. (2) 
k=l 


where A, are arbitrarily chosen events from £, for 
k= 1,2,..., respectively. 

Definition 2 (Liu, 2015): Liu introduced the con- 
cept of uncertainty distribution to describe uncer- 
tain variables. The uncertainty distribution ® of 
an uncertain variable éis defined by 


@(x)=M{E<x}, VrER. (3) 


Let be & an uncertain variable with regular 
uncertainty distribution ®( x). Then the inverse 
function ®( æ) is called the inverse uncertainty 
distribution of &. 

Definition 3 (Liu, 2008): Let T be a totally ordered 
set (that is usually “time”), and let (T,£,M) be an 
uncertainty space. An uncertain process is defined as 
a measurable function from Tx (T,£,M) to the set of 
real numbers, i.e., foreach te T and any Borel set B 
of real numbers, the set 


(X, e By=tveT |X) € Bs} (4) 


is an event. In other words, an uncertain process is 
a sequence of uncertain variables indexed by time. 

Definition 4 (Liu, 2014): An uncertain proc- 
ess X, is said to have an uncertainty distribution 
®,( x) if at each time ¢, the uncertain variable X, 
has the uncertainty distribution ® (x). 

Theorem 1 (Liu, 2014): (Sufficient and Necessary 
Condition) A function p a) T x (0.1) +» R 
is an inverse uncertainty distribution of inde- 
pendent uncertain process if and only if 1) at 
each time t, © ( a) is a continuous and strictly 
increasing function; and 2) for any times 4<t,, 
®;'( a@)-®;'(@) is a monotone increasing func- 


tion with respect to a. 


3 METHODOLOGY 


In this section, we use an uncertain process called 
the general geometric Liu process (Liu, 2015) to for 
ADT modeling, derive the reliability and lifetime 
distributions, and gives the corresponding statisti- 
cal analysis method. 


3.1 Accelerated degradation modeling 


In practical applications, the degradation process 
is usually positive. To describe the positive degra- 
dation process under the small sample situation, 
we consider the following uncertain process 


X(t) =exp(e-t+ oft CO), (5) 


where e is the log-drift parameter, also known as 
the degradation rate. o/ Vt is the log-diffusion 
parameter. C(t) is the Liu process that follows a 
normal uncertainty distribution (V,) with mean 
0 and variance t?, ie. C(t) ~ N,(0,t). Eq.(5) is 
called the general geometric Liu process. Note that 
X(t) in Eq.(5) follows a lognormal uncertainty 
distribution (Log,), i.e., X(t) ~ Log, (et, ont). Its 
uncertainty distribution can be expressed as 


®,(x)= í exp( er *) )). 6) 


In ADT modeling, acceleration models are usu- 
ally utilized to describe the relationship between 
the degradation rate and the accelerated stress 
level, which can be expressed as 


Ine(s,)=a+b-s,, (7) 


where a and b are unknown parameters. s, is the 
normalized stress level that can be expressed as 
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eee Arrhenius model, 
1/S, -1/S, 
s= Jas, nS, Power law model, (8) 
In Sy —InS, 
Da Exponential model. 
Sg- So 


where S, is the normal stress level, S, is the i” accel- 
erated stress level, and S, is the highest accelerated 
stress level. From Eq.(8), it is easy to know that 
§,=0, and s,,=1. 

For simplicity, our proposed uncertain accel- 
erated degradation model in Eq.(5) and Eq.(7) is 
denoted by M,. The unknown parameters in model 
M, are summarized as 2 = (a,b, o), in which a 
and b are shown in Eq.(7), ois shown in Eq.(5). 


3.2 First hitting time and reliability distributions 
of the proposed model 


After getting the proposed model M,, we need 
to derive the reliability and lifetime distributions 
correspondingly. 

Define @ as the failure threshold of the degra- 
dation process, then the lifetime T can be defined 
as the first hitting time (FHT) when the degrada- 
tion process X(t) reaches æ. Liu (2013) defined the 
FHT of the uncertain process as follows: 


t, =inf{t, 20|X(O =a}. (9) 


According to Theorem 3 in (Liu, 2013), the 
FHT of an independent increment process with a 
continuous uncertainty distribution at each time 
can be expressed as follows, 


Y(z)= M{ sup X(t)= a = 1- inf (a). (10) 
O<tsz siss 

Therefore, before deriving the reliability and 
lifetime distributions, we firstly need to prove that 
X(t) in Eq.(5) is an independent increment process 
with a continuous uncertainty distribution at each 
time. 

Proof: 


1. From Eq.(6), it is easy to know that X(t) has 
a continuous uncertainty distribution at each 
time t. 

2. Foreach yeT, the uncertainty distribution of 
X(t|7) can be expressed as 


o, (J= í rafen]. 


Eq.(11) is obvious a continuous function with 
respect to time t. 


(11) 


3. From Eq.(6), we can get the inverse uncertainty 
distribution of X(t) as follows, 


Nt, a 


D(a) = exp(e(s)-t4 In ), a 
T l-g 


(0,1). 
(12) 


At each time ¢, the derivative of Eq.(12) with 
respect to ais 


(o;"(2)) = a l-@ 


ex e(s)t+ os nS) 


"E 
a(i- a) 


(13) 


Since œe (0,1), we can get that a(1-a@)>0. It 
is easy to prove that ®;'(@) is a continuous and 
strictly increasing function with respect to a. 

4. For any times 0 < ¢, < ¢, to prove 

®;'(a)-®;'(@) is a monotone increasing 
function with respect to œ, we need to prove the 
following condition: 


-9-(a)= e| e(s)-t,+ ON Bn) 
x 

Oyat V36 aZ >0 

x 1 ` 


= e| e(s)-t,+ 
-a 


(14) 


Since exp(¢) is a monotone increasing function, 
Eq.(14) is equivalent to the following condition: 


e(s)-(t, 1+ Bin (Uh Ji)>0. u3 


According to the given information, it is easy 
to prove that the condition in Eq.(15) holds. Thus, 
®;'( a)-®;"(@) is a monotone increasing func- 
tion with respect to œ. Based on Theorem 1 in 
section 2, we can prove that X() in Eq.(5) is an 
independent increment process. 

From the above analyses, we can get that the 
uncertain process X(¢) in Eq.(5) is an independent 
increment process with a continuous uncertainty 
distribution at each time ¢. Thus, the uncertainty 
distribution of the FHT of X(t) can be expressed 
as follows, 


¥(z)=1- inf ®,(@) 


=1-— inf í exp OH) (16) 


OSt<z Bto 
-fi H E) : 
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The corresponding reliability distribution is 


2(t)=M{z, >t} -[!+exo{ 22-2 2))) 
o 


(17) 


where R,(¢) is known as the “belief reliability” with 
an uncertain measure (Zeng et al., 2013). 

Meanwhile, the belief reliable life, i.e. BL(q), 
can also be derived and expressed as follows 


BL(a@) = sup R,(t)2 a= Y-"(1- a). 


OStsz 


(18) 


3.3 Statistical analysis method for parameter 
estimations 


With different loading profiles, there are different 
kinds of ADT plans, including the constant stress 
accelerated degradation testing (CSADT), the step 
stress accelerated degradation testing (SSADT), and 
the progressive stress accelerated degradation test- 
ing (PSADT). Here, we provide the statistical analy- 
sis method for parameter estimations in CSADT. 

Liu (2015) employed the principle of least 
squares for parameter estimations of uncertain 
variables. In this section, we also use this method 
to estimate the unknown parameters of the pro- 
posed model M.. 

Suppose that Xi is the k” observed degradation 
value for the j” sample under the 7” stress level, and 
lin is the corresponding measurement time, i = 1, 2, 
wy K j=l, 2, ...,2, k= 1, 2,...,m, where K is the 
number of accelerated stress levels, n, is the sample 
size under the 7 stress level, and m, is the number of 
measurements for the j” sample under the i" stress 

level. 

The unknown parameters of the proposed 
model M, is Q=a, b, o). We proposed a two-step 
statistical analysis method for the parameter esti- 
mations: 1) Collecting belief degrees; 2) Estimat- 
ing unknown parameters. The procedure of the 
method is shown in Figure | and details are shown 
as follows: 


Obtain belief degrees Estimate unknown parameters 


| Sort all xy for the 4 measurement 


under the /* in ding | Compute the objective function | 


Chaat the Ande Xe by the | | 


+ 
ka Ny F 
appomanion median r rank fimchon E = min 2 E Slof 12)-2 ) 


STEP I STEPI 


Figure 1. The two-step statistical analysis method for 
parameter estimations of the proposed model. 


1. Obtain belief degrees 
From Section 3.2, it is known that x, is an uncer- 
tain variable. All the degradation data Xx Of the 
k measurement under the 7 stress level are the 
observations of xy, 1.€., Xy = (Xie Xop. Xeo 
(j=1,2... N, and n, is the upper boundary of N). 
Each of the element has a belief degree œ. In this 
section, we use the approximate median rank func- 
tions to obtain belief degrees, which is expressed 
as follows, 

Oy, = (j- 0.3) KN, 


+04), j=1,2,.., N; - (19) 


For all the degradation data of the k” measure- 
ment under the i“ stress level, if there exist deg- 
radation data that are the same, then their belief 
degrees are also the same. 

2. Estimate unknown parameters 

According to the principle of least squares, the 
parameters estimations of the proposed model can 
be obtained by the principle of least squares, which is 


K m; Ni 


o=min Ý $ X (o(s 19) -au ] 


i=l k=l j=l 


(20) 


4 CASE STUDY 


In this section, the carbon-film resistors CSADT 
dataset (Meeker and Escobar, 1998) is used to 
illustrate the proposed methodology, and discus- 
sions are conducted for the sensitivity analysis of 
the proposed methodology to the sample sizes. 


4.1 The carbon-film resistors CSADT dataset 


In the carbon-film resistors CSADT dataset, there 
are 9, 10, 10 samples under each accelerated stress 
levels, which belongs to the small sample situation. 
Therefore, the proposed methodology can be used 
to this case for the reliability and lifetime evalua- 
tions under normal conditions. Details about this 
case is shown in Table 1. 


Table 1. Basic information about the carbon-film resis- 
tors CSADT dataset. 


Test information Contents 


Stress levels (temperature/°C) 83, 133, and 173 
Normal conditions (°C) 50 


Sample size 9, 10, and 10 
Measurement times 4, 4, and 4 
Failure threshold æ (%) 12 
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4.2 Reliability and lifetime evaluations under 
normal conditions 


Since the accelerated stress is temperature, the 
Arrhenius model is selected as the acceleration 
model. Based on the proposed methodology, the 
parameter estimations are obtained as follows: 

Taking the parameter estimation results in 
Table 2 into Eq.(17) and Eq.(18), the reliability and 
lifetime evaluations under normal conditions can 
be obtained. Results are showed in Figure 2. 

From Figure 2 (a), it can be seen that the 
belief reliability changes from the initial value 1 
and decreases gradually with the increasing time, 
which agrees with the intuitive cognition of human 
beings. If decision makers are interested at belief 
reliability R, = 0.9, the corresponding belief reli- 
able lifetime BL(0.9) = 10181 hours. It means that 
the belief degree that the products will survived at 
the normal conditions after 10818 hours is 0.9. 


4.3 Discussions 


For the sensitivity analysis of the proposed meth- 
odology to the sample sizes, we simulate several 
different situations that has different sample size, 
and remark each situation as ST) (n, n,» 1,3), 
r= 1, 2,..., 8. ST, represents the r situation. 7,,, 
nN», and n, represents the chosen sample size under 
each stress level. Details are shown in Table 3. 


Table 2. Parameter estimation results. 


Parameters a b oO 
Values -15.18 3.975 9.060e-04 
Belicf Reliability jg? Belief Retiable Licftime 
‘ + 10% n 
9 
09 
K 
7 
08 
2 - 6 
3 3 
207 zs 
3 = 
2 "4 
Oo} 
3 
asp $ 
it 
2 oe ha 06 Ds 
Time / Hour xio Belief / œ 
(a) (b) 
Figure 2. Reliability and lifetime evaluations of 


the carbon-film resistors ADT dataset under normal 
conditions. 


As shown in Table 3, under each situation, there 
are many different combinations of samples, which 
will lead to many different reliability evaluations. 
To present the range of reliability evaluations 
under different situations, under each monitoring 
time ¢ under the situation ST,, we choose the mini- 
mum and maximum reliability evaluation results as 
the lower and upper boundaries of the reliability 
evaluations. So the lower and upper boundaries 
under each situation can be obtained, and results 
are shown in Figure 3. 

Figure 3 show that with the increasing sample 
size, the lower and upper boundary are approach- 
ing gradually. It indicate that when there are more 
samples that can provide more information, the 
epistemic uncertainty in ADT data decreases, 
which will lead to more stable reliability evalua- 
tion results. In addition, the reliability evaluations 


Table 3. Different situations for discussions. 


Situations | Sample sizes n,a N,> N,3 


r, 


ST. 2 
ST, 3 
ST, 4 
ST: 5 
ST, 6 
ST, 7 
ST, 8 
ST, 9 


SCOMANHDMN SW 
SCOANADMN HW 


— 
— 


Belief reliability 


—i_ —i_ a i —i 
0 us 1 13 2 25 3 
Time / Hours «10° 


Figure 3. Lower and upper boundaries of the reliability 
evaluation results under different sample sizes. 
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results under ST, is included in the lower and 
upper boundaries under most situations (ST, to 
ST). As for ST, and ST,, the sample size is very 
small, which makes the provided information too 
scarce to get stable reliability evaluation results. 
The above analysis results show that the pro- 
posed methodology is a suitable choice for the 
small sample situation and can furtherly provide 
support for the subsequent decision making. 


5 CONCLUSIONS 


This paper deals with the positive degradation 
process with small samples in ADT, and concludes 
as follows, 


1. Based on the uncertainty theory, the general 
geometric Liu process is used to conduct an 
uncertain accelerated degradation model, which 
takes the epistemic uncertainty due to small 
samples in ADT data into consideration. 

2. The corresponding statistical analysis method 
with objective measures is provided for the 
unknown parameter estimations. 

3. The application results show that the reliability 
evaluation results of the proposed methodology 
agrees with agrees with the intuitive cognition 
of human beings, and the discussion results 
show that the proposed methodology provides 
stable reliability evaluation results under small 
samples, which makes it a suitable choice for the 
small sample situation and can provide support 
for the subsequent decision making. 


In addition to the work of this paper, there 
are other issues that are worthwhile for future 
researches. The proposed model is built up on the 
general geometric Liu process, which is a positive 
uncertain process. But in practical applications, 
there are degradation processes which are not only 
always positive but also strictly monotonic. It is 
necessary to apply other uncertain processes in 
ADT to model such degradation processes. 
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ABSTRACT: For solid state drive (SSD) with high reliability and long life, it is nearly impossible to obtain 
sufficient amount of time-to-failure data within acceptable testing time by testing such products under nor- 
mal operating environments. Thus, accelerated degradation testing (ADT) are introduced to solve reliability 
modeling problems based on the products’ degradation information obtained from accelerated tests. The 
intimate link between performance degradation data and product failures can be obtained according to the 
degradation threshold failure mechanism. Through the hypothetical degradation process and failure time, 
we can estimate the failure time with a given threshold value. However, due to diverse users, and uncertainty 
of what explicit level of degradation will cause a failure, a probabilistic, rather than a deterministic thresh- 
old value should be taken into account. On the other hand, complex operation environment would result 
in inevitable noise, so it is necessary to consider the detecting error in the reliability analysis. This paper 
propose a reliability assessment method based on fuzzy failure threshold and measurement errors, aiming at 
improving the assessment precision. We establish the degradation modeling with fuzzy failure threshold and 
measurement errors, and the maximum likelihood estimation method is adopted to estimate the failure time 
distribution parameters. Then the reliability model can be used for subsequent forecasting and decision- 
making. A commercial off-the-shelf SSD is shown as an example to illustrate the procedure that how to 
predict time to failure, of which writing current is used as a precursor parameter and directly monitored. 


Finally the results show the superior performance of the proposed method over traditional methods. 


1 INTRODUCTION 


Solid state drives (SSDs) are the most important 
and widely-used data storage device in space 
application due to their high performance, low 
energy consumption, small size, and shock resist- 
ance compared with traditional hard disk drives 
(HDDs). As the amount of data stored in SSDs 
keeps increasing, it is important to understand the 
reliability characteristics of SSD under field condi- 
tions. However, for SSD with high reliability and 
long life, it is nearly impossible to obtain sufficient 
amount of time-to-failure data within acceptable 
testing time by testing such products under normal 
operating environments and sometimes even under 
harsher conditions (Nelson 2008). 

The accelerated degradation testing (ADT) with 
higher stress levels provides a method to carry out 
the life test within a reasonable time frame based 
on the products’ degradation information obtained 


from historical data or degradation tests (YaoHsu, 
Chih-YenSu et al. 2014). Generally, a failure occurs 
when the performance degradation data exceeds 
a specified threshold. The reliability can be pre- 
dicted by extrapolating the degradation trend to its 
threshold after tracking the degradation path for a 
while (Ye and Xie 2015). Thus, numerous failure 
and reliability information can be obtained in the 
accelerated degradation process. 

Great attention have been caught in this field. 
Ren proposed an original approach combining ADT 
and physics of failure (PoF) modeling for effectively 
assessing connector reliability (Ren, Feng et al. 2015). 
Gonzalez presented a complete reliability analysis 
including the determination of the reliability func- 
tions and parameters obtained for concentrator muti- 
junction solar cells (Espinet-Gonzalez, Algora et al. 
2015). Chang proposed a reliability qualification test 
method for pneumatic cylinders based on perform- 
ance degradation data (Chang, Kwon et al. 2014). 
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The actual failure point is usually unpredictable 
or random in the practical engineering application 
because of the lack of enough empirical informa- 
tion and experimental data, leading to the ambi- 
guity in failure threshold definition. Therefore, it’s 
significant to consider the potential uncertainties 
in the reliability analysis. 

Chen presented the reliability analysis for system 
with random failure threshold based on the degrada- 
tion model and failure threshold probability distri- 
bution (Chen, Ma et al. 2014). Hua proposed a novel 
performance degradation reliability based on an 
adaptive failure threshold (Hua, Zhang et al. 2013). 

In this paper, a reliability assessment method 
based on measurement errors and fuzzy failure 
threshold is proposed. First, the life-stress model and 
lifetime distribution model are established. By com- 
bining these two models, we adopt the maximum 
likelihood estimation method to estimate the failure 
time distribution parameters, and then the reliability 
based on fixed failure threshold can be estimated. 
Subsequently, we present a reliability assessment 
method based on fuzzy failure threshold by applying 
the probability formula of the fuzzy event and the 
convolution formula. Then the reliability assessment 
method is validated by a type of commercial off-the- 
shelf SSD to be used in the Chinese space station. In 
the end, we take a comparative analysis on the relia- 
bility assessment curves with fixed failure threshold, 
fuzzy threshold without measurement errors, and 
fuzzy failure threshold and measurement errors. The 
result shows that the proposed method has a higher 
assessment accuracy, and it is more flexible. 


2 RELIABILITY MODELING AND 
ANALYSIS 


2.1 Drift degradation data analysis and model 
modification 


For electronic products in ADT, performance deg- 
radation is accompanied with performance drift 
created by accelerated stress (Li et al. 2017). The 
degradation path model can be described by: 


m,(t) = x,(t)+ Ax, + & (1) 


where m(t) is the measured value; x(t) is the true 
degradation value; Ax is the performance drift; €, 
the measurement error which is independent of 
x(t), and £ ~ N (0, ©). 


2.2 Lifetime distribution and parameter 
estimation with degradation data 


2.2.1 Life-stress model 
As temperature is the acceleration variable, the 
well-known Arrhenius model is selected: 


Ea 


&(T) = Aer (2) 


where €(T) is the performance characteristic related 
to life; A is a constant; Æ, is the activation energy 
(eV); k is Boltzmann constant (8.617 x 10— eV/K); 
and T is the absolute temperature. 


2.2.2 Lifetime distribution model 

In many standard statistical distributions used to 
model the various reliability parameters, Weibull 
distribution is the most common lifetime distri- 
bution (Bertsche 2010). Because it applies well to 
various failure modes by adjustment of the distri- 
bution parameters. In this paper, a two-parameter 
Weibull distribution is selected for the degradation 
life distribution, whose cumulative distribution 
function (cdf) has the expression: 


4 
F (x3t,T) ees (3) 


where x is the true degradation, t is the measure- 
ment time, T is the accelerated temperature stress 
level, 6 and 7 are shape parameter and scale 
parameter, respectively. 

Under assumption of constant temperature, (2) 
can be reduced to the following power formal func- 
tion related with test time and stress level (Ren, 
Feng et al. 2015): 


b 


m(t,T) =a-eT -t° (4) 


where a, b, c are the parameters to be estimated. 
The scale parameter of the Weibull model (7) 
depends on the temperature in the Arrhenius model, 
and the shape parameter (8) has been assumed con- 
stant for different temperatures. Combining the life- 
stress model and lifetime distribution model, we can 
obtain the combined Arrhenius-Weibull model: 


i Z 
F(x:t,T) je (5) 


where p, a, b, c are the parameters to be estimated. 
The failure probability density function (pdf) 
can be expressed as follows: 


B xf eal (6) 


eel 


2.2.3 Parameters estimation 
In order to estimate the best-suited parameters of 
the Arrhenius-Weibull model (5), the maximum 
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likelihood estimation method (MLE) can be uti- 
lized. The likelihood function is (Ren, Feng et al. 
2015): 


2 
n q4 m 8 { 
L(2.a,b,c)= c: I] [s "= ae tf 


k=1 i=l "( 2 ) 
aert 
(7) 


Substituting the performance degradation data 
into (7) and solving the MLE equation ðL / oB=0, 
OL | da=0, ƏL / gb =0, ƏL / dc = 0, we can get the 
estimates B a b ĉ. The estimated true degrada- 
tion distribution is expressed as: 


. B 
F(xtT)=1- Az) (8) 


The product will not work properly when the 
degradation value x(t) reaches the threshold level 
D. For a fixed failure threshold, the reliability R(t) 
is the probability that the true degradation value 
x(t) is larger than the fixed threshold level D. 
Therefore, the reliability function based on fixed 
failure threshold is represented by the cumulative 
distribution function of the true degradation value 
distribution: 


2.3 Reliability assessment method based 
on fuzzy failure threshold 


The membership function is used to describe the 
membership between an element u and a fuzzy set 
on the domain U. It is assumed that the perform- 
ance degradation data has an increasing trend 
over time in this paper. In the beginning, the mem- 


R,()=P(A)=f | MOAS- Et) Sf(e)dyde 


bership function is equal to 1. When the perform- 
ance degradation data x(t) increases to the lower 
limit of fuzzy failure threshold (D,,,,), the mem- 
bership function is w,(x), and the performance 
of the product continues to degenerate gradually; 
When the performance degradation data x(f) is 
larger than the upper limit of fuzzy failure thresh- 
old (Daa), the membership function is equal to 0 
(Yan 2014). 

Thus, the lower semi-trapezoid distribution 
membership function is: 


l X< Dyin 
Diy % 
fy (x) = DD. Dain SxS Dos (10) 
0 x>D 


2.4 Establishment and assessment of reliability 
function based on measurement errors and 
fuzzy failure threshold 


Consider the measurement uncertainty of sensors, 
we assume the measurement errors £; obey a nor- 
mal distribution (£, ~N(0,0°)). So its probability 
density function is 


(11) 


A fuzzy subset showing the good state of the 
product is expressed as follows: 


A 
A={D> X(t)}={D>Y(t)-E} (12) 

Since the true degradation value x(ż) has a differ- 
ent distribution form with the measurement errors 
€, the reliability function based on measurement 
errors and fuzzy failure threshold can be obtained 
by applying the probability formula of the fuzzy 
event and the convolution formula: 


=f [1 o-a) S()dyde+ |7 is 4, (t)f (y- et) f.(e)dyde 
B 


= s Pain 2 . es . { whe € 
cI, (af E 2m0 
A E Ta Dina 2 ey a 


7 ( y 
Pmin Dax E Dain (a . er . | 


Z 7 
l (ea? 


(e-u? 


(13) 
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Substituting the estimators B a b ¢ in (13), 
then we acquire the reliability assessment result. 
If we ignore the measurement errors, the function 
reduce to be: 


R,( A=] u(t t)dx 
n F (x,t) a 7 ve t)dx 
B 
yh lp ie J dy 


ey 


Dmax DoY Z 


max 


Z 
Dmin D Jg D ia (a . er t) 


B 
(ea) 
- y2! -e€ aerae dy 


(14) 


3 EXAMPLE OF SSD APPLICATION 


3.1 SSD test plan 


SSD consists of main control unit, power supply, 
connector, cache, and NAND flash. Non-volatile 
NAND flash memory is the core component. 
According to Fowler-Nordheim (FN) tunneling, 
electrons in memory cells write in or read out data 
in the NAND flash memory by passing through 
tunnel oxide repeatedly. This process reduces the 
reliability of memory cells, leading to the gradual 
degradation of the tunnel oxide. 

As the environment temperature rises, the thermal 
motion energy of electrons increases, accelerating 
the destruction of tunnel oxide in tunneling. There- 
fore, temperature is the sensitive stress that causes 
SSD’s performance degradation( Li, P., et al. (2017). 

According to the reliability enhancement testing, 
the tested SSD’s temperature design limit is proved 
to be —40 ~ 85°C, and operating limit is proved to 
be —80 ~ 125°C. Thus, a step-stress temperature 
preliminary test has been taken firstly by testing all 
the 9 samples from 25°C to 125°C with the same 
pre-fixed stress-change time. The degradation effect 
can be neglected because test duration is very short. 

Results of the step-stress temperature prelimi- 
nary test show that the majority of performance 
indexes nearly remain unchanged except the read/ 
write current and quiescent current, which rise 
with the increasing of temperature. Therefore, the 
relationship between current drift Ax and tem- 
perature T can be modeled by regression analysis: 
Ax = f(T) (15) 

Nine SSD samples are randomly divided into 
three groups. ADT is carried out in each group 
for each accelerated stress levels with 80°C, 90°C 


and 104°C. The test duration is set to 40 days and 
measurements are conducted every 96 hours. Dur- 
ing the test process, performance characteristics 
(read/write current, quiescent current, read/write 
speed, read/write response time and bad block 
increment) are real-time monitored. Samples con- 
tinue working all along at the same time. Results 
of the test indicate that write current is the most 
suitable index to track SSD degradation. 


3.2 SSD test results and reliability analysis 


After the determination of the relationship between 
write current drift and temperature, the degrada- 
tion path model can eliminate the drift impact. 

Assume the write current at 25°C to be the ini- 
tial value, and apply the least square method to fit 
the average measured drift data. Then the fitting 
curve is illustrated in Figure. 1. 

The fitting model is expressed as: 


Ax =a-exp(b- AT) (16) 


where Ax is the current drift; and AT is the tem- 
perature rise over initial temperature T) = 25°C. 
The parameters are estimated to be a= 2.088, 
b= 0.03446. 

Samples tested under 80°C are named N-0, N-1, 
N-2, and N-3, N-4, N-5 under 90°C, and N-6, N-7, 
N-8 under 104°C. After eliminating current drift 
impact, the true degradation path of write current 
can be calculated under 80°C, 90°C and 104°C, 
which are shown in Figure. 2. 

Figure. 3 shows the variation over time of the 
write current degradation distribution in terms 
of Weibull probability plots, for different levels 
of temperatures. The data is proved to follow the 
Weibull distribution as all the scatter points are 
close to the reference line. 

Substituting the test observations by MLE with 
the Arrhenijus-Weibull model presented above, the 
estimates B and 7 can be obtained. Further, with 
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Figure 1. Regression analysis of current drift & tem- 


perature increment. 
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Tereih) 


(c) 


Figure 2. True degradation path of write current. 
(a) 80°C. (b) 90°C. (c)104°C 


the estimates of 7 at different time points and 
stress levels, the parameters a, b, c can be estimated 
and the write current degradation distribution 
model is specified as: 


x 


rei e a a7) 


3.3 Reliability extrapolation at normal working 
conditions 


Afterwards the Arrhenius-Weibull model is used 
to extrapolate the performance of SSD at normal 
temperature. According to the NAND flash mem- 
ory manual, the failure threshold at normal tem- 
perature (25°C) is 125 mA (Isobe 2010). Then, the 
reliability function based on fixed failure threshold 
for SSD is: 


125 


R(t,T) = P(x(t)< D)=1- oe az) (18) 
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Figure 3. True degradation path of write current 
at different measuring time and temperature levels. 
(a) 80°C (N-0, N-1, N-2). (b) 90°C (N-3, N-4, N-5). (c) 
104°C (N-6, N-7, N-8) 


Assuming the lower limit of fuzzy failure 
threshold (D in) to be 100 mA, and the upper limit 
of fuzzy failure threshold (Dmna) to be 150 mA. 
Besides, the variance of measurement error distri- 
bution is 0.8, namely e~N(0,0.8). As the integra- 
tion in (13) and (14) have no explicit expressions, 
numerical integration method are used to solve 
the problems. Thus, the reliability curves based 
on fixed failure threshold, fuzzy failure threshold 
without measurement errors, and fuzzy threshold 
with measurement errors under normal tempera- 
ture (25°C) are illustrated in Figure 4 respectively. 

It can be observed that the three reliability 
curves are close to each other, but if we consider 
fuzzy threshold situation, the reliability is slightly 
smaller than the deterministic threshold. Moreo- 
ver, the reliability with measurement errors and 
fuzzy threshold is the lowest. Thus, when dealing 
with SSD’s reliability prediction, it’s necessary to 
consider the influences of measurement errors and 
fuzzy threshold, which can result in a more con- 
servative and practical value. 
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Figure 4. Reliability curve based on different assess- 
ment methods. 
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Figure 5. Reliability curve under different failure 
thresholds. 


For further analysis, Figure 5 represents the reli- 
ability curves under different failure thresholds by 
changing the range of failure threshold. 

It can be found that the smaller the fuzzy thresh- 
old range is, the closer the reliability assessment 
curve is to the reliability curve based on fixed fail- 
ure threshold. The reliability comparison results 
show that the reliability assessment method based 
on measurement errors and fuzzy failure threshold 
is more accurate and flexible. 


4 SUMMARY AND CONCLUSIONS 


In this paper, a reliability assessment method 
based on fuzzy failure threshold and measurement 
errors is proposed. Firstly, we establish the life- 
stress model and lifetime distribution model, and 
then model parameters are estimated by the maxi- 
mum likelihood estimation method. Secondly, we 
derive the reliability model based on fixed failure 
threshold by substituting the estimators into the 
combined Arrhenius-Weibull model. Thirdly, we 
present a reliability assessment method based on 
fuzzy failure threshold and measurement errors by 


applying the probability formula of the fuzzy event 
and the convolution formula. Subsequently, we 
take a type of SSD as an example to demonstrate 
the reliability assessment method. In the end, we 
compare the reliability assessment curve based on 
different situations. The comparison results show 
that the proposed method has a higher assessment 
accuracy, and it is more flexible. 


ACKNOWLEDGMENTS 


This study was supported by the National Natural 
Science Foundation of China (Grant No. 61703391) 
and Technology and Engineering Center for Space 
Utilization (Grant No. CSU-QZKT201714). 


REFERENCES 


Bertsche, B. (2010). “Reliability in Automotive and 
Mechanical Engineering.” VDJ-Buch. 

Chang, M.S., et al. (2014). “Design of reliability qualifica- 
tion test for pneumatic cylinders based on performance 
degradation data.” Journal of Mechanical Science & 
Technology 28(12): 4939-4945. 

Chen, J., et al. (2014). Reliability Analysis for System with 
Random Failure Threshold, Springer Berlin Heidelberg. 

Espinet-Gonzalez, P., et al. (2015). ” Temperature acceler- 
ated life test on commercial concentrator I-V tripleX 
junction solar cells and reliability analysis as a function 
of the operating temperature.” Progress in Photovoltaics 
Research & Applications 23(5): 559-569. 

Hua, C., et al. (2013). “Performance reliability estimation 
method based on adaptive failure threshold.” Mechani- 
cal Systems & Signal Processing 36(2): 505-519. 

Isobe, K. (2010). NAND flash memory, US. 

Li, P., et al. (2017). “Statistical Analysis of Step-stress 
Accelerated Degradation Testing based on New and 
Used Samples.” Reliability & Maintainability Sympo- 
sium, 2017. 

Li, P., et al. (2017). “ Reliability Assessment of NAND 
SSD Based on Acceleration Degradation Test.” 2017 
IEEE International Conference on Industrial Engineer- 
ing and Engineering Management. 

Nelson, W. (2008). Accelerated Testing: Statistical Models, 
Test Plans, and Data Analysis. 

Ren, Y., et al. (2015). “A Novel Model of Reliability Assess- 
ment for Circular Electrical Connectors.” IEEE Trans- 
actions on Components Packaging & Manufacturing 
Technology 5(6): 755-761. 

Yan,W.A. (2014). “Research on the method of strorage reli- 
ability for torpedo.” Northwestern Polytechnical Univer- 
sity, China. 

YaoHsu, et al. (2014). “An analytical procedure for esti- 
mating field lifetime and failure rate of electronic pack- 
ages.” Journal of the Chinese Institute of Engineers 37(1): 
36-43. 

Ye, Z.S. and M. Xie (2015). “Stochastic modelling and 
analysis of degradation for highly reliable products.” 
Applied Stochastic Models in Business & Industry 31(1): 
16-32. 


2678 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Effect of load-generation variability on power grid cascading failures 


R. Rocchetta & E. Patelli 
Institute for Risk and Uncertainty, Liverpool University, Liverpool, UK 


L. Bing & G. Sansavini 
Reliability and Risk Engineering Laboratory, Department of Mechanical and Process Engineering Institute 
of Energy Technology, ETH Zurich, Zurich, Switzerland 


ABSTRACT: Cascading failures events are major concerns for future power grids and are generally not 
treatable analytically. For realistic analysis of the cascading sequence, dedicated models for the numerical 
simulation are often required. These are generally computationally costly and involve many parameters 
and variables. Due to uncertainty associated with the cascading failures and limited or unavailable his- 
torical data on large size cascading events, several factors turn out to be poorly estimated or subjectively 
defined. In order to improve confidence in the model, sensitivity analysis is applied to reveal which among 
the uncertain factors have the highest influence on a realistic DC overload cascading model. The 95th per- 
centile of the demand not served, the estimated mean number of line failures and the frequency of line fail- 
ure are the considered outputs. Those are obtained by evaluating random contingency and load scenarios 
for the network. The approach allows to reduce the dimensionality of the model input space and to iden- 
tifying inputs interactions which are affecting the most statistical indicators of the demand not supplied. 


1 INTRODUCTION 


Assure high-reliability of electric power supply is 
a major concern for next-generation power grid. 
Power grid should have the ability to withstand 
know threats, such as N-1 and N-2 contingencies, 
but also poorly understood low-probability-high- 
consequence events such as N-k contingencies 
leading to cascading sequences. Due to the inherent 
complexity of cascading failure events, associated 
mathematical models are, generally, analytically 
not solvable. This is mainly due to the high dimen- 
sionality of the problem and to the complex, 
non-linear and dynamic behaviour characterizing 
domino failures. 

Computational models for the simulation of the 
cascading sequences are used to provide a solution 
to the cascade problem. A wide variety of models 
have been proposed in the past, aiming at analysing 
different system behaviours and with several differ- 
ent objectives. For instance, models employing the 
AC power flow (PF) equations, such as the Man- 
chester model (Nedic et al. 2006) or the linearized 
AC PF model (Li et al. 2016), the ORNL-PSerc- 
Alaska (OPA) model (Dobson, Carreras, Lynch, 
& Newman 2001) and DC PF-based models have 
been developed to simulate realistically cascading 
failures sequences. 

Numerical models for cascading simulation 
have to be adequately designed, calibrated and val- 
idated (Bialek et al. 2016). Calibration and valida- 


tion should use available historical cascading data, 
which is (in particular for large size cascade events) 
quite limited (Rocchetta et al. 2018) or affected by 
imprecision (Rocchetta et al. 2018). Consequently, 
the resulting model verification and calibration 
is very challenging and affected by high level of 
uncertainty. Uncertainty will result particularly 
prominent when the model is used to simulate rare 
events leading to very severe consequences. 

To increase confidence in the cascading model 
results and better understand the relation between 
its inputs and outputs, all the relevant sources of 
uncertainty affecting the analysis should be quan- 
tified. Dimensionality and complexity issues are 
often involved in cascades analysis problems and 
the numerical simulators generally reflect these 
problems. In fact, the simulators often are time 
costly and involve a large number of uncertain 
variable and parameters. 

Sensitivity analysis methods are useful to deal 
with both dimensionality and uncertainty issues. 
These methods can be used to reveal which sources 
of uncertainty are affecting the most the model 
output and can be used to reduce the dimension- 
ality of the aleatory space by prioritizing only 
the most important factors. This is indeed a use- 
ful information, necessary to better comprehend 
inputs-outputs relations otherwise hidden within 
the complexity of the model. 

Global sensitivity analysis methods are often 
employed by uncertainty analysts to sharpen the 
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view of the problem. Sensitivity analysis is some- 
times regarded as a fundamental part of works that 
involves the assessment and propagation of uncer- 
tainty (Borgonovo and Plischke 2016). Applying 
global sensitivity analysis methods, insights can 
be gained regarding the input-output mapping 
and the key drivers of uncertainty can be clearly 
revealed (Borgonovo and Plischke 2016). 

In this paper, an integrated framework for sen- 
sitivity analysis and power grids cascading analysis 
is proposed. The framework can be used to identify 
and prioritize the most relevant uncertain input 
factors by revealing their effect on different cas- 
cading failures indicators. Both system-level indi- 
cators, describing the overall impact of cascading 
failures, and component-level indicator, focusing 
on a single component performance, are consid- 
ered. One of the aims of this work is to provide 
some guidance for the application of given data 
sensitivity analysis and screening methods to engi- 
neering practitioners, promoting their potential. 

The framework is tested on a modified version 
of the RTS96 IEEE system. Two uncertainty cases 
are analysed, first accounting for only the uncer- 
tainty in the load demand. Then, a more complex 
and realistic case has been considered by account- 
ing for randomness in the generators costs, thus 
inflating the dimensionality of the input space, 
i.e. more flexibility for the generators outputs. The 
analysis allows to point out which among loads 
and generator costs uncertainties is affecting the 
most the outputs of cascading failures model and 
for a modest computational effort. 

The rest of the paper is organized as follows: 
Section 2 introduces global sensitivity analysis 
and screening methods. In Section 3 the algorithm 
for cascading failure simulation and the perform- 
ance indicators are introduced. A benchmark case 
study, the RTS96 system, tests the framework in 
Section 4, 2 uncertainty cases are analysed. Sec- 
tion 5 closes the paper with a discussion on the 
results and conclusions. 


2 SENSITIVITY ANALYSIS 
AND SCREENING 


This section proposes a concise introduction to 
uncertainty quantification and methods for global 
sensitivity analysis. Traditionally, uncertainty quan- 
tification and analysis consist in the assignment of 
probability distributions to the model input factors 
(variables and parameters). Once the uncertainty 
has been characterized, it is propagated into the 
simulation code via Monte Carlo method. First, 
uncertain factors are characterised by assigning 
probability distributions. This is an important step 
which has to be performed adequately to assure 
high quality and consistency of results (Patelli, 


Pradlwarter, & Schuller 2010). Then, samples are 
obtained from the joint probability distribution 
of the input factors, e.g. by Latin hypercube sam- 
pling, quasi-random sequences or crude Monte 
Carlo inverse transform sampling (Patelli, Broggi, 
Angelis, & Beer 2014). Once the i input realisation 
is obtained X, =| X,(0),..,.X,(m)], the sample is 
forwarded to the computational model M(X). This 
allows obtaining information about the input-out- 
put mapping defined by the computational model 
as follows: 


M:X >Y, X>Y=M(X) (1) 


where Yis the model output, for simplicity assumed 
1-dimensional and without loss of generality. 

Global sensitivity analysis methods have been 
developed to identify the most and the least rel- 
evant factors and gain additional insight on the 
input-output mapping defined in equation 1. 
Several global methods have been developed in 
the last decades. Screening methods, such as the 
one-at-a-time design of Morris (Morris 1991), 
variance-based methods, density-based methods 
(Borgonovo & Plischke 2016) are some of the most 
intensively applied methods. 


2.1 Given data Sobol’s indices 


A variance-based statistic, commonly referred as 
the first order sensitivity coefficient, quantifies the 
(additive) effect of each input factor on the model 
output as follows (M. Sobol 1990): 


Vy [Bx Y 1X1 
p- n (2) 
aba 
where V[ Y] is the total variance of the output Y, X, 
is the /" uncertain input factor, X_, is the matrix 
of all uncertain factors but X, Ey [Y |X,], is 
the expectation of the model output Y taken over 
all possible values of X_, while removing the X, 
uncertainty (i.e. keeping X fixed) and V, [] is the 
variance taken over all possible values of X, The 
indices S, can be used to reveal the importance of 
the input factor X, on the variance of the output 
and it is a normalized index, that is Ès, =1 

The main effect index reveals what is the impor- 
tance of each uncertain factor on the uncertainty 
in the model output. It relatively cheap to obtain 
as it can be efficiently computed using given data 
methods or from a single Monte Carlo run (Plis- 
chke et al. 2013). The main drawback of the index 
is that interactions between input factors are not 
accounted for with this sensitivity measure. Higher 
order Sobol’s effects (second and higher order 
interactions) compose the so-called total effect 
index S,, This is a variance-based measure of the 
influence of an input i accounting for all the inter- 
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actions with other uncertain factors. It is defined 
as follows: 


_ By, [Vx] 
Ti v[Y] 


Vx, [Ex 1X41] 
v[r] 


(3) 


where S, account for all the contribution to the 
total variance of the output V[ Y] when the first 
order effect of X_, is removed. 


2.2 Elementary effects and Morris diagram 


The Elementary Effects (EEs) is a screening method 
used identify the effect of input factors X(i) with 
i=1,2,...,m on the output Y of a mathematical or 
computational model M(X). The method con- 
sists in the calculation of m incremental ratios, also 
called Elementary Effects, which are used to assess 
the influence of the input variables and parame- 
ters. The i" elementary effect of the m-dimensional 
input vector X, is defined as follows: 


Y (X, (1), Xo (i) + A,..X,(m))-¥ (X,) 


5(X,) = A 


(4) 


where the quantity A is a given variation in the 
input factor whose effect has to be evaluated. 
Intuitively speaking, the input factors leading to 
the higher incremental ratios ó (X,) have to be 
considered as the most relevant for the output 
quantity Y. Of curse, this relevance metric is valid 
only locally, in X, where Y has been evaluated. 
Repeated One-At-a-Time (OAT) evaluations of 
random vector configurations provide the elemen- 
tary effect method with global sensitivity analysis 
features (Turati et al. 2017). The mean and stand- 
ard deviation of the EEs, resulting from random 
input vector configurations, can be plotted in the 
well-known .(5)—0(6d) plot proposed by Mor- 
ris (Morris 1991). If a factor X, results in a small 
absolute value of the mean and small variance, it 
should be considered less relevant for the model. 
On the other hand, a factor X, resulting in a high 
| 6, ) has to be considered highly relevant for the 
model, i.e. it leads to the average higher variation 
in the output. Similarly, a factor X, resulting in a 
high o(6) is also of interest for the model output. 
In fact, high o(d;) probably indicate a non-linear 
relation between the factor i and the output and/or 
a relevant interaction with other factors. An exam- 
ple of Morris plot is presented in Figure 1 where 
the standard error of the mean (SEM) is used to 
decompose the plot in different areas. 

The method has some points of strength, worth 
highlighting: 1) It is relatively easy to implement; 2) 
Computationally cheap compared to other global 


að] Relevant & possibly 
niin 


Figure 1. An example of Morris diagram and how to 
discern between important and non important factors. 


sensitivity methods also for high number of fac- 
tors; 3) It uses a sensitivity measure which is simple 
to communicate (similarity between incremental 
rations and partial derivatives) to non-experts; 
4) Compared to variance/based measures, shows 
if the input factors are (in average) positively or 
negatively correlated to the output. 


3 THE CASCADING MODEL 


A model for the simulation of steady-state opera- 
tions of electric networks has been developed and 
calibrated in (Bing Li and Sansavini 2017). It can 
be used to simulate the initial contingencies that 
trigger the cascading events and estimate the 
post-contingency system states. The initial genera- 
tion dispatch for each load demand is computed 
with a Security Constrained Optimal Power Flow 
(SCOPF), which takes into account the generators 
constraints, line flow constraints, voltage angles 
constraints and, optionally, the N-1 security con- 
straints. After line tripping, DC power flow is used 
to evaluate the post-contingency power flow. The 
failures propagate in the grid through line over 
loading. Frequency control and protections, volt- 
age protections and a variety of other automatic 
and realistic regulations and remedial actions are 
also included in the model. 

A simplified flow chart of the cascading failures 
analysis is presented in Figure 2 adapted from (Bing 
Li and Sansavini 2017). The algorithm starts by 
loading power grid data, selecting the steady-state 
solver (e.g. DC-SCOPF) and a list of N-k contin- 
gencies. Then, for each contingency N-k, islands 
are identified, frequency deviation assessed and 
under frequency load shedding performed if nec- 
essary. Once power balance is restored, line flows 
are evaluated using the power flow solver and the 
lines exceeding their flow limit are removed from 
the grid topology. This process is repeated until 
grid stability is reached. The considered outputs 
are the total Demand-Not-Served (DNS) due to 
contingency N-k and lines failure indicator func- 
tions indicating if a line tripped during the simula- 
tion of the N-k contingency. 

For simplicity, the contingency list has been 
obtained by random sampling N-1, N-2 and N-k 
line contingencies. To better identify and select 
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Figure 2. The flow chart of the algorithm for cascading 
failures analysis. 


critical failure scenarios, methods such as the N-2 
contingency screening, eg. the method presented 
in (Kaplunovich and Turitsyn 2016), could have 
been employed. However, a smart exploration of 
the contingency space was not the main aim of 
this work. Once the list is obtained, repeated N-k 
contingency analysis are performed as presented in 
Algorithm (Bing Li and Sansavini 2017). 


3.1 System and components performance 
indicators 


Several output measures can be obtained from the 
cascades model. In this work, we focus on 2 system- 
level indicators, which provide insights on the grid 
performance as a whole, and on NV, components per- 
formance indicators, one for each line in the system. 
The indicators are the 95th percentile of the DNS 
cumulative distribution function p,,(DNS), the 
average total number of lines tripped “(N,) and 
the line outage frequency P,,, defined as follows: 


No N, 
— È -Š lia - P 5 Tie : 
N, 


Z N,) 2 fl a N, ? 

where N, is the total number of contingencies 
listed, N, is the total number of lines in the system 
and J, is the indicator function for line / and con- 
tingency c. The indicator function will result 0 if 
the line survived the cascading propagation initiate 
by contingency c or | if the line failed, e.g. due to 
flows redistribution leading to an overload. 


4 A CASE STUDY 


The IEEE RTS96 power grid is used to test the 
methods and the cascading model and Figure 3 dis- 
plays the grid layout. The power grid data can be 
found in (Grigg et al. 1999) and are not reported 
here for sake of synthesis. In this analysis, two rep- 
resentative uncertainty cases, named Case A and B, 
are considered. In Case A, the uncertainty associ- 
ated with the load demand is explicitly modelled. In 
the second case, CASE B, also random generation 
costs are accounted for, thus introducing uncer- 
tainty in the power dispatch and increasing the 
dimensionality of the random input space. The DC 
cascading model presented in section 3, is employed 
for the solution of the cascading problem. A prede- 
fined contingency list is selected and includes 2444 
line contingencies. The list counts the full set of 
N-1 and N-2 line failures and a set of 1000 random 
N-3 line failures. To simplify comparison between 
uncertainty cases and the different sensitivity analy- 
sis methods, the contingency list has been kept the 
same throughout all the analysis (i.e. the random set 
of N-3 contingencies has been sampled just once). 


4.1 CASE A: Random loads 


The first uncertainty case A assumes that uncer- 
tainty affects the 17 loads in the system due to 
inherent variability. The analysts lack better infor- 
mation regarding the variability affecting the load 
at each node, thus, the uncertainty in Li is simply 
modelled by assuming uniform distributions. The 


Figure 3. The IEEE RTS96 system, the connections 
between the 24 nodes, the lumped generators (32 genera- 
tors) and the location of the aggregated loads (17 arrows). 
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distribution parameters have been selected to cover 
a range of values around the design loads and 
based on experts opinion: 


L ~ U(0.5L,p1.2L1;) i= L.N, 


where L,, is the design load of node i as presented 
in (Grigg, Wong, Albrecht, Allan, Bhavaraju, 
Billinton, Chen, Fong, Haddad, Kuruganty, Li, 
Mukerji, Pat ton, Rau, Reppen, Schneider, Shahi- 
dehpour, & Singh (1999) and the number of lines 
is N,=17. 

Once the uncertainty sources are characterized, 
a preliminary uncertainty analysis is performed. 
Monte Carlo method is used to propagate 5e4 sam- 
ples of the load profile. For each load sample, the 
cascading failure model is solved 2444 times, one 
for each contingency listed. The percentile of the 
demand not served, the average number of failed 
lines and the line outage frequencies are com- 
puted for each load sample as described in Sec- 
tion 3.1. The p,; (DNS) results are summarised in 
Figure 4. This figure presents a so-called cobweb 
plot, also known as parallel coordinates plot. It is 
a simple and effective way of visualising random 
input and output spaces in high dimensions. The 
X-axis reports the inputs loads and the percentile 
of the DNS (on the far right). The Y-axis reports 
the normalized inputs and output realisations of 
the Monte Carlo method. Each one of the dark 
dashed line in the background corresponds to one 
load profile realisation and corresponding The p,, 
(DNS) obtained through N, model evaluations. Red 
solid lines are conditional samples, which highlight 
only the load combinations leading to the highest 
Po; (DNS). It can be observed, later confirmed by 
Morris’ and Sobols’ analysis, that there is a strong 
influence of some of the loads (e.g. in nodes 15 and 
18) on the extremes of the DNS. In particular, when 
the power demanded in nodes 15 and 18 is small, 
the risk of facing severe DNS scenarios increases. 

Morris and Sobol’s indices have been com- 
puted aiming at better investigating which among 
the uncertain factors are key drivers for the out- 
put uncertainty. The Morris indices are obtained 
by selecting 250 random input vector realisations 


Figure 4. The parallel plot of the Monte Carlo loads 
and p,;(DNS) realizations. In red solid line the condi- 
tional samples which lead to the highest p,; and in the 
background (black dashed lines) all the MC realisations. 


(saved from the MC) and computing incremental 
ratios 6 as described in section 2.2. The Sobol’s 
first order coefficents are obtained using given 
data sensitivity approaches, see ref. (Plischke, Bor- 
gonovo, & Smith 2013) for further details. This 
is a very convenient approach as for calculations, 
as the data from the MC run can be used for this 
and with essentially no-extra computational cost. 
On the other hand, total Sobol’s indices require 
higher computational cost and in this work the Liu 
and Owen method (R. Liu 2006) is used for their 
computation. 

The result relative to the DNS percentile and the 
average total number of line failed are presented 
and compared in Table 1. The Morris statistics 
and Sobol’s main and total effect indices are also 
graphically presented in the u — o plot in Figure 4 
and in Figure 5, respectively. Both methods iden- 
tify L, and L,; as the most influencing factors for 
the DNS and average number of line failures. Less 
relevant but, not to be neglected, is the effect of 
loads in nodes 8, 19 and 16. Morris analysis has the 
advantage of revealing an inverse relation between 
Li, Lis Li and the outputs (see figure 4) which 
could not be revealed only using Sobol’s indices. 
On the other hand, an increment in load 8 lead to 
higher risk of extreme DNS. 

This result can be explained looking at the gen- 
erators production profile, which is obtained solv- 
ing the pre-contingency DC-SCOPF with objective 


Table 1. Sobol’s main and total effect mean and stand- 
ard deviation for the elementary effects for the uncer- 
tainty case A for the DNS percentile and average total 
failed lines outputs. 


Pos (DNS) uN) 

Sobol Morris Sobol Morris 

S, Sn md) AS) S, Sn MS) 8) 
L, 0.01 0.00 0.01 0.03 0.01 0.00 -0.1 0.4 
L, 0.01 0.00 0.01 0.03 0.00 0.00 -0.1 0.4 
L, 0.02 0.02 0.02 0.08 0.02 0.01 -0.5 0.7 
L, 0.01 0.00 0.01 0.03 0.01 0.00 -0.1 0.3 
L, 0.01 0.00 0.02 0.03 0.01 0.00 00 03 
L, 0.01 0.02 0.03 0.05 0.01 0.00 00 04 
L, 0.01 0.03 -0.01 0.06 0.01 0.01 00 0.7 
L, 0.04 0.03 0.06 0.08 0.03 0.05 04 0.7 
L, 0.01 0.02 0.02 0.05 0.01 0.01 -0.2 0.7 
Ly) 0.02 0.03 0.04 0.06 0.01 0.01 00 0.5 
L,, 0.01 0.04 -0.01 0.09 0.01 0.03 -0.4 0.9 
Ly, 0.02 0.02 0.02 0.09 0.01 0.01 00 0.7 
L,; 9.29 0.40 -0.18 0.20 0.33 0.33 -2.1 1.7 
Ly, 0.04 0.06 -0.05 0.09 0.03 0.02 -0.5 0.6 
La 0.39 0.44 -0.20 0.20 0.47 0.54 -2.7 1.9 
Lo 0.06 0.12 -0.08 0.13 0.03 0.05 -0.6 0.8 
Lẹ 0.03 0.06 -0.04 0.08 0.02 0.02 -0.5 0.7 
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The Morris diagram for uncertainty case A and for the DNS percentile output. The mean and standard 


deviation of the EEs are reported on the X and Y axis, respectively. 


of minimizing generation costs. The generators in 
nodes 18, 22 are associated with lower generation 
costs. This lead to the maximum exploitation of 
their production capacity, independently from the 
load profile realisation. Consequently, when electri- 
cal power is consumed in loco (e.g. the loads close 
to these generators as in 15 and 18), less power 
will be flowing from the northern’ area to the 
‘southern’ area of the network. On the other hand, 
if less power is demanded in, for instance, nodes 
18 and 15 (or more power in 8), this increases the 
risk of higher loads on line such as 24, 25 and 26 
which connecting the upper part of the grid with 
the lower part, and with it the risk of facing more 
severe post-contingency scenarios. 


4.2 CASE B: Random loads and generator costs 


The second uncertainty case B extends case A by 
accounting for generators costs uncertainties. The 
generation cost variability is characterised by uni- 
form probability distributions as follows: 
C,,~U(0.9,1.1) i=1..,N, 
where C,, is the cost of the generating unit 7 and 
the number of generators N, is equal to 32. By 
assuming costs C,, distributed uniformly between 
0.9 and 1.1, the economic viability of the genera- 
tors drastically changes if compared to case A. 
This lead to a higher variability in the economic 
dispatch, i.e. generators in nodes from 18 to 22 will 
sometime produce less than their maximum capac- 
ity. This case study shows the applicability of the 
method to larger input spaces and larger power 
grids. Furthermore it shows the impact of differ- 
ent generation profiles, in combination with load 
demands, on the cascading failures. 

Similarly to the uncertainty case A, a Monte 
Carlo uncertainty propagation is performed and 
the Sobol’s S, indices and Morris (6) and o(ô) have 
been calculated. The 5 most influencing factors 


Table 2. Comparison between the top 5 most influenc- 
ing factors according to the Sobol’s main index and Mor- 
ris mean and standard deviation. The output considered 
is the DNS percentile. 


rank S, ZO að 
1 L, Ly G,,(1) 
2 L; G1) G,,(2) 
3 G(1) L; L 
4 Ga (1) i G, (1) 
5 Lig Lig L, 

aa [Bi Main indices S; | 

a g |+@ Total indices S 

wg 

us g 

his. Input Factors ù f 

where thee eee 
Figure 6. The Sobol s main and total effects obtained 


for the uncertainty case A and for the DNS percentile 
output. 


(among the 17 loads and 32 generator costs) affect- 
ing the 95th percentile of the DNS are reported in 
Table 2. Multiple generators can be found in the 
same bus and to simplify the notation, the relevant 
costs are presented using the symbol G,(/), where j 
is the machine reference number within the bus k 
where the generator is installed. Differently from 
case A, load in node 8 emerged as the most rel- 
evant factor for the DNS percentile. 

Uncertainty in the loads and generator costs 
has been propagated to the line outage frequency 
indicator P,,. The resulting MC realisations are 
displayed using a box plot in Figure 6. The X-axes 
shows the lines identification number and the 
Y-axes presents the P,, values (red markers). Each 
box indicates the median (the central mark) and 
the bottom and top edges of the box indicate the 
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25th and 75th percentiles, respectively. It can be 
observed that the line connecting node 7 to node 8 
results in the higher failure frequency and that lines 
in the lower voltage area of the grid (ID from 1 to 
13) are more prone to failure. This result is proba- 
bly due to the lower thermal limit (175 MW) and to 
the specific combination of grid topology, design 
load demanded in node 7 and 8 (125 and 171 MW, 
respectively) and generators in node 7 maximum 
lumped capacity (300 MW). Thanks to the sensitiv- 
ity analysis, it has been possible to clarify which are 
the factors responsible for this peculiar behaviour, 
i.e. better understanding which are the variables 
which are contributing the most to P, g 

Main effect sensitivity indices have been com- 
puted for each line P, to reveal which of the 
input factors is affecting the most their variability. 
Results are presented graphically with a bar plot in 
Figure 8 and reported in Table 3. Table 3 presents 
only the factors leading to relatively high S, i.e. 


Lines Line 7-8 | tine 8-9 
1z “y Line 15-24 
gos ps ~i 7 Une 8-10 Line 16-17 


if | il li Line 17-23 
|i ili i sintttalllly li 


“12.945 © 7 6 o 10it2132e1sn0x71010002220520292027200090192053495808738 


Figure 7. The box plot of the P,, realisations corre- 
sponding to different load and generation cost samples. 


Loads i Geerrerators Contas | 
aga ~ = 
Saar 
; = 
an 
aa tis iib 
oe rece 
a 5 10 15 
Figure 8. The tornado diagram presenting the mean o 


the elementary effects for the uncertain factors consid- 
ered in case B. 


Sobol main effect 


Factor 10 EJ Line 1D 


Figure9. The S; indices calculated for the 49 input factors 
and for the P, outputs. The factors from 1 to 17 are loads 
at different locations and last 32 are the generator costs. 


Table 3. The most influencing factors for the line failure 
probability. Factors leading to a S, > 0.08. 

From To node Factors 

7 8 L,, G1), G2) 

8 9 Ly 

8 10 L 

15 21 Lig, Gig(1), Gs, C1) 
15 21 Lys, Gisel), G(1), 
16 17 Lis 

17 18 Lig, Gigs(1), G, C1) 
17 22 Lig, Gig(1), Go, C1) 
21 22 Lig, Gig.(1), G, C1) 


greater than 0.08, and the corresponding compo- 
nents. It can be observed that the variability in the 
line 7-8 outage frequency is mainly affected by 
uncertainty in node 7 (generators and load). On 
the other hand, uncertainty in L, is not affecting 
much the variance of P,,. but it is the most rel- 
evant factor for Pys o Prs io: 


5 DISCUSSION AND CONCLUSIONS 


In this paper, the sensitivity of a cascading failures 
model for power grids has been analysed. Variance- 
based global sensitivity analysis indices, i.e. Sobol’s 
indices, have been computed to reveal which among 
the uncertainty sources is affecting the most the 
variances of the cascading failure model output. 
The Morris screening indices are also obtained and 
compared to variance based indices to improve 
confidence in the results and better understand 
dependencies between output and factors. 

Different system-level and component-level 
indicators have been evaluated using the cascading 
model. The selected metrics were the 95th percen- 
tile of the DNS, the average total number of line 
failed and the frequency of line failure for each 
line. The IEEE RTS96 power grid has been selected 
as a representative case study and used to test the 
applicability of the methods to a real-world system. 
Two uncertainty cases (Case A and Case B) have 
been investigated, which were characterised by an 
increasing dimensionality of the aleatory space. 

In the Case A, only load variability has been 
accounted for and the result suggested that two 
uncertainties in the loads in node 15 and 18 are the 
major contributors to the extremes of the demand 
not served. A similar result is obtained for the 
average total number of line failed. Morris had 
the advantage of showing a negative relationship 
between the DNS and loads in nodes 15 and 18. 
In reality prices are indeed affected by uncertainty, 
so a sensitivity analysis that assumes fixed prices 
(and therefore fixed generator dispatch) might be 
misleading in identifying critical components in the 
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power grid. Thus, in the uncertainty Case B, the 
variability of the generator costs and loads variabil- 
ity are both considered. The new economic setting 
changed the underling behaviour of the network 
and, consequently, of the cascading evaluation proc- 
ess. The Sobol’s and Morris’ analysis are fairly con- 
sistent in pointing out which among the load and 
generators costs are the most relevant for the system 
output. The results are quite different compared to 
case A, due to the difference in the economic setting 
of the generators. In addition, the sensitivity of the 
lines outage frequency has been computed. 

This analysis was performed to investigate more 
in detail some cascading-relevant relationships 
between input loads, generators costs and line 
failures. The results are very interesting from an 
engineering perspective and at least 2 results can be 
highlighted which are helpful in a practical context: 


e The vulnerable lines (i.e. prone to failure) and 
the most relevant factors affecting P,, are identi- 
fied (using sensitivity indices). This information 
can be helpful to support reliability-related deci- 
sion, for instance, in deciding on weather it is 
better to replace the line with one having higher 
capacity (i.e. if P,, high and is similarly affected 
by all the input factor), or if it may be more use- 
ful to intervene on the factors affecting P, (i.e. if 
P,, high and sensitive to just few factors); 

e When the uncertainty in the loads is identified as 
highly relevant for a system-level indicator, it is 
advisable to consider actions such as allocation 
of distributed generators or adopt peak-shaving 
(load variance reduction) control methods. This 
can be beneficial to reduce the uncertainty in the 
reliability performance of the network (reducing 
its variance). 


The framework proved to be flexible and com- 
putationally quite cheap which is a requirement for 
its application to more realistic large size power 
networks. This will be the focus of future analysis. 
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ABSTRACT: The management of critical infrastructures is increasingly challenged by the high com- 
plex and interdependent nature of their operations. The need for methods and tools that better address 
these challenges has been often argued in literature. An improved access and accuracy of information on 
local operational conditions is referred to as one of the assets to be pursued, towards better coping with 
complexity. Technology currently available facilitates the access to a wide range and amount of data and 
information. However, putting such data and information to use as an effective support to decision mak- 
ing remains poorly addressed. Project RESOLUTE proposes an approach to the management, aggrega- 
tion and processing of “big data” towards an enhanced adaptive capacity of stakeholders within critical 
infrastructures. The approach is based on the understanding of functional interdependencies between 
stakeholders and the use of that understanding for the tailoring of existing data to various operational 
contexts and conditions. A modelling of critical infrastructures was developed using FRAM. This model 
is then used as a basis for the development of an IT platform that supports coordination and cooperation 


between stakeholders. 


1 INTRODUCTION 


The increasing interdependent nature that spans 
across all industry sectors poses major chal- 
lenges. While most management practices remain 
grounded on the control of internal processes, an 
increasing number of threats emanate from beyond 
the formal boundaries of organisations, where 
control based approaches reveal many shortfalls. 
Safety barriers that are continuously raised against 
threats identified based on hindsight, often show 
little effectiveness in the face of the variability and 
uncertainty that emerge from a growingly interde- 
pendent world. 

Risk management remains strongly attached to 
the assumption that phenomena can be sufficiently 
known and described, so as to fully master the like- 
lihood and means through which undesired events 
may occur. However, there is a growing awareness 
that managing and coping with uncertainty is an 
unavoidable consequence of the interdependent 
world. Singly investing on the elimination of uncer- 
tainty through hindsight-based statistical and 
predictive approaches is no longer sufficient. The 
maturity of methods and tools to effectively man- 
age uncertainty remains unsatisfactory in view of 
industry needs and desires of communities at large. 


This paper presents an approach to the manage- 
ment of uncertainty within critical infrastructures 
based on the experience of project RESOLUTE. 
Rather than attempting to eliminate uncertainty 
through knowledge on how and when “things might 
happen”, the focus is set on understanding their 
sources as a result of highly interdependent and tight 
coupled operations. Project methodology is outlined 
and the key aspects of interdependency and uncer- 
tainty within critical infrastructures are highlighted. 
Main results are then presented and, given that it 
is an ongoing project, the paper concludes with a 
discussion of foreseeable achievements in terms of 
potential enhanced adaptive capacities and ability 
to cope with uncertainty for critical infrastructure 
stakeholders. As argued by Woods (2015), enhanc- 
ing such adaptive capacities can be placed at the 
core of systems resilience, in particular within the 
scope of the resilience engineering approach. 


2 COMPLEXITY, UNCERTAINTY AND 
INTERDEPENDENCY 


The notion of complexity is widely used and 
yet remains difficult to define with precision. 
Many definitions of the concept underline that 
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something complex challenges the ability to 
describe and also to predict in terms of its states or 
behaviours, and of outcome of any actions taken 
or changes introduced to it (Hollnagel, 2012a). 
This feature of complexity is particularly relevant 
for the scope of this discussion, as it can be placed 
at the source of uncertainty. According to Grote 
(2009), uncertainty may concern the probability 
of an event (state uncertainty), a lack of informa- 
tion on the outcomes of an event and the underly- 
ing cause-effect relationships (effect uncertainty), 
or a lack of information about response options 
and their consequences (response uncertainty). 
RESOLUTE proposes to address the need for 
understanding uncertainty through the modelling 
of functional interdependencies. This is addressed 
late in this paper. 

Another important aspect of complexity is the 
lack of linear behaviours. As pointed by Flach 
(2012), things like a mechanical clock can exhibit 
a wide range of possible states and yet, because the 
relations between its parts assume a linear nature, 
these possible states can be predicted. This how- 
ever, would fit the definition of something that is 
complicated rather than something complex. Com- 
plexity is related to non-linearity. Describing and 
predicting possible states and outcomes cannot 
be achieved by the knowledge of individual vari- 
ables or components, and of their relations. Within 
complex environments, a given action may lead to 
many different outcomes and a given state can be 
achieved through many different combinations of 
variables or parts (Flach, 2012). 

Interdependency if often at the source of non- 
linear behaviours. Variables or components can 
interact and combine in many different ways, much 
beyond simple cause-effect relations. Interdepend- 
ency has been studied for a number of decades. 
James Thompson (Thompson, 1967) had antici- 
pated in 1967 the critical role that this phenom- 
enon would play across most social endeavours. 

The pursuit of multiple goals and the need to 
secure the access to a wider diversity of resources 
to respond to this pursuit generate self-reinforced 
cycles. The more interdependencies are generated 
through coalitions and exchanges of resources, the 
higher the number and diversity of goals that are 
put into play. The conflicting priorities that must 
be negotiated, lead to the continuous pursuit of 
more convenient options, which in return can be 
placed at the source of a search for coalitions that 
may potentially offer higher benefits. Thus, multiple 
conflicting goals produce continuous adjustments 
in existing interdependent relations, and also gen- 
erate the potential for the expansion towards new 
interdependencies. As interdependencies increase in 
number, diversity, dynamics and complexity emerges 
and with it, uncertainty increases (Flach, 2012). 


3 THE RESOLUTE APPROACH 


Complexity and its inherent interdependent nature, 
poses a challenge to the management of critical 
infrastructures. As stated by Flach (2012), “clas- 
sical hierarchical or servomechanism-type control 
systems are inadequate as a basis for dealing with 
the unanticipated variability endemic to complex 
work domains”. 


3.1 The context of critical infrastructures 


The European Directive 2008-114-CE from the 
European Commission defines critical infrastruc- 
tures as an “asset, system or part of which is essen- 
tial for the maintenance of vital societal functions, 
health, safety, security, economic or social well- 
being of people, and the disruption or destruction 
of which would have a significant impact”. 

Critical infrastructures are particularly prone 
to the emergence of complexity and uncertainty 
related issues. On the one hand, they operate at the 
intersection of a wide range of stakeholders, with 
diverse organizational and legal status (i.e. pub- 
lic service, non-profit and for profit...), and with 
multiple operational purposes (i.e. maintenance, 
oversight and control, support and service deliv- 
ery...). On the other hand, critical infrastructures 
tend to operate within many different scales, both 
geographically and temporally. 

In addition, the critical nature of the service 
they provide exposes these infrastructures and their 
operation to a particularly strong public and politi- 
cal scrutiny. This kind of exposure often generates 
important pressures towards heightened efficiency, 
reliability and safety of operations. This means 
that critical infrastructures are expected to gener- 
ate the capacity to adapt to continuously changing 
environments under highly complex and uncertain 
operational conditions. Generating such adaptive 
capacities is at the core of resilience (Woods, 2015) 
and to which project RESOLUTE aims to contrib- 
ute to. The challenge then becomes how to integrate 
flexibility as a requirement for adaptive capacities, 
whilst ensuring the degrees of coordination and 
alignment between multiple goals necessary to 
respond to efficiency and safety demands. 


3.2 RESOLUTE methodology 


Project methodology was based on the functional 
modelling of critical infrastructures as sociotech- 
nical systems. The modelling activities were car- 
ried out with the FRAM — Functional Resonance 
Analysis Method (Hollnagel, 2012b). FRAM ena- 
bles modelling activities through the description of 
human, technical and organisational elements as 
functions within an interdependent system. FRAM 
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is essentially a tool intended for the modelling of 
“real work”. It has been mainly applied to the 
modelling of operations and work on what is often 
considered the “sharp end” where human activity 
and processes tend to be more tangible. Within the 
RESOLUTE approach, the aim was to develop a 
generic functional model at critical infrastructure 
level. The contrast with what is normally considered 
the “sharp end” is that, at critical infrastructure 
level, interactions between human, organisational 
and technical elements tend to be less direct and 
may become more difficult to perceive and describe. 
Interviews with subject matter experts were used 
to develop the necessary insight on operations that 
mainly consisted of decision-making processes dis- 
tributed across multiple infrastructure stakehold- 
ers. Given the broad system scale that was needed 
in this case (a critical infrastructure), the functions 
described are often imprecise in terms of their 
nature (human, technical or organisational). The 
insight obtained through the contact with experts 
was used to ascertain if each function was described 
with an acceptable level of clarity and precision. 

Functions were identified on the basis of “what 
(actions or processes) should be carried out in 
order for a given critical infrastructure to achieve 
its operational purpose (the delivery of a given 
service)”. Functions are then described according 
to six different aspects: input, output, resources, 
preconditions, time and controls. In order to sys- 
temise the gathering of data and information for 
the modelling activity, a set of triggering questions 
were used. These are shown in Table 1. 

Through the identification of these six aspects for 
each of the functions described, the potential cou- 
plings between functions can be identified. Except 
for the function output, the remaining five aspects 
can be considered as inputs to the function that 
rely on couplings with “upstream” functions. These 
potential couplings may then become effective, as 
system operation is instantiated in the model. These 
couplings represent the “functional points” at which 
critical infrastructure stakeholders generate inter- 
dependencies to ensure process flows and overall 
conditions necessary to service delivery. This pro- 
vided an important understanding of various criti- 
cal operational aspects at system level, namely how 
key resources are exchanged amongst stakeholders, 
or operational control is carried out, among others. 

The main focus of RESOLUTE was urban 
transport systems as a critical infrastructure. The 
generic FRAM model developed was interpreted 
in the context of urban transport systems, in order 
to produce a more concrete understanding of the 
different stakeholders and their interdependencies. 

RESOLUTE proposes to urban transport stake- 
holders a cooperation and coordination platform, 
aiming at generating adaptive capacities that address 


Table 1. Questions supporting the description of 
functions. 
Triggering questions 
Input * What should start the function? 
e What should the function act on or 
change? 
Output e What should be the output or results of 


the function? 

* Do you should to inform anyone? 

* Do you have to collect or record/report 
anything? If so, where? 

e Who needs the output? Who will use 
what is produced? 

Precondition * What should be in place so that you can 
complete the function normally? 

e What resources do you need to perform 
the function, such as people, equipment, 
IT, power, buildings, etc.? 

* Should be any formal procedures or 
instructions controlling the function? 

* Should be people, such as supervisors, 
controlling the function? 

* Should be there any priorities? 

* Should be there specific constraints? 

* Should be there any time related to the 
function? 

* Should be a certain time where you have 
to perform the function? 


Resource 


Control 


Time 


the specific needs to cope with complexity and the 
uncertainty that it generates. The platform was 
designated as CRAMSS—Collaborative Resilience 
Assessment and Management Support System, 
and consisted mainly on the usage of “big data” to 
enhance the ability of sense-making at various deci- 
sion-making processes within various management 
and operational levels of critical infrastructures. 


3.3. From FRAM to CRAMSS 


The CRAMSS integrates real-time and historical 
data retrieved from a wide range of operational 
and environmental aspects, namely traffic moni- 
toring, weather conditions, public transport serv- 
ices, and water levels in river banks, among many 
others. 

Information is vital for every kind of decision 
making process. The lack of information is one 
of the main sources of uncertainty (Grote, 2009). 
Through the powerful technology currently avail- 
able in most decision scenarios, access to infor- 
mation and data can be relatively easy and fast. 
However, knowing what information and where to 
look for it within vast arrays of big data often poses 
many challenges. Determining what is relevant and 
needed at a given level and scenario of decision 
requires understanding its sources of uncertainty, 
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based on which available data and information 
may be processed and/or filtered. 

The FRAM model previously developed was 
used as guidance for the tailoring of the data gath- 
ered and fed to the CRAMSS to the potential needs 
of different decision makers and actors, and across 
the different critical infrastructure stakeholders. In 
a first step, the functions described in the model 
were associated with different roles played by 
stakeholders in a given critical infrastructure. For 
instance, the function “monitor operations” was 
associated with the role and responsibilities attrib- 
uted to operations control rooms such as those 
that be found at an urban transport network. The 
rational of this approach is illustrated in Figure 1. 


Table 2. Functions in the FRAM model. 


GutputA 


Produced by 


Function B 


requires requires 


™BreconditionD 
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Produced by Produced by 
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Function É Function F 


Figure 1. of function 


couplings. 


Structure of the analysis 


Function Description 

Develop Define the long-term objectives and identify critical resource needs and allocation strategy. It also involves 
Strategic the definition of policies, according to which all stakeholders should be strategically aligned. This is 
Plan expected to take place by policy makers, regulators and with the participation of key stakeholders. 

Manage Develop financial control and plan financial assets in accordance to financial needs of the operation and 
financial financial obligations. Often maintenance or renewal investments required by critical infrastructures greatly 
affairs exceed the scope of legal ownership or responsibility of a given stakeholder. Managing such large scale 


projects requires detailed coordination amongst stakeholders and frequently the oversight of regulators, in 
particular for the oversight of financial responsibilities. 


Perform Risk 
Assessment 


Organisations carry out multiple risk assessment activities. Such activities tend to be developed within rela- 
tively limited scopes (i.e. specific tasks or projects, specific equipment...) and limited to a given domain of 


risk (i.e. safety, security, financial, environmental. ..). Assessment tools also tend to undermine risk factors 
that are not formally recognised and described in particular those emanating from beyond the formal 
boundaries of an organisation. This function should recognise the added value of integrating risk factors 
of diverse nature and of coordinating with multiple stakeholders, in particular along the supply chain of 
the service supplied by the critical infrastructure. 


Coordinate 
Service 
delivery 


The delivery of critical infrastructure services requires a thorough coordination amongst multiple stakehold- 
ers. Coordination activities should be carried out at various planning and operational stages of service 
delivery. This function contemplates operation related decision-making and activities that directly aim at 


keeping service delivery aligned with the strategic plan and the overall level of service in terms of quality 


and safety. 
Manage 
awareness 
& user 
behaviour 


As providers of fundamental public services, critical infrastructures tend to be significantly exposed to 
individual and collective behaviours, in many cases not just of the service end-users, but also of the wider 
public. Recent technological developments, in particular in relation to ICTs, offer a great potential for the 
enhancement of interactions with the public and the use of this potential towards an increased effective- 


ness in managing and deploying operational adjustments to various relevant events and circumstances. 


Develop/ 
update 
procedures 


The complete set of procedures forms a body of formal knowledge regarding management and operation 
requirements. They tend to reflect the structure of decision-making and production processes of a given 
organisation, so as to ensure coordination and shared understanding of operations and their goals. While 


this may be relatively well achieved at organisational level, amongst stakeholders of complex sociotechni- 
cal systems such as critical infrastructures, this is often very challenging. Procedures are essentially tools 
internal to organisations but within the scope of the function here described, to the extent possible and in 
addition to safety and efficiency requirements, procedures should also reflect the need for synchronisation 
and coordination amongst stakeholders at various CI process stages and supply chain levels. This should 
follow from the scope of a regulator’s initiative, down to an active cooperation amongst stakeholders. 


Manage 
human 
resources 


Managing human resources within an organisation involves dealing with multiple relations between in-house 
and sub-contracting staff. The contractual boundaries may often be misaligned with real operational 
demands, where tight and dynamic cooperation amongst team members is required, regardless of the 


fact that various stakeholders are likely to be formally involved. Beyond the management of staff con- 
tractual relations, rosters and other human related operational needs, this function takes into account the 
need to manage the dynamics of close operational cooperation amongst multiple stakeholders and the 
need to align such dynamic relations with the formally established and recognised responsibilities and 


accountability. 


(Continued) 
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Table 2. (Continued). 


Function 


Description 


Training 
staff 


Manage 
ICT 


resources 


Maintain 
physical/cyber 
infrastructure 


Monitor 
Safety 
and 
Security 


Monitor 
Operations 


Monitor 
Resource 
availability 


Monitor 
user 
generated 
feedback 


Coordinate 
emergency 
actions 


Restore/Repair 
operations 


Provide adap- 
tation & 
improvement 
insights 


Collect event 
information 


In line with the principles previously outlined in relation the management of human resources, in addition to 
employee training needs, and their quality control, this function must also account for the need to provide 
and control the quality of training of staff that while working under other stakeholders, may operate on 
a more or less continuous basis with the premises of a given organisation such an infrastructure owner or 
manager. This relates to initiatives such as cross training and shared expertise programmes. 

Provide/maintain/update/develop/repair information and communications services to support critical 
infrastructure operation and management. Information systems may be owned and managed by a given 
organisation but may be strongly reliant on the operation and input from multiple stakeholders. ICT often 
gives shape to many interdependencies and the management of such resources should recognise this criti- 
cal system role, namely by providing an overall system understanding of how these resources and made 
available and used by stakeholders in view of the overall service delivery (system operational purposes). 

Maintenance activities require increasingly skilled and specialised staff and technical resources. Because 
their nature, maintenance services are often sub-contracted and providers become stakeholders with tight 
couplings with operational requirements. In addition to the planning, delivery and testing of maintenance 
activities, this function also incorporates the need to continuously assess the integration between in-house 
and sub-contracted maintenance resources, in view of process and technology changes, and overall opera- 
tional environment demands. 

Integrated risk management has been recognised as a potentially valuable approach but has proven to 
present many managerial and operational challenges. As two of the fundamental risk domains for the 
operation of all critical infrastructures, safety and security should be managed in the scope of an integrat- 
ing function, aiming to maximise efficiency and effectiveness of assessment and control measures and to 
integrate multiple interdependent risk factors that emanate from both within and beyond organisational 
boundaries. 

The monitoring of service delivery performance is often singly based on lagging indicators and stakeholders 
tend to each assess their performance in reference to internal targets to be met, which may not necessar- 
ily reflect overall needs of the service delivered at critical infrastructure level. This function envisages the 
development of shared performance assessment practices amongst critical infrastructure stakeholders, 
mainly by integrating stakeholders’ targets with overall service delivery needs. This becomes fundamental 
to generate overall system performance understanding. 

Complex sociotechnical systems such as critical infrastructures rely on increasingly diversified and dynamic 
supply chains. This function focuses on generating an overall coordination of resource planning and 
deployment, taking into account the need to align multiple stakeholder needs with CI service delivery. 
This requires an understanding of resource flows and their main variability trends. 

Current technology provides the means to monitor service usage on a wide range of parameters and produce 
in real time, fundamental support to the deployment of operational adjustments. This function deals with 
the need for an integrated approach to the assessment of user generated feedback, mainly by placing 
this data and information in the context of operational monitoring. This requires the coordinated action 
amongst multiple stakeholders under a shared framework. 

The operation of critical infrastructures relies on the close cooperation amongst multiple stakeholders. 
Emergency response scenarios pose additional challenges, mainly by adding significant time pressure and 
high uncertainty (and therefore heightened risk) to this already complex operational environment. Cop- 
ing with such challenges places even greater emphasis on the coordination needs and increased pressure 
on already limited resources. This function deals mainly with the requirements of an efficient distributed 
decision-making process under emergency response scenarios, namely the availability of accurate and 
timely information and data, and coordinated action of multiple and diverse stakeholders, often under 
unplanned and unforeseen circumstances. 

Restoring operational capacities after significant damages requires much more than the re-allocation of 
system resources, namely those foreseen under maintenance and renewals projects. Dedicated teams are 
normally put in place to design, plan and execute specific projects, which in the case of critical infrastruc- 
tures, in addition to the need to maintain minimum operation capabilities, is also likely to require the 
containment of impacts on other interdependent infrastructures. 

With the scope of resilience, the operation of complex sociotechnical systems is challenged by two opposing 
needs: sustaining adaptive capacities to continuously changing operational conditions (flexibility) and the 
continued and coherent pursuit of goals within their own timescales (rigidity/robustness). For instance, 
Operation and production goals may be reassessed on an annual basis, while strategic goals may be 
addressed on a five year basis. Without compromising the consistency and feasibility of planned operations 
within each of those timescales, a given degree of flexibility must be ensured, in order to sustain the ability 
to respond to unforeseeable operational changes. This relates to aspects such as learning from ex-post event 
analysis, de-briefing, daily operations and providing insights for system capacities adaptation, keeping oper- 
ations record, examining good practices, performing impact analysis of suggested actions, among others. 

Collecting in house and external event data as good practices and/or historical data (archiving). From the 
perspective of resilience, this should not only address the occurrence of undesired events but most of all 
the understanding of factors that under highly variable circumstances become critical for achieving suc- 
cessful performance. 
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4 PROJECT ACHIEVEMENTS 


The couplings described in the FRAM model were 
used to relate types of data to different user pro- 
files for the CRAMSS. These user profiles were 
defined according to different roles and activities 
carried out by key stakeholders within the delivery 
of urban transport services. Emphasis was placed 
on conveying to users across multiple stakehold- 
ers, information on the conditions or operational 
status of other stakeholders with which relevant 
functional couplings had been identified through 
the FRAM model. The purpose was to create 
conditions for an improved cooperation and coor- 
dination between interdependent stakeholders. 
Providing information and data with a higher con- 
text and time dependent relevancy is expected to 
reduce uncertainty associated to decision-making 
and enhance the capacity to adapt to the continu- 
ously changing conditions under which such deci- 
sions are made, and their resulting actions taken. 

For reasons of practicality, the complete FRAM 
model cannot be reproduced here in such a way 
that functions and potential couplings could be 
clearly presented. To illustrate the project outputs 
achieved so far based on the approach previously 
described, Table 2 provides a brief description of 
all the functions identified. 


5 CONCLUSIONS 


Beyond the provision of “big data” to support deci- 
sion-making, the approach here outlined addresses 
the need to tailor a wide range of available data 
to different scenario and operation needs, taking 
into account interdependencies that are critical for 
process flows and the delivery of services. 

The control of operations and processes remains 
strongly grounded on centralised mechanisms and 
on “top-to-bottom” procedural structures. One 
of the shortfalls of such approaches is the rigid- 
ity that it imposes on processes and operations, 
which often fails to account for the need for local 


adaptive capacities to uncertainty and continu- 
ously changing operational conditions (Woods, 
2015). The conflicts generated by different goals 
and local priorities, and across multiple organisa- 
tional boundaries frequently escape the prescribed 
work of procedures. The CRAMSS and the under- 
lying FRAM model offers to each stakeholder an 
enhanced adaptive capacity. In some cases, this 
capacity means the ability to adjust operations 
and conditions in anticipation of critical events. 
In others, it may only provide the means for the 
detection of events in hindsight, but nevertheless, 
with an early warning and with increased precision 
and detail on the actual occurrence. Overall, these 
adaptive capacities and the ability to cope with 
complexity are at the core of system resilience. 
Throughout the remaining duration of project 
RESOLUTE, the CRAMSS and the various tools 
on which it is build will be tested under real scenar- 
ios (a first one in the city of Florence and a second 
one in the city of Athens). The outcome of these 
testing activities will be fed back to the FRAM 
model, aiming to improve the understanding of 
stakeholder interdependencies that it provides. 
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ABSTRACT: The study presents an uncertainty and sensitivity analysis exercise of the security of gas 
supply model implemented in the probabilistic gas network simulator ProGasNet. The study showed the 
potential and usefulness of sensitivity study applied to the ProGasNet gas network model. It has not only 
identified the most important model parameters for which more attention should be paid during the esti- 
mation process of the input parameter values, but also provided useful insights into the simulation process 
by confirming, e.g., the heterogeneity of the network from a sensitivity analysis perspective. The model 
was run for four different scenarios representing different disruption situations. The study confirms the 
results already observed in other studies that some disruption scenarios affect only part of the network 
(e.g. specific countries) while other parts of the network are not affected. This clearly indicates heteroge- 
neity of the network and the need for further infrastructure development. The most important parameters 


for each country are identified, and peak demand value is the main parameter for the three countries. 


1 INTRODUCTION 


Natural gas networks can be viewed as complex tech- 
nical systems, which are exposed to various threats, 
for example technical failures, natural disasters and 
human/political uncertainties. As a consequence of 
these threats, a subset of network components might 
fail. In order to simulate security of gas supply due 
to component failures/attacks, the probabilistic gas 
network simulator ProGasNet (Probabilistic Gas 
Network Simulator) is being developed. 

ProGasNet is able to model, in a single computer 
model, capacity and reliability constraints of a nat- 
ural gas network. The physical model is based on 
graph theory (maximum flow algorithm), whereas 
component failures are simulated by the Monte- 
Carlo method. The ProGasNet simulator has been 
developed with the primary purpose to quantify 
the security of gas supply situation in probabilistic 
metrics (V.Kopustinskas et al., 2012). The simula- 
tor can be used to perform risk assessment of the 
gas transmission network as required by the EC 
Reg 2017/1938 (EU Regulation, 2017), formerly 
known as 994/2010 (EU Regulation, 2010). In 
addition, the simulator can evaluate infrastructure 
development plans and the proposed PCI projects. 
ProGasNet was used in vulnerability assessment 
which could be of interest to critical infrastructure 
protection policy makers (Praks et al., 2017). 


The ProGasNet has been applied to gas trans- 
mission networks of several EU countries, however 
geographical information cannot be disclosed. Var- 
ious types of analysis have been performed: relia- 
bility, vulnerability, security of supply analysis and 
the results have reported reliability of supply esti- 
mates under different disruption scenarios (Praks 
et al., 2015). The ProGasNet model also provides 
an indication of the worst networks nodes in terms 
of security of supply and their numerical ranking. 
The model is very powerful to compare and evalu- 
ate different supply options, new network develop- 
ment plans and analyse potential crisis situations. 

Despite of the insightful information that can be 
obtained regarding the security of gas supply within 
European regions with this simulator, still ProGas- 
Net remains a strong approximation of the reality of 
gas supply in Europe (simplified assumptions, quasi 
steady-state modelling, lack of knowledge about 
some parameter values, etc). Therefore, it is impor- 
tant to assess the impact of epistemic uncertainties 
in gas network models in order to point out those 
sources of uncertainty that have an impact on the 
model responses of interest. Identifying important 
sources of (epistemic) uncertainty allows to guide 
further investigation for possible improvements of 
gas network models. 

In this study we analyse an EU regional network 
already used for other purposes and bottleneck 
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analysis in particular (Kopustinskas et al, 2015). 
The focus of this study is to perform sensitivity 
analysis by using Polynomial Chaos Expansion 
(PCE) method (Sudret, 2008). 


2 PROGASNET SIMULATOR 


The ProGasNet (Probabilistic Gas Network) simu- 
lator is the JRC in-house developed software tool 
currently in use at the Joint Research Centre. Pro- 
GasNet is used for experimental simulation-based 
security of supply analyses of selected European 
gas transmission networks. Usually, one million 
Monte-Carlo simulations are automatically solved 
within one hour on a single-core processor compu- 
ter. The software tool can make use of a multi-core 
computer, as multiple simulations (Monte Carlo 
runs) can be evaluated independently. The gas 
transmission network model implemented in Pro- 
GasNet is based on Maximum Flow (MF) algo- 
rithm well known in graph theory (Deo, 2008). The 
algorithm was modified to reflect gas transport 
property to have lower pressure at the points far 
from the supply source. This means that in case of 
disruption and lack of gas in the network, consum- 
ers that are closer to the source have better chances 
to be served rather than those that are located far 
away. Therefore, the maximum flow algorithm was 
modified to distribute gas first to the consuming 
nodes close to the supply source and distance from 
the nearest source was used as a priority criteria. 
Although not being a model requirement, this 
assumption was chosen as if network operator 
decision on how to distribute available gas among 
the users is not known. No doubt that in a real cri- 
sis management situation, the network operator 
has more options on how to supply and distribute 
the available volumes of natural gas. 

The model is not running physical gas pressure 
and flow computations, but uses available results 
of physical models as a set of rules to define flow 
limitations. The ProGasNet simulates network 
facility failures (pipeline ruptures, failures of com- 
pressor stations, unavailability of LNG terminals 
and storages) by Monte Carlo method and each 
different network configuration is evaluated by 
modified maximum flow algorithm to evaluate 
available gas for each network consuming node. 
The statistical results are obtained from 1 million 
of Monte-Carlo runs. 


3 STUDY CASE AND NETWORK DATA 


Figure 1 shows topology of the study case gas 
transmission network. It is based on a real regional 
network topology and data, however geographical 


Figure 1. 


Topological layout of the gas network. 


location is not displayed. The transmission net- 
work GIS data are converted to a graph by creat- 
ing nodes and links (edges). The nodes are: 


— Demand nodes (consumers connected to pres- 
sure reduction stations of the transmission 
network; 

— Compressor stations; 

— Supply nodes (storages, LNG terminals, import 
points at cross-borders). 


The network links (edges) are typically pipelines. 
The model explicitly considers two parallel pipe- 
lines as two components (double links between 
nodes). 

The basic network data are the same as already 
reported (Kopustinskas et al., 2015). Here we repli- 
cate only the most important network data. 

The demand nodes are determined by daily 
demand value (Table 1). This value is a peak 
demand value, but it could be also average winter 
or summer consumption value depending on the 
purpose of the study. 

Table 2 shows maximum capacities and type 
(pipeline, UGS or LNG) of input supply nodes. 
In case of Underground Gas Storages (UGS), also 
the output values of not fully loaded storages can 
be used. 

The total maximum supply capacity is 77.5 mcm 
per day. The total network peak demand is 
45.8 mcm/d, so the network has certain degree of 
spare capacity to compensate supply disruptions. 
The experience of different analysis already per- 
formed shows that depending on where the disrup- 
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Table 1. Network demand nodes, in millions of cubic 
meters per day (mcm/d). 


Node Demand Node Demand 
4 0.1 34 0.5 
5 332) 35 0.1 
6 0.1 36 4.2 
7 0.3 37 1.3 
9 0.1 39 0.3 
10 1 41 0.6 
13 0.5 42 0.6 
17 0.1 43 0.2 
18 8.5 44 0.7 
20 0.6 45 1.3 
25 0.5 47 0.5 
26 0.8 48 1.8 
27 3 49 0.2 
28 6 51 7 
30 0.5 52 0.6 
33 0.5 53 0.1 
Table 2. Maximum capacity of supply sources. 
Node Type Capacity, mem/day 
2 Pipeline 31 

10 LNG 10.2 

11 Pipeline 7 

19 UGS 30 

29 Pipeline 4 

38 Pipeline 12 


tion happens, internal bottlenecks in the network 
prevent from full usage of this spare capacity. 

For each network component, failure data must 
be provided. The following components (nodes) 
are considered for failures: 


— Compressor Station (CS) failure: 2.5E-01/yr; 
— Underground storage failure: 1.0E-01/yr 

— LNG terminal failure: 1.5E-01/yr 

— Pipeline failure: 3.5E-05 /km/yr. 


The model uses annual failure data (probability 
of failure per year), however when simulations are 
performed, one month interval is considered. It is 
assumed that the same peak consumption in the 
network is constant during this one month period. 

The compressor station node normally is mod- 
elled as working or failed, for each state determin- 
ing the corresponding capacity of the outgoing 
pipelines. The capacity reduction due to compres- 
sor station failure is normally estimated by hydrau- 
lic model computations or expert evaluation. As a 
consequence due to a CS failure, capacity reduc- 
tion by 20% of the inlet pipelines and also the 


outlet pipelines until the next connection node is 
assumed. This assumption is based on physical 
flow models, however is not accurate in all cases 
and also multiple CS failures will have more severe 
effects on the network operation. 

The pipeline import points are not considered 
as failure-prone elements due to lack of upstream 
network model, however they are modelled as on/ 
off elements by scenario analysis. 


4 DISRUPTION SCENARIOS 


In total four different gas disruption scenarios 
were analysed. The first scenario is the reference 
scenario during which the system operates under 
normal conditions without predefined disruptions. 


Scenario 1: All available sources operate. Scenario 
1 represents a basic scenario when all sources 
can be used for supply and the network compo- 
nents can fail randomly according to their reli- 
ability parameters. 

Scenario 2: Node 2 disruption. In this scenario, 
supply node 2 (the largest gas source in capac- 
ity) is not available. This scenario can test the 
system for the largest source disruption which 
can be classified as N-1 situation looking at the 
network globally. 

Scenario 3: Node 19 disruption. Scenario 3 runs 
the model with disconnected node 19, the sec- 
ond largest gas source. 

Scenario 4: Loss of two largest gas sources (Nodes 
2 & 19). The underground gas storage can be 
unavailable due to technical problems, failures 
or inability to fill it up during summer period. 
The Scenario 4 simulates a more challenging 
crisis in which the both sources of the highest 
capacity are unavailable. The scenario 4 is used 
to demonstrate vulnerability of the network, 
when the largest and the second largest gas 
sources are lost simultaneously. The network 
can be supplied only with source nodes 10 and 
11 and small sources 29 and 38. 


5 SENSITIVITY ANALYSIS 
METHODOLOGY 


Uncertainty in model predictions stems from the 
lack of knowledge about the process of inter- 
est (in the present case, about the gas transport), 
model simplification and parameters’ value. For 
some of the model inputs, it is possible to refine 
our knowledge about their probable value (that 
is, decreasing their uncertainty) but such a task is 
time consuming. Our strategy is i) to assign large 
but likely uncertainty ranges to the inputs of the 
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gas network model, ii) to check whether this yields 
large uncertainty in the model predictions of inter- 
est and iii) to identify eventually those inputs that 
are mostly responsible for the predicted uncer- 
tainty. In step ili), it is expected that only a few of 
the uncertain inputs are identified as influential in 
order to reduce the effort to pay during the subse- 
quent input uncertainty refinement. 

Twenty model inputs of the gas transmission 
network model have been deemed as uncertain 
(Table 3). It can be noted that some are assigned 
uniform distributions within plausible ranges 
while some others are assigned normal distribu- 
tion. It is assumed that the model input values are 
independent of each other. These uncertainties 
reflect the experts’ belief before further investiga- 
tion. The prior uncertainties being large reflect the 
fact that the experts (here the modellers) have a 
vague knowledge about model inputs uncertainty. 

This study has applied Polynomial Chaos 
Expansion (PCE) method to estimate sensitivity 
indices. The idea of PCE is to approximate vari- 
ance decomposition terms (Sobol, 1993) by multi- 
dimensional orthonormal polynomials. The rate 
of convergence of such an expansion of course 
depends on the regularity properties of g(x), the 
model response of interest. By exploiting the Par- 
seval-Plancherel relationship, one obtains a stand- 


ard variance decomposition equation. Therefore, 
we get (Wiener, 1938), 


g(x) = Yon IaYa(®) (1) 


where @=@,...@,, with a, € N, is a multi-index 
indicating whether y,,(x) depends on x, (œ, >0) 
or not (œ, =0), a, is the polynomial coefficient 
associated with y,(X) which is the so-called mul- 
tivariate orthonormal polynomial chaos that is 


written 
Wal K) = Wy, (X1) X-X We, (%,) 


Wa, (x,) being the a@-th degree univariate poly- 
nomial basis element ( y, =1). 

The expression of the univariate polynomial 
basis elements depends on the PDF assigned to 
the input variables. If x,~U(-11),%,(x,) is 
the Legendre polynomial of degree a, while if 
x, ~ N(0,1), y,,(x,) is the Hermite polynomial 
of degree œ. One can rely on the Wiener-Askey 
scheme to choose the appropriate polynomial fam- 
ily (Xiu & Karniadakis, 2002). 

Once a PCE expansion such as Eq.(1) is 
obtained, it is straightforward to prove that the 
total variance of g(x) 1s, 


Table 3. Input uncertainty distributions of the gas network model. 
Baseline 

N Parameter value Distribution” Type? — Accuracy”) 
X, Capacity of gas source N2 31.2 U(16,31.2) A L 
X, Capacity of gas source N19 30.0 U(15,30) A L 
X, Capacity of gas source N10 10.2 U(5,10.2) A L 
X, Capacity of gas source N11 7.0 U(3.5;7) A M 
X, Compressor station capacity reduction factor 0.2 N(0.2,0.05°) E M 
X; Peak demand of Country 1 15.5 N(15.5,0.75°) E L 
X; Peak demand of Country 2 12.1 N(12.1,0.67) E L 
Xa Peak demand of Country 3 5.3 N(5.3,0.4°) E L 
X, Failure frequency of LNG 0.15 N(0.15,0.015> E H 
Xo Failure frequency of storage facility 0.1 N(0.1,0.01°) E M 
X,, Failure of compressor station 0.25 N(0.25,0.0257) E M 
X, Failure frequency of a pipeline 3.5E-05 N(3.5 ( 105,(3.5 (109) E M 
X,; Capacity of DN1000 pipeline 30.6 N(30.6,1.57) E H 
X,, | Capacity of DN800 pipeline 17.1 N(17.1,0.86°) E H 
X,; Capacity of DN700 pipeline 12.1 N(12.1,0.67) E H 
X, Capacity of DN600 pipeline 8.1 N(8.1,0.407) E H 
X, Capacity of DN500 pipeline dl N(5.1,0.257) E H 
X,, Capacity of DN400 pipeline 2.8 N(2.8,0.147) E H 
X, Capacity of DN350 pipeline 2.0 N(2.0,0.1°) E H 
X» | Capacity of DN300 pipeline 1.3 N(1.30,0.0657) E H 


(1) U = Uniform distribution, N(1,0°) = Normal distribution of mean u and variance 0°. 
(2) E stands for epistemic uncertainty as opposed to A-aleatory uncertainty. 
(3) The modellers belief regarding the assigned prior uncertainty: L = Low, M = Medium and H = High. 
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V (g(x) =D,= >} a-ad, 


a@eN" 


by exploiting the orthonormality property of the 
polynomial basis elements, that is, 


E( Wal X)X W(%)) = Sig 


where 6,, is the symbol of Kronecker. 
Therefore, it is possible to estimate the Sobol’ 
indices from the PCE coefficients as follows, 


g VEe) Easton 
O Ve) De ganna 0 
ST = z(y g(x, )) _ DEE 


O Vae) Earla %.0 


Hence, the issue with the PCE approach for 
variance-based sensitivity analysis is to assess the 
PCE coefficients. In this work this is achieved with 
the Bayesian sparse PCE developed in (Shao et al, 
2017). With this approach, the variance decom- 
position is obtained from one single Monte Carlo 
sample of size N. This is very computationally 
cheap compared with other classical approaches 
(Saltelli, 2002; Saltelli et al, 1999). The cost of the 
analysis is a criterion to keep in mind with a long- 
time run model like the ProGasNet. 


6 UNCERTAINTY & SENSITIVITY 
ANALYSIS: RESULTS AND DISCUSSION 


6.1 Model output selection 


Given the computational time required to run Pro- 
GasNet, a sample of size 512 was considered. The 
input sample was generated according to the prob- 
ability density function of each input variable (see 
Table 3). The low-discrepancy LPt sequences of 
(Sobol et al. 1992) were employed. It required four 
days of calculation with ProGasNet to propagate 
the input uncertainty into the model responses for 
the four different scenarios under analysis. 

It is possible to analyse the impact of the model 
input uncertainty onto different model responses. 
For this purpose, while propagating the input 
uncertainty with a Monte Carlo sample, one has to 
save the different model responses of interest after 
each model run. In the present work, we have con- 
sidered 20 different model responses, namely: S 
the mean volume of gas supply, P(S =0) the prob- 
ability of none gas supply, P(S <0.2D) the prob- 
ability of supplyinless than 20% of the demand, 
P(S <0.5D) the probability of supplying less than 
50% of the demand, P( S< 0.8D) the probabil- 
ity of supplying less than 80% of the demand and 
P(S < D) the probability of supplying less than the 
demand. Doing so for each country provided a set 
of 3 x 6 = 18 different model responses. 

The different model responses are more or less 
sensitive to the uncertainty in the model inputs. 
Figure 2 shows the Monte Carlo simulation results 


Figure 2. Monte Carlo predictions of the 18 model responses. 
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obtained for Scenario 2. In Figure 2 outputs with 
a cross do not change significantly and are not 
analysed. Outputs with a green frame are ana- 
lysed with the PCE approach. It can be noted that 
some output variables do not vary at all (likewise 
P(S =0) for country 1 & 2), some do not vary sig- 
nificantly (values less than 10°), some others only 
take discrete values (e.g. P(S < D)) while the aver- 
ages volume of gas supply S vary continuously. 
Consequently, only those model responses that are 
significantly impacted by the input uncertainty are 
analysed in the sequel. Only the model responses 
that take continuous values are analysed with the 
polynomial chaos expansion. 


6.2 Results for Country 1 


In the following sections, we analyse the predictive 
uncertainty on the mean volume of gas S supplied 
by the network to the different countries. The esti- 
mated probability density functions of the mean 
volume of gas supply (in millions of cube meters/ 
day) for the four different scenarios are depicted 


in Figure 3. We note that the PDF’s are the same 
for scenario 1 (normal operation) and scenario 
3 (Source 19 off). This means that the system is 
completely resilient to the failure of this impor- 
tant source regarding the gas supply in Country 1. 
When the main source of gas is off (scenario 2), the 
network remains quite resilient. However, as far as 
scenario 4 is concerned, the system completely fails 
at satisfying the gas demand (vertical dashed-line) 
although some quantity of gas is supplied. 

Notably, in the first three scenarios, according 
to the model, the ability of the network to supply 
sufficient gas volume depends on the true value 
of some of the uncertain input variables. Indeed, 
Figure 3 indicates that the system might fail at 
satisfying the gas demand under certain uncertain 
conditions, linked to capacity of sources which is 
subject to strong aleatory uncertainty. 

To guess which are the critical uncertain inputs 
responsible for the variability of this model 
response we have carried out a global sensitivity 
analysis for each scenario. Given that this model 
response takes continuous values, the PCE method 


Country 1 
> a : GERLI 
Normal operation 
Source 2 (X,) off 
| Source 19 (X,) off 
Sources 2 & 19 off 


Prob. Density Function 


L a oa 


8 10 12 
Demand (mcm/day) 


Figure 3. 


14 16 18 20 


Predicted uncertainty of mean gas supply for country 1 with respect to the different disruption scenarios. 
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Table 4. Relevant inputs for the predicted mean gas supply for country 1. 


Scenario Relevant inputs S,% ST,% 
1 (Normal) X,= Peak demand of Country 1 100 100 
2 (X, off) X, = Peak demand of Country 1 80 85 
X, = Capacity of Source N19 7 16 
X,= Capacity of Source N11 1 9 
3 (X, off) X, = Peak demand of Country 1 100 100 
4 (X, X, off) X, = Capacity of Source N11 88 88 
X,= Peak demand of Country 1 9 9 
Xs = Capacity of DN400 3 3 


is employed. The first-order (S) and total-order 
(ST) Sobol indices are shown in Table 4. We recall 
that S, represents the amount of variance of the 
predicted mean gas supply explained by the input 
variable while the difference (ST,— S) represents 
the amount due to the interaction of the variable 
with the other ones. We note that only scenario 2 is 
subject to interactions. 

The results indicate that the accuracy of the 
model to predict the mean gas supply to Country 
1 heavily depends on the knowledge of the peak 
demand in that country. This is particularly crucial 
for scenarios 1 & 3. When source N2 is off, then the 
model response becomes more complicated involv- 
ing sources N10 and N11. When both sources N2 
and N19 are off (scenario 4), then the capacity of 
source N11 prevails for the mean gas volume sup- 
plied to country 1, the country peak demand (X3 
becoming less relevant. In summary, to predict 
accurately the mean gas volume supplied to Coun- 
try 1 it is crucial to know accurately the value of 
the following inputs: X,, X}, X, (by order of impor- 
tance) and in a less extent X,- 


6.3 Results for Country 2 


The estimated PDF’s of the mean volume of gas 
supply for the four different scenarios are not shown 
due to lack of space. Note that when source N19 is 
unavailable (scenarios 3 & 4), the predicted mean 
value of gas supply is much less than the demand 
of the country. This indicates that the source N19 
is very important for the Country 2. However, the 
system is resilient to the failure of gas source N2 
(recall that it is the highest capacity source in the 
region). This is due to the location of the Country 
2 with respect to the sources and potential bot- 
tlenecks in certain connections. Another study of 
bottleneck analysis of this network has identified 
several bottlenecks in this network (Kopustinskas 
et al., 2015). 

The most important input variable is the peak 
demand of the country 2 when the system operates 
normally and when Source N2 is off. Surprisingly, 


we note that when Source N19 fails, the capacity 
of the system to provide gas to Country 2 also 
depends on the peak demand of Country 3 and 
the peak demand of Country 1 (when both N2 
& N19 are unavailable). This result is also a con- 
sequence of the priority algorithm implemented 
in the ProGasNet. Indeed, the latter supplies the 
closer demand nodes first before serving the other 
more distant nodes. Hence, if a country is far from 
the main sources of gas, it might be more subject 
to failure of gas supply. 


6.4 Results for Country 3 


For Country 3, the system behavior is simpler. The 
system is resilient to the failure of one of the main 
sources but it is incapable to provide gas to Country 
3 when both main sources collapse (scenario 4). This 
is an important result, revealing the country vulner- 
ability to crisis. However, the crisis simulated by 
scenario 4 is unlikely to happen. The prediction of 
volume of gas supply heavily depends on the knowl- 
edge of the value of the gas demand in the country. 


7 CONCLUSIONS 


The study presents an uncertainty and sensitivity 
exercise of the security of gas supply model imple- 
mented in the probabilistic gas network simulator 
ProGasNet. The selected network was an EU gas 
transmission network of several member states. It 
is based on realistic network topology and data, 
but due to sensitivity of information, the network 
is anonymised. 

The study aims to identify and rank the model 
parameters that most significantly affect the secu- 
rity of supply. The main finding is that the risk of 
gas supply shortage in the three countries heavily 
depends on the accurate knowledge of different 
model inputs. In particular, the peak demand in 
each country is the key parameter whose precise 
estimation is very important for accurate model 
results. 
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The model was run for four different scenarios 
representing different disruption situations. The 
study confirms, the results already observed in 
other studies, that some disruption scenarios affect 
only parts of the network (e.g. specific countries) 
while other parts of the network are not affected. 
Even more, due to existing bottlenecks in the sys- 
tem, supply disruptions in one part of the network 
cannot be restored from the other part even if there 
is enough gas available. This clearly indicates het- 
erogeneity of the network and the need for further 
infrastructure development. 

Looking at each country, for Country 1, the peak 
demand is the most important parameter, followed 
by capacity of sources 19 and 11 and capacity of 
DN400 pipeline. This is important information for 
further development of the model and indicates 
where to target further research efforts. 

For Country 2, again the peak demand is the 
most important parameter, followed by capac- 
ity of source 19, DN500 and DN400 pipelines. 
Interestingly, for some scenarios, peak demands 
of Countries 1 and 3 are identified as important. 
This can be well explained by the fact that in cer- 
tain disruption scenarios, lower demand in neigh- 
bouring countries allows better supply to the other 
countries. 

For Country 3, the supply is secured in case of 
scenarios 2 and 3, but in the case of unlikely sce- 
nario 4, the security of supply is threatened. 

The study showed the potential and usefulness 
of sensitivity study applied to the ProGasNet gas 
network model. It has not only identified the most 
important parameters of the model for which 
more attention should be paid during the estima- 
tion process, but also provided useful insights into 
the simulation process by confirming, for instance, 
the heterogeneity of the network from the sensitiv- 
ity analysis perspective. 
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ABSTRACT: This research works is focused on the analysis of Fuzzy Finite-Element Method (FFEM) 
with the present of uncertainties. In considering a major engineering science problems, like damage proc- 
esses or loading in consequence of real incident, uncertainty are present. Uncertainty is due to lack of 
data, an abundance of information, conflicting information and subjective beliefs. With that reason, the 
present of uncertainties is needed to avoid for prevent the failure of the material in engineering. The goals 
of this study are to analyzed and determine the application of FFEM by taking into consideration of the 
epistemic uncertainties involved toward the single edge crack plate and beam. Since it is crucial to develop 
an effective approach to model the epistemic uncertainties, the fuzzy system is proposed to deal with the 
selected problem. Fuzzy system theory is a non-probabilistic method, and this method is most appropriate 
to interpret the uncertainty compared to statistical approach when the deal with the lack of data. Fuzzy 
system theory contains a number of processes started from converting the crisp input to fuzzy input 
through fuzzification process and followed by the main process known as mapping process. In mapping 
process stage, the combination of fuzzy system and finite element method are proposed. In this study, 
the fuzzy inputs are numerically integrated based on extension principle method. Obtained solutions are 


depicted in terms of figures and tables to show the efficiency and reliability of the present analysis. 


1 INTRODUCTION 


An uncertainty is define as a gradual assessment 
of the truth content of a proposition, doubt arises 
as to whether the truth content may be stated with 
sufficient accuracy using each of the data models 
in all cases. Also, uncertainties are defined as the 
vagueness and lack of the information or data, Far- 
kas (2010). The element of uncertainty is one of the 
biggest challenges in the field of engineering. He 
(2007) mentioned that, in general, uncertainty can 
divided into three types, which are stochastic uncer- 
tainty, epistemic uncertainty and error. Stochas- 
tic uncertainty is due to variations in the system. 
For the epistemic uncertainty, it exists as a result 
of incomplete information, ignorance and lack of 
knowledge caused by the lack of experimental data. 
When compare to the error, this uncertainty is the 
uncertainty that can be identified due to the imper- 
fections in the modelling and simulation. 

For a several decades ago, uncertainty is mod- 
elled according to the theory of probability. Prob- 
ability method is very effective in solving the 


problem of stochastic uncertainty, but this method 
is not suitable to be used to solve problem involv- 
ing the lack of data. Some scholars hold that the 
use of non-probability methods are most appro- 
priate to interpret the uncertainty compared to 
statistical approach when deal with the lack of 
data. The interval analysis (Zimmermann 2012), 
convex modelling (Tapaswini et al. 2012) and fuzzy 
set theory (Ozkoka & Cebi 2014) are the main cat- 
egories of non-probabilistic methods. Specifically, 
fuzzy system is a system to be precisely defined and 
it applied all the theories that use the basic concept 
of fuzzy set theory. According to Savoia (2002), 
the main advantages of using the fuzzy set theory 
than the fuzzy probability theory is able to main- 
tain the intrinsic random nature of the physical 
variables without the need for modeling the proba- 
bility density function. The simple justification for 
fuzzy system theory is the real world is too com- 
plicated for precise explanation and description to 
be obtained. Fuzzy system theories are knowledge 
based or rule based systems. Fuzzy systems have 
been applied to a wide field. 
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Finite Element Method (FEM) has become a 
powerful tool in solving numerous complex scien- 
tific and engineering problems. By using FEM, the 
complicated structured of any materials can be dis- 
cretized into a small finite element. All the elements 
are assembling then applying the respective require- 
ment to obtain the output. System parameters such 
as geometry, material properties, external load or 
boundary conditions are considered as crisp value 
or can be defined exactly in the convectional FEM. 
However, rather than the particular value, it may 
have only the vague, imprecise and incomplete 
information about the variables being a result. The 
present of uncertainty is the biggest challenging 
in engineering in order to deal with uncertainty in 
designing the material and cannot be avoided. 

Fuzzy Finite Element Method approach 
(FFEM) is present to deal with the uncertainty and 
it is the merger method of fuzzy approach with the 
conventional Finite Element Method (FEM). In 
FFEM approach, conventional FEM is used as a 
parent in order to obscure the data input mapping 
to the output data (Behera, 2012). In this paper, 
there are two illustrative examples which applied 
the fuzzy finite element method with considering 
the presence of uncertainties. The first illustra- 
tive example is the application of FFEM in single 
edge crack plate by considering a fuzzy variable of 
crack length. And the second one is to analyze the 
structural reliability on beam. 


2 PRELIMINARIES 


In the following paragraphs, some of the notation, 
definition and preliminaries which are used further 
in this paper discuss deeply. 

Definition 1: 

A fuzzy number is convex normalized fuzzy set of 
the real line such that 


44 =X [0,1] (1) 


Definition 2: 

We can defined an arbitrary triangular fuzzy 
number as A=[a,,ay,ag] and trapezoidal fuzzy 
number as B= [br Paes Orcs Ox |: The fuzzy number 
A is said to be triangular fuzzy number when the 
membership function is given by 


0 x <a, 
x-a 
L - 
PEE a, SX Say 
= N L 
M(x) = ag- x (2 
ay £X Lap 
ap — ay 
X2 ap 


The triangular fuzzy number A =[a,,dy.d,| 
can be transformed into interval form by using 
a-cut as Equation (3). 


A= [a, dy. | 

= (3) 

= |a, + (ay —a,)Q,dg —(dp -ay)a@] 

The fuzzy number B is said to be trapezoidal 
fuzzy number when the membership function is 
given by 


0 x <b, 
h b, <x<b 
b, —b L NL 
Milax) = PP _ (4) 
r. A Din SX Sbp 
R “NR 
0 x 2 bg 


The trapezoidal fuzzy number B= [a, ,dy.d, | 
can be transformed into interval form by using 
a-cut as Equation (5). 


B = [Bes bigs Bel 


=[b, + (Bx — 2.) bp — (bx — bun) 2 (5) 


3 FUZZY FINITE ELEMENT METHOD 


The fuzzy finite element method is a combination 
of fuzzy approach and conventional finite element 
method. Figure 1 shows a flow chart for a fuzzy 
finite element process which included the fuzzifica- 
tion process, mapping process, œ-cut level process 
and ending with defuzzification process. 

The procedure for the analysis by using the 
fuzzy finite element methods is started by trans- 
form the crisp input or real value for input into the 
fuzzy input uncertainty through the fuzzification 
process. The variable that have fuzzy uncertainty is 
probably the material properties, geometry of the 
materials, boundary condition or loading (Aka- 
pan 2001). Triangular fuzzy numbers are used 
for understanding the function of all parameters 
of fuzzy membership. This FFEM approach con- 
tinues with the mapping process, where the fuzzy 
input will be used in the FEM model by using a 
-cut level. Vertex method is common numerical 
methods used in the implementation of the exten- 
sion principle. Vertex method is a combination of 
interval arithmetic methods with a-cut method. 
The purpose of this mapping is to map two or 
more fuzzy input to the fuzzy output with binary 
combinations of as many as where is the number 
of fuzzy input parameters. Figure 2 showed the 
example of œ-cut method for fourth level of alpha 
for each variable available. 
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Figure 1. Flow chart for FFEM. 
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Figure 2. œ- cut method for each alpha level Source: 
Farkas et al. (2010). 


After the mapping process, the new probabil- 
ity distributions of the fuzzy output are created. 
The final stage in FFEM is defuzzification process 
in which stress intensity factor is the final output 
for this study. In defuzzification process, there are 
many different methods available which is center 
of gravity, center of area, mean of maximum, mid- 
dle of maximum, fuzzy mean, fuzzy clustering 
defuzzification and so many. The center of grav- 
ity or centroid (COG) is the familiar defuzzifica- 
tion method used by many researchers and in this 
study. Defuzzification process is the process that 
transforms the fuzzy output into crisp output. 
This technique determines the point at which it 
will distribute one area into two parts which have 
the same value. 


4 RESULT AND DISCUSSION 
ON ILLUSTRATIVE EXAMPLE 


4.1 A single edge crack plate by considering 
the fuzzy variable of crack length 


The model used in this study is Aluminium Alloy 
2024-T351, since this material widely used in struc- 
tural engineering such as parts of the hydraulic 
valve and piston, gear and shafts. In this study, 
three parameters that are Young’s modulus, Æ, 


Poisson ratio, v and Density, p of Aluminium 2014- 
T3, are used as non-fuzzy parameters. Compare to 
the crack length, a is treated as a fuzzy parameter. 
The geometry used in this study is based on the 
stress analysis of crack handbook by Tada et. al 
(2000) as shown in Figure 3. 

From analytical aspect, the stress intensity fac- 
tor under mode I can be obtained by using the 
Equation (6) as below: 


K, = odar (%4) (6) 


The geometry function F|% |in equation (6) 
can be calculated by using the equation (7). 


F(%)=1.122 -0.231(%) + 10.55(4%4) 
-21.7(4%) +30.382(%) 


where a is the crack length, is width of the geom- 
etry, a is the constant with the value 3.1415927 
and ø is the applied stress. The ratio of af for 
the geometry in Figure 3 is 0.1. The fuzzy value of 
crack length, a is represent in a trapezoidal fuzzy 
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Figure 3. Two-dimensional loaded plate. 
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number in form as a = (a,,a,,a,,4,) and written in 
a-cut form and finally the interval fuzzy number 
can written as Equation 8. 


[4,.4,]=[4+(a a,) @d, (a, a)a] (8) 


As mentioned early, this study only considered 
the fuzzy variable of crack length whereas the 
Young’s Modulus and Poisson Ratio are consid- 
ered as crisp variables. 

Using fuzzy values are present the result in 
interval form and sketch it in trapezoidal member- 
ship function graph. The fuzzy number for crack 
length are a =(0.1,0.32,0.41,0.49). By using Equa- 
tion 6 and 7, the fuzzy intervals in term of œ& -cut 
are obtained as shown in Table 1. These fuzzy 
intervals are substitute into SIF formula in term 
of interval as 


Kp) = draa) (9) 


Kr) = oN 70, E (%) (10) 


and continue till all SIF are obtained. Where 
K,,) and K,,) is a left and right value of interval 
respectively for each o-value. The output for SIF 
with the fuzzy crack length is shown in Table 2. 
Beside the result is shown in Figure 4. The cases of 
crack length are treating as interval at a width again 
increases as we increase the number of elements. In 
fuzzy, the large width of interval for membership 
functions is giving more accurate result. 

The result is shown in Figure 4 depicts the SIF 
plot the single edge crack plate. It may be seen 
from the above numerical result that the natural 
stress (crisp values) are constant with increase in 
number of elements as it should be even though 
only two elements consider here. However it 


Table 1. Fuzzy value of crack length for each a-level. 
Beta (a-Level) a, dp 

0 0.1 0.49 
0.1 0.122 0.482 
0.2 0.144 0.474 
0.3 0.166 0.466 
0.4 0.188 0.458 
0.5 0.21 0.45 
0.6 0.232 0.442 
0.7 0.254 0.434 
0.8 0.276 0.426 
0.9 0.298 0.418 
1 0.32 0.41 


Table 2. Interval value of SIF for each a-level. 


Beta (a-Level) a, Ap 

0 6.40 14.16 
0.1 7.06 14.04 
0.2 7.67 13.92 
0.3 8.24 13.81 
0.4 8.77 13.69 
0.5 9.27 13.57 
0.6 9.74 13.45 
0.7 10.19 13.32 
0.8 10.63 13.20 
0.9 11.04 13.08 
1 11.44 12.95 
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Membership Values 


Figure 4. Trapezoidal membership function for SIF. 


differs for values got in each element by using 
fuzzy method. The cases of crack length are treat- 
ing as interval at a width again increases as we 
increase the number of elements. In fuzzy, the 
large width of interval for membership functions 
is giving more accurate result. 


4.2 Analysis on beam structure 


The analysis on structural reliability in the pres- 
ence of uncertainties is performed in beam struc- 
ture. In the analysis of beam structure in Figure 5, 
the moment of inertia and modulus of elasticity 
of beam are considered as fuzzy input parameters 
with incomplete data. 

The existing data shows the moment of iner- 
tia is a normal distribution with mean value 
6.58 x 10°m‘ and constant variance, COV of 0.1, 
while the elastic modulus has a normal distribution 
with mean and COV value of 0.1 and 73.1 GPa 
respectively. The uniformly distributed load 
applied to the beam is considered as an input that 
has no data. Thus, the opinion of an experienced 
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Figure 5. The geometry of the beam structure. 
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Fuzzy outputs of normal stress using COG 


specialist should be considered to determine the 
likely distribution of that load. In this study, the 
distributed load is considered as a normal distribu- 
tion with mean value 5.55 MN/m and COV value 
of 0.2. The base area of membership function for 
the three fuzzy input parameters considered to 
have the interval of 6 standard deviation and o 
normal distribution, where 99% of the distribution 
included in the analysis. 

In general, the deformation and stress in the 
beam plays an important role in determining the 
reliability of the structure. The output at each 
node in the FFEM is in membership function 
form. A COG value at each node is calculated 
with the centroid techniques used in the defuzzi- 
fication technique. The maximum displacement 
is 0.00062 m at node 13. COG values in Figure 6 
and Figure 7 represent the FFEM results con- 
sidering uncertainties in the input parameters. 
In the reliability analysis of beam structures, the 
critical parameter is the stress. The COG value for 
the maximum normal stress is at node 1 with the 


value 261 MPa and this value is less than the yield 
strength of a material. The results showed that 
the reliability of the structure under this illustra- 
tive example is 0.9733, which mean it is close to 
1. Therefore, the beam structure is still in a safe 
region of the elastic deformation range. Besides 
that, the maximum normal stress value for con- 
ventional FEM is 228 MPa. From this, it shown 
that the FFEM method was produced more con- 
servative value of reliability compare to conven- 
tional FEM method. 


5 CONCLUSION 


Obtained solutions are depicted in terms of figures 
and tables to show the efficiency and reliability of 
the present analysis. This study shows that, the 
FFEM approach is conservative method to solve 
the uncertainties problems. One way to reduce the 
uncertainty in the data is by experiment. FFEM 
method developed does not require a large amount 
of data. FFEM method only required data to 
determine the profile or shape of membership 
function. The data can usually be obtained from 
the opinion of expert knowledgeable in the analy- 
sis associated with inductive reasoning or a genetic 
algorithm. Modeling input in the form of member- 
ship function, effectively involving the epistemic 
uncertainty in the analysis. Although both these 
illustrative example are based on fuzzy approach, 
but the presence of a finite element method in 
FFEM approach allows us to use easily analysis 
toward the complex structure or component. In 
addition, the factor that affects the developed sim- 
ulation of FFEM is the number of fuzzy param- 
eters. The more obscure parameters involved in 
the simulation, the most conservative result of the 
analysis. The important decision by using fuzzy 
approach is the large width of interval for mem- 
bership functions will give more accurate in result. 
The trapezoidal and triangle membership function 
are considered for these two illustrative examples. 
The result is compare and it worth mentioning that 
by using fuzzy value given better result in term of 
interval width of the membership function and 
also reliability values. 
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ABSTRACT: Natech is a technological accident which is triggered by a natural disaster. Increasing 
frequency of natural disasters along with an increasing growth of industrial plants are bound to increase 
the risk of Natechs in the future. Due to a lack of accurate field observations and empirical data, risk 
assessment of Natechs has largely been reliant on experts opinion and thus prone to epistemic uncertainty 
in addition to aleatory uncertainty originating from randomness of natural disasters. Evidential Network 
(EN) is a directed acyclic graph based on Dempster-Shafer Theory to explicitly model the propagation of 
epistemic uncertainty in system safety and reliability assessment. In the present study, we have illustrated 
an application of EN to handling epistemic uncertainty in risk assessment of flood-induced floatation of 


storage tanks. 


1 INTRODUCTION 


Technological accidents which are triggered by 
natural disasters such as earthquakes, lightning, 
storms, wildfires, tsunamis, and floods are known 
as Natechs. Natural disasters have reportedly led 
to the release of significant amounts of oil, chemi- 
cals, and radiological substances (Showalter & 
Myre 1994, Rasmussen 1995, Young et al. 2004). 
The occurrence of Natechs in industrial plants, 
particularly oil terminals, can result in catastrophic 
consequences in terms of large spillage of petro- 
leum products. In 2005, the floods triggered by 
the Hurricane Katrina in the U.S. caused a spill- 
age of ~ 8 million gallons of oil into the ground 
and waterways. In August 2017, the Hurricane 
Harvey in the U.S. caused damage to storage tanks 
in refineries and petrochemical plants, leading to 
a substantial release of pollutants. The structural 
damage caused by natural events, however, does 
not compare with the environmental damage and 
revenue losses due to interruption in production 
and supply chain: the Hurricane Harvey made oil 
refineries shut down as for safety precautions, lead- 
ing to at least a loss of more than | million barrels 
of oil per day in refining capacity (CNBC, 2017). 
Natechs has been recognized in quantitative risk 
assessment of industrial plants by many research- 
ers (Young et al. 2004, Godoy 2007, Cruz & Okada 
2008, Antonioni et al. 2009, Haptmanns 2010, 
Krausmann et al. 2011, Landucci et al. 2012, Necci 
et al. 2013, Marzo et al. 2015, Mebarki et al. 2016, 
Khakzad & van Gelder 2017, 2018, Kameshwar & 
Padgett 2018). The scarcity of historical data, espe- 


cially data with sufficient resolution and accuracy, 
has made the majority of previous studies reliant 
on analytical or simulative techniques (e.g., finite 
element modeling) in modeling and calculating 
the probability of failure modes. This mostly has 
been carried out based on modeling the envisaged 
failure mechanisms as a function of loads exerted 
by natural disasters (e.g., impact of tsunami wave) 
and the resistance of impacted vessels. 

The stochastic features of natural disasters as 
well as randomness of failure mechanisms are 
naturally modeled via probability density func- 
tions. Either the types or the parameters of such 
density functions are usually estimated based on 
insufficient (either amount or accuracy) objective 
data. This lack of objective data is usually tried to 
be compensated for by experts opinion based upon 
their experience, knowledge, and even intuition, 
inevitably introducing degrees of epistemic uncer- 
tainty into the analysis. 

The Evidence Theory (Dempster-Shafer The- 
ory, DST), originally initiated by Dempster (1967) 
and further developed by Shafer (1976), is an effec- 
tive tool to handle imprecise probabilities and rea- 
soning under epistemic uncertainty. According to 
DST, all the possible states (mutually exclusive and 
collectively exhaustive) of a system is presented in 
a set known as the frame of discernment Q. To each 
subset of Q such as A, an evidential weight m(A) 
can be assigned to indicate the degree of evidence 
(based on objective data or subjective opinion) in 
favor of the claim that a specific state in Q belongs 
to A (Rakowsky 2007). Having m(A), which is also 
known as belief mass, the amounts of belief and 
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0 1 
Belief Doubt 
bellA) 1- belA) 
Plausibility Disbelief 
pia) 1- pA) 
Uncertainty 
Figure 1. Quantification of epistemic uncertainty 


through Bel and Pls functions (Rakowsky, 2007). 


plausibility of A, equivalent to lower and upper 
probability bounds of A, respectively, can be 
determined (Shafer 1976). The difference between 
plausibility PIs(A) and the belief Bel(A) represent 
the epistemic uncertainty of A (Fig. 1). 

An Evidential Network (EN) is a directed acy- 
clic graph to propagate uncertainty based on con- 
ditional belief functions (Xu & Smets 1996). Simon 
and Weber (2009) combined DST with Bayesian 
Network (Pearl, 1988) to take advantage of the 
junction tree algorithm developed by Jensen (1996) 
in propagating and computing the marginal belief 
functions of child nodes based upon those of their 
parent nodes. 

The present study is an attempt to illustrate the 
potentiality of EN in system safety where due to 
lack of sufficient accurate data the analysis would 
be subject to epistemic uncertainty embedded in 
expert judgement. The application of EN will be 
demonstrated via safety assessment of oil storage 
tanks impacted by floods, with the floatation of 
tanks as the most common failure mode (Cozzani 
et al. 2010). 


2 REASONING UNDER EPISTEMIC 
UNCERTAINTY 


2.1 Dempster-Shafer theory 


Assume that all the states of a system can be pre- 
sented in a frame of discernment as Q = {S1, S2, 
S3}. Accordingly, the set of all the subsets of Q can 
be shown as: 


AØ pE SILE S2}.{ 53}, 


{ S1,S2},{ S1, S3},{ 52,53},Q} (1) 


According to the available evidence (either 
objective or subjective), an expert may assign a 
belief mass to each A, as 0 < m(A,) < 1. Each A, for 
which m(A,) > 0 is called a focal set. If all the states 
of the system are known, then m(@) = 0. Further, 
it must always hold that: 


>, m(A)=1 (2) 


Ai 


Having the belief masses determined, the belief 
and plausibility measures of each focal set can be 
defined: 


Bel(A,) = Asca m(B) (3) 


Pls(A,)= u „o M(B) (4) 


Bel(A,) and Pls(A,), which are non-additive, can 
be taken as lower and upper probability bounds, 
respectively, of A; (Simon & Weber 2009): 


Bel(A,) < P(A,)< Pls(4,) (5) 
Bel( A‘) =1- Pls(A,) (6) 
Pls( A’) =1- Bel(A,) (7) 


where A’ is the complement of A;. Having the Bel 
and Pls functions, the belief mass of a focal set can 
be determined using the mdbius transformation as 
(Smets 2002): 


mA) =D peg (1) "Bel(B) (8) 


where |A, — B| refers to the difference between the 
number of elements of A; and B. 


2.2. Evidential network 


Simon & Weber (2009) used a Bayesian network 
(BN) formalism to propagate imprecise probabili- 
ties using the belief mass functions assigned to the 
focal sets. Since the belief masses allocated to the 
focal sets of each component of the system add up 
to unity, they can be used as marginal probabilities 
of the nodes in the BN. 

Combination of the mass belief functions of 
components (nodes) can readily be carried out by 
means of Boolean algebra. For the sake of exem- 
plification, consider a system Z comprising two 
components X and Y as shown in Fig. 2. 

In Fig. 2, the components and the systems are 
considered as binary nodes, i.e., being in one of up 
or down states. Thus, for instance, the frame of dis- 
cernment of X and its focal sets can be presented 
as Q, = {up, down} and A, = {{up}, {down}, 
{up,down}}, where {up,down} = {up}®{down}, 
respectively. Among the focal sets of X, {up,down} 
models the uncertainty, indicating that X can be in 
either up or down states. Now consider a case where 
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Figure 2. BN for reliability assessment of a two-com- 
ponent system. 


Table 1. Truth table used to combine the focal sets of 
components X and Y via AND and OR gates (Simon & 
Weber, 2009). 


Z 
x ¥ AND OR 
tup} tup} {up} tup} 
{up} {down} {down} {up} 
{up} {up,down} {up,down} {up} 
{down} {up} {down} {up} 
{down} {down} {down} {down} 
{down} {up,down} {down} {up,down} 
{updown} {up} {up,down} {up} 
{up,down} {down} {down} {up,down} 
{up,down} {up,down} {up,down} {up,down} 


X = {up} and Y = {up,down} are connected to Z by 
an AND gate; using Boolean algebra, the state of Z 
can be identified as {up}^{up,down} = {up} {up} 
® {up} A{down} = {up} ® {down} = {up,down}. 
Likewise, in case of an OR gate, the state of Z can 
be identified as {up}U{up,down} = {up}Uf{up} ® 
{up}U{down} = {up} ® {up} = {up}. The results 
of AND and OR gates in the form of a truth table 
have been presented in Table 1. 

For the system shown in Fig. 2, assume that the 
analyst, based on his degree of belief, has assigned 
the marginal belief mass distributions to the focal 
sets of components X and Y as m(A,) = {0.5, 0.4, 
0.1} and m(A,) = {0.4, 0.4, 0.2}. We in the next sec- 
tion will demonstrate using a case study how the 
belief mass distributions can be determined using 
Equations (2)-(8). Fig. 3 displays the resulting EN 
in which X and Y are connected to Z via an AND 
gate. 

As can be seen in Fig. 3, the inference algorithm 
of BN can be used to calculate marginal belief 
mass distribution of Z based on the marginal 
mass distributions of X and Y and the truth table 
(see Table 1) as m(A,) = {0.2, 0.64, 0.16}. Hav- 
ing the belief mass distribution of Z, the belief of 
Z = {up} can be calculated using Equation (3): 


Bel({up}) = Dice) =m ({up}) = 0.2. 


This is because among the focal sets of Z, i.e., 
A, = {{up}, {down}, {up,down}}, only the focal 
set B = {up} is the subset of {up}. Likewise, the 
plausibility of Z = {up} can be calculated using 
Equation (4): 


Pis({ up}) z > E m( B) = 


m( {up}) + m( {up,down}) = 0.2+ 0.16 = 0.36. 


This is because among the focal sets of Z, 
only the intersections of focal sets {up} and 
{up,down} with {up} are not null. As a result: 
0.2 < P(Z = up) < 0.36. 


Figure 3. EN for reliability assessment of a two-com- 
ponent system using belief mass distributions. 


Figure 4. Adding belief (Bel) and plausibility (Pls) 
nodes in order to calculate epistemic uncertainty of 
Z= {up}. 
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Table 2. Conditional belief table used to calculate 
Bel(up) and Pls(up) of Z in Fig. 4. 


Z Bel(up) Pls(up) 
{up} 1 1 
{down} 

{up,down} 0 1 


The procedure of calculating belief and plau- 
sibility can be carried out using the developed 
BN (which in fact is an EN) by adding the nodes 
Bel({up}) and Pls({up}) to the network (Fig. 4). 
The conditional belief table used to connect these 
two nodes to node Z is presented in Table 2. It 
should be noted that since Bel and Pls are non- 
additive (see Equations (6) & (7)), they have been 
presented as two separate nodes in the EN. 


3 SAFETY ASSESSMENT OF STORAGE 
TANKS IN CASE OF FLOOD 


3.1 Floatation of storage tanks 


Floatation of storage tanks has reportedly been 
the most frequent failure mode during floods 
(Cozzani et al. 2010). Floatation of storage tanks 
occurs if the upthrust force of flood exceeds the 
bulk weight of the storage tank (weight of the tank 
plus the weight of its liquid containment). Fig. 5 
presents the loading force (buoyancy) and resisting 
forces (bulk weight of the tank) contributing to the 
flotation of the storage tank. 

When there is a lack of field or experimental data 
to relate the characteristics of the natural disaster 
to the failure modes and failure probabilities of an 
impacted equipment, one may choose to develop 
Limit-State Equations (LSE) based on influen- 
tial loading and resisting forces. Development of 
LSEs helps the analyst combine his knowledge 
(though incomplete) of the influential parameters 
with available objective data to compensate for the 
inadequacy of objective data required for estima- 
tion of failure probabilities. 

As for the floatation of storage tanks, the rel- 
evant LSE should take into account the weight of 
the tank W., the weight of the contained liquid 
W,, and the buoyant force F,. As can be seen from 
Fig. 5, we have considered a self-anchored storage 
tank (not bolted to the foundation) which is a com- 
mon practice in case of atmospheric storage tanks. 
As such, the only resisting forces against the tank’s 
floatation comprise the bulk weight of the tank. 
Considering the direction of the forces in Fig. 5, 
the LSE can be developed as: 


LSE = F, -W, -W, (9) 


Schematic of the load-resistance forces con- 


Figure 5. 
sidered for tank floatation. 


2 


mD? 


F, = 2,8——h (10) 
4 

w, = pe 20H +22); (11) 
zD? 

W, = p87 L (12) 


where D: tank’s diameter, H: tank’s height, t: tank 
shell’s thickness, L: height of liquid inside the tank, 
h: height of flood’s inundation, p,: flood water 
density, p,: tank shell’s density, p: liquid’s density, 
and g: gravitational acceleration. Accordingly, the 
floatation probability of the tank can be presented 
as P(LSE > 0). 


3.2 Failure analysis 


For the sake of exemplification, assume that the 
analyst, based on objective data, would know the 
amounts of the tank’s and flood’s parameters as 
listed in Table 3, except the initial amount of chem- 
ical liquid (gasoline in this example). 

Floatation of a storage tank due to slow sub- 
mersion can result in an instantaneous release of 
liquid should the tank collapse, continuous release 
of the entire containment in a limited time in case 
of full disconnection of large pipelines, or a minor 
release in case of partial disconnection of flanges 
and pipelines (Cozzani et al 2010). In any of these 
release scenarios, if the initial inventory of the 
tank was not known before the floatation, the esti- 
mation of a priori inventory of the tank would be 
subject to uncertainty (epistemic). 

Since the analyst would have doubts about 
the initial inventory of the tank before the flood 
impacted the plant, he decides to seek the opin- 
ion of two experts (e.g., operators working at the 
storage tank area). The first expert comes up with 
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Table 3. Parameters used for risk assessment of 


floatation. 

Parameter Value 

H (m) 6 

D (m) 10 

t (m) 0.01 

h (m)* N (u=1, 0 =0.2) 
p, (kg/m?) 7900 

p, (kg/m*) 1024 

pı (kg/m?) 850 


* due to aleatory uncertainty inherent in flood’s forecast. 
ë gasoline has been considered as the chemical liquid. 


P(L = 0.5 m, L = 1.0 m, L = 1.5 m) = (0.2, 0.5, 
0.3) whereas the second expert with P(L = 0.5 m, 
L=1.0m, L=1.5 m) = (0.5, 0.3, 0.2). As such, the 
experts’ uncertainty about the initial inventory of 
the storage tank can be expressed using imprecise 
probabilities as: 


0.2< P(L=0.5)< 0.5 
0.35 P(L=1.0)< 0.5 
0.2< P(L=1.5)< 0.3 


(13) 


According to the parameters in Table 3, the 
tank’s weight, the weight of liquid containment, 
and the buoyancy force can be calculated as 
W,,= 219 (KN), W, = 655 L (KN), and F, = 789 h 
(KN), respectively. The floatation probability can 
thus be calculated as: 


P(LSE >0)= P(F, >W, +W,) = P(789h > 219+ 


Sst ye [n> 219+ 6594.) 
789 


(14) 


3.3 Uncertainty modeling 


Considering L as an uncertain variable with three 
states as L1 = 0.5 m, L2 = 1.0 m, and L3 = 1.5 m, its 
frame of discernment would be: 


Q, = {L1, L2; L3}: 
Consequently, the set of its focal sets would be: 


Acs {L1}, {L2}, {L3}, {L1, L2}, {L1, L3}, {L2, 
L3}, {L1, L2, L3}. 


Using the equations in Section 2.1, the belief 
mass of each focal set can be determined. For exam- 
ple, consider the first focal set, {L1} with the lower 
and upper bound probabilities as shown in Equa- 


tion (13). Based on Equation (5), Bel({L1}) = 0.2 
and Pls({L1}) = 0.5. Since {L1} is a singleton, 
using Equation (8), m({L1}) = Bel({L1}) = 0.2. 
Similarly, m({L2}) = 0.3, and m({L3}) = 0.2. 

As another example, consider the focal set {L1, 
L2}. Since {L1}, {L2}, and {L1, L2} are all the 
subsets of {L1, L2}, using Equation (8), we will 
have m({L1, L2}) = Bel({L1, L2}) — Bel({L1}) 
—Bel({L2}). 

Further, based on Equation (6), Bel({L1, L2}) = 
1 — Pls({L3}) =1-0.3 = 0.7. 

As a result, m({L1, L2}) = 0.7 — 0.2 — 0.3 = 0.2. 
Following the same procedure, m{{Ll}, {L2}, 
{L3}, {Ll, L2}, {Ll, L3}, {L2, L3}, {Ll, L2, 
L3}} = (0.2, 0.3, 0.2, 0.2, 0.1, 0, 0). Since m({L1, 
L3}) = m({L1, L2, L3}) = 0, they would not be 
considered as focal sets any more. 


3.4 Probability of floatation 


As can be seen from Equation (14), the only influ- 
ential parameters in estimating the probability of 
floatation are the flood inundation height h and 
the liquid containment height L. To facilitate the 
propagation of uncertainty — aleatory uncertainty 
in h and epistemic uncertainty in L — the EN in 
Fig. 6 can be developed. It is worth noting that 
compared to the EN proposed by Simon & Weber 
(2009), in our EN both the belief and plausibility 
of Floatation have been modeled using a single 
node, considering the fact that: 


Bel(A,) +Unc(A,) + Dis(A,) = 1.0 (15) 
Unc(A,) = Pls(A,)— Bel(A,) (16) 
Dis(4,) =1- Pls(A,) (17) 


where Unc(A,) and Dis(A,), respectively, refer to 
the uncertainty and disbelief about the focal set A, 
(see Fig. 1). 

In the EN shown in Fig. 6, the states of the node 
L have been represented by its focal sets with the 
respective belief masses as marginal probabilities 
(although belief masses are not probabilities, as 
discussed in Rakowsky (2007)). As opposed to the 
node L, the states of the node h are the discretized 
intervals of h with their (real) marginal probabili- 
ties calculated based on the normal distribution 
presented in Table 3. In this regard: 


hl if 0< h<08 
h2 if 0.8 < h<1.2 
T] 73 if 12< h<16 
h4 if 16< h<2.0 
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Figure 6. 
of floatation under aleatory and epistemic uncertainty. 


Evidential network to estimate the probability 


The conditional belief table used to calculate the 
marginal masses of the node Floatation has been 
shown in Table 4. The conditional masses can read- 
ily be calculated using Equation (14). As an exam- 
ple, consider the combination of h2: 0.8 <h < 1.2 
with the states (focal sets) of L: 


e Case | 
ILI} L=0.5 


219+ 655L1 


p|» > 
789 


)= P(h2>0.69)=1.0 


Since h is always greater than 0.69 (note 
0.8 < h < 1.2), the belief and plausibility of the 
floatation, as the lower and upper bounds of prob- 
ability, are both equal to 1.0. This, in turn, yields 
a zero uncertainty (see Equation (16)) and a zero 
disbelief (see Equation (17)). See the 6th row in 
Table 4. 

e Case 2 
{L2}: L=1.0 


219465512). pia >1.11 
789 


= P(1.11<A<1.2)=0.133 


pfr > 


As a result, Bel = Pls = 0.133, Unc = 0.0, and 
Dis = 0.867 (7th row in Table 4). 
e Case 3 
{L3}: L=1.5 


219+ 655L3 


pf > 
789 


J= pliz>1.s2)=00 


Since h is always smaller than 1.2 (note 
0.8 < h < 1.2), the belief and plausibility of the 
floatation, as the lower and upper bounds of prob- 


Table 4. Conditional belief mass distribution for the 
node Floatation in Fig. 6. 


Index h E Bel Unc Dis 

1 hl {L1} 0.098 0 0.902 
2 hl {L2} 0 0 1 

3 hl {L3} 0 0 1 

4 hl {L1,L2} 0.098 0 0.902 
5 hl {L1,L3} 0.098 0 0.902 
6 h2 {L1} 1 0 0 

7 h2 {L2} 0.133 0 0.867 
8 h2 {L3} 0 0 1 

9 h2 {L1,L2} 0.133 0.867 0 

10 h2 {L1,L3} 0 1 0 

11 h3 {L1} 1 0 0 

12 h3 {L2} 1 0 0 

13 h3 {L3} 0.003 0 0.997 
14 h3 {L1,L2} 1 0 0 

15 h3 {L1,L3} 0.003 0.997 0 

16 h4 {L1} 1 0 0 

17 h4 {L2} 1 0 0 

18 h4 {L3} 1 0 0 

19 h4 {L1,L2} 1 0 0 

20 h4 {L1,L3} 1 0 0 


ability, are both equal to 0.0. This, in turn, yields a 
zero uncertainty and a disbelief of unity (8th row 
in Table 4). 

e Case 4 

{L1, L2}: L=0.5 or 1.0 

From Case 1 (L = 0.5) and Case 2 (L = 1.0), 
the probabilities of floatation were calculated as 
1.0 and 0.133, respectively. Accordingly, the lower 
probability can be taken as Bel = 0.133 whereas 
the upper probability as Pls = 1.0. This in turn 
will result in Unc = 0.867 and Dis = 0.0 (9th row 
in Table 4). 

e Case 5 
{L1, L3}: L=0.5 or 1.5 

From Case 1 (L = 0.5) and Case 3 (L = 1.5), the 
probabilities of floatation were calculated as 1.0 
and 0.0, respectively. Accordingly, the lower prob- 
ability can be taken as Bel = 0.0 whereas the upper 
probability as Pls = 1.0. This in turn will result in 
Unc = 1.0 and Dis = 0.0 (10th row in Table 4). 

As can be seen from Fig. 6, given the marginal 
and conditional probabilities and belief masses, 
the lower bound probability of floatation has been 
calculated as Bel = 0.3. Similarly, the amount of 
uncertainty has been calculated as Unc = 0.2, 
which together with the amount of belief results 
in an upper bound probability of floatation as 
Pls =0.3 + 0.2 =0.5. 

Modeling the uncertainty of the floatation in a 
single node instead of two nodes (cf Fig. 4) comes 
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Figure 7. 
lief = 0 as soft evidence. 


Updated belief masses by means of Disbe- 


in handy in reasoning about a priori inventory of 
the storage tank via belief updating. For instance, 
in case the storage tank is believed to have lost 
some of its containment as a matter of floata- 
tion, the initial belief masses assigned to L can be 
updated by instantiating the amount of Disbelief 
to zero (Fig. 7). 

The updated belief masses have been depicted 
in Fig. 7, where L1 (i.e., L = 0.5 m) is believed to 
be the likeliest amount of initial inventory before 
the floatation. 


4 CONCLUSIONS 


In the present study we examined the applicability 
of evidential networks to system safety under both 
aleatory and epistemic uncertainties. 

Modeling epistemic uncertainty of a parameter 
in a single node as an aggregation of the degrees of 
belief, uncertainty, and disbelief, makes it possible 
to perform belief updating by using a variety hard 
and soft evidence. We demonstrated the application 
of evidential networks to assess the vulnerability 
of storage tanks against flood-induced submer- 
sion. However, the methodology, without a loss of 
generality, can be applied to system safety and reli- 
ability assessment in a wide variety of domains. 
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Towards an online risk model for dynamic positioning operations 
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ABSTRACT: Automation and increasing complexity mean that operators have to handle data and 
alarms and emergent decisions under the pressure of unexpected and rapidly changing hazardous 
situations. Position loss during marine operations may lead to serious accidents, such as collision, loss of 
well integrity, etc. An online risk model aims at assisting operators in dynamic positioning operations to 
successfully recover the vessel’s position in a good timing. The objective of this paper is to identify generic 
scenarios of position loss during operational phase and the information that is needed for successful 
recovery action. The results show that position loss normally involves of complex human machine 
interactions, generally in two patterns. Based on the findings, it has been recognized that risk model 
considering time aspect is of vital importance to develop an online risk model for DP operations. 


1 INTRODUCTION 


A dynamically positioned (DP) vessel is by the 
International Maritime Organization (IMO) 
defined as a vessel that maintains its position and 
heading (fixed location or pre-determined track) 
exclusively by means of active thrusters (1994). 
A DP system generally consists of three main sub- 
systems, i.e., the power system, thruster system 
and DP control system. To further measure the 
designed equipment redundancy of the DP sys- 
tem, the IMO MSC Circ. 645 (1994) defines three 
classes, i.e., DP 1, DP 2 and DP 3. For DP class 1, 
position loss may occur given a single failure event 
of an active component. For DP class 2, position 
loss should not occur given a single failure, and for 
DP class 3, position loss should not occur given a 
single failure, including fire and flooding of water- 
tight compartment or fire subdivision. 

Based on more than two-decades of experience 
with safety management of DP marine opera- 
tions, it has been shown that risk of position loss is 
intrinsic to all DP vessels (Chen and Nygard 2016). 
A position loss may happen on DP 1 vessels, as well 
as on DP 2 and DP 3 vessels. Meanwhile, offshore 
exploration and exploitation of hydrocarbons have 
opened up an era of DP vessels. There are wide 
applications of DP vessels in the offshore oil and 
gas industry, e.g., diving support vessels, pipe-lay- 
ers, heavy lifting vessels, drilling rigs, subsea con- 
struction vessels, platform support vessels, shuttle 
tankers, etc. The focus of this paper is on offshore 
loading operations using DP shuttle tankers. 

Nowadays, most liquid products (1.e., stabilised 
crude oil, condensate, liquefied petroleum gas, 
liquefied natural gas, etc.) from oil and gas fields 


in the North Sea are transported to refineries 
and terminals by DP shuttle tankers (ST). These 
large vessels with high thrust and power capacity 
may pose a significant collision risk to an adja- 
cent offshore installation in case of position loss. 
Since 2000, there have been two collisions between 
shuttle tankers and facilities on the Norwegian 
Continental Shelf (NCS). In addition, there have 
been four near misses (collision events) and seven 
incidents related to loss of position, with varying 
degrees of severity. 

There are two generic failure modes of position 
loss, i.e., drive-off and drift-off. The primary con- 
cern in this paper is on the drive-off scenario. The 
term drive-off is defined as a tanker moving away 
from its own target/desired position by its own 
power in off-loading operations. It might occur 
in different phases during offloading operations, 
i.e., approach, connection, loading and disconnec- 
tion. Excessive relative motions between the FPSO 
(floating production storage and offloading) and 
tanker, categorized in surging and yawing modes, 
have been identified as the failure prone situations 
(drive-off) in tandem offloading (Chen 2003). Sev- 
eral recommendations have been given by (2003) 
regarding how to reduce the occurrence of exces- 
sive surging and yawing events. For instance, the 
coordination of mean heading control between 
the FPSO and tanker is important to minimize the 
probability of excessive yawing. 

A recent review of DP accidents and incidents by 
the Petroleum Safety Authority in Norway (PSA) 
has shown that there is an increasing tendency in 
the number of DP drive-off accidents and incidents 
during offshore loading with shuttle tankers on the 
NCS in the past fifteen years (Kvitrud, Kleppestø 
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et al. 2012). Vinnem et al. (2015) indicate the need 
for new risk reduction measure and outline an over- 
all concept for online risk management. A research 
project has been initiated by the Department of 
Marine Technology in the Nowegian University of 
Science and Technology (NTNU). The main goal 
of the project is to develop an online risk monitor- 
ing and decision support system. 

An analysis was performed on DP accidents 
and incidents with emphasis on root causes and 
barrier failures (Dong, Rokseth et al. 2017). 
One of the major conclusions was that the most 
recently drive-off accidents and incidents on 
NCS involve both technical and human/opera- 
tional failures. The development of DP operator 
(DPO) decision support should focus on reducing 
the combination of causes. Five design principles 
for the online risk model, including complemen- 
tarity, integration, early detection, early warning, 
and transparency were proposed by Hogenboom 
et al. (2017). Moreover, it is likely that an online 
risk management system may reduce the risk 
due to human machine interface (HMI) failures. 
Automation and increasing complexity mean that 
DPOs have to handle data and alarms and take 
safety-critical decisions under the pressure of 
unexpected and rapidly changing hazardous situ- 
ations. A previous study shows that human error 
is the most complex and least understood factor 
in the failures of complex systems, accounting 
for as much as 60% to 80% of complex system 
failures (Sudano 1994). As an additional barrier 
function (Vinnem, Utne et al. 2015), the new deci- 
sion support system should aim at supporting 
information to operators, reducing the poten- 
tial for catastrophes induced or exacerbated by 
human errors. The complex HMI determines the 
importance of information support for operators’ 
decision-making. 

The objective of this paper is to identify the 
human action and types of technical failures in the 
initiating event of drive-off. The purpose is to find 
out the challenges and problems for early detection 
during DP operations, which online risk model 
and decision support system can provide the infor- 
mation to contribute to improve DPOs’ situation 
awarness and decision making, and understanding 
of system performance as well. 

The paper is structured as follows: the chal- 
lenges of DPO decision-making are stated in Sec- 
tion 2, where classifications of decision and risk 
information are also introduced. In Section 3, 
human actions in the initiating event of drive-offs 
and classifications of failures are presented, fol- 
lowing by the description of analysis and result in 
Section 4. Based on the analysis and result, discus- 
sions are given in Section 5. Lastly, conclusions are 
summarized in Section 6. 


2 DP OPERATOR DECISION-MAKING, 
CHALLENEGES, TYPES OF DECISION 
AND RISK INFORMATION 


Aninformation-decision-execution model (Figure 1) 
for DPO reaction in a drive-off scenario was intro- 
duced by Chen (2003). One important factor and 
dimension that needs to be under control is the 
time. As illustrated in Figure 1, the model is pre- 
sented with the time reference. It is worth noting 
that the three stages (Ta; Td; T1) do not happen ina 
purely linear, sequential manner. The estimation of 
the DPO action initiation time (T1) is, accordingly, 
based on an estimation of the following three char- 
acteristic time interval values, as shown in Figure 1: 


— Information time: 0-Ta 
— Decision time: Ta-Td 
— Execution time: Td-T1 


Chen (2003) states that the challenge of human 
intervention is that the DPO needs to make a deci- 
sion within typically 45 seconds to avoid a colli- 
sion in the case of drive-off, given that the typical 
distance between vessels is 75 m. Some previous 
studies show that DPOs when collisions occurred, 
used about 3 minutes before taking manual evasive 
action. 

Loss of situation awareness has been recog- 
nized as the main reason for no early detection 
(Chen 2014). Situation awareness (SA) is defined 
as “the perception of the elements in the environ- 
ment within volume of time and space, the com- 
prehension of their meaning and the projection 
of their status in the near future” (Endsley 1995). 
Moreover, Endsley (1995) developed a three-level 
model to describe the different levels involved in 
the formation of SA. Level 1, perception, refers 
to the perception of attributes and dynamics of 
elements in the environment. Level 2, comprehen- 
sion, refers to the integration and understand- 
ing of the information, i.e. it involves the human 
operator’s sense-making to establish what is hap- 
pening in the situation. Level 3, projection, refers 


i 
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Figure 1. Information-Decision-Execution model for 
DP operator reaction in drive-off scenarios, adopted 
from (Chen 2003). 
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to the operator’s estimation of future states of the 
system. The results of the assessment of the cur- 
rent situation can be utilized to determine future 
courses of action, thus supporting decision-mak- 
ing. However, Kjell et al. (2015) argued that the 
process of gaining SA does not follow sequentially 
from level 1 SA to level 2 SA, as set out by Ends- 
ley’s model, but rather the build-up seemed to be 
adaptive and related to the work system’s higher 
level goals, such as to avoid collision. He also found 
that in a majority of DP accidents and incidents, 
DPOs didn’t expect the occurrence of an accident 
or incident. Some of the DPOs were not able to 
identify the relevant initiating events (lack of level 
1 SA), or to understand the relevance of the initi- 
ating events (lack of level 2 SA) in DP accidents 
and incidents (Overgard 2015). Initiating event is 
an identified event that upsets the normal opera- 
tions of the system and may require a response to 
avoid undesirable outcomes (Rausand 2011). The 
initiating events that were used in the study include 
environmental impact, DP reference, human error, 
component failure, power management system, 
DP software failure (Kjell I. Overgard 2015). 

To avoid a collision and mitigate the conse- 
quence of position loss, successful human inter- 
vention has been considered as the main risk 
reduction measure, while efforts should be made 
on bridge ergonomics, HMI, alarm system, pro- 
cedures and training. Meanwhile, it also needs to 
improve DPO’s decision-making for gaining and 
maintaining situation awareness. 

From a risk assessment perspective, decisions can 
be classified into planning decisions and execution 
decision. Planning decision is the decision made by 
blunt-end decision maker and middle level decision 
makers, such as operational managers. The time 
lag between decision and action is relative long. 
Enough time systematically identify and evaluate 
different alternatives. Execution decision is made 
by sharp-end personnel, who monitor or control 
ongoing operation and emergency response teams. 
The time lag between decision and action is much 
less. When DPO is in charge of making execution 
decision (illustrated in Figure 1), it can be further 
divided into instantaneous decisions and emer- 
gency decisions. Instantaneous decisions are taken 
spontaneously by sharp-end operators, e.g. to fol- 
low or deviate from procedures; ignore or react 
upon deviations in normal working conditions. 
The decision-making emphasizes situation assess- 
ment and pattern matching. This type of decision 
is normally taken quickly, although not necessar- 
ily. Emergency decisions are taken in emergencies 
to avoid or adapt to hazardous situations. Time 
dynamic is often so fast that pattern matching may 
not match the development of the situation. Deci- 
sions have to be made fast. 


Furthermore, Yang and Haugen (2015) iden- 
tify six different risk types to make the differ- 
ent operational decisions. To support execution 
(instantaneous and emergency) decisions, time- 
dependent action risk information is proposed. 
The time-dependent action risk can be estimated 
or predicted based on the margin between the 
performance of parameters in the current situation 
and operational limits. 

In terms of early detection, focus should be on 
signals of deterioration of position to strengthen 
situation awareness during the monitoring (bore- 
dom) phases. Early warning including indicators 
derived from operating parameters against oper- 
ating limits should faciliate early detection and 
reflect the operating limits and capabilities. 


2.1 Human actions in initiating event of drive-off 


To study the HMI in initiating event of drive-off, 
human action is used instead of human error. 
Human error is defined by Reason (1990) as all 
those occasions in which a planned sequence of 
mental or physical activities fails to achieve its 
intended outcome, and when these failures cannot 
be attributed to the intervention of some agency. 
Reason (1990) emphasizes that the notion of inten- 
tion and error are inseparable. Human action can 
be categorized as intentional action and noninten- 
tional action. Reasons argues that human error is 
only associated with the intentional action, and it 
has no psychological meaning in relation to nonin- 
tentional behaviour. This view is also accepted in 
paper, although nonintentional human behaviour 
may contribute to system failure from safety point 
of view. 

The human actions and their interations with 
technical failure events have been categorized 
into initiating action, response action and latent 
action: 


e {nitiating action is an action that initiates a fail- 
ure event in the system. 

e Response action is an action that responds to the 
system demands, typically under technial failure 
events or special external situations. Chen (2003) 
points out that the response action may save 
or worsen the situation or cause a transition to 
another event. 

e Latent action is an action that influences (but 
does not directly initiate) the technical failure, 
e.g. maintenance action, and/or the above two 
types of human actions. 


2.2 Types of failures 


With respect to the performance of an item, it is 
necessary to explain the difference between failure, 
fault and error, a relationship between failure, fault 
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and error is given as follows (Rausand og Høyland 
2004): 


e A failure is an event that occurs at a specific 
point in time. 

e A fault is the state of an item characterized by 
inability to perform a required function. While 
a failure is an event that occurs at a specific 
point in time, a fault is a state that will last for a 
shorter or longer period. 

e The error is a discrepancy between a computed, 
observed, or measured value or condition and 
the true, specified, or theoretically correct value 
of condition. An error is present when the per- 
formance of a function deviates from the target 
performance (i.e., the theoretically correct per- 
formance), but still satisfies the performance 
requirement. An error will often, but not always, 
develop into a failure. 


An illustration showing the relationship can be 
found in Figure 2. A failure may originate from 
an error. When the failure occurs, the item enters 
a fault state. A failure mode is always related to a 
required function and the associated performance 
requirement. A failure mode is a description of a 
fault (1.e., a state) and not of a failure (i.e., an event). 
A correct term would, therefore, be a fault mode. 

In addition, failures may be classified accord- 
ing to their causes, effects, detectability and several 
other criteria. It is worth to mention a special cat- 
egory of cause is common-cause failures (CCFs). 
According to the effect of the failure, IEC 61508 
(2010) classifies failures as follows: 


e Safe failure: failure which does not have the 
potential to put the safety-related system in a 
hazardous or fail-to function state. 

e Dangerous failure: failures has the potential to 
put the safety-related system in a hazardous or 
fail-to function state. 

e Non-critical failure: failures where the main 
functions of the item are not affected. 


Safe and dangerous failures may be classified 
further as either detected (by diagnostics) or unde- 
tected (not detected by diagnostics). 


Target value 
Acceptable deviation 


Performance 


Actual 
porormance 


Time 


Figure 2. Difference between failure, fault and error 
(Rausand og Hoyland 2004). 


3 ANALYSIS AND RESULTS 


The analysis is performed by reviewing the inci- 
dent investigation reports of recently occurred 
DP accidents and incidents (a detailed overview 
can be found in (Dong, Rokseth et al. 2017)). 
According to the classifications of human action 
and failures that are stated above, the following 
keywords are used to analyse the accidents and 
incidents: 


Performance deviation 

Initiating event 

Failure (event), particularly technical failures 
Human initiating action 

Human response action 


The result shows two identified situations 
regarding performance deviations. 


Situation 1: Deviation is observed. Normal opera- 
tional activity is required to perform. 

Situation 2: Deviation is observed. Deviation 
represents the abnormal performance of the 
technical system. 


For each situation, operator tasks, HMI, type 
of human action and typical technical failures are 
summarised in Table 1. 

For the first situation (shown in Table 1), the 
operator needs to interact with the techncial system 
to perform operation activity when they observed 
deviations in normal working conditions. When 
performing the task, this situation mostly involved 
human initiating action. In addition, DP control 
logical failure has been identified as a typical tech- 
nical failure that is initiated by the human action. 
For instance, the DPO might need to adjust ST 
heading to return backloading hose during dis- 
connection. When giving the new setpoints, DPO 
initiated the software logic failure. However, it is 
challenging for the DPOs to identify the DP logic 
failure, since it is a type of undetected dangerous 
failure. Technical failures can be classified into 
(dangerous or safe) undetected and (dangerous 
or safe) detected failures. Dangerous Undetected 
(DU) failures are preventing activation on demand 
and are revealed only by testing or when a demand 
occurs. Sometimes, it is also called dormant failures 
(61508 2010). 

The drive-off involving human initiating action 
also shows DPO lacks information for performing 
their task. A task analysis is conducted based on 
a case study of adjusting shuttle tanker position 
to return loading hose during disconnection. It is 
illustrated in Figure 3. The task analysis is made 
according to a six-step decision-making process 
(D. Husjord 2015) using in navigational training 
and practice. As shown in Figure 3, the main task 
is to adjust ST position to return loading hose 
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Table 1. Identification of situation that contributed to human action in DP accidents and incidents. 
Type of Identified Type of 
human technical technical 
Situation Description Operator task HMI action failures failures 
1 Deviation is Interact with — Select an operation Initiating — DP control Dangerous 
observed. technical mode action, logic failure. undetected 
Normal system to — Give new setpoints i.e., DPO — Sensor failure. 
operational perform in user menu gives new 
activity is operation — Select DP reference setpoints. 
required to activity. orgin 
perform. i.e., adjust ST 
heading to 
return 
backloading 
hose during 
disconnection. 
2 Deviations Interact with Interaction Response Inaccurate DP Safe detected 
represent technical depending action, offset (s) for 
abnormal system to on the type of i.e, DPO PRS (s) and/ 
performance keep ST technical failures performs or gyros 
of technical position. (.e., safe failures or position deviating 
system. dangerous failures) drop-out to from true 
and causes of the calibrate PRS north 


technical failures. 


2.1 Identify the 
steps or actions 
for poistion 
change of ST 


Figure 3. 


during disconnection. To achieve it, there are a 
couple of steps, which are listed as follows: 


Step 1: Define the heading deviations; 

Step 2: Make a plan for changing the position; 
Step 3: Execute the action for a position change; 
Step 4: Monitor the movement of the vessel. 


The operator should follow the Step 1-4 in order. 
However, this is not just a human task. It demands 
a human-machine collaboration to change the 
heading. While the operator decides whether the 
heading should be changed or not, execution 


pannel 


Task analysis of adjusting ST position to return backloading hose during disconnection. 


of the position change needs thrust allocation, 
which is implemented by the DP control system. 
The problem is the DP control system might have 
limitations that are not stated in the user manual 
or hidden failures. Therefore, the operators need 
information about the real-time performance of 
the DP control system to avoid undesired outcome 
from the command they give to the control system. 
By referring to real-time, it means the weather con- 
ditions at the moment is also taken into account. 
Regarding the second situation in Table 1, it is 
normally associated with technical failures. While 
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technical failures appear, the DPO identifies the 
deviations. To ensure safe operation, the DPO fur- 
ther identifies the risk that can be caused by the 
technical failures so that they can take action to 
respond to the failure. Nevertheless, they might 
misunderstand the technical failures, which means 
they should be able to distinguish between danger- 
ous and safe detected failures. 


e Dangerous detected failures (DD): dangerous fail- 
ures that are detected immediately when they occur, 
for instance by an automatic built-in self-test. 

e Safe detected failures (SD): Dangerous fail- 
ures that are detected (normally by automatic 
self-testing). 


The detection of technical failures and identi- 
fication of dangerous or safe failures have been 
mainly given as the task of the automation, which 
is performed by diagnostic self-testing (Rausand 
og Hoyland 2004). Therefore, the increasing trust 
of the reliability of the automatic function (ie., 
automatic self-testing) may result in loss of skill of 
human operators. The operator needs information 
support for being aware that the actual perform- 
ance of DP system is within acceptable deviation 
even though a deviation is observed. 


4 DISCUSSION 


Based on the classifications of human actions, 
initiating action and response action are identi- 
fied from the recently occurred DP accidents and 
incidents. First, it shows that the initiating event 
of drive-off does have to be a technical failure. 
It can be a human initiating action that triggers 
a failure in technical system with the purpose to 
perform a normal operational task. While lack of 
information support during performing the task is 
identified, it also represents the deficiency of proof 
testing to detect dangerous undetected failure (i.e., 
DP control logic failure). DPO needs information 
for evaluation before executing the decision when 
time is available. Lack of information may result 
in loss of situation awareness and overconfidence 
of the DP system performance. Sometimes, over- 
confidence is referred to as complacency, and can 
have severe negative consequence if the automa- 
tion is less than fully reliable (Wickens, Gordon og 
Liu 1997). The cause of complacency is probably 
an inevitable consequence of the human tendency 
to let experience guide our expectancies. When DP 
systems are marketed as quite reliable, we should 
avoid that the DPOs perceive the device to be of 
“perfect reliability”. Otherwise, it becomes a nat- 
ural tendency for the operator to cease monitor- 
ing its operation or at least to monitor it far less 
vigilantly than is appropriate. One implication of 


automation for human intervention related to situ- 
ational awareness is that people are better aware 
of the state of processes in which they are actively 
participating in than when they are passive moni- 
tors of someone (or something) else. If they are 
carrying out those processes to detect a failure in 
an automated system, they will be less likely to 
intervene correctly and appropriately if they are 
out of the loop and do not fully understand the 
system’s momentary state. All of this information 
will be essential in order to develop the risk model, 
which the on-line monitoring will be based on. 

In addition, a drive-off event can also be trig- 
gered by an observed technical failure with interac- 
tion of human response action. Indeed, it requires 
the DPO to be able to analyse if it is a safe or 
dangerous failure. The operator needs informa- 
tion support for being aware whether the actual 
performance of DP system is within acceptable 
deviation or not. 

To avoid DPO overconfidence in the automa- 
tion and imporve their understanding of actual 
system performance, it is necessary with an addi- 
tional supervisory system to assist operators in 
detection, situational awareness and skill loss. 

One of the purposes for such a system is to 
support the DPO in two situations described in 
Section 4 to avoid initiating action and response 
action during DP operations. The information 
support should help DPOs to have the reference 
for the following questions: 


— Will the action initiate a dangerous undetected 
failure? 

— Is it a dangeorus technical failure? Will the 
action worsen the situation? 


For the first question, the system aims to sup- 
port the operator in the situation that they face 
deviations in normal working condition and need 
to perform an operational action about the devia- 
tions. For instance, if the DPO needs to adjust ST 
position to return loading hose during disconnec- 
tion if heading deviation has been observed. This 
information will be used in the forthcoming devel- 
opment of a risk model which will be the main 
basis of the online risk modeling tool. 

Initiation of action is a decision-making proc- 
ess. Therefore, we can call the new system an online 
decision support system. It will support operator 
aiming to reduce initiating action and response 
action. Based upon the findings, the online decision 
support system should address two types of infor- 
mation support. An illustration is given in Figure 4 
to demonstrate the online decision support system 
will support the two types of information. 


1. Support information for DPO in the task plan- 
ning in the situation that DPO encounters 
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Information 
support to 
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action. 


Information 
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avoid response 
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Situation due 
to technical 
failures 


Time- 
dependent 
risk 
Information 


Online risk model for DP operations 


Successful 
recovery 
action 


Major 
accidents 


Figure 4. Two types of information support in an online decision support system. 


deviations in normal working condition. The 
operator needs to be aware of whether their 
planned action will initiate a dangeorus unde- 
tected failure. This information support should 
avoid initiation of drive-off involving human 
initiating action. 

2. Support information for DPO to analyse the 
actual system performance. The information is 
to help operator to be aware of whether their 
response action will make worse the situation 
leading to a drive-off. 


Meanwhile, the importance of the time aspect 
should be emphasized, since responses may develop 
so fast in a drive-off situation that it might develop 
to a severe consequence, such as collision, within a 
very short time. Operator has to maintain aware- 
ness of the situation and catch the development of 
the situation. 


5 CONCLUSIONS 


Due to the nature of DP operations and human 
machine dynamics, loss of situation awareness has 
been recognized as the main reason for no early 
detection. One of the reasons is that the initiating 
event of position loss involves a complex HMI. 
Tanker drive-off potentially involves not only DP 
hardware and software, position reference systems, 
and vessel sensors and local thruster control sys- 
tem, but also the DP operator. 

To improve DP operator’s situational awareness 
and understanding of system performance, it is 
necessary to study the human machine interaction 
based on classification of human action and types 


of failures. It has been found that many DP acci- 
dents and incidents are involved human initiating 
action and response action. 

Two situations are identified from the DP acci- 
dents and incidents involving initiating action and 
response action. First, DPO encounters the situ- 
ation that is associated with deviations in normal 
working conditions (ST keeps its position within 
operating limits). Drive-off is initiated due to the 
interaction between human initiating action and 
dangerous undetected failures. Second, DPO faces 
a situation given by technical failures. The opera- 
tor needs information support for being aware 
whether the actual performance of DP system is 
within acceptable deviation or not. 

Based on the situations, some challenges are 
pointed out: 


Challenge 1: In the first situation, the DPO should 
evaluate if their action will initiate a dangerous 
failure which has not been detected. It indicates 
the need for information support concerning 
the deficiency of proof testing, which should be 
supported operator’ evaluation of planning a 
decision for the normal operational task. 

Challenge 2: In the second situation, the DPO 
should analyze the detected failure if it is a safe 
failure or dangerous failure and if their response 
action will worsen the situation. Therefore, it is 
of vital importance for the operator to be aware 
of the actual system performance is acceptable 
deviations. 


An online decision support system will be an advi- 
sory tool concerning the listed challenges to support 
the DPO to improve situation awarness and decision 
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support. A benefit of introducing the system is to 
avoid DPO overconfidence in the DP system. 
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ABSTRACT: This paper describes the implementation of dynamic safety envelopes for Autonomous 
Remotely Operated Vehicles (AROVs). A safety envelope is defined as a three-dimensional spatial area 
around the AROV, which forms a virtual protective barrier against collision with known and unknown 
obstacles in the subsea environment. The Octree method is used to setup the cuboidal shape of the pro- 
posed safety envelope. A Fuzzy Inference System (FIS) is modeled to derive the size of the dynamic 
safety envelope. The three inputs of the proposed FIS are vehicle velocity, probability of acoustic sensor 
failure and time to collision risk indicator. A user interface allows for verification and visualization of the 
resulting dynamic safety envelope during live laboratory tests. The results show that similar to vehicular 
envelopes in other industries, dynamic safety envelopes can be implemented on AROVs. The proposed 
dynamic safety envelope may be used to model the behavior of AROVs when confronted with different 


collision scenarios. 


1 INTRODUCTION 


Globally, numerous research initiatives are inves- 
tigating the use of autonomous remotely operated 
underwater vehicles to perform subsea inspec- 
tion, maintenance, and repair (IMR) operations 
(Jamieson et al. 2012, Furuholmen et al. 2013, Mai 
et al. 2016, Gancet et al. 2016, Schjolberg et al. 
2016). Autonomous remotely operatedc underwa- 
ter (AROVs) are tethered/untethered underwater 
vehicles, which can independently control manipu- 
lator functions, permit shared control between the 
vehicle and the human operator. AROVs can navi- 
gate autonomously, perform self-diagnostics, and 
be equipped with remotely operated tool systems 
requiring limited operator control (Hegde et al. 
2015). However, the introduction of autonomy in 
subsea IMR operations may also result in emerg- 
ing risk factors. One such risk factor is the risk of 
collision posed by the use of AROVs (Hegde et al. 
2016, Utne and Schjolberg 2014). Delayed IMR 
operations, loss of vehicle, loss of structural integ- 
rity may be some of the severe consequences of 
AROYV collisions with the subsea structures, other 
AROVs and the seabed. Safeguarding the func- 
tions of subsea infrastructure and the AROVs is 
vital to ensure safe and cost efficient autonomous 
subsea IMR operations. 

Studies to identify, assess, and avoid collision 
risk of vehicles are paramount for all vehicular 
systems. In the early 1970s, an increase in maritime 
traffic and need for safe envelopes around the 


marine vessel was highlighted by Fujii and Tan- 
aka (1971). Influenced by the collision avoidance 
procedures in the aviation industry, Goodwin 
(1975) coined the term “ship domain”. Goodwin 
(1975) defined “ship domain” as the “sea around 
the ship, which the navigator would like to keep 
free, with respect to other ships and fixed objects”. 
Over the years, the size, shape, and the area cov- 
ered by the ship domain has evolved continuously 
(Pietrzykowski and Uriasz 2009, Tam et al. 2009, 
Lewison 1978, Davis et al. 1980). Currently, in the 
automotive, maritime (surface vehicles), aviation 
and space industries, different forms of vehicular 
safety envelopes are utilized during operations. 
The primary aim of these vehicular envelopes is to 
suggest or autonomously modify the behavior of 
the vehicle when obstacles are detected inside the 
vehicular safety envelope. 

Hegde et al. (2017) utilize the Octree method 
to design a static safety envelope for AROVs. The 
term safety envelope can be defined as a 3D spatial 
area around the underwater vehicle forming a vir- 
tual protective barrier (in space and time) against 
collision with known and unknown obstacles in the 
subsea environment, influencing the behavior of 
the AROV (Hegde et al. 2017). In a static safety 
envelope, the size of the envelope is constant and 
does not change during live IMR operations. This 
approach is valid when the AROV is in close prox- 
imity to the subsea equipment. However, when the 
AROV is moving from one location to another, 
a dynamic safety envelope may assist the AROV 
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and the human operator to adapt and react to dif- 
ferent collision scenarios. In addition, a dynamic 
envelope can reduce the need to detect obstacles by 
decreasing the area of the envelope. This can result 
in decreased data processing requirements for the 
on-board collision detection module. At this time, 
such dynamic vehicular envelopes do not exist for 
AROVs (Hegde et al. 2015). 

The objective of this paper is to develop dynamic 
safety envelopes for AROVs, i.e., the size of the 
safety envelope changes depending on operational 
parameters of the AROV. 

This paper is organized as follows: Section 2 
presents the design of safety envelope. The ele- 
ments of the proposed fuzzy inference system 
is described in Section 3. Section 4 describes the 
laboratory setup used to test the dynamic safety 
envelope. The results from the laboratory tests are 
presented in Section 5. The findings are discussed 
in Section 6. Section 7 presents the conclusions 
and future work possibilities. 


2 DESIGN OF DYNAMIC SAFETY 
ENVELOPES 


An Octree is used to generate the dynamic safety 
envelopes. Octree is a recursive tree data structure, 
which consists of spatial cubes named Octants. 
Each Octant can further be divided into eight child 
Octants. Figure 1 illustrates the Level 1 and the 
Level 2 Octree rendering with the AROV in the 
center of the Octree. In the Level 1 Octree, eight 
cubes surround the AROV and in the Level 2 
Octree sixty four cubes surround the AROV. Each 
of the cubes are allocated an unique identifier and 
linked to a safe subsea traffic rule. The subsea 


Level 1 Octree 


Figure 1. 


traffic rule aims to maximize the horizontal and 
vertical seperation from the identified obstacle. 
If an obstacle is detected in one or more Octants, 
a suitable subsea traffic rule is suggested to the 
AROV or the human operator. 

According to Hornung et al. (2013), there are 
four main reasons to use the Octree method for 
robot applications. 


1. Octrees can establish virtual spatial grids around 
the robot, which can be used to check for colli- 
sions with the obstacles in all three axis. 

2. The resolution of Octrees can be increased or 
decresed, which can result in detailed obstacle 
tracking, if required. 

3. Measurement data from multiple sensors can be 
probabilistically represented using Octrees. 

4. Both active and passive sensors can be used to 
check for collisions in known and unknown 
environments. 


AROVsare also exposed to collisions with obsta- 
cles. Data from multiple active and passive sen- 
sors can be used to detect obstacles in the subsea 
enviornment. Therefore, use of the Octree method 
as highligted by Hornung et al. (2013) can also be 
extended to underwater vehicle applications. 

In addition, the size of the of the Octree can 
increase or decrease the computational load in 
detecting the obstacle. If the safety envelope is 
static, the constant computations required may 
consume battery power by the AROV even when 
there are no obstacles in the vicinity. A dynamic 
safety envelope may lead to decrease in the com- 
putations required by limiting the collision detec- 
tion module to an optimized Octree area. The next 
section explores the use of fuzzy logic to derive the 
size of the dynamic safety envelope. 


Level 2 Octree 


Rendering of static safety envelope for AROVs as proposed by Hegde et al. (2017). 
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3 FUZZY INFERENCE SYSTEM 


Fuzzy logic delivers precise outputs from imprecise 
inputs. Input values are assumed to vary within a 
given range of values, which resembles real-life 
scenarios. Figure 2 is adapted from (Zadeh 2002, 
Zadeh 1996) and describes the overall methodol- 
ogy of a fuzzy inference system (FIS). In a FIS, 
input and output variables can contain n number 
of fuzzy sets with shared memberships among 
other fuzzy sets. This process of converting the 
crisp input to range values is known as fuzzifi- 
cation. A fuzzy operator is used to connect the 
antecedent (fuzzy inputs) to a consequent (crisp 
output) through an if-then logic. Defuzzification 
is achieved by calculating the membership of input 
variable fuzzy sets against the output variable 
fuzzy sets. Defuzzification results in a crisp value 
that can further be used as input to make decisions. 
Fuzzy inference systems are useful in two main 
use cases: first, to model systems that are highly 
complex and when the systems behavior is vaguely 
understood; and second, where an approximate, 
but quicker solution is acceptable. 

Scikit Fuzzy, a fuzzy logic module in Python 
programming language is utilized to set up the pro- 
posed FIS (Warner et al. 2017). Figure 2 provides 
an overview of the proposed FIS, which has three 
input variables and one output variable. 


3.1 Fuzzification of variables 


Three input variables (operational performance 
indicators of the vehicle) are identified to influ- 
ence the output variable i.e., the size of the safety 
envelope. This means that it is assumed that the 
vehicle performance influence the safe operation. 
This subsection describes the fuzzification of the 
input and output variables. 


3.1.1 Vehicle velocity 

Considering the kinetic energy of a moving vehi- 
cle, the velocity at which the vehicle moves can 
influence collision detection and avoidance abil- 
ity of the underwater vehicle. An AROV traveling 
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Figure 2. Overview of the proposed fuzzy inference 
system. 


at high velocity can result in a faster approach 
to a potential obstacle. This means that the time 
needed to detect and avoid the collision scenario is 
inversely related to the velocity of the underwater 
vehicle. Therefore, vehicle velocity (vv) is used as 
one of the inputs in the proposed FIS. Three mem- 
bership functions for the vehicle velocity input are 
assumed, namely low, medium and high. 

The FIS was modeled according to the technical 
specifications of the Blue Robotics BlueROV2. The 
maximum achievable velocity of the BlueROV2 
vehicle is 1 m/s (Blue Robotics 2017). Figure 3 illus- 
trates the resulting membership functions (MFs) 
for vehicle velocity input. The three MFs are low, 
medium and high. The low velocity MF ranges 
from 0 to 0.4 m/s. The medium velocity MF 
ranges from 0.2 to 0.8 m/s and high velocity 
MF ranges from 0.8 to 1 m/s. 


3.1.2 Probability of acoustic sensor failure 
Stovner et al. (2017) demonstrate use of underwa- 
ter acoustic sensor grid to aid localization capa- 
bilities of the AROVs. A grid of acoustic sensors is 
used to communicate with and track the position 
of the AROV during IMR operations. However, 
failure of one or more subsea acoustic sensors may 
result in inaccurate position, orientation and veloc- 
ity estimates. Reliable measurements of vehicle 
position, orientation, and velocity are important 
to safely navigate in the subsea enviornment. Fail- 
ure of acoustic sensors may lead to increased risk 
of collision or loss of the AROV. It is therefore, 
important that AROVs use the available acoustic 
sensor information to make informed decisions. 

In this paper, the acoustic grid consists of four 
acoustic transducers on the bed of the pool and two 
acoustic transducers on the AROV as illustrated 
in Figure 7. Acoustic transducers placed on bed 
of the pool can communicate with the transducers 
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on the AROV. This results in an acoustic network 
with eight possible range (distance) measurements. 
A minimum of four range values are needed for 
the acoustic positioning system to be classified as 
reliable. If there are fewer than four range measure- 
ments, the resulting estimates (position, orientation 
and velocity) are assumed to be unreliable. In such 
scenarios, the acoustic localization system is cat- 
egorized as failed i.e., the probability of acoustic 
sensor failure is 1. Failure is defined as the termina- 
tion of the ability of a functional unit to provide a 
required function or operation of a functional unit 
in any way other than as required (IEC 61508 2009). 
Therefore, the acoustic sensor voting scheme is 4008 
(four out of eight). In a failed state, it is vital that the 
safety enelope around the AROV increases in size. 
Figure 4 illustrates the membership functions of 
the probability of acoustic sensor failure. A trap- 
ezoidal membership function (TRAMF) is used 
to signify that at certain range of input values the 
membership is unity (1). As shown in Figure 4, the 
probability of the acoustic sensor failure variable 
consists of three MFs, namely low, medium and 
high. The low MF ranges from probability 0 to 0.3. 
The medium MF ranges from probability 0.2 to 0.7 
and the high MF ranges from probability 0.6 to 1. 


3.1.3. Time to collision 

In the automotive and aviation industry, the time 
remaining for the vehicle to collide with an obsta- 
cle is used to suggest collision avoidance maneu- 
vers. The term time to collision (TTC) is used to 
convey the criticality of the collision scenario. The 
lower TTC, the greater the risk of collision with the 
obstacle. In the Traffic Collision Avoidance System 
(TCAS), the value of TTC is utilized to calculate 
the criticality of the obstacle (US Department of 
Transportation and Federal Aviation Administra- 
tion 2011). 
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Figure 4. Membership functions for probability of 
acoustics sensor failure. 


Hegde et al. (2016) apply the TTC as a risk indi- 
cator that can indicate risk of collision in a given 
AROV path. The TTC can be classified as an oper- 
ational parameter, in that it can change as the vehi- 
cle velocity and the distance to the obstacle varies. 
As AROVs are required to navigate through the 
subsea infrastructure, they may face many obsta- 
cles in their path. The TTC risk indicator can aid in 
classifying critical obstacles by monitoring conti- 
nously. TTC is calculated by using Equation 1 


Distance to obstacle 
Resultant velocity of AROV 


(1) 


A recommended standard for autonomous 
subsea IMR also highlights the need for moni- 
toring all existing obstacles in the vicinity of the 
subsea production system (Germanischer Lloyd 
Aktiengesellschaft 2009). Therefore, the inclusion 
of the TTC indicator in the proposed FIS allows 
the AROV to not only monitor the obstacles, but 
also devise collision avoidance behavior if they are 
under a threshold value. In the proposed FIS, the 
threshold values relate to the MFs. Three MFs are 
determined for the TTC input variable, namely 
low, medium and high. The low MF of TTC 
ranges from 0 to 3 s. The medium and high MF 
range from 2 to 5 and 5 to 10 s respectively. 


Time to collision = 


3.1.4 Size of safety envelope 

The output variable in the proposed FIS is the 
size of the safety envelope. The safety envelope is 
realized in form of a cuboid. Therefore, the FIS 
output: size of the safety envelope will increase 
or decrease uniformly along all three axes North, 
East and Down. The size of the safety envelope 
is proportional to the velocity of the vehicle and 
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Membership functions for size of safety enve- 


probability of acoustic sensor failure (PASF) and 
inversely proportional to the time to collision input. 
In short, the safety envelope size increases when 
vehicle velocity and PASF increase and decreases 
when TTC value increases. A large safety envelope 
reflects a low safety margin whereas a small safety 
envelope reflects high safety. 

Figure 6 illustrates the MFs for the FIS output 
variable. The size of the safety envelope is classi- 
fied into three MFs, namely small, medium and 
large. The MF for the small safety envelope ranges 
from 0 to 4m. The MF for the medium safety enve- 
lope ranges from 3 to 7 m and the MF for the large 
safety envelope ranges from 6 to 10 m. 


3.2 Fuzzy rule set 


Once the fuzzy sets of input and output vari- 
ables are determined, the next step is to define 
fuzzy rules by combining the input and output 
variables using logic statements. Table 2 lists the 
twenty seven fuzzy rules resulting from the three 
input variables. The fuzzy logic operator AN D is 
used to derive the inference from input variables. 

It has to be noted that the input variables and 
their influence on the output variable is different. 
For example, a low TTC is not favorable as this 
would mean that the obstacle is in close proximity 
to the AROV. On the other hand, low vehicle veloc- 
ity and PASF are favorable. Relative importance of 
inputs are not considered in this paper. Weights are 
not allotted to the input variables and therefore the 
fuzzy rule set do not favour certain rules over oth- 
ers. All rules are given equal importance. 


3.3 Deffuzification 


The process of obtaining crisp values from fuzzy 
inputs is known as defuzzification. The Scikit 


Table 1. Rule sets in the fuzzy inference system. vehicle 
velocity (VV), probability of acoustic sensor failure 
(PASF), time to collision (TTC). 


Consequent 

Rule Antecedent: VV & PASF & size of safety 
number TTC envelope 

1 Low & Low & Low Large 

2 Low & Low & Medium Medium 

3 Low & Low & High Small 

4 Low & Medium & Low Large 

5 Low & Medium & Medium Medium 

6 Low & Medium & High Medium 

7 Low & High & Low Large 

8 Low & High & Medium Large 

9 Low & High & High Large 
10 Medium & Low & Low Large 
11 Medium & Low & Medium Medium 
12 Medium & Low & High Medium 
13 Medium & Medium & Low Large 
14 Medium & Medium & Medium Medium 
15 Medium & Medium & High Medium 
16 Medium & High & Low Large 
17 Medium & High & Medium Large 
18 Medium & High & High Large 
19 High & Low & Low Large 
20 High & Low & Medium Large 
21 High & Low & High Medium 
22 High & Medium & Low Large 
23 High & Medium & Medium Large 
24 High & Medium & High Large 
25 High & High & Low Large 
26 High & High & Medium Large 
27 High & High & High Large 


Fuzzy library supports numerous defuzzification 
methods, such as centroid, bisector, mean of maxi- 
mum (mom), min of maximum (som) and max of 
maximum (lom) (Warner et al. 2017). Centroid 
defuzzification method is used in this paper 
because it provides conistent crisp output values 
when compared to other deffuziffication methods 
within the uncertainty constraints. The centroid 
defuzzification method aggregates the total area 
under the membership functions of the input vari- 
ables and calculates the centroid of the combined 
area (Sivanandam et al. 2007). 


4 LABORATORY SETUP AND TESTING 


Figure 7 illustrates the laboratory setup to test the 
dynamic safety envelopes. The Mission Orientated 
Operating Suite (MOOS) middle-ware (Newman 
2006) stores and retrieves information from the 
AROV. Four acoustic sensors are installed on the 
bed of the pool and two on the AROV. The acoustic 
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Figure 7. 


sensors provide the localization measurements, 
such as position of obstacle and AROV, velocity 
and orientation of the AROV. The localization 
module shares the data with MOOS. The shared 
control module retrieves data from the localiza- 
tion module via MOOS and utilizes it to control 
the AROV either autonomously or via shared con- 
trol between the AROV and the AROV supervisor. 
The communication to the AROV is established 
through an umbilical. 

The collision avoidance module posts and 
retrieves collision data to and from MOOS. The 
two parts of the collision avoidance module are 
the dynamic safety envelope and the subsea traffic 
rules. The subsea traffic rules are set of assigned 
safe navigation maneuvers that can be performed 
by the AROV to increase the vertical and/or hori- 
zontal separation (distance) from the obstacle (see 
Section V of Candeloro et al. (2016)). The subsea 
traffic rules are developed based on the rules from 
collision regulations (COLREGs) in the maritime 
and from the TCAS in the aviation industry. Each 
Octant in the Level 2 Octree in Figure 1 is assigned 
a subsea traffic rule to maximize vertical and/or 
horizontal separation from the obtascle. 


Pool bed 


® 


Laboratory setup to test feasibility of dynamic safety envelopes. 


The pseduocode implemented to derive the 
dynamic envelopes from the proposed FIS is as 
listed in Listing 1. The first step is to retrieve the 
velocity and position variables from MOOS fol- 
lowed by the distance to the potential obstacle. 
Then the available acoustic ranges are counted and 
a sensor voting scheme of 4008 is used to derive 
the PASF. The velocity of the AROV, distance to 
the obstacle are used to calculate the TTC. When 
the input data are collected, they are routed to the 
FIS as described in Figure 2. 

The FIS computes the new size of the safety 
envelope and publishes the new size of the safety 
envelope to MOOS. The 3D renderer updates 
safety envelopes to the new size and the detec- 
tion algorithm updates the potential detection 
volume. 


Listing 1: Pseudocode of FIS implementation in 
the underwater collision avoidance system 
#Initialization 
Getposition,velocityof AROV and o 
bstacle 
Getcountofacousticsensorranges 
Getenvelopesize 
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#Dynamic safety envelope 

Set MFs for VV, PASF, TTC 

Set fuzzy rule set 

Compute PASF and TTC 

Compute new size of safety envelope(FIS) 
Update envelope size to new size 
Update detection volume to new size 


#Collision detection 
Make emptylistcollisions 
If newenvelopecollideswithworld: 
for eachoctantinnewenvelope: 
If octantcollides with world: 
appendoctanttocollisions 


5 RESULTS 


In Figure 8, the Vehicle velocity is 0.02 m/s 
(MF = low) and PSAF is 0 (MF = low) and TTC 
is 391.01 s (MF = high). These inputs result in 


Figure 8. Rendering of data point 1. 


Figure 9. Rendering of data point 4. 


Table 2. Observations from laboratory tests. 

Data VV TTC Size of safety 
point (m/s) PASF (s) envelope (m) 
1 0.02 0 449.12 1.76 

2 0.24 0 22.12 2.49 

3 0.63 0 16.52 5.00 

4 0.03 0.38 391.01 5.00 

5 0.25 0 18.52 2.58 

6 0.20 0.12 12.69 1.83 

7 0.33 0.25 5.68 3.68 

8 0.36 0.25 10.56 4.11 


application of Rule 6. The resulting size of safety 
envelope is 1.76 m (MF = small). In Figure 9, 
the Vehicle velocity is 0.03 m/s (MF = low) and 
PSAF is 0.38 (MF = medium) and TTC is 391.01 s 
(MF = high). These inputs result in application of 
Rule 6. The resulting size of safety envelope is 5.00 
m (MF = medium). Observations from the labora- 
tory tests are as listed in Table 2. 


6 DISCUSSION 


This section discusses the learnings from the 
process and the impact to industrial applications 
through the development of the proposed dynamic 
envelopes. 

The key drivers in development of safety enve- 
lopes in the aviation and maritime industries 
(TCAS, Ship Domain) are asset/personnel safety 
and ability to design an intelligent collision avoid- 
ance systems. By fusion of both active and passive 
sensor technologies, the safety envelopes in the avi- 
ation, maritime (surface vehicles) and automotive 
industries currently utilize dynamic safety enve- 
lopes. In comparision, current remotely operated 
vehicles are controlled by human operators. In the 
future, AROVs will also need to be able to make 
decisions both in presence or in absence of the 
human operators. Development of dynamic safety 
envelopes can be seen as the first step towards 
ensuring asset safety of AROVs by identifying, 
assessing, and mitigating the risk of underwater 
collisions. 

As applications of AROVs to inspect and repair 
subsea production systems, offshore aquaculture 
systems, offshore wind turbines and facilitate sub- 
sea mining, asset safety of AROVs is vital. The 
proposed process to build dynamic safety enve- 
lopes using fuzzy logic allows the system develop- 
ers to tweak the membership functions and fuzzy 
rule sets according to their respective industrial 
requirements. This allows for application specific 
dynamic safety envelopes. For example, require- 
ments for dynamic safety envelope for subsea 
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IMR operations and subsea mining operation may 
vary as the later is more vulnerable to seabed col- 
lisions than collisions with the man-made subsea 
structures. 

Use of fuzzy logic (expert-based systems) 
ensures that the system developers can understand 
the inherent behaviour of the system under dif- 
ferent input conditions. However, two limitations 
of fuzzy logic in engineering applications can be 
highlighted. First, fuzzy logic is a form of deduc- 
tive reasoning i.e., to conclude on a specific truth 
by using generic inputs (Ross 2009). An example 
for deductive reasoning is the ground is wet (input) 
therefore, it must be raining (truth). Second, the 
subjective nature of defining the membership func- 
tions of the fuzzy variables and deriving fuzzy rule 
set can be challenging. This is true for any tech- 
nical system where experts are needed to provide 
input and they can disagree with each other’s judg- 
ment. For example, in the proposed fuzzy rule set, 
weightage to different inputs are not implemented. 
All rules and input values are given the same 
importance. In the future, a modification may be 
necessary to include the relative importance of 
input variable. Is the TTC more important than 
PASF and WV or vice-versa? 

The proposed dynamic safety envelopes are 
highly dependent on availability of reliable sensor 
measurements. The laboratory setup used in this 
paper consists of a grid of six acoustic sensors 
providing eight range measurements. Measure- 
ments from the acoustic sensors were used as the 
primary input to calculate the orientation, velocity 
and position of the AROV and the obstacle (pas- 
sive sensor grid). However, the advantage of the 
proposed dynamic envelopes is that it can be eas- 
ily modified to include input from either passive 
or active sensors. For example, active sensors, such 
as sonar and LiDAR can detect both known and 
unknown obstacles. This is possible because of the 
underlying architecture of the implemented 3D 
rendering program, which allows for scalability. In 
addition, if the sensor module comprises of redun- 
dant sensors, failure of one sensor type can be tol- 
erated by the overall collision avoidance system. 
For example, if the acoustic position sensor fails 
to measure the depth of the AROV, measurements 
from a dedicated depth sensor can still provide a 
reliable source to the proposed FIS. 


7 CONCLUSIONS 


This paper proposes a novel approach to devel- 
oping dynamic safety envelopes for autonomous 
remotely operated vehicles (AROVs). A proof- 
of-concept of the dynamic safety envelope is pre- 
sented in this paper. 


The proposed dynamic safety envelope was 
developed by using a fuzzy inference system (FIS) 
to adapt the size of the safety envelope. Three fuzzy 
input variables were used in the FIS, namely vehi- 
cle velocity, probability of acoustic sensor failure 
and time to collision. A FIS was implemented in 
an existing underwater collision avoidance system. 
Observations from the laboratory tests performed 
to verify the feasibility of dynamic safety envelopes 
are presented. Results show that the AROV safety 
envelope can increase or decrease in size depend- 
ing on the three input variables. This allows the 
AROV to decrease or increase the obstacle detec- 
tion area in a highly uncertain and sensitive subsea 
environment. 

In presence of uncertainty, visualizations of 
obstacles that pose the risk of collision to the 
AROV may aid situation awareness of human 
operators. The size of the safety envelope can be 
used to make decisions related to maneuvering of 
the AROV either autonomously by the AROV or 
remotely by the human operator. To safely maneu- 
ver the AROVs during collision scenarios, further 
development and testing is required to implement 
dynamic safety envelopes together with the subsea 
traffic rules. 
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ABSTRACT: The urgency of applying a more ethical approaches to get to a sustainable low-carbon 
economy by 2050 requires to implement a model of sustainable mobility, which embraces social, 
economical and industrial activities. In this horizon, behavioral change can produce considerable 
benefits whose effects are distributed on wider period. In this direction, the GamECAR project aims at 
provoking a change in driver style towards a more sustainable and efficient use of private cars by applying 
gamification. However, safety must not be affected by the introduction of new technologies, tools or 
systems, which are not essential for the driving task. This paper presents a methodology for the control of 
the inference of a smart-phone application suggesting eco-driving hints to the driver, on the basis of the 
dynamic assessment of the risk exposure embedded in the current situation. Real-time measurements of 
physiological, behavioural and car performance parameters are combined with data-driven driver models 
to determine the safe communication of eco-driving suggestions to the driver. The methodology builds 
upon the structured approach to operational safety initially applied in aviation and its adaption to the 
road environment during the XCYCLE Project (Funded by the Horizon 2020 Framework Programme of 


the European Union — Grant n° 635975). 


1 INTRODUCTION 


Road transport is troubled by two major issues: 
environmental pollution induced by cars and seri- 
ous injuries to people from traffic crashes. The 
common denominator between these problems 
can be found in the aggressive driving style. Stud- 
ies found that fuel consumption can be affected 
by the driver’s style up to 35%. Hard acceleration 
and braking, excessive speed, open windows results 
in higher emission rates from a vehicle compared 
with a more calm driving style. 

We are now in a transitional phase, where elec- 
tric and hybrid vehicles are becoming more popu- 
lar, but they are still not sufficiently distributed 
to generate a significant effect. Eco-driving can 
become an intermediate tool for inducing a more 
sustainable and effective use of current private 
vehicles, carbon fuelled, but it can also inspire a 
behavioural change, that will be beneficial across 
generations. 

Eco-driving has been defined as a decision mak- 
ing process that significantly affects the fuel econ- 
omy and emission intensity of a vehicle, reducing 
its environmental impact. Ecological, economic, 
but also road safety and social benefits can be 


derived from adopting eco-driving conduct. How- 
ever, changing the behaviour of a driver seems to 
be a challenging task. For example, the social con- 
text in which the driver is operating can result a 
mediating factors of behaviour. 

In thecontext of the GamECAR Project (Funded 
by the Horizon 2020 Framework Programme of 
the European Union—Grant n°732068) a Deci- 
sion Support System module embedding personal- 
ized driver models will combine the calculus of an 
eco-driving score with the dynamic evaluation of 
the risk associated to the context where the action 
is conducted. The most appropriate eco-related, 
personalized suggestion will be communicated to 
the driver if the operational risk results accept- 
able. This paper will present the methods adopted 
to combine real-time measurements of physiologi- 
cal, attitudinal and car performances parameters 
with data-driven driver models, to determine the 
safe communication of eco-driving suggestions 
to the driver. The methodology builds upon the 
structured approach to operational safety initially 
applied in aviation and its adaptation to the road 
environment during the XCYCLE Project (Funded 
by the Horizon 2020 Framework Programme of 
the European Union—Grant n° 635975). 
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2 THE FOUNDING CONCEPTS 


Some keywords inspired the creation of the 
model. 

Dynamicity: the road is, by nature, an ever- 
changing, dynamic environment. In such contexts, 
the discrepancy between the time available and 
time required to make a decision generates time 
pressure. Unfortunately, cognitive models of driv- 
ing behaviour still fail to take into account the 
dynamic context that the road environment offers. 
GamECAR focuses very much on real-time and 
context dependent information, through which it 
can suggest new modelling parameters for a more 
realistic scheme. 

Change: dynamicity causes revolution. Manag- 
ing dynamic traffic scenarios is a complex exercise 
and the actual task of management is dynamic 
itself, according to the increase or decrease of the 
situation. Driver’s capability also varies and affects 
the driver state. For example, the variety of stim- 
uli could affect the driver’s attention by causing 
distraction or by numbing his/her reaction times. 
Moreover, an impaired or fatigued driver may 
not be sufficiently responsive to the changing sur- 
rounding conditions. 

Change is also an innate element of the social 
context in which the driver operates and can repre- 
sent a mediating factor of behaviour. 

Change is a key word for the GamECAR project 
itself, as one of its most important objectives is to 
provoke a behavioural revolution towards more 
sustainable driving style and general attitude. 

Interactive: the traffic system is composed of 
three main actors, namely road users, vehicles 
and road environment. This frequent and hetero- 
geneous interactions introduced by traffic impose 
demands on drivers, increasing the complexity of 
the activity. Considering road transport, at least 
three elements compose the scenario: different 
road users, their transport means and the environ- 
ment in which the driving task is conducted. 

Safety & Ecology: previous researches demon- 
strated that adopting eco-driving behaviours not 
only induces economic paybacks, but also con- 
tributes to road safety. Among these, it is essential 
to ensure that any just-in-time feedback must not 
compromise the safety of its user, by increasing 
his/her cognitive workload and shifting attention 
from the actual driving act. 


3 ANEW USE FOR DYNAMIC RISK 
ASSESSMENT OUTCOMES 


An adequate, modern and complete risk model for 
the road transport must include all the above ele- 
ments as well as their mutual conditionings. 


Beyond facing the challenge of designing a risk 
model capable of representing all members and 
their interactions through a sufficiently lean struc- 
ture, thanks to the GamECAR project’s require- 
ments, the proposed model adds a significant 
novelty in the exploitation of risk assessments. In 
fact, traditionally, results from risk analysis, either 
dynamic or not, either proactive or retrospective, 
served to inform the user about the undesirable sit- 
uation just experienced or possibly to encounter, in 
order to highlight the need for mitigation actions. 
In GamECAR, the output of the risk analysis will 
feed information to the platform, so as to decide 
whether it is appropriate or not to present the eco- 
driving information to the driver. The statement 
depends on the context in which the driving action 
is conducted, the driver’s current state and personal 
traits, and the driving performances expressed by 
vehicles parameters. The results of the risk analyses 
aim at avoiding that new tools, such as the GamE- 
CAR display, despite their noble intentions, could 
overload the driver with unnecessary information 
and distract him/her from the primary task. The 
user will not be warned about the fact that s/he is 
going to experience a hazardous situation (e.g., 
a risk indicator, whose colour communicates the 
level of risk), neither about what this circumstance 
is (e.g., hazard’s name and description), nor about 
the consequences s/he might face (e.g., icon depict- 
ing the possible negative outcome). S/he will not 
be conscious that a risk analysis is performed, but 
the effects of these background operations will be 
essential to maintain his/her focus on the primary 
task and to induce his/her behavioural change 
through a safe process. 


4 STARTING POINTS 


The GamECAR risk model builds upon the 
research conducted in previous EU projects, namely 
A-PiMod (FP7/2007—2013, contract number: 
605141) and XCYCLE (H2020/2015-2018, con- 
tract number 635975). The former, through the 
design of a new adaptive automation concept 
based on a hybrid of three elements: a multi-modal 
pilot interaction, an adaptive distribution of tasks 
between flight crew and automation and real-time 
risk assessment, aimed at supporting flight man- 
agement when hazards and unexpected conditions 
are encountered. The latter focuses on the develop- 
ment of technologies to improve active and passive 
detection of cyclists, as well as systems informing 
both drivers and cyclists of a hazard at junctions. 
From the progressive experiences and break- 
throughs in the aviation domain, in which a struc- 
tured approach to safety has been applied for 
decades, and from the adaptation of such approach 
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in a different sector, such as the road transport one, 
the GamECAR model is sketched. 


5 RISK MODEL COMPONENTS 


The GamECAR model is constructed by three 
components: a LUT (Look Up Table), a calculus 
engine and a risk matrix (Figure 1). 

The LUT represents a database containing static 
data, such as driver’s name, age, residence, driving 
experience and eco-attitude, but also the list of 
hazards characterizing road types, such as straight 
lines or T-junctions. 

Personal information can be inserted by the 
driver in the Profile section of the GamECAR 
system, either via the application or via the web 
platform. His/her eco-attitude is assessed through 
specifically designed questionnaires, which attempt 
to deduce also other personality traits, such as 
driver aggressiveness. The driver can decide to refill 
a questionnaire, updating his/her profile, while the 
GamECAR developers could add new questions 
to refine the user modelling. 

The list of hazards is generated by GamECAR 
researchers on the basis of retrospective analysis 
of incidents and accidents related to specific road 
layouts as well as through expert judgment. 

For example, some inputs may derive from the 
study of pre-fatal crash manoeuvres identified in 
the early stages of the XCYCLE project (Fruhen, 
L., et al. (2015). Threats and consequences are the 
other elements attached to the hazards and they 
contribute to improving the description of the 
situation (event) under exam with potential harm. 
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Figure 1. GamECAR risk module’s component. 


A threat is an external element that may generate 
or increase the effects of hazards, thus affecting the 
margins of safety and impacting on the develop- 
ment of incidents and accidents. A consequence is 
any possible measurable outcome, whether adverse 
or beneficial, resulting from the evolution of 
threats and hazards. Each element is represented 
in the LUT by variables, which are quantifiable 
by measurable parameters, in order to calculate 
the probability of occurrence or the weight of 
the influencing factors in the current situation. 
Every combination of threat(s), hazard and conse- 
quences generates a single incident sequence, which 
is expressed through its probability of occurrence, 
of risk. 

Even if it the LUT contains only static data, 
its content can be updated periodically thanks to 
new research findings, newly discovered hazards or 
edited personal data. 

The assessment engine runs two distinct func- 
tions, named DIL (Driver Impairment Level) and 
ERE (Environmental Risk Exposure). These func- 
tions are embedded in the GamECAR applica- 
tion. Therefore, every time a driver activates the 
GamECAR app and monitors h/her eco-driving 
performances, these dynamically evaluate the risk 
associated to the driving task. Their inputs derive 
both from the LUT as well as from the data meas- 
ured in real time through the GamECAR sensors, 
such as a wearable device for the monitoring of 
physiological parameters and the OBD-II (On- 
Board Diagnostics) device reporting on vehicle 
performances. 

By being two independent functions, modelling 
two different aspects of the overall driving activity, 
in order to obtain a single outcome, they must be 
combined into a specific matrix where the riskiness 
of distracting the driver by adding information 
emerges, together with its tolerability level. 

The following three sections will provide more 
insight into: 1) the human factors aspects, through 
the evaluation of the Driver Impairment Level; 2) 
the driving context, i.e., the mean of transport and 
the surrounding environment; and 3) the matrix, 
discussing the meaning of its cells and how the 
tolerability level could be used to tailor the eco- 
message to display to the driver. 


5.1 The DIL function 


As stated before, the major peculiarities of the 
road environment are represented by three spe- 
cific elements, namely: the driver in control of 
the action, the context in which the driving action 
is conducted and the transport mean. The DIL 
function deals with the first one, i.e., the person 
controlling (driving) a technological tool (i.e., the 
vehicle). According to Carsten (2007), two broad 
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types of models of behavior of the human-in- 
control can be distinguished in the literature. The 
first type is defined as “descriptive model” and it 
attempts to describe parts or the whole of a task in 
terms of what the operator has to do. The second 
one is named “motivational model” and it aims to 
describe how keen the operator is in managing risk 
or abnormal situations. 

In GamECAR we concentrate on the second 
type, by developing a predictive and context-aware 
model, able to reproduce the effect of a dynamic 
environment on the driver’s performance capabili- 
ties. In fact, if task demand exceeds the driver’s 
capabilities, task overload is experienced, resulting 
in high workload (e.g., De Waard, 1996). Given 
that driving task difficulty fluctuates in a dynamic 
environment, instantaneous driver workload also 
fluctuates. As driving demand changes, driver 
capabilities also varies with driver state (for exam- 
ple, impaired, fatigued or distracted). 

The model is based on five categories of driver 
capability, performance and behavior, which are 
related to safety. The outcome is the probability of 
error making generated by human related factors 
and possible operational conditions. 

The Driver Impairment Level (DIL) focuses 
on the human related aspects and environmen- 
tal dynamic circumstances that favour the pos- 
sibility of error making. Five parameters have 
been selected as the most essential quantities that 
influence people’s motivational aspects. They are 
not claimed as representing neither an exhaus- 
tive nor an independent set of quantities that 
characterizes human behaviour. However, they 
are a first instantiation of several theoretical and 
applied research studies and the main charac- 
teristics of this modelling approach will not be 
altered by the implementation of different sets of 
parameters. 

The mechanism to describe DIL attempts to 
combine dynamic changing conditions expressed 
via Driver State (DS), Situational Awareness (SA) 
and Environmental Conditions (EC) and more 
static quantities that reveal the driver Experience/ 
competence (EXP) and _ Attitudes/personality 
(ATT) (Cacciabue, P.C., 2010). 

More in details: 


e Experience/competence (EXP). This param- 
eter enables to consider age and the number of 
years of driving license. Skills can be developed 
through practice; when a skill becomes auto- 
matic it is consciously controlled and more cog- 
nitive resources may be devoted for managing 
unanticipated events (such as emergency events) 
(static parameter); 

e Attitudes/personality (ATT). Attitude and per- 
sonality are typical individual traits that result 


in interpersonal differences and specific behav- 
ior. In the context of GamECAR it refers to the 
ecological and sustainable approach the driver 
applies not only in driving, but in his/her every- 
day life. The value is derived from a specifically 
designed questionnaire and it is updated through 
the averaged eco-score calculated by GamECAR 
over the different trips the driver has played. The 
driver would be asked to fill the form at his/her 
first access to the application and s/he will have 
the opportunity to edit the responses anytime. 
His/Her eco-score will be measured every time 
s/he performs a trip with the GamECAR App 
on and it is stored in his/her personal section of 
the Virtual Community Platform (quasi-static 
parameter); 

e Situation awareness (SA). SA is a very impor- 
tant parameter, as it represents “the perception 
of the elements in the environment within a vol- 
ume of time and space, the comprehension of 
their meaning and the projection of their status 
in the near future” (Endsley 1994) (dynamic 
parameter); 

e Environment Conditions (EC). Since the sur- 
rounding conditions might affect efficiency also 
at cockpit level, environmental characteristics 
should not be entirely disregarded, but included 
as describers of the operational contest. Inputs 
can come either from sources external from the 
GamECAR system (e.g., weather applications) 
or from GamECAR sensors, such as the vehicle 
OBD-II (dynamic parameter); 

e Driver state (DS). The Driver State refers to 
those psycho-physiological variables that may 
affect task performance either permanently (e.g. 
physical/cognitive impairments) or temporar- 
ily (e.g. impaired performance due to fatigue or 
drowsiness) (dynamic parameter). 


Driver’s bio signals are monitored via wearables 
devices that provides measurements for heart rate, 
respiration rate and muscle activity, and transmit 
them to the GamECAR App via Bluetooth tech- 
nology. The physiological data are correlated to 
the driver's psychological parameters, such as SA 
and DS. For example, by the detection of the heart 
rate and the setting of proper thresholds, we can 
derive the driver’s level of anxiety or calm. Such 
data source and quantification process are some of 
GamECAR’s main innovation. 

Apart from the Environmental Conditions 
parameter, all others reflect individual traits. 

All parameters vary between 0 and 1 with the 
meaning expressed as follows: 

From a purely logical reasoning, the DIL is 
assumed to depend these five parameters as follows: 


DIL 


= oa 2-A)/4 
(Driver,t) ~~ l-e [a+ ) ] 
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Table 1. DIL parameters and their ranges. 


Novice Low Medium High 
EXP 0 0.25 0.5 1 

Bad Poor Medium Good 
ATT 0 0.25 0.5 1 

Full Slightly degr. Degraded Loss 
SA 0 0.25 0.5 1 

Good Regular Poor Bad 
EC 0 0.25 0.5 1 

Very good Slightly degr. Degraded Bad 
DS 0 0.25 0.5 1 


where: 


y= SA, DS, EC 
A=SATT, EXP 


Note that: 


e best driver dynamic conditions = > 9 = 0 

e best driver static conditions = > à = 2 

e if ATT = EXP = 1 (best static pilots condition), 
and SA = DS = TD = 0 (best dynamic condi- 
tions) then DIL = 0, i.e., Driver Impairment 
Level is null. 


The simple, basic hypothesis is that at the 
increase of a parameter corresponds an increase in 
the DIL value. 


5.2. The ERE evaluation 


The Environmental Risk Exposure (ERE) analysis 
mainly takes into account the other two elements 
of the road environment, i.e., the scenario and the 
transport mean. For each one of them, it retrieves 
inputs from external sources and merges them into 
a safety assessment of the situation. This data 
extraction is fulfilled by the sensors utilized in the 
GamECAR project, which are on-vehicle, in the 
smartphone, and worn by the driver. With regard 
to the vehicle, its data are provided via an OBD-II 
device connected directly to the CAN bus of the 
car. Real time information includes: active trans- 
mission gear, fuel type, consumption and tank 
level, throttle position, engine RPM, vehicle speed, 
and other data. 

Moreover, the sensors integrated inside any 
modern smartphone, such as the linear 
accelerometer and the GPS receiver, enables to col- 
lect essential information on the location of the 
vehicle (i.e., longitude and latitude) and also about 
its acceleration. 

Finally, other meaningful information about the 
environment are gathered from cloud services, e.g., 
streets maps, traffic, and climatic conditions. 


The ERE builds upon an internal database, i.e. 
the LUT, a Data Organizer module that retrieves 
and manages the raw external inputs, and a Risk 
Engine, which performs the dynamic calculus for 
evaluating the exposure to hazards in a particular, 
forthcoming moment of time. 

Standard incidental sequences, based on previ- 
ous retrospective analysis are stored in the LUT. 
They are described by the well-known Bow-Tie 
logic, where to a central hazard the threats (or fac- 
tors) leading to it are linked to its left inside, and 
the consequences representing the final result of 
the possible incident sequence are connected from 
its right inside. At the beginning, an absolute prob- 
ability value is given to each hazard; moreover, 
static default values are allocated to each factor 
representing the baseline for the successive calcu- 
lation steps. Due to the scarcity of recorded data 
about road transport, compared to the potential 
availability, these values have been assigned on 
the basis of the literature available and, mostly, 
through expert judgment. 

In order to find a suitable trade-off between the 
theoretical framework, which sustains that more 
consequences can derive from the same hazard, 
and the practical real life, where possible inciden- 
tal sequences share many commonalities and it is 
complicated to differentiate among them with a 
sufficiently accurate proves, we decided to link to 
each hazard only one direct consequence. 

Another important database section includes a 
list of the road types, and a more detailed mapping 
of the roads sections and junction types, exploit- 
ing open source maps. To each element, a value 
expressing its harmfulness potential is assigned, 
determined from retrospective evaluations of inci- 
dents occurred in that particular spot. Hazards are 
linked to specific crossroads or streets on the basis 
of the possible occurrence of the associated direct 
consequence. 

In case the user sets in advance the itinerary 
of the trip s/he is going to take, the Risk Engine 
would be able to analyse the specific road path with 
higher discretization, thanks to the pre-load of the 
needed information due to a more intense usage 
of the map info, thus providing more frequent and 
more refined outcomes. 

Data in the LUT are static, but their quality 
can be refined and the dataset could be enriched 
with the scope of covering previously unforeseen 
scenarios. 

The preparation of the database is executed 
offline and when completed becomes the base- 
line for the real time calculus. The Data Organizer 
module collects as many inputs possible among 
those previously described from the on-vehicle 
sensors and the external sources, and it serves 
them to the Risk Engine. The online risk calculus 
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dynamically modifies the values in the LUT, trans- 
forming the basic probability of each hazard P,, 
into an adapted value P,,, which represents the 
exposure to risk the user suffers while s/he is driv- 
ing. The Risk Engine is supported by a function 
eta (n) that is responsible to update the probability 
in accordance with the characteristics of the cur- 
rent action scenario. 


c 

a= 1+b* e7 

where: 

y= Lai * L) *H 
nn, 


c and b are two constants that keep the function 
from exceeding the range’s limit when multiplying 
it for the hazard probability value; 


e lis a vector containing the dynamic values cal- 
culated by exploiting the data received from the 
Data Organizer; 

e qis a matrix containing the weighting values 
for each specific factor, linked to the hazard,; 

e n,is the number of evaluated factors, while nn, is 
the number of not null factors calculated during 
the real time assessment; 

e u considers risk of each specific hazard, and to 
personalize the weight of the factor for different 
road configuration. 


In order to transform the basic probability of 
the hazard, into a contextualized one, the following 
formula is applied: 


Pi =P * (7) 


This final outcome is a value included between 0 
and 1. This is combined with the DIL value into a 
Risk Matrix designed for the purpose. 


6 RISK MATRIX 


Given that DIL and ERE are two independent 
functions, they must be combined into a matrix 
in order to obtained a single, merged “risk” value. 
The basis is the well-known risk matrix proposed 
by ICAO Doc. 9859 (Third Ed.), improved by 
assigning to each cell a numerical index, enabling 
to compare among hazards associated to the same 
location and to range them. 

The tolerability range is also assigned by a five- 
colour code, in order to avoid the risk index to fall 
too often into a critical area, which might induce 


misunderstanding or overestimation of the situa- 
tion. This reasoning arises from the experience of 
the use of risk matrices in dynamic and proactive 
assessments, in contrast to retrospective analyses, 
where a three-colours code could be sufficient. 
The major difference relies in the meaning of the 
Severity. In GamECAR, the Severity is not repre- 
sented by the potential harm of the consequences, 
but by the level of impairment of the driver, thus 
by his/her capacity to promptly react to abnormal 
situation. Instead of an esteem on the effects of a 
consequence, the attention is shifted towards the 
measurement of human performances. 


7 CONCLUSIONS 


The GamECAR model is still in its designing ver- 
sion since the project is currently at its mid-term. A 
dedicated software solution is under development 
and it will be integrated in the project App, linked to 
the shared database and providing inputs to the dis- 
play function. This activity is scheduled for Spring 
2018 and the so called Evaluation Phase will run in 
Summer 2018. For this reason, this paper could not 
present a practical case study where the full meth- 
odology has been experimented. Tests will be very 
significant in relation to the fine tuning of threats’ 
weighting factors. At first, the travelling path will be 
pre-defined, thus allowing the researchers to fill the 
LUT with information, such as hazards, road types 
and harmfulness potential, etc. with more peculiar 
and accurate data. In the meantime, thanks to the 
experiences lived during the test-drives, the haz- 
ards list, as well as the threats list, can be increased, 
refined and improved, in order to better represent 
the complexity of the driving reality. 

The possible exploitation opportunities of such 
model are various. In the GamECAR project 
framework, it determines which visualization 
option is more appropriate for the scenario in 
which the driving task is conducted, in order not to 
jeopardize safety in benefit of ecology. The same 
scope could be exploited by other functionalities, 
technologies, or tools the vehicle is equipped with. 
Generally, these functions support or customize 
the driving action, but they are not essential for its 
core fulfilment. 

The main advantage of this model, and the 
main difference with the ones usually applied in 
other domains with high automation, such as avia- 
tion, is that the information is not presented to the 
final users, but it becomes a controlling input for 
another background service(s). This implies that 
the user does not have to be trained on risk princi- 
ples, risk variables, index and colour meaning and 
that the system is suitable for drivers of any experi- 
ence levels. 
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What could adaptive risk management look like in practice? 


J.M. Nisula 


Risk in Motion and ICS-IRIT, Université de Toulouse II, France 


ABSTRACT: Complex adaptive systems pose challenges for risk management, e.g. due to the difficulty 
to forecast evolutions in the system and to predict system responses to different interventions. Due to the 
complex cause-and-effect relationships, straightforward actions would often fail to produce the desired 
effects in the system. A more adaptive approach is required. This paper focuses on the challenges of adaptive 
risk management and especially adaptive risk treatment. Key features of complex adaptive systems are 
described, as well as the consequences of these features for risk management. Different types of possible 
interventions are introduced, including simple actions, adaptive policies and portfolios of experiments. 
Advantages and challenges of adaptive risk treatment strategies are discussed and some examples are pro- 
vided. It is argued that the scientific community should promote the use of adaptive approaches compared 
to the older less dynamic approaches which are not well-adapted for complex systems. 


1 INTRODUCTION 


Examination of the features of a complex adap- 
tive system leaves no doubt that in our everyday 
lives we are more and more exposed to—or rather 
agents within—complex systems. These would 
include the various social networks, so much accel- 
erated by social media; logistics systems like the 
transport system in a city; political systems such as 
the system of governance for a country or a com- 
pany; and so on. Complexity makes risk manage- 
ment more difficult, especially when a modern risk 
perspective is adopted, putting a lot of emphasis 
on the knowledge dimension behind the assess- 
ments (see e.g. Cox 2012, Aven 2013, Bjerga et al. 
2016). 

It is argued that the very dynamic nature of 
complex systems requires dynamic risk manage- 
ment techniques. Adaptive risk management has 
been introduced in literature, see e.g. Cox (2012) 
and Bjerga & Aven (2015). The current paper tries 
to take a rather practical perspective to how adap- 
tive risk management techniques could be used. 
Both risk assessment and risk treatment are dis- 
cussed but the main focus will be on risk treatment. 

Examples are picked from the challenging 
domain of managing transport safety risks. 


2 COMPLEX ADAPTIVE SYSTEMS 
AND RISK MANAGEMENT 
2.1 Features of complex adaptive systems 


Cilliers (1998) presents the following 10 character- 
istics of complex systems: 


wN 


10. 
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. Complex systems consist of a large number of 


elements. 
The elements interact dynamically. 


. The interaction is fairly rich, i.e. any element 


in the system influences, and is influenced by, 
quite a few other ones (but there can be large 
differences between elements in this respect). 
The interactions are often nonlinear. This 
means that small inputs in one part of the sys- 
tem can have large results in other parts of the 
system. 

Each element is mainly dealing with its imme- 
diate neighbors in its local context. However, 
due to the interactions, influences can propa- 
gate through the system, while possibly getting 
modulated (enhanced, suppressed or altered). 
There are feedback loops. Any feedback can 
be positive (enhancing, stimulating) or nega- 
tive (inhibiting, dampening). 

Complex systems are usually open systems, 
having live interactions with the environ- 
ment and there is a multitude of influences 
between the system and its environment. The 
system borders could be defined in many ways 
depending on the purpose and the position of 
the observer. 

Complex systems operate under conditions far 
from equilibrium being highly dynamic and 
constantly evolving. 

Complex systems have a history and their past 
is co-responsible for their present behavior. 
Each element in the system is ignorant of the 
behavior of the system as a whole, respond- 
ing only to information that is available to it 
locally. 


Reiman et al. (2015) summarize the key features 
of a Complex Adaptive System in the following 
way: 


“A complex adaptive system is a collection of indi- 
vidual agents with freedom to act in ways that are 
not always predictable, and whose actions are inter- 
connected so that one agent’s actions change the 
context for other agents. These agents interact in a 
non-linear way creating system-wide patterns and 
higher and higher levels of complexity. The agents 
differ from each other and none understands the 
system in its entirety. This diversity is a source of 
invention and improvisation. As the agents are 
interdependent of each other, relationships among 
agents can be considered to be the essence of a 
complex adaptive system. Understanding a com- 
plex adaptive system requires understanding of 
patterns of relationships among agents.” 


Complex systems can also be associated with 
the following concepts (Reiman et al. 2015): 


e Emergence: properties and behaviors which 
cannot be deducted from the individual system 
components appear spontaneously in the sys- 
tem. Consequently, the properties of the whole 
can be significantly different from the properties 
of the parts, and the outcomes are not predict- 
able, even when the initial conditions are known. 

e Self-organization: agent interaction and connec- 
tions lead to new structures, patterns and new 
forms of behaviors. Emergence, feedback and 
non-linearity contribute to self organization. 

e Nested systems: complex systems are typically 
part of larger complex systems, thus the expres- 
sion system of systems. 


Meadows (2008, p. 11) argues that a system 
consists of elements, interconnections and a func- 
tion or purpose. To these can be added the more 
visible aspects: events (Senge 2006 p. 21) and sys- 
tem behavior (Meadows 2008, p. 88). Senge points 
out that people are naturally drawn to events and 
their alleged causes, failing to see the longer-term 
behavior of the system. The system behavior and 
the resulting events are visible but from the sys- 
tem point of view they are consequences of the 
purpose, interconnections and elements. Conse- 
quently, the following hierarchy can be established: 


Purpose 
Interactions 
Elements 
Behaviors 
Events 


This hierarchy is helpful both for understand- 
ing behaviors in complex systems and for guid- 
ing interventions. As Dekker (2011, pp. 130-133) 
points out, understanding a complex system 


requires going “up and out” (synthesis) rather 
than following the typical “down and in” (analysis) 
approach of the Newtonian-Cartesian worldview. 


2.2 Challenges for risk management in complex 
adaptive systems 


A complex adaptive system introduces many chal- 
lenges for risk management, affecting both risk 
assessment and risk treatment. 

Knowledge is a key aspect of new risk per- 
spectives. Obtaining precise and comprehensive 
knowledge in a complex system is particularly chal- 
lenging. On one hand, knowledge is spread all over 
the system and cannot be centralized in one place 
(cf. point 10 in Cilliers’ list above). This makes gath- 
ering comprehensive knowledge about the system 
and its behavior virtually impossible. On the other 
hand, due to the unpredictability of the system, 
many things are simply unknowable in advance. 
The nonlinearity of the system also reserves poten- 
tial surprises in terms of emerging phenomena and 
their magnitude. These challenges related to knowl- 
edge affect the capability for risk assessment. 

Risk treatment suffers from the same unpredict- 
ability that affects risk assessment. One may have 
a good idea of the desired change in the system, 
however, it may be very difficult to establish an 
action plan which would achieve the desired out- 
comes with some level of certainty or repeatabil- 
ity. Methods, tools and techniques which work in 
simple and complicated systems are not adapted 
for complex systems (Kurtz & Snowden 2003). 
Modeling the system would almost certainly cover 
only some parts of the system behavior and any 
models would by definition not reproduce the full 
complexity of the real system. Controlled experi- 
ments are not possible because the system is open 
and keeps changing all the time whether or not this 
is desirable (Sterman 1994). When a specific inter- 
vention is introduced, nobody can predict with 
certainty what the various consequences within 
the system will be. Not only can an intervention 
fail to reach its desired outcomes—it could have 
different unwanted consequences, some of which 
may remain difficult to detect. 

For example, a new strict rule introduced by the 
regulator in order to eradicate common shortcuts 
in aviation maintenance procedures may seem like 
the typical easy solution from the point of view of 
the regulator. However, if it misses the true reasons 
for the shortcuts it may be fairly inefficient. Moreo- 
ver, the additional negative consequence could be 
loss of respect and faith in the regulator by main- 
tenance professionals when they perceive that the 
regulator has not addressed the real problem but 
only opted for an apparent solution which, to make 
things worse, threatens them with punitive actions. 
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3 ADAPTIVE RISK MANAGEMENT 


The classical way to act in trying to change a system 
is often a specific action at a specific point in time. 
This could typically be a new working method, 
new procedure, a new piece of regulation, or a 
reorganization. Such interactions are not adaptive, 
i.e. they cannot adapt to changing conditions after 
the implementation. 

The main focus in this paper is on adaptive 
interventions in the context of risk treatment. 
Especially two types of adaptive interactions are 
discussed: adaptive policies and experiments. 


3.1 Adaptive policies 


The main idea behind adaptive policies is to 
develop policies that are not targeted to be opti- 
mal for a best estimate future, but which could be 
robust across a range of futures. Swanson et al. 
(2010) argue that policies designed for a certain 
range of conditions often face unexpected chal- 
lenges outside of that range and this leads to unin- 
tended impacts and failures in accomplishing the 
original goals. Instead of being forced to change 
policies on an ad hoc basis repeatedly, adaptive 
policies have a built-in capability to adapt to new 
conditions. Such an approach is particularly suited 
for highly complex, dynamic and uncertain set- 
tings—so typically for complex adaptive systems. 

Another key feature of adaptive policy making 
is that all major uncertainties do not need to be 
tackled prior to the implementation phase. As new 
knowledge is gained during the implementation, 
the policy is gradually adapted accordingly (Mar- 
chau et al. 2010). 

Swanson et al. (2010) propose “seven tools for 
adaptive policymaker”, which can be summarized 
as: 


1. Integrated and forward-looking analysis. The 
idea is to try to identify the key success factors 
for the policy in advance, anticipate problems 
and to mitigate the foreseeable unintended 
impacts in advance. 

2. Built-in policy adjustment. If the future needs 
for policy adjustment can be anticipated, fully 
or semi-automatic adjustment mechanisms can 
be built in the policy. 

3. Formal review and continuous learning. Ideally a 
policy is introduced in a phased manner, using 
so-called policy pilots allowing early testing and 
adjustments. 

4. Multi-stakeholder deliberation. Volunteers are 
used in a very structured manner to get valuable 
feedback from stakeholders. 

5. Enabling self organization and social networking. 
Policies should promote self organization and 


networking so that local solutions can be dis- 
covered without external input. 

6. Decentralization of decision-making. Both for- 
mal and informal feedback gets faster and bet- 
ter when the decision-makers are closer to the 
people affected. 

7. Promoting variation. This can be achieved either 
by several parallel experiments, facilitating an 
environment where variation can occur, or by 
using feedback to create variation. 


Going from simple one-shot actions to adap- 
tive policies can already be a significant step in 
embracing the complexity of the surrounding sys- 
tem and creating more suitable solutions for that 
environment. 

Adaptive policy making could very naturally be 
adopted by regulators and other administrative 
agencies. However, the same philosophy can also 
be applied in private companies (and various other 
types of organizations) in trying to replace com- 
pany policies, rules and procedures by more adap- 
tive alternatives. 


3.2 Experiments 


As the heart of the problem with complex systems 
is that any intervention could have unpredictable 
consequences, the most natural course of action 
is to organize probes, i.e. set up experiments and 
use the results to improve the next round of experi- 
ments. As Sterman (1994, p. 310) states, effective 
learning in the world of dynamic complexity and 
imperfect information can be based on continuous 
experimentation and feedback. The final aim is to 
gradually find the most effective solutions through 
trial and error. 

By definition many experiments will fail. It is 
therefore important that the various cost dimen- 
sions (e.g. time, financial cost, potential damages) 
of the experiments are very limited. 

Other key requirements for experiments are: 


e Clear criteria for success and failure are nec- 
essary; otherwise the iteration towards better 
experiments will be too slow. 

e It must be possible to stop the failing experi- 
ments rapidly. 

e There must be a way to scale up the successful 
experiments. 


The whole point of experiments is that many 
diverse strategies can be experimented with. There- 
fore, a thorough pre-screening of proposed experi- 
ments is not necessarily recommendable. A lot of 
different experiments may be ran, as long as they 
are safe-to-fail. Experiment proposals should not 
need to be convincing, they just need to have a 
certain level of coherence, showing that they could 
possibly work. 
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In designing experiments, it is good to keep in 
mind the hierarchy of elements within complex 
adaptive systems (see above, section 2.1), i.e. modi- 
fying the purpose or the interconnections within a 
system will probably bring a more tangible and 
durable change than working at the level of ele- 
ments and behaviors (Meadows 2011, pp. 16-17). 


3.3 Example: Adaptive risk treatment based 
on experiments 


Let’s use as an example the problem of high road 
accident rates of young drivers. The challenge from 
a transport safety agency’s point of view is to find 
effective interventions to reduce this particular 
risk. 

Instead of trying to figure out a single pre- 
ferred solution, the solution could be built through 
experimentation with different types of interven- 
tions. It would be important to involve different 
types of people in creating the experiments, includ- 
ing people outside the typical group of transport 
safety experts. In particular, it would be valuable 
to involve some young drivers who are themselves 
part of the risk group. In this way, different per- 
spectives to the problem can be included and rel- 
evant knowledge from the real-life context can be 
made available. 

The working group creating the experiments 
might come up, for example, with the following 
experiments: 


1. Adding a specific short training which high- 
lights the limits of the driving skills of the 
young drivers. Typically, the drivers could learn 
about the speeds and conditions at which they 
no longer can control the car in a curve or stop 
the car by breaking. Rationale: reduce the over- 
confidence of young drivers by illustrating the 
limits of their skills, thereby achieving a more 
cautious driving style. 

2. Introduce specific driving limitations for young 
drivers: forbid driving 10 pm-6 am on Fridays 
and Saturdays and forbid transporting more 
than one young person in the car, unless if at 
least one adult is present. Rationale: reduce 
exposure to dangerous driving situations at 
nighttime and under negative influence of peers. 

3. Introduce specific driving limitations IT: require 
that no weekend driving can take place without 
an adult present in the car. Rationale: eliminate 
risky driving during weekends. 

4. Expose young drivers to their peers who have 
suffered a serious accident. Young drivers 
would be meeting face-to-face one or more peo- 
ple who have had serious consequences after a 
road accident as a young driver. Rationale: the 
encounter would give an emotional lesson that 
carelessness in traffic could have very serious 


and long-lasting consequences, and the encoun- 
ter would contribute to a better driving style. 

5. Oblige young drivers to take care of potential 
victims of their careless driving style. Typically, 
during several weeks the young drivers would be 
taking young children to school both by using 
their own car and also as pedestrians. Rationale: 
young drivers will experience the vulnerability of 
these people in a very tangible way and also cre- 
ate an emotional link with such potential victims. 
Being a pedestrian may create situations where 
the young drivers can observe other drivers’ dan- 
gerous behavior threatening the safety of the 
school children that they are now responsible for. 


The first experiment is close to what a regulatory 
action could be even without experimentation. The 
next two experiments are a little bit more creative 
but still within the regulatory domain. The last 
two experiments propose approaches that regula- 
tors would typically not use as a part of their inter- 
ventions. All rationales can be considered coherent 
and the costs (in the large sense) are small, thus all 
five experiments can be considered valid. 

From the practical point of view, each experi- 
ment could be running in a different region or city 
within the state in question. Probably the most 
challenging aspect would be to measure the impact 
of these experiments. Ideally, if one wants to get 
results in a short time it is not enough to wait 
for accident data. One should be able to capture 
feedback for all kinds of safety related problems 
and also comments from the drivers themselves as 
well as from any other key stakeholders depend- 
ing on the experiment. It should be possible to 
adapt the course of action after a few months of 
experimentation. 

It could happen that one or two approaches turn 
out to be the most effective ones, perhaps depend- 
ing on certain conditions (e.g. big city vs. small 
town). Additionally, it could be that some experi- 
ments could be combined or their lessons learned 
could be fed into the winning strategies as addi- 
tional elements. For example, it could be decided 
that all young drivers should “meet” previous 
young drivers who have had serious accidents, but 
that this could be made virtually through a film. 

Once the winning strategies have been imple- 
mented it is important to keep in mind that some 
fresh experiments should be run in the not-too- 
distant future. 


3.4 Resilience as a risk treatment strategy 


It is difficult to talk about risk management within 
complex systems without touching the concept of 
resilience. Woods (2015) gives four separate inter- 
pretations for the concept of resilience: 
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1. Rebound: recovering the functional state after a 
major disturbance. 

2. Robustness: increased ability to absorb pertur- 
bations. 

3. Graceful extensibility: the capability to extend 
adaptive capacity in the face of surprise. 

4. Sustained adaptability: being able to sustain 
adaptability over longer periods of time. 


Resilience is a valid concept at different scales: e.g. 
one could talk about the resilience of a crew or an 
organization or even the resilience of a specific system. 

From risk management point of view, a highly 
resilient organization could cope with virtually any 
kind of surprising disturbance without losing the 
operational capabilities. Therefore, building high 
resilience in an organization could be a preferred 
strategy compared to having to predict all possible 
failure scenarios and getting prepared for them. 
Moreover, for some types of risks—e.g. low-prob- 
ability scenarios, including black swans—working 
at the level of individual scenarios is not even fea- 
sible due to their almost infinite number. 

Achieving higher levels of resilience could also be 
taken as an objective within an adaptive risk treat- 
ment approach, i.e. one of the key goals for various 
experiments could be to increase the organizational 
resilience level. An innovative approach would be 
to use an inverse logic for such experiments: take 
the increased resilience as a given based on a robust 
scientific reasoning and use the experiments rather 
to test how well the organization can cope with the 
new constraints which come as the downside of the 
resilience-increasing measures. 


3.5 Dynamic risk assessment & treatment 


In the previous example, the experimental iteration 
took place within the risk treatment domain. How- 
ever, it is possible to stretch the loop back to risk 
assessment and create a risk management process 
where both risk assessment and risk treatment are 
part of the adaptive approach. 

Risk assessment is based on assumptions. 
Experiments can produce valuable feedback on the 
validity of the assumptions and lead to iterative 
corrections to the risk assessment. In this case, is 
not only the risk treatment for a given set of risks 
which may change—the priority to treat various 
risks may also change. 

A very dynamic risk management response can 
be achieved, if the operational people are fully 
involved in the assessment and management of 
risks, and are empowered to take real-time deci- 
sions based on the understanding of risks, assump- 
tions and the alternatives for action. This type of 
dynamic risk management requires an excellent 
cooperation between the risk management experts 
and the operational people. 


For example, imagine that an airline has made 
a risk assessment but about its operation to a spe- 
cific airport. It may have deemed the operational 
risks acceptable based on several assumptions, 
one of which could be certain acceptable weather 
conditions. If the pilots flying to the destination 
are aware of the risk assessment and the assump- 
tions behind it, they are able to compare the real- 
world conditions of the day to the assumed ones 
and detect any mismatches. If the crew one day is 
faced with weather conditions different from the 
assumed ones, it would immediately know that the 
risk assessment is no longer valid. Different alter- 
native courses of action could have been prepared 
in advance, and the crew could now take the most 
suitable one of them, for example divert to a close- 
by airport from which ground transportation to 
the final destination could be arranged. 

Ideally, instead of having a one-time risk assess- 
ment produced before the operation, the dynamic 
approach creates an active risk management tool/ 
process, engaging the expertise both from the risk 
management experts and the operational people. 


3.6 Macro-view to adaptive risk management 


It is easy to see that adopting an adaptive risk man- 
agement approach replaces a one-time exercise by 
a dynamic activity where interventions are iterated 
continually based on feedback from the system. 

Instead of having to make one or two big deci- 
sions, smaller decisions are taken, and they are 
taken more frequently. 

Applying adaptive risk treatment techniques 
also means that at any point in time there will be 
several portfolios of experiments ongoing. One 
needs to develop skills to manage such portfolios. 
Attention may also have to be paid on adaptive 
policies if they are used. 

Based on the results of the individual experi- 
ments, some would be expanded, others would 
be terminated, and some experiments could be 
merged together. 

Like explained in the previous section, the proc- 
ess can be made even more dynamic if risk assess- 
ment becomes part of the adaptive loop and if 
operational people are engaged in a shared risk 
management and decision-making process together 
with the risk analysts. 


4 ADVANTAGES OF ADAPTIVE 
TECHNIQUES 


The main advantage of the described adaptive 
techniques should be better effectiveness due to 
the use of techniques which are better adapted to 
complex systems than less dynamic alternatives: as 


2747 


it is very difficult to predict future evolutions and 
system responses to interventions, instead of try- 
ing to model the system it is better to use probes. 
Also, in a system which is in constant change, it is 
better to base decisions on fresh feedback from a 
recent experiment rather than follow a plan made 
several months or years ago. 

As a positive spin-off from frequent experimen- 
tation, there is an opportunity to start gradually 
learning interesting aspects of the system behavior. 
This type of learning can be helpful in designing 
future experiments and interventions. 

Finally, big decisions may be risky and fright- 
ening because of the potential losses if the inter- 
vention fails. In addition to the lost improvement 
opportunity and incurred costs, the failure may 
be damaging to the image and credibility of the 
organizations and people involved. The smaller 
decisions within the adaptive approach are much 
less frightening as the stakes will be smaller each 
time, and as the whole context is explicitly a con- 
text of experimentation where graceful failure is an 
acceptable option. 


5 CHALLENGES OF ADAPTIVE 
TECHNIQUES 


There are systems and phenomena which do not 
easily lend themselves to experimentation. First, by 
their nature some systems do not produce enough 
feedback so that the success of the experiment 
can be assessed. For example, a very safe system 
does not produce much feedback in the form of 
events where safety was compromised. Trying to 
test interventions aiming at further safety improve- 
ment will therefore suffer from the fact that there is 
not going to be enough feedback within a reason- 
able time period. 

Secondly, one also has to accept that running a 
specific experiment successfully in a specific place 
at a specific time does not necessarily mean that 
scaling up the same idea in the same or in a dif- 
ferent place, will again be a success. One has to 
remember that these experiments are not control- 
led experiments where various parameters can be 
frozen—because in a complex system, key param- 
eters cannot be frozen. 

Third, there may be man-made restrictions for 
experimentation. For example, it may be very dif- 
ficult to set up experiments in a tightly regulated 
system. Work may be highly proceduralized, and 
testing new procedures in real working conditions 
may be judged too risky. 

Finally, moving from traditional working meth- 
ods to adaptive risk management is a significant 
change and brings with it several psychological bar- 
riers. Accepting the features of complex systems 


means also accepting the associated unpredictabil- 
ity and uncertainty. Both the people carrying out 
the risk management and the public might perceive 
admitting the uncertainty negatively. As Tickner & 
Kriebel (2006) point out, acknowledging uncertainty 
can weaken agency authority by creating an image 
of the agency as unknowledgeable and by threaten- 
ing the objectivity of science-based standards. 

The experiment-based working method itself 
might not look very controlled or scientific either, 
and could further deteriorate the credibility of the 
organization implementing it. 

The challenge grows even bigger if one wants to 
explore unconventional interventions. This would 
typically require testing some very surprising and 
“crazy” ideas. It would also require involving peo- 
ple who are not experts in the subject matter or at 
least not perceived as experts. All such factors raise 
the bar higher for organizations to start applying 
modern adaptive approaches. 


6 CONCLUSIONS 


This paper has described typical features of com- 
plex adaptive systems and discussed the challenges 
that such systems pose to risk assessment and risk 
treatment. 

It is argued that adaptive risk management is 
better adapted to complex systems than other clas- 
sic approaches. The principal aim of this paper has 
been to illustrate how adaptive risk management 
and especially adaptive risk treatment could be 
used in practice. Three different approaches have 
been promoted: adaptive policies, experiments, 
and enlarging the adaptive iteration to cover also 
the risk assessment part of the process. The theo- 
retical content has been complemented with a con- 
crete example from the domain of road safety. 

The key argument for promoting the use of 
adaptive risk management is the expectation of 
reaching better results in managing risks in the 
complex system. 

However, several important challenges in imple- 
menting adaptive approaches have also been 
identified. 

It is proposed that the scientific community 
should help promote adaptive risk management as 
a respected method which could be implemented 
in different organizations—including governmen- 
tal agencies—without the fear of losing credibility 
in the eyes of peers or the public. 
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ABSTRACT: In recent years, new risk analysis approaches have been developed to be applied in major 
hazard establishments. Amongst these, approaches associated with the overall concept of dynamic risk 
are the most relevant. The main reason for the diffusion of Dynamics Risk Analysis (DRA) is the need 
to integrate new notions, accounting for dynamics of phenomena and information on system changes 
over the time and/or the impact of innovative technologies, within risk assessment procedures. DRA 
approaches are particularly useful when assessing the risk due to the lifting of load, especially when 
hazardous substances are handled or in major hazard industries. The reason is the potential for loss 
of containment, which could be followed by catastrophic scenarios (such as fires, explosions and toxic 
dispersions). When carrying out a crane operation, a common hazardous situation, leading to accidents, 
is associated with the hindered view of the workspace for the crane-operator. At this regard, a recently 
developed real-time monitoring tool, named Visual Guidance System (VGS), can be used to assist the 
worker during the load lifting. It gives back an alarm, when a potential impact between the handled load 
and any obstacle in the workspace is occurring. Data’s collection, through the feedback provided by 
the VGS, allows deriving appropriate indicators correlated to the safety of crane operations in chemical 
industry. These indicators are continuously updated; therefore, they are useful parameters to be integrated 
within DRAs. This work aims at the definition of risk indicators for lifting and handling operations; the 


methodology to derive such indicators is described and an application is also given. 


1 INTRODUCTION 


Safety in crane-related operations is a complex 
issue, especially in the chemical industry, where an 
accident, due to a wrong load lifting or handling, 
could lead to severe incidental scenario (Milazzo 
et al., 2016). In this context, Dynamics Risk Anal- 
ysis (DRA) approaches can be particularly use- 
ful in managing the risk. Crane accidents can be 
the cause of losses of containment, which can be 
followed by events, such as fires, explosions and 
toxic dispersions. More generally, approaches for 
dynamic risk assessment are based on the use of 
models integrating parameters that change over 
the time. Dynamic factors impact on both fre- 
quencies and consequences of incidents and, thus, 
on final risk results. Moreover, it is well-known 
that the integration of real-time monitoring data 
offers the opportunity to achieve a more effective 
control of activities, carried out in the workplace 
in view of worker safety, by allowing the preven- 
tion of accidents and the timely implementation 
of protective actions. Currently the use of DRA is 


becoming more widespread, some examples from 
the literature are given in the following: Eide and 
co-authors carried out dynamic environmental risk 
assessment for oil tankers, sailing along the North 
Norwegian coast (Eide et al, 2006; Eide et al. 2007); 
Milazzo et al (2009) made use of the concept of 
dynamic geoevent to represent a dynamic evolution 
of toxic dispersions. As pointed by Paltrinieri and 
Reniers (2017) and Paltrinieri and Khan (2016), 
DRA allows improving decision-making and sup- 
porting critical risk communication; it can also be 
used to describe the impact of innovative technolo- 
gies on the overall safety. 

In the chemical process industry, risk arises from 
complex systems and their management requires a 
large number of control measures (De Rademaeker 
et al., 2014). In this context, a common practise is 
to track performance of activities by using indica- 
tors, in order to continuously improve the safety 
and the operability. As defined by Øien (2001a), 
an indicator is a measurable/operational variable 
that can be used to concisely describe a broader 
phenomenon occurring when a plant is operating. 
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A small number of key indicators can monitor the 
status of whole systems. In the chemical indus- 
try, the most relevant indicators are those used to 
appraise safety or risk performance of systems. 
The terms safety indicator and risk indicator are 
distinguished by Øien et al. (2011). A risk indica- 
tor is a risk influencing factor, i.e. an event/condi- 
tion that affects the risk level of a system/activity; 
whereas a safety indicator is a factor that has an 
effect on safety as it is related to some measures, 
different than risk metrics (as number of accidents 
or incidents or other). Thus, risk indicators are 
derived from a risk-based approach (Øien, 2001b), 
whereas safety indicators may be developed from a 
safety performance-based approach. Indicators are 
also distinguished as leading and lagging indicators 
(HSE, 2006): leading indicators represent a form of 
proactive monitoring of the effectiveness of a Risk 
Control System (RCS), by providing feedback 
about safety outcomes before an incident occurs; 
whereas, lagging indicators represent a form of a 
reactive monitoring of the effectiveness of a RCS, 
given that they provide feedback after the occur- 
rence of a negative event. 

Approaches to develop safety and risk indicators 
are grouped in two perspective typologies by Øien 
et al. (2011), i.e. technical-human-organisational 
perspective and predictive-versus-retrospective per- 
spective. The first perspective allows developing 
safety indicators as it searches for causes of acci- 
dents occurred in the past, starting from techni- 
cal to human and further to organisational causes 
(Leveson, 2004). The second one gives risk indica- 
tors and aims at predicting potential accidents by 
including all possible causes or by trying to estab- 
lish the causes after the event (according to a retro- 
spective point of view); this approach requires the 
use of quantitative risk models. 

The above brief review underlines that the pre- 
vention of major accidents can benefit from both 
the use of dynamic risk analysis techniques and 
sdfetylrisk indicators. The application of dynam- 
ics risk assessment techniques based on proactive 
indicators is suggested by Paltrinieri et al. (2016), 
it brings additional benefits, since the risk analysis 
is supplemented by information related to the early 
warning, which supports to manage in advance 
unwanted events. The integration of a set of col- 
lected indicators provided the risk assessment with 
dynamic and proactive features. Data collection 
and processing, for the purpose of such a DRA, 
take advantage of information technology sup- 
porting real-time data collection, sharing, process- 
ing, visualization, etc. According to Paltrinieri 
et al. (2016), dynamics risk assessment techniques 
based on proactive indicators can be classified in 
four levels by referring to the basic theory and pro- 
vided results. The first level concerns to the use of 


safety indicators, it takes into account the effect 
of technical, human and organization factors; the 
second one is related to the use of risk indicators; 
thus, the application of risk models is needed; the 
third level refers to the application of techniques 
for frequency updating; finally, the forth level con- 
cerns the use of techniques for the aggregation of 
information, which are provided by indicators. 
This aggregation allows an accurate assessment 
of the variation of overall risk, also based on real- 
time data. 

This work aims at the derivation of risk indica- 
tors for the load lifting and handling in chemical 
industry, to be integrated into the dynamic risk 
analysis procedure. The paper is structured as fol- 
lows. Section 2 shows the methodology for the der- 
ivation of these indicators based on the approach 
proposed by HSE (2006); some indications are 
given on how indicators could be integrated into 
the dynamic risk analysis. Section 3 shows the 
description of a case study in which the application 
and validation of the indicators is carried out. Sec- 
tion 4 comments about the results obtained from 
this study. Finally, in Section 5 the conclusions of 
the work are presented. 


2 METHODOLOGY 


To derive risk indicators for the purpose of this 
study, the HSE approach (HSE, 2006) has been 
used. Figure | shows a simple scheme of the main 
steps of this general procedure, which is usually 
used to develop indicators related to processes. It 
is based on the development of /agging and leading 
indicators for the facility, related to each Risk Con- 
trol System (RCS) and associated with a previously 
identified hazardous event, the aim is to prevent, 
control or mitigate major accidents. 


2.1 Identify the scope of indicators 


According to this methodology, to develop key 
indicators of a facility, preliminarily, it is funda- 
mental to define the scope of such indicators. This 
allows selecting the proper information about the 
adequacy of safety controls. It is important to 
focus on a few critical indicators, giving a sufficient 
overview of the performance of systems and/or a 
more detailed picture in a hierarchical form. 

To develop facility level indicators, major acci- 
dents that could involve the equipment have to be 
identified. Each indicator should refer to a RCS 
of the facility, then, it can be weighted to reflect 
its importance in guaranteeing the safety at that 
facility. Different weights will reflect criticalities 
of such RCS on that equipment, e.g. plant design 
might be the most critical RCS on Facility 1 in 
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Figure 2. Development of facility level indicators. 


Figure 2, whereas inspection and maintenance 
could be the most important RCSs at Facility 2. 


2.2 Set lagging indicators 


The identification of the hazard scenarios and the 
description of how these events can be generated 
help the analyst to focus on the most relevant 
activities and to identify which indicators should 
be set. The HSE methodology suggests consider- 


Facility level indicators 


Set leading 


indicators 
Identify RCSs in place | Identify the most 
to prevent or mitigate critical elements of the 
the effects of the RCS and setleading 
incidents indicators for each one 
= hey commu ot Seta tolerance for each 
each RCS leading indicator 
| Select the most relevant 
-inaperta jab 7, alian indicators for the site or 
lactivity 


Main steps of the HSE approach for the development of risk indicators. 


ing the primary failure that gives an incident. Fur- 
thermore, attention must be given to areas where 
potential criticalities are present, i.e. where near- 
misses or past incidents have already occurred and 
where information from audits and inspections 
have collected. 

Next step is to provide a list of RCSs, needed 
to prevent or mitigate the consequences of each 
hazardous scenario; the starting point is to iden- 
tify their primary cause. Therefore, related safety 
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outcomes should be defined by means of the 
description of these hazard scenarios. The desired 
safety outcome represents the success for each 
RCS. For this reason, the safety outcomes must 
be clearly identified in term of “success/failure”, 
otherwise it will be impossible to classify properly 
related indicators. At this moment, a lagging indi- 
cator for each RCS should be defined to highlight 
whether the desired goal is actually achieved. 


2.3 Set leading indicators 


After the identification of proper lagging indica- 
tors, the HSE approach suggests setting leading 
indicators for each critical element of all risk con- 
trol systems (i.e. those actions or processes that 
must function correctly to deliver the outcomes). 
These indicate whether the RCSs provide the 
designed safety outcomes. The following factors 
must be considered to identify what are the most 
important critical aspects that the RCSs should 
cover to deliver the desired outcome: 


e activities that must always be performed correctly; 

e elements of the system that are susceptible of 
deterioration over time 

e activities that are most frequently performed. 


A range of tolerance for each leading indicator 
should be set. This is important, to permit to cap- 
ture the analyst attention if deviations in perform- 
ance are flagged up. Relevant information from 
indicators must be readily obtained, as well as the 
presentation of data collected should be as simple 
as possible to permit a prompt comprehension of 
them. 

Finally, a review of all indicators implemented 
should also be executed, with the aim to highlight 
poor performance of them or just of a part of 
them. This means that if leading indicators of a 
RCS show poor performance, whereas lagging indi- 
cators give back satisfactory results for the same 


Hazard identification due to the interaction: 
normal operabiltqw/load Iffting-handling 


system, there is clearly a discrepancy. In this case, 
a system reviews are recommended to understand 
the reason of such discrepancy, because to improve 
the performance of the facility, or activity and so 
on, each deviation from the intended outcome or 
failure of a critical part of a RCS must be followed 
up. Each occasion to review provides an opportu- 
nity to consider whether improvements should be 
made. 


2.4 Indicators measuring the safety in load lifting 
and handling in chemical industry 


This paper focuses on major accidents due to 
use of equipment for the lifting and handling of 
loads in chemical industry, which involves the load 
and objects located in the workspace, i.e. work- 
ers or other equipment (Cheng and Teizer, 2014, 
Spasojević Brkić et al., 2015). In conducting such 
operations, there could be situations in which the 
operator does not have an entire visibility of the 
workspace. In these cases, a widespread practice 
is to be supported by an intermediary, which usu- 
ally gives a guide to the crane-operator to cor- 
rectly navigate the load. Under this condition, the 
crane-operator is subject to a high level of stress, 
as he/she has to trust in the judgment of others in 
order to carry out an operation for which he/she 
is absolutely responsible for. Indeed, one of the 
most important causes of accidents is the human 
error, which is underlined under one of the fol- 
lowing forms distraction, wrong communication, 
inexperience, etc. For this reason, this contribution 
includes planned operator activities to control risk 
within the group of RCSs. 

To avoid the cause of accidents associated with 
a partial or a total hindered view of the work- 
space, a real-time computer-aided visual guid- 
ance system has been recently developed, which 
is simply indicated as VGS (Ancione et al. 2017). 
This safety device is composed by hardware and 
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Methodology for the development of indicator for load lifting/handling in chemical industry. 
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software; it was designed to provide support to 
crane-operators in navigating the load. The use 
of the VGS for the execution of a load lifting or 
handling represents a supplementary RCS, aiming 
at the prevention of the release of hazardous sub- 
stances, caused by collisions between the load and 
the equipment containing them. To identify the 
safety-related parameters for the activities associ- 
ated with the crane operability, the HSE method 
has been adapted as shown in Figure 3 by includ- 
ing the investigation of the hazards due to the 
interference between the plant normal activity and 
the operations carried out with the crane. 


3 CASE STUDY 


The case study, analysed in this paper, is a recon- 
struction of an alkylation unit of a refinery, where 
a crane accident occurred. This event happened in 
1987 in Texas (USEPA, 1993); it caused the release 
into the atmosphere of hydrofluoric acid, an 
extremely corrosive substance, forcing the authori- 
ties to evacuate thousands of people from their 
homes in the surrounding the refining plant. Due 
to the toxic dispersion more than 800 people were 
sent to the hospital. 

The crane was lifting a multi-ton heater, which 
was accidentally dropped onto the top of a 
hydrofluoric acid storage vessel. This facility was 
being moved for repair and maintenance during 
a general plant turnaround. The dropped heater 
severed a 4-inch acid loading line and an inch 
pressure relief line causing the hydrofluoric acid 
to be released. Figure 4 shows the layout of the 
hydrofluoric acid service in the alkylation unit, 
where the crane that was temporary installed, and 
the trajectory for the load handling. All major 
equipment is labeled. The bulk of the acid was 


located in the storage drum, the settler, the acid 
cooler and the piping between the settler and 
cooler. Smaller volumes were contained in the frac- 
tionator, the fractionator accumulator, the split- 
ter column and the acid rerun column. The acid 
unloading equipment was in southern boundary 
of the overall unit. 

The event was investigated by OSHA (USEPA, 
1993), which concluded that various causes con- 
tributed to the accident: 


e not instituting accepted engineering control 
measures to prevent the release (i.e. emptying 
hydrofluoric acid vessel before hoisting a heavy 
load over it and not hoisting a heavy load over a 
hydrofluoric acid tank); 

e the crane was not properly blocked (wooden 
blocks supporting crane outriggers were crushed); 

e crane inspection documentation was not prepared; 

e the crane safety devices were not inspected prior 
to use and a malfunction occurred. 


It is also possible that, due to the evidence of 
an anomaly, which forewarned the accident, the 
crane operator decided to drop part of the mov- 
ing system, in order to avoid a major accident due 
to impact with the vessel and, therefore due to the 
limited visibility and the complexity of the unit, he 
dropped it in the wrong place. 

In general, the risk control system adopted by 
company to prevent a loss of containment and/ 
or mitigate its risk, are: operational procedures, 
emergency procedures, training of workers, peri- 
odic inspection and scheduled manutention. The 
presence of the VGS, in that context, would have 
assisted the crane-operator in the navigation of the 
load by indicating where he would have dropped 
it, thus avoiding the chain of events that led to the 
release of the toxic cloud. 
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Figure 4. Layout for the hydrofluoric acid service in the alkylation unit. 
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4 RESULTS 


The methodology described in the Section 2.4 has 
been applied to the case study. Initially, hazard sce- 
narios and initial causes, which can lead to a major 
accident, have been defined. Then, all RCSs, that 
are in place to prevent or mitigate the effects of 
major accidents, have been identified. 

Figure 5 shows a logic scheme for the identifi- 
cation of the incidental sequence that could occur 
during the crane activity. On the left side the initial 
cause for the hazard scenario is given, this repre- 
sents what can go wrong when the crane is oper- 
ating. Amongst various causes of failure for this 
activity, one appears associated with the hindered 
view and potentially leads to a major accident. It 
is represented by a dropped load and could be the 
cause of a loss of containment (hazard scenario), 
from which a toxic dispersion of hydrofluoric acid 
is originated. 

The risk control systems, adopted by the 
Company, for the prevention and mitigation 


Load dropped on the 


hydrofluoric acid 


Hazard scenario 


Loss of Containment 


of losses of containment due to crane-related 
operations, are: 


e Plant operating procedures (if these are not 
correctly followed, facilities age and, due to 
impacts, LOCs are more likely) 
Crane operating procedures 
Inspection and Maintenance procedure 
Work permit procedure 
Emergency procedure 
GS 


Compared to the use of traditional RCSs (con- 
sidered in the case of the accident described in 
Section 3), the installation of the VGS has been 
included in the assessment. 

Table 1 lists all relevant traditional RCSs that 
usually are adopted by the Company to prevent, 
control and mitigate the loss of containment. 
For each of them the desired outcomes are indi- 
cated, as well as the lagging indicator controlling 
the achievement of the related desired outcome. 
Then, critical elements of each RCS have also been 


Toxic cloud 


Figure 5. Schematic view of the identified hazard scenario for the case study. 
Table 1. Lagging indicators. 
RCS Desired outcome Selected lagging indicator 
RCS1: Operating — Correct tank selection and operating the substance — No. of times the substance transfer 
procedures transfer does not proceed as planned 
— Correct cleaning, isolation and emptying 
RCS2: Inspection — No unexpected loss of containment due to failures — No. of unexpected LOCs due to 


and maintenance 


of the crane or a not correct handling of the load 


failures of the crane or a not correct 


procedures or failure of the control instrumentation. load handling or failure of the 
— No fires or explosions caused by faulty or control instrumentation. 
damaged electrical elements. 
RCS3: Crane — Correct execution of the load lifting/handling — No. of times the load lifting/handling 
operating — Identification of lifting restriction areas does not proceed as planned 
procedures — No involvement of escape route in the load 


trajectory 


— No interference with other equipment operations 


RCS4: Work permit — High-risk maintenance activities are undertaken 


— No. of incidents due to error during 


procedure in a way that will not cause damage/injury maintenance 

RCS5: Emergency — Minimum consequence in case of LOC — No. of elements of the emergency 
procedures procedures that fail 

RCS6: VGS — Full view of the working area and warning incase — No impacts load/crane-obstacles 


of approaching collision between load/crane and 


any obstacle in the workspace 
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identified in Table 2 with the aim to define the 
potential /eading indicators. 

The most relevant indicators (a lagging indicator 
and one or two leading indicators) for the activity 


Table 2. Leading indicators. 


under analysis have been selected; this selection 
has been made by referring to the criterion that 
the most relevant criticality per RCS has to be 
represented. 


RCS 


Critical elements 


Selected leading indicator 


RCS1: Operating 
procedures 


RCS2: Inspection 
and maintenance 
procedures 


RCS3: Crane 


operating 
procedures 


RCS4: Work 
permit procedure 


RCS5: Emergency 
procedures 


RCS6: VGS 


— Procedures contain all elements (key 


actions, tasks including emergency 
actions 

Procedures are clearly written and easy 
to be understood 

Procedures are kept up to date 
Information and training covering: 
hazardous properties of products, 
communication systems pre-transfer 
checks, load transfer controls and 
monitoring, post-transfer checks and 
emergency actions 

Scope and frequency of the inspection 
and maintenance 

Failures of critical elements of the crane 
and identified malfunctions 
Procedures contain all elements (key 
actions, tasks including emergency 
actions) 

Procedures are clearly written and easy 
to be understood 

Procedures are kept up to date 
Execution of risk analysis for 
crane-operation 

Training covering: hazardous-properties 
of products handled, communication 
systems, load transfer controls and 
monitoring, and emergency actions 
Scope of activities, covered by 

the permit-to-work is clearly 

identified 

Permits specify hazards, risks and 
control measures 

Permits are only issued according to 
proper authorisation procedures 
Duration of the permit 

Work is conducted as per permit 
conditions, including demonstration 
of satisfactory completion of work 
Emergency plan covers all relevant 
elements (testing of emergency plan, 
raising alarm, shutdown/isolation 
procedures, firefighting, 
communication, evacuation 

Device correctly shows the workspace 
and including elements 

Alarm activated at the desired set 
points 

Knowledge of tasks and relevant experience 
about substances, work processes, hazards 
and emergency actions 


Percentage of procedures reviewed 
and revised within the reference 
period 


Percentage of staff attending safety 
courses within the reference period. 


Percentage of critical elements of the 
crane inspected and repaired 


Percentage of procedures reviewed 
and revised within the reference 
period 

Percentage of activities covered by a 
preliminary risk assessment 


Percentage of staff trained within the 
reference period. 


Percentage of permits to work issued 
where the hazards, risks and control 
measures are adequately specified 
Percentage of work conducted in 
accordance with permit conditions 


Percentage of elements that have not 
failed 

Percentage of staff/contractors who 
take correctly emergency actions 


Percentage of correct indication of 
the view 

Percentage of warning at the set 
point 

Percentage of staff having at least 
10 years of experience 
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tolerance 


acceptable deviated beyond 


acceptable 


performance 


Figure 6. Example of tolerance set (source: HSE, 2006). 


The HSE approach suggests that, after the 
definition of indicators, a tolerance value for each 
leading indicator has to be assigned with the aim to 
observe any deviation in performance, with respect 
the related RCS. This allows the plant manage- 
ment deciding the actions to be taken to restore 
the system from its deviation. For instance, for 
the indicator “Percentage of elements that have 
not failed during simulation”, selected for the RCS 
“Emergency procedures’, the tolerance may be set 
as zero, which means that 100% of actions must 
be correctly performed. Alternatively, it could be 
accepted a certain degree of failure if it has pre- 
viously been highlighted to by the management 
team, this means that the tolerance of the indicator 
could be set below 100%. 

Figure 6 shows an example of tolerance set for 
a leading indicator. However, the aim of this paper 
is the definition of risk indicators for lifting and 
handling operations and then, the assignment of 
tolerance levels for each leading indicator is objec- 
tive of future study. 

As underlined above, the choice of indicators 
that provide a warning when collisions could lead 
to LOCs, is based on the ability to monitor if RCSs 
are operating as desired. Moreover, the use of a 
proper set of risk indicators, which have to be col- 
lected, processed and evaluated on a regular basis, 
represents the fundamental for the integration of 
dynamic features in risk assessment and carrying 
out a Dynamic Risk Assessment. 


5 CONCLUSION 


Risk indicators for the load lifting and handling in 
chemical industry have been developed, according 
to the HSE methodology, which has been prop- 
erly adapted with the aim to be integrated into the 
dynamic risk assessment procedure. 

The application and validation of such indica- 
tors has been made by referring to a case study 
of an alkylation unit of a refinery. Firstly, the 
hazardous scenario (loss of containment from a 


hydrofluoric acid tanks), that can lead to a major 
accident (toxic dispersion), and its initial cause 
(dropped load) have been identified. This prelimi- 
nary analysis allowed identifying RCSs in place to 
prevent or mitigate the effects of the dispersion. 
For each RCS, lagging and leading indicators have 
been defined. 

Resulting indicators highlighted criteria for 
optimize the process by accounting for the proper 
safety levels. The use of a safety device, such as 
the VGS, could be an effective solution to con- 
trol crane-operations, to guarantee the safety with 
respect to LOC initiated by a hindered view of 
the workspace and, finally, to integrate dynamic 
parameters within risk models in view of the exe- 
cution of a DRA. 
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ABSTRACT: The increasing migration of people to urban areas around the world contributes to 
increasing man’s adverse impact on the environment and therefore on climate change which is an issue 
that presents one of the most important challenges of modern civilization. In urban areas, this impact 
leads to the occurrence of many extreme events, which means multidimensional losses. Over the decades, 
hydrological catastrophes, especially floods, have been studied so as to control and monitor risks. However, 
this type of risk assessment has to meet multiple objectives that often conflict with each other. This paper 
proposes a multidimensional framework which allows researchers to use approaches that support decision 
makers in the modelling and analysis of the risk of flooding in urban areas. This includes rating risk while 
considering multiple factors and to responding to risk using actions, such as mitigation measures, which 
prioritize preventive actions to combat disasters throughout a cyclical process. Some potential benefits of 


using this approach are discussed. 


1 INTRODUCTION 


Ever since human beings first formed communities, 
they have felt the need to interact with each other 
and to be in harmony with the environment. In 
fact, living in isolation is not conducive to creating 
a common culture, to using language or to exercis- 
ing the human mind (Chatfield 2016). This means 
not only that people are interdependent—which 
they rarely admit—but also that it is interdepend- 
ence which underpins the progressive development 
of technology which is used to adapt this envi- 
ronment so as to bring comfort to a social group. 
A look at several points in history confirms this 
behavior, thereby revealing why human beings have 
needed to harness and exploit the natural resources 
at their disposal and how they have done this. 

Thus, society has a cyclical co-dependence on 
technology; in other words, they cannot be sepa- 
rated. This reality in post-modern society has 
become much more apparent, for example, in terms 
of communications, forms of transport, modes of 
learning and doing business or of the artefacts 
of comfort. Relevant tools or practices have been 
developed which simplify the way humans do 
things. 

In spite of technology in itself not being harm- 
ful to society, its use to achieve specific goals can 
give rise to negative impacts on human life. 

As to the environmental dimension, some fea- 
tures of technology are, in fact, designed to control 


and monitor natural phenomena. However, the 
growing dependence on technology in all spheres 
of life has nevertheless led to global pollution lev- 
els increasing because of human behavior. This has 
had an impact on climate change which coupled 
with other negative human behavior has caused 
natural disasters (Merchant 2014). 

When the focus of attention turns to urban 
areas, it is apparent that the unrestrained urbani- 
zation of cities has created fragmented spaces into 
which people have been segregated, thus making 
cities more vulnerable not only to social inequal- 
ity but also to the consequences of environmental 
degradation. As a result, several natural problems 
adversely affect, in particular, people who suffer 
from social and economic disadvantages. Natural 
catastrophes which are considered to be geophysi- 
cal, meteorological, hydrological and climatological 
events—such as landslides, hurricanes, floods, and 
fires—have affected more than one billion people, 
in this century alone, and caused US$ 2,512 billion 
in economic losses worldwide (MunichRe 2017). 

With this in mind, hydro-meteorological phe- 
nomena deserve special attention, since they 
reinforce the impact of climate changes brought 
about by human activities. For example, these have 
caused increases beyond tolerable proportions in 
the emission of carbon dioxide, deforestation, the 
growth of metropolises and the manipulation of 
river basins, all of which have played a part in there 
being more flood events in urban areas. 
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In this context some studies have been devel- 
oped to reach a more specific understanding of 
the relationship between the risk of floods—which 
have occurred more and more frequently over the 
decades—and changes in the climate. In the urban 
context, some research studies seek to improve 
prevention and emergency plans in large cities in 
order to strengthen urban resilience and reduce 
vulnerability (Ke et al. 2012, Kourgialas & Karat- 
zas 2016, Piloneet al. 2017, van Wesenbeeck et al. 
2016). 

Thus, by considering climate change issues that 
affect the environment of urban cities, this article 
sets out to analyze how to control and mitigate 
floods by implementing strategic projects. 

Assessing the risk of floods, which are extreme 
events, and the policies used to mitigate them have 
to take account of multiple strategic objectives that 
are often conflicting and integrate three dimen- 
sions: social, economic and environmental. That 
is why some researchers, on seeking approaches to 
support decision makers (DMs) in their decisions, 
have applied multi-criteria methods. 

This study proposes a multi-criteria framework 
to model and analyze the risks of flooding in urban 
areas that take human, social and environmental 
factors into account. In addition, this framework 
seeks to make a contribution towards aiding public 
policy to prioritize preventive actions so as to com- 
bat disasters. It does so by setting out a detailed 
classification of risk. 

The present paper is structured as follows: Sec- 
tion 2 presents a literature review on the impacts 
of climate change on urbanization, in order to 
justify what prompted this paper. Sections 3 and 4 
present research that addresses mitigating the risk 
of disaster, a review of the literature on risk and 
multidimensional analysis for decision support. 
Section 5 describes the proposed framework for 
flood risk management in urban areas, while Sec- 
tion 6 makes some final remarks and suggestions 
for future studies on the subject. 


2 IMPACTS OF CHANGES IN THE 
CLIMATE ON URBANIZATION 


Although the issue of climate is one of the most 
important challenges of modern civilization, stud- 
ies undertaken in the twentieth century pointed 
out that urbanization was one of the main factors 
driving global growth and this had the potential to 
cause natural disasters (Mitchell 1993). 

In fact, the ever greater migration of people 
to urban areas around the world changed the 
interaction between man, natural resources and 
the environment as a whole—the Earth and the 
atmosphere. People’s behavior led to their actions 


having an increasing impact on the environment 
and thereby contributed to climate change. 

Chen & Frauenfeld (2016) investigated what the 
impact on the climate of urbanization in China 
might be in the future. They used a model to project 
what China’s climate could be by 2050 and found 
it would be drier and warmer, thus indicating that 
human activity will have catastrophic effects. 

It needs to be borne in mind that there is con- 
siderable uncertainty associated with climate 
changes, and therefore it is a challenge to establish 
secure projections that can be used to inform the 
management of risk. This is because projections 
about extreme weather events and changes in the 
climate are forecasts of their frequency, intensity, 
duration, and month or season of occurrence, 
which, by definition, cannot be estimated with 
certainty. 

Studies published by IPCC—about global tem- 
perature increase—suggest that, on analyzing dif- 
ferent scenarios, global temperatures will increase 
by about 1 degree Celsius between 2000 and 2040 
(IPCC 2012). 

O’Brien & O’Keefe (2014) assert that while 
such studies indicate that there is enough time to 
develop mitigation actions, this may be misleading 
because there is little “wiggle room” and decisions 
on mitigation trajectories and deadlines need to be 
made soon. 

In addition, the interference of political factors 
contributes to the complexity of managing risks. 
Giddens (2009) affirms through his paradox that 
governments will not act until something really 
goes wrong by which time it is too late to take 
corrective actions. The explanation for this is that 
politicians keep postponing making difficult deci- 
sions if they believe this could be to their electoral 
disadvantage and therefore would lead to their 
not staying in power. Examples include the need 
to increase taxes significantly to pay for major 
projects such as protecting coastal cities; causing 
large-scale unemployment as a result of introduc- 
ing stringent pollution controls or running down 
extractive industries; and building new nuclear 
power plants near heavily-populated areas. 

Given this set of circumstances, little progress 
has been made in international climate nego- 
tiations and planning preventative and mitigating 
actions for both long-term and extreme events are 
needed urgently. 

Therefore, effective management strategies 
for managing risk are needed to counter-act the 
increasing occurrence of natural disasters in the 
urban environment. 

As an example, NATECH events have, to date, 
had huge impacts worldwide, and consequently, 
modern society has started to pay great attention 
to them and to call for appropriate actions. 
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Figure 1. Number of relevant extreme events world- 
wide from 1980 to 2016. Source: MunichRe (2017). 


Thus, Nascimento & Alencar (2016) made a 
systematic review of the literature on these events 
because of their great impact and due to their high 
level of complexity in terms of risk management. 
Their study revealed the surge in seeking improved 
scientific knowledge on how to mitigate extreme 
events, due to their occurring more and more fre- 
quently, year after year. 

However, the nature and severity of the extreme 
impacts on climate cannot be assessed only by 
their intensity, but must also include assessing the 
exposure and vulnerability of the areas most likely 
to be affected (IPCC 2012). 

Therefore, there is an urgent need to under- 
stand how natural disasters, which are increasing 
in intensity, impact the critical infrastructures of 
urban centers. These infrastructures have very 
often been integrated and this is essential for the 


maintenance of community life so if any break 
down, the impact on the rest of the system is all 
the more severe and potentially catastrophic. 

Due to this need, the literature now contains 
a growing number of studies on how to tackle 
this problem, an example of which is et al. 2017). 
They propose that urban infrastructure systems 
be improved by structuring the problem in detail, 
and do so by involving a wide range of specialists 
including those from the areas of energy, drain- 
age, transportation, and sanitation. Their model 
sets out how to conceptualize projects, actions and 
policies in order to reduce the exposure of areas in 
urban centers to extreme events. 

Figure 1 illustrates the occurrence of relevant 
events, which have been officially registered since 1980, 
worldwide. The statistics show that the frequency of 
hydrological events, of which floods are predominant, 
has grown dramatically (MunichRe 2017). 

The proportion of hydrological events as meas- 
ured against all other extreme events has grown 
from 50% in the 80s, to about 80% in the 2010s 
and, moreover, in 2016 they were nearly equal in 
number to all other events. In view of this, the 
authors were prompted to undertake the research 
that led to producing this article. 


3 FLOOD RISK MANAGEMENT AND THE 
DECISION MAKING PROCESS 


Adverse impacts are considered disasters when 
they produce widespread damage and cause serious 
changes in the normal functioning of urban socie- 
ties. Such events may occur because of extremes 
of climate and the exposure and vulnerability of 
places to these. Changes in climate can arise from a 
wide range of factors, including anthropogenic cli- 
mate change, natural climate variability, and socio- 
economic development (IPCC 2012). 

That is why research studies seek to investigate 
how risk management can be used to reduce expo- 
sure and vulnerability and increase resilience to 
the potential adverse impacts of extreme climates, 
even if risks cannot be completely eliminated. As a 
result, ways to adapt to and mitigate these events 
can complement each other and together can dra- 
matically reduce the risks that arise from climate 
change. 

As to the context of floods, however, the quali- 
tative and quantitative evaluations of these factors 
are subject to various uncertainties (Aven 2012). 
Therefore, it must be understood that the context 
of risk and hydrological uncertainty are important 
concepts that need to be addressed to support deci- 
sion making in many situations. 

The challenge is to know how to describe, 
measure and communicate risk and uncertainty. 


2765 


This involves economic factors, including poten- 
tial financial losses, and environmental and social 
impacts, such as estimating the number of fatalities 
and the impact on health services and the sanita- 
tion systems of large cities. 

Several papers in the literature seek to quantify 
and analyze risks for which various tools are used, 
such as decision analysis (Cuellar & McKinney 
2017) and Hydro-Meteorological analysis (Patra 
et al. 2015). However, this paper tackles the multi- 
objective point of view as a motivator to analyze 
in an integrated manner the dimensions involved, 
should an extreme event occur. This is set out in 
the following section. 


4 MULTIDIMENSIONAL RISK 
EVALUATION 


The multi-objective feature of a problem that a 
risk mitigation policy should reveal can be used in 
multicriteria analysis. 

The application of Multiple Criteria Decision 
Making/ Aiding (MCDM/A) methods seeks to 
establish preference relations to evaluate several 
alternatives against various previously determined 
criteria during the decision process. 

Due to the plurality of points of view that 
directly influence the decision, de Almeida et al. 
(2015) pointed out that the methods developed 
seek to translate the way people have always made 
their decisions. Despite the diversity of existing 
methods, the basic elements are simple: a set of 
undominated actions evaluated by considering at 
least two criteria that cannot be translated into 
economical evaluation and a decision maker. 

Thus, several studies over the decades have 
allowed great advances in modelling problems, 
using multicriteria aggregation methods even more 
closer to the reality of their applications. Koksalan 
et al. (2011) and Edwards et al. (2007), for example, 
analyzed the evolution, history and perspectives of 
MCDM/A area over time and emphasized this 
fact. 

Such advances leaded researchers to apply 
MCDM/A methods in many areas: water distribu- 
tion networks (Fontana & Morais 2017), research 
and development (R&D) projects (Karasakal & 
Aker 2017), cleaner energy and electricity market 
(Cucchiella et al. 2017), civil construction (Mini- 
otaite 2017), financial sector (Ferreira et al. 2018), 
among other applications. 

In the risk management context, there are stud- 
ies in the literature that integrate multicriteria 
analysis with environmental risk. Medeiros, et at. 
(2017) enhance previous suggestions for a multi- 
criteria decision model that evaluates the multidi- 
mensional risks of gas transportation by pipeline. 


Social Dimension 


Figure 2. Integration of dimensions for evaluation. 
Source: This paper (2017). 


Their model offers tools that help to prioritize 
maintenance efforts, and therefore to optimize the 
use of human, financial and other resources. 

In fact, MCDM/A methods have been used in 
many risk management contexts, and a systematic 
literature review made by de Almeida et al. (2017) 
identified research trends in dealing with multi- 
criteria models. These include that they offer a 
multidimensional view of problems and take into 
account a DM’s preference structures. 

By choosing, ranking or sorting actions, these 
methods can make recommendations on how a 
DM should analyze results. Thus, multi-dimen- 
sional methods admit subjectivity as part of the 
decision process, and support DMs by using algo- 
rithms and methodologies that aid them to build 
their preference structure. This way, the DMs are 
able to obtain more robust and reliable results, 
which might be submitted to sensitivity analysis. 

Thus, a multidimensional risk assessment can be 
carried out in order to incorporate subjective and 
elements in the social environment that have hith- 
erto been less exploited as presented in Figure 2. 
This leads to putting forward a framework for 
flood risk management in urban areas, in Section 5. 


5 AFRAMEWORK PROPOSAL FOR 
MANAGING RISK FROM FLOODS 


In this research study, a decision support frame- 
work was developed not only so as to understand 
the risks from floods and to integrate the dimen- 
sions of risk, but also so as to use this information 
to improve how projects are selected and portfo- 
lios of projects are managed, where a DM faces 
uncertainty. 
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Development of potential 
Strategic Actions 


Figure 3. A framework proposal for managing flood 
risk. Source: This paper (2017). 


The framework presented in Figure 3 is divided 
into 5 steps. It starts by collecting data to provide 
the model with enough parameters to model the 
problem. 

Broadly speaking, the framework proposes 
using a multidimensional evaluation for rating risk, 
which has already been developed, to define possi- 
ble actions to avoid risk. In this context, when this 
risk can be eliminated or mitigated, the framework 
then leads to a way to model multicriteria that will 
be used to manage projects that combat risk. These 
projects are held in a portfolio and include alterna- 
tives that perform effectively with regard to avoid- 
ing the impacts of an extreme event. 

The framework proposal conducts the multi- 
dimensional flood risk analysis step by step, as 
described below: 


5.1 Risk identification 


This step corresponds to the initial stage of under- 
standing the problem. A detailed characterization 


is made, in addition to which data is collected on 
important parameters which will be used as input 
for the risk modeling (considering the inherent 
probabilistic character). 

Flood risk identification covers (Rausand 2011): 


— Identification of all flood events that are rel- 
evant throughout the area to be studied; 

— Describe characteristics such as the way, the 
type, the volume, when and where the flood was 
present in the system under study; 

— Identify possible related triggering events; 

— Assess if flood hazards could cause potential 
hazardous events; and 

— Collect hydrological data, such as historical 
series, return time and characteristics of the 
hydrographic basin under analysis, and clima- 
tological/natural data, such as climate, humidity 
of the air, wind and soil characteristics. 


If social and economic data, such as fatalities, 
economic losses, devastated areas, are available, 
it is important to take them into account and this 
could be done in this step (Steijn et al. 2017). 


5.2 Modelling for assessment and risk analysis 


This stage comprises the process of modeling the 
problem using the data collected in order to quan- 
tify the flood risk in the area of study by consider- 
ing the influence and interaction of natural, social 
and economic aspects of floods. Thus, a mul- 
ticriteria model is used along with the hydrologi- 
cal model to analyze and determine this measure 
of risk, using methods that are appropriate to the 
problem of risk rating in the area of study. 

Thus, several procedures have been presented 
in the literature to support DMs in constructing 
a model. For example, de Almeida et al. (2015) 
divide the procedure into 3 phases—in successive 
refinements: 


a. A preliminary phase, in which the decision mak- 
ers, the objectives, the criteria used to model 
these objectives, the space of actions and the 
problematic are defined. In addition, uncon- 
trolled factors are identified. 

b. The modelling of preferences phase is devel- 
oped by choosing what MCDM/A method will 
be used in addition to modelling the DMs’ pref- 
erences; and 

c. The finalization phase, during which the evalu- 
ation of alternatives and recommendation are 
presented, and the decision is implemented. 


This way, building MCDM/A models to repre- 
sent real problems can be faced as a creative proc- 
ess, which intellectual and cultural background 
of DMs are important for understanding its 
complexity. 
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However, an analyst help decision makers in 
all phases mentioned previously, giving factual 
information about the problem. Through the 
interaction of all actors of this process, DMs 
increase their perceptions about objectives, cri- 
teria, space of actions, what indeed enrich the 
model to the evaluation and implementation of 
the decision. 

In the context of risk identification, in general, 
this step 


— estimates unknown parameters of the model; 

— uses an MCDM/A approach to generate an esti- 
mate of the risk indices and considers a broad 
range of different aspects. 


It should be noted that the elements of a deci- 
sion model, such as objectives, criteria and pref- 
erences, are constructed in order to promote a 
realistic assessment of the flood risk. This includes 
making recommendations that arise from rating 
the risks and establishing what the response to the 
risks should be. 


5.3. Response to risk 


As to disaster risk management, several proce- 
dures, which were divided into stages, have been 
developed in the literature, among which O’Brien & 
O’Keefe (2014) divides it into 4 phases in a cyclical 
process. 

Figure 4 shows that the disaster management 
function is focused on preparedness for an event 
and then responding to that. 

The authors commented that the recovery 
stage is often treated as the responsibility of other 
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Figure4. Disaster management cycle. Source: O’Brien & 
O’Keefe (2014). 


bodies. Mitigation efforts are dealt with similarly, 
but typically there will be linked with the disaster 
management function about the type of mitigation 
measures; for example, the scale and location of 
flood defenses. 

In the context of MCDM/A methods (de Alme- 
ida et al. 2015), categorizing risk based on disaster 
management cycle allows DMs to respond to risk 
in three ways: 


— Accept: there is no action to be implemented, 
since it takes a lot of time to prepare a strat- 
egy to manage risk or high-cost actions will be 
needed to deal with it; 

— Transfer: outsource or share risk to a third party 
or parties that can manage the outcome; 

— Eliminate/mitigate: action is taken to eliminate 
the causes of threats wherever possible or reduce 
the likelihood of occurrence of the risk or result- 
ing consequences. 


Therefore, if the assessed risk can be eliminated / 
mitigated, then potential flood control alternatives 
can be established, while taking the critical infra- 
structures that are affected by it into account. This 
creates a discussion among scientific communities, 
companies, local population and public adminis- 
tration to help DMs to generate a portfolio to be 
analyzed in a later stage. 


5.4 Portfolio analysis 


As a set of projects or programs and other activi- 
ties, gathered for effective management, the port- 
folio constructed from the risk analysis of the 
delimited area should be managed in order to 
select the set of projects that maximize satisfaction 
in social (such as risk by population), economic 
(loss reduction), and natural (environmental pres- 
ervation) ways. 

However, it is known that they often conflict 
with each other, which prompts the application 
of multicriteria decision support methods in this 
problem. 

Since DMs seek to prioritize their strategic 
goals, which demonstrates what really matters to 
the organization, a portfolio needs financial and 
human resources but, generally speaking, the 
amounts required exceed the limits available (Lar- 
son & Gray 2011), and the support of MCDM/A 
methods is necessary for its resolution. 

In addition, the interaction process of all actors 
is similar to the first MCDM/A model (second 
step), and which contributes to better results. 

Broadly speaking, the portfolio problem con- 
sists of choosing, within a set of actions, a sub- 
set that meets the objectives and constraints. This 
results in prioritizing projects and then implement- 
ing them. 
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5.5 Control and monitoring 


This is the last step of the framework and it allows 
a monitoring plan to be implemented. It might not 
only record the risks to be addressed, but also the 
approval of required projects and the results of 
the actions developed to mitigate floods. Thus, the 
step promotes a continuous improvement: 


— to design and execute projects; 

— to assess the risk and its implications; 

— to understand the problem and to model the 
DM’°s preferences more coherently; 

— to learn and exploit opportunities to benefit 
affected populations. 


This stage is required to: 


— Ensure the implementation of risk plans and 
assess their effectiveness in reducing risk. 

— Follow the identified risks, including the watch 
list. 

— Monitor residual risks and identify new risks 
arising from project implementation. 

— Assess the perception of risk by the population; 


Thereafter, a cyclical process is implemented 
for continuous risk assessment, which involves 
updating the framework and generating new input 
data. 

It needs to be pointed out, however, that this 
paper seeks to propose a framework, the consoli- 
dation of which will be based on a careful analysis, 
step by step, of the modeling to be done, in order 
to correctly evaluate the problem, according to the 
simplifications, specifications and constraints of 
the model. 


6 FINAL REMARKS AND FUTURE 
STUDIES 


Combating and preventing flood risks requires a 
broad and integrated view of the social, economic 
and environmental dimensions involved. The devel- 
opment of risk responses in urban areas requires a 
multi-dimensional analysis of flood risk. Thus, the 
framework presented seeks to include multivariate 
elements to control risk in a more balanced and 
coherent way, thereby making cities less vulner- 
able and less exposed to extreme events, while their 
resilience grows due to efficient urban planning. 
Some final considerations can be made: 


— When implementing this framework in practice, 
it may contribute to diagnose the resilience of 
urban areas; 

— The multicriteria analysis can give balanced and 
coherent recommendations, in order to increase 
the perception of the risk and reduce exposure 
and vulnerability; 


— The cyclical analysis can be understood as 
a learning process and new information of 
parameters and modelling can be inserted as 
procedures. In addition, elements of the decision 
model—objectives, criteria and evaluation—can 
be better estimated. Thus, it can promote better 
portfolio management. 

— The modeling of preferences becomes a chal- 
lenge for DMs. Thus, prior support is required 
from analysts to understand not only the prob- 
lem but also the MCDM/A method used. Thus, 
knowledge of the multidimensional problem 
makes the process more reliable when imple- 
menting the decision. 


As future studies, we intended to apply this 
framework to cities around the world where floods 
have a considerable impact on critical infrastruc- 
tures as well as on citizens. Therefore, accurate data 
need to be collected so that there will be enough 
parameters to guarantee the benefits mentioned 
previously, on applying this approach. 

Once the decision environment of this problem 
is dynamic, the framework can be extended for 
group decision analysis. The aim is to translate 
strategic objectives of public power into effective 
actions in the fight against urban floods in order to 
reduce the damage caused by this natural disaster 
worldwide. 
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ABSTRACT: Risk analysis has been widely used in climate adaptation practice. However, traditional 
probabilistic risk analysis methods are not capable of tackling the unavailability or incompleteness of 
climate risk data. To deal with such challenges, this paper further applies an advanced Fuzzy Bayesian 
Reasoning (FBR) model for climate risk analysis of railways system in the UK. Its novelty lies in the 
realisation of climate risk ranking under high uncertainty in data and its practical contribution on the 
risk perception of stakeholders in the UK railway systems. To test the feasibility of the developed model 
in the transport industry, a large scale of surveys are conducted to collect data, regarding the timeframe 
of climate hazards, likelihood of occurrence, severity of consequences, and infrastructure resilience for 
the analysis of climate risks threatening British rail systems. The findings will provide transport planners 
with useful insights on the identification of climate hazards of high risks to facilitate the development of 


cost-effective climate adaptation strategies. 


1 INTRODUCTION 


Current variability in climate poses a challenge for 
rail infrastructure and operation. In the major- 
ity of countries, the related activities of transport 
systems are sensitive to different weather extremes, 
which include but are not limited to, changes in 
temperature, precipitation, thunderstorms, winds, 
visibility and sea level (e.g., Love et al. 2010). 

Risk analysis as a critical element in climate 
change adaptation has been widely utilised 
through a variety of approaches and techniques. 
The selection of cost-effective climate adaptation 
measures requires systematically analysing risk 
reduction combining with the associated costs due 
to the implementation of these measures. A vari- 
ety of methods in risk quantification have been 
proposed (Wilby et al. 2009). However, their avail- 
ability and effectiveness of common-used meth- 
ods are challenged in existing climate risk studies 
(Yang et al. 2015). One of the main challenges is 
that the unavailable or incomplete objective data 
fails to precisely evaluate the risk reduction and 


costs on many occasions, resulting in difficulties 
to assess risk and costs. Owing to the high level 
of uncertainties in this data, many conventional 
risk assessment approaches (e.g., Quantitative 
Risk Assessment (QRA)) which have been widely 
applied to conduct risk analysis in many sectors 
become unsuitable (UNCTAD 2012). 

Fuzzy set methods have been applied to climate 
risk assessment on ports in several pioneering stud- 
ies to deal with this challenge (1.e., Ng et al. 2013, 
Yang et al. 2015, 2016, 2017). In assessing climate 
change risks on ports, linguistic terms were regarded 
as variables, namely, the stakeholders’ perceptions 
of the impacts of climate change (Yang et al. 2015). 
Through modelling subjective input data (linguis- 
tic terms) on climate risk evaluation based on the 
stakeholders’ perceptions collected from their cur- 
rent interpretations of the frequencies, severity 
of consequences and timeframes of climate risks 
where they occur. By these studies, this paper 
develops a new Fuzzy Bayesian model by dividing 
the risk parameters into two levels. The junior- 
level parameters are “Timeframe (T)”, “likelihood 
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(L)”, “Severity of Consequences (C)” and add the 
fourth one “Climate resilience (S)”. In particular, 
three sub-parameters of “Severity of Consequences 
(C)”, namely, “Damage to Infrastructure (INF )”, 
“Injures andlor Loss of Lives (INJ)” and “Dam- 
age to Environment (ENV)” are added to extend 
and specialise the risks parameters. As a result, 
this Fuzzy Bayesian Reasoning (FBR) approach 
in which subjective evaluations by domain experts 
enables to complement the unavailable objective 
data to realise climate risk ranking more precisely 
under high uncertainty in data. 

This paper aims to achieve two primary objec- 
tives. Firstly, it develops a new FBR model by sub- 
dividing and adding new parameters to quantify 
the risk perception of the stakeholders towards 
climate threats s. This new development will offer a 
notable foundation for further adapting to climate 
risks and tackling the uncertainties in climate risks. 
Most importantly, this FBR model is exemplified 
by the British rail system through conducting a 
nation-wide survey amongst 21 major rail stake- 
holders in the UK. This applicability will contrib- 
ute to reveal the real climate risks on UK rail from 
the perspective of both industry and academia. We 
believe that the outcomes of this study will be of 
considerable value to rail planners, decision mak- 
ers and industrial practitioners, helping them to 
create and implement adaptation plans, strategies 
and practices. 

The rest of the paper is structured as follows. 
The critical review of the risk analysis for climate 
change is presented in section 2. The FBR method- 
ology including a step-by-step risk analysis frame- 
work is given in section 3. In section 4, climate 
risks and adaptation on UK rail systems, as well as 
the data collection through a survey amongst the 
21 rail stakeholders across the UK, are described. 
Finally, the discussion and conclusion implying the 
contribution and revelation for further research 
can be found in section 5. 


2 CRITICAL REVIEW: RISK ANALYSIS 
FOR CLIMATE CHANGE 


Current research on climate-related risk analysis 
have commons on interpreting and identifying the 
existing and future risks, estimating the level of 
risk as well as determining the level of uncertain- 
ties (Yang et al. 2015). A variety of methods in risk 
quantification have been proposed which can be 
categorised into three methods: methods requiring 
limited resources (i.e., sensitivity analysis), meth- 
ods with modest resource needs (i.e., empirical 
downscaling) and methods with high resources 
needs (i.e., dynamical downscaling) (Wilby et al. 
2009). However, traditional Probabilistic Risk 


Analysis (PRA) methods sometimes are unable to 
tackle the unavailability or incompleteness of cli- 
mate risk data (i.e., Yang et al. 2015). Fuzzy set and 
Bayesian Networks (BNs) have been applied to cli- 
mate risk assessment on ports in several pioneering 
studies to deal with this challenge. In this section, 
we review the development of fuzzy theory and 
BNs, as well as their applications in risk analysis. 


2.1 Fuzzy set and fuzzy logic 


Mathematically, a fuzzy set can be defined by 
assigning to each possible individual in the uni- 
verse of discourse a value on behalf of its grade of 
membership in the fuzzy set (Klir & Yuan, 1995). 
One of main approaches to reasoning under uncer- 
tainty is possibility theory, which is called fuzzy 
logic (Jenso, 2001). It is the logic of classes with 
unsharp boundaries and has been widely utilised 
to cope with the problem of computer understand- 
ing of natural language (i.e., Zadeh, 1992; 2010). 
In recent decades, fuzzy set and fuzzy logic have 
been widely accepted tools in risk assessment, pro- 
viding a theoretical framework of expert decision 
making under uncertainty (i.e., Sii et al., 2002; 
Andrews & Moss, 2002). For instance, Singh & 
Benyoucef (2011) utilised F-TOPSIS to the solu- 
tion of MCDM problems in selection of supply 
chain coordination. Yang et al. (2011) employed 
F-TOPSIS for vessel selection under uncer- 
tain environment and Yazdani-Chamzini (2014) 
applied F-AHP and F-TOPSIS for selection and 
evaluation of available handling equipment. 
Fuzzy set methods had made successful attempts 
in climate risk assessment for climate change in 
recent years (Yang et al. 2015). For example, in 
assessing the climate risks on ports, linguistic terms 
were regarded as variables, namely, the stakehold- 
ers’ perceptions of climate impacts (Ng et al. 2013, 
Yang et al. 2015, 2016, 2017). On the basis of the 
stakeholders’ perceptions collected from their 
interpretations of climate risks’ frequencies, conse- 
quences and timeframes in which they occur, these 
subjective input data were modelled to estimate 
climate risks. The fuzzy risk score R was defined 
by using a discrete fuzzy set, R = C ° (L x T) (Ng 
et al. 2013) or a fuzzy set manipulation approach, 
R=T ®C @L (Yang et al. 2015). Nevertheless, 
these methods could neither avoid the loss of use- 
ful information in fuzzy operations nor deal with a 
huge amount of risk input data (Yang et al. 2016). 
A novel fuzzy-Bayesian model proposed by Yang 
et al (2017) overcame these issues by taking advan- 
tages of fuzzy rule bases (i.e., IF-THEN rules) in 
modelling non-linear relation between risk param- 
eters and output and BN in realising fuzzy rule 
integration and risk inference. Employing fuzzy 
IF-THEN rules in Fuzzy logic theory allows the 
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antecedent and conclusion parts containing lin- 
guistic variables model the qualitative features of 
experts’ knowledge and reasoning process when 
there is lack of precise quantitative analysis. 


2.2 Bayesian networks 


When multiple sets of data (from different experts) 
are employed, it is difficult to use normal fuzzy 
rule inference mechanisms as the calculation could 
take a long time. Bayesian Networks (BNs) is a 
sound mathematical method in minimising the 
uncertainties and increasing knowledge. These are 
achieved by combining probability distributions 
or functions of different parameters and updating 
their probabilities when new information emerges 
(Wang, 2003). 

Bayesian modelling is a proven inter-disciplinary 
tool (Tebaldi et al. 2005). There are a variety of 
benefits with BNs including integrating different 
types of variables and data within a framework, 
and efficiently being updated when new informa- 
tion and knowledge become available (Castelletti 
& Soncini-Sessa, 2007, Cinar & Kayakutlu, 2010). 
In particular, BNs are capable of compensating the 
absence of historical statistics and handling incom- 
plete uncertainty through combining various pieces 
of information and making use of expert judg- 
ments (i.e., Tighe et al., 2007). BNs has achieved a 
wide range of application in multiple fields, such as 
safety assessment, intelligent decision making and 
computer network diagnosis, which involve uncer- 
tainty due to incomplete data and limited cognitive 
capacity (Zhang et al., 2013). It is employed for the 
estimation of future climate change conditions, 
such as precipitation mean state and seasonal cycle 
in South Africa (Boulanger et al., 2007) and quan- 
titative prediction and assessment of long-term 
shoreline change associated with sea level rise and 
its uncertainty (Gutierrez et al., 2011). Bertone et al. 
(2015) integrated BNs into participatory modelling 
to develop a risk assessment tool for managing 
water-related health risks associated with extreme 
events by combining System Dynamics. 

Nevertheless, a major disadvantage of BNs in 
incorporating expert judgements is that it may fail 
to probabilistically forecast subjective fuzziness in 
a precise way as it lacks understanding of probabil- 
ity theory. Bayesian approaches were criticised as 
they often requires too much information regard- 
ing prior probabilities which are hard to receive 
in risk analysis. Also, this information mainly 
relies on subjective judgments such as provided by 
experts’ knowledge, which might result in unex- 
pected bias into BNs (i.e., Tversky & Kahneman, 
1975; 1990). BNs might become computationally 
inefficient when models have a large number of 
variables (Liang & Lee, 2008). It cannot process 


time-series data and address feedback regulations 
(Ristevski, 2013). 

To compensate these disadvantages of BNs, 
some theoretical research and applications have 
identified the benefits to combine fuzzy logic and 
Bayesian reasoning especially in the applications 
of Fuzzy-Bayesian approaches to safety and reli- 
ability (Bott & Eisenhawer, 2002). However, more 
advanced and capable methods are required to 
enhance structure learning algorithms, allowing 
for constraints based on expert knowledge, with 
precise rules for interventional risk management 
and decision making (i.e., Zhou et al. 2014, Con- 
stantinou et al. 2016). 

Overall, the evaluation of climate change risks 
on the rail systems in this paper may contain vari- 
ous types of uncertainty. Similarly, due to the scar- 
city of historical/statistical data (UNCTAD, 2012), 
is carried out by subjective judgments. Thus, com- 
bining fuzzy set theory and BNs is appropriate to 
model subjective linguistic variables and cope with 
the discrete problem, handling incomplete uncer- 
tainty and complicated calculation. 


3 ANEW FBR RISK ANALYSIS 
FRAMEWORK 


In Section 3, it indicates a step-by-step risk 
analysis framework by utilising FBR approach. 
The proposed framework for modelling climate 
change risks consists of four main steps, which 
outlines each step required for risk estimation and 
assessment. 


3.1 Identify environmental drivers 


Based on previous literature review, we investi- 
gated four primary environmental drivers due to 
climate change affecting on the British railways: 
1) temperature increase, 2) intensive rainfall/flood- 
ing, 3) increased intensity and/or frequency of high 
wind and/or storms, and 4) sea level rise. Hence, 
this risk analysis is made for each of these envi- 
ronmental drivers to evaluate the risk level of their 
corresponding potential climate threats. 


3.2 Identify fuzzy input and output variables 


To define the subjective risk estimates, five threat- 
based risk permeants are identified which include 
at both the senior and junior levels respectively. The 
senior-level parameter is “Risk Level (RL)”. Refer 
to the definition of RL used in the FBR climate 
risk analysis on port systems (i.e., Ng et al. 2013, 
Yang et al. 2015, 2016), it expresses such linguis- 
tic variables as “Very High”, “High”, “Medium”, 
“Low” and “Very Low”. 
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Three junior parameters closely associated with 
climate change risks have been identified in previ- 
ous studies, namely “Timeframe (T)”, “likelihood 
(L)” and “Severity of Consequences (C)” based 
on the FMEA approach (Yang et al. 2008, 2009, 
Ng et al. 2013, Yang et al. 2016). Especially, in this 
paper, “Severity of Consequences (C)” is divided 
into three subcategories, namely, “Damage to Infra- 
structure (INF)”, “Injures and/or Loss of Lives 
(INJ)” and “Damage to Environment (ENV)”. At 
the meantime, “Climate resilience (S)”, designed as 
the fourth important junior parameter in assessing 
climate risks in railways, is characterised in Table 1. 


3.3 Establish a fuzzy rule-base network with the 
belief structure 


Fuzzy system theory, through collecting fuzzy IF- 
THEN rules from experts or domain knowledge 
and combining the rules into a single system, offers 
a systematic procedure for transforming knowl- 
edge bases to non-linear mappings (Sii & Wang 
2002, Yang 2006). 

While in some real cases, subjective degrees of 
belief (DoBs) are assigned to the linguistic variables 
used to express the conclusion attribute (RL) for 
modelling the incompleteness of expert judgments. 
Refer to the climate risk models on ports by Yang 
et al. (2016, 2017), in the fuzzy rule-base network 
of this study, four junior-level fuzzy input parame- 
ters including 20 (5+5+5+5) linguistic variables are 
assembled to generate 625 (5*5*5*5) antecedents 
with a rational degree of belief (DoB) distribution. 
Simultaneously, we constructed a secondary level 
network between the three subcategory parameters 
(INF, INJ and ENV) and junior-level parameter C, 
containing 15 (5+5+5) linguistic variables assem- 
bling to create 125 (5*5*5) antecedents. 


3.4 Conduct risk inference by BN techniques 


The constructed belief structures can be used 
to conduct risk inference using BN techniques. 
First of all, the rule base with belief structures is 
expressed in the form of conditional probabili- 
ties. Through utilising a BN technique, the FBR 
constructed can be modelled and converted into a 
five-node converging connection. It includes four 
parent nodes, NT, NL, NC and Ns (Nodes 7) L, C 
and S); and one child node NRL (Node RL). 

The prior probabilities of NT, NL, NC and Ns 
can be achieved through questionnaire surveys 
(i.e., on rails). The prior profanities of NT, p(Li), 
for example, can be obtained through asking the 
question, “using the defined linguistic variables 
(i.e. VH, H, A, L and VL), how likely the effect 
will occur when you expect first to see this climate 
threat poses impacts to the rail your organisation 
associated with?”, to collect the data from domain 


Table 1. Climate resilience. 
Lingui- Fuzzy 
stic member- 
Grade terms Description ships 
1 Very Very weak (0-20%) capacity (0, 0, 0.1, 
Weak of the transportation 0.3) 
(VW) system to anticipate, 
absorb, accommodate, or 
recover from the effects 
of a climate event and 
requiring a very long 
period (a year) and 
very high cost of 
recovery (£10 million 
above) 
2 Weak Weak (20-39%) capacity of (0.1, 0.3, 
(W) the transportation system 0.5) 


to anticipate, absorb, 
accommodate, or recover 
from the effects of a 
climate event and requiring 
a long period (a month) 
and high cost of recovery 
(£1 million above) 
3 Average Average (40-59%) capacity (0.3, 0.5, 
(A) of the transportation 0.7) 
system to anticipate, 
absorb, accommodate, 
or recover from the effects 
of a climate event and 
requiring certain length 
of time (a week) and 
cost of recovery 
(£100, 000-£1 million) 
4 Strong Strong (60-80%) capacity of (0.5, 0.7, 
(S5) the transportation system 0.9) 
to anticipate, absorb, 
accommodate, or recover 
from the effects of a 
climate event in a relatively 
timely and efficient 
manner (a day) 
and requiring some 
cost of recovery 
(£10,000-£100,000) 


el Very Very strong (80% above) (0.7, 0.9, 
Strong capacity of the 1,1) 
(VS) transportation system to 


anticipate, absorb, 
accommodate, or recover 
from the effects of a 
climate event in a very 
timely and efficient 
manner (12 hrs) and 
requiring slight cost of 
recovery (0-£1,000) 


experts’ risk perception. Through analysing all the 
prior probabilities of the four nodes, the marginal 
probability of NRZ can be computed as (Jensen, 
2001): 
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5 


5 X X p(RLA| Ti, Lj, Ck, SI) 


j=l k=1 l=1 
p(Ti)p(Lj)p(Ck)p(SI) 
(h=1,...,5) (1) 


P(RLh) = ` 


i 


To prioritise the climate risks, RLA (h = 1, ..., 
5) requires the assignment of appropriate utility 
values URLA. The linguistic description can then 
be converted into a crisp value using a centroid 
defuzzifcation method (Yang et al. 2009). Accord- 
ingly, RLA (h=1, ..., 5)= {0.11, 0.3, 0.5, 0.7, 0.89}. 
Hence, a new risk criticality ranking index can be 
developed as follows: 


RI = y P(RLAYU pin (2) 


h=1 


where the smaller the value of RI is, the higher the 
risk level of potential climate threats. 


4 CASE STUDY: RISKS ANALYSIS OF 
CLIMATE CHANGE ON UK RAILWAYS 


4.1 Climate risks and adaptation on UK rail 
systems 


In the UK, the transport sector has been recognised 
as one of six key sectors which will be most vulner- 
able to the impact of climate change (McKenzie 
Hedger et al. 2000). The predicted climate impacts 
on rail transport in the UK include an increased 
number of hot days, a decreased number of cold 
days, increased heavy precipitation, drought, sea 
level change, seasonal change, extreme events and 
wind (i.e., Jaroszweski et al. 2010, Peterson et al. 
2008, Hooper & Chapman. 2012). 

The extreme events posed the most devastating 
impacts (e.g., heat waves and storms) on rail trans- 
port. Higher temperatures in summer will cause 
rail buckling as well as decreased thermal comfort; 
more intense precipitation in winter will result in 
flooding, landslips and bridge scour. However, this 
might be beneficial for lower winter maintenance. 
Flooding was regarded as one of the significant 
impacts on the rail network (EPA 2009). The dam- 
age caused by climate change on railway networks 
took into account approximately 29% to 71% of 
the total (Chatterton et al. 2010). Dora (2012) 
investigated the leading climate change impacts 
to infrastructure operations on UK rail transport 
systems as a result of the projected changes in tem- 
perature and precipitation. This report stressed 
the effects, including the increases in track buck- 
ling, days of track maintenance and exposure of 
staff to hear the stress and overhead power cables 
sagging in hot weather. Some issues included air 
quality in urban areas and remarkable differences 


between the North and South in the UK owing to 
the growth temperature, the increased possibility 
of track inundation and of scouring affecting river 
bridges’ stability and incidence of landslips due to 
heavy and extreme precipitation. Overall, although 
there have been widespread effects on diverse trans- 
port modes, it is only recently that more attention 
has been given to the impacts of climate change on 
British railways (Hooper & Chapman 2012). 

The lack of data on the impacts of climate 
change, as well as cost-benefit analysis for climate 
change pose a significant challenge for transporta- 
tion planners, which also results in the failure of 
adaptation strategies in the transport sector (i.e., 
Koetse & Rietveld 2012). Owing to high uncertain- 
ties related to the future climate, adaptation meas- 
ures should be robust to retain the option value 
of the portfolio of measures. Hence, through con- 
ducting a nation-wide survey of UK rail systems, 
this paper examines the new risk analysis model in 
Section 3 which overcomes the shortage of data 
and the uncertainty of climate risks to reveal the 
real climate risks for British rail. 


4.2 A nation-wide online survey on climate risks 
assessment 


To validate the efficiency of the FBR model, a 
nation-wide survey was conducted to collect the 
first-hand data through examining the perceptions 
of rail planners and stakeholders on the impacts of 
climate change within the rail systems. 

This survey aims to illustrate the general situ- 
ation of climate risks in UK rail systems and to 
justify the necessity of adaptation planning. Four 
main environmental drivers due to climate change 
were identified: temperature increase, intensive 
rainfall/flooding, increased intensity and/or fre- 
quency of high wind and/or storms and sea level 
rise. The specific potential climate threats and cor- 
responding adaptation measures were summarised 
according to the Network Rail’s adaptation frame- 
work (Network Rail 2015). To guarantee the valid- 
ity and feasibility of questionnaire designing, a 
pilot study was conducted in April 2017 by inviting 
ten professional rail experts and academics in the 
UK. From May to September 2017, a nation-wide 
online survey was conducted amongst the 21 Brit- 
ish rail stakeholders to assess their perception of 
climate change risks, including specific impacts on 
their operation, performance and infrastructure. 

The population of this survey is all the rail 
stake-holders in the UK who are from rail com- 
panies and authorities, governmental depart- 
ments, academics and NGOs etc. The databases 
of the national rail networks were chosen from the 
national maps to be used to select the transporta- 
tion entities (Network Rail 2016). The participants 
in the railway survey were mainly chosen from 
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members of the Railway Industry Association 
(RIA) and the Rail Freight Group (RFG) repre- 
senting major UK-based suppliers of the world’s 
railways and the leading body for rail freight in the 
UK (RIA n.d., RFG n.d.). Over 200 member com- 
panies crossing the whole range of railway supply 
with diverse skills and resources are the typical rail 
entities of UK national railway. 

Considering the uniqueness and complexity 
of climate change issues (i.e., the characteristics, 
geographic distribution, types and levels of risks 
posed by climate change on rails), non-probabil- 
ity sampling, including a combination method 
of judgment sampling and snowball sampling, 
was utilised in this survey. Some small entities in 
remote regions might lack necessary knowledge or 
experience of climate change issues and the repre- 
sentativeness of the samples is more critical than 
its generalisability in judgment sampling (Vogt 
et al. 2012). Consequently, a sample of 30 admin- 
istrators representing the most critical transport 
institutions in different regions of the UK (e.g., 
Network Rail, Transport for Greater Manchester, 
AECOM UK, etc.) was selected to assess their 


Table 2. 


perceptions of climate change risks. Snowballing 
was utilised via one or two key informants at each 
entity from the targeted population. 

The 30 questionnaires were distributed online 
through BOS Online Survey (BOS 2017). E-mails 
and phone calls were used to contact all the 
respondents during business days. In the end, 21out 
of 30 effective responses were received with a high 
response rate 70%. The questions were divided into 
two types: closed-end questions, which utilise mul- 
tiple choices and a linguistic evaluation approach 
to quantify responses; open-ended questions, 
which provide more freedom to respondents and 
produced more precise data. 

Data screening was conducted to eliminate miss- 
ing and ineffective data such as incomplete input 
information and insane responses. Accordingly, 
4 out of 21 feedbacks became invalid after the 
screening process. The consistency of the remaining 
17 sets of data was addressed through the compara- 
tive climate risk analysis. Finally, associated data 
from the first 11 questions from the questionnaire 
were inputted in this FBR model to rank and ana- 
lyse the top potential risks posed by climate change. 


Questionnaire results of climate risk analysis on UK railways. 


Environmental driver 
due 
to climate change 


Potential climate threat on the railway 


Utility 


Result of risk level value Ranking 


Temperature Al. Track buckling causing derailment (0.1154, 0.1808, 0.3003, 0.54 6 
increase risks & reducing opportunities for 0.1922, 0.2116} 
track maintenance 
A2. Unreliable signalling, power line side {0.8580, 0.2016, 0.3228, 0.54 6 
systems, failure of temperature controls 0.2116, 0.1783} 
and overheating of electronic equipment 
Intensive B1. Bridge foundations damaged leading (0.1083, 0.3299, 0.2613, 0.47 1 
rainfall/flooding to bridge collapse and derailment risk 0.2109, 0.0896} 


B2. Landslips caused obstruction in 


increasing derailment risk 


B3. Heavy rain affect visibility, scheduled 
work may have to be rescheduled for 


safety and welfare reasons 


B4. Track drainage overloaded leading to 


flooding of the track 
Increased intensity C1. Trees falling onto the line 
and/or frequency 


of high wind C2. High winds affect visibility and 
scheduled work may have to be 


and/or storms 


(0.1590, 0.2313, 0.2406, 0.48 2 
0.2700, 0.0991} 

(0.0621, 0.1617, 0.2115, 0.6 7 
0.3192, 0.2455} 


{0.0927, 0.2874, 0.2502, 0.5 3 
0.2716, 0.0981} 

(0.1138, 0.2045, 0.2863, 0.51 4 
0.2944, 0.1010} 

(0.0997, 0.2689, 0.2229, 0.53 5 
0.1833, 0.2252} 


rescheduled for safety and welfare 


reasons 
C3. Instability of structures 


= 


Sea level rise D 
derailment risk 


D2. Reduced maintenance opportunities, 
bridges/sea walls may not be safely 


inspected 


. Breach of sea wall, flooding and 


{0.0500, 0.1482, 0.4197, 0.56 8 
0.1832, 0.1899} 

{0.0830, 0.2974, 0.3390, 0.48 2 
0.2142, 0.0663} 


(0.0156, 0.2488, 0.3305, 0.56 8 
0.2458, 0.1594} 
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Based on the fuzzy Bayesian approach in Sec- 
tion 3, the climate risk results of each potential cli- 
mate threat of environmental driver related to UK 
rails were calculated and elaborated in Table 2. For 
instance, the impacts of temperature increase were 
divided into two potential threats, namely, “A1. 
Rack buckling is causing derailment risks & reduc- 
ing opportunities for track maintenance” and “A2. 
Unreliable signalling, power line side systems, fail- 
ure of temperature controls and overheating of 
electronic equipment”. Then, the evaluations of 
the two threats were based on the four aforemen- 
tioned risk parameters: Timeframe (T), Likelihood 
(L), Severity of occurrence (C) and Climate Resil- 
ience (S). 

Utilising Equation (6) and Hugin software 
(HUGIN v.8.5 2017), the risk results of “Al. 
Rack buckling causing derailment risks & reduc- 
ing opportunities for track maintenance” and “A2. 
Unreliable signalling, power line side systems, 
failure of temperature controls and overheat- 
ing of electronic equipment” can be calculated as 
{11.54% RL1, 18.08% RL2, 30.03% RL3, 19.22% 
RL4, 21.16% RL5} and {8.58% RL1, 20.16% RL2, 
32.28% RL3, 21.16% RL4, 17.83% RLS}, respec- 
tively. According to Equation (7) and Hugin, their 
risk index values are calculated both as 0.54. 

Based on the ranking in Table 2, the highest 
potential climate threats to British rail are “B1. 
Bridge foundations damaged leading to bridge col- 
lapse and derailment risk”, “B2.Landslips caused 
obstruction in increasing derailment risk” and 
“B4.Track drain-age overloaded leading to flood- 
ing of the track” due to the intensive rainfall/flood- 
ing, as well as“D1.Breach of seawall, flooding and 
derailment risk” due to sea level rise. Interestingly, 
all the top potential climate threats are attributed 
to the intense rainfall/flooding. However, the low- 
est threats are “C3.Instability of structures” owing 
to the increased intensity and/or frequency of high 
wind and/or storms, and “D2.Reduced mainte- 
nance opportunities, bridges/sea walls may not be 
safely inspected” posed by sea level rise. 


5 DISCUSSION & CONCLUSION 


This paper presents an innovative mathematical 
FBR model to quantify the risks posed by cli- 
mate change and applied to the British rail system 
through conducting a nation-wide survey. Fuzzy 
Bayesian Reasoning (FBR) approach, in which 
subjective evaluations by domain experts, enables 
to complement the unavailable objective data to 
realise climate risk ranking under high uncertainty 
in data. Based on the previous modelling research 
of climate adaptation in ports (e.g., Ng et al. 2013, 
Yang et al. 2015, 2016, 2017), this Fuzzy-Bayesian 


Network is innovated by employing the climate 
resilience (S) as the fourth most important junior 
parameter in assessing climate risks taking, as well 
as dividing the Severity of consequences (C) into 
three subcategories (“Damage to Infrastructure 
(INF)”, “Injures and/or Loss of Lives (INJ)” and 
“Damage to Environment (ENV)”). We believe 
that this new risk analysis model will offer a more 
precise and effective tool for researchers and rail 
stakeholders to tackle the uncertainties in respond- 
ing to climate risks. 

To test the feasibility of the model and pro- 
vide the empirical evidence on its applicability 
in the transport industry, a large scale of surveys 
were conducted to collect data for the analysis 
of climate risks threatening the rail systems in 
the UK. Through previous literature reviews, we 
identified four primary environmental drivers due 
to climate change. The risk level for each potential 
climate threat was evaluated by the timeframe of 
climate hazards, likelihood of occurrence, sever- 
ity of consequences, and infrastructure resilience. 
Unsurprisingly, the top potential climate threats to 
British rail are highly related to the occurrence of 
intensive rainfall/flooding, which is also consistent 
with the current priorities of adapting to flooding 
issues in climate change adaptation planning in the 
UK. It is noticeable that the research presented is 
part of an ongoing project and further activities 
and data collection (i.e., a second-round survey 
and in-depth interviews) are expected. Neverthe- 
less, the outcomes of this study will provide policy 
makers and transport planners with useful insights 
for the identification of climate hazards of high 
risks to facilitate the development of cost-effective 
climate adaptation strategies before the final 
implementation of the relevant infrastructure. 

One of the current dilemmas in climate risk 
assessment is that traditional probabilistic risk 
analysis (PRA) methods usually pay insufficient 
attention to a particular type of climate change 
event or transportation assets. This study focuses 
on primary impacts of climate change on British 
rail systems, to provide region-specific customi- 
sation and ongoing trend observation. Further 
research on other transport modes (e.g., road and 
air) and multiple regions (i.e., developing coun- 
tries) is vital to establish a practical and robust 
adaptation framework on climate change. Hence, 
the present preliminary analysis that this study 
makes will be incorporated into a comparative 
study of climate risks and adaptation in the rail 
and road systems on the next stage. Also, it is sug- 
gested that there are more quantitative estimates, 
cost-effectiveness evaluations and more complex 
decision models to increase the robustness of the 
risk model (Adger et al. 2007, Walker et al. 2011, 
Wt et al. 2013). Accordingly, the next step of this 
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research is to investigate the relationship between 
climate change risk reduction and the costs of 
adaptation measures. A new economic model will 
be constructed complementary with the current 
FBR model and evidential reasoning techniques 
to enhance data consistency and credibility to the 
overall modelling. 
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ABSTRACT: The Florida Public Hurricane Loss Model (FPHLM) is a catastrophe model, which esti- 
mates hurricane damage. The FPHLM team has access to three main sources of exposure and claims 
data: county tax appraiser databases, National Flood Insurance Program (NFIP) portfolios, and private 
wind insurance portfolios. These databases were processed and cross-referenced at the county level. The 
FPHLM hazard team assigned estimates of surge and wave height, or fresh water inundation height, 
to each NFIP claim, as well as estimates of wind speed to each wind claim. The paper shows how com- 
binations of more accurate building descriptions and cause of loss facilitates FPHLM flood model 
development, calibration and validation. The reliability evaluation, cleaning, geocoding, alignment, and 
integration of the different datasets, as well as the methods for the development, calibration and valida- 
tion of the model are described. This is a work in progress and the paper presents some preliminary 


vulnerability curves. 


1 INTRODUCTION 


The purpose of catastrophe models such as the 
Florida Public Hurricane Loss Model (FPHLM) is 
to estimate the potential damages caused by hurri- 
cane events, including coastal flood (storm surge), 
inland flood, and wind. Cat models have three 
main components: a hazard component, which 
models the hurricane hazards; a vulnerability com- 
ponent, which model the effect of the hazard on 
the buildings; and an actuarial component, which 
translates the damage into insured losses. 
Calibration and validation of the FRHLM wind 
and flood models outputs are conducted in part 
by comparisons against historical National Flood 
Insurance Program (NFIP) and wind insurance 
claims data. These comparisons require that differ- 
ent kinds of information are available including: 1) 
actuarial claim data, like values, and cause of loss; 
2) building claim data, like location, elevation, age, 
and building characteristics; and, 3) hazard data, 
like date and type of event, and hazard intensity. 


Once this information has been gathered for all 
claims in a portfolio, the teams can use the data for 
model development, validation and calibration. In 
the first case, vulnerability curves can be derived 
based on direct or indirect regression analysis 
techniques on the enhanced claim data. In the sec- 
ond case, empirical vulnerability curves, obtained 
through regression analysis of the claim data vs. 
hazard intensity, are compared to the correspond- 
ing model vulnerability curves. The empirical and 
modeled vulnerability curves are then compared to 
validate the FPHLM flood and/or wind models. 

A challenging aspect of this methodology is 
that, very often, the building data in the insurance 
portfolios are incomplete, missing, or erroneous, 
and the hazard intensity data is not provided. The 
purpose of this paper is to describe the process for 
gathering missing information, and the methodol- 
ogy for associating that information with a specific 
claim. Aspects of the subsequent model develop- 
ment, validation and calibration processes are then 
described. 
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2 FLORIDA PUBLIC HURRICANE LOSS 
MODEL 


The Florida Office of Insurance Regulation 
(FLOIR) sponsored the development of the Flor- 
ida Public Hurricane Loss Model (FPHLM) as a 
tool for insurance regulation in the state of Florida 
(FPHLM, 2016; Hamid et al., 2011). The FPHLM 
originally analyzed portfolios of single family homes, 
including manufactured homes, subject to wind haz- 
ard only. Currently, the scope includes commercial 
residential buildings (either apartment or condo- 
minium buildings), as well as inland and coastal 
flooding. The purpose of the model is to predict 
aggregated insured losses for residential properties in 
the form of annual expected losses (AEL) and prob- 
able maximum losses (PML). Such loss estimates are 
used by insurance companies and state regulators to 
help evaluate rate filings. The model can also be used 
to conduct scenario analyses to estimate losses for 
hypothetical events and historical storms. 

Under the sponsorship of FLOIR, the FPHLM 
team has recently expanded the previously wind- 
only scope of the FPHLM to include coastal and 
inland flood hazards (Baradaranshoraka et al., 
2017). Their strategy was to adapt the large body 
of tsunami related building fragility curves, espe- 
cially the work of Suppasri et al. (2013), to coastal 
flood, and to adapt the work of the USA Corp of 
Engineers (USACE, 2006, 2015) for inland flood. 
In all the models, building vulnerability curves pro- 
vide estimates of mean building damage ratio as a 
function of hazard magnitude (wind speed in the 
case of wind, and inundation height in the case of 
coastal or inland flood) (Pinelli et al., 2011). The 
damage ratio is the cost of repair of a damaged 
building divided by the replacement value of the 
building. Paramount to that effort is the validation 
and calibration of the building vulnerability curves. 

In addition to building damage, which includes 
the damage to both the exterior and interior of the 
building, insurance companies cover the contents 
damage. Contents is anything inside the building 
but not attached to the building (e.g. furniture, rugs, 
appliances, etc.). These contents vulnerability curves 
are being developed based on regression techniques 
using the flood claim data, as described in this paper. 


3 DATABASES 


The FPHLM teams use three primary sources of 
exposure and claim data 


— The National Flood Insurance Program (NFIP) 
database: exposure files, and a claim files 

— Twenty three wind insurance companies: 
exposure files and claim files 

— County tax appraiser databases: building descri- 
ptors 


Exposure files include all the policies insured 
by a given company, while the claim files include 
only the policies that suffered a loss for a particular 
event. The datasets provided by these sources are 
briefly summarized below. 


3.1 NFIP database 


The NFIP claims and exposure portfolios were 
provided to the FPHLM by FLOIR. The claims 
database contains 153,751 claims between July 
1975 and January 2014 for 126 different flood 
events. The exposure portfolios were provided by 
year from 1992 to 2012. A combined portfolio was 
created containing all policies between 1992 and 
2012. The hazard team analyzed the claims data 
locations and dates to associate a specific haz- 
ard to a given claim. From these datasets, a trial 
dataset of 43,552 claims from the eight storms 
with the most claims was chosen for validating the 
FPHLM. The eight storms are listed in Table 1. 

The NFIP claim files contain information such 
as the date of loss, policy number, physical address, 
cause of damage, total property value, financial 
damage to building and contents, and replacement 
cost. Fields are present in the files for structural 
information such as exterior wall type and founda- 
tion type, but do not contain values for 97% of the 
claims. The exposure files contain policy number, 
flood zone, address, original construction date, 
base flood elevation, and more, but no structural 
building information. 


3.2 Tax appraiser databases 


Tax appraiser (TA) datasets have been collected 
for a total of 51 counties and comprising approxi- 
mately 97% of Florida’s population. However, the 
Miami-Dade TA database contains virtually no 
information useful for the purposes of FPHLM 
calibration and validation. 

An ideal TA dataset would include all building 
properties and components that are damageable 
by a hurricane: interior and exterior wall compo- 
sition, roof shape, roof covering, floor covering, 


Table 1. Hurricane events currently included in the claims 
trial dataset from NFIP and wind portfolio databases. 
Storm ID Storm name Storm year 
AL041992 Andrew 1992 
AL071998 Georges 1998 
AL032004 Charley 2004 
AL062004 Frances 2004 
AL092004 Ivan 2004 
AL112004 Jeanne 2004 
AL042005 Dennis 2005 
AL252005 Wilma 2005 
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year built, and number of stories. In addition, the 
building location, elevation, footprint area, and the 
number of apartment units per building is desired. 
However, there is no uniform format or content 
standard across the different county TA databases, 
and the amount and quality of data collected by 
each tax appraiser differs from county to county. 

The TA databases typically consist of data tables, 
which contain the building attribute information, 
and GIS shapefiles, which typically contain the 
polygons defining the geographic boundaries of 
each parcel, and the unique parcel identification 
number in a linked database file. Some counties 
did not provide GIS shapefiles, or the shapefiles 
did not contain parcel identification numbers that 
matched any unique identifiers in the data tables. 
When that occurred, GIS shapefiles were sourced 
from the Florida Department of Revenue which 
contained parcel identification numbers matching 
those provided by the county. 


3.3. Wind insurance portfolios 


The wind insurance claim datasets represent 23 
different wind insurance companies and nine hur- 
ricane events. There are 667,573 claim records 
among all the insurance portfolios and hurricane 
events. The data contained in the claim files varies 
by company and event, but generally include the 
policy ID, zipcode, county, year built, construction 
type, and loss to structure, contents, appurtenant 
structures, and time related losses (additional liv- 
ing expenses). The exposure portfolios also vary 
in details provided, but generally include policy 
ID, zip code, county and construction type (frame 
or masonry). A few companies provided more 
detailed exposure files that included roof shape, 
number of stories, roof cover and opening protec- 
tion. In all, the exposure files contain 13.5 million 
policies. The latitude and longitude values and/or 
addresses are provided in the 2012 exposure files 
for the location of the individual building, and in 
the claim data files for some of the 2004 and 2005 
hurricanes. The precise locations of the buildings 
in the 2012 exposure portfolios makes it possible 
to relate the exposure information to NFIP or tax 
appraiser databases, which is the grand challenge 
for integrating these datasets. It makes it also pos- 
sible to identify the location of some claim records, 
based on the policy ID. 


4 PROCESSING OF THE DATABASES 


4.1 Reformatting 


The first step in processing the databases was to 
reformat the building attribute data contained 
within the exposure, claims and TA databases 
into a consistent nomenclature. The standard 


Table 2. Common nomenclature for primary building 

attributes. 

Building Building Roof Roof Exterior 

use category shape cover wall 

RES PR Gable Shingle Timber 

MANUF _  LR-CR Hip Tile Masonry- 
unreinforced 

RENTAL MHR-CR Other Metal 

CONDO Other Masonry- 
reinforced 

COM 

Other Other 

Table 3. Example link table to reformat attributes into 

common nomenclature. 

TA-Assigned Common 

exterior wall nomenclature 

Wood frame Timber 

Concrete block Masonry 

Fire resistant Other 

Conc block 12 Masonry 

Concrete Masonry 

Reinforced concrete MasonryR 

Fireproof steel Other 

Metal Other 

6 Concrete block Masonry 

Stone local cut Other 

8 concrete block Masonry 


nomenclature for the main building attributes 
are provided in Table 2. For each attribute within 
each database, a link table was developed to relate 
the unique values for a given attribute within the 
database to a value matching the common nomen- 
clature provided in Table 2. This process is illus- 
trated in Table 3, where unique values of exterior 
wall type from the Martin County TA database 
are recorded to match the standard nomenclature. 
Generating the link tables was intuitive for most 
counties and attributes, but documentation on the 
exact definition of attribute values were generally 
not available from the counties, which introduces 
uncertainty in the link tables. For example, in 
Table 3 it is unclear what type of structural system 
“Fire Resistant” actually refers to, and so without 
any other supporting documentation, it is coded 
as “other” . 


4.2 Geocoding and integration of the databases 


For a given building with a claim due to flood- 
induced and/or wind-induced damage during 
a hurricane, the following data sources can be 
joined: 1) standardized building attribute infor- 
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mation from the county TA database, 2) the wind 
insurance exposure portfolios, 3) the NFIP expo- 
sure portfolio, 4) the NFIP claims portfolio, 5) the 
wind claims portfolios, and 6) the hazard model 
output. Combined, this will allow for the classifi- 
cation of the claims by building type and hydro- 
logical states in the case of flood hazard, where 
4 different flood condition states were defined by 
via the relationship between the wave height to 
inundation depth. This will facilitate the develop- 
ment of semi-empirical vulnerability curves, as 
explained in the next section, as well as the com- 
parison of empirical vulnerability curves against 
the engineering model curves, to facilitate valida- 
tion and calibration. 

The various databases are joined by specific links 
as illustrated in Figure 1. The links can broadly 
be classified in two ways—spatial joins and table 
joins. In table joins, a common field between two 
fields was used to link different databases together. 
Table joins based on policy number were used to 
join the NFIP claims and exposure portfolios, and 
table joins based on a unique parcel identification 
number were used to join the tax assessor data 
tables and shapefiles containing the individual par- 
cel polygons. Two methods of spatial joins were 
used. For generating the hybrid database linking 
the hazard model outputs with the NFIP claim 
portfolio, first the physical address contained in the 
NFIP exposure database was geocoded to obtain a 
GPS coordinate. Then the hazard model generated 
coastal flood heights and/or inland flood heights at 
the geocoded location for each claim, and the out- 
puts were joined by matching the GPS coordinates 
(match-spatial-join). Finally, for combining the 
TA database with the hybrid hazard-NFIP data- 
base and, if necessary, the wind exposure or claim 


©) Spatial Join (Minich) 
& Spatial Join (Within) 
©) ason 


Figure 1. Process for linking hazard model output, 
NFIP claims and building attributes. 


portfolios, a within-spatial-join was used. This was 
accomplished by coding a graphical user interface 
(GUID) in Matlab (Mathworks, 2016a) that reads in 
the county shapefiles and finds any elements (using 
the GPS coordinates obtained from geocoding 
the physical address) of the hybrid hazard-NFIP 
database or wind exposure or claim portfolios that 
fell within a parcel polygon. For each claim that 
fell within a parcel polygon, the associated build- 
ing attributes from the TA database and the wind 
exposure portfolio were appended to the hybrid 
hazard-NFIP database, providing a combined 
dataset linking claims, hazard intensities and 
building attributes. 


4.3 Hazard information 


The NFIP claims data contains several attributes 
which can link the loss to a hazard event for each 
claim. Among these attributes are the “catastro- 
phe number”, “cause of loss”, “date of loss” and 
property location. However, upon close exami- 
nation of the data the FPHLM team found that 
the information provided by these attributes are 
not always complete or accurate, and do not pro- 
vide sufficient detail pertaining to the hazard 
event. For example, the catastrophe number is 
often missing, or multiple hazard events may be 
assigned the same number if they occur around 
the same time period. The cause of loss often 
appears to be incorrect. The team found cases 
where the loss was listed as “Tidal Water Over- 
flow”, even though the property was too far from 
the coast for such an event to occur. In addition, 
the claims data do not provide important hazard 
information, such as the flood elevation or wave 
conditions. 

In order to remedy the above issues, algorithms 
based on observed hazard information from a 
variety of sources associated hazard event infor- 
mation to each claim. New attributes appended 
to the claims data provide this information. When 
no observations are available, such as the case 
for flood elevation or wave conditions, models 
provided these estimates. The paragraphs below 
describe the new attributes. 

Distance to coast is the distance of each property 
to the nearest coast. This information is useful to 
determine if coastal flood was likely to occur. The 
team computed the distance using the 2011 Multi- 
Resolution Land Characteristics Consortium 
(MRLC) National Land Cover Database (NLCD) 
(Homer et al., 2015). The NLCD has approxi- 
mately 30 meter resolution and has a classification 
for water body. A heuristic algorithm distinguishes 
coastal waters from inland bodies of water. 

Distance to nearest body of water. This infor- 
mation was computed similarly to the distance to 
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coast, but computes the distance to nearest body 
of water, regardless of type. This can help deter- 
mine if the flooding could be due to river, stream 
or lake overflow. 

Precipitation maximum. This data was derived 
from the high resolution (4 km) PRISM daily rain- 
fall database (Daly et al, 2008). The team selected 
the largest precipitation amount that occurred with 
+/— 1 day and within 10 km of the property loca- 
tion. This information helps determine if the cause 
of loss could be due to accumulation of rainfall. 

Storm identification. Many of the claims are 
due to tropical storms or hurricanes. The HUR- 
DAT2 database, from the National Hurricane 
Center, determines if a storm was in the vicinity 
of the property during the claimed date of loss. An 
additional attribute is also provided to include an 
indication of the storm intensity level (depression, 
tropical storm or hurricane). 

Flood elevation and wave height. Since observed 
flood elevation data is generally not available, a 
coastal flood model provided an estimate of the 
flood elevation due to surge, and an inland flood 
model in the case of flood due to the accumula- 
tion of rainfall. The coastal flood model is based 
on the CEST model (Zhang et al, 2008), and was 
driven by estimated observed winds from the 
H* Wind analyses (Powell et al., 1998). The inland 
flood model is based on the SWMM model and 
was driven by observed NEXRAD rainfall data. 
A simplified wave model developed at the Univer- 
sity of Notre Dame was used to provide an indi- 
cation of wave conditions in the case of coastal 
flood. 

The above information from these new 
attributes lead to a revision of the cause of loss. If 
the original cause of loss was tidal water overflow, 
but the distance to coast was greater than 20 km, 
the team then checked the precipitation maximum. 
If it is greater than 1 inch, the cause was revised to 
be accumulation of rainfall, otherwise it is desig- 
nated as questionable. If the original cause of loss 
was accumulation of rainfall, but the precipitation 
maximum was less than 0.2 inch, then the cause 
of loss is marked as questionable. For hurricane 
events, if the original cause of loss is unknown for 
a property that is less than 20 km to the coast and 
the precipitation maximum is greater than 1 inch, 
then the cause of loss is marked as undetermined 
but is either tidal water overflow or accumulation 
of rainfall, and further review is needed. Other- 
wise the cause of loss is marked as tidal water 
overflow. 

The end result is that each claim in the NFIP 
portfolio can be assigned to one of four hydrologi- 
cal states: inland flood with no waves; coastal flood 
with minor waves; coastal flood with moderate 
waves; and, coastal flood with severe waves. 


A similar process assigns wind speed to each 
claims in both the NFIP and wind insurance port- 
folios but is not reported here. 


5 DEVELOPMENT, VALIDATION 
AND CALIBRATION OF FLOOD 
VULNERABILITY CURVES 


5.1 Introduction 


The processing of the data is still in progress and 
will result in enhanced NFIP and wind claims sub- 
set that contains loss, value, hazard and hazard 
intensity, and structural characteristics. When com- 
plete, the enhanced NFIP claims data will be used 
for the development of the flood content vulner- 
ability curves, and the validation and calibration of 
the FPHLM flood building vulnerability curves. 

The model building or content vulnerability 
curves output are the expected damage ratios 
(mean damage over replacement value), where the 
replacement value is the cost to replace the prop- 
erty with a new item of like kind and quality. In the 
claim data, building and content coverage are used 
as proxies for the respective building and content 
replacement values. In the case of the NFIP data, 
there is no reliable replacement value data pro- 
vided, and the coverage limits in the NFIP policies 
are not a true measure of the value of a building or 
its content. In the future, this issue will be solved 
through a triangulation between the TA databases, 
the NFIP exposure files, and the wind insurance 
exposure files. The idea is to use the coverage limits 
of the wind policies (which is a better measure of 
the true value of building or contents) as a proxy 
for the replacement values of the building and con- 
tents in the NFIP claims and exposure portfolios. 

Another issue is that extreme hazard events have 
very large return periods and are underrepresented 
or altogether absent from the historical claim data. 
For example, the number of NFIP claims within 
the State of Florida with a hazard intensity of 
more than 1.5 m (5 ft) is not significant, and this 
problem becomes more pronounced as the flood 
height increases. Hurricanes such as Katrina pro- 
duced storm surges more than 20 feet (The maxi- 
mum high water mark observation was 27.8 feet at 
Pass Christian, MS) (NOAA, 2006), however the 
storm surge height for the State of Florida only 
reached up to 5 feet and the FRHLM is constrained 
to claims in the State of Florida. The result is that 
any validation and calibration of a vulnerability 
model based on claim data is more applicable to 
lower hazard intensities. 

This last issue also complicates the development 
of vulnerability models based on simple regres- 
sion over the claim data, since not only the assign- 
ment of hazard intensity to the claims is subject 
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to caution, but there might be no data to regress 
upon for higher hazard intensities. To resolve this 
problem, the FPHLM team developed the method 
for the creation of contents vulnerability curves. 


5.2 Development and calibration of contents 
vulnerability curves for coastal and inland 


flood 


The FPHLM team used the NFIP claim data in 
the development of the coastal and inland flood 
model content damage component. The building 
vulnerability curve are converted into a content 
vulnerability curve using a relationship derived 
from the NFIP claim data. To derive this relation- 
ship, the building damage ratio (building damage 
to building coverage) and content damage ratio 
(content damage to content coverage) are calcu- 
lated for each claim. 

Figure 2 shows a plot of content damage ratios 
vs. building damage ratios for a subset of the NFIP 
claim data of personal residential buildings that 
have a masonry structure, are not elevated, and are 
one or two stories. The data is from 16 counties, 
where the structural information and number of 
stories have been added (around 6000 of the 43,500 
paid claim data for the eight hurricane events stated 
in Table 1). The 16 counties are spread throughout 
the State of Florida and contain more than half of 
the NFIP claim data for the entire state. The NFIP 
claim data does not provide information regard- 
ing the structure type or the number of stories and 
these were added by matching the NFIP claim data 
with the tax appraiser databases. 

A statistical analysis using a two-way histogram 
defines the relationship between building dam- 
age and content damage ratios. Figure 3 shows a 
plot of the content damage ratio vs. the building 


Building Damage ta Content Damage Relationship 
100% seyret. 


Content Damage to Content Coverage 


o% 20% 40% 60% Bo% 100% 
Building Damage to Bullding Coverage 


Figure 2. Example of building damage ratio to content 
damage ratio relationship. 


Bullding damage to content damage relationship 


100% we 
y= 097050 - 1.3281x* +1,2672x 
R= 0.855 
— 80% 
z 
ə 
= 
= 60% 
v 
= 
E 
3 
2 4m 
€ 
2 
8 20% 
m 
Building Damage Ratio (%) 
Figure 3. Example of the plot of the building vs. con- 


tent damage ratios and the polynomial line fitted to the 
mean values. 
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Figure 4. Converting the building vulnerability curves 
into content vulnerability curves using the building and 
content damage ratios relationship. 


damage ratio relationship derived using statistical 
analysis with the corresponding curve fit. 

Figure 4 illustrates the process of converting the 
building vulnerability curves into content vulner- 
ability curves for a two-story on grade masonry 
structure susceptible to coastal flood with severe 
waves. 

The process starts with the building vulner- 
ability curve. For specific hazard intensity (“A” 
box on the top left graph of Figure 4) we identify 
the building damage ratio (“B” box on the top left 
graph). Then, we move to the building and content 
damage ratios’ relationship curve. For the specific 
building damage ratio found in the first curve (“B” 
box on the lower left graph), the content damage 
is identified (“C” box on the lower left graph). 
Now, for the specific hazard intensity (“A” box on 
the lower right graph) we know what the equiva- 
lent content damage is (“C” box on the lower right 
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Content Vulnerability Curves (2-story Masonry 
Structure on Grade) 


Expected Damage Ratio (3%) 
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Figure 5. Example of content vulnerability curves. 


(1 ft= 0.305 m). 


graph). This will give a point in the content vulner- 
ability curve. Repeating this process for different 
hazard intensities will result in the content vulner- 
ability curve. 

This methodology resolves the issue of the 
uncertainty attached to the assignment of a spe- 
cific hazard intensity to each claim. Instead the 
claims are simply grouped, according to their 
appurtenance, to one of the 4 hydrological states. 
It also resolves to a certain extent the lack of claim 
data for higher hazard intensities, by combin- 
ing the claim data with the building vulnerability 
model, which extends to the whole range of haz- 
ard intensity. Finally, it ensures the compatibility 
between both the building vulnerability model, the 
claim data, and the resulting content vulnerability 
model. 

Figure 5 shows an example of content vulner- 
ability curves for different hydrological states of 
a two-story on grade masonry structure. These 
results are preliminary and intended for the devel- 
opment of the methodology. Significant refine- 
ments to the building vulnerability curves are in 
progress, and the resultant content vulnerability 
will differ from that shown in Figure 5. 


6 CONCLUSIONS AND 
RECOMMENDATIONS 


The analysis of insurance claims data provides a 
way for developing, validating, and calibrating 
hurricane risk model outputs. However, before 
they can be used they need to go through extensive 
processing and interpretation. This paper describes 
some of the challenges risk modelers face while 
handling insurance claims data. 

The paper presents a methodology to convert 
building vulnerability curves into content 


vulnerability curves using the building damage to 
content damage relationship derived using NFIP 
insurance claims data. This method produces 
contents vulnerability curves compatible with 
both the claim data and the building vulnerability 
models. 

This is a work in progress. Additional work 
includes: the validation of the building vulnerabil- 
ity curves, within the limitations of the data; the 
validation of the wind model; and, the validation 
of the combined wind and flood model 

In addition, the merging of tax appraiser data- 
bases with the wind and flood insurance portfolio 
has the potential of increasing the accuracy of the 
portfolio analyses, since more data will be available 
during the analyses. 
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ABSTRACT: Slippery runways represent a significant risk to aircrafts especially during the winter sea- 
son. In order to apply the appropriate braking action, the pilots need reliable information about the run- 
way conditions. Unfortunately the accuracy of runway reports can sometimes be unsatisfactory. In order 
to obtain more precise and up-to-date information about the current conditions, a warning system based 
on various types of weather data was suggested by Huseby & Rabbe (2012). See also Huseby & Rabbe 
(2008) and Huseby et al. (2010). The system is based on a set of scenarios known to cause slippery condi- 
tions. By monitoring meteorological parameters like air and ground temperature, humidity, visibility and 
precipitation, and comparing these to the given scenarios, the system can issue warnings to the ground 
personnel. This system is currently being used on 16 Norwegian airports. In the present paper this warn- 
ing system is reviewed. Ideally, the warning system should issue warnings whenever the estimated runway 
conditions are medium or worse. At the same time the system should not issue warnings when the runway 
conditions are good. Thus, there are two types of errors we need to take into consideration. Type | errors 
occur when the system does not issue a warning even though the conditions are medium or worse, while 
Type 2 errors occur if a warning is issued when the conditions are good. When designing the system, we 
need to find the optimal balance between these types of errors taking into account that a Type | error to 
a certain degree is considered to be worse than a Type 2 error. The paper describes how the system can be 


optimized using a combination of weather data and flight data. 


1 INTRODUCTION 


Slippery runways represent a significant risk to 
aircrafts especially during the winter season. Acci- 
dents, such as the Southwest Airlines jet skid- 
ding off a runway at Chicago Midway Airport 
in December 2005, as well as the similar accident 
with the Delta Connection flight at the Cleveland 
Hopkins International Airport in Ohio in Febru- 
ary 2007, show that this is indeed a serious prob- 
lem. More recently, an aircraft skidded off the 
runway at Oslo Airport in May 2015 due to wet 
conditions. Fortunately this accident only caused 
minor damages. 

In order to apply the appropriate braking 
action, the pilots need reliable information about 
the runway conditions. Unfortunately the accuracy 
of runway reports can sometimes be unsatisfac- 
tory. During the Southwest Airlines accident the 
pilots based their landing on an assumption that 
conditions were fair. However, computer calcula- 
tions after the crash showed that the actual con- 
ditions were in fact worse than poor. Given the 
correct information about the landing conditions, 
including a significant tailwind of 8-9 kt, the flight 


should in fact have been diverted. For more details 
about this accident see Rosenker et al. (2007). For 
a discussion of the effect of contaminated run- 
ways on aircraft braking performance see Giesman 
(2005). 

Having reliable methods for identifying slippery 
runway conditions is very important. However, 
measuring the runway friction with a satisfac- 
tory precision is very difficult. There are two main 
reasons for these problems. Firstly, measuring 
the runway friction with a satisfactory precision 
is very difficult. While many different measure- 
ment devices have been developed, it is hard to 
find equipment that produces stable and consist- 
ent results. The second problem is that in order to 
measure friction, the runway needs to be closed 
for traffic. Thus, in order to avoid severe delays, 
such measurements cannot be carried out too fre- 
quently. As a result the runway reports are not as 
useful as one could hope. This is especially true 
during heavy snowfalls, or when the temperature 
suddenly drops below the freezing point, where the 
conditions change very rapidly. 

Rosenker et al. (2007) discusses the difficulties 
with assessing runway condition, and notes that 
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no standardized and universally accepted correla- 
tion exists to define the relationship between the 
runway surface condition, using any of the avail- 
able runway surface assessment methods, and an 
airplane’s braking ability. 

In Haugen et al. (2002) an alternative approach 
to this problem was developed. Contaminated run- 
ways were characterised in terms of a function of 
local weather parameters. The main idea was that 
this function would be easier to update compared 
to friction measurements which are based on the 
last runway report. Thus, by using weather data 
one could bridge the gaps between the runway 
reports. 

Based on a large-scale study of runway condi- 
tions carried out during two winter seasons at two 
Norwegian airports the ideas suggested in Haugen 
et al. (2002) were developed further. A complete 
report from this project, referred to as the SWOP- 
project, is given in Aarrestad et al. (2007). The 
study was carried out by Avinor' with contribu- 
tions from the three airlines SAS, Norwegian and 
Widerøe. As in Haugen et al. (2002) the main goal 
was developing methodology for predicting run- 
way conditions utilizing weather data in addition 
to runway reports. Throughout the two seasons 
various kinds of weather data were collected, such 
as air and surface temperature, humidity, pre- 
cipitation, visibility and wind. Using these data a 
scenario based weather model for slippery condi- 
tions was developed. At a given point of time, the 
weather model compares the current conditions to 
a set of different scenarios. If the conditions match 
any of these scenarios, the model classifies the run- 
way conditions as potentially slippery. 

Using the experiences from the SWOP-project, 
an integrated runway information system, called 
IRIS, has been developed. This system is now imple- 
mented on 16 Norwegian airports. For a descrip- 
tion of this system see Soderholm et al. (2009). This 
system consists of three parts: a weather model, 
a runway model and a development model. The 
weather model uses a scenario approach to identify 
slippery conditions. A description of an early ver- 
sion of this model can be found in Huseby & Rabbe 
(2008), while a revised version is presented Huseby 
& Rabbe (2012). The weather model currently in use 
is yet another revision based on more recent data. 
The runway model uses mainly runway report data 
and assesses runway conditions on a five level scale 
ranging from poor to good. This model is discussed 
in Huseby et al. (2010). See also Klein-Paste et al. 
(2012). The development model combines runway 
report data and precipitation and temperature data 


1. Avinor is a state owned limited company that operates 
most of the civil airports in Norway. 


in order to issue warnings when the runway condi- 
tions appear to be deteriorating. 

In the present paper we focus on the weather 
model and show how this model can be optimized. 
In order to analyze and optimize the model, a large 
data set including weather data, runway report data 
and flight data has been collected. The full data set 
consists of data from 16 Norwegian airports. In 
the present paper, however, we only consider data 
from the airports at Oslo and Tromsø. For both 
airports we have observations from 9 winter sea- 
sons starting at the winter 2008/2009. The weather 
observations are sampled every minute starting 
at midnight November 1 and ending at midnight 
April 30. The flight data sets are provided by Scan- 
dinavian Airlines Services (SAS) and Norwegian 
Air Shuttle AS. In this paper, however, only the 
flight data sets from SAS are used. At Oslo air- 
port there are two runways, West and East. In the 
analysis these runways are treated separately. 


2 FRICTION LIMITED LANDINGS 


In order to optimize the weather model, flight data 
was obtained from the Quick Access Recorder of 
Boeing 737-600/700/800 NG airplanes. The data 
was provided with approval of the Pilots Asso- 
ciations of the cooperating airlines. Starting at 
the time of touchdown, a 60 seconds record was 
taken including among others the following main 
parameters: 


Airplane weight 

Longitudinal acceleration 
Airspeed 

Ground speed 

Flaps settings 

Spoiler settings 

Engine rotational speed 

Brake pressures 

Auto brake settings 

Longitude and latitude positions 


The flight data was analyzed using the Boeing 
Airplane Performance model which is based on 
general equations of motion for airplane, along 
the length direction of the runway. The model 
gives an airplane braking coefficient, denoted by up. 
This parameter is used to represent the contribu- 
tion of the wheel brakes to stopping the airplane. 
The coefficient u, is the ratio of the stopping force 
contribution of the wheel brakes to the average air- 
plane weight on wheels. In general u, will include 
both the wheel braking and the effect of contami- 
nant drag force. 

A key concept when analysing flight data is 
the notion of friction limited landings. Unless the 
pilot challenges the runway friction during the 
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landing, the maximum friction available will not 
be utilized. In this case 4, reflects the amount of 
tire-pavement friction that was used. When wheel 
brakes are applied fully or to a high degree on a 
slippery runway, the maximum attainable fric- 
tion from the runway is used during the stop. In 
this case the airplanes deceleration and stopping 
capability is limited by the friction available from 
the runway. The obtained u, will then reflect the 
amount of tire-pavement friction that was avail- 
able. It is therefore crucial to determine whether or 
not the stop was limited by the friction available 
from the runway. A landing where this is the case, 
is said to be friction limited. For more details about 
the Boeing Airplane Performance model see Klein- 
Paste et al. (2012). 

Given the airplane braking coefficient 4, it 
is possible to obtain the socalled runway brak- 
ing action (BA) associated with the landing. This 
is a simplified measure given according to a five 
level scale randing from poor to good. The relation 
between 4, and BA is given Table 1. 

In the validation of the weather model only 
the friction limited landings are used. In Table 2, 
Table 3 and Table 4 the total number landings as 
well as the number of friction limited landings for 
the three runways are listed. In the same tables we 
have also included the number of landings with 


Table 1. The relation between u, and runway Braking 
Action (BA). 

Up BA-level BA 

[0.000, 0.050] 0 NIL 

(0.050, 0.075] 1 Poor 

(0.075, 0.100] 2 Medium to poor 
(0.100, 0.150] 3 Medium 

(0.150, 0.200] 4 Medium to good 
(0.200, -] 5 Good 


Table 2. Number of friction limited landings at Oslo West. 


Count Percentage 
Total number of landings 57324 100.0% 
Friction limited landings 997 1.7% 
Landings with BA < 3 845 1.5% 
Table 3. Number of friction limited landings at Oslo 
East. 

Count Percentage 
Total number of landings 54794 100.0% 
Friction limited landings 981 1.8% 
Landings with BA < 3 764 1.4% 


Table 4. Number of friction limited landings at Tromsø. 


Count Percentage 
Total number of landings 12362 100.0% 
Friction limited landings 2214 17.9% 
Landings with BA < 3 1866 15.1% 


braking action level less than or equal to 3 (i.e., 
medium). 


3 THE WEATHER MODEL 


In this section we review briefly the scenario based 
weather model which is a central part of the run- 
way condition prediction methodology. This model 
is based on the work in Rabbe (1974), and includes 
eight different scenarios which are known to cause 
slippery runway conditions. In this context we will 
not describe all these scenarios in detail. Instead 
we refer to Huseby & Rabbe (2012) for a more 
complete description. 

All scenario evaluations are typically done rela- 
tive to a given point of time T representing the 
touchdown point of time for a given flight. In 
order to describe the weather conditions and check 
whether or not any of the scenarios has occurred 
at T, weather data from two time intervals [T — S,, 
T — S] and [T — S,, T] is used. In our study S, = 
1 hour, while S, = 4 hours. The length of the first 
interval is (T — S,) — (T — S) = S, —S, = 3 hours, 
while the length of the second and most recent 
interval is T—(7T-—S,)=S$,=1 hour. Thus, the two 
intervals represent a total of 4 hours of observa- 
tions. Throughout this period the different weather 
parameters are ideally sampled once every minute, 
so given the four hours of observations, each 
weather parameter is sampled 180 times during the 
first interval and 60 times during the last interval. 
In real life, the number of observations is typi- 
cally slightly less than this, but still there is more 
than enough of data for the model calculations. 
We assume that all the data are indexed, and let J, 
and J, denote the index sets corresponding to the 
first and second interval respectively. Moreover, for 
ie I, UL, we introduce: 


p, = the ith precipitation type, 

t =the ith air temperature, 

t! =the ith runway temperature, 
h,= the ith relative humidity, 

v, = the ith horizontal visibility, 


The scenarios are divided into two groups, where 
the first group includes three scenarios with pre- 
cipitation, while the second contains the remaining 
three scenarios with no precipitation. 
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3.1 Scenarios with precipitation 


The following five scenarios are all characterized 
by the presence of some sort of precipitation. 


SCENARIO |. Dry snow 

Dry snow is usually not so dangerous as wet snow. 
However, if this condition persists over some time, 
the runway may become polished by the snow- 
flakes, which can result in slippery conditions. 


SCENARIO 2. Snow 

Snowfalls can have dramatic effects on braking 
performance. Falling snow is typically a mixture of 
ice crystals and water with a temperature close to 
0°C. Dry snow containing less water is less slippery 
than snow with a higher water content. This sce- 
nario is defined to occur when the air temperature 
is between —6°C and +2°C. Severe conditions occur 
with heavy snowfalls, temperatures close to the 
freezing point, and relative humidity close to 100%. 


SCENARIO 3. Freezing rain/drizzle 

Freezing rain or freezing drizzle occurs when warm 
air tries to replace cold air from above. If rain or 
drizzle falls from the warm layer through the cold 
layer, it will be supercooled on its way down. When 
these supercooled drops hit the frozen ground, 
they will freeze immediately, and as a result the 
runway becomes extremely slippery. The weather 
sensor can sometimes identify this condition as 
a specific type of precipitation. However, in our 
scenario definition, we also include another set of 
conditions based on the precipitation type rain. 


SCENARIO 4. Freezing fog 

When temperatures at ground level drop to or 
below freezing, the water droplets making up fog 
often freeze on contact. This condition is called 
freezing fog. The result can be black ice, which 
makes the runway very slippery and dangerous. 


SCENARIO 5. Rain or drizzle on ice-coated or 
supercooled runway 

When rain or drizzle falls on an ice-coated or 

supercooled runway, some of it will start freezing. 

As a result the braking actions will be reduced to 

a minimum. 


3.2 Scenarios without precipitation 


The scenarios included in this subsection does not 
involve precipitation, at least not during the most 
recent interval, [T -— S}, T]. 


SCENARIO 6. Wet runway, clearing sky 

This scenario occurs when the weather is clear- 
ing after overcast and rain. Due to the outgoing 
radiation, the temperature both in the air and in 


the runway surface gradually drop below zero. The 
water left on the runway, will then start to freeze, 
and the friction coefficient may drop quickly. 


SCENARIO 7. Stratus/fog, air temperature below 0°C 
When a low stratus cloud or a fog layer at freezing 
temperatures flows into the airport, small drops 
will settle on the cold runway surface and partially 
freeze. As a result the runway becomes slippery. 


SCENARIO 8. Rime, sublimation, ice crystals 

In clear autumn and winter nights the temperature 
in the air and on the ground can fall below zero. 
At the same time the humidity increases towards 
100%. Sometimes this results in fog, while other 
times moisture may settle on the runway as rime. 
When water vapor sublimates on solid objects, the 
result is ice crystals. Some of this may melt under 
the wheels of the aircrafts, and create a slippery 
ice-coat on the runway. 

In the analysis the weather model is calculated 
each minute throughout the 9 winter seasons based 
on the data from these seasons. As as result we have 
as many as 2348640 realizations of this model for the 
three runways in our data set. Figure 1, Figure 2 and 
Figure 3 show the rates of occurences of the eight sce- 
narios (denoted S1, ..., S8 in the diagrams). We have 
also included the null-scenario (denoted S0 in the dia- 
grams) representing the fraction of the realizations 
with no identified scenarios. Note that the scenario 
definitions allow several scenarios to occur at the 
same time. Thus, the sum of the rates of occurences 
(including the null-scenario) is greater than 100%. 

We observe that the two dominating scenarios 
are S2 (snow) and S8 (Rime, sublimation, ice crys- 
tals). Scenario S2 is relatively easy to identify. 
Thus, we will not discuss this scenario further here. 
Scenario $8, on the other hand, is much more dif- 
ficult to identify. The scenario definition depends 
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Figure 1. Rates of occurences of the weather scenarios 
at Oslo West. 
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Figure 2. Rates of occurences of the weather scenarios 
at Oslo East. 
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Figure 3. Rates of occurences of the weather scenarios 


at Tromsø. 


on several parameters which need to be finetuned 
in order to optimize the results. In the next section 
we show how this problem can be attacked. 


4 OPTIMIZING THE RIME SCENARIO 


In this section we show how to optimize the rime 
scenario, i.e. scenario $8. We say that this scenario 
has occurred if the following conditions hold: 


e The number of minutes with precipitation? dur- 
ing the last 4 hours should not exceed 10. 

e The air temperature is decreasing during the last 
4 hours, and last value is less than or equal to 
eG, 


OR 


2. In this context mist is not considered as precipitation. 


The air temperature is less than or equal to 0°C 
during the last hour 


e The runway temperature is less than or equal to 
0°C during the last hour 

e The relative humidity is in the interval [8%, 
100%] at least half of the time during the last 
four hours 


The quantities œ and J mentioned in the con- 
ditions, are parameters which will be subject to 
simultaneous optimization. Based on meteoro- 
logical insight, however, it was decided that the 
parameters should be chosen within the following 
intervals: we [0, 3] and Be [70, 85]. As a base case 
we let œ = 2 and £ = 75. In the optimization only 
the parameters œ and p will be considered here. 
However, as scenario S8 is only one out of eight 
scenarios, the contributions of the other scenarios 
need to be taken into account as well. 

One of the difficulties with optimizing the 
weather model is the lack of precision with 
respect to the response. Even when the runway is 
very slippery, this does not need to be reflected in 
the flight data if the pilot does not challenge the 
runway friction during the landing. On the other 
hand even when no weather scenario is identified, 
the flight data might indicate slippery conditions 
if this is due to the presence of older contamina- 
tion on the runway. Thus, when the weather model 
is applied, a substantial number of both Type 1 
and 2 errors are to be expected. Still, finding the 
best balance between the two types of errors is 
important. 

We now consider an arbitrary point of time t, 
and introduce the following events: 


A = A scenario is identified at time t 
B= The true BA-value is 3 or less at time t 

A Type | error corresponds to the event A'A B, 
while a Type 2 error corresponds to the event 


AN B". In order to balance the two types of errors, 
we introduce a loss function L defined as follows: 


K, if A A B occurs 
L=4K, 
0 otherwise 


if AA B° occurs 


The expected loss is then given by: 
E[L] = KP (4° ^ B)+ K,P (AN B°) 


The constants K, and K, are relative numbers 
chosen in order to reflect that a Type | error is usu- 
ally much worse than a Type 2 error. In the analysis 
we have chosen to let K, = 25 and K,=1. 
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The probabilites P(A‘ ^ B) and P(A A B’) are 
easily estimated based on the available data, and 
depend on the values of the parameters a and £. 
In particular the event A is identified using the 
weather data, while the event B is identified using 
flight data. 

In particular, by using the data in Table 2, 
Table 3 and Table 4 we get the following estimated 
probabilities for the event B: 


ee =1.47% at Oslo West 
57324 
4 
P(B) = D 1.39% at Oslo East 
54794 
pla =15.09% at Tromsø 
12362 


It is easy to see that P(A) is increasing in œ and 
decreasing in J. Hence, the Type 1 error probabil- 
ity, P(A‘ A B) is decreasing in @ and increasing in 
B, while the Type 2 error probability, P(A ^ B®’) is 
increasing in a and decreasing in J. Moreover, for 
the base case where a = 2 and f= 75, we get the 
error probabilities and expected losses listed in 
Table 5. 

In order to optimize the weather model we run it 
for all the relevant combinations of the parameters 
aand p. The resulting expected losses are shown in 
Table 6, Table 7 and Table 8. The parameter com- 
binations yielding the minimum losses are indi- 
cated in bold face. 

For Oslo West and East we observe that the 
optimal values are a = 0 and B= 85. This is the 
most restrictive combination which implies that 
the probability of a Type 1 error is at its maxi- 


Table 5. Type 1 and Type 2 error probabilities and 
expected losses. 

Runway P(A: A B) P(4 A B’) E[Z] 
Oslo West 0.23% 31.01% 0.367 
Oslo East 0.29% 44.70% 0.520 
Tromsø 3.37% 28.87% 1.132 
Table 6. Expected losses for various combinations of œ 


and £ at Oslo West. Optimal combination is indicated in 
bold face. 


B=70 B=75 B=80 B=85 
a=0 0.3685 0.3586 0.3431 0.3204 
a=l1 0.3741 0.3634 0.3477 0.3245 
a=2 0.3782 0.3668 0.3508 0.3265 
a=3 0.3792 0.3674 0.3513 0.3268 


Table 7. Expected losses for various combinations of œ 
and £ at Oslo East. Optimal combination is indicated in 
bold face. 


B=70 B=75 B=80 B=85 
a=0 0.5310 0.5155 0.4973 0.4716 
a=1 0.5351 0.5188 0.5004 0.4742 
a=2 0.5362 0.5195 0.5007 0.4748 
a=3 0.5370 0.5203 0.5012 0.4752 
Table 8. Expected losses for various combinations of 


aand f at Tromsø. Optimal combination is indicated in 
bold face. 


B=70 B=75 B=80 B=85 
a=0 1.104 1.1446 1.2399 1.4198 
a= 1.0980 1.1370 1.2369 1.4185 
a=2 1.0929 1.1320 1.2369 1.4175 
a=3 1.0857 1.1264 1.2330 14175 


Table 9. Expected losses for various combinations of œ 
and Bat Tromsø given that K, = 10. Optimal combination 
is indicated in bold face. 


B=70 B=75 B=80 B=85 
a=0 0.6312 0.6265 0.6405 0.6906 
a=1 0.6320 0.6262 0.6412 0.6916 
a=2 0.6330 0.6260 0.6423 0.6919 
a=3 0.6306 0.6240 0.6408 0.6919 


mum while the probability of a Type 2 error is at 
its minimum. From a meteorological perspective, 
these parameter values are hardly realistic. The 
reason why this combination still comes out as 
the best, is due to the fact that Oslo Airport has 
a fairly proactive runway maintenance strategy 
which prevents the different scenarios from caus- 
ing problems. As a result P(B) is relatively small 
for this airport. Thus, if we use œ = 0 and B= 85 
instead of the base case values œ= 2 and B= 75, 
the Type 1 error probability increases slightly 
from 0.23% to 0.24%, while the Type 2 error 
probability decreases significantly from 31.01% 
to 26.06%. 

For Tromsø we have the opposite situation 
where the optimal values are œ= 3 and B=70. This 
is the least restrictive combination which implies 
that the probability of a Type 1 error is at its mini- 
mum while the probability of a Type 2 error is at 
its maximum. At this airport it is more common to 
have a contaminated runway during parts of the 
winter season. As a result avoiding Type 1 errors 
are more important. 
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It should be noted that the optimal parameter 
values depend on the choice of the loss factors 
K, and K, also. If more less weight is put on 
Type 1 errors, i.e., if K, is reduced, the optimal 
parameter values will change accordingly. As an 
illustration we have computed the expected losses 
for Tromsø airport given that K, is reduced to 
10. The results are shown in Table 9. In this case 
the optimal parameter combination is œ = 3 and 
B=75. 


5 CONCLUSIONS AND FUTURE WORK 


In the present paper we have reviewed the weather 
model used in the integrated runway information 
system IRIS. We have shown how the model can be 
optimized by using a combination of weather data 
and flight data. The methodology is demonstrated 
on a simplified problem with only two parameters. 
In a full scale analysis, a multidimensional opti- 
mization must be carried out. Furthermore, the 
model needs to be finetuned in order to work in 
combination with the other models in the system. 
In order to carry out such a complex optimization 
it is important to screen the parameters and choose 
the relevant ranges carefully. Moreover, in order to 
handle the enormous amount of data, an efficient 
database structure must be applied. 

This work is just a small part of a much larger 
study which includes all the weather scenarios, the 
use of all available flight data as well as separate 
analysis for all 16 airports where the IRIS system is 
installed. A more extensive report from this analy- 
sis will be available later. 
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ABSTRACT: To cope with natural disasters, it is necessary to establish a strategy for adaptation after 
analyzing the risk originating from natural hazards. When conducting risk analysis, the first step is 
estimating the frequency of occurrence, followed by calculating the probability of occurrence. This study 
aimed to estimate the frequency of landslide occurrence, using the correct units, and to present how 
the probability of occurrence can be calculated utilizing the concept of reliability, which corresponds 
well to the outcomes obtained when a Poisson distribution is employed. For the analysis, a pixel unit of 
GIS-based spatial information was considered as a component to denote the unit for the frequency of 
landside occurrence, and the equation for the probability of occurrence was derived from the definition 
of reliability. As a result, the frequency of landslide occurrence in the study region was demonstrated, and 
a sample calculation of the probability of landslide occurrence is presented. 


1 INTRODUCTION 


Disasters caused by natural hazards are especially 
dangerous because they affect larger areas, with 
greater intensity, than disasters caused by human 
activities. In addition, the frequency of natural dis- 
asters has increased in recent years due to climate 
change (IPCC, 2013). In order to cope with natu- 
ral disasters, it is necessary to establish a strategy 
for adaptation after analyzing the risk originating 
from natural hazard, to mitigate the risk below 
acceptable criteria, and to manage it constantly to 
prevent unwanted loss of life, property, or environ- 
ment. Landslides are a type of natural disaster that 
frequently have caused significant damage, espe- 
cially near mountainous regions (Guzzetti, 2000). 
Thus, risk analysis for landslides was conducted in 
this study, utilizing the concept of reliability. 

The first step in conducting risk analysis is to 
estimate the frequency of occurrence (Corominas 
et al., 2014), followed by calculating the probability 
of occurrence. However, estimating the frequency 
can be challenging because of the long interval 
between relevant events over time, and the lack of 
accumulated inventory information available in a 
study region (van Westen et al., 2006, Jaiswal et al., 
2010). In addition, frequency units considering 
spatial and temporal probability are often applied 
in misinformed ways in risk analysis. Regarding 
the probability of natural dis-aster occurrence, 


calculations are often based on a Poisson distribu- 
tion, sometimes without a clear explanation. 

The objective of this study was to estimate the 
frequency of landslide occurrence, using the cor- 
rect units, and to present how the probability of 
occurrence can be calculated utilizing the con- 
cept of reliability. For the analysis, a pixel unit of 
GIS-based spatial information was considered as 
a component to denote the unit for the frequency 
of landside occurrence, and the equation for the 
probability of occurrence was derived from the 
definition of reliability (Lee et al., 2017). 

To estimate the frequency of landslide occur- 
rence, the point locations where landslide events 
have occurred were considered as failed compo- 
nents, and the other areas, in which no landslide 
has occurred, were regarded as surviving compo- 
nents. To find the specified time between land- 
slide events, a rainfall threshold was established 
to estimate the frequency of landslide occur- 
rence due to the limitation of landslide inventory 
data. A number of methodologies for setting the 
rainfall threshold have been examined (Frat- 
tini et al., 2009, Polemio and Sdao, 1999, Caine, 
1980, Aleotti, 2004); however, the validity of 
those methods are pertinent only with the local 
geo-spatial properties (Martelloni et al., 2012, 
Jakob et al., 2006). Therefore, the rainfall thresh- 
old value for this study was deter-mined based on 
local research. 
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To express the equation of the probability of 
landslide occurrence, the reliability function, which 
was derived from the definition of reliability, was 
used. The equation derived from the concept of 
reliability has the same outcomes as those reported 
by Crovelli (2000) based on a Poisson distribution 
model, which is commonly employed to model the 
random disastrous events caused by natural haz- 
ards over time. 

As a result, the frequency of landslide occur- 
rence was determined, and a sample calculation for 
the probability of landslide occurrence is presented 
for the study region, Gangwon Province in South 
Korea, from which landslide inventory data and a 
landslide hazard map were available for the analy- 
sis. The resulting model can be used for risk man- 
agement of landslides, and also can be extended to 
assess various mitigation measures to handle the 
risk from natural hazards. 


2 METHOD 


2.1 Study site and inventory data 


Gangwon Province is well known as the region 
where Pyeongchang County, the host of the 2018 
Olympic Winter Games, is located. The province 
includes high mountainous regions in its bound- 
aries, as shown in Figure 4. The elevation of its 
northeast area is higher than other parts of the 
country, and the area has steeper terrain. Due 
to its topographical features, Gangwon Province 
is prone to landslides. The population of Gang- 
won Province is currently growing, reflecting the 
high demand for leisure activities and property 
investment. Consequently, housing is expanding, 
both for permanent residences and for tourism, 
into mountainous areas that are susceptible to 
landslides. 

The landslide inventory data were provided by 
the local government of Gangwon Province. The 
inventory was conducted when devastating dam- 
age was reported after Typhoon Ewiniar brought 
heavy rainfall to the region in July 2006. The 
damage and losses caused by the typhoon were 
recorded as a historic natural disaster, as it caused 
62 casualties in total, and the loss of property was 
estimated to be over 1.5 billion USD (N.E.M.A., 
2007). The landslide inventory data used for the 
study were part of the second and third rounds of 
data collection after the field survey that covered 
the overall region of Gangwon Province, and the 
inventory data contained a summary of observed 
landslide events, with information about the loca- 
tion and damaged area. The survey results were 
recorded in a polygon-type file in a GIS system 
and plotted on a 1:25000 scale, including the GPS 
location data and the areas affected by landslides. 


2.2 Landslide hazard map 


In order to differentiate the frequency of landslide 
occurrence, we adapted the landslide hazard map 
produced by the Korea Forest Service in 2012, 
which was made based on logistic regression analy- 
sis considering the following nine factors of moun- 
tain properties: slope inclination, slope orientation, 
slope length, slope curvature, topographic wetness 
index, the type of forest, the age of the forest, soil 
depth, and bedrock (Korea Forest Service, 2012). 
The map classified the analyzed area into five haz- 
ard classes, from grade 1 (the highest) to grade 5 
(the lowest), with pixel units sized 10 m x 10 m, and 
the map was projected on a 1:25000 scale. Grade 
5 was excluded from our analysis because those 
pixel data were marked with a null value, includ- 
ing areas of water and flat land with no hazard of 
landslides. 


2.3 Frequency of landslide occurrence 


Landslide risk can be briefly expressed by an equa- 
tion in which the probability of a hazardous event 
is multiplied by the probability of loss of life or 
property (A.G.S., 2000). For risk analysis, the fre- 
quency of landslide occurrence must be identified 
prior to calculating the probability of landslide 
occurrence. To estimate the frequency of land- 
slide occurrence, a unit pixel of GIS-based spatial 
information was considered as a component item 
to denote the unit for the frequency of landside 
occurrence. The frequency of landside occurrence 
can be interpreted as the instantaneous failure 
rate, presented in terms of the number of failures 
per unit time, and it is based on measurements of 
the quantity of components exposed to a stressful 
environment (Goble and Cheddie, 2005). 

Based on the assumption above, the frequency 
of landslide occurrence can be derived from the 
concept of the failure rate as below: 


2 =(N —Ns)/(Ns x At) (1) 


where A = the failure rate of the pixel components, 
which corresponds to the frequency of land- 
slide occurrence; N = the number of total items; 
Ns = the number of surviving items; and At = the 
time between landslide occurrences. 

Given that Ns is defined as the number of pixel 
components with no landslide initiation, subtract- 
ing Ns from N in Formula 1 can be considered 
as a measure of the number of landslide events 
because the point locations of landslide initiated 
were regarded as the failed components. Time to 
fail, which refers to the time to landslide reoccur- 
rence, is given by At in Formula 1, and it expresses 
the probabilistic period between landslides. It can 
be estimated by establishing the rainfall threshold. 
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More information is provided below regarding the 
rainfall threshold for landslide initiation. 

As per Formula 1, the units for the frequency 
of landslides can be expressed quantitatively as 
below, provided that the unit pixel (10 m x 10 m) 
is obtained from the data of a landslide hazard 
map: 


F, = landslide event x pixel! x year! (2) 


where F, is the frequency of landslide occurrence 
and the area of a unit pixel corresponds to 100 m°. 

Meanwhile, Corominas and Moya (2008) 
described two separate approaches to the spatial 
probability and temporal probability of landslide 
occurrence. The relative frequency is the ratio 
of the number of landslides recorded to the unit 
area, which allows multiple regional landslide 
events to be described. The units for the relative 
temporal frequency are the same as those given 
in Formula 2. The relative temporal frequency of 
landslides could be identified in this study because 
the landslide inventory data included multiple 
landslide events in the province region that were 
triggered by a heavy rain event. 

The overall procedure for analyzing the fre- 
quency of landslide occurrence based on the con- 
cept of the failure rate in the reliability study and 
the differentiated frequency according to landslide 
hazard grades is shown in Figure 1. 

Notably, when the number of landslide events is 
classified by the landslide hazard grade, the maxi- 
mum landslide hazard grade was assigned to each 
landslide event with an overlay of the inventory 
polygon data and landslide hazard map on the GIS 
platform. Moreover, in a conservative approach 
to risk analysis, the area where landslides did not 
occur was measured when determining the total 
area corresponding to each landslide hazard grade. 
It should also be noted that the estimated value of 
the frequency varies depending on the size of the 
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Figure 1. The frequency of landslide occurrence based 
on the concept of the failure rate. 


area. The frequency units for this study are only 
representative of Gangwon Province. 


2.4 Probability of landslide occurrence 


To estimate the probability of random disastrous 
events caused by natural hazards over time, a Pois- 
son distribution is often employed. Crovelli (2000) 
presented a Poisson distribution model to express 
the probability of landslide occurrence in continu- 
ous time in natural environments, as below: 


iy 


P{N(t)=n}=e* a 


(3) 


where n = 1, 2, 3...; A =the rate of occurrence of 
landslides; t = the specified time; and N(t) = the 
number of landslides that occurred during time t. 

The probability of one or more landslides occur- 
ring in time t, which is referred to as the exceedance 
probability, is expressed as below, when A is much 
less than one (A<<1): 


P{N(t)21}=1-e# (4) 


This model of the probability of landslide 
occurrence can also be presented using the defi- 
nition of reliability. The definition of reliability 
usually contains four basic elements: probability, 
adequate performance, time, and operating condi- 
tions (Billinton and Allan, 1992), and one of the 
definitions in general terms can be introduced as 
follows: the probability that an item will perform 
a required function without failure under stated 
conditions for a stated period of time (O’Connor 
and Kleyner, 2012). The reliability function R(t) in 
mathematical terms is expressed as follows (Kapur 
and Pecht, 2014): 


= Ns(t) 


R(t) N 


(5) 


where Ns = the number of surviving items; and 
N = the number of total items. 
Unreliability, F(t) is given as: 


F(t)=1-R(t)= w 6) 
and, 
To 


When the hazard rate h(t) is normalized with its 
surviving items Ns(t) instead of the total number 
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of items N from the unreliability rate f(t) equation, 
we obtain the hazard rate h(t), as shown below, 
with a more conservative meaning: 


n= (8) 


The integral of the hazard rate h(t) over the time 
from 0 to t is: 


J h(i) dz =-InR(t) (9) 


Then, R(t) is: 
R(t) =e" (10) 


where H(t) is the number of hazards in time t and 
can be expressed as H(t) = At. 
Finally, we obtain F(t) as: 


F(t) =1-R(t)=1-e" =1-e™ (11) 


By using Formula 11 above, the probability of 
landslide occurrence can be calculated using the 
frequency of landslide occurrence estimated from 
the concept of the failure rate, as expressed in 
Formula 2. 


2.5 Time period estimation by Rainfall threshold 


In order to estimate the time period between 
landslide occurrences, a rainfall threshold was 
established, since the mechanism of landslide 
occurrence is triggered by the increase in pore 
water pressure and rain water seepage forces 
(Cullen et al., 2016). Since Caine’s research (1980) 
examined the relationship between the minimum 
rainfall duration and intensity required to cause a 
landslide, a number of methodologies to identify 
the rainfall threshold have been examined, with the 
goal of finding the most suitable correlation with 
landslide initiation (Aleotti, 2004). 

However, the examined proposals are valid only 
with the local geo-spatial properties (Martelloni 
et al., 2012, Jakob et al., 2006). Thus, domestic 
research results were applied to reflect the features 
of local geology, vegetation, and topography. Kim 
and Chae (2009) reported that landslides tend to 
occur in South Korea when the consecutive rainfall 
is over 200 mm for 48 hours. Based on their results, 
cumulative precipitation of more than 200 mm for 
48 hours was adopted as a criterion for the rainfall 
threshold. 

To determine the daily rainfall intensity, which 
was another factor used to determine the thresh- 
old, daily precipitation records were reviewed from 


when Typhoon Ewiniar caused heavy rainfall in 
July 2006. This decision was based on the assump- 
tion that landslides are likely to occur in the future 
in similar environmental conditions. Despite the 
lack of continuous landslide inventory data, this 
method provides a basis for estimating the fre- 
quency of landslide occurrence. Rainfall data from 
the Automatic Weather Station (AWS) located in 
the center of Gangwon Province were adopted as 
a representative sample to estimate landslide fre- 
quency, considering the geographic location and 
the high severity of damage caused by the typhoon. 

The daily rainfall records in the region for the 
last 10 years are plotted in Figure 2. The graph 
shows that daily precipitation exceeded 150 mm on 
both July 15 and July 16, 2006, and it is possible 
to assume that landslides occurred after the rain- 
storms on those dates. 

Therefore, a daily rainfall of more than 150 mm 
was set as another factor contributing to the rain- 
fall threshold. As a result, the rainfall threshold 
was established as including both 48-hour cumula- 
tive precipitation over 200 mm and daily precipita- 
tion over 150 mm. 

By screening using the rainfall threshold, as 
shown in Figure 3, the average landslide occur- 
rence interval was estimated by counting the dates 
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Figure 2. Sampled daily rainfall in Gangwon Province 
for 10 years (2006-2015). 
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Figure 3. Scatter diagram of daily rainfall and 48-hour 
cumulative rainfall showing that 3 events exceeded the 
rainfall threshold in 10 years (2006-2015). 
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that satisfied these criteria, which resulted in 3 
events during the reviewed 10-year period. Thus, 
the probabilistic period of landslide occurrence 
was estimated as 3.3 years for the analysis of the 
probability of landslide occurrence in this study. 


3 RESULT 


3.1 Estimation of landslide frequency 


The locations in the inventory data where land- 
slides have occurred in Gangwon Province are pre- 
sented in Figure 4. 

An analysis of the landslide hazard map of 
Gangwon Province shows that grades 2 and 3 
predominated throughout the study region, while 
grade | areas were sparsely scattered near moun- 
tainous areas. The resulting estimation of the fre- 
quency of landslide occurrence is summarized in 
Table 1. The total areas of the each landslide haz- 
ard grade are shown, except for grade 5, which had 
null data, and it is shown that grade 3 occupied 
the largest area of 5815.2 km’, followed by grade 
2 (4611.6 km’), grade 4 (250.0 km’), and grade 1 
(186.5 km’). 

When the number of landslide events was 
counted and classified along with the landslide 
hazard grades, a total of 72 landslide events were 
found in grade 1 areas, followed by 700 landslides 
in grade 2 areas, 433 in grade 3 areas, and 8 land- 
slides in grade 4 areas. 
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Figure 4. Locations of landslide occurrence in the 
study area. 


Table 1. The analyzed frequency of landslide 
occurrence. 

Landslide 
Landslide Total occurrence 
hazard No. of area * frequency 
grade Landslides (km?) (A)** 
1 72 186.5 1.17E-05 
2 700 4611.6 4.60E-06 
3 433 5815.2 2.27E-06 
4 8 250.0 9.70E-07 


*The unit pixel (10 m x 10 m) is used for frequency 
estimation. 

**The unit of landslide occurrence rate (A) is landslide 
event x pixel —1 x year —1. 


The most landslides occurred in grade 2 areas 
because they accounted for the most area. How- 
ever, if we examine the occurrence ratio, which is 
defined as the number of landslide events divided 
by the total area, it can be seen that the highest 
number of landslide events per area occurred in 
grade | areas. 

Thus, our results indicate that areas with a land- 
slide hazard of grade 1 had the highest value of 
landslide occurrence frequency (1.17E-05 landslide 
events x pixel"! x year"'). The frequency decreased 
from grade 2 to grade 4, which had the lowest 
value of landslide occurrence frequency (9.70E-07 
landslide events x pixel"! x year~'). It should, how- 
ever, be kept in mind that the estimated landslide 
occurrence frequencies are only valid for Gangwon 
Province area. 


3.2 The probability of landslide occurrence 


Given the estimated frequency of landslide occur- 
rence along with the landslide hazard grades, the 
probability of landslide occurrence was calculated 
following Formula 11 and plotted with a loga- 
rithmic scale on the Y-axis. The graph in Figure 5 
shows an increase in the overall probability of 
landslide occurrence over time, as well as pre- 
senting discrete curves according to the landslide 
hazard grade. The resulting graph indicates that 
grade 1 areas had the highest value of probability 
of landslide occurrence, followed by grade 2 areas, 
grade 3 areas, and grade 4 areas, sequentially. 

Our results indicate that locations with different 
grades of landslide hazard are exposed to different 
risk levels, which can be analyzed by calculating 
the probability of landslide occurrence. The prob- 
ability of a landslide, which is a disaster caused by 
a natural hazard, can be estimated based on the 
concept of reliability. The calculated probabil- 
ity value can be used as a basis for landslide risk 
management. 
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Figure 5. Increases in the probability of landslide 


occurrence depending on the landslide hazard grade. 


4 DISCUSSION 


In this study, it was shown the frequency of land- 
slide occurrence can be estimated and the prob- 
ability of landslide occurrence can be calculated 
using the concept of reliability analysis, which is 
commonly used in mechanical studies and other 
engineering fields. Since the occurrence of disas- 
ters due to natural hazards is affected by numerous 
environmental variables, estimating their frequency 
and calculating the probability of their occurrence 
are usually difficult tasks. However, when we con- 
sider a pixel unit of spatial information in a GIS 
platform as the component for reliability analysis, 
it becomes possible to calculate the frequency of 
occurrence with correct units and to calculate the 
probability of occurrence. 

A set of cross-sectional inventory data obtained 
after the destructive damage caused by a typhoon 
was used due to the lack of landslide inventory 
data in this study. Therefore, continuous data col- 
lection in accordance with appropriate landslide 
inventory frameworks should be implemented, 
and updating landslide event observation data is 
important to maintain valid data regarding the fre- 
quency of occurrence. 

To cope with natural disasters, it is necessary to 
establish a strategy for adaptation after an analy- 
sis of the risk originating from natural hazards. 
The method presented in this paper for estimating 
the frequency and the resulting probability value 
can be used as the basis for landslide risk analysis 
and management. After baseline risk analysis is 
conducted, various safety barriers for risk reduc- 
tion can be applied to manage the risk quantita- 
tively within acceptable criteria. Monitoring of 
potentially unstable slopes is a potential safety 
barrier serving as a preventive measure to reduce 
the likelihood of landslide occurrence (Uhlemann 
et al., 2016). Adopting an early warning system is 
a safety barrier to prevent unwanted loss by timely 
detection (Sattele et al., 2015). Major engineering 


work to reinforce an unstable slope with a retaining 
wall or to install a fence or net can be a mitigation 
measure to reduce the severity of the consequences 
of a landslide (Dai et al., 2002). Safety barriers are 
associated with specific magnitudes of risk reduc- 
tion, and the appropriate level of risk reduction 
through the use of safety barriers can be decided 
by quantitatively considering the exposed risk, 
which is estimated from the probability of land- 
slide occurrence. 

Analysis of the risk originating from natural 
hazards through the concept of reliability can be 
considered a convincing approach to manage risk. 
Landslide risk management based on reliability 
analysis can be also actively applied with local fre- 
quency data that are suitable to other regions in 
order to prevent unwanted loss of life, property, or 
environment. 


5 CONCLUSION 


In this study, the frequency of landslide occurrence 
was estimated by reliability analysis and a sample 
calculation of the probability of landslide occur- 
rence was presented for Gangwon Province in 
South Korea, from which landslide inventory data 
and a landslide hazard map were available. For the 
analysis, a pixel unit of GIS-based spatial informa- 
tion was considered as a component to denote the 
unit for the frequency of landside occurrence. 

It was found that more landslide events were 
initiated in areas with a hazard grade of 2 or 3 
than in grade | areas because of their greater total 
areas; however, the most landslide events per area 
occurred in grade 1 areas. It was shown that areas 
with a greater landslide hazard had higher values 
of the landslide occurrence frequency in the study 
region, and the estimated frequency data were used 
to calculate the probability of landslide occurrence 
in an analysis based on the concept of reliability. 

By examining natural hazards through reliabil- 
ity analysis, we have ascertained that the estimated 
frequency and the resulting probability value can 
be used as the basis of landslide risk analysis and 
management. This technique can also be extended 
to assess various mitigation measures to handle the 
risks that stem from natural hazards. 
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Probabilistic seismic hazard assessment for offshore structures 
in Andaman Sea 
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ABSTRACT: A set of probabilistic seismic hazard maps for Offshore structures in Andaman sea has 
been derived using procedures developed for the latest US National Seismic Hazard Maps. In contrast 
to earlier hazard maps for this region, which are mostly computed using delineated seismic source zone, 
the presented maps are based on the combination of smoothed gridded seismicity, crustal-fault, and sub- 
duction source models. The ground motion hazard map is presented over a 10 km grid in terms of peak 
ground acceleration and spectral acceleration at 1.0 undamped natural periods and a 5% critical damping 
ratio for 10 and 2% probabilities of exceedance in 50 years, which have generally been used for Seismic 


Analysis and Design of Offshore Structures. 


1 INTRODUCTION 


The Andaman Sea is situated in an active back- 
arc basin lying above and behind the Sunda sub- 
duction zone, which is the Indo-Australian and 
Eurasian boundary zone comprise the convergent 
margins, including the Burma oblique subduc- 
tion zone, Andaman thrust and Sunda arc, to the 
North West, west and south, respectively. Within 
Andaman Sea, several earthquakes with magni- 
tude greater than 4 have been observed during the 
years 1964 to 2017. However, the earthquake activ- 
ity rate is much lower than those occurred near 
plate boundary. For this study, the southern part 
of Andaman Sea (Blue dash line and the study area 
is bounded latitude 5° to 10° and longitude 94° to 
98°, Figure 1) is of special interested due to ongo- 
ing human activities for gas pipeline and offshore 
facilities, several platforms and subsea gas pipe- 
lines have been developed offshore. In addition, 
onshore supports such as control and maintenance 
centre along with gas metering stations have been 
developed in this region. The largest instrumental 
earthquake within this zone is Mw 6.3 on 16 May 
1933 at 10 km depth. In addition, this region is 
situated about 200 and 300 km from the Sumatran 
faults and Sumatra subduction zone, respectively. 
And these seismic active structures have histori- 
cally produced moderate to large earthquakes with 
long-period ground motions that were felt in high- 
rise buildings in Singapore and Malaysia (Pan and 
Sun, 1996; Pan, 1997; Pan et al., 2001). A proba- 
bilistic seismic hazard assessment for offshore 
structures located in this area has been carried out. 
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Figure 1. (a) Map showing possible relative motions 
of the Sumatran and Burma forearc plates relative to 
adjacent Indian-Australian and Eurasian (Sunda) plates. 
Adapted with permission from Curray (2005). The vector 
diagrams show the partitioning of the total convergence 
of Australia relative to Eurasia (vector AE) into compo- 
nents of subduction (vectors AS and AB) and strike-slip 
(vectors SE and BE). (b) Plot of earthquake slip vector 
azimuths (red dots). Blue dash line showing the area of 
current study. 


The main objective of this work is to determine 
appropriate earthquake ground motion param- 
eters for the seismic design of offshore structures 
based on available scientific information. These 
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ground motion parameters include: horizontal 
Peak Ground Acceleration (PGA) and Spectral 
Acceleration (SA) values at the project site with the 
return periods of 475 and 2475 years. 


2 SEISMOTECTONIC SETTINGS 


The Indo-Australian and Eurasian boundary 
zone comprise the convergent margins, includ- 
ing the Burma oblique subduction zone, Anda- 
man thrust and Sunda arc, to the North West, 
west and south, respectively. The plate kinematics 
of the Sumatran region is, in a broad sense, the 
simple interaction of the Indian-Australian and 
Eurasian plates (Figure 1). However, in detail it 
is much more complex than that. Deformation 
rates across these plate boundaries are variable. 
The observed seismicity and seismotectonic set- 
tings of these plate boundaries clearly indicate the 
capability of producing large events, where the 26 
December 2004 earthquake occurred. A conver- 
gence rate of 65-70 mm/year as a result of Aus- 
tralia moving toward South East Asia is reported 
by McCaffrey (1996). 

Deformation of the overriding plate leads to 
larger complexities in plate motions. Sumatra 
sits at the southwestern edge of the Sunda plate 
(Bird 2003), which moves at a few millimeters per 
year to a centimeter per year eastward relative to 
Eurasia (Chamot-Rooke & Le Pichon 1999, Bock 
et al. 2003) (Figure 1). The resulting convergence 
between the Sunda plate and the oceanic plates to 
the southwest is somewhat slower than it would be 
relative to Eurasia. The rate and direction of sub- 
duction of the lithosphere under the Sunda forearc, 
however, are further modified by the independent 
motion of the forearc. Fitch (1972) explained the 
presence of the Sumatran fault and other similar 
faults inboard subduction zones by the process 
now known as slip partitioning. That is, in some 
cases of oblique subduction where the two plates 
do not converge at a right angle to the strike of 
the trench, it requires smaller overall shear force 
to share the shearing (trench-parallel) component 
of the relative motion between two separate faults 
instead of on one fault. In the case of partitioning, 
one fault is the subduction thrust, which takes up 
all of the trench-normal slip (the dip-slip compo- 
nent) and some fraction of the trench-parallel slip 
(the strike-slip component). A second fault, within 
the overriding plate and commonly strike-slip in 
nature, takes up a portion of the trench-parallel 
motion. The subduction thrust and strike-slip fault 
isolate a wedge of forearc called the sliver plate. 
The slip rates on the separate faults can be inferred 
from their geometries and knowledge of the overall 
convergence. 


The motion of the Sunda forearc (sliver plate) 
is not known well, particularly in the Anda- 
man section, and hence the subduction vector is 
highly uncertain. Earlier estimates of the relative 
motions assuming a rigid forearc sliver plate failed 
to predict convergence in the Andaman section 
of the trench, which probably indicates, as is now 
accepted, that there is extensive internal deforma- 
tion within the forearc. One possibility, evident in 
the change in earthquake slip directions along the 
margin, is that the Andaman section (called the 
Burma plate) and Sumatran sections of the forearc 
move independently with a break near 6° to 7°N 
(Subarya et al. 2006). 


3 EARTHQUAKE CATALOGUE 


The earthquake catalogue for current study is com- 
posed of instrumental earthquake records from 
different international earthquake observatories 
including: 


1. International Seismological Centre-Global 
Earthquake Model (ISC-GEM) (1907-2009), 

2. Engdahl (EHB)’s earthquake catalog (1960-2007), 

3. USGS/NEIC preliminary earthquake database 
(1900-2013), and 

4. Global centroid moment tensor 
(1976 — December 2010), 


The compiled instrumental earthquake cata- 
log is covered from 1900 to 2013 (Figure 1). All 
reported event magnitudes have been converted 
to moment magnitude by using Scordilis (2006) 
relationship for mb-Mw and Ms-Mw. Subse- 
quently, all duplicated events have been removed, 
and earthquake magnitude and location correction 
have been performed manually to remove any obvi- 
ous errors. The largest earthquake magnitude 9.2, 
Great Sumatra-Andaman earthquake in northern 
Sumatra on 26 December 2004 at 0 7.58 local time. 


(GCMT) 


3.1 Magnitude conversion 


In the final updated earthquake catalogue, sev- 
eral different magnitude scales are used to define 
the earthquake magnitude. For example, the 20-s 
surface-wave magnitude (M,) and the short-period 
P-wave magnitude (mb) are commonly used in the 
data from USGS, ISC, and other international 
database sources, and the moment magnitude 
(Mw) is reported in the Global Centroid Moment 
Tensor catalogue. It is necessary to convert all these 
different magnitude scales into a single magnitude 
scale. In this study, the moment magnitude scale is 
chosen as the single representative scale. Since the 
accuracy of reported magnitudes is dependent on 
magnitude definitions, the more reliable magnitude 
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is then preferred for using in magnitude conver- 
sion as follows: M,, M,.mb, and M, Conversions 
between magnitude scales are made using the equa- 
tion provided in Table 2 in Ornthammarath et al. 
(2011). After the magnitude conversion, we merged 
duplicate entries (from different data sources) into 
a single entry for each earthquake event. 


3.2 Declustering 


One basic assumption of the adopted seismic haz- 
ard assessment methodology is that earthquake 
occurrences are statistically independent (the 
Poisson assumption). Therefore, the earthquake 
catalogue to be used for seismic hazard assessment 
must be free of dependent events, such as fore- 
shocks and aftershocks. The process to eliminate 
dependent events from earthquake catalogues is 
called “declustering”. Gardner and Knopoff(1974) 
declustering algorithm, is chosen for the present 
study. This approach states that foreshocks and 
aftershocks are dependent (a non-Poissonian proc- 
ess) on the size of the main event, and these earth- 
quake events need to be removed in accordance 
with space- and time- windows. Normally, a large 
main earthquake event leads to larger aftershocks 
over a larger area and for a longer time. There- 
fore the time- and distance-window parameters 
for larger main events are greater than those for 
smaller events. Declustering eliminates about 60% 
of the 64,866 events in the catalogue. The final 
declustered catalogue includes 25,654 earthquake 
events (4218, 7585, and 13851 events for shallow, 
intermediate, and deep earthquake, respectively) 
with M,, greater than or equal to 3.0 in the study 
region from 1900 to 2013. 


3.3. Catalogue completeness 


It is recognized that earthquake data in the cata- 
logue are not complete, and that failure to correct 
for the data incompleteness may lead to underes- 
timation of the mean rates of earthquake occur- 
rence. The correction can be made by identifying 
the time period of complete data for prescribed 
earthquake magnitude ranges. Reliable mean rates 
of earthquake occurrence for the given magnitude 
ranges can then be computed from the complete 
data. 

Two methods were employed for completeness 
analysis of the catalogue: (a) the Visual Cumu- 
lative method (CUVI) and (b) Stepp’s method. 
Both algorithms provided a similar result; hence, 
the former technique was adopted. We divide the 
study region into three zones, i.e., shallow, inter- 
mediate, and deep earthquakes (BG-I, BG-II, BG- 
II, respectively). The data completeness analysis is 
carried out separately for each of these zones, and 


the results are presented in Table 2 in Orntham- 
marath et al. (2011). 


4 MODELING OF EARTHQUAKE 
SOURCES 


To properly describe the complex earthquake 
environments in the region, they are modeled as a 
mixture of background seismicity, subduction area 
sources, and crustal faults. These are described in 
more detail below. 


4.1 Background seismicity model 


The background seismicity model represents ran- 
dom earthquakes in the whole study region except 
the subduction zones. The model accounts for all 
earthquakes in areas with no mapped seismic faults 
and for smaller earthquakes in areas with mapped 
faults. In this approach, it is not necessary to divide 
the region into many small areas. One large area 
may be used, but the rate of seismicity is assumed 
(or allowed) to vary from place-to-place within 
the area. The rate of seismicity is determined by 
first overlaying a grid with a given spacing, in the 
current case 0.10° in latitude and 0.10° in longi- 
tude, approximately 10 by 10 km, onto the study 
region, and counting the number of earthquakes 
with magnitude greater than a reference value 
(Mp in each grid cell. The rate of seismicity is 
computed by dividing the number of earthquakes 
by the time period of earthquake data. The rate 
is then smoothed spatially by a Gaussian-function 
moving average and comparing with the observed 
seismicity. By this approach, the spatially-varied 
seismicity can be modeled with confidence relating 
to source uncertainty. 

In hazard calculations, earthquakes smaller than 
magnitude 6.0 are characterized as point sources 
at the centre of each grid cell, whereas earthquakes 
larger than magnitude 6.0 are assumed to be hypo- 
thetical finite vertical or dipping faults centered 
on the source grid cell. Lengths of finite faults are 
determined using the Well and Coppersmith (1994) 
relations. Consecutively, the pre-calculated aver- 
age source-to-site distance from virtual faults with 
strike directions uniformly distributed is employed 
(Petersen et al. 2008). 

The whole study region is divided into 6 source 
zones: BG-I, BG-II, BG-III, SD-A, SD-B, SD-C, 
(see Fig. 3). The zones SD-A, SD-B, and SD-C 
are subduction zones, which will be described in 
detail below. The zone BG-I is a background seis- 
micity zone covering the whole study area except- 
ing subduction zone, and the zone BG-II and 
BG-III are background seismicity zone for inter- 
mediate and deep earthquake except subduction 
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Figure 2. Observed shallow seismicity Black circle rep- 
resent magnitude greater than 7.0, Red circle represent 
magnitude between 6.0-7.0; Yellow circle represent mag- 
nitude between 5.0-6.0; Green circle represent magnitude 
between 4.0-5.0 (Upper) and the smoothed activity rate 
10° value of BG-I for shallow earthquake (Lower). 


105 


Figure 3. Subduction zones SD-A, SD-B, and SD-C 
considered in this study. 


zones. Earthquake data, particularly greater than 
magnitude 3.0 earthquakes, in BG-I, BG-II, and 
BG-III could be reliably recorded since 2004 due 
to the high earthquake detection capability of a 
fairly dense seismograph network in Malaysia and 
surrounding region. Hence, the accuracy of the 
estimated seismicity rate in BG-I can be signifi- 
cantly improved by including small earthquakes 
(3.0<M,,<5.0) in the seismicity rate calculation 
Furthermore, this activity rate computation is also 
based on the observation that moderate earthquakes 
generally occur in areas where there have been a 
significant number of magnitudes 3 events. On the 
other hand, in BG-II and BG-III earthquake data 
with Mẹ > 5.0 can be used for computing the seis- 
micity rate due to the incompleteness of small earth- 
quake data. Nevertheless, a lack of small earthquake 
data is not a major problem because the seismicity 
rate in these zones are relatively high; thus, the rate 
can be reliably estimated from moderate-sized earth- 
quakes. In addition, the overall influence of BG-II 
and BG-III on the seismic hazard in Andaman sea 
is lower than that of BG-I. We model the magni- 
tude-dependent characteristic of the seismicity rate 
in each background seismicity zone by a truncated 
exponential model (Gutenberg-Richter model): 


Log10 (N(M,,)) =a — bMy, (1) 


where N(M,,) is the annual occurrence rate of 
earthquakes with magnitude greater than or equal 
to M,, and a and b are the Gutenberg-Richter 
model parameters. The b-value is assumed to be 
uniform throughout the whole background region. 
Hence, we used complete earthquake data with 
magnitude greater than 4.0 in current study area 
to compute a single regional b-value. The obtained 
regional b-value is 0.90, and this value is used for 
BG-I BG-II and BG-III. The a-value varies from 
place to place within each zone. It is computed by 
using a grid with spacing of 0.10° in latitude and 
longitude and is spatially smoothed using a two- 
dimensional Gaussian moving average operator 
with a correlation distance parameter C (Frankel 
1995). Earthquake data with Mẹ > 3.0 and C = 50 
km are used for BG-I, while earthquake data with 
My > 5.0 and C = 50 km are used for BG-IT and 
BG-III. The correlation distance is chosen based 
on Frankel (1995) and it is comparable to earth- 
quake location error. Note that at present there are 
no fixed rules or guidelines to determine an appro- 
priate C value. If the value of C is too small, the 
resulting smoothed seismicity will be concentrated 
around the epicenters of past recorded earthquakes. 
On the other hand, if the value of C is too large, 
the resulting smooth seismicity will be blurred and 
will not reflect the true spatial variation pattern 
of seismicity. The chosen C values are believed to 
suitable as the computed smoothed rate 10a values 
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(presented in Fig. 2) are in agreement to observed 
spatial pattern of seismicity. 

In the truncated Gutenberg-Richter models of 
BG-I, BG-II, and BG-II, the minimum earthquake 
magnitude is set equal to 4.5 because earthquakes 
with smaller magnitude than this are judged not 
to cause damage to buildings and structures. The 
maximum (upper bound) magnitude is set to 7.0 
for BG-I and 8.0 for BG-II and BG-III to account 
for the largest earthquake magnitude that have been 
observed in these zones, as shown in Fig. 2. The 
average depth used in the model for BG-I, BG-II, 
and BG-III are 20, 75, and 120 km, respectively. 


4.2 Subduction zone model 


As explained earlier, the megathrust Sunda sub- 
duction zone is divided into seven sub-zones based 
on seismicity characteristics: the Burma zone 
(SD-A), the Northern Sumatra-Andaman zone 
(SD-B), and the Southern Sumatra zone (SD-C). 
Each sub-zone is modelled as a seismic area source 
with a uniform rate of seismicity (the traditional 
area source model), and the magnitude-dependent 
seismicity rate is modeled by a truncated Guten- 
berg-Richter relation. The geometry and recur- 
rence times of large earthquakes associated with 
these active tectonic structures are largely based on 
available paleotsunami and seismic history studies 
(Jankaew et al., 2011), summarized geodetic data 
reported in, Berryman et al., (2013). 

The calculated Gutenberg-Richter model param- 
eters (a and b values) are shown in Table 3 in Ornt- 
hammarath et al. (2011). The minimum earthquake 
magnitude in the Gutenberg-Richter model is set to 
6.5 as the subduction zones are very near to studied 
area and hence large earthquakes are also impor- 
tant for long period structures in southern Anda- 
man Sea. The maximum magnitude for each zone 
is set to the maximum observed magnitude plus 0.5 
magnitude units. The maximum magnitude for zone 
SD-B and SD-C is set to 9.2 as the 2004 Sumatra 
earthquake and the 2005 Nias earthquake 


4.3 Crustal fault source model 


Two different approaches are employed to model 
the magnitude-dependent characteristic of the seis- 
micity rate of these crustal faults: the Gutenberg- 
Richter model and Characteristic Earthquake (CE) 
model. In the Gutenberg-Richter model, a magni- 
tude-frequency distribution for crustal fault model 
is assumed from the minimum magnitude of 6.5 to 
the upper-bound magnitude (Mmax). To account 
for the uncertainty in estimating Mmax, we con- 
sider three different cases with Mmax set equal 
to MC — 0.2, MC, and MC + 0.2. The probabi- 
listic weights of 0.2, 0.6, and 0.2 are assigned to 
these cases, respectively. In each case, theb-value is 


set equal to the regional b-value of 0.90, and the 
a-value is determined from the seismic moment 
rate, which is computed from the fault slip rate. 

In the characteristic earthquake model, three 
characteristic earthquake magnitudes (Mo) are 
also considered: Mo — 0.2, Mo, and Me + 0.2. 
The probabilistic weights of 0.2, 0.6, and 0.2 are 
assigned to these cases, respectively. In each case, 
the earthquake occurrence rate is computed from 
the characteristic magnitude and the fault slip rate 
(to match with the seismic moment rate of the 
fault). The recurrence interval for the characteris- 
tic model is determined from: 


Recurrence Interval = wuLW/M,. (2) 


where is u shear modulus, 3.0 x 10!! dyne/cm?, L 
is rupture length, and W is rupture width, u is the 
fault slip rate, M,C is the characteristic earthquake 
moment, which is calculated from: 


log(M,.) = 1.5M, + 16.05 (3) 


and the magnitude is assumed to be normally 
distributed around the characteristic value with 
a standard deviation of 0.12. The properties and 
parameters of Sumatra faults considered in this 
study are shown in Table 2 in Petersen et al. (2007). 


5 GROUND MOTION PREDICTION 
EQUATIONS (GMPES) 


In this study, three Next Generation Attenuation 
(NGA) models were developed for shallow crustal 
earthquakes in the Western United States and simi- 
lar active tectonic regions were applied for back- 
ground earthquakes in BG-I and for earthquakes 
from crustal faults in the study region. These NGA 
models were developed by Boore and Atkinson 
(2008), Campbell and Bozorgnia (2008), and Chiou 
and Youngs (2008) during the NGA project. 

In addition, based on comparison of several 
different Ground Motion Prediction Equations 
(GMPEs) with recorded ground motion from 
interface subduction earthquakes by Orntham- 
marath et al. (2014), it has been decided that three 
subduction GMPEs could be used for probabi- 
listic seismic hazard analysis in this region, and 
probabilistic weights assigned to these models are 
0.25, 0.25, and 0.50 for Atkinson and Boore (2003; 
2008), Youngs et al. (1997), and Zhao et al. (2006), 
respectively. These weights are relatively consist- 
ent with residual of observed recorded data and 
estimated values of three selected GMPEs. To cal- 
culate ground motion for intermediate-and deep- 
depth earthquakes, the equations of Young et al. 
(1997) and Atkinson and Boore (2003) are consid- 
ered with equal weights. 
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6 PROBABILISTIC SEISMIC 
HAZARD ANALYSIS 


The PSHA results for southern Andaman Sea, 
Figure 4 and 5, are presented in terms of seismic 
hazard maps at 475- and 2475-year return periods 
at bedrock condition. For southern Andaman Sea, 
the observed seismicity in and around Sumatra 
subduction zone and Sumatra faults control the 
hazard for most considered structural periods. 
Estimated bedrock PGA near subduction zone 
at 2475-year return period range between 0.6g to 
1.0g; however, for area near coast of Thailand and 
Myanmar, estimated bedrock PGA at 475-year 
return period are relatively less intense varied from 
0.05g to 0.15g. This is primarily due to its location 
which is far removed from any major active struc- 
ture in augmented with low observed seismicity 
rate of background seismicity. 

For long structural period (Figure 5), large part 
of southern Andaman Sea is subjected by mod- 
erated long period ground motion due to lower 
attenuation rate of long periods. Subduction zone 
earthquakes contribute high seismic hazard to long 
period offshore structure in southern Andaman 
Sea. Estimated bedrock SA (T = 1s) near subduc- 
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Figure 4. PGA (g) map for southern Andaman Sea at 
475 - (Left) and 2475 - (Right) year return period. 
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Figure 5. SA (T = 1s) (g) map for southern Andaman 
Sea at 475 - (Left) and 2475 - (Right) year return period. 
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tion zone at 2475-year return period range between 
0.4g to 1.0g; however, for area near coast of Thai- 
land and Myanmar, estimated bedrock PGA at 
2475-year return period are relatively less intense 
varied from 0.10g to 0.20g. The suitable ground 
motion records for performing time history anal- 
ysis of long period structures in southern Anda- 
man Sea should be selected for large earthquake at 
long distance. In addition, the computed ground 
motions are comparable to those in Petersen et al. 
(2007) with minor different in short period ground 
motion near subduction zones. 
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Power outage forecasting: Methods, results, and uncertainty 


S.D. Guikema 
University of Michigan, Ann Arbor, Michigan, USA 


ABSTRACT: Power outage forecasting for severe events such as hurricanes provides valuable informa- 
tion to those managing power systems and those dependent on electric power such as other utilities and 
individuals in society. A number of different approaches to power outage forecasting have been developed. 
This paper provides an overview of different approaches with a focus on how these approaches work in 
practice. It then focuses on the role and representation of uncertainty in power outage forecasts. What 
are the sources of uncertainty? How do different outage forecasting approaches handle these sources of 
uncertainty? How is uncertainty represented in the resulting forecasts? 


1 INTRODUCTION 


Sever weather is a major cause of power outages 
and expensive restorations associated with these 
outages. A critical aspect of managing weather-in- 
duced power outages for a utility is being prepared 
to restore power quickly and cost-effectively. Power 
outage forecasting models are an important part of 
this process, providing estimates of both total out- 
ages and the spatial distribution of these outages in 
advance of a storm. An increasing number of utili- 
ties are reportedly using power outage forecasting 
models. For example, approximately 85% of utili- 
ties responding to a recent benchmarking survey 
reported that they have some sort of storm power 
outage prediction model (Guikema et al. 2017). 

A number of different approaches for predict- 
ing weather-induced power outages have been 
developed. The two primary approaches are fra- 
gility-based models and statistical models. Fragil- 
ity-based models (e.g., Winkler et al. 2010) take 
as input estimates of key aspects of the weather 
hazard, such as gust wind speeds, and they then 
estimate the probability of damage at the level of 
individual assets via a fragility function, a math- 
ematical function that translates the value of a haz- 
ard parameter (e.g., gust wind speed) into a damage 
probability. These asset level damage probabilities 
are then used to simulate a number of replications 
of sets of damaged assets. For each simulated set 
of damaged assets, a power flow or connectivity 
model is used to estimate which customers have 
power and which do not. This provides the overall 
estimate of power outages together with its spatial 
distribution. 

A statistical power outage forecasting model 
(e.g., Han et al. 2009a, 2009b, Guikema et al. 2010, 
Nateghi et al. 2011, Guikema and Quiring 2012, 


Guikema et al. 2014, Nateghi et al. 2014a, 2014b, 
Quiring et al. 2014, McRoberts et al. 2017, He et al. 
2017) uses data about the performance of power 
systems during past storms such as the number of 
meter outages in defined geographic areas together 
with data about the utility system, environmental 
conditions, and the weather conditions. These data 
are used to train and validate a model that can pre- 
dict the impact of future weather events. The types 
of statistical models used vary widely, from simple 
linear regression models to more advanced ensem- 
ble data mining methods. 

Statistical power outage forecasting models are 
in much wider use within electric power utilities. 
Based on conversations with utility personnel, 
many U.S. utilities have at least an Excel-based 
linear regression model that they use in house. It 
should also be noted that to date, statistical outage 
forecasting models have shown substantially bet- 
ter predictive accuracy that fragility-based models. 
Because statistical outage forecasting models are 
both more widespread and, to date, more accurate, 
they will be the focus of this paper. 

In this paper I focus not on the technical details 
of statistical outage forecasting models. Rather, I 
take a perspective grounded in Bayesian probabil- 
ity and risk analysis and provide a critical assess- 
ment of the foundations of these models. To the 
degree possible, I use models that I have developed 
as the basis for the critiques. The goal here is to 
suggests ways in which the approaches underlying 
outage forecasting can advance. It is important to 
note that the focus here is not the meteorological 
side of the problem. That is, I focus on the mod- 
eling approach, not on which additional weather 
parameters might be helpful to include, though 
there is potentially substantial merit to exploring a 
wider array of weather variables. 
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This paper is structured as followed. The second 
section summarizes some of the practical limita- 
tions of the current approaches, drawing on the 
discussion in Guikema et al. (2017), laying the 
foundation for the following sections. The third 
section then focuses on discussing how these mod- 
els conceptualize risk and how this is reflected in 
the estimates. The fourth section focuses in par- 
ticularly on how uncertainty is treated in outage 
forecasting models. The fifth concluding section 
presents some suggestions for paths forward. 


2 PRACTICAL LIMITATIONS 


Note, this section is a summary of the discussion 
in Guikema et al. (2017). 

The first limitation is that nearly all of the exist- 
ing models in the academic literature are for a single 
hazard. That is, they were developed and validated 
for a single type of weather event, such as a hur- 
ricane (e.g., Han et al. 2009a, 2009b, Guikema et 
al. 2014, Nateghi et al. 2014a). The two exceptions 
to this that I am aware of is the model presented 
in Guikema et al. (2017), where an “all weather” 
model (wind storms, rain storms, lightening events, 
heat events, and to a limited extent mountain snow 
events) is presented and the model presented in He 
et al. (2017) where a model for hurricanes, thun- 
derstorms, and blizzards is presented. 

A second limitation is that the data used as 
input to the models can vary significantly, with 
some models leaving out potentially important 
explanatory factors. For example, our previous 
work suggests that variables such as the duration 
of strong wind, soil moisture levels, and soil type 
play an important role in predicting the impact of 
high wind events (e.g., Quiring et al. 2011, Nateghi 
et al. 2014a, 2014b,). For high-temperature events, 
the variable set becomes challenging in other ways 
because the duration of high temperatures and 
nighttime lows may be particularly important. Pre- 
dictive accuracy is limited if an incomplete set of 
explanatory variables are used. 

A third limitation is that many of the models we 
have seen in use within utilities vary greatly in the 
rigor with which they were developed and validated. 
It is a possible to train a statistical model that fits 
an outage data set extremely well, but offers poor 
predictions for new events. Careful validation test- 
ing, using out-of-sample holdout testing is critical 
for properly balancing the trade-off between bias 
and variance, maximizing prediction accuracy for 
future storms. 

A fourth limitation is that most of the available 
models, with a few notable exceptions, provide 
point estimates of impacts and do not represent 
the uncertainty inherent in any prediction of the 


impacts of hazard weather events on power sys- 
tems. This point will be discussed further below. 


3 CONCEPTUALIZATION OF RISK 
AND TREATMENT OF UNCERTAINTY 
IN POWER OUTAGE FORECASTING 
MODELS 


3.1 Background 


Risk has been conceptualized and measured in 
many different ways. Here I draw on Aven (2012) 
and Guikema and Aven (2015) and focus on three 
conceptualizations of risk: (1) risk is an expected 
value, (2) risk is a function of probabilities, con- 
sequences, and outcome scenarios, and (3) risk 
is a function of uncertainties, consequences, and 
outcome scenarios. These conceptualizations can 
be summarizes as: (1) R = E, (2) R = f(S,P,C), and 
R=f(S,U,C). 

The first approach to risk assumes that risk 
can be summarized by an expectation. This could 
be expected outcome, with outcome measured in 
financial terms or some other measure appropri- 
ate for the situation, or it could be expected utility, 
converting the outcomes of the different scenarios 
into utility. The different possible outcome sce- 
narios are accounted for, but the risk measure is 
condensed down to a single measure by summing 
probability multiplied by the consequence meas- 
ure over all possible outcome scenarios. In this 
sense uncertainty is included in the measure, but 
only to the degree the probabilities over the sce- 
narios are included in calculating the expectation. 
This approach implicitly assumes a fixed value 
function, either through assuming risk neutrality 
(for an expected outcome) or a fixed risk attitude 
and utility functional form. This approach also 
assumes that uncertainty can be fully represented 
by probability, with a Bayesian perspective typi- 
cally adopted. 

The second approach is different in that rather 
than summarizing the probabilities and conse- 
quences by a single expected outcome measure, 
the full probability distribution is presented. Kap- 
lan and Garrick (1981) is arguably the paper that 
originated this approach, though it was used in 
the earlier WASH 1400 nuclear reactor study as 
well. It is widely used within many areas of risk 
analysis. The results are often given in the form 
of a F-N (Frequency-Number) curve, a special 
case of an inverse cumulative distribution. This 
approach allows decision-makers to consider more 
than just a point estimate given that the full prob- 
ability distribution is available to them. It also does 
not assume a fixed value function. This means 
that the decision maker(s) are able to obtain the 
probability-consequence pairs for all scenarios and 
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using them in their own utility function should 
they wish to. This approach does still assume 
that uncertainty can be adequately represented by 
probabilities. Typically, a Bayesian perspective is 
taken with regard to the probabilities. 

The third approach is similar to the second, 
except in how uncertainty is treated. Probabilities 
are often still provided, again sometimes in the form 
of an F-N curve. However, additional qualitative 
descriptions of the information underlying those 
probabilities is provided. Flage and Aven (2009) 
provide approaches for doing this in an organized, 
grounded manner. This approach is fully Bayesian 
and focuses on describing the background state of 
knowledge underlying any probability assignment 
made within a Bayesian framework. As such, this 
approach provides substantially more information 
to decision-makers, particularly information that 
would allow them to judge how much confidence 
to place in the assessed probabilities. 


3.2 How do outage models conceptualize risk 
and treat uncertainty? 


Power outage forecasting models deal with an 
inherently spatial process. That is, they provide 
spatial estimates of power outages. For example, 
Figure 1 below shows examples of predicted and 
actual outages for the hurricane power outage fore- 
casting model of Nateghi et al. [2014a] for Hur- 
ricane Ivan. 

In these types of forecasts, predictions are given 
as spatial point estimates. For example, in the 
model output shown in Figure 1, a prediction con- 
sists of an estimate of the number of power out- 
ages in each 12,000 ft. by 8,000 ft. (3.66km by 2.43 
km) grid cell. 

How does such a model conceptualize risk? The 
predictions are not probabilistic in any sense. One 
could argue here that the predictions are intended 
to represent a “best estimate” with one interpreta- 


Ivan - Actual Ivan - Predicted 


Figure 1. Example power outage forecasts from Nateghi 
et al. (201 4a). 


tion of this being that point estimate is intended 
to be a represent the mean, mode, or some other 
measure of central tendency of the unknown, un- 
estimated underlying distribution. This, however, 
is reading more into these models than is actually 
there. Many of the current generation of outage 
models (e.g., Nateghi et al. 2014a, 2014b, Guikema 
et al. 2014, McRoberts, 2017) use distribution-free 
ensemble data mining methods such as Random 
Forest. These types of models do not directly 
estimate distributions, and they lack asymptotic 
approximations of distributions as one would have 
with generalized linear models. It is thus problem- 
atic to say exactly what the point estimates repre- 
sent. In a sense, these models could be thought of 
as a R = E conceptualization, but the details are 
unclear. 

Even in these point-estimate models, some 
sources of uncertainty have been represented. For 
example, the models are usually in a forecasting 
mode, i.e., predicting outages based on a weather 
forecast, and arguably the largest source of uncer- 
tainty in the outage predictions is uncertainty in the 
weather forecast. The one paper I am aware of that 
explicitly included weather forecast uncertainty in 
outage forecasting is Quiring et al. (2014). In this 
approach, a deterministic model, the one underly- 
ing the results in Figure 1, was coupled with 1,000 
replications of simulated hurricane forecasts. The 
deterministic outage forecasting model was run for 
each replication of the weather forecast, yielding 
an empirical probability density function for the 
outage forecast. An example, taken from Quiring 
et al. (2014) is shown in Figure 2. 

In Figure 2, the red bar shows what the result 
would be if the deterministic model was run for 
the single best estimate weather forecast. The star 
represents the actual number of outages, and the 
green bar represents the mean of the ensemble. 
We see that there is a high degree of uncertainty 
in the outage forecasts due to the uncertainty in 
the weather forecast, uncertainty that is ignored in 
the deterministic forecasts shown in Figure 1. This 
approach can be conceptualized as a R = f(S,P.C) 
framework, though only a portion of the uncer- 
tainty is accounted for. 

As a third approach, consider a model that is 
explicitly trained to estimate quantiles of the dis- 
tribution for outages conditional on the input 
parameter values. This is the approach developed 
by both Guikema et al. (2017) and He et al. (2017). 
Both of these papers used a Quantile Regression 
Forest (QRF), a variant of the popular random 
forest algorithm that is designed to estimate speci- 
fied quantiles of the distribution of the response 
variable. For concreteness, I will use the model 
developed by Guikema et al. (2017) as an example, 
though He et al. (2017) is similar. 
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Figure 2. Example of including weather forecast uncer- 
tainty in an outage forecasting model. The figure if from 
Quiring et al. (2014). 


The Guikema et al. (2017) model was developed 
for estimating damage to power systems due to 
adverse weather events. Damage is estimated sepa- 
rately for four classes of assets—poles, transform- 
ers, overhead line spans, and underground cable 
runs—where the quantity of interest for each is 
the number of damaged items in each asset class. 
The model is a two-stage model. The first stage is 
a Classification model the estimates the probability 
of their being damage in any one (or more) of the 
asset classes on a given day due to the weather. The 
second stage consists of four QRF models, one per 
asset class, that estimates quantiles (in 0.01 incre- 
ments) of the distribution of the number of dam- 
aged assets in that asset class. There is a separate 
model that converts this into a probabilistic esti- 
mate of the person-hours needed for restoration, 
but I will focus on the damage model here. 

To make a prediction for a given day with 
the Guikema et al. (2016) model, N (typically 
N = 10,000) replications are simulated from the 
two-step process, leading to an estimated prob- 
ability distribution. Figure 3 shows an example of 
this type of prediction, presented as a cumulative 
density function. We see from this figure for that 
this particular day, there is a high probability of 
very little damage but that there is a long tail to 
the distribution. That is, there is a small chance of 
substantial damage on this particular day given the 
weather forecast. This approach directly captures 
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Figure 3. 


Example of the probabilistic predictions of 
asset damage from Guikema et al. (2017). 


the uncertainty in the forecast conditional on the 
weather forecast. It does not, however, explicitly 
model and propagate the weather forecast uncer- 
tainty through the predictive model as Quiring 
et al. (2014) does. This approach can be thought 
of as a R = f(S,P,C) approach, though it captures 
different aspects of the uncertainty differently than 
Quiring et al. (2014). 


4 DISCUSSION AND RECOMMENDATIONS 
FOR A PATH FOREWARD 


4.1 Summary of the state of the literature in 
representing uncertainty in power outage 
forecasting 


From the discussion above, it is clear that power 
outage forecasting models have evolved over time 
in how they handle uncertainty and how they con- 
ceptualize risk. Early models were deterministic 
with at best an asymptotic approximation of the 
predictive distribution. One model that I am aware 
of has explicitly included weather forecast uncer- 
tainty, the largest source of uncertainty, in the 
forecasting. Two other models have used explicitly 
probabilistic prediction models to directly estimate 
probability distributions for outcomes, but they did 
not explicitly include weather forecast uncertainty. 
None of the models to date have moved beyond 
probabilistic representations of uncertainty to 
include descriptions of the strength of the knowl- 
edge underlying the predictions. This is potentially 
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quite important because there is substantially 
more knowledge and data about the performance 
of power systems under some storm conditions 
than others. 


4.2 Suggested path forward 


Much has been achieved in the past 12 years dur- 
ing which power outage forecasting has been 
an active area of research. Model accuracy has 
improved substantially, and an increasing number 
of explanatory factors are being included in out- 
age forecasting models. Progress on incorporat- 
ing uncertainty into the forecasts has been made 
as well, though this has been more limited as dis- 
cussed above. Where should the field move next? 
What types of developments are needed? Here I 
discuss three key developments that are needed, 
focused on the treatment of uncertainty in outage 
forecasting models. 

First, models should be developed that more 
completely characterize uncertainty probabi- 
listically. I am aware of only one paper that has 
propagated weather forecast uncertainty through 
an outage forecasting model, and the model in 
this paper has not been used operationally due to 
computational limitations. The two probabilistic 
outage forecasting papers (1.e., Guikema et al. 2017 
and He et al. 2017) only handle weather forecast 
uncertainty indirectly, to the extent that it is repre- 
sented in the uncertainty in the conditional predic- 
tions given the outage forecast. Models are needed 
that combine these probabilistic conditional fore- 
casts with an explicit modeling of weather forecast 
uncertainty. While this would be technically chal- 
lenging it would give a much more complete pic- 
ture of the uncertainty in a given forecast. 

Second, approaches should be developed that 
move beyond probabilities to provide a clear 
description of the strength of the information 
underlying the prediction. Consider for example 
two situations. In the first, an outage model is to 
be run for weather event that is similar to a large 
number of past events that have impacted that util- 
ity system, and there is strong confidence in that 
weather forecast for that day. The model was pre- 
sented with a large set of highly relevant data dur- 
ing the training process. In the second, the weather 
forecast is much more uncertain, and the forecast 
conditions have only been experienced by that 
power system a small number of times in the past. 
The model may give similar numerical estimates 
in these two cases, but many would argue that the 
utility should have much more confidence in the 
model in the first case. Qualitative descriptors of 
the strength of knowledge and the similarity of 
the forecast conditions to past events, if properly 
developed and described clearly in practice, could 


help utility decision makers better judge the degree 
of confidence they should have in the model in a 
given situation. 

Third, better methods for communicating uncer- 
tainty in predictions to utility personnel need to be 
developed. This is related to the second suggestion, 
but the focus here is on how the probabilistic fore- 
casts themselves are communicated to utility per- 
sonnel. In my experience, utility personnel do not 
understand probability distributions as predictions 
or, in some cases, even discrete quantiles from the 
distribution particularly well. They are also mak- 
ing expensive decisions under intense time pres- 
sure after a major storm. If they are confused by 
model output, the model is not helpful to them. If 
probabilistic predictions are to be used in utilities 
in practice, new methods for communicating these 
predictions much be found. 


5 CONCLUDING THOUGHTS 


Much progress has been made in power outage fore- 
casting for storms. Outage forecasting models are in 
increasingly wide use in practice, and the available 
models are increasing in both accuracy and sophis- 
tication. However, challenges remain in how these 
models conceptualize risk and uncertainty and how 
they model and present uncertainty in predictions 
to model users. This paper suggests some specific 
steps that should be taken as this field moves for- 
ward, steps that would help to improve how power 
outage forecasting models deal with the sometimes- 
substantial uncertainty in their predictions. 
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ABSTRACT: The issues “risk assessment” and “tolerable risk” are causing conflicting reactions not 
only among Health and Safety experts. Experienced designers in the machinery sector are sometimes 
unsettled, too. The controversies are mainly about numerical probabilistic representations. These are new 
in the field of general machinery safety, and the key term “probability” turned out to be ambiguous. 
Recently introduced probabilistic methods encounter a well-proven practical state-of-the-art, which is 
merely based on qualitatively defined requirements. 

If objective findings are to be taken as a basis, it is obvious that the sequence of the total annual figures 
for reportable accidents can be considered as random independent events in a population of comparable 
elements. Mathematical concepts seem to make sense if the annually recorded (machine-specific) accident 
data are interpreted as the overall result of a huge “random experiment” in a relevant observation 


framework (e. g. DGUV statistics in Germany). 


1 PROBABILITIES IN SAFETY 
OF MACHINERY 


The issues “risk assessment” and “tolerable risk” 
are causing conflicting reactions not only among 
Health and Safety experts. Experienced designers 
are sometimes unsettled, too. The controversies 
are mainly about numerical probabilistic repre- 
sentations. These were recently introduced in the 
field of general machinery safety, when the revised 
European Machinery Directive 2006/42/EG (2006) 
extended the “hazard analysis” of former versions 
toa “risk analysis” by introducing the term “prob- 
ability” in the expression: “estimate the risks, tak- 
ing into account the severity of the possible injury or 
damage to health and the probability of its occur- 
rence”. Since this alteration in a legal text, simpli- 
fied probabilistic methods as in ISO 13849-1 (2008) 
are being developed. They encounter a well-proven 
practical state-of-the-art, which is merely based 
on qualitatively defined requirements. They were 
mainly focussing on hazards as such (and their 
countermeasures) rather than “risks, severities of 
injuries and their probability of occurrence”. Nev- 
ertheless, on the background of customer demands 
for very high availabilities (which is equivalent to 
high inherent safety), it brought about a well-tried 
state-of-the-art following the three-step-reduction 
method of ISO12100 (2010). 

It is hard to believe, but for the time being, the 
normative frame for machinery safety does not con- 
tain a numerical risk model, which summarizes the 


single factors plausibly to an overall risk, in order 
to e.g. compare theoretical results with empirical 
field data. As a consequence, fictitious hazards and 
actual risks are being mashed up again and again, 
so that in controversial discussions in the world of 
safety standardization, mostly the strongest gut 
feeling overtrumps logical considerations. 

The state-of-the-art is defined on a non-quan- 
titative descriptive background in harmonized 
safety standards in the Official Journal of the 
European Commission (2017) since more than 20 
years. However, the advantage of the quantitative 
probabilistic theory is clearly visible: the entire 
risk reduction process can be detailed in single 
quantitative factors showing the proportions to 
enable a more effective engineering. Surprisingly, 
the transition from “qualitative” to “quantitative” 
requirements was not so easy, since the key term 
“probability” turned out to be ambiguous: subjec- 
tive probabilities are disturbing the discussions, 
since many traditional experts claim that objective 
probabilities were missing or difficult to derive. Is 
this really so? 


2 REFERENCE TO ACCIDENT RECORDS 


The accident numbers of machine tools (for metal 
working) in Germany are decreasing since the 
introduction of a new statistical framework of the 
European Union in the year 2004, and also before. 
This pleasant trend of decreasing overall numbers 
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of reportable accidents is the result of a consider- 
able effort among all stakeholders, on the manu- 
facturer side and on the user side as well (BGHM). 

For instance, German manufacturers of machine 
tools are closely working together in a specific 
working group of VDW (German Machine Tool 
Builders Association, Frankfurt) on safety issues 
in order to exchange their experiences. This is nec- 
essary, because the commenting phases of product 
safety standards, which are being repeated every 
5 years, urge them to review existing standard 
provisions. 

The predominant question is then, whether 
a State-of-the-art can be considered as “proven- 
in-use” to be safe, or if there are reports of acci- 
dents (or incidents) indicating the need for certain 
upgrades in the safety design. 

Answering this question is only possible, if the 
relevant safety experts of BGHM (German Occu- 
pational Safety Organisation for metal-working 
machines) are involved, see Kesselkaul and Meyer 
(2016), Adler (2015), Platz (2016) and Preusse (2005). 


2.1 Record low for machine tools was in 2014 and 
it was missed in 2015 


Most recently, the journal “BGHM-Aktuell” 
reports in June 2016 a significant decrease of sum- 
marized accident data for the branches wood- and 
metal-working, see Platz (2016). For the separated 
numbers of metal-working machine tools (i.e. 
without wood-working), the situation of Figure 1 
remains. The decreasing trend in the overall num- 
bers of reportable accidents is also visible here. 
Noteworthy is that in 2015 “only” 17.235 report- 
able accidents are allocated to metal-working 
machine tools (out of the machine perspective). 
The delightful results in the overall numbers of 
wood and metal of are unfortunately different 
when focussing closer on the subsets of metal. For 
instance, in Fig. 2 the subset “new pention pay- 
ments” is displayed for (metal-working) machine 
tools, and Fig. 3 “fatal accidents”. 

Obviously, since 2008 the numbers are oscillat- 
ing up and down, e.g. 281 in 2014 and 350 in 2015; 
a stable decrease since 2008 is not visible. Thus, the 
above mentioned record low in 2015 does not exist 
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Figure 1. Reportable accidents of German machine 
tools from 2004 to 2015 (Source: DGUV, VDW (2015). 
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Figure 2. New pention payments for machine tools 
from 2004 to 2015 (Source: DGUV and VDW (2015)). 
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Figure 3. Fatal accidents for machine tools from 2004 
to 2015 (Source: DGUV and VDW (2015)). 


for machine tools, because it happened in the year 
before: so there is still a need for action. 

Remark: the interpretation limits of statistical 
findings of DGUV are considered in an extended 
version of this paper. 


3 CONNECTING THE ACCIDENT 
STATISTICS TO A PROBABILITY SPACE 


Obviously, the sequence of yearly overall num- 
bers for reportable accidents can be regarded as 
random independent events, since each year dif- 
ferent random effects are causing the reportable 
accidents (different machines and operators are 
affected in different situations). 


3.1 Random experiments and accident modelling 


Mathematical concepts as in Barner et al (1983), 
Bauer (1990,1991), Bertsche et al (2004) and Feller 
(1967) seem to be useful for the reduction of real 
accident risks, if the yearly recorded accident data 
are interpreted as the overall outcome of a huge 
“random experiment” in a relevant observation 
frame (which e.g. is given by Occupational Safety 
(DGUV) Statistics in Germany): 


a. the operation of an average number of œ com- 
parable machines during the actual record year; 

b. an average number of p operators, 

c. who are during an average number of y hours 
busy in operating them. 


Consequently, the annual accident statistics must 
be linked to a clearly defined “probability space” 
(also known as “sample space”). As a result, Health 
and Safety experts can monitor yearly accident 
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records on a probabilistic basis. This is recommend- 
able, because the law in Europe Machinery Direc- 
tive 2006/42/EC refers to the term “probability”. 

Regrettably, the probability theory does usu- 
ally not belong to the education of engineers. Nei- 
ther is the measure theory being lectured among 
mechanical engineers. Since the probability theory 
is based on the measure theory, some notations 
shall be explained graphically and vividly. 


3.1.1 Basic terms 
The term “probability” can be expressed in differ- 
ent ways, see Moedden (2015): 


a. asareal number between 0 and 1, where 0 means 
impossible and 1 means a certain event; also 

b. the expected frequency is used, which is the 
fraction of the number of relevant events over 
all possible events (e.g. expressing a statistical 
finding, empirical or theoretical); if this fraction 
is connected to a given period of time, it means 
the probability of an event per time; 

c. where no statistical data exist, “probability” can 
express a degree of belief about facts (e.g. due 
to subjective “experience”). 


It is obvious that the latter might become a mere 
“gut feeling”. This is the worst kind of a proba- 
bility: maybe that is the reason, why Feller (1967) 
emphasizes in the preface: “it is the purpose of this 
book to treat probability theory as a self-contained 
mathematical subject rigorously avoiding non- 
mathematical concepts”. 

Also the expected frequency is often misun- 
derstood. For instance, if there is a lack of expe- 
rience, e.g. when empirical data or theoretical 
findings are not available, it amazingly often hap- 
pens that the probability of an event is incorrectly 
mistaken for a mere possibility of occurrence. 
In doing so, the probability is not calculated 
with respect to the correct reference frame as a 
fraction of numerator/ denominator; this error 
is called “denominator neglect”, see Blastland 
(2014), Spiegelhalter et al (2014). Le. the refer- 
ence frame for the calculation of probability is 
omitted or ignored. 

As an important logical fundament for quan- 
titative probabilities, Kolmogoroff’s axioms are 
based on an event space of elementary singletons. 
The entire probability theory can be developed 
from them. Extract from Feller (1967) for the axi- 
omatic probability definition (quote): 

The “probability” is not defined as such. The 
modern theory rather uses the “probability” as a 
term, which has to fulfill certain axioms. The axi- 
oms established by Kolmogoroff are: 


1. To every random event A a real number P(A) 
can be assigned such that 0 < P(A) <1, which 


is called the probability of A. This leans on the 
properties of relative frequencies. 

2. The probability of the sure event E is P(E)=1 
(scaling axiom). 

3. If {A4,i21} are random events, which are 
pairwise disjunct, ie. 4,0 A,=0 for i # j, 
then follows with the limes n — œ (countable 
additivity axiom): 


(Ua, |= $P) (1) 


3.1.2 Four steps to build a probability space 

In order to take the bearings of an effective risk 
reduction, several steps are necessary to apply the 
probability theory to the “random experiment” 
of yearly accidents. To start with, the real acci- 
dent records can be connected to some notations 
of the mathematical measure theory, see Bauer 
(1990). 

Obviously, the number of accidents are “meas- 
ures” in a sample space. Then, these measures need 
to be calibrated such that a probability space is 
built. 


Step 1: Define the sample space Q of this random 
experiment with all possible outcomes. 

The entirety of all possible elementary outcomes 
O, (a, €Q) for every single machine and every 
single operator has a variety of numerous causes- 
to-effect relations. It spreads from technical fail- 
ures over human errors to forces majeure risk 
(e.g. lightning stroke). So as regards the cause-to- 
effect relation, the elementary outcomes @ can be 
thought of as a countably infinite set of single- 
tons @, ={a,i 2 1}. Nevertheless, the entirety of 
all possible elementary singletons, which are to be 
assigned to severities of injuries, can be divided in 
five subsets (i.e. resulting events in a record year, 
here sorted by their expected frequency of occur- 
rence from high to low), as shown in Table 1. 

In addition to this sample space Q, there are 
some related events (e.g. further dimensions due to 
possible repetition): At the end of a record year 
in the above mentioned observation frame, every 
single operator in the basic population of p opera- 
tors is going to be either in the fortunate subset A,, 
or he/she was (once or more) lucky in subset A, 
or (once or more) unlucky in subsets A,, A,. Or it 
could even happen that he/she were in the mortal 
subset A;. Accordingly for a record period, every 
single machine of the basic population of œ com- 
parable machines can be connected to subset A, or 
to one or more of the subsets A,, A,, A, or A;. And 
finally, every single hour of the on average a: y 
hours can be allocated at least to event A,, or to 
one of the other four subsets. 
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Table 1. 


Sample space Q with all possible outcomes, 


examples given for singletons. 


Subset Example of one 
of Q Meaning singleton @ € A,c Q 
Event No hazardous event, Five days of automatic 
A, i.e. undisturbed production from 
machine operation Monday to Friday 
(or standing idle), without interrupts. 
no relevant hazards 
Event A hazardous event One Monday morning, 
A, without causing an a drilling tool broke. 
injury (i.e. a near It was set free, but 
accident occurred) retained in the 
machine enclosure. 
Event A hazardous event During maintenance 
A, causing an accident work, an operator 
with slight or severe crushed his finger. 
reversible injury After months of 
medical treatment it 
healed. 
Event A hazardous event After having defeated 
A, causing an accident the safeguards of a 
with a severe turning machine, an 
irreversible injury operator’s eye was 
(pension payment) hit by ejected swarf 
so heavily, that he 
was handicapped 
afterwards. 
Event A hazardous event On a Friday afternoon, 
A, causing an accident an operator was 


with a fatal injury 
(pension payment 
to the relatives) 


entangled by a 
CNC-lathe, when 
he was manually 


polishing the surface 
of the turning 
workpiece. 


It has to be remarked that this sample space Q is 
not absolutely “mathematically correct”, because 
the fact is ignored that the basic populations of 
machines and operators presumably are slightly 
different from one year to the next. However, this 
“transition error” alters the findings below only 
gradually, but not substantially, because the con- 
clusions remain the same. 

As regards real design practice, Heisenberg 
(2013) illustrates an (implicitly described) sample 
space for a certain range of products and the 
according event records, which fits exactly to the 
explicit one in Table 1. 


Step 2: A measure space is required, which con- 

tains the measurable sets (called events). 
The measure theory illustrates that with a col- 
lection of measurable events F ={A,,i21}, we 
get the measurable space (Q, F). A measure on 
(Q, F) isa function w: F — [0,9] such that: 


i. (DO) =0 

ii. If {A, i 2 1} is a sequence of disjoint sets in 
F, then the measure of the union (of count- 
ably infinite disjoint sets) is equal to the sum of 
measures of individual sets, i.e. 


(UA) =Yul4) 2) 


i=l 


The second property stated above is known 
as the countable additivity property of measures. 
From the definition, it is clear that a measure can 
only be assigned to elements of F. The triplet 
(Q, F, 4) is called a measure space. wis said to be 
a finite measure if HQ) <œ; otherwise, is said to 
be an infinite measure. In particular, if “(Q) = 1, u 
is said to be a probability measure. 

Since an event is a subset (of comparable sin- 
gletons) of the sample space Q, to which a prob- 
ability will be assigned, for the measurable subsets 
of the accident sample space follows: The disjoint 
sets in F are here {4,,i=3,5}={A,, 4,,4,} with 
the property: 


UA) =Sala)=ar (+4, 4) (3) 


i=3 


Step 3: Assign measures to the events of F. 

The measure u can be understood as the overall 
number of all accidents of the type reportable (R) 
in the relevant observation frame above, since the 
subsets of severe accidents with pension payments 
(PP) and mortal accidents (M) are contained 
therein. 


MA, +4, +4 )=4R (4) 


However, a closer look to different severities 
requires a distinction of the numbers of the sub- 
sets so that they are disjoint: 

M=: number of all mortal accidents (remains): 
this number corresponds to event A, above. 
PP* =: number of severe accidents with pension 
payments (PP) minus fatal accidents: 
PP* = PP — M: this number corresponds 
to event A, above. 

number of all reportable accidents minus 
the number of severe accidents with 
pension payments (PP) and minus the 
number of all mortal accidents (M): 
R* = R — PP* — M: this corresponds to 
event A, above. 


Now, the values R*, PP*, M can be derived from 
the yearly accident records. They can be assigned 
tothe F = {4,4,4} following these principles: 


R* =: 
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l= R= R*+ PP”+ M (5) 


Remark: Since the measurable space F is simple 
with only three F- measurable sets {4, A, 4; }, 
it fulfills the conditions of a Algebra, a oœ-Algebra 
and a Borel Algebra (see Feller (1967) and Jaga- 
nathan (2015)). 


Step 4: Assign probabilities to the events of F. 

The triplet (Q, F,P) is called a probability space, if 
the following three properties (sometimes referred 
to as the axioms of probability) are fulfilled: a prob- 
ability measure is a function P: F > 0,1 such that: 


i. P(O)=0 

ii, P(Q)=1. 

iii. (Countable additivity:) If {4,21} is a 
sequence of disjoint sets in F, then 


(U4 = P(A) (6) 


Here, we have the single probability measures 
P(A,), P(A,), P(A), P(A,), P(A), which are strictly 
positive. Because of the limitations of the above 
defined random experiment of yearly accident 
records, we know that for every combination of @, 
p, ythe following equation holds: P(A,) + P(A,) + 
P(A,) + P(A,) + P(A.) = 1. But, we don’t know the 
actual numbers a, p, y. Obviously, these probabili- 
ties are not easy to determine, because the frame 
parameters œ, p, y of the basic population are 
not known. Thus, the measures R*, PP*, F of the 
events A,, A, A, are only rough reference numbers, 
which can be plotted of the years. 


3.2 Available data put into an observation frame 


So, for the sake of showing the proportions of 
P(A,), P(A,), P(A,), P(A,), P(A,), some practical 
probability measures out of the record numbers 
R*, PP*, M, the values a, 8, y shall be estimated 
for above defined observation frame, to start with 
(see remark below): 


a= 500.000, Z = 50.000, v= 4 (7) 


The average hours per day ah, = œ > y are 
then: ah = 500.000 - 4 = 2,000.000. And per year 
with on average about 220 working days: ah, = 
2,000.000 - 220 = 4,4 - 108. f 

The probability of an event A, can then be 
expressed as a relative frequency (rf): 


P(4,)=1f(4,)= iit (8) 


Accordingly, the probabilities of the events A, 
and A, are: 


P(A,)=rf(4,) = 


Pla)=rt(4)=P7 


The residual probabilities P(4,)+P(A,) are 
then: 


h, and m 
h, 
P(4)+P(4)=1-P(4)+P(4,)+P(4) 


(10) 


The accident record for machine tools of 2014 
shall serve as an example (see DGUV): 


M=1, PP* =281 -1 =280, R* = 17.563 -280 - 1 = 
17.282. 


4,4-198 = 64:10” 
P(A) =rf (4,) =17-282/ 4.10s =3,9-10° 
> P(4,)+P(A,) 
=1-3,9-10- —6,4-1077 — 2,3 -102 


= 0.99996. 


In Figure 4 the proportions are illustrated 
according to the assumption of eq. (7). The light 
grey right column represents the probabilities of 
the subsets A, and A,, where no injuries happen. 
The stepwise darker grey columns indicate the 
probabilities of the subsets 4,, 4, and A., where 
accidents are rerted as slight or severe reversible 
injuries, severe irreversible injuries or even fatal 
injuries. Please take note of the fact that only the 
absolute values of eq. (10) should be questioned, 
since their relative proportions are largely inde- 
pendent of the presumptions made above in eq. 
(7). If absolute values were to be acquired, the 
assumptions in eq. (7) could be spreaded in terms 
of empirical worst case, average, best case as a 
histogram. 

Because {A4,,i=3,5} is a sequence of three dis- 
joint sets, eq. (6) is fulfilled: 


[ 

| LOE-01 + 
| 1.0E-03 
| 


| 
t 
106-05 +— - + 
1.0E-07 E a 
e —m 


10F-09 
P(A1}+P(A2)  P(A3) P(A4) P(AS) 


Figure 4. Relative proportions in event probabilities are 
largely independent of assumptions in eq. (7). 


2823 


3.3 Fascinating vision: no accidents (“zero risk”) 


Now, as we have built a probability space based 
on a “o—Algebra”, we can have a closer look to 
the fascinating vision “zero risk”. Based on the 
definitions 1. to 4. above, the zero-one-laws of the 
probability theory can be applied. They indicate 
that the probability of events of a certain type 
(here a set of the “limes superior”) is either 0 or 1. 
That is to say: Those events happen either almost 
surely or they are almost impossible. 

In particular the Borel-Cantelli Lemmas seem to 
be suitable to check the decreasing trend of accident 
numbers, whether possibly the fascination “zero 
risk” can be realized in the future (see Moedden 
(2017)). Thus, the sum of all single probabilities (of 
all subsequent events) is the crucial quantity: if the 
sum is increasing limitless, the repeated occurrence 
of events (here accidents) is virtually certain. As a 
consequence of this comprehension, safety experts 
cannot be content with current safety standards and 
their (implicitly/explicitly) defined tolerable risk. 

Obviously, the tolerance level has to be continu- 
ously reduced for reaching a “zero risk”. To start 
with the foundation of the Borel-Cantelli Lem- 
mas in Moedden (2017), two for convergence and 
divergence relevant event types (or subsets) of a 
o-—algebra F shall be explained. 


3.3.1 “Limes inferior” 

The sequence of subsets {4,,i=1,n} in a o- alge- 
bra F shall be given, e.g. A, could be a single 
record of reportable accidents of year “7” in an 
observation frame. Now, we form a n sub- 
set of elements @, which are contained in (only) 
almost all 4, i.e. these elements are contained (at 
the utmost) in only a finite number of A; 

The example in Figure 5 indicates that for a sin- 
gleton æ, there are two subsets: i. Tog A,A,,A,} 
and ii. [we A,, Ay, As, Ag, Ags Ay,-+-f. The sequence 
of both subsets is called the “limes inferior”, 
which is a kind of lower accumulation point of a 
sequence of random events. It is defined as of the 
union (n=1,%) of the intersections of all partial 
sequences (i =n,°0): 


A = lim inf A, = UNA 


n=li=n 


(12) 


The subset A,,-is the set of all elements œ € Q, 
which are contained i in almost all (finitely many) of 


Figure 5. 
“limes inferior”. 


Example for the lower accumulation point 


the subsets A,. Let us have a closer look to a single 
element œ, which could e.g. mean that wis the event 
“no fatal accide at grinding machines in an entire 
record year”. As shown in Figure 5, the singleton @ 
is only contained in finitely many 4, That is to say, 
there is a time point n* such that for all n2 n*: œ 
e A, Then wis in the inter-section: we N, -A,, 
which means (in this example) no more fatal acci- 
dents would occur at grinding machines in the rel- 
evant observation frame beyond n> n° (“zero risk” 
ee If all time poms are considered, we 
get: A, =liminf A, = cm I |,..4 The reverse 
argument needs to be proven. Fet w be one single- 


ton with we UNL, 4 then there is a number 
n* with we (les A. From this time point on, 
QE Nip A, holds. For 1 <i<n* (finitely many), we 


don’t know how often @is contained. Consequently, 
the element @is contained in (only) almost all 4, 


3.3.2 “Limes superior” 
The sequence of subsets {4,,i=1,n} in a o-alge- 
bra F shall be given. We form another subset of 
elements, which are contained in infinitely many 
A; 

The example in Figure 6 indicates that for a sin- 
gleton æ, there are two subsets: 


i. {0 ¢ A, A,,A,,4,,...} and 

ii. {we A,,A,, A, A,,...}. So @ is contained in 
the non-prime number subsets, but not in the 
prime number subsets. This sequence of subsets 
is called the “limes superior”, which is a kind 
of upper accumulation point of a sequence of 
random events. It is defined as the intersection 
(n=1l,cc) of the union of all partial sequences 
(i=n,°): 


A,,, =A, i.o. = lim sup A, = UN4 


sup 


(13) 


n=l i=n 


The subset A,,, is the set of all elements we Q, 
which are contained in infinitely many of the sub- 
sets A,, and obviously we A, applies for infinitely 
many ne N, i.e. “A, i.o.” with “i.o. = = infinitely 
often”. Ap is a nested intersection of unions. The 
union U4 can be thought of as a sequence 


(B,),<y With B,>B,,, 5B . The (B,) are 


n+l n+2 


me «ee 


Figure 6. Example for the upper accumulation point 
“limes superior”. 
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Figure 7. Matryoshka nested doll (Source: Elisabeth 


Médden). 


nested decreasing subsets. This resembles the 
Matryoshka nested doll in Fig. 7, where the small- 
est innermost doll is the “intersection” of all dolls. 


Formally, let (Bren = U`, A, then 
(A, i.0.)=()",B,. In doing so, B, is the event 
that at least one event A,+A,,,+...+4, 


occurs (that is a bigger set than at least one 
event A,,,+A,,.+...+A, occurs). It is some- 
times referred to as the “end-tail” event, which 
means that from n onwards at least one event 
A,+A,.+-..+A,. occurs. Then (A, io.) is the 
intersection of the “end-tail” events, it means that 
for n an event B, occurs. That is to say, no matter 


how big n is, at least one of the A, will occur. 


P(A, io.) = (Ae, |= limP(B,) = Him(U 


neo i 
n=l i=n 


(14) 
It is obvious that A,, S Au» because if one 
singleton @ is contained only in finitely many 4,, 
it is also contained in infinitely many A,. Let one 
singleton @ be contained in infinitely many 4,. 
How can this be expressed without using the term 
“infinitely”? For every n there is one k $ n: œ € 
A, For every n, @€\)_ A, holds, and therefore 
QE “U A,. The reversion is obvious. 


As regards the application of the zero-one-laws 
on the accident records, it is fortunate that they 
consist of scalar values summarizing comparable 
singletons. So they resemble a sequence of values 
(a,) which sum up a collection of w, €Q that are 
assigned to a selected kind of events in the subset 
sequence A,. For a sequence (a,), the necessary and 
sufficient condition for convergence of sequences 
can be applied (see Barner (1983): (a,) is conver- 
gent if and only if (a,) is bounded and: 
lima, = Ay = Aup 


This is equivalent to: A,,- =A 


inf sup 


(15) 


3.3.3 Interpretation for available accident records 
The probability space from 3.1.2 is the platform 
for the interpretation: For the sake of simplicity, 


the accident records in Figure | can be interpreted 
as a finite sequence (i.e. as a partial sequence of 
an infinite sequence in the future), and the avail- 
able records contain twenty-two random events 
A,,A,,...4,. until the year 2015 (i.e. the available 
“measures” of reportable accidents in the obser- 
vation frame: injuries caused by machine tools in 
Germany). Since the sequence is monotonously 
decreasing (at least during the last ten years for 
the overall numbers), the “limes superior” of the 
only 22 (instead of °°) events can be thought of 
as the minimum number of accidents, which hap- 
pened in (almost) every of these twenty two years. 
Apparently, this is the number of accidents in the 
last recorded year A,, = 17.563. So, the “limes 
superior” for the partial sequence A,,A,,...4,, is 
the value of the lowest number of accident records 
(the “smallest doll” contained in all of the Matry- 
oshka nested dolls from above, see Jaganathan 
(2015), which is here equal to the last record: 


22 22 


Aup = A, 1.0. = lim sup A, =(\|J 4, = 4, =17.563 


sup 
n=l i=n 


(16) 


It has to be remarked that this simplification is 
only “almost mathematically correct”, because the 
intersection of the union of twenty-two different 
subsets ignores the fact that the basic populations 
of machines and operators presumably have been 
slightly different from one year to the next etc. 
However, this reservation does not alter the fact 
that the Borel-Cantelli lemmas can illustrate the 
vision “zero risk” mathematically. 


4 SUMMARY AND OUTLOOK 


Admittedly, theoretical risk assessment starts in 
the hypothetical “what if” domain, where theoreti- 
cal risk can be estimated only logically in cause and 
effect (at the most), but not in an absolute scale. 
However, actual relative risk reduction effects 
between different yearly accident records can be 
calculated and compared quite exactly, because the 
real risk actually can be “measured” precisely as by 
the DGUV records, as well as in individual records 
of machine tool builders. Therefore, this paper 
tries to support plausible risk considerations con- 
necting theory and reality. Hopefully, it also helps 
to improve a common understanding of the term 
“probability”. Only then can the 1. Borel-Cantelli 
lemma be met, which leads the way to the fasci- 
nating vision “zero risk”, at least approaching it 
yearly step by step. 

Of course it is indispensable that the a.m. acci- 
dent investigation of BGHM and DGUV is going 
to be continued meticulously. VDW is offering 
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support with this paper. This is a precondition such 
that a more complete Pareto diagram (than the one 
in Figure 12 of Moedden (2017)) can be brought 
about, showing potential safety gaps in current 
design and operational practices, which need to 
mended first. Subsequently, possible upgrades in 
the safety design against not significant hazards 
can be looked at: doing the first thing and not 
leaving the other option out. In addition, indi- 
vidual manufacturers of specific product ranges 
can monitor their own yearly records separately, 
such as convincingly demonstrated in Heisenberg 
(2013). Moreover, VDW is supporting the member 
companies accordingly in research projects (see 
Nowizki et al (2016)) in order to make German 
machine tools as safe as reasonably possible. 


SYMBOLS AND OTHER DEFINITIONS 


Symbol Dim. Meaning 

a, BLY — Parameters of the observation 
frame of acc. statistics 

u -— Measure of a subset in a sample 
space Q 

o,F - Indicator for certain algebra 

k, k, — Adjustment parameter 

Q = Sample space 

Q, a Singleton of the sample space Q 

A, to A; - Accident specific subsets of the 
sample space Q 

A, P(A) - General events in the sample space 


Q and their probabilities P 
Limes inferior of a sequence 
A, also indicated as: 
A, f.o.= lim inf A, 
- Limes superior of a sequence 
A, also indicated as: 
A, i.o. = lim sup A, 


R, PP, M — Number of accidents: reportable, 
pension pay., mortal 

€,¢,V = Element of, not an element of, 
holds for all 
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ABSTRACT: 


If hazards arising from machine tools cannot be completely eliminated by design, 


protective devices must be provided. Separating guards prevent people from accessing or entering the 
danger area; in addition, they retain any parts that may have been released in the work area. An attempt 
shall be made here to supplement the currently purely intuitive (qualitative) consideration of the protection 
effects by a probabilistic scaling of the risk reduction effects. By scaling these effects, the Pareto principle 
can be applied: achieving the best possible benefit with minimal effort. This procedure is indispensable for 
machine tool manufacturers to master economic risks in global competition. Since there is no plausible 
risk model in current safety standards for machine safety with which risk reduction effects can be scaled, 
a simplified quantitative risk model is presented in this paper for this purpose (and two further papers are 


presented at ESREL 2018). 


1 INTRODUCTION 


In accordance with the guiding standard ISO 
12100 (2010) risk is a function of possible damage 
severity and probability of occurrence, whereby 
the latter can be represented as the frequency of 
the damage. Because, over a large basic population 
of comparable situations, the mathematic-theoret- 
ical “probability” of an event becomes an empiri- 
cal (countable) “relative frequency”, which enables 
the verifiability of this event in the totality of all 
possible events, see Haigh (2012), Taschner (2013). 

As a result, risk can be reduced by avoiding or 
reducing the extent of damage (effect related meas- 
ure) and/or by reducing the probability of occur- 
rence or frequency of the occurrence of damage 
or the associated hazard (cause-related measures). 
Separating protective devices (guards) plays an 
elementary role in both measures and also offer 
a high cost-benefit advantage. These have the 
effect of significant risk reduction both for protec- 
tion against accidental access and for protection 
against flying parts, substances and, in particular, 
fire protection. 

As for conventional machine tools with drive 
powers in the range of 100 kW and powerful 


hydraulic clamping devices, it cannot be denied 
that (very low) failure probabilities of technologi- 
cally highly reliable machine functions or safety 
functions include also a basic possibility of seri- 
ous damages/injuries. This is demonstrated, e. g. by 
the depressing fact that in Germany, machine tool 
operators suffer around 100 life-changing acci- 
dents and even fatal accidents in the single-digit 
range per year, see Kesselkaul & Meyer (2016). 
The typical distribution of severity is also known 
as the “accident pyramid” and can be displayed in 
a histogram, see Moedden (2017). 

How can a designer live with this frustrating 
insight into the world of real probabilities, i. e. the 
annually recurring serious or even fatal accidents? 
One might even ask oneself the question: Can a 
machine tool be built safely at all? 

The answer is “yes”, because if the extent of 
injury cannot be reduced in itself, then the reduc- 
tion of the relative frequency of injuries remains 
as a means of reduction, i. e. how often damage 
repeats itself in a given period of time. With ref- 
erence to a basic population with comparable ele- 
ments, this is equivalent to their probability. 

The risk reduction of machine tools is therefore 
all about reducing the “probability of hazardous 
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situations”, so that damage is avoided as far as 
practicable. The likelihood of hazards occurrence 
(see also ISO 12100 (2010), Section 5.5.2.3.2 
“Occurrence of hazard events”) can be reduced 
in several respects and is scalable as parameter 
O, e. g. with 0 < O < 1 (see Fig. 2 in ISO 12100 
"Risk reduction process from the viewpoint of the 
designer"). 

The starting point is always that technological 
machine functions (which can also become safety 
functions when the safety doors are open) already 
have very low failure probabilities. How the pos- 
sible resulting hazards can be massively reduced 
even further with the help of separating protective 
devices shall be considered here and scaled plau- 
sibly. It is not the accuracy of the post-decimal 
value that matters, but rather the power of 10, as 
is usual in risk representations, see Baruch et al 
(2011). 


2 SAFETY MEASURES IN PRODUCT 
STANDARDS 


Separating guards for risk reduction on machine 
tools should be designed in such a way that they 
are sufficiently capable of holding back normally 
expected off-flying parts, see ISO 14120 (2002). 
The retention capability of a guard describes the 
resistance against the penetration of flying ele- 
ments. Machine-specific design conventions are 
anchored in standards such as ISO 23125 (2010). 
In the case of impact energies that increase beyond 
the design conventions, e. g. due to the release of 
large workpieces or clamping jaws, the probability 
of failure increases, Meister (2017). 


2.1 Failure hypotheses 


Basic risk assessments have been carried out in 
product standards and corresponding safety meas- 
ures have been defined for some typical machine 
configurations, such as machine tools, woodwork- 
ing machines and balancing machines. There 
detailed dimensioning recommendations in the 
harmonized standards are available for separating 
protective devices in order to ensure the required 
retention capability. The starting point is a typi- 
cal failure hypothesis for released elements that 
leads to defined masses, velocities and the energies 
derived from them, as in ISO 23125 (2010). Frag- 
ment energies that may exceed these hypotheses 
are therefore further reduced in terms of control 
technology, e. g. by limiting the number of r. p. m. 
of spindles or instructively by means of detailed 
instructions in the operating instructions, e. g. for 
the safe clamping of workpieces. This ought to 
reduce the risk of operator injury to a level that is 


“As Low As Reasonably Practicable” (“ALARP* ), 
UK-HSE (1974). 

The following constructional distinction is 
applied to protect against released elements in 
order to minimize the residual risk of injury: 


i. Closed door (automatic operation): The safety 
guard must have sufficient impact resistance 
against ejected parts. 

ii. Open door (setting mode, special or service 
mode): The rotational speeds of rotating parts 
are safely reduced, and unnecessary movements 
are safely stopped, so that the risk of ejected ele- 
ments is reduced. 


The application duration of operating modes 
according to ii.) should therefore be minimized 
as far as possible in order to minimize the risk. 
In the machining process, on the other hand, 
the observability of a process is sometimes 
indispensable. 


2.2 Guiding standard for safety guards 
ISO 14120 


For machine tools, the full enclosure as in Figure. 1 
in particular is a very effective means for risk 
reduction (see below transition in probabilities 
from PFH, to PHE; this is assigned to step two of 
the three-stage method in Figure. 2). For this pur- 
pose, important aspects are regulated in a Guid- 
ing Standard for separating protective devices, 
e. g. access to hazardous areas (EN ISO 14120 
(2002), Sec. 5.1.2), retention of ejected parts and 
other impacts (EN ISO 14120, Sec. 5.1.3), reten- 
tion of hazardous substances (EN ISO 14120, Sec- 
tion 5.1.4), retention capacity (EN ISO 14120, 
Section 5.5) and corrosion resistance (EN ISO 
14120, Section 5.6). 

In particular, the type of construction accord- 
ing to Section 3.5.2 is widespread in machine 
tools: 


Figure 1. 


Automatic vertical turning centre acc. to ISO 
23125 (2010) (Source: INDEX). 
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Figure 2. 


Risk reduction according to the three-step 
method in the Farmer Diagram, Baruch et al. (2011). 


Interlocking guard with guard locking 

Guard associated with an interlocking device and 

a guard locking device so that, together with the 

control system of the machine, the following func- 

tions are performed: 

— the hazardous machine functions “covered” 
by the guard cannot operate until the guard is 
closed and locked, 

— the guard remains closed and locked until the 
risk due to the hazardous machine functions 
“covered” by the guard has disappeared and 

— when the guard is closed and locked, the haz- 
ardous machine functions “covered” by the 
guard can operate (the closure and locking of 
the guard do not by themselves start the haz- 
ardous machine functions. ) 


3 SEPARATING PROTECTIVE DEVICES 
(GUARDS) AND PROBABILISTIC 
SAFETY 


The Machinery Directive 2006/42/EC requires the 
preparation of risk assessments for all life cycle 
phases of a machine as proof of compliance with 
safety and health requirements in accordance with 
Annex I. In point 1, it deals directly with the risk 
elements “severity of possible injuries” and “prob- 
ability of their occurrence”. Since 2012, ISO 13849 
(2008) has been the new guiding standard for 
control safety in mechanical engineering. It deter- 
mines the safety reliability of a control chain via 
the “Performance Level” (PL). A new feature is the 
quantitative treatment of safety: The PL is a theo- 
retical characteristic value as “average probability 
of a dangerous failure per hour” and it is given as 
PFH,-value. This concept is not yet fully compli- 
ant with the Machinery Directive 2006/42/EC and 


the Guiding Standard for Risk Assessment, as 
explained by Steiger (2014). This paper contributes 
to a better understanding. 


3.1 Intuition and better approaches 


It is hard to believe, but for the time being, the nor- 
mative frame for machinery safety does not con- 
tain a numerical risk model, which summarizes the 
single factors plausibly to an overall risk, in order 
to e. g. compare theoretical results with empirical 
field data. As a consequence, fictitious hazards and 
actual risks are being mashed up again and again, 
so that in controversial discussions in the world of 
safety standardization, mostly the strongest gut 
feeling (subjective) overtrumps logical (objective) 
considerations. 

The controversies are mainly about numerical 
probabilistic representations. These were recently 
introduced in the field of general machinery safety, 
when the revised European Machinery Direc- 
tive 2006/42/EC extended the hazard analysis of 
former versions to a risk analysis by introducing 
the term “probability” in the expression: “estimate 
the risks, taking into account the severity of the pos- 
sible injury or damage to health and the probability 
of its occurrence”. 

Intuitively, it is immediately obvious that a sepa- 
rating protective device (e. g. a full enclosure, as 
shown in Figure. 1) is an effective means of risk 
reduction—both with regard to the probability 
of occurrence and, in particular, with regard to 
the extent of damage. In addition, this applies to 
technical causes of failure on the one hand and to 
human error on the other. It is also clear that sepa- 
rating protective devices cause a considerable risk 
reduction because they significantly reduce the 
duration of the operator’s exposure to hazards, for 
example, not only in the case of protection against 
unintentional access, but also for protection 
against ejected parts, substances and especially in 
the case of fire protection. This can significantly 
reduce both the expected severity of an injury and, 
in particular, its expected frequency. 

An attempt shall be made at ESREL 2018 to 
supplement the intuitive consideration with an 
additional scaling of the risk reduction effects. 
Therefore, the key term “probability of hazard- 
ous situations” is explained in Moedden (2018). 
Therein, the scaling of risk reduction is explained 
by means of a detailed probabilistic risk model, see 
Moedden (2016). 

For full enclosures, a high probability can be 
assumed that risks from the workspace are com- 
pletely controlled in cases of failure that lie below 
the design convention (e. g. chips from machining). 

For the range of energies up to the design con- 
cept (e. g. the chuck jaw hypothesis according to 
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ISO 23125 (2010), see also Mewes (1999), Ising 
(2001)), the standard-compliant safety guards con- 
tribute considerably every day to protect the opera- 
tors. Beyond the design convention, however, the 
probability decreases significantly, see Meister et al 
(2017). As a separating protective device must be 
opened regularly, e. g. when changing workpieces 
manually, during troubleshooting or when setting 
up the machining process, the protective effect 
of the separating protective device is sometimes 
(partly) lost, too. The consequences of technical 
failure and human error can then be dramatic. This 
applies in particular if the automatic processing 
is carried out under “defeating” of the protective 
devices (manipulation), as explained in Kesselkaul 
& Meyer (2016), Preusse (2005) and Meyer (2017). 


3.2 Inherent safety and diagnostic functions 


As far as the reduction of technical failures is con- 
cerned, a reliable technological function is basically 
the starting point (step one in Figure 2), because 
the more reliable a function is, the less frequent are 
failures and even rarer failures critical to safety. 
This is because the latter have all failure events to 
the upper set; in any case, safety-relevant failures 
cannot be more frequent than all failure events 
as a whole. Obviously, non-safety-relevant and 
safety-relevant failures are in a complementary 
relationship to each other and form the entirety of 
all failures, see Bornemann et al (2015). Diagnostic 
functions can be used to anticipate some safety-rel- 
evant failure events, so that safety-related reactions 
can be carried out which lead to a (safe) standstill 
of the machine in the event of an imminent critical 
failure, see Moedden (2016). If this is done in time, 
a potentially safety-relevant failure becomes a non- 
safety-relevant failure with machine downtime as 
a result without any hazard from the motion con- 
trol of the machine tool. This applies in particular 
if a full enclosure is closed when a safety-relevant 
failure is detected in order to prevent access to 
the working area (e. g. in case of fire extinguish- 
ing with CO, flooding after detection of a fire in 
the working area, [S023125(2010)). After this step 
two in Figure 2, only the always possible stum- 
bling, falling etc. remain as a hazard. 

The entirety of all failures can be represented as 
a repetition rate in defined time periods, e. g. as 
“probability of failure per hour”, PFH. In prin- 
ciple, the PFH value of a technological function 
does not increase due to an additional diagnosis, 
but rather if the diagnostic function is hypersensi- 
tive and fails to perform correctly (e. g. often in 
case of smoke detectors in living quarters). 

A high reliability of a technological function 
(i. e. with low PFH-values) thus contributes to 
the inherent safety required by the Machinery 


Directive 2006/42/EC and ISO 12100 (2010) as 
a first step towards the three-step risk reduction 
method, see Figure 2 (Legend: EF: Expected fre- 
quency, S: Degree of severity). The diagnostic 
function itself is not an inherent safety element, 
but an additional safety measure according to step 
two of the three-step method. ISO 13849-1 (2008) 
uses the term “safety function” for this purpose. 


3.3. Target “zero risk” is missed 


With the measures described so far, an expected 
frequency of safety-relevant technical failures 
remains, because not all of them can be shut down 
safely with diagnostic functions. Therefore, if a 
“zero-risk” acc. to Platz (2016) cannot be achieved, 
a certain repetition rate of safety critical failures 
(which can cause hazards) must be expected, e. g. 
as “probability of dangerous failure per hour”, 
defined as PFH, in ISO 13849-1(2008). 

A full enclosure primarily reduces exposure to 
hazards. However, this is not the only risk reduc- 
tion achieved by safety guards. The controllability 
of a hazardous situation can also be increased by 
separating protective devices, e. g. for partial clad- 
dings around the primary flight circle of chips, so 
that the operator can either choose a safe location 
himself, or is compelled to take up a safe posi- 
tion by means of location binding of the control 
panel. 

A further protective effect of a full enclosure 
results from the ratio of partial load operation to 
full load operation. In a paper at ESREL 2016, 
this important “secret of success in machine tool 
safety” is explained, the implicit error detection in 
the process on the basis of a full enclosure, Moed- 
den (2016). The probability theory indicates that 
most of the failures are to be expected in automatic 
operation, i. e. when the safety doors are closed, 
i. e. then they occur (nearly) without hazard. 

However, there is still a probability of occur- 
rence of a hazard event with a full enclosure in 
automatic mode (index “1”): PHE, > 0 because a 
full enclosure cannot safely hold back all conceiv- 
able failures but only up to the respective design 
convention, as already mentioned, see Ising (2001). 

With correct dimensioning, however, the far 
most common of the conceivable hazards are mas- 
tered, 1. e. on this side of the dimensioning conven- 
tion and partly beyond it. For example, the chuck 
jaw hypothesis of ISO 23125 (2010) partially cov- 
ers the workpieces rotating around the main axis 
if a chuck loss leads to symmetrical release of the 
workpiece, Ising (2001). 

In automatic mode, the numerical value for 
PHE, is close to PHE, = 0 because, according to 
accident records, damage caused by parts released 
from the working area actually occurs in the basic 
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population of all machines acc. to Kesselkaul & 
Meyer (2016), but it is extremely rare in standard- 
compliant machines, see Moedden (2017). Meister 
et al (2017) made a first attempt to describe the 
withstand probability and the probability of fail- 
ure as a probability function depending on the 
translational energy of fragment impacts. 

In the setting mode (index “2”), a numeri- 
cal value PHE, > 0 must be assumed for the risk 
PHE,, since, according to accident statistics, most 
of the damage occurs in the basic population of all 
machine tools when the safety doors are open. The 
same also applies to all other operating modes with 
open safety doors, in particular for “manipula- 
tion”, Kesselkaul & Meyer (2016), Preusse (2005). 


3.4 The “accident pyramid” at the boundary 
of a “zero risk” 
According to Moedden (2018), the equation of a 


technical risk reduction for a. m. cases one and two 
is in the interval ¢, <1 < t, (eq. 1). 


PHE,, = (1 T Cia) “lop” 
th 


x | PFh, tech.1,2 ` dt 
ta 


"Oa 


exp,1,2 


Eq. (1) contains all risk reduction principles from 
the three-stage method as illustrated in Figure 2. 

Nevertheless, PHE > 0 remains as a finding, 
i.e. even with very reliable controls (XPFHd = 0) 
and standard-compliant fully enclosed enclosures, 
residual hazards remain. But does a PHE > 0 really 
mean that there is an according percentage in the 
basic population of severe damages? The answer is 
“no”, because the probability of a hazard must still 
be associated with the severity degree of an injury. 

Furthermore, in the operational environment, 
there is a supplementing indirect, intuitive way of 
avoiding injuries. It is a kind of “statistical warn- 
ing principle”, based on the so-called “accident 
pyramid”, see Manuele (2011). Considered as 
relative frequencies in probability theory on the 
background of a clear cause-to-effect relation, this 
means that—before a serious accident occurs—a 
lot of “near-accidents” occur first, then the less 
frequent accidents with real injuries, from slight 
reversible over severe to fatal. The severity of an 
injury needs to be connected to the equation of 
risk reduction eq. (1), it can be roughly estimated 
with the “accident pyramid”. 

As the “near-accidents” happen comparatively 
more often than real accidents, the critical situa- 
tions in the operational area should be largely 
known. The warning effect takes place only on a 
averaged statistical background, because a “near- 
accident” takes place many times, before it comes 


to an injury, and even more times before it comes 
to a serious or fatal injury. This means that experi- 
enced operators can also protect themselves intui- 
tively by being and remaining particularly vigilant 
in critical areas. Inexperienced operators should 
therefore pay close attention to the warnings of 
their experienced colleagues. 


3.5. Comparison of machine tool and robot 


In order to illustrate the probabilistic risk model in 
Eq. (1), a calculation example shall be used to com- 
pare an automatic machine tool with full enclo- 
sure, as shown in Fig. 1, with a collaborative robot 
without protective housing (instead using op-tical 
and/or tactile proximity sensors), as in Fig. 3. 

In order to make the comparison easier to 
understand, the estimated average values in Table 1 
are compiled on the basis of “risk snapshot param- 
eters” of typical activities; these can, of course, be 
adjusted on a case-by-case basis, e. g. out of evalu- 
ated video records. For statistical purposes, they 
can be collected in a histogram, showing the band- 
width and share of the parameters. 

A manual workpiece change on the machine 
tool, which should last 3 minutes and be repeated 
11 times per hour, thus means 33 minutes of pos- 
sible hazard exposure (per hour). The robot is a 
support device for collaborative manual activities 
as shown in Figure 3, which should last 1.8 min- 
utes and repeat 25 times p.h, i. e. 45 min. hazard 
exposure (p.h.). 

Using an 8-hour shift, the risk reduction of 
a machine tool (MT) is compared with that of 
a robot (ROB). The safety design of the latter 
depend almost entirely on reliable control tech- 
nology, e. g. in the case of a safety function “safe 
operational stop”; three superimposed movements 
are assumed to be stopped safely. The safety con- 
cept of an automatic machine tool, on the other 
hand, depends heavily on a standard-compliant, 


Figure 3. Coll. robot without protective enclosure, see 
Kentsch (2016). 
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Table 1. “Risk snapshots” of a) Machine tool with full 
enclosure, and b) collaborating robots without protective 
housing. 


Object/ 


Action/ 
xXPFH G= O0 boi Jesi C PHE 


Safety dtech 
function [a] A H A A H A 
0.5 0.05 11 


a) MT/ 
manual 
workpiece 
change/ 
SCW 

b) ROB/ 
manual 
support/ 
SOS 


6.5- 10% 8 0.5-7.5- 10* 


3-times 8 1 
5-107 


0.03 25 0.1 8.1 - 10% 


fully enclosed work area and on the presumption 
that it is used as intended. For the sake of simplic- 
ity, one safety function is considered at a time. The 
above mentioned “Safe Operation Stop” (SOS) 
for the robot, whereby the associated hazard can 
constitute an impact to the operator’s head, with 
potentially serious injuries and for the machine 
tool, a “Safe Clamping of the Workpiece” (SCW). 
The associated hazard during manual activities 
with open guards could be a loss of workpiece due 
to gravity, with a hazard of crushing the opera- 
tor’s hands in the work area and possibly serious 
injuries. 

In the right-hand column of Table 1, the prob- 
ability of occurrence of a corresponding hazard 
based on the input parameters acc. to Eq. (1) is 
estimated. 

The simple example in Table 1 shows that: 


a. A safety function (SF) with a PL=b (i. e. PFH, 
in the range of 3 - 10° /1 to 1 - 10° %1, here 
averaged: 6.5 - 10% 4-1). This is active as a tech- 
nological work function within a full enclosure 
of a machine tool, but it becomes a safety func- 
tion, when the safety doors are open. If the SF 
fails, the occurrence probability is assumed to 
be O =0.5 and the avoidability to C = 0.5. Con- 
cerning the risk, it can be equivalent to: 

b. A safety function with PL = d (i. e. PFH, in the 
range of 1 - 10% 4 to 1 - 107 /", here averaged: 
5-107 4, for three superimposed movements) 
of a robot without a full enclosure. Such safety 
functions are used without full enclosure for 
collaborating robots according to ISO 10218 
[27]. If the SF fails, the occurrence probability 
is assumed to be O = 1 and the avoidability to 
C=0.1. 


Corresponding to the risk model in eq. (1), for 
the first case in Table 1 an occurrence probability 


of a hazard PHEMT can be derived to 7.15 - 10% 
per 8-hour shift. And for the second case and 
occurrence probability of a hazard PHEROB is 
estimated to 8.1 - 10°. 

For the robot, this means that the probability of 
a hazard occurrence can largely be compared with 
that of a fully enclosed machine tool, despite the 
higher reliability of the safety function (PL = d). 
The reason why the machine tool achieves (despite 
a comparatively low PL = b) a safety level compa- 
rable to a robot is that in the event of a control fail- 
ure, the safety guard on the machine tool can still 
protect the operator here (but not for the robot). 
This comparison demonstrates the significant risk 
reduction effect of a full enclosure! 

This protective effect of a full enclosure is not 
sufficiently taken into account in ISO 13849-1 
(2008), especially the derivation of performance 
level recommended (PL,,.). This is the current 
reason why the technological clamping functions 
of machine tools, which can also be used as safety 
functions when safety guards are open, are under- 
estimated (allegedly max. PL = b, c according to 
ISO 13849-1 (2008)). As illustrated by Steiger 
(2014), this standard is not correctly based on the 
three-step method of risk reduction in ISO 12100 
(2010). The greatly simplified approaches of ISO 
13849-1 (2008) may be suitable for the situation 
on robots, but not for the complex interactions 
involved in risk reduction of safety guards of 
machine tools. 


3.6 Best practice at the EMO safety day 2017 


The actual implementation of the three-step 
method described above depends on the type of 
machine. However, there are also cross-cutting 
issues, such as the operating modes of a machine, 
see Steger (2017), and the equipment for fire 
protection, see ISO 19353 (2016). Therefore, the 
machine-specific measures are summarized in 
product safety standards (here the type C standard 
EN ISO 23125 (2010)). 

At EMO Safety Day 2017, field data evaluations 
of more than 93,000,000 operating hours of turn- 
ing machines (Figure. 1) without accident were 
presented. These results and the subsequent dis- 
cussion presented by Nowizki (2016) proved that 
the provisions in type C standards can be applied 
successfully, when a sufficiently dimensioned full 
enclosure is available. 

The “secret of success” in reducing accidents 
lies, as explained above, to a large extent in the 
massive risk reduction effects of combining a) 
fixed and moveable guards for the full enclosure 
of stationary machines (comparable to a driv- 
er’s cab in mobile work machines), and b) fault 
detection in the process, explicitly and implic- 
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itly, Moedden (2016). These two factors together 
result in a kind of “systematic safety integrity” at 
the machine level. In doing so, the very high avail- 
ability of machine tools should never be ignored 
because this contributes to a high level of inher- 
ent safety, 

The proven-in-use studies of the VDW and 
Nowizki (2016) thus confirm that the theoretically 
possible probabilities of hazards, as estimated in 
Table 1, can actually be significantly reduced by 
suitable measures, e. g. the protective effect of a 
properly designed full enclosure. The full enclo- 
sure as shown in Figure. | provides a practically 
proven state of the art which has been defined in 
product standards for more than ten years on a 
kind of “macroscopic” level (i. e. thicknesses of 
protective doors and windows are actually visible 
and measurable). The safety measures in it fol- 
low the three-step reduction method of ISO12100 
(2010) and create a significant probabilistic dis- 
tance between the failure probability (e. g. of a 
component) and the occurrence probability of 
according hazards. 

On the other hand, the discussion about control 
failures takes place merely on a kind of “micro- 
scopic” level where the PFHd-values are invisible, 
not measurable. However, they must be brought 
to the “macroscopic” level, as suggested in eq. (1), 
in order to plausibly connect to the probability of 
occurring hazards. 


4 LACK OF MAINTENANCE 


Another important aspect for the operational 
safety is the maintenance in the plants. The age- 
ing of sight protection windows has been known 
for more than ten years, see Wirz et al (2002), 
Duchstein (2010), Mewes et al (1999), Wahrlich 
et al (1999), Mewes (2003) and Uhlmann et al 
(2012), and the countermeasures are anchored 
in standards such as ISO 23125 (2010), see also 
Adler (2013), Mewes (2001). Nevertheless, the 
criteria for replacing embrittled or damaged 
protection windows—the most important com- 
ponent here is polycarbonate—are still being 
ignored by some person in charge. This can pose 
a considerable risk to the machine tool operators 
who may not be protected sufficiently against 
fragments of material released from the work 
area. 

This item also includes preventive maintenance 
which is intensively marketed for components that 
are relevant to machine availability and machin- 
ing quality. In doing so, a great deal of effort to 
ensure that replacement / maintenance is carried 
out within the specified time intervals so that avail- 
ability and quality do not suffer. 


Unfortunately, in contrast to this, the com- 
ponent “safety window” which directly affects 
the safety of the operator is often seen as a 
“cost driver”. Can availability and quality be 
more important than the safety of the operators 
who ultimately make availability and quality 
possible by exposing themselves to the above 
restrictions? 


5 SUMMARY AND OUTLOOK 


Since 1996, the Institute for Machine Tools and 
Factory Management (IWF) at the Technische 
Universität of Berlin in cooperation with VDW 
has used impact and ageing tests to define design 
recommendations for separating protective devices 
(i. e. fixed and moveable guards with vision pan- 
els) and to supplement existing knowledge. Both, 
this database and the experimental investiga- 
tion possibilities in Berlin are used in many other 
industries for the dimensioning of guards in order 
to close gaps in the existing standardization. In 
addition, current research projects have consider- 
ably expanded the experimental equipment at the 
IWF in Berlin, see Prasol et al (2017), Meister et 
al (2014). And close scientific relations exist with 
comparable research institutes in Italy, see Landi 
et al (2017), and Japan, see Yui (2017). The long- 
standing cooperation of VDW and IWF has led 
to the fact that the German machine tool industry 
has a high level of safety ass proven by Nowizki et 
al (2016). This can be seen before all other factors 
due to the low number of accidents resulting from 
machine-related faults, see Kesselkaul & Meyer 
(2016), Moedden (2017). 

Nevertheless, there is still a need for further 
research and knowledge transfer. On the one 
hand, this is aimed at the machine tools users, 
since these are the innovation drivers and define 
the requirements for the manufacturers. On the 
other hand, machine tool manufacturers should 
also be mentioned as a further target group, as 
they should also be made more aware of this issue. 
In particular, it is of interest to convey that safety 
does not in principle go hand in hand with an 
increase in costs. As an example, it is referred to 
the dimensioning of separating guards on grind- 
ing machines, see Adler et al (2017). In the area 
of marketing of safe machine tools, it would be 
possible to use safety as a “Unique Selling Propo- 
sition (USP)” similar to the “Blue Competence” 
label, in order to distinguish itself more strongly 
from competitors. This has been successfully prac- 
ticed in the automobile industry for a long time 
with reference to the different consumer motiva- 
tions and also of people who purchase industrial 
goods, see Duchstein (2011). 
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This composition is also an appeal to a profes- 
sional use of the term “probability”. An unprofes- 
sional application of the term leads to arbitrary 
interpretation which is not an option in effective 
safety engineering. The aim is always to reduce the 
number of accidents that are recorded annually by 
the strong effect of the “Pareto principle”. These 
represent the sum of all probabilities of occur- 
rence of hazards, with reference to a yearly period 
for the respective basic population. In Moedden 
(2017) the situation for machine tools in Germany 
is explained in detail; a transfer to other machine 
types and other countries is possible. 


6 CONCLUSION 


The current safety standards are difficult to 
understand and partly contradictory, especially 
when it comes to the use of the term “probabil- 
ity”. Although, taken together, they contain a 
vast number of obligations, there is no common 
risk model in them. For effective risk reduction, 
however, a logically structured risk model is 
indispensable, which should also be probabilisti- 
cally scalable. For this purpose, further proposals 
are presented here, in particular for a quantifi- 
cation of the significant risk reduction effect of 
full enclosures. Also, in order to put the following 
two previously not logically connected represen- 
tations in relation to each other: a) a dimension- 
less probability (e. g. of the Machinery Directive 
and ISO 12100 (2010)), and b) a probability 
per time unit, in the context of control func- 
tions often indicated as PFH,-value (e. g. in ISO 
13849-1 (2008)). 

Theoretically and empirically, it could be 
demonstrated that fixed and moveable guards 
make a significant contribution to reducing 
the probability of occurrence of hazard events. 
However, they cannot completely reduce the 
probability of hazard occurrence to zero for two 
reasons: 


1. Fixed and moveable guards need to be opened 
for certain activities, see Moedden (2018). 

2. In the closed state, they have only a defined 
retention capacity against released parts in the 
working area, see Meister (2017). 


In order not to lose the protective effect of sepa- 
rating protective devices, two appeals to the opera- 
tors remain important: 


1. To keep the boundaries the intended use of the 
machines, such that manipulation of full enclo- 
sures does not happen. 

2. Preventive maintenance of embrittled poly- 
carbonate vision panels must be carried out 
carefully. 


SYMBOLS AND OTHER DEFINITIONS 


Symbol Dim. Meaning 


PFH ecn 
(PFA a tech) 


bh! Acronym stands for “probability 
of (dangerous) technical failure 
per hour”. It is shown in [11] 
that this is an average density 
function. 

Acronym stands for “probability 
of a hazardous event (technical 
and human sources)”. In Eq. (1) 
the association with PFHd, tech 
value is shown. 

Parameter - Probability of occurrence of 
O hazard events 


PHE h“! 


tess sec Exposure time, expressed also 
as an intervall: t, < t < t, 

foa hr Repetition rate of an exposure 
time 

Farto = Relative share of exposure 

A,C - Avoidability, controllability 
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Analysis of fatal fires in Norway over a decade, — a retrospective 


observational study 


C. Sesseng, K. Storesund & A. Steen-Hansen 
RISE Fire Research AS, Trondheim, Norway 


ABSTRACT:  Five-hundred-and-seventy-one fatalities were registered in the official fire statistics in 
Norway between 2005-2014. However, little is known about the victims. This study collected information 
from several sources to build a holistic database and gain more knowledge about the technical and social 
aspects of the incidents, forming a basis for more targeted measures. Human behaviour greatly affects 
the risk of fire, which supports why social aspects of incidents should be considered when identifying 
risk factors associated with the victims. The results showed a clear distinction between victims above and 
below the age of 67 with respect to risk factors. For the elderly, reduced mobility, impaired cognitive abil- 
ity, mental disorders and smoking were observed risk factors. For the younger victims known substance 
abuse, mental illness, alcoholic influence and smoking were observed, mostly in combination. This shows 
that fire is a social problem, and should be prevented by initiating customised measures. 


1 INTRODUCTION 


In this study, information from fire statistics of the 
Norwegian Directorate of Civil Protection (DSB) 
and other sources has been analysed to gain more 
detailed knowledge than before about who dies in 
fires and why. This will help to implement more 
targeted measures in order to reduce the number 
of people perishing in fires. Results from previous 
studies in Norway and other countries have pro- 
vided important data in designing the study, in the 
interpretation of results, and comparisons of the 
evolution of fatal fires over time. Demographic and 
cultural differences over time and between coun- 
tries will amongst others impact on attitudes and 
risk behaviour as concern fires. Variations in the 
employment of various preventive measures will 
also be reflected in statistics on fatal fires. Factors 
that may impact on the proportion of registered 
victims in fatal fires are the methods of data col- 
lection, different definitions of fire fatalities and to 
what extent fires are investigated, plus the way in 
which fires are registered, and the thoroughness in 
which the consequences of a fire are followed up 
afterwards. 


2 METHOD AND ANALYSES 


2.1 Data collection 


2.1.1 Sample 
This study surveys persons having died in fires 
in Norway during the time period of 2005-2014. 


According to DSB’s fire statistics 517 fatal fires 
with 571 fatalities occurred during this period. 


2.1.2 Sources 

Our data material consisted of DSB’s fire statistics, 
police investigation reports, the fire victim’s medi- 
cal records, and the Norwegian Cause of Death 
Registry (NCoDR) from the Norwegian Institute 
for Public Health. Generally, the police investiga- 
tion reports also include post-mortem reports. 

In order to obtain access to these sources, 
authorisations were applied for and granted from 
several authorities. 

DSB’s statistics provided the basis for which 
police investigation reports we requested access to. 
Hence, cases not listed in these statistics are not 
included in the study. 

By means of the police reports the fatalities were 
identified by name and national identity number. 
This formed the basis for which medical records 
we asked to access, and for which persons we 
requested data from NCoDR. The medical records 
are not accessible through a common register, but 
each medical record resides with the doctor who 
was general practitioner/family doctor at the time 
of death. This meant that the Norwegian Directo- 
rate of Health (NDE) needed to provide us with a 
key to connect each fatality to the GP/family doc- 
tor at the time of death. 


2.1.3 Data registration and categorization 
To register data from the various sources, a data- 
base was set up to handle the fire data and to 
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link data on fire victims. DSB’s fire statistics were 
imported directly into the database, and were used 
as a basis for the requests to access investigation 
reports sent to the police districts. Additionally, 
variables to handle relevant data were added. 

To ensure consistent extraction and storage of 
data from the various cases between project col- 
laborators, an electronic form was prepared which 
was completed for each incident. Some of the input 
was pre-defined multi-choice categories, other was 
free text input. This made interpretations more 
objective, and it became easier to quantify qualita- 
tive data. 

Extraction of data from NCoDR, which con- 
tained information regarding cause of death, was 
also made. Of all the identified persons there were 
29 with an incomplete [D-number (lack of infor- 
mation in the police reports). Sixteen of these were 
found in NCoDR by searching their name. The 
remaining 13 persons were foreigners without a 
Norwegian [D-number or D-number (ID number 
for temporary residents in Norway) who had died 
in Norway. The extraction from NCoDR was also 
imported into our database and linked to each sep- 
arate person. Of the NCoDR data we got access 
to, the categories underlying cause of death and 
injury code were of largest interest. 


2.2 Statistical analysis 


All data registered in the database was exported to 
the statistics program Statistica, version 12 (Dell 
Inc 2015). 

Statistical tests were conducted to test and exam- 
ine apparent differences between sub-groups in the 
population. In all essentials non-parametric tests 
were employed, such as Mann-Whitney U-test, 
Fisher exact-test, chi-square test and regression 
analysis. For all analyses a significance limit of p 
< 0.05 was employed. A p-value between 0.10 = 
p > 0.05 is considered as a trend. 

Multiple Correspondence Analysis was carried 
out in order to identify whether the various char- 
acteristics of the victims frequently occurred simul- 
taneously. This includes e.g. examining whether 
victims with an established substance abuse also 
tend to be smokers. This method cannot quantify 
any similarities or dissimilarities, but it may pro- 
vide a qualitative impression and give a basis for 
further analyses. 


3 RESULTS 


3.1 Registration of fatalities 


DSB’s fire statistics contain some data on the fire 
itself, as well as the number of injured persons and 
the gender and age of the fire causalities. Requests 


for access to police reports were made for all these 
fires, and we received 347 police reports (68% 
response) that were reviewed. Data from these 
reports were combined with data from DSB’s sta- 
tistics. The cases were geographically dispersed 
across the entire country, although we did not 
receive any police reports for three out of 19 coun- 
ties in Norway. 

From the received police reports 387 fatalities 
were registered. The vast majority of these were 
identified through their [D-number, and a request 
for access to the medical records was sent to the 
respective GP/family doctor at the time of death. 
A small number lacked ID-number data and were 
consequently excluded from the analysis of medical 
records. There were also a number of cases where 
only the name and date of birth were stated, which 
often sufficed to find the medical record of the per- 
son in question. We received 248 medical records, 
which represent 64% of the identified victims. 


3.2 The fatalities 


3.2.1 Sample description 

Table 1 shows the distribution of various charac- 
teristics of the fatalities in fires during the 2005- 
2014 period. 

As concerns the category «non-native speaker», 
it is meant to include persons who are assumed 
not to communicate adequately in a Nordic lan- 
guage or English, and with whom it will be diffi- 
cult to communicate relevant information about 
fire safety. According to a rough estimate based on 
Norwegian statistics on immigration, there were 
5—6% non-native speakers (to a varying extent) in 
Norway in 2014 (Sesseng et al., 2017). 


3.2.2 Age and gender 

The persons who perished were in all essentials well 
into their adulthood. Half of the fatalities had an 
age between 44-78 years, see Figure 1 showing the 
age distribution of the fatalities. 

When taking the number of people in each age 
group into consideration, one sees there is almost 
an exponential connection between age and the 
number of fire fatalities, see Figure 2. 


Table 1. Sample description. 
Gender Male Female n 
56.1% 43.9% 387 
Age Median Interquartile Min-Max n 
range 
[years] 59 44-78 1-97 386 
Non-native No Yes Unknown n 
speaker 88.6% 7.8% 3.6% 387 
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Table 2. Registered risk factors related to the fatalities. 
The category «unknown» means that the medical records 
hold no data. Light grey cells mark a high proportion of 
observations of risk factor for persons above the age of 
67, while dark grey cells mark the same for persons below 
the age of 67. 


Visuall 
Vision Normal ioed Blind n 
All 86.8% 13.2% 0.0% 257 
- <67 years 92.4% 7.6% 0.0% 145 
0-10 10-26 20-20 30-40 40-50 50-60 60-70 70-80 80-90 90+ 267 years 79.5% 20.5% 0.0% 112 
—= snd i Hearing 

. oo n Hearing Normal impaired Deaf n 

Figure 1. Age distributed on gender for fatalities, 
n = 386. All 89.9% 10.1% 0.0% 257 
<67 years 95.2% 4.8% 0.0% 145 
— 267 years 83.0% 17.0% 0.0% 112 
ee Fatalities per 100 000 inhabitant in age group 
----- Exponential adaption Reduced 
re mobility Normal Reduced Immobile n 
04 = - All 69.2% 27.4% 3.4% 266 
03 anay f <67 years 84.6% 12.1% 3.4% 149 
03 + —— = 267 years 49.6% 47.0% 3.4% 117 
0.2 
02 4 Impaired 
oa cognitive 
oa abilities No Yes Unknown n 
P7 O93 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80+ All 16.0% 18.7% 65.3% 262 
Age <67 years 20.7% 7.6% 71.7% 145 
267 years 10.3% 32.5% 57.3% 117 
Figure 2. The graph shows the ratio between number of Known 
fatalities and number of inhabitants in each age group for substance 
fatal fires during the 2005-2014 period. abuse No Yes Unknown n 
All 9.1% 36.5% 54.4% 263 
<67 years 9.2% 6.1% 152 

During the 2005-2014 period 56% of fatali- 267 years 9.0% 25.2% 64.8% 111 
ties were men. Men are overrepresented in all age 
groups under 70 years, also taken into account the Mental 

ia aa : illness No Yes Unknown n 
gender distribution in these age groups. 

According to Statistics Norway, in 2007 there All 6.5% 44.3% 49.2% 262 
were about as many men as women around the <67 years 6.6% «yp 151 
age of 63 in Norway. After this age the propor- >67 years 6.3% 2% 59.5% 111 
tion of women increased, and with age the sur- ; 
plus of women continued to rise. Among persons Alcoholic 
over the age of 80 around two thirds were women, influence Ne bis ow un 
and among those over 90 years three fourths were All 38.9% 41.2% 19.9% 386 
women (Falnes-Dalheim & Slaastad 2007). <67 years 28.4% 12.7% 229 

After the age of 80, more women than men per- 267 years 54.1% EE, 157 
ish in fires (approx. 60% of the fatalities). However, Women 50.0% 27.6% 22.4% 170 
taken the gender distribution in the age groupsinto Men 30.1% 51.9% 18.1% 216 
consideration there is an equal risk for both gen- 
ders (Statistics Norway 2017). Smoker No Yes Unknown n 

; All 9.3% 34.6% 56.1% 387 
3.2.3 Risk factors , , <67 years 6.6% 57.6% 229 
Table 2 shows the registered risk factors linked to — „67 a 13.4% n A 157 


the fire fatalities in our selection. The results show 
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that the majority of the victims had normal vision 
as well as hearing. On the other hand, we see that 
half of the victims at pension age had reduced 
mobility. Further, one third of the same age group 
had impaired cognitive abilities, and an equally 
large share suffered from mental illness. For the 
younger age group we see that half of the victims 
had a reputation for substance abuse. Equally 
many suffered from mental illness, and an equal 
proportion was under the influence of alcohol dur- 
ing the fire. 

When examining the fatalities confirmed being 
under the influence of alcohol compared to those 
not being under the influence at time of death, 
regardless of age, we find no difference as to the 
time of day they perished when dividing the day 
into four 6-hour periods (00-06, 06-12, 12-18 and 
18-24) (p = 0.147). Even though there is no sta- 
tistically significant difference between morning/ 
day and night (06-18 and 18-06), there is a trend 
suggesting that more fatalities were intoxicated at 
night time (p = 0.052). 

There is a significant difference between women 
and men when it comes to being under the influ- 
ence of alcohol during the fire (p = 0.000). When 
disregarding the cases where it was not possible to 
neither prove nor disprove that there was alcohol 
in body fluids, one sees that around two thirds of 
all male fatalities were under alcoholic influence. 
The case is the opposite for women, where one 
third was under the influence of alcohol. 

There is no significant difference in the distribu- 
tion of the cause of fires in fires where the victim 
was under alcoholic influence compared with fires 
where the victims were not under alcoholic influ- 
ence (p = 0.433, the rarest categories were excluded 
from the analysis). 

To conduct a Multiple Correspondence Analysis 
the population was divided into two sub-groups; 
persons with age < 67 years, and age = 67 years 
(Norwegian retirement age). Further, analyses were 
made of factors representing somatic ailments and 
factors relating to intoxication, psychiatry and life- 
style. The reason why psychiatry is placed in the 
same category as substance abuse, smoking and 
alcoholic influence during the fire, is because psy- 
chiatric cases may be triggered by drug abuse, and 
therefore drug abuse and psychiatry occur simulta- 
neously in many cases. 

For the group with age = 67 years the material 
shows that persons with mental illness also fre- 
quently are smokers. Moreover, we have seen that 
being under the influence of alcohol during a fire 
does not appear systematically in combination with 
other factors (no pattern). As regards somatic ail- 
ments we do not see any evident patterns. This may 
signify that combinations of e.g. reduced mobility 
and vision impairment are not overrepresented 


in the fatal fire statistics for this group. Neverthe- 
less, 61% of this group have either impaired vision, 
hearing, or mobility, which may have contributed 
to the death. 

For the group of fatalities below the age of 67 our 
data material shows that factors «known substance 
abuse», «alcoholic influence during fire», «mental 
illness» and «smoking» often occur together. Only 
13% of the fatalities in this age group had none 
of the mentioned factors, while the majority (66%) 
had various combinations of two or more factors. 

As concerns somatic functional impairment 
no pattern is identified, except from the fact that 
factors normal mobility, normal vision, and non- 
reduced cognitive ability often occur together. 


3.2.4 Cause of death 
Figure 3 shows the distribution of cause of death 
for the fatalities in the sample. Asphyxiation is the 
chief cause (57%), followed by burns (15%). In 
addition to that, a combination of asphyxiation 
and burns was concluded in 10% of cases. In 13% 
of the cases the cause of death is unknown, which 
may be attributed to the fact that the victim was so 
heavily burnt that it was difficult to carry out an 
examination or draw some conclusions. 

Information about the underlying incident and 
causes of death was collected from NCoDR. The 
underlying incident says something about the inci- 
dent leading to the person’s death (e.g. exposure to 
open flame), while the causes say something about 
the actual cause of death (e.g. burn). 

The data showed that exposure to open flame 
stands for 88.4% of all incidents, and that 5.8% 


Burns and 
asphyxiation 
10% 


Unknown 
13% 


Other 
traumas 
4% 


Dead before 
fire 
1% 
Asphyx- 
iation 
57 % 


Figure 3. Cause of death for fatalities in fires during the 
2005-2014 period, N = 387. 
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of incidents involved intentional self-harm (sui- 
cide). Correspondingly, it was seen that one form 
of asphyxiation or other stands for 74.7% of the 
causes of death, while some form of burn stands 
for 21.6% of the causes of death. The asphyxiation 
rate in NCoDR (74.7%) is somewhat higher than 
what was registered from post-mortem reports in 
our selection (57%). By adding the fatalities where 
it was concluded that the cause of death was a 
combination of fire injury and asphyxiation, one 
gets a rate of (67%), which is still somewhat lower 
than in NCoDR. It therefore seems to be an incon- 
sistency between the post-mortem reports and 
what is being reported to NCoDR. 


4 DISCUSSION 


4.1 Cause of death 


The distribution of causes of death for fire fatali- 
ties in Norway are in stark contrast to the corre- 
sponding US numbers for 2012-2014 where e.g. 
asphyxiation alone stood for 37% of the causes 
of death (DHS 2016a). We are unaware of the 
reason for these differences, but an assumption is 
that the causes of death are defined and registered 
differently in the US compared to Norway. This 
emphasizes the uncertainty when making such 
comparisons between countries. 

The content in fire smoke is complex, consisting 
of a number of components which have various 
effects on humans. Asphyxiation may occur at an 
early stage in the course of a fire, even before the 
fire has developed enough to have a visible flame. 
In order to prevent deaths caused by asphyxiation 
it is therefore important to focus on preventing fires 
from arising in the first place, and to ensure that 
outbreaks of fire are detected and extinguished as 
early as possible. Materials with a high combusti- 
bility should be avoided, and an adequate control 
of potential ignition sources should be in place. 


4.2 Number of persons involved and fire outcome 


From the police reports we retrieved data on the 
number of people present in the building or living 
unit when the fire started and the number of per- 
sons perished or injured. The results, presented in 
Table 3, show that in the majority of the fires only 
one person was present at start of fire, and that in 
over nine of ten there was only one fatality, and in 
an almost equal proportion there were no injuries 
in addition to the fatality. 


4.3 Risk factors 


For the purpose of the study the population is 
divided into different sub-populations, e.g. vic- 


Table 3. Distribution of number of persons present at 
fire start, number of fatalities and injured. 


Present >A 
at start l pers. 2 pers. 3 pers. persons n 
Fatalities 71.1% 15.2% 5.0% 8.7% 342 
1 2 3 24. n 
92.6% 5.7% 1.2% 0.5% 513 
Number 0 pers. 1 pers. 2 pers. 23 pers. n 
f 
ansi 86.1% 8.1% 3.5% 2.3% 347 
Where Point Neigh- Other Outside n 
victims of bouring room living 
were origin roomof in unit 
found origin living 
unit 
40.1% 11,3% 42.7% 5.9% 354 


tims below the age of 67 and victims of 67 years 
or more. This makes it possible to identify the fac- 
tors that often recur for these groups. These results 
should be used by home care services, the GP/fam- 
ily doctor and other agencies who are in contact 
with the inhabitants of the municipality, and who 
provide care or fire prevention measures, in order 
to implement the right actions for the individual. 
The subsequent sub-chapters deal with the various 
risk factors. 


4.3.1 Age and health 

During the 2005-2014 period seven children aged 
0-7 years died in a fire, which constitutes 2.2% of 
the fatalities. Six of the children died in fires occur- 
ring during the night while the family was asleep. 
In comparison 15.1% of all fatalities during the 
1970-1979 period were children aged 0-7 years, 
and for the 1990 to 1992 period the rate was 4.5% 
(Storesund 2013). This means it can be confirmed 
that the number of children perishing in fires is 
comparatively low and that it has decreased since 
the 1970s. Childcare in Norway has undergone 
major changes since the 1970s. Children spend less 
time on their own and more time in buildings with 
high fire safety standards (schools and kindergar- 
tens), which altogether probably has contributed 
to improving fire safety, thereby reducing the risk 
of children perishing in a fire. 

Studies establish that elderly people are more 
exposed to perishing in a fire than younger per- 
sons, and we have found the same in our study 
(DHS 2016a, 2016b, Xiong et al. 2015). Half of 
the fatalities are 59 years old or older. When tak- 
ing into account the number of individuals within 
the various age groups, one finds an exponential 
increase in the number of fatalities with increasing 
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age. The increase is particularly noticeable for age 
groups 70-79 years and 80+. E.g., the latter group 
has a 9.3 times higher probability of perishing in 
a fire than persons between 30-39 years according 
to our material. DSB’s study of fatal fires during 
the 1986-2009 period showed that elderly people 
over the age of 70 had a probability of perishing 
in a fire that was around 4 times higher than the 
population as a whole (DSB 2010). Figures from 
the US show that elderly people (age 85+) had a 4.1 
times higher probability of perishing in a fire than 
the population at large. The corresponding figure 
from our study is 4.9 times, which must be said to 
be comparable. 

With an anticipated population growth as well as 
a growth in the proportion of elderly people (67+), 
it will be important to target preventive measures 
toward this group in society, in particular in view of 
the fact that more of them will live longer in their 
own homes than before, and often also alone. Until 
date the number of places in Norwegian nursing 
homes and similar in recent years have remained 
rather stable (Kristiansen 2016). Nursing and care 
services have changed from being provided in nurs- 
ing homes and similar to being provided to a larger 
degree at home, while places in nursing homes 
mainly are being offered to the most sickly of eld- 
erly people. There is reason to believe that this trend 
will continue (Steen-Hansen et al., 2010). 

For those who have reached retirement age we 
principally see four recurring risk factors: reduced 
mobility (47%), impaired cognitive ability (the eld- 
erly people who fell into this category often had 
Alzheimer’s disease or other forms of dementia) 
(33%), mental illness (34%) and smoking (33%). 
However, no recurring patterns of combinations 
of various risk factors were identified, except men- 
tal illness and smoking. The report entitled Correct 
measures in the right place (English translation) 
does not define age as a risk factor in itself, but 
argues that age may involve physical and cognitive 
challenges (Storesund et al. 2015, Gjosund et al. 
2016, Storesund et al. 2016). The thought behind 
such an approach is that one should not see elderly 
people as a homogeneous group facing the same 
challenges, but rather implement safety meas- 
ures meeting individual challenges. The results of 
this study of fatal fires support this perspective. 
The fact that these analyses group persons hav- 
ing reached pensionable age in a separate group 
does not signify that one must or should intro- 
duce measures targeting all elderly people, but one 
should rather be particularly attentive to the risk 
factors that stand out for this age group. 

For those below retirement age the most con- 
spicuous risk factors are known substance abuse 
(54%), mental illness (52%), alcoholic influence 
(59%) and smoking (36%) It was also found that 
these risk factors often occur in combination with 


each other. This shows that some of those who per- 
ish in a fire in this age group carry several risk fac- 
tors, which increases the risk of perishing in a fire. 
Actually, 87% of the fatalities in this group have 
one or more of the mentioned risk factors. With- 
out having any basis for asserting it in this study, 
it cannot be excluded that these factors may affect 
their independent living skills, and consequently 
also the risk of a fire occurring in their living unit. 


4.3.2 Gender 

More men than women perish in fires. However, 
the gender distribution varies for the different age 
groups. In age groups 20-49 and 60-69 there are 
around twice as many men perishing than women, 
while there is an overweight of women in age group 
80+ and between 10 and 19 years. The variations 
between women and men for some age groups may 
be explained by population distribution and risk 
behaviour. 

The variations also show us that being male is 
not a risk factor. There is still a high proportion 
of women who perish in fires, and prevention 
measures must be directed toward this group on an 
equal footing with men. 

Also for the 80+ age group the picture is more 
nuanced when taking the population into consid- 
eration. Sixty-two percent of the fatalities in age 
group 80+ are women. Figures from Statistics Nor- 
way show that the proportion of women in this age 
group is 62.3% (Statistics Norway 2017). However, 
these figures are from 2017, which is a few years 
after the focus period of this study. However, from 
what we know there is nothing to suggest that the 
gender composition has changed considerably 
during these years. Thus one may conclude that 
women in this age group probably do not carry a 
heightened risk of dying in a fire than men. 


4.3.3. Smoking 

Thirty-five percent of the fatalities were confirmed 
smokers (56% unknown), where 57% of the smok- 
ers were men and 43% women. Approximately 13% 
of fires were caused by smoking. In comparison, 
figures from Statistics Norway show that the rate 
of daily smokers in Norway has been reduced from 
around 25% to 13% during the period we studied 
(Kristiansen 2016). Since the share of smokers in 
fatal fires is a lot higher than in society in general, 
it appears that smoking is a risk factor. 


4.3.4 Influence of alcohol and intoxication 

The fact that known substance abuse and mental 
illness are relatively common occurring factors 
in fatal fires is not surprising, but coincides with 
previous studies made in Norway and abroad. 
Studies show that fatal fires more often than other 
fires are caused by the victim himself/herself; in 
our selection the fire was directly caused by the 
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victim in almost 40% of the fires. This points to 
the importance of being able to prevent and han- 
dle a fire, and it shows that the probability of fatal 
fire increases when these abilities are limited. Sub- 
stance abuse and some mental illnesses may lead to 
impaired judgment, both as concerns the incidents 
leading up to fire start, and in connection with the 
fire itself. 

In around 40% of the fatalities the post-mor- 
tem examination identified traces of alcohol in 
the body fluids, while no alcohol was found in an 
equally large proportion. As concerns the last 20% 
this is uncertain, as it was not reported in the post- 
mortem reports, or the post-mortem report was 
unavailable. 

The figures also show that there are variations 
between women and men. When considering men 
only, one finds that the figures are not dissimilar to 
previously reported findings (Skaar 2013). There 
is, however, a larger discrepancy where women 
are concerned compared to the same study. It 
was reported that only 20% of women were under 
influence of alcohol during the fire, while our fig- 
ures show a proportion almost twice as big (36%). 

An explanation to this discrepancy may be dif- 
ferences in focus periods between the two studies, 
which were 1993—2008 and 2005-2014, respectively. 
If this is the explanation, it may point to a change 
in the drinking pattern of women (frequency and 
volume) over the period. However, we did not 
examine this aspect any further. 


4.3.5 Single persons 

In over 70% of the fires in our data basis there was 
only one person present at fire start. Forty percent 
of the victims were found in the room of origin, 
which suggests that the victims had few opportuni- 
ties of handling the fire and escaping at an early 
stage of the fire. 

These figures show that there is an inherent risk 
associated to being alone. The likelihood of the fire 
being detected in time to survive is reduced, and 
it is harder to escape if one is alone at the start of 
fire. Persons living alone are probably more often 
alone in the living unit than persons living with 
others. This means that there is probably an indi- 
rect, increased risk associated with living alone. 


4.3.6 Culture, attitudes and language 

Attitudes to fire safety impact on the likelihood 
of an occurrence of fire. Attitudes may be related 
to risk behaviour, the willingness to use preventive 
equipment, and tidiness and maintenance. In our 
study we registered whether the victim was a non- 
native speaker, which was not common. The back- 
ground was amongst other to examine whether the 
ability to read and communicate in the Norwegian 
language had any large impact on the risk of per- 
ishing in fire. We did not register any detectable 


connection between being a non-native speaker 
and the risk of perishing in a fire. 

Examination of socio-economic factors 
(income, education, ethnicity, profession, etc.) was 
not a part of this study. 


4.4 Social circumstances 


Social problems are described as challenges in the 
relationship between the individual and the soci- 
ety. The term embraces the individual’s problems 
caused by societal circumstances and the individ- 
ual’s challenges in adapting to societal structures 
and norms. The causes of social problems are 
many and often interdependent, but factors such 
as poor working conditions, lack of employment 
and poor housing conditions are mentioned as 
examples. In medical context, it is said that social 
problems comprise a bodily, a mental and a social 
component, which all interact (Bruusgaard 2017). 

The dominating risk factors in our material all 
fall in under social problems in the medical context, 
as they either affect the individual’s bodily, mental 
or social functions, or a combination of these. 

As discussed and exemplified by (Storesund 
et al. 2015), it is therefore of key importance to 
not only focus on the physical circumstances of 
the dwelling when initiating fire preventive meas- 
ures, but also consider the individual’s specific 
challenges and needs (bodily and mental) and the 
individual’s social circumstances. E.g. a person 
with impaired hearing will have other challenges 
regarding preventing, perceiving and escaping a 
fire than a person with impaired vision. Further, a 
person with reduced cognitive abilities would have 
other challenges. 

By conducting a thorough risk analysis, the 
individual’s needs could be identified and suitable 
organisational and/or technical measures could be 
initiated. 


4.5 Critique of methodology 


The police reports vary when it comes to richness in 
detail, which may impact on our analyses. In some 
cases we see that investigation reports are highly 
inadequate, while in other cases they are exhaus- 
tive. This makes it difficult to draw categorical 
conclusions as regards some factors. We have taken 
care not to colour the information with our own 
interpretations, but have in cases where some fac- 
tors are not mentioned labelled them «unknown» 
to underline the lack of data. In a few number of 
cases where the police concluded with an unknown 
cause of fire, but where we, based on a professional 
assessment, believe that one cause of fire is over- 
whelmingly likely, we have stated this as the cause. 
This was a conscious choice which we believe gives 
a more correct picture of the actual causes, even 
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though it may give an imprecise picture of the 
police’s clearance rate as concerns fatal fires. 

The medical records were also very thin in some 
cases. In these cases the majority of categories were 
therefore marked as unknown. Similarly, cases 
potentially relating to cognitive ability, substance 
abuse and mental illness were labelled as unknown, 
as these conditions most often are not evaluated 
in cases where the doctor does not have any sus- 
picion. It is therefore reasonable to assume that a 
large percentage of cases where these conditions 
are labelled unknown that entails no existence of 
impaired cognitive ability, known substance abuse 
or mental illness. In cases where the victim rarely 
had been to see the doctor, the medical records 
tended to be old with scarce updated information. 
The relevance of data has therefore been evaluated 
from one case to the next. 

To which extent one is affected by a certain 
blood-alcohol level depends on a variety of fac- 
tors, e.g. weight and the person’s alcohol tolerance. 
It can therefore not be stated with any certainty 
to which extent alcohol actually affected the victim 
and the outcome of the fire. 

Based on information in the investigation 
reports and the post-mortem reports we therefore 
registered categorically whether or not the fire 
victims were under the influence of alcohol at the 
time of death. 


5 CONCLUSIONS 


Individuals who have died in fire cannot be divided 
into groups of common denominators, but there 
are some combinations of factors that we have 
seen repeatedly: 

For those who have reached retirement age, we 
mainly see four risk factors: reduced mobility, impaired 
cognitive ability, mental disorders and smoking. 

For those under retirement age, the risk factors 
are known substance abuse, mental illness, alcoholic 
influence and smoking that appear, either alone or 
in combination with each other. 

There is an increasing risk of dying in a fire 
with increasing age. Generally speaking, men do 
not have a higher risk than women, but in some 
age groups, the risk of fatality is greater for men. 
Alcohol constitutes a greater risk factor for men 
than for women. 

Fatal fires in Norway are a social problem, 
which calls for measures individually adapted to 
the persons with the identified risk factors. 

The current study has generated extensive data 
which can serve as a baseline to track whether 
the fatality rate for different groups increases or 
decreases over time. This could also be valuable 
when evaluating the effect of future fire preventive 
measures. 
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Probabilities in safety of machinery—Markov model for the scaling 
of risk reduction effects due to limiting the hazard exposure 


H. Médden 
German Machine Tool Builders’ Association (VDW), Frankfurt am Main, Germany 


ABSTRACT: The most recent product safety standards for machine tools such as ISO 16090 for the 
safety of milling machines follow the three-step-method of risk reduction, as it is explained in ISO 12100. 
Step 1 is always the design of inherent safety of the machine. Step 2 follows with the design of additional 
safety measures, and step 3 focusses on instruction for use (including warning signs at the machine) and 
training. A very effective risk reduction can be achieved, if the work area of the machine is fully enclosed 
(this belongs to step 2). In doing so, the automatic machining processes can take place without exposing 
the operator to the hazard. However, operator access to the work zone is sometimes necessary for manual 
intervention, such as setting actions inside the work area or workpiece change. Then careful instructions 
are necessary, which shall enable the operator to fully control the situation and protect himself by aware- 
ness of the respective hazards. 

The many different kinds of operator activities, which vary with the selected machining process, can 
be allocated to specific modes of operation. In ISO 16090, a modes of operation concept comprises five 
selectable options: 0: manual mode/1: automatic machining/2: setting mode/3: special mode with limited 
manual intervention/and separately for selected operators only: service mode. The purpose of such a 
concept is to reduce the relative exposure of the operator to occurring hazards as far as for the intended 
use possible. 

The combination of full enclosure and modes of operation concept brings about a significant risk 
reduction, however the effect is being discussed only qualitatively on an intuitive basis. Unfortunately, 
a scalable model is still missing, which could quantify the risk reduction effects, e.g. for the purpose of 
parameter optimization. 

In order to improve the engineering abilities, a very first simplified Markov model is presented here, 
which is founded on a probabilistic concept for the description of the operator activities at a machine. As 
a result, the a.m. risk reduction effects can be scaled probabilistically. This is of advantage, if it comes to 
the quantitative reliability requirements of safety functions according to ISO 13849-1 such as e.g. a safe 
standstill of gravity loaded vertical axes. Because of the potentially severe hazards of those axes in setting 
mode or during manual intervention in the work area, the required reliability is high, e.g. performance 
level PL = d according to ISO 13849-1. This paper explains for typical manual interventions, how this 
requirement and supplementary safety provisions can be justified probabilistically. 


1 PROBABILITIES IN SAFETY are being developed. They encounter a well-proven 


OF MACHINERY practical state-of-the-art, which is merely based 
on qualitatively defined requirements. They were 
Numerical probabilistic representations were mainly focussing on hazards as such (and their 


recently introduced in the field of general 
machinery safety, when the revised European 
Machinery Directive 2006/42/EG extended the 
“hazard analysis” of former versions to a “risk 
analysis” by introducing the term “probability” 
in the expression: “estimate the risks, taking into 
account the severity of the possible injury or dam- 
age to health and the probability of its occurrence”. 
Since this alteration in a legal text, simplified 
probabilistic methods as in ISO 13849-1 (2015) 


countermeasures) rather than “risks, severities of 
injuries and their probability of occurrence”. Nev- 
ertheless, on the background of customer demands 
for very high availabilities (which is equivalent to 
high inherent safety), it brought about a well-tried 
state-of-the-art following the three-step-reduction 
method of ISO12100 (2010). The state-of-the-art 
is defined on a non-quantitative descriptive back- 
ground in harmonized safety standards since more 
than 20 years. 
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2 CONDITIONS OF A MARKOV MODEL 
IN SHORT FORM 


Markov chains are suitable for describing stochas- 
tic processes that change between finitely many 
states, as it is the case during machine operation. 
Markov’s modelling is illustrated in Feller (1967), 
for example, by constant transition probabilities. 
Discrete system states are defined with accord- 
ing probabilities of a) either keeping a state, or b) 
the transition to another state. The representation 
with time-dependent transition probabilities on the 
basis of constant failure rates, as with an exponen- 
tial distribution, is also possible. The relative fre- 
quencies of the system states (as an approximation 
of the according probabilities) could be derived, 
for example, from a video evaluation of typical 
activities. Because, probability values can be rep- 
resented as relative frequencies, if a large basic 
population or a large number of repetitions of the 
observed random events are given. The latter is the 
case over the service life of a machine with regard 
to the system states defined under section 3 below. 
Accordingly, a Markov model shall be created here 
for the manual intervention in the workspace of a 
machine tool. In doing so, certain conditions need 
to be fulfilled. 


Check the conditions: 


Condition 1: The probability that the state Z,, 
which occurs at an i-th time step, only depends on 
the state Z, ,, which was present at the (i-1)-th time 
step, (Markov property). 


Reflection: The probability of the state Z, “Manual 
intervention” at the i-th time step depends only 
on the state Z, which was taken to the (i-1)-th 
state: namely, if the (i-1)-th state was “automatic 
processing”, there is a certain (average) probability 
to change into the state “manual intervention”. 

The Markov property may not be fully attain- 
able with very detailed state models, for example, 
if certain sequential patterns are passed through. 
In this case, the transition probability of the next 
state may not only depend on the last state, but also 
on the previous one. For a first simplified model 
with three states in Figure 1, however, the Markov 
property can be considered fulfilled. 


Condition 2: In a homogeneous Markov process 
with a discrete range of parameters and finitely 
many states, the transition matrix B(i,i+1) must not 
depend on i. At discrete time intervals t, =i- t, 
this means that the transition matrix is independ- 
ent of time, so it must be stationary. 


Reflection: The use of a machine leads to changing 
operating modes. Their occurrence probabilities 


can be approximated as averaged constants at all 
transitions (i, i+1). The requirement for a transi- 
tion matrix that is independent of “i” might not be 
fully satisfied in the short term, because, for exam- 
ple, certain sequences of machining steps that dif- 
fer from the sequences averaged over long periods 
of time occur on a machine during a day shift. For 
long periods of time however, the requirement for 
a transition matrix that it is independent of “i” (i. 
e. indep. of time) is approximately met because of 
the average effect. But, only for the “inner states” 
of the model, because with the “absorbing states” 
a temporal increase relative to global time comes 
into play (see below). Unforeseeable failures dur- 
ing the repeated change of operating modes above, 
either of technical or of human cause, require 
“absorbing states”. 

Then, a homogeneous Markov model needs 
to be expanded to an inhomogeneous Markov 
process. The reason is that technical faults have 
a time-dependent probability of occurrence, e. g. 
exponentially distributed. Even an inhomogene- 
ous Markov model still fulfils the Markov condi- 
tion that the further course of the process is clearly 
defined at all times. 


As a result, both conditions above are fulfilled 
and a Markov model can be created for manual 
intervention in the workspace of a machine tool. 


3 SYSTEM STATES OF THE MODEL 


In accordance with the normatively anchored 
operating modes, such as in ISO 16090-1 (2016), 
only 3 states of the machine are considered here 
at first: 

System State 1: Production in automatic mode 
with closed prot. doors (with Z1 as state prob.) 

System State 2: Setting mode to prepare auto- 
matic op. with opened prot. doors (with Z2 as state 
prob.) 

System State 3: Manual intervention with 
opened protection doors, e. g. to carry out a work- 
piece change or to correct a fault; either from the 
previous state 1 or from state 2 (with Z3 as state 
prob.) 

Figure | shows a possible sequence of changes 
of states: To prepare automatic operation from 
time | on, setting mode is started at time 0. In event 


PNW 


0 5 10 


Figure 1. Example of state changes in automatic mode 
(x-axis: time in minutes, y-axis: indicator for state). 
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10 12 14 16 18 20 


Figure 2. Example of state changes in setting mode 
(x-axis: time in minutes, y-axis: indicator for state). 


2, a malfunction occurs (transition from state 1 to 
3) that is rectified manually up to time 3.15. From 
then on, automatic operation continues undis- 
turbed. At point 7, the workpiece is finished and is 
being replaced by an unmachined part, which lasts 
until point 7.7, after which the automatic machin- 
ing continues. Fig. 2 shows another possible con- 
tinuation of state changes. At time step 10, there is 
still automatic processing of Fig. 1. 

The workpiece is finished at time point 11 and 
is being removed until point 11.7. Afterwards, a 
new setup operation takes place in which up to 
12.6 manual operations are necessary. And from 
time point 13.5 on, a new workpiece is clamped. 

Afterwards an automatic processing is car- 
ried out up to time point 16, where a malfunction 
occurs. This leads to a switch-over into setting 
mode, whereby manual intervention is necessary 
from point 16.5 to 17.3. From time point 17.3 the 
automatic processing can be continued. And with 
time point 19, another workpiece change takes 
place, which is finished at time point 19.4. The new 
workpiece is then to be processed aomatically. 

The activities in Figs. 1 and 2 can thus be grouped 
into four types of human-machine interaction: 

Activity A: Manual workpiece change in auto- 
matic mode (Z1é Z3) 

Activity B: Manual troubleshooting in auto- 
matic mode (Z1é Z3) 

Activity C: General setting mode (Z2) or manual 
workpiece change during setting (Z2è Z3) 

Activity D: Manual troubleshooting in setting 
mode (Z2è Z3). 

In the following section, another activity occur- 
ring in real practice shall also be introduced for the 
preparation of a follow-up paper: 

Action E: Manual troubleshooting in setting 
mode (Z2è Z3), which could not be resolved. 


4 PRESENTATION OF THE STATE AND 
TRANSITION PROBABILITIES 


The real key question in this paper is: How likely or 
relatively frequent is it that the operator is exposed 
to hazards, which could possibly occur during 
manual work in the work area? 

In order to answer the question quantita- 
tively with scalable probabilities, a homogeneous 


Figure 3. 
model. 


State and transition prob. of hom. Markov 


Markov model is suitable as a quantitative refer- 
ence frame. This should be adapted to the above 
three system states. Fig. 3 shows a graphical repre- 
sentation. The state probabilities at subsequent “i” 
time steps P(i) result from the so-called “B-matrix 
of state and transition probabilities” of a homoge- 
neous Markov process, see Eq. (2). 

The states 1,2 and 3 are called “inner states” 
because they can merge into each other. 

For time step i=0, an initial start vector P(0) from 
state probabilities is suitable as a representation: 


P(0)=) 22(0) (1) 


b, 3 
B=| b, b, b, 2) 


The progressively developing connection prob- 
abilities are described in Feller (1967): 


P(i+k)= P(ì) 


- BK 
or P(k)= P(0)- BY 


Thus, forecasts of repetitions of an assumed ref- 
erence period can be deduced at time point K - tp- 


4.1 Derivation of the transition probabilities 


As explained in section 2, probabilities over the 
lifetime of a machine and for a basic population of 
comparable machines can be presented as relative 
frequencies. 

For the duration and repetition rates of the five 
a.m. activities, an averaging over several 8 hour 
shifts, shall be assumed. This serves as a frame of 
reference for the hypothetical estimation of the 
needed transition probabilities. In this paper, the 
values listed in Table 1 are assumed, e.g. a man- 
ual workpiece change on the machine tool takes 
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Table 1. Five typical activities on a machine tool. 


bool fopi Ez Fraction Coeff. Connecting 
Action [h] [h] HI i! (bi) Coeff: 
A 0.02 6 0.12 (bi 3), = 0.12 bia = (b;3)a + (bi s)g bii = l- b2- bis = 0.6675 
B 0.05 0.25 0.0125 (b, y= 0.0125 = 0.1325 with: b,,=0 and b,, = 0.008 
C 8 0.025 0.2 (b, )o=0.2 b,,=0.2 r a ee 
D 0.1 0.05 0.005 =0. =(b,.), + (b, .). = 1—b,,=0. 
Pads 0.005 bz; = (230 Orde with: D; a (b,,), and 
E 0.5 0.006 0.003 (b,.), = 0.003 b,, = (b3)p follows: 
ss b,, = 1 —b,,—b,, = 0.992 

1.2 min. ( = 0.02 hours) and repeats itself 6 times 0.6675 0.2 0.1325 
per hour, this means 7.2 min. of non-productive 

B=| 0.992 0 0.008 (8) 


time p.h. The rel. exposure is 0.12. 

From Table 1, the state and transition probabili- 
ties can be analytically derived from state 1 (with 
probability Z1) with b; ; G = 1,3) to: 


b, ı =0.6675; b,» = 0.2; b,,=0 (4) 


Idition, it is clear that the setting mode (Z2) 
must always switch to automatic mode (Z1), either 
directly or via a troubleshooting of a fault (Z3) 
ie: 


b, =0.0. (5) 


With the additional assumptions in Table 1, the 
remaining coefficients can be determined analyti- 
cally as follows: 99.2% of the general setting mode 
(Z2) is transferred to automatic operation (Z1) 
without manual intervention (Z3). From the coun- 
terconclusion, a 0.8% usage share of the setting 
mode for troubleshooting can be deduced: 


b, , = 0.992, b, , = 0.008 (6) 


In addition, it shall be assumed that in automatic 
mode (Z1), 100% of the faults are either eliminated 
or assigned to setting mode (Z2). For a follow-up 
paper, it could be further assumed that only 5/8 of 
all faults in the setting mode can be rectified and 
3/8 cannot. 

Assuming that the unrecoverable faults (E) 
occur in state (Z3) via setting mode (Z2), the result 
would be: b, , =(b,),. And because the remediable 
faults (D) are dealt with in setting mode, the result 
would be: b, , = (b, ,)p. It follows: 


b; ı = 0.992; b; = 0.005; b, , = 0.003. (7) 


Therefore, the state and transition probabilities 
required for the matrix representation of a Markov 
model are available as coefficients. The so-called 
“B-matrix of the state and transition probabilities” 
of a homogeneous Markov process results to: 


0.992 0.005 0.003 


Plausibility check: the row sum of B. must be 1. 


4.2 State vector after repetitions 


For the i-th repetition of a state with a Markov 
model, the following vector of the state probabili- 
ties can be defined: 

Z\(i) 

Z2(i) (9) 
Z3(i) 


EQ)= 


For example, to initialize the model with a com- 
pleted setting operation to be followed by an auto- 
matic process, an initial condition (i = 0) can be: 


0 

P(0)=| 1 (10) 
0 

4.3 State vector after one weeklone year 

As assumed above, the reference frame t,,, for 


estimating the transition probabilities here is an 
8-hour shift. The coefficients of the B-matrix are 
derived from Table 1 above in Eq. 3. In this way, 
the values of the time proportions of the states for 
a reference time period can be concluded (e.g., here 
a 40-hour week with 5 times 8-hour days): 


Z1(i) Proportion of total usage time in state 1 
(automatic operation), 

Z2(i) Proportion of total usage time in state 2 
(setting mode), 

Z3(i) Prop. total usage time in condition 3 (fault). 


As a representation in vector form: 


P(i)' =(Z1(i),Z2(i),2Z3(i))" 
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Table 2. Short oscillation of the Markov model. 


Dur. Repet. i P(i)" 

1 day 1 (0.6675, 0.20, 0.1325) 
1 week 5 (0.748, 0.151, 0.101) 
1 year 220 (0.789, 0.150, 0.100) 


Table 2 shows the rapid transient oscillation of 
the Markov model, such that a quasi-stationary 
vector of the state probabilities already appears 
after 1 week. For the steady state after 1 week in 
table 2 a proportional undisturbed production of: 
Z1( = 5) = 0.784 = 78.4% results. 

The interruption of production results propor- 
tionately to: 1- 0.784 = 0.216 = 21.6%. And from 
this the system state 3 „manual intervention“ has a 
share of Z3 = 0.1 = 10%, based othe assumptions 
above. This provides a comete a scalable frame- 
work for plausibly estimating the probability of a 
hazard or injury during manual iervention. 


5 HAZARDS OF MANUAL ACTIVITIES 


A whole range of hazards are conceivable during 
manual intervention in the work area: crushing of 
hands or arms, shearing of fingers, entanglement 
of clothing etc. are basically possible. Risk assess- 
ment is in the first step aimed at the completeness 
of all possible hazards. This was formerly known 
as a hazard analysis. However, in order to become 
a risk analysis, a continuation sp needs to follow: 
every “possibility of a hazard” is to be converted 
into a “probability of this hazard”. 

Here, state 3 is “Manual intervention” to carry 
out a workpiece change or to correct a malfunction; 
either from the previous state 1 or from state 2. The 
subsequent state after 3 could be a hazard with an 
injury. It shall be combined in a further state: 

System State 4: Hazards can occur that could 
possibly cause injuries (with Z4 as the probability 
of their occurrence). Because these injuries could 
be severe, a standstill of the machine for accident 
investigation can be assumed such that (Z4 = 1) 
would apply. Thus, state 4 is an "absorbing state". 

It is supplemented in Fig. 4. If an absorbing 
state is taken once, it will not change (b,, = 1). 
Absorbing states represent critical states of a sys- 
tem. Therefore, one is interested in the probability 
with which such critical states occur during the 
process. 

The extension of the model with state 4 turns the 
above homogeneous Markov model into an inho- 
mogeneous Markov model. Because state 4 is not 
time-independent, i.e. the transition probability b; , 
is not constant, but depends on the time: b, ,=f (t). 


Figure 4. Extension to an inhomogeneous Markov 
model: state and transition probabilities and an absorb- 
ent state 4. 


In simplified terms, the exponential distribu- 
tion over the PFH,-values assumed in ISO 13849-1 
(2015) can be regarded as the trickle rate of an 
“hourglass” and thus a time proportionality can be 
applied, see Moedden (2014). The following hypo- 
thetically derived values for the above activities A, 
B, C, D and E are intended to provide the connec- 
tion for an answer to the key question above. 

It is a very first proposal of “risk snapshots”. 
The purpose is merely to estimate the risk reduc- 
tion of guards (e.g. full enclosure) in the context of 
an exemplary modes of operation concept. 

A calculation example shall be used to com- 
pare the five activities A, B, C, D and E as “risk 
snapshots” in a scaled form, as assumed above, see 
Table 3. An 8-hour shift is used again. When man- 
ually intervening in the working area, the operator 
has to completely trust in reliable control technol- 
ogy, e. g. for the safety functions “Safe operation 
stop” and “Safe reduced speed”. 

The relevant failure probabilities can be 
assigned as SY PFH, ieq,-Values in a superimposed 
form. The typical hazards can be superimposed, 
too. The probability of occurrence of relevant haz- 
ards (risk element O according to equation 11) and 
their controllability (or avoidability, risk element C 
according to equation 15) are simply estimated on 
the safe side, just for illustration purposes. 

Activity A: Manual workpiece change in auto- 
matic mode (Z1é Z3), with: 


— Safety function “Safe operational stop (SOS): 

— 2 rotary axes (X PFH, oo ee 
which can cause an entanglement hazard (EN), 

— and 3 translational axes (PFH, een = 3 - (4.5 - 
10*)+), which can cause crushing (CR), 


Activity B: Manual fault rectification in auto- 
matic operation (Z1 > Z3) with the same safety 
function and hazards as for activity A, 

Activity C: General setting mode (Z2) or manual 
workpiece change (Z2 > Z3) with the same safety 
function as for activity A, 

Activity D: Manual troubleshooting in setting 
mode (Z2 > Z3) with: 
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Table 3. “Risk snapshots” of fiveypical activities on a 
fully enclosed machine tool. 


Activity/Safety LPFA een O Fa C 

functionlHazards o) H H H 

A: /SOS/EN+CR 2.(6.5-10*) 1 0.12 0 
+3-(4.5-10) 

B: /SOS/EN+CR 2-(6.5-10%) 1 0.0125 0 
+3-(4.5-10°) 

C: /SOSIEN+CR  2-(6.5-10°°) 1 0.2 0 
+3-(4.5-10°°) 

D: /RSP/ÆEN+CR 1-(6.5-10°*) 0.2 0.005 0.5 
+1-(4.5-10°) 

E:/RSP/EN+CR  1.(6.5-10*) 1 0.003 0 
+1-(4.5-10°) 


— Safety funct. “Safely reduced speed (RSP) with: 
—1 rotary axis PFH, tech =1.(6.5-10-)+), 

which can cause an entanglement hazard (EN), 
— and | translational axis (ZPFH,,,,., = 1 - (4.5 - 


10*)+), which can cause crushing (CR), 


Activity E (for a follow-up paper): Manual trou- 
bleshooting in setting mode whout success with 
the same safety function and hazards ain activity 
D; but without controllability: C = 0 and with very 
high probability of occurrence of a hazard: O = 1. 

The likelihood of a hazard occurring after a 
control failure (e. g. causing an unexpected move- 
ment) has been assumed for activities A, B, C with 
O = 1. This means that a failure is very likely to 
lead to a hazard in the respective situation. For 
activity D, only O = 0.2 is assumed. For the first 
three activities a controllability is not possible: 
C = 0, and for the fourth activity it shall be esti- 
mated with C =0.5. 

The superimposed failure probabilities are com- 
pared in Figure 5 as X PFH, reon — values. 

These base values can be used to carry out a risk 
assessment for manual interventions A, B, C, D 
and E. For this purpose, the risk elements prob- 
ability of occurrence of a hazard (after a control 
failure or after a human error), relative exposure 
and controllability ought to be reasonsmated for 
Eq. (14). In addition, the terms used should be 
clear in order to obtain a logical risk model. 


5.1 Key term “Probability of hazardous events” 


The likelihood of hazards can be caused by tech- 
nical or human factors or force majeure. The 
designer cannot influence the latter, but he/she 
must concentrate on mastering the causes of 
technical failures (index “tech”), also on foresee- 
able human errors. For quantitative aspects, an 
objectively scalable probability plays a special role. 


P B . 
Í i ae 
B C D E 


Figure 5. Bar chart of the PFH; een — values of 
“Risk snapshots” (x-axis: indicator for activity, y-axis: 


values in 4.) 


For this purpose, a correctly defined probability 
space is indispensable, see Moedden (2017). Two 
quantitative representations of the term “prob- 
ability” are established: a) generally, in most cases 
the dimensionless probability such as illustrated in 
Feller (1967); and in the context of control chains 
for safety functions, often b) a probability per time 
unit (PFH,-value, see ISO 13849-1 (2015)) is used. 
The first can be derived from the latter via a time 
integral, as explained by means of a “hourglass 
analogy” in Moedden (2014). 


In this context, the interpretation of the so-called 
“Murphy’s Law” must be reasonably restricted, 
it reads: “Anything that can go wrong will go 
wrong”. Murphy’s law is an alleged wisdom of life, 
which makes a statement about human failure or 
sources of error in complex systems. In the short 
form with the emphasis on “...will go wrong” there 
is presumably an assumption contained, which 
must not go unmentioned here, since the key term 
“probability” has to be distinguished from the 
term “possibility”. Presumably, the precondition 
of “Murpy’s law” is: either an infinite number of 
repetitions, or an infinite basic population. But, 
“Murpy’s law” doesn’t work for finite repetitions 
and/or a finite basic population to the conclusion 
“_..will go wrong”. 


5.2 Scaling of risk reduction measures 


Practical machine design methods, as collected for 
a milling machine in ISO 16090-1 (2016), apply a 
whole range of measures that bring about a con- 
siderable distance in the probabilities of the fol- 
lowing events: 

Event I: a failure in a mechatronic control chain 
of a certain safety function, for instance due to the 
failure of a mechanical brake actuator in the safety 
function “SOS” (see abovtivity A) and 

Event I: hazards occurring from this failure 
(here: crushing and entanglemen 

The likelihood of event I. is being expressed 
as a PFH,-value of the safety fection. The unit is 
a number per hour. The probability oevent II. in 
a certain time frame t, <t <t, i.e. the occurrence 
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probability of according hazards, is given as a 
dimensionless number: it represents a probability 
and the parameter PHE is used here. 

If the a.m. over-interpretation of “Murphy’s 
Law” would be made, the conclusion were: 
PHE = PFH, (t, -t,) 


The reality is different: there is a well-tried art 
of safety design such that PHE is generally sig- 
nificantly lower PHE «PFH, -(t, =h) because 
additional risk reduction measures “surround” the 
mechatronic control chains in a cascaded form. 
For instance, protective guards such as a full enclo- 
sure of the work area are very effective in reducing 
the probability of hazardous events, which might 
be caused by failures inside the work area. Thus, 
within in a cascaded safety design, not every failure 
of a safety function leads to a hazard. Only if an 
operator is exposed to potentially hazardous con- 
trol chains, a hazard can occur. The PFH,-values 
of control chains represent the “threatening” fail- 
ure prob.’s. That is why a differentiation has to be 
made between an open or closed protective guard 
(see ste definitions above), if it comes to deriv- 
ing PHE-values. Obviously, hazardous events can 
occur, when the protection grds are open. With 
closed guards however, an operator is n exposed to 
possible hazardous events inside the full enclosure. 


The risk reduction effect by limiting the rela- 
tive exposure is explained in section 6.1 below as a 
product of time duration and repetition rate. 

A dimensionless probability wh a scaling fac- 
tor “O” can be defined as follows for the time 
period ź, St<t,. It corresponds to the risk ele- 
ment “probability of occurrence of a hazard” (ISO 
12100, section 5.5.2.3.2): 


th 
PHE=O-) Ş PFH,- dt 


(1) 
=0. Ù PFH, |t, -4] with 0<0<1 


6 PROBABILISTIC RISK MODEL 


In automatic operation (index 1 below), closed 
guards of fully enclosed work areas hold back any 
released tool and/or workpiece fragments up to 
the withstand capabilities of the respective dimen- 
sioning regulations as in ISO 16090-1 (2016). Even 
with open protection guards as in setting mode 
(index 2 below), a full enclosure can also consid- 
erably reduce the failure probability of a techno- 
logical function (index “tech”) during setting mode 
indirectly, as explained in the “implicit fault detec- 
tion in the process”, see Moedden (2016). 

As regards released parts of tools or/and work- 
pieces after a failure in a control chain, the withstand 
capability of the enclosure against causes a strongly 


reduced probability of occurrence of a technically 
caused hazardous event, i. e. with O,, << 1. For 
other non-ejection hazards even O,, = 0 holds (with 
O,< = On + Op). Therefore, PHE, during auto- 
matic mode in the period ¢, <£ < t, is: 


PHE,=0,-[°SPFH -dt (12) 


d,tech,1 


In setting mode, a numerical value PHE, >> 0 
ought to be assumed for hazardous events, because, 
according to accidents in the entire population of 
all machines, injuries occur more often, when the 
protection doors / guards are open than in closed 
condition, see Moedden (2017). 


6.1 Exposure reduction 


In the case of hazard exposure, reducing the dura- 
tion of exposure time t. and reducing the repeti- 
tion rate f, act as a product to reduce the relative 
exposure F,, = tap ` foxy As there is a kind of dou- 
ble proportionality, this can lead to a considerable 
risk reduction. For example, if workpiece and 
tool changes are mainly carried out automatically 
(instead of manually) and the product tep - fexp << 
1 reaches very small values. However, some recur- 
ring hazard exposure events in setting mode (e. g. 
to prepare for automatic operation) are always left 
over. This results for a.m. setting operation in the 
period t, <St<t, to the following expression for a 
probability of a hazardous event: 


PHE, = Loxp,2 ` Jago ` 0; ` i F PFH, reena g dt (13) 


6.2 Increasing controllabilitylavoidability 


If the above risk reduction measures have been 
implemented as far as reasonably possible, a “zero 
risk” can practically almost be reached acc. to 
Moedden (2017). However, even then (theoreti- 
cally hypothetically) a very small technical residual 
risk remains. For this purpose, step 3 of the three- 
step method of risk reduction is being applied, the 
so-called instructive safety: in risk models, mostly 
described as avoidability or controllability (param- 
eter C) in ISO 12100 (2010). In the above case of 
setting mode (indicated as “2”), and considering 
the avoidability effect with C, >0, the remaining 
event rate of hazards in the period is calculated as 
follows t, <t<t, in (Eq. 14): 


PHEY= (1 a C,) i lexp,2 : Fp j 0; i p S PPH oi g dt 
(14) 


PHEY can be equated with the probability of 
injury, 1. e. it corresponds to injury risk during setting. 
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6.3 Comparison of different activities 


In accordance with the risk model in Eq. (14), 
the probability of occurrence of a hazard event is 
shown in the right-hand column of Table 4; and a 
corresponding bar chart is shown in Figure 6. The 
proportions of activities A, B, C, D and Ein system 
state 3 are taken from Table 3 with F,,,. The simple 
comparison in Table 4 obviously proves that not 
only the PFH,-values in the range of some 10h"! 
are decisive for the probability of occurrence of a 
hazard event PHE. The other risk elements O, F 
C also have a considerable influence. This can be 
seen, because the bar chart of PHE values in Fig. 6 
differs from the bar chart in Fig. 5, which merely 
shows the distribution of the summed PFH,-val- 
ues. Fig. 6, however, is more important for the risk 
of injury than Fig. 5. And Fig. 6 shows that the 
highest probability of occurrence of hazardous 
events are in activities C and A of this parameter 
study. Activities B, D and E are far less hazardous. 

The coefficient b,, in Figure 4 and Eq. (8*) is 
time-dependent: b; ,= f(t) = fG - t). And b, , results 
from the sum of the PHE-values in the right-hand 
column of Table 4 to b,, = 0.00007; it has to be 
multiplied by as much 8-hour-shifts, as shall be 
considered. And because state 4 is an absorbing 
state: b,, = 1. Based on an eight-hour shift, the 
B-matrix in Eq. (2) and Eq. (8) can be extended 
from 3*3 to 4*4. This means that the probability 
of a hazard event can also be estimated. 

The new B-matrix would look like this, Eq. (8*): 


0.6675 0.2 0.1325 0 

. 0.992 0 0.008 0 
E= | 0.992 0.005 0.003 by, 
0 0 0 1 


Table 4. “Risk snapshots” and hazard probabilities. 

Activity) LPFH jj, t,-t,O0 Fea C PHE 

SF/HAZ. [h] Ih} HA HH 

A: /SOS/ 2.(6.5:-10%) 8 1 0.12 0 2,5 
EN+CR 13.(4.5.10-) 

B: /SOS/ 2.(6.5-10*) 8 1 0.01250 2,6 
EN+CR 4.3.(4,5:107) 

C: /SOS/ 2-(6.5-10%) 8 1 0.2 0 42.105 


EN+CR 49.(45.107) 


D: /RSP/ 1-(6.5-10%) 8 0.20.005 0.5 4,4 - 10% 


+1-(4.5-10°) 
E:/RSP/ 1.(6.5-10°) 8 1 0.003 0 2,6-107 
EN+CR 41.(4.5-105) 


| 0.00004 +- 
| 0.00002 5 | yea i 
0 + —_ b, l 


Figure 6. Bar chart of the PHE-values for the “Risk 
Snapshots” above (x-axis: indicator for activity, y-axis: 
values in [—]). 


Table 5. Result for inhomogeneous Markov model. 
Dur. i Pil’ 

1 day 1 (0.6675,0.20, 0.1325, 0.00007) 
1 week 5 (0.748, 0.151, 0.101, 0.00035) 
1 year 220 (0.789, 0.150, 0.1000, 0.0154) 


The state vector in vector form is: 
P(i)’ =(Z1(i),Z2(i),23(i),Z4(i)" 


From Table 5, it can be seen that the values Z1, 
Z2 and Z3 are convergent, whereas Z4 is divergent, 
i.e. steadily increasing. 

It can also be derived that the probability 
of the system state 4 in a basic population of n 
comparable machines is Z4 = | after 1 year, if 
n = 1/0.01538 = 65. This means that hazardous 
events, which can cause injuries, are very probable 
with above assumptions. 


7 SUMMARY AND OUTLOOK 


The system states of a simplified Markov model 
of man-machine interaction in operating machine 
tools have been defined. A diagram for the state 
and transition probabilities was developed for 
parameter studies. 

The Markov model presented here has proven 
that reducing operator exposure is as important 
a means of reducing risk as the high reliability of 
control functions. Consequently, the model can 
also be used to compare safety measures with each 
other in a scaled form. In doing so, the designer has 
the opportunity to optimise effort and benefit. In 
addition, the model can also weigh different effects 
against each other in order to compensate, for 
example, i) an increased exposure due to extend- 
ing the operating modes concept by ii) reducing the 
probability of failure of relevant control chains that 
can generate hazards, if they fail. This approach 
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has proven to be effective over more than a decade. 
Also a consideration of technical reliability versus 
human reliability can be performed in a scalable 
frame. For example, when clamping workpieces for 
machines equipped with the relatively new technol- 
ogy of vertical turning on milling machines, see 
Wittstock (2017). 

Five typical activities have already been exam- 
ined here in more detail to answer the key question 
above. Further operating modes are to be exam- 
ined, i.e. mode 3: Special mode and Service mode. 


NOMENCLATURE 

Symbol Dimension Meaning 

tig h Reference time step for the 
Markov model 

len h Time duration of a manual 
intervention 

Tavs 1 Relative frequency of a 

‘ manual intervention 
O, Fas C Probability risk elements acc. to the 


parameters of ISO 12100 
Average prob. of dangerous 
failure per hour. It is only 
for single i/o-channels 
equivalent to the failure 
rate (i.e. the frequency with 
which an engineered system 
fails, often denoted by A) 
Time period for integration 
Prob. of hazardous event 
State vector with prob. 
Transition matrix with prob. 
and coefficients b;; 


PFH, 1 


-t h 

Probability 
Probability 
B Probability 
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ABSTRACT: This paper examines the information flow of accident investigation results in the construc- 
tion industry, and how this affects learning processes. The aim is to document the state-of-the art in the 
industry on information flow that follow accident investigations, and to find how accident investigation 
results better can be learned from and used as input for proactive safety management. A literature review 
was undertaken on the relation between accidents and learning. An interview study was undertaken with 
different actors (clients, contractors and consulting engineers) in the construction industry on accident 
investigations and information flow of accident investigation results. The preliminary results from the 
interviews are presented. Mostly, results after accident investigations are shared within a company, and 
there is no systematic sharing of information between companies, other than occasional sharing. Further 
research needs on information sharing after accident investigations are discussed. 


1 INTRODUCTION 


The construction industry is a complex industry, 
with constantly changing processes and activities, 
several actors involved that depend on each other, 
and external factors, such as state of the market, 
that affect construction projects. 

Just as the industry varies a lot in terms of com- 
pany sizes, sizes of construction sites, resources 
available, and competence and experience of man- 
agers and workers, so are accident investigations 
and the results of them influenced by these charac- 
teristics and conditions. 

The aim of this research is to look closer at 
safety in the construction industry, by studying the 
processes after accident investigations, which are 
meant to accommodate for learning processes and 
prevention of future accidents. 

This paper presents preliminary results of 
research concerning the results of and the infor- 
mation flow after accident investigations, as it is a 
prerequisite for the learning processes. 


2 BACKGROUND 


2.1 The complexity of the construction industry 


A construction project can be many things; a small 
cottage to be built, a highway, a tunnel or a sky- 
scraper. “The construction of a building can be 
regarded as a complex, information dependent, 
prototype production process were conception, 


design and production phases are compressed, 
concurrent and highly interdepended in an envi- 
ronment where there exists a usually large number 
of internal and external uncertainties” (Pryke, 
2012, p.64). This shows how the construction 
industry varies with regards to project size and 
durability, company size, contract models and so 
on. According to Lingard and Rowlinson (2005, 
p. 3) the project structure which companies in the 
construction industry operate in, is an important 
characteristic that challenges the safety work in the 
industry. Each project site is different, forming a 
new, temporary organisation. There are large circu- 
lations of personnel; some come in for as shorts as 
a day, others stay for the whole project period. Fur- 
ther, construction projects are operating in an ever- 
changing environment largely influenced by the 
state of the marked. All these variations are influ- 
encing how safety work is being implemented and 
executed in the daily life of construction projects. 

The socio-technical system involved in risk 
management by Rasmussen (1997), illustrates the 
different levels that will influence the safety in 
the sharp end. The model includes environmental 
stressors that can influence the different levels that 
again can influence the overall safety. 

Further early project phases can affect the next 
phases, i.e. decisions in early phases of projects can 
influence the safety during construction. There- 
fore, it is important to look at the whole range of 
actors involved in the project not only during the 
construction phase, but also in earlier phases. 


2855 


2.2 The construction industry in Norway 


The Norwegian construction industry was employ- 
ing almost 235 000 people in 2016, and the indus- 
try comprised of more than 57 000 enterprises, 
of which approximately 90% had 0-9 employees, 
and around 1% had more than 50 employees (SSB, 
2017a). 

The industry is one of the industries with the 
highest number of work related fatalities and inju- 
ries on mainland Norway. Between 2010-2015, 
there was in total 69 fatalities related to construc- 
tion work, i.e. close to 12 fatalities per year (NLIA, 
2016). Most of the fatal accidents involve falls, fol- 
lowed by collisions, being hit by an object or being 
crushed or trapped. 

In 2016, there were 9 fatalities and more than 
2700 reported injuries in the construction industry 
(SSB, 2017b). More than half of these resulted in 
long term absence (more than three days’ absence 
from work). The most frequent types of accidents 
resulting in serious injuries were: fall from roof/ 
floor/ platform, fall from scaffolding, contact 
with a falling object, contact with moving parts of 
machine, being hit by an object in a lifting opera- 
tion, fall from ladder, and fall from height when 
unsecured (NLIA, 2017). 


2.3 Incident reporting and accident investigations 


Incident and accident reporting is an important 
tool for accident understanding, and thus learning 
and future accident prevention. Incidents and acci- 
dents should be reported and investigated inter- 
nally to prevent them from reoccurring (Lingard 
& Rowlinson, 2005, p.163—164), and in some cases 
also investigated externally (e.g. incident with high 
or potentially high consequences). 

According to the Working Environment Act 
§5—2 (2005) in Norway, employers are obligated 
to report accidents with fatal or serious outcome 
to the Norwegian Labour Inspection Author- 
ity (NLIA) or the Police. Nearly all fatalities are 
reported, however in the “serious injury” category 
there are still unrecorded numbers (NLIA, 201 5a). 
NLIA conduct inspections after all fatal acci- 
dents (NLIA, 2015b), but less severe accidents are 
not examined as closely as accidents with severe 
outcome. 


2.4 Learning from accidents 


Learning from accidents is one of the goals with 
accident investigations, both to prevent simi- 
lar accidents to reoccur, and to prevent other 
accidents. 

Learning can refer to either a product or a proc- 
ess, respectively something learned and the activ- 
ity of learning (Argyris and Schön, 1996, p. 3). 


In relation to construction safety, the product 
can be accident understanding, which is one of 
the products after an accident investigation. To 
improve the safety and avoid similar accidents 
from reoccurring, the knowledge obtained needs 
to be applied. However, taking actions might 
require major changes (e.g. cultural and behav- 
ioural) (Love et al., 2013), which in practice can be 
challenging and will require designated resources 
in a company. Correct actions taken after acci- 
dents, show the results of learning in practice, by 
applying the obtained knowledge. This takes learn- 
ing one step further towards improvement. 

Organisational learning can be divided in two 
types of learning. Single-loop learning refers to 
learning that results in changes of action so that 
the outcome is desired. It does not change the 
“theory of action” (Argyris and Schon, 1996, p.20), 
meaning that the focus is only on the symptoms of 
the problem, and not the underlying cause. There- 
fore, this is a lower level of learning. Double-loop 
learning on the other hand, focuses on changes “in 
values in theory in use” (Argyris and Schön, 1996, 
p.21), meaning that it goes deeper into the root 
causes of the problem. 

Organisation size, complexity, and number of 
levels in the organisation are factors affecting what 
type of organisational learning (single or double- 
loop) learning results in (Argyris and Schön, 1996, 
p. 25-26). 


3 METHODOLOGY 


3.1 Literature review 


A literature review was undertaken to get an over- 
view of literature treating the relation between 
accidents and learning. The focus was on the 
construction industry, however papers discussing 
other industries were also looked at to find good 
examples and general knowledge about the topics 
of safety and learning. 

Searches for literature were made in the following 
databases: Scopus, Googles Scholar and Oria. The 
search strings used were: accidents, learning, and 
construction. Searches were made both on two of 
the words at a time, and on all three at the same time. 

In Scopus, the review was undertaken in a sys- 
tematic way, which means it is a replicable, sci- 
entific and transparent process (Bryman, 2012, 
p.102). The search using “accidents” and “learn- 
ing” as search string resulted in 4,983 results. From 
the 4,983 documents found from 1930-2017, the 
majority is published in the 2000s. After round of 
sorting out, first based on titles and years, then 
on abstracts, in total 34 articles were found as the 
most relevant concerning different fields, however 
not all were accessible. 
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The aim of the literature review was to get a 
background for the data collection. 


3.2 Interviews 


The first round of interviews was undertaken with 
actors in the Norwegian construction industry on 
information flow and knowledge transfer after 
accident investigations. 

An interview guide was made with the fol- 
lowing topics: general introduction, procedures 
for accident investigations, results of accident 
investigations, information flow, learning arenas, 
improvement potential and closing questions. The 
questions in the interview guide were adjusted to 
the three different actors (clients, contractors and 
consulting engineers). 

In total 13 interviews with 19 persons responsi- 
ble for Health, Safety and Environment (HSE) at 
clients, contractors and consulting engineers were 
undertaken. Table 1 presents an overview of the 
interviewees. 

All the interviewed companies are large, profes- 
sional companies that are well established in the 
Norwegian construction industry. 

Interviewees were recruited through convenience 
selection, through contact persons in the industry. 
Further selection of interviewees will be done stra- 
tegically to cover the construction industry widely. 

The interviews were conducted between October 
2017 and January 2018. Each interview took from 
30-80 minutes. Most of the interviews took about 
an hour. Eight of the interviews were conducted 
in person, and five over phone. All the interviews 
except one, were recorded and transcribed. For 
the one that was not recorded, detailed notes were 
taken. The interviews were transcribed in NVivo, 
and coded according to the interview guide. Addi- 
tionally, new codes were created while going through 
the data. A first, preliminary review of the data was 
done, resulting in main topics for discussion. 


3.3. Methodological considerations 


This research only comprises of a smaller sample 
of interviews and the preliminary results from 


Table 1. Overview of interviewees. 

Actors Interviews Documentation 
Clients 2interviews 1 company 
Contractors 8 interviews* 3 companies 
Consulting Eng. 3 interviews* 1 industry association 


“one group interview with three interviewees. 
*one group interview with representatives from five 
companies. 


these. To get a more general picture of the state- 
of-the art of the whole industry, more interviews 
will be undertaken, and preferably other methods 
should be used as well (e.g. questionnaire). 


4 FINDINGS 


4.1 Accident investigations in practice 


The research shows that there are large variations 
when it comes to accident investigation practices 
between companies (within different actors) in the 
construction industry, and also between different 
projects within a company. The variations concern 
both resources available, investigation competence 
and investigations execution (i.e. methods used). 
Companies have their own criteria for when and 
how investigations should be performed, also these 
vary between the companies. 

Mostly the accident investigations are under- 
taken separately by different actors, however 
interviews are conducted with persons in different 
companies during the investigations if it is relevant 
for the investigation. This results in separate inves- 
tigations at clients, contractors and sub-contractor 
if more of the actors are deciding to undertake an 
investigation. 

The consulting engineers reported that they are 
usually not involved in accident investigations, 
unless the unwanted event is directly caused by a 
calculation error performed by them. 

Some companies use external parties to under- 
take investigations, others mainly perform internal 
investigations. It was also reported by one inter- 
viewee at a client, that they sometimes have to 
request contractors in order for the contractors to 
perform accident investigations. 

Competence was seen as a success factor for per- 
forming good investigations However, the research 
shows that the knowledge and experience of HSE- 
managers about accident investigations varies. It 
was pointed out by the interviewees that methods 
and tools to be used for accident investigations 
should be pre-defined, and the methods and tools 
should be easy to use to ensure that the investiga- 
tions can start quickly after the accident, and in 
order to conduct the investigations in a good way 
to obtain learning. Some of the HSE-mangers 
found this to be unclear in their company. One of 
the interviewees stated that it is important to go 
deeply into the causes to prevent future accidents: 


“The most important thing is the learning one can 
get out of accident investigations. That must be the 
main goal. If you really manage to uncover the root 
causes, that is when you have the opportunity to pre- 
vent the same from happening again. That must be 
the foremost goal.” (HSE-manager, contractor). 
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Further, some of the HSE-managers reported 
that the roles and responsibilities in terms of who 
is responsible for the accident investigations and 
follow-up of it in the projects and in the compa- 
nies, are sometimes unclear. 


4.2 Results of accident investigations 


The results of the accident investigations included 
investigation reports, learning sheets and changes 
of procedures. One of the companies had put 
together experiences from several accidents into a 
short film. Most companies finalise accident inves- 
tigation with an investigation report. The reports 
vary between companies in size and content. 

Learning sheets, which are short one page sum- 
maries of accidents, have started to become increas- 
ingly popular. The drawback that was pointed out 
with these by some of the interviewees, is that they 
do not go deeply into the causes and are more like 
event descriptions. Further, some remarked that 
the focus on these learning sheets as an answer to 
the challenge of learning and knowledge transfer 
is too large. 


“Generally, I think that there is a large focus on 


learning sheets and sharing of learning sheets, as if 


they solve everything. I think perhaps it is somewhat 
too much focus on only this one solution” (HSE- 
manager, contractor ). 


4.3 Information flow of the accident investigation 
results 


The results of the investigations are mainly dis- 
tributed within the company which undertakes the 
investigation. Some companies have systems for 
sharing results after accident investigations within 
the company, such as management systems, proce- 
dures, and best practice databases. 

Information sharing across companies is even 
lesser systematised. There are no automatic mech- 
anisms for sharing results between companies. In 
one contracting company, it was reported that if 
the unwanted event happened at a sub-contractor, 
and the main contractor or client investigated the 
event, the sub-contractor would have to request 
the report in order to get it. In the same way, the 
NLIA can request access to investigation reports 
from companies. It was also reported that the 
NLIA sometimes requests companies to make 
investigation reports. However, it was mentioned 
that this could affect what the companies put into 
the report, as they would not want to face addi- 
tional consequences. 

Further, “breakfast-meetings” that some com- 
panies hold after accidents, were perceived as very 
good knowledge sharing arenas across companies. 


At such breakfast-meetings, a company shares 
experiences from an accident they think the indus- 
try as a whole can learn from. These meetings are 
held rather seldom, and are suited only for cer- 
tain types of incidents, e.g. general activities that 
resulted in an accident or near accident, and where 
good measures to prevent this type accidents are 
found. 

Another arena for information sharing that was 
mentioned by several of the interviewees, were 
workshops held by the NLIA. These workshops 
were perceived as a good for knowledge sharing. 
Additionally, HSE-conferences (e.g. SHA-dagene, 
HMS-konferansen) were other examples of knowl- 
edge sharing arenas. These are large conferences 
that occur yearly, which mostly mangers with 
exceptions attend. However, not all the interview- 
ees were aware of these arenas, and it was pointed 
out that the events are occasional. 

How information is shared between companies, 
is in large degree steered by the systems within the 
companies, and the contracts between companies. 
It was also mentioned that a client or a main con- 
tractor can put requirements regarding incident 
and accident reporting into contracts to easier 
obtain safety information from projects. 


4.4 Utilisation of accident investigation results 


Experiences of the interviewees show that inves- 
tigation competence in the investigation team is 
important for the outcome of accident investiga- 
tion. Further, the team compositions regarding the 
members’ role in the event is also important, so 
the persons are not too closely related to events or 
persons affected in the event. The members of the 
team should not have a conflict of interest with the 
investigation. 

Further, it was mentioned by the interviewees 
that certain events are better suited to learn and 
share knowledge from than others. The outcome 
or consequence of the event (e.g. fatality, serious 
injury etc.) in large degree influence how the results 
are used further. In cases where the events result in 
police investigations and legal proceedings, there 
can be resistance that will be of disadvantage for 
the results and for the learning process. Especially, 
near-misses and high potential incidents (HIPOs) 
which have not become police cases are good to 
learn from, as the question of guilt in larger degree 
is eliminated. 

Several of the interviewees mentioned the ques- 
tion of guilt as a factor that impede knowledge 
sharing, as this concerns the reputation of the 
company, future projects as well as compensations 
for injuries. 

Further, it was mentioned that it can be chal- 
lenging to share information in cases of serious 
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injuries where police investigations are undertaken, 
as these often take long time. This leads to the 
company accident investigation report being held 
back and thus delayed, also delaying the learning 
processes. 


5 DISCUSSION 


5.1 Deficits with accident investigations 


To be able to learn from something that has hap- 
pened, information about what happened and why 
it happened is needed. Accident investigations are 
important to gain this information. Gibb et al. 
(2014) highlight the importance of going in depth 
into accidents and finding underlying causes of 
accidents for a good learning outcomes. 

In the construction industry in Norway there 
are no standardised methods to investigate acci- 
dents. Accidents largely vary when it comes to 
type, size and severity, and different accidents may 
therefore require different types of investigations. 
One important issue to research in relation to acci- 
dents is how accidents are selected for investigation 
(Lindberg et al., 2010). The criteria for investigat- 
ing accidents and the degree of investigations vary 
between companies, and even between projects 
within a company. This is a challenge when it 
comes to learning after accidents, as the investiga- 
tions vary and thus give different foundations for 
further work with safety. 

The quality of accident investigation results is in 
large degree dependent on the investigation team; 
their relation to the accident and to the company, 
knowledge about the industry, investigation knowl- 
edge and experience. The knowledge and the expe- 
rience of the responsible persons in the companies 
varies as seen in the interviews undertaken, and 
this affects the outcomes of the investigations. Le 
Coze (2013) highlights the importance of expertise 
on accident models, to apply them in proper way. 
This was also states by some of the interviewees. 

Further, as mentioned, the construction indus- 
try is characterised by having many actors, many 
phases, and constant progress and changes in the 
projects. The cooperation between levels of actors, 
between different phases of construction projects, 
between companies in the same phase performing 
different operations, and between operations within 
a company is important for good safety. From 
what is seen in the interviews, there is not much 
cooperation on accident investigation between 
companies that are involved in an unwanted event. 
Mostly, the investigations are performed separately 
between companies if more companies are under- 
taking investigations. The weakness with this is that 
important viewpoints and the causes behind the 
event can be overseen, due to lack of specialised 


knowledge (e.g. when consulting engineers are not 
involved), but also that other involved companies 
do not get access or ownership of the investiga- 
tion results and measures suggested in the inves- 
tigation report to prevent future similar accidents. 
Lundberg et al. (2009) write in their paper about 
WYLFIWYF (What You Look For Is What You 
Find), which shows the importance of using several 
perspectives in accident investigations, weather it is 
accident models, methodologies or specialists. 

If the aim of the investigation is also to learn 
from what has happened, the learning perspective 
should also be integrated into the investigation, to 
provide for information that will lay the ground- 
work for learning. 


5.2 Knowledge transfer as a premise for learning 


Nonaka and Takeuchi (1995) describe knowledge 
as different from information as it is about beliefs, 
commitment and actions. Both have in common 
that they are about meaning. Simply said: “Jnfor- 
mation is a flow of messages, while knowledge is cre- 
ated by that very flow of information, anchored in 
the beliefs and commitment of its holder” (Nonaka 
and Takeuchi, 1995). 

After accident investigations, the knowledge 
obtained needs to be shared if learning from previ- 
ous accidents is the goal. The importance of how 
this knowledge is shared for learning is also high- 
lighted by Lindberg et al. (2010). Drupsteen and 
Guldenmund (2014) point out that there often are 
limited processes to follow-up learning after acci- 
dents, and that such knowledge is often shared 
through one-way communication, which does not 
encourage interaction and thus learning processes. 
The findings of the current research are similar; 
uncertainties about who should follow-up the acci- 
dents were found, as well as examples of one way 
dissemination of accident investigation results (e.g. 
learning-sheets). 

Further, the way information is shared is another 
challenge in the industry seen from the interviews. 
Internally, companies might have some systems or 
ways to share information, however they are not 
necessarily good enough to share information with 
all levels in the company. Within companies, results 
are often shared though learning sheets. These are 
meant as an information sharing arena for all lev- 
els in the company; from the top to the sharp end. 
However, different users require different degrees 
of details of the information. In example, for other 
HSE-managers, the information which is on the 
learning sheets might be too vague to be useful for 
safety work. 

One further deficit as the research shows, is that 
this information and knowledge is not in large 
scale shared across companies. A good platform 
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for sharing information across the industry is miss- 
ing, even though there are a few conferences and 
other smaller arenas where some experiences can 
be shared. A knowledge sharing platform can be 
one solution for sharing information and experi- 
ences across companies in the industry, e.g. a com- 
mon database. An accident data base could be 
used to collect all severe accidents in the construc- 
tion industry. Accidents with a potentially serious 
outcome also need to be registered. Having set cri- 
teria for systemising the accident types, causes and 
possible use would be useful for the user of such a 
database. 

The knowledge from the investigations can serve 
several purposes such as input for risk assessments, 
decision making and to create awareness about 
important circumstances that can affect safety. 
To make use of such information companies and 
the industry need to have certain tools available. 
Information needs to be shared internally in the 
company, and externally for the whole industry to 
improve. 

Lingard and Rowlinson (2005, p.366) write that 
learning from past accidents is important for safety 
management, and in an organisational context an 
incident information systems must be available 
to collect, analyse and create preventive meas- 
ure. However, only having system is not enough 
according to them. It is also important to be aware 
of how the organisation is currently running, and 
having a vision for the desired safety work and 
performance, the management’s safety focus and 
safety work being an integrated part of the opera- 
tions is highlighted. 


5.3 Learning 


According to Nonaka and Takeuchi (1995) learning 
can be looked at as a dynamic spiral, between the 
two learning loops that Argyris and Schön (1978) 
describe. The spiral goes between tacit and explicit 
knowledge and from explicit to tacit through four 
phases. The spiral goes on as “organisational knowl- 
edge creation is a continuous and dynamic interac- 
tion between tacit and explicit knowledge” (Nonaka 
and Takeuchi, 1995, p. 70). 

Lingard and Rowlinson (2005, p.365) highlight 
the need for collective learning in the construction 
industry and the current lack of this. They write 
that similar accidents reoccur in the industry across 
countries, and yet the industry does not manage to 
improve the occupational safety enough. 

In relation to the construction industry and 
safe-ty, it is therefore important to acknowledge 
the individuals in the organisation when creating 
and implementing measures for accident preven- 
tion. In the same manner, during accident inves- 
tigations, tacit knowledge should be a part of the 


information foundation in an investigation, as 
when it goes back as learning points. 

Drupsteen and Guldenmund (2014) point out 
that it is hard to identify organisational factors 
and managerial weaknesses that are root causes of 
events, which limit the possibility of double-loop 
learning. Le Coze (2013) suggest more cross-disci- 
plinary research on learning from accidents. This 
shows the need for more research on the topic, and 
combining different topics together. 


5.4 Input for safety management 


Which results that can be used from an accident 
investigations for proactive safety management, 
depend on the type of accident, the outcome of 
the investigation as well as the way the information 
it is shared. It is suggested to make specifications 
and criteria related to characteristics of accidents 
(e.g. types of accidents, causes, processes) in order 
to decide what learning purposes they can serve. 

Results of accident investigations can in example 
be used as input for proactive safety management, 
e.g. in the Safety Management System (SMS) of a 
company, and as an input in building information 
models (BIM) which can include early phase actors 
(i.e. consulting engineers) in the learning loop. One 
of the challenges for consulting engineers when it 
comes to occupational safety during construction, 
is that they in small scale get feedback if their solu- 
tions could be executed safely in practice by con- 
struction workers. By using new solutions and tools 
(e.g. digital solutions), these actors could easier be 
involved in occupational safety work. 

The results can also be adapted to serve safety 
purposes in different processes during a construc- 
tion project. Procurement processes both in early 
phases of a project as well during construction put 
a foundation and boundaries for safety. It is impor- 
tant to transfer knowledge also to these processes 
from accident investigations, to reduce the safety 
risks during construction. 


5.5 Coping with the diversity of actors 


The industry is, as mentioned before, diverse and 
numerous actors are involved in construction 
projects. This diversity pose a challenge in rela- 
tion to learning, as different actors have different 
needs and requirements. This means that adapta- 
tion is required when it comes to ways of sharing 
knowledge and learning. A flow of information is 
required both between the different levels, and at 
each of the levels. 

To analyse this diversity of actors and activities 
Pryke (2012) suggests a graphical representation 
and a social network analysis of how the specific 
actors and activities are related. Such an analysis 
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could be linked to safety management. Mapping 
the relations and information flows after accident 
investigations might give a better understand- 
ing of deficits in communication and knowledge 
sharing. The different actors in the actor chain in 
the construction industry, introduce boundaries 
and conditions that affect safety. A social network 
analysis can also be used to map other factors such 
as frame conditions (e.g. contract conditions) and 
how they affect different actors (Pryke, 2012). 

Different actors have different roles in safety 
management, and this apply also for learning. By 
mapping relations in the construction network, 
information flow, finding out who has which 
needs, and who should facilitate whom, can help in 
knowledge transfer and learning. 

It is suggested to perform a similar analysis as 
the social network analysis on safety information 
flow after accident investigations. 


6 CONCLUSIONS 


Preliminary findings of the research show the 
importance of coordination and cooperation 
between actors of construction projects. Acci- 
dent investigations are important to avoid that 
similar accident reoccur, however there are ele- 
ments that hinder knowledge transfer and learn- 
ing from earlier accidents. Accident investigations 
in large degree vary between projects, clients and 
contractors. Often each actor performs individual 
investigations with limited sharing of the results 
across companies. Consulting engineers are rarely 
involved in investigations, unless the problem has 
clearly been related to calculations. 

Having a good knowledge foundation, based on 
facts including root causes, is crucial to ensure that 
correct measures are taken after accidents and to 
enable learning. For this it is important with com- 
petence and experience of the investigation team. 
It was found that experience and knowledge about 
accident investigations of the HSE responsible 
persons in companies varies a lot. 

To obtain learning after accident investigations, 
information must be shared. Certain types of acci- 
dents are more suitable for sharing and learning 
purposes, e.g. near misses and high potential inci- 
dents. It is suggested to make specifications and 
criteria related to characteristics of accidents for 
specific learning purposes. 

Information sharing after accident investiga- 
tions mostly happen within companies. Between 
companies, knowledge sharing and learning is not 
systemised and occurs occasionally, and tools to 
share information between companies are lacking. 

The large diversity of actors in the industry 
challenges practices, information sharing and 


learning processes. To enable learning in the indus- 
try both across organisations and within organisa- 
tions, there is a need to understand the different 
relations between actors, processes and needs. A 
social network analysis of the information flow of 
the results after accident investigations in the con- 
struction industry might help to find the deficits 
as well as the centre points of communication and 
relations between actors, that might enable knowl- 
edge transfer and learning. 
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ABSTRACT: Organizations and industries must ensure safe operation of their facilities, employing 
rigorous risk management techniques for planning and executing their activities. The use of Personal 
Protective Equipment (PPE) represents the closest layer of protection to workers, and can considerably 
reduce the risk of exposure to hazards, being critical for safety in industrial environments. The hazards 
addressed by protective equipment include physical, acoustic, electrical, heat and chemicals. Despite the 
substantial efforts in increasing awareness about the benefits of PPE to strive towards zero accident phi- 
losophy, operators often neglect its use when not being supervised. However, organizations commonly 
have surveillance cameras installed which might provide useful visual information on correct usage of 
PPE. In this context, computer vision is an interdisciplinary field that seeks to automate tasks that the 
human visual system can do and includes domains of signal and image processing, pattern recognition 
and artificial intelligence. Moreover, object recognition is a prominent technology from computer vision 
for finding and identifying objects in an image or video sequence. Then, this work aims to create an 
automatic PPE detection from surveillance cameras and other video streams using computer vision and 
machine learning. Equipment such as helmets, safety glasses, earplugs and other garments are checked 
for whether they are being used by operators in a real-time monitoring, alerting supervisors to prevent 


accidents and ensure a safer environment. 


1 INTRODUCTION 


Even with the scientific and technological progress, 
statistics provided by the International Labour 
Organization (ILO) demonstrate that work- 
ing conditions in many countries (e.g. European 
Union) have not changed to such a degree as to 
significantly reduce the problem of occupational 
injuries (Cavazza & Serpe, 2009). Therefore, every 
effort to decrease the number of accidents or, at 
least, maintain its rate at an acceptable range is 
highly important, and can be employed either by 
organizational actions, collective training or indi- 
vidual safeguard. 

The traditional approach to avoid loss is the 
implementation of barriers, which plays a central 
role in the prevention of accidents. Sklet (2006) 
defines safety barriers as ‘physical and/or non- 
physical means planned to prevent, control, or 
mitigate undesired events or accidents’. 

Indeed, there are many opportunities to inter- 
rupt or change an accident sequence of events 


before it evolves into a loss. First, an answer is to 
change the preconditions for an accident to occur 
by eliminating the energy source or modifying 
the energy characteristics from the hazard. Sec- 
ond, barriers may interrupt, dilute, or redirect the 
energy flow during the latter part of the accident 
process (e.g. separating the victim from the energy 
flow). As last barrier, it is possible to improve the 
victim’s ability to endure the energy flow (e.g. 
wearing some protective equipment), which is the 
ultimate protection to avoid damage (Kjellén and 
Albrechtsen, 2017). 

In this context, Personal Protective Equipment 
(PPE) is usually adopted to protect the individual 
against health or safety risks at work. It includes 
items related with protection of head, face, eye, 
hand, arms, and legs (Health and Safety Execu- 
tive, 2013). There are consolidated regulations 
for the usage of PPE in industries (Occupational 
Safety and Health Administration—OSHA 2004; 
U.S. Homeland Security 2002) that aim to decrease 
the frequency of misuse or absence of PPE. Also, 
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PPE’s positive impacts are very significant (e.g. rate 
of eye injury and lost work time can be reduced by 
50% or more when PPE is worn (Lipscomb, 2000). 

Head, as a vital body part containing possibly 
the most important human organ, needs appropri- 
ate attention. Every year, approximately 1.7 mil- 
lion people are hospitalized or die as a result of 
a traumatic brain injury (TBI) only in the United 
States (McCrory et al., 2009). Protective headgear 
and helmets decrease the potential for severe TBI 
following a collision by reducing the accelera- 
tion of the head upon impact, thereby decreasing 
both the brain-skull collision, as well as the sud- 
den deceleration induced axonal injury (Newman 
et al., 2005). 

There are several types for head protection such 
as industrial safety helmets, bump caps and fire- 
fighters’ helmets. The use of those equipment is 
necessary in activities like low-level fixed objects 
with risk of collision (e.g. pipework, machines, 
scaffolding) and transport activities involving the 
risk of falling material (e.g. hoists, lifting plant, 
conveyors) (Health and Safety Executive, 2015). 
The energy absorbing material of a helmet com- 
presses itself to absorb force during the collision 
and slowly restores itself to its original shape. This 
compression and restoration has the effect of pro- 
longing the duration of the collision, while reduc- 
ing the total momentum transferred to the head 
(Pellman et al., 2006). 

The problem relies on the fact that, even with 
understanding about the safety improvement that 
the usage of PPE leads, its usage is often neglected 
in industry. The report of the ILO estimates that 
2.34 million people die every year in the world due 
to occupational accidents, some of these deaths 
caused by non-use of PPE (International Labour 
Office, 2011). A common approach is to impose 
fines and penalization to workers, who do not wear 
the required PPE when performing specific activi- 
ties. However, supervision to guarantee its use is 
normally performed in person by a higher-level 
employee, which makes almost impossible to con- 
trol all operators during the whole labor time. 

Indeed, there is an extensive discussion concern- 
ing ethical issues in workplace surveillance, refer- 
ring to management’s ability to monitor, record 
and track employee performance, behaviors and 
personal characteristics in real time (Ball, 2010). 
Most of the discussion involves the so-called Elec- 
tronic Performance Monitoring (EPM) about 
employee’s control in social and technological 
forms (e.g. Internet and email monitoring, location 
tracking, biometrics) and the understanding of 
privacy boundaries surrounding employee infor- 
mation (Alder, 1998) (Allen et al., 2015). However, 
our discussion is to assure that proper safety pro- 
tocol is followed, preventing injury to employees, 


as well as avoiding damage to the assets through a 
consistent and trustworthy model. 

Therefore, an automatic method for monitor- 
ing PPE usage presumably is significant worthy for 
industrial safety, representing an impactful oppor- 
tunity for the use of Computer Vision (CV). CV is 
an interdisciplinary field aiming to investigate and 
develop computers with high-level understand- 
ing from digital images or videos, describing the 
world that we see and to reconstruct its properties 
(Szeliski, 2010). From the perspective of engineer- 
ing, it seeks to automate tasks that the human 
visual system can do (Sonka, Hlavac and Boyle, 
2008). The development of high-powered comput- 
ers, the availability of high quality and inexpensive 
video cameras, and the increasing need for auto- 
mated video analysis has generated a great deal 
of interest in object tracking algorithms in CV 
field (Yilmaz, Javed and Shah, 2006). Therefore, 
this paper aims to develop a model for automatic 
PPE detection from industrial video streams using 
computer vision and machine learning, employ- 
ing modern technologies to create tools capable of 
modifying and innovating methods currently used 
in industries. 

The rest of this paper is organized as follows: 
Section II introduces some ideas and concepts 
on CV, while Section III presents some works of 
object/person detection, describing the methodol- 
ogy applied in PPE detection. Section IV demon- 
strates the developed model. Section V provides 
possible usages as decision supporter and Section 
VI concludes remarks. 


2 COMPUTER VISION 


Computer Vision (CV) studies the automated 
extraction of information from images and videos. 
Information can mean anything from 3D models, 
camera position, object detection and recognition 
to grouping and searching image content (Jan Erik 
Solem, 2012). CV gathers knowledge from many 
fields, such as image processing, pattern recogni- 
tion, mathematics and artificial intelligence. One of 
its main goal is to enable computers to reproduce 
core functions of human vision, such as motion 
perception and scene understanding. 

Hence, visual object tracking have been con- 
stantly studied and presents three key steps for 
detection in video analysis: detection of movement 
of objects, tracking of such objects from frame to 
frame, and analysis of object tracks to recognize 
their behavior (Yilmaz et al., 2006). Essentially, the 
basis of visual object tracking is to robustly esti- 
mate the motion state (i.e., location, orientation, 
size, etc.) of a target object in each frame of an 
input image sequence (Li et al., 2013). 
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Specifically, intelligent visual surveillance sys- 
tems deal with the real-time monitoring of per- 
sistent and transient objects within a specific 
environment (Valera & Velastin, 2005). The goal 
of these systems is not only to put cameras in the 
place of human eyes, but create an entire surveil- 
lance system as automatically as possible (Hu 
et al., 2004). 

There exist some well-known visual surveil- 
lance systems such as W4 (Haritaoglu et al., 2000); 
Haar-wavelet Adaboost (Enzweiler & Gavrila, 
2009) and ViBe (Barnich & Van Droogenbroeck, 
2011), mainly developed to detect different vehicles 
types, groups of people, pedestrians, people access 
control. Every system is developed seeking to com- 
pensate the capability limitation of human opera- 
tors in monitoring enormous number of cameras 
at the same time. Thus, exploring similar tools and 
challenges to detect usage of PPE in order to avoid 
accidents in industries represents an interesting 
case. 


3 REAL TIME OBJECT DETECTION 


Techniques from statistical pattern recognition 
have, since the revival of neural networks, obtained 
a widespread use in digital image processing 
(Egmont-Petersen et al., 2002). Due to the out- 
standing work of Krizhevsky et al. (2012) for image 
classification, Deep Neural Networks (DNN) have 
been successfully studied in different fields of 
application such as speech recognition (Hinton 
et al., 2012), vibration analysis (Guo et al., 2016), 
electronic nose data (Langkvist et al., 2013) and 
physiological data (Mirowski et al., 2008). How- 
ever, surely, the most promising results are found 
in the field of computer vision, bringing impressive 
developing in tasks like automatic object and face 
recognition. 

One promising, open and free project that uses 
DNN for object detection is You only look once 
(YOLO). YOLO is a system for detecting objects 
and was first created on the Pascal VOC 2012 data- 
set, detecting the 20 Pascal object classes, such as 
person, birds, dogs, car, bicycle, bottle, table and 
chair, as could be seen in Fig. 1 (Redmon et al., 
2015). 

The developers adopt a different approach than 
the standard object detection models that uses clas- 
sifier based-systems applied at multiple locations 
and scales in an image, which typically considered 
as detections high scored regions of the image. In 
YOLO, a single DNN is executed to the full image. 
This network divides the image into regions and 
predicts bounding boxes and probabilities for each 
region, with these bounding boxes being weighted 
by the predicted probabilities. It has considerable 


Figure 1. 
Adapted from Redmon et al. (2015). 


Example of object detection using YOLO. 


advantages to other object detection models, once 
it looks at the whole image, and then its predic- 
tions are informed by global context in the image 
(Redmon et al., 2015). It also makes predictions 
with a single network evaluation, making YOLO 
extremely fast, allowing usage even for computers 
without a Graphics Processing Unit (GPU). 

Still, an improved model, YOLOvz?, has already 
been developed. More robust, detecting more than 
9000 objects without losing real-time perform- 
ance, YOLOv2 is a state-of-art object detection 
system with results comparable or with even bet- 
ter than many other systems (Redmon & Farhadi, 
2017). Moreover, the YOLO project is open, well 
described, easily explained and user-friendly to 
anyone, who has some basis in computer program- 
ming. It even demonstrates how to include objects 
that were not on its detection basis, how to process 
and train a new model, allowing adaptation for dif- 
ferent purposes. 

Hence, using YOLOv2 as a key tool, we trained a 
new model to automatically detect PPE usage. Spe- 
cifically, we were interested in identifying whether 
workers were wearing or not a safety helmet when 
performing some activities in which the protection 
was required. 


4 HELMET DETECTION USING YOLO 


YOLO project easily provides a pre-trained model, 
which could be used as a basis for detecting new 
types of objects. As any machine learning algo- 
rithm, YOLO requires a training dataset that will 
‘teach’ the machine how an unknow object looks 
like. For our specific goal, 731 images containing 
helmets were used to give sufficient information 
about its appearance. All images were obtained 
from ImageNet (Jia Deng et al., 2009), and loca- 
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tion of helmets in images were annotated manually. 
ImageNet is an image database organized accord- 
ing to the WordNet (Fellbaum, 1998) hierarchy, 
in which each node of the hierarchy is depicted 
by hundreds and thousands of images. It presents 
useful resource for researchers that needs image 
data, containing innumerous classes of items. 

The network was trained for about 8 hours, run- 
ning in a Nvidia GeForce GTX 960 m GPU, with 
4GB of video random access memory (VRAM). 
Once the algorithm finished its training, our hel- 
met detection model could be applied to a specific 
image or to a video stream, such as a camera feed, 
processing every frame. Our model runs in real- 
time, maintaining the frame rate of the camera (30 
frames per second—FPS). Fig. 2 depicts the model 
applied to a standard web camera video streaming. 

Then, a script was created to alert surveillance 
operators whether an abnormal situation appears 
(i.e. helmets were not detected). Due to simplifi- 
cations proposed, the model aims to be applied in 
a room containing specific number of employees 
that should be using helmets. For each frame, our 
algorithm detects the use of helmet and counts 
how many are present in the scene. If the number 
of detected helmets is different from number of 
previously defined people in the room for more 
than a brief period (e.g. 10 seconds, or 30 seconds), 
then an alert was emitted. Fig. 3 shows a computer 
screen when an alert was presented (i.e. one person 
is not using helmet in the image). 

Moreover, if the number of detected helmets 
does not return to its normal value, alerts may 
continue to be emitted, but the period between 
alerts may be defined as desired (e. g. 30 seconds, or 
two minutes). Properly adjusting this period avoids 
unnecessary alerts that will continually distract 
the surveillance operator even after recognition of 


eve < pam 


Figure 2. Helmet detection model applied for video 
streaming. 


Figure 3. 
situation. 


Alert emitted when model detect anomaly 


first anomaly situation, while still reminding that 
an abnormal situation is ongoing. 


5 SUPPORTING DECISIONS 


In practice, the presented model could be explored 
as a tool in different contexts, supporting decisions 
for the safety manager. The idea was to develop the 
model to be highly adaptable and manageable for 
various situations, providing specific information 
accordingly. 

For example, as aforementioned, the alert period 
is easily adjustable to avoid unnecessary warnings. 
Still, other types of warnings (e.g. depending on 
how long operators remains without PPE; how 
many operators are not wearing the PPE) could 
be easily implemented and customable, providing 
information for the decision-maker to determine 
whether or not someone must be notified. Moreo- 
ver, it is also possible to use information and sta- 
tistics provided by the model (e.g. how many times 
alerts were displayed per day; who long operators 
had remained without PPE) as a safety indicator. 

Still, implementation of a real time alert for 
the operators (e.g. a particular warning light is lit 
somewhere in the room) connected with the model 
would emphasize (or create) the sense of autoregu- 
lation among them, reducing (or sharing) the sur- 
veilling workload expected for the supervisor. 

With further and wider adaptations, this surveil- 
lance technology could be also implemented to mon- 
itor other barriers than PPE. Related with the initial 
barriers where, usually, reliable sensors are already 
available, CV could act as a redundancy, rather 
than substituting existing technologies, aiming to 
improve detection effectiveness of hazards (e.g. fire, 
toxic gases) in order to interrupt the energy flow in 
case of accidents. For inner barriers, it is possible 
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to create alerts and warnings for whether an opera- 
tor approaches a danger zone based on images of 
the area. In both previously mentioned barriers, CV 
would help to eliminate or reduce the consequences 
of unwanted energy flow, rather than dealing with 
the last safety impediment represented by the PPE. 


6 CONCLUSIONS 


This paper presented an approach for automati- 
cally detecting PPE usage in a controlled environ- 
ment, using object detection with YOLO. By using 
YOLO, this method achieves a reasonable balance 
between speed and confidence, running in real- 
time, which results in relative low computational 
resource usage. Moreover, it is possible to adjust 
the model to different scenarios according to spe- 
cific requirements. This could lead to beneficial 
results to safety engineering since the detection is 
performed automatically and does not require con- 
stant human attention. 

As a matter of our current research, we aim at 
extending this model for application in a wider 
range of situations. For instance, YOLO can be 
trained to identify other types of PPEs, so that it 
could be used for simultaneously monitoring usage 
of different PPEs. Also, the script used for alerts 
could be improved, allowing this method to cover 
a wider range of scenarios. A further step is to 
use real surveillance videos as input, detecting the 
usage of PPE in realistic environment, preventing 
accidents and providing an improvement on the 
safety monitoring system of industries. 
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ABSTRACT: The Norwegian police have a tradition for reticent use of force. Norway is one of the very 
few countries in the world with unarmed police on daily duty. The aim of this paper is to study the degree 
to which the Norwegian Police University College (NPUC), and the police districts, training in the use of 
force reflects the need for reliable handling of the various weapons at the disposal of the police officers 
and how experiences on the use of force gained by police officers in their daily duty is made available for 
police training. However, there is no national strategy on collection, dissemination, interpretation and 
integration of experiences from the streets of Norway. Even though experienced police officers lecture 
and supervise students and fellow police officers, the lack of systematic collection of experiences hampers 
the quality of the training, especially at the NPUC, but also in the police districts. 


1 INTRODUCTION 


The Norwegian police have a tradition for reticent 
use of force, a tradition rooted in the Norwegian 
society (NOU 1981:35; Norwegian Parliamen- 
tary White Paper No. 42 (2004—2005)). However, 
from 2007-2016 the Norwegian police has experi- 
enced a sharp increase in armed assignments from 
1507 in 2007 to 5816 by November 2016 (POD 
2017a). It is mostly ordinary emergency response 
police personnel who handle the acute phases of 
armed assignments (Norwegian Parliamentary 
White Paper No. 21 (2012-2013)). Heightened 
terrorist threats and organised crime may expose 
the police to extreme situations characterized by 
rapid assessments and decisions, based on ambigu- 
ous information and the possibility of fatal conse- 
quences. Changes in street realities may challenge 
both operational training and practice in the use of 
force in the Norwegian police. 

This paper aims to study the degree to which 
the Norwegian Police University College (NPUC), 
and the police districts, training in the use of force 
reflects the need for reliable handling of the vari- 
ous weapons at the disposal of the police officers 
facing street challenges, and how experiences on 
the use of force gained by police officers in their 
daily duty is collected, disseminated, interpreta- 
tion and integrated in police training. The paper 


draws on observations made through many years 
of experience from operational service in the 
police. One of the authors is former head of the 
police counter terrorist unit (Delta), and another 
is a former lecturer and head of studies at NPUC. 
The paper starts with a historical introduction to 
the police use of force, followed by a presentation 
of the frameworks on crisis, situational awareness, 
decision-making, training, exercising and learning. 
We then present the results on operational use of 
force based on literature studies of the results 
of the Firearms Commission (NOU 2017:9) and of 
reports on the use of force. We also include find- 
ings based on participant observation in training 
sessions at the NPUC and in the police districts. 
These findings are then discussed together with 
the theoretical frameworks before we make some 
concluding remarks on police training on the use 
of force in relation to the operational needs on the 
street. 


2 THE NORWEGIAN POLICE 


A Norwegian national police force organized and 
employed by the state has existed since 1936. The 
Norwegian police is organized under the Ministry 
of Justice and Public Security, and managed by the 
Police Directorate (POD). All police services in the 
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12 police districts, and the five national special- 
ized agencies! (POD 2017b), follow the same law, 
instructions and guidelines. 


2.1 The Norwegian Police and the Norwegian 
Police University College (NPUC) 


The NPUC is one of the national specialized agen- 
cies, and is responsible for all under graduate and 
post graduate education, including all police train- 
ing in the use of force. The joint national basic 
education program secures the same basic training 
in the use of force and coercive means, including 
firearms. The police are divided into five compe- 
tence levels for emergency response personnel (IP), 
according to the level of annual operational train- 
ing (PBS 1:38): 


IP 1: The police counter-terrorist personnel (Delta) 

IP 2: The dignitary protection personnel 

IP 3: The police special response unit personnel 

IP 4: The police emergency response personnel 
(ordinary police personnel) 

IP 5: The police emergency response person- 
nel with limited operational training (not 
licensed to carry firearms). 


To be authorized to carry firearms, emergency 
response personnel have to carry out at least 
48 hours of operational training and pass a stand- 
ardized annual firearms test to reach competence 
level 4 (IP 4). It is NPUC, along with POD, which 
determines the annual joint national operational 
training. The yearly training normally consists 
of firearms training, arrest techniques, first aid, 
tactics and situational training. In recent years 
training on how to handle an active shooter has 
been emphasized for all police emergency response 
personnel (Norwegian Parliamentary White Paper 
No. 10 2016-2017). 


2.2 Directives and instructions on the use of force 


There are no written directives on how the police 
should perform their operational training. NPUC 
develops the professional content on the basis of 
inputs from instructors, collaboration with POD 
and PJS (Police Joint Services), dialogue meet- 
ings with police districts, participation in national 
exercises and regular meetings with the National 
Reference Group for Police Operative Disciplines. 


"Norwegian Police University College (NPUC), Central 
Mobile Police Service (CMPS), National Criminal Inves- 
tigation Service (NCIS), Police Joint Services (PJS) and 
Norwegian National Authority for Investigation and 
Prosecution of Economic and Environmental Crime 
(NNAIPEE). 


The professional content of the training is detailed 
in relevant academic literature, in instructor manu- 
als and in subject booklets, which ensures an equal 
approach in all police districts. Thus, at a strategic 
level, NPUC has a central role in developing and 
coordinating national training. 

The specific instructions for use of force are 
provided in the Police Act, Police Instructions and 
Firearms Instructions (Lovdata.no 20.11.17). The 
use of physical force is authorized through law in 
the Police Act, is defined in the Police instructions 
and is essentially exercised on the basis of profes- 
sional judgment and discretion. Use of coercive 
measures, such as pepper spray, baton and firearms, 
are strictly regulated in the Firearms instructions. 


2.3 Models for the force continuum 


The Norwegian police use a model, “Force pyra- 
mid” (see Figure 1), for visualizing coercive meas- 
ures in a force continuum. The model shows how 
the use of force is increased the higher up in the pyr- 
amid one comes, based on the amount of physical 
and mental injury which may be expected from the 
respective coercive measure. The pyramid is not a 
complete and generally valid description of the use 
of force, but a ranking of different coercive meas- 
ures (Lie & Lagestad 2011:10). However, the model 
is used in all training in the use of force at NPUC: 
In this paper we limit the further discussion to 
physical force, shown as the steps above the dotted 
horizontal line in Figure 1. The first steps are use 


Pr Grex Korte 


Exautrs 
medion 
Verbal communicaton 
eatr 
mein Police presence (symbolic force) 
Figure 1. Force pyramid. 
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Figure 2. Police use of force model. 


of arrest techniques in more controlled situations, 
the use of pepper spray, and the use of baton. The 
next step is the use of punches and kicks as self- 
defence in more uncontrolled situations. As a last 
resort, the police can use firearms. The police use 
of force may also be displayed in another model 
(see Figure 2). 

This model, unlike the force pyramid model 
(Figure 1), also includes Electronic control devices 
(ECD, for example TASER) and the use of dogs, 
in addition to situational assessment. ECD is under 
review for possible acquisition. The police use of 
force model (Figure 2) is based on the New Zealand 
Police “Tactical options framework” model. This 
modified version was first presented at the annual 
research conference at NPUCin 2016(Vee Henriksen 
2016). The inner circle of the model describes the 
level of resistance faced by the police. The outer 
circle describes the coercive measures in relation to 
the level of resistance. Proportionate use of force is 
decided based on situational assessment, conducted 
by the police emergency response personnel. 

It must be emphasized in both models that the 
police do not have to try every step before stronger 
force is used; this depends on the development of 
the situation in question. In daily service the Nor- 
wegian police will carry pepper spray and baton in 
the belt, but their firearms will be stored in locked 
boxes in the patrol cars. 


3 CONCEPTUAL FRAMEWORKS 


To be able to understand and discuss police use 
of force, we need to clarify our understanding of 


incidents, crisis, situational awareness and deci- 
sion making. We then need to look at collection, 
dissemination, interpretation and integration of 
experience and finally learning about police use of 
force. 


3.1 Incident and crisis 


A crisis may be defined as “a serious threat to the 
basic structures or the fundamental values and 
norms of a system, which under time pressure and 
highly uncertain circumstances necessitates making 
critical decisions” (Rosenthal et al. 1989: 10). This 
definition points to critical decision-making in the 
midst of the uncertainty and time pressure of a 
dynamic incident or crisis. For the sake of this paper 
we will not distinguish between the terms incident 
and crisis. There is some research on differences 
between events such as incident, crisis and catas- 
trophe (Engen et al. 2016), for instance due to their 
response needs. We will not go into this discussion 
in this paper, but use the terms incident and crisis 
independent of the size of the event in question. 

Some crisis characteristics may be of particular 
importance in incident or crisis management. ‘t 
Hart and Boin (2001) distinguish crises based on 
their development and termination patterns. A fast 
burning crisis develops and terminates fast. A slow 
burning crisis has slow development and termina- 
tion patterns. Gundel distinguishes crises based on 
their predictability and the degree to which they can 
be influenced (Gundel 2005). Unexpected crises are 
in his terminology difficult to predict, but fairly easy 
to influence. Intractable crises are easy to predict 
but difficult to influence. It is fair to assume that an 
unexpected crisis (Gundel 2005) with a fast develop- 
ing pattern (‘t Hart and Boin 2001) may be particu- 
larly challenging to manage for police officers. 


3.2 Situational awareness and decision-making 


Situational Awareness (SA) is crucial for decision- 
makers in dynamic incident management. Situ- 
ational awareness is defined by Endsley as “the 
perception of the elements in the environment 
within a volume of time and space, the comprehen- 
sion of their meaning and the projection of their 
status in the near future” (Endsley 1997:97). SA 
makes it possible to operate quickly and efficiently 
even during demanding operations (Lee and Kirlik 
2013) necessitating swift decision-making, also on 
the use of force (including firearms). The quality of 
decision-making on modus operandi, and the abil- 
ity to switch from one practice to another (Schakel 
et al. 2016), to manage the situation at hand, includ- 
ing the use of force, may have a crucial impact on 
the police officers and the public. Decisions may 
be based on analytic and/or intuitive reasoning 


2871 


(Eid and Johnsen 2006, Helsloot and Ruitenberg 
2004). The important issue initially is to reduce 
uncertainty, to find out what is going on. Boin and 
colleagues describe this initial phase as the sense- 
making phase, or the “what the hell is going on” 
phase (2005). In unexpected and fast developing 
situations, characterized by a high degree of uncer- 
tainty, it is fair to assume that intuitive decisions 
will be of particular relevance. Intuitive decisions 
are based on previous experience from similar situ- 
ations (Eid and Johnsen 2006) and on improvisa- 
tion (Weick 1993). Recognition of familiar patterns, 
déjà vu (Weick 1993), is a comforting thought for 
decision-makers in these situations, for recognition- 
primed decision-making (Klein 1989, 2011). Weick 
describes decision-making in this connection as «an 
act of interpretation rather than an act of choice» 
(1995:185). Furthermore, Weick (1995) points out 
that a good decision-maker is as good as his mem- 
ory and the type of information that is stored there, 
and that good decisions are just as much based on 
a correct understanding of what has happened as 
on a correct understanding of what is going on. 
The level of expertise, based on experience, training 
and education, is therefore important for situational 
awareness and decision making. 


3.3. Training, exercises and learning 


Paoline and Terrill (2007) focus on experience as 
being the most important factor for learning within 
the police. Experiential learning is a form of learn- 
ing that builds on practical situations and practi- 
cal exercises that provide personal experience as 
well as learning (Kolb 1984), as learning by doing 
(Dewey 1938). Dreyfus and Dreyfus (1988) present 
five levels of proficiency; novice, advanced begin- 
ner, competent, proficient and expert. They focus on 
the development of occupational knowledge in an 
educational perspective: «If, and only if, experience 
is gained in this a theoretical manner, and intuitive 
behaviour replaces considered reactions, can mas- 
tery be regarded as developed» (Dreyfus and Drey- 
fus 1986:56). The transferring of experience and 
development of competence at the moment of learn- 
ing however requires occupational experience to be 
passed on as knowledge-based learning. Thus, police 
officers experiential learning (Kolb 1984) need to by 
shared, disseminated, and collectively integrated 
and interpreted (Dixon 1999) in the police force and 
thereby made available for instructors and training 
sessions in the NPUC and the police districts. 


4 EMPIRICAL FINDINGS 


The empirical findings stem from participant 
observation in training sessions and operational 


use of force, and statistics from the Norwegian 
Police Directorate (POD), and the Firearms Com- 
missions report (NOU 2017). Reference will also 
be made to earlier surveys relating to the subject. 


4.1 Use of force 


Even though risk assessments must not only take 
into account experiences but also cater for surprises 
and uncertainty (NOU 2012:14), proper data gath- 
ering is a vital foundation for these assessments. 
The Norwegian police do not collect data on the 
use of force in a structured way. Based on the lack 
of reporting systems in Norway, there is limited 
knowledge about to which extent and in which con- 
texts the police exercise force. The only use of force 
that is being reported is in the form of statistics for 
armed assignments, when the police have threat- 
ened to use firearms and where they have actually 
used them. However, there is no national standard 
for reporting on the use of firearms. Thus, routines 
may vary amongst the police districts. 

Figure 3 shows the development of armed 
assignments from 2007-2016. It is noted that the 
sharp decline in the period 2014-2015 is due to the 
temporary arming of the Norwegian police due to 
the increased threat of terrorism (NOU 2017:9). 
The temporary arming meant that all uniformed 
emergency response personnel approved to carry 
firearms, were required to carry a pistol during 
regular service. In most police districts, assign- 
ments that required armed response were no longer 
registered, as the police already had a general order 
for arming (therefore the decline in Figure 3 in this 
period). Some police districts, however, continued 
the normal practice of registration, despite the fact 
that the police already were armed (NOU 2017:9). 

How many shots are fired by police officers each 
year? Figure 4 shows the numbers of shots fired at 
a person in the period 2005-2014. 

Figure 4 shows an average of 2,5 shots fired at 
a person per year. In 2014, with just above 4000 


Figure 3. 
2017a:8). 


Annually reported armed assignments (POD 
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Figure 4. Annually reported shots fired at a person 
(POD 2015:3). 


armed assignments (see Figure 3), 2 shots were 
fired at a person (Figure 4). 

In situations where police officers have fired 
shots at a person, an investigation must be car- 
ried out according to the Norwegian Prosecution 
Instructions (Lovdata 22.08.16). The investigation 
will check the police officer’s action only based on 
a legal perspective, and the incidents will not be 
evaluated for gathering knowledge in a learning per- 
spective. All cases of police use of firearms in the 
period specified in Figure 4 have been filed without 
any form of legal consequences against the police. 
The number of shots fired at persons each year is 
low, and no substantial change was recorded in the 
number of situations where the police threatened 
to use firearms or actually fired shots during the 
period of temporary arming (NOU 2017:9:153). 

Except for various standards of reporting 
related to the use of firearms, it is no national 
reporting system on the police use of force. Any 
documentation will only appear in the police local 
orderly book and/or administrative documents in 
the individual criminal cases. The lack of an overall 
police system for collection, dissemination, inter- 
pretation and integration of experience in the use 
of force has been pointed out in official reports 
(NOU 2009:12; NOU 2013:9). A comprehensive 
police analysis in 2013 pointed out that: «Despite 
the fact that expertise is a crucial input factor, the 
police today lack a comprehensive strategy and 
approach to the development and management 
of competence in the police. Competence and 
personnel development work in the police is to a 
lesser degree systematic» (NOU 2013:9, p. 209). 
In this context, it is reasonable to ask questions 
about the basis for Norwegian police training in 
the use of force. Is the training based on proven 
experience-based knowledge or based on assumed 
needs of the police? Very few surveys related to 
police use of force have been conducted in Nor- 
way. Lagestad (2008) conducted a study which 
included examining how often the police train and 
use arrest techniques. In this study the participat- 
ing police officers assumed that they were using 
physical force approximately once a month. Lie 
(2010) prepared a report on mapping and testing 
of police knowledge and skills in arrest and control 
techniques. 90% responded using physical force to 
put someone on the ground once or twice the last 
two years, 16% had used pepper spray in the last 
year and 4.5 percent had used telescopic baton. 


According to Holmberg’s (2013) report on the use 
of pepper spray in the Scandinavian countries, in 
Oslo Police District the use of pepper spray was 
reported approximately twenty-seven times a year 
on average in the period 2005-2011. It must be 
underlined that the Firearms Commission sug- 
gested that the Norwegian police should report on 
all use of force (NOU 2017:9). It is not known if 
and when this will be implemented. 


4.2 Temporary arming in 2014-2016 


Due to a heightened terrorist threat ordinary police 
officers were armed on daily duty in the period 
25th November 2014 to 3rd February 2016. The 
Norwegian police officers were armed while on 
daily duty for the first time, with no time for prep- 
arations. In the Norwegian Official Report (NOU 
2017:9) it is pointed out that the Police Directorate 
particularly noticed two main issues regarding the 
arming; (1) the risk for accidental discharges (AD), 
and (2) the possibility of the police being deprived 
of their own firearms. According to the Norwegian 
Official Report (NOU 2017: 9, 151) “it cannot be 
said that a central and systematic risk assessment 
of accidental discharges was made in advance of 
the arming.” 


4.3 Accidental Discharges (AD) 


Prior to the period of temporary arming, there 
was no systematic registration of accidental dis- 
charges (AD) in the Norwegian Police Force, nor 
was there any clear specification of what should be 
considered as AD (NOU 2017:9). Guidelines for 
preventing ADs when the arming occurred were 
not in place. It was also not established a system 
for collection, dissemination, interpretation and 
integration of experiential learning from such inci- 
dents. However, AD became a very relevant issue 
during the temporary arming. In a letter of 10th 
August 2015 the Ministry of Justice and Public 
Security requested the registration of ADs dur- 
ing the temporary arming, and instructed POD 
to retrospectively map ADs in the period 2011- 
2015. Twenty-eight ADs were registered during 
the period of temporary arming, a sharp increase 
from earlier years, but earlier numbers are uncer- 
tain as they have been retrieved retrospectively 
(NOU 2017:9:147), and not based on a continuous 
systematic reporting. Figure 5 shows accidental 
discharges registered in the period 2011 — January 
2016. 

It must be underlined that the registered num- 
bers of ADs do not include ADs during firearms 
training or when weapons have been emptied and 
checked in projectile collectors, unless resulting in 
injuries. 
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Figure 5. Shows AD registered in the period 2011- 
January 2016 (NOU 2017:9:148). 


4.4 Attempts to deprive the police of firearms 


During the period of temporary arming the 
Police Directorate received twenty-two reports of 
attempts to deprive the police of firearms. The 
numbers might be incorrect because the reporting 
was not mandatory until august 2015, nine months 
after the Police Directorate gave the order to arm 
all police emergency response personnel (NOU 
2017:9). The attempts to deprive the police offic- 
ers of their firearms occurred in different contexts. 
There were reports of several attempts to deprive 
the police of their firearms assisting medical care, 
include transportation of mentally ill persons. 
Similar incidents occurred when the police per- 
formed arrests. At least two such incidents resulted 
in injuries on police officers while defending their 
firearms. 


4.5 Police basic education on the use of force 


The Norwegian police basic education lasts for 
three years (a bachelor degree). The program con- 
sists of one year of theory at one of the NPUC's 
three departments, a year of practice where each 
student is guided by an experienced police officer/ 
tutor, and a final year of theory back at the NPUC. 
The operational education, in terms of training in 
the use of force, is primarily distributed through- 
out the second and third year. The operational 
training reflects that the Norwegian police offic- 
ers still are unarmed in their daily duty. In the first 
year of studies there is no regular tactics or weap- 
ons training. However, the students receive basic 
training in arrest techniques, and training and 
approval to use OC pepper spray and the telescopic 
baton. In the second year the students undergo a 
two week course in basic tactics, situational train- 
ing and basic training on Heckler & Koch MP5 
(sub-machine gun). Tactical and situational train- 
ing takes place without students carrying firearms 
during this course. The students also attend a basic 
course in the use of the Heckler & Koch P30 L pis- 
tol. Students will to varying degrees also partici- 
pate in the operational training held in the police 
district in which they are assigned. This applies 
to both tactics training and weapons training. 


In addition, they continue to receive basic train- 
ing in arrest techniques. During the third year the 
students have a number of exercises in situational 
training. In addition, they receive training in a 
simulator with exercises in situational awareness 
and decision-making connected to the use of force, 
including focus on the use of firearms. The simu- 
lator training focuses on the model for force con- 
tinuum (see Figure 1) and the legal framework for 
police use of force. During this final year the stu- 
dents also continue training in arrest techniques, 
and have their final exam. During the last semester, 
students participate in a three-week course, which 
places special emphasis on firearms training and 
armed assignments. During this course, students 
also undergo the standardized annual firearms test 
for both firearms (pistol and sub-machine gun). 
When completed, they will be approved as police 
emergency response personnel (IP4), authorized to 
carry firearms in service. 


4.6 Use of experiential learning 


There is no position or function on a national level 
explicitly responsible for the collection of experi- 
ences from operational service, and making these 
experiences available for education and training in 
the police. Use of experiential learning is sporadic 
and dependent on individual initiatives. Experi- 
ence is not collected in a systematic manner. We 
have also seen variations both between departments 
and police districts on collection, dissemination, 
interpretation and integration of experience-based 
learning. At the same time there are some good 
examples where experience-based operational train- 
ing has been developed. The police counter terrorist 
unit (Delta) participated in the development of a 
concept for arresting a person with a knife based on 
experiences from such arrests. This concept is now 
mandatory training for all students at the NPUC. 
The project manager for the development of 
the NPUC simulator training in the use of force, 
including firearms, used all available Decisions of 
Prosecution from the Norwegian Bureau for the 
Investigation of Police Affairs as a basis for the 
professional content of the simulations. All relevant 
legal assessments were reviewed and implemented 
in the training, and many of the scenarios that were 
recorded are based on real events. This training is 
also mandatory for all the students at the NPUC. 


5 DISCUSSION 


Is the police training in the use of force based on 
experiences from incident management on the 
streets, preparing students and police officers for 
facing the realities of street challenges? First of 
all we have to go back to our understanding of 


2874 


incidents and incident management, and try to find 
out what characterizes assignments where coercive 
force may be used. As previously stated, there is a 
lack of reporting in connection with experiences 
gained by police officers in the use of physical force, 
except in the case of firearms. Reporting in the use 
of firearms is also fragmented and not based on a 
standardised format implemented by all police dis- 
tricts. There has been a sharp increase in the number 
of armed assignments in Norway. At the same time 
the number of fired shots from the police emer- 
gency response personnel stays at a very low level 
(se Figures 3 and 4). Armed assignments may sud- 
denly arise, leaving limited time for preparation and 
decision making. Some assignments terminate with- 
out any kind of contact with a suspect, some result 
in arrests without use of force and a few results in 
confrontation with a suspect where various forcible 
measures are used. Armed assignments may esca- 
late to a crisis as they represent a serious threat to 
the security of the people involved and the sur- 
roundings. Nevertheless, time pressure and highly 
uncertain circumstances may necessitate the need 
for critical decision-making (Rosenthal, Charles 
et al., 1989). Based on the numbers of armed assign- 
ments (see Figure 3) it is fair to assume that new 
incidents will arise, but it will be difficult both to 
depict the time and place of occurrence and there- 
fore the ability to influence the development of the 
event in question. Unexpected crisis (Gundel 2005) 
with a fast developing pattern (‘t Hart and Boin 
2001) may therefore be difficult for police emer- 
gency response personnel to plan for and manage. 
The NPUC, and the rest of the police force, need 
to consider both expected and unexpected scenarios 
when planning and developing the proper compe- 
tence level for police students and officers. It is dif- 
ficult to predict the future challenges the police may 
face, fully implement measures related to prevent- 
ing and preparing for worst case scenarios, and plan 
police preparedness in case of unlikely events. Risk 
assessments must not only take into account expe- 
riences but also cater for surprises and uncertainty 
(NOU 2012:14). Thus, a reliable modus operandi to 
manage the situation at hand may not only be based 
on “following the book”. Reliable decision-making 
in a crisis situation may be the result of an ability 
to switch from one practice to another (Schakel 
et al. 2016), on intuitive decisions based on previ- 
ous experience from similar situations (Eid and 
Johnsen 2006), on a feeling of déjà vu (Weick 1993), 
on improvisation (Weick 1993), and, finally, on rec- 
ognition of familiar patterns (Klein 1989). These 
are capacities normally associated with highly expe- 
rienced response personnel. However, mostly ordi- 
nary police emergency response personnel (IP: 4) 
manage the initial phase of an acute incident or cri- 
sis. They need to be trained and equipped to face 
such situations. How do the police use experience 


to manage these types of events in training and edu- 
cation? Besides the statistics on armed assignments 
and the use of firearms there is no national system- 
atic collection of experience on the use of forcible 
measures for use in training and education, either at 
the NPUC or the police districts. 

In armed assignments adequate situational 
awareness (SA) is crucial for decision-making 
undertaken by police officers, and thus the out- 
come of the situation. These assignments are often 
characterized by rapid development, uncertainty 
and lack of or ambiguous information. SA makes 
it possible to operate quickly and efficiently even 
during such demanding assignments (Lee and 
Kirlik 2013). We are well aware that SA is one of 
the most central elements in training on the use of 
force, both at NPUC and in the police districts. 
During simulator training at NPUC SA is con- 
stantly referred to in connection with the choice of 
coercive measures based on the force continuum 
models (Lie and Lagestad 2011, Vee Henriksen 
2016). As a result of their SA the police emergency 
response personnel decide whether or not to use 
force, and the appropriate kind of force and for- 
cible means to be used. In situations where the 
police are armed, decision making will be particu- 
larly important since the use of firearms may have 
fatal consequences. To make the right decisions Eid 
and Johnsen (2006) point out that intuitive deci- 
sion making will be of particular relevance as it is 
based on experience from similar situations. Weick 
(1995) also underlines that a good decision maker 
is as good as his memory and the available infor- 
mation that is stored there. However, training in 
the use of force is primarily done in orderly terms 
with less influence of interfering factors that will 
be applicable to real situations. Paoline and Ter- 
ril (2007) underlines that experience is the most 
important factor for learning within the police, and 
Kolb (1984) refers to experiential learning based 
on practical situations that provide experience and 
learning. As we have pointed out, there is a severe 
lack of structured approach to collection, dissemi- 
nation, interpretation and integration of experi- 
ence on the police use of force in Norway. There 
is no documentation for the extent of use of force, 
what kind of force is being used and the outcome 
in these situations. How can we then maintain that 
the training is experience-based? And how can we 
be sure that this training prepares police students 
and officers for street challenges? Our experience 
shows that training largely depends on the indi- 
vidual instructor’s experience, and random work 
based on individual initiatives as for example the 
development of the simulator training at NPUC. 
This means that there will be varying quality in the 
professional experiences conveyed during training. 

Norwegian official reports have several times 
pointed out a lack of systematic use of experiential 
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learning in the police (NOU 2009:12, NOU 
2012:14, NOU 2013:9, NOU 2017:9). It is a fun- 
damental principle in learning theory that training 
and education should be based on experience. You 
train as you fight. The Firearms Commission has 
suggested a systematic registration of police use of 
force and evaluation of incidents where the police 
have used firearms (NOU 2017:9). This would be 
an important contribution to the systematic use of 
experiences in the future, and support competence 
development within the police preparedness. The 
police response is not likely to be more efficient 
than the actual police preparedness when an inci- 
dent occur (Norwegian Parliamentary White Paper 
2012-2013). Thus, as also reflected in the report 
of the 22nd July commission (NOU 2012:14), the 
police emergency response personnel must have 
relevant competence and training in the use of 
force. The number of accidental discharges in the 
police (see Figure 4), lack of systematic reporting 
on the use of force, and the less degree of system- 
atic competence and personnel development work 
in the police (NOU 2013:9, p. 209), indicate that 
experience is an underused asset in police training. 


6 CONCLUSIONS 


The aim of this paper has been to study the degree 
to which the NPUC and the police districts train- 
ing in the use of force reflects the need for reliable 
handling of the various weapons at the disposal of 
the police officers facing street challenges, and how 
experiences on the use of force gained by police 
officers in their daily duty is collected, disseminated, 
interpretation and integrated in police training. 

Even though experienced police officers lecture 
at the NPUC, and supervise police students during 
the three-year study program, it is no national over- 
all strategy on collection, dissemination, interpreta- 
tion and integration of experiences from the street 
in the Norwegian police force. It is fair to say that 
this is hampering the quality of the professional 
content of the training, especially at the NPUC, 
but also in the police districts. As long as there is 
no documented experience that can be conveyed in 
the form of experience-based knowledge, the qual- 
ity of the professional content of the training and 
lecturing will depend on the individual instructor's 
experience and competence, also related to the 
street challenges faced by police officers. 
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The role of employers, safety engineers and safety reps in the improvement 
of safety level at enterprises 
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ABSTRACT: Fifteen Estonian enterprises were investigated to determine the safety level, using Method 
for Industrial Safety and Health Activity Assessment (MISHA) method. Some of the firms have imple- 
mented OHSAS 18001 or belong to the foreign companies. One of the ideas to improve the safety level 
at the enterprise is learning through the interviews. The interview was based on MISHA method. The 
safety performance elements were divided into three parts: formal, real and combined ones. The statistics 
was carried out using factor analysis with Barlett’s test, ANOVA and T-square test with Wilks’ Lambda 
row. The main possibilities to influence on safety level in the firm to employ the Working Environment 
Specialists (WES), as they are more educated and supported by the law in the work safety area. In small- 
and medium-size enterprises, there are only few resources to hire the WES. If the OHSAS 18001 was 
implemented, then the assignment of tasks and responsibilities in Occupational Health and Safety (OHS) 
was committed to the top management (p = 0.000), the employer was revising the safety policy (p = 0.000), 
the personnel’s responsibilities and authorities in OHS were clearly defined (p = 0.013). The role of the 
working environment representative was not particularly significant in the investigated enterprises 
(p = 0.350). At the same time, the firm’s type was significant on the supervisor/employee communication 


(p = 0.001) and on general communication procedures (p = 0.006). 


1 INTRODUCTION 


The work environment is a multifunctional term 
and it occupies not only the physical work set- 
tings, but also the psychological and psychosocial 
features that are subject to the people personalities 
and approaches (Hrenov et al. 2017a). Treatment 
of risks at work is a key apprehension in today’s 
working environment (Lafuente & Abad 2018). 
Researchers and general practitioners have per- 
ceived a remarkable modification in the role of 
safety supervision, which has developed from a 
slim view associated to an overpriced managerial 
problem do an functioning primacy with signifi- 
cant economic and social impression (Abad et al. 
2013; Das et al. 2009). According to the European 
Agency for Safety and Health at Work (EU-OSHA 
2013), the economic costs and functioning dam- 
ages of work accidents to workers, productions, 
and public supervision signify 3% of the EU’s 
gross domestic product. 

The OHSAS 18001 certification is becoming 
the leading international safety system approved 
by the administrations to involve in processes to 
encourage constant developments of work safety 
conditions (Fernandez-Muniz et al. 2012; Lo et al. 
2014). OHSAS 18001standard is the basis for the 
new ISO 45001 standard that would likely to be 
accessible in the nearby time. 


There are many different safety management 
systems (Li & Guldemund 2018). A safety manage- 
ment system in industry (SMS) (Robson & Bigelow 
2010) can be defined as a planned, documented 
safety program that incorporates certain basic 
management concepts and activating elements into 
a well-organized safety system. The safety activity 
areas and supporting elements that comprise this 
system act and interact on one another to help 
achieve the desired safety or risk level. A total safety 
management system (SMS) consists of the follow- 
ing objects: parametres such as input, process, out- 
put, and feedback control; attributes: properties of 
parameters such as the external manifestation of 
the way in which an object is known, observed, or 
introduced in a process; relationships: bonds that 
link objects and attributes in the system process. 

SMS can be categorized as a set of institution- 
alized interconnected and interacting strategic 
elements designed to establish and attain occupa- 
tional health and safety goals and objectives (Yurio 
et al. 2015; Kim et al. 2016). 

There are different key persons in the enterprise 
who have to take care of occupational health and 
safety (OHS): the employer (EMP), the working 
environment specialist (in some countries safety 
engineer, WES) and working environment repre- 
sentatives (or reps, WER) (Hrenov et al. 2017a). 
The roles of these key-persons in dissimilar nation 
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states are different (Paas 2015a, b, c, d; Hrenov 
et al. 2016). 

The previous studies for improvement of safety 
and health at workplace (Paas 2015a; Hrenov et al. 
2016, 2017a, 2017b), the roles of the employer, 
working environment specialist and working envi- 
ronment representatives are given specifically. 

The current study gives the comparison of the 
results of the interviews of these three parties 
(EMP, WES, and WER). 

The aim of the study is to improve the safety level 
at small- and medium-sized enterprises in Estonia 
through the cooperation and close communication 
of the employer, working environment specialist 
and the working environment representative. 


2 THEORETICAL PART 


Safety is an intellectual conception. This state of 
freedom from something that could have undesir- 
able consequences, such as damage to humans or 
nature, financial loss, or any other form of damage 
or defeat. For example, in a hospital, the safety of 
patients means holding patients in a steady state by 
avoiding the risk of adversarial occasions (Shojania 
et al., 2001). The current paper is concerned with 
industrial safety, the unexpected events and risks 
arise within the context of manufacturing activi- 
ties. However, a zero-risk condition, does not occur. 
Although some companies achieve a zero accident 
record for a convinced period of time, it does not 
indicate they are risk-free. “Risk is a degree of the 
likelihood and consequence of undefined upcom- 
ing occasions; it is the casual of an undesirable 
endings” (Yoe, 2011), while safety is, according to 
IEC 61508, “freedom from unacceptable risk”. We 
can consequently settle that the safety of an indus- 
try is judged by its acceptable risk. 

Safety management (SM) means “a systematic 
control of worker performance, machine perform- 
ance, and the physical environment” (Heinrich 
et al. 1980). 

Over the latest past, the incidence of major 
accidents and crises have made it clear that organi- 
zations must still progress their competencies to 
discourse safety through the application of a sys- 
tematic and proactive attitude. Safety management 
systems (SMS) are changing from a prescriptive 
style to a more “self-regulatory” and “perform- 
ance oriented” model (Frick & Wren, 2000; Bluff 
2003) that is more proactive, participative and bet- 
ter integrated with commerce activities. 

Goetsch (1998) introduces the concept of total 
safety management (TSM) as a performance- 
oriented approach that gives organizations a sup- 
portable advantage in the market-place by creat- 
ing a safe work environment that is conductive to 


ultimate performance and persistent improvement. 
In a total safety approach, business processes are 
combined with safety engineering methods within 
a nonstop upgrading culture that affects all levels 
in the organization. A work process is a complex 
network of interdependencies between physical 
items, information, communication and knowledge 
passages and decision-making activities (Zou & 
Sunindijo 2015; Kontogiannis et al. 2017). 


2.1 Participative risk assessment 


In furthermost cases, a risk assessment is performed 
on how jobs should be accomplished rather than on 
how they are actually performed in practice. As a 
result, critical alterations or destructions of events 
are missed in this analysis. To avoid this oversight, a 
participative risk assessment is required that would 
involve people at all organizational levels in certain 
stages of the analysis. This has the benefit of catch- 
ing risks associated with how the work is done and 
also involves staff in SM. Finally, it would be eas- 
ier to design safety measures and barriers that are 
compatible to the competencies and preferences of 
workers when they are part of that process, hence 
enabling a more efficient human-system interface 
(Kontogiannis et al. 2017). 


2.2 Processes for appreciative hazards and risks 


Risk assessment (RA) is significant because it helps 
create awareness of hazards and risks related to 
serious work activities. A participative method to 
safety analysis will allow workers and supervisors 
to update the risk analysis with everyday informa- 
tion about critical risks in the workplace and tech- 
nical processes. Hence, a process of “safety preview 
and screening” would be required at the start of 
risk analysis. Some tests related to overwhelming 
weaknesses of RA (like residual risks) should be 
addressed through new changes in RA methods 
(Kontogiannis et al. 2017). Digital RA methods 
are very looked-for in the current area. 

The research question in the current study is 
the following: which are the possibilities of the 
employer, working environment specialist (safety 
engineer) and working environment representa- 
tive (rep) to improve the safety level at small and 
medium-sized enterprises (Figure 1). 

In the previous study (Paas 2015a,b,c), the key- 
elements of safety management system have been 
divided into formal (like safety documents, con- 
tent of the policy, correlation between the safety 
activities and the implementation or non-imple- 
mentation of OHSAS 18001, revising the safety 
policy, written safety policy existence, assignment 
of tasks and responsibilities etc.), real (like top 
management commitment to the safety policy, 
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communication, participation in workplace design 
etc.) and combined (like participation in the prepa- 
ration of the safety policy, workplace hazard anal- 
ysis, the assessment of the work environment etc.) 
safety elements. 

Three hypothesis were formulated and the area 
in which they are proved concerning employer’s 
activities were as follows: 


H1. Standard OHSAS 18001 has an impact on 
Formal safety performance in companies, 

H2. Standard OHSAS 18001 has an impact on 
Real safety performance, 

H3. Standard OHSAS 18001 has an impact on 
Combined safety performance at enterprises. 


2.3 Learning from experience 


Learning involves a monitoring process of near- 
misses, changes and success/failures of modifica- 
tions as well as a reviewing process of strengths/ 
weaknesses of risk analysis, unreliability risk 
assumptions and problematic risk acceptance cri- 
teria (Kontogiannis et al. 2017). This TSM main- 
stay also includes communication and training 
intended to provide safety information and skills 
to the employees to manage risks. Communication 
includes spread of information throughout the 
organizational hierarchy and feedback from the 
workforce. Safety training is very important. Safety 


training can take several forms such as classroom 
seminars, apprentice training, simulator training 
and computer-generated reality training. Although 
many small and medium-sized enterprises (SMEs) 
have understood the importance of the safety 
monitoring, they rarely know what data to collect, 
what safety parameters are important, how often 
to collect the data and what safety parameters are 
important, how to assure data reliability and con- 
fidentiality. For example, many workers may hold 
significant risk information but they do not know 
when and how to report it. Other cases contain the 
reporting of near misses in standard forms without 
the identification of causal factors (Kontogiannis 
et al. 2017). 

One of the ideas to improve the safety level at 
the enterprise is learning through the interviews 
in the course of the questioning the employers, 
WES and WER (workers) using modified MISHA 
method (Paas 2015d). For example, the interview 
with the employer individually lasts 2-3 hours 
and is focused only of OHS matters. During this 
time, the person (employer or worker) learns as 
much as he will learn during the compulsory safety 
training course (24 hours) carried out in groups 
of people from different areas and interests and 
sometimes the lecturers on these safety training 
course (outside the house) are not giving not- 
interesting and old-fashioned information. This 
modified MISHA questionnaire is an educational 
tool for the WES: this is the mode of learning 
through interviews (Paas 2016d). 


3 MATERIAL AND METHODS 


Fifteen Estonian manufacturing enterprises 
(Table 1) were examined with modified MISHA 
method (Kuusisto 2000; Paas 2016d) to explain 
the role of EMP, WES and WER in OHS matters 
as well as for studying the perspectives to improve 
the safety level of the enterprise through their 
co-operation. 

There are four areas in MISHA method: A) 
organization and administration; B) participation, 
communication, and training; C) work environ- 
ment, D) follow-up (altogether 200 questions). 

The MISHA questionnaire was modified taking 
into account some of the workplace hazards that 
were not included into the original MISHA ques- 
tionnaire (Kuusisto 2000). For example, the influ- 
ence of vibration and electromagnetic fields on the 
workers was asked in the course of the interview 
(Guldemund 2007). 

The interviews with the learning aims consist 
of the questionnaire that includes “whether” and 
“how” questions. In the original questionnaires 
compiled for the assessment of safety, activities at 
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Table 1. The characterization investigated companies 
(N = 15). 
OHSAS 
company/ 

Id. of the The activity Size, corporated 
company area employees company 
I Plastic industry 50-249 +/ 
Il Electronics >250 I+ 
Il Food industry >250 I+ 
IV Electronics >250 +/ 
V Textile industry 50-249 -/- 
VI Printing industry <50 -/- 
VII Glass industry <50 +/ 
VII Chemical industry 50-249 +/ 
IX Chemical industry 50-249 -/— 
X Metal industry 50-249 —/+ 
XI Metal industry >250 I+ 
XII Agriculture <50 -/- 

farm (milk 

production) 
XIII Agriculture <50 -/- 

farm (grain 

production) 
XIV Construction <50 -/- 
XV Transport <50 -—/- 


enterprises can be used as a tool for learning and 
obtaining more information on safety in compa- 
nies. Learning is likely to be more effective when 
participants are actively involved in dialogue in 
which they are co-constructors of the meaning 
(Kines et al. 2011). 

In locally owned companies, where the safety 
level is rather low, the managers did not recom- 
mend to have interviews with WER as by their 
(employers’) view, the knowledge of WER in OHS 
tends to be negligible. 

In those companies where OHSAS 18001 was 
implemented, the work instructions, instruc- 
tions for safety training and safety organization’s 
activity programs existed. This may not be the 
case in companies where OHSAS 18001 was not 
implemented. 

The statistics used in the paper involved IBM 
SPSS Statistics 22.0 and R.2.15.2. The follow- 
ing statistical methods were used: correlation, 
MANOVA, factor analysis, principal component 
method, independent T-test (Field, 2013). 


4 RESULTS 


The interviews with the enterprises’ representatives 
(column 3, Table 2) were carried out and recorded; 


Table 2. The characterization and results of quanti- 
tative study by the MISHA method in the investigated 
enterprises (N = 15). 


OHSAS Total 
company/ average 
Id.of the corporated Person interviewed; score 
company company position, age (100 max) 
2 3 4 
I +/ Quality manager, 41 78 + 3.0* 
Safety manager, 62 7642.5 
WER, 25 782.5 
I /+ Quality manager, 35 84+ 2.0 
WES, 42 90 + 1.0 
WER, 53 80+ 1.0 
Ill /+ WES, 62 75 + 2.0 
WER 1, 34 80+ 2.5 
WER 2, 39 58 + 3.0 
IV +/ Quality manager, 59 92+ 1.0 
WES, 39 88 + 2.0 
WER, 39 78 +1.0 
V =/= Quality manager, 38 47 +23.5 
VI -/- Quality manager, 36 29+4.0 
VII -/- Quality manager, 41 41+4.0 
VIII +/— Manager, 55 88 +1.0 
WER, 62 851.0 
External auditor, 34 78 + 2.0 
IX +/— Manager, 45 87+1.0 
WER, 40 87+ 1.0 
External auditor, 34 78+1.0 
X —/— Manager, 40 6l+1.5 
WER, 53 55+2.0 
External auditor, 53 50 + 2.0 
XI —/+ WES, 35 891.0 
Trade union rep, 60 86+ 1.0 
XI -/- Employer, 50 46+2.0 
XI = Employer, 56 601.0 
XIV -/- Active manager, 40 50+ 2.0 
XV —/— Personnel manager, 65+ 2.0 


45 


*Mean difference in reviewers (4) assessment score. 


afterwards listened and analysed by the three 
authors of the paper and one expert independ- 
ently. The total average score (column 4) is derived 
with MISHA method. 


4.1 The employers role in safety level formation 


From the MISHA questionnaire, the questions 
concerning the activities of the employer were 
selected. The comparison of the interviewer 
assessments’ on OHSAS-implemented companies 
compared with non-implemented companies was 
assessed with the statistics (Tabel 3). 

In these safety areas (Table 3), the implementa- 
tion of OHSAS 18001 has been successful. 
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Table 3. 
the safety level is significant. 


SM areas where the employer’s influence on 


Safety key element 


Sum of 
Squares 
(KMO and 
Barlett’s test 


p-value 


Formal safety elements 
A1.4. Assignment of tasks 
and responsibilities to 

the top management 

A1.8. Revising the safety 
policy: has the employer 
defined how often the 
policy is revised? 

C2.3- Does the personnel’s 
responsibilities and 
authorities are clearly 
defined? 


Real safety elements 


A1.9. Dissemination of the 
policy: has the employer 
defined how the policy is 
made available to the 
personnel? 

A2.8. Resources: does the 
company has the 
resources for OHS 
improvement? 

B2.1. Does the manager 
arrange the information 
meetings to the employers 
on OHS? 

D2.1. Does the company 
has the system for 
redesigning the work or 
workplaces of a person 
with disabilities? 

Combined safety elements 

A1.6. Dissemination of the 
policy: has the employer 
defined how the policy 
is made available to the 
personnel? 

A1.10. Informing external 
bodies about the 
company’s safety 
policy 

B3.1. Does the employer 
affords the safety 
training for the personnel 
in a regular basis? 

D1.2. The reduction of 
accidents: has the plan 
elaborated and 
presented to the top 
manager? 

D3.1. Does the company 
have a system for 
measuring social 
climate? 


13.375 


25.688 


4.576 


21.007 


22.688 


2.896 


0.047 


13.375 


17.241 


2.854 


4.125 


19.125 


0.000 


0.000 


0.013 


0.000 


0.000 


0.006 


0.013 


0.001 


0.001 


0.004 


0.007 


0.000 


4.2 The working environment specialist role in 


safety level formation 


Similarly to the chapter 4.1, the questions concern- 
ing the WES activities, where selected (Table 4). 


Table 4. SM areas where WES has the strongest influ- 


ence on the company’s safety level. 


Sum of 
Squares 


(KMO and p- 


Safety key element 


Barlett’s test value 


Formal safety elements 


A1.1. Does the company has the 22.250 
written policy? 
A1.3, Comments of the policy:a 19.285 


description of the safety tasks? 
A1.4. Are the tasks assigned to the 13.375 
safety and health personnel? 
A1.8. Revising the safety policy, 
who are responsible? 


25.688 


C2.3. Definition of the personnel’s 4.576 
responsibilities: are the persons 
responsible for health and safety 
trained for their responsibilities? 

D1.1. Does the company make 21.000 


statistics on accident rates and 
summaries on accident causes? 


Real safety elements 

A1.9. DEssemination of the policy: 21.007 
is safety involved? 

A2.8. Does the company seek 
advice in resources to health 
and safety from safety 
personnel? 

B1. Does the safety manager 
instruct the personnel? 

B2.1. Has the safety manager 
arranged the hazards manage- 
ment system in the workplace? 

B2.4. Does the safety manager 
arrange the safety campaigns? 

B3.4. Has the safety manager 
defined which work permits 
are necessary (e.g., permit do 
fire hazardous work?) 

C1.8. Has safety manager involved 
in the cleaning of plant area? 


22.688 


5.672 


2.896 


9.797 


6.750 


4.500 


Combined safety element 
A1.5. Participation in preparation 21.500 
of the safety policy 


A1.6. Initial status review 13:375 

B3.1. Safety training needs, are 2.854 
they determined to the 
personnel? 

C3.1 Workplace hazard analysis: 9.491 
has safety manager carried out? 

D1.2. Accident investigation: are 4.125 


the near accidents investigated? 


0.000 


0.000 


0.000 


0.000 


0.0013 


0.000 


0.000 


0.000 


0.001 


0.006 


0.001 


0.004 


0.002 


0.000 


0..001 


0.004 


0.000 


0.007 
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The main possibilities to influence on the safety 
level in the company is to employ the working envi- 
ronment specialists, as they are more educated and 
supported by the law in the work safety and health 
(OHS) area. Nevertheless, in OHSAS 18001 imple- 
mented companies, the results are better than in 
non-implemented ones. 


4.3. The working environment representatives’ role 
in safety level formation (all combined area) 


The questions were selected from the interviews, 
where WER activities could possibly improve the 
safety level at enterprises (Table 5). All these ques- 
tions happened to be only in the combined safety 
elements’ area. 


4.4 Results of the hypothesis 


Three hypothesis were formulated and the area 
in which they are proved concerning employer’s 
activities were as follows: 


H1. Standard OHSAS 18001 has an impact on 
Formal safety performance in companies 
(p value < 0.013) —if OHSAS 18001 has been 
implemented, then the assignment of tasks 
and responsibilities in OHS is committed to 
the top management, the employer is revising 
the safety policy, and the personnel’s respon- 
sibilities in OHS are clearly defined. 

H2. Standard OHSAS 18001 has an impact on 
Real safety performance. (p < 0.013) — if 
OHSAS 18001 is implemented, then the top 
manager promotes dissemination of safety 
policy: the policy is made available to all of 
the personnel; resources for improvement are 
arranged by the top management; the top 


Table 5. SM areas where WER can influence on the 
safety level at enterprises. 

Sum of 

Squares 


(KMO and p- 


Combined safety elements Barlett’s test value 

A1.5. Participation in the 21.250 0.000 
preparation of the policy 

A1.6. Initial status review 13:375 0.001 

A1.10. Informing external 17.241 0.001 
bodies about the policy 

A3.3. Selection of the line 3.063 0.017 
management 

B3.1. Safety training needs 8.491 0.000 

D1.2. Acident investigation 4.125 0.007 

D3.1. Assessment of the social 19.125 0.000 


environment 


manager arranges meetings in OHS; there is a 
system for redesigning the workplaces for the 
persons who have difficulties in coping with 
the work. 

H3. Standard OHSAS 18001 has an impact on 
Combined safety performance (p < 0.007) 
— if OHSAS 18001 implemented, then: the 
top management is participating in the 
preparation of safety policy, top manager 
is reviewing the safety policy and is inform- 
ing the external bodies about the company’s 
safety policy’s effectiveness; the top manager 
arranges safety training for all of the person- 
nel; there is a plan for reduction of accidents; 
it has been elaborated by the top manager; 
the company has a system for measuring the 
social climate in the company. 


5 DISCUSSION 


Our study revealed that management plays an essen- 
tial role in OHS improvement in the company. By 
O'Toole (2002), it is also postulated that the lead- 
ership’s position is influencing the employee’s per- 
ceptions of the safety management systems. Those 
perceptions appear to influence on the employee’s 
decisions that relate to at-risk behaviours and deci- 
sions on the job. Organizational commitment did 
affect the perceived safety at work, but not on 
work accidents. 

In the current study, it was declared that the 
plan for reduction of accidents if it is worked out 
by the employer, has very strong influence on the 
combined safety at enterprises. If the safety stand- 
ards (OHSAS 18001 etc.) are implemented then 
the organizational climate will also be in higher 
level (Neal et al. 2000). 

Taking into account the results of the previous 
studies (Hrenov et al. 2016; Arghami et al. 2014), 
where the safety and health level on the enterprise 
measured with MISHA method was carried out 
from the viewpoint of the working environment 
representative (WER) (Hrenov et al. 2016) and 
the employers (Arghami et al 2014), it can be con- 
cluded that the safety engineers have the best over- 
view and knowledge of the safety system. 

The other key persons (WER, employer) are 
hesitant in some questions, concerning for exam- 
ple, the safety policy expanding to the workers in 
the firm. The working environment specialist often 
views the current situation realistically while WER 
and employer may overestimate the situation 
as their daily work is not connected with safety 
matters. 

MISHA method is not the only method for 
assessment and showing the improvement’s points 
in safety and health at enterprises (Kines et al 2011; 
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Guldemund 2007). By Arghami et al. (2014), the 
safety climate questionnaire is built up on another 
principle than in MISHA method. It contains seven 
(7) different factors: management commitment 
to safety and personnel collaboration (the influ- 
ence of total safety level R = 0.954), safety com- 
munication (R = 0.830), supportive environment 
(R = 0.793), work environment (R = 0.803), formal 
training (R = 0.774), priority of safety (R = 0.740), 
personal priorities and need for safety (R = 0.547). 
These results are comparable with the results in the 
current paper: the safety policy might be worked 
out properly and on the high level, but the safety 
policy usually does not reach the personnel, from 
up to down. One of the lowest scores (R = 0.431) 
is given to the question: “my line manager/supervi- 
sor does not always inform me of current concern 
and issues”. 

Our study examined three different types of 
companies: OHSAS certified companies, corpo- 
rated companies and small and medium-sized 
locally owned companies. It turned out that “small 
enterprises” may be diverse: the definition covers 
many types of work activities, which naturally lead 
to large differences in the work environment. Small 
enterprises are more susceptible to influence from 
various “external” sources e.g., though the own- 
ership structure. It might be important whether 
the small enterprise is part of a larger organiza- 
tion and whether it is publicly or privately owned 
(Sorensen et al., 2007). This problem remains for 
the future research. 

Compared to Estonian OHS system in compa- 
nies, Nordic OHS regime contains three different 
collaborating structures within the company: 1) a 
work environment or safety committee with bal- 
anced representation from the parties; 2) safety rep- 
resentatives elected by the employees; 3) in-house 
or external health and safety experts employed by 
and representing the management (Lindoe et al., 
2001). According to the OHS Act (1999), based 
on EU Framework Directive 89/91, the employer 
and employees have to co-operate and opportuni- 
ties for both parties to consult on the relevant OHS 
matters should be available. The need to ensure 
worker participation is stated in mandatory forms 
of industrial health and safety national legisla- 
tion and in the EU Framework Directive 89/391. 
In Estonia, WER has to be trained following the 
24-h training programme provided in the regula- 
tion. In Norway, the social partners agree that a 
40-h course covers the basic training necessary to 
function as a WER (Hovden et al., 2008). 

From our interviews, we concluded that WERs 
assessed the time for dealing with OHS matters 
unsatisfactory. The results in Nordic countries 
(Hovden et al., 2008) show similar pattern — often 
WERscomplained aboutlack of time. The examples 


of the best experiences of the Nordic countries 
should be used in order to increase workers’ par- 
ticipation and representation in health and safety 
matters. 


6 CONCLUSIONS 


1. OHSAS 18001 implementation helps to improve 
the following formal safety elements where 
safety manager is involved: to write the safety 
policy, the description of tasks of the personnel 
in safety area, the responsibilities of the safety 
personnel are clearly determined. 

2. OHSAS 18001 implementation helps to improve 
the following real safety elements: dissemina- 
tion of the safety policy, the safety personnel 
is advising the top management in safety and 
health questions, the safety manager instructs 
thoroughly the personnel in safety matters, the 
safety personnel is advising the top manage- 
ment how to allocate the resources. 

3. OHSAS 18001 implementation in the firm 
helps to improve the following combined safety 
elements: safety manager compiles the initial 
safety review, the safety training needs of the 
personnel are determined, workplace hazard 
analysis are carried out. 

4. The position of safety representative has often 

a low status in the company. WERs do not have 
enough time to fulfil their safety functions to 
keep employees safe. 
There is a limited understanding among employ- 
ers about the role of WER. The study showed 
that in small enterprises, the WER has a formal 
position, although required by the law. In that 
case, employers do not understand the need of 
the WER and while electing them only formally, 
there is no practical value and often, employ- 
ees are unaware of the position. The interviews 
also revealed that it is complicated to find the 
candidates to the WER position even in larger 
companies, especially in locally owned compa- 
nies as managers do not know how to motivate 
workers on taking an additional responsibility. 
Safety management system plays a role in effec- 
tive work of WERs. If the management does 
not give enough priorities to OHS, the employ- 
ees will follow the example of the employer. 
WER should be elected among the peers rather 
than using WERs from other departments. 

5. The WER of the organization is not well known 
or acknowledged by all the employers and sub- 
contractors. The subcontracting work may 
cause several accident and near-accident situ- 
ations. The importance of the person (WER), 
who knows how to deal with the problems in 
OHS, becomes evident only after the accident 
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has occurred or some of the workers are already 
seriously ill with occupational disease, such as 
musculoskeletal disease (MSD). The MSD is, 
at the present time, the number one occupa- 
tional illness in almost every European country 
(Kaergaard & Andersen, 2000). 

6. To be successful in WER commitments may 
be complicated due to conflicting expectations 
from employer and colleagues. The interviews 
revealed that nobody in the enterprise wants 
to be the resolver of a risky situation or even 
accident. Therefore, it is particularly important 
to prevent these situations by increasing the 
knowledge on OHS. For this occasion, WER 
and his/her knowledge and activities are a very 
good solution. It is important to mention that 
he/she needs enough time to gather the infor- 
mation on OHS and his/her activity has to be 
acknowledged by the employer. 
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ABSTRACT: The publication in 2018 of ISO 45001 is the first international ISO standard in the field of 
occupational health and safety management systems. ISO 45001 is the result of an international consensus 
on the subject and describes the best international preventive practices and incorporates the require- 
ments of a management system aligned with the so-called High-Level Structure of the ISO standards of 
management systems. Simultaneously, ISO/IEC 31010:2009 provides for the selection and application of 
systematic techniques for risk assessment. However, this standard does not address safety in a specific 
way. It is a generic standard for risk management and any reference to safety is simply informative. Thus, 
the main objective of the present work is to identify and classify the main techniques included in the ISO/ 
IEC 31010:2009 standard applicable in the field of occupational safety and in line with the requirements 
for the ISO Standard 45001. As a secondary objective, it is sought in the same context to identify the 
main non-standard techniques of new or emerging nature. This process of identification and classifica- 
tion has been carried out by means of a systematic review of the scientific literature specialized in the 
matter. The results have been classified according to bibliometric indicators in the following groups: (a) 
Main standard techniques for application to occupational safety; (b) Main techniques developed through 
specific standards (such as: IEC 61882:2016 — Hazard and operability studies [HAZOP studies] — Applica- 
tion guide; IEC 61025:2006 — Fault Tree Analysis [FTA]); (c) Main non-standard techniques of a new or 
emerging nature for application to occupational safety. Finally, the results obtained by the classification 
mentioned above have been analyzed in order to determine the degree of coverage and standardization of 
the main risk assessment techniques applied to occupational safety. 


1 INTRODUCTION are lost globally (7.1 million in the EU) as a result of 


work-related injury and illness. Of these, 67.8 million 


The European Agency for Safety and Health at 
Work together with the International Labour 
Organization has estimated of the cost of poor 
occupational safety and health. Such estimation 
reveals (EU-OSHA, 2017): Worldwide work related 
injury and illness result in the loss of 3.9% of GDP, 
at an annual cost of roughly € 2680 billion; Work- 
related illnesses account for 86% of all deaths related 
to work worldwide, and 98% of those in the EU; 
123.3 million DALY (disability-adjusted life years) 


(3.4 million in the EU) are accounted for by fatalities 
and 55.5 million (3.7 million in the EU) by disability; 
and in most European countries, work-related can- 
cer accounts for the majority of costs (€ 119.5 billion 
or 0.81% of the EU’s GDP), with musculoskeletal 
disorders being the second largest contributor. 

In order to combat the problem, ISO has devel- 
oped a new standard, ISO 45001, Occupational 
health and safety management systems—Require- 
ments, that will help organizations reduce this 
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burden by providing a framework to improve 
employee safety, reduce workplace risks and create 
better, safer working conditions, all over the world. 
This standard follows other generic management 
system approaches such as ISO 14001 and ISO 
9001 (ISO, 2017). 

The current version ISO/DIS 45001.2:2017, 
contemplates in its section 6.1.2.2 that the organi- 
zation must establish, implement and maintain one 
or several processes to assess the risks for safety 
and health at work from the identified hazards, 
taking into account legal requirements and other 
requirements and the effectiveness of existing con- 
trols (the concept of organization, include among 
others, a company, firm, enterprise, etc.). Neverthe- 
less, this standard does not specify the techniques 
or methodologies to assess the occupational safety 
and health risks. 

Simultaneously, the standard ISO/IEC 
31010:2009 provides for the selection and applica- 
tion of systematic techniques for risk assessment. 

However, this standard does not address safety 
in a specific way. It is a generic standard for risk 
management and any reference to safety is simply 
informative. 

Thus, the main objective of the present work is to 
identify and classify the main techniques included 
in the ISO/IEC 31010:2009 standard applicable in 
the field of occupational safety in a way in line with 
the requirements for the risk assessment included 
in the ISO Standard 45001 (Currently, ISO/IEC 
31010 is under review. For the moment, ISO/IEC/ 
DIS 31010: 2017 is published under development) 

As a secondary objective, it is sought in the 
same context to identify the main non-standard 
techniques of new or emerging nature. 


2 METHOD 


We conducted a systematic search in the occu- 
pational safety literature. A systematic review of 
the literature is typically based on a detailed and 
comprehensive plan and search strategy derived 
a priori in order to reduce bias (Uman, 2011). 
We aim to present an overview of techniques 
addressed in both quantitative and qualitative 
research on occupational safety, and their general 
direction (e.g. Cornelissen et al., 2017). Below, we 
will elaborate on our systematic selection process 
and analysis. 


2.1 Literature search 


As indicated by Goerlandt et al. (2017), in tradi- 
tional indexing systems such as Scopus and Web of 
Science, risk analysis is not considered as a separate 
category in the scientific research areas. Instead, 


contributions related to risk are typically listed 
under “mathematics”, “social sciences” or “engi- 
neering”. Hence, general searches in those systems 
on terms such as “risk analysis”, “validation” and 
“QRA” results in many hits, with low relevance to 
the above stated aims. 

Therefore, another review method has been 
applied (November, 2017). To do this, we chose a 
literature search using broad search terms asa start- 
ing point. To this end, we selected as search criteria 
the use of the keywords “risk” and “review” in the 
journal title. In addition, we delimited the search 
in “engineering” and “chemical engineering” fields 
using ScienceDirect databases. The results were the 
following: Engineering field: 90 records published 
all years (1995-2018); Chemical engineering field: 
48 records published all years (1988-2017). 

With these criteria, the journal principals identi- 
fied in the work of Reniers and Anthone (2012) 
have been included, except Journal Risk Analysis. 
These authors found that the most well-respected 
journal by expert opinion was the Journal of Loss 
Prevention in the Process Industries. However, tak- 
ing into consideration both the respondents’ results 
and the citation-based results into consideration, 
the Journal of Hazardous Materials is the most 
influential journal, followed by Reliability Engi- 
neering and System Safety, Risk Analysis, Acci- 
dent Analysis and Prevention and Safety Science. 


2.2 Article selection 


The further selection of articles was performed in 
steps, as depicted in Figure 1. 


Records identified through 

( database search m= 138 

\ (Engineering feld n» 90, Chemical 
“engineering field n=48)  _ / 


"Articles that did not meet the 
inclusion criteria n= 99 
(Engineering field n= 70, Chemical 

engineering field n=29) ; 


> 


j “Records for full text screening n = 39 
| (Engineering field n= 20; Chemical 
} engineering field n= 19) 


; Articles that met the 
i exclusion criteria n = 27 
> (Engineering feld n= 14; Chemical j 
engincermg field n=13) 


— TEn 
Articles included in review » = 6 f 
| (same articles for Engineering field and ) 
` Chemical engineering field) 


Figure 1. Flow diagram of study selection. 
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First, the titles were evaluated and articles that 
did not meet the following inclusion criteria were 
marked: (a) describe safety in an occupational set- 
ting; (b) focus on the application, study or analy- 
sis of techniques of identification, analysis or risk 
assessment applied to occupational safety; (c) 
conducted in the construction, manufacturing, 
offshore, or petrochemical sector; (d) published 
in a peer-reviewed journal; and (e) be written in 
English. 

Second, the full text of the articles obtained after 
the application of the previous step, were analyzed. 
In this regard, the following exclusion criteria were 
applied: (a) only one specific technique is applied; 
(b) collect several techniques only by way of exam- 
ple; (c) conducted on driving or transportation; (c) 
conducted on risks in natural disasters. 

Third and finally, the articles obtained were 
included in the review process. In this regard, both 
the Engineering field and the Chemical engineer- 
ing field obtained the same 6 results (Table 1). 


2.3. Analysis 


The analysis of the articles was developed as 
follows. 


a. Articles that contain a classification or set of 
techniques related to standard ISO/IEC 31010: 
2009. First, each article has been analyzed with 
the aim of identifying those techniques that 
coincide with the list of techniques included 
in Annex A of ISO/IEC 31010: 2009. Second, 
those techniques that are developed through 
specific standards have been identified. To do 
this, the reference documents of the standard 
have been analyzed (ISO/IEC 31010: 2009 and 
ISO/IEC/DIS 31010:2017). In addition, each 


Table 1. Articles included in review (n = 6). 


Nr Paper citation Field of application Journal 


1 Kjellén and Sklet Offshore industry SS 


(1995) 

2 Tixer et al. Risk analysis on an JLPPI 
(2002) industrial plant 

3 Marhavilas et al. Work sites JLPPI 
(2011) 


4 Dallat et al. (2017) Safety management, SS 
(risks that may 
lead to accidents) 

Chemical and process SS 
Industries 

6  Yangetal. (2017a) Petroleum activities SS 


5 Villa et al. (2016) 


SS: Safety Science; JLPPI: Journal of Loss Prevention in 
the Process Industries. 


technique has been checked if it has been devel- 
oped through any standard published by any of 
the following standardization bodies: Interna- 
tional Organization for Standardization (ISO), 
European Committee for Standardization 
(CEN) and International Electrotechnical Com- 
mission (IEC). 

b. Articles containing other techniques not col- 
lected to ISO/IEC 31010. The main techniques 
have been identified. 


3 RESULTS 


The results presented in this section follow the 
scheme accordingly for the analysis of the articles 
included in the review. 


3.1 Main standard techniques for application 
to occupational safety 


Table 2 lists the existing relations between the tech- 
niques identified in the reviewed articles (1—4) and 
the techniques included in Annex A of ISO/IEC 
31010: 2009. 

The following results can be observed in the 
Table 2: 18 techniques are collected in a single 
reference (Nr Paper citation); 7 techniques are 
collected in two references; 3 techniques are col- 
lected three references; and 3 techniques in four 
references. 


3.2 Main techniques developed through 
specific standards 


Those techniques that are developed through spe- 
cific standards have been identified in the Table 3. 

Thus, 10 techniques of the 31 collected by ISO/ 
IEC 31010: 2009 are identified. Comparing these 
10 techniques with the results of Table 2, it can be 
observed: 4 techniques are collected ina single refer- 
ence (techniques Nr 11, 12, 24 and 25); 1 technique 
is collected in two references (13); 3 techniques are 
collected in three references (6, 20 and 22); and 2 
techniques (14 and 15), in four references. 

Of the 10 techniques listed in Table 3, all of 
them except Reliability centered maintenance 
(RCM), have also their correspondence with Euro- 
pean Standards (ENs) published by the European 
Committee for Standardization (CEN, 2017). 


3.3 Main non-standard techniques of anew 
or emerging nature for application to 
occupational safety 


Tables 4 and 5 list the existing relations between 
the techniques identified in the reviewed 
articles (5—6). 
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Table 2. Relations between techniques identified in 
reviewed articles and Annex A of ISO/IEC 31010: 2009. 


Table 3. Techniques included in Annex A of ISO/ 
IEC 31010: 2009 that are developed through specific 
standards. 


Nr Annex A ISO/IEC 31010/2009 Specific standards 


Nr Annex A ISO/IEC 31010/2009 1 


Brainstorming 
Interviews 
Delphi technique 
Check lists 
Preliminary Hazard Analysis (PHA) 
Hazard and operability studies 
HAZOP 
7 Hazard analysis and critical control 
points HACCP 
8 Environmental Risk Assessment 
(ERA) 
9 Structured what if technique SWIFT 
10 Scenario analysis 
11 Business Impact Analysis (BIA) 
12 Root Cause Analysis (RCA) 
13 Failure Modes and Effect and 
Analysis (FMEA) 
14 Fault Tree Analysis (FTA) 
15 Event Tree Analysis (ETA) 
16 Cause consequence analysis 
17 Cause and effect analysis 
18 Layers of Protection Analysis (LOPA) 
19 Decision tree 
20 Human Reliability Analysis (HRA) 
21 Bow tie analysis 
22 Reliability Centered Maintenance 
(RCM) 
23 Short cut risk analysis (SCRAM) 
24 Markov analysis 
25 Monte Carlo analysis 
26 Bayes analysis and Bayesian networks 
27 F/N diagrams 
28 Risk indices 
29 Consequence likehood matrix 
30 Cost-benefit analysis 
31 Multi criteria analysis (MCDA) 


NNBWN 


KKKKKK KKK KKK KKK KKK KKK KK K K KKK KKK 


ZZZZZZZZK KAZKAZAZAZAZKK KAZZAK Z Z KKKKAZ 
ZZNZKZZZZ KASZA ZAKK. YSAAAK YZ A KAKALAZ 


ZZZZAZZAZZZ AZ2LZAAZLAZAAZKK YZAAAA ZZ AZAAALAZ 


Y (YES): The technique (Nr) is collected by the Paper (Nr) 
(NOT): The technique (Nr) is not collected by Paper (Nr). 


Theoretical and practical limitations affecting 
results of hazard identification suggest the need for 
an improvement of current techniques (Paltrinieri 
et al., 2016). Yang et al. (2017a) indicated that several 
aspects of operational decision making creates a need 
for different risk analyses compared to the traditional 
Quantitative Risk Analysis (QRA) with principles 
and guidelines described in standard ISO 31000:2009, 
which are developed more for design purposes. 

Yang et al. (2017a) reviewed 11 risk influence- 
frameworks that integrate organizational and 
human factors in a structured way (Table 4). The 
intention was to evaluate how these frameworks 


6 Hazard and operability IEC 61882:2016 
studies HAZOP 
11 Business Impact Analysis ISO TS 22317:2015; 
(BIA) ISO 22301:2012 
12 Root Cause Analysis (RCA) IEC 62740:2015 
13 Failure modes and effect TEC 60812:2006 


and analysis (FMEA) 
14 Fault Tree Analysis (FTA) 
15 Event Tree Analysis (ETA) 
20 Human Reliability Analysis 


TEC 61025:2006 
TEC 62502:2010 
TEC 62508:2010 


(HRA) 
22 Reliability Centered TEC 60300-3-11: 
Maintenance (RCM) 2003 
24 Markov analysis IEC 61078:2016; 
TEC 61165:2006; 
ISO/IEC 


15909-1:2004 
TEC 62551:2012; 

ISO/IEC Guide 

98-3-SP1:2008 


25 Monte Carlo analysis 


Table 4. Risk influence frameworks (techniques) that 
integrate organizational and human factors in a struc- 
tured way (adapted from Yang et al., 2017a). 


Field of 
applica- 
Technique Paper citation tion 
Model of Accident Embrey (1992) General 
Causation using 
Hierarchical Influence 
Network (MACHINE) 
WPAM Nuclear 
System Action Paté-Cornelland General 
Management (SAM) Murphy (1996) 
@-factor model Mosleh et al. Nuclear 
(1997); Mosleh 
et al. (1999) 
Integrated Risk Papazoglou etal. Chemical 
(I-RISK) (2002) 
(Organizational Risk Influ- Øien (2001a, Oil & Gas 
ence Model (ORIM) 2001b) 
Barrier and Organiza- Aven et al. (2006) Oil & Gas 
tional Risk Analysis 
(BORA) 
RISK_OMT Vinnem et al. (2012) Oil & Gas 
Hybrid Causal Logic Røed et al. (2009) Oil & Gas 
(HCL) 
Socio-Technical Risk Mohaghegh General 
Analysis (SoTeRiA) et al. (2009); 
Mohaghegh and 
Mosleh (2009) 
Phoenix Ekanem et al. General 
(2016). 
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and identified Risk Influencing Factors (RIFs) can 
be used for activity related risk analysis. 

Real-time data and periodical risk evaluation 
may be considered as a key improvement to allow 
for effective decision-making support (Bucelli et al., 
2017). However, being static QRA precludes any 
possible update and integration of the overall risk 
figures, due to the actual real world ever-changing 
environment or later improvements based on new 
risk notions. To overcome this limit, during the 
last decade several efforts have been devoted to the 
development of novel approaches to risk assess- 
ment and management, which can be considered 
the dynamic evolution of conditions, both internal 
and external to the system, affecting risk assessment 
(Paltrinieri and Scarponi, 2014). The main purpose 
of dynamic risk assessment is the development of 
an appropriate technique that allows for the effec- 
tive aggregation of heterogeneous information and 
provides risk estimation over time reflecting the cur- 
rent condition of the system (Yang et al., 2017b). 

Villa et al. (2016) reported a brief description 
of the most relevant methodologies and applica- 
tions of dynamic approaches to risk analysis in the 
chemical process industry. 

Table 5 shows a list of these dynamic method- 
ologies. These methodologies (or techniques) have 
evolved over time according to Villa et al. (2016) 
describe in detail. In this way, the list of techniques 


Table 5. List of the most relevant methodologies (tech- 
niques) of dynamic approaches to risk analysis in the 
chemical process industry (adapted from Villa et al., 
2016). 


Technique Paper Citation 

Dynamic Risk Kalantarnia et al. 
Assessment (Abimbola et al., 2014; 
Methodology Kalantarnia et al., 2010, 
(DRA) 2009; Khakzad et al., 


Dynamic Procedure for 
atypical scenario 
identification 
(DyPASI) 

Coupling of DRA and 
DyPASI 

Dynamic risk assessment 
with bayesian networks 

Risk barometer 


2013a 
(Paltrinieri et al., 2011, 
2013a, 2013b). 


(Paltrinieri et al., 2014b, 
2014c, 2013a, 2013b). 

Khakzad et al. (2013b, 
2011) 

The Center for Integrated 
Operations in the 
Petroleum Industry 
(Hauge et al., 2015; 
Paltrinieri and 
Hokstad, 2015; 
Paltrinieri et al.,2015, 
2014a) 


collected in Table 5 are linked to the most recent 
citations according to Villa et al. (2016). 


4 ANALYSIS AND DISCUSSION 


The results obtained can be grouped into two 
groups of techniques for application in the field of 
occupational safety: standardized and non-stand- 
ard techniques of a new or emerging nature. 

In relation to the standardized techniques, of the 
31 techniques included in Annex A of ISO / IEC 
31010, the following techniques are among the 
most used: HAZOP, SWIFT, HRA, FTA, ETA 
and RCM. In turn, these techniques are developed 
through specific standards (IEC-EN), except the 
SWIFT technique. 

Regarding non-standard techniques of a new or 
emerging nature, new risk influence frameworks as 
well as new dynamic techniques are observed. 

Among the first techniques, the following ones 
can be cited according to its closeness in time 
(last decade): HCL (2009), RISK_OMT (2012), 
SoTeRia (2009) and Phoenix (2016). In relation to 
dynamic techniques, stand out the following: DRA, 
DyPASI, DRA-DyPASI and Risk barometer. 

However, the set of the foregoing results should 
be considered as an approximation in the frame- 
work of occupational safety. This approximation is 
due to the characteristics of the method followed, 
which are linked to three important limitations. 
The first limitation is due to the use of a single 
database; the second to the inclusion and exclu- 
sion criteria used; and the third limitation is due to 
the lack of differentiating and applicative criteria 
between the techniques used in the field of safety 
occupational and safety linked to major accidents. 

In relation to the first limitation, it has allowed 
analyzing the main journals that emerge from the 
work of Reniers and Anthone (2012). However, 
these authors analyzed a total of 35 representative 
safety journals, of which with the present work a 
total of 11 journals have been analyzed. 

Regarding the inclusion and exclusion criteria 
used, they prevent analyzing the evolution of stand- 
ardized risk assessment techniques in a broader 
and deeper way through the scientific literature. 

As for the third limitation, with this work the 
dividing line that exists between occupational safety 
and safety linked to major accidents has not been 
analyzed directly (for example, this important aspect 
can be observed in the Table 5). It is evident that both 
branches of the safety are interconnected and that 
therefore share risk assessment techniques with the 
aim of avoiding accidents in processes and industries. 

However, such dividing line between these 
safety branches is diffuse, so it would be advis- 
able to deepen the analysis of differentiating and 
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applicative criteria. The differentiating criteria 
could allow a structured and interconnected risk 
and loss analysis. These criteria could be aligned 
with the results of the EU-OSHA (2017). With 
the application criteria, the scope, use, complexity, 
strengths and limitations of each technique in the 
field of occupational safety could be defined. 


5 CONCLUSIONS 


With this work the main objective consisting in iden- 
tifying and classifying the main techniques included 
in the ISO/IEC 31010:2009 standard applicable in 
the field of occupational safety, was achieved. These 
techniques could be compatible with the current 
version ISO/DIS 45001.2:2017 (section 6.1.2.2). 

In addition, a secondary objective has also been 
achieved, that is, to identify the main non-standard 
techniques of new or emerging nature (especially 
dynamic characteristics). 

However, three important limitations have been 
identified that can point the direction of future 
research. Such research could focus on two objec- 
tives. The first objective concerns pursuing ana- 
lyzing in greater depth the impact and degree of 
development of standardized techniques and non- 
standard techniques of a new or emerging nature. 
To do this, the method used must be modified 
to obtain results with greater representativeness. 
This modification should include an update of the 
results obtained by Reniers and Anthone (2012). 
The second objective concerns the focus on the 
analysis of differentiating and applicative criteria 
between the techniques used in the field of safety 
occupational and safety linked to major accidents. 

In any case, the final publications of ISO 45001 
as well as ISO/IEC 31010 should be considered, 
both publications being foreseen for 2018. 
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Indicator on the performance of barriers against fatal accidents 


in construction 


U. Kjellén 
NTNU, Trondheim, Norway 


ABSTRACT: The paper presents work to develop a safety performance indicator suitable for real-time 
management of major accident hazards in construction. Data on 60 fatal accidents in the period 2011- 
2016, resulting in 63 fatalities, have been analysed. About 70% of the accidents belonged to three main 
categories: fall from height, driver or person outside the cabin killed by moving construction machine/ 
vehicle, and person killed by load or equipment during material handling. The three main categories have 
been further divided into subcategories (seven in all) and analysed to identify barriers to prevent adverse 
consequences. This analysis has resulted in checklists, one for each subcategory. They list observable 
conditions at a construction site that, if found substandard, will indicate that one or more of the important 
barriers are seriously deteriorated. The paper highlights the results of the accident concentration and 
barrier analyses. It also reviews remaining work to develop and test the performance indicator. 


1 INTRODUCTION 


Construction activities are characterized by the 
management of large amounts of energy such as 
in transportation, excavation, assembly, work at 
height etc. Loss of control of the energy flow may 
have major consequences. Natural hazards (rock 
fall, land slide etc.) represent significant additional 
risks. The statistics on severe accidents in construc- 
tion reflect these conditions. According to ILO 
estimates, the fatal accident risk in construction is 
five times the average among workers worldwide 
(Murie 2007). Statistics from Norway for 2009- 
2014 show a fatal accident frequency rate of three 
times the general average for workers (Norwegian 
Labour Inspection Authority 2015). 

Construction work is organised in projects with 
a limited duration. A project goes through differ- 
ent phases from site establishment and excavation 
to installation and completion, and the conditions 
at site and the activities change accordingly. Tra- 
ditional safety management using performance 
measurements (such as the TRI rate) and feedback 
control is inadequate, due to lagging characteris- 
tics of most current safety performance indicators 
(Kjellén 2009, Lingard et al. 2017). There is a need 
for indicators that provide real-time data on safety 
performance to ensure timely feedback for control 
of safety performance (Kjellén 2018). 

Behavioural sampling was developed in the 
1950s to meet this requirement (Rockwell 1959). 
The method uses observations on deviations 
from safe work practices and conditions as data 


input. The “TR safety monitoring method” rep- 
resents an application of behavioural sampling 
to the construction industry (Laitinen et al. 1999, 
Laitinen & Päivärinta 2010). Experiences show 
that the method produces reliable and valid results 
related to the prevention of ordinary occupational 
accidents. 

There is a general lack of ‘real-time’ perform- 
ance indicators suitable for the control of hazards 
in construction with fatal accident potential. The 
principles behind the “barrier performance indi- 
cators” developed by the process and oil and gas 
industries for the prevention of fires and explo- 
sions offer such as opportunity (Health and Safety 
Executive 2006, OGP 2011). The indicators meas- 
ure the compliance of safety barriers to a standard. 

Accident statistics indicate that this approach 
may be valid for the construction industry, despite 
the high variety of activities in the industry. A 
relatively small number of types of central events 
according to the bow-tie accident model, each 
representing the loss-of-control of significant 
amounts of energy, account for a substantial share 
of the fatal accidents in construction (Visser 1998, 
Swuste et al. 2012). By identifying barriers to pre- 
vent these central events and/or reduce their con- 
sequences, input may be provided to performance 
indicators on the risk of fatal accidents in con- 
struction projects. 

Experiences from two case studies support the 
validity of this approach. An Indian hydropower 
project reported eight fatalities due to road depar- 
tures and falling rocks in tunnels during 2,5 years 
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after start-up (Kjellén 2012). The project imple- 
mented improved safety routines directed at bar- 
riers to prevent these types of accidents. It was 
completed three years later with one additional 
fatality not related to these types of events. 

A large international construction contrac- 
tor identified six dominating concentrations of 
fatal accidents in their operations world-wide 
(A. Berglund, personal communication, Nov. 17, 
2017). These included falls from height, conflict 
between human and machine, structural failure, 
lifting operations (falling objects), fire/explosions, 
and electric arcs. The contractor implemented life- 
saving rules directed at barriers to prevent these 
types of accidents. They experienced a reduction 
in the frequency of the affected types of fatal 
accidents by 60% in a five-year period after the 
intervention, compared to the previous five-year 
period. The number of construction workers was 
about the same in the two periods. 


1.1 This paper 


The paper presents research that is part of an on- 
going project for the construction industry. Its aim 
is to develop safety performance indicators that 
are better suited for the management of safety by 
client and contractor companies than the lagging 
safety performance indicators in use today. 

The research presented here focuses on devel- 
oping a performance indicator of the availability 
of barriers against hazards with fatal accident 
potential. The intention is to provide data in ‘real 
time’ and thereby allow the involved companies to 
accomplish effective feedback control of the fatal 
accident risk at construction sites. 

The paper highlights the first part of the 
research, aiming at identifying dominating fatal 
accident concentrations in the Norwegian con- 
struction industry and critical barriers that will 
prevent the relevant types of accidents from occur- 
ring. In the subsequent steps, these results will be 
used as input to the development of indicators for 
barrier availability for use by safety practitioners 
in the industry. These indicators will be tested and 
evaluated in intervention studies in construction 
projects. 


2 MATERIAL AND METHOD 


2.1 Sources of data 


The analysis consists of two parts, an accident 
concentration analysis and a barrier analysis. A 
set of 60 fatal accidents resulting in 63 fatalities 
in the Norwegian construction industry between 
2011 and 2016 represents the source data for the 
accident concentration analysis. The Norwegian 


Labour Inspection Authority (NLIA) provided the 
data from their general register of fatal occupa- 
tional accidents in Norway. 

The accident data was documented in a spread- 
sheet with the following information on each acci- 
dent: date of the accident, registration number, 
type of construction business (classification), type 
of activity (classification), number of people killed, 
type of injury (classification), accident type (classi- 
fication), free text resumé of the sequence of events. 

In the subsequent barrier analysis, data from 
NLIA was supplemented with data from other 
sources: 


— Observations at three construction sites and 
interviews with senior construction managers 

— The author’s library of in-depth investigations 
into fatal accidents and high potential incidents 
that fall within any of the accident concentra- 
tions selected for further analysis 

— Lifesaving rules developed by major construc- 
tion contractors 

— Relevant regulatory requirements in Norway 


The results were reviewed in workshops by high- 
level safety experts from construction client com- 
panies, contractors and an organisation of regional 
safety representatives. 


2.2 Methods 


2.2.1 Accident concentration analysis 

An accident concentration analysis aims at iden- 
tifying clusters of ‘accident repeaters’ with com- 
mon characteristics (Kjellén & Albrechtsen 2017). 
By directing preventive measures at a selection of 
dominating accident concentrations, a significant 
risk reduction is expected. 

The analysis is carried out stepwise in sev- 
eral dimensions to identify accidents with com- 
mon characteristics. A natural starting point for 
the analysis of fatal accidents was to group the 
accidents according to the main type of energy 
involved. 

In the current study, the dataset was first ana- 
lysed by accident type according to NLIA’s clas- 
sification. For each accident type, the free-text 
descriptions of the accidents were reviewed to iden- 
tify common patterns. A new, composite classifica- 
tion of the accidents was developed, where each 
‘accident concentration’ was made-up of at least 
three fatal accidents with common characteristics. 


2.2.2 Barrier analysis 
The barrier analysis is rooted in the principles of 
defence in depth and the energy model of loss cau- 
sation (Rasmussen 1993, Haddon 1980). 

A model of the accident sequence encompass- 
ing three successive phases is shown in Figure 1 
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1. Prevent 4, Prevent the 6. Separate the 8. Improve the 
build-up of uncontrolled source of target's ability 
energy release of energy andthe to endure an 
energy target in time energy flow 
2. Modify the or space 
characteristics 
of the energy 5. Modify the 7. Separate by 9. Limit the 
rate and meansef development af 
3, Limit the distribution of physical energy or 
amount of energy barriers damage 
energy 


Figure 1. Barrier intervention in the accident sequence 
to eliminate or reduce loss (Kjellén & Albrechtsen 2017). 


(Kjellén & Larsson 1981). Nine of Haddon’s ten 
accident prevention strategies are introduced in 
the model as barrier functions. Each of these may, 
dependent on the type of accident, have the capac- 
ity to change the sequence of events and thereby 
eliminate or reduce loss. 

The function of a barrier is realised through a 
barrier system. Whereas passive barriers consist of 
a physical element, active barriers are more com- 
plex with several elements including a control sys- 
tem. One or more human operators may be part of 
the control loop. 

Input data to the barrier analysis consisted of 
the results of an analysis of accident concentra- 
tions among fatal accidents in construction in 
Norway between 2011 and 2016. For each acci- 
dent concentration, barriers that are critical in 
preventing accidents in the accident concentration 
in question were identified. Next, the required per- 
formance of the elements that make up the barrier 
were identified for each identified barrier. 


3 RESULTS 


3.1 Accident concentration analysis 


The distribution of the 60 fatal accidents in NLIA’s 
database by accident type is shown in Table 1. This 
analysis represents the starting point for the identi- 
fication of accident concentrations. 

Falls represents the most common accident type 
in NLIA’s database, followed by squeezed/caught 
and hit by object. 

The analysis revealed some inconsistencies in 
the accident type classification by NLIA. A2 and 
A3 include accidents involving vehicles, but similar 
vehicle related accidents were classified as category 
A4. 

The accident concentration analysis identified 
12 categories of ’accident concentrations’, cov- 
ering 93% of the fatal accidents in the material, 


Table 1. Distribution of accidents in NLIA’s database 
on fatal accidents in construction in the period 2011- 
2016 (N = 60). The table also shows a summary descrip- 
tion of accidents with common characteristics for each 
accident type. 


Accident type #of 

(NLIA) events Coarse description of accidents 

Al Hit by 9 Hit by load or machinery part 
object during material handling, 


structural collapse, falling 
rock, flying object (bolt 
from pistol) 

Road/work area departure 
by vehicle (incl. mobile 
machine), collision, hit 
by vehicle 

Roll-over of vehicle 

Hit by vehicle, road/work area 
departure by vehicle, roll-over 
of mobile machine, squeezed 
by lifting/transportation 
equipment during operation, 
rock fall, collapsing trench, 
structural collapse 

Electric current through body 
(resulting in fall), fall from 
height (roof, deck, 
scaffold, ladder, 
machinery/equipment) 

Electric current through the 
body from electric 
installation or hand tool 

Accidental blast, explosion 


A2 Collision, 11 
hit by 
vehicle 


A3 Roll-over 3 
A4 Squeezed 13 
or caught 


AS Fall 19* 


A7 Electric 2 
voltage 


A10 Explosion, 3** 
fire 


* Two persons killed in one fall accident, ** Three per- 
sons killed in one blast accident. 


Number of fatal accidents 


1 


Figure 2. Distribution of fatal accidents by ‘accident 
concentration’ (N = 60). 


Figure 2. The category ‘miscellaneous’ made up 
the remaining 7%. 

28% of the accidents were related to fall from 
height. Similarly, 23% involved vehicles (including 
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mobile construction machinery). If we add mate- 
rial handling by crane, conveyor belt, truck etc., we 
cover in all two thirds of the fatal accidents. 

Accident concentrations with four or more 
fatalities were selected for further analysis. These 
included: 


— Road or work area departure by vehicle: The 
driver was killed due to loss-of-control of the 
vehicle followed by departure from a road or 
construction area. 

— Fall from or through roof or deck: The person 
being killed fell either outside the edge of a roof 
or deck or through and opening or weak point 
of the roof/deck. 

— Fall from machinery or equipment: The person 
being killed fell when moving or working on 
machinery or equipment. Falls from ladder was 
introduced as a special case. 

— Roll over of vehicle (or machine during trans- 
fer): The driver/operator has been killed due 
to roll-over during unloading, during transfer 
(when a machine behaved as a vehicle), or when 
a parked vehicle has accidently started to move. 

— Hit by vehicle: The person being killed was 
present in the danger zone of a vehicle in motion 
and the driver was either not aware of the per- 
son or lost control of the vehicle. 

— Hit by falling object during material handling: 
The fatalities occurred during crane handling 
or unloading of truck or trailer. The accidents 
involved loss of control of load or sudden, 
uncontrolled movements of equipment. 

— Squeezed by movements of personnel lifts or 
material handling equipment: The person being 
killed was either squeezed by uncontrolled 
machinery movements, missing machine guard- 
ing or because the operator was not aware of the 
person being in the danger zone. 


3.2 Barrier analysis 


A basic assumption in this research project is the 
considerations of barriers as an additional meas- 
ure implemented in a production system to achieve 
a tolerable level of risk. It means that the basic 
design of the work system, where the fatal acci- 
dents occurred, has not been questioned. Surface 
transportation, for example, was not considered 
as a barrier function to avoid being hit by fall- 
ing objects during material handling. In Table 2, 
barrier types 1 and 2 have been excluded for this 
reason. 

The barrier analysis has been based on some 
additional assumptions: 


3. Limit the amount of energy: Has been included 
in barrier function no. 8 for falls from height 
(use of fall arrest system). 


Table 2. Results of the barrier analysis. A barrier type 
marked “X has a function that is considered essential 
to prevent fatalities in the accident concentration in 
question. 


Type of barrier 
(see Figure 1): 


Accident concentration 3456789 


Road/work area departure by vehicle XX XXX 

Fall from or through roof or deck XXX X 

Fall from machinery or equipment XXX X 

Fall from ladder XXX 

Roll over of vehicle (or machine XX X 
during transfer) 

Hit by vehicle X X 

Hit by falling object during material X X 
handling 

Squeezed by movements of lifting X XX 


and material handling equipment 


4. Prevent uncontrolled release of energy: This 
barrier function is relevant to all analysed main 
types of accident concentrations. The barrier 
function is not relevant to subsets of accidents, 
where a machine operator has not been aware 
of personnel being present in the danger zone. 

5. Modify the rate and distribution of energy: This 
applies to the use of safety belt in case of road/ 
work area departure or roll-over and use of fall 
arrest system. 

6. Separate the source of energy and the target in 
time or space: This barrier function is relevant 
to vehicle and material handling related acci- 
dents. A possible exception is sudden machin- 
ery movements due to mechanical failure, which 
could not reasonably have been foreseen. 

7. Separate by means of physical barriers: This 
barrier function is relevant to all types of acci- 
dent concentrations. In practice, the implement- 
ing this type of barrier is restrained by feasibility 
considerations. 

8. Improve the target’s ability to endure an energy 
flow: This barrier function is implemented 
through personal protective equipment. It is 
especially relevant to falls, when collective meas- 
ures (physical barriers) are not applied. Use 
of helmet has limited effect in case of falling 
objects with high energy. 

9. Limit the development of energy or damage: The 
general organisation of emergency response at 
site falls outside the scope of this analysis. Pre- 
paredness directed at specific types of hazards 
has been included such as use of mobile machine 
close to water way. In this case, the barrier relates 
to preparedness to safe the driver from drowning 
in case of departure from road or work area. 
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Table 3. Requirements related to barrier elements for 
each of three barriers essential to prevent fall from ladder. 


Prevent Modify the rate 
Limit the uncontrolled and distribution 
amount of release of of the energy 
energy energy transfer 
e Ladder used e Ladder e Use of safety 
as an excep- checked harness and 
tion in lack and found life line 
of feasible in adequate hooked to 
alternative? condition? a solid 
attachment 
point 


e Ladder used e Ladder 


for a height positioned on 
difference solid ground 
to solid and secured from 
ground <6 m? tilting? 
e Ladder 
raises > 1 m 
above roof or 
landing? 


Table 4. Requirements related to barrier elements for 
each of two barriers essential to prevent personnel being 
hit by vehicle. 


Separate the source of energy 
and the target in time or 
space 


Prevent uncontrolled 
release of energy 


e Vehicle is certified and e Areas for transportation 


regularly checked and and loading/unloading 
maintained? separated from work 
areas and pedestrian 
traffic? 
e Driver instructions e Driver has full overview 


from the cabin of the 
operating zone? 


regarding controlled 

operation of the 

vehicle available? 

Adequate training and e Driver instructions 

certification of driver? to have control of the 

operating zone for 
other personnel? 
Adequate operator 
training and 
certification? 

e Foundation of the e Site instructions and 
operating area of the training regulating 
vehicle adequately avoidance of the 
stable, inclination and operating zone of mobile 
friction satisfactory to machines? Adequately 
ensure full operator supervised and respected? 
control? e Use of highly visible 

uniforms by all personnel 
at the site? 

e Adequately illuminated 
work areas and roads? 


The next step in the analysis has aimed at iden- 
tifying performance requirements of barrier ele- 
ments necessary for the applicable barrier function 
to be effective. These requirements provide the 
basis for the barrier performance indicator, which 
will measure average percent compliance with the 
requirements. The results of this analysis are illus- 
trated by two examples in Table 3 and Table 4. 


4 FURTHER WORK 


Work is under way to develop and operationalise 
the identified performance requirements in coop- 
eration with Client and Contractor personnel of a 
large infrastructure construction site. The require- 
ments for successful performance of the different 
barrier elements will be scrutinized by site person- 
nel in cooperation with the researchers and further 
developed to meet certain quality criteria (Kjellén 
& Albrechtsen 2017, Laitinen et al. 1999): 


— Flexible for use at different types of construc- 
tion sites, 

— Transparent and easily understood by site 
personnel, 

— Address site conditions that are observable and 
quantifiable, 

— Sensitive to changes in the safety standard at 
construction sites, 

— Produce reliable results when applied by different 
observers, 

— Robust against manipulation. 


The task of the observers will be to identify 
work at the construction site involving any of the 
seven identified major accident hazards. Next, the 
observers will use the relevant checklist to review 
the status of the different barrier elements and 
classify them as either correct or incorrect or not 
relevant. Incorrect barrier elements will be quali- 
fied through a short description. The results will be 
summarized, showing% of the checked barrier ele- 
ments that are correct. This can be shown in total 
and for each type of major accident hazard. 

To produce reliable results, characteristics of 
the various barrier elements that separate correct 
from incorrect must be defined in the checklists 
to the extent needed when used by experienced 
personnel. In addition to physical observations of 
the work, the observers will have to consult avail- 
able documentation and make interviews to check 
items such as operator training and qualifications 
and maintenance standard of vehicle. This makes 
the use of the method to resemble a system audit 
more than an inspection, but not fulfilling the 
requirement to independence. 

The next step of the research project will include 
testing and evaluation of the barrier performance 
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indicator and underlying requirements to barrier 
elements in routine safety practice. A systematic 
approach will be applied in assessing the opera- 
tional experience of the intervention itself, and 
various stages of outcome, including degree of 
implementation and immediate, intermediate and 
end results (Shannon et al. 1999). It will not be 
possible to monitor any effects on the fatal acci- 
dent rate by this trial. 


5 CONCLUSIONS 


The work presented in this paper represents the 
first phase in developing a real-time performance 
indicator aiming at managing the risk of fatal 
accidents. 

An important prerequisite has been the existence 
of dominating fatal accident concentrations in con- 
struction that may be prevented through a few well- 
defined barriers. This will allow the development of 
an indicator based on requirements for barrier avail- 
ability that is adequately comprehensive still practi- 
cable. Results of the accident concentration analysis 
shows that this prerequisite generally is satisfied. 
There exists variation between individual accidents 
in the concentrations that are more complex. It is 
necessary in the further development and operation- 
alisation of the requirements to check for these vari- 
ations to ensure that the prerequisite is satisfied. 

Another critical issue is the representativeness of 
the data used in this analysis for the fatal accident 
risk in construction in Norway. The period and 
number of events covered by the analysis, and the 
consistence of the results with findings in interna- 
tional research and safety practice referred to in this 
paper, call for a positive answer to this question. 

The ultimate test of the viability of the proposed 
performance indicator will take place through inte- 
gration is safety practice at selected construction 
sites. Here it will be possible to evaluate the indica- 
tor against predefined criteria. Only extensive and 
long-term use will validate the performance indica- 
tor as an efficient tool in preventing fatal accidents. 
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ABSTRACT: The present study compares professional seafarers and private leisure boat users in 
Norway and Greece. The aims of the present study are to examine the safety behaviours related to 
personal injuries and accidents among these groups and to study the factors influencing these behaviours. 
This will serve as a backdrop to a general discussion of why the level of fatalities is higher among private 
boat users than among professional seafarers and what the former may learn from the latter. The study is 
based on surveys to crew members on Norwegian and Greek cargo and passenger vessels and leisure boat 
users in Norway and Greece. Our study indicates that while unsafe behaviours related to work pressure 
and risk taking are important among professional seafarers (i.e. risk acceptance and violations), unsafe 
behaviours related to the leisure/holiday situation was important for the leisure boat users (i.e. alcohol 
use while driving a boat). Additionally, we discuss how the situation of private leisure boat users is less 
regulated than that of professional seafarers. Our study indicates that both in the professional and the 
private setting, norms for interaction and conduct seem to be influenced by norms and expectations 
rooted in different socio-cultural groups, e.g. the national culture, the specific sector in question, the 
organisations and in peer groups. 


1 INTRODUCTION of incapacity. According to Nevestad et al. (2015) 


there were on average 15 killed and 424 injured 


1.1 Background 


Maritime transport is a substantial part of world 
trade, as approximately 90% of the goods traded 
worldwide are transported by sea. Although safety 
improvements have led to a significant decrease in 
the mortality rates of seafarers in recent decades, 
seafaring is still termed one of the most hazard- 
ous occupations (Oldenburg & Jensen 2012). In a 
questionnaire study including 6461 participants in 
11 countries, Jensen et al. (2004) found that dur- 
ing the latest tour of duty, 9% of all seafarers were 
injured and 4% had an injury with at least 1 day 


annually on Norwegian ships, i.e. Norwegian Ordi- 
nary Ship Register (NOR) and Norwegian Interna- 
tional Ship Register (NIS) in the period 2004-2013. 
At EU level, in the period 2011-2016, there were 
on average 100 fatalities and 935 injuries annually 
reported in the European Marine Casualty Infor- 
mation Platform (EMCIP) (EMSA, 2017). 

In Norway, as in many other countries, the 
number of persons killed using different transport 
modes has decreased since 1970. This is especially 
true for road transport, but it also applies to pro- 
fessional seafarers (Nevestad et al 2015). Leisure 
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boating on the other hand, have not had the same 
positive development, and in Norway the number 
of deaths using leisure boats have more or less 
stayed the same since 1995 (with some annual 
variations). 

In 2006-2015, on average 33 people died each 
year in recreational boating accidents in Norway, 
which is about 0.65 persons per 100 000 inhabit- 
ants. Approximately 4,4 persons have died per 
100 000 vessels (registered and not). More than 
90% of these persons are men, and a majority 
of the victims were not wearing personal float- 
ing devices (PFD). Most of the accidents happen 
during the summer months, when the boats are in 
more use. Only 20-30% of the people found dead 
was wearing a lifejacket. 


1.2 Aims 


The present study compares professional seafarers 
and private leisure boat users in Norway and 
Greece. 

The aims of the study are to examine the safety 
behaviours related to personal injuries and acci- 
dents among these groups and to study the factors 
influencing these behaviours. This will serve as a 
backdrop to a general discussion of why the level 
of fatalities is higher among private boat users than 
among professional seafarers and what the former 
may learn from the latter. 

The data in this study have been collected as 
part of the SafeCulture project, which is funded by 
the Norwegian Research Council, and undertaken 
by the Institute of Transport Economics—TOI 
(Norway) and the National Technical University 
of Athens—NTUA (Greece). 

The research on safety culture suggests that if 
we are to fully understand its effects on safety in 
transport, we should study not only safety culture 
particular to organisations, but that particular 
to peer-groups, sectors, regions and nations. We 
define transport safety culture (TSC) as shared 
norms prescribing certain transport safety behav- 
iours, shared expectations regarding the behav- 
iours of others and shared values signifying what’s 
important (e.g. safety, mobility, respect, polite- 
ness) (Nevestad & Bjornskau, 2012). An impor- 
tant aspect of our approach is that overall TSC is 
a composite of overlapping safety cultures asso- 
ciated with different types of sociocultural unit. 
Thus, we apply the safety culture concept to the 
national level, to organisations and to peer groups 
in the present study. 


1.3. Previous research 


There seem to be few studies examining the 
relationship between safety behaviours and work 


accidents in the maritime sector, although there 
are some exceptions (cf. Havold and Nesset, 2009). 
The existing studies within this area do, however, 
indicate that demographic factors (age, national- 
ity, position, line of work) influence work accident 
risk, and we should assume that this relationship 
is mediated by some kind of unsafe behaviour 
(e.g. risk taking, violations), resulting in injuries. 
Younger seafarers have a higher risk (Hansen et al 
2002; Jensen et al 2004). Foreigners have a consid- 
erably lower accident risk than local (in the spe- 
cific study, Danish) citizens (Hansen et al 2002). 
Previous research also indicates that alcohol con- 
sumption may be an important risk factor in the 
maritime sector (Akhtar & Bouwer Utne 2014, 
Hetherington et al 2014), and that alcohol and drug 
abuse are greater for seafarers compared to work- 
ers ashore (Nitka 1990; Kariris 2012 in Zhang & 
Zhao 2017), partly because of their working situa- 
tion (e.g. social isolation). However, given the rela- 
tively unregulated character of private boat use, 
we may perhaps assume that alcohol consumption 
“boating while under the influence”, is an even 
more important risk factor in this sector. Likewise, 
me may perhaps also assume that the other risky 
behaviours related to boating accidents (e.g. over 
speeding close to shore) are more prevalent among 
the less regulated private boat users. 

Based on a review of previous foreign studies of 
recreational boating accidents in Norway, Amund- 
sen (2016) asserts that questions about alcohol use 
and lifejacket use are common in almost all of the 
international surveys. We may infer from this that 
alcohol use and life jacket use are key safety behav- 
iours influencing the risk of accidents among pri- 
vate leisure boat users. Amundsen (2016) reports 
that the questions used in the different countries 
are adapted to the specific use of leisure boat in 
that country, and the accident situation. Based on 
a review of studies relevant to Norway, Amund- 
sen and Bjornskau (2017) point to the following 
safety behaviours as likely to influence the safety 
of private boat users: Drive faster than the per- 
mitted speed close to shore, Carry more passen- 
gers than the boat is licensed for, Drink a beer or 
a glass of wine before going boating, Drive in the 
dark without using the lantern/lights, Wearing a 
life jacket, Carrying enough lifejackets for every- 
body onboard the boat. The questions are partly 
based on the findings from a review of the safety 
situation for the recreational boaters performed by 
the Norwegian Maritime Authority in 2012. 

Moreover, it is also important to ask whether 
the difference between the two groups are due to 
differences in private and professional maritime 
safety culture in Norway and Greece. The profes- 
sional maritime safety culture is closely related 
to the safety regulation (e.g. the ISM-code) in 
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professional maritime transport. The International 
Safety Management (ISM) code of the Interna- 
tional Maritime Organisation requires shipping 
companies to implement Safety Managemen 
Systems (SMS) on board their vessels, including 
describing safety roles, goals, procedures, moni- 
toring, reporting, follow up etc. (Thomas 2012). 
Studies indicate that the SMS requirements of the 
ISM code foster a positive safety culture on board 
vessels (Lappalainen et al 2014). Additionally, 
shipping companies also often work to implement 
a positive organizational safety culture, including 
policies for seafarer behavior. Based on previous 
research, we may hypothesize first that organiza- 
tional safety culture influences safety behaviours 
among professional seafarers (cf. Havold & Nesset 
2009, Lu & Tsai 2010). 

Also, professional seafarers have undergone an 
IMO approved training in their respective home 
countries. Thus, this training, the SMS and safety 
culture are elements which are likely to influence 
the professional maritime safety culture. Addition- 
ally, it is important to remember that professional 
seafarer culture also is likely to be influenced by 
the working conditions of professional seafaring, 
which may include a high work pressure, demand- 
ing working conditions, fatigue etc. (cf. Nevestad 
2017). Storkersen et al (2011) found that a third 
of the respondents in the Norwegian coastal cargo 
sector reported that they put themselves in danger 
to get the job done, while about 40% violate pro- 
cedures to get the job done, especially because of 
efficiency demands (Storkersen et al 2011). 

Moreover, research has also highlighted the 
importance of national safety culture for the safety 
behaviours of professional seafarers (Havold 
2005). We compare two countries (Norway and 
Greece), and we therefore, also compare the influ- 
ence of national safety culture. The theoretical 
link between safety culture and safety behaviours 
is often omitted in research (Ward et al 2010). In 
the present study, we conceptualise this relation- 
ship as both direct social pressures and more subtle 
social mechanisms, producing important norma- 
tive influences on behaviour (Cialidini et al., 1990). 
Individuals’ perceptions of peers’ opinions about 
a given behaviour are often defined as injunctive 
norms, while individuals’ perceptions of what 
peers actually do often are defined as descriptive 
norms (Ajzen 1991; Rivis & Sheeran 2003; Ward 
et al 2010). Since injunctive norms are normative 
they can be expected to directly influence peoples’ 
behaviour (Cialidini et al. 1990). In the present 
study national culture is measured as descriptive 
norms. Descriptive norms may influence behav- 
iour by providing information about what is nor- 
mal, but they can also influence behaviour through 
the false consensus bias, in which individuals over- 


estimate the prevalence of risky behaviour among 
their peers in order to justify their own behaviour. 
The focus on normative influences on behaviour 
is important in the theory of planned behaviour 
(TPB) (Ajzen, 1991, 2006), and in the critique of 
it (Rivis & Sheeran 2003). In short, TPB predicts 
that our behaviour is the result of our intention 
to carry out the behaviour, and that our intention 
to carry out a particular behaviour is influenced 
by our attitudes towards the behaviour, injunctive 
norms and our perceived control over our behav- 
iour (Ajzen 1991, 2006). 

Additionally, research on maritime safety has 
found that the framework conditions and safety 
level varies considerably between (sub)sectors 
(Storkersen 2017; Hansen et al 2002; Jensen et al 
2004). The influence of sector and sector safety 
culture is examined for professional seafarers. 
Studies of private transport operators have found 
that other sociocultural groups, e.g. peer groups 
(Nevestad et al 2014) and region (e.g. urban vs. 
rural) (Rakauskas et al 2009) are important when 
it comes to influencing safety behaviours Thus, we 
also seek to examine the influence of peer-groups 
and regional maritime safety culture. Boat type 
and background variables are examined for leisure 
boat users. 


2 METHOD 


2.1 Recruitment of respondents 


The Norwegian professional seafarers were 
recruited through the Norwegian researchers’ 
contact with Norwegian shipping companies, 
i.e. shipping companies that are located in Nor- 
way, with mainly Norwegian crew members. Web 
links to the questionnaires were distributed by 
the shipping companies to all employees working 
on board vessels, along with an introductory text 
explaining the purpose of the survey, and stress- 
ing that the surveys were confidential. The Norwe- 
gian private boat users were recruited through a) 
the Norwegian researchers’ contact with a boating 
association distributing survey links to members, 
and b) distribution on a member website for lei- 
sure boat owners, which in many years has been 
Scandinavia’s largest boat forum (e.g. with 1.6 mil- 
lion posts submitted by members). The Greek 
professional seafarers were recruited through a 
marketing research company in Greece, which 
was under the scientific supervision of research- 
ers from the NTUA. Seafarers working for Greek 
shipping companies, i.e. shipping companies that 
are located in Greece, with mainly Greek crew 
members, were approached. Private boat users in 
Greece were also recruited by the same marketing 
research company. 
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2.2 Survey measures: Professional seafarers 


The present paper analysis of professional sea- 
farers builds on and takes further the knowledge 
gained from two previous conference papers. The 
first (Nevestad et al 2017) compares organizational 
safety culture, working conditions and occupa- 
tional injuries in Norwegian cargo and passenger 
transport. The second (Nevestad et al 2018), com- 
pares cultural influences on maritime cargo trans- 
port in Norway and Greece. The present paper 
takes the knowledge from these two papers further, 
as it compares professional seafarers with private 
leisure boat users. 


1. Background variables (15 questions): e.g. gen- 
der, nationality, age group, seafarer experience, 
position/area of work, employment status, ves- 
sel type, vessel size, manning on board, ship 
register, 

2. Safety performance (5 questions): respondents’ 
occupational injuries on board, ship accidents, 
type of ship accidents, safety compromising 
fatigue and assessment of work place safety 
level (1-10). 

3. Safety behaviours (7 questions): questions on 
safety behaviours. Respondents were asked: 
How often do you think the following events 
tend to occur for every 100 working days/nights 
on board?: 1) I accept small risks because the 
“situation demands it” (e.g. because of time 
pressure, bad weather), 2) I violate procedures to 
get the job done, 3) I work, even though I am 
so tired that safety may be compromised, 4) I 
refrain from using the required protection equip- 
ment in my work, 5) I work while being under 
the influence of alcohol (e.g. one beer or more), 
or while being hungover, (Answer alternatives: 1) 
Never, 2) 1-2 times, 3) 3-5 times, 4) 6-10 times, 
5) 11-15 times, 6) 16-20 times 7) More than 20 
times, 8) Do not know/not relevant) 

4. Working conditions (3 questions): How often 
do you think the following events tend to occur 
for every 100 working days/nights on board: 1) 
Your shift change is delayed because of work 
operations, for instance port calls?, 2) You work 
more than 16 hours in the course of a 24-hour 
period?, 3) You are interrupted when you are 
off duty”. (Answer alternatives: 1) Never, 2) 
1-2 times, 3) 3-5 times, 4) 6-10 times, 5) 11-15 
times, 6) 16-20 times 7) More than 20 times, 8) 
Do not know/not relevant). 


We removed the eight answer alternative and 
made a “Demanding working conditions index” 
of these three questions (Cronbach’s Alpha:.728). 
The survey also included a question on work pres- 
sure: “Sometimes I feel pressured to continue 
working, even if it is not perfectly safe” 


5. Organisational safety culture (7 questions): We 
made an organisational culture index, consisting 
of questions from the GAIN-scale on organisa- 
tional safety culture. We have used this scale in 
previous research from different transport sectors 
(Bjornskau & Longva, 2009; Nevestad & Bjørn- 
skau, 2014). The GAIN-scale is presented in the 
*Operator’s Safety Handbook” (GAIN 2001). 
The GAIN-scale originally consists of 25 ques- 
tions measuring five themes, but we have reduced 
the scale to 7 questions, e.g. 1) Ship management 
regards safety to be a very important part of all 
work activities, 2) The shipping company regards 
safety to be a very important part of all work 
activities, 3) Ship management detects crew mem- 
bers who work unsafely, 4) Ship management 
often praises crew members who work safely etc. 

6. National safety culture (7 questions): In the 
present study we measure national safety cul- 
ture as descriptive norms (Cialdini 1990) at the 
national level meaning “what respondents expect 
that other seafarers from their own country do” 
expressed through question“ When working on 
vessels, I expect the following behaviours from 
other seafarers from my country:” 1) That they 
sometimes violate procedures to get the job 
done, 2) That they sometimes refrain from using 
the required protection equipment in their work, 
3) That they sometimes work, even when they 
are so tired that safety may be compromised, 4) 
That they sometimes work being under the influ- 
ence of alcohol (e.g. one beer or more), or while 
hungover, 5) That they sometimes take small 
risks if the “situation demands it” (e.g. because 
of time pressure, bad weather), 6) That they 
sometimes avoid telling colleagues taking risks 
to work safely, 7) That they sometimes refrain 
from reporting safety problems and unsafe situ- 
ations that they experience in their work to the 
ship management. An exploratory factor analy- 
sis (EFA) was conducted to examine the under- 
lying factor structure of the 7 national safety 
culture (descriptive norms) items. 

7. Sector safety focus (2 questions): We measure 
sector safety focus by means of two questions 
that were selected after a “scale if items deleted” 
analysis (including five items): 1) Safety is more 
important than deadlines to our customers, 2) 
Safety is more important than price to our cus- 
tomers (CA = .875). 


2.3 Survey measures: Private boat users 


1. Background variables (12 questions): gender, 
nationality, age group, experience as a boat 
driver, participation in organized boat training/ 
educational programme, boat type, use of navi- 
gation equipment, boat length, engine capac- 
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ity, maximum boat speed, purpose of boat use, 

municipality of residence, education level. 

. Safety performance (4 questions): respondents 

accidents/incidents, injuries in accidents, safety 

self-assessment as a boat driver (1-10), boat use 
duration. 

. Safety behaviours: (12 questions): Respondents 

were asked: 

A.For every ten times you are driving your 
boat, approximately how often do you do the 
following things, before you go out: 1) Tell 
someone where I will be going and when I 
will be back, 2) Check the weather forecast, 
3) Check the fuel level, 4) Drink two units of 
alcohol (e.g. two beers, two gl. of wine) 

B. For every ten times you are driving your boat, 
approximately how often do you do the fol- 
lowing things: 1) Personally wear a life jacket 
the entire trip, 2) Drink two units of alco- 
hol (e.g. two beers, two glasses of wine), 3) 
Drive faster than the permitted speed close 
to shore, 4) Drive so fast or offensively that 
passengers or others (e.g. other boat driv- 
ers) express concern or react in other ways, 
5) Look down at navigational equipment/ 
GPS for so long that I have been surprised to 
see other boats, islands, skerries etc. when I 
look up, 6) Become angered by a certain type 
of boat driver and indicate your hostility by 
whatever means you can. 

C. For every ten times you are driving your boat 
with passengers, approximately how often do 
you do the following things: 1) Ensure that 
adult passengers on your boat wear a life- 
jacket, 2) Ensure that child passengers on 
your boat wear a lifejacket. (Answer alterna- 
tives for A, B and C: 1) Never, 2) 1—2 times, 3) 
3—4 times, 4) 5—6 times, 5) 7—8 times, 6) more 
than 8 times but not always, 7) Always 

. National safety culture (3 questions): National 
safety culture is again measured as descrip- 
tive norms at the national level meaning “what 
respondents expect that other boat drivers from 
their own country do” expressed through ques- 
tion “Based on your experience, how many boat 
drivers in your country do you think do the 
following:” 1) Drink two units of alcohol (e.g. 
two beers, two glasses of wine) while driving the 
boat, 2) Drive faster than the permitted speed 
close to shore, 3) Drive so fast or offensively 
that passengers or others (e.g. other boat driv- 
ers) express concern or react in other ways. The 
questions were combined into an index. The 
survey included six additional questions about 
this that are not listed here. 

. Peer group safety culture (3 questions): The 

same principle and questions as for national 

safety culture are applied. 


> 


6. Safety culture at municipality level (3 questions): 


The same principle and questions as for national 
safety culture are applied. 


3 RESULTS 


3.1 Professional seafarers 


3.1.1 Which behaviours influences personal 
injuries? 

A logistic regression analysis was conducted with 
personal injuries as dependent variable, to find the 
variables predicting personal injury among our 
respondents (Table 1). In this analysis, the injury 
variable, which originally had four answer alterna- 
tives, was dichotomized, 0 = no personal injury, 
1 = personal injury. B values are presented and they 
indicate whether the risk of personal injuries is 
reduced (negative B values) or increased (positive 
B values), when the independent variables increase 
by one value. 

Table 1 provides three main results. The first is 
that nationality influences respondents’ work inju- 
ries in the last two years on board. This is the varia- 
ble with the strongest contribution. The Norwegian 
seafarers reported to have been more involved 
in injuries than the Greek seafarers. The vari- 
able with the second strongest contribution is the 
Risk acceptance/violations index; indicating that 
the more violations and risk accepting behaviour 
you are involved in, the more likely it is that you 
are injured on board. The variable with the third 
strongest contribution is age group, indicating that 
controlled for the other variables, the youngest sea- 
farers have a higher risk of being injured on board. 
In Table 1, the Nagelkerke R? is 0.188 which indi- 
cates that the independent variables explain 19% of 
the variance in the dependent variable. 


3.1.2 Which factors influence safety behaviours? 
In Table 2 we show results from a hierarchical, 
linear regression analysis, where independent vari- 
ables are included in successive steps to examine 
the variables predicting respondents’ scores on the 
Risk acceptance/Vviolations index. 

Table 2 provides five main results: first, the 
more demanding working conditions that the 
respondents experience, the more likely they are to 
be involved in Risk acceptance/Vviolations. Second, 
we see that the national safety culture—descrip- 
tive norms index contributes positively, indicating 
that the more unsafe behaviours the respondents 
say that they expect from seafarers from their own 
country, the more likely they are to be involved in 
unsafe behaviours themselves. Third, the higher 
organizational safety culture scores the respond- 
ents report, the less unsafe are their behaviours. 
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Table 1. Logistic regression. Dependent variable: Per- 
sonal injuries on board in the last two years (dichot- 
omized: 0: no personal injury, 1 = personal injury). B 
values. 


Variables B value 

Age group (26 years = 0, Other = 1) 0379 

Nationality (Greek = 0, Norwegian = 1) 2.226** 

Vessel type (Live fish carrier = 0, Other =1) 0.888 

Position/line of work (Deck crew = 0, 0.657 
Other = 1) 

Risk acceptance/violations index 1.164*** 

Working under the influence of alcohol/ 0.304 
hungover 

Non-reporting/non-intervention index 0.940 

Sometimes I feel pressured to continue 1.224 
working even if it is not perfectly safe 

Organisational safety culture index 1.025 

Nagelkerke R? 0.188 

*P <0.1, **p < 0.05, ***p < 0.01. 

Table 2. Linear regression. Dependent variable: 


“Risk acceptance/violations Index”. Standardized beta 
coefficients. 


Variables Beta coeff. 
Age group (26 = 2) 0.003 
Nationality (Greek = 2) —0.030 
Position (Apprentice = 2) 0.052 
Vessel type (Tank = 2) —0.031 
Sometimes I feel pressured to continue 0.167** 
working, even if it is not perfectly safe 

Demanding working conditions index 0.281** 
Organisational safety culture index —0.195** 
Sector focus on safety —0.144** 
National safety culture: descriptive norms 0.206** 
Adjusted R? 0.453 


*P < 0.1, **p < 0.05, ***p < 0.01. 


Thus, a positive organisational safety culture 
may reduce the negative contribution of demand- 
ing working conditions and safety compromis- 
ing work pressure. The same applies to the index 
“sector focus on safety”. In Table 2, Adjusted 
R? is 0.453 which indicates that the independent 
variables explain about 45% of the variance in the 
dependent variable. 


3.2 Private boat users 


3.2.1 Which behaviours influences personal 
injuries? 

A logistic regression analysis was conducted with 
boating incidents (grounding, collision, intake of 


water) as dependent variable, to find the variables 


predicting personal injury among our respondents 
(Table 3). Seven percent of the respondents had 
experienced this. The incident variable, was dichot- 
omized, | = no personal injury, 0 = personal injury. 
B values are presented and they indicate whether 
the incident risk is reduced (negative B values) or 
increased (positive B values), when the independ- 
ent variables increase by one value. 

Table 3 provides two main results. The first is 
that nationality influences respondents’ experi- 
ences with boating incidents in the last two years. 
This is the variable with the strongest contribution. 
The effect is negative, meaning that Greek boat 
users are involved in fewer incidents, controlled for 
the other relevant variables. 

The second result is that alcohol use during trips 
as a boat driver contributes significantly and nega- 
tively, meaning that increased alcohol use increases 
the likelihood of boating incidents. In Table 3 the 
Nagelkerke R? is 0.222 which indicates that the 
independent variables explain 22% of the variance 
in the dependent variable. 


3.2.2 Which factors influence safety behaviours? 
In Table 4 we show results from a hierarchical, 
linear regression analysis, where independent vari- 
ables are included in successive steps to examine 
the variables predicting respondents’ scores on the 
variable “Alcohol use during boat trip as driver”. 
Table 4 provides four main results: first, national 
safety culture, specified as descriptive norms (boat 
users from your own country’s alcohol use, over 
speeding close to shore and offensive driving) pro- 
vides the strongest contribution to respondents’ 
alcohol use while driving a boat. Respondents who 
report of unsafe behaviours among boat users in 
their country are more likely to drink alcohol while 
boating themselves. We made similar indexes for 
the peer group and the municipality level. The peer 
group level refers to “friends who own a boat”. 


Table 3. Logistic regression. Dependent variable: boat- 
ing incidents in the last two years (dichotomized: 0: inci- 
dent, 1 = no incident). B values. 


Variables B value 
Age group (46-55 years = 0, Other = 1) 0.711 
Exposure 0.002 
Boat type (motor boat w/sleeping 0.328 
facilities = 0, other = 1) 
Alcohol use during trip as boat driver —0.359** 
Nationality (Greek = 0, Norwegian = 1) —1,683*** 
Education/training in boat use (No = 1) —0.179 
Navigational equipment on board 0.826 
Nagelkerke R? 0.222 


*P < 0.1, **p < 0.05, ***p < 0.01. 
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Table 4. Linear regression. Dependent variable: “Alco- 
hol use during boat trip as driver”. Standardized beta 
coefficients. 


Variables Beta coeff. 
Age group (Under 56 years = 1, over = 2) —0.147** 
Nationality (Norwegian = 1, Greek = 2) 0.160** 
Boat type (Other = 1, motor boat w/ ON 
sleep = 2) 
Purpose of trip (Other = 1, Leisure = 2) 0.119* 
Perceived enforcement: police/coast 0.028 
guard 
Peer group safety culture 0.151* 
Municipal safety culture —0.157* 
National safety culture 0.218*** 
Adjusted R? 0.167 


*P < 0.1, **p < 0.05, ***p < 0.01. 


We see that that peer group safety culture and 
municipality safety culture only contributes signif- 
icantly at the 10% level. The contribution of peer 
group safety culture is positive, as the national cul- 
ture variable, but the municipality contribution is 
negative. This is unexpected, and we return to it in 
the discussion section. 

Second, we see that boat type (motor boat with 
sleeping facilities) contributes positively, indicating 
that using this boat type involves a higher incident 
risk in our sample. Third and fourth, we see that 
age (>56 years) and nationality (Greek) gives a 
lower risk of having experienced incidents. 

Finally, we also see that purpose (i.e. leisure 
and holiday) contributes positively to incidents, 
but only at the 10% level. Thus, we see, not unex- 
pectedly, that compared with other purposes (e.g. 
fishing, transport), boat drivers on leisure/trips 
are more likely to drink alcohol while driving. In 
Table 4, Adjusted R? is 0.167 which indicates that 
the independent variables explain about 17% of 
the variance in the dependent variable. 


4 CONCLUDING DISCUSSION 


The aims of the study were to examine the safety 
behaviours related to personal injuries and acci- 
dents among professional seafarers and private 
leisure boat users in Norway and Greece, and to 
study the factors influencing these behaviours. 


4.1 Factors predicting injurieslaccidents 


Looking at the factors predicting injuries/acci- 
dents in the two groups, we saw that nationality 
(Norwegian), risk acceptance/violations and age 
group (<26 years) predicted professional seafarers’ 


work injuries in the last two years on board. The 
contribution of age and nationality is in accord- 
ance with previous research on professional seafar- 
ers (Hansen et al 2002; Jensen et al 2004). 

Looking at the private boat users, we also saw 
that nationality influenced respondents’ risk of 
boating incidents in the last two years. Second, we 
found that alcohol use during trip as a boat driver 
increased the likelihood of boating incidents. 
Working under the influence did not contribute 
significantly to professional seafarers’ risk of work 
accident. 

This contrasting result is in line with the hypoth- 
esis we mentioned in the introduction; that private 
boat use seems to be a relatively unregulated behav- 
iour compared with professional seafaring. Previ- 
ous research indicates that alcohol consumption 
may be an important risk factor in the maritime 
sector (Akhtar & Bouwer Utne 2014, Hethering- 
ton et al 2014), and that alcohol and drug abuse 
are greater for seafarers compared to workers 
ashore (Nitka 1990; Kariris 2012 in Zhang & Zhao 
2017), partly because of their working situation 
(e.g. social isolation). However, as private boat 
use is less regulated than professional boat use, we 
hypothesized that alcohol consumption “boating 
while under the influence”, would be an even more 
important risk factor among private boat users. 
Results indicate that this is the case, at least based 
on our sample. 

As noted, Amundsen (2016) also asserts that 
questions about alcohol use and lifejacket use are 
common in almost all of the international sur- 
veys, indicating the importance of these factors 
for boating safety. Moreover, Norwegian boating 
accident statistics report a number of death involv- 
ing alcohol, and where the drowned person did not 
were a life jacket (Amundsen & Bjornskau 2017). 


4.2 Factors predicting safety behaviours 


Analysing the factors influencing professional sea- 
farers’ risk acceptance and violations, we found 
that demanding working conditions and work 
pressure were important factors. This is in line with 
previous research (Storkersen et al 2011, Neevestad 
2017). The former was the most important fac- 
tor. We also found that a positive organisational 
safety culture may reduce the negative contribu- 
tion of demanding working conditions and safety 
compromising work pressure. This has also been 
pointed out in a previous study (Nevestad 2017). 
We also found that “sector focus on safety” may 
reduce the negative influence of demanding work- 
ing conditions on professional seafarers’ safety 
behaviours. 

Additionally, we, found that the national safety 
culture—descriptive norms index contributed 


2909 


positively, indicating that the more unsafe behav- 
iours the respondents say that they expect from 
seafarers from their own country, the more likely 
they are to be involved in unsafe behaviours them- 
selves. We found the same in the analysis of the pri- 
vate boat users; in fact, this analysis showed that, 
national safety culture, specified as descriptive 
norms (boat users from your own country’s alco- 
hol use, over speeding close to shore and offensive 
driving) provided the strongest contribution to 
respondents’ alcohol use while driving a boat. To 
our knowledge, there are few other studies that 
have examined the influence of national culture 
(specified as descriptive norms) on both profes- 
sional seafarers and private boat users. 

Examining the cultural influences on private boat 
users’ safety behaviours, we also found that peer 
group and municipality safety culture contributed 
significantly at the 10% level. The contribution of 
peer group safety culture is positive, as the national 
culture variable, but the municipality contribution 
is negative. This is likely to be a result of a colline- 
arity effect, indicating that these two variables are 
strongly related and measure “the same effect”. In 
practice (and based on observing the means and 
the standard deviations on these two variables) 
it seems that respondents do not separate clearly 
between boat users in their own municipality and 
their peer group. This is understandable, given 
the memory, knowledge and analytical separation 
required to do this. Thus, we should exclude the 
municipality level from the analysis, as it is likely 
that this is the level (compare to peer group) that 
respondents know less about. 

Analysing, the influences on private boat user 
behavior, we also found that boat type (motor boat 
with sleeping facilities) involves a higher incident 
risk in our sample. We also found that age (>56 
years) and nationality (Greek) gives a lower risk of 
having experienced incidents. We also found that 
purpose (i.e. leisure and holiday) contributed posi- 
tively to incidents, but only at the 10% level. This 
brings us to the important differences between 
the two groups that we study. Finally, previous 
research on private boat users has also found such 
background variables to be important for safety 
behaviours, e.g. type of boat used, gender, age, 
experience, what kind of activity they usually use 
the boat for (e.g. fishing, competition, holiday, rec- 
reation), type of location where they usually use 
the boat (cf., Amundsen 2016; Amundsen & Bjorn- 
skau 2017). 


4.3 Why are the number of fatalities higher for 
leisure boat users than professional seafarers? 


An overarching purpose of our study was to discuss 
possible reasons to the higher number of fatalities 


for leisure boat users than professional seafarers. 
We wanted for instance to examine the kind of 
behaviours that are related to injuries/accidents in 
the two groups, and subsequently to examine the 
factors influencing these behaviours. We may of 
course only speculate based on our study, indicat- 
ing hypotheses that should be examined further in 
future research, but our study indicates that the 
settings and purposes are important to understand 
this difference. 

While unsafe behaviours related to work pres- 
sure and risk taking are important among profes- 
sional seafarers (i.e. risk acceptance and violations), 
unsafe behaviours related to the leisure/holiday 
situation was important for the leisure boat users 
(i.e. alcohol use while driving a boat). 

Additionally, it seems that the situation of pri- 
vate leisure boat users is less regulated than that 
of professional seafarers. The International Safety 
Management (ISM) code of the International 
Maritime Organisation requires shipping compa- 
nies to implement Safety Management Systems 
(SMS) on board their vessels, including describing 
safety roles, goals, procedures, monitoring, report- 
ing, follow up etc. (Thomas 2012). Additionally, 
shipping companies also often work to implement 
a positive organizational safety culture, including 
policies for seafarer behavior. Also, professional 
seafarers have undergone an IMO approved train- 
ing in their respective home countries. Thus, this 
training, the SMS and safety culture are elements 
which are likely to influence the professional mari- 
time safety culture. 

Private boat users, on the other hand are not 
part of such a system of international and national 
regulation, involving education, inspections from 
port states, flag states, classification societies, 
transport buyers etc. Compared to the number of 
people who go boating in different countries, the 
risk of accidental death is quite high compared 
to that of other private transport modes. Despite 
of this, recreational boating is to a small extent 
being regulated and the level of enforcement is 
low (Amundsen & Bjornskau 2017). Some coun- 
tries seem to take safety for leisure boat users more 
seriously than others. In a few countries it is, for 
instance, now mandatory to report all incident 
you experience while boating, even if no persons 
were injured in the incident/accident. Finally, it is 
important to remember that the above mentioned 
view points merely are hypothesis that must be 
examined in future research. 


4.4 Cultural influences on maritime safety 
behaviours 


The theoretical link between safety culture and 
safety behaviours is often omitted in research 
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(Ward et al 2010). In the present study, we concep- 
tualise this relationship as descriptive norms, that 
may influence behaviour by providing information 
about what is normal. In the professional (organi- 
sational) setting, managers are an important source 
of social pressure, as well as colleagues, and the 
interaction between people within the organisation 
is important for the creation and maintenance of a 
safety culture influencing behavior, as indicated by 
the effect of organizational safety culture on pro- 
fessional seafarers’ safety behaviors in Table 2. 

In the private setting, there will not be a simi- 
lar strong link from managers to transport safety 
culture. Some peers are, however, likely to assert 
stronger social influence than others, and may be 
as important as managers in organizations in exert- 
ing social pressures that shape safety culture and 
influence behaviour. In our study (Table 4), we saw 
that peers are central as advocates of social norms 
related to safety (i.e. drinking alcohol while boat- 
ing), but we also saw that the reference to other 
people in the boat users’ country were even more 
important. In Table 2 we also saw the importance 
of sector for professional seafarer behavior. 

To conclude, our study indicates that both in the 
professional and the private setting, norms for inter- 
action and conduct seem to be influenced by norms 
and expectations rooted in different socio-cultural 
groups, e.g. the national culture, the specific sector 
in question, the organisations and in peer groups. 
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Accident and disease prevention in working life: Common grounds 


and areas for mutual learning 
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ABSTRACT: Globally, there are more than six times more fatalities caused by a poor working 
environment than due to occupational accidents. In this paper we compare the basic strategies involved 
in accident and disease prevention. We find that the basic thinking is the same. The preventive strategies 
involve control of hazards in a hierarchy from elimination to application of personal protective equipment. 
Also, the Norwegian regulation on internal control of HSE is applicable for both accident and disease 
prevention, involving the idea of continuous improvement. Still, the nature of the hazards differ, as well 
as the possible consequences. In the area of occupational safety, the hazards are usually visible and the 
consequences of accidents are immediate. In the area of occupational health, the hazards are often invis- 
ible and the consequences of exposure delayed. There is a potential for better integration of the two areas 
in practical management, and a potential for mutual learning from concepts and models in the two fields. 


1 INTRODUCTION 


Protection from harm in relation to work is a 
responsibility for politicians, authorities and 
employers. Still, work related diseases, accidents, 
injuries and fatalities continues to be a significant 
challenge. 

According to ILO, globally 2.3 million deaths 
take place due to occupational injuries and diseases 
each year (Takala et al., 2014). Of these, over 2 mil- 
lions of deaths are due to occupational diseases, 
among them 23% due to work related cancers. 

In Norway, we have from 47 to 25 deaths each 
year from injuries in the period 2013-2016 (The 
Norwegian Labour Inspection Authority 2017), 
and 1% of the working population reported work 
related lung complaints in 2013 (NOA, 2017). In 
addition, it has been estimated that approximately 
3000 new cases of work related Chronic Obstruc- 
tive Pulmonary Disease (COPD) each year in Nor- 
way, and that 200 persons will die from this disease 
due to working conditions (Leira, 2011). Although 
these figures are high, there seems to be a positive 
trend both in work related accidents and diseases. 
The number of fatalities was more than 100 annu- 
ally in the early 1970s, but has steadily declined. 
In 2016 the number of fatalities was 25 (The Nor- 
wegian Labour Inspection Authority 2017). The 
number of self-reported exposures for chemicals 
are declining, as well as reports of other work- 
related diseases (NOA, 2011, 2017) 

Accidents in the construction industry still get 
high attention. However, some highlights the fact 


that the workers in this sector have increased risk 
for lung diseases (Bergdahl et al., 2004; Robinson, 
Petersen, Sieber, Palu, & Halperin, 1996; Vermeu- 
len, Heederik, Kromhout, & Smit, 2002). Also in 
Norway the statistics from NOA shows that work- 
ers in the construction industry have more respira- 
tory complaints, and more declared work related 
respiratory complaints than the national mean 
(Aagestad, 2015). Recent studies have also detected 
an association between exposure and an increased 
risk of COPD in the construction industry (Fell, 
Aasen, & Kongerud, 2014). In addition 59% of 
construction workers state that that they inhale 
smoke, dust or exhaust in their work situation 
(NOA, 2017). 

The purpose of this paper is to compare the 
main strategies for the prevention of occupational 
diseases and occupational accidents. 


2 PREVENTION OF OCCUPATIONAL 
DISEASES 


2.1 Exposure assessment 


The traditional approach for chemical risk assess- 
ment is to compare the exposure level of the chem- 
ical agent to their Occupational Exposure Limits 
(OELs). These OELs were established from late 19 
century (Jayjock MA, 2000). Still this approach 
is regarded to be the best practice for risk assess- 
ment for chemical agents and noise. These limit 
values are given for occupational exposures dur- 
ing 8 hours shift, and there is guidelines on how 
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to assess risk of exposure to noxious agents (NS- 
ISO689) and sound (NS-4815-1). 

The sampling and analyses of airborne con- 
taminants and comparing results with the national 
OELs have been challenging for different reasons. 
Due to variability in exposure both within work- 
ers and between workers, several samples have to 
be taken for each group of workers regarded as 
homogenously exposed (Chen, Chuang, Wu, & 
Chan, 2009; Rappaport, Lyles, & Kupper, 1995). 
The cost for each analysis and the resources needed 
to perform the measurements have resulted in a 
limited number of measurements from the differ- 
ent parts of the Norwegian industry. Reported 
measurement results from the construction sector 
are rather sparse, except some from tunnel con- 
struction, cement work and Bricklayeres (Bakke, 
Ulvestad, Thomassen, Woldbek, & Ellingsen, 
2014; Beaudry et al., 2013). 

A new approach to risk analyses on health 
effects have been widely used since the introduc- 
tion of REACH (Money, 2003; Office, 2017; UK, 
2017). This new approach take into consideration 
the health classification of chemicals or particles 
are use this together with an exposure assessment 
(Money, 2003). The classification of health effects 
are performed on the basis of the agent’s inherent 
toxic property, often by use of the CLP classifi- 
cation (CLP-ref) but the risk assessment are also 
dependent on the exposure. This exposure assess- 
ment may be founded on subjective assessment 
and exposure toolkits (Office, 2017; UK, 2017). 
The subjective assessment method uses a struc- 
tured approach, based on descriptive information 
about work activities and the work environment, 
and have been validated against exposure meas- 
urements (Cherrie & Schneider, 1999). This new 
approach makes it possible to do risk assessment of 
chemicals and particles which do not have an OEL 
and without measurements. However, when the 
risk analyses shows that there may be an unwanted 
risk; measures have to be taken to comply to the 
model, or measurements have to be performed in 
the old fashion way to document that there is an 
acceptable risk. 


2.2 Hierarchy of exposure controls 


Within occupational hygiene, control of exposure 
is a fundamental method for protection of work- 
ers. Traditionally ahierarchy of controls has been 
used as a mean of determining how to implement 
feasible and effective control solutions. The hierar- 
chy is often illustrated as in Figure 1, where elimi- 
nation is the most effective measure, and include 
physically removal of the hazard. Substitution is 
replacement of the hazard, engineering controls 
isolate people from the hazard, administrative 
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Figure 1. Control of exposure. 


controls change the way people work and Personal 
Protective Equipment (PPE) is the weakest meas- 
ure protecting the worker against the hazard. 

The principle behind this hierarchy is that the 
control methods at the top of graphic are poten- 
tially more effective and protective than those at 
the bottom. 

In practical use, however, it is often more dif- 
ficult. Using a smelting plant as example, the pro- 
duction of metal or metal alloy like aluminium, 
silicon, silicon carbide or ferromanganese is the 
basic idea behind a production plant. The top 
two measures, elimination and substitution of 
the raw material is impossible; and the occupa- 
tional hygienist only have the three lowest ranked 
measures available. The smelting industry is very 
energy-intensive and access to cheap energy often 
determines where the factories are located. The 
processes are often old and physically relatively 
simple, but produce large amounts of pollutants. 
In the silicon carbide industry the cancer risk 
have been documented since 2000 (Romundstad, 
Andersen, & Haldorsen, 2001), and the dust expo- 
sure is documented to contain both fibers (Bye, 
Eduard, Gjonnes, & Sorbroden, 1985) crystalline 
silica, silicon carbide (SiC) and sulphur dioxide (S. 
Foreland, Bye, Bakke, & Eduard, 2008). The work- 
ers in this industry is heavily loaded by personal 
protective equipment using dust mask, CO alarm, 
eye protection, hearing protection, safety helmet, 
gloves and safety clothes. The Silicon carbide 
industry is not the only industry using this as main 
principle, even that isolation of the process; local 
exhaust ventilation or general ventilation would 
have been more effective types of measures. 

This was pointed out by Foreland (Solveig Fore- 
land, Bakke, Vermeulen, Bye, & Eduard, 2013) 
as late as 2012stating that “recommendations for 
exposure reduction based on this study are (i) to 
separate the sorting area from the furnace hall, 
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(ii) minimize manual work on furnaces and in the 
sorting process, (ili) use remote controlled sanders/ 
grinders with ventilated cabins, (iv) use closed sys- 
tems for filling pallet boxes, and (v) improve clean- 
ing procedures by using methods that minimize 
dust generation”. 

Following the hierarchy normally leads to the 
implementation of inherently safer systems, where 
the risk of illness or injury has been substantially 
reduced, but as the example shows, it is not always 
possible to use the most favorable measure. 

The same kind of example could be used for the 
construction sector, as the work itself produces 
pollutants that is impossible to avoid as concrete, 
wood dust, stone dust and exhaust from vehicles. 
The only one that may be substituted is the exhaust 
when new vehicle technology is developed. This 
makes it necessary to use the last favorable protec- 
tive measure; the personal protection devices. 


3 PREVENTION OF OCCUPATIONAL 
ACCIDENTS 


The barrier concept is a basic foundation in strate- 
gies for occupational accident prevention. An acci- 
dent can be understood as caused by energy out 
of control (Gibson, 1961), where a hazard (source 
of energy) releases energy. This energy is then 
absorbed by a victim which leads to loss (health, 
life). Following this energy model, accidents can 
be prevented by barriers that stops or prevents 
a sequence of events that lead to loss of control 
of hazards (Kjellén and Albrechtsen, 2017). As 
shown in Figure 2. the barrier philosophy follows 
the same logic as control of exposure (Figure 1) 
Seen in a functional perspective, safety barriers 
perform tasks, such as preventing falling objects 
from hitting people working below. Such func- 
tions or tasks are performed by different barrier 
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Figure 2. Control of hazards. 


elements that constitute a barrier system (Rosness 
et al. 2010). Physical devices, human actions and 
administrative procedures serve as barrier ele- 
ments meant to protect vulnerable targets from 
harm (Sklet, 2006). 

A ‘defence in depth’ strategy is commonly 
applied in the prevention of accidents. According 
to Reason (1997:12), major accidents occur as a 
result of failures in multiple layers of the defences 
separating potential hazards from people and 
assets. Accident trajectories pass through ‘holes’ 
in these defences, created by active failures—errors 
and violations—and/or latent conditions, such as 
lack of competence, design flaws and unrealistic 
procedures. 

For example for a lifting operation, two bar- 
rier functions must be in place: 1) prevent sudden 
release of the gravity energy that the lift represent 
and 2) separate the gravity energy that the lift rep- 
resent and workers (establish a safety zone). The 
first barrier function (prevent sudden release of 
energy) is realized by a set of barrier elements: 
sling mechanism, crane driver competence, slinger 
competence, signal man, control rope etc. 

Haddon’s (1980) defined ten generic principles 
for the prevention of harm (injury) from transfer 
of energy. These ten strategies are generic barrier 
functions to either control the hazard; separate the 
hazard and a victim; or make the offer more robust 
to harm. See Figure 3. 

Haddon’s (1980) strategies and the energy acci- 
dent model (Gibson, 1961) have had significant 
influence on European legislation and standardisa- 
tion work such as that related to hazardous chemi- 
cals (European Council 1998) and machinery 
safety (European Council 2006). The strategies are 
central components in accident investigation meth- 
ods such as in the OARU, Management Oversight 
and Risk Tree MORT, Safety Management and 
Organisation Review Technique (SMORT) (Kjel- 
lén and Albrechtsen, 2017). 


x a 


Separate energy source 
and victim 

Strategies related to the energy Strategies related to 
source separation 

Prevent build-up of energy 6. Separate, in time or 
2. Modify the qualities of the space, the energy 


Hazard - energy source Victim 


Strategies related to the 

vulnerable target: 

8. Make the vulnerable 
target more resistant te 


~ 


energy source and the damage from the 

3. Limit the amount of energy vulnerable target energy flow 

4, Prevent uncontrolled release 7, Separate energy 9. Limit the development 
of energy source and the of the loss 


5, Modify rate and distribution vulnerable targetby 10.Stabilise, repair and 
of the released energy physical barriers rehabilitate the object 
of damage 
Figure 3. Haddon’s principles for accident prevention. 
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Risk mitigation in the European Council (2006) 
Directives on Machinery is based on Haddon’s 
strategies. ISO 12100:2010shows a strategy for 
selection of safety measures of machinery, see 
Figure 4. The strategy reflects that hazards should 
be prevented or limited at the design stage, i.e. 
designed out. If this is not possible, protective 
measures and safety controls should be estab- 
lished to separate victims and the hazards of the 
machinery. Residual risk after these measures can 
be accepted, but requires that the producer inform 
the user about these. The user of the machinery 
is responsible for training, and safe operation at 
work, including providing necessary personal pro- 
tective equipment. 

To implement the correct barrier system (func- 
tions and elements) risk assessments are essential. 
The results of risk assessment serve as decision- 
making support to implement adequate safety 
measures. ISO31000 Risk Management gives 
descriptions of the principles and steps in risk 
assessment and risk handling. Briefly, the steps 
are: identify hazards and incident scenarios; 
analyze causes; analyze frequency and conse- 
quences; analyze risk; evaluate risk according to 
risk acceptance criteria; and mitigate risk. The 
analysis of risk is made by systematically collect 
available knowledge about the analysis object and 
use this knowledge to express what can go wrong 
in the future; what the likelihood of it happen- 
ing; and the consequences if it happens (Rausand, 
2011). This risk picture is then evaluated to risk 
acceptance criteria to detriment whether the risk 
is acceptable or not. 


Define limits of machinery 


identification of tasks for all phases of 
machinery life 
identification of hazards and hazardous 
events 


Risk evaluation: 


is the riskacceptable? | 
| is riskreduction by design possible? 


A is riskreduction by safe guarding 
possible? 
Yes 


E risk reduction by user 


further 
action 
required | 


information 


i possible? 
ts residual risk acceptable? J 


Figure 4. Risk treatment and risk assessment of 
machinery (based on ISO12100). 


Terminate 


The Directives of Machinery shows how risk 
assessment is the input to risk mitigation, see 
Figure 3. ISO 12100:2010show the steps for risk 
assessment of machinery. It follow the same steps 
for risk assessment as described in ISO31000, but 
starts with an identification of the machinery and 
intended/unintended use of the machinery for all 
life phases of the machinery. 


3.1 Tunnel construction as example 


An analysis of fatal accident from hydropower 
instruction projects in Kjellén and Albrechtsen 
(2017) show that there are two types of fatal acci- 
dents in tunnel excavation: falling rock/rock burst 
and workers squeezed or driven over by vehicle. 
Barriers to prevent such fatal accidents would 
mainly be to separate in time/space the hazard and 
the victim. For blasting work workers are moved 
away from the blasting area. 

During construction work other than blowing 
work, accident risks are in particular conflicts 
between workers and heavy machinery, falling 
objects (rock fall/rock burst) and getting squeezed 
during rigging. 

One of the work activities in tunnel excavation 
are assembly of different materials and infrastruc- 
tures in the roof. The accident risk for such opera- 
tions are falling objects and squeezing. In Norway, 
there has been fatal accidents at scissor lifts, where 
the victim has been squeezed between the lift 
and the roof. Risk mitigation for this scenario is 
to design in pressure-released stop of the lifting 
mechanisms. 

Work at height will also involve significant 
occupational hygiene risk. From an occupational 
hygiene perspective it is well known that tun- 
nel workers are exposure to both particulate and 
gaseous air pollution and that tunnel workers 
are known to be at increased risk of long-term 
and short-term lung function decline and CORP 
(Ulvestad et al.). Different activities causes differ- 
ent exposures to these workers. Exposure to gases 
and particles from diesel emissions has been con- 
sidered to be among the dominating burdens dur- 
ing tunnel construction due to weel-going diesel 
machines; also during drilling and blasting opera- 
tions, workers are exposed to dust, with o-quartz 
as the most important agent. o-quartz in the dust 
from tunnels varies between <1% and more than 
50% (Norwegian Tunnelling Society,Publication 
No.13) and o-quartz exposure may lead to COPD. 
Exposure to oil mist and oil vapour is another 
type of exposure that may also occur during drill- 
ing. Exposure to oil mist may cause occupational 
asthma and also pulmonary fibrosis (Robertson 
et al., 1988) Risk assessment hence have to be per- 
formed with focus on all these possible exposures. 
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If we look into Figure 1 in order to identify 
preventive measures it seems logic that elimination 
and substitution is difficult. None of these activi- 
ties could be eliminated if the tunnel should be 
constructed. Substitution might be possible if we 
look into the type of explosive used for blasting. 
Ammonium nitrate fuel oil is used as explosive, 
and if this agent is substituted with size-sensitised 
emulsion, the worker exposure lower (Ulvestad). 
Apart from this example engineering controls 
often is the first possible preventive measure, 
where ventilation of the tunnel, pollution-abate- 
ment equipment for diesel vehicles is some of the 
measures often used. High frequency maintenance 
routines are an example of administrative controls, 
while personal protection equipment only should 
be the last solution, but in real life often is used on 
a daily basis. 


4 DISCUSSION 


4.1 Common grounds 


The law/regulation on internal control makes it 
clear that all enterprises in Norway should have a 
system for safeguarding health, safety and environ- 
ment. This system includes information on health 
and safety regarding issues in the working envi- 
ronment, requirements for establishing goals for 
the HSE-work, performing risk analyses for any 
hazard and establish routines for unveiling, cor- 
recting and preventing violations of the law and 
regulations. 

The internal control regulation that is the source 
for management systems for HSE control is the 
same for both safety and workers health, and builds 
on quality principles. Theories on quality gained 
much attention in industry from the 1980s, and 
was first related to improvements in the produc- 
tion processes. Total quality management became 
an influential movement, spurred by the works of 
Deming (1986) and Juran. Later, the principles 
were used as a foundation for internal control for 
HSE in Norway (Saksvik & Nytrø, 1996). 

The idea of continuous improvement is evident 
in both the prevention of occupational diseases 
and in the prevention of occupational accidents, 
illustrated in Deming’s circle (Figure 5): 

In the prevention of occupational diseases 
related to the exposure to chemical, quality prin- 
ciples are at work when exposure levels are com- 
pared to OELs, and when measures are taken to 
mitigate or eliminate the exposure, illustrated in 
the hierarchy of controls (Figure 1). 

In the prevention of occupational accidents, the 
idea of continuous improvement is the foundation 
for safety information systems, and the experience 
feedback such systems entail. By means of safety 


Figure 5. 


Deming’s circle. 


indicators, safety audits, and accident investiga- 
tions, information is applied to implement meas- 
ures for accident prevention. 

The barrier philosophy of both areas is the 
same. Both aim at first prioritizing efforts directed 
at the source of danger by elimination, modifica- 
tion and limitation. Second, both areas empha- 
size to avoid interface between risks and victims 
by substition and separation. Third, both areas 
emphasize engineering control by built-in solu- 
tions in design. Finally, the last measure in both 
areas is to approach the victim by information, 
training, procedures and lastly personal protective 
equipment. 

Many of the occupational hygiene risk factors 
can contribute to higher accident risk by influenc- 
ing human performance. The human operator is 
an essential barrier element to realize barrier func- 
tions. Stress (fatigue, time load and task load); sit- 
uation/environment (physical and chemical work 
environment); and human-machine interactions 
are among performance shaping factors (Groth 
and Mosleh, 2012) that affect the quality of the 
human barrier element in accident prevention. 


4.2 Differences in the nature of hazards and 
consequences 


One obvious difference between the health and 
safety field, is the nature of the hazards that 
should be handled. Hazards in the field of safety 
are a form of energy that is not properly control- 
led. Further, hazards may cause immediate harm 
if safety barriers are not in place, or if they are not 
functioning as intended. 
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In the area of occupational health, hazards are 
not limited to energy sources, although vibration, 
radiation and noise is clearly within the energy 
perspective. But also toxic fumes, cancer induc- 
ing and poisonous agents represent hazards, not 
directly related to energy and more or less invisible 
in nature. 

Further, hazards within the occupational health 
domain do in many instances not cause immedi- 
ate harm, but the harm may be delayed. In some 
instances it may take several decades from expo- 
sure to the loss is evident. Although there are 
nuances in this, this can be summed up in a general 
manner as in Table 1: 

Even if more than six times more occupational 
deaths are caused by diseases than accidents, there 
is a tendency that accidents get more attention. 
Accidents are concrete, often dramatic events, and 
with immediate consequences. This will naturally 
generate attention from employers, authorities and 
the general public, often followed by demands for 
measures that ensures that similar events will not 
take place in the future. 

Occupational diseases are less dramatic, and 
as the consequences are often delayed, there will 
also be an issue of employer responsibility for the 
harm. The employee might have changed employer 
several times before the disease is evident. Diffused 
responsibility, coupled with the more invisible 
nature of the hazards, and the latency period from 
exposure to disease, might result in less attention 
from outside actors. 

In the end it might also lead to less resources 
to the prevention of occupational diseases, rela- 
tive to the needs. The regulatory authorities should 
be aware of such mechanisms, and implement 
requirements to ensure that the prevention of 
occupational diseases get proper attention and the 
resources. 


4.3 Cross-disciplinary learning? 


Although some of the preventative strategies 
related to accidents and diseases seems to be the 
same, the professional concepts that are applied 
differ. For example, control of exposure vs. control 
of hazards refers to the same line of thinking. The 
professional concepts in the two fields have devel- 
oped over a long period, and are institutionalized 


Table 1. Differences in the nature of hazards and con- 
sequence in occupational safety and occupational health. 


Hazards Consequences 


Visible 
Invisible 


Immediate 
Delayed 


Occupational safety 
Occupational health 


and applied in research and theory building. The 
distinct repertoires of concepts ensures precision 
and are a foundation for theory development 
within the two fields. Thus, the development of a 
common professional language seems unrealistic 
and also undesirable. 

Still, mutual awareness of the concepts, models 
and theories that have been developed in the respec- 
tive fields can represent a fruitful cross-pollination, 
and instigate new ideas for prevention. For exam- 
ple, within the safety field, there exists many acci- 
dents models; domino models (Heinrich, 1931), 
information models (Turner & Pidgeon, 1997), and 
the swiss cheese model (Reason, 1997) to name a 
few. There are also perspectives related to what kind 
of organizational characteristics that may prevent 
accidents, including the theory of high-reliability 
organizations (Weick, Sutcliffe, & Obstfeld, 1999) 
and Resilience Engineering (Hollnagel, 2014). 
Researchers with occupational health might find 
such models to be inspiring and of relevance. 

A basic concept related to the prevention of 
occupational diseases is OELs. Actual exposure 
levels of chemical agents are compared to OELs 
to determine whether the exposure might induce 
harm. Much research lies behind the definitions of 
OELs. This might be an interesting area for occu- 
pational safety. Although the hazards are of a dif- 
ferent nature, setting clear thresholds for exposure 
might be an issue for further exploration, even if 
there exist similar lines of thinking (e.g. the notion 
of acceptance criteria, and the ALARP principle, 
As Low as Reasonably Practicable). 

In many instances, HSE practitioners working 
within companies (e.g. HSE engineers, managers 
etc.) will have responsibilities related to both health 
and safety. From their professional background, 
they will have insight into the different research 
fields, and be in a position to integrate them, and 
treat HSE as a holistic concept. Academics tend 
to be more specialized within one of the fields of 
research. Thus, for researchers interested in cross- 
disciplinary learning, HSE practitioners might be 
a resource for practical knowledge integration. It 
is also a responsibility for academics and educators 
to prepare prospective HSE workers for the cross- 
diciplinary challenges they will meet in working life. 
Real working life problems do not necessarily follow 
professional boundaries, but require a cross-discipli- 
nary approach. Thus, cross-disciplinary learning in 
the areas of occupational health and safety should 
be an important issue in university education. 


5 CONCLUSIONS 


The paper demonstrates that there are similarities 
between management of occupational health and 
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safety, but also that there are potential improve- 
ments for better integration of the two areas in 
practical management. 

As indicated in the introduction of the paper 
there are far more death and personnel harm due 
to poor working environment than there are deaths 
and injuries due to occupational accident. How- 
ever, accidents and accident prevention seems to 
get more attention by HSE practitioners and mass 
media. Possible explanation for this picture could 
be the different nature of the hazards and the lag- 
ging consequences of occupational exposure com- 
pared to the sudden consequences of an accident. 

Common for both prevention of both occu- 
pational disease and occupational injury is that 
adequate planning would prevent many events by 
establishing adequate barriers 

Topics for further research of the interaction 
between occupational health and safety can include: 

How is HSE implemented in practice in the 
building and construction industry? Is safety more 
focused than work environment? And if any differ- 
ence could be identified, what is the explanation 
behind this. Some hypotheses could be investi- 
gated: A) Simple measurable parameters within 
security, such as accidents per 1000 hours of work 
or absence per 1000 hours make safety easier to 
control. B) Control of exposure is not defined as 
a project-specific activity and hence an activity 
belonging to the internal control system of the sub 
contractors, which in practice means greater varia- 
tion in how much focus it gets in practical work. C) 
Differences between safety culture and work envi- 
ronment culture exist. 
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Customs—a vital contributor to safe societies? A study of the Norwegian 
customs service 
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ABSTRACT: Prior to the 2011 Norway terrorist attacks, widely referred to as ‘22 July’, the Norwegian 
Customs Service (Customs) had alerted the Norwegian Security Police Service to the importation of 
weapon-related products by a known terrorist. The purchases made by this terrorist initially led officials 
to place him on the watch list of the Norwegian Security Police Service, but no action was taken based 
on the information given. In the federal Report of the 22 July Commission, the poor coordination of the 
Norwegian Security Police with Customs was criticized. In this paper, we address whether Customs, as 
an important player in preventive safety and security, is actually being regarded as such. From a societal 
safety and risk-based regulatory perspective, we discuss the challenges of the state in recognizing and sup- 


porting the effectiveness of Customs as a vital contributor to a safe society. 


1 INTRODUCTION 


In today’s struggle to ensure safe, secure and resil- 
ient societies, we strive to avoid industrial and 
technological accidents and hazards, man-made 
disasters and health-threatening hazards and pan- 
demics. We take steps to prepare for natural dis- 
asters and climate change as well as to ensure the 
safety of critical infrastructures and supply chains. 
In this effort to ensure safe societies, federal cus- 
toms services are increasingly important players, 
especially with respect to ensuring border security 
and fighting international crime and terrorism, as 
well as addressing the relatively recent European 
migration challenge. 

A safe society must find a way to regulate and 
control the above-mentioned hazards and threats, 
which might initially seem to be an impossible task. 
Safety and security policy frameworks cover many 
societal levels and sectors. The implementation of 
such policy frameworks presupposes the interac- 
tion and close coordination between politicians, 
bureaucratic organizations, interest groups and 
other relevant players, as well as citizens in gen- 
eral. This again depends on the effective exchange 
of information and communication between the 
relevant players. This information may include, 
for example, policy documents, project or research 
results and field or operational experiences, all of 
which requires horizontal and vertical coordina- 
tion. Security and risk management in the areas of 
border control and criminal networks is also influ- 
enced by international policies covering various 
sectors and operations. As such, many individuals 


at different levels in a range of organizations must 
be aware of each other. Are they? In this article, 
we discuss the extent to which the resources held 
by the Norwegian Customs Service (Customs) 
are recognized and utilized in Norway’s effort 
to ensure a safe society. In our study, one of the 
authors investigated a total of 36 official strategic 
documents both from Customs and other signifi- 
cant organizations that had been generated over a 
17-year period (2000 to 2017). 


2 WHAT IS A SAFE SOCIETY? 


What do we consider to be a safe society? The term 
safe society is related to the threats and dangers 
faced by that society. We can define a safe society 
as follows: The ability of a society to uphold impor- 
tant social functions and safeguard its citizens’ lives, 
health, and basic needs under different forms of 
strain (Engen et.al. 2016:30). 

According to Beck (1992), our perception of 
risk and the development of modern societies are 
closely connected. Beck argues that globalization 
and societal development brings new risks and 
threats that require new ways of thinking to effec- 
tively protect ourselves. The author divides modern 
development into three phases—pre-modern soci- 
ety, modern society and risk society. These phases 
or stages are essentially connected to the periodiza- 
tion of social change, namely pre-modernity, sim- 
ple modernity and reflexive modernity. 

In Beck’s (1992) perspective, modern society is 
associated with how old and new risks are organized 
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and how a society develops its social systems, gov- 
ernance, and politics with respect to risk handling. 
The risk society developed as an outcome of the 
industrial society. The hallmarks of industrial and 
modern societies are technological and economic 
progress and the development of national demo- 
cratic institutions. In this phase, the state is governed 
by regulations that guarantee economic growth and 
the distribution of wealth, risk, safety and security. 

However, Beck (1992) also argued that modern 
societies’ continuous production of more goods to 
an increasingly wealthy population has drawbacks. 
These include the side effects of chemical, bio- 
logical, and/or environmental pollution, as well as 
complex technological developments that increas- 
ingly intervene into the social life of individuals, 
and which lead to new risks and insecurities. The 
risks associated with technological development 
and globalization are complex, are not always 
obvious or easy to detect and can strike arbitrar- 
ily. Individuals cannot protect themselves against 
these risks, in contrast to the risks associated with 
the industrial society, which are more often aligned 
with social class. The risk society is more individu- 
alized, but is also still an industrial society, due to 
its involvement in the creation of the technologies 
leading to the new risks. 

Beck (1992) argues further that the transition from 
a modern society to a risk society can be described 
as a transition from the distribution of goods to the 
distribution of risk, which implies that new risks can 
strike at random and individually, and they can be 
global and impossible for national institutions to 
manage. This random aspect can be debated as, for 
example, climate change is likely to impact third- 
world countries harder than westernized countries. 

An interesting point in Beck’s (1992) description 
of a risk society is the question of how societies 
can meet the challenges of living in a world with 
so many uncontrolled risks. Can we fully depend 
on existing risk analyses and political strategies to 
effectively address these new risks? 

The hallmark of a safe society is its population’s 
trust in the state governance (Engen et. al. 2016). 
One way states gain this trust is to develop their 
abilities, to the extent possible, to appropriately 
regulate risks (Baldwin et. al. 2012). 


1.1 Societal safety and security 


According to Beck (1992), Kaldor (2007) and Gid- 
dens (1999), among others, we have moved from a 
threat perspective, in which state security was the 
main focus, to a perception in which the threats 
to the society have become our essential focus. 
Today, our vulnerability is more dependent on glo- 
bal events and processes, so customs services have 
become even more relevant as societal protectors. 


Engen et al. (2016) argue that the terms risk, 
security and safety have politically explosive force. 
Questions concerning national security, sustain- 
able development, accident prevention and human 
security are examples of this force and give us some 
clue to the most important aspects characterizing 
a safe society. 

Also central to the concept of a safe society is the 
term insecurity. Risk, simply defined, is a product 
of probabilities and consequences that tell some- 
thing about the future. Insecurity is a more interest- 
ing dimension of risk. Insecurity can sometimes be 
offset by broadened knowledge, but always involves 
the absence of future insight, which must be com- 
pensated for by political assumptions and estima- 
tions when making decisions (Engen et. al. 2016). 

Threats and vulnerability have different time 
perspectives; some occur suddenly, some develop 
slowly, and all require good and flexible planning, 
regulation and handling (Rosenthal et.al.2001). 

What is most important to protect? Some areas 
have already been mentioned, and critical infra- 
structure is certainly an important component. In 
this paper, however, we focus on the public insti- 
tutions that safeguard the important functions 
that ensure the proper functioning and safety of 
societies. These formal organizations, for example, 
customs services, are fundamental and essential 
societal arrangements. 

Our ‘modern threats’ challenge these institu- 
tions in that they often do not recognize national 
borders. Examples include international criminal 
networks, climate change, large migration popula- 
tions, epidemics and economic/ financial systems, 
which all increase the vulnerability of our society. 

In the struggle to ensure a safe society, Norway 
has highlighted the need for key resource organi- 
sations to identify and cooperate with each other 
in appropriate ways by implementing contingency 
preparedness principles that are applicable to pub- 
lic institutions. These principles, including respon- 
sibility, proximity, similarity and coordination (also 
known as risk management principles), underpin 
the support and regulation of governance in the 
crisis management practices adopted in ministries, 
departments and directorates. These kinds of reg- 
ulations, regulations in general and especially risk 
regulation are needed to ensure a safe society. Risk 
regulation is a challenging task in modern socie- 
ties, termed “risk societies” by Beck (1992), due to 
the pace of globalization and change. 


2 RISK REGULATION IN A CHANGING 
SOCIETY 


To build a safe society, the state establishes risk 
management practices through regulations, stand- 
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ardizations and requirements that stipulate the 
responsibilities of different actors in the market- 
place (Lindge et al. 2015). Risk management is 
mainly the balance between the ability to realise 
what is wanted and avoiding what is unwanted. 
According to Lindge et al. (2015), the general eco- 
nomic liberalization characterizing the globalized 
marketplace has led to a focus on effective produc- 
tion and division of labour, as well as the removal 
of trade barriers and the development of harmo- 
nized regulations. 

Risk management governance has been on the 
OECD agenda for some years. Regulation policy 
has evolved from a focus on negative effects and 
costs to a positive regulatory approach as an 
important premise for added value and democratic 
governance. Many policy documents, guidelines, 
reform programmes and procedures have been 
issued to ensure the use of regulation as a posi- 
tive policy instrument (OECD 2010, OECD 2011). 
Risk analysis and risk considerations have also 
become more integrated in matters of common 
governance and inspections (Lindge et. al. 2015). 

What is risk regulation, why do we regulate and 
what is “good” regulation? These are vital ques- 
tions for political debate and discussion regarding 
regulation to better understand the need for risk 
regulation in a changing society. Selznick (1985) 
defined regulation as “a sustained and focused con- 
trol exercised by a public agency over activities that 
are valued by a community.” 

Baldwin et al. (2012) described regulation as a 
specific set of commands involving a social group, 
such as Health, Safety and Environment (HSE) 
rules, a deliberate state influence (like revenue) 
and all forms of social and economic influence, 
whether they are state-based or have other sources 
such as the self-regulating marketplace. The latter 
is addressed by the theory of ‘smart regulation, 
which states that regulation may be carried out by 
a host of bodies other than the state. These might 
include corporations, self-regulators, voluntary 
organizations or professional or trade bodies. 

Regulation represents a  multidisciplinary 
approach whereby political, financial, juridical 
and social matters are taken into consideration. 
Regulation discussions are wide-ranging and can 
be based on assessments or official and political 
debates related to, e.g., social protection, future 
planning, the challenge to ensure human rights and 
the principle of division of powers, or to counter 
monopolization and ensure the fair allocation of 
scarce resources (Baldwin et.al. 2012). 

A conglomeration of laws, regulations, stand- 
ards, guidelines, rules and procedures are taken 
into consideration in discussions of how to regu- 
late areas of interest. “Red-light” and “green-light” 
concepts are then used, which are based on differ- 


ent regulatory perceptions. The red-light concept 
is based on the idea of regulation as restricting 
behaviour and preventing the occurrence of unde- 
sirable activities. The green-light concept, in con- 
trast, is based on a broader view of regulation as 
more enabling or facilitating of desired behaviours 
or activities. 

With respect to risk regulation, there are par- 
ticular areas of interest related to a safe society. 
These interests might be, for example, ensuring 
the provision of product information to citizens, 
establishing social solidarity (to support a welfare 
state), maintaining important basic state services, 
discouraging unwanted or criminal behaviour, and 
protecting vulnerable interests and citizens from 
harm or hazards in their work environment and in 
general. 

Engen et al. (2016) argues that regulation is both 
a controlling mechanism and a judicial body that 
regulates a special area or field. For example, how 
should resources, safety and security be addressed 
in the oil and gas industry? Generally, there will be 
public interest in good governance and risk control 
regarding the threats and hazards associated with 
this industry. Risks that threat public infrastruc- 
ture, public interests, human or technical security 
or the environment—both with respect to material 
and intangible values—as well as crime and terror- 
ist attacks are often subject to risk assessment and 
regulation, which are in the public interest. 

Nevertheless, legislation, regulation, govern- 
ance and jurisdiction often lead to dilemmas, such 
as that referred to by Foucault (1999) in his term 
“governmentality, which pertains to how power 
(the state) establishes itself and constructs ‘truth, 
which becomes the basis for measures to facilitate 
regulation and control mechanisms. 

According to Baldwin et al. (2012), a number of 
pitfalls must be avoided to ensure that affected citi- 
zens, interests or social groups will accept the given 
regulation. First, there must be a balance between 
efficiency and control—the regulator must be sen- 
sitive to democratic rights. This again presupposes 
regulatory legitimacy, which is a premise for suc- 
cess. The premise of legitimacy is that a regulator 
has the necessary competence and a relationship 
of trust with the affected group. Experience shows 
that, to succeed, the processes of implementation 
must be transparent and fair, as well as efficient 
and easy to implement. 

Baldwin et al. (2010, 2012) argues that there 
are some well-recognized reasons for regulation, 
including social protection to ensure social solidar- 
ity, human rights to protect weaker citizens, ration- 
ing to ensure the allocation of scarce commodities 
or goods based on the public interest, and the 
avoidance of moral hazards in sharing costs and 
benefits. 
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However, many sectors and industry regulation 
efforts are based on a combination of rationales 
and have corresponding strengths and weaknesses 
related to their design and implementation, includ- 
ing the societal safety and preparedness princi- 
ples operating in Norway, which we discuss in the 
following. 


3 CUSTOMS’ CONTRIBUTION 
TO A SAFE SOCIETY 


Is Customs regarded as an important contribu- 
tor to ensuring safe societies? Customs in Nor- 
way is organized under the Ministry of Finance, 
is divided into six regions and a directorate, and 
employs roughly 1,600 people. 

Societal safety efforts to ensure public security 
and civil protection in Norway is based on four 
principles: responsibility, similarity, proximity 
and cooperation, as found in the abovementioned 
contingency preparedness principles applicable 
to public institutions. The first three principles 
were established in 2000 by the Norwegian Offi- 
cial Report NOU 2000:24 A vulnerable society. 
The principle of responsibility implies that an 
organization normally responsible for a task is 
also responsible for the necessary emergency pre- 
paredness and crisis management in that area. The 
principle of similarity means that the organization 
operating during a crisis should be as similar as 
possible to the regular organization. The principle 
of proximity states that a crisis should be handled 
at the lowest possible level. 

The principle of cooperation was introduced 
after the terrorist attacks in 2011, which are 
referred to as 22 July. It requires authorities and 
agencies to have independent responsibility to 
ensure good cooperation with relevant parties in 
prevention-related issues and crisis management 
(White Paper: Meld. St. 10 (2016-2017) Risk in a 
Safe and Secure Society). 

As a contributor to societal safety, Norwegian 
Customs seemed to have been given scope for a 
greater role following the establishment of the 
principle of cooperation. Customs’ presence along 
the borders of Sweden, Finland and Russia, and at 
international airports and along the coast, points 
to its great potential for becoming a vital contribu- 
tor to the protection of society. 

The goal of this study is to determine the extent 
to which Customs is actually regarded as a con- 
tributor to Norway’s societal safety, and if it is so 
regarded, whether or not this attitude has changed 
over the past 17 years. 

The reason for the rather limited time span is 
that the term societal safety had not yet been estab- 
lished in Norway prior to this time. The term was 


officially defined in 2002 (White Paper (St.meld. 
nr. 17 (2001—2002)), although it had been in unof- 
ficial and limited use for a couple of years prior. 

Nine of the documents we investigated in this 
study were strategic official documents from Nor- 
wegian Customs, 15 were from the Ministry of 
Finance, and 12 were strategic official documents 
related to societal safety published between the 
years 2000 and 2017. The latter were mainly Offi- 
cial Norwegian Reports, White Papers and draught 
resolutions to Parliament (Stortinget). 

In this study, we examined strategic Norwegian 
Customs documents published in the past 17 years, 
since in 2001, Customs began to explicitly claim 
its role as an important protector of society. Cus- 
toms has been increasingly conscious of this role in 
recent years, and its intention seems to have inten- 
sified after the Norwegian Tax Administration 
took over the responsibility of collecting taxes and 
duties in 2016. In the last document studied (2017), 
high ambitions are expressed in Custom’s state- 
ment that it protects society through risk-based, 
customized and effective control measures. 

The strategic documents produced by the Min- 
istry of Finance that we investigated in this study 
do not explicitly acknowledge the role of Customs 
as a protector of society until 2016, but then do so 
emphatically. 

The first Norwegian Official Report included in 
this study is the “NOU 2000:24 “A vulnerable soci- 
ety.” Its main focus was to assess societal vulnera- 
bility to strengthen Norway’s levels of security and 
preparedness. The role of the Ministry of Finance 
with respect to societal protection is limited to 
banking and financial issues, and Customs is not 
mentioned at all as a relevant contributor. 

Only two of the 12 documents related to soci- 
etal safety clearly recognize Customs as a protector 
of society. After the terror attack of 22 July 2011, 
a government-appointed commission authored 
a Norwegian Official Report (NOU 2012:14 22. 
juli-kommisjonens rapport) evaluating the incident 
and proposing answers as to how such a terrorist 
attack could happen. In that report, Customs is 
mentioned almost 150 times, and it is evident that 
prior to the terrorist attacks, Customs had alerted 
the Norwegian Security Police Service regarding 
the terrorist’s importation of weapon-related prod- 
ucts. The terrorist’s purchases initially resulted in 
his being placed on the watch list of the Norwe- 
gian Security Police Service, but no further action 
was taken in response to the information given. 
The commission states in its report that Customs 
is an important contributor to societal protection, 
and it criticized the Norwegian Security Police’s 
poor coordination with Customs. 

The second document that acknowledges Cus- 
toms as a protector of society is the Office of the 
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Auditor General of Norway, who audited the 
Customs’ border control in 2014 (Document 3:7 
(2013-2014). 

The most recent document included in this study 
is the white paper titled: Risk in a Safe and Secure 
Society (Meld.St.10 (2016-2017). It defines and 
discusses eight core areas considered to be highly 
significant for public security, three of which are 
contagious diseases, hazardous substances and 
civil-military cooperation. With respect to conta- 
gious diseases and dangerous substances, the white 
paper refers to the national strategy for Chemical, 
Biological, Radiological, Nuclear, and Explosive 
(CBRNE) preparedness. It also addresses various 
kinds of heavy crime, including the financing of 
terrorism. In all these fields, Customs plays a natu- 
ral role through its administration and control of 
the flow of goods over the border. 

The white paper clearly states the legal authori- 
ties considered to be relevant contributors to a 
safe and secure society, and Customs is not among 
them. Despite the fact that Customs obviously 
protects society from, eg., CBRNE-goods and 
the transportation of dangerous goods over the 
border, the agency is only marginally referenced 
and is not mentioned at all in relation to societal 
protection. 

The term societal safety has also changed dur- 
ing the years included in this study. After the 22 
July terrorist attacks, the concept was broadened, 
and today it also includes a preventive dimension, 
i.e., the protection of society. This change affects 
Customs as it now is aligned with its core respon- 
sibility. Through the administration and control 
of the flow of goods, Customs can contribute to 
the protection of society. Specifically, it does so 
by preventing the smuggling of dangerous goods 
and illegal imports/ exports that can destabilize the 
economy and/or may be directly hazardous to the 
population. One could perhaps expect this defini- 
tion change to naturally move Customs closer to 
its role as societal protector. However, the results 
of this study reveal that there has been no change 
in the way Customs is officially regarded in the 
documents presented to Parliament in 2000 versus 
those in 2017. 


4 DISCUSSION 


Is Customs regarded as a vital contributor to soci- 
etal safety? 

Beck (1992) emphasized that political and pub- 
lic institutions lack the political expertise to handle 
risks and threats. With this in mind, we might ask 
why the administrative system in Norway did not 
follow up on the proposals concerning risk man- 
agement coordination and cooperation introduced 


in the “NOU 2000:24 “A vulnerable society”, and 
later in the 2006:6 white paper, and most recently 
in the 2012 22 of July report. By implementing 
the societal safety and preparedness principles of 
responsibility, similarity, proximity and coopera- 
tion, Norwegian authorities have made efforts to 
regulate and meet the requirements of societal 
protection. Societal protection, as Baldwin et al. 
(2012) argue, is a well-recognized justification for 
risk regulation. However, in reality, its implemen- 
tation, especially with respect to organisational 
cooperation, has not been an easy task. 

The Ministry of Justice is in charge of the over- 
all horizontal governance between ministries and 
departments for the enforcement of the societal 
safety and preparedness principles. The review of 
the public documents in this research project span- 
ning the past 17 years reveals that cooperation 
between governmental departments is fragmented. 
This situation might be related to the principle of 
responsibility, which is strong in Norwegian gov- 
ernance, thereby leading to ministries to leave this 
responsibility to the Ministry of Justice, and to 
forsake their own capacities to implement coop- 
erative measures. 

With the focus on enabling and facilitating 
desired behaviour and activity—the intention 
of the societal safety and preparedness princi- 
ples—regulating public institutions according to 
“sreen-light” concepts is a challenging task. Public 
institutions are to some extent self-governing and 
represent complex organizations with many levels 
and factions. 

The National Audit Office (Riksrevisjonen) 
has noted the difficulties of the Ministry of Jus- 
tice in accomplishing organisational cooperation 
(Document 3:8, 2016-2017). This report found 
the mechanisms for cooperation to be weak or not 
functioning at all, as presupposed. When rules and 
regulations are to be enforced by various authori- 
ties under different departments, the inevitable 
result is weaker cooperation mechanisms. 

Additionally, the fact that Customs is organized 
under the Ministry of Finance, whereas the Min- 
istry of Justice has the responsibility for providing 
risk or societal safety regulations, highlights the 
fragmentation of the hierarchy in maintaining and 
enforcing laws and regulations. Vital societal safety 
actors, such as the Ministry of Justice, seem to lack 
sufficient knowledge about how Customs enforces 
numerous regulations related to various authori- 
ties. This knowledge gap regarding the opportu- 
nity set of Customs might be the main reason for 
the Ministry’s failure to involve Customs in coop- 
eration and coordination efforts to prepare for and 
manage the risks related to societal safety. 

The result of this situation has been fragmenta- 
tion, and the dilemma regarding the adoption of 
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centralized or decentralized management has led 
to something in between, or a vacillation between 
both. According to Fimreite et.al. (2011), an over- 
all body is missing, one whose main tasks include 
the effective cooperation, harmonization, and 
integration of societal safety public authorities in 
Norway. 

The specialization and resources of Customs 
and their broad control authority do not yet seem 
to have been taken into consideration with respect 
to ensuring societal safety. The question is whether 
the Customs’ vague or less visible role as a vital 
societal safety actor can be regarded as an informal 
deregulation, which can only lead to increased risk 
in the form of undesired incidents or accidents that 
threaten societal safety. 


4.1 Risk management in public institutions 


When vertical cooperation in public institutions 
does not function optimally, neither do the com- 
munication and information processes that are 
vitally important in effective risk management. 
Some public institutions have developed a hierar- 
chical steering model incorporating a hierarchical 
command and control system. In these systems, 
the horizontal dimension of cooperation and coor- 
dination can be challenging. 

Risk regulation presupposes that governing units 
hold a combination of specializations, both verti- 
cally and horizontally, which is one reason why the 
task of ensuring overall cooperation, as led by the 
Ministry of Justice, is so important. Critics from 
the National Audit Office (Riksrevisjonen) (Docu- 
ment 3:8, 2016-2017) point especially to the need 
to strengthen cooperation between actors across 
sections or sectors during planned exercises and 
to then follow up. Customs’ role in societal safety 
issues would benefit from better horizontal coop- 
eration and coordination, 

Despite the stated ambitions of both the Min- 
istry of Finance and Customs itself regarding the 
important role played by Customs in societal pro- 
tection, there remains a dilemma. This dilemma 
is between the sectorial decentralized responsi- 
bilities and follow up regarding societal safety 
issues on one side and the centralized decision- 
making authorities and resources on the other. 
This leads to weak governance and cooperation 
and, accordingly, the inevitable invisibility of 
the role of Customs to central decision-making 
authorities. 

This contrasts with Baldwin’s (2012) concept 
emphasizing that, to be efficient and easy, proc- 
esses of implementation must be transparent and 
fair. How else to ensure legitimacy, which presup- 
poses that the regulator has the necessary compe- 
tence and trust of the populace? 


4.2 An appropriate regulation regime? 


If we consider Norway’s public institutions such as 
Customs and other important cooperating author- 
ities, can we conclude that Norway has an effective 
risk regulation regime? 

The societal safety and preparedness princi- 
ples of responsibility, proximity and similarity are 
based on an understanding that various speciali- 
zations are needed in public institutions. The con- 
cept behind the cooperation principle is to ensure 
that the other principles are not developed using 
a sectorial approach that erects barriers between 
actors. 

In white paper No. 17 (2001-2002), the coop- 
eration responsibility of the Ministry of Justice 
concerning the preparedness and civil protection 
of society was underlined. 

As Customs is subordinate to the Ministry of 
Finance and the Ministry of Justice has respon- 
sibility for societal safety, it is difficult to readily 
achieve the necessary horizontal cross-section 
overview and cooperation necessary for good soci- 
etal safety governance. 

Governance consists of planned and objective- 
oriented activities coordinated between relevant 
actors, which presupposes mutual dependence 
(Engen et. al. 2016). If the vertical coordination 
(between departments and directorates) is weak 
and controlling authorities lack knowledge about 
Customs’ enforcement of regulations related to 
various authorities, the opportunity set for Cus- 
toms to contribute to safe societies is reduced. 

This highlights the importance of the partici- 
pation of relevant actors in the decision-making 
process. If the Ministry of Finance does not par- 
ticipate in relevant assemblies in which societal 
safety issues are discussed and decided upon, there 
is no avenue for obtaining the support from those 
in charge of societal safety. The successful outcome 
of political decision-making processes depends on 
who participates and the responsibilities or man- 
dates of the actors. 

When these actors do not view Customs as an 
equal participant, the decision-making processes 
and negotiations concerning the role of Customs 
as a societal protector cannot be effective. In addi- 
tion, Customs must be allowed to participate in 
these processes to demonstrate their societal pro- 
tection resources and goals. 

Since societal safety policy is organized piece- 
meal across sectors, it has become fragmented 
due to the spread of responsibilities among many 
actors. Fimreite et al. (2011) stated that with 
improved horizontal cooperation between the 
departments involved, the development of dif- 
ferent policies that undermine each other can be 
avoided. Instead, Norway can make the most of 
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scarce resources by bringing together various pol- 
icy interests to initiate combined action and syn- 
ergy in the field of societal safety. 

The Ministry of Finance may not be reaching 
the necessary political circles to share the goals and 
resources of Customs, since there is such a clear 
discrepancy between the Ministries of Finance and 
Justice regarding the role of Customs in matters of 
societal safety. 


5 CONCLUSIONS 


Is Customs recognized as an important contributor 
to ensuring a safe society in Norway? In this arti- 
cle, we discussed the extent to which the resources 
held by Customs are known and utilized in this 
effort to ensure such a safe society. 

There will always be dualism in the Customs 
portfolio, i.e., despite its ambitions to do so, it 
can never fully control the flow of goods over its 
borders. Societal safety must be balanced against 
democratic values and the national obligation to 
maintain free movement of legal goods, as aligned 
with what Baldwin et al. (2012) refers to as regula- 
tor sensitivity to democratic rights. However, the 
documents we analysed reveal that, although Cus- 
toms and the Ministry of Finance are eager to per- 
ceive Customs as having a vital role in protecting 
society, Customs is not sufficiently visible in the 
national preparedness overview. Our analysis indi- 
cates that the intensified focus of Customs itself as 
well as the Ministry of Finance with respect to its 
societal safety role has not been recognized by the 
Ministry of Justice, which is responsible for soci- 
etal safety and preparedness in Norway. Based on 
the societal safety and preparedness principles of 
responsibility, proximity, similarity and coopera- 
tive governance, Customs has knowledge of, pres- 
ence and wide statutory authority along Norway’s 
borders, and can thereby make a vital contribu- 
tions to ensuring a safe society. However, it seems 
that Customs has yet to be regarded by the state as 
an important contributor to safe societies; at least 
Customs is not described as such in the documents 
analysed. Furthermore, there has been no change 
in this perspective over the 17-year study period. 
This represents a paradox in light of Beck’s (1992) 
description of a risk society and the increasing glo- 
balization in recent decades. 


The results of this study shed light on the 
potential for better governance with respect to the 
cooperation principle related to risk management 
efforts. By focussing more on cooperative govern- 
ance, Customs’ importance as a societal protector 
in the nation’s preparedness regime might become 
more visible, effective, and valued. 

As yet, however, the complexity of the pol- 
icy frameworks, lack of cooperation between 
vital actors and the resulting fragmentation of 
information can only lead to a continued lack 
of awareness and poor capacity building and 
training initiatives that are so vital for ensuring 
Customs’ ability to be fully operationalised in 
ensuring a safe society. 
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ABSTRACT: Airports are potential targets for terrorist attacks. High level of risk inclines airport man- 
agers to take preventive measures. The decision-making process is difficult, as there have been no methods 
for the quantitative assessment of changes in security levels as a result of the actions taken. The aim of 
this study is to develop a method that quantitatively evaluates the effectiveness of security screening, 
based on various operating parameters of the system, which may become decision variables in the safety 
management process. The method uses the theory of fuzzy inference for overall assessment of the effec- 
tiveness of three key ways to secure air transport: screening of passengers, hand and checked baggage. 
The method based on multi-criteria group decision making was adopted for validation of the method 
and the developed calculation tool FASAS. The simulation experiments that were carried out allowed a 
practical presentation of a method for increasing security levels in medium and long time horizon. The 
method gives also an answer to the question about the quantitative effects of the decisions made. Applica- 
tion of the fuzzy inference theory enables an effective calculation tool to be developed which is usable in 
managerial practice. 


1 INTRODUCTION 1.1 Managing the system of an airport passenger 

and baggage security screening 
Every air travel is preceded by passenger and bag- 
gage screening. Terrorist threats forces the airport 
management to take measures to ensure security 
to the passengers and personnel. These involve 
considerable expenses and pose a serious organi- 
zational challenge. Therefore, the security control 
becomes a significant part of airport budget and 
affects the functioning of the entire company. At 
the same time, process management is difficult due 
to the lack of proper supporting methods. This 
applies particularly to evaluating the effects of the 
measures taken in relation to the achievable secu- 
rity levels. 

In this study a quantitative method for evaluat- 
ing the effectiveness of the airport passenger and 
baggage security screening system is proposed. It 
is difficult to accurately describe this ill-defined 
problem, so it often comes down to an intuitive or 
a ‘trial and error’ approach. Our approach allows 1. Choosing the number of Security Control Areas 


The security checks of passengers and baggage 
in airports is regulated by extensive regulatory 
system (European Commission 2015). However, 
compliance with the regulatory requirements 
does not exclude the option of making individual 
managerial decisions that can significantly affect 
security, capacity or comfort of passengers. The 
legislation in force does not give an indication on 
how to practically organize the airport control sys- 
tem, which includes not only the physical activi- 
ties visible to the passengers, but also a series of 
infrastructural, personal and procedural actions, 
requiring expenses relevant to the scale of passen- 
ger traffic. 

There are usually several key issues found in 
the passenger and baggage security screening 
management. 


us to formalize the expert knowledge and achieve (SCA). 
more objective results, and certainly makes it pos- 2. Having a proper number of staff to perform 
sible to carry out a comparative analysis. their duties. 


The method is based on fuzzy logic, more pre- 3. Selection of SCA equipment. 
cisely on the fuzzy inference systems. The com- 4. SCA organization. 
puter-aided tool FASAS (Fuzzy Airport Security 5. Dynamic modification of system operating 
Assessment System) enables practical support of parameters. 
airport management in terms of security control. 6. Operational management of the system. 
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The airports mostly function as economic agents 
and attempt to achieve a positive financial result, 
which in turn is a determinant of the investments 
planned for airports. However, the managers respon- 
sible for planning investments (in security equipment 
or training, for example) would like to know the 
measurable effects of such actions. This also applies 
to the comparison of several alternative investment 
decisions. The lack of quantitative methods makes it 
impossible to evaluate their actual effects. 

In the literature on airport security manage- 
ment an important issue is the scope of control 
operations and their effect on an airport capacity 
(Hainen et al. 2013, Butler & Poole 2002, Leone & 
Liu 2005, Van Boekhold et al. 2014, Kierzkowski 
& Kisiel 2015) and the passenger comfort and sat- 
isfaction (Alards-Tomalin et al. 2014, Benda 2015, 
Gkritza et al. 2006, Sakano et al. 2016). Increas- 
ing the scope of control operations requires obvi- 
ously increased expenditures which are not always 
reasonable (Stewart 2010, Stewart & Mueller 2014 
2015, Gerstenfeld & Berger 2011, Gillen & Mor- 
rison 2015, Prentice 2015). 


1.2 Concept of the study 


This study assumes the perspective of airport man- 
agement and, more precisely, the person in charge 
of passenger and baggage screening, undertak- 
ing medium—and long-term upgrade actions. 
The method developed will meet the proactivity 
criterion, which means that it will allow planning 
of actions before security risks are present. We 
will seek an action strategy that will increase the 
effectiveness of control without compromising the 
capacity, assuming that funds to pay for the actions 
are available. 

This study is a finalization of the previous works, 
in which we proposed methods for evaluation of 
hand baggage (Skorupski & Uchronski 2015a), 
checked baggage (Skorupski & Uchronski 2015b) 
and passenger (Skorupski & Uchrornski 2016a) 
screening systems. All those methods are based 
on fuzzy inference systems. The objective of this 
paper is to integrate in hierarchical structure and 
demonstrate that the calculation tool FASAS cre- 
ated can be used for operational management of 
an airport security screening system. 

The remaining part of the paper is organized as 
follows. Section 2 gives a brief overview of theo- 
retical grounds and presents an integrated fuzzy 
inference system which is the result of the study. 
Section 3 provides an analysis of the passenger 
and baggage security screening system at Katowice 
International Airport (ICAO code: EPKT, IATA 
code: KTW). Section 3.2 analyses several scenarios 
and determines quantitatively the effectiveness of 
the control relative to the upgrade actions taken in 


the medium-time horizon. Section 4 contains the 
summary and conclusions. 


2 FUZZY SETS IN ASSESSING THE 
EFFECTIVENESS OF SECURITY 
SCREENING IN AIR TRANSPORT 


2.1 Uncertainty and subjectivity in making 
decisions in aviation security screening 
systems 


In technical activities, including transport proc- 
esses, the information available often tends to be 
inaccurate and incomplete. If this is the case, deci- 
sions are made in uncertainty conditions. There 
are many types of uncertainty and various math- 
ematical methods and ways to reduce its adverse 
effects on the decisions being made. 

In airport passenger and baggage screening sys- 
tems, the knowledge of the effectiveness of screen- 
ing cannot be obtained from measurements. It is 
necessary to acquire the knowledge from experts 
in the field. They usually use informal language 
and their knowledge is expressed inaccurately and 
in approximation. This is therefore a typical exam- 
ple of acting in linguistic uncertainty conditions. 
For that reason, fuzzy inference systems based on 
fuzzy logic were used in this study (Siler & Buckley 
2005). 


2.2 General information on fuzzy inference 
systems 


As a fuzzy set we understand a set in a form 
A= I(x, (x)): xe X, (x)e [0.1]} 


where u, is a membership function of this set. Every 
object can belong to a certain degree to a fuzzy set. 
The element degree of membership in a fuzzy set is 
determined by the membership function. They can 
have various shapes as the case may be. 

A linguistic variable refers to a variable whose 
values are words or sentences in a natural or arti- 
ficial language. Such words or sentences are called 
the linguistic values of a linguistic variable. In for- 
mal terms, a linguistic variable can be defined as 
the five-tuple (Czogała & Pedrycz 1980): 


L,T,X,G,M (1) 


where: 

L- linguistic variable name, 

T-a set of syntactically correct linguistic variable 
values L, 

X — consideration space of a linguistic variable L, 

G-—syntactics of a linguistic variable, 
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Zc Screening system 
evaluation 
Z 
h Ws cs 
Zp 
Figure 1. General scheme of the fuzzy model of an air- 


port passenger and baggage screening system. 


M — semantics of a linguistic variable, determined 
by a set of algorithms allowing the assignment 
of linguistic variable value of certain fuzzy 
set A. 


2.3. General structure of a fuzzy model Screening 
system evaluation 


The general structure of the fuzzy model for evalu- 
ating the effectiveness of the passenger and bag- 
gage screening system is shown in Fig. 1. The input 
variables of the Screening system evaluation model 
are: 


Z,—effectiveness of hand baggage screening 
(linguistic variable Hand baggage), 

Z,„-— effectiveness of checked baggage screening 
(linguistic variable Checked baggage), 

Z, — effectiveness of passenger screening (linguistic 
variable Passenger screening). All the input 
variables are the outputs of other local 
models. They are presented in detail in the 
following sections. 


2.4 Input variable hand baggage 


Hand baggage screening has two aspects. The first 
one of them is scanning luggage with X-rays. The 
second one is manual inspection carried out by the 
security screening operator (SSO). Three input 
variables correspond to those two aspects (Sko- 
rupski & Uchronski 2015a). Two of them—Device 
evaluation (y,) and Type A Errors (x,,)—are related 
to the X-ray scanning of cabin baggage. 

The Device evaluation linguistic variable makes it 
possible to express the impact of the technical factor 
on the possibility of effectively detecting a prohibited 
article in baggage (Skorupski & Uchronski 2016b). 
That parameters taken into account are: detectabil- 
ity of different materials, presence and efficiency of 
Threat Image Projection (TIP), the number of detec- 
tion lines used, and the age of the device. 


The other linguistic variable—Type A Errors— 
describes the actual skills of SSO to use the X-ray 
device smoothly and efficiently. Type A error con- 
sists in failure to indicate the virtual prohibited 
item interposed on the image of the scanned bag- 
gage by the TIP system. 

The third variable used in the model is Manual 
Inspection (y,,). It is used to describe the efficiency 
of manual inspection carried out for some cabin 
baggage. We assumed that the quality of inspec- 
tion depends on linguistic variable Employee 
Evaluation (y,). It is, in turn, dependent on the 
experience of SSO, the amount of time since the 
last comprehensive or current training they have 
undergone, and their overall attitude to work they 
perform (Skorupski & Uchronski 2015c). On the 
other hand, the quantity of baggage subjected to 
manual inspection also has an impact on the effi- 
ciency of such screening. We have broken down the 
concept of manual inspection into two linguistic 
variables Number of type B manual inspections (xp) 
and Number of type C manual inspections (xo). 


2.5 Input variable checked baggage 


Evaluation of the effectiveness of checked bag- 
gage screening depends on two factors: efficiency 
of X-ray equipment and efficiency of the checks 
performed at SCA, particularly with participation 
of SSOs (Skorupski & Uchronski 2015b). So the 
local model Checked baggage has two input vari- 
ables—Device’s assessment (y,) explained in Sec- 
tion 2.4 and SCO control (y,). The latter depends 
on the employee evaluation, type A errors (vari- 
ables Employee Evaluation and Type A Errors— 
Section 2.4) and an important variable Control 
organisation option (x,). This variable describes the 
checked baggage screening organisation. 


2.6 Input variable passenger screening 


The Passenger screening variable (z,) depends 
on three input variables: WTMD’s evaluation 
(Ywrup)> Frequency of manual control [i] and 
Manual Control (y,,). This fuzzy inference system 
creates a hierarchical structure as the WTMD’s 
evaluation and Manual control are outputs of local 
fuzzy models (Skorupski & Uchronski 201 6a). 

An evaluation of the effectiveness of Walk- 
Through Metal Detector (WITMD) in the pas- 
senger security control depends on: number of 
detection areas, ability to set sensitivity in different 
detection areas, visualisation of the detection areas 
and the system for the support of manual control. 

The input variable Frequency of manual control 
is functionally dependant on two decision vari- 
ables: WTMD’s sensitivity (x,) and Frequency of 
additional controls ie): The former may be set 
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directly at the WTMD, while the latter is arbitrary 
based on the applicable, current regulations and 
the current threat level for the given airport. 

The Manual control variable is in fact an out- 
put of the local fuzzy inference model. The inputs 
of the model are the linguistic variables: Employee 
number (x,,) and Employee evaluation ( y) which 
is the output of local inference model with the 
appropriate inference system (Section 2.4). 


2.7 Output variable screening system evaluation 


A general evaluation of the effectiveness of a secu- 
rity screening system at an airport is described by 
the linguistic variable Screening system evaluation. 
Its membership function which is the same as the 
one presented in Fig. 2, is the output value. 

A group of experts were asked to define the 
fuzzy inference rules. They are practitioners with 
extensive experience in airport security manage- 
ment. The following problem has been brought 
before them. If we assume that the purpose of the 
whole security system at the airport is a safe flight 
(in which there is no explosion, hijacking or an 
assault on another passenger), which of these three 
types of screening is the most important to achieve 
this objective? The knowledge base consists of 125 
fuzzy rules. Some of them are presented in Table 1. 


2.8 Validation of the model Screening system 
evaluation 


To validate the rules of the model and also to 
remove any inconsistencies the method described 
in (Skorupski, 2015) was used. It consists in using 
expert opinions expressed in terms of multiple cri- 
teria in the form of both numerical and linguistic 
assessments. Experts define the conclusions of rules 
as so-called half-marks in order to increase the 
method’s flexibility. Automatically generated rules 
are compared to the rules that were provided by 
experts in order to detect inconsistencies. As a result 
of using a new concept of half-marks, it is possible 
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Figure 2. Membership functions of the linguistic vari- 
ables: Passenger screening, Hand baggage, Checked bag- 
gage, Screening system evaluation. 


Table 1. Fuzzy inference rules for the model Screening 
system. 

Screening 

Passenger Hand Checked system 

Rule screening baggage baggage evaluation 
6 Low Low Very low Very low 
17 Average High Very low Low 
54 Very high Verylow Average Average 
93 High High High High 
120 Very high High Very high Very high 


to perform not only the verification, but also auto- 
matic selection of the final form of a conclusion. 


3 EVALUATION OF THE EFFECTIVENESS 
OF AN AIRPORT PASSENGER AND 
BAGGAGE SECURITY SCREENING 
SYSTEM 


This section describes the experiments with the use 
of the method. At first, the reference variant will 
be examined, and then the variants presenting the 
benefits of various actions: extending the scope 
of training, equipment change and organisational 
activities. 


3.1 Example of a security screening system 
effectiveness assessment in KTW airport 


This section will present the reference variant 
that corresponds approximately to the situation 
in KTW in 2014. The starting point in various 
airports may differ, however the solutions pro- 
posed constitute a kind of a standard that can be 
assumed as a typical situation. The basic scenario 
S, parameters required are compiled in Table 2. 

Simulation analysis carried out for typical values 
of parameters listed in Table 2, using the FASAS 
tool, provided the results as given in Table 3. 

The final, defuzzified rating value of the entire sys- 
tem is 2.95, which corresponds to the value medium. 
This is in line with the expectations and typical for 
most airports. Meeting the regulatory requirements 
and acting in typical threat levels causes the system 
to be configured so as to achieve the average effec- 
tiveness in detection of prohibited items. 


3.2 Analysing the possibilities of influencing the 
effectiveness of airport security screening 


This section analyses the effects of implementing 
some measures to improve the security screening 
effectiveness: shortening training intervals, partial 
or complete replacement of equipment, greater 
emphasis on the system condition monitoring. 
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Table 2. Input parameters of scenario Są. 


Hand Checked 
Passenger baggage baggage 
Parameter screening screening screening 
Security screening operators 
Experience [mth] 36 36 36 
Comprehensive 18 18 18 
training [mth] 
Ongoing training 2 2 2 
[mth] 
Attitude 2 (average) 2 (average) 2 (average) 
Type A errors - 15% 20% 
X-ray equipment 
Age [yrs] — 9 5 
Number of TIPs — 6000 0 
TIP frequency - 2.8% 0 
Number of lines — 1 2 
Detectability - 6 6 
Walk-Through Metal Detectors 
Visualisation 20 - 
Detection areas 20 = — 
Sensitivity [g] 165 — — 
Support 1 (yes) -= = 
Other parameters 
Organisation = - 4 
variant 
Number of 4 — — 
employees 
Additional checks 13% - - 
Checks B - 9% - 
Checks C - 0% - 
Table 3. Evaluation of the passenger and baggage 
screening system in the scenario S,. 
Defuzzified Linguistic 
System rating rating 
Passenger 3.16 > medium 
screening 
Hand baggage 2.30 lowlmedium 
screening 
Checked baggage 2:12 lowlmedium 
screening 
Total 2.95 medium 
3.2.1 Greater emphasis on training and staff 


awareness 
As a standard, comprehensive training is given 
every 3 years which results in the average time 
from the last comprehensive training is 18 months. 
In the simulation experiment described below (sce- 
nario S,) it was assumed that the training intervals 
were reduced to 6 months, which means that the 
mean time, and therefore the value of the Compre- 


hensive training variable (for all three systems) is 3 
months. 

Shortening the training intervals has also posi- 
tive effects on the number of type A errors made 
by the SSOs. Measurements using the software 
available in TIP system showed that a compre- 
hensive training results in a decreased number of 
type A errors on average by 3 percentage points for 
hand baggage screening operators, and on average 
8 percentage points for those screening the checked 
baggage. 

The results of evaluation of the passenger and 
baggage screening system in the scenario S, are 
given in Table 4. 

As can be seen, reducing training cycles do not 
give satisfactory improvement of an airport secu- 
rity screening system—the grade is 2.98 vs. 2.95 for 
a standard training system. It is basically under- 
standable, as already for the initial parameters the 
ratings of staff in all subsystems are at the level very 
high and there is not much room for improvement. 


3.2.2 Equipment change 
Given in Table 2, the specifications of the equip- 
ment used at Katowice Airport indicate that it 
is not very advanced. The evaluation made with 
the use of FASAS system shows that they vary 
between /ow and medium. 

The effects of the decision to replace the X-ray 
equipment with more advanced devices will be 
analysed. Let Scenario S,,, denote replacing half of 
the equipment and Scenario S,, denote replacing 
all equipment. 

A part of input data that has been changed in 
relation to the reference variant is shown in Table 5. 

The simulation experiments carried out indicate 
a noticeable improvement in the assessments of 
those systems which had been provided with new 
equipment, and also the overall rating of the entire 
screening system. The results are listed in Table 6. 

As it can be noted, the effect of an upgrade 
activity is considerable. This is particularly true 
for the replacement of all equipment. However, 


Table 4. Evaluation of the passenger and baggage 
screening system in the scenario S,. 


Defuzzified Linguistic 
System rating rating 
Passenger screening 3.24 >medium 
Hand baggage 2.51 low/medium 
screening 
Checked baggage 2.72 low/medium 
screening 
Total 2.98 medium 
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Table 5. Input data for the equipment upgrade decision 
(Scenarios S,, and S,,). 


Table 7. Evaluation of the passenger and baggage 
screening system in the scenario S}. 


Hand Checked 
Passenger baggage baggage 
Parameter screening screening screening 
Age [yrs] - 0 0 
Number of TIPs  — 6000 1000 
TIP frequency - 2.8% 2.8% 
Number of lines = — 4 2 
Detectability - 10 10 
Table 6. Evaluation of the passenger and baggage 


screening system in the scenarios S,, i S, 


Scenario S,, Scenario S,, 


Defuzzi- Lingui- Defuzzi- Lingui- 
fied stic fied stic 
System rating rating rating rating 
Passenger 3.16 >medium 3.16 >medium 
screening 
Hand 2.9 <medium 3.67 medium|/ 
baggage high 
screening 
Checked 3.68 medium/ 4.0 high 
baggage high 
screening 
Total 3:25 >medium 3.75 medium!/ 
high 


attention should be drawn to quite a considerable 
cost of such project. 


3.2.3. Combined activity—equipment replacement 

and changing the training system 
The results of the experiment discussed in Sec- 
tion 3.2.1 indicate that shortening training inter- 
vals for employees working on low standard 
equipment does not give the desired outcome. Let 
us discuss Scenario S, involving this solution to be 
implemented as the second stage of the upgrade, 
after upgrading the equipment. Table 7 presents 
the experiment results. 

In this case, further improvement of grades to 
high is noticeable. It is worth mentioning here that 
this does not prevent the possibility to improve the 
effects of security screening system management, 
as the changes in equipment did not affect the 
passenger screening system at all. There is further 
potential for growth in it. 


3.2.4 Emphasis on the diagnostics of personnel 
status 

While the role of the ongoing diagnostics of a facil- 

ity condition is very important, when analysing 


Defuzzified Linguistic 
System rating rating 
Passenger screening 3.24 > medium 
Hand baggage 4.0 high 
screening 
Checked baggage 4.0 high 
screening 
Total 4.0 high 
Table 8. Evaluation of the passenger and baggage 
screening system in the scenario S,. 
Defuzzified Linguistic 
System rating rating 
Passenger screening 3.24 > medium 
Hand baggage 3: mediumlhigh 
screening 
Checked baggage IS mediumlhigh 
screening 
Total 3.66 mediumlhigh 


reliability of socio-technical systems, it seems that 
it is slightly underestimated in diagnosing the per- 
sonnel status. 

The task of diagnosing personnel status in a 
security screening system is performed in many 
ways. One of them include the use of TIP system. 
The effects of resignation from using that system 
were analysed (Scenario S,). This is practically rea- 
sonable in little used passages. Table 8 presents the 
results of Scenario S,. 

As can be seen, resignation from using TIP sys- 
tem results in lowering the grades to somewhere 
between medium and high. The extent of the 
reduction can be even greater since an SSO who 
is aware of not being evaluated may work care- 
lessly, and take a very light-hearted approach by 
investigating minor doubtful issue less carefully 
or not at all. The above issues will be subject of 
further studies. 

In contrast to the resignation from the TIP sys- 
tem, let us consider the effectiveness of the screen- 
ing system change if more stringent standards are 
established, for example that an SSO is to make no 
more than 10% of type A errors (Scenario S,). 

The simulation experiment proves that impos- 
ing so high requirements does not result in further 
improvement in the effectiveness of the screen- 
ing system. Reducing the type A error rates to 
12% proves to be sufficient to achieve the best 
assessment of a worker. It is impossible to com- 
pletely eliminate the errors in recognising images 
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of prohibited items, and moreover, the parameter 
discussed is not the only, but still very important, 
factor describing the quality of screening. 

The last scenario (S,) to be considered is the 
influence on the alertness (suspicion) of the SSOs in 
relation to the persons and baggage being checked. 
In the study (Skorupski & Uchronski 2015a) it is 
referred to as the B type checks. 

Simulation experiment for Scenario S, was per- 
formed by modifying the input data for the Sce- 
nario S, in such a manner that we increased the 
number of B type checks from 9% (as observed and 
measured in real work conditions at KTW) to 15%. 

The results of the experiment for the Scenario 
S, show that this results in an overall effectiveness 
grade of 4.41, which corresponds to somewhere 
between the high and very high rating. The result is 
so much interesting that it means a relatively high 
increase achieved by ‘soft’ actions involving influ- 
encing the SSO psychology. 


4 SUMMARY AND CONCLUSIONS 


In many areas the priority of security is declared as 
the governing rule. These areas include transport, 
particularly the air transport. Surprisingly, how- 
ever, introducing innovations intended to increase 
security levels is often delayed in those areas and 
forced mainly by effective legislative procedures 
and enforcement methods. In our opinion, one of 
the major causes of that is the lack of tools provid- 
ing an objective and quantitative assessment of the 
effects implementation of a given solution. 

The method developed, together with the 
FASAS computer tool enables quantitative assess- 
ments of the effects of various managerial deci- 
sions on the effectiveness of an airport security 
screening system. It was discovered that, for the 
personnel training to bring a positive effect, the 
equipment needed an upgrade first. Of course, the 
equipment replacement alone contributes to the 
improved system effectiveness. A reverse approach 
proves to be less effective. 

An important observation is the importance of 
personnel diagnostics. While monitoring of tech- 
nical equipment condition is quite common, the 
ongoing diagnostics of the human factor is rare. 
Perhaps, it is because there are no proper methods 
and tools to do so. An example of such a tool in an 
airport security screening is the TIP system. It is 
not always possible to use similar solution in other 
areas, however, such attempts should be made and 
the role of diagnostics in management process 
should be increased. 

Further works, including multi-criteria analyses, 
are required taking into account the overall security 
and capacity maximisation. In general, activities in 


average risk situation should focus on increasing 
security levels without compromising the capacity. 
Regrettably, for most of the solutions possible, this 
involves great cost and is time-consuming. 

To conclude, simulation experiments indicate 
the importance of quantitative methods for evalu- 
ating the effectiveness of airport security control 
systems. The lack of such methods makes it dif- 
ficult to provide reasoning for their implementa- 
tion. A commonly observed attitude is that if ones 
activity complies with the standard and regulatory 
requirements, and possible innovation is expen- 
sive, there is not much sense in introducing it. A 
hypothetical quality improvement in the security 
level is difficult to prove. However, the quantitative 
methods, similar to that presented herein, make it 
possible to compare the results from various pos- 
sible innovative activities what is critical for sound 
management. 


REFERENCES 


Alards-Tomalin, D., Ansons, T.L., Reich, T.C., Sakamoto, 
Y., Davie, R., Leboe-McGowan, J.P, Leboe- 
McGowan, L.C., 2014. Airport security measures and 
their influence on enplanement intentions: Responses 
from leisure travelers attending a Canadian University. 
Journal of Air Transport Management 37, 60-68. 

Benda, P., 2015. Commentary: Harnessing advanced 
technology and process innovations to enhance avia- 
tion security. Journal of Air Transport Management 
48, 23-25. 

Butler, V., Poole, R.W., 2002. Rethinking Checked Bag- 
gage Screening. Policy Study, Reason Public Policy 
Institute 297. 

Czogala, E., Pedrycz, W., 1980. Elements and methods of 
fuzzy sets theory, Silesian University of Technology, 
Gliwice (in Polish). 

European Commission, 2015. Commission Implement- 
ing Regulation (EU) 2015/1998 of 5 November 2015 
laying down detailed measures for the implementation 
of the common basic standards on aviation security. 

Gerstenfeld, A., Berger, P.D., 2011. A decision-analysis 
approach for optimal airport security. International 
Journal of Critical Infrastructure Protection 4, 14-21. 

Gillen, D., Morrison, W.G., 2015. Aviation security: 
Costing, pricing, finance and performance. Journal of 
Air Transport Management 48, 1—12. 

Gkritza, K., Niemeier, D., Mannering, F., 2006. Airport 
security screening and changing passenger satisfac- 
tion: An exploratory assessment. Journal of Air Trans- 
port Management 12, 213-219. 

Hainen, A.M., Remias, S.M., Bullock, D.M., Manner- 
ing, F.L., 2013. A hazard-based analysis of airport 
security transit times. Journal of Air Transport Man- 
agement 32, 32-38. 

Kierzkowski, A., Kisiel, T., 2015. An impact of the oper- 
ators and passengers behavior on the airport’s security 
screening reliability, w: Nowakowski, T., (red.), Safety 
and Reliability: Methodology and Applications: 2345— 
2354, CRC Press/Taylor & Francis/Balkema. 


2937 


Leone. K., Liu, R., 2005. The key design parameters of 
checked baggage security screening systems in airports. 
Journal of Air Transport Management 11: 69-78. 

Prentice, B.E., 2015. Canadian airport security: The 
privatization of a public good. Aviation Security 48, 
52-59. 

Sakano, R., Obeng, K., Fuller, K., 2016. Airport security 
and screening satisfaction: A case study of U.S. Jour- 
nal of Air Transport Management 55, 129-138. 

Skorupski, J., 2015. Automatic verification of a knowl- 
edge base by using a multi-criteria group evaluation 
with application to security screening at an airport. 
Knowledge-Based Systems, 85, 170-180. 

Skorupski, J., Uchronski, P., 2015a. A fuzzy reasoning 
system for evaluating the efficiency of cabin baggage 
screening at airports. Transportation Research Part C: 
Emerging Technologies, 54, 157-175. 

Skorupski, J., Uchronski, P., 2015b. Fuzzy inference 
system for the efficiency assessment of hold baggage 
security control at the airport. Safety Science, 79, 
314-323. 

Skorupski, J., Uchronski, P., 2015c. A fuzzy model for 
evaluating airport security screeners’ work, Journal of 
Air Transport Management 48, 42-51. 


Skorupski, J., Uchronski, P., 2016a. Managing the proc- 
ess of passenger security control at an airport using 
the fuzzy inference system, Expert Systems with Appli- 
cations, 54, 284-293. 

Skorupski, J., Uchronski, P., 2016b. A fuzzy system to 
support the configuration of baggage screening 
devices at an airport, Expert Systems with Applica- 
tions, 44, 114-125. 

Stewart, M.G., 2010. Risk-informed decision support for 
assessing the costs and benefits of counter-terrorism 
protective measures for infrastructure, International 
Journal of Critical Infrastructure Protection 3 (1), 
29-40. 

Stewart, M.G., Mueller, J., 2014. Cost-benefit analysis of 
airport security: Are airports too safe? Journal of Air 
Transport Management 35, 19-28. 

Stewart, M.G., Mueller, J., 2015. Responsible policy 
analysis in aviation security with an evaluation of 
PreCheck. Journal of Air Transport Management 48, 
13-22. 

Van Boekhold, J., Faghri, A., Li, M., 2014. Evaluating 
security screening checkpoints for domestic flights 
using a general microscopic simulation model. Journal 
of Transportation Security 7, 45—67. 


2938 


Safety and Reliability - Safe Societies in a Changing World — Haugen et al. (Eds) 
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7 


Information power supporting the rail systems safety 
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Czech Technical University in Prague, Praha, Czech Republic 


ABSTRACT: Railway systems are complex systems interconnected with other concurrent or 
superordinate systems by physical, information, territorial and logical linkages, i.e. they are so-called 
system of systems. Introduced linkages are the causes of system of systems vulnerabilities, which directly 
limit the rate of safety of such system. The railway systems are also complex cyber-physical systems, 
because they use an information system (cyber space) that manages the physical system and they are also 
influenced by the physical system. The cyber (information) domain is very important because it ensures 
the semiautomatic and automatic control of system technology during the normal, abnormal and even 
critical conditions. Therefore, it is very important the care on the information power. The information 
power is defined as the rate of information effectiveness of systems. The information effectiveness is the 
key aspect for coping with risks connected with the interconnections in management of complex system, 
and it is main assumption for protection or ensuring the cyber security of such systems at build up their 
safety. The present work is focused on the information power, its definition and issues, how to manage 
and improve the system property. It also provides two examples of selected railway accidents, on which 
the information power issues are demonstrated. By multicriterial approach application the main criteria 


for information power size are proposed. 


1 INTRODUCTION 


Today’s society is dependent on the technical and 
cybernetic systems that help to satisfy its basic 
needs. The systems, for their proper function, need 
correct and timely information from a real physical 
environment. For this purpose, information and 
communication systems and their technologies are 
used to create links between systems of different 
kinds and are able to process information accord- 
ing to the given rules faster than a person. 

The stronger the human effort in improvement 
and streamline of processes towards their higher 
economic benefit is, the stronger dependency of 
human society on information technologies is, and 
therefore, the incessant need for their development 
arises. By improving and streamlining the processes 
in the direction to growth of economic benefits, we 
are introducing the new links, i.e. dependences, and 
by this we make up more complex systems, which 
are more vulnerable. This vulnerability leads to 
their failures in critical conditions, which have in 
many cases an impact on human security, provi- 
sion of basic human needs and main functions of 
states. Therefore, also in the field of information 
technologies we are talking about critical informa- 
tion infrastructure, which, however, is rightfully 


interconnected with other technologies. The criti- 
cal cyber-physical system arises by linking these 
elements. 

To ensure the human security, we need to ensure 
the secured and, in many cases, the safe cyber- 
physical system. For formation of appropriate 
principles for ensuring the secure and safe cyber- 
physical systems, it is necessary to use the theory of 
information, and especially to determine parame- 
ters, which affect information power. Information 
power is just the quantity, which size determines 
the quality of decision; the higher information 
power is, the higher probability of quality decision 
is, and vice versa. 

Transportation infrastructure and rail systems 
are not different; they are complex systems of sys- 
tems and cyber-physical systems are dependent on 
information technologies, in which it is number of 
vulnerabilities that can lead to severe traffic acci- 
dents. Therefore, the work gives a selected part of 
theory of information related to the information 
power. It analyzes two examples of railway acci- 
dents and demonstrates the role of information 
performance on them. The result of work is the 
proposal of criteria determining the size of infor- 
mation power for different places of railway cyber- 
physical system. 
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2 THE INFORMATION TECHNOLOGIES 
INFLUENCE ON TRAIL SYSTEMS 
SAFETY 


We are influencing the level of railway safety by 
improving the qualitative parameters, which are 
transport speed, capacity and the amount of goods 
and people transported, RAM parameters (Reli- 
ability, Availability, Maintainability), Life Cycle 
Costs, interoperability. In some cases, the qualita- 
tive parameters improve the safety, especially when 
safety is dependent on reliability and availability 
of performed function. In other cases, the qualita- 
tive parameters impair the safety, e.g., the increase 
of speed and coincident shortening the intervals 
between trains with a larger number of passengers, 
lead to higher danger for people. 

Railway safety has a long-standing tradition 
in terms of technology, but there is still site for 
improvement in terms of human security. The level 
of safety is always limited by the operating condi- 
tions in which the system operates; if the conditions 
are very different from those for which the system 
is designed and when certain limits are exceeded, 
the system gets into a dangerous state, i.e. in state, 
in which it endangers itself and its surroundings, 
i.e. the surrounding systems, technology, the envi- 
ronment, economic links, the lives and health of 
people, and others (Kertis & Prochazkova 2017a, 
b, ©). 

Safety of the railway system means to ensure 
that the rail system is able to work perfectly in wide 
range of conditions. If range limits are exceeded, 
the system needs to recognize the change of con- 
ditions and to pass to another mode of safety 
management system, e.g. to apply measures and 
activities according to security plans, continu- 
ity plans, etc. At present, great emphasis is given 
on the development of well-secured rail systems 
(ARTEMIS 2014, CEN-CENELEC 2017, EU 
2020a,) and therefore, they are focused mainly on 
safe technological platform, which is very impor- 
tant in terms of safety, but it does not solve com- 
plex issues, i.e. human security. Integral safety 
focusing on human security is still mostly ignored 
in practice. The size of investment in safety and 
security is also given by economic aspects (Proc- 
hazkova 2013, Kertis 2015, 2016, Kertis & Proc- 
hazkova 2016). 

Information systems based on information 
technologies are implemented in all above men- 
tioned areas, i.e. ensuring the quality of railway 
transportation, security and safety of railway sys- 
tems. Information technologies interpret, help to 
manage or, in the case of automated operations, 
control all of these qualitative and safety param- 
eters. Information systems and technologies are 
integral part of rail systems. Table 1 shows exam- 


Table 1. Examples of information systems used in rail- 
way domains. 


Management and planning: 
— evaluation of data from operation, timetabling, 
— breakdown of staff services, 
— decision-making, economic, accounting, 
—communication with rescue forces and the police. 
Management and control of operation: 
—central supervision and management, dispatching, 
— station and track technologies, 
— data collection and processing on the train route, 
— communication between stationary and train systems, 
— protection devices. 
Train operation: 
— train control, vehicle control units, 
— data transmission among train devices, 
— tracking and control of train equipment (door, air 
conditioning, train radio, power equipment), 
—human-machines interfaces (HMI, 
driver-technologies). 
Passenger: 
— information light boards, 
— passenger check-in systems, 
— entertainment systems, Wi-Fi, 
— navigation systems—direction signs, for the disabled. 


ples, how information systems are used in different 
areas of railway transportation. Correctly chosen 
information system parameters ensure the size of 
their information power, i.e. quality of informa- 
tion enabling the system to effectively react on 
possible unacceptable conditions. By this way they 
improve railway safety, not only at normal condi- 
tions, but also at abnormal and especially critical 
conditions. 


3 INFORMATION SYSTEMS AND 
TECHNOLOGIES, INFORMATION 
ORIGINATION PROCESS, 
INFORMATION POWER AND 
SECURITY 


The theory introduced below is based on knowl- 
edge in domains of information systems, cyber- 
physical systems, complex systems and critical 
infrastructure (Moos & Malinovsky 2008, Proc- 
hazkova 2012, 2015, Novobilsky at. al. 2016, Kertis 
& Prochazkova 2017b, c). 

Information, information systems and tech- 
nologies include very wide domain that makes up 
the couplings among the systems. Information is 
today beside the material, energetic and financial 
resources ranked to main factors that determine 
the progress, not only in technology, but also in 
other domains of human activities. Information 
flows in systems make up the important linkages 
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and couplings the elements and whole systems in 
complex technological facilities. Without certain 
level of information, it is not possible to make up 
and to manage the processes of any nature, includ- 
ing the human society. The origin of information 
is conditioned by monitoring the certain proper- 
ties of observed object or the common properties 
of group of objects. Each information system fol- 
lows the entity properties using the particular lan- 
guage, which serves to creation of information on 
observed object. Following types of information 
systems are distinguished according to how the 
information is interpreted: syntactical information 
systems, which create the set of information images 
of state quantities of the object being observed; 
and process information systems, which represent 
the set of processes. Then an action information 
system affects the observed object by feedback, 
or it creates a model or a real object. In area of 
railway operation management, the action process 
information systems are mainly applied, and there- 


Table 2. Process of information origin. 


fore, our work is focused on them. The process of 
origin of information, information system, new 
object or modify object, consists of sub-processes, 
or sets of links and their relationships, which are 
described in Table 2. 

The qualitative characteristics of information 
systems and technologies can be influenced by 
appropriate settings of their parameters, such as: 
quality of process of producing information images 
(based on Frege’s concept); information amount 
(equation 1); parameters of transmission matrix 
(equation 4). These parameters influence the size 
of information power (equations 5 and 6), and thus 
also the ability to deliver quick and correct deci- 
sion of information system (7). Fregge functional 
concept of information image origination is com- 
posed from the sets: O, is set of rated quantities on 
the object; P, is set of states (observers); ®, is set 
of syntactic strings (data flow); I; is set of infor- 
mation images of state quantities. Relations of the 
mentioned sets, which determine the quality of the 


Affected abstract Used information Process Process 

No. Process/set nodes technologies inputs outputs 

Object Object, observer Physical receptors Observed Signals 
identification (sensors) condition 
(physical) 
quantities of 
object 

2 Observation Observer, language Sampling, quantization, Signals Data 
statement (syntax) coding/encoding 

3 Communication Language (of Telecommunication, Data Data 
between the observer resp. transmission and 
source and System of data communication 
receiver of acquisition), systems 
message message receiver 

4 Interpretation set, Language (of Ontology, language Data Information 
information observer or 
origination system of data 

acquisition, or 
receiver), 
information set 
(see no. 6) 

5 Relations hips Information Actuator of system, Object, Information 
of functions (see no. 6), action information information correctness, 
and structural the object system change of 
arrangement of object 
object, integrity 
verification 

6 Information set Information Information systems Information Information 
in set of systems 
information 
systems 

7 Interpretation Information (see Signalizing and Information Image of 
process no. 6), the new representation object, 

object technology, artificial new object 
intelligence 
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information image origination process according 
to are described by following parameters: 


— aop identification, 

— apo invasity (danger of breaking the integrity of 
state variables on the object being observed), 

— Ap projection in a set of symbols and syntactical 
strings, 

— aep uncertainty correction and identification, 

— A, interpretation, information origin, 

— aw language constricts reflection, 

— ajo relation of functions and structural regularity, 

— Ao, integrity verification. 


The information measure is mostly character- 
ized by Hartley tolerance rate, the information 
amount for the binary symbols’ system (1.e. for the 
most present cyber-physical systems) it is expressed 
by equation: 


I=—— nN) (1) 
ni 


in which N is the number of possible reports (data): 
N=S° (2) 


where “S” means the number of characters in the 
alphabet A(A1, A2,...AS) and “n” is the number 
of elements in the character set. 

The process information systems are character- 
ized by graphs assigned to relationships: 


I, = F[ P(o), O(0)| (3) 


This assignment enables to perform the struc- 
tural interpretations of complex information sys- 
tems, evaluation of feedbacks, and the quality of 
transmission and information processing in partial 
information systems, and its information segment 
goes out from the matrix representation in follow- 
ing form equation: 


L) (t Hii j 
aj |r j) lo, 4 


[7] 


where T; is the transmission matrix of i-th infor- 
mation segment (information systems segment). 
In real system from equation (3) it follows that 
information or set of information I; is in relation- 
ship with set of system conditions and information 
flows in time. We can assign the information seg- 
ment from equation (4) to data acquisition system, 
where I, are input (initial) information, ®, is an 
input information flow and on output side I, are 
output information and ®, is a transmitted infor- 
mation flow. Parameters t,,,, can be obtained 


using both, the quantitative and the qualitative 
ways and in terms of case from they express: 


— t, — interpretation capability (for t, < 1 the sys- 
tem has very small knowledge and interpretation 
capability, for t, = 1 it has capability of interpre- 
tation of object properties in information sys- 
tem, for t, > 1 it goes on the expert system with 
capability to create own information about the 
object on the basis of obtained data), 

— t, — filtration capability (for t, < 1 the system on 
its output interprets lower amount of informa- 
tion than it obtained at its input information 
flow, for t, > 1 vice versa), 

— t, — communicativeness (capability to provide 
output information flow on the basis of input 
information), 

— tą — system information throughput (i.e. the 
capability to transfer the input information flow 
to output information flow, in case of redun- 
dancy t, is much higher than 1). 


Qualitative parameters of systems of systems 
(including rail systems), which directly affect 
human security, such as safety, integrity, reli- 
ability, quality, availability, continuity, accuracy 
are directly dependent on the effectiveness of 
information systems. Information systems pro- 
vide the required accuracy and timeliness of 
information and, in the case of action informa- 
tion systems, also the speed of the right decision. 
The efficiency level of the information system is 
expressed by quantity of the information power 
quantity: 


P OL- ®,(t) (5) 


where Pi(t) is immediate information power (per- 
formance). The information power is also equal to 
the value of eliminated uncertainty quantity E per 
time unit t: 


P=1-0= (6) 


E 
$ 

To ensure the safety of railway control sys- 
tems, it is important to operate information 
systems, which provide quickly correct decision, 
which is closely related to information power. 
The probability of correct solution choice from 
set of alternatives and the probability of proper 
decision of control operation system, i.e. PCD 
(Probability of Proper Decision), are given by 
function that is dependent on level of knowledge 
of function “k” and information flow in time 
“® (t)”: 


PCD = F[®(2),k ] (7) 
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Figure 1. Probability of proper decision in systems con- 
trol operation (Moos & Malinovsky 2008). 


The relationship is depicted in Figure 1, where 
PL is the PCD maximum, and ACL is acceptable 
PCD level. 

From these facts it follows that control systems, 
which rely on higher level of knowledge, are able 
to make quickly correct decision at lower load, Le. 
they decide faster. 

Each system works correctly only under certain 
conditions. Therefore, cyber-physical systems need 
to have: determined certain limits and conditions, 
which determine their qualitative parameters; and 
mechanisms, which react to surrounding system 
conditions. For very different conditions, they 
need to have plans for transition to other activities, 
i.e. operation rules for abnormal and critical condi- 
tions, which can happen at disaster origin. These 
issues of information systems are done by proc- 
esses ensuring the information security, which are 
based on the protection of important cyber assets 
in a way that provides the required degree of con- 
fidentiality, integrity and availability for important 
information (CIA). 


4 CASE STUDIES, ACCIDENTS COMMON 
CYBER CAUSES 


For illustration of causes of traffic accidents 
caused by cyber network failures, we give two real 
cases; the first from the Czech Republic and the 
other from Spain. 


4.1 Moravany 2008 


On May 19, 2008, the serious railway accident at 
the rail station was in Moravany at 4h 48m; crash 
of freight train with passenger train followed by 
derailment. The accident caused one death, 4 
slight injuries and direct financial damage were 
12 643 092.- CZK (CR MD 2017). 

The accident investigation report states the 
following direct causes: track circuit contact loss 


between train No. 5011 and track circuit 1K; reac- 
tion of interlocking system ESA 11 on unexpected 
change of station track No. 1 status of occupa- 
tion. Next underlying causes: incompatibility of 
railway vehicles and track circuits as far as sanding 
(improving the friction between wheel and track) 
is concerned; internal logic of interlocking system, 
as far as processing of information about station 
track occupation received after track circuit reac- 
tivation is concerned. In terms of safety manage- 
ment system, the report emphasizes: operation of 
railway vehicles incompatible with track circuits 
without adequate safety measures. 

This event is not unique; source (CR MD 2017) 
also states: “On August 29, 2008 at the Hulin rail- 
way station the event with the same background 
took place on an event as the event on May 19, 
2008 at the Moravany railway station. After the 
same defect of same sending device of same rail- 
way vehicle of same series, the same procedure of 
the stationary protection device of the same type 
occurred at 17 h46 m55 s to change the status indi- 
cation of the 3rd station track to “free”, although it 
was still occupied by the stationary personal train 
Os 4256. This event took place without conse- 
quences only because of favourable circumstances 
and immediate proper response of the participat- 
ing employees. 

This event, in terms of common cyber causes 
(Kertis & Prochazkova 2017a), is combination of 
two types of causes, namely the false SW and/or 
inadequate HW: the failure of the sanding device 
that was still active after switched off by engine 
driver—insufficient maintenance (HW); use of 
sand with larger grain size—organizational error 
(HW); the sanding indication for the driver does 
not signify the actual state, only the intention— 
whether an electronic signal is given for sand- 
ing—insufficient design (SW/HW); signalling the 
unoccupied track circuit at the occupied track— 
insufficient design (HW/SW); automatic route set- 
ting to occupied track circuit—insufficient design 
(SW); absence of a remote train stop system— 
insufficient design (HW/SW); and insufficiency 
of communication equipment (freight train driver 
did not react to the stop call) - insufficient design, 
organizational error (HW). 

The main cause of the accident is the fact that 
information was distorted due to both, the sanding 
and the bad response of rail protection system in 
the station, i.e. it is an error that occurred at inter- 
face of cybernetic and physical (cyber-physical) 
system. 


4.2 Santiago de compostela 2013 


The rail accident in 2013, which was a few kilome- 
tres from Santiago de Compostela in Spain, was 
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the worst rail accident in Spain in the last forty 
years. On July, 2013 at 20 h41 m, the derailment 
of high-speed passenger train at 179 km/h velocity 
originated on the Angrois curve (where the speed 
limit is 80 km/h). After the derailment, the most of 
coaches hit the concrete wall along the curve and 
the rear generator car caught fire. The result was 
80 fatalities and 152 injured (almost all the passen- 
gers) (EU 2015). 

Cause of the accident was the speed of train 
and the state working professional commission 
of inquiry ultimately arraigned the engine driver 
from failure and non-observance of rail regula- 
tions. These commission conclusions are not open 
for public, and they were questioned by the Euro- 
pean Rail Agency (ERA). This agency described 
the root accident causes in its document for EU, 
i.e. it also revealed the weaknesses in the overall rail 
control system. 

We briefly outline the ERA report conclusions 
(EU 2015) and we append the comments from our 
experiences from similar accident investigation 
(Prochazkova 2017): 


1. The accident of the Alvia Class 730 passenger 
high-speed train 150/151, modification of Class 
130; both train ends are equipped by generator 
car including the diesel tank. Note: just spilled 
diesel has being the main cause of huge fire in 
railway accidents. 

2. Train 150/151 assembles from 13 vehicles: two 
power cars appended by two generator cars at 
both ends; eight passenger coaches; and restau- 
rant car. The train weight was 382 tons. Note: 
just the train’s parameters have a direct influ- 
ence on impacts and severity of the accident. 

3. The train was equipped by two security sys- 
tems, the ASFA Digital and the ERTMS/ETCS. 
Due to problems in reliability and availability 
of ERTMS configuration in this railroad, the 
operator (Renfe Operadora) applied to operate 
under a trackside signal block (BSL) with the 
protection of ASFA Digital. Note: just incom- 
patibility of multiple systems from different 
levels of automation cause disruption of safety 
integrity, and contribute to serious accidents. 

4. The followed rail line is equipped with the BSL 
device, ERTMS/ETCS level 1, with exception 
of its beginning and end, and with the back-up 
ASFA Digital. Note: just the transition between 
various control systems leads often to confu- 
sion of staff and to human errors. 

5. The low speed curve (max. 80 km/h) with design 
radius of 402 m is located in the end of rail line 
solely equipped with ASFA Digital. Note: at that 
time this system did not consider the speed limit 
and allowed to train to pass at with unaccept- 
able speed, which contributes to the derailment. 


6. Along the curve, the massive, concrete wall is. 
Maximal permitted speed is 80 km/h, which 
is given on table at railway. Note: such tables 
are not clearly visible in the speed and driver’s 
occupancy, which contributed to the accident. 

7. The signals and path for the train was set in 
position that indicated “track clear”. Note: 
just the gaps in the automatic control system, 
which allow incorrect indication of freedom or 
limitations on the track, caused the accidents. 

8. The speed change marker before the curve at 
kilometre point (PK) 84 + 273 did not warn- 
ing. Note: this seemingly malignant fact has 
been often the cause of traffic accidents. 

9. In the engine driver cabin there were several 
communications systems (e.g. the radiotel- 
ephones between trains; the mobile phones 
(GSM-R)) for corporate communication in 
train control system; they were in order. Note: 
service calls and communication with dis- 
patching, along with other train control tasks, 
especially at changes of control systems, lead 
to heavy workload of engine drivers and con- 
tribute to mistakes. 

10. The timetable book for driver showed the speed 
change: i.e. limitation of speed to 80 km/h at 
PK 84 + 230 (the Angrois curve). Note: this is 
possible driver’s mistake due to oversight. 

11. According to regulations, in this site the driver 
needs to start the slow down (form 200 km/h 
to 80 km/h), namely without technical sup- 
ports of control. Note: this is a possible driv- 
er’s mistake. 

12. In a given case, the train was delayed, ca 2 up 
3 minutes. Note: such delay often increases the 
engine driver stress because he/she needs to 
adhere the timetable. 

13. The records showed that the engine driver 
reacted to service call of train manager ca 
6000 meter before curve beginning and the call 
lasted 100 seconds. Note: this is possible cause, 
why driver did not react to sign dictating the 
speed reduction. 

14. The technical analysis showed that the train 
brakes were not sufficiently activated for 
ensuring the required speed reduction. Note: 
it goes on late reaction of engine driver. 


The ERA conclusion stated that the formal 
investigation of the Commission for investigating 
railway accidents in Spain (CIAF—Comision de 
Investigacion de Accidentes Ferroviarios) did not 
consider all facts at accident root causes determi- 
nation, because it leaned on narrow view on mat- 
ter, i.e. the driver always needs to respond well, and 
he / she cannot rely on notice from security sys- 
tems, i.e. he / she promptly needs to adjust the train 
speed to required 80 km/h by braking (EU 2015). 
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The assessment of traffic accident made on the 
basis of the ERA conclusions and our notes to 14 
domains, shows that the combination of number 
of mistakes have been occurred in this case, espe- 
cially the rail traffic safety management system 
complexity contributed to accident. 

From the viewpoint of cybernetic causes given 
in (Kertis & Prochazkova 2017a), it goes on: dis- 
tortion of monitoring data—according to driver’s 
testimony, it came to his confusion on current posi- 
tion due this fact; and inadequate HW—in danger- 
ous part the security level was reduced owing to 
the ETCS, level 1 switching off, which leaded to 
loss remote control of permitted line speed in that 
section. 


4.3, Common cyber causes 


From the analysis of above-mentioned railway 
accidents, their common cybernetic causes are 
clear. They are mainly connected with problems 
on systems’ interfaces, which are designed, imple- 
mented and operated by various subjects on the 
different level of their responsibilities. In goes on: 
problems at human-machine interfaces; problems 
at cyber-physical systems interfaces; problems at 
socio-technical systems interfaces; problems at 
determination of responsibilities, namely not only 
between subjects but also between system proc- 
esses (mutual compatibility); problems at interfaces 
among systems with different criticalities (levels 
of automation, levels of security, safety integrity 
levels). 

These problems correspond to the findings of 
the analysis of common causes of railway acci- 
dents in the Czech Republic (Kertis & Prochaz- 
kova 2017a), which are in terms of cybernetics: 
distortion of monitoring data; false software that 
does not consider all possible variants of operat- 
ing conditions; insufficiently robust hardware that 
causes incorrect or slow processing and evaluation 
of data. In some cases, there were noticed also 
hackers’ attacks on control centre of dispatcher’s 
workplace. The underlying root cause of accidents 
is insufficient system decision-making validity, i.e. 
the low information power and low knowledge 
of information systems, i.e. their low information 
amount. 


5 METHOD OF CONSTRUCTION OF 
MEASURES IMPROVING RAIL SAFETY 


To reach the highest level of safety of rail control 
system, i.e. also the information power, all stake- 
holders need to introduce the TQM approaches 
(Zairi 1991) with taking into consideration of 
integral risks. The approaches need to be adapted 


to specific features of rail systems. Therefore, for 
work with risks, multi-criteria approaches need 
to be used. Because some criteria for reaching the 
safety integrity are conflicting, the optimal solu- 
tion needs to be determined for certain sufficiently 
wide extent of conditions can be used to respect 
the limits (Prochazkova 2013, 2015, Kertis 2015). 


6 MEASURES IMPROVING RAIL SAFETY 


In the present case, we have adjusted the above- 
described approach for the rail information sys- 
tems, which are characterized above. According 
to (Kertis & Prochazkova 2017b), Figures 2 to 4, 
with a certain degree of abstraction, show a grad- 
ual change of system during the implementation 
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Figure 2. Level of CPS security (Common system of 
systems). 
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Figure 4. Level of CPS security (the secured system 
with higher information power). 


of various techniques for increase of information 
power and system security. 

Figure 2 shows the open system of systems, in 
which processes and linkages are implemented to 
perform defined functions. With consideration of 
large number of system linkages and interactions 
of large number of entities involved in the system, 
under normal conditions, the system performs the 
required functions, but it is prone to errors, espe- 
cially in the case of higher operational deviations 
and imbalances in external system environment. 

Figure 3 shows the system of systems with 
optimized information power that increases the 
system’s durability and maintainability, and by 
this security. By optimization, i.e. streamlining 
the flows and improving the information power 
parameters, it reaches that the system is less prone 
to internal mistakes in the case of various known 
deviations in operation and in the surroundings. 

For increase of information power and mini- 
mizing the resources, which are necessary for these 
systems formation, a number of methods can 
be used in practice, e.g.: COBIT for auditing the 
information systems from the top management 
point of view (Vitous 2000); ITIL for management 
of information systems and services, the parts of 
which are standardized (ISO 2011); refactoring, 
i.e. changes in software system, which do not influ- 
ence the external behaviour of information system, 
but they improve its internal structure (Moos & 
Malinovsky 2008). 

Introducing the management systems (ISO 2015, 
2017) with the support of information systems 
designed using the methods listed above, greatly 
contributes to improving the information power 
and transmission matrix parameters according to 
the formula (4), but only in case if the context of 
management systems focuses on the rail system as 


a whole, with a uniform terminology and focus on 
interfaces of the systems across all stakeholders, 
i.e. concerned subjects. 

Figure 4 shows a well-secure system with opti- 
mized information power that we get from the 
case, shown in Figure 3 by protecting it against 
significant external and internal influences, 1.e. we 
introduce the preventive and mitigating measures, 
and prepare reactive measures for case of inci- 
dents, as well as also measures for rapid recovery 
by help verified continuity plans. 

In management and governance of railway sys- 
tem and related organizations, the systems and 
methods according to (ISA 2007, ISO 2009, 20013) 
are gradually introduced. However, they need to be 
implemented by all stakeholders and, mainly, the 
problems associated with system interfaces need 
to be get over; i.e., it needs to consider the whole 
information origin process, used information 
technologies and the quality of their parameters 
according to Chapter 3. 

From this it follows that for safe operation it is 
very important to deal not only with security of 
information and technological assets, but also to 
provide the required information power of the 
system across all levels and concerned subjects. 
Parameters of information power are influenced in 
all sub-processes of information origin process, i.e. 
the optimization needs to take into consideration 
each such sub-process and the appropriate infor- 
mation technology listed in Table 2. 

Security of such system is then focused on iden- 
tification and management of important assets. 
Since, everything cannot be ensured, we need to 
select critical assets, i.e. critical processes, informa- 
tion flows or other supporting information and 
physical assets. On the basis of their criticalities, we 
assess the primary risks and introduce appropriate 
preventive measures. In case of disaster occurrence 
(including the cyber-attack), we perform response 
and recovery according to established policy (Ker- 
tis & Prochazkova 2017b, c). 

According to the systems of system safety 
principles, the whole rail system needs to be built 
according to the Defence-in-Depth approach 
(Prochazkova 2015) and to introduce different 
types of safety management to reflect the expected 
operating conditions of system, and eventually, for 
severe disasters, to have also way of management 
that protects also other assets than the assets of 
system under consideration, in which we monitor 
physical, organizational and cyber assets. 


7 CONCLUSION 


Examples of rail accidents presented in the paper 
point several common cybernetic causes, which 
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consist in combinations of several errors, mainly 
in design of safety management system and in its 
information systems, which should be increased 
not only the operation economic benefit, but also 
level of system safety regarding the public assets 
of European space. With regard to the fact that 
today’s processes of railway management, i.e. the 
considered safety management and other parts of 
the railway system (control systems, infrastructure, 
vehicles and devices) cannot do without informa- 
tion systems, it is very important to find faults also 
in cyber domain. 

Analysis of common cybernetic causes per- 
formed on basis of two case studies points to issues 
connected with systems’ interfaces, which are in 
administration of different subjects, or have dif- 
ferent physical nature. On the basis of above men- 
tioned outcomes and knowledge, it is needed to use 
multi criterial approaches and propose measures 
for coping with found vulnerabilities, i.e. improv- 
ing the parameters that heighten the information 
power and ensuring the information security at 
critical processes of railway management systems, 
namely in whole information origin process and at 
information processing. In the case of application 
of proposed principles, the work contributes to 
increasing the railway safety. 
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ABSTRACT: Many autonomous systems are safety-critical, e.g., autonomous cars, boats, or aerial 
vehicles. Autonomous systems rely on software and communications. Security vulnerabilities of software 
and communication will give adversaries possibilities to attack and compromise security and safety. 
Therefore, when analysing safety, security should be co-analysed. In this study, we explored three safety 
and security co-analysis methods: Systems-Theoretic Process Analysis (STPA) plus STPA-Security 
Analysis (STPA-Sec), Failure Mode, Vulnerabilities and Effect Analysis (FMVEA), and Combined Harm 
Assessment of Safety and Security for Information Systems (CHASSIS). The purpose is to compare 
applicability, efficiency, and hazards identified by the different methods. An autonomous boat is used 
as the case study. Results of the study show that STPA plus STPA-Sec and CHASSIS can be more time 
consuming to use than FMVEA. However, STPA plus STPA-Sec and CHASSIS can help analysers iden- 
tify more hazards of autonomous systems than FMVEA. Results of the study reveals weaknesses of each 
method to analyse autonomous systems with different levels of autonomy. We therefore propose possible 


improvements and combinations of the methods. 


1 INTRODUCTION 


Autonomous systems like drones, driverless cars, 
and autonomous boats are being developed. The 
key mechanism in an autonomous system is its 
ability to be independent of a human operator. 
The system manages to sustain situation awareness 
and decision-making capability, when an expected 
or unexpected event occurs. By shifting degrees of 
situation awareness and decision-making responsi- 
bilities from humans to the system, we can design 
autonomous systems with different levels of auton- 
omy. As an example, the Society of Automotive 
Engineers have described 6 levels of autonomous 
driving (SAE, 2016) from no automation, driver 
assistance, partial automation, conditional auto- 
mation, high automation, to full automation. 
Without systematic safety/security analysis and 
design of autonomous systems, mishaps can hap- 
pen and harm users and the environment. For 
example, on 24th July.2015, Fiat Chrysler Auto- 
mobiles ordered recall of 1.4 million vehicles that 
was vulnerable to a threat of remote control and 
hijacking (Guzman, 2015). In 2013, Samy Kamkar 
demonstrated with the Parrot AR that it was pos- 
sible to hijack other drones, with what he called 
SkyJack (Kamkar, 2013). Google self-driving cars 
had few accidents but was sometimes involved in a 


rear-end collision with human-driven cars, due to 
that human drivers did not anticipate actions from 
the autonomous system (Teoh and Kidd, 2017). 
Traditionally, system safety analysis focuses on 
accidental component failures or software bugs. 
As industrial and autonomous control systems 
are increasingly interconnected through networks, 
system safety can also be compromised by security 
breaches. “Although of great importance, it is not 
sufficient to address accidental threats (hazards) of 
such systems—also threats of intentional origin need 
to be covered (Aven, 2007).” “Security functions 
are not meant to cope with physical hazards and 
failures; likewise, safety functions might not detect 
and respond to attacks that target the digital compo- 
nents of the system. We infer that safety and security 
are complementary and should be treated jointly to 
improve risk management (Kriaa et al., 2015).” 
Several methods have been proposed to combine 
safety and security analysis of industrial control 
systems. Some studies have empirically compared 
different security and safety co-analysis methods 
using specific systems. For example, FMVEA (Fail- 
ure Mode, Vulnerabilities and Effect Analysis) and 
CHASSIS (Combined Harm Assessment of Safety 
and Security for Information Systems) were com- 
pared using an automotive cyber-physical system 
(Schmittner et al., 2015). The comparison focused 
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on level of abstraction, comparability of repeated 
analysis, reusability of analysis artefacts, scope of 
analysis, suitability for risk rating, and adaptabil- 
ity to changing context. However, we believe that 
more empirical comparisons of different secu- 
rity and safety co-analysis methods are needed, 
because many methods are proposed but are not 
thoroughly evaluated. In addition, few studies have 
used autonomous systems as cases for evaluation. 
We have been interested in several methods includ- 
ing the STAMP method (STPA—System-Theo- 
retic Process Analysis) since it has a modern system 
approach, looking at key control issues. 

In this paper, we present an empirical study that 
compares three security and safety co-analysis 
methods using an autonomous boat that is under 
development as the case. The autonomous boat 
Revolt (www.dnvgl.com/technology-innovation/ 
revolt/index.html) with the present design and 
sensor fitting is not pure autonomous yet, but a 
remotely operated dynamically positioned boat. 
This boat still misses sensors and functions for 
tracking other objects to be more autonomous. 
This study is just the first step of analysing safety 
and security of the autonomous boat. We will 
follow the development of Revolt and perform 
re-analysis when new functions are added. Such 
follow-up analyses will give us insights into dif- 
ferent safety and security issues of autonomous 
systems with different levels of autonomy. Our key 
focus of our study was to compare applicability, 
efficiency, and hazards identified by different meth- 
ods. The methods piloted and the sequence of the 
pilot are 1) FMVEA, 2) STPA plus STPA-Sec, and 
3) CHASSIS. 

Results of the study show that STPA plus STPA- 
Sec and CHASSIS are potentially more time con- 
suming than FMVEA. Results also illustrate that 
different methods have different strengths and 
weaknesses for identifying different hazards. Based 
on results of this study, we propose to improve and 
combine the methods to meet the requirements of 
security and safety analysis of different autono- 
mous systems. 

The rest of this paper is organized as follows. 
Section 2 defines relevant terminologies. Sec- 
tion 3 introduces the state of the art of security and 
safety co-analysis, focusing on the three methods 
we evaluate. Section 4 presents our study design 
and results. Section 5 discusses evaluation results, 
and Section 6 concludes. 


2 DEFINITIONS 


There are many definitions of security and safety. 
Usually, safety is being used to describe accidental 
harm, while security is used to describe intentional 


harm. In (Firesmith, 2003), safety is defined as 
“the degree to which accidental harm is prevented, 
reduced and properly reacted to”, and security is 
defined as “the degree to which malicious harm 
is prevented, reduced and properly reacted to.” In 
SEMA reference framework (Piétre-Cambacédés 
and Chaudet, 2010), safety and security are graph- 
ically mapped on a conceptual grid, which has 
two dimensions. The first dimension distinguishes 
between accidental and malicious threats. The sec- 
ond dimension differentiates safety and security 
based on origin and consequences. In the SEMA 
reference framework, the origin and consequence 
of safety is system and environment respectively. 
For security, the origin and consequence could be 
environment to system, system to environment, 
and system to system. In (Schmittner et al., 2016), 
the authors clarified the terminologies to be used 
for STPA plus STPA-sec analysis as follows. We 
follow the safety and security related definitions in 
(Schmittner et al., 2016) in our study. 


e Accident: Event which causes undesired losses 
of life, asset damage, data, availability etc. 

e Hazard: Dangerous system states which can lead 
to accidents. 

e Threat: Potential cause of an unwanted incident, 
which may result in harm to a system and/or 
environment. 

e Vulnerability: Weakness of an asset or control 
that can be exploited by one or more threats. 

e Attack: Attempt to gain unauthorized access to 
or make unauthorized use of an asset. 


3 STATE OF THE ART 


3.1 Security and safety co-analysis 


Many studies listed in (Kriaa et al., 2015) propose 
that it is necessary to consolidate the security and 
safety co-analysis, because security breaches can 
bring risks to system safety. However, the study 
(Eames and Moffett, 1999) identifies possible dis- 
advantages of security and safety co-analysis “We 
believe that consolidation of safety and security 
could reduce developers’ understanding of the sys- 
tem being analysed, and prevent a thorough analysis 
of either property.” In addition, the study (Eames 
& Moffett, 1999) says that “An additional danger 
is that a unified approach might actually hide the 
requirements conflicts that it aims to resolve.” To 
address the possible disadvantages, it is critical to 
closely examine the various kinds of interdepend- 
encies between safety and security. Safety-security 
interactions can be classified into four categories 
(Piétre-Cambacédeés, 2010). 


e Conditional dependency: Satisfaction of safety 
requirements conditions security or vice-versa. 
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e Mutual reinforcement: Satisfaction of safety 
requirements or safety measures contributes to 
security, or vice-versa, thereby enabling resource 
optimization and cost reduction. 

e Antagonism: When considered jointly, safety 
and security requirements or measures lead to 
conflicting situations. 

e Independency: No interaction. 


The security and safety co-analysis methods can 
generally be classified into three categories (Kriaa 
et al., 2015). One category is generic approach, 
such as FMVEA (Schmittner et al., 2014a) and 
Fault Tree Analysis (Kornecki and Liu, 2013). 
Another category is model-based graphical meth- 
ods, such as CHASSIS (Raspotnig et al., 2012) 
and method using Bayesian Belief Networks 
(Kornecki et al., 2013). The third category is 
model-based non-graphic methods, such as STPA 
(Young and Leveson, 2013) and unified frame- 
work (Asare et al., 2013). Autonomous systems 
are often cyber-physical systems that integrate 
computation, networking, and physical processes. 
In addition, autonomous systems need to have 
proper situation awareness using various sensors, 
and need to make correct decisions based on the 
sensor information. Thus, we decided to evaluate 
one method that is relevant to cyber-physical sys- 
tem in each category mentioned in (Kriaa et al., 
2015). We chose FMVEA, CHASSIS, and STPA 
plus STPA-Sec, because FMVEA and CHASSIS 
are shown to be applicable to automotive cyber- 
physical systems (Schmittner et al., 2015), and 
STPA plus STPA-Sec focuses strongly on software 
dependent systems. 


3.2 FMVEA 


FMVEA (Schmittner et al., 2014a, Schmittner 
et al., 2014b) is a FMEA (Failure Mode and Effect 
Analysis) analysis technique extended with secu- 
rity analysis. FMVEA is based on a three-level 
Data Flow Diagram (DFD). The first step of the 
method is to model the system and then to iden- 
tify failure and threat modes of each component 
of the system. The failure mode covers the safety 
aspect, by describing the way the component could 
potentially fail. The threat mode covers the secu- 
rity aspect, describing the way the component 
could be potentially misused. The threat modes are 
based on the STRIDE model, developed by Micro- 
soft (Microsoft, 2002). The STRIDE classification 
(spoofing/authentication, tampering/integrity, 
repudiation/non-repudiation, information disclo- 
sure/confidentiality, denial of service/availability, 
elevation of privilege/authorization) enables possi- 
ble attacks on such components to be found. What 
is dependent on creating failure and threat modes 
is knowledge about the system. The potential risks 


and the effect they could have, are each related to a 
component (context level). 

In addition to identifying vulnerability, threat 
modes, threat effects, and system effects, FMVEA 
also tries to quantify the attack probability by esti- 
mating system susceptibility and threat properties. 


3.3. CHASSIS 


CHASSIS (Raspotnig et al., 2012) defines a uni- 
fied process for safety and security assessments. 
The process includes the use of Misuse Case 
(MUC) (Sindre and Opdahl, 2005) and Misuse 
Sequence Diagram (MUSD) (Katta et al., 2010) 
for visual modelling for security analysis. MUC is 
also used for safety assessment, but it is combined 
with Failure Sequence Diagram (FSD) instead of 
MUSD for detailed failure analysis (Raspotnig 
and Opdahl, 2012). As shown in Figure 1, there 
are three stages and 8 steps in CHASSIS. The 
first stage (steps 1-3) is to draw Use Case and 
Sequence Diagrams based on some operational 
and environmental descriptions of the system. In 
the second stage (steps 4-6), MUC diagrams are 
created by using a set of hazard and operability 
study (HAZOP) guidewords (Kletz, 1997) applied 
for the use cases. The MUC diagrams are then 
described in textual MUC templates (step 5). FSDs 
and MUSDs are used to refine the harm scenarios 
defined in the templates (step 6). When the textual 
Misuse cases are finished, HAZOP tables are pre- 
pared (step 7) and corresponding safety or security 
requirements are defined (step 8). 


| CHASSIS Process 
| Elctating Functional Requirements | Eling Salety/Secwrity Requirements | Specitying Safety/Security 
| | | Requirements 
Crome Une Cane 
m> Daam [U ee] 
l Create Masse 
Wete Teraa Cate Diagrams 
Ue cae) Oe ac and wing nur 
| udewords (4) 
(hank patha and 
Ws. Content 
t 
baad Wet Teas INN copii 
Dapon (3) Messe Case (3) m 
T | 
H Hm Ham seqenee Such aiden) 
} ot hile eel tren 
i Create Mase / | i 
| Ue (ave sequence_———b Faure Sequence —— Satety/vecurity 
$| Dagars (6) arded 
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Figure 1. CHASSIS’ unified process. 
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3.4 STPA and STPA-Sec 


STPA-Sec (Young and Leveson, 2013, Young and 
Leveson, 2014) extends STPA, which is a safety 
analysis method (System-Theoretic Process Analy- 
sis) (John P., 2013, Leveson, 2012). The extension 
is to includes security analysis. STPA-Sec “Shifts 
the focus of the security analysis away from threats 
as the proximate cause of losses and focuses instead 
on the broader system structure that allowed the sys- 
tem to enter a vulnerable system state that the threat 
exploits to produce the disruption leading to the loss 
( Young and Leveson, 2013).” 
The main steps of STPA plus STPA-Sec are: 


e Identifying what essential services and functions 
must be protected or what represents an unac- 
ceptable loss. 

e Identifying system hazards and constraints. 

e Drawing the system control structure, physical 
hardware and network structure, and identifying 
unsafe control actions. 

e Determining the potential causes of the unsafe 
control actions. The potential causes could be 
security vulnerability and threats. To facilitate 
the security analysis, some guide words like tam- 
pered feedback, injection of manipulated control 
algorithm, and intentional congestion of feed- 
back path, are added (Schmittner et al., 2016). 


Compared to other security analysis methods, 
STPA-Sec does not focus on countermeasures that 
should be taken. STPA-Sec focuses mainly on iden- 
tifying those scenarios that could lead to losses. 


4 STUDY DESIGN AND RESULTS 


4.1 Scope: Autonomous boat 


The autonomous boat Revolt shown in Figure 2 
was made by Stadt Towing Tank (STT), on a 
mission from DNVGL in 2014. The model is a 
1:20 scale model of the concept ship. The model 
ship has a length of 3 meters and weighs 257 kg. 

Although Revolt is still under development and 
is not a fully autonomous boat, we still want to 
use it as a case since it gives us the opportunity 
to explore hazard and threats of two main issues 
i.e. 1) Safety and security of autonomous steer- 
ing of the ship (i.e. losing control; ship damaged/ 
destroyed) and 2) security of data-communication 
between onshore and offshore (sensitive data 
compromised). 


4.2 Security and safety co-analysis using 
FMVEA 


The FMVEA analysis focuses on the embedded 
computer. The attack surface is the highest for 


Figure 2. 


Overview of Revolt and its components. 


the embedded computer in the Revolt, since other 
components in some way are connected to it. 
Microcontrollers are connected to (and controlled 
by) the embedded computer via USB. Analogue 
components (water sensor etc.) are connected to 
the microcontrollers. 

To perform the FMVEA analysis, we fill in the 
table as proposed in (Schmittner et al., 2014a). The 
table includes columns for qualitative safety and 
security analysis, such as component, failure mode, 
threat mode, failure effect, threat effect, system sta- 
tus, system effect. The table also includes columns, 
such as severity, system susceptibility, treat prop- 
erties, attack/failure probabilities, and risks, for 
quantitative analysis and for ranking the hazards. 


4.3 Security and safety co-analysis using STPA 
plus STPA-Sec 


When performing STPA plus STPA-Sec analysis, 
we start with the following unacceptable losses/ 
accidents and safety constraints. 


e Collision with vessels, objects, humans/mam- 
mals, structures, grounding 

Fire or explosion 

Foundering (sinking, failing or plunging) 

Loss of cargo 

Loss of mission objectives 

Loss of information 


Then, we read the network structure and the 
control structure documents of the boat to identify 
unsafe control actions. We follow the systematic 
method proposed in (John P., 2013) and enumer- 
ate full combinations of possible values of proc- 
ess variables and evaluate where control actions 
can be unsafe if the control action is given, is not 
given, is given too early or too late, too large or too 
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small value. The control actions (CAs) we analyse 
include: 


CA1: Control the position of the vessel 

CA2: Control the speed of the vessel 

CA3: Control the course of the vessel 

CA4: Control the access to the vessels system 


After identifying the Unsafe Control Actions 
(UCA), the last step of the analysis is to identify pos- 
sible causal factors of the UCA, including possible 
security breaches that can lead to the UCA. In this 
last step, STPA-Sec analysis is applied by using the 
guide words proposed in (Schmittner et al., 2016). 


4.4 Security and safety co-analysis using 
CHASSIS 


To perform CHASSIS analysis, we first identify 
use cases and draw use case diagrams. The use case 
we focus on is “operating and monitoring Revolt 
remotely through the Revolt Intelligent System 
(RIS)”. Then we make security and safety misuse 
case through using the HAZOP keywords pro- 
posed in (Schmittner et al., 2015). Examples of 
the safety and security misuse cases are shown in 
Figure 3. 


4.5 Comparisons of effort spent on co-analysis 


The inputs to the methods are very different. 
FMVEA analysis focuses on components. STPA 


Security misuse case: Provide Operating and monitoring - Obtain access to the Revolt 


o 


Figure 3. 


Examples of security and safety misuse cases. 


plus STPA-Sec analysis focuses on control actions. 
CHASSIS analysis focuses on use cases. Thus, it is 
difficult to have direct comparisons of the effort 
spent on applying the methods. However, by ana- 
lysing the hours spent on each activity shown 
Table 1, we can still observe that STPA plus STPA- 
Sec and CHASSIS can be more time-consuming 
than FMVEA, because more activities are included 
and each activity requires more effort. 


4.6 Comparisons of safety hazards identified 


Like comparisons of effort, it is difficult to per- 
form direct comparisons of safety issues identified 
by using different methods, because the methods 
have different inputs. However, through compar- 
ing safety issues identify by each method, we can 
observe strengths and weaknesses of each method. 
FMVEA helps us identify mostly the hazards that 
are related to single component failure, e.g., com- 
munication connection is lost or updates fails. The 
input of FMVEA does not require as many inputs 
as the two other methods. It requires only a list of 
components of the system and how they are con- 
nected. This is an advantage. However, it may also 
be a restriction for early analysis, because early sys- 
tem development might not have a system design. 
Compared to FMVEA, STPA plus STPA-Sec 
method helps us identify more hazards that are 
related to interactions between different compo- 
nents or actors. STPA is a top down approach that 


itoring - Component fading 
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Table 1. Effort spent on each method. 


FMVEA STPA and STPA-Sec CHASSIS 
Activities hr. Activities hr. Activities hr. 
System level analysis S Define unacceptable 7 Elicitation functions 4 
losses and services 
Selection of component 3 Identify hazards and 5 Use case diagram 9 
safety constraints 
Identify functions of 10 Create functional 30 Safety misuse case 9 
component control structure diagram 
Failure Mode, Vulnerabilities 27 Identify hazardous 20 Security misuse case 9 
and Effect Analysis control actions diagram 
Risk assessment 10 Identify causal factors 20 Final misuse case 15 
and scenarios with mitigations 
Identify mitigations 23 Misuse Sequence 8 
Diagram 
Failure Sequence 8 
Diagram 
Fill in HAZOP table 20,5 


looks at the operative picture and identifies unsafe 
system operation. STPA analysis covers not only 
the physical system, but also human operators and 
actors. One hazard example identified by STPA is 
“setting route for shipment and launch position 
when the shipping dock has not permitting the 
action, because other ships are dispatching at the 
same time”. STPA can identify such hazards due to 
its use of process model variables to identify haz- 
ardous control actions, which generates scenarios 
that otherwise would be omitted. Another advan- 
tage of STPA is that it does not assume or need 
the fully designed observability in the system from 
the beginning, and this can possibly be achieved 
through several iterations. We usually find out 
some UCA based on the preliminary design of the 
system. When we explore casual factors of UCA, 
we may find new constraints or requirements for 
observability and control to handle the identi- 
fied accident causes, or new need to obtain proof 
that the accident causes will not practically occur. 
However, the challenge of STPA plus STPA-Sec 
analysis (John P., 2013) is that it relies heavily on 
enumerating process control variables. If many 
process control variables are present, the analysis 
can be time consuming. 

Compared to FMVEA and STPA, the strength 
of CHASSIS is that it helps us find hazards that 
are related to operation sequences. One example 
hazard identified by using CHASSIS is “the opera- 
tor performs operations on the Revolt before hav- 
ing done security and safety procedures, and the 
Revolts components are having feedback delays 
and commands are executed too late”. The weak- 
ness of CHASSIS is that it relies more on expert 
judgement than FMVEA and STPA. As observed 


in (Schmittner et al., 2015), the possible risk could 
be that “if a CHASSIS analysis is repeated by a 
new group, due to the differences in the experts, 
new viewpoints can be introduced that change the 
results.” A restriction of CHASSIS we identify is 
that its starting point is the use case. If the use case 
is too broad, steps that follows in the process might 
be difficulty to perform. 


4.7 Comparisons of security issues identified 


FMVEA security analysis uses STRIDE 
classification. The identified security threats are 
limited to threat targeted at single component, e.g., 
wireless connection is targeted to jamming. When 
using FMVEA, the safety and security analysis can 
be done independently. Thus, safety—security inter- 
actions may be overlooked. 

CHASSIS identifies threats and vulnerabil- 
ity using misuse cases. The hazards identified by 
CHASSIS are mostly related to operation and use 
of the system, e.g., the communication system might 
have vulnerabilities that could lead to modifica- 
tion of system files. By integrating security misuse 
cases and safety misuse cases, it is possible to ana- 
lyse safety-security interactions. However, like the 
safety analysis, the possible weakness of CHASSIS 
is that it relies heavily on expert knowledge. Thus, 
the analysis results may not be replicable. 

The security analysis of STPA-Sec focuses on 
identifying security vulnerabilities that may lead 
to unsafe control actions. For example, provid- 
ing CA2 (control the speed of the vessel) too late 
from shore to the boat when the WIFI connection 
is jammed. Comparing to FMVEA, the strength 
of STPA plus STPA-Sec is that it focuses more on 
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safety—security interactions. However, the limita- 
tion of the STPA-Sec is that the security analysis 
focuses mainly on vulnerability that can be the 
casual factors for safety hazards. The security vul- 
nerabilities, which may lead to information leakage 
or privacy issues, but will not lead to safety haz- 
ards, may be overlooked. The study (Schmittner 
et al., 2016) proposes to enhance STPA-Sec with 
more focus on losses related to confidentiality. In 
our study, we list “loss of information” as an acci- 
dent and find out some threats that can lead to this 
loss. However, STPA plus STPA-Sec method use 
enumeration of process control variables to iden- 
tify possible information loss. If certain security 
vulnerabilities, e.g. improper encryption of stored 
data, are not reflected directly in existing proc- 
ess control variables, the vulnerabilities may not 
be identified in the analysis. Thus, we believe that 
integrating STPA-Sec with more security oriented 
analysis methods, e.g., misuse cases or threat mod- 
elling, can be beneficial. 


5 DISCUSSIONS 


5.1 Comparison with related studies 


The study (Schmittner et al., 2015) compared 
FMVEA and CHASSIS. Our study included STPA 
and STPA-Sec in the comparison. 


e Level of abstraction: CHASSIS is a quite high- 
level approach. It can be applied in early require- 
ment and concept phase, when a system is not 
clearly defined and little information is known. 
In contrast, FMVEA needs at least a list of 
system elements and connections between the 
elements to generate meaningful results (Schmit- 
tner et al., 2015). STPA plus STPA-Sec requires 
information of the hardware, the network nodes, 
the network input/output lists to identify proc- 
ess control variables and unsafe control actions. 

e Replicable analysis results: CHASSIS depends 
more on expert knowledge. In contrast, FMVEA 
and STPA will more likely provide comparable 
results, even if the analysis is performed by dif- 
ferent persons. 

e Reusability of analysis artefacts: All three meth- 
ods use guidewords. FMVEA uses failure modes 
and STRIDE classification. STPA plus STPA- 
Sec uses guide words proposed in (Schmittner 
et al., 2016). CHASSIS uses HAZOP keywords. 
In all three methods, the quality and complete- 
ness of the keywords will strongly influence the 
quality of the analysis. 

e Scope of analysis: FMVEA and STPA plus 
STPA-Sec depend to a higher degree on the 
accuracy of the system model and control struc- 
ture. For CHASSIS, “it is possible to expand the 


consideration of risk scenarios which do not arise 
directly from the system model (Schmittner et al., 
2015).” 

e Suitability for a risk rating: FMVEA targets at 
rating the risks. STPA plus STPA-Sec and CHAS- 
SIS focus mostly on generating a list of possible 
safety and security issues rather than rating them. 
A possible combination of the method is to per- 
form STPA plus STPA-sec or CHASSIS analysis 
to identify hazards and then use FMVEA for 
quantitative comparisons of certain hazards. 

e Adaptability to changing context: It is easier for 
CHASSIS to consider different usage scenar- 
ios and changing environment than FMVEA, 
because CHASSIS is less formal and focuses on 
high level analysis. STPA plus STPA-Sec and 
FEMVA analyses results need to be updated 
when the system design changes. 


5.2 Applicability of the methods for analysing 
autonomous systems 


Autonomous systems have different levels of 
autonomy. Based on our observations of strengths 
and weakness of the three methods, we propose 
applying different methods for analysing systems 
with different levels of autonomy. 


e For systems with high automation, STPA plus 
STPA-Sec may be more applicable than FMVEA 
to analyse interactions between systems, and 
interactions between systems and environment. 

e For systems with many sensors, STPA plus 
STPA-Sec may be more applicable than CHAS- 
SIS. CHASSIS focuses on sequential messages. 
In contrast, STPA deals with fusions of sen- 
sor messages that come at the same time bet- 
ter. However, STPA plus STPA-Sec also needs 
to be improved. The current STPA proposed in 
(John P., 2013) is limited to analyse single con- 
trol action. For many cyber-physical systems 
and autonomous systems like autonomous 
boat, some control actions are mutually depend- 
ent and might be issued in pairs. For example, in 
emergency cases, the boat needs to change course 
and slow down at the same time to avoid colli- 
sion. Our solution for analysis mutually depend- 
ent control actions is to add the control action 
as a process control variable of another control 
action, if another control action has depend- 
ency with it. For example, in the table to analyse 
control action CA3 (i.e. control the course), the 
CA2 (i.e. Control the speed) is added as proc- 
ess control variables with values “speed up” and 
“slow down”. 

e For autonomous system with high level intelli- 
gence and learning capability, none of the three 
methods will be sufficient. AI will make it harder 
to review the system due to its increasing “black 
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box” and “black code” nature and its learn- 
ing capability. For those systems, STPA plus 
STPA-sec or CHASSIS analysis may outper- 
form FMVEA, because the operational level is 
the same regardless of system implementation. 
STPA and CHASSIS are good at analysing the 
operational safety with the system interaction. 
For autonomous systems with learning capabil- 
ity, however, it is necessary to have continuous 
verification along with the learning. 


5.3 Limitations of the study 


One main limitation of this study is that the safety 
and security hazards identified by this study may 
not be complete. It is because the completeness 
relies much on the domain knowledge and the 
guide words. However, the purpose of the study is 
to compare the three methods rather than to iden- 
tify all hazards of the system. We believe that, even 
if other researchers identify slightly more security 
and safety hazards than us or identify different 
hazards from Revolt, our observations of the main 
differences of the three methods are still valid. 


6 CONCLUSIONS AND FUTURE WORK 


Many security and safety co-analysis methods have 
been proposed from academia and industry. How- 
ever, few empirical studies have been performed to 
compare and evaluate the methods. In this study, 
we have evaluated three methods using an autono- 
mous boat, called Revolt, as a case study. Results 
of the study show advantages and disadvantages 
of each method. Our future study is to extend and 
strengthen existing methods to analyse safety and 
security issues of intelligent and complex control 
actions of autonomous systems. In addition, we 
need to check the validity of the method, based on 
observing performance and incidents of the Revolt 
system. 
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ABSTRACT: This contribution is intended to observe special features presented in physical assets for 
defence. Particularly, the management of defence assets has to consider not only the reliability, availability, 
maintainability and other factors frequently used in asset management. On the contrary, such systems 
should also take into account their adaptation to changing operating environments as well as their capabil- 
ity to changes on the technological context. This study approaches to the current real situation where, due 
to the diversity of conflicts in our international context, the same type of defence systems must be able 
to provide services under different boundary conditions in different areas of the globe. At the same time, 
new concepts from the Industry 4.0 provide quick changes that should be considered along the life cycle 
of a defence asset. As a finding or consequence, these variations in operating conditions and in technology 
may accelerate asset degradation by modifying its reliability, its up-to-date status and, in general terms, its 
end-of-life estimation, depending of course on a diversity of factors. This accelerated deterioration of the 
asset is often known as “obsolescence” and its implications are often evaluated (when possible), in terms 
of costs from different natures. The originality of this contribution is the introduction of a discussion on 
how a proper analysis may help to reduce errors and mistakes in the decision-making process regarding the 
suitability or not of repairing, replacing, or modernizing the asset or system under study. In other words, 
the obsolescence analysis, from a reliability and technological point of view, could be used to determine the 
conservation or not of a specific asset fleet, in order to understand the effects of operational and technol- 
ogy factors variation over the functionality and life cycle cost of physical assets for defence. 


1 INTRODUCTION to create wealth, are those used for warfare, and 
vice versa [1]. I.e., there are dual-use technologies, 


Throughout history, technological advances so that the developments and commercial innova- 


have always been applied in armed conflicts to 
allow certain superiority against the enemy. Such 
conflicts have served in many cases as fields of 
experimentation to validate progress that, subse- 
quently, have had their application in the civilian 
world and vice versa. Identically, today’s conflicts 
apply technology, where the digital transformation 
becomes more and more present. In fact, there is 
a historical constant where the same means used 


tions take advantage to the military sector and vice 
versa. To this fact, it must be added that enemies 
can also have these high-tech devices easily. 

Many functions from gadgets that emerged 
before, keep its purpose now, adapting them to the 
new circumstances, both operational and techno- 
logical. In the case for example of an armoured 
vehicle, the basic characteristics that emerged dur- 
ing the First World War (protection, mobility and 
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firepower), are the same that are expected currently, 
though, with the own technological advances of 
today. These basic functions are seen nowadays in 
continuous development and improvement accord- 
ing to the new technologies that are emerging to 
meet specific operational needs. To such functions 
it is added now a fundamental property as the reli- 
ability of the systems themselves, as well as other 
concepts such as availability, maintainability and 
safety. This contribution deals with these technical 
characteristics from the standpoint of a military 
asset, introducing terms such as obsolescence and 
useful life of these systems, and also relating the 
effect on these assets and their characteristics of 
new technological tools included under the con- 
cept of industry 4.0. 


2 CONCEPTS OF ASSET MANAGEMENT 
AND ITS ADAPTATION TO THE CASE 
OF DEFENSE 


2.1 RAMS parameters of an asset and its life 
cycle cost 


Reliability, availability and maintainability are 
parameters used in assets management whose anal- 
ysis is usually known by the acronym in English 
of RAM. Frequently, it is added the S of security 
and/or safety, although some authors referenced it 
as sustainability, and hence the acronym results as 
“RAMS analysis”. 


e Reliability: capacity of the system not to fail, 
i.e., the probability that the system complies 
with what is expected of it 

e Availability: proportion of time that the asset is 
useful to be used (in principle, it does not have 
to be operational, although it can be considered 
also an operational availability of the asset) 

e Maintainability: ease of the asset to keep its 
normal operation, i.e. it would be inversely pro- 
portional effort in maintenance activities 

e Safety/security: features set of measures that are 
taken to avoid and prevent accidents, or protect 
from illegal activities 


Apart from the a. m. concepts, stricter defini- 
tions can be found in the references [2], [3] or [4]. 
These terms are closely linked to the concept of 
useful life, which is associated with the time dur- 
ing which the system continues fulfilling its func- 
tions [5]. Over the useful life of an asset, it must 
maintain and keep its value. Each of the stages of 
the life cycle of an asset will have some associated 
costs being finally life cycle cost the sum of all these 
costs. In other words, a life cycle cost analysis must 
take into account: costs initial acquisition of the 
physical asset (covering the costs of development 


and investment); operational costs; maintenance 
costs (planned, corrective as well as overhauls); 
and the costs associated with the divestiture of 
assets or dismantle the installation. If the asset is 
still used beyond than expected, this latter term, 
instead of a cost may be considered a Residual 
value (if for example is sold to a third party). All 
of the above are often treated from the standpoint 
of an industrial physical asset, which tend to have 
a certain operating profiles, to a greater or lesser 
extent, under controlled environments. 

From the perspective of a military asset, they 
will find themselves under the paradigm of having 
to deploy to different environments, with chang- 
ing mission profiles and where logistics is critical 
to maintain its performance. I.e. when a machine 
is acquired for a production process, this process 
can be more or less predictable or stable, while in 
the case of an asset for defence, it should be added 
the uncertainty of mission (surveillance, defence, 
humanitarian support... and other profiles), the 
conditions of operation that will be used (deserts, 
icy areas, forest, urban locations...), and of course 
technological change. Consequently, the estima- 
tion of lifespan for an industrial asset does not 
have to coincide in a military system. This end of 
life of an asset is known as obsolescence, which 
can slow down if redesigns and updates through- 
out the life of the asset are considered. 


2.2 Obsolescence and modernization of an asset 


The term “obsolescence” was used for the first 
time in relation to basic products [6], [7]. However, 
it has been employed in the industrial environment 
[8], [9], relating to functional factors (changes in 
use), economic (cost of continuing to use it, regard 
the cost of replacing it with an alternative), techno- 
logical) efficiency of technology assets, in compari- 
son with the new alternatives available), or social 
(trends of users, changes in legislation or regula- 
tions of health and safety...) [10], [11]. These asset 
shifts changed its durability and obsolescence [12]. 
Therefore, there are tangible factors, such as the 
functional and economic factors more related to 
the depreciation of the assets, and intangible fac- 
tors such as technological and social, more subjec- 
tive or prospects-related to the market [13]. 
Accordingly, maintenance activities are linked 
to the functional and economical obsolescence, 
preserving the value of the asset in a physical sense 
and combating the economic depreciation of own 
assets. In this sense, there is a time-dependent rela- 
tionship between the obsolescence and the analysis 
based on reliability according to functional and 
economic factors [14] (life cycle costs). In other 
words, maintenance should be assessed by com- 
paring the assets value regarding the cost involved 
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in repair or replacement. In this analysis, it may be 
the case that the decision to repair the asset results 
in disproportionate expense to preserve the value 
of the investment [15]. On the other hand, the 
modernization activities will be more closely linked 
to combat social and technological obsolescence. 

In both cases, an improvement in maintenance 
and/or the possibilities of a modernization has to 
go exclusively through direct changes on the asset 
itself. It can be also related to assure the execution 
of tasks, training, spare parts and tools distribu- 
tion, logistics required for the coordination of all 
aspects etc... allowing the system to be available 
and in operation as long as possible, maintaining 
and implementing its performance during its life- 
time. I.e. new technologies not only can and must 
influence improvements on own assets, but also on 
the integrated logistics support provided to it. 


3 INNOVATION IN MILITARY ASSETS 


As seen in the previous section, with a view to 
slowing down the obsolescence of a military asset, 
it can be considered incorporating modernizations 
(on the assets and/or on its logistical support) tak- 
ing advantage of the new technologies that now 
are at our disposal. It is important to emphasize 
that these advantages are not only applicable to 
the own assets, but they can facilitate and improve 
all those activities that surround system and needs 
that are essential to keep it in “life”. With that 


Table 1. 


intention, it is relevant to deal with the industry 4.0 
concept, providing a panoramic view of possible 
applications in the military assets, focusing on the 
possibilities in terms of logistic support integrated 
with a view to improving precisely the reliability, 
availability, maintainability and safety of assets for 
the defence. 


3.1 Evolution to the “4.0” 


The origin of the term industry 4.0 refers to the 
fourth Industrial Revolution, understanding that 
the first arose when machine steam at the end of 
the XVIII century, the second when we imple- 
mented the use of electricity and manufacture in 
series in the last third of XIX century, and the 
third Industrial Revolution to that when automate 
factories began already in the s. XX century. Today, 
the fourth Industrial Revolution has to do with the 
digital transformation applied to the industry in 
the search for connectivity and operational excel- 
lence. It was named for the first time in a study 
conducted in Germany in 2011: “Smart Manufac- 
turing for the Future”, Germany Trade and Invest 
[16]. Associated with the term industry 4.0 are usu- 
ally define 9 technologies such as the ones shown 
in the following table (Table 1). 

Apart from these nine technologies, some- 
times are added some others such as vehicles 
or autonomous or unmanned aircraft (drones), 
new materials (Graphene), artificial intelligence, 
digital platforms... having their place in the nine 


Technologies associated with the industry 4.0 concept. 


Technology Description 


Robotics and computer 
vision 

Additive manufacturing 
and 3D scanner 


e Cooperative and autonomous robots. 

e Numerous integrated sensors and interfaces standardized. 

* 3D printing, especially for prototypes and spare parts. 

* Decentralized 3D installations to reduce inventory and transportation distances. 


e Flexibility of forms, quickly not to use tools, cost savings 


Augmented and 
virtual reality 
Simulation and modelling 


e Augmented reality for maintenance, logistics and all kinds of operational procedures. 
e Supporting information display, for example, through smart glasses. 
e Simulation of value networks (flows). 


e Optimization based on data in real time from intelligent systems. 


Horizontal and vertical 
integration 
plant management) 
Industrial Internet (IoT. 
Internet of things) 
e Cyber-physical Systems 
Cloud computing 


e Integration of data between companies based on standards of data transfer. 
e Precondition for a fully automated value chain (the company customer, the 


e Network of products and machines. 
e Multi-directional communication among objects in the network. 


e Management of huge volumes of data in open systems. 


e Communication in real time for production systems. 


Cybersecurity 


e Operation in networks and open systems. 


e High level of networking between machines, products and intelligent systems. 


Big data and data analysis 
data of machine). 


e Complete evaluation of the available data (for example, ERP, SCM, CRM, and 


e Support and optimization in real time for decision-making 
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mentioned categories. All previous technologies 
have applications without a doubt in the military 
sphere. In this sense, one of the first objectives of 
digitization can be precisely: 


e To support the concept of life cycle of military 
assets. 

e Determine the needs of services and infra- 
structures in the evolution of the means for the 
defence. 

e Study the incorporation of management soft- 
ware mass of data that allow the use of artificial 
intelligence methods. 

e Evaluate trends. 

e Get lessons learned. 

e Control and manage the asset lifecycle and 
engineering. 


In general, the inclusion of the concept of logis- 
tics support 4.0. 

Le. apart from the operational improvements 
that may benefit the assets themselves, first advan- 
tage is that assets management can be improved 
for the defence in maintenance (corrective action 
and scheduled inspections), in the management of 
spare parts and material (supply of spares, tools and 
consumables in time and place), in the adaptation 
of systems to operational requirements, control of 
assets configuration (design modifications) and, in 
general terms, the determination, evaluation and 
improvement of support along the life cycle through 
the support updating capabilities of the army. 


3.2 Technological trend of Defence 4.0 


The first half of the XXI century is characterized 
as a period of great political, social and ecological 
changes (translation of the geopolitical focusses, 
increase of the world population, consumption 
of large amounts of resources, effects on the 
biosphere...). All this happens in a highly glo- 
balized world and parallel to the highest scien- 
tific and technological transformation of history 
[17]. This context implies complex scenarios with 
increased uncertainty where the effect of technol- 
ogy cannot be ignored. Recent times correspond to 
the most technologically advanced in the history of 
mankind, driven by the digital revolution, but also 
by biology and nanotechnology [1]. In the digital 
realm, a growing convergence of cyber, physical 
and biological domains appear with more com- 
plex and higher value-added products and services. 
Therefore, at conventional operating environment: 
land, sea and air; it must be added now the virtual 
or cyber space (Figure 1). 

In general terms, it is found an increasingly glo- 
balized combat environment more connected and 
with greater presence of digitized and automated 
means that (making a comparison with the indus- 


Products & 
Services / 
_ Processes 


Figure 1. New operational environments. 


try) could be called “Defence 4.0”. All this obliges 
the armies to a better adaptation to the changes 
because there are factors that radically transform 
the character of the war fighting. The Defence sec- 
tor must therefore nourish of the digital revolution 
in aspects such as connectivity in real time (inter- 
net of things, “low cost” sensing), the collaborative 
robotic (drones, autonomous vehicles, 3D print- 
ing, exoskeletons), the Automation (augmented 
and virtual reality, artificial intelligence, Big Data, 
processing in the cloud...). 

The application of these technologies must 
take into account the platform on which it is 
incorporated (aircraft, vehicles, weapons, general 
equipment systems...), means for implementation 
(software, hardware...), as well as logistics and 
tactical elements (equipment of support, training 
and simulation systems, digital stands for technical 
documentation, mission planning systems...) [18]. 


4 CONCLUSIONS 


Today, advances in technology and changes in 
operation environments oblige the designs of 
assets for the defence to be flexible and to evolve 
with the life of the system. In other words, the 
conservation of a fleet of military assets updated 
and the fulfilment of operational demands require 
redesigns and upgrades throughout the life of the 
items, as well as to streamline everything related 
to their logistical support. In order to emphasize 
those scientific issues addressed along this paper, 
it is relevant to explore the current willingness of 
industries, together with universities and research 
centers to collaborate and constitute a kind of 
community of interest in the area of maintenance 
[19], logistics support [20] and, in general terms, 
the integration of technological solutions [21], in 
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order to overcome the problem of obsolescence in 
the defense sector. 

As it has been observed throughout this docu- 
ment, the application of technologies associated 
with the industry 4.0 concept, is without doubt a 
great help to increase the usefulness and efficiency 
of assets, improving both its operation and main- 
tenance. For example, large data collection allows 
scheduled maintenance according to the differ- 
ent mission profiles, reducing costs by preventing 
unnecessary actions, and tasks etc. All intended to 
a result where it is increasing easiness, flexibility 
and immediacy of the spare parts, an improvement 
of maintenance policies, customizing them, as well 
as other aspects which, in summary, affect the 
parameters of assets reliability, availability, main- 
tainability and safety. 

As future lines of action and research, it is 
suggested the development of unified and sim- 
plified protocols for analysis and work processes. 
Depending on the case, it will be desirable to 
observe the applicability of methods of artificial 
intelligence (Machine Learning, Deep Learning), 
framed initially (for simplicity) in the concept of 
Integrated Logistic Support 4.0 for the establish- 
ment of decision-making processes, flow data 
and organizational structure that can certainly 
be implemented in the own assets ILS. As an aca- 
demic added value, this paper stands aligned with 
the principles established by the current inter- 
national defense sector, in the search for greater 
logistic efficiency [22] through an interconnected 
system among the members (Armies, Governmen- 
tal Organizations and Companies from the allied 
countries), where innovation is encouraged and 
promoted [1], [16]. Finally, note that having oper- 
ational armed forces requires today an advanced 
and competitive technological and industrial man- 
ufacturing. It should be a priority strategic objec- 
tive for the industrialization and modernization of 
the defence sector. 
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ABSTRACT: A military operational plan helps prepare the armed forces for future security challenges. 
Planning necessitates making assumptions about the future, and these assumptions constitute operational 
risks; possible negative operational effects in case the assumptions fail. Hence, a thorough risk analysis 
should be part of the operational planning process. In this paper, we propose a method for analysing oper- 
ational risks and vulnerabilities in support of operational planning. In particular, we focus on risks related 
to assumptions about availability of critical capabilities. Using a bow-tie model where the undesired event 
is a capability gap, we explore likely causes and consequences by combining gaps with identified vulner- 
abilities. The outcome of the analysis is an overview of operational risks. The method is applied on an 
example within a fictive planning scenario. While developed primarily for military planning, the method 
is also relevant for security risk analysis in the civilian domain. 


1 INTRODUCTION 


A military operational plan is developed to pre- 
pare the armed forces for future security chal- 
lenges. The future is inherently uncertain; hence, 
we should aspire to develop plans that are robust 
and adaptable both in the short-term and long- 
term perspective. This also necessitates making 
planning assumptions that are assertions about 
the future that underlies the plan (Dewar et al. 
1993). Uncertainty and pertaining assumptions 
entail operational risks, which can be expressed 
as the combination of the probability that some- 
thing goes wrong and the related operational 
consequences; negative effects on the operational 
effectiveness. 

In order to facilitate development of relevant 
plans, we need to balance their desired properties, 
such as robustness, adaptability and flexibility, with 
the planning assumptions. This could be assump- 
tions about the future threat environment and the 
availability of critical capabilities and resources to 
handle different situations. These assumptions are 
vulnerable to future changes, and thus, should be 
monitored and followed-up in order to ensure rel- 
evant plans. 

Most military plans are developed following the 
Operational Planning Process (OPP) described in 
the Comprehensive Operations Planning Directive 
(COPD) (NATO 2013). COPD focuses on what to 
do, i.e. which planning products to produce, but 
not so much on how to do it. The COPD planning 
process results in comprehensive plans that include 


planning assumptions and risk assessments. How- 
ever, the process usually stops short of identify- 
ing adequate indicators which can help determine 
when assumptions are violated or failed, i.e. when 
it is time to revise the plans. 

There are various sources of uncertainty to 
include in the planning process, e.g. adversaries’ 
intentions and capabilities, own and allied capabil- 
ities and the properties of the operational environ- 
ment. In order to develop relevant plans, all these 
sources need to be considered. Ignoring uncer- 
tainty and associated risks is not a viable option. 
According to Walker et al. (2013), ignorance 
may reduce the ability to take timely corrective 
actions and increases the risk of missing emerging 
opportunities. 

In order to support development of relevant 
plans, we need to explore uncertainties, vulnerable 
planning assumptions and possible consequences 
if assumptions are violated. This information can 
be obtained by a risk and vulnerability analysis. 
Hence, in this paper we discuss how risk and vul- 
nerability analysis can support development of rel- 
evant plans. 

For this reason, we propose a method for risk 
and vulnerability analysis based on the well-known 
bow-tie model; see for instance Aven (2015) and 
Rausand & Utne (2009). The starting point is an 
initiating, undesired event, and the model is used 
to explore likely causes for and possible conse- 
quences of this event. The model is a multi-method 
approach, which means applying different combi- 
nations quantitative and qualitative methods that 
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work well together to solve the problem. This 
makes the method flexible and adaptable to user 
requirements. 

The proposed method can be applied to sup- 
port development of relevant plans by testing the 
validity of planning assumptions: What happens 
if planning assumptions fail or are violated? The 
output of this analysis supports development of 
more adaptable plans, and it enables decisions 
about the need to either revise an existing plan or 
develop a new one. Further, the method can be 
used to analyse risks and vulnerabilities in support 
of plan development following the Operational 
Planning Process (OPP). The result of the risk and 
vulnerability analysis can be summarized in a risk 
picture providing necessary information for risk 
management. 

In particular, we focus on risks related to capa- 
bility gaps, i.e. gaps in the ability to perform a cer- 
tain task or function. We explore likely causes and 
consequences by combining capability gaps and 
identified vulnerabilities, where we also assess the 
impact of barriers and implemented measures. The 
outcome of the analysis is an overview of opera- 
tional risks that can be used to validate planning 
assumptions. 


2 BACKGROUND 


A relevant plan is one that is useful for its purpose, 
i.e. prepare for future decision making. Developing 
a relevant plan is not trivial, and it involves recog- 
nizing uncertainties and finding the right balance 
between robustness, adaptability and planning 
assumptions. Planning assumptions are neces- 
sary due to the irreducible uncertainties related 
to the future state of the world. Making plan- 
ning assumptions is always challenging, because 
they have a direct influence on the relevancy of 
the plan. Hence, it is important to ensure that the 
assumptions are robust and relevant. The validity 
of assumptions is not static; it will change over 
time and depend on the planning time-horizon. It 
is more likely that the assumptions are valid in the 
short term than in the long term. 

Some assumptions are more critical and vulner- 
able than others, i.e. violation or failure may have 
huge impacts on the operations covered by the 
plan. Thus, it is important to identify assumptions, 
both the explicit and implicit, and to assess how 
critical these are with respect to the plan. Accord- 
ing to Rosenhead & Mingers (2001), a robust plan 
will perform satisfactory on selected assessment 
criteria across a wide range of plausible futures, 
i.e. be less vulnerable to changes (anticipated vari- 
ations) in factors like the behaviour of adversar- 
ies and the operational environment. An adaptive 


plan is a plan that facilitates updating when new 
information becomes available. 


2.1 Assumption-based planning 


Dewar et al. (1993) introduce “Assumption-Based 
Planning” (ABP) as a way to make more robust and 
adaptable plans. ABP is not a tool for developing 
plans, but for improving the robustness and adapt- 
ability of existing plans by identifying and examin- 
ing underlying planning assumptions. Events that 
violate the assumptions can originate from differ- 
ent sources, such as potential opponents/adversar- 
ies, the capability of own and allied forces, and the 
operational environment. ABP is about identify- 
ing and applying relevant signpost or indicators 
to detect changes in the vulnerability of planning 
assumptions that may have severe consequences 
for the operation. However, ABP does not give a 
clear recipe for how best to identify these signposts 
or how to determine appropriate threshold values. 

Risk and vulnerability analysis may be a suitable 
approach to identify vulnerable assumptions and 
possible consequences if these are violated or fail. 
The bow-tie model provides a generic approach to 
risk analysis using an undesired event, e.g. a violated 
or failed assumption, to initiate an analysis of pos- 
sible causes for the event and likely consequences. 
A vulnerability analysis is an integrated part of the 
bow-tie approach, supporting both the cause and 
effect analysis. We believe that a thorough risk and 
vulnerability analysis can support identification 
of signposts with pertaining threshold values, and 
relate these to consequences for the operation. 

In this paper, our primary focus is on assump- 
tions related to the availability and performance of 
own capabilities required to accomplish a specific 
mission. A typical undesired event in this context 
is a capability gap, i.e. the discrepancy between 
the capability requirement and available resources. 
However, assumption about the availability of 
required capabilities is only one of many assump- 
tions that are usually made in a military contin- 
gency plan. Other assumptions are made related 
to the behaviour of the opponent/adversary, own 
Courses of Action (CoA) and properties of the 
operational environment. 

For more information about ABP and related 
planning methods, see for instance Walker et al. 
(2013) which provides a review of planning 
approaches for adaption under deep uncertainty. 


2.2 Risk analysis 


An overview of risks is as natural part of plan- 
ning and decision processes. In our definition, risk 
is associated with negative consequences of uncer- 
tainty; however, uncertainty may also introduce 
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emerging opportunities. In this case, the challenge is 
to exploit the “window of opportunity” in order to 
gain an advantage. This underlines the importance 
of being aware of uncertainties, planning assump- 
tions and risks. 

The bow-tie model provides a basis for devel- 
oping more advanced and detailed risk analysis 
methods. According to Aven (2015), there are three 
main categories of risk analysis methods: simpli- 
fied risk analysis (qualitative), standard risk analy- 
sis (qualitative or quantitative) and model-based 
risk analysis (primarily quantitative). All these 
categories are relevant to support development of 
robust and adaptable plans. 

Regardless of the choice of method, it is hard 
to identify all the vulnerable planning assump- 
tions and to assess their associated risks. A plan 
may contain many underlying assumptions, both 
explicitly and implicitly given, which need to be 
exposed. Further, it is necessary to develop rel- 
evant and plausible scenarios that include violated 
or failed assumptions. Based on this, it should be 
possible to assess the likelihood of a violated or 
failed assumption constituting an undesired event 
for the risk analysis. The next step is to assess pos- 
sible consequences of the undesired event and 
in particular reveal events that may have a huge 
impact on the planned operations, i.e. a high nega- 
tive influence on the attainment of objectives and 
effects. Lastly, the results of the risk analysis are to 
be presented in a risk picture suitable for making 
decisions about the relevance of the plan and the 
need for revision. 

Thus, some basic requirements to the risk and 
vulnerability analysis in support of plan develop- 
ment are: 


— Provide a structured approach to identify vul- 
nerable assumptions. 


— Identify likely causes for violated/failed 
assumptions. 
— Identify possible consequences of violated 
assumptions. 


— Identify indicators to detect violated or failed 
assumptions. 


In the next section, we present our suggested 
method. 


3 A METHOD TO SUPPORT 
DEVELOPMENT OF RELEVANT PLANS 


Our method aims to combine ABP (Dewar et al. 
1993) with a standard risk analysis. We use the 
requirements stated in the previous section as guid- 
ing principles for the method development. 
Although the method is developed for analys- 
ing risks related to capability gaps, the approach 


is generic and can thus be applied to analyse other 
planning assumptions as well. 


3.1 Risk analysis 


Figure | shows the main steps of a risk and vulner- 
ability analysis based on the bow-tie model (Aven 
2015). 

The first step is to perform proper problem 
structuring and framing. In order to produce rel- 
evant results, it is crucial to understand what deci- 
sions the results of the analysis should support. In 
addition, it is necessary to limit the scope of the 
analysis and determine the analysis object (sys- 
tem). Decision makers and stakeholders should 
participate in formulating the problem and deter- 
mining the scope of the analysis. 

In our method, an undesired event is any event 
associated with violated or failed planning assump- 
tions. Various events and vulnerabilities can be 
underlying causes for the occurrence of undesired 
events like capability gaps. A violated assumption 
can result in reduced performance of an activity/ 
task, which further can affect the outcome of the 
operation. 

In the cause analysis we aim to reveal the causes 
behind an undesired event. There can be several 
causes for a capability gap, e.g. lack of relevant 
resources to fill the capability requirement, lengthy 
reaction times due to low preparedness of critical 
resources, or insufficient sustainability due to lack 
of fuel, food/water or manpower. In order to avoid 
or reduce capability gaps, barriers and functions 
are implemented. A probability estimate of the 
occurrence of a gap should include assessments 
of the effect of both vulnerabilities and barriers. 
The severity of undesired events varies, depending 


Consequence Analysis 


Cause analysis 


+ Resources + Performance 

* Activities + Effectiveness 

+ Vulnerabilities * Vulnerabilities 

+ Barriers/measures + Barriers/measures. 
* Probability 


Us ea 


Figure 1. Risk analysis process. 
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on how critical the capability is for the operation. 
Thus, consequences depend on the scenario, the 
chosen Courses of Action (CoA) and the concept 
of operation. 

The results of the cause analysis and the conse- 
quence analysis are combined in a risk picture that 
provides the required information to support the 
planning and decision process. 


3.2 Risk analysis of capability gaps 


Figure 2 gives an overview of the main compo- 
nents of the proposed method for analysing risks 
of capability gaps. 

The method combines capability and risk 
analysis. First, we derive capability requirements 
and compare these to available resources. Identi- 
fied capability gaps are further used as initiating 
events in a risk and vulnerability analysis, looking 
at the likelihood of occurrence of capability gaps 
and possible consequences for the outcome of the 
operation. 

The different steps in Figure 2 are explained in 
more detail in the following. A possible application 
of the method is illustrated by an example. 

3.2.1 Problem structuring and scenario 
development 

Proper problem structuring and framing should 
be the first step of a risk and vulnerability analy- 
sis. Examples of questions we want to answer are: 
What is the purpose of the analysis? Who are the 
end-users? What are their expectations? Which 
resources are available for the analysis? Do rel- 
evant scenarios exist that can be adapted (Malerud 
& Fridheim 2013)? 

Problem structuring can be done by informal 
discussions or by applying more formal methods 
and techniques, such as structured brainstorming, 
soft systems methodology or cognitive mapping. 


| Problem ‘structuring and ‘sce nario developmant | 
2 
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For an overview of problem structuring methods, 
see for instance (Rosenhead & Mingers 2001, Pidd 
2003). 

In this paper we use risk and vulnerability 
analysis to identify vulnerable planning assump- 
tions and possible consequences if these are vio- 
lated or fail. More specifically, we look at planning 
assumptions related to availability of critical capa- 
bilities. In Assumption-Based Planning (ABP), as 
presented in the background section, signposts 
are used as indicators for discovering when the 
vulnerability of an assumption is changed. It is 
particularly important to keep track of vulnerable 
assumptions that may have severe consequences 
for the planned operation. 

In the example used to illustrate the method, we 
identify and analyse vulnerable assumptions about 
the availability of capabilities to handle small- 
scale operations as a part of daily, routine military 
operations. The actual plan is a contingency plan, 
which assumes access to capabilities necessary to 
perform specific tasks and functions in order to 
achieve the planned effects and objectives. Capabil- 
ity requirements are expressed using the following 
parameters: Capacity, reaction time, sustainability 
and interoperability. The plan covers small-scale 
crisis events at sea, and our task is to test and vali- 
date the plan in one particular scenario designed to 
challenge certain planning assumptions. Our focus 
is on capability gaps that may hamper or impede 
the planned operation. 

Some relevant assumptions are: 


— We are able to build a relevant situation picture 
to support situation awareness in a certain area 
of operation (AOO) at sea. 

— We are able to react on, document (for forensic 
purposes) and handle an acute situation involv- 
ing illegal fishing. 


The scenario should in particular challenge the 
reaction time of critical capabilities and their abil- 
ity to cooperate and exchange information. 

Scenario development: Fit-for-purpose scenar- 
ios that challenge the planning assumptions are a 
premise for performing the risk and vulnerability 
analysis. Usually, a plan is developed for handling 
a scenario or a class of scenarios, thus there is a 
scenario base that can be adjusted to fit the pur- 
pose of the risk analysis. Otherwise, we need to 
develop new scenarios. We will not go into details 
with respect to scenario development here, but 
rather refer to Malerud & Fridheim (2016) which 
presents a method for developing fit-for-purpose 
scenarios. This method can be applied to develop 
new scenarios or adapt existing scenarios to fit the 
problems of the analysis. 

Example scenario: Illegal fishing. Fishing vessels 
from a foreign shipping company X performs ille- 
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gal fishing in a Norwegian fisheries protection zone 
in the Barents Sea. In phase I, the illegal activity is 
discovered and triggers a response from Norwegian 
authorities. Surveillance units are allocated to the 
area to collect information and help build a relevant 
situation picture. In addition, units that are capable 
of ending the illegal activity are deployed. 

In phase II, an inspection team from the Nor- 
wegian Coast Guard (CG) is restrained as hostages 
on a foreign fishing vessel. This vessel speeds up 
and heads for international waters. 

This incident comes without advance warning, 
and the armed forces are performing daily, routine 
operations. 


3.2.2 Mission and capability analysis 

In the mission and capability analysis, informa- 
tion from a plan is used to develop a mission with 
effects, objectives and a desired end-state (what we 
want to achieve). A particular Course of Action 
(CoA) is developed comprising tasks and activi- 
ties that needs to be performed in order to achieve 
the desired effects and objectives of the mission. 
Examples of effects and objectives for our scenario 
are: 


— A relevant situation picture is established and 
maintained. 

— The illegal activities are stopped and docu- 
mented for forensic. 

— The hostage situation is resolved and the hos- 
tage takers arrested. 

— End state: Situation is back to normal. 


An example of a CoA for handling this situa- 
tion is to allocate necessary surveillance units to the 
area, allocate coast guard vessels (CG) to inspect 
and document illegal fishing activities, arrest ves- 
sels that don’t comply, and finally to prepare for 
boarding of the vessel holding the hostages and to 
set them free. 

Required capabilities are: In phase I: maritime sur- 
veillance, command and control (C2), action-team 
for boarding and inspection. In phase II: in addition 
to the above, action-team for hostage release. 

In the following, we will focus on the maritime 
surveillance capability. 


3.3 System analysis 


As shown in Figure 2, the system analysis utilises 
a defined force structure as input, with Structure 
Elements (SE) such as CG vessels, surveillance air- 
crafts and maritime Task Forces (TF). 

These SEs are combined into sub-systems 
to fulfil capability requirements. If a capability 
requirement is not entirely fulfilled, the result is 
a capability gap that might have impact on the 
operation. There are many sources that can cause 


capability gaps, e.g. lack of resources, loss/degra- 
dation or faults/defects of relevant SEs and low 
availability of SEs because they are busy doing 
other missions. Capability gaps arise when these 
sources are combined with vulnerabilities such 
as lack of redundancy, dependencies between 
capabilities, low preparedness and insufficient 
resilience. On the other side, barriers and miti- 
gating actions may have been implemented to 
counter these deficiencies. Hence, both vulner- 
abilities and barriers should be considered in the 
gap analysis. 

A part of the system analysis is to assess the per- 
formance of the SEs or sub-systems on the four 
performance parameters: capacity, reaction time, 
sustainability and interoperability. Input to these 
assessments can be expert judgements, historical 
data and simulations. 

Gaps are identified by comparing system per- 
formance to capability requirements. Table 2 shows 


Table 1. 
surveillance. 


Capability requirements for maritime 


Task/activity Capability Requirement 


Build and 
maintain a 
maritime 
situation 
picture 


Maritime 
Surveillance 


Capacity: 
surveillance 
aircraft + 
CG vessel 

Reaction time: 
Within 
2 hours 

Sustainability: 
Duration of 
operation 

Interoperability: 
Ability to 
exchange 
information 
with other 
systems and 
to build a 
common 
operational 
picture 


Table 2. Capability analysis of selected capabilities 
related to example scenario. 


Capability SE/Sub 
requirement system Vulnerabilities Gap 
Maritime CG vessels Dependency Reaction 
Surveillance Aircrafts other time 
Satellite capabilities Interop- 
Low erability 
redundancy 
Low 
preparedness 
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an example of a gap analysis of the maritime sur- 
veillance capability. 

A capability gap constitutes an undesired event 
that can have impact on the operations. It is possi- 
ble to estimate the probability of capability gaps of 
varying severity, e.g. by using historical availabil- 
ity of resources and compare this to the capability 
requirements, or by assessments of Subject Matter 
Experts (SME). Probabilities can be expressed on 
a qualitative scale, e.g. high, medium and low. In 
this case the qualitative scale has to be defined, 
i.e. what is the meaning of a high probability. It is 
also possible to quantify the probabilities by apply- 
ing methods to elicit probabilities, see for instance 
(O’ Hagan et al. 2006). 

The severity of consequences depends on the 
type and size of the gap. The gap may have huge 
or minor consequences dependent on the scenario 
and how critical the capability is. 

In the next step, we assess possible consequences 
of capability gaps, with emphasis on gaps related 
to critical capabilities. 


3.3.1 Consequence analysis 

A capability gap results in reduced capability per- 
formance, and hence can have a negative effect 
on the output of an activity which further has 
negative impact on the achievement of effects and 
objectives of an operation. 

Figure 3 shows an example of an event tree 
analysis of a gap in the maritime surveillance 
capability. 

A moderate gap in the maritime surveillance 
capacity can result in reduced sensor coverage of 
the area of interest, delayed overview of the situa- 
tion (slow reaction time), time periods with insuf- 
ficient surveillance (sustainability) and inadequate 
common understanding of the situation due to 
insufficient interoperability. 

The consequence analysis should comprise a 
thorough assessment of vulnerabilities that can 
lead to severe consequences, as well as associated 
barriers and mitigating actions that are imple- 
mented to avoid or reduce the consequences. 
Some examples of vulnerabilities are depend- 
encies between capabilities, low level resilience, 
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Figure 3. Example of an event tree of a gap in maritime 
surveillance capacity. 


lack of alternatives and too high ambitions with 
respect to effects and timelines. At the branch 
points in the event tree model, identified vulnera- 
bilities are compared to relevant barriers and mit- 
igating measures in order to assess consequences 
for output and outcome of the activity/task. Such 
an event tree can be constructed for each capabil- 
ity gap. 

Some relevant questions related to the event tree 
are: 


— B: To what extent does the gap in maritime sur- 
veillance capacity cause a significant reduction 
in sensor coverage of AOI (vulnerabilities and 
barriers)? 

— C: Does the reduced sensor coverage result in an 
insufficient situation picture? 

— D: To what extent does an insufficient situa- 
tion picture contribute to insufficient situation 
awareness? 

— How severe is the gap with respect to achieve- 
ment of planned objectives and effects of the 
operation (EH, H, M, L)? Insufficient situa- 
tion awareness influences on the ability to make 
timely decisions. 


The event tree is a suitable tool for assess- 
ing consequences. It can be supplemented by e.g. 
quantitative tools such as fault trees and Bayesian 
networks, see for instance Aven (2015). 


3.3.2 Risk picture 

Our main task is to identify and analyse vulnerable 
planning assumptions that can cause severe conse- 
quences if they are violated or fail. We have limited 
our scope to look at assumptions about the avail- 
ability of relevant capabilities. The results of the 
risk analysis are combined and presented in a risk 
picture that is suitable for identifying and deciding 
on mitigating measures, in order to support deci- 
sions about the need for revising the plan. The risk 
picture contains a description of critical capabil- 
ity gaps, their probability of occurrence and pos- 
sible consequences of the gaps for the operation 
(operational risks). Figure 4 shows how an opera- 
tional risk picture can be presented for the example 
scenario. 
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Figure 4. Example of a risk picture. 
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4 DISCUSSION 


In this paper we study the question: how can risk 
and vulnerability analysis support development of 
relevant military plans? What we want to achieve are 
robust and adaptable plans that help decision makers 
achieve overarching goals and objectives, as described 
in (Dewar et al. 1993 and Walker et al. 2013). 

A plan will always, due to irreducible uncer- 
tainty, rely on a set of underlying assumptions. 
Making assumptions is necessary, but this should 
be done with great care because of the potentially 
huge consequences if the assumptions are violated 
or fail. 

In Assumption-Based Planning (ABP) (Dewar 
et al. 1993), signposts are identified and applied 
to monitor if the vulnerability of a certain plan- 
ning assumption is changed. ABP is one feasible 
approach to obtain more adaptive plans. It is, how- 
ever not clear how signposts are identified. 

Many military plans are developed using the 
COPD process (NATO 2013). COPD recognize 
the importance of adaptability and flexibility, 
however, it gives little guidance on how to achieve 
adaptable and flexible plans. In addition the 
COPD process lacks adequate measures for when 
and how a plan should be revised. 

Hence, it is necessary to establish a method for 
detecting invalid assumptions that potentially can 
have huge impact on an operation. A proper risk 
and vulnerability analysis seems to be a promising 
line of approach. 

This paper outlines a method for combining a 
capability and risk analysis to test the validity of 
planning assumptions, in particular assumptions 
related to the availability of critical capabilities for 
a military mission. Based on the bow-tie model, 
likely causes for a violated or failed assumption 
(capability gap) are explored. There are various 
sources that can cause a violated assumption, e.g. 
properties of the opponent/enemy, availability and 
performance of own and allied forces and proper- 
ties of the operational environment. By identifying 
and assessing vulnerabilities, relevant barriers and 
measures implemented to reduce or avoid violation 
of assumptions, it is possible to obtain an overview 
of vulnerable assumptions and the likelihood that 
they are violated. This again will aid the develop- 
ment of signposts to monitor the vulnerabilities. 
This analysis relies on fit-for-purpose scenarios 
that challenge the planning assumptions. Ideally, 
one should test the planning assumptions in a wide 
variety of scenarios. However, in practice it is only 
feasible to include a few scenarios in the analysis. 
Thus, it is crucial to determine scenarios that cover 
different planning assumptions. To support this 
process, we suggest applying a scenario develop- 
ment method as described in Malerud & Fridheim 


(2016). Simulations may be a feasible approach to 
test assumptions in a larger set of scenarios and 
to identify the most challenging scenarios with 
respect to the assumptions. 

In order to reveal which assumptions are most 
critical for achieving the objectives of the opera- 
tion, it is necessary to perform a consequence 
analysis. We apply the proposed method to assess 
consequences of capability gaps by first consider- 
ing how gaps influence on the output of the activ- 
ity performed by the capability. Depending on the 
type and size of the gap, it will affect the output 
differently. This assessment also comprises the 
effects of identified vulnerabilities and barriers. In 
the next step, we assess consequences of reduced 
performance on the achievement of effects and 
objectives of the operation. Depending on how 
much the performance is reduced and the critical- 
ity of the capability, the gap will affect the outcome 
of the operation differently. 

The proposed risk and vulnerability analysis is a 
multi-methodology that can adopt different com- 
binations of methods covering the three main cat- 
egories of risk analysis methods outlined in Aven 
(2015): simplified risk analysis which relies on 
qualitative methods, standard risk analysis which 
can be both qualitative and quantitative and mod- 
el-based risk analysis which is mostly quantitative. 
The actual choice of method depends among oth- 
ers on availability of data and information, time 
and resources available for the analysis, and stake- 
holder expectations/requirements. This should be 
clarified in the initial problem structuring. 

The example used to illustrate an application of 
the method is simplified and qualitative. However, 
it is possible to utilize more quantitative methods 
to enhance the quality of the risk analysis. Meth- 
ods such as event trees, fault trees and Bayesian 
networks are promising candidates 

Although the proposed method is only tested on 
simple cases, we believe risk and vulnerability analy- 
sis is promising with respect to support development 
of robust and adaptable plans. In order to refine and 
validate the method, we need to apply it on more 
cases involving real military contingency plans. 


5 SOME FINAL REMARKS 


We are currently not in the position to make any 
firm conclusions about the applicability of the 
method; the results are preliminary. However, we 
have gained some experiences that indicate that the 
method is appropriate for supporting development 
of relevant plans. 


— It supports deriving signposts indicating when 
the vulnerability of an assumption changes. 
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— It can provide a risk picture comprising vul- 
nerable planning assumptions, possible events/ 
situations that can cause an undesired event 
(capability gap), and possible operational conse- 
quences if these assumptions are violated or fail 
(operational risk). 

— It gives traceability between underlying causes 
for capability gaps and consequences for output 
and outcome of the operation. 

— It provides a structured and traceable approach. 


Although the method is developed to support 
military operational planning, we also believe it 
has a wider area of application, for instance to 
help develop robust and adaptable contingency 
plans in the civilian domain. 
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ABSTRACT: With the fast-paced development of the Internet of Things and its applications within 
the emerging field of Industry 4.0 — decentralizing decisions by remotely monitoring data and automata — 
the issues of security and reliability of the whole communication pipeline between the connected devices 
taking part in this smart industry become crucial. In such context of embedded systems, microcontrol- 
lers are widely preferred over microprocessors as they are cheaper, smaller and less energy consuming. 
Unfortunately, the implementation of security features on microcontrollers, such as signing and ciphering 
functions, can largely reduce the availability of embedded systems because these functions are energy con- 
suming and computationally complex. Thus, a trade-off has to be found between the prescribed level of 
availability and security. It is important to note that such a trade-off greatly depends on how the embedded 
systems will be used, how they are supposed to communicate between each other and if a central node 
with high computing resources is available. For instance, a common architecture typically consist of several 
embedded systems communicating up and down with a unique server. Indeed, this architecture is used in 
several areas where a monitor must supervise and treat data, which is the reason why this setup is chosen. 
The present paper aims at proposing a method to reach the right trade-off between security and avail- 
ability, depending on the available resources. However, this problem is difficult to address because of the 
complexity to measure the security or the availability of a system. Solutions featuring those characteristics 
and a generic approach are presented to find the most suitable trade-off, in the use case of Industry 4.0. 


1 INTRODUCTION 


Embedded systems communicate between each 
others since the beginning of computer engi- 
neering. Today, these objects communicate on 
a network larger day after day: the Internet. For 
several years now the expression “the Internet of 
Things” is used. However, the expression “embed- 
ded systems” regroups many things: Smartphones, 
automobile electronics, sensors, and miniaturized 
computers like Raspberry. These objects have 
different functionalities, hence different com- 
putational power. For example, a communicat- 
ing sensor embedded in a mechanical piece has 
physical constraints. Components such as micro- 
controller, battery, antenna, etc will be impacted 
by these constraints. If a Smartphone, because of 
these constraints, is less powerful than a computer, 
a communicating sensor will be much less powerful 
than a sensor. These constraints alone can define 
connected objects and their issues (Agrawal & 
Das 2011). 

The eco-system of embedded systems changed 
drastically since their emergence. Embedded sys- 
tems were connected to a physical interface, that 
was possible to physically secure. But with the more 
and more numerous communicating objects—with 
a server (for supervision purposes) or with other 


objects—physical security to access an object is not 
sufficient anymore (Sagstetter et al. 2013). 
Whether the communication is radio-based, wired, 
or other, it is necessary to secure data exchanges. 
Security is a word that brings together many con- 
cepts. These concepts can be split in three categories: 


e Confidentiality: The data are exchanged with- 
out anyone being able to understand it. 

e Integrity: A modification of the exchanged data 
is detectable. 

e Authentication: The data are signed by an emit- 
ter. The emitter can not repudiate the message. 
Another entity can not impersonate the emitter. 


These three categories cause several needs that 
embedded systems must meet. Regardless of the 
system, the tasks answering these needs will take 
time and energy. Moreover, embedded systems 
will have different options to answer these needs, 
depending on their capabilities. 

Comparing all the possible embedded systems 
outweighs the scope of this paper. That is why this 
study will focus on a specific restraint embedded 
system based on a small, procedural, real-time 
micro-controller with a transceiver. Possible appli- 
cations, for a micro-controller like this, can be 
mass production of communicating sensors for the 
industry 4.0 or home automation. 


2973 


2 PROBLEM AND CONTEXT 


2.1 Security definitions 


Authentication, integrity and confidentiality are 
based on mathematically complex functions. From 
an emitter point of view, authentication can be 
answered by a signature algorithm (S), integrity 
can be answered by a hash algorithm (H), and 
confidentiality can be answered by an encryp- 
tion algorithm (E). From a receiver perspective, 
authentication can be answered by a verification 
algorithm (V), integrity can be answered by the 
same hash algorithm (H), and confidentiality can 
be answered by a decryption algorithm (D). 
These functions take some time to execute: 


T: the time taken by S to sign a message 

T,; the time taken by H to hash a message 
T.: the time taken by £ to encrypt a message 
T: the time taken by V to verify a signature 
T; the time taken by D to decrypt a message 


Moreover, security also influences auton- 
omy because these functions use computational 
power, hence energy, and reduce the autonomy 
of an embedded system. A balance must be found 
between security and energy consumption, reli- 
ability and real time constraints (Jiang et al. 2017). 
Nevertheless this study will focus on the avail- 
ability of the receiver from a real-time situation 
perspective. Autonomy, influenced by the energy 
consumption of the cryptographic functions, is 
discarded in this paper. 


2.2 Embedded systems secured communications 


Let’s take two embedded systems as in Figure 1: an 
emitter and a receiver. The emitter must: 


e 1. Hash its message 
e 2. Sign the hash of its message 
e 3. Encrypt its message 


The computing time for the emitter will be 


Lg HEFT, +T, (1) 
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Figure 1. RSA signature. 


And the receiver must: 


e 4. Decrypt the message 

e 5. Hash the message decrypted 

e 6. and 7. Verify the signature and compare with 
the has previously calculated. 


The computing time for the receiver will be 
La Tel td; (2) 


The signature (emitter side) and decryption 
(receiver side) have to be done with a private key. 
The receiver must be able to communicate with any 
emitter, and because of memory limitations, it is 
not possible to store the keys of all the emitters on 
the receiver memory. 

A private key, k,,, is used on the receiver for 
decryption: only the receiver can decrypt the mes- 
sages from the emitter. And these messages were 
encrypted with the public key k,, on the emitter. 

Another private key, k,,is used on the emitter for 
signing: only the emitter can sign its own messages. 
These signatures will be verified by the receiver 
with the public key k,,. Two couples of keys are 
then needed: K, =(k,,.k,,) and K, =(k,.4;,). 

The receiver potentially manages many emitters. 
For example, a receiver in a nuclear power plant 
receives messages from hundreds of emitters. Once 
the messages received, they must be proceeded 
quickly so the data are supervised in real-time. 

When the receiver receives too many messages at 
the same time, it will lead to message losses because 
the receiver will be busy to decrypt and verify the 
messages. 

This study offers some methods to approach 
and quantify these too many messages and at the 
same time variables. 


3 APPROACH AND METHODOLOGY 


3.1 Security level and algorithms 


First, some variables need to be fixed: the secu- 
rity level chosen is the most secured in the sense it 
manages the confidentiality, the integrity and the 
authentication of the data. 

Not all messages in real life need to be so 
secured, but for the purpose of this article, the 
problem is simplified to focus on the workload of 
the receiver. Indeed, the receiver will be busy a cer- 
tain amount of time to check these three security 
parameters. First to calculation of this duration is 
first needed. 

RSA-1024 bits with SHA1 is chosen, mainly for 
the simplicity of the implementation, thanks to a 
C Microchip library. Each algorithm is set to its 
specific functions: 
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S: RSA-1024 bits sign 

H: SHA1 

E: RSA-1024 bits encrypt 
V: RSA-1024 bits verify 
D: RSA-1024 bits decrypt 


The RSA functions of encryption-decryption 
can be considered the same than the RSA signa- 
ture-verification ones since the mathematical oper- 
ations are the sames (Pawar & Ghumbre 2016). 
Indeed, in our case, the only difference is the key 
used for each operation: 


e Encryption and decryption: 
mie = c(modn) (3) 
chid = m(modn) (4) 


e Signature and verification: 
m'2s = s(modn) (5) 
s» = m(modn) (6) 


The PKCS standard is used for the decryption 
and verification (on the receiver). So the function 
is mathematically optimized. Moreover, to balance 
the workload on the receiver, smaller exponent 
are used on its side, so the workload will be more 
important on the emitter for one ciphering or sig- 
nature. Then, the workload per message is smaller 
on the receiver than the emitter, but the receiver 
manages several messages. 

That is why equations 4 and 6 take a smaller 
amount of time than equations 3 and 5. 

Using a smaller exponent as a private key on the 
receiver side for encryption may be a security issue: 
attackers can guess more easily the private key if 
they suppose it is a small exponent. 


3.2 Embedded systems definition 


The embedded system is an important variable 
on this problem. A modest micro-controller has 
been chosen to amplify the workload of both the 
receiver and emitter. Choosing a micro-controller 
means fixing the CPU speed, the Program Mem- 
ory and the RAM. Each of these three variables 
can influence the security in some ways, because 
of memory limitation for example. 

The measurements were conducted on a 
dsPIC30F3014 from Microchip with the specifica- 
tions described in Table 1. 

An output of the micro-controller was set to 1 
during the operation, so the amount of time spent 
by the micro-controller to do each operation can be 
checked. 


Table 1. dsPIC30F3014 specifications. 


Parameter Value 
Architecture 16-bit 
CPU Speed (MIPS) 30 
Memory Type Flash 
Program Memory (KB) 24 
RAM (KB) 2 


3.3 Variables 


Lets note T. the total duration needed for the 
receiver to verify the security (see 2), 7,,, the dura- 
tion between two emissions of the same emitter, 
and N the number of emitters. 

Other duration, like time to send data to a cen- 
tral server, time to read some input on the receiver, 
time to read the message from the antenna can 
influence the availability of the system. How- 
ever, these functions were voluntarily omitted to 
simplify the problem and to expose how security 
specifically influences the availability (Jiang et al. 
2012). 

From these data, the number of messages a 
receiver can treat without being overload can be 
determined, depending of the security (more spe- 
cifically the time used for it): 


N=| Tap IT. | (7) 


4 RESULTS AND DISCUSSIONS 


4.1 Measurements 


In the Figure 2, it is shown that with a specific T, 
and T,,, the receiver is totally busy. Graphically, 
it is visible that after 10 emitters, the receiver will 
miss some messages. 

The assumption that messages follow each other 
was done to simplify the problem. Real life emitter 
can send messages at the same time, overlapping 
each other time frame. A CSMA/CA-like can be 
implemented by introducing an alea between each 
messages but itwould eventually return to this sim- 
ple case of receiver overload. 

The execution times of the different opera- 
tions are referred in Table 2 and were given by an 
oscilloscope. 

The total duration of unavailability for the 
receiver when receiving a message is 


T, =6.8+ 6.8 =13.6ms 


The minimum time frame for an emitter between 
two message emissions is: 
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Figure 2. Receiver process time occupation. 

Table 2. Operations and execution times. 

Operation Execution time 
RSA 1024 signature + SHA1 158.4 ms 

RSA 1024 verification + SHA1 6.8 ms 

RSA 1024 encryption 159.2 ms 

RSA 1024 decryption 6.8 ms 


T.,, =158.44+159.2 =317.6ms 


em 


If emitters never sleep and continuously send 
messages to the receiver, a message will be received 
each 317.6 ms. It can now be determined that the 
number of emitter a receiver can manage in opti- 
mal scenario where messages are sent one after 
another is: 


N=|T„ 1T, | =[317.6/13.6 | = 23 (8) 


Thus, 23 emitters can be associated to a 
receiver. 


4.2 Comparisons and improvements 


This result can be adapted to other scenario. For 
example, time measurements with RSA-2048 bits 
will probably take more time for both emitters and 
receivers and thus, change the number of emitters N. 

Totally different encryption/decryption and 
signing/verifying algorithms can be used and com- 
pared on this embedded system. For example, a 
combination of AES and RSA will probably give 
a different N. 

This comparison tool can be used to com- 
pare embedded systems instead of algorithms. 
For example, a measure done with AES128- 
CCM (symmetric solution of encryption + 
integrity +authentication) ona CC2538SF53 micro- 
controller supporting of Hardware acceleration 
for cryptography gave us a computing time of 210 
Us (encryption) and 120 us (decryption). 

Symmetric algorithms, however, give a hard 
time for key management. This paper doesn’t take 
into account this complexity, but some engineers 
could use the asymmetric algorithms to exchange 
a symmetric key. 


This means that several type of messages will 
then be exchange: some encrypted with RSA, some 
with AES for example. That will imply different 
duration for these messages but it can be easily 
implemented on this method. 


5 CONCLUSIONS 


These results can be used to determine how many 
sensors can be attributed to a receiver on a specific 
area. Methods such as a header containing the ID 
of the sensor can help the receiver to know if it must 
verify the message or ignore it and thus, save time. 

If an industry needs to deploy 30 sensors for 
instance, in the case studied previously, two receiv- 
ers will be used. 

Moreover, this method can be used to compare 
micro-controllers, and it can help embedded sys- 
tems designers to better choose their components, 
regarding their needs. 

More research can add other constraints on the 
equation, like the impact of security on the bat- 
tery of the emitter (receivers are supposed to be 
plugged on a power source). This way, embedded 
systems designers will be able to choose their level 
of security depending on the real time and auton- 
omy requirements. 
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ABSTRACT: When an unspecified terrorist threat against Norway was identified in the summer of 
2014, the security level was heightened at all of Norway’s 600 plus International Ship & Port facility 
Security code (ISPS) ports. A type of classification system was necessary in order to identify which ports 
required security measures dependent on the different types of situations which could potentially arise. 
However, classifying infrastructures by criticality is a complex task with little guidance available. The 
main challenge was how to decide which level of protection the facilities would require. We evaluated 
whether the ad hoc approach, used to return security levels to baseline, was a good approach, and if there 
were other best practices available. Furthermore, we asked how the more than 600 different facilities could 
apply the same classification system. This paper proposes an approach for how to arrange different port 
facilities into “security profiles”. A general threat assessment was made, based on a literature review and 
contact with security authorities, in order to determine what kind of scenarios would be relevant for the 
ports. We then continued to map the different types of ports, and common denominators for security 
issues. Through the literature review, workshops and input from the Norwegian Coastal Administration, 
we developed value-based categories for criticality. The findings in this paper can help clarify and present 
solutions which might help practitioners overcome challenges related to assessing security and threats 
to their facilities. The approach presented in this paper may also be a useful framework for other critical 


infrastructures to help select categories for ranking or classification. 


1 INTRODUCTION 


Knowing which infrastructure is the most critical 
or most exposed in security incidents is a great 
challenge, and there are few systematic methods 
available. Several approaches have been developed 
to identify and assess critical infrastructures over 
the years, yet classification and prioritisation of 
asset or value protection remains a difficult task 
given the occurrence of security threats or risks. In 
this paper, we present a method developed to clas- 
sify port facilities into different security profiles. 

This method may also be a useful framework for 
other types of infrastructure when it comes to assess- 
ing criticality. Even though the method presented in 
this paper is customised for the nature, rules and reg- 
ulations of the maritime sector, the broader frame- 
work and the mindset may be applicable for further 
use in other infrastructure classifications. This paper 
aims to present the background for — and develop- 
ment of — this methodological framework. 

Port facilities play a key role in the global 
maritime transportation, as over 90 percent of 
the world’s trade is freighted by sea (IMO, 2017). 


Ships, engaged in international voyages, which 
hold an International Ship Security Certificate 
(ISSC) can only be served by ISPS-approved port 
facilities creating a security network for global 
maritime transportation. 

The Norwegian Defence Research Establish- 
ment (FFI) and the Norwegian Defence Estates 
Agency (NDEA) conducted a study in order to 
develop a user-oriented method for classifying 
Norwegian ports and port facilities. The study 
provided guidance in relation to the risk-based 
supervision conducted by the Norwegian Coastal 
Administration (KYV), and on how to determine 
the maritime security level in Norwegian ports and 
port facilities, particularly if the general threat 
level was increased. The focus for this article is the 
latter, where we have developed a method for clas- 
sifying port facilities based on their criticality. 

Various categories for criticality were developed 
in this study, which included ranking tables. Exam- 
ples of categories developed include the number of 
annual passengers, the type of goods being stored 
and transported, and the port facility’s strategic 
importance. Eight ranking tables were used for each 
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port facility to determine the overall score, and thus 
determine which security profile the port facility 
belonged to (security profile 1 is low criticality, 2 is 
medium criticality, and 3 has high criticality). 

It was also recommended that KYV base their 
risk-based supervision on the ports’ security profile 
as well as the Port Facility Security Assessments 
(PFSA). 


2 BACKGROUND 


In the summer of 2014, the general terror threat 
level in Norway was raised by the Police Security 
Service (PST) (PST 2014). The terror threat was 
high, but unspecified, and due to the raised threat 
level, KY V decided to raise the security level for all 
Norwegian port facilities to security level 2 from 
24th July 2014. This resulted in increased costs and 
workloads at each port, and when the threat level 
continued to stay high, KYV started to lower the 
security level to normal (security level 1) by the end 
of July, depending on the features of each facility. 
Following this situation, KYV wanted a more 
substantiated method for determining the ports’ 
inherent security classification in order to better 
differentiate the security levels for incidences when 
the general threat level is raised, or where certain 
threats are specified. With this background, FFI 
and NDEA worked in 2015 and 2016, in close col- 
laboration with KYV, to develop such a method. 


2.1 The ISPS code and the SOLAS convention 


Several international and national laws and regu- 
lations apply to ports and ships in Norway. The 
ISPS Code was created as a result of the terror- 
ist events in the USA September 11 2001, and was 
adopted by the International Maritime Organiza- 
tion (IMO) in the Safety of Life at Sea (SOLAS) 
Amendments in 2002. The objective of the ISPS 
Code and amendments was to enhance interna- 
tional maritime security. 

The ISPS Code Part A, and some of Part B, is 
applicable for Norwegian regulations. The Code 
applies to port facilities that operate the following 
ships in international traffic: passenger ships, cargo 
vessels over 500 BT, moveable drilling rigs and spe- 
cial ships (SPS). All ships with ISSC are considered 
to be in international traffic. The ISPS framework 
is not applicable to war ships, military auxiliary 
vessels, ships in state-run, non-commercial opera- 
tion, fishing vessels, ships without propulsion 
machinery, primitive wood ships and pleasure ves- 
sels (IMO 2003 and IMO 2012). 

As described, the ISPS Code describes three 
security levels for port facilities (IMO 2012: 34-36). 
KYV is responsible for changes in security level, 


and notification of the Port Security Officers and 
Port Facility Security Officers (PSO/PFSO) about 
changes in their operation areas. All port facilities 
are required to have a Port Facility Security assess- 
ment (PFSA) and, in some cases, a Port Facility 
Security Plan (PFSP). 


2.2. General threat assessment 


The authors conducted a literature study and a 
general threat assessment where trends in the mari- 
time sector were described. The analysis was based 
on trend reports from the NCIS (Kripos), the PST, 
and the Norwegian Intelligence Service as well as 
FFIs research reports and international databases 
of terrorist incidents. We also had a conversation 
with PST on general threat trends in Norway, and 
the KYV headquarters presented their incident log 
with various types of threats and security incidents 
in their sector. 

The literature study consisted of reports concern- 
ing Norwegian sea transport risk assessment (Egg- 
ereide et al 2007; Rutledal 2002a and Rutledal 2002b), 
as well as a report describing maritime threat trends 
(Tønnessen 2007) and a threat assessment done by 
PST in 2013. International sources and databases 
(RAND 2006) were used to identify international 
incidents relevant for the general threat assessment. 
The threat assessment includes considerations about 
terrorism, intelligence/espionage, sabotage and other 
crime. Piracy has been left out due to the scope of 
this study focusing on Norwegian ports. 

It is worth noting that there is wide academic 
agreement regarding the high uncertainty with 
security related risks, and, in particular, terror- 
ism risk (Aven 2015; Bruvoll, 2017; Fischhoff 
2002; Renn 2008; Jore and Nja 2012; Weiss 2007; 
DSB 2014; NSM 2016; Busmundrud et al. 2015; 
Maal et al. 2016). This will of course impact risk 
analyses and assessments where these risks are 
addressed. The traditional way of assessing risk, 
where the parameters likelihood and consequence 
result in an estimated, quantitative risk score, can 
be less expedient for security related events (Jore 
and Nja 2012; Aven et al. 2004; Renn 2008, and 
Pettersen and Engen 2010). Maritime terrorism 
has been a declining trend in the Northern hemi- 
sphere in the last years, and one of the main rec- 
ommendations from the general threat assessment 
was to also focus risk assessments on other security 
threats and not just terrorism. 


3 SCOPE 


The general threat assessment that was conducted 
in the original body of work will not be further 
described in this article. This article focuses on the 
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method developed, and the preconditions for the 
choice of approach. 


3.1 Definitions 


Value assessment has the purpose of mapping the 
organisation’s values (sometimes also referred to as 
“assets”), and considers which of these are most 
important for the organisation’s mission and deliv- 
eries. This requires systematic review of which con- 
sequences would happen if the values are affected 
(NSM 2016:11). 

According to Norwegian Standard 5830 the 
term value is defined as “a resource that if affected 
by unwanted influence will cause a negative conse- 
quence for the ones owning, managing or benefit- 
ing from the resource” (NS 5830:2012). A threat 
is defined as a “possible unwanted action that can 
have a negative consequence for a port facility’s 
security” (ibid.). Vulnerability is defined as “lack- 
ing ability to resist an unwanted event or restore 
a new stable condition if a value is affected by 
unwanted influence” (ibid.). 


3.2 Port facilities and ports 


A port facility is described as land, buildings, facil- 
ities and other infrastructures used in port opera- 
tions, including quays, terminal buildings, loading, 
unloading and transhipment facilities, and storage 
and administration buildings (KYV, 2011). 

A port is defined as areas that are for use by 
vessels that load or unload goods or transport 
passengers as part of maritime transport or other 
business activities (ibid.). A port may contain sev- 
eral port facilities within its perimeters. 


4 METHODS 


The methods were based on reviews of primary 
and secondary literature. Furthermore, the authors 
held several workshops with relevant national and 
regional stakeholders in the Norwegian Coastal 
Administration. 


4.1 Literature review 


The primary literature reviewed consisted of 24 risk 
assessments from selected port facilities from all 
regions in Norway as well as a classification guide 
for ISPS facilities developed in Denmark (Danish 
Transport Authority, 2015). This primary material 
gave unique insight into the main threats, as well 
as the diversity in Norwegian port facilities. Fur- 
thermore, the Danish example gave a baseline for 
a best practice method. The secondary literature 
studied ranged from relevant rules and regulations, 


to academic theory about risk, risk based supervi- 
sion and vulnerability, as well as previous reports 
and threat assessments about maritime security. 


4.2 Workshops and interviews 


Several workshops were conducted with relevant 
stakeholders and users from all the regions where 
KYV are situated, as well as more limited work- 
shops with selected stakeholders. This was to gain 
an overview of the diverse ports and port facilities 
in Norway, as well as ensure the user perspective 
throughout the study. The representatives from 
KYV also provided important expertise for the 
study. 

Semi-structured interviews were conducted with 
experts outside of KYV for the threat assessment 
and risk based supervision. 


4.3 Validation 


When the first draft of the methodology for clas- 
sifying port facilities was done, KYV were able to 
test the framework at several facilities, and give 
feedback on whether the framework would be 
applicable or required further adjustment. This 
was an iterative process where every draft of the 
classification system was circulated and considered 
for improvements in order to ensure sufficient vali- 
dation of the system. 


4.4 Limitations 


The number of port facilities regulated by the ISPS 
code in Norway is more than 600. Thus, assessing 
all of them individually was out of scope for this 
study, and an approach based on “types” of ports 
was selected and done on a general basis in order 
to include all ISPS port facilities. 

The analysis has included risk for intentional 
acts (security). Considerations regarding ICT- 
security or personnel security were not included. 


5 CLASSIFICATION OF CRITICALITY 


Categorising all Norwegian port facilities is highly 
demanding and very complex, as the facilities are 
very different. It is a balance between too broad 
categories and too narrow categories. After several 
takes, it was recommended that the port facilities 
themselves should determine which security pro- 
file they are, based on certain criteria. It is rec- 
ommended that the PSO/PFSO is involved in the 
classification, which can then be quality assured by 
KYYV. Based on this classification, KY V can iden- 
tify some common values linked to the total score 
and security profiles of each port and its facilities. 
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The classification method consists of eight dif- 
ferent assessment categories which all constitute 
a point score. The categories are all described 
quantitatively (by numbers and points) and by a 
qualitative description that may influence which 
profile the port facility belongs to in order to clas- 
sify criticality. 


5.1 Applied research and best practices 


As previously described, the original framework 
that was used to lower the security level of the 
port facilities to security level 1 in 2014 was inade- 
quate. The new framework needed to consider and 
include features for all the port facilities in Nor- 
way, and also adhere to the ISPS Code and existing 
best practices. Hence, the starting point included 
the rules and regulations, combined with regional 
experience from Norway and the best practice 
framework from Denmark. 


5.2 Categorising port facilities 


According to the ISPS Code part B, 15.5-10 the 
identification and consideration of important 
values and critical infrastructure is a process that 
needs to consider the potential for loss of life, 
the port’s economic significance, symbolic value 
and presence of government installations. This is 
reflected in the damage assessment form, where the 
consequence classes are (i) down time/operative 
ability, (ii) environment, (iii) life and health and 
(iv) reputation. From these parent categories, we 
developed some sub categories to assess criticality/ 
importance, as presented in Figure 1. This was nec- 
essary as the users needed more specific reference 
points to assess the different port facilities. 


5.2.1 Environment, life and health 

The ISPS Code reads that the main focus should 
be to “avoid death or damage”. According to the 
Norwegian Security Act (Sikkerhetsloven 1998), 
one should consider if there is a potential haz- 
ard for the environment or the population’s life 
and health. Therefore, we have used the category 
“number of yearly passengers” to capture how 
many passengers could be affected. The number of 
employees in a port facility was also included to 
capture how many people are present daily. 

It was also important to see how events in a 
port facility could affect the life and health of the 
population situated around the area. Therefore, 
we added the category “port facility proximity to 
populated places”. Populous areas can be places 
where many people are convened, such as industry 
areas, infrastructure such as train stations, residen- 
tial areas etc. Another element is the presence of 
dangerous goods and hazardous materials at the 
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Figure 1. 


Categories for assessing criticality. 


port facility, which could have consequences for 
environment, life and health. The category for haz- 
ardous goods and substances captures this aspect. 
These categories are visualised on the left side of 
Figure 1. 


5.2.2 Operative ability 
In the centre of Figure 1, the category “strategic 
importance” was included. According to the ISPS 
Code part B/15.6 it is important to consider if the 
port facility can still function without the given 
asset (or value), and in which degree it is possible 
to restore normal operations quickly. Downtime 
for the entire operation or delays in important 
deliveries can result in substantial costs. Address- 
ing how some ports/port facilities can have such 
critical deliveries to society, is essential since any 
downtime will have impacts on society. This is why 
we added a category for strategic importance and 
redundancy. This is connected to national security 
and sovereignty, as well as critical infrastructure. 
Today, there is no agreed method to assess 
whether the organisation is a critical infrastruc- 
ture, and qualitative, knowledge based consid- 
erations are necessary. According to the Security 
Act, one needs to consider if downtime can have 
consequences for national security and defending 
the country, as well as critical societal functions for 
civilians. The assessment questions that addressed 
this aspect were: (i) Does the port facility have 


2980 


import and export goods of strategic importance? 
(ii) Is your port facility the only one that has spe- 
cialised operations or has special tools and facili- 
ties (e.g. a type of cables that are only produced/ 
delivered for a certain port or goods for supply 
safety)? (iii) Does the port facility have strategic 
importance for the Armed Forces and proximity 
to defence installations (such as critical equipment 
like RoRo and LoLo (roll-on-roll-off and load- 
on-load-off))? (iv) Is the port facility a previously 
“national designated port”? 


5.2.3 Specifications for port facilities 

From a societal perspective, it is important to 
identify ports that are bigger and have more traf- 
fic than others. The sub categories “number of 
operations in the port facility” and “expected 
annual calls/arrivals by ship” cover this dimension. 
The thought was to also capture the complexity 
including actors, operations etc. that is ongoing 
in the port. We have also included “terminal type” 
with categories for different types of cargo. The 
categories were developed together with KYV, and 
include a ranking of different goods: (i) dry bulk, 
one-terminal facilities for timber, stone, gravel, 
asphalt, scrap iron etc., (11) bigger goods and bulk 
facilities and groupage, (iii) containers and LPG, 
(iv) supply bases, oil- and gas production and 
cruise, (v) RO/PAX (roll-on/roll-off passenger) for 
international traffic, ferry terminals for interna- 
tional shipments. The reasoning for this weight- 
ing is based on classified threat assessments and 
what could be seen as attractive targets for a threat 
actor. These categories can be seen on the right 
side of Figure 1. 


5.3 Assessing and ranking criticality 


After workshops with KYV, we developed a gen- 
eral model to assess and rank criticality. The rank- 
ing tables associated with each category collect 
all the quantitative information. Several of the 
threshold values in the tables are based on Danish 
best practice adjusted to Norwegian conditions. 
KYV has tested the model on port facilities they 
have great knowledge of, and have returned with 
input regarding threshold values. However, we also 
included a field for each ranking table where the 
assessments should be described. 

KYV also specified two conditions for the clas- 
sification. If a port facility has more than 200 000 
yearly passengers (score 5, red colour), the port 
facility will automatically end up in security pro- 
file 3 “high criticality”. The other condition is tied 
to the category “terminal type”. If the port facility 
has RO/PAX (score 5, red colour), the port facil- 
ity will automatically end up in security profile 3 
“high criticality”. 


As mentioned, the method for classification is 
based on both quantitative and qualitative assess- 
ments. The quantitative data should be relatively 
easy to collect through the PFSAs, as well as 
the KYV regional personnel and PFSOs’ local 
knowledge. 

The quantitative assessment will lead to a total 
score, which implies the criticality of the port and 
port facilities. Risk assessments for security related 
risks are difficult to manage solely based on quan- 
titative approaches, so it was deemed necessary to 
supplement these with discretionary assessments. 
The qualitative assessments can both increase 
and lower the total score based on the quantita- 
tive assessments. They are meant as a supplement 
where e.g. quantitative numbers cannot describe 
the complexity or simplicity of the systems. Under 
each ranking table, there is a field where the quali- 
tative assessments should be described. Including 
this will achieve a more holistic overview of the 
different ports and port facilities. The combined 
quantitative and qualitative data provides grounds 
for a nuanced overview in order to make an 
informed decision about the final security profile. 
Table 1 is one example of a ranking table including 
quantitative and qualitative assessment. 

The security classification should make the 
process for raising and lowering the security level 
for different port facilities and ports more effective 
during and after changes in threat level. However, 
it is preferable to have continuous, updated infor- 
mation about the threat level and types of threats 
if available. The responsible stakeholders, such as 
KYV and PFSO/PSOs, should stay updated on 
openly accessible threat assessments from the rel- 
evant authorities. 

The score for the port facility must be considered 
against the scores of the other ISPS port facilities 
in the port. The highest score in a port will affect 
the overall score for the port. This is also in terms 
with EU Port Security Directive 2005/65/EC. 

The following figure shows the different catego- 
ries explained in this chapter, and is a visualisation 
that will be further described in chapter 6. 


Table 1. Example of a ranking table. 


Ranking table 
Number of personnel in the port facility 


Number <50 51-150 151-300 301-500 <500 

Soe =———_ 

Describe the x: There are normally aroun persons 
assessment in the port facility on a daily basis. 


However, during July this number will 
be 60 as there are more tourists and 
boats visiting. We have assessed that 
the port should still score 1. 
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6 USING THE METHOD 


The first six ranking tables that should be com- 
pleted use a score from 1 to 5, and the last two use 
a tripartite score, 0-3-5. However, the assessment 
criteria for each table differ. The quantification of 
the categories was done in close collaboration with 
KYV, who has extensive knowledge about the dif- 
ferent types of facilities. 

To better describe the step-by-step method 
for the users, we created a fictional port facil- 
ity, “Hjertvik port facility”, based on several real 
PFSAs. This example will also be described here in 
order to explain the approach. 

The starting point for using the method is to 
count the “number of operations” in the port facil- 
ity, which entails number of operations, people 
associated with the various operations etc., rang- 
ing from one to more than five. In the Hjertvik 
example, there were four operations in the port 
facility, which gave a score of 4 points. 

The second step of the classification is to deter- 
mine the “port facility proximity to populated 
places”. This is described as proximity to popula- 
tions or crowded places (cities, villages, industries, 
bases, installations with symbolic value and other 
infrastructure). This is measured in kilometres/ 
metres ranging from less than 300 metres to more 
than 3 kilometres. Hjertvik port facility had more 
than 3 kilometres to the nearest densely populated 
area, which gave a score of | point. 

As the third step, the “number of annual passen- 
gers” should be accounted for. The expected num- 
bers should be filled in on a scale from less than 1000 
to more than 200 001, and if the port facility does 
not handle passengers, the score is 1 (green). In this 
category, it should also be considered whether this is 
seasonal. The Hjertvik port facility did not handle 
passengers, and hence got a score of 1 point. 

Fourth, the “number of personnel working daily 
at the port facility” should be counted based on the 
expected number on a scale from 0 to more than 
200. Hjertvik had 21-50 people working on a daily 
basis, and got a score of 3 points in this category. 

The fifth step is to assess the “expected annual 
calls/arrivals by ship” on a scale from less than 50 
to more than 500. At Hjertvik, the annual arrivals 
were 51-150, and they got 2 points in this category. 

Sixth, the “terminal type” should be described. 
This was done by examples for different scores: 
Does the terminal handle wet bulk, dry bulk, con- 
tainers, general cargo, gas, building, RO/PAX and 
maintenance, as described in chapter 5.3.2. If the 
port facility handles different types of goods, the 
starting point is the type of cargo that gives the 
highest score. Here, Hjertvik had containers and 
LPG (liquefied petroleum gas-wet bulk), and 
ended up with a score of 3 points. 


The seventh step includes “hazardous goods and 
substances”. The term “hazardous substances” 
is used for any substance that may constitute an 
unreasonable risk to the health and safety of the 
operators, personnel or the environment if it is not 
handled and processed properly by storage, pro- 
duction, processing, packing, use, destruction or 
transportation. When assessing the dangers that 
may be potentially harmful to humans, the Haz- 
Mat “diamond” is often used (United Nations 
2011 and 2015). The scores were determined by 
consequence scores from the HazMat diamond, 
and whether the hazardous substances are stored 
in, or in proximity to, the port facility. Hjertvik 
had several types of dangerous goods and got a 
score of 5 points. 

The eighth, and last, step is to determine “stra- 
tegic importance”. In this category it should be 
assessed whether the port facility has strategic sig- 
nificance for national security and sovereignty (e.g. 
territorial sovereignty and integrity, national free- 
dom of action, relations with other states, demo- 
cratic governance). It is also important to consider 
the importance of the port facility in the market 
and if it is of national symbolic value. In addition 
to the description, we prepared several guiding 
questions to help with this category, as described in 
chapter 5.2.2. The three scores were: no impact on 
national security and sovereignty, strategic import/ 
export, uniqueness in operations, and importance 
for the armed forces, or a previous “national desig- 
nated harbour”. Hjertvik port facility did not cur- 
rently have strategic importance, and hence got a 
score of 0 points. 

In order to summarise the port facility’s over- 
all security profile, the point scores are added 
together and placed in the security profiles pre- 
sented in Table 2. 

Hjertvik port facility got a total score of 19 
points, and ended up in “Security profile 2: 
Medium criticality”, which is visualised in yellow 
in Table 2. 


6.1 End user experience 


During the spring of 2017, KYV completed the 
classification of all ISPS-approved Norwegian 
port facilities by applying the described classifica- 


Table 2. The port facility security profiles. 


The port facility security profile 


Security classes Score 
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tion method. KYV employees who have their main 
occupation in port security conducted the work, in 
association with PFSO when needed. Even though 
the aim was to make the PSO/PFSO determine what 
security profile they are, only to be quality assured 
by KYV, KYV ended up doing most of the clas- 
sifications due to the familiarity with the method. 

The method has a step-by-step approach mak- 
ing it intuitive to conduct, and does not require 
comprehensive knowledge about security risk 
assessments. Yet, it requires a joint interpretation 
of the ranking tables and their initial questions. 
During the process, KYV experienced that the 
questions were perceived differently among those 
who conducted the classification. This entailed 
that more or less similar port facilities could be 
assigned unequal security profiles. 

As a response to this, KYV had to make some 
further refinements for how to understand the dif- 
ferent ranking tables. For example, in the ranking 
table showing the number of operations in a port 
facility, KYV had to specify that operations in this 
context are related to the main ship-port activities. 
This means that a port handling container ships 
and dry bulk ships has two operations. For future 
use, a more accurate definition of each ranking 
table would improve the reliability of the method. 
Another remark is that classification must be con- 
sidered as a continuous process. KYV has completed 
classifications of all Norwegian port facilities, but a 
change in a port facility’s operational pattern can 
also entail change in the security profile. 

Overall, the method provides a time efficient 
framework that applies to all Norwegian port facil- 
ities. Decision-makers in KYV are now provided 
with a simple, yet efficient tool when a situation 
calls for an increased security level in port facilities. 


7 CONCLUSION AND 
RECOMMENDATIONS 


In this paper, we have described a method devel- 
oped to classify Norwegian port facilities and ports 
into different security profiles. This paper describes 
our approach and limitations to the study, as well 
as the data that has been analysed. 

In the study, categories to assess criticality have 
been developed, with appurtenant ranking tables. 
Eight ranking tables are used for each port facil- 
ity in order to determine what score they get, and 
thereby which security profile they belong to (secu- 
rity profile 1 is low criticality, 2 is medium critical- 
ity, and 3 has high criticality). Following this, the 
port facility’s score must be compared with other 
ISPS port facilities in the port. All ISPS port facili- 
ties in a port must be dimensioned after the highest 
score in a threat situation. 


We also described general threat trends in the 
maritime sector, focusing on the categories ter- 
rorism, intelligence/espionage, sabotage and other 
crime, and recommend that KYV stay updated on 
the available threat information in order to deter- 
mine relevant security levels. 

It is recommended that the security profile clas- 
sification is used as one of the baseline informa- 
tion sources upon which KYV carry out their risk 
based supervisions. The frequency of supervisions 
should be considered with regard to both the criti- 
cality of the port facility and experiences from pre- 
vious supervisions. The basis for the supervision 
should be well embedded, and the port facilities 
(e.g. PFSO/PSO) should be involved in the process 
of classification and risk assessments. 

The approach presented here is a case that 
can be used as an example for other infrastruc- 
tures as well. The background for the classifica- 
tion is rooted in a more general way of assessing 
consequences of security incidents, and may also 
be applicable to different kinds of critical infra- 
structures. Lessons from this case can also be rel- 
evant to consider for critical infrastructure owners 
or operators who are unsure of how they should 
address classification. 
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ABSTRACT: The IP Exchange (IPX) connects telecommunication networks with each other. It enables 
features like roaming and data access while traveling. Designed as a closed network it is now opening up 
and unauthorized entities now misuse the IPX network for their purposes. The interconnection networks 
suffer from many Signaling System No 7 (SS7) protocol attacks. Advanced operators now use Diameter 
based LTE roaming. We will illustrate how under certain network configurations an attacker can collect 
sufficient amount of insider information and then modify the subscriber profile to change the access 
point configuration in different core network nodes and by that place himself in the middle of the data 
traffic path for the user. We will close with recommendations on how to prevent such an attack. 


1 INTRODUCTION 


It is taken for granted that we can use our phone 
for data and calls when abroad. We rarely consider 
what happens in the background when we switch 
on our phone after our arrival in another country. 
You connect to a network that knows at that point 
of time nearly nothing about you, still in the end 
you can make calls, access your webmail and twit- 
ter and are being charged on your home-network 
bill. This is all possible because operator networks 
communicate through private signalling networks, 
i.e. an Interconnection Network. All network 
operators in the world are connected through it 
with each other, sometimes directly, sometimes 
indirectly via service providers. 

The first Interconnection Network was the 
so called Nordic Mobile Telephone Network 
between Norway, Finland, Sweden and Denmark 
[1] in 1981. At that time most network operators 
were state-owned and there was trust between the 
partners. The main goal was to enable services for 
their users. They designed protocols and messages 
to serve that goal. The Signalling System No. 7 
(SS7) is a network signalling protocol stack used 
worldwide between network elements and between 
different types of operator networks, service pro- 
viders on the interconnection and within operator 
networks. It was standardised by the International 
Telecommunication Union, Telecommunication 


Standardisation Sector (ITU-T) more than 35 
years ago [2]. 

At that point of time, security was not the main 
de-sign concern, as the usage of SS7 was consid- 
ered to be only in a closed network between trusted 
partners. Later with VoIP and other IP services 
being handled by mobile devices it was necessary 
to use an IP based Interconnection network for 
mobile operators, this lead to the IP Exchange 
network (IPX) and it has been sup-ported and 
partially standardized by the GSMA Association 
since 2004. 


2 DIAMETER BACKGROUND 


Diameter is the evolution of the SS7 and its Mes- 
sage Application Protocol (MAP) [3] protocol 
that is used within and between the 4G Long Term 
Evolution (LTE) networks. LTE uses the Diam- 
eter protocol and the Session Initiation Protocol 
(SIP) for communication between the network ele- 
ments inside a network and between networks. We 
will focus on the diameter part of the network in 
this article. In a Diameter based network archi- 
tecture all elements are connected via an IP inter- 
face. The network nodes all support the Diameter 
base protocol specified in IETF RFC 6733 [4]. In 
Diameter each interface has its own application 
interface specification which is defined separately 
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in a different specification document and specifies 
application specific additions to the base protocol. 

To connect two LTE diameter based opera- 
tors together, often an IPX or interconnection 
service provider is used. Smaller operators utilize 
IPX service providers and aggregators to be able 
to offer their customers a large roaming selection. 
Operator networks usually deploy a Diameter 
Edge Agent (DEA) that resides on the border of 
the network as the first contact point for mes- 
sages coming over the interconnection link. The 
most important nodes from the security point of 
view are the Home Subscriber Server (HSS) which 
holds the subscriber profile information and the 
Mobility Management Entity (MME) which takes 
care of the user’s mobility (often combined with a 
Visitor Location Register VLR). 

Diameter based communication can be secured 
using Network Domain Security NDS/IP as speci- 
fied in 3GPP TS 33.210 [5] i.e. IPSec (Figure 1). 
However, even if IPSec is implemented in many 
core network nodes and separate gateways (e.g. 
SeGWs) it is commonly not used i.e. the Figure 1 is 
wishful thinking in most cases. The reason for that 
are manifold. Since this is an international net- 
work, the question of the trustworthy root certifi- 
cate authorities, revocation list, certificate chains, 
trustworthiness of self-signed certificates, key gen- 
eration etc. becomes a political one. In addition, 
there are Interconnection Service Providers i.e. 
messages often traverse several “hops” between the 
operators. The routes between operators are cho- 
sen based on costs, therefore, the realm based rout- 
ing to and from an operator might be different. 
The support of security would require the creation 
of a large Public Key Infrastructure, including 
dynamic certificate status validation, which goes 
along with some substantial investments also for 
player who did not invest in security in the past (as 
the IPX network was closed). And some operators 
just don’t have the financial resources or expertise 
to secure their network communications to their 
partners. We will focus on the usage of Diameter 
on the S6a/S6d between HSS and MME as speci- 
fied in 3GPP TS 29.272 [6] and the Sh interface 
between the HSS and an IMS application server as 
specified in TS 29.329 [7] and TS 29.328 [8]. 


Trusted Core Network © 


XX KE 
Figure 1. Interconnection between LTE operators using 
Diameter with NDS/IP security. 


Diameter is constantly extended also for 5G 
and IoT (Internet of Things) connected devices. 
Even though Diameter is a different protocol that 
what is used within SS7, the underlying functional 
requirements e.g. authenticating the user to enable 
a IMS-SIP based session setup, transferring user 
information etc. remain much the same so there 
are many similarities in the messages used for 
Diameter and the SS7 MAP protocol messages. 
However, there isn’t an exact one-to-one mapping 
for each MAP message to each Diameter com- 
mand and vice versa. 


3 RECENT SECURITY BREACHES 


For SS7 many attacks via the Interconnection 
link are known and observed e.g. location track- 
ing, eavesdropping, SMS interception, fraud, DoS, 
One-time password theft, credential theft, unblock- 
ing of stolen phones etc. 

The first interconnection vulnerabilities were 
published for the 4G Diameter protocol: 


— Location tracking [9], [12]. 
— Denial of Service/Fraud [10], [12]. 
— SMS interception [11]. 


Recently, an interesting GPRS data attack was 
observed during a live assessment, where a modi- 
fication of the CAMEL profile of the user was 
manipulated [13] (Figure 2). We are doing a differ- 
ent type of profile modification, but it was inspired 
by the approach of [13]. 

In this attack, an incoming SS7-MAP Insert 
Subscriber Data (ISD) packet was detected being 
sent to the SGSN that a subscriber was currently 
attached to. Within the ISD packet the subscriber 
profile was modified with these specific CAMEL 
settings: 


1. A GPRS-CSI Trigger Point was set, this set 
GPRS-TriggerDetectionPoint to a value of 11 
(pdp-ContextEstablishment) 


3-1: IDP-GPRS 
CHOE wal arn 


Figure 2. Data interception for GPRS [13]. 
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2. AgsmSCF-Address was set, this is the SS7 Node 
address that the SGSN was to send any result- 
ant CAMEL packet to. This SS7 address was 
the same as the source of the ISD packet (a 
SS7 Node based in another con-tinent) 

3. DefaultGPRS-Handling was set to 0 
(continueTransation) 


The overall effect of this trigger point was to 
instruct the SGSN to inform the attacker node of 
any PDP context establishment procedure started 
by the sub-scriber, i.e. if they tried to use the Inter- 
net or set up a data connection. At this point the 
attacker node, having received the IDP-GPRS 
would override the APN indicated by the user (i.e. 
the one that is stored in his phone), with its own 
APN in its CONNECT-GPRS response. 

The purpose of setting DefaultGPRS-Handling 
field by the attacker is if the attacker node did not 
respond, then the SGSN would continue to use 
the original APN, this removes the need to gener- 
ate an explicit response each time, and reduces the 
attacker traffic. 

The main obstacle for an attacker is to gain 
access to the private Interconnection network. In 
theory, this should not be possible for a private 
individual or an attacker, however the legal rules 
for network operators for renting out access to the 
interconnection to service providers differ between 
countries, and some IP-using nodes are attached 
and visible on the Internet (e.g. GGSN nodes 
via shodan.io [18]). Therefore, attackers with suf- 
ficient technical skills or financial resources have 
found ways to breach the privacy of the network. 
The GSMA Association has provided their mem- 
bers with a set of protection measures for SS7 and 
Diameter interconnection security. That those 
attack vectors are also very likely exploited in prac- 
tice can be seen from the information contained in 
the leaked e-mails of HackingTeam [14] from 2013, 
which state that they are developing at that time 
already exploits for LTE/Diameter. 


4 ATTACK DESCRIPTION 


The attack we will describe has three major phases, 
which are independent of each other. Each of 
those phases has some assumptions on the con- 
figurations and also some variants and alternatives 
which an attacker may apply to reach the final goal 
of data interception. 

We assume that the network does not have a sig- 
naling aware filtering software deployed at the edge 
of the network, typically represented by a Diameter 
Edge Agent (DEA) or a Diameter Routing Agent. 
Our assumption on the Diameter edge node is that 
it was placed there by the operator under the pre- 


sumption, that it runs as a sort of router between 
trusted entities in a private and closed network. 

The attacker is in possession of the phone 
number of the victim (MSISDN) and has access to 
the Interconnection network. 

For attack preparation, the attacker acquires 
information about the network operators he wants 
to attack and about the network node he pretends 
to be. Operators that offer roaming have a so- 
called IR.21 document, this document describes 
various details of their architecture to enable the 
configuration for the roaming connections with 
their partners. Those IR.21 documents are not 
public documents and should be only used by the 
roaming partners for proper configuration of the 
roaming interface, nevertheless, some of them can 
be found on the Internet. In addition, operators 
also often use blocks of addresses for their nodes. 
Both IR.21 leakage and block usage make it easier 
for an attacker to brute force e.g. SGSN attacks. 

The attacks are based on results obtained from 
specifications defining the behavior of the nodes 
and the test network that is usually used to verify 
correctness and security of new software code / 
updates for real operator networks before the new 
software is rolled-out. 


4.1 Subscriber data harvesting 


4.1.1 Getting the IMSI 

The first step for an attacker is gather information 
about the user. He starts by obtaining the user’s 
International Mobile Subscriber Identity (IMSI). 
This user identifier is needed for core network com- 
munication, but is also needed as critical informa- 
tion to base subsequent attacks on. There are several 
ways of obtaining IMSIs. One can set up a false base 
station and just call all devices in the area to send 
them their IMSI. Alternatively, a WIFI access point 
which is able to issue a EAP-SIM call to the device 
[15] or using the SS7 MAP SRI_SM (Send Routing 
Info for Short Message) command [16]. 

We will focus on how to obtain the IMSI via 
Diameter based Interconnection, as we assume 
that the attacker does not want to travel to his vic- 
tim. The attacker impersonates a Short Message 
Center (SMSC) i.e. he claims to have a SMS for a 
user and he wants to deliver it and needs therefore 
the “contact details” from the users Home Sub- 
scriber Server (HSS), which contains all the impor- 
tant subscriber information elements (Figure 3). 
This is a common and valid roaming scenario. 

For that purpose, the attacker sends a Diam- 
eter Send Routing Info for (Short Message) SM 
Request to the HSS of the user’s operator. 

This message contains the MSISDN (phone 
number) of the user. In response, the HSS will pro- 
vide a Send Routing Information for SM Answer 
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contain MBISON fover S6< rntertace) 


Send Routing Information for SM Answer 
contains IMS, serving MME and SGSN (over Sét itertace) 


Figure 3. IMSI Retrieval using SMSC impersonation. 


with the IMSI and the serving nodes for the user 
i.e. serving MME and Serving GPRS Support 
Node (SGSN). The information will then later be 
re-used in the 2nd step of the attack. 

It should be noted, that between the “harvesting 
of the IMSI” and the actual usage of the IMSI in 
attack years may go by. For some cases e.g. in IP 
Multimedia Subsystem IMS the IMSI is not even 
needed at all e.g. in the User Data Request (UDR) 
message. 

The second piece of information the attacker 
craves is the subscriber profile. The subscriber 
profile contains detailed technical information 
and key attributes about the subscription the user 
holds. 


4.1.2 Subscriber profile retrieval 

4.1.2.1 From MME 

The attacker is now in possession of the IMSI of 
the user, the serving Mobility Management Entity 
(MME). The next step is the attacker can perform 
a location update i.e. the attacker claims, that this 
user has “landed” in his network. For this he makes 
a Diameter Update Location Request (ULR) over 
the S6a interface according to 3GPP TS 29.272 
[6], using the information obtained in the previous 
IMSI retrieval attack (Figure 4). 

In this location update request he does NOT set 
the ULR-Flag “Skip subscriber data”, this indi- 
cates to the HSS that the MME requests a fresh 
copy of the subscriber profile for synchronization 
purposes, which is a common roaming scenario. 
The HSS then sends back an Update Location 
Answer (ULA). This answer contains the requested 
subscriber profile. 

We assume that once an attacker holds a com- 
plete subscriber profile of a user of one operator, 
he can deduct the operator specific details of the 
structure of the profile, and then figure out how 
to make a fake entry look real for this or another 
subscription. In Figure 5 a subtle attacker would 
also reset the MME back (assuming again that 
the DEA does not properly differentiate between 
internal and external originated traffic, in a really 
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Attacker impersonates V-MME (address from Internet) DEA HSS 


Update Location Request (ULR] 
Contains IMSI, ULR-Flag Skip Subscriber Data NOT det (over S6a interface) 


Update Location Answer: 
‘Contains subscriber profile (over $a interface) 


Figure 4. Profile extraction using ULR. 
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Figure 5. 
MME”. 


Setting back the MME to “old home 


fully trusted and closed network this kind of dif- 
ferentiation would not be needed). 

If the network has a very basic filtering in place 
i.e. H-MME messages should not come via inter- 
connection link, then this setting back does not 
work, but for the main attack itself this ‘extra’ 
setting back of the MME is not needed, it is only 
a way for an attacker to reduce the risk of being 
noticed. 


4.1.2.2 From HSS 

An alternative method is to request the informa- 
tion from the HSS, this though is more difficult 
for Diameter. It may be possible as IMS networks 
usually deploy many Application Servers (AS) and 
not all of them reside directly in the core network. 
The Sh interface is between an Application Server 
and the HSS and is specified in TS 29.329 and TS 
29.328 [7], [8] and should be used as a network 
internal interface only. 

But, an operator has usually many application 
servers and not all AS services might be provided 
by that operator himself. They might be pro- 
vided by the mother company of that operator or 
some other service provider, then the Sh interface 
becomes an interface that is open on the intercon- 
nection link (might not be visible, but still proc- 
esses Sh messages). 

In the case that the DEA passes Sh messages to 
the HSS (e.g. if the operators uses an application 
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server that does not reside directly in their net- 
work), the attacker can send a User Data Request 
(UDR) request message, which then returns the 
user profile. In IMS, this message can be populated 
with the MSISDN only as a user identity and the 
returned user profile would contain also the IMSI 
(Figure 6). 

Below in Figure 7 a snapshot of such an update 
request and the corresponding answer in Figure 8: 

As the Application Server would be on the other 
side of the DEA, the HSS would see only the IP 
address of the DEA and any IP address access 
control list for IMS Application Servers would 
not work, as that IP address control mechanism 
was designed for AS that reside in the same core 
network. 


a nA 
A 
D Cm» 
Attacker impersonates an IMS Application Server 
eg from the mother company of a large operator 


Figure 6. Subscriber profile and IMSI retrieval using 
UDR over Sh. 
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Figure 8. UDA user data answer “ok”. 


4.2 Inserting the false access point node 


Depending on the different possible network con- 
figuration and on the target data communication, 
the attack discussed in this paper has several fla- 
vors. We will now outline how to place modified 
APN settings in various nodes. 


4.2.1 GPRS—profile modification in serving 
GPRS Support Node (SGSN) and MME 
The attacker aims to modify the subscriber data 
stored in the Serving GPRS Support Node 
(SGSN). From here on, the attack follows the dis- 
covery of [13] and evolves it to a Diameter attack. 
The attacker pretends to be the users home HSS 
issuing an Insert Subscriber Data Request (IDR) 
to the serving SGSN. The serving SGSN informa- 
tion has been obtained as outlined above or an 
attacker just brute forces all SGSNs of that opera- 
tor or by extracting the subscriber profile. 

The serving node information is part of the 
subscriber profile i.e. MME/VLR and SGSN. This 
malicious IDR message contains the subscription 
data AVP (attribute value pair) of type grouped. In 
particular, the subscription data contains: 


— GPRS Subscription Data 
— Access Point Network (APN) Configuration 
Profile (with APN con-figuration details) 


In [13] based on SS7, the CAMEL settings were 
modified directly. In the LTE case this approach 
does not work, as the S6d interface does not sup- 
port SGSN CAMEL subscription data. But this 
obstacle can be circumvented, by directly modi- 
fying the APN-Configuration-Profile AVP in the 
SGSN, which contains the APN-Configuration 
AVP (TS 29.272 [6], section 7.3.35) in the sub- 
scriber profile in the IDR diameter message. 


APN-Configuration: = <AVP header: 1430 10415> 
{Context-Identifier\ 
{PDN-Type} 
{Service-Selection} 


The Context-Identifier in the APN-Configu- 
ration AVP uniquely identifies the 4G Evolved 
Packet System (EPS) APN configuration per sub- 
scription. Each APN profile configuration can 
contain several APN configurations, therefore an 
attacker can set specific fake APNs for GPRS, EPS 
etc. TS 29.272 (section 7.3.34 and 7.3.35). Service 
Selection contains the name of the subscribed 
APN in use (Figure 9). 

Depending on the deployed filtering the place- 
ment of a false APN in the SGSN is harder, while 
the user is in his home network. There the attacker 
would need to know the address of the home-HSS 
(potentially again from leaked IR.21 documents), 
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Insert Subscription Data Answer (IDA) 


Figure 9. Placing of modified profile in the SGSN. 


but that address is “less public” then for example 
a DEA address. 

There are two “flavors” to this IDR attack. If 
the user is roaming and the serving SGSN resides 
in the visited network, then the attack has high 
chances of succeeding. The attacker, when sending 
the IDR with the malicious APN, would imper- 
sonate the Home-HSS, but due to roaming the 
visited network would only see the DEA address 
of the home network (which can be spoofed by set- 
ting the origin realm and origin host), as the mes- 
sage answer to the IDR does not really need to go 
through it is no issue to spoof the origin. The DEA 
address could be found from various IR.21 docu- 
ments on the Internet or brute forcing the operator 
ranges. 

Also, the network would not even have the low- 
est of all checks i.e. it does not check, if a message 
arrives on the interconnection edge which claims to 
come from an internal node. Therefore, the attack 
is considered harder, when the target user is not 
roaming. The modified profile would stay active 
until the SGSN synchronizes again with the HSS 
and indicates that it would need a fresh profile. 


4.2.2 Profile modification in HSS Using Sh 

An alternative method is to do subscriber profile 
modification in the HSS via the Sh interface. It 
is for an attacker very interesting to change the 
“master copy” of the subscriber profile which 
resides in the HSS, we will describe how in a not 
very well secure network that might be achieved. 
An attacker can pose as an Application Server and 
send a Profile Update Request (PUR) message to 
the HSS (Figure 10). This “synchronization mes- 
sage” can contain a modified subscriber profile i.e. 
it would have a modified APN settings. Actually, 
an AS is not supposed to touch that part of the 
subscriber profile, but the specification does not 
mandate fine grained profile processing on the 
HSS side. So depending on the implementation, 
the HSS will just process the incoming profile data, 
regardless if it is “transparent” profile data or not. 


8 £ I 
mm Cm, 
PA a 
Attacker impersonates an IMS Application Server DEA HSS 
eg. from the mother company of a large operator 


Profile Update Request (PUR) over Sh 
Contains modified susbscriber profile with att: 


Profile Update Answer (PUA) 
ok 


Figure 10. Update of profile in HSS using Sh interface 
and PUR message. 


4.3 Modification of APN settings in HSS 
using Soa 


The main focus of this paper is to highlight the risk 
that comes by an unprotected IMS Sh interface. 
But it should be noted, that parts of the attack can 
also be performed using other interfaces, like the 
S6a interface. 

The update location message over the S6a 
interface allows the MME or SGSN to include a 
dynamic APN information to restore the Packet 
Data Network Gateway (PDN GW) data in the 
HSS e.g. for the case that a reset of the HSS has 
occurred and the APN information need to be 
restored (for detail see TS 29.272 [6] 5.2.1.1.2). 
The active APN AVP configuration element, con- 
tains the list of active APNs stored by the MME 
or SGSN, including the identity of the PDN GW 
assigned to each APN. The attacker would fill then 
in his APN information (Figure 11). The following 
information is required to be present: 


— Context-Identifier: context id of subscribed APN 
in use 

— Service-Selection: name of subscribed APN in 
use 

— MIP6-Agent-Info: including PDN GW identity in 
use for subscribed APN 

— Visited-Network-Identifier: identifies the PLMN 
where the PDN GW was allocated (which would 
be the network the attacker has rented its access 


from) 


For the case that the above approach is not 
working the second alternative is to use a Wildcard 
APN by inserting the following information: 


— Context-Identifier; context id of the Wildcard 
APN- Specific-APN-Info: list of APN-in use and 
related PDN GW identity when the subscribed 
APN is the wildcard APN 


4.4 Redirection of user 


When the user now requests data services, he would 
first use the APN configuration in the terminal. 
This APN configuration would not match the one 
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Attacker impersonates MME (DEA visible only in roaming) DEA HSS 


Update Location Request 
Contains Active APN AVP with attackers lake APN bs a dynamit AVP 


Update Location Answer 


Figure 11. APN insertion with ULR over S6a. 


subscriber profile in the SGSN/MME. There is 
an error process for this, as the basic assumption 
of the net-work would be that there is a miscon- 
figuration in the user terminal. This assumption 
comes from the times, when the user had to manu- 
ally configure their terminals. Therefore, the SGSN 
(GPRS) / MME (EPS) contacts the operators DNS 
using a NAPTR query to resolve the correspond- 
ing APN configuration to the attacker’s node and 
the GTP (GPRS Tunneling Protocol) is established 
to the attackers GGSN. 

Any profile modification in the SGSN and also 
in the MME are of “temporary nature”, as those 
nodes synchronize with their Home-HSS at some 
point of time. Those synchronization configura- 
tions are operator dependent and depend on how 
much load usually a HSS has. If an operator has a 
of load on his HSS, those synchronization will be 
rarer. The attacker has to do some try-and-error 
to see, if it is sufficient to change only the MME 
subscriber profile or if he needs to change it in the 
HSS. 

In some products exist the feature for such an 
error case to configure a default APN in the APN 
remap tables, but those remapping approaches can 
potentially be “tricked” by modifying the entry 
in the HSS/HLR. Also, there exist SGSN which 
attach automatically a “mncXXX.mccXXX.gprs” 
before sending out the request (similar for EPS), 
which would make such an attack difficult, but this 
kind of automatic attachment is not part of the 
mandatory standard processing TS 29.303 [17] and 
behavior and is a product specific extension. 

It should be noted, that an operator should 
deploy different DNS for internal and external 
resolutions or at least different DNS views based 
on the requestor, but for efficiency reasons this is 
not necessarily always done. Note however, that 
even if different DNS are in place for both internal 
and external, and the internal DNS only contains 
entries of home and roaming partner APNs, it is 
possible that the malicious APN resolved could 
be resolved from the domain of a trusted roaming 
partner that has provisioned the malicious APN 


Depending on if the MAE always syncs oF not, 
modified orelile with lake APN placed in MME oF HSS 


Attacker controls fake APN 


of request (gnor termirjal requested 


35 of aftocker server (NS also resolves extern: 


Figure 12. DNS resolution message flow. 


(to the attacked network). After all, the attacker 
needs to have some entry vector into the intercon- 
nection vector and that means he is “in some oper- 
ator” already. This though, is a more complex and 
invasive method. 

With those different approaches to modify and 
update the APN_Configuration_Profile AVP in 
the subscriber profile and setting it to the APN 
an attacker server, the attacker can vary his attack 
strategy according to configuration of the victim. 
The attacker does not know, if the MME would 
synchronize or not, so he would try and error, if he 
needs to modify the APN setting in the MME or in 
the HSS. When then a normal flow takes place and 
the DNS is serving internal and external requests, 
then the MME would redirect the terminal to the 
attacker’s server (Figure 12). 

In general, profile modification and data inter- 
ception attacks pose a real threat and were observed 
in the wild for GPRS and therefore, we expect that 
with the advances in technology towards LTE, the 
attackers will extend their attack portfolio to cover 
also those attacks. 


5 PROTECTION MEASURES 


This article outlines, that an attacker is not 
bound to one technology or protocol choice when 
attempting to intercept data and modify the user 
profile via the interconnection link. 

Attacks can be modular in structure and the 
attacker would have a toolbox depending on the 
encountered actual configuration of the victim net- 
work. From this toolbox he can use a mix-and-match 
approach to achieve his goals by combining several 
potential typical deployment weaknesses in Diame- 
ter networks. We validated in our test network speci- 
fication conformant messages and nodes and their 
behavior under the different configurations. The 
main point of this article is to show that an informa- 
tion leakage and “harmless looking” configuration 
weaknesses in one interface combined with typical 
roaming messages can lead to a data breach. 
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The countermeasures are dependent on the actual 
deployments, but the following general recommen- 
dations would make this kind of attack harder: 


— Interface checking at DEA i.e. the DEA should 
not answer Sh messages and related messages 
need to be blocked and logged. 

— If the network relies use an Application Server 
outside of the core network security zone, then 
this should be secured with a direct certificate 
secure tunnel and explicit certificate based 
access control and authentication. 

— Usage of whitelists and location-distance checks 
at the DEA (e.g. for S6a interface) and deploying 
a dedicated protocol aware signaling firewall for it. 

— Validation if a reset procedure was done before 
executing a dynamic APN in an ULR mes- 
sage. In addition validation of the source 
of the update location i.e. if the this update 
location for this particular user is feasible 
(location-distance-travel-time-checking). 

— Deployment of a dedicated DNS for internal 
queries only and put it behind the firewall e.g. 
multiple DNS servers or DNS views for each 
domain; Alternatively, no shared DNS configu- 
rations/physical DNS servers for trusted and 
untrusted domains. 

— DNS security i.e. protection against cache poi- 
soning and internal DNS should not be in front 
of the firewall. 

— IMSI based address resolution (to avoid auto- 
matic resolution of attacker address). 


These methods would make the described 
attacks at least much harder, if not impossible. 
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Finding your aim—choosing your game 


T. Grunnan & H. Fridheim 


Norwegian Defence Research Establishment (FFI), Kjeller, Norway 


ABSTRACT: There are many ways to conduct crisis management games and exercises, with many for- 
mats to choose from. A recurring challenge is to find the most useful format for reaching the specific 
objectives of a given game or exercise. We have conducted a Limited Objective Experiment (LOE) to study 
organizational uncertainty related to this and to look at ways to support game and exercise planners. The 
experiment shows that precise aims and objectives are important for choosing suitable formats for games 
and exercises. Additionally, it shows that planners need guidance and support tools to make the right 
choices, even when the aim is specified. This is likely related to uncertainty associated with terminology, 
methodological biases and the fact that there are numerous available games and exercise formats. The 
findings are relevant to practitioners within both the fields of security and safety. 


1 INTRODUCTION 


There are many ways to conduct crisis manage- 
ment games and exercises and many types and 
formats to choose from. Relevant formats include 
table tops, seminar games, wargames and full-scale 
exercises. A recurring challenge is to find the most 
useful format for reaching the specific objectives of 
a given game or exercise. In our experience, there 
are often uncertainties within organizations on 
how best to do this. 

Handbooks on wargaming and exercise plan- 
ning are often descriptive, in the sense that they 
explain how to set up and conduct various types of 
games and exercises (DHS 2013, MSB 2014, Burns 
et al. 2015, NVE 2015, DSB 2016, MOD 2017). 
The handbooks often take the form of instruction 
manuals, teaching the reader to plan and conduct 
a game or an exercise. However, Grunnan & Frid- 
heim (2017) shows that planners and decision- 
makers often are unaware of how choices in both 
design and execution affect the results from games 
and exercises. Handbooks often describe what to 
do, but to a lesser extent why a particular format 
or type is the best for a given purpose, what works 
and what will not. 

In this paper, we will expand on our previous 
research (Grunnan & Fridheim 2016, Grunnan & 
Fridheim 2017, Fridheim et al. 2017) and argue 
that finding the aim of an activity is essential and 
a pre-requisite for choosing the right type of game 
or exercise to support it. However, we often notice 
that game/exercise planners are uncertain of what 
the most suitable format is, even when aims and 
objectives are given. The uncertainty is partly due 
to the bewildering array of overlapping and con- 


tradicting terms related to gaming and exercises, 
partly due to methodological biases of planners. 
Thus, we have started work on how to manage 
this uncertainty, using our own organization as 
a case. We have conducted a Limited Objective 
Experiment (LOE) to study whether the perceived 
uncertainty is real or not. We have also looked at 
possible tools to support game or exercise planners 
in choosing the right formats. This paper describes 
our initial findings. 

The target audience for the paper is anyone 
involved in game/exercise planning, for example 
contingency planners, exercise planners, military 
planners, crisis managers, and consultants. In 
addition, the findings are relevant for decision- 
makers, researchers and analysts who use games 
and exercises as a method to investigate a problem 
and collect data. 


2 BACKGROUND AND THEORY 


2.1 The organizational challenge 


The Norwegian Defence Research Establishment 
(FFD) has a long history of planning and conduct- 
ing games and exercises, both generally to support 
analysis in running projects and specifically for 
various military and civilian customers. There is a 
wide range of activities that fall within the terms 
“games” and “exercises”, but their purpose can 
roughly be divided in two: 


1. Learning: Activities where the main purpose is 
to provide learning and training for the partici- 
pants, so they are better prepared to make deci- 
sions in the future. 
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2. Analysis: Activities where the main purpose 
is to collect or develop data and information 
which can support future decisions. 


While there are many activities that can support 
these purposes, they have several common features, 
similar to the elements of a wargame as listed in 
(MOD 2017): 


— Aims and objectives for the activity 

— Players/participants and their decisions 

— Ascenario and a setting that provides an immer- 
sive environment for the participants 

— A simulation of the real world 

— Rules, procedures and adjudication of decisions 

— Data to create the scenario and the simulation 

— Supporting personnel and subject matter experts 

— Analysis and data collection 


Although customer feedback on the conducted 
games and exercises at FFI is generally good, there 
are still areas for improvement. We have run a sim- 
ple in-house problem-structuring session to help 
identify the challenges that must be addressed to 
develop more relevant and useful games and exer- 
cises in the future. Here we observed an uncertainty 
within the organization on which game and exercise 
types are best suited to support different purposes. 
Given the huge variety of possible formats, which 
should be chosen to give the most relevant answers 
and outputs for any given aim and objective? 

Some of the causes for this uncertainty were 
identified to be: 


— There is no common understanding of the terms 
related to “games” and “exercises”. 

— There is too little collaboration in the planning 
and conduct of games and exercises. Instead of 
sharing knowledge with each other, different 
project groups have established local practices 
and approaches. 

— There is a practical approach to planning and 
conduct of games and exercises, instead of an 
academic one. The main goal is often to find a 
format that works given available resources, with- 
out considering options to find the best approach. 


These causes are not uncommon for FFI. For 
instance, the lack of collaboration in exercise 
planning and conduct is also identified in OECD 
(2014), where local practices are described to be 
“ ,.knowledge as a canon best passed on through 
mentor-mentee communication and informal 
apprenticeship or their designs as proprietary trade 
secrets best left undocumented and un-diffused”. 


2.2 Games or exercises? 


The terminology related to games and exercises has 
long been subject to debate, and there are several def- 


initions of both terms, with none being universally 
accepted. How you define the terms also depends on 
your area of work. Of interest to FFI is the differ- 
ence between military and civilian approaches. 

The military often uses the term wargaming or 
war gaming (tellingly, there is no consensus on 
whether the words should be combined or sepa- 
rate). One widely used definition of wargaming is 
from Perla (1990): “...a warfare model or simula- 
tion that does not involve the operations of actual 
forces, in which the flow of events affects and is 
affected by decisions made during the course of 
those events by players representing the oppos- 
ing sides”. The basic components of this defini- 
tion are identified by the US Naval War College 
(USNWC 2017): “War-games involve PEOPLE 
making DECISIONS in a context of competition 
or CONFLICT (with themselves, other people, 
or their environment)”. Exercises may differ from 
wargames in the sense that they use actual military 
forces instead of simulated ones (Simpson Jr 2015). 

On the civilian side, one definition of exercises 
is made by the Department of Homeland Security 
(DHS 2013): “An exercise is an instrument to train 
for, assess, practice, and improve performance in 
prevention, protection, mitigation, response, and 
recovery capabilities in a risk-free environment.” 
The same publication defines a game as “... a simu- 
lation of operations that often involves two or more 
teams, usually in a competitive environment, using 
rules, data, and procedures designed to depict an 
actual or hypothetic situation. Games explore the 
consequences of player decisions and actions”. 
While this is in line with the definition of wargam- 
ing above, the military separation between the real 
(exercises) and the simulated (games) is muddled 
when DHS classifies games as a “discussion-based 
exercise technique”. A similar approach is made 
by the Norwegian Directorate for Civil Protection, 
who defines “game exercise” as one of four exercise 
formats in their guidelines for planning, conduct 
and evaluation of exercises (DSB 2016). 

At FFI, one result of these terminology uncer- 
tainties is that different terms are used about simi- 
lar activities, and vice versa. On the military side, a 
tabletop discussion could be labelled as a seminar 
game. A similar activity on the civilian side could 
be classified as a crisis management exercise. How- 
ever, for all practical purposes, these are the same 
activities, with the same aims and objectives, using 
similar formats. This confusion adds to the uncer- 
tainty of choosing the right game or exercise for- 
mat for any given purpose. 


2.3. Choosing your game or exercise style 


There are many different styles of games and 
exercises available, and naturally, there have been 
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many attempts at categorizing them. Related to 
wargaming, a recent and widely used categoriza- 
tion is made by Pournelle (2017). He identifies six 
wargame categories, depending on the purpose of 
the games (creating knowledge, conveying knowl- 
edge or entertainment) and whether they address 
structured or unstructured problems, see Table 1. 

Within the six categories, Pournelle (2017) iden- 
tifies four major styles of operational wargames: 
Seminar games, matrix games, free kriegsspiel 
and rigid kriegsspiel. These styles get less creative, 
more predictable and provide more analytic rigour 
from left to right. In general, the effort necessary to 
arrange the games also increase from left to right. 

Within these general styles, you will find vari- 
ous types of wargames. Mouat (2017) lists several 
types, mostly depending on how the operational 
environment is represented: Computer wargame, 
map wargame, board wargame, seminar wargame, 
sand table wargame and “soft issues” wargame 
(matrix game). 

Related to exercises, DHS (2013) categorizes 
exercises as either: 


— Discussion-based, Seminars, workshops, table- 
top exercises and games. 

— Operations-based, Drills, functional exercises 
and full-scale exercises. 


Similar to Pournelle’s game styles, the opera- 
tions-based exercise categories are more effort- 
intensive than the discussion-based. 

Within the range of styles, opinions naturally 
differ on which is the most suited game/exercise 
format for a given aim or objective. In many cases, 
the choice of format is subject to less thought than 
it should be. As discussed in (Grunnan & Frid- 
heim 2016), customers often decide on a particu- 
lar exercise format without finalizing the exercise 
goals. Similarly, game and exercise planners are 
very much subject to the “law of the instrument”, 
as expressed by Maslow (1966): “I suppose it is 
tempting, if the only tool you have is a hammer, 
to treat everything as if it were a nail.” The result 
is that old habits die hard when game and exercise 
planners re-use the methods they are familiar with, 
without considering new or improved approaches 
for the given problem. 


Table 1. Six wargame categories (Pournelle 2017). 
Creating Conveying Enter- 
Problem knowledge knowledge tainment 
Unstructured Discovery Education Roleplaying 
games games 
Structured Analytic Training Commercial 
games games kriegsspiel 


Similar effects can be noticed at FFI, where 
those working on the strategic and operational 
level tend to go for simpler discussion-based for- 
mats, while those working on the tactical level usu- 
ally reach for more rigid formats with support of 
modelling and simulation. In many cases, this is a 
natural consequence of the problems that are to be 
studied, but not always. 

In summary, it may be emphasized that there is 
need for more clarity on terms, as well as a need 
for tools or frameworks that enable us to choose 
the right game or exercise format for a given 
purpose. 


3 METHOD 


In this chapter, we explain the methods we have 
used in order to identify and manage uncertainty 
related to matching aims and objectives with suit- 
able game and exercise formats. We have combined 
initial problem structuring with a simple Limited 
Objective Experiment. 


3.1 Problem structuring 


In order to 1) identify relevant challenges, and 2) 
demonstrate and discuss which game/exercise for- 
mat is suitable and applicable for given purposes, 
we conducted two problem-structuring sessions 
in-house. The first session was a brainstorming 
session where the aim was to identify different 
formats/types of games and exercises, as well as to 
identify critical factors and criteria for matching 
formats and objectives. 

After the initial problem structuring session 
(described in chapter 2.1), we used a creative 
technique called «Starbursting» (NATO 2017). 
The technique is used to generate questions about 
a problem, and it provides a useful structure for 
getting started with a research topic. We drew a 
star with six points and wrote who, what, why, 
when, where and how at each point. In the mid- 
dle we put the topic of discussion, how to find 
the best format of a game or exercise in order to 
reach specific objectives. We systematically went 
through the six points, brainstorming possible 
questions. Afterwards, we organized and priori- 
tized the questions. 

The final step was to formulate a plan on how to 
proceed, based on an analysis of the findings. The 
questions laid the foundation for further literature 
studies and the organization of a seminar with a 
simple experiment. Starbursting is an effective 
method, especially in the early phases of a project, 
since many people find it easier to raise questions 
than to find answers (NATO 2017). 
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3.2 Limited objective experiment 


The next stage was to conduct a Limited Objective 
Experiment (LOE). A LOE is a narrowly scoped, 
analytically focused assessment or validation event. 
We invited 24 researchers and military officers at 
FFI to take part in the experiment in the context 
of a half-day seminar. The participants’ experi- 
ence and expertise related to games and exercises 
ranged from no experience to much experience, see 
Table 2 in Chapter 4.2. 

The seminar had three objectives. First, we 
wanted to strengthen internal competence and 
awareness of what games and exercises are and 
how they can be used at FFI. The authors pre- 
sented a range of formats for games and exercises, 
using real examples. Second, we wanted to facilitate 
more collaboration on games and exercises across 
the organization, to enable learning and strengthen 
the quality of future games and exercises. Finally, 
and most importantly, we conducted the LOE with 
the participants, to study the perceived uncertainty 
related to game and exercise formats. In addition to 
being a competence-enhancing measure, the presen- 
tations and discussions were mood setters intended 
to prepare the participants for the experiment. 

The purpose of the LOE was two-fold: First, 
we wanted to study whether the uncertainty we 
thought was present in the organization actually 
was there. Second, we wanted to see if a simple 
supporting form would help participants select the 
format of a game or exercise, when the aims and 
objectives were given. 

The participants were given a form with the title: 
“How to choose the format of a game/exercise”. 
On one page we described three cases. They were 
inspired by actual assignments FFI had received in 
the past. On the flip page, there was a form with 
boxes to fill out, as shown in Figure 1. The figure is 
a matrix with two dimensions, operational environ- 
ment and adjudication. Operational environment 
is defined by the authors as “how the operational 
environment is represented in the game or exercise”, 
i.e. how reality is simulated. Adjudication is “the 
process of judging how the game progresses and 
develops when the participants make decisions”. 
The figure had these two dimensions represented on 
each axis. On each of the axes there were a number 
of values associated with the current dimension, 8 
for adjudication and 6 for operational environment. 
Combinations of one value from each axis would 
constitute a format for a game or exercise. 


Table 2. Participants’ experience with games/exercises. 
None Little Some Much Total 
2 9 8 3 22 


5 Real operations ci 
8 
š Computer models 
< Policy of Rutes/ | 
Tables 
Experts 
Moderator 
Consensus 
Random/Dice 
Scripted 
Ry 
“oy, My opty Yin Mag My 
% kA C 
ey Ka 
Cs 
“ey ey 
| Operational environment ] 
Figure 1. Example of form used in experiment. 


The participants were asked to write Cl, C2 or 
C3 in the figure where they would intuitively place 
cases 1, 2 and 3. When everyone had filled out the 
form, there was a short break to mingle and reflect, 
discuss and interact with the other participants. 
After the break, participants were allowed to make 
changes with an arrow to a new format they con- 
sidered more appropriate, see example in Figure 1. 
Finally, the form was submitted to the organizers. 


4 FINDINGS 


In this chapter, we present the three cases we used in 
the experiment, before we describe and discuss the 
findings. The results from each case are presented 
in figures. The participants had the opportunity to 
modify the answers they made initially, and these 
changes are described in writing. We comment 
on the overall results, and we discuss how various 
factors may have impacted on the results from the 
experiment. 


4.1 Three cases 


The three cases chosen for the experiment were 
inspired by assignments given to FFI in the past. 
They have been conducted as games or exercises by 
researchers at the institute, including the authors, 
but not by the participants in the experiment. 

The participants were asked to read the descrip- 
tion of the cases and note in the form which game/ 
exercise formats they intuitively would choose 
related to the two dimensions, operational environ- 
ment and adjudication. 
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4.1.1 Case 1 — Bilateral ministerial game 

As part of strengthening their bilateral coopera- 
tion, ministries in two countries want to carry out 
a wargame with a crisis in the Arctic region. The 
purposes of the game are: 


1. Gain insight into the countries’ strategic thinking, 
threat assessment and action in a potential crisis. 

2. Help improve understanding of situational 
awareness and explore opportunities and limi- 
tations for cooperation in crisis and war. 


4.1.2 Case 2 — Internal wargame 

As part of internal learning processes, a research 
organization wants to conduct a wargame for 
its employees. The scenario is a bilateral conflict 
between two countries, which ends in a military 
attack on one of the countries. 

The purpose of the game is to assess the two 
countries’ practices and courses of action in differ- 
ent phases of the conflict, based on a given Order 
of Battle (i.e. available forces). 


4.1.3. Case 3 — Educational wargame 

A ministry is to participate in an annual interna- 
tional crisis management exercise. In order to pre- 
pare for the exercise, the ministry wants to conduct 
an internal game. The purposes of the game are: 


1. Develop work methodology and procedures 
internally, within and between the various ele- 
ments of the crisis management organization. 

2. Prepare the players for the scenario and key 
issues related to the international exercise. 

3. Exercise the ministry’s crisis management pro- 
cedures in relation to processes in international 
crisis management organizations. 


4.2 Experience with games and exercises 


24 employees took part in the experiment. How- 
ever, the sample for “experience in games or exer- 
cises” is 22, as one participant did not fill out the 
table, and another marked both little experience 
with exercises and much experience with games. 
9 participants had little experience with games or 
exercises, 8 participants had some experience, three 
had much experience, while two had no experience 
at all. This is shown in Table 2. 

In the forms, a few answers differed consider- 
ably from the others. We examined these outliers 
and found that they were made by participants 
with no experience. 


4.3 Results 


The results in Figures 2-4 show the intuitive 
responses from the participants on which combina- 
tion of adjudication and operational environment 


they would choose for the three cases. The results 
are their immediate reflections, based on their own 
skills and knowledge and on the information that 
was given to them in a presentation before the 
experiment. The number of respondents varies 
from case to case, due to incorrect use of the form. 


4.3.1 Results — Case 1 

For Case 1 — Bilateral Ministerial Game, 10 of 24 
participants chose a plenary format for the opera- 
tional environment, as shown in Figure 2. Map was 
the second choice, with 8 respondents finding this 
environment the most suitable for the given case. 

Adjudication based on consensus between play- 
ers was the preferred format for both plenary and 
map operational environment, with a total of 10 
responses. Another preferred adjudication method 
was moderator, with 7 responses. No respondents 
chose simulator/data models as the operational 
environment or scripted and policy or rules/tables 
as means of adjudication for case 1. 

Four participants made changes for Case | after 
the break. One participant changed the adjudication 
method from consensus to scripted, but kept plenary 
as the operational environment. One changed the 
operational environment from workplace to plenary, 
but kept moderator as a mean of adjudication, and 
another changed the operational environment from 
board game to simulator, but kept computer mod- 
els for adjudication. The last modification was a 
change made on both axes, from real operations— 
field to policy or rules/tables—workplace. 


4.3.2 Results — Case 2 
Map was clearly the most preferred format for oper- 
ational environment for Case 2 — Internal Wargame, 


A 


& Real operations 2 
J Í 
3 Computermodels | 1 
< Policy or Rules/ | 
Tables | 
| 
Experts 1 2 
Moderator | 3 1 1 2 
Consensus | ¢ 4 
RandonvDice 1 
Scripted 
s% % & 
AS K Ste fy % 
N le,” 
y 
Operational environment 
Figure 2. Results from Case 1. 
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Figure 3. 


with 13 of 24 respondents choosing it, see Figure 3. 
Simulator/Data model followed with 6 participants 
finding this environment the most suitable. Nobody 
chose plenary or field, the two “extremes” on the 
operational environment axis, and nobody chose 
real operations as a possible adjudication method. 

Adjudication by experts was the preferred 
method with 9 responses, while computer models 
followed with 7 responses. Experts were chosen as 
the most preferred adjudication method in com- 
bination with a map operational environment, 
among four other choices. Nobody chose experts 
in combination with simulator/data model, which, 
evidently, was related to computer models. 

After the break, five participants modified their 
responses for Case 2. Four of the changes were 
made from the combination experts as mean of 
adjudication and map as operational environment. 
Two of these kept experts, but changed map to 
simulator. Two others kept map, but adjusted the 
method of adjudication to consensus and compu- 
ter model. One participant kept computer model 
for adjudication, but changed operational environ- 
ment from simulator to board. 


4.3.3 Results — Case 3 

For Case 3 — Educational Wargame, 16 of 23 
respondents chose workplace as the ideal opera- 
tional environment, see Figure 4. No one preferred 
simulator or field for this case. Scripted adjudica- 
tion was preferred with 9 answers, followed by 
experts which got 6 answers. 

Four participants made changes after the break. 
All of them kept their initial choice of operational 
environment, but changed the method of adjudica- 
tion. Two participants changed adjudication from 
expert to scripted and one participant changed from 


Real operations 


Computer models 1 


Adjudication 


Policy or Rules/ 
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Experts 2 4 
Moderator 1 4 


Consensus 1 
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Scripted 7 
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Figure 4. Results — Case 3. 


moderator to scripted. These participants’ preferred 
operational environment was workplace. The last 
participant had map as operational environment, but 
modified the method of adjudication from consensus 
to scripted. The modifications strengthened scripted 
as the favorite mean of adjudication for case 3. 


4.4 Summary of results 


When combining the results from all three cases, 
represented in Figures 2—4, we find that the most 
popular choices for operational environment over- 
all are map, workplace and plenary. Field is not 
considered particularly suitable for the cases. The 
most preferred adjudication method overall is the 
use of experts, followed by moderator, and consen- 
sus and scripted share the third option. The choices 
on the adjudication axis are more spread out than 
the ones on the operational environment axis. 

The preferred choices in the experiment matched 
well with the formats used in the real games that 
the three cases were inspired by, see Table 3. The 
three means of adjudication used in the real games 
were moderator, experts and a scripted design, 
whereas the operational environment used were 
plenary sessions, map and workplace. The partici- 
pants were presented the actual formats used after 
the experiment was completed, and they were able 
to comment on the differences. 

In the experiment, the preferred combination in 
Case | was adjudication by consensus and a plenary 
representation of the operational environment. As 
shown in Table 3, the format for adjudication used 
in the real game was moderator. This was the second 
most popular choice in the experiment. For Case 2 
and Case 3, the preferred formats in the experiment 
were similar to those used in the actual games. 
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Table 3. Actual formats used in real games. 
Case | Case 2 Case 3 
Adjudication Moderator Experts Scripted 
Operational Plenary Map Workplace 
environment 


Overall, the experiment results correspond very 
well with real choices. However, Figures 2—4 show 
that the results were largely spread out, which indi- 
cate some uncertainty in the answers and that the 
participants intuitively thought differently. 


4.5 Discussion of results 


Various factors may have impacted on the results 
from the experiment. They are discussed in the 
following. 


4.5.1 Selection of cases 

There were similarities between the real games that 
the three cases were based on, in the sense that they 
were all discussion-based and tabletop-oriented. 
More rigorous games supported by models and 
simulations were not included, and having these 
as cases in the experiment could have provided 
clearer trends. However, we decided to choose cases 
that are representative for the assignments that 
the authors have worked on, where we knew the 
rationale behind the real choices of formats. Thus, 
it was interesting to see how a larger group of par- 
ticipants would respond to three tabletop-oriented 
cases. While the overall results from the experi- 
ment matched well with the real formats, individual 
responses were spread out across possible formats. 

A major challenge with the cases is that they 
include several different objectives in the same 
assignment. This makes choosing just one format, 
or combination of adjudication and operational 
environment, difficult. This is especially relevant 
in Case 3. These challenges were present when FFI 
received these tasks from customers, and the partic- 
ipants also found this challenging when filling out 
the form. They gave feedback on this matter in the 
break between the two sessions. In reality, game and 
exercise planners often receive assignments with 
different objectives. Unlike the experiment, real 
assignments are often accompanied by an order or 
expressed wish for the type or format the customer 
wants. That brings us back to the initial scope of 
the paper: We argue that finding and defining the 
purpose is crucial for the choice of game or exer- 
cise, but that there is a need for supplementary 
guidance to find the most suitable format. 

The cases could also have been presented with 
more detail, and the participants had little time 
to reflect on the cases and their choices. However, 
we were looking for the initial reactions from a 


group of researchers/officers interested in games 
and exercises. The results provide good indica- 
tions that there is uncertainty within the organi- 
zation about how best to match aims and goals 
with the myriad of available formats for games 
and exercises. 


4.5.2 Selection of dimensions 

In the experiment, we used two dimensions in the 
form to help choose game formats, adjudication 
and operational environment. The two dimen- 
sions illustrate the tradeoff between feasibility 
and robustness/complexity in game or exercise 
planning, where effort and necessary resources 
generally increase along the axes. It is more work 
involved to run a simulator than a plenary ses- 
sion, and data models are more resource-intensive 
to use for adjudication than a moderator. At the 
same time, the analytical rigor of the results will 
likely improve along the axes. This is in line with 
the categorizations and classifications referred to 
in Chapter 2, and therefore, it was natural to use 
these factors in the experiment. 

In the experiment cases, we did not include 
how much time and resources were available for 
planning and conducting the games. In reality, 
time and resources are important entry values 
(Grunnan & Fridheim 2016), and the experiment 
participants commented on the lack of this infor- 
mation when discussing their selections. In other 
words, our form does not cover all relevant factors 
for choosing game formats. However, it gives the 
opportunity to quickly assess alternatives, prefer- 
ably in dialogue with the client (Grunnan & Frid- 
heim 2017). 


4.5.3 Selection of values 

The selection of values for the two dimensions in the 
form, presented on the axes in Figures 1-4, is influ- 
enced by the fact that both games and exercises are 
covered in the same form. Due to the uncertainties 
regarding terminology as discussed in Chapter 2, we 
have added values related to both games and exercises. 
Terms such as rules, tables, models, dices, maps, boards 
and simulators are often related to (war) games, while 
real operations, experts, moderators, workplace and 
field are often included in exercise vocabularies. There 
isa possibility that merging games and exercises in the 
same set-up may have confused the respondents and 
made them more uncertain. 


4.5.4 Number of participants 

It would have been beneficial to have more partici- 
pants attending the experiment. Still, 24 partici- 
pants is a fairly large number when reflecting on 
the number of experts working on this topic within 
the organization. Table 2 in chapter 4.2 shows an 
even distribution of experienced and less experi- 
enced participants. 
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5 CONCLUSION 


The LOE shows that purpose should guide the 
choice of format for a game or an exercise, and it 
is therefore important to be precise when defin- 
ing aims and objectives. A clear and precise task 
or assignment will make choosing an appropriate 
format easier (Grunnan & Fridheim 2017). How- 
ever, our experiment indicates that even though 
you know the purpose, different people will still 
instinctively choose different formats. This is likely 
related to the uncertainty related to terminology, 
methodological biases and the fact that there are 
numerous games and exercise formats to choose 
from. Therefore, it is useful to have a quick support 
tool to assist the planners, like the form used in our 
experiment and shown in Figures 1—4. Although 
our findings demonstrate that there was a good 
match between the formats chosen in the experi- 
ment and the actual choices used in real-life, there 
was also a large spread of answers for each case. 

The LOE is a pilot study, as it is the first time 
we have tested our ideas on this topic on a group 
of researchers. We carried out a first impressions 
session at the end of the experiment. The form (or 
matrix, as it was referred to by some participants) 
was especially considered useful for expanding the 
range of possibilities in the planning phase. The 
challenge is to choose formats when factors such 
as time and resources are included. For instance, 
which formats are best for a one-day or a week- 
long event respectively? Also, the participants 
found it difficult to choose only one format, espe- 
cially in the cases which had several objectives in 
the assignment text. Often there is a need for differ- 
ent formats to cover all objectives, and this is not 
easy to visualize in one form. 

There are many possibilities for future research 
on this topic. In coming experiments, it is possible 
to make adjustments such as adding more cases, 
choosing other dimensions in the form, and ana- 
lyzing results from experienced and unexperienced 
participants respectively. Furthermore, for internal 
organizational use, we recognize that there may be 
a need for an internal classification of different 
game/exercise formats and their overall strengths 
and weaknesses for different purposes. 

While the exercises and games covered by the 
paper are mainly related to security challenges, our 
findings are relevant also for more safety related 
issues. The form used in the experiment provides 
an opportunity to evaluate and discuss options 
and formats independently of the subject area. 
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Optimizing security patrolling scheduling in chemical industrial parks 
by using game theory 
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Delft, The Netherlands 


ABSTRACT: Protecting chemical clusters from intentional attacks has been a hot topic during the last 
decade. Besides intrusion security countermeasures such as cameras, entrances control etc., patrolling 
also fulfils an important role in the security of chemical facilities and industrial parks. Current patrolling 
strategies in industry are mainly single-plant driven and purely randomized or based on the patroller’s 
preference. Such an approach in a chemical industrial park is on the one hand not able to cover the more 
hazardous facilities more than the less hazardous plants within the park, and on the other hand is not 
able to deal with strategic (intelligent) human adversaries w.r.t. terrorism. This paper therefore investi- 
gates a game theoretic model for optimizing the schedule of patrolling in chemical clusters. The industrial 
defender and the intelligent/adaptive attackers are modelled as two players in the game. The defender 
aims at increasing the probability of detecting the attacker, by randomly but strategically scheduling her 
patrolling route. The attacker aims at causing maximal consequences with highest success probabilities, by 


choosing a proper attack time and a proper target. The model is further illustrated by a case study. 


1 INTRODUCTION 


Since the unfortunate 9/11 attack, the protection 
of critical infrastructures has been an urgent topic, 
both in academia and in practice. Chemical indus- 
tries have an important role in modern society. 
They provide materials for human being’s daily 
necessities such as clothes, food, energy, etc. How- 
ever, chemical facilities may also pose huge threat 
to modern society. The sadly Bhopal disaster (toxic 
gas leakage), among others, caused more than 3000 
deaths and life-long suffering for over 300,000 [1]. 
In the security aspect, the malicious attack to a 
chemical plant in France reminded people that 
there is a possibility of a successful attack to chem- 
ical facilities. An investigation carried out by Orum 
and Rushing [2] concluded that a successful attack 
to a top 101 dangerous chemical plant in U.S. may 
result in more than one million casualties. 
Furthermore, due to economic and management 
reasons, chemical plants are nowadays geographi- 
cally clustered, forming chemical clusters, e.g., the 
Antwerp port chemical cluster, the Rotterdam 
port chemical cluster etc. Besides intrusion secu- 
rity countermeasures within each plant, patrol- 
ling is also scheduled, for securing these chemical 
facilities. The patrolling can be either single plant 
oriented, which is scheduled by the plant itself, or 
multiple plants oriented, which should be scheduled 
by a multiple plant council (MPC) [3]. Both types 
of patrolling have a drawback of not being able to 


deal with intelligent attackers. Some patrollers fol- 
low a fixed patrolling route, and the attacker thus 
can predict the patroller’s position at a certain time. 
Other patrollers purely randomize their patrolling, 
without taking into consideration the hazardous 
level that each facility/plant holds, and the attacker 
can attack more dangerous facilities/plants since all 
the facilities/plants are equally patrolled. 

Game theory has been introduced to the 
security domain to optimally allocate security 
resources. In a security problem, the attacker 
(human beings) is able to plan his attack accord- 
ing to the defender’s defence, while the defender 
knows the fact and thus she can also defend 
accordingly. This procedure is called the ‘intelli- 
gent interactions’ between the defender and the 
attacker. Game theory was invented to model 
strategic decision making in multiple actor sys- 
tems, thus it perfectly fits the necessity of mod- 
elling the ‘intelligent interactions’ in the security 
domain. Tambe and his co-authors [4] employed 
game theory for optimizing patrolling of protect- 
ing ferries, of protecting wild animals etc. Alpern 
and his co-authors [5] theoretically studied the 
optimization problem of patrolling in a graph. 
Amirali et al. [6] introduced a game theoretic 
model for optimally scheduling pipeline patrol- 
ling. No literature has investigated the use of 
game theory to optimize patrolling in chemical 
clusters, neither for the single plant patrolling nor 
for the multiple plants patrolling. 
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This paper proposes a Chemical Cluster Patrol- 
ling (CCP) game, which answers the question how 
to optimally randomize the patrolling, to better 
secure a chemical cluster, by using a game theory 
model. The reminder of this paper is organized as 
follows: Section 2 briefly demonstrates the CCP 
game. A case study is introduced in Section 3 and 
the results of the case study are given in Section 4. 
Conclusions are drawn in Section 5. 


2 THE CHEMICAL CLUSTER 
PATROLLING (CCP) GAME 


2.1 Graphic modelling 


A chemical cluster can be descript as a graph G(V, 
E). The vehicle entrances of each plant and the 
cross points of the vehicle road form the nodes 
of the graph. The vehicle roads between differ- 
ent plants (to be more specified, they should be 
“between different entrances”) are modelled as 
edges of the graph. Furthermore, all entrance 
nodes which belong to the same plant are modelled 
to be full connected, which means edges also exist 
between every two nodes in these cases. 

Based on the graphic model, the chemical clus- 
ter patrolling can be descript as a graphic patrolling 
problem: 1) a patroller (team) starts her (In this paper, 
we denote the patroller/defender as she/her/her, and 
denote the attacker as he/him/his.) patrolling from 
a node (the base camp); 2) she moves in the graph; 
3) when arriving a node, she may decide whether 
to stay at the node for a specific period of time £? 
(i.e., patrol the plant) or not (i.e, move to another 
plant without patrolling the current plant); 4) after a 
period 7, the patroller terminates the patrolling. 

A directed patrolling graph pG(pV, pE) is defined 
based on the graphic model of the chemical cluster. A 
node of pG is defined as a tuple of (t, 1), in which t € 
denotes time dimension and i € 2. vei vi} denotes 
anode in graph G(V, E) (.e., a plant (entrance) in the 
chemical cluster). Node (t, i) means that at time t the 
patroller arrives or leaves node i. A directed edge of 
pG from node (t,, i,) to node (t,, i,) therefore means 
that the patroller moves from node 1, at time t, to 
node i,, and arrives at t,. Figure 3 shows the patrol- 
ling graph of the case study. 


2.2 Game theoretic modelling 


A game theoretic model consists of players, strate- 
gies, and payoffs. 


Players 

Players of the chemical cluster patrolling (CCP) 
game are the patroller team and the potential 
attackers. The CCP game is a two players game and 
both players are assumed with perfect rationality. 


Strategies 

An attacker’s strategy consists of three parts: 1) 
which plant to attack; ii) when to attack; and iii) what 
attack scenario to use, thus can be expressed as: 


s,=(tisk) (1) 


In which ¢ denotes the attack start time, 7 repre- 
sents the target plant, k, is the attack period (e.g., 7 
minutes) which should be determined by both the 
attack scenario and the target plant. 

A mathematic formulation of the defender’s 
strategy is shown in Formula (2). 


s= Teori D 


In which c,, denotes the probabilistic number 
assigned to the edge (of pG) from node s to node e, 
ĮI denotes the Cartesian product of all edges in pG 
G.e., all (s,e) € pE). 

An important property of these probabilities 
is that, for each node (of pG), the sum of all the 
income probabilities must equal the sum of all the 
outcome probabilities. Formula (3) illustrates the 
abovementioned property. 


= C, 
È ine{ se pV (s,pv)e pE} PY 


oute{ ee pV ( pv.e)e pE} Cpr-out (3) 


Payoffs 

Formulas (4) and (5) define the patroller and the 
attacker’s payoff, in which f is the probability that 
the attacker would fail, and if the attacker failed, 
the patroller gets a reward R’ (e.g., obtaining bonus) 
and the attacker suffers a penalty P” (e.g., being sent 
to prison). If the attacker succeeds, the patroller suf- 
fers a loss L’ and the attacker obtains a gain G”. 


u = RI f -L -(1- f) (4) 
u =6°-(1- f)- P": f (9) 
Computing the f 


The probability that the attacker would be 
detected can be calculated by Formula (6), in which 
Sopp denotes the probability that the intrusion detec- 
tion systems (IDS) in the target plant would detect 
the attacker, f, is the probability that the patroller 
would detect the attacker 


f=1-(l-fyp)-(I-F,) (6) 


Note that f,,, is a plant-specific parameter (a 
number belongs to [0,1]). While f, can be calculated 


by Formula (7), in which r denotes the overlap situ- 
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ation of that the patroller’s staying in the plant and 
the attacker’s intrusion and attack procedure, ø, is 
the detection probability of situation r. Further- 
more, the probability that the patroller would be in 
situation r is denoted as f,. 


L=}, O T. (7) 


Denote the defender’s strategy in a vector form 
as č. It is worth noting that 7, would be a linear 
polynomial of č, and f, and o, are user provided 
parameters. Therefore, f is a linear polynomial of 
č as well. 


Stackelberg equilibrium 

In the CCP game, the attacker is assumed 
to be able to collect information of the patrol- 
ler’s patrolling route, A Stackelberg equilibrium 
(ssi) = (E (E,K) for the CCP game is a 
defender-attacker strategy pair that satisfies the 
following condition: 


(t,i, ki) = argmax{u, (2.(¢ i,k,))} m 


T= argmax{ u, (e(r ))} (9) 


Formula (8) reflects that observing the defend- 
er’s strategy č, the attacker would play a strategy 
which will maximize his own payoff (i.e., a best 
response). Formula (9) represents that the defender 
can also work out the attacker’s best response to 
her strategy, thus she plays accordingly. 


3 ILLUSTRATIVE CASE STUDY 


Figure 1 provides the layout of a chemical cluster 
from the Antwerp port (data source: Google map). 
There are 5 plants in this cluster, indexed as plant 
‘A’, plant ‘B’, and so forth. The yellow dot lines 
demonstrate the vehicle routes, and the patroller 
only drives on the vehicle route. Figure 2 shows 
the graph model of the cluster shown in Figure 1. 


Figure 1. 


Layout of a chemical park in Antwerp port. 


Figure 2. Graphic modelling of the chemical park. 


As we may see, each plant (i.e., ‘A’, ‘C, ‘D’, ʻE’) 
in Figure 1 is modelled as a node (with the same 
name) in Figure 2. The cross point of the vehi- 
cle road between plant ‘D’ and ‘E’ in Figure 1 is 
also denoted as a node in Figure 2 (i.e., node ‘cr’). 
Moreover, plant ‘B’ has two vehicle entrances, and 
two nodes (i.e., nodes ‘B1’ and ‘B2’) are used in 
Figure 2 to denote the two different entrances of 
plant ‘B’. Edges ‘el’ to ‘e6’ reflect the vehicle roads 
between different plants, while edge e7 is added 
between node ‘Bl’ and ‘B? because these 2 nodes 
belong to the same plant and hence should be full 
connected. 

We set: tf = 2,t4 = 3,4 = 4,14 = 3,14 = 2,27 = 2, 
and further set P(A, B,C, D E’) 
=[9,7,6,5,7].1? represents the driving time of 
edge ‘e7 in Figure 2. For instance, #’ is the driv- 
ing time from node ‘A’ to ‘BI’. ¢?(‘X’) denotes the 
time needed to patrol plant ‘X’. If the patroller 
may have multiple patrolling intensity in a plant, 
then the £? should not only be a number, but be 
a set of numbers. In this paper, we only consider 
one patrolling intensity in each plant and all the 
temporal data are unified in minutes. 

Table 1 shows the time of moving from one 
node to another node. For instance, from node ‘A’ 
to node ‘Bl’ needs ¢/ =2 minutes. It is worth not- 
ing that 1) numbers in the diagonal denote the time 
needed to patrol the plant, e.g., patrolling plant ‘A’ 
needs ¢’(‘A’)=9 minutes; ii) the number from 
one entrance node to another entrance node of the 
same plant (e.g., from node ‘B1’ to “B2’) also repre- 
sents the time needed to patrol the plant. Case (ii) 
means that the patroller comes into and leaves the 
plant from different entrances. 

Figure 3 shows the patrolling graph pG for the 
chemical cluster shown in Figure 1, with the data 
in Table 1 and further assume a patrolling time T 
= 30. Patroller’s base camp is assumed close to the 
cross road node, thus ‘cr’ is chosen as the patrol- 
ler’s base camp. 
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Figure 3. Patrolling graph of the illustrative example. 
Table 1. Superior connection matrix for Figure 2 with 
the illustrative numbers. 
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In Figure 3, the x axis denotes the time dimen- 
sion, while the y axis represents the different nodes 
in Figure 2. Therefore, any coordinates in Figure 3 
can be a possible node for pG. As we may see, node 
1 (at the left hand side of the figure) in Figure 3 
is (0, cr), and it means that at time 0, the patroller 
starts from her base camp (1.e., ‘cr’). Thereafter she 
has 3 choices: 1) to come to plant ‘B’ (more accu- 
rately, entrance ‘B2’) with a driving time tf, and 
reaches node 2; ii) to come to plant ‘D’ with a driv- 
ing time ¢/, and reaches node 3; and iii) to come 
to plant ‘E’ with a driving time ¢/, and reaches 
node 4. Subsequently, at new nodes (e.g., 2, 3, or 
4), the patroller has the same choice problem, that 
is, to patrol the current plant or to come to another 
plant. In Figure 3, the indexes of some nodes and 
the weight of some edges are not shown, for the 
clarity of the figure. 
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Table 2. Model inputs. 

R’ L’ G P fa 
A£ 1 16 10 3 0.45 
B 1 11.2 6 3 0.3 
Col 14 8.3 3 0.42 
D 1 12 74 3 0.45 
E 1 15 10 3 0.5 


For example, the bold (and black) line 
in the figure, denotes a patrolling route as: 
‘cr 3‘C3C’> patrol plant ‘C — ‘Bro 
patrol plant B’—> leave plant ‘B from 
‘B? > ‘cr >'E > ‘cr’. 

Finally, when time comes to the end of the patrol, 
the patroller terminates the patrolling and comes 
back to her base camp. In this research, to keep the 
continuity of coverage of each plant, the patroller 
is required to prolong their patrolling in the plant 
until that the next patroller team might be able to 
arrive the plant. For instance, in Figure 3, though 
the patrolling time is set as T = 30, however, the 
patrolling in plant ‘A’ is not stopped until ¢ = 41. 
The reason is that, the shortest time that the next 
patrolling team can arrive plant ‘A’ (from ‘cr’) is 11 
(By following a path ‘cr — ‘B2’ ‘Bl’ ‘A’). If 
the current patroller team does not prolong their 
patrolling, and the next patroller team starts at time 
30 and stars from their base camp (i.e., ‘cr’), then 
plant ‘A’ will definitely not be covered during time 
(30, 41). This approach may increase the patroller’s 
workload. However, if we set T slightly smaller than 
the patroller’s real workload, the problem will be 
solved. For example, if a patroller team’s workload 
is 240 minutes per day, and we may set T= 220. 
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For the sake of clarity, only one type of attacker 
and only one attack scenario is considered. Fur- 
ther assume that the intrusion and attack proce- 
dure of the employed scenario would last for 10 
minutes. For instance, the two horizontal bold dot 
red lines in Figure 3 represent attack strategies that 
attack plant ‘A’ start at time 9 (the line at below) 
and attack plant ʻE‘ start at time 4 (the line at 
above), with an intrusion and attack period of ten 
time units, respectively. 

Table 2 gives the model inputs, i.e., the defend- 
er’s reward (loss) of (not) detecting an attacker; 
the attacker’s gain (penalty) from a (not) successful 
attack; the probability that the intrusion detection 
system (IDS) can detect the attacker. The probabil- 
ity that the patroller can detect the attacker (i.e., o) 
should also be provided by security experts. How- 
ever, in this paper, we simply assume that in each time 
unit, if the attacker and the patroller stay in the same 
plant (i.e., overlap), there is a probability of 0.05 that 
the attacker would be detected by the patroller. 


4 RESULTS 


4.1 Stackelberg equilibrium 


Figure 4 shows the Stackelberg equilibrium 
(SE) of the case study. The black (and narrow) 
lines demonstrate the patroller’s optimal patrol- 
ling strategy. The associated numbers on the line 
denotes the probability that the defender will take 
this action. For instance, cl = 0.22747 means that 
at time 0, the patroller should drive to node ‘B2’ 
at probability 0.22747. Furthermore, in patrolling 
practice, if the patroller arrives at a node in the 


figure, the conditional probabilities of following 
actions can be calculated as cP=c/sP, in which 
c denotes the probability assigned to the edge, sP, 
denotes the probability that the patroller would 
be at the node. For instance, the probability that 
the patroller would arrive at the red node (6, C’) 
in Figure 4 is sP, =0.41734, and the conditional 
probabilities that she should take the 2 actions are 
cP = 0.4979, cP, = 0.5021. 


Table 3. The patroller’s actions that may detect the 
attacker. 

Edge T Overlap o 

25 0.00220 [9,13] 0.20 
41 0.09935 [9,16] 0.35 
85 0.11143 [11,18] 0.35 
159 0.09935 [16,19] 0.15 
186 0.00220 [17,19] 0.10 
206 0.11143 [18,19] 0.05 


Table 4. Comparison of the CCP strategy and the 
purely randomized strategy. 


Edge Overlap aa (a o 
82 [11,19] 0.1926 0.0046 0.4 
98 [12,19] 0.1942 0.0139 0.35 

156 [15,19] 0 0.0019 0.2 

176 [16,19] 0 0.0071 0.15 

196 [17,19] 0 0.0024 0.1 

216 [18,19] 0 0.0039 0.05 

425 [9,10] 0 0.0100 0.05 

430 [9,11] 0.3358 0.0274 0.1 
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Figure 4. The optimal patrolling strategy and the attacker’s best response (Stackelberg equilibrium). 
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The attacker’s best response in the SE is to 
attack plant ‘E’ at time 9, as shown in the figure as 
a red bold line. The short lines above the attacker’s 
best response line represent the defender’s patrol- 
ling actions which would have overlap with the 
attacker’s strategy. Table 3 shows detail informa- 
tion of these patrolling actions. 

Based on the result in Table 3, we have that: 
Í, =}, T,: o, = 0.089101, f = 1=(1-0.5)*(1-f,) = 
0.09491, u, = 2.88311, u; = —6.24074. 


4.2 Comparing to random patrolling 


In the current patrolling practise, patrollers may 
randomly schedule their patrol route. This situ- 
ation, one looks at Figure 3, is simply assigning 
the same probabilities to edges that start from 
the same node. For instance, at the starting node 
(i.e., (0,’cr’)), the patroller would come to plant 
(entrance) ‘B2’, ‘D’, and ‘E’ at the same probabil- 
ity, and the probability is 1/3. 

In the case study, if the defender would purely 
randomize her patrolling, then the attacker’s best 
response would be attacking plant ‘A’ at time 9. 
The attacker and the defender would obtain a pay- 
off of 4.0653 and —8.2393, respectively. Compar- 
ing to the result of the CCP game, the defender 
suffers a higher lose. 

Table 4 illustrates the differences between the 
CCP strategy and the purely randomized strategy. 
The ‘Edge’ column in Table 4 shows the edges in 
the patrolling graph that have an overlap with the 
attacker’s strategy (i.e., attack plant ‘A’ at time 9). 
The overlap column illustrates which period of the 
attack procedure is overlapped by the edge. The ‘c’ 
and ‘re’ column show the probability that the patrol- 
ler will go the edge, resulting from the CCP game 
and from the randomized strategy respectively. The 
sigma column shows the probability that the attacker 
will be detected by the patroller by this edge and it 
is simply calculated as 0.05 multiplied by the over- 
lapped time units. According to the result in Table 4, 
we can calculate the probability that the attacker 
would be detected by the patroller (see Formula 7), 
and the results are: f= 0.17860, f; = 0.01183. 
These results reveal that the CCP strategy has a 
higher probability of detecting the attacker at plant 
‘A, and thus transfers the attacker’s best response 
target from plant ‘A’ to plant ‘F’. 


5 CONCLUSION 


Terrorism has been a global problem. The chemi- 
cal industry can be an attractive target for terror- 
ists, due to the existence of hazardous materials. 
A chemical cluster is formed by multiple chemical 
plants, and can be of extra interest for attackers. 

Besides intrusion security countermeasures of 
each plant, security countermeasures at the cluster 
level are also recommended. The current patrolling 
in chemical clusters are either single plant based 
or purely randomized, being economically not effi- 
cient and theoretically not optimized. The security 
adversaries are human beings, and they may learn 
the patroller’s daily patrolling routes and plan their 
attack accordingly. 

This paper therefore proposes a chemical cluster 
patrolling (CCP) game. The CCP game generates 
randomized but strategic patrolling routes for the 
cluster patrolling team. The intelligent interactions 
between the patroller and the potential attackers 
are modelled in the CCP game. An illustrative case 
study shows that the patrolling routes generated 
by our CCP game outperforms the purely rand- 
omized patrolling strategy. 
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ABSTRACT: The aim of the study was to examine judgement of security problems when using public 
transportation. A self-completion questionnaire survey was carried out among representative samples of 
residents in six Norwegian urban areas (n = 1047). Respondents who most frequently use public travel 
modes assessed the security problems to be larger compared to less frequent users. Frequency of use as 
well as past personal experience of a security problem enhanced the assessment of future probability of 
experiencing such an event. Perceived risk consistency was also measured and a median split on both 
these variables was carried out and four groups emerged. The first group was a group of risk insensitive 
respondents, the second group consisted of risk inconsistent respondents, the third consisted of risk con- 
sistent respondents and the fourth was risk sensitive respondents. Travel mode use and personal experi- 
ence of a security-related problem were positively associated with risk sensitivity. 


1 INTRODUCTION 


More knowledge is needed to understand how 
future choices could be moved into a more pro- 
environmental and safe direction by use of pub- 
lic transport as well as walking and cycling. The 
current study focuses on the role of perception of 
security risks by use of public travel modes. Sev- 
eral studies carried out previously have examined 
perceived risk related to security (probability of 
violence, acts of terror, etc.) in public transport. 
Roche-Cerasi, Rundmo, Sigurdson and Moe 
(2014) showed that there were differences in overall 
perceived risk evaluations between frequent users 
of private and public travel mode users in repre- 
sentative samples in Paris and Oslo. The group 
including those who most frequently used public 
travel modes perceived the probability of experi- 
encing violence and acts of terror on public travel 
modes to be larger compared to those who most fre- 
quently used car. Frequent public travel mode users 
were also found to be more worried about security 
issues on public transportation. In a representative 
sample of the Norwegian public, Rundmo, Nord- 
fjærn, Iversen et al. (2011) also reported differences 
in perceived risk evaluations and worry with regard 
to criminality and acts of terror on public travel 
modes when comparing a group of respondents 
who used public travel modes most frequently with 
a group of respondents who most frequently used 
private motorised travel modes. Frequent users of 
public transport perceived the security problems to 
be more probable compared to frequent users of 


own car. Nordfjærn, Şimşekoğlu, Lind et al. (2014) 
showed that perceived risk evaluations and worry 
related to terrorism, sabotage, theft, harassment 
and other uncomfortable episodes, as well as vio- 
lence significantly predicted travel mode use. 

The studies presented above primarily examined 
the role of perceived risk and worry in travel mode 
use. Accordingly, the current study also hypoth- 
esises perceived risk evaluations and worry to be 
significant predictor variables of travel mode use, 
i.e. use of public travel modes versus car. In addi- 
tion, perceived risk evaluations have been distin- 
guished from risk sensitivity (Sjøberg, 1996). Risk 
sensitivity is the general tendency to perceive all 
risks as large or small. Consequently, risk sensitiv- 
ity is linked to perceived risk. The current study 
aims to examine coincidence between perception 
and sensitivity in risk judgements. 

Rundmo and Moen (2006) showed that per- 
ceived risk evaluations in various types of trans- 
port were significantly associated. Those who 
perceived the risk to be large concerning one type 
of transport also perceived the risk to be large in 
other areas, and vice versa. Sjöberg (1996, 2004) 
found that risk amplification-attenuation was a 
significant predictor variable of personal as well 
as general perceived risk evaluations. The risk 
amplification-attenuation framework attempts to 
explain the process by which risks are amplified, 
receiving public attention, or attenuated, receiv- 
ing less public attention. In a random sample of 
the Swedish public, Sjøberg (1996) showed that 
risk sensitivity was a significant predictor vari- 
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able of general perceived risk evaluations among 
other factors, e.g. trust in authorities, attitudes, 
and the pooled original psychometric dimensions 
(see Fischhoff, Slovic, Lichtenstein et al., 1978). 
In these studies perceived risk evaluations and risk 
sensitivity were distinguished and measured as two 
different constructs. However, if risk is rated to be 
high in one domain it is more likely to be rated as 
high in another domain, which implies risk sensi- 
tivity. Thus, it could be that risk sensitivity is iden- 
tical to perceived risk evaluations. 

In the current study risk sensitivity is conceived 
to be pooled perceived risk evaluations. While risk 
evaluations have been found to be equal to hazard 
perception (Rundmo & Nordfjern, 2017), risk sen- 
sitivity adds to the understanding of the concept 
by introducing an element of perceived risk that 
is a general tendency in risk perception which is 
independent of the object that is perceived. Thus, 
perceived risk evaluations consist of two elements. 
The first is linked to the characteristics of the 
object that is perceived (i.e. subjective assessment 
of the probability of experiencing a negative event 
and judgement of severity of consequences if it 
should occur) and a general characteristic reflected 
in the sensitivity of risk in general. Risk sensitivity 
should also be distinguished from perceived risk 
consistency. Respondents may vary in risk sensi- 
tivity, i.e. the level of perceived risk when judging 
a number of risk sources, as well as in risk con- 
sistency, i.e. how consistent or stable the risk level 
of all the sources is perceived. The current study 
hypothesises that risk sensitivity and risk stability 
significantly predicts travel mode use. The current 
study aims to investigate the role of priority of 
security and risk sensitivity in use of public travel 
modes versus car among an urban public. An 
additional objective is to discuss the relationship 
between risk perception and risk sensitivity. 


2 METHODS 


The results of the current study are based on a 
self-completion questionnaire survey carried out 
among residents above 18 years of age in the six 
most urbanised areas in Norway (n = 1043). The 
response rate was 18% A total of 1043 respondents 
(18%) replied to the questionnaire. The respondents 
were asked to assess the occurrence probability of a 
security problem on a seven-point evaluation scale 
ranging from ‘not at all probable’ to ‘very probable’ 
(see also Rundmo et al., 2011). Transport mode 
use was measured by asking the respondents about 
their weekly ordinary use of transport modes. The 
questionnaire also contained questions about the 
demographic characteristics of the respondents, 
including gender, age, level of education and car 


access. They were also asked about whether they 
themselves had been victimised due to a security- 
related problem when using public transportation 
and how they prioritised security measures to pre- 
vent theft, harassment and acts of terror. 


3 RESULTS 


The respondents assessed the probability of “theft” 
and “harassment” to be larger compared to the 
other types of security problems that were meas- 
ured. The probability of “sexual assault” and “ter- 
rorism” was perceived to be low. A MANCOVA 
aimed to examine differences in the respondents’ 
assessment of probability for experiencing security 
problems in such transport was carried out (signifi- 
cant differences shown in bold). Those who most 
often used public travel modes were compared to 
those who most frequently used car. There was a 
significant overall difference in judgement of secu- 
rity problems (Wilks’ à = .95, p < .001). Gender, 
age group and educational level were covariates in 
the analysis. The frequent public travel mode users 
perceived the probability of security problems 
in general to be larger compared to the group of 
frequent users of private motorised travel modes. 
This was the case for assessment of the probabil- 
ity for “theft” (F = 9.17, p < .01), “sexual assault” 
(F = 7.68, p < .01), “harassment” (F = 7.20, p < 
.01), and “terrorism” (F = 9.37, p<.01). There were 
also tendencies, however not significant, in the 
same direction for assessments of “blind violence” 
and “sabotage”. Thus, respondents who most fre- 
quently used public travel modes assessed the secu- 
rity problems related to use of such modes to be 
larger compared to less frequent users. 

The respondents were also asked to assess the 
probability of “being too late at work” due to 
public travel mode delay. In this case the group dif- 
ference was in the opposite direction. Those who 
most seldom used public transportation assessed 
the probability of delay to be larger when using 
public travel modes compared to the cluster group 
of frequent users of such modes (F = 12.33, p < 
.001). 

It could be argued that “being too late at work” 
is not a security problem in line with the other 
security problems in public transport. While the 
other items have to be conceived as the probabil- 
ity of outside inflicted damages, the probability 
of “being too late at work” is either caused by the 
traveller or by the operating travel company. In this 
case it is not a problem inflicted to a victim during 
the travel. Most often the consequences are more 
trivial compared to other security problems con- 
cerning public transportation. The internal consist- 
ency of the judgements of risk sensitivity measured 
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on a single dimension was satisfactory (a = .839). 
When excluding the item “being too late at work”, 
the reliability improved marginally (a = .859). 
There were large and significant positive associa- 
tions between the probability assessments of the 
security problems, i.e. those who assessed one secu- 
rity problem to be large also tended to assess the 
other security problems in the same way and vice 
versa, indicating the presence of risk sensitivity. 
The coefficients varied between .15 and .65. 

In addition to probability assessments the 
respondents were also asked whether or not they 
had been exposed to one or more security risks 
when using public transportation during the last 
five years. A total of 14.5 per cent reported that 
they had experienced themselves either “theft”, 
“sexual assault”, “harassment”, “terrorism”, or 
“sabotage” during this time period. Thus, in addi- 
tion to frequency of use of public travel modes, 
the respondents’ previous experience of public 
travel mode security hazards have to be taken into 
consideration. A MANCOVA was carried out to 
examine the differences in probability assessments 
of public travel security problems due to risk expo- 
sure (frequent users of public versus private trans- 
portation) and previous personal experience with 
security problems when using public transporta- 
tion. As shown there were significant differences in 
the judgement of probability due to previous haz- 
ard experience (Wilks’ à = 0.87, p < .001). 

A median split on risk sensitivity as well as risk 
stability was carried out and four groups emerged. 
The first group consisted of respondents who scored 
low on risk sensitivity as well as on risk consistency 
(n = 342). In addition to rating the probability as 
low they were characterised by a low Sd, indicating 
that they judge the overall probability to be low, i.e. 
consistent low score. This group could be defined 
as consisting of risk ignorant respondents. The next 
group consisted of respondents who rated some of 
the risks to be large and at the same time their evalu- 
ations were characterised by a low Sd (n= 152). This 
group was characterised of risk instability and con- 
sisted of risk steady respondents. The third group 
of people who rated most of the probabilities to be 
low, however, was characterised by a high Sd, indi- 
cating a variety in judgements (n = 188). This group 
was entitled risk fluctuating respondents. Finally, 
there were respondents who rated the probability of 
all risks to be high, and consequently the group was 
characterised by a low Sd (n = 351). This group was 
entitled risk sensitive respondents. A total of 52.2 
per cent of the respondents scored high on risk sta- 
bility and 48.7 per cent on risk sensitivity. 

The number of risk sensitive female respondents 
were larger than expected by statistical inference 
compared to male respondents x2 = 30.06, p < .001. 
There were also more risk sensitive respondents 


than expected among those who had a vocational 
practical education and a university education 
less than 3 years. The number of risk insensitive 
respondents were larger than expected by statisti- 
cal inference among those who had a university 
education lasting for more than three years com- 
pared to the other groups, %2 = 23.71, p < .05. The 
age groups were also compared and risk sensitivity 
decreased by age, %2 = 28.71, p < .01. 

The respondents were also asked to rate how 
they prioritized security in transport, i.e. how 
important it was for themselves. This evaluation 
included the importance of implementing counter- 
measures to prevent theft, harassment and terror. 
The results showed a significant overall difference 
in priority of security due to risk sensitivity, Wilks’ 
à = .96, p < .001, gender, Wilks’ A = .7, p < .001, 
educational level, Wilks’ A = .97, p < .001, and age 
group, Wilks’ A = .97, p < .01. However, there were 
no significant differences in priority of security due 
to past security risk experience, Wilks’ A = .99, NS. 
Risk ignorant and risk steady respondents priori- 
tised security measures to a lower extent compared 
with risk fluctuating and risk sensitive respondents. 

The Post Hoc tests showed that there were no 
statistically significant differences between the 
security priorities of the groups of risk insensi- 
tive and risk stable respondents. The tests showed 
an identical pattern of differences for all the three 
assessed factors. The priority of security measures 
in the risk sensitive group differed significantly 
from all the other three groups. However, both the 
two last groups (the groups of risk steady and risk 
sensitive respondents) differed significantly from 
risk fluctuating and risk sensitive respondents, p 
< .001. This was the case for priority of counter- 
measures to reduce the risk of theft, harassment as 
well as terror. Of note, there were no significant dif- 
ferences between the security priorities of the risk 
fluctuating and risk sensitive respondents. The pat- 
tern was the same for all the three prioritised areas. 
Thus, risk fluctuation and risk sensitivity seems to 
enhance the priority of security measures to reduce 
the risk of theft, harassment as well as terror. Risk 
perception varied due to previous risk security 
experience. However, there were no significant dif- 
ferences in priority of risk reduction measures due 
to past experience of security problems in public 
transport. 

The next step was to examine how priority of 
security and risk sensitivity predicted travel mode 
use. Hierarchical and k-means cluster analysis was 
carried out separately for leisure and work trav- 
els to identify mode user groups. The first group 
consisted of those who most frequently used pub- 
lic transport and the second group consisted of 
respondents who mainly used car. The same cluster 
groups emerged for leisure as well as work travels. 
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The first analysis concerned leisure travelling. As 
expected the first block consisting of demographic 
variables, distance from home to the nearest pub- 
lic transport point, and car access significantly 
predicted leisure travel mode use (%2 = 213.74, 
p < .001). Adding priority of security significantly 
improved the model (%2 = 16.05, p < .05) and add- 
ing risk sensitivity as the final block added further 
to the explained variance (%2 = 4.03, p < .05). 

When all the blocks were adjusted for in the 
model, level of education (OR = 1.36, p < .001), 
annual income (OR = 0.67, p < .01), and car access 
(OR = 0.02, p < .001) significantly predicted travel 
mode use. Annual income and access to car were 
negatively associated with use of public transpor- 
tation. Gender, minutes to walk from home to the 
nearest public transport point and previous per- 
sonal experience with security hazards was non- 
significant predictor variables. In the second block 
the significant predictor variables were priority of 
security against theft (OR = 1.80, p < .05) and har- 
assment (OR = 1.41, p < .001). Priority of coun- 
termeasures to reduce theft was largest among 
frequent car users and countermeasures for reduc- 
ing harassment was prioritised among public travel 
mode users. Finally, the probability of belong- 
ing to the group of frequent public travel mode 
users were larger when risk sensitivity increased 
(OR = 1.24, p <.05). 

Concerning work travels, the predictor variables 
of the first block also significantly predicted work 
travel mode use (%2 = 57.62, p < .001). This was 
also the case for the third block (%2 = 4.37, p < 
.05). However, the predictors of the second block 
seemed not to be equally important for mode use 
on work travels (%2 = 1.53, NS) compared to lei- 
sure travels. The significant predictor variables in 
the final analysis were annual income (OR = 0.69, p 
< .05) and access to car (OR = 0.25, p < .001) from 
the first block. Risk sensitivity also significantly 
predicted mode use (OR = 1.28, p < .05). The pre- 
diction of leisure travel mode use (Cox & Snell’s 
R2 = 0.35, Negelkerke’s R2 = 0.49) was more suc- 
cessful than predicting mode use on work travels 
(Cox & Snell’s R2 = 0.11, Negelkerke’s R2 = 0.20). 


4 DISCUSSION 


The current study showed that priority of secu- 
rity and risk sensitivity was significant predictors 
of travel mode use among an urban public when 
demographic factors were controlled for. In stud- 
ies carried out previously (e.g. Sjoberg, 1994, 2004) 
risk sensitivity was conceived to be a predictor of 
“risk perception”. However, risk sensitivity is the 
tendency to perceive all risks to be high and risk 
insensitivity the opposite. Explaining a large per- 


centage of the variance in “perceived risk” (Sjoberg, 
1996, 2004) could partly have been caused by the 
use of risk sensitivity as a predictor variable which 
is coincident with the criterion variable. Rundmo 
and Nordfjern (2017) found no support of fit 
of the data to a model where risk perception was 
conceived to be a formative construct of subjec- 
tive assessments of probability and judgement of 
severity of consequences. Subjective judgements 
of risk should be conceptualised as perceived risk 
assessments when the judgements are not consid- 
ered to be a formative construct. 

The current study was restricted to examining 
the probability component of the perceived risk 
evaluations. In future research the role of sever- 
ity of consequences should also be included. This 
could add to explained variance in prediction of 
travel mode use. Sjoberg (1999) as well as Rundmo 
and Moen (2006) showed that the subjective judge- 
ment severity of consequences if a negative event 
should occur was a more significant predictor of 
demand for risk mitigation compared to the prob- 
ability assessment. Studies carried out previously 
have shown that subjective assessments of risk and 
judgement of severity of consequences may pre- 
dict worry and worry has been found to be associ- 
ated with demand for risk mitigation in transport 
(Rundmo & Moen, 2007) as well as mode use 
preferences (Nordfjern et al., 2014; Rundmo et 
al., 2011). Probability assessment was found to be 
rather insignificant for such demands. The cur- 
rent research did not aim to re-examine the role of 
worry in travel mode use. It is interesting to note 
that the assessment of the probability-component 
of risk sensitivity alone contributed significantly 
to prediction of travel mode use (public transpor- 
tation versus use of car). 

The results of the current study showed that the 
same set of predictor variables explained a signifi- 
cantly larger proportion of explained variance in 
leisure travel mode use compared to work travel 
mode use. There could be several explanations for 
this. Most obvious is that the freedom to choose 
travel modes could be different on the two types 
of travels. It may be that the freedom of choice is 
larger for leisure travels compared to work travels. 
As expected access of private travel modes was an 
important predictor of travel mode use. In addi- 
tion, the power of this predictor variable was sig- 
nificantly larger for travel mode use on work travels 
compared to leisure travels, indicating that it may 
be easier to choose other travel modes for leisure 
travels. Also demographic factors seemed to be of 
less importance for travel mode use in leisure time 
than for work travels. Other possible explanations, 
which could be further investigated, include pos- 
sible differences in the role of habits. Because work 
travels could have a more repetitive nature than 
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many travels conducted during the leisure time, 
habits could play a larger role in mode use on these 
travels (see also Bamberg et al., 2003). 

In this study, perceived risk evaluations and risk 
sensitivity have been considered to contain the same 
data of evaluations of probability assessments and 
judgement of severity of consequences. Conse- 
quently, the concepts of perceived risk evaluations 
and risk sensitivity are unquestionably woven 
together. They are two parts of the same intuitive 
evaluation of risk; a direct assessment of probabil- 
ity of an event with negative consequences and the 
stability or consistency in the evaluation of various 
risk sources. The first element may vary because 
it relates closely to the hazard or object of evalu- 
ation. The second element is not primarily related 
to the object, but is a general tendency influenc- 
ing on the direct evaluation of risk. This element 
consists of two elements. The first is risk sensitiv- 
ity, i.e. judging a set of hazards or risk sources on 
a continuum varying from high (indicating risk 
sensitivity) to low (indicating risk ignorance). The 
second element is risk stability, which varies from 
high (indicating risk stability) to low (indicating 
risk flexibility). Perceived risk evaluation is only 
the “basis material” for calculating risk sensitiv- 
ity. Therefore, risk sensitivity is the main concept, 
covering the perceived risk evaluations, including 
intuitive judgments of probability as well as sever- 
ity of consequences across a set of risk sources. 

Further research should examine predictors of 
risk sensitivity as well as risk stability in further 
detail. It could be interesting to investigate how 
aspects of subjective risk judgements not directly 
associated with characteristics of the risk source 
influence judgement. Therefore, research on risk 
sensitivity should be given priority, not the mere 
analysis of single hazard or risk source evaluation. 
Further investigations could focus on associations 
between personality variables and risk sensitiv- 
ity, which in previous research have been found to 
be associated with risk perception as well as risk- 
taking behaviour. Such behaviour has been con- 
nected to personality factors e.g. sensation seeking. 
Another hypothesis is that attitudinal factors, e.g. 
attitudes towards risk-taking and risky behaviour, 
may stabilise risk sensitivity on a high or low level, 


working more or less independently in the judge- 
ment of single hazards or risks. The current study 
showed that priority of security also was associ- 
ated with travel mode use. The relations between 
priority of security, personality factors and risk 
sensitivity should be investigated more thoroughly 
in future research. 
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ABSTRACT: Development of digital and network technology has led to a big change in the industry, 
especially in ICS (Industrial Control System) and SCADA (Supervisory Control And Data Acquisition) 
system. The components of analogue in facilities are changed into digital components and new facili- 
ties of ICS and SCADA systems are composed of various digital instrumentation and control systems. 
Because of these changes, the security of ICS and SCADA system became an important factor in the 
industry. Nevertheless, there are few researches on cyber-attack taxonomy for ICS and SCADA systems. 
Even though some papers and researches suggested a cyber-attack taxonomy, it was not enough or com- 
prehensive for industrial oriented ICS and SCADA systems. Therefore, in this paper, the classification 
scheme is proposed to classify the cyber-attack taxonomy of PLC (Programmable Logic Controller), DCS 
(Distributed Control System), and network equipment, which are core components of ICS and SCADA 
systems. In this paper, cyber-attack is subdivided into foot printing/scanning, password cracking, spoof- 
ing, sniffing, hijacking, MITM (Man In The Middle), virus, DoS (Denial of Service), backdoor instal- 
lation, and hiding files. These ten cyber-attack categories are related with cyber-attack scenario, which 
is composed of prior preparation, gaining access, maintaining access, and clearing tracks. The grouped 
categories mentioned above were subdivided according to the characteristics and principles of cyber- 
attack. The subdivided cyber-attack access path is classified into physical/network, internal/external, and 
accidental/intentional access. After that, the detailed method of access is investigated. The consequences 
that can be caused by cyber-attacks are defined as disclosure, modification, destruction, and interruption. 
Finally, the prevention and mitigation of cyber-attack suggested the specific ways to reduce or escape the 


damage of cyber-attack consequence. 


1 INTRODUCTION 


In the ICS (Industrial Control System) and 
SCADA (Supervisory Control And Data Acquisi- 
tion) system, the analog instrumentation & control 
device is being replaced by the digital instrumenta- 
tion & control device because the digital control 
system provides better control accuracy and ease 
of maintenance. However, digital control systems 
have problems that are subject to cyber-attack. 
An example of the seriousness of the situation 
is Stuxnet, which was discovered in 2011, which 
has caused great damage to the Iranian nuclear 
infrastructure and major industrial infrastructure 
in China [1]. This example implies that ICS & 
SCADA systems, which have closed networks with 
air gap, are no longer safe from cyber-attack. 
Therefore, a new digital instrumentation & control 
device with cyber security function is required to pro- 
tect from malicious cyber-attack. In order to apply a 
new measurement control device with cyber security 
function to ICS & SCADA system such as nuclear 
power plants, a compatibility test for cyber security 


must be performed along with the development of 
design requirements. In order to conduct a conform- 
ance test for cyber security, a systematic classification 
of attack types and a cyber-attack taxonomy investi- 
gation including the latest attack types should be pre- 
ceded. Most of the preceding cyber-attack taxonomy 
and researches focused on IT (Information Technol- 
ogy). However, this classification scheme is not suit- 
able for ICS & SCADA systems that are composed 
of closed networks with air gap and take into account 
the characteristics of the equipment such as DCS 
(Distributed Control System) and PLC (Programma- 
ble Logic Controller). Therefore, this paper suggested 
a systematic classification scheme for cyber-attack 
taxonomy that selectively considered the characteris- 
tics of ICS & SCADA systems. The paper classified 
the cyber-attacks as foot printing & scanning, pass- 
word cracking, spoofing, sniffing, hijacking, MITM 
(Man in the middle), virus, DoS (Denial of Service), 
backdoor installation, and hiding files according to 
the general cyber-attack scenarios. Then, each cat- 
egory of attacks is subdivided into detailed attacks, 
access means, result of cyber-attack, vulnerability, 
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and defense of cyber-attack. In addition, the classi- 
fication of cyber-attack taxonomy can be applied to 
identify the attackable digital devices such as DCS, 
PLC, and network device and the interval of cyber- 
attack test such as DCS to network device, DCS to 
PLC, and so on. 


2 BACKGROUND 


2.1 General cyber-attack process 


As mentioned above, this paper classified ten 
categories of cyber-attack. Before the systematic 
classification scheme for cyber-attack taxonomy, 
a brief description of the ten categories of cyber- 
attack is first presented. 

Foot printing & scanning is the preliminary 
task of gathering information about the system to 
be attacked. Password cracking is the attack that 
extracts passwords through various methods and 
means, and can gain full access rights to the system 
through password cracking. Spoofing is the word 
meaning cheat, which allows spoofing on any con- 
nection that exists on the Internet or locally. Sniff- 
ing is an act of peeping packets exchanged by other 
parties on the network, similar to the dictionary 
meaning of ‘sniff’. Session hijacking is a cyber- 
attack technique that steals and accesses another 
person’s session state. MITM is the way of inter- 
cepting and exchanging information between two 
parties communicating with each other. A compu- 
ter virus is a type of malicious software program 
that can replicate itself by modifying other compu- 
ter programs and inserting its own code. 

A DoS is an attack that causes an attacker to 
deplete the resource of target so that the user is 
no longer able to receive services for the resource 
used by the attacker. Backdoor installation is 
the method of passing through authentication, 
ensuring remote access. Hiding files is a process 
that clears the trace of the attack and deletes the 
related log to avoid tracing and also hides key files 
or information for later use or for confidential use 
of information stored in the victim system. 


2.2 Digital assets for ICS 


Analyzing the digital devices is necessary to make 
relationship between cyber-attack type and a dig- 
ital asset. This paper analyzed PLC, DCS, and net- 
work device, which are core components of ICS & 
SCADA systems. 


2.2.1 PLC 

The PLC is a digital control device that can store 
commands and perform control algorithms that 
can perform various functions such as logic, opera- 


tion, counting, and sequential processing to control 
various types of machines or processes. The PLC is 
basically the central processing unit that manages 
and controls all the functionality, a process input / 
output for exchanging signals, a memory for stor- 
ing programs, a power unit for supplying electricity 
to the PLC, and other peripherals [2,16]. The dif- 
ferences between PLC and analog type relays are 
shown in the Table 1. 


2.2.2 DCS 

The DCS divides one central processing unit to dis- 
tribute various functions. The overall system config- 
uration is connected to each computer with a small 
number of central processing units via a commu- 
nication network. DCS is developed to replace the 
PID (Proportional Integral Derivative) controller 
and generally controlled the complex process. The 
basic feature of DCS is to distribute the process con- 
trol functions to several computers to improve the 
reliability and minimize the ripple effect in the event 
of an error. In addition, it facilitates data process- 
ing and operation management by concentrating 
information, driving operation, and management 
functions of distributed computers on the Distrib- 
uted Operate Console. The basic components of the 
DCS are a CPU (Central Processing Unit) that per- 
forms data in real time, a data highway, LAN (Local 
Area Network) connecting a communication line, a 
signal input/output unit, an interface for power sup- 
ply, and a power supply for supplying power [2]. The 
comparison between PLC and DCS is shown in the 
following Table 2. 


2.2.3 Network 

The network has a fieldbus, which is a type of dig- 
ital network system. The fieldbus has been devel- 
oped to digitally replace analog signals, providing 
high accuracy and reliability. The fieldbus is a 
path for transferring network data from the field 
where PLC-based control equipment exists to the 
network system between equipment. In addition, 
there are fieldbus as a controller subnetwork that 
digitizes/network connection between analog type 
control devices and field devices, and sensor bus 
which is a network that transmits sensor signals 
at high speed. The fieldbus as a controller sub- 
network is also classified as a FA (Factory Auto- 
mation) system and a PA (Process Automation) 
system. The fieldbus requires only one signal line 
to transmit signals to the main cable to which a 
plurality of cells can connect, thereby reducing 
the wiring cost. It is not only makes it easy for 
the operator to identify all the devices included in 
the system, but also facilitates mutual operation 
between individual devices, thereby lowering the 
maintenance cost and ensuring reliability. In addi- 
tion, it has an advantage that system performance 
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Table 1. Differences between Relay and PLC. 


Relay PLC 
Control method Hard Logic Soft Logic 
Control function * Relay (AND, OR by serial/parallel) * Relays (AND, OR, NOT, etc.) 
e Timer e Up/down counter 
* Counter (The function is limited and enlarged e Arithmetic operation 
according to size) e Logical operation 


e Transmission (Function is limited and 
can be enlarged) 


Control * Reed switch (Limited lifetime, low speed control) ° Solid contact (High reliability, 
element long lifetime, high speed control) 
Change content Disconnection and rewiring of wire Change through program change 
Construction period * Control panel production after specification e Specification can be made in parallel 
* Prolonged inspection and commissioning period with assembly of hard decision 


e Reduced inspection and 
commissioning period 
Integrity Difficult to repair High reliability and easy maintenance 
Scalability Difficulty in expanding the system e Easy to expand system 
e Transmission of work information 
is possible by connecting with computer 
Size Difficult to miniaturize Possible to miniaturize 


Table 2. Differences between PLC and DCS. 


PLC DCS 
Control target Unit control Process system 
Constitution PLC «PLC 
e Distributed I/O 
* Touch panel 
Response time Fast (relatively) Late (relatively) 
Suitable for situations requiring real-time action 
Scalability Manage thousands of I/O e Manage hundreds of thousands of I/O 
e Suitable for advanced process control 
Complexity Impossible to request PLC based control system Better than PLC 
Process change Suitable for processes that are not frequently changed Ideal for complex or coordinated 
analysis that combines large amounts 
of data 


Figure 1. Typical structure of PLC. Figure 2. Structure of DCS. 
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can be improved by facilitating the informa- 
tion collection of field devices. Typical types of 
these field buses are Modbus, Interbus, Profibus, 
RS232C/RS485, and Ethernet TCP (Transmission 
Control Protocol)/IP (Internet Protocol) [9]. 


3 CLASSIFICATION SCHEME FOR 
CYBER-ATTACK 


3.1 Subdivision of attack categories 


This paper classified cyber-attack scenario as four 
stages: prior preparation, gaining access, maintain- 
ing access, and clearing tracks [18]. 

Prior preparation for cyber-attack includes 
information that target IP range, namespace, 


Table 3. Cyber-attack scenario with corresponding 
attack. 

Corresponding 
Cyber-attack scenario Attack 


Prior preparation Foot printing 


Gathering information for & Scanning 
attack Spoofing, sniffing, 
MITM 


Gaining access 

Bypass access controls to gain 
access to the system. 

Maintaining access 

Retain ownership of the system 

Clearing tracks 

The activities carried out by an 
attacker to hide malicious acts 


Password cracking 
Hijacking, DoS 


Virus, Backdoors 


Hiding files 


Attack Category 
l Prior preparation 


7 - Password cracking 

7 Gaining access ext 
Hijacking, DoS 
Virus, Backdoor 


H Maintainingaccess H 


a | 


Clearingtracks m Hiding files 
' £ 


Footprinting& Scanning 
Spoofing, Sniffing, MITM 


Figure 3. Flowchart of cyber-attack taxonomy. 


Result Vulnerability 
Internal 7 p leakage = pVulnerability ~ Prevention 7 
External Type 
Modification 
Physical 
Network 
Destruction 
Accidental Known 
“Intentional- L interruption “Vulnerability + ‘Mitigation + 


vulnerability, and protocol. Gaining access is to 
bypass access controls to gain access to the system. 
Maintaining access is retaining ownership of the 
system. Clearing track is the activities carried out 
by an attacker to hide malicious acts. 

The step after classifying the cyber-attack in 
terms of cyber-attack scenario is to further subdi- 
vide the cyber-attack. The reason for subdivision of 
cyber-attacks is that even if the basic principles of 
cyber — attacks are the same, the conditions and vul- 
nerabilities for the cyber-attacks are different, and 
the attack results and countermeasures are not the 
same accordingly. Also, if the cyber-attack is further 
subdivided and complementary measures are taken, 
it can be a more stable system. As a typical example, 
password cracking can be subdivided into keylogger 
attack, dictionary attack, hybrid attack, brute-force 
attack, and precomputed Hashes attack. 

With these definitions, ten types of attacks 
were matched to the cyber-attack scenario and 
Table 3 shows their relationship. 


3.2 Classification scheme for cyber-attack 


Generally, the cyber-attack scenario proceeds as 
follows. In the prior preparation stage, the attacker 
will identify the attack target information, attack 
through the designated access, and achieve the 
desired result. After that, security vendors will 
develop a defense. This paper focused on the gen- 
eral cyber-attack scenarios and selected items to 
be considered in the attack classification system 
as method of access, outcome, vulnerability, and 
defense. The Figure 3 shows the overall flow chart 
of cyber-attack taxonomy. 


Cyber-attack 
Taxonomy 
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3.2.1 Access means of cyber-attack 

The first of the cyber-attack classification is the 
access of cyber-attack. The ICS & SCADA system 
is disconnected with the external network, so usu- 
ally attack of internal is mainly considered. How- 
ever, in consideration of recent ICS & SCADA 
system accidents such as Stuxnet, external attacks 
are also importantly considered as the accessible 
path of cyber-attack. After distinguishing internal/ 
external attack, considered access of cyber-attack is 
physical access which include USB or CD and net- 
work access which using communication network 
[4]. This can be used to determine whether physical 
security is required or network and software security 
is required to increase the security of system. The 
last one considered is whether an attack is inten- 
tional or accidental. Considering only intentional 
attacks, there is a limit to the availability guarantee. 
Therefore, in order to provide a high level of avail- 
ability and security for ICS operation, accidental 
access such as misconfiguration of ICS components 
and ICS equipment failure are considered [2,7]. 


3.2.2 Attack result from cyber-attack 

The next classification is based on attack result. In 
order to derive the attack result, ten cyber-attack cat- 
egories are considered which were earlier mentioned. 
For example, in the case of foot printing & scanning, 
the purpose is to leak data by secretly viewing the 
data. In the case of DoS, the goal is to destroy the 
system by putting the system and server in a satura- 
tion state. In the case of MITM, the purpose is to 
modify the data between the client and the server. 
Considering purpose of the remaining cyber-attacks, 
the results of the attack are classified as leakage, 
modification, destruction, and interruption [7]. 


3.2.3. Vulnerability with cyber-attack 
The next classification is based on the vulnerability. 
With the development of information technology, 


Table 4. Example for usage of cyber-attack taxonomy. 


the complexity of software has increased and vul- 
nerabilities have also increased. Vulnerability is a 
flaw that can be security threat in hardware or soft- 
ware and can be exploited directly by a hacker or as 
a means of spreading a virus or malicious program 
[15]. As a result, some countries have databased 
vulnerabilities and have developed countermeas- 
ures against them. This enables early response to 
specific cyber-attack. This paper also considered 
vulnerability as one of the classification items, so 
that it can be used for early response by matching 
with vulnerability in case of specific cyber-attack. 
List the vulnerabilities of the software used in ICS 
and SCADA systems and investigate where the 
software is used to know anticipated cyber-attacks 
in the system and use them for countermeasures 
and mitigations in case of cyber-attack. In addi- 
tion, the newly discovered vulnerabilities can be 
continuously matched and added to the cyber- 
attack, which can effectively upgrade and supple- 
ment the system. Vulnerability databases include 
USA’s NVD (National Vulnerability Database), 
Japan’s JVN (Japan Vulnerability Notes) and 
CNVD (China National Vulnerability Database). 
In this paper, NVD based on NIST (National 
Institute of Standards and Technology), which is 
widely used, is selected and shown in Table 4 [15]. 


3.2.4 Defense from cyber-attack 

The last classification is the defense from cyber-at- 
tack. There are two kinds of method to reduce the 
damage from cyber-attack. One is the prevention 
of cyber-attack and the other is the mitigation of 
the cyber-attack [17]. 

Considering typical vulnerabilities of ICS is a 
way to prevent cyber-attack. Typical vulnerabilities 
of ICS are unauthorized protocols, aged hardware, 
vulnerable user authentication, vulnerable file integ- 
rity checks, vulnerable Windows operating systems, 
and undocumented third party relationships [13]. 


Attack Subdivided 
Category Attack Definition of Subdivided Cyber-attack Defense 
RST hijacking is a kind of TCP/IP hijacking in The countermeasures of RST Hijacking 
which RST packets are injected. are to reduce the SYN-received latency, 
... (Omitted below) or to block RST packets at the router 
or firewall. 
... (Omitted below) 
Access Access Attack Method Attackable Attack Known Vulnerability 
Session RST Method Method & Condition OS, Result Vulnerability Type 
Hijacking Hijacking (1) (2) CPU, 
S/W 
Internal NA: TCP OS: Interruption CVE- Other 
External TCP/IP vulnerability Window 2002-1778 
(3-handshake) ver, 
Linux 
3.2.24 
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The prevention of the cyber-attack which derived 
from ICS vulnerabilities are antivirus, firewall, one- 
way gate, and media control. Antivirus is the use 
of signature-based engines with solutions to detect 
and treat malware. Firewall is used to allow and 
block information transmitted during data trans- 
mission between ICS networks. One-way gate is a 
method of establishing a physical unidirectional 
environment to intercept the intrusion when linking 
infrastructure network and external network data. 
Media control is the disconnection of an interface 
control device such as a USB connected to an ICS 
network [15]. There is mitigation as a way to reduce 
the damage after the cyber-attack. Treatment and 
mitigation is the method needed to normalize the 
system after a cyber-attack. Treatment and mitiga- 
tion are various like as vulnerability. As mentioned 
above, the mitigation is linked with vulnerability 
and can be used for early response after cyber- 
attack. With these classifications, Table 4 shows 
the example of systematic classification scheme for 
cyber-attack taxonomy. 


3.3 Example of cyber-attack taxonomy 
utilization 


The surveyed cyber-attack taxonomy can be used 
to determine attack target, attack area, and test 
method considering the principles and character- 
istics of PLC, DCS, and network device of ICS/ 
SCADA system proposed in 2.2. PLC, DCS, and 
network devices used in the actual industrial field 
vary, and security performance and vulnerabilities 
are also different. The PLC, DCS, and network 
devices mentioned in this paper are assumed to 
have general performance and characteristics. 

An example of cyber-attack, which shows the 
usage of cyber-attack taxonomy, is the RST hijack- 
ing. RST hijacking is a kind of TCP/IP hijacking 
in which RST packets are injected into the network 
approach. RST Hijacking can observe the commu- 
nication between the client and the server as well as 
TCP session hijacking. Also, it is possible to extort 
the session using the trust and the session using the 
TCP. This attack exploits the weaknesses of the 
TCP three handshake method, which can cause 
the device to overload and stop functioning similar 
to a DoS attack. By applying the information of 
the cyber-attack taxonomy investigated to the ICS 
& SCADA system, the attackable device can be 
selected by the DCS and the network because RST 
hijacking is intended for communication between 
client and server like DCS. It is expected that 
the test interval in the cyber security test can be 
between DCS and network. It means that knowing 
of the particular cyber-attack, which occurs in a 
particular section of the ICS and SCADA system, 
can reduce the effort and cost of defending and 


mitigating from cyber-attack. Also, if many cyber- 
attacks are systematically classified and accumu- 
lated like RST-hijacking mentioned above, the 
user or operator will be able to quickly recognize 
the type of attack when an arbitrary cyber-attack 
is applied. It means that the user or operator can 
take action quickly. 


4 CONCLUSION & FUTURE WORK 


This paper classified cyber-attack scenario as four 
stages such as prior preparation, gaining access, 
maintaining access, and clearing tracks matched 
with the ten categories of cyber-attack such as 
footprinting, scanning, password attack, spoof- 
ing, sniffing, hijacking, MITM, virus, DoS, back- 
door installation and hiding files. Then, ten kinds 
of cyber-attacks are subdivided and classified 
into access means, result of cyber-attack, vulner- 
ability, and defense according to the classification 
scheme for cyber-attack. Access of cyber-attack 
can be classified with internal/external, physical/ 
network, and intentional/accidental. Result of 
cyber-attack can be classified with leakage, modi- 
fication, destruction, and interruption. Vulner- 
ability of cyber-attack is based on NVD and it can 
be used for early response from cyber-attack when 
it combined with defense of cyber-attack. Defense 
of cyber-attack can be classified with prevention 
and mitigation. Through this classification sys- 
tem, the cyber-attacks and the possible intervals in 
ICS are exemplified. Future works include further 
subdivision of the ten categories of cyber-attack, 
and using it to prove possible attacks and attack 
intervals through penetration testing in a cyber- 
attackable experimental environment. 
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ABSTRACT: The aims of the present study are to 1) develop and test a scale measuring organizational 
information security culture, and 2) examine its relationships to other aspects of information security. The 
study focuses on an organization providing critical infrastructure. We developed the scale by conducting 
qualitative interviews (N = 22) and three focus groups (N = 15) in an organization providing critical infra- 
structure, and by reviewing previous research on culture in organisations. Based on our literature review 
and the interviews, we chose to measure organizational information security culture by reformulating one 
of the few existing general organizational safety culture questionnaires. We first tested the questionnaire 
in a small pilot survey, and then conducted a questionnaire survey (N = 323) including all departments in 
the organization. Our examination of the factor structure of the scale indicated two factors. Regression 
analyses indicate that our adapted GAIN-scale, measuring organizational information security culture is 


the most important variable influencing information security behavior in the model. 


1 INTRODUCTION 


1.1 Background 


Information security is often defined as protec- 
tion against breaches of confidentiality, integrity 
and accessibility. This applies to information that 
is oral, written or electronic. Confidentiality refers 
to ensuring that only those who are authorised to 
access information, accesses it. Integrity refers to 
protecting the accuracy and entirety of informa- 
tion and processing methods. Accessibility refers 
to ensuring that authorised users have access to 
the information and associated equipments when 
necessary (Report to the Storting 29. 2011-2012). 
Ruighaver et al (2007) assert that it was not until 
the start of the century that scholars began to 
recognise the importance of organizational infor- 
mation security culture for information systems 
security in organisations. The importance of cul- 
ture for security and safety has also gained recogni- 
tion in the Norwegian society in recent years. One 
of the most important conclusions of the report 
of the investigation commission following the ter- 
rorist attack in Oslo and Utøya, July 22. 2011 was 
that future efforts to secure sensitive objects (e.g. 
people and critical infrastructure) and information 
should focus on culture, focusing especially on the 
acknowledgement of risk and leadership. 


The study organization is a provider of critical 
infrastructure in Norway. As a provider of critical 
infrastructure, the study organization is obliged to 
follow the requirements of the Security Act (“Sikker- 
hetsloven”) when it comes to preventive safety work, 
which includes safety analyses, securing objects, 
information security and safety drill. Based on these 
requirements, the study organization decided to map 
and analyse their own organizational security cul- 
ture. Critical infrastructure means the facilities and 
systems that are completely necessary to maintain 
society's critical functions, which in turn meet socie- 
ty’s basic needs and respond to the population’s need 
for a perception of safety (NOU 2006). 


1.2 Aims 


The aims of the present study are to 1) develop and 
test a scale measuring organizational information 
security culture and 2) examine its relationships to 
other aspects of information security. 


1.3 Research on culture in organisations 


1.3.1 Organisational information security culture 
Although Ruighaver et al. (2007) note that the 
organisational security culture concept has gained 
recognition, they also underline that there is 
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lacking consensus when it comes to how the con- 
cept should be defined and conceptualized (cf. 
Chia et al., 2002). Additionally, they also assert 
that in spite a large amount of research on organi- 
sational security and how it should be improved, 
this research only focus on certain aspects of secu- 
rity and not how these aspects can be analysed as 
part of a larger organisational culture. 

Based on this understanding, Ruighaver et al. 
(2007) choose to draw on organisational culture 
research in their analysis of security culture. This 
approach is similar to that applied by scholars study- 
ing organisational safety culture, who analyse safety 
culture as a focused and safety relevant aspect of the 
larger organisational culture (e.g. Hale, 2000, Hauke- 
lid, 2008, Antonsen, 2009). Based on this, we may 
also analyse security culture as “security relevant” 
aspects of the larger organisational culture, define 
and conceptualising using models of organisational 
culture (e.g. Schein, 2004). In this paper, we suggest 
that the research on information security culture 
could learn from the research on safety culture. Nos- 
worthy (2000) asserts for instance that one of the key 
challenges of information security culture implemen- 
tation is how to educate the people of the organiza- 
tion to successfully implement the requirements of 
the information security policy. A lot of effort has 
been put in to understand this in safety culture 
research, discerning between formal (structure; safety 
management system; procedures, training, routines 
etc.) aspects of safety and informal aspects (culture) 
(Antonsen, 2009). Additionally, Knapp et al. (2006), 
depict the top management support as a significant 
predictor of an organization’s security culture and 
level of policy enforcement. This also reflects a key 
finding in organizational safety culture and safety 
culture research, and thus it is relevant to also draw 
on the knowledge gained in these research fields. 


1.3.2 Organisational culture 

The influential scholar Schein defines organiza- 
tional culture as: “(...) a pattern of shared basic 
assumptions that the group learned as it solved 
its problems of external adaptation and internal 
integration that has worked well enough to be con- 
sidered valid and, therefore, to be taught to new 
members as the correct way to perceive, think and 
feel in relation to those problems.” (Schein 1992: 
12). According to Guldenmund (2000: 222-225), 
organizational culture has the following charac- 
teristics: 1. It is a construct in the sense that it is 
an abstract, not a concrete concept, 2. It is rela- 
tively stable, 3. It has multiple dimensionality, in 
the sense that it can be described in many different 
ways, 4. It is shared by groups of people, 5. It con- 
sists of various aspects, which means that several 
different cultures can be identified within organi- 
zations, depending on the issue at hand, 6. It con- 


stitutes practices, 7. It is functional. Guldenmund 
describes organizational culture in the following 
manner: “Overall, organisational culture is a rela- 
tively stable, multidimensional, holistic construct 
shared by (groups of) organisational members 
that supplies a frame of reference and which gives 
meaning to and/or is typically revealed in certain 
practices.” (Guldenmund 2000: 225). 

As the research on organizational safety culture 
seems to have been through many of the challenges 
that the organizational security culture research 
now is facing, we draw on the experiences of the 
former, e.g. when it comes to analyzing security cul- 
ture as a focused aspect of organizational culture. 


1.3.3 Organizational safety culture 

Even though the concept of safety culture has 
become popular since it first was introduced in 
the wake of the Chernobyl accident in 1986, it is 
not well understood (Reason, 1997). Safety cul- 
ture scholars may disagree on a range of differ- 
ent issues, but they seem to agree that the research 
on safety culture and its relationship with safety 
is fragmented and unsystematic (e.g., Cox & Flin, 
199, Pidgeon, 1998, Hopkins, 2006, Guldenmund, 
2007; Choudry et al., 2007; Glendon, 2008). In 
spite of this disagreement, most scholars seem to 
agree that safety culture refers to shared and safety 
relevant ways of thinking or acting that are (re) 
created through the joint negotiation of people in 
social settings (Nzvestad, 2010a), and as noted as 
safety-relevant aspects of organisational culture 
(Hale 2000). The element of safety culture that can 
be measured is often referred to as safety climate. 
Thus, safety climate can be conceived of as “snap- 
shots”, or manifestations of safety culture (Cox & 
Flin, 1998). Quantitative measurements of safety 
culture can provide leading indicators of safety 
and consequently offer predictive assessments that 
enable safety improvements without having to wait 
for accidents or incidents to happen (Antonsen, 
2009). Senior management commitment to safety 
is the most studied and best-documented charac- 
teristic of a good safety culture, independent of 
sector (Flin et al. 2000; Guldenmund 2000). 


2 METHOD 


Our methodological approach is based on a litera- 
ture review conducted in 2012, interviews (N = 22) 
and focus groups (N = 15) in 2014 and survey in 
2014 (N = 323). 


2.1 Interviews and focus groups 


We started with the qualitative part of the study 
before conducting the quantitative survey, so that 
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we could form a picture of key issues concerning 
safety and security work at the study organisa- 
tion. This is important because we had to adapt 
the security culture questionnaire, and because it 
gave us the opportunity to add questions that are 
central to safety and security work at the study 
organisation to the questionnaire. We have con- 
ducted 22 in-depth interviews, primarily managers, 
and one group interview with 3 respondents. Focus 
groups, primarily employees: 2 focus groups with 
a total of 12 persons. This makes a total of 37 in- 
depth interviews. 

We used a semi-structured and relatively open 
interview guide based on the safety and security 
culture topics from the safety culture index. Our 
point of departure was topics related to informa- 
tion security and the protection of critical infra- 
structure. The interview guide had to be open so 
that we could depend on the interviewees’ under- 
standing of how different features of the organisa- 
tion culture in the study organisation have had and 
can have consequences for safety and security. 

The interviews were built up around the follow- 
ing main topics: 1) In general about the depart- 
ment’s and respondent’s responsibility and roles, 
2) Security focus, and relation to safety in the 
information security, HSE safety, deliverance reli- 
ability, 3) Organizational framework, management 
lines and communication, 4) Safety culture issues; 
safety, training, expertise, procedures, etc. 


2.2 Literature review 


The literature review was originally conducted as 
part of another project (cf. Nevestad & Bjornskau 
2012), but we nevertheless draw on it in the present 
study, as it also was relevant to the present study, 
and as our choice of safety culture scale for the 
present study was based on it. This is based on our 
mentioned ambition to learn from the safety cul- 
ture literature when measuring and understanding 
organisational information security culture. 

In this review, we conducted literature searches 
for articles and reports that document experiences 
with different safety culture measurement tools. 
We conducted searches through two key scientific 
databases, “Science direct” and the ISI web of sci- 
ence. A search for “Safety climate” in “abstract/ 
title/key words”, “safety climate scale” and “safety 
climate questionnaire” in scientific publications 
(primarily referenced journal articles, but also 
some books) for all years, gave everything in all 249 
results. The next search we made from the scien- 
tific database “ISI-Web of Knowledge”. Here we 
searched for articles with “safety climate” in title 
or subject, for all years, and received 458 hits. 

The scales were reviewed according to the fol- 
lowing criteria, whether: 1) they are based on a 


solid scientific approach (e.g. based on previous 
research and existing theory, have been validated in 
several studies), 2) they are universal, 3) they are) 
user-friendly; do not include too many themes and 
questions, which are understandable for people 
who are not researchers and 4) A key criterion has 
been that the themes and the items in the scales are 
in accordance with the key results of the interviews 
and focus groups. Our review resulted in 11 scales 
that we perceived as relevant enough to be evalu- 
ated systematically against these criteria. 

In the present study, we choose to reformulate 
one of the few existing universal organizational 
safety culture scales, the GAIN-scale for safety cul- 
ture, into an organizational security culture scale. 
The GAIN scale was chosen first, as our previous 
literature review, conclude that this is one of very few 
universal safety culture surveys (Neevestad & Bjørn- 
skau, 2012). Thus, the wording of each item can be 
adapted to different sectors (and presumably also to 
security) without obviously altering the particular 
aspect which that item measures. Thus, the scale has 
the potential to be developed as a generic measure. 

Second, the scale was chosen, as it is founded 
on a relatively solid scientific foundation. It must 
be noted that we ended up recommending another 
scale in the above mentioned 2012 review. In this 
review, we chose the NOSAQ-50 scale (Kines et al., 
2011), over the GAIN-scales (GAIN, 2001), as this 
had been subjected to a more systematic literature 
review. We have however conducted several studies 
using the GAIN scale since 2012 (e.g. Nevestad & 
Bjornskau, 2014, Nevestad et al., 2017, Nevestad, 
2017), subjecting it to exploratory and confirma- 
tory factor analyses, and we have also analyzed the 
relationship between the scale and safety outcomes 
(e.g. Neevestad, 2017). The scale has also been used 
to study and compare safety culture in different 
transport sectors like road, rail, helicopter and 
aviation (Bjornskau & Longva, 2009). 

Third, the scale was chosen, as it is relatively easy 
to use. The GAIN-scale has for instance consider- 
ably shorter than the NOSAQ 50; it has only half 
the items. Additionally, the questions are relatively 
short, and it is relatively easy to change and adapt 
the wording to information security culture. 


2.3 Survey 


2.3.1 Sample characteristics 

A total of 323 individuals responded to the survey, 
from 11 different departments, giving a response 
rate of 56%. More than 90 per cent of the respond- 
ents are permanent employees and seven per cent 
are hired consultants. We also see that seven per 
cent are section or department managers. It should 
also be mentioned that more than 50% those who 
responded to the survey had been employed by the 
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study organisation for five years or less. This is a rel- 
atively high percentage. It explains that more than 
40 per cent of the respondents have been employed 
by 3-5 other business in their working life before the 
study organisation. Almost a quarter of those who 
responded have however been employed for more 
than 20 years. This is an approximate reflection of 
the actual distribution in the study organisation, but 
respondents with the shortest seniority are overrep- 
resented. 56 per cent of the respondents are above 
the age of 46. This is interesting, considering that 
around half have seniority of five years or less. 61 
per cent of the respondents are men and 66 per cent 
have graduated from university/university college. 


2.3.2 Pilot survey 

As we developed several new questions in the survey 
which had never been tested before, we conducted 
a small pilot study (N = 12) directed at personnel 
in the study organisation to obtain feedback and 
assess how the questions worked. In the pilot study 
we received some useful feedback, including that 
we should use the term “my immediate supervisor” 
rather than “my department manager” in the survey 
on safety culture and information security culture 


2.3.3 Survey topics 

The survey contains mainly questions about ten 
topics. It first contains a set of background ques- 
tions (e.g. gender, age, education, experience, level) 
that were sent to all respondents. In addition, three 
short indexes follow with questions about three 
different types of security related to information 
security, HSE safety and deliverance reliability. 
The questions are identical and have the same scale 
so that we can directly compare the meanings of 
the three forms of safety and security in the study 
organisation. Furthermore, the questionnaire 13 
contains questions about attitudes and behaviour 
regarding information security culture. 


2.3.3.1 Background variables 

The survey also includes questions on demo- 
graphic background variables and various per- 
formance targets related to safety. The background 
variables include information on: 1) gender, 2) 
age, 3) education, 4) seniority, 5) employment in 
other businesses, 6) level in the organization and 
7) employment status in the organization (perma- 
nent, hired). These background variables are only 
presented at the enterprise level. 


2.3.3.2 The GAIN scale 

Global Aviation Information Network (GAIN) is 
a voluntary association of airlines, manufacturers, 
trade unions, governments and other organisations 
in aviation. The GAIN questionnaire contains 24 
questions concerning (we excluded one of the orig- 
inal questions, because of the wording): 


1. Management’s attitude to and focus on safety: 


Man 1: My immediate supervisor discovers employ- 
ees who fail to take sufficient considerations to infor- 
mation security in their work 

Man 2: My immediate supervisor often praises 
employees for maintaining information security 

Man 3: My immediate supervisor is aware of the most 
important information security issues in the company 
Man 4: My immediate supervisor often discusses 
information security issues with the employees 

Man 5: My immediate supervisor is personally 
involved in activities to improve information security 
Man 6: My immediate supervisor postpones tasks/ 
activities if information security is not sufficiently 
ensured 

Man 7: My immediate supervisor considers informa- 
tion security to be very important in all tasks and 
activities 

Man 8: My immediate supervisor does everything 
he/she can do to avoid breaches of information 
security 


2. Employees’ attitudes to and focus on safety: 


Emp 1; My colleagues do everything they can to avoid 
breaches of information security 

Emp 2: Employees encourage one another to safe- 
guard information security 

Emp 3: Employees usually report all breaches and 
irregularities related to information security that they 
experience at work 


3. Reporting culture and reactions to incident 
reporting: 


Rep 1: Those who pursue breaches of information 
security in the business attempt to find the real causes 
rather than just blaming the employees 

Rep 2: There are routines and procedures at my 
workplace so that I may report information security- 
related breaches or irregularities 

Rep 3: After a breach of information security, meas- 
ures are implemented to prevent this from happening 
again 

Rep 4: All irregularities and information security 
issues that are reported are remedied in a short time 
Rep 5: Everyone has plenty of opportunities to for- 
ward suggestions related to information security 


4. Safety training and education: 


Tra 1: Employees in my company are provided with 
adequate training in the secure use of ICT systems 
(e.g. e-mail, storage, encryption) 

Tra 2: All new employees are provided with adequate 
training for tasks and the secure use of ICT systems 
(e.g. e-mail, storage, encryption) 

Tra 3: Everyone is provided with sufficient feedback 
on how the enterprise is performing with regard to 
information security 
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Tra 4: Everyone is informed of any changes that may 
impact information security 


5. General questions concerning safety in the organ- 
ization in question: 


Gen 1: There are procedures that must be followed in 
the event of an emergency situation in my workplace 
Gen 2: Information security in my business is better 
that in other businesses 

Gen 3: Regular security audits are carried out 

Gen 4:_Information security is generally well taken 
care of at my workplace 


Respondents can rate the questions from | (totally 
disagree) to 5 (totally agree). Thus, a safety culture 
index with a minimum value of 24 (1 x 24) and 
a maximum value of 120 (5 x 24) can be com- 
pared across companies and sectors. According 
to GAIN (2001), organizations with a score of 
93-125 points on the safety culture index have a 
positive safety culture, 59-92 indicates a bureau- 
cratic safety culture and 25-58 indicates a poor 
safety culture. 


2.3.3.3 Questions about information security 
Based on the interviews (and literature review of 
organizational security culture scales, e.g. Sjek- 
kIT developed by NTNU and Sintef for the Nor- 
wegian National Security Authority (NSM), we 
also included 22 additional questions about infor- 
mation security in the organisation. These were 
themes representing special information security 
challenges in the organization. 


3 RESULTS FROM INTERVIEWS AND 
FOCUS GROUPS 


In the interviews, we discussed organisational 
security management with the key managers, and 
based on the interviews, we found that they per- 
ceived the five GAIN themes as important and 
relevant. Management and employee commitment 
for safety was perceived as key. The organisation 
had also developed a reporting system covering 
information security, and they were also engaged 
in several initiatives to educate employees in infor- 
mation security issues. 

Based on the interviews and focus groups, we 
also developed 22 survey questions, reflecting the 
most important information security challenges in 
the organisaion. We included several untested ques- 
tions among the 22, and we experienced that nine 
of these did not work because some of them had 
relatively large shares of “neither/nor” responses. 
The 13 questions on information security we ended 
up with after removing the 9 that did not work (of 
22 questions in total) may be divided into the fol- 


lowing topics. We unfortunately don’t have the 
opportunity to present all here due to space consid- 
erations, but will nevertheless reproduce the topics 
and questions, as they are an important result of 
our qualitative surveys: 


1. Knowledge/attitudes—information security: 


We constructed an index for knowledge of and 
attitudes to information security with five ques- 
tions. All of the questions have five values, such 
that the minimum value for the index is 5 and the 
maximum value is 25 (Cronbach’s Alpha = 0.740). 


In my work, I have a clear understanding of what the 
term information security means. 

I have a clear understanding of what entails a breach 
of information security in my business. 

I feel that I have adequate knowledge on the secure 
use of ICT systems (e.g. e-mail, storage, encryption) 
All unfamiliar persons at the workplace are noticed, 
and one investigates what they are doing there. 

When I am asked for information, I always think care- 
fully about whether the information can be used for 
other purposes than originally intended. 


2. Security assessment—PC and cell phone: 


My cell phone contains sensitive information. 


If Pm working on a PC from home, information secu- 
rity is just as high as it is at work 


3. Classified information and accessibility. 


We created an index of the following three questions 
on classified information (Cronbach’s Alpha was 
0.721): 


I am well aware of which type of information that is 
sensitive and classified 

I am well aware of who has access to various types of 
classified information 

I take precautions when I come into contact with sen- 
sitive and classified information 


The respondents were also asked to consider the 
following statement: “Considerations to informa- 
tion security (for example passwords to log on) 
impede my work”. 


4 RESULTS FROM THE QUANTITATIVE 
SURVEY 


4.1 Clear understanding of information security? 


The questionnaire that measures information 
security culture is based on the GAIN safety cul- 
ture index, where the word “safety” is replaced by 
“information security.” The questionnaire opened 
with definitions of information security and 
related sub-concepts. 
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4.2 Organisational information security 
culture index 


We have combined the 24 statements with five 
response options on the five different aspects of 
information security in an information security 
culture index. The indexes for the departments 
correspond to the average scores for the respond- 
ents. Since we have removed a statement from the 
GAIN index, the minimum score will be 24 (24 x 1) 
and the maximum score will be 120 (24 x 5). Cron- 
bach’s Alpha for the 24 questions in the index is 
0.913, which means very good agreement between 
the questions and that the index is very good. 


Factor 1: Information security management and Factor 
commitment loadings 


Man 4: My immediate supervisor often 0.774 
discusses information security issues 
with the employees 

Tra 4: Information security is generally 
well taken care of at my workplace 

Man 7: My immediate supervisor considers 
information security to be very important 
in all tasks and activities 

Man 6: My immediate supervisor postpones 
tasks/activities if information security is 
not sufficiently ensured 

Man 8: My immediate supervisor does 
everything he/she can do to avoid 
breaches of information security 

Rep 3: After a breach of information 
security, measures are implemented to 
prevent this from happening again 

Man 3: My immediate supervisor is aware 
of the most important information 
security issues in the company 

Man 5: My immediate supervisor is 
personally involved in activities to 
improve information security 

Rep 4: All irregularities and information 
security issues that are reported are 
remedied in a short time 

Emp 2: Employees encourage one another 
to safeguard information security 

Man 2: My immediate supervisor often 
praises employees for maintaining 
information security 

Emp 3: Employees usually report all 
breaches and irregularities related to 
information security that they 
experience at work 

Gen 3: Regular security audits are 
carried out 

Rep1: Those who pursue breaches of 
information security in the business 
attempt to find the real causes rather 
than just blaming the employees 


0.745 


0.743 


0.735 


0.733 


0.729 


0.691 


0.678 


0.654 


0.644 


0.639 


0.634 


0.615 


0.598 


Man 1: My immediate supervisor discovers 0.593 
employees who fail to take sufficient 
considerations to information security 
in their work 

Emp 1: My colleagues do everything they can 0.591 
to avoid breaches of information security 

Rep 5: Everyone has plenty of opportunities 
to forward suggestions related to 
information security 

Gen 2: Information security in my business 
is better that in other businesses 

Rep 2: There are routines and procedures 
at my workplace so that I may report 
information 
security-related breaches or irregularities 

Tra 1: There are procedures that must be 
followed in the event of an emergency 
situation in my workplace 


0.572 


0.548 


0.534 


0.508 


Factor 2: Information security training Loadings 


Tra 2: All new employees are provided 0.691 
with adequate training for tasks and the 
secure use of ICT systems (e.g. e-mail, 
storage, encryption) 

Tra 1: Employees in my company are 
provided with adequate training in the 
secure use of ICT systems (e.g. e-mail, 
storage, encryption) 

Tra 3: Everyone is provided with sufficient 
feedback on how the enterprise is 
performing with regard to information 
security 

Tra 4: Everyone is informed of any changes 
that may impact information security 


0.629 


0.562 


0.550 


94 
87 
82 

84 80 

77 75. 26 78 
74 
64 
54 
44 
34 
24 

1 2 3 4 5 6 Total 

Figure 1. Mean scores on the GAIN index applied to 


organsiational information security culture. Depart- 
ments in the study organisation. Minimum 24, maximum 
120. (N = 249). 
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The figure shows that DEP 6 has the highest 
score and that DEP 4 has the lowest score. The 
difference between the highest and lowest score is 
more than 11 points. The differences between the 
departments are significant at the 1% level. We are 
careful of comparing the departments and in rela- 
tion to the information security culture questions. 
The number of “neither/nor” responses indicate 
that the respondents are unwilling or incapable of 
considering the statements on the area, which may 
make it difficult to interpret the numbers. It also 
entails that respondents who have actually had an 
opinion are a minority for some of the questions. 
Allin all, we therefore consider this index to be less 
robust than the others. This applies to all depart- 
ments, except DEP 6. 

The first topic in the index was “Immediate 
supervisor's attitude to and focus on information 
security.” Here DEP 6 had the highest score, DEP 
4 the lowest. 

The second topic in the index is “Employees’ 
attitude to and focus on information security.” 
Once again DEP 6 had the highest score while 
DEP 1 had the lowest. The differences are signifi- 
cant at the 5% level. This should be interpreted in 
the light of DEP 1 having responsibility for follow- 
ing up such work, and is probably more critical in 
its assessment. 

The third topic in the index is “Reporting cul- 
ture and reactions to incident reporting.” DEP 6 
and DEP 2 had the highest scores, while DEP 4 
and DEP 5 had the lowest. Differences were sig- 
nificant, of more than two points. 

The fourth topic in the index is “Training in 
information security thinking.” DEP 2 and DEP 6 
had the highest scores, while DEP 1 had the lowest. 
The fifth topic in the index is “General informa- 
tion security issues.” DEP 6 had the highest score, 
while DEP 1 had the lowest. The differences are 
significant at the 1% level. 


4.2.1. Exploratory factor analysis 

An Exploratory Factor Analysis (EFA) was con- 
ducted to examine the underlying factor structure 
of the 24 items in the sample. Although the original 
GAIN-scale for safety culture is comprised of five 
themes, we chose EFA as we apply it to a new topic; 
information security culture. Tests indicated that the 
items and the data were suitable for factor analysis. 
Bartlett’s test of sphericity (approx. Chi-square) was 
3380,834 (p < 0.001). The Kaiser-Meyer—Olkin’s 
measure of sampling adequacy showed a value of 
0.909. An unrotated principal component analysis 
(PCA) was used. We sat the cut off value of factor 
loadings equal to or above 0.40, as Matsunaga (2010) 
suggests that this perhaps is the lowest acceptable 
threshold on a conventional liberal-to-conservative 


continuum. Results showed five components with 
initial Eigenvalues higher than 1, which explained 
a total of 64.9% of the variance. The choice of the 
number of factors to retain was based on a combi- 
nation of a) Eigenvalues, b) inspecting the scree plot 
for a bending point, c) inspecting the factor loadings 
in the component matrix, and d) conceptual and the- 
oretical consideration. By inspecting the scree plot, a 
bend was relatively clearly identified between factor 
5 and 6, indicating a five-factor solution. This is in 
line with the Eigenvalues. However, when looking at 
the factor loadings, we saw that all items loaded on 
the first component, while there were seven cross- 
loading. Four of these had lower factor loadings 
on the other factors than the first factor, and they 
were distributed on different factors. They were 
therefore kept in the first factor, with one exception. 
Three of the cross-loading items were all in the sec- 
ond factor, they had higher factor loadings in the 
second factor and they all concern security training. 
They were therefore attributed to a second factor. 
Additionally, one of the first mentioned cross-load- 
ing items had quite similar loadings in both factors 
(0.567 vs. 0.562), but as it matched the second con- 
ceptually, we attributed it to this factor. 

Thus, based on our analysis of the factor load- 
ings and a conceptual and theoretical considera- 
tion (the four latter items all concern information 
security training), we chose a two-factor solution, 
which explained a total of 50% of the variance, i.e. 
about 15% less than the three-factor solution. 


4.2.2 Regression analysis: What influences 
organisational information security scores? 

The information security culture scores vary 
according to conditions such as age, education and 
seniority, but we do not know which conditions 
that are most important to explain the variation in 
information security culture, or whether the effect 
we see from education is actually due to age or vice 
versa. We have conducted regression analyses to 
assess which conditions explain variation in the 
information security culture index. 

We have used linear regression as the dependent 
variable is continuous. We add various independ- 
ent variables in steps, so that we can assess their 
isolated effect on the dependent variables, i.e. when 
the values of the other variables remain unchanged. 
In this manner we can examine the effect of educa- 
tion controlled for age, for example. 

We add the gender, age, education and senior- 
ity variables in the study organisation and depart- 
ment. We have converted the department variable 
to a dichotomous variable, i.e. with two values. The 
reason is that in regression analyses one cannot have 
independent variables that are at the nominal level, 
i.e. with values that are mutually exclusive, but which 
can’t be ranked. The two values of the department 
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variable are 1) all other departments, 2) DEP 6. We 
have done this because DEP 6 had the highest score 
on the information security culture index. 

We see that age contributes significantly and 
positively in 2; the older the respondents are, the 
better their information security culture score. 
In model 3 however, the age variable stops being 
significant, and that indicates that the age effect 
is actually due to it correlating with education. 
This means that younger respondents have higher 
education and a lower information security culture 
score and vice versa. Seniority does not make a 
significant contribution in any of the models. 

Finally, we see that the department variable 
makes the strongest contribution in the regression 
analysis in Table 1. Belonging to DEP 6 predicts a 
positive score on the information security culture 
index. We already knew this, but in the regression 
analyses in Table 1 also show that this also applies 
when controlling for gender, age, education and 
seniority. We may therefore conclude that DEP 
6’s high score on the information security culture 
index is not due to underlying variables such as 
gender, age, education and seniority. 

We see that the adjusted R? value, which indi- 
cates which proportion of the variation in the 
dependent variable that is explained by the inde- 
pendent variables significantly increasing in model 
3 when education is included in the analyses, and 
that it increases by more than twice as much when 
department is included in model 5. The independ- 


ent variables education and department explain 
9.7% of the variation in the information security 
culture index. 


4.3 Regression analysis: What influences 
information security behaviour? 


We have conducted regression analyses to assess 
which conditions explain variation in the vari- 
able “I have never caused a breach of information 
security.” This is a variable with five options from 
1 (strongly disagree) to 5 (strongly agree). The 
overall “neither/nor” share is 26.3% for this ques- 
tion. What one answers here is probably to a cer- 
tain extent dependent on whether one has a clear 
understanding of what information security is, or 
what it means for day to day work. The answer will 
also depend on how many opportunities one has to 
breach information security in one’s work. We have 
used linear regression as the dependent variable is 
continuous. We add four independent variables 
in steps, so that we can assess their isolated effect 
on the dependent variables, i.e. when the values 
of the other variables remain unchanged. We add 
the gender, age and seniority variables in the study 
organisation and information security culture. 
The table shows that seniority and information 
security culture contribute significantly to explain 
the variation in the variable “I have never caused 
a breach of information security.” Both effects are 
positive. The positive effect of seniority means that 


Table 1. Linear regression. Dependent variable: Organisational information security culture standardised beta 
coefficients. 

Variable 1 2 3 4 5 

Gender —0,003 0,028 0,021 0,020 0,003 
Age 0,158** 0,105 0,076 0,034 
Edu (Uni = 2) —0,185*** —0,157** —0,138** 
Seniority 0,078 0,059 
Department (DEP 6 = 2) 0,244*** 
Adj. R? —0.004 0.016 0.044 0.045 0,097 


*p < 0,1; **p < 0,05; ***p < 0,01. 


Table 2. Linear regression. Dependent variable: “I have never caused a breach of information security.” Standardised 
beta coefficients. 

Variable Mod. 1 Mod. 2 Mod. 3 Mod. 4 
Gender —0,043 —0,042 —0,046 —0,051 
Age 0,005 —0,070 —0,087 
Seniority 0,159** 0,132* 
Information security culture 0,192*** 
Adjusted R? —0.002 —0.006 0.010 0.041 


*p < 0,1; **p < 0,05; ***p < 0,01. 
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the longer one has been employed by the study 
organisation, the higher the likelihood that one has 
not caused a breach of information security. This 
is perhaps somewhat unexpected, as one would 
assume that the longer one has been employed, the 
more opportunities (in terms of time) there have 
been to breach information security. It further 
means that there is reason to assume that the study 
organisation had scored somewhat higher on secu- 
rity culture if the distribution of respondents had 
corresponded to that in the organization: In the 
survey, 51 per cent of respondents had 0-5 years 
seniority, while in reality there are 38 per cent who 
have 0-5 years seniority in the study organisation. 
There is nevertheless no reason to believe that this 
possible skewness alters fundamental conclusions, 
as the difference is too small. 

The effect of information security culture is 
however strongest, and this is the most impor- 
tant variable to explain variations in breaches of 
information security in the analyses. The higher 
the information security culture score is, the higher 
the likelihood that one has not caused a breach of 
information security. 

We see that the adjusted R2 value, which indi- 
cates which proportion of the variation in the 
dependent variable that is explained by the inde- 
pendent variables, is negative in the two first mod- 
els, but that it is at 1 and 4.1% in the last two. This 
happened when we included seniority and infor- 
mation security culture. These variables explain 
4.1% of the variation in the variable “I have never 
caused a breach of information security”. 


5 CONCLUDING DISCUSSION 


Learning from research on organizational culture 
and safety culture, we have adapted an organiza- 
tional safety culture scale to measure organizational 
information security culture. Our examination of 
the factor structure of the scale indicated two fac- 
tors. Regression analyses indicate that our adapted 
GAIN-scale, measuring organizational information 
security culture is the most important variable influ- 
encing information security behavior in the model. 
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ABSTRACT: The aims of the present study are to 1) Compare results from a study conducted before 
and a study conducted after efforts to improve organizational information security culture in an organiza- 
tion providing critical infrastructure, and 2) discuss the results of the comparison. In this study, we com- 
pare the results of two surveys done over a period of just over two years; the first early in 2014 (N = 323) 
and the second late in 2016 (N = 446). Organizational information security culture was measured with an 
index which was made by adding the scores of respondents’ answers to 24 items with answer alternatives 
ranging from | to 5. Thus, the minimum score of the information security index was 24 and the maximum 
score was 120. We found a statistically significant improvement in the index, comparing 2014 to 2016. 
Changes are discussed considering respondents’ experiences with the implemented measures, sample 
characteristics and other methodological factors. We conclude that it seems that management implemen- 


tation of measures aimed at improving organizational security culture has led to improvements. 


1 INTRODUCTION 


1.1 Background 


Ruighaver et al (2007) assert that it was not until the 
start of the century that scholars began to recognise 
the importance of organisational security culture 
for information systems security in organisations. 
Information security is often defined as protec- 
tion against breaches of confidentiality, integrity 
and accessibility. This applies to information that 
is oral, written or electronic. Confidentiality refers 
to ensuring that only those who are authorised to 
access information, accesses it. Integrity refers to 
protecting the accuracy and entirety of informa- 
tion and processing methods. Accessibility refers 
to ensuring that authorised users have access to the 
information and associated equipments when nec- 
essary (Report to the Storting 29. 2011-2012). 
Throughout the report of the 22 July Commis- 
sion, following the Oslo and Uteya terror attacks 
in 2011, leadership, culture and attitudes are 
emphasized as key challenges, both back in time 
and for the future. In the 22 July Commission’s 
report, part VI “Learning, 19.9” (page 458) the 
Commission’s main conclusion and recommen- 
dations are presented: The Commission’s most 
important recommendation is that leaders at all 


levels of the administration work systematically 
to strengthen their own and their organisations’ 
fundamental attitudes and culture in respect of: 
e.g. the acknowledgement of risk and leader- 
ship. Critical infrastructure means the facilities 
and systems that are completely necessary to 
maintain society’s critical functions, which in 
turn meet society’s basic needs and respond to 
the population’s need for a perception of safety 
(NOU 2006). 


1.2 Aim 


The aims of the present study are to 1) Compare 
results from a study conducted before and a study 
conducted after efforts to improve organizational 
information security culture in an organization 
providing critical infra-structure, and 2) discuss the 
results of the comparison. 

The study organization is a provider of critical 
infrastructure in Norway. As a provider of critical 
infrastructure, the study organization is obliged 
to follow the requirements of the Safety Act 
(“Sikkerhetsloven”) when it comes to preventive 
safety work, which includes safety analyses, secur- 
ing objects, information security and safety drill. 
Based on these requirements, the study organiza- 
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tion decided to map and analyse their own organi- 
zational security culture. 


1.3 Previous research on culture in organisations 


Although Ruighaver et al (2007) note that the 
organisational security culture concept has gained 
recognition, they also underline that there is lacking 
consensus when it comes to how the concept should 
be defined and conceptualized (cf. Chia et al 2002). 
Additionally, they also assert that in spite a large 
amount of research on organisational security and 
how it should be improved, this research only focuses 
on certain aspects of security, and not how these 
aspects can be analysed as part of a larger organi- 
sational culture (Ruighaver et al 2007). Based on 
this understanding, they choose to draw on organi- 
sational culture research in their analysis of security 
culture. This approach is similar to that applied by 
scholars studying organisational safety culture, who 
analyse safety culture as a focused and safety rele- 
vant aspect of the larger organisational culture (e.g. 
Hale 2000; Haukelid 2008 Antonsen 2009). Based on 
this, we may also analyse security culture as “security 
relevant” aspects of the larger organisational culture, 
define and conceptualising using models of organi- 
sational culture (e.g. Schein 2004). 

The influential scholar Schein defines organi- 
zational culture as: “(...) a pattern of shared basic 
assumptions that the group learned as it solved 
its problems of external adaptation and internal 
integration that has worked well enough to be con- 
sidered valid and, therefore, to be taught to new 
members as the correct way to perceive, think and 
feel in relation to those problems.” (Schein 1992: 
12). According to Guldenmund (2000: 222-225), 
organizational culture has the following charac- 
teristics: 1. It is a construct in the sense that it is 
an abstract, not a concrete concept, 2. It is rela- 
tively stable, 3. It has multiple dimensionality, in 
the sense that it can be described in many different 
ways, 4. It is shared by groups of people, 5. It con- 
sists of various aspects, which means that several 
different cultures can be identified within organi- 
zations, depending on the issue at hand, 6. It con- 
stitutes practices, 7. It is functional. Guldenmund 
describes organizational culture in the following 
manner: “Overall, organisational culture is a rela- 
tively stable, multidimensional, holistic construct 
shared by (groups of) organisational members 
that supplies a frame of reference and which gives 
meaning to and/or is typically revealed in certain 
practices.” (Guldenmund 2000: 225). 

As the research on organizational safety culture 
seems to have been through many of the challenges 
that the organizational security culture research 
now is facing, we draw on the experiences of the 
former, e.g. when it comes to analyzing security 
culture as a focused aspect of organizational cul- 


ture. This also applies to management of culture. 
Knapp et al (2006) asserts for instance that top 
management support is a significant predictor 
of an organization’s security culture and level of 
policy enforcement. Learning from the organiza- 
tional culture literature, we may conceptualise this 
in terms of Schein’s (2004) “six primary embed- 
ding mechanisms” that managers can use to shape 
culture is typical of the integration view: 

1) What managers pay attention to, measure and 
control on a regular basis, 2) How managers react 
to critical incidents and organizational crises, 3) 
How managers allocate resources, 4) Deliberate role 
modelling, teaching and coaching, 5) How manag- 
ers allocate rewards and status, and 6) How man- 
agers recruit, select, promote and excommunicate. 
It should however be noted that although organi- 
sational culture is a concept that often is examined 
with the intention to influence, organizational cul- 
ture scholars give different answers to whether, to 
what extent and how safety culture can be managed, 
depending on their conceptualization of culture (cf. 
Nevestad 2010a). To conclude, the research on cul- 
ture interventions in organisations generally indi- 
cate that safety culture change come about in the 
dynamic between “top-down” processes inititated 
from the management and “bottom-up” processes 
based in sub-groups (Olsen et al 2009). 


2 METHOD 


2.1 Interviews 


We used a semi-structured and relatively open inter- 
view guide. Our starting point was topics related to 
information security, follow-up of safety and secu- 
rity work since 2014. The interview guide had to 
be open so that we could depend on the interview- 
ees’ understanding of how different organisational 
and cultural features of the study organisation 
have had and can have consequences for security. 
The interviews were built up around the following 
main topics: a) Reviews and measures related to 
security in the study organisation since 2014, b) 
The organisation’s and employees’ relationship to 
security. Security in everyday life, and as a topic 
of conversation, c) Clarity, management and rules 
related to security, d) Training, reporting/notifica- 
tion and risk assessments. 


2.2 Survey items 


The survey mainly contains questions about ten 
topics. It first contains a set of background ques- 
tions, (e.g. gender, age, experience, education). that 
were sent to all respondents. In addition, the ques- 
tionnaire contains 13 questions about attitudes and 
behaviour concerning information security culture. 
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We have chosen to not report results for these items 
here, as we wanted to focus on the development in 
the organizational security culture scale in 2014 
and 2016, also comparing departments and meas- 
ures implemented after the first measurement. 


2.2.1 The GAIN scale 

In this study, we choose to reformulate one of the 
few existing universal organizational safety culture 
scales, the GAIN-scale for safety culture, into an 
organizational security culture scale. Global Avia- 
tion Information Network (GAIN) is a volun- 
tary association of airlines, manufacturers, trade 
unions, governments and other organisations in 
aviation. The GAIN questionnaire contains 25 
questions concerning: 1) management’s attitude to 
and focus on safety, 2) employees’ attitudes to and 
focus on safety, 3) reporting culture and reactions 
to incident reporting, 4) safety training and edu- 
cation, and 5) general questions concerning safety 
in the organization in question. Respondents can 
rate the questions from 1 (totally disagree) to 5 
(totally agree). Thus, a safety culture index with 
a minimum value of 25 (1 x 25) and a maximum 
value of 125 (5 x 25) can be compared across com- 
panies and sectors. According to GAIN (2001), 
organizations with a score of 93-125 points on the 
safety culture index have a positive safety culture, 
59-92 indicates a bureaucratic safety culture and 
25-58 indicates a poor safety culture. 

The scale was chosen first, as the research on 
organizational safety culture seems to have been 
through many of the challenges that the organi- 
zational security culture research now is facing. 
Although Ruighaver et al (2007) note that the 
organisational security culture concept has gained 
recognition, they also underline that there is lack- 
ing consensus when it comes to how the concept 
should be defined and conceptualized (cf. Chia et al 
2002). Additionally, they also assert that in spite a 
large amount of research on organisational secu- 
rity and how it should be improved, this research 
only focus on certain aspects of security and not 
how these aspects can be analysed as part of a 
larger organisational culture. Based on this under- 
standing, Ruighaver et al (2007) choose to draw on 
organisational culture research in their analysis of 
security culture. We choose the same strategy in the 
present paper, drawing on the experiences of the 
safety culture literature, and choosing the GAIN- 
scale. The safety culture literature seems to have 
matured a bit more conceptually and methodologi- 
cally, as it has employed the culture perspective in a 
few more years than the field of security research. 


2.2.2 Three types of safety/security/reliability 

In the 2014 interviews related to the first survey, 
we were given indications that many different types 
or nuances of safety and security are key in the 


study organisation. To assess the importance of, 
or the focus on these types of safety and security, 
we created four indexes with similar formulations 
of statements: one about information security, one 
about HSE, one about deliverance reliability (and 
one about security in connection with sensitive 
objects which is not reported here). The indexes 
consist of three statements: 


I. My head of department considers ... to be very 
important 
II. The study organisation’s senior management 
considers ... to be very important 
II. My colleagues consider ... to be very important 


In the four indexes we replace ... with 1) infor- 
mation security, 2) protection of sensitive objects, 
3) HSE and 4) deliverance reliability. All of the 
questions have five values, so the minimum value 
is 1 (totally disagree) and the maximum value is 5 
(totally agree). The four indexes therefore have the 
same minimum and maximum values so that they 
can be compared to each other. In this way we can 
see what types(s) of safety and security managers 
and employees focus most on in the study organi- 
sation as a whole; in the various departments. 


2.3 Samples 


In this study, we compare the results of two sur- 
veys done over a period of just over two years; 
the first in the spring of 2014 and the second in 
the autumn of 2016. While we have used the same 
questionnaire, we added some questions to the last 
form. A total of 323 respondents responded to the 
survey in 2014, while 446 responded in 2016. The 
survey respondents are shown in Table 1. 


2.4 Analysis 


We calculate the significance of the differences in 
scores for all surveys. In this way, we examine the 
probabilities that the differences are due to statis- 
tical coincidences. This is done by calculating the 


Table 1. Responses, number of employed and response 
rate per department in 2016, compared with response 
rates in 2014. 


Dep. Res. Empl 2016 2014 
1 84 112 75% 79% 
2 26 31 84% — 

2 62 85 73% 69% 
4 38 54 70% 70% 
4 115 162 71% 46% 
6 84 109 717% 24% 
i 13 17 76% 92% 
8 24 28 86% 113% 
Total 446 600 74% 56% 
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average scores’ confidence intervals. These indicate 
the error margins of the average scores, i.e. the 
range which, with a given probability, contains the 
true number measured. When comparing average 
scores, we can generally say that the differences 
between them are statistically significant if they do 
not lie within each other’s confidence intervals. 

Probability is given as a percentage. This is also 
often given as a so-called P value. In choosing the 
confidence interval, you choose how much uncer- 
tainty you will accept. A 90% confidence interval 
means that you have decided on a 90% probability 
level, indicating that, on average, an error will be 
concluded in one of ten cases. A 95% confidence 
interval means that there is a 95% chance that the 
“true” risk number is within this range. We use 
confidence intervals of 90%, 95% and 99%, and we 
say that the differences are statistically significant 
at the 10%, 5% and 1% levels, respectively. 


3 RESULTS 


The questionnaire that measures information secu- 
rity culture is based on the GAIN organisational 
safety culture index, where the word “safety” is 
replaced by “information security.” The ques- 
tionnaire opened with definitions of information 
security and related sub-concepts. In order to 
map the respondents’ relationship with informa- 
tion security in their work, we asked them to take 
a position on the statement: “In my work, I have 
a clear understanding of what the term informa- 
tion security means.” The results show that all of 
the departments have an average score above 4 
(= agree somewhat). 


3.1 Management commitment 


The immediate supervisor’s attitude to and focus 
on information security is an index with eight 
questions, with a minimum value of 8 and a maxi- 
mum value of 40. Cronbach’s Alpha for this index 
in the data from 2014 was 0.911, which means very 
good agreement between the questions and that 
the index is good. The index is based on the fol- 
lowing questions: 


My immediate supervisor is personally involved in 
activities to improve information security 

My immediate supervisor postpones tasks/activities 
if information security is not sufficiently ensured 

My immediate supervisor considers information 
security to be very important in all tasks and 
activities 

My immediate supervisor does everything he/she can 
do to avoid breaches of information security 


The score in 2014 was 26 points, while it was 
29 in 2016. When we test the significance of the 
difference between the total scores for 2014 and 
2016, we see that the differences are significant at 
the 1% level. The differences between the depart- 
ments are significant at the 1% level in 2014 and 
2016. This indicates that there are significant dif- 
ferences between the departments as to whether 
employees feel that their immediate supervisors 
focus on information security. DEP 6, DEP 2, 
DEP 1 and DEP 5 had the highest scores in 2016. 
In addition, we also see that all of the departments 
saw improvement in this index in 2016, especially 
DEP 1, DEP 2, DEP 6 and DEP 4. 


3.2 Employees’ attitude to and focus on 
information security 


Employees’ attitude to, and focus on information 
security is an index with three questions, with 
a minimum value of 3 and a maximum value of 
15. In 2014, we measured a Cronbach’s Alpha at 
0.754, a high value considering that the index con- 
sists of three questions. 


My colleagues do everything they can to avoid 
breaches of information security 

Employees encourage one another to safeguard infor- 
mation security 

Employees usually report all breaches and irregulari- 
ties related to information security that they expe- 
rience at work 


My immediate supervisor discovers employees who 
fail to take sufficient considerations to information 
security in their work 

My immediate supervisor often praises employees for 
maintaining information security 

My immediate supervisor is aware of the most 
important information security issues in the 
company 

My immediate supervisor often discusses information 
security issues with the employees 


All of the departments saw an improvement on 
this index in 2016, especially DEP 1 and DEP 5. 
DEP 6 had the highest score again in 2016. The gen- 
eral improvement went from 10 to 11 points on the 
index. When we test the significance of the differ- 
ence between the total scores for 2014 and 2016, we 
see that the differences are significant at the 1% level. 


3.3 Reporting culture and reactions to incident 
reporting 


Reporting culture and reactions to incident report- 
ing is an index with five questions, with a minimum 
value of 5 and a maximum value of 25 (Cronbach’s 
Alpha = 0.785 in 2014). 
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Those who pursue breaches of information security in 
the business attempt to find the real causes rather 
than just blaming the employees 

There are routines and procedures at my workplace 
so that I may report information security-related 
breaches or irregularities 

After a breach of information security, measures are 
implemented to prevent this from happening again 

All irregularities and information security issues that 
are reported are remedied in a short time 

Everyone has plenty of opportunities to forward sug- 
gestions related to information security 


Again DEP 6 had the highest score in 2014 and 
2016, followed by DEP 2. All of the departments 
regressed on this index in 2016, especially DEP 1, 
which noted a decline of more than 3 points on 
the index. The average decline for all of the depart- 
ments from 2014 to 2016 was 2.1 points. This could 
be due to a learning effect and/or increased focus 
on what reporting means in the organisation (i.e. 
increased expectations), and/or an actual decrease. 
This change is statistically significant at the 1% level. 


3.4 Training/instruction in information security 
thinking 


Training/instruction in information security 
thinking is an index with four questions, with a 
minimum value of 4 and a maximum value of 20 
(Cronbach’s Alpha = 0.866 in 2014). 


Employees in my company are provided with ade- 
quate training in the secure use of ICT systems 
(e.g. e-mail, storage, encryption) 

All new employees are provided with adequate train- 
ing for tasks and the secure use of ICT systems 
(e.g. e-mail, storage, encryption) 

Everyone is provided with sufficient feedback on how 
the enterprise is performing with regard to infor- 
mation security 

Everyone is informed of any changes that may impact 
information security 


There are procedures that must be followed in the 
event of an emergency situation in my workplace 

Information security in my business is better that in 
other businesses 

Regular security audits are carried out 

Information security is generally well taken care of at 
my workplace 


Again DEP 6 department had the highest score 
in 2016 (and 2014), followed by DEP 4 and DEP 1. 
All of the departments saw an improvement on this 
index in 2016, especially DEP 1, which increased 
by more than 4 points on the index. The average 
improvement for all of the departments from 2014 
to 2016 was 2.4 points. This change is statistically 
significant at the 1% level. 


3.5 General information security issues 


General information security issues is an index 
with four questions, with a minimum value 
of 4 and a maximum value of 20 (Cronbach’s 
Alpha = 0.720 in 2014). 


DEP 6 department again had the highest score 
in 2016 (and 2014), followed by DEP 1. All of the 
departments saw an improvement on this index in 
2016, especially DEP 1, which increased by 3 points 
on the index. The average improvement for all of the 
departments from 2014 to 2016 was 1.4 points. This 
change is statistically significant at the 1% level. 


3.6 Index for information security culture 


We have combined the 24 statements with five 
response options on the five different aspects of 
information security in an information security 
culture index. The indexes for the departments cor- 
respond to the average scores for the respondents. 
Since we have removed a statement from the GAIN 
index (as the wording was unsuitable for the con- 
text), the minimum score will be 24 (24 x 1) and 
the maximum score will be 120 (24 x 5). Cronbach's 
Alpha for the 24 questions in the index was 0.913 in 
2014, which means very good agreement between the 
questions and that the index is very good. Table 2. 

We see that the DEP 6 department again had 
the highest score in 2016 and 2014, followed by 
DEP 2 and DEP 1. All of the departments saw an 
improvement on this index in 2016, especially DEP 
1, which increased by 12 points on the index. 

The average improvement for all of the depart- 
ments from 2014 to 2016 was 9 points. This change 
is statistically significant at the 1% level. The dif- 
ferences between the departments are significant at 
the 1% level, both in 2014 and 2016. 

If we are to transfer the GAIN scale values from 
organisational safety culture to organisational 


Table 2. Scores on the index for information secu- 
rity culture per department in 2014 and 2016 (min = 24 
points, max = 120 points). 


Department 2014 2016 
1 77 89 
2 82 89 
3 80 82 
4 75 84 
5 76 87 
6 87 95 
Total 78 87 
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information security culture, we see that none of 
the departments had a positive information secu- 
rity culture in 2014. However, in 2016 we find that 
DEP 6, DEP 2 and DEP 1 were within the part of 
the scale that we refer to as a positive culture. The 
limits for “positive organisational culture” range 
from 88 points to 120 points on the GAIN index. 
The moderate culture scale goes from 47 points 
to a maximum of 87 points, and scores below 46 
points correspond to a poor culture. 


3.7 Questions about the department's follow-up 
of the survey in 2014 


We asked the respondents three questions about 
what actions their immediate supervisor has taken 
in his/her own department/section after the evalu- 
ation of the information security culture in 2014. 
We opened the questions with the text: “We want 
to know a little about what actions your immediate 
supervisor has taken in your department/section 
after the evaluation of the information security 
culture in 2014”. 

We made an index based on the three questions 
we have reviewed. All of the questions include six 
response options: | = totally disagree and 5 = totally 
agree and 6 = Have no knowledge about this. 
When we made the index, we removed the sixth 
response option. The index is based on the following 
questions: 


The head of my department/section has gone through 
the results of the evaluation with us 

My department/section has taken steps to improve 
the information security culture (e.g. focus on pass- 
words and security-critical information) based on 
the results of the evaluation 

I am satisfied with how my department/section has 
worked on information security over the past two 
years 


Table 3 shows the average index score for actions 
the immediate supervisor has initiated in his/her 
own department/section after the evaluation of the 
information security culture in 2014. 


3.8 Questions about training 


We asked the respondents three questions related 
to whether they have received useful information 
over the past two years that has increased their 
knowledge and awareness of information security: 


During the past two years I have received useful infor- 
mation (e.g. from security coordinator, manager, 
intranet) about what a secure password is 

During the past two years, I have received informa- 
tion (e.g. from security coordinator, manager, 


Table 3. Index for actions the immediate supervisor 
has initiated in his/her own department/section after the 
evaluation of the information security culture in 2014. 
Minimum value: 3, maximum value: 15. (N=313). 


Department Measure index 
1 12 
2 11 
3 11 
4 11 
5 12 
6 14 
Total 12 


Table 4. The average index score for information secu- 
rity training in 2014-2016 in combination with scores for 
the action index in each department. 


Department Training index 
1 13 
2 12 
3 12 
4 13 
5 13 
6 14 
Total 13 


intranet) that made me more aware of strangers in 
our premises 

During the past two years, I have received informa- 
tion (e.g. from the security coordinator, manager, 
intranet) that has given me more insight into what 
security-critical information is 


Since we, to some extent, assume that this infor- 
mation comes from the security coordinator, man- 
ager and intranet, we also refer to these questions 
as training in information security 2014-2016. 

We made an index based on the three ques- 
tions. All of the questions include six response 
options: 1 = totally disagree and 5 = totally agree 
and 6 = Have no knowledge about this. When we 
created the index, we removed the sixth response 
option. 

Table shows the average index score for infor- 
mation security training in 2014-2016 in com- 
bination with scores for the action index in each 
department. 


3.9 Three types of safety/security/reliability 


Through initial interviews in the first survey 
we found that it was appropriate to distinguish 
between different types or nuances of security, 
safety and reliability in the study organisation. 
To assess the importance of, or the focus on these 
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types of safety and security in the study organisa- 
tion as a whole, and in different departments, we 
created indexes with identical questions: one about 
information security, one about HSE, one about 
deliverance reliability and one about security in 
connection with sensitive objects. The index about 
sensitive objects was not included in 2016. 

In line with what we have commented on, we see 
in Figure 1 that all of the departments did better 
on the indexes in 2016 compared with 2014. How- 
ever, the improvement is greatest on the informa- 
tion security index. 

This is interesting because in 2014 we con- 
cluded that the results indicated that little atten- 
tion was paid to information security compared 
with the “primary task” of deliverance reliability 
and the more traditional focus on HSE. The fact 
that the respondents in 2016 deemed that the focus 
on information security is equally strong as for 
the other types of security can indicate that the 
study organisation’s efforts to put this topic on the 
agenda has worked as intended. 


3.10 Interview results 


The organisation has carried out extensive changes 
in its organisation and practices during the period. 
The scope and significance of the changes are 
emphasised through interviews. Information secu- 
rity and HSE-related support and correction from 
colleagues seems to be a fairly consistent practice. 
They also seem to be topics that are the subject of 
conversations. This is well in line with the general 


13,0 13,0 _ 13,0 
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11,7 
11,0 
9,0 
7,0 
5,0 
3.0 
Inf. sec Del rel HSE 
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Figure 1. Total scores for the indexes for indexes on: 
Information security (N=323 and 446), Deliverance reli- 
ability (N=323 and 446), HSE (N=85 and 169). 


increase in all types of security culture in the agency. 
There has been a strong focus on information secu- 
rity both internally in the organisation and through 
auditing, supervision and ownership requirements. 
Since 2014, a more clear “security organisation” 
has been established, to use the organisation’s own 
term. Security organisation means roles, owner- 
ship, and processes related to security. 

The security culture now seems more strongly 
rooted the higher one gets in the organisation. 
This is partly based on interviews, which quite 
uniformly point to a clear focus on security in the 
agency’s top management, and more varied further 
follow-up. This also correlates with the fact that 
managers state to a greater degree than staff that 
security is being followed up. 


4 CONCLUDING DISCUSSION 


4.1 Improvements and measures 


We saw above that DEP 6 scored the highest on the 
information security culture index, and in chapter 
we also saw that DEP 6 scored the highest on the 
index for managers’ measures aimed at informa- 
tion security culture. This indicates a correlation 
between immediate supervisors’ measures since 
2014, and the information security culture score. 
Figure 2 confirms the impression that good 
work on measures produces results in the form 
of high scores on the information security culture 


i 2 3 4a 5 6 4b Total 
mm 01+ m 2016 — Measure index 
Figure 2. Information security culture scores in 2014 


and 2016, for each department, including a measure 
index. 
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index. On the other hand, the impact of measures 
can also be assessed on the basis of which depart- 
ment has made the greatest improvement on the 
index from 2014 to 2016. 

This result is in line with both the assertion of 
Knapp et al (2006), depicting the top management 
support as a significant predictor of an organiza- 
tion’s security culture and level of policy enforce- 
ment. It is also in line with Schein’s (2004) view 
on the possibilities to manage culture in organisa- 
tions. As noted, he has coined “six primary embed- 
ding mechanisms” that managers can use to shape 
culture: e.g. what managers pay attention to, meas- 
ure and control on a regular basis, how managers 
allocate resources. Thus, our study may indicate 
how organisations may improve their organiza- 
tional security culture through cultural manage- 
ment. Moreover, the training measures also seem 
to be related to high security culture scores. Thus, 
this organization may provide an example of a 
possible solution to the key challenge addressed by 
Nosworthy (2000): how to we educate the people 
of the organization to successfully implement the 
requirements of the information security policy? 


4.2 Is the improvement of information security 
culture real? 


Hawthorne effect? We have seen an average increase 
of 9 points from 2014 to 2016 in the study organi- 
sation’s information security culture index. This is 
a significant improvement that can be said to indi- 
cate that the study organisation’s information secu- 
rity measures from 2014 to 2016 have produced 
good results. On the other hand, the improvement 
may be interpreted as a “measurement effect”. This 
means that the improvement has more to do with 
the repeated survey than a real effect. 

Such measurement effects can be due to numer- 
ous mechanisms. On a general basis, one can point 
to the so-called “Hawthorne effect,” which indi- 
cates that being measured repeatedly produces 
improvement in the surveys because the research- 
ers’ focus per se causes the studied elements to do 
better in measurements. This is the “placebo effect” 
of intervention research, i.e., one can expect a cer- 
tain amount of improvement in a repeated survey, 
only because the survey is conducted. 

Learning effect? Another measurement effect 
that occurs in some cases is that the respondents 
learn from the first questionnaire and respond 
differently to the second questionnaire for that 
reason. However, this can yield both positive and 
negative outcomes; the respondents may think that 
they assessed their workplace too “harshly” when 
they think more about the topics from the sur- 
vey and respond more positively the second time. 
However, they may also think that they assessed 


their workplace too “leniently” when they think 
more about the topics from the survey and respond 
more negatively the second time. This can result in 
lower scores in a repeated survey. 

Noise. A third effect that can impact the second 
survey is that reorganisation, noise and the like 
make the respondents more negatively disposed to 
their workplace and the topics that are being meas- 
ured and therefore show their dissatisfaction by 
giving negative answers. This would involve lower 
scores, which only applies to reporting culture. 

We can perhaps conclude to some extent that the 
comparisons seem realistic, first since the results 
for the short indexes for safety, security and relia- 
bility also indicate a clear, increased focus on infor- 
mation security in the study organisation. In 2014, 
we saw a greater focus among managers and staff 
in the study organisation on deliverance reliability 
and HSE than on information security. However, in 
2016 we see that the score for information security 
is at the same high level as deliverance reliability 
and HSE. Second, we saw a relationship between 
measures, training and security culture scores. 
Third, interviews indicate increased focus on infor- 
mation security in 2016 compared with 2014. 

4.2.1 Can the improvement be explained by 
sample differences? 

Changed scores should also be discussed in the light 
of possible sample differences in the two surveys. 
The respondent rate in 2016 was higher than it was 
in 2014. The fact that there were more respondents 
in 2016 may have influenced the answers given and 
the scores on the indexes we looked at. In 2016, the 
percentage of respondents who had been employed 
for 16 or more years was down five points com- 
pared with 2014. This should have indicated lower 
scores on the information security culture index, 
since we know that higher age and seniority yield 
higher scores. However, it should be pointed out 
that the increased shares without higher education 
in the 2016sample probably did not respond to 
the questions about information security culture, 
since this higher share was probably due to higher 
response rates from DEP 3. 

It should also be mentioned that over half 
of those who responded to the survey had been 
employed by the study organisation for five years 
or less in both 2014 and 2016. This is a relatively 
large share. However, we do not know how many 
of the employees have been employed for less than 
two years. 

Finally, it should be noted that an important 
difference between the departments’ samples in 
2014 and 2016 is that all of the departments in 
2016 were given a code for “staff” that counts in 
the department’s average. This is a small number of 
persons in each department, which usually includes 
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one supervisor and the rest staff. We considered 
removing staff from the comparisons between the 
departments in 2014 and 2016, but decided not to 
after a systematic analysis showed that the staff 
groups did not have higher scores than the sections 
in the departments. 

Average scores for GAIN information security 
culture increase with the age of the respondents 
and decline with education. The difference is the 
same between the age groups in 2014 and 2016, 
and about the same for the highest and lowest level 
of education (14 in 2014 and 13 in 2016). There 
is no significant difference between the scores for 
men and women. Finally, we see that those with 
11 and more years of seniority have significantly 
higher scores on the index than those with 10 or 
fewer years of seniority. The difference is 6 points 
in 2014 and 5 points in 2016. 


4.2.2 Information security culture scores with and 
without managers 

Since we also include managers in the survey, we 
will examine whether managers’ responses pull the 
scores up or down in the various departments. The 
differences between the information security cul- 
ture scores of staff, other managers and department 
and section managers were insignificant in 2014. In 
2016, however, the differences were greater. Section 
or department managers consider the information 
security culture to be 6 points higher than staff do. 
This is interesting, but perhaps not unexpected, 
because section or department managers answer 
on the basis of their immediate supervisor, who is 
either the department manager or the study organ- 
isation’s director. The results can therefore indicate 
that efforts and measures aimed at information 
security culture in 2016 are etter rooted at the man- 
agement level than at the staff level. Similarly, this 
may indicate that efforts and measures are better 
rooted at higher management levels in the organi- 
sation, which correlates with statements from the 
interviews. However, the differences between the 
three groups within each measurement point are 
not statistically significant, so these results must 
not be accorded too much weight (P = 0.14). 


5 CONCLUSION 


We conclude that it seems that management imple- 
mentation of measures aimed at improving organ- 
izational information security culture has led to 
improvements. Thus, our study may indicate how 
organisations may improve their organizational 


security culture through cultural management. It 
is important, however, to note that the study lacks 
a control group, and that it is impossible to rule 
out the influence of a possible Hawthorne effect. 
Additionally, we discuss whether sample charac- 
teristics may explain the observed improvements. 
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ABSTRACT: Most of today’s Machine Learning (ML) methods and implementations are based on 
correlations, in the sense of a statistical relationship between a set of inputs and the output(s) under inves- 
tigation. The relationship might be obscure to the human mind, but through the use of ML, mathematics 
and statistics makes it seemingly apparent. However, to base safety critical decisions on such methods 
suffer from the same pitfalls as decisions based on any other correlation metric that disregards causality. 
Causality is key to ensure that applied mitigation tactics will actually affect the outcome in the desired 
way. This paper reviews the current situation and challenges of applying ML in high risk environments. It 
further outlines how phenomenological knowledge, together with an uncertainty-based risk perspective 
can be incorporated to alleviate the missing causality considerations in current practice. 


1 INTRODUCTION 


In this paper we highlight some pitfalls to avoid in 
the design of Machine Learning (ML) applications 
for high risk engineering applications. We also pro- 
pose some recommendations to consider in model 
development, and emphasize the introduction of 
causality constraints based on phenomenological 
knowledge. 


1.1 Background 


ML is recognized as one of the key enablers for 
the fourth industrial revolution'. In this setting 
it is often communicated as a tool that, together 
with increased access to data and computational 
power, can unlock a huge potential for increased 
efficiency, new insights and ultimately new revenue 
streams. Success stories of businesses that have dis- 
rupted entire industries are often shared to inspire 
investment in similar technologies. However, for 
operation of complex engineering systems in high 
risk environments, the new challenges that appear 
are often not clearly expressed. 

Concerns have been raised regarding the reli- 
ability and trustworthiness of systems relying on 
Artificial Intelligence (AI) in general, and spe- 
cifically related to the current main strategy of 
implementation. Knight (2017) emphasizes that 
“not knowing how the most advanced algorithms 


'For a broader discussion see e.g. Lu (2017) and Dopico, 
Gomez, De la Fuente, Garcia, Rosillo, & Puche (2016). 


do what they do might become a serious problem 
as computers become more responsible for mak- 
ing important decisions”. The problem of acci- 
dents in ML systems is discussed in detail in a 
paper led by Google Brain researchers Amodei, 
Olah, Steinhardt, Christiano, Schulman, & Mané 
(2016). Here the authors motivate the increasing 
need to address these safety problems by some 
general trends, one of which relates to the increas- 
ing autonomy in AI systems. “Systems that sim- 
ply output a recommendation to human users, 
such as speech systems, typically have relatively 
limited potential to cause harm. By contrast, sys- 
tems that exert direct control over the world, such 
as machines controlling industrial processes, can 
cause harms in a way that humans cannot neces- 
sarily correct or oversee.” 

In this paper we turn our attention towards the 
introduction of ML in design and operation of 
complex engineering systems in high risk environ- 
ments. These are systems where today’s methods 
for assessing risk relies heavily on understanding 
the underlying physical processes and our ability 
to model these. This is in contrast to e.g. mass pro- 
duction of components where more data-driven, 
statistically founded methods can be applied to 
estimate rates of failure. 

We acknowledge the need for ML technologies 
to address increasing complexity of engineering 
systems, and the challenges that follow in quantify- 
ing uncertainty and risk based on detailed numeri- 
cal simulation. However, this type of application 
reduces the tolerance for erroneous model behav- 
ior. Most of the methods applied in ML are based 
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on historic statistical models, enhanced by recent 
breakthroughs in computer science. The assump- 
tions, limitations and practical challenges of these 
statistical models still remain, and serve as recur- 
ring pitfalls in the digital era. 

For a given application both the statistical and 
ML mindsets may have its advantages and draw- 
backs, but we argue that for the high risk engineer- 
ing problems discussed in this paper, uncritical 
application of ML is unwarranted. We believe that 
experienced ML practitioners know this, and the 
main problem lies in how ML is communicated 
in the current digitalization boom, and how it is 
perceived by the increasing number of new prac- 
titioners in the field that may also be strongly 
incentivized to develop solutions for cutting costs 
through automation. 


1.2 Correlation and causation 


Most of today’s ML methods and implementations 
are based on correlations, which we define in the 
general sense as any statistical relationship, whether 
causal or not, between random variables, i.e. the 
degree of which two or more variables tend to vary 
together. 

“Correlation does not imply causation” is a well 
used phrase within statistics, and describes what 
still remains as one of the major pitfalls in the 
analysis of data. The importance of this distinc- 
tion depends on the degree to which one intends 
to intervene, and the consequence of erroneous 
intervention. See e.g (Pearl 2010). For many ML 
applications this may not be significant. But for 
the use cases considered in this paper it plays an 
important role, and we will argue that causal- 
ity constraints from phenomenological knowl- 
edge should be incorporated in ML models—to 
strengthen model performance for tail events, 
to increase model transparency, and to make it 
easier to falsify models that do not comply with 
observations. 


1.3 Structure of this paper 


Section 2 is included for readers that are unfamiliar 
or new to the field of ML, and gives a brief intro- 
duction to some core concepts and their relation 
to classical statistics. The experienced ML practi- 
tioner may jump directly to Section 3, where we 
propose a set of model properties that should be 
accommodated for ML applications in high risk, 
low probability scenarios. Going beyond model 
selection, some general pitfalls are highlighted in 
Section 4 which gives an example illustrating the 
use of ML for anomaly detection and space explo- 
ration in Structural Reliability Analysis (SRA). 
Finally, our conclusions and some final remarks 
are summarized in Section 5. 


2 A BIT OF HISTORY 


Often times discussions regarding ML and AI can 
become fairly opaque, colored by data science jar- 
gon, cognitive analogies and marketing buzzwords. 
This section will clarify the relation between ML 
and statistical methods by comparing the classes of 
ML problems with their statistical counterparts, and 
tracing their origins. This section will only briefly 
present the statistical background that underpin 
the main classes of ML methods. See e.g. (Hastie, 
Tibshirani, & Friedman 2001) for a thorough intro- 
duction, or (Domingos 2012) for a more informal 
description on how they are used in practice. 


2.1 A formal definition of machine learning 


Although the field of ML is often defined in cog- 
nitive terms, as giving computers the ability to 
learn without being programmed, a more formal 
definition often cited is the one by Mitchell (1997). 
“A computer program is said to learn from experi- 
ence E with respect to some class of tasks T and 
performance measure P if its performance at tasks 
in 7, as measured by P, improves with experi- 
ence E.” Further, ML is generally classified into 
two categories called supervised and unsupervised 
learning. 

Supervised learning: For some unknown rela- 
tionship between variables x — y(x) where N pairs 


of data are observed {yh , the goal is to esti- 
mate y* for unobserved x*. Here, the experience E 
is the observed data, the task T is to predict y* and 
P is usually a measure of the difference between 
the predicted y* and the true value y(x*). 

Unsupervised learning: The goal is to discover 
patterns in unlabeled data. For instance, based 
only on the x-values in the above example, {x,}__,, 
estimate the distribution over the data or identify 
clusters, etc. With the task T of clustering in mind, 
the experience Æ is still the observed data and P 
might be a measure on cluster compactness or 
separability. 


2.2 Machine learning and statistical modelling 


The increasing popularity of ML today may be 
credited to recent advancements within computer 
science, although most of supervised and unsuper- 
vised learning have roots in traditional statistical 
methods. An overview is illustrated in Figure 1. 
Supervised learning is based on prior knowledge, 
and covers regression and classification (discrimi- 
nant analysis). 

Regression considers a continuous dependent var- 
iable represented by a model function fitted to data. 
Assumptions are generally made about the data gen- 
eration process, e.g. homoscedasticity (equal finite 
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Figure 1. Machine learning vs. multivariate statistics. 


variance), independence and normality. The earliest 
form of regression was the method of least squares, 
published by Legendre (1805) and Gauss (1809). 

Classification or discriminant analysis has the 
same objective as regression, but where the depend- 
ent variable is discrete, and typically comes in a 
categorical form as labels from a finite set. Some 
of the earliest work was by Fisher (1936) leading to 
Fisher’s linear discriminant function. Other popu- 
lar methods are Logistic regression that dates back 
to Verhulst (1838), decision trees (Morgan & A. 
Sonquist 1963) and k-nearest neighbors (Cover & 
Hart 1967). 

Note that due to the close link between these 
two categories of supervised learning problems, 
regression models may be altered to work for dis- 
criminant analysis and vice versa. 

Unsupervised learning is related to data explo- 
ration problems. The main exploratory methods 
are often classified as clustering or dimensionality 
reduction. 

Clustering analysis considers the task of group- 
ing objects into sets (clusters) such that objects in 
the same set are more similar to each other than 
to those in other sets. No precise definition of a 
cluster exists, and this is one of the reasons why 
there are so many different clustering algorithms 
(Estivill-Castro 2002). Some popular alternatives 
representing different approaches to the problem 
of clustering are k-means (MacQueen 1967), hier- 
archical clustering (Sibson 1973), Gaussian mixture 
models using expectation-maximization (Dempster, 
Laird, & Rubin 1977) and Density-Based Spatial 
Clustering of Applications with Noise (DBSCAN) 
(Ester, Kriegel, Sander, & Xu 1996). 


Dimensionality reduction is the procedure 
of reducing the number of input variables to a 
smaller set of principal variables. One fundamen- 
tal approaches is principal component analysis 
(Pearson 1901), developed by Karl Pearson, con- 
sidered the father of modern statistics. 

The methods mentioned above is by no means 
a complete list, and the references cited are meant 
to give indications on early work on the different 
subjects as true origins are often debatable and 
outside the scope of this paper. A lot of research 
has gone into extending or improving on these 
methods since first invented, which has spawned 
the large variety of models and algorithms used in 
ML today. There are also other popular methods 
and widely used techniques within ML that can 
be placed in more than one category in Figure 1, 
e.g. artificial neural networks dating back to Hebb 
(1950) and using the backpropagation algorithm 
developed by Werbos in 1974. 


2.3. Model design vs. trial-and-error 


Many of the similarities between statistics and 
ML are just hidden behind different terminology. 
For instance, where we in ML refer to features in 
a model and that model weights are learned from 
data, a statistician would refer to variables in a 
model where data is used to estimate model param- 
eters. But there are also some important differences 
in how the (same) models are used, and whether 
the main emphasis is on designing a good model or 
obtaining good prediction by trial-and-error. 

In classical statistics, the focus is often on testing 
hypothesis of causes and effects and interpretabil- 
ity of the models applied. A common aphorism in 
statistics is that “All models are wrong, but some 
are useful”. The analyst should know when the 
model will break, how it breaks, and if one can still 
use it anyways. The main goal is understanding the 
underlying mechanisms that drive the things we 
observe. 

On the contrary, ML has more focus on pre- 
dictive accuracy of models, with less attention 
towards model interpretation. Model selection is 
often based on trial-and-error through cross vali- 
dation to evaluate goodness-of-fit criteria, where 
prediction accuracy is evaluated on data that was 
excluded in the learning process. Although the 
main goal is to obtain a good prediction, special 
considerations are made based on what the predic- 
tion is used for in a given application. For high risk 
applications for instance, false positives may be far 
worse than false negatives (or vice versa), and the 
model optimization can be weighted accordingly. 
Still, it is based on observing the desired behavior 
in future or excluded data. 

Although science always has been concerned 
with both prediction and explanation, the different 
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philosophies have resulted in much debate between 
the statistics and machine learning communities. 
See for instance the discussion in Breiman (2001). 


3 ML FOR HIGH RISK ENGINEERING 
APPLICATIONS 


This section highlights some of the main challenges 
when using ML for high-risk and low probability 
scenarios. We continue by proposing recommen- 
dations to consider in ML model development to 
address these challenges. 


3.1 Tail events in high risk environments 


Three of the main challenges with ML applied to 
high risk and low probability scenarios are related 
to the following: 


e High risk reduces the tolerance for wrong predic- 
tions. The consequence might be catastrophic. 

e Critical consequences often relate to tail events— 
for which data is naturally scarce. This increases 
uncertainty and reduces the accuracy of 
predictions. 

e The ML models that are able to fit the data well 
are often opaque. This makes the model less falsi- 
fiable, increasing uncertainty and reducing deci- 
sion makers’ ability to trust the model. 


The first point relates to the decisions that are 
made based on model predictions. When the high 
risk is associated with a catastrophic consequence, 
the tolerance for a wrong prediction is clearly low. 
Moreover, severe consequences are often related 
to a rare event. This introduces additional uncer- 
tainty as data is scarce’. These first two points are 
in direct contrast to the typical ML applications 
today (eg. non-consequential recommendation 
engines). In this respect, assurance of safety criti- 
cal systems relying on ML is also recieving increas- 
ing attention, see eg. Brandseter & Knutsen 
(In press). 

The last point on model opaqueness (lack of 
transparency) relates to quantification of model 
discrepancy and the ability to falsify models that 
do not comply with observations. The added 
uncertainty from model opaqueness may at first 
seem purely subjective. However, as we will see in 
the following section, addressing this by increas- 
ing model transparency may relax requirements on 
accuracy and vice versa. 


*Scarce in the sense that the size of data is small compared 
to the number of relevant dimensions. This is because the 
event under consideration, e.g. structural failure, is rare 
and expensive to approximate by experiments. 


3.2 Recommendations on model development 


The challenges stated above will make any ML 
or statistics based model prone to erroneous, 
and potentially dangerous, use. However, many 
of the most common pitfalls in using such cor- 
relation based or data-driven methods may be 
avoided by proper model selection and design. 
To increase confidence in the safe use of ML 
models we propose some recommendations in 
this section. 

For supervised learning, and when we want to 
use ML for prediction in general engineering appli- 
cations, one should seek to adopt or develop mod- 
els with the following attributes: 


1. Flexibility: The models ability to represent a 
large class of functions. 

2. Constrainability: The ability to impose model 
constraints based on phenomenological 
knowledge. 

3. Probabilistic inference: Probabilistic represen- 
tation of model output defined on the entire 
model range (i.e. the model output is repre- 
sented by a distribution). 


The first two concepts relate to the problem of 
underfitting and overfitting respectively. The third 
is important when the model is used in assessment 
of risk and reliability. 

From an engineering perspective, flexibility is 
needed to capture the behavior of complex physical 
systems and their response. Most non-parametric 
models fulfill this requirement. The definition of 
“a large class of functions” in this context may not 
be precise, as it certainly depends on the problem 
at hand. A typical example may be all continuous 
or differentiable functions with a finite number of 
discontinuities. 

Constrainability is beneficial for two main rea- 
sons. First, it reduces the possibility of overfitting 
the data, thus increasing the robustness and the 
performance for application on future data. This 
is particularly important for tail behavior prob- 
lems where data is scarce. The second reason is 
that imposing constraints based on phenomeno- 
logical knowledge reduces model opaqueness, i.e. 
increases transparency. 

By a transparent model we mean that the rela- 
tionship between model inputs and outputs can 
be meaningfully understood by humans, and that 
model characteristic properties and limitations 
of the model can be understood without explicit 
numerical computation. The most straightforward 
example is simple linear regression with linear 
basis, 1.e. fitting a line. In contrast, for an opaque 
model, the model behavior can only be investigated 
through computation. These models are often 
referred to as black boxes, where the only way to 
fully characterize the model’s properties is through 
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exhaustive computation; evaluating the model for 
all possible inputs. 

The scientific method is based on the principle 
that any model, or hypothesis, should be falsifiable. 
All models, including ML models, are based on 
a set of assumptions. For a model or an assump- 
tion to be falsifiable, it must, in principle, be pos- 
sible to make an observation that would show the 
assumption to be false. Thus, model transparency 
is important as it enables one way of falsification. 
ML models are typically falsified by observing 
poor accuracy. That is, observations are made that 
differ significantly from model predictions. How- 
ever, understanding why such discrepancies occur 
is often difficult in an opaque model. If we assume 
that the observation is not erroneous (observed 
discrepancy is not related to noise), then the root 
cause might either be lack of relevant data, Le. 
the model is built on too few datapoints close to 
the observed discrepancy—in which case a larger 
degree of uncertainty is expected. Or the root cause 
is related to violation of the underlying model 
assumptions—and the model must be changed. 

In order to develop useful ML models for high 
risk applications, a compromise usually has to be 
made between model transparency and flexibility. 
To counteract the negative tradeoff of an initially 
opaque, but flexible model, we might impose con- 
straints based on phenomenological knowledge 
related to causality. This can be thought of as 
“putting the black box inside a white box”, Le. 
enabling deduction of bounding model properties 
through the imposed constraints. 

Figure 2 illustrates constraints in the form of 
boundedness, monotonicity and convexity. Three 
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Figure 2. 
polation in data without noise. The interpolation func- 
tion cannot enter the shaded areas. 


datapoints have been observed, and for simplicity 
we assume the data does not contain noise. This 
means that we are looking for an interpolation 
model, a function passing through all three points. 
The shaded areas show where the function cannot 
enter due to the imposed constraints. Hence, start- 
ing with the space of all functions passing through 
the three points, the constraints reduce the space 
of possible interpolation functions. Assuming 
that the constraints are based on phenomenologi- 
cal knowledge that hold in reality, the constrained 
models are less prone to overfitting (more robust). 
In the case where an observation is made within 
the shaded area, the model is falsified immediately 
as the assumptions behind the constraints do not 
conform with an observed outcome. Hence, by 
imposing constraints based on phenomenological 
knowledge, either a) performance is increased, or b) 
the model is falsified and the modeler learn some- 
thing fundamentally new about the phenomenon 
studied, which can be applied in future modelling. 

The constrained models shown in Figure 2 may 
be restricted further by imposing multiple con- 
straints, e.g boundedness and monotonicity or 
monotonicity and convexity. Note also that for 
noisy data, this means interpreting constraints in 
terms of probabilities using the assumed distribu- 
tion of noise. The example is motivated by the more 
general class of constraints in the form of partial 
differential inequalities, for which phenomenologi- 
cal knowledge related to causal effects in various 
physical phenomena may often be available. 

Practically, imposing constraints such as the 
ones illustrated in Figure 2 means translating the 
phenomenological constraints to constraints in 
the ML optimization algorithm. Many techniques 
exist for including constraints in ML through opti- 
mization, usually in order to obtain regularization 
effects, but we emphasize that developing the nec- 
essary links between these constraints and phe- 
nomenological knowledge will be highly beneficial. 
See for instance (Yu 2007) or (Maatouk & Bay 
2017) for some examples and further discussion. 

Probabilistic inference on the ML model output 
is needed for risk and reliability analysis applica- 
tions. This means that the model output should 
optimally be in the form of a distribution. Model 
predictions in the form of fixed values and best esti- 
mates are not applicable. The dangers of express- 
ing risk through expected values is well known, 
and modern definitions of risk usually relate to the 
distribution over possible outcomes. Hence, some 
quantification of prediction uncertainty is essen- 
tial. It should be noted that this goes beyond the 
probabilities often used to report model accuracy 
for ML classifiers. The fact that an object is cor- 
rectly classified 99% of the time might be insig- 
nificant if the outcomes of the remaining 1% is 
associated with severe consequences. 
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Figure 3. ML workflow with emphasis on constraints. 


There is a traditional approach to this problem, 
where ML has played a role in mathematical mod- 
els for calculation of risk and reliability. First, the 
models are scrutinized through human quality con- 
trol to ensure that the model accuracy is sufficient 
or at least on the conservative side. This quickly 
becomes infeasible for higher dimensional mod- 
els. Further, a single distribution representing the 
model uncertainty is often established from statis- 
tical analysis of prediction accuracy alone, assum- 
ing uniformly distributed data. This is no longer 
feasible for higher dimensional models or when the 
input data is far from uniformly distributed. 


3.3. Working with constraints 


Figure 3 illustrates of the workflow for building 
ML models with emphasis on constraints. For sim- 
plicity we ignore work on data cleaning and fea- 
ture selection that naturally comes prior to model 
development. 

Any hypothesis on model constraints com- 
ing from assumed causalities in the phenomenon 
under consideration must be tested to identify to 
what degree they hold in the observed data. Hence, 
the task of hypothesis testing is emphasized. There 
might also be valid constraints that are not imme- 
diately identifiable from phenomenological knowl- 
edge. It could therefore be valuable to search for 
possible constraints by unsupervised learning, to 
serve as hints to the modeler, and help identifying 
additional constraints before further testing and 
possible inclusion in the ML model. 

This type of workflow will move the typical 
approach for building ML models today closer to 
traditional statistical modelling. 


4 RELEVANT AREA OF 
APPLICATION - SRA 


In this section we give a concrete example of an 
application area where ML is linked with engineer- 
ing risk analysis — Structural Reliability Analysis 


(SRA), where the recommendations given Sec- 
tion 3 are highly relevant. In addition, we highlight 
two general pitfalls related to a common, but pos- 
sibly misconceived, idea on how ML may be used 
in this context. 


4.1 Structural reliability analysis 


Structural reliability analysis, or SRA for short, is 
the fundamental building block of modern risk- 
based engineering methodologies. For a thorough 
introduction reference is made to Madsen, Krenk, & 
Lind (2006). The underlying theory combines 
structural analysis with statistics and probabilistic 
modelling to assess uncertainties of information 
that contributes to the probability of structural 
failure. 

SRA may generally be described as the problem 
of establishing the probability 


P(G(x)<0) (1) 


where x is a vector of stochastic variables, e.g. 
structural dimensions, material properties, loads 
and model uncertainties. The function G(x) is 
referred to as the limit state, and is defined such 
that G(x) <0 if and only if the scenario represented 
by x results in structural failure, see Figure 4. In 
the literature the limit state is often presented as 


G(x) = R(x)-L(x) (2) 


where R(x) and L(x) represent the structural resist- 
ance, or capacity, and load effect respectively, 
although these are often not separable in practice. 
The main tasks in a structural reliability analy- 
sis is to establish a suitable limit state function and 
distributions of all the input parameters, so that 
one may estimate the failure probability given by 
Eq. | and analyse the sensitivity of parameters and 
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Figure 4. Illustration of SRA in two dimensions. 
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decisions that will affect the system. Informally, 
one might say that we use data and domain knowl- 
edge to establish the input distributions describing 
the current state of the system, and extrapolate 
to states where the system has failed using very 
limited data related to system failure in combina- 
tion with advanced phenomenological simulations 
based on first principle physics. 


4.2 Machine learning in SRA 


Some of the key challenges in SRA today relate to 
the rapid increase in structural complexity of engi- 
neering systems, including more automation and 
software intensive control systems, together with a 
demand for higher system utilization and the need 
for more accurate and reliable models to support 
decision making under uncertainty. At the same time, 
ubiquitous sensor data provides new information 
that could potentially reduce the uncertainty if the 
information could be incorporated into the models. 

In practice, this means that the function G(x) in 
Eq. 1 and the distribution over the input space x 
will take a more complicated form. To cope with 
this, the technologies we now label ML can be use- 
ful, e.g. to address the following problems: 


e Find x: Establish distribution over x using all 
relevant data. 

e Find G: Through data related to structural 
behavior, combined with data from past experi- 
ence, experiments and simulations of structural 
failure, establish the limit state function that 
classifies all x’s as Safe or Failure. 


Note that in practice these two problems are gen- 
erally intertwined, in the sense that inference about 
a model parameter (the x’es) may only be observable 
through its effect on the system response. E.g. some 
y(x) is observed, where the mapping y(-) is in our 
representation baked into the general function G. 

The use of ML in SRA has traditionally been 
confined to smaller subcomponents where human 
quality control is possible, but for more complex 
systems this quickly becomes infeasible. The more 
general task of approximating functions like the 
limit state G(x) using ML together with a limited 
number of realisations (experiments or numerical 
simulations) has received increasing attention over 
the last decades, within the field of Uncertainty 
Quantification (UQ). See for instance (Sullivan 
2015). UQ aim to quantify the ML uncertainty 
introduced through approximation, as well as 
how uncertainty in input parameters propagates 
through such models. 


4.3 General ML pitfalls in SRA 


As for all ML applications, there are some gen- 
eral pitfalls to look out for. This section highlights 


some of the challenges related to how introduc- 
tion of ML in reliability analysis is often depicted. 
This relates to a growing appetite for ideas like the 
following 


e Due to the increasing instrumentation of sys- 
tems, more data is available about the loads and 
structural behavior of systems at any time. By 
combining this with historical data from many 
other similar systems where the structural integ- 
rity is known (we know whether or not they have 
failed, and how), we could detect any abnormal 
behavior. With this information we could create 
warnings before potential failure occurs, and 
possibly also help the system back into normal 
operation. 

Anomaly detection will probably play a larger 
role in risk assessment in the future, but there are 
some pitfalls that the industry needs to be aware of. 
The first relates to the quantification of the safety 
margin, which is often represented as a probability 
of failure or through some other metric relating 
the current physical condition with conditions cor- 
responding to structural failure. 

e For complex engineering systems, quantifying 
the margin of safety based on operational data 
alone is unlikely. 

This statement might be obvious, from the many 
different ways a system may fail in practice and the 
assumption that these systems are designed not 
to fail. The next argument however is a bit more 
subtle 


e From a data exploration perspective, when 
observing system states outside normal operation 
one might unknowingly have transitioned away 
from the default system behavior, leaving all previ- 
ous observations biased, and possibly irrelevant. 
This statement impacts the basic assumption 

in ML that future data will come from the same 

distribution as the data the model was trained on. 

This is illustrated by an example in Figure 5. 
Following the SRA setting illustrated in Fig- 

ure 4, we assume that the limit state is defined 
in terms of material over-utilization. Often the 
criteria for when ultimate failure occurs is diffi- 
cult to express mathematically, and conservative 
approaches are applied by defining failure as some 
identifiable prior event. One such limit from mate- 
rial science is the yielding criteria of ductile materi- 
als such as steel. The stress-strain relationship of a 
material under some loading is linear up until the 
yielding point, and the material behaves elastic in 
this region. I.e. unloading the material will bring it 
back to its original unharmed state*. For continued 
loading beyond the yield point, the material will 
exhibit plastic behavior until rupture. 


‘Ignoring other failure modes such as fatigue. 
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Figure 5. 
operating environments. 


Illustration of structural response in different 


A limit state defined from the yield criterion 
can be illustrated as the dashed line in Figure 5, 
whereas the boundary of the structural failure set 
(black line) represents material fracture*. Imagine 
an anomaly detection agent that warns whenever 
the system behaves outside the normal operating 
envelope, and estimates procedures for moving the 
system back into normal operation. For elastic 
material response (leftmost point), the data used 
to train the model is still valid. But when materi- 
als are pushed closer to their limits, some proper- 
ties are fundamentally changed. The elastic limit 
will change due to work hardening, and the stress- 
strain curve is no longer valid. Furthermore, when 
the material is over-utilized over a certain thresh- 
old, the reduction of load may initiate failure 
modes previously non-relevant, and uninformed 
decisions may be catastrophic. 

This example is an oversimplification, but illus- 
trates some challenges with introducing purely 
data driven agents. Due to the increased compu- 
tational capacities and scientific models available 
today there is an increased push to utilise systems 


‘For load controlled scenarios the top of the stress-strain 
curve represents maximum capacity. Unless the load is 
decreased the material will eventually fracture 


closer to their limits. In the above example this 
means allowing operation closer to the true fail- 
ure limit, and compensating by increased control, 
uncertainty reduction, and more detailed under- 
standing of the failure modes. 


5 CONCLUDING REMARKS 


The field of ML is largely based on statistical meth- 
ods, but with a focus that is shifted more towards 
predictive accuracy and with limited attention 
towards model interpretation and testing hypoth- 
eses on causes and effects. 

For tail events in high risk environments the 
modeler is faced with additional challenges, as 
the tolerance for error is reduced and accuracy 
is needed in distribution tails rather than where 
the main bulk of data is. Because of this, opaque 
black-box type models are difficult to work with as 
the means for falsification may be limited to obser- 
vations of future performance. 

Therefore, research and development of ML 
models for such applications should be guided 
towards enabling incorporation of causality con- 
straints reflecting the modeler’s phenomenological 
knowledge. 
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ABSTRACT: In the context of Operation and Maintenance of wind energy infrastructure, it is impor- 
tant to develop decision support tools, able to guide engineers in the management of these assets. This 
task is particularly challenging given the multiplicity of uncertainties involved, from the point of view of 
the aggregated data, the available knowledge with respect to the wind turbine structures, as well as the 
varying operational and environmental loads. We propose to propagate wind turbine telemetry through a 
decision tree learning algorithm to detect faults, damage, and abnormal operations. The use of decision 
trees is motivated by the fact that they tend to be easier to implement and interpret than other quantitative 
data-driven methods. Furthermore, the telemetry consists of data from condition and structural health 
monitoring systems, which lends itself nicely in the field of Big Data as large amounts are continuously 
sampled at high rate from thousands of wind turbines. In this paper, we review several decision tree algo- 
rithms, we then train an ensemble Bagged decision tree classifier on a dataset from an offshore wind farm 
comprising 48 wind turbines, and use it to automatically extract paths linking excessive vibrations faults 
to their possible root causes. We finally give an outlook of a cloud computing based architecture to imple- 


ment decision tree learning involving Apache Hadoop and Spark. 


1 INTRODUCTION 


Wind Turbines (WTs) maintenance and inspection 
relies on conventional techniques (Yang & Sun 
2013), such as visual inspection, non-destructive 
evaluation and standard signal processing, trend 
analysis and statistics of data streamed from 
the Supervisory Control And Data Acquisition 
(SCADA). Specialized Condition Monitoring 
(CM) systems are only available on specific com- 
ponents such as the gearbox and main bearing 
(Hameed et al. 2009), while far forming part of 
the actual engineering practice (Grasse et al. 2011) 
Structural Health Monitoring (SHM) systems are 
deployed mostly for research purposes, or tempo- 
rarily during the certification stage. In fact, there 
exists a dislocation between (i) data derived from 
CM systems (e.g. power output, rotor RPM), (ii) 
data obtained from specialized SHM (e.g. tower 
acceleration, strain on blade root), and (iti) spe- 
cialized maintenance strategies of individual WT 
components. As a result, there are currently no 
holistic approach, and systematic, quantitative and 
automated tools for monitoring and diagnostics of 


WTs, for operation, maintenance (O&M) and deci- 
sion making within their life-cycle. Towards this 
end, we propose to perform automated fault diag- 
nostics and root cause analysis of faults on wind 
turbines (WTs) on the basis of decision tree clas- 
sifiers. A decision tree is a predictive model that 
maps observations to their target values or labels. 
The key concept lies in running WT telemetry data 
through a decision tree learning algorithm for 
detecting faults, errors, damage patterns, anoma- 
lies and abnormal operation (i.e., end states). 
A decision tree essentially comprises a machine 
learning tool for classification of event outcomes. 
For a given initiating event, multiple end states are 
possible, linking each event to an associated prob- 
ability of occurrence. Once built and trained, and 
given a new set of measurements, the decision tree 
may be used to predict end states and classify (dis- 
cover) previously unknown end states. By examin- 
ing the paths that lead to failure-predicting leaf 
nodes, one may distinguish the possible sources 
(root causes) of error. The remainder of this arti- 
cle is organized as follows. In section 2 we revisit 
the decision tree learning theory. In section 3 we 
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train an ensemble of bagged decision tree clas- 
sifiers with the standard CART algorithm on a 
condition monitoring data set from the Lillgrund 
offshore wind farm comprising 48 wind turbines 
and use it to perform a diagnostics to elucidate 
the root cause of excessive vibrations. Finally, in 
section 4 we further our discussion to show how 
decision tree learning can be expanded to big data 
based applications for monitoring and diagnostics 
for wind turbines using the object-oriented based 
decision tree concept cite (Abdallah 2017). 


2 DECISION TREES 


A Decision Tree (also called Classification or Pre- 
diction Tree) is designed to classify or predict a dis- 
crete category from the data. Decision Trees (DTs) 
are a non-parametric supervised learning method 
used for classification (and regression). In the 
machine learning sense, the goal is to create a clas- 
sification model (classification tree) that predicts 
the value of a target variable (also known as label 
or class) by learning simple decision rules inferred 
from the data features (also known as attributes 
or predictors). From Figure 1 an internal node N 
denotes a test on an attribute, an edge E represents 
an outcome of the test, and the Leaf nodes C rep- 
resent class labels or class distribution. 

Four reasons motivated us to work with deci- 
sion tree classifiers. First, they can be learned 
and updated from data relatively fast compared 
to other methods. Second, they are visually more 
intuitive, simpler and easier to assimilate and 
interpret by humans/engineers. Third, unlike other 
classification methods, with decision tree classifi- 
ers one is able to perform data-driven root cause 
analysis of faults; one can trace a path from the 
end state (e.g. blade damage) to the initiating event 
(e.g. wrong parameters in control system), a way 
that follows the sequence and chronology of how 
events are interlinked. Last, it has been shown that 
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Figure 1. Graphical representation of a decision tree 


(DT) classifier. DT terminologies are also shown. 


the accuracy of decision tree classifiers is compara- 
ble or superior (especially ensemble descision tree 
classifiers) to other models and in fact display the 
best combination of error rate and speed (Lim, 
Loh, & Shih 1997, Hand 1997, Lim, Loh, & Shih 
2000, Caruana & Niculescu-Mizil 2006). 

A decision tree is is a tree-structured classifier 
built by starting with a single node that encom- 
passes the entire data and recursively splitting the 
data within a node, generally into two branches 
(some algorithms can perform multiway splits) by 
selecting the variable (dimension) that best clas- 
sifies the samples according to a split criterion, 
i.e. the one that maximizes the information gain 
(Equation 1) among the random subsample of 
dimensions obtained at every point. The splitting 
continues until a terminal leaf is created that meets 
a stopping criterion such as a minimum leaf size or 
a variance threshold. Each terminal leaf contains 
data that belongs to one or more classes. Within 
this leaf, a model is applied that provides a fairly 
comprehensible prediction, especially in situations 
where many variables may exist that interact in a 
nonlinear manner as is often the case on wind tur- 
bines (Carrasco Kind & Brunner 2013). Algorithm 
1 shows pseudocode of a generic decision tree 
learning algorithm. 

Formally, the splitting is done by choosing the 
attribute that maximizes the Information Gain 
(/,), which is defined in terms of the impurity 
degree index: 


m 


I,(7,M)=1,(T)- ¥ =T) (1) 


meM |T| 


where T is the training data in a given node, M is 
one of the possible dimensions along which the 
node may be split, m are the possible values of 
M, |T| is the size of the training data, |T] is the 
number of objects for a given subset m within the 
current node, and J, is the function that represents 
the degree of impurity of the information. There 
are three standard methods to compute the impu- 
rity index (Z). The first method is by using the 
information entropy (H), which is defined by: 


n 


I(T) = H(T)=-> flog,(f,) (2) 


i=l 


where i is the class to be predicted, n is all pos- 
sible classes, and f; is the fraction of the training 
data belonging to class 7. The second option, is to 
measure the Gini impurity (G). In this case, a leaf 
is considered pure if all the data contained within 
it have the same class. The Gini impurity can be 
computed inside each node using the following 
simplified equation: 
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I(T) =GP)=1->f? 3) 


The third method is to use the classification 
error (C,): 


I(T) =C,(T) =1—max f, (4) 


where the maximum values are taken among the 
fractions f; within the data T that have class i. Fig- 
ure 2 shows the three impurity indices, for a node 
with data that are categorized into two classes, as 
a function of the fraction of the data having a spe- 
cific class. If all of the data belong to one class, the 
impurity is zero. On the other hand, if half of the 
data have one class and the remaining data belong 
to the other class, the impurity is at its maximum 
(Carrasco Kind and Brunner 2013). 

There exist several algorithms for training deci- 
sion trees from data including 1D3, C4.5, C5.0, 
J48, SPRINT, FACT; FIRM, SLIQ, CHAID, 
QUEST, CRUISE, PUBLIC, BOAT; RAINFOR- 
EST, MARS, RIPPER and CART. In the following 
we will briefly mention some ofthe more common 
algorithms. Ross Quinlan developed ID3 (Quinlan 
1986) which stands for Iterative Dichotomiser 3, 
and its later iterations include C4.5 and C5.0. ID3 
attempts to generate the smallest multiway tree. If 
the problem involves real-valued variables, they 
are first binned into intervals, each interval being 
treated as an unordered nominal attribute. C4.5 
converts the trained trees into sets of if-then rules 
(Quinlan 1994) and improves on ID3 by allowing 
both discrete and continuous attributes, missing 
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Figure 2. Impurity index J, for a two-class example as a 
function of the probability of one of the classes f, using 
the information entropy, Gini impurity and classification 
error. In all cases, the impurity is at its maximum when 
the fraction of data within a node with class 1 is 0.5, and 
zero when all data are in the same category. 


attribute values, attributes with differing costs, 
and pruning trees to avoid over-fitting are usually 
applied to improve the ability of the tree to gen- 
eralise to unseen data. The accuracy of each rule 
is then evaluated to determine the order in which 
they should be applied. C5.0 is essentially the same 
as C4.5 but uses less memory and builds smaller 
rule sets while being more accurate. 

CART (Breiman, Friedman, Olshen, & Stone 
1984) which stands for Classification and Regres- 
sion Trees is very similar to C4.5, but it differs in 
that it supports numerical target variables (regres- 
sion) and constructs binary tree based on a numer- 
ical splitting criterion recursively applied to the 
data instead of constructing rule sets. 

According to (Lim, Loh, & Shih 2000) C4.5 and 
QUEST have the best combinations of error rate 
and speed, but C4.5 tends to produce trees with 
twice as many leaves as those from Quest. QUEST 
is a binary-split decision tree algorithm for classifi- 
cation and regression (Loh & Shih 1997). It uses a 
significance test to select variables for splitting. An 
advantage of the QUEST tree algorithm is that it is 
not biased in split-variable selection, unlike CART 
which is biased towards selecting split-variables 
which allow more splits, and those which have 
more missing values. 

CRUISE is one of the most accurate decision 
tree classifiers (Loh 2011) that is also efficiently 
capable of performing multiway splits. (Kim & 
Loh 2001) proposed CRUISE which stands for 
Classification Rule with Unbiased Interaction 
Selection and Estimation that splits each node into 
as many as subnodes, which precludes the use of 
greedy search methods. CRUISE is practically free 
of selection bias (Kim & Loh 2001) and is capable 
of integrating interactions between variables when 
growing the tree. CRUISE borrows and improves 
upon ideas from many sources, but especially from 
FACT, QUEST, and GUIDE for split selection 
and CART for pruning. 

Finally a word about RainForest which is not a 
decision tree classifier per se but rather a frame- 
work. RainForest was proposed to make decision 
tree construction more scalable (same as BOAT 
which is in fact faster than RainForest by a fac- 
tor of three while constructing exactly the same 
decision tree, and can handle a wide range of 
splitting criteria (Gehrke et al. 1999)). Accord- 
ing to (Gehrke et al. 2000), a thorough examina- 
tion of the algorithms in the literature (including 
C4.5, CART, CHAID, FACT, ID3 and extensions, 
SLIQ, Sprint and QUEST) shows that the greedy 
schema described in Algorithm 4 can be refined 
to a generic RainForest tree induction schema. In 
fact, most decision tree algorithms consider every 
attribute individually and need the distribution of 
class labels for each distinct value of an attribute 
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to decide on the splitting criteria. Rainforest is a 
comprehensive approach for decision tree classi- 
fiers that separates the scalability aspects of algo- 
rithms for constructing a decision tree from the 
central features that determines the quality of the 
tree. Rainforest concentrates on the growth phase 
of a decision tree due to the time consuming nature 
of this step RainForest, closes the gap between the 
limitations to main memory datasets of algorithms 
in the machine learning and statistics literature and 
the scalability requirements of a data mining envi- 
ronment (Singh & Sulekh 2017). 

Next we present a demonstration of deci- 
sion tree classifier learning based on a real-world 
SCADA data set from the Lillgrund offshore wind 
farm. 


Algorithm 1: Pseudocode of a generic recur- 
sive decision tree learning algorithm. 
1 DTClassifier (TR, Target, Attr); 

Input : TR: training examples, Target: 
target label, Attr: set of descriptive 
attributes 

Output: DT: decision tree classifier 

2 Create a Root node for the tree; 

3 if TR have all the same label t; then 

4 return a single-node tree, corresponding to 
leaf node with that label; 

5 elseif the set of attributes Ati is empty then 

6 Return the single-node tree, i.e, Root, with 

most common value of Target in TR; 

7 else 

8 pick an attribute A from Al/ (such that A 

maximizes Iç) and create a node F for it: 

9 for each possible value v; of A do 


10 Let TR, be the subset of T R that have 
value v for A; 

n Add an out-going edge F to node A 
labeled with the value w;; 

12 if T R,, is empty then 

13 attach a leaf node to edge F labeled 


with the target value = most 
common value of Target in TR: 
l4 else 

15 call 

DTClassifier( T'R, , Target. Attr— 
A) and attach the resulting tree as 
the subtree under edge E; 


16 end 

17 end 

18 Return the subtree rooted at R; 
w end 


3 DEMONSTRATION 


Condition monitoring data from 48 wind turbines 
was collected over a period of 12 months and 


sampled every 10 minutes, across 64 channels. In 
total, more than 2.5 million records were available, 
of which 980 excessive vibration error events are 
recorded across all wind turbines. The error event 
of interest in this demonstration is excessive struc- 
tural vibrations. 

Data Pre-processing The first step prepares the 
data for the construction of the prediction tree 
classifier. Knowledge of the process is helpful in 
the elimination of parameters (features) that are 
not significant. The SCADA system recorded 
parameters can be grouped into the following 
categories: (i) system related data, e.g., turbine 
number and index, time stamp, are turbine specific 
and, therefore, can be excluded from the decision 
tree classifier training, (ii) operating performance 
parameters such as power output, pitch and rotor 
speed, (iii) environmental parameters such as wind 
speed and wind direction, (iv) temperature meas- 
urements for various components such as gearbox, 
bearings and generator, and finally (4) dynam- 
ics parameters such as tower top accelerations. 
The attributes chosen in this demonstration are 
shown in Table 1. The table includes both origi- 
nal sensor attributes such as maximum generator 
rotational speed over a 10min period (GRpm_max) 
and additional derived parameters such as the dif- 
ference between max and min wind speed over a 
10min period (DMaxMinV). We open a small 
parenthesis here. As with most pattern recogni- 
tion methods, tree-based classification methods 
work best if the proper features are selected to 
start with; preprocessing by a data dimensionality 
reduction techniques such as principal component 
analysis (PCA) or independent component analy- 
sis (ICA) or optimal feature selection approaches 
such as the wrapper approach integrated with the 
genetic or the best-first search algorithms can be 
effective because they find important axes to be 
used as guideline for the selection of the features 
upon which a decision tree is trained. However, 


Table 1. The attributes (features) used as input to train 

the decision tree classifier. 

Attributes Description 

HSGenTmp Mean temperature of gear bear- 
ing on generator side 

Po_max Max value of active power 

DMaxMeanPow Difference between max and 
mean active power 

GRpm_max Max generator RPM 

DMaxMeanGRpm Difference between max and 
mean generator RPM 

Pi_min Min collective pitch 

DMaxMinV Difference between max and 


min wind speed 
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in this demonstration we chose not to in order to 
test the limit of a decision tree classifier for fault 
diagnostics; for instance wind speed, power and 
RPM are strongly inter-dependent attributes, but 
for the same wind speed a wind turbine could be 
found operating at two different power output lev- 
els (during distinct time periods) or two different 
generator RPM levels, and both cases are consid- 
ered normal operating modes. How so? Indeed this 
happens very often when a wind turbine has to 
de-rate the power output following a demand from 
the grid side or reduce the generator RPM under 
a specific noise or load control mode. The target 
variable TurbineState for classification is shown in 
Table 2. It can take two labels, NoFault indicating 
that the wind turbine is producing electric power 
under normal operating conditions, and Vibr indi- 
cating an excessive vibration fault resulting in the 
wind turbine shutting down, furthermore trigger- 
ing a message to be send to the vibration support 
technical team (in order for some corrective action 
to take place). 

Bagged decision tree construction The second 
step is the construction of the Bagged decision 
tree classifier. An ensemble of decision trees is 
often more accurate than any single tree classi- 
fier (Bauer & Kohavi 1999, Dietterich 1996). Bag- 
ging (Breiman 1996), boosting (Schapire 1990) 
and random forest are three popular methods of 
creating accurate ensembles. Previous research 
indicates that boosting is more prone to overfit- 
ting the training data (Freund & Schapire 1996, 
Opitz & Maclin 1999). Consequently, the pres- 
ence of noise causes a greater decrease in the per- 
formance of boosting. Therefore, this study uses 
bagging to create an ensemble of bagged decision 
tree classifiers (using the standard CART algo- 
rithm) to better address the noise in the condition 
monitoring data. Other decision tree algorithms 
and ensemble approaches will be investigated in 
future work. This technique can be summarized 
as, (i) take a bootstrap sample from the data set, 
(ii) train an un-pruned classification tree and 
(ili) aggregate the trained tree classifiers. In more 
detail, Bagging predictors comprise a method for 
generating multiple versions of a predictor, each 


Table 2. Target variable: Turbine State. Turbine state 
is defined in this demonstration according to two labels. 


Class label Description 


NoFault Normal operation, turbine is producing 
power, no faults 
Vibr Structural vibrations error resulting in: 


“Inform Vibration Support” 


on random subsets of the original dataset, and fus- 
ing these into a unique final aggregated predictor. 
This aggregated predictor can typically be used 
for reducing the variance of a black-box estima- 
tor, by introducing randomization into the con- 
struction procedure and forming an ensemble (for 
proof refer to (Breiman 1996, Bühlmann 2012)). 
The bagging algorithm consists in (1) construct- 
ing a bootstrap sample (EYA aden 
by randomly drawing n times with replacement 
from the original data (OYO) XOY) 
(2) computing the bootstrapped estimator (i.e. tree 
classifier) ĝ, = h XOY) (XP) where 
the function ,(.) defines the estimator as a func- 
tion of the data, and (3) repeating steps 1 and 2 
M times, where M is often chosen as 50 or 100, 
yielding { ĝt,k =1,...,M} and the bagged estima- 


toris pe = >, ~ Gt /M. In theory, M > ~ if the 
bagged estimator is 


Bragg =EL] (5) 


In the machine learning and statistics literature, 
the two main performance measures for a clas- 
sification tree algorithm are its predictive quality 
and construction time of the resulting tree. In this 
paper the predictive quality is given by the misclas- 
sification rate on the validation data set. As shown 
in Figure 4, the misclassification rate of the trained 
Bagged tree is less than 1% when the bag size is 
more than 10. 

Fault diagnostics (offline) The final step uses 
the newly generated decision tree classifier (e.g. 
Figure 3) to create diagnostics for individual 
target fault classes, i.e. Vibr. One aspect of fault 
diagnostics deals with offline root cause analysis, 
which we will demonstrate here. When diagnos- 
ing faults, we are interested in identifying the root 
causes (or sequence of events) that lead to a large 
portion of the overall abnormal behavior, where 
the decision tree edges leading to faults become 
root cause candidates. In the literature, one can 
find a limited number of proposed programatic 
algorithms by which decision tree classifiers can 
be scanned/probed for root cause analysis (Solé, 
Muntés-Mulero, Rana, & Estrada 2017). One 
implementation is as follows (Zheng, Lloyd, & 
Brewer 2004): 


1. Ignore the leaf nodes that correspond to nor- 
mal operations (i.e. NoFault) as they will not be 
useful in diagnosing faults. 

2. Identify, in the decision tree, all leaf nodes with 
the target fault class of interest, i.e. Vibr 

3. Ranking: list the leaf nodes with the target fault 
class ranked by failure count to prioritize their 
importance. 
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Figure 3. Part of the decision tree classifier based on 
the SCADA data from 48 wind turbines. 


4. Noise Filtering: in diagnostics, we are interested 
in identifying the root causes that result in a 
large proportion of the overall faults. Thus we 
retain the leaf nodes accounting for more than 
a certain threshold of the total number of faults 
(e.g.) threshold = 5%) 

5. Extract traces (paths) containing the target 
fault leaf node (C), and all edges (£) and inter- 
nal nodes (N) up to the root node (R). 

. Extract rules at each internal node of the traces 

. Node Merging: we merge nodes on a path by 
eliminating ancestor nodes that are logically 
subsumed by successor nodes. 


aD 


Below is an example of an automated extraction 
of a sequential trace of events (root causes) from 
the trained decision tree classifier (see Figure 3), 
leading from the classified fault to the root of the 
tree: 


Vibr — 1503.6<= GRpm_ mar < 1539.05 <— 
110.455<= DMaxMeanGRpm < 380.51<— 
DMaxMinV >= 10.65 


This sequential trace of events indicate that the 
fault Vibr can possibly occur when, over a period 
of 10 minutes, the maximum generator speed is 
in the range 1503.6-1549.05 RPM, the difference 
between max and mean generator speed is in the 
range of 110.45-380.51 RPM and the difference 
between the max and min wind speed exceeds 
10.65 m/s. Note that it is not possible to infer from 
the data the time interval during which the large 
change in wind speed occured but it could well be 
an indication of agust or excessive turbulence over 


a short period of time resulting in the excessive 
vibration fault. Another example of an automated 
extraction of a sequential trace of events (root 
causes) leading from the classified fault to the root 
of the tree is as follows: 


Vibr — Po_max < 2377 <— DMaxMinV >= 
8.05 —DMaxMeanGRpm > = 380.51 


This sequential trace of events indicate that the 
fault Vibr can possibly occur when, over a period 
of 10 minutes, the maximum electric power out- 
put does not exceed, 2377 kW, the difference 
between max and mean generator speed exceeds 
380.51 RPM and the difference between the max 
and min wind speed exceeds 8.05 m/s. 

This approach to data-driven root cause analysis 
elegantly elucidates the traces of events that lead to 
a fault. These traces can subsequently be used by an 
engineer to design simulation scenarios to try and 
replicate the faults and to propose mitigating actions. 


4 OUTLOOK: BIG DATA BASED 
MONITORING AND DIAGNOSTICS 
FRAMEWORK 


So far we demonstrated how offline root cause 
analysis can be conducted via an ensemble based 
Bagged decision tree learning from SCADA data 
of 48 wind turbines based on the CART algorithm. 
In this section we summarize an outlook for a 
monitoring and diagnostics framework that would 
perform on big data and in real-time. The intent is 
for this framework to be deployed on a cloud such 
as Azure and scale as the volume of streamed data 
increases. Figure 5 shows an overview of the archi- 
tecture of the framework. The main features of the 
framework include: 
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Figure 4. Misclassification rate of the validation set as 
a function of tree bag size. 
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Figure 5. Graphical summary of the proposed framework decision tree learning with big data for fault diagnostics. 


e Hardware-software package, able to acquire 
and fuse in real-time both condition monitoring 
(CM) (e.g. electric power output, rotor speed) 
and specialized structural health monitoring 
(SHM) data (e.g. tower accelerations, strains on 
blade) from WT components (e.g. blades, gear- 
box, tower, etc.). 

e Computational core: object-oriented decision 
tree learning algorithm, Bayesian network (BN) 
based computation of probabilities and root- 
cause discovery algorithm (as demonstrated in 
the previous section). 

e Distributed cloud based data storage and analyt- 
ics of the high rate of real-time data streaming 
from the measuring unit on the WT, which inter- 
faces with the real-time decision-tree software 
unit. Hence, the toolkit does not need to save 
data and do heavy computations on remote WT. 

e Online user interface to visualize the output 
from the decision tree (decision support tool). 
In the following we elaborate a bit more on: (i) 

the cloud computing, storage and analytics, and 

(ii) object-oriented decision tree learning aspects. 


i. For efficient fusion of the large bulk of infor- 
mation made available from the condition and 
structural health monitoring systems consti- 
tutes a Big Data problem, as large amounts 
of data are continuously sampled at high and 
diverse rates from the WTs within a wind farm. 
Thus, the integration of a distributed cloud 
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based data storage and analytics is a natural 
fit with consequent reduction in the cost of 
handling and manipulating of said data (see 
Figure 6). In the proposed framework, large 
scale storage of historical data from WTs 
is done via the Hadoop Data File System 
(HDFS). HDFS is a distributed file system that 
is fault tolerant and can load large volumes 
of data, in a distributed manner, even if any 
errors occurs. Telemetry and data acquisition 
is supported by Apache Kafka which is a dis- 
tributed streaming platform that aims to pro- 
vide a unified, high-throughput, low-latency 
methodology for handling real-time data feeds. 
Kafka is a fault-tolerant system, meaning that 
if any error occurs, published data would still 
be available in case the application or Kafka 
is stopped, is offline and then restarted. Thus, 
Kafka can be guaranteed not to miss any data 
that is issued from the wind turbines. Finally, 
for data processing purposes, Apache Spark is 
used. Spark is an in-memory fast and general 
engine for large-scale data processing. In addi- 
tion, Spark can scale to thousands of machines 
in a fault-tolerant manner. These characteris- 
tics make Apache Spark very suitable to proc- 
ess and analyze the large volumes of data that 
are gathered from the wind turbines (Canizo, 
Onieva, Conde, Charramendieta, & Trujillo 
2017). 


Streaming of 
real-time data 
feeds 


$8 kafka 


Long term 
distributed data 
storage 


Sppe 


Data processing, 
analytics and 
model building 


Spar 


Figure 6. Cloud computing and storage. 


ii. A priori Decision Trees are built first to serve 
the fundamental diagnostic tasks by combining 
engineering knowledge of the system, failure 
modes and domain knowledge. The dynamic 
decision tree classifiers are trained over time 
from wind turbine telemetry (CM & SHM 
data). The limitation in traditional decision tree 
classifiers appears when the component displays 
a behaviour with feedback (i.e., after a repair/ 
update in the system) or for evolving systems 
(e.g. when new sensors are integrated or aging 
of the system), which implies a need to estab- 
lish several decision trees based on the possible 
ordering of the events or based on new initiat- 
ing events. One way around this is an innova- 
tion that we introduced in this framework in 
the form of object-oriented decision tree learn- 
ing (Wyss & Durán 2001). To this end, a WT 
is viewed as a multi-layered system of objects 
(e.g. structure, controller, actuator, etc.) that are 
defined on the basis of abstract super-classes, 
attributed with specific properties and methods. 
Decision trees classifier are further mapped 
into Bayesian Networks for further assessment 
of the conditional probabilities (Bearfield & 
Marsh 2005, Jassens, Wets, Brijs, Van Vanhoof, 
Arentz, & Timmermans 2006). The integration 
of the object-oriented decision tree learning 
and BN delivers updated probability of occur- 
rence associated with each event and end state, 
which would form a solid indicator of the risk 
of future failure of any given component. 


5 CONCLUSIONS 


We presented a review of several decision tree clas- 
sification algorithms. We then demonstrated how 
data-driven and automated root cause analysis can 
be conducted via an ensemble based Bagged deci- 
sion tree learning from SCADA data of 48 wind 
turbines based on the CART algorithm. Root cause 


analysis is here cast in the sense of programmati- 
cally discovering the sequence of events (paths) 
leading from a classified fault at the leaf, all the 
way to the root of the decision tree classifier. These 
traces can subsequently be used by an engineer to 
design simulation scenarios to try and replicate the 
faults and to propose mitigating actions. Finally, 
we briefly presented an architecture for a moni- 
toring and diagnostics framework that would per- 
form on big data. In particular we highlighted the 
need for cloud based storage and computing, and 
an innovative approach based on object-oriented 
decision tree learning that extends the traditional 
decision tree classifier concept. In the future, more 
concrete implementations and results of the pro- 
posed framework will be disseminated. 
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ABSTRACT: The transformation of the industry due to recent technologies introduction is an evolving 
process whose engines are competitiveness and sustainability, understood in its broadest sense (environ- 
mental, economic and social). This process is facing, due to the current state of scientific and technological 
development, a new challenge yet even more important: the transition from discrete technological solutions 
that respond to isolated problems, to a global conception where the assets, plant, processes and engineer- 
ing systems are conceived, designed and operated as an integrated complex unit. This vision is evolving 
besides a set of concepts that are, in some way, to guide this development: Smart Factories, Cyber-Physical 
Systems, Factory of the Future or Industry 4.0, are examples. The full integration of the operation and 
maintenance (O&M) processes in the production systems is a key topic within this new paradigm. Not 
only that, this evolution necessarily results in the emergence of new processes and needs of O&M, i.e. 
also, the O&M will undergo a profound transformation. The transition from actual isolated production 
assets to such Industry 4.0 with CPS is far from easy. This document presents a proposal to develop such 
transition adapting one iteration of the Model of Maintenance Management (MMM) integrated into 
ISO 55000 to the complexity of incorporating “System of Systems” CPSs maintenance. It involves sev- 
eral stages: identification, prioritization, risk management, planning, scheduling, execution, control, and 


improvement supported by system engineering techniques and agile/concurrent project management. 


1 INTRODUCTION 


Cyber-Physical Systems (CPS) is a widespread 
concept with several meanings, usually linked with 
embedded and connected systems. 

The term cyber-physical systems was coined by 
Helen Gill in 2006 at the National Science Founda- 
tion in the U.S. to refer to the integration of com- 
putation with physical processes. 

One of the earliest and referred papers describes 
them as “... are integrations of computation with 
physical processes. Embedded computers and 
networks monitor and control the physical proc- 
esses, usually with feedback loops where physical 
processes affect computations and vice versa.” 
(E.A. Lee 2008). In the early 2010’s was defined as 
“transformative technologies for managing inter- 
connected systems between its physical assets and 
computational capabilities” (Baheti & Gill 2011). 
It was a wider definition and avoids any technical 
description. The most recent release of Framework 
for CPS released by American National Institute 
of Standards and Technology (NIST CPS Pub- 
lic Working Group 2017) opens even more the 
definition: “... are smart systems that include 


engineered interacting networks of physical and 
computational components”. But simultaneously 
the technical description needs a complete section 
—2.1 — of the document, and details in other sec- 
tion —1.1.2 — thirteen main differences with con- 
ventional product, system, and application design. 
This process of increasing the technical description 
extension while broaden the scope by shrinking the 
term definition and using vague words, reveals the 
complexity of these systems. 

Despite this complexity, Industry will have to 
deal with the integration of CPS into their asset 
portfolio and their maintenance framework. It is a 
straightforward way to improve their performance 
in the global and competitive market that they face 
nowadays. The disruptive aspect of CPS in Opera- 
tion & Maintenance of assets are two. The first one 
is their ability to share information & self-compare 
their behavior in an autonomous way as a commu- 
nity of equipment. For example: a network of CPS 
pumps will be able to predict the failure of one of 
their members based on shared information, with- 
out the intervention of higher level supervisors. 
The prognostics and health management has an 
open road ahead. The second disruptive aspect is 
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their ability to identify a misuse or improper use 
by human operators: two conveyor belts could 
compare themselves and rise a warning if one of 
them is overloaded while the other is underutilized. 
Unfortunate ly for maintenance personnel, most 
of their interventions will be over legacy and oper- 
ating Systems of Systems interacting with human 
beings: plant operators, maintenance technicians, 
etcetera. Their intervention will have also to deal 
with the seven samurais (Martin 2004) systems: 
context, intervention, realization, deployed, col- 
laborating, sustainment and competing. 

This paper will propose a framework to intro- 
duce CPS assets and their philosophy for Opera- 
tions & Maintenance in those real environments. 


2 METHOD 


Assuming a company that follows the Mainte- 
nance Management Model (Marquez 2007) as a 
framework for maintenance that fulfills the ISO 
55000standard for asset management (Crespo & 
Parra 2018). The starting point for CPS integration 
in the asset portfolio should appear as a decision 
of phase 8 of an iteration (Continuous improve- 
ment and new techniques utilization): The com- 
pany decision is to adopt the industry 4.0 concepts. 
In doing so it will implement Cyber-Physical Sys- 
tems (CPS) as future assets and uplift the existent 
ones to this concept. Therefore, a new iteration in 
the complete Maintenance Management Model is 
proposed!. This iteration will also follow the Cyber 
Physical Framework. The actual target is to achieve 
the Function III (Cyber level) of the 5C level archi- 
tecture for implementation of CPS (J. Lee, Bagheri, 
& Kao 2015), as a previous step to the full comple- 
tion (Configuration Level) in future iterations. 

It will be analyzed in the next paragraphs the 
eight phases of Maintenance Management Model 
as decision areas to implement our strategy regard- 
ing the inclusion of CPS in our asset portfolio. 
There will not be further references to “business 
as usual” activities of above mentioned framework 
related to maintenance process of the company. 


3 PHASE 1: DEFINITION OF THE 
MAINTENANCE OBJECTIVES AND 
KPIS 


Along this phase there is a conceptualization facet: 
the main effort is to obtain the model that will sat- 
isfy our requirements under the distinct aspects of 


'Only the new activities due to CPS are mentioned in the 
paper. The business as usual activities of the cycle are not 
included: refer to the original 


CPS framework: Functional, Business, Human, 
Trustworthiness, Timing, Data, Boundaries, Com- 
position and Lifecycle. It is one of the phases 
deeply impacted by CPS implementation. During 
this phase, the decisions taken will determine the 
level I of the 5C architecture: The Smart Connec- 
tion Level function with Plug & play, tether-free 
communication and sensor network as attributes. 
Basic tool for this phase is Balanced Scorecard 
integrating not only economic performance and 
technical indicators of operation and maintenance 
but also project execution and Human factors of 
iteration to integrate CPS in the assets portfolio. 


3.1 Objectives to add 


1. CPS implementation policy. This policy should 
consider the importance of achieving tangible 
objectives of CPS as soon as possible, to engage 
the main stakeholders. So, assets to upgrade/ 
substitute must be carefully chosen; on later 
iterations under MMM it can be implemented 
the fully transformation of portfolio. 

2. Maintainability, risk reduction, reliability and 
availability improvement to achieve by new/ 
uplifted CPS assets. Obviously, these param- 
eters should be increased above the average. 

3. Economic impact of overall operation. Capi- 
tal expenditure should be considered carefully: 
In the future, the balance between investment 
and return obtained will be scrutinized for such 
technology change and can jeopardize further 
deployment. 

4. Domains of CPS framework implementation 
should be identified; these are the areas of 
deployment in which stakeholders may have 
domain-specific (manufacturing) and cross- 
domain concerns (Energy, transportation). 
Groups of conceptually equivalent or related 
concerns will became the Aspects of the CPS 
framework of our interest. 


Phase 1: Phase 2: 


Phase 3: 


Definition of the Assets priority and ó 
maintenance objectives and maintenance strategy ae we kec dk aay 
KPI's, definition oo > 
—— a Effectiveness 
Phase 8: Phase 4: 
Continuous improvement Design of the preventive 
and new techniques maintenance plans and 
utilization. resources. 
Phase 5: 
Phase 7: Phase 6; P Sched 
Asset life cycle analysis and Maintenance execution nes piin sie 
replacement optimization. assesment and control and Eyi 
t : optimization. 
~ mw 
Figure 1. MMM schema. 
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3.2 Identify the critical questions to be answered 


5c architecture overview examples (Lee et al.). 


and key decisions to be taken 


1. New Stakeholders identification and documen- 


tation of their expectations. Consultants and 
technology providers are new stakeholders, 
but it should be also reexamined the old ones 
with new concerns: maintenance personnel 
and operators should change their approach to 
these systems improving their technical skills in 
software/hardware; it will imply new training. 
Report processes, command chain and IT sys- 
tems will be also impacted. 


10. 
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Legal & safety concerns: Due to the recent devel- 
opment of CPS, there is a lack of regulation, 
but it will have a deep impact in the future for 
safety or environmental risky systems. The leg- 
islation will affect them and their crypto condi- 
tions to avoid any unintended access. The CPS 
framework has a full detailed coverage of this 
concerns and the assurance facet is described in 
detail. 

Technological state of the art. Due to the lack 
of maturity, it must be evaluated which technol- 
ogies will survive in a few years. Open standards 
will help to survive or, at least, to ease the tran- 
sition to new ones in the future. 

Identify leaders (internal and external) for new 
know-how transference and acquisition. 


. Other Standard, Policies, Directives and Proce- 


dures to be adapted. It has been mentioned the 
technological ones, but other business domains 
should be also considered. As an example: CPS 
themselves could inform of earned value to the 
project management software (units produced, 
finish of testing phase, finish of startup process...) 
Cyber-level infrastructure and Machine-cyber 
interface. 

ERP and Enterprise Asset Management soft- 
ware data flow impact. 

Big data analysis systems and AI deployment. 
Training of personnel. 

Infrastructures, Special tools and test equip- 
ment needed or affected. 


IT infrastructures play a key role in CPS sys- 


tems. So, critical questions must be faced in this 


stage as a decision-making breakdown structure 
for IT to be implemented in the CPS: 

In the software level, the continuous evolution 
of information technologies is a risk factor and the 
use of open and standard methodologies for com- 
munications, like REST APIs with HATEOAS to 
exchange data, based on JSON or XML should 
be a priority. A wrong step in this direction could 
endanger the rest of the project by making CPS 
unable to communicate among them. 

Regarding physical layers of communications 
(wired or wireless) it should be followed the same 
principle: looking for standard and open technolo- 
gies. Of course, wired TCP/IP should be there, but 
wireless technologies are here to stay, and it is far 
more confusing the election: 

Low-rate wireless personal area networks under 
IEEE 802.15.x or WIFI could be a reasonable deci- 
sion for locally deployed systems. 

For wide areas, it could be chosen between 
licensed LPWA technologies standardized by 3GPP 
(EC-GSM-IoT, LTE MTC Cat M1 and NB-IoT) 
or other commercial LPWA solutions: LORAWAN 
or SigFox. These are closed technologies. 

The future would pass through commercial 
5G bands categorized into three generic serv- 
ices, namely, extreme mobile broadband, massive 
machine-type communications, and ultra-reliable 
machine-type communications. 


3.3 KPIs 


1. Human factors: strong leadership is needed, 
to overcome resistance and barriers, to change 
mindsets, to push through organizational 
change, to sustain investment, and to keep the 
team involved, specifically during the transi- 
tion. Indicators over these soft factors will help 
to measure the pulse of the organization. It is 
important because the benefits of the CPS will 
appear after the integration of some of them: at 
the beginning of the process, only problems will 
arrive without any apparent benefit. 

2. Speed of update in infrastructures, assets, etcet- 
era: transition should be implemented fast 
enough to achieve evident benefits during the 
first stages, but avoiding over stress the organi- 
zation (shutdown and start up production, new 
training, new processes, etcetera) 

3. AI transition and assets peer to peer compari- 
son. As one of the theoretically most disruptive 
changes that brings CPS, we should monitor 
the efficiency of this behavior through the eco- 
nomic impact in our organization. Installing 
only a modern gadget will fail as objective. 

4. Earned Value management indicators (EV, 
CPI, SPI) are a good reference for any project 
to deploy CPS. It should also use the Ameri- 
can Defense Contract Management Agency 14 


points of Baseline Execution Index for planning 
and schedule. These are objective indicators of 
project’s financial status, as one of the main 
concerns of the enterprise. 


3.4 Audits 


The use of audits will have two variants: the ones 
for control and continuous improvement accord- 
ing to MMM (MES, QMEM, etcetera) and spe- 
cific ones to check the efficiency and effectivity of 
CPS implementation. These audits should focus 
on the incremental evolution of CPS upgrade: The 
added value of firsts units implemented should be 
audited against their objectives. 


4 PHASE 2: ASSETS PRIORITY AND 
MAINTENANCE STRATEGY 
DEFINITION 


Basic tool is Criticality Analysis, upgraded to 
include the risk of project—failure during imple- 
mentation of CPS: It is needed to prioritize those 
assets with higher Return on Investment and lower 
technical risks. During the first iteration one of the 
highest risks is disaffection of CPS by main stake- 
holders. So, the asset to be upgraded at this early 
stage should be with high improvement-visibility, 
affordable capital expenditures effort, lower tech- 
nological risk, easy to O&M in the near future. 


4.1 Determination of actions on actual assets 
based on risk factor analysis to: 


1. Replace/substitute for a new asset. 
2. Uplift and improve the existent ones. 


It will be scored using the concerns of CPS- 
FWK tailored accordingly. Without been exhaus- 
tive, in this first iteration it should be paid special 
attention to several concerns enumerated in the 
framework: 


1. Functional: 
communication. 

2. Business: cost, time to market, utility and 
interoperability. 

3. Timing: awareness and resilient time. 

4. Boundaries: networkability, responsibility. 

5. Composition: adaptability, constructivity and 
discoverability. 

6. Lifecycle: procureability, 
mantenability. 

. Human: usability. 

. Trustworthiness: 
cybersecurity. 


monitorability and 


deployability and 


safety, reliability and 


These are, under author considerations, the key 
aspects to assure stakeholders engagement during 
the first iteration. 
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4.2 Planning and scheduling to retrofit the assets. 
Based on the CPS 5C level architecture, with 
target of cyber level 


It will be used turnaround, shutdown and Outage 
operations to implement the new CPS assets or 
transform the legacy ones in CPS systems. 

The authors recommend using Kanban agile 
practices embedded in a general schedule (Vil- 
lar-Fidalgo, Espinosa-Escudero & Dominguez- 
Somonte 2016 & 2017), according to PMI® Agile 
Practice Guide (2017). The “product owner” value 
team might produce a roadmap to show the antici- 
pated sequence of CPS to add or legacy systems to 
upgrade over the time. This planning will include 
conceptual, preliminary and detailed design, devel- 
opment, construction, M&O and disposal phases. 
This team re-plans the roadmap based on what the 
results are. 

Due to the complexity of the CPS concept and 
lack of maturity, incremental and iterative changes 
during several iterations are preferable than a 
complete transformation in a single endeavor 
with waterfall planning and scheduling. Besides 
the intrinsic resilience and flexibility of a well- 
designed CPS will benefit this approach. 

Agile-lean Kanban Method fits perfectly in par- 
ticular steps of the proposal: Starts with current 
state, it is incremental, respects the current process, 
roles, responsibilities and titles. It will help dealing 
with overcoming requirements 

The main adversary is that enterprise culture not 
always embrace leadership attitude at all levels. 


4.3 EV baseline evaluation 


This baseline will help to measure the performance 
of transition. The variances over baseline estima- 
tions will identify the need of upgrade the plan 
or maintain the initial target. Deviations off the 
baseline will not only affect the financial perform- 
ance of the transition project but also stakeholder’s 
engagement with CPS vision. 


5 PHASE 3: IMMEDIATE INTERVENTION 
ON HIGH IMPACT WEAK POINTS 


Basic tool is Failure Root Cause Analysis (FRCA), 
to look for high impact reliability enhancements 
and ensure a very effective definition of subse- 
quent maintenance plan activities. Adding a new 
cause of failure is necessary, as it is recognized in 
CPS framework aspect of trustworthiness, based 
on security, privacy, safety, reliability and resilience 
of future CPS. The high complexity of network 
based software is a new factor to consider in the 
FRCA analysis over traditional physical, human 
and latent causes inherent to any kind of systems. 


Nevertheless, it should be also intervened in 
high impact favorable assets to CPS transforma- 
tion: those items which could lead to enhance the 
visibility of CPS advantages. A key reactor of high 
economical value deems for sure to be upgraded 
to CPS level, but their actual monitorization as a 
single asset with conventional SCADA and PLCs 
plugged to the network will dilute the improve- 
ment of such upgrade. On the other hand, a set 
of water flow pumps will not be as spectacular as 
the reactor, but their number and failure rate could 
make them ideal candidates to be upgraded if it is 
possible to achieve some good failure prognostics 
based on their shared data, including condition 
based on sensors and usage rate. 


5.1 Retrofit of highest priority assets to CPS level 
in first incremental iterations of schedule 


The priority will be based on: 


1. Affordability of implementation and economic 
benefit for asset exploitation. 


Feasibility of full CPS application. 
Trustworthiness of implementation. 
Visibility of CPS enhancements. 


5.2 Data capture and data mining to extract 
the information, probably under “big data” 
considerations 


The first data extracted and initial behavior of CPS 
itself should follow the planned strategy. Other- 
wise will be necessary to adapt, preferably through 
agile methodologies, the next steps. 

It should be achieved the maximum net benefits 
per system replaced/upgraded during the earliest 
stages of implementation, because of the priority 
criteria used. If the results are not the expected, 
it is an alarm signal that should draw attention to 
strategy or, at least, re-plan the implantation pri- 
orities. During this phase, it should be achieved 
the 5C architecture function 2: Data—to informa- 
tion Conversion Level. The attributes are smart 
analytics for component machine health with 
multi-dimensional data correlation and finally the 
degradation and performance prediction. 


6 PHASE 4: DESIGN OF THE PREVENTIVE 
MAINTENANCE PLANS AND 
RESOURCES 


Basic tool is Reliability Centered Maintenance: 
where operations and maintenance start to be 
influenced by CPS, depending on the level assigned 
in the risk plan and within the operational mode. It 
is far from the function 5 of 5C architecture: Con- 
figuration Level, with attributes of self-X of the 
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CPS, but it has information that, supported by new 
plans and their optimization, will take it to desired 
Function II of 5C architecture: the cyber level. 
After implementation, the maintenance plan 
should be oriented to Condition Based Mainte- 
nance/Prognostic Health Management to exploit 
all the advantages of CPS. The CPS will probably 
rise overcoming requirements due to higher preci- 
sion in sensors and data analysis (big data, M2M 
peer review), that will fine tune the detection of 
catastrophic failures. But this fine tune detection 
will increase the need of fast actuation: it will not 
only detect the normal worn of a friction bushing 
but the fast failure of an axle due to unexpected 
fatigue because of micro cracks (that CPS could 
infer due to misuse of the asset). The maintenance 
plans should also embrace the Agile philosophy. 


7 PHASE 5: PREVENTIVE PLAN, 
SCHEDULES AND RESOURCES 
OPTIMIZATION 


Basic tool is Risk-Cost Optimization considering 
the new opportunities that CPS brings. Use of 
self-compare behavior and machine to machine 
data exchange (peer monitoring) to identify the 
state of the asset portfolio and accuracy prognosis 
will allow to take advantage of dynamic planning 
and scheduling forecasting with optimal allocation 
of resources for maintenance and operation. The 
capacity of CPS to identify misuse or workloads 
unbalanced will also affect the production plans or 
operators training plans. 

Again, the agile methodologies imbricated 
in high-level integrated master plan is a must: it 
should be ready to deal with overcoming require- 
ments that need to be addressed on line. Baseline 
updates with a transition from original contour 
conditions (or “samurais”) to updated ones will 
allow to measure performance without losing con- 
tact with the new reality. This way audits will con- 
tinue delivering value. 


8 PHASE 6: MAINTENANCE EXECUTION 
ASSESSMENT AND CONTROL 


Basic tool is Operational Reliability Analysis, but it 
should not be forgotten specific aspects of transi- 
tion phase: under a project environment with high- 
risk technological transition. 

All the KPIs defined in phase 1 will help to con- 
trol the evolution of the project and the impact in 
O&M operations: 


1. Measurement, analysis and evaluation of 
earned value indicators for asset retrofit and 
start-up: cost and schedule through CPI, SPI, 


S-curves and Integrated Program Management 
Report. 

2. Evaluation of indicators of actual perform- 
ance in reliability, maintainability and future 
improvement based not only on probabilistic 
assessments, but also on truthful information 
and prognostics delivered by new CPS. 

3. Evaluation of Risk with cost/benefit evaluation 
of mitigation plans. 

4. CPS transition-speed and acceptance among 
stakeholders. 


Finally, it will be needed to control the improve- 
ment achieved with CPS already installed. If we 
are not able to follow the baseline technical-plan, it 
should be reconsidered the overall project. 


9 PHASE 7: ASSET LIFE CYCLE ANALYSIS 
AND REPLACEMENT OPTIMIZATION 


This phase, as the first one, is deeply impacted by 
CPS. Basic tool is Life Cycle Cost Analysis, but it 
must include all the facets that new systems will 
bring to the asset portfolio. 

It will face new cost-categories: software 
upgrades, trustworthiness analysis and develop- 
ments, data networks deployment and mainte- 
nance, etcetera. Nevertheless, it can also achieve 
important savings: better use of systems by opera- 
tors, higher accuracy in prognosis maintenance, 
lower supervision costs due to the “self-aware- 
ness” of CPS groups. Unfortunately, this kind of 
improvement will become very often incomputable 
because the traditional life cycle analysis has not 
included these factors. The maintenance team will 
have to struggle against skepticism to show the 
advantages of the CPS. This is one of the reasons 
to make key decisions and answer critical questions 
during the first phase: once the iteration is at this 
point, it should have evidences from CPS deployed 
to support the benefits to Life Cycle Cost, other- 
wise there will be only important capital expen- 
ditures and intangible benefits that could lead to 
disaffection of critical stakeholders like CFO's. 

Another key aspect to highlight is the improve- 
ment in risk management and “probability/risk 
number” of CPS systems, due to a better knowl- 
edge of their health as a system. 


10 PHASE 8: CONTINUOUS IMPROVEMENT 
AND NEW TECHNIQUES UTILIZATION 


It will be necessary to analyze the targets from 
phase 1 achieved in the concluded iteration. If 
the result is satisfactory there will have two roads 
ahead: spread the CPS architecture to more assets 
or take another step with actual CPS towards 
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Function IV of CPS architecture: The Cognition 
Level. Probably the wisest decision in these times 
of technological immaturity is spread the CPS 
concept and study lessons learnt during the first 
iteration to smooth the continuous uplift / renova- 
tion of our asset portfolio. 

The Function IV (Cognition Level) and V 
(Configuration Level) of 5C architecture have an 
intrinsic high degree of technological uncertainty 
that make them too risky under the engagement 
of main stakeholders. They should go through a 
System Integration Laboratory or prototype phase 
before the full integration in production assets. 


11 DISCUSION 


The architecture developed by Lee et al. (J. Lee 
et al. 2015) or the Framework released by NIST 
(NIST CPS Public Working Group 2017) are clear 
starting points for deployment of CPS in manufac- 
turing industries. Nevertheless, these documents 
are focused on the CPS itself. 

Our approach is a complementary and holis- 
tic view of the CPS implementation in a dynamic 
environment like the active enterprise. The Human 
Factor is a cornerstone during business transfor- 
mation and should be included in the equation. 

Another intended contribution is the necessary 
link between the sequential workflow of construc- 
tion of a CPS with the iteration phases of the 
maintenance model and business itself. 


12 CONCLUSIONS AND FURTHER 
DEVELOPMENT 


This paper presents a framework to incorporate 
CPS, up to Function 3 of 5C architecture, in our 
asset portfolio with a holistic view and under a 
consolidated maintenance management model. 
The objective is to avoid early failures that discour- 
age stakeholders from supporting this technology 
after the first iteration. 

This work will continue with use cases evalua- 
tion and further iterations to achieve the Function 
V (Configuration Level) where all the advantages 


of CPS could be exploited in benefit of mainte- 
nance of full assets portfolio, and therefore the 
business objectives. 
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Automated train driver competency performance indicators using real 
train driving data 
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ABSTRACT: On Train Data Recorders (OTDR) are used within the GB Railways to collect data relat- 
ing to train operations and the state of various train systems throughout a journey. These data include 
power and brake controller position and driver acknowledgement of signaling system warnings. This data 
could be used to assess driver competency but an assessment framework is required to extract the data 
sensibly. This paper proposes a train driver competency framework to define aspects that are related to 
train driver functions based on documents analysis, cab-rides and informal interviews. It also explores the 
utilization of OTDR in the quantification of the train driver competency framework by introducing a 
number of indicators under each aspect covered by the framework. The proposed indicators demonstrate 
to how OTDR data can be useful in routine systematic checks and pre-incident investigation, for example, 
identification of the deviation from recommended rules that may have safety implications. Furthermore, 
the data may allow for improved understanding of driver performance that in turn could allow the devel- 
opment of more effective safety management strategies. A number of numerical example presented to 


illustrate applicability of developed algorithm. 


1 INTRODUCTION 


On train data recorder (OTDR) offers an opportu- 
nity to understand the driver use of power, brake 
and safety systems during a journey. Despite the 
potential of OTDR data, it is not widely used 
to facilitate the automatic analysis of driver 
performance. 

This paper presents a train driver competency 
framework based on official documents and 
reflecting the professional driving policies. It will 
also explore the use of OTDR data to quantify dif- 
ferent areas covered by the proposed train driver 
competency framework to assess train driver per- 
formance aspects such as drivers’ use of safety sys- 
tems and braking, in addition, to derive indicators 
to measure the vigilance level of a driver. 


2 AUTOMATED SAFETY ASSESSMENT 


In the UK, current practice for assessing driver 
competence performance is in-cab riding by driver 
managers. A number of train operating compa- 
nies use in-cab assessment to monitor drivers’ 
operational usage of Driver’s Reminder Appli- 
ance (DRA) (McCorquodale et al., 2002). In some 
cases, digital cameras implemented to record driv- 
er’s action but they tend to be unpopular (RSSB, 
2004). These techniques have their merits in assess- 
ing the driver performance as they supplied com- 


prehensive details about the driver’s behavior. 
However, drivers may behave differently under 
observation, limiting the potential for independent 
driver assessment. Add to that, the time and cost 
traditional methods hinder their use for continu- 
ous monitoring. 

A number of research studies (Balfe, 2016; El 
Rashidy & Van Gulijk, 2016; Walker & Strathie, 
2014; Green et al., 2011) explored the advantage of 
using OTDR source in different areas such as sta- 
tion duties, driver assessment and interaction with 
warning systems. Green et al. (2011) introduced a 
number of indicators to assess driver performance. 
They are: 


e The speed at which power Notch 4 (out of a 
total of 4 notches) is selected when accelerating; 

e The percentage of time in a braking sequence 
that the driver selects brake step 3 (out of a total 
of 4 steps); 

e The of the train as it traverses a Train Protection 
and Warning System grid (TPWS) approaching 
a Permanent Speed Restriction (PSR); 

e The speed through a PSR as a percentage of the 
maximum speed and the mean speed when the 
warning system (AWS) horn is received. 

e Erroneous events such as wrong-side door 
release and system trips such as TPWS brake 
demand. 


These indicators are compared with the average 
performance of the whole population of train drivers 
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to assess an individual’s driving performance in rela- 
tion to the cohort of drivers. The study introduced 
initial learning OTDR analysis but did not make use 
of all the available OTDR that related to the driver 
performance. 

The aim of this paper is similar to the papers 
above: to assess driver performance in relation to 
safe driving of a train but we propose a more com- 
prehensive framework. 


3 METHOD 


The method proposed in this work comprises of 
two steps, viz. developing a train driver compe- 
tency framework and introducing a number of 
performance indicators to quantify elements of the 
framework using OTDR data. 


3.1 Train driver competency framework 


The framework is based on documents analysis, 
cab-rides and interviews. 

Document analysis clarified the driver func- 
tion and best practices in relation safe professional 
driving. For example, the professional driv- 
ing policy (e.g. SWT, 2012; LM, 2009) was used 
to identify the recommended travel speed when 
approaching a red aspect and braking rules. The 
Rule Book — Train Driver Manual (GE/RM8000/ 
train driver) was also used to identify rules that the 
driver should comply with such as the use of safety 
systems. 

To gain more knowledge about the driver envi- 
ronment and driver reaction under different situ- 
ations, cab-rides were carried out. In addition, 
consultations, in the form of informal interviews, 
with a driver and a driver manager were conducted 
to discuss some operational issues and clarify some 
technical points. 

Based on above processes the following aspects 
were identified: 


e The driver handling of trains; 
e The driver’s compliance with rules; 
e The driver’s vigilance; 


Under each aspect, a number of indicators were 
introduced, as presented in Figure 1, to facilitate 
the conversion of the conceptual framework to 
operational indicators that can be used to assess 
each aspect. Train handling aspect covers how the 
driver uses the brake system and the power system 
whereas compliance deals with rules in relation 
to driver’s handling safety systems. The vigilance 
level of the driver has been assessed by a number 
of TPWS brake demands, wrong-side door release 
and percentage of instant cancellations of AWS 
horn. 


Train driver competency performance framework 
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Figure 1. Train driver competency framework. 
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Figure 2. Brake use for a pair of origin-destination 
pair. 


3.2 Train driver performance indicators 


A bottom-up approach was implemented to 
develop train driver competency performance indi- 
cators (DCPIs) based on driver competency frame- 
work using OTDR data. 


3.2.1 OTDR data 

The OTDR data files used in this paper were sup- 
plied by Southern Railway. They are for the same 
route and the same day but different drivers to 
eliminate the impact of route conditions. 


3.2.2 Initial data handling 
The initial data handling process, presented in 
Figure 2, includes a number of steps as follows: 


e Examine data types and format and correct them 
if needed. For example, the format of a relative 
journey time is converted from “+ 01h24 mn26 s 
6” to “5066.6” seconds. 

e Compress all variables that occurred at the same 
time in a single data row. Closer inspection of the 
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data showed a relative journey time record may 
appear more than once with different groups 
of variables (i.e. for the same time record, there 
was more than one input line from different data 
channels). 

e Processing missing data using different logical 
processes, for example, filling the missing values 
of train distance with calculated distance based 
on the available time duration and train speed. 
This error checking is specific to the Class 455 
data used in this study, although it is likely that 
all OTDR data will need similar error handling 
and cleaning routines to make it useable. 


After this pre-processing, data analysis could 
commence. 


3.2.3. Detection algorithms 

A number of algorithms have been developed in 
the R software package to extract relevant indica- 
tors for each behavior aspect presented in Figure 1. 


3.2.4 Data analysis 

A number of algorithms have been developed in 
R to identify relevant scenarios for each behavior 
aspect presented in Figure 1 and, then, calculate 
the metric. The metrics are mostly the frequency, 
the averaging or maximum or minimum values. A 
number of standard statistical visualizations were 
used to present the results such as boxplots. 


4 RESULTS 


4.1 Proposed DCPIs 


Table 1 a, band c summarize DCPIs that developed 
in this work and shows the proposed performance 
metric and criteria for each indicator assessment. 


4.2 Safety related DCPIs examples 


This section gives a few numerical examples of 
DCPIs that can be related to safety rules or devices 
to illustrate the use of OTDR rather than detecting 
any trend or best practice rule due to the size of the 
used data sample. 

Under the train handling aspect, braking behav- 
ior and speed at AWS horn approaching a red 
aspect are presented here as they are directly related 
to safety. For braking behavior, Figure 2 shows the 
percentage of each brake step use calculated by 
considering the travelled distance using each step. 
For example, in “Journey 5” the driver applied 
Step 2 (0.66) in addition to using Step 3 (0.28) as 
shown in Figure 2 due to the late use of the brake 
which may create a hazard condition under differ- 
ent circumstance such as low adhesive condition. It 
should be noted that use of brake Step 3 should be 


Table 1. DCPIs based on OTDR data. 
(a) Train handling 
Aspect Metric 


Braking behavior. Pattern recognition based 
on braking curve data 
The maximum percentage 

of distance travelled 
using 

brake-step 3 per station 
during a journey. 

The percentage distance 
travelled using each 
brake-step. 

The speed at AWS horn 
(mph) prior to a red 
aspect 

The frequency of train 
speed < = 3 mph when 
power Notch 4 selected 
during the journey 

The percentage of distance 
travelled using power 
notches | to 4 (out of 4). 


Use of brake-step 3 on 
approach to stations. 


Use of brake on 
approach to stations. 


Speed 


Use of power. 


*AWS stand for Automatic Warning System. 


(b) Compliance with rules 


Aspect Metric 
Use of EBS* EBS operated event, 
Use of TPWS* TPWS isolated events, 


Use of DRA* in front of a The number of DRA 


red aspect. operated event 
Use of DRA at the start of | comparing with, a 
a journey number of red aspects 
Use of DRA during the driver experienced. 
the coupling/ 
uncoupling activity 
Putting the brake Number of Step 3 at stand 


controller into Step 
3 once the train is at 
Stand. 

Brake test before the first 
station and the first 
caution aspect. 


compared with number 
of station and red aspect 
during a journey 

The use of brake prior to 
the first station or an 
AWS horn. 


*EBS, TPWS and DRA stand for Emergency Bypass 
Switch, Train Protection and Warning System, and 
Driver’s Reminder Appliance, respectively. 


(c) Vigilance 


Aspect Metric 


Instant cancellation of 
AWS horn 
Wrong side door release 


The percentage of instant 
AWS cancellation 
Number of wrong side door 


release 
TPWS Demand Number of TPWS Demand 
application application 
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minimized, as a good practice unless the driver had 
to use it due to low track adhesion. 

For the speed, Table 2 presents the speed at AWS 
horn when the driver approaching a red aspect. A 
higher than normal speed when approaching a red 
aspect may cause a SPAD (signal passed at danger- 
ous without authorization) or lead to a full brake 
application to stop the train at the correct location. 
For one train operator this is 20 meters in advance 
of the red aspect (LM, 2009). For the OTDR sam- 
ple used in this paper all driver complied with this 
rule as showed in Table 2. 

Under compliance with rules, braking test is 
checked as the brake test enables the driver to 
evaluate the performance of train braking system 
prior to the need to use it. Using OTDR data ena- 
bles checking this rule prior to the driver first stop 
(due to a station or a red aspect). For example, Fig- 
ure 3 shows that the driver carried out the brake 
test prior to the first AWS horn, in contrast, the 
driver presented by first AWS horn. 

For vigilance aspect, wrong-side door release 
is presented. Releasing the doors on the wrong 
side of the train may have serious consequences 
as it could cause a potential harm to railway pas- 
sengers. Only one of the journeys showed any 
instances that appeared to have wrong-side door 
release. This journey is shown in Figure 5 and is 
unusual in that the train appears to be stationary 
for most of the time. It is possible that this file 
shows a train under maintenance. Figure 5 shows 
door releases; on each occasion the right-hand side 
door (shown by the blue circles) is released shortly 
before the left-hand side door (shown by the red 
line). Because of the very short period between the 
two door releases, they appear to occur at the same 
time in the figure. Whilst this file does not appear 
to show an instance of an actual hazard—since the 
train did not appear to be moving—it nevertheless 
demonstrates that it is possible to use OTDR data 
to detect instances of wrong-side door release. 


Table 2. Maximum speed at AWS horn prior to red 
aspect. 

Maximum train 
Journey Number of Speed approaching 
Number red aspects a red aspect 
Journey 1 0 NA* 
Journey 2 2 11 
Journey 3 1 14 
Journey 4 1 11 
Journey 5 1 14 
Journey 6 2 13 
Journey 7 1 14 


*The driver did not have any red aspect signal during his/ 
her journey 
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Figure 5. Wrong side door release. 


5 DISCUSSION 


Considering the growing interest in harvesting data 
sources such as OTDR, the development of DCPIs 
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that support that direction is essential. DCPIs pro- 
posed in this paper developed not only based the 
available data from OTDR but also supported 
by the official documents such as Rule Book and 
number of professional driving policy documents. 

The technological approach described offers 
sensible solutions for the extraction of DCPIs from 
OTDR; it offers rapid analysis of the driver per- 
formance in contrast to in-cab-riding assessment 
by a driver manager that normally takes place every 
six months and only provides the opportunity for 
a driver to be assessed under a limited range of 
conditions. Furthermore, the in-cab-riding assess- 
ment may cause drivers to behave differently under 
observation, whilst DCPIs can be calculated with- 
out disrupting the driver. 

DCPIs do not pass judgement about what is 
‘good’ or ‘bad’ but illustrate how data is extracted 
in a sensible way from a huge dataset. For quali- 
tative judgement, the allocation of indicators may 
need further discussion in the implementation 
stage to consider related parties point of view. 

A few numerical examples of DCPIs are pre- 
sented to illustrate the practicality of using OTDR 
to calculated DCPIs. However, the used data sam- 
ple was very small to detect any trend or best prac- 
tice rule. 


6 CONCLUSION 


In this paper, the train driver competency framework 
was introduced to outline the main areas of a driver 
function using documentary analysis (eg. TOCs 
professional driving policy and the Rule Book), in 
addition to interviews and in cab-rides. A number of 
DCPIs have also proposed to assess driver perform- 
ance under real-life conditions using OTDR data. 
OTDR offers great sources to develop a compre- 
hensive list of behavior aspects related to driver per- 
formance that can be determined from OTDR data. 
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ABSTRACT: During the last decade, increasing attention has been focused on environmental protec- 
tion. For instance, the ecological effects of hydrocarbon releases in the sea are of paramount concern. 
One way to assess their environmental impact is to consider the amount of pollutant discharged. Effec- 
tive early detection would help in revealing spills in advance and take the necessary mitigating measures 
to contain the released volume. Standards and guidelines are established for developing effective sensor 
networks in the subsea templates for monitoring purposes and data collection. Sensors provide a het- 
erogeneous amount of information about the template they are monitoring. According to recent studies 
on risk assessment, the level of knowledge about a specific system is an intrinsic feature that should be 
considered during the assessment and evaluation phases for better managing potential increments of the 
risk level. The information provided by sensor networks may be used in this perspective. Sensors may be 
functionally placed in fault tree analyses and update the information about frequency deviation. The work 
in this paper is focused on risk management using such information from subsea sensor networks. A real 
reference case from the oil and gas industry located in an environmentally sensitive area on the Norwegian 
Continental Shelf is provided for testing the suggested approach. The case study refers to subsea monitor- 
ing of oil leakages from the wellhead templates. Insights from the case study highlight how sensor data 
analysis may improve risk management and support operational decision making. 


1 INTRODUCTION In this perspective, the Dynamic Risk Man- 


agement Framework (DRMF) has been devel- 


Dynamicity to risk assessment and management 
is a main challenge that today’s researchers have 
to face. A quantitative assessment of the level of 
risk for a production installation is required by law, 
but it is usually performed during the design phase. 
Effective support during operations is missing 
(Villa et al., 2016). The chemical and petrochemical 
industry requires tools and methods to update the 
risk picture on a real-time basis and then improv- 
ing risk management (Paltrinieri and Khan, 2016). 
Different approaches have been suggested to 
dynamically update the risk level. Some of these 
are based on Bayesian networks (Khakzad et al., 
2016, 2014) while others are proactive approaches 
based on indicators (Paltrinieri et al., 2016). 


oped (Paltrinieri et al., 2014). Figure 1 shows the 
DRMF. DRMF focuses on the continuous sys- 
tematization of information on new risk evidence. 
As shown in Figure 1, its shape opens the process 
to new information and early warnings by means 
of continuous monitoring. Side information is an 
input to each step of risk management through 
communication and consultation. 

The available information provided by different 
sources, such as monitoring and control devices, 
but also training reports and audits, should be 
included and exploited when assessing the risk 
level during operations. As suggested by Aven and 
Krohn (2014), a new dimension to the definition 
of risk from Kaplan and Garrick (1981) should be 
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Figure 1. 
clockwise (adapted from (Paltrinieri et al., 2014)). 


Dynamic risk management framework— 


added. As shown in Eq. (1), risk (R) is a function 
of the identified scenario (s), its probability (p), 
its consequences (c) and of what Aven and Krohn 
(2014) define as level of knowledge (A). 


R= f(s,p,¢k) (1) 


The level of knowledge for a specific system is 
an intrinsic feature that should be considered dur- 
ing the assessment and evaluation phases for better 
managing potential increments of the risk level. 

The information provided by sensor networks 
may be used in this perspective. Sensors may be 
functionally placed in fault tree analyses and 
update the information about frequency deviation. 
The current analysis in this paper refers to subsea 
detection networks. 

Subsea leak detection is a considerable challenge 
for the oil and gas offshore industry, although the 
main concern for subsea templates is blow-out. As 
shown by Macondo (Deepwater Horizon Study 
Group, 2011) accident, the effect of a well blow- 
out due to the large amounts of spilt crude are cat- 
astrophic from human, environmental, economic 
and reputational point of views. 

However, as oil and gas offshore production 
is moving north, towards sub-Arctic and Arctic 
areas, monitoring and control of crude oil spill are 
becoming critical issues. These areas are environ- 
mentally sensitive (Larsen et al., 2004) and specific 
requirements (DNV-GL, 2012) must be met during 
production. For instance, the Barents Sea area is 
recognized by the World Wildlife Fund (WWF) as 


critically sensitive from an environmental point of 
view (Larsen et al., 2004) due to: 


— Naturalness; 

— Representativeness; 

— High biological diversity; 

— High productivity; 

— Ecological significance for species; 

— Source area for essential ecological processes or 
life-support systems; 

— Uniqueness; and 

— Sensitivity. 


The current development of large oil and gas tem- 
plates in the Barents Sea may lead to severe pollution 
and increased risks of large oil spill (Bioforsk Soil 
and Environment, 2006), constituting a major threat 
to the biodiversity of this particularly sensitive area. 

Detectors are required to show high sensitivity 
to small amounts of leaking hydrocarbons and to 
detect a spill in a reasonable time interval. This is 
the basis for early detection systems. The threshold 
value of a leakage rate to be detected by the sensors 
is a critical parameter that influence the choice and 
the cost of the device. Furthermore, the detectors 
have to be available and reliable when in place to 
effectively provide information to the topside con- 
trol room. Fault logs’ information may be gathered 
to evaluate to which extent the measurement by the 
sensor is trustable. 

Moreover, it would be preferable to locate the 
leakage source through the detection system. Col- 
lecting information about where the template is 
spilling oil is useful for both intervention and con- 
sequent maintenance activities. 

The contribution in this paper addresses the 
main challenges related to subsea oil detection 
coupled with risk management for a real case of 
an oil and gas Floating, Production, Storage and 
Offloading (FPSO) unit located in the Barents Sea. 
Available sensor network information is used to 
support risk management. 

The paper is organized as follows: Section 2 
provides some fundamentals of signal processing 
useful for a comprehensive understanding of how 
the subsea leak detector network works. The case 
study is extensively described in Section 3. The leg- 
islative requirements and both the subsea template 
and sensor network characteristics are included in 
this Section. The results of the study and their dis- 
cussion are provided in Section 4 and 5. The paper 
ends with conclusions in Section 6. 


2 FUNDAMENTALS OF SIGNAL 
PROCESSING FOR OIL DETECTION 


The detection system purpose is to reveal hydro- 
carbon spills in the sea from the subsea equipment. 
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For the sake of simplicity, this work addresses the 
oil leakage event in a binary way: the presence of 
release is associated with the state H,, while the 
absence with the state H,. Sensors detect the pres- 
ence (H,) or absence (H,) of oil leakage from the 
wellhead. A sensor’s local detection is performed 
by comparing the registered signal with a fixed 
threshold. 

Typically, distributed multiple sensors are in 
place to detect the oil leakage. Their number is 
defined as K and everyone is equipped with an 
acoustic transducer. Every i-th sensor makes a 
local decision, y, and this signal is transmitted to 
a fusion center (FC), which takes a (theoretically 
more reliable) global decision, d, about the pres- 
ence or absence of the binary event. The global 
decision is derived by appropriately combining the 
received information on local decisions from differ- 
ent sensors. This type or architecture is defined as 
centralized and it is represented in Figure 2 (Salvo 
Rossi et al., 2016; Salvo Rossi and Ciuonzo, 2015). 

Referring to Figure 2, the present study consid- 
ers that the local decision from the i-th sensor, y, 
does not suffer of disturbance and signal attenu- 
ation while it is transferred to the FC. The sig- 
nal transmitted to the FC from the i-th sensor is 
named r, For the assumptions made, the value of 
r, corresponds with y, 

Locally, at sensor level, four different decision 
situations may result considering a binary leak 
event. Such decision situations are summarized in 
Table 1. The present analysis assumes that every 
sensor senses autonomously the environment in a 
defined space cell to detect the presence or absence 
of a target (which in this specific case is the pres- 
ence of oil leaking from the template). 

The probability of detection (P,), false alarm 
(P,,) and missed detection (P,,) are defined accord- 
ing to the equations 2-4: 


Py = P(v=H,| 1) (2) 
P; = pP(y=H,| A) (3) 
Py = p(y = AIH) =1- P, (4) 
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Figure 2. Distributed detection system (K sensors) with 
fusion center (adapted from Salvo Rossi and Ciuonzo 
(2015)). 


Table 1. Detection and detection errors. 
DECISION 
d=H, d=4H, 
EVENT H, Correct decision Error type 2: False 
alarm 
H, Error type 1: Missed Correct decision 
detection (detection) 


The sensor local performance may be described 
by means of different parameters. The present 
work refers to P, and P, according to common 
practice in communication engineering studies 
(Salvo Rossi et al., 2016; Salvo Rossi and Ciuonzo, 
2015). The present work assumes that sensors are 
independent from each other. Given this hypothe- 
sis, P, and P,,are as well stationary and condition- 
ally independent. The sensors within the network 
are assumed to have identical local performance 
(homogeneous network). 

The detection system performance is evaluated in 
terms of the global probability of detection, Q,, the 
global probability of false alarm, Q,, and the glo- 
bal probability of missed detection, Q,,. They are 
defined according to the following equations 5-7: 


QO, = pld = H, | H) (5) 
Qr = p(d = H,| Hy) (6) 
Qu = p(d = H,|H,)=1-9, (7) 


The FC takes the final decision based on the 
received decisions and using a Fusion Rule (FR) 
(Javadi and Peiravi, 2013). This work applies the 
Counting Fusion Rule (CFR). The sum of sensor 
decisions is compared to a specific threshold at the 
FC to make the final decision (Javadi and Peiravi, 
2013). The CFR is a simple and intuitive strategy 
to count the number of reported detections (Niu 
and Varshney, 2008), but it is far from the optimal 
performance (Javadi and Peiravi, 2013). However, 
it is suitable for the purpose of the current analysis 
as it does not require previous system knowledge 
and it provides a good basis for trade-off analysis. 


3 CASE STUDY 


As previously mentioned, this study focuses on the 
main challenges of subsea oil detection and risk 
management for a real case of an oil and gas Float- 
ing, Production, Storage and Offloading (FPSO) 
unit located in the Barents Sea. 
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For this reason, the study aims to evaluate if the 
facility detection system is able to: 


— Improve subsea safety; 

— Reduce environmental impact by controlling the 
released hydrocarbon quantities; 

— Reduce the need for remotely operated vehicle 
(ROV) inspections. 


In particular, the focus of this work is on early 
detection of oil releases in the subsea template on 
the seabed. 


3.1 Regulations and stakeholders 


Companies operating on the Norwegian Continen- 
tal Shelf (NCS) are required to carry out environ- 
mental monitoring to obtain information about the 
actual and potential environmental impact of their 
activities (Norwegian Environment Agency, 2015). 
Different regulations set the requirements for the 
monitoring of petroleum activities. The regula- 
tions relating to conducting petroleum activities 
(The Activities Regulations) (Petroleum Safety 
Authority Norway, 2016a) dedicate Sections 52-57 
to special requirements for environmental moni- 
toring. These requirements include the monitoring 
of the water column and of the benthic habitats, 
as well as the establishment of an effective remote 
sensing system to detect and map acute pollution. 
The Management Regulations (Petroleum Safety 
Authority Norway, 2016b) require in Section 34 
the operators to report the results obtained from 
monitoring of the external marine environment. 
These requirements have to be satisfied during oil 
and gas operations. 

In 2014, a Joint Industry Project (JIP) led by 
DNV-GL was aimed at developing the best prac- 
tices for designing and implementing detection 
systems (Leirgulen, 2014). Twenty key partners 
joined the project, including different operators, 
integrators and suppliers, as well as authorities, 
the Norwegian Ministry of Climate and Environ- 
ment, and the Petroleum Safety Authority Norway 
(PSA) (Leirgulen, 2014). The JIP identified rel- 
evant functional requirements and general specifi- 
cation for a subsea detection system. The outcomes 
are included in the Recommended Practice F302 
(DNV GL, 2016). The key functional requirements 
identified for the subsea detection system by DNV 
GL (2016) may be summarized as follows: 


e Sensitivity to small releases; 

e Responsiveness of the detection system; 

e Availability and reliability of the leak detector; 
e Ability to locate the leakage source. 

Therefore, the detection system must satisfy the 
requirements set by the standard for oil detection 
in the subsea template RP-F302 (DNV GL, 2016). 


The standard sets qualitative requisites to be ful- 
filled. First, the Best Available Techniques (BAT) 
approach for leak detection has to be selected. 
RP-F302 requires a two-step BAT process where 
the firstly single techniques are assessed and then 
different configurations are compared to identify 
the most efficient in cost and risk reduction. 

Anyway, the analysis of the different standards 
does not provide straightforward guidelines for 
the positioning of subsea leak detectors. Different 
configurations have to be assessed and redundancy 
margins to be guaranteed. The main purpose of 
the subsea network is to strain the detection of oil 
releases to unit. 

Different actors are involved in the response 
when a subsea leak is detected. The topside opera- 
tors have to gather relevant information and start 
preliminary mitigation actions. Moreover, the off- 
shore personnel have to consult experts from the 
onshore department, and notify the coast guard 
and to the airborne in case their intervention is 
needed. From the topside, it is possible to moni- 
tor and control the amount of oil released from 
the subsea equipment. The production system 
needs a detailed and reliable picture of the situ- 
ation in the subsea template in case there would 
be a need for shut-down. The economic impact of 
unplanned shutdowns can be severe for oil and gas 
companies (Oil and Gas IQ, 2014). Assessing the 
risk in a detailed way may allow minimizing time 
(and costs) of unnecessary stops of production. 
Moreover, the effectiveness of the subsea detection 
system is also critical for limiting the number of 
unplanned ROV inspections. ROVs are operated 
by a crew on board dedicated vessels and are usu- 
ally used for maintenance activities on the subsea 
templates. ROV inspections are extremely expen- 
sive and especially dedicated expert personnel is 
required. A reliable sensor network able to identify 
releases due to mechanical failures would be help- 
ful in eliminating the costs of unnecessary ROV 
inspections. With a detection system that works 
effectively and identifies (and eventually locates) 
the leakage sources, the number of required inter- 
ventions from the topside would decrease, leading 
to a subsequent drop in operation costs. 

In addition, different environmental organiza- 
tions have raised their concern about oil and gas 
exploration and drilling, particularly in the sensi- 
tive Arctic and sub-Arctic areas, which are critical 
for biodiversity and ecological significance (Green- 
peace, 2017). These organizations may affect pub- 
lic opinion towards the environmental protection 
policy of a company. The impact on reputation of 
oil and gas operators may be severe. For this reason, 
implementation of advanced and effective strate- 
gies and technologies for environmental protection 
should be a main priority for the operator company. 
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For all these reasons, the stakeholders of subsea 
oil and gas activities within Arctic and sub-Arctic 
regions may be the following: 


Offshore operator; 

Production system; 

Onshore department; 

Coast guard; 

Airborne; 

ROV operator; 

Sensor supplier; 

Petroleum Safety Authority Norway; 
Environmental protection agency; and 
Non-Governmental Organizations (NGOs). 


3.2 Subsea template overview 


The subsea template on the seabed is a critical area 
of the oil production installation where a high 
number of valves and joint points are located. These 
critical connections may be potential sources of oil 
leakage due to pressure increments during produc- 
tion disturbances and/or mechanical failures. 

Sensors are placed in the template structure to 
early detect oil releases. Although different types 
of sensors may be available, this analysis refers 
only to acoustic oil leak detectors. 

Figure 3 shows the physical elements needed 
for the detection of hydrocarbon leakage at the 
subsea wellhead and X-Tree (adapted from Rosby 
(2011)). 


3.3 Sensors characteristics and configurations 


According to RP F302 (DNV GL, 2016), there are 
no unique guidelines to locate the sensors in the 
distributed detection network. The only relevant 
requirement concerns the use of BAT approach for 
early detection of oil releases. 

Two types of sensors are considered named 
Type A and Type B, respectively. They are set to 
work with the same P, (equal to 10) as common 


Detection of hydrocarbon leakage at the wellhead and XT 


Router (to 
topside) 
Capacitive 
sensors 


Figure 3. Detection system for the wellhead and X-Tree 
(adapted from Røsby (2011)). 


Subsea control 
mode 


Acoustic sensors 


practice in telecommunication engineering studies. 
However, the sensors have different Receiver Oper- 
ating Characteristic (ROC) curves and this results 
in different P,. The more performing sensor (Type 
A) has a P, of 0.90 and the other (Type B) of 0.50 
(Salvo Rossi et al., 2016). 

Table 2 summarizes the characteristics of the 
sensors. 

Detection and its reliability are key parameters 
during oil and gas operations. Reliably assessing 
that a mechanical rupture has happened and that 
the template is leaking is critical in efficient ROV 
intervention management. 

The sensors are placed in two different configu- 
rations. The area of interest is organized in struc- 
tured square cell grids, as shown in Figure 4. The 
first configuration considers one single sensor for 
each grid cell defined in the sensed environment 
(namely, single configuration). In the second con- 
figuration, redundant N sensors monitor the pres- 
ence (or absence) of the target of interest (namely, 
redundant configuration). Figure 4 is shown as 
representative. 

The case study compares the detection perform- 
ance of the distributed sensor network in two cases. 
The first scenarios refer to the single configuration 
using high-performance acoustic sensors in terms 
of detection probability (Type A). The second 
considers the redundant configuration applying 
theoretically cheaper and less performance sensors 
(Type B). The detection performance of the two 
sensors are described in Table 2. The single con- 
figuration uses one sensor of Type A for each grid 
node described in Figure 4. The sensor covers the 


Table 2. Description of detection performance for sen- 
sors Type A and B. 

Sensor Pp Pe 
Type A 0.90 0.01 
Type B 0.50 0.01 


+ Sensor(s) position 


Detection coverage 


O Monitored environment 


— Grid cell 


Figure 4. Sensor grid in the monitored environment. 
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entire grid cell and it sends its local decision about 
the presence or absence of oil release to the FC. 
The redundant configuration applies a number of 
N sensors Type B for each grid cells. The number 
N of sensors should be defined to approximately 
match the detection probability obtained with a 
single Type A sensor. The CFR is applied as fusion 
rule at the FC. The threshold is set conservatively 
to 1. This means that the FC conservatively takes 
a positive decision on the presence of oil leakage 
when at least one detector monitoring the grid cell 
sends a signal revealing the presence of the target. 


4 RESULTS 


The current analysis considers a release trend as the 
one shown in Figure 5. It is worth to noticing that the 
release behaviour has been adopted for demonstra- 
tive purposes. The sensors detect noises from the sub- 
sea template and they record them above a defined 
threshold. Some oscillations are recorded due to any 
pressure variation in the reservoir. In that case, the 
pressure is controlled and reset to its optimal value 
without any intervention from the topside (see the 
first 50 time steps in Figure 5). This trend may also 
be due to some slightly overpressure scenario devel- 
oping in the first year of production, when the pres- 
sure in the reservoir is higher (Kansas Geological 
Survey, 2000). The oscillations may result in fatigue 
on mechanical components and induce a mechani- 
cal failure of some valve in the X-mas tree and well- 
head. The template is then continuously leaking and 
it needs dedicated inspections and intervention. 

The detection and false alarm probabilities are 
calculated using the fusion rule described in Sec- 
tion 2. Table 3 shows that a number of Type B sen- 
sors equal to 4 in the redundant configuration has 
been found to approximately match the detection 
probability obtained with a single Type A sensor. 


š na g 
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07} 
w 0.6} 
fn 
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03 
02 | 
01 | 
o g - i 
o 20 40 60 80 100 
TIME 
Figure 5. Assumed target trend for the present 
analysis. 


Table 3. Detection performance of the subsea template 
system in two different configurations. 


Configuration SINGLE REDUNDANT 
Sensor type Type A Type B 

Number for each grid cell 1 4 

P? 0.90 0.94 

Px 0.01 0.04 


The P, in the redundant configuration is slightly 
higher than the in the single configuration, while 
the P,,is four times increased. 


5 DISCUSSION 


Table 3 highlights a relevant increment of the 
detection system performance using redundant 
“cheap” sensors (Type B). The detection prob- 
ability P, for a single Type B sensor is 0.50 (see 
Table 2 in Section 3.3), but it is almost doubled in 
the redundant configuration (0.94). Moreover, this 
type of configuration slightly exceeds the single 
expensive Type A sensor Py. Redundant configu- 
rations of Type A sensors may be also considered. 
However, the increment in detection performance 
would not be as relevant as in the case of Type B 
sensors as their performance is already high. 

However, the redundant configuration as 
described in this work allows the increment of the 
false alarm probability P, as shown in Table 3. This 
may have a negative effect on the organizational lev- 
els. False alarms may result in unnecessary unplanned 
ROV inspections and shut-downs with strong incre- 
ments of operational costs. The fusion rule for the FC 
adopts a conservative approach with respect to leak 
detection for which the global decision is the presence 
of leak in case at least one detector emits a positive 
signal. That justifies a global P, for the redundant 
configuration of four times the single value. 

A more sophisticated decision rule should be 
implemented (Javadi and Peiravi, 2013). The sen- 
sor placement should be investigated and opti- 
mized to guarantee early detection and to track the 
oil spill movement in case intervention is needed. 

The sensors detect noises from the subsea 
template and they record them above a defined 
threshold. Some oscillations are recorded due to 
any pressure variation in the reservoir. In such 
scenarios, the pressure is controlled and reset to 
its optimal value without any intervention from 
the topside. This trend may also be due to some 
slightly overpressure scenario developing in the 
first year of production, when the pressure in the 
reservoir is higher (Kansas Geological Survey, 
2000). These oscillations may result in fatigue on 
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mechanical component and induce a mechanical 
failure of some valve in the X-mas tree and well- 
head. The template may then continuously leak, 
needing dedicated inspections and intervention. 

According to the results of the performed simu- 
lations, the number of missed detections is lower in 
redundant configurations. It is possible to identify 
and distinguish if the release is due to well fluc- 
tuations or mechanical failures by coupling the 
signal from the FC and pressure data. This allows 
recording of early warnings and use them for risk 
assessment and management. The analysis of near- 
accident data is a fundamental step in the frame- 
work to forecast likely accident scenarios. The 
information from sensor networks may provide a 
basis to the reactive update the risk picture of the 
installation with respect to subsea leakage risk. For 
instance, the data from sensors may be used as evi- 
dence in Bayesian inference network for updating 
release probabilities (Paltrinieri and Khan, 2016). 

Reliable information from the subsea may also 
improve communication between different stake- 
holders and decision making processes. 


6 CONCLUSION 


The threshold leakage rate for the sensors defines 
its sensitivity and therefore its cost. High sensi- 
tivity (detection of lower leakage rate) results in 
highly sophisticated sensors with substantial cost. 
A solution would be the application of low cost 
redundant sensors located in a specific network 
in order to perform early detection. The decision 
about the presence of oil leakage into sea from 
the subsea template determines the need of inter- 
vention from the topside. Different (internal and 
external) stakeholders are involved in oil and gas 
facilities. A reliable subsea detection system may 
help avoid unnecessary intervention and improve 
the overall company risk management. Moreover, 
every intervention to the subsea template from 
the topside requires substantial costs that may be 
reduced with a reliable basis of information. 

The analysis suggests the investigation of dif- 
ferent sensor placement configurations in order to 
enhance early detection and oil leakage tracking. 
Further studies should be considered applying dif- 
ferent and more specific decision fusion rules. 

The information from decision making may be 
used in updating the risk picture of the installation 
and in improving the decision making process. 
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ABSTRACT: With the onset of Industry 4.0 several technological possibilities are offered in industry 
such as big data analytics, digital twin and augmented reality. The result is a more digitalised industry 
where faster and better decisions are possible. In long term this should provide a more reliable production 
with increased plant capacity and reduced downtime. To succeed with these possibilities a Cyber Physi- 
cal Systems (CPS) must be established for the company. Currently, an own framework for CPS is under 
development and is expected to be tailored for Norwegian manufacturing. When building on the principle 
in Industry 4.0, big data capability with machine learning will be a fundamental model. Nevertheless, 
Industry 4.0 should also include other models for big data capability such as reliability modelling. The 
aim in this article is to present the current status of CPS framework and how it could be implemented 
in manufacturing industries. In particular, the article discusses and demonstrates the balance between 


machine learning and reliability engineering in big data analytics. 


1 INTRODUCTION 


The European competitive advantage is under 
pressure, where customer needs, such as improved 
delivery accuracy of products, have changed over 
time (Smart Industry, 2017). It might challenge the 
future industry in Europe how to implement and 
digitalize equipment and tools for a safe and reli- 
able environment. 

Several initiatives like platforms for Industry 4.0 
have been established (Kagermann et al., 2013). Also 
national strategic initiatives have been established, 
like “Smart Industry” in Netherland (Smart Industry, 
2017) and “Industry, greener, smarter and more inno- 
vative” in Norway (Ministry of Trade Industry and 
Fisheries, 2017) where the focus is adapting systems 
for data handling and digitalization. Several impor- 
tant elements can be related to Industry 4.0 such 
as predictive maintenance (McKinsey&Company, 
2015). The benefit of predictive maintenance is 
improved reliability with application of the opportu- 
nities from big data and statistics where application 
of continuous real-time monitoring of assets, with 
alerts given based on pre-established rules or critical- 
ity levels (Pwc, 2017). It remains to investigate more 
in detail how reliability engineering methods also 
can be combined with big data analytics in order to 
improve the reliability of the production plant. 

Another important element of Industry 4.0 is 
cyber-physical systems (CPS) (Kagermann et al., 
2013). As an overall understanding, CPS are inte- 


grations of computation with physical processes 
(Lee, 2008). Since manufacturing is one essential 
application of CPS (Lee, 2008), the notion cyber- 
physical production systems (CPPS) is often used 
in manufacturing and production (Monostori, 
2014, Lee et al., 2017, Hehenberger et al., 2016, 
Monostori et al., 2016). 

Although the economic impact of applying 
the CPS in manufacturing is significant, comput- 
ing and network technologies today may impede 
the progress towards this application (Lee, 2008). 
For example, the “best effort” in networking tech- 
nologies make predictable and reliable real-time 
performance difficult. Nevertheless, certain efforts 
have been conducted where structures and archi- 
tectures of CPS have been constructed, ranging 
from typical sketches with sensors and actuators 
(Lee, 2010), towards more generic architectures 
both as level based CPS (Lee et al., 2015) and CPS 
architecture with three dimensions (IEC, 2017). 

Several challenges have been addressed for CPS, 
such as physical critical infrastructure that calls for 
preventive maintenance (Rajkumar et al., 2010). It 
has also been pointed out as a challenge to have 
a CPS architectures that are both “globally vir- 
tual and locally physical” (Rajkumar et al., 2010). 
Another challenge is need for standards (Chaari 
et al., 2016). Although a pre-standard of CPS 
has been published (IEC, 2017) the industry has 
already started to test alternative architectures 
(Lee et al., 2017) in advent for a standard. 
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In Norway it is of interest to establish a CPS 
framework for Norwegian Industry. To create 
such a framework an ongoing competence project 
where framework, tools and implementation in 
demonstrators are in progress (Eleftheriadis and 
Myklebust, 2017). 

The aim of this article is to present the current 
status of a Norwegian CPS framework and how it 
could be implemented in manufacturing and proc- 
ess industries. 

To achieve this aim, following sub objectives are 
outlined: 


1. Present existing elements for CPS architecture 

2. Present existing CPS architecture for Norwe- 
gian manufacturing and process industry 

3. Evaluate how it can be further developed based 
on existing CPS theory 

4. Propose reliability-based analysis methods and 
technology for the CPS architecture 

5. Discuss how the CPS architecture will be imple- 
mented in Norwegian industry. 


The remainder of this article is structured as fol- 
lows: In Section 2 existing elements for CPS archi- 
tectures are presented. Based on these elements 
the Norwegian CPS framework is constructed in 
Section 3. In Section 4 three relevant CPS analy- 
sis methods and technologies are proposed and 
elaborated; life cycle profit (1), Safety perspec- 
tive (2), and machine learning related to reliability 
engineering (3). Section 5 elaborates how the CPS 
framework can be implemented, while concluding 
remarks are made in Section 6. 


2 EXISTING ELEMENTS FOR CPS 
ARCHITECTURES 


To ensure successful application of the break- 
through technologies offered in Industry 4.0 in an 
organisation, a concrete architecture for CPS must 
be established. Today, there exist several architec- 
tures for CPS. In particular three architectures 
seem to be of relevance in Industry 4.0. 

Asa first example of CPS architecture, Lee et al. 
(2015) has proposed a 5-level CPS architecture 
denoted as the 5C architecture. This architecture 
provides a step-by step guideline in rolling out CPS 
in manufacturing with following levels: 


1. Smart Connection level. Implementing the nec- 
essary instrumentation of machines, “plug & 
play” sensors, and wireless communication. 

2. Data-to-Information Conversion level. The data 
collected from the sensors will be input-data for 
several models that provide information such as 
assessment of degradation. 

3. Cyber level. At this level the digital twin of the 
plant is established and more advanced ana- 


lytics is possible with assessment of fleet of 
machines. 

4. Cognition level. To support the decision-maker 
to conduct faster and better decisions, proper 
presentation of the acquired knowledge is nec- 
essary. This level visualises e.g. future factory 
performance and key performance indicators. 

5. Configuration level. This level provides feed- 
back from the virtual world back to the physical 
world based on decisions conducted in level 4. 
This level also self-optimizes several properties 
of the plant. 


Some successful case studies of the 5C archi- 
tecture have recently been conducted both for ball 
screw health monitoring (Lee et al., 2017) and a 
wire rod machine (Rødseth et al., 201 6b). 

A second proposed architecture for CPS clas- 
sifies the digitalization of Industry 4.0 into two 
types of value chains: Horizontal and vertical 
value chain (Geissbauer et al., 2014). The horizon- 
tal value chain comprises suppliers, the company 
and its customers, whereas the vertical value chain 
comprises activities in the company such as sales, 
manufacturing, service and product development. 

A third proposed CPS architecture is “reference 
architecture model industry 4.0” (RAMI 4.0). Cur- 
rently, a PAS (publicly available specification) has 
been developed for RAMI 4.0 (IEC, 2017). This 
specification does not fulfils the requirements for 
a standard, but is at least made available to the 
public. The core in RAMI 4.0 is to ensure coop- 
eration and collaboration between technical assets 
which has a value for an organisation. RAMI 4.0 
comprise a CPS architecture visualised with three 
dimensions: 


1. Layers. In total six layers represent the informa- 
tion relevant for the technical asset: Business, 
functional, information, communication, inte- 
gration and asset. 

2. Life cycle and value stream. This dimension rep- 
resent the life cycle of the technical asset. 

3. Hierarchy. The hierarchy classifies the enter- 
prise system into following categories: Con- 
nected world, enterprise, work centres, station, 
control device, field device and product. 


RAMI 4.0 is developed from a more “Gener- 
alized Enterprise Reference Architecture Meth- 
odology (GERAM) which later was converted to 
three standards in late nineties. The GERAM was 
a extension of Computer Integrated Manufactur- 
ing (CIM) models which is an early enterprise or 
business model (Myklebust, 2002). The integra- 
tion of GERAM and RAMI 4.0 from (Industrial 
Internet Consortium, 2016) shows the building 
block of a enterprise model that has interrelation- 
ships between organisational, process and product 
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structures. Members of the organisation are con- 
nected to process roles defining their work tasks. 
Competence are connected to process roles, goals 
are connected to the processes and products, and 
resources are connected to processes. 

The GERAM later RAMI has a well-struc- 
tured design and fit well with the generic demand 
of product, process and organisation. The link 
to the manufacturing system theory is therefore 
the last approach to include the product con- 
figuration and design process of disciplines like 
mechanics, cybernetic and material science on the 
physical side and planning activities, economical 
aspects and optimization processes on the logical 
side. Theoretically based on geometrical founda- 
tion and the methods within the theory that are 
related to concepts of connections. The analysis 
of the manufacturing systems is the prime area for 
the usage of this theory and is important to bring 
a science base into manufacturing. However how 
to succeed in developing, managing and operating 
such an enterprise model is still maybe the main 
challenge. 


3 CONSTRUCTING A NEW CPS 
FRAMEWORK 


Figure 1 presents the proposed CPS framework 
tailored for Norwegian manufacturing and proc- 
ess industry. With the motivation of establishing 
a value chain between vendor and the user (Geiss- 
bauer et al., 2014) a horizontal value chain has 
been outlined. In addition, inspired partly by the 
5C architecture (Lee et al., 2015), a vertical value 
chain is also proposed to ensure that data from sen- 
sors will lead to smarter decisions. In total a CPS 
framework with two dimensions are developed. 
The horizontal value chain consist of vendor 
(A), the production (B), and the customer where 
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Figure 1. CPS framework. 


the end-product is consumed (C). At the vendor 
the asset is created and the support is provided 
from the vendor. The vendor can e.g. be a machine 
builder and supports with providing the main- 
tenance programme. As pointed out by (Smart 
Industry, 2017) the maintenance could be totally 
outsourced where all maintenance activities are 
performed by the vendor and the machine is leased 
by the user. This will also require a more strategic 
alliance with the vendor (Batran et al., 2017). The 
production is where the asset, such as the machine 
is operated. At this location, an own maintenance 
management is located to ensure that the required 
technical condition of the asset is achieved with 
support from both internal and external main- 
tenance resources. It is a crucial decision for the 
maintenance management to establish the most 
appropriate maintenance strategy relevant for the 
vendor. An important issue for the maintenance 
management to decide is the correct degree of 
maintenance outsourcing. The end-product is 
located at the customer where it is consumed. The 
customer value will be influenced by production 
where lack of maintenance can reduce the produc- 
tion assurance and result in late product delivery. 
Also a defect in production can be undetected and 
finally discovered by the customer. With applica- 
tion of real-time system, changes in customer 
requests will be ensured. 

The vertical value chain consist of six separate 
levels of data from assets (I) smart connection 
to assets (II), a digital shadow of the data (III), 
deep knowledge application (IV), smart decision 
application (V). At level I the asset is located that 
provides value for the organsiation. From the 
asset all relevant raw data is collected. The asset 
is not only the asset crated by the vendor such as 
the machine, but also other technical objects such 
as data servers, ERP-systems, algorithms and 
software programs. At this level the raw data is 
extracted from the physical assets, e.g. data cap- 
turing from a temperature sensor in a machine. 
The next level is smart connection (II) where data 
is extracted with SCADA and PLC systems and 
organized in databases such as ERP. To ensure 
that all databases can exchange data, an own level 
of OPC UA is necessary. In level IV, it will there- 
fore be possible to apply deep knowledge analyt- 
ics where databases at production and vendor can 
provide data-driven analytics in e.g. predictive 
maintenance. 

In level V the deep knowledge analytics will 
support the decision maker with visualization 
and dashboards. This level can be considered to 
be a “digital advisor” for the decision maker. As 
an example in production, application of key per- 
formance indicators in integrated planning can 
support the planner to improve his future activities. 
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4 CPS ANALYSIS METHODS AND 
TECHNOLOGIES 


4.1 Life cycle profit 


Life cycle profit (LCP) is in this article defined as 
“accumulated profit of a component or system over 
it’s lifetime”. LCP presents the potential financial 
losses over the lifetime of a system due to the dif- 
ferent time losses measured in overall equipment 
effectiveness (OEE) (Nakajima, 1989). The profit 
generated from the system after these losses is 
then LCP. Table 1 presents a proposed correlation 
between the time losses in OEE and LCP. 

Figure 2 illustrates the LCP model, modified 
from Rolstadas et al. (1999). The area above the 
line x-x represents the time losses in accordance 
with Nakajima (1989). In addition, this area also 
distinguish between planned and unplanned main- 
tenance. The reason for this distinction is that some 
of the planned maintenance require a shutdown of 
the machine and if necessary the production plant. 
If there are no time losses for the machine, it would 
be no area above the line x-x and maximum turno- 
ver would be achieved. 

The area bellow the line x-x represents the costs 
that occurs during operation of the machine. In 


Table 1. Time losses and LCP. 
Time loss category LCP element 
Availability Production 
Maintenance 
Resources 
Performance Degraded machine, energy loss. 
Quality Value of product before it is 
scrapped 
Turnover / Costs 
4 
, eas oun to piamadmamenmia:. 
Loss due to unplanned maintenance 
A 
% Life Cycle Profit ‘3 
š 2 et > 
È Production costs 


: CAPEX j 
Time 


Installation J m Disposal 
Burn-in Useful life period Wear-out 
period period 


Figure 2. Life cycle profit model, modified from (Rol- 
stadas et al., 1999). 


this model it is assumed that capital expenditure 
(CAPEX) is constantly scarred over the operation 
time. Both the maintenance costs and production 
costs will be decries in the start and increase at the 
end of the lifetime. The curve of line B-B will for 
the bathtub curve due to its characteristic shape 
and is due to the failure rate of the system over 
its lifetime (Sintef and Oreda, 2009, Rausand and 
Hoyland, 2004). The bathtub can be divided into 
three specific phases: 


— Burn-in period. This is an initial phase with high 
failure rate due to undiscovered defects. This is 
also known as “infant mortality”. 

— Useful life period. This phase is considered to 
be the useful period of the system where the 
failure rate is constant due to the maintenance 
activities. 

— Wear-out period. In this phase, the regular main- 
tenance activities can no longer keep the failure 
rate constant and it will decrease until the dis- 
posal of the system. 


The LCP should be developed by the vendor 
with support from production in the CPS frame- 
work. With support from historical operations and 
loads it will be possible to achieve more accurate 
life cycle profit calculations. 


4.2 Safety perspective 


Regarding the safety perspective, following state- 
ments will be important: 


— All corrective maintenance is deviation from 
required function. 

— All maintenance activities will have a risk 
potential. 

— Good maintenance is a pillar for effective and 
safe manufacturing and production. 


The safety perspective will be of relevance of 
following situations: 


— Accidents during maintenance 
— Wrong type of maintenance 
— Lack of maintenance 


Table 2 presents a proposal of how these per- 
spectives are relevant for the CPS framework. 


4.3 Machine learning and reliability engineering 


CPS plant position analytics in level IV with deep 
knowledge. It has been pointed by the European 
commission that intelligent maintenance systems 
based on condition prediction mechanisms with 
computation of remaining useful life (RUL) will 
increase reliability availability and safety (EFFRA, 
2013). Furthermore, more sophisticated tech- 
niques for cause-effect and trend analyses are also 
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Table 2. Safety perspective in the CPS framework. 


Example of Example of 
Safety position of CPS Application 
perspective framework in CPS 
Accidents B. Production Application of 
during V. Smart augmented 
maintenance decisions reality. 
Wrong type of A. Vendor Real-time notifica- 
maintenance IV. Deep tion to vendor 
knowledge in maintenance 
engineering. 
Lack of B. Production Estimation of 
maintenance IV. Deep RUL in real-time 
knowledge with machine 


learning. 


required. The deep analytics has been developed 
an integrated approach form machine learning and 
the need for zero defect manufacturing (ZDM). 
With intelligent sensor system ZDM can be oper- 
ated for short term, medium term and long term 
decisions in the EU-project [FaCOM (intelligent 
fault correction and self-optimizing manufacturing 
systems) (Rødseth et al., 2016a). It has also been 
argued that maintenance could be one part of the 
IFaCOM concept. When advancing towards novel 
predictive maintenance technologies with reliabil- 
ity-based maintenance approaches, it is pointed 
out that this should include quality-maintenance 
methods as well as failure modes, effects, and criti- 
cality analysis (FMECA) (European Commision, 
2016). Thus it is in this article of interest to inves- 
tigate how FMECA can be balanced with big data 
analytics such as machine learning. 

The maintenance model called deep digital main- 
tenance (DDM) comprise an artificial intelligence 
module that tested remaining useful life (RUL) 
prediction based on dataset of degradation simu- 
lation run-to-failure data of jet engines (Rodseth 
et al., 2017). The output of the prediction model is 
the probability that RUL is more than 10 cycles in 
a specific point in time, denoted as P(RUL > 10). 
One cycle is a magnitude for time, e.g. one week. 

The prediction model should also include some 
error estimate to indicate the accuracy. Predictive 
maintenance should improve the maintenance 
planning capability in the organization where the 
plant capacity is increased as well as improved uti- 
lization of maintenance resources. The latter can 
be controlled by capacity overview (Liebstiickel, 
2014). Due to the operational conditions of degree 
of prediction in predictive maintenance and the 
available capacity of the craft technicians, the main- 
tenance window needed by the maintenance plan- 
ner will vary. Liebstiickel (2014) has exemplified the 


Catastrophic 


Frequency/ 1 a 4 5 
consequence | Very unlikely Occasional Probable Frequent 


Figure 3. Risk matrix from FMECA, adapted from 
(Rausand and Hoyland, 2004). 


maintenance window to be 10 weeks when the plan- 
ner shall conduct a capacity evaluation. 

FMECA evaluates the risk of each failure mode 
and risk reducing measures. The risk of each fail- 
ure mode may be positioned in a risk matrix shown 
in Figure 3 (Rausand and Hoyland, 2004). The 
decision criteria for the risk matrix is as follows: 


— Red area. The risk is unacceptable and risk 
reducing measures are required. 

— Yellow area. Acceptable level of risk. The risk 
should be as low as reasonable as possible. Fur- 
ther investigations should be considered. 

— Green Area. Acceptable level of risk. Only con- 
sider to keep the risk as low as reasonable as 
possible. 


When the relevant failure modes has been evalu- 
ated in FMECA, it is further possible to evaluate 
to what degree implementation of machine learn- 
ing in predictive maintenance can reduce the fre- 
quency of each failure mode. In the risk matrix, 
there are two failure modes denoted FM1 and 
FM2. The failure mode FM1 has non-acceptable 
risk whereas failure mode FM2 has acceptable but 
should still be investigate further for risk reduc- 
ing measures. For both FM1 and FM2 predictive 
maintenance with machine learning is considered. 
To reduce the risk for FM1 to the green area, high 
accuracy in machine learning will be required. 
For FM2, it is not required the same accuracy in 
machine learning in predictive maintenance to 
reduce the risk to the green area. 

Following criteria must be considered when 
evaluating the reduction of frequency due to 
implementation of predictive maintenance as risk 
reducing measure: 


— The needed maintenance window and the accu- 
racy of the trained data set. 

— The similarity of operational conditions from 
the trained data and the predicted data. 


5 ROLLING OUT THE CPS FRAMEWORK 
IN ORGANISATIONS 


In a Norwegian perspective implementation of a 
CPS framework has to be followed up by guide- 
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lines and tools for fulfilling the expected impact. 
The Norwegian industry is probably one of the 
most organized labour markets in Europe and con- 
sist generally of small and medium size businesses. 
Where the labour policy for decades has been 
based on a tripartite cooperation between the gov- 
ernment, trade unions and enterprise federations. 
The result is a flat structure where the involvement 
of skilled and self-dependent workers has been 
essential for competing in a global market. 

The expectation of digitalizing Norwegian indus- 
try is therefore improved performance and a higher 
productivity. However, thru different maturity map- 
pings, literature and surveys we can see the complex- 
ity in CPS and Industry 4.0 is broad. There is an 
image of a leadership which request for change, but 
do not find the right tools on one side. On the other 
hand, impatient workers with high digital compe- 
tence and a mix match of equipment not prepared for 
digitalization. (Eleftheriadis and Myklebust, 2017) 

A development of regulated safety and quality 
cultures is one of the benefit from such an organ- 
ised labour where structure for reliable quality sys- 
tems, preventive maintenance and management 
methods are implemented and where the improve- 
ment is a part of the organised culture. 


6 CONCLUDING REMARKS 


The aim of this article is to present the cur- 
rent status of CPS framework and how it can be 
implemented in Norwegian industry. With sound 
concepts of CPS theory a framework was proposed 
to be implemented for Norwegian industry which 
is both vertical and horizontal integrated. Also the 
analysis methods LCP and FMECA and different 
technological application for the safety perspective 
was recommended for the CPS framework. 

The benefit of the CPS framework is that it can 
integrate all relevant data at sensor level up to deci- 
sions at plant level and at the same time connect 
the horizontal value chain including the machine 
builder, industrial user of the machine and the 
customer that consumes the end-product. As an 
impact for the industry it is expected that the value 
creation of this framework will be measured in 
terms of improved asset utilization with improved 
availability as well as improved product quality 
with reduced scrappage. 

CPS plant will require parallel work with both 
vertical and horizontal integration of the CPS 
framework. Further work for the vertical integra- 
tion will require specification of data capturing 
including establishment of a detailed specification 
of sensors that are to be applied in the project. 
For the horizontal integration, identification of 
interfaces in the horizontal value chain should be 
identified and mapping the value across companies. 


Further work of the safety perspective would 
be to build a list of recommended application in 
CPS framework based on Table 2 and evaluate the 
reduction of risk. For the LCP, the horizontal value 
chain should be mapped when the vendor develops 
the LCP with support from production. The fur- 
ther development of FMECA would require more 
cooperation between the reliability engineering 
and machine learning. In detail, this would require 
simulation where the accuracy of the trained algo- 
rithms in machine learning calculates the failure 
rates as an input for the risk matrix. 

For the implementation of the CPS framework, 
further work would require a detailed road map for 
Norwegian industry based on findings in demon- 
stration of the CPS framework as well as involve- 
ment with several Norwegian companies within 
manufacturing and process industry. 
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ABSTRACT: 


In recent years, analysis methods using sensor data and record data acquired through the 


provision of maintenance services for social infrastructure equipment have attracted considerable atten- 
tion. We focus on optimal replacement of elevator components. Features of elevators include that they 
move continuously for a long time without any operator and their proportion is high among social infra- 
structure equipment. We define five steps in the analysis of maintenance services for social infrastructure 
equipment. We have developed an analysis system consisting of life-limited component analysis, replace- 
ment planning simulations, and service performance analysis. The analysis system uses a combination of 
functionality of machine learning, such as ontology processing, text mining, and facility-type clustering 


in order to handle various types of facility data. 


1 INTRODUCTION 


In recent years, analysis methods using sensor and 
record data acquired through maintenance services 
for social infrastructure equipment have attracted 
considerable attention (Mobley, K. 2008, Narayan, 
V. 2004). Generally, social infrastructure equipment 
such as elevators and escalators require inspections 
and repairs by experts with special knowledge and 
skills. Experts visiting a site to maintain equipment 
usually prepare maintenance reports. In this paper 
we consider maintenance service data accumulated 
from elevators for a period of over twenty years. 
Elevators are in continual operation without 
operators and are a very common form of social 
infrastructure equipment. Since there are vari- 
ous types of elevators installed in various build- 
ing environments, it is important to consider 


Diagnose of identification Prediction of Planning of 
functions of factors irregularities maintenance 
Maintenance 
operation 
c k Maintonance Maintenance 
oai simulation KPI analysis 
z | Pioren | 


Figure 1. Integration of analyses for maintenance 
optimization. 


these variations when analyzing the elevator data. 
Experts have compiled large datasets while provid- 
ing elevator maintenance services. Elevators have 
thousands of components and database formats 
change over time, and there are also variations in 
how individual experts record maintenance data. 
The remainder of this paper is organized as 
follows. In Section 2, we describe an integrated 
analysis system for maintenance optimization. In 
Section 3—5, we describe the elements of the sys- 
tem. Finally, we present conclusions in Section 6. 


2 FAILURE ANALYSIS 


We define five steps in the analysis of maintenance 
services for social infrastructure equipment: (1) 
function diagnosis, (2) identification of factors 
that account for irregularities, (3) irregularity pre- 
diction, (4) maintenance planning based on predic- 
tions, and (5) verification of facility performance. 
The Plan—Do—Check—Act (PDCA) cycle is a widely 
known method for facilitating management tasks. 
This study aims to improve maintenance services 
by developing data mining methods and simula- 
tion methods that consider the five analysis steps 
and the varied forms of elevator maintenance data. 
We pay particular attention to replacement plan- 
ning for elevator components. We have developed 
an analysis system consisting of life-limited com- 
ponent analysis (Yano, T. et al. 2013) with statistical 
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survival analyses (Nelson, W. 2011), a naive Bayes 
model (Hsu, C.N. et al. 2000), and mixed survival 
analyses; replacement planning simulations with 
facility-type clustering; and facility and service per- 
formance analysis with a naive Bayes model and 
ontology processing using maintenance record data 
and troubleshooting data. To accommodate various 
facility data, the analysis system combines machine 
learning with functions such as ontology process- 
ing, text mining, and facility-type clustering. 


3 CONSTRUCTION OF SURVIVAL MODEL 


This section describes failure analysis in our sys- 
tem for diagnosing component failure, identify- 
ing equipment attributes that affect failures, and 
predicting failures. Survival analysis is generally 
known as a statistical modeling method for irregu- 
larities, commonly used in medicine, reliability 
engineering, and other fields. From the survival 
model of a component, we can calculate the prob- 
ability of survival at a future time or after some 
number of uses following installation. Construct- 
ing a survival model requires “right-censored data,” 
in which conditions are specified when the com- 
ponents are replaced. We used survival analysis to 
construct survival models of elevator components 
from maintenance records in order to apply those 
models to prediction of component failure. The 
maintenance records were composed of equipment 
master data, troubleshooting data, and regular 
operation data. However, to specify the conditions 
under which components were replaced, it was 
necessary to use handwritten reports of trouble- 
shooting data, which used a variety of expressions 
to some extent. 


Trouble 
classification 


Replacement 


Trouble date 
component ID 


Treatment 
classification 


To collect troubleshooting data associated with 
a target component, we use a naive Bayes model 
often used for text mining. Figure 2 shows a sam- 
ple of troubleshooting data, which includes the 
date of occurrence, replacement component ID, 
trouble classification, treatment classification, 
and a detailed description. To allow the model to 
search for motor failure in troubleshooting data, 
we added “trouble classification” and “treatment 
classification” categories to the learning data as 
explanatory variables. Additionally, keywords 
such as “motor” and “electric mt” (mt: motor) in 
“detail description” are automatically extracted 
from maintenance reports in the troubleshooting 
data and added to the training data. A naive Bayes 
model is constructed from the data using positive 
and negative examples. The models can be used to 
search data sorted from high to low relevance and 
to select from among them based on a probability 
threshold parameter. Figure 2 shows how process- 
ing occurs. The naive Bayes failure search model 
thus uses threshold parameters to collect various 
component failure data. 

We next describe construction of component 
survival models. The upper part of Figure 3 shows 
right-censored data for survival analysis, where 
the target component is replaced twice in equip- 
ment | and once in equipment 2. In the figure, “N” 
means normal and “F” means component failure 
at replacement. “N” is collected from regular oper- 
ation data and “F” is collected using failure search 
model from troubleshooting data. The short line 
on the left indicates the start of maintenance serv- 
ice, and the line on the right indicates the analysis 
date on which the data were collected for survival 
analysis. There are only three cases (solid lines) in 
the component replacement database. Right-cen- 
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Figure 2. Component failure search model using text mining. 
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Figure 3. Censored data with attribute data, and sur- 
vival models. 


sored data must include six cases in addition to the 
above three cases, as indicated by the broken line. 
These three cases are assumed to be under normal 
conditions when we analyze the data. The solid 
curve in Figure 3 shows a survival model obtained 
from these right-censored data. 

We adopt mixture survival analysis for com- 
ponents with many failure cases. Figure 3 shows 
differing equipment conditions such as number 
of starts and rated elevator speed. In this case, 
no components deteriorate over time over a fixed 
period. We thus split equipment into groups and 
create a component survival model for each group. 


A group consists of components with higher ten- 
dency to fail. 

As described above, we specify component 
failure according to similar maintenance reports 
extracted using a naïve Bayes model. We generate 
right-censored data and conduct mixture survival 
analyses, and construct a survival model to predict 
irregularities and to identify factors that account 
for irregular functioning. We constructed about 
100 fault-search models and about 2000survival 
models using the data mining methods. 


4 MAINTENANCE PLANNING 
SIMULATION 


This section describes the maintenance simulation 
used in our analysis system for maintenance plan- 
ning. For maintenance planning using component 
survival models, we adopted simulation for esti- 
mating equipment maintenance costs and failure 
rate. This simulator should be able to accommo- 
date various equipment types and components. 
Obtained simulation results can be applied to com- 
ponent replacement planning or budget manage- 
ment at individual branch offices. 

Figure 4(a) shows an actual system in which users 
can visually grasp survival models. The left side of 
Figure 4(a) lists elevator type. When a user selects an 
elevator type, a list of components that make up the 
selected type is shown on the right. When the user 
selects a component from the list, a maintenance rule 
editor for that component is shown in the center. They 
can configure the maintenance rules for obtaining the 
desired simulation results and can set acceptable com- 
ponent replacement years through trial and error. The 
survival models of individual components are shown 
on the right. Component IDs and component names 
are shown in the maintenance rule editor. Users can 
specify the elevator types to which a rule applies. 
They can also configure replacement operations, 
component acquisition costs as replacement cost, 
and maintenance type, and can perform queries using 
elevator attributes or replacement years. By executing 
simulations after configuring the above items, life- 
cycle maintenance costs and component failure rates 
are calculated using the survival models, component 
ontology, and the elevator master data containing the 
attributes. Relations between equipment types and 
mounted components are organized as components 
in ontology data. Users can configure maintenance 
rules for obtaining the desired simulation results and 
can set acceptable component replacement years 
through trial and error. Other values (1) indicate 
maintenance costs for preventive maintenance: 


Replacement cost = Acquisition cost 
+ Replacement operation cost (1) 
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Figure 4. Life-cycle model and maintenance plan simulation. (a) Maintenance planning simulator. (b) Equipment 


grouping. (c) Life-cycle simulation. 


To calculate simulation results rapidly for vari- 
ous elevator types, we previously used elevator 
master data to find groups with similar attributes 
and adopted these patterns for the principal eleva- 
tor types. Figure 4(b) shows the simulation proc- 
ess for maintenance planning using equipment 
groups. When providing a simulation, it is neces- 
sary to know in advance the number of elevators 
belonging to a group and the patterns of principal 
types. It is also necessary to calculate the number 
of components attached to each elevator and to 
multiply this by the number of elevators to calcu- 
late the maintenance costs. After generating pat- 
terns for principle types, we proceed to life-cycle 
simulation. By using the above methods, we can 
perform simulation faster than by calculating the 
data for the elevator. 

Figure 4(c) shows the calculation of life-cycle 
maintenance costs and component survival rates 
using the survival model. In the graph, the hori- 
zontal axis represents elevator life-cycle time and 
the vertical axis represents component survival 
rate. As the graph shows, the component survival 
rate gradually decreases. Area S represents the 
component failure rate, and dividing this by life- 
time L gives the annual component failure rate. 
Costs are incurred at each replacement interval set 
by the users according to the maintenance rules. 
Life-cycle maintenance costs represent total costs. 


Users can configure complex maintenance rules 
in the mixture survival model using attributes such 
as number of starts, and rated speed. Many eleva- 
tor types require treatment of operation data from 
more than one hundred thousand elevators. There 
are too many elevator types to perform exact simu- 
lations for all types because of the amount of time 
needed. Therefore, we prepared about 10-20 typi- 
cal groups using ontology processing to represent 
elevator types, principle patterns, and the number 
of standard components by elevator type. Users 
can configure the maintenance rules for obtaining 
the desired simulation results and can set acceptable 
component replacement years through trial and 
error. Utilizing these functions, organization staff 
can analyze information associated with compo- 
nent replacement based on actual maintenance data 
to determine repair and maintenance strategies. 
This is expected to aid with drafting plans for better 
maintenance services. By applying operation data 
such as those shown in Figure 4, users can analyze 
and update current maintenance records every day. 


5 MAINTENANCE KEY PERFORMANCE 
INDICATOR ANALYSIS 


As the previous section shows, we have focused 
on elevator component replacement planning 
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that uses cost and failure rate simulations based 
on a survival model. This section describes Key 
Performance Indicator (KPI) analysis applied to 
maintenance in our analysis system for checking 
facility performance of maintenance services to 
verify the consequences of a component replace- 
ment plan through simulation. Performance 
indicators need to include the number of com- 
ponent issues that arise when maintenance rule 
are changed. Down time, maintenance costs, and 
workloads are frequently used service indicators 
for maintenance. We developed analysis functions 
to calculate trouble incidence rates and component 
troubleshooting time. 

Figure 5 shows the architecture of the main- 
tenance KPI analysis system. These KPI analysis 
functions use maintenance records such as oper- 
ating data, troubleshooting data, and equipment 
master data. Maintenance staff record regular 
inspections and maintenance reports along with 
troubleshooting reports describing situations of 
urgent site visits for unplanned maintenance. These 
data include handwritten reports as described in 
Section 3. Because there are various elevators types 
with tens of thousands of components each, the 
data analysis requires summarization methods. For 
this purpose, we prepared various ontology proc- 
esses and failure search models using naive Bayes 
modeling. 


Operating data 
Troubleshooting data 
Equipment master data 


Ontologies 


Equipment 
component 
ontology 


Operation 
group 
ontology 


Maintenance 
area 
ontology 


Equipment 
type 


ontology 


Component failure search models 


Figure 6 shows a sample of the ontology 
processing function. The central table lists the 
replaced components. We created an equipment 
component ontology to determine equipment and 
components with higher tendencies for failure. 
The ontologies express thousands of component 
groups in four stages, allowing users to summarize 
large classifications such as “machine room” or 
“equipment type” or sub-stages lower in the hier- 
archy. This allows analysis of various component 
types among similar groups. 

We also introduced a failure search model using 
a naive Bayes model to treat various handwritten 
texts. We showed a naive Bayes model for search- 
ing through troubleshooting reports of motor 
failure in Figure 2, and also adopted the model 
for searching for similar data in the maintenance 
KPI analysis system because the analyzed data 
includes handwritten reports. For example, natu- 
ral disasters such as earthquakes have an effect on 
the operation of elevators, but are not a diagnosis 
of the operation. Therefore, we also constructed a 
naive Bayes model that can treat “earthquake” as 
a positive example. In fact, the ability to search for 
incidents of natural disasters allows users to ana- 
lyze rates of incidence data and remove data that 
are related to natural disasters. Switching compo- 
nents to types with a longer life also decreased the 
rate of component failure. 


Maintenance 


KPI analysis 


Motor failure search model (naive Bayes) 


LB, er Detail description— 
LB, er keyword: “motor” 


Figure 5. 


.. (Detail description- 
eyword; “electric mt” 


Maintenance KPI Analysis System with ontologies and component failure search models. 
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Equipment component ontology 


Replacement 
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Figure 6. Adopting ontology data. 


As stated above, we developed a maintenance 
KPI analysis system that prompts for updates 
to the policies and survival models. System 
functions include analysis of each group using 
ontology data and similar search models using 
naive Bayes models. At present, the system uses 
not only survival analysis components, but also 
visualizations of various maintenance service 
circumstances for optimization of field mainte- 
nance tasks. It calculates about 1000 KPIs for a 
few hours every night. 


6 CONCLUSION 


This paper described algorithms applied to eleva- 
tor analysis systems for prediction-based main- 
tenance planning. We described construction of 
failure analysis using a component survival model, 
and maintenance simulation that utilizes the fail- 
ure analysis. In addition to the simulation, we also 
added a maintenance KPI analysis that optimizes 
maintenance by incorporating long-term mainte- 
nance records. The system is designed to include 
features such as application to various equipment 
types and long-term operation for social infrastruc- 
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ture equipment. In future research, we intent to 
investigate replacement planning based on main- 
tenance planning simulations and elucidate factors 
in component degradation using maintenance KPI 
analysis with the aim of constructing and adopting 
an advanced survival model. 
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ABSTRACT: This paper finds an extended set of railway Safety Management System (SMS) goals to 
manage safety compliance, technological complexities, operational uncertainties and business objectives 
in a holistic perspective using Enterprise Architecture (EA) approach. It also presents a comparative 
analysis of EA frameworks for implementation of a railway SMS. We call this system Safety Enterprise 
Architecture System (SEAS). In the technique, a set of selected capabilities based on the proposed railway 
SMS goals and system requirements are evaluated from legal, business and information systems perspec- 
tives. The SEAS approach establishes that the prevailing new technologies, evolving social realities and 
changing market dynamics has made the case of railway safety management more complex. Alongside 
regulatory compliance, it should also mandate explicit goals of standardization, innovation, business 
transformation, and IT alignment at all levels. It was found that at present not a single EA framework 
exists that suits the target SEAS approach entirely but the open group standard TOGAF comes closest. 


1 INTRODUCTION 


The Enterprise Architecture is a strategy to develop 
an IT backbone to support the safety infrastructure 
that comprises of railway’s people, processes and 
systems in order to align safety objectives, business 
requirements and their complex interfaces through 
a structured IT system. For adequate application 
in railway safety, the architecture should cover 
all aspects of safety management by providing a 
‘whole system’ approach to design, plan, delivery 
and control as part of a company business (Lodzin- 
ski et al. 2007 and RTS & TSLG. 2012). It should 
enable safety compliance, facilitate strategic deci- 
sion making, support innovation and realize an 
effective business transformation (NAS 2014, Aier 
et al. 2016 and Cooney & Paxton 2016). 

The innovation and business transformation for 
safety management systems are part of a continu- 
ous improvement process for Railway Undertak- 
ings (RUs) and Infrastructure Managers (IMs) 
alike, but it remains fundamentally based on their 
existing business functions and operations to 
remain in conformity with the legal requirements 
(Lodzinski et al 2007). 

This paper describes an EA strategy to facilitate 
the IT business transformation in the organiza- 
tional pursuit to institutionalized safety as inbuilt 
quality. The section 2 of the paper gives the EA 
overview, discusses some popular EA frameworks 


and explores railway safety management back- 
ground. Section 3 list down the target SMS goals. 
Section 4 discusses criteria selection to examine 
the proposed strategy. Section 5 is the comparison 
table of some carefully chosen EA frameworks 
against the selected capabilities. Section 6 discusses 
the result and section 7 ends with concluding 
remarks. 


2 BACKGROUND 


2.1 Enterprise architecture background 


The comprehensive introduction of EA terminol- 
ogy is down to J. A. Zachman from his paper title 
“A Framework for Information Systems Archi- 
tecture.”, published in the IBM systems journal 
where he presented his Zachman grid to provide 
EA ontology (Zachman 1987). An EA framework 
presents a skeletal structure to define suggested 
architectural artifacts, their relationships and char- 
acteristics. Every EA framework typically embraces 
a reference enterprise architecture, a planning and 
implementation methodology, guidelines tools and 
common vocabulary (Ahlemann et al. 2012). For 
various perspectives, several alternative EA frame- 
works have been proposed and evaluated by aca- 
demics and practitioners, each with different scope 
and activities, to support all aspects of EA lifecycle 
(Dang & Pekkola 2017, Nikpay et al. 2015, Susanne 
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& Zellner 2015, Rouhani et al. 2015, Odongo et al. 
2010, and Urbaczewski et al. 2006). The case of EA 
application to railway safety demands appropriate 
framework selection to ensure safety compliance 
in the very changing environment of technological 
complexities, operational uncertainties and evolv- 
ing social realities. 

Previous studies recognized that railways opera- 
tional safety depends on several data-dependent 
systems including signaling, infrastructure man- 
agement, rolling stock, organizational safety, 
culture and human factors. Many accident inves- 
tigators agree that lack of linkage in between these 
systems contributed to the substandard control of 
multi-causal accidents (Kyriakidis et al. 2012, Silla 
& Kallberg 2011, Evans 2011 and Baysari et al. 
2008). The studies also highlighted the weakness of 
railway SMSs in implementation especially in the 
area of information management, organizational 
structure, role & responsibility and competence 
management. It also observed significant defi- 
ciency of processes for design and improvement 
related to risk assessment (Shang et al. 2017). Ulti- 
mately, EA paves the way for better integration of 
overall data on the railway through alignment of 
its business with IT infrastructure. 


2.2 Enterprise architecture frameworks 


This paper assesses four major EA frameworks 
Zachman, TOGAF, DODAF and FEAF. They are 
selected on the basis of their popularity and cus- 
tomization. They are described in brief below. 


2.2.1 Zachman framework 
Zachman Framework (Zachman 1987, Sessions 
2007) provides a concise way to structure and model 
enterprise architecture. It is a two-dimensional grid 
with a set of six perspectives or views and five basic 
interrogatives. A perspective constitutes a row that 
depicts role for a stakeholder of the project team. 
The various stakeholders are the planner, owner, 
designer, builder, subcontractor and user. The col- 
umns of the grid represent a characterization of 
information for each perspective through what- 
data, how-function, where-network, who- people, 
when- timing and why- motive. The rows and col- 
umns thus intersect to create cells where each cell is 
resulting in an architecture activity depending on 
the system’s aspect for a particular stakeholder. 
Zachman framework is the most comprehensive 
of the EA frameworks which offer a series of views 
and visualization support as a planning tool to help 
better selections between the alternative options. 
However, it does not focus on the EA development 
and governance mechanism. It also could not pro- 
vide appropriate software tools configuration and 
alignment to the innovation and new social reali- 


ties. The framework does not explicitly offer sup- 
port for nonfunctional requirements and system 
development lifecycle. The Zachman framework, 
although self-described as a framework, is more 
accurately defined as a taxonomy for organizing 
architectural artifacts. 


2.2.2 TOGAF 

The Open Group Architecture Framework 
(TOGAF) is very comprehensive with regards to 
actual process involved (Magoulas 2012, Rou- 
hani et al. 2013). TOGAF’s view of an enterprise 
architecture consists of business, application, data 
and technical architectures. The most important 
parts of TOGAF is the Architecture Develop- 
ment Method (ADM) for process development, 
the enterprise continuum for various architec- 
ture views and the knowledge base repository for 
resources, implementation guidelines, templates 
and background information. 

It provides appropriate strategy and governance 
supports for the designing, planning and imple- 
menting of an architecture based on the enter- 
prise requirements. It utilizes appropriate models 
for both IT and enterprise activities. TOGAF- 
ADM is a step-by-step process performed in cre- 
ating EA and consists of a preliminary phase, 
followed by eight transformation phases that guide 
the users through various levels of architecture 
maturity in a managed manner through the transi- 
tion. TOGAF is flexible and allows phases to be 
performed incompletely, skipped, combined, reor- 
dered, or reshaped as per the stakeholder’s require- 
ment to fit any organization’s needs. TOGAF views 
the world of enterprise architecture as a contin- 
uum of architectures, ranging from highly generic 
to highly specific. It document design rationale to 
trace design and architecture decisions. 

TOGAF is lacking instructions to clearly 
describe the output specifications of each develop- 
ment cycle and missing the organizational role and 
responsibilities. Although strong on business and 
architecture perspectives, it is short in detail from 
planning and maintenance aspects. 


2.2.3 DODAF 

The Department of Defense Architecture Frame- 
work (DODAF) is developed by US defense to 
provide a holistic support platform for its agencies 
goals transformation. Its overall orientation is dif- 
ferent in solving the EA issues. DODAF (Odongo 
et al. 2010) uses model templates to collect and 
disseminate information data on a specific issue 
resulting in a view or perspective. In DODAF 2.0 
there are eight prescribed perspectives. Its architec- 
ture development process consists of six steps of 
context definition, scope, requirements, perspec- 
tives, development and application. 
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DODAF in acceptable manner supports the 
concept, modeling and process phases of EA life 
cycle. Its overall characteristics in the target sys- 
tem implementation reside in the same area with 
TOGAF with little lower in grade. 

The issue with the framework is that there is no 
complete governance guidance, social, financial 
and technical analysis is available to support the 
system objectives. The lack of enterprise integra- 
tion and software configuration support contra- 
dict its application in a dynamic environment. 
Also, it offers limited support to nonfunctional 
requirements of consistency, design traceability 
and verifiability. 


2.2.4 FEAF 

Federal Enterprise Architecture Framework 
(FEAF) of US government is to facilitate the 
shared development of common processes and 
information among US federal entities (Odongo 
et al. 2010, Sessions 2007). FEAF is based on Zach- 
man framework, but refers only to the first three 
columns that represent what-data, how-function 
and where-network respectively and focuses on the 
top three rows presenting various perspectives to 
provide standard terms and definitions through 
Business Reference Model (BRM), Components 
Reference Model (CRM), Technical Reference 
Model (TRM), Data Reference Model (DRM), 
and Performance Reference Model (PRM) with 
each have unique goals. It uses architecture analy- 
sis, architectural definition, investment and fund- 
ing strategy and program management plan as a 
four-step process for creating an EA. 

FEAF is the most complete of all the method- 
ologies discussed here. It has both a comprehen- 
sive taxonomy, like Zachman, and an architectural 
process, like TOGAF. It supports vision, strategy 
and knowledge base repository for EA planning. It 
allows flexibility in the use of tools and has stand- 
ard practices for interoperability. 

FEAF is primarily a framework for architec- 
ture planning rather EA development and main- 
tenance. A part from system security, FEAF does 
not explicitly support other non-functional require- 
ments and offer limited support on plan validation 
and traceability. 


2.3 Safety management background 


Different countries developed different railway 
Safety Management Systems (SMS) based on the 
operational circumstances and legal requirements. 
In Europe, the systems are guided by the Common 
Safety Methods (CSM) to be used as a solid basis 
to support the design and implementation of their 
SMS. The approach thus resulted in a number of 
SMSs but a reasonable degree of agreement can 


be achieved on what a standard SMS must cover. 
It shall contain all the processes and procedures 
describing activities related directly or indirectly 
with railway safety both at an organizational and 
operational level. 

Although recent technological advances and 
development of railway SMS guidelines and 
standards assist operators to perform their duties 
efficiently, railway accidents still occur due to a 
complex interaction of its components systems and 
interface management shared between many actors 
functioning in fragmented collections of stan- 
dalone or silo-based systems and processes. These 
silo systems include both older in house developed 
systems and a number of best available in the mar- 
ket commercial off-the-shelf solutions which are 
selected to meet their local business needs (Aier 
et al 2016, Clayton 2010, Cooney & Paxton 2016). 
Therefore, to couple the overall business with the 
current intelligent communication systems, sophis- 
ticated and extensible customer systems with 
cross-silo integration components are purchased 
at high cost without any future roadmap but only 
to provide opportunistic solutions to keep running 
the system. This approach literally lacks a holistic 
enterprise perspective of central IT functionality. 
The future highly automated railway operations 
anticipate innovation with information systems for 
business agility and safety requirements, leading to 
standardization and integration, at all levels. It is 
the enterprise architecture, SEAS, which can pro- 
vide the desirable platform to deliver the greatest 
business value through reliability, interoperability 
and safety using a whole system unification of 
enterprise entities, railway systems and people. 


3 SAFETY ENTERPRISE ARCHITECTURE 


To appraise the proposed SEAS approach for 
a railway SMS in the ever-changing economic, 
regulatory and technical environments involves 
many considerations. A typical safety manage- 
ment system is driven by key guidelines originating 
from the CSMs for supervision, monitoring, risk 
evaluation, conformity assessment, and national 
legislations for effective safety management. Key 
elements of railway SMS are listed below (EU 
Railway Safety Directive 2004): 


Staff commitment 

Transition or migration 

Information management 

Emergency and crisis management 
Compliance assurance 

Accident/Incident reporting 

Continual improvement 

Gradual and step by step implementation 
Roles and responsibilities 
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Competence management 

Change management 

Tools for monitoring 

Risk assessment and evaluation 
Strategies to achieve targets 
Documentation 

Assets management and maintenance 
Internal audits 

Standard Glossary 


Implementation of these elements is essential for 
regulatory compliance and a railway SMS should 
establish specific targets and formulate relevant 
assessment criteria for achieving these targets to 
ensure the realization of safety complaint opera- 
tions. A typical railway SEAS, in addition to these 
legal requirements, should achieve a number of 
other goals listed below as part of its strategic plan 
for a financially beneficial railway (NAS 2014, Tang 
et al. 2004, Aier et al. 2016, RTS & TSLG 2012, 
Lange & Mendling 2011, Lim et al. 2009, Magou- 
las et al. 2012, and EU S2R 2015). It will result in 
a blueprint for Safety Enterprise Architecture. Ide- 
ally, this blueprint should support the SMS inde- 
pendent of railway system geography, industry size, 
operational domain and architecture style. 


Regulatory Compliance 

Transparency 

Complexity Management 

Business-IT Alignment 

Configuration Management and Control 
Harmonization & Standardization 

Business Transformation 

Innovation 

Risk Control Measure for all risks associated 
with IM/RU, maintenance and assets manage- 
ment, subcontractors and other parties, reputa- 
tional risks. 

Architecture Maturity 

Agility 

Soft aspects 

Standard Glossary 


4 METHOD FOR CRITERIA SELECTION 


This research is divided into foundation ground 
work, system requirements and SEAS capabilities 
identification and evaluation. For the purpose, a 
systematic literature review of relevant research 
articles was carried out. A set of research ques- 
tions related to SEAS capabilities and framework 
selection are formulated and discussed with rail- 
way experts. 

RQI: What system aspects should be consid- 
ered for SEAS? 

In order to answer the first question, the SEAS 
can be divided into the following five aspects: 


e Operational Safety: Railway SMS external goals 
are regulatory driven. The IM/RUs should fulfill 
community requirement to ensure safe railway 
operations through continuous improvement, 
adopting a system-based approach and assign- 
ing of responsibilities (EU railway safety direc- 
tive 2004, Lodzinski et al. 2007, ORR 2015). 

e Business Support: These are related to railway 
SMS internal goals related to cost reduction, 
new business initiatives, improve decision mak- 
ing, and risk management (Aier et al. 2011 and 
Lange & Mendling 2016). 

e Information Technology System: This aspect is 
related to organization information resources, 
information systems, architecture tools, imple- 
mentation models, software configuration and 
operating platforms (Urbaczewski & Steven 
2006). 

e Qualitative or Non-functional: It is the system 
ability that the characteristic of its offered serv- 
ice or a product satisfies the user's requirements. 
Therefore in railway SMS case we have to define 
the user's demands to understand the quality. 
Under the quality or non-functional aspects the 
following items are collected for EA framework 
analysis: interoperability, flexibility, reusability, 
scalability, standardization, Alignment, reduce 
risk, reduce complexity, integration, better and 
faster service, communication, innovation (Lim 
et al 2009). 

e Social and Cultural: Safety has relative levels in 
relation to different situations and environments. 
Also the increasingly unreliable and demanding 
customers require through consideration in to 
which product or services to offer at any given 
time to ensure safety in operation as well miti- 
gate reputation risks (Magoulas et al. 2012). 
RQ2: What capabilities the proposed railway 

SEAS should have? 

From the SEAS goals listed in section 3 in con- 
junction with required system aspects discussed in 
question 1 above, forty-two (42) capabilities are 
identified. The system should attain these capabili- 
ties through the best practices implementation by 
achieving set targets through gradual transforma- 
tion and improvement of the system (Zamermann 
et al. 2015, Aier et al. 2016, Tung et al. 2004, Nik- 
pay et al. 2017). 

RQ3: Which of the EA frameworks is more rel- 
evant for railway SEAS implementation? 

The question deals with the already established 
EA frameworks in various categories. These 
include Zachman from commercial frameworks, 
Open Group Architecture Framework (TOGAF) 
from open group frameworks, FEAF from gov- 
ernment frameworks and Department of Defense 
Framework (DODAF) from defense frameworks 
(Jacco 2014, Nikpay et al, 2017). TOGAF is capa- 
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ble to be used as a general framework in any enter- 
prise with modifications and therefore replaces 
GERAM and RM-ODP in that category. FEAF 
is the complete methodology with ZF-like classifi- 
cation and TOGAF like structural design process 
and excels over the TEAF in the federally devel- 
oped frameworks category (Tang et al, 2004). 
DoDAF version 2.0 is an evolution of the C4ISR 
and NATO framework due to a limited application 
was dropped and replaced by DODAF. 


Table 1. Comparison of EA Frameworks for SEAS. 


CAPABILITY ZF TOGAF DODAF FEAF 


Strategic Vision and Goals H H 

Architecture definition M 

Architecture Development M 
Process 

Process Completeness 

Architecture Verifiability 

Transition Strategy and Plan 

Tool Support 

Information Reference 
Resources 

Standardization 

Step by Step guidelines 

Continual 

Architecture Evolution 
Support 

Governance guidelines 

Integrating enterprise systems 

Business Alignment 

Socio Cultural Alignment 

Functional Alignment 

Structural Alignment 

Infological Alignment 

Contextual Alignment 

Dynamic (Innovative) 

Business Model 

System Model 

Information Model 

Computational Model 

Software Configuration 
Model 

Software Processing Model 

Implementation Model 

Platforms 

Nonfunctional Requirements 

EA Security Issues 

Views/perspectives 

Abstractions 

System Development 
Lifecycle 

Change Management 

Maintenance Process 

Costing and Vendor 
Supporting 

Meta model 

Maturity Model 

Enterprise Knowledge Base 

Taxonomy Completeness 

Vendor Neutrality 


oie ee 
Pere: sH 


Pore a 
ae gees 


Pott meee ee eee 


Pomme 
rmtmmee 


Cmte egtet 
mem ZU Seo eee eee ee MEZE ee mmm 
Pos Beer een 


Pees orm mee ee oo 


Sie grp 


memZe 
ZEHEE 


RQ4: How different frameworks are 
compared? 
Three relative conformance levels (High, 


Medium and Low) based on the support each 
framework support to a capability are developed. 
If a framework explicitly supports a capability ele- 
ment in the table 1, it is marked as High (H), if par- 
tially support then marked Medium (M) and Low 
(L) in case of little or no support at all. 


5 EVALUATION OF EA SYSTEMS 


A SEAS capabilities list was developed as summa- 
rized in table 1 below based on the already done 
research in this area and produced here in citations. 
It provides a high level comparison and analysis of 
the selected EA frameworks. 

From Table | outcomes of various frameworks 
are summarized based on the support they offer to 
the capabilities. In the current analysis, only con- 
sidering the clearly supportive option (H) for suit- 
able EA framework selection, ZF accrued 15 Hs, 
TOGAF has 30 Hs, DODAF has 16 Hs and FEAF 
has accumulated 13 Hs. In consequence, TOGAF, 
although does not meet overall criteria, emerged 
the most suitable preference. 


6 DISCUSSION AND LEARNING 


The current studies provide discussion platform 
for reconsideration of railway SMS goals and 
implementation approaches. 


6.1 Revamping railway safety management 


The current studies introduce business aspects 
and soft aspects of the railway safety manage- 
ment pursuit of attaining whole system approach. 
The typical safety management systems currently 
in sway in rail and metro industry rely on hard 
aspects to focus on safety compliance in supervi- 
sion, monitoring, maintenance and improvement 
within their existing operational boundaries which 
is an inappropriate and outdated approach in the 
very changing environment. In today’s world, the 
railway like all other safety-centric industries espe- 
cially to mention, airline and space industries, is 
changing hands from public to privately operated 
corporations. The underlying drivers behind these 
handing overs are fundamentally based on busi- 
ness objectives. The market competition and sur- 
vival threat compel every entity to deliver as per 
customer satisfaction with excellence in operations 
where safety matters at the center. We therefore 
suggest revamping of the European railway agency 
SMS wheel to cater for business objectives and soft 
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aspects in order to align with new social realities 
and market dynamics. 


6.2 EA framework selection 


EA implementation can be independent of any 
framework however it provides a roadmap 
and tools to capture requisite information and 
model them to avoid any panic. The results of the 
above comparison in perspective of railway SEAS 
objectives indicate that each enterprise methodol- 
ogy has its strengths and weaknesses and none of 
them is categorically complete. The Meta model, 
maintenance, dynamic and soft aspects are the 
issues which are not supported by all selected EA 
frameworks. 


7 CONCLUSION AND FUTURE WORK 


This paper presented EA application to an imple- 
mentation of a railway SMS in order to develop 
SEAS. The new approach which is based on holis- 
tic perspective is to partner the safety regulatory 
requirements and business objectives of RU/IMs 
in a business transformation support in the face 
of system complexity, operational uncertainty and 
new social realities. The purpose of the paper is 
to evaluate different well- known EA frameworks 
methodologies to narrow down on the selection 
in order to support a future railway SEAS that 
is capable to shape norms and values of railway 
organizations. It also attempt to establish EA as 
the definitive solution and formulate parameters 
of a standard SEAS to ensure its optimal reali- 
zation. Any of the existing EA frameworks can 
provide guidance for EA implementation to a cer- 
tain degree and TOGAF scores higher in the case 
due to a number of distinctive capabilities. It uses 
architecture development in incremental step by 
step guidelines and supports the enterprise archi- 
tecture modeling with detailed descriptions in 
order to effectively address critical business needs. 
The methodology also establishes a common set 
of risk and security concepts and demonstrates 
mapping to the implementation languages and 
tools leading to build a shared understanding and 
response strategies to ensure traceability and veri- 
fiability. Nevertheless it is not complete enough to 
meet the railway RU/IMs specific requirements in 
the area of business and socio-cultural alignment, 
risk management implementation and innovations 
support. 

The SEAS goals set in this studies are not final 
and the related capabilities should evolve with 
architecture maturity. In the current assessment 
scheme all the capabilities are given equal weight 
irrespective of their risk exposure and manage- 


ability. In future studies some type of quantitative 
analysis will be considered based on relative scor- 
ing technique for EA framework selection. 
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ABSTRACT: Accident reports provide a valuable source of data for any safety management system. In 
multi-lingual jurisdictions, accident reports can be provided in more than one language. For example the 
Swiss transport authority collects accident reports that are written in either German, French, or Italian. 
The unstructured nature of free-text makes it difficult to extract information from large numbers of acci- 
dent reports. Machine-reading of text is an emerging area of research, however there are few instances of 
information being extracted from text in more than one language. 

This paper introduces an ontology-based interactive learning method between a human and computer 
software to identify safety-related information by analysing text written in three different languages. The 
results of the method were analysed by fluent speakers of each language, who rated the overall accuracy 
of the method to be 98.5%. 

The method stores and processes the data in a NoSQL graph database, which provides a powerful tool 
to readily integrate the analysis with other data sources, for example train movement data, passenger 


census data, or even comparative data from other railways. 


1 INTRODUCTION 


A large amount of information that is useful to 
safety is contained in natural language text reports, 
for example accident reports, hazard reports, 
or safety audit reports. Whist these data sources 
can contain valuable information, it is not easy to 
extract the information. Human-reading of large 
amounts of textual data is slow and error-prone 
and, if the task is divided amongst a large number 
of readers, then differences in interpretation can 
occur. Machine reading of text is an emerging area 
of research which has the potential to provide useful 
information from large text sources, although there 
are still a number of problems to be overcome. No 
prior work has been found that has attempted to 
extract information from safety-related documents 
written in more than one language. 

The Swiss Federal Office of Transport (FOT) 
collects textual data on safety incidents that occur 
on the nation’s transport system. Switzerland is 
multi-lingual, and the text in the incident reports is 
provided in either German, French, or Italian. Each 
report contains information that could be useful to 
managing safety of the system, however no existing 
process is known whereby the information can be 
collated in a way that supports safety management. 


This paper describes an  ontology-based 
approach to obtain information from 5065 incident 
reports provided by the FOT. Incidents were classi- 
fied into a number of categories including incidents 
that occurred: whilst passengers were boarding 
trains; whilst they were alighting trains; or as a 
result of passengers falling down stairs, caught by 
closing doors, or struck by falling baggage. 


2 BACKGROUND 


2.1 Safety management of big data 


Modern approaches to safety management require 
organisations to collect information on accidents. 
It is very common for these data to include text that 
describes where and when the accident occurred; the 
context within which the accident occurred (for 
example weather conditions; activities that were 
taking place at the time); what injuries and damage 
resulted; what proximate and underlying causes 
led to the accident; what risk controls failed to 
allow the accident to occur; and recommenda- 
tions to minimise and mitigate recurrence. The 
purpose of collecting such data is to support deci- 
sion-making processes that consider information 
from different sources (for example safety-critical 
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work procedures, budget data, or legislative 
requirements) and take action to optimise safety 
management. 

Professional safety management systems often 
collect information not only on accidents that have 
been observed, but also on incidents where safety 
risk controls have broken down but no injury or 
damage occurred, so called near-miss or close call 
events (Gnoni, Andriulo et al. 2013, Andriulo and 
Gnoni 2014, Macrae 2014). These accident and 
incident reports themselves can amount to a large 
quantity of data (Hughes et al., 2016a). Extracting 
information from these large data sources can be a 
challenge by itself; the problem is further compli- 
cated when the data is provided as free-text rather 
than structured machine-coded data. Combining 
data from such a large data source with data from 
other, potentially very large, data sources can be 
problematic. This data data management challenge 
is commonly referred to as big data. There is an 
emerging body of work describing the challenges 
of big data and techniques that can be used to 
extract useful information from the data. To obtain 
information that supports safety management, 
Van Gulijk et al. (2016) introduce the concept of 
Big Data Risk Analysis (BDRA), and describe the 
four enablers of BDRA: 


data and data-management, 
visualisation interface, 

analytics and software, and 

ontology and knowledge representation. 


Figure | presents a schematic overview of the 
interaction of these enabling components. Each 
enabler is described below. 

Data and data management is the initiating rea- 
son for BDRA. Modern organisations and their 


The four enablers of BDRA. 


Figure 1. 


safety management systems collect large amounts 
of data with the intention of using these data to 
improve safety and safety management. Data may 
be collected from manual processes, such as work- 
ers completing forms as part of a safety-critical 
work process; from automatic systems, such as 
supervisory control and data acquisition (SCADA) 
systems; or from external sources, such as informa- 
tion on the internet regarding weather or traffic 
conditions. Collecting data on hazards, incidents 
and accidents is at the heart of modern safety 
management systems. The FOT has a database of 
thousands of reports describing all detected acci- 
dents on the transport network, including minor 
accidents, such as where a passenger fell over and 
sustained only very minor injuries. The data in the 
accident database is a valuable source of informa- 
tion that can be used to understand the causes of 
accidents and any underlying trends, and to deter- 
mine actions that may minimise the likelihood of 
recurrence. 

A visualisation interface is an essential compo- 
nent for understanding large data sources. Humans 
have a capacity for visualising concepts and their 
relationships, which is valuable for understanding 
big data sources which can contain thousands, or 
even hundreds of thousands of inter-related con- 
cepts. For example the safety management of an 
operating railway requires an understanding of 
concepts such as all types of rail vehicles includ- 
ing locomotives, carriages, wagons, and mainte- 
nance machines; all railway locations including 
track locations, stations, sidings, maintenance 
facilities; all organisations who operate within the 
railway including law enforcement organisations; 
users of the railway; routes and timetables etc. It is 
not feasible to expect staff who operate the safety 
management system to be able to visualise all 
these concepts and the multitudinous interactions 
between them without external support. Visual 
analytic tools facilitate the understanding of these 
concepts and include applications to help under- 
stand different categories of safety risks and the 
geospatial distribution of incidents and accidents. 
Such tools may even include visual causation mod- 
els, such as bow-tie diagrams, that describe the 
chains of events that can lead to an accident and 
the risk controls that are in place to reduce the risk. 

Analytics and software are the backbone of 
BDRA: modern data management systems are 
based on software and BDRA requires large soft- 
ware analytical capability such as that available on 
modern cluster computers or provided by cloud 
computing services. Several software services are 
required for BDRA in order to: 


e store data in short-term stores and long-term 
archives; 
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e organise data into meaningful data units 
and transport it to the locations required for 
processing; 

e pre-process data ready for analysis and format it 
as required; 

e analyse the data to produce results that assist 
safety-related decision-making; 

e collate results of the analytics and aggregate 
them as necessary; 

e present the results—in a visual form—to analysts 
to allow understanding of underlying hazards, 
risks, controls, accidents and consequences; 

e iterate through analysis depending on the find- 
ings of earlier analyses; 

e store results of analyses to allow subsequent 
analysis to expand on the results of earlier 
work; 

e oversee and coordinate all of the above proc- 
esses and distribute computer resources in an 
optimal way. 


The software tools to support these tasks may 
include traditional tools such as simple graphing 
tools and SQL databases, as well as technologies 
that have emerged in the past decade such as inter- 
active visualisation tools, graph databases and 
massively parallel, distributed analytical tools. 

Ontology is a structured method to classify enti- 
ties within a domain and the interactions between 
them. An entity is any item that can have prop- 
erties and interact with other entities. A simple 
example of an entity in a railway safety context are 
trains which have properties of rolling stock class, 
furthermore individual instances of trains have 
train numbers as properties. Trains interact with 
stations, track and passengers. Other examples of 
entities include tickets (which have properties such 
as valid dates and routes) and railway staff (who 
have properties such the job function and interact 
with people and organisations). Entities within an 
ontology may also be abstract, such as dates and 
times. Regardless of whether an entity is abstract 
or has a physical form, a defining feature of all 
entities is that they can be referred to by them- 
selves; this differentiates them from relationships 
that require entities in order to be meaningful. 
An example relationship is that a passenger can 
board a train. Both the passenger and the train are 
entities that can exist by themselves. However the 
boarding relationship requires these entities to be 
present in order to be meaningful: boarding can- 
not occur by itself. The ontology provides a struc- 
ture that allows data to be joined to the key entities 
that are relevant to safety management. For exam- 
ple as an instance of data, an accident report may 
be joined in an ontology to the station where it 
occurred; to the members of staff who responded; 
and to the date and time where it occurred. If the 


ontology contains an incident causation model, 
such as a bow-tie diagram, the accident report can 
also be linked to the particular risk control break- 
downs that led to the accident and the outcomes 
that occurred. Individual stations in the ontology 
may be linked to the general concept of a station; 
individuals may be linked to the organisations for 
whom they work, and so on. In this way it is possi- 
ble to structure queries of the data to request acci- 
dent reports that occurred at particular stations, or 
at stations in general; or to identify staff respond- 
ers from particular organisations. The ontology 
may also be linked to other data in the safety man- 
agement system such as audit reports, or main- 
tenance logs. In this way it is possible to identify 
accidents that occurred at stations where audits 
have not recently been performed; or accidents at 
locations where particular maintenance activities 
have occurred. Where the ontology contains dates 
or chains of causation, these entities can also be 
queried to extract precise and meaningful informa- 
tion to support safety management. 


2.2 Ontologies for data management 
and understanding 


Smith & Welty (2001) describe the traditional 
understanding of the word ontology, as being an 
ancient Greek concept that addresses the funda- 
mental nature of reality. Aristotle established ten 
categories of reality viz.: action; habit; location; 
passion; position; quality; quantity; relation; sub- 
stance or essence; and time (Ritter & Kohonen 
1989). This ontology was established to address 
underlying questions in ancient Greek philosophy 
such as what is real? and what can be said to exist? 
A basic method for establishing an ontology can 
be considered in two stages: firstly observation and 
conceptualisation of the real-world domain; and 
secondly explicit formalisation of the of the identi- 
fied concepts (Evermann & Fang 2010). Such a for- 
malisation of the world into well-defined concepts 
that can be reasoned about in a meaningful way 
provides a framework that is well-suited to compu- 
tational analysis of data (Searle 2006, Smith 1998). 
An open question in ontology research is the degree 
to which an ontology needs to be complete in order 
to support understanding and decision-making. 
The concept of a naive ontology is discussed by 
Dahlgren (1995) which is an ontology that considers 
only objects and their classifications. For example a 
naive ontology would consider a passenger train as 
being a type of train, which in turn is a type of vehi- 
cle. However such an ontology may not consider 
abstract concepts such as the concept of lateness 
in relation to a train running to a timetable. This 
naive approach underpins the resource description 
framework established by the World Wide Web 


3109 


Consortium as well as the approach taken by Noy 
& McGuinness (2001). Dahlgren (1995) asserts that 
the approach is sufficient to perform almost 80% 
of common-sense reasoning. Brewster and O’Hara 
(2007) argue that ontologies are particularly useful 
in well-defined domains such as individual organi- 
sations. Noy & McGuinness (2001) assert that the 
ontological approach to data management provides 
a valuable method to reuse domain knowledge. For 
example database queries can be stored within the 
ontology so that information found by one analyst 
can be found again later by other analysts. In this 
way the ontology adds to the amount of data that 
needs to be stored, but reduces the need for repeti- 
tion of work. 


2.3 Ontologies for natural language processing 


Popping (2000) classifies natural language process- 
ing (NLP) techniques in three categories, which 
are in order of sophistication: thematic, semantic, 
and network. Thematic analysis considers the rela- 
tive occurrence of words within the source text and 
can be used as a broad categorisation method to 
identify texts (such as accident records) that con- 
tain similar words and therefore may relate to the 
same broad themes. Semantic analysis expands 
the thematic approach by consider the function of 
words within a sentence (their part of speech, such 
as whether a word is a noun, a verb, or an adjec- 
tive) to identify subject-verb-object triples. These 
triples provide the underpinning for complex for- 
mal ontology systems that develop from a mere- 
ology of objects or actions (Bateman et al. 2010; 
Bierwisch & Schreuder 1992; Miller & Fellbaum 
1991; Ritter & Kohonen 1989). 

Network analysis of text establishes the source 
text as a graph consisting of nodes (which usually 
represent single words) and edges (which describe 
relationships between the nodes such as co-oc- 
currence of words within a single sentence). As 
such the network approach establishes a form of 
ontology of text within a document, although such 
an ontology is based on abstract concepts (using 
words as labels for objects) and therefore funda- 
mentally differs from naive ontologies which con- 
sider the nodes in the ontology as representations 
of the objects themselves. Miller and Fellbaum 
(1991) note that a disadvantage with network anal- 
ysis can be the need for additional software tools 
such as graphing and network analysis tools, which 
can complicate the analysis if there is a need to 
transfer data between separate software products. 
The introduction of graphical text analysis tools 
introduces a new domain of research, for example 
as discussed by Figueres-Esteban et al. (2015). 

For general text analysis, Miller & Fellbaum 
(1991) propose an ontology of 26 basic concepts, 


these concepts can be used to form a basis for 
domain-specific ontologies; they are: 


act, action, activity; 
animal, fauna; 
artefact; 

attribute, property; 
body, corpus; 
cognition, ideation; 
communication; 
event, happening; 
feeling, emotion; 
food; 

group, collection; 
location, place; 
motive; 

natural object; 
natural phenomenon; 
person, human being; 
plant, flora; 
possession, property; 
process; 

quantity, amount; 
relation; 

shape; 

society; 

state, condition; 
substance; 

time. 


Finally, Van Gulijk et al. (2016) conclude their 
discussion of the use of ontologies in computer 
science by providing the following three counsels. 
Firstly there is no such thing as a perfect ontology; 
rather there can be a number of alternative ontolo- 
gies that serve the same purpose. Secondly, effec- 
tive ontologies are developed iteratively, perhaps 
as users interact with an ontology and seek more 
detail from it. Thirdly, ontologies that relate to 
physical objects are the easiest to create; ontologies 
that relate to abstract concepts can provide con- 
ceptual difficulties for both the ontology builder 
and the analyst using the ontology. 


3 METHOD 


Extraction of information from the text was per- 
formed in a process that consisted of four mains 
steps. The first step being the preparation of the 
data to allow for efficient completion of the later 
steps. The second step was the identification of key 
terms in the text and the construction of an ontol- 
ogy to make explicit the relationships between the 
terms. The third step involved the execution of 
queries to identify records that correspond with 
each of the categories of incident; this is an auto- 
mated step performed by software. The final step 
was a review of the returned results and conse- 
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quent refinement of the ontology and queries until 
an acceptable result was achieved. Each step is 
described in detail below. 


3.1 Data preparation 


The accident reports provided by the FOT were 
imported into a graph database. Graph databases 
structure data as nodes and edges, rather than 
using the structure required by Structured Query 
Language (SQL), which has been a prevalent struc- 
ture for databases some decades. As such, graph 
databases belong to a class of databases known as 
NoSQL. 

Data relating to an individual incident was 
imported as a single node in the database: 5065 
record nodes were created in the database. An 
automatic process was used to analyse the text in 
the source records and create a new node for each 
sentence in the text. During this process a simplify- 
ing assumption was made that a full-stop followed 
by a space (. ) would always mark the end of a sen- 
tence. The sentence nodes were linked to the node 
that contained the data from the record. 

In alphabetic languages, such as the European 
languages used in Switzerland, the basic unit of 
text analysis is a word. Fundamentally the process 
to establish meaning from text is performed by ana- 
lysing the occurrence, frequency, and colocation of 
words or groups of words. In this work each sen- 
tence was broken into individual words: punctua- 
tion marks were separated from words by inserting 
spaces. Each word was converted to lower-case text 
and added as a word node in the graph. During 
this process the frequency of occurrence of each 
word was stored in the word node. Colocations of 
pairs of words were shown by the creation of an 
edge marked next; the edge also recorded data on 
the frequency of the colocation of the pair. This 
process of creating word nodes and next relation- 
ships is the same as the process applied by Lyon 
(2015). 


3.2 Step 2: Ontology learning 


The source data were analysed to identify terms 
(words and bigrams) for inclusion in an ontology 
of items. Candidate terms were identified by calcu- 
lating the TFIDF score for each word in the source 
text. Subsequently, each bigram was considered to 
be a single token and a TFIDF score was calcu- 
lated for the bigram. All terms from all the source 
records were compiled in a table of descending 
TFIDF score and presented to an analyst for con- 
sideration in the ontology. 

The analyst reviewed the list of candidate terms, 
starting with those with the largest TFIDF score, 
and selected the terms that appeared to be relevant 


to each category of incident. Since the TFIDF 
ranking is intended to list terms in order of rel- 
evance, the analyst worked through the list until 
reaching a point where it was determined the that 
the terms were generally irrelevant and no further 
terms would be considered. The analyst in this trial 
spoke only English and had no fluency in any of 
the source languages (German, French, nor Ital- 
ian) and used simple on-line translation tools 
to understand the candidate terms. Terms were 
selected based solely on the analyst’s understand- 
ing of railway operations and safety. After iden- 
tifying terms, the analyst created an ontology that 
joined matching or similar concepts. The ontology 
allowed equivalent terms in different languages to 
be joined to the same node. For example, in records 
written in German, the words ambulanz and krank- 
enwagen were both used to refer to an ambulance 
and were linked to the ambulance ontology node. 
Similarly the French term ambulance, and the Ital- 
ian term ambulanza were linked to the same node. 
A simplification was made to link plural terms to 
singular ontology concepts as a simple form of 
lemmatisation of the text. 

The ontology learning process was limited to 
the creation of only a naive ontology: only a sin- 
gle type of relationship was defined indicating that 
each ontological element can be a type of another 
element; for example a woman is a type of person. 
The ontology did not contain relationships such as 
a door is a part of a train; nor a train can arrive at 
a station. 


3.3 Step 3: Execution of queries based 
on the ontology 


Queries were performed on the data in the graph 
database to identify records related to each cat- 
egory of incident. The queries were started at the 
ontology nodes that defined each category of inci- 
dent and traced via edges in the graph to identify 
records that contained terms relating to the inci- 
dent category. 


3.4 Step 4: Iteration and reporting 


As noted above, the process of ontology creation 
is iterative. After execution of each query, the ana- 
lyst reviewed the results to determine whether the 
records correctly corresponded to the category. 
Since the analyst had no fluency in the languages 
used to write the records, the TFIDF ranking 
process was re-applied to only the records returned 
as a result of each query. The analyst reviewed the 
terms occurring in the subset of records, using sim- 
ple translation tools again, to determine whether 
terms were occurring that did not appear to relate 
to each query. 
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4 RESULTS 


This section presents the results of each stage of 
the analysis. 


4.1 Data preparation 


The 5065 records were loaded into the database 
and a total of 16,419 unique word nodes were cre- 
ated. Relationships were created to show colloca- 
tions of words. Figure 2 shows an example of the 
pair of words dame and âgée with the NEXT edge 
joining them. The figure shows that the word dame 
occurs 620 times in the source text, the word dgée 
occurs 202 times and, as a colocation, dame dgée 
occurs 150 times. 


4.2 Step 2: Ontology learning 


The process of TFIDF ranking returned 82,726 
candidate terms, being 16,419 single words terms 
(one for each word node in the database) and 
66,307 bigrams. From this list the analyst identi- 
fied 389 terms that appeared to be related to the 
specified categories of incident. It is notable that, 
in some cases, the TFIDF calculation produced 
similar scores for terms with similar meaning in 
different languages. For example of the 82,726 
identified terms, the term ältere dame (German for 
elderly lady) was ranked 314th in the list; the term 
dame âgée (French for elderly lady) was ranked 
316th on the list. 

The analyst constructed an ontology based on 
the identified terms. For ease of analysis the ontol- 
ogy was limited to only objects and actions and 
structured in two layers. Table 1 shows example the 
entities included in the ontology. 


dame NEXT agée 


WORN NEXT a he 
dame dgée 
colocation 
frequency: lances has frequency: 


620 150 202 


Figure 2. Example of the colocation dame âgée and the 
next edge linking the words. 


Table 1. Example ontology entries. 


Ontology 


upper layer Ontology lower layer 


person doctor, self, customer, person, driver, 
passenger, months old, years old, baby 
female, old, elderly, young, male, man, 
lady 

places line, stations, pavement, hospital, ground, 
platform, stairs 

actions hit, medical, injure, get out, enter, fall, 
rush 

body parts foot, head, arm, leg 

vehicle carriage, vehicle, ambulance, tram, tain, 
bus 

direction backwards, direction, in between, in front 

items bag, alcohol, drugs, stairs, footboard, cus- 


tomer information system, ticket, door 


4.3 Step 3: Execution of queries based 
on the ontology 


Queries were designed and executed for each cate- 
gory of incident based on the terms identified dur- 
ing Step 2. The starting point for each query was 
the ontological entries that define the key entities 
being sought in the query. For example to identify 
records related to injuries caused to passengers by 
closing doors, the query identified the ontology 
nodes relating to passengers, closing doors, and 
injury. The query then identified the terms relat- 
ing to those concepts, followed by the records that 
contained those terms. Figure 3 shows an example 
of records that relate to the ontological concepts 
of an elderly person and stairs. The query can be 
thought of executing from the top of the diagram 
down: firstly the ontology elements for stairs and 
for a person — and in particular an old person—are 
identified. These ontology elements are traced in 
the query to instances of words that define them, 
for example the words Treppe (being a German 
word for stairs) and scale (Italian for stairs). In 
turn, these words are traced to instances of acci- 
dent reports where the words occur. It can be seen 
that the plural term for men in Italian (uomini) has 
been linked to the singular term in the ontology. 


4.4 Step 4: Iteration and reporting 


The results of the queries were reviewed by the 
analyst for correctness and the process in Steps 
2 and 3 was iterated to refine the terms, ontol- 
ogy and queries to create results that appeared to 
better align with each category of incident. After 
iteration of the process, the results of the queries 
were presented to fluent speakers of each language 
for review. The reviewers were independent of 
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Due [uomini anziani} tavano parando in cima a {scaie| quando uno di loro cadde 


Figure 3. Schematic diagram of an example query. 


the process and assessed each record entirely on 
whether it appeared to correctly describe an event 
belonging to each category. As such the reviewers’ 
assessed only the rate of true positive matches com- 
pared with the overall number of records found to 
match for each category. The overall results from 
all reviews indicated that the number of true posi- 
tives was 98.5% of all positive results returned by 
the queries. 


5 DISCUSSION 


The overall finding is that this preliminary study 
has demonstrated the potential for the technique to 
be a powerful tool for identifying specific instances 
of safety-related events from multi-lingual acci- 
dent reports. The result of 98.5% true positives in 
all results returned is very strong, especially consid- 
ering that the analyst did not have fluency in any of 
the source languages and used only simple transla- 
tion tools. At this stage, however, it is not clear how 
many false negatives results occurred as a result of 
the queries (i.e. records that described one of the 
categories of incident but were not identified by 
the queries). Further work would be required to 
determine the overall accuracy of the process. 

The trials of the technique to date have been 
limited to only a few categories of incident that 
were specified before the process of ontology crea- 
tion. Since the process is based on the occurrence 
of terms in the text, it appears possible that the 
process could be started by examining the text to 
determine what categories of incident are being 
describe, i.e. a bottom-up exploration of the text to 
identify categories rather than starting the analy- 
sis with specific categories of incident (a top-down 
analysis). Such a bottom-up approach may be 
valuable to identify unexpected trends in the data 


that could not be presupposed; for example unex- 
pected categories of incident caused by emerging 
technology. 

The ontology developed during the process is 
based on the terms that occur in the text and, as 
such, it appears that the technique should be appli- 
cable to other sources of text that can support 
safety management such as audit reports, inspection 
reports; or even to general sources of textual data. 

Another limitation of the technique is that it is 
based on the occurrence of simple concepts being 
described in the text, but does not consider com- 
pounded concepts such as negation. For example 
when stairs are mentioned in the text, it is pre- 
sumed that stairs are relevant to the incident, how- 
ever a record may refer to stairs even though they 
are not relevant to an incident, (e.g. an old man, 
whom I had previously seen on the stairs, fell whilst 
entering the train). To address this issue the ontol- 
ogy would need to be updated to include complex 
ontological concepts. Further work is being car- 
ried out to align this study with our previous work 
(Hughes et al., 2016b) to address these issues in the 
technique. 
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Building cyber resilience through a discursive approach 
to “big cyber” threat landscapes 
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SINTEF Technology and Society, Trondheim, Norway 


ABSTRACT: Cyber safety, security and resilience of Critical Infrastructures (CI) and critical societal 
functions is a contemporary challenge. To understand the bigger picture, we may build composite threat 
landscapes in which vulnerabilities and threats combine and travel across distinct domains between which 
expertise, competence, experience and knowledge horizon related to safety, security and risk may differ 
substantially. Additional sensitization towards emerging cyber threats is however needed. Inspired by the 
post-normal “science of what-if”, the “BigCyber” model advance threat landscapes further into sensi- 
tivity to hidden, dynamic and emergent vulnerabilities. The approach is exemplified in terms of smart 
metering of household electricity consumption. The need for discursive support for different stakeholders 
relating to threat landscapes is identified, and a discursive framework for stepwise nurturing of polycen- 
tric governance is outlined. The framework can also be used to elaborate and support the idea of resilience 


landscapes of autonomous entities, facilitating a polycentric approach to cyber resilience. 


1 INTRODUCTION 


The potential consequences of failure and distur- 
bance of Critical Infrastructures (CI) and soci- 
etal functions (e.g., energy, water, transport, and 
logistics) depending on Information and Com- 
munication Technology (ICT) are frightening and 
potentially devastating. The overall risk picture is 
increasingly blurred, mixed and constantly evolv- 
ing. Itis difficult to maintain a sharp divide between 
the stable inside of a critical infrastructure, and a 
more innovative outside. Presumed motivations of 
potential adversaries and perpetrators span a wide 
range, encompassing cyber conflict and hybrid 
warfare, fake information, political influence, 
cyber-physical damage, cybercrime, sheer vandal- 
ism or teenager tricks. This adds to the existing 
prospects of the accidents, failures and unfortu- 
nate incidents in (already) complex systems. Per- 
row’s (1984) notion of the “normal accident” is 
persistently hard to escape. 

The Internet of Things (IoT) is already on the 
scene, offering new access ( = attack) points, new 
magnitudes of automation and cyber-physical 
impact, but also boosting the ability to "informate" 
(Zuboff 1984); to generate electronic texts about 
the use of the infrastructure. Moreover, the "Inter- 
net of Everything" (IoE) has been coined as the 
"Big Other" surveillance capitalism (Zuboff, 2015), 
fueling a logic of accumulation. This is signified by 
the increasing rate of ICT systems rigged for col- 
lecting as much (surplus) data as possible. Vendors 


collecting extensive information from installations 
without the customer's consent, could be coined as 
the "industrial Big Other" 

In the 1990's, the prospect of "trusted" compu- 
ter systems prevailed. Today, few if any ICT sys- 
tems are delivered with assurances that support 
this. Practically no ICT system, including CI, may 
preclude the possibility of intrusion, disturbance 
and hacking. Big-scale consumer innovations, e.g. 
autonomous cars and home appliances, are seem- 
ingly always lagging in computer security. Some 
voices even claim that "computer security is broken 
from top to bottom" (Economist, 2017). 

Potential countermeasures are often invasive, 
e.g. on privacy, often unduly playing on strings of 
fear and anxiety. Public initiatives, e.g. from the 
EU (Galbusera and Giannopoulos, 2016) aiming 
for public, semantic web descriptions of critical 
infrastructures may also be exploited to enable 
sophisticated attacks. 

We cannot expect of holistic, cross-nation, cross- 
sector approaches to these challenges. The obstacle 
is not just the tremendous information coordina- 
tion challenge, but also the incommensurate and 
diverse motives and objectives across boundaries of 
private vs public, classified vs unclassified, national 
vs international. Information cannot be shared, nor 
trusted, in one "heap". Motives and objectives are 
incommensurate, increasingly located in an atmos- 
phere of post-fact attitudes, fake news, and informa- 
tion warfare targeting societal trust, in which even 
security agencies may find it difficult to navigate. 
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This fundamental challenge demands an 
attempt of imagining the inconceivable. Societies, 
organizations and stakeholders habitually directs 
their hope and faith for dealing with such chal- 
lenges to risk management and governance, but 
these are increasingly acknowledging their limita- 
tions. [llustratively, a new Specialty Group (SG) 
on resilience analysis was approved by the Society 
of Risk Analysis (SRA) Council on December 10, 
2017. 

In the following, a diverse portfolio of strate- 
gies and approaches that can be utilized at several 
levels, from the national regulator to the infra- 
structure owner and stakeholder is proposed. The 
key issues are about building threat landscapes to 
increase sensitivity to hidden, dynamic and emer- 
gent (“//d/e”) vulnerabilities and couplings, and to 
employ the concept of resilience in a polycentric 
manner catering for diversity, rather as a system- 
wide property on uniform terms. 

An example is offered: smart household electric- 
ity metering as part of smart grids. Energy com- 
panies strive to use technology to innovate their 
customer relations, technical maintenance and grid 
stability, fearing cyber threats, but also fearing a 
sudden, technology-driven meltdown of their busi- 
ness models. 


2 THREAT LANDSCAPES—AND BEYOND 


2.1. Threat landscapes 


Risk management is traditionally not possible with- 
out making demarcations about a system regarding 
boundaries, threats, vulnerabilities, key events, and 
other inventories (e.g., acting subjects). In today’s 
complex cyber-inflicted systems, such presump- 
tions become increasingly difficult. H/d/e couplings 
between parts that we traditionally would prefer to 
keep apart for analytical clarity, or events and con- 
ditions that would be considered as unlikely or even 
irrelevant in conjunction, challenge organizations’ 
and societies’ experiences, capabilities and skills 
regarding imagination as well as actual resilience 
towards disturbance and surprise. 

A societal perspective will have to address the 
bigger picture by recognizing and combining mul- 
tiple, distinct domains of expertise, competence, 
experience and knowledge horizons related to, e.g., 
safety, security, resilience, threat and risk. In this 
paper, any such distinct domain is “squared out” 
as a picture, with a frame representing demarca- 
tion of inside vs outside, however with the premise 
that there may always be some relevant knowledge 
missing. 

A key challenge is that due to the diversity and 
h/d/e ICT-induced couplings of physical as well as 


logical nature, “pictures” may suddenly turn out to 
be be flawed, and the new threats may travel across 
such experience-based boundaries in unprec- 
edented ways. The understanding of such com- 
posite threat landscapes require methods beyond 
the practices used to address single domains. 
Although it is likely that the (more or less profes- 
sional) risk management approaches per se do not 
vary dramatically across such “squared” frames, it 
is likely that the pragmatic knowledge horizon of 
each domain, e.g., the sensitivity to different phe- 
nomena and the ways information and knowledge 
is recognized, collected, combined and appreci- 
ated, will differ substantially more. 

Due to the presumed heterogeneity of the total 
landscape, it is held unlikely that a joint holis- 
tic picture of threats and vulnerabilities can be 
comprehended from a single knowledge horizon. 
Hence, it is presumed that the “visible landscape” 
that can be created and shared between domains 
is constituted by several “squares”, each of which 
representing a specific horizon of knowledge-gath- 
ering (“knowledging”) strategies and actual expe- 
rience. To be able to construct such a landscape, 
three issues are crucial: 


1. Explication of the boundary conditions for 
each horizon “squared out” (Figure 1) in terms 
of the frame description, the demarcations of 
the validity of the inside, and the indicators of 
its saturation (that is, when it cannot accommo- 
date more issues, without losing its pragmatic 
meaning) 

2. the characteristics of overlap zones and the cor- 
responding h/d/e vulnerabilities and couplings 
that may enable threats to propagate 

3. The joint acknowledgement that single frames, 
as well as their intersections, are not only uncer- 
tain, but also influenced by a background land- 
scape encompassing h/d/e phenomena. 


The labyrinth background in Figure 1 signi- 
fies the persistent hermeneutical challenge of a 
“moving horizon” (Gadamer 1992) facing each 
“squared” horizon, as it encounters new phenom- 
ena and contesting horizons through the overlap 
zones. 

The resulting threat landscape metaphoric 
hence implies a loss of traditional presumptions 
of clear-cut responsibility and authority tradition- 
ally associated with single pictures/frames, but 
also an increased sensitivity to other horizons of 
understanding. 

For taking advantage of this landscape meta- 
phoric, each knowledging agent or community 
must acknowledge the need to understand the 
foundations of its own horizon, and be able as well 
as willing to take a closer look beyond its prevalent 
presumptions. 
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Figure 1. 
threat landscape on a labyrinth background (Gretan and 
Antonsen, 2016). 


Overlapping threat pictures constituting a 


In this way, the threat landscape metaphor may 
be used to build a “bigger picture” including h/d/e 
couplings between distinct domains, among which 
expertise, competence, experience and knowledge 
horizon related to safety, security, threat and risk 
are not necessarily commensurate. To establish the 
grounds for extended understanding of the sur- 
rounding landscape, each agent may take advan- 
tage of the “take it to the limits” approach (Gretan 
and Antonsen 2016) in which a sequence of issues 
is raised to encircle the boundaries and the satura- 
tion points of each frame, and open up for inputs 
from other domains. 


2.2 “What-if”: Sensitization and weak signals 


The threat landscape approach above is prima- 
rily useful for utilizing past experience and existing 
knowledge from professional domains, looking for 
new combinations, and for revealing or spotting h/d/e 
vulnerabilities before they have negative impact. 
However, many recent events illustrate that 
cyber (h/d/e) vulnerabilities emerge as a big sur- 
prise or a “black swan”. A recent example is social 
media allegedly being used for political communi- 
cation, tipping elections (Guardian, 2017), indeed 
exceeding what may be anticipated by means of 
traditional scientific approaches. The public sphere 
is very likely to be affected by the consequences, 
and involved in the key phenomena, e.g., as actu- 
ally being the product, not the customer, for “Big 
Others”. The threat landscape approach per se 
may thus not be enough. The residual challenge is 
thus to be able to raise and pursue the question 
“what-if” based on hints, weak signals or sheer 
imagination, and make a serious judgment in time 
on issues that normally might be discarded as 
belonging to a risk distribution tail. Also, the “lay- 
man” horizon should be included in the process. 


To address this residue, inspiration is here gath- 
ered from the “post-normal science” (PNS) (also 
denoted the “science of what-if”) introduced 
by Funtowicz and Ravetz (1993) and Marchi & 
Ravetz (1999), setting out to resolve a science in 
crisis (sic!). 

Konig et al. (2017) identify the conditions char- 
acterizing a post-normal situation: Irreducible 
complexity, deep uncertainties, multiple legiti- 
mate perspectives, value dissent, high stakes, and 
urgency of decision-making. The PNS goal is not 
to attain certain knowledge, but quality, a more 
robust ‘science for policy’. Pointing to how politici- 
zation of science renders classical Mertonian scien- 
tific norms invalid, they identify an ethos for PNS 
which they denominate TRUST (Transparency, 
Robustness, Uncertainty management, Sustain- 
ability, and Transdisciplinarity), considering this a 
nexus for reflexivity practices. They propose that 
the public trust in science advice can be restored 
through the PNS ethos. 

PNS is also portrayed as “both descriptive 
(describing urgent decision problems — post-normal 
issues — characterized by incomplete, uncertain or 
contested knowledge and high decision stakes and 
how these characteristics change the relationship 
between science and governance) and normative 
(proposing a style of scientific inquiry and practice 
that is reflexive, inclusive and transparent in regards 
to scientific uncertainty and moving into a direction 
of democratization of expertise)” (Strand 2017). 

Here, the PNS challenge is responded to in a 
more meagre and restricted way; 1) by urging for 
sensitivity to weak signals, and 2) the proposition of 
a cyber-vulnerability sensitization model that hope- 
fully make sense to professionals and laymen alike. 
Hopefully, this is a contribution to the PNS urge to 
invite “extended peers” into the conversation. Ubiq- 
uitous cyber vulnerabilities, and the hope of cyber 
resilience, should not only be based on a discourse 
among professionals; ultimately it involves us all. 

The turn towards PNS for inspiration, and the 
return to the less ambitious sensitization model 
presented below, is thus a natural next step from 
the idea of the threat landscape as a vehicle for 
joint comprehension and discourse based on vali- 
dated of, or even sense-making from, weak signals. 
The notion of “weak signal” is thus not confined to 
uncertainty within a familiar domain, but include 
the possibility that something radically outside the 
normal frames of reference may “travel” trough 
the landscape, with significant impact at unex- 
pected places. 


2.3 The BigCyber sensitization model 


This sensitization model is intended as a generic 
tool to support a balanced approach to under- 
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standing the temptations as well as the possible 
drawbacks related to utilization of the ever-evolv- 
ing “cyber space”. 


2.3.1 Underlying and formative issues 

2.3.1.1 Potential conflict in cyber space 

The actors who own and operate critical infrastruc- 
tures are usually not directly involved in (military) 
cyber conflict scenarios, they have traditionally 
not been seen as military actors. Still, they may be 
targets for offensive cyber weapons in a potential 
conflict situation. By making attacks on critical 
infrastructure from afar technically possible, dig- 
ital technology also make these types of attacks 
feasible. It is therefore important that these actors 
think about the possibility of being targets, and 
prepare accordingly. An attack of this type could 
be intended to simply disrupt services, sabotage 
or even cause physical damage. E.g., Since being 
coined by CIA Director Leon Panetta in 2016, there 
has been a persistent concern in the US regarding a 
potential “Cyber Pearl Harbour” attack. 


2.3.1.2 The Internet of Things (IoT) 

IoT implies a network of objects able to collect 
data through embedded sensors and exchanging 
this information via the internet, but are notori- 
ously hard to secure, and even hard to update when 
needed. 

Both intended, malicious cyber threats and unin- 
tended system failures and vulnerabilities of IoT 
dispersed throughout a CI may lead to severe dis- 
ruptions in cyber physical systems. In 2016, we also 
experienced a hint of the future, as the recognized 
scale of DDoS attacks increased dramatically due to 
the broad availability of tools for compromising and 
leveraging the collective, offensive firepower of IoT 
devices—poorly secured Internet-based security 
cameras, digital video recorders and Internet rout- 
ers (Guardian, 2016). The intentions and motives 
behind may be related to crime and hackers, ranging 
from teenagers’ ploys via organized crime to state 
actors, but also to cyber conflict and hybrid warfare. 

IoT thus sparks the ability to “informate”, to 
generate electronic texts around the use of the 
infrastructure and technology, boosting the “Big- 
Other” logic of accumulation. 


2.3.1.3. From IoT to Internet of Everything 
(IoE) 

As more and more personal information is being 
made more or less public, and the possibility for com- 
bination increases, a new form of information econ- 
omy emerges. Zuboff (2015) describes the emergent 
logic of accumulation in the networked sphere as an 
“Internet of Everything” (IoE) in which personal 
information becomes a commodity of high value 
for a wide range of (unknown) users. This radical 
new form of surveillance capitalism aims to predict 


and modify human behavior as a means to produce 
revenue and market control. Zuboff (2015) launches 
the need for an ‘information civilization’ addressing 
the challenges from “Big Other”: “a ubiquitous net- 
worked institutional regime that records, modifies, 
and commodifies everyday experience from toast- 
ers to bodies, communication to thought, all with a 
view to establishing new pathways to monetization 
and profit” (Zuboff, 2015). 

Such an “information civilization” requires a 
new comprehension of cyber safety and security, 
including the multifaceted concept of resilience. 


2.3.1.4 The lack of assurances 

Given the high ambitions related to evaluation crite- 
ria for “trusted” computer systems a couple of dec- 
ades back, there is a striking contemporary silence 
and numbness related to the lack of assurances about 
vulnerability of critical computer systems, at least in 
the non-classified domain. The infamous Stuxnet 
incident has demonstrated that a widely used indus- 
trial control system platform can be used to launch 
very intricate attacks that are very hard to spot. This 
is not only about “zero days”, it is also about an 
inherent technological brittleness, and the possibil- 
ity that industrial plants such as windfarms (Staggs 
et al. 2017) or smart metering systems (Hansen et al. 
2017) demonstrably can be “hacked”, with poten- 
tially severe consequences. This is also about a flawed 
marketplace that does not care to ask for such assur- 
ances at all, or just to a very minor degree. 


2.3.1.5 Privacy 

The Norwegian Data Inspectorate have just 
recently aired their concern regarding the impli- 
cations of this, and The Norwegian Consumer 
Council is worried about privacy and consumer 
rights in a situation where such consumer data has 
become a “goldmine” for infrastructure operators. 
The Norwegian telecom operator Telenor is mak- 
ing data from the cellular network to a commod- 
ity under the label “mobility analytics”. In the US, 
a new bill is criticized for being a lift of existing 
legislation that “not only gives cable companies 
and wireless providers free rein to do what they 
like with your browsing history, shopping habits, 
your location and other information gleaned from 
your online activity, but it would also prevent the 
Federal Communications Commission from ever 
again establishing similar consumer privacy pro- 
tections”. It can be doubted whether the individual 
customer will be able to value his/her privacy suf- 
ficiently in relation to the “benefits”, or the sheer 
volume of “user agreements” that are offered. 


2.3.1.6 Enter psychology 

1. Big Five in Big Data 

Psychoinformatics (Montag et al., 2016) is a disci- 
pline on the rise. The “Big Five” model has been 
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a prevalent model for psychological profiling, with 
alleged predictive power on human behaviour and 
influence. Recently, the Big Five model has been 
a driving force in “Big Data” attempts of collect- 
ing enough data to reveal patterns from which 
predictions about human behaviour become quite 
precise). Some findings seem to suggest strong cor- 
relations between Big Five parameters and social 
media (e.g., facebook) data. E.g., an average of 68 
“facebook clicks” seemed to be enough (in 2012) 
to predict colour of skin, sexual preference, politi- 
cal preference, intelligence, religious belief, use of 
alcohol/tobacco/drugs, or of having divorced par- 
ents, with reasonably high confidence (Grassegger 
& Krogerus, 2018). With more data, the model 
predictions beat the assessments of a person from 
colleagues, friends, parents and spouse. Ultimately, 
the smart phone is an “enormous psychologi- 
cal questionnaire” feeding us (or someone) with 
more and more detail. With more information, 
the prospect is raised that somebody could know 
“more than the informant think they know about 
themselves”. Inherent in this is the assumed ability 
to predict an informant’s response to a condition/ 
situation. 

But it also works the other way around: the user 
data can also be used as a filter to find and track 
down users/individuals with specific personality 
details; providing a method to “profile” people 
without themselves knowing. It is claimed that this 
has been used recently in political marketing/com- 
munication, by “micro-targeting” through assess- 
ments of personalities through Big Five and digital 
footprints. From which, political messages are 
organized and based on psychometry rather than 
demography, by, e.g., designated “messages” as 
personality-adapted advertisements or “news” (not 
necessarily “fake”). “Dark posts” are paid fb ads 
exclusively in the news feeds to users with specific 
personalities. It can also be about microscopic vari- 
ations in the same message to accomplish psycho- 
logical effectiveness, headings, colours, captions, 
stills or videos, targeting villages, neighbourhoods, 
or individuals differently. Hence, digital footprints 
become “real humans” with worries, needs, inter- 
ests and addresses. 

2. Cyber psychology in change 

Another issue with possibly unprecedented impli- 
cations is the potential implications of how digital 
omnipresence leaks into and potentially changes 
our psychology as users and operators, e.g. in 
terms of increased conformity (Storseth 2013). 


2.3.2 The BigCyber sensitization model 

The Big Cyber model summarizes the key issues 
at large, as illustrated in Figure 2. The model com- 
prises five different “Janus-faces”, each of which 
offering a huge benefit (inside of the dotted penta- 
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Figure 2. The BigCyber sensitization model. 


gon), as well as conducing a severe downside (out- 
side of the dotted pentagon) that can be viewed as 
an h/d/e threat or vulnerability. 

BigBrother(s) may offer comfort and security in 
times of crisis and terror, but are giving themselves 
rather free passes to track down and inflict harm 
on any instance or person that may be regarded 
as a present or potentuial adversary. There are no 
international agreements on ethical conduct of 
cyber offense. 

Personalization and customization of services 
offers ease of use. But the backside is that we are 
enrolled into the BigOther surveillance capital- 
ism (Zuboff 2015) without being properly asked 
or informed; “users” are transformed to products 
and monetized behavioral commodities in a digital 
economy. 

BigData coupled with artificial intelligence and 
machine learning promise an endless range of new 
insight and capabilities, but these are not reserved 
for the “good” purpose. What if the key ideas of 
“insurance” are jeopardized? Intelligent offense 
towards CI and ICT systems is as likely as intel- 
ligent defense. 

The BigFive personality model can probably 
make us even more comfortably numb while effort- 
lessly harvesting the benefits of cyberspace. Will we 
be able at all to resist the narrowed “alternatives” 
presented? Will we develop a “cyber psychology” 
that enables us to recognize and deal with commer- 
cially and politically motivated communication? 

This is also about an aggregated, unevenly dis- 
tributed digital economy and power. BigOther will 
have supreme power to utilize BigData as well as 
BigFive. Evry cyber innovation is aiming for sale, 
and BigOther is loaded with cash and ready to buy 
any advantage and “edge” available. 

Societies, organizations as well as individuals are 
always hungry for the BigInn(ovation). The lack 
of basic assurances are hardly noticed, except for 
the invitation to become an “update junkie”, and 
that the computer industry is not subject to any- 
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thing near the liability issues that, say, automakers 
or pharmaceutical industries must consider. Also, 
outsourcing with fragmented managerial account- 
abilities is a too easy escape when ambitions of 
digitalization exceed available competence to deal 
with the vulnerabilities. 

The BigCyber model support understanding of 
exposure to unfamiliar intentions and motives, and 
of new attack surfaces and vectors, e.g., cyber-phys- 
ical impact, small and large, massive profiling, crime 
and intrusion and an endless stream of “zero days”. 


3 EXAMPLE: SMART METERING 


A smart meter is a physically separate device 
designed with encrypted communication between 
the energy supplier and the customer for regular 
metering at, e.g., an hourly basis. In one way, the 
Smart Meter is just another Industrial Control 
System (ICS) or Supervisory Control and Data 
Acquisition (SCADA) system that is a precondi- 
tion for even preconceiving the idea of a smart 
grid, depending on control functions and measure- 
ments at an unprecedented scale. 

As illustrated in Figure 3, the smart meter also 
has a “private” physical connector (in Norway 
denoted a “Home Area Network” (HAN) port) 
that enable third parties, e.g., providing “smart 
home” solutions, to read metering data as part 
of their (innovative) services, connected to the 
internet. However, do we understand the potential 
threat landscape of this? 

Smart metering is an entry point to the huge 
challenges of protecting the energy grid as a criti- 
cal infrastructure. Concerns can be raised, inde- 
pendent on whether the connection is physical or 
not, on both unauthorised access and to whether 
the end user oversee the implications of granting 
additional connections. 

The attractiveness of the electricity system for 
cyber attacks was demonstrated in Ukraine (2015 


(SMART) ENERGY SYSTEM IN TRANSFORMATION, 
WCREASINGLY DEPENDENT ON SCADA/ICS 


SMART METER 


Figure 3. The Smart Meter as part of the Smart Grid. 


and 2016). Disruptions may range from massive 
shutdown (leading to imbalance and potential 
physical damage) or just poor quality (voltage/fre- 
quency). The potential for targeting vast numbers 
of smart meters simultaneously is demonstrated 
(Hansen et al. 2017). We have yet to experience 
the full damage potential, but in the UK, MPs 
were warned of sabotage threat from smart meter 
hackers. As experts said rogue programmers could 
target £11bn system, a massive shutdown will put 
enormous strain on both the supplier and con- 
sumer side (Financial Times, 2016). 


3.1 The microcosmic threat landscape of ICS 


We start by illustrating the potential of the threat 
landscape approach at a very small scale. Resting 
on a similar vocabulary employed by The Euro- 
pean Union Agency for Network and Information 
Security (ENISA), the smart meter seen as an ICS/ 
SCADA system, can be depicted as a “microcos- 
mic” threat landscape in its own respect (Figure 4). 
the approach is illustrated in terms of a work- 
shop assessment of an industrial SCADA system in 
a networked context. The actual “squared” threat 
pictures (left side of Figure 4) are selected and 
derived from a similar approach by ENISA. The 
actual threat landscape composition was conducted 
as part of the (Ist Annual) Workshop on Cyber 
Safety, Security and Resilience of Critical Energy 
Infrastructures, Oslo, Norway June 2016. Here, each 
threat picture was elaborated before combined into 
the landscape. Both the contents of each “frame” 
or “horizon”, and their overlaps, turned out to be 
surprisingly complex, and did add weight to the sus- 
picion that the ideal design does not cover every vul- 
nerability in this (microcosmic) threat landscape”. 


3.2 The BigCyber-sensitized threat landscape 


Can we conceive a bigger picture, a BigCyber-sen- 
sitized smart meter threat landscape? 

In addition to necessary functionality for build- 
ing of smart grids, the smart metering solution is 
also an excellent example of a connection to “Big 


B Aging and mix of old and new hardware/software 


fg User competence and awareness 


8 Internet of Things 


| Complex web of actors (complex value chains) E 


F 
i Intentional acts (sabotage and criminal intent) 


Figure4. “Microcosmic” ICS/SCADA Threat Landscape. 
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Other”. The joint access to the HAN port it is also 
a source for building information about “energy 
behavior” with a huge commercial potential, espe- 
cially when it is linked to other sources of individ- 
ual and commercial behavior that can be used to 
profile targeted individuals or groups. The privacy 
issues are imminent, but a hostile “BigBrother” 
may in the ultimate case also weaponize this to 
trigger collective irregular consumer behavior, and 
target key personnel, with the intention of disturb- 
ing the energy system per se. Another possibility 
may be conceived through the infamous Stuxnet 
attack; either by (1) disturbing the crucial grid 
measurements in order to destabilize trust in grid 
operation, or (2) initiate (cyber-)physical damage 
by imposing electrical imbalances. 

Hence, we may see the contours of new attack 
surfaces and vectors of both tangible and intangi- 
ble kinds, that can be combined and cleverly orches- 
trated. Vulnerable equipment can be attacked, 
users and populations can be manipulated and 
influenced, and key personnel in protection of 
critical infrastructure services could also be specifi- 
cally targeted as part of an orchestrated attack, e.g. 
with a criminal intent. For the defenders, a main 
vulnerability is the lack of acknowledgement of 
the coupling. 

In Figure 5, a BigCyber-inspired Threat Land- 
scape for smart metering is indicated. The pros- 
pects of “clinical” attack vectors, triggering of user 
behavior as part of attack, optimization of dam- 
age and targeting of key personnel on the inside, 
are simply not refutable one by one. Maybe not 
even in combination. 


3.3. Weak signals in sight? 


The “metering paranoia” threat landscape (Fig- 
ure 5) is hypothetical. Are there weak signals that 


Smart/net metering paranoia: 

* “Ginicaf attack vectors 

* Triggering user behavior as 
part of attack | | 

* Optimise pain/damage 

+ Targetkey personnel on Attacks | 
inside 


= Calbration 


Figure 5. 


“Smart metering paranoia”. 


support the likelihood that it may manifest into 
reality? 


e Banks in Asia are already using customers’ 
smartphone data points, like how (often) they 
drain their battery, to determine whether they’re 
eligible for a loan (CNN, 2016). Can electricity 
metering derived from smart houses contain 
behavioral data that could be matched with a 
personality assessment? 

e Will customers care to use the new European 
privacy legislation to demand insight into smart 
metering data? Would the trivia of energy con- 
sumption draw the necessary attention? 

e Energy companies are now increasingly con- 
cerned about disruptive competition. Who will 
take lead in offering homes and companies the 
dual role of producer and consumer, utilizing 
solar, wind, (virtual) batteries, e.g. in electri- 
cal cars, optimize the energy consumption in a 
market in which, e.g., consumption based pric- 
ing is replaced by capacity-based pricing? Will 
access to personal energy consumption be part 
of the "price"? Who will have the data edge in 
a new market environment? Will we see similar 
dynamics as when the "Flash Boys" (Lewis 2015) 
changed the stock markets by means of getting 
split-second advantages over other actors? 

e The grid, however "smart", will still need some 
supervised electrical stability. Who will be 
responsible for managing this, with potentially 
severe consequences in terms of physical damage 
of electro-mechanical equipment. If for example 
Google offer an "integrated" energy system to 
"prosumers", would they care about the grid? If 
the risk towards the grid level of service is relo- 
cated, who will be in charge? 


4 DISCURSIVE SUPPORT FOR 
POLYCENTRIC GOVERNANCE (PCG) 


The challenges described above goes beyond the 
limits of safety and security as traditional disci- 
plines. Petersen (2012) argue that we need an ana- 
lytical approach “sensitive to conceptual change 
and diversity” that “enable us to identify innova- 
tions in political language” and “provide us with 
the ability to grasp new developments in the cor- 
porate, governmental or organizational conception 
of risk”. There is thus a need for a step change in 
the way societies and organizations deal with cyber 
risk, from fragmented to polycentric risk govern- 
ance (PCB). 

The threat landscape metaphoric and the Big- 
Cyber sensitization model provide discursive 
support for PCG. E.g., as in the smart metering 
example, a regulator can be aware of privacy chal- 
lenges, but must reach a risk-informed assessment, 
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based on current knowledge, without jeopardizing 
the objective of transforming the grid. We may 
expect that the regulator take part in a broader, 
responsible discourse, but we cannot expect them 
to be voluntarily taken hostage for issues beyond 
their primary mandate. Hence, we need a discur- 
sive framework that sensitizes not only academics 
and analysts, but also the actual decision makers/ 
processes to the very same issues. 

Petersen (2012) claim that “a conceptual dis- 
course does not exist by itself; rather, it will always be 
defined in interaction with other discourses”. Hence, 
a discursive framework will have to be designed with 
the following in mind: the users of the framework 
will come from unique and different “home dis- 
courses”, and we should enable a sustained reso- 
nance between the “home” and the joint discourse, 
as participants move along their unique trajectories. 

As indicated in Figure 6, several stakeholders, 
each of which bound to a home discourse, e.g., of 
privacy and consumer rights or of facilitation of 
smart grid development, can join forces, overlap 
horizons, share threat landscapes (TL) and chal- 
lenge themselves and others by using the Generic 
BigCyber discursive model, which is the simplistic 
(and recursive) formula of 


TL —> BigCyber > TL’, 


as exemplified in chapter 3. 

Using this discursive framework will contribute to 
an improved coherence between decisions made by 
different stakeholders. The “lay” perspective may be 
voiced through civic participation, NGOs, or prox- 
ied through agencies of consumer rights and privacy. 


5 FROM PCG TO POLYCENTRIC 
RESILIENCE 


During the past decade, the safety field as well 
as the societal security and disaster fields, have 
devoted attention to the concept of “resilience. 


However, the notion of cyber resilience demands 
more than a technological fix. Human and organi- 
zational issues are more inert than the technologi- 
cal, and also for cyber resilience we must respect 
the double-hermeneutic scientific principle of 
understanding understanding subjects, rather than 
explaining them as objects. 

It is important that the concept is properly con- 
textualized. Though it sounds normatively good, 
it carries no guarantee for success. It is an attrac- 
tive idea that invites fallible practices, and hence 
it must be brought under managerial supervision, 
accountability and mandate. If not, we may invite 
expectations that will victimize those that are not 
able to thrive from being exposed to risk, or that 
do not possess the resources or skills in the first 
place. 

We must take the notion of resilience seri- 
ously without depriving it from its content and 
origins through mere re-labelling of traditional 
risk management practices. Resilience is ulti- 
mately a matter of emergent, “bottom-up” and 
situated solutions to unique and idiosyncratic 
demands and situations rather that instrumental 
responses to stereotypical replications of former 
situations. 

By implication of the above, cyber resilience 
must be translated to a scheme of composite pro- 
tection comprising a diverse set of (resilient) enti- 
ties that can be orchestrated to a certain degree. 
Grotan and Bergström (2016) propose a theoreti- 
cal foundation for exploring the concept of resil- 
ience landscapes; autonomous but interconnected 
resilient entities that forms a composite scheme of 
resilience. Such entities can utilize the same discur- 
sive structure as for PCG (Figure 6), and the evolv- 
ing threat landscape can be a basis for dynamic 
interfaces and interactive patterns. 


6 CONCLUSION 


The threat landscape metaphoric and the BigCy- 
ber sensitization model is a promising approach 
that make sense in the smart metering case, and 
carries a potential for further application for the 
emerging cyber threat landscapes. The notions of 
polycentric governance and polycentric resilience 
landscapes are logical companions to the former, 
and both can benefit from the discursive support 
structure presented. 
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Safety principles for autonomous driving 
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ABSTRACT: In safety technology, the application of safety principles (e.g. fail-safe or safe-life) is used 
to design and implement a safe system that eventually fulfils the requirements of the functional safety 
standards. Safety principles have already been described and applied to guided transport systems, includ- 
ing systems with immaterial guidance principles. The different responsibility of human driver and techni- 
cal driving system in different automation levels for autonomous driving vehicles require the application 
of safety principles. We consider, which safety principles have to be applied using general safety principles 
and analysing the relevant SAE level based on the experience from projects. For the five levels of auto- 
mated driving as defined by the SAE, safety principles are derived. For the levels 0-2, the driver is fully 
responsible for driving, whereas starting from level 3, the automated driving equipment monitors the 
vehicle. To give the driver the possibility to intervene, means that this must be implemented according to 
the relevant safety integrity level and that the driver must have enough time to take over control. The lat- 
ter strongly depends on the level of automation and the speed and the environment in which the vehicle 
moves. Depending on the level of automation, the technical systems are implemented as fail-silent or as 
safe-life. There are also exclusions, e.g. when the technical systems can be implemented as fail safe. This is 
possible, when the vehicle always can be brought to a safe stop, e.g. when driving with low speed and on 


a controlled territory. 


1 INTRODUCTION 


Autonomous driving has become a very impor- 
tant subject of research and first pilot projects. In 
safety technology, the application of safety princi- 
ples as e.g. fail-safe or safe-life is a very important 
tool to design and implement a safe system that 
eventually fulfils the requirements of the standards 
for functional safety. Safety principles have already 
been described and applied to guided transport 
systems, including system with immaterial guid- 
ance principles. 

In earlier papers, safety principles have been 
described and later applied to guided driving. 

In the present paper, we systematically con- 
sider which safety principles have to be applied for 
which SAE level of autonomously driving systems 
und we show how an autonomous system could be 
built. This is partially done with the help of gen- 
eral safety principles, partially by analysing the 
relevant SAE level based on the experience from 
several projects. 

According to UN resolution /UN/ or SAE / 
SAE/, autonomous driving on the road knows five 
different levels: 


e 0 No automation 
e | Driver assistance 
e 2 Partial automation 


e 3 Conditional automation 
e 4 High automation 
e 5 Full automation 


For the levels 0-2, the driver is fully responsi- 
ble for driving, starting from level 3 the automated 
driving equipment monitors the vehicle. 

This different responsibility of human driver 
and technical driving system requires the applica- 
tion of safety principles. In the present paper, we 
systematically consider which safety principles 
have to be applied for which level und we show 
how such a system could be built. 

This is partially done with the help of general 
safety principles, partially by analysing the relevant 
level. 

We start with a very simple and abstract model 
of the system and show that there exist different 
possibilities to implement autonomous driving. 
An important result is that an arbiter needs to be 
installed that gives the human driver the possibility 
to override the decisions of the autonomous sys- 
tem to fulfil legal requirements. 

For the five levels of automated driving as 
defined by the SAE (2016), safety principles 
are derived. For the levels 0-2, the driver is fully 
responsible for driving, whereas starting from level 
3, the automated driving equipment monitors the 
vehicle. To give the driver the possibility to inter- 
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vene, means that this must be implemented accord- 
ing to the relevant safety integrity level and that the 
driver must have enough time to take over control. 
The latter strongly depends on the level of automa- 
tion and the speed and the environment in which 
the vehicle moves. 

Depending on the level of automation, the tech- 
nical system are implemented as fail-silent or as 
safe-life. There are also exclusions, when the tech- 
nical systems can be implanted as fail safe, when 
the vehicle always can be brought to a safe stop, 
e.g. when driving with low speed and on a control- 
led territory. 

We consider the two main functions of guid- 
ance and braking / acceleration and their role for 
autonomous driving. Moreover, detection and 
reaction with regard to fixed and moving obstacles 
is discussed. 

Two basic requirements for autonomous systems 
are that they need to be developed according to the 
relevant standards of functional safety fulfilling an 
ASIL (or SIL) level and that the capability of the 
autonomous driving system must at least on the 
same level as that of a human driver. 

We note that Wachenfeld (2016) has proposed a 
stochastic approach to show that an autonomous 
system fulfils a certain level of performance or 
safety. This, however, can only be seen additional 
evidence, the main evidence for a safe system is 
an appropriate safety architecture implemented 
according to the rules of functional safety, see ISO 
26262. 


2 GENERAL SAFETY PRINCIPLES 


In this chapter we will briefly remember the main 
safety principles, see Giilker & Schabe (2006) and 
and Gayen & Schabe (2008). 

Fail safe: If the system has a safe stopping state, 
i.e. a safe state in which it is not operational and this 
state is stable which can be reached fast enough, 
then the fail safe principle can be applied. It means 
that a system is brought into this sate if a failure 
occurs which cannot be tolerated. This principle 
can be implemented as inherent fail-safety, ractive 
fail-safety or composite fail-safety. 

Safe life (fail operational): If the system does not 
have a safe stopping state which can reached fast 
enough, then the safety function has to be ensured. 
This is mainly done by using redundancies. 

Fail silent: The fail silent principle is applied 
to a function the loss of which is tolerable since it 
is either an assistance function or the function is 
implemented in several instances. Then, failure of 
the function must be such that there is no repercus- 
sion on the safe functioning of the system. That 
means, that a fail-silent system must detect its fail- 


ures and possible dangerous states and switch itself 
off without influencing other systems in a danger- 
ous way. 


3 ABSTRACT MODEL OF THE SYSTEM 


Lotz (2017) proposed an architecture consisting 
of three levels: a navigational level, a manoeuvring 
level and a controller level. We will try to discuss a 
model that is as simple as possible. 

For systems that drive automatically, partially 
automatically or autonomously, we will use the 
following very simple structure for the system. In 
fact, this system must be equipped not only with 
a human driver, but also with a technical driving 
system, that carries out the driving. 

The vehicle consists of driving sub-systems as 
steering, braking, acceleration systems etc. in a 
very abstract manner. These sub-systems could be 
even very simple systems as pure mechanical steer- 
ing system, pneumatic brake systems etc. The driv- 
ing is carried out by the human driver using these 
driving sub-systems directly. 

The manoeuvring and navigational level accord- 
ing to Lotz (2017) have here been combined in one 
system (human driver / technical driving system). 

If a vehicle shall be operated by a technical sys- 
tem which does the driving in place of the human 
driver or supports the human driver, then this sys- 
tem must have access to the driving sub-systems. 
This is possible only using a driving controller and 
actuator. That means that these types of systems 
must be present in the vehicle to allow for driving 
by a technical system. 

Then, this allows also the human driver to access 
the driving sub-systems via the driving controllers. 

Hence, there are different possibilities to operate 
these subsystems. 


a. The driver can directly access the driving sub- 
systems, e.g. the steering wheel is mechanically 
connected to the steered axle. 

b. The driver accesses the driving sub-system via 
a controller and an actuator which operate the 
sub-system electronically. A typical example for 
such a system is an electric parking brake. 

c. The technical driving system accesses the driv- 
ing subsystem via a controller and actuator 


We will not consider the aspects of how the driver 
should best access the driving sub-systems yet. 
We need to distinguish two situations: 


a. driving on an open road and 
b. driving on private territory 


Without going into details we must be aware 
of the fact that for driving on an open road, the 
Convention requires a driver to be always present 
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which is implemented in the national law of almost 
all countries. For driving on private territory, the 
traffic law is not applicable—the car would be a 
moving machine. Nevertheless, also her, safety 
requirements have to be obeyed. 


4 SAFETY INTEGRITY LEVELS 


Whenever a function might lead to harm, i.e. injury 
of fatalities to persons, material damage, damage 
to the environment, functional safety has to be 
applied. That means that the risk arising from a 
possible functional failure must be reduced to an 
acceptable level. 

For this sake, safety integrity levels are defined. 
According to ISO 26262 this can be QM, ASIL A 
to ASIL D with ASIL D being the most severe. 
This system has to be applied for road vehicles with 
a gross weight of up to 3.5 t. For heavier vehicles 
(still) IEC 61508 would be applicable, which knows 
the safety integrity levels 1 to 4. Also, for moving 
machines, IEC 61508 is applicable. 

In practice this means, that for all driving func- 
tions and all driving sub-systems, the necessary 
safety level (ASIL or SIL) has to be determined 
using a risk analysis. 


5 LEVEL ANALYSIS 


In this section we will analyse the levels (SAE 
(2016)) of automatisation and draw conclusions 
for the safety architecture of a vehicle. 

In levels 0-1 execution of steering and accelera- 
tion and deceleration is in the responsibility of the 
human driver, the driver is responsible for moni- 
toring and the technical system is able to support 
some driving modes (level 1). 

That means, the human driver is doing the driv- 
ing and the technical driving system can only add 
some supporting functionality as warn the driver 
or react in cases, when he is not able to react (emer- 
gency brake assistant). This means that the tech- 
nical driving system must be fail silent, i.e. upon 
failure of this system the driving behaviour of the 
vehicle must not be influenced or only influenced 
in such a manner that safe driving is still possible. 
The driver should be warned, if such an assistance 
system fails to work. 

In automation level 2, the system takes respon- 
sibility in some driving modes. The human driver 
monitors the technical driving system and he is the 
fall back solution. That means, that all technical 
systems are pure assistance systems and that 


R1. The driver must have the technical possibil- 
ity to interfere, i.e. to override the technical 


systems. That means, that each controller 
for acceleration, braking and steering that 
receives signals from the human driver and 
from the technical driving system must have 
a voter which always gives priority to the 
driver. In fact this means that an electronic 
control system needs to be present for these 
function that has an ASIL that coincides with 
the function, mainly this would be ASIL D. 
This control system then must have a priority 
input for the driver and another non-priority 
input for the technical driving system. The 
relevant driving controller must detect, when 
the driver wants to override the technical driv- 
ing system and has to carry out the required 
reaction. 

R2. The driver must have enough time to detect 
wrong or faulty behaviour of the technical 
driving system and react and be able to bring 
the vehicle back to a safe driving state. That 
means, that the controllers have to limit the 
influence of the technical driving system, 
e.g. limit the level of acceleration, decelera- 
tion and the steering angles or angular speed 
and angular acceleration and jerks so that the 
driver always has the time to react. Moreover, 
the driver must be trained for this function 
or the controllers must be designed in such a 
manner that they give enough time for reac- 
tion for any driver. 

Level 3 differs from level 2 in just one point. The 
technical driving system is responsible for monitor- 
ing of the driving equipment. That means that the 
system must diagnose itself and the environment in 
order to decide whether it can go on with driving 
or whether the human driver must act as a fall back 
solution. The following questions are important 
R3. A clear handshake must be defined between 

human driver and technical driving system. 
Either the technical driving system must go 
on with functioning until the human driver 
has accepted to take over control or 

R4. A certain time of e.g. one second is foreseen 
for the human driver to take over control at 
any time, if the technical driving system asks 
him to do so. 

In the first case, the technical driving must be 
safe life, in the second case the latency time for 
the human driver to take over must be ensured by 
technical systems—either by the safe life property 
or just by the driving situations and speed. Tim- 
ing considerations can be found in Vogelpohl et al. 
(2016). 

Levels 4 and 5 are even more advanced. The dif- 
ference between levels 4 and 5 is relatively small, 
since the distribution of responsibilities is the same, 
only in level 4 some driving modes are excluded 
which allows the technical driving system to have 
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limited capabilities. However, when this system is 
active, it must be able to take full responsibility. 

As a consequence, the technical driving system 
must always ensure safe driving and would need to 
be safe life. 

The relevant requirements are derived the so 
called GAME principle, which can be found e.g. in 
EN 50126 “All new guided transport systems must 
offer a level of risk globally at least as good as the 
one offered by any equivalent existing system“. 
Here we apply the phrase on guided transport sys- 
tem just to an autonomously driving vehicle. We 
compare the classical vehicle with a human driver 
with an autonomously driving vehicle. Then, there 
are two aspects to be considered: 


a. Performance and 

b. The technical system (vehicle and technical 
driving system) are sufficiently free from dan- 
gerous failures. 


Both aspects are considered separately. For per- 
formance, the technical driving system must be at 
last as good as a human driver in the relevant driv- 
ing situations, see Mazzega et al (2016). If this can- 
not be ensured for all driving situations, the set of 
relevant driving situations must be limited and the 
human driver must handle the most complex ones. 
The second, the safety aspect can be handled as 
for any technical system by defining an appropri- 
ate safety level (ASIL or SIL). This leads to 
R5. The performance of the technical driving sys- 
tem as reaction time, detection and handling 
of traffic situations etc. with an un-failed 
system must be at least as good as that of a 
human driver. 

R6. The technical driving system must be devel- 
oped according to a reasonable SIL / ASIL. 

For level 4, a clear handshake must be defined 
how to pass over responsibility between technical 
driving system and human driver. Especially, the 
driving modes must be defined, where the techni- 
cal driving system must not be used for reasons of 
e.g. insufficient performance. Handshake must be 
carried out either during standstill or the technical 
driving system must early enough inform the driver 
that it wants to pass control to the driver and the 
driver must take responsibility. If the driver does 
not take over, the technical driving system must 
still have the possibility to stop the vehicles as long 
as it is in a driving mode, where automatic driving 
is allowed and possible. 

If the driver passes responsibility to the techni- 
cal driving system he must have responsibility until 
the technical driving system informs him that it has 
taken over responsibility. 

When driving on an open road, the driver must 
be in full responsibility of the driving behaviour of 
the vehicle, see the Convention. Then, even if the 


technical driving system is able to perform up to 
SAE level 5 with the necessary safety integrity, the 
driver must have the possibility to intervene. So, 
the requirements under a) and b) mentioned for 
SAE level 2 hold if driving on an open road. 
Autonomous driving, i.e. driving without inter- 
vention of a human driver is in fact only realized in 
SAE level 4 (partially) or 5 (completely). This holds 
even if the laws require a driver to be present. 


6 IMPLEMENTATION 


In this section, we will discuss possible implemen- 
tation schemes and principles. 

It is clear that for technical driving systems in 
levels up to 2 the systems must and can be fail silent 
and R1 and R2 must be fulfilled to ensure that the 
driver has the possibility to take over control. 


6.1 Arbitration between human driver 
and technical driving system 


Comparing this with Figure 1 it becomes clear that 
arbitration between the commands of the human 
driver and the technical driving system must take 
place. 

There are different levels on which arbitration 
can take place: 


a. driving subsystems 
In this case the force applied by the driving con- 
troller and actuator must be so small that the 
driver can always overrule without a problem. 
However, he would be either required to switch 
off the driving controller an actuator manually, 
or those system need to have an in-built func- 
tion to detect the interference of the driver and 
switch themselves off. 

b. driving controller and actuator 
Here, the driving controller has two inputs with 
different priority. The high priority input is used 
by the driver, the low priority input by the tech- 
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nical driving system. The arbitration is done by 
the driving controller which detects overruling 
by the human driver and switches off the input 
from the technical driving system. Many con- 
trollers in modern cars (brake controller, steer- 
ing controller etc.) have an additional input for 
assistance systems which just fulfils this require- 
ment. This approach assumes that the human 
driver himself controls the vehicle via x-by-wire 
via the relevant controllers. 


c. technical driving system 


Arbitration is between the human driver and 
the technical driving system. If the human 
driver overrules the technical driving system the 
latter does not generate its own control signals 
but simply transfers the signals of the human 
driver to the driving controllers. 

The choice on one of the approaches is a choice 
of the manufacturer of the vehicle. However, this 
choice influences the suppliers of the driving con- 
trollers and actuators. They need to implement dif- 
ferent architectures in their controllers. 

In case a) they need to detect intervention of the 
man driver and deactivate the actuator. 

In case b) they need to have two inputs with dif- 
ferent priority and need to carry out arbitration 

In case c) only one input is necessary and no 
arbitration is necessary. 

We see that x-by-wire is a necessary precondi- 
tion for solutions b) and c). 

We will guide ourselves by the requirements 
for SAE levels 5 and 2 combined, i.e. for a fully 
autonomous driving vehicle with a possibility for 
the human driver to take over responsibility at any 
time. 


6.2 Technical implementation 
with safety principles 


For the technical implementation we will discuss 
the controllers and the technical driving system 
itself. 

Regarding the driving controllers, they need to 
be developed and implemented according to an 
adequate SIL. This is for the brake (ABS / ESP) 
mainly ASIL D, for the steering ASIL B...ASIL 
D, depending on the function of this controller. 
With such a choice most of the vehicles can per- 
form with velocities up to 250 km/h. 

First of all, we need to determine whether there 
exists a safe stopping state that can be reached suf- 
ficiently quick. Assume the velocity of the vehicle 
is limited to a value v, the braking deceleration is 
a and the reaction time t then the vehicle will stop 
within a distance of 


s = v-t +v/(2a). 


Assuming that the steering has no limitation, 
stopping the vehicle will be a safe action if there 
is no obstacle within a distance of s from the outer 
boundaries of the. This area can be made even 
smaller taking into account that 


e actual direction of steering and the (physical) 
limitations of changes of the steering angle and 

e physical limitations for changing the driving 
direction. 


In such case, the technical driving system and 
the driving controllers could be a complete fail safe 
system, stopping the vehicle in case that a failure 
is detected. 

Depending on the free space around, the vehicle 
speed is determined. Obviously, the less free space 
available, the slower the vehicle must drive. 

If the vehicle is intended to move faster, the 
technical driving system and the driving control- 
lers must be safe life, at least as long as the vehicle 
is in motion. 

The implementation using safety principles dif- 
fers whether we are talking on a road vehicle or a 
moving machine. In the first case the environment 
cannot be assumed to be under control, in the 
second case this can be ensured since the techni- 
cal driving system acts on private territory. In this 
latter case it is much easier to ensure enough free 
space. 

From this consideration it becomes also clear, 
that not all functions must be always implemented 
with the highest SIL / ASIL. This depends very 
much on the speed and the environment. If speed 
is limited by physical or other means, the also a 
lower SIL or ASIL can be used. In any case this 
needs to be shown by the risk analysis that has to 
be accrued out based on ISO 26262 or IEC 61508. 

The following functions are the main functions 
to be considered: 


e Guidance 
How to implement such a function including the 
steering is described in Bouwman, Schabe & Vis 
(2006). Mostly the steering of the axles needs to 
be safe life and a safe computer has to be used 
in the technical driving system to determine the 
steering angles. Another important function is 
determination of the location, where differen- 
tial GPS, maps together with ultrasonic sensors, 
radar or lasers or cameras or different types of 
marking placed physical on the lane of the vehi- 
cle can be used. The safe computer will deter- 
mine the real location and compare this with 
the assumed location as a result of its steering 
activities and correct or stop he vehicle. 

e Braking and acceleration 
Assuming that the vehicle moves along the 
desired trajectory, the vehicle needs to start, 
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move and stop. So the vehicle needs to react 
to these commands. It is important to limit the 
speed e.g. in curves or at narrow places and to 
be able to perform an emergency stop, if parts 
of the system fail. In order to perform this func- 
tion, the system needs to know the location. 
Solely with these two functions the vehicle would 
move without taking into account the environ- 
ment. Any change in the environment could lead 
to a collision or the vehicle leaving its track. 
Reaction to unforeseen events (obstacle) 
The vehicle must be able to detect obstacles. By 
an obstacle we denote any object that is in the 
(planned) or near the (planned) trajectory of the 
vehicle. We need to distinguish fixed obstacles and 
moving obstacles. In the beginning we consider as 
only strategy of the vehicle to stop in front of the 
obstacle. Moving around the obstacle will be con- 
sidered later together with moving obstacles 
a. Stationery obstacle: To detect the technical 
driving system needs to have a blueprint of 
the environment and needs to compare the 
real environment with that blueprint and 
detect differences. This would require cer- 
tain algorithms for detection and classifica- 
tion of objects. Note that “detection” and 
“blueprint” does not mean that the techni- 
cal driving system uses optical means. It 
can be optical means, but also others or in 
combination. 
In a first step the obstacle as such needs to 
be detected. This is possible only at a cer- 
tain distance and takes a certain time. This 
performance of the system might limit the 
speed, since the vehicle must always come to 
a standstill in front of the object. 
In a second step the technical driving system 
can classify the obstacles as small. Note that 
this classification can be present implicitly if 
the technical driving system will not detect 
obstacles of small size. Such a classification 
is always present due to limitations of the 
system. 
If the obstacle is small enough and not tall, 
the vehicle might decide to go on with driving. 
b. Moving obstacles: Moving obstacles must 
be traced and its motion must be predicted 
using the actual position and speed. It must 
also be taken into account whether the object 
can accelerate or decelerate or change its 
motion direction. The latter factors strongly 
depend on the nature of the object. E.g. a 
motorbike can reach other acceleration val- 
ues as a pedestrian. In order to provide a 
good prediction, the technical driving sys- 
tem must cluster moving objects according 
to their capability of motion. Consequently, 
for each object of the different clusters future 


positions must be predicted and the technical 
driving system must define the motion of the 
vehicle in such a manner that collisions are 
avoided. This might lead to the decision to 
stop or to keep the present fixed position. 
Depending on the performance of the clus- 
tering and prediction algorithms, the tech- 
nical driving system would behave more or 
less conservatively. With better algorithms 
the technical driving system would stop less 
frequently. We remind that the performance 
of these algorithms together with the stop- 
ping process in case of doubts about the 
future trajectory of the obstacle must be as 
least as good as that of a human driver. This 
includes of course strategies to drive around 
an obstacle. 

c. Stationery obstacles that could start moving 

are in fact a combination of cases a) and b) 
discussed above. This means, that the techni- 
cal driving system must not only trace mov- 
ing obstacles but must also be able to classify 
stationery obstacles and provide a judgement 
on whether they might move and with which 
velocity and in which direction. A most safe 
strategy would surely be to stop at a safe dis- 
tance of any unknown object. 
If a proper reaction of the vehicle cannot 
be ensured for all driving situations, the set 
of relevant driving situations must be lim- 
ited and the human driver must handle the 
more complex ones. This would lead to an 
SAE level 4 situation. An example would be 
a strategy, where the technical driving system 
takes over control on a motorway and the 
human driver in the city. 


7 PROBLEMS 


In connection with autonomous driving some 
problem appear. We will, discuss only some of them 
and try to describe possible technical solutions. 


a. Assume an autonomous vehicle cannot prevent 
an accident and needs to make a choice, e.g. 
between material damage, environmental dam- 
age and injury or—even worse—injuring or even 
killing either an older or younger person, another 
driver, the own passengers etc., see e.g. EK (2017) 
This type of discussions automatically comes 
up when the responsibility for driving is car- 
ried over from the human driver to a techni- 
cal driving system. The ethic problem that is 
behind this discussion cannot be solved in this 
paragraph. It is obvious that a technical solu- 
tion to this problem would require to distin- 
guish between persons and objects or animals, 
to discriminate between different persons etc. 
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This would require rather complex algorithms, 
if it is feasible at all. 
The simplest solution to the problem is to apply 
the principle of driving on sight. That means 
the rule for the autonomous vehicle would be to 
drive only with such a velocity that it can stop 
before each obstacle that appears on the road. 
This requirement covers: 
e Detection of any obstacle above a certain 
size, 
e Prediction of movement of objects (which is 
the most complicated part), 
e Reducing speed if necessary to come to a 
standstill before such an obstacle. 
Based on such a “safety first” approach, later on 
objects of certain (small) size can be neglected 
to ensure performance and avoid the vehicle 
stopping in front of a leaf or a plastic bag. 
b. Additional information 
A vehicle might optionally use additional infor- 
mation provided by the infrastructure, which 
might lead to better performance regarding 
safety. 
Let us consider the following example. The 
vehicle uses information from cameras mounted 
on the street and has the possibility to “look 
around the corner”. Then, it could e.g. detect 
a suddenly appearing child running out of the 
house, what a human driver could not. 
c. Safety targets 
Since the target of autonomous driving behav- 
iour would always be the performance of a 
human driver, the technical driving system 
would have to fulfil this important requirement. 
However, assume that autonomous systems will 
set a new target in the future—then the ques- 
tion will arise: Does the driver have the right to 
switch the automatic system off and decrease 
the achieved level of safety? It would be some- 
how equivalent to a train driver switching off 
automatic train protection, eg. to use some 
speed margins. This simple example shows that 
the way to autonomous driving would be a one- 
way street, with no return to manual driving at 
the end. 


8 CONCLUSIONS 


In this paper we have presented some ideas on pos- 
sible safety architecture for autonomous driving, 
deduced from known safety principles and from 
general requirements. We have analysed the SAE 
levels and the implication for the safety architec- 
ture per level. 

Possible implementation principles have been 
described and specific problems of autonomous 
driving have been discussed. 
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ABSTRACT: The safety is on the first rank in the aero industry. From this reason at the aero compo- 
nents fabrication, the great attention is given on precision and perfection of execution of all production 
operations. It goes on removing the causes of aircraft accidents caused by errors at production, assembly 
and maintenance of aero engines. The detail research aimed to improvement of safety was carried out 
for the GE Aviation Czech company in Praha. It was directed to the aero engine production at which it 
is necessary to observe the technological procedures and to eliminate the human factor that influences 
the production process. In the paper there are given results connected with the production of aero engine 
protective cowling, namely specifically at welding the basic components. It contains special checklist and 


results of safety audit. 


1 INTRODUCTION 


In the aero industry, the safety is on the first place. 
From this reason, at aircraft component produc- 
tion, the great attention is given on precision and 
perfection of execution of all production opera- 
tions. It goes on elimination of causes of aircraft 
traffic accidents caused by faults at production, 
montage and maintenance of aircraft engine (GE 
Aviation 2015). The detailed research aimed to 
safety improvement was performed for the GE 
Aviation Czech in Praha. It was aimed to domain 
of aircraft engine production in which it is neces- 
sary to respect the technological procedures and to 
eliminate the human factor troublesome impacts, 
which influence the production process. 

In present paper, the partial results of research 
are given. These results are connected with the pro- 
duction of protective cowling that is located inside 
the aircraft engine, and it is attached to the outer 
aircraft engine shell that fenced the capacity tur- 
bine. Its task is to protect the aircraft engine against 
fragments of blades, discs or of other parts. On 
the basis of evaluation of information in technical 
operation notebooks (GE Aviation Czech 2017a) 
we concentrated the great attention to welding of 
fundamental components of protective cowling. 

For identification and assessment of risks there 
are used the risk engineering methods, namely 
the process maps, check lists and safety audits 
(Procházková 201la). With regard to the pro- 
duction technological procedure basis (GE Avia- 
tion Czech 2016) and the production technical 
documentation (GE Aviation Czech 2017a), the 
detailed check list was processed for safety audit 
that was handed over to experts of the Company 


Technical Board to the review. After its permission 
the safety audit was realised with help of checklist 
under account. The outcomes for individual items 
of check list at the safety audit were determined 
as median from results obtained from 3 specialists 
(company auditor, supervisory department leader, 
state inspector). By evaluation of audit results the 
critical spots of welding process were found. Con- 
secutively, the measures for safety improvement of 
both, the production and the product, were pro- 
posed in cooperation with company experts. 


2 SAFETY CULTURE, LOSS PREVENTION 
AND PROCESS SAFETY 


The culture denotes the specific material and spir- 
itual values that the humans create by their activi- 
ties and by which they enhance the life of both, the 
humans and the whole human society. The society 
culture is an integral system of substances, values 
and societal norms, which the members of a given 
society follow, and which through sharing they 
transmit to next generation. It is the collection of 
values, symbols, company heroes, rituals and own 
histories that act upon exterior, and they have big 
influence on human behaviour at working posi- 
tions (Procházková 201 1b). 

On the basis of just given culture definition, the 
safety culture means that the human at all his / her 
roles (control worker, employee, employer, citizen 
or disaster victim) respects the safety culture, i.e. 
he / she behaves in a way so he / she may not cause 
to happen the possible risks realisation, and if 
risk realisation happens, he / she may contribute 
to the effective response, the protective interests’ 
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renovation and to start of further development. 
According to some authors it goes on set of atti- 
tudes, surmises, norms and values that exist in a 
given entity. It is the reflection of way, by which 
the company is ruled, i.e., they include the general 
principles of separation of authority and respon- 
sibility, principles of management and a certain 
ration among the accent of working outcomes, 
authority, care on humans, observance of safety 
principles and ensuring the given entity functional- 
ity (Prochazkova 2011b). 

The effective safety culture is the fundamental 
element of safety management. It reflects the safety 
concept and it goes out from values, attitudes and 
manners of top management workers and from 
their communication with all involved persons. It 
is obvious obligation to participate in solving the 
problems of safety and it promotes so all involved 
persons perform safely and so they observe the 
appropriate legal rules, standards and norms. The 
safety culture rules need to be incorporated into all 
activities in each entity and in each territory. Their 
ground is not the concentration to punishment of 
malefactors / originators of faults, but the lessons 
learned from the mistakes and the introduction of 
such corrective measures so mistakes could not 
repeat, or rather their occurrence frequency might 
be distinctly reduced. N 

The safety culture principles with (Ceská tech- 
nologická platforma bezpečnosti průmyslu 2015) 
are: 


1. Outright, open attitude to weak spots, action 
directed to finding the solution. 

2. Diversion from the culture of determination of 
responsibility for fault and punishment of such 
person. 

3. Employees, employer and top management 
behave responsibly, separately and with orienta- 
tion to team. It means that the safety culture is 
a part of their life. 

4. Safety standards are accepted and integrated to 
everyday company life. 

5. Safety and health protection form important 
value for both, the company workers and the 
whole company. 


The safety culture level is the quantity that 
cannot be directly and exactly measured, but for 
all that it has fundamental influence on workers’ 
behaviours, the management style and the technol- 
ogy level. The definition of weak and strong fea- 
tures in individual parts of safety is important for 
safety culture level. The comparison of time series 
of investigations permits to evaluate the effective- 
ness of corrective measures. 

The safety culture for company manufacturing 
the aircraft engines means that the company under- 
takes to carry out the manufacturing with the high- 


est safety standards. For reaching such aim, it is 
crucial and important to have in force the effective 
and without disincentives the pursued announce- 
ment of all accidents, incidents, near misses, ran- 
dom events and cases, experiences, doubts and 
further information and data that might adversely 
shaped the element of component of produced 
aircraft engine. After all, each individual employee 
is not only kindly encouraged, but also obligated 
to announce the arbitrary information concerning 
the safety. 

The announcement is not the subject of some 
imputation and following retributive measure. Its 
main purpose is the management and govern of 
risk and preceding the accidents and incidents, i.e. 
not to impute the blame. They are not made inter- 
ventions against employee, who reports any data 
concerning the safety by help of report system, 
until such announcement without any doubts does 
not expose that there was commit the criminal act, 
offensive negligence or intentional and conscious 
violation of rules or technological procedures. The 
method of collection, recording and spreading the 
precaution information secures the protection in 
the whole width and range according to the law, 
including the protection of identity of person, who 
announces the information concerning the safety 
(Česká technologická platforma bezpečnosti 
průmyslu 2015). The individual pillars of safety 
culture are shown in Figure 1. 

In link-up with the safety culture there are 
often in present professional literature connected 
with technologies used the terms loss prevention 


OPEN 
COMMUNICATION 


Fundamental 
(DEPOSITPHOTOS 2014). It is necessary to intro- 
duce: open communication; accent on prevention; inci- 
dent analysis; support of safe behaviour; and to provide 
resources for safety formation. 


Figure 1. pillars of safety culture 
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and process safety. Their definitions we give also 
for that reason that they are the tools that in con- 
nection with technologies serve for protection of 
persons and property. Loss Prevention is the sys- 
tematic approach to prevention of accidents, or 
at least to reduction of their impacts. It includes 
the means for elimination of sources of risks or for 
reduction of probability of their realization, and 
for mitigation of impacts connected with this reali- 
zation (preventive and consequential measures). 
Further it includes the identification of suitable 
supervisory measures, identification and appli- 
cation of suitable remedial measures, by help of 
which it is ensured the safe entity with appropriate 
level of security and sustainable development that 
does not pose unacceptable danger for its vicinity 
(Procházková 201 1c). 

The process safety or better the safety of proc- 
esses is a branch of safety directed to safety in 
industry, in which there are series of manufactur- 
ing and additive processes that are necessary for 
setting up of final product of a given industry. 
Together with production it goes on averting the 
accidents that have special and characteristic fea- 
tures for a given specific industry. It deals e.g. with 
the prevention of immediate leakage of chemical 
substances or energies in harmful amount, and in 
case if such leakages occur with the reduction of 
their sizes, impacts and consequences. It does not 
include the questions of classic safety and protec- 
tion of workers at work, i.e. it deals with purely 
technical problems, by which it differs from the 
system safety that is directed to all public assets. 


3 AIRCRAFT ENGINE SAFETY 


The aircraft is a floating transport vehicle that is 
heavier than air with the solid wing. It goes on the 
safe system, which has basic parts: kite (wing, fuse- 
lage, tails, flying equipment, landing gear); para- 
phernalia (life-saving systems, defroster system, 
air-conditioning, outfit of cockpit, instruments) 
and propulsive segment (ULT 2014); the schematic 
technical picture of aircraft is in Figure 2. 

So the aircraft might fly, it needs certain flying 
speed and trend angle, otherwise it might happen 
the loss of required lift for flight. The required 
flying speed ensures the tractive component, i.e. 
the aircraft engine. The aircraft engine consists of 
generator (engine heart), which is composed from 
compressor, combustor and generator turbine. The 
compressor from surrounding atmosphere sucks 
the air, it compresses it and shift it up to combus- 
tor. In combustor it arrives to fuel injection that 
burns, and developed hot gases drive the genera- 
tor turbine. Over joint shaft this turbine drives the 
compressor and by this it obtains the compressed 
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Figure 2. Schematic picture of aircraft (ULT 2014). 


air that is necessary for fuel combustion (GE Avia- 
tion 2015). For drive of loose turbine that is joined 
to propeller over the gearbox with constant rate, 
the aim of which is the reduction of high turns of 
speed of this loose turbine on the level that is usa- 
ble for propeller, it is used the remaining energy in 
hot gases behind the generator turbine (GE Avia- 
tion 2015). 

The important role at engine protection it fills 
the protective cowling (GE Aviation 2015). It 
goes on rigid, i.e. non-rotatory part, the function 
of which is to catch the prospective fragments of 
blades, discs or of other parts. It is located inside 
the aircraft engine, and it is attached to the outer 
aircraft engine shell that fenced the capacity tur- 
bine. Because, it ties to the combustor, from which 
the hot gases are regulated to the output turbine 
blades, it is designed by the way, so in addition to 
absorption of spasmodic performance it would 
withstand the high temperature and pressure (GE 
Aviation 2015). 

From this reason, the protective cowling needs 
to have the capability to catch the high kinetic 
energy. Even if the turbine blade has relatively 
small mass, so the computation (MareSova 2017) 
shows that that centrifugal force comes to near 16 
000 N, which it is in conversion around 1.6 t. 

From the viewpoint of construction of protec- 
tive cowling, it always goes on compromise. On one 
side, it is the requirement on sufficiently robust con- 
struction, so the safety may be sufficiently ensured. 
On the other side, it is the requirement, so its mass 
may be the smallest. 

The analysis of faults of rotor (Marešová 2017) 
shows that on engine failure, they are responsible 
in 60% fragments of engine parts (56% fragments 
of blades, 4% fragments of discs, box, gasket etc.), 
and that the protective cowling does not catch only 
15% fragments of blades. 
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4 DATA FOR PROTECTIVE COWLING 


From the technical documentation (GE Aviation 
2015), it follows that all parts of hot department 
of aircraft jet engine, i.e. also the protective cowl- 
ing need to withstand the impacts of high tempera- 
tures and high pressures. From this reason, the only 
materials’ type, which can be used, is the nickel and 
cobalt super alloys. To this group, it also belongs 
the super alloy Nimonic 80 A, which is used for the 
protective cowling manufacture. 

The fabrication of complete protective cowling 
is time-consuming process. The layout of protec- 
tive cowling is composed of four parts (Figure 3) 
that are mutually put together by welding. 

The protective cowling composes from 4 parts: 


1. Buffer—it is the most important part of protec- 
tive cowling that comes as the first into contact 
with blades. Its function is catching the prospec- 
tive parts of high pressure turbine (blades). 

2. Coat—it acts as location of joint (the joint is 
not part of protective cowling makeup). The 
joint serves for set up of complete layout of 
protective cowling with the engine envelope. 

3. Whorl—it serves for set up of buffer for tacking 
to engine envelope. 

4. Strut—it serves for set up of coat into proper 
location compared with buffer. 


The manufacturing process of these components 
includes the great spectrum of manufacturing tech- 
nologies, as the process map in (MareSova 2017) 
shows. Among these works, it is possible to find the 
processing by help of standard lathes or by help of 
locksmith works, the CNC working as it is use of 
CNC lathes and CNC milling machines and flection. 
Further, there are used special working processes as 
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Figure 3. 
its components (left) and picture of protective cowling 
position in aircraft jet engine (right) (GE Aviation Czech 
2016). 


Picture of layout of protective cowling and 


the annealing for remove of internal tension and pre- 
cipitation case-hardening or the welding by method 
Tungsten Inert Gas (TIG). The important opera- 
tions in material processing are inter operational 
checks for audit of dimensions and correct process- 
ing. One of such specific audits, it is special lumines- 
cence technology for identification of cracks and 
surface defects in material. On the basis of analyses 
of manufacturing procedures in (MareSova 2017) 
performed on the ground of operation notebooks 
(GE Aviation Czech 2017a), it was shown that the 
critical operation is also the TIG welding. From this 
reason we give below the results of our investigation. 
The TIG welding belongs among the arc weld- 
ing methods. The electric arc is between the infusible 
wolfram electrode and molten weld metal in inert pro- 
tective gas atmosphere. In considered case, there are 
used argon, helium or their mixtures (Hrivňák 2009). 
The TIG welding method was set up at the end 
of 30 s of twentieth century for welding the magne- 
sium alloys. This welding method partly replaced 
the fastening with rivets that in that time was the 
most used method for soldering the components 
from aluminium and magnesium in aero industry. 
The given welding method is the most impor- 
tant at welding the components from stainless 
steel, aluminium, magnesium, copper and reactive 
materials (e.g. titanium and tantalum). The thick- 
ness of weld materials ranges from several tens of 
millimetres up to several millimetres (Olson 1993). 
The TIG method advantages over against other 
ways of welding are according (Olson 1993) the 
following: 


1. Manufacture of high quality welds, i.e. low 
deformation of welded parts, welds with mini- 
mum amount of impurities, gasses, respectively 
pores and cracks by occurrence of which it is 
reduced the material capacity (resistance to high 
temperature and high pressure). 

2. At welding, the skewness does not origin, and 
therefore, no need for its removing. 

3. Welding is possible to perform with or without 
additive material. 

4. Welding is possible almost for all material kinds 
and also various materials. 

5. Precise audit of weld parameters is possible. 


The TIG welding method is used in cases in 
which there are requirements on high quality of 
welds. By this method it is possible to weld almost 
all metal materials. The welder is capable during 
the welding to perform the precise check of heat 
taken in weld, because the weld surrounding is not 
rounded during the welding by vapours and gasses 
from the process (Olson 1993). 

Except of advantages the TIG method has in 
comparison with other methods also disadvan- 
tages, according (Olson 1993) they are: 


3138 


1. Lower weld output in comparison with other 
arc welding methods. 

2. Higher demands on welder craftsmanship. 

3. Higher economic demandingness of production 
in comparison with welding method by coated 
electrode. 


According to regulation AWS D17.1 (AWS 
2013), the welds according to mutual location of 
welded parts are divided in: 


1. Blunt welds—joint of two materials (sheets 
or pipes), mutually put together by frontal 
surfaces. 

2. Fillet welds—joint of two materials that form 
an angle and the welding joint is located on 
edges of these welded parts. 


In the next part there are only given results for 
blunt welds by the TIG method. 


Table 1. 


Check list used for risks evaluation in welding process; SN—serial number, Y—YES, N—NO, R- 


5 USED RISK ENGINEERING METHODS 


For detection and judgement of risks there were 
used the following methods: process map, i.e. the 
scheme picture of production process that shows 
places in which are possible conflicts in produc- 
tion because of insufficient workplace capacity or 
bad production planning (ASME 2010); Check list 
(Prochazkova 2011a); and safety audit (GE Avia- 
tion 2015, Procházková 201 1a). 

Specific check list was compiled according to 
manufacturing technique of TIG method (GE 
Aviation Czech 2016), in which the critical spots 
were determined on the basis of operating events 
company book (GE Aviation Czech 2017a). It is 
given in Table 1. 

The judgement of safety audit results was done 
according the CSN OHSAS 18001scale, Table 2 
(Prochazkova 2013). 


remark. 


SN Question 


x N R 


Preparation before welding 


1 Are connected surfaces of welded parts metallically unalloyed, i.e. are they ridded of tips 


and rough layers of oxide? 


Are connected surfaces of welded parts ridded of smear? 

Are connected surfaces of welded parts ridded of other dirt affecting the final weld quality? 

Is set up of welded parts, i.e. the size of weld space, in concord with requirements of regulation? 

Is set up of welded parts, i.e. the size of overlapped surfaces, in concord with requirements 
of regulation? 

Are the edges of connected materials in place of future weld ridded of splinters? 

Are the edges of connected materials in place of future weld without gross shrink? 

Are welded parts put down on unpolluted surface of work bench? 

Is it manipulated with welded parts in clean cotton gloves that do not release fibres? 


Machines and fittings 


10 Are used for welding only certifiable machines and apparatuses? 

11 Are used for welding only calibrated machines and apparatuses? 

12 Are used for welding only certifiable machines and apparatuses determined in technological 
procedure (SMC 2004)? 

13 Is the welding source capable to be continuously regulated in the whole range of values of 
welding parameters given in technological procedure (FLICKER 2012)? 

14 Is the source of protective gas capable to be continuously regulated in the whole range of 
values of welding parameters given in technological procedure? 

15 Is the source of electric current for welding without symptom of defect? 

16 Is attachment for welding without symptom of defect? 

17 Are cables for welding without symptom of defect? 

18 Is the underlay piece on working table, on which the welded components are located, flat, 
without sharp roughness and dents? 

19 Is the underlay piece on working table, on which the welded components are located, from 
the electrically conductive material? 

20 Are the staples of welding apparatus located in close contiguity of welded components? 

21 Is ensured the sufficient conductivity between staple and the underlay piece on working table? 

22 Does the protective gas quality fulfil the minimal demands on pureness according to norms 
in force and rule (SMC 2004)? 

23 Is used the protective gas according to technological procedure (SMC 2004) with obligatory 


certificate? 
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Table 1. (Continued). 


SN Question N 
24 Is welding burner range for a given electric current load? 

25 Are cable ranges for a given electric current load? 

26 Is the burner spout without noticeable defects? 

27 Is the spout diameter in the range determined in technological procedure (SMC 2004), 


which enabling the supply of sufficient amount of protective gas for the effective protec- 
tion of welded metal against influences from surrounding atmosphere? 


28 Does the wolfram electrode type correspond with demands given in technological procedure 
(SMC 2004)? 

29 Does the wolfram electrode diameter correspond with demands given in technological pro- 
cedure (SMC 2004)? 

30 Is the wolfram electrode also ranged for maximal electric current load? 

Welding personnel 

31 Does the welding perform the qualified welding personnel? 

32 Is the welding personnel properly trained? 

33 Does welding personnel keep the welding certificate? 

34 Does welding personnel keep the valid medical check? 

Welding process 

35 Are in the case of requirements on tacking, the stitches uniformly arranged? 

36 Are the stitches without cracks and craters? 

37 Is it used as the additive material only certificate weld wire? 

38 Is the additive material degreased? 

39 Is the additive material free from dust and dirt? 

40 Is the additive material properly branded? 

41 Is the additive material stored in original packing? 

42 Does only use the additive material that is properly branded? 

43 Is it suppressed to use the unbranded additive material? 

44 Is used for welding the weld wire with demanded chemical texture? 

45 Is used for welding the weld wire given in technological procedure (SMC 2004)? 

46 Is used for welding the weld wire with diameter in demanded range? 

47 Is the welding current regulated according to the technological procedure or the weld 
certificate (SMC 2004)? 

48 Is the welding speed set according to technological procedure or the weld certificate (SMC 2004)? 

49 Is the surface of welding formation cleaned after welding by stainless brush? 

50 Is the part influenced by heat cleaned after welding by stainless brush? 

Check after welding 

51 Is the visual check of weld performed in workplace that is determined for such check? 

32 Are accessible the records on minimal demanded value of lighting intensity (300 1x)? 

53 Is the visual check by purely eye performed according to rule for visual weld check? 

54 Is the visual check performed by magnifying glass with demanded enlargement according to 
rule for visual weld check (SMC 2004)? 

55 Is the personnel pursuing the visual weld check qualified? 

56 Is the personnel pursuing the visual weld check trained? 

57 Are records from weld check get-at-able for next check? 

58 Are in the case of detection of weld defects, these weld defects eliminated? 

59 Are the weld defects eliminated only in range permit by a given rule? 


Table 2. Risk rate according to scale (Procházková 6 SAFETY AUDIT RESULTS AND 


2013). 


PROPOSALS FOR SAFETY 

Risk rate Number of answers “NO” v% IMPROVEMENTS 

Extremely high More than 95% The check of welding was performed on workplace 
Very high 70-95% “Svarovna” under supervision of technologist and 
High 45-70% auditor. In case of answer “NO”, it was given 
Medium 25-45% argument, why this is. The judgement was per- 
Low 5-25% formed by auditor. Except of judgement by check 
Negligible Lower than 5% list, there were drawn up the photo documenta- 
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tion and taken down the record, that was signed 
by all participants, i.e. also by welder (GE Aviation 
Czech 2017c). 

From the record (GE Aviation Czech 2017c) it 
follows that that at answers to 59 questions, six 
answers were “NO”, i.e. 10%. From the risk rate 
viewpoint, it goes on low rate of criticality. 

For removing four problems that affect the weld 
quality, there were proposed the measures: 


1. Problem 1 — the connected spaces of welded 
parts are not before itself welding ridded of tips, 
rough layers of oxide, smear and further dirt 
negatively affecting the quality of final welds 
(GE Aviation Czech 2016, Hriviak 2009, Olson 
1993). Before the welding, the welded parts are 
only washed in cleaner bath. On the surface of 
these parts the residues of cleaner emulsion are 
evident. 

Proposal for improvement: 

For reach of high quality of weld joint, it is 
necessary from the surface of parts in place 
of future weld to remove all grimes. The sur- 
face oxides and apposite tips may be removed 
mechanically by help of stainless brush or SiC 
abrasive material (abrasive canvas). The part 
degreasing is suitable to perform chemically, 
i.e. by acetone or technical ethyl alcohol. This 
is important, because the smear, especially the 
sulphur reacts with nickel in welding bath and 
causes crack in a given weld. 

2. The welded parts need to be set up from the 
reason of reaching the acceptable weld slit (GE 
Aviation Czech 2016, 2017a). 

Proposal for improvement: 

For reach of high quality of weld joint, it is 
necessary so weld slit at rim welds was mini- 
mal, ideally null-space. The reason is reality 
that at stiffening the weld metal it arises to its 
contraction, and by this to origin of tensile 
stresses. The higher the weld slit is the higher 
tensile stresses originate. When the size of 
originated tensile stresses exceeds the material 
solidity limit, it goes to origin of cracks. From 
this reason, the quality of performed welds is 
influenced not only by itself welding process, 
but also the preparation, i.e. make-up of parts. 
To reach demanded make-up, the technological 
procedure (SMC 2004) needs to be kept, i.e. it is 
necessary to keep the set tolerance, which goes 
to increase of exactness of dimensions of pro- 
duced parts. 

3. The additive material for welding is used in 
form of rod with diameter Ø 2.0 mm. In spite 
of the fact that weld with additive wire Ø 2,0 
is certificated as convenient, so for welding the 
materials from which the protective cowling is 
produced it would be more suitable to use the 
additive wire with lower diameter, ideally Ø 


1,6 mm. The reason is the lower amount of heat 
necessary for smelting the welding wire, and so 
lower amount of heat carried into weld. By this 
way the part of weld vicinity (TOO) affected by 
heat is smaller (GE Aviation Czech 2016, SMC 
2004). In our case it goes on welding the case- 
harden material, in which the cracks just origi- 
nated in the TOO (GE Aviation Czech 2017a), 
i.e. the smaller TOO, the lower is the probability 
of defect origin. 

Proposal for improvement: 

On the Czech Republic territory, the supplier 
of additive material with demanded thickness 
1,6 mm is not. The foreign suppliers offer this 
material in packets with great amount, which 
leads to wastage. From this reason this proposal 
has not been realised yet. 

4. For execution of welding, it is not positional fit- 
tings. By producing such positional fittings with 
adjustable speed of rotation, it would reach the 
facilitation of work, increase of comfort, and 
primary the improvement of work, and by this 
also the weld (MareSova 2017). The positional 
fittings can be used for welding the other welded 
part of aircraft engine. 

Proposal for improvement: 

The purchase of new positional fittings and the 
judgement of its influence on improvement of 
weld quality and work ergonomics are in run. 


7 CONCLUSION 


On the basis of project on co-operation of our 
faculty with the GE Aviation Czech, there are step 
by step created the tools for decrease of opera- 
tional risks with aim to ensure the high safety of 
aircraft engines. One of such tool connected with 
welding the protective cowling is described above. 
The safety audit performed with specific check list 
revealed four defects. From the safety reasons three 
defects were fast rectified; the last one is in han- 
dling due to the present inaccessibility of economi- 
cally acceptable product. 
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ABSTRACT: The article brings closer the methodology of material requirement planning in a man- 
ufacturing enterprise. A manufacturing structure consisting of two independent products, A and B, 
showing modularity of design was assumed. A mathematical description reflecting the dependencies of 
individual product components used to execute a delivery schedule was used for the calculations. The 
article describes how to plan the production of products with a module characteristics and determine 
its supply moment so as to maintain a continuity and rhythmicality of a manufacturing process. It is of 
utmost importance due to the prompt execution of external orders (independent demand) and thus it has 
an impact on the dependability of the execution of orders. The suggested methodology of proceedings 
eliminates errors, reduces the stock level, enables to fully control particular production stages, improves 
the utilization of the available resources and synchronizes both order and delivery processes of materials 


with production needs. 


1 MATERIAL REQUIREMENT 
PLANNING IN TERMS OF RELIABILITY 
OF SUPPLIES 


The functional specificity of industrial compa- 
nies involves processing raw materials into fin- 
ished products, as a result of a creative process. 
This process is associated with changing the form, 
shape or properties of a product defined as a final 
product. 

The aim of the article is to present a methodol- 
ogy for describing products of a modular structure, 
and then to utilize it to develop a delivery schedule, 
ensuring reliability of manufacturing orders fulfil- 
ment. It can be achieved through timely execution 
of assembly tasks, precise determination of stock 
levels, defining the size of manufacturing batches, 
and synchronizing and control of the manufactur- 
ing-assembly process. 

Material Requirement Planning (MRP) is 
defined by the literature in an ambiguous man- 
ner. In a book by Cecil Bozarth and Robert B. 
Handfield, MRP is defined as: “a planning proc- 
ess enabling to translate the superior manufactur- 
ing plan onto the planned orders for the parts and 
components necessary to manufacture products, 
of which the completion was included in the supe- 
rior plan” (Bozarth & Handfield 2007). Whereas 
Donald Waters, in his book, state that: “(...) 
material requirement planning utilizes the main 
manufacturing plan in order to schedule the mate- 


rial supply. Expanding the main manufacturing 
plan allows planning material supplies at the very 
moment they are needed” (Waters 2007). 

B. Sliwezynski has a slightly different percep- 
tion of the MRP concept, writing: “(...) planning 
material requirement covers every element of the 
final product, in each of the manufacturing stages 
and defines the material requirements, resulting 
from the product range, volume and lead time of 
a manufacturing batch. The lead times of deliveries 
of the materials and elements necessary to manu- 
facture a finished product are calculated as per the 
MRP method, according to the main manufactur- 
ing schedule. As a result of material requirement 
planning, a delivery schedule is developed, which is 
the basis for material supply planning” (Śliwczyński 
2008). On the other hand, S. Krawezyk positions 
MRP within the process of handling a manufactur- 
ing enterprise, therefore, relates it to, 1.a., the level of 
customer service. Therefore, the author draws atten- 
tion to the aspect concerning delivery reliability, 
which should be understood in a slightly narrower 
sense as timeliness (Krawczyk 2011). The concept 
of supply reliability, related to a company, may and 
should be considered on many levels. For example, 
supply reliability may be shaped by: lack of damage, 
relevance, timeliness, conformity, completeness, etc. 
T. Nowakowski indicates that supply reliability is 
impacted by: its timely execution, completeness of 
an order and releasing/receiving undamaged goods 
(Nowakowski 2004 &2011). J. Zurek points to a 
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direct relationship between reliability and readiness, 
and relates these terms to operational processes of 
facilities and technical systems (Nizinski & Zurek 
2011, Żurek & Jazwinski 2007). With regard to the 
reliability of systems associated with stock control, 
the author implies that one of the basic delivery 
reliability methods is to maintain an excess stock, 
understood as a safety reserve. Furthermore, the 
author differentiates between stock levels, depend- 
ing on the cost criterion (prices). The author states 
that stock of cheap elements may be maintained at 
a high sufficiency level, while the stock of expen- 
sive elements at a low or zero sufficiency level. Very 
expensive elements are often purchased only in the 
event of a demand (Waters 2007). S. Werbińska- 
Wojciechowska points also to time redundancy 
in the aspect of evaluating availability of logistics 
system resources in companies, drawing atten- 
tion to the issues of temporary stock availability 
(Werbinska-Wojciechowska 2007, 2013). 

An MRP system improves stock management 
and facilitates the creation of requirement plans for 
raw materials and materials necessary for manufac- 
turing, for which the demand depends on market 
needs. They make up the manufacturing schedule, 
defined by calendar time, which takes into account 
both the assembly labor consumption, as well as 
the lead times of component deliveries. Therefore, 
its correct execution impacts the delivery reliabil- 
ity understood (to a narrow extent) as timeliness 
of executing manufacturing tasks. Basic informa- 
tion of the Material Requirement Planning system 
(MRP) form information streams, with their set 
containing (Bozarth & Handfield 2007, Krawc- 
zyk 2011, Skowronek & Sarjusz-Wolski 2012, 
Sliwezynski 2008): 


e main manufacturing schedule; 
e product structure; 
e stock main set. 


MRP system’s information streams are pre- 
sented in Fig. 1. 


FORECASTS ORDERS 


MAIN 
MANUFACTURING 
SCHEDULE 


STOCK MAIN 


PRODUCT 
SET STRUCTURE 


Figure 1. Information streams supplying an MRP system. 


As indicated in Fig. 1, information streams mak- 
ing up an MRP system are formed by: the main 
manufacturing schedule, which is developed in the 
long term on the basis of sales forecasts, and in 
the shorter time horizon, it is confirmed by orders 
coming from the customers. The second stream is 
a set of stock, expanded by assembly-delivery lead 
times, while the third one is the modular structure 
of a product. The components described above 
shall be used for practical development of a deliv- 
ery schedule. 


2 SCHEDULING DELIVERIES 
IN PRACTICE 


An essential component of an MRP system is the 
manufacturing schedule, which includes orders 
from customers, concerning the required number 
of products and the order lead time. In this context, 
a correctly drawn up delivery schedule ensures reli- 
ability, understood as delivery timeliness. In order 
to develop such a schedule, we also need the struc- 
ture of products A and B (Fig. 2 and Fig. 3), and 
its description method, which shall be utilized for 
the calculations. 

In structural terms, the presented final prod- 
uct consists of products A and B, which include 
repeating structural fragments, called modules, 
appearing in at least two locations within a dis- 
cussed structured. 

Two types of functional modules may be distin- 
guished in the structure of product A: 


e module of the Ist order C-1 (consists of the 
mentioned module of the 2nd order D-2 and 
four elements marked as F); 


Product 


Level 0 


Level 1 


Level 2 


Level 3 


Figure 2. Product A structure. 
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Product 
B 


Level 0 


Level 1 


Level 2 
Level 3 — 


Figure 3. Product B structure. 


e module of the 2nd order D-2 (consists of two E 
type elements and four F type elements). 


Furthermore, product A consists of elements 
located within the structure on three functional 
levels. The first level contains a C-1 type module, 
the second one represents the D-2 module, while 
the third one contains F-2 and E-3 type elements, 
independent in structural terms, subordinate 
directly to product A. 

The structure of product B should be discussed 
by analogy (Fig. 3). We should remember that 
products A and B located in level 0 are structurally 
independent in relation to each other, but combined 
form a single final product. The indicated modular- 
ity of both products causes a specific description of 
their structure, used for calculations in the delivery 
schedule (MRP). Below you can find a presentation 
of the description methodology, according to which 
such a schedule is to be executed, starting with level 
0, and ending with level 3. 

The first structural level of both products making 
up a finished product, contains one C-1 type mod- 
ule (it is a so-called module of the first order). The 
second structural level contains a D-2 type module 
(a so-called module of the second order). The meth- 
odology of describing such a structure is as follows: 


e products A and B are located at the same 
level (level 0) and there are no structural links 
between them, therefore, they are not subject to 
a description; 


Table 1. 


level 1 in both products contains a C type mod- 
ule, the second level contains a D type mod- 
ule, and the third level contains E and F type 
elements, which are indivisible in terms of 
structure; 

the fact that module C is present in products A 
and B is recorded as follows: C(A, B); 

the second product structure level contains a 
D-2 type module, whereas in relation to the 
structure of products A and B it has a slightly 
different location, that is, in the case of product 
A it is directly subordinate to A (level 0), while 
in relation to product B, it constitutes a part of 
a C type module (level 2), and this fact shall be 
recorded as D(2A, 2C); 

the last level (level 3) contains elements, which 
should be described according to the princi- 
ples set out above, therefore: E(3A, B, D) and 
F(2A, B, 4C, 2D). 


Main manufacturing schedule. 


A 1254 123 124 125 870 
B 456 543 560 678 340 
D 274 228 132 143 270 
E 376 223 345 678 450 
Table 2. Company’s stock level. 


A 556 70 

B 234 60 

€ 523 50 

D 345 60 

E 123 70 

E 321 12 
Table 3. Delivery lead times. 
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2 weeks 
2 weeks 
1 week 
1 week 
1 week 
1 week 


TMmoaAwD> 


Table 4. Material requirement plan for products A and B. 


MRP 


gross demand 
available stock 
et demand 


gross demand 
available stock 
net demand 


gross demand 
available stock 
net demand 


gross demand 


The above methodology of description shall be 
used for calculations in a Manufacturing Require- 
ment Plan (MRP). 

Table | presents the planned manufacturing of 
products A and B, a D type module and an ele- 
ment marked as E. 

Table 2 presents the stock level of a company 
regarding individual products, modules and ele- 
ments on week one and seven. They are used to 
calculate the net demand, which is defined as the 
difference between the gross demand and available 
stock. 

Table 3 presents the delivery times for products, 
modules and elements, which enable the determi- 
nation of an exact moment, so as to ensure manu- 
facturing continuity and an effective utilization of 
available stock. 

Table 4 below, presents a Material Require- 
ment Plan (MRP), for the development of which 
the data summarized in Tables 1-3 was used. The 
above description methodology, which is charac- 
teristic for the structure of products A and B show- 
ing modularity, was used to calculate the gross 
demand. This methodology applies to the descrip- 
tion of modules marked as C-1 and D-2, and the 
elements defined as E and F, therefore, it relates to 


available stock 345 345 345 
et demand 0 0 449 

gross demand 0 449 2468 

available stock 123 123 

net demand 0 326 2468 

gross demand 0 898 6524 

available stock 321 321 

net demand 0 577 6524 


0 0 1254 123 124 125 870 
556 556 556 70 0 0 0 
0 0 698 53 124 125 870 

0 0 456 543 560 678 340 
234 234 234 60 0 0 0 
0 0 222 483 560 678 340 
920 536 684 803 1210 0 0 
523 0 0 50 0 0 0 
397 536 684 753 1210 0 0 
2468 1474 2028 2898 1872 143 270 
0 0 0 60 0 0 0 
2468 1474 2028 2838 1872 143 270 
3790 2670 4146 3148 3438 948 450 
0 0 0 70 0 0 0 
3790 2670 4146 3078 3438 948 450 
6710 7381 9496 9512 2366 540 0 
0 0 0 72 0 0 0 
6710 7381 9496 9440 2366 540 0 


the components assigned to levels 1—3 of a product 
structure (cf. Figs. 1 and 2). 

The presented material requirement plan was 
developed for a time horizon covering a range of 
1-10 weeks. The gross demand for products A and 
B was submitted in weeks 6-10. After taking into 
account the delivery lead times and assembly times, 
the generated internal orders actually covered a 
time period of 1—4 weeks. It is associated with the 
so-called reverse planning, and a bill of material, 
related to the structure of the product in each case. 


3 CONCLUSIONS 


The objective of this paper was to develop a 
descriptive methodology for products with a mod- 
ular structure and its application in manufacturing 
scheduling. The planned objective was achieved, 
and the article presents the manner, in which to 
plan manufacturing products of a modular char- 
acteristic and how to determine the moment to 
order them, with the purpose to maintain conti- 
nuity and rhythm of the manufacturing process. 
It is extremely important from the point of view 
of timely processing of external orders (independ- 
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ent demand), therefore, it impacts the reliability 
of delivery execution. The presented methodology 
eliminates mistakes, decreases the level of main- 
tained stock, enables full control over individual 
manufacturing stages, improves the utilization of 
available resources and synchronizes the material 
ordering and delivery processes with manufactur- 
ing needs. The methodology was developed in 1957 
by APICS (American Production and Inventory 
Control Society), and its particular development 
fell on the 1970s and 1980s. It is dedicated to a var- 
iable demand, characterized by irregularity, and 
an enterprise manufacturing products on a mass 
scale. For this reason, in the first place it should be 
applied to serial or direct-line manufacturing, and 
second of all, to unit manufacturing or a combina- 
tion of the aforementioned types. 
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ABSTRACT: Growth in the number of certification schemes in the aquaculture industry has been attrib- 
uted to several factors. The schemes contribute to improved traceability of products, provide healthier 
stocks, and provide more information to customers’ decision-making efforts. There is a wide range of cer- 
tification schemes and standards available, addressing food safety, environmental impact, animal welfare, 
and worker conditions, to name a few. The abundance of certification schemes has resulted in concerns 
about consumers becoming confused with the number of labels and that certification schemes themselves 
may become a barrier to trade. This paper examines 5 major certification schemes in the aquaculture sec- 
tor and categorizes them according to their purpose, proprietorship, and process. We investigate what has 
caused this wave of attention to be given to such a diverse range of issues, exploring how the diversity of 


these certifications is rooted in their inception and the areas they address. 


1 INTRODUCTION 


By 2050, the world population is predicted to have 
increased from 7.6 billion people to 9.8 billion 
(UN 2017). This implies that the need for fish as 
a source of nutrition will increase, and with that, 
there will be increased challenges for wild catch 
and production. 

Capture fisheries have become stagnant since the 
1980s, while aquaculture of fish and shellfish has 
more than doubled its growth in the last quarter of 
the twentieth century (FAO 2016). Salmon is one of 
the species that has seen spectacular growth, espe- 
cially in Norway, Chile, Canada, and the UK (FAO 
2003). 

The growth of aquaculture production plays 
an important part in international trade and has 
helped the economy in many developing countries 
(Prein and Scholz 2014). However, this growth does 
not come without negative consequences to people 
or the environment. The “blue revolution” calls for 
problems to be addressed, such as water pollution, 
ecosystem degradation, and poor labor conditions. 
The rapid growth of the salmon farming industry 
has in many countries raised public concern and 
critique from stakeholders and politicians regard- 
ing social, economic, and environmental impacts. 
The concerns are both country-specific and/or glo- 
bal, from the effects of aquaculture on biodiversity 
and wild fish stocks to socio-economic impacts 


(e.g. competition for ocean space, land, and prop- 
erty value) (Bush et al. 2013). Asche et al. (1999) 
categorized salmon farming’s sources of environ- 
mental problems into three categories: (1) organic 
material emission; (2) spread of diseases that may 
affect wild species; and (3) genetic contamination 
of wild stocks by escapees. 

The critiques of salmon aquaculture, combined 
with a general increased focus on environmen- 
tal and social issues, have led to a rise in public 
awareness and a demand for a more sustainable 
industry (Prein and Scholz 2014). Despite a uni- 
fied call for ‘sustainability’, there lacks a shared 
consensus as to what that actually entails and how 
it can be accomplished (Davidson 2010). With lit- 
tle agreement beyond the common notion of the 
three dimensions of sustainability: environmental 
(ecosystem and biodiversity), economic (long-term 
business viability), and social (social responsibility 
and community well-being) (World Bank 2014), 
the road to ‘a sustainable industry’ has become a 
vague and ambiguous one. 

While the main production of salmon aqua- 
culture is found in Norway, Chile, the UK, and 
Canada, farmed salmon is sold to more than 100 
countries worldwide. Stakeholders are therefore 
not only from the producing countries but from 
quite a large, global marketplace. With demands 
for sustainability coming from, and the actual pro- 
duction happening in, very different corners of the 
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world, there has been an increased need for glo- 
bal consistency in the regulation of the industry 
(Busch 2011, Stanton 2012). 

An effort to achieve this is through the use of 
global standards, certification schemes, and labe- 
ling created by NGOs and retailers (e.g. IKEA, 
Tesco). These are a form of private governance or 
‘soft law’, which entails that their sanctions do not 
carry the force of law and are therefore not man- 
datory (Busch 2011). Certification schemes pro- 
vide different standards for which the producers 
can voluntarily choose to comply, and in doing so 
obtain a certification from the chosen scheme. In 
Europe, the most prevalent standards in aquacul- 
ture are the GLOBALG.A.P. Aquaculture Stand- 
ard and the Aquaculture Stewardship Council 
(ASC) standards. In North America, on the other 
hand, the standards set by Global Aquaculture 
Alliance, the Best Aquaculture Practice, are widely 
used (Prein and Scholz 2014). 

In recent years, the number of certification 
schemes for food production and processing has 
increased significantly, along with a variety of 
actors involved in the development of these stand- 
ards. Attempting to cover the many rising chal- 
lenges in aquaculture, these standards and labels 
relate to issues such as sustainability, food safety, 
organic production, etc. As a consequence, the 
types of schemes, their objectives, and their scope 
vary considerably (Nadvi and Waltring 2002). 

This paper aims to illustrate the multitude of 
standards existing in the market today. As seen from 
the literature, there is a wide range of certification 
schemes and standards available and the arguments 
for the development of these vary between the need 
for consumer legitimacy, market demands, quality 
improvement, etc. This paper explores what has 
caused this wave of attention given to such a diverse 
range of issues, which has led to this sea of certifi- 
cations. By doing a comparison of the proprietor- 
ship, process, and purpose (hereafter referred to as 
the 3 P’s of certification) for 5 major certification 
schemes in use for salmon aquaculture, we seek to 
understand how differences in their standards and 
their focus areas can be related to their origin. What 
arguments are being used for each certification 
scheme/standard, and why do they differ in focus 
and demand for improvement? 


2 BACKGROUND 


Certification and labeling are one type of signal or 
attribute giving the consumer the opportunity to 
evaluate a product before purchase/consumption 
(Chen et al. 2015). FAO differ between ecolabels, 
and food safety and quality standards (Washington 
and Ababouch 2011). Ecolabels, also referred to as 
‘best practice’ labels, focus on responsible aquac- 


ulture practices, procurement policies of retailers/ 
brand owners, and support to consumers in their 
purchasing decisions. The food safety standards 
are schemes that provide assurance in the quality 
and safety of products and the processes involved. 

Numerous reasons for the emergence of such 
certification schemes have been identified, seen 
both from consumers, market actors (e.g. retailers), 
and producers. One argument focuses on a lack of 
sufficient regulation, arguing that these certifica- 
tion schemes have emerged where the public regula- 
tion is perceived as inefficient or ineffective in their 
response to food safety, quality, and environmental 
sustainability (Washington and Ababouch 2011). 

For the retailers and companies selling seafood, 
labels are also viewed as a mechanism to reduce 
risk related to negative publicity concerning pro- 
duction practices (Boyd and Nevin 2011). Achiev- 
ing trust from consumers and supporting producer 
legitimacy are an important part of certification 
schemes (Bush et al. 2013). Summarized by Morris 
(1997), the possibility to improve the image and/ 
or sales of a company, in addition to encouraging 
firms to account for the environmental impact of 
their production, are important arguments to sup- 
port certification schemes. 

Certification usually provides product traceabil- 
ity, standardization among global suppliers, and 
transparency of production processes (Washing- 
ton and Ababouch 2011). Standardization can be 
seen as a form of risk management that extends 
a company’s liability to a third-party Certification 
Body (CB), thus, allowing the company to claim 
due diligence in the event of a predicament (Busch 
2011). In addition to allocation of risk, certifica- 
tion may also deter “real and/or perceived risks 
along the food chain” (Stanton 2012: 247). 

Nevertheless, there are uncertainties about the 
certification schemes’ consequences for sustainabil- 
ity. There is little scientific proof that shows a reduc- 
tion of negative environmental impacts by certified 
farms compared to noncertified farms (Boyd and 
Nevin 2011). Though it might be likely to reduce 
impact on a farm level, this may not contribute to 
an overall improvement in sustainability (Tlusty & 
Thorsen 2017). Questions have also been raised as 
to whether the increased demand for documenta- 
tion and record-keeping of the aquaculture compa- 
nies through these schemes actually are making the 
production more sustainable (Bush et al. 2013). 

Another concern regarding certification 
schemes is that they may act as a barrier to trade 
for smaller companies or companies from develop- 
ing countries who cannot afford the costs and doc- 
umentation requirements of standards originating 
in the industrialized countries (Busch 2011). 

Although private standards are not legally 
required, international markets demand that com- 
panies comply with supposedly voluntary stand- 
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ards (Stanton 2012). Private standards that have 
become industry norm no longer provide a real 
choice for suppliers to comply with in order to 
participate or remain in a specific market. Hence, 
private schemes become “de facto mandates” as 
demarcation between mandatory requirements 
and voluntary standards becomes obscure (Casey 
2009, Stanton 2012). 

From the perspective of the consumer, the large 
amount of certification schemes, standards, and 
labels available may contribute to confuse and com- 
plicate the purchase decision, as well as negatively 
influence their attitude towards the food produc- 
ers and owners of the label in use. It has also been 
shown that many consumers do not know the con- 
tent of each label so that decisions are often made on 
other characteristics and heuristics (Grunert 2005). 
Research shows consumers might prefer sustainable 
seafood; however, they do not pay much attention 
to this when buying seafood (Alfnes 2017). 


3 METHODS 


This paper is based on an analysis of documents 
from a range of certification schemes, the con- 
tent of their different standards, and literature on 
certification. The chosen method is aimed to pro- 
vide a comparison of a selected number of certi- 
fication schemes and their origin, motivation for 
establishment, and content of their standard(s). 
The selected standards are established at different 
times, some of them are aquaculture and salmon 
specific, while others are not, and they differ in 
their focus on sustainability and/or animal welfare. 
Common for all is their relevance to salmon aqua- 
culture production. The selection of schemes and 
standards is also based on their prevalence in the 
major nations of salmon aquaculture production. 
To illustrate the muddled sea of certifications in 
which production companies find themselves, the 
choice of standards in this study is also meant to 
reflect the diversity of focus areas, motivation, 
and actors involved. After gathering data and cat- 
egorizing them according to characteristics (see 
Table 1), the background for the inception of these 
schemes was also analyzed (see Figure 1). The 
following information, unless otherwise specified, 
comes from the websites of these schemes. 


4 STANDARDS AND CERTIFICATION 
SCHEMES 


4.1 ASC 


Established in 2009, the Aquaculture Stewardship 
Council (ASC) originated from the Aquaculture 
Dialogue, a multi-stakeholder roundtable founded 


Figure 1. 


Development of schemes. 


by the World Wide Fund for Nature (WWF) in 
2004 (WWE Norge 2016). WWE and The Sustain- 
able Trade Initiative (IDH, includes businesses, 
trade unions, NGOs, and Dutch Ministries for 
stimulating sustainable trade) from the Nether- 
lands worked together in establishing the Aquac- 
ulture Stewardship Council in 2010 (IDH 2017). 

ASC is the only aquaculture certification scheme 
that is recognized as a full member of the ISEAL 
Alliance Code of Good Practice for Setting Social 
and Environmental Standards. Also, the organiza- 
tion develops standards that are in line with FAO 
guidelines. ASC partners with the Global Aquacul- 
ture Alliance (GAA) and GLOBALG.A.P, and is 
supported by various suppliers, producers, retailers, 
and food brands. Any stakeholder or individual can 
raise issues regarding a certification of a facility as 
the certification documents are available online. 

There are currently 8 aquaculture standards 
that cover 12 different species: abalone, bivalves 
(clams, mussels, oyster, scallop), freshwater trout, 
pangasius, salmon, shrimp, tilapia, seriola, and 
cobia. The ASC Salmon Standard was developed 
in 2012 by over 500 participants (WWF Norge 
2016). The scope of the ASC standard for salmon 
includes: compliance with national and local laws 
and regulations, habitat, biodiversity and ecosys- 
tem, health and genetic integrity of wild popu- 
lations, responsible use of resources, managing 
disease and parasites responsibly, socially respon- 
sible development and operations, and commu- 
nity involvement. The review of the standards 
is conducted regularly to ensure that the stand- 
ards are compatible with new scientific develop- 
ments and practices. The ASC supervisory board 
is composed of representatives from academia, 
NGOs, and the industry while its Technical Advi- 
sory Group (TAG) consists of a group of invited 
technical experts. The Technical Working Groups 
(TWG) and Steering Committees also meet and 
guide ASC standard development. 


4.2 GLOBALGA.P 


EurepGAP was initiated by European retailers in 
1997 with the goal of establishing a generic stand- 
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ard for Good Agricultural Practice (GAP) (Kalfagi- 
anni and Pattberg 2013). Prior to its establishment, 
European supermarket chains started various 
“Integrated Crop Managements” (ICMs) as an 
effort to gain consumers that preferred ‘sustain- 
able products’ (Casey 2009, Kalfagianni and Fuchs 
2012). The suppliers struggled with achieving the 
many ICMs of different supermarkets. As a way of 
harmonizing these agricultural processes, Eurep- 
GAP was born and was renamed to GLOBALGAP 
in 2007 as the standard became widespread in the 
international scene (Kalfagianni and Fuchs 2012). 

The GLOBALG.A.P. Aquaculture module was 
included in GLOBALG.A.P. in 2004 and covers 
the entire production chain of a variety of farmed 
fishes, crustaceans, and mollusks from suppliers 
(brood-stock, feeds, seedlings) to the various activ- 
ities, such as faring, harvesting, processing, and 
post-harvest handling operations (Prein and Scholz 
2014). GLOBALG.A.P. is a business-to-business 
standard, and is classified by FAO as both a stand- 
ard and a code (Washington and Ababouch 2011). 
The scope of the certification for the aquaculture 
module includes site management, reproduction, 
chemical compounds, occupational health and 
safety, fish welfare, management and husbandry, 
sampling and testing, feed management, pest con- 
trol, environmental and biodiversity management, 
water usage and disposal, harvesting and post- 
harvest operations, holding and crowding facilities, 
slaughter activities, depuration, post-harvest mass 
balance and traceability, and social criteria. 

In addition to certification, GLOBALG.A.P. also 
has a consumer label called GGN (GLOBALG.A.P. 
Number) for certified aquaculture products that 
are in accordance with GLOBALG.A.P. (GGN 
2017). Feed that includes captured fish should 
come from fisheries that adhere to the FAO Code 
of Conduct for Responsible Fisheries. 

GLOBALG.A.P. members elect the Board (5 
producers and 5 retailers), which guides the Secre- 
tariat, the Technical Committees (one, out of eleven 
representatives, is from Asia in the Aquaculture 
group), and Focus Groups (voluntary members and 
non-members). The Secretariat gives directions to 
the Benchmarking Committee, Certification Body 
Committee, Integrity Surveillance Committees, and 
the National Technical Working Groups (41 coun- 
tries). The Technical Committees give direction to 
the respective Focus Groups. National Technical 
Working Groups are responsible for translating the 
national interpretation guidelines and local adap- 
tation of the standard. There are two public con- 
sultations or rounds for submitting comments by 
interested parties within a period of 40 to 60 days. 


4.3 RSPCA 


The Royal Society for the Prevention of Cruelty 
to Animals (RSPCA) is an animal welfare charity 


organization in England and Wales. The RSPCA 
Assured label which replaced the Freedom Food 
label in 2015, is an ethical food label established by 
the RSPCA. A report from The Food and Ethics 
Council and Picket (2014) identified three driv- 
ers for farm assurance schemes. Firstly, the 1980s 
and 1990s in the UK were overcast by a number of 
highly publicized food scares such as with BSE in 
cattle and reports uncovering salmonella-infected 
egg production. In addition to the aim of restor- 
ing consumer confidence, the Food Safety Act in 
1990 introduced the requirement of retailers’ due 
diligence which assigned food safety responsibil- 
ity to retailers. A third reason for farm assurance 
schemes to proliferate during this time was the desire 
to promote responsible farming and animal welfare 
(The Food Ethics Council and Pickett 2014). 

Priding itself as being the only farm animal 
welfare scheme in the UK, the RSPCA welfare 
standards examine all aspects that are vital to an 
animal’s welfare, such as farm management, hus- 
bandry practices, healthcare, living conditions, 
nutrition, transport, and humane slaughter. The 
RSPCA welfare standards include beef cattle and 
calves, chickens, ducks, hatcheries, laying hens, 
dairy cattle and calves, pigs, pullets, salmon, sheep, 
trout, and turkey. Meetings with the Standards 
Technical Advisory Group (STAG) are conducted 
by RSPCA once a year for each species to ensure 
effective accumulation of the latest scientific, vet- 
erinary, and industry information. STAG members 
include retailers, food companies, farming associ- 
ated industries (e.g. manufacturing), veterinarians, 
environmentalists, or organizations and individuals 
advising the RSPCA Farm Animals Department 
on standard development. STAG membership is 
by invitation only. Membership for the Wider Con- 
sultation Group (WCG) is by invitation only by the 
Farm Animals Department of RSPCA. RSPCA 
Assured currently covers more than 140 million 
salmon. Major retailers in the UK offer more than 
2,000 RSPCA Assured products. 


4.4 IFS food standard 


The International Featured Standards (IFS), origi- 
nally called the International Food Standard, was 
established in 2003. IFS is an association of retail- 
ers and industrial companies that aims to set har- 
monized standards for their producers, logistics 
companies, brokers, and agents. Since their expan- 
sion, they now have 8 standards for food products 
and services published in five primary languages 
(English, German, Spanish, French, and Italian). 
The IFS Food Standard deals with food safety and 
quality of the product and the processes of food 
packing and processing companies. The standard 
is recognized by the Global Food Safety Initiative 
(GFSI). The scope of the standard includes sen- 
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ior management responsibility, quality and food 
safety management system, resource management, 
planning and production processes, measurements 
analysis and improvements, and food defense and 
external inspections. 

Retailers that require suppliers to have IFS certifi- 
cation include Aldi, Lidl, and Metro (Bureau Veritas 
2017). The IFS certification is also sought after by 
retailers from their suppliers in the French and Ger- 
man markets (Washington and Ababouch 2011). 

The IFS Technical Committee (TC) is com- 
posed of representatives from retailers (17, many 
from Germany, Italy, France, and Spain), industry 
(6 manufacturers, 1 food service), and certifica- 
tion bodies (4 from Europe). The TC is responsi- 
ble for content and requirements of the standards. 
National Working groups (NWG) from Italy, 
France, Germany, Chile, USA, and Spain are 
responsible for supporting and providing the TC 
technical information to the International Working 
Group. Examination Working Groups (EWGs) are 
composed of retailers and experts. A Review Com- 
mittee is represented by retailers, industry, and 
CBs. They discuss experiences and discuss changes 
of requirements of the audit report and training. 


4.5 BAP 


The Global Aquaculture Alliance (GAA), a non- 
profit organization attending to issues related to 
advocacy, education, and leadership in responsible 
aquaculture, is the owner of the BAP certification 
scheme. GAA was established in 1997 by shrimp 
farmers as a response to criticisms from Greenpeace 
in the 1990s and a global moratorium demanded by 
NGOs and community organizations in Choluteca, 
Honduras (Lee and Connelly 2006). According to 
Aguayo and Barriga (2016), BAP standards were 
led by the industry corporate actors and there was 
no participation by stakeholders not belonging to 
the industry (Aguayo and Barriga 2016). 

BAP is an aquaculture standard that promotes 
codes of conduct through best management prac- 
tices (Lee and Connelly 2006). The standards 
are continuously improved through efforts from 
the Technical Committee, Standards Oversight 
Committee (SOC) comprised of experts in envi- 
ron-mental conservation, the academia and the 
industry, and comments from the public, which 
are available on their website. The BAP consumer 
eco-label includes a star rating system that shows 
the level of integration in the food chain, with one 
star meaning the product is produced by a BAP- 
certified processing plant while a 5-stars label 
means that the product has been produced only by 
BAP-certified facilities (processing plant, farms, 
hatchery, and feed mill). The standard covers com- 
munity property rights and regulatory compliance, 
community relations, worker safety and employee 


relations, sediment and water quality, fishmeal and 
fish oil conservation, control of escapees, predator 
and wildlife interactions, storage and disposal of 
farm supplies, animal health and welfare, biosecu- 
rity and disease management, control of potential 
food safety hazards, and traceability. 

BAP standards are continuously updated. The 
GAA is responsible for coordinating the develop- 
ment of the standards. The technical details are 
developed by the Technical Committee (TC) under 
the guidance of the Standards Coordinator from 
GAA and subject to the review and approval from 
the Standards Oversight Committee (SOC). The 
12-member SOC should consist of equal numbers of 
representatives from academia, conservation groups, 
and industry groups. After the SOC has reviewed the 
document (and modified, if needed), the changes are 
published for a 60-day comment period where the 
public can participate. The SOC carefully considers 
all the public comments for possible inclusion in the 
final draft. The draft is then submitted for approval 
by the SOC and the GAA Board of Directors before 
the standard is implemented. 


5 DISCUSSION 


Figure | shows a diagram illustrating how stand- 
ards are established for different purposes and 
through diverse processes by distinct proprietors. 
As discombobulating as the figure seems, the real- 
ity is far more confounding. This can be explained 
by the many different stakeholders involved, with 
their various motives, interests, and desires to 
tackle the array of challenges that salmon aquacul- 
ture is facing. Despite running the risk of confus- 
ing the consumers, and at worst, resulting in label 
indifference, the schemes continue to evolve with a 
goal of making themselves distinct from the others 
while aiming to expand their terrain. 

To give a more orderly and comprehensive 
understanding of the differences and similarities 
that characterize these schemes and their standards, 
we here provide a summary divided into the 3 Ps of 
certification: purpose, proprietorship, and process. 
Purpose refers to the needs and interests that have 
motivated the development of the different stand- 
ards. Proprietorship deals with the owner(s) of the 
scheme. Process involves how the standards were 
developed and which actors were involved. 


5.1 Purpose 


Each standard was established with a purpose in 
mind. Some were intended to cover very specific 
issues, such as the IFS Food Standard and the 
RSPCA, while others were meant to be more gen- 
eral and all-encompassing. In the latter category, 
the GLOBALG.A.P. Aquaculture standard and 
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the BAP Aquaculture standard are similar in that 
they both cover aspects of food safety and qual- 
ity, social, environment, and animal welfare. How- 
ever, GLOBALG.A.P. was initiated to unify several 
schemes required by suppliers to provide consumers 
with sustainable products, while the BAP certifica- 
tion was developed as a response to criticisms from 
environmental groups and NGOs. The ASC Salmon 
Standard, also in the latter category, differs as it is 
a species-specific scheme with less focus on food 
safety, and was developed as a response to increased 
focus on the environment and social responsibility 
of the aquaculture industry. As with many of the 
more general standards, the IFS Food Standard was 
also aimed at providing a unified standard for sup- 
pliers; however, its focus is on general food safety 
and quality. The RSPCA Assured was established 
to improve animal welfare and, therefore, focuses 
more or less only on concerns regarding this issue. 


5.2 Proprietorship 


GLOBALG.A.P. and IFS schemes were both 
established by retailers while the ASC and RSPCA 
standards were both initiated by non-governmental 
organizations. Of the five schemes, only the BAP 
Aqua-culture Standard was started by producers. 
Certification is performed by third-party certifica- 
tion bodies, except for RSPCA, which differentiates 
itself by certifying farms using their own RSPCA 
Assured assessors. A majority of these private 
schemes are mostly owned by retailers and NGOs, 
which means that they are able to exert power over 
the producers by demanding that these require- 
ments be met if they are to be recognized as suppli- 
ers. Moreover, the schemes come from developed 
countries and Northern markets, tipping the scales 
in favor of large companies (Belton et al. 2011). 


5.3 Process 


The development of standards for the different 
schemes is similar in the sense that they are includ- 
ing different stakeholders and expert groups. Some 
schemes try to balance the number of representa- 
tives from the different stakeholder groups, such 
as BAP and GLOBALG.A.P. Not all the schemes, 
however, include public consultation. The IFS 
scheme, for instance, does not mention any pub- 
lic consultation nor does it say anything about 
NGO participation. Other schemes only include 
participants by invitation, such as the RSPCA, 
selecting the experts for consultation and standard 
development. Furthermore, the documents stating 
how many of each stakeholder group should be 
included in a Technical Group does not apply in 
practice (e.g. GLOBALG.A.P. and ASC). 
According to Fuchs et al. (2011), the retailer- 
dominated private standards, such as IFS, are 


dominated by the standard owner. The food indus- 
try and certification bodies play only a consulta- 
tive role, while civil society is not provided with a 
voice. They categorize GLOBALG.A.P. as a stand- 
ard that provides an equal partnership between 
the retailers and producers through elections, and 
certification bodies only act as associate members, 
while civil society and the NGOs may participate 
in the annual meetings. Despite the seemingly 
equal opportunities for stakeholders to take part 
in representing their group, in reality, not all of the 
stakeholders afford to take part in the development 
process as this requires a lot of time and resources. 


6 CONCLUDING REMARKS 


As has been shown here, there are countless chal- 
lenges that follow the proliferation of certifications, 
standards, and labels in the aquaculture industry. 
Increasing pressure from both public and private 
regulatory agencies is causing a continuous build-up 
of demands for production companies. Since stand- 
ards purposely differ from one another in some 
ways and overlap in other aspects, there is often a 
need to comply with more than one standard. This 
entails that the new standards which emergence do 
not replace others, but add yet more layers. 

Having just one all-encompassing standard could 
possibly curtail certification-related work for produc- 
ers and strengthen consumers’ trust in labeling; but 
would this be attainable? Based on our findings in 
this study, it is unlikely to happen. This can be attrib- 
uted to numerous explanations. For one, the differ- 
ent certification schemes are in competition with 
each other, as certifications, standards, and labels 
have become big business. Furthermore, the stand- 
ards are created at different times and continue to be 
adapted and revised, making a potential unification 
difficult to achieve. Most importantly, the endeavor 
to improve the aquaculture industry, currently under 
the banner of sustainability, is pulling in many dif- 
ferent directions. The numerous challenges that the 
industry is facing are subject to trade-offs and politi- 
cal priorities, as many of them run counter to each 
other. In order for the standard to cover everything, 
it would necessarily go against itself. 
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ABSTRACT: | This article proposes an ontological and semantic foundation for safety science, based on 
an etymological and etiological study of the concepts of risk and safety. The awareness regarding the con- 
cepts of safety and risk have both evolved in similar ways because of increasingly more demanding situa- 
tions and events that impact society in an economic way, also linked to the value of human lives. From a 
purely negative view on risk and safety, this awareness has grown into a more systemic and even holistic 
perspective on these concepts. The proposed foundation is aligned with the semantics and concepts used 
in the ISO 31000 risk management standard. Based on this foundation, the article also advocates a theo- 


retical model and a metaphor on how to look at safety and performance in any organization. 


1 INTRODUCTION 


When one talks about safety, risk or performance, 
everyone understands what is being talked about. 
There’s no one who doesn’t grasp what the words 
mean in one’s own perception and how they can be 
understood. However, when opening a discussion 
on what these concepts really are, and how one 
should study or deal with them, it is most likely to 
end up in ontological and semantic debates due to 
different views, perceptions and understanding. 

Science is served with clear concepts and well- 
defined parameters. Because, having these concepts 
and parameters allows for exact measurement of 
observations and in its turn, this opens the oppor- 
tunity of accurate analysis, which then can be used 
to develop sound theories and practices. When 
studying in the field of safety science (a relatively 
young field of science), it is hard to find unambig- 
uous definitions and parameters that clearly link 
safety, performance and risk. 

When reviewing the safety science literature, the 
question “what is safety” is answered in many ways 
and it is very hard to find a clear definition of its 
opposite, which we could also name ‘unsafety’. 
Likewise, there is also a lack of standardization 
when it comes to defining its opposite. Terms like 
accident, incident, mishap, disaster, catastrophe, 
etc. have different meanings depending on the per- 
sons or fields of knowledge using these commonly 
used words. 


Because there is no commonly accepted way to 
define safety and its opposite, it becomes very dif- 
ficult to measure and compare the level of safety 
of situations and organizations in an unambigu- 
ous or objective manner, certainly amongst differ- 
ent sectors or societies. Also, it is more difficult to 
think of proactive solutions that generate safety 
instead of developing reactive methods that pre- 
vent unsafety. 

Although the field of safety science is relatively 
novel as a separate and independent domain of 
study, many theories, models and metaphors have 
already been proposed, attempting to describe 
what safety is and how it can be achieved. Often 
these theories are drawn from the investigation of 
—and lessons learned from- catastrophes and dis- 
asters. As such, these theories are often justified by 
explaining how these mishaps came about. There- 
fore, in general, efforts to improve safety of sys- 
tems have mostly been driven by hindsight, both 
in research and in practice (Woods and Hollnagel, 
2006). 


2 EVOLVING PERCEPTIONS REGARDING 
SAFETY (SCIENCE) AND RISK 
(MANAGEMENT) 


Safety and risk are two concepts that are tightly 
coupled and have known similar evolutions in their 
development and in how people understood these 
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concepts. Also, the evolution on how people have 
dealt with risk and safety is very much comparable. 
Safety and risk are often perceived in a similar way 
and are regularly used as antonyms. Risky often 
means unsafe and safe often means without or pro- 
tected from risk (as indicated in many dictionary defi- 
nitions of safety). And, when looking at the past, one 
can see that ideas about safety and risk have evolved 
in a very analogous way and for comparable reasons. 


2.1 A historical perspective on risk management, 
the etymology of risk and its etiology 


2.1.1 Ancient times 

For thousands of years, people considered all that 
happened being the will and acts of the gods (Bern- 
stein, 1996). So, the general idea was that whatever 
one tried, things finally always happened to the will 
of the gods and there was nothing to do about it 
but to accept it. 

However, this doesn’t mean that concepts as risk 
and safety were strange to people. In their article 
“Risk analysis and risk management: an histori- 
cal perspective”, Covello and Mumpower (1985) 
describe how in the Tigris-Euphrates valley, about 
3200 B.C. the Asipu already offered a kind of con- 
sultancy services related to risk and safety. 


2.1.2 The Renaissance and modern time period 
The serious study of risk started during the Ren- 
aissance and it took until the work of Pascal in 
the 17th Century to see a sudden progress in the 
understanding of risk and decision making based 
on numbers. 

In this time period science was on the rise and 
it was a period of expanding trade of new and 
scarce products, transported overseas. This cre- 
ated a new reality. Trade oversea to distant regions 
and countries was a high-risk endeavor. This eco- 
nomic factor made people become more aware of 
the concept of managing risk. Soon, the insurance 
industry emerged as an effort to manage risk in 
commerce. Wealth was no longer the privilege of 
the happy few, but could be earned by investing in 
trade and making the right decisions (Bernstein, 
1996; Covello & Mumpower, 1985). 


2.1.3 20th century 

Although the etymological roots of the term risk, 
can be traced back as far as the late Middle Ages, 
the more modern concepts of risk appeared only 
gradually, with the transition from a traditional to 
a modern society. With larger and ever more com- 
plex technology systems emerging after the second 
World War (e.g. nuclear installations), the focus on 
probability and risk supported a scientific, math- 
ematically-based approach toward risk and risk 
assessment (Zachmann, 2014). 


Later in the twentieth century, with standards 
of living quickly rising after World War II, other 
objectives became also important and the concept 
of managing risk expanded from a mathemati- 
cally-based approach to include also more quali- 
tative methods. Hence, the origins of operational 
risk management, which can be traced back to the 
discipline of safety engineering (Raz & Hillson, 
2005). 

Continuing losses, injuries and casualties, trig- 
gered the US Armed Forces and NASA to develop 
risk management proposing a more comprehensive 
approach, called Operational Risk Management 
(ORM), adapting the world of risk management 
to the human factor involved in day to day opera- 
tions. However, by the end of the century further 
development of the concept of operational risk 
management expanded the view on risk from a 
loss and probability perspective to a systemic view, 
shifting attention from probability to achieving 
goals. 

In the same period of time, due to scandals such 
as the Barings Bank (1995), the dot.com bubble 
(1997-2001) and ENRON (2001), people became 
ever more concerned with the management of risk 
and the good ethical practices in managing organi- 
zations. During this last decade of the twentieth 
century, there has been a major surge of interest 
in improving the ability to deal with an uncertain 
future, and at that time, still with a focus on the 
negative impact at the organizational level. Opera- 
tional risk scarcely existed as a category of prac- 
titioner thinking at the beginning of the 1990's, 
however, by the end of that decade, regulators, 
financial institutions and practitioners could talk 
of little else (Power, 2005). At that time, the first 
risk-related standards were published (Raz & Hill- 
son, 2005). 


2.1.4 21st century 

The changes that emerged during the last quarter 
of the 20th century persisted and ongoing chang- 
ing ideas concerning risk management generated 
an increasing understanding of the concept of 
risk, as modern risk management evolved substan- 
tially due to factors such as the rise of knowledge- 
intensive work, an expanding view on stakeholders, 
a growing importance of project management, the 
expanded use of technology, increased competitive 
pressure, increased complexity, globalization and 
continuing change (Raz & Hillson, 2005). 

This growing concern and increasing aware- 
ness regarding risk management at the turn of this 
century led to the development of a whole range 
of additional risk management standards. These 
standards were issued by governments (Canada 
in 1997, United Kingdom 2000, Japan 2001 and 
Australia/New Zealand 2004), International insti- 
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tutions (IEEE-USA 2001, CEI/IEC-CH 2001) 
or professional organizations (IRM/ALARM/ 
AIRMIC-UK 2002, APM-UK 2004, PMI-USA 
2004). Each of these standards, coming from dif- 
ferent perspectives, reflect an increasing under- 
standing of risk and risk management, proposing 
different definitions of risk and comparable proc- 
esses to manage risks. At that moment in time, a 
shift occurs from a purely negative view on risk, 
still expressed in the definitions of some of those 
(older) standards (CAN/CSA-Q850-97:1997 and 
IEEE 1540:2001) to more neutral or even very 
broad definitions of risk in the other, more mod- 
ern, standards. Another remarkable aspect of the 
“newer” definitions is the fact that risk is more 
explicitly linked to objectives and that the effects 
of uncertainties on objectives (consequences) can 
be positive, negative or both (Raz & Hillson, 2005). 


2.1.5 Enterprise Risk Management (ERM) 

Also in the first decade of this century, and due to 
a number of scandals—similar to ENRON, there is 
an ever-increasing attention for corporate govern- 
ance and the role of operational risk management 
in that regard. This resulted in the first internation- 
ally used comprehensive corporate standard on risk 
management, the COSO Enterprise Risk Manage- 
ment Integrated Framework (2004). (Mestchian et 
Al, 2005). Enterprise Risk Management (ERM), 
similar to ORM in the military and aviation sec- 
tors, is the more holistic approach that is needed to 
cope with the complex realities and awareness of 
risks for the corporate world in the 21st century. 

However, the COSO ERM framework, devel- 
oped as an auditing tool to check compliance, 
failed during the 2008 financial crisis, because 
organizations implementing ERM would still fol- 
low the reductionist approach these organizations 
were used to. So, the International Standardisation 
Organisation (ISO), set out to establish a working 
group to achieve consistency and reliability in risk 
management by creating a standard (ISO 31000) 
that would be applicable to all forms of risk and 
to all kinds of organizations, creating a foundation 
for risk management (Purdy, 2010). 

Little real progress could be made with the ISO 
standard until all agreed on a definition of risk that 
arose from a clear and common understanding of 
what risk is and how it occurs. The working group 
arrived at: “risk is the effect of uncertainty on objec- 
tives”. When risk is defined like this, it reveals more 
clearly that managing risk is, quite simply, a proc- 
ess of optimization that makes the achievement of 
objectives more likely, objectives being understood 
in the broadest sense of the word. Successfully 
detecting and understanding risk, including how 
it is caused and influenced, allows, if necessary, to 
change it so that it is more likely to achieve objec- 


tives and reach them faster, more efficiently, and 
with improved results. (Purdy, 2010) 

The specific way in which risk is regarded by the 
ISO 31000 standard also broadens the understand- 
ing and attention of risk management towards 
performance, instead of solely focusing on compli- 
ance or the prevention of loss. 


2.2 Evolution of awareness in safety science 


Compared to the other sciences, surprisingly little 
attention has been given to the history of safety. 
(Guarnieri, 1992). As such, the following sections 
only try to give an indication of how and why the 
perspective of safety changed over time. 


2.2.1 The industrial revolution 

Safety science, similarly to risk management, 
originated because of a need to cope with uncer- 
tain profit, the failure of maintaining possession 
of valuable assets and the accidental injury or loss 
of workforce. In the same way expanding views 
impacted the etymology of the concepts of risk 
and risk management, ever-increasing awareness 
and knowledge regarding the concepts of safety 
and safety management has also impacted the ety- 
mology of safety and safety science. 

The industrial revolution and the appearance of 
new technologies provoked reoccurring and severe 
accidents, damaging valuable assets, causing severe 
casualties and injuries to workers. In the beginning 
these accidents are just seen as set-backs, caused by 
workers behavior and part of the business. However, 
during the second industrial revolution, starting at 
the end of the 19th Century, the ongoing mecha- 
nization and new technological developments are 
used to develop new industries. Furthermore, pro- 
duction engineering substantially increased pro- 
ductivity with the advent of mass production. As 
a result, life was getting better, incomes were rising 
and mortality was declining. (Mokyr, 1998). These 
rapid economical, technological and social changes 
also triggered the dawn of safety as a science. 

Accidents have always been a problem. Yet they 
did not appear as a major economic and health 
issue until the early 1800s when the declining death 
rate from infectious diseases shifted attention to 
other causes of mortality (Guarnieri, 1992). Acci- 
dents in a production line are costly, not only due 
to the casualties and lost workforce, but also the 
loss in production and production capacity are a 
burden to the profitability of the new factories. 
Furthermore, these accidents are responsible for a 
high mortality in the industrial world, leading to a 
bad reputation. Due to the rising prosperity, this 
is no longer acceptable. Accidents are no longer 
acts of God, but man made and can be prevented. 
(Swuste et al, 2010). 
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From the start, the awareness about risk and 
safety, risk management, safety management and 
safety science were triggered by the possibility of 
adverse effects, impacting on the profitability of 
endeavors and related to new emerging sectors. Both 
trying to accommodate for losses that impact that 
profitability. Risk, as such, became the domain of 
insurers and the start of a whole financial industry 
to compensate for financial losses. Likewise, safety 
science started with focusing on accidents, injuries 
and casualties and their prevention. In both cases, 
people drifted away from what they really needed, 
which is safeguarding and achieving objectives, 
getting what they want and being safe. 

One of the first theories on safety is about acci- 
dent proneness (Farmer, 1925). However, the acci- 
dent proneness theory only looks at one possible 
cause of accidents and therefore cannot explain 
accidents in a general way. 

Heinrich observed production facilities to dis- 
cover trends and patterns in occupational accidents, 
resulting in Heinrich’s pyramid or triangle (Hein- 
rich, 1931). Heinrich also proposed his Domino 
theory on accident causation when studying the cost 
of accidents and the impact of safety on efficiency, 
opening up the perspective to the role of manage- 
ment in accident prevention (Heinrich, 1941). 
Heinrich’s domino theory became a basis for many 
other studies on accident causation and the role of 
management in accident prevention, dominating the 
world of safety practitioners well beyond the Sec- 
ond World War (Hosseinian & Torghabeh, 2012). 


2.2.2 After World War IT 

Heinrich’s research and work inspired other 
researchers, also to incorporate the role of man- 
agement in their models. For instance, Petersen 
(1971) developed a model based on “unsafe acts” 
and “unsafe conditions” and Weaver (1971) and 
Bird (1974) updated the domino model with more 
emphasis on the role of management. (Hosseinian 
& Torghabeh, 2012; Swuste et al, 2014). 

At the beginning of the second half of the twen- 
tieth century, Gibson (1961) and Haddon (1970) 
focused on the causation of injuries, discovering a 
formula for injury prevention. This shift in focus, 
also caused safety science to look at engineering to 
reduce injuries, leading to safety belts, bumpers and 
many other devices capable of absorbing or deflect- 
ing energy (Guarnieri, 1992). In this period of time, 
also the introduction of the “hazard” — “barrier” — 
“target” model and tools, such as Failure Mode and 
Effect Analysis (FMEA), Hazard and Operability 
Analysis (HAZOP), the Energy Analysis approach 
and so on, are to be noted. (Swuste et al, 2014) 

Similar to the evolutions in risk management, 
safety science further evolved as a result of a series 
of accidents which had a huge impact on society. 
Flixborough (1 June 1974), Seveso (10 July 1976) 


and Three Miles Island (28 March 1979) are part of 
the history of safety science, provoking a broader 
perspective on safety and increased safety regula- 
tions. This enlarged awareness about safety, also 
reflects in the increasing political attention for safety 
related issues and the rise in associated regulations, 
clearly demonstrated by the advent of a number of 
safety related scientific journals in the last quarter 
of the twentieth century (Hale, 2014). Investigat- 
ing these accidents, expands the awareness of safety 
practitioners from the role of management, to inter- 
actions in the entire socio-technical system. 


2.2.3. More major accidents and disasters 
The socio-technical concept arose in 1949 (Trist, 
1981). However, at that time in the fifties, the soci- 
etal climate was negative towards socio-technical 
innovation. This would only change thirty years 
later (Walton, 1979). Again, alike the development 
of risk management and operational risk, safety 
science took up this wider organizational perspec- 
tive on safety issues as from in the early eighties. At 
the same time, advances in technology also make 
safety engineering an indispensable part of safety 
science, with the development of safety equipment, 
for instance safety belts, air bags etcetera. 
Another result of analyzing, amongst others, 
the Three Miles Island accident, is Charles Per- 
row’s book, Normal Accidents (1984), in which the 
“normal accident theory’ (NAT) is proposed. It has 
been particularly influential among researchers 
concerned to understand the organizational ori- 
gins of disasters and the strategies which might be 
used to make organizations safer (Hopkins 1999). 
Safety science further developed in the past thirty 
year as a result of another series of significant dis- 
asters, such as Bohpal (2-3 December 1984), Chal- 
lenger (January 28, 1986), Tsjernobyl (26 April 
1986), and The Herald of Free Enterprise (6 March 
1987). Each of these accidents show the complexity 
of socio-technical systems. As a result, scholars try 
to model systems in order to predict their behavior. 
Building on the work of Rasmussen (1983), Reason 
(1990) proposes the Generic Error Modeling Sys- 
tem (GEMS), later to become known as the Swiss 
Cheese model (of defenses) (Reason, 1997, 2016). 
By the end of that disastrous decade, people also 
look at human factors and behavior, by introduc- 
ing the notion of safety culture. It is loosely used 
to describe the corporate atmosphere or culture in 
which safety is understood to be, and is accepted 
as, top priority (Cullen, 1990). A more specific 
approach is the concept of Just Culture (Dekker, 
2008, 2017). Furthermore, the concepts of “High 
Reliability Organizations (HRO)’ (LaPorte & Con- 
solini, 1991; La Porte, 1996; Weick & Sutcliffe, 
2001) and ‘Resilience Engineering’ (Woods & Hol- 
Inagel, 2006; Hollnagel, 2013) were introduced, 
looking at the whole organization. 
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Recent years have seen a whole range of mod- 
els that try to model the taxonomy and structure 
of accidents. Some examples are the Systems- 
Theoretic Accident Modell and Process (STAMP) 
(Leveson, 2011) and the Functional Resonance 
Analysis Method (FRAM) (Hollnagel, 2012). 
Most remarkable is that FRAM is focused on 
safety instead of unsafety, going beyond the failure 
concept and the concepts of barriers and controls, 
aiming at the day to day performance. (Hollnagel, 
2012). The idea is to achieve safety proactively. An 
idea further developed with the advent of the con- 
cepts of Safety-I and Safety-I (Hollnagel, 2014a). 
In his article, ‘Is safety a subject for science’, Hol- 
Inagel (2014b) indicates the difficulty to change 
the mindset in the safety science community from 
what is going wrong to what is going right. 

In the new millennium, alike risk management 
expanded into a systemic/holistic view with the 
advent of Enterprise risk management, today 
these approaches come together in concepts such 
as Resilience Engineering, HRO and Safety-I & 
Safety-I]. Ever more these modern concepts in 
safety are focusing on what people want and how 
to achieve it, instead of trying to protect against 
failure. Likewise, increasingly, scientists are look- 
ing for significant leading indicators in order 
to be more proactive in preventing accidents by 
achieving what is the aim. Concepts therefore also 
evolved from a purely negative view on risk and 
safety towards a more encompassing view, also 
considering the positive sides of risk and safety. 
Now the focus is gradually more on safety instead 
of solely concentrating on unsafety. 


3 INCREASING AWARENESS SEEN FROM 
A SYSTEMIC PERSPECTIVE - THE 
SYSTEMS THINKING ICEBERG 


The above review of risk and risk management, 
safety models, metaphors and theories is far from 
complete. A more complete overview is to be found 
in reading the related references. We only wanted to 
show the changing etymology of the concepts of risk 
and safety, and its etiology, over a period of time. 

A model that can help in getting a comprehen- 
sive view on the evolution of risk and safety is the 
systems thinking iceberg model (Bryan et al, 2006). 
The systemic iceberg model, is a way to look at real- 
ity from a systems perspective. The visible part of 
the iceberg, (above the waterline) represents the 
events that result from the system(s) involved. When 
events are observed over time, patterns and trends 
can be discovered. It is at this easy to perceive level 
of awareness of systems that safety science has orig- 
inated. Still today, safety practitioners are driven by 
the facts that are directly visible, gathered in sta- 
tistical data, trying to find and understand trends 


or delineate recognizable patterns related to the 
observed events. As such, they try to discover causal 
relationships and produce better predictions to pre- 
vent these negative events from happening. 

The common approach to safety is to look at the 
events such as loss of life, injury, harm, damage, or 
any other event generating negative effects, trying 
to understand, predict and prevent accidents. As 
an example, the accident proneness theory can be 
seen as a result of that process. 

Increasing awareness on accidents allows to 
become aware of the system of causes and effects 
that produces the unwanted events and that repeat 
themselves. It is what every accident investigation 
tries to achieve, i.e. the understanding how elements 
act together, in order to find ways to prevent them 
from happening again. When the system is under- 
stood, it is possible to proactively alter the system 
and prevent the same unwanted events happening. 
Through history, scholars have been searching for 
ways to explain why unwanted events happen and 
how disasters can be predicted, trying to discover the 
system(s) that is (are) behind their occurrence. One 
of the first to develop a theory on accident causa- 
tion is Heinrich and his domino theory, naming the 
elements of the system that are involved in the crea- 
tion of accidents and using the metaphor of domino 
blocks to represent the subsystems and their inter- 
action. Also, NAT and the Brownian movements 
model (Rasmussen, 1995, 1997) can be seen as such. 

To be able to predict or to obtain more under- 
standing and control on the systems involved in 
accidents, scientists also try to determine the struc- 
ture of the systems involved, getting a clearer view 
on the dynamics that are at the genesis of accidents. 
Recent years have seen a whole range of models that 
try to model and structure the systems that gener- 
ate unwanted events. STAMP and FRAM can be 
seen as examples and there is also the Swiss Cheese 
model (Reason, 1997, 2016) and the models that 
build on the same human factor approach. 

Finally, scientists and practitioners aim at devel- 
oping understanding on how mental models gener- 
ate the system(s) and how they can be controlled 
and managed. An example of a mental model gen- 
erating safety is for instance the concept of Just 
Culture (Dekker, 2008, 2017). 


4 A MODERN PERSPECTIVE ON RISK 
AND SAFETY 


Safety is often defined as a dynamic non-event and 
mostly explained by the events that violated that 
state of dynamic non-events (Weick, 2011). The 
problem with this approach is that it only covers 
the domain of unsafety and leaves any interpreta- 
tion of safety open. When safety thinking is linked 
with dynamic non-events it solely focusses on pre- 
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venting bad things from happening. But is this the 
right approach in pursuing safety? 

Is turning away from unsafety the same as aiming 
for safety? When one considers a situation of 100% 
safety, is this a situation where nothing is happen- 
ing? This seems an impossible assumption. There 
will always be something happening, events and 
consequences (positive effects on objectives) one 
wants and events and consequences (negative effects 
on objectives) one doesn’t want, both are important 
from a modern safety perspective. So, what is distinc- 
tive for safety to emerge, exist and persist? A modern 
perspective on safety, in the same way as a modern 
perspective on risk and risk management, looks at 
the whole picture. It starts with what people want 
to achieve, what needs to be safe and first make sure 
this will be achieved. It is making certain that the 
return on investment is attained when pursuing an 
opportunity. Hollnagel (2014) talks of Safety I and 
Safety II, where safety I is the traditional approach 
of avoiding losses while safety II is making sure the 
objectives are accomplished. In our view, this is how 
safety and safety science should evolve. It is about 
both the absence of unsafety and the presence of 
achieved and safeguarded objectives. 

Nancy Leveson says that Safety is an emergent 
property of systems, not a component property 
(Leveson, 2011). It means Safety is something that 
needs to be achieved by the system, repeatedly. 
Obviously, a component can also be considered as 
a system on its own. Every system is made up of 
sub-systems which have other objectives than the 
overarching system. The safety of these sub-systems 
is important to the safety of the overarching system 
and each of them is subjected to a set of risk sources 
that can affect those more specific objectives. 


4.1 Total respect management 


The modern perspective on safety, and the mental 
model to achieve safety in a proactive way we pro- 
pose, is called Total Respect Management (TR*M). 
Respect, in the way it is used for this model, is an 
expression originally derived from the Latin word 
respectus. Respectus comes from the verb res- 
picere, meaning ‘to look again,’ ‘to look back at, 
‘to regard’ or ‘to consider someone or something.’ 
In other words, the original meaning of the word 
‘respect’ holds the connotation of giving someone 
or something dedicated attention to have a better 
view on the matter or give it consideration, partic- 
ularly to come to a better understanding (Blokland 
& Reniers, 2017). As much as possible the systems, 
sub-systems (including the human factor) and 
their objectives need to be known and understood. 

The reason for this ‘respect’ for systems and 
their sub-systems is the conviction that there is no 
common structure to all ‘accidents’. This is also 


a way how we intend to look at the Swiss cheese 
metaphor. In this metaphor, the whole cheese is 
a reflection and representation of a socio-tech- 
nical system and its performance. The cheese 
itself can be understood as excellent perform- 
ance, where objectives have been achieved and are 
safeguarded (Safety-II). On the other hand, the 
holes in the cheese are the sub-systems of which 
the objectives are not achieved or safeguarded. As 
such, these are the different factors contributing 
to accidents (Safety-I). The model’s hypothesis 
is that one can never know for 100% sure which 
sub-systems will fail at a given time or why and 
how they will become connected at a given time 
to produce a major accident and therefore it is 
important to give attention to all failed objectives 
and aim at reducing the number and magnitude 
of these failed objectives by increasing the level 
of performance. 

Each hole in the cheese is considered being a 
“failed” objective and therefore unsafe. The Swiss 
cheese is dynamic. The holes constantly change 
positions and dimensions in an unpredictable way 
and could be seen as shifting around, coming and 
going, shrinking and expanding in response to 
operator actions and local demands (Reason, 1997). 

Although the idea of layers of protection is use- 
ful to a certain extent, it only covers for Safety- 
I. Therefore, the TR?M model also focusses on 
performance as an element of safety. The aim of 
performance is to achieve objectives and main- 
tain objectives safeguarded. As such perform- 
ance stands for the whole cheese, or the whole 
concerned socio-technical system and the aim of 
TRM is excellent performance (excellence) in that 
regard. 

The way TR*M approaches the Swiss cheese 
metaphor is by stating that each of these latent 
conditions (failed objectives) can be seen as acci- 
dents on their own. It is just the level of importance 
and number of objectives involved that differenti- 
ates ‘accidents’ worth investigating from ‘less 
important’ holes. However, for TR*M each one of 
the holes is meaningful and needs one’s respect. 
Hence the name ‘Total Respect Management’. The 
holes/accidents result from (sub)systems, created 
by non-aligned or defective mental models exist- 
ing in, or surrounding, the system (Blokland & 
Reniers, 2017). 

Risk and the level of risk, in this sense, is nothing 
more than the possible effects on one’s objectives 
due to the decisions taken and performance reached 
at a given time, representing a possible future reality. 
This reality will also determine the level of safety. 
The better this future reality can be imagined, the 
more it can be shaped to the desires and needs of 
the beholders by taking the right decisions at the 
appropriate time, generating safety proactively. 
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4.2 A semantic connection between risk and 
safety and performance 


Regarding risk, ISO 31000 provides a whole set of 
definitions, also to be found in the ISO Guide 73. 
However, for Safety, in its broadest sense, there is no 
such internationally agreed standard and neither a 
modern and commonly used definition. A random 
selection of some definitions (Wikipedia, Merriam 
Webster, Dictionary.com) gives a traditional view 
on safety. Safety is a state, a condition, a control, a 
device, a quality or anything else that keeps us safe. 
Sure, we all know what safety means, it is being pro- 
tected from harm, from bad things happening, but 
what does it really mean to be safe? So, how can 
safety and security be defined, taking into account 
the most recent ideas on safety and risk, inspired by 
the ISO 31000standard and its definition of risk? 
When are systems or people completely safe? Isn’t it 
when one has attained all of ones objectives, when 
everything performs well and nothing affects ones 
objectives in a negative way? 

Defining risk being the effect of uncertainty on 
objectives, means that three aspects define risk, 
namely; ‘Objectives’; ‘Effects on objectives’ and 
“Uncertainty related to the effects and the objectives’. 

The proposed semantic foundation can be seen 
as follows: risk is an uncertain effect on objectives, 
while the actual performance is the result of that 
uncertain effect. Performance is Safety I + Safety 
II. It is why both concepts have evolved over time 
in similar ways and at a comparable timing. 

Risk management, ina way, started closely related 
with gambling activities. Because professional poker 
players know that they don’t win by chance or as a 
result of acts from the gods, but through carefully 
gathering information and analyzing/considering 
options based on that knowledge. It allows them 
to increase the probability that they make the right 
decision to support their aim of winning the game 
by taking more risk when it is appropriate to do 
so and limit the risks they take when it is the wiser 
decision, each time counting on the fact that the 
risks run are low for the decisions they take. How- 
ever, they will only be safe when the game is over, 
all effects of uncertainty have their outcome and the 
profit has been paid. As such, safety and risk are 
the same, where risk, and how it is managed, deter- 
mines the future of one’s safety. Therefore, the same 
semantical foundation can and should be used. 


4.3 What is the usefulness of defining SAFETY 
in this way? 


This perspective on risk and safety, allows to develop 
a commonly used terminology for safety science and 
risk management. It will also allow to build systems 
that can measure safety instantly and in a much 


broader range than it is actual the case. Or better 
said, to measure ‘unsafety’. Because measuring 
safety would require knowing all objectives present 
in a system and its sub-systems. This is impossible, 
because most of the time people are not aware of all 
of their objectives and also organizations are unable 
to know all the objectives of all their stakeholders. 
On the other hand, it is much easier to discover 
unsafety. Because when an objective has failed, it is 
likely to be seen or trigger a reaction. Human nature 
is designed to recognize unsafety, because it is nec- 
essary for survival. To measure unsafety, it is suf- 
ficient to measure all the holes to get an idea of the 
level of safety in the cheese. 


4.4 Safety first or first in safety 


The adagio “safety first”, certainly from a tradi- 
tional perspective, is fiction. When safety is the 
prevention of bad things happening, this credo is 
a real show stopper, as the safest thing to do for 
avoiding losses is to do nothing and prevent any 
activity. However, when you look at this motto 
from a fresh and modern perspective, it becomes 
a helpful mental model in achieving safety proac- 
tively. Safety first then means to achieve and 
protect objectives as a priority, aiming on excel- 
lent performance (safety II) and calling to action 
instead of inaction. When this is the governing 
paradigm, it will also become possible to be the 
first in safety, aiming at excellence. Because also 
safety performance will be an objective to be 
achieved and safeguarded. 


5 CONCLUSION 


In this article, we have expounded the evolutions 
of the concepts of risk and safety, how the aware- 
ness grew due to repetitive adverse effects on objec- 
tives and looking for ways to understand and cope 
with what had happened. We also indicated how 
the meaning of the concepts changed as a result of 
this increased awareness. To finally propose a new 
paradigm regarding safety and performance, link- 
ing risk and safety rather as synonyms, instead of 
treating them as antonyms. 
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ABSTRACT: Since 2017 Dutch flood protection standards are defined as target flood probabilities 
that all primary flood defences have to comply with by 2050. Explicitly accounting for uncertainties in 
probability distributions of load and resistance is an integral part of estimating the actual flood prob- 
ability. Based on such estimates, many flood defences will be reinforced in the coming years, for design 
lifetimes that are generally 25-100 years. Therefore it is important that we correctly take into account 
time-dependence of both load and resistance during the lifetime. Loads are typically uncorrelated from 
year to year, whereas strength parameters exhibit significant correlation over time. This correlation over 
time of strength parameters can significantly reduce the failure rate and increase the lifetime reliability 
of a flood protection structure. In this paper we show the implications of time-dependent reliability for a 
set of illustrative cases. We consider the effect of different degrees of temporal dependence on reliability, 
lifetime and relative cost savings. The cases show that for common configurations, the inclusion of time- 
dependent effects, especially the correlation in time of strength variables, can increase the lifetime of a 


flood protection structure by up to 50%. 


1 INTRODUCTION 


Since January 2017 the Dutch primary flood 
defences have to satisfy new risk-based safety stand- 
ards. Based on economic risk analysis, analysis of 
societal risk (risk of large numbers of casualties) 
and individual risk (risk of dying due to a flood), 
allowable (i.e. target) probabilities of failure for all 
major flood defences have been derived (Kok et al. 
2017). The failure criterion is herein defined as the 
loss of flood retention capacity resulting in flood of 
a (defined) neighborhood with an average depth of 
> 0.2 meters. Safety standards are generalized into 
main categories with annual allowable failure prob- 
abilities 1/300, 1/1000, 1/3000, 1/10000 etcetera. 

These safety standards are based upon a Baye- 
sian interpretation of probability, meaning that the 
failure probability should be interpreted as a state 
of belief. A change in the magnitude of uncertain- 
ties due to e.g. new knowledge or measurements 
will then cause a change in the estimated failure 
probability. Hence, when the safety standard is 
not met, reducing dominant uncertainties can be a 
very relevant measure. 

The new failure probability requirements for 
flood defences are formulated as annual probabili- 


ties, implicating that for each separate year the fail- 
ure probability has to satisfy the defined standard. 
In design this is often interpreted as that the failure 
probability at the end of the design life has to equal 
the maximum allowable probability of flooding. 
This is different from for instance the failure prob- 
abilities in the Eurocode, where the design criterion 
is expressed as both an annual target reliability and 
a reliability for a lifetime, e.g. 50 years (CEN 2002). 
Depending on the exact definition of the failure 
probability, the difference between annual and life- 
time reliability, which will be discussed in the next 
section, can have significant implications. 

The actual reliability of a flood protection struc- 
ture can be assessed by doing probabilistic compu- 
tations using probability distributions of both load 
and strength variables. Historically, the tools for 
design and assessment that were used in the Neth- 
erlands are based on a semi-probabilistic approach. 
Slomp et al. (2016) gives a thorough overview of 
the current safety assessment tools. The current 
assessment and design tools allow for both proba- 
bilistic and semi-probabilistic assessment and are 
based on an explicit coupling between (old) semi- 
probabilistic tools and the new probabilistic safety 
standards. 
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The quantitative assessment of failure prob- 
abilities also enables accounting for time-depend- 
ent reliability effects. Next to time-dependent 
(uncertain) deterioration and uncertain changes 
in climate, also correlations between years can be 
taken into account explicitly. Loads are typically 
independent from year to year (the maximum 
water level in year į is typically not conditional 
on the maximum water level in year i—-1). How- 
ever, the strength variables are typically correlated 
from year to year as these uncertainties are merely 
caused by spatial variability in combination with 
limited knowledge. Incorporating this correlation 
could have significant impact on the assessment of 
reliability during the lifetime. 

Currently there is little attention for the actual 
source of the uncertainty, which poses a problem 
when using concepts of time-dependent reliabil- 
ity. For instance for hydraulic load models, model 
uncertainties are used for water level, wave height 
and wave period, but the source of these uncer- 
tainties is not immediately clear. Also in the dis- 
tributions for strength parameters, there can be 
significant uncertainty, especially for geotechnical 
failure mechanisms. For instance, the failure proba- 
bility for piping is dominated by uncertainty in per- 
meability and grain size (Jongejan and Maaskant 
2015). Strength uncertainties can typically consist 
of natural variability, measurement uncertainty, 
transformation uncertainty or model uncertainty 
(Phoon and Retief 2016). The source of the uncer- 
tainty is important for two main reasons: 


e It determines the optimal method for uncer- 
tainty reduction: some methods might have the 
same source of uncertainty. These will not (effi- 
ciently) increase the quality of available data and 
hence not reduce uncertainty nor improve the 
reliability estimate; 

e It determines the amount of time dependence of 
subsequent years as some uncertainties might be 
(fully) correlated in time and others may not. 


In this paper we explore different definitions 
of reliability that can be used for flood defences. 
Using illustrative cases with different degrees of 
uncertainties we illustrate the influence of these 
definitions and how that translates to the lifetime 
of flood defences and their life cycle costs. 


2 METHODOLOGY 


2.1. Time-dependent reliability 


Flood defences are generally constructed for design 
periods of 25-100 years which, in the context of 
annual failure probability, implicates that in any 
given year in such a period the reliability should 


be higher than the requirement. In reliability engi- 
neering in general, concepts such as the survival 
time, failure rate and hazard function are used to 
characterize the temporal reliability (Kottegoda 
and Rosso 2008). Especially the hazard function 
is of interest, as this provides the failure rate of 
the system, this is conceptually shown in Figure 1. 
Here three phases are distinguished for the hazard 
rate of a system: 


e The inception phase: here the hazard rate 
decreases as due to first experiences and quality 
control errors are corrected. One could say that 
at t = 0 the constructed system is accepted. 

e The phase where neither initial errors, nor dete- 
rioration play a role. In Kottegoda and Rosso 
(2008) this is denoted the useful life. 

e The deterioration phase where deterioration of 
the system causes the hazard rate to increase 
significantly. 


The inception phase for a flood defence has two 
major aspects: first of all there is the experience 
from initial performance, mainly during construc- 
tion, that improves the reliability as instantaneous 
repairs are carried out. In this paper we consider 
flood defences that have just been delivered, so 
this phase is not considered. Secondly there is the 
dependence of failures on preceding years, mean- 
ing that if a dike doesn’t fail and there is any kind 
of correlation between the years it yields some 
information on its performance. This is an effect 
that will be relevant during the entire life-cycle. 

When also considering the other two phases, 
the distinction between the three phases doesn’t fit 
that well for flood defences. Most of the deterio- 
ration processes are gradual and play a role dur- 
ing the entire life-cycle (see e.g. Buijs et al. (2009), 
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Figure 1. Hazard rate with distinction of three phases 


(Kottegoda and Rosso 2008). 
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Speijker et al. (2000)). In practice this means that 
there is no second phase (“failures at Poisson (or 
other) rates”), but that after delivery the reliability 
simultaneously decreases due to deterioration and 
increases due to information from non-failures. 
Whether and how these processes are taken into 
account depends on the definition of the failure 
probability that is used. For the failure probability 
in year ¢ P(t) the three main ones are: 


1. P(t) which means that the failure probability in 
year ¢ is independent from the failure probabil- 
ity in other years. 

2. Pf Af...) Which denotes the probability of 
failure in year ¢ and no failures occurred in the 
previous year. 

3. P(f,|f,,.) which denotes the probability of 
failure in year ¢ given that no failures occurred 
in the previous year. 


where f, denotes failure at rand f, ,_, denotes no 
failure in the period 1...t—1. It has to be noted that 
most failure rates considered in literature assume a 
constant failure rate, which is in fact comparable to 
the first definition. The second and third are best 
compared to a description of a Decreasing Failure 
Rate (DFR) as described by Finkelstein (2008). 
However in order to better connect to current 
flood defence reliability practice a slightly different 
description is chosen here. 

The choice of definition is dependent on the 
application and on the specifics of the situation. 
For cases where either the correlation between P(t) 
and P,(t—1) is small and/or P(t) is small equa- 
tion 1 holds, and there is little difference in the 
three definitions. 


PPA a) =P Aaa) (1) 


For all other cases the first definition is 
conservative. 

Also it has to be noted that dike reinforce- 
ments generally do not entail a complete renewal, 
but rather an improvement of an existing flood 
defence. This means that part of the flood defence 
has already passed the inception period (as well as 
the other periods) and has to some degree proven 
itself. For this paper we consider a completely new 
flood defence and do not take that consideration 
into account although it can be very important if 
part of the dominant uncertainty is in a part of the 
dike body that has existed and survived for multi- 
ple decades or centuries. 


2.2 Temporal dependence in life-cycle reliability 


We consider a simple reliability problem where the 
limit state function at time t is given by: 


Z(t) = R(t)— S(t) (2) 


with R the resistance and S the strength. In such a 
case, if we assume the limit state function can be 
approximated as a linearized hyperplane, the limit 
state function can be written as: 


Z (1) = Alt) a4 (0) ug(t) as (us() 6) 


where f= (1- P,), where ®(-) is the inverse 
standard normal distribution. a, and a, are the 
influence coefficients of the random variables, 
indicating the respective contribution of their 
uncertainty towards the failure probability. up and 
u, are random variables. 

If we want to calculate the temporal reliability 
according to definitions 2 and 3 in the previous sec- 
tion, we need to take into account the correlation 
between subsequent years. The correlation of a com- 
ponent of the system in equation 2 is defined by: 


P(Z,,Z,.1) = Ap AR Prt Xs (%541Ps (4) 


where pz and p; is the autocorrelation for the 
random variables of strength and load. For the 
strength, provided that there is no deterioration it 
could be argued that p} = 1, the load is independ- 
ent each year so p,=0. 

For combining correlated components one 
could use numerical integration or probabilistic 
techniques such as Monte Carlo, but a very fast 
and efficient method is the Equivalent Planes 
Method, which is extensively described by Roscoe 
et al. (2015). In Roscoe et al. (2015) it is shown that 
this method is accurate for most cases, although 
some accuracy is lost for very large systems and 
for very strong correlations. However as the load 
is uncorrelated and values for œ} are typically at 
most 0.7, such high values for the correlation will 
not be encountered when studying temporal reli- 
ability of flood defences. 

The Dutch flood defense act allows for all three 
definitions of the previous section to be applied. 
The Equivalent Planes method therefore provides 
a fast and reliable method for evaluating the sec- 
ond and third definition of the annual failure prob- 
ability. For the assessment of existing structures the 
third definition is most sensible, as in such cases 
it would be desirable to take into account that the 
structure didn’t fail in the previous years, as has 
been done in for instance Schweckendiek (2014) 
and Schweckendiek et al. (2017). The third defini- 
tion fits best with that. The second definition can 
be used for design purposes, as it is sensible to not 
account for the probability of failure in year t when 
the built structure has already failed in year ¢—1. 
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It has to be noted that for small probabilities of 
failure the second and third definition are almost 
the same as it follows from the definition of con- 
ditional probability that the difference between the 
two definitions for year ¢ equals 1/ Il. l= P 
This indicates that for small P, the difference ‘will 
be negligible. 

In order to combine different years, it is impor- 
tant that correlations in time between the limit state 
function from year i—1 to year i are correctly esti- 
mated. In many cases the reliability problem will 
not be so easy as the previously described problem, 
but will consist of many random variables that are 
(partially) correlated in time. In order to properly 
determine the temporal correlation of parameters 
and uncertainties it is important to classify uncer- 
tainties based on their original source, as only then 
a reliable classification can be made. This is further 
explained in the following section. 


2.3 Uncertainty in flood defence reliability 


There are various sources of uncertainty in flood 
defence reliability assessments. These are categorized 
by Gelder (2000) as inherent (aleatory) in time and 
space and knowledge (epistemic) uncertainty due to 
model and statistical uncertainty.Other categories 
can be used as well, e.g. Walker et al. (2003) distin- 
guishes between different levels of uncertainty, and 
how these influence a decision problem. In general a 
distinction is often made between reducible and irre- 
ducible uncertainty as these influence the optimal 
action to deal with unacceptable failure probabilities 
(see e.g. Slijkhuis et al. (1997) and Schweckendiek 
(2014)). Inherent uncertainties are typically consid- 
ered irreducible, where as knowledge uncertainty is 
considered reducible. This framework works well 
for typical loads on flood defences (a better model 
reduces model uncertainty but inherent natural vari- 
ability in annual maxima of river discharges remains 
irreducible), but is less trivial for strength uncertain- 
ties. The strength uncertainty of flood defences 
mainly arises due to heterogeneity of the subsoil 
and dike body combined with limited knowledge of 
this subsoil, in combination with imperfect models 
describing the strength of the flood defence. In here, 
most uncertainty could theoretically be reduced 
but the question is more whether it is economically 
feasible to do so than whether it is technically pos- 
sible (Schweckendiek 2014). It is therefore more 
applicable to use the classification of Phoon and 
Retief (2016) where geotechnical strength uncer- 
tainties are split into natural variability, measure- 
ment uncertainty, transformation uncertainty and 
model uncertainty. All of these uncertainties can be 
reduced to some extent, but each requires a different 
measure. For instance, if the source of uncertainty 
in Pre-Overburden Pressure (POP) is mainly natural 


variability, more measurements could be applied. 
However, if the source is a old and inaccurate meas- 
urement method, a more accurate method should be 
applied as there will also be a lot of measurement 
uncertainty. Hence it is important to systematically 
distinguish the main uncertainties based on their 
original source. 

When doing a time-dependent reliability analysis 
this becomes even more important, as some uncer- 
tainties (mainly epistemic strength uncertainties 
and model uncertainties) will be correlated in time, 
whereas aleatory uncertainties are not. In order to 
correctly apply the notion of non-failure in pre- 
ceding years, these uncertainties should be clearly 
distinguished. In this paper we will focus on the 
influence of temporal correlation on time-depend- 
ent reliability: it has to be noted that also spatial cor- 
relation can be used as information. For instance if 
the same model is used for different dike sections, 
and model uncertainty is the dominant parameter, 
failures and non-failures at location A might pro- 
vide information on the reliability at location B. 
However this is out of the scope of this paper. 


3 RESULTS 


3.1 Case description 


In order to investigate the effects of different formu- 
lations of temporal reliability and the influence of 
different values of uncertainty and correlation for 
different values of the reliability index we use fragil- 
ity curves to describe the strength of a flood defence. 
This is a broadly used method of aggregating fail- 
ure probabilities from more complex failure models 
(see e.g. Bachmann et al. (2013) and Schweckendiek 
et al. (2017)). The fragility curve expresses the critical 
height 4, which is an integration of the joint prob- 
ability of the strength given a certain water level, 
resulting in the following limit state function 


Z=h,-h (5) 


where A is the water level and A, the critical height. 
This approach is sound as long as the water level is 
(strongly correlated to) the dominant load for the 
mechanism. In this case we consider flood defence 
reliability described by the aforementioned limit 
state function where it holds that A, and h are nor- 
mally distributed. The Equivalent Planes method 
requires information about the influence coef- 
ficients (œ) of all i random variables per year j, 
reliability indices (£) for each year j and, as auto- 
correlations are constant in time, a correlation 
matrix with dimensions i*i. Using this method 
we can then combine the non-failure and failure 
events for subsequent years. 


3170 


3.2 Example 1: Life-cycle reliability of a dike 
without deterioration 


First we investigate the life-cycle reliability of a 
dike without deterioration, so constant A,. In this 
case the value for P,(1)=...= P, (t-1)= P(t). 
For the temporal autocorrelation it holds that 
the strength is fully correlated (2, = 1), which 
is typical for many strength parameters of flood 
defences. The loads are uncorrelated from year to 
year (2, =0), as the maximum water level in year 
iis typically independent of the maximum in year 
i-1. In the examples we will only consider the 
second definition for annual reliability index which 
is AS, Ciel as the difference with AS | f) 
is very small. For instance, if we assume that B= 3 
and a@ =a;=0.5, the relative difference in B 
after 100 years is only 0.5%. 

As h, is assumed to be fully correlated in time, 
and / is fully uncorrelated, the influence coeffi- 
cients will have a significant influence on the dif- 
ference between f(t) and A Í. Nfa): 

Figure 2 shows the relative change in reliability 
index £ for different values of @, for £= 3. As 
expected it is observed that for higher values of 
&, the difference is larger. Typical values for the 
influence coefficient of the strength for failure due 
to overflow are very small (order of 0.1 or 0.2), 
but for geotechnical failures these are often in the 
order œ, =0.75, meaning that for a lifetime of 50 
years the various definitions of the reliability yield 
a difference in resulting reliability index of 10%. In 
terms of failure probability this is approximately a 
factor 3, which is equal to the difference in safety 
standard for two subsequent categories as defined 
in the law (e.g. 1/300 to 1/1000). Another impor- 
tant fact is that the reducing effect diminishes over 
time, which can be explained from the change in 
a over time, see Figure 3. The fact that the influ- 
ence coefficient of the correlated variable reduces 
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Figure 2. Relative change in £(f) for various values of 
the time correlated @, and A{ t= 0) = 3. 


in time makes intuitive sense as more of the same 
information will result in increasingly less new 
insight. 

A last important investigation of this simple 
case is the level of correlation. As was argued in 
the preceding sections it is important to distinguish 
different parameters with different uncertainties 
and different temporal correlations. However, 
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Figure 4. Relative change in £(f) for various values of 
the correlation coefficient 7, and Al t= 0) =3, 


Table 1. Parameters for Example 2. 

Variables Distribution Parameters 

h, 4.5 1.05 

ý N(4,0) 

H 0 0.6 
N(4,0) 

Ah, T( 7,6) 1 1 

vary, Uan” var rah, 


3171 


Figure 4 shows us that the influence of having a 
p, that is slightly smaller than 1 is not that influ- 
ential on the A(t): even for a ø, of 0.7 values 
close to the ones for full correlation are found. So 
even if there is a small error due to e.g. combining 
two (uncertainty) parameters with different time 
correlations, the influence on the result will often 
be relatively small. 


3.3 Example 2: Life-cycle reliability of a dike 
with deterioration 


In practice it will not occur that a dike will remain 
the same for 50 years. The most common deteriora- 
tion mechanism for dikes is settlement, which can 
be described by parametric models (see e.g. Buijs 
et al. (2009)) or stochastic process models such as the 
Gamma Process (Pandey and van Noortwijk 2004)). 
Here we use such a Gamma process. We introduce a 
new random variable Ah, which denotes the change 
in critical height compared to the first year. For the 
sake of the example we make an important simpli- 
fication here as we assume that the critical height 
is fully dominated by the initial crest height and its 
settlement. For many (geotechnical) failure mecha- 
nisms this is not the case, and other types of deterio- 
ration will be more dominant. We assume that the 
average annual settlement is 2 cm with a coefficient 
of variation of 30%. By splitting the variables we 
can maintain that 7, =1. We assume 2y, = 0 as it 
is a random process, this leads to the following dis- 
tributions for the random variables: 

The choice of the distributions is such that 
for the initial situation @, ~0.75, comparable 
to the first example. The initial #~3.7. It has 
to be noted that while we attribute the temporal 
change to settlement in this case (i.e. decrease in 
strength), it could also be attributed to an increase 
in load, for instance due to climate change. The 
behaviour of such a parameter would be similar: 
increasing in time with increasing uncertainty. 
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Figure 5. Values for different definitions of 6 with 
deterioration 


Figure 5 shows the results for the time-dependent 
reliability. Here it can be seen that for this case 
the influence of the definition is rather large: 
when we compare to a minimum required reli- 
ability = 3.09( P,=1/1000 yr), the expected 
extended life when taking into account survival 
is approximately 15 years (or: an extension of the 
lifetime by almost 50%). 

This amount of lifetime extension how- 
ever is dependent on the rate of deterioration, 
and especially the uncertainty in deterioration. 
Figure 6 shows the o-values for two rates of 
deterioration, on the left is the same as used in 
Figure 5, the right is a distribution with higher var- 
iation and slightly lower mean, such that A(f,,) is 
equal for both cases. However for the deterioration 
with high variation Al ts OF. a is significantly 
smaller than for the case with smaller variation. 
This can be explained by a smaller æ; in the 
design point, meaning that the influence of that 
uncertainty on the reliability is smaller, resulting in 
less valuable non-failure information. 


3.4 Economic implications of time dependent 
reliability 

Generally the goal of flood defence management 
is to maintain flood defences at a desired level of 
reliability, against acceptable costs. In many cases 
Life Cycle Costing (LCC) is used to evaluate costs 
in time, for which the principles were first reported 
by Samuelson (1937). In an LCC analysis the Net 
Present Value, which denotes the value in current 
day prices, is calculated using the following formula: 


NPV = 


ee m 


where, C, is the total cost in year i, r is the discount 
rate and ż is the evaluation period. One of the major 
implications of this economic theory is that post- 
poning an investment yields significant benefits. For 
instance: if we postpone an investment by 10 years, 
assuming a discount rate of 3%, the current stand- 


var an 703 var 


0 10 za 30 40 so 10 z0 30 40 50 
Time in years Time in years 
Figure 6. Change of of in time for different rates of 
deterioration. 
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ard in the Netherlands (Werkgroep Discontovoet 
2015), the cost after ten years is only 75% of the 
cost, expressed in present day prices. A disadvantage 
of using LCC is that it is slightly harder to compare 
investments with different lifetimes. For such com- 
parisons the NPV can be expressed as Equivalent 
Annual Cost (EAC), which is calculated using the 
following formulas (Schoemaker et al. 2016): 


EAC = (7) 


_1-d4n" 


4.2 (8) 
r 


where A,, is the Annuity factor for year ¢ and dis- 
count rate r, which denotes the sum of the discount 
factors compared to t=0. 

In Figure 5 it was shown that the definition of 
time-dependent reliability can have a significant 
influence on the lifetime of a structure. To further 
investigate this we consider 4 situations, and gen- 
erate distributions for the initial strength A, corre- 
sponding to a wide range of a-values. The 4 cases 
contain 3 cases (1, 2 and 3) with different reliability 
requirements and 1 case (case 4) with an adapted 
uncertainty for the settlement (similar to the com- 
parison of different deterioration rates in the previ- 
ous section). The different cases are summarized in 
Table 2. It is expected that the case with low target 
reliability has the largest life extension, as here a 
non-failure is more relevant than for the case with 
a very high reliability. Also, based on the oa-plots 
in Figure 6, where we saw a more rapid decrease 
in @, fora higher variation of the settlement, we 
would expect the increase in lifetime for the case 
with small variation in settlement to be larger. 

Figure 7 shows the extension of the lifetime in 
years for the 4 considered cases. Here it can indeed 
be seen that for a lower reliability larger extensions 
are gained, and that for lower uncertainty in deterio- 
ration the effect of non-failures in preceding years is 
also larger. It has to be noted that the lines are a bit 
wobbly, which is due to the fact that the reliability is 
determined per year, resulting in discrete steps and 


Table 2. Cases for analysis of lifetime extension. 


Case BP) vary, 
: 2.326(10-) me 

2 3.090 (107) ue 

3 3.719(10~) 23 

i 3.090(10-) aw 


therefore small wobbles. However the lifetime exten- 
sion doesn’t directly translate into financial benefits. 
Therefore in Figure 8 the relative savings following 
from the postponement of a new reinforcement are 
shown for a discount rate of 3%. Here it follows that 
in the most extreme case (high @, for Case 2) the 
relative savings can amount up to a factor 3. These 
relative savings are independent of other investments 
during the life cycle, and also independent of the 
actual reference year as the relative savings are lin- 
ear in time (due to the exponential character of the 
discount rate). For instance for assessments of exist- 
ing structures this would be a relevant value, as the 
reference year wouldn’t be 0 but somewhere between 
0 and the end-of-life. However in design decision 
making the change in Equivalent Annual Costs is 
more relevant, as this denotes the economic yearly 
cost for a design option. If we assume that at t = 0 
a reinforcement is made for a lifetime of 50 years, 
for a discount rate of 3% a reduction in Equivalent 
Annual Cost of approximately 12.5% is realized for a 
lifetime extension to 70 years, which is in accordance 
with Figure 8. Obviously this will not hold for all 
flood defences in the Netherlands, but mainly in the 


Lifetime extension related toa? 


Uite extension in years 


Figure 7. Life extension in years for different &,- 
values for the 4 considered cases. 


Relative savings in reinforcement cost 


von Cane 1: P,a 10” 


—— Cane 2 Pe 107 


Figure 8. Factor of relative savings for the reinforce- 
ment cost for the 4 considered cases 
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riverine area failure probabilities are dominated by 
geotechnical failure mechanisms, meaning that such 
lifetime extensions will not be uncommon. 


4 CONCLUSIONS AND 
RECOMMENDATIONS 


In this paper we have explored various aspects 
of temporal dependence in reliability of flood 
defences. As flood defence reliability is often deter- 
mined by highly uncertain but temporally corre- 
lated strength parameters, non-failures in previous 
years constitute information that can improve the 
estimate of the strength, especially when the esti- 
mated reliability is low. In this paper we’ve explored 
some parametric cases where it is shown that the 
savings due to accounting for non-failures rather 
than the commonly used conservative approach 
of disregarding temporal dependence in reliabil- 
ity can be significant (up to 20 years in lifetime 
extension). This is of relevance for many aspects 
of flood defence management such as design 
guidelines as well as assessment of existing flood 
defences where often low reliabilities are found 
from models. It has to be further investigated for 
different decision problems in flood defence man- 
agement what the consequences of accounting for 
this temporal dependence are, especially by look- 
ing in more detail into the sources of uncertainty 
of actual dike reliability analyses. However, based 
on the considered cases it is expected that it can 
significantly improve reliability estimates in assess- 
ment and design, especially when there is large 
uncertainty on strength parameters that are cor- 
related in time. It has to be noted that in such cases 
it might be necessary to improve the knowledge on 
strength parameters through obtaining additional 
information. Due to the high strength uncertain- 
ties that are often encountered, the consideration 
of the value of improved information is one of the 
major decision problems to be studied for flood 
defence management. 
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Impact assessment of road infrastructure: A holistic approach 
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Faculty of Engineering, Ostfold University College, Fredrikstad, Norway 


ABSTRACT: As construction companies attempt to maximize the value of each road project and opti- 
mize their investment portfolio, it has become vital to assess the impact of road transport projects for 
their cost effectiveness in order to support well-informed decision-making. Proper impact assessment will 
result in road design that is more advanced, efficient operations, and improved environmental protection. 
Identifying cost-effective road design, which have a minimal environmental footprint, however, is one of 
the biggest challenges for road construction industries. The purpose of this paper is thus to propose a 
holistic methodology for assessing the impacts of the road projects, by taking into account the available 
natural resources, the environment, and by involving all relevant stakeholders. The paper also outlines 
steps to assess both monetised and non-monetised impacts. By employing, the proposed steps, the pros 
and cons of new road networks can be identified in a structured way and hereby highlight factors of suc- 


cess and failure. 


1 INTRODUCTION 


Across the world, inadequate or poorly designed 
and performing road infrastructures leads to 
major financial and social challenges, which gov- 
ernments and businesses need to address (Dobbs 
et al., 2013, Ayele et al., 2013). Further, in broad 
sense, road network designs without a proper 
impact assessment, resulted in a road that is not 
adequate for meeting the current and future road 
transport needs of a given society (Systematics, 
2012). Hence, in investment decisions, the relevant 
decision-maker often carry out a cost-effectiveness 
analysis (CEA) and cost-benefit analysis (CBA) 
of road network projects. When identifying and 
categorizing the costs and benefits of road design 
alternatives, a reasonable effort has to be made to 
identify those costs that will have the most signifi- 
cant implications on the strategic decision. 

To identify cost-effective road network design, 
which have a minimal environmental footprint, it 
has been argued that two questions are fundamen- 
tal (Macias and Gadzinski, 2013, Hanley, 2001, 
Ayele et al., 2016b). Firstly, which road design 
alternative is estimated to be cost-effective and 
environmentally sustainable, based on the prevail- 
ing evidence? Secondly, should further research be 
carried out in order to minimise the level of uncer- 
tainty related to the decision? To answer these 
questions and, determine the cost-effectiveness of 
the road network design, number of studies have 
been carried out, see e.g. Hanley (2001), Macias 
and Gadzinski (2013), Nagurney et al. (2010), 


and Welde et al. (2013). For instance, Welde et al. 
(2013) compared the implementation of CEA and 
CBA, which are the two common economic evalu- 
ation methods, as tools for allocation of national 
public funds in the transport sector. Moreover, 
Macias and Gadzinski (2013) discussed issues of 
road transport networks influence on the natural 
environment. Furthermore, for better understand- 
ing of the environmental impacts, Nagurney et al. 
(2010) proposed environmental impact assessment 
(EIA) indices for evaluating the environmental 
effects of link capacity degradation in transporta- 
tion (road) networks. 

However, in most of the available CEA and CBA 
literature, either cost or environmental impact is 
the only factor considered; and there is a lack of 
consideration of the available natural resources, 
the environment, and all relevant stakeholders 
when assessing the impacts (monetised and non- 
monetised) of the road projects. Further, the ade- 
quacy of the models used to estimate the various 
costs and benefits of the road network investment 
projects is increasingly being called into question 
(Systematics, 2012). 

This is considered as a significant drawback 
since inadequate or poorly designed infrastructure 
presents major economic and social challenges. 
Further, understanding how a specific decision or 
choice of assessment method affects the intended 
goals is a key in identifying a cost-effective road 
network design. Proper CEA and CBA will result 
in more advanced road network design, more 
efficient operations, and improved environmental 
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protection. Moreover, it can be used to identify 
weaknesses or strengths of existing or new road 
networks in a structured way and hereby highlight 
factors of success and failure. It is also a core ele- 
ment in examining the overall quality of the road 
design alternatives. 

Hence, the purpose of this paper is to propose 
a holistic methodology for assessing the impacts 
of the road projects, by taking into account the 
available natural resources, the environment, and 
by involving all relevant stakeholders. The paper 
also outlines steps to assess both monetised and 
non-monetised impacts. The rest of the paper is 
organised as follows: Sect. 2 presents a problem 
description. Section 3 discusses the concept of 
CBA and CEA process. Section 4 introduces the 
proposed holistic methodology for assessing the 
impacts of the road networks. Section 5 provides 
the concluding remarks. 


2 PROBLEM DESCRIPTION 


The problem considered here is an impact assess- 
ment problem of the road networks. Suppose we 
have a finite number of road network design alter- 
natives, each with a different construction, mainte- 
nance and HSE (health, safety, and environmental) 
costs. The idea is to use the most suitable design 
alternatives with a minimum environmental foot- 
print. However, considering the available natural 
resources, the environment, and the involving rel- 
evant stakeholders when assessing the CEA and 
CBA of road design alternatives, a decision-maker 
will face a time-variant decision making process. 
In other words, the decision maker is faced with 
an optimisation problem, since these factors can be 
considered as covariates. A covariate, in the con- 
text of this paper, is a factor that can have an influ- 
ence on the direct and indirect costs as well as the 
decisions of the road projects. 

To optimise the effectiveness and benefit and, 
to determine what road design alternative will 
be used, the proposed holistic methodology con- 
stantly assesses the impacts of the road projects, by 
considering the existing natural resources, the envi- 
ronment, and by involving all relevant stakehold- 
ers. The proposed methodology attempt to capture 
the monetised and non-monetised impacts of new 
road systems, in a structured way and hereby high- 
light factors of success and failure. 


3 A GENERIC CEA AND CBA PROCESS 


Assessment tools are structured procedures or 
methods which, when used, result in objective and 
replicable information; either on the technological 


and design appropriateness of the decision choice 
or the local allowing or restricting conditions that 
can influence the attainment or non-attainment of 
the decision choice (it can be technical or manage- 
rial) (Zurbriigg et al., 2014). A range of assessment 
methods are often developed for assessing a spe- 
cific aspect of the road networks, such as technical, 
environmental and health, economic and financial, 
social and institutional, organizational aspects; 
and, others attempt to provide a more holistic 
picture by integrating different assessment meth- 
ods into the same tool; see e.g. (Welde et al., 2013, 
Hanley, 2001, Nagurney et al., 2010, Macias and 
Gadziński, 2013). When assessing environmentally 
sustainable and cost-effective road project, in gen- 
eral, three main issues govern the final decision. 
Figure 1 illustrates these three main issues, which 
needs to be considered while assessing environmen- 
tally sustainable and cost-effective road project. 


3.1 CBA and CEA 


3.1.1 Cost-Benefit Analysis (CBA) 

CBA is an analytical method for assessing the 
total costs and benefits from the planned project 
(Finnveden et al., 2007, Skovgaard et al., 2007). 
It provides a starting point from which to begin 
evaluation of a project; can provide quantitative 
data to back up qualitative arguments; it does 
allow interested parties to clearly define the issues 
involved; and it allows comparisons to be made 
between investments or projects (Gerald Shively 
and Marta Galopin, 2014). However, it requires 
that the analyst assign monetary values to all ben- 
efits and costs, however, there are numerous ben- 
efits and costs which are intangible and therefore 


EIA 


Environmentally 
sustainable & cost- 
effective road 


project 
Relevant 
CEANCEA — stakeholder 
Figure 1. An overview of the relationship between the 


main aspects of road network. 
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difficult to value; CBA results can be very sensi- 
tive to the choice of the discount rate; and some 
future benefits and costs cannot conceived, much 
less measured (Gerald Shively and Marta Galopin, 
2014). 


3.1.2 Cost-Effectiveness Analysis (CEA) 

A generic CEA process involves: 1) determining 
which cost variables affect the cost-effectiveness 
of the chosen road design solution. This includes 
determining and analysing: (a) internal cost fac- 
tors, which arise because of company decisions 
and goals; the company largely manages these cost 
factors and, if necessary, they can be changed and 
(b) external cost factors, which are not controlled 
by the company but will influence the overall cost 
& the decisions. ii) Determining inherent risk fac- 
tors for the chosen design highway practices and 
the company tolerance for them. 111) Determining 
the functional interdependence between the cost 
and risk variables and the degree to which each of 
these variables can be controlled. Figure 2 illus- 
trates the key steps in the CEA. 

CEA focuses on the costs on achieving the goals 
of the system and the most efficient way of achiev- 
ing it (Finnveden et al., 2007, Ayele et al., 2014). 
The benefits of achieving the goals do not have 
to be quantified (Finnveden et al., 2007). CEA is 
most useful when analysts face constraints which 
prevent them from conducting CBA (Tan-Torres 
Edejer et al., 2003, Ayele et al., 2014). For instance, 
the most common constraint is the inability of 
analysts to monetize benefits. Though the basic 
cost-effectiveness calculation appears to be sim- 


Identify & categorise cost 


element 


Project cost & benefits over 
the life of the system 


Monetise (place a 'dollars') 
value on costs, 


Estimate the total cost 


Figure 2. Key steps in the CEA. 


ple, choices about units of measurement, scope 
of costs, and prices to be included not only will 
alter the numerical results but also will affect the 
interpretation of the cost-effectiveness ratio (The 
International Bank for Reconstruction and Devel- 
opment and The World Bank Group, 2006). 


3.1.3 Cost-Effectiveness Ratio (CER) 

The CER is a ratio in which the denominator is 
the unit of effectiveness and the numerator is the 
present value of the cost of a particular road infra- 
structure. Units of effectiveness are a measure 
of any quantifiable outcome central to the road 
design objectives. In road construction, the total 
volume of construction material required would 
be the most important outcome and, would be 
an obvious unit of effectiveness. Mathematically, 
CER for a specific road infrastructure, based on 
Cellini and Kee (2010), can be described as: 


P VCam f 


CER gay = (1) 


Rinf . 


where: 


e Upny is the unit of effectiveness of road infra- 
structure (RInf.), and 

e PVCRInf. is the present value of cost of road 
infrastructure (Rinf.), and is given by: 


Cams 2 TC any 3 
(1+r)' (147)? 
T Crm i OES r TCrnr, 2) 


Tent Amn 


PV Crn = TCA 


where: 

e TCrnr, 18 the annualised total cost of road 
infrastructure (RInf.), 

e t indicates the year from 1 to T (the last year of 
the analysis), 

e ris discount rate, which is meant to reflect soci- 
ety’s impatience or preference for consumption 
today over consumption in the future. 


CER results are very sensitive to the choice of 
the discount rate, and thus an appropriate choice 
of the discount rate is critical, and there is ongoing 
and considerable debate as to the appropriate rate, 
see e.g. Stern (2006), Lopez (2008) and Cellini and 
Kee (2010). 


3.2 Environmental Impact Assessment (EIA) 


EIA is a process for assessing environmental con- 
sequences (positive and negative) because of the 
planned project. Conducting an EJA during the 
project design phase will provide information 
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about the environmental conditions of the area; 
early assessment of negative impacts can ensure 
that appropriate mitigation measures and oppor- 
tunities are identified and implemented; and it can 
help to reduce costs in the long term (Jonathan R. 
and Emma J., 2010). However, some of the rea- 
sons given for the non-use of EIA tools are that 
they are too cumbersome, time consuming, and 
generalized; another reason is the lack of evidence 
confirming the actual value and success of EIA's; 
and in contrast to other assessment methods EIA 
is generally site-specific (Jonathan R. and Emma 
J., 2010, Finnveden et al., 2007). 


3.3 Engaging relevant stakeholders 


The impact analysis can only be undertaken when 
stakeholders agree on a minimum range of pre- 
requisites (points). In addition, during the effec- 
tiveness assessment the inclusion of stakeholders, 
intergenerational equity as well as the satisfaction 
of the social needs can be assured. That means 
the involvement of all relevant stakeholders in 
the decision-making process can be corroborated 
(Morrissey and Browne, 2004). 


4 PROPOSED HOLISTIC METHODOLOGY 


The main steps for assessing the impacts of the 
road project is depicted in Figure 3. The method- 
ology has various key elements; and, it takes into 
account the available natural resources, the envi- 
ronment, relevant stakeholders. The first stage, in 
the proposed methodology, is to specify the goals 
and criteria based on the available standards, regu- 
lations, and recommended guidelines. One of the 
main criteria is that the result from the assessment 
should support the decision-maker to clearly iden- 
tify the consequences of the chosen road network 
design alternative. Moreover, when defining goals 
for effective and environmentally sustainable road 
network design, the inclusion of all relevant stake- 
holders must be ensured. 

After defining the objectives, the goals, and 
the criteria, in the next stage, the road network 
design options should be specified. In this stage, 
the aspects of road network design as well as the 
features of transport network expansion capacity 
should be outlined. The road network design, in 
general, deeds with the configuration of network 
to achieve specified objectives (Tom and Mohan, 
2003). Further, there are two variations to the 
design problem, the continuous network design 
and the discrete network design. Some of the 
design problem include, determining the width of 
the road, setting of road user charges, calculation 
of signal timings, etc. (Tom and Mohan, 2003). 


and Requia 5 
mended guidelines 


Figure 3. The proposed holistic methodology for assess- 
ing the impact of the road infrastructure. 


Formulating the scope of the impacts of the 
road network is the next stage. In this stage, it is 
essential to identify the scope of the impacts, such 
as the influence of the road networks in the land- 
scape ecology. In broad sense, with construction of 
road networks, the range of human activities also 
expands, which increases the human impact on 
the natural ecology. Hence, in this stage, the scope 
should be clearly defined the facet of the impact 
analysis. 

In the next stage, the monetized impact assess- 
ment or evaluation of the benefit and cost of the 
alternatives should be carried out. Figure 4 depicts 
the key parameters that needs to be quantified. In 
this stage, the analyst assigns monetary values to 
all benefits and costs. Since the CBA results can 
be very sensitive to the choice of the discount rate, 
the analyst should be vigilant. It is also important 
to note that some future benefits and costs can- 
not conceived, much less measured (Gerald Shively 
and Marta Galopin, 2014). 

After assessing and quantifying the impact, 
thereafter, non-monetised impact assessment 
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Cost analysis 


Figure 4. Monitised impact assessment: Benefit and 
cost analysis, modified from Systematics (2012). 


should be performed. In this stage, the main 
emphases are on assessing the costs on achiev- 
ing the goals of the network design and the most 
efficient way of achieving it. Here, the benefits of 
achieving the goals do not have to be quantified 
in monetary terms. This stage is most useful when 
analysts face constraints which prevent them from 
conducting monetized impact analysis (Tan-Torres 
Edejer et al., 2003). For instance, the most com- 
mon constraint is the inability of analysts to mon- 
etise benefits. 

The next stage, in the proposed methodology, 
is estimating the net present value (NPV) of each 
road network design alternatives. Here, by employ- 
ing the NPV method, the benefit and cost of a 
given road design alternative over an extended 
period of time should be calculated and then 
discounted at a selected discount rate to give the 
present value (Sinha and Labi, 2011). Benefits are 
treated as positive and cost as negative and the 
summation gives the NPV (Sinha and Labi, 2011). 
In general, any road design with positive NPV is 
treated as acceptable. In comparing more than one 
design alternatives, an alternative with higher NPV 
should be accepted. 

After estimating the NPV for each alternative, 
in the next stage, the sensitivity analysis should 
be performed. The aim of the sensitivity analysis 
is to recognize the vital cost variables and their 
potential impact in terms of changes in the annu- 
alised total cost and present value cost. In general, 
there are two common sensitivity analysis — partial 


and extreme cases sensitivity analyses. The partial 
approach varies one assumption (or one parameter 
or number) at a time, holding all else constant (Cel- 
lini and Kee, 2010). On the other hand, extreme 
case sensitivity analysis varies all of the uncertain 
parameters simultaneously, picking the values 
for each parameter that yields either the best- or 
worst-case scenario (Cellini and Kee, 2010, Ayele 
et al., 2016a). 

Thereafter, the comparison between the alter- 
native design options should be performed. The 
comparison should consider the strengths and 
weaknesses of each alternative, check the extent 
and the validity of the ‘results’ obtained from the 
assessment, and investigate whether a given design 
alternative match the intended plan and outcomes, 
etc. Further, at this stage, specific effectiveness 
indicators should be clearly identified; the advan- 
tage and disadvantage of the proposed road design 
alternatives should be assessed; and, the compi- 
lation of HSE standards should be ensured, by 
verifying the consideration and integration of the 
social aspects. 

Thereafter, the recommendation should be 
made and, afterwards the chosen road design alter- 
native should be implemented. The recommenda- 
tion should comply with road design standards 
and guidelines. Afterwards, the approved recom- 
mendation should be monitored by establishing the 
monitoring programme. That means the monitor- 
ing programme has to include the follow-up of the 
implementation of the chosen design alternative, 
by emphasising what does work, what does not 
work, and what continues to work. Further, there 
should be a feedback loop where the recommenda- 
tions should help to review the impact assessment 
and, the choice of the alternatives. 


5 CONCLUDING REMARKS 


To address the challenges related with inadequate 
or poorly designed and performing road infrastruc- 
tures, proper impact assessment of road network 
design alternatives is vital. This consequently helps 
to ensure cost-effective and environmentally sus- 
tainable road infrastructure design. Furthermore, 
proper impact assessment can have a significant 
economic, environmental, and social benefit, by 
reducing the HSE (health, safety, and environmen- 
tal) impacts, identifying cost-effective and environ- 
mentally sustainable road. 

This paper proposed a holistic methodology for 
assessing the impact of the road networks. The 
proposed methodology is based on the consid- 
eration of possible road design alternatives. The 
first part of the proposed methodology presents 
the problem description and discuss the concept 
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of cost-benefit analysis (CBA) and cost-effective- 
ness analysis (CEA) process. The second part of 
the paper describes estimation of the net present 
value (NPV) by categorising the road networks 
impact into two: monetised and non-monetised. 
By employing, the proposed methodology the pros 
and cons of new road systems can be identified in 
a structured way and hereby highlight factors of 
success and failure. 
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ABSTRACT: Development of a Generic Application or Product (GAP) is always challenging, because 
it should be designed as much as generic and simply configurable for the final specific applications. If 
such a development is also safety critical, then the complexity increases dramatically as the resulted sys- 
tem will have impacts on human life, property and environment. The safety management plays a major 
role to keep this tough development process under control by avoiding systematic faults. Setting up cor- 
rect organizational structure as well as applying Verification & Validation (V&V) concepts, which are 
two fundamental elements of safety management, in an accurate way are therefore crucial. This paper 
discusses railway safety management in terms of organizational structure and V&V with regards to the 
current normative status with its drawbacks. Proposals are provided for an updated organization and 
more harmonized V&V concepts including relations with safety management and quality assurance by 
sharing practical experiences. 


1 INTRODUCTION 


Safety 

Reliability and availability 
h 

Environmental protection 
Technical compatibility 


orb 


As Murthy et al. (2008) explain regulatory require- 


Accewibiity 
ments, customer requirements, and technical $ an 
requirements shall be fulfilled when producing a ii > Speedin 
safety instrumented system. Independent safety it 
assessment against the international and/or Euro- H 
pean standards shall be realized for the acceptance 3 Shenae 
and approval of the system. If necessary, national ? Bi cn aa 


standards, regulations and directives are also to be 
followed. These are shown in a hierarchical man- 
ner in Figure 1. 

For railway applications, both On-board and 
trackside, the European Norms EN 50126, EN Figure 1. Hierarchy of laws [EU Recommendations 
50129, EN 50128, and EN 50159, dealing with commission (2014)]. 

RAMS, electronic system, SW and transmission, 
respectively, are referenced in European countries, 
and at the same time they are also well accepted 
and applied in many other non-European coun- 
tries. These norms are derived from the core norm 
IEC 61508 like many other European and interna- 
tional norms as depicted below in Figure 2. 
Despite the fact that they are very detailed and 


IEC $1508 


strict as they are related to complex safety critical Railway Automotive Nuclear Power 
: EN 50126 ISO 26262 

electronic systems and subsystems, Hokstad and EN 50129 IEC 61513 

Corneliussen (2004) draw attention to a couple EN 50128 IEG 50987. 
ORGE Bae EN 50185 IEC 60880 

of ambiguities about safety unavailability from 


quantitative evaluation perspective diminishing 
the usefulness of the standards. This paper focuses Figure 2. Derivation of standards from IEC 61508. 
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on two main issues from the qualitative perspective 
regarding SIL (Safety Integrity Level) 4 develop- 
ment. It is organized as follows. Section I gives a 
brief information about the standards. In Section 
II, the organization and responsibilities in the 
norms are distilled and experiences are highlighted 
by proposing a new role and responsibilities taking 
into account the critical duties in a safety critical 
project. In Section III, Verification & Validation 
(V&V) concepts confused usually not only on the 
account of their different understanding in other 
domains like defense or avionic, but also their dif- 
ferent descriptions inside the CENELEC norms 
themselves are discussed and accordingly the 
confusing concepts are clarified. Moreover, rela- 
tions between verification, validation, safety man- 
agement, quality assurance and assessment are 
explained. In section II and III, project experiences 
gained during SIL 4 GAP for mainline On-board 
signalling part of European Rail Traffic Manage- 
ment System—European Train Control System 
(ERTMS ETCS) and metro line On-board signal- 
ling part of Communication Based Train Control 
System conjointly with wayside interlocking sys- 
tem assessed independently by a European Noti- 
fied Body are utilized. Finally, the paper conveys 
some concluding remarks. 


2 ORGANIZATION AND 
RESPONSIBILITIES 


Firstly, the organizational independence for SIL 
4 is illustrated as in Figure 3 in EN 50129, the 
Railway Standard for “Safety related electronic 
systems for signalling”, referred to the System and 
HW development. In the illustration, same person 
and same organization arrangements are depicted 
as solid and dashed lines, respectively. In the figure, 
PM stands for Project Manager, DI for Designer/ 
Implementer, VER for Verifier, VAL for Validator. 
Having provided the illustration and mentioned 
the independence of roles in Table E.3 in EN 
50129, there are no more additional explanations 
about the independence of roles in this norm. 

Secondly, when the norm EN 50128, “Soft- 
ware for railway control and protection systems” 
is examined, it gives much more detail about the 
organizational independence (see Figure 4). It 
defines not only new roles such as Requirements 
Manager (REQ), Implementer (IMP), Integrator 
(INT) and Tester (TST), but also new type of ver- 
tical lines for the responsibility dependency such 
that RQM, DES (Designer), IMP shall report to 
the PM; INT, TST and VER can report to the PM 
whereas VAL shall not report to PM. 

Differently from the norm for electronic sys- 
tems, this norm elaborates the roles by thirteen 
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Figure 4. Preferred organizational structure [EN 50128 
(2011)]. 


clauses. In addition, it offers two options for the 
roles validator and verifier. It allows that the VAL 
and the VER can be same person guarantying 
independence from the PM where in this case there 
shall be a further person who reviews the docu- 
ments of VER and who does not report to PM. A 
similar approach is brought to the role of VER in 
the way that VER and INT/TST can be the same 
person provided VAL performs verification tasks, 
too, such that there are double checks. 

If considered, the similar tasks are also real- 
ized in system and HW development phases. For 
instance, there will also be HW tests, and a test 
report and a further verification report, however 
though the integrity level is the same, the organiza- 
tional independencies differ from each other which 
means that the VER can perform test activities and 
write both test and verification report. We there- 
fore propose at this step that the organizational 
structures in these norms shall be harmonized and 
the one stated in EN 50128 can be used for this 
harmonization purpose, as it declares more roles 
and therefore more checks which are inevitable in 
a safety critical system. Another problem for the 
independency requirements arises from the need 
of several verifiers in a project, as the verification 
activities do require specialization in the technique. 
In this case, a lead verifier can be assigned to the 
interface and will be responsible for fulfilling all 
the verification activities. In case there is a limited 
number of available specialized personnel, this 
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time there arise personnel availability problems. 
To overcome this issue, it can be proposed that 
VER of a work-package can be allocated as DI of 
another work-package. 

Besides, intensive safety analyses (Table 1) shall 
be performed in safety critical systems to derive 
safety requirements, both at system and subsys- 
tem level. These analyses require broad technical 
knowledge and long term of experiences, in short 
special qualification. Therefore it is far more above 
than the specialization of a designer. Additionally, 
as Oedewald and Gotcheva (2015) take attention, 
many safety activities are carried out by subcon- 
tractor networks which means that the subcon- 
tractors and their deliverables shall be checked 
for correctness, completeness and consistency by 
an expert of the main system owner. Hence, we 
propose a new role, namely the “safety responsible 
(SR)” to be added to this structure with the tasks 
provided below: 


— Developing the safety plan, 

— Performing safety analyses, 

— Creating and maintaining the hazard log, 

— Ensuring that the entries in the hazard log are 
successfully closed or further allocated in an 
appropriate manner, 

— Being responsible for ensuring that safety 
requirements are met successfully, 

— Deciding whether a deviation or change is safety 
relevant or not and ensuring that it is success- 
fully closed, 

— Obtaining safety evidence of the third party 
items including their Safety Related Application 
Conditions, 

— Deriving the Safety Related Application Condi- 
tions for the developed item(s) in the project 

— Composing the safety case 


The addition of the new role “SR” arises from 
another necessity, as well that in terms of the 


Table 1. Failure and hazard analysis [EN 50129 
(2003)]. 

Techniques/Measures SIL 4 
Preliminary hazard analysis HR 
Fault tree analysis HR 
Markov diagrams HR 
FMECA HR 
HAZOP HR 
Cause-consequence diagrams HR 
Event tree R 
Reliability block diagram R 
Zonal analysis R 
Interface hazard analysis HR 
Common cause failure analysis HR 


Historical event analysis 


independence and appropriateness of the tasks of 
VAL, a considerable number of tasks are usually 
fulfilled by the VAL which does not coincide with 
the primary validation tasks. For instance, there is 
no statement in the norms about the responsible 
person for the safety plan. It has been observed in 
the industry that the safety plan is usually written 
by the VAL. However, the safety plan itself needs 
both verification and validation. The same is also 
valid for the safety analyses as well as hazard log 
and safety case. To make this point clearer, we ana- 
lyze the V&V concept in the further part of this 
paper. 

Beside the safety responsible as acting person 
for safety technical pertinent issues, another work 
package in a safety critical project is the safety 
management activities such as planning and coor- 
dination of the V&V activities, the internal and 
external audits and assessments, reviews, manage- 
ment of subcontractors concerning RAMS activi- 
ties. Both the SR and VAL can perform safety 
management activities as these activities do not 
contradict with the validation activities. Below, 
the proposed organizational structure for all three 
development processes, system, HW, SW, are 
depicted. The SR is preferable independent from 
the PM considering his/her activities explained 
above such as deciding for the safety relevance of 
a change request to avoid any stress factor that 
can be caused by the PM due to time and/or costs 
pressure. Another issue is that EN 50128 does not 
allow REQ, DES or IMP to be INT, however, con- 
sidering the role of INT in the norm, this is not 
contrary to the independency, and thus we propose 
to update the organization as in Figure 5. 

Furthermore, an important role during safety 
critical system development is allocated to the 
customer/operator. However, in GAPs, there is 
neither customer nor operator, at least during the 
design phase, which increases the difficulty as the 
customer has in fact significant tasks in safety 
relevant projects as explained in Table 2. The cus- 
tomer/operator are incorporated once during the 
specific application projects on the field. Especially 
when identifying hazards and assessing risks, the 
customer is needed, as the standard IEC 61882 
clarifies that both the contractor and the client 
should constitute the HAZOP team. To meet these 
needs, a person inside the company having experi- 
ence in the domain and technical knowledge about 
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Figure 5. Proposed organizational structure. 
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Table 2. 


Responsibilities within the RAMS Process [EN 50126 (1999)], x full, (x) partial responsibility. 


Customer/ Approval (Main) 
Operator Authority Contractor Sub-Contractor Suppliers 
Concept Phase x 
System Definition & Application x 
Risk Analysis x x 
System Requirements X (X) 
Apportionment of System X) X 
Requirements 
Design and Implementation X (X) 
Manufacture X X X 
Installation X (X) 
System Validation X X X (X) 
System acceptance X X 
Operation and Maintenance X (X) (X) 
Performance Monitoring X (X) (X) 
Modification and Retrofit X X X 
De-commissioning and Disposal X (X) 


the specifications can be assigned as a customer to 
the project. However, it should be noted that, the 
customer and the PM have very different responsi- 
bilities which shall be performed in an independent 
manner and it should not be confused that cus- 
tomer and PM can be the same person as there is 
no limitation in the organizational structure given 
in the aforementioned norms. 


3 V&V CONCEPTS 


Although verification and validation definitions 
should be consistent in the CENELEC Norms, 
these are defined somehow differently in these 
norms which causes ambiguity when creating plans 
and performing activities. 


— Verification in EN 50126: confirmation by exam- 
ination and provision of objective evidence that 
the specified requirements have been fulfilled. 

— Verification in EN 50129: the activity of deter- 
mination, by analysis and test, at each phase of 
the life-cycle, that the requirements of the phase 
under consideration meet the output of the pre- 
vious phase and that the output of the phase 
under consideration fulfils its requirements. 

— Verification in EN 50128: the process of exami- 
nation followed by a judgment based on evidence 
that output items (process, documentation, 
software or application) of a specific develop- 
ment phase fulfil the requirements of that phase 
with respect to completeness, correctness and 
consistency. 


As mentioned in the previous section, as TST is 
not defined in EN 50129, the verification activity 


here involves test whereas in EN 50128, the verifi- 
cation is based on the review. EN 50129 comments 
verification as a process, rather than a product. For 
each phase in EN 50126, there are verification steps 
of the phase considering the requirements of the 
phase. EN 50128 makes these more comprehensi- 
ble and inclusionary, therefore for avoiding misap- 
plication during the project and the harmonization 
of the verification definition, it is proposed to use 
the definition in EN 50128 in the whole project. 

Coming to the validation, the situation requires 
the same level of attention. According to the defini- 
tions given below, again here, the test is underlined 
in EN 50129 while EN 50128 uses tests for checking 
the results. For the harmonization, we propose again 
that the definition in EN 50128should be used. 


— Validation in EN 50126: Confirmation by 
examination and provision of objective evidence 
that the particular requirements for a specific 
intended use have been fulfilled. 

— Validation in EN 50129: The activity applied in 
order to demonstrate, by test and analysis, that 
the product meets in all respects its specified 
requirements. 

— Validation in EN 50128: Process of analysis 
followed by a judgment based on evidence to 
determine whether an item (e.g. process, docu- 
mentation, software or application) fits the user 
needs, in particular with respect to safety and 
quality and with emphasis on the suitability of 
its operation in accordance with its purpose in 
its intended environment. 


Moreover, validation is depicted (see Figure 6) 
in all the CENELEC norms in a way that it is real- 
ized against system requirements. However, the 
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validation is to be performed not only at the end 
against system requirements, but as (Lundteigen 
et al., 2009) states, verification and validation are 
important activities in all phases of the project 
development process. For instance, at the plan- 
ning phase, the tools shall be validated or if they 
are already pre-validated, this should be checked 
against the intended use or the risk analyses are to 
be validated. Another task can be witnessing the 
independence during the design and test. There- 
fore we propose the Figure 7 to represent the V&V 
activities. 

Several management reports are to be created in 
a safety critical system. The main reports included 
in the safety case are verification, validation, safety 
management, quality assurance and assessment 
reports. An important issue to be tackled is the 
creation order of these reports, since there must be 
relations between them. For the order, the follow- 
ing procedure, also approved by an independent 
assessor, can be applied. After the deliverables are 
produced, the products of the phase and the proc- 
ess itself shall be verified against EN 50126. Then, 
the validator can create his/her report considering 
the deliverables and verification report. Some- 
times, the validator can prefer to compile some 
phases as shown in Figure 8. Following this, the 
safety management report can be developed using 
check-list so that every item is closed, or if neces- 
sary, transferred to the next phase in an appropri- 
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Figure 6. V&V [EN 50126 (2009)] (top), SW Validation 
against SW Requirements [EN 50128 (2011)] (bottom). 


Figure 7. Proposed V&V representation. 
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Figure 8. Relations between Verification, Validation, 
Safety Management, Quality Assurance and Assessment 
Reports. 


ate manner. After this, quality assurance team can 
verify phase related quality requirements record- 
ing the results in their reports. Having applied the 
illustrated order in Figure 8 during the project, 
the safety management report to be incorporated 
in the safety case can only refer to the individual 
phase relevant safety management reports by sum- 
marizing their results. 


4 CONCLUSION 


Safety critical systems being developed especially 
for SIL 4 GAPs require much effort and organi- 
zational independences with intensive verification 
and validation activities. The safety standards are 
very detailed and evolving continuously. Although 
the independence and V&V concepts are very cru- 
cial and have direct effects on both technical and 
fiscal success of the project, it has been revealed 
that these points are neither clearly comprehensi- 
ble in these reference standards nor discussed in 
the past studies, but faced once a project is devel- 
oped. It is therefore very important to place the 
independency requirements for the roles and set 
V&V activities at the outset of the project cor- 
rectly, completely and consistently. Defining a new 
role as safety responsible is very beneficial as this 
role fulfills the activities with special qualification 
and essential independency. The verification cov- 
ers both process and product. For different work- 
package, different verifiers can be assigned in an 
effective way. Validation is not just performed at 
the end against system requirements, but mostly at 
each phase during all system, HW, SW life cycles, 
hence VAL should play an active role during the 
whole life cycle with agreements or disagreements 
affecting the project progress widely. An appropri- 
ate relation between verification, validation, qual- 
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ity and safety management can be constructed in a 
consecutive way as proposed in this study such that 
the safety management report is produced step by 
step resulting in a ready safety management report 
at the end of the project that could be simply inte- 
grated into the safety case to be assessed by the 
independent assessor. 
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ABSTRACT: Big Data Risk Analysis (BDRA) intends to combine the huge volume of information that 
railway systems produce from a variety of data sources for safety and risk management. One of the most 
challenging issues is how safety scientists can use big data techniques. This is especially important in the 
light of data coming from different systems that hold information about critical events, hazards or con- 
trols. Yet, the integration of complex safety-related data is not just another IT problem. Fundamentally, 
it requires the expertise of safety experts in order to make sense of the data. A solution lies in the use of 
graph databases and ontologies to access this data for safety purposes. This type of database allows for the 
handling of huge amounts of data whilst it is still accessible to safety experts that are not gifted program- 
mers. This approach opens up big data for safety management and enables a plethora of possibilities for 


future safety research. 


1 INTRODUCTION 


A BDRA safety system is defined as an enter- 
prise safety management system that performs the 
following: 


e Extracts information from mixed data sources. 

e Processes it quickly to infer and present relevant 
safety management information. 

e Combines applications to collectively provide 
sensible interpretation. 

e Uses online interfaces to connect the right peo- 
ple at the right time. 


In order to: 


e Provide decision support for safety and risk 
management. 


This definition guides the development of 
BDRA systems that are of use to companies that 
work on the GB railways. BDRA aims to use big 
data analytics techniques for safety (Van Guliyk 
et al. 2018; Van Gulijk et al. 2017). 

One of the key challenges of BDRA is to store 
and process that massive amount of data and 
manage the heterogeneous knowledge from differ- 
ent information systems to obtain safety insight. 
A solution lies in the use of graph databases that 
are controlled by ontologies to represent a com- 
mon framework of understanding and integrate 
data. (Figueres-Esteban et al. 2016; Van Guliyk 
et al. 2016). The method explained in this paper 
opens up big data for safety scientists. The method 
is straightforward but powerful and does not rely 
on gifted programmers. In theory, the database is 


infinitely scalable so it is hard to predict the limita- 
tions of the approach. 


2 DATABASES AND BIG DATA 


In the last decades, relational databases (aka SQL 
databases) have dominate the market of databases 
until coming up a standard way. They are struc- 
tured in tables for access and have been very effi- 
cient when it comes to rapid and efficient access 
to data. Nevertheless, in big data environments 
where huge amounts of information have to be 
stored and integrated from new unknown sources, 
relational databases become unwieldy (Sadalage & 
Fowler 2013). 

A solution to bypass this problem is to omit the 
relational table by simply storing data in a system 
that, for lack of a better example, finds its anal- 
ogy in an infinitely scalable library card catalogue 
(Van Gulijk et al. 2018). Databases that work in 
that way are called NoSQL databases. 


2.1 Graph databases 


In a relatively novel development these NoSQL 
databases have been enriched with a sensible visual 
interface based on graphs. They are simply called 
graph databases. A graph database is database 
management system that store data in the form 
of a property graph (Robinson et al. 2013). Safety 
scientists will recognise a property graph as a col- 
lection of nodes and links; as we often see them in 
our work. 
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The organization of the data in graphs 
is extremely useful in terms of understand- 
ing (Figueres-Esteban, Van Gulik, et al. 2015; 
Figueres-Esteban, Hughes, et al. 2015). Graph 
databases allow to represent different types of data 
models into a common space in order to integrate 
diverse type of data (EL Rashidy et al. 2017). This 
issue is a key aspect in order to implement ontolo- 
gies that represent the knowledge of technical 
domains such as railways, risk and safety. 


3 KNOWLEDGE REPRESENTATION 
IN RAILWAYS 


Railways are a complex systems that represent a 
rich tapestry of different types of organisational 
knowledge, created for different purposes and peo- 
ple with different expertise, skills and competences 
in many different contexts. Bringing together all 
the data that railways produce means to make 
sense of heterogeneous knowledge from different 
information systems. 

The most common technique used by compu- 
ter scientists to represent a common framework 
of understanding and manage the knowledge is 
an ontology. A formal, and broadly accepted, 
definition of an ontology is provided by (Gruber 
1995): “An ontology is an explicit specification of a 
conceptualization.” 

There are different types of ontologies such 
domain and application ontologies depending on 
their specificity of the knowledge (Guarino 1997). 

In the railway domain, the FP6 European 
Integrail project (http://www.integrail.eu/) and 
the RailML community (http://www.railml.org) 
proved the utility of ontologies in the communica- 
tion and integration of data through railway infor- 
mation systems (Van Gulijk & Figueres-Esteban 
2016). 


4 ONTOLOGIES AND GRAPH 
DATABASES FOR BDRA 


Different ontology languages and frameworks have 
been developed to support the implementation of 
an ontology (Corcho et al. 2003). The challenge 
is that a single ontology should be understood by 
people and machines. 

One of the most used frameworks is showed 
in the left side of Figure 1. Different data struc- 
tures represented in formats such as XML, JSON 
and CSV can be integrated through ontologies 
implemented in RDF/OWL languages. These lan- 
guages support the application of Artificial Intel- 
ligence (AJ) in order to reason with the represented 
knowledge. The approach that this work is taking 
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Ontology N 


m 
\ implementation / 
< Data schema > 


Figure 1. Transformation of the stack of ontology lan- 
guages for BDRA. 


Graph Database + 
Ontology 


XML ISON, CSV 


XML, ISON, CSV 


bypasses complicated ontology languages and 
replaces it with a relatively straightforward visual 
interface in a graph database. This is where we 
omit the need for gifted programmers. 

This paper demonstrates how to use the frame- 
work showed in the right side of Figure 1. 


5 METHODOLOGY 


The paper describes the implementation of a rail- 
way domain ontology by safety experts in order 
to connect three different data sources to an event 
related to safety management. 

The current BDRA project focuses on under- 
standing SPAD risks (passing red signals) but for 
the benefit of explaining the method we focus on 
part of that risk: the “signal obscured” hazard. 
This means that safety records related to obscured 
signals and instances of a signal database have to 
be found and linked to enrich the analysis of these 
type of events. The methodology has three basic 
steps: 


a. Selection of data sources and storing data in a 
graph database. 

b. Building the signal domain ontology. 

c. Implementing the signal ontology for the inte- 
gration of data. 


5.1 Data sources 


This trial uses four *.csv files extracted from three 
information systems: three files of text records 
from the SMIS and IFCS systems of Railway 
Safety and Standard Board (RSSB) containing 
around 100,000 incidents and a table of signals 
from the Ellipse Asset Management tool of Net- 
work Rail (NR) containing 40,000 descriptions of 
signals. 

SMIS is a database for recording safety-related 
events that occur on the rail network in Brit- 
ain (RSSB 2017). Railway stakeholders such as 
NR or train/freight operators enter about 75,000 
events per year such as derailments and SPADs. In 
this exercise, we are just using records related to 
obscured signals. IFCS is a database that focuses 
on human performance and underlying causes of 
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IFCS Record IFCS Factor Record 
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Event number 
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Inadent reference 
SMIS record reference 


SMIS type of event 


Figure 2. Description of the data sources used to sup- 
port data integration. 


rail incidents. These underlying caused are classi- 
fied using 10 Incident Factors that are breakdown 
by different levels of sub-categories (Gibson et al. 
2015). The table of signals is a sample of descrip- 
tions of signals that is part of the Ellipse Asset 
Management tool of NR. Figure 2 shows the 
properties that were used to integrate the data. 


5.2 Signal domain ontology 


The purpose of the signal domain ontology is to 
align data structures of the information systems 
with an accepted reference framework by railways. 
For this exercise, the reference framework for the 
signal domain has been the railway signal standard 
in UK (RSSB 2015). 

The sources showed below have been used to 
build the ontology: 


— The Signals, handsignals, indicators and signs. 
Handbook RS/521 Issue 3 (December 2015). 

— The data model of the SMIS+ program. 

— The schema of the table of signals. 


The standard RS/521 provides a classification and 
description of all railway signals in UK. The data 
model of the SMIS+ includes a taxonomy of railway 
signals that is aligned with other reporting systems. 
The data structure of the table of signals does not 
provide a signal taxonomy but it can be extracted 
from the field “Item Name”. An example of value 
of this field is “EZ220 — SIG HEAD — COLOUR 
LIGHT - LED”. Note that programmers don’t have 
the expertise for this exercise, even if they are gifted. 
The interpretation and consideration of safety- 
aspects lies within the remit of safety experts. 


5.3. Implementation of the ontology and 
data integration 


The data from the information systems were stored 
in a Neo4 j graph database. Each row of the data 
files represents a node in the graph database and 
each node has as many properties as columns the 
data file has. In this first step, the database has no 
structure and it just stores data under a label (data 
nodes). 

In the same database, the signal domain ontology 
was implemented in a graph data model (ontology 
nodes). Using the properties of the ontology nodes 
and analysing the property of the data nodes that 
stores the text of the record, the links between each 
node were created. 

The signal obscure event (event node) was con- 
nected to data nodes of signals and these ones 
were connected to the data nodes of SMIS/IFCS 
records. 


6 RESULTS 


Table 1 shows an excerpt of the extracted ontol- 
ogy from the table of signals. This ontology was 
mapped with the explicit ontologies of the standard 
RS/521 and the SMIS data model. Table 2 shows 
an excerpt of the mapping table. Figure 3 shows a 
piece of the final signal ontology. 

Figure 4 shows an excerpt of the graph database 
that contains part of the implementation of the 
signal ontology and instances of signal and SMIS/ 
IFCS records. The ontology is connected to the 


Table 1. 
of signals. 


Excerpt of the signal taxonomy from the table 


Item name 


First token Second token Third token Fourth token 


EZ220 SIG HEAD COLOUR LED 

EZ101 LIGHT 1 ASPECT 
EZ102 2 ASPECT 
EZ103 3 ASPECT 
EZ104 4 ASPECT 
Table 2. Excerpt of the mapping between the signal 


taxonomies of the RS/521 standard, SMIS and the table 
of signals. 


RS/521 SMIS Table of signals 


SPAD indicator SPAD indicator ES100 
Limit of shunt signal Limit of shunt EZ160 
Point indicator Points indicator BR100, EZ170 
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Figure 3. Excerpt of the UML diagram that represents 
the signal ontology. 
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Figure 4. Excerpt of the graph database that integrates 
different types of instances of data with the event “Signal 
obscured” by means of the signal domain ontology. 


signal nodes that are connected to the SMIS/IFCS 
records and the signal obscure event. 


7 DISCUSSION 


This paper shows that NoSQL graph databases 
guided by ontologies enable safety scientists to work 
with big data techniques without the intervention 
of IT experts. Some programming is required but 
most of it is not much more complicated than excel 
macros or Matlab. The expertise of safety experts, 
however, is fundamentally required to build safety 
and railway domain ontologies to support the inte- 
gration of data for further safety analysis. 

This work demonstrates that graph databases 
can store complex data structures as single nodes, 
which helps safety scientists navigate through their 
data. Figure 4 displays different nodes that rep- 
resent data from signals and SMIS/IFCS records 
regardless of the internal structure of the data 
source. 

Domain ontologies can be straightforwardly 
implemented in the database as a data model to 
integrate data. These ontologies support the analy- 
sis of data nodes that use different semantics about 
a railway domain. This semantic alignment allows 
to interconnect data nodes each other or connect 
them to specific events related to safety manage- 
ment. However, ontologies are far from being popu- 
lated automatically and require safety expertise and 
human effort to build them (Figueres-Esteban & Van 
Gulijk 2016). Table 1 and Table 2 shows the results 
of this effort in order to align three different data 
sources with a railway standard in a single ontology 
that represent the signal domain (Figure 3). 

The alignment of different types of data with 
a domain ontology and events related to safety 
has important benefits. Firstly, the ontology pro- 
vides framework in order to query signals. In this 
case, the standard RS/521 was selected as refer- 
ence framework. Secondly, the integration of data 
allows to connect all the information available in 
the data sources. For example, linking data nodes 
of signals to the signal obscure event and SMIS/ 
IFCS records allows to filter records by specific 
types of signals in order to improve the safety 
understanding related to obscured signals. 


8 CONCLUSIONS 


This paper demonstrates how safety scientists can 
enter the realm of big data. It demonstrates that 
the challenge of storing large amounts of data 
from diverse railway data sources to extract safety 
learning requires safety experts that can work with 
graph databases. 
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Graph databases allow to store data regardless 
of the structure of the data. But more fundamen- 
tally, it allows the co-location of domain ontolo- 
gies to integrate different data sources and extract 
safety learning. 

In theory, the database is infinitely scalable so it 
is hard to predict the limitations of the approach. 
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